Agent Skills: QA Agent Testing (Jan 2026)

QA harness for LLM agents: scenario suites, flake controls, tool sandboxing, LLM-as-judge scoring, and regression protocols.

UncategorizedID: vasilyu1983/ai-agents-public/qa-agent-testing

Install this agent skill to your local

pnpm dlx add-skill https://github.com/vasilyu1983/AI-Agents-public/tree/HEAD/frameworks/shared-skills/skills/qa-agent-testing

Skill Files

Browse the full folder contents for qa-agent-testing.

Download Skill

Loading file tree…

frameworks/shared-skills/skills/qa-agent-testing/SKILL.md

Skill Metadata

Name
qa-agent-testing
Description
"QA harness for LLM agents: scenario suites, flake controls, tool sandboxing, LLM-as-judge scoring, and regression protocols."

QA Agent Testing (Jan 2026)

Design and run reliable evaluation suites for LLM agents/personas, including tool-using and multi-agent systems.

Default QA Workflow

  1. Define the Persona Under Test (PUT): scope, out-of-scope, and safety boundaries.
  2. Define 10 representative tasks (Must Ace).
  3. Define 5 refusal edge cases (Must Decline + redirect).
  4. Define an output contract (format, tone, structure, citations).
  5. Run the suite with determinism controls and tool tracing.
  6. Score with the 6-dimension rubric; track variance across reruns.
  7. Log baselines and regressions; gate merges/deploys on thresholds.

Use the copy-paste templates in assets/ for day-0 setup.

Determinism and Flake Control

  • Control inputs: pin prompts/config, fixtures, stable tool responses, frozen time/timezone where possible.
  • Control sampling: fixed seeds/temperatures where supported; log model/config versions.
  • Record tool traces: tool name, args, outputs, latency, errors, retries, and side effects.

Two-Layer Evaluation (2026)

Evaluate reasoning and action layers separately:

| Layer | What to Test | Key Metrics | |---------------|-------------------------------------|-----------------------------------------------------| | Reasoning | Planning, decision-making, intent | Intent resolution, task adhesion, context retention | | Action | Tool calls, execution, side effects | Tool call accuracy, completion rate, error recovery |

Evaluation Dimensions (Score What Matters)

| Dimension | What to Measure | Level | |--------------------|--------------------------------------------------|----------| | Task success | Correct outcome and constraints met | Agent | | Safety/policy | Correct refusals and safe alternatives | Agent | | Reliability | Stability across reruns and small prompt changes | Agent | | Latency/cost | Budgets per task and per suite | Business | | Debuggability | Failures produce evidence (logs, traces) | Agent | | Factual grounding | Hallucination rate, citation accuracy | Model | | Bias detection | Fairness across demographic inputs | Model |

CI Economics

  • PR gate: small, high-signal smoke eval suite.
  • Scheduled: full scenario suites, adversarial inputs, and cost/latency regression checks (track separately from quality scoring).

Robustness and Security Tests (Recommended)

  • Metamorphic tests: run small, meaning-preserving prompt/input rewrites; enforce invariants on outputs.
  • Prompt injection tests: treat tool outputs, retrieved text, and user-provided documents as untrusted; verify the agent does not follow embedded instructions that conflict with system/developer constraints.
  • Tool fault injection: simulate timeouts, retries, partial data, and tool errors; verify graceful recovery.
  • Differential testing: compare behavior across model/config versions for regressions and unexpected shifts.

Do / Avoid

Do:

  • Use objective oracles (schema validation, golden traces, deterministic tool mocks) in addition to human review.
  • Quarantine flaky evals with owners and expiry, just like flaky tests in CI.

Avoid:

  • Evaluating only "happy prompts" with no tool failures and no adversarial inputs.
  • Letting self-evaluations substitute for ground-truth checks.

Quick Reference

| Need | Use | Location | |------|-----|----------| | Build the 10 tasks | Task patterns + examples | references/test-case-design.md | | Design refusals | Refusal categories + templates | references/refusal-patterns.md | | Score runs | Detailed rubric + thresholds | references/scoring-rubric.md | | Compute suite math quickly | CLI utility script | scripts/score_suite.py | | Manage regressions | Re-run workflow + baseline policy | references/regression-protocol.md | | Sandbox tools | Isolation tiers + hardening | references/tool-sandboxing.md | | Test multi-agent systems | Coordination patterns + suite template | references/multi-agent-testing.md | | Use LLM-as-judge safely | Biases + mitigations | references/llm-judge-limitations.md | | Test prompt injection attacks | Injection taxonomy + test cases | references/prompt-injection-testing.md | | Detect hallucinations | Detection methods + scoring | references/hallucination-detection.md | | Design eval datasets | Dataset construction + maintenance | references/eval-dataset-design.md | | Start from templates | Harness + scoring sheet + log | assets/ |

Decision Tree

Testing an agent?
  - New agent?
    - Create QA harness -> Define 10 tasks + 5 refusals -> Run baseline
  - Prompt changed?
    - Re-run full 15-check suite -> Compare to baseline
  - Tool/knowledge changed?
    - Re-run affected tests -> Log in regression log
  - Quality review?
    - Score against rubric -> Identify weak areas -> Fix prompt

Scoring and Gates

  • Score each run with the 6-dimension rubric (0-3 each; max 18 per task).
  • Prefer suite-level gating that accounts for variance; avoid treating non-determinism as a free pass.
  • Use scripts/score_suite.py to compute averages, normalized scores, and basic PASS/CONDITIONAL/FAIL classification.
  • For detailed methodology (including judge calibration and variance metrics), see references/scoring-rubric.md.

Navigation

Resources

  • references/test-case-design.md - 10-task patterns + validation + metamorphic add-ons
  • references/refusal-patterns.md - refusal categories + response templates + test tactics
  • references/scoring-rubric.md - scoring guide, thresholds, variance metrics, judge calibration
  • references/regression-protocol.md - re-run scope, baseline policy, recovery procedures
  • references/tool-sandboxing.md - sandbox tiers, tool hardening, injection/exfil test ideas
  • references/multi-agent-testing.md - coordination testing patterns + suite template
  • references/llm-judge-limitations.md - LLM-as-judge biases, limits, mitigations
  • references/prompt-injection-testing.md - Injection taxonomy, test cases, and defense validation
  • references/hallucination-detection.md - Hallucination detection methods, scoring, and benchmarks
  • references/eval-dataset-design.md - Evaluation dataset construction, versioning, and maintenance

Templates

  • assets/qa-harness-template.md - copy-paste harness
  • assets/scoring-sheet.md - scoring tracker
  • assets/regression-log.md - version tracking

External Resources

See data/sources.json for:

  • LLM evaluation research
  • Red-teaming methodologies
  • Prompt testing frameworks

Related Skills

Quick Start

  1. Copy assets/qa-harness-template.md
  2. Fill in PUT (Persona Under Test) section
  3. Define 10 representative tasks for your agent
  4. Add 5 refusal edge cases
  5. Specify output contracts
  6. Run baseline test
  7. Log results in regression log

Success Criteria: Each of the 10 tasks scores >= 12/18 and each refusal scores >= 2/3 (or PASS by your policy oracle), with stable results across reruns and no new hard failures.

Fact-Checking

  • Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
  • Prefer primary sources; report source links and dates for volatile information.
  • If web access is unavailable, state the limitation and mark guidance as unverified.