QA Agent Testing (Jan 2026) Skill

QA Agent Testing (Jan 2026)

Design and run reliable evaluation suites for LLM agents/personas, including tool-using and multi-agent systems.

Default QA Workflow

Define the Persona Under Test (PUT): scope, out-of-scope, and safety boundaries.
Define 10 representative tasks (Must Ace).
Define 5 refusal edge cases (Must Decline + redirect).
Define an output contract (format, tone, structure, citations).
Run the suite with determinism controls and tool tracing.
Score with the 6-dimension rubric; track variance across reruns.
Log baselines and regressions; gate merges/deploys on thresholds.

Use the copy-paste templates in assets/ for day-0 setup.

Determinism and Flake Control

Control inputs: pin prompts/config, fixtures, stable tool responses, frozen time/timezone where possible.
Control sampling: fixed seeds/temperatures where supported; log model/config versions.
Record tool traces: tool name, args, outputs, latency, errors, retries, and side effects.

Two-Layer Evaluation (2026)

Evaluate reasoning and action layers separately:

| Layer | What to Test | Key Metrics | |---------------|-------------------------------------|-----------------------------------------------------| | Reasoning | Planning, decision-making, intent | Intent resolution, task adhesion, context retention | | Action | Tool calls, execution, side effects | Tool call accuracy, completion rate, error recovery |

Evaluation Dimensions (Score What Matters)

| Dimension | What to Measure | Level | |--------------------|--------------------------------------------------|----------| | Task success | Correct outcome and constraints met | Agent | | Safety/policy | Correct refusals and safe alternatives | Agent | | Reliability | Stability across reruns and small prompt changes | Agent | | Latency/cost | Budgets per task and per suite | Business | | Debuggability | Failures produce evidence (logs, traces) | Agent | | Factual grounding | Hallucination rate, citation accuracy | Model | | Bias detection | Fairness across demographic inputs | Model |

CI Economics

PR gate: small, high-signal smoke eval suite.
Scheduled: full scenario suites, adversarial inputs, and cost/latency regression checks (track separately from quality scoring).

Robustness and Security Tests (Recommended)

Metamorphic tests: run small, meaning-preserving prompt/input rewrites; enforce invariants on outputs.
Prompt injection tests: treat tool outputs, retrieved text, and user-provided documents as untrusted; verify the agent does not follow embedded instructions that conflict with system/developer constraints.
Tool fault injection: simulate timeouts, retries, partial data, and tool errors; verify graceful recovery.
Differential testing: compare behavior across model/config versions for regressions and unexpected shifts.

Do / Avoid

Do:

Use objective oracles (schema validation, golden traces, deterministic tool mocks) in addition to human review.
Quarantine flaky evals with owners and expiry, just like flaky tests in CI.

Avoid:

Evaluating only "happy prompts" with no tool failures and no adversarial inputs.
Letting self-evaluations substitute for ground-truth checks.

Quick Reference

| Need | Use | Location | |------|-----|----------| | Build the 10 tasks | Task patterns + examples | references/test-case-design.md | | Design refusals | Refusal categories + templates | references/refusal-patterns.md | | Score runs | Detailed rubric + thresholds | references/scoring-rubric.md | | Compute suite math quickly | CLI utility script | scripts/score_suite.py | | Manage regressions | Re-run workflow + baseline policy | references/regression-protocol.md | | Sandbox tools | Isolation tiers + hardening | references/tool-sandboxing.md | | Test multi-agent systems | Coordination patterns + suite template | references/multi-agent-testing.md | | Use LLM-as-judge safely | Biases + mitigations | references/llm-judge-limitations.md | | Test prompt injection attacks | Injection taxonomy + test cases | references/prompt-injection-testing.md | | Detect hallucinations | Detection methods + scoring | references/hallucination-detection.md | | Design eval datasets | Dataset construction + maintenance | references/eval-dataset-design.md | | Start from templates | Harness + scoring sheet + log | assets/ |

Decision Tree

Testing an agent?
  - New agent?
    - Create QA harness -> Define 10 tasks + 5 refusals -> Run baseline
  - Prompt changed?
    - Re-run full 15-check suite -> Compare to baseline
  - Tool/knowledge changed?
    - Re-run affected tests -> Log in regression log
  - Quality review?
    - Score against rubric -> Identify weak areas -> Fix prompt

Scoring and Gates

Score each run with the 6-dimension rubric (0-3 each; max 18 per task).
Prefer suite-level gating that accounts for variance; avoid treating non-determinism as a free pass.
Use scripts/score_suite.py to compute averages, normalized scores, and basic PASS/CONDITIONAL/FAIL classification.
For detailed methodology (including judge calibration and variance metrics), see references/scoring-rubric.md.

Navigation

Resources

references/test-case-design.md - 10-task patterns + validation + metamorphic add-ons
references/refusal-patterns.md - refusal categories + response templates + test tactics
references/scoring-rubric.md - scoring guide, thresholds, variance metrics, judge calibration
references/regression-protocol.md - re-run scope, baseline policy, recovery procedures
references/tool-sandboxing.md - sandbox tiers, tool hardening, injection/exfil test ideas
references/multi-agent-testing.md - coordination testing patterns + suite template
references/llm-judge-limitations.md - LLM-as-judge biases, limits, mitigations
references/prompt-injection-testing.md - Injection taxonomy, test cases, and defense validation
references/hallucination-detection.md - Hallucination detection methods, scoring, and benchmarks
references/eval-dataset-design.md - Evaluation dataset construction, versioning, and maintenance

Templates

assets/qa-harness-template.md - copy-paste harness
assets/scoring-sheet.md - scoring tracker
assets/regression-log.md - version tracking

External Resources

See data/sources.json for:

LLM evaluation research
Red-teaming methodologies
Prompt testing frameworks

Related Skills

qa-testing-strategy: ../qa-testing-strategy/SKILL.md - General testing strategies
ai-prompt-engineering: ../ai-prompt-engineering/SKILL.md - Prompt design patterns

Quick Start

Copy assets/qa-harness-template.md
Fill in PUT (Persona Under Test) section
Define 10 representative tasks for your agent
Add 5 refusal edge cases
Specify output contracts
Run baseline test
Log results in regression log

Success Criteria: Each of the 10 tasks scores >= 12/18 and each refusal scores >= 2/3 (or PASS by your policy oracle), with stable results across reruns and no new hard failures.

Fact-Checking

Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
Prefer primary sources; report source links and dates for volatile information.
If web access is unavailable, state the limitation and mark guidance as unverified.

Agent Skills: QA Agent Testing (Jan 2026)

Install this agent skill to your local

Skill Files