Trustworthiness Auditor (L3 Worker) Skill

Paths: File paths (references/, ../ln-*) are relative to this skill directory.

Trustworthiness Auditor (L3 Worker)

Type: L3 Worker

Specialized worker auditing whether automated test results are deterministic, isolated, and trustworthy.

Purpose & Scope

Audit Test Trustworthiness (Category 5: Medium Priority)
Check determinism, isolation, and dependency control
Detect flaky tests, time/random/order dependency, shared state, and real external dependencies
Emit REWRITE_FOR_DETERMINISM or DELETE_IF_LOW_VALUE
Calculate compliance score (X/10)

Inputs

MANDATORY READ: Load references/audit_worker_core_contract.md.

Receives contextStore with: tech_stack, testFilesMetadata, codebase_root, output_dir.

Workflow

Detection policy: use two-layer detection (candidate scan, then context verification); load references/two_layer_detection.md only when the verification method is ambiguous.

Parse Context: Extract tech stack, trustworthiness checklist, test file list, output_dir from contextStore
Check Isolation (Layer 1): Check isolation for 6 categories (APIs, DB, FS, Time, Random, Network) 2b) Context Analysis (Layer 2 -- MANDATORY): For each isolation violation, ask:
- Is this an integration test? (real dependencies are intentional) -> do NOT flag. Only flag isolation issues in unit tests
- Is in-memory DB configured via test config (not visible in grep)? -> skip
- Is this a test helper that sets up mocks for other tests? -> skip
Check Determinism: Check for flaky tests, time-dependent assertions, order-dependent tests, shared mutable state
Evaluate trust action: Use REWRITE_FOR_DETERMINISM by default; use DELETE_IF_LOW_VALUE only when the test is both untrustworthy and low-value according to obvious local evidence
Collect Findings: Record each violation with severity, location (file:line), effort estimate (S/M/L), action, recommendation
Calculate Score: Count violations by severity, calculate compliance score (X/10)
Write Report: Build full markdown report in memory per references/templates/audit_worker_report_template.md, write to {output_dir}/ln-635--global.md in single Write call
Return Summary: Return minimal summary to coordinator (see Output Format)

Audit Rules: Test Isolation

1. External APIs

Good: Mocked (jest.mock, sinon, nock) Bad: Real HTTP calls to external APIs

Detection:

Grep for axios.get, fetch(, http.request without mocks
Check if test makes actual network calls

Severity: HIGH

Recommendation: Ensure external API calls are controlled (mock, stub, or test server). Tool choice depends on project stack. Exception: Integration tests are EXPECTED to use real dependencies -- do NOT flag

Effort: M

2. Database

Good: In-memory DB (sqlite :memory:) or mocked Bad: Real database (PostgreSQL, MySQL)

Detection:

Check DB connection strings (localhost:5432, real DB URL)
Grep for beforeAll(async () => { await db.connect() }) without :memory:

Severity: MEDIUM

Recommendation: Ensure DB state is controlled and isolated between test runs. Exception: Integration tests with in-memory DB via config -> skip

Effort: M-L

3. File System

Good: Mocked (mock-fs, vol) Bad: Real file reads/writes

Detection:

Grep for fs.readFile, fs.writeFile without mocks
Check if test creates/deletes real files

Severity: MEDIUM

Recommendation: Ensure file system operations are isolated (mock, temp directory, or cleanup). Tool choice depends on project stack

Effort: S-M

4. Time/Date

Good: Mocked (jest.useFakeTimers, sinon.useFakeTimers) Bad: new Date(), Date.now() without mocks

Detection:

Grep for new Date() in test files without useFakeTimers

Severity: MEDIUM

Recommendation: Ensure time-dependent logic uses controlled clock (fake timers, injected clock, or time provider). Tool choice depends on project stack

Effort: S

5. Random

Good: Seeded random (Math.seedrandom, fixed seed) Bad: Math.random() without seed

Detection:

Grep for Math.random() without seed setup

Severity: LOW

Recommendation: Use seeded random for deterministic tests

Effort: S

6. Network

Good: Mocked (supertest for Express, no real ports) Bad: Real network requests (localhost:3000, binding to port)

Detection:

Grep for app.listen(3000) in tests
Check for real HTTP requests

Severity: MEDIUM

Recommendation: Use supertest (no real port)

Effort: M

Audit Rules: Determinism

1. Flaky Tests

What: Tests that pass/fail randomly

Detection:

Run tests multiple times, check for inconsistent results
Grep for setTimeout, setInterval without proper awaits
Check for race conditions (async operations not awaited)

Severity: HIGH

Recommendation: Fix race conditions, use proper async/await

Effort: M-L

2. Time-Dependent Assertions

What: Assertions on current time (expect(timestamp).toBeCloseTo(Date.now()))

Detection:

Grep for Date.now(), new Date() in assertions

Severity: MEDIUM

Recommendation: Mock time

Effort: S

3. Order-Dependent Tests

What: Tests that fail when run in different order

Detection:

Run tests in random order, check for failures
Grep for shared mutable state between tests

Severity: MEDIUM

Recommendation: Isolate tests, reset state in beforeEach

Effort: M

4. Shared Mutable State

What: Global variables modified across tests

Detection:

Grep for let globalVar at module level
Check for state shared between tests

Severity: MEDIUM

Recommendation: Use beforeEach to reset state

Effort: S-M

Audit Rules: Trustworthiness Drag

1. Overlarge Test With Shared Setup (>100 lines)

What: Test with >100 lines, testing too many scenarios

Detection:

Count lines per test
If >100 lines -> Giant

Severity: MEDIUM

Recommendation: Split into focused tests (one scenario per test)

Effort: S-M

2. Slow Poke (>5 seconds)

What: Test taking >5 seconds to run

Detection:

Measure test duration
If >5s -> Slow Poke

Severity: MEDIUM

Recommendation: Control external deps with test doubles or in-memory services selected from the project stack; parallelize only after isolation is verified

Effort: M

3. Conjoined Twins (Unit test without controlled dependencies)

What: Test labeled "Unit" but not mocking dependencies

Detection:

Check if test name includes "Unit"
Verify all dependencies are mocked
If no mocks -> actually Integration test

Severity: LOW

Recommendation: Either mock dependencies OR rename to Integration test

Effort: S

4. Default Value Blindness (Tests with default config)

What: Tests with default config values only. Use the non-default config rule from references/risk_based_testing_guide.md; load references/risk_based_testing_methodology.md only when examples are needed.

Detection:

Grep for common defaults in test setup: :8080, :3000, 30000, limit: 20, offset: 0
Check if test config values match framework/library defaults
Look for || DEFAULT patterns in source code with matching test values

Severity: HIGH

Effort: S

Scoring Algorithm

MANDATORY READ: Load references/audit_scoring.md.

Severity mapping:

Flaky tests, External API not controlled, Default Value Blindness -> HIGH
Real database, File system, Time/Date, Network, Overlarge shared setup, Slow Poke -> MEDIUM
Random without seed, Order-dependent, Conjoined Twins -> LOW

Output Format

MANDATORY READ: Load references/templates/audit_worker_report_template.md.

Write JSON summary per references/audit_summary_contract.md. In managed mode the caller passes both runId and summaryArtifactPath; in standalone mode the worker generates its own run-scoped artifact path per shared contract.

Write report to {output_dir}/ln-635--global.md with category: "Test Trustworthiness" and checks: api_isolation, db_isolation, fs_isolation, time_isolation, random_isolation, network_isolation, flaky_tests, order_dependency, shared_state, default_value_blindness.

Return summary per references/audit_summary_contract.md.

When summaryArtifactPath is absent, write the standalone runtime summary under .hex-skills/runtime-artifacts/runs/{run_id}/evaluation-worker/{worker}--{identifier}.json and optionally echo the same summary in structured output.

Report written: .hex-skills/runtime-artifacts/runs/{run_id}/audit-report/ln-635--global.md
Score: X.X/10 | Issues: N (C:N H:N M:N L:N)

Note: Findings are flattened into single array. Use principle field prefix (Isolation / Determinism / Dependency Control) to identify issue category. Each finding includes action: "REWRITE_FOR_DETERMINISM" or action: "DELETE_IF_LOW_VALUE".

Critical Rules

Apply the already-loaded references/audit_worker_core_contract.md.

Do not auto-fix: Report only
Effort realism: S = <1h, M = 1-4h, L = >4h
Flat findings: Merge isolation + determinism + dependency-control findings into single findings array, use principle prefix to distinguish
Context-aware: Supertest with real Express app is acceptable for integration tests
Unique angle: Only audit whether test results can be trusted. Do not evaluate product behavior, E2E journey value, portfolio value, missing coverage, oracle strength, manual evidence, or structure.
Action required: Every finding uses REWRITE_FOR_DETERMINISM unless evidence shows the test is also low-value enough to use DELETE_IF_LOW_VALUE.

Monitor (2.1.98+): For repeated test runs expected >30s each, use Monitor. Fallback: Bash(run_in_background=true).

Definition of Done

Apply the already-loaded references/audit_worker_core_contract.md.

[ ] contextStore parsed successfully (including output_dir)
[ ] All 3 audit groups completed:
- Isolation (6 categories: APIs, DB, FS, Time, Random, Network)
- Determinism (4 checks: flaky, time-dependent, order-dependent, shared state)
- Dependency control (overlarge shared setup, slow tests, conjoined dependencies, default-value blindness)
[ ] Findings collected with severity, location, effort, action, recommendation
[ ] Score calculated using penalty algorithm
[ ] Report written to {output_dir}/ln-635--global.md (atomic single Write call)
[ ] Summary written per contract

Version: 3.0.0 Last Updated: 2025-12-23

Agent Skills: Trustworthiness Auditor (L3 Worker)

Install this agent skill to your local

Skill Files