TDD Enforcement Across All Implementation Agents Skill

TDD Enforcement Across All Implementation Agents

Comprehensive guide to Test-Driven Development (TDD) workflow as implemented by Hermes (backend), Aphrodite (frontend), and Demeter (database) agents. This ensures consistent RED → GREEN → REFACTOR cycle across all implementation agents.

⚠️ Non-Interactive Testing Rule

CRITICAL: Agents must ALWAYS run tests in a non-interactive mode. Never use watch modes, debuggers, or commands that require user input (e.g. q to quit).

Frontend: Use npx vitest run instead of vitest (which defaults to watch mode).
Backend/Database: Use pytest but NEVER append --pdb or start interactive shells.

Core TDD Philosophy

TDD = Write Tests First, Not After

Traditional approach ❌:

Write code → Write tests → Discover problems → Fix code
Risk: Tests written AFTER might not catch edge cases

TDD approach ✅:

Write failing test (RED) → Write minimal code to pass (GREEN) → Improve code (REFACTOR)
Benefit: Tests drive design, all paths covered by definition

The RED → GREEN → REFACTOR Cycle

Every implementation agent follows this religiously:

🔴 Phase 1: RED (Test Fails)

What: Write test FIRST, before any implementation code Why: Test defines the requirement Outcome: Test FAILS (RED)

Example - Backend (Hermes):

# tests/test_user_service.py
import pytest
from app.models import User
from app.services import UserService

def test_user_password_hashing():
    """User password must be bcrypt hashed, not plaintext."""
    service = UserService()
    user = service.create_user(
        email="alice@example.com",
        password="SecurePassword123"
    )
    
    # RED: This test FAILS right now
    assert user.password != "SecurePassword123"  # Password should NOT be plaintext
    assert user.verify_password("SecurePassword123")  # But verify should work
    assert not user.verify_password("WrongPassword")  # Wrong password should fail

# Run: pytest tests/test_user_service.py::test_user_password_hashing
# Result: ❌ FAILED - UserService doesn't exist yet

Example - Frontend (Athena):

// src/components/__tests__/LoginForm.test.tsx
import { render, screen, userEvent } from "@testing-library/react";
import { LoginForm } from "../LoginForm";

test("submits form with email and password", async () => {
    const mockSubmit = vi.fn();
    
    render(<LoginForm onSubmit={mockSubmit} />);
    
    // RED: This test FAILS right now
    const emailInput = screen.getByRole("textbox", { name: /email/i });
    const passwordInput = screen.getByLabelText("password");
    const submitButton = screen.getByRole("button", { name: /login/i });
    
    await userEvent.type(emailInput, "alice@example.com");
    await userEvent.type(passwordInput, "SecurePassword123");
    await userEvent.click(submitButton);
    
    expect(mockSubmit).toHaveBeenCalledWith({
        email: "alice@example.com",
        password: "SecurePassword123"
    });
});

// Run: npx vitest run LoginForm.test.tsx
// Result: ❌ FAILED - LoginForm doesn't exist yet

Example - Database (Demeter):

# tests/test_user_migration.py
import pytest
from sqlalchemy import inspect, MetaData

def test_user_table_has_email_column():
    """User table must have email column, unique and non-null."""
    metadata = MetaData()
    metadata.reflect(bind=engine)
    user_table = metadata.tables.get("user")
    
    # RED: This test FAILS because table doesn't exist
    assert user_table is not None
    assert "email" in user_table.columns
    
    email_col = user_table.columns["email"]
    assert email_col.nullable is False
    assert email_col.unique is True

# Run: pytest tests/test_user_migration.py::test_user_table_has_email_column
# Result: ❌ FAILED - user table doesn't exist yet

RED Checklist:

✅ Test is FAILING (not passing)
✅ Test clearly states requirement (readable)
✅ Test catches both happy path AND edge cases
✅ No implementation code exists yet
✅ Error message is clear (helps drive implementation)

🟢 Phase 2: GREEN (Test Passes - Minimal Implementation)

What: Write MINIMAL code to make test PASS Why: Minimal code = no unnecessary complexity Outcome: Test PASSES (GREEN) Rule: Write just enough to pass the test, nothing more

Example - Backend (Hermes):

# app/services.py - MINIMAL implementation
import bcrypt
from app.models import User

class UserService:
    def create_user(self, email: str, password: str) -> User:
        # GREEN: Minimal code to pass test
        hashed = bcrypt.hashpw(password.encode(), bcrypt.gensalt())
        user = User(email=email, password=hashed)
        return user
        # That's it! Don't add validation, logging, etc yet

# app/models.py
import bcrypt

class User:
    def __init__(self, email: str, password: bytes):
        self.email = email
        self.password = password  # Already hashed
    
    def verify_password(self, plaintext: str) -> bool:
        # GREEN: Minimal verification
        return bcrypt.checkpw(plaintext.encode(), self.password)

# Run: pytest tests/test_user_service.py::test_user_password_hashing
# Result: ✅ PASSED - Test is green!

Example - Frontend (Athena):

// src/components/LoginForm.tsx - MINIMAL implementation
export function LoginForm({ onSubmit }) {
    // GREEN: Minimal component to pass test
    return (
        <form onSubmit={(e) => {
            e.preventDefault();
            const formData = new FormData(e.currentTarget);
            onSubmit({
                email: formData.get("email"),
                password: formData.get("password")
            });
        }}>
            <input name="email" placeholder="Email" />
            <input name="password" type="password" placeholder="Password" />
            <button type="submit">Login</button>
        </form>
    );
}

// Run: npx vitest run LoginForm.test.tsx
// Result: ✅ PASSED - Test is green!

Example - Database (Demeter):

# migrations/001_create_user_table.py - MINIMAL schema
from alembic import op
import sqlalchemy as sa

def upgrade():
    # GREEN: Minimal table schema to pass test
    op.create_table(
        'user',
        sa.Column('id', sa.Integer(), primary_key=True),
        sa.Column('email', sa.String(255), nullable=False, unique=True),
        sa.Column('password', sa.LargeBinary(), nullable=False),
    )

def downgrade():
    op.drop_table('user')

# Run: alembic upgrade head
# Run: pytest tests/test_user_migration.py::test_user_table_has_email_column
# Result: ✅ PASSED - Test is green!

GREEN Checklist:

✅ All tests PASS (not just one)
✅ Implementation is MINIMAL (no extra features)
✅ Code is straightforward (not clever)
✅ No validation yet (just the core logic)
✅ Coverage might be low still (that's okay for now)

🔄 Phase 3: REFACTOR (Improve Without Breaking Tests)

What: Improve code quality, add validation, add documentation Why: Tests guarantee behavior stays the same Outcome: Tests STILL PASS + code is better Rule: If any test fails, rollback refactoring

Example - Backend (Hermes):

# app/services.py - REFACTORED
import bcrypt
import logging
from typing import Optional
from app.models import User
from app.exceptions import InvalidEmailError, WeakPasswordError

logger = logging.getLogger(__name__)

class UserService:
    """Service for user account management with secure password handling."""
    
    MIN_PASSWORD_LENGTH = 12
    ALLOWED_DOMAINS = ["example.com", "example.org"]
    
    def create_user(self, email: str, password: str) -> User:
        """
        Create a new user with secure password hashing.
        
        Args:
            email: User email (must be valid format)
            password: User password (min 12 chars, mixed case + numbers)
            
        Returns:
            User instance with hashed password
            
        Raises:
            InvalidEmailError: If email format invalid
            WeakPasswordError: If password doesn't meet requirements
        """
        # Validate email format
        if not self._is_valid_email(email):
            raise InvalidEmailError(f"Invalid email: {email}")
        
        # Validate password strength
        if not self._is_strong_password(password):
            raise WeakPasswordError(
                "Password must be 12+ chars with uppercase, lowercase, numbers"
            )
        
        # Hash password with bcrypt
        hashed = bcrypt.hashpw(password.encode(), bcrypt.gensalt(rounds=12))
        
        # Create user
        user = User(email=email, password=hashed)
        logger.info(f"Created user: {email}")
        
        return user
    
    @staticmethod
    def _is_valid_email(email: str) -> bool:
        """Validate email format."""
        if "@" not in email or "." not in email:
            return False
        domain = email.split("@")[-1]
        # Can add domain whitelist if needed
        return True
    
    @staticmethod
    def _is_strong_password(password: str) -> bool:
        """Validate password meets minimum strength requirements."""
        if len(password) < UserService.MIN_PASSWORD_LENGTH:
            return False
        
        has_upper = any(c.isupper() for c in password)
        has_lower = any(c.islower() for c in password)
        has_digit = any(c.isdigit() for c in password)
        
        return has_upper and has_lower and has_digit

# app/models.py - REFACTORED
import bcrypt
from datetime import datetime
from typing import Optional

class User:
    """User model with secure password handling."""
    
    def __init__(self, email: str, password: bytes):
        """
        Initialize user.
        
        Args:
            email: User email address
            password: Bcrypt hashed password (bytes)
        """
        if not email or not password:
            raise ValueError("Email and password are required")
        
        self.email = email
        self.password = password
        self.created_at = datetime.utcnow()
    
    def verify_password(self, plaintext: str) -> bool:
        """
        Verify plaintext password against bcrypt hash.
        
        Args:
            plaintext: Plain text password to verify
            
        Returns:
            True if password matches, False otherwise
        """
        if not plaintext:
            return False
        
        return bcrypt.checkpw(plaintext.encode(), self.password)
    
    def __repr__(self) -> str:
        return f"User(email={self.email}, created_at={self.created_at})"

# Run tests: pytest tests/test_user_service.py -v
# Result: ✅ ALL TESTS PASS (including new edge case tests)

Example - Frontend (Athena):

// src/components/LoginForm.tsx - REFACTORED
import { useState, useCallback, FC, FormEvent } from "react";
import { validateEmail, validatePassword } from "../utils/validation";
import { LoginCredentials } from "../types";
import styles from "./LoginForm.module.css";

interface LoginFormProps {
    onSubmit: (credentials: LoginCredentials) => void;
    isLoading?: boolean;
    onError?: (error: string) => void;
}

export const LoginForm: FC<LoginFormProps> = ({
    onSubmit,
    isLoading = false,
    onError
}) => {
    const [email, setEmail] = useState("");
    const [password, setPassword] = useState("");
    const [errors, setErrors] = useState<Record<string, string>>({});
    
    const handleSubmit = useCallback((e: FormEvent<HTMLFormElement>) => {
        e.preventDefault();
        
        // Validate inputs
        const newErrors: Record<string, string> = {};
        
        if (!validateEmail(email)) {
            newErrors.email = "Invalid email format";
        }
        
        if (!validatePassword(password)) {
            newErrors.password = "Password must be 12+ chars with uppercase, lowercase, numbers";
        }
        
        if (Object.keys(newErrors).length > 0) {
            setErrors(newErrors);
            return;
        }
        
        // Submit valid credentials
        setErrors({});
        onSubmit({ email, password });
    }, [email, password, onSubmit]);
    
    return (
        <form onSubmit={handleSubmit} className={styles.form} noValidate>
            <div className={styles.fieldGroup}>
                <label htmlFor="email">Email</label>
                <input
                    id="email"
                    type="email"
                    name="email"
                    value={email}
                    onChange={(e) => setEmail(e.target.value)}
                    aria-invalid={!!errors.email}
                    aria-describedby={errors.email ? "email-error" : undefined}
                    disabled={isLoading}
                    required
                />
                {errors.email && (
                    <span id="email-error" className={styles.error}>
                        {errors.email}
                    </span>
                )}
            </div>
            
            <div className={styles.fieldGroup}>
                <label htmlFor="password">Password</label>
                <input
                    id="password"
                    type="password"
                    name="password"
                    value={password}
                    onChange={(e) => setPassword(e.target.value)}
                    aria-invalid={!!errors.password}
                    aria-describedby={errors.password ? "password-error" : undefined}
                    disabled={isLoading}
                    required
                />
                {errors.password && (
                    <span id="password-error" className={styles.error}>
                        {errors.password}
                    </span>
                )}
            </div>
            
            <button
                type="submit"
                disabled={isLoading}
                className={styles.submitButton}
            >
                {isLoading ? "Logging in..." : "Login"}
            </button>
        </form>
    );
};

// Run tests: npx vitest run LoginForm.test.tsx
// Result: ✅ ALL TESTS PASS (including accessibility checks)

Example - Database (Demeter):

# migrations/001_create_user_table.py - REFACTORED
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql

def upgrade():
    """Create user table with proper indexes and constraints."""
    
    # Create table with comprehensive schema
    op.create_table(
        'user',
        sa.Column('id', postgresql.UUID(as_uuid=True), 
                  primary_key=True, default=sa.func.gen_random_uuid()),
        sa.Column('email', sa.String(255), nullable=False, unique=True),
        sa.Column('password', sa.LargeBinary(), nullable=False),
        sa.Column('verified_at', sa.DateTime(timezone=True), nullable=True),
        sa.Column('created_at', sa.DateTime(timezone=True), 
                  nullable=False, default=sa.func.now()),
        sa.Column('updated_at', sa.DateTime(timezone=True),
                  nullable=False, default=sa.func.now(),
                  onupdate=sa.func.now()),
        sa.UniqueConstraint('email', name='uq_user_email'),
    )
    
    # Add indexes for performance
    op.create_index('ix_user_email', 'user', ['email'])
    op.create_index('ix_user_created_at', 'user', ['created_at'])

def downgrade():
    """Drop user table."""
    op.drop_table('user')

# Run migration: alembic upgrade head
# Run tests: pytest tests/test_user_migration.py -v
# Result: ✅ ALL TESTS PASS (including performance tests)

REFACTOR Checklist:

✅ All tests STILL PASS (zero regressions)
✅ Code quality improved (validation, error handling, docs)
✅ Logging added where appropriate
✅ Type hints added (Python + TypeScript)
✅ Edge cases handled
✅ Documentation strings complete
✅ Performance optimized (indexes for DB, memoization for React)
✅ No API changes (backward compatible)

Coverage Requirements

Minimum Coverage: >80% (non-negotiable, Themis enforces) Target Coverage: 92-95% (stretch goal) Coverage Includes:

Line coverage (every line executed?)
Branch coverage (every if/else taken?)
Function coverage (every function tested?)

Measurement (Backend - Hermes):

# Generate coverage report
pytest tests/ --cov=app --cov-report=html

# Open report
open htmlcov/index.html

# Require minimum
pytest tests/ --cov=app --cov-fail-under=80

Measurement (Frontend - Athena):

// vitest.config.ts
export default defineConfig({
  test: {
    coverage: {
      provider: 'v8',
      reporter: ['text', 'html'],
      exclude: [
        'node_modules/',
        'dist/',
        '**/*.test.ts{,x}',
        '**/index.ts'
      ],
      lines: 80,        // Minimum 80%
      functions: 80,
      branches: 80,
      statements: 80
    }
  }
});

# Run tests with coverage
npx vitest run --coverage

Measurement (Database - Demeter):

# Migrations are covered by:
# 1. Upgrade tests (can you apply migration?)
# 2. Downgrade tests (can you revert migration?)
# 3. Schema tests (does schema match expectations?)

pytest tests/test_migrations.py --cov=migrations --cov-fail-under=80

Anti-Patterns (What NOT to Do)

❌ Don't write tests after code:

# WRONG:
class User:
    def hash_password(self, pwd):
        return pwd  # Oops, forgot to hash!

def test_hash():  # Written after code - misses the bug
    assert User().hash_password("secret") != "secret"
# Test will catch it, but too late - already in code

✅ Do write tests before code:

# RIGHT:
def test_hash():  # Red - test fails
    assert User().hash_password("secret") != "secret"

class User:
    def hash_password(self, pwd):  # Green - code passes test
        return bcrypt.hash(pwd)  # Forced to hash!

❌ Don't test after full implementation:

// WRONG:
function LoginForm() {
    return (
        <form>
            <input />
            <button>Login</button>
        </form>
    );
}

// Test written 2 weeks later - might miss requirements
test("form submits data", () => { ... });

✅ Do test before starting component:

// RIGHT:
test("submits form with email and password", () => {
    // This test drives the component design
    expect(mockSubmit).toHaveBeenCalledWith({
        email, password
    });
});

// Now implement component to pass test
function LoginForm() {
    // Implementation guided by test
}

❌ Don't skip edge cases:

# WRONG:
def test_user_creation():
    user = create_user("alice@example.com", "Password123")
    assert user.email == "alice@example.com"  # Only happy path!

# Missing edge cases:
# - Invalid email format?
# - Weak password?
# - Duplicate email?
# - SQL injection?

✅ Do test happy path + edge cases:

# RIGHT:
def test_user_creation_happy_path():
    user = create_user("alice@example.com", "StrongPassword123")
    assert user.email == "alice@example.com"

def test_user_creation_invalid_email():
    with pytest.raises(InvalidEmailError):
        create_user("not-an-email", "StrongPassword123")

def test_user_creation_weak_password():
    with pytest.raises(WeakPasswordError):
        create_user("alice@example.com", "weak")

def test_user_creation_duplicate_email():
    create_user("alice@example.com", "StrongPassword123")
    with pytest.raises(DuplicateEmailError):
        create_user("alice@example.com", "DifferentPassword456")

def test_user_creation_sql_injection():
    with pytest.raises(InvalidEmailError):
        create_user("alice@example.com'); DROP TABLE users; --", "Password123")

TDD Metrics to Track

| Metric | Target | Minimum | Notes | |--------|--------|---------|-------| | Code Coverage | 95% | 80% | Themis enforces minimum | | Tests per 100 LOC | 15-20 | 10 | Indicates test density | | Red→Green→Refactor Cycle Time | 5-10 min | - | Per feature | | Bug Escape Rate | 0-1% | <5% | Bugs caught in QA? | | Production Bugs | 0 | - | Ideal state (TDD goal) |

Examples by Use Case

Use Case 1: Complex Algorithm (Backend)

# Test 1: RED - Algorithm not implemented
def test_calculate_shipping_cost():
    calculator = ShippingCalculator()
    cost = calculator.calculate(weight_kg=5, distance_km=100, zone="urban")
    assert cost == 25.50  # Specific expected cost

# Test 1: GREEN - Minimal algorithm
class ShippingCalculator:
    def calculate(self, weight_kg, distance_km, zone):
        return 25.50  # Hardcoded to pass test

# Test 1: REFACTOR - Real algorithm
class ShippingCalculator:
    RATE_PER_KG = 2.0
    RATE_PER_KM = 0.1
    ZONE_MULTIPLIERS = {
        "urban": 1.0,
        "suburban": 1.2,
        "rural": 1.5
    }
    
    def calculate(self, weight_kg, distance_km, zone):
        base_cost = (weight_kg * self.RATE_PER_KG) + (distance_km * self.RATE_PER_KM)
        multiplier = self.ZONE_MULTIPLIERS.get(zone, 1.0)
        return round(base_cost * multiplier, 2)

# Tests: All pass including edge cases
def test_calculate_shipping_cost_urban():
    calc = ShippingCalculator()
    assert calc.calculate(5, 100, "urban") == 25.00

def test_calculate_shipping_cost_rural():
    calc = ShippingCalculator()
    assert calc.calculate(5, 100, "rural") == 37.50

def test_calculate_shipping_cost_invalid_zone():
    calc = ShippingCalculator()
    # Should use default multiplier
    assert calc.calculate(5, 100, "invalid") == 25.00

Use Case 2: React Component State

// Test 1: RED - Component not implemented
test("increments counter when button clicked", async () => {
    render(<Counter />);
    const button = screen.getByRole("button", { name: /increment/i });
    expect(screen.getByText("Count: 0")).toBeInTheDocument();
    
    await userEvent.click(button);
    expect(screen.getByText("Count: 1")).toBeInTheDocument();
});

// Test 1: GREEN - Minimal component
export function Counter() {
    const [count, setCount] = useState(0);
    return (
        <div>
            <p>Count: {count}</p>
            <button onClick={() => setCount(count + 1)}>Increment</button>
        </div>
    );
}

// Test 1: REFACTOR - Better component
export function Counter() {
    const [count, setCount] = useState(0);
    
    const increment = useCallback(() => {
        setCount(prev => prev + 1);
    }, []);
    
    return (
        <div>
            <p aria-live="polite">Count: {count}</p>
            <button onClick={increment} aria-label="Increment counter">
                Increment
            </button>
        </div>
    );
}

// Additional tests for edge cases
test("resets counter when reset button clicked", async () => {
    render(<Counter />);
    await userEvent.click(screen.getByRole("button", { name: /increment/i }));
    await userEvent.click(screen.getByRole("button", { name: /reset/i }));
    expect(screen.getByText("Count: 0")).toBeInTheDocument();
});

When Themis Blocks Coverage

Scenario: Your code has 76% coverage, minimum is 80%

Output from Themis ❌:
"Coverage 76% below minimum 80%
Missing coverage in:
  - app/services.py lines 45-52 (error handling)
  - app/services.py lines 60-65 (edge case)

Action Required:
1. Add tests for error scenarios
2. Add tests for edge cases
3. Re-run coverage"

Response (Hermes):
"Adding tests for missing coverage:"

# New test for error handling
def test_create_user_database_error():
    with patch("app.db.execute") as mock_execute:
        mock_execute.side_effect = DatabaseError("Connection lost")
        with pytest.raises(UserServiceError):
            service.create_user("alice@example.com", "Password123")

# New test for edge case
def test_create_user_email_already_exists():
    service.create_user("alice@example.com", "Password123")
    with pytest.raises(DuplicateEmailError):
        service.create_user("alice@example.com", "DifferentPassword456")

# Re-run coverage: pytest --cov
# Result: 94% coverage ✅

Final Checklist for Agents

Before marking phase complete, all agents verify:

TDD Phase Completion Checklist:

RED ✅
  [ ] Test written first (before code)
  [ ] Test is failing
  [ ] Test clearly states requirement
  [ ] Test covers happy path + edge cases
  [ ] Error message in RED is clear

GREEN ✅
  [ ] Test now passes
  [ ] Implementation is minimal (no extra features)
  [ ] All related tests pass
  [ ] No broken existing tests

REFACTOR ✅
  [ ] Code quality improved
  [ ] All tests still pass
  [ ] Type hints added (if applicable)
  [ ] Documentation strings complete
  [ ] Error handling robust
  [ ] Edge cases handled
  [ ] Performance acceptable

COVERAGE ✅
  [ ] Coverage >80% minimum
  [ ] Coverage report reviewed
  [ ] Missing coverage identified and addressed
  [ ] Target: 92-95% coverage

READY FOR PRODUCTION ✅
  [ ] Zero test failures
  [ ] Coverage >80%
  [ ] Themis review passed
  [ ] Commit message clear
  [ ] Ready to merge

Version: 1.0
Standard: Enforced by Hermes, Athena, Demeter
Checkpoints: Themis enforces coverage >80%
Status: Mandatory for all implementation agents

Remember: RED first, then GREEN, then REFACTOR. Coverage >80%. Always.

Agent-Level Evaluation

Beyond code-level tests, agents must pass behavioral evaluation:

Hallucination Testing

class HallucinationEvaluator:
    """Evaluates LLM outputs for factual consistency with source context."""
    
    @staticmethod
    async def evaluate_faithfulness(
        response: str,
        context: list[str]
    ) -> dict:
        """Check if response is faithful to provided context."""
        from langsmith import Client
        
        client = Client()
        
        # Create evaluation example
        example = client.create_example(
            inputs={"question": "What did the analysis conclude?"},
            outputs={
                "response": response,
                "context": context
            },
            metadata={"evaluator": "faithfulness"}
        )
        
        # Score = number of claims supported / total claims
        # Returns score (0-1), unsupported_claims, verdict
        ...
    
    @classmethod
    def assert_faithfulness(
        cls, response: str, context: list[str], threshold: float = 0.8
    ):
        """Assert that response is faithful to context above threshold."""
        result = cls.evaluate_faithfulness(response, context)
        assert result["score"] >= threshold, (
            f"Faithfulness {result['score']:.2f} < {threshold:.2f}\n"
            f"Unsupported claims: {result['unsupported_claims']}"
        )

Behavioral Regression Testing

class AgentBehavioralTest:
    """Define expected agent behaviors as test cases."""
    
    scenarios: list[dict] = [
        {
            "name": "user_registration_flow",
            "steps": [
                {"action": "call", "agent": "hermes", "input": "POST /users", "params": {...}},
                {"expected": "status_201", "check": "response.status_code == 201"},
                {"expected": "user_created", "check": "response.json()['id'] is not None"},
            ],
            "post_conditions": [
                "user exists in database",
                "welcome email scheduled",
                "rate limit incremented",
            ]
        },
        {
            "name": "search_with_filters",
            "steps": [...],
            "expected_behavior": "results filtered by category",
        }
    ]

LangSmith Evaluation Integration

from langsmith import Client, evaluate

def eval_agent_run(run: dict, example: dict) -> dict:
    """Evaluation function for agent runs in LangSmith."""
    response = run.outputs.get("response", "")
    expected = example.outputs.get("expected_response", "")
    
    # Multiple evaluation dimensions
    return {
        "correctness": _check_correctness(response, expected),
        "faithfulness": _check_faithfulness(response, run.inputs["context"]),
        "conciseness": _check_conciseness(response),
        "latency_ok": run.latency_ms < 5000,
    }

# Run evaluation on test dataset
results = evaluate(
    eval_agent_run,
    data="agent-behavioral-tests",
    experiment_prefix="agent-v2.1",
    max_concurrency=5
)

Agent Skills: TDD Enforcement Across All Implementation Agents

Install this agent skill to your local

Skill Files