Test Props Live Deployment
Knowledge base for manually running the props critic and grader evaluation stack. Covers infrastructure setup, database lifecycle, agent image push, running critics, grading, and exporting results.
Not for standard tests — use bazel test //props/... for unit, integration, and
e2e Bazel tests. This skill is for manual live deployment and evaluation runs.
For evaluation workflow background:
- @props/docs/openai_evaluation/evaluation.md
- @props/docs/local_llm_evaluation/evaluation.md
Environment Detection
Detect gVisor (Claude Code on the Web) by kernel version:
if [[ "$(uname -r)" == "4.4.0" ]]; then
IS_GVISOR=true
else
IS_GVISOR=false
fi
gVisor requires:
PROPS_DOCKER_NETWORK=host(agent containers must use host networking)--annotation run.oci.keep_original_groups=1on all podman containers- Verify
$DOCKER_HOSTis set (the session hook sets this)
Infrastructure Setup
Start Postgres (port 5433) and the OCI registry (port 5050):
bash .claude/skills/test_props/start_infra.sh
Set PG environment variables for subsequent commands:
export PGHOST=127.0.0.1
export PGPORT=5433
export PGUSER=postgres
export PGPASSWORD=$(cat props/.devenv/state/pg_password)
export PGDATABASE=eval_results
Database Lifecycle
Fresh schema
db recreate drops all schema objects (tables, views, functions, policies), then
runs Alembic migrations to recreate the schema from scratch and syncs model
metadata. Does not sync specimens.
PGHOST=127.0.0.1 PGPORT=5433 PGUSER=postgres \
PGPASSWORD=$(cat props/.devenv/state/pg_password) \
PGDATABASE=eval_results \
bazel run //props/cli:cli -- db recreate --yes
Migrations and specimen sync via backend lifespan
When auto_migrate = true is set in the config file, the backend runs
alembic upgrade head on startup (idempotent — only applies pending migrations).
When auto_sync_specimens = true is set, the backend scans /specimens and syncs
all specimens on startup. Before starting the backend, symlink the repo's specimens
directory:
ln -sf "$(git rev-parse --show-toplevel)/props/specimens" /specimens
The config.ollama.toml in this skill directory enables both flags.
Importing an existing dump
To continue from a saved dump (e.g., from a previous session):
- Recreate schema (via
db recreate). - Sync specimens — start backend with
auto_sync_specimens = true(and the/specimenssymlink in place), or runbazel run //props/cli:cli -- db sync-specimenper specimen. - Import the dump:
# OpenAI evaluation results:
zstd -dc props/docs/openai_evaluation/results.sql.zst \
| psql --set ON_ERROR_STOP=on -d eval_results
# Local LLM (ollama) results:
zstd -dc props/docs/local_llm_evaluation/results.sql.zst \
| psql --set ON_ERROR_STOP=on -d eval_results
The dump excludes ground-truth tables (snapshots, file_sets, true_positives, etc.) that come from specimens. Specimens must be in the DB before importing.
Backend Startup
With remote Ollama (gpt-oss:20b)
Retrieve the Ollama API key from k8s:
export OLLAMA_API_KEY=$(kubectl get secret ollama-api-key -n claude-sandbox \
-o jsonpath='{.data.api-key}' | base64 -d)
Symlink specimens and build:
ln -sf "$(git rev-parse --show-toplevel)/props/specimens" /specimens
bazel build //props/backend:backend_bin
Start in background:
PROPS_CONFIG_FILE=.claude/skills/test_props/config.ollama.toml \
PGHOST=127.0.0.1 PGPORT=5433 PGUSER=postgres \
PGPASSWORD=$(cat props/.devenv/state/pg_password) \
PGDATABASE=eval_results \
PROPS_REGISTRY_UPSTREAM_URL=http://127.0.0.1:5050 \
PROPS_DOCKER_NETWORK=host \
DOCKER_HOST=$DOCKER_HOST \
OLLAMA_API_KEY=$OLLAMA_API_KEY \
bazel-bin/props/backend/backend_bin serve > /tmp/backend.log 2>&1 &
The backend logs the admin token on startup:
grep "Admin token" /tmp/backend.log
Check health: curl -s http://127.0.0.1:8000/health
With OpenAI
Use config.podman.toml and set OPENAI_API_KEY + OPENAI_BASE_URL:
PROPS_CONFIG_FILE=.claude/skills/test_props/config.podman.toml \
OPENAI_API_KEY=$OPENAI_API_KEY \
OPENAI_BASE_URL=https://api.openai.com/v1 \
PGHOST=127.0.0.1 PGPORT=5433 PGUSER=postgres \
PGPASSWORD=$(cat props/.devenv/state/pg_password) \
PGDATABASE=eval_results \
PROPS_REGISTRY_UPSTREAM_URL=http://127.0.0.1:5050 \
PROPS_DOCKER_NETWORK=host \
DOCKER_HOST=$DOCKER_HOST \
bazel-bin/props/backend/backend_bin serve > /tmp/backend.log 2>&1 &
Push Agent Images
Push images to the registry proxy (port 8000), not the upstream registry (port 5050). The proxy records agent definitions; the grader supervisor listens for grader tag changes.
Set up Docker auth for the registry proxy:
PG_PASSWORD=$(cat props/.devenv/state/pg_password)
AUTH_B64=$(echo -n "postgres:$PG_PASSWORD" | base64)
mkdir -p ~/.docker
cat > ~/.docker/config.json <<EOF
{
"auths": {
"localhost:8000": { "auth": "$AUTH_B64" }
}
}
EOF
Push both agent types:
bazel run //props/agents/critic:push
bazel run //props/agents/grader:push
These push to localhost:8000/<type>:latest. The grader supervisor starts grader
containers automatically when the grader image is pushed.
Running Critics
Get the admin token and run a critic via the API. The call blocks until the critic container exits.
PG_PASSWORD=$(cat props/.devenv/state/pg_password)
AUTH_TOKEN=$(echo -n "postgres:$PG_PASSWORD" | base64)
# File-set example: rank-1 example (33 TPs, 39 occurrences, fastest to run)
curl -s -X POST http://127.0.0.1:8000/api/runs/critic \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"definition_id": "latest",
"example": {
"kind": "file_set",
"snapshot_slug": "ducktape/2025-09-03-00",
"files_hash": "8e2209f20bd1df0c5bc4073dfff739fe"
},
"critic_model": "gpt-oss-20b",
"timeout_seconds": 1800,
"budget_usd": 0.0
}'
Top file-set examples (file-set runs are faster than whole-snapshot):
| Rank | Snapshot | files_hash | TPs | Occurrences |
| ---- | ------------------------------ | ---------------------------------- | --- | ----------- |
| 1 | ducktape/2025-09-03-00 | 8e2209f20bd1df0c5bc4073dfff739fe | 33 | 39 |
| 2 | ducktape/2025-11-20-00 | bb8aff17944a6348a8089790457e3094 | 15 | 31 |
| 3 | ducktape/2025-11-26-00 | 6e416fb1d095abc7fdc79131434c7dac | 20 | 21 |
| 4 | ducktape/2025-11-21-00 | 15702f4d16234db852e973e31323fbdd | 21 | 21 |
| 5 | gmail-archiver/2025-12-17-00 | 9e218584782810e5a65195da8f63931a | 14 | 21 |
budget_usd = 0.0 for Ollama (local cluster inference is free).
Monitoring Runs
# All runs with status
psql -c "SELECT agent_run_id, type_config->>'agent_type' AS type, model, status,
container_exit_code FROM agent_runs ORDER BY created_at"
# Reported issues for a specific critic run
psql -c "SELECT COUNT(*) FROM reported_issues WHERE agent_run_id = '<run_id>'"
# Grading edges (populated by grader after critic finishes)
psql -c "SELECT ge.critique_run_id, ge.critique_issue_id,
ge.tp_id, ge.fp_id, ge.grade
FROM grading_edges ge
WHERE ge.critique_run_id = '<critic_run_id>'"
# Grading pending (count should reach 0 when grading is complete)
psql -c "SELECT COUNT(*) FROM grading_pending WHERE critique_run_id = '<critic_run_id>'"
Grader Supervisor
The GraderSupervisor is enabled by grader_model in the config file. It starts
automatically when the backend starts and listens for grader_definition_changed
pg_notify events. When a grader image is pushed, it (re)starts grader containers
for all active snapshots.
Graders run continuously, watching for new critic runs to grade. After a critic
completes, the grader picks up the new reported issues and creates grading_edges.
If graders don't appear after pushing the image, check:
- Was the image pushed to the proxy (port 8000), not directly to the registry (port 5050)?
- Backend logs:
grep -i grader /tmp/backend.log
Exporting Results
Export run results (excluding ground-truth and infrastructure tables) for import in a future session. The export excludes specimen data (snapshots, file_sets, true_positives, etc.) since those come from specimens on import.
# For ollama/local LLM evaluation:
pg_dump eval_results \
--data-only --no-owner --no-privileges \
--exclude-table=true_positives \
--exclude-table=true_positive_occurrences \
--exclude-table=false_positives \
--exclude-table=false_positive_occurrences \
--exclude-table=fp_occurrence_relevant_files \
--exclude-table=occurrence_ranges \
--exclude-table=critic_scopes_expected_to_recall \
--exclude-table=file_sets \
--exclude-table=file_set_members \
--exclude-table=snapshots \
--exclude-table=snapshot_files \
--exclude-table=model_metadata \
--exclude-table=agent_role_salt \
--exclude-table=alembic_version \
-f props/docs/local_llm_evaluation/results.sql \
&& zstd --rm --ultra -22 props/docs/local_llm_evaluation/results.sql
llm_requests stores full conversation transcripts (O(N²) growth across turns).
zstd compresses cross-row redundancy far better than per-row gzip.
Troubleshooting
Image resolution errors
Add insecure registry entries to ~/.cache/claude-hooks/podman/registries.conf:
[[registry]]
prefix = "127.0.0.1:5050"
location = "127.0.0.1:5050"
insecure = true
[[registry]]
prefix = "127.0.0.1:8000"
location = "127.0.0.1:8000"
insecure = true
Password issues
Use hex-only passwords in props/.devenv/state/pg_password (no /, +, =
characters that break asyncpg DSN parsing).
Ollama API key
Retrieve fresh from k8s if the current key doesn't work:
kubectl get secret ollama-api-key -n claude-sandbox \
-o jsonpath='{.data.api-key}' | base64 -d
Key Architecture Points
- Registry proxy: Integrated into the backend. Push images to port 8000 (backend), which proxies to port 5050 (upstream registry) and records agent definitions.
- Grader supervisor: Listens for
grader_definition_changedpg_notify. When a grader tag is pushed, all grader containers are (re)started. - Agent containers: Run with host networking, per-agent PostgreSQL roles, and RLS-scoped database access.
- Model config:
config.ollama.toml(this skill directory) configuresgpt-oss:20bvia the remote ollama cluster atollama.allegedly.works. The cluster serves the model with 131072-token context (OLLAMA_NUM_CTX=131072).