Debug Agent Traces
Debug props agent behavior by reading their LLM traces from the database and optionally resurrecting past conversations to ask follow-up questions.
Argument: $ARGUMENTS
Prerequisites
- PostgreSQL running at
127.0.0.1:5433 OPENAI_API_KEYin environment (for speak-with-dead)
Database Connection
export PGHOST=127.0.0.1
export PGPORT=5433
export PGUSER=postgres
export PGPASSWORD=$(cat props/.devenv/state/pg_password)
export PGDATABASE=eval_results
Capabilities
1. List Agent Runs
Find runs to debug:
SELECT agent_run_id, status,
type_config->>'agent_type' as agent_type,
type_config->'example'->>'snapshot_slug' as snapshot,
created_at
FROM agent_runs
ORDER BY created_at DESC;
2. Read LLM Trace
For a given agent_run_id, list all LLM round-trips:
SELECT id, model, input_tokens, output_tokens, latency_ms, error, created_at
FROM llm_requests
WHERE agent_run_id = '<run_id>'
ORDER BY created_at;
3. Parse Tool Calls from a Trace
LLM requests/responses use OpenAI Responses API format:
request_bodyhas keys:input(conversation array),model,tools,instructions(sometimes null — checkinput[0]for system message)response_bodyhas key:output(array of response items)
Input items have role:
system— system prompt (checkcontent[0].text)function_call— tool call from model (hasname,arguments,call_id)function_call_output— tool result (hascall_id,output)
Output items have type:
message— text response (content[0].text)function_call— tool call (name,arguments,call_id)reasoning— reasoning trace if model supports it (summary)
To extract all tool calls from a run, save each response_body and parse:
import json
# For each response_body:
output = response_body.get('output', [])
for item in output:
if item['type'] == 'function_call':
name = item['name']
args = json.loads(item['arguments'])
print(f'{name}({args})')
elif item['type'] == 'message':
text = item['content'][0]['text']
print(f'TEXT: {text}')
elif item['type'] == 'reasoning':
summaries = item.get('summary', [])
print(f'REASONING: {summaries}')
4. Read Specific Turn
To read the full request and response for a specific LLM turn:
SELECT request_body::text, response_body::text
FROM llm_requests WHERE id = <turn_id>;
Save to files and parse with Python to inspect the full conversation context, tools available, and model's response.
5. Speak With Dead (Resurrect Conversation)
Resurrect a past agent conversation to ask follow-up questions. This sends the agent's original conversation prefix (up to a chosen turn) plus a new question to the same model, getting back the agent's perspective.
Steps:
-
Pick the turn — find the
llm_requests.idat or after the decision you want to ask about. -
Extract the conversation prefix — the
request_body.inputarray contains all turns up to that point. Theresponse_body.outputarray contains the model's response for that turn. -
Build the resurrection request:
import json import httpx import os # Load the turn # request_body and response_body from the database for the chosen turn # Take the original request as base resurrection = { "model": request_body["model"], "input": request_body["input"].copy(), "tools": request_body["tools"], "instructions": request_body.get("instructions"), } # Append the model's response from that turn for item in response_body["output"]: resurrection["input"].append(item) # If there are subsequent turns to include, append their # function_call_output and further exchanges from later request_bodies. # Each subsequent request_body.input will have new items appended after # the previous response — extract just the NEW items (tool results, etc.) # Add the follow-up question resurrection["input"].append({ "role": "user", "content": "Why did you assign credit=1.0 to all TPs for the " "broad-except issue? What was your reasoning?" }) # Send to OpenAI resp = httpx.post( "https://api.openai.com/v1/responses", headers={ "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}", "Content-Type": "application/json", }, json=resurrection, timeout=60, ) result = resp.json() # Extract answer for item in result.get("output", []): if item.get("type") == "message": for c in item.get("content", []): if c.get("type") == "output_text": print(c["text"]) -
Include full context — for best results, include all turns up to and including the turn of interest. The model needs the same context it had when making the decision.
Shortcut: include ALL turns up to turn N:
Rather than manually stitching, you can reconstruct the full conversation from
sequential llm_requests rows. Each request_body.input for turn N+1
contains all of turn N's prefix + turn N's response + any new tool results.
So the last request_body before or at your target turn already has the
full prefix. Just append that turn's response and your question.
# To ask about turn 58's decision:
# 1. Load request_body from turn 58 (has full prefix)
# 2. Load response_body from turn 58 (has the model's decision)
# 3. Append response output items to input
# 4. Append your question
# 5. Send to API
6. View Grading Results
Check what a grader decided:
-- Grading edges for a specific critic run
SELECT ge.critique_issue_id,
COALESCE(ge.tp_id, ge.fp_id) as ground_truth_id,
CASE WHEN ge.tp_id IS NOT NULL THEN 'TP' ELSE 'FP' END as edge_type,
ge.credit,
LEFT(ge.rationale, 120) as rationale
FROM grading_edges ge
WHERE ge.critique_run_id = '<critic_run_id>'
ORDER BY ge.critique_issue_id, edge_type, ground_truth_id;
Check grading completeness:
SELECT COUNT(*) as pending FROM grading_pending
WHERE critique_run_id = '<critic_run_id>';
-- 0 = complete, >0 = still grading
7. View Critic Findings
SELECT ri.issue_id, ri.rationale,
rio.locations
FROM reported_issues ri
LEFT JOIN reported_issue_occurrences rio
ON rio.agent_run_id = ri.agent_run_id
AND rio.reported_issue_id = ri.issue_id
WHERE ri.agent_run_id = '<critic_run_id>';
8. Cost Analysis
-- Per-run costs
SELECT agent_run_id, model, total_cost, total_input_tokens, total_output_tokens
FROM llm_run_costs;
-- Per-request breakdown
SELECT id, model, input_tokens, cached_input_tokens, output_tokens, latency_ms
FROM llm_requests
WHERE agent_run_id = '<run_id>'
ORDER BY created_at;
Tips
- Turn IDs are sequential — lower
llm_requests.id= earlier in the conversation. UseORDER BY created_atto see chronological order. - The last request has the longest input — each turn appends to the
conversation, so
input_tokensgrows monotonically. - Reasoning models (o-series, etc.) will have
reasoningitems in output withsummaryarrays showing chain-of-thought. - gpt-4.1-mini is the default grader model — it doesn't produce reasoning traces but is fast and cheap.
- For speak-with-dead, using the same model preserves behavior fidelity. Using a stronger model may give better explanations but different reasoning.