End-to-End Testing
End-to-end tests verify ikigai behavior through its control socket (ikigai-ctl). Each test is a self-contained JSON file. Tests run sequentially — they share a single ikigai instance.
Test File Location
Tests live in tests/e2e/. Run order is defined by tests/e2e/index.json — a JSON array of test filenames in execution order. When asked to "run the e2e tests", read index.json and execute each listed test file sequentially.
Execution Modes
| Mode | Backend | Steps | Assertions |
|------|---------|-------|------------|
| mock | bin/mock-provider | all steps including mock_expect | assert + assert_mock |
| live | real provider (Anthropic, OpenAI, Google) | mock_expect steps skipped | assert only |
Every test always includes mock_expect steps and assert_mock assertions — tests are written once and run in either mode. In live mode, mock_expect steps are skipped and assert_mock is not evaluated.
JSON Schema
{
"name": "human-readable test name",
"steps": [
...
],
"assert": [
...
],
"assert_mock": [
...
]
}
name— describes what the test verifiessteps— ordered list of actions to executeassert— assertions checked in ALL modesassert_mock— assertions checked only in mock mode
Step Types
send_keys
Send keystrokes to ikigai via ikigai-ctl send_keys.
{"send_keys": "/model gpt-5-mini\\r"}
Include \\r at the end of the string to submit. Use ikigai-ctl send_keys escaping conventions.
read_framebuffer
Capture the current screen contents via ikigai-ctl read_framebuffer. The captured state is what assertions run against.
{"read_framebuffer": true}
Always read_framebuffer before asserting. Each read_framebuffer replaces the previous capture.
wait
Pause for N seconds. Use after send_keys to allow UI updates or LLM responses.
{"wait": 0.5}
- After UI commands (
/model,/clear): 0.5 seconds - After sending a prompt to the LLM: 3-5 seconds
wait_idle
Wait until the current agent becomes idle (ready for input), or until the timeout elapses.
Calls ikigai-ctl wait_idle <timeout_ms>.
{"wait_idle": 10000}
- Value is
timeout_ms(integer milliseconds) - Exit code 0 = agent became idle; exit code 1 = timed out (report FAIL)
- Use instead of
{"wait": N}after sending prompts to the LLM - Keep
{"wait": 0.5}for UI-only commands (/clear,/model) that don't trigger LLM
mock_expect
Configure the mock provider's next response queue. Sends a POST to /_mock/expect. Skipped in live mode.
{"mock_expect": {"responses": [{"content": "The capital of France is Paris."}]}}
The object is sent as the JSON body to /_mock/expect. The responses array is a FIFO queue — each LLM request pops the next entry. Entries contain either content (text) or tool_calls (array), never both. Must appear before the send_keys that triggers the LLM call.
Assertion Types
Assertions run against the most recent read_framebuffer capture. The framebuffer response contains a lines array; each line has spans with text fields. Concatenate span texts per row to reconstruct screen content.
contains
At least one row contains the given substring.
{"contains": "gpt-5-mini"}
not_contains
No row contains the given substring.
{"not_contains": "error"}
line_prefix
At least one row starts with the given prefix (after trimming leading whitespace).
{"line_prefix": "●"}
Running Tests — Direct Execution, No Scripts
Every step must be executed directly using individual tool calls — never via a script, the runner, or any programmatic wrapper. This applies to mock mode, live mode, single tests, and full suite runs.
Why: The reason you are asked to run e2e tests (instead of the user running tests/e2e/runner themselves) is that direct execution lets you observe every response and react to unexpected behavior — crashes, garbled output, timing issues. A script executes steps mechanically and silently masks errors. The fully automated scripted runner already exists for when scripted execution is appropriate; when the user asks you to run e2e tests, they want the manual path precisely because the runner is not sufficient for what they're investigating.
Procedure for each test file:
- Read the JSON file
- Determine mode (mock or live) from context — mock if ikigai is connected to
mock-provider, live otherwise - Execute each step in order, one tool call per step:
send_keys: runikigai-ctl send_keys "<value>"wait:sleep Nwait_idle: runikigai-ctl wait_idle <value>, fail if exit code is 1read_framebuffer: runikigai-ctl read_framebuffer, store resultmock_expect: in mock mode,curl -s 127.0.0.1:<port>/_mock/expect -d '<json>'; in live mode, skip
- After all steps, evaluate assertions:
- Always evaluate
assert - In mock mode, also evaluate
assert_mock
- Always evaluate
- Report PASS or FAIL with evidence (cite relevant framebuffer rows)
Running Large Test Batches with Sub-Agents
When asked to run a large number of e2e tests (more than 20), in either mock or live mode, divide the work across sub-agents running serially (one after the next, never in parallel):
- Read
tests/e2e/index.jsonto get the full ordered list of test files - Partition the list into chunks of at most 20 tests each
- Launch one sub-agent per chunk, sequentially — wait for each to complete before launching the next
- Each sub-agent receives: its assigned test files (in order), the ikigai socket path, the mock provider port (if mock mode), and the full contents of this skill file (
/load e2e-testingor inline the text). The sub-agent needs the complete context — step types, assertion types, execution rules, key rules — to execute correctly. Without it, the sub-agent will improvise and introduce errors. Do not pre-read the test files yourself — pass only the filenames and let the sub-agent read them. - Collect pass/fail results from each sub-agent and summarize at the end
Why serially: Tests share a single ikigai instance. Running sub-agents concurrently would interleave keystrokes and framebuffer reads across tests, corrupting results.
Why chunked: Tests consume context window space. Chunking prevents any single agent from exhausting its context window mid-run.
Key Rules
- Never start ikigai — the user manages the instance
- Never use the runner script (
tests/e2e/runner) — it exists for CI/automated use. When the user asks you to run tests, they want direct execution so they can see every step and every response. - Never use any script or programmatic wrapper — no Ruby, no shell loops, no automation of any kind. One tool call per step. This applies equally to sub-agents executing chunked batches.
- One test file = one test — self-contained, no dependencies on other test files
- Steps execute in order — sequential, never parallel
- Always read_framebuffer before asserting — assertions reference the last capture
- Never chain anything after wait_idle —
wait_idlemust always be the last command in a Bash tool call. If it succeeds (exit code 0) and a subsequent command fails, the overall exit code 1 is indistinguishable fromwait_idletiming out. Runread_framebufferin a separate Bash tool call afterwait_idlecompletes.
Example: UI-only test
{
"name": "no model indicator on fresh start",
"steps": [
{"read_framebuffer": true}
],
"assert": [
{"contains": "(no model)"}
]
}
Example: mock provider test
{
"name": "basic chat completion via mock provider",
"steps": [
{"mock_expect": {"responses": [{"content": "The capital of France is Paris."}]}},
{"send_keys": "What is the capital of France?\\r"},
{"wait": 3},
{"read_framebuffer": true}
],
"assert": [
{"line_prefix": "●"}
],
"assert_mock": [
{"contains": "The capital of France is Paris."}
]
}
Example: model switching test
{
"name": "set model to gpt-5-mini with low reasoning",
"steps": [
{"send_keys": "/clear\\r"},
{"wait": 0.5},
{"send_keys": "/model gpt-5-mini/low\\r"},
{"wait": 0.5},
{"mock_expect": {"responses": [{"content": "Mock response from gpt-5-mini."}]}},
{"send_keys": "Hello\\r"},
{"wait_idle": 10000},
{"read_framebuffer": true}
],
"assert": [
{"contains": "gpt-5-mini/low"},
{"contains": "low effort"},
{"line_prefix": "●"}
],
"assert_mock": [
{"contains": "Mock response from gpt-5-mini."}
]
}