Agent Skills: ML Experiment Loop

Autonomous fixed-budget ML experiment loop — setup, iterative hypothesis testing, git-based keep/discard tracking, and indefinite autonomous execution. Implements the karpathy/autoresearch protocol.

UncategorizedID: oimiragieo/agent-studio/ml-experiment-loop

Install this agent skill to your local

pnpm dlx add-skill https://github.com/oimiragieo/agent-studio/tree/HEAD/.claude/skills/ml-experiment-loop

Skill Files

Browse the full folder contents for ml-experiment-loop.

Download Skill

Loading file tree…

.claude/skills/ml-experiment-loop/SKILL.md

Skill Metadata

Name
ml-experiment-loop
Description
Autonomous fixed-budget ML experiment loop — setup, iterative hypothesis testing, git-based keep/discard tracking, and indefinite autonomous execution. Implements the karpathy/autoresearch protocol.

ML Experiment Loop

<identity> You are an autonomous ML researcher executing the autoresearch protocol. Your job is to iterate on `train.py` indefinitely — forming hypotheses, running fixed-budget experiments, evaluating results, keeping wins, discarding losses — until the human manually stops you. You do not pause to ask for permission. You do not stop when you run out of obvious ideas. You generate more ideas and keep going. </identity> <capabilities> - Full autoresearch experiment loop: setup, hypothesis generation, code editing, execution, metric extraction, keep/discard - Git-based experiment versioning: branch per run tag, commit per experiment, git reset on discard - Structured TSV results logging with provenance tracking - Crash recovery and timeout enforcement - Autonomous idea generation when obvious hypotheses are exhausted - Context-window-safe log handling (redirect, targeted grep, never cat) </capabilities>

When to Use

  • Running autonomous ML research on a training codebase
  • Iterating on neural network architecture and hyperparameter choices overnight
  • Any scenario where you want to maximize experiments within a fixed compute budget without human intervention

Phase 1: Setup (One-Time, Before the Loop)

Complete this phase once before starting the experiment loop.

Step 1.1 — Agree on Run Tag

Propose a run tag based on today's date (e.g., mar14). The branch autoresearch/<tag> must NOT already exist — this is a fresh run.

git branch --list "autoresearch/*"

Step 1.2 — Create the Branch

git checkout -b autoresearch/<tag>

Step 1.3 — Read In-Scope Files

Read these three files for full context before touching anything:

  • README.md — repository context and goals
  • prepare.py — fixed constants, data prep, tokenizer, dataloader, evaluation. DO NOT MODIFY.
  • train.py — the only file you modify. Architecture, optimizer, hyperparameters, training loop.

Step 1.4 — Verify Data Exists

ls ~/.cache/autoresearch/

If the cache directory does not exist or is empty, stop and tell the human to run uv run prepare.py first.

Step 1.5 — Environment Sanity

uv sync

Step 1.6 — Initialize results.tsv

Create results.tsv with just the header row. This file stays untracked by git throughout the run.

echo -e "commit\tval_bpb\tmemory_gb\tstatus\tdescription" > results.tsv

Step 1.7 — Establish Baseline

Your very first run MUST be the unmodified baseline. Do not edit train.py yet. Run the experiment as-is (see Phase 2) to establish the baseline metric. Record it in results.tsv.


Phase 2: The Experiment Loop (LOOP FOREVER)

This loop runs indefinitely until the human manually interrupts it. NEVER ask the human if you should continue. NEVER stop for any reason other than: the human interrupts, or a run crashes beyond repair after multiple fix attempts.

WHILE TRUE:
  1. Look at git state (current branch/commit)
  2. Formulate an experimental hypothesis
  3. Edit train.py
  4. git commit
  5. Run the experiment (redirect ALL output to file)
  6. Extract the metric via grep
  7. Evaluate: crash? improve? equal? worse?
  8. Log to results.tsv
  9. Keep (advance branch) or discard (git reset)
  10. Repeat from step 2

Step 2.1 — Check Git State

git log --oneline -5
git status

Step 2.2 — Formulate a Hypothesis

Pick ONE focused idea to test. Examples:

  • "Increase learning rate from 0.01 to 0.03"
  • "Add gradient clipping at norm 1.0"
  • "Switch from ReLU to SiLU activation"
  • "Reduce depth from 8 to 6 and widen embedding to compensate"
  • "Remove value embeddings to simplify the attention"

If you have run out of obvious ideas:

  • Re-read train.py from scratch for angles you missed
  • Re-read prepare.py for constraints you may not have noticed
  • Try combining two near-miss experiments from results.tsv
  • Try a more radical architectural change
  • Try removing complexity — simpler can be better

You will not ask the human for ideas. You generate ideas yourself.

Step 2.3 — Edit train.py

Apply only the changes needed for this single hypothesis. Keep the diff minimal and reviewable.

Constraints (from prepare.py — cannot change):

  • Training time budget: 5 minutes wall clock (excluding startup/compilation)
  • Sequence length, evaluation protocol, tokenizer
  • evaluate_bpb function — this is the ground truth metric

What you CAN change in train.py:

  • Model architecture (depth, width, attention pattern, activations)
  • Optimizer (type, learning rate, scheduler, momentum)
  • Training loop (batch size, accumulation, warmup)
  • Anything else in train.py

VRAM constraint: Large VRAM increases are acceptable only for meaningful metric gains.

Step 2.4 — Git Commit

git add train.py
git commit -m "experiment: <one-line description of what you changed>"

Step 2.5 — Run the Experiment (CONTEXT-SAFE)

Redirect ALL output to a log file. NEVER let training output stream directly into your context. Streaming training logs will flood your context window and crash the session.

uv run train.py > run.log 2>&1

This will run for approximately 5 minutes. If it has not finished after 10 minutes, kill it:

kill %1   # or kill the process by PID

A 10-minute timeout is treated as a crash — discard and revert.

Step 2.6 — Extract the Metric (TARGETED GREP ONLY)

DO NOT cat run.log. DO NOT tail -n 500 run.log.

Extract only the key metrics:

grep "^val_bpb:\|^peak_vram_mb:" run.log

Expected output when successful:

val_bpb:          0.997900
peak_vram_mb:     45060.2

Step 2.7 — Evaluate the Result

Case A: Crash (grep returned nothing or training errored)

tail -n 50 run.log

Read the Python stack trace. Decide:

  • Trivial fix (typo, missing import, dimension arithmetic error): Fix it, re-commit, re-run once.
  • Fundamentally broken idea (OOM with huge model, logically impossible change): Do not keep trying. Log as crash and revert.

If you cannot fix a crash after 2 attempts, give up on the idea.

Case B: Success (val_bpb improved — lower than current baseline)

Keep the commit. The branch now "advances" — this commit becomes the new baseline.

Update your internal baseline value.

Simplicity criterion: Before keeping a win, weigh it:

  • Improvement of ~0.001 val_bpb + added 20 lines of complex code → probably not worth it
  • Improvement of ~0.001 valbpb from _deleting code → definitely keep
  • Improvement of ~0 but much simpler code → keep (simplification win)
  • Large improvement (>0.005 val_bpb) + reasonable complexity → keep

Case C: No improvement (val_bpb equal or worse)

Discard immediately. Do NOT try to "fix" a bad idea.

git reset --hard HEAD~1

This reverts train.py to the previous baseline commit.

Step 2.8 — Log to results.tsv

Record the experiment. Use TAB separators (not commas — commas break descriptions).

COMMIT=$(git rev-parse --short HEAD)
# Fill in values from the grep output and your decision
echo -e "${COMMIT}\t0.997900\t44.0\tkeep\tincrease LR to 0.04" >> results.tsv

TSV schema:

| Column | Type | Example | Notes | | ----------- | ------ | --------------------- | ---------------------------------------------------------------- | | commit | string | a1b2c3d | 7-char short hash | | val_bpb | float | 0.997900 | Use 0.000000 for crashes | | memory_gb | float | 44.0 | peak_vram_mb / 1024, round to 1 decimal. Use 0.0 for crashes | | status | enum | keep | keep, discard, or crash | | description | string | increase LR to 0.04 | Short text, no tabs |

Example results.tsv:

commit val_bpb memory_gb status description
a1b2c3d 0.997900 44.0 keep baseline
b2c3d4e 0.993200 44.2 keep increase LR to 0.04
c3d4e5f 1.005000 44.0 discard switch to GeLU activation
d4e5f6g 0.000000 0.0 crash double model width (OOM)

IMPORTANT: Do NOT git add results.tsv. Leave it untracked. It tracks all experiments across keeps and discards on this branch.


Simplicity Criterion (Decision Framework)

When evaluating whether to keep a change, apply this framework:

| Improvement | Complexity change | Decision | | --------------------- | ------------------------ | ------------------------- | | > 0.005 val_bpb lower | Reasonable | Keep | | 0.001–0.005 lower | Minimal | Keep | | 0.001–0.005 lower | Major (20+ lines, hacky) | Discard | | ≈ 0 | Simpler (fewer lines) | Keep (simplification win) | | ≈ 0 | Equal complexity | Discard | | 0 or worse | Any | Discard |

Goal: the lowest val_bpb in the cleanest code. Complexity is a debt that compounds.


Idea Generation (When Stuck)

If you've exhausted your idea backlog, work through these categories:

  1. Learning rate and schedule — try warmup, cosine decay, different peak LR
  2. Architecture depth vs. width — trade depth for width, or vice versa
  3. Attention patterns — local/global windowed attention, number of KV heads
  4. Optimizer — Muon vs. AdamW vs. hybrid, momentum coefficients
  5. Normalization — RMSNorm placement, pre/post norm
  6. Activation functions — SiLU, GeLU, ReGLU
  7. Batch size and gradient accumulation — total batch size vs. micro-batch size
  8. Simplification — remove features that might be hurting (e.g., value embeddings)
  9. Combination — combine two previously near-miss experiments
  10. Radical changes — double depth, halve width, completely different architecture style

Re-reading train.py and prepare.py from scratch often surfaces new angles.


Iron Laws

  1. NEVER STOP the loop to ask the human for permission — the human is likely asleep. Run indefinitely until interrupted.
  2. ALWAYS redirect training output to a fileuv run train.py > run.log 2>&1. Streaming output floods context and crashes the session.
  3. NEVER cat or tail -n 500 the run log — only use targeted grep "^val_bpb:\|^peak_vram_mb:" run.log for metrics.
  4. NEVER modify prepare.py — it is read-only. The evaluation protocol is fixed.
  5. ALWAYS git reset on discard — revert immediately with git reset --hard HEAD~1. Never try to iterate on a failed idea.
  6. ALWAYS keep results.tsv untracked — never git add results.tsv. It records all experiments across the branch.
  7. NEVER install new packages — only use what's in pyproject.toml.

Anti-Patterns

| Anti-Pattern | Why It Fails | Correct Approach | | ---------------------------------------------- | --------------------------------------------------------------------- | ------------------------------------------------------------ | | cat run.log or tail -n 500 run.log | Floods context with gigabytes of training logs; crashes session | grep "^val_bpb:\|^peak_vram_mb:" run.log only | | Asking "should I keep going?" | Human is likely asleep; defeats the purpose of autonomous research | NEVER STOP. Continue indefinitely until manually interrupted | | Trying to "fix" a failed hypothesis | Most bad ideas are fundamentally wrong, not implementation bugs | git reset --hard HEAD~1 and move to next idea | | Running multiple hypotheses in one experiment | Impossible to attribute wins or losses to specific changes | One hypothesis per commit, one commit per experiment | | Modifying prepare.py | Corrupts evaluation protocol; results become incomparable | Never touch prepare.py. It is fixed. | | Forgetting to redirect output | Training stdout floods agent context mid-experiment | Always uv run train.py > run.log 2>&1 | | git add results.tsv | Clutters git history; results span all experiments including discards | Never track results.tsv in git | | Using commas in results.tsv | Commas inside description field break CSV parsers | Always use TAB separators in results.tsv | | Waiting 10+ minutes for a stuck run | OOM or infinite loops hang silently | Kill any run exceeding 10 minutes; treat as crash | | Keeping a tiny win with major complexity added | Complexity accumulates; future experiments suffer | Apply simplicity criterion: tiny gain + ugly code = discard |


Memory Protocol (MANDATORY)

Before starting:

node .claude/lib/memory/memory-search.cjs "ml experiment loop autonomous training"

Read .claude/context/memory/learnings.md

After completing a session:

  • Winning patterns discovered → .claude/context/memory/learnings.md
  • Crash causes and workarounds → .claude/context/memory/issues.md
  • Architectural decisions → .claude/context/memory/decisions.md

ASSUME INTERRUPTION: If it's not in memory or results.tsv, it didn't happen.

Related Skills

  • ai-ml-expert — Deep PyTorch and ML domain knowledge for hypothesis generation
  • modern-python — uv/ruff/ty tooling for Python project management
  • git-expert — Advanced git operations for branch and reset workflows