Compute Strategy — backend selection and iteration discipline
You are about to run a compute job. Before submitting anything, walk this skill.
This skill answers three questions for any compute-bearing task:
- Which backend? HPC, cloud (Vast.ai/etc.), or local?
- Within that backend, which partition / queue / instance? Testing, production, long-running?
- What's the iteration loop? How do you avoid wasting hours on a config that was never going to work?
The framework below is backend-agnostic. The backends/ subdirectory contains per-backend pages with the concrete partition/queue/instance details — read the relevant one(s) when picking specifics.
Project-level rules win. If the project's
AGENTS.mdorCLAUDE.mdsays "always use CCM, never call vastai directly" or "this group has a CU Alpine allocation, use it for production," respect that. This skill is the framework; the project tells you the authority gradient.
Core principle: iteration speed beats compute size
A failed cheap-partition job costs ~20 min. A failed production job costs 6–24 hr + queue politics.
The moment you understand this, the rest of the skill follows. Always validate on the cheapest partition that can run real physics, before scaling. This isn't optional discipline — it's the default workflow. Skip it only when the config is byte-identical to a previously-validated one.
Backend selection — decision tree
Walk the questions in order. Stop at the first match.
1. Is this a sub-30-minute CPU job (file munging, lint, plot)?
→ LOCAL. Don't pay setup overhead.
2. Is this a < 1-hour GPU job, or a smoke test of a new config?
→ CHEAPEST GPU AVAILABLE: testing partition on HPC if accessible,
otherwise short Vast.ai instance. (~$0.20–0.50)
3. Is this a 1–8 hour GPU job, single-shot, no chained dependencies?
→ IF urgent or HPC busy: Vast.ai (instant queue, ~$0.50–4 total).
ELSE: HPC production partition (free with allocation).
4. Is this an 8–24 hour GPU job?
→ HPC production partition. Cost balance favors HPC for jobs > ~6 hr.
5. Is this a > 24 hour single-shot job?
→ Prefer chained HPC jobs with restart. Use long-QoS only if
restart isn't viable (rare for MD; common for some QM).
6. Is this an embarrassingly-parallel ensemble (50+ independent jobs)?
→ JOB ARRAYS on HPC if it fits. Vast.ai if HPC is throttling concurrency.
Multi-backend split if budget + capacity allow.
7. Is the primary backend unavailable (network, VPN, allocation expired)?
→ Fall over to next-best backend per the table above. Document the
fallback. Don't wait passively for primary if work can proceed.
If none of the above fits, stop and ask the user — that's a signal the job has unusual characteristics worth a human-in-the-loop discussion.
The universal iteration loop
Every backend, every job size, follows this loop. The partition names change; the structure doesn't.
┌─ STAGE 0 Local sanity (< 5 min, free)
│ dry-run, file manifest, syntax lint
└─ ↓ if clean
┌─ STAGE 1 Smoke on cheapest GPU (~20 min, ≈ free)
│ 5–10 ps run, real physics, real GPU
│ Pass criteria: exit 0, finite energies, restart written,
│ throughput in expected ballpark
└─ ↓ if pass
┌─ STAGE 2 Single-system production (hours)
│ Real run, smallest realistic input
│ Catches drift, density, long-time bugs
└─ ↓ if pass
┌─ STAGE 3 Fan-out / production (full campaign)
│ Parallel jobs, real compute, real cost
↑ if any stage fails: don't escalate to a bigger partition.
Drop *back down* to the smoke partition, fix, re-smoke, then continue.
The big partition is read-only for validated configs. Debugging substrate is the smoke partition.
What a "smoke test" actually is
Same config as production, but numsteps (or max_iter, or nsteps, etc.) reduced so the job completes in 10–30 minutes. Same physics: same timestep, same thermostat, same force field, same parallelism settings, same output frequency. Only difference is the run length.
If smoke passes, the only untested behaviors are long-time ones (drift, slow equilibration, rare events). The bug classes filtered by a smoke test:
- File / parameter parsing errors
- Initial-geometry clashes (NaN energies in step 1–10)
- Output writing failures
- GPU memory issues
- Throughput in wrong ballpark (signals broken parallelism or missed flag)
- Restart file write failures
That's ~95% of failure modes for ~20 minutes of testing-partition time.
Submission hygiene — universal
Always
- Dry-run before push. Whatever local lint or
--dry-runflag the deploy tooling has, run it. - Check the queue first.
squeue -u <me>before submitting; never stack a duplicate job. - Notify on completion.
--mail-type=END,FAILon SLURM; equivalent on other backends. Avoid polling. - Log job IDs at submit time. Drop into a tracking doc (project-level
STATUS.md, playbook, or stage-status table). Don't rely on memory. - Use job arrays for fan-out.
--array=1-N%Ccaps concurrency at C. Friendly to shared queues. - Chain stages with dependencies.
--dependency=afterok:$JOBIDover manual orchestration. - Restart-friendly checkpointing. For long jobs, write restart files often enough that a walltime hit costs ≤ 1 hr of redo.
Never
- ❌ Submit a production job without smoke first. The whole framework collapses.
- ❌ Submit while a sibling job for the same target is queued or running.
- ❌ Keep submitting after 3 consecutive same-error failures. Stop, ask the user.
- ❌ Hand-edit files on a remote backend. Fixes belong in the project repo, redeployed via the deploy tooling.
- ❌
scancel(or equivalent) without reading the log first. Even failed jobs have data. - ❌ Poll long jobs in tight loops. Use mail-type notifications + scheduled wakeups, never
while true; do squeue; sleep 10; done. - ❌ Escalate "the queue was slow" → "use long QoS." Slow queue is a load signal, not a config bug.
- ❌ Run multiple smokes in parallel hoping one will work. Fix the first failure first.
Failure → debug protocol
When any stage fails, walk this exact sequence. Don't skip steps.
- Pull the log. Most recent
.outand.err(or backend equivalent),tail -200. - Diagnose locally. Read the last error. Cross-reference against the project's "common failure modes" table (every project should have one — if it doesn't, add it). If the symptom is new, document it after diagnosing.
- Reproduce on the smoke partition if at all possible. The smoke partition is the debugging substrate. Don't iterate against the production partition.
- Fix the source in the repo. Edit the config / script / input locally, in the project's git tree. Hand-edits on a remote backend that don't survive a redeploy are tech debt.
- Re-smoke. Stage 1 again. Then Stage 2. Then resume the campaign.
- Append to a "lessons learned" log. One line: date, what failed, what fixed it. Future agents (and future-you) save real time.
Waiting cadence
How long the job takes determines whether you wait, schedule, or detach.
| Job class | Wall | Strategy |
|---|---|---|
| Smoke | ~20 min | Wait in-conversation if no other work. Otherwise ScheduleWakeup ~1500 s. |
| Short production | 1–4 hr | Submit, log JOBID, advance. Re-check on user prompt or notification. |
| Long production | 4–24 hr | Fully detached. --mail-type=END,FAIL. Never poll. |
| Decorrelation / multi-day | 1+ days | Detached. Periodic sacct (or backend equivalent) summaries; don't watch. |
Never sleep && squeue busy-wait. Use the backend's notification channel + wakeup primitives.
Multi-backend workflows
Some campaigns span backends — local prep + cloud GPU + HPC long-runs. The pattern:
| Phase | Compute load | Default backend | |---|---|---| | Literature search, structure prep, FF acquisition | CPU, light | LOCAL | | Structure relaxation, equilibration | GPU, < 1 hr | LOCAL GPU or short cloud | | Equilibration, single-system NPT | GPU, 1–8 hr | HPC or cloud | | Production trajectory (long) | GPU, 8–24 hr | HPC (free) or cloud (urgent) | | Ensemble fan-out (many short) | GPU, embarrassingly parallel | HPC arrays + cloud overflow | | Analysis, plotting, manuscript | CPU, light | LOCAL |
Data handoffs are usually rsync (HPC ↔ local) or scp over the cloud instance's public IP. Plan handoffs explicitly — files that need to move between backends are first-class steps in the workflow, not afterthoughts.
When to involve the user
Escalate (don't silently retry) on:
- Two consecutive smoke failures with the same error after a fix attempt.
- An unfamiliar bug not in the project's failure-modes catalog and you don't have high confidence in the fix.
- Compute cost about to exceed a previously agreed budget (cloud) or use > N% of a quarterly allocation (HPC).
- Any destructive action (canceling > 1 job, deleting remote data, restarting a campaign from scratch).
- A choice between two reasonable backends with materially different cost/latency profiles — the human gets to set the trade-off.
Companion skills — composes with
This skill answers WHERE to run. Two siblings answer the other questions:
compute-validation/SKILL.md— IS IT READY. Verification (reasoning, no compute) + smoke loop (cheap-compute measurement + extrapolation) before committing to production. Read this before any expensive submission.campaign-orchestration/SKILL.md— HOW TO MANAGE LONG EXECUTION. Stateless agents over per-campaign WORKFLOW.md files. Read this when a campaign has multiple stages over hours-to-days.
Read all three when driving a non-trivial compute campaign.
Backend-specific details
Each available backend has a page in backends/ with concrete partition/queue/instance details. Read the relevant one before picking specifics.
| Backend | Page | Use for |
|---|---|---|
| CU Boulder Alpine HPC | backends/alpine.md | Production GPU work, free with CU allocation |
| ALCF Polaris | backends/polaris.md | A100 GPU HPC (PBS Pro), free with DOE allocation; high-throughput MD; readiness vehicle toward ALCC/INCITE |
| ALCF Crux | backends/crux.md | CPU HPC (dual EPYC Rome, 128 cores/node, PBS Pro), free with DOE allocation; large/parallel CPU MD, bulk core-hours |
| Vast.ai | backends/vast-ai.md | On-demand GPU, no queue, pay per hour |
| Local (WSL2 / native) | backends/local.md | Sub-hour CPU/GPU work, debugging, prep |
To add a new backend, copy backends/TEMPLATE.md and fill in the partition/queue/instance facts. Adding a backend should not require any change to this SKILL.md — it's framework, the backend page is data.
Project-level layering
This skill is the framework. Projects are encouraged to add their own thin specialization layer:
- Project
AGENTS.md/CLAUDE.md: which backends are authorized, any group-specific overrides ("never call vastai directly, use CCM"), per-project budget caps. - Project playbook (e.g.,
simulations/<campaign>/HPC_PLAYBOOK.md): the applied version of this skill for one specific campaign — concrete commands, the stage-status tracking table, the lessons-learned log, the "common failure modes" catalog for that project's physics.
The skill provides the standards and the decision framework. The project provides the application.
A canonical example of the project-level layer (slab Thrust 4 in the Pt-NEC LOHC project): ~/work/research/hydrogenation/simulations/surfaces/HPC_PLAYBOOK.md.
Quick checklist before any compute submission
Use this as a self-check before you hit submit:
- [ ] Have I run the local dry-run / lint?
- [ ] Have I picked the backend via the decision tree?
- [ ] Have I checked the backend page for current partition/queue rules?
- [ ] If this is a new or modified config, have I smoke-tested it?
- [ ] If smoke passed, have I picked the right partition for production?
- [ ] Is
--mail-type=END,FAIL(or equivalent) set? - [ ] Have I checked
squeue/ equivalent for duplicates? - [ ] Will my walltime + restart frequency survive a 24-hr cap?
- [ ] Have I logged the job ID into the project's tracking doc?
- [ ] Do I have a plan for what happens if it fails (which log to read, where the failure-modes catalog is)?
If any of those is "no" or "I don't know," fix that before submitting.