Campaign Orchestration — stateless agents, durable workflow files
A practical pattern for managing many long-running compute campaigns (think: 10+ HPC simulations, each a multi-stage pipeline, each stage taking hours to days) without dedicating a persistent agent to each one.
Core idea
State lives in
WORKFLOW.mdfiles. Agents are stateless workers that tick over the files.
Each campaign has a WORKFLOW.md that declares its stages, current position, job IDs, action queue, failure history. An orchestration tick reads all the files under a search root, advances or debugs each one, writes the files back, and exits. Next tick is a fresh agent that picks up where it left off. The file is the state machine; the agent is a function (file, time) → (new file, side effects).
This is the same pattern that powers Argo Workflows, Airflow, and anything else that has to survive worker restarts.
When this skill is the right pattern
- You have N ≥ 2 campaigns (otherwise just drive interactively).
- Stages are long (hours to days). Interactive driving wastes context.
- Failures are expected but diagnosable — the agent can read a log, infer a fix, retry.
- You want a single dashboard view (
cat WORKFLOW.md) without building a UI. - You want to escalate to a human only on novel or destructive situations.
If your work is one short campaign, skip this — interactive is fine. If failures are catastrophic / irreversible, this isn't the right substrate.
The WORKFLOW.md schema
One file per campaign. YAML frontmatter for machine-readable state, Markdown sections for human-readable detail.
---
campaign: <slug> # short identifier, must match dirname
project: <project-name> # e.g. hydrogenation
created: 2026-05-05
last_tick: 2026-05-05T20:10:00Z # updated each orchestration tick
status: <enum> # see Status section
current_stage: <stage-name> # the first non-done stage
escalation_required: false # true halts ticks until cleared
escalation_reason: "" # one-line summary if escalated
budget:
backend: alpine # alpine | vast-ai | local
cap_usd: 0 # 0 for free Alpine; cumulative cap for paid
spent_usd: 0 # updated post-stage
notify:
on_advance: false # ping user when stage advances
on_escalate: true # ping user when blocked
references: # files agents should read for context
- path: AGENTS.md
section: HPC operations
- path: simulations/surfaces/HPC_PLAYBOOK.md
---
# <Campaign Title>
One-paragraph purpose. Why this campaign exists, what it produces.
## Stages
| # | Name | Entry condition | Exit condition | Status | JobID | Started | Completed | Retries |
|---|-------------------|-----------------------|-------------------------------|----------|----------|-------------|-------------|---------|
| 0 | dry-run | always | manifest OK | done | — | YYYY-MM-DD | YYYY-MM-DD | 0 |
| 1 | smoke | stage 0 done | exit=0, restart, no NaN | done | 12345678 | YYYY-MM-DD | YYYY-MM-DD | 1 |
| 2 | npt-prod | stage 1 done | npt_equil.restart.xsc, ρ ≈ ρ* | running | 12345700 | YYYY-MM-DD | — | 0 |
| ... |
Each row tracks one stage. **Entry condition** is what must be true before this stage can submit. **Exit condition** is the validation predicate. **Status** is the enum below. **Retries** counts attempts at this stage (escalation triggers at retries ≥ 2 for the same error).
## Status enum
| Status | Meaning | Next tick action |
|-----------|------------------------------------------------------------|------------------|
| `pending` | not yet eligible (entry condition not met) | check entry condition; advance to `ready` if met |
| `ready` | entry condition met, not yet submitted | submit; advance to `running` |
| `running` | submitted to scheduler, job in flight | poll status; on completion validate exit and advance to `done` or `failed` |
| `done` | exit condition validated | move on to next stage |
| `failed` | scheduler reported failure OR exit condition not met | diagnose, attempt one fix, retry once → `ready`; on second same-error → `escalated` |
| `escalated` | requires human | no-op, log timestamp |
## Action queue
Free-form bullets describing what the agent should consider on the next tick. The agent updates this section as it works. Example:
- [ ] Job 12345700 submitted at 19:55. Re-check status next tick.
- [ ] If job completes cleanly: rsync npt_equil.restart.* back locally, then deploy stage 3.
- [ ] If job fails with "OOM": switch to gh200 partition. (Pre-approved fallback.)
## Failure log
Append-only. Each failure adds a row.
| Date | Stage | Error class | Error msg (1 line) | Action taken | Result | |------|-------|-------------|---------------------|--------------|--------| | 2026-05-05 | smoke | velocity-limit | Atom 19658 v=14658 > 12000 | minimize 1000 → 10000 | RESOLVED |
## Lessons learned
Free-form notes. Append as discovered. Useful for "next time we hit this, do X."
## References
Pointers to playbooks, framework skills, related project docs.
---
## Orchestration agent behavior — spec
This is what an agent runs when invoked with "orchestrate the campaigns under <root>."
### Tick procedure
For each WORKFLOW.md found under <root>:
- Read frontmatter + stage table + action queue.
- If escalation_required is true: log timestamp, skip this campaign.
- Find the first stage with status != done.
a. If status == pending: check entry condition (parse it from the row).
- If met: advance to ready.
- Else: skip. b. If status == ready: submit (deploy_hpc.sh, sbatch, etc.).
- Record JobID, set status=running, started=now. c. If status == running: query scheduler.
- Still running: log "still running at HH:MM:SS", no-op.
- Completed: evaluate exit condition (parse from row, e.g. "exit=0, restart written"). If passes: status=done, completed=now. Continue loop to next stage same tick. If fails: status=failed, append to failure log.
- Failed (scheduler-reported): same as failed-validation path. d. If status == failed: Pull log, classify error against catalog (or new):
- If catalogued + auto-fix exists + retries < 2: apply fix, status=ready, retries++.
- Else: status=escalated, escalation_required=true, escalation_reason=<class>. e. If status == escalated: skip (handled at top).
- Update last_tick timestamp.
- Save WORKFLOW.md.
After processing all campaigns: Compute next-tick cadence based on most-active campaign (see cadence table). ScheduleWakeup with that delay.
### Cadence table (next-tick delay)
| Most active campaign state | Re-tick in |
|----------------------------|------------|
| Active short job (smoke, < 1 hr expected) | 4–6 min |
| Active medium job (1–4 hr expected) | 30–60 min |
| Active long job (4–24 hr expected) | 2–4 hr |
| Active multi-day job (> 24 hr) | 8–12 hr |
| All campaigns idle / done / escalated | 12 hr (or until user prompt) |
Choose the **shortest** cadence among active campaigns. Don't sleep through fast jobs to be polite to slow ones.
### Failure classification
The agent maintains two catalogs:
1. **Cross-project (in this skill, see `failure-catalog.md`)** — generic patterns: walltime exceeded, GPU OOM, node death, transient SSH failure.
2. **Per-project (in the project's docs)** — physics-specific: NAMD velocity limit, density drift, snapshot extraction errors. Lives in the project's HPC playbook or AGENTS.md.
When the agent sees a failure: try to classify against both catalogs. If matched and an auto-fix exists, apply it. If unmatched, escalate. After resolving, append to the project catalog so the next agent recognizes it.
### Escalation rules — when to halt and ping the user
The agent **halts the campaign** (sets `escalation_required: true`) on any of:
- **2 consecutive same-error failures** at the same stage. The fix didn't work; humans needed.
- **Unfamiliar error** not matched in either catalog and not in a "safe to retry" pattern.
- **Cost cap exceeded** (paid backends only).
- **Destructive action proposed** (cancel > 1 job, delete data, restart-from-zero, modify force-field params).
- **Stage stuck > 3× expected walltime** without checkpoint progress (deadlock indicator).
Escalated campaigns sit until the human resolves and clears the flag. Other campaigns continue normally.
### Pre-approved actions (no escalation)
The agent may freely:
- Submit a stage when its entry condition is met
- Validate exit conditions and advance
- Re-submit a job that hit walltime *if* checkpoint state advanced
- Apply a catalogued fix once
- Sync output files back to local
- Update WORKFLOW.md state and timestamps
- Append to failure log and lessons learned
- Call `squeue`, `sacct`, `scancel <single_job>` (single-job cancellation only)
## How to invoke
Three reasonable invocation patterns:
### Pattern 1: User-driven manual tick
"Run an orchestration tick across all campaigns under simulations/."
The agent walks the files, advances/debugs, schedules a wakeup, exits.
### Pattern 2: Self-paced loop
/loop "Run an orchestration tick on simulations/<project>/. Schedule the next tick yourself based on cadence."
The agent re-fires itself on each tick. Most autonomous mode.
### Pattern 3: Cron-triggered (most production-grade)
System cron or `/schedule` triggers the agent every N minutes with the same prompt as Pattern 1. Stateless invocations; durable WORKFLOW.md state.
## Project-level integration
The skill is project-agnostic. To use it on a new project:
1. **Pick a search root** — usually `simulations/` or `campaigns/`.
2. **Drop a `WORKFLOW.md` in each campaign directory** following the schema above.
3. **(Optional) Add a project-level slash command** (e.g. `/<project>-tick`) that just expands to "orchestrate campaigns under <root>" — saves typing.
4. **Maintain a project-specific failure catalog** in your project's HPC playbook (e.g., `simulations/.../HPC_PLAYBOOK.md` § Common failure modes). The agent reads it on each tick.
## Templates
Starter WORKFLOW.md skeletons live in `templates/`. Copy and edit:
- `templates/WORKFLOW.template.md` — generic multi-stage HPC campaign
- `templates/WORKFLOW.namd-ensemble.md` — pre-fab for NAMD MD ensembles (NPT → 1000K decorr → snapshots → cool+prod array)
## Runtime monitoring during the tick (agentic anomaly scan)
Each orchestration tick should do more than progress state machines. It should also reason about whether the *current state of all campaigns* looks healthy — the runtime sibling to `compute-validation/workflows/orchestration-safety.md` (which is static pre-submission analysis).
This is agentic, not threshold-based. The agent reads sacct + queue state + recent logs and asks:
- **Does the pattern of completions look right?** Compare submissions/completions per hour against expected throughput for each campaign. Sudden spikes (>10 completions in 15 min) or pile-ups (>50 same-name jobs queued) are anomalies even if no single threshold fires.
- **Are any chains stalling?** A chain link in PD for >24 hr while others advance is suspicious.
- **Are any failures replicating?** Two consecutive same-error failures on a chain = pattern; investigate before continuing.
- **Are any logs producing unexpected output?** Real-time log tail check for FATAL, OOM, "RESTART" without progress, or other anomaly signatures.
Use the project's `.priors.yaml` (`class: orchestration` patterns) as seeds for what anomalies to look for. The runaway-resubmit pattern (today's bug class) shows up as "many fast completions in short time" — a heuristic the agent can apply.
When the scan flags concerns:
- **Low confidence**: write to the campaign's WORKFLOW.md action queue for next tick
- **Medium confidence**: pause the chain (`scancel` future links) and write to ORCHESTRATION_CHECK.md
- **High confidence + critical class**: escalate to user immediately
The agent should reason about the specific campaign, not just match thresholds. A pattern that looks like a runaway in a NAMD context might be normal in a different campaign — context matters. Same falsification mindset that drives Layer A and Layer A' applies to runtime monitoring.
This is the *post-submission* analog to Layer A' (`compute-validation/workflows/orchestration-safety.md`'s pre-submission static analysis). Together they catch:
- Pre-submission: predictable orchestration anti-patterns + brainstormed risks
- Runtime: novel anomalies that emerge during execution
- Post-run: bidirectional learning loop updates priors
## Why this design
- **Durable state** — files survive agent restarts, reboots, model changes
- **Stateless workers** — agents are cheap to spin up, don't need conversation continuity
- **Inspectable** — `cat WORKFLOW.md` is the dashboard
- **Composable** — adding a 6th campaign is dropping a file
- **Human-in-the-loop where it matters** — escalation rules keep destructive things gated
- **Cadence-aware** — fast jobs get fast checks, slow jobs get slow checks; nothing's polled in a tight loop
## What this skill does NOT do
- Build a job-tracking database. The filesystem is the database.
- Provide a web dashboard. `cat WORKFLOW.md` is the dashboard.
- Wire event-driven notifications from HPC. Periodic polling is fine at HPC timescales.
- Execute simulations. It only orchestrates them — it calls deploy scripts and SLURM submit scripts that already exist.
- Replace project judgment about what stages should exist. Each project defines its own stages in its WORKFLOW.md files.
## Cross-references
- `compute-strategy/SKILL.md` — backend selection, smoke-first iteration, the underlying compute decisions this skill orchestrates over
- `compute-validation/SKILL.md` — verification + orchestration-safety + smoke-loop discipline. A campaign's stages should follow this skill's protocol. WORKFLOW.md may include `0.5-verification` (requires `VERIFICATION.md`) AND `0.6-orchestration-check` (requires `ORCHESTRATION_CHECK.md`) before any compute-bearing stage.
- `compute-validation/workflows/orchestration-safety.md` — Layer A' static analysis of submission scripts; this skill's runtime-monitoring tick is the post-submission complement.
- `vast-cloud/SKILL.md` — Vast.ai backend driver
- For Alpine HPC specifics: `compute-strategy/backends/alpine.md`