Mission Control Promotion Evaluation Skill
Core Purpose
Systematically evaluate the health of a Mission Control environment as a platform (not demo workloads) to support release and promotion readiness decisions. This skill queries the live environment using MCP tools and produces a structured diagnostic report.
Important Distinction
This evaluation focuses on Mission Control platform health — the components that make MC work (canary-checker, config-db, mission-control deployments, core health checks, scrapers, jobs). It does NOT evaluate demo workloads or user-created resources unless they indicate a platform problem.
Known expected-fail checks: Some health checks have the label Expected-Fail=true. These are intentional test checks and should be excluded from failure counts and findings.
Parameters
When invoked, check if the user specified:
- time_window: Lookback period (default:
24h) - target: Environment to evaluate (ask user if not specified)
Evaluation Procedure
Execute these phases sequentially. After each phase, record component status and findings.
Initialize a running JSON result conforming to @skills/promotion-eval-mission-control/schema.json with:
{
"verdict": "READY",
"evaluated_at": "<current ISO timestamp>",
"time_window": "<window>",
"target": "<target environment>",
"components": {},
"findings": [],
"recommendations": []
}
Catalog Type Reference
These are the confirmed MissionControl catalog types:
MissionControl::ScrapeConfig— config scrapersMissionControl::Playbook— playbook definitionsMissionControl::Notification— notification rulesMissionControl::Job— background jobsMissionControl::Canary— canary check definitionsMissionControl::Connection— external connectionsMissionControl::Topology— topology definitions
Phase 1: Health Check Pipelines
Goal: Determine if health checks are running and passing.
- Get failing checks directly:
view_failing-health-checks_mission-controlwithwithRows=trueandselect=["id","name","type","status","severity","last_transition_time","description"] - Get total check count:
list_all_checksfor baseline metrics - Filter out expected failures: Exclude checks with label
Expected-Fail=truefrom failure counts - Drill into real failures: For each genuinely unhealthy check (not expected-fail), call
get_check_status(id, limit=10)to retrieve recent execution history. Classify as:- Transient: Occasional failures mixed with passes
- Persistent: Consistently failing across recent executions
- Assess staleness: From the check list, identify checks where
updated_atis older than the time window
Metrics to record:
total_checks: Total number of health checkshealthy_count: Number currently healthyunhealthy_count: Number currently unhealthy (excluding expected-fail)expected_fail_count: Checks labeled Expected-Failpersistent_failures: Number failing consistentlystale_count: Number not updated within time windowhealth_rate: Percentage healthy (excluding expected-fail from denominator)
Verdict logic:
- PASS: No persistent failures, stale_count == 0, health_rate > 95%
- WARN: Some transient failures OR 1-2 stale checks OR health_rate 80-95%
- FAIL: Any persistent failures OR stale_count > 2 OR health_rate < 80%
Phase 2: Config Scrapers
Goal: Verify config scrapers are active and producing fresh data.
- Find scraper configs:
search_catalogwithtype=MissionControl::ScrapeConfigandselect=["id","name","health","status","updated_at"] - Check freshness: For each scraper, check
updated_attimestamp. Flag any not updated within expected schedule (typically 1h) - Check scraper errors:
view_mission-control-system_mission-controlwithwithPanels=true— this returns scraper error counts and a list of scrapers with errors - Review recent changes:
search_catalog_changeswithtype=MissionControl::ScrapeConfig created_at>now-{window}for config changes
Metrics to record:
total_scrapers: Number of scraper configs foundactive_count: Scrapers updated within expected windowstale_count: Scrapers not recently updatederror_count: From system view scraper errors panel
Verdict logic:
- PASS: All scrapers active, no errors
- WARN: 1-2 scrapers slightly stale OR minor errors
- FAIL: Any scraper missing updates for > 2h OR significant errors
Phase 3: Background Jobs & Playbooks
Goal: Check for failed playbook runs and job errors.
- Get failed job history:
view_jobhistory_mission-controlwithwithRows=trueandselect=["name","status","duration","error","timestamp"]limit=20 - Get failed playbook runs:
get_playbook_failed_runs(limit=10)for recent failures - Get recent playbook runs:
get_playbook_recent_runs(limit=20)to calculate success rate - Drill into failures: For any failed playbook runs, call
get_playbook_run_steps(run_id)to understand the failure cause - Check playbook catalog health:
search_catalogwithtype=MissionControl::Playbook health=unhealthyto find unhealthy playbook definitions
Metrics to record:
total_recent_runs: Total playbook runs in windowfailed_runs: Number of failed runssuccess_rate: Percentage of successful runsjob_errors: Count of job errors from job history view
Verdict logic:
- PASS: success_rate > 95%, no recurring job errors
- WARN: success_rate 80-95% OR some job errors
- FAIL: success_rate < 80% OR critical/recurring job failures
Phase 4: Notification Delivery
Goal: Verify the notification pipeline is functioning.
- Get notification send history:
view_notification-send-history_mission-controlwithwithRows=trueandselect=["id","age","resource_name","resource_current_health","title","notification"]limit=20 - Get notification stats from system view: The
view_mission-control-system_mission-controlpanel (already fetched in Phase 2) includes notification counts by status (SENT, SILENCED, REPEAT-INTERVAL, etc.) - Find notification configs:
search_catalogwithtype=MissionControl::Notificationandselect=["id","name","health","status"] - Check for error notifications: For each notification config, call
get_notifications_for_resource(resource_id, status=error, since=now-{window})
Metrics to record:
total_notification_configs: Number of notification rulessent_count: From system viewsilenced_count: From system viewerror_count: Notifications with error statusdelivery_rate: sent / (sent + error) percentage
Verdict logic:
- PASS: No delivery errors, system view shows sends happening
- WARN: Some errors but delivery_rate > 95%
- FAIL: delivery_rate < 95% or notification system appears down
Phase 5: System & Event Queue
Goal: Check overall system health indicators, database, and event queue.
- System overview:
view_mission-control-system_mission-controlwithwithPanels=true(reuse from Phase 2 if already fetched)- Check scraper errors, notification stats, agent resource counts
- Database health:
view_mission-control-database_mission-controlwithwithPanels=true- Check DB size, active users, DB connections
- Connection health:
list_connectionsto verify external integrations are configured
Metrics to record:
db_size_bytes: Database sizedb_connections: Active connectionsactive_users: User counttotal_connections: Number of configured connections
Verdict logic:
- PASS: DB healthy, connections configured, no concerning metrics
- WARN: High DB connections or large DB size growth
- FAIL: Database unreachable or critical system errors
Phase 6: MC Infrastructure Health
Goal: Verify Mission Control's own Kubernetes resources are healthy.
- Get MC pods directly:
view_mission-control-pods_mission-controlwithwithRows=trueandselect=["name","namespace","status","health","updated"]— this returns all MC-related pods - Find MC deployments:
search_catalogwithtype=Kubernetes::Deploymentand name patterns:name=mission-control*name=canary-checker*name=config-db*Useselect=["id","name","health","status","updated_at"]for each.
- Describe unhealthy resources: For any unhealthy MC deployment or pod, call
describe_catalog(id)to get full details including error messages - Check recent changes:
search_catalog_changeswithtype=Kubernetes::Deployment name=mission-control* created_at>now-{window}(and similar for canary-checker, config-db) - Check related configs: For unhealthy resources, use
get_related_configsto trace Deployment → ReplicaSet → Pod
Metrics to record:
total_mc_pods: MC pods foundhealthy_pods: Healthy MC podsunhealthy_pods: Unhealthy MC podstotal_mc_deployments: MC deployments foundhealthy_deployments: Healthy MC deploymentsrecent_changes: Changes to MC components in time window
Verdict logic:
- PASS: All MC resources healthy, no concerning recent changes
- WARN: Recent deployments/changes but all healthy, OR minor pod restarts
- FAIL: Any unhealthy MC deployment or persistent pod failures
Report Generation
After all phases complete, produce the final report in two parts:
Part 1: Markdown Report
# Promotion Evaluation Report
**Target**: <target environment>
**Evaluated at**: <timestamp>
**Time window**: <window>
**Verdict**: **<READY|CAUTION|NOT_READY>**
## Summary
| Component | Status | Key Metrics |
|-----------|--------|-------------|
| Health Checks | <PASS/WARN/FAIL> | <health_rate>% healthy, <persistent_failures> persistent failures |
| Config Scrapers | <PASS/WARN/FAIL> | <active_count>/<total_scrapers> active, <error_count> errors |
| Jobs & Playbooks | <PASS/WARN/FAIL> | <success_rate>% success rate, <failed_runs> failures |
| Notifications | <PASS/WARN/FAIL/SKIP> | <sent_count> sent, <error_count> errors |
| System & DB | <PASS/WARN/FAIL> | DB <db_size>MB, <db_connections> connections |
| MC Infrastructure | <PASS/WARN/FAIL> | <healthy_pods>/<total_mc_pods> pods healthy |
## Findings
<For each finding, sorted by severity (critical first)>
### [severity] [component]: [message]
- **Resource**: [name] ([type], ID: [id])
- **Evidence**: [evidence]
## Recent Changes (Risk Factors)
<List recent changes to MC infrastructure that could affect stability>
## Recommendations
<Numbered list of actionable recommendations>
Part 2: Structured JSON
Output the completed JSON object conforming to the schema. Wrap in a code block with language json.
Overall Verdict Logic
Derive the top-level verdict from component statuses:
- READY: All components PASS (or SKIP for non-critical ones)
- CAUTION: Any component is WARN, but none are FAIL
- NOT_READY: Any component is FAIL
Components that must not FAIL for READY: health_checks, config_scrapers, mc_infrastructure
Components that may SKIP without affecting verdict: notifications
Error Handling
- If an MCP tool call fails or returns unexpected data, record the component as SKIP with a note
- Do not let one phase failure block subsequent phases — evaluate all phases independently
- Reuse data across phases when the same tool was already called (e.g., system view data)