Agent-Skills.md

Agent Skills: Monitoring and Alerting

Monitoring plan for Solana apps: RPC health, tx success, program errors, and liquidity signals. Use to set up dashboards and alerts.

UncategorizedID: sanctifiedops/solana-skills/monitoring-and-alerting

Author

sanctifiedops

https://github.com/sanctifiedops View all skills

Repository

sanctifiedops/solana-skills

SanctifiedOpsLicense: MIT

3

Install this agent skill to your local

pnpm dlx add-skill https://github.com/SanctifiedOps/solana-skills/tree/HEAD/skills/infra/monitoring-and-alerting

Skill Files

Browse the full folder contents for monitoring-and-alerting.

Loading file tree…

skills/infra/monitoring-and-alerting/SKILL.md

Skill Metadata

Name: monitoring-and-alerting
Description: Monitoring plan for Solana apps: RPC health, tx success, program errors, and liquidity signals. Use to set up dashboards and alerts.

Monitoring and Alerting

Role framing: You are an observability lead. Your goal is to ensure early detection of issues across RPC, programs, and markets.

Initial Assessment

What components exist? (frontend, backend, programs, bots, LPs)
SLOs for latency/success? On-call structure?
Tools available (Grafana, Datadog, Helius webhooks, Sentry)?

Core Principles

Measure what users feel: tx success, latency, wallet connect, pool health.
Separate signal from noise: actionable alerts only.
Include on-chain + off-chain metrics.

Workflow

Metrics selection
- RPC: latency, error rates, slot lag.
- Tx pipeline: submit latency, confirmation time, failure codes.
- Program: error counts by code, compute units used.
- Market: price, liquidity depth, volume, holder concentration.
Instrumentation
- Add logs with error codes; emit metrics from services/bots.
- Subscribe to webhooks for program logs/events.
Dashboards
- Build views for user journeys (connect, sign, swap/mint) and infra (RPC health).
Alerts
- Set thresholds and runbooks (e.g., tx fail rate >3% over 5m -> switch RPC).
- Pager paths with severity levels.
Testing
- Fire drill alerts; validate runbooks; ensure contacts current.

Templates / Playbooks

Alert table: metric | threshold | duration | action | owner.
Standard runbook entries for RPC failover, blockhash errors, LP imbalance.

Common Failure Modes + Debugging

Alert fatigue: too many low-priority alerts; prune.
Missing program error visibility; add msg! with codes and parse logs.
Slot lag misread due to provider differences; monitor per provider.
No runbook -> slow response; write and link.

Quality Bar / Validation

Dashboards live with top metrics; alerts tested.
Each alert has runbook and owner.
On-call rotation known; contact methods tested.

Output Format

Provide monitoring plan: metrics list, dashboards needed, alert thresholds with runbooks, and ownership map.

Examples

Simple: Dashboard for tx success + RPC latency; alert to Slack on error spike; runbook to switch RPC.
Complex: Full stack including program log parsing, pool depth alerts, holder concentration tracking; PagerDuty rotation with quarterly drills.

Monitoring and Alerting Skill | Agent Skills