Agent Skills: Observability & Monitoring Skill

Structured logging, metrics, distributed tracing, and alerting strategies

UncategorizedID: aiskillstore/marketplace/observability-monitoring

Install this agent skill to your local

pnpm dlx add-skill https://github.com/aiskillstore/marketplace/tree/HEAD/skills/ariegoldkin/observability-monitoring

Skill Files

Browse the full folder contents for observability-monitoring.

Download Skill

Loading file tree…

skills/ariegoldkin/observability-monitoring/SKILL.md

Skill Metadata

Name
observability-monitoring
Description
Structured logging, metrics, distributed tracing, and alerting strategies

Observability & Monitoring Skill

Comprehensive frameworks for implementing observability including structured logging, metrics, distributed tracing, and alerting.

When to Use

  • Setting up application monitoring
  • Implementing structured logging
  • Adding metrics and dashboards
  • Configuring distributed tracing
  • Creating alerting rules
  • Debugging production issues

Three Pillars of Observability

┌─────────────────┬─────────────────┬─────────────────┐
│     LOGS        │     METRICS     │     TRACES      │
├─────────────────┼─────────────────┼─────────────────┤
│ What happened   │ How is system   │ How do requests │
│ at specific     │ performing      │ flow through    │
│ point in time   │ over time       │ services        │
└─────────────────┴─────────────────┴─────────────────┘

Structured Logging

Log Levels

| Level | Use Case | |-------|----------| | ERROR | Unhandled exceptions, failed operations | | WARN | Deprecated API, retry attempts | | INFO | Business events, successful operations | | DEBUG | Development troubleshooting |

Best Practice

// Good: Structured with context
logger.info('User action completed', {
  action: 'purchase',
  userId: user.id,
  orderId: order.id,
  duration_ms: 150
});

// Bad: String interpolation
logger.info(`User ${user.id} completed purchase`);

See templates/structured-logging.ts for Winston setup and request middleware

Metrics Collection

RED Method (Rate, Errors, Duration)

Essential metrics for any service:

  • Rate - Requests per second
  • Errors - Failed requests per second
  • Duration - Request latency distribution

Prometheus Buckets

// HTTP request latency
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]

// Database query latency
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]

See templates/prometheus-metrics.ts for full metrics configuration

Distributed Tracing

OpenTelemetry Setup

Auto-instrument common libraries:

  • Express/HTTP
  • PostgreSQL
  • Redis

Manual Spans

tracer.startActiveSpan('processOrder', async (span) => {
  span.setAttribute('order.id', orderId);
  // ... work
  span.end();
});

See templates/opentelemetry-tracing.ts for full setup

Alerting Strategy

Severity Levels

| Level | Response Time | Examples | |-------|---------------|----------| | Critical (P1) | < 15 min | Service down, data loss | | High (P2) | < 1 hour | Major feature broken | | Medium (P3) | < 4 hours | Increased error rate | | Low (P4) | Next day | Warnings |

Key Alerts

| Alert | Condition | Severity | |-------|-----------|----------| | ServiceDown | up == 0 for 1m | Critical | | HighErrorRate | 5xx > 5% for 5m | Critical | | HighLatency | p95 > 2s for 5m | High | | LowCacheHitRate | < 70% for 10m | Medium |

See templates/alerting-rules.yml for Prometheus alerting rules

Health Checks

Kubernetes Probes

| Probe | Purpose | Endpoint | |-------|---------|----------| | Liveness | Is app running? | /health | | Readiness | Ready for traffic? | /ready | | Startup | Finished starting? | /startup |

Readiness Response

{
  "status": "healthy|degraded|unhealthy",
  "checks": {
    "database": { "status": "pass", "latency_ms": 5 },
    "redis": { "status": "pass", "latency_ms": 2 }
  },
  "version": "1.0.0",
  "uptime": 3600
}

See templates/health-checks.ts for implementation

Observability Checklist

Implementation

  • [ ] JSON structured logging
  • [ ] Request correlation IDs
  • [ ] RED metrics (Rate, Errors, Duration)
  • [ ] Business metrics
  • [ ] Distributed tracing
  • [ ] Health check endpoints

Alerting

  • [ ] Service outage alerts
  • [ ] Error rate thresholds
  • [ ] Latency thresholds
  • [ ] Resource utilization alerts

Dashboards

  • [ ] Service overview
  • [ ] Error analysis
  • [ ] Performance metrics

Extended Thinking Triggers

Use Opus 4.5 extended thinking for:

  • Incident investigation - Correlating logs, metrics, traces
  • Alert tuning - Reducing noise, catching real issues
  • Architecture decisions - Choosing monitoring solutions
  • Performance debugging - Cross-service latency analysis

Templates Reference

| Template | Purpose | |----------|---------| | structured-logging.ts | Winston logger with request middleware | | prometheus-metrics.ts | HTTP, DB, cache metrics with middleware | | opentelemetry-tracing.ts | Distributed tracing setup | | alerting-rules.yml | Prometheus alerting rules | | health-checks.ts | Liveness, readiness, startup probes |