Monitor Skill | Agent Skills

Monitor

Observe deployed application health and gather feedback to inform future development.

Purpose

Monitoring completes the feedback loop:

Validate success criteria from intent
Collect performance and health metrics
Gather user feedback
Identify issues for the backlog
Ensure deployed features work as expected

Input

Expect from orchestrator:

Deployed application URL/endpoints
Success criteria from intent
Metrics to track
Monitoring duration/frequency

Process

1. Health monitoring

Check application health continuously:

health_checks:
  - name: "API availability"
    endpoint: "/health"
    method: "GET"
    expected_status: 200
    interval: "60s"
    timeout: "5s"

  - name: "Database connectivity"
    endpoint: "/health/db"
    method: "GET"
    expected_status: 200
    interval: "60s"

  - name: "External dependencies"
    endpoint: "/health/deps"
    method: "GET"
    expected_status: 200
    interval: "300s"

2. Performance metrics

Collect performance data:

metrics:
  - name: "response_time_p50"
    source: "apm"
    threshold: 200  # ms
    alert_above: 500

  - name: "response_time_p99"
    source: "apm"
    threshold: 1000  # ms
    alert_above: 2000

  - name: "error_rate"
    source: "logs"
    threshold: 0.01  # 1%
    alert_above: 0.05

  - name: "throughput"
    source: "apm"
    baseline: 100  # req/s
    alert_below: 50

  - name: "cpu_usage"
    source: "infrastructure"
    threshold: 70  # %
    alert_above: 90

  - name: "memory_usage"
    source: "infrastructure"
    threshold: 80  # %
    alert_above: 95

3. Success criteria validation

Verify intent success criteria are met:

success_validation:
  - criterion: "Users can register successfully"
    metric: "registration_success_rate"
    target: ">= 99%"
    actual: "99.5%"
    status: "pass"

  - criterion: "Page loads under 2 seconds"
    metric: "page_load_time_p95"
    target: "< 2000ms"
    actual: "1850ms"
    status: "pass"

  - criterion: "Zero critical errors in first 24h"
    metric: "critical_error_count"
    target: "= 0"
    actual: "0"
    status: "pass"

4. Error tracking

Monitor for errors and exceptions:

error_tracking:
  sources:
    - "application_logs"
    - "error_reporting_service"  # Sentry, Bugsnag, etc.
    - "browser_console"

  categories:
    - severity: "critical"
      pattern: "uncaught exception|fatal error|500"
      action: "immediate_alert"

    - severity: "error"
      pattern: "error|failed|exception"
      action: "aggregate_and_review"

    - severity: "warning"
      pattern: "warning|deprecated|slow"
      action: "log_for_review"

5. User feedback collection

Gather feedback from various sources:

feedback_sources:
  - source: "support_tickets"
    integration: "zendesk"
    filter: "tag:new-feature"

  - source: "user_surveys"
    integration: "typeform"
    trigger: "post_feature_use"

  - source: "analytics_events"
    integration: "mixpanel"
    events: ["feature_used", "feature_abandoned", "error_encountered"]

  - source: "social_mentions"
    integration: "twitter_api"
    keywords: ["@ourapp", "ourapp bug"]

6. Analyse and categorise

Process collected data:

analysis:
  - type: "error_pattern"
    description: "Timeout errors spike during peak hours"
    frequency: "12 occurrences/hour"
    impact: "medium"
    recommendation: "Investigate database connection pooling"

  - type: "user_friction"
    description: "Users abandoning registration at email step"
    metric: "30% drop-off"
    impact: "high"
    recommendation: "Review email validation UX"

  - type: "performance_regression"
    description: "API response time increased 40% after deploy"
    metric: "p99: 800ms → 1120ms"
    impact: "medium"
    recommendation: "Profile recent changes"

7. Generate backlog items

Create stories from feedback:

backlog_additions:
  - id: "US-015"
    title: "Optimise database queries for peak load"
    source: "monitor:error_pattern"
    description: |
      As a user,
      I want fast response times during peak hours,
      So that I can complete tasks without delays.
    priority: 2
    evidence: "Timeout errors during 9-11am, 2-4pm"

  - id: "US-016"
    title: "Improve email validation feedback"
    source: "monitor:user_friction"
    description: |
      As a new user,
      I want clear feedback when my email is invalid,
      So that I can correct it and complete registration.
    priority: 1
    evidence: "30% drop-off at email step, support tickets about 'email not accepted'"

Output

monitor:
  period:
    start: "2024-01-15T10:00:00Z"
    end: "2024-01-16T10:00:00Z"
    duration: "24h"

  status: "healthy | degraded | unhealthy"

  health:
    uptime: "99.95%"
    incidents: 0
    health_checks:
      passed: 1440
      failed: 0

  performance:
    response_time_p50: "145ms"
    response_time_p99: "890ms"
    error_rate: "0.02%"
    throughput: "125 req/s"

  success_criteria:
    met: 3
    total: 3
    details:
      - criterion: "Users can register"
        status: "pass"
      - criterion: "Page loads < 2s"
        status: "pass"
      - criterion: "Zero critical errors"
        status: "pass"

  errors:
    critical: 0
    error: 12
    warning: 45
    top_errors:
      - message: "Connection timeout"
        count: 8
        first_seen: "2024-01-15T14:23:00Z"

  feedback:
    support_tickets: 3
    user_surveys: 15
    nps_score: 42
    sentiment: "positive"

  backlog_additions:
    - id: "US-015"
      title: "Optimise database queries"
      priority: 2
    - id: "US-016"
      title: "Improve email validation"
      priority: 1

  recommendations:
    - "Monitor database connection pool during peak hours"
    - "Consider adding email format hints to registration form"
    - "Review timeout thresholds for external API calls"

Update manifest:

releases:
  - version: "1.2.3"
    monitoring:
      status: "healthy"
      period: "2024-01-15 to 2024-01-16"
      success_criteria_met: true

backlog:
  # New items added from monitoring
  - id: "US-015"
    source: "monitor"
    # ...

Monitoring duration

monitoring_schedule:
  post_deploy:
    duration: "24h"
    intensity: "high"
    checks_interval: "1m"

  steady_state:
    duration: "ongoing"
    intensity: "normal"
    checks_interval: "5m"

  alerts:
    critical: "immediate"
    error: "within 15m"
    warning: "daily digest"

Alert escalation

escalation:
  level_1:
    condition: "health_check_failed"
    wait: "5m"
    action: "slack_notification"

  level_2:
    condition: "health_check_failed for 15m"
    action: "pagerduty_alert"

  level_3:
    condition: "health_check_failed for 30m"
    action: "phone_call_oncall"

Tips

Set baselines before comparing metrics
Alert on symptoms, not just causes
Avoid alert fatigue with appropriate thresholds
Correlate metrics across services
Keep monitoring configs in version control
Review and tune thresholds regularly
Document what "normal" looks like

Agent Skills: Monitor

Install this agent skill to your local

Skill Files