Agent Skills: Infrastructure Monitoring

Set up comprehensive infrastructure monitoring with Prometheus, Grafana, and alerting systems for metrics, health checks, and performance tracking.

UncategorizedID: aj-geddes/useful-ai-prompts/infrastructure-monitoring

Install this agent skill to your local

pnpm dlx add-skill https://github.com/aj-geddes/useful-ai-prompts/tree/HEAD/skills/infrastructure-monitoring

Skill Files

Browse the full folder contents for infrastructure-monitoring.

Download Skill

Loading file tree…

skills/infrastructure-monitoring/SKILL.md

Skill Metadata

Name
infrastructure-monitoring
Description
>

Infrastructure Monitoring

Table of Contents

Overview

Implement comprehensive infrastructure monitoring to track system health, performance metrics, and resource utilization with alerting and visualization across your entire stack.

When to Use

  • Real-time performance monitoring
  • Capacity planning and trends
  • Incident detection and alerting
  • Service health tracking
  • Resource utilization analysis
  • Performance troubleshooting
  • Compliance and audit trails
  • Historical data analysis

Quick Start

Minimal working example:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: "infrastructure-monitor"
    environment: "production"

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093

# Rule files
rule_files:
  - "alerts.yml"
  - "rules.yml"

scrape_configs:
  # Prometheus itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
// ... (see reference guides for full implementation)

Reference Guides

Detailed implementations in the references/ directory:

| Guide | Contents | |---|---| | Prometheus Configuration | Prometheus Configuration | | Alert Rules | Alert Rules | | Alertmanager Configuration | Alertmanager Configuration | | Grafana Dashboard | Grafana Dashboard | | Monitoring Deployment | Monitoring Deployment |

Best Practices

✅ DO

  • Follow established patterns and conventions
  • Write clean, maintainable code
  • Add appropriate documentation
  • Test thoroughly before deploying

❌ DON'T

  • Skip testing or validation
  • Ignore error handling
  • Hard-code configuration values