AWS CloudWatch Skill Skill

AWS CloudWatch Skill

Set up comprehensive monitoring and alerting for AWS resources.

Quick Reference

| Attribute | Value | |-----------|-------| | AWS Service | CloudWatch | | Complexity | Medium | | Est. Time | 15-30 min | | Prerequisites | Resources to monitor |

Parameters

Required

| Parameter | Type | Description | Validation | |-----------|------|-------------|------------| | namespace | string | Metric namespace | AWS/* or custom | | metric_name | string | Metric name | Valid metric | | resource_id | string | Resource identifier | Valid ARN or ID |

Optional

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | period | int | 300 | Evaluation period (seconds) | | statistic | string | Average | Average, Sum, Min, Max, p99 | | threshold | float | varies | Alert threshold | | evaluation_periods | int | 3 | Consecutive periods |

Essential Alarms

EC2 Alarms

- name: HighCPU
  metric: CPUUtilization
  threshold: 80
  period: 300
  evaluation_periods: 3

- name: StatusCheckFailed
  metric: StatusCheckFailed
  threshold: 1
  period: 60
  evaluation_periods: 2

ECS Alarms

- name: HighCPU
  metric: CPUUtilization
  threshold: 80

- name: HighMemory
  metric: MemoryUtilization
  threshold: 85

- name: RunningTaskCount
  metric: RunningTaskCount
  threshold: 1
  comparison: LessThan

RDS Alarms

- name: HighCPU
  metric: CPUUtilization
  threshold: 80

- name: LowFreeStorage
  metric: FreeStorageSpace
  threshold: 10737418240  # 10GB
  comparison: LessThan

- name: HighConnections
  metric: DatabaseConnections
  threshold: 100

Implementation

Create Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name prod-ec2-high-cpu \
  --alarm-description "EC2 CPU > 80% for 15 minutes" \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts \
  --ok-actions arn:aws:sns:us-east-1:123456789012:alerts \
  --treat-missing-data notBreaching

Dashboard Template

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "title": "EC2 CPU Utilization",
        "metrics": [
          ["AWS/EC2", "CPUUtilization", "InstanceId", "i-xxx"]
        ],
        "period": 300,
        "stat": "Average",
        "region": "us-east-1"
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "ECS Service Memory",
        "metrics": [
          ["AWS/ECS", "MemoryUtilization", "ServiceName", "my-service"]
        ]
      }
    }
  ]
}

Custom Metrics

import boto3

cloudwatch = boto3.client('cloudwatch')

# Publish custom metric
cloudwatch.put_metric_data(
    Namespace='MyApp',
    MetricData=[
        {
            'MetricName': 'RequestLatency',
            'Dimensions': [
                {'Name': 'Service', 'Value': 'API'},
                {'Name': 'Environment', 'Value': 'prod'}
            ],
            'Value': 150.5,
            'Unit': 'Milliseconds'
        }
    ]
)

Log Insights Queries

Error Rate

fields @timestamp, @message
| filter @message like /ERROR/
| stats count() as error_count by bin(5m)

Latency Analysis

fields @timestamp, latency
| stats avg(latency) as avg_latency,
        pct(latency, 95) as p95_latency,
        pct(latency, 99) as p99_latency
  by bin(1h)

Top Errors

fields @timestamp, @message
| filter @message like /Exception|Error/
| parse @message /(?<error_type>\w+Exception)/
| stats count() as count by error_type
| sort count desc
| limit 10

Troubleshooting

Common Issues

| Symptom | Cause | Solution | |---------|-------|----------| | No data | Metric not emitting | Check CloudWatch Agent | | Alarm stuck | Insufficient data | Check treat_missing_data | | Dashboard empty | Wrong namespace | Verify metric source | | High costs | Too many metrics | Use metric filters |

Debug Checklist

[ ] CloudWatch Agent installed and running?
[ ] IAM role allows cloudwatch:PutMetricData?
[ ] Correct namespace and dimensions?
[ ] Metric has data in expected period?
[ ] Alarm threshold reasonable?
[ ] SNS topic has subscriptions?

Test Template

def test_cloudwatch_alarm():
    # Arrange
    alarm_name = "test-alarm"

    # Act
    cw.put_metric_alarm(
        AlarmName=alarm_name,
        MetricName='CPUUtilization',
        Namespace='AWS/EC2',
        Statistic='Average',
        Period=300,
        EvaluationPeriods=1,
        Threshold=80,
        ComparisonOperator='GreaterThanThreshold'
    )

    # Assert
    response = cw.describe_alarms(AlarmNames=[alarm_name])
    assert len(response['MetricAlarms']) == 1

    # Cleanup
    cw.delete_alarms(AlarmNames=[alarm_name])

Assets

assets/alarm-config.yaml - Common alarm configurations

Agent Skills: AWS CloudWatch Skill

Install this agent skill to your local

Skill Files