CoreWeave Observability
GPU Metrics (DCGM Exporter)
CKS clusters come with DCGM exporter pre-installed. Key metrics:
| Metric | Description |
|--------|-------------|
| DCGM_FI_DEV_GPU_UTIL | GPU core utilization % |
| DCGM_FI_DEV_FB_USED | GPU memory used (MB) |
| DCGM_FI_DEV_FB_FREE | GPU memory free (MB) |
| DCGM_FI_DEV_POWER_USAGE | Power consumption (W) |
| DCGM_FI_DEV_GPU_TEMP | GPU temperature (C) |
Prometheus Alert Rules
groups:
- name: coreweave-gpu
rules:
- alert: GPUUtilizationLow
expr: avg(DCGM_FI_DEV_GPU_UTIL) < 20
for: 30m
labels: { severity: warning }
annotations:
summary: "GPU utilization below 20% for 30min -- consider scaling down"
- alert: GPUMemoryHigh
expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95
for: 5m
labels: { severity: critical }
annotations:
summary: "GPU memory >95% -- risk of OOM"
- alert: InferencePodDown
expr: kube_deployment_status_replicas_available{deployment=~".*inference.*"} == 0
for: 2m
labels: { severity: critical }
Resources
Next Steps
For incident response, see coreweave-incident-runbook.