CoreWeave Incident Runbook
Triage Steps
# 1. Check pod status
kubectl get pods -l app=inference -o wide
# 2. Check recent events
kubectl get events --sort-by=.lastTimestamp | tail -20
# 3. Check node status
kubectl get nodes -l gpu.nvidia.com/class -o wide
# 4. Check GPU health
kubectl exec -it $(kubectl get pod -l app=inference -o name | head -1) -- nvidia-smi
Common Incidents
Inference Service Down
- Check pod status and events
- If OOMKilled: reduce batch size or upgrade GPU
- If ImagePullBackOff: check registry credentials
- If Pending: check GPU quota and availability
GPU Node Failure
- Pods will be rescheduled automatically
- If no capacity: scale down non-critical workloads
- Contact CoreWeave support for extended outages
Model Loading Failure
- Check HuggingFace token secret exists
- Verify model name spelling
- Check PVC has sufficient storage
- Review container logs for download errors
Rollback
kubectl rollout undo deployment/inference
Resources
Next Steps
For data handling, see coreweave-data-handling.