CoreWeave Webhooks & Events
Kubernetes Event Monitoring
# Watch GPU pod events
kubectl get events --watch --field-selector=reason=Scheduled,reason=Pulled,reason=Failed
# Monitor GPU utilization via exec
kubectl exec -it deployment/inference -- nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv -l 5
Prometheus GPU Metrics
# DCGM exporter for GPU metrics (pre-installed on CKS)
# Key metrics:
# DCGM_FI_DEV_GPU_UTIL - GPU utilization %
# DCGM_FI_DEV_FB_USED - GPU memory used
# DCGM_FI_DEV_POWER_USAGE - Power draw
Slack Alert Integration
import subprocess, json, requests
def check_inference_health(deployment: str, slack_url: str):
result = subprocess.run(
["kubectl", "get", "deployment", deployment, "-o", "json"],
capture_output=True, text=True,
)
deploy = json.loads(result.stdout)
ready = deploy["status"].get("readyReplicas", 0)
desired = deploy["spec"]["replicas"]
if ready < desired:
requests.post(slack_url, json={
"text": f"CoreWeave: {deployment} has {ready}/{desired} replicas ready"
})
Resources
Next Steps
For performance optimization, see coreweave-performance-tuning.