CoreWeave Observability Skill

CoreWeave Observability

Overview

CoreWeave runs GPU-intensive workloads on Kubernetes where hardware failures, memory exhaustion, and underutilization directly impact cost and reliability. Observability must cover DCGM GPU metrics, Kubernetes pod health, inference latency, and job completion rates. Proactive monitoring prevents wasted spend on idle GPUs and catches OOM conditions before they cascade.

Key Metrics

| Metric | Type | Target | Alert Threshold | |--------|------|--------|-----------------| | GPU utilization | Gauge | > 60% | < 20% for 30m | | GPU memory usage | Gauge | < 85% | > 95% for 5m | | Inference latency p99 | Histogram | < 200ms | > 500ms | | Job completion rate | Counter | > 99% | < 95% per hour | | Pod restart count | Counter | 0 | > 3 in 15m | | Node GPU temperature | Gauge | < 80C | > 85C for 10m |

Instrumentation

async function trackInference(model: string, fn: () => Promise<any>) {
  const start = Date.now();
  try {
    const result = await fn();
    metrics.record('coreweave.inference.latency', Date.now() - start, { model, status: 'ok' });
    metrics.increment('coreweave.inference.completed', { model });
    return result;
  } catch (err) {
    metrics.increment('coreweave.inference.errors', { model, error: err.code });
    throw err;
  }
}

Health Check Dashboard

async function coreweaveHealth(): Promise<Record<string, string>> {
  const gpu = await queryPrometheus('avg(DCGM_FI_DEV_GPU_UTIL)');
  const mem = await queryPrometheus('avg(DCGM_FI_DEV_FB_USED/(DCGM_FI_DEV_FB_USED+DCGM_FI_DEV_FB_FREE))');
  const pods = await queryPrometheus('kube_deployment_status_replicas_available{namespace="inference"}');
  return {
    gpu_utilization: gpu > 20 ? 'healthy' : 'underutilized',
    gpu_memory: mem < 0.9 ? 'healthy' : 'critical',
    inference_pods: pods > 0 ? 'healthy' : 'down',
  };
}

Alerting Rules

const alerts = [
  { metric: 'DCGM_FI_DEV_GPU_UTIL', condition: 'avg < 20', window: '30m', severity: 'warning' },
  { metric: 'gpu_memory_pct', condition: '> 0.95', window: '5m', severity: 'critical' },
  { metric: 'inference_latency_p99', condition: '> 500ms', window: '10m', severity: 'warning' },
  { metric: 'pod_restart_count', condition: '> 3', window: '15m', severity: 'critical' },
];

Structured Logging

function logGpuEvent(event: string, node: string, data: Record<string, any>) {
  console.log(JSON.stringify({
    service: 'coreweave', event, node,
    gpu_model: data.gpu_model, utilization: data.util,
    memory_pct: data.memPct, temperature: data.temp,
    timestamp: new Date().toISOString(),
  }));
}

Error Handling

| Signal | Meaning | Action | |--------|---------|--------| | GPU util < 20% sustained | Idle GPUs burning cost | Scale down or reassign workload | | GPU memory > 95% | OOM imminent | Reduce batch size or add nodes | | Pod CrashLoopBackOff | Driver or config failure | Check DCGM logs, restart node | | Inference latency spike | Contention or throttling | Review GPU temp and queue depth | | Node NotReady | Hardware or network issue | Cordon node, migrate pods |

Resources

CoreWeave Observability

Next Steps

For incident response, see coreweave-incident-runbook.

Agent Skills: CoreWeave Observability

Install this agent skill to your local

Skill Files