Agent Skills: CoreWeave Observability

|

UncategorizedID: jeremylongshore/claude-code-plugins-plus-skills/coreweave-observability

Install this agent skill to your local

pnpm dlx add-skill https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/HEAD/plugins/saas-packs/coreweave-pack/skills/coreweave-observability

Skill Files

Browse the full folder contents for coreweave-observability.

Download Skill

Loading file tree…

plugins/saas-packs/coreweave-pack/skills/coreweave-observability/SKILL.md

Skill Metadata

Name
coreweave-observability
Description
|

CoreWeave Observability

GPU Metrics (DCGM Exporter)

CKS clusters come with DCGM exporter pre-installed. Key metrics:

| Metric | Description | |--------|-------------| | DCGM_FI_DEV_GPU_UTIL | GPU core utilization % | | DCGM_FI_DEV_FB_USED | GPU memory used (MB) | | DCGM_FI_DEV_FB_FREE | GPU memory free (MB) | | DCGM_FI_DEV_POWER_USAGE | Power consumption (W) | | DCGM_FI_DEV_GPU_TEMP | GPU temperature (C) |

Prometheus Alert Rules

groups:
  - name: coreweave-gpu
    rules:
      - alert: GPUUtilizationLow
        expr: avg(DCGM_FI_DEV_GPU_UTIL) < 20
        for: 30m
        labels: { severity: warning }
        annotations:
          summary: "GPU utilization below 20% for 30min -- consider scaling down"

      - alert: GPUMemoryHigh
        expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "GPU memory >95% -- risk of OOM"

      - alert: InferencePodDown
        expr: kube_deployment_status_replicas_available{deployment=~".*inference.*"} == 0
        for: 2m
        labels: { severity: critical }

Resources

Next Steps

For incident response, see coreweave-incident-runbook.