CoreWeave Production Checklist
Inference Services
- [ ] GPU type and count validated for model size
- [ ] Autoscaling configured (KServe or HPA)
- [ ] Health and readiness probes set
- [ ] Resource requests AND limits specified
- [ ] Node affinity targeting correct GPU class
- [ ]
minReplicas >= 1for production (no cold starts)
Storage
- [ ] Model weights in PVC (not downloaded at startup)
- [ ] Checkpoints saved to persistent storage
- [ ] Storage class appropriate (SSD for inference, HDD for archival)
Security
- [ ] Secrets for model tokens and registry access
- [ ] Network policies applied
- [ ] Container images from trusted registries
Monitoring
- [ ] GPU utilization metrics collected
- [ ] Inference latency and throughput tracked
- [ ] Alert on pod restarts and OOM events
- [ ] Log aggregation configured
Rollback
kubectl rollout undo deployment/my-inference
kubectl rollout status deployment/my-inference
Resources
Next Steps
For upgrades, see coreweave-upgrade-migration.