CoreWeave Reference Architecture
Architecture Diagram
┌─────────────────────┐
│ Load Balancer │
│ (Ingress/LB) │
└──────────┬──────────┘
│
┌────────────────┼────────────────┐
│ │ │
┌────────▼──────┐ ┌──────▼────────┐ ┌─────▼───────┐
│ Model A │ │ Model B │ │ Model C │
│ (vLLM, A100) │ │ (TGI, H100) │ │ (SD, L40) │
│ 2 replicas │ │ 1 replica │ │ 3 replicas │
└───────────────┘ └───────────────┘ └─────────────┘
│ │ │
┌────────▼────────────────▼────────────────▼───────┐
│ Shared Storage (PVC) │
│ Models / Checkpoints / Data │
└──────────────────────────────────────────────────┘
Project Structure
ml-platform/
├── k8s/
│ ├── base/ # Shared templates
│ ├── models/
│ │ ├── llama-8b/ # Per-model manifests
│ │ ├── llama-70b/
│ │ └── stable-diffusion/
│ └── infra/
│ ├── storage.yaml # PVCs
│ ├── secrets.yaml # Model tokens
│ └── monitoring.yaml # Prometheus rules
├── containers/
│ ├── vllm/Dockerfile
│ └── custom-server/Dockerfile
├── scripts/
│ ├── deploy.sh
│ └── benchmark.sh
└── monitoring/
├── grafana-dashboards/
└── alert-rules.yaml
Key Design Decisions
| Decision | Choice | Rationale | |----------|--------|-----------| | Serving framework | vLLM | Continuous batching, PagedAttention | | GPU type (production) | A100 80GB | Best price/performance for inference | | Storage | Shared PVC (SSD) | Fast model loading across replicas | | Autoscaling | KServe + Knative | Native scale-to-zero support | | Container registry | GHCR | GitHub integration, free for public |
Resources
Next Steps
For multi-environment setup, see coreweave-multi-env-setup.