CoreWeave Performance Tuning
GPU Selection by Workload
| Workload | Recommended GPU | Why | |----------|----------------|-----| | LLM inference (7-13B) | A100 80GB | Good balance of memory and cost | | LLM inference (70B+) | 8xH100 | NVLink for tensor parallelism | | Image generation | L40 | Good for diffusion models | | Training (large models) | 8xH100 SXM5 | Fastest interconnect | | Batch processing | A100 40GB | Cost-effective |
Inference Optimization
# Continuous batching with vLLM
containers:
- name: vllm
args:
- "--model=meta-llama/Llama-3.1-8B-Instruct"
- "--max-num-batched-tokens=8192"
- "--max-num-seqs=256"
- "--gpu-memory-utilization=0.90"
- "--enable-prefix-caching"
- "--dtype=float16"
Autoscaling Tuning
# HPA based on GPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: inference-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: inference-server
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: DCGM_FI_DEV_GPU_UTIL
target:
type: AverageValue
averageValue: "70"
Performance Benchmarks
| Metric | A100-80GB | H100-80GB | |--------|-----------|-----------| | Llama-8B tokens/sec | ~2,000 | ~4,500 | | Llama-70B tokens/sec | ~200 (4x) | ~500 (4x) | | Cold start (vLLM) | 30-60s | 20-40s |
Resources
Next Steps
For cost optimization, see coreweave-cost-tuning.