CoreWeave Performance Tuning Skill

CoreWeave Performance Tuning

GPU Selection by Workload

| Workload | Recommended GPU | Why | |----------|----------------|-----| | LLM inference (7-13B) | A100 80GB | Good balance of memory and cost | | LLM inference (70B+) | 8xH100 | NVLink for tensor parallelism | | Image generation | L40 | Good for diffusion models | | Training (large models) | 8xH100 SXM5 | Fastest interconnect | | Batch processing | A100 40GB | Cost-effective |

Inference Optimization

# Continuous batching with vLLM
containers:
  - name: vllm
    args:
      - "--model=meta-llama/Llama-3.1-8B-Instruct"
      - "--max-num-batched-tokens=8192"
      - "--max-num-seqs=256"
      - "--gpu-memory-utilization=0.90"
      - "--enable-prefix-caching"
      - "--dtype=float16"

Autoscaling Tuning

# HPA based on GPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: DCGM_FI_DEV_GPU_UTIL
        target:
          type: AverageValue
          averageValue: "70"

Performance Benchmarks

| Metric | A100-80GB | H100-80GB | |--------|-----------|-----------| | Llama-8B tokens/sec | ~2,000 | ~4,500 | | Llama-70B tokens/sec | ~200 (4x) | ~500 (4x) | | Cold start (vLLM) | 30-60s | 20-40s |

Resources

Next Steps

For cost optimization, see coreweave-cost-tuning.

Agent Skills: CoreWeave Performance Tuning

Install this agent skill to your local

Skill Files