CoreWeave Core Workflow: KServe Inference
Overview
Deploy production inference services on CoreWeave using KServe InferenceService with GPU scheduling, autoscaling, and scale-to-zero. CKS natively integrates with KServe for serverless GPU inference.
Prerequisites
- Completed
coreweave-install-authsetup - KServe available on your CKS cluster
- Model stored in S3, GCS, or HuggingFace
Instructions
Step 1: Deploy an InferenceService
# inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-inference
annotations:
autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"
autoscaling.knative.dev/metric: "concurrency"
autoscaling.knative.dev/target: "1"
autoscaling.knative.dev/minScale: "1"
autoscaling.knative.dev/maxScale: "5"
spec:
predictor:
minReplicas: 1
maxReplicas: 5
containers:
- name: kserve-container
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--port"
- "8080"
ports:
- containerPort: 8080
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
memory: 48Gi
cpu: "8"
requests:
nvidia.com/gpu: "1"
memory: 32Gi
cpu: "4"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu.nvidia.com/class
operator: In
values: ["A100_PCIE_80GB"]
kubectl apply -f inference-service.yaml
kubectl get inferenceservice llama-inference -w
Step 2: Scale-to-Zero Configuration
# For dev/staging -- scale down to zero when idle
metadata:
annotations:
autoscaling.knative.dev/minScale: "0" # Scale to zero
autoscaling.knative.dev/maxScale: "3"
autoscaling.knative.dev/scaleDownDelay: "5m"
Step 3: Test the Endpoint
# Get inference URL
INFERENCE_URL=$(kubectl get inferenceservice llama-inference \
-o jsonpath='{.status.url}')
curl -X POST "${INFERENCE_URL}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'
Error Handling
| Error | Cause | Solution |
|-------|-------|----------|
| InferenceService not ready | GPU not available | Check node capacity and affinity |
| Scale-to-zero cold start | First request after idle | Set minScale: 1 for production |
| Model loading timeout | Large model download | Pre-cache model in PVC |
| OOMKilled | Model too large | Use multi-GPU or quantized model |
Resources
Next Steps
For GPU training workloads, see coreweave-core-workflow-b.