CoreWeave Core Workflow: KServe Inference Skill

CoreWeave Core Workflow: KServe Inference

Overview

Deploy production inference services on CoreWeave using KServe InferenceService with GPU scheduling, autoscaling, and scale-to-zero. CKS natively integrates with KServe for serverless GPU inference.

Prerequisites

Completed coreweave-install-auth setup
KServe available on your CKS cluster
Model stored in S3, GCS, or HuggingFace

Instructions

Step 1: Deploy an InferenceService

# inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-inference
  annotations:
    autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"
    autoscaling.knative.dev/metric: "concurrency"
    autoscaling.knative.dev/target: "1"
    autoscaling.knative.dev/minScale: "1"
    autoscaling.knative.dev/maxScale: "5"
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 5
    containers:
      - name: kserve-container
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "meta-llama/Llama-3.1-8B-Instruct"
          - "--port"
          - "8080"
        ports:
          - containerPort: 8080
            protocol: TCP
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: 48Gi
            cpu: "8"
          requests:
            nvidia.com/gpu: "1"
            memory: 32Gi
            cpu: "4"
        env:
          - name: HUGGING_FACE_HUB_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-token
                key: token
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: gpu.nvidia.com/class
                  operator: In
                  values: ["A100_PCIE_80GB"]

kubectl apply -f inference-service.yaml
kubectl get inferenceservice llama-inference -w

Step 2: Scale-to-Zero Configuration

# For dev/staging -- scale down to zero when idle
metadata:
  annotations:
    autoscaling.knative.dev/minScale: "0"    # Scale to zero
    autoscaling.knative.dev/maxScale: "3"
    autoscaling.knative.dev/scaleDownDelay: "5m"

Step 3: Test the Endpoint

# Get inference URL
INFERENCE_URL=$(kubectl get inferenceservice llama-inference \
  -o jsonpath='{.status.url}')

curl -X POST "${INFERENCE_URL}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'

Error Handling

| Error | Cause | Solution | |-------|-------|----------| | InferenceService not ready | GPU not available | Check node capacity and affinity | | Scale-to-zero cold start | First request after idle | Set minScale: 1 for production | | Model loading timeout | Large model download | Pre-cache model in PVC | | OOMKilled | Model too large | Use multi-GPU or quantized model |

Resources

Next Steps

For GPU training workloads, see coreweave-core-workflow-b.

Agent Skills: CoreWeave Core Workflow: KServe Inference

Install this agent skill to your local

Skill Files