vLLM Model Serving and Inference Skill

vLLM Model Serving and Inference

Quick Start

Docker (CPU)

docker run --rm -p 8000:8000 \
  --shm-size=4g \
  --cap-add SYS_NICE \
  --security-opt seccomp=unconfined \
  -e VLLM_CPU_KVCACHE_SPACE=4 \
  <vllm-cpu-image> \
  --model <model-name> \
  --dtype float32
# Access: http://localhost:8000

Docker (GPU)

docker run --rm -p 8000:8000 \
  --gpus all \
  --shm-size=4g \
  <vllm-gpu-image> \
  --model <model-name>
# Access: http://localhost:8000

Docker Deployment

1. Assess Hardware Requirements

| Hardware | Minimum RAM | Recommended | |----------|-------------|-------------| | CPU | 2x model size | 4x model size | | GPU | Model size + 2GB | Model size + 4GB VRAM |

Check model documentation for specific requirements
Consider quantized variants to reduce memory footprint
Allocate 50-100GB storage for model downloads

2. Pull the Container Image

# CPU image (check vLLM docs for latest tag)
docker pull <vllm-cpu-image>

# GPU image (check vLLM docs for latest tag)
docker pull <vllm-gpu-image>

Notes:

Use CPU-specific images for CPU inference
Use CUDA-enabled images matching your GPU architecture
Verify CPU instruction set compatibility (AVX512, AVX2)

3. Configure and Run

CPU Deployment:

docker run --rm \
  --shm-size=4g \
  --cap-add SYS_NICE \
  --security-opt seccomp=unconfined \
  -p 8000:8000 \
  -e VLLM_CPU_KVCACHE_SPACE=4 \
  -e VLLM_CPU_OMP_THREADS_BIND=0-7 \
  <vllm-cpu-image> \
  --model <model-name> \
  --dtype float32 \
  --max-model-len 2048

GPU Deployment:

docker run --rm \
  --gpus all \
  --shm-size=4g \
  -p 8000:8000 \
  <vllm-gpu-image> \
  --model <model-name> \
  --dtype auto \
  --max-model-len 4096

4. Verify Deployment

# Check health
curl http://localhost:8000/health

# List models
curl http://localhost:8000/v1/models

# Test inference
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "<model-name>", "prompt": "Hello", "max_tokens": 10}'

5. Update

docker pull <vllm-image>
docker stop <container-id>
# Re-run with same parameters

Cloud VM Deployment

1. Provision Infrastructure

# Create security group with rules:
# - TCP 22 (SSH)
# - TCP 8000 (API)

# Launch instance with:
# - Sufficient RAM/VRAM for model
# - Docker pre-installed (or install manually)
# - 50-100GB root volume
# - Public IP for external access

2. Connect and Deploy

ssh -i <key-file> <user>@<instance-ip>

# Install Docker if not present
# Pull and run vLLM container (see Docker Deployment section)

3. Verify External Access

# From local machine
curl http://<instance-ip>:8000/health
curl http://<instance-ip>:8000/v1/models

4. Cleanup

# Stop container
docker stop <container-id>

# Terminate instance to stop costs
# Delete associated resources (volumes, security groups) if temporary

Configuration Reference

Environment Variables

| Variable | Purpose | Example | |----------|---------|---------| | VLLM_CPU_KVCACHE_SPACE | KV cache size in GB (CPU) | 4 | | VLLM_CPU_OMP_THREADS_BIND | CPU core binding (CPU) | 0-7 | | CUDA_VISIBLE_DEVICES | GPU device selection | 0,1 | | HF_TOKEN | HuggingFace authentication | hf_xxx |

Docker Flags

| Flag | Purpose | |------|---------| | --shm-size=4g | Shared memory for IPC | | --cap-add SYS_NICE | NUMA optimization (CPU) | | --security-opt seccomp=unconfined | Memory policy syscalls (CPU) | | --gpus all | GPU access | | -p 8000:8000 | Port mapping |

vLLM Arguments

| Argument | Purpose | Example | |----------|---------|---------| | --model | Model name/path | <model-name> | | --dtype | Data type | float32, auto, bfloat16 | | --max-model-len | Max context length | 2048 | | --tensor-parallel-size | Multi-GPU parallelism | 2 |

API Endpoints

| Endpoint | Method | Purpose | |----------|--------|---------| | /health | GET | Health check | | /v1/models | GET | List available models | | /v1/completions | POST | Text completion | | /v1/chat/completions | POST | Chat completion | | /metrics | GET | Prometheus metrics |

Production Checklist

[ ] Verify model fits in available memory
[ ] Configure appropriate data type for hardware
[ ] Set up firewall/security group rules
[ ] Test API endpoints before production use
[ ] Configure monitoring (Prometheus metrics)
[ ] Set up health check alerts
[ ] Document model and configuration used
[ ] Plan for model updates and rollbacks

Troubleshooting

| Issue | Solution | |-------|----------| | Container exits immediately | Increase RAM or use smaller model | | Slow inference (CPU) | Verify OMP thread binding configuration | | Connection refused externally | Check firewall/security group rules | | Model download fails | Set HF_TOKEN for gated models | | Out of memory during inference | Reduce max_model_len or batch size | | Port already in use | Change host port mapping | | Warmup takes too long | Normal for large models (1-5 min) |

Agent Skills: vLLM Model Serving and Inference

Install this agent skill to your local

Skill Files