vLLM Model Serving and Inference
Quick Start
Docker (CPU)
docker run --rm -p 8000:8000 \
--shm-size=4g \
--cap-add SYS_NICE \
--security-opt seccomp=unconfined \
-e VLLM_CPU_KVCACHE_SPACE=4 \
<vllm-cpu-image> \
--model <model-name> \
--dtype float32
# Access: http://localhost:8000
Docker (GPU)
docker run --rm -p 8000:8000 \
--gpus all \
--shm-size=4g \
<vllm-gpu-image> \
--model <model-name>
# Access: http://localhost:8000
Docker Deployment
1. Assess Hardware Requirements
| Hardware | Minimum RAM | Recommended | |----------|-------------|-------------| | CPU | 2x model size | 4x model size | | GPU | Model size + 2GB | Model size + 4GB VRAM |
- Check model documentation for specific requirements
- Consider quantized variants to reduce memory footprint
- Allocate 50-100GB storage for model downloads
2. Pull the Container Image
# CPU image (check vLLM docs for latest tag)
docker pull <vllm-cpu-image>
# GPU image (check vLLM docs for latest tag)
docker pull <vllm-gpu-image>
Notes:
- Use CPU-specific images for CPU inference
- Use CUDA-enabled images matching your GPU architecture
- Verify CPU instruction set compatibility (AVX512, AVX2)
3. Configure and Run
CPU Deployment:
docker run --rm \
--shm-size=4g \
--cap-add SYS_NICE \
--security-opt seccomp=unconfined \
-p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=4 \
-e VLLM_CPU_OMP_THREADS_BIND=0-7 \
<vllm-cpu-image> \
--model <model-name> \
--dtype float32 \
--max-model-len 2048
GPU Deployment:
docker run --rm \
--gpus all \
--shm-size=4g \
-p 8000:8000 \
<vllm-gpu-image> \
--model <model-name> \
--dtype auto \
--max-model-len 4096
4. Verify Deployment
# Check health
curl http://localhost:8000/health
# List models
curl http://localhost:8000/v1/models
# Test inference
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "<model-name>", "prompt": "Hello", "max_tokens": 10}'
5. Update
docker pull <vllm-image>
docker stop <container-id>
# Re-run with same parameters
Cloud VM Deployment
1. Provision Infrastructure
# Create security group with rules:
# - TCP 22 (SSH)
# - TCP 8000 (API)
# Launch instance with:
# - Sufficient RAM/VRAM for model
# - Docker pre-installed (or install manually)
# - 50-100GB root volume
# - Public IP for external access
2. Connect and Deploy
ssh -i <key-file> <user>@<instance-ip>
# Install Docker if not present
# Pull and run vLLM container (see Docker Deployment section)
3. Verify External Access
# From local machine
curl http://<instance-ip>:8000/health
curl http://<instance-ip>:8000/v1/models
4. Cleanup
# Stop container
docker stop <container-id>
# Terminate instance to stop costs
# Delete associated resources (volumes, security groups) if temporary
Configuration Reference
Environment Variables
| Variable | Purpose | Example |
|----------|---------|---------|
| VLLM_CPU_KVCACHE_SPACE | KV cache size in GB (CPU) | 4 |
| VLLM_CPU_OMP_THREADS_BIND | CPU core binding (CPU) | 0-7 |
| CUDA_VISIBLE_DEVICES | GPU device selection | 0,1 |
| HF_TOKEN | HuggingFace authentication | hf_xxx |
Docker Flags
| Flag | Purpose |
|------|---------|
| --shm-size=4g | Shared memory for IPC |
| --cap-add SYS_NICE | NUMA optimization (CPU) |
| --security-opt seccomp=unconfined | Memory policy syscalls (CPU) |
| --gpus all | GPU access |
| -p 8000:8000 | Port mapping |
vLLM Arguments
| Argument | Purpose | Example |
|----------|---------|---------|
| --model | Model name/path | <model-name> |
| --dtype | Data type | float32, auto, bfloat16 |
| --max-model-len | Max context length | 2048 |
| --tensor-parallel-size | Multi-GPU parallelism | 2 |
API Endpoints
| Endpoint | Method | Purpose |
|----------|--------|---------|
| /health | GET | Health check |
| /v1/models | GET | List available models |
| /v1/completions | POST | Text completion |
| /v1/chat/completions | POST | Chat completion |
| /metrics | GET | Prometheus metrics |
Production Checklist
- [ ] Verify model fits in available memory
- [ ] Configure appropriate data type for hardware
- [ ] Set up firewall/security group rules
- [ ] Test API endpoints before production use
- [ ] Configure monitoring (Prometheus metrics)
- [ ] Set up health check alerts
- [ ] Document model and configuration used
- [ ] Plan for model updates and rollbacks
Troubleshooting
| Issue | Solution | |-------|----------| | Container exits immediately | Increase RAM or use smaller model | | Slow inference (CPU) | Verify OMP thread binding configuration | | Connection refused externally | Check firewall/security group rules | | Model download fails | Set HF_TOKEN for gated models | | Out of memory during inference | Reduce max_model_len or batch size | | Port already in use | Change host port mapping | | Warmup takes too long | Normal for large models (1-5 min) |