GPU Debugger
Turn errors into solutions.
This skill helps debug failed GPU CLI runs: OOM errors, sync failures, connectivity issues, model loading problems, and more.
When to Use This Skill
| Problem | This Skill Helps With | |---------|----------------------| | "CUDA out of memory" | OOM diagnosis and fixes | | "Connection refused" | Connectivity troubleshooting | | "Sync failed" | File sync debugging | | "Pod won't start" | Provisioning issues | | "Model won't load" | Model loading errors | | "Command exited with error" | Exit code analysis | | "My run is hanging" | Stuck process diagnosis |
Debugging Workflow
Error occurs
│
▼
┌─────────────────────────┐
│ 1. Collect information │
│ - Error message │
│ - Daemon logs │
│ - Exit code │
│ - VRAM usage │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ 2. Identify error type │
│ - OOM │
│ - Network │
│ - Model │
│ - Sync │
│ - Permission │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ 3. Apply fix │
│ - Config change │
│ - Code change │
│ - Retry │
└─────────────────────────┘
Information Collection Commands
Check Daemon Logs
# Last 50 log lines
gpu daemon logs --tail 50
# Full logs since last restart
gpu daemon logs
# Follow logs in real-time
gpu daemon logs --follow
Check Pod Status
# Current pod status
gpu status
# Pod details
gpu pods list
Check Job History
# Recent jobs
gpu jobs list
# Specific job details
gpu jobs show <job-id>
Common Errors and Solutions
1. CUDA Out of Memory (OOM)
Error messages:
CUDA out of memory. Tried to allocate X GiB
RuntimeError: CUDA error: out of memory
torch.cuda.OutOfMemoryError
Diagnosis:
# In your script, check VRAM usage
import torch
print(f"VRAM allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"VRAM reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
Solutions by severity:
| Solution | VRAM Savings | Effort | |----------|-------------|--------| | Reduce batch size | ~Linear | Easy | | Enable gradient checkpointing | ~40% | Easy | | Use FP16/BF16 | ~50% | Easy | | Use INT8 quantization | ~50% | Medium | | Use INT4 quantization | ~75% | Medium | | Enable CPU offloading | Variable | Easy | | Use larger GPU | Solves it | $$ |
Quick fixes:
# Reduce batch size
BATCH_SIZE = 1 # Start small, increase until OOM
# Enable gradient checkpointing
model.gradient_checkpointing_enable()
# Use FP16
model = model.half()
# Use INT4 quantization
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)
# CPU offloading (for diffusers)
pipe.enable_model_cpu_offload()
# Clear cache between batches
torch.cuda.empty_cache()
Config fix:
{
// Upgrade to larger GPU
"gpu_type": "A100 PCIe 80GB", // Instead of RTX 4090
"min_vram": 80
}
2. Connection Refused / Timeout
Error messages:
Connection refused
Connection timed out
SSH connection failed
Failed to connect to daemon
Diagnosis:
# Check daemon status
gpu daemon status
# Check if daemon is running
ps aux | grep gpud
# Check daemon logs
gpu daemon logs --tail 20
Solutions:
| Cause | Solution |
|-------|----------|
| Daemon not running | gpu daemon start |
| Daemon crashed | gpu daemon restart |
| Wrong socket | Check GPU_DAEMON_SOCKET env var |
| Port conflict | Kill conflicting process |
Restart daemon:
gpu daemon stop
gpu daemon start
3. Pod Won't Start / Provisioning Failed
Error messages:
Failed to create pod
No GPUs available
Insufficient resources
Provisioning timeout
Diagnosis:
# Check available GPUs
gpu machines list
# Check specific GPU availability
gpu machines list --gpu "RTX 4090"
Solutions:
| Cause | Solution | |-------|----------| | GPU type unavailable | Try different GPU type | | Region full | Remove region constraint | | Price too low | Increase max_price | | Volume region mismatch | Use volume's region |
Config fixes:
{
// Use min_vram instead of exact GPU
"gpu_type": null,
"min_vram": 24, // Any GPU with 24GB+
// Or try different GPU
"gpu_type": "RTX A6000", // Alternative to RTX 4090
// Or relax region constraint
"region": null, // Any region
// Or increase price tolerance
"max_price": 2.0 // Allow up to $2/hr
}
4. Sync Errors
Error messages:
rsync error
Sync failed
File not found
Permission denied during sync
Diagnosis:
# Check sync status
gpu sync status
# Check .gitignore
cat .gitignore
# Check outputs config
cat gpu.jsonc | grep outputs
Solutions:
| Cause | Solution | |-------|----------| | File too large | Add to .gitignore | | Permission issue | Check file permissions | | Path not in outputs | Add to outputs config | | Disk full on pod | Increase workspace size |
Config fixes:
{
// Ensure outputs are configured
"outputs": ["output/", "results/", "models/"],
// Exclude large files
"exclude_outputs": ["*.tmp", "*.log", "checkpoints/"],
// Increase storage
"workspace_size_gb": 100
}
5. Model Loading Errors
Error messages:
Model not found
Could not load model
Safetensors error
HuggingFace rate limit
Diagnosis:
# Check if model is downloading
# Look for download progress in job output
# Check HuggingFace cache on pod
gpu run ls -la ~/.cache/huggingface/hub/
Solutions:
| Cause | Solution | |-------|----------| | Model not downloaded | Add to download spec | | Wrong model path | Fix path in code | | HF rate limit | Set HF_TOKEN | | Network issue | Retry with timeout | | Gated model | Accept license on HF |
Config fixes:
{
// Pre-download models
"download": [
{ "strategy": "hf", "source": "meta-llama/Llama-3.1-8B-Instruct", "timeout": 7200 }
],
// Set HF token in environment
"environment": {
"shell": {
"steps": [
{ "run": "echo 'HF_TOKEN=your_token' >> ~/.bashrc" }
]
}
}
}
6. Process Hanging / Stuck
Symptoms:
- No output for a long time
- Process doesn't exit
gpu statusshows job running forever
Diagnosis:
# Check if process is actually running
gpu run ps aux | grep python
# Check for infinite loops in logs
gpu jobs logs <job-id> --tail 100
# Check VRAM (might be swapping)
gpu run nvidia-smi
Solutions:
| Cause | Solution | |-------|----------| | Infinite loop | Fix code logic | | Waiting for input | Make script non-interactive | | VRAM thrashing | Reduce memory usage | | Deadlock | Add timeout | | Network wait | Add timeout to requests |
Code fixes:
# Add timeout to model loading
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
low_cpu_mem_usage=True # Prevent OOM during loading
)
# Add timeout to HTTP requests
import requests
response = requests.get(url, timeout=30)
# Add progress bars to see activity
from tqdm import tqdm
for item in tqdm(items):
process(item)
7. Exit Code Errors
Common exit codes:
| Code | Meaning | Common Cause | |------|---------|--------------| | 0 | Success | - | | 1 | General error | Script exception | | 2 | Misuse of command | Bad arguments | | 126 | Permission denied | Script not executable | | 127 | Command not found | Missing binary | | 137 | Killed (OOM) | Out of memory | | 139 | Segfault | Bad memory access | | 143 | Terminated | Killed by signal |
Diagnosis:
# Check last job exit code
gpu jobs list --limit 1
# Get full output including error
gpu jobs logs <job-id>
Solutions:
| Exit Code | Solution |
|-----------|----------|
| 1 | Check Python traceback, fix exception |
| 126 | chmod +x script.sh |
| 127 | Install missing package |
| 137 | Reduce memory usage, bigger GPU |
| 139 | Update PyTorch/CUDA versions |
8. CUDA Version Mismatch
Error messages:
CUDA error: no kernel image is available
CUDA version mismatch
Torch not compiled with CUDA enabled
Diagnosis:
# Check CUDA version on pod
gpu run nvcc --version
gpu run nvidia-smi | head -3
# Check PyTorch CUDA version
gpu run python -c "import torch; print(torch.version.cuda)"
Solutions:
{
// Use a known-good base image
"environment": {
"base_image": "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04"
}
}
Or in requirements.txt:
# Install PyTorch with specific CUDA version
--extra-index-url https://download.pytorch.org/whl/cu124
torch==2.4.0
Debug Script Template
Add this to your projects for better error info:
#!/usr/bin/env python3
"""Wrapper script with debugging info."""
import sys
import traceback
import torch
def print_system_info():
"""Print system info for debugging."""
print("=" * 50)
print("SYSTEM INFO")
print("=" * 50)
print(f"Python: {sys.version}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM total: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
print(f"VRAM allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print("=" * 50)
def main():
# Your actual code here
pass
if __name__ == "__main__":
print_system_info()
try:
main()
except Exception as e:
print("\n" + "=" * 50)
print("ERROR OCCURRED")
print("=" * 50)
print(f"Error type: {type(e).__name__}")
print(f"Error message: {str(e)}")
print("\nFull traceback:")
traceback.print_exc()
# Print memory info for OOM debugging
if torch.cuda.is_available():
print(f"\nVRAM at error: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
sys.exit(1)
Quick Reference: Error → Solution
| Error Contains | Likely Cause | Quick Fix |
|----------------|--------------|-----------|
| CUDA out of memory | OOM | Reduce batch size, use quantization |
| Connection refused | Daemon down | gpu daemon restart |
| No GPUs available | Supply shortage | Try different GPU type |
| Model not found | Not downloaded | Add to download spec |
| Permission denied | File permissions | chmod +x or check path |
| Killed | OOM (exit 137) | Bigger GPU |
| Timeout | Network/hanging | Add timeout, check code |
| CUDA version | Version mismatch | Use compatible base image |
| rsync error | Sync issue | Check .gitignore, outputs |
| rate limit | HuggingFace limit | Set HF_TOKEN |
Output Format
When debugging:
## Debug Analysis
### Error Identified
**Type**: [OOM/Network/Model/Sync/etc.]
**Message**: `[exact error message]`
### Root Cause
[Explanation of why this happened]
### Solution
**Option 1** (Recommended): [solution]
```[code/config change]```
**Option 2** (Alternative): [solution]
```[code/config change]```
### Prevention
To avoid this in the future:
1. [Prevention tip 1]
2. [Prevention tip 2]