Agent Skills: GPU Debugger

Debug failed GPU CLI runs. Analyze error messages, diagnose OOM errors, fix sync issues, troubleshoot connectivity, and resolve common problems. Turn cryptic errors into actionable fixes.

UncategorizedID: gpu-cli/gpu/gpu-debugger

Skill Files

Browse the full folder contents for gpu-debugger.

Download Skill

Loading file tree…

skills/gpu-debugger/SKILL.md

Skill Metadata

Name
gpu-debugger
Description
"Debug failed GPU CLI runs. Analyze error messages, diagnose OOM errors, fix sync issues, troubleshoot connectivity, and resolve common problems. Turn cryptic errors into actionable fixes."

GPU Debugger

Turn errors into solutions.

This skill helps debug failed GPU CLI runs: OOM errors, sync failures, connectivity issues, model loading problems, and more.

When to Use This Skill

| Problem | This Skill Helps With | |---------|----------------------| | "CUDA out of memory" | OOM diagnosis and fixes | | "Connection refused" | Connectivity troubleshooting | | "Sync failed" | File sync debugging | | "Pod won't start" | Provisioning issues | | "Model won't load" | Model loading errors | | "Command exited with error" | Exit code analysis | | "My run is hanging" | Stuck process diagnosis |

Debugging Workflow

Error occurs
     │
     ▼
┌─────────────────────────┐
│ 1. Collect information  │
│    - Error message      │
│    - Daemon logs        │
│    - Exit code          │
│    - VRAM usage         │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│ 2. Identify error type  │
│    - OOM                │
│    - Network            │
│    - Model              │
│    - Sync               │
│    - Permission         │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│ 3. Apply fix            │
│    - Config change      │
│    - Code change        │
│    - Retry              │
└─────────────────────────┘

Information Collection Commands

Check Daemon Logs

# Last 50 log lines
gpu daemon logs --tail 50

# Full logs since last restart
gpu daemon logs

# Follow logs in real-time
gpu daemon logs --follow

Check Pod Status

# Current pod status
gpu status

# Pod details
gpu pods list

Check Job History

# Recent jobs
gpu jobs list

# Specific job details
gpu jobs show <job-id>

Common Errors and Solutions

1. CUDA Out of Memory (OOM)

Error messages:

CUDA out of memory. Tried to allocate X GiB
RuntimeError: CUDA error: out of memory
torch.cuda.OutOfMemoryError

Diagnosis:

# In your script, check VRAM usage
import torch
print(f"VRAM allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"VRAM reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

Solutions by severity:

| Solution | VRAM Savings | Effort | |----------|-------------|--------| | Reduce batch size | ~Linear | Easy | | Enable gradient checkpointing | ~40% | Easy | | Use FP16/BF16 | ~50% | Easy | | Use INT8 quantization | ~50% | Medium | | Use INT4 quantization | ~75% | Medium | | Enable CPU offloading | Variable | Easy | | Use larger GPU | Solves it | $$ |

Quick fixes:

# Reduce batch size
BATCH_SIZE = 1  # Start small, increase until OOM

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Use FP16
model = model.half()

# Use INT4 quantization
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)

# CPU offloading (for diffusers)
pipe.enable_model_cpu_offload()

# Clear cache between batches
torch.cuda.empty_cache()

Config fix:

{
  // Upgrade to larger GPU
  "gpu_type": "A100 PCIe 80GB",  // Instead of RTX 4090
  "min_vram": 80
}

2. Connection Refused / Timeout

Error messages:

Connection refused
Connection timed out
SSH connection failed
Failed to connect to daemon

Diagnosis:

# Check daemon status
gpu daemon status

# Check if daemon is running
ps aux | grep gpud

# Check daemon logs
gpu daemon logs --tail 20

Solutions:

| Cause | Solution | |-------|----------| | Daemon not running | gpu daemon start | | Daemon crashed | gpu daemon restart | | Wrong socket | Check GPU_DAEMON_SOCKET env var | | Port conflict | Kill conflicting process |

Restart daemon:

gpu daemon stop
gpu daemon start

3. Pod Won't Start / Provisioning Failed

Error messages:

Failed to create pod
No GPUs available
Insufficient resources
Provisioning timeout

Diagnosis:

# Check available GPUs
gpu machines list

# Check specific GPU availability
gpu machines list --gpu "RTX 4090"

Solutions:

| Cause | Solution | |-------|----------| | GPU type unavailable | Try different GPU type | | Region full | Remove region constraint | | Price too low | Increase max_price | | Volume region mismatch | Use volume's region |

Config fixes:

{
  // Use min_vram instead of exact GPU
  "gpu_type": null,
  "min_vram": 24,  // Any GPU with 24GB+

  // Or try different GPU
  "gpu_type": "RTX A6000",  // Alternative to RTX 4090

  // Or relax region constraint
  "region": null,  // Any region

  // Or increase price tolerance
  "max_price": 2.0  // Allow up to $2/hr
}

4. Sync Errors

Error messages:

rsync error
Sync failed
File not found
Permission denied during sync

Diagnosis:

# Check sync status
gpu sync status

# Check .gitignore
cat .gitignore

# Check outputs config
cat gpu.jsonc | grep outputs

Solutions:

| Cause | Solution | |-------|----------| | File too large | Add to .gitignore | | Permission issue | Check file permissions | | Path not in outputs | Add to outputs config | | Disk full on pod | Increase workspace size |

Config fixes:

{
  // Ensure outputs are configured
  "outputs": ["output/", "results/", "models/"],

  // Exclude large files
  "exclude_outputs": ["*.tmp", "*.log", "checkpoints/"],

  // Increase storage
  "workspace_size_gb": 100
}

5. Model Loading Errors

Error messages:

Model not found
Could not load model
Safetensors error
HuggingFace rate limit

Diagnosis:

# Check if model is downloading
# Look for download progress in job output

# Check HuggingFace cache on pod
gpu run ls -la ~/.cache/huggingface/hub/

Solutions:

| Cause | Solution | |-------|----------| | Model not downloaded | Add to download spec | | Wrong model path | Fix path in code | | HF rate limit | Set HF_TOKEN | | Network issue | Retry with timeout | | Gated model | Accept license on HF |

Config fixes:

{
  // Pre-download models
  "download": [
    { "strategy": "hf", "source": "meta-llama/Llama-3.1-8B-Instruct", "timeout": 7200 }
  ],

  // Set HF token in environment
  "environment": {
    "shell": {
      "steps": [
        { "run": "echo 'HF_TOKEN=your_token' >> ~/.bashrc" }
      ]
    }
  }
}

6. Process Hanging / Stuck

Symptoms:

  • No output for a long time
  • Process doesn't exit
  • gpu status shows job running forever

Diagnosis:

# Check if process is actually running
gpu run ps aux | grep python

# Check for infinite loops in logs
gpu jobs logs <job-id> --tail 100

# Check VRAM (might be swapping)
gpu run nvidia-smi

Solutions:

| Cause | Solution | |-------|----------| | Infinite loop | Fix code logic | | Waiting for input | Make script non-interactive | | VRAM thrashing | Reduce memory usage | | Deadlock | Add timeout | | Network wait | Add timeout to requests |

Code fixes:

# Add timeout to model loading
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    low_cpu_mem_usage=True  # Prevent OOM during loading
)

# Add timeout to HTTP requests
import requests
response = requests.get(url, timeout=30)

# Add progress bars to see activity
from tqdm import tqdm
for item in tqdm(items):
    process(item)

7. Exit Code Errors

Common exit codes:

| Code | Meaning | Common Cause | |------|---------|--------------| | 0 | Success | - | | 1 | General error | Script exception | | 2 | Misuse of command | Bad arguments | | 126 | Permission denied | Script not executable | | 127 | Command not found | Missing binary | | 137 | Killed (OOM) | Out of memory | | 139 | Segfault | Bad memory access | | 143 | Terminated | Killed by signal |

Diagnosis:

# Check last job exit code
gpu jobs list --limit 1

# Get full output including error
gpu jobs logs <job-id>

Solutions:

| Exit Code | Solution | |-----------|----------| | 1 | Check Python traceback, fix exception | | 126 | chmod +x script.sh | | 127 | Install missing package | | 137 | Reduce memory usage, bigger GPU | | 139 | Update PyTorch/CUDA versions |

8. CUDA Version Mismatch

Error messages:

CUDA error: no kernel image is available
CUDA version mismatch
Torch not compiled with CUDA enabled

Diagnosis:

# Check CUDA version on pod
gpu run nvcc --version
gpu run nvidia-smi | head -3

# Check PyTorch CUDA version
gpu run python -c "import torch; print(torch.version.cuda)"

Solutions:

{
  // Use a known-good base image
  "environment": {
    "base_image": "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04"
  }
}

Or in requirements.txt:

# Install PyTorch with specific CUDA version
--extra-index-url https://download.pytorch.org/whl/cu124
torch==2.4.0

Debug Script Template

Add this to your projects for better error info:

#!/usr/bin/env python3
"""Wrapper script with debugging info."""

import sys
import traceback
import torch

def print_system_info():
    """Print system info for debugging."""
    print("=" * 50)
    print("SYSTEM INFO")
    print("=" * 50)
    print(f"Python: {sys.version}")
    print(f"PyTorch: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"CUDA version: {torch.version.cuda}")
        print(f"GPU: {torch.cuda.get_device_name(0)}")
        print(f"VRAM total: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
        print(f"VRAM allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
    print("=" * 50)

def main():
    # Your actual code here
    pass

if __name__ == "__main__":
    print_system_info()
    try:
        main()
    except Exception as e:
        print("\n" + "=" * 50)
        print("ERROR OCCURRED")
        print("=" * 50)
        print(f"Error type: {type(e).__name__}")
        print(f"Error message: {str(e)}")
        print("\nFull traceback:")
        traceback.print_exc()

        # Print memory info for OOM debugging
        if torch.cuda.is_available():
            print(f"\nVRAM at error: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

        sys.exit(1)

Quick Reference: Error → Solution

| Error Contains | Likely Cause | Quick Fix | |----------------|--------------|-----------| | CUDA out of memory | OOM | Reduce batch size, use quantization | | Connection refused | Daemon down | gpu daemon restart | | No GPUs available | Supply shortage | Try different GPU type | | Model not found | Not downloaded | Add to download spec | | Permission denied | File permissions | chmod +x or check path | | Killed | OOM (exit 137) | Bigger GPU | | Timeout | Network/hanging | Add timeout, check code | | CUDA version | Version mismatch | Use compatible base image | | rsync error | Sync issue | Check .gitignore, outputs | | rate limit | HuggingFace limit | Set HF_TOKEN |

Output Format

When debugging:

## Debug Analysis

### Error Identified

**Type**: [OOM/Network/Model/Sync/etc.]
**Message**: `[exact error message]`

### Root Cause

[Explanation of why this happened]

### Solution

**Option 1** (Recommended): [solution]
```[code/config change]```

**Option 2** (Alternative): [solution]
```[code/config change]```

### Prevention

To avoid this in the future:
1. [Prevention tip 1]
2. [Prevention tip 2]