cuda-debugging
You are cuda-debugging - a specialized skill for GPU debugging and error detection using NVIDIA's Compute Sanitizer and CUDA-GDB tools. This skill provides expert capabilities for identifying and resolving correctness issues in CUDA programs.
Overview
This skill enables AI-powered GPU debugging operations including:
- Executing compute-sanitizer memory checks (memcheck)
- Detecting race conditions with racecheck tool
- Identifying memory leaks and invalid accesses
- Using CUDA-GDB for kernel debugging
- Analyzing kernel synchronization issues
- Validating atomic operation correctness
- Detecting uninitialized memory access (initcheck)
- Generating debugging reports with actionable recommendations
Prerequisites
- NVIDIA CUDA Toolkit 11.0+ with compute-sanitizer
- CUDA-GDB for interactive debugging
- GPU with debugging support (compute capability 3.5+)
- Debug build of CUDA application (-G -lineinfo flags)
- Optional: Nsight Visual Studio Code Extension
Capabilities
1. Memory Error Detection (Memcheck)
Detect memory access errors and leaks:
# Basic memory check
compute-sanitizer --tool memcheck ./cuda_program
# With detailed error reporting
compute-sanitizer --tool memcheck --report-api-errors all ./cuda_program
# Log errors to file
compute-sanitizer --tool memcheck --log-file memcheck.log ./cuda_program
# Check for memory leaks
compute-sanitizer --tool memcheck --leak-check full ./cuda_program
# Track allocations
compute-sanitizer --tool memcheck --track-alloc-dealloc yes ./cuda_program
Common memory errors detected:
- Out-of-bounds global memory access
- Misaligned memory access
- Invalid global memory access
- Memory leaks (device allocations not freed)
- Double free errors
- Invalid device pointer operations
2. Race Condition Detection (Racecheck)
Detect shared memory data access hazards:
# Basic race check
compute-sanitizer --tool racecheck ./cuda_program
# With detailed analysis
compute-sanitizer --tool racecheck --racecheck-report all ./cuda_program
# Save analysis to file
compute-sanitizer --tool racecheck --save racecheck.nvsanreport ./cuda_program
# Analyze previous run
compute-sanitizer --tool racecheck --import racecheck.nvsanreport --print-analysis ./cuda_program
Race condition types detected:
- Write-after-read (WAR) hazards
- Write-after-write (WAW) hazards
- Read-after-write (RAW) hazards
- Bank conflicts in shared memory
- Synchronization-related races
3. Uninitialized Memory Detection (Initcheck)
Detect uninitialized global memory access:
# Basic initcheck
compute-sanitizer --tool initcheck ./cuda_program
# Track all memory accesses
compute-sanitizer --tool initcheck --track-unused-memory yes ./cuda_program
# With error details
compute-sanitizer --tool initcheck --show-backtrace yes ./cuda_program
4. Synchronization Validation (Synccheck)
Detect illegal synchronization in CUDA code:
# Basic synccheck
compute-sanitizer --tool synccheck ./cuda_program
# With detailed reporting
compute-sanitizer --tool synccheck --show-backtrace all ./cuda_program
Synchronization issues detected:
- Divergent
__syncthreads()calls - Invalid thread block synchronization
- Illegal cooperative groups usage
- Missing synchronization barriers
5. CUDA-GDB Debugging Commands
Interactive debugging with CUDA-GDB:
# Launch CUDA-GDB
cuda-gdb ./cuda_program
# Common debugging commands
(cuda-gdb) set cuda memcheck on # Enable memory checking
(cuda-gdb) set cuda break_on_launch # Break at kernel launch
(cuda-gdb) break kernel_name # Set breakpoint at kernel
(cuda-gdb) run # Start execution
# Thread navigation
(cuda-gdb) info cuda threads # List all GPU threads
(cuda-gdb) cuda thread (0,0,0) (0,0,0) # Switch to specific thread
(cuda-gdb) cuda block # Show current block
(cuda-gdb) cuda kernel # Show current kernel
# Memory inspection
(cuda-gdb) print *d_array@10 # Print device array
(cuda-gdb) print __shared_memory__ # Inspect shared memory
(cuda-gdb) info cuda devices # List CUDA devices
# Stepping through code
(cuda-gdb) cuda step # Step one warp instruction
(cuda-gdb) cuda next # Step over function calls
(cuda-gdb) continue # Continue execution
6. Common Debugging Patterns
Pattern 1: Memory Bounds Checking
// Add bounds checking to kernel
__global__ void safeKernel(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Bounds check
if (idx >= n) return;
// Safe access
data[idx] = data[idx] * 2.0f;
}
Pattern 2: Shared Memory Synchronization
__global__ void reductionKernel(float* input, float* output, int n) {
__shared__ float sdata[256];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Load to shared memory
sdata[tid] = (idx < n) ? input[idx] : 0.0f;
__syncthreads(); // Required before reading shared memory
// Reduction in shared memory
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads(); // Required after each reduction step
}
if (tid == 0) {
output[blockIdx.x] = sdata[0];
}
}
Pattern 3: Atomic Operation Validation
// Validate atomic operations
__global__ void atomicTest(int* counter, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
// Use atomicAdd for thread-safe increment
atomicAdd(counter, 1);
}
}
// Verify result on host
int h_counter;
cudaMemcpy(&h_counter, d_counter, sizeof(int), cudaMemcpyDeviceToHost);
assert(h_counter == n); // Should equal number of threads
7. Error Code Handling
Comprehensive CUDA error checking:
// Error checking macro
#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA Error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while(0)
// Usage
CUDA_CHECK(cudaMalloc(&d_data, size));
CUDA_CHECK(cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice));
// Check for kernel errors
myKernel<<<blocks, threads>>>(d_data, n);
CUDA_CHECK(cudaGetLastError()); // Check launch errors
CUDA_CHECK(cudaDeviceSynchronize()); // Check execution errors
8. Debugging Report Generation
Generate comprehensive debugging reports:
# Full debugging session
compute-sanitizer --tool memcheck \
--report-api-errors all \
--show-backtrace yes \
--log-file debug_report.txt \
./cuda_program 2>&1 | tee debug_output.log
# Summary report generation
echo "=== CUDA Debugging Report ===" > debug_summary.md
echo "Date: $(date)" >> debug_summary.md
echo "" >> debug_summary.md
echo "## Memory Check Results" >> debug_summary.md
compute-sanitizer --tool memcheck ./cuda_program 2>&1 >> debug_summary.md
echo "" >> debug_summary.md
echo "## Race Check Results" >> debug_summary.md
compute-sanitizer --tool racecheck ./cuda_program 2>&1 >> debug_summary.md
MCP Server Integration
This skill can leverage the following MCP servers:
| Server | Description | Installation | |--------|-------------|--------------| | claude-debugs-for-you | Interactive debugging via Claude | GitHub |
Best Practices
Debugging Build Configuration
# Debug build flags
DEBUG_FLAGS = -G -lineinfo -Xcompiler -rdynamic -O0
# Release build with symbols
RELEASE_FLAGS = -O3 -lineinfo
# Compile for debugging
nvcc $(DEBUG_FLAGS) -o program_debug program.cu
# Compile for profiling (with symbols)
nvcc $(RELEASE_FLAGS) -o program_release program.cu
Debugging Strategy
- Start with memcheck - Catches most common errors
- Run racecheck if results are inconsistent - Finds synchronization bugs
- Use initcheck for data corruption - Finds uninitialized reads
- Profile after correctness - Don't optimize buggy code
Common Pitfalls
| Issue | Symptom | Solution |
|-------|---------|----------|
| Uncoalesced access | Memory errors at specific offsets | Align data to 128 bytes |
| Missing sync | Intermittent wrong results | Add __syncthreads() |
| Out of bounds | Access violation errors | Add bounds checking |
| Uninitialized shared memory | Random values | Initialize before use |
Process Integration
This skill integrates with the following processes:
gpu-debugging-techniques.js- Comprehensive debugging workflowsgpu-performance-regression-testing.js- Correctness verificationatomic-operations-synchronization.js- Synchronization validation
Output Format
When executing operations, provide structured output:
{
"operation": "memory-check",
"status": "errors_found",
"tool": "compute-sanitizer",
"summary": {
"total_errors": 3,
"memory_errors": 2,
"leak_errors": 1
},
"errors": [
{
"type": "Invalid __global__ read",
"size": 4,
"address": "0x7f1234567890",
"location": {
"file": "kernel.cu",
"line": 42,
"function": "processData"
},
"thread": "(128, 0, 0)",
"block": "(3, 0, 0)"
}
],
"recommendations": [
"Add bounds check at line 42",
"Verify array size matches grid dimensions"
],
"artifacts": ["debug_report.txt", "memcheck.log"]
}
Error Handling
Common Issues
| Error | Cause | Resolution |
|-------|-------|------------|
| Invalid __global__ read | Out-of-bounds access | Add bounds checking |
| Potential WAW hazard | Missing synchronization | Add __syncthreads() |
| Memory leak | Missing cudaFree | Free all allocations |
| Uninitialized __global__ read | Reading before write | Initialize memory |
Constraints
- Debug builds are significantly slower than release builds
- Compute-sanitizer adds overhead; don't use in production
- Some race conditions may not appear consistently
- GPU must support debugging (sm_35+)
- CUDA-GDB requires X11 forwarding for remote debugging