HPC Python Skill | Agent Skills

HPC Python

Python parallelism, PyTorch DDP, and CPU/GPU latency hiding patterns.

Extension Files

| File | Content | |------|---------| | parallel-python.md | Threading vs multiprocessing vs asyncio decision tree, GIL rules, CUDA+fork safety | | latency-hiding.md | CUDA streams, double buffering, compute/comm overlap, CUDAGraphs, async checkpoint | | pytorch-ddp.md | DDP internals, gradient buckets, common bugs, mixed precision, DistributedSampler | | preload-caching.md | Three-level caching: L1 file (disk/memmap/shm), L2 function (lru_cache/dedup/index), L3 variable (buffer/GPU cache/warmup/KV cache) | | torch-compile.md | torch.compile modes, graph break diagnosis/fixes, reading generated Triton as starting point for hand-tuning | | benchmarking.md | Correct GPU timing (CUDA events, torch.utils.benchmark.Timer), warmup, common pitfalls, what to measure | | dataloader.md | DataLoader params (num_workers/pin_memory/prefetch_factor), dataset patterns, data formats, collation, worker issues |

Quick Decision Tree

What is the bottleneck?
├─ I/O bound         → threading (ThreadPoolExecutor) or asyncio
├─ CPU bound         → multiprocessing (mp.Pool, fork BEFORE CUDA!)
├─ GPU bound         → batch inputs, don't parallelize
├─ Mixed CPU→GPU     → pipeline + CUDA streams (see latency-hiding.md)
└─ DDP communication → tune bucket_cap_mb, use model.no_sync()

Critical Rules

mp.Pool BEFORE CUDA: Create multiprocessing pool before any torch.cuda call (fork+CUDA = deadlock)
Never .item() in loops: Accumulate on GPU, transfer final result only
pin_memory=True: Required for non_blocking=True transfers to actually be async
DDP no_sync(): Use during gradient accumulation steps to avoid redundant AllReduce
find_unused_parameters=True: Only when needed — it's expensive

Review Checklist (Python)

□ No .cpu()/.item()/.numpy() in hot loops?
□ DataLoader: num_workers > 0, pin_memory=True?
□ mp.Pool created before CUDA init?
□ Threading used only for I/O, not CPU-bound work?
□ DDP gradient accumulation uses no_sync()?
□ CUDA streams used for transfer/compute overlap?
□ Pre-allocated buffers reused (not created per iteration)?

Agent Skills: HPC Python

Install this agent skill to your local

Skill Files

HPC Python

Extension Files

Quick Decision Tree

Critical Rules

Review Checklist (Python)