HPC Python
Python parallelism, PyTorch DDP, and CPU/GPU latency hiding patterns.
Extension Files
| File | Content |
|------|---------|
| parallel-python.md | Threading vs multiprocessing vs asyncio decision tree, GIL rules, CUDA+fork safety |
| latency-hiding.md | CUDA streams, double buffering, compute/comm overlap, CUDAGraphs, async checkpoint |
| pytorch-ddp.md | DDP internals, gradient buckets, common bugs, mixed precision, DistributedSampler |
| preload-caching.md | Three-level caching: L1 file (disk/memmap/shm), L2 function (lru_cache/dedup/index), L3 variable (buffer/GPU cache/warmup/KV cache) |
| torch-compile.md | torch.compile modes, graph break diagnosis/fixes, reading generated Triton as starting point for hand-tuning |
| benchmarking.md | Correct GPU timing (CUDA events, torch.utils.benchmark.Timer), warmup, common pitfalls, what to measure |
| dataloader.md | DataLoader params (num_workers/pin_memory/prefetch_factor), dataset patterns, data formats, collation, worker issues |
Quick Decision Tree
What is the bottleneck?
├─ I/O bound → threading (ThreadPoolExecutor) or asyncio
├─ CPU bound → multiprocessing (mp.Pool, fork BEFORE CUDA!)
├─ GPU bound → batch inputs, don't parallelize
├─ Mixed CPU→GPU → pipeline + CUDA streams (see latency-hiding.md)
└─ DDP communication → tune bucket_cap_mb, use model.no_sync()
Critical Rules
- mp.Pool BEFORE CUDA: Create multiprocessing pool before any
torch.cudacall (fork+CUDA = deadlock) - Never
.item()in loops: Accumulate on GPU, transfer final result only pin_memory=True: Required fornon_blocking=Truetransfers to actually be async- DDP
no_sync(): Use during gradient accumulation steps to avoid redundant AllReduce find_unused_parameters=True: Only when needed — it's expensive
Review Checklist (Python)
□ No .cpu()/.item()/.numpy() in hot loops?
□ DataLoader: num_workers > 0, pin_memory=True?
□ mp.Pool created before CUDA init?
□ Threading used only for I/O, not CPU-bound work?
□ DDP gradient accumulation uses no_sync()?
□ CUDA streams used for transfer/compute overlap?
□ Pre-allocated buffers reused (not created per iteration)?