Back to tags
Tag

Agent Skills with tag: cuda

6 skills match this tag. Use tags to discover related Agent Skills and explore similar workflows.

gpu-workflow-creator

Transform natural language requests into complete GPU CLI workflows. The ultimate skill for Mac users who want to run NVIDIA GPU workloads without configuration complexity. Describe what you want, get a working project.

cudagpu-accelerationcommand-linemacos
gpu-cli
gpu-cli
0

gpu-debugger

Debug failed GPU CLI runs. Analyze error messages, diagnose OOM errors, fix sync issues, troubleshoot connectivity, and resolve common problems. Turn cryptic errors into actionable fixes.

cudagpu-accelerationterminalerror-handling
gpu-cli
gpu-cli
0

cuda

CUDA kernel development, debugging, and performance optimization for Claude Code. Use when writing, debugging, or optimizing CUDA code, GPU kernels, or parallel algorithms. Covers non-interactive profiling with nsys/ncu, debugging with cuda-gdb/compute-sanitizer, binary inspection with cuobjdump, and performance analysis workflows. Triggers on CUDA, GPU programming, kernel optimization, nsys, ncu, cuda-gdb, compute-sanitizer, PTX, GPU profiling, parallel performance.

cudagpu-programmingperformance-optimizationdebugging
technillogue
technillogue
4

funsloth-local

Training manager for local GPU training - validate CUDA, manage GPU selection, monitor progress, handle checkpoints

cudagpu-accelerationmonitoringresource-allocation
chrisvoncsefalvay
chrisvoncsefalvay
4

qlora

Memory-efficient fine-tuning with 4-bit quantization and LoRA adapters. Use when fine-tuning large models (7B+) on consumer GPUs, when VRAM is limited, or when standard LoRA still exceeds memory. Builds on the lora skill.

large-language-modelsquantizationlorafine-tuning
itsmostafa
itsmostafa
10

at-dispatch-v2

Convert PyTorch AT_DISPATCH macros to AT_DISPATCH_V2 format in ATen C++ code. Use when porting AT_DISPATCH_ALL_TYPES_AND*, AT_DISPATCH_FLOATING_TYPES*, or other dispatch macros to the new v2 API. For ATen kernel files, CUDA kernels, and native operator implementations.

pytorchc++cudaaten
pytorch
pytorch
96,34426,418