training-orchestration | Agent Skills

ray-train

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

training-orchestrationdistributed-computinghyperparameter-tuningscalability

ovachiever

training-llms-megatron

Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. Use when training models >1B parameters, need maximum GPU efficiency (47% MFU on H100), or require tensor/pipeline/sequence/context/expert parallelism. Production-ready framework used for Nemotron, LLaMA, DeepSeek.

training-orchestrationlarge-language-modelsparallelismgpu-acceleration

ovachiever

training-pipelines

Master training pipelines - orchestration, distributed training, hyperparameter tuning

training-orchestrationdistributed-traininghyperparameter-tuningmodel-training

pluginagentmarketplace

funsloth-runpod

Training manager for RunPod GPU instances - configure pods, launch training, monitor progress, retrieve checkpoints

cloud-computingrunpodgpu-instancestraining-orchestration

chrisvoncsefalvay

Agent Skills with tag: training-orchestration

ray-train

training-llms-megatron

training-pipelines

funsloth-runpod