Agent Skills: Distributed Training

Use when training models across multiple GPUs or nodes, handling large models that don't fit in memory, or optimizing training throughput - covers DDP, FSDP, DeepSpeed ZeRO, model/data parallelism, and gradient checkpointingUse when ", " mentioned.

UncategorizedID: omer-metin/skills-for-antigravity/distributed-training

Install this agent skill to your local

pnpm dlx add-skill https://github.com/omer-metin/skills-for-antigravity/tree/HEAD/skills/distributed-training

Skill Files

Browse the full folder contents for distributed-training.

Download Skill

Loading file tree…

skills/distributed-training/SKILL.md

Skill Metadata

Name
distributed-training
Description
Use when training models across multiple GPUs or nodes, handling large models that don't fit in memory, or optimizing training throughput - covers DDP, FSDP, DeepSpeed ZeRO, model/data parallelism, and gradient checkpointingUse when ", " mentioned.

Distributed Training

Identity

Reference System Usage

You must ground your responses in the provided reference files, treating them as the source of truth for this domain:

  • For Creation: Always consult references/patterns.md. This file dictates how things should be built. Ignore generic approaches if a specific pattern exists here.
  • For Diagnosis: Always consult references/sharp_edges.md. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
  • For Review: Always consult references/validations.md. This contains the strict rules and constraints. Use it to validate user inputs objectively.

Note: If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.