LLM Training
Frameworks and techniques for training and finetuning large language models.
Framework Comparison
| Framework | Best For | Multi-GPU | Memory Efficient | |-----------|----------|-----------|------------------| | Accelerate | Simple distributed | Yes | Basic | | DeepSpeed | Large models, ZeRO | Yes | Excellent | | PyTorch Lightning | Clean training loops | Yes | Good | | Ray Train | Scalable, multi-node | Yes | Good | | TRL | RLHF, reward modeling | Yes | Good | | Unsloth | Fast LoRA finetuning | Limited | Excellent |
Accelerate (HuggingFace)
Minimal wrapper for distributed training. Run accelerate config for interactive setup.
Key concept: Wrap model, optimizer, dataloader with accelerator.prepare(), use accelerator.backward() for loss.
DeepSpeed (Large Models)
Microsoft's optimization library for training massive models.
ZeRO Stages:
- Stage 1: Optimizer states partitioned across GPUs
- Stage 2: + Gradients partitioned
- Stage 3: + Parameters partitioned (for largest models, 100B+)
Key concept: Configure via JSON, higher stages = more memory savings but more communication overhead.
TRL (RLHF/DPO)
HuggingFace library for reinforcement learning from human feedback.
Training types:
- SFT (Supervised Finetuning): Standard instruction tuning
- DPO (Direct Preference Optimization): Simpler than RLHF, uses preference pairs
- PPO: Classic RLHF with reward model
Key concept: DPO is often preferred over PPO - simpler, no reward model needed, just chosen/rejected response pairs.
Unsloth (Fast LoRA)
Optimized LoRA finetuning - 2x faster, 60% less memory.
Key concept: Drop-in replacement for standard LoRA with automatic optimizations. Best for 7B-13B models.
Memory Optimization Techniques
| Technique | Memory Savings | Trade-off | |-----------|---------------|-----------| | Gradient checkpointing | ~30-50% | Slower training | | Mixed precision (fp16/bf16) | ~50% | Minor precision loss | | 4-bit quantization (QLoRA) | ~75% | Some quality loss | | Flash Attention | ~20-40% | Requires compatible GPU | | Gradient accumulation | Effective batch↑ | No memory cost |
Decision Guide
| Scenario | Recommendation | |----------|----------------| | Simple finetuning | Accelerate + PEFT | | 7B-13B models | Unsloth (fastest) | | 70B+ models | DeepSpeed ZeRO-3 | | RLHF/DPO alignment | TRL | | Multi-node cluster | Ray Train | | Clean code structure | PyTorch Lightning |
Resources
- Accelerate: https://huggingface.co/docs/accelerate
- DeepSpeed: https://www.deepspeed.ai/
- TRL: https://huggingface.co/docs/trl
- Unsloth: https://github.com/unslothai/unsloth