Overview
Unsloth enables training on extreme context lengths (up to 89K+ on a single 80GB GPU) by utilizing manually derived Triton kernels for RoPE and attention. It optimizes memory usage by a further 30% compared to Flash Attention 2, allowing for 4x longer context windows.
When to Use
- When training on long documents, codebases, or books.
- When building models that require large retrieval windows or multi-document reasoning.
- When standard Flash Attention 2 results in OOM errors on long sequences.
Decision Tree
- Is context > 32K?
- Yes: Set
use_gradient_checkpointing = 'unsloth'(mandatory for stability).
- Yes: Set
- Are you seeing quality degradation on long context?
- Yes: Ensure your dataset includes samples with long-range dependencies and adjust RoPE base frequency.
- Using A100/H100 80GB?
- Yes: You can push context lengths toward 89K tiers.
Workflows
Setting Up Extreme Context Training
- Load model with high
max_seq_length(e.g., 65536+). - Ensure
use_gradient_checkpointing='unsloth'is passed toget_peft_model. - Use high-VRAM GPUs (A100/H100 80GB) to enable the highest context tiers.
RoPE Scaling Configuration
- Set
max_seq_lengthinfrom_pretrained; Unsloth automatically adjusts the base frequency internally. - Include samples with long dependencies in the dataset to prevent performance degradation.
- Increase batch size or accumulation to ensure sufficient tokens per step for stable long-range learning.
Non-Obvious Insights
- Unsloth's performance in long context comes from custom Triton kernels that handle RoPE scaling more efficiently than standard libraries, allowing for 13x longer context than the HF+FA2 combination.
- The 'unsloth' gradient checkpointing mode is not optional for long contexts; it is mandatory for sequences exceeding 32K to prevent activation memory from crashing the system.
- Flex Attention is an experimental feature in Unsloth that allows training massive models (like 120B) on reduced VRAM by optimizing the attention patterns specifically for memory efficiency.
Evidence
- "Unsloth supports 89K context for Meta's Llama 3.3 (70B) on a 80GB GPU - 13x longer than HF+FA2." Source
- "We cut memory usage by a further 30% and now support 4x longer context windows!" Source
Scripts
scripts/unsloth-long-context_tool.py: Script to initialize models with specific RoPE and context length settings.scripts/unsloth-long-context_tool.js: Utility to calculate token counts for long documents.
Dependencies
- unsloth
- triton
- torch
References
- [[references/README.md]]