unsloth-quantization Skill

Overview

Unsloth utilizes advanced quantization techniques to reduce the memory footprint of LLM fine-tuning. This includes "Dynamic 4-bit" loading (protecting sensitive layers), FP8 training for modern GPUs, and the use of 8-bit optimizers to save gigabytes of VRAM.

When to Use

When training on GPUs with limited VRAM (e.g., 8GB, 12GB, or 16GB).
When aiming for the fastest possible training speeds on H100 or RTX 40 series GPUs.
When trying to balance model size and reasoning performance.

Decision Tree

Is your GPU RTX 40 series or newer (Ada/Hopper)?
- Yes: Use FP8 Dynamic for 2x faster training.
- No: Use BF16/FP16.
Running out of VRAM?
- Yes: Ensure load_in_4bit=True and use adamw_8bit optimizer.
Is accuracy dropping significantly?
- Yes: Use "Dynamic" variants that protect the first and last layers.

Workflows

FP8 Training Configuration

Select a model variant ending in '-FP8-Dynamic'.
Configure the trainer to use the FP8 backend (available for H100 and Ada Lovelace architectures).
Verify speedup, which typically reaches 2x compared to standard BF16.

VRAM-Constrained Training Setup

Set load_in_4bit=True and use_gradient_checkpointing='unsloth'.
Apply 'adamw_8bit' optimizer in TrainingArguments.
Set per_device_train_batch_size to 1 and maximize gradient_accumulation_steps.

Non-Obvious Insights

Unsloth's "Dynamic 4-bit" (unsloth-bnb-4bit) differs from standard BNB 4-bit by protecting the first and last layers, which are critical for reasoning and output quality.
Quantization-Aware Training (QAT) is implicitly supported through Unsloth's specialized kernels, allowing the model to adapt to the lower precision during the LoRA update process.
Switching to the adamw_8bit optimizer can save up to 2GB of VRAM on optimizer states alone for a 7B parameter model.

Evidence

"load_in_4bit = True – Enables 4-bit quantization, reducing memory use 4× for fine-tuning." Source
"FP8 Dynamic offers slightly faster training and lower VRAM usage than FP8 Block." Source

Scripts

scripts/unsloth-quantization_tool.py: Script to check GPU compatibility and suggest quantization settings.
scripts/unsloth-quantization_tool.js: Memory usage estimator for different quantization levels.

Dependencies

unsloth
bitsandbytes
torch

References

[[references/README.md]]

Agent Skills: unsloth-quantization

Install this agent skill to your local

Skill Files