Overview
Unsloth-grpo enables training of reasoning models using Group Relative Policy Optimization (GRPO). This technique replaces traditional PPO Reward and Value models with group statistics, achieving 8x memory savings and allowing long-context RL training on limited VRAM.
When to Use
- When building DeepSeek-R1 style reasoning models.
- When performing Reinforcement Learning with Verifiable Rewards (RLVR) for math or code.
- When training models with long context lengths (e.g., 20K tokens) on single GPUs.
Decision Tree
- Is your model size < 1.5B?
- Yes: Model may struggle with consistent thinking tokens; consider 1.5B-8B.
- Is the reward verifiable (e.g., math answer)?
- Yes: Use RLVR with regex-based reward functions.
- Are you training on a single GPU with long context?
- Yes: Use
GRPOTrainerto benefit from the 8x memory reduction.
- Yes: Use
Workflows
- Converting to Reasoning LLM: Load a base model with
fast_inference = True, define a reward function for<thought>and<answer>tags, and train withGRPOTrainer. - Implementing Verifiable Rewards (RLVR): Create a correctness function using regex to extract answers and assign rewards (e.g., 2.0) or penalties (-1.0) based on ground truth.
- Speed Optimization with FP8: Select an FP8 Dynamic model and use
optim = 'adamw_8bit'to further reduce memory during the RL rollout phase.
Non-Obvious Insights
- GRPO eliminates the need for both the Reward Model and the Value Model, relying purely on group generation statistics for policy updates.
- Unsloth's GRPO implementation allows training 20K context reasoning models with only ~9.8GB of additional VRAM.
- For stable training,
num_generationsshould be set to 8 per prompt to provide sufficient statistical variance for the reward calculation.
Evidence
- "Unsloth shaves 8x memory usage for long context GRPO, so we need only an extra 9.8GB in extra VRAM for 20K context lengths!" Source
- "Introducing Long-context Reasoning (GRPO) in Unsloth. Train your own reasoning model with just 5GB VRAM." Source
Scripts
scripts/unsloth-grpo_tool.py: Template for GRPOTrainer and verifiable reward functions.scripts/unsloth-grpo_tool.js: Node.js utility for monitoring RL training metrics.
Dependencies
unslothtrlvllm(recommended for faster generation in rollouts)