NanoGPT Training
Overview
Training GPT-2 scale models (~124M parameters) efficiently on a single GPU. It provides:
- GPT-124M Architecture: Standard transformer with RoPE and modern optimizations
- Tokenized Datasets: Loading pre-tokenized shards from HuggingFace Hub or local files
- Modern Optimizers: Muon optimizer with Newton-Schulz orthogonalization
- Mixed Precision: bfloat16 training on A100 for 2x speedup
Training options:
- Baseline GPT: Standard residual connections
- Experimental residual variants: Optional alternative residual schemes for stability/efficiency
Quick Reference
| Topic | Reference | |-------|-----------| | Model Architecture | GPT Architecture | | Data Loading | Tokenized Data | | Optimizers | Optimizers | | Training Loop | Training Loop | | Hyperparameters | Hyperparameters |
Installation
pip install torch einops numpy huggingface_hub
Minimal Example
import modal
app = modal.App("gpt-training")
image = modal.Image.debian_slim(python_version="3.11").pip_install(
"torch", "einops", "numpy", "huggingface_hub"
)
@app.function(gpu="A100", image=image, timeout=3600)
def train():
import torch
from dataclasses import dataclass
@dataclass
class GPTConfig:
block_size: int = 1024
vocab_size: int = 50257
n_layer: int = 12
n_head: int = 12
n_embd: int = 768
dropout: float = 0.0
bias: bool = False
# Download data, build model, train
# ... (see references for full implementation)
return {"final_loss": final_loss}
@app.local_entrypoint()
def main():
results = train.remote()
print(results)
Common Imports
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import autocast, GradScaler
from dataclasses import dataclass
from einops import rearrange, repeat, reduce
import numpy as np
import math
When to Use What
| Scenario | Approach | |----------|----------| | Standard GPT training | Use baseline model with standard residuals | | Stability experiments | Try alternative residual variants or extra streams | | Small experiments | Use T4/A10G GPU | | Full training | Use A100 with bfloat16 | | Custom data | Modify the dataset loader class | | Different model size | Adjust GPTConfig parameters |
Metrics to Monitor
| Metric | Typical Signal | Notes | |--------|----------------|-------| | Validation loss | Steady decrease | Absolute value depends on dataset/tokenizer | | Grad norm | Moderate, stable range | Large spikes indicate instability | | Training stability | Smooth curves | Frequent spikes suggest LR/batch issues | | Throughput | Consistent tokens/sec | Use for comparing configs |
External Resources
- nanoGPT: https://github.com/karpathy/nanoGPT
- build-nanogpt: https://github.com/karpathy/build-nanogpt
- modded-nanogpt: https://github.com/KellerJordan/modded-nanogpt
- FineWeb-Edu token shards: https://huggingface.co/datasets/karpathy/fineweb-edu-100B-gpt2-token-shards