HuggingFace Transformers
Access thousands of pre-trained models for NLP, vision, audio, and multimodal tasks.
When to Use
- Quick inference with pipelines
- Text generation, classification, QA, NER
- Image classification, object detection
- Fine-tuning on custom datasets
- Loading pre-trained models from HuggingFace Hub
Pipeline Tasks
NLP Tasks
| Task | Pipeline Name | Output |
|------|---------------|--------|
| Text Generation | text-generation | Completed text |
| Classification | text-classification | Label + confidence |
| Question Answering | question-answering | Answer span |
| Summarization | summarization | Shorter text |
| Translation | translation_en_to_fr | Translated text |
| NER | ner | Entity spans + types |
| Fill Mask | fill-mask | Predicted tokens |
Vision Tasks
| Task | Pipeline Name | Output |
|------|---------------|--------|
| Image Classification | image-classification | Label + confidence |
| Object Detection | object-detection | Bounding boxes |
| Image Segmentation | image-segmentation | Pixel masks |
Audio Tasks
| Task | Pipeline Name | Output |
|------|---------------|--------|
| Speech Recognition | automatic-speech-recognition | Transcribed text |
| Audio Classification | audio-classification | Label + confidence |
Model Loading Patterns
Auto Classes
| Class | Use Case | |-------|----------| | AutoModel | Base model (embeddings) | | AutoModelForCausalLM | Text generation (GPT-style) | | AutoModelForSeq2SeqLM | Encoder-decoder (T5, BART) | | AutoModelForSequenceClassification | Classification head | | AutoModelForTokenClassification | NER, POS tagging | | AutoModelForQuestionAnswering | Extractive QA |
Key concept: Always use Auto classes unless you need a specific architecture—they handle model detection automatically.
Generation Parameters
| Parameter | Effect | Typical Values | |-----------|--------|----------------| | max_new_tokens | Output length | 50-500 | | temperature | Randomness (0=deterministic) | 0.1-1.0 | | top_p | Nucleus sampling threshold | 0.9-0.95 | | top_k | Limit vocabulary per step | 50 | | num_beams | Beam search (disable sampling) | 4-8 | | repetition_penalty | Discourage repetition | 1.1-1.3 |
Key concept: Higher temperature = more creative but less coherent. For factual tasks, use low temperature (0.1-0.3).
Memory Management
Device Placement Options
| Option | When to Use | |--------|-------------| | device_map="auto" | Let library decide GPU allocation | | device_map="cuda:0" | Specific GPU | | device_map="cpu" | CPU only |
Quantization Options
| Method | Memory Reduction | Quality Impact | |--------|------------------|----------------| | 8-bit | ~50% | Minimal | | 4-bit | ~75% | Small for most tasks | | GPTQ | ~75% | Requires calibration | | AWQ | ~75% | Activation-aware |
Key concept: Use torch_dtype="auto" to automatically use the model's native precision (often bfloat16).
Fine-Tuning Concepts
Trainer Arguments
| Argument | Purpose | Typical Value | |----------|---------|---------------| | num_train_epochs | Training passes | 3-5 | | per_device_train_batch_size | Samples per GPU | 8-32 | | learning_rate | Step size | 2e-5 for fine-tuning | | weight_decay | Regularization | 0.01 | | warmup_ratio | LR warmup | 0.1 | | evaluation_strategy | When to eval | "epoch" or "steps" |
Fine-Tuning Strategies
| Strategy | Memory | Quality | Use Case | |----------|--------|---------|----------| | Full fine-tuning | High | Best | Small models, enough data | | LoRA | Low | Good | Large models, limited GPU | | QLoRA | Very Low | Good | 7B+ models on consumer GPU | | Prefix tuning | Low | Moderate | When you can't modify weights |
Tokenization Concepts
| Parameter | Purpose | |-----------|---------| | padding | Make sequences same length | | truncation | Cut sequences to max_length | | max_length | Maximum tokens (model-specific) | | return_tensors | Output format ("pt", "tf", "np") |
Key concept: Always use the tokenizer that matches the model—different models use different vocabularies.
Best Practices
| Practice | Why | |----------|-----| | Use pipelines for inference | Handles preprocessing automatically | | Use device_map="auto" | Optimal GPU memory distribution | | Batch inputs | Better throughput | | Use quantization for large models | Run 7B+ on consumer GPUs | | Match tokenizer to model | Vocabularies differ between models | | Use Trainer for fine-tuning | Built-in best practices |
Resources
- Docs: https://huggingface.co/docs/transformers
- Model Hub: https://huggingface.co/models
- Course: https://huggingface.co/course