spaCy NLP
Production-ready NLP with spaCy 3.x. This skill covers installation through deployment.
Contents
Scope
In Scope:
- spaCy 3.x installation and text processing
- TextCategorizer training for document classification
- Production deployment and optimization patterns
Out of Scope (use other tools/skills):
- Training custom NER models (different workflow)
- spaCy 2.x (deprecated, incompatible with 3.x)
- Rule-based matching (EntityRuler, Matcher, PhraseMatcher)
- Custom tokenizers or language models
Quick Start
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")
# Entities
for ent in doc.ents:
print(ent.text, ent.label_)
# Tokens with attributes
for token in doc:
print(token.text, token.pos_, token.dep_)
Installation
Standard Setup
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
Model Selection
| Model | Size | Speed | Use Case |
|-------|------|-------|----------|
| en_core_web_sm | 12 MB | Fastest | Prototyping, speed-critical |
| en_core_web_md | 40 MB | Fast | General use with word vectors |
| en_core_web_lg | 560 MB | Fast | Semantic similarity tasks |
| en_core_web_trf | 438 MB | Slow | Maximum accuracy (GPU) |
Verify Installation
import spacy
print(spacy.__version__)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Test sentence.")
print(f"Tokens: {len(doc)}")
For detailed installation options (conda, GPU, transformers): See references/installation.md
Text Processing
Basic Pipeline
nlp = spacy.load("en_core_web_sm")
doc = nlp("The striped bats are hanging on their feet.")
# Tokenization + attributes
for token in doc:
print(f"{token.text:10} | {token.lemma_:10} | {token.pos_:6} | {token.dep_}")
Named Entity Recognition
for ent in doc.ents:
print(ent.text, ent.label_) # "Apple Inc." ORG, "Steve Jobs" PERSON
For entity types, filtering, and span details: See references/basic-usage.md
Batch Processing (Critical for Production)
# WRONG - slow
for text in texts:
doc = nlp(text) # Don't do this
# CORRECT - fast
for doc in nlp.pipe(texts, batch_size=50):
process(doc)
# With multiprocessing
docs = list(nlp.pipe(texts, n_process=4))
Disable Unused Components
# Only need NER - disable the rest for 2x speed
nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger", "lemmatizer"])
For Doc/Token/Span details, noun chunks, similarity: See references/basic-usage.md
Training Classifiers
Train custom text classifiers with TextCategorizer.
Workflow Overview
- Prepare data → Run
scripts/prepare_training_data.py - Generate config → Run
scripts/generate_config.pyor useassets/config_textcat.cfg - Validate →
python -m spacy debug data config.cfg(catches issues before training) - Train →
python -m spacy train config.cfg --output ./output - Evaluate → Run
scripts/evaluate_model.py - Use →
nlp = spacy.load("./output/model-best")
Data Format
Training data uses spaCy's DocBin format. Example input (JSON):
[
{"text": "Quarterly revenue exceeded expectations", "label": "Business"},
{"text": "Fixed null pointer exception in parser", "label": "Programming"},
{"text": "Kubernetes deployment manifest updated", "label": "DevOps"}
]
Convert with script:
python scripts/prepare_training_data.py \
--input data.json \
--output-train train.spacy \
--output-dev dev.spacy \
--split 0.8
Training Command
# Generate optimized config
python scripts/generate_config.py --categories "Business,Technology,Programming,DevOps"
# Or use template
cp assets/config_textcat.cfg config.cfg
# Train
python -m spacy train config.cfg --output ./output
# With GPU
python -m spacy train config.cfg --output ./output --gpu-id 0
Using Trained Model
nlp = spacy.load("./output/model-best")
doc = nlp("Deploy the application to Kubernetes cluster")
predicted = max(doc.cats, key=doc.cats.get)
confidence = doc.cats[predicted]
print(f"{predicted}: {confidence:.1%}") # DevOps: 94.2%
For detailed training guide: See references/text-classification.md
Troubleshooting
Model Not Found (E050)
OSError: [E050] Can't find model 'en_core_web_sm'
Fix:
python -m spacy download en_core_web_sm
Alternative (avoids path issues):
import en_core_web_sm
nlp = en_core_web_sm.load()
Memory Issues
Symptoms: OOM errors, slow processing
Fixes:
# 1. Disable unused components
nlp = spacy.load("en_core_web_sm", exclude=["parser", "ner"])
# 2. Process in chunks
for chunk in chunk_text(large_text, max_length=100000):
doc = nlp(chunk)
# 3. Use memory zones (spaCy 3.8+)
with nlp.memory_zone():
for doc in nlp.pipe(batch):
process(doc)
GPU Not Working
import spacy
# Must call BEFORE loading model
if spacy.prefer_gpu():
print("Using GPU")
else:
print("GPU not available")
nlp = spacy.load("en_core_web_trf") # Now loads on GPU
Version Compatibility
spaCy 2.x models do not work with spaCy 3.x. Check compatibility:
python -m spacy validate
For more troubleshooting: See references/troubleshooting.md
Production Deployment
Package Model
python -m spacy package ./output/model-best ./packages \
--name my_classifier \
--version 1.0.0
pip install ./packages/en_my_classifier-1.0.0/
FastAPI Server
Use the production template:
python scripts/serve_model.py --model ./output/model-best --port 8000
Or customize from template:
from fastapi import FastAPI
import spacy
app = FastAPI()
nlp = spacy.load("en_my_classifier")
@app.post("/classify")
async def classify(text: str):
with nlp.memory_zone():
doc = nlp(text)
return {
"category": max(doc.cats, key=doc.cats.get),
"scores": doc.cats
}
Performance Optimization
| Technique | Speedup | When to Use |
|-----------|---------|-------------|
| Disable components | 2-3x | Don't need all annotations |
| nlp.pipe() | 5-10x | Processing multiple texts |
| Multiprocessing | 2-4x | CPU-bound, many cores |
| GPU | 2-5x | Transformer models |
For evaluation metrics and hyperparameter tuning: See references/production.md
Scripts Reference
| Script | Purpose | Usage |
|--------|---------|-------|
| prepare_training_data.py | Convert JSON to DocBin | python scripts/prepare_training_data.py --input data.json |
| generate_config.py | Create training config | python scripts/generate_config.py --categories "A,B,C" |
| evaluate_model.py | Detailed metrics | python scripts/evaluate_model.py --model ./output/model-best |
| serve_model.py | FastAPI server | python scripts/serve_model.py --model ./model --port 8000 |
Assets Reference
| Asset | Purpose | Usage |
|-------|---------|-------|
| config_textcat.cfg | Base training config | Copy and customize for your labels |
| training_data_template.json | Data format example | Reference for preparing your data |