spaCy NLP Skill | Agent Skills

spaCy NLP

Production-ready NLP with spaCy 3.x. This skill covers installation through deployment.

Quick Start
Installation
Text Processing
Training Classifiers
Troubleshooting
Production Deployment

Scope

In Scope:

spaCy 3.x installation and text processing
TextCategorizer training for document classification
Production deployment and optimization patterns

Out of Scope (use other tools/skills):

Training custom NER models (different workflow)
spaCy 2.x (deprecated, incompatible with 3.x)
Rule-based matching (EntityRuler, Matcher, PhraseMatcher)
Custom tokenizers or language models

Quick Start

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

# Entities
for ent in doc.ents:
    print(ent.text, ent.label_)

# Tokens with attributes
for token in doc:
    print(token.text, token.pos_, token.dep_)

Installation

Standard Setup

pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm

Model Selection

| Model | Size | Speed | Use Case | |-------|------|-------|----------| | en_core_web_sm | 12 MB | Fastest | Prototyping, speed-critical | | en_core_web_md | 40 MB | Fast | General use with word vectors | | en_core_web_lg | 560 MB | Fast | Semantic similarity tasks | | en_core_web_trf | 438 MB | Slow | Maximum accuracy (GPU) |

Verify Installation

import spacy
print(spacy.__version__)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Test sentence.")
print(f"Tokens: {len(doc)}")

For detailed installation options (conda, GPU, transformers): See references/installation.md

Text Processing

Basic Pipeline

nlp = spacy.load("en_core_web_sm")
doc = nlp("The striped bats are hanging on their feet.")

# Tokenization + attributes
for token in doc:
    print(f"{token.text:10} | {token.lemma_:10} | {token.pos_:6} | {token.dep_}")

Named Entity Recognition

for ent in doc.ents:
    print(ent.text, ent.label_)  # "Apple Inc." ORG, "Steve Jobs" PERSON

For entity types, filtering, and span details: See references/basic-usage.md

Batch Processing (Critical for Production)

# WRONG - slow
for text in texts:
    doc = nlp(text)  # Don't do this

# CORRECT - fast
for doc in nlp.pipe(texts, batch_size=50):
    process(doc)

# With multiprocessing
docs = list(nlp.pipe(texts, n_process=4))

Disable Unused Components

# Only need NER - disable the rest for 2x speed
nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger", "lemmatizer"])

For Doc/Token/Span details, noun chunks, similarity: See references/basic-usage.md

Training Classifiers

Train custom text classifiers with TextCategorizer.

Workflow Overview

Prepare data → Run scripts/prepare_training_data.py
Generate config → Run scripts/generate_config.py or use assets/config_textcat.cfg
Validate → python -m spacy debug data config.cfg (catches issues before training)
Train → python -m spacy train config.cfg --output ./output
Evaluate → Run scripts/evaluate_model.py
Use → nlp = spacy.load("./output/model-best")

Data Format

Training data uses spaCy's DocBin format. Example input (JSON):

[
  {"text": "Quarterly revenue exceeded expectations", "label": "Business"},
  {"text": "Fixed null pointer exception in parser", "label": "Programming"},
  {"text": "Kubernetes deployment manifest updated", "label": "DevOps"}
]

Convert with script:

python scripts/prepare_training_data.py \
  --input data.json \
  --output-train train.spacy \
  --output-dev dev.spacy \
  --split 0.8

Training Command

# Generate optimized config
python scripts/generate_config.py --categories "Business,Technology,Programming,DevOps"

# Or use template
cp assets/config_textcat.cfg config.cfg

# Train
python -m spacy train config.cfg --output ./output

# With GPU
python -m spacy train config.cfg --output ./output --gpu-id 0

Using Trained Model

nlp = spacy.load("./output/model-best")
doc = nlp("Deploy the application to Kubernetes cluster")
predicted = max(doc.cats, key=doc.cats.get)
confidence = doc.cats[predicted]
print(f"{predicted}: {confidence:.1%}")  # DevOps: 94.2%

For detailed training guide: See references/text-classification.md

Troubleshooting

Model Not Found (E050)

OSError: [E050] Can't find model 'en_core_web_sm'

Fix:

python -m spacy download en_core_web_sm

Alternative (avoids path issues):

import en_core_web_sm
nlp = en_core_web_sm.load()

Memory Issues

Symptoms: OOM errors, slow processing

Fixes:

# 1. Disable unused components
nlp = spacy.load("en_core_web_sm", exclude=["parser", "ner"])

# 2. Process in chunks
for chunk in chunk_text(large_text, max_length=100000):
    doc = nlp(chunk)

# 3. Use memory zones (spaCy 3.8+)
with nlp.memory_zone():
    for doc in nlp.pipe(batch):
        process(doc)

GPU Not Working

import spacy

# Must call BEFORE loading model
if spacy.prefer_gpu():
    print("Using GPU")
else:
    print("GPU not available")

nlp = spacy.load("en_core_web_trf")  # Now loads on GPU

Version Compatibility

spaCy 2.x models do not work with spaCy 3.x. Check compatibility:

python -m spacy validate

For more troubleshooting: See references/troubleshooting.md

Production Deployment

Package Model

python -m spacy package ./output/model-best ./packages \
  --name my_classifier \
  --version 1.0.0

pip install ./packages/en_my_classifier-1.0.0/

FastAPI Server

Use the production template:

python scripts/serve_model.py --model ./output/model-best --port 8000

Or customize from template:

from fastapi import FastAPI
import spacy

app = FastAPI()
nlp = spacy.load("en_my_classifier")

@app.post("/classify")
async def classify(text: str):
    with nlp.memory_zone():
        doc = nlp(text)
        return {
            "category": max(doc.cats, key=doc.cats.get),
            "scores": doc.cats
        }

Performance Optimization

| Technique | Speedup | When to Use | |-----------|---------|-------------| | Disable components | 2-3x | Don't need all annotations | | nlp.pipe() | 5-10x | Processing multiple texts | | Multiprocessing | 2-4x | CPU-bound, many cores | | GPU | 2-5x | Transformer models |

For evaluation metrics and hyperparameter tuning: See references/production.md

Scripts Reference

| Script | Purpose | Usage | |--------|---------|-------| | prepare_training_data.py | Convert JSON to DocBin | python scripts/prepare_training_data.py --input data.json | | generate_config.py | Create training config | python scripts/generate_config.py --categories "A,B,C" | | evaluate_model.py | Detailed metrics | python scripts/evaluate_model.py --model ./output/model-best | | serve_model.py | FastAPI server | python scripts/serve_model.py --model ./model --port 8000 |

Assets Reference

| Asset | Purpose | Usage | |-------|---------|-------| | config_textcat.cfg | Base training config | Copy and customize for your labels | | training_data_template.json | Data format example | Reference for preparing your data |

Agent Skills: spaCy NLP

Install this agent skill to your local

Skill Files