sentencepiece
Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.
text-summarizer
Generate extractive summaries from long text documents. Control summary length, extract key sentences, and process multiple documents.
named-entity-extractor
Extract named entities (people, organizations, locations, dates) from text using NLP. Use for document analysis, information extraction, or data enrichment.
language-detector
Detect language of text with confidence scores, support for 50+ languages, and batch text classification.
nlp-basics
Process and analyze text using modern NLP techniques - preprocessing, embeddings, and transformers