Document Rag Pipeline
Overview
This skill creates a complete Retrieval-Augmented Generation (RAG) system from a folder of documents. It handles:
- Regular PDF text extraction
- OCR for scanned/image-based PDFs
- DRM-protected file detection
- Text chunking with overlap
- Vector embedding generation
- SQLite storage with full-text search
- Semantic similarity search
Quick Start
# Install dependencies
pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm
# Build knowledge base
python build_knowledge_base.py /path/to/documents --embed
# Search documents
python build_knowledge_base.py /path/to/documents --search "your query"
When to Use
- Building searchable knowledge bases from document folders
- Processing technical standards libraries (API, ISO, ASME, etc.)
- Creating semantic search over engineering documents
- OCR processing of scanned historical documents
- Any collection of PDFs needing intelligent search
Prerequisites
System Dependencies
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-eng poppler-utils
# macOS
brew install tesseract poppler
# Verify Tesseract
tesseract --version # Should show 5.x
Python Dependencies
pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm
Or with UV:
uv pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm
Related Skills
pdf-text-extractor- Just text extractionsemantic-search-setup- Just embeddings/searchrag-system-builder- Add LLM Q&A layerknowledge-base-builder- Simpler document catalog
Version History
- 1.1.0 (2026-01-02): Added Quick Start, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
- 1.0.0 (2024-10-15): Initial release with OCR support, chunking, vector embeddings, semantic search