Agent Skills: Document Rag Pipeline

Build complete document knowledge bases with PDF text extraction, OCR

UncategorizedID: vamseeachanta/workspace-hub/document-rag-pipeline

Install this agent skill to your local

pnpm dlx add-skill https://github.com/vamseeachanta/workspace-hub/tree/HEAD/.claude/skills/data/documents/document-rag-pipeline

Skill Files

Browse the full folder contents for document-rag-pipeline.

Download Skill

Loading file tree…

.claude/skills/data/documents/document-rag-pipeline/SKILL.md

Skill Metadata

Name
document-rag-pipeline
Description
Build complete document knowledge bases with PDF text extraction, OCR

Document Rag Pipeline

Overview

This skill creates a complete Retrieval-Augmented Generation (RAG) system from a folder of documents. It handles:

  • Regular PDF text extraction
  • OCR for scanned/image-based PDFs
  • DRM-protected file detection
  • Text chunking with overlap
  • Vector embedding generation
  • SQLite storage with full-text search
  • Semantic similarity search

Quick Start

# Install dependencies
pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm

# Build knowledge base
python build_knowledge_base.py /path/to/documents --embed

# Search documents
python build_knowledge_base.py /path/to/documents --search "your query"

When to Use

  • Building searchable knowledge bases from document folders
  • Processing technical standards libraries (API, ISO, ASME, etc.)
  • Creating semantic search over engineering documents
  • OCR processing of scanned historical documents
  • Any collection of PDFs needing intelligent search

Prerequisites

System Dependencies

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-eng poppler-utils

# macOS
brew install tesseract poppler

# Verify Tesseract
tesseract --version  # Should show 5.x

Python Dependencies

pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm

Or with UV:

uv pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm

Related Skills

  • pdf-text-extractor - Just text extraction
  • semantic-search-setup - Just embeddings/search
  • rag-system-builder - Add LLM Q&A layer
  • knowledge-base-builder - Simpler document catalog

Version History

  • 1.1.0 (2026-01-02): Added Quick Start, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
  • 1.0.0 (2024-10-15): Initial release with OCR support, chunking, vector embeddings, semantic search

Sub-Skills

Sub-Skills

Sub-Skills