Agent Skills: Pdf

Comprehensive PDF manipulation toolkit. For batch/bulk extraction (1K+ PDFs),

UncategorizedID: vamseeachanta/workspace-hub/pdf

Install this agent skill to your local

pnpm dlx add-skill https://github.com/vamseeachanta/workspace-hub/tree/HEAD/.claude/skills/data/documents/pdf

Skill Files

Browse the full folder contents for pdf.

Download Skill

Loading file tree…

.claude/skills/data/documents/pdf/SKILL.md

Skill Metadata

Name
pdf
Description
Comprehensive PDF manipulation toolkit. For batch/bulk extraction (1K+ PDFs),

Pdf

Overview

This skill enables comprehensive PDF operations through Python libraries and command-line tools. Use it for reading, creating, modifying, and analyzing PDF documents.

Quick Start

from pypdf import PdfReader

reader = PdfReader("document.pdf")
for page in reader.pages:
    text = page.extract_text()
    print(text)

Tool Selection (WRK-1277 + WRK-1302 + WRK-1303 Learnings)

Scenario → Tool Mapping

| Scenario | Tool | Why | |----------|------|-----| | Batch extraction (1K+ PDFs) | pdftotext (poppler) via subprocess | Proven at 297K scale; reliable timeout via SIGTERM; subprocess isolation | | Single-doc understanding | OpenAI Codex PDF→Markdown | Best quality; too expensive for bulk | | Single-doc text extraction | PyMuPDF (fitz) | Fast, good API, in-process | | Readability classification | pypdfium2 | Replaces pdfplumber for page sampling; no D-state hangs; Apache-2.0 license | | Table extraction | pdfplumber (single doc only) | Best table detection; DO NOT use in multiprocessing pools | | Structured markdown (tables+equations) | Docling (targeted use only) | MIT license; 1731 table rows from 6 docs; ~310s/doc on CPU | | LLM/RAG markdown | pymupdf4llm (monitor only) | 0.12s/doc, good markdown; AGPL license blocks adoption |

Quality & Completeness Index (measured on dev-primary)

Scores: text completeness (% of content captured vs best-in-class), structure preservation, and batch viability. Based on WRK-1302 (243 PDFs) and WRK-1303 (6 PDFs).

| Tool | Text Completeness | Structure | Tables | Equations | Speed | Batch Safe | License | |------|:-:|:-:|:-:|:-:|:-:|:-:|:-:| | pdftotext (baseline) | 100% | none | none | unicode only | 0.02s | yes | GPL-2 | | pypdfium2 | 86% | none | none | unicode only | 0.02s | no (thread-unsafe) | Apache-2.0 | | pdfplumber | ~95% | partial | good (69-93%) | none | 0.10s | no (D-state) | MIT | | Docling | 117% | full (md) | good (md rows) | unicode+context | 310s | no (CPU bound) | MIT | | PyMuPDF (fitz) | ~98% | partial | basic | unicode only | 0.01s | yes | AGPL | | pymupdf4llm | ~100% | full (md) | good (md) | unicode+context | 0.12s | untested | AGPL | | Codex API | ~100% | full (md) | excellent | LaTeX | ~2s | no (API cost) | proprietary |

Column definitions:

  • Text Completeness: chars extracted vs pdftotext baseline (WRK-1302: 243 PDFs, WRK-1303: 6 PDFs)
  • Structure: none = raw text | partial = some layout | full = headings, lists, sections
  • Tables: none | basic = cell text only | good = rows+cols preserved | excellent = multi-span
  • Equations: unicode only = captures symbols | unicode+context = in structured output | LaTeX = formula markup
  • Batch Safe: can run in ProcessPoolExecutor on NFS/NTFS without hangs or crashes

WARNING: pdfplumber hangs in kernel D-state (disk sleep) on NTFS and NFS mounts. SIGALRM cannot interrupt kernel I/O. Use pdftotext via subprocess.run(timeout=N) for any batch/parallel work — the subprocess can be killed reliably on timeout.

Benchmarks: scripts/data/doc_intelligence/benchmark_pdf_tools.py (WRK-1302), scripts/data/doc_intelligence/benchmark_docling.py (WRK-1303)

When to Use

  • Batch PDF processing - Use pdftotext (poppler) via subprocess for bulk extraction
  • Converting PDFs to Markdown - Use OpenAI Codex for intelligent conversion (single docs)
  • Extracting text and metadata from PDF files
  • Merging multiple PDFs into a single document
  • Splitting large PDFs into individual pages
  • Adding watermarks or annotations to PDFs
  • Password-protecting or decrypting PDFs
  • Extracting images from PDF documents
  • OCR processing for scanned documents
  • Creating new PDFs with reportlab
  • Extracting tables from structured PDFs

Version History

  • 1.2.2 (2026-01-04): Fixed P2 issue - added parents=True to all mkdir() calls to handle nested output paths; prevents FileNotFoundError when creating directories with non-existent parent paths
  • 1.2.1 (2026-01-04): Fixed CLI tool missing imports - added complete standalone script with all required imports (openai, pypdf, logging) and function definitions; resolved P1 issue from Codex review
  • 1.2.0 (2026-01-04): MAJOR UPDATE - Added OpenAI Codex integration for PDF-to-Markdown conversion as recommended first step for all PDF processing; includes batch conversion, chunking for large files, cost-effective options, and complete CLI tool
  • 1.1.0 (2026-01-02): Added Quick Start, When to Use, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
  • 1.0.0 (2024-10-15): Initial release with pypdf, pdfplumber, reportlab, CLI tools

Sub-Skills