PDF Utilities Skill
Overview
This skill provides comprehensive PDF processing capabilities including reading PDFs with multiple library backends (tabula, camelot, PyPDF2), extracting tables to DataFrames, editing/extracting page ranges, handling fillable forms, and managing PDF comments. All operations are driven by YAML configuration.
Key Components
ReadPDF Class (read_pdf.py)
Multi-backend PDF reading with table extraction:
read_pdf(cfg, file_index)- Route to appropriate backend based on configfrom_pdf_tabula(cfg, file_index)- Extract tables using tabula-pyfrom_pdf_camelot(cfg, file_index)- Extract tables using camelotfrom_pdf_PyPDF2(cfg, file_index)- Read PDF pages using PyPDF2
EditPDF Class (edit_pdf.py)
PDF page manipulation and extraction:
edit_pdf(cfg, file_index)- Process PDF files based on configurationfrom_pdf_PyPDF2(cfg, file_index)- Extract page ranges to new PDF filesprocess_cfg_files(cfg)- Process multiple PDF files from config
Additional Modules
fillable_pdf.py- Handle fillable PDF forms (fill fields, extract data)pdf_comments.py- Add, read, and manipulate PDF annotationspdf_reports.py- Generate PDF reports from data
Usage Patterns
Table Extraction Configuration
pdf:
io: pdf_read
reader: tabula # or camelot, PyPDF2
files:
- path: "input.pdf"
pages: [1, 2, 3]
area: [0, 0, 100, 100] # Optional: specific region
Page Extraction Configuration
pdf:
io: pdf_edit
files:
- path: "source.pdf"
output: "extracted_pages.pdf"
page_start: 1
page_end: 5
Common Workflows
- Table Extraction: PDF → tabula/camelot → DataFrame → CSV/Excel
- Page Extraction: Multi-page PDF → Extract range → New PDF
- Form Processing: Fillable PDF → Fill fields → Save completed form
- Report Generation: DataFrame → Generate styled PDF report
Module Location
- Read:
src/assetutilities/modules/pdf_utilities/read_pdf.py - Edit:
src/assetutilities/modules/pdf_utilities/edit_pdf.py - Forms:
src/assetutilities/modules/pdf_utilities/fillable_pdf.py - Comments:
src/assetutilities/modules/pdf_utilities/pdf_comments.py - Reports:
src/assetutilities/modules/pdf_utilities/pdf_reports.py
Dependencies
- PyPDF2 (PDF reading and manipulation)
- tabula-py (table extraction with Java backend)
- camelot-py (table extraction)
- reportlab (PDF generation, optional)