Document Processing Guide Skill

Document Processing Guide

Work with office documents: PDF, Excel, Word, and PowerPoint.

Format Overview

| Format | Extension | Structure | Best For | |--------|-----------|-----------|----------| | PDF | .pdf | Binary/text | Reports, forms, archives | | Excel | .xlsx | XML in ZIP | Data, calculations, models | | Word | .docx | XML in ZIP | Text documents, contracts | | PowerPoint | .pptx | XML in ZIP | Presentations, slides |

Key concept: XLSX, DOCX, and PPTX are all ZIP archives containing XML files. You can unzip them to access raw content.

PDF Processing

PDF Tools

| Task | Best Tool | |------|-----------| | Basic read/write | pypdf | | Text extraction | pdfplumber | | Table extraction | pdfplumber | | Create PDFs | reportlab | | OCR scanned PDFs | pytesseract + pdf2image | | Command line | qpdf, pdftotext |

Common Operations

| Operation | Approach | |-----------|----------| | Merge | Loop through files, add pages to writer | | Split | Create new writer per page | | Extract tables | Use pdfplumber, convert to DataFrame | | Rotate | Call .rotate(degrees) on page | | Encrypt | Use writer's .encrypt() method | | OCR | Convert to images, run pytesseract |

Excel Processing

Excel Tools

| Task | Best Tool | |------|-----------| | Data analysis | pandas | | Formulas & formatting | openpyxl | | Simple CSV | pandas | | Financial models | openpyxl |

Critical Rule: Use Formulas

| Approach | Result | |----------|--------| | Wrong: Calculate in Python, write value | Static number, breaks when data changes | | Right: Write Excel formula | Dynamic, recalculates automatically |

Financial Model Standards

| Convention | Meaning | |------------|---------| | Blue text | Hardcoded inputs | | Black text | Formulas | | Green text | Links to other sheets | | Yellow fill | Needs attention |

Common Formula Errors

| Error | Cause | |-------|-------| | #REF! | Invalid cell reference | | #DIV/0! | Division by zero | | #VALUE! | Wrong data type | | #NAME? | Unknown function name |

Word Processing

Word Tools

| Task | Best Tool | |------|-----------| | Text extraction | pandoc | | Create new | python-docx or docx-js | | Simple edits | python-docx | | Tracked changes | Direct XML editing |

Document Structure

| File | Contains | |------|----------| | word/document.xml | Main content | | word/comments.xml | Comments | | word/media/ | Images |

Tracked Changes (Redlining)

| Element | XML Tag | |---------|---------| | Deletion | <w:del><w:delText>...</w:delText></w:del> | | Insertion | <w:ins><w:t>...</w:t></w:ins> |

Key concept: For professional/legal documents, use tracked changes XML rather than replacing text directly.

PowerPoint Processing

PowerPoint Tools

| Task | Best Tool | |------|-----------| | Text extraction | markitdown | | Create new | pptxgenjs (JS) or python-pptx | | Edit existing | Direct XML or python-pptx |

Slide Structure

| Path | Contains | |------|----------| | ppt/slides/slide{N}.xml | Slide content | | ppt/notesSlides/ | Speaker notes | | ppt/slideMasters/ | Master templates | | ppt/media/ | Images |

Design Principles

| Principle | Guideline | |-----------|-----------| | Fonts | Use web-safe: Arial, Helvetica, Georgia | | Layout | Two-column preferred, avoid vertical stacking | | Hierarchy | Size, weight, color for emphasis | | Consistency | Repeat patterns across slides |

Converting Between Formats

| Conversion | Tool | |------------|------| | Any → PDF | LibreOffice headless | | PDF → Images | pdftoppm | | DOCX → Markdown | pandoc | | Any → Text | Appropriate extractor |

Best Practices

| Practice | Why | |----------|-----| | Use formulas in Excel | Dynamic calculations | | Preserve formatting on edit | Don't lose styles | | Test output opens correctly | Catch corruption early | | Use tracked changes for contracts | Audit trail | | Extract to markdown for analysis | Easier to process |

Common Packages

| Language | Packages | |----------|----------| | Python | pypdf, pdfplumber, openpyxl, python-docx, python-pptx | | JavaScript | docx, pptxgenjs | | CLI | pandoc, qpdf, pdftotext, libreoffice |

Agent Skills: Document Processing Guide

Install this agent skill to your local

Skill Files