Document Processing Guide
Work with office documents: PDF, Excel, Word, and PowerPoint.
Format Overview
| Format | Extension | Structure | Best For | |--------|-----------|-----------|----------| | PDF | .pdf | Binary/text | Reports, forms, archives | | Excel | .xlsx | XML in ZIP | Data, calculations, models | | Word | .docx | XML in ZIP | Text documents, contracts | | PowerPoint | .pptx | XML in ZIP | Presentations, slides |
Key concept: XLSX, DOCX, and PPTX are all ZIP archives containing XML files. You can unzip them to access raw content.
PDF Processing
PDF Tools
| Task | Best Tool | |------|-----------| | Basic read/write | pypdf | | Text extraction | pdfplumber | | Table extraction | pdfplumber | | Create PDFs | reportlab | | OCR scanned PDFs | pytesseract + pdf2image | | Command line | qpdf, pdftotext |
Common Operations
| Operation | Approach |
|-----------|----------|
| Merge | Loop through files, add pages to writer |
| Split | Create new writer per page |
| Extract tables | Use pdfplumber, convert to DataFrame |
| Rotate | Call .rotate(degrees) on page |
| Encrypt | Use writer's .encrypt() method |
| OCR | Convert to images, run pytesseract |
Excel Processing
Excel Tools
| Task | Best Tool | |------|-----------| | Data analysis | pandas | | Formulas & formatting | openpyxl | | Simple CSV | pandas | | Financial models | openpyxl |
Critical Rule: Use Formulas
| Approach | Result | |----------|--------| | Wrong: Calculate in Python, write value | Static number, breaks when data changes | | Right: Write Excel formula | Dynamic, recalculates automatically |
Financial Model Standards
| Convention | Meaning | |------------|---------| | Blue text | Hardcoded inputs | | Black text | Formulas | | Green text | Links to other sheets | | Yellow fill | Needs attention |
Common Formula Errors
| Error | Cause | |-------|-------| | #REF! | Invalid cell reference | | #DIV/0! | Division by zero | | #VALUE! | Wrong data type | | #NAME? | Unknown function name |
Word Processing
Word Tools
| Task | Best Tool | |------|-----------| | Text extraction | pandoc | | Create new | python-docx or docx-js | | Simple edits | python-docx | | Tracked changes | Direct XML editing |
Document Structure
| File | Contains |
|------|----------|
| word/document.xml | Main content |
| word/comments.xml | Comments |
| word/media/ | Images |
Tracked Changes (Redlining)
| Element | XML Tag |
|---------|---------|
| Deletion | <w:del><w:delText>...</w:delText></w:del> |
| Insertion | <w:ins><w:t>...</w:t></w:ins> |
Key concept: For professional/legal documents, use tracked changes XML rather than replacing text directly.
PowerPoint Processing
PowerPoint Tools
| Task | Best Tool | |------|-----------| | Text extraction | markitdown | | Create new | pptxgenjs (JS) or python-pptx | | Edit existing | Direct XML or python-pptx |
Slide Structure
| Path | Contains |
|------|----------|
| ppt/slides/slide{N}.xml | Slide content |
| ppt/notesSlides/ | Speaker notes |
| ppt/slideMasters/ | Master templates |
| ppt/media/ | Images |
Design Principles
| Principle | Guideline | |-----------|-----------| | Fonts | Use web-safe: Arial, Helvetica, Georgia | | Layout | Two-column preferred, avoid vertical stacking | | Hierarchy | Size, weight, color for emphasis | | Consistency | Repeat patterns across slides |
Converting Between Formats
| Conversion | Tool | |------------|------| | Any → PDF | LibreOffice headless | | PDF → Images | pdftoppm | | DOCX → Markdown | pandoc | | Any → Text | Appropriate extractor |
Best Practices
| Practice | Why | |----------|-----| | Use formulas in Excel | Dynamic calculations | | Preserve formatting on edit | Don't lose styles | | Test output opens correctly | Catch corruption early | | Use tracked changes for contracts | Audit trail | | Extract to markdown for analysis | Easier to process |
Common Packages
| Language | Packages | |----------|----------| | Python | pypdf, pdfplumber, openpyxl, python-docx, python-pptx | | JavaScript | docx, pptxgenjs | | CLI | pandoc, qpdf, pdftotext, libreoffice |