DOCX Processing
Workflow Decision Tree
- Reading/Analyzing: Use text extraction or raw XML access
- Creating New Document: Use docx-js (JavaScript)
- Editing Existing: Use OOXML editing or redlining workflow
Reading Content
Text Extraction with Pandoc
# Convert to markdown with tracked changes
pandoc --track-changes=all file.docx -o output.md
Raw XML Access
# Unpack document
unzip document.docx -d unpacked/
# Key files:
# word/document.xml - Main content
# word/comments.xml - Comments
# word/media/ - Images
Creating New Documents (docx-js)
import { Document, Paragraph, TextRun, Packer } from 'docx';
import fs from 'fs';
const doc = new Document({
sections: [{
children: [
new Paragraph({
children: [
new TextRun({ text: "Hello ", bold: true }),
new TextRun({ text: "World", italics: true })
]
})
]
}]
});
const buffer = await Packer.toBuffer(doc);
fs.writeFileSync('document.docx', buffer);
Editing Existing Documents
Simple Edits
- Unpack:
unzip doc.docx -d unpacked/ - Edit
word/document.xml - Repack:
cd unpacked && zip -r ../edited.docx .
Tracked Changes (Redlining)
For professional documents, use tracked changes:
<!-- Deletion -->
<w:del w:author="Author" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>old text</w:delText></w:r>
</w:del>
<!-- Insertion -->
<w:ins w:author="Author" w:date="2025-01-01T00:00:00Z">
<w:r><w:t>new text</w:t></w:r>
</w:ins>
Converting to Images
# DOCX to PDF
soffice --headless --convert-to pdf document.docx
# PDF to images
pdftoppm -jpeg -r 150 document.pdf page
Best Practices
- Use Pandoc for text extraction
- Use docx-js for creating new documents
- For legal/business docs, always use tracked changes
- Preserve original RSIDs when editing