DOCX Creation, Editing, and Analysis Skill

DOCX Creation, Editing, and Analysis

Overview

A .docx file is essentially a ZIP archive containing XML files that you can read or edit.

Workflow Decision Tree

Reading/Analyzing Content

Use text extraction or raw XML access

Creating New Document

Use docx-js workflow

Editing Existing Document

Your own document + simple changes: Basic OOXML editing
Someone else's document: Redlining workflow (recommended)
Legal, academic, business docs: Redlining workflow (required)

Reading Content

Text Extraction

Convert to markdown using pandoc:

# Convert document to markdown with tracked changes
pandoc --track-changes=all path-to-file.docx -o output.md

Raw XML Access

Needed for: comments, complex formatting, document structure, embedded media.

# Unpack a file
python ooxml/scripts/unpack.py <input.docx> <output_dir>

Key file structures:

word/document.xml - Main document contents
word/comments.xml - Comments referenced in document.xml
word/media/ - Embedded images and media files
Tracked changes use <w:ins> and <w:del> tags

Creating New Documents

Use docx-js (JavaScript/TypeScript):

Create a JavaScript/TypeScript file using Document, Paragraph, TextRun components
Export as .docx using Packer.toBuffer()

import { Document, Paragraph, TextRun, Packer } from "docx";

const doc = new Document({
  sections: [{
    properties: {},
    children: [
      new Paragraph({
        children: [new TextRun("Hello World")],
      }),
    ],
  }],
});

const buffer = await Packer.toBuffer(doc);

Editing Existing Documents

Use the Document library (Python):

Unpack: python ooxml/scripts/unpack.py <input.docx> <output_dir>
Create and run a Python script using the Document library
Pack: python ooxml/scripts/pack.py <unpacked_dir> <output.docx>

Redlining Workflow

For document review with tracked changes:

Principle: Minimal, Precise Edits Only mark text that actually changes. Break replacements into: [unchanged text] + [deletion] + [insertion] + [unchanged text]

Workflow

Get markdown representation:

pandoc --track-changes=all path-to-file.docx -o current.md

Identify and group changes into batches of 3-10

Unpack the document:

python ooxml/scripts/unpack.py <input.docx> <output_dir>

Implement changes in batches using Document library

Pack the document:

python ooxml/scripts/pack.py unpacked reviewed-document.docx

Final verification:

pandoc --track-changes=all reviewed-document.docx -o verification.md

Converting to Images

# Convert DOCX to PDF
soffice --headless --convert-to pdf document.docx

# Convert PDF pages to JPEG
pdftoppm -jpeg -r 150 document.pdf page

Dependencies

pandoc: Text extraction
docx: Creating new documents (npm)
LibreOffice: PDF conversion
Poppler: PDF to image conversion
defusedxml: Secure XML parsing

Agent Skills: DOCX Creation, Editing, and Analysis

Install this agent skill to your local

Skill Files