Document Format Conversion Skill

Document Format Conversion

Convert various document formats to Markdown for knowledge base onboarding.

Supported Formats

| Format | Processing Method | |--------|------------------| | DOCX | Pandoc conversion, preserve formatting and images | | DOC | LibreOffice → DOCX → Pandoc | | PDF Electronic | PyMuPDF4LLM fast conversion | | PDF Scanned | PaddleOCR-VL online OCR | | PPTX | pptx2md professional conversion | | PPT | LibreOffice → PPTX → pptx2md |

Usage

python .claude/skills/document-conversion/scripts/smart_convert.py \
    <temp_path> \
    --original-name "<original_filename>" \
    --json-output

Parameters:

<temp_path>: Temporary file path (e.g. /tmp/kb_upload_xxx.pptx)
--original-name: Must pass original filename, used to generate correct image directory name
--json-output: Output JSON format result

Output Format

{
  "success": true,
  "markdown_file": "/path/to/output.md",
  "images_dir": "original_filename_images",
  "image_count": 5,
  "input_file": "/path/to/input.pptx"
}

Processing Flow

Execute conversion command (must use --original-name and --json-output)
Parse JSON output, check success field
If success: false, report error and end
If success: true, record generated file path and image directory

Important Notes

Image directory uses original filename naming (e.g. 培训资料_images/)
Not passing --original-name will cause incorrect image reference paths
PDF type is automatically detected, scanned version processing is slower (tens of seconds to minutes)

Format Details

Detailed processing instructions for each format, see FORMATS.md

Agent Skills: Document Format Conversion

Install this agent skill to your local