Document Format Conversion
Convert various document formats to Markdown for knowledge base onboarding.
Supported Formats
| Format | Processing Method | |--------|------------------| | DOCX | Pandoc conversion, preserve formatting and images | | DOC | LibreOffice → DOCX → Pandoc | | PDF Electronic | PyMuPDF4LLM fast conversion | | PDF Scanned | PaddleOCR-VL online OCR | | PPTX | pptx2md professional conversion | | PPT | LibreOffice → PPTX → pptx2md |
Usage
python .claude/skills/document-conversion/scripts/smart_convert.py \
<temp_path> \
--original-name "<original_filename>" \
--json-output
Parameters:
<temp_path>: Temporary file path (e.g./tmp/kb_upload_xxx.pptx)--original-name: Must pass original filename, used to generate correct image directory name--json-output: Output JSON format result
Output Format
{
"success": true,
"markdown_file": "/path/to/output.md",
"images_dir": "original_filename_images",
"image_count": 5,
"input_file": "/path/to/input.pptx"
}
Processing Flow
- Execute conversion command (must use
--original-nameand--json-output) - Parse JSON output, check
successfield - If
success: false, report error and end - If
success: true, record generated file path and image directory
Important Notes
- Image directory uses original filename naming (e.g.
培训资料_images/) - Not passing
--original-namewill cause incorrect image reference paths - PDF type is automatically detected, scanned version processing is slower (tens of seconds to minutes)
Format Details
Detailed processing instructions for each format, see FORMATS.md