Agent Skills: Document Format Conversion

Convert DOC/DOCX/PDF/PPT/PPTX documents to Markdown format. Automatically detect PDF type (electronic/scanned), extract images to separate directory. Use this Skill when administrator onboards non-Markdown documents. Trigger condition: Onboard DOC/DOCX/PDF/PPT/PPTX format files.

UncategorizedID: Harryoung/efka/document-conversion

Repository

HarryoungLicense: Apache-2.0
9117

Install this agent skill to your local

pnpm dlx add-skill https://github.com/Harryoung/efka/tree/HEAD/skills/document-conversion

Skill Files

Browse the full folder contents for document-conversion.

Download Skill

Loading file tree…

skills/document-conversion/SKILL.md

Skill Metadata

Name
document-conversion
Description
Convert DOC/DOCX/PDF/PPT/PPTX documents to Markdown format. Automatically detect PDF type (electronic/scanned), extract images to separate directory. Use this Skill when administrator onboards non-Markdown documents. Trigger condition: Onboard DOC/DOCX/PDF/PPT/PPTX format files.

Document Format Conversion

Convert various document formats to Markdown for knowledge base onboarding.

Supported Formats

| Format | Processing Method | |--------|------------------| | DOCX | Pandoc conversion, preserve formatting and images | | DOC | LibreOffice → DOCX → Pandoc | | PDF Electronic | PyMuPDF4LLM fast conversion | | PDF Scanned | PaddleOCR-VL online OCR | | PPTX | pptx2md professional conversion | | PPT | LibreOffice → PPTX → pptx2md |

Usage

python .claude/skills/document-conversion/scripts/smart_convert.py \
    <temp_path> \
    --original-name "<original_filename>" \
    --json-output

Parameters:

  • <temp_path>: Temporary file path (e.g. /tmp/kb_upload_xxx.pptx)
  • --original-name: Must pass original filename, used to generate correct image directory name
  • --json-output: Output JSON format result

Output Format

{
  "success": true,
  "markdown_file": "/path/to/output.md",
  "images_dir": "original_filename_images",
  "image_count": 5,
  "input_file": "/path/to/input.pptx"
}

Processing Flow

  1. Execute conversion command (must use --original-name and --json-output)
  2. Parse JSON output, check success field
  3. If success: false, report error and end
  4. If success: true, record generated file path and image directory

Important Notes

  • Image directory uses original filename naming (e.g. 培训资料_images/)
  • Not passing --original-name will cause incorrect image reference paths
  • PDF type is automatically detected, scanned version processing is slower (tens of seconds to minutes)

Format Details

Detailed processing instructions for each format, see FORMATS.md