Agent Skills: Document to Markdown Conversion

Convert documents (PDF, Word, Excel, PowerPoint, images, HTML) to Markdown using microsoft/markitdown. Use for document analysis, content extraction, preprocessing for LLMs, or batch document conversion. Supports images with OCR/LLM descriptions, audio transcription, and ZIP archives.

UncategorizedID: rysweet/amplihack/markitdown

Install this agent skill to your local

pnpm dlx add-skill https://github.com/rysweet/amplihack/tree/HEAD/.claude/skills/markitdown

Skill Files

Browse the full folder contents for markitdown.

Download Skill

Loading file tree…

.claude/skills/markitdown/SKILL.md

Skill Metadata

Name
markitdown
Description
Convert documents (PDF, Word, Excel, PowerPoint, images, HTML) to Markdown using microsoft/markitdown. Use for document analysis, content extraction, preprocessing for LLMs, or batch document conversion. Supports images with OCR/LLM descriptions, audio transcription, and ZIP archives.

Document to Markdown Conversion

Overview

Convert various document formats to clean Markdown using Microsoft's MarkItDown tool. Optimized for LLM processing, content extraction, and document analysis workflows.

Supported Formats: PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx/.xls), Images (with OCR/LLM), HTML, Audio (with transcription), CSV, JSON, XML, ZIP archives, EPubs

Quick Start

Basic Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)

Command Line

# Convert single file
markitdown document.pdf > output.md
markitdown document.pdf -o output.md

# Pipe input
cat document.pdf | markitdown

πŸ”’ Security Considerations

Before using in production:

  • βœ… Validate file types (MIME, not extension)
  • βœ… Limit file sizes (prevent DoS)
  • βœ… Sanitize file paths (prevent traversal)
  • βœ… Protect API keys (never hardcode)
  • βœ… Consider data privacy (external services)

See patterns.md for implementation details.

API Key Security

❌ NEVER:

  • Hardcode keys in code
  • Commit .env files to git
  • Log environment variables

βœ… ALWAYS:

  • Use environment variables: export OPENAI_API_KEY="sk-..." # pragma: allowlist secret
  • Use secret management (AWS Secrets Manager, Azure Key Vault)
  • Rotate keys regularly

Common Patterns

PDF Documents

# Basic PDF conversion
md = MarkItDown()
result = md.convert("report.pdf")

# With Azure Document Intelligence (better quality)
md = MarkItDown(docintel_endpoint="<your-endpoint>")
result = md.convert("report.pdf")

Office Documents

# Word documents - preserves structure
result = md.convert("document.docx")

# Excel - converts tables to markdown tables
result = md.convert("spreadsheet.xlsx")

# PowerPoint - extracts slide content
result = md.convert("presentation.pptx")

Images with Descriptions

# βœ… SECURE: Using environment variables for API keys
import os
from openai import OpenAI

api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise RuntimeError("OPENAI_API_KEY not set")

client = OpenAI(api_key=api_key)
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("diagram.jpg")  # Gets AI-generated description

Batch Processing

from pathlib import Path

md = MarkItDown()
documents = Path(".").glob("*.pdf")

for doc in documents:
    result = md.convert(str(doc))
    output_path = doc.with_suffix(".md")
    output_path.write_text(result.text_content)

Installation

# Full installation (all features)
pip install 'markitdown[all]'

# Selective features
pip install 'markitdown[pdf, docx, pptx]'

Requirements: Python 3.10 or higher

Key Features

  • Structure Preservation: Maintains headings, lists, tables, links
  • Plugin System: Extend with custom converters
  • Docker Support: Containerized deployments
  • MCP Integration: Model Context Protocol server for LLM apps

When to Read Supporting Files

  • reference.md - Read when you need:

    • Complete API reference and all configuration options
    • Azure Document Intelligence integration details
    • Plugin development guide
    • Docker and MCP server setup
    • Troubleshooting and error handling
  • examples.md - Read when you need:

    • Working examples for specific file types
    • Batch processing workflows
    • Error handling patterns
    • Integration with existing pipelines
  • patterns.md - Read when you need:

    • Production deployment patterns
    • Performance optimization strategies
    • Security considerations
    • Anti-patterns to avoid

Quick Reference

| File Type | Use Case | Command | | ---------- | ----------------- | ----------------------------------------------------------- | | PDF | Reports, papers | md.convert("file.pdf") | | Word | Documents | md.convert("file.docx") | | Excel | Data tables | md.convert("file.xlsx") | | PowerPoint | Presentations | md.convert("file.pptx") | | Images | Diagrams with OCR | md = MarkItDown(llm_client=client); md.convert("img.jpg") | | HTML | Web pages | md.convert("page.html") | | ZIP | Archives | md.convert("archive.zip") - processes contents |

⚠️ Common Mistakes to Avoid

Anti-Pattern 1: Hardcoded API Keys

# ❌ NEVER DO THIS
md = MarkItDown(llm_client=OpenAI(api_key="sk-hardcoded-key"))

# βœ… ALWAYS DO THIS
api_key = os.getenv("OPENAI_API_KEY")
md = MarkItDown(llm_client=OpenAI(api_key=api_key))

Anti-Pattern 2: Unvalidated File Paths

# ❌ Vulnerable to path traversal
user_input = "../../../etc/passwd"
md.convert(user_input)

# βœ… Validate and sanitize
from pathlib import Path
safe_path = Path(user_input).resolve()
if not safe_path.is_relative_to(allowed_dir):
    raise ValueError("Invalid path")
md.convert(str(safe_path))

Anti-Pattern 3: Ignoring File Size Limits

# ❌ Can cause DoS
md.convert("huge_file.pdf")  # No size check

# βœ… Check size first
max_size = 50 * 1024 * 1024  # 50MB
if Path("file.pdf").stat().st_size > max_size:
    raise ValueError("File too large")

Common Issues

Import Error: Ensure Python >= 3.10 and markitdown installed Missing Dependencies: Install with pip install 'markitdown[all]' Image Descriptions Not Working: Requires LLM client (OpenAI or compatible)

For detailed troubleshooting, see reference.md.