Marker PDF-to-Markdown Converter
Convert PDFs to Markdown while preserving LaTeX formulas and document structure. Uses the marker_single CLI from the marker-pdf package.
Dependencies
marker_singleon PATH (pip install marker-pdfif missing)- Python 3.10+ (available in the task image)
Quick Start
from scripts.marker_to_markdown import pdf_to_markdown
markdown_text = pdf_to_markdown("paper.pdf")
print(markdown_text)
Python API
pdf_to_markdown(pdf_path, *, timeout=600, cleanup=True) -> str- Runs
marker_single --output_format markdown --disable_image_extraction cleanup=True: use a temp directory and delete after reading the Markdowncleanup=False: keep outputs in<pdf_stem>_marker/next to the PDF- Exceptions:
FileNotFoundErrorif the PDF is missing,RuntimeErrorfor marker failures,TimeoutErrorif it exceeds the timeout
- Runs
- Tips: bump
timeoutfor large PDFs; setcleanup=Falseto inspect intermediate files
Command-Line Usage
# Basic conversion (prints markdown to stdout)
python scripts/marker_to_markdown.py paper.pdf
# Keep temporary files
python scripts/marker_to_markdown.py paper.pdf --keep-temp
# Custom timeout
python scripts/marker_to_markdown.py paper.pdf --timeout 600
Output Locations
cleanup=True: outputs stored in a temporary directory and removed automaticallycleanup=False: outputs saved to<pdf_stem>_marker/; markdown lives at<pdf_stem>_marker/<pdf_stem>/<pdf_stem>.mdwhen present (otherwise the first.mdfile is used)
Troubleshooting
marker_singlenot found: installmarker-pdfor ensure the CLI is on PATH- No Markdown output: re-run with
--keep-temp/cleanup=Falseand checkstdout/stderrsaved in the output folder