Academic PDF → GitHub GFM Conversion
A battle-tested workflow for converting academic/research PDF papers into GitHub-renderable GFM markdown with inline figures, mathematically correct LaTeX, and validated output.
Battle-tested on: López de Prado (2026) "How to Use the Sharpe Ratio" — 51 pages, 82 equations, 8 figures.
Quick Start (3 Steps)
# Step 1: Extract prose (best structure preservation)
uv run --python 3.13 --with pymupdf4llm python3 -c "
import pymupdf4llm
md = pymupdf4llm.to_markdown('paper.pdf')
open('paper-raw.md', 'w').write(md)
"
# Step 2: Extract images
uv run --python 3.13 --with pymupdf python3 references/extract-images.py paper.pdf
# Step 3: Validate math before pushing
node references/validate-math.mjs paper.md
CRITICAL: Detect PDF Type First
This determines the entire workflow. Getting it wrong wastes hours.
Type A — Word-Generated PDF (Most Modern Academic Papers)
Signs: Embedded fonts, copyable text, Unicode math chars when you copy-paste (∑, π, α, β, γ, →)
Math encoding: Math is Unicode text in PDF stream — NOT images, NOT glyph maps
Consequence: OCR tools like marker-pdf cannot extract LaTeX — they see text like "γ₄" not \gamma_4. They may return empty output or crash silently.
Required approach:
- Use
pymupdf4llmfor prose extraction - Manually transcribe all equations from PDF screenshots — there is no shortcut
- Read each formula visually, write LaTeX by hand
How to confirm: Run marker-pdf — if output is empty or has zero math content, it's Type A.
Type B — LaTeX-Generated PDF
Signs: Computer Modern fonts, precise mathematical spacing, arxiv.org source available
Math encoding: Glyph-mapped — structure is partially extractable
Approach: pymupdf4llm or pdftotext for text. If arxiv source exists, extract directly from .tex (vastly preferred over PDF conversion).
Type C — Scanned/Image PDF
Signs: All pages are raster images, zero copyable text
Approach: OCR pipeline — marker-pdf is best option, or tesseract
Tool Comparison
| Tool | Best For | Install | Key Limitation |
| ------------- | ------------------------------- | --------------------------------- | ----------------------------------------------- |
| pymupdf4llm | Type A/B prose (best structure) | uv run --with pymupdf4llm | Math as Unicode, not LaTeX |
| pdftotext | Quick plain text | brew install poppler | Loses table structure |
| markitdown | Alternative prose | uv run --with 'markitdown[pdf]' | Slight over-spacing; same math limit |
| marker-pdf | Type C scanned only | pip install marker-pdf | Fails silently on Type A (Unicode text bug) |
Never trust marker-pdf output on Type A/B PDFs — the apparent "success" with empty math sections is the failure mode.
Image Extraction
Save references/extract-images.py:
import fitz, os, sys
doc = fitz.open(sys.argv[1])
os.makedirs("references/media", exist_ok=True)
saved = []
for page_num in range(len(doc)):
for img_idx, img in enumerate(doc[page_num].get_images(full=True)):
xref = img[0]
base_image = doc.extract_image(xref)
img_bytes = base_image["image"]
if len(img_bytes) < 2048: # skip icons/logos/watermarks/rules
continue
ext = base_image["ext"]
fname = f"fig-p{page_num+1:02d}-{img_idx+1:02d}.{ext}"
with open(f"references/media/{fname}", "wb") as f:
f.write(img_bytes)
saved.append((page_num+1, fname, base_image.get("width"), base_image.get("height")))
print(f"Saved: {fname} ({len(img_bytes)//1024}KB, {base_image.get('width')}×{base_image.get('height')})")
doc.close()
print(f"\n{len(saved)} images saved to references/media/")
Naming: fig-p{page:02d}-{idx:02d}.{ext} — page number in name for easy location matching.
Size filter: Skip < 2 KB (captures icons, watermarks, horizontal rules). Review everything ≥ 2 KB — some are decorative but most are figures.
Insert in markdown:

Place immediately after the nearest section heading or the paragraph that references the figure.
GitHub GFM Math Rendering Rules
The $$ vs ```math ``` Decision — Root Cause
GitHub's Markdown pre-processor runs BEFORE the math renderer. It treats \\ as an escaped backslash and collapses it to \. This breaks LaTeX line breaks in display math.
The rule is simple:
| Equation type | Use | Reason |
| ------------------------------------------------------- | --------------- | ------------------------------------------ |
| Single-line display | $$...$$ | No \\ → pre-processor safe |
| Multi-line (contains \\, \begin{aligned}, matrices) | ```math ``` | Pre-processor does NOT process code fences |
| Inline | $...$ | Standard |
# BROKEN on GitHub — \\ stripped by pre-processor:
$$
\begin{aligned}
a &= b + c \\
d &= e + f
\end{aligned}
$$
# CORRECT on GitHub:
```math
\begin{aligned}
a &= b + c \\
d &= e + f
\end{aligned}
```
### Display Block Formatting Rules
- `$$` must be on its **own line** — not `$$formula$$` on one line
- **Blank line required** before AND after every `$$` block
- **Blank line required between consecutive** `$$` blocks
- These rules do NOT apply to `` ```math ``` `` blocks
### Supported/Unsupported LaTeX
See [references/github-math-support-table.md](./references/github-math-support-table.md) for the full table.
**Key things to avoid**:
| Command | Problem | Fix |
|---------|---------|-----|
| `\begin{align}` | ❌ Not supported by GitHub | Use `\begin{aligned}` |
| `\boxed{}` | ⚠️ Can cause raw LaTeX passthrough | Remove or use bold text |
| `\operatorname{}` | ⚠️ Active GitHub bug, inconsistent | Use `\text{}` or `\mathrm{}` |
| `\newcommand` | ❌ Was briefly available, then pulled | Expand all macros inline |
| `x^_y` | Superscript immediately before subscript | Write `x^{*}_{i}` with braces |
### Common Gotchas
- `\\[8pt]` vertical spacing inside `$$` → eaten by pre-processor → move to `` ```math ``` ``
- `\frac{1}{T}:\left(` → spurious colon after fraction → remove colon
- Pearson vs excess kurtosis: most finance formulas need Pearson (γ₄ = 3 for Gaussian), not excess. **Always document the kurtosis convention in the formula comment.**
- `\begin{pmatrix}` with `\\` → must use `` ```math ``` ``
- `\begin{cases}` with multiple rows → must use `` ```math ``` ``
---
## GitLab: No Workarounds Needed
**Empirically verified 2026-03-15** on GitLab CE 18.9.2. Confirmed by Comrak source code analysis.
GitLab uses the **Comrak** Rust parser with `math_dollars: true`. When Comrak encounters `$$`, it calls `handle_dollars` which slices the raw input buffer directly and stores it as a `NodeMath` AST node — CommonMark's backslash handler is never invoked on math content. The raw LaTeX is passed to KaTeX via `<span data-math-style="display/inline">` unchanged.
**Every GitHub workaround is unnecessary on GitLab:**
| GitHub problem | GitHub fix required | GitLab |
|----------------|---------------------|--------|
| `\\` in `$$` stripped → broken multiline | Use ` ```math ``` ` | `$$` works with `\\` |
| `\left\{` → `\left{` (delimiter error) | Use `\left\lbrace` | `\left\{` works |
| `\{...\}` set notation → invisible braces | Use `\lbrace...\rbrace` | `\{...\}` works |
| `\,` in `$$` → literal comma | Remove `\,` | `\,` works |
| `\,` in inline `$` → literal comma | Remove `\,` | `\,` works |
On GitLab you can write standard LaTeX without any platform-specific workarounds. If you're targeting GitLab (or hosting your own GitLab CE), skip all the `\lbrace`/`\rbrace` substitutions and ` ```math ``` ` conversions — plain `$$` with standard LaTeX is correct.
### GitLab.com Has a Hard 50-Span Per-Page Limit
**GitLab.com (SaaS) enforces a limit of 50 total math spans per page** (display + inline combined). After the 50th span, all subsequent equations silently fall back to raw LaTeX text. This limit exists to prevent DoS attacks and cannot be overridden on GitLab.com.
| Document math density | gitlab.com | Self-hosted CE |
|---|---|---|
| ≤ 50 total spans | ✅ Renders fully | ✅ |
| 51–100 spans | ⚠️ Partial render | ✅ |
| 100+ spans (academic papers) | ❌ Most equations raw text | ✅ Disable with `math_rendering_limits_enabled: false` |
**Validated on**: Sharpe ratio paper (341 spans) — breaks at span 51 on gitlab.com, renders fully on local CE.
**The W6 check in `validate-math.mjs`** warns when a file exceeds the limit.
**Summary: which platform to use**:
- **GitHub.com**: No math span limit. Use `\lbrace`/`\rbrace` workarounds (handled by `--fix`).
- **Self-hosted GitLab CE**: No limit (disable math_rendering_limits_enabled). No workarounds needed.
- **GitLab.com**: Only suitable for documents with ≤ 50 math spans.
### Self-hosting GitLab CE for Math-Heavy Documents
GitLab CE is free and runs on a single machine. On a 61 GB workstation with slim config:
- Memory footprint: ~3 GB (`puma['worker_processes'] = 2`, `sidekiq['concurrency'] = 5`, monitoring disabled)
- Push mirroring to GitHub: free on CE (syncs within 5 min)
- `glab` CLI: first-party, comparable to `gh`
```yaml
# docker-compose.yml — slim GitLab CE
services:
gitlab:
image: gitlab/gitlab-ce:latest
restart: unless-stopped
environment:
GITLAB_OMNIBUS_CONFIG: |
external_url 'http://YOUR_IP:8929'
puma['worker_processes'] = 2
sidekiq['concurrency'] = 5
prometheus_monitoring['enable'] = false
alertmanager['enable'] = false
node_exporter['enable'] = false
redis_exporter['enable'] = false
postgres_exporter['enable'] = false
gitlab_exporter['enable'] = false
ports: ["8929:8929", "8922:22"]
volumes:
- /srv/gitlab/config:/etc/gitlab
- /srv/gitlab/logs:/var/log/gitlab
- /srv/gitlab/data:/var/opt/gitlab
```
---
## Validation Pipeline
### Step 1: Install KaTeX Validator
```bash
bun add -g katex # Bun-first per project policy
# or: npm install -g katex
Step 2: Run Before Every Push
# Validate only (exit 1 on errors)
node references/validate-math.mjs your-file.md
# Validate + auto-fix correctable issues
node references/validate-math.mjs your-file.md --fix
The script is at references/validate-math.mjs. It runs two layers:
Layer 1 — KaTeX syntax: parse errors in $, $$, ```math ``` blocks
Layer 2 — GFM structural (issues KaTeX passes but GitHub breaks):
| Code | Severity | Issue | Auto-fix |
| ---- | -------- | --------------------------------------------------------------------------------------------- | -------------------------------------- |
| E0 | Error | \! \, \; \{ \} in $$ block — pre-processor strips backslash → parse error cascade | ✅ spacing removed; \{→\lbrace |
| E0b | Warning | \{ \} \, in inline $...$ — invisible braces or literal commas in prose | ✅ → \lbrace/\rbrace; \, removed |
| E1 | Error | $$ block with \\ — GitHub pre-processor strips backslashes | ✅ → ```math ``` |
| E2 | Error | Consecutive $$ blocks without blank line — orphaned delimiter cascade | ✅ add blank line |
| W1 | Warning | Bare ^* in $$ or $ block — markdown italic pairing eats the * | ✅ → ^{\ast} |
| W2 | Warning | \begin{align} — not supported on GitHub | ✗ manual |
| W3 | Warning | \boxed{} — can cause raw LaTeX passthrough | ✗ manual |
| W4 | Warning | \operatorname{} — inconsistent GitHub support | ✗ manual |
E0 is the most dangerous: a single failing $$ block exposes its $$ delimiters as literal text, creating an orphaned $ that shifts ALL subsequent inline $...$ pairings. One broken equation takes down the entire document.
\{/\} trap: In $$ blocks, \left\{ becomes \left{ (invalid KaTeX delimiter → "Missing or unrecognized delimiter") and \{...\} set notation becomes invisible grouping. Fix: use \lbrace/\rbrace (letter-based, CommonMark-immune). This affects every equation using set notation like \{\hat{SR}_k\} or \min_T\left\{...\right\}.
Exits code 1 on errors (CI-friendly). Warnings do not block CI but should be reviewed.
Local Preview Tools
# GitHub-accurate hot-reload preview
bun add -g @hyrious/gfm
gfm your-file.md --serve
# Offline binary (gh extension)
gh extension install thiagokokada/gh-gfm-preview
gh gfm-preview your-file.md
VS Code extensions:
shd101wyy.markdown-preview-enhanced— closest to GitHub renderingbierner.markdown-preview-github-styles— GitHub CSS styling
Multi-Agent Adversarial Equation Validation
For papers with 10+ equations, use this multi-agent pattern:
Phase 1 — Parallel Extraction
- Agent A: Extract prose with pymupdf4llm, transcribe math from PDF screenshots
- Agent B: Extract and categorize all images
Phase 2 — Parallel Validation
- Agent C: Validate equations against reference implementation (if code/repo exists)
- Agent D: Numerical spot-checks — compute paper's exhibit values, compare
Phase 3 — Discrepancy Handling
- For each discrepancy: write
/tmp/paper-discrepancy/eq-{N}.md - Spawn resolver agents to search online for authoritative third-party sources
- Authority rule: Paper is tentatively more authoritative than code implementation; a third independent source breaks ties
Phase 4 — Guarded Application
- Apply only HIGH-confidence fixes to the markdown
- For MEDIUM-confidence: spawn an independent audit agent before touching the file
- Document all discrepancies even if not fixed — future readers need to know
Anti-Patterns
| Anti-pattern | Why it fails | Fix |
| ---------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
| \!\left( or \, in $$ blocks | GH pre-processor strips \!→! before KaTeX — !\left( crashes KaTeX, cascades all | Remove \! \, \; (spacing only) — or use ```math ``` |
| \left\{ or \{...\} in $$/$ blocks | \{→{ (CommonMark escape), so \left\{→\left{ = "Missing delimiter" error, and \{x\} renders without visible braces | Replace with \left\lbrace, \right\rbrace, \lbrace, \rbrace |
| $$\begin{aligned}...\\...\end{aligned}$$ | \\ stripped by GH pre-processor | Use ```math ``` |
| Trusting marker-pdf on Word PDFs | Returns no output or zero math (Unicode bug) | Read as screenshots, transcribe manually |
| \begin{align} in display math | Not supported by GitHub | Replace with \begin{aligned} |
| \operatorname{Cov} | Active GH bug — sometimes renders raw | Use \text{Cov} or \mathrm{Cov} |
| KaTeX validation only, no ```math ``` conversion | KaTeX passes but GH pre-processor still breaks \\ | Also convert ALL multi-line blocks |
| \boxed{} for highlighting | Can cause raw LaTeX passthrough on GitHub | Use bold text or a blockquote callout |
| Excess kurtosis in formulas expecting Pearson | Silent ~50% underestimate in variance formulas | Always document convention; use scipy.stats.kurtosis(fisher=False) |
| Consecutive $$ blocks without blank lines | GitHub collapses them into one broken block | Add blank line between each block |
| Running validation AFTER pushing | Bugs visible in public repo | Validate locally before every push (--fix auto-corrects E0/E1/E2) |
References
| File | Purpose | | ------------------------------------------------------------------------- | -------------------------------------- | | validate-math.mjs | KaTeX batch validator for GFM files | | pdf-type-detection.md | Detailed guide to detecting PDF type | | github-math-support-table.md | Full supported/unsupported LaTeX table |
Related Skills
| Skill | Relationship |
| --------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- |
| pandoc-pdf-generation | Opposite direction: markdown → PDF |
| documentation-standards | GFM formatting standards |
| quant-research:opendeviation-eval-metrics | Worked example: references/how-to-use-the-sharpe-ratio-2026.md |