Academic PDF → GitHub GFM Conversion Skill

Academic PDF → GitHub GFM Conversion

A battle-tested workflow for converting academic/research PDF papers into GitHub-renderable GFM markdown with inline figures, mathematically correct LaTeX, and validated output.

Battle-tested on: López de Prado (2026) "How to Use the Sharpe Ratio" — 51 pages, 82 equations, 8 figures.

Self-Evolving Skill: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.

Quick Start (3 Steps)

# Step 1: Extract prose (best structure preservation)
uv run --python 3.13 --with pymupdf4llm python3 -c "
import pymupdf4llm
md = pymupdf4llm.to_markdown('paper.pdf')
open('paper-raw.md', 'w').write(md)
"

# Step 2: Extract images
uv run --python 3.13 --with pymupdf python3 references/extract-images.py paper.pdf

# Step 3: Validate math before pushing
node references/validate-math.mjs paper.md

CRITICAL: Detect PDF Type First

This determines the entire workflow. Getting it wrong wastes hours.

Type A — Word-Generated PDF (Most Modern Academic Papers)

Signs: Embedded fonts, copyable text, Unicode math chars when you copy-paste (∑, π, α, β, γ, →)

Math encoding: Math is Unicode text in PDF stream — NOT images, NOT glyph maps

Consequence: OCR tools like marker-pdf cannot extract LaTeX — they see text like "γ₄" not \gamma_4. They may return empty output or crash silently.

Required approach:

Use pymupdf4llm for prose extraction
Manually transcribe all equations from PDF screenshots — there is no shortcut
Read each formula visually, write LaTeX by hand

How to confirm: Run marker-pdf — if output is empty or has zero math content, it's Type A.

Type B — LaTeX-Generated PDF

Signs: Computer Modern fonts, precise mathematical spacing, arxiv.org source available

Math encoding: Glyph-mapped — structure is partially extractable

Approach: pymupdf4llm or pdftotext for text. If arxiv source exists, extract directly from .tex (vastly preferred over PDF conversion).

Type C — Scanned/Image PDF

Signs: All pages are raster images, zero copyable text

Approach: OCR pipeline — marker-pdf is best option, or tesseract

Tool Comparison

| Tool | Best For | Install | Key Limitation | | ------------- | ------------------------------- | --------------------------------- | ----------------------------------------------- | | pymupdf4llm | Type A/B prose (best structure) | uv run --with pymupdf4llm | Math as Unicode, not LaTeX | | pdftotext | Quick plain text | brew install poppler | Loses table structure | | markitdown | Alternative prose | uv run --with 'markitdown[pdf]' | Slight over-spacing; same math limit | | marker-pdf | Type C scanned only | pip install marker-pdf | Fails silently on Type A (Unicode text bug) |

Never trust marker-pdf output on Type A/B PDFs — the apparent "success" with empty math sections is the failure mode.

Image Extraction

Save references/extract-images.py:

import fitz, os, sys

doc = fitz.open(sys.argv[1])
os.makedirs("references/media", exist_ok=True)
saved = []
for page_num in range(len(doc)):
    for img_idx, img in enumerate(doc[page_num].get_images(full=True)):
        xref = img[0]
        base_image = doc.extract_image(xref)
        img_bytes = base_image["image"]
        if len(img_bytes) < 2048:   # skip icons/logos/watermarks/rules
            continue
        ext = base_image["ext"]
        fname = f"fig-p{page_num+1:02d}-{img_idx+1:02d}.{ext}"
        with open(f"references/media/{fname}", "wb") as f:
            f.write(img_bytes)
        saved.append((page_num+1, fname, base_image.get("width"), base_image.get("height")))
        print(f"Saved: {fname} ({len(img_bytes)//1024}KB, {base_image.get('width')}×{base_image.get('height')})")
doc.close()
print(f"\n{len(saved)} images saved to references/media/")

Naming: fig-p{page:02d}-{idx:02d}.{ext} — page number in name for easy location matching.

Size filter: Skip < 2 KB (captures icons, watermarks, horizontal rules). Review everything ≥ 2 KB — some are decorative but most are figures.

Insert in markdown:

![Figure 1: Variance of Sharpe ratio estimates](./media/fig-p12-01.png)

Place immediately after the nearest section heading or the paragraph that references the figure.

GitHub GFM Math Rendering Rules

The `$$` vs ```math ``` Decision — Root Cause

GitHub's Markdown pre-processor runs BEFORE the math renderer. It treats \\ as an escaped backslash and collapses it to \. This breaks LaTeX line breaks in display math.

The rule is simple:

| Equation type | Use | Reason | | ------------------------------------------------------- | --------------- | ------------------------------------------ | | Single-line display | $$...$$ | No \\ → pre-processor safe | | Multi-line (contains \\, \begin{aligned}, matrices) | ```math ``` | Pre-processor does NOT process code fences | | Inline | $...$ | Standard |

# BROKEN on GitHub — \\ stripped by pre-processor:

$$
\begin{aligned}
a &= b + c \\
d &= e + f
\end{aligned}
$$

# CORRECT on GitHub:

```math
\begin{aligned}
a &= b + c \\
d &= e + f
\end{aligned}
```


### Display Block Formatting Rules

- `$$` must be on its **own line** — not `$$formula$$` on one line
- **Blank line required** before AND after every `$$` block
- **Blank line required between consecutive** `$$` blocks
- These rules do NOT apply to `` ```math ``` `` blocks

### Supported/Unsupported LaTeX

See [references/github-math-support-table.md](./references/github-math-support-table.md) for the full table.

**Key things to avoid**:

| Command | Problem | Fix |
|---------|---------|-----|
| `\begin{align}` | ❌ Not supported by GitHub | Use `\begin{aligned}` |
| `\boxed{}` | ⚠️ Can cause raw LaTeX passthrough | Remove or use bold text |
| `\operatorname{}` | ⚠️ Active GitHub bug, inconsistent | Use `\text{}` or `\mathrm{}` |
| `\newcommand` | ❌ Was briefly available, then pulled | Expand all macros inline |
| `x^_y` | Superscript immediately before subscript | Write `x^{*}_{i}` with braces |

### Common Gotchas

- `\\[8pt]` vertical spacing inside `$$` → eaten by pre-processor → move to `` ```math ``` ``
- `\frac{1}{T}:\left(` → spurious colon after fraction → remove colon
- Pearson vs excess kurtosis: most finance formulas need Pearson (γ₄ = 3 for Gaussian), not excess. **Always document the kurtosis convention in the formula comment.**
- `\begin{pmatrix}` with `\\` → must use `` ```math ``` ``
- `\begin{cases}` with multiple rows → must use `` ```math ``` ``

---

## GitLab: No Workarounds Needed

**Empirically verified 2026-03-15** on GitLab CE 18.9.2. Confirmed by Comrak source code analysis.

GitLab uses the **Comrak** Rust parser with `math_dollars: true`. When Comrak encounters `$$`, it calls `handle_dollars` which slices the raw input buffer directly and stores it as a `NodeMath` AST node — CommonMark's backslash handler is never invoked on math content. The raw LaTeX is passed to KaTeX via `<span data-math-style="display/inline">` unchanged.

**Every GitHub workaround is unnecessary on GitLab:**

| GitHub problem | GitHub fix required | GitLab |
|----------------|---------------------|--------|
| `\\` in `$$` stripped → broken multiline | Use ` ```math ``` ` | `$$` works with `\\` |
| `\left\{` → `\left{` (delimiter error) | Use `\left\lbrace` | `\left\{` works |
| `\{...\}` set notation → invisible braces | Use `\lbrace...\rbrace` | `\{...\}` works |
| `\,` in `$$` → literal comma | Remove `\,` | `\,` works |
| `\,` in inline `$` → literal comma | Remove `\,` | `\,` works |

On GitLab you can write standard LaTeX without any platform-specific workarounds. If you're targeting GitLab (or hosting your own GitLab CE), skip all the `\lbrace`/`\rbrace` substitutions and ` ```math ``` ` conversions — plain `$$` with standard LaTeX is correct.

### GitLab.com Has a Hard 50-Span Per-Page Limit

**GitLab.com (SaaS) enforces a limit of 50 total math spans per page** (display + inline combined). After the 50th span, all subsequent equations silently fall back to raw LaTeX text. This limit exists to prevent DoS attacks and cannot be overridden on GitLab.com.

| Document math density | gitlab.com | Self-hosted CE |
|---|---|---|
| ≤ 50 total spans | ✅ Renders fully | ✅ |
| 51–100 spans | ⚠️ Partial render | ✅ |
| 100+ spans (academic papers) | ❌ Most equations raw text | ✅ Disable with `math_rendering_limits_enabled: false` |

**Validated on**: Sharpe ratio paper (341 spans) — breaks at span 51 on gitlab.com, renders fully on local CE.

**The W6 check in `validate-math.mjs`** warns when a file exceeds the limit.

**Summary: which platform to use**:
- **GitHub.com**: No math span limit. Use `\lbrace`/`\rbrace` workarounds (handled by `--fix`).
- **Self-hosted GitLab CE**: No limit (disable math_rendering_limits_enabled). No workarounds needed.
- **GitLab.com**: Only suitable for documents with ≤ 50 math spans.

### Self-hosting GitLab CE for Math-Heavy Documents

GitLab CE is free and runs on a single machine. On a 61 GB workstation with slim config:
- Memory footprint: ~3 GB (`puma['worker_processes'] = 2`, `sidekiq['concurrency'] = 5`, monitoring disabled)
- Push mirroring to GitHub: free on CE (syncs within 5 min)
- `glab` CLI: first-party, comparable to `gh`

```yaml
# docker-compose.yml — slim GitLab CE
services:
  gitlab:
    image: gitlab/gitlab-ce:latest
    restart: unless-stopped
    environment:
      GITLAB_OMNIBUS_CONFIG: |
        external_url 'http://YOUR_IP:8929'
        puma['worker_processes'] = 2
        sidekiq['concurrency'] = 5
        prometheus_monitoring['enable'] = false
        alertmanager['enable'] = false
        node_exporter['enable'] = false
        redis_exporter['enable'] = false
        postgres_exporter['enable'] = false
        gitlab_exporter['enable'] = false
    ports: ["8929:8929", "8922:22"]
    volumes:
      - /srv/gitlab/config:/etc/gitlab
      - /srv/gitlab/logs:/var/log/gitlab
      - /srv/gitlab/data:/var/opt/gitlab
```

---

## Validation Pipeline

### Step 1: Install KaTeX Validator

```bash
bun add -g katex   # Bun-first per project policy
# or: npm install -g katex

Step 2: Run Before Every Push

# Validate only (exit 1 on errors)
node references/validate-math.mjs your-file.md

# Validate + auto-fix correctable issues
node references/validate-math.mjs your-file.md --fix

The script is at references/validate-math.mjs. It runs two layers:

Layer 1 — KaTeX syntax: parse errors in $, $$, ```math ``` blocks Layer 2 — GFM structural (issues KaTeX passes but GitHub breaks):

| Code | Severity | Issue | Auto-fix | | ---- | -------- | --------------------------------------------------------------------------------------------- | -------------------------------------- | | E0 | Error | \! \, \; \{ \} in $$ block — pre-processor strips backslash → parse error cascade | ✅ spacing removed; \{→\lbrace | | E0b | Warning | \{ \} \, in inline $...$ — invisible braces or literal commas in prose | ✅ → \lbrace/\rbrace; \, removed | | E1 | Error | $$ block with \\ — GitHub pre-processor strips backslashes | ✅ → ```math ``` | | E2 | Error | Consecutive $$ blocks without blank line — orphaned delimiter cascade | ✅ add blank line | | W1 | Warning | Bare ^* in $$ or $ block — markdown italic pairing eats the * | ✅ → ^{\ast} | | W2 | Warning | \begin{align} — not supported on GitHub | ✗ manual | | W3 | Warning | \boxed{} — can cause raw LaTeX passthrough | ✗ manual | | W4 | Warning | \operatorname{} — inconsistent GitHub support | ✗ manual |

E0 is the most dangerous: a single failing $$ block exposes its $$ delimiters as literal text, creating an orphaned $ that shifts ALL subsequent inline $...$ pairings. One broken equation takes down the entire document.

\{/\} trap: In $$ blocks, \left\{ becomes \left{ (invalid KaTeX delimiter → "Missing or unrecognized delimiter") and \{...\} set notation becomes invisible grouping. Fix: use \lbrace/\rbrace (letter-based, CommonMark-immune). This affects every equation using set notation like \{\hat{SR}_k\} or \min_T\left\{...\right\}.

Exits code 1 on errors (CI-friendly). Warnings do not block CI but should be reviewed.

Local Preview Tools

# GitHub-accurate hot-reload preview
bun add -g @hyrious/gfm
gfm your-file.md --serve

# Offline binary (gh extension)
gh extension install thiagokokada/gh-gfm-preview
gh gfm-preview your-file.md

VS Code extensions:

shd101wyy.markdown-preview-enhanced — closest to GitHub rendering
bierner.markdown-preview-github-styles — GitHub CSS styling

Multi-Agent Adversarial Equation Validation

For papers with 10+ equations, use this multi-agent pattern:

Phase 1 — Parallel Extraction

Agent A: Extract prose with pymupdf4llm, transcribe math from PDF screenshots
Agent B: Extract and categorize all images

Phase 2 — Parallel Validation

Agent C: Validate equations against reference implementation (if code/repo exists)
Agent D: Numerical spot-checks — compute paper's exhibit values, compare

Phase 3 — Discrepancy Handling

For each discrepancy: write /tmp/paper-discrepancy/eq-{N}.md
Spawn resolver agents to search online for authoritative third-party sources
Authority rule: Paper is tentatively more authoritative than code implementation; a third independent source breaks ties

Phase 4 — Guarded Application

Apply only HIGH-confidence fixes to the markdown
For MEDIUM-confidence: spawn an independent audit agent before touching the file
Document all discrepancies even if not fixed — future readers need to know

Anti-Patterns

| Anti-pattern | Why it fails | Fix | | ---------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- | | \!\left( or \, in $$ blocks | GH pre-processor strips \!→! before KaTeX — !\left( crashes KaTeX, cascades all | Remove \! \, \; (spacing only) — or use ```math ``` | | \left\{ or \{...\} in $$/$ blocks | \{→{ (CommonMark escape), so \left\{→\left{ = "Missing delimiter" error, and \{x\} renders without visible braces | Replace with \left\lbrace, \right\rbrace, \lbrace, \rbrace | | $$\begin{aligned}...\\...\end{aligned}$$ | \\ stripped by GH pre-processor | Use ```math ``` | | Trusting marker-pdf on Word PDFs | Returns no output or zero math (Unicode bug) | Read as screenshots, transcribe manually | | \begin{align} in display math | Not supported by GitHub | Replace with \begin{aligned} | | \operatorname{Cov} | Active GH bug — sometimes renders raw | Use \text{Cov} or \mathrm{Cov} | | KaTeX validation only, no ```math ``` conversion | KaTeX passes but GH pre-processor still breaks \\ | Also convert ALL multi-line blocks | | \boxed{} for highlighting | Can cause raw LaTeX passthrough on GitHub | Use bold text or a blockquote callout | | Excess kurtosis in formulas expecting Pearson | Silent ~50% underestimate in variance formulas | Always document convention; use scipy.stats.kurtosis(fisher=False) | | Consecutive $$ blocks without blank lines | GitHub collapses them into one broken block | Add blank line between each block | | Running validation AFTER pushing | Bugs visible in public repo | Validate locally before every push (--fix auto-corrects E0/E1/E2) |

References

| File | Purpose | | ------------------------------------------------------------------------- | -------------------------------------- | | validate-math.mjs | KaTeX batch validator for GFM files | | pdf-type-detection.md | Detailed guide to detecting PDF type | | github-math-support-table.md | Full supported/unsupported LaTeX table |

Related Skills

| Skill | Relationship | | --------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- | | pandoc-pdf-generation | Opposite direction: markdown → PDF | | documentation-standards | GFM formatting standards | | quant-research:opendeviation-eval-metrics | Worked example: references/how-to-use-the-sharpe-ratio-2026.md |

Post-Execution Reflection

After this skill completes, reflect before closing the task:

Locate yourself. — Find this SKILL.md's canonical path before editing.
What failed? — Fix the instruction that caused it.
What worked better than expected? — Promote to recommended practice.
What drifted? — Fix any script, reference, or dependency that no longer matches reality.
Log it. — Evolution-log entry with trigger, fix, and evidence.

Do NOT defer. The next invocation inherits whatever you leave behind.

Agent Skills: Academic PDF → GitHub GFM Conversion

Install this agent skill to your local

Skill Files