PDF to EPUB Converter Skill

PDF to EPUB Converter

Convert PDF documents to high-quality EPUB files with automatic chapter detection, image optimization, and footnote hyperlinking.

Quick Start

# Run conversion (from skill directory)
python -m scripts.convert input.pdf output.epub

# Validate result
python -m scripts.validate input.pdf output.epub

Workflow Overview

The conversion follows a 3-phase process with an optional 4th phase for adaptation:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  1. ANALYZE     │ ──► │  2. CONVERT     │ ──► │  3. VALIDATE    │
│  - PDF structure│     │  - Apply config │     │  - Check quality│
│  - Generate cfg │     │  - Build EPUB   │     │  - Report issues│
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                        │
                                                        ▼
                                              ┌─────────────────┐
                                              │  4. ADAPT       │
                                              │  (if needed)    │
                                              │  - Tune config  │
                                              │  - Modify code  │
                                              └─────────────────┘

Phase 1: Analyze

Before converting, analyze the PDF to determine the best configuration:

# Open PDF and examine structure
import fitz  # pymupdf
doc = fitz.open("input.pdf")

# Check for:
# 1. Number of pages
# 2. Presence of images
# 3. Multi-column layout (compare text block x-coordinates)
# 4. Footnotes/endnotes (numbers in margins or at page bottom)
# 5. Font sizes (for heading detection thresholds)

Generate initial config based on analysis:

Fiction book: Use default y_sort reading order
Academic paper: Enable xy_cut for columns
Magazine: Enable image optimization, use xy_cut

Ask user to confirm the proposed configuration before proceeding.

Phase 2: Convert

Run the conversion with the generated config:

from conversion.converter import Converter
from conversion.models import ConversionConfig

config = ConversionConfig(
    page_ranges=PageRanges(skip=[1, 2], content=(3, -3)),
    exclude_regions=ExcludeRegions(top=0.05, bottom=0.05),
    reading_order_strategy="y_sort",  # or "xy_cut" for columns
    image_optimization=ImageOptimizationConfig(enabled=True),
)

converter = Converter(strategy="simple")
result = converter.convert(pdf_path, epub_path, config)

# Check confidence
if result.reading_order_confidence < 0.7:
    print("Warning: Low confidence in reading order")

Phase 3: Validate

Always validate the conversion result:

from validation.completeness_checker import CompletenessChecker
from validation.order_checker import OrderChecker

# Check text completeness
completeness = CompletenessChecker().check(pdf_text, epub_text)
print(f"Completeness: {completeness.score:.1%}")  # Should be > 95%

# Check reading order
order = OrderChecker().check(pdf_chunks, epub_chunks)
print(f"Order score: {order.score:.1%}")  # Should be > 80%

Quality gates:

Completeness < 95%: Text is being lost
Order score < 80%: Reading order is wrong

Phase 4: Adapt (if validation fails)

See Decision Tree below.

Decision Tree: When Things Go Wrong

Validation failed?
│
├─► Text loss > 5%?
│   ├─► Check exclude_regions (headers/footers being cut?)
│   │   → Try: exclude_regions.top: 0.03 (reduce from 0.05)
│   ├─► Check page_ranges (skipping too many pages?)
│   │   → Try: page_ranges.skip: [] (don't skip any)
│   └─► Still failing? → See reference/troubleshooting.md#text-loss
│
├─► Wrong reading order?
│   ├─► PDF has columns?
│   │   → Try: reading_order_strategy: "xy_cut"
│   ├─► Columns detected but wrong?
│   │   → Try: multi_column.threshold: 0.3 (more sensitive)
│   └─► Still failing? → See reference/troubleshooting.md#order
│
├─► Headings not detected?
│   ├─► Headings only slightly larger than body?
│   │   → Try: heading_detection.font_size_threshold: 1.1
│   └─► Custom font patterns?
│   │   → May need to modify structure_classifier.py (ADAPTABLE)
│
├─► Footnotes not linking?
│   ├─► Non-standard format (not [1] or (1))?
│   │   → Add pattern to FootnoteDetector.PATTERNS
│   └─► See reference/troubleshooting.md#footnotes
│
└─► Other issue?
    └─► See reference/troubleshooting.md

Three-Layer Architecture

The codebase is organized into three layers with different modification policies:

Layer 1: FROZEN (Do Not Modify)

These files implement fixed specifications or deterministic algorithms:

| File | Reason | |------|--------| | core/epub_builder.py | EPUB3 spec is fixed | | core/text_segmenter.py | Validation depends on identical chunking | | validation/* | Metrics must be reproducible |

Never modify these files unless there's a fundamental bug.

Layer 2: CONFIGURABLE (Try Config First)

Before changing code, try adjusting configuration:

ConversionConfig:
├── page_ranges         # Which pages to process
├── exclude_regions     # Margins to ignore (headers/footers)
├── multi_column        # Column detection settings
├── reading_order_strategy  # "y_sort" or "xy_cut"
├── heading_detection   # Font size thresholds
├── footnote_processing # Footnote patterns
├── image_optimization  # Compression settings
└── metadata           # Title, author, language

See reference/config-tuning.md for all parameters.

Layer 3: ADAPTABLE (Can Modify If Config Fails)

These files contain heuristics that may need tuning for specific PDFs:

| File | What You Can Modify | |------|---------------------| | conversion/strategies/* | Create new strategy subclass | | detectors/structure_classifier.py | Heading detection heuristics | | detectors/reading_order/* | Add custom sorter algorithm | | detectors/footnote_detector.py | Add new footnote patterns |

See reference/code-adaptation.md for guidelines.

Project Structure

<skill-directory>/
├── SKILL.md                     # This file
├── requirements.txt             # Python dependencies
├── core/                        # FROZEN: Core algorithms
│   ├── epub_builder.py          # EPUB3 file creation
│   ├── pdf_extractor.py         # PDF text/image extraction
│   ├── text_segmenter.py        # Deterministic chunking
│   └── image_optimizer.py       # Image compression
│
├── conversion/                  # Main conversion logic
│   ├── converter.py             # Orchestrator
│   ├── models.py                # Data classes & configs
│   ├── strategies/              # ADAPTABLE: Conversion strategies
│   │   ├── base_strategy.py     # Template method pattern
│   │   └── simple_strategy.py
│   └── detectors/               # ADAPTABLE: Detection heuristics
│       ├── structure_classifier.py
│       ├── reading_order/
│       ├── footnote_detector.py
│       └── endnote_formatter.py
│
├── validation/                  # FROZEN: Quality checking
│   ├── completeness_checker.py
│   └── order_checker.py
│
├── scripts/                     # CLI entry points
│   ├── analyze.py
│   ├── convert.py
│   └── validate.py
│
├── reference/                   # Documentation
│   ├── workflow.md
│   ├── architecture.md
│   ├── troubleshooting.md
│   ├── config-tuning.md
│   └── code-adaptation.md
│
└── examples/                    # Example configurations
    ├── fiction-simple.json
    ├── academic-multicol.json
    └── magazine-images.json

Example Configurations

Fiction Book (simple layout)

{
  "page_ranges": {"skip": [1, 2], "content": [3, -3]},
  "exclude_regions": {"top": 0.05, "bottom": 0.05},
  "reading_order_strategy": "y_sort",
  "heading_detection": {"font_size_threshold": 1.2}
}

Academic Paper (2-column)

{
  "page_ranges": {"skip": [1], "content": [2, -1]},
  "exclude_regions": {"top": 0.08, "bottom": 0.08},
  "reading_order_strategy": "xy_cut",
  "multi_column": {"enabled": true, "threshold": 0.4}
}

Magazine (images + columns)

{
  "reading_order_strategy": "xy_cut",
  "multi_column": {"enabled": true, "column_count": 2},
  "image_optimization": {
    "enabled": true,
    "max_width": 800,
    "jpeg_quality": 75
  }
}

Reference Documentation

For detailed information, see:

Workflow Details - Complete phase-by-phase guide
Architecture - Three-layer system explanation
Troubleshooting - Common problems and solutions
Config Tuning - All configuration parameters
Code Adaptation - When and how to modify code

Common Commands

# Full conversion with validation (from skill directory)
python -m scripts.convert input.pdf output.epub && \
python -m scripts.validate input.pdf output.epub

# Analyze PDF structure
python -m scripts.analyze input.pdf

Quality Metrics

After conversion, always check:

| Metric | Good | Warning | Bad | |--------|------|---------|-----| | Text completeness | > 98% | 95-98% | < 95% | | Reading order | > 90% | 80-90% | < 80% | | Confidence | > 0.8 | 0.6-0.8 | < 0.6 |

If any metric is in "Warning" or "Bad" range, follow the Decision Tree above.

Agent Skills: PDF to EPUB Converter

Skill Files