Agent Skills: OCRmyPDF — Batch Processing Guide

OCRmyPDF batch processing skill — process multiple PDFs, Docker automation, shell scripting, and CI/CD integration. Use when the user needs to OCR many PDFs, set up automated OCR pipelines, or integrate OCR into workflows.

UncategorizedID: teachingai/full-stack-skills/ocrmypdf-batch

Install this agent skill to your local

pnpm dlx add-skill https://github.com/partme-ai/full-stack-skills/tree/HEAD/skills/ocrmypdf-skills/ocrmypdf-batch

Skill Files

Browse the full folder contents for ocrmypdf-batch.

Download Skill

Loading file tree…

skills/ocrmypdf-skills/ocrmypdf-batch/SKILL.md

Skill Metadata

Name
ocrmypdf-batch
Description
OCRmyPDF batch processing skill — process multiple PDFs, Docker automation, shell scripting, and CI/CD integration. Use when the user needs to OCR many PDFs, set up automated OCR pipelines, or integrate OCR into workflows.

OCRmyPDF — Batch Processing Guide

Overview

OCRmyPDF supports batch processing through shell scripting, Docker, and CI/CD integration for automated OCR pipelines.

For core OCR functionality, see the ocrmypdf skill. For image processing, see ocrmypdf-image. For optimization, see ocrmypdf-optimize.

Shell Loop

Basic batch

# Process all PDFs in directory
for f in *.pdf; do
    ocrmypdf "$f" "output/$f"
done

Parallel processing

# Use GNU parallel for faster processing
parallel ocrmypdf {} output/{/} ::: *.pdf

# Limit to 4 concurrent jobs
parallel -j 4 ocrmypdf {} output/{/} ::: *.pdf

Recursive batch

# Process all PDFs in directory tree
find . -name "*.pdf" -exec ocrmypdf {} output/{/} \;

Docker

Official image

# Pull image
docker pull jbarlow83/ocrmypdf

# Basic usage
docker run --rm \
    -v $(pwd):/data \
    jbarlow83/ocrmypdf \
    input.pdf output.pdf

Batch with Docker

# Process all PDFs
docker run --rm \
    -v $(pwd):/data \
    jbar65t83/ocrmypdf \
    ocrmypdf /data/input/*.pdf /data/output/

Docker Compose

version: '3'
services:
  ocrmypdf:
    image: jbarlow83/ocrmypdf
    volumes:
      - ./input:/data/input
      - ./output:/data/output
    command: sh -c "for f in /data/input/*.pdf; do ocrmypdf \"$f\" \"/data/output/$(basename $f)\"; done"

GitHub Actions

name: OCR PDFs
on: [push]
jobs:
  ocr:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run OCR
        run: |
          docker run --rm \
            -v ${{ github.workspace }}:/data \
            jbarlow83/ocrmypdf \
            sh -c "for f in /data/*.pdf; do ocrmypdf \"$f\" \"/data/output/$(basename $f)\"; done"

CI/CD Examples

GitLab CI

ocr:
  image: jbarlow83/ocrmypdf
  script:
    - mkdir -p output
    - for f in *.pdf; do ocrmypdf "$f" "output/$f"; done
  artifacts:
    paths:
      - output/

Shell script template

#!/bin/bash
INPUT_DIR="input"
OUTPUT_DIR="output"
LANG="eng+chi_sim"

mkdir -p "$OUTPUT_DIR"

for pdf in "$INPUT_DIR"/*.pdf; do
    filename=$(basename "$pdf")
    echo "Processing: $filename"
    ocrmypdf -l "$LANG" --deskew --remove-bordering "$pdf" "$OUTPUT_DIR/$filename"
    echo "Done: $filename"
done

echo "Batch OCR complete!"

Error Handling

# Continue on error, log failures
for f in *.pdf; do
    if ! ocrmypdf "$f" "output/$f" 2>&1; then
        echo "FAILED: $f" >> failed.log
    fi
done

Performance Tips

  • Use --jobs N for multi-core processing
  • Use --output-type pdf (not pdfa) for faster processing when archival not needed
  • Pre-process images with --deskew and --clean to reduce file size
  • Use Docker layer caching in CI/CD for faster rebuilds

Quick Reference

| Task | Command | |------|---------| | Sequential batch | for f in *.pdf; do ocrmypdf "$f" out/"$f"; done | | Parallel batch | parallel ocrmypdf {} out/{/} ::: *.pdf | | Docker basic | docker run -v $(pwd):/data jbarlow83/ocrmypdf in.pdf out.pdf | | Recursive | find . -name "*.pdf" -exec ocrmypdf {} out/{/} \; |

Troubleshooting

  • Permission denied: Ensure output directory is writable.
  • Memory issues: Process in smaller batches or use --jobs 1.
  • Docker path issues: Use absolute paths with -v.