News Articles Rename
Purpose
Newspaper articles saved as PDFs or images typically arrive with unhelpful filenames like
Image 2026-02-25 15-52-38.pdf. This skill extracts the main headline from each file using
OCR and renames it to Article Title.pdf — making the News folder instantly browsable.
Target Folder
The default target is always:
Vivien (PA)/News/
Supported File Types
Process any file with these extensions: .pdf, .png, .jpg, .jpeg
Skip hidden files (starting with .) and any file that doesn't match these extensions.
How It Works
Run the bundled script scripts/rename_articles.py which handles the full pipeline:
python3 <skill-path>/scripts/rename_articles.py "<news-folder-path>"
The script will:
- Scan the folder for all supported files
- For each file, extract the first page as an image (300 DPI)
- Run Tesseract OCR on the image
- Identify the headline using heuristics (skip metadata, collect first substantial text block)
- Apply common OCR corrections (e.g. "Al" → "AI")
- Sanitise the headline for use as a filename
- Rename the file, handling duplicates by appending a number
- Print a summary table of old → new filenames
The script also handles mounted filesystem lock issues automatically by copying files to a temp directory for OCR processing when direct reads fail.
After Running
Present the results as a clear summary table showing what was renamed:
| # | Original Filename | New Filename | Status | |---|---|---|---| | 1 | Image 2026-02-25 15-52-38.pdf | Headline Goes Here.pdf | ✅ Renamed | | 2 | Image 2026-02-25 15-53-03.pdf | Another Article.pdf | ✅ Renamed | | 3 | some-file.png | some-file.png | ⚠️ No title found |
Flag any files that couldn't be processed and explain why. Note that minor OCR artefacts in headlines (e.g. misread characters) are expected from Tesseract — only flag files where no headline could be extracted at all.
Important Notes
- Always process every file in the folder. Do not leave any file out.
- If OCR is uncertain about a headline, prefer keeping the original name over guessing wrong.
- The script handles both single-page and multi-page PDFs — only the first page is used for title extraction.
- For image files (.png, .jpg, .jpeg), OCR is run directly on the image.