Agent Skills: Article Extraction Skill

Extract clean article content from web pages, removing ads and clutter for reading and archiving

UncategorizedID: ljchg12-hue/dotfiles/article-extraction

Install this agent skill to your local

pnpm dlx add-skill https://github.com/ljchg12-hue/dotfiles/tree/HEAD/skills/article-extraction

Skill Files

Browse the full folder contents for article-extraction.

Download Skill

Loading file tree…

skills/article-extraction/SKILL.md

Skill Metadata

Name
article-extraction
Description
Extract clean article content from web pages, removing ads and clutter for reading and archiving

Article Extraction Skill

Extract clean article text from web pages, removing ads, navigation, and clutter.

When to Use

  • Content archiving
  • Research collection
  • Reading list management
  • Content analysis

Core Capabilities

  • Main content extraction
  • Metadata extraction (title, author, date)
  • Image extraction
  • Clean HTML/Markdown output
  • Multi-page article handling
  • Paywall bypass (where legal)

Tools

# Readability (Node.js)
npm install @mozilla/readability

# newspaper3k (Python)
pip install newspaper3k
python -c "from newspaper import Article; a = Article('URL'); a.download(); a.parse(); print(a.text)"

# Trafilatura (Python)
pip install trafilatura
trafilatura -u "URL"

Best Practices

  • Respect robots.txt
  • Cache extracted content
  • Preserve attribution
  • Handle different CMS formats

Resources

  • Readability: https://github.com/mozilla/readability
  • newspaper3k: https://github.com/codelucas/newspaper