Agent Skills: Scrape Milan Jovanovic Blog Posts

Scrape new articles from Milan Jovanovic's blog (November 2025+). Optimized - pre-filters from listing page, only scrapes new articles.

UncategorizedID: melodic-software/claude-code-plugins/scrape-posts

Install this agent skill to your local

pnpm dlx add-skill https://github.com/melodic-software/claude-code-plugins/tree/HEAD/plugins/milan-jovanovic/skills/scrape-posts

Skill Files

Browse the full folder contents for scrape-posts.

Download Skill

Loading file tree…

plugins/milan-jovanovic/skills/scrape-posts/SKILL.md

Skill Metadata

Name
scrape-posts
Description
Scrape new articles from Milan Jovanovic's blog (November 2025+). Optimized - pre-filters from listing page, only scrapes new articles.

Scrape Milan Jovanovic Blog Posts

Scrape new articles from Milan Jovanovic's .NET blog with optimized pre-filtering. Parses dates from listing page to avoid unnecessary per-article scraping.

Arguments

  • --force: Re-scrape all articles (compare content hash to skip unchanged)
  • --since YYYY-MM-DD: Custom date filter (default: 2025-11-01)
  • --limit N: Limit number of articles (for testing)
  • --dry-run: Preview what would be scraped without saving

Optimized Workflow

Step 1: Invoke Skill

Invoke the milan-jovanovic:milan-jovanovic-blog skill to load context and access scripts.

Step 2: Pre-Filter from Listing Page (OPTIMIZATION)

Key efficiency optimization: Parse dates from listing page BEFORE scraping individual articles.

  1. Scrape the blog listing page using firecrawl_scrape:

    URL: https://www.milanjovanovic.tech/blog
    Format: markdown
    
  2. Save listing content to temp file (e.g., .claude/temp/milan-listing.md)

  3. Run pre-filter script to identify articles needing scraping:

    # Normal mode - only new articles
    python scripts/core/check_new_articles.py .claude/temp/milan-listing.md --json --since 2025-11-01
    
    # Force mode - include existing for re-check
    python scripts/core/check_new_articles.py .claude/temp/milan-listing.md --json --force --since 2025-11-01
    
  4. Parse JSON output to get to_scrape list. If empty, skip to Step 5 (no scraping needed).

Step 3: Scrape Only Needed Articles

For each article in to_scrape:

  1. For articles with in_index: false (new):

    • Scrape full article with firecrawl_scrape
    • Extract publication date from metadata
    • Clean promotional content
    • Save to canonical/milanjovanovic-tech/blog/{slug}.md
  2. For articles with in_index: true (force mode re-check):

    • Scrape full article with firecrawl_scrape
    • Clean promotional content
    • Generate content hash
    • Compare to content_hash from pre-filter output
    • If unchanged, skip writing (log as "skipped - unchanged")
    • If changed, save updated content

Step 4: Update Index

After scraping completes:

python scripts/management/refresh_index.py

Step 5: Report Statistics

Report:

  • Articles found on listing page
  • Articles needing scraping (new + force re-check)
  • Articles skipped (already indexed, not in force mode)
  • Articles skipped (unchanged content hash, force mode)
  • Articles filtered (before cutoff date)
  • Any errors

Content Cleanup Patterns

The scraper removes these promotional patterns:

Footer patterns (stop processing):

  • "Whenever you're ready, there are X ways I can help you"
  • "Become a Better .NET Software Engineer"
  • "Hi, I'm Milan"

Sponsor patterns (remove section):

  • AuthKit/WorkOS mentions
  • "Sponsor this newsletter" links
  • Incident response sponsor content

Inline patterns (remove):

  • Reading time ("5 min read")
  • "Manage read history" links
  • Empty image placeholders

Efficiency Gains

| Scenario | Without Optimization | With Optimization | |----------|----------------------|-------------------| | No new articles | 10+ firecrawl requests | 1-2 requests | | 1 new article | 10+ firecrawl requests | 2-3 requests | | Force (unchanged) | 10+ requests | 10+ requests but skips writes |

Why this matters: Firecrawl has API costs and rate limits. Pre-filtering saves 80-90% of requests when articles haven't changed.

Example Usage

/milan-jovanovic:scrape-posts
/milan-jovanovic:scrape-posts --limit 3 --dry-run
/milan-jovanovic:scrape-posts --force
/milan-jovanovic:scrape-posts --since 2025-12-01

Troubleshooting

Firecrawl Not Available

If firecrawl MCP is not connected, the command will fail. Ensure the firecrawl MCP server is configured and running.

Date Parsing Issues

If listing page dates can't be parsed, the script logs them in no_date category. These articles are skipped unless you provide a specific URL.

Pre-Filter Shows 0 Articles

If check_new_articles.py shows 0 articles to scrape:

  • All articles are already indexed (use --force to re-check)
  • All articles are before the cutoff date (adjust --since)
  • Listing page format changed (check regex patterns in script)