Agent Skills: Website Crawler for Whorl

Crawl and ingest websites into whorl. Use when scraping a personal site, blog, or extracting web content for the knowledge base.

UncategorizedID: Uzay-G/whorl/website-crawler

Skill Files

Browse the full folder contents for website-crawler.

Download Skill

Loading file tree…

skills/website-crawler/SKILL.md

Skill Metadata

Name
website-crawler
Description
Crawl and ingest websites into whorl. Use when scraping a personal site, blog, or extracting web content for the knowledge base.

Website Crawler for Whorl

Crawl websites and ingest content into your whorl knowledge base.

Prerequisites

Install trafilatura if not already available:

pip install trafilatura

Single Page

Extract a single page and save to whorl docs:

# Extract content as markdown
trafilatura -u "https://example.com/page" --markdown > ~/.whorl/docs/page-name.md

Or with metadata in frontmatter:

URL="https://example.com/page"
SLUG=$(echo "$URL" | sed 's|https\?://||; s|/|_|g; s|_$||')
OUTPUT=~/.whorl/docs/"$SLUG".md

# Fetch and extract
CONTENT=$(trafilatura -u "$URL" --markdown)
TITLE=$(trafilatura -u "$URL" --json | python3 -c "import sys,json; print(json.load(sys.stdin).get('title','Untitled'))" 2>/dev/null || echo "Untitled")

# Write with frontmatter
cat > "$OUTPUT" << EOF
---
title: "$TITLE"
source_url: $URL
fetched_at: $(date -u +%Y-%m-%dT%H:%M:%SZ)
---

$CONTENT
EOF

echo "Saved to $OUTPUT"

Crawl Entire Site

Crawl up to 30 pages from a site:

trafilatura --crawl "https://example.com" --markdown -o ~/.whorl/docs/site-name/

Or with sitemap:

trafilatura --sitemap "https://example.com/sitemap.xml" --markdown -o ~/.whorl/docs/site-name/

Crawl with Custom Limit

For more control, use Python:

import os
from pathlib import Path
from datetime import datetime, timezone
import trafilatura
from trafilatura.spider import focused_crawler

WHORL_DOCS = Path.home() / ".whorl" / "docs"
site_dir = WHORL_DOCS / "my-site"
site_dir.mkdir(parents=True, exist_ok=True)

for url in focused_crawler("https://example.com", max_seen_urls=50):
    downloaded = trafilatura.fetch_url(url)
    if not downloaded:
        continue

    content = trafilatura.extract(downloaded, output_format='markdown')
    metadata = trafilatura.extract_metadata(downloaded)

    if not content:
        continue

    # Generate filename from URL
    slug = url.split("//")[-1].replace("/", "_").rstrip("_")[:80]
    filepath = site_dir / f"{slug}.md"

    # Write with frontmatter
    title = metadata.title if metadata else "Untitled"
    frontmatter = f"""---
title: "{title}"
source_url: {url}
fetched_at: {datetime.now(timezone.utc).isoformat()}
---

"""
    filepath.write_text(frontmatter + content)
    print(f"+ {filepath.name}")

After Crawling

Run whorl sync to process new documents with ingestion agents:

whorl sync

Or if running locally without auth:

curl -X POST http://localhost:8000/api/sync

Tips

  • Rate limiting: trafilatura respects robots.txt and has built-in politeness
  • Deduplication: whorl's hash index will detect duplicate content
  • Binary files: PDFs and images should be downloaded separately with curl -O
  • Large sites: Use max_seen_urls to limit scope, or target specific sitemaps