Documentation Scraper with slurp-ai Skill

Documentation Scraper with slurp-ai

Overview

slurp-ai scrapes documentation websites and compiles them into a single markdown file optimized for AI agent context. It is lightweight, fast, and deterministic - it does NOT use AI to scrape, it is FOR AI consumption.

CRITICAL: Run Outside Sandbox

All commands in this skill MUST be run outside the sandbox. Use dangerouslyDisableSandbox: true for all Bash commands including:

which slurp (installation check)
node analyze-sitemap.js (sitemap analysis)
slurp (scraping)
File inspection commands (wc, head, cat, etc.)

The sandbox blocks network access and file operations required for web scraping.

Pre-Flight: Check Installation

Before scraping, verify slurp-ai is installed:

which slurp || echo "NOT INSTALLED"

If not installed, ask the user to run:

npm install -g slurp-ai

Requires: Node.js v20+

Do NOT proceed with scraping until slurp-ai is confirmed installed.

Commands

| Command | Purpose | |---------|---------| | slurp <url> | Fetch and compile in one step | | slurp fetch <url> [version] | Download docs to partials only | | slurp compile | Compile partials into single file | | slurp read <package> [version] | Read local documentation |

Output: Creates slurp_compiled/compiled_docs.md from partials in slurp_partials/.

CRITICAL: Analyze Sitemap First

Before running slurp, ALWAYS analyze the sitemap. This reveals the complete site structure and informs your --base-path and --max decisions.

Step 1: Run Sitemap Analysis

Use the included analyze-sitemap.js script:

node analyze-sitemap.js https://docs.example.com

This outputs:

Total page count (informs --max)
URLs grouped by section (informs --base-path)
Suggested slurp commands with appropriate flags
Sample URLs to understand naming patterns

Step 2: Interpret the Output

Example output:

📊 Total URLs in sitemap: 247

📁 URLs by top-level section:
   /docs                          182 pages
   /api                            45 pages
   /blog                           20 pages

🎯 Suggested --base-path options:
   https://docs.example.com/docs/guides/     (67 pages)
   https://docs.example.com/docs/reference/  (52 pages)
   https://docs.example.com/api/             (45 pages)

💡 Recommended slurp commands:

   # Just "/docs/guides" section (67 pages)
   slurp https://docs.example.com/docs/guides/ --base-path https://docs.example.com/docs/guides/ --max 80

Step 3: Choose Scope Based on Analysis

| Sitemap Shows | Action | |---------------|--------| | < 50 pages total | Scrape entire site: slurp <url> --max 60 | | 50-200 pages | Scope to relevant section with --base-path | | 200+ pages | Must scope down - pick specific subsection | | No sitemap found | Start with --max 30, inspect partials, adjust |

Step 4: Frame the Slurp Command

With sitemap data, you can now set accurate parameters:

# From sitemap: /docs/api has 45 pages
slurp https://docs.example.com/docs/api/intro \
  --base-path https://docs.example.com/docs/api/ \
  --max 55

Key insight: Starting URL is where crawling begins. Base path filters which links get followed. They can differ (useful when base path itself returns 404).

Common Scraping Patterns

Library Documentation (versioned)

# Express.js 4.x docs
slurp https://expressjs.com/en/4x/api.html --base-path https://expressjs.com/en/4x/

# React docs (latest)
slurp https://react.dev/learn --base-path https://react.dev/learn

API Reference Only

slurp https://docs.example.com/api/introduction --base-path https://docs.example.com/api/

Full Documentation Site

slurp https://docs.example.com/

CLI Options

| Flag | Default | Purpose | |------|---------|---------| | --max <n> | 20 | Maximum pages to scrape | | --concurrency <n> | 5 | Parallel page requests | | --headless <bool> | true | Use headless browser | | --base-path <url> | start URL | Filter links to this prefix | | --output <dir> | ./slurp_partials | Output directory for partials | | --retry-count <n> | 3 | Retries for failed requests | | --retry-delay <ms> | 1000 | Delay between retries | | --yes | - | Skip confirmation prompts |

Compile Options

| Flag | Default | Purpose | |------|---------|---------| | --input <dir> | ./slurp_partials | Input directory | | --output <file> | ./slurp_compiled/compiled_docs.md | Output file | | --preserve-metadata | true | Keep metadata blocks | | --remove-navigation | true | Strip nav elements | | --remove-duplicates | true | Eliminate duplicates | | --exclude <json> | - | JSON array of regex patterns to exclude |

When to Disable Headless Mode

Use --headless false for:

Static HTML documentation sites
Faster scraping when JS rendering not needed

Default is headless (true) - works for most modern doc sites including SPAs.

Output Structure

slurp_partials/              # Intermediate files
  └── page1.md
  └── page2.md
slurp_compiled/              # Final output
  └── compiled_docs.md       # Compiled result

Quick Reference

# 1. ALWAYS analyze sitemap first
node analyze-sitemap.js https://docs.example.com

# 2. Scrape with informed parameters (from sitemap analysis)
slurp https://docs.example.com/docs/ --base-path https://docs.example.com/docs/ --max 80

# 3. Skip prompts for automation
slurp https://docs.example.com/ --yes

# 4. Check output
cat slurp_compiled/compiled_docs.md | head -100

Common Issues

| Problem | Cause | Solution | |---------|-------|----------| | Wrong --max value | Guessing page count | Run analyze-sitemap.js first | | Too few pages scraped | --max limit (default 20) | Set --max based on sitemap analysis | | Missing content | JS not rendering | Ensure --headless true (default) | | Crawl stuck/slow | Rate limiting | Reduce --concurrency 3 | | Duplicate sections | Similar content | Use --remove-duplicates (default) | | Wrong pages included | Base path too broad | Use sitemap to find correct --base-path | | Prompts blocking automation | Interactive mode | Add --yes flag |

Post-Scrape Usage

The output markdown is designed for AI context injection:

# Check file size (context budget)
wc -c slurp_compiled/compiled_docs.md

# Preview structure
grep "^#" slurp_compiled/compiled_docs.md | head -30

# Use with Claude Code - reference in prompt or via @file

When NOT to Use

API specs in OpenAPI/Swagger: Use dedicated parsers instead
GitHub READMEs: Fetch directly via raw.githubusercontent.com
npm package docs: Often better to read source + README
Frequently updated docs: Consider caching strategy

Agent Skills: Documentation Scraper with slurp-ai

Skill Files