Agent Skills: Web Scraping Skill v3.0

Intelligent web scraper with content extraction, multiple output formats, and error handling

UncategorizedID: aiskillstore/marketplace/web-scrape

Install this agent skill to your local

pnpm dlx add-skill https://github.com/aiskillstore/marketplace/tree/HEAD/skills/21pounder/web-scrape

Skill Files

Browse the full folder contents for web-scrape.

Download Skill

Loading file tree…

skills/21pounder/web-scrape/SKILL.md

Skill Metadata

Name
web-scrape
Description
Intelligent web scraper with content extraction, multiple output formats, and error handling

Web Scraping Skill v3.0

Usage

/web-scrape <url> [options]

Options:

  • --format=markdown|json|text - Output format (default: markdown)
  • --full - Include full page content (skip smart extraction)
  • --screenshot - Also save a screenshot
  • --scroll - Scroll to load dynamic content (infinite scroll pages)

Examples:

/web-scrape https://example.com/article
/web-scrape https://news.site.com/story --format=json
/web-scrape https://spa-app.com/page --scroll --screenshot

Execution Flow

Phase 1: Navigate and Load

1. mcp__playwright__browser_navigate
   url: "<target URL>"

2. mcp__playwright__browser_wait_for
   time: 2  (allow initial render)

If --scroll option: Execute scroll sequence to trigger lazy loading:

3. mcp__playwright__browser_evaluate
   function: "async () => {
     for (let i = 0; i < 3; i++) {
       window.scrollTo(0, document.body.scrollHeight);
       await new Promise(r => setTimeout(r, 1000));
     }
     window.scrollTo(0, 0);
   }"

Phase 2: Capture Content

4. mcp__playwright__browser_snapshot
   → Returns full accessibility tree with all text content

If --screenshot option:

5. mcp__playwright__browser_take_screenshot
   filename: "scraped_<domain>_<timestamp>.png"
   fullPage: true

Phase 3: Close Browser

6. mcp__playwright__browser_close

Smart Content Extraction

After getting the snapshot, apply intelligent extraction:

Step 1: Identify Content Type

| Page Type | Indicators | Extraction Strategy | |-----------|------------|---------------------| | Article/Blog | <article>, long paragraphs, date/author | Extract main article body | | Product Page | Price, "Add to Cart", specs | Extract title, price, description, specs | | Documentation | Code blocks, headings hierarchy | Preserve structure and code | | List/Search | Repeated item patterns | Extract as structured list | | Landing Page | Hero section, CTAs | Extract key messaging |

Step 2: Filter Noise

ALWAYS REMOVE these elements from output:

  • Navigation menus and breadcrumbs
  • Footer content (copyright, links)
  • Sidebars (ads, related articles, social links)
  • Cookie banners and popups
  • Comments section (unless specifically requested)
  • Share buttons and social widgets
  • Login/signup prompts

Step 3: Structure the Content

For Articles:

# [Title]

**Source:** [URL]
**Date:** [if available]
**Author:** [if available]

---

[Main content in clean markdown]

For Product Pages:

# [Product Name]

**Price:** [price]
**Availability:** [in stock/out of stock]

## Description
[product description]

## Specifications
| Spec | Value |
|------|-------|
| ... | ... |

Output Formats

Markdown (default)

Clean, readable markdown with proper headings, lists, and formatting.

JSON

{
  "url": "https://...",
  "title": "Page Title",
  "type": "article|product|docs|list",
  "content": {
    "main": "...",
    "metadata": {}
  },
  "extracted_at": "ISO timestamp"
}

Text

Plain text with minimal formatting, suitable for further processing.


Error Handling

Navigation Errors

| Error | Detection | Action | |-------|-----------|--------| | Timeout | Page doesn't load in 30s | Report error, suggest retry | | 404 Not Found | "404" in title/content | Report "Page not found" | | 403 Forbidden | "403", "Access Denied" | Report access restriction | | CAPTCHA | "captcha", "verify you're human" | Report CAPTCHA detected, cannot proceed | | Paywall | "subscribe", "premium content" | Extract visible content, note paywall |

Recovery Actions

If page load fails:
1. Report the specific error to user
2. Suggest: "Try again?" or "Different URL?"
3. Close browser cleanly

If content is blocked:
1. Report what was detected (CAPTCHA/paywall/geo-block)
2. Extract any available preview content
3. Suggest alternatives if applicable

Advanced Scenarios

Single Page Applications (SPA)

1. Navigate to URL
2. Wait longer (3-5 seconds) for JS hydration
3. Use browser_wait_for with specific text if known
4. Then snapshot

Infinite Scroll Pages

1. Navigate
2. Execute scroll loop (see Phase 1)
3. Snapshot after scrolling completes

Pages with Click-to-Reveal Content

1. Snapshot first to identify clickable elements
2. Use browser_click on "Read more" / "Show all" buttons
3. Wait briefly
4. Snapshot again for full content

Multi-page Articles

1. Scrape first page
2. Identify "Next" or pagination links
3. Ask user: "Article has X pages. Scrape all?"
4. If yes, iterate through pages and combine

Performance Guidelines

| Metric | Target | How | |--------|--------|-----| | Speed | < 15 seconds | Minimal waits, parallel where possible | | Token Usage | < 5000 tokens | Smart extraction, not full DOM | | Reliability | > 95% success | Proper error handling |


Security Notes

  • Never execute arbitrary JavaScript from the page
  • Don't follow redirects to suspicious domains
  • Don't submit forms or click login buttons
  • Don't scrape pages that require authentication (unless user provides credentials flow)
  • Respect robots.txt when mentioned by user

Quick Reference

Minimum viable scrape (4 tool calls):

1. browser_navigate → 2. browser_wait_for → 3. browser_snapshot → 4. browser_close

Full-featured scrape (with scroll + screenshot):

1. browser_navigate
2. browser_wait_for
3. browser_evaluate (scroll)
4. browser_snapshot
5. browser_take_screenshot
6. browser_close

Remember: The goal is to deliver clean, useful content to the user, not raw HTML/DOM dumps.