Scraping Skill | Agent Skills

Scraping

Web scraping using nu-shell and browser tools for data extraction.

Prerequisites

nu-shell installed (nu)
query web plugin installed (for HTML scraping): nu -c "plugin add query web"
Browser extension enabled (for dynamic content): Enable the browser extension in your agent configuration

Common Tasks

Fetching Web Pages

Use http get to retrieve HTML content:

# Simple GET request
nu -c 'http get https://example.com'

# With headers
nu -c 'http get -H [User-Agent "My Scraper"] https://example.com'

HTML Parsing and Data Extraction

Use the query web plugin to parse HTML and extract data using CSS selectors:

# Extract text from elements
nu -c 'http get https://example.com | query web -q "h1, h2" | str trim'

# Extract attributes
nu -c 'http get https://example.com | query web -a href "a"'

# Parse tables as structured data
nu -c 'http get https://example.com/table-page | query web --as-table ["Column1" "Column2"]'

Browser-Based Scraping for Dynamic Content

For websites requiring JavaScript execution or complex DOM interactions, use browser automation tools.

# Start browser
start-browser

# Navigate to page
navigate-browser --url https://example.com

# Extract data with JavaScript evaluation
evaluate-javascript --code "Array.from(document.querySelectorAll('selector')).map(e => e.textContent)"

# Screenshot for visual inspection
take-screenshot

# Query HTML fragments
query-html-elements --selector ".content"

API Interactions

For JSON APIs, use http get and parse with from json:

# GET JSON API
nu -c 'http get https://api.example.com/data | from json'

# POST requests
nu -c 'http post https://api.example.com/submit -t application/json {key: value}'

Handling Authentication

# Basic auth
nu -c 'http get -u username:password https://api.example.com'

# Bearer token
nu -c 'http get -H [Authorization "Bearer YOUR_TOKEN"] https://api.example.com'

# Custom headers
nu -c 'http get -H [X-API-Key "YOUR_KEY" User-Agent "Scraper"] https://api.example.com'

Rate Limiting and Delays

# Add delays between requests
nu -c '$urls | each { |url| http get $url; sleep 1sec }'

Parallel Processing

# Scrape multiple pages in parallel
nu -c '$urls | par-each { |url| http get $url | query web -q ".data" }'

One-liner Examples

Basic HTML Scraping

# Extract all h1 titles
nu -c 'http get https://example.com | query web -q "h1"'

# Get all links
nu -c 'http get https://example.com | query web -a href "a"'

# Scrape product prices
nu -c 'http get https://store.example.com | query web -q ".price"'

HTML Scraping Example: Hacker News

# Scrape HN front page titles and URLs
nu -c 'http get https://news.ycombinator.com/ | query web -q ".titleline a" | get text | zip (http get https://news.ycombinator.com/ | query web -a href ".titleline a" | get href) | each { |pair| echo $"($pair.0) - ($pair.1)" }'

For static sites like HN, use http get directly. Reserve browser tools for dynamic content requiring JavaScript execution.

GitHub Stars Scraper

# Get star count for a repo
nu -c 'http get https://api.github.com/repos/nushell/nushell | get stargazers_count'

API Data Extraction

# Fetch JSON and extract fields
nu -c 'http get https://api.example.com/users | from json | get -i 0.name'

API Authentication

# Bearer token
nu -c 'http get -H [Authorization "Bearer YOUR_TOKEN"] https://api.example.com/data'

# API key
nu -c 'http get -H [X-API-Key "YOUR_API_KEY"] https://api.example.com/data'

# Basic auth
nu -c 'http get -u username:password https://api.example.com/protected'

Related Skills

nu-shell: Core nu-shell scripting patterns and commands.

Related Tools

start-browser: Start Cromite browser via Puppeteer.
navigate-browser: Navigate to a URL in the browser.
evaluate-javascript: Evaluate JavaScript code in the active browser tab.
take-screenshot: Take a screenshot of the active browser tab.
query-html-elements: Extract HTML elements by CSS selector.
list-browser-tabs: List all open browser tabs with their titles and URLs.
close-tab: Close a browser tab by index or title.
switch-tab: Switch to a specific tab by index.
refresh-tab: Refresh the current tab.
current-url: Get the URL of the current active tab.
page-title: Get the title of the current active tab.
wait-for-element: Wait for a CSS selector to appear on the page.
click-element: Click on an element by CSS selector.
type-text: Type text into an input field.
extract-text: Extract text content from elements by CSS selector.
search-web: Perform web searches and extract information from search results.

Agent Skills: Scraping

Install this agent skill to your local

Skill Files