Mental Model
Data acquisition is converting unstructured web content into structured data. Choose tool based on page complexity: JS-heavy → chrome-devtools MCP, static → Python requests.
Tool Selection
| Page Type | Tool | When to Use | |-----------|------|-------------| | Dynamic (JS-rendered, SPAs) | chrome-devtools MCP | React/Vue apps, infinite scroll, login gates | | Static HTML | Python requests | Blogs, news sites, simple pages | | Complex/reusable logic | Python script | Multi-step scraping, rate limiting, proxies |
Anti-Patterns (NEVER)
- Don't scrape without checking robots.txt
- Don't overload servers (default: 1 req/sec)
- Don't scrape personal data without consent
- Don't use Chinese characters in output filenames (ASCII only)
- Don't forget to identify bot with User-Agent
Output Format
- JSON: Nested/hierarchical data
- CSV: Tabular data
- Filename:
{source}_{timestamp}.{ext}(ASCII only, e.g.,news_20250115.csv)
Workflow
- Ask: What data? Which sites? How much?
- Select tool based on page type
- Extract and save structured data
- Deliver file path to user or pass to data-analysis
Python Environment
Auto-initialize virtual environment if needed, then execute:
cd skills/data-base
if [ ! -f ".venv/bin/python" ]; then
echo "Creating Python environment..."
./setup.sh
fi
.venv/bin/python your_script.py
The setup script auto-installs: requests, beautifulsoup4, pandas, web scraping tools.
References (load on demand)
For detailed APIs and templates, load: references/REFERENCE.md, references/templates.md