Clio-Style Clustering Pipeline
Build an end-to-end semantic clustering analysis from any text data source, with interactive visualization.
What This Skill Does
This skill guides you through building a complete clustering pipeline:
- Data Sourcing - Identify APIs/methods to fetch data, build tests to verify access
- Scraping - Collect data with proper pagination and rate limiting
- Embedding - Generate embeddings using OpenAI's text-embedding-3-large
- Clustering - Hierarchical HDBSCAN clustering with UMAP projection
- Labeling - LLM-powered cluster naming and description
- Visualization - Interactive React/D3 explorer with drill-down
Quick Start
When the user describes a data source (e.g., "GitHub issues from facebook/react"), follow these steps:
Phase 1: Data Source Discovery
First, identify how to access the data:
- Research the API - Use web search to find official API documentation
- Identify authentication - What tokens/keys are needed?
- Find pagination patterns - How does the API handle large datasets?
- Determine rate limits - What are the constraints?
See data-sourcing.md for common patterns (GitHub, Slack, etc.)
Phase 2: Build & Test Data Fetcher
IMPORTANT: Write tests BEFORE building the full scraper.
# test_fetcher.py - Verify API access works
import os
import requests
def test_api_access():
"""Verify we can access the API."""
# Adapt this for your specific data source
token = os.environ.get('API_TOKEN')
assert token, "API_TOKEN not set"
response = requests.get(
'https://api.example.com/endpoint',
headers={'Authorization': f'Bearer {token}'}
)
assert response.status_code == 200
data = response.json()
assert len(data) > 0, "No data returned"
print(f"Successfully fetched {len(data)} items")
if __name__ == '__main__':
test_api_access()
Run the test: python test_fetcher.py
Only proceed to the full scraper once tests pass.
Phase 3: Build the Scraper
Create a scraper that:
- Handles pagination efficiently
- Respects rate limits
- Stores data in SQLite for resumability
- Saves progress for resumable scraping
See data-sourcing.md for the database schema and scraper template.
Phase 4: Generate Embeddings & Cluster
Use the clustering pipeline to:
- Generate embeddings with OpenAI
- Run hierarchical HDBSCAN clustering
- Project to 2D with UMAP
- Label clusters with LLM
See clustering-reference.md for the complete implementation.
Phase 5: Build Visualization
Set up the interactive visualization:
- Export data to JSON
- Create Next.js app with D3 visualization
- Add hierarchical drill-down view
See visualization-setup.md for setup instructions.
The components/ directory contains ready-to-copy React components.
Project Structure
When complete, the project should look like:
project/
├── data/
│ └── items.db # SQLite database
├── pipeline/
│ ├── __init__.py
│ ├── db.py # Database operations
│ ├── scraper.py # Data fetcher
│ ├── embed.py # Embedding generation
│ ├── cluster.py # HDBSCAN clustering
│ ├── describe.py # LLM labeling
│ └── export.py # JSON export
├── visualizer/
│ ├── app/
│ │ ├── page.tsx
│ │ └── layout.tsx
│ ├── components/
│ │ ├── HierarchicalView.tsx
│ │ ├── ScatterPlot.tsx
│ │ └── ...
│ ├── lib/
│ │ ├── types.ts
│ │ ├── data.ts
│ │ └── utils.ts
│ └── public/data/
│ ├── items.json
│ └── clusters.json
├── test_fetcher.py # API access tests
├── requirements.txt
└── README.md
Dependencies
Python (for pipeline)
openai>=1.0
instructor>=1.0
hdbscan>=0.8.33
umap-learn>=0.5
scikit-learn>=1.3
numpy>=1.24
rich>=13.0
Node.js (for visualization)
{
"dependencies": {
"next": "14.2.0",
"react": "^18.2.0",
"d3": "^7.8.5",
"framer-motion": "^11.0.0",
"tailwindcss": "^3.4.1"
}
}
Environment Variables
OPENAI_API_KEY=sk-... # Required for embeddings and labeling
# Plus whatever auth your data source needs:
GITHUB_TOKEN=ghp_... # For GitHub
SLACK_TOKEN=xoxb-... # For Slack
# etc.
Running the Pipeline
# 1. Test API access
python test_fetcher.py
# 2. Scrape data
python -m pipeline.scraper
# 3. Generate embeddings
python -m pipeline.embed
# 4. Cluster
python -m pipeline.cluster
# 5. Label clusters with LLM
python -m pipeline.describe
# 6. Export for visualization
python -m pipeline.export
# 7. Run visualizer
cd visualizer && npm run dev
Key Design Decisions
- SQLite for storage - Simple, portable, supports resumability
- HDBSCAN over K-means - Finds natural clusters, handles noise
- 3-level hierarchy - Coarse (L1) -> Medium (L2) -> Fine (L3)
- UMAP for projection - Preserves local structure better than t-SNE
- text-embedding-3-large - Best quality embeddings for semantic similarity
- Next.js + D3 - Fast, interactive visualization with SSR support
Detailed Documentation
- Data Sourcing Patterns - API patterns, auth, pagination
- Clustering Implementation - Embedding, HDBSCAN, UMAP code
- Visualization Setup - Next.js app and components