Announce: "Using ds-plan (Phase 2) to profile data and create task breakdown."
Contents
Planning (Data Profiling + Task Breakdown)
Profile the data and create an analysis plan based on the spec.
Requires .claude/SPEC.md from /ds first.
SPEC MUST EXIST BEFORE PLANNING. This is not negotiable.
Before exploring data or creating tasks, you MUST have:
.claude/SPEC.mdwith objectives and constraints- Clear success criteria
- User-approved spec
If .claude/SPEC.md doesn't exist, run /ds first.
</EXTREMELY-IMPORTANT>
Rationalization Table - STOP If You Think:
| Excuse | Reality | Do Instead | |--------|---------|------------| | "Data looks clean, profiling unnecessary" | Your data is never clean | PROFILE to discover issues | | "I can profile as I go" | You'll miss systemic issues | PROFILE comprehensively NOW | | "Quick .head() is enough" | Your head hides tail problems | RUN full profiling checklist | | "Missing values won't affect my analysis" | They always do | DOCUMENT and plan handling | | "I'll handle data issues during analysis" | Your issues will derail your analysis | FIX data issues FIRST | | "User didn't mention data quality" | They assume YOU'LL check | QUALITY check is YOUR job | | "Profiling takes too long" | Your skipping it costs days later | INVEST time now |
Honesty Framing
Creating an analysis plan without profiling the data is LYING about understanding the data.
You cannot plan analysis steps without knowing:
- Your data's shape and types
- Your missing value patterns
- Your data quality issues
- Your cleaning requirements
Profiling costs you minutes. Your wrong plan costs hours of rework and incorrect results.
No Pause After Completion
After writing .claude/PLAN.md, IMMEDIATELY invoke:
Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/ds-implement/SKILL.md")
DO NOT:
- Ask "should I proceed with implementation?"
- Summarize the plan
- Wait for user confirmation (they approved SPEC already)
- Write status updates
The workflow phases are SEQUENTIAL. Complete plan → immediately start implement.
What Plan Does
| DO | DON'T | |-------|----------| | Read .claude/SPEC.md | Skip brainstorm phase | | Profile data (shape, types, stats) | Skip to analysis | | Identify data quality issues | Ignore missing/duplicate data | | Create ordered task list | Write final analysis code | | Write .claude/PLAN.md | Make completion claims |
Brainstorm answers: WHAT and WHY Plan answers: HOW and DATA QUALITY
Process
1. Verify Spec Exists
cat .claude/SPEC.md # verify-spec: read SPEC file to confirm it exists
If missing, stop and run /ds first.
2. Data Profiling
For multiple data sources: Profile in parallel using background Task agents.
Single Data Source (Direct Profiling)
MANDATORY profiling steps:
import pandas as pd
# Basic structure
df.shape # (rows, columns)
df.dtypes # Column types
df.head(10) # Sample data
df.tail(5) # End of data
# Summary statistics
df.describe() # Numeric summaries
df.describe(include='object') # Categorical summaries
df.info() # Memory, non-null counts
# Data quality checks
df.isnull().sum() # Missing values per column
df.duplicated().sum() # Duplicate rows
df[col].value_counts() # Distribution of categories
# For time series
df[date_col].min(), df[date_col].max() # Date range
df.groupby(date_col).size() # Records per period
Multiple Data Sources (Parallel Profiling)
<EXTREMELY-IMPORTANT> **Pattern from oh-my-opencode: Launch ALL profiling agents in a SINGLE message.**Use run_in_background: true for parallel execution.
When profiling 2+ data sources, launch agents in parallel: </EXTREMELY-IMPORTANT>
# PARALLEL + BACKGROUND: All Task calls in ONE message
Task(
subagent_type="general-purpose",
description="Profile dataset 1",
run_in_background=true,
prompt="""
Profile this dataset and return a data quality report.
Dataset: /path/to/dataset1.csv
Required checks:
1. Shape: rows x columns
2. Data types: df.dtypes
3. Missing values: df.isnull().sum()
4. Duplicates: df.duplicated().sum()
5. Summary statistics: df.describe()
6. Unique value counts for categorical columns
7. Date range if time series
8. Memory usage: df.info()
Output format:
- Markdown table with column summary
- List of data quality issues found
- Recommendations for cleaning
Tools denied: Write, Edit, NotebookEdit (read-only profiling)
""")
Task(
subagent_type="general-purpose",
description="Profile dataset 2",
run_in_background=true,
prompt="""
[Same template for dataset 2]
""")
Task(
subagent_type="general-purpose",
description="Profile dataset 3",
run_in_background=true,
prompt="""
[Same template for dataset 3]
""")
After launching agents:
- Continue to other work (don't wait)
- Check status with
/taskscommand - Collect results with TaskOutput when ready
# Collect profiling results
TaskOutput(task_id="task-abc123", block=true, timeout=30000)
TaskOutput(task_id="task-def456", block=true, timeout=30000)
TaskOutput(task_id="task-ghi789", block=true, timeout=30000)
Benefits:
- 3x faster profiling for 3 datasets
- Each agent focused on single source
- Results consolidated in main chat
3. Identify Data Quality Issues
CRITICAL: Document ALL issues before proceeding:
| Check | What to Look For | |-------|------------------| | Missing values | Null counts, patterns of missingness | | Duplicates | Exact duplicates, key-based duplicates | | Outliers | Extreme values, impossible values | | Type issues | Strings in numeric columns, date parsing | | Cardinality | Unexpected unique values | | Distribution | Skewness, unexpected patterns |
4. ETL Strategy Assessment (Conditional)
Triggers when profiling reveals ANY of:
- Total rows > 1M across all sources
- Multiple data sources requiring joins/merges
- Data sourced from remote databases (WRDS, SQL servers, APIs)
If triggered, assess these three dimensions before creating the task breakdown:
A. Filter Push-Down Strategy
The anti-pattern: Pull entire tables into memory, then filter in pandas/R/SAS.
AskUserQuestion(questions=[{
"question": "Where should filtering happen for this data?",
"header": "Filtering",
"options": [
{"label": "Database-level (Recommended)", "description": "SQL WHERE clauses filter at source. Only matching rows transfer. Required for >1M row tables."},
{"label": "Application-level", "description": "Pull full dataset, filter in code. Only acceptable for small tables (<100K rows) or when database access is read-once."},
{"label": "Hybrid", "description": "Coarse filter at database (date range, key columns), fine filter in code (complex logic, cross-table conditions)."}
],
"multiSelect": false
}])
Document in PLAN.md: For each data source, specify WHERE the filtering happens and WHY.
B. Parallelism Assessment
The anti-pattern: Process years/groups sequentially when they're embarrassingly parallel.
Identify parallelizable dimensions from profiling:
- Time: year-by-year, month-by-month processing
- Groups: firm-by-firm, sector-by-sector processing
- Sources: independent data sources profiled/cleaned in parallel
AskUserQuestion(questions=[{
"question": "How should parallelizable tasks be executed?",
"header": "Parallelism",
"options": [
{"label": "Background Task agents (Recommended)", "description": "Spawn parallel Task agents for independent groups/years. Best for in-session work with Claude."},
{"label": "SGE array jobs", "description": "Submit as array jobs to grid scheduler. Best for WRDS/HPC cluster workloads."},
{"label": "Sequential", "description": "Process one at a time. Only when tasks have dependencies or parallelism isn't worth the overhead."}
],
"multiSelect": false
}])
Document in PLAN.md: For each task, note if it can be parallelized, on what dimension, and the chosen execution method.
C. Intermediate Result Caching
The anti-pattern: Re-read and re-process the same large source file in every task.
If multiple tasks read from the same large source:
- Task 1 reads and cleans the source → saves intermediate result
- Tasks 2-N read from the intermediate result, not the raw source
AskUserQuestion(questions=[{
"question": "What format should be used for intermediate results?",
"header": "Cache format",
"options": [
{"label": "Parquet (Recommended)", "description": "Columnar, compressed, preserves dtypes. Best for tabular data. ~10x smaller than CSV."},
{"label": "CSV", "description": "Universal, human-readable. Use when downstream tools require CSV or data is small."},
{"label": "SQLite", "description": "Queryable intermediate storage. Best when downstream tasks need filtered reads from the same intermediate."},
{"label": "No caching needed", "description": "Each task reads from a different source, or sources are small enough to re-read."}
],
"multiSelect": false
}])
Document in PLAN.md: Data flow diagram showing which tasks produce intermediates, which consume them, and the storage format.
D. Incremental Scale-Up Strategy
The anti-pattern: Submit the full batch (21K documents, 50M rows, $500 API call) without testing at small scale first. One bad schema, wrong prompt, or misconfigured parameter = entire batch wasted.
This is TDD for ETL: fail at 10 items in minutes, not at 21,000 items in hours.
Triggers when ANY task involves:
- External API batch processing (Gemini, OpenAI, Bedrock, etc.)
- Irreversible operations (database writes, file transformations)
- Operations costing > $10 or > 30 minutes at full scale
- Processing > 500 items through any external service
For each expensive task, ask the user how to scale up:
AskUserQuestion(questions=[{
"question": "How should we scale up testing for this batch/ETL operation?",
"header": "Scale-up",
"options": [
{"label": "Full scale-up (Recommended)", "description": "4 stages: 10 → 100 → 1,000 → full. Maximum safety for large batches (>5,000 items)."},
{"label": "Standard scale-up", "description": "3 stages: 10 → 100 → full. Good for medium batches (500-5,000 items)."},
{"label": "Minimal scale-up", "description": "2 stages: 10 → full. Quick validation for small batches (<500 items) or low-cost operations."},
{"label": "Custom stages", "description": "Define custom batch sizes and gate criteria for this specific pipeline."}
],
"multiSelect": false
}])
Then define the plan:
- Set stage sizes based on user choice and total items
- Define gate criteria per stage — what must be true before scaling up:
- Output schema/format matches expectations (non-empty, correct structure)
- Success rate above threshold (≥90% for test, ≥95% for intermediate)
- Spot-check: manually inspect N outputs for quality/correctness
- Cost/time extrapolation is acceptable for next stage
- Document in PLAN.md: Scale-up testing plan table for each expensive task.
ETL Strategy Section for PLAN.md
## ETL Strategy
<!-- Include this section when data > 1M rows or multiple sources -->
### Filter Strategy
| Source | Rows | Filter Location | Filter Columns | Justification |
|--------|------|-----------------|----------------|---------------|
| source1 | 5M | Database (SQL WHERE) | date, type | Too large for full pull |
| source2 | 50K | Application (pandas) | — | Small enough for full load |
### Parallelism Plan
| Task | Parallelizable? | Dimension | Method |
|------|----------------|-----------|--------|
| Task 1 | Yes | By year (2003-2023) | Background Task agents / SGE array |
| Task 2 | No | — | Sequential (depends on Task 1 output) |
### Data Flow
source1.csv → [Task 1: Clean] → clean_source1.parquet → [Task 2: Merge]
source2.csv → [Task 1: Clean] → clean_source2.parquet ↗
→ [Task 3: Analyze] → results
<EXTREMELY-IMPORTANT>
### ETL Rationalization Table - STOP If You Think:
| Excuse | Reality | Do Instead | |--------|---------|------------| | "I'll just read the whole table, it's easier" | 50M rows × 200 columns = OOM crash or 30-minute wait | Filter at source with SQL WHERE | | "Sequential processing is simpler to write" | 20 years × 5 minutes = 100 minutes vs 5 minutes parallel | Use background agents or SGE arrays | | "I'll re-read the source in each task" | Re-parsing 5GB CSV five times wastes hours | Save intermediate parquet after first read | | "Filtering in pandas is more flexible" | Pandas loads ALL rows before filtering — you've already paid the cost | Push coarse filters to database, fine filters to pandas | | "The data isn't that big" | You just profiled it. Check the row count. If >1M, it IS that big. | Follow the ETL strategy, don't guess | | "I'll optimize later if it's slow" | Later never comes. The pipeline runs once and everyone moves on. | Design efficient ETL NOW | | "I'll just run the full batch, it's faster" | One bad schema = 21K wasted requests. One wrong prompt = hours of queue time for garbage. | Test at 10 first. Always. | | "The API validates my input anyway" | APIs validate format, not correctness. Empty responses are "successful." | Verify output content, not just HTTP 200 | | "Testing a small batch takes too long to set up" | Setting up a 10-item test takes 5 minutes. Resubmitting 21K items takes hours. | Build the test batch into the pipeline | </EXTREMELY-IMPORTANT>
5. Identify Implementation Language
Before creating tasks, determine the implementation language for ETL and analysis:
AskUserQuestion(questions=[{
"question": "What language will be used for data processing / ETL?",
"header": "Language",
"options": [
{"label": "Python (Recommended)", "description": "pandas/polars in notebooks or scripts. Default for most analysis."},
{"label": "SAS", "description": "SAS on WRDS grid (qsas/qsub). For large-scale WRDS ETL with hash merges and SGE parallelism."},
{"label": "R", "description": "R scripts or notebooks. For statistical modeling."},
{"label": "Mixed", "description": "SAS for ETL, Python/R for analysis. Common for WRDS pipelines."}
],
"multiSelect": false
}])
If SAS or Mixed is selected:
- Record
Implementation Language: SAS(orMixed: SAS ETL + Python analysis) in PLAN.md header - Load WRDS SAS enforcement:
Read("${CLAUDE_PLUGIN_ROOT}/skills/wrds/references/sas-etl.md") - All SAS tasks in the plan MUST include performance annotations:
- Merge strategy: hash or sort-merge (with justification if sort-merge)
- WHERE pattern: range-based date literals (document that no function-wrapped filters are used)
- Parallelism: SGE array or sequential (with justification if sequential)
- Add
## SAS Performance Constraintssection to PLAN.md (see template below)
6. Create Task Breakdown
Break analysis into ordered tasks:
- Each task should produce visible output
- Order by data dependencies
- Include data cleaning tasks FIRST
7. Write Plan Doc
Write to .claude/PLAN.md:
# Analysis Plan: [Analysis Name]
> **For Claude:** REQUIRED SUB-SKILL: Use `Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/ds-implement/SKILL.md")` to implement this plan with output-first verification.
>
> **Delegation:** Main chat orchestrates, Task agents implement. Use `Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/ds-delegate/SKILL.md")` for subagent templates.
## Spec Reference
See: .claude/SPEC.md
## Data Profile
### Source 1: [name]
- Location: [path/connection]
- Shape: [rows] x [columns]
- Date range: [start] to [end]
- Key columns: [list]
#### Column Summary
| Column | Type | Non-null | Unique | Notes |
|--------|------|----------|--------|-------|
| col1 | int64 | 100% | 50 | Primary key |
| col2 | object | 95% | 10 | Category |
#### Data Quality Issues
- [ ] Missing: col2 has 5% nulls - [strategy: drop/impute/flag]
- [ ] Duplicates: 100 duplicate rows on [key] - [strategy]
- [ ] Outliers: col3 has values > 1000 - [strategy]
### Source 2: [name]
[Same structure]
## Task Breakdown
### Task 1: Data Cleaning (required first)
- Handle missing values in col2
- Remove duplicates
- Fix data types
- Output: Clean DataFrame, log of rows removed
### Task 2: [Analysis Step]
- Input: Clean DataFrame
- Process: [description]
- Output: [specific output to verify]
- Dependencies: Task 1
### Task 3: [Next Step]
[Same structure]
## Output Verification Plan
For each task, define what output proves completion:
- Task 1: "X rows cleaned, Y rows dropped"
- Task 2: "Visualization showing [pattern]"
- Task 3: "Model accuracy >= 0.8"
## ETL Strategy
<!-- Include when any source > 1M rows or multiple sources require joins -->
### Filter Strategy
| Source | Rows | Filter Location | Filter Columns | Justification |
|--------|------|-----------------|----------------|---------------|
### Parallelism Plan
| Task | Parallelizable? | Dimension | Method |
|------|----------------|-----------|--------|
### Data Flow
[source] → [task] → [intermediate] → [task] → [output]
### Scale-Up Testing Plan
<!-- Include when any task involves batch APIs, irreversible operations, or >500 items through external services -->
| Task | Total Items | Stage 1 (test) | Stage 2 | Stage 3 | Gate Criteria |
|------|-------------|-----------------|---------|---------|---------------|
| Batch extraction | 21,000 | 10 | 100 | 1,000 | ≥95% success, schema valid, non-empty responses |
| DB write | 5M rows | 100 | 1,000 | — | No constraint violations, row counts match |
## Implementation Language
[Python / SAS / R / Mixed]
<!-- If SAS or Mixed, include this section: -->
## SAS Performance Constraints
> **For Claude:** REQUIRED: Load `Read("${CLAUDE_PLUGIN_ROOT}/skills/wrds/references/sas-etl.md")` before writing ANY SAS code.
> Validate ALL SAS code against the SAS Code Validation Checklist in the WRDS skill.
### Per-Task SAS Annotations
| Task | Merge Strategy | WHERE Pattern | Parallelism |
|------|---------------|---------------|-------------|
| Task 1 | Hash (lookup < 500K rows) | BETWEEN date literals | SGE array by year |
| Task 2 | Sort-merge (both tables > 50M) | No date filter | Sequential (single output) |
## Reproducibility Requirements
- Random seed: [value if needed]
- Package versions: [key packages]
- Data snapshot: [date/version]
Red Flags - STOP If You're About To:
| Action | Why It's Wrong | Do Instead | |--------|----------------|------------| | Skip data profiling | Your data issues will break your analysis | Always profile first | | Ignore missing values | You'll corrupt your results | Document and plan handling | | Start analysis immediately | You haven't characterized your data | Complete profiling | | Assume your data is clean | Never assume, you must verify | Run quality checks | | Pull entire tables without WHERE clauses | OOM on large data, wastes time/memory | Filter at database level for >1M row sources | | Process years sequentially | Embarrassingly parallel = free speedup | Use background agents or SGE arrays | | Re-read same source in multiple tasks | Redundant I/O multiplies runtime | Save intermediate results after first read | | Submit full batch without test batch | One bad schema/prompt = entire batch wasted | Plan scale-up testing stages for expensive operations |
Output
Complete the plan when:
- Read and understand
.claude/SPEC.md - Profile all data sources (shape, types, stats)
- Document data quality issues
- Define cleaning strategy for each issue
- Assess ETL strategy (if data > 1M rows or multiple sources)
- Order tasks by dependency
- Define output verification criteria
- Write
.claude/PLAN.md - Confirm ready for implementation
Phase Complete
REQUIRED SUB-SKILL: After completing plan, IMMEDIATELY invoke:
Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/ds-implement/SKILL.md")