Agent Skills: Data Wrangler

>

UncategorizedID: georgekhananaev/claude-skills-vault/data-wrangler

Install this agent skill to your local

pnpm dlx add-skill https://github.com/georgekhananaev/claude-skills-vault/tree/HEAD/.claude/skills/data-wrangler

Skill Files

Browse the full folder contents for data-wrangler.

Download Skill

Loading file tree…

.claude/skills/data-wrangler/SKILL.md

Skill Metadata

Name
data-wrangler
Description
>

Data Wrangler

Manipulate tabular data (CSV, Excel, JSON, Parquet, TSV) w/ pandas-powered scripts. Two scripts cover all operations: data_wrangler.py for data ops, excel_toolkit.py for Excel-specific features.

When to Use

  • User asks to read, edit, filter, sort, or transform CSV/Excel/JSON/Parquet/TSV files
  • User asks to merge/join datasets, deduplicate, fill missing values, or validate data
  • User asks to create Excel workbooks w/ formatting, dropdowns, freeze panes, or multi-sheet
  • User asks to pivot, unpivot, group-by, aggregate, sample, or split datasets
  • User asks to add computed columns, rename columns, cast types, or apply formulas
  • User asks to convert between data formats (CSV -> Excel, JSON -> Parquet, etc.)
  • User asks to inspect/profile data structure, types, nulls, stats

Prerequisites

# Required
pip install pandas openpyxl

# Optional (per feature)
pip install pyarrow          # Parquet support
pip install xlrd             # Legacy .xls read
pip install pandasql         # SQL queries on DataFrames
pip install fastparquet      # Alternative Parquet engine

Quick Routing

| Task | Script | Command | |------|--------|---------| | Inspect/profile data | data_wrangler.py | inspect | | Filter rows | data_wrangler.py | filter --where "expr" | | Sort by columns | data_wrangler.py | sort --by Col --desc | | Group & aggregate | data_wrangler.py | group --by Col --agg "Col:func" | | Merge/join files | data_wrangler.py | merge f2 --on Key --how left | | Pivot / unpivot | data_wrangler.py | pivot --index/--unpivot | | Remove duplicates | data_wrangler.py | dedupe --subset "Col" | | Fill missing values | data_wrangler.py | fill --column Col --strategy mean | | Drop cols/rows | data_wrangler.py | drop --columns "A,B" | | Rename columns | data_wrangler.py | rename --map "old:new" | | Cast types | data_wrangler.py | cast --column Col --dtype datetime | | Computed columns | data_wrangler.py | derive --formula "New = A + B" | | Random sample | data_wrangler.py | sample --n 100 | | Split by values | data_wrangler.py | split --by Region | | Validate rules | data_wrangler.py | validate --rules rules.json | | Apply formulas | data_wrangler.py | formula --expr "C=A+B" | | Convert formats | data_wrangler.py | convert -o data.xlsx | | SQL queries | data_wrangler.py | query --sql "SELECT..." | | List Excel sheets | excel_toolkit.py | sheets | | Extract sheet | excel_toolkit.py | extract --sheet Sales -o sales.csv | | Combine -> xlsx | excel_toolkit.py | combine *.csv -o combined.xlsx | | Format headers | excel_toolkit.py | format --header-style bold,blue --autowidth | | Freeze panes | excel_toolkit.py | freeze --at B2 | | Auto-filter | excel_toolkit.py | autofilter | | Dropdown validation | excel_toolkit.py | validate --column Status --values "Open,Closed" | | Protect sheet | excel_toolkit.py | protect --password secret | | Create workbook | excel_toolkit.py | create --columns "Name,Age" -o template.xlsx |

Usage Patterns

Data Operations (data_wrangler.py)

All operations follow: python3 scripts/data_wrangler.py <op> <input> [options] [-o output]

# Inspect
python3 data_wrangler.py inspect sales.csv
python3 data_wrangler.py inspect data.xlsx --sheet "Q1 Sales" --nrows 1000

# Filter
python3 data_wrangler.py filter data.csv --where "Revenue > 10000" -o high_rev.csv
python3 data_wrangler.py filter data.csv --where 'Status == "active" and Age >= 25' -o active.csv

# Sort
python3 data_wrangler.py sort data.csv --by "Revenue,Name" --desc -o sorted.csv

# Group + Aggregate
python3 data_wrangler.py group data.csv --by Department --agg "Salary:mean,Salary:count,Revenue:sum" -o summary.csv

# Merge
python3 data_wrangler.py merge orders.csv customers.csv --on CustomerID --how left -o joined.csv

# Pivot
python3 data_wrangler.py pivot data.csv --index Name --columns Month --values Sales --aggfunc sum -o pivoted.csv

# Unpivot (melt)
python3 data_wrangler.py pivot wide.csv --index ID --unpivot --var-name Metric --value-name Value -o long.csv

# Deduplicate
python3 data_wrangler.py dedupe data.csv --subset "Email" --keep first -o clean.csv

# Fill nulls
python3 data_wrangler.py fill data.csv --column "Revenue,Profit" --strategy mean -o filled.csv

# Drop columns
python3 data_wrangler.py drop data.csv --columns "TempCol,Notes" -o trimmed.csv
python3 data_wrangler.py drop data.csv --null-threshold 0.5 -o cleaned.csv

# Rename
python3 data_wrangler.py rename data.csv --map "old_name:new_name,col2:Column2" -o renamed.csv
python3 data_wrangler.py rename data.csv --snake -o snake_case.csv

# Cast types
python3 data_wrangler.py cast data.csv --column Date --dtype datetime --date-format "%Y-%m-%d" -o typed.csv

# Computed columns
python3 data_wrangler.py derive data.csv --formula "Profit = Revenue - Cost" -o enriched.csv

# Sample
python3 data_wrangler.py sample large.csv --n 500 --seed 42 -o sample.csv

# Split by value
python3 data_wrangler.py split data.csv --by Region --output-dir ./by_region/

# Validate
python3 data_wrangler.py validate data.csv --rules validation_rules.json -o report.json

# Formula
python3 data_wrangler.py formula data.xlsx --expr "Total=Price*Quantity" -o calculated.xlsx

# Convert
python3 data_wrangler.py convert data.csv -o data.xlsx
python3 data_wrangler.py convert data.xlsx -o data.json
python3 data_wrangler.py convert data.json -o data.parquet

# SQL query
python3 data_wrangler.py query data.csv --sql "SELECT Name, AVG(Salary) FROM df WHERE Dept='Eng' GROUP BY Name"

Excel Operations (excel_toolkit.py)

All operations follow: python3 scripts/excel_toolkit.py <op> <input> [options] [-o output]

# List sheets
python3 excel_toolkit.py sheets workbook.xlsx

# Extract sheet
python3 excel_toolkit.py extract workbook.xlsx --sheet "Sales Q1" -o sales_q1.csv

# Combine multiple files into multi-sheet xlsx
python3 excel_toolkit.py combine sales.csv inventory.csv orders.csv -o report.xlsx

# Format
python3 excel_toolkit.py format data.xlsx --header-style bold,blue --autowidth --zebra -o styled.xlsx

# Freeze panes
python3 excel_toolkit.py freeze data.xlsx --at B2 -o frozen.xlsx

# Auto-filter
python3 excel_toolkit.py autofilter data.xlsx -o filtered.xlsx

# Dropdown validation
python3 excel_toolkit.py validate data.xlsx --column Status --values "Open,Closed,Pending" -o validated.xlsx

# Protect
python3 excel_toolkit.py protect data.xlsx --password mypass -o protected.xlsx

# Create template
python3 excel_toolkit.py create --columns "Name,Email,Department,Start Date,Salary" -o template.xlsx

Validation Rules Format

Create a JSON rules file for validate:

{
  "rules": [
    {"column": "Email", "type": "not_null"},
    {"column": "Email", "type": "pattern", "regex": "^[^@]+@[^@]+\\.[^@]+$"},
    {"column": "ID", "type": "unique"},
    {"column": "Age", "type": "range", "min": 0, "max": 150},
    {"column": "Status", "type": "enum", "values": ["active", "inactive", "pending"]}
  ]
}

Rule types: not_null, unique, range (min/max), pattern (regex), enum (allowed values).

Fill Strategies

| Strategy | Behavior | |----------|----------| | mean | Fill w/ column mean (numeric) | | median | Fill w/ column median (numeric) | | mode | Fill w/ most frequent value | | zero | Fill w/ 0 | | empty | Fill w/ empty string | | ffill | Forward fill (carry last value) | | bfill | Backward fill | | drop | Drop rows w/ nulls in column | | value:<v> | Fill w/ specific value |

Supported Formats

| Format | Read | Write | Dependency | |--------|------|-------|------------| | CSV | Y | Y | (builtin) | | TSV | Y | Y | (builtin) | | XLSX | Y | Y | openpyxl | | XLS | Y | N | xlrd | | JSON | Y | Y | (builtin) | | JSONL | Y | Y | (builtin) | | Parquet | Y | Y | pyarrow |

Integration w/ file-converter

Pipeline data between skills:

# 1. Convert YAML -> CSV (file-converter), then wrangle
python3 .claude/skills/file-converter/scripts/csv_json_yaml.py data.yaml data.csv
python3 .claude/skills/data-wrangler/scripts/data_wrangler.py filter data.csv --where "Status == 'active'" -o filtered.csv

# 2. Wrangle, then convert to PDF report
python3 data_wrangler.py group data.csv --by Dept --agg "Salary:mean,count" -o summary.csv
# (Use file-converter to render summary as markdown -> PDF)

# 3. Excel -> JSON -> YAML pipeline
python3 data_wrangler.py convert data.xlsx -o data.json
python3 .claude/skills/file-converter/scripts/csv_json_yaml.py data.json data.yaml

Pandas Query Syntax Reference

Filter expressions use pandas query syntax:

| Pattern | Example | |---------|---------| | Comparison | Age > 30, Revenue >= 10000 | | Equality | Status == "active", Region != "East" | | String contains | Name.str.contains("Smith") | | Multiple conditions | Age > 25 and Status == "active" | | OR conditions | Region == "East" or Region == "West" | | IN list | Status in ["active", "pending"] | | NOT IN | Status not in ["closed", "archived"] | | Null check | Revenue.notna(), Email.isna() | | Between | Age >= 18 and Age <= 65 |

Aggregation Functions

Available for group --agg and pivot --aggfunc:

sum, mean, median, min, max, count, std, var, first, last, nunique

Spec format: "Column:function" — multiple: "Salary:mean,Salary:count,Revenue:sum"