CSV Query Skill | Agent Skills

CSV Query

Query tabular data files using SQL via the Polars-powered sqlp command.

Cowork note: If relative paths don't resolve, call mcp__qsv__qsv_get_working_dir and mcp__qsv__qsv_set_working_dir to sync the working directory.

Decision Tree

Is the query simple (single column filter, basic select)?

Yes -> Consider select + search for simpler operations
No -> Use sqlp for full SQL support

Does the query involve joins, GROUP BY, window functions, or complex expressions?

Yes -> Use sqlp (Polars SQL engine)

Is the CSV file very large (> 10MB)?

Yes -> Consider converting to Parquet with mcp__qsv__qsv_to_parquet for faster repeated queries. Note: sqlp can also query CSV files of any size directly.

Steps

Prepare the file: Run mcp__qsv__qsv_index and mcp__qsv__qsv_stats with cardinality: true, stats_jsonl: true to create index and stats cache.
Read the stats cache: Read <FILESTEM>.stats.csv (e.g., data.stats.csv for data.csv) to understand column metadata before writing SQL. This is the most important step for writing efficient queries.
Run frequency on key columns: For columns you plan to GROUP BY, filter on, or join on, run mcp__qsv__qsv_frequency to see actual value distributions. This reveals the best filter values and whether a GROUP BY will produce a manageable result set.
Write and run SQL: Use mcp__qsv__qsv_sqlp with the SQL query informed by stats and frequency data. The table name in SQL is the filename stem (e.g., data.csv -> SELECT * FROM data). For Parquet files, use read_parquet('data.parquet') as the table source instead.
Refine if needed: Check results and adjust the query.

Using Stats to Write Better SQL

After reading the .stats.csv cache, use these columns to inform your SQL:

| Stats Column | How to Use in SQL | |-------------|-------------------| | type | Use correct casts and comparisons — don't quote integers, use date functions for Date/DateTime columns | | min / max | Write precise WHERE clauses using actual data range (e.g., WHERE price BETWEEN 10.5 AND 999.99 instead of arbitrary bounds) | | cardinality | Estimate GROUP BY result size — low cardinality (< 100) is fast; high cardinality (> 10K) may need LIMIT or a different approach | | nullcount | Only add COALESCE or IS NOT NULL where nullcount > 0 — skip null handling for columns with zero nulls | | sort_order | Skip ORDER BY if data is already sorted on that column (sort_order = "Ascending"/"Descending") | | mean / stddev | Write outlier filters: WHERE col BETWEEN mean - 3*stddev AND mean + 3*stddev | | median / q1 / q3 | For skewed data (when mean and median diverge), use quartile-based ranges: WHERE col BETWEEN q1 AND q3 instead of mean ± stddev | | skewness | If skewness > 1 or < -1, prefer median/quartile-based filters over mean-based ones | | cv | High CV (> 100%) signals high relative variability — add LIMIT to GROUP BY queries and consider binning continuous values | | outliers_percentage | If > 5%, consider excluding outliers before aggregation: WHERE col BETWEEN lower_inner_fence AND upper_inner_fence | | sparsity | Columns with sparsity > 0.5 are mostly null — avoid using them as join keys or GROUP BY columns |

Using Frequency for Filter Values

Run mcp__qsv__qsv_frequency with select: "col", limit: 20 before writing WHERE clauses on categorical columns:

Pick selective filters: If frequency shows "active" has 90% of rows, filtering on WHERE status = 'active' is wasteful — filter on the rare values instead
Validate expected values: If you plan WHERE category IN ('A','B','C'), check frequency first to confirm those values exist and see if you're missing any
Avoid GROUP BY on high-cardinality columns: If frequency shows thousands of unique values, GROUP BY will produce a huge result — add LIMIT or aggregate differently

SQL Syntax Guide

The sqlp command uses Polars SQL dialect:

-- Basic select
SELECT col1, col2 FROM data WHERE col1 > 100

-- Aggregation
SELECT category, COUNT(*) as cnt, AVG(price) as avg_price
FROM data GROUP BY category ORDER BY cnt DESC

-- Window functions
SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) as rank
FROM employees

-- String operations
SELECT * FROM data WHERE col1 LIKE '%pattern%'

-- Date operations
SELECT *, EXTRACT(YEAR FROM date_col) as year FROM data

-- Multiple files (join)
SELECT a.*, b.name FROM file1 a JOIN file2 b ON a.id = b.id

-- CASE expressions
SELECT *, CASE WHEN amount > 1000 THEN 'high' ELSE 'low' END as tier FROM data

Table Naming Convention

File: sales_2024.csv -> Table: sales_2024
File: my-data.csv -> Table: "my-data" (quote if contains special chars)
Multiple files: each file is a separate table

Notes

sqlp uses the Polars engine - some PostgreSQL-specific syntax may not be supported
For very complex queries that fail, suggest DuckDB as an alternative
The stats cache helps Polars choose optimal data types for columns
Results go to stdout by default; use --output file.csv for large result sets
Column names are case-sensitive in SQL queries
Use LIMIT to preview large result sets before running full queries
sqlp can query multiple CSV files in a single SQL statement (useful for joins)

Agent Skills: CSV Query

Install this agent skill to your local

Skill Files