CSV Query
Query tabular data files using SQL via the Polars-powered sqlp command.
Cowork note: If relative paths don't resolve, call
mcp__qsv__qsv_get_working_dirandmcp__qsv__qsv_set_working_dirto sync the working directory.
Decision Tree
Is the query simple (single column filter, basic select)?
- Yes -> Consider
select+searchfor simpler operations - No -> Use
sqlpfor full SQL support
Does the query involve joins, GROUP BY, window functions, or complex expressions?
- Yes -> Use
sqlp(Polars SQL engine)
Is the CSV file very large (> 10MB)?
- Yes -> Consider converting to Parquet with
mcp__qsv__qsv_to_parquetfor faster repeated queries. Note:sqlpcan also query CSV files of any size directly.
Steps
-
Prepare the file: Run
mcp__qsv__qsv_indexandmcp__qsv__qsv_statswithcardinality: true, stats_jsonl: trueto create index and stats cache. -
Read the stats cache: Read
<FILESTEM>.stats.csv(e.g.,data.stats.csvfordata.csv) to understand column metadata before writing SQL. This is the most important step for writing efficient queries. -
Run frequency on key columns: For columns you plan to GROUP BY, filter on, or join on, run
mcp__qsv__qsv_frequencyto see actual value distributions. This reveals the best filter values and whether a GROUP BY will produce a manageable result set. -
Write and run SQL: Use
mcp__qsv__qsv_sqlpwith the SQL query informed by stats and frequency data. The table name in SQL is the filename stem (e.g.,data.csv->SELECT * FROM data). For Parquet files, useread_parquet('data.parquet')as the table source instead. -
Refine if needed: Check results and adjust the query.
Using Stats to Write Better SQL
After reading the .stats.csv cache, use these columns to inform your SQL:
| Stats Column | How to Use in SQL |
|-------------|-------------------|
| type | Use correct casts and comparisons — don't quote integers, use date functions for Date/DateTime columns |
| min / max | Write precise WHERE clauses using actual data range (e.g., WHERE price BETWEEN 10.5 AND 999.99 instead of arbitrary bounds) |
| cardinality | Estimate GROUP BY result size — low cardinality (< 100) is fast; high cardinality (> 10K) may need LIMIT or a different approach |
| nullcount | Only add COALESCE or IS NOT NULL where nullcount > 0 — skip null handling for columns with zero nulls |
| sort_order | Skip ORDER BY if data is already sorted on that column (sort_order = "Ascending"/"Descending") |
| mean / stddev | Write outlier filters: WHERE col BETWEEN mean - 3*stddev AND mean + 3*stddev |
| median / q1 / q3 | For skewed data (when mean and median diverge), use quartile-based ranges: WHERE col BETWEEN q1 AND q3 instead of mean ± stddev |
| skewness | If skewness > 1 or < -1, prefer median/quartile-based filters over mean-based ones |
| cv | High CV (> 100%) signals high relative variability — add LIMIT to GROUP BY queries and consider binning continuous values |
| outliers_percentage | If > 5%, consider excluding outliers before aggregation: WHERE col BETWEEN lower_inner_fence AND upper_inner_fence |
| sparsity | Columns with sparsity > 0.5 are mostly null — avoid using them as join keys or GROUP BY columns |
Using Frequency for Filter Values
Run mcp__qsv__qsv_frequency with select: "col", limit: 20 before writing WHERE clauses on categorical columns:
- Pick selective filters: If
frequencyshows "active" has 90% of rows, filtering onWHERE status = 'active'is wasteful — filter on the rare values instead - Validate expected values: If you plan
WHERE category IN ('A','B','C'), check frequency first to confirm those values exist and see if you're missing any - Avoid GROUP BY on high-cardinality columns: If frequency shows thousands of unique values, GROUP BY will produce a huge result — add LIMIT or aggregate differently
SQL Syntax Guide
The sqlp command uses Polars SQL dialect:
-- Basic select
SELECT col1, col2 FROM data WHERE col1 > 100
-- Aggregation
SELECT category, COUNT(*) as cnt, AVG(price) as avg_price
FROM data GROUP BY category ORDER BY cnt DESC
-- Window functions
SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) as rank
FROM employees
-- String operations
SELECT * FROM data WHERE col1 LIKE '%pattern%'
-- Date operations
SELECT *, EXTRACT(YEAR FROM date_col) as year FROM data
-- Multiple files (join)
SELECT a.*, b.name FROM file1 a JOIN file2 b ON a.id = b.id
-- CASE expressions
SELECT *, CASE WHEN amount > 1000 THEN 'high' ELSE 'low' END as tier FROM data
Table Naming Convention
- File:
sales_2024.csv-> Table:sales_2024 - File:
my-data.csv-> Table:"my-data"(quote if contains special chars) - Multiple files: each file is a separate table
Notes
sqlpuses the Polars engine - some PostgreSQL-specific syntax may not be supported- For very complex queries that fail, suggest DuckDB as an alternative
- The stats cache helps Polars choose optimal data types for columns
- Results go to stdout by default; use
--output file.csvfor large result sets - Column names are case-sensitive in SQL queries
- Use
LIMITto preview large result sets before running full queries sqlpcan query multiple CSV files in a single SQL statement (useful for joins)