Agent Skills: CSV Query

Run SQL queries against CSV/TSV/Excel files using Polars SQL engine

dataID: kilo-org/kilo-marketplace/csv-query

Install this agent skill to your local

pnpm dlx add-skill https://github.com/Kilo-Org/kilo-marketplace/tree/HEAD/skills/csv-query

Skill Files

Browse the full folder contents for csv-query.

Download Skill

Loading file tree…

skills/csv-query/SKILL.md

Skill Metadata

Name
csv-query
Description
Run SQL queries against CSV/TSV/Excel files using Polars SQL engine

CSV Query

Query tabular data files using SQL via the Polars-powered sqlp command.

Cowork note: If relative paths don't resolve, call mcp__qsv__qsv_get_working_dir and mcp__qsv__qsv_set_working_dir to sync the working directory.

Decision Tree

Is the query simple (single column filter, basic select)?

  • Yes -> Consider select + search for simpler operations
  • No -> Use sqlp for full SQL support

Does the query involve joins, GROUP BY, window functions, or complex expressions?

  • Yes -> Use sqlp (Polars SQL engine)

Is the CSV file very large (> 10MB)?

  • Yes -> Consider converting to Parquet with mcp__qsv__qsv_to_parquet for faster repeated queries. Note: sqlp can also query CSV files of any size directly.

Steps

  1. Prepare the file: Run mcp__qsv__qsv_index and mcp__qsv__qsv_stats with cardinality: true, stats_jsonl: true to create index and stats cache.

  2. Read the stats cache: Read <FILESTEM>.stats.csv (e.g., data.stats.csv for data.csv) to understand column metadata before writing SQL. This is the most important step for writing efficient queries.

  3. Run frequency on key columns: For columns you plan to GROUP BY, filter on, or join on, run mcp__qsv__qsv_frequency to see actual value distributions. This reveals the best filter values and whether a GROUP BY will produce a manageable result set.

  4. Write and run SQL: Use mcp__qsv__qsv_sqlp with the SQL query informed by stats and frequency data. The table name in SQL is the filename stem (e.g., data.csv -> SELECT * FROM data). For Parquet files, use read_parquet('data.parquet') as the table source instead.

  5. Refine if needed: Check results and adjust the query.

Using Stats to Write Better SQL

After reading the .stats.csv cache, use these columns to inform your SQL:

| Stats Column | How to Use in SQL | |-------------|-------------------| | type | Use correct casts and comparisons — don't quote integers, use date functions for Date/DateTime columns | | min / max | Write precise WHERE clauses using actual data range (e.g., WHERE price BETWEEN 10.5 AND 999.99 instead of arbitrary bounds) | | cardinality | Estimate GROUP BY result size — low cardinality (< 100) is fast; high cardinality (> 10K) may need LIMIT or a different approach | | nullcount | Only add COALESCE or IS NOT NULL where nullcount > 0 — skip null handling for columns with zero nulls | | sort_order | Skip ORDER BY if data is already sorted on that column (sort_order = "Ascending"/"Descending") | | mean / stddev | Write outlier filters: WHERE col BETWEEN mean - 3*stddev AND mean + 3*stddev | | median / q1 / q3 | For skewed data (when mean and median diverge), use quartile-based ranges: WHERE col BETWEEN q1 AND q3 instead of mean ± stddev | | skewness | If skewness > 1 or < -1, prefer median/quartile-based filters over mean-based ones | | cv | High CV (> 100%) signals high relative variability — add LIMIT to GROUP BY queries and consider binning continuous values | | outliers_percentage | If > 5%, consider excluding outliers before aggregation: WHERE col BETWEEN lower_inner_fence AND upper_inner_fence | | sparsity | Columns with sparsity > 0.5 are mostly null — avoid using them as join keys or GROUP BY columns |

Using Frequency for Filter Values

Run mcp__qsv__qsv_frequency with select: "col", limit: 20 before writing WHERE clauses on categorical columns:

  • Pick selective filters: If frequency shows "active" has 90% of rows, filtering on WHERE status = 'active' is wasteful — filter on the rare values instead
  • Validate expected values: If you plan WHERE category IN ('A','B','C'), check frequency first to confirm those values exist and see if you're missing any
  • Avoid GROUP BY on high-cardinality columns: If frequency shows thousands of unique values, GROUP BY will produce a huge result — add LIMIT or aggregate differently

SQL Syntax Guide

The sqlp command uses Polars SQL dialect:

-- Basic select
SELECT col1, col2 FROM data WHERE col1 > 100

-- Aggregation
SELECT category, COUNT(*) as cnt, AVG(price) as avg_price
FROM data GROUP BY category ORDER BY cnt DESC

-- Window functions
SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) as rank
FROM employees

-- String operations
SELECT * FROM data WHERE col1 LIKE '%pattern%'

-- Date operations
SELECT *, EXTRACT(YEAR FROM date_col) as year FROM data

-- Multiple files (join)
SELECT a.*, b.name FROM file1 a JOIN file2 b ON a.id = b.id

-- CASE expressions
SELECT *, CASE WHEN amount > 1000 THEN 'high' ELSE 'low' END as tier FROM data

Table Naming Convention

  • File: sales_2024.csv -> Table: sales_2024
  • File: my-data.csv -> Table: "my-data" (quote if contains special chars)
  • Multiple files: each file is a separate table

Notes

  • sqlp uses the Polars engine - some PostgreSQL-specific syntax may not be supported
  • For very complex queries that fail, suggest DuckDB as an alternative
  • The stats cache helps Polars choose optimal data types for columns
  • Results go to stdout by default; use --output file.csv for large result sets
  • Column names are case-sensitive in SQL queries
  • Use LIMIT to preview large result sets before running full queries
  • sqlp can query multiple CSV files in a single SQL statement (useful for joins)