Polars Fast DataFrame Library
Lightning-fast DataFrame library with lazy evaluation and parallel execution.
When to Use
- Pandas is too slow for your dataset
- Working with 1-100GB datasets that fit in RAM
- Need lazy evaluation for query optimization
- Building ETL pipelines
- Want parallel execution without extra config
Lazy vs Eager Evaluation
| Mode | Function | Executes | Use Case |
|------|----------|----------|----------|
| Eager | read_csv() | Immediately | Small data, exploration |
| Lazy | scan_csv() | On .collect() | Large data, pipelines |
Key concept: Lazy mode builds a query plan that gets optimized before execution. The optimizer applies predicate pushdown (filter early) and projection pushdown (select columns early).
Core Operations
Data Selection
| Operation | Purpose |
|-----------|---------|
| select() | Choose columns |
| filter() | Choose rows by condition |
| with_columns() | Add/modify columns |
| drop() | Remove columns |
| head(n) / tail(n) | First/last n rows |
Aggregation
| Operation | Purpose |
|-----------|---------|
| group_by().agg() | Group and aggregate |
| pivot() | Reshape wide |
| melt() | Reshape long |
| unique() | Distinct values |
Joins
| Join Type | Description | |-----------|-------------| | inner | Matching rows only | | left | All left + matching right | | outer | All rows from both | | cross | Cartesian product | | semi | Left rows with match | | anti | Left rows without match |
Expression API
Key concept: Polars uses expressions (pl.col()) instead of indexing. Expressions are lazily evaluated and optimized.
Common Expressions
| Expression | Purpose |
|------------|---------|
| pl.col("name") | Reference column |
| pl.lit(value) | Literal value |
| pl.all() | All columns |
| pl.exclude(...) | All except |
Expression Methods
| Category | Methods |
|----------|---------|
| Aggregation | .sum(), .mean(), .min(), .max(), .count() |
| String | .str.contains(), .str.replace(), .str.to_lowercase() |
| DateTime | .dt.year(), .dt.month(), .dt.day() |
| Conditional | .when().then().otherwise() |
| Window | .over(), .rolling_mean(), .shift() |
Pandas Migration
| Pandas | Polars |
|--------|--------|
| df['col'] | df.select('col') |
| df[df['col'] > 5] | df.filter(pl.col('col') > 5) |
| df['new'] = df['col'] * 2 | df.with_columns((pl.col('col') * 2).alias('new')) |
| df.groupby('col').mean() | df.group_by('col').agg(pl.all().mean()) |
| df.apply(func) | df.map_rows(func) (avoid if possible) |
Key concept: Polars prefers explicit operations over implicit indexing. Use .alias() to name computed columns.
File I/O
| Format | Read | Write | Notes |
|--------|------|-------|-------|
| CSV | read_csv() / scan_csv() | write_csv() | Human readable |
| Parquet | read_parquet() / scan_parquet() | write_parquet() | Fast, compressed |
| JSON | read_json() / scan_ndjson() | write_json() | Newline-delimited |
| IPC/Arrow | read_ipc() / scan_ipc() | write_ipc() | Zero-copy |
Key concept: Use Parquet for performance. Use scan_* for large files to enable lazy optimization.
Performance Tips
| Tip | Why | |-----|-----| | Use lazy mode | Query optimization | | Use Parquet | Column-oriented, compressed | | Select columns early | Projection pushdown | | Filter early | Predicate pushdown | | Avoid Python UDFs | Breaks parallelism | | Use expressions | Vectorized operations | | Set dtypes on read | Avoid inference overhead |
vs Alternatives
| Tool | Best For | Limitations | |------|----------|-------------| | Polars | 1-100GB, speed critical | Must fit in RAM | | Pandas | Small data, ecosystem | Slow, memory hungry | | Dask | Larger than RAM | More complex API | | Spark | Cluster computing | Infrastructure overhead | | DuckDB | SQL interface | Different API style |
Resources
- Docs: https://pola.rs/
- User Guide: https://docs.pola.rs/user-guide/
- Cookbook: https://docs.pola.rs/user-guide/misc/cookbook/