Dask Parallel and Distributed Computing
Scale pandas/NumPy workflows beyond memory and across clusters.
When to Use
- Datasets exceed available RAM
- Need to parallelize pandas or NumPy operations
- Processing multiple files efficiently (CSVs, Parquet)
- Building custom parallel workflows
- Distributing workloads across multiple cores/machines
Dask Collections
| Collection | Like | Use Case | |------------|------|----------| | DataFrame | pandas | Tabular data, CSV/Parquet | | Array | NumPy | Numerical arrays, matrices | | Bag | list | Unstructured data, JSON logs | | Delayed | Custom | Arbitrary Python functions |
Key concept: All collections are lazy—computation happens only when you call .compute().
Lazy Evaluation
| Function | Behavior | Use |
|----------|----------|-----|
| dd.read_csv() | Lazy load | Large CSVs |
| dd.read_parquet() | Lazy load | Large Parquet |
| Operations | Build graph | Chain transforms |
| .compute() | Execute | Get final result |
Key concept: Dask builds a task graph of operations, optimizes it, then executes in parallel. Call .compute() once at the end, not after every operation.
Schedulers
| Scheduler | Best For | Start |
|-----------|----------|-------|
| threaded | NumPy/Pandas (releases GIL) | Default |
| processes | Pure Python (GIL bound) | scheduler='processes' |
| synchronous | Debugging | scheduler='synchronous' |
| distributed | Monitoring, scaling, clusters | Client() |
Distributed Scheduler
| Feature | Benefit | |---------|---------| | Dashboard | Real-time progress monitoring | | Cluster scaling | Add/remove workers | | Fault tolerance | Retry failed tasks | | Worker resources | Memory management |
Chunking Concepts
DataFrame Partitions
| Concept | Description | |---------|-------------| | Partition | Subset of rows (like a mini DataFrame) | | npartitions | Number of partitions | | divisions | Index boundaries between partitions |
Array Chunks
| Concept | Description | |---------|-------------| | Chunk | Subset of array (n-dimensional block) | | chunks | Tuple of chunk sizes per dimension | | Optimal size | ~100 MB per chunk |
Key concept: Chunk size is critical. Too small = scheduling overhead. Too large = memory issues. Target ~100 MB.
DataFrame Operations
Supported (parallel)
| Category | Operations |
|----------|------------|
| Selection | filter, loc, column selection |
| Aggregation | groupby, sum, mean, count |
| Transforms | apply (row-wise), map_partitions |
| Joins | merge, join (shuffles data) |
| I/O | read_csv, read_parquet, to_parquet |
Avoid or Use Carefully
| Operation | Issue | Alternative |
|-----------|-------|-------------|
| iterrows | Kills parallelism | map_partitions |
| apply(axis=1) | Slow | map_partitions |
| Repeated compute() | Inefficient | Single compute() at end |
| sort_values | Expensive shuffle | Avoid if possible |
Common Patterns
ETL Pipeline
scan_*orread_*(lazy load)- Chain filters and transforms
- Single
.compute()or.to_parquet()
Multi-File Processing
| Pattern | Description |
|---------|-------------|
| Glob patterns | dd.read_csv('data/*.csv') |
| Partition per file | Natural parallelism |
| Output partitioned | to_parquet('output/') |
Custom Operations
| Method | Use Case |
|--------|----------|
| map_partitions | Apply function to each partition |
| map_blocks | Apply function to each array block |
| delayed | Wrap arbitrary Python functions |
Best Practices
| Practice | Why |
|----------|-----|
| Don't load locally first | Let Dask handle loading |
| Single compute() at end | Avoid redundant computation |
| Use Parquet | Faster than CSV, columnar |
| Match partition to files | One partition per file |
| Check task graph size | len(ddf.__dask_graph__()) < 100k |
| Use distributed for debugging | Dashboard shows progress |
Common Pitfalls
| Pitfall | Solution |
|---------|----------|
| Loading with pandas first | Use dd.read_* directly |
| compute() in loops | Collect all, single compute() |
| Too many partitions | Repartition to ~100 MB each |
| Memory errors | Reduce chunk size, add workers |
| Slow shuffles | Avoid sorts/joins when possible |
vs Alternatives
| Tool | Best For | Trade-off | |------|----------|-----------| | Dask | Scale pandas/NumPy, clusters | Setup complexity | | Polars | Fast in-memory | Must fit in RAM | | Vaex | Out-of-core single machine | Limited operations | | Spark | Enterprise, SQL-heavy | Infrastructure |
Resources
- Docs: https://docs.dask.org/
- Best Practices: https://docs.dask.org/en/stable/best-practices.html
- Examples: https://examples.dask.org/