Exploratory Data Analysis (EDA)
Analyze tabular datasets to understand distributions, data quality, and patterns.
When to Use
- Understanding a new dataset before modeling
- Checking data quality (missing values, outliers, duplicates)
- Analyzing target variable distribution
- Identifying class imbalance
- Generating summary statistics
Analysis Process
- Connect to data - Verify access and inspect schema
- Analyze target variable first - Understand class balance
- Check each column - Distribution, missing data, cardinality
- Document findings - Save reports for reproducibility
Available Analyses
| Analysis | Description | |----------|-------------| | Column Distribution | Value counts, percentages, cardinality assessment | | Missing Data | Null counts, patterns (MCAR/MAR/MNAR) | | Class Balance | Imbalance detection for classification targets | | Summary Stats | Count, unique, nulls per column |
Column Distribution Analysis
For detailed analysis methodology and output format:
Quick Reference
Cardinality Levels: | Level | Criteria | Action | |-------|----------|--------| | Low | ≤10 unique | Good for categorical encoding | | Medium | 11-100 or <1% of rows | May need encoding strategy | | High | >100 and <50% of rows | Consider grouping/binning | | Very High | >50% of rows | Likely identifier, exclude |
Missing Data Thresholds: | Percentage | Assessment | |------------|------------| | 0% | No missing data | | <1% | Minimal - safe to drop or impute | | 1-5% | Some - consider imputation strategy | | >5% | Significant - investigate pattern |
Class Imbalance:
-
80% in top class: Imbalance detected
-
95% in top class: Extreme imbalance
Output Format
# Column Distribution: {column_name}
- **source**: path/to/data
- **column**: column_name
## Summary
- Total rows: N
- Null/missing: N (X%)
- Unique values: N
- Cardinality: Low|Medium|High|Very High
## Distribution
| Value | Count | Percentage | Cumulative |
|-------|-------|------------|------------|
## Observations
- Auto-generated insights
Best Practices
- Start with schema inspection before deep analysis
- Check target variable first for classification tasks
- Missing data may not be random - investigate patterns
- Save reports for reproducibility