Data Profiling
Systematic approach to understanding a dataset's structure, quality, and analytical potential before diving into analysis.
Structural Reconnaissance
Table-Level Inventory
Start every new dataset encounter by answering these questions:
- What is the total row and column count?
- What does each row represent (the grain)?
- Which column or columns uniquely identify a row?
- When was the most recent data loaded?
- What is the earliest date in the dataset?
Classifying Columns by Role
Assign every column to one of these categories:
- Key: Primary keys, foreign keys, entity identifiers
- Attribute: Categorical fields used for grouping or filtering (region, status, plan_type)
- Measure: Numeric fields intended for aggregation (revenue, duration, score)
- Timestamp: Date or datetime fields (created_at, processed_on, event_date)
- Free-text: Unstructured strings (descriptions, comments, names)
- Flag: Binary true/false indicators
- Nested: JSON blobs, arrays, or embedded structures
Column-Level Inspection
Universal Checks (All Column Types)
- Null count and null percentage
- Count of distinct values and cardinality ratio (distinct / total rows)
- Top 5-10 most frequent values with their occurrence counts
- Bottom 5 least frequent values (useful for spotting anomalies)
Numeric Measures
Minimum, maximum, mean, median
Standard deviation
Percentile ladder: p1, p5, p25, p75, p95, p99
Count of zeros
Count of negatives (flag if unexpected)
Text and Categorical Fields
Shortest length, longest length, average length
Count of empty strings
Format regularity (do values follow a consistent pattern?)
Case consistency (uniform upper, uniform lower, or mixed?)
Count of values with leading or trailing whitespace
Date and Timestamp Fields
Earliest date, latest date
Count of nulls
Count of future dates (flag if the domain forbids them)
Distribution across months or weeks
Gaps in expected daily/weekly cadence
Boolean Fields
True count, false count, null count
True proportion
Discovering Relationships
After examining columns individually, look for connections:
- Foreign key candidates: ID columns whose values likely reference another table
- Hierarchical dimensions: Columns that nest naturally (country > region > city)
- Correlated measures: Numeric columns that trend together
- Computed columns: Fields that appear derived from other columns in the same table
- Duplicated information: Columns that carry identical or near-identical content
Evaluating Data Quality
Completeness Ratings
Assign each column a tier:
- Full (>99% populated): No action needed
- Mostly full (95-99%): Investigate the missing values
- Gaps present (80-95%): Determine whether the gaps are systematic and whether the column is still usable
- Sparse (<80%): Likely unusable without imputation or supplemental data
Consistency Checks
Scan for:
- Value normalization failures: The same concept spelled multiple ways ("USA", "US", "United States", "us")
- Type mismatches: Numerics stored as text, dates in inconsistent formats
- Broken references: Foreign key values with no matching parent record
- Rule violations: Negative quantities, end dates preceding start dates, percentages outside 0-100
- Cross-column contradictions: status = 'shipped' while ship_date is null
Accuracy Warning Signs
Patterns that suggest the data may be unreliable:
- Sentinel values: 0, -1, 999999, "N/A", "TBD", "test", "xxx"
- Suspiciously dominant defaults: One value appearing far more often than expected
- Stale records: The updated_at column shows no recent activity in an active system
- Physically impossible values: Ages above 150, dates decades in the future, negative durations
- Rounding artifacts: Disproportionate clustering on multiples of 5 or 10 (implies estimation)
Freshness Assessment
- What is the timestamp of the most recent row?
- How often should this table refresh?
- Is there measurable lag between event occurrence and warehouse arrival?
- Are there missing days or hours in the time series?
Recognizing Patterns
Distribution Shapes
When profiling a numeric column, classify its shape:
- Bell-shaped: Mean and median nearly equal; symmetric tails
- Right-skewed: Long right tail with a few very large values (typical for revenue, session length)
- Left-skewed: Long left tail with a few very small values (less common)
- Bimodal: Two distinct peaks (suggests two merged populations)
- Power-law: A handful of enormous values dominating; many near-zero values (user engagement metrics)
- Uniform: Roughly flat across the range (often synthetic or randomly generated)
Time-Based Patterns
For any temporal data, investigate:
- Trend: Persistent upward or downward drift
- Seasonality: Recurring cycles — weekly, monthly, quarterly, annual
- Weekday effects: Systematic weekday vs. weekend differences
- Holiday impacts: Spikes or dips around known holidays
- Level shifts: Abrupt, sustained changes in the baseline
- Isolated anomalies: Single data points that break the prevailing pattern
Segment Discovery
Surface natural groupings by:
- Identifying categorical columns with 3 to 20 distinct values
- Comparing metric distributions across each segment value
- Highlighting segments whose behavior diverges meaningfully from the overall
- Checking whether apparent segments contain meaningful sub-segments
Correlation Scanning
Across numeric columns:
- Build a pairwise correlation matrix
- Highlight strong associations (|r| > 0.7)
- Remember that correlation never establishes causation — note this explicitly
- Probe for non-linear relationships (quadratic, logarithmic) that Pearson r would miss
Documenting What You Find
Dataset Summary Template
## Table: [schema.table_name]
**Purpose**: [What this table captures]
**Grain**: [One row per...]
**Primary Key**: [column(s)]
**Approximate Rows**: [count, as of date]
**Refresh Cadence**: [real-time / hourly / daily / weekly]
**Responsible Team**: [owner]
### Important Columns
| Column | Type | Meaning | Sample Values | Notes |
|--------|------|---------|---------------|-------|
| user_id | STRING | Unique user handle | "usr_abc123" | References users.id |
| event_type | STRING | Action category | "click", "view", "purchase" | 15 distinct values |
| revenue | DECIMAL | USD transaction amount | 29.99, 149.00 | Null for non-purchases |
| created_at | TIMESTAMP | Event occurrence time | 2024-01-15 14:23:01 | Partition column |
### Join Paths
- Links to `users` via `user_id`
- Links to `products` via `product_id`
- Parent of `event_details` (one-to-many on event_id)
### Known Caveats
- [Document any quality issues]
- [Note analytical gotchas]
### Typical Query Use Cases
- [List common analytical patterns against this table]
Schema Discovery Queries
When working directly against a warehouse, use these patterns:
-- Enumerate tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;
-- Inspect column metadata (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'target_table'
ORDER BY ordinal_position;
-- Rank tables by storage footprint (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
-- Row count per table (general approach)
-- Execute individually: SELECT COUNT(*) FROM table_name
Tracing Data Lineage
When navigating an unfamiliar warehouse:
- Begin at the consumption layer — identify which tables power reports and dashboards
- Follow dependencies upstream: what feeds those tables?
- Map the raw / staging / mart architecture
- Track where transformations enrich, filter, or roll up the data
- Record any points where data is joined with external sources