Statistical Methods
A practitioner's guide to applying statistics in data analysis, from summarizing distributions through testing hypotheses and spotting analytical traps.
Summarizing Numeric Data
Choosing a Center Metric
| Data Characteristic | Recommended Measure | Rationale | |---|---|---| | Symmetric, outlier-free | Mean | Maximally efficient estimator | | Asymmetric or outlier-heavy | Median | Unaffected by extreme values | | Non-numeric or ranked | Mode | Sole option for categorical data | | Business KPIs like revenue per user | Both mean and median | The gap between them reveals skewness |
Guideline: For any business metric, present the mean alongside the median. When they differ substantially, the distribution is skewed and the mean by itself will mislead.
Quantifying Variability
- Standard deviation: Typical distance from the mean; best suited to bell-shaped data.
- IQR (interquartile range): Gap between the 25th and 75th percentiles; resistant to extreme values.
- Coefficient of variation: Standard deviation divided by the mean; enables apples-to-apples variability comparison across different scales.
- Range: Maximum minus minimum; gives a quick but outlier-sensitive view of data spread.
Telling the Story with Percentiles
Go beyond averages by reporting a percentile ladder:
p1: Floor of the distribution (bottom 1%)
p5: Lower boundary of typical values
p25: First quartile
p50: Median — the representative observation
p75: Third quartile
p90: Top 10% threshold (heavy users, premium tier)
p95: Upper boundary of typical values
p99: Extreme top 1%
Sample insight: "Half of all sessions last under 4.2 minutes, yet the top decile exceeds 22 minutes, which pushes the average to 7.8 minutes."
Characterizing Distributions
For every numeric column, document:
- Shape: Gaussian, right-tailed, left-tailed, bimodal, uniform, heavy-tailed
- Center: Mean vs. median and the magnitude of their difference
- Spread: Standard deviation or IQR as appropriate
- Extremes: Count and severity of outliers
- Boundaries: Natural limits such as zero floors or 100% ceilings
Trend Analysis and Projection
Smoothing Noisy Time Series
# Weekly smoother — useful for daily data with weekday/weekend cycles
df['smooth_7'] = df['metric'].rolling(window=7, min_periods=1).mean()
# Four-week smoother — irons out both weekly and monthly rhythms
df['smooth_28'] = df['metric'].rolling(window=28, min_periods=1).mean()
Period Comparisons
- Week-over-week: Same weekday, one week apart
- Month-over-month: Calendar month versus prior calendar month
- Year-over-year: The gold standard for businesses with seasonal patterns
- Same-calendar-day: Matches the exact date from the prior year
Measuring Growth
Simple rate: (current - prior) / prior
CAGR: (final / initial) ^ (1 / n_years) - 1
Log rate: ln(current / prior) # more stable for volatile series
Spotting Seasonal Cycles
- Visually inspect the raw series first
- Aggregate by day-of-week to surface weekly rhythms
- Aggregate by calendar month to surface annual rhythms
- Always use year-over-year or matched-period comparisons to separate trend from seasonality
Lightweight Forecasting Approaches
For analysts who need quick projections rather than full modeling:
- Naive: Forecast equals the most recent observation. Serves as the minimum-viable baseline.
- Seasonal naive: Forecast equals the value from the same period in the prior cycle.
- Linear extrapolation: Fit a straight line to recent history. Only appropriate when the trend is clearly linear.
- Trailing average: Use a rolling mean as the projected value.
Always express forecasts as ranges, not point estimates:
- Good: "Next month should bring 10,000 to 12,000 registrations based on the trailing quarter"
- Misleading: "Next month will yield exactly 11,234 registrations"
Hand off to a specialist when the pattern is non-linear, multiple seasonal cycles overlap, external drivers (ad spend, holidays) matter, or when forecast precision drives resource decisions.
Detecting and Handling Outliers
Identification Techniques
Z-score approach (assumes approximate normality):
z = (df['val'] - df['val'].mean()) / df['val'].std()
outliers = df[abs(z) > 3] # beyond 3 standard deviations
IQR fence approach (works regardless of distribution shape):
q1 = df['val'].quantile(0.25)
q3 = df['val'].quantile(0.75)
iqr = q3 - q1
lo = q1 - 1.5 * iqr
hi = q3 + 1.5 * iqr
outliers = df[(df['val'] < lo) | (df['val'] > hi)]
Percentile cutoff approach (most straightforward):
outliers = df[(df['val'] < df['val'].quantile(0.01)) |
(df['val'] > df['val'].quantile(0.99))]
What to Do with Outliers
Never strip outliers automatically. Follow this decision process:
- Diagnose: Is this a recording error, a legitimately extreme observation, or a sign of a separate population?
- Errors: Correct or exclude (e.g., negative ages, epoch-zero timestamps)
- Legitimate extremes: Retain but switch to robust summaries (median, IQR)
- Distinct populations: Analyze separately (e.g., enterprise accounts vs. self-serve)
Document every exclusion: "We set aside 47 records (0.3% of the dataset) with order values above $50K; these bulk enterprise transactions are covered in a separate section."
Detecting Anomalies in Time Series
- Establish an expected baseline (rolling average or year-ago value)
- Compute the residual: actual minus expected
- Flag residuals exceeding 2-3 standard deviations of historical residuals
- Differentiate one-off spikes (point anomalies) from lasting shifts (change points)
Hypothesis Testing Essentials
When It Applies
Use formal testing whenever you need to distinguish a real signal from random noise:
- Evaluating A/B experiment results
- Measuring the impact of a product change (before vs. after)
- Comparing metrics across customer segments
Step-by-Step Process
- State the null (H0): No difference exists (default position)
- State the alternative (H1): A difference exists
- Set the significance threshold (alpha): 0.05 is standard (5% false-positive tolerance)
- Calculate the test statistic and p-value
- Decide: p < alpha means sufficient evidence to reject H0
Selecting the Right Test
| Question | Appropriate Test | Conditions | |---|---|---| | Two group means differ? | Independent samples t-test | Roughly normal, two groups | | Two conversion rates differ? | Proportions z-test | Binary outcomes | | Same entities measured twice? | Paired t-test | Pre/post on identical subjects | | Three or more group means? | ANOVA | Multiple variants or segments | | Non-normal data, two groups? | Mann-Whitney U | Skewed or ordinal metrics | | Two categorical variables related? | Chi-squared test | Frequency table data |
Beyond p-values: Practical Impact
A statistically significant result only means the effect is unlikely due to chance. It does not guarantee the effect matters in practice. Always accompany test results with:
- Effect magnitude: "Variant B lifted conversion by 0.3 percentage points"
- Confidence interval: The plausible range of the true effect
- Business translation: Revenue, user, or efficiency implications
Sample Size Awareness
- Small samples yield unreliable conclusions even when p-values look good
- Proportions require roughly 30 or more events per group for baseline reliability
- Detecting subtle effects (e.g., a 1-point conversion shift) can demand thousands of observations per arm
- When data is limited, say so: "With 200 observations per group, effects smaller than X% would likely go undetected"
Guarding Against Statistical Pitfalls
Correlation vs. Causation
Whenever a correlation surfaces, explicitly evaluate:
- Reverse direction: Perhaps B drives A rather than A driving B
- Hidden third factor: Some unmeasured variable C could be behind both
- Coincidence: Enough variable pairs will show spurious associations
Safe phrasing: "Users who adopt feature X exhibit 30% higher retention" Unsafe phrasing: "Feature X causes 30% higher retention" (requires experimental evidence)
The Multiple Testing Trap
Running many tests inflates false positives:
- At alpha = 0.05, testing 20 metrics yields roughly one spurious hit by chance
- If you explored numerous segments before finding the "interesting" one, acknowledge that
- Apply Bonferroni correction (alpha / number of tests) or transparently report total tests conducted
Simpson's Paradox
An overall trend can invert when you break the data into subgroups:
- Verify that aggregate conclusions hold within each key segment
- Classic scenario: total conversion rises while every segment's conversion falls, because traffic shifted toward a naturally higher-converting segment
Survivorship Bias
Your dataset only contains entities that persisted long enough to be recorded:
- Studying current users ignores everyone who already left
- Profiling winning products overlooks the failures
- Routinely ask: "Who is absent from this data, and would including them change the conclusion?"
Ecological Fallacy
Group-level patterns may not describe individuals:
- "Nations with higher X tend to have higher Y" does not mean the same holds per person
- Resist applying aggregate statistics to individual-level predictions
Illusory Precision
Overly specific numbers suggest unjustified confidence:
- "Churn will be 4.73% next quarter" implies an accuracy that rarely exists
- Prefer honest ranges: "Churn is likely between 4% and 6%"
- Round to the level of certainty you actually possess