Statistical Analysis
Overview
Conduct hypothesis tests (t-tests, ANOVA, chi-square), regression, correlation, and Bayesian analyses with systematic assumption checking, effect sizes, and APA-style reporting. The goal is an analysis a reviewer could not tear apart: the right test, verified assumptions, honest effect sizes, and a complete write-up.
When to Use This Skill
Use this skill when:
- Conducting statistical hypothesis tests (t-tests, ANOVA, chi-square, non-parametric)
- Performing regression or correlation analyses
- Running Bayesian statistical analyses
- Checking statistical assumptions and diagnostics
- Calculating effect sizes and conducting power analyses
- Reporting statistical results in APA format
- Analyzing experimental or observational data for research
Installation
Use uv to install the libraries used in this skill. Pin versions in production; unpinned installs are fine for exploration.
# Core frequentist stack (Python 3.10+; 3.12+ recommended for latest SciPy/ArviZ)
uv pip install "pingouin>=0.6" "scipy>=1.11" "statsmodels>=0.14.6" pandas matplotlib seaborn
# Bayesian modeling (PyMC 5 + ArviZ)
uv pip install "pymc>=5.0" "arviz>=1.0"
Compatibility notes (verified against pingouin 0.6.1, statsmodels 0.14.6, arviz 1.2, 2026):
- Pingouin 0.6.0 renamed output columns to remove special characters:
p_val,cohen_d,CI95,p_unc(previouslyp-val,cohen-d,CI95%,p-uncin 0.5.x). Examples below use the current names; if stuck on 0.5.x, use the hyphenated forms. - statsmodels + SciPy: use
statsmodels>=0.14.6withscipy>=1.11to avoid_lazywhereimport errors on SciPy 1.16+. - ArviZ 1.x:
az.summary()now defaults to 89% intervals (eti89columns) and the width parameter isci_prob(nothdi_prob). To report a conventional 95% credible interval, passaz.summary(trace, ci_prob=0.95). - One-sided Bayes Factors are gone from Pingouin:
pg.ttest(..., alternative='greater')silently drops theBF10column, andpg.bayesfactor_ttestraises on one-sided alternatives. For one-sided Bayesian tests, use PyMC directly (compute the posterior probability of the directional hypothesis) or JASP/R's BayesFactor.
For model-specific APIs (OLS, GLM, ARIMA), see the statsmodels skill. For PyMC workflows, see the pymc skill.
Analysis Workflow
Every sound analysis follows the same arc. Skipping steps is how analyses end up retracted, so work through them in order and say what you did at each one.
- Frame the question before touching the data. State the hypothesis, the outcome and predictor variables, and the design (independent vs. paired, number of groups). Commit to a planned test now — choosing the test after peeking at results is p-hacking, even when done innocently.
- Inspect the data. Per group: n, mean, SD, median, missing values. Plot the raw data (histograms or box plots) before any test. Unequal group sizes, missingness, floor/ceiling effects, and outliers all change what test is appropriate — surface them to the user rather than silently working around them.
- Select the test using the quick reference below, or
references/test_selection_guide.mdfor designs beyond the basics (counts, time-to-event, reliability, factorial). - Check assumptions with
scripts/assumption_checks.py. If an assumption fails, switch to the remedial test (table below) and report both the plan and the change. - Run the test and always compute the effect size alongside it — a p-value says an effect exists; the effect size says whether anyone should care.
- Report using the APA templates below, including descriptives, exact statistics, effect sizes with CIs, and the assumption checks performed.
If the user only needs one step (e.g., "how many participants do I need?"), jump straight to that section — but still confirm the design assumptions the calculation rests on.
Test Selection Guide
Quick Reference: Choosing the Right Test
Use references/test_selection_guide.md for comprehensive guidance (counts, survival, reliability, factorial designs). Quick reference:
Comparing Two Groups:
- Independent, continuous, normal → Independent t-test
- Independent, continuous, non-normal → Mann-Whitney U test
- Paired, continuous, normal → Paired t-test
- Paired, continuous, non-normal → Wilcoxon signed-rank test
- Binary outcome → Chi-square or Fisher's exact test
Comparing 3+ Groups:
- Independent, continuous, normal → One-way ANOVA
- Independent, continuous, non-normal → Kruskal-Wallis test
- Paired, continuous, normal → Repeated measures ANOVA
- Paired, continuous, non-normal → Friedman test
Relationships:
- Two continuous variables → Pearson (normal) or Spearman correlation (non-normal)
- Continuous outcome with predictor(s) → Linear regression
- Binary outcome with predictor(s) → Logistic regression
Bayesian Alternatives:
All tests have Bayesian versions providing direct probability statements about hypotheses, Bayes Factors quantifying evidence, and the ability to support the null. See references/bayesian_statistics.md.
Assumption Checking
Always check assumptions before interpreting test results, and report the checks — reviewers look for them.
Use the bundled scripts/assumption_checks.py module. Run Python from the skill directory (skills/statistical-analysis/) or add scripts/ to sys.path:
from assumption_checks import comprehensive_assumption_check
# Outliers + normality (per group) + homogeneity of variance, with plots
results = comprehensive_assumption_check(
data=df,
value_col='score',
group_col='group', # Optional: for group comparisons
alpha=0.05
)
For targeted checks, import individual functions:
from assumption_checks import (
check_normality, # Shapiro-Wilk + Q-Q plot + histogram
check_normality_per_group,
check_homogeneity_of_variance, # Levene's test + box plots
check_linearity, # scatter + residual plot for simple regression
check_regression_diagnostics, # full OLS diagnostics (see Regression below)
detect_outliers # IQR or z-score methods
)
result = check_normality(data=df['score'], name='Test Score', alpha=0.05, plot=True)
print(result['interpretation'])
print(result['recommendation'])
What to Do When Assumptions Are Violated
Normality violated:
- Mild violation + n > 30 per group → Proceed with parametric test (robust)
- Moderate violation → Use non-parametric alternative
- Severe violation → Transform data or use non-parametric test
Homogeneity of variance violated:
- For t-test → Use Welch's t-test (
pg.ttestapplies it automatically withcorrection='auto') - For ANOVA → Use Welch's ANOVA (
pg.welch_anova) or Brown-Forsythe - For regression → Use robust standard errors or weighted least squares
Linearity violated (regression):
- Add polynomial terms, transform variables, or use non-linear models / GAM
Formal tests get oversensitive as n grows: for n ≥ 100, weigh the Q-Q plot more heavily than the Shapiro-Wilk p-value. See references/assumptions_and_diagnostics.md for comprehensive guidance.
Running Statistical Tests
Primary libraries:
- pingouin: user-friendly tests that return effect sizes by default — prefer it for standard tests
- scipy.stats: core statistical tests
- statsmodels: regression, diagnostics, power analysis
- pymc + arviz: Bayesian modeling and diagnostics
T-Test with Complete Reporting
import pingouin as pg
# correction='auto' applies Welch's correction when variances are unequal
result = pg.ttest(group_a, group_b, correction='auto')
# Pingouin >= 0.6 column names
t_stat = result['T'].values[0]
df = result['dof'].values[0]
p_value = result['p_val'].values[0]
cohens_d = result['cohen_d'].values[0]
ci_lower, ci_upper = result['CI95'].values[0] # CI for the mean difference
print(f"t({df:.0f}) = {t_stat:.2f}, p = {p_value:.3f}, d = {cohens_d:.2f}")
ANOVA with Post-Hoc Tests
import pingouin as pg
aov = pg.anova(dv='score', between='group', data=df, detailed=True)
print(aov)
# Effect size: partial eta-squared
eta_p2 = aov['np2'].values[0]
# If significant, conduct post-hoc tests (Tukey HSD controls family-wise error)
if aov['p_unc'].values[0] < 0.05:
posthoc = pg.pairwise_tukey(dv='score', between='group', data=df)
print(posthoc) # includes Hedges' g per pair
Linear Regression with Diagnostics
import statsmodels.api as sm
from assumption_checks import check_regression_diagnostics
X = sm.add_constant(X_predictors) # Add intercept
model = sm.OLS(y, X).fit()
print(model.summary())
# 4-panel residual plot + Shapiro-Wilk, Breusch-Pagan, Durbin-Watson, VIF
diag = check_regression_diagnostics(model)
print(diag['interpretation'])
print(diag['vif'])
# If heteroscedasticity was flagged, report robust standard errors instead
robust = model.get_robustcov_results('HC3')
Bayesian T-Test
import pymc as pm
import arviz as az
import numpy as np
with pm.Model() as model:
# Priors
mu1 = pm.Normal('mu_group1', mu=0, sigma=10)
mu2 = pm.Normal('mu_group2', mu=0, sigma=10)
sigma = pm.HalfNormal('sigma', sigma=10)
# Likelihood
y1 = pm.Normal('y1', mu=mu1, sigma=sigma, observed=group_a)
y2 = pm.Normal('y2', mu=mu2, sigma=sigma, observed=group_b)
# Derived quantity
diff = pm.Deterministic('difference', mu1 - mu2)
trace = pm.sample(2000, tune=1000)
# ArviZ 1.x defaults to 89% intervals; request 95% explicitly for reporting
print(az.summary(trace, var_names=['difference'], ci_prob=0.95))
# Direct probability statement (this is what one-sided questions become)
prob_greater = np.mean(trace.posterior['difference'].values > 0)
print(f"P(mu1 > mu2 | data) = {prob_greater:.3f}")
# ArviZ 1.x removed az.plot_posterior; use plot_dist (on 0.x, plot_posterior still works)
az.plot_dist(trace, var_names=['difference'], ci_prob=0.95)
Scale priors to the data (e.g., sigma=10 suits outcomes with SD near 10; use the observed SD as a guide) and state the priors in the report.
Effect Sizes
Effect sizes quantify magnitude; p-values only indicate existence. Report one for every test. See references/effect_sizes_and_power.md for the full guide.
Quick Reference: Common Effect Sizes
| Test | Effect Size | Small | Medium | Large | |------|-------------|-------|--------|-------| | T-test | Cohen's d | 0.20 | 0.50 | 0.80 | | ANOVA | η²_p | 0.01 | 0.06 | 0.14 | | Correlation | r | 0.10 | 0.30 | 0.50 | | Regression | R² | 0.02 | 0.13 | 0.26 | | Chi-square | Cramér's V | 0.07 | 0.21 | 0.35 |
Benchmarks are conventions, not laws — a "small" effect can matter enormously (drug side effects) and a "large" one can be trivial. Interpret in context.
Calculating Effect Sizes
Pingouin returns effect sizes with its tests (cohen_d from pg.ttest, np2 from pg.anova, hedges from pg.pairwise_tukey; r from pg.corr is already an effect size).
Confidence Intervals for Effect Sizes
Report a CI for the effect size to show its precision. Use pg.compute_esci (note: pg.compute_effsize_from_t returns only the point estimate — it does not return a CI):
import pingouin as pg
d = pg.compute_effsize(group_a, group_b, eftype='cohen')
ci_lower, ci_upper = pg.compute_esci(stat=d, nx=len(group_a), ny=len(group_b),
eftype='cohen', confidence=0.95)
print(f"d = {d:.2f}, 95% CI [{ci_lower:.2f}, {ci_upper:.2f}]")
Power Analysis
A Priori Power Analysis (Study Planning)
Determine required sample size before data collection:
from statsmodels.stats.power import tt_ind_solve_power, FTestAnovaPower
# T-test: What n per group is needed to detect d = 0.5?
n_required = tt_ind_solve_power(
effect_size=0.5,
alpha=0.05,
power=0.80,
ratio=1.0,
alternative='two-sided'
)
print(f"Required n per group: {n_required:.0f}")
# One-way ANOVA: What n is needed to detect Cohen's f = 0.25?
# Notes: the parameter is k_groups; effect_size is Cohen's f (f = sqrt(eta2/(1-eta2)));
# and solve_power returns the TOTAL sample size, not n per group.
import math
anova_power = FTestAnovaPower()
n_total = anova_power.solve_power(
effect_size=0.25,
k_groups=3,
alpha=0.05,
power=0.80
)
print(f"Required total N: {math.ceil(n_total)} ({math.ceil(n_total / 3)} per group)")
Sensitivity Analysis (Post-Study)
Determine what effect size the study could detect:
# With n=50 per group, what effect could we detect at 80% power?
detectable_d = tt_ind_solve_power(
effect_size=None, # Solve for this
nobs1=50,
alpha=0.05,
power=0.80,
ratio=1.0,
alternative='two-sided'
)
print(f"Study could detect d >= {detectable_d:.2f}")
Note: Post-hoc "observed power" (computing power from the observed effect) is circular and misleading — it is a deterministic function of the p-value. If a study is done and someone asks about power, run a sensitivity analysis instead.
See references/effect_sizes_and_power.md for detailed guidance.
Reporting Results
Follow references/reporting_standards.md for APA style. Every report needs:
- Descriptive statistics: M, SD, n for all groups/variables
- Test statistics: Test name, statistic, df, exact p-value (
p = .034, notp < .05; usep < .001only below .001) - Effect sizes: With confidence intervals
- Assumption checks: Which tests were run, results, and actions taken
- All planned analyses: Including non-significant findings — omitting them is cherry-picking
Example Report Templates
Independent T-Test
Group A (n = 48, M = 75.2, SD = 8.5) scored significantly higher than
Group B (n = 52, M = 68.3, SD = 9.2), t(98) = 3.82, p < .001, d = 0.77,
95% CI [0.36, 1.18], two-tailed. Assumptions of normality (Shapiro-Wilk:
Group A W = 0.97, p = .18; Group B W = 0.96, p = .12) and homogeneity
of variance (Levene's F(1, 98) = 1.23, p = .27) were satisfied.
One-Way ANOVA
A one-way ANOVA revealed a significant main effect of treatment condition
on test scores, F(2, 147) = 8.45, p < .001, η²_p = .10. Post hoc
comparisons using Tukey's HSD indicated that Condition A (M = 78.2,
SD = 7.3) scored significantly higher than Condition B (M = 71.5,
SD = 8.1, p = .002, d = 0.87) and Condition C (M = 70.1, SD = 7.9,
p < .001, d = 1.07). Conditions B and C did not differ significantly
(p = .52, d = 0.18).
Multiple Regression
Multiple linear regression was conducted to predict exam scores from
study hours, prior GPA, and attendance. The overall model was significant,
F(3, 146) = 45.2, p < .001, R² = .48, adjusted R² = .47. Study hours
(B = 1.80, SE = 0.31, β = .35, t = 5.78, p < .001, 95% CI [1.18, 2.42])
and prior GPA (B = 8.52, SE = 1.95, β = .28, t = 4.37, p < .001,
95% CI [4.66, 12.38]) were significant predictors, while attendance was
not (B = 0.15, SE = 0.12, β = .08, t = 1.25, p = .21, 95% CI [-0.09, 0.39]).
Multicollinearity was not a concern (all VIF < 1.5).
Bayesian Analysis
A Bayesian independent samples t-test was conducted using weakly
informative priors (Normal(0, 10) for group means). The posterior
distribution indicated that Group A scored higher than Group B
(M_diff = 6.8, 95% credible interval [3.2, 10.4]), with a 99.8%
posterior probability that Group A's mean exceeded Group B's mean.
Convergence diagnostics were satisfactory (all R-hat < 1.01, ESS > 1000).
If a non-parametric test was used, report medians rather than means, the U/W/H statistic, and a rank-based effect size (e.g., rank-biserial correlation, returned by pg.mwu as RBC).
Bayesian Statistics
Consider Bayesian approaches when:
- You have prior information to incorporate
- You want direct probability statements about hypotheses ("there is a 95% probability the effect lies in this interval")
- Sample size is small or data collection is sequential (no correction needed for optional stopping)
- You need to quantify evidence for the null hypothesis
- The model is complex (hierarchical structure, missing data)
See references/bayesian_statistics.md for prior specification, Bayes Factors, credible intervals, hierarchical models, and convergence checking (R-hat < 1.01, sufficient ESS, posterior predictive checks).
Bundled Resources
References (references/)
- test_selection_guide.md: Decision tree covering group comparisons, relationships, counts, time-to-event, agreement/reliability, and categorical analysis
- assumptions_and_diagnostics.md: Detailed guidance on checking and handling assumption violations
- effect_sizes_and_power.md: Calculating, interpreting, and reporting effect sizes; power analysis
- bayesian_statistics.md: Priors, Bayes Factors, credible intervals, hierarchical models, diagnostics
- reporting_standards.md: APA-style reporting guidelines with worked examples
Scripts (scripts/)
- assumption_checks.py: Automated assumption checking with visualizations
comprehensive_assumption_check(): outliers + normality + variance homogeneity in one callcheck_normality(),check_normality_per_group(): Shapiro-Wilk with Q-Q plotscheck_homogeneity_of_variance(): Levene's test with box plotscheck_regression_diagnostics(): 4-panel residual plots + Shapiro-Wilk, Breusch-Pagan, Durbin-Watson, VIF for fitted OLS modelscheck_linearity(),detect_outliers()
Statistical Integrity
These are the practices that keep an analysis defensible. They matter because the most common statistical failures are not computational errors — they are silent flexibility (testing until something works) and selective reporting.
- Distinguish confirmatory from exploratory. State the planned analysis before running it; label anything discovered along the way as exploratory.
- Don't shop for significance. If the planned test is non-significant, that is the result. Trying alternative tests, subgroups, or outlier-removal schemes until p < .05 invalidates the p-value.
- Correct for multiple comparisons when running families of tests (Tukey HSD for post-hoc ANOVA; Holm or Benjamini-Hochberg FDR for other families) and say which correction was used.
- A non-significant result is not evidence of no effect. With small n, the study may simply have been underpowered — run a sensitivity analysis, or use a Bayesian analysis / equivalence test to actually quantify support for the null.
- Statistical significance is not practical importance. With large n, trivial effects reach p < .001. Lead the interpretation with the effect size.
- Understand missing data before dropping rows. Listwise deletion is only safe when data are missing completely at random; otherwise consider multiple imputation and say what was done.
- Make it reproducible. Set random seeds, report library versions for simulation-based methods, and keep the analysis in a runnable script.