Agent Skills: Statistical Analysis

Guided statistical analysis with test selection and reporting. Use when you need help choosing appropriate tests for your data, assumption checking, power analysis, and APA-formatted results. Best for academic research reporting, test selection guidance. For implementing specific models programmatically use statsmodels.

UncategorizedID: K-Dense-AI/claude-scientific-skills/statistical-analysis

Install this agent skill to your local

pnpm dlx add-skill https://github.com/K-Dense-AI/claude-scientific-skills/tree/HEAD/skills/statistical-analysis

Skill Files

Browse the full folder contents for statistical-analysis.

Download Skill

Loading file tree…

skills/statistical-analysis/SKILL.md

Skill Metadata

Name
statistical-analysis
Description
Guided statistical analysis for research data - test selection, assumption checking, effect sizes, power analysis, Bayesian alternatives, and APA-formatted reporting. Use whenever a user wants to compare groups, test a hypothesis, analyze experimental or survey data, check statistical assumptions, compute required sample sizes, or write up results - even if they never name a specific test. Covers t-tests, ANOVA, chi-square, correlation, regression, non-parametric and Bayesian methods. For low-level model APIs, see the statsmodels and pymc skills.

Statistical Analysis

Overview

Conduct hypothesis tests (t-tests, ANOVA, chi-square), regression, correlation, and Bayesian analyses with systematic assumption checking, effect sizes, and APA-style reporting. The goal is an analysis a reviewer could not tear apart: the right test, verified assumptions, honest effect sizes, and a complete write-up.

When to Use This Skill

Use this skill when:

  • Conducting statistical hypothesis tests (t-tests, ANOVA, chi-square, non-parametric)
  • Performing regression or correlation analyses
  • Running Bayesian statistical analyses
  • Checking statistical assumptions and diagnostics
  • Calculating effect sizes and conducting power analyses
  • Reporting statistical results in APA format
  • Analyzing experimental or observational data for research

Installation

Use uv to install the libraries used in this skill. Pin versions in production; unpinned installs are fine for exploration.

# Core frequentist stack (Python 3.10+; 3.12+ recommended for latest SciPy/ArviZ)
uv pip install "pingouin>=0.6" "scipy>=1.11" "statsmodels>=0.14.6" pandas matplotlib seaborn

# Bayesian modeling (PyMC 5 + ArviZ)
uv pip install "pymc>=5.0" "arviz>=1.0"

Compatibility notes (verified against pingouin 0.6.1, statsmodels 0.14.6, arviz 1.2, 2026):

  • Pingouin 0.6.0 renamed output columns to remove special characters: p_val, cohen_d, CI95, p_unc (previously p-val, cohen-d, CI95%, p-unc in 0.5.x). Examples below use the current names; if stuck on 0.5.x, use the hyphenated forms.
  • statsmodels + SciPy: use statsmodels>=0.14.6 with scipy>=1.11 to avoid _lazywhere import errors on SciPy 1.16+.
  • ArviZ 1.x: az.summary() now defaults to 89% intervals (eti89 columns) and the width parameter is ci_prob (not hdi_prob). To report a conventional 95% credible interval, pass az.summary(trace, ci_prob=0.95).
  • One-sided Bayes Factors are gone from Pingouin: pg.ttest(..., alternative='greater') silently drops the BF10 column, and pg.bayesfactor_ttest raises on one-sided alternatives. For one-sided Bayesian tests, use PyMC directly (compute the posterior probability of the directional hypothesis) or JASP/R's BayesFactor.

For model-specific APIs (OLS, GLM, ARIMA), see the statsmodels skill. For PyMC workflows, see the pymc skill.


Analysis Workflow

Every sound analysis follows the same arc. Skipping steps is how analyses end up retracted, so work through them in order and say what you did at each one.

  1. Frame the question before touching the data. State the hypothesis, the outcome and predictor variables, and the design (independent vs. paired, number of groups). Commit to a planned test now — choosing the test after peeking at results is p-hacking, even when done innocently.
  2. Inspect the data. Per group: n, mean, SD, median, missing values. Plot the raw data (histograms or box plots) before any test. Unequal group sizes, missingness, floor/ceiling effects, and outliers all change what test is appropriate — surface them to the user rather than silently working around them.
  3. Select the test using the quick reference below, or references/test_selection_guide.md for designs beyond the basics (counts, time-to-event, reliability, factorial).
  4. Check assumptions with scripts/assumption_checks.py. If an assumption fails, switch to the remedial test (table below) and report both the plan and the change.
  5. Run the test and always compute the effect size alongside it — a p-value says an effect exists; the effect size says whether anyone should care.
  6. Report using the APA templates below, including descriptives, exact statistics, effect sizes with CIs, and the assumption checks performed.

If the user only needs one step (e.g., "how many participants do I need?"), jump straight to that section — but still confirm the design assumptions the calculation rests on.


Test Selection Guide

Quick Reference: Choosing the Right Test

Use references/test_selection_guide.md for comprehensive guidance (counts, survival, reliability, factorial designs). Quick reference:

Comparing Two Groups:

  • Independent, continuous, normal → Independent t-test
  • Independent, continuous, non-normal → Mann-Whitney U test
  • Paired, continuous, normal → Paired t-test
  • Paired, continuous, non-normal → Wilcoxon signed-rank test
  • Binary outcome → Chi-square or Fisher's exact test

Comparing 3+ Groups:

  • Independent, continuous, normal → One-way ANOVA
  • Independent, continuous, non-normal → Kruskal-Wallis test
  • Paired, continuous, normal → Repeated measures ANOVA
  • Paired, continuous, non-normal → Friedman test

Relationships:

  • Two continuous variables → Pearson (normal) or Spearman correlation (non-normal)
  • Continuous outcome with predictor(s) → Linear regression
  • Binary outcome with predictor(s) → Logistic regression

Bayesian Alternatives: All tests have Bayesian versions providing direct probability statements about hypotheses, Bayes Factors quantifying evidence, and the ability to support the null. See references/bayesian_statistics.md.


Assumption Checking

Always check assumptions before interpreting test results, and report the checks — reviewers look for them.

Use the bundled scripts/assumption_checks.py module. Run Python from the skill directory (skills/statistical-analysis/) or add scripts/ to sys.path:

from assumption_checks import comprehensive_assumption_check

# Outliers + normality (per group) + homogeneity of variance, with plots
results = comprehensive_assumption_check(
    data=df,
    value_col='score',
    group_col='group',  # Optional: for group comparisons
    alpha=0.05
)

For targeted checks, import individual functions:

from assumption_checks import (
    check_normality,                # Shapiro-Wilk + Q-Q plot + histogram
    check_normality_per_group,
    check_homogeneity_of_variance,  # Levene's test + box plots
    check_linearity,                # scatter + residual plot for simple regression
    check_regression_diagnostics,   # full OLS diagnostics (see Regression below)
    detect_outliers                 # IQR or z-score methods
)

result = check_normality(data=df['score'], name='Test Score', alpha=0.05, plot=True)
print(result['interpretation'])
print(result['recommendation'])

What to Do When Assumptions Are Violated

Normality violated:

  • Mild violation + n > 30 per group → Proceed with parametric test (robust)
  • Moderate violation → Use non-parametric alternative
  • Severe violation → Transform data or use non-parametric test

Homogeneity of variance violated:

  • For t-test → Use Welch's t-test (pg.ttest applies it automatically with correction='auto')
  • For ANOVA → Use Welch's ANOVA (pg.welch_anova) or Brown-Forsythe
  • For regression → Use robust standard errors or weighted least squares

Linearity violated (regression):

  • Add polynomial terms, transform variables, or use non-linear models / GAM

Formal tests get oversensitive as n grows: for n ≥ 100, weigh the Q-Q plot more heavily than the Shapiro-Wilk p-value. See references/assumptions_and_diagnostics.md for comprehensive guidance.


Running Statistical Tests

Primary libraries:

  • pingouin: user-friendly tests that return effect sizes by default — prefer it for standard tests
  • scipy.stats: core statistical tests
  • statsmodels: regression, diagnostics, power analysis
  • pymc + arviz: Bayesian modeling and diagnostics

T-Test with Complete Reporting

import pingouin as pg

# correction='auto' applies Welch's correction when variances are unequal
result = pg.ttest(group_a, group_b, correction='auto')

# Pingouin >= 0.6 column names
t_stat = result['T'].values[0]
df = result['dof'].values[0]
p_value = result['p_val'].values[0]
cohens_d = result['cohen_d'].values[0]
ci_lower, ci_upper = result['CI95'].values[0]  # CI for the mean difference

print(f"t({df:.0f}) = {t_stat:.2f}, p = {p_value:.3f}, d = {cohens_d:.2f}")

ANOVA with Post-Hoc Tests

import pingouin as pg

aov = pg.anova(dv='score', between='group', data=df, detailed=True)
print(aov)

# Effect size: partial eta-squared
eta_p2 = aov['np2'].values[0]

# If significant, conduct post-hoc tests (Tukey HSD controls family-wise error)
if aov['p_unc'].values[0] < 0.05:
    posthoc = pg.pairwise_tukey(dv='score', between='group', data=df)
    print(posthoc)  # includes Hedges' g per pair

Linear Regression with Diagnostics

import statsmodels.api as sm
from assumption_checks import check_regression_diagnostics

X = sm.add_constant(X_predictors)  # Add intercept
model = sm.OLS(y, X).fit()
print(model.summary())

# 4-panel residual plot + Shapiro-Wilk, Breusch-Pagan, Durbin-Watson, VIF
diag = check_regression_diagnostics(model)
print(diag['interpretation'])
print(diag['vif'])

# If heteroscedasticity was flagged, report robust standard errors instead
robust = model.get_robustcov_results('HC3')

Bayesian T-Test

import pymc as pm
import arviz as az
import numpy as np

with pm.Model() as model:
    # Priors
    mu1 = pm.Normal('mu_group1', mu=0, sigma=10)
    mu2 = pm.Normal('mu_group2', mu=0, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=10)

    # Likelihood
    y1 = pm.Normal('y1', mu=mu1, sigma=sigma, observed=group_a)
    y2 = pm.Normal('y2', mu=mu2, sigma=sigma, observed=group_b)

    # Derived quantity
    diff = pm.Deterministic('difference', mu1 - mu2)

    trace = pm.sample(2000, tune=1000)

# ArviZ 1.x defaults to 89% intervals; request 95% explicitly for reporting
print(az.summary(trace, var_names=['difference'], ci_prob=0.95))

# Direct probability statement (this is what one-sided questions become)
prob_greater = np.mean(trace.posterior['difference'].values > 0)
print(f"P(mu1 > mu2 | data) = {prob_greater:.3f}")

# ArviZ 1.x removed az.plot_posterior; use plot_dist (on 0.x, plot_posterior still works)
az.plot_dist(trace, var_names=['difference'], ci_prob=0.95)

Scale priors to the data (e.g., sigma=10 suits outcomes with SD near 10; use the observed SD as a guide) and state the priors in the report.


Effect Sizes

Effect sizes quantify magnitude; p-values only indicate existence. Report one for every test. See references/effect_sizes_and_power.md for the full guide.

Quick Reference: Common Effect Sizes

| Test | Effect Size | Small | Medium | Large | |------|-------------|-------|--------|-------| | T-test | Cohen's d | 0.20 | 0.50 | 0.80 | | ANOVA | η²_p | 0.01 | 0.06 | 0.14 | | Correlation | r | 0.10 | 0.30 | 0.50 | | Regression | R² | 0.02 | 0.13 | 0.26 | | Chi-square | Cramér's V | 0.07 | 0.21 | 0.35 |

Benchmarks are conventions, not laws — a "small" effect can matter enormously (drug side effects) and a "large" one can be trivial. Interpret in context.

Calculating Effect Sizes

Pingouin returns effect sizes with its tests (cohen_d from pg.ttest, np2 from pg.anova, hedges from pg.pairwise_tukey; r from pg.corr is already an effect size).

Confidence Intervals for Effect Sizes

Report a CI for the effect size to show its precision. Use pg.compute_esci (note: pg.compute_effsize_from_t returns only the point estimate — it does not return a CI):

import pingouin as pg

d = pg.compute_effsize(group_a, group_b, eftype='cohen')
ci_lower, ci_upper = pg.compute_esci(stat=d, nx=len(group_a), ny=len(group_b),
                                     eftype='cohen', confidence=0.95)
print(f"d = {d:.2f}, 95% CI [{ci_lower:.2f}, {ci_upper:.2f}]")

Power Analysis

A Priori Power Analysis (Study Planning)

Determine required sample size before data collection:

from statsmodels.stats.power import tt_ind_solve_power, FTestAnovaPower

# T-test: What n per group is needed to detect d = 0.5?
n_required = tt_ind_solve_power(
    effect_size=0.5,
    alpha=0.05,
    power=0.80,
    ratio=1.0,
    alternative='two-sided'
)
print(f"Required n per group: {n_required:.0f}")

# One-way ANOVA: What n is needed to detect Cohen's f = 0.25?
# Notes: the parameter is k_groups; effect_size is Cohen's f (f = sqrt(eta2/(1-eta2)));
# and solve_power returns the TOTAL sample size, not n per group.
import math
anova_power = FTestAnovaPower()
n_total = anova_power.solve_power(
    effect_size=0.25,
    k_groups=3,
    alpha=0.05,
    power=0.80
)
print(f"Required total N: {math.ceil(n_total)} ({math.ceil(n_total / 3)} per group)")

Sensitivity Analysis (Post-Study)

Determine what effect size the study could detect:

# With n=50 per group, what effect could we detect at 80% power?
detectable_d = tt_ind_solve_power(
    effect_size=None,  # Solve for this
    nobs1=50,
    alpha=0.05,
    power=0.80,
    ratio=1.0,
    alternative='two-sided'
)
print(f"Study could detect d >= {detectable_d:.2f}")

Note: Post-hoc "observed power" (computing power from the observed effect) is circular and misleading — it is a deterministic function of the p-value. If a study is done and someone asks about power, run a sensitivity analysis instead.

See references/effect_sizes_and_power.md for detailed guidance.


Reporting Results

Follow references/reporting_standards.md for APA style. Every report needs:

  1. Descriptive statistics: M, SD, n for all groups/variables
  2. Test statistics: Test name, statistic, df, exact p-value (p = .034, not p < .05; use p < .001 only below .001)
  3. Effect sizes: With confidence intervals
  4. Assumption checks: Which tests were run, results, and actions taken
  5. All planned analyses: Including non-significant findings — omitting them is cherry-picking

Example Report Templates

Independent T-Test

Group A (n = 48, M = 75.2, SD = 8.5) scored significantly higher than
Group B (n = 52, M = 68.3, SD = 9.2), t(98) = 3.82, p < .001, d = 0.77,
95% CI [0.36, 1.18], two-tailed. Assumptions of normality (Shapiro-Wilk:
Group A W = 0.97, p = .18; Group B W = 0.96, p = .12) and homogeneity
of variance (Levene's F(1, 98) = 1.23, p = .27) were satisfied.

One-Way ANOVA

A one-way ANOVA revealed a significant main effect of treatment condition
on test scores, F(2, 147) = 8.45, p < .001, η²_p = .10. Post hoc
comparisons using Tukey's HSD indicated that Condition A (M = 78.2,
SD = 7.3) scored significantly higher than Condition B (M = 71.5,
SD = 8.1, p = .002, d = 0.87) and Condition C (M = 70.1, SD = 7.9,
p < .001, d = 1.07). Conditions B and C did not differ significantly
(p = .52, d = 0.18).

Multiple Regression

Multiple linear regression was conducted to predict exam scores from
study hours, prior GPA, and attendance. The overall model was significant,
F(3, 146) = 45.2, p < .001, R² = .48, adjusted R² = .47. Study hours
(B = 1.80, SE = 0.31, β = .35, t = 5.78, p < .001, 95% CI [1.18, 2.42])
and prior GPA (B = 8.52, SE = 1.95, β = .28, t = 4.37, p < .001,
95% CI [4.66, 12.38]) were significant predictors, while attendance was
not (B = 0.15, SE = 0.12, β = .08, t = 1.25, p = .21, 95% CI [-0.09, 0.39]).
Multicollinearity was not a concern (all VIF < 1.5).

Bayesian Analysis

A Bayesian independent samples t-test was conducted using weakly
informative priors (Normal(0, 10) for group means). The posterior
distribution indicated that Group A scored higher than Group B
(M_diff = 6.8, 95% credible interval [3.2, 10.4]), with a 99.8%
posterior probability that Group A's mean exceeded Group B's mean.
Convergence diagnostics were satisfactory (all R-hat < 1.01, ESS > 1000).

If a non-parametric test was used, report medians rather than means, the U/W/H statistic, and a rank-based effect size (e.g., rank-biserial correlation, returned by pg.mwu as RBC).


Bayesian Statistics

Consider Bayesian approaches when:

  • You have prior information to incorporate
  • You want direct probability statements about hypotheses ("there is a 95% probability the effect lies in this interval")
  • Sample size is small or data collection is sequential (no correction needed for optional stopping)
  • You need to quantify evidence for the null hypothesis
  • The model is complex (hierarchical structure, missing data)

See references/bayesian_statistics.md for prior specification, Bayes Factors, credible intervals, hierarchical models, and convergence checking (R-hat < 1.01, sufficient ESS, posterior predictive checks).


Bundled Resources

References (references/)

  • test_selection_guide.md: Decision tree covering group comparisons, relationships, counts, time-to-event, agreement/reliability, and categorical analysis
  • assumptions_and_diagnostics.md: Detailed guidance on checking and handling assumption violations
  • effect_sizes_and_power.md: Calculating, interpreting, and reporting effect sizes; power analysis
  • bayesian_statistics.md: Priors, Bayes Factors, credible intervals, hierarchical models, diagnostics
  • reporting_standards.md: APA-style reporting guidelines with worked examples

Scripts (scripts/)

  • assumption_checks.py: Automated assumption checking with visualizations
    • comprehensive_assumption_check(): outliers + normality + variance homogeneity in one call
    • check_normality(), check_normality_per_group(): Shapiro-Wilk with Q-Q plots
    • check_homogeneity_of_variance(): Levene's test with box plots
    • check_regression_diagnostics(): 4-panel residual plots + Shapiro-Wilk, Breusch-Pagan, Durbin-Watson, VIF for fitted OLS models
    • check_linearity(), detect_outliers()

Statistical Integrity

These are the practices that keep an analysis defensible. They matter because the most common statistical failures are not computational errors — they are silent flexibility (testing until something works) and selective reporting.

  1. Distinguish confirmatory from exploratory. State the planned analysis before running it; label anything discovered along the way as exploratory.
  2. Don't shop for significance. If the planned test is non-significant, that is the result. Trying alternative tests, subgroups, or outlier-removal schemes until p < .05 invalidates the p-value.
  3. Correct for multiple comparisons when running families of tests (Tukey HSD for post-hoc ANOVA; Holm or Benjamini-Hochberg FDR for other families) and say which correction was used.
  4. A non-significant result is not evidence of no effect. With small n, the study may simply have been underpowered — run a sensitivity analysis, or use a Bayesian analysis / equivalence test to actually quantify support for the null.
  5. Statistical significance is not practical importance. With large n, trivial effects reach p < .001. Lead the interpretation with the effect size.
  6. Understand missing data before dropping rows. Listwise deletion is only safe when data are missing completely at random; otherwise consider multiple imputation and say what was done.
  7. Make it reproducible. Set random seeds, report library versions for simulation-based methods, and keep the analysis in a runnable script.