Statistics Verifier Skill

Statistics Verifier

Structured frameworks for verifying statistical claims, validating research methodology, and detecting analytical errors and biases.

Statistical Claim Verification Checklist

Rapid Claim Assessment

CLAIM VERIFICATION PROTOCOL:

1. SOURCE CHECK
   - Who made the claim?
   - What is their expertise and incentive?
   - Where was it published (peer-reviewed, preprint, press release)?
   - Is the original data or study accessible?

2. METHODOLOGY CHECK
   - What type of study (RCT, observational, survey, meta-analysis)?
   - What was the sample size and population?
   - What was the measurement method?
   - Is the statistical test appropriate for the data type?

3. NUMBER SENSE CHECK
   - Does the claim pass a basic plausibility test?
   - Are units and denominators clearly stated?
   - Absolute vs relative numbers — which is being used?
   - Is the base rate provided for context?

4. REPLICATION CHECK
   - Have other studies found similar results?
   - Are the findings consistent across populations?
   - Has anyone attempted and failed to replicate?

5. CONCLUSION CHECK
   - Does the conclusion follow from the data?
   - Are alternative explanations addressed?
   - Is the scope of the claim proportional to the evidence?

Claim Red Flags

| Red Flag | What It Means | Action | | --- | --- | --- | | No sample size given | Cannot assess reliability | Request or estimate N | | Only relative risk reported | May hide small absolute effect | Calculate absolute difference | | "Up to X%" framing | Cherry-picked best case | Ask for median or mean | | No confidence interval | Precision unknown | Treat with skepticism | | Correlation stated as causation | Confounders likely ignored | Check study design | | Self-selected sample | Selection bias likely | Note limitation | | Composite endpoint | May mask weak individual results | Decompose the endpoint | | Subgroup analysis highlighted | Likely post-hoc fishing | Require pre-registration |

Common Statistical Errors

Error Detection Framework

CATEGORY 1: DESIGN ERRORS
- Sampling bias (convenience, voluntary response, survivorship)
- Confounding variables not controlled
- Insufficient sample size (underpowered study)
- No control group or inappropriate comparator
- Measurement instrument not validated

CATEGORY 2: ANALYSIS ERRORS
- Multiple comparisons without correction (p-hacking)
- Treating ordinal data as interval
- Assuming normality without checking
- Ignoring missing data patterns (MCAR vs MNAR)
- Using parametric tests on non-parametric data

CATEGORY 3: INTERPRETATION ERRORS
- Confusing statistical significance with practical significance
- Interpreting non-significant result as "no effect"
- Ecological fallacy (group-level applied to individuals)
- Simpson's paradox not checked
- Ignoring effect size and confidence intervals

CATEGORY 4: REPORTING ERRORS
- Selective reporting of favorable results
- Omitting negative or null findings
- Misleading axis scales in visualizations
- Presenting percentages without base numbers
- Switching between absolute and relative metrics

Error Severity Assessment

| Error Type | Severity | Impact on Conclusion | | --- | --- | --- | | P-hacking / HARKing | Critical | Invalidates findings | | Selection bias | Critical | Fundamentally flawed sample | | Confounding not addressed | High | Alternative explanations remain | | Wrong statistical test | High | Results may be artifactual | | Multiple comparisons uncorrected | High | Inflated false positive rate | | Small sample without power analysis | Medium | May miss real effects | | Missing confidence intervals | Medium | Cannot judge precision | | Misleading visualization | Medium | Misrepresents magnitude | | Minor rounding errors | Low | Minimal impact |

Significance Testing Framework

Test Selection Guide

CHOOSING THE RIGHT TEST:

DATA TYPE → COMPARISON → TEST

Continuous + 2 groups + independent → Independent t-test (or Mann-Whitney)
Continuous + 2 groups + paired     → Paired t-test (or Wilcoxon signed-rank)
Continuous + 3+ groups + independent → One-way ANOVA (or Kruskal-Wallis)
Continuous + 2+ factors            → Two-way ANOVA (or Friedman)
Continuous + continuous             → Pearson correlation (or Spearman)

Categorical + 2 groups             → Chi-square test (or Fisher's exact)
Categorical + ordered              → Cochran-Armitage trend test
Binary outcome + predictors        → Logistic regression

Time-to-event + groups             → Log-rank test / Cox regression
Count data                          → Poisson regression
Proportion + large sample           → Z-test for proportions

P-Value Interpretation Guide

P-VALUE CONTEXT:

p-value = P(data this extreme | null hypothesis is true)

COMMON MISINTERPRETATIONS:
  p = 0.03 does NOT mean:
  - "There is a 3% chance the result is due to chance"
  - "There is a 97% probability the hypothesis is true"
  - "The effect is large or important"
  - "The study will replicate"

  p = 0.03 DOES mean:
  - If the null hypothesis were true, data this extreme
    would occur about 3% of the time by chance alone.

THRESHOLDS (conventional, not absolute):
  p < 0.001  — strong evidence against null
  p < 0.01   — moderate evidence against null
  p < 0.05   — conventional threshold (context-dependent)
  p > 0.05   — insufficient evidence to reject null
                (NOT evidence of no effect)

ALWAYS COMPLEMENT WITH:
  - Effect size (Cohen's d, odds ratio, etc.)
  - Confidence interval (range of plausible values)
  - Practical significance (is the effect meaningful?)
  - Study power (could it have detected a real effect?)

Multiple Comparisons Correction

| Method | When to Use | Conservativeness | | --- | --- | --- | | Bonferroni | Few comparisons, need strong control | Very conservative | | Holm-Bonferroni | Moderate comparisons, step-down | Less conservative | | Benjamini-Hochberg | Many comparisons (FDR control) | Liberal | | Tukey's HSD | All pairwise comparisons after ANOVA | Moderate | | Dunnett's | Multiple treatments vs one control | Moderate |

Sample Size Validation

Quick Reference Table

MINIMUM SAMPLE SIZE GUIDELINES:

Survey (population estimate):
  ±3% margin, 95% CI → n ≈ 1,067
  ±5% margin, 95% CI → n ≈ 385
  ±10% margin, 95% CI → n ≈ 97

A/B Test (detecting 5% relative lift):
  Baseline 10% conversion → n ≈ 3,200 per group
  Baseline 5% conversion  → n ≈ 6,400 per group
  Baseline 2% conversion  → n ≈ 16,000 per group

Clinical trial (medium effect d=0.5):
  Two-group comparison, 80% power → n ≈ 64 per group
  Two-group comparison, 90% power → n ≈ 86 per group

Correlation (detecting r=0.3):
  80% power, alpha=0.05 → n ≈ 85
  90% power, alpha=0.05 → n ≈ 113

Power Analysis Checklist

| Parameter | Must Specify | Source | | --- | --- | --- | | Alpha (Type I error rate) | Yes | Convention (usually 0.05) | | Power (1 - Type II error) | Yes | Usually 0.80 or 0.90 | | Effect size | Yes | Prior research or MCID | | Variance / SD | Yes | Pilot data or literature | | Sample size | Calculated | Output of power analysis | | Attrition rate | Recommended | Inflate N by expected dropout |

Correlation vs Causation Checklist

Bradford Hill Criteria for Causation

DOES CORRELATION IMPLY CAUSATION? CHECK:

1. STRENGTH           Is the association large?
                      Larger effects harder to explain away.

2. CONSISTENCY        Replicated across settings, populations?
                      Multiple studies, same finding.

3. SPECIFICITY        Is X linked specifically to Y (not everything)?
                      Less useful for multifactorial diseases.

4. TEMPORALITY        Does X precede Y in time?
                      REQUIRED — cause must come before effect.

5. BIOLOGICAL GRADIENT  Does more X produce more Y (dose-response)?
                        Strong support for causation.

6. PLAUSIBILITY       Is there a credible mechanism?
                      Based on current knowledge.

7. COHERENCE          Consistent with known biology/theory?
                      No conflict with established facts.

8. EXPERIMENT         Does removing X reduce Y?
                      Strongest evidence (RCT).

9. ANALOGY            Similar exposures cause similar effects?
                      Weakest criterion, supporting only.

VERDICT:
  Criteria 1-3 met + Temporality → Suggestive of causation
  Criteria 1-6 met + Experiment  → Strong evidence of causation
  Only correlation observed      → Association only, cannot infer cause

Common Third-Variable Confounders

| Observed Association | Likely Confounder | | --- | --- | | Ice cream sales and drowning | Warm weather (season) | | Shoe size and reading ability | Age | | Hospital visits and death rate | Illness severity | | Organic food and health | Socioeconomic status | | Screen time and depression | Social isolation, sleep |

Survey Methodology Review

Survey Quality Assessment

SURVEY METHODOLOGY CHECKLIST:

SAMPLING:
- [ ] Probability sampling method described?
- [ ] Sampling frame defined and appropriate?
- [ ] Response rate reported (acceptable: >60% mail, >80% in-person)?
- [ ] Non-response bias assessed?

QUESTIONNAIRE:
- [ ] Questions validated or adapted from validated instruments?
- [ ] Leading or double-barreled questions absent?
- [ ] Response options balanced and exhaustive?
- [ ] Pilot tested with target population?

ADMINISTRATION:
- [ ] Mode (online, phone, in-person) appropriate?
- [ ] Anonymity/confidentiality assured?
- [ ] Informed consent obtained?
- [ ] Social desirability bias mitigated?

ANALYSIS:
- [ ] Weighting applied for non-response or oversampling?
- [ ] Margin of error and confidence level reported?
- [ ] Subgroup analyses pre-specified (not exploratory)?

Data Visualization Integrity Checks

Chart Audit Checklist

| Check | What to Look For | Fail Condition | | --- | --- | --- | | Y-axis starts at zero (bar charts) | Truncated axis exaggerates differences | Axis starts above zero without clear label | | Consistent scale | Both axes have proportional increments | Non-linear scale without explanation | | Area proportional to data | Bubble/icon size matches values | Area misrepresents magnitude | | Time axis evenly spaced | Equal intervals between data points | Uneven spacing compresses/expands trends | | Appropriate chart type | Data type matches visualization | Pie chart with 20+ categories | | Context provided | Benchmarks, comparisons, baselines | Single data point with no reference | | Source cited | Data origin traceable | No source attribution | | Dual axes used responsibly | Two Y-axes can create false correlations | Arbitrary scaling implies relationship |

Misleading Visualization Patterns

WATCH FOR THESE TRICKS:

1. TRUNCATED AXIS
   Small differences look dramatic when baseline removed.
   FIX: Always check if y-axis starts at zero for bar charts.

2. CHERRY-PICKED TIME WINDOW
   Start/end dates chosen to show desired trend.
   FIX: Ask for longer time series with consistent intervals.

3. 3D EFFECTS
   Perspective distortion makes sizes unequal.
   FIX: Use flat 2D charts for accurate comparison.

4. DUAL AXIS MANIPULATION
   Two y-axes scaled to create apparent correlation.
   FIX: Normalize data or use separate panels.

5. CUMULATIVE VS DAILY
   Cumulative charts always go up — hides declining rates.
   FIX: Show rate of change alongside cumulative.

Bias Detection Framework

Cognitive Biases in Data Analysis

BIAS DETECTION CHECKLIST:

CONFIRMATION BIAS
- Are they only presenting data that supports their hypothesis?
- Were negative results reported?
- Was the analysis plan pre-registered?

ANCHORING BIAS
- Is the first number presented influencing interpretation of later data?
- Are comparisons made to appropriate benchmarks?

SURVIVORSHIP BIAS
- Are only successful cases included (ignoring failures)?
- Is the denominator complete (not just survivors)?

AVAILABILITY BIAS
- Are dramatic or recent events overweighted?
- Is systematic data used rather than anecdotal evidence?

PUBLICATION BIAS
- Is there a funnel plot asymmetry (meta-analyses)?
- Are null results published or only significant ones?

TEXAS SHARPSHOOTER FALLACY
- Were clusters or patterns found after looking at data?
- Was the hypothesis formed before or after seeing results?

Bias Severity Matrix

| Bias | Detection Method | Mitigation | | --- | --- | --- | | Selection bias | Compare sample to population demographics | Probability sampling, weighting | | Measurement bias | Check instrument validity and calibration | Validated instruments, blinding | | Reporting bias | Look for asymmetric funnel plots | Pre-registration, open data | | Recall bias | Compare to objective records | Prospective data collection | | Observer bias | Check if assessors were blinded | Double-blind design | | Attrition bias | Compare completers vs dropouts | Intention-to-treat analysis |

Reproducibility Checklist

Study Reproducibility Assessment

REPRODUCIBILITY REQUIREMENTS:

DATA AVAILABILITY:
- [ ] Raw data accessible (repository, supplement, on request)?
- [ ] Data dictionary / codebook provided?
- [ ] Data collection protocol documented?

CODE / ANALYSIS:
- [ ] Analysis code shared (GitHub, OSF, supplement)?
- [ ] Software versions and packages specified?
- [ ] Random seeds set for reproducible computation?
- [ ] Pipeline documented end-to-end?

METHODOLOGY:
- [ ] Study pre-registered (OSF, ClinicalTrials.gov)?
- [ ] Deviations from protocol documented?
- [ ] All outcome measures reported (not just significant ones)?
- [ ] Sensitivity analyses included?

REPORTING:
- [ ] Follows reporting guidelines (CONSORT, STROBE, PRISMA)?
- [ ] Effect sizes and confidence intervals reported?
- [ ] Power analysis or sample size justification provided?
- [ ] Limitations section thorough and honest?

Reporting Standards by Study Type

| Study Type | Guideline | Key Elements | | --- | --- | --- | | Randomized trial | CONSORT | Flow diagram, ITT analysis, blinding | | Observational study | STROBE | Selection criteria, confounders, missing data | | Systematic review | PRISMA | Search strategy, inclusion criteria, risk of bias | | Diagnostic accuracy | STARD | Index test, reference standard, flow diagram | | Qualitative research | COREQ | Research team, study design, data analysis | | Prediction model | TRIPOD | Model development, validation, performance |

Quick Verification Workflow

FAST VERIFICATION (5 minutes):

1. Read the claim carefully — what exactly is being stated?
2. Check: source, sample size, study type
3. Ask: absolute or relative? What is the base rate?
4. Check: confidence interval or margin of error given?
5. Search: has this been replicated independently?

VERDICT CATEGORIES:
  VERIFIED    — multiple strong sources, robust methodology
  PLAUSIBLE   — reasonable evidence, some limitations
  UNCERTAIN   — mixed evidence, methodology concerns
  MISLEADING  — technically true but presented deceptively
  FALSE       — contradicted by strong evidence
  UNVERIFIABLE — cannot assess with available information

Agent Skills: Statistics Verifier

Install this agent skill to your local

Skill Files