Agent Skills: Analysis QA

Quality-check a data analysis before sharing — verify joins, aggregations, denominators, time ranges, and metric definitions. Detect pitfalls like survivorship bias, average-of-averages, join explosion, timezone mismatches, incomplete periods, and selection bias. Includes documentation templates for reproducible analyses.

UncategorizedID: vm0-ai/vm0-skills/analysis-qa

Install this agent skill to your local

pnpm dlx add-skill https://github.com/vm0-ai/vm0-skills/tree/HEAD/analysis-qa

Skill Files

Browse the full folder contents for analysis-qa.

Download Skill

Loading file tree…

analysis-qa/SKILL.md

Skill Metadata

Name
analysis-qa
Description
Quality-check a data analysis before sharing — verify joins, aggregations, denominators, time ranges, and metric definitions. Detect pitfalls like survivorship bias, average-of-averages, join explosion, timezone mismatches, incomplete periods, and selection bias. Includes documentation templates for reproducible analyses.

Analysis QA

Pre-delivery review process for catching errors, validating results, and ensuring analyses are reproducible and trustworthy.

Review Checklist

Work through every section below before presenting findings to stakeholders.

Data Foundation

  • [ ] Correct sources: Confirmed that the tables and datasets used are the appropriate ones for this question
  • [ ] Freshness: Data recency is sufficient; the "data as of" date is noted
  • [ ] Coverage: No unexpected time gaps or missing segments in the dataset
  • [ ] Null treatment: Null rates in critical columns have been reviewed; nulls are excluded, filled, or explicitly flagged
  • [ ] Duplicate control: Row counts confirm no double-counting from faulty joins or repeated source records
  • [ ] Filter accuracy: Every WHERE clause and filter condition has been verified; nothing is accidentally excluded or included

Computation Integrity

  • [ ] GROUP BY correctness: All non-aggregated columns appear in GROUP BY; the aggregation grain matches the analytical question
  • [ ] Denominator validity: Rates and percentages use the intended base population; division by zero is prevented
  • [ ] Temporal alignment: Compared periods span equal durations; partial periods are either excluded or called out
  • [ ] Join behavior: JOIN types are intentional (INNER vs. LEFT); many-to-many relationships have not silently inflated totals
  • [ ] Metric fidelity: Calculated metrics align with how the business defines them; any deviations are documented
  • [ ] Additive consistency: Sub-totals sum to the reported total where expected; non-additive cases (overlap, double-counting) are explained

Plausibility Assessment

  • [ ] Order of magnitude: Key figures fall within a believable range; revenue is non-negative; percentages stay within 0-100%
  • [ ] Trend coherence: Time series show no unexplained jumps or drops
  • [ ] External agreement: Headline numbers align with dashboards, finance reports, or earlier analyses
  • [ ] Ballpark math: Total revenue roughly equals per-user revenue times user count, etc.
  • [ ] Boundary behavior: Results make sense for edge cases — a single day, a single user, a single category

Presentation Quality

  • [ ] Accurate visuals: Bar charts begin at zero; axes have labels; scales are consistent across panels
  • [ ] Clean formatting: Numbers use appropriate precision, consistent currency/percent notation, and thousands separators
  • [ ] Descriptive titles: Headings convey the insight, not just the metric name; date ranges are included
  • [ ] Transparent caveats: Limitations and assumptions are stated up front
  • [ ] Reproducibility: Another analyst could recreate the work from the provided documentation

Recognizing Common Mistakes

Inflated Counts from Many-to-Many Joins

What goes wrong: Joining two tables with a many-to-many relationship silently multiplies rows, blowing up counts and sums.

Detection method:

-- Compare row counts before and after the join
SELECT COUNT(*) FROM orders;          -- 1,000
SELECT COUNT(*) FROM orders o
JOIN line_items li ON o.id = li.order_id;  -- 3,500 (unexpected inflation)

Prevention:

  • Always compare pre-join and post-join row counts
  • Verify the actual cardinality of the join relationship
  • Use COUNT(DISTINCT o.id) to count entities accurately through multi-row joins

Survivorship Bias

What goes wrong: The analysis only covers entities that still exist, ignoring those that were removed, churned, or failed.

Typical scenarios:

  • Studying behavior of "active users" while ignoring everyone who left
  • Benchmarking against "companies on our platform" while skipping those who evaluated and moved on
  • Analyzing traits of "successful" cases without any "unsuccessful" comparison group

Prevention: Before drawing conclusions, ask: "Who is absent from this dataset, and would their presence change the story?"

Partial Period Comparisons

What goes wrong: A month, week, or quarter that is still in progress gets compared to a completed one.

Typical scenarios:

  • "January revenue is $500K vs. December's $800K" when January is only half over
  • "Signups are down this week" when checked on Tuesday against a full prior week

Prevention: Restrict comparisons to completed periods, or normalize by matching the same number of elapsed days.

Shifting Denominators

What goes wrong: The population used as a denominator changes between periods, making rate comparisons invalid.

Typical scenarios:

  • Conversion rate appears to improve because the definition of "eligible visitor" was narrowed
  • Churn rate shifts because "active user" was redefined mid-analysis

Prevention: Lock in consistent definitions across every period being compared. Flag any definition changes.

Averaging Pre-Computed Averages

What goes wrong: Taking the mean of group-level averages ignores differences in group size, producing an incorrect overall figure.

Illustration:

  • Segment A: 100 customers, $50 average order
  • Segment B: 10 customers, $200 average order
  • Incorrect overall average: ($50 + $200) / 2 = $125
  • Correct weighted average: (100 * $50 + 10 * $200) / 110 = $63.64

Prevention: Always compute averages from individual records. Never take the mean of already-aggregated means.

Timezone Inconsistencies

What goes wrong: Different source systems record timestamps in different zones, causing misaligned daily rollups and join mismatches.

Typical scenarios:

  • Backend events logged in UTC while the reporting layer uses US Pacific
  • Two tables that define "today" with different cutoff hours

Prevention: Convert all timestamps to a single reference zone (UTC is the safest default) before any analysis. State the timezone in the deliverable.

Circular Segmentation

What goes wrong: Segments are defined using the very outcome being measured, creating tautological findings.

Typical scenarios:

  • "Users who finished onboarding retain better" — finishing onboarding is itself a retention signal
  • "Power users drive more revenue" — revenue generation is what made them power users

Prevention: Base segment definitions on characteristics measured before the outcome period, not on the outcome itself.

Sanity-Checking Results

Quick Magnitude Tests

| Metric Category | Validation Approach | |---|---| | User counts | Cross-reference against known DAU/MAU benchmarks | | Revenue totals | Compare to known ARR or recent financial reports | | Conversion rates | Must be 0-100%; compare to dashboard baselines | | Growth rates | Is 50%+ month-over-month realistic, or does it signal a data problem? | | Averages | Given the distribution, does this number feel right? | | Segment shares | Do percentage breakdowns sum to approximately 100%? |

Cross-Validation Approaches

  1. Dual calculation: Derive the same metric via two independent query paths and confirm they match
  2. Record-level spot checks: Select a handful of specific entities and manually trace their numbers end to end
  3. Benchmark comparison: Verify against published dashboards, finance systems, or prior analysis outputs
  4. Arithmetic reversal: If total revenue is X and there are N users, does X / N approximate the reported per-user figure?
  5. Micro-slice testing: Filter to a single day, user, or category and confirm the micro-result is sensible

Signals That Demand Investigation

  • Any metric swinging more than 50% period-over-period without a clear explanation
  • Totals or sums that land on suspiciously round numbers (possible filter or default-value artifact)
  • Rates pegged at exactly 0% or 100% (may indicate missing data rather than perfect outcomes)
  • Results that confirm the hypothesis too neatly (real data is almost always messy)
  • Identical values appearing across different time periods or segments (suggests a dimension is being ignored)

Ensuring Reproducibility

Analysis Write-Up Template

Every substantial analysis should ship with this documentation:

## Analysis: [Title]

### Business Question
[The precise question this work answers]

### Sources
- Table: [schema.table_name] (snapshot date: [date])
- Table: [schema.other_table] (snapshot date: [date])
- External file: [filename] (origin: [description])

### Metric and Segment Definitions
- [Metric A]: [Precise calculation formula]
- [Segment X]: [Exact inclusion/exclusion criteria]
- [Time window]: [Start] through [end], [timezone]

### Analytical Approach
1. [First step and its purpose]
2. [Second step]
3. [Third step]

### Assumptions and Known Limitations
- [Assumption and why it holds]
- [Limitation and its potential effect on conclusions]

### Results
1. [Finding with supporting evidence]
2. [Finding with supporting evidence]

### Queries
[All SQL and code used, annotated with comments]

### Warnings for the Reader
- [Anything the audience should weigh before acting on these results]

Annotating Analytical Code

For SQL or Python that others may reuse:

"""
Title: Monthly Cohort Retention
Author: [Name]
Created: [Date]
Sources: events, users
Last cross-checked: [Date] — matched dashboard within 2%

Objective:
    Build monthly retention cohorts anchored on each user's first event date.

Assumptions:
    - "Active" = at least one recorded event in the calendar month
    - Internal and test accounts excluded (user_type != 'internal')
    - All timestamps normalized to UTC

Output:
    Retention grid: rows are cohort months, columns are months since first event.
    Cell values are retention percentages (0-100).
"""

Maintaining an Audit Trail

  • Store all queries and scripts in version control or a shared knowledge base
  • Record the exact data snapshot date used for each analysis run
  • When refreshing a recurring analysis, document what changed and why
  • Link current results to prior versions so trends in the analysis itself are traceable