Data Investigation Skill

Data Investigation

Use this skill to produce investigations that are fast, correct, reproducible, and communicate a clear conclusion rather than a pile of charts.

Purpose

Every investigation should be answerable in one sentence before the first SQL query is written.

Phase 1: Frame Before Querying

1. Write the one-sentence answer first

Before writing any SQL, write the sentence the conclusion is expected to be.

Example: The cohort size gap is a definition problem rather than a product behavior problem.

If the sentence cannot be written, the question is not yet understood.

2. Classify the investigation type

| Type | Trigger | Approach | |---|---|---| | Gap analysis | Why do A and B not match? | Establish the gap, localize it, explain it | | Root cause | Why did this metric change? | Confirm real, isolate segment, align timing, validate mechanism | | Hypothesis test | Is X causing Y? | Define what must be true, test sub-claims, confirm or reject | | Feasibility check | Is this number trustworthy? | Check grain, joins, nulls, definition overlap |

3. State 2-3 competing hypotheses before querying

Never investigate with a single hypothesis. That creates confirmation bias.

Order hypotheses by plausibility and note which one is currently expected to be correct and why.

Phase 2: Build Queries In Escalating Specificity

1. Establish first, explain second

Step 1 always confirms the anomaly is real and measures its magnitude. Do not jump to cause until the effect is confirmed.

select
    <time_bucket>,
    <source_a_count> as metric_a,
    <source_b_count> as metric_b,
    <source_a_count> - <source_b_count> as gap,
    round(100.0 * (<source_a_count> - <source_b_count>) / nullif(<source_a_count>, 0), 1) as pct_gap
from ...
order by 1

2. Localize by breaking one dimension at a time

After confirming the gap, break it down along one dimension per step:

By time: when did it appear?
By segment: who is affected?
By signal/source: which path is missing?

3. Align timing with upstream changes

Once the anomaly start date is known, check:

git commits to relevant models, ETL jobs, or application code
schema migrations or source changes
metric definition changes
external events such as launches, campaigns, or pricing changes

A credible root cause must explain why the metric changed when it did.

4. Validate the mechanism

After identifying a candidate cause, confirm it mechanically:

Does the affected population match the predicted population?
What does the counterfactual look like?
Does a second independent signal support the explanation?

Do not stop at correlation.

5. Quantify each hypothesis before concluding

For every candidate explanation, produce a number.

Examples:

H1 accounts for 1,240 users
H2 accounts for 1,050 users

If a hypothesis cannot be quantified, it is not yet validated.

Phase 3: SQL Hygiene Rules

Use stable time bounds for investigations

Investigation SQL should be reproducible. Favor fixed bounds over rolling windows unless the analysis is intentionally operational and evergreen.

-- Prefer fixed investigation scope
where created_at >= '2026-02-16'
  and created_at < '2026-04-04'

Add explicit grain comments in important CTEs

-- grain: one row per user
with users as (
    ...
)

Prefer named comparison outputs over raw aggregates

Bad:

select count(*) from ...

Better:

select
    count(*) as users_in_scope,
    count(distinct org_id) as orgs_in_scope
from ...

Keep diagnostic queries small and discardable

Diagnostic SQL exists to answer one question. Do not build giant reusable queries too early.

Phase 4: Communication Rules

Lead with the answer

The first sentence of the written output should state the conclusion.

Bad:

I looked at several possible explanations for the discrepancy.

Good:

The discrepancy is caused by internal users being excluded from the dashboard query but included in the warehouse baseline.

Separate evidence from interpretation

Use a structure like:

Conclusion
Evidence
Why alternative hypotheses were rejected
Recommended next action

Include the limiting assumption

Every investigation should state the biggest assumption that could change the answer.

Example:

This conclusion assumes backend event timestamps are complete for the affected week; if ingestion was delayed, the gap may be overstated.

Phase 5: Output Standard

Every completed investigation should leave behind:

One-sentence answer
Final supporting SQL or notebook
A note on rejected hypotheses
Any follow-up question or unresolved ambiguity

Common Failure Modes

Treating a metric definition dispute as a product issue
Comparing sources with different grain or freshness
Accepting the first plausible explanation without quantification
Writing one giant query too early
Ending with charts instead of an answer

Related Skills

Use the answering-natural-language-questions-with-dbt workflow when the goal is answering a business question rather than debugging why analysis or metrics disagree.

Agent Skills: Data Investigation

Install this agent skill to your local

Skill Files