Model Requirements Gatherer Skill

Model Requirements Gatherer

Composable: Works before query-writer (provides requirements), references metadata-manager patterns (DAG/scheduling), uses bigquery-etl-core conventions When to use:

Before writing SQL for new models
Before modifying existing queries in bqetl (to understand impact and gather change requirements)
When gathering requirements from data scientists/stakeholders
When planning data model architecture

Overview

This skill guides requirements gathering for building and modifying BigQuery tables. It:

For new models: Structured requirements interview where user provides source tables, grain, and business logic
For existing queries: Understands current model from local files, identifies downstream dependencies, gathers requirements for changes
Validates dependencies: Uses local schema/metadata files first, then Glean Dictionary, then DataHub as last resort
Warns about downstream impact: When modifying existing tables, identifies and warns about affected downstream models
Suggests useful columns: Based on requirements and available source data
Produces dual outputs: Requirements doc (for stakeholders) + condensed working notes (for SQL development)

When to Use This Skill

Use this skill when:

Starting a new data model from scratch
A data scientist or stakeholder requests a new table/view
User asks to modify/edit an existing query in bqetl
Planning architecture for a new ETL pipeline
You need to understand downstream impact before making changes

Do NOT use this skill for:

Ad-hoc analysis queries (not part of bqetl)
Questions about what data exists (that's exploration, not construction)

Discovery Priority for Construction

When validating dependencies or checking schemas:

Local /sql directory - ALWAYS check first (schema.yaml, metadata.yaml, query.sql files)
Glean Dictionary - For _live/_stable tables via https://dictionary.telemetry.mozilla.org/
DataHub - Last resort or to validate what you've found

Key principle: User provides source tables. You validate they exist and understand downstream impact.

Requirements Interview Process

Step 1: Understand the Business Need

Ask the user these key questions (don't ask all at once - follow conversational flow):

Core questions:

What business activity does this model represent? (e.g., "install attribution", "user retention", "revenue reporting")
Who will use this model and what decisions will they make with it? (data scientists, dashboards, ML models)
What are 3-5 key questions this model should answer? (be specific, e.g., "How many installs per campaign per day?")

Capture:

Business purpose in 1-2 sentences
Primary stakeholders/consumers
Core use cases and questions

Step 2: Define the Grain

This is THE most critical decision - everything else flows from grain.

Ask: "What should one row represent?"

Good examples:

"One row per client per day"
"One row per install event"
"One row per search interaction"
"One row per subscription, aggregated monthly"

Bad/vague examples:

"User data" (what grain? per day? per event? lifetime?)
"Search metrics" (per query? per day? per client?)

If grain is unclear, help clarify:

"Will this be daily aggregates or event-level data?"
"Should this track changes over time or snapshot state?"
"How will downstream consumers typically filter/join this?"

Capture:

Exact grain statement
Time dimension (daily, hourly, event-level, lifetime)

Step 3: Get Source Tables from User

Ask: "What source tables should this model use?"

User provides table names - they tell you what upstream models/tables to use.

For Each Source Table

Capture from user:

Full table name (project.dataset.table)
Why this table is needed (what fields/data it provides)
Join keys (how it connects to other sources)
Any filter logic (WHERE clauses, date ranges)

Validate Source Tables Exist

After user provides source tables, validate they exist using discovery priority:

Check local /sql directory:
```
find sql/ -path "*/<dataset>/<table>"
```
- Read schema.yaml and metadata.yaml if found
- Understand available columns for later column suggestions
For _live/_stable tables, check Glean Dictionary:
- URL: https://dictionary.telemetry.mozilla.org/apps/<app_id>/tables/<table_name>
- Use WebFetch to get schema
If not found in local files or Glean, use DataHub as last resort:
- Validate table exists
- Get schema for join key validation

Capture:

Table description (from metadata.yaml or Glean Dictionary)
Owner (from metadata.yaml)
Available columns (for later column suggestions)

Step 4: Understand Transformations

Ask: "What transformations or calculations are needed?"

Listen for:

Aggregations (SUM, COUNT, AVG, etc.)
Filtering logic (active users, specific countries, etc.)
Derived fields (calculations, CASE statements, etc.)
Deduplication logic
Window functions

Capture:

High-level transformation logic (don't write SQL yet!)
Key derived fields and their business definitions
Any complex business rules or edge cases

Step 5: Determine Scheduling and Partitioning

Ask scheduling questions:

How fresh does this data need to be? (daily, hourly, real-time)
How much history do we need? (30 days, 1 year, all time)
Will this run incrementally or full refresh? (usually incremental for large tables)

Infer partitioning:

Daily grain → daily partitioning
Hourly grain → hourly partitioning
Event-level → usually daily partitioning on submission_date

Reference metadata-manager patterns:

Check similar tables in same product area for DAG patterns
Suggest existing DAGs when appropriate

Capture:

Refresh frequency
Partitioning strategy
Retention policy
Candidate DAG (or note: "new DAG needed")

Step 6: Identify Dependencies and Blockers

Ask: "Are there any blockers or dependencies we should know about?"

Common blockers:

Upstream tables don't exist yet
Need access to specific datasets
Waiting for new metrics to be added to Glean schema
Need to coordinate with another team

Capture:

Upstream dependencies (what must exist first)
Downstream consumers (what will use this)
Blockers (with owners/timelines if known)

After Gathering Requirements: Validate and Suggest

After gathering full requirements from user:

1. Check Downstream Dependencies (if modifying existing table)

CRITICAL: If user is modifying an existing query, warn about downstream impact.

Check local files first - Look for tables that reference this one in their query.sql:
```
grep -r "project.dataset.table_name" sql/ --include="query.sql"
```
Check metadata.yaml files for explicit dependencies in depends_on fields

If needed, use DataHub lineage to validate:

python .claude/skills/metadata-manager/scripts/datahub_lineage.py <table_name> --direction downstream

Warn user: "⚠️ This table is used by X downstream models: [list]. Changes may impact these."

2. Suggest Useful Columns

Based on requirements and available source columns:

Review source table schemas (from local files, Glean Dictionary, or DataHub)
Identify useful columns that match the stated requirements
Suggest fields for:
- Join keys
- Dimensions for grouping (country, channel, version, etc.)
- Metrics for calculations (counts, sums, etc.)
- Timestamps for partitioning/filtering

Example: "Based on your requirements to track installs by campaign, these columns from adjust_derived.installs_v1 would be useful: campaign_id, campaign_name, install_date, install_count"

Output Format

Generate TWO artifacts:

1. Requirements Document (Full)

Use the template from assets/requirements_template.md and populate with:

Include these sections:

Overview (business purpose, 1-2 sentences)
Grain (exact row definition)
Source Data (table with join keys, filters, notes)
Field Summary (key derived fields and their business logic)
Questions This Model Should Answer (3-5 specific questions)
Dependencies (upstream, downstream, blockers)
Validation & QA (row count checks, null thresholds, freshness)
Owner & Status (primary owner, reviewers, target date)

Optional sections (include if relevant):

Useful Links (if user provided notebooks, dashboards, prior work)
Lineage Diagram (if DataHub lineage available)
Data Governance (if PII, retention policies, or classification discussed)

Minimize busywork:

Skip sections not relevant to this model
Keep descriptions concise
Focus on what helps with SQL development

2. Working Notes (Condensed)

Create a condensed version for SQL development:

## <Model Name> - Working Notes

**Grain:** <one row per ...>

**Source Tables:**
- `project.dataset.table1` - <why needed>
- `project.dataset.table2` - <why needed>

**Key Transformations:**
- <transformation 1>
- <transformation 2>

**Derived Fields:**
- `field_name` - <business logic>

**Partitioning:** <daily/hourly on field>

**Scheduling:** <DAG name or "new DAG needed">

**Blockers:** <any blockers or "None">

Integration with Other Skills

Before query-writer

This skill runs before query-writer to gather requirements. After generating requirements doc and working notes:

Ask user: "Ready to start writing SQL based on these requirements?"
If yes, pass working notes to query-writer skill
query-writer uses requirements as context for SQL development

References metadata-manager

Use metadata-manager patterns for:

Schema validation (priority: /sql → Glean Dictionary → DataHub)
Partitioning strategies (daily vs hourly)
DAG configuration patterns

Uses datahub_lineage.py (when needed)

For impact analysis and validation:

Check downstream dependencies when modifying existing tables
Validate upstream dependencies if user is unsure
Last resort after checking local files and Glean Dictionary

Best Practices

For Requirements Interviews

Don't ask all questions at once - follow conversational flow
Clarify vague grain - grain is the foundation, must be precise
User provides source tables - don't suggest alternatives unless they're unsure
Focus on business logic - not SQL syntax yet
Capture decisions and rationale - why we chose X over Y

For Modifying Existing Queries

ALWAYS check downstream dependencies first - warn about impact
Read existing query.sql - understand current logic before planning changes
Check for schema changes - schema changes require coordination with downstream consumers
Document impact analysis - include in requirements doc

For Validation and Column Suggestions

Follow discovery priority - local files → Glean Dictionary → DataHub
Validate sources exist - don't assume tables are available
Review source schemas - understand available columns before suggesting
Match requirements to columns - suggest fields that directly support stated needs

For Output Quality

Make requirements doc useful for SQL development - not just paperwork
Working notes should be quick reference - condense to essentials
Skip irrelevant sections - don't force every template field
Include downstream impact - critical for modification scenarios

Key Commands Reference

# Check downstream dependencies when modifying existing table
grep -r "project.dataset.table_name" sql/ --include="query.sql"

# Find schema for source table
find sql/ -path "*/<dataset>/<table>/schema.yaml"

# Check metadata.yaml for dependencies
find sql/ -path "*/<dataset>/<table>/metadata.yaml"

# Validate downstream impact with DataHub (if needed)
python .claude/skills/metadata-manager/scripts/datahub_lineage.py <table_name> --direction downstream

# Find tables by keyword
find sql/ -path "*/*keyword*" -name "schema.yaml"

Common Scenarios

Scenario 1: New Model - Install Attribution

User says: "I need to track install attribution from Adjust using adjust_derived.installs_v1"

Workflow:

Gather requirements: Grain (one row per install), business questions, transformations needed
User provides source: adjust_derived.installs_v1
Validate source exists: Check local schema.yaml or DataHub
Review source columns: campaign_id, campaign_name, install_date, install_count
Suggest useful columns: Based on requirements about attribution
Output: Requirements doc + working notes for query-writer

Scenario 2: Modify Existing Query

User says: "Add country dimension to telemetry_derived.clients_daily_v1"

Workflow:

Read existing query.sql to understand current logic

Check downstream dependencies:

grep -r "clients_daily_v1" sql/ --include="query.sql"

Warn about impact: "⚠️ This table is used by 47 downstream models. Schema change will require coordination."
Gather requirements: Why country is needed, how it affects downstream
Validate country exists in source: Check normalized_country column availability
Output: Requirements doc with impact analysis + working notes

Scenario 3: Understanding Upstream Dependencies

User says: "Create revenue dashboard model" but unsure what upstream tables exist

Workflow:

Gather what user knows: Grain, business questions, general data needs
User says: "I think there's a revenue table somewhere?"

Help find it:

find sql/ -path "*/revenue*" -name "schema.yaml"

Show options: revenue_derived.revenue_data_v1 found
User confirms: "Yes, use that one"
Continue with validation and column suggestions

Reference Files

assets/requirements_template.md - Full requirements document template
See metadata-manager skill for schema discovery patterns

Agent Skills: Model Requirements Gatherer

Install this agent skill to your local

Skill Files