Query Writer
Composable: Works with bigquery-etl-core (for conventions), metadata-manager (for schemas), and sql-test-generator (for tests) When to use: Writing or updating SQL queries (query.sql) or Python ETL scripts (query.py) following Mozilla BigQuery ETL conventions
Overview
Generate and update SQL queries and Python ETL scripts for the bigquery-etl repository following Mozilla's conventions. This skill handles:
- Writing new query.sql or query.py files
- Updating existing queries
- MANDATORY: Checking for and updating existing tests whenever queries are modified
- Coordinating downstream updates to schemas (via metadata-manager) and tests (via sql-test-generator)
π¨ REQUIRED READING - Start Here
BEFORE writing any query, READ these reference files to understand patterns:
-
SQL Conventions: READ
references/sql_formatting_conventions.md- Mozilla's formatting standards
- Naming conventions
- Code organization
-
Common Patterns: READ
references/common_query_patterns.md- Standard query structures for different use cases
- When to use CTEs vs subqueries
- Aggregation patterns
-
Partitioning: READ
references/partitioning_patterns.md- Incremental vs full refresh
- Date partitioning requirements
- Parameter usage
π Templates - Copy These Structures
When writing queries, READ and COPY from these template files:
For SQL Queries:
- Basic aggregation? β READ
assets/basic_query_example.sql - Need CTEs? β READ
assets/cte_query_example.sql - Joining tables? β READ
assets/join_example.sql - UNNESTing arrays? β READ
assets/unnest_example.sql - User-level aggregation? β READ
assets/user_aggregation_example.sql
For Python Queries:
- API calls or complex logic? β READ
assets/python_query_template.py - Also READ
references/python_queries.mdfor Python-specific patterns
Schema and Description Lookups for Query Construction
When writing queries, you need both schemas (column names/types) and descriptions. These come from different sources:
Schema Priority (Column Names & Types):
- FIRST: Check local
schema.yamlfiles insql/directory - LAST RESORT: Use DataHub when schema not available locally
Description Priority:
- FIRST: Check Glean Dictionary for
_live/_stabletables (https://dictionary.telemetry.mozilla.org/) - SECOND: Check local
metadata.yamlfiles insql/directory - LAST RESORT: Use DataHub when descriptions not available
IMPORTANT: Glean Dictionary provides descriptions ONLY, not schemas. For schemas of _live/_stable tables, use local schema.yaml files or DataHub.
When to use each source:
- Local
/sqlfiles: For schemas and metadata of any derived tables in bqetl (most common) - Glean Dictionary: For descriptions of raw ingestion tables (
_live,_stable) ONLY - DataHub: Last resort for schemas when not available locally, or for descriptions when not in Glean/local files
π¨ Configuration Standards
CRITICAL: Only use documented patterns and configurations!
When writing queries, ONLY use query patterns, SQL conventions, and partitioning configurations that are documented in:
- Reference files in this skill (
references/directory) - Example queries in this skill (
assets/directory) - Existing patterns found in the
/sqldirectory
DO NOT:
- Invent new query patterns or configurations that aren't documented
- Assume BigQuery features work the same as other SQL dialects
- Use undocumented partitioning or clustering configurations
ALWAYS reference existing patterns in the /sql directory to see how similar queries are structured.
Typical workflow for this skill:
- Gather requirements - Use model-requirements skill if needed to understand what to build
- Write query - Using this skill (query-writer) following Mozilla conventions
- Format query - Run
./bqetl format <path>to ensure proper SQL formatting - Validate query - Run
./bqetl query validate <path>to check SQL syntax and conventions - Generate schema/metadata - Use metadata-manager skill to create schema.yaml and metadata.yaml (ONLY if validation passes)
- π¨ MANDATORY: Check for and update tests - ALWAYS look for existing tests and update/create them (see Test Management section below)
Quick Start
SQL or Python?
Use query.sql for:
- Standard data transformations (95% of tables)
- Aggregations and GROUP BY operations
- Joins, CTEs, window functions
- Standard BigQuery operations
Use query.py for:
- API calls to external services
- Complex pandas transformations
- Multi-project queries or INFORMATION_SCHEMA operations
- Custom business logic clearer in Python
Basic SQL Structure
-- Brief comment explaining the query's purpose
SELECT
submission_date,
sample_id,
client_id,
COUNT(*) AS n_total_events,
FROM
`moz-fx-data-shared-prod.telemetry.events`
WHERE
submission_date = @submission_date
GROUP BY
submission_date,
sample_id,
client_id
Key conventions:
- Uppercase SQL keywords
- 2-space indentation
- Each field on its own line
- Always filter on
submission_date = @submission_datefor incremental queries
Partitioning Requirements
Incremental queries (most common):
- Accept
@submission_dateparameter - Output
submission_datecolumn matching parameter - Filter:
WHERE submission_date = @submission_date
Full refresh queries:
- No
@submission_dateparameter - Set
date_partition_parameter: nullin metadata.yaml
SQL Formatting
ALWAYS format SQL queries using the bqetl formatter:
# Format a specific query file
./bqetl format sql/moz-fx-data-shared-prod/telemetry_derived/events_daily_v1/query.sql
# Format an entire query directory
./bqetl format sql/moz-fx-data-shared-prod/telemetry_derived/events_daily_v1/
# Check if formatting is correct without modifying files
./bqetl format --check sql/moz-fx-data-shared-prod/telemetry_derived/events_daily_v1/
Why formatting matters:
- Ensures consistent code style across the repository
- Makes queries easier to read and review
- Required for CI/CD pipeline to pass
- Automatically handles indentation, keyword casing, and line breaks
When to format:
- Immediately after writing or modifying any SQL query
- Before running
./bqetl query validate - Before committing changes to git
Note: The ./bqetl query validate command includes formatting checks, but it's better to run ./bqetl format first to automatically fix any formatting issues rather than just checking for them.
Assets (Examples)
The /assets directory contains complete query examples:
basic_query_example.sql- Simple aggregation patterncte_query_example.sql- Using CTEs for complex logicuser_aggregation_example.sql- User-level metricsjoin_example.sql- Standard JOIN with partition filtersunnest_example.sql- UNNEST for repeated fieldspython_query_template.py- Python query structure
References (Detailed Documentation)
The /references directory contains detailed guides:
sql_formatting_conventions.md- Formatting rules, UDF usage, header commentspartitioning_patterns.md- Incremental vs full refresh patternsjinja_templating.md- Jinja functions and date handlingcommon_query_patterns.md- Event processing, JOINs, performance tipspython_queries.md- When to use Python, common patterns, best practicesexternal_documentation.md- Links to official docs and example queriestest_update_workflow.md- Workflow for updating existing queries and coordinating test updates
DataHub Usage (CRITICAL for Token Efficiency)
BEFORE using any DataHub MCP tools (mcp__datahub-cloud__*), you MUST:
- READ
../bigquery-etl-core/references/datahub_best_practices.md- Token-efficient query patterns and priority order - Always prefer local files (schema.yaml, metadata.yaml) over DataHub queries
- Always check Glean Dictionary for
_live/_stabletables before using DataHub
Use DataHub ONLY for:
- Schema definitions when not available in
/sqldirectory - Field descriptions when missing (after checking Glean Dictionary β
/sqlhierarchy) - Lineage/dependencies when can't infer from bqetl (telemetry sources, syndicated datasets)
- Syndicated datasets (directories without query.sql/query.py/view.sql - usually from dev teams' postgres databases)
When using DataHub:
- Extract ONLY essential fields (column names/types) from DataHub responses
- Use search β get_entity pattern with limited results
- Check metadata.yaml for syndicated datasets first
π¨ Test Management - MANDATORY FOR ALL QUERY UPDATES
CRITICAL: ALWAYS check for and update existing tests when modifying queries!
When Updating Existing Queries
BEFORE making changes:
- Check for existing tests:
ls tests/sql/<project>/<dataset>/<table>/ - Note current source tables:
grep -E "FROM|JOIN" sql/<project>/<dataset>/<table>/query.sql
AFTER making changes:
- Update schema (if output structure changed):
./bqetl query schema update <dataset>.<table> - π¨ MANDATORY: Update test fixtures for ALL existing tests:
- New source table added (JOIN, FROM)? β MUST use sql-test-generator skill to add fixtures to ALL test directories
- Source table removed? β Delete its fixture files from all test directories
- Output schema changed? β Update expect.ndjson/expect.yaml in all test directories
- Query logic changed? β Review and update test input data and expected outputs
- Run tests:
pytest tests/sql/<project>/<dataset>/<table>/ -v - Fix any failures - Update fixtures and expectations until tests pass
When Creating New Queries
After writing query.sql:
- MUST create unit tests using sql-test-generator skill
- Tests validate query logic before deployment
- Required for CI/CD pipeline
Common Test Update Scenarios
Added a new field based on existing source data:
- Update input data in
<source_table>.ndjsonfiles to include the field - Update expected output in
expect.ndjsonfiles with expected values for the new field - Update test schema files if needed (
.schema.jsonfiles)
Added a JOIN to a new table:
- MUST use sql-test-generator skill to create fixtures for the new table in ALL test directories
- Prevents tests from querying production data (which causes failures)
- Ensures proper data types and schema structure
Changed aggregation logic:
- Update expected values in
expect.ndjsonto match new calculation logic - May need to adjust input test data to create meaningful test cases
For detailed workflow: See references/test_update_workflow.md
Data Quality Checks vs Unit Tests
IMPORTANT DISTINCTION:
Data Quality Checks (Bigeye):
- Use bigconfig-generator skill to create Bigeye monitoring configurations
- Bigeye monitors production data for quality issues (anomalies, nulls, freshness, etc.)
checks.sqlfiles are DEPRECATED and should NOT be used- All data quality monitoring should be done via Bigeye
Unit Tests (sql-test-generator):
- Use sql-test-generator skill to create unit test fixtures
- Unit tests validate query logic during development
- Tests run on small, synthetic fixtures (not production data)
- Ensures queries work correctly before deployment
- MANDATORY: Must be updated whenever queries are modified
Integration with Other Skills
query-writer works in coordination with other skills:
Works with bigquery-etl-core
- References core skill for project structure and naming conventions
- Uses common patterns, mozfun UDF references, and metric discovery
Works with metadata-manager
- After writing and validating queries: Use metadata-manager to generate/update schema.yaml
- Creates/updates metadata.yaml with scheduling and ownership
Works with sql-test-generator
- After writing and validating queries: Use sql-test-generator to create unit test fixtures
- Prevents production queries by ensuring complete test coverage
Works with bigconfig-generator
- For data quality monitoring: Use bigconfig-generator to create Bigeye configurations
- Monitors production tables for data quality issues (NOT checks.sql)
Typical Workflow
Creating new queries:
- Use query-writer to write query.sql or query.py
- Format the query: Run
./bqetl format sql/<project>/<dataset>/<table>to apply Mozilla SQL formatting standards - Validate the query: Run
./bqetl query validate sql/<project>/<dataset>/<table>to check for syntax errors and conventions - Invoke metadata-manager to generate schema.yaml and metadata.yaml (ONLY if validation passes)
- Invoke sql-test-generator to create unit test fixtures
- Run unit tests and validate query works correctly
Updating existing queries:
- π¨ FIRST: Check for existing tests: Run
ls tests/sql/<project>/<dataset>/<table>/to see what tests exist - Use query-writer to modify query.sql or query.py
- Format the query: Run
./bqetl format sql/<project>/<dataset>/<table>to apply formatting - Validate the query: Run
./bqetl query validate sql/<project>/<dataset>/<table>to ensure changes are valid - Invoke metadata-manager to update schema.yaml if output structure changed (ONLY if validation passes)
- π¨ MANDATORY: Update tests:
- If new source table added (JOIN/FROM): MUST invoke sql-test-generator to add fixtures to ALL test directories
- If source table removed: Delete fixture files from all test directories
- If output schema changed: Update expect.ndjson in all test directories
- If query logic changed: Update test input data and expectations
- Run tests:
pytest tests/sql/<project>/<dataset>/<table>/ -vand fix any failures - See
references/test_update_workflow.mdfor complete test update workflow and checklist
Performance Essentials
- Filter on partition columns:
WHERE submission_date = @submission_date - Avoid
SELECT *- list only needed columns - Filter before JOINs to reduce shuffling
- Use
sample_idfor testing:WHERE sample_id = 0(1% sample) - Use approximate functions:
approx_count_distinct()when exact counts not needed
For detailed optimization: https://docs.telemetry.mozilla.org/cookbooks/bigquery/optimization.html
External Documentation
- Creating derived datasets: https://mozilla.github.io/bigquery-etl/cookbooks/creating_a_derived_dataset/
- Recommended practices: https://mozilla.github.io/bigquery-etl/reference/recommended_practices/
- Common workflows: https://mozilla.github.io/bigquery-etl/cookbooks/common_workflows/
- Query optimization: https://docs.telemetry.mozilla.org/cookbooks/bigquery/optimization.html
Scripts
The /scripts directory is reserved for helper scripts (currently empty).