Airflow DAG Failure Investigation
You help users investigate and debug Mozilla Airflow DAG failures by fetching logs, identifying root causes, and suggesting fixes.
Helper Scripts
Discovering failures
Use get-triage-data from the airflow-triage skill to discover failures:
../airflow-triage/scripts/get-triage-data # Last 24 hours
../airflow-triage/scripts/get-triage-data --since 3d # Last 3 days
fetch-task-log
Fetch and explore task logs from GCS (gs://airflow-remote-logs-prod-prod).
# List recent runs for a DAG
scripts/fetch-task-log <dag_id> --list-runs
# List tasks in a specific run
scripts/fetch-task-log <dag_id> --list-tasks --run-id <run_id>
# Fetch a task log
scripts/fetch-task-log <dag_id> <task_id> <run_id>
# Fetch only the last N lines
scripts/fetch-task-log <dag_id> <task_id> <run_id> --tail 100
Related Repositories
When investigating failures, check these repos (all checked out locally):
bigquery-etl- Query definitions, metadata.yaml, DAG generationprivate-bigquery-etl- Confidential ETL codetelemetry-airflow- DAGs, operators, GKEPodOperatordataservices-infra- Infrastructure (GKE, Helm, logging config)
Where DAGs Are Defined
Most DAGs are auto-generated from bigquery-etl. The task ID tells you where to find the source.
Task ID Pattern: <dataset>__<table>__<version>
Example task ID: telemetry_derived__clients_daily__v6
Source query location:
bigquery-etl/sql/moz-fx-data-shared-prod/<dataset>/<table>/
├── query.sql # The SQL query
├── metadata.yaml # Scheduling config, owner, tags
└── schema.yaml # Table schema
For the example above:
bigquery-etl/sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/
DAG ID Pattern: bqetl_<name>
DAGs starting with bqetl_ are auto-generated. The DAG configuration is in bigquery-etl/dags.yaml.
Non-bqetl DAGs
DAGs not starting with bqetl_ are manually defined in:
telemetry-airflow/dags/<dag_name>.py
Private/Confidential DAGs
Some DAGs are in private-bigquery-etl with the same structure:
private-bigquery-etl/sql/<project>/<dataset>/<table>/
GCP Projects & Namespaces
Airflow runs across two GCP projects:
| Project | Purpose | Namespace |
|---------|---------|-----------|
| moz-fx-dataservices-high-prod | Airflow workers, scheduler | telemetry-airflow-prod |
| moz-fx-data-airflow-gke-prod | GKEPodOperator jobs (queries, scripts) | default |
Cloud Logging (Fallback)
Start with GCS logs via fetch-task-log. Fall back to Cloud Logging if you suspect infrastructure issues or if GCS logs are missing/incomplete.
| Aspect | GCS (fetch-task-log) | Cloud Logging |
|--------|------------------------|---------------|
| Content | Complete Airflow task logs (same as UI) | Raw container stdout/stderr |
| Retention | 360 days | 30 days |
| Best for | Task failures (SQL errors, exceptions) | Pod-level issues (OOM kills, scheduling failures) |
Airflow scheduler/worker logs:
gcloud logging read 'resource.type="k8s_container" AND resource.labels.namespace_name="telemetry-airflow-prod" AND textPayload=~"<DAG_ID>"' \
--project=moz-fx-dataservices-high-prod \
--limit=200
GKEPodOperator job logs (query execution errors):
gcloud logging read 'resource.type="k8s_container" AND resource.labels.namespace_name="default" AND textPayload=~"<DAG_ID>"' \
--project=moz-fx-data-airflow-gke-prod \
--limit=200
Useful Links
- Airflow UI:
https://workflow.telemetry.mozilla.org/home- DAG detail:
https://workflow.telemetry.mozilla.org/dags/<dag_id>/grid - Task logs:
https://workflow.telemetry.mozilla.org/dags/<dag_id>/grid?dag_run_id=<run_id>&task_id=<task_id>
- DAG detail:
- Grafana dashboards:
- Airflow overview:
https://earthangel-b40313e5.influxcloud.net/d/airflow-overview - Task duration:
https://earthangel-b40313e5.influxcloud.net/d/airflow-task-duration
- Airflow overview:
- Bugzilla (airflow-triage bugs): buglist
- Runbooks: Search Confluence for
<dag_id>or the pipeline area (e.g. "main_summary runbook"). If a Confluence MCP tool is available, use it to search directly. Otherwise, suggest the user check:https://mozilla-hub.atlassian.net/wiki/search?text=<dag_id>https://mozilla-hub.atlassian.net/wiki/search?text=airflow+runbook
Bugzilla Search
Before investigating, search Bugzilla for existing tickets related to the failing DAG. This avoids duplicate work and surfaces prior context.
Use scripts/search-bugzilla — it is pinned to bugzilla.mozilla.org and defaults to the [airflow-triage] whiteboard filter:
scripts/search-bugzilla <dag_id> # Open [airflow-triage] bugs matching DAG name
scripts/search-bugzilla <dag_id> --status resolved # Recently resolved matches
scripts/search-bugzilla <dag_id> --status all # Both
scripts/search-bugzilla "<task_id>" --limit 10 # Narrow by task name
scripts/search-bugzilla --all "<keywords>" # Drop the whiteboard filter
scripts/search-bugzilla <dag_id> --json # Structured output
To read full context (change history + comments) for a bug ID the script returned, call
mcp__plugin_mozdata_moz__get_bugzilla_bug with the bug ID. That goes through the
Mozilla-hosted MCP server — no direct network access needed from the skill.
If a match is found:
- Link to the existing bug:
https://bugzilla.mozilla.org/show_bug.cgi?id=<bug_id> - Note whether it's a known/ongoing issue
- Include any context from the bug summary/comments in your analysis
GitHub PR Investigation
When investigating a failure, search for recent PRs that may have introduced the issue. Check these repositories:
| Repo | What it contains |
|------|-----------------|
| mozilla/bigquery-etl | Query definitions, metadata.yaml, DAG generation |
| mozilla/private-bigquery-etl | Confidential ETL code |
| mozilla/telemetry-airflow | DAGs, operators, GKEPodOperator |
| mozilla/lookml-generator | LookML generation from bigquery-etl |
| mozilla/probe-scraper | Probe/metric definitions scraping |
How to search
Use the gh CLI to find PRs merged around the time of the failure. Focus on
files related to the failing DAG/task:
# Search for recently merged PRs touching a path
gh pr list --repo mozilla/bigquery-etl --state merged --limit 10 \
--search "merged:>2026-04-10" --json number,title,mergedAt,url
# Search PRs mentioning a DAG or table name
gh search prs --repo mozilla/bigquery-etl --merged ">2026-04-10" "<dag_id or table_name>"
# View a specific PR's changed files
gh pr view --repo mozilla/bigquery-etl <PR_NUMBER> --json files,title,body
What to look for
- PRs merged shortly before the first failure timestamp
- Changes to the failing task's query, metadata, or schema
- Changes to shared dependencies (UDFs, views, upstream tables)
- Changes to DAG definitions or operator configuration
- Infrastructure changes (Helm, Docker, Airflow version bumps)
If a suspect PR is found, include it in the investigation report with a link and a note on which changed files are relevant.
Investigation Workflow
If the user provides a DAG name, skip straight to step 2. Only discover failures when no DAG name is given.
- Run
../airflow-triage/scripts/get-triage-datato discover failures (skip if DAG name is already known) - Search Bugzilla for existing tickets matching the DAG/task name using
scripts/search-bugzilla. If a promising bug ID comes back, hydrate it withmcp__plugin_mozdata_moz__get_bugzilla_bugfor history and comments. Link it and note prior context before continuing. - Run
scripts/fetch-task-log <dag_id> --list-runsto find recent runs - Run
scripts/fetch-task-log <dag_id> --list-tasks --run-id <run_id>to list tasks in the failing run - Run
scripts/fetch-task-log <dag_id> <task_id> <run_id> --tail 100to get the error - Identify root cause from the logs
- Look at the query/script in bigquery-etl or telemetry-airflow
- Search GitHub for recently merged PRs that may have introduced the issue (use
ghCLI — see "GitHub PR Investigation" above) - Suggest fix
Response Format
When reporting findings, use this structure:
1. Summary
- DAG name and failure time
- Key error message quoted from logs
- Existing Bugzilla bug (if found), with link
2. Links
- Airflow UI: link to the DAG grid view (fill in the dag_id in the URL template above)
- Grafana: link to relevant dashboard if the failure relates to performance/duration
- Bugzilla: link to existing bug if one was found, or note that none exists
- Runbook: link to Confluence search for the DAG name, or note if a known runbook exists
3. Root Cause Analysis
- Identify the root cause (SQL error, timeout, OOM, dependency failure, etc.)
- Link to the relevant source file in bigquery-etl or telemetry-airflow
4. Confidence Assessment
Rate your confidence and flag what needs human investigation:
| Level | Meaning | Example | |-------|---------|---------| | High confidence | Root cause is clear from logs and source code | SQL syntax error with exact line, permission denied on a specific table | | Medium confidence | Likely cause identified but could not fully verify | Timeout that could be query performance or upstream delay | | Low confidence | Symptoms observed but root cause is unclear | Intermittent failure with no clear error, infra-level issue |
Always be explicit:
- "I'm high confidence this is caused by X because the log shows Y"
- "I'm medium confidence — the error suggests X but I couldn't verify Z. An engineer should check: [specific thing to check]"
- "I'm low confidence on root cause. The logs show X but this could be several things. An engineer should investigate: [list of specific areas]"
5. Suggested Fix or Next Steps
- Suggest a concrete fix if confident
- If not confident, list specific things an engineer should investigate (not generic advice — point to exact logs, tables, configs, or services to check)