Airflow DAG Failure Investigation Skill

Airflow DAG Failure Investigation

You help users investigate and debug Mozilla Airflow DAG failures by fetching logs, identifying root causes, and suggesting fixes.

Helper Scripts

Discovering failures

Use get-triage-data from the airflow-triage skill to discover failures:

../airflow-triage/scripts/get-triage-data              # Last 24 hours
../airflow-triage/scripts/get-triage-data --since 3d   # Last 3 days

fetch-task-log

Fetch and explore task logs from GCS (gs://airflow-remote-logs-prod-prod).

# List recent runs for a DAG
scripts/fetch-task-log <dag_id> --list-runs

# List tasks in a specific run
scripts/fetch-task-log <dag_id> --list-tasks --run-id <run_id>

# Fetch a task log
scripts/fetch-task-log <dag_id> <task_id> <run_id>

# Fetch only the last N lines
scripts/fetch-task-log <dag_id> <task_id> <run_id> --tail 100

Related Repositories

When investigating failures, check these repos (all checked out locally):

bigquery-etl - Query definitions, metadata.yaml, DAG generation
private-bigquery-etl - Confidential ETL code
telemetry-airflow - DAGs, operators, GKEPodOperator
dataservices-infra - Infrastructure (GKE, Helm, logging config)

Where DAGs Are Defined

Most DAGs are auto-generated from bigquery-etl. The task ID tells you where to find the source.

Task ID Pattern: `<dataset><table><version>`

Example task ID: telemetry_derived__clients_daily__v6

Source query location:

bigquery-etl/sql/moz-fx-data-shared-prod/<dataset>/<table>/
├── query.sql          # The SQL query
├── metadata.yaml      # Scheduling config, owner, tags
└── schema.yaml        # Table schema

For the example above:

bigquery-etl/sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/

DAG ID Pattern: `bqetl_<name>`

DAGs starting with bqetl_ are auto-generated. The DAG configuration is in bigquery-etl/dags.yaml.

Non-bqetl DAGs

DAGs not starting with bqetl_ are manually defined in:

telemetry-airflow/dags/<dag_name>.py

Private/Confidential DAGs

Some DAGs are in private-bigquery-etl with the same structure:

private-bigquery-etl/sql/<project>/<dataset>/<table>/

GCP Projects & Namespaces

Airflow runs across two GCP projects:

| Project | Purpose | Namespace | |---------|---------|-----------| | moz-fx-dataservices-high-prod | Airflow workers, scheduler | telemetry-airflow-prod | | moz-fx-data-airflow-gke-prod | GKEPodOperator jobs (queries, scripts) | default |

Cloud Logging (Fallback)

Start with GCS logs via fetch-task-log. Fall back to Cloud Logging if you suspect infrastructure issues or if GCS logs are missing/incomplete.

| Aspect | GCS (fetch-task-log) | Cloud Logging | |--------|------------------------|---------------| | Content | Complete Airflow task logs (same as UI) | Raw container stdout/stderr | | Retention | 360 days | 30 days | | Best for | Task failures (SQL errors, exceptions) | Pod-level issues (OOM kills, scheduling failures) |

Airflow scheduler/worker logs:

gcloud logging read 'resource.type="k8s_container" AND resource.labels.namespace_name="telemetry-airflow-prod" AND textPayload=~"<DAG_ID>"' \
  --project=moz-fx-dataservices-high-prod \
  --limit=200

GKEPodOperator job logs (query execution errors):

gcloud logging read 'resource.type="k8s_container" AND resource.labels.namespace_name="default" AND textPayload=~"<DAG_ID>"' \
  --project=moz-fx-data-airflow-gke-prod \
  --limit=200

Useful Links

Airflow UI: https://workflow.telemetry.mozilla.org/home
- DAG detail: https://workflow.telemetry.mozilla.org/dags/<dag_id>/grid
- Task logs: https://workflow.telemetry.mozilla.org/dags/<dag_id>/grid?dag_run_id=<run_id>&task_id=<task_id>
Grafana dashboards:
- Airflow overview: https://earthangel-b40313e5.influxcloud.net/d/airflow-overview
- Task duration: https://earthangel-b40313e5.influxcloud.net/d/airflow-task-duration
Bugzilla (airflow-triage bugs): buglist
Runbooks: Search Confluence for <dag_id> or the pipeline area (e.g. "main_summary runbook"). If a Confluence MCP tool is available, use it to search directly. Otherwise, suggest the user check:
- https://mozilla-hub.atlassian.net/wiki/search?text=<dag_id>
- https://mozilla-hub.atlassian.net/wiki/search?text=airflow+runbook

Bugzilla Search

Before investigating, search Bugzilla for existing tickets related to the failing DAG. This avoids duplicate work and surfaces prior context.

Use scripts/search-bugzilla — it is pinned to bugzilla.mozilla.org and defaults to the [airflow-triage] whiteboard filter:

scripts/search-bugzilla <dag_id>                        # Open [airflow-triage] bugs matching DAG name
scripts/search-bugzilla <dag_id> --status resolved      # Recently resolved matches
scripts/search-bugzilla <dag_id> --status all           # Both
scripts/search-bugzilla "<task_id>" --limit 10          # Narrow by task name
scripts/search-bugzilla --all "<keywords>"              # Drop the whiteboard filter
scripts/search-bugzilla <dag_id> --json                 # Structured output

To read full context (change history + comments) for a bug ID the script returned, call mcp__plugin_mozdata_moz__get_bugzilla_bug with the bug ID. That goes through the Mozilla-hosted MCP server — no direct network access needed from the skill.

If a match is found:

Link to the existing bug: https://bugzilla.mozilla.org/show_bug.cgi?id=<bug_id>
Note whether it's a known/ongoing issue
Include any context from the bug summary/comments in your analysis

GitHub PR Investigation

When investigating a failure, search for recent PRs that may have introduced the issue. Check these repositories:

| Repo | What it contains | |------|-----------------| | mozilla/bigquery-etl | Query definitions, metadata.yaml, DAG generation | | mozilla/private-bigquery-etl | Confidential ETL code | | mozilla/telemetry-airflow | DAGs, operators, GKEPodOperator | | mozilla/lookml-generator | LookML generation from bigquery-etl | | mozilla/probe-scraper | Probe/metric definitions scraping |

How to search

Use the gh CLI to find PRs merged around the time of the failure. Focus on files related to the failing DAG/task:

# Search for recently merged PRs touching a path
gh pr list --repo mozilla/bigquery-etl --state merged --limit 10 \
  --search "merged:>2026-04-10" --json number,title,mergedAt,url

# Search PRs mentioning a DAG or table name
gh search prs --repo mozilla/bigquery-etl --merged ">2026-04-10" "<dag_id or table_name>"

# View a specific PR's changed files
gh pr view --repo mozilla/bigquery-etl <PR_NUMBER> --json files,title,body

What to look for

PRs merged shortly before the first failure timestamp
Changes to the failing task's query, metadata, or schema
Changes to shared dependencies (UDFs, views, upstream tables)
Changes to DAG definitions or operator configuration
Infrastructure changes (Helm, Docker, Airflow version bumps)

If a suspect PR is found, include it in the investigation report with a link and a note on which changed files are relevant.

Investigation Workflow

If the user provides a DAG name, skip straight to step 2. Only discover failures when no DAG name is given.

Run ../airflow-triage/scripts/get-triage-data to discover failures (skip if DAG name is already known)
Search Bugzilla for existing tickets matching the DAG/task name using scripts/search-bugzilla. If a promising bug ID comes back, hydrate it with mcp__plugin_mozdata_moz__get_bugzilla_bug for history and comments. Link it and note prior context before continuing.
Run scripts/fetch-task-log <dag_id> --list-runs to find recent runs
Run scripts/fetch-task-log <dag_id> --list-tasks --run-id <run_id> to list tasks in the failing run
Run scripts/fetch-task-log <dag_id> <task_id> <run_id> --tail 100 to get the error
Identify root cause from the logs
Look at the query/script in bigquery-etl or telemetry-airflow
Search GitHub for recently merged PRs that may have introduced the issue (use gh CLI — see "GitHub PR Investigation" above)
Suggest fix

Response Format

When reporting findings, use this structure:

1. Summary

DAG name and failure time
Key error message quoted from logs
Existing Bugzilla bug (if found), with link

2. Links

Airflow UI: link to the DAG grid view (fill in the dag_id in the URL template above)
Grafana: link to relevant dashboard if the failure relates to performance/duration
Bugzilla: link to existing bug if one was found, or note that none exists
Runbook: link to Confluence search for the DAG name, or note if a known runbook exists

3. Root Cause Analysis

Identify the root cause (SQL error, timeout, OOM, dependency failure, etc.)
Link to the relevant source file in bigquery-etl or telemetry-airflow

4. Confidence Assessment

Rate your confidence and flag what needs human investigation:

| Level | Meaning | Example | |-------|---------|---------| | High confidence | Root cause is clear from logs and source code | SQL syntax error with exact line, permission denied on a specific table | | Medium confidence | Likely cause identified but could not fully verify | Timeout that could be query performance or upstream delay | | Low confidence | Symptoms observed but root cause is unclear | Intermittent failure with no clear error, infra-level issue |

Always be explicit:

"I'm high confidence this is caused by X because the log shows Y"
"I'm medium confidence — the error suggests X but I couldn't verify Z. An engineer should check: [specific thing to check]"
"I'm low confidence on root cause. The logs show X but this could be several things. An engineer should investigate: [list of specific areas]"

5. Suggested Fix or Next Steps

Suggest a concrete fix if confident
If not confident, list specific things an engineer should investigate (not generic advice — point to exact logs, tables, configs, or services to check)

Agent Skills: Airflow DAG Failure Investigation

Install this agent skill to your local

Skill Files