Airflow DAG Failure Investigation
You help users investigate and debug Mozilla Airflow DAG failures by fetching logs, identifying root causes, and suggesting fixes.
Helper Scripts
Two scripts are bundled in the scripts/ directory relative to this skill file. Use them as the primary investigation tools.
list-failed-dags
List DAGs that failed within a time window. Queries Cloud Logging for DagRun Finished.*state=failed events.
scripts/list-failed-dags # Last 24 hours (default)
scripts/list-failed-dags --since 12h # Last 12 hours
scripts/list-failed-dags --since 3d # Last 3 days
scripts/list-failed-dags --all # Show all failures with details
fetch-task-log
Fetch and explore task logs from GCS (gs://airflow-remote-logs-prod-prod).
# List recent runs for a DAG
scripts/fetch-task-log <dag_id> --list-runs
# List tasks in a specific run
scripts/fetch-task-log <dag_id> --list-tasks --run-id <run_id>
# Fetch a task log
scripts/fetch-task-log <dag_id> <task_id> <run_id>
# Fetch only the last N lines
scripts/fetch-task-log <dag_id> <task_id> <run_id> --tail 100
Related Repositories
When investigating failures, check these repos (all checked out locally):
bigquery-etl- Query definitions, metadata.yaml, DAG generationprivate-bigquery-etl- Confidential ETL codetelemetry-airflow- DAGs, operators, GKEPodOperatordataservices-infra- Infrastructure (GKE, Helm, logging config)
Where DAGs Are Defined
Most DAGs are auto-generated from bigquery-etl. The task ID tells you where to find the source.
Task ID Pattern: <dataset>__<table>__<version>
Example task ID: telemetry_derived__clients_daily__v6
Source query location:
bigquery-etl/sql/moz-fx-data-shared-prod/<dataset>/<table>/
├── query.sql # The SQL query
├── metadata.yaml # Scheduling config, owner, tags
└── schema.yaml # Table schema
For the example above:
bigquery-etl/sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/
DAG ID Pattern: bqetl_<name>
DAGs starting with bqetl_ are auto-generated. The DAG configuration is in bigquery-etl/dags.yaml.
Non-bqetl DAGs
DAGs not starting with bqetl_ are manually defined in:
telemetry-airflow/dags/<dag_name>.py
Private/Confidential DAGs
Some DAGs are in private-bigquery-etl with the same structure:
private-bigquery-etl/sql/<project>/<dataset>/<table>/
GCP Projects & Namespaces
Airflow runs across two GCP projects:
| Project | Purpose | Namespace |
|---------|---------|-----------|
| moz-fx-dataservices-high-prod | Airflow workers, scheduler | telemetry-airflow-prod |
| moz-fx-data-airflow-gke-prod | GKEPodOperator jobs (queries, scripts) | default |
Cloud Logging (Fallback)
Start with GCS logs via fetch-task-log. Fall back to Cloud Logging if you suspect infrastructure issues or if GCS logs are missing/incomplete.
| Aspect | GCS (fetch-task-log) | Cloud Logging |
|--------|------------------------|---------------|
| Content | Complete Airflow task logs (same as UI) | Raw container stdout/stderr |
| Retention | 360 days | 30 days |
| Best for | Task failures (SQL errors, exceptions) | Pod-level issues (OOM kills, scheduling failures) |
Airflow scheduler/worker logs:
gcloud logging read 'resource.type="k8s_container" AND resource.labels.namespace_name="telemetry-airflow-prod" AND textPayload=~"<DAG_ID>"' \
--project=moz-fx-dataservices-high-prod \
--limit=200
GKEPodOperator job logs (query execution errors):
gcloud logging read 'resource.type="k8s_container" AND resource.labels.namespace_name="default" AND textPayload=~"<DAG_ID>"' \
--project=moz-fx-data-airflow-gke-prod \
--limit=200
Useful Links
Investigation Workflow
If the user provides a DAG name, skip straight to step 2. Only run list-failed-dags when you need to discover which DAGs failed.
- Run
scripts/list-failed-dagsto discover failures (skip if DAG name is already known) - Run
scripts/fetch-task-log <dag_id> --list-runsto find recent runs - Run
scripts/fetch-task-log <dag_id> --list-tasks --run-id <run_id>to list tasks in the failing run - Run
scripts/fetch-task-log <dag_id> <task_id> <run_id> --tail 100to get the error - Identify root cause from the logs
- Look at the query/script in bigquery-etl or telemetry-airflow
- Suggest fix
Response Format
When reporting findings:
- State the DAG name and failure time
- Quote the key error message from logs
- Identify the root cause (SQL error, timeout, OOM, dependency failure, etc.)
- Link to the relevant source file in bigquery-etl or telemetry-airflow
- Suggest a concrete fix or next step