Glue Diagnostics Skill | Agent Skills

Glue Diagnostics

When to use

Any AWS Glue investigation where the console alone is insufficient — job failures, OOM errors, Spark crashes, crawler schema misdetection, connection timeouts, Data Catalog drift, DPU under/over-provisioning, data skew, bookmark corruption, or Glue Studio generation errors.

Investigation workflow

Step 1 — Collect and triage

aws glue get-job --name <job-name>
aws glue get-job-run --job-name <job-name> --run-id <run-id>
aws glue batch-get-jobs --job-names <job1> <job2>
aws glue get-crawler --name <crawler-name>
aws glue get-connection --name <connection-name>
aws logs filter-log-events --log-group-name /aws-glue/jobs/logs-v2 --log-stream-name-prefix <run-id>

Step 2 — Domain deep dive

aws glue get-job-runs --job-name <job-name> --max-results 10
aws glue get-crawler-metrics --crawler-name-list <crawler-name>
aws glue get-databases
aws glue get-tables --database-name <db-name>
aws glue get-partitions --database-name <db-name> --table-name <table-name>
aws glue get-job-bookmark --job-name <job-name>
aws cloudwatch get-metric-statistics --namespace Glue --metric-name glue.driver.aggregate.bytesRead --dimensions Name=JobName,Value=<job-name> --start-time <iso> --end-time <iso> --period 300 --statistics Sum

Read references/glue-guardrails.md before concluding on any Glue issue.

Tool quick reference

| Tool / API | When to use | |------------|-------------| | glue get-job | Job configuration, Glue version, DPU, worker type | | glue get-job-run | Specific run status, error message, execution time | | glue batch-get-jobs | Retrieve multiple job configs at once | | glue get-job-runs | Job run history, failure patterns | | glue get-crawler | Crawler config, targets, schedule, schema change policy | | glue get-crawler-metrics | Crawler runtime stats, tables created/updated | | glue get-connection | JDBC/network connection config, VPC, subnet | | glue get-databases / get-tables | Data Catalog metadata, schema definitions | | glue get-partitions | Partition metadata, partition keys, storage location | | glue get-job-bookmark | Bookmark state for incremental processing | | logs filter-log-events | Glue job CloudWatch logs for Spark errors | | cloudwatch get-metric-statistics | Glue job metrics (bytes read/written, DPU usage) |

Gotchas: AWS Glue

DPU sizing matters: G.1X (1 DPU per worker, 16 GB memory), G.2X (2 DPU, 32 GB), G.4X (4 DPU, 64 GB), G.8X (8 DPU, 128 GB). Under-provisioning causes OOM; over-provisioning wastes cost.
Spark executor OOM vs driver OOM: executor OOM means data partitions are too large (repartition or increase worker type). Driver OOM means too much data collected to the driver (avoid collect(), reduce broadcast join size).
Job bookmarks track processed data for incremental loads. Bookmarks only work with S3 sources using job.init()/job.commit(). Resetting bookmarks reprocesses all data.
Crawler schema evolution: crawlers can add new columns but may not handle type changes gracefully. Schema change policy (UPDATE_IN_DATABASE vs LOG) controls behavior.
Glue connections for JDBC require VPC, subnet, and security group configuration. The subnet must have a NAT gateway or VPC endpoints for Glue service access.
Glue Data Catalog vs Hive metastore: Glue Data Catalog is the default metastore for Glue jobs. External Hive metastore requires explicit configuration and network connectivity.
Glue Studio visual editor has limitations: complex transformations may require custom code nodes. Not all PySpark/Scala operations are available as visual transforms.
Spark UI is available for Glue 2.0+ jobs via the Glue console. It provides DAG visualization, stage details, and executor metrics for debugging performance issues.
Job timeout defaults to 48 hours (2880 minutes). Long-running jobs may silently consume DPUs. Always set an explicit timeout.
Glue version compatibility: Glue 2.0 (Spark 2.4), Glue 3.0 (Spark 3.1), Glue 4.0 (Spark 3.3). Library availability and behavior differ across versions.
Partition management: too many small partitions cause excessive S3 LIST calls. Too few large partitions cause OOM. Aim for 128 MB–512 MB per partition.
S3 eventual consistency impact: S3 provides strong read-after-write consistency since December 2020, but Glue Data Catalog partition metadata updates may still lag behind S3 changes.

Worker type comparison

| Worker Type | DPU | Memory | vCPU | Use Case | |-------------|-----|--------|------|----------| | G.1X | 1 | 16 GB | 4 | Standard ETL, small-medium datasets | | G.2X | 2 | 32 GB | 8 | Memory-intensive transforms, large joins | | G.4X | 4 | 64 GB | 16 | ML transforms, very large datasets | | G.8X | 8 | 128 GB | 32 | Massive datasets, complex aggregations | | G.025X | 0.25 | 2 GB | 2 | Python shell jobs only | | Z.2X | 2 | 32 GB | 8 | Ray jobs (Glue 4.0+) |

Glue version comparison

| Version | Spark | Python | Key Features | |---------|-------|--------|-------------| | Glue 2.0 | 2.4 | 3.7 | Spark UI, no startup overhead | | Glue 3.0 | 3.1 | 3.7 | Optimized shuffle, auto-scaling | | Glue 4.0 | 3.3 | 3.10 | Ray support, Python 3.10, improved performance |

Anti-hallucination rules

Always cite specific job run error messages, crawler metrics, or CloudWatch log entries as evidence.
Never assume OOM is always executor-side. Check whether the error is on the driver or executor — the fix is different.
Job bookmarks only work with supported sources (S3, JDBC) and require job.init()/job.commit() calls. Never claim bookmarks work automatically with all sources.
Crawler schema changes depend on the SchemaChangePolicy. Never assume crawlers automatically update table schemas.
Glue connections require VPC networking. Never suggest JDBC connections work without proper VPC, subnet, and security group configuration.
Spend no more than 2 minutes on any single hypothesis. Pivot if inconclusive.

28 runbooks

| Category | IDs | Covers | |----------|-----|--------| | A — Jobs | A1–A4 | Job failures, timeout, OOM, Spark errors | | B — Crawlers | B1–B3 | Crawler failures, schema detection, partition issues | | C — Connections | C1–C3 | JDBC connection failures, VPC/subnet, S3 endpoint | | D — Data Catalog | D1–D2 | Catalog sync issues, schema evolution | | E — Performance | E1–E3 | DPU sizing, shuffle issues, data skew | | F — ETL | F1–F3 | Transformation errors, bookmark issues, data quality | | G — Security | G1–G2 | IAM permissions, encryption | | H — Glue Studio | H1–H2 | Visual editor errors, job generation | | Z — Catch-All | Z1 | General troubleshooting |

Agent Skills: Glue Diagnostics

Install this agent skill to your local

Skill Files