Senior Data Engineer
Core Capabilities
- Batch Pipeline Orchestration - Design and implement production-ready ETL/ELT pipelines with Airflow, intelligent dependency resolution, retry logic, and comprehensive monitoring
- Real-Time Streaming - Build event-driven streaming pipelines with Kafka, Flink, Kinesis, and Spark Streaming with exactly-once semantics and sub-second latency
- Data Quality Management - Comprehensive batch and streaming data quality validation covering completeness, accuracy, consistency, timeliness, and validity
- Streaming Quality Monitoring - Track consumer lag, data freshness, schema drift, throughput, and dead letter queue rates for streaming pipelines
- Performance Optimization - Analyze and optimize pipeline performance with query optimization, Spark tuning, and cost analysis recommendations
Key Workflows
Workflow 1: Build ETL Pipeline
Time: 2-4 hours
Steps:
- Design pipeline architecture using Lambda, Kappa, or Medallion pattern
- Configure YAML pipeline definition with sources, transformations, targets
- Generate Airflow DAG with
pipeline_orchestrator.py - Define data quality validation rules
- Deploy and configure monitoring/alerting
Expected Output: Production-ready ETL pipeline with 99%+ success rate, automated quality checks, and comprehensive monitoring
Workflow 2: Build Real-Time Streaming Pipeline
Time: 3-5 days
Steps:
- Select streaming architecture (Kappa vs Lambda) based on requirements
- Configure streaming pipeline YAML (sources, processing, sinks, quality)
- Generate Kafka configurations with
kafka_config_generator.py - Generate Flink/Spark job scaffolding with
stream_processor.py - Deploy and monitor with
streaming_quality_validator.py
Expected Output: Streaming pipeline processing 10K+ events/sec with P99 latency < 1s, exactly-once delivery, and real-time quality monitoring
World-class data engineering for production-grade data systems, scalable pipelines, and enterprise data platforms.
Overview
This skill provides comprehensive expertise in data engineering fundamentals through advanced production patterns. From designing medallion architectures to implementing real-time streaming pipelines, it covers the full spectrum of modern data engineering including ETL/ELT design, data quality frameworks, pipeline orchestration, and DataOps practices.
What This Skill Provides:
- Production-ready pipeline templates (Airflow, Spark, dbt)
- Comprehensive data quality validation framework
- Performance optimization and cost analysis tools
- Data architecture patterns (Lambda, Kappa, Medallion)
- Complete DataOps CI/CD workflows
Best For:
- Building scalable data pipelines for enterprise systems
- Implementing data quality and governance frameworks
- Optimizing ETL performance and cloud costs
- Designing modern data architectures (lake, warehouse, lakehouse)
- Production ML/AI data infrastructure
Quick Start
Pipeline Orchestration
# Generate Airflow DAG from configuration
python scripts/pipeline_orchestrator.py --config pipeline_config.yaml --output dags/
# Validate pipeline configuration
python scripts/pipeline_orchestrator.py --config pipeline_config.yaml --validate
# Use incremental load template
python scripts/pipeline_orchestrator.py --template incremental --output dags/
Data Quality Validation
# Validate CSV file with quality checks
python scripts/data_quality_validator.py --input data/sales.csv --output report.html
# Validate database table with custom rules
python scripts/data_quality_validator.py \
--connection postgresql://user:pass@host/db \
--table sales_transactions \
--rules rules/sales_validation.yaml \
--threshold 0.95
Performance Optimization
# Analyze pipeline performance and get recommendations
python scripts/etl_performance_optimizer.py \
--airflow-db postgresql://host/airflow \
--dag-id sales_etl_pipeline \
--days 30 \
--optimize
# Analyze Spark job performance
python scripts/etl_performance_optimizer.py \
--spark-history-server http://spark-history:18080 \
--app-id app-20250115-001
Real-Time Streaming
# Validate streaming pipeline configuration
python scripts/stream_processor.py --config streaming_config.yaml --validate
# Generate Kafka topic and client configurations
python scripts/kafka_config_generator.py \
--topic user-events \
--partitions 12 \
--replication 3 \
--output kafka/topics/
# Generate exactly-once producer configuration
python scripts/kafka_config_generator.py \
--producer \
--profile exactly-once \
--output kafka/producer.properties
# Generate Flink job scaffolding
python scripts/stream_processor.py \
--config streaming_config.yaml \
--mode flink \
--generate \
--output flink-jobs/
# Monitor streaming quality
python scripts/streaming_quality_validator.py \
--lag --consumer-group events-processor --threshold 10000 \
--freshness --topic processed-events --max-latency-ms 5000 \
--output streaming-health-report.html
Core Workflows
1. Building Production Data Pipelines
Steps:
- Design Architecture: Choose pattern (Lambda, Kappa, Medallion) based on requirements
- Configure Pipeline: Create YAML configuration with sources, transformations, targets
- Generate DAG:
python scripts/pipeline_orchestrator.py --config config.yaml - Add Quality Checks: Define validation rules for data quality
- Deploy & Monitor: Deploy to Airflow, configure alerts, track metrics
Pipeline Patterns: See frameworks.md for Lambda Architecture, Kappa Architecture, Medallion Architecture (Bronze/Silver/Gold), and Microservices Data patterns.
Templates: See templates.md for complete Airflow DAG templates, Spark job templates, dbt models, and Docker configurations.
2. Data Quality Management
Steps:
- Define Rules: Create validation rules covering completeness, accuracy, consistency
- Run Validation:
python scripts/data_quality_validator.py --rules rules.yaml - Review Results: Analyze quality scores and failed checks
- Integrate CI/CD: Add validation to pipeline deployment process
- Monitor Trends: Track quality scores over time
Quality Framework: See frameworks.md for complete Data Quality Framework covering all dimensions (completeness, accuracy, consistency, timeliness, validity).
Validation Templates: See templates.md for validation configuration examples and Python API usage.
3. Data Modeling & Transformation
Steps:
- Choose Modeling Approach: Dimensional (Kimball), Data Vault 2.0, or One Big Table
- Design Schema: Define fact tables, dimensions, and relationships
- Implement with dbt: Create staging, intermediate, and mart models
- Handle SCD: Implement slowly changing dimension logic (Type 1/2/3)
- Test & Deploy: Run dbt tests, generate documentation, deploy
Modeling Patterns: See frameworks.md for Dimensional Modeling (Kimball), Data Vault 2.0, One Big Table (OBT), and SCD implementations.
dbt Templates: See templates.md for complete dbt model templates including staging, intermediate, fact tables, and SCD Type 2 logic.
4. Performance Optimization
Steps:
- Profile Pipeline: Run performance analyzer on recent pipeline executions
- Identify Bottlenecks: Review execution time breakdown and slow tasks
- Apply Optimizations: Implement recommendations (partitioning, indexing, batching)
- Tune Spark Jobs: Optimize memory, parallelism, and shuffle settings
- Measure Impact: Compare before/after metrics, track cost savings
Optimization Strategies: See frameworks.md for performance best practices including partitioning strategies, query optimization, and Spark tuning.
Analysis Tools: See tools.md for complete documentation on etl_performance_optimizer.py with query analysis and Spark tuning.
5. Building Real-Time Streaming Pipelines
Steps:
- Architecture Selection: Choose Kappa (streaming-only) or Lambda (batch + streaming) architecture
- Configure Pipeline: Create YAML config with sources, processing engine, sinks, quality thresholds
- Generate Kafka Configs:
python scripts/kafka_config_generator.py --topic events --partitions 12 - Generate Job Scaffolding:
python scripts/stream_processor.py --mode flink --generate - Deploy Infrastructure: Use Docker Compose for local dev, Kubernetes for production
- Monitor Quality:
python scripts/streaming_quality_validator.py --lag --freshness --throughput
Streaming Patterns: See frameworks.md for stateful processing, stream joins, windowing, exactly-once semantics, and CDC patterns.
Templates: See templates.md for Flink DataStream jobs, Kafka Streams applications, PyFlink templates, and Docker Compose configurations.
Python Tools
pipeline_orchestrator.py
Automated Airflow DAG generation with intelligent dependency resolution and monitoring.
Key Features:
- Generate production-ready DAGs from YAML configuration
- Automatic task dependency resolution
- Built-in retry logic and error handling
- Multi-source support (PostgreSQL, S3, BigQuery, Snowflake)
- Integrated quality checks and alerting
Usage:
# Basic DAG generation
python scripts/pipeline_orchestrator.py --config pipeline_config.yaml --output dags/
# With validation
python scripts/pipeline_orchestrator.py --config config.yaml --validate
# From template
python scripts/pipeline_orchestrator.py --template incremental --output dags/
Complete Documentation: See tools.md for full configuration options, templates, and integration examples.
data_quality_validator.py
Comprehensive data quality validation framework with automated checks and reporting.
Capabilities:
- Multi-dimensional validation (completeness, accuracy, consistency, timeliness, validity)
- Great Expectations integration
- Custom business rule validation
- HTML/PDF report generation
- Anomaly detection
- Historical trend tracking
Usage:
# Validate with custom rules
python scripts/data_quality_validator.py \
--input data/sales.csv \
--rules rules/sales_validation.yaml \
--output report.html
# Database table validation
python scripts/data_quality_validator.py \
--connection postgresql://host/db \
--table sales_transactions \
--threshold 0.95
Complete Documentation: See tools.md for rule configuration, API usage, and integration patterns.
etl_performance_optimizer.py
Pipeline performance analysis with actionable optimization recommendations.
Capabilities:
- Airflow DAG execution profiling
- Bottleneck detection and analysis
- SQL query optimization suggestions
- Spark job tuning recommendations
- Cost analysis and optimization
- Historical performance trending
Usage:
# Analyze Airflow DAG
python scripts/etl_performance_optimizer.py \
--airflow-db postgresql://host/airflow \
--dag-id sales_etl_pipeline \
--days 30 \
--optimize
# Spark job analysis
python scripts/etl_performance_optimizer.py \
--spark-history-server http://spark-history:18080 \
--app-id app-20250115-001
Complete Documentation: See tools.md for profiling options, optimization strategies, and cost analysis.
stream_processor.py
Streaming pipeline configuration generator and validator for Kafka, Flink, and Kinesis.
Capabilities:
- Multi-platform support (Kafka, Flink, Kinesis, Spark Streaming)
- Configuration validation with best practice checks
- Flink/Spark job scaffolding generation
- Kafka topic configuration generation
- Docker Compose for local streaming stacks
- Exactly-once semantics configuration
Usage:
# Validate configuration
python scripts/stream_processor.py --config streaming_config.yaml --validate
# Generate Kafka configurations
python scripts/stream_processor.py --config streaming_config.yaml --mode kafka --generate
# Generate Flink job scaffolding
python scripts/stream_processor.py --config streaming_config.yaml --mode flink --generate --output flink-jobs/
# Generate Docker Compose for local development
python scripts/stream_processor.py --config streaming_config.yaml --mode docker --generate
Complete Documentation: See tools.md for configuration format, validation checks, and generated outputs.
streaming_quality_validator.py
Real-time streaming data quality monitoring with comprehensive health scoring.
Capabilities:
- Consumer lag monitoring with thresholds
- Data freshness validation (P50/P95/P99 latency)
- Schema drift detection
- Throughput analysis (events/sec, bytes/sec)
- Dead letter queue rate monitoring
- Overall quality scoring with recommendations
- Prometheus metrics export
Usage:
# Monitor consumer lag
python scripts/streaming_quality_validator.py \
--lag --consumer-group events-processor --threshold 10000
# Monitor data freshness
python scripts/streaming_quality_validator.py \
--freshness --topic processed-events --max-latency-ms 5000
# Full quality validation
python scripts/streaming_quality_validator.py \
--lag --freshness --throughput --dlq \
--output streaming-health-report.html
Complete Documentation: See tools.md for all monitoring dimensions and integration patterns.
kafka_config_generator.py
Production-grade Kafka configuration generator with performance and security profiles.
Capabilities:
- Topic configuration (partitions, replication, retention, compaction)
- Producer profiles (high-throughput, exactly-once, low-latency, ordered)
- Consumer profiles (exactly-once, high-throughput, batch)
- Kafka Streams configuration with state store tuning
- Security configuration (SASL-PLAIN, SASL-SCRAM, mTLS)
- Kafka Connect source/sink configurations
- Multiple output formats (properties, YAML, JSON)
Usage:
# Generate topic configuration
python scripts/kafka_config_generator.py \
--topic user-events --partitions 12 --replication 3 --retention-hours 168
# Generate exactly-once producer
python scripts/kafka_config_generator.py \
--producer --profile exactly-once --transactional-id producer-001
# Generate Kafka Streams config
python scripts/kafka_config_generator.py \
--streams --application-id events-processor --exactly-once
Complete Documentation: See tools.md for all profiles, security options, and Connect configurations.
Reference Documentation
Frameworks (frameworks.md)
Comprehensive data engineering frameworks and patterns:
- Architecture Patterns: Lambda, Kappa, Medallion, Microservices data architecture
- Data Modeling: Dimensional (Kimball), Data Vault 2.0, One Big Table
- ETL/ELT Patterns: Full load, incremental load, CDC, SCD, idempotent pipelines
- Data Quality: Complete framework covering all quality dimensions
- DataOps: CI/CD for data pipelines, testing strategies, monitoring
- Orchestration: Airflow DAG patterns, backfill strategies
- Real-Time Streaming: Stateful processing, stream joins, windowing strategies, exactly-once semantics, event time processing, watermarks, backpressure, Apache Flink patterns, AWS Kinesis patterns, CDC for streaming
- Governance: Data catalog, lineage tracking, access control
Templates (templates.md)
Production-ready code templates and examples:
- Airflow DAGs: Complete ETL DAG, incremental load, dynamic task generation
- Spark Jobs: Batch processing, streaming, optimized configurations
- dbt Models: Staging, intermediate, fact tables, dimensions with SCD Type 2
- SQL Patterns: Incremental merge (upsert), deduplication, date spine, window functions
- Python Pipelines: Data quality validation class, retry decorators, error handling
- Real-Time Streaming: Apache Flink DataStream jobs (Java), Kafka Streams applications, PyFlink jobs, AWS Kinesis consumers, Docker Compose for streaming stack
- Kafka Configs: Producer/consumer properties templates, topic configurations, security configurations
- Docker: Dockerfiles for data pipelines, Docker Compose for local development including streaming stack (Kafka, Flink, Schema Registry)
- Configuration: dbt project config, Spark configuration, Airflow variables, streaming pipeline YAML
- Testing: pytest fixtures, integration tests, data quality tests
Tools (tools.md)
Python automation tool documentation:
- pipeline_orchestrator.py: Complete usage guide, configuration format, DAG templates
- data_quality_validator.py: Validation rules, dimension checks, Great Expectations integration
- etl_performance_optimizer.py: Performance analysis, query optimization, Spark tuning
- stream_processor.py: Streaming pipeline configuration, validation, job scaffolding generation
- streaming_quality_validator.py: Consumer lag, data freshness, schema drift, throughput monitoring
- kafka_config_generator.py: Topic, producer, consumer, Kafka Streams, and Connect configurations
- Integration Patterns: Airflow, dbt, CI/CD, monitoring systems, Prometheus
- Best Practices: Configuration management, error handling, performance, monitoring, streaming quality
Tech Stack
Core Technologies:
- Languages: Python 3.8+, SQL, Scala (Spark), Java (Flink)
- Orchestration: Apache Airflow, Prefect, Dagster
- Batch Processing: Apache Spark, dbt, Pandas
- Stream Processing: Apache Kafka, Apache Flink, Kafka Streams, Spark Structured Streaming, AWS Kinesis
- Storage: PostgreSQL, BigQuery, Snowflake, Redshift, S3, GCS
- Schema Management: Confluent Schema Registry, AWS Glue Schema Registry
- Containerization: Docker, Kubernetes
- Monitoring: Datadog, Prometheus, Grafana, Kafka UI
Data Platforms:
- Cloud Data Warehouses: Snowflake, BigQuery, Redshift
- Data Lakes: Delta Lake, Apache Iceberg, Apache Hudi
- Streaming Platforms: Apache Kafka, AWS Kinesis, Google Pub/Sub, Azure Event Hubs
- Stream Processing Engines: Apache Flink, Kafka Streams, Spark Structured Streaming
- Workflow: Airflow, Prefect, Dagster
Integration Points
This skill integrates with:
- Orchestration: Airflow, Prefect, Dagster for workflow management
- Transformation: dbt for SQL transformations and testing
- Quality: Great Expectations for data validation
- Monitoring: Datadog, Prometheus for pipeline monitoring
- BI Tools: Looker, Tableau, Power BI for analytics
- ML Platforms: MLflow, Kubeflow for ML pipeline integration
- Version Control: Git for pipeline code and configuration
See tools.md for detailed integration patterns and examples.
Best Practices
Pipeline Design:
- Idempotent operations for safe reruns
- Incremental processing where possible
- Clear data lineage and documentation
- Comprehensive error handling
- Automated recovery mechanisms
Data Quality:
- Define quality rules early
- Validate at every pipeline stage
- Automate quality monitoring
- Track quality trends over time
- Block bad data from downstream
Performance:
- Partition large tables by date/region
- Use columnar formats (Parquet, ORC)
- Leverage predicate pushdown
- Optimize for your query patterns
- Monitor and tune regularly
Operations:
- Version control everything
- Automate testing and deployment
- Implement comprehensive monitoring
- Document runbooks for incidents
- Regular performance reviews
Performance Targets
Batch Pipeline Execution:
- P50 latency: < 5 minutes (hourly pipelines)
- P95 latency: < 15 minutes
- Success rate: > 99%
- Data freshness: < 1 hour behind source
Streaming Pipeline Execution:
- Throughput: 10K+ events/second sustained
- End-to-end latency: P99 < 1 second
- Consumer lag: < 10K records behind
- Exactly-once delivery: Zero duplicates or losses
Data Quality (Batch):
- Quality score: > 95%
- Completeness: > 99%
- Timeliness: < 2 hours data lag
- Zero critical failures
Streaming Quality:
- Data freshness: P95 < 5 minutes from event generation
- Late data rate: < 5% outside watermark window
- Dead letter queue rate: < 1%
- Schema compatibility: 100% backward/forward compatible changes
Cost Efficiency:
- Cost per GB processed: < $0.10
- Cloud cost trend: Stable or decreasing
- Resource utilization: > 70%
Resources
- Frameworks Guide: references/frameworks.md
- Code Templates: references/templates.md
- Tool Documentation: references/tools.md
- Python Scripts:
scripts/directory
Version: 2.0.0 Last Updated: December 16, 2025 Documentation Structure: Progressive disclosure with comprehensive references Streaming Enhancement: Task #8 - Real-time streaming capabilities added