Data Engineering Skill
Quick Reference
| Role | Focus | Timeline | Entry From | |------|-------|----------|------------| | Data Engineer | Pipelines, Infra | 12-24 mo | Backend Dev | | ML Engineer | Models, Features | 12-24 mo | Data Scientist | | AI Engineer | LLMs, Agents | 6-12 mo | Any Developer |
Learning Paths
Data Engineer
[1] SQL Mastery (4-6 wk)
│ └─ Window functions, CTEs, optimization
│
▼
[2] Python for Data (4-6 wk)
│ └─ Pandas, file formats, scripting
│
▼
[3] ETL/ELT Pipelines (6-8 wk)
│ └─ Extract, transform, load patterns
│
▼
[4] Big Data: Spark (8-12 wk)
│ └─ PySpark, DataFrames, partitioning
│
▼
[5] Data Warehouse (4-6 wk)
│ └─ Star schema, dbt, Snowflake/BQ
│
▼
[6] Orchestration (4-6 wk)
└─ Airflow/Prefect, scheduling, monitoring
2025 Stack: Python + Spark + Airflow + dbt + Snowflake/BigQuery
ML Engineer
[1] Python + NumPy (4-6 wk)
│
▼
[2] Math Foundations (6-8 wk)
│ └─ Linear algebra, calculus, statistics
│
▼
[3] Classical ML (8-12 wk)
│ └─ scikit-learn, XGBoost, evaluation
│
▼
[4] Deep Learning (8-12 wk)
│ └─ PyTorch, CNNs, Transformers
│
▼
[5] MLOps (6-8 wk)
└─ MLflow, model serving, monitoring
2025 Stack: Python + PyTorch + scikit-learn + MLflow + W&B
AI Engineer (2025 Hot Path)
[1] LLM Fundamentals (2-3 wk)
│ └─ Tokens, embeddings, context windows
│
▼
[2] Prompt Engineering (2-3 wk)
│ └─ Few-shot, CoT, structured output
│
▼
[3] RAG Systems (3-4 wk)
│ └─ Embeddings, vector DBs, retrieval
│
▼
[4] AI Agents (4-6 wk)
│ └─ Tool calling, agent loops, memory
│
▼
[5] Production Deploy (ongoing)
└─ Evaluation, guardrails, monitoring
2025 Stack: Python + LangChain/LlamaIndex + OpenAI/Anthropic + ChromaDB
2025 Tool Matrix
Data Processing
| Tool | Scale | Use Case | |------|-------|----------| | Pandas | <10GB | Prototyping, small data | | Polars | <100GB | Fast local processing | | Spark | >100GB | Distributed processing | | dbt | Any | Transformations, testing |
ML Frameworks
| Framework | Best For | Complexity | |-----------|----------|------------| | scikit-learn | Classical ML | Low | | XGBoost | Tabular data | Low | | PyTorch | Research, flexibility | Medium | | TensorFlow | Production, mobile | Medium |
LLM/AI Tools
| Tool | Use Case | |------|----------| | LangChain | LLM orchestration | | LlamaIndex | RAG systems | | Claude/OpenAI | LLM APIs | | ChromaDB | Vector storage |
Algorithm Reference
Classical ML
| Type | Algorithms | |------|------------| | Regression | Linear, Ridge, Lasso, ElasticNet | | Classification | Logistic, SVM, Decision Tree | | Ensemble | Random Forest, XGBoost, LightGBM | | Clustering | K-Means, DBSCAN, Hierarchical |
Deep Learning
| Architecture | Use Case | |--------------|----------| | CNN | Images, vision | | RNN/LSTM | Sequences | | Transformer | NLP, LLMs | | Diffusion | Image generation |
AI Agent Architecture (2025)
┌─────────────────────────────────────────┐
│ AGENTIC LOOP │
├─────────────────────────────────────────┤
│ PERCEIVE → REASON → ACT → REFLECT │
│ │ │ │ │ │
│ │ │ │ └─► Loop │
│ │ │ └─► Execute tools│
│ │ └─► LLM decides action │
│ └─► Gather context, observations │
└─────────────────────────────────────────┘
Design Patterns (Anthropic 2025):
• Prompt Chaining - Sequential fixed steps
• Routing - Classify and dispatch
• Parallelization - Concurrent subtasks
• Orchestrator-Workers - Central delegation
• Evaluator-Optimizer - Generate + critique
Troubleshooting
Which path to choose?
├─► Love building infrastructure? → Data Engineer
├─► Love algorithms/math? → ML Engineer
├─► Want fastest AI entry? → AI Engineer
└─► Uncertain? → Start with Python + SQL
Model not performing well?
├─► Data quality issues? → Clean data first
├─► Feature engineering? → Create better features
├─► Wrong algorithm? → Try different models
├─► Overfitting? → More data, regularization
└─► Hyperparameters? → Grid/random search
LLM giving bad answers?
├─► Prompt too vague? → Be more specific
├─► Missing context? → Add relevant info
├─► Hallucinating? → Use RAG, verify facts
└─► Wrong tool? → Improve tool descriptions
Common Failure Modes
| Symptom | Root Cause | Recovery | |---------|------------|----------| | Model fails in prod | Data drift | Monitor distributions | | Pipeline always late | Unoptimized queries | Profile, partition | | RAG finds wrong docs | Bad chunking | Tune chunk size, overlap | | Agent loops forever | No exit condition | Add max iterations |
Portfolio Projects
Data Engineering
- ETL Pipeline (Airflow + dbt)
- Real-time Streaming (Kafka + Spark)
- Data Warehouse Design
ML Engineering
- Classification Model (scikit-learn)
- Deep Learning Model (PyTorch)
- ML Pipeline (MLflow)
AI Engineering
- RAG Chatbot (LangChain + ChromaDB)
- AI Agent with Tools
- Multi-Agent System
Next Actions
Specify your target role for a detailed learning plan.