ML System Design
This skill provides frameworks for designing production machine learning systems, from data pipelines to model serving.
When to Use This Skill
Keywords: ML pipeline, machine learning system, feature store, model training, model serving, ML infrastructure, MLOps, A/B testing ML, feature engineering, model deployment
Use this skill when:
- Designing end-to-end ML systems for production
- Planning feature store architecture
- Designing model training pipelines
- Planning model serving infrastructure
- Preparing for ML system design interviews
- Evaluating ML platform tools and frameworks
ML System Architecture Overview
The ML System Lifecycle
┌─────────────────────────────────────────────────────────────────────────┐
│ ML SYSTEM LIFECYCLE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │
│ │ Data │──▶│ Feature │──▶│ Model │──▶│ Model │──▶│ Monitor│ │
│ │ Ingestion│ │ Pipeline │ │ Training │ │ Serving │ │ & Eval │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └────────┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │
│ │ Data │ │ Feature │ │ Model │ │ Inference│ │ Metrics│ │
│ │ Lake │ │ Store │ │ Registry │ │ Cache │ │ Store │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Key Components
| Component | Purpose | Examples | | --------- | ------- | -------- | | Data Ingestion | Collect raw data from sources | Kafka, Kinesis, Pub/Sub | | Feature Pipeline | Transform raw data to features | Spark, Flink, dbt | | Feature Store | Store and serve features | Feast, Tecton, Vertex AI | | Model Training | Train and validate models | SageMaker, Vertex AI, Kubeflow | | Model Registry | Version and track models | MLflow, Weights & Biases | | Model Serving | Serve predictions | TensorFlow Serving, Triton, vLLM | | Monitoring | Track model performance | Evidently, WhyLabs, Arize |
Feature Store Architecture
Why Feature Stores?
Problems without a feature store:
- Training-serving skew (features computed differently)
- Duplicate feature computation across teams
- No feature versioning or lineage
- Slow feature experimentation
Feature Store Components
┌─────────────────────────────────────────────────────────────────┐
│ FEATURE STORE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ OFFLINE STORE │ │ ONLINE STORE │ │
│ │ │ │ │ │
│ │ - Historical data │ │ - Low-latency │ │
│ │ - Training queries │ ────▶ │ - Point lookups │ │
│ │ - Batch features │ sync │ - Real-time serving│ │
│ │ │ │ │ │
│ │ (Data Warehouse) │ │ (Redis, DynamoDB) │ │
│ └─────────────────────┘ └─────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ FEATURE REGISTRY ││
│ │ - Feature definitions - Version control ││
│ │ - Data lineage - Access control ││
│ └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
Feature Types
| Type | Computation | Storage | Example | | ---- | ----------- | ------- | ------- | | Batch | Scheduled (hourly/daily) | Offline → Online | User purchase count (30 days) | | Streaming | Real-time event processing | Direct to online | Items in cart (current) | | On-demand | Request-time computation | Not stored | Distance to nearest store |
Training-Serving Consistency
TRAINING (Historical):
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Historical │───▶│ Point-in-Time│───▶│ Training │
│ Events │ │ Join │ │ Dataset │
└──────────────┘ └──────────────┘ └──────────────┘
│
Uses feature
definitions
│
SERVING (Real-time): ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Online │───▶│ Same Feature │───▶│ Prediction │
│ Store │ │ Definitions │ │ Request │
└──────────────┘ └──────────────┘ └──────────────┘
Model Training Infrastructure
Training Pipeline Components
┌───────────────────────────────────────────────────────────────────────┐
│ TRAINING PIPELINE │
├───────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Data │──▶│ Feature │──▶│ Model │──▶│ Model │ │
│ │ Loader │ │ Transform│ │ Train │ │ Validate │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Experiment │ │ Hyperparameter│ │ Checkpoint │ │ Model │ │
│ │ Tracking │ │ Tuning │ │ Storage │ │ Registry │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────────┘
Training Infrastructure Patterns
| Pattern | Use Case | Tools | | ------- | -------- | ----- | | Single-node | Small datasets, quick experiments | Jupyter, local GPU | | Distributed data-parallel | Large datasets, same model | Horovod, PyTorch DDP | | Model-parallel | Large models that don't fit in memory | DeepSpeed, FSDP, Megatron | | Hyperparameter tuning | Automated model optimization | Optuna, Ray Tune |
Experiment Tracking
Track for reproducibility:
| What to Track | Why | | ------------- | --- | | Hyperparameters | Reproduce training runs | | Metrics | Compare model performance | | Artifacts | Model files, datasets | | Code version | Git commit hash | | Environment | Docker image, dependencies | | Data version | Dataset hash or snapshot |
Model Serving Architecture
Serving Patterns
| Pattern | Latency | Throughput | Use Case | | ------- | ------- | ---------- | -------- | | Online (REST/gRPC) | Low (<100ms) | Medium | Real-time predictions | | Batch | High (hours) | Very high | Bulk scoring | | Streaming | Medium | High | Event-driven predictions | | Embedded | Very low | Varies | Edge/mobile inference |
Online Serving Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ MODEL SERVING SYSTEM │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ Clients │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Load Balancer│ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ API Gateway │ │
│ │ - Authentication - Rate limiting - Request validation │ │
│ └──────────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌───────────────────────┼───────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Model A │ │ Model B │ │ Model C │ │
│ │ (v1.2) │ │ (v2.0) │ │ (v1.0) │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │ │ │ │
│ └───────────────────────┼───────────────────────┘ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Feature Store │ │
│ │ (Online) │ │
│ └────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Latency Optimization
| Technique | Latency Impact | Trade-off | | --------- | -------------- | --------- | | Batching | Reduces per-request | Increases latency for first request | | Caching | 10-100x faster | May serve stale predictions | | Quantization | 2-4x faster | Slight accuracy loss | | Distillation | Variable | Training overhead | | GPU inference | 10-100x faster | Cost increase |
A/B Testing ML Models
Experiment Design
┌─────────────────────────────────────────────────────────────────────┐
│ A/B TESTING ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ Traffic │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Experiment Assignment │ ◀─────── Experiment Config │
│ │ - User bucketing │ - Allocation % │
│ │ - Feature flags │ - Target segments │
│ └──────────┬───────────┘ - Guardrails │
│ │ │
│ ┌────────┴────────┐ │
│ ▼ ▼ │
│ ┌────────┐ ┌────────┐ │
│ │Control │ │Treatment│ │
│ │Model A │ │Model B │ │
│ └────┬───┘ └────┬───┘ │
│ │ │ │
│ └────────┬───────┘ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Metrics Logger │ │
│ └────────┬───────┘ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Statistical │ ─────▶ Decision: Ship / Iterate / Kill │
│ │ Analysis │ │
│ └────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Metrics to Track
| Metric Type | Examples | Purpose | | ----------- | -------- | ------- | | Model metrics | AUC, RMSE, precision/recall | Model quality | | Business metrics | CTR, conversion, revenue | Business impact | | Guardrail metrics | Latency, error rate, engagement | Prevent regressions | | Segment metrics | Metrics by user segment | Detect heterogeneous effects |
Statistical Considerations
- Sample size: Calculate power before experiment
- Duration: Account for novelty effects and time patterns
- Multiple testing: Adjust for multiple metrics (Bonferroni, FDR)
- Early stopping: Use sequential testing methods
Model Monitoring
What to Monitor
| Category | Metrics | Alert Threshold | | -------- | ------- | --------------- | | Data quality | Missing values, schema drift | >1% change | | Feature drift | Distribution shift (PSI, KL) | PSI >0.2 | | Prediction drift | Output distribution shift | Depends on use case | | Model performance | Accuracy, AUC (when labels available) | >5% degradation | | Operational | Latency, throughput, errors | SLO violations |
Drift Detection
┌─────────────────────────────────────────────────────────────────────┐
│ DRIFT DETECTION PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Training Data Production Data │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Reference │ │ Current │ │
│ │ Distribution │ │ Distribution │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
│ └──────────────┬──────────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Statistical Test │ │
│ │ - PSI (Population Stability Index) │
│ │ - KS Test │
│ │ - Chi-squared │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Drift Score │ │
│ └────────┬─────────┘ │
│ │ │
│ ┌───────────┼───────────┐ │
│ ▼ ▼ ▼ │
│ No Drift Warning Critical │
│ (< 0.1) (0.1-0.2) (> 0.2) │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Continue Investigate Retrain │
│ │
└─────────────────────────────────────────────────────────────────────┘
Common ML System Design Patterns
Pattern 1: Recommendation System
Components needed:
- Candidate Generation (retrieve 100s-1000s)
- Ranking Model (score and sort)
- Feature Store (user features, item features)
- Real-time personalization (recent behavior)
- A/B testing infrastructure
Pattern 2: Fraud Detection
Components needed:
- Real-time feature computation
- Low-latency model serving (<50ms)
- High recall focus (can't miss fraud)
- Explainability for compliance
- Human-in-the-loop review
- Feedback loop for labels
Pattern 3: Search Ranking
Components needed:
- Two-stage ranking (retrieval + ranking)
- Feature store for query/document features
- Low latency (<200ms end-to-end)
- Learning to rank models
- Click-through rate prediction
- A/B testing with interleaving
Estimation for ML Systems
Training Infrastructure
Training time estimation:
- Dataset size: 100M examples
- Model: Transformer (100M params)
- GPU: A100 (80GB, 312 TFLOPS)
- Batch size: 32
- Training steps: Dataset / batch = 3.1M steps
- Time per step: ~100ms
- Total time: ~86 hours single GPU
- With 8 GPUs (data parallel): ~11 hours
Serving Infrastructure
Inference estimation:
- QPS: 10,000
- Model latency: 20ms
- Batch size: 1 (real-time)
- GPU utilization: 50% (latency constraint)
- Requests per GPU/sec: 25
- GPUs needed: 10,000 / 25 = 400 GPUs
- With batching (batch 8): 100 GPUs (4x reduction)
Related Skills
llm-serving-patterns- LLM-specific serving and optimizationrag-architecture- Retrieval-Augmented Generation patternsvector-databases- Vector search and embeddingsml-inference-optimization- Latency and cost optimizationestimation-techniques- Back-of-envelope calculationsquality-attributes-taxonomy- NFR definitions
Related Commands
/sd:ml-pipeline <problem>- Design ML system interactively/sd:estimate <scenario>- Capacity calculations
Related Agents
ml-systems-designer- Design ML architecturesml-interviewer- Mock ML system design interviews
Version History
- v1.0.0 (2025-12-26): Initial release
Last Updated
Date: 2025-12-26 Model: claude-opus-4-5-20251101