Feature Stores Skill
Learn: Build production feature stores for ML systems.
Skill Overview
| Attribute | Value | |-----------|-------| | Bonded Agent | 03-data-pipelines | | Difficulty | Intermediate to Advanced | | Duration | 35 hours | | Prerequisites | mlops-basics |
Learning Objectives
- Understand feature store architecture
- Implement features with Feast
- Validate data quality with Great Expectations
- Serve features online and offline
- Version datasets with DVC
Topics Covered
Module 1: Feature Store Architecture (8 hours)
Components:
┌─────────────────────────────────────────────────────────────┐
│ FEATURE STORE ARCHITECTURE │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Offline │ │ Feature │ │ Online │ │
│ │ Store │───▶│ Registry │◀───│ Store │ │
│ │ (Parquet) │ │ (Metadata) │ │ (Redis) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ [Training] [Discovery] [Inference] │
│ │
└─────────────────────────────────────────────────────────────┘
Exercises:
- [ ] Design feature store for e-commerce use case
- [ ] Compare Feast vs Tecton vs Hopsworks
Module 2: Feast Implementation (12 hours)
Feature Definition Example:
from feast import Entity, Feature, FeatureView, FileSource
from feast.types import Float32, Int64
from datetime import timedelta
# Entity definition
customer = Entity(
name="customer_id",
value_type=ValueType.INT64,
description="Customer identifier"
)
# Feature view
customer_features = FeatureView(
name="customer_features",
entities=["customer_id"],
ttl=timedelta(days=7),
schema=[
Feature(name="total_purchases", dtype=Float32),
Feature(name="avg_order_value", dtype=Float32),
Feature(name="days_since_last_order", dtype=Int64),
],
online=True,
source=customer_stats_source
)
Exercises:
- [ ] Set up Feast repository locally
- [ ] Create entity and feature views
- [ ] Materialize features to online store
- [ ] Retrieve features for training and inference
Module 3: Data Validation (8 hours)
Great Expectations Setup:
import great_expectations as gx
# Create validation suite
suite = context.add_expectation_suite("ml_data_validation")
# Add expectations
suite.add_expectation(
gx.expectations.ExpectColumnValuesToNotBeNull(
column="target",
mostly=0.99
)
)
suite.add_expectation(
gx.expectations.ExpectColumnMeanToBeBetween(
column="feature_a",
min_value=0.0,
max_value=100.0
)
)
Module 4: Data Versioning (7 hours)
DVC Workflow:
# Initialize DVC
dvc init
# Add data to tracking
dvc add data/training_data.parquet
# Push to remote storage
dvc push
# Checkout specific version
git checkout v1.0.0
dvc checkout
Code Templates
Template: Feature Engineering Pipeline
# templates/feature_pipeline.py
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
class FeaturePipeline(BaseEstimator, TransformerMixin):
"""Production feature engineering pipeline."""
def __init__(self, config: dict):
self.config = config
self.feature_names = []
def fit(self, X: pd.DataFrame, y=None):
"""Learn feature statistics."""
self.means = X.select_dtypes(include=['number']).mean()
self.stds = X.select_dtypes(include=['number']).std()
return self
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
"""Apply feature transformations."""
X = X.copy()
# Numerical normalization
for col in X.select_dtypes(include=['number']).columns:
X[f"{col}_normalized"] = (X[col] - self.means[col]) / self.stds[col]
# Temporal features
for col in self.config.get("datetime_columns", []):
X[f"{col}_hour"] = pd.to_datetime(X[col]).dt.hour
X[f"{col}_dow"] = pd.to_datetime(X[col]).dt.dayofweek
return X
Troubleshooting Guide
| Issue | Cause | Solution | |-------|-------|----------| | Slow feature serving | Online store bottleneck | Scale Redis, add caching | | Training-serving skew | Different transformations | Use unified feature pipeline | | Stale features | Materialization lag | Increase refresh frequency |
Resources
- Feast Documentation
- Great Expectations Docs
- DVC Documentation
- [See: training-pipelines] - Use features in training
Version History
| Version | Date | Changes | |---------|------|---------| | 2.0.0 | 2024-12 | Production-grade with Feast examples | | 1.0.0 | 2024-11 | Initial release |