Machine Learning & Feature Engineering for Fantasy Football
Overview
Provide expert guidance on building ML-based player projection models using research-backed feature engineering patterns, appropriate model selection, and sports-specific validation strategies. Apply domain expertise to help design features, choose models, avoid common pitfalls, and create interpretable predictions.
When to Use This Skill
Trigger this skill for queries involving:
- Feature engineering: "What features should I include?" "How do I create age curve features?" "What are good opportunity metrics?"
- Model selection: "Which ML model should I use?" "Random Forest or XGBoost?" "When to use regularized regression?"
- Validation strategies: "How do I validate sports models?" "What's wrong with standard cross-validation?" "How to avoid data leakage?"
- Sports-specific challenges: "How to handle small sample sizes?" "How to model position differences?" "Handling regime changes?"
- Feature selection: "How to reduce 109 stats to key features?" "Lasso vs Ridge?" "How to handle multicollinearity?"
- Model interpretability: "How to explain predictions?" "What features matter most?" "SHAP values for fantasy?"
Note: For dynasty strategy questions (player valuation, trade analysis, roster construction), use ff-dynasty-strategy. For statistical methods (regression types, simulations, GAMs), use ff-statistical-methods.
Core Capabilities
1. Feature Engineering
Core Principle: Feature engineering is more important than model selection for sports predictions.
Key Feature Categories:
Age Curves
- Marcel system: 3-year weighted average + age adjustment + regression to mean
- Position-specific peaks: RB 23-26, WR 26-28, QB 28-33, TE 26-29
- Implementation:
age_factor = 1 - (age - peak_age) * 0.003for decline phase
Opportunity Metrics
- Target share, snap share, weighted opportunities (carries + targets×1.5)
- Points per opportunity (efficiency measure)
- Volume is king: opportunity metrics predict better than TDs
Efficiency Statistics
- Yards per route run (YPRR), yards per carry (YPC)
- Yards after contact (YAC), catch rate
- Warning: Noisy with small samples, use rolling averages
Interaction Terms
- QB quality × target share (receiver production context)
- Opponent strength adjustments
- Game script (leading = rushing, trailing = passing)
- ~40% of team performance from synergy effects
Rolling Averages
- Last 3 games, last 5 games, season-long
- Trend features: recent form vs established baseline
- Lag features: last game, same opponent last season
Reference: references/feature_engineering.md for formulas, implementation patterns, and common mistakes.
2. Model Selection
Decision Framework:
Primary Goal?
├─ Interpretability → Linear/Ridge/Lasso Regression
└─ Performance
├─ Small (<1000) → Ridge/Lasso/Elastic Net
├─ Medium (1K-10K) → Random Forest or XGBoost
└─ Large (>10K) → XGBoost/LightGBM or Ensemble
Model Types:
Linear Regression: Baseline, interpretability, small samples
Regularized Regression: High-dimensional data, multicollinearity, automatic feature selection (Lasso)
Random Forest: Medium data, robustness, feature importance
XGBoost/LightGBM: Best single-model performance, handles missing values
Ensemble: Combine Ridge + RF + XGBoost (weighted 1:2:2), often 2-5% improvement
Position-Specific Modeling: Train separate models per position (RB features ≠ WR features)
Reference: references/model_selection.md for detailed comparisons, hyperparameters, and implementation.
3. Validation Strategies
Critical Rule: NEVER use standard cross-validation with shuffle=True
❌ Wrong: KFold(n_splits=5, shuffle=True) → Data leakage!
✅ Correct: TimeSeriesSplit(n_splits=5) → Train on past, test on future
Time-Series Split: Always predict future from past data
Appropriate Metrics:
- MAE (Mean Absolute Error): Most interpretable
- RMSE: Penalizes large errors
- R²: Proportion of variance explained
Nested Cross-Validation: Outer loop for evaluation, inner loop for hyperparameter tuning
Reference: references/validation_strategies.md for detailed workflows and common mistakes.
4. Sports-Specific Challenges
Small Sample Sizes: NFL = 17 games/season → Use regularization (Ridge/Lasso)
Position-Specific Modeling: Separate models per position with different feature sets
Regime Changes: Weight recent seasons heavier, use sliding window validation
Data Leakage Prevention: Only use data available at prediction time, time-series validation
Reference: references/model_selection.md sections on sports-specific considerations.
Workflow: Building a Player Projection Model
Step 1: Feature Engineering
- Start with raw stats (yards, TDs, targets, snaps)
- Create opportunity metrics (target share, snap %)
- Add efficiency features (YPRR, YPC)
- Generate rolling averages (3-game, 5-game)
- Include age curves and interaction terms
- Use
assets/player_projection_model_template.pyas starting point
Step 2: Feature Selection
- Check correlation (remove highly correlated features)
- Use Lasso for automatic selection
- SHAP values for importance
- Domain knowledge: prioritize opportunity > efficiency > TDs
Step 3: Model Selection
- Establish baseline (linear regression or Marcel)
- Try regularized model (Elastic Net)
- Test tree-based (Random Forest, then XGBoost)
- Position-specific models
- Ensemble top 2-3 models
Step 4: Validation
- Hold out most recent season as final test
- TimeSeriesSplit on training data
- Nested CV for hyperparameter tuning
- Evaluate MAE by position
Step 5: Interpretability
- SHAP values for feature importance
- Partial dependence plots for age curves
- Validate on new season
Identifying Data Requirements
For Player Projection Models:
- Historical performance (3+ years for aging curves)
- Opportunity metrics (targets, snaps, routes run, carries)
- Efficiency stats (YPRR, YPC, catch rate)
- Contextual data (opponent strength, QB quality, game script)
- Position and age
For Feature Engineering:
- Player-level: Stats, age, position, career year
- Team-level: Total targets, snaps, carries (for share calculations)
- Game-level: Score differential, home/away, opponent defense rank
- Season-level: Rule changes, schedule strength
Integrating with Other Skills
Complement with ff-dynasty-strategy when:
- Need domain knowledge for feature selection (aging curves, TD regression)
- Interpreting model outputs (sell-high candidates)
- Understanding position-specific patterns
Complement with ff-statistical-methods when:
- Choosing regression type (OLS vs Lasso vs GAMs)
- Running Monte Carlo simulations using predictions
- Performing variance analysis
Best Practices
Feature Engineering Over Model Complexity - Well-engineered features make simple models outperform complex ones
Always Use Time-Series Validation - Standard CV inflates performance 15-20%
Position-Specific Models - RB features ≠ WR features ≠ QB features
Regularization for Small Samples - NFL has limited games (17/season)
Prioritize Interpretability - SHAP values for explainability, start simple
References
references/feature_engineering.md- Age curves, opportunity metrics, efficiency stats, interaction termsreferences/model_selection.md- Decision framework, model types, hyperparametersreferences/validation_strategies.md- Time-series splits, nested CV, metrics
Assets
assets/player_projection_model_template.py- Python template for building player projection models