ML Fundamentals Skill
Master the building blocks of machine learning: from raw data to trained models.
Quick Start
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
# 1. Load and split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 2. Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))
])
# 3. Train and evaluate
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Accuracy: {score:.4f}")
Key Topics
1. Data Preprocessing
| Step | Purpose | Implementation |
|------|---------|----------------|
| Missing Values | Handle NaN/None | SimpleImputer(strategy='median') |
| Scaling | Normalize ranges | StandardScaler() or MinMaxScaler() |
| Encoding | Convert categories | OneHotEncoder() or LabelEncoder() |
| Outliers | Remove extremes | IQR method or Z-score |
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# Define column types
numeric_features = ['age', 'income', 'score']
categorical_features = ['gender', 'city', 'category']
# Create preprocessor
preprocessor = ColumnTransformer([
('num', Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
]), numeric_features),
('cat', Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
]), categorical_features)
])
2. Feature Engineering
| Technique | Use Case | Example |
|-----------|----------|---------|
| Polynomial | Non-linear relationships | PolynomialFeatures(degree=2) |
| Binning | Discretize continuous | KBinsDiscretizer(n_bins=5) |
| Log Transform | Right-skewed data | np.log1p(x) |
| Interaction | Feature combinations | x1 * x2 |
3. Model Evaluation
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
# Cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted')
print(f"CV F1: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
# Detailed report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
4. Cross-Validation Strategies
| Strategy | When to Use |
|----------|-------------|
| KFold | Standard, balanced data |
| StratifiedKFold | Imbalanced classification |
| TimeSeriesSplit | Temporal data |
| GroupKFold | Grouped samples |
Best Practices
DO
- Split data BEFORE any preprocessing
- Use pipelines for reproducibility
- Stratify splits for classification
- Log all preprocessing parameters
- Version your feature engineering code
DON'T
- Don't fit on test data
- Don't ignore data leakage
- Don't use accuracy for imbalanced data
- Don't hard-code parameters
Exercises
Exercise 1: Basic Pipeline
# TODO: Create a pipeline that:
# 1. Imputes missing values
# 2. Scales features
# 3. Trains a logistic regression
Exercise 2: Cross-Validation
# TODO: Implement 5-fold stratified CV
# and report mean and std of F1 score
Unit Test Template
import pytest
import numpy as np
from sklearn.datasets import make_classification
def test_preprocessing_pipeline():
"""Test preprocessing handles missing values."""
X, y = make_classification(n_samples=100, n_features=10)
X[0, 0] = np.nan # Introduce missing value
pipeline = create_preprocessing_pipeline()
X_transformed = pipeline.fit_transform(X)
assert not np.isnan(X_transformed).any()
assert X_transformed.shape[0] == X.shape[0]
def test_no_data_leakage():
"""Verify preprocessing doesn't leak test data."""
X_train, X_test = X[:80], X[80:]
pipeline.fit(X_train)
X_test_transformed = pipeline.transform(X_test)
# Check that test transform uses train statistics
assert pipeline.named_steps['scaler'].mean_ is not None
Troubleshooting
| Problem | Cause | Solution |
|---------|-------|----------|
| NaN in prediction | Missing imputer | Add SimpleImputer to pipeline |
| Shape mismatch | Inconsistent features | Use ColumnTransformer |
| Memory error | Too many one-hot features | Use max_categories or hashing |
| Poor CV variance | Data leakage | Check preprocessing order |
Related Resources
- Agent:
01-ml-fundamentals - Next Skill:
supervised-learning - Docs: scikit-learn User Guide
Version: 1.4.0 | Status: Production Ready