Supervised Learning Skill Skill

Supervised Learning Skill

Build, tune, and evaluate classification and regression models.

Quick Start

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_weighted')
print(f"CV F1: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")

Key Topics

1. Classification Algorithms

| Algorithm | Best For | Complexity | |-----------|----------|------------| | Logistic Regression | Baseline, interpretable | O(nd) | | Random Forest | Tabular, general | O(ndtrees) | | XGBoost | Competitions, accuracy | O(nd*trees) | | SVM | High-dim, small data | O(n²) |

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

classifiers = {
    'lr': LogisticRegression(max_iter=1000, class_weight='balanced'),
    'rf': RandomForestClassifier(n_estimators=100, class_weight='balanced'),
    'xgb': XGBClassifier(n_estimators=100, eval_metric='logloss')
}

2. Regression Algorithms

| Algorithm | Best For | Key Param | |-----------|----------|-----------| | Ridge | Multicollinearity | alpha | | Lasso | Feature selection | alpha | | Random Forest | Non-linear | n_estimators | | XGBoost | Best accuracy | learning_rate |

3. Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_dist = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(3, 15),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10)
}

search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_dist,
    n_iter=50,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    random_state=42
)

search.fit(X_train, y_train)
print(f"Best params: {search.best_params_}")
print(f"Best CV score: {search.best_score_:.4f}")

4. Handling Class Imbalance

| Technique | Implementation | |-----------|----------------| | Class Weights | class_weight='balanced' | | SMOTE | imblearn.over_sampling.SMOTE() | | Threshold Tuning | Adjust prediction threshold |

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

pipeline = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier())
])

5. Model Comparison

from sklearn.model_selection import cross_validate
import pandas as pd

def compare_models(models, X, y, cv=5):
    results = []
    for name, model in models.items():
        cv_results = cross_validate(
            model, X, y, cv=cv,
            scoring=['accuracy', 'f1_weighted', 'roc_auc_ovr_weighted'],
            return_train_score=True
        )
        results.append({
            'model': name,
            'train_acc': cv_results['train_accuracy'].mean(),
            'test_acc': cv_results['test_accuracy'].mean(),
            'test_f1': cv_results['test_f1_weighted'].mean(),
            'test_auc': cv_results['test_roc_auc_ovr_weighted'].mean()
        })
    return pd.DataFrame(results).round(4)

Best Practices

DO

Start with a simple baseline
Use stratified splits for classification
Log all hyperparameters
Check for overfitting (train vs test gap)
Use early stopping for boosting

DON'T

Don't tune on test set
Don't ignore class imbalance
Don't skip feature importance analysis
Don't use accuracy for imbalanced data

Exercises

Exercise 1: Model Selection

# TODO: Compare 3 different classifiers using cross-validation
# Report F1 score for each

Exercise 2: Hyperparameter Tuning

# TODO: Use RandomizedSearchCV to tune XGBoost
# Find optimal n_estimators, max_depth, learning_rate

Unit Test Template

import pytest
from sklearn.datasets import make_classification

def test_classifier_trains():
    """Test classifier can fit and predict."""
    X, y = make_classification(n_samples=100, random_state=42)
    model = get_classifier()

    model.fit(X[:80], y[:80])
    predictions = model.predict(X[80:])

    assert len(predictions) == 20
    assert set(predictions).issubset({0, 1})

def test_handles_imbalance():
    """Test model handles imbalanced classes."""
    X, y = make_classification(n_samples=100, weights=[0.9, 0.1])
    model = get_balanced_classifier()

    model.fit(X, y)
    predictions = model.predict(X)

    # Should predict both classes
    assert len(set(predictions)) == 2

Troubleshooting

| Problem | Cause | Solution | |---------|-------|----------| | Overfitting | Model too complex | Reduce depth, add regularization | | Underfitting | Model too simple | Increase complexity | | Class imbalance | Skewed data | Use SMOTE or class weights | | Slow training | Large data | Use LightGBM, reduce estimators |

Related Resources

Agent: 02-supervised-learning
Previous: ml-fundamentals
Next: clustering

Version: 1.4.0 | Status: Production Ready

Agent Skills: Supervised Learning Skill

Install this agent skill to your local

Skill Files