Agent Skills: Supervised Learning Skill

Build production-ready classification and regression models with hyperparameter tuning

supervised-learningclassificationregressionhyperparameter-tuningmodel-training
machine-learningID: pluginagentmarketplace/custom-plugin-machine-learning/supervised-learning

Skill Files

Browse the full folder contents for supervised-learning.

Download Skill

Loading file tree…

skills/supervised-learning/SKILL.md

Skill Metadata

Name
supervised-learning
Description
Build production-ready classification and regression models with hyperparameter tuning

Supervised Learning Skill

Build, tune, and evaluate classification and regression models.

Quick Start

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_weighted')
print(f"CV F1: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")

Key Topics

1. Classification Algorithms

| Algorithm | Best For | Complexity | |-----------|----------|------------| | Logistic Regression | Baseline, interpretable | O(nd) | | Random Forest | Tabular, general | O(ndtrees) | | XGBoost | Competitions, accuracy | O(nd*trees) | | SVM | High-dim, small data | O(n²) |

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

classifiers = {
    'lr': LogisticRegression(max_iter=1000, class_weight='balanced'),
    'rf': RandomForestClassifier(n_estimators=100, class_weight='balanced'),
    'xgb': XGBClassifier(n_estimators=100, eval_metric='logloss')
}

2. Regression Algorithms

| Algorithm | Best For | Key Param | |-----------|----------|-----------| | Ridge | Multicollinearity | alpha | | Lasso | Feature selection | alpha | | Random Forest | Non-linear | n_estimators | | XGBoost | Best accuracy | learning_rate |

3. Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_dist = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(3, 15),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10)
}

search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_dist,
    n_iter=50,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    random_state=42
)

search.fit(X_train, y_train)
print(f"Best params: {search.best_params_}")
print(f"Best CV score: {search.best_score_:.4f}")

4. Handling Class Imbalance

| Technique | Implementation | |-----------|----------------| | Class Weights | class_weight='balanced' | | SMOTE | imblearn.over_sampling.SMOTE() | | Threshold Tuning | Adjust prediction threshold |

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

pipeline = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier())
])

5. Model Comparison

from sklearn.model_selection import cross_validate
import pandas as pd

def compare_models(models, X, y, cv=5):
    results = []
    for name, model in models.items():
        cv_results = cross_validate(
            model, X, y, cv=cv,
            scoring=['accuracy', 'f1_weighted', 'roc_auc_ovr_weighted'],
            return_train_score=True
        )
        results.append({
            'model': name,
            'train_acc': cv_results['train_accuracy'].mean(),
            'test_acc': cv_results['test_accuracy'].mean(),
            'test_f1': cv_results['test_f1_weighted'].mean(),
            'test_auc': cv_results['test_roc_auc_ovr_weighted'].mean()
        })
    return pd.DataFrame(results).round(4)

Best Practices

DO

  • Start with a simple baseline
  • Use stratified splits for classification
  • Log all hyperparameters
  • Check for overfitting (train vs test gap)
  • Use early stopping for boosting

DON'T

  • Don't tune on test set
  • Don't ignore class imbalance
  • Don't skip feature importance analysis
  • Don't use accuracy for imbalanced data

Exercises

Exercise 1: Model Selection

# TODO: Compare 3 different classifiers using cross-validation
# Report F1 score for each

Exercise 2: Hyperparameter Tuning

# TODO: Use RandomizedSearchCV to tune XGBoost
# Find optimal n_estimators, max_depth, learning_rate

Unit Test Template

import pytest
from sklearn.datasets import make_classification

def test_classifier_trains():
    """Test classifier can fit and predict."""
    X, y = make_classification(n_samples=100, random_state=42)
    model = get_classifier()

    model.fit(X[:80], y[:80])
    predictions = model.predict(X[80:])

    assert len(predictions) == 20
    assert set(predictions).issubset({0, 1})

def test_handles_imbalance():
    """Test model handles imbalanced classes."""
    X, y = make_classification(n_samples=100, weights=[0.9, 0.1])
    model = get_balanced_classifier()

    model.fit(X, y)
    predictions = model.predict(X)

    # Should predict both classes
    assert len(set(predictions)) == 2

Troubleshooting

| Problem | Cause | Solution | |---------|-------|----------| | Overfitting | Model too complex | Reduce depth, add regularization | | Underfitting | Model too simple | Increase complexity | | Class imbalance | Skewed data | Use SMOTE or class weights | | Slow training | Large data | Use LightGBM, reduce estimators |

Related Resources

  • Agent: 02-supervised-learning
  • Previous: ml-fundamentals
  • Next: clustering

Version: 1.4.0 | Status: Production Ready