Agent Skills: ML Fundamentals Skill

Master machine learning foundations - algorithms, preprocessing, feature engineering, and evaluation

algorithmsdata-preprocessingfeature-engineeringmodel-evaluationml-fundamentals
machine-learningID: pluginagentmarketplace/custom-plugin-machine-learning/ml-fundamentals

Skill Files

Browse the full folder contents for ml-fundamentals.

Download Skill

Loading file tree…

skills/ml-fundamentals/SKILL.md

Skill Metadata

Name
ml-fundamentals
Description
Master machine learning foundations - algorithms, preprocessing, feature engineering, and evaluation

ML Fundamentals Skill

Master the building blocks of machine learning: from raw data to trained models.

Quick Start

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# 1. Load and split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 2. Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# 3. Train and evaluate
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Accuracy: {score:.4f}")

Key Topics

1. Data Preprocessing

| Step | Purpose | Implementation | |------|---------|----------------| | Missing Values | Handle NaN/None | SimpleImputer(strategy='median') | | Scaling | Normalize ranges | StandardScaler() or MinMaxScaler() | | Encoding | Convert categories | OneHotEncoder() or LabelEncoder() | | Outliers | Remove extremes | IQR method or Z-score |

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Define column types
numeric_features = ['age', 'income', 'score']
categorical_features = ['gender', 'city', 'category']

# Create preprocessor
preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]), numeric_features),
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ]), categorical_features)
])

2. Feature Engineering

| Technique | Use Case | Example | |-----------|----------|---------| | Polynomial | Non-linear relationships | PolynomialFeatures(degree=2) | | Binning | Discretize continuous | KBinsDiscretizer(n_bins=5) | | Log Transform | Right-skewed data | np.log1p(x) | | Interaction | Feature combinations | x1 * x2 |

3. Model Evaluation

from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

# Cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted')
print(f"CV F1: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

# Detailed report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

4. Cross-Validation Strategies

| Strategy | When to Use | |----------|-------------| | KFold | Standard, balanced data | | StratifiedKFold | Imbalanced classification | | TimeSeriesSplit | Temporal data | | GroupKFold | Grouped samples |

Best Practices

DO

  • Split data BEFORE any preprocessing
  • Use pipelines for reproducibility
  • Stratify splits for classification
  • Log all preprocessing parameters
  • Version your feature engineering code

DON'T

  • Don't fit on test data
  • Don't ignore data leakage
  • Don't use accuracy for imbalanced data
  • Don't hard-code parameters

Exercises

Exercise 1: Basic Pipeline

# TODO: Create a pipeline that:
# 1. Imputes missing values
# 2. Scales features
# 3. Trains a logistic regression

Exercise 2: Cross-Validation

# TODO: Implement 5-fold stratified CV
# and report mean and std of F1 score

Unit Test Template

import pytest
import numpy as np
from sklearn.datasets import make_classification

def test_preprocessing_pipeline():
    """Test preprocessing handles missing values."""
    X, y = make_classification(n_samples=100, n_features=10)
    X[0, 0] = np.nan  # Introduce missing value

    pipeline = create_preprocessing_pipeline()
    X_transformed = pipeline.fit_transform(X)

    assert not np.isnan(X_transformed).any()
    assert X_transformed.shape[0] == X.shape[0]

def test_no_data_leakage():
    """Verify preprocessing doesn't leak test data."""
    X_train, X_test = X[:80], X[80:]

    pipeline.fit(X_train)
    X_test_transformed = pipeline.transform(X_test)

    # Check that test transform uses train statistics
    assert pipeline.named_steps['scaler'].mean_ is not None

Troubleshooting

| Problem | Cause | Solution | |---------|-------|----------| | NaN in prediction | Missing imputer | Add SimpleImputer to pipeline | | Shape mismatch | Inconsistent features | Use ColumnTransformer | | Memory error | Too many one-hot features | Use max_categories or hashing | | Poor CV variance | Data leakage | Check preprocessing order |

Related Resources


Version: 1.4.0 | Status: Production Ready