Model Serving Skill Skill

Model Serving Skill

Learn: Deploy ML models for production inference with optimization.

Skill Overview

| Attribute | Value | |-----------|-------| | Bonded Agent | 05-model-serving | | Difficulty | Intermediate to Advanced | | Duration | 35 hours | | Prerequisites | mlops-basics, training-pipelines |

Learning Objectives

Deploy models with BentoML and Triton
Optimize inference with quantization and ONNX
Configure auto-scaling policies
Implement batch and streaming inference
Deploy to edge devices

Topics Covered

Module 1: Serving Platforms (8 hours)

Platform Comparison:

| Platform | Multi-framework | Dynamic Batching | Kubernetes | |----------|-----------------|------------------|------------| | TorchServe | PyTorch only | ✅ | ✅ | | Triton | ✅ | ✅ | ✅ | | BentoML | ✅ | ✅ | ✅ | | Seldon | ✅ | ⚠️ | ✅ |

Module 2: BentoML Deployment (10 hours)

Service Definition:

import bentoml
from bentoml.io import JSON, NumpyNdarray

@bentoml.service(resources={"gpu": 1, "memory": "4Gi"})
class ModelService:
    def __init__(self):
        self.model = bentoml.pytorch.load_model("model:latest")

    @bentoml.api(route="/predict")
    async def predict(self, input_array: np.ndarray) -> dict:
        with torch.no_grad():
            predictions = self.model(input_array)
        return {"predictions": predictions.tolist()}

Exercises:

[ ] Create BentoML service for your model
[ ] Containerize and deploy to Kubernetes
[ ] Configure traffic management

Module 3: Inference Optimization (10 hours)

Optimization Techniques:

# 1. Dynamic Quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# 2. ONNX Export
torch.onnx.export(model, sample_input, "model.onnx")

# 3. TensorRT Conversion
import tensorrt as trt
# Convert ONNX to TensorRT for NVIDIA GPUs

Expected Speedups: | Technique | Speedup | Accuracy Impact | |-----------|---------|-----------------| | FP16 | 2-3x | <1% | | INT8 | 3-4x | 1-2% | | TensorRT | 5-10x | <1% |

Module 4: Scaling & Monitoring (7 hours)

Kubernetes HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-serving
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Code Templates

Template: Production Serving

# templates/serving.py
from fastapi import FastAPI
import torch
import numpy as np

app = FastAPI()

class ProductionServer:
    def __init__(self, model_path: str):
        self.model = torch.jit.load(model_path)
        self.model.eval()

    def predict(self, inputs: np.ndarray) -> np.ndarray:
        with torch.no_grad():
            tensor = torch.from_numpy(inputs)
            outputs = self.model(tensor)
        return outputs.numpy()

server = ProductionServer("model.pt")

@app.post("/predict")
async def predict(data: dict):
    inputs = np.array(data["inputs"])
    predictions = server.predict(inputs)
    return {"predictions": predictions.tolist()}

Troubleshooting Guide

| Issue | Cause | Solution | |-------|-------|----------| | High latency | No optimization | Apply quantization, batching | | Cold starts | Serverless | Pre-warming, min replicas | | OOM | Model too large | Optimize, reduce batch size |

Resources

BentoML Documentation
Triton Inference Server
ONNX Runtime
[See: ml-monitoring] - Monitor deployed models

Version History

| Version | Date | Changes | |---------|------|---------| | 2.0.0 | 2024-12 | Production-grade with optimization | | 1.0.0 | 2024-11 | Initial release |

Agent Skills: Model Serving Skill

Install this agent skill to your local

Skill Files