Agent Skills: ML/CV Specialist

Deep expertise in ML/CV model selection, training pipelines, and inference architecture. Use when designing machine learning systems, computer vision pipelines, or AI-powered features.

UncategorizedID: alirezarezvani/claude-cto-team/ml-cv-specialist

Skill Files

Browse the full folder contents for ml-cv-specialist.

Download Skill

Loading file tree…

skills/ml-cv-specialist/SKILL.md

Skill Metadata

Name
ml-cv-specialist
Description
Deep expertise in ML/CV model selection, training pipelines, and inference architecture. Use when designing machine learning systems, computer vision pipelines, or AI-powered features.

ML/CV Specialist

Provides specialized guidance for machine learning and computer vision system design, model selection, and production deployment.

When to Use

  • Selecting ML models for specific use cases
  • Designing training and inference pipelines
  • Optimizing ML system performance and cost
  • Evaluating build vs. API for ML capabilities
  • Planning data pipelines for ML workloads

ML System Design Framework

Model Selection Decision Tree

Use Case Identified
    │
    ├─► Text/Language Tasks
    │   ├─► Classification → BERT, DistilBERT, or API (OpenAI, Claude)
    │   ├─► Generation → GPT-4, Claude, Llama (self-hosted)
    │   ├─► Embeddings → OpenAI Ada, sentence-transformers
    │   └─► Search/RAG → Vector DB + Embeddings + LLM
    │
    ├─► Computer Vision Tasks
    │   ├─► Classification → ResNet, EfficientNet, ViT
    │   ├─► Object Detection → YOLOv8, DETR, Faster R-CNN
    │   ├─► Segmentation → SAM, Mask R-CNN, U-Net
    │   ├─► OCR → Tesseract, PaddleOCR, Cloud Vision API
    │   └─► Face Recognition → InsightFace, DeepFace
    │
    ├─► Audio Tasks
    │   ├─► Speech-to-Text → Whisper, DeepSpeech, Cloud APIs
    │   ├─► Text-to-Speech → ElevenLabs, Coqui TTS
    │   └─► Audio Classification → PANNs, AudioSet models
    │
    └─► Structured Data
        ├─► Tabular → XGBoost, LightGBM, CatBoost
        ├─► Time Series → Prophet, ARIMA, Transformer-based
        └─► Recommendations → Two-tower, matrix factorization

API vs. Self-Hosted Decision

When to Use APIs

| Factor | API Preferred | Self-Hosted Preferred | |--------|---------------|----------------------| | Volume | < 10K requests/month | > 100K requests/month | | Latency | > 500ms acceptable | < 100ms required | | Customization | General use case | Domain-specific fine-tuning | | Data Privacy | Non-sensitive data | PII, HIPAA, financial | | Team Expertise | No ML engineers | ML team available | | Budget | Predictable per-call costs | High volume justifies infra |

Cost Comparison Framework

## API Costs (Example: OpenAI GPT-4)
- Input: $0.03/1K tokens
- Output: $0.06/1K tokens
- Average request: 500 input + 200 output tokens
- Cost per request: $0.027
- 100K requests/month: $2,700

## Self-Hosted Costs (Example: Llama 70B)
- GPU instance: $3/hour (A100 40GB)
- Throughput: ~50 requests/minute = 3K/hour
- Cost per request: $0.001
- 100K requests/month: $100 + $500 engineering time

## Break-even Analysis
- < 50K requests: API likely cheaper
- > 50K requests: Self-hosted may be cheaper
- Factor in: engineering time, ops burden, model quality

Training Pipeline Architecture

Standard ML Pipeline

┌─────────────────────────────────────────────────────────────┐
│                    DATA LAYER                                │
├─────────────────────────────────────────────────────────────┤
│  Data Sources → ETL → Feature Store → Training Data         │
│  (S3, DBs)     (Airflow)  (Feast)     (Versioned)          │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                  TRAINING LAYER                              │
├─────────────────────────────────────────────────────────────┤
│  Experiment Tracking → Training Jobs → Model Registry       │
│  (MLflow, W&B)         (SageMaker)    (MLflow, S3)         │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                  SERVING LAYER                               │
├─────────────────────────────────────────────────────────────┤
│  Model Server → Load Balancer → Monitoring                  │
│  (TorchServe)   (K8s/ELB)      (Prometheus)                │
└─────────────────────────────────────────────────────────────┘

Component Selection Guide

| Component | Options | Recommendation | |-----------|---------|----------------| | Feature Store | Feast, Tecton, SageMaker | Feast (open source), Tecton (enterprise) | | Experiment Tracking | MLflow, Weights & Biases, Neptune | MLflow (free), W&B (best UX) | | Training Orchestration | Kubeflow, SageMaker, Vertex AI | SageMaker (AWS), Vertex (GCP) | | Model Registry | MLflow, SageMaker, custom S3 | MLflow (standard) | | Model Serving | TorchServe, TFServing, Triton | Triton (multi-framework) |


Inference Architecture Patterns

Pattern 1: Synchronous API

Best for: Low-latency requirements, simple integration

Client → API Gateway → Model Server → Response
                           │
                      Load Balancer
                           │
                    ┌──────┴──────┐
                    │             │
                Model Pod    Model Pod

Latency targets:

  • P50: < 100ms
  • P95: < 300ms
  • P99: < 500ms

Pattern 2: Asynchronous Processing

Best for: Long-running inference, batch processing

Client → API → Queue (SQS) → Worker → Result Store → Webhook/Poll
                                          │
                                     S3/Redis

Use when:

  • Inference > 5 seconds
  • Batch processing required
  • Variable load patterns

Pattern 3: Edge Inference

Best for: Privacy, offline capability, ultra-low latency

┌─────────────────────────────────────────┐
│              EDGE DEVICE                 │
│  ┌─────────┐    ┌─────────────────────┐ │
│  │ Camera  │───▶│ Optimized Model     │ │
│  └─────────┘    │ (ONNX, TFLite)      │ │
│                 └─────────────────────┘ │
│                          │              │
│                     Local Result        │
└─────────────────────────────────────────┘
                           │
                    Sync to Cloud
                    (non-blocking)

Model optimization for edge:

  • Quantization (INT8): 4x smaller, 2-3x faster
  • Pruning: 50-90% sparsity possible
  • Distillation: Smaller model, similar accuracy
  • ONNX/TFLite: Optimized runtime

Computer Vision Pipeline Design

Real-Time Video Processing

Camera Stream → Frame Extraction → Preprocessing → Model → Postprocessing → Output
     │              │                   │            │           │
   RTSP/         1-30 FPS           Resize,      Batch or    NMS, tracking,
   WebRTC                           normalize    single       annotation

Performance optimization:

  • Process every Nth frame (skip frames)
  • Resize to model input size early
  • Batch frames when latency allows
  • Use GPU preprocessing (NVIDIA DALI)

Object Detection System

## Pipeline Components

1. **Input Processing**
   - Video decode: FFmpeg, OpenCV
   - Frame buffer: Ring buffer for temporal context
   - Preprocessing: NVIDIA DALI (GPU), OpenCV (CPU)

2. **Detection**
   - Model: YOLOv8 (speed), DETR (accuracy)
   - Batch size: 1-8 depending on latency requirements
   - Confidence threshold: 0.5-0.7 typical

3. **Post-processing**
   - NMS (Non-Maximum Suppression)
   - Tracking: SORT, DeepSORT, ByteTrack
   - Smoothing: Kalman filter for stable boxes

4. **Output**
   - Annotations: Bounding boxes, labels, confidence
   - Events: Trigger on detection (webhook, queue)
   - Storage: Frame + metadata to S3/DB

LLM Integration Patterns

RAG (Retrieval-Augmented Generation)

User Query → Embedding → Vector Search → Context Retrieval → LLM → Response
                              │
                         Vector DB
                       (Pinecone, Weaviate,
                        Chroma, pgvector)

Vector DB Selection: | Database | Best For | Limitations | |----------|----------|-------------| | Pinecone | Managed, scale | Cost at scale | | Weaviate | Self-hosted, features | Operational overhead | | Chroma | Simple, local dev | Not for production scale | | pgvector | PostgreSQL users | Performance at >1M vectors | | Qdrant | Performance | Newer, smaller community |

LLM Serving Architecture

┌─────────────────────────────────────────────────────────────┐
│                    API GATEWAY                               │
│  Rate limiting, auth, request routing                       │
└─────────────────────────────────────────────────────────────┘
                            │
              ┌─────────────┼─────────────┐
              │             │             │
              ▼             ▼             ▼
         ┌────────┐   ┌────────┐   ┌────────┐
         │ GPT-4  │   │ Claude │   │ Local  │
         │  API   │   │  API   │   │ Llama  │
         └────────┘   └────────┘   └────────┘
                            │
                    Model Router
              (cost/latency/capability)

Multi-model strategy:

  • Simple queries → Cheaper model (GPT-3.5, Haiku)
  • Complex reasoning → Expensive model (GPT-4, Opus)
  • Sensitive data → Self-hosted (Llama, Mistral)

Performance Optimization

GPU Memory Optimization

| Technique | Memory Reduction | Speed Impact | |-----------|-----------------|--------------| | FP16 (Half Precision) | 50% | Neutral to faster | | INT8 Quantization | 75% | 10-20% slower | | INT4 Quantization | 87.5% | 20-40% slower | | Gradient Checkpointing | 60-80% | 20-30% slower | | Model Sharding | Distributed | Communication overhead |

Batching Strategies

# Dynamic batching pseudocode
class DynamicBatcher:
    def __init__(self, max_batch=32, max_wait_ms=50):
        self.queue = []
        self.max_batch = max_batch
        self.max_wait = max_wait_ms

    async def add_request(self, request):
        self.queue.append(request)

        # Batch when full or timeout
        if len(self.queue) >= self.max_batch:
            return await self.process_batch()

        await asyncio.sleep(self.max_wait / 1000)
        return await self.process_batch()

    async def process_batch(self):
        batch = self.queue[:self.max_batch]
        self.queue = self.queue[self.max_batch:]
        return await self.model.predict_batch(batch)

Model Monitoring

Key Metrics to Track

| Metric | What It Measures | Alert Threshold | |--------|------------------|-----------------| | Latency (P95) | Response time | > 2x baseline | | Throughput | Requests/second | < 80% capacity | | Error Rate | Failed predictions | > 1% | | Model Drift | Distribution shift | PSI > 0.2 | | Data Quality | Input anomalies | > 5% anomalies |

Drift Detection

Training Distribution ──┐
                        ├──► Statistical Test ──► Alert
Production Distribution ─┘
                         (PSI, KS test, JS divergence)

Population Stability Index (PSI):

  • PSI < 0.1: No significant change
  • 0.1 < PSI < 0.2: Moderate change, monitor
  • PSI > 0.2: Significant change, investigate

Quick Reference Tables

Model Selection by Use Case

| Use Case | Recommended Model | Latency | Cost | |----------|-------------------|---------|------| | Text Classification | DistilBERT | 10ms | Low | | Text Generation | GPT-4 / Claude | 1-5s | Medium | | Image Classification | EfficientNet-B0 | 5ms | Low | | Object Detection | YOLOv8-n | 10ms | Low | | Object Detection (Accurate) | YOLOv8-x | 50ms | Medium | | Semantic Segmentation | SAM | 100ms | Medium | | Speech-to-Text | Whisper-base | Real-time | Low | | Embeddings | text-embedding-ada-002 | 50ms | Low |

Infrastructure Sizing

| Scale | GPU | Model Size | Throughput | |-------|-----|------------|------------| | Development | T4 (16GB) | < 7B params | 10-50 req/s | | Production Small | A10G (24GB) | < 13B params | 50-100 req/s | | Production Medium | A100 (40GB) | < 70B params | 100-500 req/s | | Production Large | A100 (80GB) x 2+ | > 70B params | 500+ req/s |


References