Entry point:
/faion-net— invoke this skill for automatic routing to the appropriate domain.
ML Ops Skill
Communication: User's language. Code: English.
Purpose
Handles ML model operations. Covers fine-tuning, evaluation, cost management, and observability.
Context Discovery
Auto-Investigation
Check these project signals before asking questions:
| Signal | Where to Check | What to Look For | |--------|----------------|------------------| | Dependencies | requirements.txt | transformers, peft, openai, tiktoken, langsmith | | Training data | /data, /datasets | JSONL files for fine-tuning | | Logs/metrics | Grep for "langsmith", "wandb", "mlflow" | Existing observability tools | | Cost tracking | Grep for "tiktoken", "count_tokens" | Token counting implementation |
Discovery Questions
question: "What ML operation are you working on?"
header: "Operation Type"
multiSelect: false
options:
- label: "Fine-tuning LLM"
description: "Custom model training (OpenAI API, LoRA, QLoRA)"
- label: "Model evaluation"
description: "Benchmark performance, LLM-as-judge"
- label: "Cost optimization"
description: "Reduce API costs, prompt caching, batching"
- label: "Observability/monitoring"
description: "Track LLM usage, traces, performance"
question: "For fine-tuning: dataset size and approach?"
header: "Fine-tuning Strategy"
multiSelect: false
options:
- label: "<100 examples - use few-shot prompting instead"
description: "Too small for fine-tuning, improve prompts"
- label: "100-1000 examples - OpenAI fine-tuning"
description: "Use OpenAI API fine-tuning endpoint"
- label: ">1000 examples - LoRA/QLoRA"
description: "Efficient parameter fine-tuning"
- label: "Not fine-tuning"
description: "Skip this question"
question: "Which observability tools?"
header: "Monitoring Stack"
multiSelect: true
options:
- label: "LangSmith (recommended)"
description: "LangChain native tracing"
- label: "Langfuse (open-source)"
description: "Self-hosted observability"
- label: "Custom logging"
description: "Build custom tracking"
- label: "None yet"
description: "Starting from scratch"
Scope
| Area | Coverage | |------|----------| | Fine-tuning | LoRA, QLoRA, OpenAI fine-tuning, datasets | | Evaluation | Metrics, benchmarks, frameworks | | Cost Optimization | Token management, caching, batch APIs | | Observability | LLM monitoring, tracing, logging |
Quick Start
| Task | Files | |------|-------| | Fine-tune OpenAI | fine-tuning-openai-basics.md → fine-tuning-openai-production.md | | Fine-tune LoRA | lora-qlora.md → finetuning-basics.md | | Cost optimization | llm-cost-basics.md → cost-reduction-strategies.md | | Evaluation | evaluation-metrics.md → evaluation-framework.md | | Observability | llm-observability.md → llm-observability-stack-2026.md |
Methodologies (15)
Fine-tuning (5):
- finetuning-basics: Fundamentals, when to fine-tune
- finetuning-datasets: Data preparation, quality
- fine-tuning-openai-basics: OpenAI API fine-tuning
- fine-tuning-openai-production: Production deployment
- lora-qlora: Efficient fine-tuning, parameter selection
Evaluation (3):
- evaluation-metrics: Accuracy, F1, perplexity, task metrics
- evaluation-framework: LLM-as-judge, human eval
- evaluation-benchmarks: MMLU, HumanEval, industry benchmarks
Cost Optimization (2):
- llm-cost-basics: Token counting, pricing models
- cost-reduction-strategies: Caching, compression, batching
Observability (5):
- llm-observability: Fundamentals, why monitor
- llm-observability-stack: Tools selection
- llm-observability-stack-2026: Latest tools (LangSmith, Langfuse)
- llm-management-observability: End-to-end management
Code Examples
OpenAI Fine-tuning
from openai import OpenAI
client = OpenAI()
# Upload training data
file = client.files.create(
file=open("training_data.jsonl", "rb"),
purpose="fine-tune"
)
# Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={"n_epochs": 3}
)
# Monitor
while True:
job = client.fine_tuning.jobs.retrieve(job.id)
if job.status == "succeeded":
break
LoRA Fine-tuning
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8b")
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
bias="none"
)
model = get_peft_model(model, lora_config)
Cost Tracking
import tiktoken
def count_tokens(text, model="gpt-4o"):
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
def estimate_cost(prompt, completion, model="gpt-4o"):
prompt_tokens = count_tokens(prompt, model)
completion_tokens = count_tokens(completion, model)
# GPT-4o pricing
prompt_cost = prompt_tokens * 0.000005
completion_cost = completion_tokens * 0.000015
return prompt_cost + completion_cost
LLM Observability with LangSmith
from langsmith import traceable
@traceable
def rag_pipeline(query: str) -> str:
# Retrieval
docs = retrieve(query)
# Generation
response = generate(query, docs)
return response
Fine-tuning Decision Matrix
| Scenario | Approach | |----------|----------| | Small dataset (<100 examples) | Few-shot prompting | | Medium dataset (100-1000) | OpenAI fine-tuning | | Large dataset (>1000) | LoRA/QLoRA | | Custom behavior | Fine-tuning | | New knowledge | RAG (not fine-tuning) |
Cost Reduction Strategies
| Strategy | Savings | Trade-off | |----------|---------|-----------| | Prompt caching | 90% on cached | Cold start cost | | Batch API | 50% | 24h latency | | Smaller models | 80%+ | Lower quality | | Context pruning | Variable | May lose context | | Output limits | Variable | Truncated responses |
Evaluation Frameworks
| Framework | Use Case | |-----------|----------| | LangSmith | Production monitoring, traces | | Langfuse | Open-source observability | | PromptLayer | Prompt versioning | | Weights & Biases | Experiment tracking |
Related Skills
| Skill | Relationship | |-------|-------------| | faion-llm-integration | Provides APIs to optimize | | faion-rag-engineer | RAG evaluation | | faion-devops-engineer | Model deployment |
ML Ops v1.0 | 15 methodologies