AI/LLM Development Skill

AI/LLM Development

Core Philosophy

Context Engineering > Prompt Engineering: Optimize entire LLM configuration, not just wording.

Simplicity First: 80% of use cases need single LLM call, not multi-agent systems.

Currency Over Memory: Models deprecate in 6-12 months. Learn to find current ones via leaderboards.

Empiricism: Benchmarks guide; YOUR data decides. Test top 3-5 models with your prompts.

RESEARCH FIRST PROTOCOL

CRITICAL: Your training data is ALWAYS stale for LLM work. The field changes weekly.

Before ANY LLM-Related Action

Identify what you're assuming: Model capabilities? API syntax? Best practices?
Research using live tools (in order of preference):
- WebSearch: "latest [model/provider] models"
- Exa MCP: Get current documentation and examples
- Gemini CLI: Verify against latest information with web grounding
Verify your assumptions: Don't trust training data for:
- Model names and versions (new models release monthly)
- API syntax and parameters (providers update frequently)
- Best practices and recommendations (evolve constantly)
- Pricing and limits (change without notice)
- Deprecation status (models removed regularly)

Research Query Templates

Model Selection:

"latest [provider] models"
"[model-name] release date and capabilities"
"is [model-name] deprecated or superseded"
"[provider] newest models announced"

API Syntax:

"[provider] API documentation [specific-feature]"
"[sdk-name] current version and usage"
"OpenRouter model ID format current"

Best Practices:

"[task] LLM best practices latest"
"current recommendations for [architecture pattern]"
"[framework] latest patterns and examples"

Red Flags That Trigger Mandatory Research

❌ Making assumptions about version numbers (3.0 vs 2.5 doesn't mean newer) ❌ Changing model defaults without verification ❌ Assuming API syntax from training data ❌ Selecting models based on memory of capabilities ❌ Following "best practices" without checking if still current ❌ Any action based on "I think..." or "probably..." for LLM topics

Research Before Action Checklist

Before committing any LLM-related change:

[ ] Searched for latest information on involved models/APIs
[ ] Verified current state vs. training data assumptions
[ ] Checked provider documentation for API syntax
[ ] Confirmed model is not deprecated or superseded
[ ] Validated best practices are still current
[ ] Tested configuration syntax in provider console/playground

Mantra: "When in doubt about LLM tech, RESEARCH. When certain about LLM tech, STILL RESEARCH."

Decision Trees

Model Selection

Task type → Find relevant benchmark → Check leaderboards → Test top 3 empirically
Coding: SWE-bench | Reasoning: GPQA | General: Arena Elo

See: references/model-selection.md

Architecture Complexity

1. Single LLM Call (start here - 80% stop here)
2. Sequential Calls (workflows)
3. LLM + Tools (function calling)
4. Agentic System (LLM controls flow)
5. Multi-Agent (only if truly needed)

See: references/architecture-patterns.md

Vector Storage

<1M vectors → Postgres pgvector or Convex
1-50M vectors → Postgres with pgvectorscale
>50M + <10ms p99 → Dedicated (Qdrant, Weaviate)

Key Optimizations

Prompt Caching: 60-90% cost reduction. Static content first.
Structured Outputs: Native JSON Schema. Zero parsing failures.
Model Routing: Simple→cheap model, Complex→expensive model.
Hybrid RAG: Vector + keyword search = 15-25% better than pure vector.

See: references/prompt-engineering.md, references/production-checklist.md

Stack Defaults (TypeScript/Next.js)

SDK: Vercel AI SDK (streaming, React hooks, provider-agnostic)
Provider: OpenRouter (400+ models, easy A/B testing, fallbacks)
Vectors: Postgres pgvector (95% of use cases, $20-50/month)
Observability: Langfuse (self-hostable, generous free tier)
Evaluation: Promptfoo (CI/CD integration, security testing)

Quality Infrastructure

Production-grade LLM apps need:

Model Gateway (OpenRouter, LiteLLM)
- Multi-provider access
- Fallback chains
- Cost routing
- See: llm-gateway-routing skill
Evaluation & Testing (Promptfoo)
- Regression testing in CI/CD
- Security scanning (red team)
- Quality gates
- See: llm-evaluation skill
Production Observability (Langfuse)
- Full trace debugging
- Cost tracking
- Latency monitoring
- See: langfuse-observability skill
Quality Audit
- Run /llm-gates command to audit your LLM infrastructure
- Identifies gaps in routing, testing, observability, security, cost

Quick Setup

# Evaluation (Promptfoo)
npx promptfoo@latest init
npx promptfoo@latest eval

# Observability (Langfuse)
pnpm add langfuse
# Sign up at langfuse.com, add keys to .env

# Gateway (OpenRouter)
# Sign up at openrouter.ai, add OPENROUTER_API_KEY to .env

Quality Gate Standards

| Stage | Checks | Time Budget | |-------|--------|-------------| | Pre-commit | Prompt validation, secrets scan | < 5s | | Pre-push | Regression suite, cost estimate | < 15s | | CI/CD | Full eval, security scan, A/B comparison | < 5 min | | Production | Traces, cost alerts, error monitoring | Continuous |

Scripts

scripts/validate_llm_config.py <dir> - Scan for LLM anti-patterns

References

references/model-selection.md - Leaderboards, search strategies, red flags
references/prompt-engineering.md - Caching, structured outputs, CoT, model-specific styles
references/architecture-patterns.md - Complexity ladder, RAG, tool use, caching
references/production-checklist.md - Cost, errors, security, observability, evaluation

Related Skills

llm-evaluation - Promptfoo setup, CI/CD integration, security testing
llm-gateway-routing - OpenRouter, LiteLLM, routing strategies
langfuse-observability - Tracing, cost tracking, production debugging

Related Commands

/llm-gates - Audit LLM infrastructure quality across 5 pillars
/observe - General observability audit (includes LLM section)

Live Research Tools

Use these BEFORE relying on training data:

WebSearch: Latest model releases, deprecations, best practices
Exa MCP (mcp__exa__web_search_exa): Current documentation and code examples
Gemini CLI (gemini): Sophisticated reasoning with Google Search grounding
Provider Playgrounds: OpenRouter, Google AI Studio, Anthropic Console

Research Flow:

WebSearch for latest information
Exa MCP for documentation and examples
Gemini CLI for complex verification and comparison
Provider playground for syntax testing

Agent Skills: AI/LLM Development

Skill Files