Agent Skills: AI/LLM Development

|

UncategorizedID: phrazzld/claude-config/ai-llm-development

Skill Files

Browse the full folder contents for ai-llm-development.

Download Skill

Loading file tree…

skills/ai-llm-development/SKILL.md

Skill Metadata

Name
ai-llm-development
Description
|

AI/LLM Development

Core Philosophy

Context Engineering > Prompt Engineering: Optimize entire LLM configuration, not just wording.

Simplicity First: 80% of use cases need single LLM call, not multi-agent systems.

Currency Over Memory: Models deprecate in 6-12 months. Learn to find current ones via leaderboards.

Empiricism: Benchmarks guide; YOUR data decides. Test top 3-5 models with your prompts.

RESEARCH FIRST PROTOCOL

CRITICAL: Your training data is ALWAYS stale for LLM work. The field changes weekly.

Before ANY LLM-Related Action

  1. Identify what you're assuming: Model capabilities? API syntax? Best practices?
  2. Research using live tools (in order of preference):
    • WebSearch: "latest [model/provider] models"
    • Exa MCP: Get current documentation and examples
    • Gemini CLI: Verify against latest information with web grounding
  3. Verify your assumptions: Don't trust training data for:
    • Model names and versions (new models release monthly)
    • API syntax and parameters (providers update frequently)
    • Best practices and recommendations (evolve constantly)
    • Pricing and limits (change without notice)
    • Deprecation status (models removed regularly)

Research Query Templates

Model Selection:

  • "latest [provider] models"
  • "[model-name] release date and capabilities"
  • "is [model-name] deprecated or superseded"
  • "[provider] newest models announced"

API Syntax:

  • "[provider] API documentation [specific-feature]"
  • "[sdk-name] current version and usage"
  • "OpenRouter model ID format current"

Best Practices:

  • "[task] LLM best practices latest"
  • "current recommendations for [architecture pattern]"
  • "[framework] latest patterns and examples"

Red Flags That Trigger Mandatory Research

❌ Making assumptions about version numbers (3.0 vs 2.5 doesn't mean newer) ❌ Changing model defaults without verification ❌ Assuming API syntax from training data ❌ Selecting models based on memory of capabilities ❌ Following "best practices" without checking if still current ❌ Any action based on "I think..." or "probably..." for LLM topics

Research Before Action Checklist

Before committing any LLM-related change:

  • [ ] Searched for latest information on involved models/APIs
  • [ ] Verified current state vs. training data assumptions
  • [ ] Checked provider documentation for API syntax
  • [ ] Confirmed model is not deprecated or superseded
  • [ ] Validated best practices are still current
  • [ ] Tested configuration syntax in provider console/playground

Mantra: "When in doubt about LLM tech, RESEARCH. When certain about LLM tech, STILL RESEARCH."

Decision Trees

Model Selection

Task type → Find relevant benchmark → Check leaderboards → Test top 3 empirically
Coding: SWE-bench | Reasoning: GPQA | General: Arena Elo

See: references/model-selection.md

Architecture Complexity

1. Single LLM Call (start here - 80% stop here)
2. Sequential Calls (workflows)
3. LLM + Tools (function calling)
4. Agentic System (LLM controls flow)
5. Multi-Agent (only if truly needed)

See: references/architecture-patterns.md

Vector Storage

<1M vectors → Postgres pgvector or Convex
1-50M vectors → Postgres with pgvectorscale
>50M + <10ms p99 → Dedicated (Qdrant, Weaviate)

Key Optimizations

  • Prompt Caching: 60-90% cost reduction. Static content first.
  • Structured Outputs: Native JSON Schema. Zero parsing failures.
  • Model Routing: Simple→cheap model, Complex→expensive model.
  • Hybrid RAG: Vector + keyword search = 15-25% better than pure vector.

See: references/prompt-engineering.md, references/production-checklist.md

Stack Defaults (TypeScript/Next.js)

  • SDK: Vercel AI SDK (streaming, React hooks, provider-agnostic)
  • Provider: OpenRouter (400+ models, easy A/B testing, fallbacks)
  • Vectors: Postgres pgvector (95% of use cases, $20-50/month)
  • Observability: Langfuse (self-hostable, generous free tier)
  • Evaluation: Promptfoo (CI/CD integration, security testing)

Quality Infrastructure

Production-grade LLM apps need:

  1. Model Gateway (OpenRouter, LiteLLM)

    • Multi-provider access
    • Fallback chains
    • Cost routing
    • See: llm-gateway-routing skill
  2. Evaluation & Testing (Promptfoo)

    • Regression testing in CI/CD
    • Security scanning (red team)
    • Quality gates
    • See: llm-evaluation skill
  3. Production Observability (Langfuse)

    • Full trace debugging
    • Cost tracking
    • Latency monitoring
    • See: langfuse-observability skill
  4. Quality Audit

    • Run /llm-gates command to audit your LLM infrastructure
    • Identifies gaps in routing, testing, observability, security, cost

Quick Setup

# Evaluation (Promptfoo)
npx promptfoo@latest init
npx promptfoo@latest eval

# Observability (Langfuse)
pnpm add langfuse
# Sign up at langfuse.com, add keys to .env

# Gateway (OpenRouter)
# Sign up at openrouter.ai, add OPENROUTER_API_KEY to .env

Quality Gate Standards

| Stage | Checks | Time Budget | |-------|--------|-------------| | Pre-commit | Prompt validation, secrets scan | < 5s | | Pre-push | Regression suite, cost estimate | < 15s | | CI/CD | Full eval, security scan, A/B comparison | < 5 min | | Production | Traces, cost alerts, error monitoring | Continuous |

Scripts

  • scripts/validate_llm_config.py <dir> - Scan for LLM anti-patterns

References

  • references/model-selection.md - Leaderboards, search strategies, red flags
  • references/prompt-engineering.md - Caching, structured outputs, CoT, model-specific styles
  • references/architecture-patterns.md - Complexity ladder, RAG, tool use, caching
  • references/production-checklist.md - Cost, errors, security, observability, evaluation

Related Skills

  • llm-evaluation - Promptfoo setup, CI/CD integration, security testing
  • llm-gateway-routing - OpenRouter, LiteLLM, routing strategies
  • langfuse-observability - Tracing, cost tracking, production debugging

Related Commands

  • /llm-gates - Audit LLM infrastructure quality across 5 pillars
  • /observe - General observability audit (includes LLM section)

Live Research Tools

Use these BEFORE relying on training data:

  • WebSearch: Latest model releases, deprecations, best practices
  • Exa MCP (mcp__exa__web_search_exa): Current documentation and code examples
  • Gemini CLI (gemini): Sophisticated reasoning with Google Search grounding
  • Provider Playgrounds: OpenRouter, Google AI Studio, Anthropic Console

Research Flow:

  1. WebSearch for latest information
  2. Exa MCP for documentation and examples
  3. Gemini CLI for complex verification and comparison
  4. Provider playground for syntax testing