Agent Skills: Phoenix Evals

Build and run evaluators for AI/LLM applications using Phoenix.

UncategorizedID: arize-ai/phoenix/phoenix-evals

Repository

Arize-aiLicense: NOASSERTION
9,073779

Install this agent skill to your local

pnpm dlx add-skill https://github.com/Arize-ai/phoenix/tree/HEAD/skills/phoenix-evals

Skill Files

Browse the full folder contents for phoenix-evals.

Download Skill

Loading file tree…

skills/phoenix-evals/SKILL.md

Skill Metadata

Name
phoenix-evals
Description
Build and run evaluators for AI/LLM applications using Phoenix.

Phoenix Evals

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.

Quick Reference

| Task | Files | | ---- | ----- | | Setup | setup-python, setup-typescript | | Decide what to evaluate | evaluators-overview | | Choose a judge model | fundamentals-model-selection | | Use pre-built evaluators | evaluators-pre-built | | Build code evaluator | evaluators-code-{python\|typescript} | | Build LLM evaluator | evaluators-llm-{python\|typescript}, evaluators-custom-templates | | Batch evaluate DataFrame | evaluate-dataframe-python | | Run experiment | experiments-running-{python\|typescript} | | Create dataset | experiments-datasets-{python\|typescript} | | Generate synthetic data | experiments-synthetic-{python\|typescript} | | Validate evaluator accuracy | validation, validation-evaluators-{python\|typescript} | | Sample traces for review | observe-sampling-{python\|typescript} | | Analyze errors | error-analysis, error-analysis-multi-turn, axial-coding | | RAG evals | evaluators-rag | | Avoid common mistakes | common-mistakes-python, fundamentals-anti-patterns | | Production | production-overview, production-guardrails, production-continuous |

Workflows

Starting Fresh: observe-tracing-setuperror-analysisaxial-codingevaluators-overview

Building Evaluator: fundamentalscommon-mistakes-pythonevaluators-{code\|llm}-{python\|typescript}validation-evaluators-{python\|typescript}

RAG Systems: evaluators-ragevaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)

Production: production-overviewproduction-guardrailsproduction-continuous

Rule Categories

| Prefix | Description | | ------ | ----------- | | fundamentals-* | Types, scores, anti-patterns | | observe-* | Tracing, sampling | | error-analysis-* | Finding failures | | axial-coding-* | Categorizing failures | | evaluators-* | Code, LLM, RAG evaluators | | experiments-* | Datasets, running experiments | | validation-* | Validating evaluator accuracy against human labels | | production-* | CI/CD, monitoring |

Key Principles

| Principle | Action | | --------- | ------ | | Error analysis first | Can't automate what you haven't observed | | Custom > generic | Build from your failures | | Code first | Deterministic before LLM | | Validate judges | >80% TPR/TNR | | Binary > Likert | Pass/fail, not 1-5 |