llm-evaluation
|
model-evaluation
Evaluates machine learning models for performance, fairness, and reliability using appropriate metrics and validation techniques. Trigger keywords: model evaluation, metrics, accuracy, precision, recall, F1, ROC, AUC, cross-validation, ML testing.
healthbench_evaluation
>
annotate
Create flexible annotation workflows for AI applications. Contains common tools to explore raw ai agent logs/transcripts, extract out relevant evaluation data, and llm-as-a-judge creation.
decision-critic
Invoke IMMEDIATELY via python script to stress-test decisions and reasoning. Do NOT analyze first - the script orchestrates the critique workflow.
advanced-evaluation
This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.