Back to tags
Tag

Agent Skills with tag: llm-evaluation

10 skills match this tag. Use tags to discover related Agent Skills and explore similar workflows.

evaluating-llms-harness

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

llm-evaluationbenchmarkingacademic-benchmarkshuggingface
ovachiever
ovachiever
81

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

llm-evaluationbenchmarkingautomated-metricshuman-feedback
ovachiever
ovachiever
81

evaluation-metrics

LLM evaluation frameworks, benchmarks, and quality metrics for production systems.

llm-evaluationevaluation-frameworkbenchmarksquality-metrics
pluginagentmarketplace
pluginagentmarketplace
1

prompt-engineering

Expert skill for prompt engineering and task routing/orchestration. Covers secure prompt construction, injection prevention, multi-step task orchestration, and LLM output validation for JARVIS AI assistant.

prompt-engineeringagent-orchestrationinjection-attacksllm-evaluation
martinholovsky
martinholovsky
92

llm-judge

LLM-as-judge methodology for comparing code implementations across repositories. Scores implementations on functionality, security, test quality, overengineering, and dead code using weighted rubrics. Used by /beagle:llm-judge command.

code-reviewquality-metricsllm-evaluationllm-integration
existential-birds
existential-birds
61

annotate

Create flexible annotation workflows for AI applications. Contains common tools to explore raw ai agent logs/transcripts, extract out relevant evaluation data, and llm-as-a-judge creation.

evaluation-pipelineslogsllm-evaluationchat-analysis
haizelabs
haizelabs
151

anti-slop

Comprehensive toolkit for detecting and eliminating "AI slop" - generic, low-quality AI-generated patterns in natural language, code, and design. Use when reviewing or improving content quality, preventing generic AI patterns, cleaning up existing content, or enforcing quality standards in writing, code, or design work.

review-checkpointsllm-evaluationquality-managementcode-cleanup
rand
rand
487

promptfoo-evaluation

Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".

llm-evaluationevaluation-frameworkrubric-creation
daymade
daymade
15713

llm-patterns

AI-first application patterns, LLM testing, prompt management

llmarchitecture-patternsprompt-engineeringllm-evaluation
alinaqi
alinaqi
28724

advanced-evaluation

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

machine-learningllm-evaluationevaluation-pipelinesrubric-creation
muratcankoylan
muratcankoylan
5,808463