Back to tags
Tag

Agent Skills with tag: evaluation-framework

13 skills match this tag. Use tags to discover related Agent Skills and explore similar workflows.

scholar-evaluation

Systematic framework for evaluating scholarly and research work based on the ScholarEval methodology. This skill should be used when assessing research papers, evaluating literature reviews, scoring research methodologies, analyzing scientific writing quality, or applying structured evaluation criteria to academic work. Provides comprehensive assessment across multiple dimensions including problem formulation, literature review, methodology, data collection, analysis, results interpretation, and scholarly writing quality.

scholarly-assessmentliterature-reviewmethodology-evaluationscientific-writing
ovachiever
ovachiever
81

cognitive-baseline-eval

Execute the Joseph Cognitive Baseline v2.1 (JC B-v2.1) 5-Scenario Test Suite to quantify AI alignment, friction maintenance, and protocol adherence.

evaluation-frameworkAI-alignmentprotocol-adherencescenario-testing
starwreckntx
starwreckntx
1

evaluation-metrics

LLM evaluation frameworks, benchmarks, and quality metrics for production systems.

llm-evaluationevaluation-frameworkbenchmarksquality-metrics
pluginagentmarketplace
pluginagentmarketplace
1

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

llmevaluation-frameworkautomated-metricshuman-feedback
camoneart
camoneart
4

dspy-evaluation-suite

Comprehensive evaluation metrics and testing framework for DSPy programs

dsppyevaluation-frameworkmetricstest-framework
OmidZamani
OmidZamani
131

quality-auditor

Comprehensive quality auditing and evaluation of tools, frameworks, and systems against industry best practices with detailed scoring across 12 critical dimensions

quality-gatesevaluation-frameworkbest-practicesscoring
daffy0208
daffy0208
55

llm-evaluation

|

llmevaluation-pipelinesevaluation-frameworkquality-metrics
phrazzld
phrazzld
21

Decision Frameworks

Decision-making methodologies, scoring frameworks, and planning strategies for Group 2 agents in four-tier architecture

multi-agent-systemsevaluation-frameworkagent-decision-makingplanning-strategies
bejranonda
bejranonda
1111

evaluation-rubrics

Use when need explicit quality criteria and scoring scales to evaluate work consistently, compare alternatives objectively, set acceptance thresholds, reduce subjective bias, or when user mentions rubric, scoring criteria, quality standards, evaluation framework, inter-rater reliability, or grade/assess work.

rubric-creationevaluation-frameworkquality-standardsacceptance-criteria
lyndonkl
lyndonkl
82

prompt-engineer

Use when designing prompts for LLMs, optimizing model performance, building evaluation frameworks, or implementing advanced prompting techniques like chain-of-thought, few-shot learning, or structured outputs.

prompt-engineeringprompt-generationprompt-refinementllm
Jeffallan
Jeffallan
245

validation-report-generator

Generate structured 8-section validation reports with verdict (GOOD/BAD/NEEDS MAJOR WORK), strengths, critical flaws, blindspots, and concrete path forward. Use after strategic-cto-mentor has completed validation analysis and needs to produce final deliverable.

reporting-guidelinesevaluation-frameworkfeedback
alirezarezvani
alirezarezvani
4110

promptfoo-evaluation

Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".

llm-evaluationevaluation-frameworkrubric-creation
daymade
daymade
15713

evaluation

This skill should be used when the user asks to "evaluate agent performance", "build test framework", "measure agent quality", "create evaluation rubrics", or mentions LLM-as-judge, multi-dimensional evaluation, agent testing, or quality gates for agent pipelines.

autonomous-agentagent-testingevaluation-frameworkquality-gates
muratcankoylan
muratcankoylan
5,808463