Agent Skills: 1-Minute Codebase Evaluation

Fast codebase evaluation using Claude CLI. Scans code to extract repo tree and source files, then runs parallel metric evaluations (impact, technical, creativity, presentation, prompt_design) via claude -p. Automatically submits to TopVibeCoder ranking API and tracks progress over time. Use when evaluating LLM apps, hackathon projects, or code quality.

UncategorizedID: topvibecoder/eval/1-min-eval

Install this agent skill to your local

pnpm dlx add-skill https://github.com/topvibecoder/eval/tree/HEAD/.codex/skills/1-min-eval

Skill Files

Browse the full folder contents for 1-min-eval.

Download Skill

Loading file tree…

.codex/skills/1-min-eval/SKILL.md

Skill Metadata

Name
1-min-eval
Description
Fast codebase evaluation using Codex CLI (`codex exec`). Scans code to extract repo tree and source files, then runs parallel metric evaluations (impact, technical, creativity, presentation, prompt_design). By default, we want to submits results to the TopVibeCoder ranking API so you automatically get a rank.

1-Minute Codebase Evaluation

Fast, parallel evaluation of codebases using Codex CLI with structured metrics.

Features

  • Smart Scanning: Automatically skips .codex/, node_modules/, .git/, hidden dotfiles, and previous .evals/ results
  • Parallel Evaluation: Runs multiple metrics concurrently for speed
  • Auto Ranking: Submits to TopVibeCoder API and gets your rank
  • Progress Tracking: Saves ranking history to track improvements over time
  • Detailed Reports: Generates comprehensive markdown reports with citations
  • Terminal Bar Chart: Visual score display with Unicode block characters

Quick Start

# Evaluate current directory (by default)
.codex/skills/1-min-eval/scripts/run_eval.sh .

# Evaluate and automatically fetch rank (default behavior)
.codex/skills/1-min-eval/scripts/run_eval.sh /path/to/project

# Evaluate with specific metrics
.codex/skills/1-min-eval/scripts/run_eval.sh /path/to/project --metrics impact,technical

# Full evaluation with all metrics (DO NOT use this by default)
.codex/skills/1-min-eval/scripts/run_eval.sh /path/to/project --all-metrics

How It Works

  1. Scan: scan_codebase.py extracts repo tree and source code with line numbers
  2. Evaluate: Runs parallel codex exec calls for each metric
  3. Aggregate: Combines JSON results into a final report
  4. Visualize: Displays terminal bar chart with scores

Example Output

After evaluation completes, you'll see a visual bar chart:

==================================================
📊 Evaluation Scores
==================================================
  presentation    6.25 | ████████████░░░░░░░░
  impact          5.25 | ██████████░░░░░░░░░░
  technical       1.75 | ███░░░░░░░░░░░░░░░░░
  creativity      0.50 | █░░░░░░░░░░░░░░░░░░░
  prompt_design   0.00 | ░░░░░░░░░░░░░░░░░░░░
==================================================

Available Metrics

| Metric | Description | |--------|-------------| | impact | Real-world problem solving, usable experience | | technical | Architecture, robustness, LLM integration | | creativity | Originality, novel LLM usage | | presentation | UX clarity, onboarding, demo quality | | prompt_design | Prompt structure, staging, constraints | | security | Secure coding, auth, dependency hygiene | | completion | Description-to-code alignment | | monetization | Business potential analysis |

Scoring Scale (0.00-10.00)

| Range | Meaning | |-------|---------| | 0.00-2.50 | Barely functional, major gaps | | 2.51-4.50 | Minimal implementation, weak | | 4.51-6.50 | Working but basic, clear gaps | | 6.51-8.50 | Solid implementation, good quality | | 8.51-10.00 | Excellent, production-ready |

Configuration

| Variable | Default | Description | |----------|---------|-------------| | EVAL_PARALLEL | 4 | Number of parallel evaluations | | EVAL_TIMEOUT | 300 | Timeout per metric (seconds) | | EVAL_MAX_CHARS | 300000 | Max chars to include | | EVAL_MODEL | gpt-5.2 | Model to use for evaluation |

Ranking & Progress Tracking

After evaluation, results are automatically submitted to the TopVibeCoder ranking API to get:

  • Overall rank and percentile
  • Per-metric rankings
  • Comparison with nearby apps
  • Historical progress tracking

Rankings are saved to ranking_history.jsonl in the output directory, allowing you to track improvements over time.

Note: The ranking API uses browser-like headers to bypass Cloudflare protection, ensuring reliable submissions.

Ranking submission is always enabled.

Output Structure

Results saved to .evals/<timestamp>_<project>/ (hidden directory):

  • codebase.md - Scanned source code
  • codebase.json - Structured metadata
  • prompts/ - Generated evaluation prompts
  • results/ - JSON results per metric
  • logs/ - Execution logs
  • report.md - Aggregated markdown report with ranking
  • ranking_history.jsonl - Historical ranking data (one entry per evaluation)
  • .evals/history.jsonl - Rolling local history for progress charts

Note: Evaluation results are saved to a hidden .evals/ directory to keep your workspace clean. Add .evals/ to your .gitignore if you don't want to commit evaluation results.

Manual Usage

You can also run components individually:

# 1. Scan codebase
python3 .codex/skills/1-min-eval/scripts/scan_codebase.py ./project \
    --output /tmp/code.md --max-chars 0

# 2. Run single metric evaluation
cat /tmp/code.md | codex exec -m "$EVAL_MODEL" -

# 3. Aggregate results
python3 .codex/skills/1-min-eval/scripts/aggregate.py \
    --input-dir ./results --output ./report.md

Adding Custom Metadata

Create metadata.json in project root:

{
  "name": "My App",
  "description": "An AI-powered tool that...",
  "author": "Your Name"
}

Tips

  1. Large codebases: Use --max-chars 500000 for more context
  2. Debugging: Add --verbose to see detailed output
  3. Resume: Results are cached; re-run skips completed metrics
  4. Single metric: Use --metrics impact for quick test