1-Minute Codebase Evaluation
Fast, parallel evaluation of codebases using Codex CLI with structured metrics.
Features
- ✅ Smart Scanning: Automatically skips
.codex/,node_modules/,.git/, hidden dotfiles, and previous.evals/results - ✅ Parallel Evaluation: Runs multiple metrics concurrently for speed
- ✅ Auto Ranking: Submits to TopVibeCoder API and gets your rank
- ✅ Progress Tracking: Saves ranking history to track improvements over time
- ✅ Detailed Reports: Generates comprehensive markdown reports with citations
- ✅ Terminal Bar Chart: Visual score display with Unicode block characters
Quick Start
# Evaluate current directory (by default)
.codex/skills/1-min-eval/scripts/run_eval.sh .
# Evaluate and automatically fetch rank (default behavior)
.codex/skills/1-min-eval/scripts/run_eval.sh /path/to/project
# Evaluate with specific metrics
.codex/skills/1-min-eval/scripts/run_eval.sh /path/to/project --metrics impact,technical
# Full evaluation with all metrics (DO NOT use this by default)
.codex/skills/1-min-eval/scripts/run_eval.sh /path/to/project --all-metrics
How It Works
- Scan:
scan_codebase.pyextracts repo tree and source code with line numbers - Evaluate: Runs parallel
codex execcalls for each metric - Aggregate: Combines JSON results into a final report
- Visualize: Displays terminal bar chart with scores
Example Output
After evaluation completes, you'll see a visual bar chart:
==================================================
📊 Evaluation Scores
==================================================
presentation 6.25 | ████████████░░░░░░░░
impact 5.25 | ██████████░░░░░░░░░░
technical 1.75 | ███░░░░░░░░░░░░░░░░░
creativity 0.50 | █░░░░░░░░░░░░░░░░░░░
prompt_design 0.00 | ░░░░░░░░░░░░░░░░░░░░
==================================================
Available Metrics
| Metric | Description | |--------|-------------| | impact | Real-world problem solving, usable experience | | technical | Architecture, robustness, LLM integration | | creativity | Originality, novel LLM usage | | presentation | UX clarity, onboarding, demo quality | | prompt_design | Prompt structure, staging, constraints | | security | Secure coding, auth, dependency hygiene | | completion | Description-to-code alignment | | monetization | Business potential analysis |
Scoring Scale (0.00-10.00)
| Range | Meaning | |-------|---------| | 0.00-2.50 | Barely functional, major gaps | | 2.51-4.50 | Minimal implementation, weak | | 4.51-6.50 | Working but basic, clear gaps | | 6.51-8.50 | Solid implementation, good quality | | 8.51-10.00 | Excellent, production-ready |
Configuration
| Variable | Default | Description | |----------|---------|-------------| | EVAL_PARALLEL | 4 | Number of parallel evaluations | | EVAL_TIMEOUT | 300 | Timeout per metric (seconds) | | EVAL_MAX_CHARS | 300000 | Max chars to include | | EVAL_MODEL | gpt-5.2 | Model to use for evaluation |
Ranking & Progress Tracking
After evaluation, results are automatically submitted to the TopVibeCoder ranking API to get:
- Overall rank and percentile
- Per-metric rankings
- Comparison with nearby apps
- Historical progress tracking
Rankings are saved to ranking_history.jsonl in the output directory, allowing you to track improvements over time.
Note: The ranking API uses browser-like headers to bypass Cloudflare protection, ensuring reliable submissions.
Ranking submission is always enabled.
Output Structure
Results saved to .evals/<timestamp>_<project>/ (hidden directory):
codebase.md- Scanned source codecodebase.json- Structured metadataprompts/- Generated evaluation promptsresults/- JSON results per metriclogs/- Execution logsreport.md- Aggregated markdown report with rankingranking_history.jsonl- Historical ranking data (one entry per evaluation).evals/history.jsonl- Rolling local history for progress charts
Note: Evaluation results are saved to a hidden .evals/ directory to keep your workspace clean. Add .evals/ to your .gitignore if you don't want to commit evaluation results.
Manual Usage
You can also run components individually:
# 1. Scan codebase
python3 .codex/skills/1-min-eval/scripts/scan_codebase.py ./project \
--output /tmp/code.md --max-chars 0
# 2. Run single metric evaluation
cat /tmp/code.md | codex exec -m "$EVAL_MODEL" -
# 3. Aggregate results
python3 .codex/skills/1-min-eval/scripts/aggregate.py \
--input-dir ./results --output ./report.md
Adding Custom Metadata
Create metadata.json in project root:
{
"name": "My App",
"description": "An AI-powered tool that...",
"author": "Your Name"
}
Tips
- Large codebases: Use
--max-chars 500000for more context - Debugging: Add
--verboseto see detailed output - Resume: Results are cached; re-run skips completed metrics
- Single metric: Use
--metrics impactfor quick test