coderace 0.9.0__tar.gz → 1.3.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {coderace-0.9.0 → coderace-1.3.0}/CHANGELOG.md +46 -0
- coderace-1.3.0/DONE.txt +70 -0
- {coderace-0.9.0 → coderace-1.3.0}/PKG-INFO +242 -1
- {coderace-0.9.0 → coderace-1.3.0}/README.md +241 -0
- coderace-1.3.0/all-day-build-contract-context-eval.md +120 -0
- coderace-1.3.0/all-day-build-contract-model-selection.md +121 -0
- coderace-1.3.0/all-day-build-contract-race-mode.md +183 -0
- coderace-1.3.0/all-day-build-contract-v1.0-statistical.md +184 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/__init__.py +1 -1
- coderace-1.3.0/coderace/adapters/__init__.py +77 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/adapters/aider.py +11 -4
- {coderace-0.9.0 → coderace-1.3.0}/coderace/adapters/base.py +13 -3
- {coderace-0.9.0 → coderace-1.3.0}/coderace/adapters/claude.py +12 -6
- {coderace-0.9.0 → coderace-1.3.0}/coderace/adapters/codex.py +12 -5
- {coderace-0.9.0 → coderace-1.3.0}/coderace/adapters/gemini.py +12 -8
- {coderace-0.9.0 → coderace-1.3.0}/coderace/adapters/opencode.py +12 -8
- {coderace-0.9.0 → coderace-1.3.0}/coderace/benchmark.py +174 -81
- {coderace-0.9.0 → coderace-1.3.0}/coderace/benchmark_report.py +372 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/cli.py +264 -28
- {coderace-0.9.0 → coderace-1.3.0}/coderace/commands/benchmark.py +124 -10
- coderace-1.3.0/coderace/commands/context_eval.py +143 -0
- coderace-1.3.0/coderace/commands/race.py +626 -0
- coderace-1.3.0/coderace/context_eval.py +282 -0
- coderace-1.3.0/coderace/context_eval_report.py +275 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/dashboard.py +97 -1
- coderace-1.3.0/coderace/elo.py +94 -0
- coderace-1.3.0/coderace/export.py +115 -0
- coderace-1.3.0/coderace/statistics.py +206 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/store.py +56 -3
- {coderace-0.9.0 → coderace-1.3.0}/coderace/types.py +3 -1
- coderace-1.3.0/examples/context-eval-demo.sh +65 -0
- coderace-1.3.0/examples/model-selection.yaml +30 -0
- coderace-1.3.0/progress-log.md +70 -0
- {coderace-0.9.0 → coderace-1.3.0}/pyproject.toml +1 -1
- coderace-1.3.0/tests/test_benchmark_trials.py +205 -0
- coderace-1.3.0/tests/test_benchmark_v1_integration.py +164 -0
- coderace-1.3.0/tests/test_context_eval.py +532 -0
- coderace-1.3.0/tests/test_context_eval_dashboard.py +198 -0
- coderace-1.3.0/tests/test_elo.py +259 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_examples.py +3 -2
- coderace-1.3.0/tests/test_export.py +173 -0
- coderace-1.3.0/tests/test_model_selection_d1_d2.py +200 -0
- coderace-1.3.0/tests/test_model_selection_d3.py +164 -0
- coderace-1.3.0/tests/test_model_selection_d4.py +186 -0
- coderace-1.3.0/tests/test_race.py +883 -0
- coderace-1.3.0/tests/test_statistics.py +227 -0
- coderace-0.9.0/coderace/adapters/__init__.py +0 -26
- coderace-0.9.0/progress-log.md +0 -444
- {coderace-0.9.0 → coderace-1.3.0}/.github/workflows/publish.yml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/.gitignore +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/LICENSE +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/action.yml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-benchmark.md +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-builtin-tasks.md +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-ci-integration.md +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-cost-tracking.md +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-dashboard.md +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-leaderboard.md +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-v0.2.md +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-v090-tasks.md +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-verification-tests.md +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/benchmark-results/fibonacci-2026-02-27.md +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/benchmark-results/fibonacci-v2-2026-02-27.md +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/benchmark-results/hard-tasks-2026-02-27.md +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/benchmark-results/multi-task-2026-02-27.md +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/benchmark_stats.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/__init__.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/binary-search-tree.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/cli-args-parser.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/csv-analyzer.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/data-pipeline.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/diff-algorithm.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/expression-evaluator.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/fibonacci.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/file-watcher.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/http-server.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/json-parser.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/lru-cache.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/markdown-to-html.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/regex-engine.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/state-machine.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/task-scheduler.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/url-router.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/commands/__init__.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/commands/dashboard.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/commands/diff.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/commands/history.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/commands/leaderboard.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/commands/results.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/commands/tasks.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/cost.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/git_ops.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/html_report.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/publish.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/reporter.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/scorer.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/stats.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/coderace/task.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/demo-race.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/examples/add-type-hints.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/examples/ci-race-on-pr.yml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/examples/example-task.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/examples/fix-edge-case.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/examples/write-tests.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/scripts/ci-run.sh +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/scripts/format-comment.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tasks/markdown-table.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tasks/parse-duration.yaml +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/__init__.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/conftest.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_adapters.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_benchmark.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_builtins.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_cli.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_cli_store_integration.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_cost.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_cost_config.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_cost_integration.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_dashboard.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_dashboard_cli.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_diff.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_format_comment.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_full_workflow.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_git_ops.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_history.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_html_report.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_leaderboard.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_markdown_results.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_publish.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_reporter.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_scorer.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_stats.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_store.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_task.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_tasks_cli.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/tests/test_verification_integration.py +0 -0
- {coderace-0.9.0 → coderace-1.3.0}/uv.lock +0 -0
|
@@ -1,5 +1,51 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## [1.3.0] - 2026-03-05
|
|
4
|
+
|
|
5
|
+
### Added
|
|
6
|
+
- **Model selection**: Per-agent model override via `agent:model` syntax in `--agents` / `--agent` flags
|
|
7
|
+
- Example: `coderace run task.yaml --agent codex:gpt-5.4 --agent codex:gpt-5.3-codex`
|
|
8
|
+
- Example: `coderace benchmark --agents claude:opus-4-6,claude:sonnet-4-6`
|
|
9
|
+
- `BaseAdapter.__init__(model=None)`: all adapters accept optional model at construction
|
|
10
|
+
- `BaseAdapter.build_command(task, model=None)`: model parameter flows to CLI flag
|
|
11
|
+
- `parse_agent_spec()`, `make_display_name()`, `instantiate_adapter()` in `coderace.adapters`
|
|
12
|
+
- All adapters (codex, claude, aider, gemini, opencode) append `--model <name>` when specified
|
|
13
|
+
- Benchmark and race commands handle model-specific agents; display names flow to results, store, ELO, dashboard
|
|
14
|
+
- Task YAML: `agents` list accepts `agent:model` entries (e.g. `- codex:gpt-5.4`)
|
|
15
|
+
|
|
16
|
+
### Changed
|
|
17
|
+
- `AgentResult.agent` is now the display name (`codex (gpt-5.4)`) when a model is specified
|
|
18
|
+
- ELO ratings, leaderboard, and dashboard automatically track model variants as separate entries
|
|
19
|
+
- Branch names sanitized to be git-compatible (colons replaced with dashes)
|
|
20
|
+
|
|
21
|
+
## [1.2.0] - 2026-03-03
|
|
22
|
+
|
|
23
|
+
### Added
|
|
24
|
+
|
|
25
|
+
- **`coderace race` command** - New first-to-pass race mode with early-stop semantics. Agents run in parallel worktrees and the race ends when the first winner is found.
|
|
26
|
+
- **Live race UI** - Rich `Live` panel with per-agent status and timers:
|
|
27
|
+
- `🔨 coding...`
|
|
28
|
+
- `🧪 testing...`
|
|
29
|
+
- `✅ WINNER!`
|
|
30
|
+
- `❌ failed`
|
|
31
|
+
- `⏰ timed out`
|
|
32
|
+
- `🛑 stopped`
|
|
33
|
+
- **Winner announcement and runner-up delta** - Prints race winner and optional runner-up timing delta after the live panel closes.
|
|
34
|
+
- **Race result persistence (JSON fallback)** - Saves race summaries to `.coderace/race-results.json` including `race_id`, winner metadata, participant statuses, exit codes, and wall times. Supports `--no-save`.
|
|
35
|
+
- **Race test suite** - Added 21 race-focused tests (`tests/test_race.py`) covering winner logic, cancellation, timeout/no-winner paths, verification modes, live updates, serialization, and Ctrl+C cleanup.
|
|
36
|
+
|
|
37
|
+
## [1.0.0] - 2026-02-28
|
|
38
|
+
|
|
39
|
+
### Added
|
|
40
|
+
|
|
41
|
+
- **Benchmark trials mode** — `coderace benchmark --trials N` now runs each `(task, agent)` pair repeatedly and stores each trial with `trial_number` in SQLite.
|
|
42
|
+
- **Statistical benchmarking module** — New `coderace/statistics.py` computes per-pair and per-agent aggregates: mean/stddev, 95% confidence intervals, pass rate, consistency, win rate, cost efficiency, and reliability.
|
|
43
|
+
- **Persistent ELO ratings** — New `coderace/elo.py` plus `elo_ratings` store table. Ratings update automatically after each benchmark using pairwise task outcomes and persist across runs.
|
|
44
|
+
- **`coderace ratings` command** — View persistent ELO rankings, output as JSON (`--json`), and reset all ratings (`--reset`).
|
|
45
|
+
- **Standardized benchmark export** — `coderace benchmark --export <path>` writes shareable JSON with run metadata, system info, per-trial details, aggregate stats, and current ELO ratings.
|
|
46
|
+
- **Enhanced benchmark report rendering** — Multi-trial reports now show statistical columns (`mean +/- stddev`, CI, consistency, reliability) and include ELO ratings in terminal/markdown/html output.
|
|
47
|
+
- **Integration and edge-case coverage for v1.0 flow** — Added tests for full `--trials 3` benchmark + export + ELO pipeline and edge cases (single trial/agent/task and always-failing agent).
|
|
48
|
+
|
|
3
49
|
## [0.7.0] - 2026-02-26
|
|
4
50
|
|
|
5
51
|
### Added
|
coderace-1.3.0/DONE.txt
ADDED
|
@@ -0,0 +1,70 @@
|
|
|
1
|
+
Context Eval Build Contract: COMPLETE
|
|
2
|
+
======================================
|
|
3
|
+
|
|
4
|
+
Date: 2026-03-02
|
|
5
|
+
All deliverables (D1-D4) implemented and validated.
|
|
6
|
+
|
|
7
|
+
## What Was Built
|
|
8
|
+
|
|
9
|
+
### D1: context-eval CLI Command (core + CLI)
|
|
10
|
+
- `coderace/context_eval.py`: Core A/B evaluation engine
|
|
11
|
+
- ContextEvalResult and TrialResult data classes
|
|
12
|
+
- Context file backup/restore/placement/removal for baseline vs treatment
|
|
13
|
+
- KNOWN_CONTEXT_FILES list (CLAUDE.md, AGENTS.md, .cursorrules, etc.)
|
|
14
|
+
- run_context_eval() orchestrator running N trials per condition
|
|
15
|
+
- `coderace/commands/context_eval.py`: CLI subcommand with:
|
|
16
|
+
--context-file PATH, --task PATH, --benchmark, --agents, --trials N,
|
|
17
|
+
--output PATH, --task-dir PATH
|
|
18
|
+
- Full input validation (missing files, invalid agents, trials < 2, etc.)
|
|
19
|
+
|
|
20
|
+
### D2: Statistical Comparison Report
|
|
21
|
+
- `coderace/context_eval_report.py`: Statistical analysis and rendering
|
|
22
|
+
- Delta with 95% CI using Welch's t-test
|
|
23
|
+
- Cohen's d effect size
|
|
24
|
+
- Per-agent summary: baseline vs treatment pass rates and scores
|
|
25
|
+
- Per-task breakdown: which tasks improved, which degraded
|
|
26
|
+
- Summary verdict: "improved", "degraded", or "no significant improvement"
|
|
27
|
+
- Rich terminal table output
|
|
28
|
+
- JSON output format
|
|
29
|
+
|
|
30
|
+
### D3: Dashboard Integration
|
|
31
|
+
- Extended `coderace/dashboard.py` with context-eval A/B section:
|
|
32
|
+
- Bar chart: baseline vs treatment scores per agent
|
|
33
|
+
- Delta table with CI (95%) and effect size
|
|
34
|
+
- Verdict display
|
|
35
|
+
- CSS for A/B visualization (.ab-baseline, .ab-treatment, .positive, .negative)
|
|
36
|
+
- Added --context-eval PATH flag to `coderace dashboard` command
|
|
37
|
+
|
|
38
|
+
### D4: Documentation + Examples
|
|
39
|
+
- README.md: Added "Context Evaluation" and "Measuring Context Engineering Impact" sections
|
|
40
|
+
with usage examples, output format, CLI flags table, and effect size interpretation guide
|
|
41
|
+
- examples/context-eval-demo.sh: Executable demo script
|
|
42
|
+
- Clear help text on `coderace context-eval --help` and `coderace --help`
|
|
43
|
+
|
|
44
|
+
## Test Results
|
|
45
|
+
|
|
46
|
+
- 58 new tests added (41 for D1+D2, 17 for D3)
|
|
47
|
+
- All 505 tests pass (447 original + 58 new)
|
|
48
|
+
- No regressions in existing test suite
|
|
49
|
+
|
|
50
|
+
## Commits
|
|
51
|
+
|
|
52
|
+
1. feat(context-eval): add context-eval command with A/B statistical comparison (D1+D2)
|
|
53
|
+
2. feat(context-eval): add dashboard A/B comparison section (D3)
|
|
54
|
+
3. docs(context-eval): add README section, examples, and interpretation guide (D4)
|
|
55
|
+
|
|
56
|
+
## Files Created/Modified
|
|
57
|
+
|
|
58
|
+
New files:
|
|
59
|
+
- coderace/context_eval.py
|
|
60
|
+
- coderace/commands/context_eval.py
|
|
61
|
+
- coderace/context_eval_report.py
|
|
62
|
+
- tests/test_context_eval.py
|
|
63
|
+
- tests/test_context_eval_dashboard.py
|
|
64
|
+
- examples/context-eval-demo.sh
|
|
65
|
+
|
|
66
|
+
Modified files:
|
|
67
|
+
- coderace/cli.py (registered context-eval subcommand + --context-eval dashboard flag)
|
|
68
|
+
- coderace/dashboard.py (added A/B comparison section)
|
|
69
|
+
- README.md (added context-eval documentation)
|
|
70
|
+
- progress-log.md (added D1-D4 progress entries)
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: coderace
|
|
3
|
-
Version:
|
|
3
|
+
Version: 1.3.0
|
|
4
4
|
Summary: Race coding agents against each other on real tasks
|
|
5
5
|
Project-URL: Homepage, https://github.com/mikiships/coderace
|
|
6
6
|
Project-URL: Repository, https://github.com/mikiships/coderace
|
|
@@ -30,6 +30,11 @@ Description-Content-Type: text/markdown
|
|
|
30
30
|
|
|
31
31
|
# coderace
|
|
32
32
|
|
|
33
|
+
[](https://pypi.org/project/coderace/)
|
|
34
|
+
[](#install)
|
|
35
|
+
[](#)
|
|
36
|
+
[](#license)
|
|
37
|
+
|
|
33
38
|
Stop reading blog comparisons. Race coding agents against each other on real tasks in *your* repo with *your* code.
|
|
34
39
|
|
|
35
40
|
Every week there's a new "Claude Code vs Codex vs Cursor" post. They test on toy problems with cherry-picked examples. coderace gives you automated, reproducible, scored comparisons on the tasks you actually care about.
|
|
@@ -340,6 +345,41 @@ Keys can be agent names (`claude`, `codex`, `aider`, `gemini`, `opencode`) or mo
|
|
|
340
345
|
|
|
341
346
|
Pricing is easy to update: the table lives in `coderace/cost.py` as a plain dict.
|
|
342
347
|
|
|
348
|
+
## Model Selection
|
|
349
|
+
|
|
350
|
+
Compare different models of the same agent head-to-head using the `agent:model` syntax:
|
|
351
|
+
|
|
352
|
+
```bash
|
|
353
|
+
# Compare two Codex models on the same task
|
|
354
|
+
coderace run task.yaml --agent codex:gpt-5.4 --agent codex:gpt-5.3-codex
|
|
355
|
+
|
|
356
|
+
# Mix agents and models
|
|
357
|
+
coderace run task.yaml --agent codex:gpt-5.4 --agent claude:opus-4-6 --agent claude:sonnet-4-6
|
|
358
|
+
|
|
359
|
+
# Benchmark multiple model variants across built-in tasks
|
|
360
|
+
coderace benchmark --agents codex:gpt-5.4,codex:gpt-5.3-codex,claude:opus-4-6
|
|
361
|
+
|
|
362
|
+
# Race with model variants (parallel)
|
|
363
|
+
coderace race task.yaml
|
|
364
|
+
```
|
|
365
|
+
|
|
366
|
+
In task YAML files:
|
|
367
|
+
|
|
368
|
+
```yaml
|
|
369
|
+
agents:
|
|
370
|
+
- codex:gpt-5.4
|
|
371
|
+
- codex:gpt-5.3-codex
|
|
372
|
+
- claude:opus-4-6
|
|
373
|
+
- claude:sonnet-4-6
|
|
374
|
+
```
|
|
375
|
+
|
|
376
|
+
**How it works:**
|
|
377
|
+
- `agent:model` splits on the first colon: `codex:gpt-5.4` → agent `codex`, model `gpt-5.4`
|
|
378
|
+
- The model is passed via `--model <name>` to the underlying CLI
|
|
379
|
+
- Results display as `codex (gpt-5.4)` vs `codex (gpt-5.3-codex)` for easy comparison
|
|
380
|
+
- ELO ratings, leaderboard, and dashboard track each model variant separately
|
|
381
|
+
- The same agent can appear multiple times with different models in one run
|
|
382
|
+
|
|
343
383
|
## Leaderboard & History
|
|
344
384
|
|
|
345
385
|
Every `coderace run` automatically saves results to a local SQLite database (`~/.coderace/results.db`). Two new commands aggregate this data.
|
|
@@ -471,6 +511,37 @@ coderace run task.yaml --parallel
|
|
|
471
511
|
|
|
472
512
|
Sequential mode (default) runs agents one at a time on the same repo.
|
|
473
513
|
|
|
514
|
+
## Race Mode
|
|
515
|
+
|
|
516
|
+
Use `coderace race` for first-to-pass execution. Unlike `coderace run --parallel`, race mode stops as soon as one agent passes the win condition:
|
|
517
|
+
|
|
518
|
+
- If verification is configured, winner = first agent that passes verification.
|
|
519
|
+
- If verification is not configured, winner = first agent that exits cleanly.
|
|
520
|
+
- Remaining agents are stopped after a short graceful shutdown window.
|
|
521
|
+
|
|
522
|
+
```bash
|
|
523
|
+
coderace race task.yaml --agent claude --agent codex
|
|
524
|
+
```
|
|
525
|
+
|
|
526
|
+
Example terminal output:
|
|
527
|
+
|
|
528
|
+
```text
|
|
529
|
+
🏁 coderace race - fix-auth-bug
|
|
530
|
+
Running 3 agents in parallel...
|
|
531
|
+
|
|
532
|
+
Agent Status Time
|
|
533
|
+
claude 🔨 coding... 0:00:23
|
|
534
|
+
codex 🧪 testing... 0:00:31
|
|
535
|
+
aider 🛑 stopped 0:00:18
|
|
536
|
+
|
|
537
|
+
🏆 Winner: codex - completed in 1:23 (first to pass verification)
|
|
538
|
+
Runner-up: claude - finished 0:12 later
|
|
539
|
+
```
|
|
540
|
+
|
|
541
|
+
When to use each mode:
|
|
542
|
+
- Use `coderace race` when you want the fastest successful patch and can stop early.
|
|
543
|
+
- Use `coderace run --parallel` when you want full scoring across all agents before deciding.
|
|
544
|
+
|
|
474
545
|
## Why coderace?
|
|
475
546
|
|
|
476
547
|
**Blog posts compare models. coderace compares agents on your work.**
|
|
@@ -606,9 +677,15 @@ coderace benchmark --agents claude --difficulty easy,medium
|
|
|
606
677
|
# Dry-run: see what would run without executing
|
|
607
678
|
coderace benchmark --agents claude,codex --dry-run
|
|
608
679
|
|
|
680
|
+
# Statistical mode: run repeated trials per pair
|
|
681
|
+
coderace benchmark --agents claude,codex --tasks fibonacci,json-parser --trials 5
|
|
682
|
+
|
|
609
683
|
# Save report to file
|
|
610
684
|
coderace benchmark --agents claude,codex --output report.md
|
|
611
685
|
coderace benchmark --agents claude,codex --output report.html
|
|
686
|
+
|
|
687
|
+
# Export standardized JSON (shareable benchmark artifact)
|
|
688
|
+
coderace benchmark --agents claude,codex --trials 5 --export benchmark.json
|
|
612
689
|
```
|
|
613
690
|
|
|
614
691
|
### Example Terminal Output
|
|
@@ -652,7 +729,171 @@ coderace benchmark show bench-20260227-143022
|
|
|
652
729
|
| `--difficulty` | Filter by difficulty: `easy`, `medium`, `hard` | all |
|
|
653
730
|
| `--timeout` | Per-task timeout in seconds | `300` |
|
|
654
731
|
| `--parallel N` | Run N agents in parallel | `1` (sequential) |
|
|
732
|
+
| `--trials N` | Repeat each `(task, agent)` pair N times | `1` |
|
|
655
733
|
| `--dry-run` | List combinations without running | `false` |
|
|
656
734
|
| `--format` | Output format: `terminal`, `markdown`, `html` | `terminal` |
|
|
657
735
|
| `--output` | Save report to file | — |
|
|
736
|
+
| `--export` | Write standardized benchmark JSON file | — |
|
|
658
737
|
| `--no-save` | Skip saving results to the store | `false` |
|
|
738
|
+
|
|
739
|
+
### Statistical Reports (`--trials > 1`)
|
|
740
|
+
|
|
741
|
+
When `--trials` is greater than 1, benchmark reports switch to statistical mode:
|
|
742
|
+
|
|
743
|
+
- Task cells show `mean score +/- stddev` (plus mean wall time)
|
|
744
|
+
- Report includes `CI (95%)`, `Consistency`, and `Reliability` columns
|
|
745
|
+
- Summary includes per-agent mean score, confidence interval, win rate, and reliability
|
|
746
|
+
- ELO ratings are rendered at the bottom of terminal/markdown/html reports
|
|
747
|
+
|
|
748
|
+
### ELO Ratings
|
|
749
|
+
|
|
750
|
+
Every benchmark run updates persistent ELO ratings across all benchmark history.
|
|
751
|
+
|
|
752
|
+
```bash
|
|
753
|
+
# Show ratings
|
|
754
|
+
coderace ratings
|
|
755
|
+
|
|
756
|
+
# JSON output
|
|
757
|
+
coderace ratings --json
|
|
758
|
+
|
|
759
|
+
# Reset all ratings to 1500
|
|
760
|
+
coderace ratings --reset
|
|
761
|
+
```
|
|
762
|
+
|
|
763
|
+
ELO rules:
|
|
764
|
+
- Initial rating: `1500`
|
|
765
|
+
- K-factor: `32`
|
|
766
|
+
- Each task is treated as a round-robin set of pairwise matches
|
|
767
|
+
- Winner per pair is based on higher mean trial score (draw when within 1 point)
|
|
768
|
+
|
|
769
|
+
### Export Format (`--export`)
|
|
770
|
+
|
|
771
|
+
`coderace benchmark --export benchmark.json` writes a standardized JSON artifact:
|
|
772
|
+
|
|
773
|
+
```json
|
|
774
|
+
{
|
|
775
|
+
"coderace_version": "1.0.0",
|
|
776
|
+
"benchmark_id": "bench-20260228-133000",
|
|
777
|
+
"timestamp": "2026-02-28T13:30:00Z",
|
|
778
|
+
"system": { "os": "...", "python": "...", "cpu": "..." },
|
|
779
|
+
"config": { "trials": 5, "timeout": 300, "tasks": ["..."], "agents": ["..."] },
|
|
780
|
+
"results": [
|
|
781
|
+
{
|
|
782
|
+
"task": "fibonacci",
|
|
783
|
+
"agent": "claude",
|
|
784
|
+
"trials": 5,
|
|
785
|
+
"mean_score": 87.5,
|
|
786
|
+
"stddev_score": 3.2,
|
|
787
|
+
"ci_95": [83.1, 91.9],
|
|
788
|
+
"mean_time": 45.2,
|
|
789
|
+
"mean_cost": 0.03,
|
|
790
|
+
"pass_rate": 1.0,
|
|
791
|
+
"consistency_score": 0.96,
|
|
792
|
+
"per_trial": []
|
|
793
|
+
}
|
|
794
|
+
],
|
|
795
|
+
"elo_ratings": { "claude": 1523, "codex": 1488 },
|
|
796
|
+
"summary": {}
|
|
797
|
+
}
|
|
798
|
+
```
|
|
799
|
+
|
|
800
|
+
## Context Evaluation
|
|
801
|
+
|
|
802
|
+
The `coderace context-eval` command measures whether a context file (CLAUDE.md, AGENTS.md, .cursorrules, etc.) actually improves agent performance. It runs A/B trials — baseline (no context file) vs treatment (with context file) — and produces statistical comparisons.
|
|
803
|
+
|
|
804
|
+
```bash
|
|
805
|
+
# Evaluate whether CLAUDE.md improves claude's performance on a task
|
|
806
|
+
coderace context-eval --context-file CLAUDE.md --task fix-auth-bug.yaml --agents claude --trials 5
|
|
807
|
+
|
|
808
|
+
# Evaluate across all built-in benchmark tasks
|
|
809
|
+
coderace context-eval --context-file CLAUDE.md --benchmark --agents claude,codex
|
|
810
|
+
|
|
811
|
+
# Save results as JSON
|
|
812
|
+
coderace context-eval --context-file CLAUDE.md --task task.yaml --agents claude --output results.json
|
|
813
|
+
|
|
814
|
+
# Use a custom task directory
|
|
815
|
+
coderace context-eval --context-file CLAUDE.md --benchmark --task-dir ./my-tasks --agents claude
|
|
816
|
+
```
|
|
817
|
+
|
|
818
|
+
### How It Works
|
|
819
|
+
|
|
820
|
+
For each agent × task combination:
|
|
821
|
+
1. Run N trials **without** the context file (baseline condition)
|
|
822
|
+
2. Run N trials **with** the context file placed in the task directory (treatment condition)
|
|
823
|
+
3. Compare pass rates, mean scores, and compute statistical significance
|
|
824
|
+
|
|
825
|
+
### Output
|
|
826
|
+
|
|
827
|
+
The terminal report shows:
|
|
828
|
+
- **Per-agent summary**: baseline vs treatment pass rates and scores, delta with 95% CI, Cohen's d effect size
|
|
829
|
+
- **Per-task breakdown**: which tasks improved, which degraded
|
|
830
|
+
- **Verdict**: whether the context file significantly improved performance
|
|
831
|
+
|
|
832
|
+
```
|
|
833
|
+
┌────────┬───────────────────┬────────────────────┬────────────────┬─────────────────┬────────┬──────────────────┬─────────────┐
|
|
834
|
+
│ Agent │ Baseline Pass Rate│ Treatment Pass Rate │ Baseline Score │ Treatment Score │ Delta │ CI (95%) │ Effect Size │
|
|
835
|
+
├────────┼───────────────────┼────────────────────┼────────────────┼─────────────────┼────────┼──────────────────┼─────────────┤
|
|
836
|
+
│ claude │ 67% │ 100% │ 55.0 │ 81.0 │ +26.0 │ [10.5, 41.5] │ 2.10 │
|
|
837
|
+
│ codex │ 33% │ 67% │ 45.0 │ 70.0 │ +25.0 │ [8.0, 42.0] │ 1.80 │
|
|
838
|
+
└────────┴───────────────────┴────────────────────┴────────────────┴─────────────────┴────────┴──────────────────┴─────────────┘
|
|
839
|
+
|
|
840
|
+
Context file improved performance by +25.5 points (CI: [12.0, 39.0])
|
|
841
|
+
```
|
|
842
|
+
|
|
843
|
+
### Context-Eval CLI Flags
|
|
844
|
+
|
|
845
|
+
| Flag | Description | Default |
|
|
846
|
+
|------|-------------|---------|
|
|
847
|
+
| `--context-file` | Path to the context file to evaluate (required) | — |
|
|
848
|
+
| `--task` | Path to a single task YAML | — |
|
|
849
|
+
| `--benchmark` | Run against built-in benchmark tasks | `false` |
|
|
850
|
+
| `--agents` | Comma-separated agent names (required) | — |
|
|
851
|
+
| `--trials` | Trials per condition (min: 2) | `3` |
|
|
852
|
+
| `--output` | Save JSON results to file | — |
|
|
853
|
+
| `--task-dir` | Custom task directory for benchmark mode | — |
|
|
854
|
+
|
|
855
|
+
### Dashboard Integration
|
|
856
|
+
|
|
857
|
+
Include context-eval results in the HTML dashboard:
|
|
858
|
+
|
|
859
|
+
```bash
|
|
860
|
+
# Run context-eval and save JSON
|
|
861
|
+
coderace context-eval --context-file CLAUDE.md --task task.yaml --agents claude --output eval.json
|
|
862
|
+
|
|
863
|
+
# Generate dashboard with A/B comparison section
|
|
864
|
+
coderace dashboard --context-eval eval.json
|
|
865
|
+
```
|
|
866
|
+
|
|
867
|
+
## Measuring Context Engineering Impact
|
|
868
|
+
|
|
869
|
+
Context engineering — crafting CLAUDE.md, AGENTS.md, .cursorrules, and similar files — is becoming a core developer skill. But until now, there was no way to empirically measure whether your context files actually help.
|
|
870
|
+
|
|
871
|
+
**The problem:** You write a CLAUDE.md with coding conventions, architectural guidelines, and project-specific instructions. But does it actually make agents produce better code? Or is it cargo-cult configuration?
|
|
872
|
+
|
|
873
|
+
**The solution:** `coderace context-eval` gives you data:
|
|
874
|
+
|
|
875
|
+
1. **Write your context file** (e.g., CLAUDE.md with project conventions)
|
|
876
|
+
2. **Run A/B evaluation** against real coding tasks
|
|
877
|
+
3. **Get statistical evidence** of improvement (or lack thereof)
|
|
878
|
+
|
|
879
|
+
```bash
|
|
880
|
+
# Iterate on your context file with data
|
|
881
|
+
coderace context-eval --context-file CLAUDE.md --benchmark --agents claude --trials 5
|
|
882
|
+
|
|
883
|
+
# Compare different context files
|
|
884
|
+
coderace context-eval --context-file v1-claude.md --task task.yaml --agents claude --output v1.json
|
|
885
|
+
coderace context-eval --context-file v2-claude.md --task task.yaml --agents claude --output v2.json
|
|
886
|
+
```
|
|
887
|
+
|
|
888
|
+
**Interpreting results:**
|
|
889
|
+
- **Effect size > 0.8**: Large improvement — your context file is helping significantly
|
|
890
|
+
- **Effect size 0.2–0.8**: Moderate improvement — some benefit, room to iterate
|
|
891
|
+
- **Effect size < 0.2**: Negligible — your context file isn't making a measurable difference
|
|
892
|
+
- **CI crosses zero**: Not statistically significant — need more trials or a better context file
|
|
893
|
+
|
|
894
|
+
## See Also
|
|
895
|
+
|
|
896
|
+
- **[agentmd](https://github.com/mikiships/agentmd)** — Generate and score context files (CLAUDE.md, AGENTS.md, .cursorrules) for AI coding agents. Pair with coderace: generate context with agentmd, measure agent performance with coderace, iterate with data instead of vibes.
|
|
897
|
+
- **[agentlint](https://github.com/mikiships/agentlint)** — Lint AI agent git diffs for risky patterns (scope drift, secret leaks, test regression). Static analysis, no LLM required.
|
|
898
|
+
|
|
899
|
+
Measure (coderace) → Optimize (agentmd) → Guard (agentlint).
|
|
@@ -1,5 +1,10 @@
|
|
|
1
1
|
# coderace
|
|
2
2
|
|
|
3
|
+
[](https://pypi.org/project/coderace/)
|
|
4
|
+
[](#install)
|
|
5
|
+
[](#)
|
|
6
|
+
[](#license)
|
|
7
|
+
|
|
3
8
|
Stop reading blog comparisons. Race coding agents against each other on real tasks in *your* repo with *your* code.
|
|
4
9
|
|
|
5
10
|
Every week there's a new "Claude Code vs Codex vs Cursor" post. They test on toy problems with cherry-picked examples. coderace gives you automated, reproducible, scored comparisons on the tasks you actually care about.
|
|
@@ -310,6 +315,41 @@ Keys can be agent names (`claude`, `codex`, `aider`, `gemini`, `opencode`) or mo
|
|
|
310
315
|
|
|
311
316
|
Pricing is easy to update: the table lives in `coderace/cost.py` as a plain dict.
|
|
312
317
|
|
|
318
|
+
## Model Selection
|
|
319
|
+
|
|
320
|
+
Compare different models of the same agent head-to-head using the `agent:model` syntax:
|
|
321
|
+
|
|
322
|
+
```bash
|
|
323
|
+
# Compare two Codex models on the same task
|
|
324
|
+
coderace run task.yaml --agent codex:gpt-5.4 --agent codex:gpt-5.3-codex
|
|
325
|
+
|
|
326
|
+
# Mix agents and models
|
|
327
|
+
coderace run task.yaml --agent codex:gpt-5.4 --agent claude:opus-4-6 --agent claude:sonnet-4-6
|
|
328
|
+
|
|
329
|
+
# Benchmark multiple model variants across built-in tasks
|
|
330
|
+
coderace benchmark --agents codex:gpt-5.4,codex:gpt-5.3-codex,claude:opus-4-6
|
|
331
|
+
|
|
332
|
+
# Race with model variants (parallel)
|
|
333
|
+
coderace race task.yaml
|
|
334
|
+
```
|
|
335
|
+
|
|
336
|
+
In task YAML files:
|
|
337
|
+
|
|
338
|
+
```yaml
|
|
339
|
+
agents:
|
|
340
|
+
- codex:gpt-5.4
|
|
341
|
+
- codex:gpt-5.3-codex
|
|
342
|
+
- claude:opus-4-6
|
|
343
|
+
- claude:sonnet-4-6
|
|
344
|
+
```
|
|
345
|
+
|
|
346
|
+
**How it works:**
|
|
347
|
+
- `agent:model` splits on the first colon: `codex:gpt-5.4` → agent `codex`, model `gpt-5.4`
|
|
348
|
+
- The model is passed via `--model <name>` to the underlying CLI
|
|
349
|
+
- Results display as `codex (gpt-5.4)` vs `codex (gpt-5.3-codex)` for easy comparison
|
|
350
|
+
- ELO ratings, leaderboard, and dashboard track each model variant separately
|
|
351
|
+
- The same agent can appear multiple times with different models in one run
|
|
352
|
+
|
|
313
353
|
## Leaderboard & History
|
|
314
354
|
|
|
315
355
|
Every `coderace run` automatically saves results to a local SQLite database (`~/.coderace/results.db`). Two new commands aggregate this data.
|
|
@@ -441,6 +481,37 @@ coderace run task.yaml --parallel
|
|
|
441
481
|
|
|
442
482
|
Sequential mode (default) runs agents one at a time on the same repo.
|
|
443
483
|
|
|
484
|
+
## Race Mode
|
|
485
|
+
|
|
486
|
+
Use `coderace race` for first-to-pass execution. Unlike `coderace run --parallel`, race mode stops as soon as one agent passes the win condition:
|
|
487
|
+
|
|
488
|
+
- If verification is configured, winner = first agent that passes verification.
|
|
489
|
+
- If verification is not configured, winner = first agent that exits cleanly.
|
|
490
|
+
- Remaining agents are stopped after a short graceful shutdown window.
|
|
491
|
+
|
|
492
|
+
```bash
|
|
493
|
+
coderace race task.yaml --agent claude --agent codex
|
|
494
|
+
```
|
|
495
|
+
|
|
496
|
+
Example terminal output:
|
|
497
|
+
|
|
498
|
+
```text
|
|
499
|
+
🏁 coderace race - fix-auth-bug
|
|
500
|
+
Running 3 agents in parallel...
|
|
501
|
+
|
|
502
|
+
Agent Status Time
|
|
503
|
+
claude 🔨 coding... 0:00:23
|
|
504
|
+
codex 🧪 testing... 0:00:31
|
|
505
|
+
aider 🛑 stopped 0:00:18
|
|
506
|
+
|
|
507
|
+
🏆 Winner: codex - completed in 1:23 (first to pass verification)
|
|
508
|
+
Runner-up: claude - finished 0:12 later
|
|
509
|
+
```
|
|
510
|
+
|
|
511
|
+
When to use each mode:
|
|
512
|
+
- Use `coderace race` when you want the fastest successful patch and can stop early.
|
|
513
|
+
- Use `coderace run --parallel` when you want full scoring across all agents before deciding.
|
|
514
|
+
|
|
444
515
|
## Why coderace?
|
|
445
516
|
|
|
446
517
|
**Blog posts compare models. coderace compares agents on your work.**
|
|
@@ -576,9 +647,15 @@ coderace benchmark --agents claude --difficulty easy,medium
|
|
|
576
647
|
# Dry-run: see what would run without executing
|
|
577
648
|
coderace benchmark --agents claude,codex --dry-run
|
|
578
649
|
|
|
650
|
+
# Statistical mode: run repeated trials per pair
|
|
651
|
+
coderace benchmark --agents claude,codex --tasks fibonacci,json-parser --trials 5
|
|
652
|
+
|
|
579
653
|
# Save report to file
|
|
580
654
|
coderace benchmark --agents claude,codex --output report.md
|
|
581
655
|
coderace benchmark --agents claude,codex --output report.html
|
|
656
|
+
|
|
657
|
+
# Export standardized JSON (shareable benchmark artifact)
|
|
658
|
+
coderace benchmark --agents claude,codex --trials 5 --export benchmark.json
|
|
582
659
|
```
|
|
583
660
|
|
|
584
661
|
### Example Terminal Output
|
|
@@ -622,7 +699,171 @@ coderace benchmark show bench-20260227-143022
|
|
|
622
699
|
| `--difficulty` | Filter by difficulty: `easy`, `medium`, `hard` | all |
|
|
623
700
|
| `--timeout` | Per-task timeout in seconds | `300` |
|
|
624
701
|
| `--parallel N` | Run N agents in parallel | `1` (sequential) |
|
|
702
|
+
| `--trials N` | Repeat each `(task, agent)` pair N times | `1` |
|
|
625
703
|
| `--dry-run` | List combinations without running | `false` |
|
|
626
704
|
| `--format` | Output format: `terminal`, `markdown`, `html` | `terminal` |
|
|
627
705
|
| `--output` | Save report to file | — |
|
|
706
|
+
| `--export` | Write standardized benchmark JSON file | — |
|
|
628
707
|
| `--no-save` | Skip saving results to the store | `false` |
|
|
708
|
+
|
|
709
|
+
### Statistical Reports (`--trials > 1`)
|
|
710
|
+
|
|
711
|
+
When `--trials` is greater than 1, benchmark reports switch to statistical mode:
|
|
712
|
+
|
|
713
|
+
- Task cells show `mean score +/- stddev` (plus mean wall time)
|
|
714
|
+
- Report includes `CI (95%)`, `Consistency`, and `Reliability` columns
|
|
715
|
+
- Summary includes per-agent mean score, confidence interval, win rate, and reliability
|
|
716
|
+
- ELO ratings are rendered at the bottom of terminal/markdown/html reports
|
|
717
|
+
|
|
718
|
+
### ELO Ratings
|
|
719
|
+
|
|
720
|
+
Every benchmark run updates persistent ELO ratings across all benchmark history.
|
|
721
|
+
|
|
722
|
+
```bash
|
|
723
|
+
# Show ratings
|
|
724
|
+
coderace ratings
|
|
725
|
+
|
|
726
|
+
# JSON output
|
|
727
|
+
coderace ratings --json
|
|
728
|
+
|
|
729
|
+
# Reset all ratings to 1500
|
|
730
|
+
coderace ratings --reset
|
|
731
|
+
```
|
|
732
|
+
|
|
733
|
+
ELO rules:
|
|
734
|
+
- Initial rating: `1500`
|
|
735
|
+
- K-factor: `32`
|
|
736
|
+
- Each task is treated as a round-robin set of pairwise matches
|
|
737
|
+
- Winner per pair is based on higher mean trial score (draw when within 1 point)
|
|
738
|
+
|
|
739
|
+
### Export Format (`--export`)
|
|
740
|
+
|
|
741
|
+
`coderace benchmark --export benchmark.json` writes a standardized JSON artifact:
|
|
742
|
+
|
|
743
|
+
```json
|
|
744
|
+
{
|
|
745
|
+
"coderace_version": "1.0.0",
|
|
746
|
+
"benchmark_id": "bench-20260228-133000",
|
|
747
|
+
"timestamp": "2026-02-28T13:30:00Z",
|
|
748
|
+
"system": { "os": "...", "python": "...", "cpu": "..." },
|
|
749
|
+
"config": { "trials": 5, "timeout": 300, "tasks": ["..."], "agents": ["..."] },
|
|
750
|
+
"results": [
|
|
751
|
+
{
|
|
752
|
+
"task": "fibonacci",
|
|
753
|
+
"agent": "claude",
|
|
754
|
+
"trials": 5,
|
|
755
|
+
"mean_score": 87.5,
|
|
756
|
+
"stddev_score": 3.2,
|
|
757
|
+
"ci_95": [83.1, 91.9],
|
|
758
|
+
"mean_time": 45.2,
|
|
759
|
+
"mean_cost": 0.03,
|
|
760
|
+
"pass_rate": 1.0,
|
|
761
|
+
"consistency_score": 0.96,
|
|
762
|
+
"per_trial": []
|
|
763
|
+
}
|
|
764
|
+
],
|
|
765
|
+
"elo_ratings": { "claude": 1523, "codex": 1488 },
|
|
766
|
+
"summary": {}
|
|
767
|
+
}
|
|
768
|
+
```
|
|
769
|
+
|
|
770
|
+
## Context Evaluation
|
|
771
|
+
|
|
772
|
+
The `coderace context-eval` command measures whether a context file (CLAUDE.md, AGENTS.md, .cursorrules, etc.) actually improves agent performance. It runs A/B trials — baseline (no context file) vs treatment (with context file) — and produces statistical comparisons.
|
|
773
|
+
|
|
774
|
+
```bash
|
|
775
|
+
# Evaluate whether CLAUDE.md improves claude's performance on a task
|
|
776
|
+
coderace context-eval --context-file CLAUDE.md --task fix-auth-bug.yaml --agents claude --trials 5
|
|
777
|
+
|
|
778
|
+
# Evaluate across all built-in benchmark tasks
|
|
779
|
+
coderace context-eval --context-file CLAUDE.md --benchmark --agents claude,codex
|
|
780
|
+
|
|
781
|
+
# Save results as JSON
|
|
782
|
+
coderace context-eval --context-file CLAUDE.md --task task.yaml --agents claude --output results.json
|
|
783
|
+
|
|
784
|
+
# Use a custom task directory
|
|
785
|
+
coderace context-eval --context-file CLAUDE.md --benchmark --task-dir ./my-tasks --agents claude
|
|
786
|
+
```
|
|
787
|
+
|
|
788
|
+
### How It Works
|
|
789
|
+
|
|
790
|
+
For each agent × task combination:
|
|
791
|
+
1. Run N trials **without** the context file (baseline condition)
|
|
792
|
+
2. Run N trials **with** the context file placed in the task directory (treatment condition)
|
|
793
|
+
3. Compare pass rates, mean scores, and compute statistical significance
|
|
794
|
+
|
|
795
|
+
### Output
|
|
796
|
+
|
|
797
|
+
The terminal report shows:
|
|
798
|
+
- **Per-agent summary**: baseline vs treatment pass rates and scores, delta with 95% CI, Cohen's d effect size
|
|
799
|
+
- **Per-task breakdown**: which tasks improved, which degraded
|
|
800
|
+
- **Verdict**: whether the context file significantly improved performance
|
|
801
|
+
|
|
802
|
+
```
|
|
803
|
+
┌────────┬───────────────────┬────────────────────┬────────────────┬─────────────────┬────────┬──────────────────┬─────────────┐
|
|
804
|
+
│ Agent │ Baseline Pass Rate│ Treatment Pass Rate │ Baseline Score │ Treatment Score │ Delta │ CI (95%) │ Effect Size │
|
|
805
|
+
├────────┼───────────────────┼────────────────────┼────────────────┼─────────────────┼────────┼──────────────────┼─────────────┤
|
|
806
|
+
│ claude │ 67% │ 100% │ 55.0 │ 81.0 │ +26.0 │ [10.5, 41.5] │ 2.10 │
|
|
807
|
+
│ codex │ 33% │ 67% │ 45.0 │ 70.0 │ +25.0 │ [8.0, 42.0] │ 1.80 │
|
|
808
|
+
└────────┴───────────────────┴────────────────────┴────────────────┴─────────────────┴────────┴──────────────────┴─────────────┘
|
|
809
|
+
|
|
810
|
+
Context file improved performance by +25.5 points (CI: [12.0, 39.0])
|
|
811
|
+
```
|
|
812
|
+
|
|
813
|
+
### Context-Eval CLI Flags
|
|
814
|
+
|
|
815
|
+
| Flag | Description | Default |
|
|
816
|
+
|------|-------------|---------|
|
|
817
|
+
| `--context-file` | Path to the context file to evaluate (required) | — |
|
|
818
|
+
| `--task` | Path to a single task YAML | — |
|
|
819
|
+
| `--benchmark` | Run against built-in benchmark tasks | `false` |
|
|
820
|
+
| `--agents` | Comma-separated agent names (required) | — |
|
|
821
|
+
| `--trials` | Trials per condition (min: 2) | `3` |
|
|
822
|
+
| `--output` | Save JSON results to file | — |
|
|
823
|
+
| `--task-dir` | Custom task directory for benchmark mode | — |
|
|
824
|
+
|
|
825
|
+
### Dashboard Integration
|
|
826
|
+
|
|
827
|
+
Include context-eval results in the HTML dashboard:
|
|
828
|
+
|
|
829
|
+
```bash
|
|
830
|
+
# Run context-eval and save JSON
|
|
831
|
+
coderace context-eval --context-file CLAUDE.md --task task.yaml --agents claude --output eval.json
|
|
832
|
+
|
|
833
|
+
# Generate dashboard with A/B comparison section
|
|
834
|
+
coderace dashboard --context-eval eval.json
|
|
835
|
+
```
|
|
836
|
+
|
|
837
|
+
## Measuring Context Engineering Impact
|
|
838
|
+
|
|
839
|
+
Context engineering — crafting CLAUDE.md, AGENTS.md, .cursorrules, and similar files — is becoming a core developer skill. But until now, there was no way to empirically measure whether your context files actually help.
|
|
840
|
+
|
|
841
|
+
**The problem:** You write a CLAUDE.md with coding conventions, architectural guidelines, and project-specific instructions. But does it actually make agents produce better code? Or is it cargo-cult configuration?
|
|
842
|
+
|
|
843
|
+
**The solution:** `coderace context-eval` gives you data:
|
|
844
|
+
|
|
845
|
+
1. **Write your context file** (e.g., CLAUDE.md with project conventions)
|
|
846
|
+
2. **Run A/B evaluation** against real coding tasks
|
|
847
|
+
3. **Get statistical evidence** of improvement (or lack thereof)
|
|
848
|
+
|
|
849
|
+
```bash
|
|
850
|
+
# Iterate on your context file with data
|
|
851
|
+
coderace context-eval --context-file CLAUDE.md --benchmark --agents claude --trials 5
|
|
852
|
+
|
|
853
|
+
# Compare different context files
|
|
854
|
+
coderace context-eval --context-file v1-claude.md --task task.yaml --agents claude --output v1.json
|
|
855
|
+
coderace context-eval --context-file v2-claude.md --task task.yaml --agents claude --output v2.json
|
|
856
|
+
```
|
|
857
|
+
|
|
858
|
+
**Interpreting results:**
|
|
859
|
+
- **Effect size > 0.8**: Large improvement — your context file is helping significantly
|
|
860
|
+
- **Effect size 0.2–0.8**: Moderate improvement — some benefit, room to iterate
|
|
861
|
+
- **Effect size < 0.2**: Negligible — your context file isn't making a measurable difference
|
|
862
|
+
- **CI crosses zero**: Not statistically significant — need more trials or a better context file
|
|
863
|
+
|
|
864
|
+
## See Also
|
|
865
|
+
|
|
866
|
+
- **[agentmd](https://github.com/mikiships/agentmd)** — Generate and score context files (CLAUDE.md, AGENTS.md, .cursorrules) for AI coding agents. Pair with coderace: generate context with agentmd, measure agent performance with coderace, iterate with data instead of vibes.
|
|
867
|
+
- **[agentlint](https://github.com/mikiships/agentlint)** — Lint AI agent git diffs for risky patterns (scope drift, secret leaks, test regression). Static analysis, no LLM required.
|
|
868
|
+
|
|
869
|
+
Measure (coderace) → Optimize (agentmd) → Guard (agentlint).
|