coderace 0.9.0__tar.gz → 1.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (137) hide show
  1. {coderace-0.9.0 → coderace-1.3.0}/CHANGELOG.md +46 -0
  2. coderace-1.3.0/DONE.txt +70 -0
  3. {coderace-0.9.0 → coderace-1.3.0}/PKG-INFO +242 -1
  4. {coderace-0.9.0 → coderace-1.3.0}/README.md +241 -0
  5. coderace-1.3.0/all-day-build-contract-context-eval.md +120 -0
  6. coderace-1.3.0/all-day-build-contract-model-selection.md +121 -0
  7. coderace-1.3.0/all-day-build-contract-race-mode.md +183 -0
  8. coderace-1.3.0/all-day-build-contract-v1.0-statistical.md +184 -0
  9. {coderace-0.9.0 → coderace-1.3.0}/coderace/__init__.py +1 -1
  10. coderace-1.3.0/coderace/adapters/__init__.py +77 -0
  11. {coderace-0.9.0 → coderace-1.3.0}/coderace/adapters/aider.py +11 -4
  12. {coderace-0.9.0 → coderace-1.3.0}/coderace/adapters/base.py +13 -3
  13. {coderace-0.9.0 → coderace-1.3.0}/coderace/adapters/claude.py +12 -6
  14. {coderace-0.9.0 → coderace-1.3.0}/coderace/adapters/codex.py +12 -5
  15. {coderace-0.9.0 → coderace-1.3.0}/coderace/adapters/gemini.py +12 -8
  16. {coderace-0.9.0 → coderace-1.3.0}/coderace/adapters/opencode.py +12 -8
  17. {coderace-0.9.0 → coderace-1.3.0}/coderace/benchmark.py +174 -81
  18. {coderace-0.9.0 → coderace-1.3.0}/coderace/benchmark_report.py +372 -0
  19. {coderace-0.9.0 → coderace-1.3.0}/coderace/cli.py +264 -28
  20. {coderace-0.9.0 → coderace-1.3.0}/coderace/commands/benchmark.py +124 -10
  21. coderace-1.3.0/coderace/commands/context_eval.py +143 -0
  22. coderace-1.3.0/coderace/commands/race.py +626 -0
  23. coderace-1.3.0/coderace/context_eval.py +282 -0
  24. coderace-1.3.0/coderace/context_eval_report.py +275 -0
  25. {coderace-0.9.0 → coderace-1.3.0}/coderace/dashboard.py +97 -1
  26. coderace-1.3.0/coderace/elo.py +94 -0
  27. coderace-1.3.0/coderace/export.py +115 -0
  28. coderace-1.3.0/coderace/statistics.py +206 -0
  29. {coderace-0.9.0 → coderace-1.3.0}/coderace/store.py +56 -3
  30. {coderace-0.9.0 → coderace-1.3.0}/coderace/types.py +3 -1
  31. coderace-1.3.0/examples/context-eval-demo.sh +65 -0
  32. coderace-1.3.0/examples/model-selection.yaml +30 -0
  33. coderace-1.3.0/progress-log.md +70 -0
  34. {coderace-0.9.0 → coderace-1.3.0}/pyproject.toml +1 -1
  35. coderace-1.3.0/tests/test_benchmark_trials.py +205 -0
  36. coderace-1.3.0/tests/test_benchmark_v1_integration.py +164 -0
  37. coderace-1.3.0/tests/test_context_eval.py +532 -0
  38. coderace-1.3.0/tests/test_context_eval_dashboard.py +198 -0
  39. coderace-1.3.0/tests/test_elo.py +259 -0
  40. {coderace-0.9.0 → coderace-1.3.0}/tests/test_examples.py +3 -2
  41. coderace-1.3.0/tests/test_export.py +173 -0
  42. coderace-1.3.0/tests/test_model_selection_d1_d2.py +200 -0
  43. coderace-1.3.0/tests/test_model_selection_d3.py +164 -0
  44. coderace-1.3.0/tests/test_model_selection_d4.py +186 -0
  45. coderace-1.3.0/tests/test_race.py +883 -0
  46. coderace-1.3.0/tests/test_statistics.py +227 -0
  47. coderace-0.9.0/coderace/adapters/__init__.py +0 -26
  48. coderace-0.9.0/progress-log.md +0 -444
  49. {coderace-0.9.0 → coderace-1.3.0}/.github/workflows/publish.yml +0 -0
  50. {coderace-0.9.0 → coderace-1.3.0}/.gitignore +0 -0
  51. {coderace-0.9.0 → coderace-1.3.0}/LICENSE +0 -0
  52. {coderace-0.9.0 → coderace-1.3.0}/action.yml +0 -0
  53. {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-benchmark.md +0 -0
  54. {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-builtin-tasks.md +0 -0
  55. {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-ci-integration.md +0 -0
  56. {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-cost-tracking.md +0 -0
  57. {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-dashboard.md +0 -0
  58. {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-leaderboard.md +0 -0
  59. {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-v0.2.md +0 -0
  60. {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-v090-tasks.md +0 -0
  61. {coderace-0.9.0 → coderace-1.3.0}/all-day-build-contract-verification-tests.md +0 -0
  62. {coderace-0.9.0 → coderace-1.3.0}/benchmark-results/fibonacci-2026-02-27.md +0 -0
  63. {coderace-0.9.0 → coderace-1.3.0}/benchmark-results/fibonacci-v2-2026-02-27.md +0 -0
  64. {coderace-0.9.0 → coderace-1.3.0}/benchmark-results/hard-tasks-2026-02-27.md +0 -0
  65. {coderace-0.9.0 → coderace-1.3.0}/benchmark-results/multi-task-2026-02-27.md +0 -0
  66. {coderace-0.9.0 → coderace-1.3.0}/coderace/benchmark_stats.py +0 -0
  67. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/__init__.py +0 -0
  68. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/binary-search-tree.yaml +0 -0
  69. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/cli-args-parser.yaml +0 -0
  70. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/csv-analyzer.yaml +0 -0
  71. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/data-pipeline.yaml +0 -0
  72. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/diff-algorithm.yaml +0 -0
  73. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/expression-evaluator.yaml +0 -0
  74. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/fibonacci.yaml +0 -0
  75. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/file-watcher.yaml +0 -0
  76. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/http-server.yaml +0 -0
  77. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/json-parser.yaml +0 -0
  78. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/lru-cache.yaml +0 -0
  79. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/markdown-to-html.yaml +0 -0
  80. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/regex-engine.yaml +0 -0
  81. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/state-machine.yaml +0 -0
  82. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/task-scheduler.yaml +0 -0
  83. {coderace-0.9.0 → coderace-1.3.0}/coderace/builtins/tasks/url-router.yaml +0 -0
  84. {coderace-0.9.0 → coderace-1.3.0}/coderace/commands/__init__.py +0 -0
  85. {coderace-0.9.0 → coderace-1.3.0}/coderace/commands/dashboard.py +0 -0
  86. {coderace-0.9.0 → coderace-1.3.0}/coderace/commands/diff.py +0 -0
  87. {coderace-0.9.0 → coderace-1.3.0}/coderace/commands/history.py +0 -0
  88. {coderace-0.9.0 → coderace-1.3.0}/coderace/commands/leaderboard.py +0 -0
  89. {coderace-0.9.0 → coderace-1.3.0}/coderace/commands/results.py +0 -0
  90. {coderace-0.9.0 → coderace-1.3.0}/coderace/commands/tasks.py +0 -0
  91. {coderace-0.9.0 → coderace-1.3.0}/coderace/cost.py +0 -0
  92. {coderace-0.9.0 → coderace-1.3.0}/coderace/git_ops.py +0 -0
  93. {coderace-0.9.0 → coderace-1.3.0}/coderace/html_report.py +0 -0
  94. {coderace-0.9.0 → coderace-1.3.0}/coderace/publish.py +0 -0
  95. {coderace-0.9.0 → coderace-1.3.0}/coderace/reporter.py +0 -0
  96. {coderace-0.9.0 → coderace-1.3.0}/coderace/scorer.py +0 -0
  97. {coderace-0.9.0 → coderace-1.3.0}/coderace/stats.py +0 -0
  98. {coderace-0.9.0 → coderace-1.3.0}/coderace/task.py +0 -0
  99. {coderace-0.9.0 → coderace-1.3.0}/demo-race.yaml +0 -0
  100. {coderace-0.9.0 → coderace-1.3.0}/examples/add-type-hints.yaml +0 -0
  101. {coderace-0.9.0 → coderace-1.3.0}/examples/ci-race-on-pr.yml +0 -0
  102. {coderace-0.9.0 → coderace-1.3.0}/examples/example-task.yaml +0 -0
  103. {coderace-0.9.0 → coderace-1.3.0}/examples/fix-edge-case.yaml +0 -0
  104. {coderace-0.9.0 → coderace-1.3.0}/examples/write-tests.yaml +0 -0
  105. {coderace-0.9.0 → coderace-1.3.0}/scripts/ci-run.sh +0 -0
  106. {coderace-0.9.0 → coderace-1.3.0}/scripts/format-comment.py +0 -0
  107. {coderace-0.9.0 → coderace-1.3.0}/tasks/markdown-table.yaml +0 -0
  108. {coderace-0.9.0 → coderace-1.3.0}/tasks/parse-duration.yaml +0 -0
  109. {coderace-0.9.0 → coderace-1.3.0}/tests/__init__.py +0 -0
  110. {coderace-0.9.0 → coderace-1.3.0}/tests/conftest.py +0 -0
  111. {coderace-0.9.0 → coderace-1.3.0}/tests/test_adapters.py +0 -0
  112. {coderace-0.9.0 → coderace-1.3.0}/tests/test_benchmark.py +0 -0
  113. {coderace-0.9.0 → coderace-1.3.0}/tests/test_builtins.py +0 -0
  114. {coderace-0.9.0 → coderace-1.3.0}/tests/test_cli.py +0 -0
  115. {coderace-0.9.0 → coderace-1.3.0}/tests/test_cli_store_integration.py +0 -0
  116. {coderace-0.9.0 → coderace-1.3.0}/tests/test_cost.py +0 -0
  117. {coderace-0.9.0 → coderace-1.3.0}/tests/test_cost_config.py +0 -0
  118. {coderace-0.9.0 → coderace-1.3.0}/tests/test_cost_integration.py +0 -0
  119. {coderace-0.9.0 → coderace-1.3.0}/tests/test_dashboard.py +0 -0
  120. {coderace-0.9.0 → coderace-1.3.0}/tests/test_dashboard_cli.py +0 -0
  121. {coderace-0.9.0 → coderace-1.3.0}/tests/test_diff.py +0 -0
  122. {coderace-0.9.0 → coderace-1.3.0}/tests/test_format_comment.py +0 -0
  123. {coderace-0.9.0 → coderace-1.3.0}/tests/test_full_workflow.py +0 -0
  124. {coderace-0.9.0 → coderace-1.3.0}/tests/test_git_ops.py +0 -0
  125. {coderace-0.9.0 → coderace-1.3.0}/tests/test_history.py +0 -0
  126. {coderace-0.9.0 → coderace-1.3.0}/tests/test_html_report.py +0 -0
  127. {coderace-0.9.0 → coderace-1.3.0}/tests/test_leaderboard.py +0 -0
  128. {coderace-0.9.0 → coderace-1.3.0}/tests/test_markdown_results.py +0 -0
  129. {coderace-0.9.0 → coderace-1.3.0}/tests/test_publish.py +0 -0
  130. {coderace-0.9.0 → coderace-1.3.0}/tests/test_reporter.py +0 -0
  131. {coderace-0.9.0 → coderace-1.3.0}/tests/test_scorer.py +0 -0
  132. {coderace-0.9.0 → coderace-1.3.0}/tests/test_stats.py +0 -0
  133. {coderace-0.9.0 → coderace-1.3.0}/tests/test_store.py +0 -0
  134. {coderace-0.9.0 → coderace-1.3.0}/tests/test_task.py +0 -0
  135. {coderace-0.9.0 → coderace-1.3.0}/tests/test_tasks_cli.py +0 -0
  136. {coderace-0.9.0 → coderace-1.3.0}/tests/test_verification_integration.py +0 -0
  137. {coderace-0.9.0 → coderace-1.3.0}/uv.lock +0 -0
@@ -1,5 +1,51 @@
1
1
  # Changelog
2
2
 
3
+ ## [1.3.0] - 2026-03-05
4
+
5
+ ### Added
6
+ - **Model selection**: Per-agent model override via `agent:model` syntax in `--agents` / `--agent` flags
7
+ - Example: `coderace run task.yaml --agent codex:gpt-5.4 --agent codex:gpt-5.3-codex`
8
+ - Example: `coderace benchmark --agents claude:opus-4-6,claude:sonnet-4-6`
9
+ - `BaseAdapter.__init__(model=None)`: all adapters accept optional model at construction
10
+ - `BaseAdapter.build_command(task, model=None)`: model parameter flows to CLI flag
11
+ - `parse_agent_spec()`, `make_display_name()`, `instantiate_adapter()` in `coderace.adapters`
12
+ - All adapters (codex, claude, aider, gemini, opencode) append `--model <name>` when specified
13
+ - Benchmark and race commands handle model-specific agents; display names flow to results, store, ELO, dashboard
14
+ - Task YAML: `agents` list accepts `agent:model` entries (e.g. `- codex:gpt-5.4`)
15
+
16
+ ### Changed
17
+ - `AgentResult.agent` is now the display name (`codex (gpt-5.4)`) when a model is specified
18
+ - ELO ratings, leaderboard, and dashboard automatically track model variants as separate entries
19
+ - Branch names sanitized to be git-compatible (colons replaced with dashes)
20
+
21
+ ## [1.2.0] - 2026-03-03
22
+
23
+ ### Added
24
+
25
+ - **`coderace race` command** - New first-to-pass race mode with early-stop semantics. Agents run in parallel worktrees and the race ends when the first winner is found.
26
+ - **Live race UI** - Rich `Live` panel with per-agent status and timers:
27
+ - `🔨 coding...`
28
+ - `🧪 testing...`
29
+ - `✅ WINNER!`
30
+ - `❌ failed`
31
+ - `⏰ timed out`
32
+ - `🛑 stopped`
33
+ - **Winner announcement and runner-up delta** - Prints race winner and optional runner-up timing delta after the live panel closes.
34
+ - **Race result persistence (JSON fallback)** - Saves race summaries to `.coderace/race-results.json` including `race_id`, winner metadata, participant statuses, exit codes, and wall times. Supports `--no-save`.
35
+ - **Race test suite** - Added 21 race-focused tests (`tests/test_race.py`) covering winner logic, cancellation, timeout/no-winner paths, verification modes, live updates, serialization, and Ctrl+C cleanup.
36
+
37
+ ## [1.0.0] - 2026-02-28
38
+
39
+ ### Added
40
+
41
+ - **Benchmark trials mode** — `coderace benchmark --trials N` now runs each `(task, agent)` pair repeatedly and stores each trial with `trial_number` in SQLite.
42
+ - **Statistical benchmarking module** — New `coderace/statistics.py` computes per-pair and per-agent aggregates: mean/stddev, 95% confidence intervals, pass rate, consistency, win rate, cost efficiency, and reliability.
43
+ - **Persistent ELO ratings** — New `coderace/elo.py` plus `elo_ratings` store table. Ratings update automatically after each benchmark using pairwise task outcomes and persist across runs.
44
+ - **`coderace ratings` command** — View persistent ELO rankings, output as JSON (`--json`), and reset all ratings (`--reset`).
45
+ - **Standardized benchmark export** — `coderace benchmark --export <path>` writes shareable JSON with run metadata, system info, per-trial details, aggregate stats, and current ELO ratings.
46
+ - **Enhanced benchmark report rendering** — Multi-trial reports now show statistical columns (`mean +/- stddev`, CI, consistency, reliability) and include ELO ratings in terminal/markdown/html output.
47
+ - **Integration and edge-case coverage for v1.0 flow** — Added tests for full `--trials 3` benchmark + export + ELO pipeline and edge cases (single trial/agent/task and always-failing agent).
48
+
3
49
  ## [0.7.0] - 2026-02-26
4
50
 
5
51
  ### Added
@@ -0,0 +1,70 @@
1
+ Context Eval Build Contract: COMPLETE
2
+ ======================================
3
+
4
+ Date: 2026-03-02
5
+ All deliverables (D1-D4) implemented and validated.
6
+
7
+ ## What Was Built
8
+
9
+ ### D1: context-eval CLI Command (core + CLI)
10
+ - `coderace/context_eval.py`: Core A/B evaluation engine
11
+ - ContextEvalResult and TrialResult data classes
12
+ - Context file backup/restore/placement/removal for baseline vs treatment
13
+ - KNOWN_CONTEXT_FILES list (CLAUDE.md, AGENTS.md, .cursorrules, etc.)
14
+ - run_context_eval() orchestrator running N trials per condition
15
+ - `coderace/commands/context_eval.py`: CLI subcommand with:
16
+ --context-file PATH, --task PATH, --benchmark, --agents, --trials N,
17
+ --output PATH, --task-dir PATH
18
+ - Full input validation (missing files, invalid agents, trials < 2, etc.)
19
+
20
+ ### D2: Statistical Comparison Report
21
+ - `coderace/context_eval_report.py`: Statistical analysis and rendering
22
+ - Delta with 95% CI using Welch's t-test
23
+ - Cohen's d effect size
24
+ - Per-agent summary: baseline vs treatment pass rates and scores
25
+ - Per-task breakdown: which tasks improved, which degraded
26
+ - Summary verdict: "improved", "degraded", or "no significant improvement"
27
+ - Rich terminal table output
28
+ - JSON output format
29
+
30
+ ### D3: Dashboard Integration
31
+ - Extended `coderace/dashboard.py` with context-eval A/B section:
32
+ - Bar chart: baseline vs treatment scores per agent
33
+ - Delta table with CI (95%) and effect size
34
+ - Verdict display
35
+ - CSS for A/B visualization (.ab-baseline, .ab-treatment, .positive, .negative)
36
+ - Added --context-eval PATH flag to `coderace dashboard` command
37
+
38
+ ### D4: Documentation + Examples
39
+ - README.md: Added "Context Evaluation" and "Measuring Context Engineering Impact" sections
40
+ with usage examples, output format, CLI flags table, and effect size interpretation guide
41
+ - examples/context-eval-demo.sh: Executable demo script
42
+ - Clear help text on `coderace context-eval --help` and `coderace --help`
43
+
44
+ ## Test Results
45
+
46
+ - 58 new tests added (41 for D1+D2, 17 for D3)
47
+ - All 505 tests pass (447 original + 58 new)
48
+ - No regressions in existing test suite
49
+
50
+ ## Commits
51
+
52
+ 1. feat(context-eval): add context-eval command with A/B statistical comparison (D1+D2)
53
+ 2. feat(context-eval): add dashboard A/B comparison section (D3)
54
+ 3. docs(context-eval): add README section, examples, and interpretation guide (D4)
55
+
56
+ ## Files Created/Modified
57
+
58
+ New files:
59
+ - coderace/context_eval.py
60
+ - coderace/commands/context_eval.py
61
+ - coderace/context_eval_report.py
62
+ - tests/test_context_eval.py
63
+ - tests/test_context_eval_dashboard.py
64
+ - examples/context-eval-demo.sh
65
+
66
+ Modified files:
67
+ - coderace/cli.py (registered context-eval subcommand + --context-eval dashboard flag)
68
+ - coderace/dashboard.py (added A/B comparison section)
69
+ - README.md (added context-eval documentation)
70
+ - progress-log.md (added D1-D4 progress entries)
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: coderace
3
- Version: 0.9.0
3
+ Version: 1.3.0
4
4
  Summary: Race coding agents against each other on real tasks
5
5
  Project-URL: Homepage, https://github.com/mikiships/coderace
6
6
  Project-URL: Repository, https://github.com/mikiships/coderace
@@ -30,6 +30,11 @@ Description-Content-Type: text/markdown
30
30
 
31
31
  # coderace
32
32
 
33
+ [![PyPI](https://img.shields.io/pypi/v/coderace)](https://pypi.org/project/coderace/)
34
+ [![Python](https://img.shields.io/badge/python-3.10%2B-blue)](#install)
35
+ [![Tests](https://img.shields.io/badge/tests-526%20passing-brightgreen)](#)
36
+ [![License](https://img.shields.io/badge/license-MIT-lightgrey)](#license)
37
+
33
38
  Stop reading blog comparisons. Race coding agents against each other on real tasks in *your* repo with *your* code.
34
39
 
35
40
  Every week there's a new "Claude Code vs Codex vs Cursor" post. They test on toy problems with cherry-picked examples. coderace gives you automated, reproducible, scored comparisons on the tasks you actually care about.
@@ -340,6 +345,41 @@ Keys can be agent names (`claude`, `codex`, `aider`, `gemini`, `opencode`) or mo
340
345
 
341
346
  Pricing is easy to update: the table lives in `coderace/cost.py` as a plain dict.
342
347
 
348
+ ## Model Selection
349
+
350
+ Compare different models of the same agent head-to-head using the `agent:model` syntax:
351
+
352
+ ```bash
353
+ # Compare two Codex models on the same task
354
+ coderace run task.yaml --agent codex:gpt-5.4 --agent codex:gpt-5.3-codex
355
+
356
+ # Mix agents and models
357
+ coderace run task.yaml --agent codex:gpt-5.4 --agent claude:opus-4-6 --agent claude:sonnet-4-6
358
+
359
+ # Benchmark multiple model variants across built-in tasks
360
+ coderace benchmark --agents codex:gpt-5.4,codex:gpt-5.3-codex,claude:opus-4-6
361
+
362
+ # Race with model variants (parallel)
363
+ coderace race task.yaml
364
+ ```
365
+
366
+ In task YAML files:
367
+
368
+ ```yaml
369
+ agents:
370
+ - codex:gpt-5.4
371
+ - codex:gpt-5.3-codex
372
+ - claude:opus-4-6
373
+ - claude:sonnet-4-6
374
+ ```
375
+
376
+ **How it works:**
377
+ - `agent:model` splits on the first colon: `codex:gpt-5.4` → agent `codex`, model `gpt-5.4`
378
+ - The model is passed via `--model <name>` to the underlying CLI
379
+ - Results display as `codex (gpt-5.4)` vs `codex (gpt-5.3-codex)` for easy comparison
380
+ - ELO ratings, leaderboard, and dashboard track each model variant separately
381
+ - The same agent can appear multiple times with different models in one run
382
+
343
383
  ## Leaderboard & History
344
384
 
345
385
  Every `coderace run` automatically saves results to a local SQLite database (`~/.coderace/results.db`). Two new commands aggregate this data.
@@ -471,6 +511,37 @@ coderace run task.yaml --parallel
471
511
 
472
512
  Sequential mode (default) runs agents one at a time on the same repo.
473
513
 
514
+ ## Race Mode
515
+
516
+ Use `coderace race` for first-to-pass execution. Unlike `coderace run --parallel`, race mode stops as soon as one agent passes the win condition:
517
+
518
+ - If verification is configured, winner = first agent that passes verification.
519
+ - If verification is not configured, winner = first agent that exits cleanly.
520
+ - Remaining agents are stopped after a short graceful shutdown window.
521
+
522
+ ```bash
523
+ coderace race task.yaml --agent claude --agent codex
524
+ ```
525
+
526
+ Example terminal output:
527
+
528
+ ```text
529
+ 🏁 coderace race - fix-auth-bug
530
+ Running 3 agents in parallel...
531
+
532
+ Agent Status Time
533
+ claude 🔨 coding... 0:00:23
534
+ codex 🧪 testing... 0:00:31
535
+ aider 🛑 stopped 0:00:18
536
+
537
+ 🏆 Winner: codex - completed in 1:23 (first to pass verification)
538
+ Runner-up: claude - finished 0:12 later
539
+ ```
540
+
541
+ When to use each mode:
542
+ - Use `coderace race` when you want the fastest successful patch and can stop early.
543
+ - Use `coderace run --parallel` when you want full scoring across all agents before deciding.
544
+
474
545
  ## Why coderace?
475
546
 
476
547
  **Blog posts compare models. coderace compares agents on your work.**
@@ -606,9 +677,15 @@ coderace benchmark --agents claude --difficulty easy,medium
606
677
  # Dry-run: see what would run without executing
607
678
  coderace benchmark --agents claude,codex --dry-run
608
679
 
680
+ # Statistical mode: run repeated trials per pair
681
+ coderace benchmark --agents claude,codex --tasks fibonacci,json-parser --trials 5
682
+
609
683
  # Save report to file
610
684
  coderace benchmark --agents claude,codex --output report.md
611
685
  coderace benchmark --agents claude,codex --output report.html
686
+
687
+ # Export standardized JSON (shareable benchmark artifact)
688
+ coderace benchmark --agents claude,codex --trials 5 --export benchmark.json
612
689
  ```
613
690
 
614
691
  ### Example Terminal Output
@@ -652,7 +729,171 @@ coderace benchmark show bench-20260227-143022
652
729
  | `--difficulty` | Filter by difficulty: `easy`, `medium`, `hard` | all |
653
730
  | `--timeout` | Per-task timeout in seconds | `300` |
654
731
  | `--parallel N` | Run N agents in parallel | `1` (sequential) |
732
+ | `--trials N` | Repeat each `(task, agent)` pair N times | `1` |
655
733
  | `--dry-run` | List combinations without running | `false` |
656
734
  | `--format` | Output format: `terminal`, `markdown`, `html` | `terminal` |
657
735
  | `--output` | Save report to file | — |
736
+ | `--export` | Write standardized benchmark JSON file | — |
658
737
  | `--no-save` | Skip saving results to the store | `false` |
738
+
739
+ ### Statistical Reports (`--trials > 1`)
740
+
741
+ When `--trials` is greater than 1, benchmark reports switch to statistical mode:
742
+
743
+ - Task cells show `mean score +/- stddev` (plus mean wall time)
744
+ - Report includes `CI (95%)`, `Consistency`, and `Reliability` columns
745
+ - Summary includes per-agent mean score, confidence interval, win rate, and reliability
746
+ - ELO ratings are rendered at the bottom of terminal/markdown/html reports
747
+
748
+ ### ELO Ratings
749
+
750
+ Every benchmark run updates persistent ELO ratings across all benchmark history.
751
+
752
+ ```bash
753
+ # Show ratings
754
+ coderace ratings
755
+
756
+ # JSON output
757
+ coderace ratings --json
758
+
759
+ # Reset all ratings to 1500
760
+ coderace ratings --reset
761
+ ```
762
+
763
+ ELO rules:
764
+ - Initial rating: `1500`
765
+ - K-factor: `32`
766
+ - Each task is treated as a round-robin set of pairwise matches
767
+ - Winner per pair is based on higher mean trial score (draw when within 1 point)
768
+
769
+ ### Export Format (`--export`)
770
+
771
+ `coderace benchmark --export benchmark.json` writes a standardized JSON artifact:
772
+
773
+ ```json
774
+ {
775
+ "coderace_version": "1.0.0",
776
+ "benchmark_id": "bench-20260228-133000",
777
+ "timestamp": "2026-02-28T13:30:00Z",
778
+ "system": { "os": "...", "python": "...", "cpu": "..." },
779
+ "config": { "trials": 5, "timeout": 300, "tasks": ["..."], "agents": ["..."] },
780
+ "results": [
781
+ {
782
+ "task": "fibonacci",
783
+ "agent": "claude",
784
+ "trials": 5,
785
+ "mean_score": 87.5,
786
+ "stddev_score": 3.2,
787
+ "ci_95": [83.1, 91.9],
788
+ "mean_time": 45.2,
789
+ "mean_cost": 0.03,
790
+ "pass_rate": 1.0,
791
+ "consistency_score": 0.96,
792
+ "per_trial": []
793
+ }
794
+ ],
795
+ "elo_ratings": { "claude": 1523, "codex": 1488 },
796
+ "summary": {}
797
+ }
798
+ ```
799
+
800
+ ## Context Evaluation
801
+
802
+ The `coderace context-eval` command measures whether a context file (CLAUDE.md, AGENTS.md, .cursorrules, etc.) actually improves agent performance. It runs A/B trials — baseline (no context file) vs treatment (with context file) — and produces statistical comparisons.
803
+
804
+ ```bash
805
+ # Evaluate whether CLAUDE.md improves claude's performance on a task
806
+ coderace context-eval --context-file CLAUDE.md --task fix-auth-bug.yaml --agents claude --trials 5
807
+
808
+ # Evaluate across all built-in benchmark tasks
809
+ coderace context-eval --context-file CLAUDE.md --benchmark --agents claude,codex
810
+
811
+ # Save results as JSON
812
+ coderace context-eval --context-file CLAUDE.md --task task.yaml --agents claude --output results.json
813
+
814
+ # Use a custom task directory
815
+ coderace context-eval --context-file CLAUDE.md --benchmark --task-dir ./my-tasks --agents claude
816
+ ```
817
+
818
+ ### How It Works
819
+
820
+ For each agent × task combination:
821
+ 1. Run N trials **without** the context file (baseline condition)
822
+ 2. Run N trials **with** the context file placed in the task directory (treatment condition)
823
+ 3. Compare pass rates, mean scores, and compute statistical significance
824
+
825
+ ### Output
826
+
827
+ The terminal report shows:
828
+ - **Per-agent summary**: baseline vs treatment pass rates and scores, delta with 95% CI, Cohen's d effect size
829
+ - **Per-task breakdown**: which tasks improved, which degraded
830
+ - **Verdict**: whether the context file significantly improved performance
831
+
832
+ ```
833
+ ┌────────┬───────────────────┬────────────────────┬────────────────┬─────────────────┬────────┬──────────────────┬─────────────┐
834
+ │ Agent │ Baseline Pass Rate│ Treatment Pass Rate │ Baseline Score │ Treatment Score │ Delta │ CI (95%) │ Effect Size │
835
+ ├────────┼───────────────────┼────────────────────┼────────────────┼─────────────────┼────────┼──────────────────┼─────────────┤
836
+ │ claude │ 67% │ 100% │ 55.0 │ 81.0 │ +26.0 │ [10.5, 41.5] │ 2.10 │
837
+ │ codex │ 33% │ 67% │ 45.0 │ 70.0 │ +25.0 │ [8.0, 42.0] │ 1.80 │
838
+ └────────┴───────────────────┴────────────────────┴────────────────┴─────────────────┴────────┴──────────────────┴─────────────┘
839
+
840
+ Context file improved performance by +25.5 points (CI: [12.0, 39.0])
841
+ ```
842
+
843
+ ### Context-Eval CLI Flags
844
+
845
+ | Flag | Description | Default |
846
+ |------|-------------|---------|
847
+ | `--context-file` | Path to the context file to evaluate (required) | — |
848
+ | `--task` | Path to a single task YAML | — |
849
+ | `--benchmark` | Run against built-in benchmark tasks | `false` |
850
+ | `--agents` | Comma-separated agent names (required) | — |
851
+ | `--trials` | Trials per condition (min: 2) | `3` |
852
+ | `--output` | Save JSON results to file | — |
853
+ | `--task-dir` | Custom task directory for benchmark mode | — |
854
+
855
+ ### Dashboard Integration
856
+
857
+ Include context-eval results in the HTML dashboard:
858
+
859
+ ```bash
860
+ # Run context-eval and save JSON
861
+ coderace context-eval --context-file CLAUDE.md --task task.yaml --agents claude --output eval.json
862
+
863
+ # Generate dashboard with A/B comparison section
864
+ coderace dashboard --context-eval eval.json
865
+ ```
866
+
867
+ ## Measuring Context Engineering Impact
868
+
869
+ Context engineering — crafting CLAUDE.md, AGENTS.md, .cursorrules, and similar files — is becoming a core developer skill. But until now, there was no way to empirically measure whether your context files actually help.
870
+
871
+ **The problem:** You write a CLAUDE.md with coding conventions, architectural guidelines, and project-specific instructions. But does it actually make agents produce better code? Or is it cargo-cult configuration?
872
+
873
+ **The solution:** `coderace context-eval` gives you data:
874
+
875
+ 1. **Write your context file** (e.g., CLAUDE.md with project conventions)
876
+ 2. **Run A/B evaluation** against real coding tasks
877
+ 3. **Get statistical evidence** of improvement (or lack thereof)
878
+
879
+ ```bash
880
+ # Iterate on your context file with data
881
+ coderace context-eval --context-file CLAUDE.md --benchmark --agents claude --trials 5
882
+
883
+ # Compare different context files
884
+ coderace context-eval --context-file v1-claude.md --task task.yaml --agents claude --output v1.json
885
+ coderace context-eval --context-file v2-claude.md --task task.yaml --agents claude --output v2.json
886
+ ```
887
+
888
+ **Interpreting results:**
889
+ - **Effect size > 0.8**: Large improvement — your context file is helping significantly
890
+ - **Effect size 0.2–0.8**: Moderate improvement — some benefit, room to iterate
891
+ - **Effect size < 0.2**: Negligible — your context file isn't making a measurable difference
892
+ - **CI crosses zero**: Not statistically significant — need more trials or a better context file
893
+
894
+ ## See Also
895
+
896
+ - **[agentmd](https://github.com/mikiships/agentmd)** — Generate and score context files (CLAUDE.md, AGENTS.md, .cursorrules) for AI coding agents. Pair with coderace: generate context with agentmd, measure agent performance with coderace, iterate with data instead of vibes.
897
+ - **[agentlint](https://github.com/mikiships/agentlint)** — Lint AI agent git diffs for risky patterns (scope drift, secret leaks, test regression). Static analysis, no LLM required.
898
+
899
+ Measure (coderace) → Optimize (agentmd) → Guard (agentlint).
@@ -1,5 +1,10 @@
1
1
  # coderace
2
2
 
3
+ [![PyPI](https://img.shields.io/pypi/v/coderace)](https://pypi.org/project/coderace/)
4
+ [![Python](https://img.shields.io/badge/python-3.10%2B-blue)](#install)
5
+ [![Tests](https://img.shields.io/badge/tests-526%20passing-brightgreen)](#)
6
+ [![License](https://img.shields.io/badge/license-MIT-lightgrey)](#license)
7
+
3
8
  Stop reading blog comparisons. Race coding agents against each other on real tasks in *your* repo with *your* code.
4
9
 
5
10
  Every week there's a new "Claude Code vs Codex vs Cursor" post. They test on toy problems with cherry-picked examples. coderace gives you automated, reproducible, scored comparisons on the tasks you actually care about.
@@ -310,6 +315,41 @@ Keys can be agent names (`claude`, `codex`, `aider`, `gemini`, `opencode`) or mo
310
315
 
311
316
  Pricing is easy to update: the table lives in `coderace/cost.py` as a plain dict.
312
317
 
318
+ ## Model Selection
319
+
320
+ Compare different models of the same agent head-to-head using the `agent:model` syntax:
321
+
322
+ ```bash
323
+ # Compare two Codex models on the same task
324
+ coderace run task.yaml --agent codex:gpt-5.4 --agent codex:gpt-5.3-codex
325
+
326
+ # Mix agents and models
327
+ coderace run task.yaml --agent codex:gpt-5.4 --agent claude:opus-4-6 --agent claude:sonnet-4-6
328
+
329
+ # Benchmark multiple model variants across built-in tasks
330
+ coderace benchmark --agents codex:gpt-5.4,codex:gpt-5.3-codex,claude:opus-4-6
331
+
332
+ # Race with model variants (parallel)
333
+ coderace race task.yaml
334
+ ```
335
+
336
+ In task YAML files:
337
+
338
+ ```yaml
339
+ agents:
340
+ - codex:gpt-5.4
341
+ - codex:gpt-5.3-codex
342
+ - claude:opus-4-6
343
+ - claude:sonnet-4-6
344
+ ```
345
+
346
+ **How it works:**
347
+ - `agent:model` splits on the first colon: `codex:gpt-5.4` → agent `codex`, model `gpt-5.4`
348
+ - The model is passed via `--model <name>` to the underlying CLI
349
+ - Results display as `codex (gpt-5.4)` vs `codex (gpt-5.3-codex)` for easy comparison
350
+ - ELO ratings, leaderboard, and dashboard track each model variant separately
351
+ - The same agent can appear multiple times with different models in one run
352
+
313
353
  ## Leaderboard & History
314
354
 
315
355
  Every `coderace run` automatically saves results to a local SQLite database (`~/.coderace/results.db`). Two new commands aggregate this data.
@@ -441,6 +481,37 @@ coderace run task.yaml --parallel
441
481
 
442
482
  Sequential mode (default) runs agents one at a time on the same repo.
443
483
 
484
+ ## Race Mode
485
+
486
+ Use `coderace race` for first-to-pass execution. Unlike `coderace run --parallel`, race mode stops as soon as one agent passes the win condition:
487
+
488
+ - If verification is configured, winner = first agent that passes verification.
489
+ - If verification is not configured, winner = first agent that exits cleanly.
490
+ - Remaining agents are stopped after a short graceful shutdown window.
491
+
492
+ ```bash
493
+ coderace race task.yaml --agent claude --agent codex
494
+ ```
495
+
496
+ Example terminal output:
497
+
498
+ ```text
499
+ 🏁 coderace race - fix-auth-bug
500
+ Running 3 agents in parallel...
501
+
502
+ Agent Status Time
503
+ claude 🔨 coding... 0:00:23
504
+ codex 🧪 testing... 0:00:31
505
+ aider 🛑 stopped 0:00:18
506
+
507
+ 🏆 Winner: codex - completed in 1:23 (first to pass verification)
508
+ Runner-up: claude - finished 0:12 later
509
+ ```
510
+
511
+ When to use each mode:
512
+ - Use `coderace race` when you want the fastest successful patch and can stop early.
513
+ - Use `coderace run --parallel` when you want full scoring across all agents before deciding.
514
+
444
515
  ## Why coderace?
445
516
 
446
517
  **Blog posts compare models. coderace compares agents on your work.**
@@ -576,9 +647,15 @@ coderace benchmark --agents claude --difficulty easy,medium
576
647
  # Dry-run: see what would run without executing
577
648
  coderace benchmark --agents claude,codex --dry-run
578
649
 
650
+ # Statistical mode: run repeated trials per pair
651
+ coderace benchmark --agents claude,codex --tasks fibonacci,json-parser --trials 5
652
+
579
653
  # Save report to file
580
654
  coderace benchmark --agents claude,codex --output report.md
581
655
  coderace benchmark --agents claude,codex --output report.html
656
+
657
+ # Export standardized JSON (shareable benchmark artifact)
658
+ coderace benchmark --agents claude,codex --trials 5 --export benchmark.json
582
659
  ```
583
660
 
584
661
  ### Example Terminal Output
@@ -622,7 +699,171 @@ coderace benchmark show bench-20260227-143022
622
699
  | `--difficulty` | Filter by difficulty: `easy`, `medium`, `hard` | all |
623
700
  | `--timeout` | Per-task timeout in seconds | `300` |
624
701
  | `--parallel N` | Run N agents in parallel | `1` (sequential) |
702
+ | `--trials N` | Repeat each `(task, agent)` pair N times | `1` |
625
703
  | `--dry-run` | List combinations without running | `false` |
626
704
  | `--format` | Output format: `terminal`, `markdown`, `html` | `terminal` |
627
705
  | `--output` | Save report to file | — |
706
+ | `--export` | Write standardized benchmark JSON file | — |
628
707
  | `--no-save` | Skip saving results to the store | `false` |
708
+
709
+ ### Statistical Reports (`--trials > 1`)
710
+
711
+ When `--trials` is greater than 1, benchmark reports switch to statistical mode:
712
+
713
+ - Task cells show `mean score +/- stddev` (plus mean wall time)
714
+ - Report includes `CI (95%)`, `Consistency`, and `Reliability` columns
715
+ - Summary includes per-agent mean score, confidence interval, win rate, and reliability
716
+ - ELO ratings are rendered at the bottom of terminal/markdown/html reports
717
+
718
+ ### ELO Ratings
719
+
720
+ Every benchmark run updates persistent ELO ratings across all benchmark history.
721
+
722
+ ```bash
723
+ # Show ratings
724
+ coderace ratings
725
+
726
+ # JSON output
727
+ coderace ratings --json
728
+
729
+ # Reset all ratings to 1500
730
+ coderace ratings --reset
731
+ ```
732
+
733
+ ELO rules:
734
+ - Initial rating: `1500`
735
+ - K-factor: `32`
736
+ - Each task is treated as a round-robin set of pairwise matches
737
+ - Winner per pair is based on higher mean trial score (draw when within 1 point)
738
+
739
+ ### Export Format (`--export`)
740
+
741
+ `coderace benchmark --export benchmark.json` writes a standardized JSON artifact:
742
+
743
+ ```json
744
+ {
745
+ "coderace_version": "1.0.0",
746
+ "benchmark_id": "bench-20260228-133000",
747
+ "timestamp": "2026-02-28T13:30:00Z",
748
+ "system": { "os": "...", "python": "...", "cpu": "..." },
749
+ "config": { "trials": 5, "timeout": 300, "tasks": ["..."], "agents": ["..."] },
750
+ "results": [
751
+ {
752
+ "task": "fibonacci",
753
+ "agent": "claude",
754
+ "trials": 5,
755
+ "mean_score": 87.5,
756
+ "stddev_score": 3.2,
757
+ "ci_95": [83.1, 91.9],
758
+ "mean_time": 45.2,
759
+ "mean_cost": 0.03,
760
+ "pass_rate": 1.0,
761
+ "consistency_score": 0.96,
762
+ "per_trial": []
763
+ }
764
+ ],
765
+ "elo_ratings": { "claude": 1523, "codex": 1488 },
766
+ "summary": {}
767
+ }
768
+ ```
769
+
770
+ ## Context Evaluation
771
+
772
+ The `coderace context-eval` command measures whether a context file (CLAUDE.md, AGENTS.md, .cursorrules, etc.) actually improves agent performance. It runs A/B trials — baseline (no context file) vs treatment (with context file) — and produces statistical comparisons.
773
+
774
+ ```bash
775
+ # Evaluate whether CLAUDE.md improves claude's performance on a task
776
+ coderace context-eval --context-file CLAUDE.md --task fix-auth-bug.yaml --agents claude --trials 5
777
+
778
+ # Evaluate across all built-in benchmark tasks
779
+ coderace context-eval --context-file CLAUDE.md --benchmark --agents claude,codex
780
+
781
+ # Save results as JSON
782
+ coderace context-eval --context-file CLAUDE.md --task task.yaml --agents claude --output results.json
783
+
784
+ # Use a custom task directory
785
+ coderace context-eval --context-file CLAUDE.md --benchmark --task-dir ./my-tasks --agents claude
786
+ ```
787
+
788
+ ### How It Works
789
+
790
+ For each agent × task combination:
791
+ 1. Run N trials **without** the context file (baseline condition)
792
+ 2. Run N trials **with** the context file placed in the task directory (treatment condition)
793
+ 3. Compare pass rates, mean scores, and compute statistical significance
794
+
795
+ ### Output
796
+
797
+ The terminal report shows:
798
+ - **Per-agent summary**: baseline vs treatment pass rates and scores, delta with 95% CI, Cohen's d effect size
799
+ - **Per-task breakdown**: which tasks improved, which degraded
800
+ - **Verdict**: whether the context file significantly improved performance
801
+
802
+ ```
803
+ ┌────────┬───────────────────┬────────────────────┬────────────────┬─────────────────┬────────┬──────────────────┬─────────────┐
804
+ │ Agent │ Baseline Pass Rate│ Treatment Pass Rate │ Baseline Score │ Treatment Score │ Delta │ CI (95%) │ Effect Size │
805
+ ├────────┼───────────────────┼────────────────────┼────────────────┼─────────────────┼────────┼──────────────────┼─────────────┤
806
+ │ claude │ 67% │ 100% │ 55.0 │ 81.0 │ +26.0 │ [10.5, 41.5] │ 2.10 │
807
+ │ codex │ 33% │ 67% │ 45.0 │ 70.0 │ +25.0 │ [8.0, 42.0] │ 1.80 │
808
+ └────────┴───────────────────┴────────────────────┴────────────────┴─────────────────┴────────┴──────────────────┴─────────────┘
809
+
810
+ Context file improved performance by +25.5 points (CI: [12.0, 39.0])
811
+ ```
812
+
813
+ ### Context-Eval CLI Flags
814
+
815
+ | Flag | Description | Default |
816
+ |------|-------------|---------|
817
+ | `--context-file` | Path to the context file to evaluate (required) | — |
818
+ | `--task` | Path to a single task YAML | — |
819
+ | `--benchmark` | Run against built-in benchmark tasks | `false` |
820
+ | `--agents` | Comma-separated agent names (required) | — |
821
+ | `--trials` | Trials per condition (min: 2) | `3` |
822
+ | `--output` | Save JSON results to file | — |
823
+ | `--task-dir` | Custom task directory for benchmark mode | — |
824
+
825
+ ### Dashboard Integration
826
+
827
+ Include context-eval results in the HTML dashboard:
828
+
829
+ ```bash
830
+ # Run context-eval and save JSON
831
+ coderace context-eval --context-file CLAUDE.md --task task.yaml --agents claude --output eval.json
832
+
833
+ # Generate dashboard with A/B comparison section
834
+ coderace dashboard --context-eval eval.json
835
+ ```
836
+
837
+ ## Measuring Context Engineering Impact
838
+
839
+ Context engineering — crafting CLAUDE.md, AGENTS.md, .cursorrules, and similar files — is becoming a core developer skill. But until now, there was no way to empirically measure whether your context files actually help.
840
+
841
+ **The problem:** You write a CLAUDE.md with coding conventions, architectural guidelines, and project-specific instructions. But does it actually make agents produce better code? Or is it cargo-cult configuration?
842
+
843
+ **The solution:** `coderace context-eval` gives you data:
844
+
845
+ 1. **Write your context file** (e.g., CLAUDE.md with project conventions)
846
+ 2. **Run A/B evaluation** against real coding tasks
847
+ 3. **Get statistical evidence** of improvement (or lack thereof)
848
+
849
+ ```bash
850
+ # Iterate on your context file with data
851
+ coderace context-eval --context-file CLAUDE.md --benchmark --agents claude --trials 5
852
+
853
+ # Compare different context files
854
+ coderace context-eval --context-file v1-claude.md --task task.yaml --agents claude --output v1.json
855
+ coderace context-eval --context-file v2-claude.md --task task.yaml --agents claude --output v2.json
856
+ ```
857
+
858
+ **Interpreting results:**
859
+ - **Effect size > 0.8**: Large improvement — your context file is helping significantly
860
+ - **Effect size 0.2–0.8**: Moderate improvement — some benefit, room to iterate
861
+ - **Effect size < 0.2**: Negligible — your context file isn't making a measurable difference
862
+ - **CI crosses zero**: Not statistically significant — need more trials or a better context file
863
+
864
+ ## See Also
865
+
866
+ - **[agentmd](https://github.com/mikiships/agentmd)** — Generate and score context files (CLAUDE.md, AGENTS.md, .cursorrules) for AI coding agents. Pair with coderace: generate context with agentmd, measure agent performance with coderace, iterate with data instead of vibes.
867
+ - **[agentlint](https://github.com/mikiships/agentlint)** — Lint AI agent git diffs for risky patterns (scope drift, secret leaks, test regression). Static analysis, no LLM required.
868
+
869
+ Measure (coderace) → Optimize (agentmd) → Guard (agentlint).