coderace 0.9.0__tar.gz → 1.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (131) hide show
  1. {coderace-0.9.0 → coderace-1.2.0}/CHANGELOG.md +28 -0
  2. coderace-1.2.0/DONE.txt +70 -0
  3. {coderace-0.9.0 → coderace-1.2.0}/PKG-INFO +199 -1
  4. {coderace-0.9.0 → coderace-1.2.0}/README.md +198 -0
  5. coderace-1.2.0/all-day-build-contract-context-eval.md +120 -0
  6. coderace-1.2.0/all-day-build-contract-race-mode.md +183 -0
  7. coderace-1.2.0/all-day-build-contract-v1.0-statistical.md +184 -0
  8. {coderace-0.9.0 → coderace-1.2.0}/coderace/__init__.py +1 -1
  9. {coderace-0.9.0 → coderace-1.2.0}/coderace/benchmark.py +158 -71
  10. {coderace-0.9.0 → coderace-1.2.0}/coderace/benchmark_report.py +372 -0
  11. {coderace-0.9.0 → coderace-1.2.0}/coderace/cli.py +230 -5
  12. {coderace-0.9.0 → coderace-1.2.0}/coderace/commands/benchmark.py +124 -10
  13. coderace-1.2.0/coderace/commands/context_eval.py +143 -0
  14. coderace-1.2.0/coderace/commands/race.py +626 -0
  15. coderace-1.2.0/coderace/context_eval.py +282 -0
  16. coderace-1.2.0/coderace/context_eval_report.py +275 -0
  17. {coderace-0.9.0 → coderace-1.2.0}/coderace/dashboard.py +97 -1
  18. coderace-1.2.0/coderace/elo.py +94 -0
  19. coderace-1.2.0/coderace/export.py +115 -0
  20. coderace-1.2.0/coderace/statistics.py +206 -0
  21. {coderace-0.9.0 → coderace-1.2.0}/coderace/store.py +56 -3
  22. coderace-1.2.0/examples/context-eval-demo.sh +65 -0
  23. coderace-1.2.0/progress-log.md +918 -0
  24. {coderace-0.9.0 → coderace-1.2.0}/pyproject.toml +1 -1
  25. coderace-1.2.0/tests/test_benchmark_trials.py +205 -0
  26. coderace-1.2.0/tests/test_benchmark_v1_integration.py +164 -0
  27. coderace-1.2.0/tests/test_context_eval.py +532 -0
  28. coderace-1.2.0/tests/test_context_eval_dashboard.py +198 -0
  29. coderace-1.2.0/tests/test_elo.py +259 -0
  30. coderace-1.2.0/tests/test_export.py +173 -0
  31. coderace-1.2.0/tests/test_race.py +883 -0
  32. coderace-1.2.0/tests/test_statistics.py +227 -0
  33. coderace-0.9.0/progress-log.md +0 -444
  34. {coderace-0.9.0 → coderace-1.2.0}/.github/workflows/publish.yml +0 -0
  35. {coderace-0.9.0 → coderace-1.2.0}/.gitignore +0 -0
  36. {coderace-0.9.0 → coderace-1.2.0}/LICENSE +0 -0
  37. {coderace-0.9.0 → coderace-1.2.0}/action.yml +0 -0
  38. {coderace-0.9.0 → coderace-1.2.0}/all-day-build-contract-benchmark.md +0 -0
  39. {coderace-0.9.0 → coderace-1.2.0}/all-day-build-contract-builtin-tasks.md +0 -0
  40. {coderace-0.9.0 → coderace-1.2.0}/all-day-build-contract-ci-integration.md +0 -0
  41. {coderace-0.9.0 → coderace-1.2.0}/all-day-build-contract-cost-tracking.md +0 -0
  42. {coderace-0.9.0 → coderace-1.2.0}/all-day-build-contract-dashboard.md +0 -0
  43. {coderace-0.9.0 → coderace-1.2.0}/all-day-build-contract-leaderboard.md +0 -0
  44. {coderace-0.9.0 → coderace-1.2.0}/all-day-build-contract-v0.2.md +0 -0
  45. {coderace-0.9.0 → coderace-1.2.0}/all-day-build-contract-v090-tasks.md +0 -0
  46. {coderace-0.9.0 → coderace-1.2.0}/all-day-build-contract-verification-tests.md +0 -0
  47. {coderace-0.9.0 → coderace-1.2.0}/benchmark-results/fibonacci-2026-02-27.md +0 -0
  48. {coderace-0.9.0 → coderace-1.2.0}/benchmark-results/fibonacci-v2-2026-02-27.md +0 -0
  49. {coderace-0.9.0 → coderace-1.2.0}/benchmark-results/hard-tasks-2026-02-27.md +0 -0
  50. {coderace-0.9.0 → coderace-1.2.0}/benchmark-results/multi-task-2026-02-27.md +0 -0
  51. {coderace-0.9.0 → coderace-1.2.0}/coderace/adapters/__init__.py +0 -0
  52. {coderace-0.9.0 → coderace-1.2.0}/coderace/adapters/aider.py +0 -0
  53. {coderace-0.9.0 → coderace-1.2.0}/coderace/adapters/base.py +0 -0
  54. {coderace-0.9.0 → coderace-1.2.0}/coderace/adapters/claude.py +0 -0
  55. {coderace-0.9.0 → coderace-1.2.0}/coderace/adapters/codex.py +0 -0
  56. {coderace-0.9.0 → coderace-1.2.0}/coderace/adapters/gemini.py +0 -0
  57. {coderace-0.9.0 → coderace-1.2.0}/coderace/adapters/opencode.py +0 -0
  58. {coderace-0.9.0 → coderace-1.2.0}/coderace/benchmark_stats.py +0 -0
  59. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/__init__.py +0 -0
  60. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/tasks/binary-search-tree.yaml +0 -0
  61. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/tasks/cli-args-parser.yaml +0 -0
  62. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/tasks/csv-analyzer.yaml +0 -0
  63. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/tasks/data-pipeline.yaml +0 -0
  64. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/tasks/diff-algorithm.yaml +0 -0
  65. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/tasks/expression-evaluator.yaml +0 -0
  66. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/tasks/fibonacci.yaml +0 -0
  67. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/tasks/file-watcher.yaml +0 -0
  68. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/tasks/http-server.yaml +0 -0
  69. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/tasks/json-parser.yaml +0 -0
  70. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/tasks/lru-cache.yaml +0 -0
  71. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/tasks/markdown-to-html.yaml +0 -0
  72. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/tasks/regex-engine.yaml +0 -0
  73. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/tasks/state-machine.yaml +0 -0
  74. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/tasks/task-scheduler.yaml +0 -0
  75. {coderace-0.9.0 → coderace-1.2.0}/coderace/builtins/tasks/url-router.yaml +0 -0
  76. {coderace-0.9.0 → coderace-1.2.0}/coderace/commands/__init__.py +0 -0
  77. {coderace-0.9.0 → coderace-1.2.0}/coderace/commands/dashboard.py +0 -0
  78. {coderace-0.9.0 → coderace-1.2.0}/coderace/commands/diff.py +0 -0
  79. {coderace-0.9.0 → coderace-1.2.0}/coderace/commands/history.py +0 -0
  80. {coderace-0.9.0 → coderace-1.2.0}/coderace/commands/leaderboard.py +0 -0
  81. {coderace-0.9.0 → coderace-1.2.0}/coderace/commands/results.py +0 -0
  82. {coderace-0.9.0 → coderace-1.2.0}/coderace/commands/tasks.py +0 -0
  83. {coderace-0.9.0 → coderace-1.2.0}/coderace/cost.py +0 -0
  84. {coderace-0.9.0 → coderace-1.2.0}/coderace/git_ops.py +0 -0
  85. {coderace-0.9.0 → coderace-1.2.0}/coderace/html_report.py +0 -0
  86. {coderace-0.9.0 → coderace-1.2.0}/coderace/publish.py +0 -0
  87. {coderace-0.9.0 → coderace-1.2.0}/coderace/reporter.py +0 -0
  88. {coderace-0.9.0 → coderace-1.2.0}/coderace/scorer.py +0 -0
  89. {coderace-0.9.0 → coderace-1.2.0}/coderace/stats.py +0 -0
  90. {coderace-0.9.0 → coderace-1.2.0}/coderace/task.py +0 -0
  91. {coderace-0.9.0 → coderace-1.2.0}/coderace/types.py +0 -0
  92. {coderace-0.9.0 → coderace-1.2.0}/demo-race.yaml +0 -0
  93. {coderace-0.9.0 → coderace-1.2.0}/examples/add-type-hints.yaml +0 -0
  94. {coderace-0.9.0 → coderace-1.2.0}/examples/ci-race-on-pr.yml +0 -0
  95. {coderace-0.9.0 → coderace-1.2.0}/examples/example-task.yaml +0 -0
  96. {coderace-0.9.0 → coderace-1.2.0}/examples/fix-edge-case.yaml +0 -0
  97. {coderace-0.9.0 → coderace-1.2.0}/examples/write-tests.yaml +0 -0
  98. {coderace-0.9.0 → coderace-1.2.0}/scripts/ci-run.sh +0 -0
  99. {coderace-0.9.0 → coderace-1.2.0}/scripts/format-comment.py +0 -0
  100. {coderace-0.9.0 → coderace-1.2.0}/tasks/markdown-table.yaml +0 -0
  101. {coderace-0.9.0 → coderace-1.2.0}/tasks/parse-duration.yaml +0 -0
  102. {coderace-0.9.0 → coderace-1.2.0}/tests/__init__.py +0 -0
  103. {coderace-0.9.0 → coderace-1.2.0}/tests/conftest.py +0 -0
  104. {coderace-0.9.0 → coderace-1.2.0}/tests/test_adapters.py +0 -0
  105. {coderace-0.9.0 → coderace-1.2.0}/tests/test_benchmark.py +0 -0
  106. {coderace-0.9.0 → coderace-1.2.0}/tests/test_builtins.py +0 -0
  107. {coderace-0.9.0 → coderace-1.2.0}/tests/test_cli.py +0 -0
  108. {coderace-0.9.0 → coderace-1.2.0}/tests/test_cli_store_integration.py +0 -0
  109. {coderace-0.9.0 → coderace-1.2.0}/tests/test_cost.py +0 -0
  110. {coderace-0.9.0 → coderace-1.2.0}/tests/test_cost_config.py +0 -0
  111. {coderace-0.9.0 → coderace-1.2.0}/tests/test_cost_integration.py +0 -0
  112. {coderace-0.9.0 → coderace-1.2.0}/tests/test_dashboard.py +0 -0
  113. {coderace-0.9.0 → coderace-1.2.0}/tests/test_dashboard_cli.py +0 -0
  114. {coderace-0.9.0 → coderace-1.2.0}/tests/test_diff.py +0 -0
  115. {coderace-0.9.0 → coderace-1.2.0}/tests/test_examples.py +0 -0
  116. {coderace-0.9.0 → coderace-1.2.0}/tests/test_format_comment.py +0 -0
  117. {coderace-0.9.0 → coderace-1.2.0}/tests/test_full_workflow.py +0 -0
  118. {coderace-0.9.0 → coderace-1.2.0}/tests/test_git_ops.py +0 -0
  119. {coderace-0.9.0 → coderace-1.2.0}/tests/test_history.py +0 -0
  120. {coderace-0.9.0 → coderace-1.2.0}/tests/test_html_report.py +0 -0
  121. {coderace-0.9.0 → coderace-1.2.0}/tests/test_leaderboard.py +0 -0
  122. {coderace-0.9.0 → coderace-1.2.0}/tests/test_markdown_results.py +0 -0
  123. {coderace-0.9.0 → coderace-1.2.0}/tests/test_publish.py +0 -0
  124. {coderace-0.9.0 → coderace-1.2.0}/tests/test_reporter.py +0 -0
  125. {coderace-0.9.0 → coderace-1.2.0}/tests/test_scorer.py +0 -0
  126. {coderace-0.9.0 → coderace-1.2.0}/tests/test_stats.py +0 -0
  127. {coderace-0.9.0 → coderace-1.2.0}/tests/test_store.py +0 -0
  128. {coderace-0.9.0 → coderace-1.2.0}/tests/test_task.py +0 -0
  129. {coderace-0.9.0 → coderace-1.2.0}/tests/test_tasks_cli.py +0 -0
  130. {coderace-0.9.0 → coderace-1.2.0}/tests/test_verification_integration.py +0 -0
  131. {coderace-0.9.0 → coderace-1.2.0}/uv.lock +0 -0
@@ -1,5 +1,33 @@
1
1
  # Changelog
2
2
 
3
+ ## [1.2.0] - 2026-03-03
4
+
5
+ ### Added
6
+
7
+ - **`coderace race` command** - New first-to-pass race mode with early-stop semantics. Agents run in parallel worktrees and the race ends when the first winner is found.
8
+ - **Live race UI** - Rich `Live` panel with per-agent status and timers:
9
+ - `🔨 coding...`
10
+ - `🧪 testing...`
11
+ - `✅ WINNER!`
12
+ - `❌ failed`
13
+ - `⏰ timed out`
14
+ - `🛑 stopped`
15
+ - **Winner announcement and runner-up delta** - Prints race winner and optional runner-up timing delta after the live panel closes.
16
+ - **Race result persistence (JSON fallback)** - Saves race summaries to `.coderace/race-results.json` including `race_id`, winner metadata, participant statuses, exit codes, and wall times. Supports `--no-save`.
17
+ - **Race test suite** - Added 21 race-focused tests (`tests/test_race.py`) covering winner logic, cancellation, timeout/no-winner paths, verification modes, live updates, serialization, and Ctrl+C cleanup.
18
+
19
+ ## [1.0.0] - 2026-02-28
20
+
21
+ ### Added
22
+
23
+ - **Benchmark trials mode** — `coderace benchmark --trials N` now runs each `(task, agent)` pair repeatedly and stores each trial with `trial_number` in SQLite.
24
+ - **Statistical benchmarking module** — New `coderace/statistics.py` computes per-pair and per-agent aggregates: mean/stddev, 95% confidence intervals, pass rate, consistency, win rate, cost efficiency, and reliability.
25
+ - **Persistent ELO ratings** — New `coderace/elo.py` plus `elo_ratings` store table. Ratings update automatically after each benchmark using pairwise task outcomes and persist across runs.
26
+ - **`coderace ratings` command** — View persistent ELO rankings, output as JSON (`--json`), and reset all ratings (`--reset`).
27
+ - **Standardized benchmark export** — `coderace benchmark --export <path>` writes shareable JSON with run metadata, system info, per-trial details, aggregate stats, and current ELO ratings.
28
+ - **Enhanced benchmark report rendering** — Multi-trial reports now show statistical columns (`mean +/- stddev`, CI, consistency, reliability) and include ELO ratings in terminal/markdown/html output.
29
+ - **Integration and edge-case coverage for v1.0 flow** — Added tests for full `--trials 3` benchmark + export + ELO pipeline and edge cases (single trial/agent/task and always-failing agent).
30
+
3
31
  ## [0.7.0] - 2026-02-26
4
32
 
5
33
  ### Added
@@ -0,0 +1,70 @@
1
+ Context Eval Build Contract: COMPLETE
2
+ ======================================
3
+
4
+ Date: 2026-03-02
5
+ All deliverables (D1-D4) implemented and validated.
6
+
7
+ ## What Was Built
8
+
9
+ ### D1: context-eval CLI Command (core + CLI)
10
+ - `coderace/context_eval.py`: Core A/B evaluation engine
11
+ - ContextEvalResult and TrialResult data classes
12
+ - Context file backup/restore/placement/removal for baseline vs treatment
13
+ - KNOWN_CONTEXT_FILES list (CLAUDE.md, AGENTS.md, .cursorrules, etc.)
14
+ - run_context_eval() orchestrator running N trials per condition
15
+ - `coderace/commands/context_eval.py`: CLI subcommand with:
16
+ --context-file PATH, --task PATH, --benchmark, --agents, --trials N,
17
+ --output PATH, --task-dir PATH
18
+ - Full input validation (missing files, invalid agents, trials < 2, etc.)
19
+
20
+ ### D2: Statistical Comparison Report
21
+ - `coderace/context_eval_report.py`: Statistical analysis and rendering
22
+ - Delta with 95% CI using Welch's t-test
23
+ - Cohen's d effect size
24
+ - Per-agent summary: baseline vs treatment pass rates and scores
25
+ - Per-task breakdown: which tasks improved, which degraded
26
+ - Summary verdict: "improved", "degraded", or "no significant improvement"
27
+ - Rich terminal table output
28
+ - JSON output format
29
+
30
+ ### D3: Dashboard Integration
31
+ - Extended `coderace/dashboard.py` with context-eval A/B section:
32
+ - Bar chart: baseline vs treatment scores per agent
33
+ - Delta table with CI (95%) and effect size
34
+ - Verdict display
35
+ - CSS for A/B visualization (.ab-baseline, .ab-treatment, .positive, .negative)
36
+ - Added --context-eval PATH flag to `coderace dashboard` command
37
+
38
+ ### D4: Documentation + Examples
39
+ - README.md: Added "Context Evaluation" and "Measuring Context Engineering Impact" sections
40
+ with usage examples, output format, CLI flags table, and effect size interpretation guide
41
+ - examples/context-eval-demo.sh: Executable demo script
42
+ - Clear help text on `coderace context-eval --help` and `coderace --help`
43
+
44
+ ## Test Results
45
+
46
+ - 58 new tests added (41 for D1+D2, 17 for D3)
47
+ - All 505 tests pass (447 original + 58 new)
48
+ - No regressions in existing test suite
49
+
50
+ ## Commits
51
+
52
+ 1. feat(context-eval): add context-eval command with A/B statistical comparison (D1+D2)
53
+ 2. feat(context-eval): add dashboard A/B comparison section (D3)
54
+ 3. docs(context-eval): add README section, examples, and interpretation guide (D4)
55
+
56
+ ## Files Created/Modified
57
+
58
+ New files:
59
+ - coderace/context_eval.py
60
+ - coderace/commands/context_eval.py
61
+ - coderace/context_eval_report.py
62
+ - tests/test_context_eval.py
63
+ - tests/test_context_eval_dashboard.py
64
+ - examples/context-eval-demo.sh
65
+
66
+ Modified files:
67
+ - coderace/cli.py (registered context-eval subcommand + --context-eval dashboard flag)
68
+ - coderace/dashboard.py (added A/B comparison section)
69
+ - README.md (added context-eval documentation)
70
+ - progress-log.md (added D1-D4 progress entries)
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: coderace
3
- Version: 0.9.0
3
+ Version: 1.2.0
4
4
  Summary: Race coding agents against each other on real tasks
5
5
  Project-URL: Homepage, https://github.com/mikiships/coderace
6
6
  Project-URL: Repository, https://github.com/mikiships/coderace
@@ -471,6 +471,37 @@ coderace run task.yaml --parallel
471
471
 
472
472
  Sequential mode (default) runs agents one at a time on the same repo.
473
473
 
474
+ ## Race Mode
475
+
476
+ Use `coderace race` for first-to-pass execution. Unlike `coderace run --parallel`, race mode stops as soon as one agent passes the win condition:
477
+
478
+ - If verification is configured, winner = first agent that passes verification.
479
+ - If verification is not configured, winner = first agent that exits cleanly.
480
+ - Remaining agents are stopped after a short graceful shutdown window.
481
+
482
+ ```bash
483
+ coderace race task.yaml --agent claude --agent codex
484
+ ```
485
+
486
+ Example terminal output:
487
+
488
+ ```text
489
+ 🏁 coderace race - fix-auth-bug
490
+ Running 3 agents in parallel...
491
+
492
+ Agent Status Time
493
+ claude 🔨 coding... 0:00:23
494
+ codex 🧪 testing... 0:00:31
495
+ aider 🛑 stopped 0:00:18
496
+
497
+ 🏆 Winner: codex - completed in 1:23 (first to pass verification)
498
+ Runner-up: claude - finished 0:12 later
499
+ ```
500
+
501
+ When to use each mode:
502
+ - Use `coderace race` when you want the fastest successful patch and can stop early.
503
+ - Use `coderace run --parallel` when you want full scoring across all agents before deciding.
504
+
474
505
  ## Why coderace?
475
506
 
476
507
  **Blog posts compare models. coderace compares agents on your work.**
@@ -606,9 +637,15 @@ coderace benchmark --agents claude --difficulty easy,medium
606
637
  # Dry-run: see what would run without executing
607
638
  coderace benchmark --agents claude,codex --dry-run
608
639
 
640
+ # Statistical mode: run repeated trials per pair
641
+ coderace benchmark --agents claude,codex --tasks fibonacci,json-parser --trials 5
642
+
609
643
  # Save report to file
610
644
  coderace benchmark --agents claude,codex --output report.md
611
645
  coderace benchmark --agents claude,codex --output report.html
646
+
647
+ # Export standardized JSON (shareable benchmark artifact)
648
+ coderace benchmark --agents claude,codex --trials 5 --export benchmark.json
612
649
  ```
613
650
 
614
651
  ### Example Terminal Output
@@ -652,7 +689,168 @@ coderace benchmark show bench-20260227-143022
652
689
  | `--difficulty` | Filter by difficulty: `easy`, `medium`, `hard` | all |
653
690
  | `--timeout` | Per-task timeout in seconds | `300` |
654
691
  | `--parallel N` | Run N agents in parallel | `1` (sequential) |
692
+ | `--trials N` | Repeat each `(task, agent)` pair N times | `1` |
655
693
  | `--dry-run` | List combinations without running | `false` |
656
694
  | `--format` | Output format: `terminal`, `markdown`, `html` | `terminal` |
657
695
  | `--output` | Save report to file | — |
696
+ | `--export` | Write standardized benchmark JSON file | — |
658
697
  | `--no-save` | Skip saving results to the store | `false` |
698
+
699
+ ### Statistical Reports (`--trials > 1`)
700
+
701
+ When `--trials` is greater than 1, benchmark reports switch to statistical mode:
702
+
703
+ - Task cells show `mean score +/- stddev` (plus mean wall time)
704
+ - Report includes `CI (95%)`, `Consistency`, and `Reliability` columns
705
+ - Summary includes per-agent mean score, confidence interval, win rate, and reliability
706
+ - ELO ratings are rendered at the bottom of terminal/markdown/html reports
707
+
708
+ ### ELO Ratings
709
+
710
+ Every benchmark run updates persistent ELO ratings across all benchmark history.
711
+
712
+ ```bash
713
+ # Show ratings
714
+ coderace ratings
715
+
716
+ # JSON output
717
+ coderace ratings --json
718
+
719
+ # Reset all ratings to 1500
720
+ coderace ratings --reset
721
+ ```
722
+
723
+ ELO rules:
724
+ - Initial rating: `1500`
725
+ - K-factor: `32`
726
+ - Each task is treated as a round-robin set of pairwise matches
727
+ - Winner per pair is based on higher mean trial score (draw when within 1 point)
728
+
729
+ ### Export Format (`--export`)
730
+
731
+ `coderace benchmark --export benchmark.json` writes a standardized JSON artifact:
732
+
733
+ ```json
734
+ {
735
+ "coderace_version": "1.0.0",
736
+ "benchmark_id": "bench-20260228-133000",
737
+ "timestamp": "2026-02-28T13:30:00Z",
738
+ "system": { "os": "...", "python": "...", "cpu": "..." },
739
+ "config": { "trials": 5, "timeout": 300, "tasks": ["..."], "agents": ["..."] },
740
+ "results": [
741
+ {
742
+ "task": "fibonacci",
743
+ "agent": "claude",
744
+ "trials": 5,
745
+ "mean_score": 87.5,
746
+ "stddev_score": 3.2,
747
+ "ci_95": [83.1, 91.9],
748
+ "mean_time": 45.2,
749
+ "mean_cost": 0.03,
750
+ "pass_rate": 1.0,
751
+ "consistency_score": 0.96,
752
+ "per_trial": []
753
+ }
754
+ ],
755
+ "elo_ratings": { "claude": 1523, "codex": 1488 },
756
+ "summary": {}
757
+ }
758
+ ```
759
+
760
+ ## Context Evaluation
761
+
762
+ The `coderace context-eval` command measures whether a context file (CLAUDE.md, AGENTS.md, .cursorrules, etc.) actually improves agent performance. It runs A/B trials — baseline (no context file) vs treatment (with context file) — and produces statistical comparisons.
763
+
764
+ ```bash
765
+ # Evaluate whether CLAUDE.md improves claude's performance on a task
766
+ coderace context-eval --context-file CLAUDE.md --task fix-auth-bug.yaml --agents claude --trials 5
767
+
768
+ # Evaluate across all built-in benchmark tasks
769
+ coderace context-eval --context-file CLAUDE.md --benchmark --agents claude,codex
770
+
771
+ # Save results as JSON
772
+ coderace context-eval --context-file CLAUDE.md --task task.yaml --agents claude --output results.json
773
+
774
+ # Use a custom task directory
775
+ coderace context-eval --context-file CLAUDE.md --benchmark --task-dir ./my-tasks --agents claude
776
+ ```
777
+
778
+ ### How It Works
779
+
780
+ For each agent × task combination:
781
+ 1. Run N trials **without** the context file (baseline condition)
782
+ 2. Run N trials **with** the context file placed in the task directory (treatment condition)
783
+ 3. Compare pass rates, mean scores, and compute statistical significance
784
+
785
+ ### Output
786
+
787
+ The terminal report shows:
788
+ - **Per-agent summary**: baseline vs treatment pass rates and scores, delta with 95% CI, Cohen's d effect size
789
+ - **Per-task breakdown**: which tasks improved, which degraded
790
+ - **Verdict**: whether the context file significantly improved performance
791
+
792
+ ```
793
+ ┌────────┬───────────────────┬────────────────────┬────────────────┬─────────────────┬────────┬──────────────────┬─────────────┐
794
+ │ Agent │ Baseline Pass Rate│ Treatment Pass Rate │ Baseline Score │ Treatment Score │ Delta │ CI (95%) │ Effect Size │
795
+ ├────────┼───────────────────┼────────────────────┼────────────────┼─────────────────┼────────┼──────────────────┼─────────────┤
796
+ │ claude │ 67% │ 100% │ 55.0 │ 81.0 │ +26.0 │ [10.5, 41.5] │ 2.10 │
797
+ │ codex │ 33% │ 67% │ 45.0 │ 70.0 │ +25.0 │ [8.0, 42.0] │ 1.80 │
798
+ └────────┴───────────────────┴────────────────────┴────────────────┴─────────────────┴────────┴──────────────────┴─────────────┘
799
+
800
+ Context file improved performance by +25.5 points (CI: [12.0, 39.0])
801
+ ```
802
+
803
+ ### Context-Eval CLI Flags
804
+
805
+ | Flag | Description | Default |
806
+ |------|-------------|---------|
807
+ | `--context-file` | Path to the context file to evaluate (required) | — |
808
+ | `--task` | Path to a single task YAML | — |
809
+ | `--benchmark` | Run against built-in benchmark tasks | `false` |
810
+ | `--agents` | Comma-separated agent names (required) | — |
811
+ | `--trials` | Trials per condition (min: 2) | `3` |
812
+ | `--output` | Save JSON results to file | — |
813
+ | `--task-dir` | Custom task directory for benchmark mode | — |
814
+
815
+ ### Dashboard Integration
816
+
817
+ Include context-eval results in the HTML dashboard:
818
+
819
+ ```bash
820
+ # Run context-eval and save JSON
821
+ coderace context-eval --context-file CLAUDE.md --task task.yaml --agents claude --output eval.json
822
+
823
+ # Generate dashboard with A/B comparison section
824
+ coderace dashboard --context-eval eval.json
825
+ ```
826
+
827
+ ## Measuring Context Engineering Impact
828
+
829
+ Context engineering — crafting CLAUDE.md, AGENTS.md, .cursorrules, and similar files — is becoming a core developer skill. But until now, there was no way to empirically measure whether your context files actually help.
830
+
831
+ **The problem:** You write a CLAUDE.md with coding conventions, architectural guidelines, and project-specific instructions. But does it actually make agents produce better code? Or is it cargo-cult configuration?
832
+
833
+ **The solution:** `coderace context-eval` gives you data:
834
+
835
+ 1. **Write your context file** (e.g., CLAUDE.md with project conventions)
836
+ 2. **Run A/B evaluation** against real coding tasks
837
+ 3. **Get statistical evidence** of improvement (or lack thereof)
838
+
839
+ ```bash
840
+ # Iterate on your context file with data
841
+ coderace context-eval --context-file CLAUDE.md --benchmark --agents claude --trials 5
842
+
843
+ # Compare different context files
844
+ coderace context-eval --context-file v1-claude.md --task task.yaml --agents claude --output v1.json
845
+ coderace context-eval --context-file v2-claude.md --task task.yaml --agents claude --output v2.json
846
+ ```
847
+
848
+ **Interpreting results:**
849
+ - **Effect size > 0.8**: Large improvement — your context file is helping significantly
850
+ - **Effect size 0.2–0.8**: Moderate improvement — some benefit, room to iterate
851
+ - **Effect size < 0.2**: Negligible — your context file isn't making a measurable difference
852
+ - **CI crosses zero**: Not statistically significant — need more trials or a better context file
853
+
854
+ ## See Also
855
+
856
+ - **[agentmd](https://github.com/mikiships/agentmd)** — Generate and score context files (CLAUDE.md, AGENTS.md, .cursorrules) for AI coding agents. Pair with coderace: generate context with agentmd, measure agent performance with coderace, iterate with data instead of vibes.
@@ -441,6 +441,37 @@ coderace run task.yaml --parallel
441
441
 
442
442
  Sequential mode (default) runs agents one at a time on the same repo.
443
443
 
444
+ ## Race Mode
445
+
446
+ Use `coderace race` for first-to-pass execution. Unlike `coderace run --parallel`, race mode stops as soon as one agent passes the win condition:
447
+
448
+ - If verification is configured, winner = first agent that passes verification.
449
+ - If verification is not configured, winner = first agent that exits cleanly.
450
+ - Remaining agents are stopped after a short graceful shutdown window.
451
+
452
+ ```bash
453
+ coderace race task.yaml --agent claude --agent codex
454
+ ```
455
+
456
+ Example terminal output:
457
+
458
+ ```text
459
+ 🏁 coderace race - fix-auth-bug
460
+ Running 3 agents in parallel...
461
+
462
+ Agent Status Time
463
+ claude 🔨 coding... 0:00:23
464
+ codex 🧪 testing... 0:00:31
465
+ aider 🛑 stopped 0:00:18
466
+
467
+ 🏆 Winner: codex - completed in 1:23 (first to pass verification)
468
+ Runner-up: claude - finished 0:12 later
469
+ ```
470
+
471
+ When to use each mode:
472
+ - Use `coderace race` when you want the fastest successful patch and can stop early.
473
+ - Use `coderace run --parallel` when you want full scoring across all agents before deciding.
474
+
444
475
  ## Why coderace?
445
476
 
446
477
  **Blog posts compare models. coderace compares agents on your work.**
@@ -576,9 +607,15 @@ coderace benchmark --agents claude --difficulty easy,medium
576
607
  # Dry-run: see what would run without executing
577
608
  coderace benchmark --agents claude,codex --dry-run
578
609
 
610
+ # Statistical mode: run repeated trials per pair
611
+ coderace benchmark --agents claude,codex --tasks fibonacci,json-parser --trials 5
612
+
579
613
  # Save report to file
580
614
  coderace benchmark --agents claude,codex --output report.md
581
615
  coderace benchmark --agents claude,codex --output report.html
616
+
617
+ # Export standardized JSON (shareable benchmark artifact)
618
+ coderace benchmark --agents claude,codex --trials 5 --export benchmark.json
582
619
  ```
583
620
 
584
621
  ### Example Terminal Output
@@ -622,7 +659,168 @@ coderace benchmark show bench-20260227-143022
622
659
  | `--difficulty` | Filter by difficulty: `easy`, `medium`, `hard` | all |
623
660
  | `--timeout` | Per-task timeout in seconds | `300` |
624
661
  | `--parallel N` | Run N agents in parallel | `1` (sequential) |
662
+ | `--trials N` | Repeat each `(task, agent)` pair N times | `1` |
625
663
  | `--dry-run` | List combinations without running | `false` |
626
664
  | `--format` | Output format: `terminal`, `markdown`, `html` | `terminal` |
627
665
  | `--output` | Save report to file | — |
666
+ | `--export` | Write standardized benchmark JSON file | — |
628
667
  | `--no-save` | Skip saving results to the store | `false` |
668
+
669
+ ### Statistical Reports (`--trials > 1`)
670
+
671
+ When `--trials` is greater than 1, benchmark reports switch to statistical mode:
672
+
673
+ - Task cells show `mean score +/- stddev` (plus mean wall time)
674
+ - Report includes `CI (95%)`, `Consistency`, and `Reliability` columns
675
+ - Summary includes per-agent mean score, confidence interval, win rate, and reliability
676
+ - ELO ratings are rendered at the bottom of terminal/markdown/html reports
677
+
678
+ ### ELO Ratings
679
+
680
+ Every benchmark run updates persistent ELO ratings across all benchmark history.
681
+
682
+ ```bash
683
+ # Show ratings
684
+ coderace ratings
685
+
686
+ # JSON output
687
+ coderace ratings --json
688
+
689
+ # Reset all ratings to 1500
690
+ coderace ratings --reset
691
+ ```
692
+
693
+ ELO rules:
694
+ - Initial rating: `1500`
695
+ - K-factor: `32`
696
+ - Each task is treated as a round-robin set of pairwise matches
697
+ - Winner per pair is based on higher mean trial score (draw when within 1 point)
698
+
699
+ ### Export Format (`--export`)
700
+
701
+ `coderace benchmark --export benchmark.json` writes a standardized JSON artifact:
702
+
703
+ ```json
704
+ {
705
+ "coderace_version": "1.0.0",
706
+ "benchmark_id": "bench-20260228-133000",
707
+ "timestamp": "2026-02-28T13:30:00Z",
708
+ "system": { "os": "...", "python": "...", "cpu": "..." },
709
+ "config": { "trials": 5, "timeout": 300, "tasks": ["..."], "agents": ["..."] },
710
+ "results": [
711
+ {
712
+ "task": "fibonacci",
713
+ "agent": "claude",
714
+ "trials": 5,
715
+ "mean_score": 87.5,
716
+ "stddev_score": 3.2,
717
+ "ci_95": [83.1, 91.9],
718
+ "mean_time": 45.2,
719
+ "mean_cost": 0.03,
720
+ "pass_rate": 1.0,
721
+ "consistency_score": 0.96,
722
+ "per_trial": []
723
+ }
724
+ ],
725
+ "elo_ratings": { "claude": 1523, "codex": 1488 },
726
+ "summary": {}
727
+ }
728
+ ```
729
+
730
+ ## Context Evaluation
731
+
732
+ The `coderace context-eval` command measures whether a context file (CLAUDE.md, AGENTS.md, .cursorrules, etc.) actually improves agent performance. It runs A/B trials — baseline (no context file) vs treatment (with context file) — and produces statistical comparisons.
733
+
734
+ ```bash
735
+ # Evaluate whether CLAUDE.md improves claude's performance on a task
736
+ coderace context-eval --context-file CLAUDE.md --task fix-auth-bug.yaml --agents claude --trials 5
737
+
738
+ # Evaluate across all built-in benchmark tasks
739
+ coderace context-eval --context-file CLAUDE.md --benchmark --agents claude,codex
740
+
741
+ # Save results as JSON
742
+ coderace context-eval --context-file CLAUDE.md --task task.yaml --agents claude --output results.json
743
+
744
+ # Use a custom task directory
745
+ coderace context-eval --context-file CLAUDE.md --benchmark --task-dir ./my-tasks --agents claude
746
+ ```
747
+
748
+ ### How It Works
749
+
750
+ For each agent × task combination:
751
+ 1. Run N trials **without** the context file (baseline condition)
752
+ 2. Run N trials **with** the context file placed in the task directory (treatment condition)
753
+ 3. Compare pass rates, mean scores, and compute statistical significance
754
+
755
+ ### Output
756
+
757
+ The terminal report shows:
758
+ - **Per-agent summary**: baseline vs treatment pass rates and scores, delta with 95% CI, Cohen's d effect size
759
+ - **Per-task breakdown**: which tasks improved, which degraded
760
+ - **Verdict**: whether the context file significantly improved performance
761
+
762
+ ```
763
+ ┌────────┬───────────────────┬────────────────────┬────────────────┬─────────────────┬────────┬──────────────────┬─────────────┐
764
+ │ Agent │ Baseline Pass Rate│ Treatment Pass Rate │ Baseline Score │ Treatment Score │ Delta │ CI (95%) │ Effect Size │
765
+ ├────────┼───────────────────┼────────────────────┼────────────────┼─────────────────┼────────┼──────────────────┼─────────────┤
766
+ │ claude │ 67% │ 100% │ 55.0 │ 81.0 │ +26.0 │ [10.5, 41.5] │ 2.10 │
767
+ │ codex │ 33% │ 67% │ 45.0 │ 70.0 │ +25.0 │ [8.0, 42.0] │ 1.80 │
768
+ └────────┴───────────────────┴────────────────────┴────────────────┴─────────────────┴────────┴──────────────────┴─────────────┘
769
+
770
+ Context file improved performance by +25.5 points (CI: [12.0, 39.0])
771
+ ```
772
+
773
+ ### Context-Eval CLI Flags
774
+
775
+ | Flag | Description | Default |
776
+ |------|-------------|---------|
777
+ | `--context-file` | Path to the context file to evaluate (required) | — |
778
+ | `--task` | Path to a single task YAML | — |
779
+ | `--benchmark` | Run against built-in benchmark tasks | `false` |
780
+ | `--agents` | Comma-separated agent names (required) | — |
781
+ | `--trials` | Trials per condition (min: 2) | `3` |
782
+ | `--output` | Save JSON results to file | — |
783
+ | `--task-dir` | Custom task directory for benchmark mode | — |
784
+
785
+ ### Dashboard Integration
786
+
787
+ Include context-eval results in the HTML dashboard:
788
+
789
+ ```bash
790
+ # Run context-eval and save JSON
791
+ coderace context-eval --context-file CLAUDE.md --task task.yaml --agents claude --output eval.json
792
+
793
+ # Generate dashboard with A/B comparison section
794
+ coderace dashboard --context-eval eval.json
795
+ ```
796
+
797
+ ## Measuring Context Engineering Impact
798
+
799
+ Context engineering — crafting CLAUDE.md, AGENTS.md, .cursorrules, and similar files — is becoming a core developer skill. But until now, there was no way to empirically measure whether your context files actually help.
800
+
801
+ **The problem:** You write a CLAUDE.md with coding conventions, architectural guidelines, and project-specific instructions. But does it actually make agents produce better code? Or is it cargo-cult configuration?
802
+
803
+ **The solution:** `coderace context-eval` gives you data:
804
+
805
+ 1. **Write your context file** (e.g., CLAUDE.md with project conventions)
806
+ 2. **Run A/B evaluation** against real coding tasks
807
+ 3. **Get statistical evidence** of improvement (or lack thereof)
808
+
809
+ ```bash
810
+ # Iterate on your context file with data
811
+ coderace context-eval --context-file CLAUDE.md --benchmark --agents claude --trials 5
812
+
813
+ # Compare different context files
814
+ coderace context-eval --context-file v1-claude.md --task task.yaml --agents claude --output v1.json
815
+ coderace context-eval --context-file v2-claude.md --task task.yaml --agents claude --output v2.json
816
+ ```
817
+
818
+ **Interpreting results:**
819
+ - **Effect size > 0.8**: Large improvement — your context file is helping significantly
820
+ - **Effect size 0.2–0.8**: Moderate improvement — some benefit, room to iterate
821
+ - **Effect size < 0.2**: Negligible — your context file isn't making a measurable difference
822
+ - **CI crosses zero**: Not statistically significant — need more trials or a better context file
823
+
824
+ ## See Also
825
+
826
+ - **[agentmd](https://github.com/mikiships/agentmd)** — Generate and score context files (CLAUDE.md, AGENTS.md, .cursorrules) for AI coding agents. Pair with coderace: generate context with agentmd, measure agent performance with coderace, iterate with data instead of vibes.
@@ -0,0 +1,120 @@
1
+ # All-Day Build Contract: Context Eval
2
+
3
+ Status: In Progress
4
+ Date: 2026-03-02
5
+ Owner: Codex execution pass
6
+ Scope type: Deliverable-gated (no hour promises)
7
+
8
+ ## 1. Objective
9
+
10
+ Add a `coderace context-eval` command that measures whether context files (CLAUDE.md, AGENTS.md, .cursorrules, etc.) actually improve agent performance on coding tasks. Runs A/B trials: baseline (no context file) vs treatment (with context file), produces statistical comparison with confidence intervals. This is the first tool that lets developers empirically measure context engineering impact.
11
+
12
+ This contract is considered complete only when every deliverable and validation gate below is satisfied.
13
+
14
+ ## 2. Non-Negotiable Build Rules
15
+
16
+ 1. No time-based completion claims.
17
+ 2. Completion is allowed only when all checklist items are checked.
18
+ 3. Full test suite must pass at the end.
19
+ 4. New features must ship with docs and report addendum updates in the same pass.
20
+ 5. CLI outputs must be deterministic and schema-backed where specified.
21
+ 6. Never modify files outside the project directory.
22
+ 7. Commit after each completed deliverable (not at the end).
23
+ 8. If stuck on same issue for 3 attempts, stop and write a blocker report.
24
+ 9. Do NOT refactor, restyle, or "improve" code outside the deliverables.
25
+ 10. Read existing tests and docs before writing new code.
26
+
27
+ ## 3. Feature Deliverables
28
+
29
+ ### D1. `context-eval` CLI Command (core + CLI)
30
+
31
+ Add a new `context-eval` subcommand to the coderace CLI that:
32
+ - Accepts `--context-file PATH` (the file to evaluate, e.g., CLAUDE.md)
33
+ - Accepts `--task PATH` (a single task YAML) or `--benchmark` (run built-in tasks)
34
+ - Accepts `--agents AGENT1,AGENT2` (which agents to test, default: all configured)
35
+ - Accepts `--trials N` (number of trials per condition, default: 3, min: 2)
36
+ - Accepts `--output PATH` (optional JSON output path)
37
+ - Accepts `--task-dir PATH` (optional, custom task directory for benchmark mode)
38
+
39
+ Workflow:
40
+ 1. For each agent × task combination:
41
+ a. Run N trials WITHOUT the context file present (baseline condition)
42
+ b. Run N trials WITH the context file placed in the task working directory (treatment condition)
43
+ 2. Collect pass/fail + timing for each trial
44
+ 3. Produce comparison report
45
+
46
+ The context file placement: copy the file to the task's working directory before treatment trials, remove it after. If a context file already exists in the task dir, back it up and restore after.
47
+
48
+ Required files:
49
+ - `coderace/context_eval.py` (core logic)
50
+ - `coderace/cli.py` (add subcommand)
51
+
52
+ - [ ] CLI argument parsing with validation
53
+ - [ ] Baseline runner (strips context files from task dir)
54
+ - [ ] Treatment runner (places context file in task dir)
55
+ - [ ] Backup/restore logic for pre-existing context files
56
+ - [ ] Integration with existing `run` infrastructure (reuse task loading, agent execution, scoring)
57
+ - [ ] Tests for D1
58
+
59
+ ### D2. Statistical Comparison Report
60
+
61
+ After running both conditions, produce a comparison report that includes:
62
+ - Per-agent: baseline pass rate vs treatment pass rate
63
+ - Delta (treatment - baseline) with 95% confidence interval
64
+ - Effect size (Cohen's d or similar)
65
+ - Per-task breakdown: which tasks improved, which degraded
66
+ - Summary verdict: "Context file improved performance by X% (CI: [lo, hi])" or "No significant improvement detected"
67
+
68
+ Output formats:
69
+ - Rich terminal table (default, using existing table formatting)
70
+ - JSON (with `--output`)
71
+
72
+ Reuse existing statistical infrastructure from `benchmark` command where possible (confidence intervals, etc.).
73
+
74
+ - [ ] Statistical comparison logic (delta, CI, effect size)
75
+ - [ ] Terminal table report
76
+ - [ ] JSON output
77
+ - [ ] Summary verdict logic
78
+ - [ ] Tests for D2
79
+
80
+ ### D3. Dashboard Integration
81
+
82
+ Extend the existing `dashboard` command to include context-eval results when available:
83
+ - New section in HTML dashboard showing A/B comparison
84
+ - Bar chart: baseline vs treatment pass rates per agent
85
+ - Delta chart with confidence intervals
86
+
87
+ Reuse existing dashboard infrastructure (Jinja templates, chart generation).
88
+
89
+ - [ ] Dashboard data model for context-eval results
90
+ - [ ] HTML template section for A/B comparison
91
+ - [ ] Charts (bar chart + delta chart)
92
+ - [ ] Tests for D3
93
+
94
+ ### D4. Documentation + Examples
95
+
96
+ - [ ] Update README.md with context-eval section (usage, examples, interpretation guide)
97
+ - [ ] Add example in `examples/` directory: `context-eval-demo.yaml` or script
98
+ - [ ] Update `coderace --help` / `coderace context-eval --help` with clear descriptions
99
+ - [ ] Add a section on "Measuring Context Engineering Impact" to README
100
+
101
+ ## 4. Test Requirements
102
+
103
+ - [ ] Unit tests for context file placement/removal/backup logic
104
+ - [ ] Unit tests for statistical comparison (known inputs → known outputs)
105
+ - [ ] Integration test: mock agent that passes more with context vs without
106
+ - [ ] Edge cases: no agents configured, context file doesn't exist, task dir already has context file, trials=1 (should error), 0 tasks selected
107
+ - [ ] All existing tests (447) must still pass
108
+
109
+ ## 5. Reports
110
+
111
+ - Write progress to `progress-log.md` after each deliverable
112
+ - Include: what was built, what tests pass, what's next, any blockers
113
+ - Final summary when all deliverables done or stopped
114
+
115
+ ## 6. Stop Conditions
116
+
117
+ - All deliverables checked and all tests passing → DONE
118
+ - 3 consecutive failed attempts on same issue → STOP, write blocker report
119
+ - Scope creep detected (new requirements discovered) → STOP, report what's new
120
+ - All tests passing but deliverables remain → continue to next deliverable