PyPI - coderace - Versions diffs - 1.2.0__tar.gz → 1.3.0__tar.gz - Mend

coderace 1.2.0tar.gz → 1.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (137) hide show

{coderace-1.2.0 → coderace-1.3.0}/CHANGELOG.md RENAMED Viewed

@@ -1,5 +1,23 @@
 # Changelog
+## [1.3.0] - 2026-03-05
+### Added
+- **Model selection**: Per-agent model override via `agent:model` syntax in `--agents` / `--agent` flags
+  - Example: `coderace run task.yaml --agent codex:gpt-5.4 --agent codex:gpt-5.3-codex`
+  - Example: `coderace benchmark --agents claude:opus-4-6,claude:sonnet-4-6`
+- `BaseAdapter.__init__(model=None)`: all adapters accept optional model at construction
+- `BaseAdapter.build_command(task, model=None)`: model parameter flows to CLI flag
+- `parse_agent_spec()`, `make_display_name()`, `instantiate_adapter()` in `coderace.adapters`
+- All adapters (codex, claude, aider, gemini, opencode) append `--model <name>` when specified
+- Benchmark and race commands handle model-specific agents; display names flow to results, store, ELO, dashboard
+- Task YAML: `agents` list accepts `agent:model` entries (e.g. `- codex:gpt-5.4`)
+### Changed
+- `AgentResult.agent` is now the display name (`codex (gpt-5.4)`) when a model is specified
+- ELO ratings, leaderboard, and dashboard automatically track model variants as separate entries
+- Branch names sanitized to be git-compatible (colons replaced with dashes)
 ## [1.2.0] - 2026-03-03
 ### Added

{coderace-1.2.0 → coderace-1.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: coderace
-Version: 1.2.0
+Version: 1.3.0
 Summary: Race coding agents against each other on real tasks
 Project-URL: Homepage, https://github.com/mikiships/coderace
 Project-URL: Repository, https://github.com/mikiships/coderace
@@ -30,6 +30,11 @@ Description-Content-Type: text/markdown
 # coderace
+[![PyPI](https://img.shields.io/pypi/v/coderace)](https://pypi.org/project/coderace/)
+[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](#install)
+[![Tests](https://img.shields.io/badge/tests-526%20passing-brightgreen)](#)
+[![License](https://img.shields.io/badge/license-MIT-lightgrey)](#license)
 Stop reading blog comparisons. Race coding agents against each other on real tasks in *your* repo with *your* code.
 Every week there's a new "Claude Code vs Codex vs Cursor" post. They test on toy problems with cherry-picked examples. coderace gives you automated, reproducible, scored comparisons on the tasks you actually care about.
@@ -340,6 +345,41 @@ Keys can be agent names (`claude`, `codex`, `aider`, `gemini`, `opencode`) or mo
 Pricing is easy to update: the table lives in `coderace/cost.py` as a plain dict.
+## Model Selection
+Compare different models of the same agent head-to-head using the `agent:model` syntax:
+```bash
+# Compare two Codex models on the same task
+coderace run task.yaml --agent codex:gpt-5.4 --agent codex:gpt-5.3-codex
+# Mix agents and models
+coderace run task.yaml --agent codex:gpt-5.4 --agent claude:opus-4-6 --agent claude:sonnet-4-6
+# Benchmark multiple model variants across built-in tasks
+coderace benchmark --agents codex:gpt-5.4,codex:gpt-5.3-codex,claude:opus-4-6
+# Race with model variants (parallel)
+coderace race task.yaml
+```
+In task YAML files:
+```yaml
+agents:
+  - codex:gpt-5.4
+  - codex:gpt-5.3-codex
+  - claude:opus-4-6
+  - claude:sonnet-4-6
+```
+**How it works:**
+- `agent:model` splits on the first colon: `codex:gpt-5.4` → agent `codex`, model `gpt-5.4`
+- The model is passed via `--model <name>` to the underlying CLI
+- Results display as `codex (gpt-5.4)` vs `codex (gpt-5.3-codex)` for easy comparison
+- ELO ratings, leaderboard, and dashboard track each model variant separately
+- The same agent can appear multiple times with different models in one run
 ## Leaderboard & History
 Every `coderace run` automatically saves results to a local SQLite database (`~/.coderace/results.db`). Two new commands aggregate this data.
@@ -854,3 +894,6 @@ coderace context-eval --context-file v2-claude.md --task task.yaml --agents clau
 ## See Also
 - **[agentmd](https://github.com/mikiships/agentmd)** — Generate and score context files (CLAUDE.md, AGENTS.md, .cursorrules) for AI coding agents. Pair with coderace: generate context with agentmd, measure agent performance with coderace, iterate with data instead of vibes.
+- **[agentlint](https://github.com/mikiships/agentlint)** — Lint AI agent git diffs for risky patterns (scope drift, secret leaks, test regression). Static analysis, no LLM required.
+Measure (coderace) → Optimize (agentmd) → Guard (agentlint).

{coderace-1.2.0 → coderace-1.3.0}/README.md RENAMED Viewed

@@ -1,5 +1,10 @@
 # coderace
+[![PyPI](https://img.shields.io/pypi/v/coderace)](https://pypi.org/project/coderace/)
+[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](#install)
+[![Tests](https://img.shields.io/badge/tests-526%20passing-brightgreen)](#)
+[![License](https://img.shields.io/badge/license-MIT-lightgrey)](#license)
 Stop reading blog comparisons. Race coding agents against each other on real tasks in *your* repo with *your* code.
 Every week there's a new "Claude Code vs Codex vs Cursor" post. They test on toy problems with cherry-picked examples. coderace gives you automated, reproducible, scored comparisons on the tasks you actually care about.
@@ -310,6 +315,41 @@ Keys can be agent names (`claude`, `codex`, `aider`, `gemini`, `opencode`) or mo
 Pricing is easy to update: the table lives in `coderace/cost.py` as a plain dict.
+## Model Selection
+Compare different models of the same agent head-to-head using the `agent:model` syntax:
+```bash
+# Compare two Codex models on the same task
+coderace run task.yaml --agent codex:gpt-5.4 --agent codex:gpt-5.3-codex
+# Mix agents and models
+coderace run task.yaml --agent codex:gpt-5.4 --agent claude:opus-4-6 --agent claude:sonnet-4-6
+# Benchmark multiple model variants across built-in tasks
+coderace benchmark --agents codex:gpt-5.4,codex:gpt-5.3-codex,claude:opus-4-6
+# Race with model variants (parallel)
+coderace race task.yaml
+```
+In task YAML files:
+```yaml
+agents:
+  - codex:gpt-5.4
+  - codex:gpt-5.3-codex
+  - claude:opus-4-6
+  - claude:sonnet-4-6
+```
+**How it works:**
+- `agent:model` splits on the first colon: `codex:gpt-5.4` → agent `codex`, model `gpt-5.4`
+- The model is passed via `--model <name>` to the underlying CLI
+- Results display as `codex (gpt-5.4)` vs `codex (gpt-5.3-codex)` for easy comparison
+- ELO ratings, leaderboard, and dashboard track each model variant separately
+- The same agent can appear multiple times with different models in one run
 ## Leaderboard & History
 Every `coderace run` automatically saves results to a local SQLite database (`~/.coderace/results.db`). Two new commands aggregate this data.
@@ -824,3 +864,6 @@ coderace context-eval --context-file v2-claude.md --task task.yaml --agents clau
 ## See Also
 - **[agentmd](https://github.com/mikiships/agentmd)** — Generate and score context files (CLAUDE.md, AGENTS.md, .cursorrules) for AI coding agents. Pair with coderace: generate context with agentmd, measure agent performance with coderace, iterate with data instead of vibes.
+- **[agentlint](https://github.com/mikiships/agentlint)** — Lint AI agent git diffs for risky patterns (scope drift, secret leaks, test regression). Static analysis, no LLM required.
+Measure (coderace) → Optimize (agentmd) → Guard (agentlint).

coderace-1.3.0/all-day-build-contract-model-selection.md ADDED Viewed

@@ -0,0 +1,121 @@
+# All-Day Build Contract: Model Selection for Adapters
+Status: In Progress
+Date: 2026-03-05
+Owner: Codex execution pass
+Scope type: Deliverable-gated (no hour promises)
+## 1. Objective
+Add per-agent model selection to coderace so users can benchmark different models within the same agent CLI. For example: `coderace run task.yaml --agents codex:gpt-5.4,codex:gpt-5.3-codex,claude:opus-4-6,claude:sonnet-4-6` to compare models head-to-head on the same tasks.
+This enables the "which model is actually best for coding" benchmark content that vibes-based blog posts can't provide.
+This contract is considered complete only when every deliverable and validation gate below is satisfied.
+## 2. Non-Negotiable Build Rules
+1. No time-based completion claims.
+2. Completion is allowed only when all checklist items are checked.
+3. Full test suite must pass at the end.
+4. New features must ship with docs and report addendum updates in the same pass.
+5. CLI outputs must be deterministic and schema-backed where specified.
+6. Never modify files outside the project directory.
+7. Commit after each completed deliverable (not at the end).
+8. If stuck on same issue for 3 attempts, stop and write a blocker report.
+9. Do NOT refactor, restyle, or "improve" code outside the deliverables.
+10. Read existing tests and docs before writing new code.
+## 3. Feature Deliverables
+### D1. Base Adapter Model Support (core)
+Add optional `model` parameter to BaseAdapter so subclasses can receive a model override.
+Required files:
+- `coderace/adapters/base.py`
+- [ ] Add `model: Optional[str] = None` to `__init__` (or as class attribute)
+- [ ] Pass `model` through to `build_command` signature: `build_command(self, task_description: str, model: Optional[str] = None) -> list[str]`
+- [ ] Update `run()` to pass model to `build_command`
+- [ ] Update `parse_cost` calls to use the model override when provided
+- [ ] Tests for D1
+### D2. Codex and Claude Adapter Model Flags
+Update the two main adapters to pass `--model` when a model is specified.
+Required files:
+- `coderace/adapters/codex.py`
+- `coderace/adapters/claude.py`
+- [ ] CodexAdapter.build_command: append `--model`, model_name when model is not None
+- [ ] ClaudeAdapter.build_command: append `--model`, model_name when model is not None
+- [ ] Update parse_cost to use the provided model name for accurate pricing
+- [ ] Also update aider.py, gemini.py, opencode.py adapters if they support model flags (check their --help)
+- [ ] Tests for D2
+### D3. Agent:Model CLI Syntax
+Parse `agent:model` syntax in the CLI so users can specify models per agent.
+Required files:
+- `coderace/cli.py` (or wherever `--agents` is parsed)
+- `coderace/adapters/__init__.py` (adapter registry/factory)
+The syntax: `--agents codex:gpt-5.4,claude:opus-4-6`
+- If no `:model` suffix, use the adapter's default (current behavior)
+- If `:model` suffix, pass it through to the adapter
+- The same agent can appear multiple times with different models
+- Agent display name in results should include the model: `codex (gpt-5.4)` vs `codex (gpt-5.3-codex)`
+- [ ] Parse `agent:model` in CLI --agents flag
+- [ ] Support duplicate agents with different models in the same run
+- [ ] Display agent+model in result tables and reports
+- [ ] Works with `run`, `benchmark`, and `race` commands
+- [ ] Tests for D3
+### D4. Benchmark and Race Command Integration
+Ensure `benchmark` and `race` commands correctly handle model-specific agents.
+Required files:
+- `coderace/benchmark.py`
+- `coderace/commands/` (race command if separate)
+- `coderace/store.py` (results storage)
+- [ ] Benchmark results store agent+model as the identifier (not just agent name)
+- [ ] ELO ratings track agent+model combinations separately
+- [ ] Leaderboard shows model variants as separate entries
+- [ ] Dashboard HTML includes model information
+- [ ] Tests for D4
+### D5. Documentation and Version Bump
+- [ ] Update README.md with model selection examples
+- [ ] Add model selection section to examples/
+- [ ] Update CHANGELOG.md
+- [ ] Bump version to 1.3.0 in pyproject.toml
+- [ ] All existing 526 tests still pass
+- [ ] New tests bring total to 550+
+## 4. Test Requirements
+- [ ] Unit tests for each adapter with model override
+- [ ] Unit tests for agent:model parsing
+- [ ] Integration test: dry-run benchmark with model variants
+- [ ] Edge cases: invalid model name, empty model, agent without model support
+- [ ] All existing 526 tests must still pass
+## 5. Reports
+- Write progress to `progress-log.md` after each deliverable
+- Include: what was built, what tests pass, what's next, any blockers
+- Final summary when all deliverables done or stopped
+## 6. Stop Conditions
+- All deliverables checked and all tests passing -> DONE
+- 3 consecutive failed attempts on same issue -> STOP, write blocker report
+- Scope creep detected (new requirements discovered) -> STOP, report what's new
+- All tests passing but deliverables remain -> continue to next deliverable

{coderace-1.2.0 → coderace-1.3.0}/coderace/__init__.py RENAMED Viewed

@@ -1,3 +1,3 @@
 """coderace - Race coding agents against each other on real tasks."""
-__version__ = "1.2.0"
+__version__ = "1.3.0"

coderace-1.3.0/coderace/adapters/__init__.py ADDED Viewed

@@ -0,0 +1,77 @@
+"""Agent adapters for coderace."""
+from __future__ import annotations
+from typing import Optional
+from coderace.adapters.aider import AiderAdapter
+from coderace.adapters.base import BaseAdapter
+from coderace.adapters.claude import ClaudeAdapter
+from coderace.adapters.codex import CodexAdapter
+from coderace.adapters.gemini import GeminiAdapter
+from coderace.adapters.opencode import OpenCodeAdapter
+ADAPTERS: dict[str, type[BaseAdapter]] = {
+    "claude": ClaudeAdapter,
+    "codex": CodexAdapter,
+    "aider": AiderAdapter,
+    "gemini": GeminiAdapter,
+    "opencode": OpenCodeAdapter,
+}
+def parse_agent_spec(spec: str) -> tuple[str, Optional[str]]:
+    """Parse an agent spec string into (agent_name, model_or_None).
+    Examples:
+        "codex"           -> ("codex", None)
+        "codex:gpt-5.4"  -> ("codex", "gpt-5.4")
+        "claude:opus-4-6" -> ("claude", "opus-4-6")
+    """
+    if ":" in spec:
+        agent_name, model = spec.split(":", 1)
+        return agent_name.strip(), model.strip() or None
+    return spec.strip(), None
+def make_display_name(agent_name: str, model: Optional[str]) -> str:
+    """Return display name for agent+model combo.
+    Examples:
+        ("codex", None)       -> "codex"
+        ("codex", "gpt-5.4") -> "codex (gpt-5.4)"
+    """
+    if model:
+        return f"{agent_name} ({model})"
+    return agent_name
+def instantiate_adapter(spec: str) -> BaseAdapter:
+    """Instantiate an adapter from an agent spec string (e.g. 'codex:gpt-5.4').
+    The returned adapter has:
+    - adapter.model set to the parsed model (or None)
+    - adapter.name set to the display name (e.g. 'codex (gpt-5.4)')
+    Raises KeyError if the agent name is not in ADAPTERS.
+    """
+    agent_name, model = parse_agent_spec(spec)
+    adapter_cls = ADAPTERS[agent_name]
+    adapter = adapter_cls(model=model)
+    # Override the instance name to be the display name
+    adapter.name = make_display_name(agent_name, model)
+    return adapter
+__all__ = [
+    "ADAPTERS",
+    "BaseAdapter",
+    "ClaudeAdapter",
+    "CodexAdapter",
+    "AiderAdapter",
+    "GeminiAdapter",
+    "OpenCodeAdapter",
+    "parse_agent_spec",
+    "make_display_name",
+    "instantiate_adapter",
+]

{coderace-1.2.0 → coderace-1.3.0}/coderace/adapters/aider.py RENAMED Viewed

@@ -7,27 +7,34 @@ from typing import Optional
 from coderace.adapters.base import BaseAdapter
 from coderace.cost import CostResult, parse_aider_cost
+DEFAULT_AIDER_MODEL = "aider-default"
 class AiderAdapter(BaseAdapter):
     """Adapter for Aider coding assistant."""
     name = "aider"
-    def build_command(self, task_description: str) -> list[str]:
-        return [
+    def build_command(self, task_description: str, model: Optional[str] = None) -> list[str]:
+        cmd = [
             "aider",
             "--message",
             task_description,
             "--yes",
             "--no-auto-commits",
         ]
+        effective_model = model or self.model
+        if effective_model:
+            cmd += ["--model", effective_model]
+        return cmd
     def parse_cost(
         self,
         stdout: str,
         stderr: str,
-        model_name: str = "aider-default",
+        model_name: str = "",
         custom_pricing: dict[str, tuple[float, float]] | None = None,
     ) -> Optional[CostResult]:
         """Parse cost data from Aider output."""
-        return parse_aider_cost(stdout, stderr, model_name, custom_pricing)
+        effective_model = model_name or self.model or DEFAULT_AIDER_MODEL
+        return parse_aider_cost(stdout, stderr, effective_model, custom_pricing)

{coderace-1.2.0 → coderace-1.3.0}/coderace/adapters/base.py RENAMED Viewed

@@ -17,8 +17,12 @@ class BaseAdapter(ABC):
     name: str = "base"
+    def __init__(self, model: Optional[str] = None) -> None:
+        """Initialize adapter with optional model override."""
+        self.model = model
     @abstractmethod
-    def build_command(self, task_description: str) -> list[str]:
+    def build_command(self, task_description: str, model: Optional[str] = None) -> list[str]:
         """Build the CLI command to invoke this agent."""
         ...
@@ -44,7 +48,8 @@ class BaseAdapter(ABC):
         custom_pricing: dict[str, tuple[float, float]] | None = None,
     ) -> AgentResult:
         """Run the agent on a task and capture results."""
-        cmd = self.build_command(task_description)
+        model = self.model
+        cmd = self.build_command(task_description, model=model)
         start = time.monotonic()
         timed_out = False
@@ -76,7 +81,12 @@ class BaseAdapter(ABC):
         cost_result: Optional[CostResult] = None
         if not no_cost:
             try:
-                cost_result = self.parse_cost(stdout, stderr, custom_pricing=custom_pricing)
+                cost_result = self.parse_cost(
+                    stdout,
+                    stderr,
+                    model_name=model or "",
+                    custom_pricing=custom_pricing,
+                )
             except Exception:
                 pass

{coderace-1.2.0 → coderace-1.3.0}/coderace/adapters/claude.py RENAMED Viewed

@@ -7,29 +7,35 @@ from typing import Optional
 from coderace.adapters.base import BaseAdapter
 from coderace.cost import CostResult, parse_claude_cost
+DEFAULT_CLAUDE_MODEL = "claude-sonnet-4-6"
 class ClaudeAdapter(BaseAdapter):
     """Adapter for Claude Code CLI."""
     name = "claude"
-    def build_command(self, task_description: str) -> list[str]:
-        return [
+    def build_command(self, task_description: str, model: Optional[str] = None) -> list[str]:
+        cmd = [
             "claude",
             "--print",
             "--output-format",
             "json",
             "--dangerously-skip-permissions",
-            "-p",
-            task_description,
         ]
+        effective_model = model or self.model
+        if effective_model:
+            cmd += ["--model", effective_model]
+        cmd += ["-p", task_description]
+        return cmd
     def parse_cost(
         self,
         stdout: str,
         stderr: str,
-        model_name: str = "claude-sonnet-4-6",
+        model_name: str = "",
         custom_pricing: dict[str, tuple[float, float]] | None = None,
     ) -> Optional[CostResult]:
         """Parse cost data from Claude Code output."""
-        return parse_claude_cost(stdout, stderr, model_name, custom_pricing)
+        effective_model = model_name or self.model or DEFAULT_CLAUDE_MODEL
+        return parse_claude_cost(stdout, stderr, effective_model, custom_pricing)

{coderace-1.2.0 → coderace-1.3.0}/coderace/adapters/codex.py RENAMED Viewed

@@ -7,26 +7,33 @@ from typing import Optional
 from coderace.adapters.base import BaseAdapter
 from coderace.cost import CostResult, parse_codex_cost
+DEFAULT_CODEX_MODEL = "gpt-5.3-codex"
 class CodexAdapter(BaseAdapter):
     """Adapter for OpenAI Codex CLI."""
     name = "codex"
-    def build_command(self, task_description: str) -> list[str]:
-        return [
+    def build_command(self, task_description: str, model: Optional[str] = None) -> list[str]:
+        cmd = [
             "codex",
             "exec",
             "--full-auto",
-            task_description,
         ]
+        effective_model = model or self.model
+        if effective_model:
+            cmd += ["--model", effective_model]
+        cmd.append(task_description)
+        return cmd
     def parse_cost(
         self,
         stdout: str,
         stderr: str,
-        model_name: str = "gpt-5.3-codex",
+        model_name: str = "",
         custom_pricing: dict[str, tuple[float, float]] | None = None,
     ) -> Optional[CostResult]:
         """Parse cost data from Codex CLI output."""
-        return parse_codex_cost(stdout, stderr, model_name, custom_pricing)
+        effective_model = model_name or self.model or DEFAULT_CODEX_MODEL
+        return parse_codex_cost(stdout, stderr, effective_model, custom_pricing)

{coderace-1.2.0 → coderace-1.3.0}/coderace/adapters/gemini.py RENAMED Viewed

@@ -7,25 +7,29 @@ from typing import Optional
 from coderace.adapters.base import BaseAdapter
 from coderace.cost import CostResult, parse_gemini_cost
+DEFAULT_GEMINI_MODEL = "gemini-2.5-pro"
 class GeminiAdapter(BaseAdapter):
     """Adapter for Google Gemini CLI."""
     name = "gemini"
-    def build_command(self, task_description: str) -> list[str]:
-        return [
-            "gemini",
-            "-p",
-            task_description,
-        ]
+    def build_command(self, task_description: str, model: Optional[str] = None) -> list[str]:
+        cmd = ["gemini"]
+        effective_model = model or self.model
+        if effective_model:
+            cmd += ["--model", effective_model]
+        cmd += ["-p", task_description]
+        return cmd
     def parse_cost(
         self,
         stdout: str,
         stderr: str,
-        model_name: str = "gemini-2.5-pro",
+        model_name: str = "",
         custom_pricing: dict[str, tuple[float, float]] | None = None,
     ) -> Optional[CostResult]:
         """Parse cost data from Gemini CLI output."""
-        return parse_gemini_cost(stdout, stderr, model_name, custom_pricing)
+        effective_model = model_name or self.model or DEFAULT_GEMINI_MODEL
+        return parse_gemini_cost(stdout, stderr, effective_model, custom_pricing)

{coderace-1.2.0 → coderace-1.3.0}/coderace/adapters/opencode.py RENAMED Viewed

@@ -7,25 +7,29 @@ from typing import Optional
 from coderace.adapters.base import BaseAdapter
 from coderace.cost import CostResult, parse_opencode_cost
+DEFAULT_OPENCODE_MODEL = "opencode-default"
 class OpenCodeAdapter(BaseAdapter):
     """Adapter for OpenCode CLI (terminal-first AI coding agent)."""
     name = "opencode"
-    def build_command(self, task_description: str) -> list[str]:
-        return [
-            "opencode",
-            "run",
-            task_description,
-        ]
+    def build_command(self, task_description: str, model: Optional[str] = None) -> list[str]:
+        cmd = ["opencode", "run"]
+        effective_model = model or self.model
+        if effective_model:
+            cmd += ["--model", effective_model]
+        cmd.append(task_description)
+        return cmd
     def parse_cost(
         self,
         stdout: str,
         stderr: str,
-        model_name: str = "opencode-default",
+        model_name: str = "",
         custom_pricing: dict[str, tuple[float, float]] | None = None,
     ) -> Optional[CostResult]:
         """Parse cost data from OpenCode output."""
-        return parse_opencode_cost(stdout, stderr, model_name, custom_pricing)
+        effective_model = model_name or self.model or DEFAULT_OPENCODE_MODEL
+        return parse_opencode_cost(stdout, stderr, effective_model, custom_pricing)

coderace 1.2.0__tar.gz → 1.3.0__tar.gz

coderace 1.2.0tar.gz → 1.3.0tar.gz