devlyn-cli 1.14.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (148) hide show
  1. package/AGENTS.md +104 -0
  2. package/CLAUDE.md +112 -119
  3. package/README.md +43 -125
  4. package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +272 -0
  5. package/benchmark/auto-resolve/README.md +114 -0
  6. package/benchmark/auto-resolve/RUBRIC.md +162 -0
  7. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +30 -0
  8. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/expected.json +68 -0
  9. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/metadata.json +10 -0
  10. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/setup.sh +4 -0
  11. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/spec.md +45 -0
  12. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/task.txt +8 -0
  13. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +54 -0
  14. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected-pair-plan-registry.json +170 -0
  15. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected.json +84 -0
  16. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/metadata.json +21 -0
  17. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/pair-plan.sample-fail.json +214 -0
  18. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/pair-plan.sample-pass.json +223 -0
  19. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/setup.sh +5 -0
  20. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/spec.md +56 -0
  21. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/task.txt +14 -0
  22. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +28 -0
  23. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected-pair-plan-registry.json +162 -0
  24. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +65 -0
  25. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/metadata.json +19 -0
  26. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/setup.sh +4 -0
  27. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +56 -0
  28. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/task.txt +9 -0
  29. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +40 -0
  30. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/expected.json +57 -0
  31. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/metadata.json +10 -0
  32. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/setup.sh +6 -0
  33. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/spec.md +49 -0
  34. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/task.txt +9 -0
  35. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +38 -0
  36. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/expected.json +65 -0
  37. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/metadata.json +10 -0
  38. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/setup.sh +55 -0
  39. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/spec.md +49 -0
  40. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/task.txt +7 -0
  41. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +38 -0
  42. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/expected.json +77 -0
  43. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/metadata.json +10 -0
  44. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/setup.sh +4 -0
  45. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/spec.md +49 -0
  46. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/task.txt +10 -0
  47. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +50 -0
  48. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/expected.json +76 -0
  49. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/metadata.json +10 -0
  50. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/setup.sh +36 -0
  51. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/spec.md +46 -0
  52. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/task.txt +7 -0
  53. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +50 -0
  54. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/expected.json +63 -0
  55. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/metadata.json +10 -0
  56. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/setup.sh +4 -0
  57. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +48 -0
  58. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/task.txt +1 -0
  59. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +93 -0
  60. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/expected.json +74 -0
  61. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/metadata.json +10 -0
  62. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/setup.sh +28 -0
  63. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +62 -0
  64. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/task.txt +5 -0
  65. package/benchmark/auto-resolve/fixtures/SCHEMA.md +130 -0
  66. package/benchmark/auto-resolve/fixtures/test-repo/README.md +27 -0
  67. package/benchmark/auto-resolve/fixtures/test-repo/bin/cli.js +63 -0
  68. package/benchmark/auto-resolve/fixtures/test-repo/package-lock.json +823 -0
  69. package/benchmark/auto-resolve/fixtures/test-repo/package.json +22 -0
  70. package/benchmark/auto-resolve/fixtures/test-repo/playwright.config.js +17 -0
  71. package/benchmark/auto-resolve/fixtures/test-repo/server/index.js +37 -0
  72. package/benchmark/auto-resolve/fixtures/test-repo/tests/cli.test.js +25 -0
  73. package/benchmark/auto-resolve/fixtures/test-repo/tests/server.test.js +58 -0
  74. package/benchmark/auto-resolve/fixtures/test-repo/web/index.html +37 -0
  75. package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +174 -0
  76. package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +256 -0
  77. package/benchmark/auto-resolve/scripts/compile-report.py +331 -0
  78. package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +552 -0
  79. package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +430 -0
  80. package/benchmark/auto-resolve/scripts/judge.sh +359 -0
  81. package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +260 -0
  82. package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +274 -0
  83. package/benchmark/auto-resolve/scripts/oracle-test-fidelity.py +328 -0
  84. package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +401 -0
  85. package/benchmark/auto-resolve/scripts/pair-plan-lint.py +468 -0
  86. package/benchmark/auto-resolve/scripts/run-fixture.sh +691 -0
  87. package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +234 -0
  88. package/benchmark/auto-resolve/scripts/run-suite.sh +214 -0
  89. package/benchmark/auto-resolve/scripts/ship-gate.py +222 -0
  90. package/bin/devlyn.js +129 -17
  91. package/config/skills/_shared/adapters/README.md +64 -0
  92. package/config/skills/_shared/adapters/gpt-5-5.md +29 -0
  93. package/config/skills/_shared/adapters/opus-4-7.md +29 -0
  94. package/config/skills/_shared/archive_run.py +130 -0
  95. package/config/skills/_shared/codex-config.md +54 -0
  96. package/config/skills/_shared/codex-monitored.sh +141 -0
  97. package/config/skills/_shared/engine-preflight.md +35 -0
  98. package/config/skills/_shared/expected.schema.json +93 -0
  99. package/config/skills/_shared/pair-plan-schema.md +298 -0
  100. package/config/skills/_shared/runtime-principles.md +110 -0
  101. package/config/skills/_shared/spec-verify-check.py +519 -0
  102. package/config/skills/devlyn:ideate/SKILL.md +99 -481
  103. package/config/skills/devlyn:ideate/references/elicitation.md +97 -0
  104. package/config/skills/devlyn:ideate/references/from-spec-mode.md +54 -0
  105. package/config/skills/devlyn:ideate/references/project-mode.md +76 -0
  106. package/config/skills/devlyn:ideate/references/spec-template.md +102 -0
  107. package/config/skills/devlyn:resolve/SKILL.md +172 -184
  108. package/config/skills/devlyn:resolve/references/free-form-mode.md +68 -0
  109. package/config/skills/devlyn:resolve/references/phases/build-gate.md +45 -0
  110. package/config/skills/devlyn:resolve/references/phases/cleanup.md +39 -0
  111. package/config/skills/devlyn:resolve/references/phases/implement.md +42 -0
  112. package/config/skills/devlyn:resolve/references/phases/plan.md +42 -0
  113. package/config/skills/devlyn:resolve/references/phases/verify.md +69 -0
  114. package/config/skills/devlyn:resolve/references/state-schema.md +106 -0
  115. package/{config/skills → optional-skills}/devlyn:design-system/SKILL.md +1 -0
  116. package/optional-skills/devlyn:reap/SKILL.md +105 -0
  117. package/optional-skills/devlyn:reap/scripts/reap.sh +129 -0
  118. package/optional-skills/devlyn:reap/scripts/scan.sh +116 -0
  119. package/{config/skills → optional-skills}/devlyn:team-design-ui/SKILL.md +5 -0
  120. package/package.json +16 -2
  121. package/scripts/lint-skills.sh +431 -0
  122. package/config/skills/devlyn:auto-resolve/SKILL.md +0 -602
  123. package/config/skills/devlyn:auto-resolve/references/build-gate.md +0 -116
  124. package/config/skills/devlyn:auto-resolve/references/engine-routing.md +0 -204
  125. package/config/skills/devlyn:browser-validate/SKILL.md +0 -164
  126. package/config/skills/devlyn:browser-validate/references/flow-testing.md +0 -118
  127. package/config/skills/devlyn:browser-validate/references/tier1-chrome.md +0 -137
  128. package/config/skills/devlyn:browser-validate/references/tier2-playwright.md +0 -195
  129. package/config/skills/devlyn:browser-validate/references/tier3-curl.md +0 -57
  130. package/config/skills/devlyn:clean/SKILL.md +0 -285
  131. package/config/skills/devlyn:design-ui/SKILL.md +0 -351
  132. package/config/skills/devlyn:discover-product/SKILL.md +0 -124
  133. package/config/skills/devlyn:evaluate/SKILL.md +0 -564
  134. package/config/skills/devlyn:feature-spec/SKILL.md +0 -630
  135. package/config/skills/devlyn:ideate/references/challenge-rubric.md +0 -122
  136. package/config/skills/devlyn:ideate/references/templates/item-spec.md +0 -90
  137. package/config/skills/devlyn:implement-ui/SKILL.md +0 -466
  138. package/config/skills/devlyn:preflight/SKILL.md +0 -370
  139. package/config/skills/devlyn:preflight/references/auditors/browser-auditor.md +0 -32
  140. package/config/skills/devlyn:preflight/references/auditors/code-auditor.md +0 -90
  141. package/config/skills/devlyn:preflight/references/auditors/docs-auditor.md +0 -38
  142. package/config/skills/devlyn:product-spec/SKILL.md +0 -603
  143. package/config/skills/devlyn:recommend-features/SKILL.md +0 -286
  144. package/config/skills/devlyn:review/SKILL.md +0 -161
  145. package/config/skills/devlyn:team-resolve/SKILL.md +0 -631
  146. package/config/skills/devlyn:team-review/SKILL.md +0 -493
  147. package/config/skills/devlyn:update-docs/SKILL.md +0 -463
  148. package/config/skills/workflow-routing/SKILL.md +0 -73
package/README.md CHANGED
@@ -27,124 +27,71 @@ If devlyn-cli saved you time, [give it a star](https://github.com/fysoul17/devly
27
27
  npx devlyn-cli
28
28
  ```
29
29
 
30
- That's it. The interactive installer handles everything. Run it again anytime to update.
30
+ That's it. The interactive installer handles everything. Claude Code config is installed by default; optional AI CLI instructions can be selected during install. Choose **Codex CLI (OpenAI)** to install `AGENTS.md`. Run it again anytime to update.
31
31
 
32
32
  ---
33
33
 
34
- ## How It Works — Three Steps, Full Cycle
34
+ ## How It Works — Two Skills, Full Cycle
35
35
 
36
- devlyn-cli turns Claude Code into an autonomous development pipeline. The core loop is simple:
36
+ devlyn-cli turns Claude Code into a hands-free development pipeline. The product surface is two skills:
37
37
 
38
38
  ```
39
- ideate → auto-resolve → preflight → fix gaps → ship
39
+ ideate (optional) → resolve → ship
40
40
  ```
41
41
 
42
- ### Step 1 — Plan with `/devlyn:ideate`
42
+ ### Step 1 (optional) — Plan with `/devlyn:ideate`
43
43
 
44
- Turn a raw idea into structured, implementation-ready specs.
44
+ Turn a raw idea into a verifiable spec — single-feature, multi-feature, or "normalize this external doc".
45
45
 
46
46
  ```
47
47
  /devlyn:ideate "I want to build a habit tracking app with AI nudges"
48
48
  ```
49
49
 
50
- This produces three documents through interactive brainstorming:
50
+ Default mode produces a `docs/specs/<id>-<slug>/spec.md` plus `spec.expected.json` (mechanical verification block) that `/devlyn:resolve --spec` consumes directly. Modes:
51
51
 
52
- | Document | What It Contains |
52
+ | Mode | When to use |
53
53
  |---|---|
54
- | `docs/VISION.md` | North star, principles, anti-goals |
55
- | `docs/ROADMAP.md` | Phased roadmap with links to each spec |
56
- | `docs/roadmap/phase-N/*.md` | Self-contained spec per feature ready for auto-resolve |
54
+ | `default` | One feature, AI drives focused Q&A |
55
+ | `--quick` | One-line goal assume-and-confirm spec, single-turn (autonomous-pipeline-safe) |
56
+ | `--from-spec <path>` | You already wrote a spec; ideate normalizes + lints it |
57
+ | `--project` | Multi-feature project: emits `plan.md` index + N child specs |
57
58
 
58
- Need to add features later? Run ideate again it expands the existing roadmap.
59
+ Skip ideate entirely if you have a spec or just want to describe the work — `/devlyn:resolve` accepts free-form goals too.
59
60
 
60
- ### Step 2 — Build with `/devlyn:auto-resolve`
61
+ ### Step 2 — Resolve with `/devlyn:resolve`
61
62
 
62
- Point it at a spec (or just describe what you want) and walk away.
63
+ Hands-free pipeline for any coding task — bug fix, feature, refactor, debug, modify, PR review. Pass a spec, a free-form goal, or a diff to verify.
63
64
 
64
65
  ```
65
- /devlyn:auto-resolve "Implement per spec at docs/roadmap/phase-1/1.1-user-auth.md"
66
+ /devlyn:resolve "fix the login bug" # free-form
67
+ /devlyn:resolve --spec docs/specs/2026-05-04-auth/spec.md # spec mode
68
+ /devlyn:resolve --verify-only <diff-or-PR-ref> --spec <path> # verify-only
66
69
  ```
67
70
 
68
- It runs a **10-phase pipeline** autonomously:
71
+ Internal phases run sequentially with file-based handoff via `.devlyn/pipeline.state.json`:
69
72
 
70
73
  ```
71
- Build Build Gate Browser Test Evaluate Fix Loop → Simplify → Review → Security → Clean → Docs
74
+ PLAN IMPLEMENT BUILD_GATE CLEANUP VERIFY (fresh subagent, findings-only)
72
75
  ```
73
76
 
74
- - Each phase runs as a separate agent with fresh context
75
- - Git checkpoints at every phase for safe rollback
76
- - **Build Gate** runs your project's real compilers, typecheckers, and linters — catches type errors, cross-package drift, and Docker build failures that tests alone miss. Auto-detects project type (Next.js, Rust, Go, Solidity, Expo, Swift, and more) and Dockerfiles.
77
- - Browser validation tests your feature end-to-end (clicks, forms, verification)
78
- - Evaluation grades against done-criteria — if it fails, auto-fix and re-evaluate
77
+ - **PLAN** is the heaviest phase by design formalizes invariants from the spec/goal and the file list to touch.
78
+ - **BUILD_GATE** runs your project's real compilers, typecheckers, linters, and `python3 .claude/skills/_shared/spec-verify-check.py` (verification commands literal-match). Auto-detects Next.js, Rust, Go, Solidity, Expo, Swift, and Dockerfiles. Browser flows route through Chrome MCP → Playwright → curl tier.
79
+ - **VERIFY** runs in a fresh subagent context with no code-mutation tools findings only, structurally independent.
80
+ - Git checkpoints at every phase for safe rollback. Fix-loop budget shared across BUILD_GATE and VERIFY (`--max-rounds N`, default 4).
79
81
 
80
- Skip phases you don't need: `--skip-browser`, `--skip-review`, `--skip-clean`, `--skip-docs`, `--skip-build-gate`, `--max-rounds 6`
81
- Customize the build gate: `--build-gate strict` (warnings = errors), `--build-gate no-docker` (skip Docker builds for speed)
82
- Use dual-model routing: `--engine auto` (Codex builds, Claude evaluates — see below)
82
+ Common flags: `--engine claude|codex|auto` (default `claude`), `--bypass build-gate,cleanup`, `--pair-verify` (force pair-mode JUDGE in VERIFY), `--perf` (per-phase timing).
83
83
 
84
- ### Step 3Verify with `/devlyn:preflight`
84
+ ### Engine selectionClaude solo by default
85
85
 
86
- After implementing all roadmap items, run a final alignment check:
86
+ `--engine claude` (default) is the canonical surface. Every phase routes to Claude.
87
87
 
88
- ```
89
- /devlyn:preflight
90
- ```
91
-
92
- Reads every commitment from your vision, roadmap, and item specs, then audits the codebase evidence-based. Catches what you missed:
93
-
94
- | Category | What It Finds |
95
- |---|---|
96
- | `MISSING` | In roadmap but not implemented |
97
- | `INCOMPLETE` | Started but unfinished |
98
- | `DIVERGENT` | Implemented differently than spec |
99
- | `BROKEN` | Has a bug preventing it from working |
100
- | `STALE_DOC` | Docs don't match current code |
101
-
102
- Confirmed gaps become new roadmap items — feed them back into auto-resolve. Use `--autofix` to do this automatically, or `--phase 2` to check only one phase.
103
-
104
- ### Bonus — Intelligent Model Routing with `--engine`
105
-
106
- Install the Codex MCP server during setup, then:
88
+ `--engine codex` routes IMPLEMENT to Codex; `--engine auto` opts into the experimental dual-engine routing where applicable. Both are research-only at HEAD: iter-0020 closed Codex BUILD/IMPLEMENT below the quality floor on the 9-fixture suite (L2 vs L1 = −3.6, 3/8 gated fixtures cleared the +5 margin floor — release-readiness FAIL); iter-0033g + iter-0034 closed PLAN-pair as research-only with explicit unblock conditions (container/sandbox infra OR production telemetry capturing positive evidence of subagent introspection). Install the Codex CLI (https://platform.openai.com/docs/codex) and pass the flag explicitly to opt in:
107
89
 
108
90
  ```
109
- /devlyn:auto-resolve "fix the auth bug" --engine auto
91
+ /devlyn:resolve "fix the auth bug" --engine auto # experimental, research-only
110
92
  ```
111
93
 
112
- **`--engine auto`** routes each pipeline phase and team role to the optimal model (Claude Opus 4.7 or GPT-5.4) validated through A/B testing, not just benchmarks.
113
-
114
- > `--engine auto` (default, recommended) · `--engine codex` (force Codex for build) · `--engine claude` (Claude only)
115
-
116
- Works across the full pipeline:
117
-
118
- ```
119
- /devlyn:auto-resolve "implement feature" --engine auto
120
- /devlyn:ideate "plan new project" --engine auto
121
- /devlyn:preflight --engine auto
122
- ```
123
-
124
- <details>
125
- <summary><strong>How routing works</strong> — A/B tested on 6 roles, 11 integration tests</summary>
126
-
127
- **Pipeline phases** — builder and critic are always different models (GAN dynamic):
128
-
129
- | Phase | Model | Why |
130
- |---|---|---|
131
- | Build (implementation) | **Codex GPT-5.4** | SWE-bench Pro +11.7pp for hard coding tasks |
132
- | Evaluate | **Claude** | Long-context (MRCR +28pp) for full-diff grading |
133
- | Fix Loop | **Codex GPT-5.4** | Same advantage as Build |
134
- | Challenge | **Claude** | Fresh skeptical review needs different model family |
135
- | Browser Validate | **Claude** | Chrome MCP session-bound |
136
-
137
- **Team roles** — each of 21 roles routes to the best model:
138
-
139
- | Engine | Roles | Examples |
140
- |---|---|---|
141
- | Claude (11) | Analysis, design, architecture | root-cause-analyst, architecture-reviewer, ux-designer, product-analyst |
142
- | Codex (4) | Code generation, performance | implementation-planner, test-engineer, performance-engineer |
143
- | Dual (6) | Both models find unique issues | security-auditor, quality-reviewer, api-designer |
144
-
145
- **Key finding**: Benchmark predictions were only 33% accurate. 4 of 6 A/B-tested roles needed routing changes after real testing — proving that benchmarks alone are insufficient for optimal routing.
146
-
147
- </details>
94
+ If Codex is absent when `--engine auto` or `--engine codex` is requested, the harness silently downgrades to `--engine claude` and emits a banner in the final report.
148
95
 
149
96
  <details>
150
97
  <summary><strong>What's new in 1.14.0</strong> — CPO lens + handoff enforcement</summary>
@@ -169,7 +116,7 @@ Works across the full pipeline:
169
116
  Core pipeline skills (`ideate`, `auto-resolve`, `preflight`) rewritten against Anthropic's Opus 4.7 prompting guidance, validated by multi-round comprehension and quality-grading subagents.
170
117
 
171
118
  - **4.7 prompt patterns** — `<investigate_before_answering>` on evaluator and challenge, `<coverage_over_filtering>` with per-finding confidence, 3 few-shot examples in the Challenge phase, `<orchestrator_context>` (auto-compaction + xhigh effort), `<use_parallel_tool_calls>` in ideate EXPLORE and preflight Phase 0.
172
- - **`--with-codex` consolidated into `--engine auto`** — auto now covers BUILD/FIX + team roles + ideate CHALLENGE critic (broader than `--with-codex both` ever was). Legacy flag still accepted with a graceful handoff.
119
+ - **`--with-codex` consolidated into `--engine auto`** — auto covers BUILD/FIX + team roles + ideate CHALLENGE critic. Legacy flag still accepted with a graceful handoff. *(Note: post iter-0020 close-out, `--engine auto` is experimental research-only; default is `--engine claude`.)*
173
120
  - **Bug fixes** — PHASE 1.5 BLOCKED browser failures re-route correctly via PHASE 2.5; PHASE 1.4-fix and PHASE 2.5 share one global round counter; preflight PHASE 1 numbering fixed; build-gate-exhausted now produces a graceful final report.
174
121
  - **CLAUDE.md refresh** (shipped to `npx` installers) — Quick Start pointing to ideate → auto-resolve → preflight, Context Window Management updated for Opus 4.7 auto-compaction, terminology refresh (TodoWrite → task tools, Task agents → Agent subagents).
175
122
 
@@ -177,47 +124,16 @@ Core pipeline skills (`ideate`, `auto-resolve`, `preflight`) rewritten against A
177
124
 
178
125
  ---
179
126
 
180
- ## Manual Commands
181
-
182
- When you want step-by-step control instead of the full pipeline.
183
-
184
- ### Debugging & Resolution
185
-
186
- | Command | Use When |
187
- |---|---|
188
- | `/devlyn:resolve` | Simple bugs (1-2 files) |
189
- | `/devlyn:team-resolve` | Complex issues — spawns root-cause analyst, test engineer, security auditor |
190
- | `/devlyn:browser-validate` | Test a web feature in a real browser (Chrome MCP → Playwright → curl fallback) |
127
+ ## Optional Power-User Skills
191
128
 
192
- ### Code Review & Quality
129
+ Two creative skills have moved to `optional-skills/` — install them via the interactive installer when you need them.
193
130
 
194
131
  | Command | Use When |
195
132
  |---|---|
196
- | `/devlyn:review` | Solo review security, quality, best practices checklist |
197
- | `/devlyn:team-review` | Multi-reviewer team security, testing, performance, product perspectives |
198
- | `/devlyn:evaluate` | Grade work against done-criteria with calibrated skepticism |
199
- | `/devlyn:clean` | Remove dead code, unused deps, complexity hotspots |
200
-
201
- ### UI Design Pipeline
202
-
203
- | Step | Command | What It Does |
204
- |---|---|---|
205
- | 1 | `/devlyn:design-ui` | Generate 5 distinct style explorations |
206
- | 2 | `/devlyn:design-system` | Extract design tokens from chosen style |
207
- | 3 | `/devlyn:implement-ui` | Team builds it — component architect, UX, accessibility, responsive, visual QA |
133
+ | `/devlyn:design-system` | Extract exact design tokens (colors, type scale, spacing) from a chosen UI style |
134
+ | `/devlyn:team-design-ui` | Multi-perspective design team generates 5 distinct UI style explorations |
208
135
 
209
- > Use `/devlyn:team-design-ui` for step 1 with a full creative team.
210
-
211
- ### Planning & Docs
212
-
213
- | Command | What It Does |
214
- |---|---|
215
- | `/devlyn:preflight` | Verify codebase matches vision/roadmap — gap analysis with evidence |
216
- | `/devlyn:product-spec` | Generate or update product specs |
217
- | `/devlyn:feature-spec` | Turn product spec → implementable feature spec |
218
- | `/devlyn:discover-product` | Scan codebase → auto-generate product docs |
219
- | `/devlyn:recommend-features` | Prioritize top 5 features to build next |
220
- | `/devlyn:update-docs` | Sync all docs with current codebase |
136
+ > Earlier versions of devlyn-cli shipped 16+ skills (auto-resolve / preflight / evaluate / review / team-review / clean / update-docs / browser-validate / product-spec / feature-spec / recommend-features / discover-product / design-ui / implement-ui). These were consolidated into `/devlyn:resolve` (which folds verification, review, and cleanup into its phases) plus `/devlyn:ideate` (which absorbs the planning surfaces) in the iter-0034 Phase 4 cutover (2026-05-04). Upgrades automatically remove the legacy skill directories from `~/.claude/skills/`.
221
137
 
222
138
  ---
223
139
 
@@ -231,7 +147,6 @@ These activate automatically — no commands needed. They shape how Claude think
231
147
  | `code-review-standards` | Reviews — severity framework, approval criteria |
232
148
  | `ui-implementation-standards` | UI work — design fidelity, accessibility, responsiveness |
233
149
  | `code-health-standards` | Maintenance — dead code prevention, complexity thresholds |
234
- | `workflow-routing` | Any task — guides you to the right command |
235
150
 
236
151
  ---
237
152
 
@@ -253,6 +168,9 @@ Selected during install. Run `npx devlyn-cli` again to add more.
253
168
  | `dokkit` | Document template filling for DOCX/HWPX |
254
169
  | `devlyn:pencil-pull` | Pull Pencil designs into code |
255
170
  | `devlyn:pencil-push` | Push codebase UI to Pencil canvas |
171
+ | `devlyn:reap` | Safely reap orphaned MCP / codex / Superset child processes |
172
+ | `devlyn:design-system` | Extract design tokens from a chosen UI style for exact reproduction |
173
+ | `devlyn:team-design-ui` | 5 distinct UI style explorations from a full design team |
256
174
 
257
175
  </details>
258
176
 
@@ -274,8 +192,9 @@ Selected during install. Run `npx devlyn-cli` again to add more.
274
192
 
275
193
  | Server | Description |
276
194
  |---|---|
277
- | `codex-cli` | Codex MCP server enables `--engine auto/codex` intelligent model routing |
278
- | `playwright` | Playwright MCP — powers browser-validate Tier 2 |
195
+ | `playwright` | Playwright MCP — powers `/devlyn:resolve` BUILD_GATE browser tier (Chrome MCP → Playwright → curl fallback) |
196
+
197
+ > `--engine auto/codex` uses the local `codex` CLI binary, not MCP. Install from https://platform.openai.com/docs/codex; the harness silently downgrades to `--engine claude` if the CLI is missing.
279
198
 
280
199
  </details>
281
200
 
@@ -290,9 +209,8 @@ Selected during install. Run `npx devlyn-cli` again to add more.
290
209
 
291
210
  ## Contributing
292
211
 
293
- - **Add a command** — `.md` file in `config/commands/`
294
212
  - **Add a skill** — directory in `config/skills/` with `SKILL.md`
295
- - **Add optional skill** — add to `optional-skills/` and `OPTIONAL_ADDONS`
213
+ - **Add optional skill** — add to `optional-skills/` and `OPTIONAL_ADDONS` in [`bin/devlyn.js`](bin/devlyn.js)
296
214
  - **Suggest a pack** — PR to the pack list
297
215
 
298
216
  ## Star History
@@ -0,0 +1,272 @@
1
+ # Benchmark Suite Design — v1
2
+
3
+ **Outer goal**: see [`autoresearch/NORTH-STAR.md`](../../autoresearch/NORTH-STAR.md) — the harness composes frontier LLMs into a hands-free pipeline that delivers engineer-quality software for users who do not know context engineering, with each composition layer (L0 bare → L1 solo harness → L2 pair harness) justifying its added cost on quality AND wall-time efficiency. This benchmark is the measurement instrument for that contract.
4
+
5
+ **Purpose.** Replace ad-hoc A/B benchmarking with a permanent, comprehensive,
6
+ one-command suite that gates every future harness change with a ship/rollback
7
+ decision. Any prompt edit, phase reorder, new native skill, or model upgrade
8
+ can be validated by running the suite and reading the numbers.
9
+
10
+ **Arm structure (current vs planned).** Today the suite runs `variant` (L2: Claude + Codex pair) vs `bare` (L0). The L1 (solo harness on a single LLM) arm is queued for iter-0020 — until then the benchmark cannot directly verify the L1 contract, only the L0 ↔ L2 delta. Single-LLM users (Opus alone, GPT-5.5 alone) are first-class per the North Star, so this gap is a release-blocker for them, not a future enhancement.
11
+
12
+ **Non-goals.** Publishable-research statistical rigor. Not a regression test
13
+ library for the product code — those live elsewhere. Not a substitute for
14
+ production telemetry — just enough signal for ship decisions.
15
+
16
+ ---
17
+
18
+ ## Principles
19
+
20
+ 1. **One command.** `npx devlyn-cli benchmark` runs everything and prints a
21
+ verdict. No manual fixture setup.
22
+ 2. **Novice-proof.** The suite exercises the same paths a first-time user
23
+ hits — including an end-to-end `ideate → auto-resolve → preflight` fixture.
24
+ 3. **LLM-upgrade friendly.** Rubric, fixture semantics, and thresholds stay
25
+ stable; scores and margins float up as models improve. Nothing is
26
+ hardcoded to a specific model version.
27
+ 4. **Karpathy.** No fixture earns its place unless it tests a distinct
28
+ failure mode. Tooling stays boring. History plumbing is simple.
29
+ 5. **Ship gate is numbers, not vibes.** Concrete thresholds in RUBRIC.md.
30
+
31
+ ---
32
+
33
+ ## Directory Layout
34
+
35
+ ```
36
+ benchmark/auto-resolve/
37
+ ├── BENCHMARK-DESIGN.md # this file
38
+ ├── README.md # how to run, interpret, extend
39
+ ├── RUBRIC.md # stable judge rubric + ship gates
40
+
41
+ ├── fixtures/
42
+ │ ├── SCHEMA.md # fixture file format
43
+ │ ├── test-repo/ # bootstrap Node project (shared base)
44
+ │ │ ├── bin/cli.js
45
+ │ │ ├── server/index.js
46
+ │ │ ├── web/page.html
47
+ │ │ ├── tests/
48
+ │ │ ├── playwright.config.js
49
+ │ │ └── package.json
50
+ │ │
51
+ │ ├── F1-cli-trivial-flag/
52
+ │ ├── F2-cli-medium-subcommand/
53
+ │ ├── F3-backend-contract-risk/
54
+ │ ├── F4-web-browser-design/
55
+ │ ├── F5-fix-loop-red-green/
56
+ │ ├── F6-dep-audit-native-module/
57
+ │ ├── F7-out-of-scope-trap/
58
+ │ ├── F8-known-limit-ambiguous/
59
+ │ └── F9-e2e-ideate-to-resolve/
60
+
61
+ ├── scripts/
62
+ │ ├── run-suite.sh # single entry — runs all fixtures × 2 arms + judge + report
63
+ │ ├── run-fixture.sh # one fixture, one arm
64
+ │ ├── judge.sh # Codex blind judge (model-agnostic)
65
+ │ ├── compile-report.py # aggregate into report.md + summary.json
66
+ │ └── ship-gate.py # apply thresholds, return ship/rollback verdict
67
+
68
+ ├── results/ # per-run artifacts (overwritten)
69
+ │ └── <run-id>/
70
+ │ ├── <fixture>/
71
+ │ │ ├── variant/{input.md, transcript.txt, diff.patch, verify.json, timing.json}
72
+ │ │ └── bare/{same}
73
+ │ ├── <fixture>/judge.json
74
+ │ ├── report.md
75
+ │ └── summary.json
76
+
77
+ └── history/
78
+ ├── runs/ # append-only immutable records
79
+ │ └── 2026-04-23T120000Z-v3.6.json
80
+ ├── latest.json # pointer to most recent run
81
+ └── baselines/
82
+ └── shipped.json # last blessed version, used for regression check
83
+ ```
84
+
85
+ ---
86
+
87
+ ## Fixture Schema
88
+
89
+ Every fixture is a directory with these files (see `fixtures/SCHEMA.md`):
90
+
91
+ | File | Purpose |
92
+ |------|---------|
93
+ | `metadata.json` | id, category, difficulty, timeout, required tools, intent block |
94
+ | `spec.md` | pipeline-arm input (auto-resolve-ready spec with Requirements/Constraints/Out-of-Scope/Verification) |
95
+ | `task.txt` | bare-arm input (same intent, natural-language framing) |
96
+ | `expected.json` | machine-readable acceptance criteria + forbidden patterns + verification commands |
97
+ | `NOTES.md` | why this fixture exists, the specific failure mode it tests |
98
+ | `setup.sh` | deterministic starting state — applies to a fresh copy of `test-repo/` |
99
+
100
+ **Drift prevention**: `spec.md` and `task.txt` both derive from the same
101
+ `intent` block in `metadata.json`. A lint step in CI verifies they stay
102
+ consistent.
103
+
104
+ ---
105
+
106
+ ## The 9 Fixtures
107
+
108
+ Category coverage matrix (rows = concerns, columns = fixtures):
109
+
110
+ | Fixture | Trivial | Medium | High-risk | Stress | Edge | E2E |
111
+ |---------|---------|--------|-----------|--------|------|-----|
112
+ | F1-cli-trivial-flag | ✓ | | | | | |
113
+ | F2-cli-medium-subcommand | | ✓ | | | | |
114
+ | F3-backend-contract-risk | | | ✓ | | | |
115
+ | F4-web-browser-design | | | | ✓ (browser-validate) | | |
116
+ | F5-fix-loop-red-green | | | | ✓ (FIX LOOP) | | |
117
+ | F6-dep-audit-native-module | | | | ✓ (CRITIC security dep audit) | | |
118
+ | F7-out-of-scope-trap | | | | ✓ (scope discipline) | | |
119
+ | F8-known-limit-ambiguous | | | | | ✓ (documents where pipeline may lose) | |
120
+ | F9-e2e-ideate-to-resolve | | | | | | ✓ (novice full-flow) |
121
+
122
+ **F9 is load-bearing** for the "novice user types `/devlyn:ideate`" promise.
123
+ Input is a vague idea; pipeline arm runs ideate → auto-resolve on every
124
+ generated spec → preflight; bare arm runs a direct prompt. Judge compares
125
+ the final usable artifact set (code + docs + roadmap state).
126
+
127
+ ---
128
+
129
+ ## Single-Command Invocation
130
+
131
+ ### User experience
132
+
133
+ ```bash
134
+ npx devlyn-cli benchmark # n=1 smoke, all fixtures
135
+ npx devlyn-cli benchmark --n 3 # higher confidence for ship decisions
136
+ npx devlyn-cli benchmark F2 F5 # specific fixtures only
137
+ npx devlyn-cli benchmark --judge-only --run-id <id> # re-judge without re-running
138
+ ```
139
+
140
+ Output on completion:
141
+
142
+ ```
143
+ Benchmark Suite Run — 2026-04-23T12:00Z (v3.6)
144
+ Judge: codex CLI flagship, xhigh, blind (model recorded in run history)
145
+
146
+ Fixture Variant Bare Margin Verdict
147
+ F1-cli-trivial-flag 95 88 +7 PASS
148
+ F2-cli-medium-subcommand 92 81 +11 PASS
149
+ F3-backend-contract-risk 89 72 +17 PASS
150
+ F4-web-browser-design 87 79 +8 PASS
151
+ F5-fix-loop-red-green 91 65 +26 PASS
152
+ F6-dep-audit-native-module 88 70 +18 PASS
153
+ F7-out-of-scope-trap 94 73 +21 PASS
154
+ F8-known-limit-ambiguous 78 79 -1 EXPECTED (known-limit)
155
+ F9-e2e-ideate-to-resolve 90 68 +22 PASS
156
+ ---------------------------------------------------------
157
+ Suite average variant score: 89.3
158
+ Suite average bare score: 75.0
159
+ Suite average margin: +14.3 (ship floor: +5)
160
+ Hard-floor violations: 0
161
+ Regression vs shipped: n/a (first run of v3.6)
162
+ SHIP-GATE VERDICT: ✅ PASS
163
+ ```
164
+
165
+ ### Runner orchestration
166
+
167
+ `run-suite.sh`:
168
+
169
+ 1. Generate run-id `<ISO>-<sha>-<branch>`
170
+ 2. For each fixture × each arm (variant, bare): parallelizable via `xargs -P`
171
+ - `run-fixture.sh --fixture FX --arm variant` → writes `results/<run-id>/FX/variant/*`
172
+ 3. For each fixture: `judge.sh FX <run-id>` → writes `results/<run-id>/FX/judge.json`
173
+ 4. `compile-report.py <run-id>` → writes `report.md` + `summary.json`
174
+ 5. `ship-gate.py <run-id>` → exit 0 (PASS) / 1 (FAIL). Prints verdict to stdout.
175
+ 6. If PASS and `--bless` flag: copy `summary.json` → `history/baselines/shipped.json`
176
+ 7. Always: append `history/runs/<run-id>.json` + update `latest.json`
177
+
178
+ ### `run-fixture.sh` contract
179
+
180
+ - Creates fresh temp copy of `test-repo/` at `/tmp/bench-<run-id>-<fixture>-<arm>/`
181
+ - Applies `setup.sh` if present
182
+ - Copies `spec.md` (variant) or `task.txt` (bare) as the prompt
183
+ - Invokes Claude/auto-resolve (variant) or bare Claude (bare) via isolated Agent
184
+ - Captures: `diff.patch`, `changed-files.txt`, `transcript.txt`, `timing.json`
185
+ - Runs `expected.json::verification_commands`, writes pass/fail per command to `verify.json`
186
+ - Writes `result.json` with aggregate: exit code, duration, files changed, verification score
187
+
188
+ ### `judge.sh` contract
189
+
190
+ - Reads `results/<run-id>/<fixture>/{variant,bare}/{diff.patch,verify.json}` + fixture's `spec.md` + `expected.json`
191
+ - Builds a blind prompt: labels arms A and B randomly per fixture (seed recorded)
192
+ - Invokes `codex exec` (current flagship — no model hardcode) with RUBRIC.md
193
+ - Writes `judge.json`: per-axis scores, winner, margin, critical findings, disqualifiers
194
+ - Idempotent: re-running overwrites the same `judge.json`
195
+
196
+ ---
197
+
198
+ ## LLM-Upgrade Resilience
199
+
200
+ Three mechanisms:
201
+
202
+ 1. **No hardcoded models.** Judge invocation is `codex exec` without `-m`; it
203
+ inherits whichever flagship the CLI currently ships. Same for agents —
204
+ they run against whatever Claude Code session-model the caller has.
205
+ Model provenance is captured in `result.json` per run.
206
+
207
+ 2. **Margin as primary signal, absolute score as secondary.** When models
208
+ improve, both arms get better. Margin (variant − bare) is model-invariant
209
+ — it measures **what the harness adds beyond bare**. Ship gates are
210
+ defined on margin (`>= +5`) and regression (`-3 or worse`), not absolute
211
+ score.
212
+
213
+ 3. **Fixture difficulty gradient.** F1 (trivial) is expected to saturate near
214
+ 100 quickly as models improve — that's fine, it still catches catastrophic
215
+ regressions. F5/F9 (stress/E2E) have enough depth that even a near-perfect
216
+ model won't 100-zero bare. If any fixture saturates (both arms > 95 for
217
+ two consecutive versions), we replace it with a harder one and document
218
+ the swap in `history/runs/<ts>-fixture-rotation.json`.
219
+
220
+ ---
221
+
222
+ ## Ship Gates (from RUBRIC.md)
223
+
224
+ Hard floors (any single failure blocks ship):
225
+
226
+ - **No silent-catch / fabricated verification / skipped required test in variant.** Judge flags this as disqualifier.
227
+ - **Variant may not lose any fixture by more than −5** versus previous shipped version (per-fixture regression floor).
228
+ - **At least 7 of 9 fixtures** must have margin ≥ +5 (suite coverage).
229
+ - **F9 (E2E) must PASS** — novice-flow contract.
230
+
231
+ Soft gates (trigger rollback discussion):
232
+
233
+ - Suite average margin drop > 3 vs last shipped.
234
+ - Any fixture with margin ≤ 0 that previously had margin > +5.
235
+ - Critical-finding catch-rate decrease vs last shipped variant (not vs bare — bare is the opponent, not the regression baseline).
236
+
237
+ Known-limit exception:
238
+
239
+ - F8 is explicitly allowed to tie or lose (margin in [-3, +3]). Its job is to
240
+ document honesty, not to beat bare.
241
+
242
+ ---
243
+
244
+ ## Karpathy Check
245
+
246
+ Where over-engineering lurks:
247
+
248
+ - ❌ **Automatic history mutation during development.** Add append-only
249
+ history AFTER the suite format stabilizes (one version after initial ship).
250
+ - ❌ **Statistical tooling beyond mean/median/margin.** n=1-3 doesn't need
251
+ t-tests.
252
+ - ❌ **Auto-generated fixture cards / dashboards.** Plain `report.md` is enough.
253
+ - ✅ **Keep scripts under 100 lines each** unless they're doing concrete,
254
+ repeated work the user would do by hand.
255
+
256
+ If the suite tooling grows past ~800 total lines, prune aggressively before
257
+ adding anything.
258
+
259
+ ---
260
+
261
+ ## Open Questions (to be answered before first full ship-gate run)
262
+
263
+ 1. Where does `benchmark` subcommand live? Inside `bin/devlyn.js` or as
264
+ standalone `benchmark/auto-resolve/scripts/run-suite.sh` invoked via `npm
265
+ run`? **Proposal**: both — `bin/devlyn.js benchmark` is the advertised
266
+ entry, which shells out to the script.
267
+ 2. Parallel run safety — can we run 9 fixtures × 2 arms concurrently without
268
+ rate-limit / lockfile conflicts? **Proposal**: default sequential with
269
+ `--parallel N` flag. Default `N=1` for safety; the user can opt in.
270
+ 3. Token accounting — Claude Code doesn't expose subagent totals reliably.
271
+ **Proposal**: capture wall time as primary efficiency metric; token
272
+ estimate as best-effort secondary. Do not gate ship on token math alone.
@@ -0,0 +1,114 @@
1
+ # devlyn-cli auto-resolve Benchmark Suite
2
+
3
+ One-command A/B benchmark that gates every harness change with a ship/rollback decision.
4
+
5
+ ## Quick start
6
+
7
+ ```bash
8
+ npx devlyn-cli benchmark # n=1 smoke, all fixtures × 2 arms, judge, report, ship-gate
9
+ npx devlyn-cli benchmark --n 3 # higher confidence for ship decisions
10
+ npx devlyn-cli benchmark F2 # specific fixture only
11
+ npx devlyn-cli benchmark --dry-run # validate suite wiring without model invocation
12
+ npx devlyn-cli benchmark --bless # if ship-gate PASSes, promote this run as the shipped baseline
13
+ npx devlyn-cli benchmark --judge-only --run-id <ID> # re-judge an existing run's artifacts
14
+ ```
15
+
16
+ Exit code 0 = PASS, 1 = FAIL.
17
+
18
+ ## What it does
19
+
20
+ 1. For every fixture × arm (`variant` / `bare`):
21
+ - Prepare a fresh temp copy of `fixtures/test-repo/`.
22
+ - Commit baseline + apply `setup.sh` + commit bench scaffolding.
23
+ - Invoke the arm via an isolated `claude -p` subprocess.
24
+ - Capture `diff.patch`, `transcript.txt`, `timing.json`, run `expected.json::verification_commands`.
25
+ 2. For every fixture, invoke `codex exec` as a blind judge (`A`/`B` randomized per fixture) using the 4-axis rubric in `RUBRIC.md`.
26
+ 3. Aggregate into `results/<run-id>/report.md` + `summary.json`.
27
+ 4. Apply ship-gate thresholds (`scripts/ship-gate.py`). Print verdict.
28
+ 5. Append immutable record to `history/runs/<run-id>.json`.
29
+
30
+ ## Directory layout
31
+
32
+ ```
33
+ benchmark/auto-resolve/
34
+ ├── BENCHMARK-DESIGN.md # full design rationale
35
+ ├── README.md # this file
36
+ ├── RUBRIC.md # 4-axis scoring + ship gates
37
+
38
+ ├── fixtures/
39
+ │ ├── SCHEMA.md # fixture file format
40
+ │ ├── test-repo/ # bootstrap Node project — base for all arms
41
+ │ ├── F2-cli-medium-subcommand/
42
+ │ └── F1,F3-F9/ # add per Stage 2-3
43
+
44
+ ├── scripts/
45
+ │ ├── run-suite.sh # single entry — called by `npx devlyn-cli benchmark`
46
+ │ ├── run-fixture.sh # one fixture × one arm, self-contained
47
+ │ ├── judge.sh # Codex blind judge for one fixture
48
+ │ ├── compile-report.py # aggregates into report.md + summary.json
49
+ │ └── ship-gate.py # applies thresholds + writes history record
50
+
51
+ ├── results/<run-id>/ # per-run artifacts (overwritten)
52
+ └── history/
53
+ ├── runs/ # append-only, one JSON per run
54
+ ├── latest.json # pointer to most recent run
55
+ └── baselines/shipped.json # last blessed version, used for regression floor
56
+ ```
57
+
58
+ ## Prerequisites
59
+
60
+ - `claude` CLI on PATH (Claude Code, used to invoke each arm).
61
+ - `codex` CLI on PATH (used by the blind judge). Install from https://platform.openai.com/docs/codex.
62
+ - `python3`, `node`, `git`, `timeout`.
63
+
64
+ ## Adding a fixture
65
+
66
+ Follow `fixtures/SCHEMA.md`. Six files per fixture: `metadata.json`, `spec.md`, `task.txt`, `expected.json`, `NOTES.md`, `setup.sh`. Common workflow:
67
+
68
+ 1. Copy an existing fixture directory as a template.
69
+ 2. Rewrite `metadata.json::intent` with the new task's plain-language intent.
70
+ 3. Write `spec.md` (auto-resolve-ready) and `task.txt` (plain prompt) both derived from the intent.
71
+ 4. Fill `expected.json` with concrete verification commands and forbidden patterns.
72
+ 5. Document purpose + failure mode in `NOTES.md`.
73
+ 6. Add `setup.sh` if the task needs the base `test-repo` modified before either arm starts.
74
+
75
+ ## LLM-upgrade resilience
76
+
77
+ - **No model hardcoding.** Judge runs `codex exec` without `-m`, inheriting whichever flagship the CLI currently ships. Each run captures `_judge_model` for historical provenance.
78
+ - **Margin-based gates.** Ship thresholds use margin (variant − bare), not absolute score. Both arms improve together as models improve; the harness-added value measured by margin stays meaningful.
79
+ - **Saturation rotation.** When both arms exceed 95 on a fixture for two shipped versions, rotate it (see `RUBRIC.md::Fixture Rotation Policy`).
80
+
81
+ ## Ship gates (summary — see `RUBRIC.md` for full spec)
82
+
83
+ Hard floors (any one fails → block):
84
+
85
+ - Zero variant disqualifier (silent catch, fabricated verification, extra deps beyond `max_deps_added`, etc.).
86
+ - `F9-e2e-ideate-to-resolve` must PASS (novice-flow contract).
87
+ - ≥ 7 of 9 gated fixtures have margin ≥ +5.
88
+ - No per-fixture regression worse than −5 vs last shipped baseline.
89
+
90
+ Soft gates (warning, not block): suite-margin drop > 3, fixture losing its margin, critical-finding catch-rate regression vs last shipped variant.
91
+
92
+ ## Running the full suite (real)
93
+
94
+ Full real benchmark costs roughly 2-3 minutes per arm for simple fixtures and up to 15 minutes per arm for strict-route fixtures. A full n=1 run of 9 fixtures × 2 arms can take 30 min – 2 hrs depending on routes taken.
95
+
96
+ ```bash
97
+ # Smoke run before ship decisions
98
+ npx devlyn-cli benchmark
99
+
100
+ # Ship-decision run
101
+ npx devlyn-cli benchmark --n 3 --label v3.7 --bless
102
+ ```
103
+
104
+ ## Dry-run
105
+
106
+ `--dry-run` skips model invocation. It still:
107
+
108
+ - Prepares each fresh work dir.
109
+ - Writes arm-specific prompts.
110
+ - Commits the baseline.
111
+ - Applies `setup.sh`.
112
+ - Runs verification commands (which will mostly fail since no implementation was added).
113
+
114
+ Use it to sanity-check new fixtures or runner changes before burning model tokens.