devlyn-cli 2.2.2 → 2.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +2 -2
- package/CLAUDE.md +4 -4
- package/README.md +85 -34
- package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +61 -44
- package/benchmark/auto-resolve/BENCHMARK-RESULTS.md +341 -0
- package/benchmark/auto-resolve/README.md +307 -44
- package/benchmark/auto-resolve/RUBRIC.md +23 -14
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +7 -3
- package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md +8 -3
- package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/NOTES.md +8 -3
- package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/NOTES.md +10 -4
- package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/NOTES.md +10 -4
- package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +7 -4
- package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/NOTES.md +8 -0
- package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/NOTES.md +16 -4
- package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +7 -0
- package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/NOTES.md +11 -5
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +8 -1
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +4 -2
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/NOTES.md +34 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/spec.md +67 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/duplicate-event-error.js +35 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/priority-transfer-rollback.js +53 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/NOTES.md +38 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/spec.md +70 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/task.txt +3 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/duplicate-renewal-error.js +42 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/priority-credit-rollback.js +70 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +10 -3
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +7 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +5 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +7 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +3 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +15 -3
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/SCHEMA.md +53 -7
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/NOTES.md +37 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/RETIRED.md +13 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/expected.json +56 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/setup.sh +18 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/spec.md +69 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/exact-proration.js +48 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/rules-source-and-conflict.js +79 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/NOTES.md +54 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/RETIRED.md +7 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/expected.json +67 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/spec.md +67 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/task.txt +5 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/policy-precedence.js +72 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-and-immutability.js +43 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-boundary.js +116 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/NOTES.md +35 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/RETIRED.md +12 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/expected.json +58 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/spec.md +73 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/task.txt +17 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/mixed-idempotent-settlement.js +53 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/rejection-boundaries.js +74 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/NOTES.md +60 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/RETIRED.md +29 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/expected.json +73 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/setup.sh +28 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/spec.md +58 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/task.txt +5 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.json +82 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.md +18 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.json +46 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.md +17 -0
- package/benchmark/auto-resolve/run-real-benchmark.md +303 -0
- package/benchmark/auto-resolve/scripts/audit-headroom-rejections.py +441 -0
- package/benchmark/auto-resolve/scripts/audit-pair-evidence.py +1256 -0
- package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +147 -15
- package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +28 -16
- package/benchmark/auto-resolve/scripts/collect-swebench-predictions.py +11 -1
- package/benchmark/auto-resolve/scripts/compile-report.py +208 -46
- package/benchmark/auto-resolve/scripts/fetch-swebench-instances.py +22 -4
- package/benchmark/auto-resolve/scripts/frozen-verify-gate.py +175 -30
- package/benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py +408 -46
- package/benchmark/auto-resolve/scripts/headroom-gate.py +270 -39
- package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +164 -33
- package/benchmark/auto-resolve/scripts/iter-0033c-l1-summary.py +97 -0
- package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +150 -38
- package/benchmark/auto-resolve/scripts/judge.sh +153 -26
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +12 -5
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +25 -2
- package/benchmark/auto-resolve/scripts/pair-candidate-frontier.py +469 -0
- package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +5 -5
- package/benchmark/auto-resolve/scripts/pair-plan-lint.py +9 -2
- package/benchmark/auto-resolve/scripts/pair-rejected-fixtures.sh +91 -0
- package/benchmark/auto-resolve/scripts/pair_evidence_contract.py +269 -0
- package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-case.py +39 -10
- package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-corpus.py +34 -4
- package/benchmark/auto-resolve/scripts/prepare-swebench-solver-worktree.py +23 -5
- package/benchmark/auto-resolve/scripts/recent-benchmark-summary.py +232 -0
- package/benchmark/auto-resolve/scripts/run-fixture.sh +118 -51
- package/benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh +211 -39
- package/benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh +335 -39
- package/benchmark/auto-resolve/scripts/run-headroom-candidate.sh +249 -6
- package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +22 -48
- package/benchmark/auto-resolve/scripts/run-suite.sh +44 -7
- package/benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh +120 -19
- package/benchmark/auto-resolve/scripts/run-swebench-solver-batch.sh +32 -14
- package/benchmark/auto-resolve/scripts/ship-gate.py +219 -50
- package/benchmark/auto-resolve/scripts/solo-ceiling-avoidance.py +53 -0
- package/benchmark/auto-resolve/scripts/solo-headroom-hypothesis.py +77 -0
- package/benchmark/auto-resolve/scripts/swebench-frozen-matrix.py +239 -26
- package/benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh +288 -0
- package/benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh +1672 -0
- package/benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh +933 -0
- package/benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh +491 -0
- package/benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh +91 -0
- package/benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh +328 -3
- package/benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh +497 -18
- package/benchmark/auto-resolve/scripts/test-headroom-gate.sh +331 -14
- package/benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh +525 -0
- package/benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh +254 -0
- package/benchmark/auto-resolve/scripts/test-lint-fixtures.sh +580 -0
- package/benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh +591 -0
- package/benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh +497 -0
- package/benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh +401 -0
- package/benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh +111 -0
- package/benchmark/auto-resolve/scripts/test-ship-gate.sh +1189 -0
- package/benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh +924 -5
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/NOTES.md +28 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/expected.json +63 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/spec.md +47 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/expected.json +53 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/spec.md +50 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/duplicate-order-error.js +27 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/priority-stock-reservation.js +44 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/expected.json +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/spec.md +52 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/duplicate-ticket-error.js +29 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/priority-agent-assignment.js +48 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/expected.json +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/spec.md +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/duplicate-return-error.js +43 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/priority-return-routing.js +70 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/NOTES.md +37 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/expected.json +54 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/spec.md +59 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/credit-ledger-priority.js +98 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/duplicate-charge-error.js +38 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/NOTES.md +36 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/expected.json +56 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/spec.md +59 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/duplicate-refund-error.js +41 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/priority-refund-ledger.js +65 -0
- package/bin/devlyn.js +221 -17
- package/config/skills/_shared/adapters/README.md +3 -0
- package/config/skills/_shared/adapters/gpt-5-5.md +5 -1
- package/config/skills/_shared/adapters/opus-4-7.md +9 -1
- package/config/skills/_shared/archive_run.py +78 -6
- package/config/skills/_shared/codex-config.md +5 -4
- package/config/skills/_shared/codex-monitored.sh +46 -1
- package/config/skills/_shared/collect-codex-findings.py +20 -5
- package/config/skills/_shared/engine-preflight.md +17 -13
- package/config/skills/_shared/runtime-principles.md +6 -9
- package/config/skills/_shared/spec-verify-check.py +2664 -107
- package/config/skills/_shared/verify-merge-findings.py +1369 -19
- package/config/skills/devlyn:design-ui/SKILL.md +364 -0
- package/config/skills/devlyn:ideate/SKILL.md +7 -4
- package/config/skills/devlyn:ideate/references/elicitation.md +50 -4
- package/config/skills/devlyn:ideate/references/from-spec-mode.md +26 -4
- package/config/skills/devlyn:ideate/references/project-mode.md +20 -1
- package/config/skills/devlyn:ideate/references/spec-template.md +10 -1
- package/config/skills/devlyn:resolve/SKILL.md +78 -26
- package/config/skills/devlyn:resolve/references/free-form-mode.md +15 -0
- package/config/skills/devlyn:resolve/references/phases/build-gate.md +2 -2
- package/config/skills/devlyn:resolve/references/phases/implement.md +1 -1
- package/config/skills/devlyn:resolve/references/phases/probe-derive.md +74 -2
- package/config/skills/devlyn:resolve/references/phases/verify.md +80 -29
- package/config/skills/devlyn:resolve/references/state-schema.md +9 -4
- package/package.json +47 -2
- package/scripts/lint-fixtures.sh +349 -0
- package/scripts/lint-shadow-fixtures.sh +58 -0
- package/scripts/lint-skills.sh +3645 -95
package/AGENTS.md
CHANGED
|
@@ -28,7 +28,7 @@ ideate (optional) -> resolve -> ship
|
|
|
28
28
|
|
|
29
29
|
- `/devlyn:ideate` (optional) — unstructured idea → `docs/specs/<id>/spec.md` + `spec.expected.json`. Modes: default Q&A, `--quick` (autonomous-pipeline-safe), `--from-spec <path>`, `--project` (multi-feature).
|
|
30
30
|
- `/devlyn:resolve` — hands-free pipeline for any coding task. Free-form goal, `--spec <path>`, or `--verify-only <ref> --spec <path>`. Phases run inline: PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY (fresh-subagent, findings-only).
|
|
31
|
-
-
|
|
31
|
+
- `/devlyn:design-ui` — required creative UI exploration surface. Optional companion skills (`/devlyn:reap`, `/devlyn:design-system`, `/devlyn:team-design-ui`) live in `optional-skills/` and install only when the user opts in.
|
|
32
32
|
|
|
33
33
|
Each skill's `SKILL.md` is the source of truth for flags and workflow. Do not duplicate.
|
|
34
34
|
|
|
@@ -73,7 +73,7 @@ No silent fallbacks.
|
|
|
73
73
|
- Fallbacks allowed only when widely accepted and harmless (CSS fallback fonts, CDN failover, image placeholders).
|
|
74
74
|
- Silent `catch` blocks are bugs.
|
|
75
75
|
- Logging is not user-visible error handling.
|
|
76
|
-
-
|
|
76
|
+
- No engine-availability fallback is permitted for pair/risk-probe routes: if required Codex or Claude is unavailable, emit `BLOCKED:codex-unavailable` or `BLOCKED:claude-unavailable` with setup guidance. `--no-pair` and `--no-risk-probes` are explicit user opt-outs, not fallbacks.
|
|
77
77
|
|
|
78
78
|
## Evidence Over Claim
|
|
79
79
|
|
package/CLAUDE.md
CHANGED
|
@@ -24,7 +24,7 @@ The runtime sub-agent contract below (Subtractive-first / Goal-locked / No-worka
|
|
|
24
24
|
|
|
25
25
|
## Quick Start
|
|
26
26
|
|
|
27
|
-
Two skills cover the full cycle post iter-0034 Phase 4 cutover (2026-05-04). `/devlyn:ideate` is OPTIONAL; `/devlyn:resolve` is REQUIRED. **Both default to `--engine claude`** for PLAN/IMPLEMENT. Codex BUILD/IMPLEMENT and PLAN-pair remain research-only, but `/devlyn:resolve` VERIFY has
|
|
27
|
+
Two skills cover the full cycle post iter-0034 Phase 4 cutover (2026-05-04). `/devlyn:ideate` is OPTIONAL; `/devlyn:resolve` is REQUIRED; `/devlyn:design-ui` is also REQUIRED as the creative UI exploration surface. **Both pipeline skills default to `--engine claude`** for PLAN/IMPLEMENT. Codex BUILD/IMPLEMENT and PLAN-pair remain research-only, but `/devlyn:resolve` VERIFY has conditional-default pair-JUDGE when its `SKILL.md` trigger policy fires. Pass `--engine auto` or `--engine codex` explicitly to opt into the broader research path. If a selected or conditionally required engine is unavailable, the run stops with `BLOCKED:<engine>-unavailable` and setup guidance.
|
|
28
28
|
|
|
29
29
|
1. `/devlyn:ideate` (optional) — unstructured idea → `docs/specs/<id>/spec.md` + `spec.expected.json`. Modes: default Q&A, `--quick` (autonomous-pipeline-safe), `--from-spec <path>`, `--project`.
|
|
30
30
|
2. `/devlyn:resolve` — hands-free pipeline for any coding task. Free-form goal, `--spec <path>`, or `--verify-only <diff> --spec <path>`. Phases: PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY (fresh subagent, findings-only).
|
|
@@ -123,7 +123,7 @@ No `any`, no `@ts-ignore`, no silent `catch`, no hardcoded values, no helper scr
|
|
|
123
123
|
|
|
124
124
|
**Permitted exceptions** (explicitly carved out):
|
|
125
125
|
- CSS fallback fonts, CDN failover, image placeholders — widely-accepted best practices.
|
|
126
|
-
-
|
|
126
|
+
- No engine-availability fallback is permitted for `/devlyn:resolve` pair/risk-probe routes. If Codex or Claude is required and unavailable, the run stops with `BLOCKED:codex-unavailable` or `BLOCKED:claude-unavailable` plus setup guidance. `--no-pair` / `--no-risk-probes` are explicit user opt-outs, not fallbacks.
|
|
127
127
|
<!-- runtime-principles:section=no-workaround:end -->
|
|
128
128
|
|
|
129
129
|
### Evidence over claim
|
|
@@ -141,7 +141,7 @@ A finding without one of these forms is excluded. Vague findings produce vague f
|
|
|
141
141
|
|
|
142
142
|
## Codex invocation
|
|
143
143
|
|
|
144
|
-
When `/devlyn:resolve` or `/devlyn:ideate` route a phase to Codex (`--engine codex
|
|
144
|
+
When `/devlyn:resolve` or `/devlyn:ideate` route a phase to Codex (`--engine codex`, `--engine auto`, or conditional VERIFY pair/risk-probe routing), the wrapper-form contract lives in `config/skills/_shared/codex-config.md` (or `.claude/skills/_shared/codex-config.md` once installed). Omit `-m <model>` — the CLI's current flagship is used automatically. MCP is not in the loop. If Codex is required and unavailable, stop with `BLOCKED:codex-unavailable` and setup guidance.
|
|
145
145
|
|
|
146
146
|
## Working Mode
|
|
147
147
|
|
|
@@ -152,7 +152,7 @@ When `/devlyn:resolve` or `/devlyn:ideate` route a phase to Codex (`--engine cod
|
|
|
152
152
|
|
|
153
153
|
## Skill Boundary Policy
|
|
154
154
|
|
|
155
|
-
Post iter-0034 Phase 4 cutover (2026-05-04) the runtime
|
|
155
|
+
Post iter-0034 Phase 4 cutover (2026-05-04) the runtime pipeline surface is two skills — `/devlyn:resolve` and `/devlyn:ideate` — plus the required creative UI exploration surface `/devlyn:design-ui`. `/devlyn:resolve` runs PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY inline; verification, cleanup, and security review (delegated to the native `security-review` Claude Code skill from BUILD_GATE) all live inside the pipeline. There are no standalone `/devlyn:review`, `/devlyn:evaluate`, `/devlyn:team-resolve`, etc. surfaces to delegate to — those skills were folded into resolve's phases or removed in iter-0034. Optional creative power-user skills (`/devlyn:reap`, `/devlyn:design-system`, `/devlyn:team-design-ui`) live in `optional-skills/` and are user-invoked only; resolve never delegates to them.
|
|
156
156
|
|
|
157
157
|
Browser validation routes through `_shared/browser-runner.sh` (Chrome MCP → Playwright → curl tier) directly from BUILD_GATE — there is no separate `/devlyn:browser-validate` skill at HEAD.
|
|
158
158
|
|
package/README.md
CHANGED
|
@@ -27,13 +27,13 @@ If devlyn-cli saved you time, [give it a star](https://github.com/fysoul17/devly
|
|
|
27
27
|
npx devlyn-cli
|
|
28
28
|
```
|
|
29
29
|
|
|
30
|
-
That's it. The interactive installer handles everything. Claude Code config is installed by default; optional AI CLI instructions can be selected during install. Choose **Codex CLI (OpenAI)** to install `AGENTS.md` AND `/devlyn:resolve` + `/devlyn:ideate` skills into `~/.codex/skills/` so the same slash commands work inside Codex too. Run it again anytime to update.
|
|
30
|
+
That's it. The interactive installer handles everything. Claude Code config is installed by default; optional AI CLI instructions can be selected during install. Choose **Codex CLI (OpenAI)** to install `AGENTS.md` AND `/devlyn:resolve` + `/devlyn:ideate` + `/devlyn:design-ui` skills into `~/.codex/skills/` so the same slash commands work inside Codex too. Run it again anytime to update.
|
|
31
31
|
|
|
32
32
|
---
|
|
33
33
|
|
|
34
34
|
## How It Works — Two Skills, Full Cycle
|
|
35
35
|
|
|
36
|
-
devlyn-cli turns Claude Code into a hands-free development pipeline. The
|
|
36
|
+
devlyn-cli turns Claude Code into a hands-free development pipeline. The pipeline surface is two skills, with `/devlyn:design-ui` installed as the required creative UI surface:
|
|
37
37
|
|
|
38
38
|
```
|
|
39
39
|
ideate (optional) → resolve → ship
|
|
@@ -79,11 +79,25 @@ PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY (fresh subagent
|
|
|
79
79
|
- **VERIFY** runs in a fresh subagent context with no code-mutation tools — findings only, structurally independent.
|
|
80
80
|
- Git checkpoints at every phase for safe rollback. Fix-loop budget shared across BUILD_GATE and VERIFY (`--max-rounds N`, default 4).
|
|
81
81
|
|
|
82
|
-
Common flags: `--engine claude|codex|auto` (default `claude`), `--bypass build-gate,cleanup`, `--pair-verify` (force pair-mode JUDGE in VERIFY), `--perf` (per-phase timing).
|
|
82
|
+
Common flags: `--engine claude|codex|auto` (default `claude`), `--bypass build-gate,cleanup`, `--pair-verify` (force pair-mode JUDGE in VERIFY), `--no-pair` (intentional solo VERIFY), `--risk-probes` / `--no-risk-probes`, `--perf` (per-phase timing).
|
|
83
|
+
`--pair-verify` and `--no-pair` are mutually exclusive; using both stops with `BLOCKED:invalid-flags`.
|
|
83
84
|
|
|
84
|
-
|
|
85
|
+
Free-form goals that ask for benchmark evidence, pair-evidence, risk-probe
|
|
86
|
+
measurement, `solo<pair` proof, or solo-headroom work must include an
|
|
87
|
+
actionable `solo-headroom hypothesis` naming the visible behavior `solo_claude`
|
|
88
|
+
is expected to miss plus a backticked observable command; the backticked line
|
|
89
|
+
itself must contain `miss` and be framed as the command/observable that exposes it. Without that,
|
|
90
|
+
`/devlyn:resolve` stops with `BLOCKED:solo-headroom-hypothesis-required` and
|
|
91
|
+
points you to `/devlyn:ideate` instead of inventing a weak hypothesis.
|
|
92
|
+
Free-form goals that add or run a new unmeasured benchmark, shadow fixture,
|
|
93
|
+
golden fixture, risk-probe, or pair-evidence candidate must also include
|
|
94
|
+
`solo ceiling avoidance`, mention `solo_claude`, and name the concrete
|
|
95
|
+
difference from rejected or solo-saturated controls such as `S2`-`S6`; without
|
|
96
|
+
that, `/devlyn:resolve` stops with `BLOCKED:solo-ceiling-avoidance-required`.
|
|
85
97
|
|
|
86
|
-
|
|
98
|
+
### Engine selection — Claude implementation, conditional pair VERIFY
|
|
99
|
+
|
|
100
|
+
`--engine claude` (default) is the canonical implementation surface for PLAN, IMPLEMENT, BUILD_GATE, and CLEANUP. VERIFY/JUDGE conditionally runs pair mode for verify-only runs, high-risk specs, risk probes, mechanical warnings, coverage gaps, or explicit `--pair-verify`.
|
|
87
101
|
|
|
88
102
|
`--engine codex` routes IMPLEMENT to Codex; `--engine auto` opts into the experimental dual-engine routing where applicable. Both are research-only at HEAD: iter-0020 closed Codex BUILD/IMPLEMENT below the quality floor on the 9-fixture suite (L2 vs L1 = −3.6, 3/8 gated fixtures cleared the +5 margin floor — release-readiness FAIL); iter-0033g + iter-0034 closed PLAN-pair as research-only with explicit unblock conditions (container/sandbox infra OR production telemetry capturing positive evidence of subagent introspection). Install the Codex CLI (https://platform.openai.com/docs/codex) and pass the flag explicitly to opt in:
|
|
89
103
|
|
|
@@ -91,49 +105,86 @@ Common flags: `--engine claude|codex|auto` (default `claude`), `--bypass build-g
|
|
|
91
105
|
/devlyn:resolve "fix the auth bug" --engine auto # experimental, research-only
|
|
92
106
|
```
|
|
93
107
|
|
|
94
|
-
If Codex is absent when
|
|
95
|
-
|
|
96
|
-
<details>
|
|
97
|
-
<summary><strong>What's new in 1.14.0</strong> — CPO lens + handoff enforcement</summary>
|
|
98
|
-
|
|
99
|
-
`/devlyn:ideate` now thinks like a world-class Product Owner, and `/devlyn:auto-resolve` finally honors the spec contract the ideate skill was already designed to produce. Validated with 19 parallel eval subagents, 1.2M tokens of evidence — Customer Frame propagation went from 0/20 to 20/20 across seven test scenarios.
|
|
108
|
+
If Codex or Claude is absent when explicitly selected or conditionally required, the harness stops with `BLOCKED:codex-unavailable` or `BLOCKED:claude-unavailable` and prints setup guidance. Use `--no-pair` only when intentionally accepting solo VERIFY; use `--no-risk-probes` only when intentionally disabling automatic high-risk probes.
|
|
100
109
|
|
|
101
|
-
|
|
102
|
-
- **Customer Frame field on every item spec** — item-spec template gains a `## Customer Frame` section between Context and Objective that carries the per-item JTBD sentence all the way through to auto-resolve's build agent. The build agent uses this line to resolve ambiguity in Requirements rather than inventing interpretations.
|
|
103
|
-
- **PHASE 0.5 SPEC PREFLIGHT on auto-resolve** — when the task names a `docs/roadmap/phase-N/...md` spec, auto-resolve now reads it BEFORE BUILD, verifies internal dependencies are `status: done`, and writes `.devlyn/SPEC-CONTEXT.md` so downstream phases stop re-deriving what the spec already owns. Un-done deps halt the pipeline with `BLOCKED` rather than shipping out-of-sequence code.
|
|
104
|
-
- **Done-criteria verbatim copy** — when PHASE 0.5 found a spec, BUILD's Phase B copies the spec's `Requirements`, `Out of Scope`, and `Verification` sections verbatim into `.devlyn/done-criteria.md`. No silent re-derivation; the ideate CHALLENGE rubric's validation is preserved through the handoff.
|
|
105
|
-
- **Spec-bounded exploration** — BUILD's Phase A uses the spec's `Architecture Notes` + `Dependencies` as the exploration boundary instead of re-classifying the task type open-endedly.
|
|
106
|
-
- **Complexity-gated team ceremony** — `complexity: low` specs with no security/auth/API/data risk keywords skip TeamCreate entirely. Medium/high complexity or risk-flagged specs still assemble the team as before.
|
|
107
|
-
- **Evidence discipline in ideate EXPLORE** — research phase now labels unsourced market/tech claims `[UNVERIFIED]` inline rather than presenting recall as fact. The CHALLENGE rubric's NO GUESSWORK axis fires on unlabeled authoritative claims.
|
|
108
|
-
- **Mode tie-break rule** — when a request matches two ideate modes (Quick Add vs Expand, Research-first vs Deep-dive), the narrowest mode wins. Deterministic selection replaces intuitive match.
|
|
109
|
-
- **Bloat removal** — three redundant motivational blocks deleted from ideate SKILL.md (`<why_this_matters>` rationale, duplicate CHALLENGE preamble, external engine-routing pointer). SKILL.md shrank from 529 to 519 lines despite the new features.
|
|
110
|
+
### Benchmark score runs
|
|
110
111
|
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
<details>
|
|
114
|
-
<summary><strong>What's new in 1.13.0</strong> — Opus 4.7 pipeline pass</summary>
|
|
112
|
+
Use the benchmark CLI when a change claims `solo_claude < pair`. The score-focused runners print the run id, startup gate lines, blind-judge score tables, fixture pair margins, average pair margin, wall-time ratio, and failure reasons:
|
|
115
113
|
|
|
116
|
-
|
|
114
|
+
```bash
|
|
115
|
+
npx devlyn-cli benchmark headroom --min-fixtures 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
|
|
116
|
+
npx devlyn-cli benchmark recent
|
|
117
|
+
npx devlyn-cli benchmark recent --out-md /tmp/devlyn-recent-benchmark.md
|
|
118
|
+
npx devlyn-cli benchmark frontier --out-md /tmp/devlyn-pair-frontier.md
|
|
119
|
+
npx devlyn-cli benchmark audit --out-dir /tmp/devlyn-benchmark-audit
|
|
120
|
+
npx devlyn-cli benchmark audit-headroom --out-json /tmp/devlyn-headroom-audit.json
|
|
121
|
+
npx devlyn-cli benchmark pair --min-fixtures 3 --max-pair-solo-wall-ratio 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
|
|
122
|
+
```
|
|
117
123
|
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
-
|
|
121
|
-
|
|
124
|
+
`benchmark recent` prints a compact, wrap-safe snapshot of the current local
|
|
125
|
+
pair evidence: status counts, pair-lift aggregates, and one card per passing
|
|
126
|
+
pair-evidence fixture. It intentionally avoids wide Markdown tables, so the
|
|
127
|
+
same output stays readable in narrow terminals, PR comments, and release notes.
|
|
128
|
+
`benchmark frontier` also prints a stdout score summary for existing complete pair
|
|
129
|
+
evidence rows, including pair arm, trigger reasons, average/minimum pair margin,
|
|
130
|
+
and wall ratio, plus row-level verdicts even when `--out-json` or `--out-md`
|
|
131
|
+
writes an artifact. Markdown frontier artifacts include a `Triggers` column.
|
|
132
|
+
Full-pipeline pair gate artifacts record `require_hypothesis_trigger` in JSON
|
|
133
|
+
and include a Markdown `Hypothesis trigger` column, so strict regenerated
|
|
134
|
+
evidence shows whether each row carried `spec.solo_headroom_hypothesis`.
|
|
135
|
+
`benchmark audit` is the provider-free release/handoff guard: it writes
|
|
136
|
+
`audit.json` with the frontier summary, artifact map, and compact trigger-backed verdict-bearing `pair_evidence_rows`
|
|
137
|
+
(each row carries `pair_trigger_eligible: true`, non-empty `pair_trigger_reasons`, `pair_trigger_has_canonical_reason: true`, and `pair_trigger_has_hypothesis_reason`; the audit fails rows missing trigger reasons or missing actionable solo-headroom hypotheses in fixture `spec.md` whose observable command matches `expected.json`), runs the frontier with
|
|
138
|
+
`--fail-on-unmeasured`, requires at least four fixtures with passing pair evidence,
|
|
139
|
+
revalidates frontier `verdict: PASS`, zero unmeasured candidates, and revalidates `pair_mode: true`,
|
|
140
|
+
the default 5-point pair margin, and 3x pair/solo wall ratio, then
|
|
141
|
+
audits failed headroom results. The audit stdout also prints
|
|
142
|
+
`headroom_rejections=...`, `pair_evidence_quality=...`,
|
|
143
|
+
`pair_trigger_reasons=...`, `pair_evidence_hypotheses=...`, and
|
|
144
|
+
`pair_evidence_hypothesis_triggers=...` handoff rows, plus
|
|
145
|
+
`pair_trigger_historical_aliases=...` when archived evidence includes legacy
|
|
146
|
+
trigger aliases and `pair_evidence_hypothesis_trigger_gaps=...` when documented
|
|
147
|
+
hypotheses have not yet propagated into trigger reasons, with the rejected-fixture
|
|
148
|
+
coverage counts plus actual minimum pair margin, maximum pair/solo wall ratio,
|
|
149
|
+
and canonical trigger reason coverage plus row-match status.
|
|
150
|
+
The compact evidence row count must match the frontier evidence count,
|
|
151
|
+
`checks.frontier_stdout` records summary, aggregate, final-verdict, expected, printed score-row, trigger-visible row, and hypothesis-trigger-visible row counts,
|
|
152
|
+
`checks.headroom_rejections` records child verdict plus unrecorded/unsupported counts,
|
|
153
|
+
`checks.pair_evidence_quality` records the same quality thresholds from the compact rows,
|
|
154
|
+
`checks.pair_trigger_reasons` records canonical/historical-alias/exposed/total trigger-reason row counts, fixture-level historical alias details, summary count, and row-match status,
|
|
155
|
+
`checks.pair_evidence_hypotheses` records documented/total pair-evidence hypothesis row counts,
|
|
156
|
+
and `checks.pair_evidence_hypothesis_triggers` records whether documented hypotheses also appear as `spec.solo_headroom_hypothesis` trigger reasons plus fixture-level gap details
|
|
157
|
+
so incomplete or low-quality local score artifacts cannot inflate the claim.
|
|
158
|
+
Add `--require-hypothesis-trigger` to turn those hypothesis-trigger gaps from
|
|
159
|
+
archived-evidence WARN rows into release-blocking FAIL rows for newly
|
|
160
|
+
regenerated pair evidence.
|
|
122
161
|
|
|
123
|
-
|
|
162
|
+
```bash
|
|
163
|
+
npx devlyn-cli benchmark audit --require-hypothesis-trigger --out-dir /tmp/devlyn-benchmark-audit-strict
|
|
164
|
+
```
|
|
124
165
|
|
|
125
|
-
|
|
166
|
+
Historical trigger aliases are only reported for archived artifact review; new
|
|
167
|
+
current pair-evidence gates fail historical-only or unknown trigger reasons and
|
|
168
|
+
require at least one canonical `pair_trigger.reasons` entry.
|
|
169
|
+
`benchmark audit-headroom` fails if an active failed headroom fixture is missing
|
|
170
|
+
from both rejected registry and passing pair evidence.
|
|
171
|
+
Headroom runs use the current claim gate: `bare <= 60`, `solo_claude <= 80`,
|
|
172
|
+
and the default 5-point `bare`/`solo_claude` headroom margins before spending a pair arm.
|
|
173
|
+
Add `--dry-run` to either score runner to validate args, fixture ids, minimum
|
|
174
|
+
fixture count, and the replay command without running arms or judges. Dry-runs
|
|
175
|
+
and lint prove wiring only; real score claims must cite the run id and fixture
|
|
176
|
+
ids.
|
|
126
177
|
|
|
127
178
|
## Optional Power-User Skills
|
|
128
179
|
|
|
129
|
-
Two creative skills
|
|
180
|
+
Two creative companion skills live in `optional-skills/` — install them via the interactive installer when you need them.
|
|
130
181
|
|
|
131
182
|
| Command | Use When |
|
|
132
183
|
|---|---|
|
|
133
184
|
| `/devlyn:design-system` | Extract exact design tokens (colors, type scale, spacing) from a chosen UI style |
|
|
134
185
|
| `/devlyn:team-design-ui` | Multi-perspective design team generates 5 distinct UI style explorations |
|
|
135
186
|
|
|
136
|
-
> Earlier versions of devlyn-cli shipped 16+ skills (auto-resolve / preflight / evaluate / review / team-review / clean / update-docs / browser-validate / product-spec / feature-spec / recommend-features / discover-product / design-ui / implement-ui).
|
|
187
|
+
> Earlier versions of devlyn-cli shipped 16+ skills (auto-resolve / preflight / evaluate / review / team-review / clean / update-docs / browser-validate / product-spec / feature-spec / recommend-features / discover-product / design-ui / implement-ui). Most were consolidated into `/devlyn:resolve` (which folds verification, review, and cleanup into its phases) plus `/devlyn:ideate` (which absorbs the planning surfaces) in the iter-0034 Phase 4 cutover (2026-05-04). `/devlyn:design-ui` is now installed as a required creative UI surface. Upgrades automatically remove the legacy skill directories from `~/.claude/skills/`.
|
|
137
188
|
|
|
138
189
|
---
|
|
139
190
|
|
|
@@ -194,7 +245,7 @@ Selected during install. Run `npx devlyn-cli` again to add more.
|
|
|
194
245
|
|---|---|
|
|
195
246
|
| `playwright` | Playwright MCP — powers `/devlyn:resolve` BUILD_GATE browser tier (Chrome MCP → Playwright → curl fallback) |
|
|
196
247
|
|
|
197
|
-
> `--engine auto/codex`
|
|
248
|
+
> `--engine auto/codex` and conditional VERIFY pair mode use the local `codex` CLI binary, not MCP. Install from https://platform.openai.com/docs/codex, run the current Codex auth/login flow, verify `codex --version`, then rerun.
|
|
198
249
|
|
|
199
250
|
</details>
|
|
200
251
|
|
|
@@ -2,12 +2,18 @@
|
|
|
2
2
|
|
|
3
3
|
**Outer goal**: see [`autoresearch/NORTH-STAR.md`](../../autoresearch/NORTH-STAR.md) — the harness composes frontier LLMs into a hands-free pipeline that delivers engineer-quality software for users who do not know context engineering, with each composition layer (L0 bare → L1 solo harness → L2 pair harness) justifying its added cost on quality AND wall-time efficiency. This benchmark is the measurement instrument for that contract.
|
|
4
4
|
|
|
5
|
-
**Purpose.** Replace ad-hoc
|
|
5
|
+
**Purpose.** Replace ad-hoc harness benchmarking with a permanent, comprehensive,
|
|
6
6
|
one-command suite that gates every future harness change with a ship/rollback
|
|
7
7
|
decision. Any prompt edit, phase reorder, new native skill, or model upgrade
|
|
8
8
|
can be validated by running the suite and reading the numbers.
|
|
9
9
|
|
|
10
|
-
**Arm structure
|
|
10
|
+
**Arm structure.** Current full-pipeline evidence uses three arms: `bare` (L0),
|
|
11
|
+
`solo_claude` (L1 solo harness), and an L2 pair arm (`variant` in the smoke
|
|
12
|
+
suite, or a focused pair arm such as `l2_risk_probes` in pair-candidate runs).
|
|
13
|
+
Pair claims are headroom-gated: counted fixtures must leave room above solo
|
|
14
|
+
(`bare <= 60`, `solo_claude <= 80`, default 5-point `bare`/`solo_claude` headroom margins),
|
|
15
|
+
the pair arm must actually run, and blind judging must show pair above solo by
|
|
16
|
+
the configured margin.
|
|
11
17
|
|
|
12
18
|
**Non-goals.** Publishable-research statistical rigor. Not a regression test
|
|
13
19
|
library for the product code — those live elsewhere. Not a substitute for
|
|
@@ -20,7 +26,7 @@ production telemetry — just enough signal for ship decisions.
|
|
|
20
26
|
1. **One command.** `npx devlyn-cli benchmark` runs everything and prints a
|
|
21
27
|
verdict. No manual fixture setup.
|
|
22
28
|
2. **Novice-proof.** The suite exercises the same paths a first-time user
|
|
23
|
-
hits — including an end-to-end `ideate →
|
|
29
|
+
hits — including an end-to-end `ideate → resolve` fixture.
|
|
24
30
|
3. **LLM-upgrade friendly.** Rubric, fixture semantics, and thresholds stay
|
|
25
31
|
stable; scores and margins float up as models improve. Nothing is
|
|
26
32
|
hardcoded to a specific model version.
|
|
@@ -56,10 +62,11 @@ benchmark/auto-resolve/
|
|
|
56
62
|
│ ├── F6-dep-audit-native-module/
|
|
57
63
|
│ ├── F7-out-of-scope-trap/
|
|
58
64
|
│ ├── F8-known-limit-ambiguous/
|
|
59
|
-
│
|
|
65
|
+
│ ├── F9-e2e-ideate-to-resolve/
|
|
66
|
+
│ └── F10+ extensions for headroom, full-pipeline pair, and frozen VERIFY
|
|
60
67
|
│
|
|
61
68
|
├── scripts/
|
|
62
|
-
│ ├── run-suite.sh #
|
|
69
|
+
│ ├── run-suite.sh # smoke entry — runs fixture arms + judge + report
|
|
63
70
|
│ ├── run-fixture.sh # one fixture, one arm
|
|
64
71
|
│ ├── judge.sh # Codex blind judge (model-agnostic)
|
|
65
72
|
│ ├── compile-report.py # aggregate into report.md + summary.json
|
|
@@ -68,8 +75,9 @@ benchmark/auto-resolve/
|
|
|
68
75
|
├── results/ # per-run artifacts (overwritten)
|
|
69
76
|
│ └── <run-id>/
|
|
70
77
|
│ ├── <fixture>/
|
|
71
|
-
│ │ ├──
|
|
72
|
-
│ │
|
|
78
|
+
│ │ ├── bare/{input.md, transcript.txt, diff.patch, result.json}
|
|
79
|
+
│ │ ├── solo_claude/{same}
|
|
80
|
+
│ │ └── variant or l2_risk_probes/{same}
|
|
73
81
|
│ ├── <fixture>/judge.json
|
|
74
82
|
│ ├── report.md
|
|
75
83
|
│ └── summary.json
|
|
@@ -91,7 +99,7 @@ Every fixture is a directory with these files (see `fixtures/SCHEMA.md`):
|
|
|
91
99
|
| File | Purpose |
|
|
92
100
|
|------|---------|
|
|
93
101
|
| `metadata.json` | id, category, difficulty, timeout, required tools, intent block |
|
|
94
|
-
| `spec.md` | pipeline-arm input (
|
|
102
|
+
| `spec.md` | pipeline-arm input (resolve-ready spec with Requirements/Constraints/Out-of-Scope/Verification) |
|
|
95
103
|
| `task.txt` | bare-arm input (same intent, natural-language framing) |
|
|
96
104
|
| `expected.json` | machine-readable acceptance criteria + forbidden patterns + verification commands |
|
|
97
105
|
| `NOTES.md` | why this fixture exists, the specific failure mode it tests |
|
|
@@ -103,9 +111,13 @@ consistent.
|
|
|
103
111
|
|
|
104
112
|
---
|
|
105
113
|
|
|
106
|
-
##
|
|
114
|
+
## Core Fixtures And Extensions
|
|
107
115
|
|
|
108
|
-
|
|
116
|
+
The original v3.6 matrix covered F1-F9. Later fixtures extend the same schema
|
|
117
|
+
for headroom, full-pipeline pair, and frozen VERIFY evidence.
|
|
118
|
+
|
|
119
|
+
Category coverage matrix for the original core set (rows = concerns, columns =
|
|
120
|
+
fixtures):
|
|
109
121
|
|
|
110
122
|
| Fixture | Trivial | Medium | High-risk | Stress | Edge | E2E |
|
|
111
123
|
|---------|---------|--------|-----------|--------|------|-----|
|
|
@@ -120,9 +132,9 @@ Category coverage matrix (rows = concerns, columns = fixtures):
|
|
|
120
132
|
| F9-e2e-ideate-to-resolve | | | | | | ✓ (novice full-flow) |
|
|
121
133
|
|
|
122
134
|
**F9 is load-bearing** for the "novice user types `/devlyn:ideate`" promise.
|
|
123
|
-
Input is a vague idea; pipeline
|
|
124
|
-
|
|
125
|
-
|
|
135
|
+
Input is a vague idea; the pipeline path turns it into a spec with ideate and
|
|
136
|
+
then resolves that spec. Bare arm runs a direct prompt. Judge compares the final
|
|
137
|
+
usable artifact set.
|
|
126
138
|
|
|
127
139
|
---
|
|
128
140
|
|
|
@@ -132,7 +144,6 @@ the final usable artifact set (code + docs + roadmap state).
|
|
|
132
144
|
|
|
133
145
|
```bash
|
|
134
146
|
npx devlyn-cli benchmark # n=1 smoke, all fixtures
|
|
135
|
-
npx devlyn-cli benchmark --n 3 # higher confidence for ship decisions
|
|
136
147
|
npx devlyn-cli benchmark F2 F5 # specific fixtures only
|
|
137
148
|
npx devlyn-cli benchmark --judge-only --run-id <id> # re-judge without re-running
|
|
138
149
|
```
|
|
@@ -143,20 +154,21 @@ Output on completion:
|
|
|
143
154
|
Benchmark Suite Run — 2026-04-23T12:00Z (v3.6)
|
|
144
155
|
Judge: codex CLI flagship, xhigh, blind (model recorded in run history)
|
|
145
156
|
|
|
146
|
-
Fixture
|
|
147
|
-
F1-cli-trivial-flag
|
|
148
|
-
F2-cli-medium-subcommand
|
|
149
|
-
F3-backend-contract-risk
|
|
150
|
-
F4-web-browser-design
|
|
151
|
-
F5-fix-loop-red-green
|
|
152
|
-
F6-dep-audit-native-module
|
|
153
|
-
F7-out-of-scope-trap
|
|
154
|
-
F8-known-limit-ambiguous
|
|
155
|
-
F9-e2e-ideate-to-resolve
|
|
157
|
+
Fixture variant (L2) solo_claude (L1) bare (L0) variant-solo_claude Verdict
|
|
158
|
+
F1-cli-trivial-flag 95 92 88 +3 PASS
|
|
159
|
+
F2-cli-medium-subcommand 92 86 81 +6 PASS
|
|
160
|
+
F3-backend-contract-risk 89 80 72 +9 PASS
|
|
161
|
+
F4-web-browser-design 87 83 79 +4 PASS
|
|
162
|
+
F5-fix-loop-red-green 91 78 65 +13 PASS
|
|
163
|
+
F6-dep-audit-native-module 88 82 70 +6 PASS
|
|
164
|
+
F7-out-of-scope-trap 94 85 73 +9 PASS
|
|
165
|
+
F8-known-limit-ambiguous 78 79 79 -1 EXPECTED (known-limit)
|
|
166
|
+
F9-e2e-ideate-to-resolve 90 84 68 +6 PASS
|
|
156
167
|
---------------------------------------------------------
|
|
157
|
-
Suite average variant score:
|
|
158
|
-
Suite average
|
|
159
|
-
Suite average
|
|
168
|
+
Suite average variant (L2) score: 89.3
|
|
169
|
+
Suite average solo_claude (L1) score: 83.2
|
|
170
|
+
Suite average bare (L0) score: 75.0
|
|
171
|
+
Suite average variant-solo_claude margin: +6.1 (pair-evidence floor: +5 on eligible fixtures)
|
|
160
172
|
Hard-floor violations: 0
|
|
161
173
|
Regression vs shipped: n/a (first run of v3.6)
|
|
162
174
|
SHIP-GATE VERDICT: ✅ PASS
|
|
@@ -167,7 +179,7 @@ SHIP-GATE VERDICT: ✅ PASS
|
|
|
167
179
|
`run-suite.sh`:
|
|
168
180
|
|
|
169
181
|
1. Generate run-id `<ISO>-<sha>-<branch>`
|
|
170
|
-
2. For each fixture × each arm (variant, bare): parallelizable via `xargs -P`
|
|
182
|
+
2. For each fixture × each arm (`variant`/L2, `solo_claude`/L1, `bare`/L0): parallelizable via `xargs -P`
|
|
171
183
|
- `run-fixture.sh --fixture FX --arm variant` → writes `results/<run-id>/FX/variant/*`
|
|
172
184
|
3. For each fixture: `judge.sh FX <run-id>` → writes `results/<run-id>/FX/judge.json`
|
|
173
185
|
4. `compile-report.py <run-id>` → writes `report.md` + `summary.json`
|
|
@@ -179,17 +191,17 @@ SHIP-GATE VERDICT: ✅ PASS
|
|
|
179
191
|
|
|
180
192
|
- Creates fresh temp copy of `test-repo/` at `/tmp/bench-<run-id>-<fixture>-<arm>/`
|
|
181
193
|
- Applies `setup.sh` if present
|
|
182
|
-
- Copies `spec.md`
|
|
183
|
-
- Invokes
|
|
194
|
+
- Copies `spec.md` for `variant`/`solo_claude` or `task.txt` for `bare` as the prompt
|
|
195
|
+
- Invokes `/devlyn:resolve --spec` for `variant`, `/devlyn:resolve --spec --engine claude --no-pair --no-risk-probes` for `solo_claude`, or bare Claude for `bare` via isolated Agent
|
|
184
196
|
- Captures: `diff.patch`, `changed-files.txt`, `transcript.txt`, `timing.json`
|
|
185
197
|
- Runs `expected.json::verification_commands`, writes pass/fail per command to `verify.json`
|
|
186
198
|
- Writes `result.json` with aggregate: exit code, duration, files changed, verification score
|
|
187
199
|
|
|
188
200
|
### `judge.sh` contract
|
|
189
201
|
|
|
190
|
-
- Reads `results/<run-id>/<fixture>/{variant,bare}/{diff.patch,verify.json}` + fixture's `spec.md` + `expected.json`
|
|
202
|
+
- Reads `results/<run-id>/<fixture>/{variant,solo_claude,bare}/{diff.patch,verify.json}` + fixture's `spec.md` + `expected.json`
|
|
191
203
|
- Builds a blind prompt: labels arms A and B randomly per fixture (seed recorded)
|
|
192
|
-
- Invokes
|
|
204
|
+
- Invokes isolated Codex (current flagship — no model hardcode) with RUBRIC.md
|
|
193
205
|
- Writes `judge.json`: per-axis scores, winner, margin, critical findings, disqualifiers
|
|
194
206
|
- Idempotent: re-running overwrites the same `judge.json`
|
|
195
207
|
|
|
@@ -199,23 +211,27 @@ SHIP-GATE VERDICT: ✅ PASS
|
|
|
199
211
|
|
|
200
212
|
Three mechanisms:
|
|
201
213
|
|
|
202
|
-
1. **No hardcoded models.** Judge invocation
|
|
203
|
-
|
|
204
|
-
|
|
205
|
-
|
|
214
|
+
1. **No hardcoded models.** Judge invocation omits `-m`, so it inherits
|
|
215
|
+
whichever flagship the CLI currently ships. The blind judge is isolated from
|
|
216
|
+
user config/rules/hooks so local agent instructions cannot contaminate the
|
|
217
|
+
judgment. Same for agents — they run against whatever Claude Code
|
|
218
|
+
session-model the caller has. Model provenance is captured in `result.json`
|
|
219
|
+
per run.
|
|
206
220
|
|
|
207
221
|
2. **Margin as primary signal, absolute score as secondary.** When models
|
|
208
|
-
improve,
|
|
209
|
-
|
|
222
|
+
improve, all arms tend to get better. Pairwise margins remain the stable
|
|
223
|
+
signal: `solo_claude`-`bare` (L1-L0) measures solo harness value,
|
|
224
|
+
pair-`solo_claude` (L2-L1) measures pair value on eligible fixtures, and
|
|
225
|
+
`variant`-`bare` (L2-L0) remains the legacy suite signal. Ship gates are
|
|
210
226
|
defined on margin (`>= +5`) and regression (`-3 or worse`), not absolute
|
|
211
227
|
score.
|
|
212
228
|
|
|
213
229
|
3. **Fixture difficulty gradient.** F1 (trivial) is expected to saturate near
|
|
214
230
|
100 quickly as models improve — that's fine, it still catches catastrophic
|
|
215
231
|
regressions. F5/F9 (stress/E2E) have enough depth that even a near-perfect
|
|
216
|
-
model won't 100-zero bare. If any fixture saturates (
|
|
217
|
-
two consecutive versions), we replace it with a harder one and
|
|
218
|
-
the swap in `history/runs/<ts>-fixture-rotation.json`.
|
|
232
|
+
model won't 100-zero bare. If any fixture saturates (all compared gated arms
|
|
233
|
+
> 95 for two consecutive versions), we replace it with a harder one and
|
|
234
|
+
document the swap in `history/runs/<ts>-fixture-rotation.json`.
|
|
219
235
|
|
|
220
236
|
---
|
|
221
237
|
|
|
@@ -225,14 +241,15 @@ Hard floors (any single failure blocks ship):
|
|
|
225
241
|
|
|
226
242
|
- **No silent-catch / fabricated verification / skipped required test in variant.** Judge flags this as disqualifier.
|
|
227
243
|
- **Variant may not lose any fixture by more than −5** versus previous shipped version (per-fixture regression floor).
|
|
228
|
-
- **At least 7
|
|
244
|
+
- **At least 7 gated, headroom-available fixtures** must have margin ≥ +5
|
|
245
|
+
(suite coverage).
|
|
229
246
|
- **F9 (E2E) must PASS** — novice-flow contract.
|
|
230
247
|
|
|
231
248
|
Soft gates (trigger rollback discussion):
|
|
232
249
|
|
|
233
250
|
- Suite average margin drop > 3 vs last shipped.
|
|
234
251
|
- Any fixture with margin ≤ 0 that previously had margin > +5.
|
|
235
|
-
- Critical-finding catch-rate decrease vs last shipped
|
|
252
|
+
- Critical-finding catch-rate decrease vs the last shipped comparable arm.
|
|
236
253
|
|
|
237
254
|
Known-limit exception:
|
|
238
255
|
|
|
@@ -264,7 +281,7 @@ adding anything.
|
|
|
264
281
|
standalone `benchmark/auto-resolve/scripts/run-suite.sh` invoked via `npm
|
|
265
282
|
run`? **Proposal**: both — `bin/devlyn.js benchmark` is the advertised
|
|
266
283
|
entry, which shells out to the script.
|
|
267
|
-
2. Parallel run safety — can we run
|
|
284
|
+
2. Parallel run safety — can we run the selected fixture set × 3 arms concurrently without
|
|
268
285
|
rate-limit / lockfile conflicts? **Proposal**: default sequential with
|
|
269
286
|
`--parallel N` flag. Default `N=1` for safety; the user can opt in.
|
|
270
287
|
3. Token accounting — Claude Code doesn't expose subagent totals reliably.
|