devlyn-cli 2.3.0 → 2.3.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +1 -1
- package/CLAUDE.md +2 -2
- package/README.md +82 -29
- package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +61 -44
- package/benchmark/auto-resolve/BENCHMARK-RESULTS.md +341 -0
- package/benchmark/auto-resolve/README.md +307 -44
- package/benchmark/auto-resolve/RUBRIC.md +23 -14
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +7 -3
- package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md +8 -3
- package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/NOTES.md +8 -3
- package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/NOTES.md +10 -4
- package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/NOTES.md +10 -4
- package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +7 -4
- package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/NOTES.md +8 -0
- package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/NOTES.md +16 -4
- package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +7 -0
- package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/NOTES.md +11 -5
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +8 -1
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +4 -2
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/NOTES.md +34 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/spec.md +67 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/duplicate-event-error.js +35 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/priority-transfer-rollback.js +53 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/NOTES.md +38 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/spec.md +70 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/task.txt +3 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/duplicate-renewal-error.js +42 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/priority-credit-rollback.js +70 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +10 -3
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +7 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +5 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +7 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +3 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +15 -3
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/SCHEMA.md +53 -7
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/NOTES.md +37 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/RETIRED.md +13 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/expected.json +56 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/setup.sh +18 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/spec.md +69 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/exact-proration.js +48 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/rules-source-and-conflict.js +79 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/NOTES.md +54 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/RETIRED.md +7 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/expected.json +67 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/spec.md +67 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/task.txt +5 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/policy-precedence.js +72 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-and-immutability.js +43 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-boundary.js +116 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/NOTES.md +35 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/RETIRED.md +12 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/expected.json +58 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/spec.md +73 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/task.txt +17 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/mixed-idempotent-settlement.js +53 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/rejection-boundaries.js +74 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/NOTES.md +60 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/RETIRED.md +29 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/expected.json +73 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/setup.sh +28 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/spec.md +58 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/task.txt +5 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.json +82 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.md +18 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.json +46 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.md +17 -0
- package/benchmark/auto-resolve/run-real-benchmark.md +303 -0
- package/benchmark/auto-resolve/scripts/audit-headroom-rejections.py +441 -0
- package/benchmark/auto-resolve/scripts/audit-pair-evidence.py +1256 -0
- package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +147 -15
- package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +28 -16
- package/benchmark/auto-resolve/scripts/collect-swebench-predictions.py +11 -1
- package/benchmark/auto-resolve/scripts/compile-report.py +208 -46
- package/benchmark/auto-resolve/scripts/fetch-swebench-instances.py +22 -4
- package/benchmark/auto-resolve/scripts/frozen-verify-gate.py +175 -30
- package/benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py +408 -46
- package/benchmark/auto-resolve/scripts/headroom-gate.py +270 -39
- package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +164 -33
- package/benchmark/auto-resolve/scripts/iter-0033c-l1-summary.py +97 -0
- package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +150 -38
- package/benchmark/auto-resolve/scripts/judge.sh +153 -26
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +12 -5
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +25 -2
- package/benchmark/auto-resolve/scripts/pair-candidate-frontier.py +469 -0
- package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +5 -5
- package/benchmark/auto-resolve/scripts/pair-plan-lint.py +9 -2
- package/benchmark/auto-resolve/scripts/pair-rejected-fixtures.sh +91 -0
- package/benchmark/auto-resolve/scripts/pair_evidence_contract.py +269 -0
- package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-case.py +39 -10
- package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-corpus.py +34 -4
- package/benchmark/auto-resolve/scripts/prepare-swebench-solver-worktree.py +23 -5
- package/benchmark/auto-resolve/scripts/recent-benchmark-summary.py +232 -0
- package/benchmark/auto-resolve/scripts/run-fixture.sh +118 -51
- package/benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh +211 -39
- package/benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh +335 -39
- package/benchmark/auto-resolve/scripts/run-headroom-candidate.sh +249 -6
- package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +22 -48
- package/benchmark/auto-resolve/scripts/run-suite.sh +44 -7
- package/benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh +120 -19
- package/benchmark/auto-resolve/scripts/run-swebench-solver-batch.sh +32 -14
- package/benchmark/auto-resolve/scripts/ship-gate.py +219 -50
- package/benchmark/auto-resolve/scripts/solo-ceiling-avoidance.py +53 -0
- package/benchmark/auto-resolve/scripts/solo-headroom-hypothesis.py +77 -0
- package/benchmark/auto-resolve/scripts/swebench-frozen-matrix.py +239 -26
- package/benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh +288 -0
- package/benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh +1672 -0
- package/benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh +933 -0
- package/benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh +491 -0
- package/benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh +91 -0
- package/benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh +328 -3
- package/benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh +497 -18
- package/benchmark/auto-resolve/scripts/test-headroom-gate.sh +331 -14
- package/benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh +525 -0
- package/benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh +254 -0
- package/benchmark/auto-resolve/scripts/test-lint-fixtures.sh +580 -0
- package/benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh +591 -0
- package/benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh +497 -0
- package/benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh +401 -0
- package/benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh +111 -0
- package/benchmark/auto-resolve/scripts/test-ship-gate.sh +1189 -0
- package/benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh +924 -5
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/NOTES.md +28 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/expected.json +63 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/spec.md +47 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/expected.json +53 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/spec.md +50 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/duplicate-order-error.js +27 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/priority-stock-reservation.js +44 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/expected.json +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/spec.md +52 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/duplicate-ticket-error.js +29 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/priority-agent-assignment.js +48 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/expected.json +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/spec.md +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/duplicate-return-error.js +43 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/priority-return-routing.js +70 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/NOTES.md +37 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/expected.json +54 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/spec.md +59 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/credit-ledger-priority.js +98 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/duplicate-charge-error.js +38 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/NOTES.md +36 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/expected.json +56 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/spec.md +59 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/duplicate-refund-error.js +41 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/priority-refund-ledger.js +65 -0
- package/bin/devlyn.js +211 -18
- package/config/skills/_shared/adapters/README.md +3 -0
- package/config/skills/_shared/adapters/gpt-5-5.md +5 -1
- package/config/skills/_shared/adapters/opus-4-7.md +9 -1
- package/config/skills/_shared/archive_run.py +78 -6
- package/config/skills/_shared/codex-config.md +3 -2
- package/config/skills/_shared/codex-monitored.sh +46 -1
- package/config/skills/_shared/collect-codex-findings.py +20 -5
- package/config/skills/_shared/engine-preflight.md +1 -1
- package/config/skills/_shared/runtime-principles.md +5 -8
- package/config/skills/_shared/spec-verify-check.py +2664 -107
- package/config/skills/_shared/verify-merge-findings.py +1369 -19
- package/config/skills/devlyn:ideate/SKILL.md +7 -4
- package/config/skills/devlyn:ideate/references/elicitation.md +50 -4
- package/config/skills/devlyn:ideate/references/from-spec-mode.md +26 -4
- package/config/skills/devlyn:ideate/references/project-mode.md +20 -1
- package/config/skills/devlyn:ideate/references/spec-template.md +10 -1
- package/config/skills/devlyn:resolve/SKILL.md +49 -18
- package/config/skills/devlyn:resolve/references/free-form-mode.md +15 -0
- package/config/skills/devlyn:resolve/references/phases/build-gate.md +2 -2
- package/config/skills/devlyn:resolve/references/phases/probe-derive.md +74 -2
- package/config/skills/devlyn:resolve/references/phases/verify.md +62 -28
- package/config/skills/devlyn:resolve/references/state-schema.md +7 -4
- package/package.json +47 -2
- package/scripts/lint-fixtures.sh +349 -0
- package/scripts/lint-shadow-fixtures.sh +58 -0
- package/scripts/lint-skills.sh +3642 -92
- /package/{optional-skills → config/skills}/devlyn:design-ui/SKILL.md +0 -0
|
@@ -82,7 +82,7 @@ The elicitation agent:
|
|
|
82
82
|
5. Stops when the structural lint passes AND user confirms, or 8 turns elapsed.
|
|
83
83
|
|
|
84
84
|
Structural lint (inline check, no script needed):
|
|
85
|
-
- Frontmatter has `id`, `title`, `kind`, `status: planned`.
|
|
85
|
+
- Frontmatter has `id`, `title`, `kind`, `status: planned`, `complexity`.
|
|
86
86
|
- `## Context` non-empty (≥ 1 sentence).
|
|
87
87
|
- `## Requirements` has ≥ 1 `- [ ]` bullet.
|
|
88
88
|
- `## Out of Scope` present (may list "none" if truly nothing).
|
|
@@ -91,8 +91,9 @@ Structural lint (inline check, no script needed):
|
|
|
91
91
|
After lint passes:
|
|
92
92
|
1. Write `<spec-dir>/<id>-<slug>/spec.md` (the spec).
|
|
93
93
|
2. Generate `<spec-dir>/<id>-<slug>/spec.expected.json` from the spec's `## Verification` block + any `forbidden_patterns` / `required_files` / `forbidden_files` / `max_deps_added` the conversation surfaced.
|
|
94
|
-
3. Run `python3 .claude/skills/_shared/spec-verify-check.py --check <spec-path>` to validate the verification carrier shape. If exit 2, fix the carrier and re-run.
|
|
95
|
-
4.
|
|
94
|
+
3. Run `python3 .claude/skills/_shared/spec-verify-check.py --check <spec-path>` to validate the verification carrier shape, supported `complexity` frontmatter, and any present actionable solo-headroom hypothesis; if the spec uses a legacy inline `## Verification` JSON carrier, any solo-headroom hypothesis command must match that carrier's `verification_commands[].cmd`. If exit 2, fix the carrier/frontmatter/hypothesis and re-run.
|
|
95
|
+
4. Run `python3 .claude/skills/_shared/spec-verify-check.py --check-expected <expected-path>` to validate sibling `spec.expected.json` against `_shared/expected.schema.json` plus sibling spec `complexity` frontmatter and any present actionable solo-headroom hypothesis; if the spec has a solo-headroom hypothesis, its observable command must match `spec.expected.json.verification_commands[].cmd`. If exit 2, fix the JSON/frontmatter/hypothesis and re-run.
|
|
96
|
+
5. Print: `spec ready — /devlyn:resolve --spec <spec-path>`.
|
|
96
97
|
|
|
97
98
|
## PHASE 1Q: QUICK MODE
|
|
98
99
|
|
|
@@ -103,6 +104,7 @@ Single-turn assume-and-confirm. Prompt body: see `references/elicitation.md` §
|
|
|
103
104
|
3. User responds with "go" / "fix X" / "no, different".
|
|
104
105
|
4. On "go": write spec + spec.expected.json + lint + announce.
|
|
105
106
|
5. On "fix X": apply correction, re-show, ask again. Maximum 3 correction rounds before escalating to default mode.
|
|
107
|
+
6. Exception: for benchmark, risk-probe, or pair-evidence goals, do not infer a solo-headroom hypothesis. Ask for the actionable hypothesis first; if unavailable, exit with `spec not ready — solo-headroom hypothesis required`. For new unmeasured benchmark, shadow-fixture, golden-fixture, risk-probe, or pair-evidence candidates, also do not infer solo ceiling avoidance; ask for the concrete difference from rejected or solo-saturated controls such as `S2`-`S6`, and exit with `spec not ready — solo ceiling avoidance required` if unavailable.
|
|
106
108
|
|
|
107
109
|
## PHASE 1F: FROM-SPEC MODE
|
|
108
110
|
|
|
@@ -114,7 +116,8 @@ Prompt body: `references/from-spec-mode.md`.
|
|
|
114
116
|
4. Apply structural fixes only — do NOT reshape Requirements / Out-of-Scope content. The user's substantive intent is preserved.
|
|
115
117
|
5. Generate `spec.expected.json` if absent (best-effort from `## Verification` block).
|
|
116
118
|
6. Write the normalized spec back to `<spec-dir>/<id>-<slug>/` (preserves original at `<path>` untouched unless user passes `--in-place`).
|
|
117
|
-
7.
|
|
119
|
+
7. Run both lint checks: `--check <spec-path>` and `--check-expected <expected-path>`.
|
|
120
|
+
8. Lint pass → announce. Lint fail → surface the unfixable issue and exit non-zero. If the source is a pair-evidence candidate without an actionable solo-headroom hypothesis, the announcement must say `pair-evidence not ready` instead of implying measurement readiness.
|
|
118
121
|
|
|
119
122
|
## PHASE 1P: PROJECT MODE
|
|
120
123
|
|
|
@@ -30,7 +30,37 @@ For most coding tasks, the under-specified blanks are:
|
|
|
30
30
|
3. **Failure shape**: what happens on bad input? Exit code, error message format, fallback behavior (silent vs visible)?
|
|
31
31
|
4. **Scope boundary**: which files are in-scope, which are out-of-scope? "Don't touch the auth module" is a boundary worth surfacing.
|
|
32
32
|
5. **Constraints**: dependency policy (new deps allowed?), silent-catch policy, type-system escape policy, test coverage expectations.
|
|
33
|
-
6. **
|
|
33
|
+
6. **Complexity signal**: set spec frontmatter `complexity` to `high` when
|
|
34
|
+
the spec needs a compound scenario crossing state mutation with ordering,
|
|
35
|
+
idempotency, auth/error priority, rollback/failure handling, or exact output
|
|
36
|
+
shape. This is a downstream VERIFY pair-trigger signal, not a vague
|
|
37
|
+
difficulty label.
|
|
38
|
+
7. **Verification**: how does the user know it worked? Pick the smallest concrete check.
|
|
39
|
+
If the goal combines state mutation with ordering/priority, idempotency,
|
|
40
|
+
auth/error priority, or exact output shape, ask for one concrete compound
|
|
41
|
+
scenario that exercises the interaction end-to-end instead of accepting only
|
|
42
|
+
isolated happy-path checks.
|
|
43
|
+
8. **Pair-candidate headroom**: when the user is creating a benchmark, risk
|
|
44
|
+
probe, or pair-evidence candidate, ask for one solo-headroom hypothesis in
|
|
45
|
+
actionable form: the spec must literally contain `solo-headroom hypothesis`,
|
|
46
|
+
`solo_claude`, `miss`, and a backticked observable command while naming the
|
|
47
|
+
visible behavior a capable `solo_claude` baseline should miss; the backticked
|
|
48
|
+
line itself must contain `miss` and be framed as the command/observable that exposes it. If the
|
|
49
|
+
answer is only "the task is hard", rework the candidate before spending provider
|
|
50
|
+
calls. Do not write a benchmark/risk-probe/pair-evidence spec until this
|
|
51
|
+
hypothesis is actionable; if the user cannot provide it, stop with
|
|
52
|
+
`spec not ready — solo-headroom hypothesis required` and ask them to return
|
|
53
|
+
with the visible behavior `solo_claude` is expected to miss.
|
|
54
|
+
9. **Solo ceiling avoidance**: for a new unmeasured benchmark, shadow-fixture,
|
|
55
|
+
golden-fixture, risk-probe, or pair-evidence candidate, ask how this candidate
|
|
56
|
+
differs from rejected or solo-saturated controls such as `S2`-`S6`. The note
|
|
57
|
+
must literally contain `solo ceiling avoidance`, mention `solo_claude`, and
|
|
58
|
+
name the concrete difference expected to preserve `solo_claude` headroom.
|
|
59
|
+
Benchmark fixture directories put this in `NOTES.md` as
|
|
60
|
+
`## Solo ceiling avoidance`; ordinary specs keep it in `## Verification`
|
|
61
|
+
next to the solo-headroom hypothesis. Do not write or measure the candidate
|
|
62
|
+
if this answer is missing; stop with
|
|
63
|
+
`spec not ready — solo ceiling avoidance required`.
|
|
34
64
|
|
|
35
65
|
Walk through these in roughly this order. Skip the ones already clear from the user's initial text.
|
|
36
66
|
</missing_decisions_to_surface>
|
|
@@ -56,14 +86,18 @@ When you're about to ask the user a question, look at the draft first — if the
|
|
|
56
86
|
|
|
57
87
|
<lint>
|
|
58
88
|
Before declaring the spec ready, verify structurally:
|
|
59
|
-
- Frontmatter has `id`, `title`, `kind`, `status: planned`.
|
|
89
|
+
- Frontmatter has `id`, `title`, `kind`, `status: planned`, `complexity`.
|
|
60
90
|
- All 5 H2 sections present (`## Context`, `## Requirements`, `## Constraints`, `## Out of Scope`, `## Verification`).
|
|
61
91
|
- Requirements ≥ 1 bullet.
|
|
62
92
|
- Verification has either ≥ 1 named command OR the explicit pure-design escape phrase.
|
|
63
93
|
|
|
64
94
|
If the lint fails, fix the missing piece (ask one focused question if needed) before announcing.
|
|
65
95
|
|
|
66
|
-
After lint passes,
|
|
96
|
+
After lint passes, run both mechanical checks:
|
|
97
|
+
1. `python3 .claude/skills/_shared/spec-verify-check.py --check <spec-path>` validates the spec's verification carrier shape, supported `complexity` frontmatter, and any present actionable solo-headroom hypothesis; if the spec uses a legacy inline `## Verification` JSON carrier, any solo-headroom hypothesis command must match that carrier's `verification_commands[].cmd`.
|
|
98
|
+
2. `python3 .claude/skills/_shared/spec-verify-check.py --check-expected <expected-path>` validates sibling `spec.expected.json` against `_shared/expected.schema.json` plus sibling spec `complexity` frontmatter and any present actionable solo-headroom hypothesis; if the spec has a solo-headroom hypothesis, its observable command must match `spec.expected.json.verification_commands[].cmd`.
|
|
99
|
+
|
|
100
|
+
If either exits 2: read the stderr message, fix the malformed carrier or JSON, and re-run the failed command. Both commands must exit 0 before ready.
|
|
67
101
|
</lint>
|
|
68
102
|
|
|
69
103
|
<output>
|
|
@@ -83,9 +117,21 @@ When `--quick` is set:
|
|
|
83
117
|
1. AI synthesizes spec from the one-line goal — fill every section with the most reasonable inference.
|
|
84
118
|
2. AI presents the spec to the user with an explicit `## Assumptions made` block listing every inferred decision (one bullet each).
|
|
85
119
|
3. User responds with "go" / "fix X to be Y" / "no, different".
|
|
86
|
-
4. On "go": write the spec + spec.expected.json, run lint, announce.
|
|
120
|
+
4. On "go": write the spec + spec.expected.json, run both lint checks, announce.
|
|
87
121
|
5. On "fix X": apply correction, re-present, ask again. Maximum 3 correction rounds before escalating to default mode.
|
|
88
122
|
|
|
123
|
+
Exception: quick mode must not infer a solo-headroom hypothesis for benchmark,
|
|
124
|
+
risk-probe, or pair-evidence goals. If the one-line goal lacks the actionable
|
|
125
|
+
`solo-headroom hypothesis` / `solo_claude` / `miss` / backticked-command
|
|
126
|
+
contract, ask exactly one focused follow-up for that hypothesis before showing a
|
|
127
|
+
draft; if the user cannot provide it, exit with
|
|
128
|
+
`spec not ready — solo-headroom hypothesis required`. For a new unmeasured
|
|
129
|
+
benchmark, shadow-fixture, golden-fixture, risk-probe, or pair-evidence
|
|
130
|
+
candidate, quick mode also must not infer the `solo ceiling avoidance` note; ask
|
|
131
|
+
for the concrete difference from rejected or solo-saturated controls such as
|
|
132
|
+
`S2`-`S6`, and exit with `spec not ready — solo ceiling avoidance required` if
|
|
133
|
+
the user cannot provide it.
|
|
134
|
+
|
|
89
135
|
Quick mode trades thoroughness for speed. Use it for trivial-medium tasks where the user has a clear-enough goal that one round of inference + correction is sufficient.
|
|
90
136
|
|
|
91
137
|
## Anti-patterns
|
|
@@ -14,7 +14,7 @@ The user already wrote a spec (or has one from elsewhere — a teammate, a previ
|
|
|
14
14
|
|
|
15
15
|
<allowed_changes>
|
|
16
16
|
You may:
|
|
17
|
-
1. Add missing frontmatter fields (id from filename, kind=feature default, status=planned).
|
|
17
|
+
1. Add missing frontmatter fields (id from filename, kind=feature default, status=planned, complexity=medium default; set complexity=high only when preserved Requirements clearly combine state/order/failure/output-shape risks).
|
|
18
18
|
2. Rename non-canonical section headings to canonical (`## Goals` → `## Requirements`, `## Notes` ignored unless they clearly belong in Constraints).
|
|
19
19
|
3. Add a missing `## Out of Scope` section with `- (no explicit non-goals provided by author)`.
|
|
20
20
|
4. Add a missing `## Verification` section if Requirements imply observable runtime checks — best-effort one-command-per-Requirement, then surface to user for review.
|
|
@@ -37,14 +37,36 @@ You must NOT:
|
|
|
37
37
|
3. For each missing/malformed piece: apply the smallest allowed fix.
|
|
38
38
|
4. Write the normalized spec. Default location: `<spec-dir>/<id>-<slug>/spec.md`. With `--in-place` flag: write to `<path>` directly (overwrites the original).
|
|
39
39
|
5. Generate or fix `spec.expected.json` per the rules above. Same dir as the spec.
|
|
40
|
-
6. Run `python3 .claude/skills/_shared/spec-verify-check.py --check <spec-path>` to validate.
|
|
41
|
-
7.
|
|
40
|
+
6. Run `python3 .claude/skills/_shared/spec-verify-check.py --check <spec-path>` to validate the spec carrier and supported `complexity` frontmatter; if the spec uses a legacy inline `## Verification` JSON carrier, any solo-headroom hypothesis command must match that carrier's `verification_commands[].cmd`.
|
|
41
|
+
7. Run `python3 .claude/skills/_shared/spec-verify-check.py --check-expected <expected-path>` to validate sibling `spec.expected.json` plus sibling spec `complexity` frontmatter and any present solo-headroom hypothesis command against `spec.expected.json.verification_commands[].cmd`.
|
|
42
|
+
8. If lint still fails after allowed fixes (e.g. Requirements section is empty in the source), surface the issue and exit non-zero — do NOT invent Requirements.
|
|
43
|
+
9. If the preserved Requirements combine state mutation with ordering/priority,
|
|
44
|
+
idempotency, auth/error priority, or exact output shape but the Verification
|
|
45
|
+
section lacks a compound end-to-end scenario, do not rewrite the author's
|
|
46
|
+
content. Add a final warning that `/devlyn:resolve` may need default-mode
|
|
47
|
+
ideation or a stronger Verification section before pair-relevant risks are
|
|
48
|
+
measurable.
|
|
49
|
+
10. If the source is a benchmark, risk probe, or pair-evidence candidate and it
|
|
50
|
+
lacks an actionable solo-headroom hypothesis, do not invent one. Add a final
|
|
51
|
+
warning that the candidate may be solo-saturated until Context or
|
|
52
|
+
Verification literally contains `solo-headroom hypothesis`, `solo_claude`,
|
|
53
|
+
`miss`, and a backticked observable command while naming the visible
|
|
54
|
+
behavior a capable `solo_claude` baseline is expected to miss; the
|
|
55
|
+
backticked line itself must contain `miss` and be framed as the
|
|
56
|
+
command/observable that exposes it. Do not call the normalized spec pair-evidence ready.
|
|
57
|
+
11. If the source is a new unmeasured benchmark, shadow-fixture, golden-fixture,
|
|
58
|
+
risk-probe, or pair-evidence candidate and it lacks a solo ceiling avoidance
|
|
59
|
+
note, do not invent one. Add a final warning that the candidate may replay
|
|
60
|
+
rejected or solo-saturated controls until Context, Verification, or fixture
|
|
61
|
+
`NOTES.md` literally contains `solo ceiling avoidance`, mentions
|
|
62
|
+
`solo_claude`, and names a concrete difference from rejected controls such
|
|
63
|
+
as `S2`-`S6`. Do not call the normalized spec pair-evidence ready.
|
|
42
64
|
</flow>
|
|
43
65
|
|
|
44
66
|
<output>
|
|
45
67
|
Same as default mode: `<spec-dir>/<id>-<slug>/spec.md` + `<spec-dir>/<id>-<slug>/spec.expected.json`.
|
|
46
68
|
|
|
47
|
-
Final announcement: `spec normalized — /devlyn:resolve --spec <spec-path>`. If the spec was lint-passing with no changes needed, announce: `spec already canonical — /devlyn:resolve --spec <spec-path>`.
|
|
69
|
+
Final announcement: `spec normalized — /devlyn:resolve --spec <spec-path>`. If the spec was lint-passing with no changes needed, announce: `spec already canonical — /devlyn:resolve --spec <spec-path>`. If step 9 applies, append: `warning: Verification may need one compound end-to-end scenario before pair-relevant risks are measurable`. If step 10 applies, append: `pair-evidence not ready — Pair-candidate headroom is unproven until the spec states a solo-headroom hypothesis`. If step 11 applies, append: `pair-evidence not ready — Pair-candidate headroom is unproven until the spec states solo ceiling avoidance`.
|
|
48
70
|
|
|
49
71
|
If lint failed unfixably: print the specific failure, exit non-zero. Do not write a partial output.
|
|
50
72
|
</output>
|
|
@@ -11,6 +11,25 @@ The user wants to build a project, not a single feature. Your job is to elicit t
|
|
|
11
11
|
2. Ask the same question categories as default mode (input/output/failure/scope/constraints/verification) but at the project level first, then drill into each feature.
|
|
12
12
|
3. Decompose into 3-7 features. Fewer = the project is actually one big feature; recommend default mode. More = the project is too large; recommend splitting into separate ideate runs.
|
|
13
13
|
4. Each feature must be independently shippable: a feature whose verification depends on another feature's runtime behavior is a dependency, not a feature.
|
|
14
|
+
5. When a feature combines state mutation with ordering/priority, idempotency,
|
|
15
|
+
auth/error priority, or exact output shape, its per-feature Verification must
|
|
16
|
+
include one compound end-to-end scenario; do not hide the interaction in
|
|
17
|
+
project-level prose.
|
|
18
|
+
6. When a feature is intended as a benchmark, risk probe, or pair-evidence
|
|
19
|
+
candidate, its per-feature Verification must include a solo-headroom
|
|
20
|
+
hypothesis. The feature spec must literally contain
|
|
21
|
+
`solo-headroom hypothesis`, `solo_claude`, `miss`, and a backticked
|
|
22
|
+
observable command while naming the visible behavior a capable
|
|
23
|
+
`solo_claude` baseline is expected to miss; the backticked line itself must
|
|
24
|
+
contain `miss` and be framed as the command/observable that exposes it. Do not defer that to
|
|
25
|
+
project-level prose, and rework the feature spec if the hypothesis is only
|
|
26
|
+
"the task is hard".
|
|
27
|
+
7. When a feature is a new unmeasured benchmark, shadow-fixture, golden-fixture,
|
|
28
|
+
risk-probe, or pair-evidence candidate, its per-feature Verification must also include a solo ceiling avoidance note. The feature spec must literally
|
|
29
|
+
contain `solo ceiling avoidance`, mention `solo_claude`, and name a concrete
|
|
30
|
+
difference from rejected or solo-saturated controls such as `S2`-`S6`. Do not
|
|
31
|
+
defer that to project-level prose; benchmark fixture directories mirror the
|
|
32
|
+
same note in `NOTES.md` as `## Solo ceiling avoidance`.
|
|
14
33
|
</conversation_rules>
|
|
15
34
|
|
|
16
35
|
<decomposition_rules>
|
|
@@ -63,7 +82,7 @@ Anything binding all features (e.g. "no new top-level dependencies", "all CLI ou
|
|
|
63
82
|
- `<spec-dir>/<id-N>/spec.md` for each feature (per `references/spec-template.md`).
|
|
64
83
|
- `<spec-dir>/<id-N>/spec.expected.json` for each feature (per `_shared/expected.schema.json`).
|
|
65
84
|
|
|
66
|
-
Each per-feature spec is structurally lint-validated using `python3 .claude/skills/_shared/spec-verify-check.py --check <spec-path
|
|
85
|
+
Each per-feature spec is structurally lint-validated using `python3 .claude/skills/_shared/spec-verify-check.py --check <spec-path>`, including supported `complexity` frontmatter and any present actionable solo-headroom hypothesis, and each sibling expected contract plus sibling spec `complexity` frontmatter and any present actionable solo-headroom hypothesis is validated using `python3 .claude/skills/_shared/spec-verify-check.py --check-expected <expected-path>`; if the spec has a solo-headroom hypothesis, its observable command must match `spec.expected.json.verification_commands[].cmd`.
|
|
67
86
|
|
|
68
87
|
Final announcement: `project ready — N specs at <spec-dir>/. Start with /devlyn:resolve --spec <first-spec-path>`.
|
|
69
88
|
</output>
|
|
@@ -10,6 +10,7 @@ id: "<spec-id>" # kebab-case, unique per spec-dir; auto-generated if us
|
|
|
10
10
|
title: "<short title>" # one line, descriptive
|
|
11
11
|
kind: feature # feature | spike | prototype
|
|
12
12
|
status: planned # planned → in_progress → done. ideate writes "planned"; resolve's CLEANUP flips to "done".
|
|
13
|
+
complexity: medium # trivial | medium | high. Use high when Verification needs compound state/order/failure checks.
|
|
13
14
|
depends_on: [] # list of spec ids this depends on (empty for standalone). project mode populates this.
|
|
14
15
|
---
|
|
15
16
|
```
|
|
@@ -78,7 +79,11 @@ If all Requirements are pure-design (no observable runtime check), the body of t
|
|
|
78
79
|
|
|
79
80
|
## Sibling file: `spec.expected.json`
|
|
80
81
|
|
|
81
|
-
Schema: `_shared/expected.schema.json`. Required when Requirements have observable checks; optional when all Requirements are pure-design.
|
|
82
|
+
Schema: `_shared/expected.schema.json`. Required when Requirements have observable checks; optional when all Requirements are pure-design. Validate before announcing ready:
|
|
83
|
+
|
|
84
|
+
```bash
|
|
85
|
+
python3 .claude/skills/_shared/spec-verify-check.py --check-expected <expected-path>
|
|
86
|
+
```
|
|
82
87
|
|
|
83
88
|
Generated by ideate from the conversation:
|
|
84
89
|
- `verification_commands` ← parsed from `## Verification` body + any commands the conversation surfaced.
|
|
@@ -99,4 +104,8 @@ Substantive (ideate's job during elicitation):
|
|
|
99
104
|
- Each Constraint has a reasoning clause.
|
|
100
105
|
- Out of Scope explicitly enumerates non-goals.
|
|
101
106
|
- Verification commands actually verify the Requirement they map to (no "looks plausible" verification).
|
|
107
|
+
- Frontmatter `complexity` reflects verification shape: `high` when the spec combines state mutation with ordering/priority, idempotency, auth/error priority, rollback/failure handling, or exact output shape; `medium` for ordinary multi-step work; `trivial` only for a single localized behavior.
|
|
108
|
+
- When Requirements combine state mutation, ordering/priority, idempotency, auth/error priority, or exact output shape, Verification includes at least one compound scenario that exercises the interaction end-to-end; isolated happy paths are insufficient.
|
|
109
|
+
- For benchmark, risk-probe, or pair-evidence specs, include a solo-headroom hypothesis inside `## Verification`: the artifact must literally contain `solo-headroom hypothesis`, `solo_claude`, `miss`, and a backticked observable command while naming the visible behavior a capable `solo_claude` baseline is expected to miss; the backticked line itself must contain `miss`, be framed as the command/observable that exposes it, and match a `spec.expected.json.verification_commands[].cmd` entry. For regenerated pair evidence, this hypothesis is the source for VERIFY's canonical `spec.solo_headroom_hypothesis` trigger reason and must pass `benchmark audit --require-hypothesis-trigger`. If the hypothesis is only "the task is hard", rework the spec before measurement.
|
|
110
|
+
- For new unmeasured benchmark, shadow-fixture, golden-fixture, risk-probe, or pair-evidence candidates, include a solo ceiling avoidance note before measurement: the artifact must literally contain `solo ceiling avoidance`, mention `solo_claude`, and name a concrete difference from rejected or solo-saturated controls such as `S2`-`S6`. If the note cannot say why the candidate should preserve `solo_claude` headroom, rework the candidate instead of spending provider calls. Benchmark fixture directories put this in `NOTES.md` as `## Solo ceiling avoidance`; ordinary specs keep the note in `## Verification` next to the solo-headroom hypothesis.
|
|
102
111
|
- Spec text is plain language — no jargon walls, no "for future flexibility" hedging.
|
|
@@ -40,7 +40,7 @@ Each phase routes to an engine and prepends the per-engine adapter header from `
|
|
|
40
40
|
Three input shapes:
|
|
41
41
|
|
|
42
42
|
1. **Free-form**: `/devlyn:resolve "fix the login bug"`. PHASE 0 runs the complexity classifier and either proceeds with an internal mini-spec (trivial), drafts focused questions for in-prompt resolution (medium), or escalates to `/devlyn:ideate` (large/ambiguous). No mid-pipeline prompts in any branch.
|
|
43
|
-
2. **Spec**: `/devlyn:resolve --spec docs/roadmap/phase-N/X.md`. Spec is read-only.
|
|
43
|
+
2. **Spec**: `/devlyn:resolve --spec docs/roadmap/phase-N/X.md`. Spec is read-only. Stage verification commands from sibling `spec.expected.json`; if absent, use the legacy `## Verification` JSON block.
|
|
44
44
|
3. **Verify-only**: `/devlyn:resolve --verify-only <diff-or-PR-ref> --spec <path>`. Skips PHASE 1-4. Runs PHASE 5 (VERIFY) on the supplied diff against the spec.
|
|
45
45
|
</modes>
|
|
46
46
|
|
|
@@ -67,14 +67,16 @@ Once `state.implement_passed_sha` is non-null (PHASE 2 returned and produced a d
|
|
|
67
67
|
|
|
68
68
|
2. Engine pre-flight: follow `_shared/engine-preflight.md`. If a required engine is unavailable, halt with a BLOCKED verdict and setup instructions instead of downgrading.
|
|
69
69
|
|
|
70
|
-
|
|
70
|
+
`--pair-verify` and `--no-pair` are mutually exclusive; if both are present, stop with `BLOCKED:invalid-flags`.
|
|
71
|
+
|
|
72
|
+
3. Initialize `.devlyn/pipeline.state.json` per `references/state-schema.md`. Set `state.run_id`, `started_at`, `engine`, `pair_verify: true` only when `--pair-verify` was passed and `false` otherwise, `base_ref.{branch, sha}`, `rounds.{max_rounds, global: 0}`, `bypasses`, empty `phases`, empty `criteria`, and `risk_profile: { high_risk: false, reasons: [], risk_probes_enabled: false, pair_default_enabled: true }`. `risk_profile` is strict typed state: keep it an object, keep the three flags as JSON booleans, and keep `reasons` as a string array; never serialize booleans as strings.
|
|
71
73
|
|
|
72
74
|
4. **Mode-specific init**:
|
|
73
|
-
- **Free-form**: read `references/free-form-mode.md`. Run the complexity classifier deterministically (rules over keyword density / file count / spec-shape signals). Set `state.complexity ∈ {trivial, medium, large}`. Trivial: write internal mini-spec to `.devlyn/criteria.generated.md` and proceed. Medium: synthesize a minimal spec from the goal + add 1-2 context anchors from the codebase, write to `.devlyn/criteria.generated.md`, proceed. Large: log `recommend: /devlyn:ideate first` in the final report and either halt (default) or proceed with assumed defaults if `--continue-on-large` flag set
|
|
74
|
-
- **Spec**: validate spec exists
|
|
75
|
+
- **Free-form**: read `references/free-form-mode.md`. Run the complexity classifier deterministically (rules over keyword density / file count / spec-shape signals, plus pair-evidence intent). Set `state.complexity ∈ {trivial, medium, large}`. Trivial: write internal mini-spec to `.devlyn/criteria.generated.md` and proceed. Medium: synthesize a minimal spec from the goal + add 1-2 context anchors from the codebase, write to `.devlyn/criteria.generated.md`, proceed. Every free-form branch that writes criteria must set `state.source.type = "generated"`, `state.source.criteria_path = ".devlyn/criteria.generated.md"`, and `state.source.criteria_sha256` from the raw file bytes. Large: log `recommend: /devlyn:ideate first` in the final report and either halt (default) or proceed with assumed defaults if `--continue-on-large` flag set, except pair-evidence intent without an actionable solo-headroom hypothesis must halt with `BLOCKED:solo-headroom-hypothesis-required`, and unmeasured pair-candidate intent without solo ceiling avoidance must halt with `BLOCKED:solo-ceiling-avoidance-required`.
|
|
76
|
+
- **Spec**: validate spec exists. If sibling `spec.expected.json` exists, run `--check-expected <expected-path>` to validate both the expected contract, sibling spec `complexity` frontmatter, and any present actionable solo-headroom hypothesis; if the spec has a solo-headroom hypothesis, its observable command must match `spec.expected.json.verification_commands[].cmd`. Then stage `.devlyn/spec-verify.json` from `verification_commands`. Otherwise run `--check <spec-path>` to validate the legacy inline carrier plus supported `complexity` frontmatter and any present actionable solo-headroom hypothesis; if the spec uses an inline `## Verification` JSON carrier, any solo-headroom hypothesis command must match that carrier's `verification_commands[].cmd`. Then stage from the legacy inline carrier. Compute `state.source.spec_sha256`.
|
|
75
77
|
- **Verify-only**: skip to PHASE 5 with `state.source.spec_path` set, the supplied diff captured at `.devlyn/external-diff.patch`.
|
|
76
78
|
|
|
77
|
-
5. Compute `state.risk_profile` from the user goal plus spec/criteria text. Mark `high_risk: true` when the work touches any of: auth/authz, permissions, security, token/session, payment/money/billing/invoice/pricing/tax/ledger, persistence/data mutation/deletion/migration, idempotency/replay/duplicate, API/webhook/raw-body/signature, allocation/scheduling/inventory/rollback/transaction, or explicit error-priority/output-shape contracts. If high-risk and `--no-risk-probes` is absent, set `risk_probes_enabled: true`; explicit `--risk-probes` also sets it true. If `--no-pair` is present, set `pair_default_enabled: false`.
|
|
79
|
+
5. Compute `state.risk_profile` from the user goal plus spec/criteria text. Mark `high_risk: true` when the work touches any of: auth/authz, permissions, security, token/session, payment/money/billing/invoice/pricing/tax/ledger, persistence/data mutation/deletion/migration, idempotency/replay/duplicate, API/webhook/raw-body/signature, allocation/scheduling/inventory/rollback/transaction, or explicit error-priority/output-shape contracts. If high-risk and `--no-risk-probes` is absent, set `risk_probes_enabled: true`; explicit `--risk-probes` also sets it true. If `--no-pair` is present, set `pair_default_enabled: false`. Add concise string reasons for the classification, but do not use reasons as substitutes for the boolean fields.
|
|
78
80
|
|
|
79
81
|
6. Announce one line: `resolve starting — run <run_id> — engine <engine> — mode <mode> — complexity <complexity-or-na> — pair <conditional|disabled> — risk_probes <on|off>`.
|
|
80
82
|
|
|
@@ -115,14 +117,41 @@ a JSON object keyed by tag, with marker arrays as values; a top-level array or
|
|
|
115
117
|
tag-only probe is malformed. `ordering_inversion` must include
|
|
116
118
|
`input_order_would_choose_wrong_winner` and `asserts_processing_order_result`;
|
|
117
119
|
`prior_consumption` must include `same_resource_consumed_first` and
|
|
118
|
-
`later_entity_fails_or_reroutes`; `stdout_stderr_contract`
|
|
119
|
-
|
|
120
|
+
`later_entity_fails_or_reroutes`; `stdout_stderr_contract` must include
|
|
121
|
+
`asserts_named_stream_output`; `error_contract` must include
|
|
122
|
+
`asserts_error_payload_or_stderr` and `asserts_nonzero_or_exit_2`.
|
|
123
|
+
`http_error_contract` must include `asserts_http_error_status` and
|
|
124
|
+
`asserts_error_payload_body`.
|
|
125
|
+
`auth_signature_contract` must include `asserts_signature_over_exact_bytes` and
|
|
126
|
+
`asserts_tampered_or_missing_signature_rejected`; `idempotency_replay` must
|
|
127
|
+
include `first_delivery_then_duplicate` and
|
|
128
|
+
`duplicate_id_rejected_regardless_of_body`; `concurrent_state_consistency` must
|
|
129
|
+
include `overlapping_mutations_exercised`,
|
|
130
|
+
`all_successful_responses_reflected`, and `distinct_identifiers_asserted`;
|
|
131
|
+
`atomic_batch_state` must include `mixed_valid_invalid_batch`,
|
|
132
|
+
`asserts_store_unchanged_after_failure`, and
|
|
133
|
+
`asserts_success_order_and_distinct_ids`.
|
|
134
|
+
When visible text names exact keys, fields, row shapes, JSON objects, response
|
|
135
|
+
bodies, stdout/stderr objects, or exact error bodies, `shape_contract` must
|
|
136
|
+
include `uses_visible_input_key_names`, `asserts_visible_output_key_names`, and
|
|
137
|
+
`asserts_no_unexpected_output_keys`; exact JSON error objects/bodies must also
|
|
138
|
+
include `asserts_exact_error_object`. Cart/pricing success probes should use
|
|
120
139
|
`shape_contract` unless they satisfy the `ordering_inversion` markers. The probe
|
|
121
140
|
command must not reference external network URLs; use only worktree-local or
|
|
122
141
|
localhost resources.
|
|
123
142
|
For high-complexity specs with multiple behavior bullets, at least one probe
|
|
124
143
|
must be compound: it must exercise two or more visible verification bullets in a
|
|
125
144
|
single command. Empty output is invalid when `--risk-probes` is set.
|
|
145
|
+
When the visible spec includes a solo-headroom hypothesis, the first probe must
|
|
146
|
+
exercise that hypothesis with the visible command/input shape and full
|
|
147
|
+
observable assertion; its `cmd` must contain the hypothesis's backticked
|
|
148
|
+
observable command, and its `derived_from` must reference the hypothesis bullet,
|
|
149
|
+
so deterministic validation can prove the probe targets the stated expected
|
|
150
|
+
`solo_claude` miss. Otherwise the probe set is too weak for pair-evidence work.
|
|
151
|
+
The same actionable solo-headroom hypothesis is a VERIFY pair-trigger reason,
|
|
152
|
+
so a candidate spec that explicitly predicts a `solo_claude` miss cannot finish
|
|
153
|
+
on solo VERIFY alone unless `--no-pair` was explicitly set or an earlier
|
|
154
|
+
verdict-binding blocker already decides the run.
|
|
126
155
|
|
|
127
156
|
State write: `phases.probe_derive.{started_at, verdict, completed_at, duration_ms, artifacts}`.
|
|
128
157
|
|
|
@@ -130,7 +159,9 @@ Invocation contract when OTHER engine is Codex:
|
|
|
130
159
|
|
|
131
160
|
- Invoke Codex only through the monitored wrapper path in `CODEX_MONITORED_PATH`,
|
|
132
161
|
or `.claude/skills/_shared/codex-monitored.sh` when the env var is absent:
|
|
133
|
-
`bash "$CODEX_MONITORED_PATH" -C "$PWD" --full-auto -c model_reasoning_effort=high "<probe prompt>"`.
|
|
162
|
+
`CODEX_MONITORED_ISOLATED=1 bash "$CODEX_MONITORED_PATH" -C "$PWD" --full-auto -c model_reasoning_effort=high "<probe prompt>"`.
|
|
163
|
+
Isolation keeps user config, AGENTS.md, pyx-memory, hooks, and project rules
|
|
164
|
+
from adding hidden context, tool calls, or transcript side effects.
|
|
134
165
|
- Do not run `codex`, `codex exec`, `/Users/.../codex`, or a plugin-provided
|
|
135
166
|
Codex binary directly. A raw Codex child can outlive the phase and makes the
|
|
136
167
|
benchmark run invalid even if `.devlyn/risk-probes.jsonl` is written.
|
|
@@ -166,8 +197,8 @@ Skip in verify-only mode OR when `build-gate` in `state.bypasses`. Deterministic
|
|
|
166
197
|
Spawn Claude `Agent` (`mode: "bypassPermissions"`) with prompt body `references/phases/build-gate.md`. The agent:
|
|
167
198
|
1. Detects language/framework via project files (`package.json`, `pyproject.toml`, etc.).
|
|
168
199
|
2. Runs language-specific gates (tsc / lint / test).
|
|
169
|
-
3. Always runs `python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes` (verification_commands literal-match plus `.devlyn/risk-probes.jsonl` when present).
|
|
170
|
-
4. If
|
|
200
|
+
3. Always runs `python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes` (verification_commands literal-match plus `.devlyn/risk-probes.jsonl` when present). If `state.risk_profile.risk_probes_enabled == true`, the script requires `.devlyn/risk-probes.jsonl`; a missing file is a CRITICAL mechanical blocker, not a silent solo run.
|
|
201
|
+
4. If diff touches web-surface files: run the browser tier with the repo's available toolchain (for example Playwright or curl).
|
|
171
202
|
5. Emits `.devlyn/build_gate.findings.jsonl` + `.devlyn/build_gate.log.md`.
|
|
172
203
|
|
|
173
204
|
State write: `phases.build_gate.{started_at, verdict, completed_at, duration_ms, artifacts}`.
|
|
@@ -199,18 +230,18 @@ Independent quality layer. **Spawned with empty conversation context** — no ca
|
|
|
199
230
|
|
|
200
231
|
Two sub-phases:
|
|
201
232
|
|
|
202
|
-
1. **MECHANICAL** (deterministic): re-run `python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes` against the post-CLEANUP code (independent of BUILD_GATE's earlier run).
|
|
233
|
+
1. **MECHANICAL** (deterministic): re-run `SPEC_VERIFY_PHASE=verify_mechanical SPEC_VERIFY_FINDINGS_FILE=verify-mechanical.findings.jsonl SPEC_VERIFY_FINDING_PREFIX=VERIFY-MECH python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes` against the post-CLEANUP code (independent of BUILD_GATE's earlier run). If `state.risk_profile.risk_probes_enabled == true`, missing `.devlyn/risk-probes.jsonl` is a CRITICAL mechanical blocker. This emits `.devlyn/verify-mechanical.findings.jsonl` for `verify-merge-findings.py`.
|
|
203
234
|
|
|
204
|
-
2. **JUDGE** (fresh-context Agent): grade the diff against the spec on rubric axes (spec compliance, scope, quality, consistency). Split each Requirement into binding clauses and trace code-order counterexamples; a passing verifier proves only the case it exercises, not neighboring `once` / `regardless` / `duplicate` / auth-order / rollback invariants. Respect scope qualifiers such as `inside a warehouse`, `per resource`, `for this line`, and `after validation`; do not widen a scoped clause into a global invariant, and compose multiple ordering rules in the stated order. For stateful flows, explicitly trace failed-operation rollback and the next entity's state before hunting broader edge cases. For high-complexity specs, construct at least one interaction counterexample that combines ordering/priority with failure handling and state mutation, then execute at least one such scenario through the repo's existing CLI/API/test runner without leaving tracked files behind; one-axis examples and pure mental tracing are insufficient. Default engine = same as IMPLEMENT (solo). Pair-mode (cross-model JUDGE) is eligible only
|
|
235
|
+
2. **JUDGE** (fresh-context Agent): grade the diff against the spec on rubric axes (spec compliance, scope, quality, consistency). Split each Requirement into binding clauses and trace code-order counterexamples; a passing verifier proves only the case it exercises, not neighboring `once` / `regardless` / `duplicate` / auth-order / rollback invariants. Respect scope qualifiers such as `inside a warehouse`, `per resource`, `for this line`, and `after validation`; do not widen a scoped clause into a global invariant, and compose multiple ordering rules in the stated order. For stateful flows, explicitly trace failed-operation rollback and the next entity's state before hunting broader edge cases. For high-complexity specs, construct at least one interaction counterexample that combines ordering/priority with failure handling and state mutation, then execute at least one such scenario through the repo's existing CLI/API/test runner without leaving tracked files behind; one-axis examples and pure mental tracing are insufficient. Default engine = same as IMPLEMENT (solo). Pair-mode (cross-model JUDGE) is eligible only after MECHANICAL and the primary JUDGE have no verdict-binding findings; deterministic blockers and primary JUDGE blockers already decide the verdict and route to the fix loop. Pair-mode fires when eligible and:
|
|
205
236
|
- `--pair-verify` flag set, OR
|
|
206
237
|
- `state.mode == "verify-only"`, OR
|
|
207
238
|
- `state.risk_profile.high_risk == true`, OR
|
|
208
239
|
- `.devlyn/risk-probes.jsonl` exists or `state.risk_profile.risk_probes_enabled == true`, OR
|
|
209
|
-
- spec frontmatter has `complexity: high
|
|
210
|
-
- MECHANICAL emits findings flagged `severity: warning` (not
|
|
240
|
+
- spec frontmatter has `complexity: high` (legacy/external spec `complexity: large` is accepted for compatibility; new specs use `high`), OR current free-form `state.complexity` is `"large"` (legacy `"high"` state is accepted only for archived runs), OR
|
|
241
|
+
- MECHANICAL or the primary JUDGE emits findings flagged `severity: warning` (not verdict-binding — those route to fix loop directly), OR
|
|
211
242
|
- `state.verify.coverage_failed == true` (judge could not exercise a required spec axis from available evidence).
|
|
212
243
|
|
|
213
|
-
|
|
244
|
+
After MECHANICAL and the primary JUDGE finish, compute `pair_trigger = { eligible, reasons[], skipped_reason }`, write it into `state.phases.verify`, and then spawn the second OTHER-engine judge when eligible. If `eligible == true`, `reasons` must be non-empty, include every applicable canonical reason, and every reason must be one of these canonical values: `mode.verify-only`, `mode.pair-verify`, `complexity.high`, `complexity.large`, `spec.complexity.high`, `spec.complexity.large`, `spec.solo_headroom_hypothesis`, `risk.high`, `risk_probes.enabled`, `risk_probes.present`, `coverage.failed`, `mechanical.warning`, or `judge.warning`; `skipped_reason` must be null; and you MUST spawn the second OTHER-engine judge. If `eligible == false`, `reasons` must be empty and `skipped_reason` must be a string or null. Contradictory, incomplete, or unknown trigger state is a VERIFY contract violation, not advisory metadata; `verify-merge-findings.py` blocks malformed trigger state. Pair reasons derive `risk.high` and `risk_probes.enabled` from `state.risk_profile`; malformed `risk_profile` is also a VERIFY contract violation because it can hide a required pair decision.
|
|
214
245
|
|
|
215
246
|
The `--engine` flag never suppresses this rule. Explicit `--engine claude`
|
|
216
247
|
means "Claude is the primary judge"; it does not mean "do not run Codex as the
|
|
@@ -219,7 +250,7 @@ trigger are deterministic MECHANICAL HIGH/CRITICAL blockers or an explicit
|
|
|
219
250
|
`--no-pair`. Engine unavailability is a `BLOCKED:<engine>-unavailable` verdict,
|
|
220
251
|
not a skip reason.
|
|
221
252
|
|
|
222
|
-
Pair-mode JUDGE: spawn a second Agent with the OTHER engine's adapter; the second judge is a bounded adversarial complement, not a duplicate broad audit. The primary judge owns broad coverage; pair-JUDGE targets the two highest-risk explicit `## Verification` bullets that cross state mutation, all-or-nothing rollback, ordering, idempotency, auth, or error-priority clauses. It must not read `.claude/skills`, `.codex/skills`, `CLAUDE.md`, `AGENTS.md`, or other harness docs unless the orchestrator pasted a specific excerpt into the prompt. It may use only the spec, diff, implementation files, tests, and the repo's existing CLI/API/test runner. It may execute at most two targeted probes before first output, and each probe must compare the full externally visible result (exit/stdout/stderr plus full parsed output object, including accepted/scheduled rows, rejected rows, and remaining state when present), not just a single property. For priority/stateful specs, at least one probe must include an earlier input entity that would succeed under input-order processing, a later higher-priority entity that consumes or blocks the critical resource, and a failure/blocked/rollback edge that determines a later entity's state. For cart/pricing specs where visible verification combines duplicate items, line promotions, tax, coupon, and shipping, the success-path probe must include interleaved duplicates plus taxable and non-taxable items and assert full output rows. Scope qualifiers are binding: pair-JUDGE must not reinterpret `inside a warehouse`, `per resource`, or line-scoped rules as global rules. When both priority ordering and rollback/blocked-interval behavior appear in the spec, this dominance-loss probe is mandatory and comes before any other probe: an earlier lower-priority entity that would succeed alone or under input-order processing must lose because a later higher-priority entity is processed first; a failed/blocked middle entity must not corrupt later state; and the assertion must cover complete accepted/scheduled and rejected output ordering. It must stop and emit JSONL immediately on the first verdict-binding finding, and must emit PASS immediately if both probes plus static scope/dependency checks pass. Both judgments merge with the rule "any HIGH/CRITICAL finding either model surfaces is verdict-binding; high-confidence MEDIUM findings are also verdict-binding when they identify a concrete behavioral regression against the spec, public contract, or existing test contract." Cross-model disagreement on advisory lower-severity findings is logged but does not change the verdict. If MECHANICAL has a
|
|
253
|
+
Pair-mode JUDGE: spawn a second Agent with the OTHER engine's adapter; the second judge is a bounded adversarial complement, not a duplicate broad audit. The primary judge owns broad coverage; pair-JUDGE targets the two highest-risk explicit `## Verification` bullets that cross state mutation, all-or-nothing rollback, ordering, idempotency, auth, or error-priority clauses. If the spec includes a solo-headroom hypothesis, one of those targeted probes must exercise that hypothesis with the visible command/input shape and full externally visible result, using the hypothesis's backticked observable command as its command anchor before adding bounded input variations. It must not read `.claude/skills`, `.codex/skills`, `CLAUDE.md`, `AGENTS.md`, or other harness docs unless the orchestrator pasted a specific excerpt into the prompt. It may use only the spec, diff, implementation files, tests, and the repo's existing CLI/API/test runner. It may execute at most two targeted probes before first output, and each probe must compare the full externally visible result (exit/stdout/stderr plus full parsed output object, including accepted/scheduled rows, rejected rows, and remaining state when present), not just a single property. When the spec names exact keys, row shapes, JSON object shape, or an exact error body, pair-JUDGE must compare parsed key sets/deep equality so aliased keys, missing keys, and extra keys are verdict-binding failures, and it must construct inputs with the spec's visible key names. For priority/stateful specs, at least one probe must include an earlier input entity that would succeed under input-order processing, a later higher-priority entity that consumes or blocks the critical resource, and a failure/blocked/rollback edge that determines a later entity's state. For cart/pricing specs where visible verification combines duplicate items, line promotions, tax, coupon, and shipping, the success-path probe must include interleaved duplicates plus taxable and non-taxable items and assert full output rows. Scope qualifiers are binding: pair-JUDGE must not reinterpret `inside a warehouse`, `per resource`, or line-scoped rules as global rules. When both priority ordering and rollback/blocked-interval behavior appear in the spec, this dominance-loss probe is mandatory and comes before any other probe: an earlier lower-priority entity that would succeed alone or under input-order processing must lose because a later higher-priority entity is processed first; a failed/blocked middle entity must not corrupt later state; and the assertion must cover complete accepted/scheduled and rejected output ordering. It must stop and emit JSONL immediately on the first verdict-binding finding, and must emit PASS immediately if both probes plus static scope/dependency checks pass. Both judgments merge with the rule "any HIGH/CRITICAL finding either model surfaces is verdict-binding; high-confidence MEDIUM findings are also verdict-binding when they identify a concrete behavioral regression against the spec, public contract, or existing test contract." Cross-model disagreement on advisory lower-severity findings is logged but does not change the verdict. If MECHANICAL or the primary JUDGE has a verdict-binding finding, skip the second judge and record `pair_judge: null`; the fix loop needs the blocker, not duplicate review.
|
|
223
254
|
|
|
224
255
|
If pair-mode is triggered and the OTHER engine is unavailable, do not downgrade
|
|
225
256
|
or skip the required judge. Set VERIFY to `BLOCKED:<engine>-unavailable`, preserve the
|
|
@@ -231,7 +262,7 @@ failed availability check evidence, and print setup guidance:
|
|
|
231
262
|
- Claude: install/configure Claude Code, run `claude --version` when available,
|
|
232
263
|
confirm the host can spawn Claude agents, then rerun.
|
|
233
264
|
|
|
234
|
-
Findings written to `.devlyn/verify.findings.jsonl`. **VERIFY agents have no code-mutation tools.** Codex pair-JUDGE is read-only: invoke `codex-monitored.sh`
|
|
265
|
+
Findings written to `.devlyn/verify.findings.jsonl`. **VERIFY agents have no code-mutation tools.** Codex pair-JUDGE is read-only: invoke `codex-monitored.sh` with `CODEX_MONITORED_ISOLATED=1` and `-c model_reasoning_effort=medium`, no `tail`/`head`/`grep` pipes, direct stdout/stderr capture, JSONL findings on stdout, and orchestrator-written `.devlyn/verify.pair.findings.jsonl`. Isolation blocks user config, AGENTS.md, pyx-memory, hooks, and project rules from hidden context/tool/transcript side effects. If stdout is captured as `.devlyn/codex-judge.stdout`, run `python3 .claude/skills/_shared/collect-codex-findings.py` before merge; raw stdout is diagnostic only. If stdout contains findings or a non-PASS summary while `.devlyn/verify.pair.findings.jsonl` is empty, `verify-merge-findings.py` blocks VERIFY for `verify.pair.emission-contract`. Do not ask Codex to `apply_patch` or edit `.devlyn`. After primary and pair findings are written, run `python3 .claude/skills/_shared/verify-merge-findings.py --write-state`. Branch only on the merged `state.phases.verify.verdict`; a HIGH/CRITICAL finding from either judge must mechanically become `NEEDS_WORK`. Never write `.devlyn/verify-merged.findings.jsonl` or `.devlyn/verify-merge.summary.json` by hand; `verify-merge-findings.py` is their only writer. State write: `phases.verify.{started_at, verdict, completed_at, duration_ms, sub_verdicts: {mechanical, judge, pair_judge?}, artifacts}`.
|
|
235
266
|
|
|
236
267
|
Branch:
|
|
237
268
|
- `PASS` → PHASE 6.
|
|
@@ -244,7 +275,7 @@ State write: `phases.final_report.started_at` at the top of this phase.
|
|
|
244
275
|
|
|
245
276
|
1. **Terminal verdict** — derive from `state.phases.{plan, implement, build_gate, cleanup, verify}.verdict` per the precedence rules in `references/state-schema.md#terminal-verdict`. Verify-only mode short-circuits to `state.phases.verify.verdict`.
|
|
246
277
|
|
|
247
|
-
2. **Render report** — sections: header (run_id, engine, mode, verdict, wall-time), per-phase summary, pair/risk-probe status, findings table (verify findings only — post-IMPLEMENT phases are findings-only), follow-up notes (any `--continue-on-large` assumptions, any `--no-pair` / `--no-risk-probes` opt-out,
|
|
278
|
+
2. **Render report** — sections: header (run_id, engine, mode, verdict, wall-time), per-phase summary, pair/risk-probe status, findings table (verify findings only — post-IMPLEMENT phases are findings-only), follow-up notes (any `--continue-on-large` assumptions, any `--no-pair` / `--no-risk-probes` opt-out, any engine setup guidance after BLOCKED, `/devlyn:ideate` guidance after `BLOCKED:solo-headroom-hypothesis-required` that asks for the visible behavior `solo_claude` is expected to miss, and `/devlyn:ideate` guidance after `BLOCKED:solo-ceiling-avoidance-required` that asks for the concrete difference from rejected or solo-saturated controls such as `S2`-`S6`).
|
|
248
279
|
|
|
249
280
|
3. State write: `phases.final_report.{verdict, completed_at, duration_ms}` BEFORE archive runs (archive prune logic skips runs whose `final_report.verdict` is null).
|
|
250
281
|
|
|
@@ -13,6 +13,14 @@ Compute these signals from the goal text + project state:
|
|
|
13
13
|
3. **verb_class** — primary verb of the goal: `fix | add | refactor | debug | review | rewrite | migrate | ...`.
|
|
14
14
|
4. **codebase_size** — `git ls-files | wc -l`. Coarse buckets: `<50` / `<500` / `≥500`.
|
|
15
15
|
5. **has_failing_test** — does the goal mention a specific failing test or include a stack trace?
|
|
16
|
+
6. **pair_evidence_intent** — does the goal ask for benchmark evidence, pair-evidence, risk-probe measurement, solo<pair proof, or solo-headroom work?
|
|
17
|
+
7. **has_actionable_solo_headroom** — does the goal itself include the actionable contract: literal `solo-headroom hypothesis`, `solo_claude`, `miss`, and a backticked observable command line that itself contains `miss` and is framed as the command/observable that exposes it?
|
|
18
|
+
8. **unmeasured_pair_candidate_intent** — does the goal ask to add, create,
|
|
19
|
+
promote, or run a new unmeasured benchmark, shadow fixture, golden fixture,
|
|
20
|
+
risk-probe, or pair-evidence candidate?
|
|
21
|
+
9. **has_solo_ceiling_avoidance** — does the goal itself include the literal
|
|
22
|
+
phrase `solo ceiling avoidance`, mention `solo_claude`, and name a concrete
|
|
23
|
+
difference from rejected or solo-saturated controls such as `S2`-`S6`?
|
|
16
24
|
|
|
17
25
|
### Trivial branch
|
|
18
26
|
|
|
@@ -46,10 +54,14 @@ Conditions (any one):
|
|
|
46
54
|
- `file_scope_signals > 10` OR zero signals (vague enough that the classifier cannot pick scope).
|
|
47
55
|
- `verb_class ∈ {rewrite, migrate}` and scope is multi-subsystem.
|
|
48
56
|
- The goal mentions a new feature whose surface area requires design decisions the harness cannot make from a one-shot prompt.
|
|
57
|
+
- `pair_evidence_intent == true` and `has_actionable_solo_headroom == false`.
|
|
58
|
+
- `unmeasured_pair_candidate_intent == true` and `has_solo_ceiling_avoidance == false`.
|
|
49
59
|
|
|
50
60
|
Action: log `recommend: /devlyn:ideate first` in `.devlyn/criteria.generated.md` plus the final report. Two policies:
|
|
51
61
|
- Default: halt with terminal verdict `BLOCKED:large-needs-ideation`.
|
|
52
62
|
- `--continue-on-large` flag: synthesize a best-effort spec from the goal with explicit "assumptions made" block; proceed to PHASE 1; the final report flags every assumption for user review.
|
|
63
|
+
- Exception: if the large classification came from pair-evidence intent without an actionable solo-headroom hypothesis, halt with `BLOCKED:solo-headroom-hypothesis-required` even when `--continue-on-large` is set. Do not invent a hypothesis; recommend `/devlyn:ideate` so the user can supply the visible behavior `solo_claude` is expected to miss.
|
|
64
|
+
- Exception: if the large classification came from unmeasured pair-candidate intent without solo ceiling avoidance, halt with `BLOCKED:solo-ceiling-avoidance-required` even when `--continue-on-large` is set. Do not invent the note; recommend `/devlyn:ideate` so the user can supply the concrete difference from rejected or solo-saturated controls such as `S2`-`S6`.
|
|
53
65
|
|
|
54
66
|
## Anti-pattern: drift to LLM judgment
|
|
55
67
|
|
|
@@ -63,6 +75,9 @@ The internal mini-spec written for trivial / medium / `--continue-on-large` path
|
|
|
63
75
|
|
|
64
76
|
- `## Requirements` non-empty, each bullet testable (CLI command, test command, observable file change).
|
|
65
77
|
- `## Verification` non-empty if the goal implies any runnable acceptance check. Empty Verification is allowed only when all Requirements are pure-design (e.g. "follow existing pattern X").
|
|
78
|
+
- If a free-form goal includes pair-evidence intent and already includes an actionable solo-headroom hypothesis, preserve that literal hypothesis in `.devlyn/criteria.generated.md` unchanged enough for VERIFY to detect `solo-headroom hypothesis`, `solo_claude`, `miss`, and the backticked observable command line that itself contains `miss`, emit the canonical `spec.solo_headroom_hypothesis` pair trigger reason, and satisfy regenerated-evidence checks such as `benchmark audit --require-hypothesis-trigger`.
|
|
79
|
+
- If a free-form goal includes unmeasured pair-candidate intent and already includes solo ceiling avoidance, preserve that literal note in `.devlyn/criteria.generated.md` unchanged enough for reviewers to see `solo ceiling avoidance`, `solo_claude`, and the concrete difference from rejected or solo-saturated controls such as `S2`-`S6`.
|
|
66
80
|
- Free-form mode mini-specs are written to `.devlyn/criteria.generated.md` (not to a roadmap path) — this is run-scoped artifact, not a documented spec.
|
|
81
|
+
- After writing `.devlyn/criteria.generated.md`, set `state.source.type = "generated"`, `state.source.spec_path = null`, `state.source.spec_sha256 = null`, `state.source.criteria_path = ".devlyn/criteria.generated.md"`, and `state.source.criteria_sha256` to the raw-byte SHA-256 of the generated criteria file. Downstream PLAN/IMPLEMENT/VERIFY phases and `spec-verify-check.py --include-risk-probes` depend on this pointer; do not rely on the file existing by convention.
|
|
67
82
|
|
|
68
83
|
PLAN reads the mini-spec the same way it reads a real spec. The downstream pipeline cannot tell the difference.
|
|
@@ -22,8 +22,8 @@ Run in this order; each emits findings into `.devlyn/build_gate.findings.jsonl`:
|
|
|
22
22
|
1. **Type check** (TypeScript / mypy / etc.). Each error → one finding, severity `HIGH`, rule `correctness.type-check`.
|
|
23
23
|
2. **Lint** (eslint / ruff / clippy / etc.). Each error → finding, severity `MEDIUM`, rule `quality.lint`. Warnings stay LOW unless the spec elevates them.
|
|
24
24
|
3. **Test suite** (npm test / pytest / go test / cargo test). Each failing test → finding, severity `HIGH`, rule `correctness.test-failure`. Include the failing test's file:line and the assertion.
|
|
25
|
-
4. **Spec literal verification + risk probes**: `python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes`. The script
|
|
26
|
-
5. **Browser** (only when
|
|
25
|
+
4. **Spec literal verification + risk probes**: `python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes`. The script self-stages from sibling `spec.expected.json` next to `state.source.spec_path`, or the legacy inline carrier when the sibling is absent; benchmark-prestaged `.devlyn/spec-verify.json` still wins. It appends `.devlyn/risk-probes.jsonl` when present, and requires that file when `state.risk_profile.risk_probes_enabled == true`. Malformed `state.risk_profile` is also CRITICAL because it can hide enabled risk probes. Command or risk-probe mismatch → CRITICAL finding. Missing required risk probes, missing/malformed generated carrier, or malformed sibling expected file → `correctness.spec-verify-malformed` CRITICAL.
|
|
26
|
+
5. **Browser** (only when diff touches `*.tsx`, `*.jsx`, `*.vue`, `*.svelte`, `page.*`, `layout.*`, `route.*`, `*.css`, `*.html`): start the dev server and run the repo's existing browser checks, or a minimal curl/HTML check when no browser test harness exists. Each failed check → finding, severity `HIGH`, rule `correctness.browser-flow-failed`.
|
|
27
27
|
|
|
28
28
|
Append all findings; do not stop on the first failure.
|
|
29
29
|
</gates>
|
|
@@ -25,7 +25,23 @@ Read the visible `## Verification` section. Emit 1 to 3 executable probes
|
|
|
25
25
|
that cover the highest-risk bullets whose failure would change observable
|
|
26
26
|
behavior. Prefer bullets that combine ordering/priority, rollback/state
|
|
27
27
|
mutation, idempotency, auth/error priority, stdout/stderr, or exact output
|
|
28
|
-
shape.
|
|
28
|
+
shape. Treat CLI/process errors and HTTP error responses as different contracts:
|
|
29
|
+
CLI errors must prove exit/stderr behavior, while HTTP errors must prove the
|
|
30
|
+
status code and response body. When the visible verification text names concurrent or near-concurrent
|
|
31
|
+
mutations, the probe must overlap the operations and assert the complete
|
|
32
|
+
externally-visible state, not just that every request returned a success code.
|
|
33
|
+
When a batch/import operation must be all-or-nothing, the probe must exercise a
|
|
34
|
+
mixed valid/invalid batch and prove the externally-visible state is unchanged
|
|
35
|
+
after the failure.
|
|
36
|
+
|
|
37
|
+
If the visible spec includes a solo-headroom hypothesis, the first probe must
|
|
38
|
+
target that hypothesis: use the visible command/input shape it names, exercise
|
|
39
|
+
the behavior the spec says `solo_claude` is expected to miss, and assert the
|
|
40
|
+
full observable result. The emitted probe `cmd` must contain the hypothesis's
|
|
41
|
+
backticked observable command so `.devlyn/risk-probes.jsonl` can be validated
|
|
42
|
+
mechanically, and `derived_from` must be an exact substring of that hypothesis
|
|
43
|
+
bullet. Do not replace the hypothesis with a neighboring easier edge case, and
|
|
44
|
+
do not cite hidden or benchmark-only verifier files.
|
|
29
45
|
|
|
30
46
|
For high-complexity specs with two or more behavior bullets, at least one probe
|
|
31
47
|
must be compound: one command must exercise two or more visible verification
|
|
@@ -101,7 +117,9 @@ Rules:
|
|
|
101
117
|
- `tags` is required. Use only these shape tags:
|
|
102
118
|
`ordering_inversion`, `boundary_overlap`, `prior_consumption`,
|
|
103
119
|
`rollback_state`, `positive_remaining`, `stdout_stderr_contract`,
|
|
104
|
-
`error_contract`, `
|
|
120
|
+
`error_contract`, `http_error_contract`, `auth_signature_contract`,
|
|
121
|
+
`idempotency_replay`, `concurrent_state_consistency`,
|
|
122
|
+
`atomic_batch_state`, `shape_contract`.
|
|
105
123
|
- `tag_evidence` is required and must be a JSON object keyed by tag, never a
|
|
106
124
|
top-level array. For these tags, include every listed evidence marker in the
|
|
107
125
|
tag's array and make the command actually exercise it:
|
|
@@ -119,6 +137,26 @@ Rules:
|
|
|
119
137
|
`later_entity_uses_released_state`.
|
|
120
138
|
- `positive_remaining`: `asserts_full_remaining_state`,
|
|
121
139
|
`zero_quantity_rows_absent`.
|
|
140
|
+
- `stdout_stderr_contract`: `asserts_named_stream_output`.
|
|
141
|
+
- `error_contract`: `asserts_error_payload_or_stderr`,
|
|
142
|
+
`asserts_nonzero_or_exit_2`.
|
|
143
|
+
- `http_error_contract`: `asserts_http_error_status`,
|
|
144
|
+
`asserts_error_payload_body`.
|
|
145
|
+
- `auth_signature_contract`: `asserts_signature_over_exact_bytes`,
|
|
146
|
+
`asserts_tampered_or_missing_signature_rejected`.
|
|
147
|
+
- `idempotency_replay`: `first_delivery_then_duplicate`,
|
|
148
|
+
`duplicate_id_rejected_regardless_of_body`.
|
|
149
|
+
- `concurrent_state_consistency`: `overlapping_mutations_exercised`,
|
|
150
|
+
`all_successful_responses_reflected`, `distinct_identifiers_asserted`.
|
|
151
|
+
- `atomic_batch_state`: `mixed_valid_invalid_batch`,
|
|
152
|
+
`asserts_store_unchanged_after_failure`,
|
|
153
|
+
`asserts_success_order_and_distinct_ids`.
|
|
154
|
+
- `shape_contract` when the visible text names exact keys, fields, row
|
|
155
|
+
shapes, JSON objects, response bodies, stdout/stderr objects, or exact error
|
|
156
|
+
bodies: `uses_visible_input_key_names`,
|
|
157
|
+
`asserts_visible_output_key_names`, `asserts_no_unexpected_output_keys`.
|
|
158
|
+
If it names an exact JSON error object/body, also include
|
|
159
|
+
`asserts_exact_error_object`.
|
|
122
160
|
Tags not listed here may use an empty evidence list or be omitted from
|
|
123
161
|
`tag_evidence`.
|
|
124
162
|
- `cmd` must not reference `BENCH_FIXTURE_DIR`, `verifiers/`, benchmark fixture
|
|
@@ -128,6 +166,10 @@ Rules:
|
|
|
128
166
|
- Match the spec's visible input and output key names literally; do not invent
|
|
129
167
|
aliases such as `stock` for `lots`, `order_id` for `id`, or `warehouse_id`
|
|
130
168
|
for `warehouse`.
|
|
169
|
+
- When a verification bullet names exact keys, fields, row shapes, JSON object
|
|
170
|
+
shape, or an exact error body, the probe must use `shape_contract` and assert
|
|
171
|
+
exact key sets with parsed JSON/deep equality. A substring check is too weak:
|
|
172
|
+
the command must fail on aliased keys, missing keys, and extra keys.
|
|
131
173
|
- For cart/pricing specs whose visible verification covers duplicate combining,
|
|
132
174
|
multiple line-promotion types, tax, coupon, and shipping, the compound success
|
|
133
175
|
probe must include interleaved duplicate SKUs plus taxable and non-taxable
|
|
@@ -148,6 +190,10 @@ Rules:
|
|
|
148
190
|
full test suite.
|
|
149
191
|
- Coverage over cleverness: mirror the verification bullet literally before
|
|
150
192
|
inventing an edge case.
|
|
193
|
+
- If the spec includes a solo-headroom hypothesis and the emitted probes do not
|
|
194
|
+
exercise the stated `solo_claude` miss with a `cmd` containing the hypothesis's
|
|
195
|
+
backticked observable command and `derived_from` pointing at the hypothesis
|
|
196
|
+
bullet, the artifact is too weak for pair-evidence work.
|
|
151
197
|
- If a probe passes while an implementation processes entities in input order
|
|
152
198
|
instead of the required priority/order, or emits extra zero-value state rows,
|
|
153
199
|
the probe is too weak.
|
|
@@ -175,6 +221,32 @@ Rules:
|
|
|
175
221
|
- If `remaining` state appears in the visible contract, at least one probe must
|
|
176
222
|
carry `positive_remaining` and assert that zero-quantity/zero-value rows are
|
|
177
223
|
absent unless the visible spec explicitly requires them.
|
|
224
|
+
- If webhook signatures, raw-body signatures, HMAC, or `X-Signature` appear in
|
|
225
|
+
the visible contract, at least one probe must carry
|
|
226
|
+
`auth_signature_contract` and prove exact-byte signature verification plus a
|
|
227
|
+
tampered or missing signature rejection.
|
|
228
|
+
- If replay, duplicate delivery, same id, already-seen ids, or idempotency
|
|
229
|
+
appear in the visible contract, at least one probe must carry
|
|
230
|
+
`idempotency_replay` and cover first delivery followed by duplicate rejection,
|
|
231
|
+
including the case where the duplicate body would otherwise fail validation
|
|
232
|
+
when the spec says duplicate wins.
|
|
233
|
+
- If an HTTP status error such as `400`, `401`, `409`, or `422` appears with a
|
|
234
|
+
JSON error body, error object, or named error field, at least one probe must
|
|
235
|
+
carry `http_error_contract` and assert both the exact status code and parsed
|
|
236
|
+
response body. Do not use CLI `error_contract` for these HTTP-only checks
|
|
237
|
+
unless the visible text also names process exit or stderr behavior.
|
|
238
|
+
- If concurrent, close-together, simultaneous, parallel, race, lost update, or
|
|
239
|
+
many-at-once mutation semantics appear in the visible contract, at least one
|
|
240
|
+
probe must carry `concurrent_state_consistency`. It must trigger overlapping
|
|
241
|
+
mutations, then compare every successful response against the final
|
|
242
|
+
externally-visible state and assert the identifiers are distinct. Do not use
|
|
243
|
+
this tag for ordinary batch success cases that only require distinct ids.
|
|
244
|
+
- If a batch/import contract says one valid plus one invalid item fails while
|
|
245
|
+
the later state remains the same as before, or otherwise says all-or-nothing /
|
|
246
|
+
no partial updates / 0 inserts on failure, at least one probe must carry
|
|
247
|
+
`atomic_batch_state`. It must execute a mixed valid/invalid batch, assert the
|
|
248
|
+
store/list is unchanged after failure, and include an all-valid success case
|
|
249
|
+
proving order and distinct ids.
|
|
178
250
|
</quality_bar>
|
|
179
251
|
|
|
180
252
|
<runtime_principles>
|