devlyn-cli 2.3.0 → 2.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +1 -1
- package/CLAUDE.md +2 -2
- package/README.md +80 -29
- package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +61 -44
- package/benchmark/auto-resolve/BENCHMARK-RESULTS.md +341 -0
- package/benchmark/auto-resolve/README.md +307 -44
- package/benchmark/auto-resolve/RUBRIC.md +23 -14
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +7 -3
- package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md +8 -3
- package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/NOTES.md +8 -3
- package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/NOTES.md +10 -4
- package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/NOTES.md +10 -4
- package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +7 -4
- package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/NOTES.md +8 -0
- package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/NOTES.md +16 -4
- package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +7 -0
- package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/NOTES.md +11 -5
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +8 -1
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +4 -2
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/NOTES.md +34 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/spec.md +67 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/duplicate-event-error.js +35 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/priority-transfer-rollback.js +53 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/NOTES.md +38 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/spec.md +70 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/task.txt +3 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/duplicate-renewal-error.js +42 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/priority-credit-rollback.js +70 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +10 -3
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +7 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +5 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +7 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +3 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +15 -3
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/SCHEMA.md +53 -7
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/NOTES.md +37 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/RETIRED.md +13 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/expected.json +56 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/setup.sh +18 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/spec.md +69 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/exact-proration.js +48 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/rules-source-and-conflict.js +79 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/NOTES.md +54 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/RETIRED.md +7 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/expected.json +67 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/spec.md +67 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/task.txt +5 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/policy-precedence.js +72 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-and-immutability.js +43 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-boundary.js +116 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/NOTES.md +35 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/RETIRED.md +12 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/expected.json +58 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/spec.md +73 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/task.txt +17 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/mixed-idempotent-settlement.js +53 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/rejection-boundaries.js +74 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/NOTES.md +60 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/RETIRED.md +29 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/expected.json +73 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/setup.sh +28 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/spec.md +58 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/task.txt +5 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.json +82 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.md +18 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.json +46 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.md +17 -0
- package/benchmark/auto-resolve/run-real-benchmark.md +303 -0
- package/benchmark/auto-resolve/scripts/audit-headroom-rejections.py +441 -0
- package/benchmark/auto-resolve/scripts/audit-pair-evidence.py +1256 -0
- package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +147 -15
- package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +28 -16
- package/benchmark/auto-resolve/scripts/collect-swebench-predictions.py +11 -1
- package/benchmark/auto-resolve/scripts/compile-report.py +208 -46
- package/benchmark/auto-resolve/scripts/fetch-swebench-instances.py +22 -4
- package/benchmark/auto-resolve/scripts/frozen-verify-gate.py +175 -30
- package/benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py +408 -46
- package/benchmark/auto-resolve/scripts/headroom-gate.py +270 -39
- package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +164 -33
- package/benchmark/auto-resolve/scripts/iter-0033c-l1-summary.py +97 -0
- package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +150 -38
- package/benchmark/auto-resolve/scripts/judge.sh +153 -26
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +12 -5
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +25 -2
- package/benchmark/auto-resolve/scripts/pair-candidate-frontier.py +469 -0
- package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +5 -5
- package/benchmark/auto-resolve/scripts/pair-plan-lint.py +9 -2
- package/benchmark/auto-resolve/scripts/pair-rejected-fixtures.sh +91 -0
- package/benchmark/auto-resolve/scripts/pair_evidence_contract.py +269 -0
- package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-case.py +39 -10
- package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-corpus.py +34 -4
- package/benchmark/auto-resolve/scripts/prepare-swebench-solver-worktree.py +23 -5
- package/benchmark/auto-resolve/scripts/recent-benchmark-summary.py +232 -0
- package/benchmark/auto-resolve/scripts/run-fixture.sh +118 -51
- package/benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh +211 -39
- package/benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh +335 -39
- package/benchmark/auto-resolve/scripts/run-headroom-candidate.sh +249 -6
- package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +22 -48
- package/benchmark/auto-resolve/scripts/run-suite.sh +44 -7
- package/benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh +120 -19
- package/benchmark/auto-resolve/scripts/run-swebench-solver-batch.sh +32 -14
- package/benchmark/auto-resolve/scripts/ship-gate.py +219 -50
- package/benchmark/auto-resolve/scripts/solo-ceiling-avoidance.py +53 -0
- package/benchmark/auto-resolve/scripts/solo-headroom-hypothesis.py +77 -0
- package/benchmark/auto-resolve/scripts/swebench-frozen-matrix.py +239 -26
- package/benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh +288 -0
- package/benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh +1672 -0
- package/benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh +933 -0
- package/benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh +491 -0
- package/benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh +91 -0
- package/benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh +328 -3
- package/benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh +497 -18
- package/benchmark/auto-resolve/scripts/test-headroom-gate.sh +331 -14
- package/benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh +525 -0
- package/benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh +254 -0
- package/benchmark/auto-resolve/scripts/test-lint-fixtures.sh +580 -0
- package/benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh +591 -0
- package/benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh +497 -0
- package/benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh +401 -0
- package/benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh +111 -0
- package/benchmark/auto-resolve/scripts/test-ship-gate.sh +1189 -0
- package/benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh +924 -5
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/NOTES.md +28 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/expected.json +63 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/spec.md +47 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/expected.json +53 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/spec.md +50 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/duplicate-order-error.js +27 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/priority-stock-reservation.js +44 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/expected.json +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/spec.md +52 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/duplicate-ticket-error.js +29 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/priority-agent-assignment.js +48 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/expected.json +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/spec.md +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/duplicate-return-error.js +43 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/priority-return-routing.js +70 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/NOTES.md +37 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/expected.json +54 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/spec.md +59 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/credit-ledger-priority.js +98 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/duplicate-charge-error.js +38 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/NOTES.md +36 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/expected.json +56 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/spec.md +59 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/duplicate-refund-error.js +41 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/priority-refund-ledger.js +65 -0
- package/bin/devlyn.js +210 -17
- package/config/skills/_shared/adapters/README.md +3 -0
- package/config/skills/_shared/adapters/gpt-5-5.md +5 -1
- package/config/skills/_shared/adapters/opus-4-7.md +9 -1
- package/config/skills/_shared/archive_run.py +78 -6
- package/config/skills/_shared/codex-config.md +3 -2
- package/config/skills/_shared/codex-monitored.sh +46 -1
- package/config/skills/_shared/collect-codex-findings.py +20 -5
- package/config/skills/_shared/engine-preflight.md +1 -1
- package/config/skills/_shared/runtime-principles.md +5 -8
- package/config/skills/_shared/spec-verify-check.py +2664 -107
- package/config/skills/_shared/verify-merge-findings.py +1369 -19
- package/config/skills/devlyn:ideate/SKILL.md +7 -4
- package/config/skills/devlyn:ideate/references/elicitation.md +50 -4
- package/config/skills/devlyn:ideate/references/from-spec-mode.md +26 -4
- package/config/skills/devlyn:ideate/references/project-mode.md +20 -1
- package/config/skills/devlyn:ideate/references/spec-template.md +10 -1
- package/config/skills/devlyn:resolve/SKILL.md +49 -18
- package/config/skills/devlyn:resolve/references/free-form-mode.md +15 -0
- package/config/skills/devlyn:resolve/references/phases/build-gate.md +2 -2
- package/config/skills/devlyn:resolve/references/phases/probe-derive.md +74 -2
- package/config/skills/devlyn:resolve/references/phases/verify.md +62 -28
- package/config/skills/devlyn:resolve/references/state-schema.md +7 -4
- package/package.json +47 -2
- package/scripts/lint-fixtures.sh +349 -0
- package/scripts/lint-shadow-fixtures.sh +58 -0
- package/scripts/lint-skills.sh +3642 -92
- /package/{optional-skills → config/skills}/devlyn:design-ui/SKILL.md +0 -0
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
# F30 - Notes
|
|
2
|
+
|
|
3
|
+
## Failure Mode
|
|
4
|
+
|
|
5
|
+
This fixture detects payment-style implementations that pass simple happy-path
|
|
6
|
+
tests while double-applying duplicate operations, letting rejected operations
|
|
7
|
+
consume available credit, failing to release holds on capture/release, or
|
|
8
|
+
mutating input files while computing settlement state.
|
|
9
|
+
|
|
10
|
+
## Pipeline Phases
|
|
11
|
+
|
|
12
|
+
It stresses IMPLEMENT and VERIFY. The visible spec names the state transition
|
|
13
|
+
rules, idempotency output shape, and exact account summary; hidden verifiers
|
|
14
|
+
combine those rules so a one-axis implementation is not enough.
|
|
15
|
+
|
|
16
|
+
## Why Existing Fixtures Do Not Cover This
|
|
17
|
+
|
|
18
|
+
F16 covers quote math, F23 covers warehouse allocation rollback, F25 covers cart
|
|
19
|
+
promotions, and F28 return authorization was rejected after corrected scoring
|
|
20
|
+
showed solo saturation. None focus on duplicate operation idempotency plus
|
|
21
|
+
credit-hold mutation, capture/release transitions, and validation immutability.
|
|
22
|
+
|
|
23
|
+
## Retirement Criteria
|
|
24
|
+
|
|
25
|
+
Retire or rotate this fixture if both `solo_claude` and the selected pair arm
|
|
26
|
+
score near the ceiling for two shipped versions, or if another fixture covers
|
|
27
|
+
idempotent financial hold mutation with clearer pair headroom.
|
|
28
|
+
|
|
29
|
+
## Headroom Status
|
|
30
|
+
|
|
31
|
+
Retired after headroom run `20260511-f30-headroom-v1`: bare 33 /
|
|
32
|
+
solo_claude 98, headroom FAIL because `solo_claude score 98 > 80`.
|
|
33
|
+
|
|
34
|
+
Do not count F30 as pair-lift evidence. Rework the visible contract or hidden
|
|
35
|
+
verifiers before spending pair arms on this idea again.
|
|
@@ -0,0 +1,12 @@
|
|
|
1
|
+
# F30 retired
|
|
2
|
+
|
|
3
|
+
Retired from the active golden suite after headroom run
|
|
4
|
+
`20260511-f30-headroom-v1`.
|
|
5
|
+
|
|
6
|
+
Reason: `solo_claude` scored 98, exceeding the headroom ceiling of 80, while
|
|
7
|
+
bare scored 33. The fixture is useful as a record of an idempotent hold
|
|
8
|
+
settlement candidate that proved too explicit for solo, but it is not
|
|
9
|
+
pair-lift evidence.
|
|
10
|
+
|
|
11
|
+
Future use: rework the visible contract or hidden verifiers so the task creates
|
|
12
|
+
a fair pair-risk-probe gap without hiding requirements from the spec.
|
|
@@ -0,0 +1,58 @@
|
|
|
1
|
+
{
|
|
2
|
+
"verification_commands": [
|
|
3
|
+
{
|
|
4
|
+
"cmd": "node --test tests/cli.test.js",
|
|
5
|
+
"exit_code": 0,
|
|
6
|
+
"stdout_contains": [],
|
|
7
|
+
"stdout_not_contains": ["not ok "]
|
|
8
|
+
},
|
|
9
|
+
{
|
|
10
|
+
"cmd": "node \"$BENCH_FIXTURE_DIR/verifiers/mixed-idempotent-settlement.js\"",
|
|
11
|
+
"exit_code": 0,
|
|
12
|
+
"stdout_contains": ["\"ok\":true"],
|
|
13
|
+
"stdout_not_contains": [],
|
|
14
|
+
"contract_refs": [
|
|
15
|
+
"A failed authorization does not reserve credit and does not block a later valid authorization.",
|
|
16
|
+
"Capture removes the active hold and increases `balance_cents`.",
|
|
17
|
+
"Release removes the active hold without changing `balance_cents`.",
|
|
18
|
+
"Duplicate operation ids do not mutate state and report the original status."
|
|
19
|
+
]
|
|
20
|
+
},
|
|
21
|
+
{
|
|
22
|
+
"cmd": "node \"$BENCH_FIXTURE_DIR/verifiers/rejection-boundaries.js\"",
|
|
23
|
+
"exit_code": 0,
|
|
24
|
+
"stdout_contains": ["\"ok\":true"],
|
|
25
|
+
"stdout_not_contains": [],
|
|
26
|
+
"contract_refs": [
|
|
27
|
+
"Reusing an active `hold_id` is rejected as `\"duplicate_hold\"`.",
|
|
28
|
+
"Capture or release with a wrong amount is rejected as `\"amount_mismatch\"` and does not mutate state.",
|
|
29
|
+
"Invalid unknown-account input exits `2`, prints one JSON error object to stderr, and prints no stdout.",
|
|
30
|
+
"The input file contents are unchanged after settlement."
|
|
31
|
+
]
|
|
32
|
+
}
|
|
33
|
+
],
|
|
34
|
+
"forbidden_patterns": [
|
|
35
|
+
{
|
|
36
|
+
"pattern": "catch\\s*\\([^)]*\\)\\s*\\{[^}]*return\\s+(null|undefined|''|\\{\\})",
|
|
37
|
+
"description": "silent catch returning fallback in credit hold settlement path",
|
|
38
|
+
"files": ["bin/cli.js"],
|
|
39
|
+
"severity": "disqualifier"
|
|
40
|
+
},
|
|
41
|
+
{
|
|
42
|
+
"pattern": "catch\\s*\\([^)]*\\)\\s*\\{\\s*\\}",
|
|
43
|
+
"description": "empty catch block",
|
|
44
|
+
"files": ["bin/cli.js"],
|
|
45
|
+
"severity": "disqualifier"
|
|
46
|
+
},
|
|
47
|
+
{
|
|
48
|
+
"pattern": "/\\*\\s*eslint-disable",
|
|
49
|
+
"description": "eslint-disable without scoped justification",
|
|
50
|
+
"files": ["bin/cli.js"],
|
|
51
|
+
"severity": "disqualifier"
|
|
52
|
+
}
|
|
53
|
+
],
|
|
54
|
+
"required_files": ["bin/cli.js", "tests/cli.test.js"],
|
|
55
|
+
"forbidden_files": [],
|
|
56
|
+
"max_deps_added": 0,
|
|
57
|
+
"spec_output_files": ["bin/cli.js", "tests/cli.test.js"]
|
|
58
|
+
}
|
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
{
|
|
2
|
+
"id": "F30-cli-credit-hold-settlement",
|
|
3
|
+
"category": "high-risk",
|
|
4
|
+
"difficulty": "high",
|
|
5
|
+
"timeout_seconds": 1500,
|
|
6
|
+
"required_tools": ["node"],
|
|
7
|
+
"browser": false,
|
|
8
|
+
"deps_change_expected": false,
|
|
9
|
+
"intent": "Add a bench-cli settle-holds command that reads account credit holds and operations, applies authorization/capture/release with idempotency and rollback, and emits exact cents-based settlement state without mutating input."
|
|
10
|
+
}
|
|
@@ -0,0 +1,73 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: "F30-cli-credit-hold-settlement"
|
|
3
|
+
title: "Credit hold settlement"
|
|
4
|
+
status: planned
|
|
5
|
+
complexity: high
|
|
6
|
+
depends-on: []
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# F30 Credit hold settlement
|
|
10
|
+
|
|
11
|
+
## Context
|
|
12
|
+
|
|
13
|
+
`bench-cli` currently has greeting and version commands only. The task:
|
|
14
|
+
add a `settle-holds` command that reads account credit holds and operations,
|
|
15
|
+
applies authorization/capture/release with idempotency and rollback, and emits
|
|
16
|
+
exact cents-based settlement state without mutating input.
|
|
17
|
+
|
|
18
|
+
Credit holds feed payment and ledger workflows, so duplicate operations,
|
|
19
|
+
failed operations, and available-credit calculations must be deterministic.
|
|
20
|
+
|
|
21
|
+
## Requirements
|
|
22
|
+
|
|
23
|
+
- [ ] `bench-cli settle-holds --input <path>` reads JSON shaped as `{ "accounts": Array<Account>, "operations": Array<Operation> }`.
|
|
24
|
+
- [ ] Each account has `{ "id": string, "balance_cents": number, "credit_limit_cents": number }`.
|
|
25
|
+
- [ ] Each operation has `{ "id": string, "account_id": string, "type": "authorize" | "capture" | "release", "hold_id": string, "amount_cents": number }`.
|
|
26
|
+
- [ ] Validate before settlement: ids and hold ids must be non-empty strings, account ids must be unique, balances and credit limits must be non-negative integers, amount cents must be positive integers, operation types must be one of the allowed strings, and every operation account must exist.
|
|
27
|
+
- [ ] Invalid input exits `2`, writes exactly one JSON error object to stderr, and writes nothing to stdout.
|
|
28
|
+
- [ ] Business rejections do not exit non-zero and do not mutate settlement state.
|
|
29
|
+
- [ ] Process operations in input order.
|
|
30
|
+
- [ ] An `authorize` operation creates one active hold when `credit_limit_cents - balance_cents - active_hold_cents >= amount_cents`; otherwise it is rejected with reason `"insufficient_credit"`.
|
|
31
|
+
- [ ] An `authorize` operation for a `hold_id` that is already active is rejected with reason `"duplicate_hold"`.
|
|
32
|
+
- [ ] A `capture` operation requires an active hold for the same account and exactly the requested amount available on that hold; otherwise it is rejected with reason `"unknown_hold"` or `"amount_mismatch"`.
|
|
33
|
+
- [ ] An approved `capture` increases `balance_cents` by `amount_cents` and removes the active hold.
|
|
34
|
+
- [ ] A `release` operation requires an active hold for the same account and exactly the requested amount available on that hold; otherwise it is rejected with reason `"unknown_hold"` or `"amount_mismatch"`.
|
|
35
|
+
- [ ] An approved `release` removes the active hold without changing `balance_cents`.
|
|
36
|
+
- [ ] Duplicate operation ids are idempotent: the first occurrence is processed normally; each later operation with the same `id` must not mutate state and emits `{ "id": string, "status": "duplicate", "original_status": "approved" | "rejected" }`.
|
|
37
|
+
- [ ] Approved rows have keys `id`, `status`, `type`, `account_id`, `hold_id`, `amount_cents`.
|
|
38
|
+
- [ ] Rejected rows have keys `id`, `status`, `reason`.
|
|
39
|
+
- [ ] Duplicate rows have keys `id`, `status`, `original_status`.
|
|
40
|
+
- [ ] On success, write exactly one JSON object to stdout and no stderr. Keys: `results`, `accounts`.
|
|
41
|
+
- [ ] `results` is ordered by input operation order.
|
|
42
|
+
- [ ] `accounts` is sorted by account id. Each account row has keys `id`, `balance_cents`, `active_hold_cents`, `available_cents`.
|
|
43
|
+
- [ ] The command must not modify the input file.
|
|
44
|
+
- [ ] `tests/cli.test.js` is updated. Existing tests still pass AND at least two new tests cover one mixed approved/rejected/duplicate settlement and one validation failure.
|
|
45
|
+
|
|
46
|
+
## Constraints
|
|
47
|
+
|
|
48
|
+
- **No new npm dependencies.**
|
|
49
|
+
- **No floating-money output.** All public amounts are integer cents.
|
|
50
|
+
- **No silent catches.** Invalid input and file-read failures must surface as JSON errors with exit `2`.
|
|
51
|
+
- **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
|
|
52
|
+
- **Touch only `bin/cli.js` and `tests/cli.test.js`.**
|
|
53
|
+
|
|
54
|
+
## Out of Scope
|
|
55
|
+
|
|
56
|
+
- Payment processor calls.
|
|
57
|
+
- Persistence beyond stdout.
|
|
58
|
+
- Partial captures or partial releases.
|
|
59
|
+
- Currencies, interest, fees, or statement generation.
|
|
60
|
+
- Touching `server/`, `web/`, or `tests/server.test.js`.
|
|
61
|
+
|
|
62
|
+
## Verification
|
|
63
|
+
|
|
64
|
+
- `node --test tests/cli.test.js` exits 0.
|
|
65
|
+
- A failed authorization does not reserve credit and does not block a later valid authorization.
|
|
66
|
+
- Capture removes the active hold and increases `balance_cents`.
|
|
67
|
+
- Release removes the active hold without changing `balance_cents`.
|
|
68
|
+
- Duplicate operation ids do not mutate state and report the original status.
|
|
69
|
+
- Reusing an active `hold_id` is rejected as `"duplicate_hold"`.
|
|
70
|
+
- Capture or release with a wrong amount is rejected as `"amount_mismatch"` and does not mutate state.
|
|
71
|
+
- Invalid unknown-account input exits `2`, prints one JSON error object to stderr, and prints no stdout.
|
|
72
|
+
- The input file contents are unchanged after settlement.
|
|
73
|
+
- `git diff --stat` shows only `bin/cli.js` and `tests/cli.test.js` touched.
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
Add a bench-cli settle-holds command that reads account credit holds and
|
|
2
|
+
operations, applies authorization/capture/release with idempotency and rollback,
|
|
3
|
+
and emits exact cents-based settlement state without mutating input.
|
|
4
|
+
|
|
5
|
+
The command should read `--input <path>` JSON with accounts and operations.
|
|
6
|
+
Validate the input before settlement. Use integer cents only, write exactly one
|
|
7
|
+
JSON object to stdout on success, and write exactly one JSON error object to
|
|
8
|
+
stderr with exit code 2 for invalid input. Do not add npm dependencies.
|
|
9
|
+
|
|
10
|
+
Business rejections should stay on exit code 0. Process operations in input
|
|
11
|
+
order. Authorizations create active holds only when available credit is enough.
|
|
12
|
+
Captures and releases require an active hold for the same account and exact
|
|
13
|
+
amount. Failed operations must not mutate state. Duplicate operation ids are
|
|
14
|
+
idempotent: only the first occurrence mutates state, and later duplicates report
|
|
15
|
+
the original approved/rejected status.
|
|
16
|
+
|
|
17
|
+
Update the CLI tests. Touch only `bin/cli.js` and `tests/cli.test.js`.
|
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
'use strict';
|
|
2
|
+
|
|
3
|
+
const assert = require('node:assert');
|
|
4
|
+
const { execFileSync } = require('node:child_process');
|
|
5
|
+
const fs = require('node:fs');
|
|
6
|
+
const os = require('node:os');
|
|
7
|
+
const path = require('node:path');
|
|
8
|
+
|
|
9
|
+
const work = process.env.BENCH_WORKDIR || process.cwd();
|
|
10
|
+
const cli = path.join(work, 'bin', 'cli.js');
|
|
11
|
+
const tmp = fs.mkdtempSync(path.join(os.tmpdir(), 'f30-mixed-'));
|
|
12
|
+
const input = path.join(tmp, 'holds.json');
|
|
13
|
+
|
|
14
|
+
fs.writeFileSync(input, JSON.stringify({
|
|
15
|
+
accounts: [
|
|
16
|
+
{ id: 'acct-a', balance_cents: 2000, credit_limit_cents: 10000 },
|
|
17
|
+
{ id: 'acct-b', balance_cents: 0, credit_limit_cents: 5000 }
|
|
18
|
+
],
|
|
19
|
+
operations: [
|
|
20
|
+
{ id: 'op-auth-1', account_id: 'acct-a', type: 'authorize', hold_id: 'h-1', amount_cents: 5000 },
|
|
21
|
+
{ id: 'op-too-large', account_id: 'acct-a', type: 'authorize', hold_id: 'h-2', amount_cents: 4000 },
|
|
22
|
+
{ id: 'op-release', account_id: 'acct-a', type: 'release', hold_id: 'h-1', amount_cents: 5000 },
|
|
23
|
+
{ id: 'op-auth-3', account_id: 'acct-a', type: 'authorize', hold_id: 'h-3', amount_cents: 3000 },
|
|
24
|
+
{ id: 'op-capture', account_id: 'acct-a', type: 'capture', hold_id: 'h-3', amount_cents: 3000 },
|
|
25
|
+
{ id: 'op-capture', account_id: 'acct-a', type: 'capture', hold_id: 'h-3', amount_cents: 3000 },
|
|
26
|
+
{ id: 'op-auth-b', account_id: 'acct-b', type: 'authorize', hold_id: 'h-b', amount_cents: 5000 }
|
|
27
|
+
]
|
|
28
|
+
}), 'utf8');
|
|
29
|
+
|
|
30
|
+
const stdout = execFileSync('node', [cli, 'settle-holds', '--input', input], {
|
|
31
|
+
cwd: work,
|
|
32
|
+
encoding: 'utf8',
|
|
33
|
+
stdio: ['ignore', 'pipe', 'pipe']
|
|
34
|
+
});
|
|
35
|
+
const parsed = JSON.parse(stdout);
|
|
36
|
+
|
|
37
|
+
assert.deepStrictEqual(parsed, {
|
|
38
|
+
results: [
|
|
39
|
+
{ id: 'op-auth-1', status: 'approved', type: 'authorize', account_id: 'acct-a', hold_id: 'h-1', amount_cents: 5000 },
|
|
40
|
+
{ id: 'op-too-large', status: 'rejected', reason: 'insufficient_credit' },
|
|
41
|
+
{ id: 'op-release', status: 'approved', type: 'release', account_id: 'acct-a', hold_id: 'h-1', amount_cents: 5000 },
|
|
42
|
+
{ id: 'op-auth-3', status: 'approved', type: 'authorize', account_id: 'acct-a', hold_id: 'h-3', amount_cents: 3000 },
|
|
43
|
+
{ id: 'op-capture', status: 'approved', type: 'capture', account_id: 'acct-a', hold_id: 'h-3', amount_cents: 3000 },
|
|
44
|
+
{ id: 'op-capture', status: 'duplicate', original_status: 'approved' },
|
|
45
|
+
{ id: 'op-auth-b', status: 'approved', type: 'authorize', account_id: 'acct-b', hold_id: 'h-b', amount_cents: 5000 }
|
|
46
|
+
],
|
|
47
|
+
accounts: [
|
|
48
|
+
{ id: 'acct-a', balance_cents: 5000, active_hold_cents: 0, available_cents: 5000 },
|
|
49
|
+
{ id: 'acct-b', balance_cents: 0, active_hold_cents: 5000, available_cents: 0 }
|
|
50
|
+
]
|
|
51
|
+
});
|
|
52
|
+
|
|
53
|
+
console.log(JSON.stringify({ ok: true }));
|
|
@@ -0,0 +1,74 @@
|
|
|
1
|
+
'use strict';
|
|
2
|
+
|
|
3
|
+
const assert = require('node:assert');
|
|
4
|
+
const { spawnSync } = require('node:child_process');
|
|
5
|
+
const fs = require('node:fs');
|
|
6
|
+
const os = require('node:os');
|
|
7
|
+
const path = require('node:path');
|
|
8
|
+
|
|
9
|
+
const work = process.env.BENCH_WORKDIR || process.cwd();
|
|
10
|
+
const cli = path.join(work, 'bin', 'cli.js');
|
|
11
|
+
|
|
12
|
+
function runPayload(label, payload) {
|
|
13
|
+
const tmp = fs.mkdtempSync(path.join(os.tmpdir(), `f30-${label}-`));
|
|
14
|
+
const input = path.join(tmp, 'holds.json');
|
|
15
|
+
const original = JSON.stringify(payload, null, 2);
|
|
16
|
+
fs.writeFileSync(input, original, 'utf8');
|
|
17
|
+
const result = spawnSync('node', [cli, 'settle-holds', '--input', input], {
|
|
18
|
+
cwd: work,
|
|
19
|
+
encoding: 'utf8'
|
|
20
|
+
});
|
|
21
|
+
assert.strictEqual(fs.readFileSync(input, 'utf8'), original, `${label}: input mutated`);
|
|
22
|
+
return result;
|
|
23
|
+
}
|
|
24
|
+
|
|
25
|
+
const boundary = runPayload('boundary', {
|
|
26
|
+
accounts: [
|
|
27
|
+
{ id: 'acct-a', balance_cents: 1000, credit_limit_cents: 7000 }
|
|
28
|
+
],
|
|
29
|
+
operations: [
|
|
30
|
+
{ id: 'op-auth', account_id: 'acct-a', type: 'authorize', hold_id: 'h-1', amount_cents: 3000 },
|
|
31
|
+
{ id: 'op-dupe-hold', account_id: 'acct-a', type: 'authorize', hold_id: 'h-1', amount_cents: 1000 },
|
|
32
|
+
{ id: 'op-bad-capture', account_id: 'acct-a', type: 'capture', hold_id: 'h-1', amount_cents: 2000 },
|
|
33
|
+
{ id: 'op-release', account_id: 'acct-a', type: 'release', hold_id: 'h-1', amount_cents: 3000 },
|
|
34
|
+
{ id: 'op-bad-release', account_id: 'acct-a', type: 'release', hold_id: 'h-1', amount_cents: 3000 },
|
|
35
|
+
{ id: 'op-after', account_id: 'acct-a', type: 'authorize', hold_id: 'h-2', amount_cents: 6000 },
|
|
36
|
+
{ id: 'op-dupe-reject', account_id: 'acct-a', type: 'authorize', hold_id: 'h-3', amount_cents: 1 },
|
|
37
|
+
{ id: 'op-dupe-reject', account_id: 'acct-a', type: 'authorize', hold_id: 'h-4', amount_cents: 1 }
|
|
38
|
+
]
|
|
39
|
+
});
|
|
40
|
+
|
|
41
|
+
assert.strictEqual(boundary.status, 0);
|
|
42
|
+
assert.strictEqual(boundary.stderr, '');
|
|
43
|
+
assert.deepStrictEqual(JSON.parse(boundary.stdout), {
|
|
44
|
+
results: [
|
|
45
|
+
{ id: 'op-auth', status: 'approved', type: 'authorize', account_id: 'acct-a', hold_id: 'h-1', amount_cents: 3000 },
|
|
46
|
+
{ id: 'op-dupe-hold', status: 'rejected', reason: 'duplicate_hold' },
|
|
47
|
+
{ id: 'op-bad-capture', status: 'rejected', reason: 'amount_mismatch' },
|
|
48
|
+
{ id: 'op-release', status: 'approved', type: 'release', account_id: 'acct-a', hold_id: 'h-1', amount_cents: 3000 },
|
|
49
|
+
{ id: 'op-bad-release', status: 'rejected', reason: 'unknown_hold' },
|
|
50
|
+
{ id: 'op-after', status: 'approved', type: 'authorize', account_id: 'acct-a', hold_id: 'h-2', amount_cents: 6000 },
|
|
51
|
+
{ id: 'op-dupe-reject', status: 'rejected', reason: 'insufficient_credit' },
|
|
52
|
+
{ id: 'op-dupe-reject', status: 'duplicate', original_status: 'rejected' }
|
|
53
|
+
],
|
|
54
|
+
accounts: [
|
|
55
|
+
{ id: 'acct-a', balance_cents: 1000, active_hold_cents: 6000, available_cents: 0 }
|
|
56
|
+
]
|
|
57
|
+
});
|
|
58
|
+
|
|
59
|
+
const invalid = runPayload('invalid', {
|
|
60
|
+
accounts: [
|
|
61
|
+
{ id: 'acct-a', balance_cents: 0, credit_limit_cents: 1000 }
|
|
62
|
+
],
|
|
63
|
+
operations: [
|
|
64
|
+
{ id: 'op-missing-account', account_id: 'acct-missing', type: 'authorize', hold_id: 'h-1', amount_cents: 100 }
|
|
65
|
+
]
|
|
66
|
+
});
|
|
67
|
+
|
|
68
|
+
assert.strictEqual(invalid.status, 2);
|
|
69
|
+
assert.strictEqual(invalid.stdout, '');
|
|
70
|
+
const err = JSON.parse(invalid.stderr);
|
|
71
|
+
assert.strictEqual(typeof err.error, 'string');
|
|
72
|
+
assert.notStrictEqual(err.error.length, 0);
|
|
73
|
+
|
|
74
|
+
console.log(JSON.stringify({ ok: true }));
|
|
@@ -0,0 +1,60 @@
|
|
|
1
|
+
# F9 — Notes
|
|
2
|
+
|
|
3
|
+
## Purpose
|
|
4
|
+
|
|
5
|
+
**Load-bearing for the novice-user contract.** The suite ship-gate requires
|
|
6
|
+
F9 to pass (margin ≥ +5) on every shipped version. If F9 fails, the "type
|
|
7
|
+
`/devlyn:ideate` and ship worldclass software" promise is not being met.
|
|
8
|
+
|
|
9
|
+
## What the variant arm does
|
|
10
|
+
|
|
11
|
+
A novice-simulating prompt (`task.txt` is identical to what the user typed)
|
|
12
|
+
is delivered to a fresh Claude session. The session has our skills installed.
|
|
13
|
+
The pipeline arm is expected to:
|
|
14
|
+
|
|
15
|
+
1. Recognize this is a vague idea, not a spec → invoke `/devlyn:ideate`.
|
|
16
|
+
2. Ideate produces `docs/VISION.md`, `docs/ROADMAP.md`, and
|
|
17
|
+
`docs/roadmap/phase-1/1.1-gitstats.md` (or similar).
|
|
18
|
+
3. Run `/devlyn:auto-resolve` on the generated spec.
|
|
19
|
+
4. Run `/devlyn:preflight` for verification.
|
|
20
|
+
|
|
21
|
+
The variant arm's prompt explicitly instructs this chain so we're not
|
|
22
|
+
relying on Claude to invent it. That's fair because the novice contract is
|
|
23
|
+
about the TOOLS being available + discoverable; the user in this benchmark
|
|
24
|
+
is already primed to use them.
|
|
25
|
+
|
|
26
|
+
## What the bare arm does
|
|
27
|
+
|
|
28
|
+
Same raw task delivered as a direct prompt. Bare implements `gitstats`
|
|
29
|
+
using its own judgment. Bare does NOT produce VISION/ROADMAP documents
|
|
30
|
+
(and isn't expected to).
|
|
31
|
+
|
|
32
|
+
## Why margin ≥ +5 is required
|
|
33
|
+
|
|
34
|
+
The pipeline's whole value prop is that it trades some bare-case tokens for
|
|
35
|
+
quality uplift on novice flows. If this fixture can't show ≥ +5 margin,
|
|
36
|
+
we're paying pipeline cost without delivering on the novice promise.
|
|
37
|
+
|
|
38
|
+
## Scoring notes
|
|
39
|
+
|
|
40
|
+
- The variant's `docs/VISION.md` + `ROADMAP.md` + spec files ARE part of
|
|
41
|
+
the judge's evaluation. The judge sees the full product (code + docs +
|
|
42
|
+
roadmap state), not just the diff to `bin/cli.js`.
|
|
43
|
+
- Bare doesn't produce roadmap files, so bare's judge payload is
|
|
44
|
+
code+test only. This asymmetry is INTENTIONAL — the fixture tests
|
|
45
|
+
total-output quality, not per-file quality.
|
|
46
|
+
|
|
47
|
+
## Failure modes detected
|
|
48
|
+
|
|
49
|
+
- **Pipeline skips ideate.** Variant goes straight to auto-resolve with a
|
|
50
|
+
vague spec → downstream implementation is weak. Caught by judge:
|
|
51
|
+
`docs/roadmap/` files missing.
|
|
52
|
+
- **Bare over-engineers.** Without a skeleton, bare builds too much,
|
|
53
|
+
touches wrong files, adds deps. Caught by spec constraints.
|
|
54
|
+
- **Pipeline ships "done" but preflight was a no-op.** If `.devlyn/PREFLIGHT-REPORT.md` exists but shows no commitment audit, something is broken upstream.
|
|
55
|
+
|
|
56
|
+
## Rotation trigger
|
|
57
|
+
|
|
58
|
+
F9 is the last fixture we rotate — it's the anchor. If it saturates
|
|
59
|
+
(variant consistently > 95), the whole suite needs a harder novice-flow
|
|
60
|
+
anchor before we retire this one.
|
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
# RETIRED — F9-e2e-ideate-to-preflight
|
|
2
|
+
|
|
3
|
+
**Retired**: 2026-04-30 (iter-0033a)
|
|
4
|
+
**Replaced by**: `benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/`
|
|
5
|
+
**Source SHA**: `8d4d57f` (commit before the 2-skill-contract rename).
|
|
6
|
+
|
|
7
|
+
## Why retired
|
|
8
|
+
|
|
9
|
+
The 2-skill redesign (Phases 1-3, iter-0029 / 0031 / 0032) replaced
|
|
10
|
+
`/devlyn:ideate` (greenfield) and folded `/devlyn:preflight` into
|
|
11
|
+
`/devlyn:resolve`'s VERIFY phase. The OLD F9 fixture's contract assumed
|
|
12
|
+
the 3-skill chain (`/devlyn:ideate` → `/devlyn:auto-resolve` →
|
|
13
|
+
`/devlyn:preflight`), which is unobtainable at HEAD post-iter-0032
|
|
14
|
+
because OLD ideate was deleted.
|
|
15
|
+
|
|
16
|
+
iter-0033a redesigned F9 to match the shipped 2-skill contract.
|
|
17
|
+
|
|
18
|
+
## When to consult this archive
|
|
19
|
+
|
|
20
|
+
- Replaying a regression suspected from the OLD chain.
|
|
21
|
+
- Migrating a pre-2026-04-30 historical run record back to readable shape.
|
|
22
|
+
- Auditing what changed when the new fixture's measurements diverge from
|
|
23
|
+
pre-redesign baselines.
|
|
24
|
+
|
|
25
|
+
## What lives here
|
|
26
|
+
|
|
27
|
+
The exact file contents of the F9 fixture as of `8d4d57f` (the last commit
|
|
28
|
+
before the rename). DO NOT use this directory as a live fixture — it is
|
|
29
|
+
not picked up by `run-suite.sh`. Restore-and-run requires a manual copy.
|
|
@@ -0,0 +1,73 @@
|
|
|
1
|
+
{
|
|
2
|
+
"verification_commands": [
|
|
3
|
+
{
|
|
4
|
+
"cmd": "node bin/cli.js gitstats",
|
|
5
|
+
"exit_code": 0,
|
|
6
|
+
"stdout_contains": [
|
|
7
|
+
"Commits:",
|
|
8
|
+
"Last commit:"
|
|
9
|
+
],
|
|
10
|
+
"stdout_not_contains": [
|
|
11
|
+
"Error:"
|
|
12
|
+
]
|
|
13
|
+
},
|
|
14
|
+
{
|
|
15
|
+
"cmd": "node bin/cli.js gitstats --json",
|
|
16
|
+
"exit_code": 0,
|
|
17
|
+
"stdout_contains": [
|
|
18
|
+
"{",
|
|
19
|
+
"commits",
|
|
20
|
+
"authors"
|
|
21
|
+
],
|
|
22
|
+
"stdout_not_contains": []
|
|
23
|
+
},
|
|
24
|
+
{
|
|
25
|
+
"cmd": "cd /tmp && node -e 'const { spawnSync } = require(\"child_process\"); const p = process.env.BENCH_WORKDIR || process.cwd(); console.log(spawnSync(\"node\", [p + \"/bin/cli.js\", \"gitstats\"], { encoding: \"utf8\", cwd: \"/tmp\" }).status)'",
|
|
26
|
+
"exit_code": 0,
|
|
27
|
+
"stdout_contains": [
|
|
28
|
+
"2"
|
|
29
|
+
],
|
|
30
|
+
"stdout_not_contains": [
|
|
31
|
+
"0"
|
|
32
|
+
]
|
|
33
|
+
},
|
|
34
|
+
{
|
|
35
|
+
"cmd": "node bin/cli.js hello",
|
|
36
|
+
"exit_code": 0,
|
|
37
|
+
"stdout_contains": [
|
|
38
|
+
"Hello, world!"
|
|
39
|
+
],
|
|
40
|
+
"stdout_not_contains": []
|
|
41
|
+
},
|
|
42
|
+
{
|
|
43
|
+
"cmd": "node --test tests/",
|
|
44
|
+
"exit_code": 0,
|
|
45
|
+
"stdout_contains": [],
|
|
46
|
+
"stdout_not_contains": []
|
|
47
|
+
}
|
|
48
|
+
],
|
|
49
|
+
"forbidden_patterns": [
|
|
50
|
+
{
|
|
51
|
+
"pattern": "catch\\s*\\([^)]*\\)\\s*\\{\\s*\\}",
|
|
52
|
+
"description": "empty catch",
|
|
53
|
+
"files": [
|
|
54
|
+
"bin/cli.js"
|
|
55
|
+
],
|
|
56
|
+
"severity": "disqualifier"
|
|
57
|
+
}
|
|
58
|
+
],
|
|
59
|
+
"required_files": [
|
|
60
|
+
"bin/cli.js"
|
|
61
|
+
],
|
|
62
|
+
"forbidden_files": [],
|
|
63
|
+
"max_deps_added": 0,
|
|
64
|
+
"tier_a_waivers": [
|
|
65
|
+
"docs/VISION.md",
|
|
66
|
+
"docs/ROADMAP.md",
|
|
67
|
+
"docs/roadmap/**"
|
|
68
|
+
],
|
|
69
|
+
"spec_output_files": [
|
|
70
|
+
"bin/**",
|
|
71
|
+
"tests/**"
|
|
72
|
+
]
|
|
73
|
+
}
|
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
{
|
|
2
|
+
"id": "F9-e2e-ideate-to-preflight",
|
|
3
|
+
"category": "e2e",
|
|
4
|
+
"difficulty": "high",
|
|
5
|
+
"timeout_seconds": 3600,
|
|
6
|
+
"required_tools": ["node"],
|
|
7
|
+
"browser": false,
|
|
8
|
+
"deps_change_expected": false,
|
|
9
|
+
"intent": "End-to-end novice flow: from a vague idea ('git stats CLI for the current repo') the variant must run /devlyn:ideate → /devlyn:auto-resolve → /devlyn:preflight to produce Vision/Roadmap + implemented code + preflight sign-off. The bare arm receives the same vague idea as a direct prompt. This fixture gates the novice-user contract."
|
|
10
|
+
}
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
#!/usr/bin/env bash
|
|
2
|
+
# F9 setup — seed a few synthetic commits with different authors so the
|
|
3
|
+
# `gitstats` subcommand's "top 3 authors by commit count" requirement is
|
|
4
|
+
# meaningfully exercised. Without this, every commit author is the runner's
|
|
5
|
+
# default and the ranking test is a no-op.
|
|
6
|
+
set -e
|
|
7
|
+
|
|
8
|
+
commit_as() {
|
|
9
|
+
local name="$1" email="$2" file="$3" message="$4"
|
|
10
|
+
echo "$(date +%s%N) $name" >> "$file"
|
|
11
|
+
git add "$file"
|
|
12
|
+
git -c user.name="$name" -c user.email="$email" commit -q -m "$message"
|
|
13
|
+
}
|
|
14
|
+
|
|
15
|
+
mkdir -p .bench-seed
|
|
16
|
+
|
|
17
|
+
commit_as "Alpha Author" "alpha@bench.test" .bench-seed/log "seed: alpha 1"
|
|
18
|
+
commit_as "Alpha Author" "alpha@bench.test" .bench-seed/log "seed: alpha 2"
|
|
19
|
+
commit_as "Alpha Author" "alpha@bench.test" .bench-seed/log "seed: alpha 3"
|
|
20
|
+
commit_as "Alpha Author" "alpha@bench.test" .bench-seed/log "seed: alpha 4"
|
|
21
|
+
commit_as "Beta Author" "beta@bench.test" .bench-seed/log "seed: beta 1"
|
|
22
|
+
commit_as "Beta Author" "beta@bench.test" .bench-seed/log "seed: beta 2"
|
|
23
|
+
commit_as "Beta Author" "beta@bench.test" .bench-seed/log "seed: beta 3"
|
|
24
|
+
commit_as "Gamma Author" "gamma@bench.test" .bench-seed/log "seed: gamma 1"
|
|
25
|
+
commit_as "Gamma Author" "gamma@bench.test" .bench-seed/log "seed: gamma 2"
|
|
26
|
+
commit_as "Delta Author" "delta@bench.test" .bench-seed/log "seed: delta 1"
|
|
27
|
+
|
|
28
|
+
echo "F9 setup: seeded 10 commits across 4 authors (Alpha 4 / Beta 3 / Gamma 2 / Delta 1)"
|
|
@@ -0,0 +1,58 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: "F9-e2e-ideate-to-preflight"
|
|
3
|
+
title: "End-to-end: idea → shipped CLI feature"
|
|
4
|
+
status: planned
|
|
5
|
+
complexity: high
|
|
6
|
+
depends-on: []
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# F9 End-to-End Novice Flow
|
|
10
|
+
|
|
11
|
+
## Context
|
|
12
|
+
|
|
13
|
+
A first-time user has a vague idea:
|
|
14
|
+
|
|
15
|
+
> "I want a CLI subcommand that shows basic stats about the current git repo — commit count, last commit date, top 3 authors. Call it `gitstats`."
|
|
16
|
+
|
|
17
|
+
The variant arm is expected to use the pipeline: `/devlyn:ideate` to
|
|
18
|
+
produce a VISION/ROADMAP + spec, then `/devlyn:auto-resolve` to implement
|
|
19
|
+
per spec, then `/devlyn:preflight` to verify. The bare arm receives the
|
|
20
|
+
same idea as a direct prompt and implements it without the pipeline.
|
|
21
|
+
|
|
22
|
+
This fixture is the suite's most important gate for the "novice user contract":
|
|
23
|
+
a first-time user typing `/devlyn:ideate` should land at working, well-structured software.
|
|
24
|
+
|
|
25
|
+
## Requirements
|
|
26
|
+
|
|
27
|
+
- [ ] A new `gitstats` subcommand exists in `bin/cli.js`.
|
|
28
|
+
- [ ] `node bin/cli.js gitstats` (run inside a git repo) prints:
|
|
29
|
+
- Line 1: commit count (e.g., `Commits: 42`).
|
|
30
|
+
- Line 2: last commit ISO date (e.g., `Last commit: 2026-04-23T12:00:00Z`).
|
|
31
|
+
- Lines 3-5: top 3 authors by commit count, format `<rank>. <name> <count>`.
|
|
32
|
+
- [ ] Run outside a git repo → stderr message `Error: not a git repository` and exit 2.
|
|
33
|
+
- [ ] `node bin/cli.js gitstats --json` emits valid JSON with the same data.
|
|
34
|
+
- [ ] Existing subcommands (`hello`, `version`) unchanged.
|
|
35
|
+
- [ ] Add at least one test.
|
|
36
|
+
- [ ] For variant: a `docs/VISION.md`, `docs/ROADMAP.md`, and a `docs/roadmap/phase-1/` spec file must exist after the run (evidence the ideate stage happened).
|
|
37
|
+
|
|
38
|
+
## Constraints
|
|
39
|
+
|
|
40
|
+
- **No new npm dependencies.** Use `child_process` to shell out to `git`.
|
|
41
|
+
- **No silent catches.**
|
|
42
|
+
- **Non-git-repo handling.** Do not assume the user is always in a repo.
|
|
43
|
+
|
|
44
|
+
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
45
|
+
|
|
46
|
+
## Out of Scope
|
|
47
|
+
|
|
48
|
+
- Parsing commit messages, tags, branches.
|
|
49
|
+
- Remote API calls.
|
|
50
|
+
- Touching `server/` or `web/`.
|
|
51
|
+
|
|
52
|
+
## Verification
|
|
53
|
+
|
|
54
|
+
- Inside this worktree (which IS a git repo): `node bin/cli.js gitstats` exits 0 and prints at least 5 lines of summary.
|
|
55
|
+
- `node bin/cli.js gitstats --json | node -e 'const d=JSON.parse(require("fs").readFileSync(0,"utf8")); console.log(typeof d.commits)'` prints `number`.
|
|
56
|
+
- `cd /tmp && node <worktree>/bin/cli.js gitstats` (from outside a repo — use the worktree's absolute path) exits 2.
|
|
57
|
+
- For variant: `test -f docs/VISION.md && test -f docs/ROADMAP.md && ls docs/roadmap/phase-1/*.md | head -1`.
|
|
58
|
+
- `node --test tests/` passes.
|
|
@@ -0,0 +1,5 @@
|
|
|
1
|
+
I want a CLI subcommand that shows basic stats about the current git repo — commit count, last commit date, top 3 authors. Call it `gitstats`.
|
|
2
|
+
|
|
3
|
+
Should work inside this repo when I run `node bin/cli.js gitstats`, and fail cleanly if I'm not in a git repo. A `--json` flag for machine-readable output would be useful too.
|
|
4
|
+
|
|
5
|
+
Keep the existing `hello` and `version` subcommands working. Add a test. No new npm dependencies.
|