devlyn-cli 2.3.0 → 2.3.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +1 -1
- package/CLAUDE.md +2 -2
- package/README.md +82 -29
- package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +61 -44
- package/benchmark/auto-resolve/BENCHMARK-RESULTS.md +341 -0
- package/benchmark/auto-resolve/README.md +307 -44
- package/benchmark/auto-resolve/RUBRIC.md +23 -14
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +7 -3
- package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md +8 -3
- package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/NOTES.md +8 -3
- package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/NOTES.md +10 -4
- package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/NOTES.md +10 -4
- package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +7 -4
- package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/NOTES.md +8 -0
- package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/NOTES.md +16 -4
- package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +7 -0
- package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/NOTES.md +11 -5
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +8 -1
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +4 -2
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/NOTES.md +34 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/spec.md +67 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/duplicate-event-error.js +35 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/priority-transfer-rollback.js +53 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/NOTES.md +38 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/spec.md +70 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/task.txt +3 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/duplicate-renewal-error.js +42 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/priority-credit-rollback.js +70 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +10 -3
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +7 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +5 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +7 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +3 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +15 -3
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/SCHEMA.md +53 -7
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/NOTES.md +37 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/RETIRED.md +13 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/expected.json +56 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/setup.sh +18 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/spec.md +69 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/exact-proration.js +48 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/rules-source-and-conflict.js +79 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/NOTES.md +54 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/RETIRED.md +7 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/expected.json +67 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/spec.md +67 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/task.txt +5 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/policy-precedence.js +72 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-and-immutability.js +43 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-boundary.js +116 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/NOTES.md +35 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/RETIRED.md +12 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/expected.json +58 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/spec.md +73 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/task.txt +17 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/mixed-idempotent-settlement.js +53 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/rejection-boundaries.js +74 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/NOTES.md +60 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/RETIRED.md +29 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/expected.json +73 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/setup.sh +28 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/spec.md +58 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/task.txt +5 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.json +82 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.md +18 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.json +46 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.md +17 -0
- package/benchmark/auto-resolve/run-real-benchmark.md +303 -0
- package/benchmark/auto-resolve/scripts/audit-headroom-rejections.py +441 -0
- package/benchmark/auto-resolve/scripts/audit-pair-evidence.py +1256 -0
- package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +147 -15
- package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +28 -16
- package/benchmark/auto-resolve/scripts/collect-swebench-predictions.py +11 -1
- package/benchmark/auto-resolve/scripts/compile-report.py +208 -46
- package/benchmark/auto-resolve/scripts/fetch-swebench-instances.py +22 -4
- package/benchmark/auto-resolve/scripts/frozen-verify-gate.py +175 -30
- package/benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py +408 -46
- package/benchmark/auto-resolve/scripts/headroom-gate.py +270 -39
- package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +164 -33
- package/benchmark/auto-resolve/scripts/iter-0033c-l1-summary.py +97 -0
- package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +150 -38
- package/benchmark/auto-resolve/scripts/judge.sh +153 -26
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +12 -5
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +25 -2
- package/benchmark/auto-resolve/scripts/pair-candidate-frontier.py +469 -0
- package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +5 -5
- package/benchmark/auto-resolve/scripts/pair-plan-lint.py +9 -2
- package/benchmark/auto-resolve/scripts/pair-rejected-fixtures.sh +91 -0
- package/benchmark/auto-resolve/scripts/pair_evidence_contract.py +269 -0
- package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-case.py +39 -10
- package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-corpus.py +34 -4
- package/benchmark/auto-resolve/scripts/prepare-swebench-solver-worktree.py +23 -5
- package/benchmark/auto-resolve/scripts/recent-benchmark-summary.py +232 -0
- package/benchmark/auto-resolve/scripts/run-fixture.sh +118 -51
- package/benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh +211 -39
- package/benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh +335 -39
- package/benchmark/auto-resolve/scripts/run-headroom-candidate.sh +249 -6
- package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +22 -48
- package/benchmark/auto-resolve/scripts/run-suite.sh +44 -7
- package/benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh +120 -19
- package/benchmark/auto-resolve/scripts/run-swebench-solver-batch.sh +32 -14
- package/benchmark/auto-resolve/scripts/ship-gate.py +219 -50
- package/benchmark/auto-resolve/scripts/solo-ceiling-avoidance.py +53 -0
- package/benchmark/auto-resolve/scripts/solo-headroom-hypothesis.py +77 -0
- package/benchmark/auto-resolve/scripts/swebench-frozen-matrix.py +239 -26
- package/benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh +288 -0
- package/benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh +1672 -0
- package/benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh +933 -0
- package/benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh +491 -0
- package/benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh +91 -0
- package/benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh +328 -3
- package/benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh +497 -18
- package/benchmark/auto-resolve/scripts/test-headroom-gate.sh +331 -14
- package/benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh +525 -0
- package/benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh +254 -0
- package/benchmark/auto-resolve/scripts/test-lint-fixtures.sh +580 -0
- package/benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh +591 -0
- package/benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh +497 -0
- package/benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh +401 -0
- package/benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh +111 -0
- package/benchmark/auto-resolve/scripts/test-ship-gate.sh +1189 -0
- package/benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh +924 -5
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/NOTES.md +28 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/expected.json +63 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/spec.md +47 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/expected.json +53 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/spec.md +50 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/duplicate-order-error.js +27 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/priority-stock-reservation.js +44 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/expected.json +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/spec.md +52 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/duplicate-ticket-error.js +29 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/priority-agent-assignment.js +48 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/expected.json +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/spec.md +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/duplicate-return-error.js +43 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/priority-return-routing.js +70 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/NOTES.md +37 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/expected.json +54 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/spec.md +59 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/credit-ledger-priority.js +98 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/duplicate-charge-error.js +38 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/NOTES.md +36 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/expected.json +56 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/spec.md +59 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/duplicate-refund-error.js +41 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/priority-refund-ledger.js +65 -0
- package/bin/devlyn.js +211 -18
- package/config/skills/_shared/adapters/README.md +3 -0
- package/config/skills/_shared/adapters/gpt-5-5.md +5 -1
- package/config/skills/_shared/adapters/opus-4-7.md +9 -1
- package/config/skills/_shared/archive_run.py +78 -6
- package/config/skills/_shared/codex-config.md +3 -2
- package/config/skills/_shared/codex-monitored.sh +46 -1
- package/config/skills/_shared/collect-codex-findings.py +20 -5
- package/config/skills/_shared/engine-preflight.md +1 -1
- package/config/skills/_shared/runtime-principles.md +5 -8
- package/config/skills/_shared/spec-verify-check.py +2664 -107
- package/config/skills/_shared/verify-merge-findings.py +1369 -19
- package/config/skills/devlyn:ideate/SKILL.md +7 -4
- package/config/skills/devlyn:ideate/references/elicitation.md +50 -4
- package/config/skills/devlyn:ideate/references/from-spec-mode.md +26 -4
- package/config/skills/devlyn:ideate/references/project-mode.md +20 -1
- package/config/skills/devlyn:ideate/references/spec-template.md +10 -1
- package/config/skills/devlyn:resolve/SKILL.md +49 -18
- package/config/skills/devlyn:resolve/references/free-form-mode.md +15 -0
- package/config/skills/devlyn:resolve/references/phases/build-gate.md +2 -2
- package/config/skills/devlyn:resolve/references/phases/probe-derive.md +74 -2
- package/config/skills/devlyn:resolve/references/phases/verify.md +62 -28
- package/config/skills/devlyn:resolve/references/state-schema.md +7 -4
- package/package.json +47 -2
- package/scripts/lint-fixtures.sh +349 -0
- package/scripts/lint-shadow-fixtures.sh +58 -0
- package/scripts/lint-skills.sh +3642 -92
- /package/{optional-skills → config/skills}/devlyn:design-ui/SKILL.md +0 -0
|
@@ -0,0 +1,56 @@
|
|
|
1
|
+
{
|
|
2
|
+
"verification_commands": [
|
|
3
|
+
{
|
|
4
|
+
"cmd": "node --test tests/cli.test.js",
|
|
5
|
+
"exit_code": 0,
|
|
6
|
+
"stdout_contains": [],
|
|
7
|
+
"stdout_not_contains": ["not ok "]
|
|
8
|
+
},
|
|
9
|
+
{
|
|
10
|
+
"cmd": "node \"$BENCH_FIXTURE_DIR/verifiers/priority-refund-ledger.js\"",
|
|
11
|
+
"exit_code": 0,
|
|
12
|
+
"stdout_contains": ["\"ok\":true"],
|
|
13
|
+
"stdout_not_contains": [],
|
|
14
|
+
"contract_refs": [
|
|
15
|
+
"Process refund requests globally by `priority` descending, then `requested_on` ascending, then original input order ascending.",
|
|
16
|
+
"A refund rejects with reason `window_expired` when `requested_on` is more than `refund_window_days` after `purchased_on`.",
|
|
17
|
+
"A refund accepts only when the order's remaining refundable cents is at least the requested `cents`.",
|
|
18
|
+
"A rejected refund with reason `over_refund` must not change that order's remaining refundable cents.",
|
|
19
|
+
"For each accepted refund, decrement that order's remaining refundable cents by the requested `cents`.",
|
|
20
|
+
"For each accepted refund, compute `fee_cents` as the category policy's `restocking_fee_cents` capped at the requested `cents`, and compute `net_cents = cents - fee_cents`.",
|
|
21
|
+
"`approved` is ordered in processing order. Each row has keys `id`, `order`, `refund_cents`, `fee_cents`, and `net_cents`.",
|
|
22
|
+
"`rejected` is ordered in the original input refund order. Each row has keys `id`, `reason`.",
|
|
23
|
+
"`orders` is ordered by order id ascending. Each row has keys `id` and `remaining_refundable_cents`.",
|
|
24
|
+
"On success, write exactly one JSON object to stdout and no stderr. Keys: `approved`, `rejected`, `orders`."
|
|
25
|
+
]
|
|
26
|
+
},
|
|
27
|
+
{
|
|
28
|
+
"cmd": "node \"$BENCH_FIXTURE_DIR/verifiers/duplicate-refund-error.js\"",
|
|
29
|
+
"exit_code": 0,
|
|
30
|
+
"stdout_contains": ["\"ok\":true"],
|
|
31
|
+
"stdout_not_contains": [],
|
|
32
|
+
"contract_refs": [
|
|
33
|
+
"Before settling any refund, duplicate refund ids are invalid input: exit `2`, write exactly one JSON error object `{ \"error\": \"duplicate_refund_id\", \"id\": string }` to stderr, and write no stdout."
|
|
34
|
+
]
|
|
35
|
+
}
|
|
36
|
+
],
|
|
37
|
+
"forbidden_patterns": [
|
|
38
|
+
{
|
|
39
|
+
"pattern": "catch\\s*\\([^)]*\\)\\s*\\{[^}]*return\\s+(null|undefined|''|\\{\\}|\\[\\])",
|
|
40
|
+
"description": "silent catch returning fallback in settle-refunds path",
|
|
41
|
+
"files": ["bin/cli.js", "tests/cli.test.js"],
|
|
42
|
+
"severity": "disqualifier"
|
|
43
|
+
},
|
|
44
|
+
{
|
|
45
|
+
"pattern": "catch\\s*\\([^)]*\\)\\s*\\{\\s*\\}",
|
|
46
|
+
"description": "empty catch block",
|
|
47
|
+
"files": ["bin/cli.js", "tests/cli.test.js"],
|
|
48
|
+
"severity": "disqualifier"
|
|
49
|
+
}
|
|
50
|
+
],
|
|
51
|
+
"required_files": ["bin/cli.js", "tests/cli.test.js"],
|
|
52
|
+
"forbidden_files": [],
|
|
53
|
+
"tier_a_waivers": [],
|
|
54
|
+
"spec_output_files": ["bin/cli.js", "tests/cli.test.js"],
|
|
55
|
+
"max_deps_added": 0
|
|
56
|
+
}
|
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
{
|
|
2
|
+
"id": "S6-cli-refund-window-ledger",
|
|
3
|
+
"category": "high-risk",
|
|
4
|
+
"difficulty": "high",
|
|
5
|
+
"timeout_seconds": 900,
|
|
6
|
+
"required_tools": ["node"],
|
|
7
|
+
"browser": false,
|
|
8
|
+
"deps_change_expected": false,
|
|
9
|
+
"intent": "Add a refund ledger CLI command that applies category refund windows, priority-ordered refund requests, cumulative per-order refundable balances, duplicate refund rejection, and exact JSON output shape."
|
|
10
|
+
}
|
|
@@ -0,0 +1,59 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: "S6-cli-refund-window-ledger"
|
|
3
|
+
title: "Add refund window ledger command"
|
|
4
|
+
status: planned
|
|
5
|
+
complexity: high
|
|
6
|
+
depends-on: []
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# S6 Add Refund Window Ledger Command
|
|
10
|
+
|
|
11
|
+
## Context
|
|
12
|
+
|
|
13
|
+
Finance operations needs a deterministic CLI command that settles refund
|
|
14
|
+
requests against original orders. The command must combine category refund
|
|
15
|
+
windows, priority ordering, cumulative per-order refundable balances, duplicate
|
|
16
|
+
id rejection, and exact machine-readable output.
|
|
17
|
+
|
|
18
|
+
## Requirements
|
|
19
|
+
|
|
20
|
+
- [ ] Add `settle-refunds` to `bin/cli.js`.
|
|
21
|
+
- [ ] Accept `--policies <json>` as a JSON object whose keys are category names and whose values have keys `refund_window_days` and `restocking_fee_cents`.
|
|
22
|
+
- [ ] Accept `--orders <json>` as a JSON array of order objects. Each order has keys `id`, `category`, `paid_cents`, `purchased_on`, and `fulfilled`.
|
|
23
|
+
- [ ] Accept `--refunds <json>` as a JSON array of refund request objects. Each refund has keys `id`, `order`, `cents`, `priority`, and `requested_on`.
|
|
24
|
+
- [ ] Before settling any refund, duplicate refund ids are invalid input: exit `2`, write exactly one JSON error object `{ "error": "duplicate_refund_id", "id": string }` to stderr, and write no stdout.
|
|
25
|
+
- [ ] Process refund requests globally by `priority` descending, then `requested_on` ascending, then original input order ascending.
|
|
26
|
+
- [ ] A refund rejects with reason `unknown_order` when the order does not exist.
|
|
27
|
+
- [ ] A refund rejects with reason `unfulfilled_order` when the order exists but `fulfilled` is not `true`.
|
|
28
|
+
- [ ] A refund rejects with reason `unknown_policy` when the order category has no policy.
|
|
29
|
+
- [ ] A refund rejects with reason `window_expired` when `requested_on` is more than `refund_window_days` after `purchased_on`.
|
|
30
|
+
- [ ] A refund accepts only when the order's remaining refundable cents is at least the requested `cents`.
|
|
31
|
+
- [ ] A rejected refund with reason `over_refund` must not change that order's remaining refundable cents.
|
|
32
|
+
- [ ] For each accepted refund, decrement that order's remaining refundable cents by the requested `cents`.
|
|
33
|
+
- [ ] For each accepted refund, compute `fee_cents` as the category policy's `restocking_fee_cents` capped at the requested `cents`, and compute `net_cents = cents - fee_cents`.
|
|
34
|
+
- [ ] `approved` is ordered in processing order. Each row has keys `id`, `order`, `refund_cents`, `fee_cents`, and `net_cents`.
|
|
35
|
+
- [ ] `rejected` is ordered in the original input refund order. Each row has keys `id`, `reason`.
|
|
36
|
+
- [ ] `orders` is ordered by order id ascending. Each row has keys `id` and `remaining_refundable_cents`.
|
|
37
|
+
- [ ] On success, write exactly one JSON object to stdout and no stderr. Keys: `approved`, `rejected`, `orders`.
|
|
38
|
+
|
|
39
|
+
## Constraints
|
|
40
|
+
|
|
41
|
+
- Use only Node.js built-ins; add no npm dependencies.
|
|
42
|
+
- Touch only `bin/cli.js` and `tests/cli.test.js`.
|
|
43
|
+
- Do not silently catch JSON parse or validation errors. Surface invalid input as a user-visible error with nonzero exit.
|
|
44
|
+
- Do not persist refund balances between command invocations.
|
|
45
|
+
- All public money amounts are integer cents.
|
|
46
|
+
|
|
47
|
+
## Out of Scope
|
|
48
|
+
|
|
49
|
+
- Reading input from files.
|
|
50
|
+
- Taxes, payment gateway calls, currency conversion, or store-credit issuance.
|
|
51
|
+
- Partial approval of a single refund request.
|
|
52
|
+
- Changing `hello`, `version`, server routes, or package metadata.
|
|
53
|
+
|
|
54
|
+
## Verification
|
|
55
|
+
|
|
56
|
+
- `node --test tests/cli.test.js` passes.
|
|
57
|
+
- `node "$BENCH_FIXTURE_DIR/verifiers/priority-refund-ledger.js"` prints `{"ok":true}`.
|
|
58
|
+
- `node "$BENCH_FIXTURE_DIR/verifiers/duplicate-refund-error.js"` prints `{"ok":true}`.
|
|
59
|
+
- Solo-headroom hypothesis: solo_claude is expected to miss cumulative remaining refundable cents or original-order rejected rows under priority-ordered refund settlement; observable command `node "$BENCH_FIXTURE_DIR/verifiers/priority-refund-ledger.js"` exposes the miss.
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
Add a `settle-refunds` command to bench-cli. It must read policies, orders, and refund requests from JSON CLI arguments, process refund requests by priority, maintain per-order remaining refundable cents, reject duplicates before processing, and emit exact JSON output.
|
|
@@ -0,0 +1,41 @@
|
|
|
1
|
+
'use strict';
|
|
2
|
+
const assert = require('node:assert');
|
|
3
|
+
const { spawnSync } = require('node:child_process');
|
|
4
|
+
const path = require('node:path');
|
|
5
|
+
|
|
6
|
+
const work = process.env.BENCH_WORKDIR || process.cwd();
|
|
7
|
+
const cli = path.join(work, 'bin', 'cli.js');
|
|
8
|
+
|
|
9
|
+
const policies = JSON.stringify({
|
|
10
|
+
apparel: { refund_window_days: 45, restocking_fee_cents: 25 }
|
|
11
|
+
});
|
|
12
|
+
const orders = JSON.stringify([
|
|
13
|
+
{ id: 'ord-a', category: 'apparel', paid_cents: 600, purchased_on: '2026-01-10', fulfilled: true }
|
|
14
|
+
]);
|
|
15
|
+
const refunds = JSON.stringify([
|
|
16
|
+
{ id: 'dup', order: 'ord-a', cents: 100, priority: 2, requested_on: '2026-01-11' },
|
|
17
|
+
{ id: 'dup', order: 'ord-a', cents: 100, priority: 1, requested_on: '2026-01-12' }
|
|
18
|
+
]);
|
|
19
|
+
|
|
20
|
+
const result = spawnSync('node', [
|
|
21
|
+
cli,
|
|
22
|
+
'settle-refunds',
|
|
23
|
+
'--policies',
|
|
24
|
+
policies,
|
|
25
|
+
'--orders',
|
|
26
|
+
orders,
|
|
27
|
+
'--refunds',
|
|
28
|
+
refunds
|
|
29
|
+
], {
|
|
30
|
+
cwd: work,
|
|
31
|
+
encoding: 'utf8'
|
|
32
|
+
});
|
|
33
|
+
|
|
34
|
+
assert.strictEqual(result.status, 2, result.stdout || result.stderr);
|
|
35
|
+
assert.strictEqual(result.stdout, '');
|
|
36
|
+
assert.deepStrictEqual(JSON.parse(result.stderr), {
|
|
37
|
+
error: 'duplicate_refund_id',
|
|
38
|
+
id: 'dup'
|
|
39
|
+
});
|
|
40
|
+
|
|
41
|
+
console.log(JSON.stringify({ ok: true }));
|
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
'use strict';
|
|
2
|
+
const assert = require('node:assert');
|
|
3
|
+
const { spawnSync } = require('node:child_process');
|
|
4
|
+
const path = require('node:path');
|
|
5
|
+
|
|
6
|
+
const work = process.env.BENCH_WORKDIR || process.cwd();
|
|
7
|
+
const cli = path.join(work, 'bin', 'cli.js');
|
|
8
|
+
|
|
9
|
+
const policies = JSON.stringify({
|
|
10
|
+
electronics: { refund_window_days: 30, restocking_fee_cents: 150 },
|
|
11
|
+
apparel: { refund_window_days: 45, restocking_fee_cents: 25 }
|
|
12
|
+
});
|
|
13
|
+
const orders = JSON.stringify([
|
|
14
|
+
{ id: 'ord-a', category: 'electronics', paid_cents: 1000, purchased_on: '2026-01-01', fulfilled: true },
|
|
15
|
+
{ id: 'ord-b', category: 'apparel', paid_cents: 600, purchased_on: '2026-01-10', fulfilled: true },
|
|
16
|
+
{ id: 'ord-c', category: 'electronics', paid_cents: 400, purchased_on: '2025-12-01', fulfilled: true },
|
|
17
|
+
{ id: 'ord-d', category: 'apparel', paid_cents: 500, purchased_on: '2026-01-15', fulfilled: false }
|
|
18
|
+
]);
|
|
19
|
+
const refunds = JSON.stringify([
|
|
20
|
+
{ id: 'low-a', order: 'ord-a', cents: 500, priority: 1, requested_on: '2026-01-08' },
|
|
21
|
+
{ id: 'expired-c', order: 'ord-c', cents: 100, priority: 9, requested_on: '2026-02-01' },
|
|
22
|
+
{ id: 'high-a', order: 'ord-a', cents: 800, priority: 10, requested_on: '2026-01-09' },
|
|
23
|
+
{ id: 'unknown', order: 'missing', cents: 50, priority: 8, requested_on: '2026-01-09' },
|
|
24
|
+
{ id: 'unfulfilled', order: 'ord-d', cents: 50, priority: 7, requested_on: '2026-01-20' },
|
|
25
|
+
{ id: 'apparel-ok', order: 'ord-b', cents: 300, priority: 6, requested_on: '2026-01-20' }
|
|
26
|
+
]);
|
|
27
|
+
|
|
28
|
+
const result = spawnSync('node', [
|
|
29
|
+
cli,
|
|
30
|
+
'settle-refunds',
|
|
31
|
+
'--policies',
|
|
32
|
+
policies,
|
|
33
|
+
'--orders',
|
|
34
|
+
orders,
|
|
35
|
+
'--refunds',
|
|
36
|
+
refunds
|
|
37
|
+
], {
|
|
38
|
+
cwd: work,
|
|
39
|
+
encoding: 'utf8'
|
|
40
|
+
});
|
|
41
|
+
|
|
42
|
+
assert.strictEqual(result.status, 0, result.stderr || result.stdout);
|
|
43
|
+
assert.strictEqual(result.stderr, '');
|
|
44
|
+
const parsed = JSON.parse(result.stdout);
|
|
45
|
+
|
|
46
|
+
assert.deepStrictEqual(parsed, {
|
|
47
|
+
approved: [
|
|
48
|
+
{ id: 'high-a', order: 'ord-a', refund_cents: 800, fee_cents: 150, net_cents: 650 },
|
|
49
|
+
{ id: 'apparel-ok', order: 'ord-b', refund_cents: 300, fee_cents: 25, net_cents: 275 }
|
|
50
|
+
],
|
|
51
|
+
rejected: [
|
|
52
|
+
{ id: 'low-a', reason: 'over_refund' },
|
|
53
|
+
{ id: 'expired-c', reason: 'window_expired' },
|
|
54
|
+
{ id: 'unknown', reason: 'unknown_order' },
|
|
55
|
+
{ id: 'unfulfilled', reason: 'unfulfilled_order' }
|
|
56
|
+
],
|
|
57
|
+
orders: [
|
|
58
|
+
{ id: 'ord-a', remaining_refundable_cents: 200 },
|
|
59
|
+
{ id: 'ord-b', remaining_refundable_cents: 300 },
|
|
60
|
+
{ id: 'ord-c', remaining_refundable_cents: 400 },
|
|
61
|
+
{ id: 'ord-d', remaining_refundable_cents: 500 }
|
|
62
|
+
]
|
|
63
|
+
});
|
|
64
|
+
|
|
65
|
+
console.log(JSON.stringify({ ok: true }));
|
package/bin/devlyn.js
CHANGED
|
@@ -22,7 +22,7 @@ const CLI_TARGETS = {
|
|
|
22
22
|
// Codex auto-loads skills from ~/.codex/skills/ (user-global). Same
|
|
23
23
|
// SKILL.md format as Claude Code; descriptions must stay ≤1024 chars.
|
|
24
24
|
skillsDir: path.join(os.homedir(), '.codex', 'skills'),
|
|
25
|
-
skillsToInstall: ['devlyn:resolve', 'devlyn:ideate', '_shared'],
|
|
25
|
+
skillsToInstall: ['devlyn:resolve', 'devlyn:ideate', 'devlyn:design-ui', '_shared'],
|
|
26
26
|
detect: () => fs.existsSync(path.join(process.cwd(), 'AGENTS.md')) || fs.existsSync(path.join(process.cwd(), '.codex')),
|
|
27
27
|
},
|
|
28
28
|
gemini: {
|
|
@@ -183,7 +183,6 @@ const OPTIONAL_ADDONS = [
|
|
|
183
183
|
{ name: 'devlyn:pencil-push', desc: 'Push codebase UI to Pencil canvas for design sync', type: 'local' },
|
|
184
184
|
{ name: 'devlyn:reap', desc: 'Safely reap orphaned MCP / codex / Superset child processes left behind by long Claude sessions', type: 'local' },
|
|
185
185
|
{ name: 'devlyn:design-system', desc: 'Extract design tokens from a chosen UI style for exact reproduction (creative power-user)', type: 'local' },
|
|
186
|
-
{ name: 'devlyn:design-ui', desc: 'N (default 5) distinct UI style explorations from a single Lead Designer (creative power-user)', type: 'local' },
|
|
187
186
|
{ name: 'devlyn:team-design-ui', desc: '5 distinct UI style explorations from a full design team (creative power-user)', type: 'local' },
|
|
188
187
|
// External skill packs (installed via npx skills add)
|
|
189
188
|
{ name: 'vercel-labs/agent-skills', desc: 'React, Next.js, React Native best practices', type: 'external' },
|
|
@@ -194,7 +193,7 @@ const OPTIONAL_ADDONS = [
|
|
|
194
193
|
// MCP servers (installed via claude mcp add)
|
|
195
194
|
// Note: the Codex integration uses the local `codex` CLI binary (not MCP).
|
|
196
195
|
// Install the CLI separately per https://platform.openai.com/docs/codex — the
|
|
197
|
-
//
|
|
196
|
+
// pair/risk-probe routes fail closed when Codex is required but unavailable.
|
|
198
197
|
{ name: 'playwright', desc: 'Playwright MCP for browser testing — powers /devlyn:resolve BUILD_GATE browser tier', type: 'mcp', command: 'npx -y @anthropic-ai/mcp-playwright' },
|
|
199
198
|
];
|
|
200
199
|
|
|
@@ -524,7 +523,7 @@ function detectOtherCLIs() {
|
|
|
524
523
|
return detected;
|
|
525
524
|
}
|
|
526
525
|
|
|
527
|
-
// Install
|
|
526
|
+
// Install devlyn:resolve + devlyn:ideate + devlyn:design-ui + _shared skills into a CLI's
|
|
528
527
|
// global skills directory (e.g. ~/.codex/skills/). Returns count of skills
|
|
529
528
|
// copied. Skipped silently for CLIs without a skillsDir (e.g. cursor, copilot
|
|
530
529
|
// at the time of writing — they don't have an analogous skill-loader).
|
|
@@ -608,11 +607,11 @@ function installAgentsForCLI(cliKey) {
|
|
|
608
607
|
}
|
|
609
608
|
|
|
610
609
|
// If this CLI also supports a global skill-loader (currently Codex), install
|
|
611
|
-
//
|
|
612
|
-
//
|
|
610
|
+
// devlyn:resolve + devlyn:ideate + devlyn:design-ui + _shared. Codex invokes
|
|
611
|
+
// these as skills (for example `$devlyn:resolve`), not Claude slash commands.
|
|
613
612
|
const skillsCopied = installSkillsForCLI(cliKey);
|
|
614
613
|
if (skillsCopied > 0) {
|
|
615
|
-
log(` → ${skillsCopied} skill${skillsCopied > 1 ? 's' : ''} installed (devlyn:resolve / devlyn:ideate / _shared)`, 'dim');
|
|
614
|
+
log(` → ${skillsCopied} skill${skillsCopied > 1 ? 's' : ''} installed (devlyn:resolve / devlyn:ideate / devlyn:design-ui / _shared)`, 'dim');
|
|
616
615
|
}
|
|
617
616
|
|
|
618
617
|
return true;
|
|
@@ -689,7 +688,7 @@ async function init(skipPrompts = false) {
|
|
|
689
688
|
}
|
|
690
689
|
}
|
|
691
690
|
if (!settings.env) settings.env = {};
|
|
692
|
-
// Auto-allow pipeline state directory and common git commands so
|
|
691
|
+
// Auto-allow pipeline state directory and common git commands so resolve doesn't prompt
|
|
693
692
|
if (!settings.permissions) settings.permissions = {};
|
|
694
693
|
if (!settings.permissions.allow) settings.permissions.allow = [];
|
|
695
694
|
const pipelinePermissions = [
|
|
@@ -762,7 +761,7 @@ async function init(skipPrompts = false) {
|
|
|
762
761
|
if (cli.configDir) {
|
|
763
762
|
desc = `Install agents into ${cli.configDir}/`;
|
|
764
763
|
} else if (cli.skillsDir) {
|
|
765
|
-
desc = `Install ${cli.instructionsFile} +
|
|
764
|
+
desc = `Install ${cli.instructionsFile} + devlyn:resolve/devlyn:ideate/devlyn:design-ui skills (~/.codex/skills/; use $devlyn:* in Codex)`;
|
|
766
765
|
} else {
|
|
767
766
|
desc = `Install ${cli.instructionsFile}`;
|
|
768
767
|
}
|
|
@@ -777,7 +776,7 @@ async function init(skipPrompts = false) {
|
|
|
777
776
|
log(` ✅ Agent instructions installed for ${agentsInstalled} CLI${agentsInstalled !== 1 ? 's' : ''}`, 'green');
|
|
778
777
|
} else {
|
|
779
778
|
log('💡 No additional CLI instructions selected', 'dim');
|
|
780
|
-
log(' Run `npx devlyn-cli agents codex` later to install Codex AGENTS.md +
|
|
779
|
+
log(' Run `npx devlyn-cli agents codex` later to install Codex AGENTS.md + devlyn skills', 'dim');
|
|
781
780
|
}
|
|
782
781
|
|
|
783
782
|
// Ask about optional addons (local skills + external packs)
|
|
@@ -808,8 +807,14 @@ function showHelp() {
|
|
|
808
807
|
log(' npx devlyn-cli -y Install without prompts');
|
|
809
808
|
log(' npx devlyn-cli agents Install agents for detected CLIs');
|
|
810
809
|
log(' npx devlyn-cli agents all Install agents for all supported CLIs');
|
|
811
|
-
log(' npx devlyn-cli benchmark Run the
|
|
812
|
-
log(' npx devlyn-cli benchmark
|
|
810
|
+
log(' npx devlyn-cli benchmark Run the resolve benchmark suite');
|
|
811
|
+
log(' npx devlyn-cli benchmark recent Show compact recent benchmark results');
|
|
812
|
+
log(' npx devlyn-cli benchmark frontier Show pair candidate frontier scores/triggers without providers');
|
|
813
|
+
log(' npx devlyn-cli benchmark audit Audit pair evidence readiness');
|
|
814
|
+
log(' npx devlyn-cli benchmark audit-headroom Audit failed headroom results');
|
|
815
|
+
log(' npx devlyn-cli benchmark headroom <fixtures...> Score bare vs solo_claude headroom');
|
|
816
|
+
log(' npx devlyn-cli benchmark pair <fixtures...> Score solo_claude vs pair path');
|
|
817
|
+
log(' npx devlyn-cli benchmark --bless If ship-gate passes, promote baseline');
|
|
813
818
|
log(' npx devlyn-cli benchmark --dry-run Validate suite setup without model invocation');
|
|
814
819
|
log(' npx devlyn-cli --help Show this help\n');
|
|
815
820
|
log('Optional skills (select during install):', 'green');
|
|
@@ -831,6 +836,170 @@ function showHelp() {
|
|
|
831
836
|
log('');
|
|
832
837
|
}
|
|
833
838
|
|
|
839
|
+
function showBenchmarkHelp() {
|
|
840
|
+
log('Usage:', 'green');
|
|
841
|
+
log(' npx devlyn-cli benchmark [suite] [options] [fixtures...]');
|
|
842
|
+
log(' npx devlyn-cli benchmark recent [options]');
|
|
843
|
+
log(' npx devlyn-cli benchmark frontier [options]');
|
|
844
|
+
log(' npx devlyn-cli benchmark audit [options]');
|
|
845
|
+
log(' npx devlyn-cli benchmark audit-headroom [options]');
|
|
846
|
+
log(' npx devlyn-cli benchmark headroom [options] <fixtures...>');
|
|
847
|
+
log(' npx devlyn-cli benchmark pair [options] <fixtures...>');
|
|
848
|
+
log('');
|
|
849
|
+
log('Score-focused runs:', 'green');
|
|
850
|
+
log(' recent Show compact, wrap-safe recent benchmark results');
|
|
851
|
+
log(' frontier Show active rejected/evidence/unmeasured pair candidates, scores, and triggers without providers');
|
|
852
|
+
log(' audit Fail on unmeasured pair candidates and invalid headroom rejections');
|
|
853
|
+
log(' Prints frontier score rows plus headroom and pair quality handoff rows');
|
|
854
|
+
log(' audit-headroom Fail on active failed or unsupported headroom rejections');
|
|
855
|
+
log(' headroom Score bare vs solo_claude before spending the pair arm');
|
|
856
|
+
log(' pair Score solo_claude vs the selected pair path and print gate tables');
|
|
857
|
+
log('');
|
|
858
|
+
log('Shadow suite:', 'green');
|
|
859
|
+
log(' npx devlyn-cli benchmark suite --suite shadow --dry-run');
|
|
860
|
+
log(' Lists shadow tasks only; use headroom/pair with explicit S* ids for real measurement');
|
|
861
|
+
log('');
|
|
862
|
+
log('Examples:', 'green');
|
|
863
|
+
log(' npx devlyn-cli benchmark --dry-run F1-cli-trivial-flag');
|
|
864
|
+
log(' npx devlyn-cli benchmark recent');
|
|
865
|
+
log(' npx devlyn-cli benchmark recent --out-md /tmp/devlyn-recent-benchmark.md');
|
|
866
|
+
log(' npx devlyn-cli benchmark frontier --out-md /tmp/devlyn-pair-frontier.md');
|
|
867
|
+
log(' npx devlyn-cli benchmark audit --out-dir /tmp/devlyn-benchmark-audit');
|
|
868
|
+
log(' npx devlyn-cli benchmark audit-headroom --out-json /tmp/devlyn-headroom-audit.json');
|
|
869
|
+
log(' npx devlyn-cli benchmark headroom --min-fixtures 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules');
|
|
870
|
+
log(' npx devlyn-cli benchmark pair --min-fixtures 3 --max-pair-solo-wall-ratio 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules');
|
|
871
|
+
log('');
|
|
872
|
+
}
|
|
873
|
+
|
|
874
|
+
function showBenchmarkModeHelp(mode) {
|
|
875
|
+
if (mode === 'recent') {
|
|
876
|
+
log('Usage:', 'green');
|
|
877
|
+
log(' npx devlyn-cli benchmark recent [options]');
|
|
878
|
+
log('');
|
|
879
|
+
log('Options:', 'green');
|
|
880
|
+
log(' --out-json PATH');
|
|
881
|
+
log(' --out-md PATH');
|
|
882
|
+
log(' --fixtures-root PATH');
|
|
883
|
+
log(' --registry PATH');
|
|
884
|
+
log(' --results-root PATH');
|
|
885
|
+
log(' --max-width N default: 92');
|
|
886
|
+
log(' --min-pair-margin N default: 5');
|
|
887
|
+
log(' --max-pair-solo-wall-ratio N default: 3');
|
|
888
|
+
log('');
|
|
889
|
+
log('Output:', 'green');
|
|
890
|
+
log(' Prints compact, wrap-safe benchmark status and pair-evidence cards without wide tables');
|
|
891
|
+
log('');
|
|
892
|
+
log('Example:', 'green');
|
|
893
|
+
log(' npx devlyn-cli benchmark recent');
|
|
894
|
+
log(' npx devlyn-cli benchmark recent --out-md /tmp/devlyn-recent-benchmark.md');
|
|
895
|
+
log('');
|
|
896
|
+
return;
|
|
897
|
+
}
|
|
898
|
+
if (mode === 'frontier') {
|
|
899
|
+
log('Usage:', 'green');
|
|
900
|
+
log(' npx devlyn-cli benchmark frontier [options]');
|
|
901
|
+
log('');
|
|
902
|
+
log('Options:', 'green');
|
|
903
|
+
log(' --out-json PATH');
|
|
904
|
+
log(' --out-md PATH');
|
|
905
|
+
log(' --fixtures-root PATH');
|
|
906
|
+
log(' --registry PATH');
|
|
907
|
+
log(' --results-root PATH');
|
|
908
|
+
log(' --min-pair-margin N default: 5');
|
|
909
|
+
log(' --max-pair-solo-wall-ratio N default: 3');
|
|
910
|
+
log(' --fail-on-unmeasured');
|
|
911
|
+
log('');
|
|
912
|
+
log('Output:', 'green');
|
|
913
|
+
log(' Prints pair evidence score rows with trigger reasons; --out-md includes a Triggers column');
|
|
914
|
+
log('');
|
|
915
|
+
log('Example:', 'green');
|
|
916
|
+
log(' npx devlyn-cli benchmark frontier --out-md /tmp/devlyn-pair-frontier.md');
|
|
917
|
+
log('');
|
|
918
|
+
return;
|
|
919
|
+
}
|
|
920
|
+
if (mode === 'audit-headroom') {
|
|
921
|
+
log('Usage:', 'green');
|
|
922
|
+
log(' npx devlyn-cli benchmark audit-headroom [options]');
|
|
923
|
+
log('');
|
|
924
|
+
log('Options:', 'green');
|
|
925
|
+
log(' --out-json PATH');
|
|
926
|
+
log(' --fixtures-root PATH');
|
|
927
|
+
log(' --registry PATH');
|
|
928
|
+
log(' --results-root PATH');
|
|
929
|
+
log('');
|
|
930
|
+
log('Example:', 'green');
|
|
931
|
+
log(' npx devlyn-cli benchmark audit-headroom --out-json /tmp/devlyn-headroom-audit.json');
|
|
932
|
+
log('');
|
|
933
|
+
return;
|
|
934
|
+
}
|
|
935
|
+
if (mode === 'audit') {
|
|
936
|
+
log('Usage:', 'green');
|
|
937
|
+
log(' npx devlyn-cli benchmark audit [options]');
|
|
938
|
+
log('');
|
|
939
|
+
log('Options:', 'green');
|
|
940
|
+
log(' --out-dir PATH');
|
|
941
|
+
log(' --fixtures-root PATH');
|
|
942
|
+
log(' --registry PATH');
|
|
943
|
+
log(' --results-root PATH');
|
|
944
|
+
log(' --min-pair-evidence N default: 4');
|
|
945
|
+
log(' --min-pair-margin N default: 5');
|
|
946
|
+
log(' --max-pair-solo-wall-ratio N default: 3');
|
|
947
|
+
log(' --require-hypothesis-trigger');
|
|
948
|
+
log('');
|
|
949
|
+
log('Output:', 'green');
|
|
950
|
+
log(' Prints frontier score rows plus headroom_rejections=PASS/FAIL, pair_evidence_quality=PASS/FAIL, pair_trigger_reasons=PASS/FAIL, pair_evidence_hypotheses=PASS/FAIL, pair_evidence_hypothesis_triggers=PASS/WARN/FAIL, historical-alias, and hypothesis-trigger gap handoff rows');
|
|
951
|
+
log('');
|
|
952
|
+
log('Example:', 'green');
|
|
953
|
+
log(' npx devlyn-cli benchmark audit --out-dir /tmp/devlyn-benchmark-audit');
|
|
954
|
+
log(' npx devlyn-cli benchmark audit --require-hypothesis-trigger --out-dir /tmp/devlyn-benchmark-audit-strict');
|
|
955
|
+
log('');
|
|
956
|
+
return;
|
|
957
|
+
}
|
|
958
|
+
if (mode === 'headroom') {
|
|
959
|
+
log('Usage:', 'green');
|
|
960
|
+
log(' npx devlyn-cli benchmark headroom [options] <fixtures...>');
|
|
961
|
+
log('');
|
|
962
|
+
log('Options:', 'green');
|
|
963
|
+
log(' --run-id ID');
|
|
964
|
+
log(' --bare-max N default: 60');
|
|
965
|
+
log(' --solo-max N default: 80');
|
|
966
|
+
log(' --min-bare-headroom N default: 5');
|
|
967
|
+
log(' --min-solo-headroom N default: 5');
|
|
968
|
+
log(' --min-fixtures N default: 2; use 3 for F16/F23/F25 proof reruns; audit requires 4 passing evidence rows');
|
|
969
|
+
log(' --allow-rejected-fixtures active-fixture diagnostics only');
|
|
970
|
+
log(' --dry-run validate args/fixtures and print replay command only');
|
|
971
|
+
log('');
|
|
972
|
+
log('Example:', 'green');
|
|
973
|
+
log(' npx devlyn-cli benchmark headroom --min-fixtures 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules');
|
|
974
|
+
log('');
|
|
975
|
+
return;
|
|
976
|
+
}
|
|
977
|
+
if (mode === 'pair') {
|
|
978
|
+
log('Usage:', 'green');
|
|
979
|
+
log(' npx devlyn-cli benchmark pair [options] <fixtures...>');
|
|
980
|
+
log('');
|
|
981
|
+
log('Options:', 'green');
|
|
982
|
+
log(' --run-id ID');
|
|
983
|
+
log(' --bare-max N');
|
|
984
|
+
log(' --solo-max N');
|
|
985
|
+
log(' --min-bare-headroom N default: 5');
|
|
986
|
+
log(' --min-solo-headroom N default: 5');
|
|
987
|
+
log(' --min-fixtures N default: 2; use 3 for F16/F23/F25 proof reruns; audit requires 4 passing evidence rows');
|
|
988
|
+
log(' --min-pair-margin N default: 5');
|
|
989
|
+
log(' --max-pair-solo-wall-ratio N default: 3');
|
|
990
|
+
log(' --pair-arm ARM default: l2_risk_probes; l2_gated is diagnostic');
|
|
991
|
+
log(' --reuse-calibrated-from RUN_ID');
|
|
992
|
+
log(' --allow-rejected-fixtures active-fixture diagnostics only');
|
|
993
|
+
log(' --dry-run validate args/fixtures and print replay command only');
|
|
994
|
+
log('');
|
|
995
|
+
log('Example:', 'green');
|
|
996
|
+
log(' npx devlyn-cli benchmark pair --min-fixtures 3 --max-pair-solo-wall-ratio 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules');
|
|
997
|
+
log('');
|
|
998
|
+
return;
|
|
999
|
+
}
|
|
1000
|
+
showBenchmarkHelp();
|
|
1001
|
+
}
|
|
1002
|
+
|
|
834
1003
|
// Main
|
|
835
1004
|
const args = process.argv.slice(2);
|
|
836
1005
|
const command = args[0];
|
|
@@ -850,16 +1019,40 @@ switch (command) {
|
|
|
850
1019
|
break;
|
|
851
1020
|
case 'benchmark':
|
|
852
1021
|
case 'bench': {
|
|
853
|
-
|
|
854
|
-
|
|
855
|
-
|
|
1022
|
+
const benchmarkScripts = {
|
|
1023
|
+
suite: 'run-suite.sh',
|
|
1024
|
+
recent: 'recent-benchmark-summary.py',
|
|
1025
|
+
frontier: 'pair-candidate-frontier.py',
|
|
1026
|
+
audit: 'audit-pair-evidence.py',
|
|
1027
|
+
'audit-headroom': 'audit-headroom-rejections.py',
|
|
1028
|
+
headroom: 'run-headroom-candidate.sh',
|
|
1029
|
+
pair: 'run-full-pipeline-pair-candidate.sh',
|
|
1030
|
+
};
|
|
1031
|
+
let forwardedArgs = args.slice(1);
|
|
1032
|
+
if (forwardedArgs[0] === '--help' || forwardedArgs[0] === '-h') {
|
|
1033
|
+
showBenchmarkHelp();
|
|
1034
|
+
break;
|
|
1035
|
+
}
|
|
1036
|
+
let benchmarkMode = 'suite';
|
|
1037
|
+
if (forwardedArgs[0] === 'suite' || forwardedArgs[0] === 'recent' || forwardedArgs[0] === 'frontier' || forwardedArgs[0] === 'audit' || forwardedArgs[0] === 'audit-headroom' || forwardedArgs[0] === 'headroom' || forwardedArgs[0] === 'pair') {
|
|
1038
|
+
benchmarkMode = forwardedArgs[0];
|
|
1039
|
+
forwardedArgs = forwardedArgs.slice(1);
|
|
1040
|
+
}
|
|
1041
|
+
if (forwardedArgs[0] === '--help' || forwardedArgs[0] === '-h') {
|
|
1042
|
+
showBenchmarkModeHelp(benchmarkMode);
|
|
1043
|
+
break;
|
|
1044
|
+
}
|
|
1045
|
+
const runnerName = benchmarkScripts[benchmarkMode];
|
|
1046
|
+
const runner = path.join(__dirname, '..', 'benchmark', 'auto-resolve', 'scripts', runnerName);
|
|
1047
|
+
if (!fs.existsSync(runner)) {
|
|
856
1048
|
log('❌ Benchmark suite runner missing — is this a clean devlyn-cli checkout?', 'yellow');
|
|
857
|
-
log(` Expected: ${
|
|
1049
|
+
log(` Expected: ${runner}`, 'dim');
|
|
858
1050
|
process.exit(1);
|
|
859
1051
|
}
|
|
860
1052
|
const { spawnSync } = require('child_process');
|
|
861
|
-
const
|
|
862
|
-
const
|
|
1053
|
+
const env = { ...process.env, DEVLYN_BENCHMARK_CLI_SUBCOMMAND: benchmarkMode };
|
|
1054
|
+
const executable = (benchmarkMode === 'recent' || benchmarkMode === 'frontier' || benchmarkMode === 'audit' || benchmarkMode === 'audit-headroom') ? 'python3' : 'bash';
|
|
1055
|
+
const res = spawnSync(executable, [runner, ...forwardedArgs], { stdio: 'inherit', env });
|
|
863
1056
|
process.exit(res.status ?? 1);
|
|
864
1057
|
break;
|
|
865
1058
|
}
|
|
@@ -30,6 +30,9 @@ Verbosity, formatting, length conventions specific to this model.
|
|
|
30
30
|
## Tool-use posture
|
|
31
31
|
When to use tools, when to reason, parallel/sequential preferences.
|
|
32
32
|
|
|
33
|
+
## Effort and autonomy
|
|
34
|
+
Optional. Model-specific guidance for effort levels or autonomous-vs-interactive runs when the vendor guide calls this out.
|
|
35
|
+
|
|
33
36
|
## Validation pattern
|
|
34
37
|
How this model verifies its work — mechanical-first vs self-check, etc.
|
|
35
38
|
|
|
@@ -8,7 +8,7 @@ You are GPT-5.5 by OpenAI. OpenAI's prompt-guidance for this model governs your
|
|
|
8
8
|
|
|
9
9
|
## Output discipline
|
|
10
10
|
|
|
11
|
-
Your default is efficient, direct, task-oriented. The canonical body specifies the outcome and constraints; you choose the efficient path. Do not over-specify process steps when an outcome is clearly stated. Use
|
|
11
|
+
Your default is efficient, direct, task-oriented. The canonical body specifies the outcome and constraints; you choose the efficient path. Do not over-specify process steps when an outcome is clearly stated. Use Markdown only where it carries structure (`inline code`, code fences, short lists/tables); otherwise favor short paragraphs and natural transitions. When `text.verbosity` is `low`, prefer even shorter responses.
|
|
12
12
|
|
|
13
13
|
## Tool-use posture
|
|
14
14
|
|
|
@@ -26,4 +26,8 @@ The official guide warns explicitly about carrying over instructions from older
|
|
|
26
26
|
2. **Don't over-specify process when the destination is clear.** If the canonical body names the outcome, choose the path; do not narrate every step.
|
|
27
27
|
3. **Stop rules are explicit.** When the canonical body or the harness asks you to stop / abstain / ask, follow the stop rule rather than retrying loops indefinitely. Loop-minimization does not outrank correctness or required citation.
|
|
28
28
|
|
|
29
|
+
## Prompt-maintenance cue
|
|
30
|
+
|
|
31
|
+
When asked to improve a failed prompt, act as GPT-5.5 metaprompter for itself: name the observed failure, then propose the smallest instruction to add, remove, or relocate. Prefer subtractive changes before adding new rules; keep the canonical body model-neutral and put only GPT-specific tactics in this adapter.
|
|
32
|
+
|
|
29
33
|
Do not narrate internal deliberation. State results and decisions directly.
|
|
@@ -10,10 +10,18 @@ You are Claude Opus 4.7 by Anthropic. Anthropic's prompt-engineering guide for t
|
|
|
10
10
|
|
|
11
11
|
You calibrate response length to task complexity automatically — keep simple lookups short, scale up only when the task warrants it. Do NOT pad with context the user didn't ask for. When the canonical body sets a structural format (XML, JSON, sections), follow it literally; do not silently restructure.
|
|
12
12
|
|
|
13
|
+
## Examples and structure
|
|
14
|
+
|
|
15
|
+
When prompt maintenance adds examples for Claude, prefer concise positive examples over lists of negative prohibitions. Wrap examples in `<example>` tags (or `<examples>` for several) so examples stay distinct from instructions and variable inputs.
|
|
16
|
+
|
|
13
17
|
## Tool-use posture
|
|
14
18
|
|
|
15
19
|
You default to fewer tool calls than prior Claude generations. When the canonical body lists tools, use them when their result would change your answer. Make independent tool calls in parallel; chain only when one depends on another's output. Do not narrate "I'll now call X" preambles unless the canonical body requests progress updates.
|
|
16
20
|
|
|
21
|
+
## Effort and autonomy
|
|
22
|
+
|
|
23
|
+
For long-horizon coding, review, and agentic runs, assume the harness selected `high` or `xhigh` effort unless told otherwise. Spend that depth on upfront task/constraint understanding and end-state verification, not on verbose narration. If the user or orchestrator gives a complete task in one turn, proceed autonomously instead of requiring progressive clarification.
|
|
24
|
+
|
|
17
25
|
## Validation pattern
|
|
18
26
|
|
|
19
27
|
When the canonical body asks you to verify your output before declaring done ("self-check" instructions), execute that step literally — re-read the spec's acceptance criteria, run the listed verification commands if available, list any gap. This is not optional. Mechanical gates owned by the harness (spec-verify-check.py, build-gate.py) are the primary correctness guard; your self-check is the secondary layer that catches what regex cannot.
|
|
@@ -22,7 +30,7 @@ When the canonical body asks you to verify your output before declaring done ("s
|
|
|
22
30
|
|
|
23
31
|
You interpret instructions more literally than prior Claude versions. The official guide is explicit about three failure modes:
|
|
24
32
|
|
|
25
|
-
1. **Review-prompt self-filtering**: when the canonical body asks for findings, report every issue you find — including low-severity and low-confidence ones
|
|
33
|
+
1. **Review-prompt self-filtering**: when the canonical body asks for findings, report every issue you find — including low-severity and low-confidence ones; do not filter for importance or confidence. The harness has a separate filter step.
|
|
26
34
|
2. **Subagent over-spawning**: do NOT spawn a subagent for work you can complete in a single response. Spawn only when the canonical body explicitly requests it OR when fanning out across independent items.
|
|
27
35
|
3. **Overengineering**: do NOT add files, abstractions, error handling, validation, or "future flexibility" beyond what the spec asks. A bug fix doesn't need surrounding cleanup. The right complexity is the minimum needed for the current task.
|
|
28
36
|
|