devlyn-cli 2.3.0 → 2.3.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +1 -1
- package/CLAUDE.md +2 -2
- package/README.md +82 -29
- package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +61 -44
- package/benchmark/auto-resolve/BENCHMARK-RESULTS.md +341 -0
- package/benchmark/auto-resolve/README.md +307 -44
- package/benchmark/auto-resolve/RUBRIC.md +23 -14
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +7 -3
- package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md +8 -3
- package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/NOTES.md +8 -3
- package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/NOTES.md +10 -4
- package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/NOTES.md +10 -4
- package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +7 -4
- package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/NOTES.md +8 -0
- package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/NOTES.md +16 -4
- package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +7 -0
- package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/NOTES.md +11 -5
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +8 -1
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +4 -2
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/NOTES.md +34 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/spec.md +67 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/duplicate-event-error.js +35 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/priority-transfer-rollback.js +53 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/NOTES.md +38 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/spec.md +70 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/task.txt +3 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/duplicate-renewal-error.js +42 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/priority-credit-rollback.js +70 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +10 -3
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +7 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +5 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +7 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +3 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +15 -3
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/SCHEMA.md +53 -7
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/NOTES.md +37 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/RETIRED.md +13 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/expected.json +56 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/setup.sh +18 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/spec.md +69 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/exact-proration.js +48 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/rules-source-and-conflict.js +79 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/NOTES.md +54 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/RETIRED.md +7 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/expected.json +67 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/spec.md +67 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/task.txt +5 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/policy-precedence.js +72 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-and-immutability.js +43 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-boundary.js +116 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/NOTES.md +35 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/RETIRED.md +12 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/expected.json +58 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/spec.md +73 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/task.txt +17 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/mixed-idempotent-settlement.js +53 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/rejection-boundaries.js +74 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/NOTES.md +60 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/RETIRED.md +29 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/expected.json +73 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/setup.sh +28 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/spec.md +58 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/task.txt +5 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.json +82 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.md +18 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.json +46 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.md +17 -0
- package/benchmark/auto-resolve/run-real-benchmark.md +303 -0
- package/benchmark/auto-resolve/scripts/audit-headroom-rejections.py +441 -0
- package/benchmark/auto-resolve/scripts/audit-pair-evidence.py +1256 -0
- package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +147 -15
- package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +28 -16
- package/benchmark/auto-resolve/scripts/collect-swebench-predictions.py +11 -1
- package/benchmark/auto-resolve/scripts/compile-report.py +208 -46
- package/benchmark/auto-resolve/scripts/fetch-swebench-instances.py +22 -4
- package/benchmark/auto-resolve/scripts/frozen-verify-gate.py +175 -30
- package/benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py +408 -46
- package/benchmark/auto-resolve/scripts/headroom-gate.py +270 -39
- package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +164 -33
- package/benchmark/auto-resolve/scripts/iter-0033c-l1-summary.py +97 -0
- package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +150 -38
- package/benchmark/auto-resolve/scripts/judge.sh +153 -26
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +12 -5
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +25 -2
- package/benchmark/auto-resolve/scripts/pair-candidate-frontier.py +469 -0
- package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +5 -5
- package/benchmark/auto-resolve/scripts/pair-plan-lint.py +9 -2
- package/benchmark/auto-resolve/scripts/pair-rejected-fixtures.sh +91 -0
- package/benchmark/auto-resolve/scripts/pair_evidence_contract.py +269 -0
- package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-case.py +39 -10
- package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-corpus.py +34 -4
- package/benchmark/auto-resolve/scripts/prepare-swebench-solver-worktree.py +23 -5
- package/benchmark/auto-resolve/scripts/recent-benchmark-summary.py +232 -0
- package/benchmark/auto-resolve/scripts/run-fixture.sh +118 -51
- package/benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh +211 -39
- package/benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh +335 -39
- package/benchmark/auto-resolve/scripts/run-headroom-candidate.sh +249 -6
- package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +22 -48
- package/benchmark/auto-resolve/scripts/run-suite.sh +44 -7
- package/benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh +120 -19
- package/benchmark/auto-resolve/scripts/run-swebench-solver-batch.sh +32 -14
- package/benchmark/auto-resolve/scripts/ship-gate.py +219 -50
- package/benchmark/auto-resolve/scripts/solo-ceiling-avoidance.py +53 -0
- package/benchmark/auto-resolve/scripts/solo-headroom-hypothesis.py +77 -0
- package/benchmark/auto-resolve/scripts/swebench-frozen-matrix.py +239 -26
- package/benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh +288 -0
- package/benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh +1672 -0
- package/benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh +933 -0
- package/benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh +491 -0
- package/benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh +91 -0
- package/benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh +328 -3
- package/benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh +497 -18
- package/benchmark/auto-resolve/scripts/test-headroom-gate.sh +331 -14
- package/benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh +525 -0
- package/benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh +254 -0
- package/benchmark/auto-resolve/scripts/test-lint-fixtures.sh +580 -0
- package/benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh +591 -0
- package/benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh +497 -0
- package/benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh +401 -0
- package/benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh +111 -0
- package/benchmark/auto-resolve/scripts/test-ship-gate.sh +1189 -0
- package/benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh +924 -5
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/NOTES.md +28 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/expected.json +63 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/spec.md +47 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/expected.json +53 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/spec.md +50 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/duplicate-order-error.js +27 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/priority-stock-reservation.js +44 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/expected.json +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/spec.md +52 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/duplicate-ticket-error.js +29 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/priority-agent-assignment.js +48 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/expected.json +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/spec.md +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/duplicate-return-error.js +43 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/priority-return-routing.js +70 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/NOTES.md +37 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/expected.json +54 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/spec.md +59 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/credit-ledger-priority.js +98 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/duplicate-charge-error.js +38 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/NOTES.md +36 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/expected.json +56 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/spec.md +59 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/duplicate-refund-error.js +41 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/priority-refund-ledger.js +65 -0
- package/bin/devlyn.js +211 -18
- package/config/skills/_shared/adapters/README.md +3 -0
- package/config/skills/_shared/adapters/gpt-5-5.md +5 -1
- package/config/skills/_shared/adapters/opus-4-7.md +9 -1
- package/config/skills/_shared/archive_run.py +78 -6
- package/config/skills/_shared/codex-config.md +3 -2
- package/config/skills/_shared/codex-monitored.sh +46 -1
- package/config/skills/_shared/collect-codex-findings.py +20 -5
- package/config/skills/_shared/engine-preflight.md +1 -1
- package/config/skills/_shared/runtime-principles.md +5 -8
- package/config/skills/_shared/spec-verify-check.py +2664 -107
- package/config/skills/_shared/verify-merge-findings.py +1369 -19
- package/config/skills/devlyn:ideate/SKILL.md +7 -4
- package/config/skills/devlyn:ideate/references/elicitation.md +50 -4
- package/config/skills/devlyn:ideate/references/from-spec-mode.md +26 -4
- package/config/skills/devlyn:ideate/references/project-mode.md +20 -1
- package/config/skills/devlyn:ideate/references/spec-template.md +10 -1
- package/config/skills/devlyn:resolve/SKILL.md +49 -18
- package/config/skills/devlyn:resolve/references/free-form-mode.md +15 -0
- package/config/skills/devlyn:resolve/references/phases/build-gate.md +2 -2
- package/config/skills/devlyn:resolve/references/phases/probe-derive.md +74 -2
- package/config/skills/devlyn:resolve/references/phases/verify.md +62 -28
- package/config/skills/devlyn:resolve/references/state-schema.md +7 -4
- package/package.json +47 -2
- package/scripts/lint-fixtures.sh +349 -0
- package/scripts/lint-shadow-fixtures.sh +58 -0
- package/scripts/lint-skills.sh +3642 -92
- /package/{optional-skills → config/skills}/devlyn:design-ui/SKILL.md +0 -0
|
@@ -0,0 +1,82 @@
|
|
|
1
|
+
{
|
|
2
|
+
"run_id": "20260510-f16-f23-f25-combined-proof",
|
|
3
|
+
"rule": "headroom candidates only; bare headroom >= 5; solo_claude headroom >= 5; l2_risk_probes must be evidence-clean, pair_mode true, pair_trigger eligible with a canonical reason, and beat solo_claude by the configured margin",
|
|
4
|
+
"verdict": "PASS",
|
|
5
|
+
"fixtures_total": 3,
|
|
6
|
+
"fixtures_passed": 3,
|
|
7
|
+
"min_fixtures": 3,
|
|
8
|
+
"fixture_count_ok": true,
|
|
9
|
+
"bare_max": 60,
|
|
10
|
+
"solo_max": 80,
|
|
11
|
+
"min_bare_headroom_required": 5,
|
|
12
|
+
"min_solo_headroom_required": 5,
|
|
13
|
+
"min_pair_margin": 5,
|
|
14
|
+
"pair_arm": "l2_risk_probes",
|
|
15
|
+
"max_pair_solo_wall_ratio": 3.0,
|
|
16
|
+
"require_hypothesis_trigger": false,
|
|
17
|
+
"max_observed_pair_solo_wall_ratio": 2.2506234413965087,
|
|
18
|
+
"avg_pair_margin": 25.333333333333332,
|
|
19
|
+
"avg_pair_solo_wall_ratio": 1.725768446785212,
|
|
20
|
+
"rows": [
|
|
21
|
+
{
|
|
22
|
+
"fixture": "F16-cli-quote-tax-rules",
|
|
23
|
+
"status": "PASS",
|
|
24
|
+
"bare_score": 50,
|
|
25
|
+
"bare_headroom": 10,
|
|
26
|
+
"solo_score": 75,
|
|
27
|
+
"solo_headroom": 5,
|
|
28
|
+
"pair_score": 96,
|
|
29
|
+
"pair_margin": 21,
|
|
30
|
+
"pair_mode": true,
|
|
31
|
+
"pair_trigger_eligible": true,
|
|
32
|
+
"pair_trigger_reasons": [
|
|
33
|
+
"complexity.high",
|
|
34
|
+
"spec.solo_headroom_hypothesis"
|
|
35
|
+
],
|
|
36
|
+
"pair_trigger_has_canonical_reason": true,
|
|
37
|
+
"pair_trigger_has_hypothesis_reason": true,
|
|
38
|
+
"pair_solo_wall_ratio": 1.2805280528052805,
|
|
39
|
+
"reason": ""
|
|
40
|
+
},
|
|
41
|
+
{
|
|
42
|
+
"fixture": "F23-cli-fulfillment-wave",
|
|
43
|
+
"status": "PASS",
|
|
44
|
+
"bare_score": 33,
|
|
45
|
+
"bare_headroom": 27,
|
|
46
|
+
"solo_score": 66,
|
|
47
|
+
"solo_headroom": 14,
|
|
48
|
+
"pair_score": 97,
|
|
49
|
+
"pair_margin": 31,
|
|
50
|
+
"pair_mode": true,
|
|
51
|
+
"pair_trigger_eligible": true,
|
|
52
|
+
"pair_trigger_reasons": [
|
|
53
|
+
"complexity.high",
|
|
54
|
+
"spec.solo_headroom_hypothesis"
|
|
55
|
+
],
|
|
56
|
+
"pair_trigger_has_canonical_reason": true,
|
|
57
|
+
"pair_trigger_has_hypothesis_reason": true,
|
|
58
|
+
"pair_solo_wall_ratio": 2.2506234413965087,
|
|
59
|
+
"reason": ""
|
|
60
|
+
},
|
|
61
|
+
{
|
|
62
|
+
"fixture": "F25-cli-cart-promotion-rules",
|
|
63
|
+
"status": "PASS",
|
|
64
|
+
"bare_score": 25,
|
|
65
|
+
"bare_headroom": 35,
|
|
66
|
+
"solo_score": 75,
|
|
67
|
+
"solo_headroom": 5,
|
|
68
|
+
"pair_score": 99,
|
|
69
|
+
"pair_margin": 24,
|
|
70
|
+
"pair_mode": true,
|
|
71
|
+
"pair_trigger_eligible": true,
|
|
72
|
+
"pair_trigger_reasons": [
|
|
73
|
+
"complexity.high",
|
|
74
|
+
"spec.solo_headroom_hypothesis"
|
|
75
|
+
],
|
|
76
|
+
"pair_trigger_has_canonical_reason": true,
|
|
77
|
+
"pair_trigger_has_hypothesis_reason": true,
|
|
78
|
+
"pair_solo_wall_ratio": 1.646153846153846,
|
|
79
|
+
"reason": ""
|
|
80
|
+
}
|
|
81
|
+
]
|
|
82
|
+
}
|
|
@@ -0,0 +1,18 @@
|
|
|
1
|
+
# Full-Pipeline Pair Gate - 20260510-f16-f23-f25-combined-proof
|
|
2
|
+
|
|
3
|
+
Verdict: **PASS**
|
|
4
|
+
|
|
5
|
+
Fixtures passed: 3/3 (minimum required: 3)
|
|
6
|
+
|
|
7
|
+
Rule: at least 3 fixtures; bare <= 60; bare headroom >= 5; solo_claude <= 80; solo_claude headroom >= 5; l2_risk_probes evidence-clean; pair_mode true; pair_trigger eligible with canonical reason; l2_risk_probes - solo_claude >= 5.
|
|
8
|
+
Average pair margin: +25.3
|
|
9
|
+
Allowed pair/solo wall ratio: 3.00x
|
|
10
|
+
Maximum observed pair/solo wall ratio: 2.25x
|
|
11
|
+
Average pair/solo wall ratio: 1.73x
|
|
12
|
+
Hypothesis trigger required: false
|
|
13
|
+
|
|
14
|
+
| Fixture | Bare | Bare headroom | Solo_claude | Solo_claude headroom | Pair | Margin | Pair mode | Hypothesis trigger | Triggers | Wall ratio | Status | Reason |
|
|
15
|
+
|---|---:|---:|---:|---:|---:|---:|---|---|---|---:|---|---|
|
|
16
|
+
| F16-cli-quote-tax-rules | 50 | 10 | 75 | 5 | 96 | +21 | true | true | complexity.high,spec.solo_headroom_hypothesis | 1.28x | PASS | |
|
|
17
|
+
| F23-cli-fulfillment-wave | 33 | 27 | 66 | 14 | 97 | +31 | true | true | complexity.high,spec.solo_headroom_hypothesis | 2.25x | PASS | |
|
|
18
|
+
| F25-cli-cart-promotion-rules | 25 | 35 | 75 | 5 | 99 | +24 | true | true | complexity.high,spec.solo_headroom_hypothesis | 1.65x | PASS | |
|
|
@@ -0,0 +1,46 @@
|
|
|
1
|
+
{
|
|
2
|
+
"run_id": "20260510-f16-f23-f25-combined-proof",
|
|
3
|
+
"rule": "at least 3 candidate fixtures; each must satisfy bare <= 60 with headroom >= 5, solo_claude <= 80 with headroom >= 5, with both baseline arms evidence-complete",
|
|
4
|
+
"verdict": "PASS",
|
|
5
|
+
"fixtures_total": 3,
|
|
6
|
+
"fixtures_passed": 3,
|
|
7
|
+
"min_fixtures": 3,
|
|
8
|
+
"bare_max": 60,
|
|
9
|
+
"solo_max": 80,
|
|
10
|
+
"min_bare_headroom_required": 5,
|
|
11
|
+
"min_solo_headroom_required": 5,
|
|
12
|
+
"fixture_count_ok": true,
|
|
13
|
+
"avg_bare_headroom": 24.0,
|
|
14
|
+
"min_bare_headroom": 10,
|
|
15
|
+
"avg_solo_headroom": 8.0,
|
|
16
|
+
"min_solo_headroom": 5,
|
|
17
|
+
"rows": [
|
|
18
|
+
{
|
|
19
|
+
"fixture": "F16-cli-quote-tax-rules",
|
|
20
|
+
"status": "PASS",
|
|
21
|
+
"bare_score": 50,
|
|
22
|
+
"solo_score": 75,
|
|
23
|
+
"bare_headroom": 10,
|
|
24
|
+
"solo_headroom": 5,
|
|
25
|
+
"reason": ""
|
|
26
|
+
},
|
|
27
|
+
{
|
|
28
|
+
"fixture": "F23-cli-fulfillment-wave",
|
|
29
|
+
"status": "PASS",
|
|
30
|
+
"bare_score": 33,
|
|
31
|
+
"solo_score": 66,
|
|
32
|
+
"bare_headroom": 27,
|
|
33
|
+
"solo_headroom": 14,
|
|
34
|
+
"reason": ""
|
|
35
|
+
},
|
|
36
|
+
{
|
|
37
|
+
"fixture": "F25-cli-cart-promotion-rules",
|
|
38
|
+
"status": "PASS",
|
|
39
|
+
"bare_score": 25,
|
|
40
|
+
"solo_score": 75,
|
|
41
|
+
"bare_headroom": 35,
|
|
42
|
+
"solo_headroom": 5,
|
|
43
|
+
"reason": ""
|
|
44
|
+
}
|
|
45
|
+
]
|
|
46
|
+
}
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
# Headroom Gate — 20260510-f16-f23-f25-combined-proof
|
|
2
|
+
|
|
3
|
+
Verdict: **PASS**
|
|
4
|
+
|
|
5
|
+
Fixtures passed: 3/3 (minimum required: 3)
|
|
6
|
+
|
|
7
|
+
Rule: at least 3 fixtures; bare <= 60 with headroom >= 5, solo_claude <= 80 with headroom >= 5, both baseline arms evidence-complete.
|
|
8
|
+
Average bare headroom: 24.0
|
|
9
|
+
Minimum bare headroom: 10
|
|
10
|
+
Average solo_claude headroom: 8.0
|
|
11
|
+
Minimum solo_claude headroom: 5
|
|
12
|
+
|
|
13
|
+
| Fixture | Bare | Bare headroom | Solo_claude | Solo_claude headroom | Status | Reason |
|
|
14
|
+
|---|---:|---:|---:|---:|---|---|
|
|
15
|
+
| F16-cli-quote-tax-rules | 50 | 10 | 75 | 5 | PASS | |
|
|
16
|
+
| F23-cli-fulfillment-wave | 33 | 27 | 66 | 14 | PASS | |
|
|
17
|
+
| F25-cli-cart-promotion-rules | 25 | 35 | 75 | 5 | PASS | |
|
|
@@ -0,0 +1,303 @@
|
|
|
1
|
+
# Running Real Pair/Solo Benchmarks
|
|
2
|
+
|
|
3
|
+
This document is for benchmark runs that spend real model calls and produce
|
|
4
|
+
judge scores. Use it when a change claims `solo_claude < pair`.
|
|
5
|
+
|
|
6
|
+
For wiring checks that must not invoke providers, use `npx devlyn-cli benchmark
|
|
7
|
+
--dry-run` or the shell tests listed in `README.md`.
|
|
8
|
+
|
|
9
|
+
## Current Score Harness
|
|
10
|
+
|
|
11
|
+
The current full-pipeline comparison has three evidence arms:
|
|
12
|
+
|
|
13
|
+
| Arm | Meaning |
|
|
14
|
+
|---|---|
|
|
15
|
+
| `bare` | control without the devlyn skills |
|
|
16
|
+
| `solo_claude` | Claude-only `/devlyn:resolve` path |
|
|
17
|
+
| `l2_risk_probes` | current measured pair path: Claude implement plus Codex-derived risk probes / pair VERIFY |
|
|
18
|
+
|
|
19
|
+
`l2_gated` is diagnostic replay only. `l2_forced` is retired and rejected by the
|
|
20
|
+
runner because it leaks pair-awareness before IMPLEMENT.
|
|
21
|
+
|
|
22
|
+
The score artifacts that matter are:
|
|
23
|
+
|
|
24
|
+
- `benchmark/auto-resolve/results/<run-id>/<fixture>/judge.json`
|
|
25
|
+
- `benchmark/auto-resolve/results/<run-id>/<fixture>/<arm>/result.json`
|
|
26
|
+
- `benchmark/auto-resolve/results/<run-id>/<fixture>/<arm>/verify.json`
|
|
27
|
+
- `benchmark/auto-resolve/results/<run-id>/full-pipeline-pair-gate.md`
|
|
28
|
+
- `benchmark/auto-resolve/results/<run-id>/full-pipeline-pair-gate.json`
|
|
29
|
+
|
|
30
|
+
Do not treat a score as evidence if the matching arm has a deterministic
|
|
31
|
+
failure, judge disqualifier, missing `diff.patch`, blocked resolve verdict,
|
|
32
|
+
failed verify score, provider invocation failure, or an invalid judge axis cell.
|
|
33
|
+
The matching arms must also appear in `judge.json` `_blind_mapping`; a
|
|
34
|
+
`scores_by_arm` value without the blind slot mapping is not score evidence.
|
|
35
|
+
|
|
36
|
+
## Headroom First
|
|
37
|
+
|
|
38
|
+
Before spending new provider calls, check the active frontier:
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
python3 benchmark/auto-resolve/scripts/pair-candidate-frontier.py \
|
|
42
|
+
--out-md /tmp/devlyn-pair-frontier.md
|
|
43
|
+
npx devlyn-cli benchmark frontier --out-md /tmp/devlyn-pair-frontier.md
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
Only `candidate_unmeasured` fixtures need fresh headroom. Fixtures marked
|
|
47
|
+
`pair_evidence_passed` already have local passing full-pipeline complete pair evidence rows,
|
|
48
|
+
and fixtures marked `rejected` need rework before pair arms. The frontier command
|
|
49
|
+
prints existing complete `bare`, `solo_claude`, `pair`, margin, wall ratio, and run id rows to
|
|
50
|
+
stdout, plus average/minimum pair margin and wall ratio, even when `--out-md`
|
|
51
|
+
or `--out-json` writes an artifact.
|
|
52
|
+
Gate-3 pair-eligible manifests carry both `rejected_excluded` and
|
|
53
|
+
`rejected_excluded_reasons`, so excluded solo-ceiling controls keep their
|
|
54
|
+
registry reason inside the manifest artifact.
|
|
55
|
+
After a headroom failure, run
|
|
56
|
+
`npx devlyn-cli benchmark audit-headroom --out-json /tmp/devlyn-headroom-audit.json`
|
|
57
|
+
which invokes `audit-headroom-rejections.py` to ensure no active failed fixture
|
|
58
|
+
remains outside both the rejected registry and passing pair evidence, and that
|
|
59
|
+
each active rejected-registry reason is backed by a matching local headroom
|
|
60
|
+
artifact unless it is an explicit calibration/known-limit fixture.
|
|
61
|
+
For release/handoff checks, add `--fail-on-unmeasured` to the frontier command
|
|
62
|
+
to fail when active pair candidates still need headroom measurement.
|
|
63
|
+
Or run the composite provider-free guard:
|
|
64
|
+
|
|
65
|
+
```bash
|
|
66
|
+
npx devlyn-cli benchmark audit --out-dir /tmp/devlyn-benchmark-audit
|
|
67
|
+
npx devlyn-cli benchmark audit --require-hypothesis-trigger --out-dir /tmp/devlyn-benchmark-audit-strict
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
It invokes `pair-candidate-frontier.py --fail-on-unmeasured` and
|
|
71
|
+
`audit-headroom-rejections.py`, writes `audit.json` with the frontier summary, artifact map,
|
|
72
|
+
`frontier.json`, `frontier.stdout`, `frontier.stderr`,
|
|
73
|
+
and compact trigger-backed verdict-bearing `pair_evidence_rows` (each row carries
|
|
74
|
+
`pair_trigger_eligible: true`, non-empty `pair_trigger_reasons`, `pair_trigger_has_canonical_reason: true`, and `pair_trigger_has_hypothesis_reason`; the audit fails rows missing trigger reasons or missing actionable solo-headroom hypotheses in fixture `spec.md` whose observable command matches `expected.json`), plus both child JSON reports and child stdout/stderr logs, and prints the existing complete pair score rows
|
|
75
|
+
with pair arm, verdict, and trigger reasons from the frontier step. By default it revalidates frontier `verdict: PASS`, zero unmeasured candidates,
|
|
76
|
+
requires at least four active fixtures with passing pair evidence, and revalidates `pair_mode: true`,
|
|
77
|
+
the default 5-point pair margin, and 3x pair/solo wall ratio. The audit stdout
|
|
78
|
+
also prints `headroom_rejections=...`, `pair_evidence_quality=...`,
|
|
79
|
+
`pair_trigger_reasons=...`, `pair_evidence_hypotheses=...`, and `pair_evidence_hypothesis_triggers=...` handoff rows, plus
|
|
80
|
+
`pair_trigger_historical_aliases=...` when archived evidence includes legacy
|
|
81
|
+
trigger aliases and `pair_evidence_hypothesis_trigger_gaps=...` when documented
|
|
82
|
+
hypotheses have not yet propagated into trigger reasons, with rejected-fixture
|
|
83
|
+
coverage counts, actual minimum pair margin, maximum pair/solo wall ratio, and
|
|
84
|
+
canonical trigger reason coverage plus row-match status. The compact evidence row count must match the frontier evidence count, so incomplete local score artifacts cannot inflate
|
|
85
|
+
the claim. `checks.frontier_stdout` records summary, aggregate, final-verdict, expected, printed score-row, trigger-visible row, and hypothesis-trigger-visible row counts, `checks.headroom_rejections` records child verdict plus unrecorded/unsupported counts, `checks.pair_evidence_quality` records the same quality thresholds from the compact rows, `checks.pair_trigger_reasons` records canonical/historical-alias/exposed/total trigger-reason row counts, fixture-level historical alias details, summary count, and row-match status for handoff review, `checks.pair_evidence_hypotheses` records documented/total pair-evidence hypothesis row counts, and `checks.pair_evidence_hypothesis_triggers` records whether documented hypotheses also appear as `spec.solo_headroom_hypothesis` trigger reasons plus fixture-level gap details. The markdown frontier
|
|
86
|
+
artifact includes the overall verdict plus row-level verdict, pair-arm, and trigger-reason columns.
|
|
87
|
+
Full-pipeline pair gate artifacts record `require_hypothesis_trigger` in JSON
|
|
88
|
+
and include a Markdown `Hypothesis trigger` column, so strict regenerated
|
|
89
|
+
evidence shows whether each row carried `spec.solo_headroom_hypothesis`.
|
|
90
|
+
Add `--require-hypothesis-trigger` to turn those hypothesis-trigger gaps from
|
|
91
|
+
archived-evidence WARN rows into release-blocking FAIL rows for newly
|
|
92
|
+
regenerated pair evidence.
|
|
93
|
+
Historical trigger aliases are only reported for archived artifact review; new
|
|
94
|
+
current pair-evidence gates fail historical-only or unknown trigger reasons and
|
|
95
|
+
require at least one canonical `pair_trigger.reasons` entry.
|
|
96
|
+
|
|
97
|
+
Pair lift is not measurable when `bare` or `solo_claude` is already near the
|
|
98
|
+
ceiling. Calibrate candidate fixtures first:
|
|
99
|
+
|
|
100
|
+
```bash
|
|
101
|
+
bash benchmark/auto-resolve/scripts/run-headroom-candidate.sh \
|
|
102
|
+
--bare-max 60 \
|
|
103
|
+
--solo-max 80 \
|
|
104
|
+
--min-fixtures 3 \
|
|
105
|
+
F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
Equivalent CLI entrypoint:
|
|
109
|
+
|
|
110
|
+
```bash
|
|
111
|
+
npx devlyn-cli benchmark headroom \
|
|
112
|
+
--bare-max 60 \
|
|
113
|
+
--solo-max 80 \
|
|
114
|
+
--min-fixtures 3 \
|
|
115
|
+
F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
The runner prints a startup `Gate:` line, the replay `Command:`, and the
|
|
119
|
+
headroom markdown report with `bare`/`solo_claude` scores and remaining headroom against
|
|
120
|
+
the configured thresholds, including average and minimum headroom for the
|
|
121
|
+
candidate set plus fixture pass count. When launched through
|
|
122
|
+
`npx devlyn-cli benchmark headroom`, the replay command uses that same package
|
|
123
|
+
CLI path. Count a fixture only when `headroom-gate.py` reports
|
|
124
|
+
evidence-complete `bare <= 60` and `solo_claude <= 80` with the default minimum 5-point `bare`/`solo_claude` headroom margin. Add `--dry-run` only to validate args,
|
|
125
|
+
fixture ids, minimum fixture count, and the replay command; it does not produce
|
|
126
|
+
scores. When showing scores, include `bare` headroom and `solo_claude` headroom. A real
|
|
127
|
+
headroom run explicitly reports whether the candidate set was accepted or rejected.
|
|
128
|
+
Known rejected or ceiling-saturated fixtures are refused by default; use
|
|
129
|
+
`--allow-rejected-fixtures` only for diagnostics of still active rejected
|
|
130
|
+
fixtures, not for new pair-evidence candidate selection. Retired fixtures are
|
|
131
|
+
preserved for historical artifact replay and are not rerun by the pair-candidate
|
|
132
|
+
runners.
|
|
133
|
+
|
|
134
|
+
## Full Pair Measurement
|
|
135
|
+
|
|
136
|
+
Run the selected pair arm only after headroom passes:
|
|
137
|
+
|
|
138
|
+
```bash
|
|
139
|
+
bash benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh \
|
|
140
|
+
--min-fixtures 3 \
|
|
141
|
+
--max-pair-solo-wall-ratio 3 \
|
|
142
|
+
F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
Equivalent CLI entrypoint:
|
|
146
|
+
|
|
147
|
+
```bash
|
|
148
|
+
npx devlyn-cli benchmark pair \
|
|
149
|
+
--min-fixtures 3 \
|
|
150
|
+
--max-pair-solo-wall-ratio 3 \
|
|
151
|
+
F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
For prompt-only pair changes, reuse an evidence-complete calibration run to avoid
|
|
155
|
+
re-spending `bare` and `solo_claude`:
|
|
156
|
+
|
|
157
|
+
```bash
|
|
158
|
+
bash benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh \
|
|
159
|
+
--run-id <new-run-id> \
|
|
160
|
+
--reuse-calibrated-from <prior-headroom-run-id> \
|
|
161
|
+
--min-fixtures 3 \
|
|
162
|
+
--max-pair-solo-wall-ratio 3 \
|
|
163
|
+
F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
The runner prints startup `Headroom:` and `Pair:` lines, the replay `Command:`,
|
|
167
|
+
and the final pair gate report with fixture pass count and average pair margin.
|
|
168
|
+
If headroom fails, it reports that the pair arm was not executed. If the final
|
|
169
|
+
pair gate fails, it reports that pair evidence was rejected. On success, it
|
|
170
|
+
reports that the selected pair arm is executing and then that pair evidence was
|
|
171
|
+
accepted. When launched through `npx devlyn-cli benchmark pair`, the replay
|
|
172
|
+
command uses that same package CLI path. The pair runner and full-pipeline gate
|
|
173
|
+
use the default 3x pair/solo wall ratio unless `--max-pair-solo-wall-ratio` is
|
|
174
|
+
overridden for diagnostics. The full-pipeline gate report separates the allowed pair/solo wall ratio from the maximum observed pair/solo wall ratio, records `require_hypothesis_trigger` in JSON, and includes a Markdown `Hypothesis trigger` column. Add
|
|
175
|
+
`--dry-run` only to validate args, fixture ids, minimum fixture count, and the
|
|
176
|
+
replay command; it does not produce scores. Known rejected or ceiling-saturated
|
|
177
|
+
fixtures are refused by default here too; use
|
|
178
|
+
`--allow-rejected-fixtures` only for diagnostics of still active
|
|
179
|
+
rejected fixtures. Retired fixtures remain historical replay artifacts and are
|
|
180
|
+
not rerun by this candidate runner.
|
|
181
|
+
When showing a real run, report at minimum:
|
|
182
|
+
|
|
183
|
+
- run id
|
|
184
|
+
- fixture id
|
|
185
|
+
- fixtures passed / total and `--min-fixtures`
|
|
186
|
+
- startup `Headroom:` / `Pair:` gate lines
|
|
187
|
+
- `bare`, `solo_claude`, and `l2_risk_probes` scores
|
|
188
|
+
- pair minus `solo_claude` margin
|
|
189
|
+
- average pair margin for the counted set
|
|
190
|
+
- `pair_mode`
|
|
191
|
+
- pair trigger eligibility, trigger reasons, canonical-trigger coverage, and `spec.solo_headroom_hypothesis` coverage when the fixture spec has an actionable solo-headroom hypothesis
|
|
192
|
+
- pair/solo wall-time ratio
|
|
193
|
+
- gate verdict and failure reasons, if any
|
|
194
|
+
|
|
195
|
+
Example reporting shape:
|
|
196
|
+
|
|
197
|
+
```text
|
|
198
|
+
Run: <run-id>
|
|
199
|
+
Fixture Bare Solo_claude Pair Pair-Solo_claude Pair mode Wall pair/solo Verdict
|
|
200
|
+
<fixture-a> 42 65 86 +21 true 1.44x PASS
|
|
201
|
+
<fixture-b> 31 58 82 +24 true 1.48x PASS
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
Do not summarize a real run as "pair improved" unless the gate passed or the
|
|
205
|
+
failure reason is explicitly shown next to the scores.
|
|
206
|
+
|
|
207
|
+
## Existing Evidence
|
|
208
|
+
|
|
209
|
+
The current measured pair arm is `l2_risk_probes`.
|
|
210
|
+
|
|
211
|
+
- `20260510-f16-f23-f25-combined-proof` passed the F16/F23/F25 gate with pair
|
|
212
|
+
margins `+21`, `+31`, and `+24`; average pair margin was `+25.3`; average
|
|
213
|
+
pair/solo wall ratio was `1.73x`.
|
|
214
|
+
- `20260509-f16-f25-combined-cartprobe-v2` also passes the current gate for
|
|
215
|
+
the F16/F25 subset with pair margins `+21` and `+24`; average pair margin was
|
|
216
|
+
`+22.5`; average pair/solo wall ratio was `1.46x`.
|
|
217
|
+
- `20260511-f21-current-riskprobes-v1` passed focused F21 evidence with
|
|
218
|
+
`bare 33`, `solo_claude 66`, `l2_risk_probes 99`, margin `+33`, pair mode
|
|
219
|
+
true, and pair/solo wall ratio `1.47x`; it is counted by `benchmark audit` as the fourth passing pair-evidence row.
|
|
220
|
+
|
|
221
|
+
F22 and F26 are not pair-lift evidence right now because existing headroom runs
|
|
222
|
+
put `solo_claude` near the ceiling. F27 is also rejected in its first headroom smoke:
|
|
223
|
+
`20260511-f27-headroom-smoke-061401` measured bare 33 / solo_claude 94, with bare
|
|
224
|
+
verification passing only 1 of 3 commands. Rework or rotate F27 before spending
|
|
225
|
+
a pair arm on it. F28 is rejected as pair-lift evidence: earlier unstable runs
|
|
226
|
+
`20260511-f28-headroom-smoke-085307` and `20260511-f28-pair-smoke-091021` were
|
|
227
|
+
superseded after a hidden-oracle bug was found. The oracle had expected a
|
|
228
|
+
defective item to bypass expiration, which the visible spec does not require.
|
|
229
|
+
After re-verifying the same provider diffs against the corrected oracle,
|
|
230
|
+
`20260511-f28-policy-oraclefix-reverified-pair` scored bare 50 / solo_claude 98 /
|
|
231
|
+
`l2_risk_probes` 96, margin -2, and failed headroom. Rework or rotate F28 before
|
|
232
|
+
spending more pair arms.
|
|
233
|
+
F30 is also rejected: `20260511-f30-headroom-v1` scored bare 33 / solo_claude 98, so
|
|
234
|
+
it failed the `solo_claude` headroom precondition before any pair arm should be spent.
|
|
235
|
+
F15 is also rejected: `20260511-f15-concurrency-headroom` scored bare 99 /
|
|
236
|
+
solo_claude 94, so it failed both headroom preconditions and should stay a frozen-diff
|
|
237
|
+
review control unless reworked. F3 is also rejected after tightening its HTTP
|
|
238
|
+
error-body verifier: `20260511-f3-http-error-headroom` scored bare 97 / solo_claude 99,
|
|
239
|
+
so it failed both headroom preconditions. F2 medium CLI is rejected by
|
|
240
|
+
`20260512-f2-medium-headroom`: bare 83 / solo_claude 95, so both baseline scores
|
|
241
|
+
exceed headroom ceilings. F4 web browser design is rejected by
|
|
242
|
+
`20260512-f4-web-headroom`: bare 70 / solo_claude 92 with bare disqualifiers, so it
|
|
243
|
+
needs rework before pair arms. F5 fix-loop is rejected by
|
|
244
|
+
`20260512-f5-fixloop-headroom`: bare 99 / solo_claude 99, with `bare` and `solo_claude` each
|
|
245
|
+
passing 5/5 verification commands. F6 dep-audit checksum is rejected by
|
|
246
|
+
`20260512-f6-checksum-headroom`: bare 97 / solo_claude 96, with `bare` and `solo_claude` each
|
|
247
|
+
passing 6/6 verification commands. F7 scope discipline is rejected by
|
|
248
|
+
`20260512-f7-scope-headroom`: bare 99 / solo_claude 100, with `bare` and `solo_claude` each
|
|
249
|
+
passing 6/6 verification commands. F9 ideate-to-resolve remains the novice-flow
|
|
250
|
+
anchor but is rejected as pair evidence by `20260512-f9-e2e-headroom`: bare 60 /
|
|
251
|
+
solo_claude 90 with bare headroom 0 and a bare judge disqualifier, despite passing F9
|
|
252
|
+
artifact checks. Rework it before spending pair arms. F1 and F8 are rejected by
|
|
253
|
+
design as calibration/known-limit controls, not pair-lift evidence candidates.
|
|
254
|
+
F10/F11 are also rejected by `20260507-f10-f11-tier1-full-pipeline`: F10 scored
|
|
255
|
+
bare 75 / solo_claude 94, and F11 scored bare 98 / solo_claude 97. F12 webhook signature/replay is rejected by
|
|
256
|
+
`20260511-f12-webhook-headroom`: bare 85 / solo_claude 99.
|
|
257
|
+
F31 seat rebalance is rejected by `20260512-f31-seat-rebalance-headroom`: bare
|
|
258
|
+
33 / solo_claude 98, with bare judge/result/verify disqualifiers and `solo_claude` passing 3/3
|
|
259
|
+
verification commands. F32 subscription renewal is rejected by
|
|
260
|
+
`20260512-f32-subscription-renewal-headroom`: bare 33 / solo_claude 98, so it should
|
|
261
|
+
not receive a pair arm unless reworked.
|
|
262
|
+
|
|
263
|
+
## Smoke Suite
|
|
264
|
+
|
|
265
|
+
The top-level benchmark command still exists for broad suite health:
|
|
266
|
+
|
|
267
|
+
```bash
|
|
268
|
+
npx devlyn-cli benchmark
|
|
269
|
+
npx devlyn-cli benchmark --judge-only --run-id <ID>
|
|
270
|
+
```
|
|
271
|
+
|
|
272
|
+
This path runs `variant`, `solo_claude`, and `bare` across fixtures, judges
|
|
273
|
+
them, compiles `summary.json`, and applies `ship-gate.py`. It is useful for
|
|
274
|
+
regression floors and fixture hygiene. For new `solo_claude < pair` claims,
|
|
275
|
+
prefer the headroom plus full-pipeline pair gate above because it names the
|
|
276
|
+
selected pair arm and enforces `pair_mode`.
|
|
277
|
+
|
|
278
|
+
## Runtime Perf Artifacts
|
|
279
|
+
|
|
280
|
+
Every `/devlyn:resolve` run can also archive state into
|
|
281
|
+
`.devlyn/runs/<run_id>/pipeline.state.json`. Use those artifacts for wall-time
|
|
282
|
+
and phase diagnostics, not as score evidence by themselves.
|
|
283
|
+
|
|
284
|
+
```bash
|
|
285
|
+
for f in .devlyn/runs/*/pipeline.state.json; do
|
|
286
|
+
jq '{run_id, engine: .engine, phases: .phases, risk_profile: .risk_profile}' "$f"
|
|
287
|
+
done
|
|
288
|
+
```
|
|
289
|
+
|
|
290
|
+
When `--perf` data is present, include it as secondary cost evidence. If token
|
|
291
|
+
counts are absent in the environment, say so; do not infer token savings from
|
|
292
|
+
wall-time alone.
|
|
293
|
+
|
|
294
|
+
## Honest Reporting Rules
|
|
295
|
+
|
|
296
|
+
- Real score claims must cite the run id and fixture ids.
|
|
297
|
+
- A fixture counts only when all measured arms have complete artifacts.
|
|
298
|
+
- Headroom failures are not pair failures; they mean the fixture cannot measure
|
|
299
|
+
lift.
|
|
300
|
+
- Provider-limit or invocation failures make the affected fixture non-evidence.
|
|
301
|
+
- Wall-time ratios are cost signals, not quality scores.
|
|
302
|
+
- Dry-runs, lint, and shell tests prove wiring only. They are not benchmark
|
|
303
|
+
scores.
|