devlyn-cli 2.2.0 → 2.2.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/benchmark/auto-resolve/README.md +7 -4
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/spec.md +0 -1
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +0 -3
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +0 -11
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +0 -10
- package/benchmark/auto-resolve/scripts/run-fixture.sh +23 -3
- package/config/skills/_shared/spec-verify-check.py +51 -0
- package/config/skills/devlyn:resolve/SKILL.md +12 -5
- package/config/skills/devlyn:resolve/references/phases/probe-derive.md +24 -5
- package/package.json +1 -1
- package/scripts/lint-skills.sh +39 -22
- package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/NOTES.md +0 -24
- package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/expected.json +0 -66
- package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/metadata.json +0 -10
- package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/setup.sh +0 -22
- package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/spec.md +0 -62
- package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/task.txt +0 -9
- package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/verifiers/exact-success.js +0 -48
- package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/verifiers/insufficient-balance.js +0 -36
- package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/verifiers/rules-source.js +0 -55
- package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/NOTES.md +0 -20
- package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/expected.json +0 -66
- package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/metadata.json +0 -10
- package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/setup.sh +0 -23
- package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/spec.md +0 -66
- package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/task.txt +0 -11
- package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/verifiers/exact-success.js +0 -44
- package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/verifiers/rules-source.js +0 -58
- package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/verifiers/unavailable-inventory.js +0 -35
|
@@ -109,7 +109,8 @@ bash benchmark/auto-resolve/scripts/test-headroom-gate.sh
|
|
|
109
109
|
```
|
|
110
110
|
|
|
111
111
|
After a full-pipeline pair run has the calibrated arms (`bare`,
|
|
112
|
-
`solo_claude`, `l2_gated`) plus a blind `judge.json`, gate
|
|
112
|
+
`solo_claude`, `l2_gated` or `l2_risk_probes`) plus a blind `judge.json`, gate
|
|
113
|
+
it separately:
|
|
113
114
|
|
|
114
115
|
```bash
|
|
115
116
|
bash benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh \
|
|
@@ -143,10 +144,12 @@ python3 benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py \
|
|
|
143
144
|
```
|
|
144
145
|
|
|
145
146
|
This is the full-pipeline claim gate: each counted fixture must satisfy the
|
|
146
|
-
headroom precondition (`bare <= 60`, `solo_claude <= 80`), the
|
|
147
|
+
headroom precondition (`bare <= 60`, `solo_claude <= 80`), the selected pair arm
|
|
147
148
|
must be clean, `pair_mode` must be true in the captured resolve state, and the
|
|
148
|
-
blind judge must score
|
|
149
|
-
`solo_claude`.
|
|
149
|
+
blind judge must score the pair arm at least `--min-pair-margin` above
|
|
150
|
+
`solo_claude`. `l2_risk_probes` is the current measured pair arm for the
|
|
151
|
+
F16/F25 gate: `20260509-f16-f25-combined-cartprobe-v2` passed with margins +21
|
|
152
|
+
and +24, average pair/solo wall ratio 1.46x. When changing this gate, run:
|
|
150
153
|
|
|
151
154
|
```bash
|
|
152
155
|
bash benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh
|
|
@@ -28,7 +28,6 @@ This is a low-risk edit used to calibrate trivial-tier fixture difficulty.
|
|
|
28
28
|
- **No silent catches.** If an unknown flag is passed, exit 1 with an informative message (same pattern as the existing `--name` handler).
|
|
29
29
|
- **Surgical diff.** Only touch `bin/cli.js` and `tests/cli.test.js`. Do not reformat unrelated code.
|
|
30
30
|
|
|
31
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
32
31
|
|
|
33
32
|
## Out of Scope
|
|
34
33
|
|
|
@@ -35,7 +35,6 @@ multiple POSTs arrive close together — no duplicate ids, no lost writes.
|
|
|
35
35
|
- **No silent catches.** Any `try/catch` in the write path must surface failure as `500` with a clear body, not return a fake success.
|
|
36
36
|
- **No hardcoded ids.** Existing baseline ids (1, 2) remain valid; new ids must not collide with any past or present id.
|
|
37
37
|
- **No breaking change** to `GET /items` shape or `GET /items/:id` semantics.
|
|
38
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
39
38
|
|
|
40
39
|
## Out of Scope
|
|
41
40
|
|
|
@@ -34,7 +34,6 @@ and the stored list is left exactly as it was before the request.
|
|
|
34
34
|
- **No silent catches.**
|
|
35
35
|
- **No partial updates.** A batch with N items must produce either N inserts or 0 inserts.
|
|
36
36
|
- **No breaking change** to existing `GET /items` and `GET /items/:id`.
|
|
37
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
38
37
|
|
|
39
38
|
## Out of Scope
|
|
40
39
|
|
|
@@ -37,7 +37,6 @@ no trailing newline).
|
|
|
37
37
|
- **No silent catches.** Errors in the verification path surface as `500` with a clear body.
|
|
38
38
|
- **Use `crypto.timingSafeEqual` for the signature comparison.** A non-constant-time `===` between hex strings leaks information about the true MAC byte-by-byte.
|
|
39
39
|
- **No breaking change** to existing `/items`, `/items/:id`, `/health`.
|
|
40
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
41
40
|
|
|
42
41
|
## Out of Scope
|
|
43
42
|
|
|
@@ -35,7 +35,6 @@ The implementation persists state to `data/items.json` and exposes:
|
|
|
35
35
|
- **No new npm dependencies.** Fix using Express + Node built-ins only.
|
|
36
36
|
- **No silent catches.** Errors surface with explicit status + body, not by returning a fake-success.
|
|
37
37
|
- **Touch only `server/index.js` and `tests/server.test.js`.** Do not modify `data/items.json` shape, `tests/cli.test.js`, or anything outside the server.
|
|
38
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
39
38
|
|
|
40
39
|
## Out of Scope
|
|
41
40
|
|
|
@@ -40,7 +40,6 @@ the output must be machine-readable.
|
|
|
40
40
|
- **No floating-money output.** All public amounts are integer cents.
|
|
41
41
|
- **No silent catches.** If parsing or file reading fails, emit a visible JSON error to stderr and exit `2`.
|
|
42
42
|
- **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
|
|
43
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
44
43
|
|
|
45
44
|
## Out of Scope
|
|
46
45
|
|
|
@@ -37,7 +37,6 @@ inside the CLI itself.
|
|
|
37
37
|
- **HOME guard.** If `process.env.HOME` is undefined or empty, emit a clear FAIL line ("HOME environment variable is not set") and exit 1.
|
|
38
38
|
- **EACCES handling.** If `readdirSync` fails with EACCES, emit a permission-specific message quoting the offending path. Do not silently return an empty list.
|
|
39
39
|
|
|
40
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
41
40
|
|
|
42
41
|
## Out of Scope
|
|
43
42
|
|
|
@@ -42,7 +42,6 @@ failure reasons must be deterministic.
|
|
|
42
42
|
- **No mutation of the input file.**
|
|
43
43
|
- **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
|
|
44
44
|
- **Touch only `bin/cli.js` and `tests/cli.test.js`.**
|
|
45
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
46
45
|
|
|
47
46
|
## Out of Scope
|
|
48
47
|
|
|
@@ -44,7 +44,6 @@ once, and duplicate ids must not silently corrupt balances.
|
|
|
44
44
|
- **No mutation of the input file.**
|
|
45
45
|
- **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
|
|
46
46
|
- **Touch only `bin/cli.js` and `tests/cli.test.js`.**
|
|
47
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
48
47
|
|
|
49
48
|
## Out of Scope
|
|
50
49
|
|
|
@@ -50,7 +50,6 @@ orders must be deterministic.
|
|
|
50
50
|
- **No mutation of the input file.**
|
|
51
51
|
- **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
|
|
52
52
|
- **Touch only `bin/cli.js` and `tests/cli.test.js`.**
|
|
53
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
54
53
|
|
|
55
54
|
## Out of Scope
|
|
56
55
|
|
|
@@ -43,7 +43,6 @@ and stdout must stay machine-readable.
|
|
|
43
43
|
- **No floating-money output.** All public amounts are integer cents.
|
|
44
44
|
- **No silent catches.** If parsing or file reading fails, emit a visible JSON error to stderr and exit `2`.
|
|
45
45
|
- **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
|
|
46
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
47
46
|
|
|
48
47
|
## Out of Scope
|
|
49
48
|
|
|
@@ -45,7 +45,6 @@ public amount must be integer cents.
|
|
|
45
45
|
- **No floating-money output.** All public amounts are integer cents.
|
|
46
46
|
- **No silent catches.** If parsing or file reading fails, emit a visible JSON error to stderr and exit `2`.
|
|
47
47
|
- **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
|
|
48
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
49
48
|
|
|
50
49
|
## Out of Scope
|
|
51
50
|
|
|
@@ -37,7 +37,6 @@ so existing assertions continue to pass alongside new paging assertions.
|
|
|
37
37
|
- **No breaking change to `/items/:id`.** The per-item route must keep its current contract (the fixture explicitly does NOT paginate single-item lookups).
|
|
38
38
|
- **Backward-compat note**: clients that previously read `response.items` MUST still get the array at the same key inside the new envelope.
|
|
39
39
|
|
|
40
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
41
40
|
|
|
42
41
|
## Out of Scope
|
|
43
42
|
|
|
@@ -31,7 +31,6 @@ and italicized — using only the page's own CSS/JS.
|
|
|
31
31
|
- **No inline JS frameworks.** Stick to the vanilla pattern already in `index.html`.
|
|
32
32
|
- **Accessibility.** Both buttons must have accessible names equal to their visible labels; `#whisper` adds `aria-label="whisper"` only if its visible text differs (it doesn't, so leave it off).
|
|
33
33
|
|
|
34
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
35
34
|
|
|
36
35
|
## Out of Scope
|
|
37
36
|
|
|
@@ -31,7 +31,6 @@ Implement it so every test passes.
|
|
|
31
31
|
- **Do not modify `tests/count.test.js`.** If a test looks wrong, that's a signal to revisit the implementation, not the test.
|
|
32
32
|
- **No silent catches.** Errors reading stdin must surface with a clear message (not suppressed).
|
|
33
33
|
|
|
34
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
35
34
|
|
|
36
35
|
## Out of Scope
|
|
37
36
|
|
|
@@ -30,7 +30,6 @@ already provides everything needed; no external dependency is warranted.
|
|
|
30
30
|
- **Stream-friendly.** Large files should not be read fully into memory. Use a hash stream (`crypto.createHash('sha256')` + pipe from `fs.createReadStream`).
|
|
31
31
|
- **No silent catches.** File I/O errors must surface with an informative message and the appropriate exit code.
|
|
32
32
|
|
|
33
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
34
33
|
|
|
35
34
|
## Out of Scope
|
|
36
35
|
|
|
@@ -27,7 +27,6 @@ version without string manipulation. Add a `--format json` flag that makes
|
|
|
27
27
|
- **Touch only `bin/cli.js` (`version` handler + argument parsing) and `tests/cli.test.js` (new test).** Do NOT modify the `hello` subcommand or any other file.
|
|
28
28
|
- **No silent catches.** Unknown `--format` values must surface an error.
|
|
29
29
|
|
|
30
|
-
- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
|
|
31
30
|
|
|
32
31
|
## Out of Scope
|
|
33
32
|
|
|
@@ -42,9 +42,6 @@ inside `/devlyn:resolve` (no separate preflight skill in the 2-skill design).
|
|
|
42
42
|
- **No silent catches.**
|
|
43
43
|
- **Non-git-repo handling.** Do not assume the user is always in a repo.
|
|
44
44
|
|
|
45
|
-
- **Lifecycle note.** The harness's CLEANUP/VERIFY phases may flip this
|
|
46
|
-
spec's frontmatter `status` after implementation completes — that is
|
|
47
|
-
benchmark lifecycle bookkeeping, not a scope violation.
|
|
48
45
|
|
|
49
46
|
## Out of Scope
|
|
50
47
|
|
|
@@ -154,20 +154,9 @@ def analyze(work_dir, scaffold_sha, waivers, fixture_id=None):
|
|
|
154
154
|
findings = []
|
|
155
155
|
entries = git_diff_status(scaffold_sha, work_dir)
|
|
156
156
|
|
|
157
|
-
# Structural exemption: every benchmark fixture has its own spec at
|
|
158
|
-
# docs/roadmap/phase-*/<fixture_id>.md, and auto-resolve's DOCS phase
|
|
159
|
-
# Job 1 legitimately flips its frontmatter status. That flip is a
|
|
160
|
-
# skill feature, not a scope violation — always exempt regardless of
|
|
161
|
-
# per-fixture waivers.
|
|
162
|
-
own_spec_globs = []
|
|
163
|
-
if fixture_id:
|
|
164
|
-
own_spec_globs.append(f"docs/roadmap/phase-*/{fixture_id}.md")
|
|
165
|
-
|
|
166
157
|
for status, path in entries:
|
|
167
158
|
if is_waived(path, waivers):
|
|
168
159
|
continue
|
|
169
|
-
if is_waived(path, own_spec_globs):
|
|
170
|
-
continue
|
|
171
160
|
|
|
172
161
|
# Lockfile deletion — only when file existed at scaffold.
|
|
173
162
|
if status == "D" and os.path.basename(path) in LOCKFILE_NAMES:
|
|
@@ -173,22 +173,12 @@ def analyze(work_dir_str: str, scaffold_sha: str, tier_c_globs, waivers,
|
|
|
173
173
|
|
|
174
174
|
reachable = bfs_trace(seeds, work_dir)
|
|
175
175
|
|
|
176
|
-
# Structural exemption: the fixture's own spec file at
|
|
177
|
-
# docs/roadmap/phase-*/<fixture_id>.md is always authorized — DOCS
|
|
178
|
-
# phase Job 1 flips its frontmatter status by design. Kept in sync
|
|
179
|
-
# with oracle-scope-tier-a.py.
|
|
180
|
-
own_spec_globs = []
|
|
181
|
-
if fixture_id:
|
|
182
|
-
own_spec_globs.append(f"docs/roadmap/phase-*/{fixture_id}.md")
|
|
183
|
-
|
|
184
176
|
findings = []
|
|
185
177
|
for path in sorted(touched):
|
|
186
178
|
if match_any(path, tier_c_globs):
|
|
187
179
|
continue
|
|
188
180
|
if match_any(path, waivers):
|
|
189
181
|
continue
|
|
190
|
-
if match_any(path, own_spec_globs):
|
|
191
|
-
continue
|
|
192
182
|
if path in reachable:
|
|
193
183
|
depth, via = reachable[path]
|
|
194
184
|
findings.append({
|
|
@@ -595,8 +595,7 @@ fi
|
|
|
595
595
|
(cd "$WORK_DIR" \
|
|
596
596
|
&& git diff "$SCAFFOLD_SHA" --name-only) > "$RESULT_DIR/changed-files.txt" 2>&1 || true
|
|
597
597
|
|
|
598
|
-
# Deterministic oracles
|
|
599
|
-
# Findings-only at this stage; scoring integration is step 5.
|
|
598
|
+
# Deterministic oracles. Hard/flag findings are merged into verify.json below.
|
|
600
599
|
python3 "$BENCH_ROOT/scripts/oracle-test-fidelity.py" \
|
|
601
600
|
--work "$WORK_DIR" --scaffold "$SCAFFOLD_SHA" \
|
|
602
601
|
> "$RESULT_DIR/oracle-test-fidelity.json" 2>/dev/null || \
|
|
@@ -670,7 +669,8 @@ verify_env["BENCH_FIXTURE_DIR"] = os.path.dirname(os.path.abspath(sys.argv[1]))
|
|
|
670
669
|
|
|
671
670
|
verify = {"commands": [], "forbidden_pattern_hits": [], "deps_added": 0,
|
|
672
671
|
"max_deps_added": expected.get("max_deps_added", 0),
|
|
673
|
-
"missing_required_files": [], "forbidden_files_present": []
|
|
672
|
+
"missing_required_files": [], "forbidden_files_present": [],
|
|
673
|
+
"oracle_findings": [], "oracle_disqualifier": False}
|
|
674
674
|
|
|
675
675
|
for vc in expected.get("verification_commands", []):
|
|
676
676
|
try:
|
|
@@ -766,11 +766,29 @@ verify["commands_passed"] = passed
|
|
|
766
766
|
verify["commands_total"] = total
|
|
767
767
|
verify["verify_score"] = (passed / total) if total else 1.0
|
|
768
768
|
|
|
769
|
+
for oracle_file in (
|
|
770
|
+
"oracle-scope-tier-a.json",
|
|
771
|
+
"oracle-scope-tier-b.json",
|
|
772
|
+
"oracle-test-fidelity.json",
|
|
773
|
+
):
|
|
774
|
+
try:
|
|
775
|
+
data = json.load(open(os.path.join(result_dir, oracle_file)))
|
|
776
|
+
except Exception:
|
|
777
|
+
continue
|
|
778
|
+
oracle_name = data.get("oracle") or oracle_file.removesuffix(".json")
|
|
779
|
+
for finding in data.get("findings", []) or []:
|
|
780
|
+
item = dict(finding)
|
|
781
|
+
item["oracle"] = oracle_name
|
|
782
|
+
verify["oracle_findings"].append(item)
|
|
783
|
+
if item.get("severity") in ("disqualifier", "hard", "flag"):
|
|
784
|
+
verify["oracle_disqualifier"] = True
|
|
785
|
+
|
|
769
786
|
verify["disqualifier"] = (
|
|
770
787
|
any(h["severity"] == "disqualifier" for h in verify["forbidden_pattern_hits"])
|
|
771
788
|
or verify["deps_added"] > verify["max_deps_added"]
|
|
772
789
|
or bool(verify["missing_required_files"])
|
|
773
790
|
or bool(verify["forbidden_files_present"])
|
|
791
|
+
or verify["oracle_disqualifier"]
|
|
774
792
|
)
|
|
775
793
|
|
|
776
794
|
json.dump(verify, open(os.path.join(result_dir, "verify.json"), "w"), indent=2)
|
|
@@ -861,6 +879,8 @@ result = {
|
|
|
861
879
|
"arm": arm,
|
|
862
880
|
"run_id": run_id,
|
|
863
881
|
"disqualifier": verify.get("disqualifier", False),
|
|
882
|
+
"oracle_disqualifier": verify.get("oracle_disqualifier", False),
|
|
883
|
+
"oracle_findings_count": len(verify.get("oracle_findings", [])),
|
|
864
884
|
"verify_score": verify.get("verify_score", 0.0),
|
|
865
885
|
"commands_passed": verify.get("commands_passed", 0),
|
|
866
886
|
"commands_total": verify.get("commands_total", 0),
|
|
@@ -77,6 +77,14 @@ JSON_FENCE_RE = re.compile(r'(?ms)^```json[ \t]*\n(.*?)\n```[ \t]*$')
|
|
|
77
77
|
FORBIDDEN_RISK_PROBE_CMD_RE = re.compile(
|
|
78
78
|
r'BENCH_FIXTURE_DIR|benchmark/auto-resolve/fixtures|/verifiers/|verifiers/'
|
|
79
79
|
)
|
|
80
|
+
EXTERNAL_URL_RE = re.compile(r"https?://([^/\s\"']+)", re.IGNORECASE)
|
|
81
|
+
LOCAL_URL_HOSTS = {
|
|
82
|
+
'localhost',
|
|
83
|
+
'127.0.0.1',
|
|
84
|
+
'0.0.0.0',
|
|
85
|
+
'[::1]',
|
|
86
|
+
'::1',
|
|
87
|
+
}
|
|
80
88
|
RISK_PROBE_TAGS = {
|
|
81
89
|
"ordering_inversion",
|
|
82
90
|
"boundary_overlap",
|
|
@@ -131,6 +139,15 @@ def extract_verification_text(text: str) -> str:
|
|
|
131
139
|
return section.group(1) if section else ""
|
|
132
140
|
|
|
133
141
|
|
|
142
|
+
def external_url_hosts(text: str) -> list[str]:
|
|
143
|
+
hosts: list[str] = []
|
|
144
|
+
for match in EXTERNAL_URL_RE.finditer(text or ''):
|
|
145
|
+
host = match.group(1).split('@')[-1].split(':')[0].lower()
|
|
146
|
+
if host not in LOCAL_URL_HOSTS and host not in hosts:
|
|
147
|
+
hosts.append(host)
|
|
148
|
+
return hosts
|
|
149
|
+
|
|
150
|
+
|
|
134
151
|
def validate_shape(data) -> str | None:
|
|
135
152
|
"""Return None if shape matches the canonical verification_commands
|
|
136
153
|
schema; else a human-readable error string.
|
|
@@ -189,6 +206,12 @@ def validate_risk_probe(probe: object, index: int, verification_text: str) -> st
|
|
|
189
206
|
f"risk-probes[{index}].cmd references hidden fixture/verifier paths; "
|
|
190
207
|
"risk probes must derive from visible spec text only"
|
|
191
208
|
)
|
|
209
|
+
external_hosts = external_url_hosts(cmd)
|
|
210
|
+
if external_hosts:
|
|
211
|
+
return (
|
|
212
|
+
f"risk-probes[{index}].cmd references external URL(s): "
|
|
213
|
+
f"{', '.join(external_hosts)}; use only worktree-local or localhost resources"
|
|
214
|
+
)
|
|
192
215
|
if len(cmd) > 4000:
|
|
193
216
|
return f"risk-probes[{index}].cmd exceeds 4000 characters"
|
|
194
217
|
tags = probe.get("tags")
|
|
@@ -197,6 +220,15 @@ def validate_risk_probe(probe: object, index: int, verification_text: str) -> st
|
|
|
197
220
|
unknown_tags = sorted(set(tags) - RISK_PROBE_TAGS)
|
|
198
221
|
if unknown_tags:
|
|
199
222
|
return f"risk-probes[{index}].tags contains unknown tag(s): {', '.join(unknown_tags)}"
|
|
223
|
+
if "error_contract" in tags and not re.search(
|
|
224
|
+
r'invalid|stderr|json[ -]?error|error object|exit[ `]*2',
|
|
225
|
+
derived_from,
|
|
226
|
+
re.IGNORECASE,
|
|
227
|
+
):
|
|
228
|
+
return (
|
|
229
|
+
f"risk-probes[{index}].derived_from for error_contract must name "
|
|
230
|
+
"an invalid-input, stderr, JSON-error, or exit-2 verification bullet"
|
|
231
|
+
)
|
|
200
232
|
evidence = probe.get("tag_evidence")
|
|
201
233
|
if not isinstance(evidence, dict):
|
|
202
234
|
return f"risk-probes[{index}].tag_evidence must be an object"
|
|
@@ -449,6 +481,25 @@ def run_self_test() -> int:
|
|
|
449
481
|
(devlyn / "risk-probes.jsonl").write_text(json.dumps({
|
|
450
482
|
"id": "P3",
|
|
451
483
|
"derived_from": "probe must pass visible marker.",
|
|
484
|
+
"cmd": "printf bad-error-derived-from",
|
|
485
|
+
"exit_code": 0,
|
|
486
|
+
"tags": ["error_contract"],
|
|
487
|
+
"tag_evidence": {"error_contract": []},
|
|
488
|
+
}) + "\n")
|
|
489
|
+
bad_error_ref = subprocess.run(
|
|
490
|
+
[sys.executable, script_path, "--validate-risk-probes"],
|
|
491
|
+
cwd=work,
|
|
492
|
+
env=env,
|
|
493
|
+
capture_output=True,
|
|
494
|
+
text=True,
|
|
495
|
+
)
|
|
496
|
+
if bad_error_ref.returncode == 0:
|
|
497
|
+
print("error_contract with unrelated derived_from was accepted", file=sys.stderr)
|
|
498
|
+
return 1
|
|
499
|
+
|
|
500
|
+
(devlyn / "risk-probes.jsonl").write_text(json.dumps({
|
|
501
|
+
"id": "P4",
|
|
502
|
+
"derived_from": "probe must pass visible marker.",
|
|
452
503
|
"cmd": "printf weak-boundary",
|
|
453
504
|
"exit_code": 0,
|
|
454
505
|
"tags": ["boundary_overlap"],
|
|
@@ -103,9 +103,16 @@ fixture/verifier paths, previous findings, and harness docs unless excerpted.
|
|
|
103
103
|
Output: `.devlyn/risk-probes.jsonl`, 1 to 3 JSONL entries. Each entry must be
|
|
104
104
|
one verification command shape plus `id`, `derived_from`, `tags`, and
|
|
105
105
|
`tag_evidence`, where `derived_from` is an exact substring of the visible
|
|
106
|
-
`## Verification`
|
|
107
|
-
|
|
108
|
-
malformed.
|
|
106
|
+
`## Verification` bullet the command directly exercises. `tag_evidence` must be
|
|
107
|
+
a JSON object keyed by tag, with marker arrays as values; a top-level array or
|
|
108
|
+
tag-only probe is malformed. `ordering_inversion` must include
|
|
109
|
+
`input_order_would_choose_wrong_winner` and `asserts_processing_order_result`;
|
|
110
|
+
`prior_consumption` must include `same_resource_consumed_first` and
|
|
111
|
+
`later_entity_fails_or_reroutes`; `stdout_stderr_contract` and `shape_contract`
|
|
112
|
+
do not require marker strings. Cart/pricing success probes should use
|
|
113
|
+
`shape_contract` unless they satisfy the `ordering_inversion` markers. The probe
|
|
114
|
+
command must not reference external network URLs; use only worktree-local or
|
|
115
|
+
localhost resources.
|
|
109
116
|
For high-complexity specs with multiple behavior bullets, at least one probe
|
|
110
117
|
must be compound: it must exercise two or more visible verification bullets in a
|
|
111
118
|
single command. Empty output is invalid when `--risk-probes` is set.
|
|
@@ -116,7 +123,7 @@ Invocation contract when OTHER engine is Codex:
|
|
|
116
123
|
|
|
117
124
|
- Invoke Codex only through the monitored wrapper path in `CODEX_MONITORED_PATH`,
|
|
118
125
|
or `.claude/skills/_shared/codex-monitored.sh` when the env var is absent:
|
|
119
|
-
`bash "$CODEX_MONITORED_PATH" -C "$PWD" --full-auto -c model_reasoning_effort=
|
|
126
|
+
`bash "$CODEX_MONITORED_PATH" -C "$PWD" --full-auto -c model_reasoning_effort=high "<probe prompt>"`.
|
|
120
127
|
- Do not run `codex`, `codex exec`, `/Users/.../codex`, or a plugin-provided
|
|
121
128
|
Codex binary directly. A raw Codex child can outlive the phase and makes the
|
|
122
129
|
benchmark run invalid even if `.devlyn/risk-probes.jsonl` is written.
|
|
@@ -201,7 +208,7 @@ second pair judge." The only valid skip reasons after a non-empty eligible
|
|
|
201
208
|
trigger are deterministic MECHANICAL HIGH/CRITICAL blockers or Codex
|
|
202
209
|
unavailability proven by the invocation layer.
|
|
203
210
|
|
|
204
|
-
Pair-mode JUDGE: spawn a second Agent with the OTHER engine's adapter; the second judge is a bounded adversarial complement, not a duplicate broad audit. The primary judge owns broad coverage; pair-JUDGE targets the two highest-risk explicit `## Verification` bullets that cross state mutation, all-or-nothing rollback, ordering, idempotency, auth, or error-priority clauses. It must not read `.claude/skills`, `.codex/skills`, `CLAUDE.md`, `AGENTS.md`, or other harness docs unless the orchestrator pasted a specific excerpt into the prompt. It may use only the spec, diff, implementation files, tests, and the repo's existing CLI/API/test runner. It may execute at most two targeted probes before first output, and each probe must compare the full externally visible result (exit/stdout/stderr plus full parsed output object, including accepted/scheduled rows, rejected rows, and remaining state when present), not just a single property. For priority/stateful specs, at least one probe must include an earlier input entity that would succeed under input-order processing, a later higher-priority entity that consumes or blocks the critical resource, and a failure/blocked/rollback edge that determines a later entity's state. Scope qualifiers are binding: pair-JUDGE must not reinterpret `inside a warehouse`, `per resource`, or line-scoped rules as global rules. When both priority ordering and rollback/blocked-interval behavior appear in the spec, this dominance-loss probe is mandatory and comes before any other probe: an earlier lower-priority entity that would succeed alone or under input-order processing must lose because a later higher-priority entity is processed first; a failed/blocked middle entity must not corrupt later state; and the assertion must cover complete accepted/scheduled and rejected output ordering. It must stop and emit JSONL immediately on the first verdict-binding finding, and must emit PASS immediately if both probes plus static scope/dependency checks pass. Both judgments merge with the rule "any HIGH/CRITICAL finding either model surfaces is verdict-binding; high-confidence MEDIUM findings are also verdict-binding when they identify a concrete behavioral regression against the spec, public contract, or existing test contract." Cross-model disagreement on advisory lower-severity findings is logged but does not change the verdict. If MECHANICAL has a HIGH/CRITICAL finding, skip the second judge and record `pair_judge: null`; the fix loop needs the deterministic finding, not duplicate review.
|
|
211
|
+
Pair-mode JUDGE: spawn a second Agent with the OTHER engine's adapter; the second judge is a bounded adversarial complement, not a duplicate broad audit. The primary judge owns broad coverage; pair-JUDGE targets the two highest-risk explicit `## Verification` bullets that cross state mutation, all-or-nothing rollback, ordering, idempotency, auth, or error-priority clauses. It must not read `.claude/skills`, `.codex/skills`, `CLAUDE.md`, `AGENTS.md`, or other harness docs unless the orchestrator pasted a specific excerpt into the prompt. It may use only the spec, diff, implementation files, tests, and the repo's existing CLI/API/test runner. It may execute at most two targeted probes before first output, and each probe must compare the full externally visible result (exit/stdout/stderr plus full parsed output object, including accepted/scheduled rows, rejected rows, and remaining state when present), not just a single property. For priority/stateful specs, at least one probe must include an earlier input entity that would succeed under input-order processing, a later higher-priority entity that consumes or blocks the critical resource, and a failure/blocked/rollback edge that determines a later entity's state. For cart/pricing specs where visible verification combines duplicate items, line promotions, tax, coupon, and shipping, the success-path probe must include interleaved duplicates plus taxable and non-taxable items and assert full output rows. Scope qualifiers are binding: pair-JUDGE must not reinterpret `inside a warehouse`, `per resource`, or line-scoped rules as global rules. When both priority ordering and rollback/blocked-interval behavior appear in the spec, this dominance-loss probe is mandatory and comes before any other probe: an earlier lower-priority entity that would succeed alone or under input-order processing must lose because a later higher-priority entity is processed first; a failed/blocked middle entity must not corrupt later state; and the assertion must cover complete accepted/scheduled and rejected output ordering. It must stop and emit JSONL immediately on the first verdict-binding finding, and must emit PASS immediately if both probes plus static scope/dependency checks pass. Both judgments merge with the rule "any HIGH/CRITICAL finding either model surfaces is verdict-binding; high-confidence MEDIUM findings are also verdict-binding when they identify a concrete behavioral regression against the spec, public contract, or existing test contract." Cross-model disagreement on advisory lower-severity findings is logged but does not change the verdict. If MECHANICAL has a HIGH/CRITICAL finding, skip the second judge and record `pair_judge: null`; the fix loop needs the deterministic finding, not duplicate review.
|
|
205
212
|
|
|
206
213
|
Findings written to `.devlyn/verify.findings.jsonl`. **VERIFY agents have no code-mutation tools.** Codex pair-JUDGE is read-only: invoke `codex-monitored.sh` directly with `-c model_reasoning_effort=medium` for this bounded two-probe review, without piping to `tail`/`head`/`grep`, capture stdout/stderr by direct tool capture or file redirection, require JSONL findings on stdout, and have the orchestrator write `.devlyn/verify.pair.findings.jsonl`. If stdout is first captured as `.devlyn/codex-judge.stdout`, run `python3 .claude/skills/_shared/collect-codex-findings.py` before merge; that script is the deterministic boundary writer for `.devlyn/verify.pair.findings.jsonl`. Raw stdout remains diagnostic only: if stdout contains findings or a non-PASS summary while `.devlyn/verify.pair.findings.jsonl` is empty, `verify-merge-findings.py` blocks VERIFY for `verify.pair.emission-contract`. Do not ask Codex to `apply_patch` or edit `.devlyn`. After primary and pair findings are written, run `python3 .claude/skills/_shared/verify-merge-findings.py --write-state`. Branch only on the merged `state.phases.verify.verdict`; a HIGH/CRITICAL finding from either judge must mechanically become `NEEDS_WORK`. Never write `.devlyn/verify-merged.findings.jsonl` or `.devlyn/verify-merge.summary.json` by hand; `verify-merge-findings.py` is their only writer. State write: `phases.verify.{started_at, verdict, completed_at, duration_ms, sub_verdicts: {mechanical, judge, pair_judge?}, artifacts}`.
|
|
207
214
|
|
|
@@ -82,7 +82,8 @@ observable success, not by internal reasoning.
|
|
|
82
82
|
|
|
83
83
|
Each probe must run entirely from the worktree with standard shell/Node/Python
|
|
84
84
|
tools already present in the repo. Use inline temp-file scripts when needed.
|
|
85
|
-
Leave no tracked files behind.
|
|
85
|
+
Leave no tracked files behind. Probe commands must not call external network
|
|
86
|
+
APIs or write to external memory/telemetry services.
|
|
86
87
|
</task>
|
|
87
88
|
|
|
88
89
|
<output>
|
|
@@ -94,13 +95,20 @@ Write `.devlyn/risk-probes.jsonl`. Each line is one JSON object:
|
|
|
94
95
|
|
|
95
96
|
Rules:
|
|
96
97
|
- `derived_from` must be an exact substring of the visible `## Verification`
|
|
97
|
-
|
|
98
|
+
bullet that the command directly exercises. For `error_contract`, use the
|
|
99
|
+
invalid-input/stderr/JSON-error/exit-2 bullet, not a generic test-runner
|
|
100
|
+
bullet.
|
|
98
101
|
- `tags` is required. Use only these shape tags:
|
|
99
102
|
`ordering_inversion`, `boundary_overlap`, `prior_consumption`,
|
|
100
103
|
`rollback_state`, `positive_remaining`, `stdout_stderr_contract`,
|
|
101
104
|
`error_contract`, `shape_contract`.
|
|
102
|
-
- `tag_evidence` is required
|
|
103
|
-
|
|
105
|
+
- `tag_evidence` is required and must be a JSON object keyed by tag, never a
|
|
106
|
+
top-level array. For these tags, include every listed evidence marker in the
|
|
107
|
+
tag's array and make the command actually exercise it:
|
|
108
|
+
- Do not emit a shape tag unless the visible `## Verification` text names that
|
|
109
|
+
kind of risk and the command exercises it. In particular, `boundary_overlap`
|
|
110
|
+
is only for visible blocked-interval/window/overlap boundary semantics; do not
|
|
111
|
+
use it for inventory, warehouse, or generic resource constraints.
|
|
104
112
|
- `ordering_inversion`: `input_order_would_choose_wrong_winner`,
|
|
105
113
|
`asserts_processing_order_result`.
|
|
106
114
|
- `boundary_overlap`: `starts_at_blocked_start`, `ends_at_blocked_end`,
|
|
@@ -114,7 +122,18 @@ Rules:
|
|
|
114
122
|
Tags not listed here may use an empty evidence list or be omitted from
|
|
115
123
|
`tag_evidence`.
|
|
116
124
|
- `cmd` must not reference `BENCH_FIXTURE_DIR`, `verifiers/`, benchmark fixture
|
|
117
|
-
paths, hidden oracle files, or files outside the worktree.
|
|
125
|
+
paths, hidden oracle files, external URLs, or files outside the worktree.
|
|
126
|
+
Localhost URLs are allowed only when the visible verification command needs a
|
|
127
|
+
local server.
|
|
128
|
+
- Match the spec's visible input and output key names literally; do not invent
|
|
129
|
+
aliases such as `stock` for `lots`, `order_id` for `id`, or `warehouse_id`
|
|
130
|
+
for `warehouse`.
|
|
131
|
+
- For cart/pricing specs whose visible verification covers duplicate combining,
|
|
132
|
+
multiple line-promotion types, tax, coupon, and shipping, the compound success
|
|
133
|
+
probe must include interleaved duplicate SKUs plus taxable and non-taxable
|
|
134
|
+
items, then assert the full output object and item rows. Use `shape_contract`
|
|
135
|
+
for this probe unless the command also proves the required
|
|
136
|
+
`ordering_inversion` evidence markers.
|
|
118
137
|
- Empty output is invalid when this phase is enabled. If no bounded executable
|
|
119
138
|
probe can be derived, write one JSONL object whose command exits nonzero and
|
|
120
139
|
whose `derived_from` names the blocking verification bullet; BUILD_GATE will
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "devlyn-cli",
|
|
3
|
-
"version": "2.2.
|
|
3
|
+
"version": "2.2.2",
|
|
4
4
|
"description": "AI development toolkit for Claude Code — ideate, auto-resolve, and ship with context engineering and agent orchestration",
|
|
5
5
|
"homepage": "https://github.com/fysoul17/devlyn-cli#readme",
|
|
6
6
|
"bin": {
|
package/scripts/lint-skills.sh
CHANGED
|
@@ -24,20 +24,21 @@ bad() { printf ' %s✗%s %s\n' "$red" "$reset" "$1"; fail=1; }
|
|
|
24
24
|
section "Check 1: No mcp__codex-cli__ outside _shared / archive"
|
|
25
25
|
# Legal places: config/skills/_shared/codex-config.md (explicitly says "MCP is not used"),
|
|
26
26
|
# archival snapshots, and tests.
|
|
27
|
-
offenders=$(grep -
|
|
27
|
+
offenders=$(git grep -Il -- 'mcp__codex-cli__' -- \
|
|
28
28
|
config/skills \
|
|
29
29
|
benchmark \
|
|
30
30
|
README.md \
|
|
31
31
|
CLAUDE.md \
|
|
32
|
-
bin/
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
32
|
+
bin/ \
|
|
33
|
+
':!config/skills/_shared/codex-config.md' \
|
|
34
|
+
':!config/skills/roadmap-archival-workspace/**' \
|
|
35
|
+
':!config/skills/devlyn:auto-resolve-workspace/**' \
|
|
36
|
+
':!config/skills/devlyn:ideate-workspace/**' \
|
|
37
|
+
':!config/skills/preflight-workspace/**' \
|
|
38
|
+
':!benchmark/auto-resolve/external/**' \
|
|
39
|
+
':!benchmark/auto-resolve/results/**' \
|
|
40
|
+
':!benchmark/auto-resolve/PILOT-RESULTS*' \
|
|
41
|
+
2>/dev/null || true)
|
|
41
42
|
if [ -z "$offenders" ]; then
|
|
42
43
|
ok "no MCP references in managed files"
|
|
43
44
|
else
|
|
@@ -48,15 +49,20 @@ fi
|
|
|
48
49
|
# 2. No "Requires Codex MCP" prose.
|
|
49
50
|
# ---------------------------------------------------------------------------
|
|
50
51
|
section "Check 2: No 'Requires Codex MCP' prose"
|
|
51
|
-
offenders=$(grep -
|
|
52
|
-
config/skills
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
52
|
+
offenders=$(git grep -Il -- 'Requires Codex MCP\|Codex MCP server\|Codex MCP available\|Codex MCP disconnected' -- \
|
|
53
|
+
config/skills \
|
|
54
|
+
benchmark \
|
|
55
|
+
README.md \
|
|
56
|
+
CLAUDE.md \
|
|
57
|
+
bin/ \
|
|
58
|
+
':!config/skills/roadmap-archival-workspace/**' \
|
|
59
|
+
':!config/skills/devlyn:auto-resolve-workspace/**' \
|
|
60
|
+
':!config/skills/devlyn:ideate-workspace/**' \
|
|
61
|
+
':!config/skills/preflight-workspace/**' \
|
|
62
|
+
':!benchmark/auto-resolve/external/**' \
|
|
63
|
+
':!benchmark/auto-resolve/results/**' \
|
|
64
|
+
':!benchmark/auto-resolve/PILOT-RESULTS*' \
|
|
65
|
+
2>/dev/null || true)
|
|
60
66
|
if [ -z "$offenders" ]; then
|
|
61
67
|
ok "no Codex MCP prose"
|
|
62
68
|
else
|
|
@@ -203,6 +209,16 @@ else
|
|
|
203
209
|
bad "spec-verify-check.py risk-probe self-test failed"
|
|
204
210
|
fi
|
|
205
211
|
|
|
212
|
+
section "Check 6e: All-or-nothing probes prove mutable rollback"
|
|
213
|
+
probe_doc="config/skills/devlyn:resolve/references/phases/probe-derive.md"
|
|
214
|
+
if grep -Fq "pre-rejected by a whole-order availability shortcut" "$probe_doc" \
|
|
215
|
+
&& grep -Fq "must allocate a scarce" "$probe_doc" \
|
|
216
|
+
&& grep -Fq "must request the same scarce first-line SKU" "$probe_doc"; then
|
|
217
|
+
ok "all-or-nothing probe contract preserves mutable rollback evidence"
|
|
218
|
+
else
|
|
219
|
+
bad "$probe_doc — missing mutable rollback probe contract"
|
|
220
|
+
fi
|
|
221
|
+
|
|
206
222
|
# ---------------------------------------------------------------------------
|
|
207
223
|
# 8. CRITIC security sub-pass must be native, not Dual.
|
|
208
224
|
# Catches the specific drift where a section updates but a cross-reference doesn't.
|
|
@@ -431,14 +447,15 @@ fi
|
|
|
431
447
|
# version lives) pass while genuine stale references fail. Excluded scopes:
|
|
432
448
|
# benchmark/auto-resolve/results/ (historical run artifacts, frozen) and
|
|
433
449
|
# scripts/lint-skills.sh itself (carries the pattern in this check).
|
|
434
|
-
stale=$(grep -
|
|
450
|
+
stale=$(git grep -In -- 'F9-e2e-ideate-to-preflight' -- \
|
|
435
451
|
config/skills \
|
|
436
452
|
benchmark \
|
|
437
453
|
scripts \
|
|
438
454
|
CLAUDE.md \
|
|
439
|
-
README.md
|
|
455
|
+
README.md \
|
|
456
|
+
':!benchmark/auto-resolve/results/**' \
|
|
457
|
+
2>/dev/null \
|
|
440
458
|
| grep -v '^benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/' \
|
|
441
|
-
| grep -v '^benchmark/auto-resolve/results/' \
|
|
442
459
|
| grep -v '^scripts/lint-skills\.sh:' \
|
|
443
460
|
| grep -v 'fixtures/retired/F9-e2e-ideate-to-preflight' \
|
|
444
461
|
|| true)
|
|
@@ -1,24 +0,0 @@
|
|
|
1
|
-
# F27 CLI gift card redemption
|
|
2
|
-
|
|
3
|
-
## Why this fixture exists
|
|
4
|
-
|
|
5
|
-
F16 showed a valid full-pipeline pair lift when the solo arm implemented the
|
|
6
|
-
happy path but missed the exact validation-error contract. F25 was rejected
|
|
7
|
-
after an oracle correction made solo pass. F26 was rejected because solo reached
|
|
8
|
-
the ceiling.
|
|
9
|
-
|
|
10
|
-
F27 keeps the useful F16 shape but removes checkout tax complexity: success is
|
|
11
|
-
straight integer aggregation, while the risk is the exact failure object after
|
|
12
|
-
combining duplicate card redemption rows before balance validation.
|
|
13
|
-
|
|
14
|
-
## Pair expectation
|
|
15
|
-
|
|
16
|
-
PLAN must preserve the order of aggregation before validation. IMPLEMENT must
|
|
17
|
-
read `data/gift-cards.json` and keep all public amounts in integer cents.
|
|
18
|
-
VERIFY should construct an adversarial request where two individually valid
|
|
19
|
-
redemptions for the same card become invalid only after combination.
|
|
20
|
-
|
|
21
|
-
## Isolation
|
|
22
|
-
|
|
23
|
-
F16 covers quote tax rules. F27 covers non-persistent balance redemption and
|
|
24
|
-
exact validation shape after duplicate aggregation.
|