devlyn-cli 2.3.0 → 2.3.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +1 -1
- package/CLAUDE.md +2 -2
- package/README.md +82 -29
- package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +61 -44
- package/benchmark/auto-resolve/BENCHMARK-RESULTS.md +341 -0
- package/benchmark/auto-resolve/README.md +307 -44
- package/benchmark/auto-resolve/RUBRIC.md +23 -14
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +7 -3
- package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md +8 -3
- package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/NOTES.md +8 -3
- package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/NOTES.md +10 -4
- package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/NOTES.md +10 -4
- package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +7 -4
- package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/NOTES.md +8 -0
- package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/NOTES.md +16 -4
- package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +7 -0
- package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/NOTES.md +11 -5
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +8 -1
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +4 -2
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/NOTES.md +34 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/spec.md +67 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/duplicate-event-error.js +35 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/priority-transfer-rollback.js +53 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/NOTES.md +38 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/spec.md +70 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/task.txt +3 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/duplicate-renewal-error.js +42 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/priority-credit-rollback.js +70 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +10 -3
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +7 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +5 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +7 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +3 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +15 -3
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/SCHEMA.md +53 -7
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/NOTES.md +37 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/RETIRED.md +13 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/expected.json +56 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/setup.sh +18 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/spec.md +69 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/exact-proration.js +48 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/rules-source-and-conflict.js +79 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/NOTES.md +54 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/RETIRED.md +7 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/expected.json +67 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/spec.md +67 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/task.txt +5 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/policy-precedence.js +72 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-and-immutability.js +43 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-boundary.js +116 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/NOTES.md +35 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/RETIRED.md +12 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/expected.json +58 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/spec.md +73 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/task.txt +17 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/mixed-idempotent-settlement.js +53 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/rejection-boundaries.js +74 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/NOTES.md +60 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/RETIRED.md +29 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/expected.json +73 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/setup.sh +28 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/spec.md +58 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/task.txt +5 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.json +82 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.md +18 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.json +46 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.md +17 -0
- package/benchmark/auto-resolve/run-real-benchmark.md +303 -0
- package/benchmark/auto-resolve/scripts/audit-headroom-rejections.py +441 -0
- package/benchmark/auto-resolve/scripts/audit-pair-evidence.py +1256 -0
- package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +147 -15
- package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +28 -16
- package/benchmark/auto-resolve/scripts/collect-swebench-predictions.py +11 -1
- package/benchmark/auto-resolve/scripts/compile-report.py +208 -46
- package/benchmark/auto-resolve/scripts/fetch-swebench-instances.py +22 -4
- package/benchmark/auto-resolve/scripts/frozen-verify-gate.py +175 -30
- package/benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py +408 -46
- package/benchmark/auto-resolve/scripts/headroom-gate.py +270 -39
- package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +164 -33
- package/benchmark/auto-resolve/scripts/iter-0033c-l1-summary.py +97 -0
- package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +150 -38
- package/benchmark/auto-resolve/scripts/judge.sh +153 -26
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +12 -5
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +25 -2
- package/benchmark/auto-resolve/scripts/pair-candidate-frontier.py +469 -0
- package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +5 -5
- package/benchmark/auto-resolve/scripts/pair-plan-lint.py +9 -2
- package/benchmark/auto-resolve/scripts/pair-rejected-fixtures.sh +91 -0
- package/benchmark/auto-resolve/scripts/pair_evidence_contract.py +269 -0
- package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-case.py +39 -10
- package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-corpus.py +34 -4
- package/benchmark/auto-resolve/scripts/prepare-swebench-solver-worktree.py +23 -5
- package/benchmark/auto-resolve/scripts/recent-benchmark-summary.py +232 -0
- package/benchmark/auto-resolve/scripts/run-fixture.sh +118 -51
- package/benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh +211 -39
- package/benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh +335 -39
- package/benchmark/auto-resolve/scripts/run-headroom-candidate.sh +249 -6
- package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +22 -48
- package/benchmark/auto-resolve/scripts/run-suite.sh +44 -7
- package/benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh +120 -19
- package/benchmark/auto-resolve/scripts/run-swebench-solver-batch.sh +32 -14
- package/benchmark/auto-resolve/scripts/ship-gate.py +219 -50
- package/benchmark/auto-resolve/scripts/solo-ceiling-avoidance.py +53 -0
- package/benchmark/auto-resolve/scripts/solo-headroom-hypothesis.py +77 -0
- package/benchmark/auto-resolve/scripts/swebench-frozen-matrix.py +239 -26
- package/benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh +288 -0
- package/benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh +1672 -0
- package/benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh +933 -0
- package/benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh +491 -0
- package/benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh +91 -0
- package/benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh +328 -3
- package/benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh +497 -18
- package/benchmark/auto-resolve/scripts/test-headroom-gate.sh +331 -14
- package/benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh +525 -0
- package/benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh +254 -0
- package/benchmark/auto-resolve/scripts/test-lint-fixtures.sh +580 -0
- package/benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh +591 -0
- package/benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh +497 -0
- package/benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh +401 -0
- package/benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh +111 -0
- package/benchmark/auto-resolve/scripts/test-ship-gate.sh +1189 -0
- package/benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh +924 -5
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/NOTES.md +28 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/expected.json +63 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/spec.md +47 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/expected.json +53 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/spec.md +50 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/duplicate-order-error.js +27 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/priority-stock-reservation.js +44 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/expected.json +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/spec.md +52 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/duplicate-ticket-error.js +29 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/priority-agent-assignment.js +48 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/expected.json +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/spec.md +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/duplicate-return-error.js +43 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/priority-return-routing.js +70 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/NOTES.md +37 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/expected.json +54 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/spec.md +59 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/credit-ledger-priority.js +98 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/duplicate-charge-error.js +38 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/NOTES.md +36 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/expected.json +56 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/spec.md +59 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/duplicate-refund-error.js +41 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/priority-refund-ledger.js +65 -0
- package/bin/devlyn.js +211 -18
- package/config/skills/_shared/adapters/README.md +3 -0
- package/config/skills/_shared/adapters/gpt-5-5.md +5 -1
- package/config/skills/_shared/adapters/opus-4-7.md +9 -1
- package/config/skills/_shared/archive_run.py +78 -6
- package/config/skills/_shared/codex-config.md +3 -2
- package/config/skills/_shared/codex-monitored.sh +46 -1
- package/config/skills/_shared/collect-codex-findings.py +20 -5
- package/config/skills/_shared/engine-preflight.md +1 -1
- package/config/skills/_shared/runtime-principles.md +5 -8
- package/config/skills/_shared/spec-verify-check.py +2664 -107
- package/config/skills/_shared/verify-merge-findings.py +1369 -19
- package/config/skills/devlyn:ideate/SKILL.md +7 -4
- package/config/skills/devlyn:ideate/references/elicitation.md +50 -4
- package/config/skills/devlyn:ideate/references/from-spec-mode.md +26 -4
- package/config/skills/devlyn:ideate/references/project-mode.md +20 -1
- package/config/skills/devlyn:ideate/references/spec-template.md +10 -1
- package/config/skills/devlyn:resolve/SKILL.md +49 -18
- package/config/skills/devlyn:resolve/references/free-form-mode.md +15 -0
- package/config/skills/devlyn:resolve/references/phases/build-gate.md +2 -2
- package/config/skills/devlyn:resolve/references/phases/probe-derive.md +74 -2
- package/config/skills/devlyn:resolve/references/phases/verify.md +62 -28
- package/config/skills/devlyn:resolve/references/state-schema.md +7 -4
- package/package.json +47 -2
- package/scripts/lint-fixtures.sh +349 -0
- package/scripts/lint-shadow-fixtures.sh +58 -0
- package/scripts/lint-skills.sh +3642 -92
- /package/{optional-skills → config/skills}/devlyn:design-ui/SKILL.md +0 -0
|
@@ -1,5 +1,5 @@
|
|
|
1
1
|
#!/usr/bin/env python3
|
|
2
|
-
"""Archive
|
|
2
|
+
"""Archive devlyn:resolve run artifacts per references/pipeline-state.md#archive-contract.
|
|
3
3
|
|
|
4
4
|
Usage:
|
|
5
5
|
python3 scripts/archive_run.py [--devlyn-dir .devlyn]
|
|
@@ -16,8 +16,10 @@ from __future__ import annotations
|
|
|
16
16
|
import argparse
|
|
17
17
|
import json
|
|
18
18
|
import pathlib
|
|
19
|
+
import re
|
|
19
20
|
import shutil
|
|
20
21
|
import sys
|
|
22
|
+
import tempfile
|
|
21
23
|
|
|
22
24
|
|
|
23
25
|
PER_RUN_PATTERNS = (
|
|
@@ -57,18 +59,30 @@ PER_RUN_PATTERNS = (
|
|
|
57
59
|
"codex-judge.*",
|
|
58
60
|
)
|
|
59
61
|
|
|
62
|
+
SAFE_RUN_ID_RE = re.compile(r"^[A-Za-z0-9_.-]+$")
|
|
63
|
+
|
|
64
|
+
|
|
65
|
+
def reject_json_constant(token: str) -> None:
|
|
66
|
+
raise ValueError(f"invalid JSON numeric constant: {token}")
|
|
67
|
+
|
|
68
|
+
|
|
69
|
+
def loads_strict_json(text: str):
|
|
70
|
+
return json.loads(text, parse_constant=reject_json_constant)
|
|
71
|
+
|
|
60
72
|
|
|
61
73
|
def read_run_id(devlyn: pathlib.Path) -> str:
|
|
62
74
|
state_path = devlyn / "pipeline.state.json"
|
|
63
75
|
if not state_path.is_file():
|
|
64
76
|
raise SystemExit(f"error: {state_path} not found")
|
|
65
77
|
try:
|
|
66
|
-
state =
|
|
67
|
-
except
|
|
78
|
+
state = loads_strict_json(state_path.read_text(encoding="utf-8"))
|
|
79
|
+
except ValueError as e:
|
|
68
80
|
raise SystemExit(f"error: {state_path} is not valid JSON: {e}")
|
|
69
81
|
run_id = state.get("run_id")
|
|
70
|
-
if not run_id:
|
|
82
|
+
if not isinstance(run_id, str) or not run_id:
|
|
71
83
|
raise SystemExit(f"error: {state_path} has no run_id")
|
|
84
|
+
if not SAFE_RUN_ID_RE.fullmatch(run_id):
|
|
85
|
+
raise SystemExit(f"error: {state_path} run_id must match [A-Za-z0-9_.-]+")
|
|
72
86
|
return run_id
|
|
73
87
|
|
|
74
88
|
|
|
@@ -91,8 +105,8 @@ def prune(runs_dir: pathlib.Path, keep: int = 10) -> int:
|
|
|
91
105
|
if not state_file.is_file():
|
|
92
106
|
continue
|
|
93
107
|
try:
|
|
94
|
-
s =
|
|
95
|
-
except
|
|
108
|
+
s = loads_strict_json(state_file.read_text(encoding="utf-8"))
|
|
109
|
+
except ValueError:
|
|
96
110
|
# Can't decide flight-state safely; skip (never prune)
|
|
97
111
|
continue
|
|
98
112
|
verdict = s.get("phases", {}).get("final_report", {}).get("verdict")
|
|
@@ -109,11 +123,69 @@ def prune(runs_dir: pathlib.Path, keep: int = 10) -> int:
|
|
|
109
123
|
return pruned
|
|
110
124
|
|
|
111
125
|
|
|
126
|
+
def self_test() -> int:
|
|
127
|
+
with tempfile.TemporaryDirectory() as tmp:
|
|
128
|
+
devlyn = pathlib.Path(tmp) / ".devlyn"
|
|
129
|
+
devlyn.mkdir()
|
|
130
|
+
(devlyn / "pipeline.state.json").write_text(
|
|
131
|
+
json.dumps({
|
|
132
|
+
"run_id": "run-1",
|
|
133
|
+
"phases": {"final_report": {"verdict": "PASS"}},
|
|
134
|
+
}) + "\n",
|
|
135
|
+
encoding="utf-8",
|
|
136
|
+
)
|
|
137
|
+
for name in (
|
|
138
|
+
"risk-probes.jsonl",
|
|
139
|
+
"verify.pair.findings.jsonl",
|
|
140
|
+
"verify-merge.summary.json",
|
|
141
|
+
"codex-judge.stdout",
|
|
142
|
+
"codex-judge.summary.json",
|
|
143
|
+
):
|
|
144
|
+
(devlyn / name).write_text("{}\n", encoding="utf-8")
|
|
145
|
+
run_id = read_run_id(devlyn)
|
|
146
|
+
assert run_id == "run-1", run_id
|
|
147
|
+
moved = move_artifacts(devlyn, devlyn / "runs" / run_id)
|
|
148
|
+
assert moved >= 6, moved
|
|
149
|
+
for name in (
|
|
150
|
+
"pipeline.state.json",
|
|
151
|
+
"risk-probes.jsonl",
|
|
152
|
+
"verify.pair.findings.jsonl",
|
|
153
|
+
"verify-merge.summary.json",
|
|
154
|
+
"codex-judge.stdout",
|
|
155
|
+
"codex-judge.summary.json",
|
|
156
|
+
):
|
|
157
|
+
assert (devlyn / "runs" / run_id / name).is_file(), name
|
|
158
|
+
|
|
159
|
+
bad = pathlib.Path(tmp) / "bad"
|
|
160
|
+
bad.mkdir()
|
|
161
|
+
(bad / "pipeline.state.json").write_text('{"run_id": "../escape"}\n', encoding="utf-8")
|
|
162
|
+
try:
|
|
163
|
+
read_run_id(bad)
|
|
164
|
+
except SystemExit as exc:
|
|
165
|
+
assert "run_id must match" in str(exc)
|
|
166
|
+
else:
|
|
167
|
+
raise AssertionError("unsafe archive run_id was accepted")
|
|
168
|
+
|
|
169
|
+
nan = pathlib.Path(tmp) / "nan"
|
|
170
|
+
nan.mkdir()
|
|
171
|
+
(nan / "pipeline.state.json").write_text('{"run_id": NaN}\n', encoding="utf-8")
|
|
172
|
+
try:
|
|
173
|
+
read_run_id(nan)
|
|
174
|
+
except SystemExit as exc:
|
|
175
|
+
assert "invalid JSON numeric constant: NaN" in str(exc)
|
|
176
|
+
else:
|
|
177
|
+
raise AssertionError("NaN archive run_id was accepted")
|
|
178
|
+
return 0
|
|
179
|
+
|
|
180
|
+
|
|
112
181
|
def main() -> int:
|
|
113
182
|
ap = argparse.ArgumentParser(description=__doc__.splitlines()[0])
|
|
114
183
|
ap.add_argument("--devlyn-dir", default=".devlyn")
|
|
115
184
|
ap.add_argument("--keep", type=int, default=10, help="keep N most recent completed runs")
|
|
185
|
+
ap.add_argument("--self-test", action="store_true")
|
|
116
186
|
args = ap.parse_args()
|
|
187
|
+
if args.self_test:
|
|
188
|
+
return self_test()
|
|
117
189
|
|
|
118
190
|
devlyn = pathlib.Path(args.devlyn_dir)
|
|
119
191
|
if not devlyn.is_dir():
|
|
@@ -6,10 +6,10 @@ Single source of truth for how every skill calls Codex. **MCP is not used.** Ski
|
|
|
6
6
|
|
|
7
7
|
All long-running Codex calls go through `codex-monitored.sh` — a thin wrapper that closes stdin (codex 0.124.0 hangs when both stdin is open and a prompt arg is given), streams Codex stdout fully (no `tail -n` truncation), and prints a `[codex-monitored] heartbeat` line every 30s so the outer `claude -p` byte-watchdog stays fed during long reasoning gaps. The wrapper passes its arguments through verbatim to the underlying CLI, so the canonical flag set is unchanged from a raw call — only the launcher differs.
|
|
8
8
|
|
|
9
|
-
**Read-only critique / adversarial review / debate** (
|
|
9
|
+
**Read-only critique / adversarial review / debate** (`/devlyn:resolve` VERIFY pair-mode, plus any future ideate read-only critique). Security review stays native to Claude Code BUILD_GATE. Codex returns findings on stdout; the orchestrator writes files.
|
|
10
10
|
|
|
11
11
|
```bash
|
|
12
|
-
bash .claude/skills/_shared/codex-monitored.sh \
|
|
12
|
+
CODEX_MONITORED_ISOLATED=1 bash .claude/skills/_shared/codex-monitored.sh \
|
|
13
13
|
-C <project-root> \
|
|
14
14
|
-s read-only \
|
|
15
15
|
-c model_reasoning_effort=xhigh \
|
|
@@ -31,6 +31,7 @@ Notes:
|
|
|
31
31
|
- `-s read-only` / `--full-auto` — sandbox policy. `--full-auto` = `-s workspace-write` with auto-approval of sandboxed commands.
|
|
32
32
|
- `-c model_reasoning_effort=xhigh` — config override for reasoning depth. Required for deep critique; skills may choose `high` or `medium` when thoroughness doesn't warrant xhigh.
|
|
33
33
|
- **Omit `-m <model>`** — Codex CLI uses its configured flagship (currently `gpt-5.5`, automatically whatever ships next). This is the zero-touch mechanism. Only name `-m` when a role explicitly needs a different model (e.g., `gpt-5.3-codex` for SWE-bench-heavy coding tasks, `gpt-5.3-codex-spark` for speed).
|
|
34
|
+
- `CODEX_MONITORED_ISOLATED=1` — required for bounded read-only critique/probe/judge calls. The wrapper adds `--ignore-user-config --ignore-rules --ephemeral --disable codex_hooks --disable hooks` so user config, AGENTS.md, pyx-memory, hooks, and project rules cannot add hidden context, tool calls, or transcript side effects. Do not set it for workspace-write implementation phases.
|
|
34
35
|
- Raw `codex exec ...` invocations are **forbidden** in skill prompts. The benchmark variant arm runs a PATH shim (`scripts/codex-shim/codex`) that transparently re-routes any raw `codex exec` to the wrapper as a safety net, but skills should always emit the wrapper form directly so the orchestrator's first-attempt has the right shape. Two prior iterations (iter-0006 universal foreground ban, iter-0008 prompt-level kill-shape contract) failed because the orchestrator picked starvation-prone shapes (`codex exec ... 2>&1 | tail -200`) from its own pattern prior — the wrapper plus the shim is the runtime binding layer those iters lacked. See `autoresearch/iterations/0009-wrapper-and-hook.md`.
|
|
35
36
|
|
|
36
37
|
## Availability check
|
|
@@ -47,6 +47,10 @@
|
|
|
47
47
|
# CODEX_REAL_BIN when set, else `codex`.
|
|
48
48
|
# Set this when the shim has put us first
|
|
49
49
|
# on PATH.
|
|
50
|
+
# CODEX_MONITORED_ISOLATED — set non-empty for bounded read-only
|
|
51
|
+
# probe/judge calls that must ignore
|
|
52
|
+
# user config, project rules, session
|
|
53
|
+
# persistence, and hook side effects.
|
|
50
54
|
# CODEX_MONITORED_ALLOW_PIPED — set non-empty to skip the pipe-stdout
|
|
51
55
|
# refusal. Reserved for tests; don't use
|
|
52
56
|
# in skill prompts.
|
|
@@ -70,6 +74,44 @@ TIMEOUT_SEC="${CODEX_MONITORED_TIMEOUT_SEC:-0}"
|
|
|
70
74
|
CODEX_BIN="${CODEX_BIN:-${CODEX_REAL_BIN:-codex}}"
|
|
71
75
|
START=$(date +%s)
|
|
72
76
|
TIMEOUT_FLAG=""
|
|
77
|
+
CODEX_ARGS=("$@")
|
|
78
|
+
|
|
79
|
+
require_nonnegative_int() {
|
|
80
|
+
local name="$1"
|
|
81
|
+
local value="$2"
|
|
82
|
+
case "$value" in
|
|
83
|
+
''|*[!0-9]*)
|
|
84
|
+
printf '[codex-monitored] error: %s must be a non-negative integer (got %s)\n' \
|
|
85
|
+
"$name" "$value" >&2
|
|
86
|
+
exit 64
|
|
87
|
+
;;
|
|
88
|
+
esac
|
|
89
|
+
}
|
|
90
|
+
|
|
91
|
+
require_positive_int() {
|
|
92
|
+
local name="$1"
|
|
93
|
+
local value="$2"
|
|
94
|
+
require_nonnegative_int "$name" "$value"
|
|
95
|
+
if [ "$value" -le 0 ]; then
|
|
96
|
+
printf '[codex-monitored] error: %s must be > 0 (got %s)\n' \
|
|
97
|
+
"$name" "$value" >&2
|
|
98
|
+
exit 64
|
|
99
|
+
fi
|
|
100
|
+
}
|
|
101
|
+
|
|
102
|
+
require_positive_int CODEX_MONITORED_HEARTBEAT "$HEARTBEAT_SEC"
|
|
103
|
+
require_nonnegative_int CODEX_MONITORED_TIMEOUT_SEC "$TIMEOUT_SEC"
|
|
104
|
+
|
|
105
|
+
if [ -n "${CODEX_MONITORED_ISOLATED:-}" ]; then
|
|
106
|
+
CODEX_ARGS=(
|
|
107
|
+
--ignore-user-config
|
|
108
|
+
--ignore-rules
|
|
109
|
+
--ephemeral
|
|
110
|
+
--disable codex_hooks
|
|
111
|
+
--disable hooks
|
|
112
|
+
"${CODEX_ARGS[@]}"
|
|
113
|
+
)
|
|
114
|
+
fi
|
|
73
115
|
|
|
74
116
|
# --- Pipe-stdout refusal (iter-0009 R2 finding #1) -------------------------
|
|
75
117
|
# `[ -p /dev/stdout ]` is the POSIX test for "is fd 1 a FIFO/pipe". Verified
|
|
@@ -169,10 +211,13 @@ trap cleanup EXIT
|
|
|
169
211
|
|
|
170
212
|
printf '[codex-monitored] start: ts=%s heartbeat=%ds timeout=%ss bin=%s\n' \
|
|
171
213
|
"$(date -u +%FT%TZ)" "$HEARTBEAT_SEC" "$TIMEOUT_SEC" "$CODEX_BIN" >&2
|
|
214
|
+
if [ -n "${CODEX_MONITORED_ISOLATED:-}" ]; then
|
|
215
|
+
printf '[codex-monitored] isolated=1\n' >&2
|
|
216
|
+
fi
|
|
172
217
|
|
|
173
218
|
# Launch codex with stdin closed; output streams directly to OUR stdout/stderr.
|
|
174
219
|
set -m
|
|
175
|
-
"$CODEX_BIN" exec "
|
|
220
|
+
"$CODEX_BIN" exec "${CODEX_ARGS[@]}" < /dev/null &
|
|
176
221
|
CODEX_PID=$!
|
|
177
222
|
set +m
|
|
178
223
|
printf '[codex-monitored] codex pid=%d\n' "$CODEX_PID" >&2
|
|
@@ -14,6 +14,14 @@ from typing import Any
|
|
|
14
14
|
FINDING_SEVERITIES = {"CRITICAL", "HIGH", "MEDIUM", "LOW", "INFO"}
|
|
15
15
|
|
|
16
16
|
|
|
17
|
+
def reject_json_constant(token: str) -> None:
|
|
18
|
+
raise ValueError(f"invalid JSON numeric constant: {token}")
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
def loads_strict_json(text: str) -> Any:
|
|
22
|
+
return json.loads(text, parse_constant=reject_json_constant)
|
|
23
|
+
|
|
24
|
+
|
|
17
25
|
def atomic_write(path: pathlib.Path, text: str) -> None:
|
|
18
26
|
path.parent.mkdir(parents=True, exist_ok=True)
|
|
19
27
|
with tempfile.NamedTemporaryFile(
|
|
@@ -34,8 +42,8 @@ def collect(stdout_path: pathlib.Path) -> tuple[list[dict[str, Any]], dict[str,
|
|
|
34
42
|
continue
|
|
35
43
|
if raw.startswith("# SUMMARY "):
|
|
36
44
|
try:
|
|
37
|
-
item =
|
|
38
|
-
except
|
|
45
|
+
item = loads_strict_json(raw.removeprefix("# SUMMARY ").strip())
|
|
46
|
+
except ValueError as exc:
|
|
39
47
|
raise SystemExit(f"error: invalid SUMMARY JSON at {stdout_path}:{line_no}: {exc}")
|
|
40
48
|
if not isinstance(item, dict):
|
|
41
49
|
raise SystemExit(f"error: SUMMARY is not an object at {stdout_path}:{line_no}")
|
|
@@ -44,8 +52,8 @@ def collect(stdout_path: pathlib.Path) -> tuple[list[dict[str, Any]], dict[str,
|
|
|
44
52
|
if raw.startswith("#"):
|
|
45
53
|
continue
|
|
46
54
|
try:
|
|
47
|
-
item =
|
|
48
|
-
except
|
|
55
|
+
item = loads_strict_json(raw)
|
|
56
|
+
except ValueError as exc:
|
|
49
57
|
raise SystemExit(f"error: invalid JSONL at {stdout_path}:{line_no}: {exc}")
|
|
50
58
|
if not isinstance(item, dict):
|
|
51
59
|
raise SystemExit(f"error: JSONL item is not an object at {stdout_path}:{line_no}")
|
|
@@ -74,7 +82,14 @@ def self_test() -> int:
|
|
|
74
82
|
findings, summary = collect(stdout_path)
|
|
75
83
|
write_outputs(findings, summary, out_path, summary_path)
|
|
76
84
|
assert out_path.read_text(encoding="utf-8").count("\n") == 1
|
|
77
|
-
assert
|
|
85
|
+
assert loads_strict_json(summary_path.read_text(encoding="utf-8"))["verdict"] == "NEEDS_WORK"
|
|
86
|
+
stdout_path.write_text('{"id":"nan","severity":NaN}\n', encoding="utf-8")
|
|
87
|
+
try:
|
|
88
|
+
collect(stdout_path)
|
|
89
|
+
except SystemExit as exc:
|
|
90
|
+
assert "invalid JSON numeric constant: NaN" in str(exc)
|
|
91
|
+
else:
|
|
92
|
+
raise AssertionError("NaN Codex stdout finding must not normalize")
|
|
78
93
|
stdout_path.write_text("", encoding="utf-8")
|
|
79
94
|
try:
|
|
80
95
|
collect(stdout_path)
|
|
@@ -20,7 +20,7 @@ When a run or phase requires Claude, before spawning that phase:
|
|
|
20
20
|
|
|
21
21
|
Never prompt the user mid-pipeline. Missing required engines are explicit BLOCKED states, not silent fallbacks.
|
|
22
22
|
|
|
23
|
-
Per-skill defaults: `/devlyn:resolve`
|
|
23
|
+
Per-skill defaults: `/devlyn:resolve` uses Claude for PLAN/IMPLEMENT; VERIFY may invoke the OTHER engine when its pair-JUDGE trigger fires. `/devlyn:ideate` defaults to Claude; `--engine` selects the elicitation/normalization adapter, not an automatic cross-model challenge phase. Any future ideate read-only critique must follow `_shared/codex-config.md` isolation rules. Each SKILL.md flag block is source of truth for that skill's default.
|
|
24
24
|
|
|
25
25
|
## What a skill must report after a BLOCKED engine check
|
|
26
26
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# Runtime principles — sub-agent contract
|
|
2
2
|
|
|
3
|
-
The runtime contract every sub-agent inside `/devlyn:resolve` (PLAN / IMPLEMENT / BUILD_GATE / CLEANUP / VERIFY) and `/devlyn:ideate`
|
|
3
|
+
The runtime contract every sub-agent inside `/devlyn:resolve` (PLAN / IMPLEMENT / BUILD_GATE / CLEANUP / VERIFY) and `/devlyn:ideate` must satisfy. Source of truth for sub-agent behavior on user tasks. NOT for autoresearch-loop / harness-developer concerns (see `autoresearch/PRINCIPLES.md`).
|
|
4
4
|
|
|
5
5
|
The four sections below mirror the corresponding CLAUDE.md sections (Subtractive-first editing, Goal-locked execution, No-workaround discipline, Evidence over claim). Each section is wrapped in `<!-- runtime-principles:section=NAME:begin -->` / `:end -->` markers in BOTH this file and CLAUDE.md; lint Check 12 (added in iter-0019.A Step 5) extracts each named block from both files and diffs to detect drift.
|
|
6
6
|
|
|
@@ -97,14 +97,11 @@ A finding without one of these forms is excluded. Vague findings produce vague f
|
|
|
97
97
|
<!-- runtime-principles:contract:end -->
|
|
98
98
|
|
|
99
99
|
<!-- runtime-principles:consumption:begin -->
|
|
100
|
-
## Consumption
|
|
100
|
+
## Consumption
|
|
101
101
|
|
|
102
102
|
**Consumers**:
|
|
103
|
-
- `
|
|
104
|
-
- `
|
|
103
|
+
- `devlyn:resolve/SKILL.md` `<harness_principles>` points here as the contract source. Phase prompt bodies inline or reference the operational excerpt needed for each phase.
|
|
104
|
+
- `devlyn:ideate/SKILL.md` consumes this file for spec-shaping and conversation discipline through its own `<harness_principles>` block.
|
|
105
105
|
|
|
106
|
-
**Codex routing**:
|
|
107
|
-
|
|
108
|
-
**Non-consumers**:
|
|
109
|
-
- `ideate/SKILL.md` does NOT consume this file. Ideate is planning-layer; its CHALLENGE rubric (`references/challenge-rubric.md`) covers analogous concerns at planning scope, with deliberate one-shot Codex critic discipline.
|
|
106
|
+
**Codex routing**: Codex-routed phases must inline the contract excerpt directly into the prompt body. Bounded read-only Codex critique, probe, or judge calls must also follow `_shared/codex-config.md` isolation rules.
|
|
110
107
|
<!-- runtime-principles:consumption:end -->
|