@oneie/claude 0.1.0 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/commands/do.md +62 -7
- package/package.json +1 -1
- package/scripts/do-analyze.sh +38 -28
- package/scripts/do-auto.sh +117 -0
- package/scripts/do-recon-cache.sh +50 -0
- package/scripts/do-smoke.sh +7 -5
- package/scripts/w1-recon.ts +139 -0
- package/scripts/w4-rubric.ts +142 -0
package/commands/do.md
CHANGED
|
@@ -31,8 +31,11 @@ Comprehensive and fast are not in tension: the **tier decides which gates run**,
|
|
|
31
31
|
|---|---|---|
|
|
32
32
|
| Tier prune | Step 1 | PATCH walks `code` only; a gate never runs above the phases it guards |
|
|
33
33
|
| Tool ladder | every decision | bash → Haiku → Sonnet → Opus; stop at the first that decides |
|
|
34
|
+
| **Context reset** | cycle boundary | multi-cycle plans run each cycle in a fresh subprocess (`do-auto.sh`) — prior-cycle chatter (~15–25k/cycle) never accumulates; ~90–150k saved on a 6-cycle plan |
|
|
34
35
|
| **TRIVIAL fast-path** | BUILD | **0 agent spawns** — read ≤3 files inline, edit, bash verify, inline rubric, 1 learnings line |
|
|
35
|
-
| Recon cache | W1 |
|
|
36
|
+
| Recon cache | W1 | `do-recon-cache.sh check` — sha-keyed local store, <14d → **0 tokens**, skip all spawns (saves ~12k/recon) |
|
|
37
|
+
| W1 prefix cache | W1 miss | `w1-recon.ts` caches the ~5,400-token rules+agent prefix across Haiku calls (~4,900 saved/repeat) |
|
|
38
|
+
| W4 rubric cache | W4 | `w4-rubric.ts` caches the ~10,900-token rubric+spec block across 6 Haiku calls — 1 write + 5 reads (~46k saved/run) |
|
|
36
39
|
| Verify-only-changed | W3→W4 | dependency cone from `git diff`; full verify only at cycle close |
|
|
37
40
|
| Cross-cycle pre-warm | W4 close | one fire-and-forget Haiku pre-reads the next cycle's W1 targets |
|
|
38
41
|
| Recon cap | W1 | 400-word receipt; high-signal slices → `.w2-spec.json`, not the raw dump |
|
|
@@ -72,9 +75,10 @@ C2·C3·C4 are siblings → their W1/W2 run concurrently and their W3a edits mer
|
|
|
72
75
|
| Input | Treat as | Entry |
|
|
73
76
|
|---|---|---|
|
|
74
77
|
| bare text (`"add usage billing"`) | **IDEA** (default) | walk the spine from the top |
|
|
75
|
-
| `<slug>` or a `plans/*.md` path | existing work | resolve its slug, walk from the first gap |
|
|
76
|
-
| `--auto` | run all cycles of a todo continuously | BUILD engine, trust-aware |
|
|
78
|
+
| `<slug>` or a `plans/*.md` path | existing work | resolve its slug; ≥2 incomplete cycles → auto context-isolated loop, else walk from the first gap inline |
|
|
77
79
|
| `--wave N` | force one wave of a todo | BUILD engine |
|
|
80
|
+
| `--next-cycle` | *(internal)* run one batch then exit — the loop's recursion guard, set by `do-auto.sh`; never typed by a human | BUILD engine |
|
|
81
|
+
| `--auto` | *(legacy alias)* same as a bare multi-cycle `/do <slug>` — the loop is automatic now | BUILD engine, trust-aware |
|
|
78
82
|
|
|
79
83
|
Derive the **slug** from the idea: kebab-case, 2–4 words (`"add usage billing"` → `usage-billing`). Every artifact on the spine is keyed by this slug.
|
|
80
84
|
|
|
@@ -148,7 +152,40 @@ After the spine: **LEARN / close** — write one `learnings.md` entry (slug, tie
|
|
|
148
152
|
|
|
149
153
|
**Plan-outcome kill-switch (zero LLM).** Re-run the plan's `outcome:` command every cycle close. Exit 0 → the goal is met: remaining cycles enter justify-or-drop (default drop), `--auto` halts. Over the tier ceiling → `warn` + justify. Weak rubric 3× → halt, the *plan* is wrong, not the code. Goal-drift (outcome fails 3× with green rubrics) → force a re-plan.
|
|
150
154
|
|
|
151
|
-
|
|
155
|
+
**One command, complexity-sized.** `/do <anything>` is the only thing a human types — an idea, a slug, or a `plans/<slug>-todo.md` path. The tier sizes the work and the spine prunes itself: a typo is edited and verified inline (one cycle, no loop); a feature writes its promise, spec, todo, tests, and docs and then runs every cycle. The human never picks the mode, never types a flag, never runs a second command.
|
|
156
|
+
|
|
157
|
+
**Context isolation is automatic for multi-cycle plans.** When `/do` resolves to a todo with **≥2 incomplete cycles**, it does not run them inline — it hands the loop to the internal engine so every cycle gets a **fresh context**:
|
|
158
|
+
|
|
159
|
+
```bash
|
|
160
|
+
# /do runs this itself — the user never types it
|
|
161
|
+
bun .claude/scripts/do-auto.sh <slug>
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
`do-auto.sh` loops: each iteration spawns a clean `/do <slug> --next-cycle` that runs exactly one batch (W1→W4), ticks its boxes, writes state, and exits. Because each cycle starts from near-zero context, prior-cycle recon/edit/verify chatter never accumulates. The main session just kicks off the loop and reports the close.
|
|
165
|
+
|
|
166
|
+
A **single** incomplete cycle (or a PATCH/FIX) runs inline — there's nothing to isolate from, and spawning a subprocess would cost more than it saves.
|
|
167
|
+
|
|
168
|
+
**`--next-cycle` is internal.** It means "run one batch, tick boxes, exit — do **not** loop." Only `do-auto.sh` passes it; it is the recursion guard that keeps a fresh `/do` from re-entering the loop. A human never types it.
|
|
169
|
+
|
|
170
|
+
State that survives the reset (all on disk — the loop is stateless in memory):
|
|
171
|
+
- `plans/<slug>-todo.md` — checked boxes are the progress bar; the next `/do --next-cycle` reads them and skips completed cycles automatically
|
|
172
|
+
- `.do-trust.json` — trust level and consecutive score history (drives auto-continue vs halt)
|
|
173
|
+
- `.w4-improvements.json` — open items that become mandatory W1 targets next cycle
|
|
174
|
+
- `.w2-spec.json` / `.w3-receipts.json` — write-once state for soft-resume within a cycle
|
|
175
|
+
|
|
176
|
+
What is intentionally discarded each reset: prior-cycle conversation and agent output prose. The checkboxes captured everything that matters; the prose was scaffolding.
|
|
177
|
+
|
|
178
|
+
**Trust gates the loop (reads `.do-trust.json`).** `trusted` (composite ≥ 0.85 × 3+) → next cycle fires immediately. `standard` (0.65–0.85) → continue. `cautious` (< 0.65 × 2) → `do-auto.sh` halts and tells the human to re-run `/do <slug>` once the issue is fixed. Absent file → `standard`.
|
|
179
|
+
|
|
180
|
+
**Token math for a 6-cycle COMPLEX plan:**
|
|
181
|
+
|
|
182
|
+
| Mode | Context per cycle | Cumulative context |
|
|
183
|
+
|---|---|---|
|
|
184
|
+
| inline (everything in one session) | +15–25k per cycle | ~90–150k by cycle 6 |
|
|
185
|
+
| context-isolated (the default) | ~1–2k (fresh invocation) | ~1–2k every cycle |
|
|
186
|
+
| **Saving** | — | **~90–150k tokens** |
|
|
187
|
+
|
|
188
|
+
This compounds with the W1 cache and W4 rubric cache: a cycle 6 run with context isolation + both SDK caches costs roughly **1/10th** of the same cycle run inline in a long session.
|
|
152
189
|
|
|
153
190
|
---
|
|
154
191
|
|
|
@@ -165,7 +202,19 @@ echo "$RESOLVED" | jq -se 'all(.doc_only)' >/dev/null \
|
|
|
165
202
|
TSC=$(bunx tsc --noEmit 2>&1 | grep -c "error TS" || echo 0) # write .w0-baseline.json {tscErrors, loc, tests}
|
|
166
203
|
```
|
|
167
204
|
|
|
168
|
-
**W1 — Recon.** Read `.w4-improvements.json` open items first — they're mandatory recon targets.
|
|
205
|
+
**W1 — Recon.** Read `.w4-improvements.json` open items first — they're mandatory recon targets.
|
|
206
|
+
|
|
207
|
+
**Cache check (zero tokens):**
|
|
208
|
+
```bash
|
|
209
|
+
.claude/scripts/do-recon-cache.sh check $CYCLE_TARGET_PATHS 2>/dev/null && RECON_CACHE_HIT=true || true
|
|
210
|
+
```
|
|
211
|
+
Hit → load findings from stdout, skip all agent spawns. Also run `do-recon-cache.sh prune` once per session to evict entries older than 14 days.
|
|
212
|
+
|
|
213
|
+
Miss → ≤5 files → read inline. ≥6 → run the SDK script (prompt-cached rules + agent-prompt block, writes result to `.w1-cache/`):
|
|
214
|
+
```bash
|
|
215
|
+
bun .claude/scripts/w1-recon.ts --targets "$FILES" --mode RECON
|
|
216
|
+
```
|
|
217
|
+
Script exits 2 if `ANTHROPIC_API_KEY` is absent → fall back to spawning `w1-recon` agent (Haiku · low) for ALL files in ONE message. Either path: skip `relevance_score < 0.4`; all-filtered → halt and broaden (zero-findings guard). Receipt capped at 400 words; persist high-signal slices into `.w2-spec.json`. *(The `w1-recon` agent carries the two-track existing-code + primitive-inventory contract.)*
|
|
169
218
|
|
|
170
219
|
**W2 — Decide (never delegated — Opus · high).** Write the goal/deliverable/UX gate (3 sentences) first. Then: reconcile names against `dictionary.md`; run the **compress check** before any new primitive (name 3 existing primitives that compose it → `compose` removes it from the diff, `new` needs a one-line justification + same-diff doc edit). The pre-mortem + trade-offs were already captured at the `spec` stop (`template-spec.md`) — carry the failure modes forward as test cases, don't redo them. Classify each W1 finding Act / Keep / Defer. Output diff specs + write `.w2-spec.json` (+ `.w2-doc-plan.json` if a doc trigger fires). *(The `w2-decide` agent carries the compose-target table — which canonical doc to check per primitive type — plus context-triggers for surgical doc injection.)*
|
|
171
220
|
|
|
@@ -178,7 +227,11 @@ $(cycle.demo.command) # the goal gate — exit 0 = pass
|
|
|
178
227
|
# doc-sync gate (reads .w2-doc-plan.json): stale-name=0, links ok, contract mtime current
|
|
179
228
|
DELTA_TSC=$((TSC_NOW - TSC_BASELINE)) # hard gate: ≤ 0, no new type errors ever
|
|
180
229
|
```
|
|
181
|
-
Rubric (inline for TRIVIAL/SIMPLE; for COMPLEX
|
|
230
|
+
Rubric (inline for TRIVIAL/SIMPLE; for COMPLEX — run the SDK script first (6 Haiku in parallel, rubric + spec block cached across all calls):
|
|
231
|
+
```bash
|
|
232
|
+
bun .claude/scripts/w4-rubric.ts --files "$(git diff HEAD --name-only | tr '\n' ',')"
|
|
233
|
+
```
|
|
234
|
+
Script exits 2 if key absent → fall back to spawning 6 Haiku agents · medium. Either path runs alongside `pr-review-toolkit` (code-reviewer, silent-failure-hunter, type-design-analyzer) + `find-bugs` + `accessibility` on UI):
|
|
182
235
|
```
|
|
183
236
|
composite = 0.35·goal-fit + 0.20·security + 0.20·stability + 0.15·simplicity + 0.10·speed
|
|
184
237
|
gate: composite ≥ 0.65 AND goal-fit ≥ 0.50 (hard) AND no adversarial > 0.5 AND delta_tsc ≤ 0
|
|
@@ -206,12 +259,14 @@ PATCH clears {1,2,3,8}. FEATURE clears all eight.
|
|
|
206
259
|
- **Closed loop** — every cycle ends in `mark`/`warn`, never a silent return.
|
|
207
260
|
- **Cheapest tool that decides wins** — a phase that needs no LLM call is the best kind.
|
|
208
261
|
- **W4 max 3 loops**, then halt and report.
|
|
262
|
+
- **One command.** A human types `/do <anything>` and nothing else. Complexity sizing, spine pruning, and (for multi-cycle plans) the context-isolated loop are all automatic — never a flag, never a second command.
|
|
263
|
+
- **Every cycle gets a fresh context.** A multi-cycle `/do` runs each cycle in a clean subprocess via the internal loop. Prior-cycle conversation is noise; the todo checkboxes are the only state that carries forward.
|
|
209
264
|
|
|
210
265
|
---
|
|
211
266
|
|
|
212
267
|
## Available to /do — the toolbox
|
|
213
268
|
|
|
214
|
-
**Scripts** (`.claude/scripts/`): `do-tier.sh` (tier + pruned spine + classifier + ceiling) · `do-folder.sh` (folder-aware verify/build) · `do-survey.sh` (reuse verdict) · `do-reconcile.sh` (substrate dim/verb/dead-name gate) · `do-analyze.sh` (
|
|
269
|
+
**Scripts** (`.claude/scripts/`): `do-auto.sh` (*internal* — the context-isolated loop `/do` drives for multi-cycle plans) · `do-tier.sh` (tier + pruned spine + classifier + ceiling) · `do-folder.sh` (folder-aware verify/build) · `do-survey.sh` (reuse verdict) · `do-reconcile.sh` (substrate dim/verb/dead-name gate) · `do-analyze.sh` (deliverable↔cycle coverage gate) · `do-prove.sh` (surface-detect proof) · `do-smoke.sh` (deterministic outcome) · `w1-recon.ts` (prompt-cached recon) · `w4-rubric.ts` (cached parallel rubric).
|
|
215
270
|
|
|
216
271
|
**Templates**: `text/template-frame.md` (promise) · `plans/template-spec.md` (design + pre-mortem + decisions) · `plans/template-todo.md` (plan + parallel budget + testing policy) · `plans/agent-template.md` (agent definition).
|
|
217
272
|
|
package/package.json
CHANGED
package/scripts/do-analyze.sh
CHANGED
|
@@ -1,7 +1,15 @@
|
|
|
1
1
|
#!/usr/bin/env bash
|
|
2
2
|
# do-analyze.sh — ANALYZE coverage gate (ex-Spec-Kit /analyze), read-only. Run at the
|
|
3
|
-
# todo→code boundary, BEFORE build spends worktree tokens.
|
|
4
|
-
#
|
|
3
|
+
# todo→code boundary, BEFORE build spends worktree tokens. Verifies the plan against the
|
|
4
|
+
# current template-todo.md contract: a `deliverables:` frontmatter block, and per-cycle
|
|
5
|
+
# `**Deliverable:**` + (`demo:` | `**Cycle outcome:**`) lines.
|
|
6
|
+
#
|
|
7
|
+
# CRITICAL (exit 1, blocks build):
|
|
8
|
+
# - deliverables: frontmatter is empty → plan ships nothing
|
|
9
|
+
# - a ## C# cycle has no **Deliverable:** line → cycle ships nothing (coverage hole)
|
|
10
|
+
# HIGH (warn, does not block):
|
|
11
|
+
# - a cycle has no demo:/Cycle outcome line → AC→test gap (no planned verification)
|
|
12
|
+
# Never edits.
|
|
5
13
|
# Usage: do-analyze.sh <todo.md>
|
|
6
14
|
set -euo pipefail
|
|
7
15
|
todo="${1:?usage: do-analyze.sh <todo.md>}"
|
|
@@ -9,34 +17,36 @@ todo="${1:?usage: do-analyze.sh <todo.md>}"
|
|
|
9
17
|
|
|
10
18
|
crit=0; high=0
|
|
11
19
|
|
|
12
|
-
# 1.
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
19
|
-
|
|
20
|
+
# 1. plan ships something: deliverables: frontmatter has >=1 real entry (- kind: ...)
|
|
21
|
+
ndel=$(awk '
|
|
22
|
+
/^deliverables:/ {f=1; next}
|
|
23
|
+
f && /^[a-z_]+:/ {f=0}
|
|
24
|
+
f && /^[[:space:]]*-[[:space:]]/ {n++}
|
|
25
|
+
END {print n+0}
|
|
26
|
+
' "$todo")
|
|
27
|
+
if [ "$ndel" -eq 0 ]; then
|
|
28
|
+
echo "CRITICAL: deliverables: frontmatter is empty — plan ships nothing"; crit=1
|
|
29
|
+
fi
|
|
20
30
|
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
31
|
+
# 2. every cycle ships a deliverable AND has a planned test. Walk each ## C# section
|
|
32
|
+
# (header to the next ## header — ### W-waves and **bold** lines stay inside).
|
|
33
|
+
ncyc=0; nodeliv=0; notest=0
|
|
34
|
+
for c in $(grep -oE '^## C[0-9]+' "$todo" | grep -oE 'C[0-9]+'); do
|
|
35
|
+
ncyc=$((ncyc+1))
|
|
36
|
+
block=$(awk -v c="$c" '
|
|
37
|
+
$0 ~ "^## "c"( |$)" {f=1; print; next}
|
|
38
|
+
f && /^## / {exit}
|
|
39
|
+
f {print}
|
|
40
|
+
' "$todo")
|
|
41
|
+
printf '%s' "$block" | grep -qE '^\*\*Deliverable' || {
|
|
42
|
+
echo "CRITICAL: $c has no **Deliverable:** line (cycle ships nothing — coverage hole)"; crit=1; nodeliv=$((nodeliv+1))
|
|
43
|
+
}
|
|
44
|
+
printf '%s' "$block" | grep -qiE 'demo:|^\*\*Cycle outcome' || {
|
|
45
|
+
echo "HIGH: $c has no planned test (demo: or **Cycle outcome:** line)"; high=$((high+1)); notest=$((notest+1))
|
|
46
|
+
}
|
|
37
47
|
done
|
|
38
48
|
|
|
39
49
|
echo "----"
|
|
40
|
-
echo "coverage: $ndel deliverables / $ncyc cycles ·
|
|
50
|
+
echo "coverage: $ndel deliverables / $ncyc cycles · cycles-without-deliverable=$nodeliv · cycles-without-test=$notest · high=$high"
|
|
41
51
|
if [ "$crit" -ne 0 ]; then echo "ANALYZE: CRITICAL — fix coverage before build"; exit 1; fi
|
|
42
|
-
echo "ANALYZE: pass (
|
|
52
|
+
echo "ANALYZE: pass (every cycle ships + is testable)"; exit 0
|
|
@@ -0,0 +1,117 @@
|
|
|
1
|
+
#!/usr/bin/env bash
|
|
2
|
+
# do-auto.sh — INTERNAL context-isolated loop for multi-cycle /do plans.
|
|
3
|
+
#
|
|
4
|
+
# Not a user command. `/do <slug>` invokes this itself when the resolved todo has
|
|
5
|
+
# >=2 incomplete cycles. Each cycle runs in a FRESH `claude -p "/do <slug> --next-cycle"`
|
|
6
|
+
# subprocess, resetting the context window to near-zero. State passes entirely through
|
|
7
|
+
# disk: the todo file (checked boxes), .do-trust.json (trust), .w4-improvements.json.
|
|
8
|
+
#
|
|
9
|
+
# Why it exists: run inline, each closed cycle leaves ~15-25k tokens of recon/edit/verify
|
|
10
|
+
# chatter in context that the next cycle has no use for — ~90-150k tokens of noise by
|
|
11
|
+
# cycle 6. Fresh-per-cycle eliminates it. The todo checkboxes are the only carried state.
|
|
12
|
+
#
|
|
13
|
+
# Internal usage (driven by do.md, not a human):
|
|
14
|
+
# bun .claude/scripts/do-auto.sh <slug> [--max-cycles N] [--dry-run]
|
|
15
|
+
set -euo pipefail
|
|
16
|
+
|
|
17
|
+
SLUG=""
|
|
18
|
+
MAX_CYCLES=30
|
|
19
|
+
DRY_RUN=false
|
|
20
|
+
|
|
21
|
+
while [ "$#" -gt 0 ]; do
|
|
22
|
+
case "$1" in
|
|
23
|
+
--max-cycles) MAX_CYCLES="$2"; shift 2 ;;
|
|
24
|
+
--dry-run) DRY_RUN=true; shift ;;
|
|
25
|
+
-*) echo "unknown flag: $1" >&2; exit 1 ;;
|
|
26
|
+
*) SLUG="$1"; shift ;;
|
|
27
|
+
esac
|
|
28
|
+
done
|
|
29
|
+
|
|
30
|
+
[ -z "$SLUG" ] && { echo "usage: do-auto.sh <slug> [--max-cycles N] [--dry-run]" >&2; exit 1; }
|
|
31
|
+
|
|
32
|
+
# Validate slug is safe (kebab-case only) before it touches any shell string.
|
|
33
|
+
# Prevents command injection if the slug ever contains metacharacters.
|
|
34
|
+
if ! printf '%s' "$SLUG" | grep -qE '^[a-zA-Z0-9][a-zA-Z0-9_-]*$'; then
|
|
35
|
+
echo "[do-auto] unsafe slug (must be alphanumeric + hyphens/underscores): $SLUG" >&2; exit 1
|
|
36
|
+
fi
|
|
37
|
+
|
|
38
|
+
# Trust boundary: this script runs in the developer's own workspace, invoked by /do
|
|
39
|
+
# which resolves the slug from a plans/ file the developer controls. The spawned
|
|
40
|
+
# claude subprocess uses --dangerously-skip-permissions because it is non-interactive
|
|
41
|
+
# — it cannot prompt the human for tool approvals. The workspace is the isolation
|
|
42
|
+
# boundary. Do not expose this script to untrusted input or run it in shared environments.
|
|
43
|
+
TODO="plans/${SLUG}-todo.md"
|
|
44
|
+
TRUST=".do-trust.json"
|
|
45
|
+
|
|
46
|
+
[ -f "$TODO" ] || { echo "[do-auto] $TODO not found — run /do $SLUG first" >&2; exit 1; }
|
|
47
|
+
|
|
48
|
+
_remaining() {
|
|
49
|
+
# Count cycles in the Status section that are open ([ ]) or in-flight ([~]) — i.e. not [x].
|
|
50
|
+
# Pattern starts with '\[' (not '-') so grep never mistakes it for an option flag.
|
|
51
|
+
grep -cE '\[[ ~]\] C[0-9]+' "$TODO" 2>/dev/null || echo 0
|
|
52
|
+
}
|
|
53
|
+
|
|
54
|
+
_trust() {
|
|
55
|
+
jq -r '.level // "standard"' "$TRUST" 2>/dev/null || echo "standard"
|
|
56
|
+
}
|
|
57
|
+
|
|
58
|
+
_plan_done() {
|
|
59
|
+
# Plan close item is ticked. -e marks the pattern explicitly (dash-safe).
|
|
60
|
+
grep -qe '\[x\].*Plan outcome command exits 0' "$TODO" 2>/dev/null
|
|
61
|
+
}
|
|
62
|
+
|
|
63
|
+
i=0
|
|
64
|
+
prev_remaining=-1
|
|
65
|
+
stall=0
|
|
66
|
+
while [ "$i" -lt "$MAX_CYCLES" ]; do
|
|
67
|
+
if _plan_done; then
|
|
68
|
+
echo "[do-auto] plan complete — outcome line ticked"
|
|
69
|
+
break
|
|
70
|
+
fi
|
|
71
|
+
|
|
72
|
+
remaining=$(_remaining)
|
|
73
|
+
if [ "$remaining" -eq 0 ]; then
|
|
74
|
+
echo "[do-auto] no open cycles — plan complete"
|
|
75
|
+
break
|
|
76
|
+
fi
|
|
77
|
+
|
|
78
|
+
# Stall guard: a cycle that ticks no box means the subprocess halted/errored.
|
|
79
|
+
# Retrying the identical state forever just burns subprocesses — halt after 2 stalls.
|
|
80
|
+
if [ "$remaining" -eq "$prev_remaining" ]; then
|
|
81
|
+
stall=$((stall + 1))
|
|
82
|
+
if [ "$stall" -ge 2 ]; then
|
|
83
|
+
echo "[do-auto] no progress for 2 iterations (${remaining} cycle(s) stuck) — halting." >&2
|
|
84
|
+
echo "[do-auto] The last cycle ticked no checkbox. Inspect $TODO, fix the blocker, then re-run /do $SLUG." >&2
|
|
85
|
+
exit 1
|
|
86
|
+
fi
|
|
87
|
+
else
|
|
88
|
+
stall=0
|
|
89
|
+
fi
|
|
90
|
+
prev_remaining=$remaining
|
|
91
|
+
|
|
92
|
+
trust=$(_trust)
|
|
93
|
+
if [ "$trust" = "cautious" ]; then
|
|
94
|
+
echo "[do-auto] trust=cautious — halting. Fix the issue, then re-run /do $SLUG."
|
|
95
|
+
exit 1
|
|
96
|
+
fi
|
|
97
|
+
|
|
98
|
+
i=$((i + 1))
|
|
99
|
+
echo ""
|
|
100
|
+
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
|
|
101
|
+
echo "[do-auto] iteration ${i} — ${remaining} cycle(s) remaining — trust=${trust}"
|
|
102
|
+
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
|
|
103
|
+
echo ""
|
|
104
|
+
|
|
105
|
+
if $DRY_RUN; then
|
|
106
|
+
echo "[do-auto] DRY RUN: would invoke — claude --dangerously-skip-permissions -p \"/do $SLUG --next-cycle\""
|
|
107
|
+
break
|
|
108
|
+
fi
|
|
109
|
+
|
|
110
|
+
# Fresh context: no --resume, no --continue. Each cycle is a clean slate.
|
|
111
|
+
claude --dangerously-skip-permissions -p "/do $SLUG --next-cycle"
|
|
112
|
+
done
|
|
113
|
+
|
|
114
|
+
if [ "$i" -ge "$MAX_CYCLES" ]; then
|
|
115
|
+
echo "[do-auto] hit --max-cycles $MAX_CYCLES — halting"
|
|
116
|
+
exit 1
|
|
117
|
+
fi
|
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
#!/usr/bin/env bash
|
|
2
|
+
# do-recon-cache.sh — sha-keyed W1 recon cache, 14-day TTL, local file store
|
|
3
|
+
#
|
|
4
|
+
# check <file>... → if ALL files are cached and fresh, print findings to stdout, exit 0
|
|
5
|
+
# else exit 1 (caller proceeds to agent spawn)
|
|
6
|
+
# write <sha> <file> → store findings file under CACHE_DIR/<sha>.json
|
|
7
|
+
# prune → delete entries older than MAX_AGE_DAYS
|
|
8
|
+
set -euo pipefail
|
|
9
|
+
|
|
10
|
+
CACHE_DIR="${CACHE_DIR:-.w1-cache}"
|
|
11
|
+
MAX_AGE_DAYS=14
|
|
12
|
+
cmd="${1:-}"; shift || true
|
|
13
|
+
|
|
14
|
+
_sha() { git hash-object "$1" 2>/dev/null || sha256sum "$1" | cut -c1-40; }
|
|
15
|
+
|
|
16
|
+
case "$cmd" in
|
|
17
|
+
check)
|
|
18
|
+
[ "$#" -eq 0 ] && exit 1
|
|
19
|
+
mkdir -p "$CACHE_DIR"
|
|
20
|
+
all_findings=""
|
|
21
|
+
for path in "$@"; do
|
|
22
|
+
[ -f "$path" ] || continue
|
|
23
|
+
sha=$(_sha "$path")
|
|
24
|
+
cache_file="$CACHE_DIR/${sha}.json"
|
|
25
|
+
[ -f "$cache_file" ] || exit 1
|
|
26
|
+
# stale?
|
|
27
|
+
if find "$cache_file" -mtime "+${MAX_AGE_DAYS}" -print 2>/dev/null | grep -q .; then
|
|
28
|
+
rm -f "$cache_file"
|
|
29
|
+
exit 1
|
|
30
|
+
fi
|
|
31
|
+
all_findings="${all_findings}$(cat "$cache_file")"$'\n\n'
|
|
32
|
+
done
|
|
33
|
+
printf '%s' "$all_findings"
|
|
34
|
+
exit 0
|
|
35
|
+
;;
|
|
36
|
+
write)
|
|
37
|
+
sha="$1"; src="$2"
|
|
38
|
+
mkdir -p "$CACHE_DIR"
|
|
39
|
+
cp "$src" "$CACHE_DIR/${sha}.json"
|
|
40
|
+
;;
|
|
41
|
+
prune)
|
|
42
|
+
[ -d "$CACHE_DIR" ] || exit 0
|
|
43
|
+
count=$(find "$CACHE_DIR" -name '*.json' -mtime "+${MAX_AGE_DAYS}" -print -delete | wc -l)
|
|
44
|
+
echo "pruned ${count} stale entries from $CACHE_DIR"
|
|
45
|
+
;;
|
|
46
|
+
*)
|
|
47
|
+
echo "usage: do-recon-cache.sh check <file>... | write <sha> <file> | prune" >&2
|
|
48
|
+
exit 1
|
|
49
|
+
;;
|
|
50
|
+
esac
|
package/scripts/do-smoke.sh
CHANGED
|
@@ -27,10 +27,12 @@ echo "== C9 do-survey =="
|
|
|
27
27
|
"$S/do-survey.sh" zxqwfoobar 2>/dev/null | grep -q 'VERDICT: build' && ok "novel → build" || no "novel"
|
|
28
28
|
"$S/do-survey.sh" signal 2>/dev/null | grep -qE 'VERDICT: (extend|expose)' && ok "existing → extend/expose" || no "existing"
|
|
29
29
|
|
|
30
|
-
echo "== C14 do-analyze =="
|
|
31
|
-
"$S/do-analyze.sh" plans/
|
|
32
|
-
tmp=$(mktemp); printf 'deliverables:\n -
|
|
33
|
-
"$S/do-analyze.sh" "$tmp" >/dev/null 2>&1 && no "
|
|
30
|
+
echo "== C14 do-analyze (current template format: deliverables: + **Deliverable:** + demo/Cycle outcome) =="
|
|
31
|
+
"$S/do-analyze.sh" plans/tools-router-todo.md >/dev/null 2>&1 && ok "real plan → every cycle ships + testable" || no "real plan coverage"
|
|
32
|
+
tmp=$(mktemp); printf 'deliverables:\n - api: foo.ts — does X (C1)\nsource_of_truth:\n## C1 — y\n(no deliverable line)\n' > "$tmp"
|
|
33
|
+
"$S/do-analyze.sh" "$tmp" >/dev/null 2>&1 && no "cycle w/o deliverable should fail" || ok "cycle ships nothing → CRITICAL exit 1"; rm -f "$tmp"
|
|
34
|
+
tmp=$(mktemp); printf 'deliverables:\n - api: foo.ts — does X (C1)\nsource_of_truth:\n## C1 — y\n**Deliverable:** `foo.ts`\n**Cycle outcome:** vitest passes\n' > "$tmp"
|
|
35
|
+
"$S/do-analyze.sh" "$tmp" >/dev/null 2>&1 && ok "well-formed mini-plan → pass" || no "well-formed should pass"; rm -f "$tmp"
|
|
34
36
|
|
|
35
37
|
echo "== C11 do-prove =="
|
|
36
38
|
pv=$("$S/do-prove.sh" one.ie/web/src/components/X.tsx 2>/dev/null); echo "$pv" | grep -q 'surface: frontend' && ok "frontend → /browser" || no "frontend"
|
|
@@ -46,7 +48,7 @@ echo '{"receiver":"cost:cycle","data":{"tokens":{"input":1},"model":"sonnet","co
|
|
|
46
48
|
echo "== /do-loop removed (single front door) =="
|
|
47
49
|
# scan commands+agents only (the engine surface); the 'no separate' note is the lone allowed mention
|
|
48
50
|
if grep -rn '/do-loop' .claude/commands .claude/agents 2>/dev/null | grep -vq 'no separate'; then no "/do-loop still referenced"; else ok "/do-loop gone (only the 'no separate' note)"; fi
|
|
49
|
-
[ -f .claude/commands/do
|
|
51
|
+
[ -f .claude/commands/do.md ] && ok "do.md is the spec (single front door)" || no "do.md missing"
|
|
50
52
|
|
|
51
53
|
echo "== C5 seed lifecycle (one.ie/web vitest) =="
|
|
52
54
|
if command -v bunx >/dev/null 2>&1; then
|
|
@@ -0,0 +1,139 @@
|
|
|
1
|
+
#!/usr/bin/env bun
|
|
2
|
+
/**
|
|
3
|
+
* w1-recon.ts — W1 recon with Anthropic SDK prompt caching
|
|
4
|
+
*
|
|
5
|
+
* Caches the static rules + agent-prompt block so repeated W1 calls on the
|
|
6
|
+
* same codebase skip re-tokenizing those layers (~1k tokens / call).
|
|
7
|
+
* Also writes findings to .w1-cache/ (sha-keyed) so the next cycle is free.
|
|
8
|
+
*
|
|
9
|
+
* Usage:
|
|
10
|
+
* bun .claude/scripts/w1-recon.ts --targets <file,...> [--mode RECON|SURVEY|INVESTIGATE]
|
|
11
|
+
*
|
|
12
|
+
* Falls back gracefully: if ANTHROPIC_API_KEY is missing, exits 2 (caller uses Agent spawn).
|
|
13
|
+
*/
|
|
14
|
+
|
|
15
|
+
import Anthropic from "@anthropic-ai/sdk";
|
|
16
|
+
import { readFileSync, existsSync, mkdirSync, writeFileSync } from "fs";
|
|
17
|
+
import { execFileSync } from "child_process";
|
|
18
|
+
import { join } from "path";
|
|
19
|
+
|
|
20
|
+
const SCRIPTS_DIR = new URL(".", import.meta.url).pathname;
|
|
21
|
+
const AGENTS_DIR = join(SCRIPTS_DIR, "../agents");
|
|
22
|
+
const RULES_DIR = join(SCRIPTS_DIR, "../rules");
|
|
23
|
+
const CACHE_DIR = ".w1-cache";
|
|
24
|
+
|
|
25
|
+
if (!process.env.ANTHROPIC_API_KEY) {
|
|
26
|
+
process.stderr.write("[w1-recon] no ANTHROPIC_API_KEY — falling back to agent spawn\n");
|
|
27
|
+
process.exit(2);
|
|
28
|
+
}
|
|
29
|
+
|
|
30
|
+
function parseArgs() {
|
|
31
|
+
const args = process.argv.slice(2);
|
|
32
|
+
const out = { targets: [] as string[], mode: "RECON" };
|
|
33
|
+
for (let i = 0; i < args.length; i++) {
|
|
34
|
+
if (args[i] === "--targets" && args[i + 1]) out.targets = args[++i].split(",").filter(Boolean);
|
|
35
|
+
if (args[i] === "--mode" && args[i + 1]) out.mode = args[++i].toUpperCase();
|
|
36
|
+
}
|
|
37
|
+
return out;
|
|
38
|
+
}
|
|
39
|
+
|
|
40
|
+
// Recon is read-only mapping — it needs the locked-rule vocabulary (engine) and
|
|
41
|
+
// doc conventions, NOT the UI rules (astro/react/ui/design) that guide W3 edits.
|
|
42
|
+
// Curating here cuts ~7k tokens of irrelevant context from every recon call AND
|
|
43
|
+
// keeps the merged static block comfortably above Haiku's 4,096-token cache floor.
|
|
44
|
+
const W1_RULES = ["engine.md", "documentation.md"];
|
|
45
|
+
|
|
46
|
+
function loadRules(): string {
|
|
47
|
+
return W1_RULES
|
|
48
|
+
.map(f => join(RULES_DIR, f))
|
|
49
|
+
.filter(existsSync)
|
|
50
|
+
.map(p => readFileSync(p, "utf8"))
|
|
51
|
+
.join("\n\n---\n\n");
|
|
52
|
+
}
|
|
53
|
+
|
|
54
|
+
function loadAgentPrompt(): string {
|
|
55
|
+
const path = join(AGENTS_DIR, "w1-recon.md");
|
|
56
|
+
if (!existsSync(path)) return "";
|
|
57
|
+
return readFileSync(path, "utf8").replace(/^---[\s\S]*?---\n/, "").trim();
|
|
58
|
+
}
|
|
59
|
+
|
|
60
|
+
function fileSha(path: string): string {
|
|
61
|
+
try { return execFileSync("git", ["hash-object", path], { encoding: "utf8" }).trim(); }
|
|
62
|
+
catch { return ""; }
|
|
63
|
+
}
|
|
64
|
+
|
|
65
|
+
function checkCache(shas: string[]): string | null {
|
|
66
|
+
if (!existsSync(CACHE_DIR) || shas.some(s => !s)) return null;
|
|
67
|
+
const entries: string[] = [];
|
|
68
|
+
for (const sha of shas) {
|
|
69
|
+
const p = join(CACHE_DIR, `${sha}.json`);
|
|
70
|
+
if (!existsSync(p)) return null;
|
|
71
|
+
entries.push(readFileSync(p, "utf8"));
|
|
72
|
+
}
|
|
73
|
+
return entries.join("\n\n");
|
|
74
|
+
}
|
|
75
|
+
|
|
76
|
+
function writeCache(sha: string, findings: string) {
|
|
77
|
+
mkdirSync(CACHE_DIR, { recursive: true });
|
|
78
|
+
writeFileSync(join(CACHE_DIR, `${sha}.json`), findings);
|
|
79
|
+
}
|
|
80
|
+
|
|
81
|
+
async function main() {
|
|
82
|
+
const { targets, mode } = parseArgs();
|
|
83
|
+
if (!targets.length) { process.stderr.write("--targets required\n"); process.exit(1); }
|
|
84
|
+
|
|
85
|
+
const shas = targets.map(fileSha);
|
|
86
|
+
const hit = checkCache(shas);
|
|
87
|
+
if (hit) {
|
|
88
|
+
process.stderr.write(`[w1-recon] cache hit (${targets.length} file(s))\n`);
|
|
89
|
+
process.stdout.write(hit);
|
|
90
|
+
return;
|
|
91
|
+
}
|
|
92
|
+
|
|
93
|
+
const client = new Anthropic();
|
|
94
|
+
const rulesText = loadRules();
|
|
95
|
+
const agentPrompt = loadAgentPrompt();
|
|
96
|
+
|
|
97
|
+
const targetContents = targets
|
|
98
|
+
.map(t => {
|
|
99
|
+
try { return `### ${t}\n\`\`\`\n${readFileSync(t, "utf8").slice(0, 4000)}\n\`\`\``; }
|
|
100
|
+
catch { return `### ${t}\n[not found]`; }
|
|
101
|
+
})
|
|
102
|
+
.join("\n\n");
|
|
103
|
+
|
|
104
|
+
const improvements = existsSync(".w4-improvements.json")
|
|
105
|
+
? `\n\n## Open improvements\n\`\`\`json\n${readFileSync(".w4-improvements.json", "utf8")}\n\`\`\``
|
|
106
|
+
: "";
|
|
107
|
+
|
|
108
|
+
// ONE merged cached block. Rules + agent prompt are both static, so a single
|
|
109
|
+
// breakpoint at the end caches the whole prefix. Splitting them would leave the
|
|
110
|
+
// ~1.4k-token agent block below Haiku's 4,096 cache floor — it would never cache.
|
|
111
|
+
const staticPrefix = [rulesText && `# Rules\n\n${rulesText}`, agentPrompt]
|
|
112
|
+
.filter(Boolean)
|
|
113
|
+
.join("\n\n---\n\n");
|
|
114
|
+
const system: Anthropic.TextBlockParam[] = [
|
|
115
|
+
{ type: "text", text: staticPrefix, cache_control: { type: "ephemeral" } } as any,
|
|
116
|
+
];
|
|
117
|
+
|
|
118
|
+
const res = await client.messages.create({
|
|
119
|
+
model: "claude-haiku-4-5-20251001",
|
|
120
|
+
max_tokens: 1024,
|
|
121
|
+
system: system as any,
|
|
122
|
+
messages: [{
|
|
123
|
+
role: "user",
|
|
124
|
+
content: `Mode: ${mode}\n\nTarget files:\n${targetContents}${improvements}\n\nProduce the findings report. Under 400 words. End with the W1 receipt line.`,
|
|
125
|
+
}],
|
|
126
|
+
});
|
|
127
|
+
|
|
128
|
+
const findings = res.content.filter(b => b.type === "text").map(b => (b as Anthropic.TextBlock).text).join("\n");
|
|
129
|
+
|
|
130
|
+
// Write to sha-keyed cache
|
|
131
|
+
shas.filter(Boolean).forEach(sha => writeCache(sha, findings));
|
|
132
|
+
|
|
133
|
+
const u = res.usage as any;
|
|
134
|
+
process.stderr.write(`[w1-recon] tokens in=${u.input_tokens} cache_write=${u.cache_creation_input_tokens ?? 0} cache_read=${u.cache_read_input_tokens ?? 0}\n`);
|
|
135
|
+
|
|
136
|
+
process.stdout.write(findings);
|
|
137
|
+
}
|
|
138
|
+
|
|
139
|
+
main().catch(e => { process.stderr.write(`[w1-recon] error: ${e}\n`); process.exit(1); });
|
|
@@ -0,0 +1,142 @@
|
|
|
1
|
+
#!/usr/bin/env bun
|
|
2
|
+
/**
|
|
3
|
+
* w4-rubric.ts — W4 rubric scoring with Anthropic SDK prompt caching
|
|
4
|
+
*
|
|
5
|
+
* Runs 6 Haiku calls in parallel (goal-fit, security, stability, simplicity,
|
|
6
|
+
* speed, adversarial). The static rubric + spec block is cached — each call
|
|
7
|
+
* pays only for the unique diff/files suffix.
|
|
8
|
+
*
|
|
9
|
+
* Usage:
|
|
10
|
+
* bun .claude/scripts/w4-rubric.ts --files <file,...>
|
|
11
|
+
*
|
|
12
|
+
* Output: JSON to stdout — { composite, gate, dimensions, adversarial }
|
|
13
|
+
* Falls back: exits 2 if ANTHROPIC_API_KEY missing (caller spawns agents).
|
|
14
|
+
*/
|
|
15
|
+
|
|
16
|
+
import Anthropic from "@anthropic-ai/sdk";
|
|
17
|
+
import { readFileSync, existsSync } from "fs";
|
|
18
|
+
import { execFileSync } from "child_process";
|
|
19
|
+
import { join } from "path";
|
|
20
|
+
|
|
21
|
+
if (!process.env.ANTHROPIC_API_KEY) {
|
|
22
|
+
process.stderr.write("[w4-rubric] no ANTHROPIC_API_KEY — falling back to agent spawn\n");
|
|
23
|
+
process.exit(2);
|
|
24
|
+
}
|
|
25
|
+
|
|
26
|
+
const SCRIPTS_DIR = new URL(".", import.meta.url).pathname;
|
|
27
|
+
const RUBRICS_PATH = join(SCRIPTS_DIR, "../../plans/rubrics.md");
|
|
28
|
+
const SPEC_PATH = ".w2-spec.json";
|
|
29
|
+
|
|
30
|
+
const DIMS = [
|
|
31
|
+
{ key: "goal-fit", weight: 0.35, instruction: "Score goal-fit: does the diff advance the plan outcome? Read Goal/deliverable from the spec. 0 = no movement, 1 = fully delivers." },
|
|
32
|
+
// These are grep-pattern strings sent to the LLM for it to search — not code being executed.
|
|
33
|
+
{ key: "security", weight: 0.20, instruction: "Score security: check for hardcoded secrets, calls to eval, dangerouslySetInnerHTML, missing Zod at API boundaries, wildcard CORS, TypeQL string concat. 1 = all greps return 0." },
|
|
34
|
+
{ key: "stability", weight: 0.20, instruction: "Score stability: check new `any`, @ts-ignore without comment, silent returns, retired names (knowledge|connections|people|node|scent|alarm|trail|colony). 1 = all zero." },
|
|
35
|
+
{ key: "simplicity", weight: 0.15, instruction: "Score simplicity: focused single-purpose files, functions under 20 lines, no backwards-compat shims or WHAT comments. 1 = tight, no ceremony." },
|
|
36
|
+
{ key: "speed", weight: 0.10, instruction: "Score speed: does the diff increase bundle size or build time? Any new client:load where client:idle suffices? 1 = no regressions." },
|
|
37
|
+
{ key: "adversarial", weight: 0, instruction: "Adversarial check: is there any finding that should block this cycle? If yes, name it briefly. If nothing blocks, return null for 'finding'." },
|
|
38
|
+
] as const;
|
|
39
|
+
|
|
40
|
+
function loadRubrics(): string {
|
|
41
|
+
if (existsSync(RUBRICS_PATH)) return readFileSync(RUBRICS_PATH, "utf8");
|
|
42
|
+
return "Score each dimension 0.0–1.0. Explain why in one sentence. Name the specific gap in 'improve', or 'clean' if 1.0.";
|
|
43
|
+
}
|
|
44
|
+
|
|
45
|
+
function loadSpec(): string {
|
|
46
|
+
if (existsSync(SPEC_PATH)) return readFileSync(SPEC_PATH, "utf8").slice(0, 3000);
|
|
47
|
+
return "{}";
|
|
48
|
+
}
|
|
49
|
+
|
|
50
|
+
function diffStat(): string {
|
|
51
|
+
try { return execFileSync("git", ["diff", "HEAD", "--stat"], { encoding: "utf8" }).slice(0, 1000); }
|
|
52
|
+
catch { return "(no diff)"; }
|
|
53
|
+
}
|
|
54
|
+
|
|
55
|
+
function fileSnippets(files: string[]): string {
|
|
56
|
+
return files
|
|
57
|
+
.map(f => {
|
|
58
|
+
try { return `### ${f}\n\`\`\`\n${readFileSync(f, "utf8").slice(0, 1500)}\n\`\`\``; }
|
|
59
|
+
catch { return `### ${f}\n[not found]`; }
|
|
60
|
+
})
|
|
61
|
+
.join("\n\n");
|
|
62
|
+
}
|
|
63
|
+
|
|
64
|
+
type DimResult = { key: string; score: number; why: string; improve: string; finding?: string | null };
|
|
65
|
+
|
|
66
|
+
async function scoreDim(
|
|
67
|
+
client: Anthropic,
|
|
68
|
+
cachedSystem: Anthropic.TextBlockParam[],
|
|
69
|
+
dim: typeof DIMS[number],
|
|
70
|
+
): Promise<DimResult> {
|
|
71
|
+
const res = await client.messages.create({
|
|
72
|
+
model: "claude-haiku-4-5-20251001",
|
|
73
|
+
max_tokens: 256,
|
|
74
|
+
system: cachedSystem as any,
|
|
75
|
+
messages: [{
|
|
76
|
+
role: "user",
|
|
77
|
+
content: `${dim.instruction}
|
|
78
|
+
|
|
79
|
+
The rubric, cycle spec, diff summary, and touched files are in the system context above.
|
|
80
|
+
|
|
81
|
+
Return compact JSON only:
|
|
82
|
+
${dim.key === "adversarial"
|
|
83
|
+
? '{"finding": "<what blocks or null>"}'
|
|
84
|
+
: '{"score": 0.0–1.0, "why": "<one sentence>", "improve": "<gap or clean>"}'}`,
|
|
85
|
+
}],
|
|
86
|
+
});
|
|
87
|
+
|
|
88
|
+
const text = res.content.filter(b => b.type === "text").map(b => (b as Anthropic.TextBlock).text).join("");
|
|
89
|
+
try {
|
|
90
|
+
const match = text.match(/\{[\s\S]*?\}/);
|
|
91
|
+
if (match) return { key: dim.key, score: 0.5, why: "", improve: "", ...JSON.parse(match[0]) };
|
|
92
|
+
} catch {}
|
|
93
|
+
return { key: dim.key, score: 0.5, why: "parse error", improve: text.slice(0, 80) };
|
|
94
|
+
}
|
|
95
|
+
|
|
96
|
+
async function main() {
|
|
97
|
+
const args = process.argv.slice(2);
|
|
98
|
+
const filesIdx = args.indexOf("--files");
|
|
99
|
+
const files = filesIdx >= 0 ? args[filesIdx + 1].split(",").filter(Boolean) : [];
|
|
100
|
+
|
|
101
|
+
const client = new Anthropic();
|
|
102
|
+
const rubricText = loadRubrics();
|
|
103
|
+
const spec = loadSpec();
|
|
104
|
+
const diff = diffStat();
|
|
105
|
+
const snippets = fileSnippets(files);
|
|
106
|
+
|
|
107
|
+
// ONE cached block — rubric + spec + diff + snippets are byte-identical across all
|
|
108
|
+
// 6 parallel dimension calls. Caching them here (instead of repeating diff+snippets
|
|
109
|
+
// in each user message) turns 6× re-tokenization into 1 cache write + 5 cache reads.
|
|
110
|
+
const cachedSystem: Anthropic.TextBlockParam[] = [{
|
|
111
|
+
type: "text",
|
|
112
|
+
text: `# Rubric Reference\n\n${rubricText}\n\n# Cycle Spec\n\`\`\`json\n${spec}\n\`\`\`\n\n# Diff summary\n${diff}\n\n# Touched files\n${snippets}`,
|
|
113
|
+
cache_control: { type: "ephemeral" },
|
|
114
|
+
} as any];
|
|
115
|
+
|
|
116
|
+
const results = await Promise.all(
|
|
117
|
+
DIMS.map(dim => scoreDim(client, cachedSystem, dim))
|
|
118
|
+
);
|
|
119
|
+
|
|
120
|
+
const scored = results.filter(r => r.key !== "adversarial");
|
|
121
|
+
const composite = scored.reduce((sum, r) => {
|
|
122
|
+
const w = DIMS.find(d => d.key === r.key)!.weight;
|
|
123
|
+
return sum + r.score * w;
|
|
124
|
+
}, 0);
|
|
125
|
+
|
|
126
|
+
const goalFit = scored.find(r => r.key === "goal-fit")?.score ?? 0;
|
|
127
|
+
const adversarial = results.find(r => r.key === "adversarial");
|
|
128
|
+
|
|
129
|
+
const output = {
|
|
130
|
+
composite: Math.round(composite * 100) / 100,
|
|
131
|
+
gate: composite >= 0.65 && goalFit >= 0.50,
|
|
132
|
+
dimensions: Object.fromEntries(
|
|
133
|
+
scored.map(r => [r.key, { score: r.score, why: r.why, improve: r.improve }])
|
|
134
|
+
),
|
|
135
|
+
adversarial: adversarial?.finding ?? null,
|
|
136
|
+
};
|
|
137
|
+
|
|
138
|
+
process.stderr.write(`[w4-rubric] composite=${output.composite} gate=${output.gate}\n`);
|
|
139
|
+
process.stdout.write(JSON.stringify(output, null, 2));
|
|
140
|
+
}
|
|
141
|
+
|
|
142
|
+
main().catch(e => { process.stderr.write(`[w4-rubric] error: ${e}\n`); process.exit(1); });
|