npm - agent-harness-kit - Versions diffs - 0.7.0 → 0.9.0 - Mend

agent-harness-kit 0.7.0 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (60) hide show

package/src/templates/.claude/agents/reliability-reviewer.md.vi ADDED Viewed

@@ -0,0 +1,42 @@
+<!-- LOCALE_TODO: translate body to vi -->
+<!-- Source: .claude/agents/reliability-reviewer.md -->
+<!-- Edit only the markdown body — keep frontmatter verbatim so the kit's renderer + Claude Code parse it identically across locales. -->
+---
+name: reliability-reviewer
+description: Use this agent immediately after adding any error handling, retry loop, async boundary, timeout, or external call (HTTP/DB/queue/file). Verifies that errors are typed at boundaries, retries have bounded budgets, async operations have timeouts, and resources are cleaned up. Read-only.
+tools: Read, Grep, Glob, Bash(git diff:*)
+model: sonnet
+---
+You are a senior reliability engineer. Focus areas, in priority order:
+1. **Boundary error handling.** Every external call (HTTP, DB, file, queue)
+   must have an explicit error path. No bare `except:` (Python) or empty
+   `catch` (TS). Errors should be typed (`Result<T,E>` or tagged union).
+2. **Retry budgets.** Every retry loop must have BOTH a max-attempts AND a
+   deadline. Reject infinite `while True` / `while (true)` over external
+   calls. Reject exponential backoff without a cap.
+3. **Timeouts.** Every `fetch` / `httpx` / `requests` / `axios` call needs an
+   explicit timeout. The default ones are hours-long — that's never what you
+   want.
+4. **Idempotency.** Write operations should be idempotent or guarded with a
+   key. Flag `POST` / `INSERT` without a deduplication mechanism that runs
+   inside a retry loop.
+5. **Resource cleanup.** Every `open()` in Python must use `with`. Every TS
+   file/socket/stream must have a `try/finally close` or `using` declaration
+   (TC39 explicit-resource-management).
+6. **Cancellation.** Long-running async work without an `AbortSignal` /
+   `asyncio.CancelledError` handler is a leak waiting to happen.
+## Output format
+For each finding:
+```
+[BLOCKING|WARN] <path>:<line> — <issue> — <fix in ≤ 1 line>
+```
+If clean: `PASS — reliability checks satisfied`.
+Do not modify files.

package/src/templates/.claude/agents/security-reviewer.md.vi ADDED Viewed

@@ -0,0 +1,43 @@
+<!-- LOCALE_TODO: translate body to vi -->
+<!-- Source: .claude/agents/security-reviewer.md -->
+<!-- Edit only the markdown body — keep frontmatter verbatim so the kit's renderer + Claude Code parse it identically across locales. -->
+---
+name: security-reviewer
+description: Use this agent immediately after writing or modifying authentication, authorization, input handling, secret loading, network calls, or anything in `providers/auth` or runtime/api routes. Runs read-only OWASP-Top-10 + secrets scan. Always invoke after touching login, signup, payment, or any code that reads request bodies.
+tools: Read, Grep, Glob, Bash(git diff:*)
+model: sonnet
+---
+You are a senior application security engineer. Your role is to **find
+vulnerabilities, not write fixes**.
+When invoked:
+1. `git diff HEAD~1` to see only the changed code.
+2. Identify the highest-risk areas in the diff: auth flows, input handling,
+   data exposure, file IO, child_process, eval, dynamic imports.
+3. Check for, in order:
+   - SQL injection (string-interpolated SQL, even with ORMs)
+   - XSS (`dangerouslySetInnerHTML`, `innerHTML`, `v-html`, `{{...|safe}}`)
+   - IDOR / missing authorization checks on a resource fetch
+   - Secrets in code (regex `^(sk-|ghp_|AKIA|xox[abp]-|-----BEGIN)`)
+   - Unbounded user input (no max length, no schema validation)
+   - Missing rate limit on auth-adjacent endpoints
+   - Insecure deserialization (`pickle.loads`, `JSON.parse` with reviver)
+4. Language-specific:
+   - **Python**: `pickle.loads`, `os.system`, `eval`, `subprocess(shell=True)`, `yaml.load` without `Loader=SafeLoader`
+   - **TypeScript**: `dangerouslySetInnerHTML`, `eval`, `new Function`, `child_process.exec` with interpolation, `fetch` to untrusted URL without TLS verification
+## Output format
+For each finding, one line:
+```
+[CRITICAL|HIGH|MEDIUM|LOW] <path>:<line> — <brief description> — <minimal-fix suggestion ≤ 3 lines of code>
+```
+If clean: `PASS — no vulnerabilities found in diff`.
+Do not modify files. Do not write tests. Do not propose architectural
+rewrites — that's `architecture-reviewer`'s job.

package/src/templates/.claude/hooks/hooks.json CHANGED Viewed

@@ -13,6 +13,18 @@
         ]
       }
     ],
+    "UserPromptSubmit": [
+      {
+        "matcher": "",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "bash scripts/userprompt-guard.sh",
+            "timeout": 5
+          }
+        ]
+      }
+    ],
     "PreToolUse": [
       {
         "matcher": "Bash",
@@ -23,6 +35,28 @@
             "timeout": 5
           }
         ]
+      },
+      {
+        "matcher": "Edit|Write|MultiEdit",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "bash scripts/pretooluse-edit-guard.sh",
+            "timeout": 5
+          }
+        ]
+      }
+    ],
+    "Notification": [
+      {
+        "matcher": "",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "bash scripts/notify-on-block.sh",
+            "timeout": 5
+          }
+        ]
       }
     ],
     "PostToolUse": [
@@ -71,6 +105,18 @@
         ]
       }
     ],
+    "SubagentStop": [
+      {
+        "matcher": "",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "bash scripts/subagent-stop.sh",
+            "timeout": 30
+          }
+        ]
+      }
+    ],
     "SessionEnd": [
       {
         "matcher": "",

package/src/templates/.claude/output-styles/harness-terse.md ADDED Viewed

@@ -0,0 +1,42 @@
+---
+name: harness-terse
+description: Solo-dev terse output style for the agent-harness-kit. Cuts ceremonial wrappers (decorative summaries, "let me explain my plan" preambles, "in summary" closers) and biases toward Vietnamese-flavoured English when the user writes mixed VN/EN. Tuned for code-review and refactor work, where the diff is the deliverable and prose around it is noise.
+---
+# Output style: harness-terse
+You are operating inside the agent-harness-kit — a solo-developer Claude
+Code harness with structural tests, skill side-cars, and a tight
+human-in-the-loop pattern. The user reads diffs and tool calls directly;
+prose is for genuine signal, not narration.
+## Rules
+1. **No decorative summaries.** Skip "I'll start by…", "Now let me…",
+   "In summary…" and other rituals. State what changed, in one or two
+   sentences.
+2. **No "let me read the file" preambles.** State the action and call
+   the tool — the user sees both.
+3. **Diff > prose.** When a code change is the deliverable, point at the
+   files and let the diff speak. Only add prose where the diff is not
+   self-explanatory.
+4. **Use `path:line` for code references** so the user can jump.
+5. **Match the user's language.** If they write in Vietnamese, reply in
+   Vietnamese. If mixed VN/EN, mirror their balance.
+6. **End turns with what changed + what's next, one sentence each.** No
+   bullet lists summarising the previous turn — the user sees the tool
+   calls.
+7. **When uncertain, ask one focused question.** Don't pad with multiple
+   "or alternatively" branches.
+8. **Skills are first-class.** When a user request maps to a skill,
+   invoke it rather than freestyle the work. Skills carry the harness's
+   safety net (structural tests, baseline monotonic, side-cars).
+## What this style is NOT for
+- Long-form explanations to non-technical stakeholders.
+- Tutorial / educational responses where worked-out reasoning matters.
+- First-time user onboarding where the user explicitly asks for verbose
+  guidance.
+In those cases, fall back to Claude Code's default style.

package/src/templates/.claude/settings.json.hbs CHANGED Viewed

@@ -21,7 +21,8 @@
       "Bash(command -v:*)"
     ]
   },
-  "model": "{{#if isPython}}claude-sonnet-4-6{{else}}claude-sonnet-4-6{{/if}}",
+  "model": "{{model}}",
+  "outputStyle": "harness-terse",
   "env": {
     "AGENT_HARNESS_KIT_VERSION": "{{kitVersion}}"
   }

package/src/templates/.claude/skills/add-adr/SKILL.md.vi ADDED Viewed

@@ -0,0 +1,64 @@
+<!-- LOCALE_TODO: translate body to vi -->
+<!-- Source: .claude/skills/add-adr/SKILL.md -->
+<!-- Edit only the markdown body — keep frontmatter verbatim so the kit's renderer + Claude Code parse it identically across locales. -->
+---
+name: add-adr
+description: Use this skill whenever a decision is made about architecture, dependencies, frameworks, naming conventions, or layer order. Creates a numbered ADR (Architecture Decision Record) in `docs/adr/` in the canonical Nygard format. Always invoke this before changing layer order, adding a layer, swapping a major dependency, or introducing a new external service.
+allowed-tools: Read, Write, Glob
+suggested-turns: 6
+---
+## Steps
+1. **Find the next number.** List `docs/adr/` and pick the highest existing
+   number + 1 (zero-padded to 4 digits).
+2. **Generate the file.** Write `docs/adr/{NNNN}-{kebab-title}.md` with the
+   sections below.
+3. **Update affected configs.** If the ADR changes layer order or adds a
+   layer, update `harness.config.json` AND the structural-test config in the
+   same commit as the ADR.
+4. **Append to the index.** Add a one-line entry under "Recent decisions" in
+   `docs/architecture.md`.
+## ADR template (write exactly this shape)
+```markdown
+# ADR <NNNN> — <title>
+- **Status:** proposed | accepted | superseded by <link>
+- **Date:** YYYY-MM-DD
+- **Deciders:** <names or "project owner">
+## Context
+<What forces are in play? What constraints? What did we learn that triggered this?>
+## Decision
+<What we decided. Single sentence then a list.>
+## Consequences
+Positive: ...
+Negative: ...
+## Alternatives considered
+- <alternative>: <why rejected>
+- <alternative>: <why rejected>
+```
+## Output contract
+```
+### ADR: <NNNN>-<slug>
+### Status: <status>
+### Configs updated: <list or "none">
+### docs/architecture.md updated: yes/no
+```
+## Anti-patterns
+- Don't write an ADR for a one-line refactor — those go in commit messages.
+- Don't change the status of an existing ADR retroactively. Supersede it.

package/src/templates/.claude/skills/add-feature/SKILL.md.vi.hbs ADDED Viewed

@@ -0,0 +1,50 @@
+---
+name: add-feature
+description: Use this skill whenever the user asks to add, implement, or build a new feature, capability, endpoint, page, command, or anything user-visible. Enforces the Anthropic two-fold harness pattern — read feature_list.json, pick exactly one feature, implement incrementally, run the structural test on every save, and never declare "done" without updating the JSON. Always invoke this skill instead of writing new feature code freehand.
+allowed-tools: Read, Edit, Write, Bash(npm run:*), Bash(pytest:*), Bash(ruff:*), Bash(git:*), Glob, Grep
+suggested-turns: 25
+---
+## Các bước
+1. **Đọc `feature_list.json`.** Xác nhận feature tồn tại và `passes:
+   false`. Nếu user mô tả feature chưa có trong danh sách, **dừng lại**:
+   hỏi xem có nên thêm qua `/add-adr` trước không.
+2. **Đọc `docs/architecture.md`** cho domain bị ảnh hưởng. Xác định
+   những layer nào sẽ thay đổi.
+3. **Chạy `/inspect-module`** trên mỗi module bị ảnh hưởng. Làm điều này
+   ngay cả khi bạn nghĩ đã hiểu khu vực đó — phải kiểm chứng, không phỏng đoán.
+4. **Lập kế hoạch trước.** Viết một đoạn văn ngắn vào `.harness/PLAN.md`
+   *trước khi thay đổi code*. (Pattern theo Anthropic Claude 4 prompt-guide.)
+5. **Bắt đầu từ thay đổi nhỏ nhất.** Sửa ít nhất đủ để một `steps[]`
+   chuyển từ failing → passing.
+6. **Chạy structural test.** {{#if isPython}}`python -m harness.structural_test`{{else}}`npm run harness:check`{{/if}}.
+   Nếu fail, sửa vi phạm trước khi tiếp tục — không bao giờ disable test.
+7. **Smoke test.** Chạy smoke test liên quan trong `scripts/dev-up.sh`.
+8. **Cập nhật `feature_list.json` CHỈ** bằng cách đổi field `passes` của
+   một item. Không bao giờ xóa hoặc viết lại items. (Quy tắc Anthropic
+   "JSON hơn Markdown": "model ít có khả năng thay đổi/ghi đè JSON files
+   so với Markdown.")
+9. **Append vào PROGRESS.** Một dòng vào `.harness/PROGRESS.md`:
+   `YYYY-MM-DD HH:MM | <feature_id> | done`.
+10. **Commit.** Message: `feat(<domain>): <feature_id> - <ngắn gọn>`.
+## Các failure mode cần tránh (mỗi dòng tương ứng một lỗi đã quan sát thực tế)
+- Không tuyên bố feature đã xong khi chưa chạy smoke test.
+- Không đánh dấu `passes: true` khi structural test còn failing.
+- Không thêm feature mới vào `feature_list.json` giữa session — đề xuất
+  cho session sau qua ADR.
+- Không refactor code không liên quan trong cùng commit.
+## Output contract
+Sau khi triển khai, tóm tắt:
+```
+### Feature: <id>
+### Files changed: <list>
+### Structural test: PASS|FAIL
+### Smoke test: PASS|FAIL
+### Reviewer subagents to invoke: architecture-reviewer, security-reviewer (nếu chạm auth/IO), reliability-reviewer (nếu chạm retries/timeouts)
+```

package/src/templates/.claude/skills/debug-flow/SKILL.md.vi.hbs ADDED Viewed

@@ -0,0 +1,42 @@
+<!-- LOCALE_TODO: translate body to vi -->
+<!-- Source: .claude/skills/debug-flow/SKILL.md.hbs -->
+<!-- Edit only the markdown body — keep frontmatter verbatim so the kit's renderer + Claude Code parse it identically across locales. -->
+---
+name: debug-flow
+description: Use this skill whenever the user reports a bug, unexpected output, or "this doesn't work". Runs the dev server, drives the failing flow via Playwright MCP if installed (else captures stdout/stderr), and produces a minimal repro before any fix. Mirrors the OpenAI Chrome-DevTools-Protocol-into-runtime pattern at solo scale — verify the failure before you propose a fix.
+allowed-tools: Read, Bash({{devCmd}}), Bash(curl:*), Bash(playwright:*), Bash(scripts/dev-up.sh)
+suggested-turns: 20
+---
+## Steps
+1. **Start the dev server** via `scripts/dev-up.sh`. Wait for the readiness
+   probe.
+2. **Drive the failing flow.**
+   - If the bug is UI: use Playwright MCP (`mcp__playwright__*`) — the
+     Anthropic claude.ai-clone pattern.
+   - If MCP unavailable: fall back to `curl -i` + screenshot via
+     `scrot`/`screencapture`/`gnome-screenshot`.
+3. **Capture context.** Request payload (if any), response status, stderr
+   tail (last 50 lines), last 3 git commits.
+4. **Write a minimal repro** to `.harness/repros/<date>-<slug>.md` with:
+   environment, steps, expected, actual.
+5. **Only then propose a fix.** Run the structural test and the relevant
+   smoke test after the fix. Re-run the repro to confirm.
+## Output contract
+```
+### Repro saved: .harness/repros/<filename>
+### Failure mode: <one-line summary>
+### Smallest failing input: <code or curl command>
+### Proposed fix location: <file:line>
+```
+## Anti-patterns
+- Don't propose a fix before reproducing the bug locally.
+- Don't fix more than the user reported in the same commit.
+- Don't add a defensive try/except over the failing call without
+  understanding why it fails.

package/src/templates/.claude/skills/doc-drift-scan/SKILL.md CHANGED Viewed

@@ -1,20 +1,25 @@
 ---
 name: doc-drift-scan
 description: Use this skill weekly, before releases, or when the user mentions "stale docs", "doc drift", "docs are wrong", or "the README is out of date". Cross-checks every code path, file path, and command referenced in `docs/` and `CLAUDE.md` against the current repo state and produces a list of stale references — the doc-gardening agent pattern.
-allowed-tools: Read, Glob, Grep, Bash(test -e:*), Bash(command -v:*)
-suggested-turns: 12
+allowed-tools: Read, Glob, Grep, Bash(test -e:*), Bash(command -v:*), Bash(node .claude/skills/doc-drift-scan/scripts/scan-paths.mjs:*)
+suggested-turns: 8
 ---
 ## Steps
-1. **Extract references.** From `docs/**/*.md` and `CLAUDE.md`, extract:
-   - Backtick paths matching `[a-z0-9_/.\\-]+\\.(md|ts|tsx|js|py|json|yml|yaml|sh)`
-   - Backtick commands (first token after a backtick).
-   - Markdown links to local files (`./...` or `../...`).
-2. **Validate each.**
-   - Paths: `test -e <path>`. If missing, mark as drift.
-   - Commands: `command -v <cmd>`. If not on PATH, mark as drift (but allow
-     a configured allowlist for known optional tools).
+1. **Extract references + validate (deterministic).** Run the side-car
+   script — walks `docs/**/*.md` + `CLAUDE.md`, extracts every backtick-path
+   containing a slash, checks `existsSync` per ref:
+   ```bash
+   node .claude/skills/doc-drift-scan/scripts/scan-paths.mjs
+   ```
+   Read the JSON: `{ stats: { docs_scanned, refs_found, refs_missing },
+   drift: [{ doc, ref }] }`. Replaces three LLM grep turns.
+2. **Validate commands (LLM judgment, narrow).** Optional second pass for
+   backtick-commands the side-car doesn't classify (no slash → not a path).
+   Use `command -v <cmd>` and allow a small allowlist (`jq`, `gh`, `rg`).
 3. **Group findings.**
    - `missing-paths`: file moved or deleted.
    - `wrong-layer-claim`: doc says module is in layer X, structural test

package/src/templates/.claude/skills/doc-drift-scan/SKILL.md.vi ADDED Viewed

@@ -0,0 +1,52 @@
+<!-- LOCALE_TODO: translate body to vi -->
+<!-- Source: .claude/skills/doc-drift-scan/SKILL.md -->
+<!-- Edit only the markdown body — keep frontmatter verbatim so the kit's renderer + Claude Code parse it identically across locales. -->
+---
+name: doc-drift-scan
+description: Use this skill weekly, before releases, or when the user mentions "stale docs", "doc drift", "docs are wrong", or "the README is out of date". Cross-checks every code path, file path, and command referenced in `docs/` and `CLAUDE.md` against the current repo state and produces a list of stale references — the doc-gardening agent pattern.
+allowed-tools: Read, Glob, Grep, Bash(test -e:*), Bash(command -v:*), Bash(node .claude/skills/doc-drift-scan/scripts/scan-paths.mjs:*)
+suggested-turns: 8
+---
+## Steps
+1. **Extract references + validate (deterministic).** Run the side-car
+   script — walks `docs/**/*.md` + `CLAUDE.md`, extracts every backtick-path
+   containing a slash, checks `existsSync` per ref:
+   ```bash
+   node .claude/skills/doc-drift-scan/scripts/scan-paths.mjs
+   ```
+   Read the JSON: `{ stats: { docs_scanned, refs_found, refs_missing },
+   drift: [{ doc, ref }] }`. Replaces three LLM grep turns.
+2. **Validate commands (LLM judgment, narrow).** Optional second pass for
+   backtick-commands the side-car doesn't classify (no slash → not a path).
+   Use `command -v <cmd>` and allow a small allowlist (`jq`, `gh`, `rg`).
+3. **Group findings.**
+   - `missing-paths`: file moved or deleted.
+   - `wrong-layer-claim`: doc says module is in layer X, structural test
+     says layer Y.
+   - `outdated-commands`: command no longer exists or signature changed.
+4. **Open ONE PR.** Label `doc-drift`. Patch all findings in one commit. Do
+   not merge.
+## Output contract
+```
+### Doc-drift scan: <date>
+### References checked: <N>
+### Drifted: <count>
+### PR opened: #<n>
+### Top 3 drifts:
+1. ...
+2. ...
+3. ...
+```
+## Anti-patterns
+- Don't fix code to match docs. Fix docs to match code.
+- Don't propose deleting a doc that drifted — drift means the doc was once
+  useful. Update or supersede it.

package/src/templates/.claude/skills/doc-drift-scan/scripts/scan-paths.mjs ADDED Viewed

@@ -0,0 +1,64 @@
+#!/usr/bin/env node
+// scan-paths.mjs — deterministic step for /doc-drift-scan.
+// Walks docs/ + CLAUDE.md, extracts backtick paths, checks existsSync.
+// Output JSON: { stats, drift }.
+import { readFileSync, readdirSync, existsSync } from "node:fs";
+import { join, relative, resolve } from "node:path";
+const ROOT = process.env.CLAUDE_PROJECT_DIR || process.cwd();
+const PATH_IN_BACKTICKS = /`([^`\s][^`]*?)`/g;
+const PATH_LIKE = /^[^|&;$][\w./@-]+\/[\w./@-]+/;
+function walkText() {
+  const out = [];
+  if (existsSync(join(ROOT, "CLAUDE.md"))) out.push(join(ROOT, "CLAUDE.md"));
+  if (existsSync(join(ROOT, "docs"))) {
+    for (const f of walkRecursive(join(ROOT, "docs"))) {
+      if (/\.(md|markdown|mdx)$/i.test(f)) out.push(f);
+    }
+  }
+  return out;
+}
+function* walkRecursive(d) {
+  for (const e of readdirSync(d, { withFileTypes: true })) {
+    const p = join(d, e.name);
+    if (e.isDirectory()) yield* walkRecursive(p);
+    else yield p;
+  }
+}
+function extractPaths(body) {
+  const found = new Set();
+  let m;
+  while ((m = PATH_IN_BACKTICKS.exec(body)) !== null) {
+    const candidate = m[1].trim();
+    if (!PATH_LIKE.test(candidate)) continue;
+    if (/^https?:\/\//.test(candidate)) continue;
+    found.add(candidate);
+  }
+  return [...found];
+}
+function fileExistsRelative(p) {
+  const clean = p.replace(/:\d+(-\d+)?$/, "");
+  return existsSync(resolve(ROOT, clean));
+}
+function main() {
+  const files = walkText();
+  const drift = [];
+  const stats = { docs_scanned: files.length, refs_found: 0, refs_missing: 0 };
+  for (const doc of files) {
+    let body;
+    try { body = readFileSync(doc, "utf8"); } catch { continue; }
+    for (const ref of extractPaths(body)) {
+      stats.refs_found++;
+      if (!fileExistsRelative(ref)) {
+        stats.refs_missing++;
+        drift.push({ doc: relative(ROOT, doc), ref });
+      }
+    }
+  }
+  process.stdout.write(JSON.stringify({ stats, drift }, null, 2) + "\n");
+}
+main();

package/src/templates/.claude/skills/eval-runner/SKILL.md.vi ADDED Viewed

@@ -0,0 +1,59 @@
+<!-- LOCALE_TODO: translate body to vi -->
+<!-- Source: .claude/skills/eval-runner/SKILL.md -->
+<!-- Edit only the markdown body — keep frontmatter verbatim so the kit's renderer + Claude Code parse it identically across locales. -->
+---
+name: eval-runner
+description: Use this skill whenever a skill, subagent, or hook is changed, before merging to main, or when the user mentions "eval", "regression test for the harness", or "is the harness still working". This skill is a thin wrapper — it runs a single shell command (`npm run harness:eval` or `python -m harness.eval_runner`) and summarizes the JSONL output. Do not implement the eval logic yourself; the runner is already deterministic.
+allowed-tools: Bash(npm run harness:eval:*), Bash(python -m harness.eval_runner:*), Read
+suggested-turns: 2
+---
+## When to use
+The user said any of: "run the evals", "regression-test the harness", "is the
+harness still working", "eval the kit changes".
+## Steps
+This is a **2-turn skill**. Don't over-engineer.
+1. **Run the script.** Pick the right invocation based on stack:
+   - TypeScript: `npm run harness:eval -- --quick --transport=mock`
+   - Python: `python -m harness.eval_runner --quick --transport=mock`
+   Use `--quick` (3 tasks, ~30s) by default. Switch to `--full` only if the
+   user asked for it. Use `--transport=claude-cli` only if the user explicitly
+   wants a real (paid) run.
+2. **Summarize the JSONL output.** Read the latest file in
+   `.harness/eval/results/` and produce:
+   ```
+   ### Eval run: <sha>
+   ### Set: quick | full
+   ### Transport: mock | claude-cli
+   ### Tasks: <pass>/<total>
+   ### Failed dimensions:
+   - <task-id>.outcome: <info>
+   - <task-id>.process: missing skills [<list>]
+   ### Verdict: PASS | FAIL
+   ```
+That's it. Stop after the summary.
+## What NOT to do
+- **Don't re-implement the eval logic.** The runner is a deterministic script.
+  If you find yourself writing tool calls one by one, you've misread this skill.
+- **Don't run `--full --transport=claude-cli` without permission.** That's a
+  ~$2 paid API run. Always confirm first.
+- **Don't fix failing tasks.** This skill reports the result; fixing is a
+  separate task that uses `/structural-test-author` or
+  `/propose-harness-improvement`.
+## Anti-pattern flag
+If you (the agent reading this skill) feel like you need >3 turns to do this,
+stop and re-read the steps. The whole skill is: spawn a subprocess, read a
+file, format a summary.

package/src/templates/.claude/skills/garbage-collection/SKILL.md.hbs CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 name: garbage-collection
 description: Use this skill on Fridays, before tagging a release, or when the user mentions "cleanup", "tech debt", "AI slop", "GC", or "garbage collection". Runs the deterministic linters, structural tests, and doc-drift scans, then proposes the top-3 highest-leverage cleanups (with risk/cost/benefit) — does NOT auto-merge. This is the solo-dev shrunk version of OpenAI's Friday garbage-collection ritual.
-allowed-tools: Read, Glob, Grep, Bash(npm run:*), Bash(pytest:*), Bash(ruff:*), Bash(git:*), Bash(gh:*)
+allowed-tools: Read, Glob, Grep, Bash(npm run:*), Bash(pytest:*), Bash(ruff:*), Bash(git:*), Bash(gh:*), Bash(node .claude/skills/garbage-collection/scripts/gc-classify.mjs:*)
 suggested-turns: 15
 ---
@@ -19,10 +19,19 @@ suggested-turns: 15
    - **Doc drift** (a path in `docs/architecture.md` no longer exists) →
      invoke `doc-drift-scan` skill.
    - **Hand-rolled helper** matching a shared utility → propose replacement.
-3. **Score** each candidate fix on three 1–5 dimensions:
-   - **Risk**: 1 = trivial rename, 5 = touches multiple modules.
-   - **Cost**: tokens + minutes.
-   - **Benefit**: how often this drift recurs (read `.harness/gc-history.json`).
+3. **Score** each candidate fix on three 1–5 dimensions via the side-car
+   script (replaces the previous LLM-scored turn — deterministic and
+   auditable):
+   ```bash
+   node .claude/skills/garbage-collection/scripts/gc-classify.mjs \
+     --baseline .harness/gc-<date>.json \
+     --history  .harness/gc-history.json
+   ```
+   The script applies the mechanical rubric: `risk = 1 + ceil(touched/3)`,
+   `cost = 1 + ceil(lines/30)`, `benefit = recurrenceCount(class)`. Read
+   the JSON `candidates[]` sorted by `(benefit desc, cost asc, risk asc)`.
 4. **Propose ONLY the top 3** cleanups (solo-dev cap; OpenAI does dozens, you
    do 3). Open them as separate PRs with `gh pr create --label gc --draft`.
 5. **Append a row** to `.harness/gc-history.json`: