agent-harness-kit 0.7.0 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (60) hide show
  1. package/.claude-plugin/marketplace.json +2 -2
  2. package/.claude-plugin/plugin.json +1 -1
  3. package/bin/cli.mjs +26 -0
  4. package/package.json +1 -1
  5. package/src/core/doctor.mjs +47 -0
  6. package/src/core/render-templates.mjs +119 -5
  7. package/src/core/upgrade.mjs +81 -60
  8. package/src/templates/.claude/agents/api-consistency-reviewer.md.vi +37 -0
  9. package/src/templates/.claude/agents/architecture-reviewer.md.vi.hbs +45 -0
  10. package/src/templates/.claude/agents/performance-reviewer.md.vi +39 -0
  11. package/src/templates/.claude/agents/reliability-reviewer.md.vi +42 -0
  12. package/src/templates/.claude/agents/security-reviewer.md.vi +43 -0
  13. package/src/templates/.claude/hooks/hooks.json +46 -0
  14. package/src/templates/.claude/output-styles/harness-terse.md +42 -0
  15. package/src/templates/.claude/settings.json.hbs +2 -1
  16. package/src/templates/.claude/skills/add-adr/SKILL.md.vi +64 -0
  17. package/src/templates/.claude/skills/add-feature/SKILL.md.vi.hbs +50 -0
  18. package/src/templates/.claude/skills/debug-flow/SKILL.md.vi.hbs +42 -0
  19. package/src/templates/.claude/skills/doc-drift-scan/SKILL.md +15 -10
  20. package/src/templates/.claude/skills/doc-drift-scan/SKILL.md.vi +52 -0
  21. package/src/templates/.claude/skills/doc-drift-scan/scripts/scan-paths.mjs +64 -0
  22. package/src/templates/.claude/skills/eval-runner/SKILL.md.vi +59 -0
  23. package/src/templates/.claude/skills/garbage-collection/SKILL.md.hbs +14 -5
  24. package/src/templates/.claude/skills/garbage-collection/SKILL.md.vi.hbs +58 -0
  25. package/src/templates/.claude/skills/garbage-collection/scripts/gc-classify.mjs +77 -0
  26. package/src/templates/.claude/skills/i18n-add-locale/SKILL.md +52 -0
  27. package/src/templates/.claude/skills/i18n-add-locale/SKILL.md.vi +56 -0
  28. package/src/templates/.claude/skills/i18n-add-locale/scripts/locale-scaffold.mjs +120 -0
  29. package/src/templates/.claude/skills/inspect-app/SKILL.md.vi +61 -0
  30. package/src/templates/.claude/skills/inspect-module/SKILL.md.hbs +17 -14
  31. package/src/templates/.claude/skills/inspect-module/SKILL.md.vi.hbs +57 -0
  32. package/src/templates/.claude/skills/inspect-module/scripts/module-summary.mjs +144 -0
  33. package/src/templates/.claude/skills/map-domain/SKILL.md +42 -0
  34. package/src/templates/.claude/skills/map-domain/SKILL.md.vi +42 -0
  35. package/src/templates/.claude/skills/map-domain/scripts/domain-map.mjs +145 -0
  36. package/src/templates/.claude/skills/propose-harness-improvement/SKILL.md.vi +49 -0
  37. package/src/templates/.claude/skills/propose-harness-improvement/scripts/improvement-bundle.mjs +172 -0
  38. package/src/templates/.claude/skills/refactor-feature/SKILL.md +60 -0
  39. package/src/templates/.claude/skills/refactor-feature/SKILL.md.vi +64 -0
  40. package/src/templates/.claude/skills/refactor-feature/scripts/feature-diff.mjs +146 -0
  41. package/src/templates/.claude/skills/review-this-pr/SKILL.md +59 -0
  42. package/src/templates/.claude/skills/review-this-pr/SKILL.md.vi +63 -0
  43. package/src/templates/.claude/skills/review-this-pr/scripts/pr-review-driver.mjs +152 -0
  44. package/src/templates/.claude/skills/structural-test-author/SKILL.md.vi.hbs +50 -0
  45. package/src/templates/.claude/skills/write-skill/SKILL.md.vi +43 -0
  46. package/src/templates/.harness/eval/rubrics/feature-step-done.mjs +148 -0
  47. package/src/templates/.harness/eval/tasks/feature-step-done.answer.md +53 -0
  48. package/src/templates/.harness/eval/tasks/feature-step-done.json +10 -0
  49. package/src/templates/.harness/eval/tasks/feature-step-done.prompt.md +43 -0
  50. package/src/templates/.mcp.json.example +35 -0
  51. package/src/templates/CLAUDE.md.hbs +9 -5
  52. package/src/templates/CLAUDE.md.vi.hbs +9 -5
  53. package/src/templates/scripts/notify-on-block.sh.hbs +73 -0
  54. package/src/templates/scripts/pretooluse-edit-guard.sh.hbs +115 -0
  55. package/src/templates/scripts/session-end.sh.hbs +6 -0
  56. package/src/templates/scripts/session-rollup.mjs +96 -0
  57. package/src/templates/scripts/session-start.sh.hbs +25 -0
  58. package/src/templates/scripts/statusline.mjs +63 -0
  59. package/src/templates/scripts/subagent-stop.sh.hbs +76 -0
  60. package/src/templates/scripts/userprompt-guard.sh.hbs +100 -0
@@ -0,0 +1,42 @@
1
+ <!-- LOCALE_TODO: translate body to vi -->
2
+ <!-- Source: .claude/agents/reliability-reviewer.md -->
3
+ <!-- Edit only the markdown body — keep frontmatter verbatim so the kit's renderer + Claude Code parse it identically across locales. -->
4
+
5
+ ---
6
+ name: reliability-reviewer
7
+ description: Use this agent immediately after adding any error handling, retry loop, async boundary, timeout, or external call (HTTP/DB/queue/file). Verifies that errors are typed at boundaries, retries have bounded budgets, async operations have timeouts, and resources are cleaned up. Read-only.
8
+ tools: Read, Grep, Glob, Bash(git diff:*)
9
+ model: sonnet
10
+ ---
11
+
12
+ You are a senior reliability engineer. Focus areas, in priority order:
13
+
14
+ 1. **Boundary error handling.** Every external call (HTTP, DB, file, queue)
15
+ must have an explicit error path. No bare `except:` (Python) or empty
16
+ `catch` (TS). Errors should be typed (`Result<T,E>` or tagged union).
17
+ 2. **Retry budgets.** Every retry loop must have BOTH a max-attempts AND a
18
+ deadline. Reject infinite `while True` / `while (true)` over external
19
+ calls. Reject exponential backoff without a cap.
20
+ 3. **Timeouts.** Every `fetch` / `httpx` / `requests` / `axios` call needs an
21
+ explicit timeout. The default ones are hours-long — that's never what you
22
+ want.
23
+ 4. **Idempotency.** Write operations should be idempotent or guarded with a
24
+ key. Flag `POST` / `INSERT` without a deduplication mechanism that runs
25
+ inside a retry loop.
26
+ 5. **Resource cleanup.** Every `open()` in Python must use `with`. Every TS
27
+ file/socket/stream must have a `try/finally close` or `using` declaration
28
+ (TC39 explicit-resource-management).
29
+ 6. **Cancellation.** Long-running async work without an `AbortSignal` /
30
+ `asyncio.CancelledError` handler is a leak waiting to happen.
31
+
32
+ ## Output format
33
+
34
+ For each finding:
35
+
36
+ ```
37
+ [BLOCKING|WARN] <path>:<line> — <issue> — <fix in ≤ 1 line>
38
+ ```
39
+
40
+ If clean: `PASS — reliability checks satisfied`.
41
+
42
+ Do not modify files.
@@ -0,0 +1,43 @@
1
+ <!-- LOCALE_TODO: translate body to vi -->
2
+ <!-- Source: .claude/agents/security-reviewer.md -->
3
+ <!-- Edit only the markdown body — keep frontmatter verbatim so the kit's renderer + Claude Code parse it identically across locales. -->
4
+
5
+ ---
6
+ name: security-reviewer
7
+ description: Use this agent immediately after writing or modifying authentication, authorization, input handling, secret loading, network calls, or anything in `providers/auth` or runtime/api routes. Runs read-only OWASP-Top-10 + secrets scan. Always invoke after touching login, signup, payment, or any code that reads request bodies.
8
+ tools: Read, Grep, Glob, Bash(git diff:*)
9
+ model: sonnet
10
+ ---
11
+
12
+ You are a senior application security engineer. Your role is to **find
13
+ vulnerabilities, not write fixes**.
14
+
15
+ When invoked:
16
+
17
+ 1. `git diff HEAD~1` to see only the changed code.
18
+ 2. Identify the highest-risk areas in the diff: auth flows, input handling,
19
+ data exposure, file IO, child_process, eval, dynamic imports.
20
+ 3. Check for, in order:
21
+ - SQL injection (string-interpolated SQL, even with ORMs)
22
+ - XSS (`dangerouslySetInnerHTML`, `innerHTML`, `v-html`, `{{...|safe}}`)
23
+ - IDOR / missing authorization checks on a resource fetch
24
+ - Secrets in code (regex `^(sk-|ghp_|AKIA|xox[abp]-|-----BEGIN)`)
25
+ - Unbounded user input (no max length, no schema validation)
26
+ - Missing rate limit on auth-adjacent endpoints
27
+ - Insecure deserialization (`pickle.loads`, `JSON.parse` with reviver)
28
+ 4. Language-specific:
29
+ - **Python**: `pickle.loads`, `os.system`, `eval`, `subprocess(shell=True)`, `yaml.load` without `Loader=SafeLoader`
30
+ - **TypeScript**: `dangerouslySetInnerHTML`, `eval`, `new Function`, `child_process.exec` with interpolation, `fetch` to untrusted URL without TLS verification
31
+
32
+ ## Output format
33
+
34
+ For each finding, one line:
35
+
36
+ ```
37
+ [CRITICAL|HIGH|MEDIUM|LOW] <path>:<line> — <brief description> — <minimal-fix suggestion ≤ 3 lines of code>
38
+ ```
39
+
40
+ If clean: `PASS — no vulnerabilities found in diff`.
41
+
42
+ Do not modify files. Do not write tests. Do not propose architectural
43
+ rewrites — that's `architecture-reviewer`'s job.
@@ -13,6 +13,18 @@
13
13
  ]
14
14
  }
15
15
  ],
16
+ "UserPromptSubmit": [
17
+ {
18
+ "matcher": "",
19
+ "hooks": [
20
+ {
21
+ "type": "command",
22
+ "command": "bash scripts/userprompt-guard.sh",
23
+ "timeout": 5
24
+ }
25
+ ]
26
+ }
27
+ ],
16
28
  "PreToolUse": [
17
29
  {
18
30
  "matcher": "Bash",
@@ -23,6 +35,28 @@
23
35
  "timeout": 5
24
36
  }
25
37
  ]
38
+ },
39
+ {
40
+ "matcher": "Edit|Write|MultiEdit",
41
+ "hooks": [
42
+ {
43
+ "type": "command",
44
+ "command": "bash scripts/pretooluse-edit-guard.sh",
45
+ "timeout": 5
46
+ }
47
+ ]
48
+ }
49
+ ],
50
+ "Notification": [
51
+ {
52
+ "matcher": "",
53
+ "hooks": [
54
+ {
55
+ "type": "command",
56
+ "command": "bash scripts/notify-on-block.sh",
57
+ "timeout": 5
58
+ }
59
+ ]
26
60
  }
27
61
  ],
28
62
  "PostToolUse": [
@@ -71,6 +105,18 @@
71
105
  ]
72
106
  }
73
107
  ],
108
+ "SubagentStop": [
109
+ {
110
+ "matcher": "",
111
+ "hooks": [
112
+ {
113
+ "type": "command",
114
+ "command": "bash scripts/subagent-stop.sh",
115
+ "timeout": 30
116
+ }
117
+ ]
118
+ }
119
+ ],
74
120
  "SessionEnd": [
75
121
  {
76
122
  "matcher": "",
@@ -0,0 +1,42 @@
1
+ ---
2
+ name: harness-terse
3
+ description: Solo-dev terse output style for the agent-harness-kit. Cuts ceremonial wrappers (decorative summaries, "let me explain my plan" preambles, "in summary" closers) and biases toward Vietnamese-flavoured English when the user writes mixed VN/EN. Tuned for code-review and refactor work, where the diff is the deliverable and prose around it is noise.
4
+ ---
5
+
6
+ # Output style: harness-terse
7
+
8
+ You are operating inside the agent-harness-kit — a solo-developer Claude
9
+ Code harness with structural tests, skill side-cars, and a tight
10
+ human-in-the-loop pattern. The user reads diffs and tool calls directly;
11
+ prose is for genuine signal, not narration.
12
+
13
+ ## Rules
14
+
15
+ 1. **No decorative summaries.** Skip "I'll start by…", "Now let me…",
16
+ "In summary…" and other rituals. State what changed, in one or two
17
+ sentences.
18
+ 2. **No "let me read the file" preambles.** State the action and call
19
+ the tool — the user sees both.
20
+ 3. **Diff > prose.** When a code change is the deliverable, point at the
21
+ files and let the diff speak. Only add prose where the diff is not
22
+ self-explanatory.
23
+ 4. **Use `path:line` for code references** so the user can jump.
24
+ 5. **Match the user's language.** If they write in Vietnamese, reply in
25
+ Vietnamese. If mixed VN/EN, mirror their balance.
26
+ 6. **End turns with what changed + what's next, one sentence each.** No
27
+ bullet lists summarising the previous turn — the user sees the tool
28
+ calls.
29
+ 7. **When uncertain, ask one focused question.** Don't pad with multiple
30
+ "or alternatively" branches.
31
+ 8. **Skills are first-class.** When a user request maps to a skill,
32
+ invoke it rather than freestyle the work. Skills carry the harness's
33
+ safety net (structural tests, baseline monotonic, side-cars).
34
+
35
+ ## What this style is NOT for
36
+
37
+ - Long-form explanations to non-technical stakeholders.
38
+ - Tutorial / educational responses where worked-out reasoning matters.
39
+ - First-time user onboarding where the user explicitly asks for verbose
40
+ guidance.
41
+
42
+ In those cases, fall back to Claude Code's default style.
@@ -21,7 +21,8 @@
21
21
  "Bash(command -v:*)"
22
22
  ]
23
23
  },
24
- "model": "{{#if isPython}}claude-sonnet-4-6{{else}}claude-sonnet-4-6{{/if}}",
24
+ "model": "{{model}}",
25
+ "outputStyle": "harness-terse",
25
26
  "env": {
26
27
  "AGENT_HARNESS_KIT_VERSION": "{{kitVersion}}"
27
28
  }
@@ -0,0 +1,64 @@
1
+ <!-- LOCALE_TODO: translate body to vi -->
2
+ <!-- Source: .claude/skills/add-adr/SKILL.md -->
3
+ <!-- Edit only the markdown body — keep frontmatter verbatim so the kit's renderer + Claude Code parse it identically across locales. -->
4
+
5
+ ---
6
+ name: add-adr
7
+ description: Use this skill whenever a decision is made about architecture, dependencies, frameworks, naming conventions, or layer order. Creates a numbered ADR (Architecture Decision Record) in `docs/adr/` in the canonical Nygard format. Always invoke this before changing layer order, adding a layer, swapping a major dependency, or introducing a new external service.
8
+ allowed-tools: Read, Write, Glob
9
+ suggested-turns: 6
10
+ ---
11
+
12
+ ## Steps
13
+
14
+ 1. **Find the next number.** List `docs/adr/` and pick the highest existing
15
+ number + 1 (zero-padded to 4 digits).
16
+ 2. **Generate the file.** Write `docs/adr/{NNNN}-{kebab-title}.md` with the
17
+ sections below.
18
+ 3. **Update affected configs.** If the ADR changes layer order or adds a
19
+ layer, update `harness.config.json` AND the structural-test config in the
20
+ same commit as the ADR.
21
+ 4. **Append to the index.** Add a one-line entry under "Recent decisions" in
22
+ `docs/architecture.md`.
23
+
24
+ ## ADR template (write exactly this shape)
25
+
26
+ ```markdown
27
+ # ADR <NNNN> — <title>
28
+
29
+ - **Status:** proposed | accepted | superseded by <link>
30
+ - **Date:** YYYY-MM-DD
31
+ - **Deciders:** <names or "project owner">
32
+
33
+ ## Context
34
+
35
+ <What forces are in play? What constraints? What did we learn that triggered this?>
36
+
37
+ ## Decision
38
+
39
+ <What we decided. Single sentence then a list.>
40
+
41
+ ## Consequences
42
+
43
+ Positive: ...
44
+ Negative: ...
45
+
46
+ ## Alternatives considered
47
+
48
+ - <alternative>: <why rejected>
49
+ - <alternative>: <why rejected>
50
+ ```
51
+
52
+ ## Output contract
53
+
54
+ ```
55
+ ### ADR: <NNNN>-<slug>
56
+ ### Status: <status>
57
+ ### Configs updated: <list or "none">
58
+ ### docs/architecture.md updated: yes/no
59
+ ```
60
+
61
+ ## Anti-patterns
62
+
63
+ - Don't write an ADR for a one-line refactor — those go in commit messages.
64
+ - Don't change the status of an existing ADR retroactively. Supersede it.
@@ -0,0 +1,50 @@
1
+ ---
2
+ name: add-feature
3
+ description: Use this skill whenever the user asks to add, implement, or build a new feature, capability, endpoint, page, command, or anything user-visible. Enforces the Anthropic two-fold harness pattern — read feature_list.json, pick exactly one feature, implement incrementally, run the structural test on every save, and never declare "done" without updating the JSON. Always invoke this skill instead of writing new feature code freehand.
4
+ allowed-tools: Read, Edit, Write, Bash(npm run:*), Bash(pytest:*), Bash(ruff:*), Bash(git:*), Glob, Grep
5
+ suggested-turns: 25
6
+ ---
7
+
8
+ ## Các bước
9
+
10
+ 1. **Đọc `feature_list.json`.** Xác nhận feature tồn tại và `passes:
11
+ false`. Nếu user mô tả feature chưa có trong danh sách, **dừng lại**:
12
+ hỏi xem có nên thêm qua `/add-adr` trước không.
13
+ 2. **Đọc `docs/architecture.md`** cho domain bị ảnh hưởng. Xác định
14
+ những layer nào sẽ thay đổi.
15
+ 3. **Chạy `/inspect-module`** trên mỗi module bị ảnh hưởng. Làm điều này
16
+ ngay cả khi bạn nghĩ đã hiểu khu vực đó — phải kiểm chứng, không phỏng đoán.
17
+ 4. **Lập kế hoạch trước.** Viết một đoạn văn ngắn vào `.harness/PLAN.md`
18
+ *trước khi thay đổi code*. (Pattern theo Anthropic Claude 4 prompt-guide.)
19
+ 5. **Bắt đầu từ thay đổi nhỏ nhất.** Sửa ít nhất đủ để một `steps[]`
20
+ chuyển từ failing → passing.
21
+ 6. **Chạy structural test.** {{#if isPython}}`python -m harness.structural_test`{{else}}`npm run harness:check`{{/if}}.
22
+ Nếu fail, sửa vi phạm trước khi tiếp tục — không bao giờ disable test.
23
+ 7. **Smoke test.** Chạy smoke test liên quan trong `scripts/dev-up.sh`.
24
+ 8. **Cập nhật `feature_list.json` CHỈ** bằng cách đổi field `passes` của
25
+ một item. Không bao giờ xóa hoặc viết lại items. (Quy tắc Anthropic
26
+ "JSON hơn Markdown": "model ít có khả năng thay đổi/ghi đè JSON files
27
+ so với Markdown.")
28
+ 9. **Append vào PROGRESS.** Một dòng vào `.harness/PROGRESS.md`:
29
+ `YYYY-MM-DD HH:MM | <feature_id> | done`.
30
+ 10. **Commit.** Message: `feat(<domain>): <feature_id> - <ngắn gọn>`.
31
+
32
+ ## Các failure mode cần tránh (mỗi dòng tương ứng một lỗi đã quan sát thực tế)
33
+
34
+ - Không tuyên bố feature đã xong khi chưa chạy smoke test.
35
+ - Không đánh dấu `passes: true` khi structural test còn failing.
36
+ - Không thêm feature mới vào `feature_list.json` giữa session — đề xuất
37
+ cho session sau qua ADR.
38
+ - Không refactor code không liên quan trong cùng commit.
39
+
40
+ ## Output contract
41
+
42
+ Sau khi triển khai, tóm tắt:
43
+
44
+ ```
45
+ ### Feature: <id>
46
+ ### Files changed: <list>
47
+ ### Structural test: PASS|FAIL
48
+ ### Smoke test: PASS|FAIL
49
+ ### Reviewer subagents to invoke: architecture-reviewer, security-reviewer (nếu chạm auth/IO), reliability-reviewer (nếu chạm retries/timeouts)
50
+ ```
@@ -0,0 +1,42 @@
1
+ <!-- LOCALE_TODO: translate body to vi -->
2
+ <!-- Source: .claude/skills/debug-flow/SKILL.md.hbs -->
3
+ <!-- Edit only the markdown body — keep frontmatter verbatim so the kit's renderer + Claude Code parse it identically across locales. -->
4
+
5
+ ---
6
+ name: debug-flow
7
+ description: Use this skill whenever the user reports a bug, unexpected output, or "this doesn't work". Runs the dev server, drives the failing flow via Playwright MCP if installed (else captures stdout/stderr), and produces a minimal repro before any fix. Mirrors the OpenAI Chrome-DevTools-Protocol-into-runtime pattern at solo scale — verify the failure before you propose a fix.
8
+ allowed-tools: Read, Bash({{devCmd}}), Bash(curl:*), Bash(playwright:*), Bash(scripts/dev-up.sh)
9
+ suggested-turns: 20
10
+ ---
11
+
12
+ ## Steps
13
+
14
+ 1. **Start the dev server** via `scripts/dev-up.sh`. Wait for the readiness
15
+ probe.
16
+ 2. **Drive the failing flow.**
17
+ - If the bug is UI: use Playwright MCP (`mcp__playwright__*`) — the
18
+ Anthropic claude.ai-clone pattern.
19
+ - If MCP unavailable: fall back to `curl -i` + screenshot via
20
+ `scrot`/`screencapture`/`gnome-screenshot`.
21
+ 3. **Capture context.** Request payload (if any), response status, stderr
22
+ tail (last 50 lines), last 3 git commits.
23
+ 4. **Write a minimal repro** to `.harness/repros/<date>-<slug>.md` with:
24
+ environment, steps, expected, actual.
25
+ 5. **Only then propose a fix.** Run the structural test and the relevant
26
+ smoke test after the fix. Re-run the repro to confirm.
27
+
28
+ ## Output contract
29
+
30
+ ```
31
+ ### Repro saved: .harness/repros/<filename>
32
+ ### Failure mode: <one-line summary>
33
+ ### Smallest failing input: <code or curl command>
34
+ ### Proposed fix location: <file:line>
35
+ ```
36
+
37
+ ## Anti-patterns
38
+
39
+ - Don't propose a fix before reproducing the bug locally.
40
+ - Don't fix more than the user reported in the same commit.
41
+ - Don't add a defensive try/except over the failing call without
42
+ understanding why it fails.
@@ -1,20 +1,25 @@
1
1
  ---
2
2
  name: doc-drift-scan
3
3
  description: Use this skill weekly, before releases, or when the user mentions "stale docs", "doc drift", "docs are wrong", or "the README is out of date". Cross-checks every code path, file path, and command referenced in `docs/` and `CLAUDE.md` against the current repo state and produces a list of stale references — the doc-gardening agent pattern.
4
- allowed-tools: Read, Glob, Grep, Bash(test -e:*), Bash(command -v:*)
5
- suggested-turns: 12
4
+ allowed-tools: Read, Glob, Grep, Bash(test -e:*), Bash(command -v:*), Bash(node .claude/skills/doc-drift-scan/scripts/scan-paths.mjs:*)
5
+ suggested-turns: 8
6
6
  ---
7
7
 
8
8
  ## Steps
9
9
 
10
- 1. **Extract references.** From `docs/**/*.md` and `CLAUDE.md`, extract:
11
- - Backtick paths matching `[a-z0-9_/.\\-]+\\.(md|ts|tsx|js|py|json|yml|yaml|sh)`
12
- - Backtick commands (first token after a backtick).
13
- - Markdown links to local files (`./...` or `../...`).
14
- 2. **Validate each.**
15
- - Paths: `test -e <path>`. If missing, mark as drift.
16
- - Commands: `command -v <cmd>`. If not on PATH, mark as drift (but allow
17
- a configured allowlist for known optional tools).
10
+ 1. **Extract references + validate (deterministic).** Run the side-car
11
+ script walks `docs/**/*.md` + `CLAUDE.md`, extracts every backtick-path
12
+ containing a slash, checks `existsSync` per ref:
13
+
14
+ ```bash
15
+ node .claude/skills/doc-drift-scan/scripts/scan-paths.mjs
16
+ ```
17
+
18
+ Read the JSON: `{ stats: { docs_scanned, refs_found, refs_missing },
19
+ drift: [{ doc, ref }] }`. Replaces three LLM grep turns.
20
+ 2. **Validate commands (LLM judgment, narrow).** Optional second pass for
21
+ backtick-commands the side-car doesn't classify (no slash → not a path).
22
+ Use `command -v <cmd>` and allow a small allowlist (`jq`, `gh`, `rg`).
18
23
  3. **Group findings.**
19
24
  - `missing-paths`: file moved or deleted.
20
25
  - `wrong-layer-claim`: doc says module is in layer X, structural test
@@ -0,0 +1,52 @@
1
+ <!-- LOCALE_TODO: translate body to vi -->
2
+ <!-- Source: .claude/skills/doc-drift-scan/SKILL.md -->
3
+ <!-- Edit only the markdown body — keep frontmatter verbatim so the kit's renderer + Claude Code parse it identically across locales. -->
4
+
5
+ ---
6
+ name: doc-drift-scan
7
+ description: Use this skill weekly, before releases, or when the user mentions "stale docs", "doc drift", "docs are wrong", or "the README is out of date". Cross-checks every code path, file path, and command referenced in `docs/` and `CLAUDE.md` against the current repo state and produces a list of stale references — the doc-gardening agent pattern.
8
+ allowed-tools: Read, Glob, Grep, Bash(test -e:*), Bash(command -v:*), Bash(node .claude/skills/doc-drift-scan/scripts/scan-paths.mjs:*)
9
+ suggested-turns: 8
10
+ ---
11
+
12
+ ## Steps
13
+
14
+ 1. **Extract references + validate (deterministic).** Run the side-car
15
+ script — walks `docs/**/*.md` + `CLAUDE.md`, extracts every backtick-path
16
+ containing a slash, checks `existsSync` per ref:
17
+
18
+ ```bash
19
+ node .claude/skills/doc-drift-scan/scripts/scan-paths.mjs
20
+ ```
21
+
22
+ Read the JSON: `{ stats: { docs_scanned, refs_found, refs_missing },
23
+ drift: [{ doc, ref }] }`. Replaces three LLM grep turns.
24
+ 2. **Validate commands (LLM judgment, narrow).** Optional second pass for
25
+ backtick-commands the side-car doesn't classify (no slash → not a path).
26
+ Use `command -v <cmd>` and allow a small allowlist (`jq`, `gh`, `rg`).
27
+ 3. **Group findings.**
28
+ - `missing-paths`: file moved or deleted.
29
+ - `wrong-layer-claim`: doc says module is in layer X, structural test
30
+ says layer Y.
31
+ - `outdated-commands`: command no longer exists or signature changed.
32
+ 4. **Open ONE PR.** Label `doc-drift`. Patch all findings in one commit. Do
33
+ not merge.
34
+
35
+ ## Output contract
36
+
37
+ ```
38
+ ### Doc-drift scan: <date>
39
+ ### References checked: <N>
40
+ ### Drifted: <count>
41
+ ### PR opened: #<n>
42
+ ### Top 3 drifts:
43
+ 1. ...
44
+ 2. ...
45
+ 3. ...
46
+ ```
47
+
48
+ ## Anti-patterns
49
+
50
+ - Don't fix code to match docs. Fix docs to match code.
51
+ - Don't propose deleting a doc that drifted — drift means the doc was once
52
+ useful. Update or supersede it.
@@ -0,0 +1,64 @@
1
+ #!/usr/bin/env node
2
+ // scan-paths.mjs — deterministic step for /doc-drift-scan.
3
+ // Walks docs/ + CLAUDE.md, extracts backtick paths, checks existsSync.
4
+ // Output JSON: { stats, drift }.
5
+
6
+ import { readFileSync, readdirSync, existsSync } from "node:fs";
7
+ import { join, relative, resolve } from "node:path";
8
+
9
+ const ROOT = process.env.CLAUDE_PROJECT_DIR || process.cwd();
10
+ const PATH_IN_BACKTICKS = /`([^`\s][^`]*?)`/g;
11
+ const PATH_LIKE = /^[^|&;$][\w./@-]+\/[\w./@-]+/;
12
+
13
+ function walkText() {
14
+ const out = [];
15
+ if (existsSync(join(ROOT, "CLAUDE.md"))) out.push(join(ROOT, "CLAUDE.md"));
16
+ if (existsSync(join(ROOT, "docs"))) {
17
+ for (const f of walkRecursive(join(ROOT, "docs"))) {
18
+ if (/\.(md|markdown|mdx)$/i.test(f)) out.push(f);
19
+ }
20
+ }
21
+ return out;
22
+ }
23
+ function* walkRecursive(d) {
24
+ for (const e of readdirSync(d, { withFileTypes: true })) {
25
+ const p = join(d, e.name);
26
+ if (e.isDirectory()) yield* walkRecursive(p);
27
+ else yield p;
28
+ }
29
+ }
30
+ function extractPaths(body) {
31
+ const found = new Set();
32
+ let m;
33
+ while ((m = PATH_IN_BACKTICKS.exec(body)) !== null) {
34
+ const candidate = m[1].trim();
35
+ if (!PATH_LIKE.test(candidate)) continue;
36
+ if (/^https?:\/\//.test(candidate)) continue;
37
+ found.add(candidate);
38
+ }
39
+ return [...found];
40
+ }
41
+ function fileExistsRelative(p) {
42
+ const clean = p.replace(/:\d+(-\d+)?$/, "");
43
+ return existsSync(resolve(ROOT, clean));
44
+ }
45
+
46
+ function main() {
47
+ const files = walkText();
48
+ const drift = [];
49
+ const stats = { docs_scanned: files.length, refs_found: 0, refs_missing: 0 };
50
+ for (const doc of files) {
51
+ let body;
52
+ try { body = readFileSync(doc, "utf8"); } catch { continue; }
53
+ for (const ref of extractPaths(body)) {
54
+ stats.refs_found++;
55
+ if (!fileExistsRelative(ref)) {
56
+ stats.refs_missing++;
57
+ drift.push({ doc: relative(ROOT, doc), ref });
58
+ }
59
+ }
60
+ }
61
+ process.stdout.write(JSON.stringify({ stats, drift }, null, 2) + "\n");
62
+ }
63
+
64
+ main();
@@ -0,0 +1,59 @@
1
+ <!-- LOCALE_TODO: translate body to vi -->
2
+ <!-- Source: .claude/skills/eval-runner/SKILL.md -->
3
+ <!-- Edit only the markdown body — keep frontmatter verbatim so the kit's renderer + Claude Code parse it identically across locales. -->
4
+
5
+ ---
6
+ name: eval-runner
7
+ description: Use this skill whenever a skill, subagent, or hook is changed, before merging to main, or when the user mentions "eval", "regression test for the harness", or "is the harness still working". This skill is a thin wrapper — it runs a single shell command (`npm run harness:eval` or `python -m harness.eval_runner`) and summarizes the JSONL output. Do not implement the eval logic yourself; the runner is already deterministic.
8
+ allowed-tools: Bash(npm run harness:eval:*), Bash(python -m harness.eval_runner:*), Read
9
+ suggested-turns: 2
10
+ ---
11
+
12
+ ## When to use
13
+
14
+ The user said any of: "run the evals", "regression-test the harness", "is the
15
+ harness still working", "eval the kit changes".
16
+
17
+ ## Steps
18
+
19
+ This is a **2-turn skill**. Don't over-engineer.
20
+
21
+ 1. **Run the script.** Pick the right invocation based on stack:
22
+ - TypeScript: `npm run harness:eval -- --quick --transport=mock`
23
+ - Python: `python -m harness.eval_runner --quick --transport=mock`
24
+
25
+ Use `--quick` (3 tasks, ~30s) by default. Switch to `--full` only if the
26
+ user asked for it. Use `--transport=claude-cli` only if the user explicitly
27
+ wants a real (paid) run.
28
+
29
+ 2. **Summarize the JSONL output.** Read the latest file in
30
+ `.harness/eval/results/` and produce:
31
+
32
+ ```
33
+ ### Eval run: <sha>
34
+ ### Set: quick | full
35
+ ### Transport: mock | claude-cli
36
+ ### Tasks: <pass>/<total>
37
+ ### Failed dimensions:
38
+ - <task-id>.outcome: <info>
39
+ - <task-id>.process: missing skills [<list>]
40
+ ### Verdict: PASS | FAIL
41
+ ```
42
+
43
+ That's it. Stop after the summary.
44
+
45
+ ## What NOT to do
46
+
47
+ - **Don't re-implement the eval logic.** The runner is a deterministic script.
48
+ If you find yourself writing tool calls one by one, you've misread this skill.
49
+ - **Don't run `--full --transport=claude-cli` without permission.** That's a
50
+ ~$2 paid API run. Always confirm first.
51
+ - **Don't fix failing tasks.** This skill reports the result; fixing is a
52
+ separate task that uses `/structural-test-author` or
53
+ `/propose-harness-improvement`.
54
+
55
+ ## Anti-pattern flag
56
+
57
+ If you (the agent reading this skill) feel like you need >3 turns to do this,
58
+ stop and re-read the steps. The whole skill is: spawn a subprocess, read a
59
+ file, format a summary.
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: garbage-collection
3
3
  description: Use this skill on Fridays, before tagging a release, or when the user mentions "cleanup", "tech debt", "AI slop", "GC", or "garbage collection". Runs the deterministic linters, structural tests, and doc-drift scans, then proposes the top-3 highest-leverage cleanups (with risk/cost/benefit) — does NOT auto-merge. This is the solo-dev shrunk version of OpenAI's Friday garbage-collection ritual.
4
- allowed-tools: Read, Glob, Grep, Bash(npm run:*), Bash(pytest:*), Bash(ruff:*), Bash(git:*), Bash(gh:*)
4
+ allowed-tools: Read, Glob, Grep, Bash(npm run:*), Bash(pytest:*), Bash(ruff:*), Bash(git:*), Bash(gh:*), Bash(node .claude/skills/garbage-collection/scripts/gc-classify.mjs:*)
5
5
  suggested-turns: 15
6
6
  ---
7
7
 
@@ -19,10 +19,19 @@ suggested-turns: 15
19
19
  - **Doc drift** (a path in `docs/architecture.md` no longer exists) →
20
20
  invoke `doc-drift-scan` skill.
21
21
  - **Hand-rolled helper** matching a shared utility → propose replacement.
22
- 3. **Score** each candidate fix on three 1–5 dimensions:
23
- - **Risk**: 1 = trivial rename, 5 = touches multiple modules.
24
- - **Cost**: tokens + minutes.
25
- - **Benefit**: how often this drift recurs (read `.harness/gc-history.json`).
22
+ 3. **Score** each candidate fix on three 1–5 dimensions via the side-car
23
+ script (replaces the previous LLM-scored turn deterministic and
24
+ auditable):
25
+
26
+ ```bash
27
+ node .claude/skills/garbage-collection/scripts/gc-classify.mjs \
28
+ --baseline .harness/gc-<date>.json \
29
+ --history .harness/gc-history.json
30
+ ```
31
+
32
+ The script applies the mechanical rubric: `risk = 1 + ceil(touched/3)`,
33
+ `cost = 1 + ceil(lines/30)`, `benefit = recurrenceCount(class)`. Read
34
+ the JSON `candidates[]` sorted by `(benefit desc, cost asc, risk asc)`.
26
35
  4. **Propose ONLY the top 3** cleanups (solo-dev cap; OpenAI does dozens, you
27
36
  do 3). Open them as separate PRs with `gh pr create --label gc --draft`.
28
37
  5. **Append a row** to `.harness/gc-history.json`: