forge-workflow 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (105) hide show
  1. package/.claude/commands/dev.md +314 -0
  2. package/.claude/commands/plan.md +389 -0
  3. package/.claude/commands/premerge.md +179 -0
  4. package/.claude/commands/research.md +42 -0
  5. package/.claude/commands/review.md +442 -0
  6. package/.claude/commands/rollback.md +721 -0
  7. package/.claude/commands/ship.md +134 -0
  8. package/.claude/commands/sonarcloud.md +152 -0
  9. package/.claude/commands/status.md +77 -0
  10. package/.claude/commands/validate.md +237 -0
  11. package/.claude/commands/verify.md +221 -0
  12. package/.claude/rules/greptile-review-process.md +285 -0
  13. package/.claude/rules/workflow.md +105 -0
  14. package/.claude/scripts/greptile-resolve.sh +526 -0
  15. package/.claude/scripts/load-env.sh +32 -0
  16. package/.forge/hooks/check-tdd.js +240 -0
  17. package/.github/PLUGIN_TEMPLATE.json +32 -0
  18. package/.mcp.json.example +12 -0
  19. package/AGENTS.md +169 -0
  20. package/CLAUDE.md +99 -0
  21. package/LICENSE +21 -0
  22. package/README.md +414 -0
  23. package/bin/forge-cmd.js +313 -0
  24. package/bin/forge-validate.js +303 -0
  25. package/bin/forge.js +4228 -0
  26. package/docs/AGENT_INSTALL_PROMPT.md +342 -0
  27. package/docs/ENHANCED_ONBOARDING.md +602 -0
  28. package/docs/EXAMPLES.md +482 -0
  29. package/docs/GREPTILE_SETUP.md +400 -0
  30. package/docs/MANUAL_REVIEW_GUIDE.md +106 -0
  31. package/docs/ROADMAP.md +359 -0
  32. package/docs/SETUP.md +632 -0
  33. package/docs/TOOLCHAIN.md +849 -0
  34. package/docs/VALIDATION.md +363 -0
  35. package/docs/WORKFLOW.md +400 -0
  36. package/docs/planning/PROGRESS.md +396 -0
  37. package/docs/plans/.gitkeep +0 -0
  38. package/docs/plans/2026-02-27-forge-test-suite-v2-decisions.md +21 -0
  39. package/docs/plans/2026-02-27-forge-test-suite-v2-design.md +362 -0
  40. package/docs/plans/2026-02-27-forge-test-suite-v2-tasks.md +343 -0
  41. package/docs/plans/2026-03-02-superpowers-gaps-decisions.md +26 -0
  42. package/docs/plans/2026-03-02-superpowers-gaps-design.md +239 -0
  43. package/docs/plans/2026-03-02-superpowers-gaps-tasks.md +260 -0
  44. package/docs/plans/2026-03-04-agent-command-parity-design.md +163 -0
  45. package/docs/plans/2026-03-04-verify-worktree-cleanup-decisions.md +7 -0
  46. package/docs/plans/2026-03-04-verify-worktree-cleanup-design.md +165 -0
  47. package/docs/plans/2026-03-05-forge-uto-decisions.md +6 -0
  48. package/docs/plans/2026-03-05-forge-uto-design.md +116 -0
  49. package/docs/plans/2026-03-05-forge-uto-tasks.md +244 -0
  50. package/docs/plans/2026-03-10-command-creator-and-eval-decisions.md +52 -0
  51. package/docs/plans/2026-03-10-command-creator-and-eval-design.md +350 -0
  52. package/docs/plans/2026-03-10-command-creator-and-eval-tasks.md +426 -0
  53. package/docs/plans/2026-03-10-stale-workflow-refs-decisions.md +8 -0
  54. package/docs/plans/2026-03-10-stale-workflow-refs-design.md +80 -0
  55. package/docs/plans/2026-03-10-stale-workflow-refs-tasks.md +90 -0
  56. package/docs/plans/2026-03-14-beads-plan-context-decisions.md +9 -0
  57. package/docs/plans/2026-03-14-beads-plan-context-design.md +171 -0
  58. package/docs/plans/2026-03-14-beads-plan-context-tasks.md +160 -0
  59. package/docs/plans/2026-03-14-skill-eval-loop-decisions.md +33 -0
  60. package/docs/plans/2026-03-14-skill-eval-loop-design.md +118 -0
  61. package/docs/plans/2026-03-14-skill-eval-loop-results.md +78 -0
  62. package/docs/plans/2026-03-14-skill-eval-loop-tasks.md +160 -0
  63. package/docs/plans/2026-03-15-agent-command-parity-v2-decisions.md +11 -0
  64. package/docs/plans/2026-03-15-agent-command-parity-v2-design.md +145 -0
  65. package/docs/plans/2026-03-15-agent-command-parity-v2-tasks.md +211 -0
  66. package/docs/research/TEMPLATE.md +292 -0
  67. package/docs/research/advanced-testing.md +297 -0
  68. package/docs/research/agent-permissions.md +167 -0
  69. package/docs/research/dependency-chain.md +328 -0
  70. package/docs/research/forge-workflow-v2.md +550 -0
  71. package/docs/research/plugin-architecture.md +772 -0
  72. package/docs/research/pr4-cli-automation.md +326 -0
  73. package/docs/research/premerge-verify-restructure.md +205 -0
  74. package/docs/research/skills-restructure.md +508 -0
  75. package/docs/research/sonarcloud-perfection-plan.md +166 -0
  76. package/docs/research/sonarcloud-quality-gate.md +184 -0
  77. package/docs/research/superpowers-integration.md +403 -0
  78. package/docs/research/superpowers.md +319 -0
  79. package/docs/research/test-environment.md +519 -0
  80. package/install.sh +1062 -0
  81. package/lefthook.yml +39 -0
  82. package/lib/agents/README.md +198 -0
  83. package/lib/agents/claude.plugin.json +28 -0
  84. package/lib/agents/cline.plugin.json +22 -0
  85. package/lib/agents/codex.plugin.json +19 -0
  86. package/lib/agents/copilot.plugin.json +24 -0
  87. package/lib/agents/cursor.plugin.json +25 -0
  88. package/lib/agents/kilocode.plugin.json +22 -0
  89. package/lib/agents/opencode.plugin.json +20 -0
  90. package/lib/agents/roo.plugin.json +23 -0
  91. package/lib/agents-config.js +2112 -0
  92. package/lib/commands/dev.js +513 -0
  93. package/lib/commands/plan.js +696 -0
  94. package/lib/commands/recommend.js +119 -0
  95. package/lib/commands/ship.js +377 -0
  96. package/lib/commands/status.js +378 -0
  97. package/lib/commands/validate.js +602 -0
  98. package/lib/context-merge.js +359 -0
  99. package/lib/plugin-catalog.js +360 -0
  100. package/lib/plugin-manager.js +166 -0
  101. package/lib/plugin-recommender.js +141 -0
  102. package/lib/project-discovery.js +491 -0
  103. package/lib/setup.js +118 -0
  104. package/lib/workflow-profiles.js +203 -0
  105. package/package.json +115 -0
@@ -0,0 +1,426 @@
1
+ # Task List: Command Creator & Eval
2
+
3
+ - **Feature**: command-creator-and-eval
4
+ - **Date**: 2026-03-10
5
+ - **Beads**: forge-jfw (PR-A), forge-agp (PR-B), forge-1jx (PR-C)
6
+ - **Branch**: feat/command-creator-and-eval
7
+ - **Worktree**: .worktrees/command-creator-and-eval
8
+ - **Design doc**: docs/plans/2026-03-10-command-creator-and-eval-design.md
9
+ - **Baseline**: 1160 pass, 5 fail (pre-existing chalk errors in skills package), 31 skip
10
+
11
+ ---
12
+
13
+ ## PR-A: Static Command Validator + Sync Infrastructure (forge-jfw)
14
+
15
+ Ship order: **FIRST** (no dependencies, highest ROI)
16
+
17
+ ---
18
+
19
+ ### Task 1: Dead reference detection tests (RED)
20
+
21
+ **File(s)**: `test/structural/command-files.test.js`
22
+
23
+ **What to implement**: Add a new `describe` block "dead reference checks" that reads every `.claude/commands/*.md` file and checks for known stale references:
24
+ - `openspec` (removed tool)
25
+ - `/merge` (renamed to `/premerge`)
26
+ - `/check` when used as a stage name (renamed to `/validate`)
27
+ - `docs/planning/PROGRESS.md` (removed file)
28
+ - `9-stage` or `nine stage` (now 7 stages)
29
+
30
+ **TDD steps**:
31
+ 1. Write test: `test/structural/command-files.test.js` — new describe block with 5 regex patterns, assert none match
32
+ 2. Run test: confirm it FAILS (known: `/status` has `openspec list`, `/rollback` has 9-stage ref)
33
+ 3. Note: do NOT fix the commands here — that's forge-ctc's job. Tests document the problem.
34
+ 4. Mark failing tests with `.todo()` so CI stays green until forge-ctc lands
35
+ 5. Commit: `test: add dead reference detection for command files`
36
+
37
+ **Expected output**: 5 new `.todo()` tests that will pass once forge-ctc lands.
38
+
39
+ ---
40
+
41
+ ### Task 2: Cross-command contract tests (RED → GREEN)
42
+
43
+ **File(s)**: `test/structural/command-contracts.test.js` (new file)
44
+
45
+ **What to implement**: Verify that commands reference each other correctly:
46
+ - `/plan` output mentions `docs/plans/YYYY-MM-DD-<slug>-tasks.md` → `/dev` input expects this pattern
47
+ - `/plan` mentions `docs/plans/YYYY-MM-DD-<slug>-design.md` → `/ship` references design doc
48
+ - `/dev` mentions `bun test` or `TEST_COMMAND` → `/validate` runs tests
49
+ - `/ship` mentions `gh pr create` → `/review` mentions PR
50
+ - All 7 workflow commands reference the correct stage numbers (plan=1, dev=2, validate=3, ship=4, review=5, premerge=6, verify=7)
51
+
52
+ **TDD steps**:
53
+ 1. Write test: new file `test/structural/command-contracts.test.js` — 5 contract assertions
54
+ 2. Run test: confirm passes (these contracts should already hold)
55
+ 3. If any fail: document which contract is broken, mark as `.todo()`
56
+ 4. Commit: `test: add cross-command contract tests`
57
+
58
+ **Expected output**: 5+ passing contract tests.
59
+
60
+ ---
61
+
62
+ ### Task 3: Sync script — frontmatter parser utility
63
+
64
+ **File(s)**: `scripts/sync-commands.js`
65
+
66
+ **What to implement**: A utility module that:
67
+ - Reads a `.claude/commands/*.md` file
68
+ - Extracts YAML frontmatter (between `---` markers)
69
+ - Returns `{ frontmatter: object, body: string }`
70
+ - Can reconstruct a file with different frontmatter: `buildFile(newFrontmatter, body)`
71
+
72
+ **TDD steps**:
73
+ 1. Write test: `test/scripts/sync-commands.test.js` — parse frontmatter from sample command, rebuild with different frontmatter
74
+ 2. Run test: confirm fails (module doesn't exist)
75
+ 3. Implement: `scripts/sync-commands.js` with `parseFrontmatter()` and `buildFile()` functions
76
+ 4. Run test: confirm passes
77
+ 5. Commit: `feat: add frontmatter parser for command sync`
78
+
79
+ **Expected output**: `parseFrontmatter('---\ndescription: X\n---\nbody')` → `{ frontmatter: { description: 'X' }, body: 'body' }`
80
+
81
+ ---
82
+
83
+ ### Task 4: Sync script — adapter transforms per agent
84
+
85
+ **File(s)**: `scripts/sync-commands.js`
86
+
87
+ **What to implement**: Add an `AGENT_ADAPTERS` config object that maps each agent to its:
88
+ - Target directory
89
+ - File extension
90
+ - Frontmatter transform function (strip, keep, add fields)
91
+
92
+ Read agent capabilities from `lib/agents/*.plugin.json` to determine which agents to sync.
93
+
94
+ Agents (8 total, 2 tiers):
95
+
96
+ **Tier 1 — Full workflow (commands + hooks + MCP):**
97
+ - Claude Code: no-op (canonical)
98
+ - Cursor: strip all frontmatter, output to `.cursor/skills/<name>/`
99
+ - Cline: strip all frontmatter, output to `.clinerules/workflows/`
100
+ - OpenCode: keep `description`, output to `.opencode/commands/`
101
+ - GitHub Copilot: add `name`, `description`, `tools:`; change ext to `.prompt.md`, output to `.github/prompts/`
102
+
103
+ **Tier 2 — Partial (commands + MCP, no hooks):**
104
+ - Kilo Code: keep `description`, add `mode: code`, output to `.kilocode/workflows/`
105
+ - Roo Code: keep `description`, add `mode: code`, output to `.roo/commands/`
106
+ - Codex: special case (combined SKILL.md file), output to `.codex/skills/<name>/`
107
+
108
+ **TDD steps**:
109
+ 1. Write test: `test/scripts/sync-commands.test.js` — for each agent, assert transform produces correct frontmatter and extension
110
+ 2. Run test: confirm fails
111
+ 3. Implement: adapter transforms in `scripts/sync-commands.js`
112
+ 4. Run test: confirm passes
113
+ 5. Commit: `feat: add agent adapter transforms for command sync`
114
+
115
+ **Expected output**: `adaptForAgent('cursor', { description: 'X' }, 'body')` → `{ content: 'body', filename: 'plan.md', dir: '.cursor/commands/' }`
116
+
117
+ ---
118
+
119
+ ### Task 5: Sync script — CLI entry point (`sync-commands` command)
120
+
121
+ **File(s)**: `scripts/sync-commands.js`
122
+
123
+ **What to implement**: Add CLI entry point that:
124
+ - Reads all `.claude/commands/*.md` files
125
+ - For each agent with `commands: true` in plugin.json: generates adapted files
126
+ - Writes to agent-specific directories
127
+ - `--dry-run` flag: prints what would be written without writing
128
+ - `--check` flag: compares existing files, exits non-zero if out of sync
129
+ - Warns before overwriting files that have been manually modified (content hash check)
130
+
131
+ **TDD steps**:
132
+ 1. Write test: `test/scripts/sync-commands.test.js` — test `--check` mode against a mock filesystem with one in-sync and one out-of-sync agent
133
+ 2. Run test: confirm fails
134
+ 3. Implement: CLI entry point with `--dry-run` and `--check` flags
135
+ 4. Run test: confirm passes
136
+ 5. Commit: `feat: add sync-commands CLI with --dry-run and --check flags`
137
+
138
+ **Expected output**: `node scripts/sync-commands.js --dry-run` prints list of files to generate. `--check` exits 0 when in sync.
139
+
140
+ ---
141
+
142
+ ### Task 6: Sync drift test integration
143
+
144
+ **File(s)**: `test/structural/command-sync.test.js` (new file)
145
+
146
+ **What to implement**: A test that runs the sync script in `--check` mode and asserts it passes. This catches sync drift in CI.
147
+
148
+ **TDD steps**:
149
+ 1. Write test: `test/structural/command-sync.test.js` — spawns `node scripts/sync-commands.js --check`, asserts exit code 0
150
+ 2. Run test: confirm fails (no agent dirs exist yet)
151
+ 3. Run `node scripts/sync-commands.js` to generate all agent dirs
152
+ 4. Run test: confirm passes
153
+ 5. Commit: `test: add sync drift detection test`
154
+
155
+ **Expected output**: Test passes when all agent command files match canonical source.
156
+
157
+ ---
158
+
159
+ ### Task 7: agnix evaluation + integration
160
+
161
+ **File(s)**: `package.json` (devDependency), `test/structural/agnix-lint.test.js` (new)
162
+
163
+ **What to implement**: Evaluate agnix (`npx agnix .`) against the repo. If it provides value beyond our custom tests:
164
+ - Add as devDependency
165
+ - Create test that runs `npx agnix . --format json` and asserts 0 errors
166
+ - Document which agnix rules overlap with our custom tests (to avoid duplication)
167
+
168
+ If agnix is not useful (too many false positives, doesn't cover Forge-specific checks): skip and document why.
169
+
170
+ **TDD steps**:
171
+ 1. Run `npx agnix . --format json` manually, review output
172
+ 2. If useful: write test, add devDep, implement
173
+ 3. If not useful: document findings in design doc, skip
174
+ 4. Commit: `feat: integrate agnix multi-agent linter` or `docs: skip agnix — findings documented`
175
+
176
+ **Expected output**: Decision documented. If integrated, `bun test` includes agnix validation.
177
+
178
+ ---
179
+ ### Task 8: Add blast-radius search to /plan command (prevention)**File(s)**: `.claude/commands/plan.md`**What to implement**: Add a "Blast-radius search" subsection after the existing DRY check in Phase 2. Fires when a feature involves removing, renaming, or replacing something. Directly prevents the gap that caused PR #54 incomplete Antigravity removal.Add after DRY check section:- Title: `### Blast-radius search (mandatory for remove/rename/replace features)`- Steps: grep the entire codebase for the thing being removed, add cleanup tasks for every match- Flag matches in unexpected packages explicitlyAlso add condition 4 to Phase 2 exit HARD-GATE:- `4. If feature involves removal/rename: blast-radius search completed, all references in task list`**TDD steps**:1. Write test: extend `test/structural/command-files.test.js` — assert plan.md contains "blast-radius"2. Run test: confirm fails3. Implement: edit `.claude/commands/plan.md`4. Run test: confirm passes5. Commit: `feat: add blast-radius search to /plan Phase 2`**Expected output**: /plan now requires blast-radius grep for removal/rename features.---
180
+
181
+ ## PR-B: Command Behavioral Eval + Improvement Loop (forge-agp)
182
+
183
+ Ship order: **SECOND** (depends on PR-A)
184
+
185
+ ---
186
+
187
+ ### Task 9: Grader agent for command evaluation
188
+
189
+ **File(s)**: `.claude/agents/command-grader.md` (new)
190
+
191
+ **What to implement**: Adapt skill-creator's `agents/grader.md` for command evaluation. The grader receives:
192
+ - Command name (e.g., `/status`)
193
+ - Execution transcript (stream-json output)
194
+ - List of assertions (e.g., "lists beads issues", "shows current branch")
195
+
196
+ Returns: `grading.json` with `{ text, passed, evidence }` per assertion.
197
+
198
+ Key differences from skill-creator grader:
199
+ - Evaluates multi-turn transcripts (not single-skill invocations)
200
+ - HARD-GATE assertion type: "agent stopped when gate condition unmet"
201
+ - Contract assertion type: "output contains file X that next command expects"
202
+
203
+ **TDD steps**:
204
+ 1. Write test: `test/eval/command-grader.test.js` — mock transcript + assertions, verify grading output format
205
+ 2. Run test: confirm fails
206
+ 3. Implement: `.claude/agents/command-grader.md` with assertion evaluation instructions
207
+ 4. Run test: confirm passes (format validation only — actual grading requires claude CLI)
208
+ 5. Commit: `feat: add command-grader agent for behavioral eval`
209
+
210
+ **Expected output**: Agent file exists with grading instructions. Format test passes.
211
+
212
+ ---
213
+
214
+ ### Task 10: Eval set definitions for /status and /validate
215
+
216
+ **File(s)**: `eval/commands/status.eval.json`, `eval/commands/validate.eval.json` (new)
217
+
218
+ **What to implement**: Define eval sets for the two simplest commands:
219
+
220
+ `status.eval.json`:
221
+ ```json
222
+ [
223
+ {
224
+ "scenario": "clean_repo_with_beads",
225
+ "prompt": "/status",
226
+ "assertions": ["shows current branch", "lists beads issues or says no issues", "shows recent commits"],
227
+ "max_turns": 5
228
+ }
229
+ ]
230
+ ```
231
+
232
+ `validate.eval.json`:
233
+ ```json
234
+ [
235
+ {
236
+ "scenario": "all_passing",
237
+ "prompt": "/validate",
238
+ "assertions": ["runs tests", "reports test results", "checks lint"],
239
+ "max_turns": 10
240
+ },
241
+ {
242
+ "scenario": "failing_tests",
243
+ "setup": "break a test file",
244
+ "prompt": "/validate",
245
+ "assertions": ["reports test failures", "does NOT declare all checks passed"],
246
+ "max_turns": 10
247
+ }
248
+ ]
249
+ ```
250
+
251
+ **TDD steps**:
252
+ 1. Write test: `test/eval/eval-schema.test.js` — validate eval JSON files match expected schema
253
+ 2. Run test: confirm fails (files don't exist)
254
+ 3. Create eval JSON files
255
+ 4. Run test: confirm passes
256
+ 5. Commit: `feat: add eval definitions for /status and /validate commands`
257
+
258
+ **Expected output**: Valid eval JSON files with 3+ scenarios total.
259
+
260
+ ---
261
+
262
+ ### Task 11: Eval runner script
263
+
264
+ **File(s)**: `scripts/run-command-eval.js` (new)
265
+
266
+ **What to implement**: Script that:
267
+ 1. Reads an eval JSON file
268
+ 2. For each scenario: creates a disposable worktree (or uses `claude --worktree`)
269
+ 3. Runs `claude -p "<prompt>" --output-format stream-json --no-session-persistence --max-turns N`
270
+ 4. Strips `CLAUDECODE` env var for nested invocation
271
+ 5. Captures full transcript
272
+ 6. Passes transcript + assertions to command-grader agent
273
+ 7. Collects grading results
274
+ 8. Prints summary: X/Y assertions passed
275
+ 9. Uses threading-based reader (not `select.select()`) for Windows compatibility
276
+ 10. Cleans up worktrees on completion
277
+
278
+ **TDD steps**:
279
+ 1. Write test: `test/eval/run-command-eval.test.js` — test transcript parsing logic (mock subprocess, don't actually run claude)
280
+ 2. Run test: confirm fails
281
+ 3. Implement: `scripts/run-command-eval.js`
282
+ 4. Run test: confirm passes
283
+ 5. Manual test: `node scripts/run-command-eval.js eval/commands/status.eval.json` (requires claude CLI)
284
+ 6. Commit: `feat: add command eval runner with Windows-compatible streaming`
285
+
286
+ **Expected output**: Script runs, captures transcript, grades assertions, prints results.
287
+
288
+ ---
289
+
290
+ ### Task 12: Command improvement script (Scope C)
291
+
292
+ **File(s)**: `scripts/improve-command.js` (new)
293
+
294
+ **What to implement**: Adapted from skill-creator's `improve_description.py`:
295
+ 1. Takes a command name and eval results (with failures)
296
+ 2. Reads the canonical command file
297
+ 3. Calls Claude API with extended thinking to analyze failures and propose a rewrite
298
+ 4. Shows diff between current and proposed command
299
+ 5. **User approval gate**: prints diff, asks for confirmation before writing
300
+ 6. If approved: writes updated command, re-runs eval, shows before/after comparison
301
+ 7. Logs full transcript to `.forge/eval-logs/` (gitignored)
302
+
303
+ **TDD steps**:
304
+ 1. Write test: `test/eval/improve-command.test.js` — test diff generation and approval gate logic (mock API calls)
305
+ 2. Run test: confirm fails
306
+ 3. Implement: `scripts/improve-command.js`
307
+ 4. Run test: confirm passes
308
+ 5. Commit: `feat: add command improvement script with user approval gate`
309
+
310
+ **Expected output**: Script proposes command rewrite, shows diff, waits for approval.
311
+
312
+ ---
313
+
314
+ ## PR-C: Skill Optimization via Eval Loop (forge-1jx)
315
+
316
+ Ship order: **PARALLEL with PR-A** (no dependencies)
317
+
318
+ ---
319
+
320
+ ### Task 13: Skill eval set definitions
321
+
322
+ **File(s)**: `eval/skills/*.eval.json` (6 new files, one per skill)
323
+
324
+ **What to implement**: For each skill in `skills/`, create an eval JSON with:
325
+ - 3 should-trigger queries (realistic user prompts that should activate the skill)
326
+ - 2 should-not-trigger queries (prompts that are superficially similar but shouldn't trigger)
327
+
328
+ Skills: `parallel-web-search`, `parallel-deep-research`, `parallel-web-extract`, `parallel-data-enrichment`, `citation-standards`, `sonarcloud-analysis`
329
+
330
+ **TDD steps**:
331
+ 1. Write test: `test/eval/skill-eval-schema.test.js` — validate all eval JSONs have correct format with both trigger types
332
+ 2. Run test: confirm fails (files don't exist)
333
+ 3. Create eval JSON files for all 6 skills
334
+ 4. Run test: confirm passes
335
+ 5. Commit: `feat: add eval definitions for all 6 skills`
336
+
337
+ **Expected output**: 6 eval JSON files, 30 total queries (5 per skill).
338
+
339
+ ---
340
+
341
+ ### Task 14: Skill eval runner (adapt skill-creator pattern)
342
+
343
+ **File(s)**: `scripts/run-skill-eval.js` (new)
344
+
345
+ **What to implement**: Adapted from skill-creator's `run_eval.py` but in JS for consistency:
346
+ 1. Reads a skill eval JSON
347
+ 2. For each query: runs `claude -p "<query>" --output-format stream-json --verbose --include-partial-messages --no-session-persistence --max-turns 1`
348
+ 3. Detects if the Skill tool was invoked with the correct skill name
349
+ 4. Early termination: if any non-Skill/Read tool called first → not triggered
350
+ 5. Runs each query 3 times for reliability (threshold: ≥2/3 = triggered)
351
+ 6. Reports trigger accuracy: true positives, false positives, true negatives, false negatives
352
+ 7. Windows compatible (threading reader)
353
+
354
+ **TDD steps**:
355
+ 1. Write test: `test/eval/run-skill-eval.test.js` — test trigger detection logic with mock stream-json events
356
+ 2. Run test: confirm fails
357
+ 3. Implement: `scripts/run-skill-eval.js`
358
+ 4. Run test: confirm passes
359
+ 5. Manual test: `node scripts/run-skill-eval.js eval/skills/parallel-web-search.eval.json`
360
+ 6. Commit: `feat: add skill eval runner with trigger detection`
361
+
362
+ **Expected output**: Script reports trigger accuracy per skill.
363
+
364
+ ---
365
+
366
+ ### Task 15: Skill improvement loop with train/test split
367
+
368
+ **File(s)**: `scripts/improve-skill.js` (new)
369
+
370
+ **What to implement**: Adapted from skill-creator's `run_loop.py`:
371
+ 1. Splits eval set 60/40 train/test (stratified by should_trigger)
372
+ 2. Runs eval on full set
373
+ 3. If train score < 100%: calls Claude API with extended thinking to propose new description
374
+ 4. Re-runs eval with new description
375
+ 5. Selects best by **test** score (not train) to prevent overfitting
376
+ 6. Max 5 iterations
377
+ 7. Before/after benchmark comparison
378
+ 8. User approval gate before writing new description
379
+
380
+ **TDD steps**:
381
+ 1. Write test: `test/eval/improve-skill.test.js` — test train/test split logic, best-selection logic (mock API)
382
+ 2. Run test: confirm fails
383
+ 3. Implement: `scripts/improve-skill.js`
384
+ 4. Run test: confirm passes
385
+ 5. Commit: `feat: add skill improvement loop with train/test split`
386
+
387
+ **Expected output**: Script iterates, selects best description, shows before/after comparison.
388
+
389
+ ---
390
+
391
+ ## Task Ordering
392
+
393
+ **Foundational first:**
394
+ 1. Task 1 (dead refs) — extends existing test file
395
+ 2. Task 2 (contracts) — new test file, no implementation
396
+ 3. Task 3 (frontmatter parser) — utility for sync script
397
+ 4. Task 4 (adapter transforms) — builds on Task 3
398
+
399
+ **Feature logic:**
400
+ 5. Task 5 (sync CLI) — builds on Tasks 3-4
401
+ 6. Task 6 (sync drift test) — integrates Task 5 into CI
402
+ 7. Task 7 (agnix eval) — independent evaluation
403
+
404
+ **PR-A (continued):**
405
+ 8. Task 8 (blast-radius /plan update)
406
+
407
+ **PR-B (after PR-A ships):**
408
+ 9. Task 9 (grader agent)
409
+ 10. Task 10 (eval definitions)
410
+ 11. Task 11 (eval runner)
411
+ 12. Task 12 (improvement script)
412
+
413
+ **PR-C (parallel):**
414
+ 13. Task 13 (skill eval defs)
415
+ 14. Task 14 (skill eval runner)
416
+ 15. Task 15 (skill improvement loop)
417
+
418
+ ---
419
+
420
+ ## Notes
421
+
422
+ - Tasks 1-8 = PR-A (forge-jfw) — ship first
423
+ - Tasks 9-12 = PR-B (forge-agp) — ship after PR-A
424
+ - Tasks 13-15 = PR-C (forge-1jx) — ship in parallel with PR-A
425
+ - Baseline failures (5 chalk errors in skills package) are pre-existing and unrelated
426
+ - Task 7 (agnix) is exploratory — may be skipped if not useful
@@ -0,0 +1,8 @@
1
+ # Decisions Log: stale-workflow-refs
2
+
3
+ **Beads**: forge-ctc
4
+ **Branch**: feat/stale-workflow-refs
5
+
6
+ ---
7
+
8
+ (No decision gates fired — all changes were fully specified in the design doc)
@@ -0,0 +1,80 @@
1
+ # Design: Clean up stale workflow refs in agent commands
2
+
3
+ - **Feature**: stale-workflow-refs
4
+ - **Date**: 2026-03-10
5
+ - **Status**: approved
6
+ - **Beads**: forge-ctc
7
+
8
+ ## Purpose
9
+
10
+ Three `.claude/commands/` files reference removed tools (openspec), orphaned files (PROGRESS.md), and a dropped workflow stage (/research). This causes confusion when agents execute these commands and hit nonexistent resources. Additionally, `/premerge` has no CHANGELOG.md maintenance step, so the changelog has fallen behind (last updated 2026-02-03).
11
+
12
+ ## Success Criteria
13
+
14
+ 1. `status.md` — no references to openspec, PROGRESS.md, or /research; replaced with Beads equivalents
15
+ 2. `rollback.md` — workflow flow shows correct 7-stage pipeline (no /research)
16
+ 3. `premerge.md` — PROGRESS.md reference replaced with Beads equivalent; CHANGELOG.md update step added
17
+ 4. All workflow flow diagrams in touched files match: `/status → /plan → /dev → /validate → /ship → /review → /premerge → /verify`
18
+ 5. No functional/code changes — docs-only PR
19
+
20
+ ## Out of Scope
21
+
22
+ - Documentation link checker (tracked separately — Beads issue to be created)
23
+ - Fixing stale refs in files outside `.claude/commands/` (e.g., package.json, QUICKSTART.md, docs/EXAMPLES.md)
24
+ - Changes to `research.md` (already a proper legacy alias redirect)
25
+
26
+ ## Approach Selected
27
+
28
+ **Approach A: Minimal fix** — Update only the 3 command files to fix stale refs and add CHANGELOG step. Docs-only, no code changes.
29
+
30
+ Rationale: This is a docs cleanup task. Link checker infrastructure is a separate feature with its own Beads issue.
31
+
32
+ ## Constraints
33
+
34
+ - Docs-only — no source code, no tests, no new dependencies
35
+ - Must preserve the existing structure/format of each command file
36
+ - CHANGELOG step in premerge should use Keep a Changelog format (already established in CHANGELOG.md)
37
+
38
+ ## Edge Cases
39
+
40
+ 1. **Stale ref found in a file we're already editing**: Fix inline, document in commit message
41
+ 2. **CHANGELOG.md format**: Follow existing Keep a Changelog format already in the file
42
+ 3. **Beads commands in status.md**: Use real `bd` commands that actually work (`bd list`, `bd stats`)
43
+
44
+ ## Ambiguity Policy
45
+
46
+ Fix inline and document in commit. Low-risk for docs-only changes.
47
+
48
+ ## Technical Research
49
+
50
+ ### Stale Reference Inventory
51
+
52
+ | File | Line | Stale Reference | Replacement |
53
+ |------|------|----------------|-------------|
54
+ | status.md | 21 | `cat docs/planning/PROGRESS.md` | `bd list --status completed --limit 5` |
55
+ | status.md | 33 | `openspec list --active` | Remove (no replacement needed) |
56
+ | status.md | 45 | `openspec list --archived --limit 3` | Remove |
57
+ | status.md | 69 | `Next: /research <feature-name>` | `Next: /plan <feature-name>` |
58
+ | status.md | 74 | `Run /research <feature-name>` | `Run /plan <feature-name>` |
59
+ | rollback.md | 309 | `/status → /research → /plan → ...` | `/status → /plan → /dev → ...` |
60
+ | rollback.md | 334 | `/research payment-integration` | `/plan payment-integration` |
61
+ | premerge.md | 49 | `docs/planning/PROGRESS.md` | Replace with CHANGELOG.md step |
62
+ | premerge.md | 135 | `PROGRESS.md: Feature entry added` | `CHANGELOG.md: Entry added` |
63
+
64
+ ### OWASP Top 10 Analysis
65
+
66
+ Not applicable — docs-only changes with no code, no user input, no authentication, no data storage.
67
+
68
+ ### TDD Test Scenarios
69
+
70
+ 1. **Happy path**: All 3 files updated, grep for stale terms returns 0 matches (excluding research.md legacy alias)
71
+ 2. **Workflow consistency**: All workflow diagrams in touched files show identical 7-stage flow
72
+ 3. **CHANGELOG format**: New premerge step references Keep a Changelog format consistent with existing CHANGELOG.md
73
+
74
+ ### DRY Check
75
+
76
+ No existing "replace stale workflow refs" logic exists. This is a manual docs edit — no abstraction needed.
77
+
78
+ ## Related Work
79
+
80
+ - **Link checker** (new Beads issue): Local Lefthook pre-push hook preferred over GitHub Action, to catch broken internal markdown links before they hit PRs. Reference workflow from user's other repo saved in issue description.
@@ -0,0 +1,90 @@
1
+ # Tasks: Clean up stale workflow refs in agent commands
2
+
3
+ **Beads**: forge-ctc
4
+ **Branch**: feat/stale-workflow-refs
5
+ **Design**: docs/plans/2026-03-10-stale-workflow-refs-design.md
6
+
7
+ ---
8
+
9
+ ## Task 1: Fix status.md — remove openspec, PROGRESS.md, /research
10
+
11
+ **File(s)**: `.claude/commands/status.md`
12
+
13
+ **What to implement**:
14
+ - Line 21: Replace `cat docs/planning/PROGRESS.md` with `bd stats` and `bd list --status completed --limit 5`
15
+ - Lines 33-34: Remove `openspec list --active` block entirely
16
+ - Lines 44-46: Remove `openspec list --archived --limit 3` block entirely
17
+ - Line 69: Change `Next: /research <feature-name>` → `Next: /plan <feature-name>`
18
+ - Line 74: Change `Run /research <feature-name>` → `Run /plan <feature-name>`
19
+ - Update example output to reflect Beads-only tracking (no OpenSpec)
20
+ - Update "Next Steps" section to reference `/plan` not `/research`
21
+
22
+ **TDD steps**:
23
+ 1. Run: `grep -c 'openspec\|PROGRESS\.md\|/research' .claude/commands/status.md` → expect 5+ matches
24
+ 2. Make edits
25
+ 3. Run: `grep -c 'openspec\|PROGRESS\.md\|/research' .claude/commands/status.md` → expect 0 matches
26
+ 4. Verify workflow flow (if present) matches 7-stage
27
+ 5. Commit: `docs: fix stale refs in status.md — remove openspec, PROGRESS.md, /research`
28
+
29
+ **Expected output**: status.md references only Beads (`bd`) for tracking, `/plan` for next steps
30
+
31
+ ---
32
+
33
+ ## Task 2: Fix rollback.md — update workflow flow diagrams
34
+
35
+ **File(s)**: `.claude/commands/rollback.md`
36
+
37
+ **What to implement**:
38
+ - Line 309: Change `/status → /research → /plan → /dev → /validate → /ship → /review → /premerge → /verify` → `/status → /plan → /dev → /validate → /ship → /review → /premerge → /verify`
39
+ - Line 314: Same fix for the recovery workflow line if it has /research
40
+ - Line 334: Change `/research payment-integration` → `/plan payment-integration`
41
+ - Check for any other stale refs in the file
42
+
43
+ **TDD steps**:
44
+ 1. Run: `grep -c '/research' .claude/commands/rollback.md` → expect 2+ matches
45
+ 2. Make edits
46
+ 3. Run: `grep -c '/research' .claude/commands/rollback.md` → expect 0 matches
47
+ 4. Verify all workflow flows show correct 7-stage
48
+ 5. Commit: `docs: fix stale workflow refs in rollback.md — remove /research stage`
49
+
50
+ **Expected output**: All workflow diagrams in rollback.md show 7-stage pipeline
51
+
52
+ ---
53
+
54
+ ## Task 3: Fix premerge.md — replace PROGRESS.md with CHANGELOG.md step
55
+
56
+ **File(s)**: `.claude/commands/premerge.md`
57
+
58
+ **What to implement**:
59
+ - Line 49: Replace `docs/planning/PROGRESS.md` section with CHANGELOG.md update step:
60
+ - Add entry under correct version heading using Keep a Changelog format
61
+ - Categories: Added, Changed, Fixed, Removed (match existing CHANGELOG.md style)
62
+ - Include: feature name, PR number, Beads ID
63
+ - Line 135: Update example output to show CHANGELOG.md instead of PROGRESS.md
64
+ - Keep the note about `docs/planning/` being gitignored if PROGRESS.md section is fully replaced
65
+
66
+ **TDD steps**:
67
+ 1. Run: `grep -c 'PROGRESS\.md' .claude/commands/premerge.md` → expect 2 matches
68
+ 2. Make edits
69
+ 3. Run: `grep -c 'PROGRESS\.md' .claude/commands/premerge.md` → expect 0 matches
70
+ 4. Run: `grep -c 'CHANGELOG' .claude/commands/premerge.md` → expect 1+ matches
71
+ 5. Commit: `docs: replace PROGRESS.md with CHANGELOG.md step in premerge`
72
+
73
+ **Expected output**: premerge.md instructs agents to update CHANGELOG.md before merge handoff
74
+
75
+ ---
76
+
77
+ ## Task 4: Final verification — grep for all stale terms across touched files
78
+
79
+ **File(s)**: All 3 files
80
+
81
+ **What to implement**:
82
+ - Run grep for `openspec`, `PROGRESS.md`, `/research` across all `.claude/commands/` (excluding `research.md` legacy alias)
83
+ - Verify 0 matches
84
+ - Run grep for consistent workflow flow in all touched files
85
+
86
+ **TDD steps**:
87
+ 1. Run: `grep -l 'openspec\|PROGRESS\.md' .claude/commands/status.md .claude/commands/rollback.md .claude/commands/premerge.md` → expect 0 matches
88
+ 2. Run: `grep '/research' .claude/commands/status.md .claude/commands/rollback.md .claude/commands/premerge.md` → expect 0 matches
89
+ 3. Verify each file's workflow diagram (if present) matches the canonical 7-stage flow
90
+ 4. No commit needed — verification only
@@ -0,0 +1,9 @@
1
+ # Decisions Log: beads-plan-context
2
+
3
+ **Feature**: beads-plan-context
4
+ **Branch**: feat/beads-plan-context
5
+ **Beads**: forge-bmy
6
+
7
+ ---
8
+
9
+ <!-- Decisions will be logged below as they arise during /dev -->