npm - forge-workflow - Versions diffs - 0.0.1 - Mend

forge-workflow 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (105) hide show

package/.claude/commands/dev.md +314 -0
package/.claude/commands/plan.md +389 -0
package/.claude/commands/premerge.md +179 -0
package/.claude/commands/research.md +42 -0
package/.claude/commands/review.md +442 -0
package/.claude/commands/rollback.md +721 -0
package/.claude/commands/ship.md +134 -0
package/.claude/commands/sonarcloud.md +152 -0
package/.claude/commands/status.md +77 -0
package/.claude/commands/validate.md +237 -0
package/.claude/commands/verify.md +221 -0
package/.claude/rules/greptile-review-process.md +285 -0
package/.claude/rules/workflow.md +105 -0
package/.claude/scripts/greptile-resolve.sh +526 -0
package/.claude/scripts/load-env.sh +32 -0
package/.forge/hooks/check-tdd.js +240 -0
package/.github/PLUGIN_TEMPLATE.json +32 -0
package/.mcp.json.example +12 -0
package/AGENTS.md +169 -0
package/CLAUDE.md +99 -0
package/LICENSE +21 -0
package/README.md +414 -0
package/bin/forge-cmd.js +313 -0
package/bin/forge-validate.js +303 -0
package/bin/forge.js +4228 -0
package/docs/AGENT_INSTALL_PROMPT.md +342 -0
package/docs/ENHANCED_ONBOARDING.md +602 -0
package/docs/EXAMPLES.md +482 -0
package/docs/GREPTILE_SETUP.md +400 -0
package/docs/MANUAL_REVIEW_GUIDE.md +106 -0
package/docs/ROADMAP.md +359 -0
package/docs/SETUP.md +632 -0
package/docs/TOOLCHAIN.md +849 -0
package/docs/VALIDATION.md +363 -0
package/docs/WORKFLOW.md +400 -0
package/docs/planning/PROGRESS.md +396 -0
package/docs/plans/.gitkeep +0 -0
package/docs/plans/2026-02-27-forge-test-suite-v2-decisions.md +21 -0
package/docs/plans/2026-02-27-forge-test-suite-v2-design.md +362 -0
package/docs/plans/2026-02-27-forge-test-suite-v2-tasks.md +343 -0
package/docs/plans/2026-03-02-superpowers-gaps-decisions.md +26 -0
package/docs/plans/2026-03-02-superpowers-gaps-design.md +239 -0
package/docs/plans/2026-03-02-superpowers-gaps-tasks.md +260 -0
package/docs/plans/2026-03-04-agent-command-parity-design.md +163 -0
package/docs/plans/2026-03-04-verify-worktree-cleanup-decisions.md +7 -0
package/docs/plans/2026-03-04-verify-worktree-cleanup-design.md +165 -0
package/docs/plans/2026-03-05-forge-uto-decisions.md +6 -0
package/docs/plans/2026-03-05-forge-uto-design.md +116 -0
package/docs/plans/2026-03-05-forge-uto-tasks.md +244 -0
package/docs/plans/2026-03-10-command-creator-and-eval-decisions.md +52 -0
package/docs/plans/2026-03-10-command-creator-and-eval-design.md +350 -0
package/docs/plans/2026-03-10-command-creator-and-eval-tasks.md +426 -0
package/docs/plans/2026-03-10-stale-workflow-refs-decisions.md +8 -0
package/docs/plans/2026-03-10-stale-workflow-refs-design.md +80 -0
package/docs/plans/2026-03-10-stale-workflow-refs-tasks.md +90 -0
package/docs/plans/2026-03-14-beads-plan-context-decisions.md +9 -0
package/docs/plans/2026-03-14-beads-plan-context-design.md +171 -0
package/docs/plans/2026-03-14-beads-plan-context-tasks.md +160 -0
package/docs/plans/2026-03-14-skill-eval-loop-decisions.md +33 -0
package/docs/plans/2026-03-14-skill-eval-loop-design.md +118 -0
package/docs/plans/2026-03-14-skill-eval-loop-results.md +78 -0
package/docs/plans/2026-03-14-skill-eval-loop-tasks.md +160 -0
package/docs/plans/2026-03-15-agent-command-parity-v2-decisions.md +11 -0
package/docs/plans/2026-03-15-agent-command-parity-v2-design.md +145 -0
package/docs/plans/2026-03-15-agent-command-parity-v2-tasks.md +211 -0
package/docs/research/TEMPLATE.md +292 -0
package/docs/research/advanced-testing.md +297 -0
package/docs/research/agent-permissions.md +167 -0
package/docs/research/dependency-chain.md +328 -0
package/docs/research/forge-workflow-v2.md +550 -0
package/docs/research/plugin-architecture.md +772 -0
package/docs/research/pr4-cli-automation.md +326 -0
package/docs/research/premerge-verify-restructure.md +205 -0
package/docs/research/skills-restructure.md +508 -0
package/docs/research/sonarcloud-perfection-plan.md +166 -0
package/docs/research/sonarcloud-quality-gate.md +184 -0
package/docs/research/superpowers-integration.md +403 -0
package/docs/research/superpowers.md +319 -0
package/docs/research/test-environment.md +519 -0
package/install.sh +1062 -0
package/lefthook.yml +39 -0
package/lib/agents/README.md +198 -0
package/lib/agents/claude.plugin.json +28 -0
package/lib/agents/cline.plugin.json +22 -0
package/lib/agents/codex.plugin.json +19 -0
package/lib/agents/copilot.plugin.json +24 -0
package/lib/agents/cursor.plugin.json +25 -0
package/lib/agents/kilocode.plugin.json +22 -0
package/lib/agents/opencode.plugin.json +20 -0
package/lib/agents/roo.plugin.json +23 -0
package/lib/agents-config.js +2112 -0
package/lib/commands/dev.js +513 -0
package/lib/commands/plan.js +696 -0
package/lib/commands/recommend.js +119 -0
package/lib/commands/ship.js +377 -0
package/lib/commands/status.js +378 -0
package/lib/commands/validate.js +602 -0
package/lib/context-merge.js +359 -0
package/lib/plugin-catalog.js +360 -0
package/lib/plugin-manager.js +166 -0
package/lib/plugin-recommender.js +141 -0
package/lib/project-discovery.js +491 -0
package/lib/setup.js +118 -0
package/lib/workflow-profiles.js +203 -0
package/package.json +115 -0

package/docs/plans/2026-02-27-forge-test-suite-v2-tasks.md ADDED Viewed

@@ -0,0 +1,343 @@
+# Task List: Forge Test Suite v2
+**Feature**: forge-test-suite-v2
+**Beads**: forge-5vf
+**Branch**: feat/forge-test-suite-v2
+**Design doc**: docs/plans/2026-02-27-forge-test-suite-v2-design.md
+**Baseline**: 107/107 tests passing
+---
+## Ordering Rationale
+1. **Delete stale code first** (Tasks 1-2) — removes dead exports so new tests can't accidentally pass by testing removed code
+2. **Unit tests for lib functions** (Tasks 3-5) — deterministic, fast, bun:test migration
+3. **Structural command-file tests** (Task 6) — no lib dependency, can run independently
+4. **commitlint script tests** (Task 7) — isolated, no lib dependency
+5. **gh-aw behavioral workflow** (Tasks 8-10) — CI-only, built last after unit tests confirm baseline
+---
+## Task 1: Audit and delete stale lib exports
+**File(s)**: `lib/commands/research.js`, `lib/commands/plan.js`
+**What to implement**:
+Grep entire codebase for imports/requires of `lib/commands/research.js` and the OpenSpec functions in `lib/commands/plan.js` (`createOpenSpecProposal`, `formatProposalPRBody`, `createProposalPR`). If zero usages found outside test files, delete `lib/commands/research.js` entirely and remove the three OpenSpec functions from `lib/commands/plan.js`. Update `package.json` exports if needed.
+**TDD steps**:
+1. Write test: none — this is a deletion task. Run `grep -r "require.*commands/research" . --include="*.js" --exclude-dir=node_modules --exclude-dir=test` and `grep -r "createOpenSpecProposal\|formatProposalPRBody\|createProposalPR" . --include="*.js" --exclude-dir=node_modules --exclude-dir=test` first.
+2. Confirm zero usages outside test files
+3. Delete `lib/commands/research.js`
+4. Remove OpenSpec functions from `lib/commands/plan.js`
+5. Run `bun test` — if any test now fails with "Cannot find module", that test was using the deleted code and must also be deleted in Task 2
+6. Commit: `refactor: delete stale research lib and OpenSpec functions`
+**Expected output**: `bun test` still passes all non-stale tests. Zero references to deleted exports in non-test files.
+---
+## Task 2: Delete stale test files
+**File(s)**: `test/commands/research.test.js`, `test/commands/plan.test.js` (OpenSpec tests only)
+**What to implement**:
+Delete `test/commands/research.test.js` entirely — it tests `lib/commands/research.js` which no longer exists after Task 1. In `test/commands/plan.test.js`, remove the test blocks for `createOpenSpecProposal`, `formatProposalPRBody`, `createProposalPR`, and `createProposalPR`. Keep `detectScope`, `createBeadsIssue`, `createFeatureBranch`, `extractDesignDecisions` — these are still valid. Migrate kept tests from `node:test` to `bun:test` import syntax in the same step.
+**TDD steps**:
+1. Write test: none — deletion task
+2. Delete `test/commands/research.test.js`
+3. Remove OpenSpec test blocks from `test/commands/plan.test.js`
+4. Migrate remaining `plan.test.js` imports: `require('node:test')` → `import { describe, test } from "bun:test"`, `require('node:assert/strict')` → `import { expect } from "bun:test"`
+5. Run `bun test test/commands/plan.test.js` — must pass
+6. Commit: `refactor: delete stale research tests and OpenSpec test blocks`
+**Expected output**: `test/commands/research.test.js` does not exist. `test/commands/plan.test.js` has no OpenSpec references.
+---
+## Task 3: Add Phase 1/2/3 coverage to plan.test.js
+**File(s)**: `test/commands/plan.test.js`, `lib/commands/plan.js`
+**What to implement**:
+Add test coverage for the new `/plan` workflow mechanics. Tests use `bun:test` with `mock.module` for `node:child_process` and `node:fs`. Add mock.module declarations at the top of the file BEFORE any lib import. Cover:
+- `validateDesignDoc(content)` — returns `{ valid: true, sections: [...] }` for complete doc; `{ valid: false, missing: ['OWASP'] }` for missing OWASP section
+- `validateDesignDoc` minimum content length check — OWASP section < 200 chars → invalid
+- `validateDesignDoc` placeholder detection — doc containing "[describe" → invalid
+- `validateTaskList(content)` — returns `{ valid: true, taskCount: N }` when ≥3 tasks with TDD steps; `{ valid: false, reason: '...' }` when < 50% of tasks have RED/GREEN/REFACTOR
+- `readResearchDoc` now reads from `docs/plans/` (not `docs/research/`) — assert correct path
+- `createFeatureBranch` with `--strategic` flag — assert proposal branch naming `feat/<slug>-proposal`
+**TDD steps**:
+1. Write test: `describe("validateDesignDoc")` block with 5 cases (happy path, missing OWASP, short OWASP, placeholder, missing HARD-GATE)
+2. Run: confirm RED — `validateDesignDoc is not a function`
+3. Implement `validateDesignDoc(content)` in `lib/commands/plan.js`
+4. Run: confirm GREEN
+5. Write test: `describe("validateTaskList")` block with 3 cases (≥3 tasks all with TDD, ≥3 tasks only 30% with TDD → invalid, < 3 tasks → invalid)
+6. Run: confirm RED
+7. Implement `validateTaskList(content)` in `lib/commands/plan.js`
+8. Run: confirm GREEN
+9. Write test: `readResearchDoc` path assertion — mock `fs.existsSync` to capture the path argument, assert it includes `docs/plans/`
+10. Run: confirm GREEN or RED depending on current path in lib
+11. Fix path in lib if needed
+12. Commit: `test: add Phase 1/2/3 coverage to plan.test.js` then `feat: add validateDesignDoc and validateTaskList`
+**Expected output**: All new tests pass. `validateDesignDoc` and `validateTaskList` exported from `lib/commands/plan.js`.
+---
+## Task 4: Add decision gate + subagent tests to dev.test.js
+**File(s)**: `test/commands/dev.test.js`, `lib/commands/dev.js`
+**What to implement**:
+Migrate `test/commands/dev.test.js` from `node:test` to `bun:test`. Add `mock.module` for `node:child_process` at top before lib import. Add coverage for:
+- `evaluateDecisionGate(score)` — score 0-3 → `{ route: 'PROCEED' }`, score 4-7 → `{ route: 'SPEC-REVIEWER' }`, score 8+ → `{ route: 'BLOCKED' }`
+- `orderReviewers(task)` — always returns spec compliance reviewer BEFORE code quality reviewer (spec-before-quality HARD-GATE)
+- `dispatchImplementer(task, designDoc)` — mock the subprocess call, assert it receives full task text (not just task number), assert it receives relevant design doc sections
+- `dispatchImplementer` with missing task list → returns `{ success: false, error: 'task-list-not-found' }`
+**TDD steps**:
+1. Migrate existing `dev.test.js` imports to `bun:test` (same pattern as Task 2)
+2. Add `mock.module("node:child_process", ...)` at top
+3. Write test: `describe("evaluateDecisionGate")` — 3 score ranges, 3 boundary cases (0, 3, 4, 7, 8, 15)
+4. Run: confirm RED — `evaluateDecisionGate is not a function`
+5. Implement `evaluateDecisionGate(score)` in `lib/commands/dev.js`
+6. Run: confirm GREEN
+7. Write test: `describe("orderReviewers")` — assert spec reviewer index < quality reviewer index in returned array
+8. Run: confirm RED or GREEN (may already exist)
+9. Implement or fix `orderReviewers` if needed
+10. Write test: `describe("dispatchImplementer")` — mock `execFileSync`, assert call args contain full task text
+11. Run: confirm RED → implement → GREEN
+12. Commit: `test: add decision gate and subagent dispatch tests` then `feat: implement evaluateDecisionGate and orderReviewers`
+**Expected output**: Decision gate routing tested at all 6 boundary values. Spec-before-quality ordering verified. Subagent dispatch mock asserts correct arguments.
+---
+## Task 5: Add commitlint script tests
+**File(s)**: `test/scripts/commitlint.test.js` (new file), `scripts/commitlint.js`
+**What to implement**:
+Create `test/scripts/commitlint.test.js`. Test the cross-platform commitlint runner. Use `bun:test` with `mock.module` for `node:child_process` and `node:fs`.
+Cover:
+- `getCommitlintRunner()` — `bun.lock` exists → returns `'bunx'`; no `bun.lock` → returns `'npx'`
+- `getCommitlintRunner()` on Windows (`process.platform === 'win32'`) → shell option is `true`
+- Missing commit message file argument → process exits with code 1 + error message
+- Exit code propagation — if spawnSync returns `{ status: 1 }`, script exits with 1
+- Exit code propagation — if spawnSync returns `{ status: 0 }`, script exits with 0
+**TDD steps**:
+1. Write test file with 5 test cases listed above
+2. Run: confirm RED — `test/scripts/commitlint.test.js` doesn't exist yet, or functions not exported
+3. Refactor `scripts/commitlint.js` to export `getCommitlintRunner()` for testability (extract from inline logic), keep `if (require.main === module)` guard for CLI usage
+4. Run: confirm GREEN
+5. Commit: `test: add commitlint script tests` then `refactor: extract getCommitlintRunner for testability`
+**Expected output**: 5 tests passing for `scripts/commitlint.js`. Function exported without breaking lefthook hook behavior.
+---
+## Task 6: Add structural command-file tests
+**File(s)**: `test/commands/plan-structure.test.js` (new file), `test/commands/dev-structure.test.js` (new file)
+**What to implement**:
+Using the same pattern as `test/ci-workflow.test.js` (reads a file and asserts content structure), create two test files that read `.claude/commands/plan.md` and `.claude/commands/dev.md` and assert required structural elements exist:
+**plan-structure.test.js asserts:**
+- `<!-- WORKFLOW-SYNC:START -->` and `<!-- WORKFLOW-SYNC:END -->` markers present
+- Phase 1 header `## Phase 1` exists
+- Phase 2 header `## Phase 2` exists
+- Phase 3 header `## Phase 3` exists
+- `<HARD-GATE: Phase 1 exit>` block exists
+- `<HARD-GATE: Phase 2 exit>` block exists
+- `<HARD-GATE: /plan exit>` block exists
+- `Skill("parallel-web-search")` call present in Phase 2
+- `git worktree add` command present in Phase 3
+- `docs/plans/` path present (not `docs/research/`)
+- `docs/plans/YYYY-MM-DD-<slug>-tasks.md` task list path format present
+**dev-structure.test.js asserts:**
+- `<HARD-GATE: /dev entry>` block exists
+- Spec compliance reviewer step present (text: "Spec compliance reviewer" or "spec-before-quality")
+- Code quality reviewer step present AFTER spec reviewer
+- Decision gate scoring documented (text: "PROCEED", "SPEC-REVIEWER", "BLOCKED" all present)
+- `docs/plans/` path present for task list reading
+- `decisions.md` path present
+**TDD steps**:
+1. Write `test/commands/plan-structure.test.js` with all assertions (using `fs.readFileSync` + `includes()`, same pattern as ci-workflow.test.js)
+2. Run: confirm which assertions pass/fail against current `.claude/commands/plan.md`
+3. Fix any missing markers in `.claude/commands/plan.md` (add WORKFLOW-SYNC markers if missing)
+4. Run: GREEN for plan-structure
+5. Write `test/commands/dev-structure.test.js`
+6. Run: confirm which pass/fail
+7. Fix any missing markers in `.claude/commands/dev.md`
+8. Run: GREEN for dev-structure
+9. Commit: `test: add structural command-file tests for plan and dev`
+**Expected output**: Both structural test files pass. `.claude/commands/plan.md` and `dev.md` contain all required structural markers.
+---
+## Task 7: Update test script in package.json
+**File(s)**: `package.json`
+**What to implement**:
+Add the new test files to the `bun test` script in `package.json` so they run in CI:
+- `test/commands/plan.test.js`
+- `test/commands/dev.test.js`
+- `test/commands/plan-structure.test.js`
+- `test/commands/dev-structure.test.js`
+- `test/scripts/commitlint.test.js`
+Also verify `test/commands/check.test.js`, `test/commands/ship.test.js`, `test/commands/status.test.js` are already included or add them.
+Run full `bun test` with the updated script to confirm all tests pass.
+**TDD steps**:
+1. Read current `package.json` test script
+2. Add new test file paths
+3. Run `bun test <all files>` — confirm all pass
+4. Commit: `chore: add new test files to bun test script`
+**Expected output**: `bun test` (using package.json script) runs all test files and all pass.
+---
+## Task 8: Create gh-aw behavioral workflow markdown
+**File(s)**: `.github/workflows/behavioral-test.md` (new file), `.github/workflows/detect-command-file-changes.yml` (new file)
+**What to implement**:
+Create the gh-aw behavioral test workflow in markdown format. Two files:
+**`detect-command-file-changes.yml`** — lightweight standard GitHub Actions YAML that triggers when `.claude/commands/plan.md`, `.claude/commands/dev.md`, or `AGENTS.md` changes on push to master. This fires `workflow_run` on the behavioral test.
+**`.github/workflows/behavioral-test.md`** — gh-aw markdown workflow:
+```yaml
+---
+name: forge-workflow-behavioral-test
+description: "Tests that a real AI agent correctly follows the Forge /plan workflow"
+on:
+  - schedule: "0 3 * * SUN"
+  - workflow_dispatch
+  - workflow_run:
+      workflows: ["detect-command-file-changes.yml"]
+      types: [completed]
+permissions:
+  contents: read
+  actions: read
+secrets:
+  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+  OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
+engine:
+  type: claude
+  model: claude-sonnet-4-6
+  max-turns: 20
+tools:
+  - bash
+  - edit
+  - github:
+      toolsets: [repos, actions]
+---
+```
+Markdown body instructs the agent to:
+1. Create a temp directory as synthetic test repo
+2. Run `/plan` on 3-4 rotating test prompts
+3. Assert artifacts exist (design doc, task list)
+4. Save Q&A transcript to temp file
+5. Run judge evaluation (curl to OpenRouter with MiniMax M2.5)
+6. Parse judge JSON output
+7. Apply 3-layer scoring (blockers → dimensions → band)
+8. Append score to `.github/behavioral-test-scores.json`
+9. If FAIL → exit non-zero (fails the workflow run)
+10. If INCONCLUSIVE (API error) → exit 0 with warning comment
+11. Cleanup temp directory
+**TDD steps**:
+1. Write `detect-command-file-changes.yml` (standard YAML, no gh-aw)
+2. Write `.github/workflows/behavioral-test.md` with frontmatter + markdown body
+3. Run `gh aw compile .github/workflows/behavioral-test.md` to generate `.lock.yml`
+4. Verify `.lock.yml` was created and is valid YAML
+5. Commit: `feat: add gh-aw behavioral test workflow`
+**Expected output**: Both files exist. `.github/workflows/behavioral-test.lock.yml` generated and committed.
+---
+## Task 9: Create judge scoring script
+**File(s)**: `scripts/behavioral-judge.sh` (new file)
+**What to implement**:
+Bash script called by the behavioral test workflow to run the judge evaluation. Takes design doc path + task list path + Q&A transcript path as args. Calls OpenRouter MiniMax M2.5, parses response, applies 3-layer scoring, returns JSON result.
+Covers all 16 loophole fixes:
+- Layer 1 blocker checks (existence, content length, placeholder detection, timestamp recency, majority TDD threshold)
+- Layer 2 weighted scoring (security ×3, TDD ×3, design ×2, structural ×1, max 45)
+- Layer 3 trend comparison (read previous score from `.github/behavioral-test-scores.json`, compare per-dimension)
+- INCONCLUSIVE on API errors (429/5xx)
+- Calibration mode flag (first 4 runs don't enforce FAIL gate)
+- Minimax M2.5 with MiniMax K2.5 fallback
+**TDD steps**:
+1. Write `test/scripts/behavioral-judge.test.js` with mocked OpenRouter responses covering:
+   - All Layer 1 blockers fire correctly
+   - Weighted scoring math (security 4/5 × 3 = 12, etc.)
+   - INCONCLUSIVE on 429
+   - Calibration mode: score below threshold but result is still PASS (with warning)
+   - Trend alert: current score 20, previous was 29 → ≥8 point drop → alert
+2. Run: confirm RED
+3. Implement `scripts/behavioral-judge.sh` (bash) + export testable functions to a `lib/behavioral-judge.js` wrapper for unit testing
+4. Run: confirm GREEN
+5. Commit: `test: add behavioral judge scoring tests` then `feat: implement behavioral judge scoring script`
+**Expected output**: Judge script handles all 16 loophole scenarios correctly. INCONCLUSIVE does not cause FAIL.
+---
+## Task 10: Add CI sync check for .lock.yml
+**File(s)**: `.github/workflows/test.yml`
+**What to implement**:
+Add a job to `test.yml` that verifies `.github/workflows/behavioral-test.lock.yml` is in sync with `.github/workflows/behavioral-test.md`. On every PR and push, runs `gh aw compile --dry-run` and diffs output against committed `.lock.yml`. Fails if they diverge.
+Also add: `test/workflows/behavioral-test-sync.test.js` that asserts the `.lock.yml` exists and is non-empty (structural sanity check without requiring gh-aw CLI in unit test environment).
+**TDD steps**:
+1. Write `test/workflows/behavioral-test-sync.test.js` — assert `.github/workflows/behavioral-test.lock.yml` exists and `behavioral-test.md` exists
+2. Run: RED (files don't exist yet from Task 8)
+3. After Task 8 creates the files, rerun: GREEN
+4. Add sync check job to `.github/workflows/test.yml`
+5. Run `gh pr checks` to verify new job appears
+6. Commit: `test: add behavioral test lock file sync check`
+**Expected output**: CI fails if `.lock.yml` is out of sync with `.md`. Structural test confirms both files exist.
+---
+## Parallelization Map
+```
+Sequential (must run in order):
+  Task 1 (delete stale lib) → Task 2 (delete stale tests) → Task 3-5 (unit tests)
+Parallel after Task 2:
+  Track A: Tasks 3, 4, 5 (unit tests — independent of each other)
+  Track B: Task 6 (structural tests — no lib dependency)
+  Track C: Task 7 (package.json — wait for Tasks 3-6 to know file names)
+Sequential after all unit/structural tests:
+  Task 8 → Task 9 → Task 10 (behavioral workflow — builds on stable unit test foundation)
+```

package/docs/plans/2026-03-02-superpowers-gaps-decisions.md ADDED Viewed

@@ -0,0 +1,26 @@
+# Decisions Log: superpowers-gaps
+**Feature**: superpowers-gaps
+**Branch**: feat/superpowers-gaps
+**Dev session started**: 2026-03-02
+**Design doc**: `docs/plans/2026-03-02-superpowers-gaps-design.md`
+**Ambiguity policy**: Follow /dev decision gate (7-dimension scoring). Low-impact → proceed + document. High-impact → pause and ask.
+---
+## /dev Summary
+**Completed**: 2026-03-02
+**Tasks**: 6 (0a, 0b, 1, 2, 3, 4)
+**Decision gates fired**: 0 (plan quality: Excellent — all ambiguity resolved in Phase 1 Q&A)
+**Final test result**: 1227 pass, 31 skip, 0 fail (1258 total across 72 files)
+### Post-implementation fix (final code review finding)
+**Issue 1**: Duplicate 4-phase debug section in `validate.md` (copy-paste artifact from Task 4 implementation). Removed in commit `5baddcc`.
+**Issue 2**: Incomplete `/check` → `/validate` rename — `bin/forge.js`, `lib/workflow-profiles.js`, `lib/agents-config.js`, `lib/commands/status.js`, `README.md`, `QUICKSTART.md`, `GEMINI.md`, and 6 test files still referenced `/check`. All updated in commit `5baddcc`.
+No decision gates were fired during implementation. All ambiguity was resolved upfront in Phase 1 Q&A.
+**Status**: All decisions RESOLVED. Ready for /validate.

package/docs/plans/2026-03-02-superpowers-gaps-design.md ADDED Viewed

@@ -0,0 +1,239 @@
+# Design Doc: superpowers-gaps
+**Feature**: superpowers-gaps
+**Date**: 2026-03-02
+**Status**: Phase 3 complete — ready for /dev
+**Branch**: feat/superpowers-gaps
+**Beads**: forge-6od (in_progress)
+---
+## Purpose
+Fill 5 workflow gaps identified in the OBRA/Superpowers integration research (`docs/research/superpowers.md`, `docs/research/superpowers-integration.md`, beads `forge-6od`):
+1. **Worktree isolation** — `/plan` had no entry gate; planning could run on any branch, contaminating unrelated feature branches (discovered when superpowers-gaps commits leaked into forge-test-suite-v2 history)
+2. **YAGNI enforcement** — No gate in `/plan` Phase 3 prevents over-scoped tasks
+3. **DRY enforcement** — No gate in `/plan` Phase 2 checks for existing implementations before planning new ones
+4. **Verification-before-completion** — `/dev` task completion and `/check` don't require end-to-end verification, only unit test passage
+5. **Systematic debugging** — No structured investigation workflow when validation fails
+These gaps mean: planning commits bleed into wrong branches (isolation), code gets planned that already exists (DRY), tasks get created that aren't in the design (YAGNI), and validation failures get "fixed" without root-cause investigation (debug).
+---
+## Success Criteria
+1. ✅ `/plan` has a HARD-GATE at entry that checks the current branch, stops if not on master, then creates `feat/<slug>` + `.worktrees/<slug>` before any Phase 1 work begins (commit `86eaec8`)
+2. ✅ `/plan` Phase 3 branch creation explicitly uses `git checkout master` as base, not the current branch (commit `9b31bd9`)
+3. `/plan` Phase 2 includes an explicit DRY check step that searches for existing implementations before finalizing approach
+4. `/plan` Phase 3 task-writing includes a YAGNI filter: each task must map to a requirement in the design doc; tasks without a design doc anchor are flagged
+5. `/dev` task completion HARD-GATE requires actual behavior verification (run the feature/function, not just unit tests) before marking a task done
+6. `/check` is renamed to `/validate` and upgraded: failure path triggers automatic 4-phase systematic debug mode (Reproduce → Root-cause → Fix → Verify) with HARD-GATE: no fix without completed root-cause phase
+7. AGENTS.md, `docs/WORKFLOW.md`, and workflow table updated to reflect `/validate` naming and new capabilities
+8. All existing tests pass after changes
+---
+## Out of Scope
+- Separate `/debug` command — debug mode is embedded inside `/validate`
+- New review subagent for YAGNI/DRY at code-writing time — enforcement is at planning stage
+- Changes to `/dev` subagent architecture (spec → quality review stays 2-stage; scope compliance is handled in `/plan` pre-work)
+- Changing how Beads integrates with existing commands
+- Any changes to `/ship`, `/review`, `/premerge`, `/verify`
+---
+## Approach Selected: A+B Hybrid (inline gates + automatic review)
+**Why not A alone (inline gates only)**: User explicitly wants best quality and automatic process evaluation. Inline planning gates catch scope creep at planning time, but the automatic 4-phase debug mode in `/validate` provides automated enforcement at validation time.
+**Why not B alone (new subagent per task)**: Adding a scope compliance reviewer as a 3rd subagent per task in `/dev` would slow every task. YAGNI/DRY enforcement is better done at planning time (before any code is written) rather than per-task.
+**The hybrid**:
+- **Pre-code enforcement** (planning): DRY check in Phase 2, YAGNI filter in Phase 3 — catch problems before code is written
+- **Post-code enforcement** (validation): Verification HARD-GATE in `/dev` task completion, automatic debug mode in `/validate` — catch problems before shipping
+---
+## Implementation Plan (High-Level)
+### Change 1: DRY gate in `/plan` Phase 2
+**File**: `.claude/commands/plan.md`
+**Where**: Phase 2, codebase exploration section
+**What**: Add an explicit step: before finalizing the approach, search the codebase for existing implementations that could be reused or extended. Document what was found and whether the new work extends existing code or starts fresh.
+### Change 2: YAGNI filter in `/plan` Phase 3
+**File**: `.claude/commands/plan.md`
+**Where**: Phase 3, Step 5 (task list creation)
+**What**: Add a YAGNI filter step after initial task drafting: for each task, confirm it maps to a specific requirement in the design doc. Tasks without a clear design doc anchor must be either (a) traced back to a requirement or (b) removed. Present any removed tasks to the user as "out of scope" before finalizing.
+### Change 3: Verification HARD-GATE in `/dev` task completion
+**File**: `.claude/commands/dev.md`
+**Where**: Task completion HARD-GATE (currently at line ~178)
+**What**: Upgrade the completion gate to require: in addition to tests passing, run the actual implemented function/feature and observe real output. This is the "verification-before-completion" pattern from Superpowers — tests can pass but behavior can still be wrong.
+### Change 4: Rename `/check` to `/validate` + add debug mode
+**Files**:
+- `.claude/commands/check.md` → rename to `.claude/commands/validate.md`
+- All references to `/check` in AGENTS.md, `docs/WORKFLOW.md`, `docs/plans/`, `.claude/rules/workflow.md`
+**What**:
+- Rename the command file
+- Add failure path: when any validation step fails, automatically enter 4-phase debug mode:
+  - **Phase D1: Reproduce** — confirm failure is deterministic, get exact error output
+  - **Phase D2: Root-cause trace** — trace the failure to its actual source (not symptoms)
+  - **Phase D3: Fix** — minimal targeted fix for the root cause
+  - **Phase D4: Verify** — re-run full validation, confirm fix works end-to-end
+- HARD-GATE: No fix commit without completing Phase D2 (root-cause confirmed in writing)
+- After fix, automatically re-run validation from the beginning
+---
+## Constraints
+- **Additive only**: No restructuring of existing phases, no removing steps
+- **Lean gates**: YAGNI filter = checklist, not a new phase. DRY check = one search step, not a research loop. Gates should add ~2-3 lines of instruction, not new procedures.
+- **No new ceremony**: Debug mode in `/validate` activates only on failure. Passing runs are unchanged.
+- **Ambiguity policy**: Follow existing `/dev` decision gate (7-dimension scoring). Low-impact spec gaps → agent makes reasonable choice, documents in decisions file. High-impact gaps → pause and ask.
+---
+## Edge Cases (from Q&A)
+1. **YAGNI filter removes all tasks**: If every task is flagged as out-of-scope, the design doc needs more requirements. Present this as "design doc doesn't cover all tasks — needs amendment" rather than error.
+2. **DRY check finds partial match**: If codebase has something 80% similar, document it as "extend existing" in the approach — don't create a net-new implementation.
+3. **Debug mode loops**: If Phase D3 fix doesn't resolve Phase D4 verify, re-enter Phase D1 with more specific reproduction steps. Max 3 debug cycles before surfacing to user with full context.
+4. **Validation passes on re-run after fix, but fix is wrong**: Phase D4 requires not just "tests pass" but "behavior is correct" — run actual feature, not just tests.
+---
+## Ambiguity Policy
+Follow existing `/dev` decision gate (7-dimension scoring system):
+- Score ≤ threshold: Agent makes reasonable choice, documents in `docs/plans/YYYY-MM-DD-superpowers-gaps-decisions.md`
+- Score > threshold: Pause and ask user
+Phase 1 Q&A pre-resolved all major design questions. Remaining ambiguity should be rare.
+---
+## Technical Research
+### YAGNI/DRY Enforcement — Key Findings
+**Critical discovery**: Superpowers `writing-plans/SKILL.md` only contains "DRY. YAGNI. TDD." as aspirational bullet points — no actual gates or enforcement mechanisms. Our approach (proper HARD-GATE wording) is stronger than Superpowers' implementation.
+**Effective YAGNI gate wording** (from Claude Code system prompts, Cursor rules, community research):
+- "Do not add features, refactor, or improve beyond what was asked."
+- "Only make changes that are directly requested or clearly necessary."
+- "YAGNI: No speculative implementation." (applied during GREEN phase)
+**Effective DRY gate wording**:
+- "Check if logic already exists before writing new code." (Cursor rules)
+- "Before creating new code, search the codebase for existing implementations" + explicit grep/glob tool calls
+**Critical gotcha**: Aspirational lists ("DRY. YAGNI.") are ignored under pressure. Effective enforcement requires imperative gate language AND explicit search commands (not just "check"). Agents hallucinate that nothing equivalent exists if not forced to search with tools.
+### Verification-Before-Completion — Key Findings
+From `superpowers:verification-before-completion/SKILL.md`:
+**Iron Law**: `NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE`
+**5-step gate**:
+1. IDENTIFY: What command proves this claim?
+2. RUN: Execute the FULL command (fresh, complete)
+3. READ: Full output, check exit code, count failures
+4. VERIFY: Does output confirm the claim?
+5. ONLY THEN: Make the claim
+**Enforcement**: "Skip any step = lying, not verifying"
+**Common failures table** (forbidden substitutes):
+- "Tests pass" ← `"Previous run", "should pass"` is not evidence
+- "Bug fixed" ← `"Code changed, assumed fixed"` is not evidence
+- "Requirements met" ← `"Tests passing"` alone is not sufficient
+**Red Flags — STOP**: Using "should", "probably", "seems to"; expressing satisfaction ("Great!", "Done!") before verification; trusting agent success reports.
+### Systematic Debugging — Key Findings
+From `superpowers:systematic-debugging/SKILL.md`:
+**Iron Law**: `NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST`
+**4-phase structure** (MUST complete each before proceeding):
+- Phase 1: Root Cause Investigation (reproduce, trace data flow)
+- Phase 2: Pattern Analysis (find working examples, compare references)
+- Phase 3: Hypothesis and Testing (form SINGLE hypothesis, test MINIMALLY)
+- Phase 4: Implementation (failing test FIRST, ONE change at a time)
+**3-fix architectural HARD-GATE**: If >= 3 fix attempts fail → STOP, question architecture. Do NOT attempt Fix #4.
+**Red Flags — STOP** (return to Phase 1):
+- "Quick fix for now, investigate later"
+- "It's probably X, let me fix that"
+- "I don't fully understand but this might work"
+**Key principle**: "Fix at source, not at symptom." Seeing symptoms ≠ understanding root cause.
+### Rename Scope (/check → /validate)
+**Files affected**: 25 files, ~70+ instances
+- Command file: `.claude/commands/check.md` → `.claude/commands/validate.md`
+- Implementation: `lib/commands/check.js` → `lib/commands/validate.js`
+- Test file: `test/commands/check.test.js` → `test/commands/validate.test.js`
+- Stage references in all command docs (dev.md, plan.md, ship.md, review.md, premerge.md, verify.md, research.md, rollback.md)
+- Docs: AGENTS.md, docs/WORKFLOW.md, docs/TOOLCHAIN.md, docs/VALIDATION.md, docs/EXAMPLES.md, docs/README-v1.3.md, docs/ROADMAP.md, docs/MANUAL_REVIEW_GUIDE.md, docs/ENHANCED_ONBOARDING.md
+- GitHub: .github/CONTRIBUTING.md, .github/pull_request_template.md, .github/agentic-workflows/behavioral-test.md
+- Rules: .claude/rules/workflow.md
+**Strategy**: Batch sed replacement across all files for `/check` → `/validate`, then manually update:
+- File renames (check.md → validate.md, check.js → validate.js, check.test.js → validate.test.js)
+- `<HARD-GATE: /check exit>` tag names
+- Function names in check.js that reference "check" semantically
+### OWASP Analysis
+All changes are to `.md` instruction files and `.js` command implementations. No security surface: no user input, no cryptography, no access control, no external service calls.
+Risk: Near-zero. No OWASP categories apply to this change type.
+### TDD Test Scenarios
+**Test 1 (Happy path — YAGNI filter)**:
+- Input: plan Phase 3 with 5 tasks, 3 mapped to design doc, 2 not mapped
+- Expected: `extractTasksFromDesign()` returns flagged tasks list: 2 tasks with `yaggniFlag: true`
+- Test file: `test/commands/plan.phases.test.js`
+**Test 2 (Happy path — /validate rename)**:
+- Input: `executeValidate({ skip: ['lint', 'security', 'tests'] })`
+- Expected: returns `{ success: boolean, checks: object, summary: string }` (same shape as check)
+- Test file: `test/commands/validate.test.js`
+**Test 3 (Verification gate — no completion without evidence)**:
+- Input: `validateCompletion({ claimed: 'tests pass', evidence: null })`
+- Expected: throws or returns `{ valid: false, reason: 'No fresh run evidence provided' }`
+- Test file: `test/commands/validate.test.js`
+**Test 4 (Edge case — all tasks flagged as YAGNI)**:
+- Input: plan Phase 3 with design doc that has no matching tasks
+- Expected: returns `{ allFlagged: true, message: 'Design doc doesn\'t cover all tasks — needs amendment' }`
+- Test file: `test/commands/plan.phases.test.js`
+**Test 5 (Debug mode — 3-fix architectural gate)**:
+- Input: `debugMode({ fixAttempts: 3, error: 'test failure' })`
+- Expected: returns `{ escalate: true, message: 'STOP: 3+ fixes attempted. Question architecture before Fix #4.' }`
+- Test file: `test/commands/validate.test.js`
+---
+## Sources
+- `docs/research/superpowers.md` — Full Superpowers analysis, 14 skills, HARD-GATE pattern
+- `docs/research/superpowers-integration.md` — 5 integration options, decision matrix, recommended path
+- `forge-6od` — Confirmed gaps list with primary sources
+- `.claude/commands/plan.md` — Current plan command state (HARD-GATE blocks confirmed present)
+- `.claude/commands/dev.md` — Current dev command state (two-stage review confirmed present)
+- `.claude/commands/check.md` — Current check command state (verification-before-completion confirmed missing)