npm - @haposoft/cafekit - Versions diffs - 0.8.7 → 0.8.9 - Mend

@haposoft/cafekit 0.8.7 → 0.8.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

package/package.json +1 -1
package/src/claude/CLAUDE.md +2 -2
package/src/claude/agents/code-auditor.md +9 -1
package/src/claude/agents/inspector.md +24 -1
package/src/claude/agents/spec-maker.md +26 -22
package/src/claude/agents/test-runner.md +27 -5
package/src/claude/hooks/spec-state.cjs +2 -1
package/src/claude/migration-manifest.json +2 -1
package/src/claude/rules/workflow.md +5 -4
package/src/claude/scripts/validate-spec-output.cjs +271 -0
package/src/claude/skills/code-review/references/spec-compliance-review.md +1 -1
package/src/claude/skills/develop/SKILL.md +43 -12
package/src/claude/skills/develop/references/quality-gate.md +43 -40
package/src/claude/skills/specs/SKILL.md +32 -27
package/src/claude/skills/specs/references/review.md +2 -2
package/src/claude/skills/specs/rules/tasks-generation.md +35 -9
package/src/claude/skills/specs/templates/task.md +43 -33
package/src/claude/skills/sync/SKILL.md +2 -2
package/src/claude/skills/sync/references/sync-protocols.md +5 -5
package/src/claude/skills/test/SKILL.md +32 -1

package/src/claude/skills/develop/SKILL.md CHANGED Viewed

@@ -45,7 +45,7 @@ DO NOT write implementation code until an approved spec exists.
 <DEFINITION-OF-DONE>
 A task is NOT done because code compiles or a placeholder renders.
-A task is done only when the task file's Completion Criteria AND Task Test Plan & Verification Evidence section are satisfied with real execution proof. Existing specs may use legacy `Verification & Evidence`; treat that as the same contract.
+A task is done only when the task file's Completion Criteria AND Evidence section are satisfied with real execution proof. Existing specs may use `Task Test Plan & Verification Evidence` or legacy `Verification & Evidence`; treat those as the same contract.
 </DEFINITION-OF-DONE>
 <CONTRACT-FIDELITY>
@@ -53,6 +53,11 @@ If the spec/task explicitly names a framework, auth system, datastore, transport
 You MUST NOT silently replace it with a simpler custom substitute ("for MVP", "placeholder", "temporary auth", "in-memory until later") unless the spec itself is updated first.
 </CONTRACT-FIDELITY>
+<SCOPE-FIDELITY>
+The approved `scope_lock`, requirements, design contracts, and active task packet are the implementation contract.
+You MUST implement all scoped behavior for the active task, MUST NOT add out-of-scope behavior, and MUST NOT mark work done while required surfaces exist only as orphaned files, unmounted UI, unregistered routes, uncalled loaders, or placeholder wiring.
+</SCOPE-FIDELITY>
 ## Anti-Rationalization Protocol
 | Thought (Excuse) | Reality (Rule) |
@@ -66,12 +71,14 @@ You MUST NOT silently replace it with a simpler custom substitute ("for MVP", "p
 flowchart TD
     A["/hapo:develop \u003cfeature\u003e"] --> B[Step 1: Load Spec]
     B -->|Missing| Z[Stop: Run /hapo:specs]
-    B -->|Ready| C[Step 2: Scout Codebase (inspector)]
+    B -->|Ready| C[Step 2: Task-Aware Scout (inspector)]
     C --> D[Step 3: Implement Code (god-developer)]
-    D --> E[Step 4: Quality Gate: Test + Review + Evidence]
-    E -->|Fail (code-auditor)| D
+    D --> E[Step 4: Quality Gate: Test + Spec Review + Code Review + Evidence]
+    E -->|Fail| D
     E -->|Pass| F[Step 5: State Sync + Incremental Docs Sync]
-    F --> G[Report Completion]
+    F --> H{More tasks?}
+    H -->|Yes| B
+    H -->|No| G[Final Integration Scout + Report Completion]
 ```
 ### Step 1: Initialize & Load Spec
@@ -85,7 +92,7 @@ flowchart TD
   - Objective + Constraints
   - Related Files
   - Completion Criteria
-  - Task Test Plan & Verification Evidence (or legacy Verification & Evidence)
+  - Evidence (or `Task Test Plan & Verification Evidence` / legacy `Verification & Evidence`)
   - Exact executable verification commands named in the task
   - Requirement IDs referenced by the task
   - Named technologies, frameworks, protocols, and data stores that the task/spec explicitly requires
@@ -94,11 +101,26 @@ flowchart TD
 - Before coding, set the active task(s) to `in_progress` in both markdown and `spec.json.task_registry`, or route through `/hapo:sync` if the runtime expects the sync protocol.
 ### Step 2: Scout (Codebase Inspection)
-- **Mandatory:** Call agent `Agent(subagent_type="inspector", ...)` to scan the overall codebase structure (e.g., where components live, where utils are). Avoid wandering into forbidden zones. Use the legacy `Task` tool only in runtimes that have not renamed the subagent tool yet.
+- **Mandatory per task:** Call agent `Agent(subagent_type="inspector", ...)` before implementing EVERY active task. This is task-aware scouting, not a one-time global scan.
+- The inspector prompt MUST include:
+  - Active task file path and extracted task packet
+  - Requirement IDs and `scope_lock`
+  - Relevant `design.md` contracts/invariants
+  - Prior completed task outputs from `spec.json.task_registry`
+  - Related Files from the active task
+- Inspector MUST report:
+  - Real runtime entrypoints/callers affected by the task (`App.tsx`, routes, CLI command, worker registration, manifest, API consumer, etc.)
+  - Existing integration points and adjacent code patterns to follow
+  - Prior task outputs this task must consume or preserve
+  - Blast-radius touchpoints and dependent files that can regress
+  - Reachability risks: orphan components, unmounted UI, unregistered routes, uncalled services/loaders, unused providers, disconnected reducers/actions
+  - Exact files likely safe to modify and any files outside `Related Files` that require a justified scope escape
+- If the inspector cannot identify the entrypoint/caller for a runtime-facing task, STOP and route back to spec correction or ask the user. Do not guess.
 ### Step 3: Implement Code
 - Act as `god-developer` OR directly write code, executing tasks specified in the loaded Markdown file(s) sequentially.
 - **Important:** You may create and modify files directly, but must faithfully follow the design from the Spec.
+- You MUST use the Step 2 scout report as implementation context. If code reality contradicts the task packet, stop and reconcile the spec before coding.
 - Progress tracking: Temporarily change `[ ]` to `[/]` in Spec files while coding is in progress. Do NOT mark `[x]` before Step 4 passes.
 - **Task Boundary Protocol (CRITICAL):**
   - Default editable scope is `Related Files` from the task packet.
@@ -112,18 +134,22 @@ flowchart TD
 - **Named Technology Rule:** If the task/spec explicitly requires a named dependency or runtime choice (for example Better Auth, Hono, Next.js proxy routes, Redis, Drizzle, S3), you MUST implement that choice or stop. Do not swap it for a custom/in-memory/local substitute and still call the task complete.
 - **Cross-Service Reality Rule:** If a task spans multiple processes or runtimes (web ↔ API, worker ↔ DB, extension ↔ backend), you MUST prove the integration uses shared real state or a real contract boundary. Process-local placeholders on both sides do not count as completion.
 - **Placeholder Completion Rule:** You MAY scaffold future files only when the active task truly needs them to compile, but placeholder route handlers, in-memory stores, or fake adapters MUST NOT be used as evidence that the current task's behavior works end-to-end.
+- **Reachability Rule:** Runtime-facing work is incomplete until it is reachable from the real entrypoint/caller named in the task evidence or Step 2 scout report. Creating a component/service/route/provider/reducer without importing, mounting, registering, or invoking it is not implementation.
+- **Prior Output Consumption Rule:** If this task depends on previous task outputs, verify those outputs are consumed through real code paths. If a prior output is unused and this task is responsible for wiring it, wire it now; if a later task owns the wiring, keep the current task pending unless that deferral is named in the active task evidence.
 ### Step 4: Self-Healing (Quality Gate Auto-Fix)
 The moment you finish coding, DO NOT proceed further. Switch to `references/quality-gate.md` and run the automatic review loop.
-**Mantra:** All feedback from code-auditor must be addressed thoroughly: Score >= 9.5 & Zero Critical issues.
+**Mantra:** Scope/spec compliance first, code quality second. All feedback from code-auditor must be addressed thoroughly: Score >= 9.5 & Zero Critical issues.
 - Passing Step 4 requires ALL of the following:
-  1. Automated verification passes, including preflight compile/typecheck/build health and every exact command named in the task's `Task Test Plan & Verification Evidence` section (or legacy `Verification & Evidence`)
-  2. Code review passes
-  3. Task evidence passes (artifacts/runtime surfaces/negative-path checks from the task file are proven)
+  1. Automated verification passes, including preflight compile/typecheck/build health and every exact command named in the task's `Evidence` section (or `Task Test Plan & Verification Evidence` / legacy `Verification & Evidence`)
+  2. Spec compliance review passes: every scoped requirement and active task criterion is implemented, with no extras and no omissions
+  3. Code quality review passes
+  4. Task evidence passes (artifacts/runtime surfaces/reachability/negative-path checks from the task file are proven)
 - `PRECHECK_FAIL` outranks `NO_TESTS`. If compile/typecheck/build fails, the task is FAIL even when no test suite exists yet.
 - `NO_TESTS` is NOT equivalent to PASS. If the task explicitly requires a test command or automated test proof, `NO_TESTS` is a FAIL or BLOCKED outcome until the requirement is satisfied or the spec is corrected.
 - If build/test passes but task evidence is missing, the task is still FAIL.
+- If runtime-facing work is orphaned, unmounted, unregistered, uncalled, or unreachable from the declared entrypoint/caller, the task is still FAIL.
 - If the implementation silently replaced a named contract choice or relies on cross-service process-local stand-ins, the task is still FAIL.
 - Only escalate to the user after 3 consecutive failed review rounds.
@@ -135,7 +161,7 @@ The moment you finish coding, DO NOT proceed further. Switch to `references/qual
   - `spec.json.task_registry[path].status = "done"`
   - `completed_at` + `last_updated_at`
   - synchronized top-level `updated_at`
-  - a human-readable verification receipt inside the task's `Task Test Plan & Verification Evidence` section showing which commands ran, their outcomes, and what proof was observed
+  - a human-readable verification receipt inside the task's `Evidence` section showing which commands ran, their outcomes, and what proof was observed
 - Verification receipts with `PRECHECK_FAIL`, `FAIL`, `UNVERIFIED`, or an explicit note that the implementation intentionally simplified a named contract MUST NOT be synchronized as `done`.
 - After syncing the active task, run a **Task Closeout Docs Checkpoint**
 - Task Closeout Docs Checkpoint:
@@ -147,6 +173,11 @@ The moment you finish coding, DO NOT proceed further. Switch to `references/qual
 - Task-level docs sync happens after every verified completed task, but actual edits still depend on `Docs impact`.
 - In **Specific-Task Mode**, STOP after sync and report the result.
 - In **Full-Spec Mode**, only after sync may you re-read `task_registry`, pick the next unblocked pending task, and repeat from Step 1 for that task.
+- When no pending tasks remain, run a **Final Integration Scout** before reporting completion:
+  - Trace runtime entrypoints from `main`/route/CLI/worker/manifest/API consumer through the scoped feature surfaces.
+  - Compare reachable behavior against `scope_lock`, `requirements.md`, `design.md`, and all task Completion Criteria.
+  - FAIL completion if any scoped surface is missing, any created runtime-facing artifact is orphaned, or spec progress/registry says done while evidence is missing.
+  - Only then set top-level progress to `code_done` / next phase.
 ---
 ## Attached References

package/src/claude/skills/develop/references/quality-gate.md CHANGED Viewed

@@ -1,23 +1,27 @@
-# Quality Gate — Parallel Test + Review Loop
+# Quality Gate — Task Evidence + Two-Stage Review Loop
 This is the critical checkpoint protecting codebase quality at Step 4 of `hapo:develop`.
 Runs AUTOMATICALLY. Only escalates to user after 3 consecutive failures or a critical block.
-Green tests are NOT enough. The gate requires three proofs:
+Green tests are NOT enough. The gate requires four proofs:
 1. Automated verification (typecheck/test/build)
-2. Code/spec review
-3. Task evidence (completion criteria + runtime/artifact proof from the task file)
+2. Spec compliance review (scope/task/design adherence)
+3. Code quality review
+4. Task evidence (completion criteria + runtime/artifact/reachability proof from the task file)
 ## Automation Semantics
-- If the task names exact commands in `Task Test Plan & Verification Evidence` (or legacy `Verification & Evidence`), those exact commands are mandatory and must run before any fallback repo defaults.
+- If the task names exact commands in `Evidence` (or `Task Test Plan & Verification Evidence` / legacy `Verification & Evidence`), those exact commands are mandatory and must run before any fallback repo defaults.
 - Preflight compile/typecheck/build health is mandatory. If compile/typecheck/build fails before tests are meaningful, the gate result is `PRECHECK_FAIL`, not `NO_TESTS`.
 - `NO_TESTS` is never an automatic PASS.
 - `NO_TESTS` is acceptable only when the task does **not** require a dedicated test suite command and every other required automated command/evidence item passes.
 - If the task explicitly requires tests and the repo has no such test command or suite, the task is FAIL or BLOCKED, not done.
+- If the task kind implies a concrete test type, the gate must enforce it: unit tests for logic/regression, component or integration tests for stateful UI or cross-module wiring, E2E/UI-flow checks for complete user workflows, visual/responsive checks for layout/theme work, accessibility checks for interactive UI, and smoke checks for scaffold/config. Performance/security checks are mandatory only when specified by requirement/risk/boundary.
 - Named frameworks, auth systems, transports, datastores, and runtime boundaries in the task/spec are contractual. Silent substitutions are review failures, not acceptable implementation trade-offs.
 - Multi-process or multi-runtime flows must prove shared real state or a real boundary contract. Matching in-memory placeholders on both sides do not count as working integration.
+- Scope fidelity is mandatory: missing scoped behavior, extra unapproved behavior, or task output that exists only as orphaned/unreachable code is a review failure even when build/tests pass.
+- Runtime-facing artifacts must be reachable from the real entrypoint/caller named by the task or the task-aware scout report.
-## Parallel Quality Cycle
+## Quality Cycle
 Maximum retry counter: **3 attempts**. Exceeding 3 triggers a collapse warning.
@@ -26,69 +30,68 @@ Variable: retry_count = 0
 Before START_LOOP:
   - Read the active task file(s)
-  - Extract Related Files, Completion Criteria, Task Test Plan & Verification Evidence (or legacy Verification & Evidence)
+  - Extract Related Files, Completion Criteria, Evidence (or Task Test Plan & Verification Evidence / legacy Verification & Evidence)
   - Extract the exact executable verification commands in declaration order
   - Extract relevant design contracts/invariants for the touched area
+  - Extract scope_lock, requirement IDs, runtime entrypoints/callers, and reachability proof obligations
   - If any of these are missing or too vague to verify, FAIL immediately and route back to spec correction
 START_LOOP:
   ---------------------------------------------------------------
-  PARALLEL GATE: Spawn BOTH agents simultaneously
+  STAGE A: Test + SPEC COMPLIANCE review
   ---------------------------------------------------------------
   → Agent(subagent_type="test-runner",
-        prompt="Run task-aware verification for the recently implemented code. Read the active task file(s) and execute the exact verification commands named there first, in order. Preflight compile/typecheck/build failures must be reported as PRECHECK_FAIL and take precedence over NO_TESTS. After that, run any additional repo-level typecheck/test/build checks needed for confidence. Inspect named artifacts/runtime outputs. For multi-service tasks, verify the flow does not rely on process-local stand-ins masquerading as shared state. Return PASS only if automated checks and task evidence both pass. Mark anything unexecuted as UNVERIFIED. Treat NO_TESTS as non-passing unless the task did not require a dedicated test suite.",
+        prompt="Run task-aware verification for the recently implemented code. Read the active task file(s) and execute the exact verification commands named there first, in order. Preflight compile/typecheck/build failures must be reported as PRECHECK_FAIL and take precedence over NO_TESTS. After that, run any additional repo-level typecheck/test/build checks needed for confidence. Inspect named artifacts/runtime outputs and prove runtime reachability from declared entrypoints/callers. For multi-service tasks, verify the flow does not rely on process-local stand-ins masquerading as shared state. Return PASS only if automated checks and task evidence both pass. Mark anything unexecuted as UNVERIFIED. Treat NO_TESTS as non-passing unless the task did not require a dedicated test suite.",
         description="Test [feature]")
   → Agent(subagent_type="code-auditor",
-        prompt="Review all recently written code against the active task file(s), referenced requirements, and design contracts. Missing deliverables, placeholder-only wiring, missing runtime entrypoints, overscope edits outside the task packet, silent replacement of named technologies/contracts, or fake cross-service proof via process-local state are Critical even if build/tests pass. Check security, logic, architecture, YAGNI/KISS/DRY. Return score (X/10), critical count, warning list, and evidence gaps.",
-        description="Review [feature]")
+        prompt="SPEC COMPLIANCE REVIEW ONLY. Do not trust the implementer's report. Read the active task file(s), scope_lock, referenced requirements, design contracts, task-aware scout report, and actual code. Verify line by line that every scoped requirement and completion criterion is implemented, nothing out-of-scope was added, and every runtime-facing artifact is reachable from the declared entrypoint/caller or explicitly deferred to a named later task. Missing deliverables, placeholder-only wiring, orphan components/services, unmounted UI, unregistered routes, uncalled loaders, missing runtime entrypoints, overscope edits outside the task packet, silent replacement of named technologies/contracts, or fake cross-service proof via process-local state are Critical even if build/tests pass. Return SPEC_PASS or SPEC_FAIL, critical count, file:line findings, and evidence gaps.",
+        description="Spec review [feature]")
   Wait for BOTH to return results.
-  ---------------------------------------------------------------
-  COMBINE RESULTS
-  ---------------------------------------------------------------
-  CASE 1 — PRECHECK_FAIL OR Automated FAIL OR required command missing OR Evidence FAIL / UNVERIFIED OR NO_TESTS when tests were required:
+  CASE 1 — PRECHECK_FAIL OR Automated FAIL OR required command missing OR Evidence FAIL / UNVERIFIED OR Reachability FAIL / SPEC_FAIL OR NO_TESTS when tests were required:
     - Increment retry_count++
     - If retry_count >= 3:
         → COLLAPSE! AskUserQuestion: "Quality gate cannot prove this task is complete! User intervention required!"
     - If retry_count < 3:
-        → Return to Step 3 (god-developer). Fix the failing checks or missing evidence first.
-        → GOTO START_LOOP (re-run BOTH test + review)
+        → Return to Step 3 (god-developer). Fix the failing checks, spec gaps, or missing evidence first.
+        → GOTO START_LOOP
+  CASE 2 — Test PASS + Evidence PASS + SPEC_PASS:
+    → Proceed to STAGE B code quality review.
-  CASE 2 — Test PASS + Evidence PASS + Review FAIL (Score < 9.5 OR Critical > 0):
+STAGE B:
+  ---------------------------------------------------------------
+  CODE QUALITY REVIEW (only after spec compliance passes)
+  ---------------------------------------------------------------
+  → Agent(subagent_type="code-auditor",
+        prompt="CODE QUALITY REVIEW. Spec compliance already passed. Review recently written code for security, logic correctness, architecture, YAGNI/KISS/DRY, maintainability, tests, and project conventions. Also re-check that no recent edits broke dependents found by the task-aware scout report. Return score (X/10), critical count, warning list, and concrete file:line findings.",
+        description="Quality review [feature]")
+  CASE 3 — Code quality review FAIL (Score < 9.5 OR Critical > 0):
     - Increment retry_count++
     - If retry_count >= 3:
         → COLLAPSE! AskUserQuestion: "Code does not meet minimum standards! User intervention required!"
     - If retry_count < 3:
-        → Fix each review issue from warning log.
-        → GOTO REVIEW_ONLY (skip re-test only if the fixes cannot affect automated evidence; otherwise rerun full loop)
+        → Fix each review issue.
+        → GOTO START_LOOP unless the fix is prose-only and cannot affect evidence; otherwise re-run Stage B.
-  CASE 3 — Test PASS + Evidence PASS + Review PASS (Score >= 9.5 AND Critical = 0):
+  CASE 4 — Test PASS + Evidence PASS + SPEC_PASS + Code quality review PASS (Score >= 9.5 AND Critical = 0):
     → PASS! Auto-approved.
-    → PROCEED to completion report with a verification receipt summarizing exact commands executed, artifact/runtime proof, and review result.
-REVIEW_ONLY:
-  ---------------------------------------------------------------
-  Re-run ONLY code-auditor (tests already passed and no new evidence-producing code changed)
-  ---------------------------------------------------------------
-  → Agent(subagent_type="code-auditor", ...)
-  IF Score >= 9.5 AND Critical = 0 → PASS!
-  IF Score < 9.5 OR Critical > 0:
-    - retry_count++
-    - If retry_count >= 3 → COLLAPSE
-    - Else → fix issues, GOTO REVIEW_ONLY
+    → PROCEED to completion report with a verification receipt summarizing exact commands executed, artifact/runtime/reachability proof, spec review result, and code quality review result.
 ```
 ## Critical Issue Definitions
 - **Security:** XSS vulnerabilities, SQL injection, leaked env tokens/secrets.
-- **Performance:** Bottlenecks, O(n³) algorithms, unbounded loops over DB calls.
+- **Performance:** Bottlenecks, O(n^3) algorithms, unbounded loops over DB calls.
 - **Architecture:** Breaking MVC boundaries, cross-module coupling, convention violations.
 - **Principles:** YAGNI violations, KISS violations, DRY violations (excessive code duplication).
 - **Evidence / Done-Criteria Drift:** Missing required artifacts, placeholder-only wiring, missing entrypoints, unproven completion criteria, or runtime contract mismatches.
-- **Overscope Delivery Drift:** Implementing later-task deliverables or editing out-of-scope files without direct justification for the active task.
+- **Reachability Failure:** Orphan components/services/hooks/routes/workers/commands/providers/reducers, unmounted UI, unregistered routes, uncalled data loaders, unused providers, disconnected actions, or any runtime-facing artifact that cannot be reached from the declared entrypoint/caller.
+- **Scope Drift:** Scoped acceptance criteria omitted, behavior added outside `scope_lock`, or a task marked complete while part of its approved requirement remains unwired.
+- **Overscope Delivery Drift:** Implementing later-task deliverables or editing out-of-scope files without direct justification for the active task packet.
 - **Contract Substitution Drift:** Replacing a named framework/auth/transport/datastore/runtime boundary with a custom simplification without a spec amendment.
 - **Cross-Service Reality Failure:** Claiming end-to-end behavior across web/api/worker/extension boundaries while state only exists in local process memory or placeholder adapters.
@@ -96,8 +99,8 @@ REVIEW_ONLY:
 Must log the Quality Gate result to the terminal for user visibility:
-- **Quick Pass:** `✓ Step 4 Quality Gate: Test PASS + Evidence PASS + Review 9.5/10 - Auto-Approved`
-- **Hard-Won Pass:** `✓ Step 4 Quality Gate: Failed 2 rounds → Test PASS + Evidence PASS + Review 9.6/10`
+- **Quick Pass:** `✓ Step 4 Quality Gate: Test PASS + Evidence PASS + Spec PASS + Review 9.5/10 - Auto-Approved`
+- **Hard-Won Pass:** `✓ Step 4 Quality Gate: Failed 2 rounds → Test PASS + Evidence PASS + Spec PASS + Review 9.6/10`
 - **Preflight Fail:** `[x] Step 4 Quality Gate: PRECHECK_FAIL → compile/typecheck/build failed before tests mattered`
-- **Fix Needed:** `[~] Step 4 Quality Gate: Tests/evidence failed → returned to god-developer`
+- **Fix Needed:** `[~] Step 4 Quality Gate: Tests/spec/evidence failed → returned to god-developer`
 - **Awaiting Rescue:** `[!] Step 4 Quality Gate: Failed 3 rounds! Awaiting user intervention...`

package/src/claude/skills/specs/SKILL.md CHANGED Viewed

@@ -80,6 +80,9 @@ Forbidden generated artifacts:
 - Do NOT create shorthand task files such as `tasks/task-R0-1.md`, `tasks/task-R1-1.md`, or `tasks/R0-1-<slug>.md`.
 - The template file name is never the output file name. `templates/spec-state.json` is only the schema source for generated `spec.json`.
 - Task hydration is session/task-state synchronization only; it MUST NOT be written as a markdown artifact.
+- Before marking a spec ready, run the deterministic validator:
+  - `node .claude/scripts/validate-spec-output.cjs specs/<feature>`
+  - Any validator failure blocks `ready_for_implementation = true`.
 ### Writing Style
 - Concise, prefer bullet lists
@@ -266,7 +269,10 @@ Load: `references/scope-inquiry.md`
 - Load `rules/tasks-parallel-analysis.md` for parallel markers (default: enabled)
 - Each task file follows template `templates/task.md`
 - `Related Files` and test plans must inherit paths, contracts, and test targets from the codebase scout. If exact files/tests cannot be named for an enhancement, run targeted inspect before generating tasks.
-- Each task file MUST include `Completion Criteria` and `Task Test Plan & Verification Evidence` sections detailed enough that a downstream quality gate can prove the task is truly done.
+- Each task file MUST include `Completion Criteria` and `Evidence` sections detailed enough that a downstream quality gate can prove the task is truly done. Existing specs may use `Task Test Plan & Verification Evidence` or legacy `Verification & Evidence`.
+- Each task's `Evidence` MUST choose the right proof type for the touched surface: unit for pure logic, component/integration for UI or state wiring, E2E/UI flow for complete user workflows, visual/responsive checks for style/layout work, accessibility checks for interactive UI, smoke checks for scaffold/config, regression checks for bug fixes, and performance/security checks only when the requirement or risk calls for them.
+- Every task MUST preserve the approved `scope_lock`: implement all scoped acceptance criteria for its requirement, avoid out-of-scope features, and record any intentional deferral as a named later task rather than implicit omission.
+- For UI/app/runtime features, generate a final integration/reachability task or final section that names the real runtime entrypoint and proves prior task outputs are imported, mounted, registered, invoked, or otherwise reachable.
 - Build `spec.json.task_registry` alongside `task_files`. For each task file, register at minimum:
   - `id`
   - `title`
@@ -278,11 +284,11 @@ Load: `references/scope-inquiry.md`
   - `last_updated_at`
 - Update `spec.json` phase + task metadata
-#### Requirement-Driven Task Grouping (MANDATORY)
-Tasks MUST be organized **by requirement**, NOT by technical concern. Each requirement from `requirements.md` gets its own cluster of task files.
+#### Requirement-Covered Task Grouping (MANDATORY)
+Tasks MUST be organized by implementation flow while preserving explicit requirement coverage. Foundation work uses `R0`; feature work uses `R1+`.
 **Naming convention:** `tasks/task-R{N}-{SEQ}-<slug>.md`
-- `R{N}` = requirement number (e.g., R1, R2, R3...)
+- `R{N}` = foundation or implementation cluster (R0 foundation, R1+ feature work)
 - `{SEQ}` = sequential number within that requirement (01, 02, 03...)
 - `<slug>` = descriptive kebab-case name
@@ -301,11 +307,11 @@ tasks/
 ```
 **Splitting rules:**
-- Each requirement → 1 or more task files (split by sub-scope within the requirement)
-- A task file MUST serve exactly 1 primary requirement (cross-cutting references allowed as secondary)
-- If a requirement has only 1 natural task, create 1 file (no forced splitting)
-- If a requirement has many acceptance criteria spanning different concerns → split into multiple task files
-- After generating all tasks: verify **every requirement ID** appears as primary in at least one task file — gaps = failure
+- Split by real implementation dependency chain first: model/schema -> service -> API -> UI -> integration.
+- A task file MAY cover multiple requirement IDs when one code change naturally satisfies them.
+- A requirement MAY be covered by multiple task files when it spans layers.
+- Do not create all tasks under `R0`; `R0` is only shared foundation/setup.
+- After generating all tasks: verify **every requirement ID** appears in at least one task file's `## Requirements` section — gaps = failure.
 - **Legacy Protection:** If the `research.md` identified existing codebase files or tests that will be broken (Blast Radius), you MUST generate explicitly tasked files (e.g., `task-R5-01-update-legacy-tests.md`) to fix those breakages. Do not leave broken tests out of scope.
 **Dependency ordering:** Tasks within the same requirement are ordered by natural implementation flow. Cross-requirement dependencies use `Dependencies:` field referencing other task file names.
@@ -314,23 +320,17 @@ tasks/
 Each task file MUST be **self-contained and implementation-ready** — detailed enough for a junior developer or AI coding agent to execute without guessing.
 **Structure per task file:**
-1. **Objective** — 1-2 sentence objective (WHAT, not HOW)
-2. **Implementation Steps** — Hierarchical breakdown:
-   - Major steps (`- [ ] 1. ...`) group by cohesion
-   - Sub-tasks (`- [ ] 1.1 ...`) are specific actionable items (1-3 hours each)
-   - Detail bullets under each sub-task describe:
-     - Business logic and behavior to implement
-     - Edge cases and constraints
-     - Validation rules
-   - `_Requirements: X.X_` at the END of every sub-task — **no exceptions**
-3. **Test coverage** — Last major step in every task must cover unit + integration tests
-4. **Related Files** — Table with exact paths, action type, and descriptions
-5. **Completion Criteria** — Observable, testable criteria (checkbox format)
-6. **Risk Assessment** — Table with risk, severity, mitigation
+1. **Context** — why this task exists, current state, target outcome, relevant exact files.
+2. **Steps** — concise implementation checklist with business intent and code-level detail.
+3. **Requirements** — list requirement IDs and acceptance criteria covered by this task.
+4. **Related Files** — table with exact paths, action type, and descriptions when paths are known; otherwise run scout first.
+5. **Completion Criteria** — observable, testable criteria.
+6. **Evidence** — automated command(s), artifact/runtime proof, negative-path proof, and runtime reachability proof.
+7. **Risk Assessment** — table with risk, severity, mitigation.
 **Parallel markers:** Append `(P)` to tasks that can run concurrently (no data dependency, no shared files, no prerequisite approval from another task). Tasks serving DIFFERENT requirements are often parallelizable.
-**FORBIDDEN:** Task files with only 3-5 top-level checkboxes and no sub-task breakdown. This level of detail is INSUFFICIENT for implementation.
+**FORBIDDEN:** Task files with only vague checkboxes and no exact files, requirements, or evidence. Compact is good; vague is invalid.
 ### Step 8: Task Hydration
 Load: `references/task-hydration.md`
@@ -354,6 +354,7 @@ Load: `references/review.md` + `rules/design-review.md`
 ### Step 9.5: Finalization Audit (MANDATORY)
 - Re-scan the `tasks/` directory and rebuild `spec.json.task_files` from the real filesystem (sorted, relative paths)
 - Rebuild `spec.json.task_registry` from the real filesystem if it is missing, stale, or missing keys. Preserve task status fields when the path still matches.
+- Run `node .claude/scripts/validate-spec-output.cjs specs/<feature>` and treat any non-zero exit as a blocking failure.
 - FAIL if any task file exists on disk but is missing from `task_files`
 - FAIL if any path in `task_files` does not exist on disk
 - FAIL if any task file exists on disk but is missing from `task_registry`
@@ -361,8 +362,10 @@ Load: `references/review.md` + `rules/design-review.md`
 - FAIL if any task file path does not match `tasks/task-R{N}-{SEQ}-<slug>.md` with two-digit `SEQ` (for example `tasks/task-R0-01-project-scaffolding.md`)
 - FAIL if a newly generated non-trivial spec lacks a `research.md` Evidence Summary with codebase scout result, external research result or skip rationale, selected decision, rejected alternatives, and downstream task/test implications.
 - FAIL if any requirement or NFR mapping uses non-numeric labels (`NFR-1`, `SEC-1`, etc.)
-- FAIL if a task lacks `Completion Criteria` or `Task Test Plan & Verification Evidence` (legacy `Verification & Evidence` is accepted only for pre-existing task files)
-- FAIL if accepted validation decisions exist in reports but are not reflected in the implementation-facing sections of affected artifacts (`Objective`, `Constraints`, `Implementation Steps`, `Completion Criteria`, `Task Test Plan & Verification Evidence`, canonical contracts, or requirements text).
+- FAIL if a task lacks `Completion Criteria` or `Evidence` (existing `Task Test Plan & Verification Evidence` or legacy `Verification & Evidence` is accepted)
+- FAIL if a task creates runtime-facing artifacts but neither proves reachability from an entrypoint/caller nor names a later integration task responsible for wiring them.
+- FAIL if a UI/app/runtime spec has multiple user-facing task outputs but no final integration/reachability task or final integration section.
+- FAIL if accepted validation decisions exist in reports but are not reflected in the implementation-facing sections of affected artifacts (`Context`, `Steps`, `Requirements`, `Completion Criteria`, `Evidence`, canonical contracts, or requirements text).
 - FAIL if the spec scope/provider was switched away from Anthropic/Claude but `requirements.md`, `design.md`, or `tasks/*.md` still contain stale provider-specific strings such as `Claude API`, `Haiku`, or `haiku_reachable`. `research.md` is the only allowed place for historical cost comparisons.
 - FAIL if privacy/delete-data work lacks a single canonical deletion policy. The design MUST explicitly choose either:
   1. hard-delete with no re-registration lock, or
@@ -445,7 +448,7 @@ specs/
     ├── requirements.md        # Technical requirements (EARS format)
     ├── research.md            # Research notes
     ├── design.md              # Architectural design
-    ├── tasks/                 # Grouped by requirement (R1, R2, R3...)
+    ├── tasks/                 # Foundation + implementation clusters (R0, R1, R2...)
     │   ├── task-R0-01-foundation.md
     │   ├── task-R1-01-<slug>.md
     │   ├── task-R1-02-<slug>.md
@@ -500,13 +503,14 @@ Before finalizing any specification, assert all the following:
 - [ ] **Requirements traceability** matrix present in design.md
 - [ ] **Canonical Contracts & Invariants** filled for auth/transport/persistence/artifact-sensitive work
 - [ ] **Every task file** maps to at least 1 valid in-scope requirement ID
-- [ ] **Every task file** includes `Task Test Plan & Verification Evidence` with executable or inspectable proof
+- [ ] **Every task file** includes `Evidence` with executable or inspectable proof
 - [ ] **State Machine Blueprint:** design.md contains Mermaid diagrams for non-trivial flows
 - [ ] **Dependency graph complete**: no task can start before its blockers are listed
 - [ ] **Risk matrix filled**: likelihood × impact, with mitigation for High items
 - [ ] **Test strategy defined**: what gets unit tested, integration tested, e2e validated
 - [ ] **task_files inventory synced**: no missing or orphaned task references
 - [ ] **task_registry synced**: every task file has exactly one machine-state entry with valid status + dependencies
+- [ ] **deterministic validator passed**: `node .claude/scripts/validate-spec-output.cjs specs/<feature>`
 - [ ] **Validation gate consistent**: validation_recommended and validation.status agree with spec risk
 - [ ] **Provider wording clean**: no stale vendor/provider strings outside allowed research context
 - [ ] **spec.json fully updated**: phase, current_phase, progress, timestamps, approvals, design_context
@@ -533,6 +537,7 @@ Before finalizing any specification, assert all the following:
 - `design.md` — Design document template
 - `research.md` — Research log template
 - `task.md` — Template for individual task file
+- `.claude/scripts/validate-spec-output.cjs` — Deterministic validator for generated spec artifacts
 ### Rules (`rules/`)
 - `ears-format.md` — EARS requirements standard

package/src/claude/skills/specs/references/review.md CHANGED Viewed

@@ -42,8 +42,8 @@ These rules override any self-reasoning or optimization the system may attempt:
 4. **Apply YAGNI to fixes.** When user says "configure later" or "decide later", add a single note to the task file. Do NOT generate multiple concrete implementations (e.g., 4 provider files when user only asked for abstraction).
 5. **No false completion.** You MUST NOT set `validation.status = "completed"` or `ready_for_implementation = true` until a reconciliation audit proves the accepted findings and validation decisions are reflected in the physical spec artifacts.
 6. **Provider drift is a real defect.** If the scope changed away from Claude/Anthropic, stale strings like `Claude API`, `Haiku`, or `haiku_reachable` in `requirements.md`, `design.md`, or `tasks/*.md` are validation failures. `research.md` may mention them only as historical comparison.
-7. **Implementation-facing propagation is mandatory.** A decision that affects implementation is NOT considered applied if it only appears in `Risk Assessment`, `validate-log.md`, or `red-team-report.md`. It must update at least one of: `requirements.md`, `Canonical Contracts & Invariants`, `Objective`, `Constraints`, `Implementation Steps`, `Completion Criteria`, or `Task Test Plan & Verification Evidence`.
-8. **CafeKit command dialect only.** Validation output MUST use `/hapo:develop <feature>` as the implementation handoff.
+7. **Implementation-facing propagation is mandatory.** A decision that affects implementation is NOT considered applied if it only appears in `Risk Assessment`, `validate-log.md`, or `red-team-report.md`. It must update at least one of: `requirements.md`, `Canonical Contracts & Invariants`, `Context`, `Steps`, `Requirements`, `Completion Criteria`, or `Evidence`.
+8. **CafeKit command dialect only.** Validation output MUST use `/hapo:develop <feature>` as the implementation handoff. Never mention `/sdd:execute-spec`, `/sdd:*`, `/work`, `/code`, `/specs <feature> --approve`, `/hapo:specs <feature> --approve`, or non-CafeKit aliases.
 9. **CafeKit task filename convention only.** Task files MUST use `tasks/task-R{N}-{SEQ}-<slug>.md` with two-digit `SEQ` (for example `tasks/task-R0-01-project-scaffolding.md`). Files like `tasks/R0-1-project-scaffolding.md` are legacy/foreign format; rename them and update `spec.json.task_files`, `spec.json.task_registry`, and dependency references before passing validation.
 ---

package/src/claude/skills/specs/rules/tasks-generation.md CHANGED Viewed

@@ -28,6 +28,7 @@ Detail bullets must include:
 **Every task must**:
 - Build on previous outputs (no orphaned code)
 - Connect to the overall system (no hanging features)
+- Stay inside the approved `scope_lock` and requirement IDs; do not add unapproved features or silently drop scoped behavior
 - Progress incrementally (no big jumps in complexity)
 - Validate core functionality early in sequence
 - Respect architecture boundaries defined in design.md (Architecture Pattern & Boundary Map)
@@ -37,6 +38,10 @@ Detail bullets must include:
 - Use major task summaries sparingly—omit detail bullets if the work is fully captured by child tasks.
 **End with integration tasks** to wire everything together.
+- For UI/app/runtime workflows, the final integration task MUST name the real entrypoint (`App.tsx`, route, command, worker, extension manifest, API route, etc.) and verify every user-visible surface from the requirements is reachable from that entrypoint.
+- Components, services, routes, commands, workers, providers, and data loaders created by earlier tasks MUST be consumed by a later integration task or explicitly marked as internal support in `design.md`; orphaned deliverables are invalid.
+- Prefer compact, implementation-ready task prose over large boilerplate. The golden shape is: `Context` -> `Steps` -> `Requirements` -> `Related Files` -> `Completion Criteria` -> `Evidence` -> `Risk Assessment`.
+- A compact task is valid when it names exact files/contracts, maps requirements, and gives executable evidence. Do not expand it into nested filler just to satisfy a template.
 ### 3. Flexible Task Sizing
@@ -80,6 +85,7 @@ Grouping tasks vertically by requirement carries the risk of "siloed" or fragmen
 1. **Foundation First (The R0 Concept)**: Extract shared infrastructure, core database migrations, authentication wrappers, and base UI layouts into foundational tasks running before feature work. If these aren't explicitly in requirements, classify them as `task-R0-XX-foundation.md` or map them to the most logical architectural requirement. All parallel feature tasks MUST depend on these foundation tasks.
 2. **Shared Interfaces (Horizontal Contracts)**: Sub-tasks that touch shared cross-requirement architecture (like registering a new page in a global `router.ts` or adding a column to a shared table) MUST explicitly reference the shared contract defined in `design.md`.
 3. **Integration Enforcers**: If R1 and R2 interact (e.g., R2 UI displays data fetched by R1 backend), the later task MUST have a sub-task explicitly dedicated to "Wiring/Integrating with [Previous Feature] output".
+4. **Final Runtime Integration**: For any feature that has a user-facing screen, public route, CLI command, background worker, browser extension surface, or API flow, create a final integration task (or a final integration section in the last dependent task) that proves the whole scoped feature works from its runtime entrypoint. This task MUST fail if prior-task outputs exist but are not imported, mounted, registered, or invoked.
 ### 3d. Spike Tasks for Complex/Uncertain Areas (MANDATORY)
@@ -136,31 +142,51 @@ Every task file MUST contain the Risk Assessment table, even if no risks are ide
 - Never mark implementation work or integration-critical verification as optional—reserve `*` for auxiliary/deferrable test coverage that can be revisited post-MVP.
 - Never mark auth, permissions, privacy, data deletion, migration, schema, or contract verification work as optional.
-### Mandatory Task Test Plan & Verification Evidence
+### Mandatory Evidence Section
-Every new task file MUST include a `## Task Test Plan & Verification Evidence` section. Existing specs may still use the legacy `## Verification & Evidence` heading; readers and sync tools must support both.
+Every new task file MUST include a `## Evidence` section. Existing specs may still use the v0.8 heading `## Task Test Plan & Verification Evidence` or the legacy `## Verification & Evidence` heading; readers and sync tools must support all three.
-That section is the task-level test plan and MUST contain:
+That section is the task-level test plan and proof checklist. It MUST contain:
 1. **Automated proof** — exact command(s) for typecheck, tests, build, or explicit `N/A`
 2. **Artifact/runtime proof** — exact files, routes, UI surfaces, generated outputs, or persisted state to inspect
 3. **Contract/negative-path proof** — at least one contract-preserving check for unauthorized, invalid, missing-permission, rollback, or failure-path behavior when relevant
+4. **Reachability proof** — when the task creates a runtime-facing artifact, name the upstream entrypoint or caller that reaches it; if reachability is deferred, name the exact later integration task responsible
 Rules:
 - If the task produces a build artifact or generated file, name the exact artifact path to inspect.
 - If the task wires entrypoints (popup, content script, route, worker, CLI command), name the exact runtime surface that must exist after implementation.
+- If the task creates a UI component, service, hook, reducer, route handler, worker, command, or data loader, the evidence MUST prove it is either reachable from the declared runtime surface or intentionally internal support for a named later task.
 - If verification depends on environment or manual setup, document the blocker explicitly instead of implying success.
 - Build success alone is NEVER enough evidence for a completed task.
 - For provider-sensitive work, use provider-neutral wording unless the scope lock explicitly names a vendor.
 - For delete-data/privacy work, task text MUST match the single deletion/retention policy chosen in `design.md`. Mixed policies are invalid.
+### Test Type Selection
+Choose verification by task risk and touched surface. Do not force every task to include every test type, but do not omit the test type that proves the task's actual behavior.
+| Task kind | Required / expected proof |
+|---|---|
+| Pure logic, data transform, parser, sorting, filtering, validator | Unit test plus negative-path case |
+| Stateful UI component or user interaction | Component test or integration test; add runtime UI check if the component must be mounted |
+| Cross-module state, API, persistence, provider, or service boundary | Integration test that proves real contract/state handoff |
+| User-facing workflow across screens/components | E2E or UI flow verification after the vertical slice exists |
+| Layout, theme, responsive, visual style | Runtime/visual viewport checks; screenshot proof when practical |
+| Keyboard/focus/form/modal/table interaction | Accessibility check for focus, labels, roles, and keyboard behavior |
+| Scaffolding/config/release plumbing | Smoke checks: typecheck/build/test/dev-server or equivalent |
+| Bug fix/regression | Regression test reproducing the old failure, then passing |
+| Performance/security-sensitive requirement or touched surface | Performance/security check only when specified by requirements, design risk, or changed boundary |
+`hapo:specs` writes the expected proof into each task. `hapo:develop` executes the task-local proof before marking the task done. `hapo:test` runs the broader system pass after implementation or for a requested feature scope.
 ## Task Hierarchy Rules
 ### Maximum 2 Levels
-- **Level 1**: Major tasks (1, 2, 3, 4...)
-- **Level 2**: Sub-tasks (1.1, 1.2, 2.1, 2.2...)
-- **No deeper nesting** (no 1.1.1)
-- If a major task would contain only a single actionable item, collapse the structure and promote the sub-task to the major level (e.g., replace `1.1` with `1.`).
-- When a major task exists purely as a container, keep the checkbox description concise and avoid duplicating detailed bullets—reserve specifics for its sub-tasks.
+- Prefer one actionable checkbox per real implementation step.
+- Use sub-tasks (`1.1`, `1.2`) only when a step has multiple separately verifiable units.
+- **No deeper nesting** (no `1.1.1`).
+- If a major task would contain only a single actionable item, collapse the structure and promote the sub-task to the major level.
+- When a major task exists purely as a container, keep the checkbox description concise and avoid duplicating detailed bullets.
 ### Sequential Numbering
 - Major tasks MUST increment: 1, 2, 3, 4, 5...
@@ -210,6 +236,6 @@ Rules:
 - If gaps found: Return to requirements or design phase
 - No requirement should be left without corresponding tasks
-Use `N.M`-style numeric requirement IDs where `N` is the top-level requirement number from requirements.md (for example, Requirement 1 → 1.1, 1.2; Requirement 2 → 2.1, 2.2), and `M` is a local index within that requirement group.
+Use the requirement ID style already present in `requirements.md` (`R1`, `REQ-01`, or `N.M`). The task filename cluster (`task-R1-01-*`) does not have to mirror every requirement ID exactly, but every requirement MUST be listed in at least one task's `## Requirements` section.
 Document any intentionally deferred requirements with rationale.