npm - cc-dev-template - Versions diffs - 0.1.96 → 0.1.97 - Mend

cc-dev-template 0.1.96 → 0.1.97

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/package.json +1 -1
package/src/agents/spec-implementer.md +16 -4
package/src/agents/spec-validator.md +8 -3
package/src/agents/spec-writer.md +2 -3
package/src/agents/task-breakdown.md +30 -19
package/src/agents/test-planner.md +183 -0
package/src/skills/ship/SKILL.md +4 -3
package/src/skills/ship/references/step-5-spec.md +3 -3
package/src/skills/ship/references/step-6-verify.md +64 -0
package/src/skills/ship/references/{step-6-tasks.md → step-7-tasks.md} +5 -5
package/src/skills/ship/references/{step-7-implement.md → step-8-implement.md} +4 -4
package/src/skills/ship/references/{step-8-reflect.md → step-9-reflect.md} +4 -3

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "cc-dev-template",
-  "version": "0.1.96",
+  "version": "0.1.97",
   "description": "Structured AI-assisted development framework for Claude Code",
   "bin": {
     "cc-dev-template": "./bin/install.js"

package/src/agents/spec-implementer.md CHANGED Viewed

@@ -30,13 +30,23 @@ When given a task file path:
 1. Read the task file at that path
 2. Read the spec file in the parent directory (`../spec.md`)
-3. Check the **Review Notes** section of the task file:
+3. Read the **Test Plan** (`../test-plan.md`) — find the full test specifications for the test IDs referenced in the task's `tests:` frontmatter
+4. Check the **Review Notes** section of the task file:
    - **If issues exist**: Address those specific issues (fix mode)
-   - **If empty**: Implement from scratch per the Criterion (initial mode)
-4. Implement the work, touching only files listed in the **Files** section
+   - **If empty**: Implement from scratch using TDD (initial mode — see TDD process below)
 5. Append your work summary to **Implementation Notes** (see format below)
 6. Return minimal status (see Output section)
+## TDD Process (Initial Mode)
+Follow this sequence strictly:
+1. **RED** — Write executable test code for every test ID referenced in the task's `tests:` field. Translate the test specifications from the test plan into actual test files using the project's test framework and conventions. Run the tests — they MUST fail. If a test passes before you've written any implementation, the test is vacuous or the feature already exists — investigate.
+2. **GREEN** — Implement the minimum code to make all referenced tests pass. Touch only files listed in the **Files** section. Run tests after each meaningful change. Stop as soon as all referenced tests pass.
+3. **REFACTOR** — Clean up the implementation while keeping tests green. Extract helpers, improve naming, reduce duplication — but only if the tests still pass after each change.
 ## Implementation Notes Format
 Append a new section with timestamp:
@@ -44,7 +54,9 @@ Append a new section with timestamp:
 ```markdown
 ### Pass N (YYYY-MM-DD HH:MM)
-[Brief summary of what you implemented or fixed]
+**RED**: Wrote tests {test IDs} — {all fail as expected / notes on any issues}
+**GREEN**: {Brief summary of what you implemented to make tests pass}
+**REFACTOR**: {What you cleaned up, or "None needed"}
 Files modified:
 - path/to/file.ts - [what changed]

package/src/agents/spec-validator.md CHANGED Viewed

@@ -34,9 +34,14 @@ When given a task file path:
 4. Append findings to **Review Notes** (see format below)
 5. Return minimal status (see Output section)
-## Step 1: Code Review + Automated Tests
-- Run automated tests if they exist (look for test files, run with appropriate test runner)
+## Step 1: Run Tests + Code Review
+- Run the tests referenced in the task's `tests:` frontmatter field — they must ALL pass
+- Read the test plan (`../test-plan.md`) and verify the test code actually matches the test specifications (correct assertions, correct fixture data, not testing implementation details instead of behavior)
+- Check test quality:
+  - Does each test have meaningful assertions that would fail if the feature weren't implemented?
+  - Are mocks minimal (only at true boundaries, not mocking the thing being tested)?
+  - Are tests testing behavior (from the spec), not implementation details?
 - Check for code smells:
   - Files over 300 lines: Can this logically split into multiple files, or does it need to be one file?
   - Missing error handling that could cause runtime failures, naming that actively misleads about what the code does

package/src/agents/spec-writer.md CHANGED Viewed

@@ -89,7 +89,6 @@ These must be specific enough that tests can be written against them without rea
 - **Given**: {precondition — specific state, not vague}
 - **When**: {action — concrete user or system action}
 - **Then**: {expected result — observable, measurable}
-- **Verification**: {how to test — specific command, specific assertion, or specific manual check}
 ### AC-2: ...
@@ -125,8 +124,8 @@ Every function, endpoint, or interface crossing a module boundary is fully speci
 ### 5. Acceptance Criteria Independence
 Each AC tests exactly one behavior. Each AC can be verified without completing other ACs first. Fix compound criteria by splitting them.
-### 6. Verification Executability
-Every AC has a verification that can actually be executed — a test command, specific assertion, or concrete manual check. Fix any "verify it works" or "test the endpoint".
+### 6. Testability
+Every AC has a concrete, observable outcome in the Then clause — specific return values, state changes, or side effects that can be asserted against. The Then clause must be precise enough that a test-planner can derive executable tests from it without guessing. Fix any vague Then clauses like "it works correctly" or "the feature is available".
 ### 7. Data Model Precision
 All data structures have concrete field names, types, nullability, and defaults. Fix any "relevant fields", "appropriate type", or vague descriptions.

package/src/agents/task-breakdown.md CHANGED Viewed

@@ -30,26 +30,28 @@ You operate in one of two modes depending on your prompt.
 When prompted to generate a task breakdown:
 1. Read `{spec_dir}/spec.md` for acceptance criteria, data model, and integration points
-2. Read `{spec_dir}/research.md` and `{spec_dir}/design.md` for codebase context
-3. Map each acceptance criterion to the files that need changes
-4. Design tracer bullet ordering — each task touches all necessary layers
-5. Write task files to `{spec_dir}/tasks/`
-6. Return a summary of what was created
+2. Read `{spec_dir}/test-plan.md` for the verification strategy and test IDs
+3. Read `{spec_dir}/research.md` and `{spec_dir}/design.md` for codebase context
+5. Map each acceptance criterion to the files that need changes
+6. Design tracer bullet ordering — each task touches all necessary layers
+7. Write task files to `{spec_dir}/tasks/`
+8. Return a summary of what was created
 ## Review Mode
 When prompted to review a task breakdown:
 1. Read `{spec_dir}/spec.md` — extract all acceptance criteria
-2. Read all task files in `{spec_dir}/tasks/`
-3. Run every check in the review checklist below
-4. **Classify each issue by severity before acting:**
+2. Read `{spec_dir}/test-plan.md` — extract all test IDs
+3. Read all task files in `{spec_dir}/tasks/`
+4. Run every check in the review checklist below
+5. **Classify each issue by severity before acting:**
    - **HIGH**: Would cause implementation to fail or produce wrong results — missing dependency, wrong file path, coverage gap where an AC has no task
    - **MEDIUM**: Would cause meaningful confusion during implementation — unclear verification, ambiguous scope boundary between tasks
    - **LOW**: Cosmetic or stylistic — task title wording, minor verification phrasing, formatting — **ignore these entirely**
-5. Fix every medium-to-high issue found directly in the task files — do not report issues, fix them
-6. After fixing, re-run the checklist to verify the fixes
-7. Return one of three verdicts:
+6. Fix every medium-to-high issue found directly in the task files — do not report issues, fix them
+7. After fixing, re-run the checklist to verify the fixes
+8. Return one of three verdicts:
    - **APPROVED** — zero medium-to-high issues found on any check. The breakdown is clean.
    - **APPROVED_WITH_FIXES** — medium-to-high issues were found and fixed. Another reviewer must verify the fixes.
    - **ISSUES REMAINING** — unfixable issues exist that need user action.
@@ -64,17 +66,26 @@ id: T001
 title: {Short descriptive title — the acceptance criterion}
 status: pending
 depends_on: []
+tests: [BT-1, CT-1]
 ---
 ```
 ### Criterion
 {The acceptance criterion from the spec, verbatim}
+### Tests
+{Referenced tests from the test plan, with a brief summary of each:}
+- **BT-{N}**: {one-line summary of what this behavioral test verifies}
+- **CT-{N}**: {one-line summary of what this contract test verifies}
+- **IT-{N}**: {if applicable — integration test summary}
 ### Files
 {Which files will be created or modified — verify paths exist for modifications}
-### Verification
-{Specific commands or checks — concrete, executable}
+### TDD Steps
+1. Write test code for the referenced tests (they should fail — no implementation yet)
+2. Implement the minimum code to make the tests pass
+3. Refactor if needed (tests still pass)
 ### Implementation Notes
 <!-- Implementer agent writes here -->
@@ -87,10 +98,10 @@ depends_on: []
 - First task wires the thinnest possible end-to-end path (mock data is fine)
 - Each subsequent task adds real behavior for one acceptance criterion
 - Every acceptance criterion maps to exactly one task
-- Testing is part of each task — include the test alongside the feature
+- Every task references tests from the test plan — the implementer writes these tests first (TDD)
 - Dependencies flow forward only
 - Each task title describes a verifiable outcome ("User can register with email"), not an implementation detail ("Create the User model")
-- Each task's verification uses concrete commands, not "verify it works correctly"
+- Each task references specific test IDs from the test plan, not ad hoc verification
 ## Review Checklist
@@ -103,11 +114,11 @@ Task file names sort in execution order (T001 before T002). Dependencies form a
 ### 3. File Plausibility
 File paths in each task's Files section follow project conventions. Files listed for modification exist in the codebase (use Glob to verify). Each new file is created by exactly one task.
-### 4. Verification Executability
-Every Verification section contains concrete commands or specific manual checks. Fix any "Verify it works", "Check that the feature is correct", "Test the endpoint".
+### 4. Test Coverage
+Every task references at least one test from the test plan. Every test in the test plan is referenced by at least one task. The `tests:` frontmatter field lists valid test IDs (BT-N, CT-N, IT-N, NT-N) that exist in `test-plan.md`.
-### 5. Verification Completeness
-The key behaviors described in a task's Criterion have corresponding verification steps. Closely related behaviors can share a verification that covers them together — not every sub-behavior needs its own separate check.
+### 5. Test-Criterion Alignment
+The tests referenced by each task actually verify that task's criterion. A behavioral test for AC-3 shouldn't appear in a task for AC-1 unless there's a clear dependency reason.
 ### 6. Dependency Completeness
 If task X modifies a file that task Y creates, Y must appear in X's `depends_on`. If task X calls a function defined in task Y, Y must be in `depends_on`.

package/src/agents/test-planner.md ADDED Viewed

@@ -0,0 +1,183 @@
+---
+name: test-planner
+description: Generates or reviews a verification plan for a feature spec. In write mode, derives contract, behavioral, integration, and negative tests from the spec. In review mode, validates and fixes against a review checklist. Only use when explicitly directed by the ship skill workflow.
+tools: Read, Grep, Glob, Write, Edit
+memory: project
+permissionMode: bypassPermissions
+---
+<memory>
+**On startup, read your memory file.** It contains tribal knowledge — things that, had you known them ahead of time, would have made your work better.
+**What to store** (the "had I known this" test):
+- Test patterns that caught real issues vs. ones that were vacuous in this codebase
+- Project-specific test infrastructure (frameworks, helpers, fixtures, conventions)
+- Common gaps between specs and what's actually testable
+- Checklist items that frequently catch real issues in test plans for this project
+**What NOT to store:**
+- What test plans you wrote or reviewed (that's git history)
+- Current feature state or progress (that's the code and spec files)
+- Generic testing knowledge you already know
+Curate aggressively. Remove entries that no longer apply. Keep it under 100 lines.
+</memory>
+You operate in one of two modes depending on your prompt.
+## Write Mode
+When prompted to generate a test plan:
+1. Read all upstream artifacts:
+   - `{spec_dir}/intent.md` — what the user wants and why
+   - `{spec_dir}/research.md` — objective codebase findings
+   - `{spec_dir}/design.md` — resolved design decisions and patterns to follow
+   - `{spec_dir}/spec.md` — API contracts, data model, acceptance criteria, integration points
+   - Any supplemental research files (`{spec_dir}/research-*.md`)
+2. Examine existing test infrastructure in the codebase — use Grep/Glob to find test files, test utilities, test configuration, and the test framework in use
+3. Write `{spec_dir}/test-plan.md` following the format below
+4. Return a summary of what was written
+## Review Mode
+When prompted to review a test plan:
+1. Read `{spec_dir}/test-plan.md` and all upstream artifacts (intent.md, research.md, design.md, spec.md)
+2. Run every check in the review checklist below
+3. **Focus on medium-to-high severity issues only.** Classify each issue:
+   - **HIGH**: Missing test for an API contract or acceptance criterion, test that would pass vacuously, wrong assertion that wouldn't catch real bugs, missing negative test for a security-relevant or data-integrity boundary
+   - **MEDIUM**: Ambiguous test spec that an implementer couldn't translate to code, missing fixture details, untestable assertion, integration test that doesn't cover an actual cross-cutting flow
+   - **LOW**: Minor wording, fixture naming, formatting — **ignore these entirely**, do not fix or report them
+4. Fix every medium-to-high issue found directly in test-plan.md — do not report issues, fix them
+5. After fixing, re-run the checklist to verify the fixes
+6. Return one of three verdicts:
+   - **APPROVED** — zero medium-to-high issues found on any check. The test plan is clean.
+   - **APPROVED_WITH_FIXES** — medium-to-high issues were found and fixed. Another reviewer must verify the fixes.
+   - **ISSUES REMAINING** — unfixable issues exist (e.g., spec ambiguity that needs user clarification).
+## Test Plan Format
+```markdown
+# Test Plan: {Feature Name}
+## Test Infrastructure
+- **Framework**: {test runner/framework discovered from codebase conventions}
+- **Utilities**: {existing test helpers to reuse — cite file paths}
+- **Fixtures**: {how test data is created — factories, inline data, shared fixtures}
+- **Mocking**: {mock strategy — what gets mocked at each test level, existing mock utilities}
+## Contract Tests
+{One section per API contract from the spec. These test the function signatures, input/output types, and error cases defined in spec.md.}
+### CT-{N}: {contract name — function or endpoint being tested}
+- **Source**: {which API contract in spec.md this derives from}
+- **Inputs**: {concrete fixture values, not "valid input"}
+- **Expected output**: {concrete expected return value or shape}
+- **Error cases**:
+  - {invalid input scenario} -> {expected error response}
+  - {boundary condition} -> {expected behavior}
+## Behavioral Tests
+{One section per acceptance criterion from the spec. These operationalize the Given/When/Then into concrete, implementable test cases.}
+### BT-{N}: {test name — maps to AC-{N} from spec}
+- **Source**: AC-{N} from spec.md
+- **Setup**: {concrete precondition — specific fixture data, specific state to create}
+- **Action**: {concrete function call or user action with specific parameters}
+- **Assertions**:
+  - {specific return value, state change, or side effect to verify}
+  - {additional assertions if the AC has multiple observable outcomes}
+- **Teardown**: {cleanup if needed, omit if none}
+## Integration Tests
+{Tests that span multiple acceptance criteria or verify cross-cutting behavior. These catch issues at the seams between components.}
+### IT-{N}: {integration scenario name}
+- **Source**: {which integration points from spec.md this covers}
+- **Components**: {which modules/files interact in this test}
+- **Setup**: {state that must exist across components}
+- **Flow**: {sequence of actions spanning components}
+- **Assertions**: {what to verify at each step of the flow}
+## Negative Tests
+{Systematic tests for what should NOT happen. Focus on security-relevant boundaries, data integrity, and error handling.}
+### NT-{N}: {negative scenario name}
+- **Source**: {which spec requirement this guards against}
+- **Action**: {the invalid, malicious, or unexpected input/action}
+- **Expected behavior**: {how the system should reject, handle, or recover}
+```
+## Review Checklist
+### 1. Contract Coverage
+Every API contract in the spec has at least one contract test. Every contract test references a real API contract from the spec. No orphaned tests.
+### 2. Behavioral Coverage
+Every acceptance criterion in the spec has exactly one behavioral test. The BT-N IDs map 1:1 to AC-N IDs. No AC is missing a test. No test exists without a corresponding AC.
+### 3. Fixture Concreteness
+Every test uses concrete fixture values — specific strings, numbers, objects. Fix any "valid input", "appropriate data", or placeholder values. The implementer must be able to write the test without inventing test data.
+### 4. Assertion Strength
+Every test has at least one assertion that would FAIL if the feature were not implemented. Fix any assertions that could pass vacuously (checking existence without checking value, asserting on mock return values, checking type without checking content).
+### 5. Integration Completeness
+Every integration point in the spec that connects two or more components has a corresponding integration test. Cross-cutting flows (data created by one AC and consumed by another) are covered.
+### 6. Negative Test Coverage
+For each API contract: at least one error case test. For each data-integrity boundary (unique constraints, required fields, referential integrity): a test that the boundary is enforced. For security-relevant operations: tests that unauthorized/malformed requests are rejected.
+### 7. Test Infrastructure Accuracy
+The framework and utilities section references real files and tools that exist in the codebase (use Grep/Glob to verify). Fixture strategy matches the codebase's existing test patterns.
+### 8. Implementability
+Every test can be translated into executable test code using only the spec's API contracts and the test infrastructure described. No test requires implementation details that don't exist in the spec. No test depends on internal implementation choices.
+### 9. Consistency
+Test IDs are sequential. Source references point to real spec artifacts. No test contradicts another test or the spec.
+## Output
+**Write mode:**
+```
+Test plan written to {spec_dir}/test-plan.md
+Tests:
+- Contract tests: N (covering N API contracts)
+- Behavioral tests: N (covering N acceptance criteria)
+- Integration tests: N (covering N cross-cutting flows)
+- Negative tests: N (covering N error/boundary cases)
+```
+**Review mode (zero medium-to-high issues — clean pass):**
+```
+APPROVED
+0 medium-to-high issues found.
+All 9 checks passed.
+```
+**Review mode (issues found and fixed — needs re-review):**
+```
+APPROVED_WITH_FIXES
+N issues found and fixed:
+- [HIGH] [Check Name]: what was fixed
+- [MEDIUM] [Check Name]: what was fixed
+...
+All 9 checks now pass for medium-to-high issues.
+```
+**Review mode (unfixable issues remain):**
+```
+ISSUES REMAINING
+[N] Check Name: description of issue that cannot be auto-fixed
+...
+```

package/src/skills/ship/SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: ship
-description: End-to-end workflow for shipping complex features through intent discovery, contamination-free research, design discussion, spec generation, task breakdown, and implementation. Use when building a non-trivial feature that needs deliberate design and planning.
+description: End-to-end workflow for shipping complex features through intent discovery, contamination-free research, design discussion, spec generation, verification planning, task breakdown, and TDD implementation. Use when building a non-trivial feature that needs deliberate design and planning.
 argument-hint: [feature-name]
 allowed-tools: Read, Write, Edit, Grep, Glob, Bash, Agent, TaskCreate, TaskList, TaskUpdate, TaskGet, AskUserQuestion
 ---
@@ -40,8 +40,9 @@ Look for `docs/specs/{feature-name}/state.yaml`.
 | research | `references/step-3-research.md` |
 | design | `references/step-4-design.md` |
 | spec | `references/step-5-spec.md` |
-| tasks | `references/step-6-tasks.md` |
-| implement | `references/step-7-implement.md` |
+| verify | `references/step-6-verify.md` |
+| tasks | `references/step-7-tasks.md` |
+| implement | `references/step-8-implement.md` |
 Read the step file for the current phase and follow its instructions.

package/src/skills/ship/references/step-5-spec.md CHANGED Viewed

@@ -12,7 +12,7 @@ Create these tasks and work through them in order:
 2. "Generate spec" — spawn spec-writer in write mode
 3. "Review spec" — spawn spec-writer in review mode, loop until approved
 4. "Review spec with user" — present the approved spec
-5. "Begin task breakdown" — proceed to the next phase
+5. "Begin verification planning" — proceed to the next phase
 ## Task 1: External Research (if needed)
@@ -71,6 +71,6 @@ Revise based on user feedback. If changes are substantial, re-run the review loo
 ## Task 5: Proceed
-Update `{spec_dir}/state.yaml` — set `phase: tasks`.
+Update `{spec_dir}/state.yaml` — set `phase: verify`.
-Use the Read tool on `references/step-6-tasks.md` to break the spec into implementation tasks.
+Use the Read tool on `references/step-6-verify.md` to plan verification for the spec.

package/src/skills/ship/references/step-6-verify.md ADDED Viewed

@@ -0,0 +1,64 @@
+# Verification Planning
+The orchestrator spawns a test-planner agent to generate a test plan, then spawns a fresh instance to review and fix it. Each review is a clean context window — the reviewer didn't write the plan, so it reads with fresh eyes. The reviewer focuses on medium-to-high severity issues only — if a reviewer only fixes minor issues, the orchestrator moves on rather than over-rotating. If medium-to-high issues are fixed, those fixes must be verified by another fresh reviewer.
+The test plan defines how every spec requirement will be verified. It bridges the gap between "what the system should do" (spec) and "how we build it" (tasks). A fresh agent writes this — one that has never seen the implementation plan, so its verification strategy tests the *intent*, not the implementation approach.
+Read `{spec_dir}/spec.md` before proceeding.
+## Create Tasks
+Create these tasks and work through them in order:
+1. "Generate test plan" — spawn test-planner in write mode
+2. "Review test plan" — spawn test-planner in review mode, loop until approved
+3. "Review test plan with user" — present the approved plan
+4. "Begin task breakdown" — proceed to the next phase
+## Task 1: Generate Test Plan
+Spawn the test-planner in write mode:
+```
+Agent tool:
+  subagent_type: "test-planner"
+  prompt: "Generate the test plan for the feature at {spec_dir}. Read intent.md, research.md, design.md, and spec.md for context. Write the test plan to {spec_dir}/test-plan.md."
+```
+## Task 2: Review Loop
+Spawn a FRESH instance of test-planner in review mode. At least one review is mandatory.
+```
+Agent tool:
+  subagent_type: "test-planner"
+  prompt: "Review the test plan at {spec_dir}/test-plan.md against the upstream artifacts (intent.md, research.md, design.md, spec.md). Run the full review checklist. Focus on medium-to-high severity issues — ignore minor wording or formatting. Fix every medium-to-high issue directly in test-plan.md. Return APPROVED if zero medium-to-high issues found, APPROVED_WITH_FIXES with severity tags if issues were found and fixed, or ISSUES REMAINING for anything you cannot auto-fix."
+```
+**If APPROVED** (zero medium-to-high issues found): The test plan is verified clean. Move to Task 3.
+**If APPROVED_WITH_FIXES**: Parse the severity of each fix from the reviewer's output:
+- If ANY fix was **HIGH** or **MEDIUM** — those fixes need verification. Spawn another fresh instance to review again.
+- If somehow all fixes were low-severity — the reviewer is finding diminishing returns. Move to Task 3.
+**If ISSUES REMAINING**: Spawn another fresh instance to review again. The previous reviewer already fixed what it could — the next reviewer may catch different things or resolve what the last one couldn't.
+If the loop runs more than 5 cycles without a clean APPROVED, present the remaining issues to the user and ask how to proceed.
+## Task 3: Review With User
+Read `{spec_dir}/test-plan.md` and present it to the user. Walk through each section, highlighting:
+- Contract tests and which API boundaries they cover
+- Behavioral tests and their mapping to acceptance criteria
+- Integration tests and which cross-cutting flows they verify
+- Negative tests and which failure modes they catch
+- Test infrastructure decisions (framework, fixtures, mocking strategy)
+Ask the user if the verification strategy is complete. Revise based on feedback. If changes are substantial, re-run the review loop (Task 2).
+## Task 4: Proceed
+Update `{spec_dir}/state.yaml` — set `phase: tasks`.
+Use the Read tool on `references/step-7-tasks.md` to break the spec into implementation tasks.

package/src/skills/ship/references/{step-6-tasks.md → step-7-tasks.md} RENAMED Viewed

@@ -2,7 +2,7 @@
 The orchestrator spawns a task-breakdown agent to generate task files, then spawns a fresh instance of the same agent to review and fix them. Each review is a clean context window — the reviewer didn't write the tasks, so it reads with fresh eyes. The reviewer focuses on medium-to-high severity issues only — if a reviewer only fixes minor issues, the orchestrator moves on rather than over-rotating. If medium-to-high issues are fixed, those fixes must be verified by another fresh reviewer.
-Read `{spec_dir}/spec.md` before proceeding.
+Read `{spec_dir}/spec.md` and `{spec_dir}/test-plan.md` before proceeding.
 ## Create Tasks
@@ -20,7 +20,7 @@ Spawn the task-breakdown agent in write mode:
 ```
 Agent tool:
   subagent_type: "task-breakdown"
-  prompt: "Break the spec at {spec_dir} into implementation task files. Read spec.md, research.md, and design.md for context. Write task files to {spec_dir}/tasks/."
+  prompt: "Break the spec at {spec_dir} into implementation task files. Read spec.md, test-plan.md, research.md, and design.md for context. Write task files to {spec_dir}/tasks/."
 ```
 ## Task 2: Review Loop
@@ -30,7 +30,7 @@ Spawn a FRESH instance of task-breakdown in review mode:
 ```
 Agent tool:
   subagent_type: "task-breakdown"
-  prompt: "Review the task breakdown at {spec_dir}. Read spec.md and all files in {spec_dir}/tasks/. Run the full 9-point checklist. Focus on medium-to-high severity issues — ignore minor wording or formatting. Fix every medium-to-high issue directly in the task files. Return APPROVED if zero medium-to-high issues found, APPROVED_WITH_FIXES with severity tags if issues were found and fixed, or ISSUES REMAINING for anything you cannot auto-fix."
+  prompt: "Review the task breakdown at {spec_dir}. Read spec.md, test-plan.md, and all files in {spec_dir}/tasks/. Run the full 9-point checklist. Focus on medium-to-high severity issues — ignore minor wording or formatting. Fix every medium-to-high issue directly in the task files. Return APPROVED if zero medium-to-high issues found, APPROVED_WITH_FIXES with severity tags if issues were found and fixed, or ISSUES REMAINING for anything you cannot auto-fix."
 ```
 **If APPROVED** (zero issues found): The breakdown is verified clean. Move to Task 3.
@@ -49,7 +49,7 @@ Present the approved task breakdown. For each task, show:
 - What it does (the criterion)
 - Why it's in this order (the dependency reasoning)
-- How it can be independently verified
+- Which tests from the test plan it references
 Revise based on user feedback. If changes are substantial, re-run the review loop (Task 2).
@@ -57,4 +57,4 @@ Revise based on user feedback. If changes are substantial, re-run the review loo
 Update `{spec_dir}/state.yaml` — set `phase: implement`.
-Use the Read tool on `references/step-7-implement.md` to begin implementation.
+Use the Read tool on `references/step-8-implement.md` to begin implementation.

package/src/skills/ship/references/{step-7-implement.md → step-8-implement.md} RENAMED Viewed

@@ -2,7 +2,7 @@
 Orchestrate implementation using spec-implementer and spec-validator sub-agents. This follows the execute-spec pattern — you dispatch agents, you do not write code yourself.
-Read `{spec_dir}/spec.md` and list all task files in `{spec_dir}/tasks/`.
+Read `{spec_dir}/spec.md`, `{spec_dir}/test-plan.md`, and list all task files in `{spec_dir}/tasks/`.
 ## Step 1: Hydrate Task System
@@ -23,7 +23,7 @@ Work through tasks in dependency order. For each task that is ready (no blockers
 ```
 Agent tool:
   subagent_type: "spec-implementer"
-  prompt: "Implement the task described in {task_file_path}. Read the task file for requirements, files to modify, and verification steps. Also read {spec_dir}/spec.md for overall context. After implementation, run the verification steps described in the task file."
+  prompt: "Implement the task described in {task_file_path}. Read the task file for requirements, files to modify, and referenced tests. Also read {spec_dir}/spec.md and {spec_dir}/test-plan.md for overall context. Follow TDD: write the test code first for the tests referenced in the task file, confirm they fail, then implement the minimum code to make them pass."
 ```
 When the implementer returns, use TaskUpdate to mark the task as `in_progress` (implementation done, not yet validated).
@@ -33,7 +33,7 @@ When the implementer returns, use TaskUpdate to mark the task as `in_progress` (
 ```
 Agent tool:
   subagent_type: "spec-validator"
-  prompt: "Validate the task described in {task_file_path}. Review the code changes, run tests, and verify against the acceptance criteria in {spec_dir}/spec.md. Report pass/fail with details."
+  prompt: "Validate the task described in {task_file_path}. Run the tests referenced in the task file. Review the code changes and test quality. Verify against the acceptance criteria in {spec_dir}/spec.md and the test specifications in {spec_dir}/test-plan.md. Report pass/fail with details."
 ```
 - **If pass**: Use TaskUpdate to mark the task as `completed`. This unblocks downstream tasks.
@@ -51,4 +51,4 @@ Present a summary to the user:
 - What tests pass
 - Any tasks that needed manual intervention
-Use the Read tool on `references/step-8-reflect.md` to review and improve this workflow.
+Use the Read tool on `references/step-9-reflect.md` to review and improve this workflow.

package/src/skills/ship/references/{step-8-reflect.md → step-9-reflect.md} RENAMED Viewed

@@ -10,9 +10,10 @@ Consider each phase:
 2. **Question quality**: Were the research questions comprehensive? Were any critical questions missing that caused problems later?
 3. **Research objectivity**: Did the research stay objective? Did the contamination prevention work — or did implementation opinions leak in despite the separation?
 4. **Design decisions**: Were the design questions the right ones? Did the user have to course-correct on things that should have been caught earlier?
-5. **Spec completeness**: Were the API contracts and acceptance criteria specific enough for implementation agents?
-6. **Task ordering**: Did the tracer bullet ordering work? Were there dependency issues or tasks that should have been ordered differently?
-7. **Implementation**: Did agents struggle with any tasks? Were the task descriptions clear enough?
+5. **Spec completeness**: Were the API contracts and acceptance criteria specific enough for downstream agents?
+6. **Verification planning**: Did the test plan cover the right things? Were there gaps that only surfaced during implementation? Did the test-planner catch spec issues the spec-writer missed?
+7. **Task ordering**: Did the tracer bullet ordering work? Were there dependency issues or tasks that should have been ordered differently?
+8. **Implementation**: Did agents follow TDD successfully? Did tests written first actually catch implementation bugs, or were they vacuous? Did any tests need significant rework during implementation?
 ## Skill Improvement