cclaw-cli 8.2.0 → 8.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # cclaw
2
2
 
3
- **cclaw is a lightweight harness-first flow toolkit for coding agents.** Three slash commands. Five hops (`Detect → Triage → Dispatch → Pause → Compound/Ship`). Four stages (`plan → build → review → ship`, where **build IS a TDD cycle**: RED → GREEN → REFACTOR). Six on-demand specialists, all running as isolated sub-agents. Three Acceptance-Criteria modes (`inline` / `soft` / `strict`) so trivial edits do not pay the price of risky migrations. A deep content layer of skills, templates, runbooks, patterns, examples, and recovery playbooks wrapped around a runtime under 1 KLOC — so Claude Code, Cursor, OpenCode, or Codex can move from idea to shipped change with a clear plan, the right amount of ceremony, and almost no orchestrator bloat.
3
+ **cclaw is a lightweight harness-first flow toolkit for coding agents.** Three slash commands. Six hops (`Detect → Triage → Pre-flight → Dispatch → Pause → Compound/Ship`). Four stages (`plan → build → review → ship`, where **build IS a TDD cycle**: RED → GREEN → REFACTOR). Six on-demand specialists, all running as isolated sub-agents and emitting a calibrated `Confidence: high | medium | low` signal. Three Acceptance-Criteria modes (`inline` / `soft` / `strict`) so trivial edits do not pay the price of risky migrations. A five-axis review (`correctness · readability · architecture · security · performance`) with a five-tier severity vocabulary, a strict-mode adversarial pre-mortem before ship, and a source-driven mode that grounds framework code in current docs. A deep content layer of skills, templates, runbooks, patterns, examples, and recovery playbooks wrapped around a runtime under 1 KLOC — so Claude Code, Cursor, OpenCode, or Codex can move from idea to shipped change with a clear plan, the right amount of ceremony, and almost no orchestrator bloat.
4
4
 
5
5
  ```text
6
6
  idea
@@ -15,7 +15,15 @@
15
15
 
16
16
  ┌────────────────────────────────────────────────────┐
17
17
  │ Hop 2: Triage — auto-classify task, │
18
- │ recommend path + acMode, user accepts or overrides
18
+ │ recommend path + acMode, runMode (step/auto)
19
+ └─────────┬──────────────────────────────────────────┘
20
+
21
+
22
+ ┌────────────────────────────────────────────────────┐
23
+ │ Hop 2.5: Pre-flight — surface 3-7 assumptions │
24
+ │ (stack, conventions, defaults, out-of-scope); │
25
+ │ user confirms; persisted to triage.assumptions. │
26
+ │ skipped on inline + on resume │
19
27
  └─────────┬──────────────────────────────────────────┘
20
28
 
21
29
  trivial │ small-medium │ large-risky
@@ -24,7 +32,7 @@
24
32
  ▼ ▼ ▼
25
33
  edit + commit plan → build → review → ship brainstorm? → architect? → plan → build → review → ship
26
34
  (no plan) each stage in a fresh sub-agent each stage in a fresh sub-agent, parallel-build allowed
27
- │ │
35
+ │ │ five-axis review · adversarial pre-mortem
28
36
  └─────────┬────────────┘
29
37
 
30
38
  compound (auto, gated by quality)
@@ -33,7 +41,28 @@
33
41
  active flows → shipped/<slug>/
34
42
  ```
35
43
 
36
- Three slash commands (`/cc`, `/cc-cancel`, `/cc-idea`). Four stages (`plan → build → review → ship`). Six specialists, all on-demand, all running as sub-agents. Fifteen skills including the always-on `triage-gate`, `flow-resume`, `tdd-cycle`, `conversation-language`, and `anti-slop`. Ten templates including `plan-soft.md` and `build-soft.md` for the soft-mode path. Four runbooks. Eight reference patterns. Three research playbooks. Five recovery playbooks. Eight worked examples. Two mandatory gates in strict mode (AC traceability + TDD phase chain); soft mode keeps both as advisory; inline mode skips both.
44
+ Three slash commands (`/cc`, `/cc-cancel`, `/cc-idea`). Four stages (`plan → build → review → ship`). Six specialists, all on-demand, all running as sub-agents, all emitting `Confidence: high | medium | low`. Seventeen skills including the always-on `triage-gate`, `flow-resume`, `pre-flight-assumptions`, `tdd-cycle`, `conversation-language`, `anti-slop`, and the strict-mode-default `source-driven`. Ten templates including `plan-soft.md` and `build-soft.md` for the soft-mode path. Four runbooks. Eight reference patterns. Three research playbooks. Five recovery playbooks. Eight worked examples. Two mandatory gates in strict mode (AC traceability + TDD phase chain); soft mode keeps both as advisory; inline mode skips both.
45
+
46
+ ## What changed in 8.4
47
+
48
+ 8.4 is a non-breaking content + behaviour patch on top of 8.3, picking up seven things three reference skill libraries do that cclaw 8.3 didn't.
49
+
50
+ - **Confidence calibration in slim summaries.** Every specialist emits `Confidence: high | medium | low`. The orchestrator's Hop 4 — *Pause* — treats `Confidence: low` as a **hard gate in both `step` and `auto` modes**: it pauses, refuses to chain, and offers `expand <stage>` (re-dispatch with a richer envelope), `show`, `override`, or `cancel`.
51
+ - **Pre-flight assumptions (Hop 2.5).** A new orchestrator hop runs after triage, before the first specialist dispatch, on every fresh non-inline flow. It surfaces 3-7 numbered assumptions (stack + version, repo conventions, architecture defaults, out-of-scope items) using the harness's structured ask, persists them to `triage.assumptions` (string array), and makes them immutable for the lifetime of the flow. Both `planner` and `architect` read them verbatim before authoring; a decision that would break an assumption surfaces as a feasibility blocker, not a silent override.
52
+ - **Five-axis review.** The reviewer's `code` mode now mandates five axes — `correctness`, `readability`, `architecture`, `security`, `performance` — every iteration. Findings carry `axis` and a five-tier `severity: critical | required | consider | nit | fyi`. Ship gates: `strict` blocks on any open `critical` or `required`; `soft` blocks only on `critical`. Legacy `block | warn | info` ledgers are migrated forward by the reviewer prompt.
53
+ - **Source-driven mode.** A new always-on skill `source-driven.md` instructs `architect` and `planner` (and indirectly `slice-builder`) to detect stack + versions, fetch the version-pinned official doc page, implement against documented patterns, and cite URLs in `decisions.md` and code comments. Default in **strict mode for framework-specific work**, opt-in for `soft`. Integrates with the `user-context7` MCP tool when available, falls back to `WebFetch`. When docs are unreachable: write `UNVERIFIED — implementing against training memory` next to the affected line.
54
+ - **Adversarial pre-mortem before ship (strict only).** Hop 5 — *Ship + Compound* — now dispatches `reviewer` mode=`adversarial` **in parallel** with `reviewer` mode=`release`. The adversarial reviewer picks the most pessimistic plausible reading and writes `flows/<slug>/pre-mortem.md` listing 3-7 likely failure modes (data-loss, race, regression, blast-radius, rollback-impossibility, accidental-scope, hidden-coupling). Uncovered risks become `required`/`critical` findings, escalating the ship gate.
55
+ - **Cross-flow learning in the planner.** The planner reads `.cclaw/knowledge.jsonl` at every dispatch and surfaces 1-3 relevant prior entries — lessons captured by `compound` from past shipped slugs — in a new `## Prior lessons` section in `plan.md`, citing `learnings/<slug>.md`. Filtering: surface-area overlap, tag overlap, recency.
56
+ - **Test-impact-aware GREEN.** The `tdd-cycle.md` skill's GREEN phase now distinguishes a fast inner loop (affected-test pattern) from a safe outer loop (full project suite). REFACTOR still always runs the full suite. Mandatory gate `green_two_stage_suite` is added to `commit-helper.mjs --phase=green` guidance.
57
+
58
+ ## What changed in 8.3
59
+
60
+ 8.3 is a non-breaking content + UX patch on top of 8.2.
61
+
62
+ - **Triage as a structured ask, not a code block.** The orchestrator now uses the harness's structured question tool (`AskUserQuestion` / `AskQuestion` / `prompt`) to render the triage. Two questions, in order: pick the path, then pick the run mode. The fenced form remains as a fallback only.
63
+ - **Run mode: `step` (default) vs `auto`.** `step` pauses after every stage and waits for `continue` (8.2 behaviour). `auto` chains plan → build → review → ship without pausing; stops only on block findings, cap-reached, security findings, or before `ship`. New optional field `triage.runMode` in `flow-state.json`.
64
+ - **Explicit parallel-build fan-out in Hop 3.** The `/cc` body now carries a full ASCII fan-out diagram for the strict-mode parallel-build path — `git worktree` per slice, max 5 slices, one `slice-builder` sub-agent per slice, integration reviewer, merge sequence. The skill `parallel-build.md` already had this; the orchestrator now sees it at the dispatch site.
65
+ - **TDD cycle deepening.** Four new sections in `tdd-cycle.md`: vertical slicing / tracer bullets, stop-the-line rule, Prove-It pattern for bug fixes, writing-good-tests rules (state-not-interactions, DAMP over DRY, real-over-mock, test pyramid). Three new antipatterns: A-13 horizontal slicing, A-14 pushing past a failing test, A-15 mocking what should not be mocked.
37
66
 
38
67
  ## What changed in 8.2
39
68
 
@@ -1,4 +1,4 @@
1
- export declare const CCLAW_VERSION = "8.2.0";
1
+ export declare const CCLAW_VERSION = "8.4.0";
2
2
  export declare const RUNTIME_ROOT = ".cclaw";
3
3
  export declare const STATE_REL_PATH = ".cclaw/state";
4
4
  export declare const HOOKS_REL_PATH = ".cclaw/hooks";
package/dist/constants.js CHANGED
@@ -1,4 +1,4 @@
1
- export const CCLAW_VERSION = "8.2.0";
1
+ export const CCLAW_VERSION = "8.4.0";
2
2
  export const RUNTIME_ROOT = ".cclaw";
3
3
  export const STATE_REL_PATH = `${RUNTIME_ROOT}/state`;
4
4
  export const HOOKS_REL_PATH = `${RUNTIME_ROOT}/hooks`;
@@ -1 +1 @@
1
- export declare const ANTIPATTERNS = "# .cclaw/lib/antipatterns.md\n\nPatterns we have seen fail. Each entry is a short symptom, the underlying mistake, and the corrective action. The orchestrator and specialists open this file when a smell is detected; the reviewer cites entries as findings when applicable.\n\n## A-1 \u2014 \"Just one more AC\"\n\n**Symptom.** A plan starts with 4 AC and ends with 11. Most of the additions appeared during build.\n\n**Underlying mistake.** Scope is being expanded mid-flight without going back to plan-stage.\n\n**Correction.** When build encounters new work, surface it as a follow-up in `.cclaw/ideas.md` or a fresh slug. If the new work is genuinely required to satisfy an existing AC, that AC was wrong; cancel the slug and re-plan with a tighter AC set.\n\n## A-2 \u2014 TDD phase integrity broken\n\n**Symptom (any of):**\n\n- Build commits land for AC-N with `--phase=green` but no `--phase=red` recorded earlier.\n- AC has RED + GREEN commits but no `--phase=refactor` (skipped or applied) entry in flow-state.\n- A `--phase=red` commit touches `src/`, `lib/`, or `app/` \u2014 production code slipped into RED.\n- Tests for AC-N appear in a separate commit a few minutes after the AC-N implementation lands.\n\n**Underlying mistake.** The TDD cycle was treated as ceremony, not as the contract. The cycle exists so the failing test encodes the AC; skipping or scrambling phases produces an audit trail that nobody can trust.\n\n**Correction.** `commit-helper.mjs` enforces RED \u2192 GREEN \u2192 REFACTOR per AC. Write a failing test first and commit under `--phase=red` (test files only). Implement the smallest production change that turns it green; commit under `--phase=green`. Either commit a refactor under `--phase=refactor` or skip it explicitly with `--phase=refactor --skipped --message=\"refactor(AC-N) skipped: <reason>\"`. The reviewer cites this entry whenever the chain is incomplete.\n\n## A-3 \u2014 Work outside the AC\n\n**Symptom (any of):**\n\n- A small AC commit also restructures an unrelated module.\n- A commit produced by `commit-helper.mjs` contains files that are unrelated to the AC.\n- `git add -A` appears in shell history inside `/cc`.\n\n**Underlying mistake.** Slice-builder absorbed unrelated edits or silently expanded scope. The AC commit no longer maps cleanly to the AC.\n\n**Correction.** Stage AC-related files explicitly: `git add <path>` per file, or `git add -p` to pick hunks. Never `git add -A` inside `/cc`. If a refactor really must happen, capture it as a follow-up; if it really blocks the AC, cancel the slug and re-plan as a refactor + behaviour-change pair.\n\n## A-4 \u2014 AC that mirror sub-tasks\n\n**Symptom.** AC read like \"implement the helper\", \"wire the helper\", \"test the helper\".\n\n**Underlying mistake.** AC are outcomes, not sub-tasks. Outcomes survive refactors; sub-tasks do not.\n\n**Correction.** Rewrite AC as observable outcomes. The helper is an implementation detail, not an AC.\n\n## A-5 \u2014 Over-careful brainstormer\n\n**Symptom.** Brainstormer produces three pages of Context for a small task; planner is then unable to size the work.\n\n**Underlying mistake.** Brainstormer ignored the routing class. Trivial / small-medium tasks should have a one-paragraph Context, not a Frame + Scope + Alternatives sweep.\n\n**Correction.** Brainstormer reads the routing class first and short-circuits when the task is small. Three sentences of Context is enough for AC-1.\n\n## A-6 \u2014 \"I already looked\"\n\n**Symptom.** Reviewer reports a \"clear\" decision without a Five Failure Modes pass.\n\n**Underlying mistake.** The Five Failure Modes pass is the artifact. Skipping it because \"I already looked\" produces no audit trail.\n\n**Correction.** Reviewer always emits the Five Failure Modes block. Each item gets yes / no with citation when yes. A \"no\" with no thinking attached is fine; an absent block is not.\n\n## A-7 \u2014 Shipping with a pending AC\n\n**Symptom.** `runCompoundAndShip()` is invoked while flow-state has at least one AC with `status: pending`.\n\n**Underlying mistake.** The agent expected the orchestrator to \"figure it out\" and complete the AC silently.\n\n**Correction.** The AC traceability gate refuses ship. Either complete the AC (slice-builder) or cancel the slug (`/cc-cancel`) and re-plan with the smaller AC set. There is no override.\n\n## A-8 \u2014 Re-creating a shipped slug instead of refining\n\n**Symptom.** A new `/cc` invocation produces a slug whose plan is 80% identical to a slug already in `.cclaw/flows/shipped/`.\n\n**Underlying mistake.** Existing-plan detection was skipped or its output was ignored.\n\n**Correction.** Existing-plan detection is mandatory at the start of every `/cc`. When a shipped match is offered, the user picks **refine shipped** or **new unrelated**, not \"ignore the match\".\n\n## A-9 \u2014 Editing shipped artifacts\n\n**Symptom.** A shipped slug's `plan.md` is edited weeks after ship.\n\n**Underlying mistake.** Shipped artifacts are immutable. Editing them invalidates the knowledge index and breaks refinement chains.\n\n**Correction.** Open a refinement slug. The new slug carries `refines: <old-slug>` and contains the corrections. The old slug stays as it shipped.\n\n## A-10 \u2014 Force-push during ship\n\n**Symptom.** `git push --force` appears in shell history during ship.\n\n**Underlying mistake.** Force-push rewrites the SHAs that flow-state and the AC traceability block reference. The chain breaks silently; nothing in the runtime detects it.\n\n**Correction.** Refuse `git push --force` inside `/cc` unless the user explicitly requested it twice (initial request + confirmation). After the force-push, every recorded SHA in the slug must be re-verified by hand and updated.\n\n## A-11 \u2014 Hidden security surface\n\n**Symptom.** A slug ships without `security_flag: true` even though the diff added a new auth-adjacent code path.\n\n**Underlying mistake.** The author judged \"this is mostly UI\" and skipped the security checklist.\n\n**Correction.** `security_flag` is set whenever the diff touches authn / authz / secrets / supply chain / data exposure, even when the change feels small. The cost of a spurious security flag is a few minutes; the cost of a missed one is a CVE.\n\n## A-12 \u2014 Single test green, didn't run the suite\n\n**Symptom.** `flows/<slug>/build.md` GREEN evidence column shows `npm test path/to/single.test` only; full-suite run is missing.\n\n**Underlying mistake.** A passing single test is not GREEN. Production change can break adjacent tests; without running the suite, the AC is shipped on a regression.\n\n**Correction.** GREEN evidence must be the **full relevant suite** for the affected module(s), not the single test. The reviewer cites this as a block finding.\n";
1
+ export declare const ANTIPATTERNS = "# .cclaw/lib/antipatterns.md\n\nPatterns we have seen fail. Each entry is a short symptom, the underlying mistake, and the corrective action. The orchestrator and specialists open this file when a smell is detected; the reviewer cites entries as findings when applicable.\n\n## A-1 \u2014 \"Just one more AC\"\n\n**Symptom.** A plan starts with 4 AC and ends with 11. Most of the additions appeared during build.\n\n**Underlying mistake.** Scope is being expanded mid-flight without going back to plan-stage.\n\n**Correction.** When build encounters new work, surface it as a follow-up in `.cclaw/ideas.md` or a fresh slug. If the new work is genuinely required to satisfy an existing AC, that AC was wrong; cancel the slug and re-plan with a tighter AC set.\n\n## A-2 \u2014 TDD phase integrity broken\n\n**Symptom (any of):**\n\n- Build commits land for AC-N with `--phase=green` but no `--phase=red` recorded earlier.\n- AC has RED + GREEN commits but no `--phase=refactor` (skipped or applied) entry in flow-state.\n- A `--phase=red` commit touches `src/`, `lib/`, or `app/` \u2014 production code slipped into RED.\n- Tests for AC-N appear in a separate commit a few minutes after the AC-N implementation lands.\n\n**Underlying mistake.** The TDD cycle was treated as ceremony, not as the contract. The cycle exists so the failing test encodes the AC; skipping or scrambling phases produces an audit trail that nobody can trust.\n\n**Correction.** `commit-helper.mjs` enforces RED \u2192 GREEN \u2192 REFACTOR per AC. Write a failing test first and commit under `--phase=red` (test files only). Implement the smallest production change that turns it green; commit under `--phase=green`. Either commit a refactor under `--phase=refactor` or skip it explicitly with `--phase=refactor --skipped --message=\"refactor(AC-N) skipped: <reason>\"`. The reviewer cites this entry whenever the chain is incomplete.\n\n## A-3 \u2014 Work outside the AC\n\n**Symptom (any of):**\n\n- A small AC commit also restructures an unrelated module.\n- A commit produced by `commit-helper.mjs` contains files that are unrelated to the AC.\n- `git add -A` appears in shell history inside `/cc`.\n\n**Underlying mistake.** Slice-builder absorbed unrelated edits or silently expanded scope. The AC commit no longer maps cleanly to the AC.\n\n**Correction.** Stage AC-related files explicitly: `git add <path>` per file, or `git add -p` to pick hunks. Never `git add -A` inside `/cc`. If a refactor really must happen, capture it as a follow-up; if it really blocks the AC, cancel the slug and re-plan as a refactor + behaviour-change pair.\n\n## A-4 \u2014 AC that mirror sub-tasks\n\n**Symptom.** AC read like \"implement the helper\", \"wire the helper\", \"test the helper\".\n\n**Underlying mistake.** AC are outcomes, not sub-tasks. Outcomes survive refactors; sub-tasks do not.\n\n**Correction.** Rewrite AC as observable outcomes. The helper is an implementation detail, not an AC.\n\n## A-5 \u2014 Over-careful brainstormer\n\n**Symptom.** Brainstormer produces three pages of Context for a small task; planner is then unable to size the work.\n\n**Underlying mistake.** Brainstormer ignored the routing class. Trivial / small-medium tasks should have a one-paragraph Context, not a Frame + Scope + Alternatives sweep.\n\n**Correction.** Brainstormer reads the routing class first and short-circuits when the task is small. Three sentences of Context is enough for AC-1.\n\n## A-6 \u2014 \"I already looked\"\n\n**Symptom.** Reviewer reports a \"clear\" decision without a Five Failure Modes pass.\n\n**Underlying mistake.** The Five Failure Modes pass is the artifact. Skipping it because \"I already looked\" produces no audit trail.\n\n**Correction.** Reviewer always emits the Five Failure Modes block. Each item gets yes / no with citation when yes. A \"no\" with no thinking attached is fine; an absent block is not.\n\n## A-7 \u2014 Shipping with a pending AC\n\n**Symptom.** `runCompoundAndShip()` is invoked while flow-state has at least one AC with `status: pending`.\n\n**Underlying mistake.** The agent expected the orchestrator to \"figure it out\" and complete the AC silently.\n\n**Correction.** The AC traceability gate refuses ship. Either complete the AC (slice-builder) or cancel the slug (`/cc-cancel`) and re-plan with the smaller AC set. There is no override.\n\n## A-8 \u2014 Re-creating a shipped slug instead of refining\n\n**Symptom.** A new `/cc` invocation produces a slug whose plan is 80% identical to a slug already in `.cclaw/flows/shipped/`.\n\n**Underlying mistake.** Existing-plan detection was skipped or its output was ignored.\n\n**Correction.** Existing-plan detection is mandatory at the start of every `/cc`. When a shipped match is offered, the user picks **refine shipped** or **new unrelated**, not \"ignore the match\".\n\n## A-9 \u2014 Editing shipped artifacts\n\n**Symptom.** A shipped slug's `plan.md` is edited weeks after ship.\n\n**Underlying mistake.** Shipped artifacts are immutable. Editing them invalidates the knowledge index and breaks refinement chains.\n\n**Correction.** Open a refinement slug. The new slug carries `refines: <old-slug>` and contains the corrections. The old slug stays as it shipped.\n\n## A-10 \u2014 Force-push during ship\n\n**Symptom.** `git push --force` appears in shell history during ship.\n\n**Underlying mistake.** Force-push rewrites the SHAs that flow-state and the AC traceability block reference. The chain breaks silently; nothing in the runtime detects it.\n\n**Correction.** Refuse `git push --force` inside `/cc` unless the user explicitly requested it twice (initial request + confirmation). After the force-push, every recorded SHA in the slug must be re-verified by hand and updated.\n\n## A-11 \u2014 Hidden security surface\n\n**Symptom.** A slug ships without `security_flag: true` even though the diff added a new auth-adjacent code path.\n\n**Underlying mistake.** The author judged \"this is mostly UI\" and skipped the security checklist.\n\n**Correction.** `security_flag` is set whenever the diff touches authn / authz / secrets / supply chain / data exposure, even when the change feels small. The cost of a spurious security flag is a few minutes; the cost of a missed one is a CVE.\n\n## A-12 \u2014 Single test green, didn't run the suite\n\n**Symptom.** `flows/<slug>/build.md` GREEN evidence column shows `npm test path/to/single.test` only; full-suite run is missing.\n\n**Underlying mistake.** A passing single test is not GREEN. Production change can break adjacent tests; without running the suite, the AC is shipped on a regression.\n\n**Correction.** GREEN evidence must be the **full relevant suite** for the affected module(s), not the single test. The reviewer cites this as a block finding.\n\n## A-13 \u2014 Horizontal slicing (RED-batch then GREEN-batch)\n\n**Symptom.** `flows/<slug>/build.md` shows AC-1 RED, AC-2 RED, AC-3 RED committed in a row, then AC-1 GREEN, AC-2 GREEN, AC-3 GREEN. Or the slice-builder describes the build as \"tests written, now I'll implement\".\n\n**Underlying mistake.** Writing all RED tests before any GREEN code means the tests describe the behaviour you *guessed* before you saw the real interface. Tests written this way pass when behaviour breaks (because they test the imagined shape) and fail when behaviour is fine (because the real shape diverged from the imagination). They get rewritten during the next refactor.\n\n**Correction.** One test \u2192 one implementation \u2192 repeat. Each cycle informs the next. The AC-2 test is shaped by what the AC-1 implementation revealed about the real interface. `commit-helper.mjs --phase=red` for AC-2 will refuse if AC-1's chain isn't closed yet \u2014 that is the rail. See the Vertical Slicing section in `tdd-cycle.md`.\n\n## A-14 \u2014 Pushing past a failing test\n\n**Symptom.** Build log shows a flaky or unexpected failure on AC-2, then continues into AC-3 with \"I'll come back to AC-2 later\". Or a hook rejection silently retried with a slightly different commit message.\n\n**Underlying mistake.** Errors compound. AC-3 is built on the invariants AC-2 was supposed to establish. If AC-2's RED failed for the wrong reason, you are debugging a stack of broken assumptions, and every cycle past that point makes the diagnosis harder.\n\n**Correction.** Stop the line. Preserve the failure (command + 1\u20133 lines of output verbatim), reproduce in isolation, root-cause to a concrete file:line, fix once, re-run the full relevant suite, then resume the cycle. If the root cause cannot be identified in three attempts, surface a blocker to the orchestrator \u2014 do not \"make it work\" by removing the test or weakening the assertion.\n\n## A-15 \u2014 Mocking what should not be mocked\n\n**Symptom.** A database query test mocks the driver and asserts on `db.query` call shape; the test is green and the actual query never runs in production. Or a service test mocks every collaborator and only verifies which methods were called, in which order.\n\n**Underlying mistake.** Mocking a dependency you control couples the test to the implementation. The test reads green even when the SQL is wrong, the migration is missing, the column is misspelled. Real bugs live in those gaps. Interaction-based assertions (`expect(x).toHaveBeenCalledWith(...)`) break on every refactor and provide weaker confidence than state-based assertions on the outcome.\n\n**Correction.** Use a real test database (or an in-memory fake of the same shape) and assert on the **outcome** \u2014 the row that was inserted, the response from the query, the observable side effect \u2014 not on the call. Reach for mocks only for things genuinely outside your control: third-party APIs, time, randomness, the network. Real > Fake (in-memory) > Stub (canned data) > Mock (interaction).\n";
@@ -106,4 +106,28 @@ Patterns we have seen fail. Each entry is a short symptom, the underlying mistak
106
106
  **Underlying mistake.** A passing single test is not GREEN. Production change can break adjacent tests; without running the suite, the AC is shipped on a regression.
107
107
 
108
108
  **Correction.** GREEN evidence must be the **full relevant suite** for the affected module(s), not the single test. The reviewer cites this as a block finding.
109
+
110
+ ## A-13 — Horizontal slicing (RED-batch then GREEN-batch)
111
+
112
+ **Symptom.** \`flows/<slug>/build.md\` shows AC-1 RED, AC-2 RED, AC-3 RED committed in a row, then AC-1 GREEN, AC-2 GREEN, AC-3 GREEN. Or the slice-builder describes the build as "tests written, now I'll implement".
113
+
114
+ **Underlying mistake.** Writing all RED tests before any GREEN code means the tests describe the behaviour you *guessed* before you saw the real interface. Tests written this way pass when behaviour breaks (because they test the imagined shape) and fail when behaviour is fine (because the real shape diverged from the imagination). They get rewritten during the next refactor.
115
+
116
+ **Correction.** One test → one implementation → repeat. Each cycle informs the next. The AC-2 test is shaped by what the AC-1 implementation revealed about the real interface. \`commit-helper.mjs --phase=red\` for AC-2 will refuse if AC-1's chain isn't closed yet — that is the rail. See the Vertical Slicing section in \`tdd-cycle.md\`.
117
+
118
+ ## A-14 — Pushing past a failing test
119
+
120
+ **Symptom.** Build log shows a flaky or unexpected failure on AC-2, then continues into AC-3 with "I'll come back to AC-2 later". Or a hook rejection silently retried with a slightly different commit message.
121
+
122
+ **Underlying mistake.** Errors compound. AC-3 is built on the invariants AC-2 was supposed to establish. If AC-2's RED failed for the wrong reason, you are debugging a stack of broken assumptions, and every cycle past that point makes the diagnosis harder.
123
+
124
+ **Correction.** Stop the line. Preserve the failure (command + 1–3 lines of output verbatim), reproduce in isolation, root-cause to a concrete file:line, fix once, re-run the full relevant suite, then resume the cycle. If the root cause cannot be identified in three attempts, surface a blocker to the orchestrator — do not "make it work" by removing the test or weakening the assertion.
125
+
126
+ ## A-15 — Mocking what should not be mocked
127
+
128
+ **Symptom.** A database query test mocks the driver and asserts on \`db.query\` call shape; the test is green and the actual query never runs in production. Or a service test mocks every collaborator and only verifies which methods were called, in which order.
129
+
130
+ **Underlying mistake.** Mocking a dependency you control couples the test to the implementation. The test reads green even when the SQL is wrong, the migration is missing, the column is misspelled. Real bugs live in those gaps. Interaction-based assertions (\`expect(x).toHaveBeenCalledWith(...)\`) break on every refactor and provide weaker confidence than state-based assertions on the outcome.
131
+
132
+ **Correction.** Use a real test database (or an in-memory fake of the same shape) and assert on the **outcome** — the row that was inserted, the response from the query, the observable side effect — not on the call. Reach for mocks only for things genuinely outside your control: third-party APIs, time, randomness, the network. Real > Fake (in-memory) > Stub (canned data) > Mock (interaction).
109
133
  `;