npm - groundwork-method - Versions diffs - 0.10.0 → 0.11.0 - Mend

groundwork-method 0.10.0 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (70) hide show

package/src/hidden-skills/groundwork-bet/briefs/blind-reviewer.md ADDED Viewed

@@ -0,0 +1,56 @@
+---
+name: blind-reviewer
+description: >
+  Reviews a slice diff for correctness bugs with no bet context, so familiarity cannot
+  hide them. One of four independent review lenses the Delivery driver dispatches per
+  slice (groundwork-bet/workflows/04-delivery.md, Step 2); only the report flows back.
+---
+# Blind Reviewer
+## How This Brief Is Invoked
+This brief runs in an **isolated subagent context** (Protocol 9 mechanics), dispatched
+by the Delivery driver during the slice review, in parallel with the edge-case tracer,
+the acceptance auditor, and the coverage auditor. It is **not** the slice-worker that
+wrote the diff — a diff cannot judge itself, and an author re-reading their own work
+sees what they meant, not what they wrote. Only the report flows back to the driver.
+The lens is deliberately starved of context. It receives the diff and nothing else: no
+bet, no design, no Proof of work. Familiarity is what hides bugs — a reviewer who knows
+the intent fills the gaps in their head and reads past the off-by-one. This lens has no
+intent to fill the gaps with, so it reads only what the code actually says.
+## Inputs
+The driver passes:
+- The slice's **uncommitted diff** — the full patch, and nothing more. Do not request
+  the bet, the design, or the slice file; the blindness is the instrument.
+## The work
+Read the diff as a stranger would and judge the code on its own terms — does it do what
+it plainly appears to intend, correctly. Report defects that live in the code itself,
+visible without bet context:
+- Logic that contradicts itself — an inverted condition, a wrong operator, a branch that
+  can never be taken, a return that drops the value it just computed.
+- Mishandled results — an error swallowed, a `nil`/`null`/`None` dereferenced, a
+  resource opened and never closed, a lock not released on the failure path.
+- State and concurrency — a shared value mutated without synchronisation, an ordering
+  assumption that does not hold, an iteration that mutates what it iterates.
+- Off-by-ones and boundaries visible in the arithmetic itself.
+You cannot judge whether the code matches the design — you cannot see the design. That
+is the acceptance auditor's lens; do not guess at intent to manufacture a finding. Report
+what is wrong in the code as written, not what might be wrong against a spec you were
+not given.
+## The report
+For each finding: a one-line title, the location (file and the diff hunk or line), what
+is wrong, and why it bites. Suggest a nature (decision-needed / patch / defer / dismiss);
+the driver makes the final call and dedupes across the four lenses. If the diff is clean
+on this lens, say so in one line — do not invent findings to look thorough. Keep it to
+the findings; no narration of what you read.

package/src/hidden-skills/groundwork-bet/briefs/coverage-auditor.md ADDED Viewed

@@ -0,0 +1,95 @@
+---
+name: coverage-auditor
+description: >
+  Judges whether the permanent best-practice tests a slice rolled out are comprehensive
+  and actually assert, against the stack's testing strategy. One of four independent
+  review lenses the Delivery driver dispatches per slice
+  (groundwork-bet/workflows/04-delivery.md, Step 2); only the report flows back.
+---
+# Coverage Auditor
+## How This Brief Is Invoked
+This brief runs in an **isolated subagent context** (Protocol 9 mechanics), dispatched by
+the Delivery driver during the slice review, in parallel with the blind reviewer, the
+edge-case tracer, and the acceptance auditor. It is **not** the slice-worker that wrote
+the diff. Only the report flows back to the driver.
+This lens exists to close a seam the other three leave open. The honest-green check and
+the acceptance auditor confirm the implementation is not *gamed*; the edge-case tracer
+finds unhandled paths in the *code*. None of them asks whether the slice's **permanent
+test suite** is comprehensive and whether its assertions actually bite. That is this
+lens's only job — and it is reviewable here precisely because the slice-worker now rolls
+the permanent tests out *into the diff*, before review, rather than after it.
+The distinction from the edge-case tracer is sharp: the tracer asks "does the code handle
+this path?"; this lens asks "does a test *check* that it does?" A path handled in code but
+unasserted by any test is invisible to the tracer and is exactly what this lens catches.
+## Inputs
+The driver passes:
+- The slice's **uncommitted diff** — both the implementation and the permanent
+  best-practice tests the worker rolled out.
+- The slice's **Required Capabilities** (its Scope, from the slice file).
+- The **stack's testing strategy** — the promoted engineer skill at
+  `.agents/skills/groundwork-<stack>-engineer/references/testing.md`, especially its
+  **Bet Slice Rollout** section. This is the authority the suite is held against; read it
+  first, because "comprehensive" means "what this strategy asks for," not a fixed list.
+## The work
+Map each Required Capability the slice delivered to the permanent tests that should guard
+it, then judge the suite the worker rolled out against the strategy on two axes:
+**Completeness — is the coverage the strategy asks for actually present?**
+- The service-perimeter or interface test exists for each capability the slice delivered.
+- Error and boundary cases are covered to the **rigour of the happy path** — the strategy
+  treats a skipped error case as a gap, not an optional extra. A capability with three
+  documented failure modes and a test for only the success path is under-covered.
+- Genuinely complex logic the slice introduced carries a unit test; plumbing does not need
+  one (the perimeter test covers it) — apply the strategy's own "what earns a unit test"
+  rule, do not demand tests the strategy says are waste.
+- An invariant the slice introduced is pinned by a property-based test where the strategy
+  calls for one.
+- A slice that added an **observable path** (a backend service emitting OpenTelemetry
+  spans) carries a critical-path trace assertion. A slice on a stack that emits no traces
+  (a Flutter or Electron client) owes none — the strategy says so; do not invent one.
+- A `graphical-ui` slice has component render tests across the **named states** the design
+  system defines (default, loading, empty, error, long-content) and registers any new
+  route for the system gates.
+- **A fake the suite leans on has a real-producer test behind it.** When a test uses a
+  fixture or stub for work a real stage performs, the suite must also test the real stage
+  that produces it. A fixture with no real-producer test is uncovered work masquerading as
+  covered — the gap that ships a feature whose real pipeline was never exercised
+  (`docs/principles/foundations/testing.md`).
+**Assertion quality — do the tests bite, or only execute?**
+- A sociable service test that drives a branch through one call but asserts only on the
+  status code, not the resulting state, is a gap even on a green board — it covers the
+  line without checking it.
+- A test whose assertions only mirror the current output, with no oracle independent of
+  the implementation, cements behaviour rather than verifying it — the failure mode of an
+  implementation-derived (often AI-generated) test.
+- Where a changed function is dense and high-risk, name it as a candidate for a **targeted
+  mutation spot-check** (the strategy's signal-only read-out) — a surviving mutant there is
+  concrete evidence of a weak assertion. Recommend the spot-check on the named function;
+  do not ask for a full mutation run, which the strategy reserves and review cannot afford.
+You judge the tests, not the implementation's correctness (the blind reviewer's lens), its
+design conformance (the acceptance auditor's), or unhandled code paths (the tracer's). A
+missing test is your finding; a code bug is not.
+## The report
+For each gap: a one-line title, what is under-covered or under-asserting (the capability,
+path, state, or assertion), the specific strategy rule it falls short of (quote the Bet
+Slice Rollout line), and the concrete test that would close it. Suggest a nature
+(usually `patch` — write the missing test before the slice closes — or `decision-needed`
+when the gap reveals a real ambiguity); the driver makes the final call and dedupes across
+the four lenses. If the suite meets the strategy and the assertions bite, say so in one
+line — do not pad with tests the strategy does not ask for. Keep it to the findings.

package/src/hidden-skills/groundwork-bet/briefs/edge-case-tracer.md ADDED Viewed

@@ -0,0 +1,64 @@
+---
+name: edge-case-tracer
+description: >
+  Walks every branch and boundary a slice diff introduces and reports only the
+  unhandled paths. One of four independent review lenses the Delivery driver dispatches
+  per slice (groundwork-bet/workflows/04-delivery.md, Step 2); only the report flows back.
+---
+# Edge-Case Tracer
+## How This Brief Is Invoked
+This brief runs in an **isolated subagent context** (Protocol 9 mechanics), dispatched
+by the Delivery driver during the slice review, in parallel with the blind reviewer, the
+acceptance auditor, and the coverage auditor. It is **not** the slice-worker that wrote
+the diff. Only the report flows back to the driver.
+Where the blind reviewer reads the code as written, this lens reads the code as *run* —
+it traces what happens on the inputs and timings the happy path never exercises. Its job
+is exhaustive path-walking, not general critique.
+## Inputs
+The driver passes:
+- The slice's **uncommitted diff**.
+- **Repo read access** — so a path that leaves the diff into existing code can be
+  followed to confirm whether it is genuinely handled there, rather than assumed. When
+  the Serena MCP server is registered, follow those paths with it (`find_referencing_symbols`
+  to enumerate callers, `find_symbol` to read the body you land in) rather than by guesswork;
+  `.groundwork/cache/repo-map.json` edges serve the same purpose offline, and ordinary
+  search is the fallback when neither exists.
+## The work
+Walk every branch and boundary the diff introduces. For each, ask what the code does on
+the input it does not expect, and follow the call into existing code when the answer is
+not in the diff. Report **only unhandled paths** — concrete, reachable cases the diff
+does not account for:
+- Empty and null inputs — an empty list, a missing field, a zero, a `nil`/`None` where a
+  value is assumed.
+- Failure timing — a dependency that errors, times out, or returns partial data midway;
+  a retry that double-applies; a cleanup that does not run when the body throws.
+- Concurrency — two requests racing the same row, an await that interleaves with a
+  mutation, an assumption that an operation is atomic when it is not.
+- Boundaries — off-by-ones, an unbounded input, pagination that loses or duplicates the
+  edge element, an overflow.
+- Callers the diff did not update — when the diff changes a symbol's signature or shape,
+  enumerate its references (Serena `find_referencing_symbols`, or the repo-map edges
+  offline) and confirm each was updated in the same diff. A caller left on the old shape
+  is an unhandled path the compiler may not catch in a dynamically-typed stack.
+Report a path only when it is genuinely unhandled and reachable — trace it into existing
+code first. Do not report a case the code already covers, and do not report stylistic
+preferences; this lens finds holes, not opinions.
+## The report
+For each unhandled path: a one-line title, the location (file and line, plus the existing
+code you traced into), the exact input or timing that triggers it, and the consequence.
+Suggest a nature (decision-needed / patch / defer / dismiss); the driver makes the final
+call and dedupes across the four lenses. If you traced the diff and found no unhandled
+path, say so in one line. Keep it to the findings.

package/src/hidden-skills/groundwork-bet/briefs/experience-auditor.md ADDED Viewed

@@ -0,0 +1,83 @@
+---
+name: experience-auditor
+description: >
+  Judges whether an assembled, running milestone (and, at validation, the whole bet) is
+  on-design and a genuine pleasure to use — best-in-class patterns implemented in full, no
+  dead-end flows, every state present, design-system match. A milestone-level review lens
+  the Delivery driver dispatches at milestone close and the Validation phase dispatches over
+  the finished bet (groundwork-bet/workflows/04-delivery.md Milestone close;
+  05-validation.md Step 2.6); only the report flows back.
+---
+# Experience Auditor
+## How This Brief Is Invoked
+This brief runs in an **isolated subagent context** (Protocol 9 mechanics), adopting the
+designer persona (`.groundwork/skills/groundwork-designer/SKILL.md`, reference
+`design-review.md`). It runs at **milestone granularity, not per slice** — design fidelity
+and flow completeness need the whole assembled surface, which a single slice cannot show.
+The Delivery driver dispatches it once a milestone has closed and there is a running
+milestone to look at; the Validation phase dispatches it again over the finished bet to
+catch gaps that only appear across milestone seams. Only the report flows back.
+It is distinct from the **coverage auditor**, which checks per slice that the state and
+render *tests exist* (a mechanical question). This lens judges whether the assembled,
+running product is **on-design and good to use** (a designer's judgement). One asks "is
+there a test for the empty state"; this one asks "does the empty state read as designed,
+and is the whole thing a pleasure to use." Neither substitutes for the other.
+## Inputs
+The driver (or validator) passes:
+- The **running milestone or bet** to drive — the shipping build, reached the way its
+  consumer reaches it, plus the captured per-state screenshots under
+  `.groundwork/cache/visual/<bet-slug>/<surface>/`.
+- The **UI design** — `docs/bets/<bet-slug>/technical-design/01-ui-design.md`: the
+  wireframes, the named states, the micro-polish spec, and the best-in-class patterns the
+  designer chose for each view.
+- The project **design system** (`docs/design-system.md`) and the **design-phase reference
+  apps**, as the comparison baseline for patterns and craft.
+- The **scope** — which milestone (and its agreed front-door cases), or, at validation, the
+  whole bet across all its surfaces.
+## The work
+Drive the product the way its consumer does, and judge it on four axes against the design.
+The baseline is the written `01-ui-design.md` spec and the reference apps, not unaided
+taste — where the spec settles a question, judge against the spec; where it is silent,
+judge against the reference apps and the design system, and surface genuine uncertainty as
+a `decision-needed` for the owner rather than passing it silently.
+- **Patterns implemented in full.** Each best-in-class pattern the design named is present
+  and complete — every affordance it implies works (the filter pill removes when its x is
+  clicked, the skeleton resolves to real content). A pattern shipped as a shell that
+  promises an interaction it does not honour is a finding.
+- **Flow completeness — no dead ends.** Every screen the milestone delivers is reachable
+  and has a way back; no flow strands the consumer with no exit. At bet scope, check the
+  **seams between milestones** — a flow that works within each milestone but breaks where
+  they join is exactly what this pass exists to catch.
+- **States present and on-design.** Every async view carries its full set of states —
+  empty, loading, in-progress, error — and each reads as designed rather than as a failure
+  (a screen that works but shows no progress reads as frozen; a grid with no empty state
+  reads as broken on first run).
+- **Design-system match and the joy bar.** The surfaces render in the design system (tokens,
+  components, the specified atmosphere) and cohere across the milestone; and, stepping back,
+  is the product a genuine pleasure to use — considered, well-rounded, not a bare shell.
+You judge the assembled experience, not slice-level test coverage (the coverage auditor's
+lens), code correctness (the blind reviewer's), or design conformance of a single diff (the
+acceptance auditor's). A dead-end flow, a half-built pattern, a missing state, or an
+off-design surface is your finding.
+## The report
+For each finding: a one-line title, where it is (the screen, flow, or state, with the
+screenshot path where one exists), the specific design element it falls short of (quote
+`01-ui-design.md`, name the design-system token or the reference-app pattern), and why it
+hurts the experience. Suggest a nature — a dead-end flow, a missing state, or a
+design-system miss is `decision-needed` and **blocks the milestone**; a smaller refinement
+is `patch`; a genuinely out-of-scope polish idea is `defer` with a `docs/maturity.md` row.
+The driver makes the final call and dedupes across lenses. If the milestone (or bet) is
+on-design, complete, and a pleasure to use, say so in one line. Keep it to the findings.

package/src/hidden-skills/groundwork-bet/briefs/slice-worker.md CHANGED Viewed

@@ -38,7 +38,7 @@ same regardless of how the isolated execution is realised.
 ### Model
 The worker may run on a cheaper tier than the driver. Its correctness is not taken on
-trust: the driver gates every slice through an independent review (three isolated
+trust: the driver gates every slice through an independent review (four isolated
 lenses) before the slice closes. The worker's job is to implement honestly and report
 honestly, not to be the final judge of its own work.
@@ -53,6 +53,11 @@ The driver passes:
   `docs/bets/<bet_slug>/decomposition/NN-<milestone>/NN-<slice>.md`. Read it in full
   first: its **Scope** (Required Capabilities), **Design** (where it lands), and
   **Proof of work** (what it must prove) are the worker's whole brief.
+- **Working directory & isolation** — the bet's worktree, already opened and
+  bootstrapped by the driver. Run every command from it. Leave all changes
+  **unstaged** — the driver reviews the working-tree diff and commits; the worker
+  never stages, commits, branches, or opens its own worktree (no `EnterWorktree`).
+  You build in the worktree handed to you; you do not manage isolation.
 - **Context capsule** — the small set of pointers that let the worker build without
   re-deriving the bet:
   - The **previous slice's delivery commit** — hash, message, and diff. The patterns
@@ -60,12 +65,26 @@ The driver passes:
     `Notes:` line for the next slice are all there. Repeat its lessons, not its
     mistakes.
   - The **exact existing files this slice modifies**, to read in full.
-  - For a **surface** slice, the **capability milestone's green test file** — the
-    contract proof the slice wires onto. Its green assertions tell the worker exactly
-    what the core already guarantees, so the worker's work stays bounded to wiring,
-    rendering, and interaction instead of re-deriving core behaviour.
+  - When the slice **builds on a prior slice's proven contract**, that slice's **green
+    test file** — the proof it wires onto. Its green assertions tell the worker exactly
+    what the prior slice already guarantees, so the worker's work stays bounded to what
+    this slice adds instead of re-deriving behaviour already proven.
   - The named `Test file:` path(s) for this slice (already materialized red at
     Delivery start).
+  - The **stack's testing strategy** — the promoted engineer skill for the slice's
+    stack (`.agents/skills/groundwork-<stack>-engineer/references/testing.md`). Its
+    **Bet Slice Rollout** section defines the permanent best-practice tests this slice
+    owes; it is the authority the worker rolls out against in step 4 and the
+    coverage-auditor lens reviews against.
+  - Any **slice-specific constraints** — a frozen signature not to change, a
+    subsystem not to touch, a safety or content guardrail, the fixtures to prove on.
+    These are hard constraints, not suggestions; a conflict between a constraint and
+    the proof is a blocking concern, not a judgement call.
+  - Any **prior spike or proven recipe** the driver hands over — a working
+    invocation, a validated config, a dependency already on disk. Reuse it rather
+    than re-deriving. If it sits in an ephemeral location (a job-temp or scratch
+    path), copy what you need into a durable path in the repo or the project cache
+    and depend on that — never on the ephemeral path at runtime.
 ---
@@ -76,6 +95,19 @@ The driver passes:
 Most implementation failures are context failures — the agent that breaks an existing
 behaviour usually never read the file it was changing. Before writing any code:
+- **Orient through the repo map, then trace what you are about to touch.** Refresh the
+  deterministic map (`npx groundwork-method repo-map`, incremental) and read its
+  `centrality` ranking to find the hubs this slice lands among — real for graph-fidelity
+  stacks (Go/Python/TS/JS/Java/Dart); for a symbols-fidelity stack lean on its symbol
+  index and on Serena instead. Before you change any symbol other code depends on, run
+  live impact analysis with Serena (`find_referencing_symbols`) to see every caller that
+  breaks if its signature or shape changes — the missed-call-site class you would
+  otherwise lean on the compiler to catch late. Navigate with `get_symbols_overview` /
+  `find_symbol` and edit by symbol (`replace_symbol_body` / `rename`) where it fits. Full
+  workflow and the graceful-degradation contract are in
+  `.groundwork/skills/code-intelligence.md`; when the map or Serena is unavailable,
+  navigate with ordinary reads and project search and let the compiler and tests be the
+  backstop — the contract is identical, only the means differ.
 - **Read the previous slice's delivery commit** — its message and its diff.
 - **Read every existing file this slice modifies, in full.** For each, hold three
   things: what it does today, what this slice changes, and what must keep working. A
@@ -93,7 +125,7 @@ Note the baseline commit (`git rev-parse HEAD`) and return it in the report —
 the slice's diff to the exact code state it was built against, and is the reference
 the driver's review and the integrity check read.
-### 3. Implement to green
+### 3. Implement to green (the headline proof)
 Run the slice's bet-progress tests (`tests/bets/<bet_slug>/test_slice_<n>_*`) — red,
 because the implementation does not exist. Implement until they pass, staying inside
@@ -120,31 +152,62 @@ quietly substitute a mock and move on — **stop and report it as a blocking con
 green and satisfy the API and data design. Stay within this slice. Do not refactor
 unrelated subsystems or reach into other slices' work.
-### 4. Mechanical self-reconcile (first pass, not the gate)
-A green suite proves nothing if the sealed prose was quietly altered or the code was
+### 4. Roll out the permanent best-practice tests
+The headline proof is green; now write the coverage that stays. The bet-progress
+tests prove the slice's capability once and are archived at bet close — the permanent
+best-practice tests are what guard the slice against regression for the life of the
+codebase, and they ship in *this* slice's diff so the review judges them alongside the
+implementation that they are meant to pin.
+What the slice owes is defined by the stack's testing strategy — the promoted engineer
+skill at `.agents/skills/groundwork-<stack>-engineer/references/testing.md`,
+specifically its **Bet Slice Rollout** section. Read it and roll out what this slice
+earns: the service-perimeter or interface test for each capability the slice
+delivered, unit tests for any genuinely complex logic it introduced, a property-based
+test for any invariant, a critical-path trace assertion where it added an observable
+path (a backend service that emits OpenTelemetry spans), and — for a `graphical-ui`
+slice — component render tests across the states the design system names (default,
+loading, empty, error, long-content) plus registering any new route in
+`tests/system/routes.json`. These live in the service repos and `tests/system/`, never
+in `tests/bets/`. Run them green before reporting.
+Match the depth to the slice's risk, not a fixed count — the strategy names which tier
+carries each assertion, and a sociable service test that executes a branch without
+asserting on it is a gap even when the suite is green. The independent coverage-auditor
+lens holds this suite against the same strategy, so an under-covered error case or a
+missing trace assertion surfaces in review: write the suite the strategy asks for, not
+the minimum that compiles.
+### 5. Mechanical self-reconcile (first pass, not the gate)
+A green suite proves nothing if the approved prose was quietly altered or the code was
 gamed to pass. Run two cheap checks and **report their result** — they are the
 worker's honest first pass, not the authoritative gate (the driver's independent
 review is that):
-- **Prose integrity.** The approved contract — the decomposition tree and technical
-  design — is sealed at the `bet/<bet_slug>/approved` tag.
-  `git diff bet/<bet_slug>/approved.. -- docs/bets/<bet_slug>/decomposition/ docs/bets/<bet_slug>/technical-design/`
-  must show no change. The worker never edits that prose; if a proof looks wrong, that
+- **Prose integrity.** The approved contract is the decomposition tree and technical
+  design.
+  `git status --short -- docs/bets/<bet_slug>/decomposition/ docs/bets/<bet_slug>/technical-design/`
+  must show no change — the worker never edits that prose. If a proof looks wrong, that
   is a blocking concern, not an edit.
-- **Honest green.** The implementation must satisfy the proof for the right reason. A
-  return value hardcoded to the test's expected output, an input special-cased to the
-  fixture, a `if TEST_MODE`-style branch, or a mocked-out unit of real work is a
-  defect even though the suite is green — *a weak suite that generated code passes is
-  worse than no suite* (`docs/principles/foundations/testing.md`). Flag any of these
-  in the report rather than leaving them for the review to find.
-### 5. Do not commit
-The worker implements to green and stops. It does **not** commit, roll out permanent
-best-practice tests, or close the slice — those are the driver's, after its
-independent review and triage. Leave the working tree with the slice's changes
-unstaged; return the report.
+- **Honest green.** The implementation must satisfy the proof for the right reason,
+  against the real product. A return value hardcoded to the test's expected output, an
+  input special-cased to the fixture, a `if TEST_MODE`-style branch, a mocked-out unit
+  of real work, or a fixture standing in for a real pipeline stage that nothing else
+  produces is a defect even though the suite is green — *a weak suite that generated
+  code passes is worse than no suite* (`docs/principles/foundations/testing.md`). If a
+  fake the slice leans on has no real test behind it, or the proof runs against a test
+  target rather than the shipping build, flag it. Surface any of these in the report
+  rather than leaving them for the review to find.
+### 6. Do not commit
+The worker implements to green, rolls out the permanent tests, and stops. It does
+**not** stage, commit, or close the slice — those are the driver's, after its
+independent review and triage. Leave the working tree with all of the slice's changes
+unstaged — the implementation, the bet-progress tests turned green, and the permanent
+best-practice tests — and return the report.
 ---
@@ -158,6 +221,9 @@ triage, and close the slice:
 SLICE: <n> <slice-slug>  (service: <owner-service>, surface: <core|slug>)
 BASELINE: <git rev-parse HEAD at open>
 SUITE: green | red — <one line: which slice tests pass; if still red, why>
+COVERAGE: <the permanent best-practice tests rolled out, by kind — service/interface,
+  unit, property, trace; for graphical-ui, the component states covered. Note any the
+  strategy asks for that this slice does not owe, and why.>
 FILES:
 - added: <path>, ...

package/src/hidden-skills/groundwork-bet/instructions.md CHANGED Viewed

@@ -21,8 +21,8 @@ Each phase establishes one thing the next phase depends on:
 - **Discovery** establishes the *what* and the *why*. It produces the pitch — the problem, the appetite, the solution sketch, the success signal, and the explicit no-gos. Without it, design has nothing to anchor against.
 - **Design Foundations** establishes the *contract*. It produces the technical design — UI design first (one subsection per in-scope surface), then the headless core beneath it: data flows and business logic, API design, and schema & data design, surface-neutral. Without a locked design, decomposition produces milestones and tests that contradict each other.
-- **Decomposition** establishes *the order of work and the proof*. With the design locked, it authors the full **milestone ladder** — demonstrable states ordered to front-load risk, each carrying an **un-mockable** headline Proof-of-work proof (a stub or double cannot satisfy it; this is the success signal made executable) — but slices only the **first milestone**; later rungs are sliced on arrival in Delivery, from what the milestones before them taught. Capability milestones prove at the contract, surface milestones in each surface's medium; every shape traces to the prose API/data design. No test code is written here. The user reviews proof by proof and approves; the approved prose is committed and **tagged** (`bet/<slug>/approved`) — a **ratchet seal**, not a one-time freeze: it covers the ladder plus the first milestone's slices now, and advances additively as Delivery opens (or adds) each later milestone. The tests are generated red at Delivery start from this approved prose. Without this Proof of Work, delivery has no proof to satisfy and no sequence to follow.
-- **Delivery** materializes the red board from the approved prose, then drives it green as an orchestration. The agent is the *driver* — it holds the thin spine (the board, the milestone order, the granularity the user chose, the triage judgement) and dispatches a fresh slice-worker subagent per slice (`briefs/slice-worker.md`) so the heavy implementation context stays disposable. It reviews each worker's diff through independent lenses, commits the slice, and at every milestone boundary runs a postmortem that confirms the milestone honestly proved its intent, re-checks the remaining ladder, and authors the next milestone's slices from what this one taught (introducing a new rung when the ladder is missing one, within appetite) — the seal ratcheting forward as it goes, course-correcting through the Amendment Protocol or Change Navigation when a sealed proof or the design is wrong, never editing sealed prose around. Delivery is offered at three cadences — slice by slice, milestone by milestone (default), or whole bet — which set where the driver pauses; hard stops pause regardless. The suite is the board (`./dev bet status`, derived not stored) and each slice's commit is its record, so progress is visible and resumable. As each slice completes, permanent best-practice tests are rolled out, and a per-slice reconciliation checks that the sealed prose has not silently moved. Without the Decomposition contract, every design question becomes a mid-implementation conversation made under coding pressure.
+- **Decomposition** establishes *the order of work and the proof*. With the design locked, it authors the full **milestone ladder** — thin, user-visible steps ordered to front-load risk, each carrying a headline Proof-of-work proof that **drives the real product through its real front door** (no stub, double, or scripted stand-in can satisfy it; any fake it leans on needs a real test behind it; this is the success signal made executable) — but slices only the **first milestone**; later rungs are sliced on arrival in Delivery, from what the milestones before them taught. Every milestone names a consumer who observes its outcome at their real surface; the first user-visible milestone lands the design system in the running app; the ladder must sum to a complete, well-rounded experience. Every shape traces to the prose API/data design. No test code is written here. The user reviews proof by proof and approves; the approved prose is committed as the recorded baseline, and changing *what a milestone proves* afterwards is an owner-approved amendment recorded in git history with a reason — steering how slices break down is free. The tests are generated red at Delivery start from this approved prose. Without this Proof of Work, delivery has no proof to satisfy and no sequence to follow.
+- **Delivery** materializes the red board from the approved prose, then drives it green as an orchestration. The agent is the *driver* — it holds the thin spine (the board, the milestone order, the granularity the user chose, the triage judgement) and dispatches a fresh slice-worker subagent per slice (`briefs/slice-worker.md`) so the heavy implementation context stays disposable. It reviews each worker's diff through independent lenses, commits the slice, proves each milestone at the front door (folding in the visual checks and a polish pass, with the experience-auditor lens judging the assembled milestone), and at every milestone boundary runs a postmortem that confirms the milestone honestly proved its intent, re-checks the remaining ladder, and authors the next milestone's slices from what this one taught (introducing a new rung when the ladder is missing one, within appetite) — recording each authored rung as it goes, course-correcting through the Amendment Protocol or Change Navigation when an approved proof or the design is wrong, never editing approved prose around without a recorded amendment. Delivery is offered at three cadences — slice by slice, milestone by milestone (default), or whole bet — which set where the driver pauses; hard stops pause regardless. The suite is the board (`./dev bet status`, derived not stored) and each slice's commit is its record, so progress is visible and resumable. As each slice completes, permanent best-practice tests are rolled out, and a per-slice reconciliation checks that the approved prose has not silently moved. Without the Decomposition contract, every design question becomes a mid-implementation conversation made under coding pressure.
 - **Validation** confirms the delivered bet behaves as designed, captures each touched service's served contract into the canonical `docs/architecture/api/` record, writes the bet's capability-ledger rows (when the project keeps a surface registry), archives the **whole bet** (docs and tests), runs the bet retrospective, and folds what the bet learned back into upstream documents for every subsequent bet.
 The lifecycle is sequential because each phase's output is the next phase's input. The order is structural, not procedural — gating design before decomposition is not a rule to follow but the only way the artifacts compose.
@@ -37,8 +37,8 @@ Each phase runs in its own workflow file because each demands a different mode.
 |---|---|---|---|
 | 1. Discovery | `workflows/01-discovery.md` | `discovery` | `docs/bets/<slug>/pitch.md` |
 | 2. Design Foundations | `workflows/02-design.md` | `design` | `docs/bets/<slug>/technical-design/` (`01-ui-design.md`, `02-data-flows.md`, `03-api-design.md`, `04-data-design.md`) |
-| 3. Decomposition | `workflows/03-decomposition.md` | `decomposition` | `docs/bets/<slug>/decomposition/` prose tree — full milestone ladder + first milestone sliced, approved and sealed (ratchet tag `bet/<slug>/approved`) |
-| 4. Delivery | `workflows/04-delivery.md` (driver) + `briefs/slice-worker.md` (per-slice subagent) | `delivery` | Red board materialized from the approved prose, then driven green milestone by milestone — slice-workers implement, the driver reviews/commits, and at each milestone boundary a postmortem course-corrects and opens the next milestone (authoring its slices, ratcheting the seal); each slice committed as its record |
+| 3. Decomposition | `workflows/03-decomposition.md` | `decomposition` | `docs/bets/<slug>/decomposition/` prose tree — full milestone ladder + first milestone sliced, approved and committed as the recorded baseline |
+| 4. Delivery | `workflows/04-delivery.md` (driver) + `briefs/slice-worker.md` (per-slice subagent) | `delivery` | Red board materialized from the approved prose, then driven green milestone by milestone — slice-workers implement, the driver reviews/commits, and at each milestone boundary a postmortem course-corrects and opens the next milestone (authoring and recording its slices); each slice committed as its record |
 | 5. Validation | `workflows/05-validation.md` | `validation` → `delivered` | Canonical `docs/architecture/api/` captured from running code; retrospective; whole bet archived |
 The pitch's frontmatter `status` field tracks where the bet sits in the lifecycle. Status transitions on entry to each phase and is the routing signal that lets a fresh context pick up the bet at the right place.

package/src/hidden-skills/groundwork-bet/templates/bet-progress-test.md CHANGED Viewed

@@ -8,20 +8,14 @@
 Bet-progress tests are **temporary, black-box proof-of-work** materialized from the approved prose before any implementation exists. Each one renders the Proof-of-work prose of a milestone or slice — the proof the user already reviewed and approved in the decomposition tree — into a runnable red stub. At Delivery start the board is materialized for the whole **milestone ladder** plus the **first milestone's slices**; each later milestone's slice stubs are added when Delivery opens that milestone (its slices are authored then). So the board shows progress at milestone granularity before the later rungs are sliced. They assert what the milestone's consumer would observe if the feature were complete. Red means the work is not done. Green means it is proven. Running the suite is the bet's live progress board.
-A milestone's proof follows its type:
-**Capability-level tests** prove a capability milestone — they hit the contract directly, end-to-end against the running services over HTTP (or against the embedded core's public API), with no surface in the loop. Every business rule the bet introduces is proven here, exactly once.
-**Interface-level tests** prove a surface milestone — they assert what that surface's users observe, in that surface's medium:
-- `graphical-ui` — a browser-driven test that navigates to the feature and verifies what the user sees
+**A milestone test drives the real product through its consumer's front door.** It exercises the shipping build the way the milestone's consumer would, on the real pipeline and real data, and asserts what the consumer observes — in their surface's medium:
+- `graphical-ui` — a browser-driven (or platform-driven) test that navigates to the feature and verifies what the user sees on the running app
 - `cli` — a test that invokes the command and verifies the output, exit code, or side-effect
 - `agentic-protocol` — a test that sends a protocol request and verifies the response structure
-Interface-level tests resolve their target surface through the surfaces fixture (slug → entry point) and are bounded to wiring, rendering, and interaction. **They never re-prove core logic** — a business rule already proven at the contract re-asserted in a surface test is a review finding, because it multiplies the test pyramid by the surface count for nothing. When an interface test goes red against green capability tests, the failure is in the surface's adapter layer by elimination.
-When the project has no surface registry, the two layers pair up per milestone exactly as they always have: an interface-level test in the project's single medium, plus an API-level system test that localizes failures.
+The milestone test resolves its consumer's surface through the surfaces fixture (slug → entry point). It is the front-door proof: it runs the whole path end to end, not a back-channel against an internal contract with no consumer in the loop. Proving a backend contract directly is real work, but it belongs to a *slice* test, not a milestone test — the milestone proves the consumer's outcome.
-Slice tests prove the vertical capability a slice contributes toward its parent milestone. They are **informed by and bounded by** the parent milestone's tests — a slice test proves a specific capability; the milestone test proves the consumer-visible outcome those capabilities enable. A slice's `surface` field names the discipline that applies: `core` slices prove contract behaviour; surface slices prove wiring in their surface's medium.
+Slice tests prove the vertical capability a slice contributes toward its parent milestone, at that slice's service edge. They are **informed by and bounded by** the parent milestone's front-door proof — a slice test proves a specific capability; the milestone test proves the consumer-visible outcome those capabilities enable. A slice that builds a screen proves the screen renders and behaves through the pattern it implements in full.
 ---
@@ -69,18 +63,18 @@ tests/bets/_archive/<bet-slug>/
 Bet-progress tests reuse the shared fixtures from `tests/conftest.py`:
 - `cluster` — boots and health-checks all services; provides the running topology
-- `api_client` — an HTTP client configured with the discovered service base URLs; capability-level tests use this to hit the contract directly
+- `api_client` — an HTTP client configured with the discovered service base URLs; slice tests use this to exercise a service edge directly
 - `pure_state_reset` — truncates all service data stores before each test (autouse)
-- `surfaces` — the mapping from registry slug to that surface's entry point (base URL for a web surface, binary path for a CLI, protocol endpoint for an agentic surface); interface-level tests resolve their target surface here
+- `surfaces` — the mapping from registry slug to that surface's entry point (base URL for a web surface, binary path for a CLI, protocol endpoint for an agentic surface); milestone front-door tests resolve their consumer's surface here
 - `frontend_base_url` — the legacy alias for the single graphical surface's base URL; present when exactly one graphical surface exists, for suites written before the surfaces fixture
 Declare the fixtures you need as test-function parameters; pytest resolves them from the parent conftest automatically.
-For interface-level tests against a `graphical-ui` surface, the `page` fixture (from pytest-playwright) drives a real browser. For `cli` surfaces, use `subprocess` or `pexpect` to invoke the binary directly.
+For a front-door test against a `graphical-ui` surface, the `page` fixture (from pytest-playwright) drives a real browser. For `cli` surfaces, use `subprocess` or `pexpect` to invoke the binary directly.
 ## Capturing screenshots for the visual verification loop
-For `graphical-ui` interface tests, capture a screenshot of each key state of the screen under test — default, hover, focus, empty, loading, error — written to `.groundwork/cache/visual/<bet-slug>/<surface>/<state>.png` (create the directory first). These are the states where "renders broken" and "looks unfinished" both live, and the captures are what the delivery agent reads (Tier 2 inspection) and the validation fidelity critique grades (Tier 3). Capture after the state is reached and assertions pass — the screenshot records the proven state, it does not replace the assertion. Capture nothing for `cli` and `agentic-protocol` surfaces; their observable output is text, asserted directly.
+For `graphical-ui` interface tests, capture a screenshot of each key state of the screen under test — default, hover, focus, empty, loading, error — written to `.groundwork/cache/visual/<bet-slug>/<surface>/<state>.png` (create the directory first). These are the states where "renders broken" and "looks unfinished" both live, and the captures are what the delivery agent reads (Tier 2 inspection) and the experience-auditor review judges at milestone close and over the whole bet. Capture after the state is reached and assertions pass — the screenshot records the proven state, it does not replace the assertion. Capture nothing for `cli` and `agentic-protocol` surfaces; their observable output is text, asserted directly.
 **What capture sees, and what it does not.** A static screenshot verifies render correctness, coherence, and composition. It does *not* see motion (easing, durations, press physics) or perceived latency — both committed in the design system — so those stay behaviour-tested, asserted on timing and state, never on a frame. Do not treat a captured screen as proof of an animation or a latency budget.
@@ -96,18 +90,13 @@ When the implementation does not exist yet, a test stub must be **explicitly red
 - Go: `t.Fatal("bet-progress test not yet implemented — <describe target state>")`
 - TypeScript: `throw new Error("bet-progress test not yet implemented — <describe target state>")`
-Comment the stub with what it will eventually assert, so the Delivery agent knows exactly what to implement. For a capability milestone stub:
-```
-# Contract: [the end-to-end behaviour the core should prove — request, response, persisted effect]
-```
-For a surface milestone stub:
+Comment the stub with what it will eventually assert, so the Delivery agent knows exactly what to implement. For a milestone stub, name the consumer's front-door outcome:
 ```
-# Surface <slug>: [what the user should observe in this surface's medium — wiring and rendering only]
+# Front door (<consumer> via <surface>): [what the consumer observes when they drive the real product — the action, what they see, on real data]
 ```
-For an untyped milestone (no surface registry), comment both layers:
+For a slice stub, name the capability at its service edge:
 ```
-# Layer 1 — interface: [what the user should observe in the product]
-# Layer 2 — API: [what the service should return end-to-end over HTTP]
+# Slice capability: [the behaviour at this slice's edge the milestone's front-door proof builds on]
 ```
 ---
@@ -115,12 +104,12 @@ For an untyped milestone (no surface registry), comment both layers:
 ## What makes a good bet-progress test
 A bet-progress test is good when:
-- It asserts a **falsifiable, consumer-visible outcome** — at the contract or in a surface's medium, never an internal state
+- It asserts a **falsifiable, consumer-visible outcome** — what the consumer observes at their front door, never an internal state
 - It would fail if the feature shipped incomplete
-- For a **milestone headline**, it is **un-mockable** — it exercises the real dependency that makes the milestone meaningful (the live model, the real service, the actual store), so a stub, mock, or hardcoded return cannot turn it green. A milestone proof a double can satisfy proves plumbing, not the milestone (`workflows/03-decomposition.md`)
+- For a **milestone headline**, it **drives the real product through the real front door** — the consumer's action runs the shipping build on the real pipeline and real data, so a stub, mock, scripted driver, or hardcoded return cannot turn it green. A milestone proof a double can satisfy proves plumbing, not the milestone (`workflows/03-decomposition.md`). Seeded inputs are fine; faking the work in the middle is not
+- **Any fake it leans on has a real test behind it** — if the proof uses a fixture for work a real stage should do, another test exercises the real producer; a fixture nothing real ever generates is a green light wired to nothing (`docs/principles/foundations/testing.md`)
 - It would pass without any special knowledge of how the feature is implemented internally
-- It proves each business rule exactly once — at the contract; surface tests assert only wiring, rendering, and interaction
 - It is a **headline proof, not a permutation** — it proves the milestone's outcome or the slice's capability, not every input variant or error code
 - A reviewer can read it and confirm it matches the milestone's acceptance criteria and Proof-of-work prose in its `decomposition/NN-<milestone-slug>/index.md`
-**Keep the suite lean.** Bet-progress tests render the high-impact proofs the user reviewed and signed as prose — a milestone's consumer-visible outcome and each slice's vertical capability, typically one to three assertions apiece. If an assertion is an edge case, a permutation, or an error variant rather than the headline outcome, it does not belong here; it is part of the slice's permanent best-practice tests, written when the slice is built in Delivery (`workflows/04-delivery.md` Step 5). A wall of assertions is unreviewable upstream and front-loads coverage the delivery suite is meant to carry.
+**Keep the suite lean.** Bet-progress tests render the high-impact proofs the user reviewed and signed as prose — a milestone's consumer-visible outcome and each slice's vertical capability, typically one to three assertions apiece. If an assertion is an edge case, a permutation, or an error variant rather than the headline outcome, it does not belong here; it is part of the slice's permanent best-practice tests, written when the slice is built in Delivery (`workflows/04-delivery.md`, the Slice Loop). A wall of assertions is unreviewable upstream and front-loads coverage the delivery suite is meant to carry.

package/src/hidden-skills/groundwork-bet/templates/change-proposal.md CHANGED Viewed

@@ -35,4 +35,4 @@
 ## Routing
-[Minor: edits applied on approval, mutated docs re-reviewed, affected proofs amended through the Amendment Protocol (edit the slice/milestone prose, re-tag `bet/<slug>/approved`, then change the code), slice resumed. Structural: bet reverts to Design Foundations with this proposal as input; delivered slices that survive the change are listed here.]
+[Minor: edits applied on approval, mutated docs re-reviewed, affected proofs amended through the Amendment Protocol (edit the slice/milestone prose, commit it beside the decomposition with a reason, then change the code), slice resumed. Structural: bet reverts to Design Foundations with this proposal as input; delivered slices that survive the change are listed here.]