npm - groundwork-method - Versions diffs - 0.10.0 → 0.11.0 - Mend

groundwork-method 0.10.0 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (70) hide show

package/CHANGELOG.md CHANGED Viewed

@@ -8,6 +8,48 @@ automatically when it detects a version jump.
 ## [Unreleased]
+## [0.11.0] - 2026-06-27
+### Fixed (the update report no longer reads as if it deletes authored skills, 2026-06-27)
+`groundwork update` computed its `.agents/skills/` diff against the package's shipped skills alone, so any skill authored beside the framework ones — a promoted engineer skill, a hand-written one — showed up as a red `-` removal line, reading as "your skill is being deleted." The install never touched those skills (it scopes its cleanup to framework-owned names), so the report contradicted the behavior, and the phantom removals also inflated the change total enough to suppress the calm "Already up to date" message on a run that changed nothing real. The report now reads ownership from the same set the installer uses (`ownedRegisteredSkillNames`): authored skills never appear in the diff, and a run whose only difference is an authored skill reports up-to-date.
+- [no-migration] Reporting-only change in the CLI; no project artifact shape, install behavior, or migration change. Authored skills were always preserved on disk — only the printed diff was wrong.
+### Changed (a bet lands as working, usable software — proven at the front door, 2026-06-27)
+A live test run exposed a structural hole: a bet closed all-green — UI tests plus a full package suite — yet the shipped app could not do its core job, because every proof faked the real work (a scripted driver, a fake worker path, hand-written thumbnails, a pre-loaded fake library) and no test ever drove the real product on real data the way a user would. The method *told* it to build this way: decomposition split milestones into a "capability" kind proven headless behind a fake and a "surface" kind wired later, even allowing a bet to finish headless. This release resets the core so a bet cannot be called done unless the real product works, looks right, and is usable.
+The milestone model collapses to one shape. A **milestone** is a thin, user-visible step proven by driving the shipping build through its real front door, on the real pipeline and real data, for a named consumer (a person at a screen, a developer calling an SDK, an operator reading a dashboard, a system calling the API — a pure-API product's front door is the API). A **slice** is a vertical cut through one service that builds toward a milestone, slices run in sequence each on the last, and the design system lands in the running app at the first user-visible milestone. The "un-mockable" rule is rewritten as drive-the-real-product-through-the-front-door plus **a fake a test leans on needs a real test behind it** (seeded inputs are fine; faking the work in the middle is the violation), and the ladder must sum to a complete, well-rounded experience — a dead-end screen or a silent-progress view is a *missing milestone*. Delivery proves each milestone at the front door (folding the visual tiers in and adding a polish pass), extends honest-green with the fake-behind-real and shipping-build checks, and Validation's success signal is the owner driving the real shipping product through the agreed front-door cases.
+Integrity moves from a heavyweight seal to a **lightweight recorded amendment trail**: the approved decomposition commit is the baseline, steering how slices break down is free, and changing *what a milestone proves* is an owner-approved amendment recorded in git history with a reason — the prose-integrity reconciliation reads that trail instead of a ratcheting `bet/<slug>/approved` tag (retired). The design phase gains a **proof-of-concept step** to de-risk unknowns (throwaway code, learning written into the technical design, a POC result never standing in for a milestone proof) and a best-in-class-pattern discipline (chosen per UX problem, implemented in full, accumulated into the design system). A new **experience-auditor** review lens (the designer persona) judges the assembled, running milestone and the whole bet for patterns-in-full, no dead ends, present states, design-system match, and the joy-to-use bar — distinct from the per-slice coverage auditor. The manifesto gains the front-door proof (belief #4 rewritten) and a UX-first-class belief (#9); `foundations/testing.md` gains the front-door-proof level above the honeycomb; `design/usability-and-ux.md` names the checkable floor (no dead ends, full async states) and the judged ceiling.
+The visual gate no longer fails silently on a surface it cannot test: when a graphical surface's platform has no runnable UI check, the `system-test-runner` generator emits a **fail-closed placeholder check** that fails with the named gap (pointing to `NATIVE-CHECK-CONTRACT.md`) instead of silently deleting the checks, and the web render-smoke floor gains a no-dead-end navigation assertion. The actual native UI checks (Flutter/Electron/native) are the named follow-on the placeholder turns green.
+- [no-migration] The bet workflows, templates, review briefs, principle docs, and engineer-skill references are framework-owned and clean-replace on update; the `system-test-runner` change only affects newly generated test suites. No project artifact shape, CLI, or migration change. In-flight bets carrying the retired `bet/<slug>/approved` tag are unaffected — the tag simply stops being read; reconciliation falls back to the git history of the decomposition prose.
+### Changed (engineer skills brought to full topic parity, and the canon→skill review chain healed, 2026-06-26)
+The five engineer skills are brought to deliberate coverage parity against the principle canon, and the structural gap that let them drift is closed. On testing, property-based testing and fuzzing (the canon's input-generation principle) are now first-class in the Go, Next.js, Flutter, and Electron testing references — `rapid` + `go test -fuzz`, `fast-check` + Schemathesis, `glados`, and policy-module fuzzing respectively — and the behaviour-naming template is explicit on every surface. Beyond testing, every cross-cutting concern is now consciously covered or marked not-applicable-with-reason on every stack: dedicated `security.md` (Python/Next.js/Flutter), `documentation.md` (Go/Flutter/Electron), `performance-and-reliability.md` (Flutter/Electron), and `observability.md` (Next.js/Flutter/Electron, client-adapted — distributed tracing stays at the capability core, the client carries crash/RUM telemetry); Next.js accessibility is promoted from a `ux-principles.md` section to its own reference. The per-stack principle docs that ship to projects (`docs/principles/stack/{go,python,flutter}/testing.md`) are refreshed to carry the current canon (honeycomb, in-memory span exporter, mutation, input-generation). Repo-map + Serena orientation is wired into each skill's mandatory first-action flow (Required First Checks / first How-to-Use step), not left as skippable advisory prose.
+The recurring-drift root cause is fixed. Each engineer skill's `sync-anchor.md` now pins the central-canon files it distils (testing, observability, security, documentation, performance, reliability, code-craft, accessibility — as applicable per stack) in addition to the per-stack idiom docs. Previously the skills pinned only the per-stack files, so a central-canon change forced zero engineer-skill review; now a canon edit fails `./dev test contracts` for every skill that embeds it, forcing the skill review and a per-stack reconcile in the same commit.
+- [no-migration] Engineer skills and their references are framework-owned and promoted into scaffolded projects' `.agents/skills/` per language (clean-replaced on update); the refreshed per-stack principle docs ship through their generators. No project artifact shape, CLI, or migration change.
+### Changed (delivery subagents orient through the repo map and Serena, 2026-06-26)
+The code-intelligence layer (the deterministic repo map + the Serena MCP server) was prescribed for the engineer persona but never threaded into the delivery orchestration capsules, so a slice-worker or reviewer used it only by chance — a live run found the delivery subagents navigating and verifying with ordinary reads and the compiler while an empty `.serena/` and no `repo-map.json` sat in the bet worktree. This closes the seam. The **slice-worker** brief now opens its context-assembly step by orienting through the map (refresh, read `centrality` for the hubs it lands among — qualified honestly, since centrality is real only for graph-fidelity stacks and a symbols-fidelity stack leans on the symbol index plus Serena) and running live impact analysis with `find_referencing_symbols` before changing any depended-on symbol. The **edge-case tracer** lens — the one review lens that already follows paths out of the diff into existing code — gains a reference-graph pass that enumerates a changed symbol's callers (Serena, or the repo-map edges offline) to catch the caller a dynamically-typed stack's compiler would miss; the blind reviewer stays deliberately diff-only so its blindness remains a distinct instrument. The **Delivery worktree bootstrap** now builds the map in the worktree and documents the `--project .` caveat (Serena resolves to the session root, so it is best-effort in a worktree — lean on the freshly-built map). Every addition carries the existing graceful-degradation contract: when the map or Serena is absent, fall back to ordinary reads and let the compiler and tests be the backstop.
+- [no-migration] The bet workflow, review briefs, and `code-intelligence.md` are framework-owned and clean-replace on update. No project artifact shape, CLI, or migration change — and no change to the repo-map generator or Serena registration; the tooling already existed, the delivery capsules simply now point workers at it.
+### Changed (delivery rolls out reviewed, comprehensive test coverage, 2026-06-26)
+The bet loop gated the *honesty* of a slice's headline proof but left the seam to comprehensive coverage ungated: permanent best-practice tests were rolled out *after* the slice review, so no lens ever judged them, and "the project's testing strategy" they were rolled out against named nothing. This closes that seam end to end. The slice-worker now rolls out the permanent best-practice tests as part of its own job (a new step), so they land in the reviewed diff; a fourth review lens — the **coverage auditor** — holds that suite against the stack's engineer-skill testing strategy for completeness (error/boundary matrix to happy-path rigour, complex-logic unit tests, critical-path trace assertions, named `graphical-ui` states) and assertion quality (a sociable test that executes a branch without asserting is a gap; a surviving mutant on changed high-risk code is the evidence). The four review lenses (blind reviewer, edge-case tracer, acceptance auditor, coverage auditor) are now first-class briefs the driver dispatches by path.
+The testing-principles canon is raised to current best practice and the five engineer skills realigned to it: the honeycomb is named as a stack-appropriate heuristic (the trophy for the frontend) anchored on independent-oracles-and-reproducibility, trace-first/ODD is made concrete with in-memory span-exporter assertions and helpers, mutation testing is positioned as the honeycomb's signal-only assertion-quality read-out (degrading gracefully where tooling is weak), and a new input-generation principle covers property-based testing, fuzzing, Schemathesis, and deterministic simulation. Flutter and Electron gain the Bet Slice Rollout sections they were missing.
+- [no-migration] The bet workflow, review briefs, review checklist, and engineer skills are framework-owned and clean-replace on update; the `src/docs/principles/` testing and observability canon propagate through `groundwork update`'s Tier-2 refresh/reconcile path. No project artifact shape, CLI, or migration change — existing bets in flight finish under the lens count they started with.
 ### Fixed (`update` prints a scannable change summary instead of dumping the whole changelog, 2026-06-26)
 A version-jump `groundwork update` dumped every line of every changelog entry in the range verbatim — for a 0.9.0 → 0.10.0 jump that is a multi-screen wall of prose. Worse, the "Migration required" list matched *any* line containing the `[migration]` token, so a prose sentence that merely mentions the token in backticks ("Changelog `[migration]` lines now reference registry ids") leaked into the list as a stray, prefix-stripped bullet ("Changelog `` lines now reference registry ids"). The renderer (`bin/groundwork.js`) now prints one bullet per change — `Category — headline`, derived from each `### Category (headline, date)` section header — and surfaces only genuine `- [migration]` bullets, with the token prefix stripped and the registry id preserved. `update --full` restores the verbatim entry dump for anyone who wants it, and a footer points at `CHANGELOG.md` / `--full` for detail.

package/bin/groundwork.js CHANGED Viewed

@@ -740,27 +740,96 @@ function writeUpgradeBrief(p, items, stamped) {
   return brief.items.filter((i) => i.status === 'pending').length;
 }
-// Clean-copy the two skill trees. Removing first prevents deprecated skills from lingering.
-// Throws on copy failure — callers abort rather than report success over a partial install.
-function installSkillTrees(p) {
-  for (const [src, dest, label] of [
-    [p.sourceSkillsDir, p.targetSkillsDir, 'Registered skills'],
-    [p.sourceHiddenSkillsDir, p.targetHiddenSkillsDir, 'Hidden methodology skills'],
-  ]) {
-    if (fs.existsSync(dest)) {
-      try {
-        fs.rmSync(dest, { recursive: true, force: true });
-      } catch (err) {
-        c.warn(`Failed to clean ${label.toLowerCase()} dir: ${err.message}`);
-      }
+// Top-level skill names under .agents/skills/ that the framework owns, per the manifest:
+// the first path segment of every tier-1 .agents/skills/<name>/... entry. Used to prune
+// framework skills that a past version shipped but the current one dropped. Returns [] for
+// a null/empty manifest (older or adopted installs) — in that case only currently-shipped
+// names are pruned, which can leave a since-removed framework skill lingering once. That is
+// strictly safer than the alternative: deleting a project-authored skill we don't recognize.
+function frameworkSkillNamesFromManifest(manifest) {
+  const names = new Set();
+  const files = (manifest && manifest.files) || {};
+  for (const [rel, entry] of Object.entries(files)) {
+    if (!entry || entry.tier !== 1) continue;
+    const m = rel.replace(/\\/g, '/').match(/^\.agents\/skills\/([^/]+)\//);
+    if (m) names.add(m[1]);
+  }
+  return [...names];
+}
+// The top-level .agents/skills/ entries the framework owns — the ONLY ones an update may
+// remove. Owned = names the framework ships in EITHER tree (registered src/skills/ ∪ hidden
+// src/hidden-skills/ — a hidden skill must never linger here after the .groundwork/skills
+// relocation) ∪ tier-1 skills the manifest remembers (to prune a registered skill a past
+// version shipped and this one dropped). Promoted engineer skills (separate src/engineer-skills/
+// tree) and any project-authored skill match none of these, so they are preserved. Both the
+// installer and the update report read ownership from here, so the report can never claim a
+// removal the install won't perform. Pass the already-loaded manifest to avoid re-reading it.
+function ownedRegisteredSkillNames(p, manifest) {
+  const shipped = fs.existsSync(p.sourceSkillsDir) ? fs.readdirSync(p.sourceSkillsDir) : [];
+  const hidden = fs.existsSync(p.sourceHiddenSkillsDir) ? fs.readdirSync(p.sourceHiddenSkillsDir) : [];
+  const m = manifest === undefined ? readManifest(p) : manifest;
+  return new Set([...shipped, ...hidden, ...frameworkSkillNamesFromManifest(m)]);
+}
+// The registered-tree diff, scoped to what installRegisteredSkills actually does: removals are
+// limited to files under framework-owned skill names. An authored skill (e.g. groundwork-swift-
+// engineer) lives outside that set, so the install leaves it untouched — and the report must not
+// flag its files as removed, which would read as "your skill is being deleted" when it isn't.
+function diffRegisteredSkills(p, manifest) {
+  const diff = diffDirs(p.sourceSkillsDir, p.targetSkillsDir);
+  const owned = ownedRegisteredSkillNames(p, manifest);
+  diff.removed = diff.removed.filter((f) => owned.has(f.split(path.sep)[0]));
+  return diff;
+}
+// .agents/skills/ is a SHARED directory: framework skills sit beside engineer skills the
+// scaffold promotes (tracked in manifest.generated) and any project-authored skills. We must
+// remove ONLY framework-owned top-level entries — never the whole tree — so an update can't
+// delete an authored skill (see ownedRegisteredSkillNames). Throws on copy failure — callers
+// abort over a partial install.
+function installRegisteredSkills(p) {
+  const dest = p.targetSkillsDir;
+  const shipped = fs.existsSync(p.sourceSkillsDir) ? fs.readdirSync(p.sourceSkillsDir) : [];
+  const owned = ownedRegisteredSkillNames(p);
+  fs.mkdirSync(dest, { recursive: true });
+  for (const name of owned) {
+    fs.rmSync(path.join(dest, name), { recursive: true, force: true });
+  }
+  if (shipped.length) {
+    try {
+      execSync(`cp -R "${p.sourceSkillsDir}/"* "${dest}/"`);
+    } catch (err) {
+      throw new Error(`Failed to install registered skills: ${err.message}`);
     }
-    fs.mkdirSync(dest, { recursive: true });
+  }
+}
+// The hidden methodology tree is exclusively framework-owned — nothing else is allowed to
+// live there — so a wholesale clean-replace is safe and prunes deprecated skills. Throws on
+// copy failure for the same reason as installRegisteredSkills.
+function installHiddenSkills(p) {
+  const dest = p.targetHiddenSkillsDir;
+  if (fs.existsSync(dest)) {
     try {
-      execSync(`cp -R "${src}/"* "${dest}/"`);
+      fs.rmSync(dest, { recursive: true, force: true });
     } catch (err) {
-      throw new Error(`Failed to install ${label.toLowerCase()}: ${err.message}`);
+      c.warn(`Failed to clean hidden methodology skills dir: ${err.message}`);
     }
   }
+  fs.mkdirSync(dest, { recursive: true });
+  try {
+    execSync(`cp -R "${p.sourceHiddenSkillsDir}/"* "${dest}/"`);
+  } catch (err) {
+    throw new Error(`Failed to install hidden methodology skills: ${err.message}`);
+  }
+}
+// Install both skill trees. The registered tree (.agents/skills/) is cleaned per-skill so
+// promoted engineer skills and project-authored skills survive; the hidden tree is clean-replaced.
+function installSkillTrees(p) {
+  installRegisteredSkills(p);
+  installHiddenSkills(p);
 }
 // generators.json ships with repo-relative factory/schema paths; resolve them against the
@@ -1191,7 +1260,7 @@ function updateGroundWork(flags = {}) {
   // Classify everything before touching anything, so the summary (and --dry-run)
   // reflects exactly what a real run performs.
-  const skillsDiff = diffDirs(p.sourceSkillsDir, p.targetSkillsDir);
+  const skillsDiff = diffRegisteredSkills(p, manifest);
   const hiddenDiff = diffDirs(p.sourceHiddenSkillsDir, p.targetHiddenSkillsDir);
   const generatorsConfig = buildGeneratorsConfig();

package/dist/src/generators/system-test-runner/generator.js CHANGED Viewed

@@ -70,6 +70,30 @@ function parseSurfaces(raw) {
         ident: s.slug.replace(/-/g, '_'),
     }));
 }
+/** A fail-closed pytest stub for a graphical surface with no runnable UI check.
+ *  It fails (never skips) so the surface's UI proof is an honest red until a
+ *  platform check is implemented per NATIVE-CHECK-CONTRACT.md. Deleting it to go
+ *  green is the silent-skip this exists to prevent. */
+function uiCheckPlaceholder(s) {
+    return `import pytest
+# AUTO-GENERATED fail-closed placeholder — do not delete to go green.
+# Surface "${s.slug}" (test medium "${s.medium}") is a surface GroundWork has no
+# UI check runner for. A milestone cannot be proven on a surface nothing checks,
+# so this placeholder FAILS until a platform UI check is implemented for it per
+# src/generators/system-test-runner/NATIVE-CHECK-CONTRACT.md (render,
+# navigation / no dead ends, the named states, design-system token match).
+def test_${s.ident}_ui_check_not_implemented():
+    pytest.fail(
+        "No UI check runner for surface '${s.slug}' (medium '${s.medium}'). "
+        "Implement a platform UI check per "
+        "system-test-runner/NATIVE-CHECK-CONTRACT.md, then replace this "
+        "fail-closed placeholder. A graphical surface must not ship unverified."
+    )
+`;
+}
 async function systemTestRunnerGenerator(tree, options) {
     const projectPrefix = options.projectPrefix || 'groundwork';
     const interfaceMedium = options.interfaceMedium || 'graphical-ui';
@@ -84,6 +108,19 @@ async function systemTestRunnerGenerator(tree, options) {
     // it as a subprocess through the app's Nx target.
     const flutterSurfaces = (surfaces ?? []).filter((s) => s.medium === 'flutter-integration');
     const electronSurfaces = (surfaces ?? []).filter((s) => s.medium === 'playwright-electron');
+    // Every test medium GroundWork knows how to run a check for. A graphical
+    // surface registered with a medium outside this set has no UI check runner —
+    // and a milestone cannot be proven on a surface nothing checks. We refuse to
+    // silently leave it unverified: each such surface gets a fail-closed
+    // placeholder check (below) naming the gap, never a silent no-op.
+    const KNOWN_MEDIA = new Set([
+        'playwright',
+        'subprocess-cli',
+        'protocol-client',
+        'flutter-integration',
+        'playwright-electron',
+    ]);
+    const unsupportedSurfaces = (surfaces ?? []).filter((s) => !KNOWN_MEDIA.has(s.medium));
     // Playwright structure follows graphical surfaces: any playwright surface in
     // registry mode, the graphical-ui value in single-medium mode. pexpect ships
     // alongside the subprocess runners so interactive (REPL) CLI flows are testable.
@@ -106,10 +143,13 @@ async function systemTestRunnerGenerator(tree, options) {
         includePexpect,
         tmpl: ''
     });
-    // Playwright structure ships only with a graphical surface: the page-object
-    // package, the axe-core a11y smoke, and the render-smoke gate depend on
-    // pytest-playwright, which the pyproject template declares only when
-    // includePlaywright is set.
+    // Playwright structure ships only with a graphical web surface: the
+    // page-object package, the axe-core a11y smoke, and the render-smoke gate
+    // depend on pytest-playwright, which the pyproject template declares only when
+    // includePlaywright is set. Removing the web-specific gates here is correct —
+    // they genuinely cannot run without a web surface. What must never happen is a
+    // graphical surface left with no check at all; the placeholder below closes
+    // that gap fail-closed instead of silently.
     if (!includePlaywright) {
         tree.delete('tests/system/pages');
         tree.delete('tests/system/test_a11y_smoke.py');
@@ -118,6 +158,14 @@ async function systemTestRunnerGenerator(tree, options) {
         tree.delete('tests/system/test_visual_regression.py');
         tree.delete('tests/system/test_token_conformance.py');
     }
+    // Fail-closed: a graphical surface whose test medium GroundWork cannot run a
+    // check for gets a placeholder that FAILS naming the gap, never a silent skip.
+    // The scaffold still generates and its other tests run; this surface's UI
+    // proof is an honest red until a platform check is implemented per
+    // NATIVE-CHECK-CONTRACT.md — the follow-on that turns it green.
+    for (const s of unsupportedSurfaces) {
+        tree.write(`tests/system/test_${s.ident}_ui_check_missing.py`, uiCheckPlaceholder(s));
+    }
     await (0, devkit_1.formatFiles)(tree);
     (0, provenance_1.recordGeneratorProvenance)(tree, 'system-test-runner', options);
 }

package/dist/src/generators/system-test-runner/generator.js.map CHANGED Viewed

	@@ -1 +1 @@
1	- {"version":3,"file":"generator.js","sourceRoot":"","sources":["../../../../src/generators/system-test-runner/generator.ts"],"names":[],"mappings":";;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;~~AA2EA~~,~~8DAgEC~~;~~AA3ID~~,uCAIoB;AACpB,2CAA6B;AAC7B,qDAAiE;AA6BjE,SAAS,aAAa,CACpB,GAAuC;IAEvC,IAAI,GAAG,KAAK,SAAS,IAAI,GAAG,KAAK,IAAI,IAAI,GAAG,KAAK,EAAE,EAAE,CAAC;QACpD,OAAO,IAAI,CAAC;IACd,CAAC;IACD,IAAI,KAAc,CAAC;IACnB,IAAI,OAAO,GAAG,KAAK,QAAQ,EAAE,CAAC;QAC5B,IAAI,CAAC;YACH,KAAK,GAAG,IAAI,CAAC,KAAK,CAAC,GAAG,CAAC,CAAC;QAC1B,CAAC;QAAC,OAAO,CAAC,EAAE,CAAC;YACX,MAAM,IAAI,KAAK,CAAC,iCAAkC,CAAW,CAAC,OAAO,EAAE,CAAC,CAAC;QAC3E,CAAC;IACH,CAAC;SAAM,CAAC;QACN,KAAK,GAAG,GAAG,CAAC;IACd,CAAC;IACD,IAAI,CAAC,KAAK,CAAC,OAAO,CAAC,KAAK,CAAC,IAAI,KAAK,CAAC,MAAM,KAAK,CAAC,EAAE,CAAC;QAChD,MAAM,IAAI,KAAK,CACb,6EAA6E,CAC9E,CAAC;IACJ,CAAC;IACD,KAAK,MAAM,IAAI,IAAI,KAAsB,EAAE,CAAC;QAC1C,IACE,CAAC,IAAI;YACL,OAAO,IAAI,CAAC,IAAI,KAAK,QAAQ;YAC7B,IAAI,CAAC,IAAI,CAAC,MAAM,KAAK,CAAC;YACtB,OAAO,IAAI,CAAC,MAAM,KAAK,QAAQ;YAC/B,IAAI,CAAC,MAAM,CAAC,MAAM,KAAK,CAAC,EACxB,CAAC;YACD,MAAM,IAAI,KAAK,CACb,qDAAqD,IAAI,CAAC,SAAS,CAAC,IAAI,CAAC,EAAE,CAC5E,CAAC;QACJ,CAAC;IACH,CAAC;IACD,OAAQ,KAAuB,CAAC,GAAG,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC;QAC1C,GAAG,CAAC;QACJ,KAAK,EAAE,CAAC,CAAC,IAAI,CAAC,OAAO,CAAC,IAAI,EAAE,GAAG,CAAC;KACjC,CAAC,CAAC,CAAC;AACN,CAAC;AAEM,KAAK,UAAU,yBAAyB,CAC7C,IAAU,EACV,OAAwC;IAExC,MAAM,aAAa,GAAG,OAAO,CAAC,aAAa,IAAI,YAAY,CAAC;IAC5D,MAAM,eAAe,GAAG,OAAO,CAAC,eAAe,IAAI,cAAc,CAAC;IAClE,0EAA0E;IAC1E,+DAA+D;IAC/D,MAAM,QAAQ,GAAG,aAAa,CAAC,OAAO,CAAC,QAAQ,CAAC,CAAC;IAEjD,MAAM,iBAAiB,GAAG,CAAC,QAAQ,IAAI,EAAE,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,MAAM,KAAK,YAAY,CAAC,CAAC;IACpF,MAAM,WAAW,GAAG,CAAC,QAAQ,IAAI,EAAE,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,MAAM,KAAK,gBAAgB,CAAC,CAAC;IAClF,MAAM,gBAAgB,GAAG,CAAC,QAAQ,IAAI,EAAE,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,MAAM,KAAK,iBAAiB,CAAC,CAAC;IACxF,uEAAuE;IACvE,4EAA4E;IAC5E,kDAAkD;IAClD,MAAM,eAAe,GAAG,CAAC,QAAQ,IAAI,EAAE,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,MAAM,KAAK,qBAAqB,CAAC,CAAC;IAC3F,MAAM,gBAAgB,GAAG,CAAC,QAAQ,IAAI,EAAE,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,MAAM,KAAK,qBAAqB,CAAC,CAAC;IAE5F,6EAA6E;IAC7E,6EAA6E;IAC7E,iFAAiF;IACjF,MAAM,iBAAiB,GAAG,QAAQ;QAChC,CAAC,CAAC,iBAAiB,CAAC,MAAM,GAAG,CAAC;QAC9B,CAAC,CAAC,eAAe,KAAK,cAAc,CAAC;IACvC,MAAM,cAAc,GAAG,WAAW,CAAC,MAAM,GAAG,CAAC,CAAC;IAE9C,yCAAyC;IACzC,IAAA,sBAAa,EACX,IAAI,EACJ,IAAI,CAAC,IAAI,CAAC,SAAS,EAAE,IAAI,EAAE,IAAI,EAAE,IAAI,EAAE,IAAI,EAAE,KAAK,EAAE,YAAY,EAAE,oBAAoB,EAAE,OAAO,CAAC,EAChG,GAAG,EACH;QACE,GAAG,OAAO;QACV,aAAa;QACb,eAAe;QACf,QAAQ;QACR,iBAAiB;QACjB,WAAW;QACX,gBAAgB;QAChB,eAAe;QACf,gBAAgB;QAChB,iBAAiB;QACjB,cAAc;QACd,IAAI,EAAE,EAAE;KACT,CACF,CAAC;IAEF,~~4EAA4E~~;~~IAC5E~~,~~wEAAwE~~;~~IACxE~~,~~qEAAqE~~;~~IACrE~~,~~4BAA4B~~;~~IAC5B~~,IAAI,CAAC,iBAAiB,EAAE,CAAC;QACvB,IAAI,CAAC,MAAM,CAAC,oBAAoB,CAAC,CAAC;QAClC,IAAI,CAAC,MAAM,CAAC,iCAAiC,CAAC,CAAC;QAC/C,IAAI,CAAC,MAAM,CAAC,mCAAmC,CAAC,CAAC;QACjD,IAAI,CAAC,MAAM,CAAC,sCAAsC,CAAC,CAAC;QACpD,IAAI,CAAC,MAAM,CAAC,wCAAwC,CAAC,CAAC;QACtD,IAAI,CAAC,MAAM,CAAC,wCAAwC,CAAC,CAAC;IACxD,CAAC;IAED,MAAM,IAAA,oBAAW,EAAC,IAAI,CAAC,CAAC;IAExB,IAAA,sCAAyB,EAAC,IAAI,EAAE,oBAAoB,EAAE,OAA6C,CAAC,CAAC;AACvG,CAAC;AAED,kBAAe,yBAAyB,CAAC"}
1	+ {"version":3,"file":"generator.js","sourceRoot":"","sources":["../../../../src/generators/system-test-runner/generator.ts"],"names":[],"mappings":";;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;AAoGA,8DA6FC;AAjMD,uCAIoB;AACpB,2CAA6B;AAC7B,qDAAiE;AA6BjE,SAAS,aAAa,CACpB,GAAuC;IAEvC,IAAI,GAAG,KAAK,SAAS,IAAI,GAAG,KAAK,IAAI,IAAI,GAAG,KAAK,EAAE,EAAE,CAAC;QACpD,OAAO,IAAI,CAAC;IACd,CAAC;IACD,IAAI,KAAc,CAAC;IACnB,IAAI,OAAO,GAAG,KAAK,QAAQ,EAAE,CAAC;QAC5B,IAAI,CAAC;YACH,KAAK,GAAG,IAAI,CAAC,KAAK,CAAC,GAAG,CAAC,CAAC;QAC1B,CAAC;QAAC,OAAO,CAAC,EAAE,CAAC;YACX,MAAM,IAAI,KAAK,CAAC,iCAAkC,CAAW,CAAC,OAAO,EAAE,CAAC,CAAC;QAC3E,CAAC;IACH,CAAC;SAAM,CAAC;QACN,KAAK,GAAG,GAAG,CAAC;IACd,CAAC;IACD,IAAI,CAAC,KAAK,CAAC,OAAO,CAAC,KAAK,CAAC,IAAI,KAAK,CAAC,MAAM,KAAK,CAAC,EAAE,CAAC;QAChD,MAAM,IAAI,KAAK,CACb,6EAA6E,CAC9E,CAAC;IACJ,CAAC;IACD,KAAK,MAAM,IAAI,IAAI,KAAsB,EAAE,CAAC;QAC1C,IACE,CAAC,IAAI;YACL,OAAO,IAAI,CAAC,IAAI,KAAK,QAAQ;YAC7B,IAAI,CAAC,IAAI,CAAC,MAAM,KAAK,CAAC;YACtB,OAAO,IAAI,CAAC,MAAM,KAAK,QAAQ;YAC/B,IAAI,CAAC,MAAM,CAAC,MAAM,KAAK,CAAC,EACxB,CAAC;YACD,MAAM,IAAI,KAAK,CACb,qDAAqD,IAAI,CAAC,SAAS,CAAC,IAAI,CAAC,EAAE,CAC5E,CAAC;QACJ,CAAC;IACH,CAAC;IACD,OAAQ,KAAuB,CAAC,GAAG,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC;QAC1C,GAAG,CAAC;QACJ,KAAK,EAAE,CAAC,CAAC,IAAI,CAAC,OAAO,CAAC,IAAI,EAAE,GAAG,CAAC;KACjC,CAAC,CAAC,CAAC;AACN,CAAC;AAED;;;uDAGuD;AACvD,SAAS,kBAAkB,CAAC,CAAsB;IAChD,OAAO;;;aAGI,CAAC,CAAC,IAAI,mBAAmB,CAAC,CAAC,MAAM;;;;;;;WAOnC,CAAC,CAAC,KAAK;;2CAEyB,CAAC,CAAC,IAAI,cAAc,CAAC,CAAC,MAAM;;;;;CAKtE,CAAC;AACF,CAAC;AAEM,KAAK,UAAU,yBAAyB,CAC7C,IAAU,EACV,OAAwC;IAExC,MAAM,aAAa,GAAG,OAAO,CAAC,aAAa,IAAI,YAAY,CAAC;IAC5D,MAAM,eAAe,GAAG,OAAO,CAAC,eAAe,IAAI,cAAc,CAAC;IAClE,0EAA0E;IAC1E,+DAA+D;IAC/D,MAAM,QAAQ,GAAG,aAAa,CAAC,OAAO,CAAC,QAAQ,CAAC,CAAC;IAEjD,MAAM,iBAAiB,GAAG,CAAC,QAAQ,IAAI,EAAE,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,MAAM,KAAK,YAAY,CAAC,CAAC;IACpF,MAAM,WAAW,GAAG,CAAC,QAAQ,IAAI,EAAE,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,MAAM,KAAK,gBAAgB,CAAC,CAAC;IAClF,MAAM,gBAAgB,GAAG,CAAC,QAAQ,IAAI,EAAE,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,MAAM,KAAK,iBAAiB,CAAC,CAAC;IACxF,uEAAuE;IACvE,4EAA4E;IAC5E,kDAAkD;IAClD,MAAM,eAAe,GAAG,CAAC,QAAQ,IAAI,EAAE,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,MAAM,KAAK,qBAAqB,CAAC,CAAC;IAC3F,MAAM,gBAAgB,GAAG,CAAC,QAAQ,IAAI,EAAE,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,MAAM,KAAK,qBAAqB,CAAC,CAAC;IAE5F,yEAAyE;IACzE,6EAA6E;IAC7E,6EAA6E;IAC7E,qEAAqE;IACrE,kEAAkE;IAClE,MAAM,WAAW,GAAG,IAAI,GAAG,CAAC;QAC1B,YAAY;QACZ,gBAAgB;QAChB,iBAAiB;QACjB,qBAAqB;QACrB,qBAAqB;KACtB,CAAC,CAAC;IACH,MAAM,mBAAmB,GAAG,CAAC,QAAQ,IAAI,EAAE,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,WAAW,CAAC,GAAG,CAAC,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC;IAEvF,6EAA6E;IAC7E,6EAA6E;IAC7E,iFAAiF;IACjF,MAAM,iBAAiB,GAAG,QAAQ;QAChC,CAAC,CAAC,iBAAiB,CAAC,MAAM,GAAG,CAAC;QAC9B,CAAC,CAAC,eAAe,KAAK,cAAc,CAAC;IACvC,MAAM,cAAc,GAAG,WAAW,CAAC,MAAM,GAAG,CAAC,CAAC;IAE9C,yCAAyC;IACzC,IAAA,sBAAa,EACX,IAAI,EACJ,IAAI,CAAC,IAAI,CAAC,SAAS,EAAE,IAAI,EAAE,IAAI,EAAE,IAAI,EAAE,IAAI,EAAE,KAAK,EAAE,YAAY,EAAE,oBAAoB,EAAE,OAAO,CAAC,EAChG,GAAG,EACH;QACE,GAAG,OAAO;QACV,aAAa;QACb,eAAe;QACf,QAAQ;QACR,iBAAiB;QACjB,WAAW;QACX,gBAAgB;QAChB,eAAe;QACf,gBAAgB;QAChB,iBAAiB;QACjB,cAAc;QACd,IAAI,EAAE,EAAE;KACT,CACF,CAAC;IAEF,oEAAoE;IACpE,0EAA0E;IAC1E,+EAA+E;IAC/E,8EAA8E;IAC9E,+EAA+E;IAC/E,4EAA4E;IAC5E,4CAA4C;IAC5C,IAAI,CAAC,iBAAiB,EAAE,CAAC;QACvB,IAAI,CAAC,MAAM,CAAC,oBAAoB,CAAC,CAAC;QAClC,IAAI,CAAC,MAAM,CAAC,iCAAiC,CAAC,CAAC;QAC/C,IAAI,CAAC,MAAM,CAAC,mCAAmC,CAAC,CAAC;QACjD,IAAI,CAAC,MAAM,CAAC,sCAAsC,CAAC,CAAC;QACpD,IAAI,CAAC,MAAM,CAAC,wCAAwC,CAAC,CAAC;QACtD,IAAI,CAAC,MAAM,CAAC,wCAAwC,CAAC,CAAC;IACxD,CAAC;IAED,6EAA6E;IAC7E,+EAA+E;IAC/E,0EAA0E;IAC1E,mEAAmE;IACnE,gEAAgE;IAChE,KAAK,MAAM,CAAC,IAAI,mBAAmB,EAAE,CAAC;QACpC,IAAI,CAAC,KAAK,CACR,qBAAqB,CAAC,CAAC,KAAK,sBAAsB,EAClD,kBAAkB,CAAC,CAAC,CAAC,CACtB,CAAC;IACJ,CAAC;IAED,MAAM,IAAA,oBAAW,EAAC,IAAI,CAAC,CAAC;IAExB,IAAA,sCAAyB,EAAC,IAAI,EAAE,oBAAoB,EAAE,OAA6C,CAAC,CAAC;AACvG,CAAC;AAED,kBAAe,yBAAyB,CAAC"}

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "groundwork-method",
-  "version": "0.10.0",
+  "version": "0.11.0",
   "description": "An installable delivery system for AI-driven software development: facilitated discovery to canonical docs, generators to a booted monorepo, and a contract-gated bet delivery loop.",
   "bin": {
     "groundwork": "./bin/groundwork.js"

package/src/docs/principles/design/usability-and-ux.md CHANGED Viewed

@@ -44,6 +44,14 @@ The cheapest error to recover from is the one that cannot occur, so we design er
 Users forage by scent: they choose links and buttons by the payoff each label predicts, so labels are descriptive and specific and match the content they lead to — never "click here," "learn more," or a bare "submit." Every screen answers the wayfinding questions — where am I, how did I get here, where can I go, how do I get back — through an active navigation state, a clear title, breadcrumbs in deep hierarchies, and consistent, persistent navigation. Disorientation is a silent driver of abandonment.
+### 8. Usable has a floor you can check and a ceiling you judge
+"Usable" splits into two halves that need different instruments. The **floor is checkable**: every screen is reachable and has a way back, so no flow dead-ends; every asynchronous view carries its full set of states — empty, loading, in-progress, error — not just the happy one a demo hits. These are verifiable, and their absence is a defect a review catches: a screen that works but shows no progress reads as frozen, and a grid with no empty state reads as broken on first run. The **ceiling is judged**: whether the screens cohere, whether the product is a pleasure to use, whether it feels considered. That judgment is made by eye, the way a designer reviews work, against the design system and the experience the product is reaching for. Hold the floor as a gate and the ceiling as a bar — clear the first, then keep raising the second.
+### 9. Solve UX problems with the patterns the best products use now, implemented fully
+For a recurring UX problem there is usually a current best-in-class solution the leading products have converged on — the removable filter pill with its clear affordance, the skeleton frame that holds layout while content loads, modern search and pagination. Reaching for these gives forward-leaning and familiar at once, because the leaders made them the standard. The discipline is to implement the pattern **completely**, every affordance it implies: a filter pill that shows but does not remove is a worse experience than no pill, because it promises an interaction it does not honour. Draw on what modern products already do and on the project's own design references, then turn the chosen pattern into a real component in the design system so the next screen inherits it rather than re-inventing a thinner version.
 ## How we apply this
 - [Interaction & Motion](interaction-and-motion.md) — the state, feedback, and perceived-performance decisions usability depends on.
@@ -59,6 +67,9 @@ Users forage by scent: they choose links and buttons by the payoff each label pr
 - **Confirmation-dialog overuse.** "Are you sure?" on routine reversible actions, training the reflexive click that defeats the dialog's purpose.
 - **Asking for the known.** Requiring data the system already has or could derive.
 - **Scentless labels.** "Click here" / "Learn more" / generic "Next" that predict nothing about their destination.
+- **Dead-end screens.** A view a user can reach but not leave, or a flow with no way back to where they came from.
+- **Happy-path-only states.** An async view that renders when data arrives but shows nothing while it loads, nothing when it is empty, and nothing when it fails — so working software looks frozen and a first run looks broken.
+- **The half-built pattern.** A recognised pattern shipped as a shell — a filter pill that does not remove, a skeleton that never resolves — promising an affordance it does not deliver.
 ## Further reading

package/src/docs/principles/foundations/testing.md CHANGED Viewed

@@ -2,13 +2,13 @@
 title: Testing
 description: Continuous Risk Assurance — testing the system, not the mock of the system.
 status: active
-last_reviewed: 2026-06-19
+last_reviewed: 2026-06-26
 ---
 # Testing
 ## TL;DR
-Tests are risk-weighted assertions about production behaviour — not boxes ticked for coverage. We favour high-fidelity service tests over solitary unit tests, run dependencies we own as real ephemeral containers rather than mocking them, contract-test the ones we don't, and treat observability signals as first-class assertions. The measure of a suite is whether its assertions actually catch faults — not its line-coverage number.
+Tests are risk-weighted assertions about production behaviour — not boxes ticked for coverage. We favour high-fidelity service tests over solitary unit tests, run dependencies we own as real ephemeral containers rather than mocking them, contract-test the ones we don't, and treat observability signals as first-class assertions. Above the honeycomb sits one more level: a proof that drives the real shipping build through its front door on the real pipeline, because parts that each pass in isolation can still assemble into a product that does nothing — and a fake a test leans on needs a real test behind it. The measure of a suite is whether its assertions actually catch faults — not its line-coverage number. The invariant under all of it: a test that captures whatever the system currently does is worthless unless something *independent* of the implementation asserts that behaviour is correct. Independent oracles and reproducible failures are the spine; the distribution shape is a detail teams over-argue.
 ## Why this matters
@@ -20,7 +20,9 @@ This matters more, not less, as code generation gets cheaper. When an agent can
 ### 1. Favour service tests over solitary unit tests
-The "sociable" service test is our foundational unit of validation. We test from the API entry point through to real, ephemeral database containers. In a service-oriented codebase the interesting bugs live at the boundaries — HTTP serialisation, SQL query correctness, transaction semantics, event emission — and those are exactly what solitary unit tests mock away. This is the *test honeycomb* shape popularised by Spotify's engineering teams: a fat middle of integrated service tests, a thin layer of solitary unit tests, and a few end-to-end checks on top — not the classic Mike Cohn pyramid that pushes most weight onto isolated units.
+Our default shape is the **test honeycomb**, popularised by Spotify's engineering teams: a fat middle of integrated, "sociable" service tests, a thin layer of solitary unit tests, and a few end-to-end checks on top — not the classic Mike Cohn pyramid that pushes most weight onto isolated units. We test from the API entry point through to real, ephemeral database containers, because in a service-oriented codebase the interesting bugs live at the boundaries — HTTP serialisation, SQL query correctness, transaction semantics, event emission — exactly what solitary unit tests mock away.
+The honeycomb is a stack-appropriate heuristic, not a law. No empirical study ranks the pyramid, honeycomb, and trophy by defect detection — they are practitioner shapes for different interaction surfaces (service-to-service for the honeycomb, component-interaction for Kent Dodds's frontend trophy), and the word "integration" means something far cheaper in one than the other. What the evidence does support is that test *quality* outweighs distribution: a suite of fast, reliable, expressive tests that fail only for useful reasons beats any ratio of tests that don't. So pick the shape that fits the stack — the honeycomb for our backends, the trophy for a frontend — and spend the saved argument on making each test bite.
 The honest tension: service tests buy fidelity at the cost of speed and diagnostic precision. A solitary unit test that fails names the broken function; a service test that fails tells you "the create-order flow is broken" and leaves you to find where. And a slow, flaky service layer is corrosive — teams that can't trust or tolerate it quietly retreat to mocking everything, which is the exact failure this principle exists to prevent. So fidelity is not a licence to be slow: keep service tests parallelisable, keep fixtures cheap, and treat suite latency as a first-class defect.
@@ -39,7 +41,9 @@ Decision rule: emulate the data and serialisation boundaries you own; contract-t
 ### 3. Observability is a test surface
-OpenTelemetry instrumentation is a design-time concern, not an afterthought. System tests assert that traces are unbroken end-to-end: a missing span, a lost TraceID, or a broken parent-child relationship is a test failure, not an instrumentation TODO. The boundary between "test" and "monitor" dissolves — both ask whether the system is behaving as we claim. The payoff is double-counted: the same instrumentation that proves correctness in CI is what lets you debug the incident in production.
+OpenTelemetry instrumentation is a design-time concern, not an afterthought — sketch the trace a feature should produce before writing the handler (the observability-driven development stance, [Observability](../quality/observability.md) principle 5). System tests then assert that traces are unbroken end-to-end: a missing span, a lost TraceID, or a broken parent-child relationship is a test failure, not an instrumentation TODO. The boundary between "test" and "monitor" dissolves — both ask whether the system is behaving as we claim. The payoff is double-counted: the same instrumentation that proves correctness in CI is what lets you debug the incident in production.
+The mechanism is an **in-memory span exporter**: register one in the test process, exercise the system, and assert on the finished spans — the DB span exists with the attributes a dashboard query depends on, the spans emit in the expected order, the TraceID propagates across a service hop. This is a built-in capability of every OTel SDK, and it is the durable approach now that the dedicated trace-based-testing tools (Tracetest, Malabi) have gone dormant. Assert on what the contract promises and let the rest float (the over-assertion trap is real — see [Observability](../quality/observability.md) principle 6). "Trace coverage" as a *metric* — a line-or-branch-coverage equivalent for spans — is still aspirational research, not a number to gate on; the proven practice is traces-as-assertions, not a coverage percentage.
 ### 4. Name tests by behaviour, not implementation
@@ -49,7 +53,9 @@ A test name must let an on-call engineer form a hypothesis from the failure log
 Coverage percentages are meaningless without proof that the assertions catch real faults — a suite can execute every line and assert nothing. We score modules on Impact × Complexity × Change-frequency before deciding test depth: high-risk modules earn live system tests and chaos experiments; low-risk modules need only small tests and static analysis. Equal depth everywhere is wasted effort.
-The honest measure of whether assertions bite is **mutation testing** (PIT, Stryker, or equivalent): inject deliberate faults and confirm a test fails. A surviving mutant is a line you cover but do not actually check. Mutation testing is expensive — its naive cost is the suite run times the number of mutants — so don't run it across the whole tree. Run it on the high-risk modules the matrix flags, and on changed code in CI, where it doubles as a quality gate on new tests (the use Meta reported for its LLM-assisted mutation work in 2025). Use it as a periodic read-out of assertion quality, never as a blanket gate.
+The honest measure of whether assertions bite is **mutation testing** (PIT, Stryker, mutmut, or equivalent): inject deliberate faults and confirm a test fails. A surviving mutant is a line you cover but do not actually check. This is the honeycomb's natural complement: a fat sociable service test drives a huge number of branches through one HTTP call, and it is easy for it to *execute* them all while only asserting on the response body — mutation testing is the one instrument that proves the suite checks what it runs rather than merely exercising it. It correlates with real fault detection better than coverage does, though not once you control for suite size, so treat it as a quality read-out, not a bug-finding proxy.
+Mutation testing is expensive — its naive cost is the suite run times the number of mutants — so never run it across the whole tree and never make it a blanket gate. Run it on the high-risk modules the matrix flags and on changed code only, the model Google operates at scale: incremental, mutate-the-diff, surfaced in review. Tooling maturity is uneven and the guidance degrades gracefully with it — Stryker (JS/TS), PIT (JVM), and mutmut/cosmic-ray (Python) are production-grade; Go's options are pre-1.0 and slow, so there it stays a hand-run spot check, not an expectation. The same read-out is the antidote to AI-generated tests, whose oracles are derived from the current implementation and so cement existing bugs as expected behaviour: generate the test, mutate the code under it, and feed any surviving mutant back as the missing assertion — the assurance filter that turns a coverage-inflating suite into one that bites.
 ### 6. Tests are part of the change, not after it
@@ -57,6 +63,20 @@ A feature PR without tests is incomplete, and we review the test with the same r
 This is a discipline about *what ships together*, not a mandate to write tests first. Test-first (TDD) is a powerful design tool — it forces you to use your own interface before committing to it — but it is a tool, not a law, and the "Is TDD Dead?" exchange between Kent Beck, Martin Fowler, and DHH named the real cost: dogmatic test-first can induce *design damage*, contorting code with needless indirection purely to make it mockable. Hold both signals. If a change resists testing, that usually means the design is wrong — fix the code. But if the *only* way to test it is to shatter a cohesive unit into layers of indirection nothing else needs, the test is making the demand, and the design was right. Write the test with the change; let it pressure the design; don't let it deform the design.
+### 7. Generate the inputs you can't enumerate
+Example-based tests check the cases you thought of; the bugs live in the cases you didn't. Where the input space is large and a property holds across all of it — a round-trip (`decode ∘ encode = id`), a parser that must never panic, a calculation with an algebraic invariant, a state machine whose transitions must preserve a constraint — assert the property and let the framework generate and shrink counterexamples (Hypothesis, fast-check, jqwik, rapid). This is the highest-leverage complement to the dense-logic unit tests of principle 1: one property covers an infinity of examples, and in practice most caught faults surface on a single generated input, so it earns its keep cheaply. The cost is authoring — a meaningful property needs domain insight and a generator — so reach for it where invariants are real, not everywhere.
+The same generator-driven idea spans two more surfaces. At the service boundary, **Schemathesis** derives a semantics-aware fuzzer straight from an OpenAPI/GraphQL spec and is the bridge between contract testing and property-based testing — it finds materially more defects than example-based API tests for the cost of pointing it at the schema. At the byte boundary, coverage-guided **fuzzing** (`go test -fuzz`, cargo-fuzz/libFuzzer) is first-class for parsers and decoders, and a failing input is saved as a permanent regression seed. For stateful or distributed cores where ordering and failure timing are the risk, deterministic simulation testing (Antithesis, FoundationDB/TigerBeetle-style seeded simulators) is the frontier worth knowing — every bug reproduces from `seed + commit` — but its setup cost is real, so treat it as a deliberate investment for the system's hardest core, not a default.
+### 8. Prove the whole product at the front door
+The honeycomb proves the parts. One level sits above it: a proof that drives the **real shipping build** — the packaged, embedded artifact a user actually launches — through its **real front door**, on the **real pipeline**, end to end, the way a user's action travels. A service test that proves an engine behind a harness and a UI test that drives screens against a scripted stand-in can both pass while the assembled product does nothing, because the wiring between them was nobody's test. The front-door proof is the one that fails when the real thing is unwired, and it is what "done" means for a feature a user touches.
+This is where **a fake needs a real test behind it** becomes load-bearing. Every stub, fixture, or seeded file a test leans on is a claim that something real produces that value, and the claim is honest only when another test exercises the real producer. A media library whose tests write fixture thumbnails passes green while the shipping grid renders blank — nothing in the suite ever generated a real thumbnail, so the fixture stood in for a stage that did not exist. Seeded inputs are not the violation: handing the real pipeline a known fixture folder tests the pipeline on controlled data. Replacing the pipeline with a script that emits the expected output is the violation. The line is whether the work in the middle runs for real.
+Non-functional outcomes a user feels — latency, throughput, memory headroom — are proven the same way. A number measured against an early prototype decays the moment the design that produced it changes; it has to be re-proven on the shipping path, not carried forward as a one-time measurement.
 ## How we apply this
 - [Observability](../quality/observability.md) — the OTel-first stance that makes traces-as-assertions possible.
@@ -70,6 +90,8 @@ This is a discipline about *what ships together*, not a mandate to write tests f
 - **Snapshot tests as a default.** Snapshots are a brittle, noisy substitute for behavioural assertions, and "update snapshots" becomes a reflex that launders bugs into the baseline. Acceptable only when the artefact is genuinely opaque (a rendered email, a serialised response).
 - **Coverage-gated CI.** "95% line coverage required" is a metric gamed without reducing real risk. Use coverage as a read-out, mutation score as the quality signal, never line coverage as the gate.
 - **Shared staging environments as the integration test.** Staging has no hermetic guarantees, no reproducibility, no determinism. It is a deployment target, not a test bed.
+- **Proving the engine, shipping the product.** A headless proof that the core behaves behind a harness is a slice of confidence, not the product. Until a test drives the assembled, shipping build through the front door on the real pipeline, "it works" is unproven where a user stands.
+- **A fake with no real test behind it.** A fixture or stub that nothing real ever produces is a green light wired to nothing. Every fake is a debt; the real test that exercises the producer is how it gets paid.
 - **"It's hard to test, so we didn't."** That is a signal the code is badly designed. Fix the code.
 ## Further reading
@@ -79,4 +101,8 @@ This is a discipline about *what ships together*, not a mandate to write tests f
 - *Growing Object-Oriented Software, Guided by Tests*, Freeman & Pryce — the canonical treatment of outside-in service testing.
 - *xUnit Test Patterns*, Gerard Meszaros — the vocabulary we use for test doubles, fixtures, and strategies.
 - *Is TDD Dead?*, Beck, Fowler & Heinemeier Hansson — the conversation that maps the contested zone between test-first discipline and test-induced design damage.
-- "UnitTest" and "Testing Pyramid", Martin Fowler (martinfowler.com) — the sociable-vs-solitary distinction and the shape trade-offs.
+- "UnitTest", "TestPyramid", and "On the Diverse and Fantastical Shapes of Testing", Martin Fowler (martinfowler.com) — the sociable-vs-solitary distinction, the shape trade-offs, and Justin Searls's argument that the shape debate is a distraction from test quality.
+- "Testing of Microservices", Spotify Engineering — the honeycomb shape and the integrated-vs-integration-test distinction it rests on.
+- "Practical Mutation Testing at Scale: A View from Google" — the changed-code-only, surfaced-in-review model that makes mutation testing affordable.
+- "A Next Step Beyond Test-Driven Development", Honeycomb.io (Charity Majors) — observability-driven development and testing in production.
+- *Deriving Semantics-Aware Fuzzers from Web API Schemas* (Schemathesis, ICSE 2022) — the empirical case for spec-driven property fuzzing at the service boundary.

package/src/docs/principles/index.md CHANGED Viewed

@@ -16,8 +16,9 @@ Software engineering is the discipline of managing complexity and optimising for
 1. **Complexity is the enemy; clarity is the goal.** We choose simple designs, simple tools, and simple processes — and we accept the cost of doing so. Speculative abstraction, premature generalisation, and fear of deletion all compound into the kind of complexity that slows teams down.
 2. **Contracts are the single source of truth.** API specifications, event schemas, and database definitions are authoritative. Clients, tests, documentation, and UIs are derived from them. When a spec is wrong, everything downstream is wrong — and that is the correct failure mode, because one visible error beats silent drift across hand-maintained artefacts.
 3. **Reliability is designed in, not patched in.** We build for failure from the first commit: idempotency at the API boundary, graceful degradation at the edges, backpressure when downstream systems slow, and observability as a design-time concern rather than an afterthought.
-4. **We test the system, not the mock of the system.** Tests that run against real databases, real message brokers, and real HTTP stacks catch the bugs that mocked tests hide. Emulation beats mocking wherever the dependency can run in a container.
+4. **We prove software by using the real thing the way its user does.** A feature is proven when a test drives the shipping build through its real front door, on the real pipeline, the way the user's action actually travels — and the user is whoever observes the outcome, a person at a screen or a caller of an API. Tests that run against real databases, real message brokers, and real HTTP stacks catch the bugs that mocked tests hide, and any fake a test leans on needs a real test behind it. Parts that each pass behind a harness can still assemble into a product that does nothing; the front-door proof is the one that catches it. See [Testing](foundations/testing.md).
 5. **A pure core, swappable edges, and one obvious place for everything.** Every service is a pure decision-making core wrapped in a thin shell that does I/O; concrete dependencies plug in behind abstractions the core owns and stay swappable, with no implementation detail leaking inward. The structure is opinionated, so neither a human reading the code nor an agent writing it ever has to guess where a thing belongs. See [How We Structure Code](system-design/code-structure.md).
 6. **Documentation is a product, not a by-product.** This documentation is versioned, reviewed, and shipped with the same discipline as code. It serves humans and AI agents, and the structures that help one help the other.
 7. **Architectural decisions are recorded and governed.** We capture each significant decision with the context, assumptions, and trade-offs that shaped it, then govern it — an owner, a review trigger, and supersession rather than silent edits when it changes. The record is immutable so the trail of *why* survives; the decision stays open to re-evaluation when its assumptions break. Re-deciding is healthy engineering; re-deciding without recording it is how teams lose their memory. See [Architecture Decisions](system-design/architecture-decisions.md).
 8. **AI agents are first-class engineers.** They read our docs, write our code, review our diffs, and run our tooling. We design our codebase, our conventions, and this documentation so an agent can operate at the same level of quality as a senior engineer.
+9. **Software is made to be used, so it lands fully formed.** A feature is finished when it works, looks right, and is a genuine pleasure to use — reachable, complete, with no dead ends and every state accounted for. Function, form, and experience are one bar, not a core that ships and polish that waits. When code generation is cheap, the considered touch that makes a product feel cared-for is cheap too, so the bar is high. See [Usability and UX](design/usability-and-ux.md).

package/src/docs/principles/quality/observability.md CHANGED Viewed

@@ -2,7 +2,7 @@
 title: Observability
 description: OpenTelemetry-first design, SLOs, error budgets, and trace-driven development.
 status: active
-last_reviewed: 2026-06-19
+last_reviewed: 2026-06-26
 ---
 # Observability
@@ -40,7 +40,7 @@ When building a new feature, we sketch the trace it should produce *before* we w
 ### 6. Assert on telemetry in tests
-System tests assert that traces are unbroken end-to-end — a missing span on a critical path is a test failure ([Testing](../foundations/testing.md)). This makes the instrumentation part of the contract rather than an optional decoration, so it cannot silently rot. The failure mode to avoid is over-asserting: a test that pins the exact span tree and every attribute is coupled to implementation detail and will break on every harmless refactor, training the team to delete the assertion rather than trust it. So assert on what the contract actually promises — the spans that must exist on the user journey, that the trace stays connected across service hops, and the attributes a dashboard or SLO query depends on — and let the rest float.
+System tests assert that traces are unbroken end-to-end — a missing span on a critical path is a test failure ([Testing](../foundations/testing.md)). This makes the instrumentation part of the contract rather than an optional decoration, so it cannot silently rot. The mechanism is an in-memory span exporter registered in the test process: exercise the system, then assert on the finished spans. It is a built-in of every OTel SDK and the durable approach now that the dedicated trace-test tools (Tracetest, Malabi) have gone dormant. The failure mode to avoid is over-asserting: a test that pins the exact span tree and every attribute is coupled to implementation detail and will break on every harmless refactor, training the team to delete the assertion rather than trust it. So assert on what the contract actually promises — the spans that must exist on the user journey, that the trace stays connected across service hops, and the attributes a dashboard or SLO query depends on — and let the rest float.
 ### 7. Logs are structured, sampled, and contextual

package/src/engineer-skills/groundwork-electron-engineer/SKILL.md CHANGED Viewed

@@ -41,7 +41,7 @@ GroundWork gives you a deterministic **repo map** (`npx groundwork-method repo-m
 ## How to Use This Skill
-Match the user's task to the smallest relevant reference set. Most tasks touch one or two references.
+**Orient first.** On any non-trivial task, refresh the repo map (`npx groundwork-method repo-map`), read its `centrality` ranking to find the hubs, and navigate them with Serena before reading widely (see Code intelligence above) — this is the first step, not optional; fall back to ordinary reads only when those tools are unavailable. Then match the user's task to the smallest relevant reference set. Most tasks touch one or two references.
 | Topic | Reference | Load When |
 |-------|-----------|-----------|
@@ -51,6 +51,9 @@ Match the user's task to the smallest relevant reference set. Most tasks touch o
 | Packaging & Updates | `references/packaging-and-updates.md` | Forge config, makers, fuses at package time, code signing, notarization, auto-update routes. |
 | Testing & Smoke | `references/testing-and-smoke.md` | The three test tiers, Playwright `_electron` patterns, xvfb CI, skip-with-reason guards. |
 | Theming & Tokens | `references/theming-and-tokens.md` | The generated brand.css, Tailwind token mapping, nativeTheme sync, dark mode on desktop. |
+| Performance & Reliability | `references/performance-and-reliability.md` | Main never blocks, renderer/bundle perf, IPC batching, long-lived-window memory, cold boot, gateway resilience. |
+| Observability | `references/observability.md` | Two-process crash reporting, structured logs across IPC, app/update telemetry, PII discipline. |
+| Documentation | `references/documentation.md` | The typed IPC contract as documentation, TSDoc on the bridge, process-boundary why-comments. |
 ## Shared with the web stack — deferrals
@@ -73,6 +76,8 @@ When the workspace has no web surface, the same canon is available as `docs/prin
 - **Release/packaging work** → Load `references/packaging-and-updates.md`. Fuses and signing live in the pipeline; signing material never enters the repo.
 - **Test work** → Load `references/testing-and-smoke.md`. Pick the cheapest tier that can carry the assertion; the smoke stays thin.
 - **Electron upgrade** → Load `references/security.md` (currency window). Treat a skipped support window as a security finding.
+- **Performance / responsiveness work** → Load `references/performance-and-reliability.md`. Main never blocks; SLOs and load shedding live in the core.
+- **Crash reporting / telemetry** → Load `references/observability.md`. Both processes report; distributed tracing lives at the services the app calls.
 ## Safety Gates