npm - groundwork-method - Versions diffs - 0.10.0 → 0.11.0 - Mend

groundwork-method 0.10.0 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (70) hide show

package/src/generators/electron-app/files/tests/smoke/app.spec.ts.template CHANGED Viewed

@@ -4,27 +4,35 @@ import { expect, test } from '@playwright/test';
 import { _electron as electron } from 'playwright';
 import type { AppStatus, CoreHealth } from '../../src/shared/ipc';
-// Boot-tier smoke: the real built app launches, renders, and completes one
-// IPC round-trip. This is the agent-closable loop the stack was chosen for —
-// keep it thin (happy path only); rules are proven in the unit tiers and at
-// the capability core's contracts. Runs under xvfb on Linux CI via the smoke
-// target; tool/electron_exec.sh builds first and skips-with-reason when the
-// Electron binary or a display server is unavailable.
+// Native UI check (the Electron side of NATIVE-CHECK-CONTRACT): the real built
+// app launches and is driven through the contract dimensions — render, the
+// named async states, and design-system token match — on the real binary. This
+// is also the agent-closable boot loop the stack was chosen for. Keep it thin;
+// rules are proven in the unit tiers and at the capability core's contracts,
+// and the milestone's front-door bet-progress proof drives the real pipeline.
+// Navigation / no-dead-ends is exempt while the app is single-screen; a bet
+// that adds screens drives between them and back here. Runs under xvfb on Linux
+// CI via the smoke target; tool/electron_exec.sh builds first and skips-with-
+// reason when the Electron binary or a display server is unavailable.
 const MAIN_ENTRY = path.join(__dirname, '..', '..', 'out', 'main', 'index.js');
-test('boots, renders, and answers an IPC round-trip', async () => {
+function requireBuild(): void {
   if (!fs.existsSync(MAIN_ENTRY)) {
     throw new Error(
       'out/main/index.js not found — run the build first (the smoke target does: npx nx run <%= fileName %>:smoke)',
     );
   }
+}
+test('boots, renders on the design system, and answers an IPC round-trip', async () => {
+  requireBuild();
   const app = await electron.launch({ args: [MAIN_ENTRY] });
   try {
     const page = await app.firstWindow();
-    // The window booted and rendered app content.
+    // Render: the window booted and rendered app content.
     await expect(page).toHaveTitle('<%= name %>');
     await expect(page.getByRole('heading', { name: '<%= name %>' })).toBeVisible();
@@ -59,6 +67,37 @@ test('boots, renders, and answers an IPC round-trip', async () => {
       /^(light|dark)$/,
     );
+    // Token match: the design-system tokens resolved and landed in the render,
+    // rather than degrading to an unstyled default. The brand custom properties
+    // resolve on :root, and the heading paints with the projected primary token
+    // (text-primary → --color-primary → --gw-primary), not the browser default.
+    const tokens = await page.evaluate(() => {
+      // The renderer DOM globals live in the browser context; type them through
+      // a cast so the spec checks under the node tsconfig (no DOM lib), the same
+      // idiom the bridged-api evaluate above uses.
+      const dom = globalThis as unknown as {
+        getComputedStyle: (el: unknown) => {
+          getPropertyValue(prop: string): string;
+          color: string;
+        };
+        document: {
+          documentElement: unknown;
+          querySelector(sel: string): unknown;
+        };
+      };
+      const root = dom.getComputedStyle(dom.document.documentElement);
+      const heading = dom.document.querySelector('h1');
+      return {
+        primary: root.getPropertyValue('--gw-primary').trim(),
+        surface: root.getPropertyValue('--gw-surface').trim(),
+        headingColor: heading ? dom.getComputedStyle(heading).color : '',
+      };
+    });
+    expect(tokens.primary.length).toBeGreaterThan(0);
+    expect(tokens.surface.length).toBeGreaterThan(0);
+    expect(tokens.headingColor).not.toBe('');
+    expect(tokens.headingColor).not.toBe('rgb(0, 0, 0)');
     // Main-process assertion via the driver's main-process hook.
     const isPackaged = await app.evaluate(({ app: electronApp }) => electronApp.isPackaged);
     expect(isPackaged).toBe(false);
@@ -66,3 +105,29 @@ test('boots, renders, and answers an IPC round-trip', async () => {
     await app.close();
   }
 });
+test('renders the unreachable core as a state, not a crash', async () => {
+  requireBuild();
+  // Named state, driven deterministically on the real binary: point the
+  // core-access seam at an unreachable address so the probe fails. The app must
+  // still render and show the user a designed "unreachable" state — a working
+  // app that cannot reach the core never shows a blank window or a crash.
+  const app = await electron.launch({
+    args: [MAIN_ENTRY],
+    env: { ...process.env, API_BASE_URL: 'http://127.0.0.1:1' },
+  });
+  try {
+    const page = await app.firstWindow();
+    // The shell still booted and rendered — the failure is a state, not a crash.
+    await expect(page.getByRole('heading', { name: '<%= name %>' })).toBeVisible();
+    // The core probe resolves to unreachable, and the renderer shows its state.
+    await expect(page.getByTestId('core-status')).toHaveText(
+      /Workspace core unreachable/,
+    );
+  } finally {
+    await app.close();
+  }
+});

package/src/generators/flutter-app/docs/principles/stack/flutter/testing.md CHANGED Viewed

@@ -8,7 +8,7 @@ last_reviewed: 2026-06-12
 ## TL;DR
-Three tiers: pure-Dart unit tests with fakes, widget tests as the bulk of coverage, and `integration_test` happy paths on a headless Android emulator as the CI-canonical loop. Patrol enters only when a flow crosses the Flutter/OS boundary. Goldens run via alchemist. Surface tests assert wiring, rendering, and interaction — they never re-prove business logic already proven at the core's contract.
+Three tiers: pure-Dart unit tests with fakes, widget tests as the bulk of coverage, and `integration_test` happy paths on a headless Android emulator as the CI-canonical loop. Patrol enters only when a flow crosses the Flutter/OS boundary. Goldens run via alchemist. Surface tests assert wiring, rendering, and interaction — they never re-prove business logic already proven at the core's contract. This is the Flutter idiom of the framework testing canon (`docs/principles/foundations/testing.md`) — widget tests are the honeycomb's fat middle, unit tests the thin solitary layer — and the canon wins on any disagreement.
 ## Why this matters
@@ -50,7 +50,9 @@ A small set of happy-path flows through the real app binary — launch, sign in
 **Android is the CI gate; iOS is a local-only lane.** The headless Android emulator is cheap, scriptable, and agent-drivable; iOS simulators need macOS runners and routinely need hands — putting them in the gate trades the agent-closable loop for platform symmetry the wiring proof does not need. iOS-specific verification happens locally or on device farms (Firebase Test Lab, Codemagic) as an explicit, non-gating lane.
-Keep this tier thin. Every integration test is minutes of emulator time; if a widget test can carry the assertion, it does.
+This tier is the Flutter side of the **native UI check contract** (`src/generators/system-test-runner/NATIVE-CHECK-CONTRACT.md`) — the `system-test-runner` drives it as the surface's visual gate, so it carries the contract's dimensions on the *real binary*: it renders without a blank or crash frame, drives the **named async states** (the unreachable/error path renders its designed card, not a red screen), and confirms a **design-system token landed** in the render (the status icon resolves the projected `StatusColors` token, not a flat default). Navigation / no-dead-ends is exempt while the app is single-screen; a bet that adds screens drives between them and back here. Each state is reached by faking the repository at the provider seam — you cannot summon a loading or unreachable state from a real backend on demand — while the milestone's front-door bet-progress proof drives the real gateway end to end.
+Keep this tier thin. Every integration test is minutes of emulator time; if a widget test can carry the assertion, it does — the integration tier carries only what needs the real binary: render, the states, and token conformance.
 ### Patrol — only across the Flutter/OS boundary
@@ -60,6 +62,16 @@ Keep this tier thin. Every integration test is minutes of emulator time; if a wi
 Goldens guard design-system-level components (the token-projected theme made visible). Use **alchemist**, with its platform-test vs CI-test split — CI variants render text as blocks, killing the cross-platform font flakiness that made goldens a deletion candidate. `golden_toolkit` is discontinued — legacy; migrate, do not adopt. Golden scope is the component library, not full screens: screen-level goldens churn on every copy change and teach the team to rubber-stamp diffs.
+## Assertion-quality read-outs (and their stack limits)
+The canon's quality read-outs are stack-dependent, and a client is honest about which apply:
+- **Mutation testing** (the canon's read-out for whether assertions bite) has **no production-grade Dart tool** — the existing packages are experimental and unmaintained. The discipline it enforces is carried by review instead: fakes over mocks, assertions on semantics and visible text, and error/empty-state coverage. A hand-run experimental tool is a spot check on a dense pure-Dart algorithm, never a gate.
+- **Property-based testing** applies narrowly: a dense pure-Dart unit with a real invariant — a mapper round-trip, a formatter, a validator that must never throw — can state the property with `glados` (the niche Dart option). It does not apply to widget trees.
+- **Trace assertions and service-boundary fuzzing** (Schemathesis, coverage-guided fuzzing) are **N/A on the surface** — a Flutter client emits no OpenTelemetry spans, and the gateway's contract is fuzzed once at the capability core. Re-running either from the surface duplicates a proof that already exists.
+Name tests by observable behaviour, not the widget under test: `'renaming updates the profile header'`, not `'ProfileView test'`.
 ## Legacy
 `flutter_driver` (long deprecated for `integration_test`); `golden_toolkit`; Appium-first Flutter testing (a mixed-stack-org accommodation, not Flutter-native practice).

package/src/generators/flutter-app/files/integration_test/app_test.dart.template CHANGED Viewed

@@ -3,28 +3,62 @@ import 'package:flutter_test/flutter_test.dart';
 import 'package:integration_test/integration_test.dart';
 import 'package:<%= pubspecName %>/app.dart';
 import 'package:<%= pubspecName %>/data/repositories/status_repository.dart';
+import 'package:<%= pubspecName %>/domain/models/health_status.dart';
+import 'package:<%= pubspecName %>/ui/core/theme/app_theme.dart';
 import '../test/fakes/fake_status_repository.dart';
-/// Boot-tier smoke: the real app binary launches and renders home on a
-/// headless Android emulator — the CI-canonical loop. Keep this tier thin
-/// (happy paths only); emulator minutes are the most expensive test currency
-/// in this stack. The gateway is faked so the smoke proves boot + rendering,
-/// not network reachability (docs/principles/stack/flutter/testing.md).
+/// Native UI check (the Flutter side of NATIVE-CHECK-CONTRACT): the real app
+/// binary launches on a headless emulator and is driven through the contract
+/// dimensions — render, the named async states, and design-system token match.
+/// Navigation / no-dead-ends is exempt while the app is single-screen; a bet
+/// that adds screens extends this harness to drive between them and back.
+///
+/// Keep it thin — emulator minutes are the most expensive test currency in this
+/// stack. The repository is faked at the provider seam so each state is reached
+/// deterministically (you cannot summon a loading or unreachable state from a
+/// real backend on demand); the milestone's front-door proof is what drives the
+/// real gateway end to end (docs/principles/stack/flutter/testing.md).
 void main() {
   IntegrationTestWidgetsFlutterBinding.ensureInitialized();
-  testWidgets('the app boots to the home view', (tester) async {
-    await tester.pumpWidget(
-      ProviderScope(
-        overrides: [
-          statusRepositoryProvider.overrideWithValue(FakeStatusRepository()),
-        ],
+  Widget bootWith(StatusRepository repo) => ProviderScope(
+        overrides: [statusRepositoryProvider.overrideWithValue(repo)],
         child: const App(),
+      );
+  testWidgets('renders the reachable state on the design system',
+      (tester) async {
+    await tester.pumpWidget(bootWith(FakeStatusRepository()));
+    await tester.pumpAndSettle();
+    // Render: the home view reached its data state — not a blank or error frame.
+    expect(find.text('Wired to the workspace gateway'), findsOneWidget);
+    // Token match: the status icon resolves the projected design-system token
+    // (StatusColors.success), in whichever theme the platform launched — proof
+    // the design system landed in the real render rather than a flat default.
+    final icon = tester.widget<Icon>(find.byIcon(Icons.check_circle_outline));
+    expect(
+      icon.color,
+      anyOf(
+        buildLightTheme().extension<StatusColors>()!.success,
+        buildDarkTheme().extension<StatusColors>()!.success,
       ),
     );
+  });
+  testWidgets('renders the unreachable state as a state, not a crash',
+      (tester) async {
+    await tester.pumpWidget(
+      bootWith(FakeStatusRepository(result: const HealthStatus.unreachable())),
+    );
     await tester.pumpAndSettle();
-    expect(find.text('Wired to the workspace gateway'), findsOneWidget);
+    // Named state: the unreachable path renders its designed card with no
+    // uncaught exception — a working app that cannot reach the core still shows
+    // the user a state, never a blank or a red screen of death.
+    expect(find.text('Gateway unreachable'), findsOneWidget);
+    expect(tester.takeException(), isNull);
   });
 }

package/src/generators/go-microservice/docs/principles/stack/go/testing.md CHANGED Viewed

@@ -8,7 +8,7 @@ last_reviewed: 2026-05-26
 ## TL;DR
-Service perimeter tests are our default: real Postgres via Testcontainers, real HTTP via `httptest`, no mocks inside the service boundary. Unit tests exist for isolated, complex logic only. The test pyramid does not apply to services that can spin up a real database in two seconds.
+Service perimeter tests are our default: real Postgres via Testcontainers, real HTTP via `httptest`, no mocks inside the service boundary. Unit tests exist for isolated, complex logic only. The default shape is the **test honeycomb** — a fat middle of sociable service tests, a thin solitary-unit layer, a few end-to-end checks — not the pyramid, which assumes a database in a test is expensive. This is the Go idiom of the framework testing canon (`docs/principles/foundations/testing.md`); the canon is the parent principle and wins on any disagreement.
 ## Why this matters
@@ -143,6 +143,22 @@ go test ./internal/meetings/...
 go test -v -run TestCreateMeeting ./internal/meetings/...
 ```
+## Trace assertions — observability is a test surface
+A critical-path request must emit an unbroken trace; a missing span is a test failure, not an instrumentation TODO. The mechanism is the OTel SDK's **in-memory span exporter** — register one in the test process, exercise the handler, and assert on the finished spans (the entry span exists, the work span is its descendant, the attributes a dashboard query depends on are present). Use `go.opentelemetry.io/otel/sdk/trace/tracetest` (`NewInMemoryExporter`), no external tooling. Assert what the contract promises and let the rest float — pinning the whole span tree couples the test to implementation.
+## Mutation testing — the assertion-quality read-out
+A fat service-perimeter test drives many branches through one HTTP call and can *execute* them all while only asserting on the response body. Mutation testing is the instrument that proves the suite checks what it runs: inject a fault, confirm a test fails; a surviving mutant is a covered-but-unchecked line. It is a **signal, never a gate**. Go's tooling is immature (`gremlins` is pre-1.0 and slow; `go-mutesting` is unmaintained), so here it stays a hand-run spot check on a dense package under active change, not a CI expectation.
+## Generate the inputs you can't enumerate
+Example-based tests check the cases you thought of; the bugs live in the cases you didn't. For dense, boundary-poor logic with a real invariant — a round-trip, a parser that must never panic, a calculation with an algebraic law — state the property and let the framework generate and shrink counterexamples (`pgregory.net/rapid`, or stdlib `testing/quick`). At the byte boundary, native `go test -fuzz` is first-class for parsers and decoders, and a failing input is saved under `testdata/fuzz/` as a permanent regression seed. Reach for these where invariants are real, not everywhere.
+## Naming
+Name a test by behaviour, not implementation: `[Function] should [expected outcome] when [condition]`. `TestCreateItem_Success` conveys nothing the dashboard does not already show; a failing name should let an on-call engineer form a hypothesis without opening the file.
 ## Anti-patterns
 - **Mocking the database.** The whole point is to test the repository against a real schema.

package/src/generators/python-microservice/docs/principles/stack/python/testing.md CHANGED Viewed

@@ -27,6 +27,8 @@ The **honeycomb model** inverts the priority:
 This gives us a suite where a passing run is a meaningful signal. The bugs we care about — boundary mismatches, SQL correctness, serialisation errors, provider contract violations — are caught before they reach production.
+This is the Python idiom of the framework testing canon (`docs/principles/foundations/testing.md`); the canon is the parent principle and wins on any disagreement.
 ---
 ## The Three Tiers
@@ -256,6 +258,45 @@ A failing test name should give an on-call engineer enough information to form a
 ---
+## Trace Assertions — observability is a test surface
+A critical-path request must emit an unbroken trace; a missing span is a test failure, not an instrumentation TODO. Register an OTel **`InMemorySpanExporter`** in the test process, exercise the endpoint, and assert on the finished spans — the entry span exists, the trace stays connected across the service hop, and the attributes a dashboard query depends on are present.
+```python
+from opentelemetry.sdk.trace import TracerProvider
+from opentelemetry.sdk.trace.export import SimpleSpanProcessor
+from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
+@pytest.fixture
+def spans():
+    exporter = InMemorySpanExporter()
+    provider = TracerProvider()
+    provider.add_span_processor(SimpleSpanProcessor(exporter))
+    # install provider for the test, exercise the system, then:
+    return exporter
+async def test_transcribe_emits_connected_trace(client, spans):
+    await client.post("/transcribe", json={"audio_url": "gs://b/f.mp3"})
+    names = [s.name for s in spans.get_finished_spans()]
+    assert "POST /transcribe" in names
+```
+Assert the spans the contract promises and let the rest float — pinning the whole tree couples the test to implementation.
+---
+## Mutation Testing — the assertion-quality read-out
+A service test drives many branches through one HTTP call and can *execute* them all while only asserting on the response body. Mutation testing proves the suite checks what it runs: inject a fault, confirm a test fails; a surviving mutant is a covered-but-unchecked line. Python's tooling is production-grade — **`mutmut`** and **`cosmic-ray`** — but it is expensive, so it is a **signal, never a gate**: run it incrementally on the high-risk modules the risk matrix flags and on changed code only. A surviving mutant on changed high-risk code is the missing assertion to add — the same read-out catches AI-generated tests whose oracle was lifted from the implementation. (Note: `mutmut` 3 no longer mutates module-level code.)
+---
+## Generate the Inputs You Can't Enumerate
+Example-based tests check the cases you thought of; the bugs live in the cases you didn't. For pure logic with a real invariant — a round-trip (`decode ∘ encode = id`), a parser that must never raise, a domain calculation with an algebraic law — state the property and let **`Hypothesis`** generate and shrink counterexamples. One property covers an infinity of examples, and most caught faults surface on a single generated input. At the API boundary, **Schemathesis** derives a semantics-aware fuzzer straight from the OpenAPI schema and finds materially more defects than example-based API tests for the cost of pointing it at the spec. Reach for these where invariants are real; the authoring cost (a meaningful property needs a generator) is why they are not everywhere.
+---
 ## Running Tests
 ```bash

package/src/generators/system-test-runner/NATIVE-CHECK-CONTRACT.md ADDED Viewed

@@ -0,0 +1,20 @@
+# Native UI Check Contract
+This is the contract a platform's UI check must satisfy to verify a `graphical-ui` surface. It exists because a graphical surface that ships with no UI check is unverified — and a check that cannot run must **block the milestone, not silently skip it**. When the `system-test-runner` generator meets a graphical surface whose test medium it has no runner for, it emits a fail-closed placeholder (`tests/system/test_<surface>_ui_check_missing.py`) that fails with the named gap. Implementing a real check to this contract is what turns that red green.
+GroundWork ships **conforming** checks for three mediums today: `playwright` (web, via the runner's own `tests/system/` suite) and `flutter-integration` / `playwright-electron` (the Flutter `integration_test` and Electron `_electron` smoke harnesses the app scaffolds ship — each driven by the `system-test-runner` and each carrying the contract's dimensions on the real binary: render, the named async states, and design-system token match; navigation is exempt while those scaffolds are single-screen). A graphical surface on any other platform — a native iOS/SwiftUI app, an Android-native app, a desktop-native shell — needs a check built to the contract below, registered under a new test medium the generator recognises; until then the surface gets the fail-closed placeholder.
+## What a native UI check must cover
+A conforming check drives the **running, shipping build** of the surface — the artifact a user actually launches, not a test target that runs code the shipping build omits — and verifies, across the surface's key screens:
+1. **Render.** Each key screen renders without a blank frame, an unstyled fallback, an error-boundary/crash overlay, or an uncaught exception. The screen a user reaches shows what it is supposed to show.
+2. **Navigation — no dead ends.** Every screen the surface reaches has a way back or onward; no flow strands the user with no exit. The check drives the real navigation between the surface's screens and confirms each is reachable and leavable. A single-screen surface is exempt (there is nowhere to strand the user) — the same guard the web render-smoke applies when there is only one route; a surface that adds a second screen owes the navigation assertion.
+3. **The named states.** For every asynchronous view, the check exercises its full set of states — empty, loading, in-progress, error — and confirms each renders as designed rather than as a frozen or broken screen. A view that only renders when data arrives is incomplete.
+4. **Design-system match.** The surface renders in the project's design system — the specified tokens (colour, type, spacing, elevation, motion) resolve and land, rather than degrading to a flat platform default.
+## How it integrates
+- Register the new test medium so `system-test-runner`'s `KNOWN_MEDIA` set recognises it, and wire the runner fixture to drive the platform's harness (the pattern `flutter-integration` and `playwright-electron` already follow: the surface ships its harness, the fixture drives it through the app's build/test command as a subprocess).
+- The check is part of the permanent system suite (`tests/system/`), run on every milestone close and at validation — the same fail-closed gate the web `render_smoke` / `a11y_smoke` / `token_conformance` checks run under.
+- Until a platform's check exists, the placeholder stands in and fails. That is the correct state: the milestone cannot be declared proven on a surface nothing checks.

package/src/generators/system-test-runner/files/tests/system/test_render_smoke.py.template CHANGED Viewed

@@ -111,6 +111,22 @@ def _render_smoke(page: Page, surface_slug: str, base_url: str | None) -> None:
                     f"({metrics['nodes']} nodes, {metrics['text']} chars of text)"
                 )
+                # 6. Not a dead end. When the app has more than one route, every
+                #    screen offers a same-origin way onward or back, so a user is
+                #    never stranded (the "you can reach the library but never leave
+                #    it" class). Single-route apps are exempt.
+                if len(ROUTES) > 1:
+                    nav_links = page.evaluate(
+                        "() => Array.from(document.querySelectorAll('a[href]'))"
+                        ".filter(a => { const h = a.getAttribute('href') || '';"
+                        " return h.startsWith('/') || h.startsWith(location.origin);"
+                        " }).length"
+                    )
+                    assert nav_links > 0, (
+                        f"{ctx}: dead-end screen — no in-app navigation link to "
+                        f"leave this route"
+                    )
                 # On pass, persist the screenshot for Tiers 2-3 to read.
                 out = _VISUAL_DIR / surface_slug
                 out.mkdir(parents=True, exist_ok=True)
@@ -216,6 +232,20 @@ def _render_smoke(page: Page, surface_slug: str) -> None:
                     f"({metrics['nodes']} nodes, {metrics['text']} chars of text)"
                 )
+                # Not a dead end: a multi-route surface offers a same-origin way
+                # off every screen, so a user is never stranded.
+                if len(ROUTES) > 1:
+                    nav_links = page.evaluate(
+                        "() => Array.from(document.querySelectorAll('a[href]'))"
+                        ".filter(a => { const h = a.getAttribute('href') || '';"
+                        " return h.startsWith('/') || h.startsWith(location.origin);"
+                        " }).length"
+                    )
+                    assert nav_links > 0, (
+                        f"{ctx}: dead-end screen — no in-app navigation link to "
+                        f"leave this route"
+                    )
                 out = _VISUAL_DIR / surface_slug
                 out.mkdir(parents=True, exist_ok=True)
                 slug = route.strip("/").replace("/", "_") or "root"

package/src/generators/system-test-runner/generator.ts CHANGED Viewed

@@ -73,6 +73,31 @@ function parseSurfaces(
   }));
 }
+/** A fail-closed pytest stub for a graphical surface with no runnable UI check.
+ *  It fails (never skips) so the surface's UI proof is an honest red until a
+ *  platform check is implemented per NATIVE-CHECK-CONTRACT.md. Deleting it to go
+ *  green is the silent-skip this exists to prevent. */
+function uiCheckPlaceholder(s: SurfaceTemplateSpec): string {
+  return `import pytest
+# AUTO-GENERATED fail-closed placeholder — do not delete to go green.
+# Surface "${s.slug}" (test medium "${s.medium}") is a surface GroundWork has no
+# UI check runner for. A milestone cannot be proven on a surface nothing checks,
+# so this placeholder FAILS until a platform UI check is implemented for it per
+# src/generators/system-test-runner/NATIVE-CHECK-CONTRACT.md (render,
+# navigation / no dead ends, the named states, design-system token match).
+def test_${s.ident}_ui_check_not_implemented():
+    pytest.fail(
+        "No UI check runner for surface '${s.slug}' (medium '${s.medium}'). "
+        "Implement a platform UI check per "
+        "system-test-runner/NATIVE-CHECK-CONTRACT.md, then replace this "
+        "fail-closed placeholder. A graphical surface must not ship unverified."
+    )
+`;
+}
 export async function systemTestRunnerGenerator(
   tree: Tree,
   options: SystemTestRunnerGeneratorSchema
@@ -92,6 +117,20 @@ export async function systemTestRunnerGenerator(
   const flutterSurfaces = (surfaces ?? []).filter((s) => s.medium === 'flutter-integration');
   const electronSurfaces = (surfaces ?? []).filter((s) => s.medium === 'playwright-electron');
+  // Every test medium GroundWork knows how to run a check for. A graphical
+  // surface registered with a medium outside this set has no UI check runner —
+  // and a milestone cannot be proven on a surface nothing checks. We refuse to
+  // silently leave it unverified: each such surface gets a fail-closed
+  // placeholder check (below) naming the gap, never a silent no-op.
+  const KNOWN_MEDIA = new Set([
+    'playwright',
+    'subprocess-cli',
+    'protocol-client',
+    'flutter-integration',
+    'playwright-electron',
+  ]);
+  const unsupportedSurfaces = (surfaces ?? []).filter((s) => !KNOWN_MEDIA.has(s.medium));
   // Playwright structure follows graphical surfaces: any playwright surface in
   // registry mode, the graphical-ui value in single-medium mode. pexpect ships
   // alongside the subprocess runners so interactive (REPL) CLI flows are testable.
@@ -121,10 +160,13 @@ export async function systemTestRunnerGenerator(
     }
   );
-  // Playwright structure ships only with a graphical surface: the page-object
-  // package, the axe-core a11y smoke, and the render-smoke gate depend on
-  // pytest-playwright, which the pyproject template declares only when
-  // includePlaywright is set.
+  // Playwright structure ships only with a graphical web surface: the
+  // page-object package, the axe-core a11y smoke, and the render-smoke gate
+  // depend on pytest-playwright, which the pyproject template declares only when
+  // includePlaywright is set. Removing the web-specific gates here is correct —
+  // they genuinely cannot run without a web surface. What must never happen is a
+  // graphical surface left with no check at all; the placeholder below closes
+  // that gap fail-closed instead of silently.
   if (!includePlaywright) {
     tree.delete('tests/system/pages');
     tree.delete('tests/system/test_a11y_smoke.py');
@@ -134,6 +176,18 @@ export async function systemTestRunnerGenerator(
     tree.delete('tests/system/test_token_conformance.py');
   }
+  // Fail-closed: a graphical surface whose test medium GroundWork cannot run a
+  // check for gets a placeholder that FAILS naming the gap, never a silent skip.
+  // The scaffold still generates and its other tests run; this surface's UI
+  // proof is an honest red until a platform check is implemented per
+  // NATIVE-CHECK-CONTRACT.md — the follow-on that turns it green.
+  for (const s of unsupportedSurfaces) {
+    tree.write(
+      `tests/system/test_${s.ident}_ui_check_missing.py`,
+      uiCheckPlaceholder(s)
+    );
+  }
   await formatFiles(tree);
   recordGeneratorProvenance(tree, 'system-test-runner', options as unknown as Record<string, unknown>);

package/src/generators/workspace-dev-cli/cli-src/dist/dev-bundle.js CHANGED Viewed

@@ -877,7 +877,7 @@ var path6 = __toESM(require("path"));
 // src/generators/workspace-dev-cli/cli-src/src/util/version.ts
 var fs6 = __toESM(require("fs"));
 var path5 = __toESM(require("path"));
-var DEV_CLI_VERSION = "0.10.0";
+var DEV_CLI_VERSION = "0.11.0";
 function stampedFrameworkVersion() {
   try {
     const state = JSON.parse(

package/src/hidden-skills/code-intelligence.md CHANGED Viewed

@@ -127,3 +127,9 @@ with ordinary file edits. No repo map for a language — not built in and not ye
 `unmapped` list): either enable it (above) or infer the missing structure from targeted reads
 (entry points, manifests, imports) in the same shape. The downstream contract is identical — only
 the means differ. Say so rather than implying structural coverage you did not have.
+In a **git worktree** (e.g. a bet under delivery): the map cache is per-working-tree, so build it
+in the worktree before relying on it (`npx groundwork-method repo-map`). Serena is registered with
+`--project .`, which resolves to the session root rather than the worktree path — treat its symbol
+tools as best-effort there and lean on the freshly-built map, falling back to ordinary reads under
+the same contract.

package/src/hidden-skills/groundwork-architect/SKILL.md CHANGED Viewed

@@ -28,7 +28,7 @@ Durable architectural guidance lives in `references/`. This skill decides what t
   2. Contracts are the single source of truth — specs are authored, clients and tests derived.
   3. Reliability and security are designed in from the first boundary, never patched on.
   4. Core-and-edges structure: dependencies point inward toward a core that imports nothing concrete.
-  5. We test the system, not the mock of it — boundaries are chosen to be testable against real things.
+  5. We prove software by using the real thing the way its user does — boundaries are chosen so each is provable against real dependencies through the front door, not behind a mock.
   6. Decisions are recorded and governed — context, assumptions, and trade-offs, with an owner and a review trigger — so they can be re-evaluated when their assumptions break. The record is immutable; the decision is not.
   7. Agents are first-class consumers — every interface is designed to be machine-consumable.

package/src/hidden-skills/groundwork-architect/sync-anchor.md CHANGED Viewed

@@ -17,7 +17,7 @@ a review of the matching reference so the distillation never drifts.
 | src/docs/principles/system-design/data-engineering.md | fd0df432fc96d51c52e6ad87bd0159fa7eac7840e669fbb4174a2b6a68ae331d | 2026-06-19 |
 | src/docs/principles/quality/reliability.md | 9c9788504e0963458667d2727c3fc2359776108be593a2efc6603f6470002252 | 2026-06-19 |
 | src/docs/principles/quality/performance.md | 18b6d3391c57d97342068f9f1da732b24de4221489d0459bb6ad8900fac0a02e | 2026-06-19 |
-| src/docs/principles/quality/observability.md | d38ac0eb660fdcebf2532f5955f371e10ada7030c6eda58e360f40e1b82b439c | 2026-06-19 |
+| src/docs/principles/quality/observability.md | 8aa60e213ba03e989c93263153e3a1ac10b2336f6d0360c394f473660d565a0b | 2026-06-26 |
 | src/docs/principles/quality/security.md | 61157d97677142737ec537954dc5aaad7a04012cc8a3dcc855e2d324287fdc64 | 2026-06-19 |
 | src/docs/principles/quality/privacy.md | d84f6bed50169b40daeb2a0ec7082dbd12d91d3abfa304b169cb9eb3fab494fb | 2026-06-19 |
 | src/docs/principles/delivery/platform.md | 3cbf6c13298bf1c148278ae26acdbc2601a06615ff8d85cdb0de3b41c008c626 | 2026-06-19 |
@@ -31,4 +31,4 @@ a review of the matching reference so the distillation never drifts.
 | src/docs/principles/system-design/surface-architecture.md | 724e2183433b0db8d54466deffc0be877d847cdb6b61f0da9060491907151b91 | 2026-06-19 |
 | src/docs/principles/system-design/identity-and-access.md | 18c99f755a37bec69de595a9784171c88639845c13c2f5a8497b55e40c3a5edf | 2026-06-19 |
 | src/docs/principles/system-design/durable-execution.md | e4faad5864bcbecb80c79983be6a941fee652f2f78b38701dd8bd2dda47c3ec3 | 2026-06-19 |
-| src/docs/principles/index.md | 768a6702488641666b785e1aaa694414b4544d97ee098488d447c3c59b20b096 | 2026-06-19 |
+| src/docs/principles/index.md | 86e957ef6437b4ef551a67cb66f1e30aef971716636181bce5f5996f701323c6 | 2026-06-27 |

package/src/hidden-skills/groundwork-bet/briefs/acceptance-auditor.md ADDED Viewed

@@ -0,0 +1,68 @@
+---
+name: acceptance-auditor
+description: >
+  Verifies a slice diff does what the design says and nothing more, and does it
+  honestly. One of four independent review lenses the Delivery driver dispatches per
+  slice (groundwork-bet/workflows/04-delivery.md, Step 2); only the report flows back.
+---
+# Acceptance Auditor
+## How This Brief Is Invoked
+This brief runs in an **isolated subagent context** (Protocol 9 mechanics), dispatched
+by the Delivery driver during the slice review, in parallel with the blind reviewer, the
+edge-case tracer, and the coverage auditor. It is **not** the slice-worker that wrote the
+diff. Only the report flows back to the driver.
+This is the only lens that holds the diff against the approved design. It judges
+**conformance and honesty**: does the implementation deliver the slice's Required
+Capabilities, only those, and for the right reason — not by gaming the test.
+## Inputs
+The driver passes:
+- The slice's **uncommitted diff**.
+- The slice's **Required Capabilities** (its Scope, from the slice file under
+  `docs/bets/<bet-slug>/decomposition/`).
+- The prose **API and data design** — `technical-design/03-api-design.md` and
+  `04-data-design.md` — the shapes the implementation must match.
+## The work
+Verify the implementation does what the design says **and nothing more**, honestly:
+- **Conformance.** Each Required Capability is delivered, and the service's generated
+  contract (OpenAPI/AsyncAPI/proto, captured from the running code) matches the prose
+  shapes — field names, types, status codes, error shapes.
+- **Nothing more.** An undeclared endpoint, a field beyond the design, a behaviour the
+  slice was not asked for is scope creep — a finding even when it works. Scope that
+  exceeds the design is risk the review did not sign off on.
+- **Honesty.** The implementation must satisfy its proof for the right reason, against the
+  real product. A return value hardcoded to the test's expected output, an input
+  special-cased to the fixture, a `if TEST_MODE`-style branch, a real unit of work mocked
+  out where the proof meant the real thing, or an error case the design names but the code
+  silently skips — each is a finding even though the suite is green. A weak implementation a
+  green suite passes is worse than none.
+- **A fake needs a real test behind it.** When the diff (or its test) leans on a fixture,
+  stub, or fake file for work a real stage should do, some test must exercise the real
+  producer. A fixture nothing real ever generates — a hand-written thumbnail no pipeline
+  stage produces, a seeded record no code path writes — is a green light wired to nothing,
+  and a finding (`docs/principles/foundations/testing.md`).
+- **Proven against the shipping build.** Where the slice contributes to a milestone's
+  front-door proof, the work it adds must live in the artifact the consumer actually
+  launches — the packaged app, the embedded worker — not only in a test target that runs
+  code the shipping build never includes.
+You judge against the design, not against general taste — a correctness bug with no
+design angle belongs to the blind reviewer, an unhandled edge to the tracer, a thin test
+suite to the coverage auditor. Stay on conformance and honesty.
+## The report
+For each finding: a one-line title, the location (file and line), the specific Required
+Capability or design shape it violates (quote the prose), and why it matters. Suggest a
+nature (decision-needed / patch / defer / dismiss); the driver makes the final call and
+dedupes across the four lenses. If the diff conforms and is honest, say so in one line.
+Keep it to the findings.