npm - groundwork-method - Versions diffs - 0.0.1 → 0.11.0 - Mend

groundwork-method 0.0.1 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (647) hide show

package/src/docs/principles/quality/accessibility.md ADDED Viewed

@@ -0,0 +1,88 @@
+---
+title: Accessibility
+description: WCAG 2.2 AA, keyboard-first design, screen-reader flows, and inclusive UX as a baseline, not a stretch goal — and, since the EAA, a legal floor.
+status: active
+last_reviewed: 2026-06-19
+---
+# Accessibility
+## TL;DR
+Every user interface we ship meets WCAG 2.2 AA as a baseline. Keyboard, screen reader, and visual assistive technology are first-class targets, not after-launch polish. A feature that does not work for a keyboard user or a screen-reader user is not finished.
+## Why this matters
+Accessibility is not a niche concern — a significant fraction of users rely on assistive technology at some point, and almost everyone hits a situational version of it (a broken arm, glare on a phone, a noisy room). Three forces make it non-negotiable:
+- **The moral case.** Equal access is a baseline, not a feature flag.
+- **The legal case.** Accessibility is now mandated, not merely encouraged. The EU's European Accessibility Act took effect on 28 June 2025; its harmonized standard, EN 301 549, currently incorporates WCAG 2.1 AA (with 2.2 in progress) for products and digital services placed on the EU market. In the US, the ADA continues to drive thousands of web-accessibility suits a year. "Inaccessible" is a compliance defect with a price tag.
+- **The quality case.** The constraints accessibility imposes — clear hierarchy, visible focus, semantic structure, predictable navigation — produce better software for *every* user. An accessible interface is almost always also a clearer, calmer interface.
+The reason to be deliberate about this is that the default is failure. In WebAIM's 2025 audit of the top one million home pages, 94.8% had detectable WCAG A/AA failures — roughly 51 errors per page. Accessibility does not happen by accident; it is engineered in or it is absent.
+## Our principles
+### 1. WCAG 2.2 AA is the floor, not the ceiling
+We conform to WCAG 2.2 AA for every page, every component, every release. Falling below AA is a bug, not a trade-off we make.
+But "aim for AAA everywhere" is the wrong correction, and the W3C says so directly: Level AAA is *not recommended as a general policy for entire sites*, because some AAA criteria are impossible to satisfy for some content (sign-language interpretation for all audio; a lower-secondary reading level for technical reference). A blanket-AAA target is one you are guaranteed to miss, which trains the team to treat the standard as aspirational rather than binding. The decision rule: **AA across the board, non-negotiable; specific AAA criteria adopted where a journey is critical and the criterion is actually achievable** — enhanced 7:1 contrast (1.4.6), visible location/breadcrumbs (2.4.8), context-sensitive help (3.3.5), no surprise session timeouts (2.2.6). Targeting 2.2 AA keeps us ahead of the EAA's current legal baseline, not behind it. WCAG 3.0 remains an early W3C Working Draft with a different conformance model — track it, but do not architect around it.
+### 2. Keyboard first
+Every interactive element is reachable and usable with the keyboard. Tab order follows reading order, focus is always visible, and there are no keyboard traps. The design test is simple: can a power user — or a user who cannot use a pointer — complete every journey without touching the mouse? Composite widgets (menus, grids, tab sets) follow the ARIA Authoring Practices keyboard model: a single tab stop into the widget, then arrow-key navigation inside it via roving `tabindex`, so a 30-item menu is one tab stop, not thirty.
+### 3. Screen readers see what sighted users see
+Semantic HTML first; ARIA only when HTML cannot express the semantics. The first rule of ARIA is that *no ARIA is better than bad ARIA*: in WebAIM's million-page survey, pages using ARIA averaged 41% more detected errors than pages without it, because a misapplied `role` silently overrides the native semantics that already worked. A native `<button>`, `<nav>`, or `<label>` is correct by construction; an ARIA reimplementation is correct only if you also wire up every state and key handler by hand.
+When you do name something, follow the accessible-name priority: associate visible text first (`aria-labelledby` pointing at on-screen copy), and reach for `aria-label` only when there is no visible text to reference. Headings form an outline, landmarks mark regions, form fields carry programmatic labels, images carry meaningful alt text. A screen reader should produce a narrative that matches what a sighted user sees — not a richer or poorer version of it.
+### 4. Contrast is measured, not eyeballed
+Low-contrast text is the single most common accessibility failure on the web — present on roughly 79% of the top million home pages and the largest single share of all detected errors. It is also the most preventable. Body text meets 4.5:1, large text 3:1 (SC 1.4.3); UI component boundaries and meaningful graphics meet 3:1 (SC 1.4.11). These ratios are verified by tooling against the actual rendered colours, not judged by eye on a designer's calibrated monitor in a dark room. Brand palettes are checked against contrast at design time; a colour pair that fails AA is a palette bug, not a creative choice to defend.
+### 5. Colour is never the only signal
+A red error, a green success, a blue link — each one carries a second, non-colour cue: a label, an icon, an underline, a position. Colour-blind users exist, and colour-only signalling excludes them (SC 1.4.1). This is distinct from contrast: a chart can have perfect contrast and still be unreadable if its only key is "the red line versus the green line."
+### 6. Motion is optional
+Animations respect `prefers-reduced-motion`. Large-scale parallax and aggressive transitions are used sparingly; for users with vestibular conditions, unrequested motion is not decoration, it is an accessibility failure. The reduced-motion path is a real design, not a disabled one — it still communicates state change, just without the movement that triggers nausea.
+### 7. Live regions are used sparingly and correctly
+Real-time updates are announced via `aria-live` when they matter to the user's understanding. But over-announcement is as harmful as silence: a region that fires on every keystroke or background poll teaches the user to tune out the announcements that matter. Use `aria-live="polite"` for status that can wait, reserve `assertive` for genuine interruptions (errors, time-critical alerts), and announce the meaningful delta, not the whole region.
+### 8. Testing is multi-layered
+Automated checks (axe, Lighthouse) run in CI on every build as a gate. But know their ceiling. Deque's analysis of ~2,000 real audits found automated tooling fully covers about 57% of issues *by volume* — and that figure is flattered by colour contrast alone, one high-frequency criterion. Measured by share of WCAG success criteria, automated coverage is closer to a third. Tools cannot judge whether alt text is *meaningful*, whether focus order makes *sense*, or whether a live-region announcement is useful or noise. So the gate has three layers: automated checks in CI, a manual keyboard walk on every new journey, and a screen-reader walkthrough on major features. Test against the combinations users actually run — NVDA with Firefox or Chrome, VoiceOver with Safari, and JAWS for enterprise audiences — because behaviour differs across them. Tools catch the mechanical; humans catch the semantic.
+### 9. Accessibility is reviewed like code
+Accessibility issues are tracked, owned, and closed the same way any other bug is. The backlog does not accumulate a "we will get to the a11y later" queue — that queue grows forever. Every PR author is expected to include the accessibility check in their definition-of-done.
+## How we apply this
+- [Performance](performance.md) — related budgets that compound with accessibility.
+## Anti-patterns we reject
+- **Placeholder text as label.** The placeholder disappears when the field is filled; the label is gone. Users who come back to check the field see nothing. Use a visible label.
+- **`<div>` as button.** A `div` with an `onClick` is invisible to keyboard, screen reader, and user agent. Use `<button>`.
+- **Cramped tap targets.** WCAG 2.2 AA (SC 2.5.8) sets the floor at 24×24 CSS px, or equivalent spacing between smaller targets. That is a floor, not a goal: touch surfaces should aim for ~44×44 (Apple's HIG and the AAA SC 2.5.5), because fingertips are wide and motor-impaired users miss small targets at far higher error rates.
+- **Focus-removal for aesthetics.** `outline: none` without a replacement focus style breaks keyboard navigation entirely. Use `:focus-visible` to style a clear indicator, not to delete one.
+- **Accessibility overlay widgets.** Bolt-on "accessibility" scripts (accessiBe and its peers) do not make a site conformant, and frequently fight the user's own assistive technology. They are a liability, not a shield: over a thousand sites running an overlay were sued in 2024, settlements routinely require *removing* the widget, and the FTC fined accessiBe $1M in 2025 for deceptive accessibility claims. Fix the markup; do not paper over it.
+- **"We will add a11y in v2."** v2 will not have it either. Build it in.
+- **Modals without focus management.** Trap focus inside the modal, return focus to the trigger when it closes, and label it with `aria-modal`/`role="dialog"`. Otherwise keyboard users are lost behind it.
+## Further reading
+- *WCAG 2.2* ([w3.org/WAI/WCAG22](https://www.w3.org/WAI/WCAG22)) and *Understanding Conformance* ([w3.org/WAI/WCAG22/Understanding/conformance](https://www.w3.org/WAI/WCAG22/Understanding/conformance)) — the normative standard and the rationale for why AAA is not a blanket policy.
+- *ARIA Authoring Practices Guide* ([w3.org/WAI/ARIA/apg](https://www.w3.org/WAI/ARIA/apg)) and *Using ARIA* ([w3.org/TR/using-aria](https://www.w3.org/TR/using-aria)) — the reference for every ARIA pattern and the five rules of ARIA use.
+- *The WebAIM Million* ([webaim.org/projects/million](https://webaim.org/projects/million)) — the annual reality check on what actually fails on the open web.
+- *European Accessibility Act / EN 301 549* — the EU legal baseline; the harmonized standard that maps the law to WCAG.
+- *Inclusive Components*, Heydon Pickering — the canonical pattern language for accessible UI components.
+- *Accessibility for Everyone*, Laura Kalbag — the short introduction for engineers who need to learn the landscape quickly.
+</content>
+</invoke>

package/src/docs/principles/quality/observability.md ADDED Viewed

@@ -0,0 +1,84 @@
+---
+title: Observability
+description: OpenTelemetry-first design, SLOs, error budgets, and trace-driven development.
+status: active
+last_reviewed: 2026-06-26
+---
+# Observability
+## TL;DR
+Observability is a design property, not a monitoring bolt-on. We instrument every service with OpenTelemetry from day one, build dashboards from the instrumentation, and use traces as both a debugging tool and a first-class test assertion. If a system is behaving strangely and we cannot see why in our data, the instrumentation — not the guessing — is what we fix.
+## Why this matters
+The difference between a team that can ship with confidence and one that cannot is, most of the time, a difference in what they can see. Observability gives a team three things: the ability to know whether the system is healthy, the ability to localise a fault when it is not, and the ability to explain what happened after the fact. Without those, every deploy is a gamble and every incident is a fresh investigation. With them, the team moves faster and sleeps better.
+## Our principles
+### 1. OpenTelemetry is the common language
+Every service emits traces, metrics, and logs through OpenTelemetry SDKs to a single collector. Vendor lock-in at the collector boundary, not inside application code. Switching backends is a collector configuration change, not an application rewrite.
+### 2. Traces are the primary signal
+Given a choice between adding a metric and enriching a trace, we enrich the trace: traces preserve causality, metrics aggregate it away, and for a request that crosses half a dozen services causality is the difference between a diagnosable incident and a guessing game.
+But "traces over metrics" is not absolute, and a thoughtful operator will push back. You cannot keep every trace — at scale they are sampled, and a sampled signal is a weak base for an alert that must fire on a single bad request. Metrics are cheap, always-on, and the right substrate for the things you can enumerate in advance. So the decision rule: the health signals you alert on (the RED/USE rates) live as metrics, carrying exemplars back to the traces that produced them; the open-ended question — *why is this slow, and for whom* — lives in traces and wide events. Sample to control cost, but sample at the tail so errors and slow outliers survive the cut (principle 7).
+### 3. The "three pillars" are one pillar
+Logs, metrics, and traces are not independent data — they are different projections of the same events. A log line includes its trace ID; a metric includes the dimensions that let you pivot back to traces; an exemplar on a metric points directly at the trace that produced it. If a team has three disconnected telemetry systems, it has no observability — only three bills and three places to look. The pillars are a storage and query detail; the unit of truth is the event, and every signal must link back to it.
+### 4. Dashboards derive from SLOs
+Every dashboard starts with the user-journey SLO it supports ([Reliability](reliability.md)). Then latency percentiles, error rates, saturation, and traffic — the "RED/USE" layers — filling in detail. Dashboards assembled by adding "interesting-looking" graphs drift into uselessness; dashboards derived from SLOs stay useful.
+### 5. Trace-driven development
+When building a new feature, we sketch the trace it should produce *before* we write the handler. What spans must exist? What attributes must each span carry? What parent-child relationships are required? The instrumentation design shapes the code, not the other way around. This makes it essentially impossible to ship a feature that is unobservable.
+### 6. Assert on telemetry in tests
+System tests assert that traces are unbroken end-to-end — a missing span on a critical path is a test failure ([Testing](../foundations/testing.md)). This makes the instrumentation part of the contract rather than an optional decoration, so it cannot silently rot. The mechanism is an in-memory span exporter registered in the test process: exercise the system, then assert on the finished spans. It is a built-in of every OTel SDK and the durable approach now that the dedicated trace-test tools (Tracetest, Malabi) have gone dormant. The failure mode to avoid is over-asserting: a test that pins the exact span tree and every attribute is coupled to implementation detail and will break on every harmless refactor, training the team to delete the assertion rather than trust it. So assert on what the contract actually promises — the spans that must exist on the user journey, that the trace stays connected across service hops, and the attributes a dashboard or SLO query depends on — and let the rest float.
+### 7. Logs are structured, sampled, and contextual
+Every log line is structured (JSON), carries its trace ID, and is emitted at a severity that the team has actually agreed on. We sample aggressively at debug and info — nobody needs every log line in production — and we never sample errors away. Traces obey the same logic with the timing reversed: prefer tail-based sampling, where the keep-or-drop decision is made after the trace completes, so every error and slow outlier is retained instead of being dropped at random the way head-based sampling does. Tail sampling costs more to operate (every span of a trace must be buffered to one place); where that cost is not justified, fall back to head-based sampling with errors force-kept. Unstructured log lines are not logs; they are a different kind of noise.
+### 8. Cardinality is a design choice
+High-cardinality attributes (per-user, per-tenant, per-session) are valuable for debugging but expensive in storage. We tag deliberately — high cardinality on traces where it is queryable, lower cardinality on metrics where it multiplies by every time window. Runaway cardinality is one of the most expensive mistakes a team can make in observability; it is a design call, not a default.
+### 9. Wide events, and instrument by default
+We lean toward "observability 2.0" — arbitrarily-wide, high-cardinality structured events queried after the fact — over pre-aggregated metrics that fix the question in advance. The honest caveat: a single wide-event store is a north star, not a free lunch. Wide events cost more to ingest and store than the metrics they would replace, and folding every signal into one backend trades correlation power for a sharper lock-in. The escape is a columnar store and an open wire format (OpenTelemetry) so the data stays portable and the bill tracks query value rather than vendor pricing — and pre-aggregated metrics keep their place for the cheap, always-on signals of principle 2.
+We also auto-instrument: kernel-level eBPF (OpenTelemetry OBI) and continuous profiling correlated to trace IDs give broad telemetry with no code change. But eBPF sees the wire, not the intent — it cannot name a business operation, attach a domain attribute, or reliably propagate trace context through compiled or encrypted paths, and OBI today is Linux-only with no logs signal. So the division of labour is fixed: auto-instrumentation for breadth and coverage, hand-instrumentation for the domain spans and attributes only we can name.
+### 10. AI systems are observed through GenAI conventions
+A model in the system is instrumented with the OTel GenAI semantic conventions: token usage (cost and latency track tokens, not requests, and prompt-cache hits are tracked separately), prompt/response capture, agent and MCP tool-call spans, and eval traces — with failed production traces promoted into the eval set so the suite grows from real behaviour. A model call logged as an opaque string is unobservable. Two caveats keep this honest: the GenAI conventions are still experimental as of 2026, so pin the semconv version and use the stability opt-in rather than assuming attribute names are frozen; and full prompt/response capture is a PII and storage liability — capture it deliberately, redact at the edge, and sample the payloads rather than the spans.
+## How we apply this
+- [Reliability](reliability.md) — the SLO layer built on top of this telemetry.
+- [Testing](../foundations/testing.md) — how we assert on traces in system tests.
+- [Performance](performance.md) — the latency work that depends on good tracing.
+## Anti-patterns we reject
+- **Pillar-at-a-time adoption.** "We'll add metrics now, traces later." You will not.
+- **Vendor SDKs in application code.** Application code imports OpenTelemetry; the collector talks to the vendor.
+- **Dashboards without SLOs.** Pretty charts without a question they are answering.
+- **Logs-as-debugger.** Using `printf` style logging to trace a single bug. Write a test; add a span.
+- **Print-statement-style `Debug` in production.** If every deploy adds ten debug logs and the next removes twelve, we are missing structure.
+- **Cardinality explosions.** Putting a UUID in a Prometheus label. The bill and the query planner will both remember.
+## Further reading
+- *Observability Engineering*, Majors, Fong-Jones, Miranda — the canonical text on traces-first observability.
+- *Distributed Systems Observability*, Cindy Sridharan — the short, sharp introduction.
+- Charity Majors, *Observability 1.0 vs 2.0* — the wide-events thesis and the honest argument about its cost, on charity.wtf and the Honeycomb blog.
+- *The OpenTelemetry specification* ([opentelemetry.io/docs/specs](https://opentelemetry.io/docs/specs)) — worth reading the high-level overview at least once; see also the GenAI semantic conventions for LLM instrumentation.
+- *Systems Performance*, Brendan Gregg — the canonical reference for the "USE method" (utilisation, saturation, errors).

package/src/docs/principles/quality/performance.md ADDED Viewed

@@ -0,0 +1,84 @@
+---
+title: Performance
+description: Latency budgets, tail latency, backpressure, and load shedding.
+status: active
+last_reviewed: 2026-06-19
+---
+# Performance
+## TL;DR
+Performance is not "fast enough" — it is a budget, spent deliberately across every hop of a user interaction and enforced in CI. We optimise for tail latency, we design backpressure into real-time flows, and we measure the things users feel, not the things developers find convenient.
+## Why this matters
+Users notice latency before they notice almost anything else. A response that renders in 800ms feels instant; at 3000ms it feels broken. The difference is not a factor of four in effort — it is a difference of whether the team thought about latency as a design constraint or as a post-hoc tuning problem. Performance handled as an afterthought is invariably more expensive than performance designed in from the start.
+## Our principles
+### 1. Latency is a budget, allocated top-down
+Every user-facing operation starts with a latency budget at the edge — say, 500ms — and that budget is allocated to downstream hops. If one fetch has 300ms and another join has 150ms, the handler has 50ms of its own work. When a hop overruns its allocation, somebody else's budget gets squeezed. The budgeting view makes trade-offs explicit. A budget written once and never checked is fiction: reconcile the allocation against measured per-hop latency, and when the numbers don't add up, the budget is wrong or the architecture is — decide which before you ship.
+### 2. Measure tail latency, not average
+p50 tells you about capacity and the typical case; it tells you almost nothing about the experience that drives your reputation. Users remember the slow request, and *which* percentile is the slow request is set by fan-out, not taste. Dean and Barroso's *The Tail at Scale* makes the arithmetic unavoidable: a request that touches 100 backends, each with a 1-in-100 chance of exceeding its p99, will overrun that latency 63% of the time end-to-end (1 − 0.99¹⁰⁰). At fan-out, a leaf service's p99.9 becomes the user's effective median.
+So the budget percentile is a decision, not a default: target p99 for a single-hop interaction, p99.9 or higher for a high-fan-out request. Measure with coordinated omission in mind — naive load-test clients silently drop the slow samples that matter most (Gil Tene). And the tail is attackable directly, not only by tuning: hedged requests — issue a duplicate after the p95 elapses, take the first to return — cut Dean and Barroso's BigTable p99 from 1800ms to 74ms for roughly 2% extra backend work.
+### 3. Pre-compute, cache, and denormalise deliberately
+When a read is hot, we pre-compute. When a computation is stable, we cache. When a join is expensive, we denormalise. Each of these trades complexity for latency; each of them earns its keep with data, not with intuition. Speculative caching is how cache-invalidation bugs become the biggest source of data incidents.
+### 4. Backpressure is designed in, not hoped for
+Every producer has a bounded queue and a defined behaviour when the queue fills: shed, coalesce, block ([Real-Time](../system-design/real-time.md)). "It works fine in load tests" is not a backpressure strategy.
+### 5. Load shedding protects the system from itself
+When the system is saturated, the right behaviour is not to try harder — it is to serve fewer requests well, because trying harder is exactly how an overload turns into a cascading failure (Google SRE). Requests carry a criticality assigned at the edge — Netflix's CRITICAL / DEGRADED / BEST_EFFORT / BULK taxonomy is a sound template — and we shed from the bottom up: prefetch and background work long before user-initiated requests.
+The shed *trigger* is adaptive, not a hand-tuned RPS or CPU threshold that is stale the day load patterns shift. An adaptive concurrency limit that watches the latency gradient finds the saturation point on its own and tracks it as the system changes. And shedding has a softer sibling: graceful degradation reduces the work *per* request — serve cached data, drop personalisation, fall back to a cheaper ranking — before it drops requests entirely. Shedding is a designed degradation mode, not an accident.
+### 6. Hot paths have no allocations to spare
+For the hottest inner loops — real-time processing, per-request ingestion at high throughput — we write allocation-aware code. Every allocation is a GC pause in waiting, and at high rate the pauses become the latency. The discipline is scoped, not universal: it applies to the paths a profiler has shown to be hot, and applying it everywhere is the over-optimisation it warns against. Most code does not need it; the hot paths demand it.
+### 7. Profile before you optimise
+Two truths usually pitched as opposites. Tuning existing code without a profile is waste — the "obvious" bottleneck is almost always wrong, and Knuth's "premature optimization is the root of all evil," read in full, says forget small efficiencies 97% of the time *and do not pass up the critical 3%*. So every non-trivial tuning effort starts with a profile, taken in production-representative conditions; profiles from developer laptops lie.
+But a profiler only ever tells you where the time goes in the design you already have. It will never tell you to pick a better data structure, flatten an allocation-heavy layout, or kill an N+1 access pattern — and those design-time choices dominate the result, are cheap on the first pass, and are expensive to retrofit. "We'll profile it later" is the standard excuse for skipping them. Decision rule: choose data models, access patterns, and algorithmic complexity with performance in mind up front; reach for the profiler to direct local tuning, never to license thoughtless design.
+### 8. Budgets are enforced in CI
+Performance regressions that slip in once slip in a hundred times, and automation is cheaper than vigilance — so budgets live in CI against committed thresholds, and a PR that regresses one needs an explicit, reviewed waiver. But *what* you gate on matters more than *that* you gate. Shared CI runners are noisy, and a wall-clock microbenchmark that cries wolf on every PR trains engineers to ignore it — worse than no gate at all. So gate hard on the metrics that are deterministic regardless of the runner: bundle size, query count per request, allocation counts, Lighthouse scores. Treat wall-clock timings as a tracked trend with relative thresholds and statistical comparison, or run them on dedicated hardware — never as a hard pass/fail on a shared runner.
+### 9. Place compute deliberately, and price the tokens
+*Where* code runs is a design axis, not only *how much*: the edge for latency-sensitive, cacheable, geo-distributed work (proximity flattens the tail); WebAssembly as the edge/FaaS/plugin compute unit; containers for stateful or heavy work — most systems blend all three. Caching is multi-tier (client, CDN/edge, service, store) with an explicit hit-ratio target, and autoscaling is event-driven with real scale-to-zero (KEDA/Karpenter), not CPU-only HPA. For a model-in-the-loop path, latency and cost track **tokens, not requests** — the levers are model routing, semantic caching at the gateway, prompt/KV caching to cut time-to-first-token, and streaming so the user sees output before generation completes.
+## How we apply this
+- [Observability](observability.md) — the measurement surface for latency work.
+- [Reliability](reliability.md) — the SLO discipline that makes performance budgets enforceable.
+- [Real-Time](../system-design/real-time.md) — the streaming-specific patterns we apply.
+## Anti-patterns we reject
+- **Optimising on hunch.** No profile, no tuning — and no "we'll profile it later" as cover for an unconsidered data model.
+- **"It is fast on my laptop."** Dev latency is not production latency. Measure in the environment that matters.
+- **Average-as-metric.** Reporting only the mean or p50 hides the tail that defines your reputation. Pick the percentile your fan-out demands.
+- **Unbounded queues.** A queue without a max is a latency bomb.
+- **Cache invalidation left to the reader.** If the cache can serve stale data under a defined circumstance, that circumstance is documented. Otherwise it is a bug.
+- **Flaky perf gates.** A wall-clock benchmark gated on a noisy shared runner teaches the team to rubber-stamp red. Gate on deterministic metrics; track the noisy ones.
+- **"We will fix performance later."** If you ship slow, users will remember slow.
+## Further reading
+- *Systems Performance*, Brendan Gregg — the canonical reference; read the USE and RED chapters first.
+- *High Performance Browser Networking*, Ilya Grigorik — the frontend-and-network half of the story.
+- *Latency Numbers Every Programmer Should Know* (Jeff Dean) — calibrate your intuition.
+- Gil Tene, "How NOT to Measure Latency" — the talk on coordinated omission and why naive latency measurements lie.
+- Jeff Dean & Luiz Barroso, "The Tail at Scale" (CACM, 2013) — the fan-out arithmetic and the hedged-request pattern.
+- *Google SRE Book*, "Handling Overload" and "Addressing Cascading Failures" — criticality, client-side throttling, and load shedding done right.

package/src/docs/principles/quality/privacy.md ADDED Viewed

@@ -0,0 +1,92 @@
+---
+title: Privacy
+description: Data minimisation, GDPR, PII handling, deletion, model training, and data residency for platforms that handle sensitive user data.
+status: active
+last_reviewed: 2026-06-19
+---
+# Privacy
+## TL;DR
+We only collect what we need, keep it only as long as we need it, expose it only where it is needed, and let users see, correct, and remove their own data on demand. Privacy is a design input, not a compliance appendage.
+## Why this matters
+A privacy failure is not a regulatory inconvenience — it is a direct breach of user trust. When a platform handles sensitive user data, remediation is punishingly expensive. Privacy has to be thought about at design time, because once the data exists in the wrong shape or the wrong place, it cannot easily be undone. Some of it — anything that has been backed up, copied to a warehouse, or trained into a model — cannot be undone at all without a deliberate plan made before collection.
+## Our principles
+### 1. Collect the minimum
+For every field we capture, we ask: do we actually need this to deliver the user's outcome? Data minimisation reduces both privacy risk and operational complexity. "We might find it useful later" is not a sufficient reason to collect a field.
+The honest tension is with measurement and ML, where more raw, per-user, fine-grained data always *looks* more useful. The decision rule: collect at the grain the outcome requires, and no finer. For product analytics, prefer aggregates, derived metrics, and event counts over raw identifiable records; where the analysis genuinely needs distributions, coarsen or add differential-privacy noise rather than retaining the raw PII. "It is for analytics" lowers the bar for *nobody* — the same necessity test applies.
+### 2. Retain for a bounded time
+Every category of data has an explicit retention policy set at collection time, justified by a purpose and a lawful basis. Expired data is deleted by automation, not by a Tuesday-afternoon cron. "We keep it forever" is never a category.
+Some data legitimately needs a longer clock — security and audit logs, fraud signals, financial records with statutory retention, and records under legal hold. That is not a licence to keep everything: a legal hold suspends deletion for *named* records tied to a specific matter; it does not turn the whole database immutable. The decision rule: each category gets its own period and its own justification, and the longest clock applies only to the records that actually earn it.
+### 3. Access is scoped and audited
+Every internal access to user data is authenticated, authorised, and logged. Engineers cannot browse production data casually; support staff cannot read sensitive records without a clear business reason and an auditable access record. Unsupervised access is a policy failure waiting to be discovered.
+### 4. Users see, control, and remove their data — including the copies
+Data subject rights — access, rectification, portability, deletion — are first-class features, not regulatory bolt-ons. A deletion request flows through the same plumbing as retention expiry: structured, automated, and verifiable.
+"Delete everywhere" is harder than it sounds, and this is where most real systems fail. User data is spread across the primary store, read replicas, caches, search indices, analytics warehouses, and backup snapshots, and an immutable or append-only backup tier cannot be surgically edited row by row. The EDPB's 2025 coordinated enforcement action on the right to erasure found exactly this: controllers that never propagate deletion into backups, and controllers that let a restore silently resurrect deleted data. The decision rule:
+- **Live and queryable tiers** (primary, replicas, caches, indices, warehouse) erase synchronously and verifiably on request.
+- **Immutable backup tiers** either get **crypto-shredding** — encrypt each subject's data under a per-subject key and destroy the key, rendering the ciphertext unrecoverable without touching the snapshot — or rely on a **documented, bounded rotation window** during which any restore is guaranteed to re-apply pending deletions before the data becomes reachable again.
+A deletion that quietly leaves "just this one copy" — most often in a backup or a warehouse export — is a promise broken. So is calling something deleted when it has merely been weakly de-identified.
+### 5. Design for data residency
+Where data lives matters, both for regulation and for user expectation, and it is a design input to storage and pipeline choices — not an afterthought discovered during procurement.
+It is also widely misunderstood. The GDPR is **not** a data-localisation law: it does not require EU personal data to physically stay on EU soil. It bars *transfers* to a third country unless a lawful mechanism backs them — an adequacy decision (the EU–US Data Privacy Framework is one, declared adequate in 2023 and extended to the EEA), Standard Contractual Clauses, or Binding Corporate Rules. The decision rule: separate a genuine **localisation mandate** (a sectoral or sovereignty law that says the bytes must remain in-country — these are real but specific) from the far more common **transfer-safeguard requirement** (data may leave if a mechanism plus access controls are in place). Pick the storage region from the strictest obligation that actually applies to the data, not from folklore about "EU data can never leave the EU."
+### 6. PII is handled distinctly from content
+Email addresses, names, IPs — PII has a shorter retention, tighter access controls, and is explicitly not co-located with content where we can help it. Treating all data the same makes the problems of the most sensitive fields become the problems of every field.
+The mechanism is separation: hold identifiers in a dedicated vault and reference them elsewhere by an opaque token (pseudonymisation). This shrinks the blast radius of any single store and makes crypto-shredding tractable — destroy the vault entry and the tokens dangle. But pseudonymisation is a risk-reduction tool, not an exit: under the GDPR, pseudonymised data is *still personal data* and stays fully in scope. Only true anonymisation leaves scope, and anonymisation is harder than it looks — coarse de-identification often remains re-identifiable by linkage. Do not let a tokenisation layer convince anyone the obligations have disappeared.
+### 7. Model training is a lawful, transparent, and near-irreversible decision
+User data is used to train or evaluate models only on a lawful, recorded, and defensible basis. Consent is the cleanest basis, but it is often infeasible to obtain at the scale and retroactivity model training demands — a point the EDPB's Opinion 28/2024 on AI models makes directly. Where consent is not workable, **legitimate interest** can be a valid basis *if* it survives the three-step necessity-and-balancing test, the use is disclosed plainly, and users have a real, honoured opt-out. What is never defensible is silent training, or assuming consent because "everyone does."
+Treat the choice to train as **near-permanent**. A model can memorise and regurgitate its training data, and the EDPB has confirmed a trained model is not automatically anonymous. Machine unlearning does not reliably take it back: exact unlearning means retraining from scratch, and approximate methods are unproven and can degrade the model. The decision rule: pick and record the lawful basis *before* training, and never feed a model anything you could not defend keeping forever — because, in practice, training is keeping it forever.
+### 8. Privacy reviews happen before launch
+Every feature that touches user data has a privacy review before it ships — the same rhythm as a security review, often in the same meeting. The reviewer asks the specific questions a regulator or an investigative journalist would, and the answers go on the record. Where the processing is high-risk — large-scale sensitive data, profiling, or novel use of personal data — that review *is* a Data Protection Impact Assessment, which the GDPR makes mandatory rather than optional. "We will do the privacy review after launch" is a commitment that never gets honoured.
+## How we apply this
+- [Data Engineering](../system-design/data-engineering.md) — retention and contract discipline.
+- [Security](security.md) — the perimeter that privacy relies on.
+- [Postgres](../stack/postgres.md) — retention enforced at the storage layer.
+## Anti-patterns we reject
+- **"Privacy is the lawyers' job."** By the time the lawyers are involved, the damage is done. Privacy is an engineering discipline.
+- **Retention by default to forever.** Growing tables nobody cleans are ticking privacy incidents.
+- **Deletion that stops at the live database.** If the backup or the warehouse still has the row, the deletion did not happen. Plan the backup story before you promise erasure.
+- **Anonymisation theatre.** Calling weakly de-identified data "anonymous" or "deleted" when relinking is feasible — flagged repeatedly in EDPB enforcement — is a breach dressed as compliance.
+- **Development data scraped from production.** A dev environment with a sample of real user data is a breach waiting to be noticed.
+- **Analytics as a free pass.** "It is for analytics" is not a sufficient justification for collecting a piece of PII. The same bar applies.
+- **PII in logs.** Trace and log data routinely outlives the systems that produced it. PII does not belong there.
+- **Silent model training.** Training on user data without a recorded lawful basis, plain disclosure, and a real opt-out is not made acceptable by a sentence buried in a ToS.
+## Further reading
+- *GDPR* text and ICO guidance — the canonical European framework.
+- *CCPA/CPRA* — the Californian counterpart.
+- *Privacy by Design*, Ann Cavoukian — the foundational essay on baking privacy into architecture.
+- *Data Protection Impact Assessments* (ICO) — the practical model we use for privacy reviews.
+- EDPB *Opinion 28/2024* on data protection in AI models — lawful basis, legitimate interest, and model anonymity.
+- EDPB *2025 Coordinated Enforcement Framework report on the right to erasure* — backups, restores, and anonymisation-as-deletion.

package/src/docs/principles/quality/reliability.md ADDED Viewed

@@ -0,0 +1,89 @@
+---
+title: Reliability
+description: SRE fundamentals, graceful degradation, circuit breakers, and the design patterns that keep systems up under load and failure.
+status: active
+last_reviewed: 2026-06-19
+---
+# Reliability
+## TL;DR
+Reliability is not a feature we add after the system is built. It is a design property we pay for up front, measured in error budgets, defended by graceful-degradation patterns, and rehearsed through deliberate failure injection. Every significant service owns an SLO and lives inside the error budget it implies.
+## Why this matters
+Users do not experience "uptime percentages" — they experience "the thing I needed did not work just now." Reliability is the discipline of holding the second experience rare enough that users learn to trust the platform. In a real-time product, unreliability compounds: a dropped request becomes a failed operation, a failed operation becomes a broken user journey. The cost of a small reliability failure is rarely proportional to its scope.
+## Our principles
+### 1. SLOs, not uptime percentages
+Every significant service defines a Service Level Objective — a per-endpoint or per-user-journey target with a latency and a success-rate component, measured over a rolling window. "99.9% uptime" is not an SLO; "p95 `POST /resource` < 300ms over 30 days, 99.5% success" is. SLOs are the measurement surface for everything else on this page.
+### 2. Error budgets govern velocity
+The budget implied by the SLO — the allowed volume of "bad" events — is a spendable resource. Teams spending below budget ship riskier changes and run experiments; teams that exhaust it pause feature work and pay down reliability debt. This inversion — reliability as a gate on velocity rather than a tax on top of it — is what makes SLOs operationally real.
+The honest failure mode is enforcement. A budget that any team can override under deadline pressure is a dashboard, not a policy. The gate only bites if it is pre-negotiated in writing before the budget is spent: product, engineering, and on-call agree the consequence of exhaustion, name the person empowered to declare and lift a freeze, and define a small, explicit set of "silver bullet" exceptions for genuinely business-critical launches. Decision rule: if you cannot name who enforces the freeze and what the exceptions are, you do not have an error-budget policy — you have an SLO with a graph next to it.
+### 3. Graceful degradation is a design, not a hope
+Every user-facing feature has a defined behaviour when its downstream fails. A view without synthesis data still renders — the panel shows a "not yet ready" state. A pipeline without a model client enqueues and returns when it can. Degradation is decided at design time and implemented alongside the happy path, never "we will figure out what to show later."
+### 4. Timeouts, retries, and load shedding are the defaults; circuit breakers are not automatic
+Every outbound call has a timeout. Every retry has a bounded policy with full jitter. These are non-negotiable and set in a shared library so a new service inherits them ([Integration Patterns](../system-design/integration-patterns.md)); opting out requires a written reason.
+The contested part is what sits on top. Retries amplify: a request retried at every layer of a call chain produces attempts equal to the *product* of the per-layer counts, so a small downstream blip becomes a self-inflicted DDoS. Two rules contain this. Retry at exactly one layer of the stack, not at every hop. And cap retries with a shared retry budget — a token bucket where successes refill and retries spend, so retries stop automatically once a downstream is failing (this is how gRPC's retry throttling and the AWS SDK adaptive-retry mode work). Per-call backoff alone does not bound aggregate load; the budget does.
+Circuit breakers are widely prescribed as the default backstop. They are not automatic here. A binary client-side breaker, estimated locally by each of many small or short-lived clients, trips on noisy local samples and can make a partial outage worse by cutting off capacity that was still serving — Marc Brooker's simulations show distributed breakers tripping far too early. Prefer the token-bucket / adaptive throttle, which degrades smoothly instead of snapping fully open. Reach for a real circuit breaker when a downstream fails *slowly* (the failure mode is timeout exhaustion, not fast error responses) or when a cheap local fallback exists — and tune its thresholds against measured traffic, never the library defaults.
+Decision rule: timeout always; retry at one layer with jitter and a shared budget; reach for a circuit breaker only against slow/hanging dependencies or where a cheap fallback exists; and treat server-side load shedding as the backstop you actually trust, because it protects the server regardless of whether every client is well-behaved.
+### 5. Isolate blast radius
+A single tenant, a single user, or a single noisy consumer must not be able to degrade the experience for everyone else. We isolate by quota (per-tenant rate limits), by resource (dedicated queues for hot workloads), and by bulkhead (separate worker pools for separate work types). The design question is always: "if this goes bad, who else is affected?" — and the answer we aim for is "only the thing that went bad."
+### 6. Rehearse failure
+Chaos engineering is a practice, not an event. We inject failures — killed pods, degraded networks, slow databases — to surface the reliability assumptions we are making without knowing it. The point is not to "test if chaos works"; it is to find the dependency we forgot was load-bearing before an incident finds it for us.
+Where you inject is a real trade-off, not a slogan. Production is where the signal lives — staging differs in traffic shape, data volume, and dependency topology, so a system that passes every staging experiment can still fall over in production, and a clean staging run buys false confidence. But you do not earn production chaos for free: the precondition is observability good enough to see the blast as it lands and an automated stop that aborts the experiment the moment a real SLO starts to burn. Decision rule: start in staging to shake out the obvious, but treat the experiment as incomplete until it has run in production behind a bounded blast radius and an automatic abort. If you cannot safely abort, you are not ready to inject.
+### 7. Alerts fire on user impact, not on mechanism
+We alert when users are affected — SLO burn rate, error-rate spikes on user journeys — not when a server has 80% CPU. Pages that fire on mechanism without user impact teach on-call to ignore pages, which is how a real incident gets missed.
+### 8. Every incident teaches a specific lesson
+Post-incident, we write a blameless postmortem that names the specific reliability assumption the incident invalidated and proposes the specific change that would have caught it. We do not write "be more careful" as an action item. We do not write "add more monitoring" without specifying the signal. The goal is one concrete, closable ticket per incident, enforceable and measurable.
+### 9. Cells, living SLOs, and semantic failure
+Blast-radius isolation generalises at scale to **cell-based architecture** — independent cells, each serving a slice of users, so a failure is contained to one cell rather than the fleet. SLOs are hypotheses reviewed against burn (multi-window, multi-burn-rate alerting), not contracts carved once and forgotten. And a model in the loop fails differently: a wrong answer returns 200 OK — valid, on time, and confidently incorrect — so latency and error-rate SLIs miss it entirely. AI features therefore carry a **per-SLI accuracy/consistency budget** distinct from latency, and the model provider is treated as the least-reliable dependency in the chain, with a defined degraded behaviour for when it is slow, wrong, or down.
+## How we apply this
+- [Observability](observability.md) — the measurement layer that makes SLOs possible.
+- [Performance](performance.md) — the tail-latency discipline that sits inside reliability.
+- [Integration Patterns](../system-design/integration-patterns.md) — the concrete patterns (timeouts, circuit breakers) we apply.
+## Anti-patterns we reject
+- **"99.999% uptime" as a target.** Five-nines for a non-core service is a reckless budget. Set an SLO the team can defend.
+- **Retries without policies.** Retry-forever is a self-inflicted DDoS.
+- **Retries at every layer.** Retrying at each hop multiplies one user request into a retry storm. Retry at one layer, and budget it.
+- **Circuit breakers as a reflex.** A binary breaker on library defaults, copied into every client, trips on noise and can deepen a partial outage. Earn it, tune it against real traffic, and prefer adaptive throttling.
+- **Mechanism alerts.** Paging on CPU, memory, or disk without tying it to a user-impact signal. Noise.
+- **"It has not failed yet."** The absence of a known failure mode is not evidence of its absence. Rehearse.
+- **Postmortems that blame humans.** A system that depends on everyone being perfect will fail. The action item is the system fix, not the person lecture.
+- **SLOs nobody tracks.** An SLO without a dashboard and a burn-rate alert is theatre.
+## Further reading
+- *Site Reliability Engineering*, Beyer et al. (the Google SRE book) — the canonical text for SLOs, error budgets, and the operational stance; the "Addressing Cascading Failures" and "Handling Overload" chapters cover retry budgets and client-side throttling.
+- *The Site Reliability Workbook* — the practical companion to the SRE book; more actionable, including the error-budget policy template.
+- *Release It!*, Michael Nygard — the stability-patterns bible (timeouts, bulkheads, the original circuit breaker).
+- *Chaos Engineering*, Rosenthal & Jones — the current state of rehearsed-failure practice.
+- *Amazon Builders' Library* — "Timeouts, retries, and backoff with jitter" (Marc Brooker) and "Using load shedding to avoid overload" (David Yanacek): the load-and-overload patterns this page leans on.
+- Marc Brooker, ["Fixing retries with token buckets and circuit breakers"](https://brooker.co.za/blog/2022/02/28/retries.html) — why distributed circuit breakers misfire and what to reach for instead.

package/src/docs/principles/quality/security.md ADDED Viewed

@@ -0,0 +1,78 @@
+---
+title: Security
+description: Zero-trust, threat modeling, SLSA supply-chain integrity, and the secure SDLC.
+status: active
+last_reviewed: 2026-06-19
+---
+# Security
+## TL;DR
+Security is every engineer's job, every day. We treat every service as untrusted, every dependency as a supply-chain risk, every input as hostile, and every secret as already-compromised unless we can prove otherwise. The goal is not zero risk — it is a system that stays standing when any single control fails.
+## Why this matters
+When a platform handles sensitive user data, a security incident is not an inconvenience — it is a breach of the trust users place in the system. Security is the baseline that every other quality concern rests on. A system that is reliable but exploitable is not reliable.
+## Our principles
+### 1. Zero trust between services
+Services authenticate each other on every request. No "internal" network is trusted implicitly; every call carries an identity, every identity is authorised per operation. The concrete mechanism is **workload identity** — short-lived, auto-rotated credentials and mTLS established at the platform layer with no secret in application code; machine identity is the new perimeter. The breach-resistance argument is simple — if an attacker pivots into one service, they do not inherit the blast radius of the entire system. The mechanism scales to the system: SPIFFE/SPIRE issuing auto-rotated SVIDs is the full-control answer, but a managed mesh or signed service tokens from a standard IdP buy most of the breach-resistance for a fraction of the operating cost. The non-negotiable is that identity travels with every call and is verified there — not which issuer mints it. Choose by blast radius, not by fashion.
+### 2. Threat model the change, not just the product
+Every significant change asks the security question before the design is signed off: who could misuse this, and how? A new endpoint, a new data field, a new integration — each gets a five-minute threat conversation. This is cheap upfront and catches most of the issues that would otherwise be found in a pen test or, worse, in production.
+### 3. Secrets are managed, rotated, and audited
+No secret lives in source. The hierarchy is eliminate, then shorten, then rotate. The best secret is no secret: wherever principle 1's workload identity or OIDC federation reaches, there is no static credential to leak. Where a credential is unavoidable, prefer **dynamic, short-lived** secrets — minted per session with a TTL in minutes — over a long-lived value on a rotation calendar. Scheduled rotation of a static secret is closer to theatre than control: an attacker abuses a leaked credential in minutes, not at the next quarterly cycle, so a 90-day rotation bounds nothing that matters. Reserve scheduled rotation for the static credentials that genuinely cannot be made ephemeral. Whatever survives lives in a secret manager, is fetched at runtime, and has every access audited — so the damage window is bounded by the TTL, not by a calendar.
+### 4. Input is hostile; validate at the boundary
+Every piece of input at a trust boundary is validated: request bodies, webhook payloads, message queue events, model outputs. Inside the trust boundary we trust our own types and do not repeat the checks ([Code Craft](../foundations/code-craft.md)). The discipline is that the boundary is explicit and every crossing is scrutinised.
+### 5. Supply chain is part of our attack surface
+Every third-party dependency is a potential exploit vector. We pin versions, review new dependencies before adoption, and scan on every build. Beyond the SBOM (what is inside) we emit **provenance** (where it came from): artifacts are signed with Sigstore/cosign and ship signed build attestations expressed as SLSA build levels. The target is SLSA Build L3 (a hardened, isolated build platform that signs its own provenance) for anything we publish, and at least L1 provenance on everything built internally — L3 is what makes provenance non-forgeable, so it is the level worth paying for. A dependency added without review is a back door added without review.
+### 6. Least privilege by default
+Every service, every database role, every cloud identity starts with the minimum permissions it needs and is extended only on evidence. "Give it admin and fix it later" is a decision with a lifetime of never. IAM policies, database roles, and credential scopes are reviewed in the same way code is reviewed.
+### 7. Auth is boring technology
+We do not invent auth. Proven auth providers handle user authentication — OIDC for federation, passkeys/WebAuthn as the phishing-resistant default rather than passwords plus OTP; service-to-service auth uses short-lived tokens from a standard identity provider; session storage follows the OWASP guidance for the context. Exotic auth is how a team learns about auth vulnerabilities the hard way.
+### 8. Detect and respond, not just prevent
+Assume prevention will sometimes fail. We log security-relevant events, alert on suspicious patterns, and run incident-response tabletops so the team knows what to do when something happens. Detection that arrives after the incident is cleaned up is not detection.
+### 9. The model is an attack surface
+A model in the system widens the threat model in ways classic AppSec misses. **Prompt injection** has led the OWASP LLM risks since the list began and is structural, not a bug awaiting a patch: the model mixes instructions and data in one channel, and the injection arrives indirectly through retrieved content, tool outputs, and other agents (it propagates across co-running agents). Treat it as unsolved — there is no method that blocks it 100%, and a guardrail advertising 95% is handing the other 5% to a motivated attacker. So we contain rather than cure, and the containment is architectural. The design-time decision rule is the **lethal trifecta** (Willison) / **Agents Rule of Two** (Meta): an agent acting autonomously may hold at most two of {processes untrusted input, accesses private data or sensitive systems, can change state or communicate externally}. An agent that needs all three does not run unsupervised — it gets a human in the loop, or a fresh and reliably-validated context, before it acts. Underneath that rule: give non-human actors their own identity and per-action tool authorization, treat a tool/MCP catalogue as an execution surface to threat-model rather than an API, and remember that output validation alone is not a defence — excessive agency is the architectural control.
+## How we apply this
+- [Privacy](privacy.md) — the handling of regulated data sits inside the security perimeter.
+- [Reliability](reliability.md) — stability and security share a lot of failure-mode vocabulary.
+- [API Design](../system-design/api-design.md) — signed webhooks, idempotency keys, and structured errors that do not leak internals.
+## Anti-patterns we reject
+- **Internal network = trusted.** This is the assumption every modern breach exploits.
+- **Secrets in environment variables checked into Git.** Use the secret manager. Always.
+- **"It is an internal tool, we can skip auth."** Internal tools are an attacker's favourite foothold.
+- **Dependencies pulled in on intuition.** A package with 12 stars, no maintainer, and a vague promise is a supply-chain risk.
+- **Exotic auth.** Custom JWT handling, custom session cookies, custom MFA flows. Use the standard, battle-tested thing.
+- **"The WAF will catch it."** A web application firewall is a last layer. Primary defence is correct code.
+## Further reading
+- *The Tangled Web*, Michal Zalewski — the canonical tour of web-security oddness.
+- *The Web Application Hacker's Handbook*, Stuttard & Pinto — read once to know what you are defending against.
+- *OWASP Top 10* — the catalogue of vulnerabilities every web engineer must know.
+- *SLSA Framework* ([slsa.dev](https://slsa.dev)) — the supply-chain integrity ladder.
+- *Zero Trust Architecture*, NIST SP 800-207 — the canonical definition.
+- *OWASP Top 10 for LLM Applications (2025)* — prompt injection and excessive agency lead the list.
+- *The lethal trifecta* (Simon Willison) and *Agents Rule of Two* (Meta) — the design rules that bound agent authority.