npm - groundwork-method - Versions diffs - 0.0.1 → 0.10.0 - Mend

groundwork-method 0.0.1 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (629) hide show

package/src/docs/principles/foundations/product-engineering.md ADDED Viewed

@@ -0,0 +1,90 @@
+---
+title: Product Engineering
+description: Engineering in service of user outcomes — shaped work, appetite-based planning, and the refusal to ship the wrong thing faster.
+status: active
+last_reviewed: 2026-06-19
+---
+# Product Engineering
+## TL;DR
+We are product engineers before we are coders. Our job is to move outcomes — not to ship tickets. Work is shaped before it is scheduled, scheduled against a fixed appetite rather than an estimate, and judged by the behaviour it changes rather than the volume of code it produces.
+## Why this matters
+The dominant failure mode of engineering teams is not technical debt — it is building the wrong thing well. John Cutler's "feature factory" and Melissa Perri's "build trap" name the same trap: a team optimises cycle time and output velocity until the product surface grows faster than the value it delivers, and "shipped" quietly replaces "worked" as the definition of done.
+AI sharpens this, it does not soften it. When generating code is nearly free, the binding constraint moves from *can we build it* to *should we, and did it work*. The 2024 and 2025 DORA reports found that AI adoption raised individual throughput but correlated with **lower delivery stability** — larger batch sizes, more code in flight, more ways to be wrong — because friction "doesn't vanish so much as move: from manual grind to deciding and verifying." Product engineering is the discipline that holds the line where the friction now lives: the unit of work is an outcome, the unit of planning is an appetite, and the test of a change is whether someone — a user, an operator, the next engineer — can feel the difference.
+## Our principles
+### 1. Outcomes over outputs
+An "output" is a feature shipped, a ticket closed, a migration completed. An "outcome" is, in Josh Seiden's phrase, a change in behaviour that drives results — what a user can now do, how fast they can do it, how reliably the system holds them up. We plan around outcomes and let outputs be whatever shape delivers them.
+The honest qualifier: outcomes are not always user-visible, and treating "no user-facing change this sprint" as failed work is wrong. Security patching, a load-bearing refactor that unblocks the next three features, paving a platform path that removes friction for internal developers — these move real outcomes (a class of incident disappears, an operator sleeps, a team ships faster) without a single end user noticing. Platform teams are measured this way on purpose: by adoption and friction removed, not deliverables counted.
+So the failure mode is not "invisible to users." It is **output that traces to nothing**: work whose only justification is the ticket it closes. Decision rule: before work is scheduled, name the behaviour change and who experiences it — end user, internal developer, or operator. If no one can name it, the work is unjustified, not merely unshippable.
+### 2. Shape work before scheduling it
+No work enters a cycle without being *shaped*: the problem stated in user terms, the rough solution sketched, the boundaries drawn to exclude rabbit holes. Shaped work is expensive upfront and cheap downstream. Unshaped work is the single biggest source of mid-cycle drift, scope creep, and late discovery that the whole approach was wrong.
+Shaping is bounded on both sides. Too vague and the team inherits the unsolved problem; too concrete and it is waterfall wearing a friendlier name — a finished design handed down, with no room for the people building it to make the hundred small calls only visible from inside the code. Shape Up's altitude is deliberate: concrete enough to bound the work, abstract enough to leave the build to the builders.
+You can only shape what you understand. Decision rule: shape when you know the problem well enough to bound the solution; when you do not — novel domain, unproven technical approach — the move is a time-boxed spike to *buy* that understanding, not a confident shape built on guesses. AI has made a throwaway prototype cheap enough that "shape by building a spike and discarding it" is now often faster than shaping on paper.
+### 3. Appetite, not estimate
+We set an *appetite* — a statement of how much a problem is worth solving, judged by opportunity cost — and design a solution that fits inside it. If it cannot fit, we reduce scope or reject the work. This inverts the usual flow: an estimate starts with a fixed solution and ends with a number; an appetite starts with the number and ends with a solution. It forces "what is the best version of this we can deliver for what it is worth?" and it kills the tendency of work to expand to fill the time available.
+We denominate appetite in worth, not effort, and not by default in calendar time. AI compresses execution unpredictably — sometimes a 19% *slowdown* on familiar code an expert already moves fast through, sometimes a large speedup on unfamiliar ground (per METR's 2025 trial) — so a fixed "two weeks" now anchors on the axis that just got cheap and noisy.
+Appetite does not abolish estimation everywhere, and pretending it does is its own failure. A partner-integration deadline, a compliance date, a contractual SLA — these demand a real estimate and a real date, and the appetite must respect them as constraints. Decision rule: appetite governs discretionary product bets, which is most of the portfolio; estimate where a hard external date or dependency exists, and feed that estimate in as a boundary. The error is mixing them up — estimating discretionary work, or setting a soft "appetite" for an obligation that has a date attached. How big a bet is, separately, is its *stakes* — what is at risk if we are wrong; see [prioritization-and-appetite](prioritization-and-appetite.md).
+### 4. Kill your darlings
+If a feature is not moving an outcome, we remove it. Deletion is the most under-used tool in a product engineer's kit. Every line of code, every doc page, every dashboard tile, every CLI flag that does not pay its maintenance cost is a candidate for the cut. A smaller, sharper product is cheaper to operate and easier for the next engineer to understand.
+Removal has its own cost, and the test is the *net* one. For anything with external surface, Hyrum's Law holds: with enough users, every observable behaviour is depended on by someone, so a hard cut breaks callers and burns trust faster than the cruft ever cost you. Decision rule: internal-only cruft, just delete it; anything users observe or script against goes through deprecate → measure usage → remove, and stays if the migration cost outweighs the carrying cost. The discipline is to default to deletion and make *keeping* earn its place — not to delete blind.
+### 5. Instrument what you ship
+We decide the signal *before* we ship — event, threshold, success criterion — and we check it after release. A feature whose effect no one watches is a feature no one owns.
+Instrumentation is not the same as quantification, and conflating them produces dashboards that decorate rather than inform. Some outcomes resist a clean number — trust, perceived quality, a rare catastrophic failure avoided. For those the signal is qualitative (interview themes, support-ticket clusters) or a tripwire (a counter-metric that fires when you have made something worse), not another tile. So the real bar is not "measurable" — it is *owned and falsifiable*. Decision rule: before shipping, name the signal **and** the evidence that would make you reverse course. If you cannot say what would change your mind, you are not measuring, you are decorating. And measure the outcome, not the act of shipping — the 2025 DORA finding is that individual throughput gains evaporate at the org level unless they are tied back to a business result. More dashboards is not more insight; one honest counter-metric beats ten vanity lines.
+## The product discipline
+This page is the spine of a wider product corpus — the discipline of moving outcomes, expanded into its working parts:
+- [Continuous Discovery](continuous-discovery.md) — mapping the problem space as a weekly habit, before choosing a solution.
+- [Product Risks](product-risks.md) — the four risks (value, usability, feasibility, viability) a bet must clear, and who owns each.
+- [Success Metrics](success-metrics.md) — designing the measure of an outcome: North Star, leading indicators, counter-metrics.
+- [Requirements & Specs](requirements-and-specs.md) — turning validated needs into testable, evidence-grounded statements.
+- [Prioritization & Appetite](prioritization-and-appetite.md) — the portfolio view: choosing and sequencing bets by opportunity cost.
+- [AI-Native Product](../ai-native/ai-native-product.md) — product practice for probabilistic systems: evals, the outcome envelope, the three cost layers.
+## How we apply this
+- [Progressive Delivery](../delivery/progressive-delivery.md) — canaries and flags are the mechanism by which we measure outcomes safely.
+- [Observability](../quality/observability.md) — the signal layer that makes outcome-based engineering possible.
+- [Decisions](../../decisions/) — the record of shaping decisions that cost us real time.
+## Anti-patterns we reject
+- **Velocity-as-KPI.** Story points per sprint measure nothing about user outcomes. Optimising for it corrupts the team — and with AI inflating raw output, it corrupts faster.
+- **Estimate-driven planning.** Estimates anchor on how long the team thinks work will take, not on how much it is worth. We use appetites for discretionary work, and reserve estimates for hard external dates.
+- **"Build it and they will come."** Launching without a signal — and without naming what would make you walk it back — means no one owns the outcome.
+- **Technical-debt-for-its-own-sake projects.** Refactors with no payoff anyone can name are a smell. Tie them to the outcome they enable — faster delivery, fewer incidents, lower carrying cost — and that outcome is the justification.
+- **Big-design-up-front in a shaping costume.** A fully specified solution handed down with no room for the builders is waterfall, whatever the cycle is called.
+## Further reading
+- *Shape Up*, Ryan Singer — the canonical treatment of shaped work and fixed appetites.
+- *Inspired*, Marty Cagan — the product-engineering triad and its implications for how teams are built.
+- *Escaping the Build Trap*, Melissa Perri — why feature-factory metrics corrupt outcomes.
+- *Outcomes Over Output*, Josh Seiden — the working definition of an outcome as a change in behaviour.
+- "12 Signs You're Working in a Feature Factory," John Cutler — the field guide to the failure mode this discipline resists.
+- *State of DevOps* (DORA), 2024 and 2025 reports — the evidence that AI raises throughput while pressuring stability, and that gains must be tied to outcomes to count.
+- "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity," METR (2025) — why execution time under AI is unpredictable, not uniformly faster.

package/src/docs/principles/foundations/product-risks.md ADDED Viewed

@@ -0,0 +1,89 @@
+---
+title: Product Risks
+description: The four risks every bet must clear before delivery — value, usability, feasibility, viability — and the discipline of killing the riskiest assumption first.
+status: active
+last_reviewed: 2026-06-19
+---
+# Product Risks
+## TL;DR
+Before we commit to building something, we ask whether it can fail in four distinct ways: will users **want** it (value), can they **figure it out** (usability), can we **build** it well (feasibility), and does it **work for the business** (viability). Discovery exists to kill these risks before delivery starts — cheaply, by testing the riskiest assumption first, rather than expensively, by shipping and finding out. Each risk has a clear owner, and that ownership is how the product, design, and engineering disciplines divide the work without gaps.
+## Why this matters
+Most failed features did not fail because they were built badly. They failed because nobody asked the right "could this not work?" question early enough. A team that only asks "can we build it?" ships things that work perfectly and that no one uses. A team that only asks "will users like it?" ships things that delight in a prototype and collapse under real load or real economics. The four-risk frame is a checklist against the specific blind spots each discipline has on its own — and running it during discovery means the miss surfaces at week two, on a sketch, instead of at launch, in production.
+## Our principles
+### 1. Four risks, named explicitly
+Every significant bet faces four categories of uncertainty:
+- **Value** — will customers choose to use or buy this? Does it solve a problem they actually feel? This is the risk most features die of, and the easiest to wave away with "of course they'll want it."
+- **Usability** — can users figure out how to use it? Will they understand it well enough to get the value that is theoretically there?
+- **Feasibility** — can we build it with the time, skills, and technology we have? Does the architecture support it, and can we operate it reliably?
+- **Viability** — does it work for the *business*? Legal, security, cost, support load, brand, and the commercial model all sit here. A feature can be desirable, usable, and feasible and still be a mistake to ship.
+Naming all four forces the question each discipline is prone to skip.
+A fifth question lives *inside* viability and deserves its own name: **should we build it at all?** — the ethical risk. Cagan files ethics under viability deliberately, then warns that it is the one viability concern with no natural stakeholder: legal owns legal, finance owns cost, security owns security, but no one is paid to ask whether the feature is good for the user even when it is good for the metrics. That is exactly why it is the most reliably dropped. Name it on purpose, or it goes unowned.
+### 2. Discovery exists to kill risk before delivery
+The purpose of discovery is not to produce a specification — it is to retire risk. Delivery should begin only once the four risks are low enough that building is the cheapest remaining way to learn. Every assumption we can test with a conversation, a prototype, a spike, or a back-of-envelope cost model is one we should *not* test by shipping. Discovery is the cheap place to be wrong.
+This is not a phase that finishes before delivery opens. Discovery and delivery run continuously and in parallel, not as sequential gates — the [dual-track](continuous-discovery.md) shape. "Risks low enough" is a judgement made one load-bearing assumption at a time, not a sign-off on the whole bet; a healthy team is retiring risk on the next bet while it ships the last one. The discipline is that *the specific assumption a piece of delivery rests on* is cleared before that piece is built — not that all discovery everywhere completes before any code is written.
+### 3. Test the riskiest assumption first
+Not all risk is equal, and the order matters. We surface the assumptions a bet rests on, rank them by how likely they are to be wrong and how much damage a wrong answer does, and test the riskiest one first. If the bet is going to die, kill it on the assumption most likely to kill it — before sinking effort into the assumptions that were never in doubt. Spending discovery on the comfortable questions while the load-bearing one goes untested is how teams feel busy and learn nothing.
+"Riskiest" is the product of two axes, not one: how *uncertain* the assumption is (how little evidence we have either way) and how *load-bearing* it is (how much of the bet collapses if it is wrong). An assumption that is uncertain but cheap to be wrong about can wait; an assumption everyone is confident in but that would sink the bet if it failed still deserves a fast check, because confidence is not evidence. Rank on uncertainty × consequence, and test top-down.
+### 4. Each risk has an owner
+Risk without an owner is risk nobody clears. The accountability splits cleanly across the disciplines:
+| Risk | Owner | Discipline |
+|---|---|---|
+| **Value** | Product | accountable for the outcome |
+| **Viability** | Product | accountable for the outcome |
+| **Usability** | Design | accountable for the experience |
+| **Feasibility** | Engineering / Architecture | accountable for delivery |
+Product owns value and viability because both are judgements about whether the outcome is worth pursuing. Design owns usability because it owns the experience. Engineering and architecture own feasibility because they own what is buildable and operable. The owner of a risk is the person who must produce the evidence that it is cleared.
+Ownership is accountability for the evidence, not a solo assignment. Discovery is done by the trio — product, design, engineering — working the same problem together; an engineer's feasibility spike routinely surfaces a value insight, and a designer's prototype routinely exposes a feasibility wall. The owner is simply who answers for the risk when it is asked about. Viability makes the distinction sharp: product is accountable, but the evidence comes from legal, security, and finance, so the owner orchestrates the answer rather than producing it alone. The failure mode is not collaboration — it is when no single name answers for a given risk, so each is everyone's job and therefore no one's.
+### 5. Match the discovery action to the risk
+Each risk is tested differently, and using the wrong instrument wastes the discovery. Value is tested with user evidence — demand signals, interviews, a fake-door, a willingness-to-pay probe, or a live in-production experiment when the change is reversible, flagged, and measured against a control. Usability is tested with prototypes and observed sessions. Feasibility is tested with a spike, a proof of concept, or an architecture review. Viability is tested by walking the decision past the constraints that bound it — cost model, security posture, legal boundary, support load. A "usability test" that was actually meant to probe value answers the wrong question convincingly.
+The instrument must fit the *stakes* as well as the risk. A randomized production experiment with a control group and a metric chosen in advance is a legitimate — often the cheapest — value test for a reversible change. A full launch to everyone with no hypothesis and no control is not a test; it is a bet you have already placed. The difference between the two is not "production or not" — it is whether there is a way to read the result and a way back.
+### 6. Low stakes earn a lighter pass
+The frame scales to the **stakes** — blast radius × reversibility × the human review the work demands, the bet's size axis defined in [prioritization-and-appetite](prioritization-and-appetite.md) §2. A small-blast-radius, reversible change does not need a four-risk discovery — it needs a quick gut-check and a willingness to undo it. The full, evidence-backed pass is for bets that are hard to reverse, wide in blast radius, or load-bearing for the product. Note that stakes is not effort: a low-effort change to a one-way door is high-stakes and earns the full pass, even when it is fast to build. Running heavy discovery on genuinely low-stakes work is its own failure mode; the discipline is proportionality, not ceremony.
+## How we apply this
+- The riskiest-assumption-first ordering is the engine of [continuous discovery](continuous-discovery.md) — the opportunity-solution tree's leaves *are* the assumptions this frame ranks and tests.
+- Feasibility risk is where product hands off to the [architecture discipline](../system-design/code-structure.md) and the engineer skills; the value/viability judgement stays with product.
+- A bet's [appetite](prioritization-and-appetite.md) is set against the risk it carries — a high-value, high-uncertainty bet earns a discovery spike before its delivery appetite is fixed.
+- AI-heavy bets stress specific corners of the frame and need the matching evidence early. Feasibility now includes model non-determinism and an evaluation harness, not just "can we call the API." Viability includes per-call inference economics — a feature can be desirable, usable, and feasible and still lose money on every request — plus unsettled data, copyright, and privacy exposure. Value includes whether users trust the output enough to act on it. Probe these in discovery with a quick eval and a cost-per-action model before the appetite is fixed; a demo that ignores tail-case output and unit economics has cleared none of the real risk.
+## Anti-patterns we reject
+- **The feasibility-only filter.** "Can we build it?" as the only question asked. Produces things that work and that nobody wanted.
+- **Validation theatre.** Discovery run to confirm a decision already made, testing the safe assumptions and skipping the one that could kill the bet.
+- **Unowned risk.** Four risks and nobody accountable for clearing any specific one — so each is everybody's job and therefore no one's.
+- **Shipping to learn, unrigorously.** Using a full production launch as the first test of a high-stakes value question — "we'll see if people use it" — with no hypothesis, no control, and no cheap way back. That is hoping, not learning, and production is the most expensive place to hear no. A reversible, instrumented experiment is the opposite and is welcome; the anti-pattern is the irreversible, unmeasured bet, not learning in production itself.
+- **The forgotten viability risk.** A desirable, usable, feasible feature that quietly triples support load, breaks a compliance boundary, costs more to run than it earns, or is good for the metrics and bad for the user. Viability — ethics most of all — is the risk teams most often never name.
+## Further reading
+- *Inspired* and *Transformed*, Marty Cagan — the four big risks, the discovery techniques that retire them, and the case for treating ethical risk as the unowned corner of viability.
+- *The Four Big Risks*, Silicon Valley Product Group — the concise canonical statement of the taxonomy and its ownership.
+- *Continuous Discovery Habits*, Teresa Torres — assumption mapping and testing the riskiest assumption first.
+- *Updating the Product Risk Taxonomy for the Generative AI Era*, Viget — how each risk shifts for LLM-powered products.

package/src/docs/principles/foundations/requirements-and-specs.md ADDED Viewed

@@ -0,0 +1,80 @@
+---
+title: Requirements & Specs
+description: Evidence-grounded, testable specification — jobs-to-be-done, user journeys, stable-ID requirements, acceptance criteria matched to their form, and explicit non-goals.
+status: active
+last_reviewed: 2026-06-19
+---
+# Requirements & Specs
+## TL;DR
+A requirement is a claim about what a user needs to accomplish, grounded in evidence and stated precisely enough to be tested. We frame needs as **jobs to be done**, walk them as concrete **user journeys**, pin each requirement to a **stable ID** so downstream artifacts can reference it, and write acceptance criteria in whatever form makes "done" unambiguous and verifiable. The specification is a living, evidence-backed record of decisions — and the source of truth a builder, human or agent, works from — not a template filled in to look thorough.
+## Why this matters
+Requirements are where product thinking becomes something an engineer, an agent, or a test can act on — and where it most often goes wrong. A spec that lists features instead of jobs builds the wrong thing precisely. A spec with vague acceptance criteria ("the system should handle errors gracefully") cannot drive a test or settle an argument about whether the work is finished. A spec produced by filling a template rather than by understanding the user reads as complete and is hollow. Precise, testable, evidence-grounded requirements are the contract between knowing what to build and building it.
+This matters more, not less, as agents do the building. A model does pattern completion, not mind reading: a vague spec is not refused, it is answered — with a thousand silent assumptions the model invents to fill the gaps. The precision the spec withholds is the precision the build makes up. In an agent-led codebase the spec is read as literally as code, which is the discipline behind spec-driven development (the loop popularised by tools like GitHub's Spec Kit, AWS Kiro, and BMAD): need → spec → plan → tasks → code, with the spec as the artifact every later stage resolves against.
+## Our principles
+### 1. Requirements describe jobs, not features
+We state what the user is trying to accomplish — the **job to be done** — before naming any feature that serves it. A job is the progress a user is trying to make in a situation ("when I finish a task, I want to know it actually completed, so I can move on without checking back"), with its functional, emotional, and social dimensions. Features are solutions to jobs; leading with the feature skips the step where we check the solution actually fits the job. The job is stable; the feature that serves it is negotiable.
+### 2. Walk the journey, do not list the screens
+A user journey is a narrative with structure: a named persona in a context, the state they enter from, the concrete path of steps they take, the moment value is delivered and how they know, and the state they are left in. Walking the journey end to end surfaces the gaps a feature list hides — the empty state, the error halfway through, the second-time-through shortcut. We describe journeys with enough texture that a reader can picture the shape of the interaction, not just enumerate its steps.
+### 3. Stable IDs make requirements referenceable
+Every functional requirement carries a stable, globally unique ID (`FR-1`, `FR-2`, …) assigned once and never reused. The ID is what lets a design doc, an architecture decision, a test, and an acceptance criterion all point at the *same* requirement without ambiguity, and what lets a coverage map prove every requirement is accounted for downstream. Requirements identified only by prose drift apart the moment two documents describe the same thing in different words.
+What does not earn its place is the heavyweight traceability matrix — a hand-maintained, bidirectional grid linking every requirement to every artifact — which rots faster than it informs and is the ceremony agile rightly walked away from. The ID is cheap; the discipline is to reference it, and to let tooling, not a clerk, maintain the links. The payoff scales with how literally the spec is consumed: in a regulated domain that must evidence coverage, or an agent-led codebase where a model resolves `FR-7` against the spec the way it resolves a symbol against its definition, a stable ID is load-bearing; on a two-person throwaway prototype it is overhead. Carry the IDs; skip the matrix.
+### 4. Acceptance criteria are testable — match the form to the criterion
+Acceptance criteria exist to make "done" unambiguous and verifiable. The form serves that goal; it is not the goal. We match the form to the criterion:
+- **Stateful behaviour and flows → Given/When/Then.** Given (precondition) / When (action) / Then (observable outcome), with And for extra conditions. The form forces you to name the starting state, the trigger, and the observable result — and a scenario you cannot fill in is usually one you do not yet understand.
+- **Invariants, validation, and business rules → a rules-based checklist.** "An order total is never negative"; "an email contains exactly one @". Forcing a flat rule into Given/When/Then adds ceremony and buries the rule; a bullet leaves less room to misread and is sharper against scope creep.
+- **Quality attributes → measurable thresholds.** Latency, throughput, accessibility, error budgets are not prose ("fast", "reliable") but numbers: "p95 search latency under 200ms at 1k RPS." A threshold is the only non-functional criterion a test can fail.
+The over-certain version of this principle — "anything that isn't Given/When/Then isn't concrete enough" — is wrong, and it pushes teams into the BDD trap: a parallel Gherkin-plus-step-definition layer maintained for its own sake, brittle and expensive, where the prose outlives the value. The criterion is the contract; the automation is downstream of it. Whatever the form, every criterion is independently verifiable, covers the edge and error cases — not just the happy path — and "done" means every one passes, nothing softer.
+### 5. Non-goals are part of the specification
+What a requirement explicitly does *not* cover is as load-bearing as what it does. We state non-goals and out-of-scope boundaries directly, with the reason and where the excluded thing belongs instead. The natural extensions a reader would assume — the adjacent feature, the obvious generalisation — are exactly what must be named as excluded, or scope creeps one reasonable assumption at a time. An explicit boundary is what makes the scope honest.
+### 6. The spec is a living record, not a template fill
+A specification earns its sections; it does not fill them to look complete. We add the sections the product needs, drop the ones that do not apply, and keep the document current as decisions change — surfacing assumptions explicitly (`[ASSUMPTION]`) so they can be confirmed rather than buried. A PRD generated by walking a template top to bottom, padding every heading, is the artifact this principle exists to prevent: it reads thorough and conveys nothing that was not already obvious.
+Living also means reconciled. When the code and the spec disagree, one of them is a defect — and a spec left to drift is worse than no spec in an agent-led codebase, because the agent trusts it literally and builds to the lie. Keeping the spec true to the system is part of the build, not paperwork after it.
+### 7. Requirements are grounded in evidence
+Every requirement traces to a reason it exists — a user need observed in discovery, a job confirmed in a conversation, a problem with evidence behind it. A requirement that traces only to someone's preference is a candidate to cut. Grounding requirements in evidence is what connects the spec back to [continuous discovery](continuous-discovery.md): the spec is where validated needs become buildable statements, not where new unvalidated ones get smuggled in.
+## How we apply this
+- Requirements emerge from validated needs — the jobs and opportunities surfaced in [continuous discovery](continuous-discovery.md), not from a brainstorm of features.
+- Each requirement names its [success metric](success-metrics.md) where one applies, so the spec carries its own definition of whether it worked.
+- The spec is the source of truth the build runs on, not a document read once and abandoned: need → spec → plan → tasks → code, with each later stage derived from the spec rather than re-inventing it. When an agent or engineer needs a decision the spec does not make, that gap is a spec defect to fix, not an assumption to bury in code.
+- Stable-ID requirements and form-matched acceptance criteria are what let the [architecture discipline](../system-design/api-design.md) derive contracts and tests from the spec rather than re-interpreting prose — requirements and contracts share the same source-of-truth discipline.
+## Anti-patterns we reject
+- **Template-fill PRDs.** Every heading padded to look complete, conveying nothing the team did not already know. The template is a checklist, not the thinking.
+- **Feature lists masquerading as requirements.** Solutions enumerated with no job behind them, so nobody can tell whether they fit the need.
+- **Untestable acceptance criteria.** "Works well," "handles errors gracefully," "is intuitive" — none can pass or fail a test, so none can settle whether the work is done.
+- **Form over substance.** Cramming every criterion into Given/When/Then — or maintaining a Gherkin-and-step-definition layer for its own sake — when a rule-list or a numeric threshold would be sharper. The ritual is not the rigour.
+- **Requirements without IDs.** Prose-only requirements that two documents describe differently and that no coverage map can track.
+- **Silent scope.** No non-goals stated, so every reasonable adjacent assumption is fair game and scope grows without a decision.
+- **Spec rot.** A spec that no longer matches the system, trusted literally by the next agent that reads it.
+## Further reading
+- *Competing Against Luck*, Clayton Christensen — the jobs-to-be-done framework in depth.
+- *User Story Mapping*, Jeff Patton — journeys and stories as the structure of a specification.
+- *Specification by Example*, Gojko Adzic — acceptance criteria as the bridge from requirement to test, and when scenarios earn their keep.

package/src/docs/principles/foundations/success-metrics.md ADDED Viewed

@@ -0,0 +1,66 @@
+---
+title: Success Metrics
+description: Designing the measure of an outcome — North Star and inputs, leading vs lagging, counter-metrics, and deciding the signal before you ship.
+status: active
+last_reviewed: 2026-06-19
+---
+# Success Metrics
+## TL;DR
+A feature that is not measured does not exist as an outcome. We design the measure before we build the thing: a small number of metrics that represent real user value, paired with the counter-metrics that stop us from gaming them, and chosen so a *no* answer is as informative as a *yes*. Metric design is a product skill distinct from the telemetry that implements it — deciding *what* to measure and *what target* means success is the hard part; the dashboard is the easy part.
+## Why this matters
+Teams measure what is easy to count and then optimise their way into the wrong product. Signups, page views, story points shipped — vanity and output metrics feel like progress while the actual outcome stagnates or regresses. Worse, a single metric pursued without a counterbalance reliably produces a degraded product: optimise engagement and you get dark patterns; optimise speed and you get a product that does the wrong thing faster. Designing the measure well — before launch, with the counter-metrics in place — is what turns "we shipped it" into "we know whether it worked." The measure is part of the design, not a reporting afterthought.
+## Our principles
+### 1. Decide the signal before you ship
+The success signal is a design decision made *before* the work starts, not a question asked after launch. Before building, we name the metric that will move, the direction, and the rough magnitude that would count as success. Deciding it upfront does two things: it forces honesty about whether the feature has a theory of impact at all, and it pre-commits us to a verdict so we cannot rationalise any result as a win after the fact. If we cannot name how we would measure it, we do not yet understand the outcome well enough to build it.
+### 2. A North Star, supported by inputs
+We anchor on a **North Star** that captures the core value the product delivers to *users* — the one number that, if it moves the right way sustainably, means the product is winning. It must be a value metric, not an activity or revenue proxy. Engagement North Stars (sessions, time-on-site) optimise for the product's interest over the user's and decay into dark patterns; revenue North Stars measure extraction, not value delivered, and can climb while the product rots. Beneath the North Star sit a handful of **input metrics**: the leading indicators a team can actually move week to week, whose causal link to the North Star is earned by trial and error, not assumed. Amazon's *Working Backwards* calls these *controllable input metrics* and steers by them precisely because output metrics like revenue report too late and too diffusely to act on.
+The single North Star is genuinely contested, and the objection is fair: one number cannot represent a two-sided marketplace, a multi-product portfolio, or segments with materially different value. Forced onto those, a single metric either flattens real trade-offs or hums along green while the business bleeds — the North Star is never a substitute for business viability. The answer is not a wall of dashboards. **Decision rule:** a focused product with one dominant value loop gets one North Star. A marketplace, platform, or portfolio gets a North Star *strategy* — a one-sentence statement of the value being created — plus a small constellation (roughly one metric per side or segment) that together evidence it. Either way the count stays small and every metric is acted on. One lighthouse's worth of focus, a few levers — not literally one number when the product has two sides.
+### 3. Distinguish leading from lagging
+Lagging metrics (retention, revenue, churn) confirm whether value landed but report too late to steer by. Leading metrics (activation, first-week usage of a key feature, time-to-value) move early and predict the lagging ones. We instrument both and act on the leading ones — a team that can only see lagging metrics is driving by the rear-view mirror. The skill is choosing leading indicators that genuinely predict the outcome rather than merely correlating with activity.
+### 4. Counter-metrics are as load-bearing as primaries
+Every primary metric we optimise gets a **counter-metric** that guards against winning it the wrong way. This is Goodhart's Law made operational: when a measure becomes a target it ceases to be a good measure, because people optimise the number rather than the value behind it — and the more weight the metric carries, the harder it gets gamed (Campbell's Law). Optimising for time-on-task? Counter with task-completion, so we do not reward confusion. Optimising for adoption? Counter with retention, so we do not reward a one-time spike. The counter-metric — what the experimentation world calls a *guardrail* — names the most likely way the primary gets gamed and makes that failure visible. A primary metric without a counter-metric is an invitation to optimise the product into a corner.
+### 5. The metric must produce a falsifiable verdict
+A good success metric is specific enough that a *no* is as informative as a *yes*. "Users are happier" cannot be falsified; "support tickets citing the confusion drop by at least half within 30 days" can. We reject vague sentiment and abstract aggregates ("engagement improves") in favour of signals tied to a concrete user behaviour and a threshold. The test of a metric is whether a disappointing result would actually change our minds.
+### 6. Match the rigour to the stakes
+Metric design scales with the bet. A load-bearing product decision earns a North Star, instrumented inputs, and a pre-registered target. A small change earns a single observable signal and a glance after release. Demanding a full metric tree for every minor feature is as much a failure as shipping a major bet with no measure at all — the discipline is proportion.
+## How we apply this
+- The success signal a metric defines is the falsifiable outcome a bet's hypothesis commits to — the same signal named in [continuous discovery](continuous-discovery.md) and carried verbatim into the pitch.
+- Metrics are the measure of [product engineering's](product-engineering.md) "instrument everything you ship" — this page is the *design* of the signal; [observability](../quality/observability.md) is the telemetry layer that *captures* it.
+- For AI features, product metrics pair with model-quality metrics — see [AI-native product](../ai-native/ai-native-product.md) for the dual-metric discipline.
+## Anti-patterns we reject
+- **Vanity metrics.** Totals that only ever go up — cumulative signups, total page views — and say nothing about whether the product delivers ongoing value.
+- **Single-metric tyranny.** One number optimised without a counter-metric, which reliably degrades the product along the axis nobody is watching.
+- **Output as outcome.** Counting features shipped or story points burned as if delivery were the goal. Output is the cost, not the result.
+- **The retrospective metric.** Deciding how to measure success only after launch, when any result can be spun into a win.
+- **The unfalsifiable goal.** "Improve the experience" with no behaviour, no threshold, and therefore no possible disconfirmation.
+## Further reading
+- *Escaping the Build Trap*, Melissa Perri — why output metrics corrupt product teams and how outcome metrics fix it.
+- *Lean Analytics*, Croll & Yoskovitz — the One Metric That Matters and choosing it by stage.
+- Amplitude's *North Star Playbook* — the North Star and its input metrics as an operating model.
+- *Working Backwards*, Colin Bryar & Bill Carr — controllable input metrics vs. output metrics, Amazon's operating model for steering by leading indicators.
+- Goodhart's Law (Charles Goodhart) and Campbell's Law (Donald Campbell) — the foundations of why a measure-as-target gets gamed, and why counter-metrics are non-optional.
+- Ravi Mehta, *Your product team doesn't need a North Star Metric* — the case for a North Star strategy over a single number when one metric cannot capture the value.

package/src/docs/principles/foundations/testing.md ADDED Viewed

@@ -0,0 +1,82 @@
+---
+title: Testing
+description: Continuous Risk Assurance — testing the system, not the mock of the system.
+status: active
+last_reviewed: 2026-06-19
+---
+# Testing
+## TL;DR
+Tests are risk-weighted assertions about production behaviour — not boxes ticked for coverage. We favour high-fidelity service tests over solitary unit tests, run dependencies we own as real ephemeral containers rather than mocking them, contract-test the ones we don't, and treat observability signals as first-class assertions. The measure of a suite is whether its assertions actually catch faults — not its line-coverage number.
+## Why this matters
+The dominant failure mode of a test suite in 2026 is not that it is too small — it is that it passes while production breaks. Mocked dependencies drift from their real counterparts, unit tests assert on implementation rather than behaviour, and green CI gives a false sense of security. *Continuous Risk Assurance* is our name for the discipline that replaces "coverage as a target" with "risk as the thing we actually measure."
+This matters more, not less, as code generation gets cheaper. When an agent can produce a plausible implementation in seconds, the bottleneck moves from writing code to *trusting* it. The test suite becomes the executable specification that constrains generated code — the thing that says what "correct" means when the author is a model and the reviewer is short on time. A weak suite that generated code passes is worse than no suite, because it manufactures confidence.
+## Our principles
+### 1. Favour service tests over solitary unit tests
+The "sociable" service test is our foundational unit of validation. We test from the API entry point through to real, ephemeral database containers. In a service-oriented codebase the interesting bugs live at the boundaries — HTTP serialisation, SQL query correctness, transaction semantics, event emission — and those are exactly what solitary unit tests mock away. This is the *test honeycomb* shape popularised by Spotify's engineering teams: a fat middle of integrated service tests, a thin layer of solitary unit tests, and a few end-to-end checks on top — not the classic Mike Cohn pyramid that pushes most weight onto isolated units.
+The honest tension: service tests buy fidelity at the cost of speed and diagnostic precision. A solitary unit test that fails names the broken function; a service test that fails tells you "the create-order flow is broken" and leaves you to find where. And a slow, flaky service layer is corrosive — teams that can't trust or tolerate it quietly retreat to mocking everything, which is the exact failure this principle exists to prevent. So fidelity is not a licence to be slow: keep service tests parallelisable, keep fixtures cheap, and treat suite latency as a first-class defect.
+Decision rule: reach for a solitary unit test when the logic is **algorithmically dense and boundary-poor** — a parser, a pricing calculator, a state machine, a validator — where the combinatorics are the risk and a container adds only latency. Reach for a service test when the risk lives in the **wiring** — serialisation, persistence, queries, events, auth. When a service-test failure is routinely hard to localise, that is a signal to factor out the dense core and unit-test it directly, not to mock the boundary.
+### 2. Run real dependencies you own; contract the ones you don't
+For a dependency you own and deploy — Postgres, your message broker, object storage — run the real thing in an ephemeral container (Testcontainers or equivalent). In-memory fakes miss the bugs that actually escape to production: schema and migration mismatches, serialisation edge cases, transaction and isolation behaviour, query-planner surprises. Pin the image to the version you run in production — never `latest`, which turns an upstream release into a flaky build. Reset state between tests for determinism, and share a container across a suite rather than per-test so startup cost doesn't dominate the run.
+But "emulate everything" is a false absolute, and applied carelessly it wrecks the feedback loop — full brokers and databases spun up for tests that exercise none of their behaviour buy nothing but minutes. Two cases break the rule:
+- **Third-party services you do not control** (a payments API, a SaaS provider) usually cannot be containerised faithfully, and a hand-written mock of them is the worst of both worlds — it encodes *your belief* about their behaviour, which is precisely what drifts. Verify against a **contract** instead: a consumer-driven contract (Pact) or a recorded/replayed interaction captured from the real provider, plus a small, periodically-run live suite against a sandbox to detect drift. Pact's leverage is weaker here because you can't compel an external provider to verify your contract, so treat the contract as a drift detector, not proof.
+- **Pure logic with no real I/O.** If the unit under test has no genuine dependency, don't invent one to stand a container up behind. Test it directly.
+Decision rule: emulate the data and serialisation boundaries you own; contract-test the boundaries you don't; mock only at a seam you fully control and only when the real thing adds latency without adding risk. A mock that stands in for a database is almost always the wrong call (see anti-patterns); a recorded contract for a remote API you can't run is often the right one.
+### 3. Observability is a test surface
+OpenTelemetry instrumentation is a design-time concern, not an afterthought. System tests assert that traces are unbroken end-to-end: a missing span, a lost TraceID, or a broken parent-child relationship is a test failure, not an instrumentation TODO. The boundary between "test" and "monitor" dissolves — both ask whether the system is behaving as we claim. The payoff is double-counted: the same instrumentation that proves correctness in CI is what lets you debug the incident in production.
+### 4. Name tests by behaviour, not implementation
+A test name must let an on-call engineer form a hypothesis from the failure log alone, without opening the test file. The default form — `[Unit] should [expected outcome] when [condition]` — encodes that intent, and names like `TestCreateItem_Success` are banned because they convey nothing beyond what the dashboard already shows. The format serves the goal; the goal is the rule. A name that states behaviour and condition in another shape is fine. A name that follows the template but says nothing specific (`should work when called`) is not.
+### 5. Risk-based depth, and prove the assertions bite
+Coverage percentages are meaningless without proof that the assertions catch real faults — a suite can execute every line and assert nothing. We score modules on Impact × Complexity × Change-frequency before deciding test depth: high-risk modules earn live system tests and chaos experiments; low-risk modules need only small tests and static analysis. Equal depth everywhere is wasted effort.
+The honest measure of whether assertions bite is **mutation testing** (PIT, Stryker, or equivalent): inject deliberate faults and confirm a test fails. A surviving mutant is a line you cover but do not actually check. Mutation testing is expensive — its naive cost is the suite run times the number of mutants — so don't run it across the whole tree. Run it on the high-risk modules the matrix flags, and on changed code in CI, where it doubles as a quality gate on new tests (the use Meta reported for its LLM-assisted mutation work in 2025). Use it as a periodic read-out of assertion quality, never as a blanket gate.
+### 6. Tests are part of the change, not after it
+A feature PR without tests is incomplete, and we review the test with the same rigour as the code. Tests deferred to a "follow-up PR" compete with the next feature and usually lose, so the work isn't done until the verification ships with it. The exceptions are honest and narrow: a spike or throwaway prototype whose purpose is to be deleted does not need tests — but the moment it becomes the implementation, it does.
+This is a discipline about *what ships together*, not a mandate to write tests first. Test-first (TDD) is a powerful design tool — it forces you to use your own interface before committing to it — but it is a tool, not a law, and the "Is TDD Dead?" exchange between Kent Beck, Martin Fowler, and DHH named the real cost: dogmatic test-first can induce *design damage*, contorting code with needless indirection purely to make it mockable. Hold both signals. If a change resists testing, that usually means the design is wrong — fix the code. But if the *only* way to test it is to shatter a cohesive unit into layers of indirection nothing else needs, the test is making the demand, and the design was right. Write the test with the change; let it pressure the design; don't let it deform the design.
+## How we apply this
+- [Observability](../quality/observability.md) — the OTel-first stance that makes traces-as-assertions possible.
+- [Reliability](../quality/reliability.md) — how tests compose with chaos and load experiments.
+- [How We Structure Code](../system-design/code-structure.md) — the structural choice that makes tests cheap to write and fast to run.
+## Anti-patterns we reject
+- **Mocking the database.** A test that mocks the database asserts against your SQL-writing skill, not against database behaviour. Use an ephemeral container.
+- **Retrying flaky tests until green.** A test that passes on the third run is a failing test with a coin flip attached, and rerun-to-green trains the whole team to ignore red. Quarantine the flake out of the gating suite, file it, and fix the root cause — non-determinism, timing, shared state, test order. Quarantine is a triage state with a deadline, not a graveyard.
+- **Snapshot tests as a default.** Snapshots are a brittle, noisy substitute for behavioural assertions, and "update snapshots" becomes a reflex that launders bugs into the baseline. Acceptable only when the artefact is genuinely opaque (a rendered email, a serialised response).
+- **Coverage-gated CI.** "95% line coverage required" is a metric gamed without reducing real risk. Use coverage as a read-out, mutation score as the quality signal, never line coverage as the gate.
+- **Shared staging environments as the integration test.** Staging has no hermetic guarantees, no reproducibility, no determinism. It is a deployment target, not a test bed.
+- **"It's hard to test, so we didn't."** That is a signal the code is badly designed. Fix the code.
+## Further reading
+- *Accelerate*, Forsgren, Humble, Kim — the empirical case for continuous delivery and its testing discipline.
+- *Working Effectively with Legacy Code*, Michael Feathers — seams, test doubles, and when each is appropriate.
+- *Growing Object-Oriented Software, Guided by Tests*, Freeman & Pryce — the canonical treatment of outside-in service testing.
+- *xUnit Test Patterns*, Gerard Meszaros — the vocabulary we use for test doubles, fixtures, and strategies.
+- *Is TDD Dead?*, Beck, Fowler & Heinemeier Hansson — the conversation that maps the contested zone between test-first discipline and test-induced design damage.
+- "UnitTest" and "Testing Pyramid", Martin Fowler (martinfowler.com) — the sociable-vs-solitary distinction and the shape trade-offs.

package/src/docs/principles/index.md ADDED Viewed

@@ -0,0 +1,23 @@
+---
+title: Engineering Manifesto
+description: The core beliefs that shape how we build software — complexity, contracts, reliability, testing, architecture, documentation, decisions, and AI-native development.
+status: active
+last_reviewed: 2026-06-19
+---
+# Engineering Manifesto
+Software engineering is the discipline of managing complexity and optimising for change. A platform that processes high-volume asynchronous workloads and serves users in real time at scale must lean hard on a solid technical foundation, frictionless developer velocity, and a rigorous engineering culture.
+> [!IMPORTANT]
+> These principles are the shared vocabulary we use to decide what to build, how to build it, and what trade-offs we accept. Every page in this hub stands on its own and does not require context from any other document to be useful.
+## What we believe
+1. **Complexity is the enemy; clarity is the goal.** We choose simple designs, simple tools, and simple processes — and we accept the cost of doing so. Speculative abstraction, premature generalisation, and fear of deletion all compound into the kind of complexity that slows teams down.
+2. **Contracts are the single source of truth.** API specifications, event schemas, and database definitions are authoritative. Clients, tests, documentation, and UIs are derived from them. When a spec is wrong, everything downstream is wrong — and that is the correct failure mode, because one visible error beats silent drift across hand-maintained artefacts.
+3. **Reliability is designed in, not patched in.** We build for failure from the first commit: idempotency at the API boundary, graceful degradation at the edges, backpressure when downstream systems slow, and observability as a design-time concern rather than an afterthought.
+4. **We test the system, not the mock of the system.** Tests that run against real databases, real message brokers, and real HTTP stacks catch the bugs that mocked tests hide. Emulation beats mocking wherever the dependency can run in a container.
+5. **A pure core, swappable edges, and one obvious place for everything.** Every service is a pure decision-making core wrapped in a thin shell that does I/O; concrete dependencies plug in behind abstractions the core owns and stay swappable, with no implementation detail leaking inward. The structure is opinionated, so neither a human reading the code nor an agent writing it ever has to guess where a thing belongs. See [How We Structure Code](system-design/code-structure.md).
+6. **Documentation is a product, not a by-product.** This documentation is versioned, reviewed, and shipped with the same discipline as code. It serves humans and AI agents, and the structures that help one help the other.
+7. **Architectural decisions are recorded and governed.** We capture each significant decision with the context, assumptions, and trade-offs that shaped it, then govern it — an owner, a review trigger, and supersession rather than silent edits when it changes. The record is immutable so the trail of *why* survives; the decision stays open to re-evaluation when its assumptions break. Re-deciding is healthy engineering; re-deciding without recording it is how teams lose their memory. See [Architecture Decisions](system-design/architecture-decisions.md).
+8. **AI agents are first-class engineers.** They read our docs, write our code, review our diffs, and run our tooling. We design our codebase, our conventions, and this documentation so an agent can operate at the same level of quality as a senior engineer.

package/src/docs/principles/quality/accessibility.md ADDED Viewed

@@ -0,0 +1,88 @@
+---
+title: Accessibility
+description: WCAG 2.2 AA, keyboard-first design, screen-reader flows, and inclusive UX as a baseline, not a stretch goal — and, since the EAA, a legal floor.
+status: active
+last_reviewed: 2026-06-19
+---
+# Accessibility
+## TL;DR
+Every user interface we ship meets WCAG 2.2 AA as a baseline. Keyboard, screen reader, and visual assistive technology are first-class targets, not after-launch polish. A feature that does not work for a keyboard user or a screen-reader user is not finished.
+## Why this matters
+Accessibility is not a niche concern — a significant fraction of users rely on assistive technology at some point, and almost everyone hits a situational version of it (a broken arm, glare on a phone, a noisy room). Three forces make it non-negotiable:
+- **The moral case.** Equal access is a baseline, not a feature flag.
+- **The legal case.** Accessibility is now mandated, not merely encouraged. The EU's European Accessibility Act took effect on 28 June 2025; its harmonized standard, EN 301 549, currently incorporates WCAG 2.1 AA (with 2.2 in progress) for products and digital services placed on the EU market. In the US, the ADA continues to drive thousands of web-accessibility suits a year. "Inaccessible" is a compliance defect with a price tag.
+- **The quality case.** The constraints accessibility imposes — clear hierarchy, visible focus, semantic structure, predictable navigation — produce better software for *every* user. An accessible interface is almost always also a clearer, calmer interface.
+The reason to be deliberate about this is that the default is failure. In WebAIM's 2025 audit of the top one million home pages, 94.8% had detectable WCAG A/AA failures — roughly 51 errors per page. Accessibility does not happen by accident; it is engineered in or it is absent.
+## Our principles
+### 1. WCAG 2.2 AA is the floor, not the ceiling
+We conform to WCAG 2.2 AA for every page, every component, every release. Falling below AA is a bug, not a trade-off we make.
+But "aim for AAA everywhere" is the wrong correction, and the W3C says so directly: Level AAA is *not recommended as a general policy for entire sites*, because some AAA criteria are impossible to satisfy for some content (sign-language interpretation for all audio; a lower-secondary reading level for technical reference). A blanket-AAA target is one you are guaranteed to miss, which trains the team to treat the standard as aspirational rather than binding. The decision rule: **AA across the board, non-negotiable; specific AAA criteria adopted where a journey is critical and the criterion is actually achievable** — enhanced 7:1 contrast (1.4.6), visible location/breadcrumbs (2.4.8), context-sensitive help (3.3.5), no surprise session timeouts (2.2.6). Targeting 2.2 AA keeps us ahead of the EAA's current legal baseline, not behind it. WCAG 3.0 remains an early W3C Working Draft with a different conformance model — track it, but do not architect around it.
+### 2. Keyboard first
+Every interactive element is reachable and usable with the keyboard. Tab order follows reading order, focus is always visible, and there are no keyboard traps. The design test is simple: can a power user — or a user who cannot use a pointer — complete every journey without touching the mouse? Composite widgets (menus, grids, tab sets) follow the ARIA Authoring Practices keyboard model: a single tab stop into the widget, then arrow-key navigation inside it via roving `tabindex`, so a 30-item menu is one tab stop, not thirty.
+### 3. Screen readers see what sighted users see
+Semantic HTML first; ARIA only when HTML cannot express the semantics. The first rule of ARIA is that *no ARIA is better than bad ARIA*: in WebAIM's million-page survey, pages using ARIA averaged 41% more detected errors than pages without it, because a misapplied `role` silently overrides the native semantics that already worked. A native `<button>`, `<nav>`, or `<label>` is correct by construction; an ARIA reimplementation is correct only if you also wire up every state and key handler by hand.
+When you do name something, follow the accessible-name priority: associate visible text first (`aria-labelledby` pointing at on-screen copy), and reach for `aria-label` only when there is no visible text to reference. Headings form an outline, landmarks mark regions, form fields carry programmatic labels, images carry meaningful alt text. A screen reader should produce a narrative that matches what a sighted user sees — not a richer or poorer version of it.
+### 4. Contrast is measured, not eyeballed
+Low-contrast text is the single most common accessibility failure on the web — present on roughly 79% of the top million home pages and the largest single share of all detected errors. It is also the most preventable. Body text meets 4.5:1, large text 3:1 (SC 1.4.3); UI component boundaries and meaningful graphics meet 3:1 (SC 1.4.11). These ratios are verified by tooling against the actual rendered colours, not judged by eye on a designer's calibrated monitor in a dark room. Brand palettes are checked against contrast at design time; a colour pair that fails AA is a palette bug, not a creative choice to defend.
+### 5. Colour is never the only signal
+A red error, a green success, a blue link — each one carries a second, non-colour cue: a label, an icon, an underline, a position. Colour-blind users exist, and colour-only signalling excludes them (SC 1.4.1). This is distinct from contrast: a chart can have perfect contrast and still be unreadable if its only key is "the red line versus the green line."
+### 6. Motion is optional
+Animations respect `prefers-reduced-motion`. Large-scale parallax and aggressive transitions are used sparingly; for users with vestibular conditions, unrequested motion is not decoration, it is an accessibility failure. The reduced-motion path is a real design, not a disabled one — it still communicates state change, just without the movement that triggers nausea.
+### 7. Live regions are used sparingly and correctly
+Real-time updates are announced via `aria-live` when they matter to the user's understanding. But over-announcement is as harmful as silence: a region that fires on every keystroke or background poll teaches the user to tune out the announcements that matter. Use `aria-live="polite"` for status that can wait, reserve `assertive` for genuine interruptions (errors, time-critical alerts), and announce the meaningful delta, not the whole region.
+### 8. Testing is multi-layered
+Automated checks (axe, Lighthouse) run in CI on every build as a gate. But know their ceiling. Deque's analysis of ~2,000 real audits found automated tooling fully covers about 57% of issues *by volume* — and that figure is flattered by colour contrast alone, one high-frequency criterion. Measured by share of WCAG success criteria, automated coverage is closer to a third. Tools cannot judge whether alt text is *meaningful*, whether focus order makes *sense*, or whether a live-region announcement is useful or noise. So the gate has three layers: automated checks in CI, a manual keyboard walk on every new journey, and a screen-reader walkthrough on major features. Test against the combinations users actually run — NVDA with Firefox or Chrome, VoiceOver with Safari, and JAWS for enterprise audiences — because behaviour differs across them. Tools catch the mechanical; humans catch the semantic.
+### 9. Accessibility is reviewed like code
+Accessibility issues are tracked, owned, and closed the same way any other bug is. The backlog does not accumulate a "we will get to the a11y later" queue — that queue grows forever. Every PR author is expected to include the accessibility check in their definition-of-done.
+## How we apply this
+- [Performance](performance.md) — related budgets that compound with accessibility.
+## Anti-patterns we reject
+- **Placeholder text as label.** The placeholder disappears when the field is filled; the label is gone. Users who come back to check the field see nothing. Use a visible label.
+- **`<div>` as button.** A `div` with an `onClick` is invisible to keyboard, screen reader, and user agent. Use `<button>`.
+- **Cramped tap targets.** WCAG 2.2 AA (SC 2.5.8) sets the floor at 24×24 CSS px, or equivalent spacing between smaller targets. That is a floor, not a goal: touch surfaces should aim for ~44×44 (Apple's HIG and the AAA SC 2.5.5), because fingertips are wide and motor-impaired users miss small targets at far higher error rates.
+- **Focus-removal for aesthetics.** `outline: none` without a replacement focus style breaks keyboard navigation entirely. Use `:focus-visible` to style a clear indicator, not to delete one.
+- **Accessibility overlay widgets.** Bolt-on "accessibility" scripts (accessiBe and its peers) do not make a site conformant, and frequently fight the user's own assistive technology. They are a liability, not a shield: over a thousand sites running an overlay were sued in 2024, settlements routinely require *removing* the widget, and the FTC fined accessiBe $1M in 2025 for deceptive accessibility claims. Fix the markup; do not paper over it.
+- **"We will add a11y in v2."** v2 will not have it either. Build it in.
+- **Modals without focus management.** Trap focus inside the modal, return focus to the trigger when it closes, and label it with `aria-modal`/`role="dialog"`. Otherwise keyboard users are lost behind it.
+## Further reading
+- *WCAG 2.2* ([w3.org/WAI/WCAG22](https://www.w3.org/WAI/WCAG22)) and *Understanding Conformance* ([w3.org/WAI/WCAG22/Understanding/conformance](https://www.w3.org/WAI/WCAG22/Understanding/conformance)) — the normative standard and the rationale for why AAA is not a blanket policy.
+- *ARIA Authoring Practices Guide* ([w3.org/WAI/ARIA/apg](https://www.w3.org/WAI/ARIA/apg)) and *Using ARIA* ([w3.org/TR/using-aria](https://www.w3.org/TR/using-aria)) — the reference for every ARIA pattern and the five rules of ARIA use.
+- *The WebAIM Million* ([webaim.org/projects/million](https://webaim.org/projects/million)) — the annual reality check on what actually fails on the open web.
+- *European Accessibility Act / EN 301 549* — the EU legal baseline; the harmonized standard that maps the law to WCAG.
+- *Inclusive Components*, Heydon Pickering — the canonical pattern language for accessible UI components.
+- *Accessibility for Everyone*, Laura Kalbag — the short introduction for engineers who need to learn the landscape quickly.
+</content>
+</invoke>