npm - groundwork-method - Versions diffs - 0.0.1 → 0.11.0 - Mend

groundwork-method 0.0.1 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (647) hide show

package/src/docs/principles/foundations/product-engineering.md ADDED Viewed

@@ -0,0 +1,90 @@
+---
+title: Product Engineering
+description: Engineering in service of user outcomes — shaped work, appetite-based planning, and the refusal to ship the wrong thing faster.
+status: active
+last_reviewed: 2026-06-19
+---
+# Product Engineering
+## TL;DR
+We are product engineers before we are coders. Our job is to move outcomes — not to ship tickets. Work is shaped before it is scheduled, scheduled against a fixed appetite rather than an estimate, and judged by the behaviour it changes rather than the volume of code it produces.
+## Why this matters
+The dominant failure mode of engineering teams is not technical debt — it is building the wrong thing well. John Cutler's "feature factory" and Melissa Perri's "build trap" name the same trap: a team optimises cycle time and output velocity until the product surface grows faster than the value it delivers, and "shipped" quietly replaces "worked" as the definition of done.
+AI sharpens this, it does not soften it. When generating code is nearly free, the binding constraint moves from *can we build it* to *should we, and did it work*. The 2024 and 2025 DORA reports found that AI adoption raised individual throughput but correlated with **lower delivery stability** — larger batch sizes, more code in flight, more ways to be wrong — because friction "doesn't vanish so much as move: from manual grind to deciding and verifying." Product engineering is the discipline that holds the line where the friction now lives: the unit of work is an outcome, the unit of planning is an appetite, and the test of a change is whether someone — a user, an operator, the next engineer — can feel the difference.
+## Our principles
+### 1. Outcomes over outputs
+An "output" is a feature shipped, a ticket closed, a migration completed. An "outcome" is, in Josh Seiden's phrase, a change in behaviour that drives results — what a user can now do, how fast they can do it, how reliably the system holds them up. We plan around outcomes and let outputs be whatever shape delivers them.
+The honest qualifier: outcomes are not always user-visible, and treating "no user-facing change this sprint" as failed work is wrong. Security patching, a load-bearing refactor that unblocks the next three features, paving a platform path that removes friction for internal developers — these move real outcomes (a class of incident disappears, an operator sleeps, a team ships faster) without a single end user noticing. Platform teams are measured this way on purpose: by adoption and friction removed, not deliverables counted.
+So the failure mode is not "invisible to users." It is **output that traces to nothing**: work whose only justification is the ticket it closes. Decision rule: before work is scheduled, name the behaviour change and who experiences it — end user, internal developer, or operator. If no one can name it, the work is unjustified, not merely unshippable.
+### 2. Shape work before scheduling it
+No work enters a cycle without being *shaped*: the problem stated in user terms, the rough solution sketched, the boundaries drawn to exclude rabbit holes. Shaped work is expensive upfront and cheap downstream. Unshaped work is the single biggest source of mid-cycle drift, scope creep, and late discovery that the whole approach was wrong.
+Shaping is bounded on both sides. Too vague and the team inherits the unsolved problem; too concrete and it is waterfall wearing a friendlier name — a finished design handed down, with no room for the people building it to make the hundred small calls only visible from inside the code. Shape Up's altitude is deliberate: concrete enough to bound the work, abstract enough to leave the build to the builders.
+You can only shape what you understand. Decision rule: shape when you know the problem well enough to bound the solution; when you do not — novel domain, unproven technical approach — the move is a time-boxed spike to *buy* that understanding, not a confident shape built on guesses. AI has made a throwaway prototype cheap enough that "shape by building a spike and discarding it" is now often faster than shaping on paper.
+### 3. Appetite, not estimate
+We set an *appetite* — a statement of how much a problem is worth solving, judged by opportunity cost — and design a solution that fits inside it. If it cannot fit, we reduce scope or reject the work. This inverts the usual flow: an estimate starts with a fixed solution and ends with a number; an appetite starts with the number and ends with a solution. It forces "what is the best version of this we can deliver for what it is worth?" and it kills the tendency of work to expand to fill the time available.
+We denominate appetite in worth, not effort, and not by default in calendar time. AI compresses execution unpredictably — sometimes a 19% *slowdown* on familiar code an expert already moves fast through, sometimes a large speedup on unfamiliar ground (per METR's 2025 trial) — so a fixed "two weeks" now anchors on the axis that just got cheap and noisy.
+Appetite does not abolish estimation everywhere, and pretending it does is its own failure. A partner-integration deadline, a compliance date, a contractual SLA — these demand a real estimate and a real date, and the appetite must respect them as constraints. Decision rule: appetite governs discretionary product bets, which is most of the portfolio; estimate where a hard external date or dependency exists, and feed that estimate in as a boundary. The error is mixing them up — estimating discretionary work, or setting a soft "appetite" for an obligation that has a date attached. How big a bet is, separately, is its *stakes* — what is at risk if we are wrong; see [prioritization-and-appetite](prioritization-and-appetite.md).
+### 4. Kill your darlings
+If a feature is not moving an outcome, we remove it. Deletion is the most under-used tool in a product engineer's kit. Every line of code, every doc page, every dashboard tile, every CLI flag that does not pay its maintenance cost is a candidate for the cut. A smaller, sharper product is cheaper to operate and easier for the next engineer to understand.
+Removal has its own cost, and the test is the *net* one. For anything with external surface, Hyrum's Law holds: with enough users, every observable behaviour is depended on by someone, so a hard cut breaks callers and burns trust faster than the cruft ever cost you. Decision rule: internal-only cruft, just delete it; anything users observe or script against goes through deprecate → measure usage → remove, and stays if the migration cost outweighs the carrying cost. The discipline is to default to deletion and make *keeping* earn its place — not to delete blind.
+### 5. Instrument what you ship
+We decide the signal *before* we ship — event, threshold, success criterion — and we check it after release. A feature whose effect no one watches is a feature no one owns.
+Instrumentation is not the same as quantification, and conflating them produces dashboards that decorate rather than inform. Some outcomes resist a clean number — trust, perceived quality, a rare catastrophic failure avoided. For those the signal is qualitative (interview themes, support-ticket clusters) or a tripwire (a counter-metric that fires when you have made something worse), not another tile. So the real bar is not "measurable" — it is *owned and falsifiable*. Decision rule: before shipping, name the signal **and** the evidence that would make you reverse course. If you cannot say what would change your mind, you are not measuring, you are decorating. And measure the outcome, not the act of shipping — the 2025 DORA finding is that individual throughput gains evaporate at the org level unless they are tied back to a business result. More dashboards is not more insight; one honest counter-metric beats ten vanity lines.
+## The product discipline
+This page is the spine of a wider product corpus — the discipline of moving outcomes, expanded into its working parts:
+- [Continuous Discovery](continuous-discovery.md) — mapping the problem space as a weekly habit, before choosing a solution.
+- [Product Risks](product-risks.md) — the four risks (value, usability, feasibility, viability) a bet must clear, and who owns each.
+- [Success Metrics](success-metrics.md) — designing the measure of an outcome: North Star, leading indicators, counter-metrics.
+- [Requirements & Specs](requirements-and-specs.md) — turning validated needs into testable, evidence-grounded statements.
+- [Prioritization & Appetite](prioritization-and-appetite.md) — the portfolio view: choosing and sequencing bets by opportunity cost.
+- [AI-Native Product](../ai-native/ai-native-product.md) — product practice for probabilistic systems: evals, the outcome envelope, the three cost layers.
+## How we apply this
+- [Progressive Delivery](../delivery/progressive-delivery.md) — canaries and flags are the mechanism by which we measure outcomes safely.
+- [Observability](../quality/observability.md) — the signal layer that makes outcome-based engineering possible.
+- [Decisions](../../decisions/) — the record of shaping decisions that cost us real time.
+## Anti-patterns we reject
+- **Velocity-as-KPI.** Story points per sprint measure nothing about user outcomes. Optimising for it corrupts the team — and with AI inflating raw output, it corrupts faster.
+- **Estimate-driven planning.** Estimates anchor on how long the team thinks work will take, not on how much it is worth. We use appetites for discretionary work, and reserve estimates for hard external dates.
+- **"Build it and they will come."** Launching without a signal — and without naming what would make you walk it back — means no one owns the outcome.
+- **Technical-debt-for-its-own-sake projects.** Refactors with no payoff anyone can name are a smell. Tie them to the outcome they enable — faster delivery, fewer incidents, lower carrying cost — and that outcome is the justification.
+- **Big-design-up-front in a shaping costume.** A fully specified solution handed down with no room for the builders is waterfall, whatever the cycle is called.
+## Further reading
+- *Shape Up*, Ryan Singer — the canonical treatment of shaped work and fixed appetites.
+- *Inspired*, Marty Cagan — the product-engineering triad and its implications for how teams are built.
+- *Escaping the Build Trap*, Melissa Perri — why feature-factory metrics corrupt outcomes.
+- *Outcomes Over Output*, Josh Seiden — the working definition of an outcome as a change in behaviour.
+- "12 Signs You're Working in a Feature Factory," John Cutler — the field guide to the failure mode this discipline resists.
+- *State of DevOps* (DORA), 2024 and 2025 reports — the evidence that AI raises throughput while pressuring stability, and that gains must be tied to outcomes to count.
+- "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity," METR (2025) — why execution time under AI is unpredictable, not uniformly faster.

package/src/docs/principles/foundations/product-risks.md ADDED Viewed

@@ -0,0 +1,89 @@
+---
+title: Product Risks
+description: The four risks every bet must clear before delivery — value, usability, feasibility, viability — and the discipline of killing the riskiest assumption first.
+status: active
+last_reviewed: 2026-06-19
+---
+# Product Risks
+## TL;DR
+Before we commit to building something, we ask whether it can fail in four distinct ways: will users **want** it (value), can they **figure it out** (usability), can we **build** it well (feasibility), and does it **work for the business** (viability). Discovery exists to kill these risks before delivery starts — cheaply, by testing the riskiest assumption first, rather than expensively, by shipping and finding out. Each risk has a clear owner, and that ownership is how the product, design, and engineering disciplines divide the work without gaps.
+## Why this matters
+Most failed features did not fail because they were built badly. They failed because nobody asked the right "could this not work?" question early enough. A team that only asks "can we build it?" ships things that work perfectly and that no one uses. A team that only asks "will users like it?" ships things that delight in a prototype and collapse under real load or real economics. The four-risk frame is a checklist against the specific blind spots each discipline has on its own — and running it during discovery means the miss surfaces at week two, on a sketch, instead of at launch, in production.
+## Our principles
+### 1. Four risks, named explicitly
+Every significant bet faces four categories of uncertainty:
+- **Value** — will customers choose to use or buy this? Does it solve a problem they actually feel? This is the risk most features die of, and the easiest to wave away with "of course they'll want it."
+- **Usability** — can users figure out how to use it? Will they understand it well enough to get the value that is theoretically there?
+- **Feasibility** — can we build it with the time, skills, and technology we have? Does the architecture support it, and can we operate it reliably?
+- **Viability** — does it work for the *business*? Legal, security, cost, support load, brand, and the commercial model all sit here. A feature can be desirable, usable, and feasible and still be a mistake to ship.
+Naming all four forces the question each discipline is prone to skip.
+A fifth question lives *inside* viability and deserves its own name: **should we build it at all?** — the ethical risk. Cagan files ethics under viability deliberately, then warns that it is the one viability concern with no natural stakeholder: legal owns legal, finance owns cost, security owns security, but no one is paid to ask whether the feature is good for the user even when it is good for the metrics. That is exactly why it is the most reliably dropped. Name it on purpose, or it goes unowned.
+### 2. Discovery exists to kill risk before delivery
+The purpose of discovery is not to produce a specification — it is to retire risk. Delivery should begin only once the four risks are low enough that building is the cheapest remaining way to learn. Every assumption we can test with a conversation, a prototype, a spike, or a back-of-envelope cost model is one we should *not* test by shipping. Discovery is the cheap place to be wrong.
+This is not a phase that finishes before delivery opens. Discovery and delivery run continuously and in parallel, not as sequential gates — the [dual-track](continuous-discovery.md) shape. "Risks low enough" is a judgement made one load-bearing assumption at a time, not a sign-off on the whole bet; a healthy team is retiring risk on the next bet while it ships the last one. The discipline is that *the specific assumption a piece of delivery rests on* is cleared before that piece is built — not that all discovery everywhere completes before any code is written.
+### 3. Test the riskiest assumption first
+Not all risk is equal, and the order matters. We surface the assumptions a bet rests on, rank them by how likely they are to be wrong and how much damage a wrong answer does, and test the riskiest one first. If the bet is going to die, kill it on the assumption most likely to kill it — before sinking effort into the assumptions that were never in doubt. Spending discovery on the comfortable questions while the load-bearing one goes untested is how teams feel busy and learn nothing.
+"Riskiest" is the product of two axes, not one: how *uncertain* the assumption is (how little evidence we have either way) and how *load-bearing* it is (how much of the bet collapses if it is wrong). An assumption that is uncertain but cheap to be wrong about can wait; an assumption everyone is confident in but that would sink the bet if it failed still deserves a fast check, because confidence is not evidence. Rank on uncertainty × consequence, and test top-down.
+### 4. Each risk has an owner
+Risk without an owner is risk nobody clears. The accountability splits cleanly across the disciplines:
+| Risk | Owner | Discipline |
+|---|---|---|
+| **Value** | Product | accountable for the outcome |
+| **Viability** | Product | accountable for the outcome |
+| **Usability** | Design | accountable for the experience |
+| **Feasibility** | Engineering / Architecture | accountable for delivery |
+Product owns value and viability because both are judgements about whether the outcome is worth pursuing. Design owns usability because it owns the experience. Engineering and architecture own feasibility because they own what is buildable and operable. The owner of a risk is the person who must produce the evidence that it is cleared.
+Ownership is accountability for the evidence, not a solo assignment. Discovery is done by the trio — product, design, engineering — working the same problem together; an engineer's feasibility spike routinely surfaces a value insight, and a designer's prototype routinely exposes a feasibility wall. The owner is simply who answers for the risk when it is asked about. Viability makes the distinction sharp: product is accountable, but the evidence comes from legal, security, and finance, so the owner orchestrates the answer rather than producing it alone. The failure mode is not collaboration — it is when no single name answers for a given risk, so each is everyone's job and therefore no one's.
+### 5. Match the discovery action to the risk
+Each risk is tested differently, and using the wrong instrument wastes the discovery. Value is tested with user evidence — demand signals, interviews, a fake-door, a willingness-to-pay probe, or a live in-production experiment when the change is reversible, flagged, and measured against a control. Usability is tested with prototypes and observed sessions. Feasibility is tested with a spike, a proof of concept, or an architecture review. Viability is tested by walking the decision past the constraints that bound it — cost model, security posture, legal boundary, support load. A "usability test" that was actually meant to probe value answers the wrong question convincingly.
+The instrument must fit the *stakes* as well as the risk. A randomized production experiment with a control group and a metric chosen in advance is a legitimate — often the cheapest — value test for a reversible change. A full launch to everyone with no hypothesis and no control is not a test; it is a bet you have already placed. The difference between the two is not "production or not" — it is whether there is a way to read the result and a way back.
+### 6. Low stakes earn a lighter pass
+The frame scales to the **stakes** — blast radius × reversibility × the human review the work demands, the bet's size axis defined in [prioritization-and-appetite](prioritization-and-appetite.md) §2. A small-blast-radius, reversible change does not need a four-risk discovery — it needs a quick gut-check and a willingness to undo it. The full, evidence-backed pass is for bets that are hard to reverse, wide in blast radius, or load-bearing for the product. Note that stakes is not effort: a low-effort change to a one-way door is high-stakes and earns the full pass, even when it is fast to build. Running heavy discovery on genuinely low-stakes work is its own failure mode; the discipline is proportionality, not ceremony.
+## How we apply this
+- The riskiest-assumption-first ordering is the engine of [continuous discovery](continuous-discovery.md) — the opportunity-solution tree's leaves *are* the assumptions this frame ranks and tests.
+- Feasibility risk is where product hands off to the [architecture discipline](../system-design/code-structure.md) and the engineer skills; the value/viability judgement stays with product.
+- A bet's [appetite](prioritization-and-appetite.md) is set against the risk it carries — a high-value, high-uncertainty bet earns a discovery spike before its delivery appetite is fixed.
+- AI-heavy bets stress specific corners of the frame and need the matching evidence early. Feasibility now includes model non-determinism and an evaluation harness, not just "can we call the API." Viability includes per-call inference economics — a feature can be desirable, usable, and feasible and still lose money on every request — plus unsettled data, copyright, and privacy exposure. Value includes whether users trust the output enough to act on it. Probe these in discovery with a quick eval and a cost-per-action model before the appetite is fixed; a demo that ignores tail-case output and unit economics has cleared none of the real risk.
+## Anti-patterns we reject
+- **The feasibility-only filter.** "Can we build it?" as the only question asked. Produces things that work and that nobody wanted.
+- **Validation theatre.** Discovery run to confirm a decision already made, testing the safe assumptions and skipping the one that could kill the bet.
+- **Unowned risk.** Four risks and nobody accountable for clearing any specific one — so each is everybody's job and therefore no one's.
+- **Shipping to learn, unrigorously.** Using a full production launch as the first test of a high-stakes value question — "we'll see if people use it" — with no hypothesis, no control, and no cheap way back. That is hoping, not learning, and production is the most expensive place to hear no. A reversible, instrumented experiment is the opposite and is welcome; the anti-pattern is the irreversible, unmeasured bet, not learning in production itself.
+- **The forgotten viability risk.** A desirable, usable, feasible feature that quietly triples support load, breaks a compliance boundary, costs more to run than it earns, or is good for the metrics and bad for the user. Viability — ethics most of all — is the risk teams most often never name.
+## Further reading
+- *Inspired* and *Transformed*, Marty Cagan — the four big risks, the discovery techniques that retire them, and the case for treating ethical risk as the unowned corner of viability.
+- *The Four Big Risks*, Silicon Valley Product Group — the concise canonical statement of the taxonomy and its ownership.
+- *Continuous Discovery Habits*, Teresa Torres — assumption mapping and testing the riskiest assumption first.
+- *Updating the Product Risk Taxonomy for the Generative AI Era*, Viget — how each risk shifts for LLM-powered products.

package/src/docs/principles/foundations/requirements-and-specs.md ADDED Viewed

@@ -0,0 +1,80 @@
+---
+title: Requirements & Specs
+description: Evidence-grounded, testable specification — jobs-to-be-done, user journeys, stable-ID requirements, acceptance criteria matched to their form, and explicit non-goals.
+status: active
+last_reviewed: 2026-06-19
+---
+# Requirements & Specs
+## TL;DR
+A requirement is a claim about what a user needs to accomplish, grounded in evidence and stated precisely enough to be tested. We frame needs as **jobs to be done**, walk them as concrete **user journeys**, pin each requirement to a **stable ID** so downstream artifacts can reference it, and write acceptance criteria in whatever form makes "done" unambiguous and verifiable. The specification is a living, evidence-backed record of decisions — and the source of truth a builder, human or agent, works from — not a template filled in to look thorough.
+## Why this matters
+Requirements are where product thinking becomes something an engineer, an agent, or a test can act on — and where it most often goes wrong. A spec that lists features instead of jobs builds the wrong thing precisely. A spec with vague acceptance criteria ("the system should handle errors gracefully") cannot drive a test or settle an argument about whether the work is finished. A spec produced by filling a template rather than by understanding the user reads as complete and is hollow. Precise, testable, evidence-grounded requirements are the contract between knowing what to build and building it.
+This matters more, not less, as agents do the building. A model does pattern completion, not mind reading: a vague spec is not refused, it is answered — with a thousand silent assumptions the model invents to fill the gaps. The precision the spec withholds is the precision the build makes up. In an agent-led codebase the spec is read as literally as code, which is the discipline behind spec-driven development (the loop popularised by tools like GitHub's Spec Kit, AWS Kiro, and BMAD): need → spec → plan → tasks → code, with the spec as the artifact every later stage resolves against.
+## Our principles
+### 1. Requirements describe jobs, not features
+We state what the user is trying to accomplish — the **job to be done** — before naming any feature that serves it. A job is the progress a user is trying to make in a situation ("when I finish a task, I want to know it actually completed, so I can move on without checking back"), with its functional, emotional, and social dimensions. Features are solutions to jobs; leading with the feature skips the step where we check the solution actually fits the job. The job is stable; the feature that serves it is negotiable.
+### 2. Walk the journey, do not list the screens
+A user journey is a narrative with structure: a named persona in a context, the state they enter from, the concrete path of steps they take, the moment value is delivered and how they know, and the state they are left in. Walking the journey end to end surfaces the gaps a feature list hides — the empty state, the error halfway through, the second-time-through shortcut. We describe journeys with enough texture that a reader can picture the shape of the interaction, not just enumerate its steps.
+### 3. Stable IDs make requirements referenceable
+Every functional requirement carries a stable, globally unique ID (`FR-1`, `FR-2`, …) assigned once and never reused. The ID is what lets a design doc, an architecture decision, a test, and an acceptance criterion all point at the *same* requirement without ambiguity, and what lets a coverage map prove every requirement is accounted for downstream. Requirements identified only by prose drift apart the moment two documents describe the same thing in different words.
+What does not earn its place is the heavyweight traceability matrix — a hand-maintained, bidirectional grid linking every requirement to every artifact — which rots faster than it informs and is the ceremony agile rightly walked away from. The ID is cheap; the discipline is to reference it, and to let tooling, not a clerk, maintain the links. The payoff scales with how literally the spec is consumed: in a regulated domain that must evidence coverage, or an agent-led codebase where a model resolves `FR-7` against the spec the way it resolves a symbol against its definition, a stable ID is load-bearing; on a two-person throwaway prototype it is overhead. Carry the IDs; skip the matrix.
+### 4. Acceptance criteria are testable — match the form to the criterion
+Acceptance criteria exist to make "done" unambiguous and verifiable. The form serves that goal; it is not the goal. We match the form to the criterion:
+- **Stateful behaviour and flows → Given/When/Then.** Given (precondition) / When (action) / Then (observable outcome), with And for extra conditions. The form forces you to name the starting state, the trigger, and the observable result — and a scenario you cannot fill in is usually one you do not yet understand.
+- **Invariants, validation, and business rules → a rules-based checklist.** "An order total is never negative"; "an email contains exactly one @". Forcing a flat rule into Given/When/Then adds ceremony and buries the rule; a bullet leaves less room to misread and is sharper against scope creep.
+- **Quality attributes → measurable thresholds.** Latency, throughput, accessibility, error budgets are not prose ("fast", "reliable") but numbers: "p95 search latency under 200ms at 1k RPS." A threshold is the only non-functional criterion a test can fail.
+The over-certain version of this principle — "anything that isn't Given/When/Then isn't concrete enough" — is wrong, and it pushes teams into the BDD trap: a parallel Gherkin-plus-step-definition layer maintained for its own sake, brittle and expensive, where the prose outlives the value. The criterion is the contract; the automation is downstream of it. Whatever the form, every criterion is independently verifiable, covers the edge and error cases — not just the happy path — and "done" means every one passes, nothing softer.
+### 5. Non-goals are part of the specification
+What a requirement explicitly does *not* cover is as load-bearing as what it does. We state non-goals and out-of-scope boundaries directly, with the reason and where the excluded thing belongs instead. The natural extensions a reader would assume — the adjacent feature, the obvious generalisation — are exactly what must be named as excluded, or scope creeps one reasonable assumption at a time. An explicit boundary is what makes the scope honest.
+### 6. The spec is a living record, not a template fill
+A specification earns its sections; it does not fill them to look complete. We add the sections the product needs, drop the ones that do not apply, and keep the document current as decisions change — surfacing assumptions explicitly (`[ASSUMPTION]`) so they can be confirmed rather than buried. A PRD generated by walking a template top to bottom, padding every heading, is the artifact this principle exists to prevent: it reads thorough and conveys nothing that was not already obvious.
+Living also means reconciled. When the code and the spec disagree, one of them is a defect — and a spec left to drift is worse than no spec in an agent-led codebase, because the agent trusts it literally and builds to the lie. Keeping the spec true to the system is part of the build, not paperwork after it.
+### 7. Requirements are grounded in evidence
+Every requirement traces to a reason it exists — a user need observed in discovery, a job confirmed in a conversation, a problem with evidence behind it. A requirement that traces only to someone's preference is a candidate to cut. Grounding requirements in evidence is what connects the spec back to [continuous discovery](continuous-discovery.md): the spec is where validated needs become buildable statements, not where new unvalidated ones get smuggled in.
+## How we apply this
+- Requirements emerge from validated needs — the jobs and opportunities surfaced in [continuous discovery](continuous-discovery.md), not from a brainstorm of features.
+- Each requirement names its [success metric](success-metrics.md) where one applies, so the spec carries its own definition of whether it worked.
+- The spec is the source of truth the build runs on, not a document read once and abandoned: need → spec → plan → tasks → code, with each later stage derived from the spec rather than re-inventing it. When an agent or engineer needs a decision the spec does not make, that gap is a spec defect to fix, not an assumption to bury in code.
+- Stable-ID requirements and form-matched acceptance criteria are what let the [architecture discipline](../system-design/api-design.md) derive contracts and tests from the spec rather than re-interpreting prose — requirements and contracts share the same source-of-truth discipline.
+## Anti-patterns we reject
+- **Template-fill PRDs.** Every heading padded to look complete, conveying nothing the team did not already know. The template is a checklist, not the thinking.
+- **Feature lists masquerading as requirements.** Solutions enumerated with no job behind them, so nobody can tell whether they fit the need.
+- **Untestable acceptance criteria.** "Works well," "handles errors gracefully," "is intuitive" — none can pass or fail a test, so none can settle whether the work is done.
+- **Form over substance.** Cramming every criterion into Given/When/Then — or maintaining a Gherkin-and-step-definition layer for its own sake — when a rule-list or a numeric threshold would be sharper. The ritual is not the rigour.
+- **Requirements without IDs.** Prose-only requirements that two documents describe differently and that no coverage map can track.
+- **Silent scope.** No non-goals stated, so every reasonable adjacent assumption is fair game and scope grows without a decision.
+- **Spec rot.** A spec that no longer matches the system, trusted literally by the next agent that reads it.
+## Further reading
+- *Competing Against Luck*, Clayton Christensen — the jobs-to-be-done framework in depth.
+- *User Story Mapping*, Jeff Patton — journeys and stories as the structure of a specification.
+- *Specification by Example*, Gojko Adzic — acceptance criteria as the bridge from requirement to test, and when scenarios earn their keep.

package/src/docs/principles/foundations/success-metrics.md ADDED Viewed

@@ -0,0 +1,66 @@
+---
+title: Success Metrics
+description: Designing the measure of an outcome — North Star and inputs, leading vs lagging, counter-metrics, and deciding the signal before you ship.
+status: active
+last_reviewed: 2026-06-19
+---
+# Success Metrics
+## TL;DR
+A feature that is not measured does not exist as an outcome. We design the measure before we build the thing: a small number of metrics that represent real user value, paired with the counter-metrics that stop us from gaming them, and chosen so a *no* answer is as informative as a *yes*. Metric design is a product skill distinct from the telemetry that implements it — deciding *what* to measure and *what target* means success is the hard part; the dashboard is the easy part.
+## Why this matters
+Teams measure what is easy to count and then optimise their way into the wrong product. Signups, page views, story points shipped — vanity and output metrics feel like progress while the actual outcome stagnates or regresses. Worse, a single metric pursued without a counterbalance reliably produces a degraded product: optimise engagement and you get dark patterns; optimise speed and you get a product that does the wrong thing faster. Designing the measure well — before launch, with the counter-metrics in place — is what turns "we shipped it" into "we know whether it worked." The measure is part of the design, not a reporting afterthought.
+## Our principles
+### 1. Decide the signal before you ship
+The success signal is a design decision made *before* the work starts, not a question asked after launch. Before building, we name the metric that will move, the direction, and the rough magnitude that would count as success. Deciding it upfront does two things: it forces honesty about whether the feature has a theory of impact at all, and it pre-commits us to a verdict so we cannot rationalise any result as a win after the fact. If we cannot name how we would measure it, we do not yet understand the outcome well enough to build it.
+### 2. A North Star, supported by inputs
+We anchor on a **North Star** that captures the core value the product delivers to *users* — the one number that, if it moves the right way sustainably, means the product is winning. It must be a value metric, not an activity or revenue proxy. Engagement North Stars (sessions, time-on-site) optimise for the product's interest over the user's and decay into dark patterns; revenue North Stars measure extraction, not value delivered, and can climb while the product rots. Beneath the North Star sit a handful of **input metrics**: the leading indicators a team can actually move week to week, whose causal link to the North Star is earned by trial and error, not assumed. Amazon's *Working Backwards* calls these *controllable input metrics* and steers by them precisely because output metrics like revenue report too late and too diffusely to act on.
+The single North Star is genuinely contested, and the objection is fair: one number cannot represent a two-sided marketplace, a multi-product portfolio, or segments with materially different value. Forced onto those, a single metric either flattens real trade-offs or hums along green while the business bleeds — the North Star is never a substitute for business viability. The answer is not a wall of dashboards. **Decision rule:** a focused product with one dominant value loop gets one North Star. A marketplace, platform, or portfolio gets a North Star *strategy* — a one-sentence statement of the value being created — plus a small constellation (roughly one metric per side or segment) that together evidence it. Either way the count stays small and every metric is acted on. One lighthouse's worth of focus, a few levers — not literally one number when the product has two sides.
+### 3. Distinguish leading from lagging
+Lagging metrics (retention, revenue, churn) confirm whether value landed but report too late to steer by. Leading metrics (activation, first-week usage of a key feature, time-to-value) move early and predict the lagging ones. We instrument both and act on the leading ones — a team that can only see lagging metrics is driving by the rear-view mirror. The skill is choosing leading indicators that genuinely predict the outcome rather than merely correlating with activity.
+### 4. Counter-metrics are as load-bearing as primaries
+Every primary metric we optimise gets a **counter-metric** that guards against winning it the wrong way. This is Goodhart's Law made operational: when a measure becomes a target it ceases to be a good measure, because people optimise the number rather than the value behind it — and the more weight the metric carries, the harder it gets gamed (Campbell's Law). Optimising for time-on-task? Counter with task-completion, so we do not reward confusion. Optimising for adoption? Counter with retention, so we do not reward a one-time spike. The counter-metric — what the experimentation world calls a *guardrail* — names the most likely way the primary gets gamed and makes that failure visible. A primary metric without a counter-metric is an invitation to optimise the product into a corner.
+### 5. The metric must produce a falsifiable verdict
+A good success metric is specific enough that a *no* is as informative as a *yes*. "Users are happier" cannot be falsified; "support tickets citing the confusion drop by at least half within 30 days" can. We reject vague sentiment and abstract aggregates ("engagement improves") in favour of signals tied to a concrete user behaviour and a threshold. The test of a metric is whether a disappointing result would actually change our minds.
+### 6. Match the rigour to the stakes
+Metric design scales with the bet. A load-bearing product decision earns a North Star, instrumented inputs, and a pre-registered target. A small change earns a single observable signal and a glance after release. Demanding a full metric tree for every minor feature is as much a failure as shipping a major bet with no measure at all — the discipline is proportion.
+## How we apply this
+- The success signal a metric defines is the falsifiable outcome a bet's hypothesis commits to — the same signal named in [continuous discovery](continuous-discovery.md) and carried verbatim into the pitch.
+- Metrics are the measure of [product engineering's](product-engineering.md) "instrument everything you ship" — this page is the *design* of the signal; [observability](../quality/observability.md) is the telemetry layer that *captures* it.
+- For AI features, product metrics pair with model-quality metrics — see [AI-native product](../ai-native/ai-native-product.md) for the dual-metric discipline.
+## Anti-patterns we reject
+- **Vanity metrics.** Totals that only ever go up — cumulative signups, total page views — and say nothing about whether the product delivers ongoing value.
+- **Single-metric tyranny.** One number optimised without a counter-metric, which reliably degrades the product along the axis nobody is watching.
+- **Output as outcome.** Counting features shipped or story points burned as if delivery were the goal. Output is the cost, not the result.
+- **The retrospective metric.** Deciding how to measure success only after launch, when any result can be spun into a win.
+- **The unfalsifiable goal.** "Improve the experience" with no behaviour, no threshold, and therefore no possible disconfirmation.
+## Further reading
+- *Escaping the Build Trap*, Melissa Perri — why output metrics corrupt product teams and how outcome metrics fix it.
+- *Lean Analytics*, Croll & Yoskovitz — the One Metric That Matters and choosing it by stage.
+- Amplitude's *North Star Playbook* — the North Star and its input metrics as an operating model.
+- *Working Backwards*, Colin Bryar & Bill Carr — controllable input metrics vs. output metrics, Amazon's operating model for steering by leading indicators.
+- Goodhart's Law (Charles Goodhart) and Campbell's Law (Donald Campbell) — the foundations of why a measure-as-target gets gamed, and why counter-metrics are non-optional.
+- Ravi Mehta, *Your product team doesn't need a North Star Metric* — the case for a North Star strategy over a single number when one metric cannot capture the value.

package/src/docs/principles/foundations/testing.md ADDED Viewed

@@ -0,0 +1,108 @@
+---
+title: Testing
+description: Continuous Risk Assurance — testing the system, not the mock of the system.
+status: active
+last_reviewed: 2026-06-26
+---
+# Testing
+## TL;DR
+Tests are risk-weighted assertions about production behaviour — not boxes ticked for coverage. We favour high-fidelity service tests over solitary unit tests, run dependencies we own as real ephemeral containers rather than mocking them, contract-test the ones we don't, and treat observability signals as first-class assertions. Above the honeycomb sits one more level: a proof that drives the real shipping build through its front door on the real pipeline, because parts that each pass in isolation can still assemble into a product that does nothing — and a fake a test leans on needs a real test behind it. The measure of a suite is whether its assertions actually catch faults — not its line-coverage number. The invariant under all of it: a test that captures whatever the system currently does is worthless unless something *independent* of the implementation asserts that behaviour is correct. Independent oracles and reproducible failures are the spine; the distribution shape is a detail teams over-argue.
+## Why this matters
+The dominant failure mode of a test suite in 2026 is not that it is too small — it is that it passes while production breaks. Mocked dependencies drift from their real counterparts, unit tests assert on implementation rather than behaviour, and green CI gives a false sense of security. *Continuous Risk Assurance* is our name for the discipline that replaces "coverage as a target" with "risk as the thing we actually measure."
+This matters more, not less, as code generation gets cheaper. When an agent can produce a plausible implementation in seconds, the bottleneck moves from writing code to *trusting* it. The test suite becomes the executable specification that constrains generated code — the thing that says what "correct" means when the author is a model and the reviewer is short on time. A weak suite that generated code passes is worse than no suite, because it manufactures confidence.
+## Our principles
+### 1. Favour service tests over solitary unit tests
+Our default shape is the **test honeycomb**, popularised by Spotify's engineering teams: a fat middle of integrated, "sociable" service tests, a thin layer of solitary unit tests, and a few end-to-end checks on top — not the classic Mike Cohn pyramid that pushes most weight onto isolated units. We test from the API entry point through to real, ephemeral database containers, because in a service-oriented codebase the interesting bugs live at the boundaries — HTTP serialisation, SQL query correctness, transaction semantics, event emission — exactly what solitary unit tests mock away.
+The honeycomb is a stack-appropriate heuristic, not a law. No empirical study ranks the pyramid, honeycomb, and trophy by defect detection — they are practitioner shapes for different interaction surfaces (service-to-service for the honeycomb, component-interaction for Kent Dodds's frontend trophy), and the word "integration" means something far cheaper in one than the other. What the evidence does support is that test *quality* outweighs distribution: a suite of fast, reliable, expressive tests that fail only for useful reasons beats any ratio of tests that don't. So pick the shape that fits the stack — the honeycomb for our backends, the trophy for a frontend — and spend the saved argument on making each test bite.
+The honest tension: service tests buy fidelity at the cost of speed and diagnostic precision. A solitary unit test that fails names the broken function; a service test that fails tells you "the create-order flow is broken" and leaves you to find where. And a slow, flaky service layer is corrosive — teams that can't trust or tolerate it quietly retreat to mocking everything, which is the exact failure this principle exists to prevent. So fidelity is not a licence to be slow: keep service tests parallelisable, keep fixtures cheap, and treat suite latency as a first-class defect.
+Decision rule: reach for a solitary unit test when the logic is **algorithmically dense and boundary-poor** — a parser, a pricing calculator, a state machine, a validator — where the combinatorics are the risk and a container adds only latency. Reach for a service test when the risk lives in the **wiring** — serialisation, persistence, queries, events, auth. When a service-test failure is routinely hard to localise, that is a signal to factor out the dense core and unit-test it directly, not to mock the boundary.
+### 2. Run real dependencies you own; contract the ones you don't
+For a dependency you own and deploy — Postgres, your message broker, object storage — run the real thing in an ephemeral container (Testcontainers or equivalent). In-memory fakes miss the bugs that actually escape to production: schema and migration mismatches, serialisation edge cases, transaction and isolation behaviour, query-planner surprises. Pin the image to the version you run in production — never `latest`, which turns an upstream release into a flaky build. Reset state between tests for determinism, and share a container across a suite rather than per-test so startup cost doesn't dominate the run.
+But "emulate everything" is a false absolute, and applied carelessly it wrecks the feedback loop — full brokers and databases spun up for tests that exercise none of their behaviour buy nothing but minutes. Two cases break the rule:
+- **Third-party services you do not control** (a payments API, a SaaS provider) usually cannot be containerised faithfully, and a hand-written mock of them is the worst of both worlds — it encodes *your belief* about their behaviour, which is precisely what drifts. Verify against a **contract** instead: a consumer-driven contract (Pact) or a recorded/replayed interaction captured from the real provider, plus a small, periodically-run live suite against a sandbox to detect drift. Pact's leverage is weaker here because you can't compel an external provider to verify your contract, so treat the contract as a drift detector, not proof.
+- **Pure logic with no real I/O.** If the unit under test has no genuine dependency, don't invent one to stand a container up behind. Test it directly.
+Decision rule: emulate the data and serialisation boundaries you own; contract-test the boundaries you don't; mock only at a seam you fully control and only when the real thing adds latency without adding risk. A mock that stands in for a database is almost always the wrong call (see anti-patterns); a recorded contract for a remote API you can't run is often the right one.
+### 3. Observability is a test surface
+OpenTelemetry instrumentation is a design-time concern, not an afterthought — sketch the trace a feature should produce before writing the handler (the observability-driven development stance, [Observability](../quality/observability.md) principle 5). System tests then assert that traces are unbroken end-to-end: a missing span, a lost TraceID, or a broken parent-child relationship is a test failure, not an instrumentation TODO. The boundary between "test" and "monitor" dissolves — both ask whether the system is behaving as we claim. The payoff is double-counted: the same instrumentation that proves correctness in CI is what lets you debug the incident in production.
+The mechanism is an **in-memory span exporter**: register one in the test process, exercise the system, and assert on the finished spans — the DB span exists with the attributes a dashboard query depends on, the spans emit in the expected order, the TraceID propagates across a service hop. This is a built-in capability of every OTel SDK, and it is the durable approach now that the dedicated trace-based-testing tools (Tracetest, Malabi) have gone dormant. Assert on what the contract promises and let the rest float (the over-assertion trap is real — see [Observability](../quality/observability.md) principle 6). "Trace coverage" as a *metric* — a line-or-branch-coverage equivalent for spans — is still aspirational research, not a number to gate on; the proven practice is traces-as-assertions, not a coverage percentage.
+### 4. Name tests by behaviour, not implementation
+A test name must let an on-call engineer form a hypothesis from the failure log alone, without opening the test file. The default form — `[Unit] should [expected outcome] when [condition]` — encodes that intent, and names like `TestCreateItem_Success` are banned because they convey nothing beyond what the dashboard already shows. The format serves the goal; the goal is the rule. A name that states behaviour and condition in another shape is fine. A name that follows the template but says nothing specific (`should work when called`) is not.
+### 5. Risk-based depth, and prove the assertions bite
+Coverage percentages are meaningless without proof that the assertions catch real faults — a suite can execute every line and assert nothing. We score modules on Impact × Complexity × Change-frequency before deciding test depth: high-risk modules earn live system tests and chaos experiments; low-risk modules need only small tests and static analysis. Equal depth everywhere is wasted effort.
+The honest measure of whether assertions bite is **mutation testing** (PIT, Stryker, mutmut, or equivalent): inject deliberate faults and confirm a test fails. A surviving mutant is a line you cover but do not actually check. This is the honeycomb's natural complement: a fat sociable service test drives a huge number of branches through one HTTP call, and it is easy for it to *execute* them all while only asserting on the response body — mutation testing is the one instrument that proves the suite checks what it runs rather than merely exercising it. It correlates with real fault detection better than coverage does, though not once you control for suite size, so treat it as a quality read-out, not a bug-finding proxy.
+Mutation testing is expensive — its naive cost is the suite run times the number of mutants — so never run it across the whole tree and never make it a blanket gate. Run it on the high-risk modules the matrix flags and on changed code only, the model Google operates at scale: incremental, mutate-the-diff, surfaced in review. Tooling maturity is uneven and the guidance degrades gracefully with it — Stryker (JS/TS), PIT (JVM), and mutmut/cosmic-ray (Python) are production-grade; Go's options are pre-1.0 and slow, so there it stays a hand-run spot check, not an expectation. The same read-out is the antidote to AI-generated tests, whose oracles are derived from the current implementation and so cement existing bugs as expected behaviour: generate the test, mutate the code under it, and feed any surviving mutant back as the missing assertion — the assurance filter that turns a coverage-inflating suite into one that bites.
+### 6. Tests are part of the change, not after it
+A feature PR without tests is incomplete, and we review the test with the same rigour as the code. Tests deferred to a "follow-up PR" compete with the next feature and usually lose, so the work isn't done until the verification ships with it. The exceptions are honest and narrow: a spike or throwaway prototype whose purpose is to be deleted does not need tests — but the moment it becomes the implementation, it does.
+This is a discipline about *what ships together*, not a mandate to write tests first. Test-first (TDD) is a powerful design tool — it forces you to use your own interface before committing to it — but it is a tool, not a law, and the "Is TDD Dead?" exchange between Kent Beck, Martin Fowler, and DHH named the real cost: dogmatic test-first can induce *design damage*, contorting code with needless indirection purely to make it mockable. Hold both signals. If a change resists testing, that usually means the design is wrong — fix the code. But if the *only* way to test it is to shatter a cohesive unit into layers of indirection nothing else needs, the test is making the demand, and the design was right. Write the test with the change; let it pressure the design; don't let it deform the design.
+### 7. Generate the inputs you can't enumerate
+Example-based tests check the cases you thought of; the bugs live in the cases you didn't. Where the input space is large and a property holds across all of it — a round-trip (`decode ∘ encode = id`), a parser that must never panic, a calculation with an algebraic invariant, a state machine whose transitions must preserve a constraint — assert the property and let the framework generate and shrink counterexamples (Hypothesis, fast-check, jqwik, rapid). This is the highest-leverage complement to the dense-logic unit tests of principle 1: one property covers an infinity of examples, and in practice most caught faults surface on a single generated input, so it earns its keep cheaply. The cost is authoring — a meaningful property needs domain insight and a generator — so reach for it where invariants are real, not everywhere.
+The same generator-driven idea spans two more surfaces. At the service boundary, **Schemathesis** derives a semantics-aware fuzzer straight from an OpenAPI/GraphQL spec and is the bridge between contract testing and property-based testing — it finds materially more defects than example-based API tests for the cost of pointing it at the schema. At the byte boundary, coverage-guided **fuzzing** (`go test -fuzz`, cargo-fuzz/libFuzzer) is first-class for parsers and decoders, and a failing input is saved as a permanent regression seed. For stateful or distributed cores where ordering and failure timing are the risk, deterministic simulation testing (Antithesis, FoundationDB/TigerBeetle-style seeded simulators) is the frontier worth knowing — every bug reproduces from `seed + commit` — but its setup cost is real, so treat it as a deliberate investment for the system's hardest core, not a default.
+### 8. Prove the whole product at the front door
+The honeycomb proves the parts. One level sits above it: a proof that drives the **real shipping build** — the packaged, embedded artifact a user actually launches — through its **real front door**, on the **real pipeline**, end to end, the way a user's action travels. A service test that proves an engine behind a harness and a UI test that drives screens against a scripted stand-in can both pass while the assembled product does nothing, because the wiring between them was nobody's test. The front-door proof is the one that fails when the real thing is unwired, and it is what "done" means for a feature a user touches.
+This is where **a fake needs a real test behind it** becomes load-bearing. Every stub, fixture, or seeded file a test leans on is a claim that something real produces that value, and the claim is honest only when another test exercises the real producer. A media library whose tests write fixture thumbnails passes green while the shipping grid renders blank — nothing in the suite ever generated a real thumbnail, so the fixture stood in for a stage that did not exist. Seeded inputs are not the violation: handing the real pipeline a known fixture folder tests the pipeline on controlled data. Replacing the pipeline with a script that emits the expected output is the violation. The line is whether the work in the middle runs for real.
+Non-functional outcomes a user feels — latency, throughput, memory headroom — are proven the same way. A number measured against an early prototype decays the moment the design that produced it changes; it has to be re-proven on the shipping path, not carried forward as a one-time measurement.
+## How we apply this
+- [Observability](../quality/observability.md) — the OTel-first stance that makes traces-as-assertions possible.
+- [Reliability](../quality/reliability.md) — how tests compose with chaos and load experiments.
+- [How We Structure Code](../system-design/code-structure.md) — the structural choice that makes tests cheap to write and fast to run.
+## Anti-patterns we reject
+- **Mocking the database.** A test that mocks the database asserts against your SQL-writing skill, not against database behaviour. Use an ephemeral container.
+- **Retrying flaky tests until green.** A test that passes on the third run is a failing test with a coin flip attached, and rerun-to-green trains the whole team to ignore red. Quarantine the flake out of the gating suite, file it, and fix the root cause — non-determinism, timing, shared state, test order. Quarantine is a triage state with a deadline, not a graveyard.
+- **Snapshot tests as a default.** Snapshots are a brittle, noisy substitute for behavioural assertions, and "update snapshots" becomes a reflex that launders bugs into the baseline. Acceptable only when the artefact is genuinely opaque (a rendered email, a serialised response).
+- **Coverage-gated CI.** "95% line coverage required" is a metric gamed without reducing real risk. Use coverage as a read-out, mutation score as the quality signal, never line coverage as the gate.
+- **Shared staging environments as the integration test.** Staging has no hermetic guarantees, no reproducibility, no determinism. It is a deployment target, not a test bed.
+- **Proving the engine, shipping the product.** A headless proof that the core behaves behind a harness is a slice of confidence, not the product. Until a test drives the assembled, shipping build through the front door on the real pipeline, "it works" is unproven where a user stands.
+- **A fake with no real test behind it.** A fixture or stub that nothing real ever produces is a green light wired to nothing. Every fake is a debt; the real test that exercises the producer is how it gets paid.
+- **"It's hard to test, so we didn't."** That is a signal the code is badly designed. Fix the code.
+## Further reading
+- *Accelerate*, Forsgren, Humble, Kim — the empirical case for continuous delivery and its testing discipline.
+- *Working Effectively with Legacy Code*, Michael Feathers — seams, test doubles, and when each is appropriate.
+- *Growing Object-Oriented Software, Guided by Tests*, Freeman & Pryce — the canonical treatment of outside-in service testing.
+- *xUnit Test Patterns*, Gerard Meszaros — the vocabulary we use for test doubles, fixtures, and strategies.
+- *Is TDD Dead?*, Beck, Fowler & Heinemeier Hansson — the conversation that maps the contested zone between test-first discipline and test-induced design damage.
+- "UnitTest", "TestPyramid", and "On the Diverse and Fantastical Shapes of Testing", Martin Fowler (martinfowler.com) — the sociable-vs-solitary distinction, the shape trade-offs, and Justin Searls's argument that the shape debate is a distraction from test quality.
+- "Testing of Microservices", Spotify Engineering — the honeycomb shape and the integrated-vs-integration-test distinction it rests on.
+- "Practical Mutation Testing at Scale: A View from Google" — the changed-code-only, surfaced-in-review model that makes mutation testing affordable.
+- "A Next Step Beyond Test-Driven Development", Honeycomb.io (Charity Majors) — observability-driven development and testing in production.
+- *Deriving Semantics-Aware Fuzzers from Web API Schemas* (Schemathesis, ICSE 2022) — the empirical case for spec-driven property fuzzing at the service boundary.

package/src/docs/principles/index.md ADDED Viewed

@@ -0,0 +1,24 @@
+---
+title: Engineering Manifesto
+description: The core beliefs that shape how we build software — complexity, contracts, reliability, testing, architecture, documentation, decisions, and AI-native development.
+status: active
+last_reviewed: 2026-06-19
+---
+# Engineering Manifesto
+Software engineering is the discipline of managing complexity and optimising for change. A platform that processes high-volume asynchronous workloads and serves users in real time at scale must lean hard on a solid technical foundation, frictionless developer velocity, and a rigorous engineering culture.
+> [!IMPORTANT]
+> These principles are the shared vocabulary we use to decide what to build, how to build it, and what trade-offs we accept. Every page in this hub stands on its own and does not require context from any other document to be useful.
+## What we believe
+1. **Complexity is the enemy; clarity is the goal.** We choose simple designs, simple tools, and simple processes — and we accept the cost of doing so. Speculative abstraction, premature generalisation, and fear of deletion all compound into the kind of complexity that slows teams down.
+2. **Contracts are the single source of truth.** API specifications, event schemas, and database definitions are authoritative. Clients, tests, documentation, and UIs are derived from them. When a spec is wrong, everything downstream is wrong — and that is the correct failure mode, because one visible error beats silent drift across hand-maintained artefacts.
+3. **Reliability is designed in, not patched in.** We build for failure from the first commit: idempotency at the API boundary, graceful degradation at the edges, backpressure when downstream systems slow, and observability as a design-time concern rather than an afterthought.
+4. **We prove software by using the real thing the way its user does.** A feature is proven when a test drives the shipping build through its real front door, on the real pipeline, the way the user's action actually travels — and the user is whoever observes the outcome, a person at a screen or a caller of an API. Tests that run against real databases, real message brokers, and real HTTP stacks catch the bugs that mocked tests hide, and any fake a test leans on needs a real test behind it. Parts that each pass behind a harness can still assemble into a product that does nothing; the front-door proof is the one that catches it. See [Testing](foundations/testing.md).
+5. **A pure core, swappable edges, and one obvious place for everything.** Every service is a pure decision-making core wrapped in a thin shell that does I/O; concrete dependencies plug in behind abstractions the core owns and stay swappable, with no implementation detail leaking inward. The structure is opinionated, so neither a human reading the code nor an agent writing it ever has to guess where a thing belongs. See [How We Structure Code](system-design/code-structure.md).
+6. **Documentation is a product, not a by-product.** This documentation is versioned, reviewed, and shipped with the same discipline as code. It serves humans and AI agents, and the structures that help one help the other.
+7. **Architectural decisions are recorded and governed.** We capture each significant decision with the context, assumptions, and trade-offs that shaped it, then govern it — an owner, a review trigger, and supersession rather than silent edits when it changes. The record is immutable so the trail of *why* survives; the decision stays open to re-evaluation when its assumptions break. Re-deciding is healthy engineering; re-deciding without recording it is how teams lose their memory. See [Architecture Decisions](system-design/architecture-decisions.md).
+8. **AI agents are first-class engineers.** They read our docs, write our code, review our diffs, and run our tooling. We design our codebase, our conventions, and this documentation so an agent can operate at the same level of quality as a senior engineer.
+9. **Software is made to be used, so it lands fully formed.** A feature is finished when it works, looks right, and is a genuine pleasure to use — reachable, complete, with no dead ends and every state accounted for. Function, form, and experience are one bar, not a core that ships and polish that waits. When code generation is cheap, the considered touch that makes a product feel cared-for is cheap too, so the bar is high. See [Usability and UX](design/usability-and-ux.md).