npm - groundwork-method - Versions diffs - 0.0.1 → 0.10.0 - Mend

groundwork-method 0.0.1 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (629) hide show

package/src/docs/principles/quality/observability.md ADDED Viewed

@@ -0,0 +1,84 @@
+---
+title: Observability
+description: OpenTelemetry-first design, SLOs, error budgets, and trace-driven development.
+status: active
+last_reviewed: 2026-06-19
+---
+# Observability
+## TL;DR
+Observability is a design property, not a monitoring bolt-on. We instrument every service with OpenTelemetry from day one, build dashboards from the instrumentation, and use traces as both a debugging tool and a first-class test assertion. If a system is behaving strangely and we cannot see why in our data, the instrumentation — not the guessing — is what we fix.
+## Why this matters
+The difference between a team that can ship with confidence and one that cannot is, most of the time, a difference in what they can see. Observability gives a team three things: the ability to know whether the system is healthy, the ability to localise a fault when it is not, and the ability to explain what happened after the fact. Without those, every deploy is a gamble and every incident is a fresh investigation. With them, the team moves faster and sleeps better.
+## Our principles
+### 1. OpenTelemetry is the common language
+Every service emits traces, metrics, and logs through OpenTelemetry SDKs to a single collector. Vendor lock-in at the collector boundary, not inside application code. Switching backends is a collector configuration change, not an application rewrite.
+### 2. Traces are the primary signal
+Given a choice between adding a metric and enriching a trace, we enrich the trace: traces preserve causality, metrics aggregate it away, and for a request that crosses half a dozen services causality is the difference between a diagnosable incident and a guessing game.
+But "traces over metrics" is not absolute, and a thoughtful operator will push back. You cannot keep every trace — at scale they are sampled, and a sampled signal is a weak base for an alert that must fire on a single bad request. Metrics are cheap, always-on, and the right substrate for the things you can enumerate in advance. So the decision rule: the health signals you alert on (the RED/USE rates) live as metrics, carrying exemplars back to the traces that produced them; the open-ended question — *why is this slow, and for whom* — lives in traces and wide events. Sample to control cost, but sample at the tail so errors and slow outliers survive the cut (principle 7).
+### 3. The "three pillars" are one pillar
+Logs, metrics, and traces are not independent data — they are different projections of the same events. A log line includes its trace ID; a metric includes the dimensions that let you pivot back to traces; an exemplar on a metric points directly at the trace that produced it. If a team has three disconnected telemetry systems, it has no observability — only three bills and three places to look. The pillars are a storage and query detail; the unit of truth is the event, and every signal must link back to it.
+### 4. Dashboards derive from SLOs
+Every dashboard starts with the user-journey SLO it supports ([Reliability](reliability.md)). Then latency percentiles, error rates, saturation, and traffic — the "RED/USE" layers — filling in detail. Dashboards assembled by adding "interesting-looking" graphs drift into uselessness; dashboards derived from SLOs stay useful.
+### 5. Trace-driven development
+When building a new feature, we sketch the trace it should produce *before* we write the handler. What spans must exist? What attributes must each span carry? What parent-child relationships are required? The instrumentation design shapes the code, not the other way around. This makes it essentially impossible to ship a feature that is unobservable.
+### 6. Assert on telemetry in tests
+System tests assert that traces are unbroken end-to-end — a missing span on a critical path is a test failure ([Testing](../foundations/testing.md)). This makes the instrumentation part of the contract rather than an optional decoration, so it cannot silently rot. The failure mode to avoid is over-asserting: a test that pins the exact span tree and every attribute is coupled to implementation detail and will break on every harmless refactor, training the team to delete the assertion rather than trust it. So assert on what the contract actually promises — the spans that must exist on the user journey, that the trace stays connected across service hops, and the attributes a dashboard or SLO query depends on — and let the rest float.
+### 7. Logs are structured, sampled, and contextual
+Every log line is structured (JSON), carries its trace ID, and is emitted at a severity that the team has actually agreed on. We sample aggressively at debug and info — nobody needs every log line in production — and we never sample errors away. Traces obey the same logic with the timing reversed: prefer tail-based sampling, where the keep-or-drop decision is made after the trace completes, so every error and slow outlier is retained instead of being dropped at random the way head-based sampling does. Tail sampling costs more to operate (every span of a trace must be buffered to one place); where that cost is not justified, fall back to head-based sampling with errors force-kept. Unstructured log lines are not logs; they are a different kind of noise.
+### 8. Cardinality is a design choice
+High-cardinality attributes (per-user, per-tenant, per-session) are valuable for debugging but expensive in storage. We tag deliberately — high cardinality on traces where it is queryable, lower cardinality on metrics where it multiplies by every time window. Runaway cardinality is one of the most expensive mistakes a team can make in observability; it is a design call, not a default.
+### 9. Wide events, and instrument by default
+We lean toward "observability 2.0" — arbitrarily-wide, high-cardinality structured events queried after the fact — over pre-aggregated metrics that fix the question in advance. The honest caveat: a single wide-event store is a north star, not a free lunch. Wide events cost more to ingest and store than the metrics they would replace, and folding every signal into one backend trades correlation power for a sharper lock-in. The escape is a columnar store and an open wire format (OpenTelemetry) so the data stays portable and the bill tracks query value rather than vendor pricing — and pre-aggregated metrics keep their place for the cheap, always-on signals of principle 2.
+We also auto-instrument: kernel-level eBPF (OpenTelemetry OBI) and continuous profiling correlated to trace IDs give broad telemetry with no code change. But eBPF sees the wire, not the intent — it cannot name a business operation, attach a domain attribute, or reliably propagate trace context through compiled or encrypted paths, and OBI today is Linux-only with no logs signal. So the division of labour is fixed: auto-instrumentation for breadth and coverage, hand-instrumentation for the domain spans and attributes only we can name.
+### 10. AI systems are observed through GenAI conventions
+A model in the system is instrumented with the OTel GenAI semantic conventions: token usage (cost and latency track tokens, not requests, and prompt-cache hits are tracked separately), prompt/response capture, agent and MCP tool-call spans, and eval traces — with failed production traces promoted into the eval set so the suite grows from real behaviour. A model call logged as an opaque string is unobservable. Two caveats keep this honest: the GenAI conventions are still experimental as of 2026, so pin the semconv version and use the stability opt-in rather than assuming attribute names are frozen; and full prompt/response capture is a PII and storage liability — capture it deliberately, redact at the edge, and sample the payloads rather than the spans.
+## How we apply this
+- [Reliability](reliability.md) — the SLO layer built on top of this telemetry.
+- [Testing](../foundations/testing.md) — how we assert on traces in system tests.
+- [Performance](performance.md) — the latency work that depends on good tracing.
+## Anti-patterns we reject
+- **Pillar-at-a-time adoption.** "We'll add metrics now, traces later." You will not.
+- **Vendor SDKs in application code.** Application code imports OpenTelemetry; the collector talks to the vendor.
+- **Dashboards without SLOs.** Pretty charts without a question they are answering.
+- **Logs-as-debugger.** Using `printf` style logging to trace a single bug. Write a test; add a span.
+- **Print-statement-style `Debug` in production.** If every deploy adds ten debug logs and the next removes twelve, we are missing structure.
+- **Cardinality explosions.** Putting a UUID in a Prometheus label. The bill and the query planner will both remember.
+## Further reading
+- *Observability Engineering*, Majors, Fong-Jones, Miranda — the canonical text on traces-first observability.
+- *Distributed Systems Observability*, Cindy Sridharan — the short, sharp introduction.
+- Charity Majors, *Observability 1.0 vs 2.0* — the wide-events thesis and the honest argument about its cost, on charity.wtf and the Honeycomb blog.
+- *The OpenTelemetry specification* ([opentelemetry.io/docs/specs](https://opentelemetry.io/docs/specs)) — worth reading the high-level overview at least once; see also the GenAI semantic conventions for LLM instrumentation.
+- *Systems Performance*, Brendan Gregg — the canonical reference for the "USE method" (utilisation, saturation, errors).

package/src/docs/principles/quality/performance.md ADDED Viewed

@@ -0,0 +1,84 @@
+---
+title: Performance
+description: Latency budgets, tail latency, backpressure, and load shedding.
+status: active
+last_reviewed: 2026-06-19
+---
+# Performance
+## TL;DR
+Performance is not "fast enough" — it is a budget, spent deliberately across every hop of a user interaction and enforced in CI. We optimise for tail latency, we design backpressure into real-time flows, and we measure the things users feel, not the things developers find convenient.
+## Why this matters
+Users notice latency before they notice almost anything else. A response that renders in 800ms feels instant; at 3000ms it feels broken. The difference is not a factor of four in effort — it is a difference of whether the team thought about latency as a design constraint or as a post-hoc tuning problem. Performance handled as an afterthought is invariably more expensive than performance designed in from the start.
+## Our principles
+### 1. Latency is a budget, allocated top-down
+Every user-facing operation starts with a latency budget at the edge — say, 500ms — and that budget is allocated to downstream hops. If one fetch has 300ms and another join has 150ms, the handler has 50ms of its own work. When a hop overruns its allocation, somebody else's budget gets squeezed. The budgeting view makes trade-offs explicit. A budget written once and never checked is fiction: reconcile the allocation against measured per-hop latency, and when the numbers don't add up, the budget is wrong or the architecture is — decide which before you ship.
+### 2. Measure tail latency, not average
+p50 tells you about capacity and the typical case; it tells you almost nothing about the experience that drives your reputation. Users remember the slow request, and *which* percentile is the slow request is set by fan-out, not taste. Dean and Barroso's *The Tail at Scale* makes the arithmetic unavoidable: a request that touches 100 backends, each with a 1-in-100 chance of exceeding its p99, will overrun that latency 63% of the time end-to-end (1 − 0.99¹⁰⁰). At fan-out, a leaf service's p99.9 becomes the user's effective median.
+So the budget percentile is a decision, not a default: target p99 for a single-hop interaction, p99.9 or higher for a high-fan-out request. Measure with coordinated omission in mind — naive load-test clients silently drop the slow samples that matter most (Gil Tene). And the tail is attackable directly, not only by tuning: hedged requests — issue a duplicate after the p95 elapses, take the first to return — cut Dean and Barroso's BigTable p99 from 1800ms to 74ms for roughly 2% extra backend work.
+### 3. Pre-compute, cache, and denormalise deliberately
+When a read is hot, we pre-compute. When a computation is stable, we cache. When a join is expensive, we denormalise. Each of these trades complexity for latency; each of them earns its keep with data, not with intuition. Speculative caching is how cache-invalidation bugs become the biggest source of data incidents.
+### 4. Backpressure is designed in, not hoped for
+Every producer has a bounded queue and a defined behaviour when the queue fills: shed, coalesce, block ([Real-Time](../system-design/real-time.md)). "It works fine in load tests" is not a backpressure strategy.
+### 5. Load shedding protects the system from itself
+When the system is saturated, the right behaviour is not to try harder — it is to serve fewer requests well, because trying harder is exactly how an overload turns into a cascading failure (Google SRE). Requests carry a criticality assigned at the edge — Netflix's CRITICAL / DEGRADED / BEST_EFFORT / BULK taxonomy is a sound template — and we shed from the bottom up: prefetch and background work long before user-initiated requests.
+The shed *trigger* is adaptive, not a hand-tuned RPS or CPU threshold that is stale the day load patterns shift. An adaptive concurrency limit that watches the latency gradient finds the saturation point on its own and tracks it as the system changes. And shedding has a softer sibling: graceful degradation reduces the work *per* request — serve cached data, drop personalisation, fall back to a cheaper ranking — before it drops requests entirely. Shedding is a designed degradation mode, not an accident.
+### 6. Hot paths have no allocations to spare
+For the hottest inner loops — real-time processing, per-request ingestion at high throughput — we write allocation-aware code. Every allocation is a GC pause in waiting, and at high rate the pauses become the latency. The discipline is scoped, not universal: it applies to the paths a profiler has shown to be hot, and applying it everywhere is the over-optimisation it warns against. Most code does not need it; the hot paths demand it.
+### 7. Profile before you optimise
+Two truths usually pitched as opposites. Tuning existing code without a profile is waste — the "obvious" bottleneck is almost always wrong, and Knuth's "premature optimization is the root of all evil," read in full, says forget small efficiencies 97% of the time *and do not pass up the critical 3%*. So every non-trivial tuning effort starts with a profile, taken in production-representative conditions; profiles from developer laptops lie.
+But a profiler only ever tells you where the time goes in the design you already have. It will never tell you to pick a better data structure, flatten an allocation-heavy layout, or kill an N+1 access pattern — and those design-time choices dominate the result, are cheap on the first pass, and are expensive to retrofit. "We'll profile it later" is the standard excuse for skipping them. Decision rule: choose data models, access patterns, and algorithmic complexity with performance in mind up front; reach for the profiler to direct local tuning, never to license thoughtless design.
+### 8. Budgets are enforced in CI
+Performance regressions that slip in once slip in a hundred times, and automation is cheaper than vigilance — so budgets live in CI against committed thresholds, and a PR that regresses one needs an explicit, reviewed waiver. But *what* you gate on matters more than *that* you gate. Shared CI runners are noisy, and a wall-clock microbenchmark that cries wolf on every PR trains engineers to ignore it — worse than no gate at all. So gate hard on the metrics that are deterministic regardless of the runner: bundle size, query count per request, allocation counts, Lighthouse scores. Treat wall-clock timings as a tracked trend with relative thresholds and statistical comparison, or run them on dedicated hardware — never as a hard pass/fail on a shared runner.
+### 9. Place compute deliberately, and price the tokens
+*Where* code runs is a design axis, not only *how much*: the edge for latency-sensitive, cacheable, geo-distributed work (proximity flattens the tail); WebAssembly as the edge/FaaS/plugin compute unit; containers for stateful or heavy work — most systems blend all three. Caching is multi-tier (client, CDN/edge, service, store) with an explicit hit-ratio target, and autoscaling is event-driven with real scale-to-zero (KEDA/Karpenter), not CPU-only HPA. For a model-in-the-loop path, latency and cost track **tokens, not requests** — the levers are model routing, semantic caching at the gateway, prompt/KV caching to cut time-to-first-token, and streaming so the user sees output before generation completes.
+## How we apply this
+- [Observability](observability.md) — the measurement surface for latency work.
+- [Reliability](reliability.md) — the SLO discipline that makes performance budgets enforceable.
+- [Real-Time](../system-design/real-time.md) — the streaming-specific patterns we apply.
+## Anti-patterns we reject
+- **Optimising on hunch.** No profile, no tuning — and no "we'll profile it later" as cover for an unconsidered data model.
+- **"It is fast on my laptop."** Dev latency is not production latency. Measure in the environment that matters.
+- **Average-as-metric.** Reporting only the mean or p50 hides the tail that defines your reputation. Pick the percentile your fan-out demands.
+- **Unbounded queues.** A queue without a max is a latency bomb.
+- **Cache invalidation left to the reader.** If the cache can serve stale data under a defined circumstance, that circumstance is documented. Otherwise it is a bug.
+- **Flaky perf gates.** A wall-clock benchmark gated on a noisy shared runner teaches the team to rubber-stamp red. Gate on deterministic metrics; track the noisy ones.
+- **"We will fix performance later."** If you ship slow, users will remember slow.
+## Further reading
+- *Systems Performance*, Brendan Gregg — the canonical reference; read the USE and RED chapters first.
+- *High Performance Browser Networking*, Ilya Grigorik — the frontend-and-network half of the story.
+- *Latency Numbers Every Programmer Should Know* (Jeff Dean) — calibrate your intuition.
+- Gil Tene, "How NOT to Measure Latency" — the talk on coordinated omission and why naive latency measurements lie.
+- Jeff Dean & Luiz Barroso, "The Tail at Scale" (CACM, 2013) — the fan-out arithmetic and the hedged-request pattern.
+- *Google SRE Book*, "Handling Overload" and "Addressing Cascading Failures" — criticality, client-side throttling, and load shedding done right.

package/src/docs/principles/quality/privacy.md ADDED Viewed

@@ -0,0 +1,92 @@
+---
+title: Privacy
+description: Data minimisation, GDPR, PII handling, deletion, model training, and data residency for platforms that handle sensitive user data.
+status: active
+last_reviewed: 2026-06-19
+---
+# Privacy
+## TL;DR
+We only collect what we need, keep it only as long as we need it, expose it only where it is needed, and let users see, correct, and remove their own data on demand. Privacy is a design input, not a compliance appendage.
+## Why this matters
+A privacy failure is not a regulatory inconvenience — it is a direct breach of user trust. When a platform handles sensitive user data, remediation is punishingly expensive. Privacy has to be thought about at design time, because once the data exists in the wrong shape or the wrong place, it cannot easily be undone. Some of it — anything that has been backed up, copied to a warehouse, or trained into a model — cannot be undone at all without a deliberate plan made before collection.
+## Our principles
+### 1. Collect the minimum
+For every field we capture, we ask: do we actually need this to deliver the user's outcome? Data minimisation reduces both privacy risk and operational complexity. "We might find it useful later" is not a sufficient reason to collect a field.
+The honest tension is with measurement and ML, where more raw, per-user, fine-grained data always *looks* more useful. The decision rule: collect at the grain the outcome requires, and no finer. For product analytics, prefer aggregates, derived metrics, and event counts over raw identifiable records; where the analysis genuinely needs distributions, coarsen or add differential-privacy noise rather than retaining the raw PII. "It is for analytics" lowers the bar for *nobody* — the same necessity test applies.
+### 2. Retain for a bounded time
+Every category of data has an explicit retention policy set at collection time, justified by a purpose and a lawful basis. Expired data is deleted by automation, not by a Tuesday-afternoon cron. "We keep it forever" is never a category.
+Some data legitimately needs a longer clock — security and audit logs, fraud signals, financial records with statutory retention, and records under legal hold. That is not a licence to keep everything: a legal hold suspends deletion for *named* records tied to a specific matter; it does not turn the whole database immutable. The decision rule: each category gets its own period and its own justification, and the longest clock applies only to the records that actually earn it.
+### 3. Access is scoped and audited
+Every internal access to user data is authenticated, authorised, and logged. Engineers cannot browse production data casually; support staff cannot read sensitive records without a clear business reason and an auditable access record. Unsupervised access is a policy failure waiting to be discovered.
+### 4. Users see, control, and remove their data — including the copies
+Data subject rights — access, rectification, portability, deletion — are first-class features, not regulatory bolt-ons. A deletion request flows through the same plumbing as retention expiry: structured, automated, and verifiable.
+"Delete everywhere" is harder than it sounds, and this is where most real systems fail. User data is spread across the primary store, read replicas, caches, search indices, analytics warehouses, and backup snapshots, and an immutable or append-only backup tier cannot be surgically edited row by row. The EDPB's 2025 coordinated enforcement action on the right to erasure found exactly this: controllers that never propagate deletion into backups, and controllers that let a restore silently resurrect deleted data. The decision rule:
+- **Live and queryable tiers** (primary, replicas, caches, indices, warehouse) erase synchronously and verifiably on request.
+- **Immutable backup tiers** either get **crypto-shredding** — encrypt each subject's data under a per-subject key and destroy the key, rendering the ciphertext unrecoverable without touching the snapshot — or rely on a **documented, bounded rotation window** during which any restore is guaranteed to re-apply pending deletions before the data becomes reachable again.
+A deletion that quietly leaves "just this one copy" — most often in a backup or a warehouse export — is a promise broken. So is calling something deleted when it has merely been weakly de-identified.
+### 5. Design for data residency
+Where data lives matters, both for regulation and for user expectation, and it is a design input to storage and pipeline choices — not an afterthought discovered during procurement.
+It is also widely misunderstood. The GDPR is **not** a data-localisation law: it does not require EU personal data to physically stay on EU soil. It bars *transfers* to a third country unless a lawful mechanism backs them — an adequacy decision (the EU–US Data Privacy Framework is one, declared adequate in 2023 and extended to the EEA), Standard Contractual Clauses, or Binding Corporate Rules. The decision rule: separate a genuine **localisation mandate** (a sectoral or sovereignty law that says the bytes must remain in-country — these are real but specific) from the far more common **transfer-safeguard requirement** (data may leave if a mechanism plus access controls are in place). Pick the storage region from the strictest obligation that actually applies to the data, not from folklore about "EU data can never leave the EU."
+### 6. PII is handled distinctly from content
+Email addresses, names, IPs — PII has a shorter retention, tighter access controls, and is explicitly not co-located with content where we can help it. Treating all data the same makes the problems of the most sensitive fields become the problems of every field.
+The mechanism is separation: hold identifiers in a dedicated vault and reference them elsewhere by an opaque token (pseudonymisation). This shrinks the blast radius of any single store and makes crypto-shredding tractable — destroy the vault entry and the tokens dangle. But pseudonymisation is a risk-reduction tool, not an exit: under the GDPR, pseudonymised data is *still personal data* and stays fully in scope. Only true anonymisation leaves scope, and anonymisation is harder than it looks — coarse de-identification often remains re-identifiable by linkage. Do not let a tokenisation layer convince anyone the obligations have disappeared.
+### 7. Model training is a lawful, transparent, and near-irreversible decision
+User data is used to train or evaluate models only on a lawful, recorded, and defensible basis. Consent is the cleanest basis, but it is often infeasible to obtain at the scale and retroactivity model training demands — a point the EDPB's Opinion 28/2024 on AI models makes directly. Where consent is not workable, **legitimate interest** can be a valid basis *if* it survives the three-step necessity-and-balancing test, the use is disclosed plainly, and users have a real, honoured opt-out. What is never defensible is silent training, or assuming consent because "everyone does."
+Treat the choice to train as **near-permanent**. A model can memorise and regurgitate its training data, and the EDPB has confirmed a trained model is not automatically anonymous. Machine unlearning does not reliably take it back: exact unlearning means retraining from scratch, and approximate methods are unproven and can degrade the model. The decision rule: pick and record the lawful basis *before* training, and never feed a model anything you could not defend keeping forever — because, in practice, training is keeping it forever.
+### 8. Privacy reviews happen before launch
+Every feature that touches user data has a privacy review before it ships — the same rhythm as a security review, often in the same meeting. The reviewer asks the specific questions a regulator or an investigative journalist would, and the answers go on the record. Where the processing is high-risk — large-scale sensitive data, profiling, or novel use of personal data — that review *is* a Data Protection Impact Assessment, which the GDPR makes mandatory rather than optional. "We will do the privacy review after launch" is a commitment that never gets honoured.
+## How we apply this
+- [Data Engineering](../system-design/data-engineering.md) — retention and contract discipline.
+- [Security](security.md) — the perimeter that privacy relies on.
+- [Postgres](../stack/postgres.md) — retention enforced at the storage layer.
+## Anti-patterns we reject
+- **"Privacy is the lawyers' job."** By the time the lawyers are involved, the damage is done. Privacy is an engineering discipline.
+- **Retention by default to forever.** Growing tables nobody cleans are ticking privacy incidents.
+- **Deletion that stops at the live database.** If the backup or the warehouse still has the row, the deletion did not happen. Plan the backup story before you promise erasure.
+- **Anonymisation theatre.** Calling weakly de-identified data "anonymous" or "deleted" when relinking is feasible — flagged repeatedly in EDPB enforcement — is a breach dressed as compliance.
+- **Development data scraped from production.** A dev environment with a sample of real user data is a breach waiting to be noticed.
+- **Analytics as a free pass.** "It is for analytics" is not a sufficient justification for collecting a piece of PII. The same bar applies.
+- **PII in logs.** Trace and log data routinely outlives the systems that produced it. PII does not belong there.
+- **Silent model training.** Training on user data without a recorded lawful basis, plain disclosure, and a real opt-out is not made acceptable by a sentence buried in a ToS.
+## Further reading
+- *GDPR* text and ICO guidance — the canonical European framework.
+- *CCPA/CPRA* — the Californian counterpart.
+- *Privacy by Design*, Ann Cavoukian — the foundational essay on baking privacy into architecture.
+- *Data Protection Impact Assessments* (ICO) — the practical model we use for privacy reviews.
+- EDPB *Opinion 28/2024* on data protection in AI models — lawful basis, legitimate interest, and model anonymity.
+- EDPB *2025 Coordinated Enforcement Framework report on the right to erasure* — backups, restores, and anonymisation-as-deletion.

package/src/docs/principles/quality/reliability.md ADDED Viewed

@@ -0,0 +1,89 @@
+---
+title: Reliability
+description: SRE fundamentals, graceful degradation, circuit breakers, and the design patterns that keep systems up under load and failure.
+status: active
+last_reviewed: 2026-06-19
+---
+# Reliability
+## TL;DR
+Reliability is not a feature we add after the system is built. It is a design property we pay for up front, measured in error budgets, defended by graceful-degradation patterns, and rehearsed through deliberate failure injection. Every significant service owns an SLO and lives inside the error budget it implies.
+## Why this matters
+Users do not experience "uptime percentages" — they experience "the thing I needed did not work just now." Reliability is the discipline of holding the second experience rare enough that users learn to trust the platform. In a real-time product, unreliability compounds: a dropped request becomes a failed operation, a failed operation becomes a broken user journey. The cost of a small reliability failure is rarely proportional to its scope.
+## Our principles
+### 1. SLOs, not uptime percentages
+Every significant service defines a Service Level Objective — a per-endpoint or per-user-journey target with a latency and a success-rate component, measured over a rolling window. "99.9% uptime" is not an SLO; "p95 `POST /resource` < 300ms over 30 days, 99.5% success" is. SLOs are the measurement surface for everything else on this page.
+### 2. Error budgets govern velocity
+The budget implied by the SLO — the allowed volume of "bad" events — is a spendable resource. Teams spending below budget ship riskier changes and run experiments; teams that exhaust it pause feature work and pay down reliability debt. This inversion — reliability as a gate on velocity rather than a tax on top of it — is what makes SLOs operationally real.
+The honest failure mode is enforcement. A budget that any team can override under deadline pressure is a dashboard, not a policy. The gate only bites if it is pre-negotiated in writing before the budget is spent: product, engineering, and on-call agree the consequence of exhaustion, name the person empowered to declare and lift a freeze, and define a small, explicit set of "silver bullet" exceptions for genuinely business-critical launches. Decision rule: if you cannot name who enforces the freeze and what the exceptions are, you do not have an error-budget policy — you have an SLO with a graph next to it.
+### 3. Graceful degradation is a design, not a hope
+Every user-facing feature has a defined behaviour when its downstream fails. A view without synthesis data still renders — the panel shows a "not yet ready" state. A pipeline without a model client enqueues and returns when it can. Degradation is decided at design time and implemented alongside the happy path, never "we will figure out what to show later."
+### 4. Timeouts, retries, and load shedding are the defaults; circuit breakers are not automatic
+Every outbound call has a timeout. Every retry has a bounded policy with full jitter. These are non-negotiable and set in a shared library so a new service inherits them ([Integration Patterns](../system-design/integration-patterns.md)); opting out requires a written reason.
+The contested part is what sits on top. Retries amplify: a request retried at every layer of a call chain produces attempts equal to the *product* of the per-layer counts, so a small downstream blip becomes a self-inflicted DDoS. Two rules contain this. Retry at exactly one layer of the stack, not at every hop. And cap retries with a shared retry budget — a token bucket where successes refill and retries spend, so retries stop automatically once a downstream is failing (this is how gRPC's retry throttling and the AWS SDK adaptive-retry mode work). Per-call backoff alone does not bound aggregate load; the budget does.
+Circuit breakers are widely prescribed as the default backstop. They are not automatic here. A binary client-side breaker, estimated locally by each of many small or short-lived clients, trips on noisy local samples and can make a partial outage worse by cutting off capacity that was still serving — Marc Brooker's simulations show distributed breakers tripping far too early. Prefer the token-bucket / adaptive throttle, which degrades smoothly instead of snapping fully open. Reach for a real circuit breaker when a downstream fails *slowly* (the failure mode is timeout exhaustion, not fast error responses) or when a cheap local fallback exists — and tune its thresholds against measured traffic, never the library defaults.
+Decision rule: timeout always; retry at one layer with jitter and a shared budget; reach for a circuit breaker only against slow/hanging dependencies or where a cheap fallback exists; and treat server-side load shedding as the backstop you actually trust, because it protects the server regardless of whether every client is well-behaved.
+### 5. Isolate blast radius
+A single tenant, a single user, or a single noisy consumer must not be able to degrade the experience for everyone else. We isolate by quota (per-tenant rate limits), by resource (dedicated queues for hot workloads), and by bulkhead (separate worker pools for separate work types). The design question is always: "if this goes bad, who else is affected?" — and the answer we aim for is "only the thing that went bad."
+### 6. Rehearse failure
+Chaos engineering is a practice, not an event. We inject failures — killed pods, degraded networks, slow databases — to surface the reliability assumptions we are making without knowing it. The point is not to "test if chaos works"; it is to find the dependency we forgot was load-bearing before an incident finds it for us.
+Where you inject is a real trade-off, not a slogan. Production is where the signal lives — staging differs in traffic shape, data volume, and dependency topology, so a system that passes every staging experiment can still fall over in production, and a clean staging run buys false confidence. But you do not earn production chaos for free: the precondition is observability good enough to see the blast as it lands and an automated stop that aborts the experiment the moment a real SLO starts to burn. Decision rule: start in staging to shake out the obvious, but treat the experiment as incomplete until it has run in production behind a bounded blast radius and an automatic abort. If you cannot safely abort, you are not ready to inject.
+### 7. Alerts fire on user impact, not on mechanism
+We alert when users are affected — SLO burn rate, error-rate spikes on user journeys — not when a server has 80% CPU. Pages that fire on mechanism without user impact teach on-call to ignore pages, which is how a real incident gets missed.
+### 8. Every incident teaches a specific lesson
+Post-incident, we write a blameless postmortem that names the specific reliability assumption the incident invalidated and proposes the specific change that would have caught it. We do not write "be more careful" as an action item. We do not write "add more monitoring" without specifying the signal. The goal is one concrete, closable ticket per incident, enforceable and measurable.
+### 9. Cells, living SLOs, and semantic failure
+Blast-radius isolation generalises at scale to **cell-based architecture** — independent cells, each serving a slice of users, so a failure is contained to one cell rather than the fleet. SLOs are hypotheses reviewed against burn (multi-window, multi-burn-rate alerting), not contracts carved once and forgotten. And a model in the loop fails differently: a wrong answer returns 200 OK — valid, on time, and confidently incorrect — so latency and error-rate SLIs miss it entirely. AI features therefore carry a **per-SLI accuracy/consistency budget** distinct from latency, and the model provider is treated as the least-reliable dependency in the chain, with a defined degraded behaviour for when it is slow, wrong, or down.
+## How we apply this
+- [Observability](observability.md) — the measurement layer that makes SLOs possible.
+- [Performance](performance.md) — the tail-latency discipline that sits inside reliability.
+- [Integration Patterns](../system-design/integration-patterns.md) — the concrete patterns (timeouts, circuit breakers) we apply.
+## Anti-patterns we reject
+- **"99.999% uptime" as a target.** Five-nines for a non-core service is a reckless budget. Set an SLO the team can defend.
+- **Retries without policies.** Retry-forever is a self-inflicted DDoS.
+- **Retries at every layer.** Retrying at each hop multiplies one user request into a retry storm. Retry at one layer, and budget it.
+- **Circuit breakers as a reflex.** A binary breaker on library defaults, copied into every client, trips on noise and can deepen a partial outage. Earn it, tune it against real traffic, and prefer adaptive throttling.
+- **Mechanism alerts.** Paging on CPU, memory, or disk without tying it to a user-impact signal. Noise.
+- **"It has not failed yet."** The absence of a known failure mode is not evidence of its absence. Rehearse.
+- **Postmortems that blame humans.** A system that depends on everyone being perfect will fail. The action item is the system fix, not the person lecture.
+- **SLOs nobody tracks.** An SLO without a dashboard and a burn-rate alert is theatre.
+## Further reading
+- *Site Reliability Engineering*, Beyer et al. (the Google SRE book) — the canonical text for SLOs, error budgets, and the operational stance; the "Addressing Cascading Failures" and "Handling Overload" chapters cover retry budgets and client-side throttling.
+- *The Site Reliability Workbook* — the practical companion to the SRE book; more actionable, including the error-budget policy template.
+- *Release It!*, Michael Nygard — the stability-patterns bible (timeouts, bulkheads, the original circuit breaker).
+- *Chaos Engineering*, Rosenthal & Jones — the current state of rehearsed-failure practice.
+- *Amazon Builders' Library* — "Timeouts, retries, and backoff with jitter" (Marc Brooker) and "Using load shedding to avoid overload" (David Yanacek): the load-and-overload patterns this page leans on.
+- Marc Brooker, ["Fixing retries with token buckets and circuit breakers"](https://brooker.co.za/blog/2022/02/28/retries.html) — why distributed circuit breakers misfire and what to reach for instead.

package/src/docs/principles/quality/security.md ADDED Viewed

@@ -0,0 +1,78 @@
+---
+title: Security
+description: Zero-trust, threat modeling, SLSA supply-chain integrity, and the secure SDLC.
+status: active
+last_reviewed: 2026-06-19
+---
+# Security
+## TL;DR
+Security is every engineer's job, every day. We treat every service as untrusted, every dependency as a supply-chain risk, every input as hostile, and every secret as already-compromised unless we can prove otherwise. The goal is not zero risk — it is a system that stays standing when any single control fails.
+## Why this matters
+When a platform handles sensitive user data, a security incident is not an inconvenience — it is a breach of the trust users place in the system. Security is the baseline that every other quality concern rests on. A system that is reliable but exploitable is not reliable.
+## Our principles
+### 1. Zero trust between services
+Services authenticate each other on every request. No "internal" network is trusted implicitly; every call carries an identity, every identity is authorised per operation. The concrete mechanism is **workload identity** — short-lived, auto-rotated credentials and mTLS established at the platform layer with no secret in application code; machine identity is the new perimeter. The breach-resistance argument is simple — if an attacker pivots into one service, they do not inherit the blast radius of the entire system. The mechanism scales to the system: SPIFFE/SPIRE issuing auto-rotated SVIDs is the full-control answer, but a managed mesh or signed service tokens from a standard IdP buy most of the breach-resistance for a fraction of the operating cost. The non-negotiable is that identity travels with every call and is verified there — not which issuer mints it. Choose by blast radius, not by fashion.
+### 2. Threat model the change, not just the product
+Every significant change asks the security question before the design is signed off: who could misuse this, and how? A new endpoint, a new data field, a new integration — each gets a five-minute threat conversation. This is cheap upfront and catches most of the issues that would otherwise be found in a pen test or, worse, in production.
+### 3. Secrets are managed, rotated, and audited
+No secret lives in source. The hierarchy is eliminate, then shorten, then rotate. The best secret is no secret: wherever principle 1's workload identity or OIDC federation reaches, there is no static credential to leak. Where a credential is unavoidable, prefer **dynamic, short-lived** secrets — minted per session with a TTL in minutes — over a long-lived value on a rotation calendar. Scheduled rotation of a static secret is closer to theatre than control: an attacker abuses a leaked credential in minutes, not at the next quarterly cycle, so a 90-day rotation bounds nothing that matters. Reserve scheduled rotation for the static credentials that genuinely cannot be made ephemeral. Whatever survives lives in a secret manager, is fetched at runtime, and has every access audited — so the damage window is bounded by the TTL, not by a calendar.
+### 4. Input is hostile; validate at the boundary
+Every piece of input at a trust boundary is validated: request bodies, webhook payloads, message queue events, model outputs. Inside the trust boundary we trust our own types and do not repeat the checks ([Code Craft](../foundations/code-craft.md)). The discipline is that the boundary is explicit and every crossing is scrutinised.
+### 5. Supply chain is part of our attack surface
+Every third-party dependency is a potential exploit vector. We pin versions, review new dependencies before adoption, and scan on every build. Beyond the SBOM (what is inside) we emit **provenance** (where it came from): artifacts are signed with Sigstore/cosign and ship signed build attestations expressed as SLSA build levels. The target is SLSA Build L3 (a hardened, isolated build platform that signs its own provenance) for anything we publish, and at least L1 provenance on everything built internally — L3 is what makes provenance non-forgeable, so it is the level worth paying for. A dependency added without review is a back door added without review.
+### 6. Least privilege by default
+Every service, every database role, every cloud identity starts with the minimum permissions it needs and is extended only on evidence. "Give it admin and fix it later" is a decision with a lifetime of never. IAM policies, database roles, and credential scopes are reviewed in the same way code is reviewed.
+### 7. Auth is boring technology
+We do not invent auth. Proven auth providers handle user authentication — OIDC for federation, passkeys/WebAuthn as the phishing-resistant default rather than passwords plus OTP; service-to-service auth uses short-lived tokens from a standard identity provider; session storage follows the OWASP guidance for the context. Exotic auth is how a team learns about auth vulnerabilities the hard way.
+### 8. Detect and respond, not just prevent
+Assume prevention will sometimes fail. We log security-relevant events, alert on suspicious patterns, and run incident-response tabletops so the team knows what to do when something happens. Detection that arrives after the incident is cleaned up is not detection.
+### 9. The model is an attack surface
+A model in the system widens the threat model in ways classic AppSec misses. **Prompt injection** has led the OWASP LLM risks since the list began and is structural, not a bug awaiting a patch: the model mixes instructions and data in one channel, and the injection arrives indirectly through retrieved content, tool outputs, and other agents (it propagates across co-running agents). Treat it as unsolved — there is no method that blocks it 100%, and a guardrail advertising 95% is handing the other 5% to a motivated attacker. So we contain rather than cure, and the containment is architectural. The design-time decision rule is the **lethal trifecta** (Willison) / **Agents Rule of Two** (Meta): an agent acting autonomously may hold at most two of {processes untrusted input, accesses private data or sensitive systems, can change state or communicate externally}. An agent that needs all three does not run unsupervised — it gets a human in the loop, or a fresh and reliably-validated context, before it acts. Underneath that rule: give non-human actors their own identity and per-action tool authorization, treat a tool/MCP catalogue as an execution surface to threat-model rather than an API, and remember that output validation alone is not a defence — excessive agency is the architectural control.
+## How we apply this
+- [Privacy](privacy.md) — the handling of regulated data sits inside the security perimeter.
+- [Reliability](reliability.md) — stability and security share a lot of failure-mode vocabulary.
+- [API Design](../system-design/api-design.md) — signed webhooks, idempotency keys, and structured errors that do not leak internals.
+## Anti-patterns we reject
+- **Internal network = trusted.** This is the assumption every modern breach exploits.
+- **Secrets in environment variables checked into Git.** Use the secret manager. Always.
+- **"It is an internal tool, we can skip auth."** Internal tools are an attacker's favourite foothold.
+- **Dependencies pulled in on intuition.** A package with 12 stars, no maintainer, and a vague promise is a supply-chain risk.
+- **Exotic auth.** Custom JWT handling, custom session cookies, custom MFA flows. Use the standard, battle-tested thing.
+- **"The WAF will catch it."** A web application firewall is a last layer. Primary defence is correct code.
+## Further reading
+- *The Tangled Web*, Michal Zalewski — the canonical tour of web-security oddness.
+- *The Web Application Hacker's Handbook*, Stuttard & Pinto — read once to know what you are defending against.
+- *OWASP Top 10* — the catalogue of vulnerabilities every web engineer must know.
+- *SLSA Framework* ([slsa.dev](https://slsa.dev)) — the supply-chain integrity ladder.
+- *Zero Trust Architecture*, NIST SP 800-207 — the canonical definition.
+- *OWASP Top 10 for LLM Applications (2025)* — prompt injection and excessive agency lead the list.
+- *The lethal trifecta* (Simon Willison) and *Agents Rule of Two* (Meta) — the design rules that bound agent authority.

package/src/docs/principles/stack/postgres.md ADDED Viewed

@@ -0,0 +1,100 @@
+---
+title: Postgres
+description: Schema design, primary keys, JSONB, expand-contract migrations, indexing, connection pooling, queues, and pgvector as a production vector store.
+status: active
+last_reviewed: 2026-06-19
+---
+# Postgres
+## TL;DR
+Postgres is the canonical data store for every service that needs persistence. We design schemas explicitly, choose primary keys deliberately, migrate with the expand-contract pattern, index from evidence, pool connections, and use `pgvector` as our vector store. When the question is "which database?", the answer is Postgres unless we have a specific, written reason it cannot be.
+## Why this matters
+Every additional datastore in a system is a multiplier on operational complexity: another backup story, another failure mode, another skill profile to hire for, another surface to monitor. Postgres is a remarkable outlier — it does relational, JSONB document storage, full-text search, queueing, and vector similarity well enough that most workloads never need another engine. Committing to it as a default keeps the operational surface small and the engineers productive.
+## Our principles
+### 1. Schema design is a design document
+Every new table begins with a schema design: what does it represent, what identifies it, what are the invariants, what queries does it need to support, what retention does it live under. This is not a formality — schema shape is the contract that outlives any service that reads or writes the table ([Data Engineering](../system-design/data-engineering.md)). Push invariants into the schema, not just the application: `NOT NULL`, `CHECK`, `FOREIGN KEY`, and `UNIQUE` constraints are enforced by the one component every writer shares. An invariant that lives only in application code is an invariant that some other writer will violate.
+### 2. Prefer columns to JSONB for stable shape
+JSONB is powerful but it is not a replacement for column design. When a field is present on every row, queried often, or stable in meaning, it belongs in a column — columns get typed constraints, foreign keys, cheap statistics, and B-tree indexes the planner reasons about well. JSONB is the right call when the shape genuinely varies per row, is rarely filtered on, or is a bag of external metadata you store but do not own. When you do query inside JSONB, index it with GIN (or an expression index on the specific path you filter), and remember that you have traded away the constraint enforcement a column would have given you. The default is columns.
+### 3. Schema changes follow expand-contract; recovery is roll-forward
+Backwards-incompatible change is the source of migration outages, so we never do it in one step. Every change uses expand-contract (parallel change): **expand** the schema with the new, compatible shape; deploy code that writes both old and new; **backfill** existing rows in batched background jobs, never in the migration transaction; cut reads over to the new shape; then **contract** by dropping the old shape once nothing references it. "Migrations are additive" is the easy half of this — the discipline is sequencing the destructive contract step so it lands after every reader and writer has moved.
+Two rules make this safe in production, and both are non-obvious:
+- **Set `lock_timeout` (and a `statement_timeout`) on the migration connection.** A bare `ALTER TABLE` queues behind any in-flight query holding a conflicting lock, and every request arriving after it then queues behind the `ALTER` — one slow query becomes a full-table stall. A short `lock_timeout` (a few seconds) makes the migration fail fast and retry instead of cascading into an outage.
+- **Build indexes and validate constraints `CONCURRENTLY` / `NOT VALID` then `VALIDATE`.** These avoid the long-held `ACCESS EXCLUSIVE` lock that the naive form takes.
+The contested zone is rollback. "Every migration has a pre-written down migration" sounds rigorous but is mostly theater: in production, a down migration that reverses a data-bearing change either cannot run without losing data or has never been exercised under load. Our decision rule: **recovery in production is roll-forward** — you ship a new migration that corrects the problem, because expand-contract has kept the previous shape live and compatible the whole time. Keep a tested down path for the local and CI loop, and only for changes that are provably reversible without data loss. For tables too large for an in-place `ALTER`, reach for a tool built for the job (`pgroll`, `pg_osc`) rather than hand-rolling shadow tables.
+### 4. Indexes are evidence-based
+Most indexes are justified by a query pattern backed by real production data — `pg_stat_user_indexes` and `pg_stat_statements` tell us which queries are hot and which indexes are paying their cost. Unused indexes cost write throughput and disk; we remove them. Speculative indexes "in case we need them later" are the opposite of the principle.
+The honest exception: some indexes are required at table creation, before any production traffic exists. Unique constraints are indexes you cannot defer. Foreign keys are not auto-indexed by Postgres, and an unindexed FK turns every parent delete or update into a full scan of the child — index the referencing side up front. Beyond that, reach for the specific index the query needs, not the generic one: partial indexes for queries that always carry the same filter, covering indexes (`INCLUDE`) to serve index-only scans, expression indexes for computed predicates. Always build them `CONCURRENTLY` on a live table.
+### 5. `pgvector` is our vector store — to a threshold we name
+Semantic search, embedding similarity, RAG retrieval — all of this runs on `pgvector` in the same Postgres cluster as relational data. The payoff is real and specific: vectors live in the same transaction as the rows they describe, so you filter, join, and keep them consistent without a second system to sync, back up, and reconcile.
+This is a default, not a law, and the dishonest version of it ignores scale. Vanilla `pgvector` with an HNSW index serves low-latency, high-recall queries comfortably into the low millions of vectors, while the index fits in RAM; performance degrades as the dataset outgrows memory. The decision rule by scale:
+- **Up to a few million vectors:** `pgvector` + HNSW. No argument.
+- **Tens of millions:** stay in Postgres but switch to `pgvectorscale` (StreamingDiskANN), which keeps the index on disk and holds high QPS at high recall well past where in-memory HNSW falls over.
+- **Hundreds of millions and beyond, or hard requirements `pgvector` does not serve** (extreme-scale sharding, specialized hybrid-filtering engines): that is the written reason to run a dedicated vector store. The data and the requirement will make the case; until they do, the second system is unbought complexity.
+### 6. Connection management is explicit, and the pooler is the real answer
+A Postgres backend is a full OS process with a meaningful memory footprint, so the server tops out at low thousands of connections regardless of how big the box is. The standard architecture is a transaction-mode pooler (PgBouncer or Supavisor) in front of the database: hundreds or thousands of client connections multiplexed onto a small set of server connections, each held only for the duration of a transaction. Size the server-side pool to the database's capacity (a small multiple of CPU cores), not to the number of application instances.
+Transaction mode is the right default but it forbids session-scoped state — session-level `SET`, advisory locks, and `LISTEN`/`NOTIFY` break across pooled transactions; isolate those on a session-mode connection. Every service still sets explicit per-connection limits, idle timeouts, and a `statement_timeout`. "Just use the defaults" is how Postgres gets hammered into `too many connections` under load. Postgres is a shared resource; treat it like one.
+### 7. Query patterns are reviewed
+Every new query is reviewed for plan shape, not just correctness. `EXPLAIN (ANALYZE, BUFFERS)` on representative data is part of the PR for any non-trivial query. N+1 queries, full-table scans, and unbounded `IN` lists are caught in review, not in production.
+### 8. Backups, retention, and disaster recovery are not afterthoughts
+Automated backups run with RPO and RTO targets that the business has signed off on. We test restores — a backup we have never restored is not a backup. Retention policies are set per table at creation time and aligned with the privacy policy ([Privacy](../quality/privacy.md)).
+### 9. Primary keys are a deliberate choice, never UUIDv4
+The default is a `bigint GENERATED ALWAYS AS IDENTITY` key: compact, sequential, cache-friendly, and ideal for internal tables that never leave the cluster. Choose a UUID instead when the key is generated by the client or across distributed nodes, exposed in URLs or public APIs (where a guessable sequential id leaks volume and ordering), or merged across systems that must not collide.
+When you do reach for a UUID, use **UUIDv7**, never UUIDv4. A v4 key is random, so inserts scatter across the B-tree, fragmenting the index and inflating write amplification and WAL. UUIDv7 is time-ordered: it keeps the global-uniqueness and distributed-generation benefits while restoring the sequential insert locality that makes `bigint` fast — close enough to identity-key performance that the gap stops being a reason to avoid it. Postgres 18+ ships a native `uuidv7()`; on earlier versions generate it in the application or via an extension. The remaining honest cost is size — 16 bytes versus 8 — which multiplies across every secondary index that carries the key, so do not pay it without one of the reasons above.
+## How we apply this
+- [Data Engineering](../system-design/data-engineering.md) — the broader treatment of data contracts.
+- [Privacy](../quality/privacy.md) — the rules that shape retention and residency.
+## Anti-patterns we reject
+- **JSONB-everything.** Not a schema; a confession of avoided design.
+- **UUIDv4 primary keys.** Random keys fragment the index and tax every write. Use `bigint` by default, UUIDv7 when you need a UUID.
+- **Indexes "just in case."** Every index is a write tax; justify it from a query or remove it — with the narrow exception of unique constraints and foreign-key indexes, which are required up front.
+- **Migrations that lock a hot table.** `ALTER TABLE ... ADD COLUMN ... NOT NULL DEFAULT` on a 10M-row table with no `lock_timeout`. Add the column nullable, backfill in batches, then tighten — and fail fast on a lock you cannot get.
+- **Blind down migrations as a production safety net.** They are rarely exercised and often lossy. Expand-contract plus roll-forward is the real recovery story.
+- **Raw string interpolation into queries.** Parameterised queries, always. This is a security rule ([Security](../quality/security.md)) and a clarity rule.
+- **A second database "just because."** Adding Redis, DynamoDB, or a dedicated vector store without a specific, documented need Postgres cannot meet. Most of the time, Postgres can.
+### On using Postgres as a queue
+The reflexive "never use the database as a queue" is dated. `SELECT ... FOR UPDATE SKIP LOCKED` gives Postgres a correct, contention-free work queue, and for low-to-moderate throughput (roughly to the low thousands of jobs per second) a Postgres queue — raw `SKIP LOCKED`, or a mature layer like `pgmq` or Oban — is often the *right* call precisely because it honours the "no second datastore" principle: jobs are enqueued in the same transaction that creates the work, so you get exactly-once-with-the-write semantics for free, with one backup and one failure mode instead of two.
+The decision rule: **reach for a dedicated broker when the workload outgrows what a table does well** — sustained high throughput, fan-out to many consumers, streaming and replay, or strict ordered partitions (Kafka territory). And respect the one operational tax that is real: a `SKIP LOCKED` queue churns dead tuples, so it lives or dies by autovacuum — tune aggressive autovacuum on the queue table, or partition it, before load finds the bloat for you.
+## Further reading
+- *PostgreSQL: Up and Running*, Obe & Hsu — a practical, current reference.
+- *The Art of PostgreSQL*, Dimitri Fontaine — advanced patterns with a teaching bent.
+- *Designing Data-Intensive Applications*, Martin Kleppmann — the systems-level argument for relational-as-default.
+- *pgvector documentation* ([github.com/pgvector/pgvector](https://github.com/pgvector/pgvector)) — the canonical source for vector index strategies.