npm - groundwork-method - Versions diffs - 0.0.1 → 0.10.0 - Mend

groundwork-method 0.0.1 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (629) hide show

package/src/docs/principles/ai-native/agentic-systems.md ADDED Viewed

@@ -0,0 +1,78 @@
+---
+title: Agentic Systems
+description: Architecting systems where AI agents are first-class actors — topology, interop protocols, context and memory, durable execution, guardrails, and human oversight.
+status: active
+last_reviewed: 2026-06-19
+---
+# Agentic Systems
+## TL;DR
+When an AI agent is an actor in the system — planning, calling tools, and acting over many steps — it is a distributed system with a non-deterministic core, and it must be architected like one. We default to a single agent that owns its context and delegates only stateless, read-only fan-out; we treat the context window as the scarce resource; we make long-running agents durable so they resume rather than restart; and we put every agent behind guardrails, an identity, and a human review point sized to the stakes. Agentic capability is designed in, not prompted in.
+## Why this matters
+The gap between an agent demo and an agent in production is the same gap as between a script and a distributed system: retries, partial failure, shared-state contention, unbounded loops, and an adversary in the input. The teams that ship reliable agents are not the ones with the best prompts — they are the ones who recognised that an autonomous loop calling real tools is infrastructure, and gave it the structure infrastructure needs. Most agent failures are system failures, not model failures, and they are designed out at the architecture stage or paid for in production.
+## Our principles
+### 1. Single agent first; multi-agent only when isolation pays
+The default is one agent that owns the full context of a task. When work fans out, it spawns **stateless, read-only sub-agents** for isolated retrieval or analysis and folds their results back into its own context. The 2025 standoff between Cognition's *Don't Build Multi-Agents* and Anthropic's *multi-agent research system* read like a contradiction and was not: the dividing line is **read versus write**. Sub-agents that only read — breadth-first research, retrieval across disjoint sources, independent analysis that exceeds one context window — compound capability; agents that write interdependent decisions onto shared mutable state compound error (conflicting actions, lost context, dispersed framing). Cognition's own *Devin manages Devins* later shipped a coordinator over isolated workers once the work was cleanly separable — the same line drawn from the other side. So the decision rule: **fan out only when sub-tasks are independent and read-mostly, and keep exactly one agent holding the pen.** Fan-out also has a price — Anthropic measured its research swarm at roughly 15× the tokens of a single chat (single agents at ~4×) — so the separable sub-problem has to be worth that bill. Supervisor/worker and handoff topologies are tools for genuinely separable work, not a default; under an equal token budget a single well-structured agent usually beats a swarm.
+### 2. Interop is a protocol stack — adopt by maturity, hide behind ports
+Agents reach the world through standard protocols, not bespoke glue: **MCP** for tools and data (the model's hands), **A2A** for agent-to-agent delegation across a trust, vendor, or framework boundary (the model's colleagues), and **AG-UI** for the agent↔interface event stream (the model's face). These sit at different maturities and the field has not converged — MCP is effectively universal (adopted by every major provider, governed under the Linux Foundation's Agentic AI Foundation), while A2A reached a v1.0 only recently and a wider set of contenders is still shaking out. So **adopt by maturity**: take MCP now; reach for A2A only when you genuinely cross an org, vendor, or framework boundary — inside a single system a function call or the agents-as-tools pattern beats a network protocol and its failure modes. Whatever you adopt, keep it behind your own ports so an unsettled protocol stays replaceable — the same reason every other interface in the system is a contract.
+### 3. Context is the scarce resource; engineer it
+The contents of the context window are the single biggest lever on agent behaviour, and the window is finite. We curate it deliberately — the right system prompt, the right retrieved facts, the right tool results — and we manage its lifecycle with the right tool for the pressure: **compaction** (summarise and re-initialise as the window fills) when the history must round-trip, **offloading** to memory or the file system when state must persist but need not stay resident, and **sub-agent isolation** when a sub-task's tokens should never touch the main thread at all. We clear stale tool output and never dump "everything relevant" in — that both raises cost and *lowers* quality, because irrelevant tokens degrade retrieval inside the window. Context engineering is the core discipline now; prompt wording still matters (the compaction prompt itself must be tuned for recall), but it is one input to context engineering, not the whole game.
+### 4. Memory is a designed, tiered system
+An agent's memory is architecture, not an afterthought: **working memory** (the live context), **long-term memory** (durable facts and preferences, retrieved on demand), and **vector memory** (semantic recall over past interactions and knowledge). Each tier has an explicit write policy, retention, and retrieval path — and that write policy is a **trust boundary**, not just a cache rule: anything an agent persists can be poisoned once and replayed forever, which is why OWASP's 2026 Agentic list names memory and context poisoning as a distinct risk. Persist only validated, client-safe facts; keep secrets and PII out of recallable tiers. Memory left implicit becomes either amnesia or unbounded context growth.
+### 5. Long-running agents are durable
+An agent loop that runs for minutes or hours will be interrupted — a crash, a timeout, a deploy. We build it on **durable execution** so it resumes from the last committed step instead of restarting and repeating side effects. Match the weight to the horizon: an in-process loop that finishes in minutes and tolerates a clean restart needs only a **checkpointer** (a LangGraph-style PostgresSaver, event-sourced state); a job that runs for hours, spans services, fires non-idempotent side effects, or pauses on a human for days needs a real **durable execution engine** (Temporal-style) with exactly-once guarantees. Do not reach for the heaviest orchestrator by reflex — but do not hand-roll resume flags either. Durability moves the reliability guarantee out of the prompt and into the infrastructure, and it is what makes human-in-the-loop pauses and long tool calls safe.
+### 6. The input is adversarial; guardrails are architecture
+An agent mixes instructions and data in one channel, so **prompt injection** is a structural risk, not an edge case — and it arrives indirectly, through retrieved documents, tool outputs, and other agents (an injection in shared context propagates). There is no known complete fix: as of 2026 injection is mitigated in layers, not solved, so design for containment, not prevention. We validate at every trust boundary, constrain what each tool can do, mediate tool access and model traffic through a gateway control point, and treat a model output crossing into code or an action as untrusted until checked. The hard line: **no model-influenced instruction reaches an irreversible or high-privilege action without a deterministic, non-LLM check or a human gate in front of it** — an LLM cannot be the thing that decides whether to trust an LLM. Prompt injection has topped OWASP's LLM risks (LLM01) every year the list has existed, and the 2026 Agentic list extends the same logic to tool misuse and memory poisoning.
+### 7. Least agency, with a human review point sized to the stakes
+An agent gets the minimum authority its task requires, and the riskier the action the tighter the leash. High-stakes actions pause at a **human approval gate** — implemented as a durable interrupt that resumes on decision, not a blocking call — and lower-stakes ones route by confidence. "Human in the loop" (approve before acting) and "human on the loop" (monitor and intervene) are distinct designs; we pick deliberately. An agent loop with no termination condition and no oversight is how an autonomous system becomes an autonomous incident.
+### 8. Evals and traces are the reliability surface
+Agent behaviour is probabilistic, so we measure it like a system under test: trace every run (plan, tool calls, tokens, outcome), score it on the dimensions that matter (task completion, tool-call correctness, reasoning quality), run evals both offline in CI and online in production, and **promote failed production traces into the eval set** so the suite grows from real behaviour. An agent you cannot trace is an agent you cannot trust.
+## How we apply this
+The capability core stays headless and deterministic where it can; the agent is an adapter at the edge, like any other surface, reached through contracts and held to the same boundaries. Durable execution, identity, and the gateway control plane are shared infrastructure the agent rides, not bespoke per-agent code.
+- [Agent-Native Systems](agent-native-systems.md) — designing the interfaces agents consume.
+- [AI Engineering](ai-engineering.md) — the prompt/eval/context discipline underneath.
+- [Integration Patterns](../system-design/integration-patterns.md) — the distributed-systems patterns an agent loop inherits.
+## Anti-patterns we reject
+- **Naive multi-agent.** Parallel agents writing interdependent decisions onto shared mutable state with no shared framing. Conflicting outputs, lost context, compounding error. Default to one agent with stateless, read-only sub-agents.
+- **The over-stuffed context.** Pouring every possibly-relevant document into the window. It raises cost and *lowers* quality — curate and compact instead.
+- **Hand-rolled durability.** Re-implementing checkpointing and resume with ad-hoc state flags. Use a checkpointer or a durable execution engine, sized to the horizon.
+- **Output-only injection defence.** Guarding the model's output while trusting its retrieved inputs, tool results, and persisted memory. Injection comes in through the data and can be replayed from memory.
+- **LLM guarding the LLM.** Letting a model's own judgement be the only check before an irreversible action. Put a deterministic gate or a human in front.
+- **Unbounded agency.** A tool-wielding loop with no authority limit, no termination condition, and no human gate on consequential actions.
+- **Free-text parsing.** Regex-extracting structured results from prose. Use schema-constrained output / tool calling.
+- **Untraced agents.** Shipping an agent whose runs cannot be replayed, scored, or turned into eval cases.
+## Further reading
+- *Effective context engineering for AI agents*, Anthropic (2025) — context as the core discipline, with compaction and offloading.
+- *How we built our multi-agent research system*, Anthropic (2025) — when read-heavy fan-out pays, and the ~15× token cost.
+- *Don't Build Multi-Agents*, Cognition (2025) — the single-thread, single-context-owner counter-position.
+- *How and when to build multi-agent systems*, LangChain — the topology trade-offs.
+- *Building effective agents*, Anthropic — the canonical agent-pattern catalogue.
+- *OWASP Top 10 for LLM Applications* (2025) and *OWASP Top 10 for Agentic Applications* (2026) — prompt injection, excessive agency, and memory poisoning as the leading risks.
+- Durable execution engines (Temporal, LangGraph) — checkpointing and resumability as the production reliability pattern.

package/src/docs/principles/ai-native/ai-engineering.md ADDED Viewed

@@ -0,0 +1,100 @@
+---
+title: AI Engineering
+description: Prompt engineering, evaluations, agent design, RAG, and context engineering.
+status: active
+last_reviewed: 2026-06-19
+---
+# AI Engineering
+## TL;DR
+AI engineering is software engineering with a non-deterministic component in the loop. We treat prompts as code, evaluations as tests, context as a first-class design surface, and agents as distributed systems. The discipline is about making probabilistic systems behave predictably enough to ship.
+## Why this matters
+Every team that has tried to ship an AI feature has learned the same lesson the hard way: the part that feels like magic in a demo is the part that fails in unpredictable ways in production. The gap between "it works in the playground" and "it works for every user, every day" is where AI engineering happens. The discipline treats the non-determinism as an engineering problem — measurable, testable, and addressable — rather than as an inherent limitation to shrug at.
+## Our principles
+### 1. Prompts are code
+Prompts live in version control, are reviewed in the same PR as any other change, and are versioned against the model they were tuned for. A prompt is an artifact with a target: it is brittle across model versions, and the requirements it leaves *unstated* are exactly the ones that regress when you upgrade. Pin the model, and treat a model upgrade as a change that must clear the evals before it ships. "We tweaked the prompt in the dashboard" is how a team loses the ability to reason about its own AI behaviour.
+The contested part is *who* writes the prompt. For a high-volume, measurable task, the best prompt is rarely the one a human hand-tunes. Declarative frameworks (DSPy and its successors) compile a prompt against your eval set — selecting few-shot examples, rewriting instructions to maximize a metric — and beat hand-authoring once you have data to optimize against. The principle is not "humans write clever prompts." It is "the prompt is a versioned, tested artifact," whether a human or an optimizer produced it.
+Decision rule: hand-author while the task is exploratory or low-volume; move to a compiled/optimized prompt once you have an eval set worth optimizing against and the task runs often enough to pay for it. Either way, the prompt ships through review and is re-validated on every model change.
+### 2. Evals are tests
+Every meaningful AI behaviour has an eval: a scored comparison of model output against a reference. Evals run in CI; thresholds are committed; regressions block merge the same way unit-test failures do. Without evals, "did we make the model worse?" is unanswerable, which means every improvement is also a potential regression you will discover from users.
+But an eval is only as trustworthy as its grader, and the popular grader — an LLM judging another LLM's output — is itself non-deterministic and biased. LLM judges are systematically overconfident, favour longer and more authoritatively formatted answers, and agree with human raters far less than their fluency suggests (inter-rater agreement on hard tasks sits around Fleiss' κ ≈ 0.3). A judge you have not calibrated against human labels is a vibe with a number attached.
+Decision rule: grade with code wherever the output is checkable — schema, exact match, contains, numeric range — because those checks are deterministic and free. Reserve the LLM judge for genuinely subjective qualities, and before trusting it as a merge gate, measure its agreement with human labels on a held-out set and recalibrate when you change judge models. Set thresholds with a noise band: a one-point move inside the judge's own variance is not a regression, and blocking on it just trains the team to rerun CI until it passes.
+### 3. Context is the interface
+The content of the context window — system prompt, few-shot examples, retrieved documents, tool outputs — is the single biggest lever on model behaviour. The goal is not the *most* relevant context; it is the smallest set of high-signal tokens that produces the behaviour you want. More tokens is not more help: as the window fills, recall degrades — *context rot*, a gradient rather than a cliff, rooted in the n² attention budget — so every token you add dilutes the ones that mattered.
+Design the context deliberately, measure its token budget, and prefer just-in-time retrieval — hand the model lightweight references (paths, IDs, queries) and let it pull the body through a tool when it needs it — over pre-loading everything you *might* need. For long-horizon work, compaction (summarize-and-restart, preserving decisions and open threads) and external note-taking keep the working set small. "Throw in everything relevant" is the anti-pattern that blows up the bill and *lowers* quality.
+### 4. Retrieval matters more than the model
+For a knowledge-grounded system, the retrieval layer sets the ceiling. A clever model with bad retrieval gives confident nonsense; a boring model with good retrieval gives boring, correct answers. Invest in retrieval quality — chunk boundaries, indexing, ranking, reranking — before you reach for a bigger model.
+The honest tension: long-context models and "just put it all in the prompt" make naive RAG look obsolete, and for a small or stable corpus, loading the documents directly is simpler and often better. But long context is neither free nor reliable at scale — it pays the context-rot tax, and for repeated queries over a large corpus, retrieval is cheaper and lower-latency by a wide margin. The field has not abandoned retrieval; it has moved past the naive 2023 top-k pipeline toward *agentic retrieval*, where the model issues, critiques, and refines its own searches as a loop.
+Decision rule: small or stable corpus that fits comfortably in context → load it directly and skip the retrieval stack. Large, changing, or cost-sensitive corpus → retrieve, and treat retrieval as a first-class system with its own evals (recall@k, not just end-to-end answer quality). Reach for agentic retrieval when a single query cannot express the information need — multi-hop questions, ambiguous asks, corpora that require exploration. Either way, what kills the system is bad retrieval, not a slightly weaker model.
+### 5. Model outputs are validated at the boundary
+Every model output that crosses into code is validated: shape, length, content, and expected enumerations. Parse failures are handled explicitly, never allowed to propagate. If you need a number, demand it in a structured schema; do not regex it out of prose.
+Validation is also a security boundary, not only a correctness one. Treat every model output — and every tool result the model reads — as untrusted input, because anything in the context window can carry an instruction. This is the *lethal trifecta* (Simon Willison, 2025): an agent that combines access to private data, exposure to untrusted content, and the ability to communicate externally can be steered by a poisoned document into exfiltrating secrets — and prompt injection has no reliable fix today. A model output flowing into business logic, a shell, or an HTTP call without validation is an injection vector waiting to fire.
+Decision rule: budget the trifecta. An agent acting without human review may hold at most two of {private data, untrusted input, external communication}; wanting all three means a human gate or a hard architectural break between the legs (Meta's "rule of two" framing).
+### 6. Agents are distributed systems
+An agent loop — model plans, model takes action, agent observes, model re-plans — has all the problems of a distributed system: retries, idempotency, timeouts, failure isolation. We apply the same patterns ([Integration Patterns](../system-design/integration-patterns.md)): bounded retries, circuit breakers, auditable history. The hardest agent failures are system failures, not model failures.
+The live design argument is single-agent versus multi-agent. Anthropic reports a multi-agent research system beating a single agent substantially on its internal eval; Cognition argues that parallel sub-agents making independent decisions on a shared problem produce conflicting, incoherent output. Both are right about different shapes of work, and the variable that decides it is the *isolation boundary*: how much each sub-agent must know about what the others are doing.
+Decision rule: split into sub-agents only when the work fans out into independent, read-mostly tasks that each return a small summary (research, search, parallel analysis) — the win is a clean context window per task, not parallelism for its own sake. Keep it a single agent when steps depend on each other's decisions, because coordinating shared mutable state across agents is harder than the problem you started with. And price it in: agents burn multiples of a chat's tokens, multi-agent systems an order of magnitude more.
+### 7. Cost is part of the evaluation
+A configuration that is 10% better but 5× more expensive is not obviously better. Evals track quality, latency, *and* cost, and the ship decision weighs all three. This matters more, not less, as token prices fall: reasoning models and agent loops spend tokens by the multiple, so the cost of a feature is now dominated by how many times it calls the model, not the sticker price per token. Budget cost per *task*, not per call, and let the eval surface the configuration that is good enough at a price you can defend ([Cost Engineering](../delivery/cost-engineering.md)).
+### 8. Human oversight is designed in
+For high-stakes AI outputs — content a user will act on, actions taken on their behalf — design the review point deliberately. The reviewer gets a summary calibrated to the decision, not a wall of raw output; the review UX is built alongside the AI feature, not retrofitted.
+Decision rule: place the human gate by stakes × reversibility. Cheap, reversible actions can run unattended with logging; expensive or irreversible ones — sending money, deleting data, messaging the outside world — get a gate, and per the trifecta budget above that gate is mandatory once an agent touches private data and the outside world at once. "Let the model do it" without a review loop is a promise the model will eventually break.
+## How we apply this
+- [Agent-Native Systems](agent-native-systems.md) — the flip side, making our interfaces consumable by agents.
+- [Observability](../quality/observability.md) — the trace surface for model calls.
+- [Testing](../foundations/testing.md) — the broader testing discipline evals sit inside.
+## Anti-patterns we reject
+- **"The model will figure it out."** Hope is not a design.
+- **Prompts as configuration.** Untracked prompts drift silently, and evals cannot catch drift they are not told about.
+- **Over-stuffed context windows.** Throwing the kitchen sink at the model is usually how quality *decreases*, not increases.
+- **An LLM judge you never checked against humans.** A confident grader that disagrees with people is worse than no grader — it automates the wrong call at scale.
+- **Trusting tool output as if it were your own code.** Everything in the context window is potentially adversarial input.
+- **Skipping evals "this once."** This once becomes always. Evals compound when you have them and compound against you when you do not.
+- **Agent loops without termination.** A loop without a clear exit condition is how a runaway agent becomes a runaway bill.
+- **Deterministic reasoning on top of probabilistic output.** If you need a number, ask for a number in a structured schema. Do not regex-extract it from prose.
+## Further reading
+- *Prompt Engineering Guide* ([promptingguide.ai](https://www.promptingguide.ai)) — the practitioner's summary of current patterns.
+- Anthropic, *Building Effective Agents* — the reference for agent architecture patterns, single- and multi-agent.
+- Anthropic, *Effective Context Engineering for AI Agents* — context rot, just-in-time retrieval, compaction, and sub-agent context isolation.
+- Shreya Shankar et al., *Who Validates the Validators?* (2024) — aligning LLM-as-judge evaluation with human preferences.
+- Simon Willison, *The Lethal Trifecta for AI Agents* (2025) — the security model for agents with data and tool access.
+- Gao et al., *Retrieval-Augmented Generation for Large Language Models: A Survey* — RAG ground truth.
+- *DSPy* (Stanford NLP) — declarative, compiled prompts as an alternative to hand-tuning.

package/src/docs/principles/ai-native/ai-native-product.md ADDED Viewed

@@ -0,0 +1,76 @@
+---
+title: AI-Native Product
+description: Product management for probabilistic systems — the continuous decision loop, evals as a product responsibility, dual success metrics, and the three AI cost layers.
+status: active
+last_reviewed: 2026-06-19
+---
+# AI-Native Product
+## TL;DR
+Building product on top of a probabilistic model changes the job. Success is no longer a feature that works the same way every time — it is an **outcome envelope** the model lands inside often enough, measured by **evals** the product team owns. We run product as a continuous decision loop fed by live signals rather than a staged lifecycle, we track **two** success metrics (product outcome *and* model quality), and we price work across **three cost layers** standard prioritization misses. AI capability is a product to be evaluated and steered, not a feature to be shipped and forgotten.
+## Why this matters
+A deterministic feature has a binary definition of done: it meets its spec or it does not. An AI feature does not — the same input can produce a good answer today and a poor one tomorrow, and "good" is a judgement across tone, relevance, and accuracy rather than a pass/fail. Product practice built for deterministic software breaks here in specific ways: acceptance criteria cannot be written as fixed assertions, a one-time launch metric misses the model drifting under you, and a prioritization framework that ignores inference and maintenance cost will greenlight a feature that is ruinous at scale. The teams shipping good AI product are not the ones with the most impressive demo — they are the ones who treat the model's behaviour as a measured, governed, continuously-steered product surface.
+## Our principles
+### 1. Own the outcome envelope, not the exact output
+For a probabilistic feature, product does not specify a single correct output — it defines the **envelope** of acceptable behaviour and the rate at which the model must land inside it. The spec shifts from "the system returns X" to "the system returns something that satisfies these properties, at least this often, and fails safely the rest of the time." Designing the envelope — what good looks like, what unacceptable looks like, what the fallback is when the model misses — is the core product decision of an AI feature.
+### 2. Evals are a first-class product responsibility — and only as honest as their calibration
+The quality of an AI feature is whatever its **evals** measure — which is exactly why a careless eval is dangerous: it reports a confident number while measuring the wrong thing. Product owns what "good" means: the dimensions that matter (task completion, correctness, tone, safety), the cases that must pass, and the bar for shipping.
+Build the suite from reality, not from intuition. Start with **error analysis** — read actual production traces, label the failures, and cluster them *before* writing a single automated check. Teams that skip straight to dashboards and LLM judges end up scoring noise. Then layer the measurement to the stakes of the decision: cheap deterministic checks for coverage, an **LLM-as-judge** for screening the fuzzy dimensions, and human review where correctness is load-bearing.
+Treat the judge itself as a measurement instrument that must be validated, not as ground truth. LLM judges carry documented biases — verbosity, position, and self-preference among them — and agreement with a small human-labeled set is a point estimate, not a guarantee that the judge holds up on the inputs you have not seen. Decision rule: align the judge against human labels on a held-out sample, re-validate it whenever you change the rubric or swap the underlying model, and **promote failed production cases into the suite** so it grows from where the product actually breaks. A team that cannot say how it measures its AI feature's quality has a demo, not a product — and a team that trusts an unvalidated judge has a demo wearing a dashboard.
+### 3. Track two success metrics, not one
+An AI feature succeeds on two axes at once and we instrument both: the **product outcome** (did users get value — engagement, retention, task success) and the **model quality** (precision, recall, acceptable-response rate, latency). These two headline axes sit on top of distinct layers that can each pass or fail independently — the model, the system that serves it (latency, cost, reliability), the product experience, and the business result — so when the headline numbers diverge, locate which layer the divergence lives in before reacting. Strong model scores with weak product outcomes means we are solving the wrong problem well; strong product outcomes with mediocre model scores means the feature tolerates imperfection better than we feared and we should stop over-investing in raw model quality. Watching only one axis hides the other's story. This is the [success-metrics](../foundations/success-metrics.md) discipline extended: the model quality metric is itself instrumented, with its own counter-metrics.
+### 4. Price the three cost layers — including the ones that recur
+AI work has a cost structure standard feature prioritization does not model, and ignoring it ships features that are unaffordable at scale:
+- **Development** — the one-time build, as with any feature.
+- **Inference at scale** — the per-call cost of running the model, paid on *every* use, forever. Do not anchor on today's token price: per-token cost has fallen roughly an order of magnitude a year, but agentic and reasoning workflows consume several times — sometimes orders of magnitude — more tokens per task, so total spend can climb even as unit prices collapse. Model token *volume* under realistic usage and reasoning depth, not the sticker price of a single call.
+- **Adaptation and maintenance** — the recurring cost of keeping quality up as the world moves. In the foundation-model era this is rarely literal retraining: it is model-version churn (providers deprecate and silently re-tune the model under you), prompt and context upkeep, eval maintenance, and the escalation ladder when prompting stops being enough. Decision rule: start with prompting, escalate to retrieval when the gap is missing facts, and reach for fine-tuning only when behaviour must change and the volume justifies it — each rung up adds standing cost, and fine-tuning can multiply inference cost several-fold.
+[Appetite](../foundations/prioritization-and-appetite.md) for an AI bet must account for all three; a framework that scores only build effort will systematically greenlight the wrong AI work.
+### 5. Run product as a continuous decision loop
+AI shortens the distance between a question and an answer — prototypes are hours not weeks, experiments run continuously, and signals from analytics, support, and behaviour arrive in real time. We exploit this by running product as a continuously-running decision system rather than a staged plan: reassess opportunities as signal arrives, prototype to learn rather than to ship, and let the loop tighten decision latency. Hold the line between building to learn and building to earn — a prototype that proves a point is not a feature, and the cheapness of generating one makes it dangerously easy to let a throwaway leak into production unmeasured. The same shortening makes [continuous discovery](../foundations/continuous-discovery.md) cheaper and therefore more obligatory.
+### 6. Design for probabilistic experience and graceful failure
+Because the model will sometimes be wrong, the experience is part of the product's correctness, not a polish layer. We design for the miss: visible uncertainty where it matters, easy correction and override, a safe fallback when confidence is low, and a [human review point](../system-design/identity-and-access.md) sized to the stakes of the action. Match friction to consequence — confirmations and checkpoints for high-stakes or irreversible actions, near-zero friction for cheap, reversible ones. Be deliberate about surfacing a confidence number: a model's self-reported confidence is usually poorly calibrated, and a precise-looking "87%" the model cannot back up erodes trust faster than honest hedging. Decision rule: show a numeric confidence only when it is calibrated against ground truth; otherwise express uncertainty by offering alternatives and making correction trivial. A probabilistic feature with a UX that assumes the model is always right is a feature that fails loudly the first time it is wrong.
+### 7. Use AI to improve the product system, not to impress
+The test of an AI capability is not how striking a single output looks — it is whether it improves the product system: better evidence, faster learning, clearer trade-offs, fewer repeated explanations, stronger decisions. We judge AI features by their effect on the loop, not by the wow of a cherry-picked prompt. A demo that dazzles and degrades the product system is a net loss disguised as innovation.
+## How we apply this
+- The eval suite is to an AI feature what [observability](../quality/observability.md) is to a service — the measurement substrate that makes steering possible — and shares the [agentic-systems](agentic-systems.md) discipline of tracing and scoring every run.
+- Model quality, safety, and the human review point connect to [agent-native systems](agent-native-systems.md) and [AI engineering](ai-engineering.md) on the implementation side; product owns *what* to measure and *what bar* to hold, engineering owns *how*.
+- The outcome envelope and dual metrics are [success-metrics](../foundations/success-metrics.md) applied to a non-deterministic core; the three cost layers extend [prioritization and appetite](../foundations/prioritization-and-appetite.md).
+## Anti-patterns we reject
+- **Demo-driven product.** Shipping on the strength of an impressive prompt, with no evals, no quality bar, and no plan for the median case.
+- **Eval theater.** Trusting an LLM judge that was never validated against human labels, or a suite that grades style while the real failures go uncounted — confident numbers measuring the wrong thing.
+- **Ship-and-forget.** Launching an AI feature and never measuring its quality again, so model drift degrades the product invisibly.
+- **Single-metric AI.** Watching product engagement while ignoring model quality, or the reverse — missing the half of the story that explains the other.
+- **Build-cost-only pricing.** Greenlighting an AI feature on build effort alone, then discovering inference at scale, version churn, or fine-tuning upkeep costs more than the feature earns.
+- **Determinism cosplay.** Writing fixed pass/fail acceptance criteria for a probabilistic feature, designing a UX that assumes the model is never wrong, or showing a confidence score the model cannot actually back up.
+## Further reading
+- *AI Evals for Engineers & PMs* (Hamel Husain & Shreya Shankar) — error analysis, LLM-as-judge, and building eval suites as a product discipline.
+- *Building effective agents* and *Effective context engineering*, Anthropic — the engineering substrate product steers.
+- *AI Product Management* and *Product Discovery: build to learn vs. build to earn* (Marty Cagan / SVPG) — product judgement over probabilistic systems.

package/src/docs/principles/delivery/cost-engineering.md ADDED Viewed

@@ -0,0 +1,89 @@
+---
+title: Cost Engineering
+description: FinOps, cost-aware architecture, and the economics of autoscaling.
+status: active
+last_reviewed: 2026-06-19
+---
+# Cost Engineering
+## TL;DR
+Cost is a non-functional requirement with a dashboard and a dollar sign. Every significant architectural decision considers cost-per-user and cost-per-call; every service has a budget it lives inside; surprising spend is an incident. FinOps is how we stay honest about the economics of running what we build.
+## Why this matters
+Most teams discover cost too late — after a quarterly bill raises eyebrows in a meeting. By then, the decisions that drove the cost are in production, have consumers, and are expensive to reverse. Cost engineering is the discipline of making the economic consequences of decisions visible at the point of the decision. It turns cost from a finance concern into an engineering variable.
+## Our principles
+### 1. Cost is a first-class metric
+Cost-per-call, cost-per-user, cost-per-feature — all tracked alongside latency and error rate. A feature's success includes its unit economics, not just its engagement numbers. A team that does not know what its features cost cannot reason about trade-offs that matter.
+### 2. Budgets are set and defended
+Every significant service runs inside a cost budget. The budget is set at design time, reviewed monthly, and treated as a commitment. Exceeding budget triggers the same response as exceeding any other SLO: investigate, remediate, or explicitly negotiate an increase.
+### 3. Autoscaling is designed, not enabled
+Autoscaling is a tool with sharp edges. Aggressive autoscaling on a bursty workload can multiply cost without improving user experience; conservative autoscaling on a steady workload wastes headroom. Each scaling policy is tuned per workload with the production load profile in mind, not set to vendor defaults and left.
+The shape of the load picks the mechanism. Steady baseline with modest peaks → run a reserved floor and let the autoscaler add thin headroom on top. Spiky, event-driven, or queue-backed work → scale on the real driver (queue depth, request concurrency via KEDA or equivalent), not lagging CPU, and scale to zero when idle if the cold-start budget allows. Latency-sensitive request paths → never scale to zero; the cold start is paid by the user. Defaults are a starting hypothesis, not a setting.
+### 4. Cheap queries beat fast queries — until staleness has a cost
+The fastest query is the one that does not run. We cache what we can, compute what we must, and denormalise when the read-to-write ratio justifies it. But "cheap" is not free: a cache buys read cost down with invalidation complexity, and the bugs born of stale or inconsistent reads can cost more than the queries they replaced. Denormalisation trades read cost for write amplification and a second copy of the truth to keep in sync.
+The decision rule is the read-to-write ratio weighed against the tolerance for staleness. Read-heavy data that tolerates seconds or minutes of lag (catalogues, dashboards, feeds) → cache or denormalise aggressively, with an explicit TTL and an invalidation path. Data where a stale read is a correctness or money bug (balances, inventory, auth) → pay for the live query and optimise the query itself. Never add a cache to mask a query you have not first tried to make cheap at the source.
+### 5. Egress is expensive; plan for it
+Cloud provider egress is the most mispriced line item in most bills. Inter-region chatter, chatty logs, large payloads sent frequently — these add up. We place data where its consumers are, batch where we can, and compress where it is cheap to do so.
+### 6. AI spend has the same discipline
+Every model call has a measured cost and a caching strategy. Prompts are versioned with token-count measurement; expensive prompts are justified by value. Output tokens are the dominant cost — major providers price them roughly 4-5× input — so a verbose model talking to itself is the silent budget killer, and trimming output earns more than trimming input. Prompt caching is the highest-leverage lever for any workload that resends a large, stable prefix (system prompt, repo, retrieved context): a cache hit bills a small fraction of the input rate, so structure prompts with the stable bytes first and the variable bytes last. "Just pass the whole context to the largest model" is how an AI feature becomes a cost incident.
+### 7. Rightsize before you commit
+Reserved instances and committed-use discounts save roughly 30-50% over on-demand for predictable baseline workloads — but the discount is only as good as the baseline it is bought against. Commit to an oversized or idle fleet and you have locked in waste at a discount. So the order is fixed: rightsize first, then commit. Rightsizing cleans the baseline — often returning 15-25% on its own — and only then is there a number worth committing to.
+The contested zone is how much to commit and for how long. Longer terms (three years) cut the most but lock hardest; a savings plan or reservation generally cannot be cancelled or resized mid-term. Cover the *verified floor* — the level the workload never drops below — and let variable and peak demand float on-demand, where elasticity is worth its premium.
+The decision rule: commit to the floor you are certain of, ladder the purchases (several small commitments across the year rather than one annual bet) so coverage tracks real growth, and bias longer terms to the stable core and shorter terms to the uncertain layer. Size to the minimum baseline, never to average or peak — basing a commitment on average usage is how a discount becomes a liability.
+### 8. FinOps is a practice, not an office
+Cost engineering is something every team does, not a team that does it on behalf of others. The central function provides tooling and visibility; the distributed decisions are made by the teams that built the spend.
+### 9. AI cost runs through a gateway; carbon is a cost too
+AI spend earns a dedicated control point: route model calls through an **AI gateway** that does model routing, semantic caching, fallback, and per-key budgets. The gateway is the difference between an experiment and a cost incident — it enforces the token discipline of Principle 6 at the edge, where every team's calls converge, rather than trusting each caller to do it alone.
+Carbon sits on the same ledger. Measure it with the Green Software Foundation's Software Carbon Intensity (SCI): the energy a workload draws, the carbon intensity of that energy, and embodied hardware emissions, divided by a functional unit (per call, per user). Express the target as a fitness function so it cannot regress silently.
+The contested zone is *how* you cut carbon, and the two levers are not equal. **Region-shifting** — placing or moving a workload onto a grid running cleaner energy — is usually the single largest lever; Microsoft's carbon-aware research finds the choice of region can dominate a workload's SCI. But it collides head-on with three other principles: data-residency and privacy law may forbid the move, egress (Principle 5) can erase the saving, and distance adds user latency. **Time-shifting** — deferring flexible work to low-carbon windows using grid signals (WattTime, Electricity Maps) via tooling like Microsoft's Carbon Aware SDK — is safer but bounded; published studies find simple scheduling captures most of the available reduction and sophisticated policies add little on top.
+The decision rule follows the workload. Deferrable batch with no latency SLA (model training, reporting, async pipelines) → time-shift into clean windows. Stateless compute free of residency constraints → region-pin to a low-carbon region at deploy time, after checking the egress and latency bill. Latency-sensitive or residency-bound request paths → shift nothing; cut carbon by cutting waste, which is the same efficiency work that cuts cost.
+## How we apply this
+- [Observability](../quality/observability.md) — the measurement substrate for cost per unit.
+- [Platform](platform.md) — the shared infra that every team's cost sits on.
+- [Performance](../quality/performance.md) — cheap code is often also fast code.
+## Anti-patterns we reject
+- **"We will optimise cost later."** Later never comes; the architecture is what it is by then.
+- **Autoscale-and-forget.** Default autoscaling on a workload you have not profiled is how you get a thousand-dollar day.
+- **Commit before rightsizing.** A three-year commitment on an oversized fleet locks in waste at a discount, and the term cannot be unwound.
+- **Chatty logs forever.** Unstructured debug logs at volume are a non-trivial line on the bill.
+- **AI calls without budget.** Model spend without a measured cost-per-request grows silently until it does not.
+- **"It's just pennies."** Pennies × N × daily = a real number. Track it.
+## Further reading
+- *Cloud FinOps*, Storment & Fuller — the canonical text on cross-functional cost management.
+- *AWS Well-Architected Framework — Cost Optimization pillar* — applicable beyond AWS, useful as a checklist.
+- *FinOps Foundation framework* ([finops.org](https://finops.org)) — the practitioner's handbook; see the Rate Optimization and committed-use-discount capabilities for the commitment-portfolio discipline behind Principle 7.
+- *Green Software Foundation — Software Carbon Intensity (SCI) specification & Carbon Aware SDK* ([greensoftware.foundation](https://greensoftware.foundation)) — how to measure carbon per functional unit and schedule work against grid signals.

package/src/docs/principles/delivery/day-2-operational-baseline.md ADDED Viewed

@@ -0,0 +1,57 @@
+---
+title: The Day-2 Operational Baseline
+description: The stack-agnostic bar every project clears to be operable, debuggable, and safe to change on day two — config validation, typed errors, a debug entry point, observability, graceful shutdown, a pure core, a fast test, and dev-CLI integration — plus the two rules that keep an off-script app honest.
+status: active
+last_reviewed: 2026-06-21
+---
+# The Day-2 Operational Baseline
+## TL;DR
+A project that boots on day one is not the same as a project a team can operate, debug, and change on day two. The gap between the two is a small, **stack-agnostic** set of properties — load-and-validate config, typed errors, a way to attach a debugger, telemetry, clean shutdown, a pure core, a test that runs in seconds, and integration with the project's dev CLI. A web service in Go, a native desktop app, and an embedded daemon owe every item on this list; they differ only in *how* each one is honoured, never in *whether*. This baseline is the bar a generated scaffold already clears, the checklist a forged stack is held to, and the work a first bet scopes in.
+## Why this matters
+GroundWork's value is a high-quality starting point. The paved-road generators bake this baseline in — clean architecture, a composition root, graceful shutdown, observability, a test harness — so a developer who runs one never has to think about it. The risk is everything *off* the paved road: a stack with no generator, or a generated project adapted past its template. The temptation there is to ship something that boots and call it done. A thing that boots but cannot be debugged, observed, or shut down cleanly is not a starting point — it is a liability handed over with a green checkmark.
+So the baseline is written stack-agnostic on purpose. It is the answer to "what does *good* mean when there is no template to copy?" — and the bar does not drop just because the road ran out.
+## The two rules
+These two rules are why the baseline exists, and they outrank convenience every time.
+### No empty capabilities
+Every affordance a project materialises must have real backing. A `./dev start` that starts nothing, a `/health` endpoint that always returns `ok`, a test medium with no surface behind it, a config flag nothing reads — each one reads as "covered" while covering nothing, and the next person trusts it. An inert capability is worse than an absent one, because absence is honest.
+The rule has a sharp edge for adapted tooling: when a shipped affordance does not fit the project (the classic case — a Docker-shaped `./dev start` in a project with no containers), the fix is to **adapt it to do something real or remove it**, never to leave it wired to nothing and never to build a parallel thing beside it. If a capability has no backing yet, say so plainly rather than shipping the hollow shell.
+### Off-script still lands well
+When the chosen stack has no paved path, the operational bar is unchanged. The baseline below is the contract: a native macOS app, a Rust daemon, and a Next.js frontend all owe config validation, typed errors, a debug entry point, telemetry, graceful shutdown, a fast test, and dev-CLI integration. Each honours them in its own idiom — `os_log` is not OpenTelemetry, `lldb` is not Delve — but "this stack does it differently" is never "this stack skips it."
+## The baseline
+Each item states what the property is and why it earns its place. Most are universal; a few are conditional, and the condition is named. Where an item is genuinely not applicable, that is a valid answer — but it must be a *reasoned* answer recorded alongside the others, not a silent omission.
+1. **Configuration is loaded and validated at startup.** The process reads its configuration once, validates it, and refuses to boot with a clear message naming the missing or invalid value. *Why:* a process that starts with half a configuration fails later, deep inside a request or a job, where the cause is buried. Fail at the door, not in the dark.
+2. **Errors are typed and handled at the boundary.** The core raises meaningful, matchable errors; the thin shell maps them to the surface's vocabulary — an HTTP status, a process exit code, a user-facing dialog. No bare strings thrown as control flow, no failures swallowed into silence. *Why:* an error you cannot pattern-match is one you cannot handle, test, or alert on.
+3. **There is a debugging entry point.** A documented, one-command way to run the app under a debugger or with verbose diagnostic output, and logs a human can read at a glance. *Why:* the first thing a developer does on day two is reproduce a defect. If attaching a debugger is undiscovered territory, every investigation starts from cold — this is the single highest-leverage developer-experience affordance a seed can ship.
+4. **Observability is wired from the first commit.** Structured logs always; distributed traces and metrics where the target is networked or long-running. *Why:* you cannot operate what you cannot see, and retrofitting telemetry means re-touching every code path you already wrote. Observability is a design-time concern, not a later sprint. (Conditional: a pure local one-shot tool needs structured logs, not a tracing pipeline.) See [Observability](../quality/observability.md).
+5. **Shutdown is graceful.** The process traps termination signals, stops accepting new work, drains what is in flight, releases resources and long-lived connections, and exits cleanly. *Why:* a process killed mid-flight corrupts state and leaks connections — and the inner loop restarts the process constantly. (Conditional: a stateless one-shot command that holds no resources has nothing to drain; record that as the reason, not the omission.)
+6. **A pure core wrapped in a thin shell.** Decision logic carries no I/O; concrete dependencies sit behind abstractions the core owns, and no implementation detail leaks inward. *Why:* the core stays testable without infrastructure and swappable as the app grows, and there is one obvious place for every kind of code. See [How We Structure Code](../system-design/code-structure.md).
+7. **A test harness exists and runs in seconds.** The first test proves the wired-together app does something real — against the real dependency where it runs locally, not a mock of it — and the author or an agent runs it with one command. *Why:* verification, not generation, is now the inner loop's bottleneck; a seed with no fast, trustworthy test is a seed nobody can safely change. See [Testing](../foundations/testing.md).
+8. **The app is a first-class citizen of the dev CLI.** `start`, `stop`, `logs`, and `test` operate on it through the project's `./dev` CLI — registered as a managed service or runner, never a side process a developer starts by hand. *Why:* the golden path is one command to learn and one surface to improve; an app that lives outside it is friction every developer pays every day. See [Developer Experience](devex.md) and [Platform](platform.md).
+## How to apply it
+When scoping a new app — and especially a forged, off-script one — walk the baseline once and, for each item, record one of: *satisfied by the seed*, *scoped into a bet*, or *N/A because …*. The applicable, not-yet-built items become the first bets' work, so the project converges on the full baseline through the normal delivery loop rather than trying to land it all in the scaffold. The seed proves the shape; the delivery loop earns the depth.
+This document is the canon. The per-stack Day-2 checklists that scaffold and engineer skills carry are *elaborations* of this baseline in a specific idiom — when they and this page disagree, this page wins, and the checklist is the one to fix.

package/src/docs/principles/delivery/devex.md ADDED Viewed

@@ -0,0 +1,88 @@
+---
+title: Developer Experience
+description: Golden paths, paved roads, inner-loop optimisation, and a measurement stack — DORA for the system, DevEx for the human — that tells us whether the loop is healthy.
+status: active
+last_reviewed: 2026-06-19
+---
+# Developer Experience
+## TL;DR
+A team ships as fast as its feedback loop lets it. We invest deliberately in the inner loop — the seconds between a code change and the evidence that the change works — because every second saved there is paid back a thousand times over across the team. Your project's dev CLI is the golden path, the measurement stack (DORA for the delivery system, DevEx for the human in it) is how we tell whether the loop is healthy, and friction in the loop is an engineering bug.
+## Why this matters
+The single largest predictor of a team's output, over months and years, is the quality of its feedback loop. A team that sees the result of a change in five seconds ships more and ships better than a team that sees it in five minutes — not because the individuals are smarter, but because the loop of hypothesis-and-test runs an order of magnitude more often.
+But the feedback loop is only one of three things that govern how it actually feels to build here. The research that named the field — *DevEx: What Actually Drives Productivity* (Noda, Storey, Forsgren & Greiler, ACM Queue, 2023) — isolates three dimensions: **feedback loops** (how fast and reliable the answers are), **cognitive load** (how much you must hold in your head to make a change), and **flow state** (whether you can stay in deep work without being yanked out of it). Optimising the loop while ignoring the other two buys a fast loop nobody can think straight inside. Developer experience is not a perk; it is an engineering lever, and it has more than one handle.
+## Our principles
+### 1. The inner loop is sacred
+The inner loop is the sequence from "I think this code will work" to "yes or no, here is the evidence." We invest in making this loop as short as it can be: incremental compilation, test selection, hot reload, one-command bootstrapping, fast linting. Every second shaved off the inner loop multiplies across every engineer, every day.
+The loop's centre of gravity has moved. When a coding agent can produce a plausible change in seconds, generation stops being the bottleneck and verification becomes it — the 2024 DORA report found AI adoption correlated with *lower* delivery throughput and stability, because machine-speed output floods a system built for human-speed review. So the inner loop we optimise is no longer "edit → compile"; it is "propose → prove." Fast, trustworthy local verification — a test suite the author and the agent can both run in seconds and believe — is now the highest-leverage second to shave. A loop that generates fast but verifies slowly is a regression dressed as progress.
+### 2. One entry point for local tasks — a facade, not a build system
+Every local task — start, stop, test, lint, migrate, deploy, generate — runs through a single dev CLI. One command to remember, one tool to teach a new engineer, one surface to improve. Proliferating ad-hoc scripts across `Makefile`, `package.json`, and `bin/` is how a developer experience becomes a treasure hunt.
+The value is the single discoverable surface, not custom machinery behind it. The dev CLI is a thin facade that delegates to the right standard tool — the monorepo task graph (Nx, Turborepo), a command runner (Task, just, Make), the package manager — never a bespoke build system reimplementing what those already do well. The failure mode is the opposite of fragmentation: a 4,000-line homegrown CLI that nobody but its author can change, with worse caching and worse error messages than the tools it wraps. Decision rule: wrap, don't reinvent. The CLI owns *discoverability and consistency*; the underlying tools own *execution*. If a subcommand contains real build logic rather than orchestration, that logic belongs in the task runner, not the wrapper.
+### 3. Golden paths, not mandatory paths
+The golden path is the well-trodden, well-supported way to do a common task. It is the default, and it is the path new engineers and agents follow without thinking. Deviation is allowed when a task genuinely does not fit, but the deviator pays the cost of their own tooling. Golden paths concentrate investment; mandatory paths breed resentment and shadow tooling built to evade them.
+The trap is freezing the path. A golden path that stops absorbing the cases people actually hit becomes a mandatory path by neglect — everyone deviates, and the "default" is fiction. The path stays golden only if the escape hatches are watched: a deviation that recurs is a signal the path is too narrow, and the fix is to widen the path, not to scold the deviators.
+### 4. Measure the system with DORA, the human with DevEx — and never the individual
+The four DORA keys — deployment frequency, lead time for changes, change failure rate, time to recover — measure the health of the *delivery system*. A fifth, operational reliability, measures whether what you ship stays up. We track them, surface them, and treat a regression in any one as a signal to invest in the loop.
+But DORA measures throughput and stability; it is silent on whether the work is sustainable or the engineers are drowning. That is the DevEx layer — feedback loops, cognitive load, flow state — which the four keys cannot see. The lineage matters: DORA (2018) → SPACE (2021) → DevEx (2023) → DX Core 4 (Tacho & Noda, 2024), which folds all three into one balanced scorecard across speed, effectiveness, quality, and impact. Use the system metrics to find *where* delivery hurts and the human metrics to find *why*. Two non-negotiable guardrails: never reduce these to a single number, and never attribute any of them to an individual. The moment a metric becomes a performance target it gets gamed (Goodhart's law) — deploy frequency inflates with trivial commits, change-failure rate drops because nobody logs incidents. Metrics are instruments for the team to steer by, not a stick to measure people with.
+### 5. Onboarding time-to-first-value is a measured target
+A new engineer should reach their first local contribution — "I changed something and I can see the change" — fast, and a new service should reach its first deploy early. We set explicit targets (a first contribution inside the first day, a first deploy inside the first week are good defaults) and we *measure against them* rather than assume them. The number is calibrated to the domain: a CRUD service and a system with deep regulatory or numerical complexity will not share a target, and pretending otherwise just makes the metric a lie. What is universal is the discipline — the target is written down, the actual time is observed, and a regression is treated as a bug in the onboarding system, not a failing of the new hire.
+### 6. Documentation is part of the loop
+A command you cannot find is a command you do not use. Every CLI subcommand has a reference entry, every golden path has a guide, every service has a handbook. Documentation lives next to the thing it documents and is generated from the source of truth wherever possible — `--help` output, schema, config — because prose that drifts from reality is worse than no prose: it actively misleads. The test is not "does a doc exist" but "can a new engineer, or an agent, find and trust it without asking a human."
+### 7. Match the production shapes that bite — not the whole topology
+The gaps that cause "it works on my machine" are specific: a different database engine or version, a message broker with different ordering and delivery semantics, an auth contract that behaves differently, a different container runtime. Close *those* — same engine, same contract, same semantics — and the class of bug disappears. Emulation over mocks ([Testing](../foundations/testing.md)) applies: emulate a dependency you own the contract with; mock only at the seam of a dependency you do not.
+Full production *shape* is not full production *scale or topology*, and chasing the latter locally is a losing trade — you cannot run a hundred-node cluster, real traffic, or every downstream service on a laptop, and the attempt produces a brittle, slow local stack that drifts anyway. When fidelity costs more than a laptop can pay, the answer is to move the environment, not fake it: ephemeral preview environments and cloud development environments (Codespaces, Gitpod, Coder) give real production shape on real infrastructure, at the price of network latency in the inner loop. Decision rule: reproduce locally the contracts and semantics that produce correctness bugs; reproduce in a remote or ephemeral environment the scale and topology that produce systemic bugs; do not try to do either in the wrong place.
+### 8. Cognitive load is the hidden tax
+The slowest part of a change is often not compiling or testing — it is the time spent figuring out *where* the change goes and *what else* it touches. Extraneous cognitive load (sprawling configuration, leaky abstractions, ten ways to do one thing, knowledge that lives only in someone's head) is a direct, compounding tax on every change, and it is invisible in delivery metrics until it surfaces as slow lead times and burnout. We treat it as a design constraint: consistent project shape, one obvious way to do common things, generated scaffolding so the structure is given rather than rediscovered, and ruthless deletion of the second and third way to do anything. Protecting flow is the same discipline applied to time — batched reviews, asynchronous defaults, and CI that does not demand babysitting keep engineers in deep work instead of context-switching out of it.
+### 9. Friction is filed as a bug
+If a process is painful, that pain is a bug. File it, prioritise it, fix it. "Everyone deals with it" is how chronic friction becomes chronic velocity loss. Whoever maintains the dev tooling owns that backlog the same way a product team owns its user-bug backlog — because the developers are the users, and the dev platform is the product.
+## How we apply this
+- [Platform](platform.md) — the broader internal platform the dev CLI is a part of.
+- [Progressive Delivery](progressive-delivery.md) — the outer loop the inner loop feeds into.
+## Anti-patterns we reject
+- **"Follow the README and read between the lines."** Onboarding that depends on tacit knowledge is not onboarding.
+- **Five CLIs for five tasks — or one CLI that reinvents the build.** One unified facade is the default. A second CLI earns its existence by solving a problem the first cannot; a homegrown build system hiding behind the facade is the same fragmentation wearing a disguise.
+- **Skip-the-test culture.** Fast-but-unreliable tests are worse than slow-reliable ones: a flaky suite teaches the team to ignore red, which is strictly worse than a slow suite they trust. The inner loop is made fast by honest investment, not by cheating — and a verification loop nobody trusts is no loop at all.
+- **DORA theatre.** Tracking the metric while not responding to it is worse than not tracking it. Ranking individuals by it is worse still.
+- **The frozen golden path.** A default that no longer fits the work people do is a mandatory path everyone routes around. Watch the deviations; widen the path.
+- **Ignoring friction.** If you find a sharp edge, file the ticket. Do not route around it silently.
+## Further reading
+- *Accelerate*, Forsgren, Humble, Kim — the empirical foundation for the DORA metrics.
+- *The DevOps Handbook*, Kim et al. — the full treatment of the inner-and-outer loop view.
+- *DevEx: What Actually Drives Productivity*, Noda, Storey, Forsgren & Greiler (ACM Queue, 2023) — the feedback-loops / cognitive-load / flow-state model.
+- *The DX Core 4* (Tacho & Noda, 2024) — the unified scorecard folding DORA, SPACE, and DevEx into one framework.
+- *Team Topologies*, Skelton & Pais — the organisational side of platform and golden paths.
+- *Developer Experience: Concept and Definition* (Fagerholm & Münch, 2012) — the academic framing that predates the modern DevEx term.