@vextlabs/theron-cli 0.2.1 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (191) hide show
  1. package/dist/api.d.ts +8 -0
  2. package/dist/api.js +3 -0
  3. package/dist/api.js.map +1 -1
  4. package/dist/auth.js +51 -1
  5. package/dist/auth.js.map +1 -1
  6. package/dist/banner.js +3 -2
  7. package/dist/banner.js.map +1 -1
  8. package/dist/checkpoints.d.ts +32 -0
  9. package/dist/checkpoints.js +61 -0
  10. package/dist/checkpoints.js.map +1 -0
  11. package/dist/index.js +61 -5
  12. package/dist/index.js.map +1 -1
  13. package/dist/input.d.ts +61 -0
  14. package/dist/input.js +574 -0
  15. package/dist/input.js.map +1 -0
  16. package/dist/profiles/index.js +5 -0
  17. package/dist/profiles/index.js.map +1 -1
  18. package/dist/profiles/methodologies/build_domains.d.ts +6 -0
  19. package/dist/profiles/methodologies/build_domains.js +170 -0
  20. package/dist/profiles/methodologies/build_domains.js.map +1 -0
  21. package/dist/profiles/methodologies/operate_domains.d.ts +8 -0
  22. package/dist/profiles/methodologies/operate_domains.js +1239 -0
  23. package/dist/profiles/methodologies/operate_domains.js.map +1 -0
  24. package/dist/profiles/methodologies/regulated_domains.d.ts +6 -0
  25. package/dist/profiles/methodologies/regulated_domains.js +153 -0
  26. package/dist/profiles/methodologies/regulated_domains.js.map +1 -0
  27. package/dist/profiles/methodologies/research_domains.d.ts +8 -0
  28. package/dist/profiles/methodologies/research_domains.js +179 -0
  29. package/dist/profiles/methodologies/research_domains.js.map +1 -0
  30. package/dist/profiles/methodologies/strategy_domains.d.ts +15 -0
  31. package/dist/profiles/methodologies/strategy_domains.js +193 -0
  32. package/dist/profiles/methodologies/strategy_domains.js.map +1 -0
  33. package/dist/profiles/seeds.js +241 -95
  34. package/dist/profiles/seeds.js.map +1 -1
  35. package/dist/receipt.d.ts +17 -0
  36. package/dist/receipt.js +46 -0
  37. package/dist/receipt.js.map +1 -0
  38. package/dist/render.d.ts +4 -1
  39. package/dist/render.js +95 -28
  40. package/dist/render.js.map +1 -1
  41. package/dist/repl.d.ts +8 -1
  42. package/dist/repl.js +420 -62
  43. package/dist/repl.js.map +1 -1
  44. package/dist/sessions.d.ts +14 -0
  45. package/dist/sessions.js +100 -0
  46. package/dist/sessions.js.map +1 -1
  47. package/dist/ship.d.ts +2 -0
  48. package/dist/ship.js +62 -0
  49. package/dist/ship.js.map +1 -0
  50. package/dist/skills/catalog.d.ts +13 -0
  51. package/dist/skills/catalog.js +86 -0
  52. package/dist/skills/catalog.js.map +1 -0
  53. package/dist/tools/bash.js +81 -14
  54. package/dist/tools/bash.js.map +1 -1
  55. package/dist/tools/edit.js +21 -1
  56. package/dist/tools/edit.js.map +1 -1
  57. package/dist/tools/glob.js +4 -1
  58. package/dist/tools/glob.js.map +1 -1
  59. package/dist/tools/grep.d.ts +5 -0
  60. package/dist/tools/grep.js +101 -2
  61. package/dist/tools/grep.js.map +1 -1
  62. package/dist/tools/index.d.ts +22 -0
  63. package/dist/tools/index.js +177 -41
  64. package/dist/tools/index.js.map +1 -1
  65. package/dist/tools/ls.d.ts +3 -0
  66. package/dist/tools/ls.js +23 -12
  67. package/dist/tools/ls.js.map +1 -1
  68. package/dist/tools/multiedit.d.ts +12 -0
  69. package/dist/tools/multiedit.js +79 -0
  70. package/dist/tools/multiedit.js.map +1 -0
  71. package/dist/tools/stoa.d.ts +1 -1
  72. package/dist/tools/stoa.js +7 -3
  73. package/dist/tools/stoa.js.map +1 -1
  74. package/dist/tools/task.d.ts +9 -0
  75. package/dist/tools/task.js +166 -0
  76. package/dist/tools/task.js.map +1 -0
  77. package/dist/tools/todowrite.d.ts +12 -0
  78. package/dist/tools/todowrite.js +38 -0
  79. package/dist/tools/todowrite.js.map +1 -0
  80. package/dist/tools/webfetch.d.ts +6 -0
  81. package/dist/tools/webfetch.js +98 -0
  82. package/dist/tools/webfetch.js.map +1 -0
  83. package/dist/tools/websearch.d.ts +7 -0
  84. package/dist/tools/websearch.js +83 -0
  85. package/dist/tools/websearch.js.map +1 -0
  86. package/dist/tools/write.js +17 -1
  87. package/dist/tools/write.js.map +1 -1
  88. package/dist/verifiers/calc_gate.d.ts +2 -0
  89. package/dist/verifiers/calc_gate.js +112 -0
  90. package/dist/verifiers/calc_gate.js.map +1 -0
  91. package/dist/verifiers/citation_gate.d.ts +2 -0
  92. package/dist/verifiers/citation_gate.js +130 -0
  93. package/dist/verifiers/citation_gate.js.map +1 -0
  94. package/dist/verifiers/confidence_marked.d.ts +2 -0
  95. package/dist/verifiers/confidence_marked.js +49 -0
  96. package/dist/verifiers/confidence_marked.js.map +1 -0
  97. package/dist/verifiers/disclaimer_gate.d.ts +2 -0
  98. package/dist/verifiers/disclaimer_gate.js +57 -0
  99. package/dist/verifiers/disclaimer_gate.js.map +1 -0
  100. package/dist/verifiers/evidence_gate.d.ts +2 -0
  101. package/dist/verifiers/evidence_gate.js +108 -0
  102. package/dist/verifiers/evidence_gate.js.map +1 -0
  103. package/dist/verifiers/index.d.ts +5 -0
  104. package/dist/verifiers/index.js +28 -7
  105. package/dist/verifiers/index.js.map +1 -1
  106. package/dist/verifiers/lint.js +4 -3
  107. package/dist/verifiers/lint.js.map +1 -1
  108. package/dist/verifiers/promoted_kernels.d.ts +8 -0
  109. package/dist/verifiers/promoted_kernels.js +190 -0
  110. package/dist/verifiers/promoted_kernels.js.map +1 -0
  111. package/dist/verifiers/source_gate.d.ts +2 -0
  112. package/dist/verifiers/source_gate.js +125 -0
  113. package/dist/verifiers/source_gate.js.map +1 -0
  114. package/dist/verifiers/test_smoke.js +30 -0
  115. package/dist/verifiers/test_smoke.js.map +1 -1
  116. package/dist/verifiers/types.d.ts +3 -0
  117. package/package.json +4 -2
  118. package/skills/README.md +123 -0
  119. package/skills/ab-test.md +89 -0
  120. package/skills/api-design.md +175 -0
  121. package/skills/architecture-design.md +185 -0
  122. package/skills/business-case.md +77 -0
  123. package/skills/causal-inference.md +77 -0
  124. package/skills/clinical-guideline.md +98 -0
  125. package/skills/code-review.md +98 -0
  126. package/skills/cold-outreach.md +268 -0
  127. package/skills/competitive-teardown.md +223 -0
  128. package/skills/component-spec.md +121 -0
  129. package/skills/content-calendar.md +280 -0
  130. package/skills/contract-review.md +155 -0
  131. package/skills/data-analysis.md +187 -0
  132. package/skills/debug.md +91 -0
  133. package/skills/design-audit.md +121 -0
  134. package/skills/differential-diagnosis.md +79 -0
  135. package/skills/discovery-call.md +206 -0
  136. package/skills/edit-pass.md +80 -0
  137. package/skills/engineering-calc.md +101 -0
  138. package/skills/estimate.md +70 -0
  139. package/skills/experiment-design.md +105 -0
  140. package/skills/fact-check.md +82 -0
  141. package/skills/financial-model.md +104 -0
  142. package/skills/grant-proposal.md +93 -0
  143. package/skills/harmony-analysis.md +93 -0
  144. package/skills/hypothesis-generation.md +99 -0
  145. package/skills/incident-response.md +134 -0
  146. package/skills/interview-loop.md +62 -0
  147. package/skills/job-scorecard.md +92 -0
  148. package/skills/kb-article.md +174 -0
  149. package/skills/launch-plan.md +85 -0
  150. package/skills/lease-review.md +93 -0
  151. package/skills/lesson-plan.md +198 -0
  152. package/skills/literature-review.md +69 -0
  153. package/skills/market-entry.md +137 -0
  154. package/skills/market-sizing.md +159 -0
  155. package/skills/meta-analysis.md +140 -0
  156. package/skills/migrate.md +117 -0
  157. package/skills/optimize.md +88 -0
  158. package/skills/options-strategy.md +166 -0
  159. package/skills/peer-review.md +96 -0
  160. package/skills/pentest-plan.md +193 -0
  161. package/skills/pitch-review.md +132 -0
  162. package/skills/plan.md +88 -0
  163. package/skills/policy-brief.md +124 -0
  164. package/skills/positioning.md +192 -0
  165. package/skills/postmortem.md +168 -0
  166. package/skills/prd.md +105 -0
  167. package/skills/prioritize.md +162 -0
  168. package/skills/proof.md +91 -0
  169. package/skills/property-underwrite.md +159 -0
  170. package/skills/recipe-develop.md +109 -0
  171. package/skills/red-team.md +142 -0
  172. package/skills/refactor.md +58 -0
  173. package/skills/reflection-session.md +115 -0
  174. package/skills/regulatory-compliance.md +136 -0
  175. package/skills/reproduce.md +87 -0
  176. package/skills/runbook.md +344 -0
  177. package/skills/security-audit.md +154 -0
  178. package/skills/seo-brief.md +201 -0
  179. package/skills/sql-query.md +161 -0
  180. package/skills/story-craft.md +163 -0
  181. package/skills/tdd.md +59 -0
  182. package/skills/term-sheet.md +298 -0
  183. package/skills/theory-of-change.md +88 -0
  184. package/skills/threat-model.md +104 -0
  185. package/skills/ticket-triage.md +200 -0
  186. package/skills/tolerance-analysis.md +149 -0
  187. package/skills/training-program.md +151 -0
  188. package/skills/translate.md +64 -0
  189. package/skills/unit-economics.md +238 -0
  190. package/skills/valuation.md +112 -0
  191. package/skills/write-tests.md +77 -0
@@ -0,0 +1,89 @@
1
+ ---
2
+ name: ab-test
3
+ description: Design and analyze a controlled A/B test end-to-end — covering metric + MDE definition, power/sample-size/duration, randomization unit, pre-registration, SRM + novelty checks, effect estimation with confidence intervals, multiple-comparison correction, and a go/no-go decision; invoke when a user wants to run or evaluate any controlled experiment on product, model, UI, pricing, ranking, or policy changes.
4
+ allowed-tools: Read, Bash, Write
5
+ ---
6
+
7
+ ═══ HARD RULES ═══
8
+
9
+ 1. NEVER peek and stop early. The moment you run the primary analysis is fixed at pre-registration; stopping because p < 0.05 appeared inflates false-positive rate to 26%+ at just two unplanned looks. If continuous monitoring is required, use a sequential/SPRT design with a pre-registered α-spending function.
10
+ 2. NEVER change the primary metric after data collection begins. Secondary metrics may be added to the exploratory tier but never promoted to primary post-hoc.
11
+ 3. NEVER pool variants that "look similar" post-hoc. Each planned pairwise comparison is pre-registered; unplanned merges are p-hacking.
12
+ 4. NEVER interpret "no statistical significance" as "no effect." Report the 95% CI width against the MDE to characterize precisely what was and was not ruled out.
13
+ 5. NEVER skip the Sample Ratio Mismatch (SRM) check. An SRM invalidates causal inference regardless of how compelling the metric lift looks.
14
+ 6. NEVER report a point estimate without its confidence interval and the achieved power at the observed effect size.
15
+ 7. NEVER run the experiment shorter than the pre-registered duration. Minimum is max(computed_duration, 14 days) to capture at least two full weekly seasonal cycles. The 14-day floor is firm; 7 days is not sufficient.
16
+
17
+ ═══ PHASE A — DEFINE THE QUESTION AND PRIMARY METRIC ═══
18
+
19
+ A1. Write a single falsifiable hypothesis: "Changing [X] will increase/decrease [metric] by at least [MDE] for [population] within [time window]."
20
+ A2. Select ONE primary metric that is (a) directly causally linked to the treatment, (b) measurable at the randomization unit without aggregation ambiguity, (c) has a known historical distribution (mean, variance, or proportion) from at least 4 weeks of pre-experiment data. Do not accept a composite "North Star" metric as primary unless it has a closed-form variance estimator — composites typically require the delta method or bootstrap.
21
+ A3. Enumerate secondary metrics: supporting metrics expected to move in the same direction as the primary, and guardrail metrics that must not degrade beyond a pre-specified absolute threshold. Guardrail violations trigger a hard NO-GO regardless of the primary metric result.
22
+ A4. Specify the Minimum Detectable Effect (MDE): the smallest effect that is business-meaningful, not the smallest the test can detect at current traffic. Express in absolute terms (e.g., +0.5 pp conversion) AND relative terms (e.g., +3.3% lift on a 15% base rate). If the team cannot articulate the MDE, halt here — no power calculation is meaningful without it.
23
+ A5. Document the null and alternative hypotheses formally. Default to two-sided unless a directional prior is pre-registered before any data is seen and the one-sided framing is explicitly justified (e.g., treatment can only improve latency, never harm it).
24
+
25
+ ═══ PHASE B — POWER, SAMPLE SIZE, AND DURATION ═══
26
+
27
+ B1. Choose significance level α (default 0.05, two-sided) and power 1−β (default 0.80; use 0.90 for high-stakes or irreversible rollout decisions where a missed effect is costly).
28
+ B2. Compute required sample size N per variant using the appropriate formula for the metric type:
29
+ - Proportion (conversion, CTR): n = 2 · (z_{α/2} + z_β)² · p̄(1−p̄) / δ² where p̄ = (p_control + p_treatment)/2, δ = MDE in absolute pp.
30
+ - Continuous (revenue, latency, session length): n = 2 · (z_{α/2} + z_β)² · σ² / δ² where σ² is the per-unit variance from historical data.
31
+ - Ratio metrics (revenue-per-user, clicks-per-impression): apply the delta method to estimate variance from historical data; do not use the naive ratio of aggregate means as the estimator.
32
+ B3. Compute duration: duration (days) = N_total / (daily_eligible_users × traffic_fraction). Apply the hard floor: required_duration = max(computed_duration, 14 days). Document the traffic ramp schedule and whether the 14-day floor or the computed duration is binding.
33
+ B4. If computed duration exceeds 8 weeks, revisit the design rather than relaxing α. Options in priority order: (1) increase the MDE to a still-meaningful but larger threshold and document the business rationale, (2) apply CUPED variance reduction (see Phase G) which can reduce N by 30–70% when pre-experiment correlation ρ > 0.4, (3) switch to a pre-registered sequential/SPRT design with an O'Brien-Fleming or Pocock α-spending function. Do not simply raise α to 0.10 to make the test shorter — that inflates false positives without reducing required N meaningfully for proportions.
34
+ B5. Document all power-analysis inputs: baseline rate or mean, variance source and historical window, MDE, α, β, traffic fraction, ramp schedule, and the N calculation tool or script used.
35
+
36
+ ═══ PHASE C — RANDOMIZATION DESIGN ═══
37
+
38
+ C1. Choose the randomization unit: the entity that receives a consistent treatment for the experiment's full duration. For user-facing features use user_id; for infrastructure or request-level experiments use request_id only if there is strictly zero carry-over between requests (stateless path); for marketplace experiments use the side of the market that controls the outcome variable (supply or demand, not both simultaneously).
39
+ C2. Validate unit independence: if treatment on unit A changes outcomes for unit B (network effects, shared recommendation pools, leaderboards, shared cache), switch to cluster randomization and recompute N with the design effect: DEFF = 1 + (m−1)ρ, where m = mean cluster size and ρ = intra-cluster correlation estimated from historical data.
40
+ C3. Use a stable deterministic hashing function (e.g., SHA256 of experiment_id + unit_id modulo num_buckets) for assignment. Never use random(seed=timestamp) — it creates re-assignment on page reload or service restart. Isolate the experiment namespace: use a per-experiment salt (e.g., hash(experiment_id + unit_id)) so that bucket assignments across concurrent experiments are statistically independent.
41
+ C4. Stratify or block on major confounders (platform OS, country, user cohort, account age decile) when they explain >20% of primary metric variance; stratification reduces required N or increases effective power at no traffic cost.
42
+ C5. Holdout rule: units assigned at bucketing time remain in their assigned bucket for the full experiment window, even if they become ineligible mid-experiment (intent-to-treat analysis). Exclusions must be pre-registered, not reactive.
43
+
44
+ ═══ PHASE D — PRE-REGISTRATION ═══
45
+
46
+ D1. Before any treatment is exposed to users, write and lock a pre-registration document containing: hypothesis (A1), primary metric + formula, secondary metrics and guardrail thresholds, randomization unit and assignment mechanism, N per variant, planned duration and end date, analysis method (estimator + CI method), multiple-comparison correction method, and explicit decision criteria (what result → SHIP / ITERATE / KILL / EXTEND).
47
+ D2. Commit this document to version control with a timestamp before the experiment launches. The commit hash is the audit trail. Store in a shared experiment log directory indexed by experiment_id.
48
+ D3. Specify the analysis date and trigger explicitly: "Primary analysis will be run on [ISO date] after [N] unique units have been exposed for at least 24 hours, whichever is later."
49
+ D4. For sequential designs, pre-register the α-spending function (e.g., O'Brien-Fleming), all planned look times, and the maximum sample size at which the experiment is stopped regardless of result.
50
+
51
+ ═══ PHASE E — INSTRUMENTATION AND LAUNCH ═══
52
+
53
+ E1. Verify logging before any traffic is exposed: confirm that assignment events (unit_id, variant, timestamp, experiment_id) and all outcome events are being emitted, persisted, and joinable on the same key. Run a 24-hour shadow log with zero traffic to validate the pipeline end-to-end.
54
+ E2. Ramp gradually: launch at 1–5% of eligible traffic, hold for at least one hour, and check error rates and guardrail metrics against baseline before expanding. This ramp check is an operational health check — it is NOT an A/A significance test and must not be used to make an early ship/kill call.
55
+ E3. Monitor guardrail metrics daily via simple threshold alerts, not p-value gates. Monitoring the primary metric during the run is peeking and inflates false-positive rate (Rule 1).
56
+ E4. Log all external events that could confound results (marketing campaigns, infrastructure incidents, seasonality events, competitor product launches) in the shared experiment log with timestamps. These are required for the post-experiment validity review.
57
+
58
+ ═══ PHASE F — SRM AND VALIDITY CHECKS ═══
59
+
60
+ F1. Sample Ratio Mismatch (SRM): at analysis time, run a chi-squared test on observed assignment counts vs. expected counts under the target split. If p < 0.01, the experiment is causally invalid — do not analyze the primary metric until the SRM is explained and resolved. Common causes: bot or crawler traffic included in assignment, logging loss asymmetric across variants, sticky-bucketing failures, exposure logging before assignment completes, or assignment-to-logging lag differences.
61
+ F2. Pre-experiment A/A balance check: using only pre-experiment data, compare the primary metric baseline across the assigned variants (pseudo-randomize on historical units using the same hashing function). A significant imbalance (p < 0.05) on a pre-experiment window metric is a red flag for assignment mechanism failure, not a genuine treatment effect.
62
+ F3. Novelty and primacy effect check: segment results by experiment day and by user tenure (new vs. established users). If the treatment lift decays monotonically over the experiment window or is concentrated entirely in the first 3 days, the effect is likely novelty-driven. Require the effect to be present and non-trivial in the final third of the experiment window's calendar days before classifying it as durable.
63
+ F4. SUTVA (Stable Unit Treatment Value Assumption) check: compare outcomes of control units that interacted with treated units vs. control units that did not. A statistically significant difference signals interference violations; escalate to a cluster or switchback experiment design.
64
+
65
+ ═══ PHASE G — EFFECT ESTIMATION ═══
66
+
67
+ G1. Use the pre-registered estimator exclusively. For proportions: difference in proportions with normal approximation CI (use exact Fisher or Wilson CI if n < 1000 per variant). For continuous metrics: two-sample Welch t-test or OLS with pre-registered covariates. Do not switch estimators post-hoc.
68
+ G2. CUPED variance reduction (Controlled-experiment Using Pre-Experiment Data): θ̂_CUPED = (ȳ_T − ȳ_C) − θ·(x̄_T − x̄_C), where θ = Cov(Y, X) / Var(X) estimated from the CONTROL ARM ONLY (not pooled, to avoid contamination), and X is the pre-experiment value of the primary metric for the same unit. CUPED reduces variance; it does not change the point estimate in expectation, only narrows the CI. Apply only if pre-registered.
69
+ G3. Report all of the following: point estimate, 95% CI (lower, upper), two-sided p-value, achieved power at the observed effect size (to quantify what was ruled out), and the absolute business impact at current volume (e.g., "+0.4 pp CTR; at 30M daily eligible users = +120K additional clicks/day").
70
+ G4. For ratio metrics (revenue/user, clicks/impression): compute SE using the delta method — Var(R) ≈ (1/μ_D²)·[Var(N) + R²·Var(D) − 2R·Cov(N,D)] — or bootstrap with ≥1000 iterations. Never use the naive ratio of aggregate sums as the standard error input.
71
+ G5. The CI is the primary statistical output. The p-value is secondary and must always be reported alongside the CI, never in isolation.
72
+
73
+ ═══ PHASE H — MULTIPLE COMPARISONS ═══
74
+
75
+ H1. Apply a correction for all pre-registered simultaneous comparisons. Default: Benjamini-Hochberg (BH) FDR control at q = 0.05 when testing 3+ secondary metrics; BH assumes test statistics are independent or positively correlated (PRDS condition), which holds for most product metrics that move together. Use Bonferroni when the tests are planned AND there is a specific need to control the family-wise error rate (FWER) rather than FDR — Bonferroni is valid regardless of correlation structure but is conservative when tests are correlated.
76
+ H2. The primary metric is always tested at the full pre-registered α. Secondary metrics use the adjusted α output from the correction procedure. Guardrail metrics use pre-specified absolute thresholds, not adjusted p-value gates.
77
+ H3. Exploratory metrics (post-hoc segmentation, unexpected segments, drill-downs discovered after unblinding) are hypothesis-generating only. Label them explicitly as EXPLORATORY in all outputs. Never make a ship/kill decision on exploratory findings; register them as hypotheses for the next experiment.
78
+ H4. For multi-variant tests (A/B/C/…): compare each treatment variant to control using Dunnett's test (or BH on the set of treatment-vs-control contrasts), not all pairwise. This preserves FWER against the control comparison without the unnecessary power penalty of testing all pairs.
79
+
80
+ ═══ PHASE I — DECISION ═══
81
+
82
+ I1. Apply the pre-registered decision rule (D1). Do not deviate based on visual dashboard inspection, stakeholder pressure, or proximity to a release deadline. The rule was set when the stakes were clear and the data was unseen — that is its entire value.
83
+ I2. SHIP if: primary metric effect size has CI fully above zero (or above the pre-specified positive threshold), no guardrail metric has violated its pre-specified threshold, SRM check passed, and novelty check passed.
84
+ I3. ITERATE if: effect is directional (point estimate positive) but the CI crosses zero AND the lower bound of the CI is above the negative MDE threshold (i.e., the treatment is not meaningfully harmful) AND there is a concrete, mechanistically grounded hypothesis for improving the treatment variant.
85
+ I4. KILL if: any guardrail metric violated its pre-registered threshold; SRM failed and cannot be explained by a known logging artifact; or the CI's upper bound is below the MDE (power analysis confirms a meaningful effect was ruled out, not merely not observed).
86
+ I5. EXTEND (one time only, pre-registered before unblinding) if: the required N was not reached due to lower-than-forecast traffic AND the pre-registered end date has not yet passed. Lock the new end date in version control before looking at any unblinded data. Never extend because the result is inconclusive after the pre-registered end date.
87
+ I6. Write a post-experiment record for every experiment: what was tested, what was found (including CIs), what was shipped, what was learned about the metric relationship and variance. Archive in the experiment log. Variance and baseline rate estimates from completed experiments are the most accurate inputs for future power calculations in the same product surface.
88
+
89
+ KEY PRINCIPLE: The experiment design is a contract — sign it before seeing the data, honor it without revision after.
@@ -0,0 +1,175 @@
1
+ ---
2
+ name: api-design
3
+ description: Design clean, evolvable APIs — start from consumer use cases, make illegal states unrepresentable, explicit errors, versioning and backward-compatibility.
4
+ allowed-tools: Read, Write, Edit, Grep, Glob
5
+ ---
6
+
7
+ ## Phase 0 — Grep for existing conventions FIRST
8
+ Before writing a single type, grep the repo for naming patterns, error shapes, pagination contracts, and auth patterns already in use. New APIs must match the grain of what exists.
9
+
10
+ ```
11
+ Grep for: "export type.*Result" | "export interface.*Response" | "export type.*Error"
12
+ Grep for: cursor | offset | page | limit (pagination convention)
13
+ Grep for: "401\|403\|404" in route handlers (error-response shape)
14
+ Grep for: "Bearer\|x-api-key\|Authorization" (auth header convention)
15
+ ```
16
+
17
+ Hard rule: if you invent a new naming or error convention, you own migrating everything to it. Prefer conforming.
18
+
19
+ ---
20
+
21
+ ## Phase 1 — Write the call site FIRST (design-by-example)
22
+ Before any implementation, write 3–5 concrete example usages as if you are the consumer:
23
+
24
+ ```ts
25
+ // Example 1: happy path
26
+ const result = await receipts.create({ agentId: "theron-cyber", claim: "...", sig: key })
27
+ if (!result.ok) return handleError(result.error) // typed discriminated union
28
+ console.log(result.value.receiptId)
29
+
30
+ // Example 2: list with pagination
31
+ const page = await receipts.list({ agentId, limit: 50, cursor: prev.nextCursor })
32
+
33
+ // Example 3: illegal state that must be rejected at the type level
34
+ receipts.create({ agentId: undefined }) // should not compile
35
+ ```
36
+
37
+ Ask: does this call site feel obvious? Is the intent readable? Are required vs optional params clear from types alone? Iterate here before writing any implementation.
38
+
39
+ ---
40
+
41
+ ## Phase 2 — Model the domain
42
+ 1. Name the RESOURCES (nouns): what are the durable entities? (Receipt, Agent, Run, Finding, Artifact)
43
+ 2. Name the OPERATIONS (verbs): create / get / list / update / delete / publish / verify — choose from a fixed vocabulary.
44
+ 3. State the INVARIANTS: what must always be true? (A receipt must have a valid sig. An agent must belong to an org. A run cannot be deleted while active.)
45
+ 4. Draw the ownership graph: who owns what? (Org → Agent → Run → Finding). This drives URL nesting depth and authz checks.
46
+
47
+ Hard rule: no more than two levels of nesting in a REST path (`/orgs/:orgId/agents/:agentId`). Deeper = flatten with a query param.
48
+
49
+ ---
50
+
51
+ ## Phase 3 — Choose the right shape
52
+ - **REST** for CRUD over resources with standard HTTP semantics (receipts, agents, runs). Use nouns, HTTP verbs (GET/POST/PATCH/DELETE), 200/201/204/400/401/403/404/409/422/500.
53
+ - **RPC / action endpoint** for operations that are not CRUD (POST `/runs/:id/cancel`, POST `/receipts/:id/verify`). Name them as verb-noun.
54
+ - **GraphQL** only if the consumer needs flexible field selection across many related resources in one round-trip. Overhead rarely justified for internal/CLI APIs.
55
+ - **Library API (TypeScript functions)** for SDK-internal logic, tool contracts, and anything called in-process. Prefer typed functions over string-keyed maps.
56
+
57
+ ---
58
+
59
+ ## Phase 4 — Make illegal states unrepresentable
60
+ 1. Use discriminated unions for result types — no mixed `{ data?, error? }` bags:
61
+ ```ts
62
+ type Result<T, E = ApiError> = { ok: true; value: T } | { ok: false; error: E }
63
+ ```
64
+ 2. Use branded types or `z.string().uuid()` to prevent mixing IDs:
65
+ ```ts
66
+ type ReceiptId = string & { readonly __brand: "ReceiptId" }
67
+ ```
68
+ 3. Encode state machines in the type (a `Run` in state `"completed"` has a `completedAt`; in `"pending"` it does not — use a union of state-shapes, not optional fields).
69
+ 4. Never use `any`. Never accept `unknown` without a Zod parse at the boundary.
70
+
71
+ ---
72
+
73
+ ## Phase 5 — Explicit, typed errors (never silent)
74
+ Every error must carry: `code` (stable machine-readable string), `message` (human), optional `field` for validation errors, optional `retryAfter` for rate limits.
75
+
76
+ ```ts
77
+ type ApiError =
78
+ | { code: "NOT_FOUND"; message: string }
79
+ | { code: "UNAUTHORIZED"; message: string }
80
+ | { code: "FORBIDDEN"; message: string }
81
+ | { code: "VALIDATION"; message: string; fields: Record<string, string> }
82
+ | { code: "CONFLICT"; message: string }
83
+ | { code: "RATE_LIMITED"; message: string; retryAfter: number }
84
+ | { code: "INTERNAL"; message: string }
85
+ ```
86
+
87
+ Hard rule: HTTP 500 must never leak stack traces or internal state to the client. Log internally, return a `INTERNAL` code externally.
88
+
89
+ ---
90
+
91
+ ## Phase 6 — Collections: pagination, filtering, sorting
92
+ Use CURSOR-based pagination (not offset) for any collection that can grow:
93
+ ```ts
94
+ // Request
95
+ { limit?: number; cursor?: string; filter?: { agentId?: string; since?: string } }
96
+ // Response
97
+ { items: T[]; nextCursor: string | null; total?: number }
98
+ ```
99
+
100
+ - Default `limit` to something sane (20–50). Cap it (max 200).
101
+ - Cursors are opaque base64-encoded server state. Never expose internal IDs as cursors.
102
+ - Filtering params use the same field names as the resource fields — no surprise aliases.
103
+ - Sorting: `sort=createdAt:desc` as a single string param. Default sort must be documented.
104
+
105
+ ---
106
+
107
+ ## Phase 7 — Mutating endpoints: idempotency + safety
108
+ - POST endpoints that create resources must accept an optional `idempotencyKey` (client-generated UUID). Return the same response on replay within 24h.
109
+ - PATCH must be partial (only send what changes). PUT replaces the whole resource — avoid PUT unless you mean it.
110
+ - DELETE must be idempotent: deleting an already-deleted resource returns 204, not 404.
111
+ - Destructive operations (delete, purge, revoke) must require explicit confirmation param OR a two-step (request → confirm token → execute).
112
+
113
+ ---
114
+
115
+ ## Phase 8 — Security: authz at the boundary
116
+ 1. Authenticate FIRST (parse + verify the bearer token/API key) before touching any DB or running any logic.
117
+ 2. Authorize at the resource level: after fetching the resource, assert the caller's org/role owns it. Never trust a URL param to scope access.
118
+ 3. Validate ALL inputs with Zod (or equivalent) at the route handler entry point. Reject before processing.
119
+ 4. Rate-limit at the gateway or middleware layer, not inside business logic.
120
+ 5. Never return a resource that belongs to a different org, even in error messages ("receipt not found" not "receipt belongs to another org").
121
+
122
+ ---
123
+
124
+ ## Phase 9 — Versioning + backward-compatibility
125
+ - URL version prefix (`/v1/`) for REST. Bump only on BREAKING changes.
126
+ - Additive changes (new optional field, new endpoint, new enum value) are NOT breaking — ship them freely.
127
+ - Breaking changes (remove field, rename field, change type, change semantics) require a new version OR a deprecation window (min 90 days, deprecation header on every response).
128
+ - Add `Deprecation: true` + `Sunset: <date>` response headers on deprecated endpoints.
129
+ - Keep a `CHANGELOG.md` entry for every change: `[added]`, `[deprecated]`, `[removed]`, `[breaking]`.
130
+ - Hard rule: never remove a field from a response without a deprecation period. Consumers break silently.
131
+
132
+ ---
133
+
134
+ ## Phase 10 — Minimal required params + least-surprise defaults
135
+ - Every required param must be impossible to omit (type-enforced, not runtime-guarded).
136
+ - Every optional param must have a documented default that is the overwhelmingly common case.
137
+ - If a caller must pass the same param on every call, it belongs in the client constructor, not the method signature.
138
+ - Boolean flags are a smell — prefer an enum that names the intent (`mode: "strict" | "lenient"` not `strict: boolean`).
139
+
140
+ ---
141
+
142
+ ## Phase 11 — Document the contract
143
+ For each endpoint/function, write inline:
144
+ 1. One-line purpose.
145
+ 2. All params: name, type, required/optional, default, constraints.
146
+ 3. Response shape with field semantics.
147
+ 4. All error codes that can be returned and when.
148
+ 5. One working example (request + response).
149
+
150
+ For REST APIs, emit an OpenAPI 3.1 spec (`openapi.yaml`) generated from the types — do not hand-write it separately from the code.
151
+
152
+ ---
153
+
154
+ ## Phase 12 — Validate against use cases before finalizing
155
+ Return to the example call sites from Phase 1. For each:
156
+ - Can it be expressed with the designed API without workarounds?
157
+ - Are there any surprise required params the consumer wouldn't know at call time?
158
+ - Does the error path give the consumer enough information to recover or display a useful message?
159
+ - Would a new engineer reading only the call site understand what it does?
160
+
161
+ If any answer is no, revise the design — not the examples.
162
+
163
+ ---
164
+
165
+ ## Hard rules summary
166
+ 1. Call site first. Implementation is a detail.
167
+ 2. Match existing conventions — grep before inventing.
168
+ 3. Discriminated Result unions. No mixed `data?/error?` bags.
169
+ 4. Illegal states must not compile.
170
+ 5. Errors are typed and machine-readable (`code` field).
171
+ 6. Cursor pagination for all collections.
172
+ 7. Authz at the boundary, on every request, on the actual resource.
173
+ 8. Additive change is free. Breaking change costs a version bump or 90-day deprecation.
174
+ 9. Never leak internals in error responses.
175
+ 10. Document the contract in the code, not in a separate doc that goes stale.
@@ -0,0 +1,185 @@
1
+ ---
2
+ name: architecture-design
3
+ description: Design a system from the non-functional requirements + load numbers — components and boundaries, storage/consistency tradeoffs, failure modes, ADRs, and design-for-change.
4
+ allowed-tools: Read, Write, Grep, Glob
5
+ ---
6
+
7
+ ## Phase 0 — Read the repo before designing anything
8
+
9
+ 1. Grep for existing conventions: naming patterns, error shapes, transport layers, auth, DB schema, env var conventions.
10
+ 2. Identify what already exists that the new system can use or must integrate with. Designing in a vacuum produces orphan systems.
11
+ 3. State the single sentence the design must satisfy: "Support X doing Y with Z constraint." If you cannot write this sentence, stop and clarify.
12
+
13
+ ---
14
+
15
+ ## Phase 1 — Nail the requirements (functional + non-functional)
16
+
17
+ 4. List functional requirements as user-facing behaviors: "User can create a receipt. Receipt is verifiable offline. Agent can query its own run history."
18
+ 5. List the non-functional requirements that actually drive architecture — these are the ones architects miss:
19
+ - **Scale**: peak QPS read / peak QPS write / concurrent users / data volume at launch / at 12 months / at 36 months
20
+ - **Latency**: p50 / p99 targets per critical path (interactive vs. background)
21
+ - **Availability**: acceptable downtime per month (99.9% = 43 min/mo; 99.99% = 4 min/mo — know the number before picking topology)
22
+ - **Consistency**: can users see stale data? For how long? Can you lose writes? (CAP forces a choice — name it)
23
+ - **Durability**: what data can never be lost? (receipts, payments, audit logs are non-negotiable; ephemeral cache is fine to lose)
24
+ - **Security**: who are the threat actors? What must be encrypted at rest / in transit? What must be auditable?
25
+ - **Cost envelope**: $/month ceiling, cost sensitivity (bandwidth-heavy? storage-heavy? compute-heavy?)
26
+ - **Evolvability**: what is the most likely next feature? Design for that specific change, not for infinite unknowns
27
+ - **Compliance / regulatory**: GDPR, SOC2, HIPAA, export control — flag if any apply before choosing storage region or vendor
28
+ 6. Separate hard constraints (must-haves, deal-breakers) from soft preferences. Hard constraints are immovable; preferences are trade-off inputs.
29
+
30
+ ---
31
+
32
+ ## Phase 2 — Estimate the load (architecture follows the numbers)
33
+
34
+ 7. Do back-of-envelope math. Write it down explicitly — guesses buried in prose are not reviewable:
35
+ ```
36
+ Users: 10K DAU at launch → 100K at 12mo
37
+ Write QPS: 10K DAU × 5 writes/day / 86400s ≈ 0.6 QPS → rounds to ~1 QPS (comfortable single Postgres)
38
+ Read QPS: 10× write ratio = ~6 QPS peak → trivial, no read replica needed yet
39
+ Storage: 5KB per receipt × 10K users × 365 days = ~18GB/year → single DB volume, no sharding
40
+ Growth trigger: at 100K users writes hit ~6 QPS sustained → revisit at that milestone
41
+ ```
42
+ 8. From the numbers, derive what the architecture DOES NOT need yet. Explicitly state what you are deferring and at what load threshold to revisit. This prevents premature optimization.
43
+ 9. Identify the ONE bottleneck most likely to be hit first. Design around it. Everything else is secondary.
44
+
45
+ ---
46
+
47
+ ## Phase 3 — Identify core components, responsibilities, and boundaries
48
+
49
+ 10. List every major component as a box with ONE clear responsibility statement. If you need "and" to describe it, split it.
50
+ 11. Draw the dependency arrows. Dependencies must be acyclic at the component level. Cycles = wrong decomposition — break them with an interface or an event.
51
+ 12. Apply the test: could you swap out component X without touching Y? If not, the boundary is wrong.
52
+ 13. Separate stateless from stateful components. Stateless can be scaled horizontally and replaced trivially. Stateful components need replication, failover, and migration strategies — treat them with extra care.
53
+ 14. Name the INTERFACES between components (not implementations). The interface is the contract; the component is a detail.
54
+
55
+ Typical component taxonomy for reference (do not copy blindly — omit what does not apply):
56
+ - **Gateway / edge**: auth enforcement, rate limiting, routing, TLS termination
57
+ - **API layer**: request validation, orchestration, response shaping — no business logic
58
+ - **Domain services**: business rules, aggregates, invariant enforcement
59
+ - **Workers / queue consumers**: async background jobs, retries, idempotency
60
+ - **Data stores**: primary DB, cache, blob store, search index, event log
61
+ - **External integrations**: third-party APIs, webhooks — always behind an anti-corruption layer
62
+
63
+ ---
64
+
65
+ ## Phase 4 — Data model and storage choice
66
+
67
+ 15. Model the data around access patterns, not around entities. Ask "what queries will this system run?" before writing a schema.
68
+ 16. Identify the access patterns for each entity: point lookup by ID? range scan by time? full-text search? aggregation? Each pattern pulls toward a different storage type.
69
+ 17. Choose storage type by workload shape:
70
+ - **Relational (Postgres/Neon)**: transactions, foreign keys, flexible queries, moderate scale, strong consistency — default choice until proven wrong
71
+ - **Key-value (Redis/Upstash)**: ephemeral state, sessions, leases, rate-limit counters, sub-millisecond reads — never use as a primary store for durable data
72
+ - **Blob / object store (R2/S3)**: large binary objects, model weights, artifacts, media — never query inside blobs
73
+ - **Time-series / append-only log**: audit trails, event streams, telemetry — immutable by design
74
+ - **Search index (Postgres full-text / Elastic)**: keyword search, fuzzy match — derived, always reconstructible from the primary store
75
+ 18. State the CAP/consistency tradeoff explicitly:
76
+ - "We choose consistency over availability for receipts — a 503 is better than a corrupt receipt."
77
+ - "We choose availability over consistency for agent status — stale status for 30s is acceptable."
78
+ 19. Design the schema migrations strategy upfront: can you add columns without downtime? (Yes for Postgres add-column with default.) Can you rename columns? (No, requires a multi-step deploy.) Know the answer before you ship.
79
+ 20. Identify what data is owned by this system vs. what it reads from another system. Do not store data you do not own — subscribe to events or call the owning service.
80
+
81
+ ---
82
+
83
+ ## Phase 5 — Data flow and critical paths
84
+
85
+ 21. Trace the critical path for the top 3 user-facing operations end-to-end: every hop, every I/O, every transformation.
86
+ 22. Count the number of synchronous I/O calls on each critical path. Every added hop adds latency and a failure point. Minimize hops on the hot path.
87
+ 23. Identify which steps MUST be synchronous (user is waiting) vs. which can be async (fire-and-forget, queue, or eventual). Move as much as possible off the synchronous path.
88
+ 24. Draw the write path separately from the read path. They often have different consistency and latency requirements and can be optimized independently.
89
+ 25. Identify fan-out: does one write trigger writes to multiple downstream systems? Fan-out is a reliability multiplier — if any downstream fails, what happens?
90
+
91
+ ---
92
+
93
+ ## Phase 6 — Key tradeoffs — name them explicitly, pick with reasons
94
+
95
+ Do not hide tradeoffs in vague language. State each one as "Option A vs Option B — we pick A because X, accepting the downside of Y."
96
+
97
+ 26. **Sync vs. async**: sync = simpler, easier to reason about, couples latency; async = decoupled, resilient, harder to debug and trace.
98
+ 27. **Strong vs. eventual consistency**: strong = correct but slower and harder to scale; eventual = faster and more available but requires idempotency and conflict resolution everywhere.
99
+ 28. **Monolith vs. services**: monolith = simpler ops, refactor freely, one deploy, no network hops; services = independent deploy, isolated failure, harder to trace and test.
100
+ 29. **Normalize vs. denormalize**: normalized = consistent, smaller writes, expensive reads requiring joins; denormalized = fast reads, write amplification, consistency burden moves to application code.
101
+ 30. **Push vs. pull**: push (webhooks, events) = low latency, complex fan-out, receiver must be available; pull (polling) = simple, receiver controls load, higher latency.
102
+ 31. **Build vs. buy**: build = control, cost at scale, maintenance burden; buy = speed, vendor lock-in, ongoing cost.
103
+
104
+ ---
105
+
106
+ ## Phase 7 — Failure modes and degradation strategy
107
+
108
+ 32. For each component, ask: "What happens when this is unavailable?" Design the degraded behavior explicitly — do not leave it to chance.
109
+ 33. Apply bulkhead pattern: isolate failure domains so that a failure in component X does not cascade to Y. Separate thread pools, connection pools, and circuit breakers per downstream dependency.
110
+ 34. Define timeouts everywhere: every outbound network call must have a timeout. "Infinite wait" is not a valid strategy. Set timeouts based on the SLA of the caller, not the SLA of the callee.
111
+ 35. Define retry policy per operation type:
112
+ - Idempotent reads: retry with exponential backoff + jitter, max 3 attempts
113
+ - Non-idempotent writes: do NOT retry automatically — require idempotency key + deduplication at the receiver
114
+ - Background jobs: retry with backoff, dead-letter queue after N failures, alert on DLQ depth
115
+ 36. Idempotency is not optional for any write that can be retried. Every mutation endpoint must accept an idempotency key or be idempotent by nature (PUT, DELETE).
116
+ 37. Backpressure: if a downstream consumer is slower than the producer, the queue grows unbounded. Name the backpressure mechanism: queue depth limit, rate limiting at ingress, or consumer auto-scaling trigger.
117
+ 38. Define the runbook for the top 3 failure scenarios now, not after the first outage:
118
+ - DB primary goes down → failover to replica, expected downtime, data loss window
119
+ - Queue backs up → consumer scale-out trigger, alert threshold
120
+ - External API is unavailable → circuit open, cached response or graceful error to user
121
+
122
+ ---
123
+
124
+ ## Phase 8 — Identify the riskiest assumption and de-risk it first
125
+
126
+ 39. List every assumption the design rests on. Rank by: (probability of being wrong) × (cost if wrong).
127
+ 40. The top-ranked assumption is the one to validate BEFORE writing a single line of production code. Build the smallest possible spike — a throwaway prototype, a load test, a proof-of-concept query — to confirm or refute it.
128
+ 41. Common high-risk assumptions: "this third-party API will have the latency we need," "Postgres can handle this query at 100× current load," "the queue consumer can keep up with peak write throughput," "users will do X not Y."
129
+ 42. Document the result of the de-risking experiment. If the assumption was wrong, revise the design before committing to an implementation.
130
+
131
+ ---
132
+
133
+ ## Phase 9 — Architecture Decision Records (ADRs)
134
+
135
+ For every significant decision, write a micro-ADR inline or in `docs/adr/`:
136
+
137
+ ```
138
+ ## ADR-NNN: <title>
139
+ **Date**: YYYY-MM-DD
140
+ **Status**: Accepted | Superseded by ADR-NNN
141
+
142
+ **Context**: <the forces at play, the problem being solved, constraints>
143
+ **Decision**: <what we decided, stated precisely>
144
+ **Consequences**: <what becomes easier, what becomes harder, what we accept as a known cost>
145
+ ```
146
+
147
+ 43. Decisions that always need an ADR: storage engine choice, consistency model, sync vs. async, service boundary, auth mechanism, API versioning strategy, caching strategy.
148
+ 44. ADRs are immutable — when a decision changes, write a new ADR marked "Supersedes ADR-NNN." Never edit an accepted ADR to change its decision.
149
+
150
+ ---
151
+
152
+ ## Phase 10 — Design for change
153
+
154
+ 45. Ask for each component: "What is the most likely change in the next 6 months?" Then verify the component boundary isolates that change — a change inside it must not ripple outward.
155
+ 46. Likely-to-change things: pricing model, auth provider, storage backend, third-party API contract, business rules. Wrap each behind an interface or adapter — the rest of the system depends on the abstraction, not the implementation.
156
+ 47. Stable abstractions principle: depend on stable things (interfaces, domain concepts) not volatile things (vendor SDKs, specific DB query syntax).
157
+ 48. Feature flags are a design choice, not an afterthought. Any capability that may need to be toggled, rolled out gradually, or killed quickly must be flag-gated from day one.
158
+ 49. Schema migrations must be backward-compatible for at least one full deploy cycle: add before remove, nullable before required, two-phase rename (add new column → backfill → switch reads → drop old).
159
+
160
+ ---
161
+
162
+ ## Phase 11 — Anti-over-engineering check
163
+
164
+ 50. Before finalizing, apply these tests:
165
+ - Do you actually have the load that justifies sharding, caching, or a queue? If not, remove it.
166
+ - Does the number of services match the number of independent deploy/scale requirements? If not, merge.
167
+ - Is there a simpler data model that handles 90% of the use cases? Take it.
168
+ - Can this be a single Postgres table with an index instead of a separate service? Check seriously.
169
+ - Could a junior engineer understand and operate this system without you? If not, simplify.
170
+ 51. Write down what you deliberately chose NOT to build and why. This is as important as what you built — it prevents well-intentioned additions from violating the design.
171
+
172
+ ---
173
+
174
+ ## Hard rules
175
+
176
+ 1. Architecture follows the numbers. Estimate load before choosing topology.
177
+ 2. Name every tradeoff and pick with reasons. Hidden tradeoffs become bugs.
178
+ 3. Failure modes are first-class. Every component needs a degradation answer.
179
+ 4. Idempotency is not optional for any write that can fail and be retried.
180
+ 5. De-risk the riskiest assumption before writing production code.
181
+ 6. One ADR per significant decision, immutable after acceptance.
182
+ 7. Design for the specific likely change, not infinite unknowns.
183
+ 8. Avoid the scale you do not have. State the threshold that triggers revisiting.
184
+ 9. Stateful components are expensive — minimize their number and isolate them.
185
+ 10. Match existing repo conventions. A design that ignores the grain of the codebase becomes a maintenance island.
@@ -0,0 +1,77 @@
1
+ ---
2
+ name: business-case
3
+ description: Build a rigorous business case for a decision or investment: size the problem and opportunity, model costs/benefits/risks across options including do-nothing, surface assumptions explicitly, and deliver a single prioritized recommendation with leading indicators and the strongest counterargument pre-empted.
4
+ allowed-tools: Read, WebSearch, Write
5
+ ---
6
+
7
+ ═══ HARD RULES ═══
8
+ 1. NEVER present a figure as fact without labeling its source or derivation (own estimate, cited source, or stakeholder-stated).
9
+ 2. NEVER omit do-nothing as a named option — it is always a valid baseline with an explicit trajectory, not an absence.
10
+ 3. NEVER let the recommendation appear before the options analysis; the evidence must earn it.
11
+ 4. NEVER conflate revenue with profit, cost savings with cash, or one-time with recurring — label each row explicitly.
12
+ 5. NEVER fabricate market data, citations, or benchmarks; flag data gaps and state exactly what evidence would close them.
13
+ 6. NEVER recommend without naming the single strongest counterargument and giving a direct, specific rebuttal — no strawmen.
14
+ 7. NEVER present a single-point estimate where input uncertainty exceeds 20% — use base / downside / upside columns.
15
+ 8. ALL assumptions must be numbered, listed in one block before any model, and referenced inline as [A3], etc.
16
+ 9. NEVER count sunk costs as a reason to proceed — they are gone; only incremental future costs and benefits are decision-relevant.
17
+ 10. STOP and ask the user only when a missing input would structurally invalidate the model (e.g., unknown discount rate policy, regulatory constraint that changes option set); otherwise estimate, label [estimate], and proceed.
18
+
19
+ ═══ PHASE A — PROBLEM DEFINITION & DECISION FRAMING ═══
20
+ A1. State: (a) the exact decision to be made in one sentence, (b) who must authorize it, and (c) the hard deadline — a date or event trigger — after which the window closes or costs escalate.
21
+ A2. Write a two-sentence problem statement: (a) current state with a quantified pain or gap in concrete units (time, dollars, error rate, customer count, market share), (b) consequence of inaction stated in the same units over a defined time horizon.
22
+ A3. Articulate the timing rationale — WHY is this decision available or urgent NOW and not in 12 months? Identify the specific change: regulatory window, competitive move, technology unlock, expiring contract, market inflection point, internal capacity becoming available. If the timing driver is weak, flag that urgency may be manufactured.
23
+ A4. Define scope boundaries explicitly: what this business case does NOT analyze and why (prevents scope creep in the model and in the presentation).
24
+ A5. Surface the decision criteria the authorizing audience actually uses. Executives optimizing for quarterly EPS weight payback period differently from founders optimizing for market position. If criteria are unstated, ask once; otherwise infer from context and state your inference. Common hard thresholds: IRR floor, payback cap (e.g., <24 months), minimum NPV hurdle, headcount freeze constraints, risk appetite (avoid any single point of failure exceeding X% revenue).
25
+ A6. Calibrate the audience: identify whether the primary reader is a Board (strategic fit, risk, not the model details), CFO (NPV, cash timing, capex vs. opex treatment), operational sponsor (implementation realism, headcount, dependencies), or a mix — the emphasis and level of quantitative detail in subsequent phases must match.
26
+
27
+ ═══ PHASE B — OPTIONS GENERATION ═══
28
+ B1. Generate a minimum of three options:
29
+ - Option 0 (Do Nothing): Describe the status quo TRAJECTORY — revenue decline, cost drift, competitive erosion — not just a frozen snapshot. Quantify where Option 0 lands at the end of the analysis horizon.
30
+ - Active options (minimum two): Apply the BUILD / BUY / PARTNER decision matrix below BEFORE naming options:
31
+ * BUILD if: the capability is a core differentiator, build cost < 2× buy cost over 3 years, and internal team has or can rapidly acquire the competency.
32
+ * BUY (acquire product/vendor) if: time-to-value is critical, the market has mature solutions with proven references, and switching cost is manageable.
33
+ * PARTNER / LICENSE if: the capability is non-core, volume does not justify owning the asset, or regulatory/IP risk makes ownership unattractive.
34
+ * Eliminate any quadrant that fails first-principles and document why — do not silently drop it.
35
+ - Option MV (Minimum Viable): A reduced-scope entry into the leading active option. Define "minimum viable" as the smallest investment that produces a measurable signal within 90 days and keeps the full option open.
36
+ B2. For each option write one sentence: exactly HOW it closes the gap quantified in A2. If you cannot write this sentence, the option does not belong in the analysis.
37
+ B3. Screen remaining options on three dimensions — feasibility (can we execute?), strategic fit (does it reinforce or dilute our position?), time-to-value (when does the first benefit dollar appear?) — and eliminate options failing any dimension. Record the reason; do not silently remove.
38
+ B4. For each surviving active option, identify the PRIMARY COMPETITIVE RESPONSE expected from the market or from internal opponents. A business case that ignores how competitors or stakeholders will react to the decision is incomplete.
39
+
40
+ ═══ PHASE C — COST & BENEFIT QUANTIFICATION ═══
41
+ C1. List ALL numbered assumptions before building any model. Each assumption must state: (a) its value or range, (b) source category — own estimate / benchmark / stated by stakeholder / data needed, and (c) confidence — High (within 10%) / Medium (within 30%) / Low (>30% uncertain). Low-confidence assumptions are automatically flagged for sensitivity testing in D4.
42
+ C2. For each surviving option, build a cost/benefit table. These row categories are mandatory where material; add domain-specific rows as needed:
43
+ - One-time costs: capital expenditure, implementation labor (internal hours at fully-loaded rate), migration, integration, training, license setup, decommissioning of displaced systems.
44
+ - Recurring costs / run-rate delta vs. Option 0: operating expense, incremental headcount (FTEs × loaded cost), licensing, maintenance, support.
45
+ - Quantified benefits (each labeled as revenue uplift, cost avoidance, risk reduction, or productivity gain): state the mechanism, not just the number. "Reduces churn by 2pp" is better than "increases revenue by $X."
46
+ - Time horizon: minimum 3 years for operational decisions, 5 years for capital-intensive or strategic decisions. State the horizon and justify it.
47
+ - Discount rate: use the organization's WACC or hurdle rate if known; otherwise use 10% for well-established businesses, 15–20% for startups or high-uncertainty environments. State the rate and its source.
48
+ - NPV / 3-year net benefit per option.
49
+ - Break-even point in months.
50
+ C3. Use ranges (base / downside / upside) for any input with confidence rated Medium or Low. A model built entirely on point estimates for uncertain inputs is not a business case — it is a forecast dressed as analysis.
51
+ C4. Identify the top 3 cost drivers and top 3 benefit drivers by absolute magnitude — these are the model's load-bearing beams. Sensitivity analysis in Phase D will stress-test these specifically.
52
+ C5. Cross-check: benefit-minus-cost math must tie to summary NPV/net-benefit figures. Disclose any rounding. Confirm that one-time costs are not counted as recurring or vice versa.
53
+ C6. Sunk cost check: strip any already-spent investment from the incremental model. Sunk costs may appear in narrative context ("we already spent $X on Platform Y") but must not appear as a cost or a benefit in the decision model.
54
+
55
+ ═══ PHASE D — RISK ANALYSIS ═══
56
+ D1. Risk register format: [R-ID] Description | Likelihood (H/M/L) | Impact (H/M/L, in $ or % revenue) | Mitigant | Residual exposure after mitigant.
57
+ D2. Classify each risk by type: execution (can we deliver?), market (will demand materialize?), regulatory/compliance, technical (will it work at scale?), dependency (vendor, partner, internal team), reputational, or financial (FX, rate, liquidity).
58
+ D3. Kill condition: identify the SINGLE scenario that makes the recommended option worse than Option 0 on the primary decision criterion. State: (a) what must be true for this to occur, (b) your estimated probability, (c) the earliest observable tripwire metric and its threshold, (d) the exit or pivot action if the tripwire is hit.
59
+ D4. Sensitivity spine: for each active option, identify the ONE assumption from C1 that, if wrong in the unfavorable direction by its uncertainty band, most damages the NPV or net benefit. Run the model with that assumption at its downside value and report the delta. This is the load-bearing assumption that should be stress-tested in any pre-mortem.
60
+ D5. Competitive response risk: for each active option, revisit the response identified in B4. What is the worst-case reaction, and does the business case still hold under that scenario?
61
+
62
+ ═══ PHASE E — OPTIONS COMPARISON & RECOMMENDATION ═══
63
+ E1. Side-by-side comparison table (one row per option):
64
+ Option | NPV (base) | NPV (downside) | Payback (months) | Risk rating | Strategic fit (H/M/L) | Verdict
65
+ The table must be readable by an executive who has not read Phases A–D.
66
+ E2. Recommendation in one sentence: "Recommend [Option X] because [primary financial rationale, with number] and [primary strategic rationale, with specific mechanism]." Vague rationales ("best overall value") are not permitted.
67
+ E3. Counterargument: state the strongest honest case for a different option or for inaction (this is the argument that WILL be made in the room). Then rebut it with a specific, direct response — cite a number, a risk mitigant, or a structural reason the counterargument does not hold in this context. No dismissal without substance.
68
+ E4. Leading indicators: define 3–5 metrics the decision-maker should track in the first 90 days to confirm the case is on track. Each must be: (a) measurable with existing or easily established instrumentation, (b) leading (predicts future outcome) not lagging (describes past outcome), (c) assigned a specific target value and a "flag" value that triggers a review. Example: "Pilot conversion rate ≥ 18% by day 60 (flag if <12%)."
69
+ E5. Reversibility: state whether the organization can exit this decision cleanly within 12 months and at what cost (financial and operational). High reversibility increases the risk-adjusted case for acting; low reversibility demands a higher evidence bar before commitment.
70
+
71
+ ═══ PHASE F — ASSUMPTIONS REGISTER & NEXT STEPS ═══
72
+ F1. Reproduce the full numbered assumptions list from C1, now with: (a) confidence rating, (b) sensitivity rank (High / Medium / Low impact on the recommendation if wrong), and (c) the specific action and data source that would upgrade a Low-confidence assumption to Medium or High. Sort by sensitivity rank descending.
73
+ F2. Pre-mortem actions — 3–5 concrete things to do BEFORE committing spend that would catch the kill condition early: a time-boxed pilot, a technical spike, a reference call with a customer or vendor who has done this, a regulatory pre-submission, a competitive-response war-game. Each action must have an owner role, a timeline, and a pass/fail criterion.
74
+ F3. Reopen threshold: state the minimum evidence that, if obtained, would change the recommendation — either toward a different option or toward reversal. This defines when it is legitimate to re-examine the decision without undermining commitment.
75
+ F4. Executive summary (write last, place here): one paragraph — problem with quantified gap → opportunity and timing → recommended option and primary financial case → top risk and mitigant → first action and owner. Written for a non-technical sponsor with no prior context; no jargon, no model details, no assumption IDs.
76
+
77
+ KEY PRINCIPLE: A business case is only as strong as its weakest assumption. Every number must be earned — labeled by source, stress-tested at its uncertainty band, and tied to a mechanism. The recommendation does not lead; the evidence does. The model's job is to make the best-available decision visible, not to sell a predetermined answer.