@vextlabs/theron-cli 0.3.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (169) hide show
  1. package/dist/api.d.ts +8 -0
  2. package/dist/api.js +3 -0
  3. package/dist/api.js.map +1 -1
  4. package/dist/auth.js +51 -1
  5. package/dist/auth.js.map +1 -1
  6. package/dist/banner.js +3 -2
  7. package/dist/banner.js.map +1 -1
  8. package/dist/checkpoints.d.ts +32 -0
  9. package/dist/checkpoints.js +61 -0
  10. package/dist/checkpoints.js.map +1 -0
  11. package/dist/index.js +59 -4
  12. package/dist/index.js.map +1 -1
  13. package/dist/input.d.ts +61 -0
  14. package/dist/input.js +574 -0
  15. package/dist/input.js.map +1 -0
  16. package/dist/profiles/index.js +5 -0
  17. package/dist/profiles/index.js.map +1 -1
  18. package/dist/profiles/methodologies/operate_domains.d.ts +8 -0
  19. package/dist/profiles/methodologies/operate_domains.js +1239 -0
  20. package/dist/profiles/methodologies/operate_domains.js.map +1 -0
  21. package/dist/profiles/seeds.js +57 -36
  22. package/dist/profiles/seeds.js.map +1 -1
  23. package/dist/receipt.d.ts +17 -0
  24. package/dist/receipt.js +46 -0
  25. package/dist/receipt.js.map +1 -0
  26. package/dist/render.d.ts +4 -1
  27. package/dist/render.js +95 -28
  28. package/dist/render.js.map +1 -1
  29. package/dist/repl.d.ts +8 -1
  30. package/dist/repl.js +420 -62
  31. package/dist/repl.js.map +1 -1
  32. package/dist/sessions.d.ts +14 -0
  33. package/dist/sessions.js +100 -0
  34. package/dist/sessions.js.map +1 -1
  35. package/dist/ship.d.ts +2 -0
  36. package/dist/ship.js +62 -0
  37. package/dist/ship.js.map +1 -0
  38. package/dist/skills/catalog.d.ts +13 -0
  39. package/dist/skills/catalog.js +86 -0
  40. package/dist/skills/catalog.js.map +1 -0
  41. package/dist/tools/bash.js +81 -14
  42. package/dist/tools/bash.js.map +1 -1
  43. package/dist/tools/edit.js +21 -1
  44. package/dist/tools/edit.js.map +1 -1
  45. package/dist/tools/glob.js +4 -1
  46. package/dist/tools/glob.js.map +1 -1
  47. package/dist/tools/grep.d.ts +5 -0
  48. package/dist/tools/grep.js +101 -2
  49. package/dist/tools/grep.js.map +1 -1
  50. package/dist/tools/index.d.ts +22 -0
  51. package/dist/tools/index.js +177 -41
  52. package/dist/tools/index.js.map +1 -1
  53. package/dist/tools/ls.d.ts +3 -0
  54. package/dist/tools/ls.js +23 -12
  55. package/dist/tools/ls.js.map +1 -1
  56. package/dist/tools/multiedit.d.ts +12 -0
  57. package/dist/tools/multiedit.js +79 -0
  58. package/dist/tools/multiedit.js.map +1 -0
  59. package/dist/tools/stoa.d.ts +1 -1
  60. package/dist/tools/stoa.js +7 -3
  61. package/dist/tools/stoa.js.map +1 -1
  62. package/dist/tools/task.d.ts +9 -0
  63. package/dist/tools/task.js +166 -0
  64. package/dist/tools/task.js.map +1 -0
  65. package/dist/tools/todowrite.d.ts +12 -0
  66. package/dist/tools/todowrite.js +38 -0
  67. package/dist/tools/todowrite.js.map +1 -0
  68. package/dist/tools/webfetch.d.ts +6 -0
  69. package/dist/tools/webfetch.js +98 -0
  70. package/dist/tools/webfetch.js.map +1 -0
  71. package/dist/tools/websearch.d.ts +7 -0
  72. package/dist/tools/websearch.js +83 -0
  73. package/dist/tools/websearch.js.map +1 -0
  74. package/dist/tools/write.js +17 -1
  75. package/dist/tools/write.js.map +1 -1
  76. package/dist/verifiers/confidence_marked.d.ts +2 -0
  77. package/dist/verifiers/confidence_marked.js +49 -0
  78. package/dist/verifiers/confidence_marked.js.map +1 -0
  79. package/dist/verifiers/disclaimer_gate.d.ts +2 -0
  80. package/dist/verifiers/disclaimer_gate.js +57 -0
  81. package/dist/verifiers/disclaimer_gate.js.map +1 -0
  82. package/dist/verifiers/index.d.ts +5 -0
  83. package/dist/verifiers/index.js +20 -7
  84. package/dist/verifiers/index.js.map +1 -1
  85. package/dist/verifiers/lint.js +4 -3
  86. package/dist/verifiers/lint.js.map +1 -1
  87. package/dist/verifiers/promoted_kernels.d.ts +8 -0
  88. package/dist/verifiers/promoted_kernels.js +190 -0
  89. package/dist/verifiers/promoted_kernels.js.map +1 -0
  90. package/dist/verifiers/source_gate.js +2 -3
  91. package/dist/verifiers/source_gate.js.map +1 -1
  92. package/dist/verifiers/test_smoke.js +30 -0
  93. package/dist/verifiers/test_smoke.js.map +1 -1
  94. package/dist/verifiers/types.d.ts +3 -0
  95. package/package.json +4 -2
  96. package/skills/README.md +123 -0
  97. package/skills/ab-test.md +89 -0
  98. package/skills/api-design.md +175 -0
  99. package/skills/architecture-design.md +185 -0
  100. package/skills/business-case.md +77 -0
  101. package/skills/causal-inference.md +77 -0
  102. package/skills/clinical-guideline.md +98 -0
  103. package/skills/code-review.md +98 -0
  104. package/skills/cold-outreach.md +268 -0
  105. package/skills/competitive-teardown.md +223 -0
  106. package/skills/component-spec.md +121 -0
  107. package/skills/content-calendar.md +280 -0
  108. package/skills/contract-review.md +155 -0
  109. package/skills/data-analysis.md +187 -0
  110. package/skills/debug.md +91 -0
  111. package/skills/design-audit.md +121 -0
  112. package/skills/differential-diagnosis.md +79 -0
  113. package/skills/discovery-call.md +206 -0
  114. package/skills/edit-pass.md +80 -0
  115. package/skills/engineering-calc.md +101 -0
  116. package/skills/estimate.md +70 -0
  117. package/skills/experiment-design.md +105 -0
  118. package/skills/fact-check.md +82 -0
  119. package/skills/financial-model.md +104 -0
  120. package/skills/grant-proposal.md +93 -0
  121. package/skills/harmony-analysis.md +93 -0
  122. package/skills/hypothesis-generation.md +99 -0
  123. package/skills/incident-response.md +134 -0
  124. package/skills/interview-loop.md +62 -0
  125. package/skills/job-scorecard.md +92 -0
  126. package/skills/kb-article.md +174 -0
  127. package/skills/launch-plan.md +85 -0
  128. package/skills/lease-review.md +93 -0
  129. package/skills/lesson-plan.md +198 -0
  130. package/skills/literature-review.md +69 -0
  131. package/skills/market-entry.md +137 -0
  132. package/skills/market-sizing.md +159 -0
  133. package/skills/meta-analysis.md +140 -0
  134. package/skills/migrate.md +117 -0
  135. package/skills/optimize.md +88 -0
  136. package/skills/options-strategy.md +166 -0
  137. package/skills/peer-review.md +96 -0
  138. package/skills/pentest-plan.md +193 -0
  139. package/skills/pitch-review.md +132 -0
  140. package/skills/plan.md +88 -0
  141. package/skills/policy-brief.md +124 -0
  142. package/skills/positioning.md +192 -0
  143. package/skills/postmortem.md +168 -0
  144. package/skills/prd.md +105 -0
  145. package/skills/prioritize.md +162 -0
  146. package/skills/proof.md +91 -0
  147. package/skills/property-underwrite.md +159 -0
  148. package/skills/recipe-develop.md +109 -0
  149. package/skills/red-team.md +142 -0
  150. package/skills/refactor.md +58 -0
  151. package/skills/reflection-session.md +115 -0
  152. package/skills/regulatory-compliance.md +136 -0
  153. package/skills/reproduce.md +87 -0
  154. package/skills/runbook.md +344 -0
  155. package/skills/security-audit.md +154 -0
  156. package/skills/seo-brief.md +201 -0
  157. package/skills/sql-query.md +161 -0
  158. package/skills/story-craft.md +163 -0
  159. package/skills/tdd.md +59 -0
  160. package/skills/term-sheet.md +298 -0
  161. package/skills/theory-of-change.md +88 -0
  162. package/skills/threat-model.md +104 -0
  163. package/skills/ticket-triage.md +200 -0
  164. package/skills/tolerance-analysis.md +149 -0
  165. package/skills/training-program.md +151 -0
  166. package/skills/translate.md +64 -0
  167. package/skills/unit-economics.md +238 -0
  168. package/skills/valuation.md +112 -0
  169. package/skills/write-tests.md +77 -0
@@ -0,0 +1,123 @@
1
+ # Theron skill library
2
+
3
+ A curated library of **73** reusable **SKILL.md** playbooks — dense, senior-practitioner workflows the agent loads on demand. Invoke any skill with `/<name>` in the `theron` CLI or the JUWEL.ai VS Code panel; each is also auto-invoked when its `description` matches your task. Run `/skills` to browse them grouped by entrance.
4
+
5
+ **Extend or override:** drop a `<name>.md` in `~/.theron/skills/` (all your projects) or `<project>/.theron/skills/` (this project). A same-named file overrides the bundled one. Format:
6
+
7
+ ```markdown
8
+ ---
9
+ name: my-skill
10
+ description: one line used to decide when to auto-invoke
11
+ allowed-tools: Read, Grep, Bash
12
+ ---
13
+ <the playbook body>
14
+ ```
15
+
16
+ ---
17
+
18
+ ## Research & Science _(13)_
19
+
20
+ - **`/ab-test`** — Design and analyze a controlled A/B test end-to-end — covering metric + MDE definition, power/sample-size/duration, randomization unit, pre-registration, SRM + novelty checks, effect estimation with confidence intervals, multiple-comparison correction, and a go/no-go decision; invoke when a user wants to run or evaluate any controlled experiment on product, model, UI, pricing, ranking, or policy changes. _(tools: Read, Bash, Write)_
21
+ - **`/causal-inference`** — Identify causal effects, not correlations — define the estimand, draw the DAG, pick an identification strategy (RCT/IV/RDD/DiD/matching), and run falsification tests. _(tools: Read, Bash, Write, Grep)_
22
+ - **`/clinical-guideline`** — Summarize clinical practice guidelines for a disease/condition/intervention — extracting evidence grades (GRADE/strength), target populations, contraindications, and cross-body conflicts — with primary source citations and a mandatory educational-use disclaimer; invoke when the user asks what guidelines say about a clinical topic, drug, screening threshold, treatment protocol, or diagnostic criterion. _(tools: Read, WebSearch, WebFetch, Write)_
23
+ - **`/data-analysis`** — End-to-end data analysis — profile before modeling, EDA, method fit to the question, honest validation (leakage, baselines, CIs), and limitations. _(tools: Read, Bash, Write, Glob, Grep)_
24
+ - **`/differential-diagnosis`** — Build and rank a structured clinical differential diagnosis from a symptom/sign/lab constellation — covering likelihood, danger, discriminating features, red flags, confirmatory/exclusionary tests, and guideline citations — for EDUCATIONAL use only; always appends the not-a-substitute-for-a-physician disclaimer. _(tools: Read, WebSearch, WebFetch, Write)_
25
+ - **`/experiment-design`** — Design a rigorous, falsifiable experiment — hypothesis + null, controls for confounds, power analysis, pre-registered analysis plan, and validity threat mitigation. _(tools: Read, Write, WebSearch, Grep)_
26
+ - **`/fact-check`** — Adversarially verify claims — decompose into atomic claims, find primary sources, try to refute each, and rate with cited evidence. Never assert without a source. _(tools: WebSearch, WebFetch, Read, Grep)_
27
+ - **`/hypothesis-generation`** — Generate and rank falsifiable hypotheses — diverse candidates each with a concrete prediction and discriminating test, ranked by novelty × plausibility × testability. _(tools: Read, WebSearch, WebFetch, Write)_
28
+ - **`/literature-review`** — Systematic literature review — search strategy, screening, quality appraisal, thematic synthesis, and an honest gap analysis with primary-source citations. _(tools: WebSearch, WebFetch, Read, Write, Grep, TodoWrite)_
29
+ - **`/meta-analysis`** — Pool studies rigorously — common effect-size metric, fixed/random-effects choice, heterogeneity (I²/τ²), publication-bias checks, sensitivity, and GRADE certainty. _(tools: Read, Bash, WebSearch, WebFetch, Write)_
30
+ - **`/peer-review`** — Review a paper/manuscript like a journal referee — summarize the contribution, assess soundness/novelty/reproducibility, separate major from minor, end with a justified recommendation. _(tools: Read, WebSearch, WebFetch, Write)_
31
+ - **`/proof`** — Construct a rigorous proof — restate precisely, choose the strategy, justify every step by name, hunt the failure modes (circularity, edge cases, quantifier slips), then verify it. _(tools: Read, Write)_
32
+ - **`/reproduce`** — Reproduce a bug, paper result, or benchmark — pin the exact claim and environment, get a minimal case, reproduce baseline first, bisect to the trigger, report repro status honestly. _(tools: Read, Bash, WebFetch, Grep, Glob, Write)_
33
+
34
+ ## Coding & Engineering _(12)_
35
+
36
+ - **`/api-design`** — Design clean, evolvable APIs — start from consumer use cases, make illegal states unrepresentable, explicit errors, versioning and backward-compatibility. _(tools: Read, Write, Edit, Grep, Glob)_
37
+ - **`/architecture-design`** — Design a system from the non-functional requirements + load numbers — components and boundaries, storage/consistency tradeoffs, failure modes, ADRs, and design-for-change. _(tools: Read, Write, Grep, Glob)_
38
+ - **`/code-review`** — Review a diff like a senior engineer — correctness first, then security, tests, API/compat, perf, readability; every comment file:line + concrete fix + severity tag. _(tools: Read, Grep, Glob, Bash)_
39
+ - **`/debug`** — Root-cause debugging — reproduce first, read the error, hypothesize and test (not shotgun), bisect/instrument, fix the cause not the symptom, add a regression test. _(tools: Read, Bash, Grep, Glob, Edit)_
40
+ - **`/engineering-calc`** — Perform a rigorous engineering calculation (structural load/stress/buckling, member sizing, thermal analysis, pressure vessel, circuit power/impedance, fluid flow) — states givens with SI/imperial units, cites governing equation + its derivation assumptions and validity limits, solves step-by-step with dimensional tracking, applies code-required safety factor, bounds-checks against first-principles intuition, and flags where a licensed PE stamp is legally required before use. _(tools: Read, Bash, Write)_
41
+ - **`/migrate`** — Migrate a codebase safely — inventory all call sites first, use expand-migrate-contract, change in small reversible batches keeping the build green, verify semantics preserved. _(tools: Read, Edit, MultiEdit, Bash, Grep, Glob)_
42
+ - **`/optimize`** — Performance optimization — measure and profile first, fix the real bottleneck (algorithmic before micro), re-measure after each change, keep results correct. _(tools: Read, Edit, Bash, Grep, Glob)_
43
+ - **`/refactor`** — Behavior-preserving refactoring — pin behavior with tests first, one small transformation at a time, run tests after each, never mix refactor with behavior change. _(tools: Read, Edit, MultiEdit, Bash, Grep, Glob)_
44
+ - **`/sql-query`** — Write, review, or optimize a SQL query against a real schema — confirm table grain + column names first, produce a correct readable query, verify row-count expectations, then measure and apply structural optimizations (indexes, predicate pushdown, window functions, N+1 elimination). _(tools: Read, Bash, Grep, Write)_
45
+ - **`/tdd`** — Test-driven development — red/green/refactor with the real test framework; write the failing test first, see it fail, minimal code to green, refactor, full-suite before done. _(tools: Read, Edit, MultiEdit, Write, Bash, Grep, Glob)_
46
+ - **`/tolerance-analysis`** — Full tolerance stack-up and GD&T analysis workflow — identify the critical dimension chain, run worst-case and RSS/statistical analysis, allocate tolerances to part cost tiers, design the datum scheme, and flag the binding contributor; invoke whenever a mechanical assembly has a gap, interference, clearance, or fit requirement that must be verified before release or when geometric controls (flatness, position, runout, profile) are being defined or reviewed. _(tools: Read, Bash, Write)_
47
+ - **`/write-tests`** — Author comprehensive tests — cover happy path, boundaries, error paths, and invariants; test behavior not implementation; deterministic and isolated; verify the test can actually fail. _(tools: Read, Write, Edit, Bash, Grep, Glob)_
48
+
49
+ ## Security _(4)_
50
+
51
+ - **`/pentest-plan`** — Plan and execute an authorized penetration test: scope lock, attack-surface enumeration mapped to OWASP Top 10 + API/cloud-native vectors, likelihood×impact prioritization, evidence capture standards, and non-destructive guardrails — invoke when asked to pentest, assess, or audit a system for security vulnerabilities. _(tools: Read, Write, Grep, Bash)_
52
+ - **`/red-team`** — Adversarially review your own (or given) work — hunt for correctness bugs, edge cases, security holes, and false assumptions; severity-rate each with evidence and a fix. _(tools: Read, Grep, Glob, Bash)_
53
+ - **`/security-audit`** — Static security audit of a codebase — map the attack surface, review against OWASP/CWE, scan deps + secrets; every finding gets severity, file:line, exploit scenario, and fix. _(tools: Read, Grep, Glob, Bash)_
54
+ - **`/threat-model`** — Threat-model a system — assets + trust boundaries, STRIDE enumeration, attack trees, risk-rated threats mapped to mitigations, and explicit residual risk. _(tools: Read, Write, Grep, Glob)_
55
+
56
+ ## Finance & Investing _(8)_
57
+
58
+ - **`/financial-model`** — Build a driver-based 3-statement (P&L / balance sheet / cash flow) projection model with explicit assumptions, linked drivers, upside/base/downside scenarios, DCF/terminal-value valuation, and sanity-checks against stage-appropriate industry benchmarks — invoke whenever the task is financial forecasting, unit-economics analysis, fundraising model, DCF valuation, or scenario planning. _(tools: Read, Bash, Write)_
59
+ - **`/lease-review`** — Review a commercial lease end-to-end: extract and stress-test every economic term (base rent, escalations, CAM, TI, free rent, options), flag off-market risk provisions (assignment, default cure periods, exclusivity, co-tenancy, holdover), and deliver a redline priority list with effective-rent computation. _(tools: Read, Write)_
60
+ - **`/options-strategy`** — Educational options-strategy explainer: given a named strategy or market outlook, produce structure, payoff diagram (ASCII), breakevens, max gain/loss, Greeks, ideal market scenario, risk-first sizing guidance, and a mandatory not-investment-advice disclaimer. _(tools: Read, Bash, Write)_
61
+ - **`/pitch-review`** — Review a startup pitch deck with senior VC rigor — separates what is missing from what is wrong, benchmarks each dimension to stage norms, surfaces the 3 hardest investor objections, and identifies the single most important fix before the next meeting. _(tools: Read, Write)_
62
+ - **`/property-underwrite`** — Full-cycle real estate underwriting playbook — collect deal inputs, state all assumptions explicitly, compute NOI/cap rate/cash-on-cash/DSCR/IRR step by step, run a two-variable sensitivity table, and deliver a conservative go/no-go verdict with annotated kill conditions; invoke whenever a user provides a property address, asking price, rent roll, or deal memo and wants investment analysis. _(tools: Read, Bash, Write)_
63
+ - **`/term-sheet`** — Analyze a venture-capital or angel term sheet — decompose economics vs. control provisions, step-by-step dilution math (option-pool shuffle, SAFE conversion, liquidation waterfall, anti-dilution), flag founder-unfriendly deviations from NVCA/YC market standard, and surface negotiation priorities; outputs a structured red/yellow/green memo with a NOT LEGAL ADVICE footer. _(tools: Read, Bash, Write)_
64
+ - **`/unit-economics`** — Compute and stress-test SaaS/subscription unit economics — contribution margin, CAC, LTV (with churn and discount rate), LTV/CAC ratio, payback period, and cohort-level retention curves — from first principles; flags when assumptions break the model. _(tools: Read, Bash, Write)_
65
+ - **`/valuation`** — Value a private or public company via DCF (WACC build-up, explicit FCF, terminal value) and trading/transaction comparable multiples, then reconcile the two into a single defensible range with a sensitivity table on the two highest-impact drivers; invoked when the task is to price a business, assess a deal, build a pitch-book range, or stress-test an offer price. _(tools: Read, Bash, Write)_
66
+
67
+ ## Business, Product & Strategy _(10)_
68
+
69
+ - **`/business-case`** — Build a rigorous business case for a decision or investment: size the problem and opportunity, model costs/benefits/risks across options including do-nothing, surface assumptions explicitly, and deliver a single prioritized recommendation with leading indicators and the strongest counterargument pre-empted. _(tools: Read, WebSearch, Write)_
70
+ - **`/competitive-teardown`** — Tear down a named AI/SaaS competitor end-to-end — positioning, ICP, pricing/packaging, product strengths and gaps, GTM motion, moat, and the specific wedges where we win or lose — grounded in public evidence; outputs a ranked battlecard and a single recommended attack vector. _(tools: Read, WebSearch, WebFetch, Write)_
71
+ - **`/estimate`** — Principled Fermi estimation — decompose into estimable factors with ranges, propagate uncertainty, sanity-check against anchors, tighten the dominant factor, report a range not false precision. _(tools: Read, Write)_
72
+ - **`/launch-plan`** — Build a full GTM launch plan for a software product or feature: segment audience, structure alpha/beta/GA tiers, craft the core message + proof points, assign channel owners + metrics, produce an asset checklist, set a timeline, and generate a launch-day runbook — invoke when asked to plan a launch, go-to-market, release strategy, or channel plan. _(tools: Read, Write)_
73
+ - **`/market-entry`** — Analyze a new market entry decision end-to-end: market attractiveness, right-to-win, entry mode and sequencing, irreversible vs reversible commitments, strongest reasons NOT to enter, and a staged de-risking plan with explicit kill criteria. _(tools: Read, WebSearch, Bash, Write)_
74
+ - **`/market-sizing`** — Size a market TAM/SAM/SOM via dual top-down and bottom-up triangulation, reconcile the gap, stress-test on the dominant driver, and produce a realistic obtainable-share view — invoke whenever a user asks to size a market, estimate addressable revenue, or validate a market opportunity for a product, segment, or geography. _(tools: Read, Bash, Write)_
75
+ - **`/positioning`** — Position an offering against named competitive alternatives — extract unique attributes, translate them to customer-valued outcomes, identify the best-fit buyer segment, name the market frame, and produce a positioning statement plus the messaging hierarchy it implies; invoke when launching, repositioning, entering a new segment, or diagnosing why sales are slow. _(tools: Read, Write)_
76
+ - **`/prd`** — Write a complete Product Requirements Document — extract the real problem, define Theron users + evidence, set measurable goals tied to agentic/receipt/specialist value, scope must/should/could requirements, map key flows and edge cases through the STOA/CIP surface, and lock an explicit out-of-scope. _(tools: Read, Write)_
77
+ - **`/prioritize`** — Score and sequence a backlog of features, bugs, research, or capability-injection work using RICE/WSJF — show every formula step, expose hard dependencies, name deferrals and why, and output a waterlined cycle plan. _(tools: Read, Write)_
78
+ - **`/seo-brief`** — Research and structure an SEO content brief — map search intent, SERP features, entity clusters, E-E-A-T signals, secondary queries, title/meta/H1, recommended layout, and draft the featured-snippet opening to maximize organic reach for a target keyword. _(tools: WebSearch, WebFetch, Read, Write)_
79
+
80
+ ## Go-to-market & Support _(4)_
81
+
82
+ - **`/cold-outreach`** — Execute multi-touch, research-driven cold outreach sequences for AI infrastructure / agentic platforms; one clear CTA per touch, no spray-and-pray. _(tools: Read, Write)_
83
+ - **`/discovery-call`** — Design and run an enterprise B2B discovery call end-to-end — intent-setting open, current-state diagnosis with quantified impact, MEDDPICC qualification, trap and tell-me-more follow-ups, and a mutual next step — diagnosing before pitching and never prescribing before the pain is owned by the buyer. _(tools: Read, Write)_
84
+ - **`/kb-article`** — Write task-focused knowledge base articles — shape the title around one problem, structure in numbered steps with concrete callouts/screenshots, branch for failures, and close with related links. Skip corporate filler; ship text a practitioner skims in 3 minutes and trusts. _(tools: Read, Write)_
85
+ - **`/ticket-triage`** — Reproduce, severity-assess, and resolve theron-cli support tickets end-to-end; escalate only when root cause requires dev/founder intervention or the customer is Enterprise. _(tools: Read, Write, Bash)_
86
+
87
+ ## Legal, Policy & Nonprofit _(5)_
88
+
89
+ - **`/contract-review`** — Clause-by-clause contract redline — flags red-flag, missing, and one-sided terms by risk severity, cites governing market norms, and proposes specific replacement language; never invents statutes or cases. _(tools: Read, WebSearch, WebFetch, Write)_
90
+ - **`/grant-proposal`** — Draft, review, or strengthen a complete funder-mirrored grant proposal (need+data, logic model, measurable outcomes, M&E plan, organizational capacity, line-item budget, sustainability) for nonprofit, government, or philanthropic RFPs — invoke whenever a user needs to write, audit, or rebuild a grant application. _(tools: Read, WebSearch, WebFetch, Write)_
91
+ - **`/policy-brief`** — Draft a rigorous 2-page policy brief on a named policy question — problem framing with quantified status-quo baseline, do-nothing and active options scored on effectiveness/cost/equity/feasibility, WINNERS and LOSERS named explicitly, evidence tier labeled for every quantitative claim, statutory citations verified for every legal reference, and a single prioritized recommendation with a named rebuttal — use when asked to brief a policymaker, analyze a bill, evaluate a regulation, or compare policy options on any public-sector question. _(tools: Read, WebSearch, WebFetch, Write)_
92
+ - **`/regulatory-compliance`** — Map a product or process to a named regulation (GDPR/CPRA, HIPAA, SOC 2, FTC Act §5, PCI-DSS v4.0, EU AI Act, CCPA, etc.), enumerate every applicable obligation with exact cited section, distinguish control gaps from evidence gaps, rank remediation items by legal exposure and breach probability, surface cross-jurisdiction conflicts, and flag all items requiring licensed counsel — use when the user asks about compliance gaps, regulatory mapping, audit readiness, or legal risk for a specific product or workflow. _(tools: Read, WebSearch, WebFetch, Write)_
93
+ - **`/theory-of-change`** — Build a rigorous theory of change for a program, product, or initiative: map the long-term impact goal backward through causal preconditions to activities, make every causal arrow's assumptions explicit and testable, assign indicators per outcome, and cite the evidence base that validates each assumption — invoked when asked to develop, stress-test, or document a theory of change, logic model, impact pathway, or results framework. _(tools: Read, Write)_
94
+
95
+ ## Ops & People _(6)_
96
+
97
+ - **`/incident-response`** — Run a live production incident end-to-end — declare severity, assign IC/comms/ops roles, mitigate-before-diagnose to restore SLO, drive cadenced stakeholder comms, maintain a live timeline, bound every mitigation's blast radius, and hand off a postmortem-ready artifact. _(tools: Read, Write, Bash)_
98
+ - **`/interview-loop`** — Design and run a structured interview loop: map each stage to a competency, write legally-safe STAR behavioral questions + real work samples per stage, build a scored rubric, and run a calibrated debrief — same questions every candidate, bias minimized. _(tools: Read, Write)_
99
+ - **`/job-scorecard`** — Build a role scorecard (mission, 4-6 measurable 12-month outcomes, ranked competencies with behavioral anchors, must-have vs nice-to-have) before drafting any job description — triggered whenever a user asks to define, hire for, or evaluate a role. _(tools: Read, Write)_
100
+ - **`/plan`** — Decompose a complex task into a grounded, verifiable plan — investigate current state first, small steps with acceptance checks, riskiest unknowns first, surface before irreversible work. _(tools: Read, Grep, Glob, TodoWrite, WebSearch)_
101
+ - **`/postmortem`** — Write blameless, timeline-driven postmortem for production incidents; root cause via 5-whys + systemic analysis; action items with owners + dates _(tools: Read, Write)_
102
+ - **`/runbook`** — On-call incident runbook — triage severity/scope in <2 min, detect patterns (cascade vs. isolated), isolate blast radius, execute fixed-sequence recovery, document timeline, and verify the fix held. _(tools: Read, Write, Grep, Bash)_
103
+
104
+ ## Education & Wellbeing _(3)_
105
+
106
+ - **`/lesson-plan`** — Build a rigorous, topic-specific lesson from scratch — diagnose entry state, write Bloom-leveled objectives, sequence concrete-to-abstract, scaffold worked examples into faded practice, and preempt the misconceptions that derail learners of this exact topic. _(tools: Read, Write)_
107
+ - **`/reflection-session`** — Run a structured, CBT/ACT-informed EDUCATIONAL reflective session with a user who wants to process an emotion, thought pattern, stressor, or difficult experience — asking reflective questions, validating feelings, gently surfacing cognitive distortions, and closing with one concrete micro-action; escalates immediately to crisis resources on any risk signal. _(tools: Read, Write)_
108
+ - **`/training-program`** — Design a periodized, goal-specific resistance/conditioning training program: screen for contraindications, set phase structure, prescribe sets×reps×intensity (RPE/%1RM)+rest+tempo, supply technique cues+regressions, enforce deload weeks, and close with a mandatory physician disclaimer — invoke when a user requests a workout plan, training block, strength/hypertrophy/conditioning program, or progressive-overload schedule. _(tools: Read, Write)_
109
+
110
+ ## Creative, Design & Lifestyle _(8)_
111
+
112
+ - **`/component-spec`** — Produce a complete, build-ready UI component specification covering anatomy, prop/variant matrix, every interaction and data state, design tokens (color/space/type/motion/dark-mode), full accessibility contract (ARIA roles, keyboard, focus), responsive rules, automation hooks, breaking-change policy, and do/don't usage — output is the single source of truth a designer and engineer both build from without a meeting. _(tools: Read, Write)_
113
+ - **`/content-calendar`** — Engineer a multi-month content calendar for a digital product or research AI platform — define content pillars, a repeatable series arc, native platform format per channel, engagement metrics to track (saves/shares over vanity), realistic cadence, and sponsorship disclosure; invoke when launching a brand, scaling social presence, or productizing thought leadership at a tech company. _(tools: Read, Write)_
114
+ - **`/design-audit`** — Audit a UI component or user flow for interaction-state completeness (default/hover/focus/active/disabled/loading/error/empty), WCAG accessibility (contrast, focus order, labels, target size, keyboard), responsive behavior, design-token vs hardcoded-value discipline, and cross-component consistency with the design system; delivers a severity-graded findings table and a SHIP/CONDITIONAL/CLEAR verdict. _(tools: Read, Grep, Write)_
115
+ - **`/edit-pass`** — Run a senior seven-phase editorial pass on prose or docs — macro structure first, then clarity, concision, grammar, rhythm, and a signed change log — killing AI-isms/hedging/em-dash abuse while preserving the author's voice and never altering meaning. _(tools: Read, Edit, Write)_
116
+ - **`/harmony-analysis`** — Analyze the harmonic language of a piece or passage — key, Roman-numeral function, cadences, voice-leading, secondary dominants, modal interchange, enharmonic modulation, formal structure, and tension/release mechanics — producing a rigorous, notation-standard analytical report within the common-practice tonal tradition. _(tools: Read, Write)_
117
+ - **`/recipe-develop`** — Develop, stress-test, and document a recipe end-to-end — ratio-first formulation, dual-unit (grams + volume) ingredient list, yield/serving breakdown, phase-by-phase technique with temperature/time/sensory doneness cues, food-science rationale, substitution tradeoffs, food-safety critical points, and non-linear scaling rules — invoke whenever the user asks to create, fix, scale, or deeply explain a recipe. _(tools: Read, Write)_
118
+ - **`/story-craft`** — Craft short literary fiction — character desire + obstacle, scene-level cause-effect, sensory show-not-tell, escalating conflict to a turn, distinctive voice, no clichés — when user asks to write, revise, or critique a short story or flash fiction piece. _(tools: Read, Write)_
119
+ - **`/translate`** — Translate text/documents between human languages preserving meaning, register, tone, markup, and placeholders — localizing idioms/units/dates, flagging untranslatables with alternatives, and marking low-confidence renderings for review; never silently drop or invent content. _(tools: Read, Write)_
120
+
121
+ ---
122
+
123
+ _73 skills across 10 entrances. The same library ships in the `@vextlabs/theron-cli` package and the JUWEL.ai VS Code extension._
@@ -0,0 +1,89 @@
1
+ ---
2
+ name: ab-test
3
+ description: Design and analyze a controlled A/B test end-to-end — covering metric + MDE definition, power/sample-size/duration, randomization unit, pre-registration, SRM + novelty checks, effect estimation with confidence intervals, multiple-comparison correction, and a go/no-go decision; invoke when a user wants to run or evaluate any controlled experiment on product, model, UI, pricing, ranking, or policy changes.
4
+ allowed-tools: Read, Bash, Write
5
+ ---
6
+
7
+ ═══ HARD RULES ═══
8
+
9
+ 1. NEVER peek and stop early. The moment you run the primary analysis is fixed at pre-registration; stopping because p < 0.05 appeared inflates false-positive rate to 26%+ at just two unplanned looks. If continuous monitoring is required, use a sequential/SPRT design with a pre-registered α-spending function.
10
+ 2. NEVER change the primary metric after data collection begins. Secondary metrics may be added to the exploratory tier but never promoted to primary post-hoc.
11
+ 3. NEVER pool variants that "look similar" post-hoc. Each planned pairwise comparison is pre-registered; unplanned merges are p-hacking.
12
+ 4. NEVER interpret "no statistical significance" as "no effect." Report the 95% CI width against the MDE to characterize precisely what was and was not ruled out.
13
+ 5. NEVER skip the Sample Ratio Mismatch (SRM) check. An SRM invalidates causal inference regardless of how compelling the metric lift looks.
14
+ 6. NEVER report a point estimate without its confidence interval and the achieved power at the observed effect size.
15
+ 7. NEVER run the experiment shorter than the pre-registered duration. Minimum is max(computed_duration, 14 days) to capture at least two full weekly seasonal cycles. The 14-day floor is firm; 7 days is not sufficient.
16
+
17
+ ═══ PHASE A — DEFINE THE QUESTION AND PRIMARY METRIC ═══
18
+
19
+ A1. Write a single falsifiable hypothesis: "Changing [X] will increase/decrease [metric] by at least [MDE] for [population] within [time window]."
20
+ A2. Select ONE primary metric that is (a) directly causally linked to the treatment, (b) measurable at the randomization unit without aggregation ambiguity, (c) has a known historical distribution (mean, variance, or proportion) from at least 4 weeks of pre-experiment data. Do not accept a composite "North Star" metric as primary unless it has a closed-form variance estimator — composites typically require the delta method or bootstrap.
21
+ A3. Enumerate secondary metrics: supporting metrics expected to move in the same direction as the primary, and guardrail metrics that must not degrade beyond a pre-specified absolute threshold. Guardrail violations trigger a hard NO-GO regardless of the primary metric result.
22
+ A4. Specify the Minimum Detectable Effect (MDE): the smallest effect that is business-meaningful, not the smallest the test can detect at current traffic. Express in absolute terms (e.g., +0.5 pp conversion) AND relative terms (e.g., +3.3% lift on a 15% base rate). If the team cannot articulate the MDE, halt here — no power calculation is meaningful without it.
23
+ A5. Document the null and alternative hypotheses formally. Default to two-sided unless a directional prior is pre-registered before any data is seen and the one-sided framing is explicitly justified (e.g., treatment can only improve latency, never harm it).
24
+
25
+ ═══ PHASE B — POWER, SAMPLE SIZE, AND DURATION ═══
26
+
27
+ B1. Choose significance level α (default 0.05, two-sided) and power 1−β (default 0.80; use 0.90 for high-stakes or irreversible rollout decisions where a missed effect is costly).
28
+ B2. Compute required sample size N per variant using the appropriate formula for the metric type:
29
+ - Proportion (conversion, CTR): n = 2 · (z_{α/2} + z_β)² · p̄(1−p̄) / δ² where p̄ = (p_control + p_treatment)/2, δ = MDE in absolute pp.
30
+ - Continuous (revenue, latency, session length): n = 2 · (z_{α/2} + z_β)² · σ² / δ² where σ² is the per-unit variance from historical data.
31
+ - Ratio metrics (revenue-per-user, clicks-per-impression): apply the delta method to estimate variance from historical data; do not use the naive ratio of aggregate means as the estimator.
32
+ B3. Compute duration: duration (days) = N_total / (daily_eligible_users × traffic_fraction). Apply the hard floor: required_duration = max(computed_duration, 14 days). Document the traffic ramp schedule and whether the 14-day floor or the computed duration is binding.
33
+ B4. If computed duration exceeds 8 weeks, revisit the design rather than relaxing α. Options in priority order: (1) increase the MDE to a still-meaningful but larger threshold and document the business rationale, (2) apply CUPED variance reduction (see Phase G) which can reduce N by 30–70% when pre-experiment correlation ρ > 0.4, (3) switch to a pre-registered sequential/SPRT design with an O'Brien-Fleming or Pocock α-spending function. Do not simply raise α to 0.10 to make the test shorter — that inflates false positives without reducing required N meaningfully for proportions.
34
+ B5. Document all power-analysis inputs: baseline rate or mean, variance source and historical window, MDE, α, β, traffic fraction, ramp schedule, and the N calculation tool or script used.
35
+
36
+ ═══ PHASE C — RANDOMIZATION DESIGN ═══
37
+
38
+ C1. Choose the randomization unit: the entity that receives a consistent treatment for the experiment's full duration. For user-facing features use user_id; for infrastructure or request-level experiments use request_id only if there is strictly zero carry-over between requests (stateless path); for marketplace experiments use the side of the market that controls the outcome variable (supply or demand, not both simultaneously).
39
+ C2. Validate unit independence: if treatment on unit A changes outcomes for unit B (network effects, shared recommendation pools, leaderboards, shared cache), switch to cluster randomization and recompute N with the design effect: DEFF = 1 + (m−1)ρ, where m = mean cluster size and ρ = intra-cluster correlation estimated from historical data.
40
+ C3. Use a stable deterministic hashing function (e.g., SHA256 of experiment_id + unit_id modulo num_buckets) for assignment. Never use random(seed=timestamp) — it creates re-assignment on page reload or service restart. Isolate the experiment namespace: use a per-experiment salt (e.g., hash(experiment_id + unit_id)) so that bucket assignments across concurrent experiments are statistically independent.
41
+ C4. Stratify or block on major confounders (platform OS, country, user cohort, account age decile) when they explain >20% of primary metric variance; stratification reduces required N or increases effective power at no traffic cost.
42
+ C5. Holdout rule: units assigned at bucketing time remain in their assigned bucket for the full experiment window, even if they become ineligible mid-experiment (intent-to-treat analysis). Exclusions must be pre-registered, not reactive.
43
+
44
+ ═══ PHASE D — PRE-REGISTRATION ═══
45
+
46
+ D1. Before any treatment is exposed to users, write and lock a pre-registration document containing: hypothesis (A1), primary metric + formula, secondary metrics and guardrail thresholds, randomization unit and assignment mechanism, N per variant, planned duration and end date, analysis method (estimator + CI method), multiple-comparison correction method, and explicit decision criteria (what result → SHIP / ITERATE / KILL / EXTEND).
47
+ D2. Commit this document to version control with a timestamp before the experiment launches. The commit hash is the audit trail. Store in a shared experiment log directory indexed by experiment_id.
48
+ D3. Specify the analysis date and trigger explicitly: "Primary analysis will be run on [ISO date] after [N] unique units have been exposed for at least 24 hours, whichever is later."
49
+ D4. For sequential designs, pre-register the α-spending function (e.g., O'Brien-Fleming), all planned look times, and the maximum sample size at which the experiment is stopped regardless of result.
50
+
51
+ ═══ PHASE E — INSTRUMENTATION AND LAUNCH ═══
52
+
53
+ E1. Verify logging before any traffic is exposed: confirm that assignment events (unit_id, variant, timestamp, experiment_id) and all outcome events are being emitted, persisted, and joinable on the same key. Run a 24-hour shadow log with zero traffic to validate the pipeline end-to-end.
54
+ E2. Ramp gradually: launch at 1–5% of eligible traffic, hold for at least one hour, and check error rates and guardrail metrics against baseline before expanding. This ramp check is an operational health check — it is NOT an A/A significance test and must not be used to make an early ship/kill call.
55
+ E3. Monitor guardrail metrics daily via simple threshold alerts, not p-value gates. Monitoring the primary metric during the run is peeking and inflates false-positive rate (Rule 1).
56
+ E4. Log all external events that could confound results (marketing campaigns, infrastructure incidents, seasonality events, competitor product launches) in the shared experiment log with timestamps. These are required for the post-experiment validity review.
57
+
58
+ ═══ PHASE F — SRM AND VALIDITY CHECKS ═══
59
+
60
+ F1. Sample Ratio Mismatch (SRM): at analysis time, run a chi-squared test on observed assignment counts vs. expected counts under the target split. If p < 0.01, the experiment is causally invalid — do not analyze the primary metric until the SRM is explained and resolved. Common causes: bot or crawler traffic included in assignment, logging loss asymmetric across variants, sticky-bucketing failures, exposure logging before assignment completes, or assignment-to-logging lag differences.
61
+ F2. Pre-experiment A/A balance check: using only pre-experiment data, compare the primary metric baseline across the assigned variants (pseudo-randomize on historical units using the same hashing function). A significant imbalance (p < 0.05) on a pre-experiment window metric is a red flag for assignment mechanism failure, not a genuine treatment effect.
62
+ F3. Novelty and primacy effect check: segment results by experiment day and by user tenure (new vs. established users). If the treatment lift decays monotonically over the experiment window or is concentrated entirely in the first 3 days, the effect is likely novelty-driven. Require the effect to be present and non-trivial in the final third of the experiment window's calendar days before classifying it as durable.
63
+ F4. SUTVA (Stable Unit Treatment Value Assumption) check: compare outcomes of control units that interacted with treated units vs. control units that did not. A statistically significant difference signals interference violations; escalate to a cluster or switchback experiment design.
64
+
65
+ ═══ PHASE G — EFFECT ESTIMATION ═══
66
+
67
+ G1. Use the pre-registered estimator exclusively. For proportions: difference in proportions with normal approximation CI (use exact Fisher or Wilson CI if n < 1000 per variant). For continuous metrics: two-sample Welch t-test or OLS with pre-registered covariates. Do not switch estimators post-hoc.
68
+ G2. CUPED variance reduction (Controlled-experiment Using Pre-Experiment Data): θ̂_CUPED = (ȳ_T − ȳ_C) − θ·(x̄_T − x̄_C), where θ = Cov(Y, X) / Var(X) estimated from the CONTROL ARM ONLY (not pooled, to avoid contamination), and X is the pre-experiment value of the primary metric for the same unit. CUPED reduces variance; it does not change the point estimate in expectation, only narrows the CI. Apply only if pre-registered.
69
+ G3. Report all of the following: point estimate, 95% CI (lower, upper), two-sided p-value, achieved power at the observed effect size (to quantify what was ruled out), and the absolute business impact at current volume (e.g., "+0.4 pp CTR; at 30M daily eligible users = +120K additional clicks/day").
70
+ G4. For ratio metrics (revenue/user, clicks/impression): compute SE using the delta method — Var(R) ≈ (1/μ_D²)·[Var(N) + R²·Var(D) − 2R·Cov(N,D)] — or bootstrap with ≥1000 iterations. Never use the naive ratio of aggregate sums as the standard error input.
71
+ G5. The CI is the primary statistical output. The p-value is secondary and must always be reported alongside the CI, never in isolation.
72
+
73
+ ═══ PHASE H — MULTIPLE COMPARISONS ═══
74
+
75
+ H1. Apply a correction for all pre-registered simultaneous comparisons. Default: Benjamini-Hochberg (BH) FDR control at q = 0.05 when testing 3+ secondary metrics; BH assumes test statistics are independent or positively correlated (PRDS condition), which holds for most product metrics that move together. Use Bonferroni when the tests are planned AND there is a specific need to control the family-wise error rate (FWER) rather than FDR — Bonferroni is valid regardless of correlation structure but is conservative when tests are correlated.
76
+ H2. The primary metric is always tested at the full pre-registered α. Secondary metrics use the adjusted α output from the correction procedure. Guardrail metrics use pre-specified absolute thresholds, not adjusted p-value gates.
77
+ H3. Exploratory metrics (post-hoc segmentation, unexpected segments, drill-downs discovered after unblinding) are hypothesis-generating only. Label them explicitly as EXPLORATORY in all outputs. Never make a ship/kill decision on exploratory findings; register them as hypotheses for the next experiment.
78
+ H4. For multi-variant tests (A/B/C/…): compare each treatment variant to control using Dunnett's test (or BH on the set of treatment-vs-control contrasts), not all pairwise. This preserves FWER against the control comparison without the unnecessary power penalty of testing all pairs.
79
+
80
+ ═══ PHASE I — DECISION ═══
81
+
82
+ I1. Apply the pre-registered decision rule (D1). Do not deviate based on visual dashboard inspection, stakeholder pressure, or proximity to a release deadline. The rule was set when the stakes were clear and the data was unseen — that is its entire value.
83
+ I2. SHIP if: primary metric effect size has CI fully above zero (or above the pre-specified positive threshold), no guardrail metric has violated its pre-specified threshold, SRM check passed, and novelty check passed.
84
+ I3. ITERATE if: effect is directional (point estimate positive) but the CI crosses zero AND the lower bound of the CI is above the negative MDE threshold (i.e., the treatment is not meaningfully harmful) AND there is a concrete, mechanistically grounded hypothesis for improving the treatment variant.
85
+ I4. KILL if: any guardrail metric violated its pre-registered threshold; SRM failed and cannot be explained by a known logging artifact; or the CI's upper bound is below the MDE (power analysis confirms a meaningful effect was ruled out, not merely not observed).
86
+ I5. EXTEND (one time only, pre-registered before unblinding) if: the required N was not reached due to lower-than-forecast traffic AND the pre-registered end date has not yet passed. Lock the new end date in version control before looking at any unblinded data. Never extend because the result is inconclusive after the pre-registered end date.
87
+ I6. Write a post-experiment record for every experiment: what was tested, what was found (including CIs), what was shipped, what was learned about the metric relationship and variance. Archive in the experiment log. Variance and baseline rate estimates from completed experiments are the most accurate inputs for future power calculations in the same product surface.
88
+
89
+ KEY PRINCIPLE: The experiment design is a contract — sign it before seeing the data, honor it without revision after.
@@ -0,0 +1,175 @@
1
+ ---
2
+ name: api-design
3
+ description: Design clean, evolvable APIs — start from consumer use cases, make illegal states unrepresentable, explicit errors, versioning and backward-compatibility.
4
+ allowed-tools: Read, Write, Edit, Grep, Glob
5
+ ---
6
+
7
+ ## Phase 0 — Grep for existing conventions FIRST
8
+ Before writing a single type, grep the repo for naming patterns, error shapes, pagination contracts, and auth patterns already in use. New APIs must match the grain of what exists.
9
+
10
+ ```
11
+ Grep for: "export type.*Result" | "export interface.*Response" | "export type.*Error"
12
+ Grep for: cursor | offset | page | limit (pagination convention)
13
+ Grep for: "401\|403\|404" in route handlers (error-response shape)
14
+ Grep for: "Bearer\|x-api-key\|Authorization" (auth header convention)
15
+ ```
16
+
17
+ Hard rule: if you invent a new naming or error convention, you own migrating everything to it. Prefer conforming.
18
+
19
+ ---
20
+
21
+ ## Phase 1 — Write the call site FIRST (design-by-example)
22
+ Before any implementation, write 3–5 concrete example usages as if you are the consumer:
23
+
24
+ ```ts
25
+ // Example 1: happy path
26
+ const result = await receipts.create({ agentId: "theron-cyber", claim: "...", sig: key })
27
+ if (!result.ok) return handleError(result.error) // typed discriminated union
28
+ console.log(result.value.receiptId)
29
+
30
+ // Example 2: list with pagination
31
+ const page = await receipts.list({ agentId, limit: 50, cursor: prev.nextCursor })
32
+
33
+ // Example 3: illegal state that must be rejected at the type level
34
+ receipts.create({ agentId: undefined }) // should not compile
35
+ ```
36
+
37
+ Ask: does this call site feel obvious? Is the intent readable? Are required vs optional params clear from types alone? Iterate here before writing any implementation.
38
+
39
+ ---
40
+
41
+ ## Phase 2 — Model the domain
42
+ 1. Name the RESOURCES (nouns): what are the durable entities? (Receipt, Agent, Run, Finding, Artifact)
43
+ 2. Name the OPERATIONS (verbs): create / get / list / update / delete / publish / verify — choose from a fixed vocabulary.
44
+ 3. State the INVARIANTS: what must always be true? (A receipt must have a valid sig. An agent must belong to an org. A run cannot be deleted while active.)
45
+ 4. Draw the ownership graph: who owns what? (Org → Agent → Run → Finding). This drives URL nesting depth and authz checks.
46
+
47
+ Hard rule: no more than two levels of nesting in a REST path (`/orgs/:orgId/agents/:agentId`). Deeper = flatten with a query param.
48
+
49
+ ---
50
+
51
+ ## Phase 3 — Choose the right shape
52
+ - **REST** for CRUD over resources with standard HTTP semantics (receipts, agents, runs). Use nouns, HTTP verbs (GET/POST/PATCH/DELETE), 200/201/204/400/401/403/404/409/422/500.
53
+ - **RPC / action endpoint** for operations that are not CRUD (POST `/runs/:id/cancel`, POST `/receipts/:id/verify`). Name them as verb-noun.
54
+ - **GraphQL** only if the consumer needs flexible field selection across many related resources in one round-trip. Overhead rarely justified for internal/CLI APIs.
55
+ - **Library API (TypeScript functions)** for SDK-internal logic, tool contracts, and anything called in-process. Prefer typed functions over string-keyed maps.
56
+
57
+ ---
58
+
59
+ ## Phase 4 — Make illegal states unrepresentable
60
+ 1. Use discriminated unions for result types — no mixed `{ data?, error? }` bags:
61
+ ```ts
62
+ type Result<T, E = ApiError> = { ok: true; value: T } | { ok: false; error: E }
63
+ ```
64
+ 2. Use branded types or `z.string().uuid()` to prevent mixing IDs:
65
+ ```ts
66
+ type ReceiptId = string & { readonly __brand: "ReceiptId" }
67
+ ```
68
+ 3. Encode state machines in the type (a `Run` in state `"completed"` has a `completedAt`; in `"pending"` it does not — use a union of state-shapes, not optional fields).
69
+ 4. Never use `any`. Never accept `unknown` without a Zod parse at the boundary.
70
+
71
+ ---
72
+
73
+ ## Phase 5 — Explicit, typed errors (never silent)
74
+ Every error must carry: `code` (stable machine-readable string), `message` (human), optional `field` for validation errors, optional `retryAfter` for rate limits.
75
+
76
+ ```ts
77
+ type ApiError =
78
+ | { code: "NOT_FOUND"; message: string }
79
+ | { code: "UNAUTHORIZED"; message: string }
80
+ | { code: "FORBIDDEN"; message: string }
81
+ | { code: "VALIDATION"; message: string; fields: Record<string, string> }
82
+ | { code: "CONFLICT"; message: string }
83
+ | { code: "RATE_LIMITED"; message: string; retryAfter: number }
84
+ | { code: "INTERNAL"; message: string }
85
+ ```
86
+
87
+ Hard rule: HTTP 500 must never leak stack traces or internal state to the client. Log internally, return a `INTERNAL` code externally.
88
+
89
+ ---
90
+
91
+ ## Phase 6 — Collections: pagination, filtering, sorting
92
+ Use CURSOR-based pagination (not offset) for any collection that can grow:
93
+ ```ts
94
+ // Request
95
+ { limit?: number; cursor?: string; filter?: { agentId?: string; since?: string } }
96
+ // Response
97
+ { items: T[]; nextCursor: string | null; total?: number }
98
+ ```
99
+
100
+ - Default `limit` to something sane (20–50). Cap it (max 200).
101
+ - Cursors are opaque base64-encoded server state. Never expose internal IDs as cursors.
102
+ - Filtering params use the same field names as the resource fields — no surprise aliases.
103
+ - Sorting: `sort=createdAt:desc` as a single string param. Default sort must be documented.
104
+
105
+ ---
106
+
107
+ ## Phase 7 — Mutating endpoints: idempotency + safety
108
+ - POST endpoints that create resources must accept an optional `idempotencyKey` (client-generated UUID). Return the same response on replay within 24h.
109
+ - PATCH must be partial (only send what changes). PUT replaces the whole resource — avoid PUT unless you mean it.
110
+ - DELETE must be idempotent: deleting an already-deleted resource returns 204, not 404.
111
+ - Destructive operations (delete, purge, revoke) must require explicit confirmation param OR a two-step (request → confirm token → execute).
112
+
113
+ ---
114
+
115
+ ## Phase 8 — Security: authz at the boundary
116
+ 1. Authenticate FIRST (parse + verify the bearer token/API key) before touching any DB or running any logic.
117
+ 2. Authorize at the resource level: after fetching the resource, assert the caller's org/role owns it. Never trust a URL param to scope access.
118
+ 3. Validate ALL inputs with Zod (or equivalent) at the route handler entry point. Reject before processing.
119
+ 4. Rate-limit at the gateway or middleware layer, not inside business logic.
120
+ 5. Never return a resource that belongs to a different org, even in error messages ("receipt not found" not "receipt belongs to another org").
121
+
122
+ ---
123
+
124
+ ## Phase 9 — Versioning + backward-compatibility
125
+ - URL version prefix (`/v1/`) for REST. Bump only on BREAKING changes.
126
+ - Additive changes (new optional field, new endpoint, new enum value) are NOT breaking — ship them freely.
127
+ - Breaking changes (remove field, rename field, change type, change semantics) require a new version OR a deprecation window (min 90 days, deprecation header on every response).
128
+ - Add `Deprecation: true` + `Sunset: <date>` response headers on deprecated endpoints.
129
+ - Keep a `CHANGELOG.md` entry for every change: `[added]`, `[deprecated]`, `[removed]`, `[breaking]`.
130
+ - Hard rule: never remove a field from a response without a deprecation period. Consumers break silently.
131
+
132
+ ---
133
+
134
+ ## Phase 10 — Minimal required params + least-surprise defaults
135
+ - Every required param must be impossible to omit (type-enforced, not runtime-guarded).
136
+ - Every optional param must have a documented default that is the overwhelmingly common case.
137
+ - If a caller must pass the same param on every call, it belongs in the client constructor, not the method signature.
138
+ - Boolean flags are a smell — prefer an enum that names the intent (`mode: "strict" | "lenient"` not `strict: boolean`).
139
+
140
+ ---
141
+
142
+ ## Phase 11 — Document the contract
143
+ For each endpoint/function, write inline:
144
+ 1. One-line purpose.
145
+ 2. All params: name, type, required/optional, default, constraints.
146
+ 3. Response shape with field semantics.
147
+ 4. All error codes that can be returned and when.
148
+ 5. One working example (request + response).
149
+
150
+ For REST APIs, emit an OpenAPI 3.1 spec (`openapi.yaml`) generated from the types — do not hand-write it separately from the code.
151
+
152
+ ---
153
+
154
+ ## Phase 12 — Validate against use cases before finalizing
155
+ Return to the example call sites from Phase 1. For each:
156
+ - Can it be expressed with the designed API without workarounds?
157
+ - Are there any surprise required params the consumer wouldn't know at call time?
158
+ - Does the error path give the consumer enough information to recover or display a useful message?
159
+ - Would a new engineer reading only the call site understand what it does?
160
+
161
+ If any answer is no, revise the design — not the examples.
162
+
163
+ ---
164
+
165
+ ## Hard rules summary
166
+ 1. Call site first. Implementation is a detail.
167
+ 2. Match existing conventions — grep before inventing.
168
+ 3. Discriminated Result unions. No mixed `data?/error?` bags.
169
+ 4. Illegal states must not compile.
170
+ 5. Errors are typed and machine-readable (`code` field).
171
+ 6. Cursor pagination for all collections.
172
+ 7. Authz at the boundary, on every request, on the actual resource.
173
+ 8. Additive change is free. Breaking change costs a version bump or 90-day deprecation.
174
+ 9. Never leak internals in error responses.
175
+ 10. Document the contract in the code, not in a separate doc that goes stale.
@@ -0,0 +1,185 @@
1
+ ---
2
+ name: architecture-design
3
+ description: Design a system from the non-functional requirements + load numbers — components and boundaries, storage/consistency tradeoffs, failure modes, ADRs, and design-for-change.
4
+ allowed-tools: Read, Write, Grep, Glob
5
+ ---
6
+
7
+ ## Phase 0 — Read the repo before designing anything
8
+
9
+ 1. Grep for existing conventions: naming patterns, error shapes, transport layers, auth, DB schema, env var conventions.
10
+ 2. Identify what already exists that the new system can use or must integrate with. Designing in a vacuum produces orphan systems.
11
+ 3. State the single sentence the design must satisfy: "Support X doing Y with Z constraint." If you cannot write this sentence, stop and clarify.
12
+
13
+ ---
14
+
15
+ ## Phase 1 — Nail the requirements (functional + non-functional)
16
+
17
+ 4. List functional requirements as user-facing behaviors: "User can create a receipt. Receipt is verifiable offline. Agent can query its own run history."
18
+ 5. List the non-functional requirements that actually drive architecture — these are the ones architects miss:
19
+ - **Scale**: peak QPS read / peak QPS write / concurrent users / data volume at launch / at 12 months / at 36 months
20
+ - **Latency**: p50 / p99 targets per critical path (interactive vs. background)
21
+ - **Availability**: acceptable downtime per month (99.9% = 43 min/mo; 99.99% = 4 min/mo — know the number before picking topology)
22
+ - **Consistency**: can users see stale data? For how long? Can you lose writes? (CAP forces a choice — name it)
23
+ - **Durability**: what data can never be lost? (receipts, payments, audit logs are non-negotiable; ephemeral cache is fine to lose)
24
+ - **Security**: who are the threat actors? What must be encrypted at rest / in transit? What must be auditable?
25
+ - **Cost envelope**: $/month ceiling, cost sensitivity (bandwidth-heavy? storage-heavy? compute-heavy?)
26
+ - **Evolvability**: what is the most likely next feature? Design for that specific change, not for infinite unknowns
27
+ - **Compliance / regulatory**: GDPR, SOC2, HIPAA, export control — flag if any apply before choosing storage region or vendor
28
+ 6. Separate hard constraints (must-haves, deal-breakers) from soft preferences. Hard constraints are immovable; preferences are trade-off inputs.
29
+
30
+ ---
31
+
32
+ ## Phase 2 — Estimate the load (architecture follows the numbers)
33
+
34
+ 7. Do back-of-envelope math. Write it down explicitly — guesses buried in prose are not reviewable:
35
+ ```
36
+ Users: 10K DAU at launch → 100K at 12mo
37
+ Write QPS: 10K DAU × 5 writes/day / 86400s ≈ 0.6 QPS → rounds to ~1 QPS (comfortable single Postgres)
38
+ Read QPS: 10× write ratio = ~6 QPS peak → trivial, no read replica needed yet
39
+ Storage: 5KB per receipt × 10K users × 365 days = ~18GB/year → single DB volume, no sharding
40
+ Growth trigger: at 100K users writes hit ~6 QPS sustained → revisit at that milestone
41
+ ```
42
+ 8. From the numbers, derive what the architecture DOES NOT need yet. Explicitly state what you are deferring and at what load threshold to revisit. This prevents premature optimization.
43
+ 9. Identify the ONE bottleneck most likely to be hit first. Design around it. Everything else is secondary.
44
+
45
+ ---
46
+
47
+ ## Phase 3 — Identify core components, responsibilities, and boundaries
48
+
49
+ 10. List every major component as a box with ONE clear responsibility statement. If you need "and" to describe it, split it.
50
+ 11. Draw the dependency arrows. Dependencies must be acyclic at the component level. Cycles = wrong decomposition — break them with an interface or an event.
51
+ 12. Apply the test: could you swap out component X without touching Y? If not, the boundary is wrong.
52
+ 13. Separate stateless from stateful components. Stateless can be scaled horizontally and replaced trivially. Stateful components need replication, failover, and migration strategies — treat them with extra care.
53
+ 14. Name the INTERFACES between components (not implementations). The interface is the contract; the component is a detail.
54
+
55
+ Typical component taxonomy for reference (do not copy blindly — omit what does not apply):
56
+ - **Gateway / edge**: auth enforcement, rate limiting, routing, TLS termination
57
+ - **API layer**: request validation, orchestration, response shaping — no business logic
58
+ - **Domain services**: business rules, aggregates, invariant enforcement
59
+ - **Workers / queue consumers**: async background jobs, retries, idempotency
60
+ - **Data stores**: primary DB, cache, blob store, search index, event log
61
+ - **External integrations**: third-party APIs, webhooks — always behind an anti-corruption layer
62
+
63
+ ---
64
+
65
+ ## Phase 4 — Data model and storage choice
66
+
67
+ 15. Model the data around access patterns, not around entities. Ask "what queries will this system run?" before writing a schema.
68
+ 16. Identify the access patterns for each entity: point lookup by ID? range scan by time? full-text search? aggregation? Each pattern pulls toward a different storage type.
69
+ 17. Choose storage type by workload shape:
70
+ - **Relational (Postgres/Neon)**: transactions, foreign keys, flexible queries, moderate scale, strong consistency — default choice until proven wrong
71
+ - **Key-value (Redis/Upstash)**: ephemeral state, sessions, leases, rate-limit counters, sub-millisecond reads — never use as a primary store for durable data
72
+ - **Blob / object store (R2/S3)**: large binary objects, model weights, artifacts, media — never query inside blobs
73
+ - **Time-series / append-only log**: audit trails, event streams, telemetry — immutable by design
74
+ - **Search index (Postgres full-text / Elastic)**: keyword search, fuzzy match — derived, always reconstructible from the primary store
75
+ 18. State the CAP/consistency tradeoff explicitly:
76
+ - "We choose consistency over availability for receipts — a 503 is better than a corrupt receipt."
77
+ - "We choose availability over consistency for agent status — stale status for 30s is acceptable."
78
+ 19. Design the schema migrations strategy upfront: can you add columns without downtime? (Yes for Postgres add-column with default.) Can you rename columns? (No, requires a multi-step deploy.) Know the answer before you ship.
79
+ 20. Identify what data is owned by this system vs. what it reads from another system. Do not store data you do not own — subscribe to events or call the owning service.
80
+
81
+ ---
82
+
83
+ ## Phase 5 — Data flow and critical paths
84
+
85
+ 21. Trace the critical path for the top 3 user-facing operations end-to-end: every hop, every I/O, every transformation.
86
+ 22. Count the number of synchronous I/O calls on each critical path. Every added hop adds latency and a failure point. Minimize hops on the hot path.
87
+ 23. Identify which steps MUST be synchronous (user is waiting) vs. which can be async (fire-and-forget, queue, or eventual). Move as much as possible off the synchronous path.
88
+ 24. Draw the write path separately from the read path. They often have different consistency and latency requirements and can be optimized independently.
89
+ 25. Identify fan-out: does one write trigger writes to multiple downstream systems? Fan-out is a reliability multiplier — if any downstream fails, what happens?
90
+
91
+ ---
92
+
93
+ ## Phase 6 — Key tradeoffs — name them explicitly, pick with reasons
94
+
95
+ Do not hide tradeoffs in vague language. State each one as "Option A vs Option B — we pick A because X, accepting the downside of Y."
96
+
97
+ 26. **Sync vs. async**: sync = simpler, easier to reason about, couples latency; async = decoupled, resilient, harder to debug and trace.
98
+ 27. **Strong vs. eventual consistency**: strong = correct but slower and harder to scale; eventual = faster and more available but requires idempotency and conflict resolution everywhere.
99
+ 28. **Monolith vs. services**: monolith = simpler ops, refactor freely, one deploy, no network hops; services = independent deploy, isolated failure, harder to trace and test.
100
+ 29. **Normalize vs. denormalize**: normalized = consistent, smaller writes, expensive reads requiring joins; denormalized = fast reads, write amplification, consistency burden moves to application code.
101
+ 30. **Push vs. pull**: push (webhooks, events) = low latency, complex fan-out, receiver must be available; pull (polling) = simple, receiver controls load, higher latency.
102
+ 31. **Build vs. buy**: build = control, cost at scale, maintenance burden; buy = speed, vendor lock-in, ongoing cost.
103
+
104
+ ---
105
+
106
+ ## Phase 7 — Failure modes and degradation strategy
107
+
108
+ 32. For each component, ask: "What happens when this is unavailable?" Design the degraded behavior explicitly — do not leave it to chance.
109
+ 33. Apply bulkhead pattern: isolate failure domains so that a failure in component X does not cascade to Y. Separate thread pools, connection pools, and circuit breakers per downstream dependency.
110
+ 34. Define timeouts everywhere: every outbound network call must have a timeout. "Infinite wait" is not a valid strategy. Set timeouts based on the SLA of the caller, not the SLA of the callee.
111
+ 35. Define retry policy per operation type:
112
+ - Idempotent reads: retry with exponential backoff + jitter, max 3 attempts
113
+ - Non-idempotent writes: do NOT retry automatically — require idempotency key + deduplication at the receiver
114
+ - Background jobs: retry with backoff, dead-letter queue after N failures, alert on DLQ depth
115
+ 36. Idempotency is not optional for any write that can be retried. Every mutation endpoint must accept an idempotency key or be idempotent by nature (PUT, DELETE).
116
+ 37. Backpressure: if a downstream consumer is slower than the producer, the queue grows unbounded. Name the backpressure mechanism: queue depth limit, rate limiting at ingress, or consumer auto-scaling trigger.
117
+ 38. Define the runbook for the top 3 failure scenarios now, not after the first outage:
118
+ - DB primary goes down → failover to replica, expected downtime, data loss window
119
+ - Queue backs up → consumer scale-out trigger, alert threshold
120
+ - External API is unavailable → circuit open, cached response or graceful error to user
121
+
122
+ ---
123
+
124
+ ## Phase 8 — Identify the riskiest assumption and de-risk it first
125
+
126
+ 39. List every assumption the design rests on. Rank by: (probability of being wrong) × (cost if wrong).
127
+ 40. The top-ranked assumption is the one to validate BEFORE writing a single line of production code. Build the smallest possible spike — a throwaway prototype, a load test, a proof-of-concept query — to confirm or refute it.
128
+ 41. Common high-risk assumptions: "this third-party API will have the latency we need," "Postgres can handle this query at 100× current load," "the queue consumer can keep up with peak write throughput," "users will do X not Y."
129
+ 42. Document the result of the de-risking experiment. If the assumption was wrong, revise the design before committing to an implementation.
130
+
131
+ ---
132
+
133
+ ## Phase 9 — Architecture Decision Records (ADRs)
134
+
135
+ For every significant decision, write a micro-ADR inline or in `docs/adr/`:
136
+
137
+ ```
138
+ ## ADR-NNN: <title>
139
+ **Date**: YYYY-MM-DD
140
+ **Status**: Accepted | Superseded by ADR-NNN
141
+
142
+ **Context**: <the forces at play, the problem being solved, constraints>
143
+ **Decision**: <what we decided, stated precisely>
144
+ **Consequences**: <what becomes easier, what becomes harder, what we accept as a known cost>
145
+ ```
146
+
147
+ 43. Decisions that always need an ADR: storage engine choice, consistency model, sync vs. async, service boundary, auth mechanism, API versioning strategy, caching strategy.
148
+ 44. ADRs are immutable — when a decision changes, write a new ADR marked "Supersedes ADR-NNN." Never edit an accepted ADR to change its decision.
149
+
150
+ ---
151
+
152
+ ## Phase 10 — Design for change
153
+
154
+ 45. Ask for each component: "What is the most likely change in the next 6 months?" Then verify the component boundary isolates that change — a change inside it must not ripple outward.
155
+ 46. Likely-to-change things: pricing model, auth provider, storage backend, third-party API contract, business rules. Wrap each behind an interface or adapter — the rest of the system depends on the abstraction, not the implementation.
156
+ 47. Stable abstractions principle: depend on stable things (interfaces, domain concepts) not volatile things (vendor SDKs, specific DB query syntax).
157
+ 48. Feature flags are a design choice, not an afterthought. Any capability that may need to be toggled, rolled out gradually, or killed quickly must be flag-gated from day one.
158
+ 49. Schema migrations must be backward-compatible for at least one full deploy cycle: add before remove, nullable before required, two-phase rename (add new column → backfill → switch reads → drop old).
159
+
160
+ ---
161
+
162
+ ## Phase 11 — Anti-over-engineering check
163
+
164
+ 50. Before finalizing, apply these tests:
165
+ - Do you actually have the load that justifies sharding, caching, or a queue? If not, remove it.
166
+ - Does the number of services match the number of independent deploy/scale requirements? If not, merge.
167
+ - Is there a simpler data model that handles 90% of the use cases? Take it.
168
+ - Can this be a single Postgres table with an index instead of a separate service? Check seriously.
169
+ - Could a junior engineer understand and operate this system without you? If not, simplify.
170
+ 51. Write down what you deliberately chose NOT to build and why. This is as important as what you built — it prevents well-intentioned additions from violating the design.
171
+
172
+ ---
173
+
174
+ ## Hard rules
175
+
176
+ 1. Architecture follows the numbers. Estimate load before choosing topology.
177
+ 2. Name every tradeoff and pick with reasons. Hidden tradeoffs become bugs.
178
+ 3. Failure modes are first-class. Every component needs a degradation answer.
179
+ 4. Idempotency is not optional for any write that can fail and be retried.
180
+ 5. De-risk the riskiest assumption before writing production code.
181
+ 6. One ADR per significant decision, immutable after acceptance.
182
+ 7. Design for the specific likely change, not infinite unknowns.
183
+ 8. Avoid the scale you do not have. State the threshold that triggers revisiting.
184
+ 9. Stateful components are expensive — minimize their number and isolate them.
185
+ 10. Match existing repo conventions. A design that ignores the grain of the codebase becomes a maintenance island.