@vextlabs/theron-cli 0.3.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (169) hide show
  1. package/dist/api.d.ts +8 -0
  2. package/dist/api.js +3 -0
  3. package/dist/api.js.map +1 -1
  4. package/dist/auth.js +51 -1
  5. package/dist/auth.js.map +1 -1
  6. package/dist/banner.js +3 -2
  7. package/dist/banner.js.map +1 -1
  8. package/dist/checkpoints.d.ts +32 -0
  9. package/dist/checkpoints.js +61 -0
  10. package/dist/checkpoints.js.map +1 -0
  11. package/dist/index.js +59 -4
  12. package/dist/index.js.map +1 -1
  13. package/dist/input.d.ts +61 -0
  14. package/dist/input.js +574 -0
  15. package/dist/input.js.map +1 -0
  16. package/dist/profiles/index.js +5 -0
  17. package/dist/profiles/index.js.map +1 -1
  18. package/dist/profiles/methodologies/operate_domains.d.ts +8 -0
  19. package/dist/profiles/methodologies/operate_domains.js +1239 -0
  20. package/dist/profiles/methodologies/operate_domains.js.map +1 -0
  21. package/dist/profiles/seeds.js +57 -36
  22. package/dist/profiles/seeds.js.map +1 -1
  23. package/dist/receipt.d.ts +17 -0
  24. package/dist/receipt.js +46 -0
  25. package/dist/receipt.js.map +1 -0
  26. package/dist/render.d.ts +4 -1
  27. package/dist/render.js +95 -28
  28. package/dist/render.js.map +1 -1
  29. package/dist/repl.d.ts +8 -1
  30. package/dist/repl.js +420 -62
  31. package/dist/repl.js.map +1 -1
  32. package/dist/sessions.d.ts +14 -0
  33. package/dist/sessions.js +100 -0
  34. package/dist/sessions.js.map +1 -1
  35. package/dist/ship.d.ts +2 -0
  36. package/dist/ship.js +62 -0
  37. package/dist/ship.js.map +1 -0
  38. package/dist/skills/catalog.d.ts +13 -0
  39. package/dist/skills/catalog.js +86 -0
  40. package/dist/skills/catalog.js.map +1 -0
  41. package/dist/tools/bash.js +81 -14
  42. package/dist/tools/bash.js.map +1 -1
  43. package/dist/tools/edit.js +21 -1
  44. package/dist/tools/edit.js.map +1 -1
  45. package/dist/tools/glob.js +4 -1
  46. package/dist/tools/glob.js.map +1 -1
  47. package/dist/tools/grep.d.ts +5 -0
  48. package/dist/tools/grep.js +101 -2
  49. package/dist/tools/grep.js.map +1 -1
  50. package/dist/tools/index.d.ts +22 -0
  51. package/dist/tools/index.js +177 -41
  52. package/dist/tools/index.js.map +1 -1
  53. package/dist/tools/ls.d.ts +3 -0
  54. package/dist/tools/ls.js +23 -12
  55. package/dist/tools/ls.js.map +1 -1
  56. package/dist/tools/multiedit.d.ts +12 -0
  57. package/dist/tools/multiedit.js +79 -0
  58. package/dist/tools/multiedit.js.map +1 -0
  59. package/dist/tools/stoa.d.ts +1 -1
  60. package/dist/tools/stoa.js +7 -3
  61. package/dist/tools/stoa.js.map +1 -1
  62. package/dist/tools/task.d.ts +9 -0
  63. package/dist/tools/task.js +166 -0
  64. package/dist/tools/task.js.map +1 -0
  65. package/dist/tools/todowrite.d.ts +12 -0
  66. package/dist/tools/todowrite.js +38 -0
  67. package/dist/tools/todowrite.js.map +1 -0
  68. package/dist/tools/webfetch.d.ts +6 -0
  69. package/dist/tools/webfetch.js +98 -0
  70. package/dist/tools/webfetch.js.map +1 -0
  71. package/dist/tools/websearch.d.ts +7 -0
  72. package/dist/tools/websearch.js +83 -0
  73. package/dist/tools/websearch.js.map +1 -0
  74. package/dist/tools/write.js +17 -1
  75. package/dist/tools/write.js.map +1 -1
  76. package/dist/verifiers/confidence_marked.d.ts +2 -0
  77. package/dist/verifiers/confidence_marked.js +49 -0
  78. package/dist/verifiers/confidence_marked.js.map +1 -0
  79. package/dist/verifiers/disclaimer_gate.d.ts +2 -0
  80. package/dist/verifiers/disclaimer_gate.js +57 -0
  81. package/dist/verifiers/disclaimer_gate.js.map +1 -0
  82. package/dist/verifiers/index.d.ts +5 -0
  83. package/dist/verifiers/index.js +20 -7
  84. package/dist/verifiers/index.js.map +1 -1
  85. package/dist/verifiers/lint.js +4 -3
  86. package/dist/verifiers/lint.js.map +1 -1
  87. package/dist/verifiers/promoted_kernels.d.ts +8 -0
  88. package/dist/verifiers/promoted_kernels.js +190 -0
  89. package/dist/verifiers/promoted_kernels.js.map +1 -0
  90. package/dist/verifiers/source_gate.js +2 -3
  91. package/dist/verifiers/source_gate.js.map +1 -1
  92. package/dist/verifiers/test_smoke.js +30 -0
  93. package/dist/verifiers/test_smoke.js.map +1 -1
  94. package/dist/verifiers/types.d.ts +3 -0
  95. package/package.json +4 -2
  96. package/skills/README.md +123 -0
  97. package/skills/ab-test.md +89 -0
  98. package/skills/api-design.md +175 -0
  99. package/skills/architecture-design.md +185 -0
  100. package/skills/business-case.md +77 -0
  101. package/skills/causal-inference.md +77 -0
  102. package/skills/clinical-guideline.md +98 -0
  103. package/skills/code-review.md +98 -0
  104. package/skills/cold-outreach.md +268 -0
  105. package/skills/competitive-teardown.md +223 -0
  106. package/skills/component-spec.md +121 -0
  107. package/skills/content-calendar.md +280 -0
  108. package/skills/contract-review.md +155 -0
  109. package/skills/data-analysis.md +187 -0
  110. package/skills/debug.md +91 -0
  111. package/skills/design-audit.md +121 -0
  112. package/skills/differential-diagnosis.md +79 -0
  113. package/skills/discovery-call.md +206 -0
  114. package/skills/edit-pass.md +80 -0
  115. package/skills/engineering-calc.md +101 -0
  116. package/skills/estimate.md +70 -0
  117. package/skills/experiment-design.md +105 -0
  118. package/skills/fact-check.md +82 -0
  119. package/skills/financial-model.md +104 -0
  120. package/skills/grant-proposal.md +93 -0
  121. package/skills/harmony-analysis.md +93 -0
  122. package/skills/hypothesis-generation.md +99 -0
  123. package/skills/incident-response.md +134 -0
  124. package/skills/interview-loop.md +62 -0
  125. package/skills/job-scorecard.md +92 -0
  126. package/skills/kb-article.md +174 -0
  127. package/skills/launch-plan.md +85 -0
  128. package/skills/lease-review.md +93 -0
  129. package/skills/lesson-plan.md +198 -0
  130. package/skills/literature-review.md +69 -0
  131. package/skills/market-entry.md +137 -0
  132. package/skills/market-sizing.md +159 -0
  133. package/skills/meta-analysis.md +140 -0
  134. package/skills/migrate.md +117 -0
  135. package/skills/optimize.md +88 -0
  136. package/skills/options-strategy.md +166 -0
  137. package/skills/peer-review.md +96 -0
  138. package/skills/pentest-plan.md +193 -0
  139. package/skills/pitch-review.md +132 -0
  140. package/skills/plan.md +88 -0
  141. package/skills/policy-brief.md +124 -0
  142. package/skills/positioning.md +192 -0
  143. package/skills/postmortem.md +168 -0
  144. package/skills/prd.md +105 -0
  145. package/skills/prioritize.md +162 -0
  146. package/skills/proof.md +91 -0
  147. package/skills/property-underwrite.md +159 -0
  148. package/skills/recipe-develop.md +109 -0
  149. package/skills/red-team.md +142 -0
  150. package/skills/refactor.md +58 -0
  151. package/skills/reflection-session.md +115 -0
  152. package/skills/regulatory-compliance.md +136 -0
  153. package/skills/reproduce.md +87 -0
  154. package/skills/runbook.md +344 -0
  155. package/skills/security-audit.md +154 -0
  156. package/skills/seo-brief.md +201 -0
  157. package/skills/sql-query.md +161 -0
  158. package/skills/story-craft.md +163 -0
  159. package/skills/tdd.md +59 -0
  160. package/skills/term-sheet.md +298 -0
  161. package/skills/theory-of-change.md +88 -0
  162. package/skills/threat-model.md +104 -0
  163. package/skills/ticket-triage.md +200 -0
  164. package/skills/tolerance-analysis.md +149 -0
  165. package/skills/training-program.md +151 -0
  166. package/skills/translate.md +64 -0
  167. package/skills/unit-economics.md +238 -0
  168. package/skills/valuation.md +112 -0
  169. package/skills/write-tests.md +77 -0
@@ -0,0 +1,77 @@
1
+ ---
2
+ name: business-case
3
+ description: Build a rigorous business case for a decision or investment: size the problem and opportunity, model costs/benefits/risks across options including do-nothing, surface assumptions explicitly, and deliver a single prioritized recommendation with leading indicators and the strongest counterargument pre-empted.
4
+ allowed-tools: Read, WebSearch, Write
5
+ ---
6
+
7
+ ═══ HARD RULES ═══
8
+ 1. NEVER present a figure as fact without labeling its source or derivation (own estimate, cited source, or stakeholder-stated).
9
+ 2. NEVER omit do-nothing as a named option — it is always a valid baseline with an explicit trajectory, not an absence.
10
+ 3. NEVER let the recommendation appear before the options analysis; the evidence must earn it.
11
+ 4. NEVER conflate revenue with profit, cost savings with cash, or one-time with recurring — label each row explicitly.
12
+ 5. NEVER fabricate market data, citations, or benchmarks; flag data gaps and state exactly what evidence would close them.
13
+ 6. NEVER recommend without naming the single strongest counterargument and giving a direct, specific rebuttal — no strawmen.
14
+ 7. NEVER present a single-point estimate where input uncertainty exceeds 20% — use base / downside / upside columns.
15
+ 8. ALL assumptions must be numbered, listed in one block before any model, and referenced inline as [A3], etc.
16
+ 9. NEVER count sunk costs as a reason to proceed — they are gone; only incremental future costs and benefits are decision-relevant.
17
+ 10. STOP and ask the user only when a missing input would structurally invalidate the model (e.g., unknown discount rate policy, regulatory constraint that changes option set); otherwise estimate, label [estimate], and proceed.
18
+
19
+ ═══ PHASE A — PROBLEM DEFINITION & DECISION FRAMING ═══
20
+ A1. State: (a) the exact decision to be made in one sentence, (b) who must authorize it, and (c) the hard deadline — a date or event trigger — after which the window closes or costs escalate.
21
+ A2. Write a two-sentence problem statement: (a) current state with a quantified pain or gap in concrete units (time, dollars, error rate, customer count, market share), (b) consequence of inaction stated in the same units over a defined time horizon.
22
+ A3. Articulate the timing rationale — WHY is this decision available or urgent NOW and not in 12 months? Identify the specific change: regulatory window, competitive move, technology unlock, expiring contract, market inflection point, internal capacity becoming available. If the timing driver is weak, flag that urgency may be manufactured.
23
+ A4. Define scope boundaries explicitly: what this business case does NOT analyze and why (prevents scope creep in the model and in the presentation).
24
+ A5. Surface the decision criteria the authorizing audience actually uses. Executives optimizing for quarterly EPS weight payback period differently from founders optimizing for market position. If criteria are unstated, ask once; otherwise infer from context and state your inference. Common hard thresholds: IRR floor, payback cap (e.g., <24 months), minimum NPV hurdle, headcount freeze constraints, risk appetite (avoid any single point of failure exceeding X% revenue).
25
+ A6. Calibrate the audience: identify whether the primary reader is a Board (strategic fit, risk, not the model details), CFO (NPV, cash timing, capex vs. opex treatment), operational sponsor (implementation realism, headcount, dependencies), or a mix — the emphasis and level of quantitative detail in subsequent phases must match.
26
+
27
+ ═══ PHASE B — OPTIONS GENERATION ═══
28
+ B1. Generate a minimum of three options:
29
+ - Option 0 (Do Nothing): Describe the status quo TRAJECTORY — revenue decline, cost drift, competitive erosion — not just a frozen snapshot. Quantify where Option 0 lands at the end of the analysis horizon.
30
+ - Active options (minimum two): Apply the BUILD / BUY / PARTNER decision matrix below BEFORE naming options:
31
+ * BUILD if: the capability is a core differentiator, build cost < 2× buy cost over 3 years, and internal team has or can rapidly acquire the competency.
32
+ * BUY (acquire product/vendor) if: time-to-value is critical, the market has mature solutions with proven references, and switching cost is manageable.
33
+ * PARTNER / LICENSE if: the capability is non-core, volume does not justify owning the asset, or regulatory/IP risk makes ownership unattractive.
34
+ * Eliminate any quadrant that fails first-principles and document why — do not silently drop it.
35
+ - Option MV (Minimum Viable): A reduced-scope entry into the leading active option. Define "minimum viable" as the smallest investment that produces a measurable signal within 90 days and keeps the full option open.
36
+ B2. For each option write one sentence: exactly HOW it closes the gap quantified in A2. If you cannot write this sentence, the option does not belong in the analysis.
37
+ B3. Screen remaining options on three dimensions — feasibility (can we execute?), strategic fit (does it reinforce or dilute our position?), time-to-value (when does the first benefit dollar appear?) — and eliminate options failing any dimension. Record the reason; do not silently remove.
38
+ B4. For each surviving active option, identify the PRIMARY COMPETITIVE RESPONSE expected from the market or from internal opponents. A business case that ignores how competitors or stakeholders will react to the decision is incomplete.
39
+
40
+ ═══ PHASE C — COST & BENEFIT QUANTIFICATION ═══
41
+ C1. List ALL numbered assumptions before building any model. Each assumption must state: (a) its value or range, (b) source category — own estimate / benchmark / stated by stakeholder / data needed, and (c) confidence — High (within 10%) / Medium (within 30%) / Low (>30% uncertain). Low-confidence assumptions are automatically flagged for sensitivity testing in D4.
42
+ C2. For each surviving option, build a cost/benefit table. These row categories are mandatory where material; add domain-specific rows as needed:
43
+ - One-time costs: capital expenditure, implementation labor (internal hours at fully-loaded rate), migration, integration, training, license setup, decommissioning of displaced systems.
44
+ - Recurring costs / run-rate delta vs. Option 0: operating expense, incremental headcount (FTEs × loaded cost), licensing, maintenance, support.
45
+ - Quantified benefits (each labeled as revenue uplift, cost avoidance, risk reduction, or productivity gain): state the mechanism, not just the number. "Reduces churn by 2pp" is better than "increases revenue by $X."
46
+ - Time horizon: minimum 3 years for operational decisions, 5 years for capital-intensive or strategic decisions. State the horizon and justify it.
47
+ - Discount rate: use the organization's WACC or hurdle rate if known; otherwise use 10% for well-established businesses, 15–20% for startups or high-uncertainty environments. State the rate and its source.
48
+ - NPV / 3-year net benefit per option.
49
+ - Break-even point in months.
50
+ C3. Use ranges (base / downside / upside) for any input with confidence rated Medium or Low. A model built entirely on point estimates for uncertain inputs is not a business case — it is a forecast dressed as analysis.
51
+ C4. Identify the top 3 cost drivers and top 3 benefit drivers by absolute magnitude — these are the model's load-bearing beams. Sensitivity analysis in Phase D will stress-test these specifically.
52
+ C5. Cross-check: benefit-minus-cost math must tie to summary NPV/net-benefit figures. Disclose any rounding. Confirm that one-time costs are not counted as recurring or vice versa.
53
+ C6. Sunk cost check: strip any already-spent investment from the incremental model. Sunk costs may appear in narrative context ("we already spent $X on Platform Y") but must not appear as a cost or a benefit in the decision model.
54
+
55
+ ═══ PHASE D — RISK ANALYSIS ═══
56
+ D1. Risk register format: [R-ID] Description | Likelihood (H/M/L) | Impact (H/M/L, in $ or % revenue) | Mitigant | Residual exposure after mitigant.
57
+ D2. Classify each risk by type: execution (can we deliver?), market (will demand materialize?), regulatory/compliance, technical (will it work at scale?), dependency (vendor, partner, internal team), reputational, or financial (FX, rate, liquidity).
58
+ D3. Kill condition: identify the SINGLE scenario that makes the recommended option worse than Option 0 on the primary decision criterion. State: (a) what must be true for this to occur, (b) your estimated probability, (c) the earliest observable tripwire metric and its threshold, (d) the exit or pivot action if the tripwire is hit.
59
+ D4. Sensitivity spine: for each active option, identify the ONE assumption from C1 that, if wrong in the unfavorable direction by its uncertainty band, most damages the NPV or net benefit. Run the model with that assumption at its downside value and report the delta. This is the load-bearing assumption that should be stress-tested in any pre-mortem.
60
+ D5. Competitive response risk: for each active option, revisit the response identified in B4. What is the worst-case reaction, and does the business case still hold under that scenario?
61
+
62
+ ═══ PHASE E — OPTIONS COMPARISON & RECOMMENDATION ═══
63
+ E1. Side-by-side comparison table (one row per option):
64
+ Option | NPV (base) | NPV (downside) | Payback (months) | Risk rating | Strategic fit (H/M/L) | Verdict
65
+ The table must be readable by an executive who has not read Phases A–D.
66
+ E2. Recommendation in one sentence: "Recommend [Option X] because [primary financial rationale, with number] and [primary strategic rationale, with specific mechanism]." Vague rationales ("best overall value") are not permitted.
67
+ E3. Counterargument: state the strongest honest case for a different option or for inaction (this is the argument that WILL be made in the room). Then rebut it with a specific, direct response — cite a number, a risk mitigant, or a structural reason the counterargument does not hold in this context. No dismissal without substance.
68
+ E4. Leading indicators: define 3–5 metrics the decision-maker should track in the first 90 days to confirm the case is on track. Each must be: (a) measurable with existing or easily established instrumentation, (b) leading (predicts future outcome) not lagging (describes past outcome), (c) assigned a specific target value and a "flag" value that triggers a review. Example: "Pilot conversion rate ≥ 18% by day 60 (flag if <12%)."
69
+ E5. Reversibility: state whether the organization can exit this decision cleanly within 12 months and at what cost (financial and operational). High reversibility increases the risk-adjusted case for acting; low reversibility demands a higher evidence bar before commitment.
70
+
71
+ ═══ PHASE F — ASSUMPTIONS REGISTER & NEXT STEPS ═══
72
+ F1. Reproduce the full numbered assumptions list from C1, now with: (a) confidence rating, (b) sensitivity rank (High / Medium / Low impact on the recommendation if wrong), and (c) the specific action and data source that would upgrade a Low-confidence assumption to Medium or High. Sort by sensitivity rank descending.
73
+ F2. Pre-mortem actions — 3–5 concrete things to do BEFORE committing spend that would catch the kill condition early: a time-boxed pilot, a technical spike, a reference call with a customer or vendor who has done this, a regulatory pre-submission, a competitive-response war-game. Each action must have an owner role, a timeline, and a pass/fail criterion.
74
+ F3. Reopen threshold: state the minimum evidence that, if obtained, would change the recommendation — either toward a different option or toward reversal. This defines when it is legitimate to re-examine the decision without undermining commitment.
75
+ F4. Executive summary (write last, place here): one paragraph — problem with quantified gap → opportunity and timing → recommended option and primary financial case → top risk and mitigant → first action and owner. Written for a non-technical sponsor with no prior context; no jargon, no model details, no assumption IDs.
76
+
77
+ KEY PRINCIPLE: A business case is only as strong as its weakest assumption. Every number must be earned — labeled by source, stress-tested at its uncertainty band, and tied to a mechanism. The recommendation does not lead; the evidence does. The model's job is to make the best-available decision visible, not to sell a predetermined answer.
@@ -0,0 +1,77 @@
1
+ ---
2
+ name: causal-inference
3
+ description: Identify causal effects, not correlations — define the estimand, draw the DAG, pick an identification strategy (RCT/IV/RDD/DiD/matching), and run falsification tests.
4
+ allowed-tools: Read, Bash, Write, Grep
5
+ ---
6
+
7
+ ## Phase 0 — State the Estimand Before Touching Data
8
+
9
+ 1. Write the estimand in one sentence: target population (treated units / full population / subgroup), treatment contrast (T=1 vs T=0, or dose shift), outcome, and time horizon. ATE = E[Y(1)−Y(0)]; ATT = E[Y(1)−Y(0)|T=1]; CATE = E[Y(1)−Y(0)|X=x].
10
+ 2. Declare the unit of analysis (individual, firm, market, day) and the causal horizon (short-run vs long-run effects may differ).
11
+ 3. Write down what the ideal randomised experiment would look like. If you cannot describe it, you do not yet have a clear estimand.
12
+
13
+ ## Phase 1 — Draw the DAG (Do This on Paper First)
14
+
15
+ 4. List every variable: treatment T, outcome Y, pre-treatment covariates X, potential mediators M, potential colliders C, instruments Z.
16
+ 5. Draw directed edges encoding your domain knowledge. An arrow A→B means "A directly causes B holding everything else fixed."
17
+ 6. Mark confounders: common causes of T and Y that are NOT on the causal path T→...→Y.
18
+ 7. Mark mediators: variables on the causal path T→M→Y. Conditioning on M blocks the path and biases total-effect estimates — do NOT adjust for mediators when estimating total effect.
19
+ 8. Mark colliders: common effects of two causes (e.g. A→C←B). Conditioning on a collider OPENS a non-causal path — never condition on colliders.
20
+ 9. Apply the backdoor criterion: a valid adjustment set blocks every backdoor path (paths into T) without opening any new paths via colliders.
21
+ 10. Apply the frontdoor criterion only when all backdoor paths are unblockable but a measured mediator M captures the full T→Y path.
22
+ 11. Run `dagitty` (R) or `pgmpy` (Python) to enumerate all d-separation statements and verify your proposed adjustment set is valid.
23
+
24
+ ## Phase 2 — Choose the Identification Strategy
25
+
26
+ 12. **RCT**: If treatment was randomised, check balance (t-tests / standardised mean differences < 0.1). Estimate via OLS with pre-specified covariates (Lin estimator) for efficiency. Watch for non-compliance → use ITT; if compliance < 100% use IV with assignment as instrument.
27
+ 13. **Regression adjustment / backdoor**: Use when all confounders are measured. Run OLS or doubly-robust AIPW (combine propensity model + outcome model — consistent if either is correct). Use cross-fitting to avoid regularisation bias.
28
+ 14. **Instrumental Variables**: Instrument Z must satisfy (a) relevance: Cov(Z,T)≠0 — report first-stage F > 10 (Stock-Yogo); (b) exclusion: Z affects Y only through T — this is untestable, argue it structurally; (c) independence: Z ⊥ unmeasured confounders. Use 2SLS. LATE = effect on compliers only; report that scope honestly.
29
+ 15. **Regression Discontinuity**: Assignment determined by a running variable crossing a cutoff. Estimate local ATE at the cutoff only. Use local linear regression, not polynomials (overfitting). Optimal bandwidth via Imbens-Kalyanaraman or `rdrobust`. Check: (a) no manipulation of running variable (McCrary density test); (b) no jump in covariates at cutoff; (c) continuity of potential outcomes.
30
+ 16. **Difference-in-Differences**: Requires (a) parallel trends in pre-period — test visually and with event-study regression; (b) no anticipation — treatment effect is zero before treatment; (c) stable unit treatment values (SUTVA). Use staggered DiD via Callaway-Sant'Anna or Sun-Abraham estimators (not TWFE, which is biased under heterogeneous effects). Report pre-trend coefficients explicitly.
31
+ 17. **Matching / Propensity Score**: Estimate propensity score P(T=1|X) via logistic regression or gradient boosting. Match 1:1 nearest-neighbour with caliper 0.2×SD(logit(p)). Check post-match balance on ALL covariates. Prefer doubly-robust AIPW over PS-weighting alone (IPW is sensitive to extreme weights; trim weights at 99th percentile).
32
+ 18. **Synthetic Control**: Use when N=1 treated unit and many pre-periods. Construct a convex combination of control units that matches pre-treatment outcomes (and covariates). Report the ratio of post-treatment MSPE to pre-treatment MSPE; run permutation tests across control units for inference.
33
+
34
+ ## Phase 3 — Assumption Checks and Falsification Tests
35
+
36
+ 19. **Balance check**: For every adjustment strategy, report standardised mean differences before and after adjustment for all covariates. SMD < 0.1 is the threshold.
37
+ 20. **Pre-trend test (DiD/SC)**: Regress outcome on leads of treatment (event-study). Joint F-test of all pre-period coefficients = 0. Visualise the event study plot.
38
+ 21. **Placebo outcome test**: Apply the same estimator to an outcome that the treatment logically cannot affect. A significant effect signals model misspecification or confounding.
39
+ 22. **Negative control**: Find a negative control exposure (shares confounders with T but cannot cause Y) and a negative control outcome (shares confounders with Y but cannot be caused by T). Both should return null effects under a correct model.
40
+ 23. **Manipulation test (RDD)**: Run McCrary density test on the running variable at the cutoff. A discontinuity in density signals sorting.
41
+ 24. **Covariate continuity test (RDD)**: Regress each pre-treatment covariate on the running variable with the same bandwidth and estimator as the main spec. No covariate should jump at the cutoff.
42
+ 25. **Placebo cutoff test (RDD)**: Re-estimate the main spec at fake cutoffs away from the true cutoff. Effects should be null.
43
+ 26. **First-stage strength (IV)**: Kleibergen-Paap F-statistic > 10 for single instrument; use conditional F in the weak-instrument-robust Anderson-Rubin confidence set.
44
+ 27. **Exclusion restriction plausibility (IV)**: Argue structurally why Z→Y paths other than Z→T→Y are closed. Cannot be tested directly; report honest threats.
45
+ 28. **Overidentification test (IV)**: If you have multiple instruments, run Sargan-Hansen J-test. Rejection means at least one instrument violates exclusion.
46
+
47
+ ## Phase 4 — Estimation and Inference
48
+
49
+ 29. Use heteroskedasticity-robust (HC2/HC3) standard errors by default. Cluster at the level of treatment assignment. Never use non-clustered SEs when treatment is clustered.
50
+ 30. For DiD, use cluster-robust SEs at the group level. With few clusters (< 30), use wild cluster bootstrap.
51
+ 31. Report point estimate, 95% CI, and p-value. Report the effect size in natural units, not just statistical significance.
52
+ 32. For CATE estimation, use causal forests (`grf` in R or `econml` in Python). Report heterogeneity via best linear projection of CATE on covariates.
53
+ 33. Pre-register your specification. If you run multiple specs, report all of them and correct for multiple testing (Benjamini-Hochberg).
54
+
55
+ ## Phase 5 — Sensitivity to Unmeasured Confounding
56
+
57
+ 34. Compute the **E-value**: the minimum strength (on the risk ratio scale) an unmeasured confounder would need to have with both T and Y to fully explain away the observed effect. Report it alongside your main estimate. Formula: E = RR + sqrt(RR×(RR−1)) for RR > 1.
58
+ 35. Run **Rosenbaum sensitivity analysis** (for matched studies): find the gamma at which the inference breaks. Report the range of gamma consistent with your conclusions.
59
+ 36. For DiD, conduct **Rambachan-Roth sensitivity analysis**: how much would parallel trends need to fail (in units of the pre-trend) to reverse the sign?
60
+ 37. State explicitly which threats could reverse the sign of the effect and how plausible they are given domain knowledge.
61
+
62
+ ## Phase 6 — Reporting
63
+
64
+ 38. Report: estimand, design, identification assumptions, falsification results, point estimate + CI + E-value, and a one-sentence plain-English causal claim scoped to the identified population.
65
+ 39. Never write "X causes Y" without specifying: in what population, under what contrast, identified by what strategy, conditional on what assumptions holding.
66
+ 40. Separate what is identified (the causal claim your design licenses) from what is extrapolation (generalization beyond the support of your design).
67
+ 41. If the design cannot identify the estimand of interest, say so and propose what data or design would be needed. A credible null is more valuable than a spurious positive.
68
+
69
+ ## Hard Rules (Never Violate)
70
+
71
+ - NEVER condition on a post-treatment variable when estimating total effect — this opens collider bias or blocks causal paths.
72
+ - NEVER interpret a regression coefficient as causal without stating the identification assumption that licenses it.
73
+ - NEVER report TWFE DiD with staggered treatment timing without heterogeneity-robust estimators.
74
+ - NEVER use propensity score weighting without trimming extreme weights and checking overlap.
75
+ - NEVER claim IV identifies ATE — it identifies LATE (the complier subpopulation) unless you have strong monotonicity + homogeneity arguments.
76
+ - NEVER skip the falsification phase — one failed placebo invalidates the design.
77
+ - ALWAYS report the E-value; a p < 0.05 with E-value of 1.2 is not credible evidence of causation.
@@ -0,0 +1,98 @@
1
+ ---
2
+ name: clinical-guideline
3
+ description: Summarize clinical practice guidelines for a disease/condition/intervention — extracting evidence grades (GRADE/strength), target populations, contraindications, and cross-body conflicts — with primary source citations and a mandatory educational-use disclaimer; invoke when the user asks what guidelines say about a clinical topic, drug, screening threshold, treatment protocol, or diagnostic criterion.
4
+ allowed-tools: Read, WebSearch, WebFetch, Write
5
+ ---
6
+
7
+ ═══ HARD RULES ═══
8
+
9
+ 1. NEVER fabricate a guideline, recommendation strength, evidence grade, citation, trial name, drug dose, threshold value, or population qualifier — if you cannot verify it, label it [UNVERIFIED] and exclude it from the summary table.
10
+ 2. NEVER present this output as professional medical advice, a substitute for clinical judgment, or a basis for patient management decisions.
11
+ 3. ALWAYS close every output with the mandatory disclaimer block defined in Phase F — no exceptions, no truncation.
12
+ 4. Cite only PRIMARY guideline sources (issuing society + year + document title + URL/DOI where retrievable). Do NOT cite secondary summaries (UpToDate, Medscape, Wikipedia) as authoritative — use them only to triangulate, then trace back to the primary document.
13
+ 5. When recommendations conflict across bodies (e.g., USPSTF vs. ACS vs. ESC), surface the conflict explicitly — never silently adopt one and suppress the other.
14
+ 6. Distinguish CURRENT from SUPERSEDED guidance: flag any guideline older than 5 years as potentially outdated and note whether a newer version exists. When a newer version from the same body exists, use the newer version and note the supersession.
15
+ 7. Do not extrapolate: if a guideline applies to adults ≥18, do not infer pediatric applicability. Population scope must be stated exactly as the issuing body states it.
16
+ 8. Risk escalation: if the query involves acute/life-threatening presentations (acute MI, sepsis, anaphylaxis, stroke, suicidality, airway compromise), prepend the acute safety note from Phase A4 before any guideline content and do not omit it.
17
+ 9. Never invent GRADE labels. Use only the grading system the issuing body itself uses (GRADE, Oxford CEBM, ACC/AHA Class I–III/LOE A–C, SIGN 1++–4, USPSTF A–D/I, NCCN Category 1/2A/2B/3, etc.) — name the system used and display a brief key in your output header so every strength label is interpretable without external lookup.
18
+ 10. Living and interim guidelines (e.g., WHO rapid guidance, NICE evidence reviews marked "in development", CDC interim recommendations) must be labeled [LIVING — subject to update] and re-checked at the source for currency.
19
+ 11. When no guideline from any identified body addresses the specific topic: state "No current guideline identified from [bodies checked]" — do not synthesize de novo clinical advice as a substitute. Note relevant systematic reviews or consensus statements only if clearly distinguished from formal society guidelines.
20
+
21
+ ═══ PHASE A — SCOPE LOCK ═══
22
+
23
+ A1. Parse the user's query into: (a) clinical condition/topic, (b) intervention or decision node (screening / diagnosis / treatment / monitoring / prevention), (c) implied population (age, sex, comorbidities, pregnancy status, clinical setting — ICU vs. outpatient vs. community), (d) geographic/system scope (US, EU, UK, global, etc.). State each dimension explicitly before proceeding; flag any that are ambiguous and state your working assumption.
24
+ A2. Identify the 3–6 authoritative guideline-issuing bodies most relevant to the topic. Examples by domain:
25
+ - Cardiology: ACC/AHA, ESC, NICE, CCS, Heart Failure Society of America
26
+ - Oncology: NCCN, ESMO, ASCO, NICE (cancer pathways)
27
+ - Primary care/prevention: USPSTF, Canadian Task Force on Preventive Health Care, WHO
28
+ - Endocrinology: ADA, Endocrine Society, AACE
29
+ - Infectious disease: IDSA, ESCMID, CDC, WHO
30
+ - Psychiatry: APA, NICE, CANMAT, BAP
31
+ - Nephrology: KDIGO, ERA
32
+ List the bodies selected and the reason each is relevant to this query.
33
+ A3. For each body, state the grading taxonomy it uses and display a compact key in your output header:
34
+ - GRADE: Strong / Conditional (Weak), with evidence quality High/Moderate/Low/Very Low
35
+ - ACC/AHA: Class I (benefit>>risk) / IIa / IIb / III (no benefit or harm); LOE A (multiple RCTs/meta-analyses) / B-R / B-NR / C-LD / C-EO
36
+ - SIGN: 1++ (high-quality meta-analysis/SR/RCT with very low risk of bias) → 4 (expert opinion); A/B/C/D recommendation grades
37
+ - USPSTF: A (high certainty, substantial net benefit) / B / C / D (recommend against) / I (insufficient evidence)
38
+ - NCCN: Category 1 (high-level evidence, uniform consensus) / 2A / 2B / 3 (major disagreement)
39
+ - Other: name the system and its hierarchy
40
+ A4. Acute safety gate: if the topic involves a potential emergency (acute MI, sepsis, anaphylaxis, stroke, suicidality, airway compromise, DKA, massive hemorrhage), begin ALL output with:
41
+ "⚠ ACUTE SAFETY NOTE: The following is an educational summary only. For any acute or life-threatening presentation, activate emergency services and follow local emergency protocols immediately. No guideline summary substitutes for real-time clinical assessment."
42
+
43
+ ═══ PHASE B — GUIDELINE RETRIEVAL ═══
44
+
45
+ B1. Run targeted WebSearch queries for each issuing body from A2, e.g.: "[Society] [condition] guideline [year]", "[Society] [condition] clinical practice recommendation site:[society-domain].org". Prioritize the most recent published version; also search for in-progress living guidance.
46
+ B2. For each hit, WebFetch the primary document or its executive summary / key-recommendations table. Extract: (a) exact publication year and version number, (b) full document title, (c) issuing society, (d) DOI or canonical URL, (e) whether the document is a full guideline, focused update, rapid guidance, or consensus statement — these carry different epistemic weight.
47
+ B3. If a guideline is paywalled, retrieve the freely available executive summary, key-recommendations table, or press release from the same issuing body — note the access limitation and that only the accessible portion was reviewed.
48
+ B4. Record retrieval failures explicitly: "Could not verify [Body] guidance — primary document not publicly accessible at time of retrieval." Never fill a gap with inference or secondary-source extrapolation.
49
+ B5. Living/interim flag: if a document is marked as living, interim, or in development, tag it [LIVING] in every table row sourced from it. Note the expected next review date if stated.
50
+ B6. Supersession check: if an older version is found first, search explicitly for "[Society] [condition] guideline update [current year–1 year]" to confirm no newer version exists before using the older document.
51
+
52
+ ═══ PHASE C — RECOMMENDATION EXTRACTION ═══
53
+
54
+ C1. For each retrieved guideline, extract each discrete recommendation into a structured row:
55
+ | Recommendation | Grade/Class | LOE/Evidence Quality | Population Scope | Contraindications/Exceptions | Source (Society, Year) |
56
+ C2. Map grades faithfully to the issuing body's own system. Never convert between systems without explicit labeling (e.g., "GRADE Strong ≈ ACC/AHA Class I — rough analog only, systems are not equivalent").
57
+ C3. Capture population qualifiers at the finest granularity the guideline states: age range (e.g., 50–75 years), biological sex where specified, pregnancy/lactation status, renal function cutoff (eGFR threshold), hepatic function (Child-Pugh class), prior treatment history, validated risk-score thresholds (e.g., ASCVD 10-year risk ≥7.5%, CHA₂DS₂-VASc ≥2 in men/≥3 in women, FIB-4 ≥2.67).
58
+ C4. Extract absolute contraindications, relative contraindications, and documented exceptions for each recommendation separately. Never omit the contraindication column; enter "None documented in this guideline" if applicable.
59
+ C5. Note where a guideline is silent on a sub-population or scenario — silence is not endorsement and must not be treated as implicit permission.
60
+ C6. No-guideline fallback: if Phase B yields no guideline for a key decision node, enter a row: "No guideline identified | — | — | — | — | [bodies checked]" and do not substitute clinical opinion.
61
+
62
+ ═══ PHASE D — CONFLICT ANALYSIS ═══
63
+
64
+ D1. Compare recommendations across bodies side by side. Flag any conflict where: (a) grade or strength differs for the same intervention in the same population, (b) population thresholds differ (e.g., mammography start age 40 per ACS vs. 50 per USPSTF 2016 vs. 40 per USPSTF 2024), (c) one body recommends for while another explicitly recommends against, (d) dosing, monitoring intervals, or treatment duration differ.
65
+ D2. For each conflict, provide the documented reason if available: different underlying evidence base (name the key RCTs), different modeling assumptions (absolute vs. relative risk, QALYs vs. life-years), health-system context (resource constraints, test availability), different population endpoint definitions. Do not editorialize about which body is correct.
66
+ D3. Rate conflict severity:
67
+ - MAJOR: opposite directions — one body issues a recommendation FOR, another recommends AGAINST the same intervention in the same population
68
+ - MODERATE: same direction, meaningfully different thresholds, risk-score cutoffs, or recommendation strength (e.g., one strong/Class I, another conditional/Class IIb)
69
+ - MINOR: wording nuance, monitoring interval, or administrative detail only
70
+ D4. Where a landmark RCT or meta-analysis drives the divergence, name it (trial acronym + year + primary endpoint result), e.g., "SPRINT (2015): SBP <120 mmHg reduced primary CV events 25% vs. <140 target; drove ACC/AHA 2017 threshold shift." Never invent trial names or results.
71
+ D5. If the user's query references a specific trial by name, cross-check whether that trial's results have been incorporated into a subsequent guideline update — and state which bodies have or have not done so.
72
+
73
+ ═══ PHASE E — SYNTHESIS OUTPUT ═══
74
+
75
+ E1. Open with a one-paragraph plain-language summary of the consensus position (where one exists) and the principal areas of disagreement. If no consensus exists, state that explicitly rather than constructing a false median.
76
+ E2. Render the full recommendation table from Phase C, grouped by clinical decision node: Screening / Diagnosis / First-line treatment / Second-line / Escalation / Monitoring / Special populations.
77
+ E3. Render a conflict table from Phase D: Conflict description | Bodies involved | Direction | Severity (MAJOR/MODERATE/MINOR) | Known reason for divergence.
78
+ E4. Add a "Special Populations" section covering any guideline-specific guidance for: pregnancy and lactation, pediatrics (<18), older adults (≥75 or ≥65 as the body defines), chronic kidney disease (eGFR-stratified where available), hepatic impairment (Child-Pugh A/B/C where available), immunocompromised status, and rare/genetic subgroups — or state "No specific guidance identified in retrieved documents" for each category that applies.
79
+ E5. Add a "Gaps and Research Frontiers" section: note only where issuing bodies themselves acknowledge insufficient evidence (GRADE low/very-low, LOE C-EO, USPSTF I, SIGN 4) and where active trials or scheduled guideline updates are noted in the retrieved documents. Do not speculate beyond what the documents state.
80
+ E6. No-guideline conclusion: if Phase B found no guideline for the queried topic, conclude: "No formal clinical practice guideline from the bodies checked addresses [topic] at the specificity requested. Relevant systematic reviews or consensus statements, if found, are listed below and clearly distinguished from society guidelines. A clinician or specialist librarian should be consulted."
81
+ E7. List all primary sources in a numbered reference list: [N] Society. Title. Version/Year. URL/DOI. Guideline type (full / focused update / living / consensus statement). Retrieved: [date].
82
+
83
+ ═══ PHASE F — MANDATORY DISCLAIMER ═══
84
+
85
+ F1. Close EVERY output — without exception, without truncation — with this exact block (fill in bracketed fields):
86
+
87
+ ---
88
+ **EDUCATIONAL USE ONLY — NOT MEDICAL ADVICE**
89
+
90
+ This summary was generated by an AI agent for educational and informational purposes. It does not constitute professional medical advice, a clinical diagnosis, a treatment recommendation, or a substitute for the judgment of a licensed healthcare provider.
91
+
92
+ - Guidelines change. Verify currency at the issuing body's official website before applying any recommendation. Living guidelines may have been updated since this retrieval.
93
+ - Individual patient circumstances, comorbidities, contraindications, preferences, and local formulary/resource constraints must be evaluated by a qualified clinician.
94
+ - If you or someone you care for is experiencing a medical emergency, call emergency services (e.g., 911 in the US / 999 in the UK / 112 in the EU) immediately.
95
+ - Sources retrieved: [date of retrieval]. Issuing bodies checked: [list bodies from Phase A2].
96
+ ---
97
+
98
+ KEY PRINCIPLE: Surface what the evidence actually says — grades, populations, conflicts, gaps, and the grading system used — exactly as the issuing bodies state it, never smoother or more certain than the underlying evidence warrants. When guidelines are absent, say so; when they conflict, show both sides.
@@ -0,0 +1,98 @@
1
+ ---
2
+ name: code-review
3
+ description: Review a diff like a senior engineer — correctness first, then security, tests, API/compat, perf, readability; every comment file:line + concrete fix + severity tag.
4
+ allowed-tools: Read, Grep, Glob, Bash
5
+ ---
6
+
7
+ ## PHASE 0 — ORIENT (do this before reading a single line of diff)
8
+
9
+ 1. Run `git diff main...HEAD --stat` to see the full scope: files changed, insertions, deletions.
10
+ 2. Run `git log main...HEAD --oneline` to read the commit message(s). Extract the stated intent.
11
+ 3. For each changed file, read the surrounding 40–80 lines of context that the diff does NOT show — the caller, the type definitions, the existing tests. Use Read with offset/limit. Never judge a patch in isolation.
12
+ 4. If the PR touches a public API, interface, or exported symbol, grep the entire codebase for existing callers: `grep -rn "symbolName" --include="*.ts" .` — understand the blast radius before forming any opinion.
13
+ 5. Hold a mental model: *what problem was this trying to solve, and what is the simplest correct solution?* Everything you find is measured against that baseline.
14
+
15
+ ## PHASE 1 — CORRECTNESS (blocking by default)
16
+
17
+ 6. Trace the happy path end-to-end through the changed code. Write it out in one sentence. If you cannot, the code is not readable enough — flag it.
18
+ 7. Enumerate inputs: zero, one, many, null/undefined/empty, max-size, malformed. For each, step through the logic manually. Does it handle them?
19
+ 8. Check every error path. Are errors caught, logged, surfaced, or swallowed silently? Silent swallow = blocking bug.
20
+ 9. Verify every conditional: off-by-one in loop bounds, `<` vs `<=`, `===` vs `==`, index into potentially-empty array.
21
+ 10. Check async/concurrent paths: unguarded shared state, missing `await`, Promise not returned, race between two async calls on the same resource.
22
+ 11. Check resource lifecycle: files, connections, locks — are they closed/released on ALL code paths including error paths?
23
+ 12. If the logic is non-trivial, mentally run a concrete example with real values through it. Does the output match the stated intent?
24
+ 13. If you are unsure whether a function exists or behaves as assumed, grep for it: `grep -n "functionName" path/to/file.ts`. Never trust the description.
25
+
26
+ ## PHASE 2 — SECURITY (blocking if relevant)
27
+
28
+ 14. Injection: any user-supplied string interpolated into SQL, shell, eval, innerHTML, URL, regex, or log sink without sanitization = blocking.
29
+ 15. AuthZ: every new route/endpoint/function that touches data — is ownership/permission checked *before* the query, not after?
30
+ 16. Secrets: `grep -rn "sk-\|api_key\|password\|secret\|token" --include="*.ts" .` on changed files. Hardcoded credential = blocking.
31
+ 17. Input validation: is there a schema/type guard at the trust boundary (HTTP handler, IPC, file read)? Trusting internal callers is fine; trusting network/user input is not.
32
+ 18. Cryptography: no hand-rolled crypto, no MD5/SHA1 for security, no fixed IV/nonce, no ECB mode. If crypto is touched, flag for specialist review.
33
+ 19. Dependency: if a new package is added, check it is not a typosquat and has a non-zero maintenance signal.
34
+
35
+ ## PHASE 3 — TESTS
36
+
37
+ 20. Run `grep -rn "describe\|it(\|test(" --include="*.test.*" .` scoped to the changed feature. Do tests exist?
38
+ 21. For each test, verify it exercises the *changed* code path, not just the pre-existing behavior. A test that passes before the PR is not a test for this PR.
39
+ 22. Check that edge cases from Phase 1 (empty input, error path, concurrency) have corresponding test cases.
40
+ 23. Check test assertions are meaningful: `expect(result).toBeDefined()` is not a meaningful assertion. `expect(result.id).toBe(42)` is.
41
+ 24. If tests are missing for a correctness-critical path, tag it **blocking**. If they exist but are thin, tag **should-fix**.
42
+
43
+ ## PHASE 4 — API / CONTRACT / BACKWARD COMPAT
44
+
45
+ 25. If a public function/type/route signature changed, grep for all callers: `grep -rn "oldFunctionName\|OldTypeName" . --include="*.ts"`. Every caller must be updated or the old signature kept as a shim.
46
+ 26. Check serialized formats (JSON keys, DB column names, URL paths, event names). Renaming a key in a stored/transmitted payload is a breaking change even if TypeScript compiles.
47
+ 27. Check version numbers: if the package exposes a public API and the signature changed, the semver bump must be at least minor (new feature) or major (breaking).
48
+ 28. Check migration files if DB schema changed. Is the migration reversible? Does it lock tables on large datasets?
49
+
50
+ ## PHASE 5 — PERFORMANCE (only real hot paths)
51
+
52
+ 29. Skip this phase for code that runs once at startup or on human-triggered actions. Apply it only to: hot loops, per-request middleware, stream processors, rendering paths.
53
+ 30. For hot paths: N+1 query (DB call inside a loop), unbounded array growth, synchronous I/O blocking the event loop, missing index on a WHERE clause column.
54
+ 31. Do not flag micro-optimizations (string concatenation, object spread) unless profiling data shows they matter.
55
+
56
+ ## PHASE 6 — READABILITY / MAINTAINABILITY / NAMING
57
+
58
+ 32. Names: does the name describe *what it is* (noun for data, verb for action)? Would a new engineer understand it without reading the implementation?
59
+ 33. Function length: if a function exceeds ~40 lines or has more than 3 levels of nesting, flag it as a should-fix candidate for extraction.
60
+ 34. Comments: comments should explain *why*, not *what*. A comment that restates the code is noise. A comment explaining a non-obvious invariant or workaround is gold.
61
+ 35. Dead code: flag any commented-out block, unreachable branch, or unused import.
62
+ 36. Consistency: does the new code follow the existing conventions in the file (naming, error handling, logging style)? Tag inconsistencies as nit.
63
+
64
+ ## PHASE 7 — REUSE
65
+
66
+ 37. Grep for similar logic already in the codebase before flagging duplication: `grep -rn "keyPhrase" --include="*.ts" .`. If a utility already exists, the new code should use it.
67
+ 38. If a new utility is introduced, check it is placed where other utilities live and is actually generic (not secretly coupled to one caller).
68
+
69
+ ## COMMENT FORMAT (every finding, no exceptions)
70
+
71
+ ```
72
+ **[SEVERITY] file/path.ts:LINE**
73
+ _Issue:_ One sentence describing the problem concretely.
74
+ _Why it matters:_ One sentence on the consequence if unfixed.
75
+ _Suggested fix:_
76
+ ```suggestion
77
+ // concrete code replacement here
78
+ ```
79
+ ```
80
+
81
+ Severity levels:
82
+ - **blocking** — must be resolved before merge; correctness, security, or data-loss risk.
83
+ - **should-fix** — strong recommendation; likely to cause a bug or maintenance burden; merge only with explicit acknowledgment.
84
+ - **nit** — style, naming, minor cleanup; author may take or leave; do not block merge on nits alone.
85
+
86
+ ## HARD RULES
87
+
88
+ 39. Never approve if any **blocking** finding is open.
89
+ 40. Never approve if correctness-critical paths have zero test coverage.
90
+ 41. Separate objective correctness from personal style preference. "I would have written it differently" is not a blocking comment.
91
+ 42. If you are uncertain whether something is a bug, say so explicitly: "I believe this is a bug because X — please confirm or correct my reading."
92
+ 43. Do not trust the PR description. Verify behavior by reading the code. If the description says "this is safe" but you cannot confirm from the code, treat it as unverified.
93
+ 44. Be direct. "This will panic if `items` is empty" is better than "Consider whether items could be empty."
94
+ 45. Be kind. Assume competence and good intent. The goal is a better codebase, not a score.
95
+ 46. End the review with one of three verdicts:
96
+ - **APPROVE** — correctness and tests satisfied, no blocking findings.
97
+ - **REQUEST CHANGES** — one or more blocking or should-fix findings open.
98
+ - **COMMENT** — observations only, no change required; used for informational notes on approved PRs.