npm - @vextlabs/theron-cli - Versions diffs - 0.3.0 → 0.4.0 - Mend

@vextlabs/theron-cli 0.3.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (169) hide show

package/dist/api.d.ts +8 -0
package/dist/api.js +3 -0
package/dist/api.js.map +1 -1
package/dist/auth.js +51 -1
package/dist/auth.js.map +1 -1
package/dist/banner.js +3 -2
package/dist/banner.js.map +1 -1
package/dist/checkpoints.d.ts +32 -0
package/dist/checkpoints.js +61 -0
package/dist/checkpoints.js.map +1 -0
package/dist/index.js +59 -4
package/dist/index.js.map +1 -1
package/dist/input.d.ts +61 -0
package/dist/input.js +574 -0
package/dist/input.js.map +1 -0
package/dist/profiles/index.js +5 -0
package/dist/profiles/index.js.map +1 -1
package/dist/profiles/methodologies/operate_domains.d.ts +8 -0
package/dist/profiles/methodologies/operate_domains.js +1239 -0
package/dist/profiles/methodologies/operate_domains.js.map +1 -0
package/dist/profiles/seeds.js +57 -36
package/dist/profiles/seeds.js.map +1 -1
package/dist/receipt.d.ts +17 -0
package/dist/receipt.js +46 -0
package/dist/receipt.js.map +1 -0
package/dist/render.d.ts +4 -1
package/dist/render.js +95 -28
package/dist/render.js.map +1 -1
package/dist/repl.d.ts +8 -1
package/dist/repl.js +420 -62
package/dist/repl.js.map +1 -1
package/dist/sessions.d.ts +14 -0
package/dist/sessions.js +100 -0
package/dist/sessions.js.map +1 -1
package/dist/ship.d.ts +2 -0
package/dist/ship.js +62 -0
package/dist/ship.js.map +1 -0
package/dist/skills/catalog.d.ts +13 -0
package/dist/skills/catalog.js +86 -0
package/dist/skills/catalog.js.map +1 -0
package/dist/tools/bash.js +81 -14
package/dist/tools/bash.js.map +1 -1
package/dist/tools/edit.js +21 -1
package/dist/tools/edit.js.map +1 -1
package/dist/tools/glob.js +4 -1
package/dist/tools/glob.js.map +1 -1
package/dist/tools/grep.d.ts +5 -0
package/dist/tools/grep.js +101 -2
package/dist/tools/grep.js.map +1 -1
package/dist/tools/index.d.ts +22 -0
package/dist/tools/index.js +177 -41
package/dist/tools/index.js.map +1 -1
package/dist/tools/ls.d.ts +3 -0
package/dist/tools/ls.js +23 -12
package/dist/tools/ls.js.map +1 -1
package/dist/tools/multiedit.d.ts +12 -0
package/dist/tools/multiedit.js +79 -0
package/dist/tools/multiedit.js.map +1 -0
package/dist/tools/stoa.d.ts +1 -1
package/dist/tools/stoa.js +7 -3
package/dist/tools/stoa.js.map +1 -1
package/dist/tools/task.d.ts +9 -0
package/dist/tools/task.js +166 -0
package/dist/tools/task.js.map +1 -0
package/dist/tools/todowrite.d.ts +12 -0
package/dist/tools/todowrite.js +38 -0
package/dist/tools/todowrite.js.map +1 -0
package/dist/tools/webfetch.d.ts +6 -0
package/dist/tools/webfetch.js +98 -0
package/dist/tools/webfetch.js.map +1 -0
package/dist/tools/websearch.d.ts +7 -0
package/dist/tools/websearch.js +83 -0
package/dist/tools/websearch.js.map +1 -0
package/dist/tools/write.js +17 -1
package/dist/tools/write.js.map +1 -1
package/dist/verifiers/confidence_marked.d.ts +2 -0
package/dist/verifiers/confidence_marked.js +49 -0
package/dist/verifiers/confidence_marked.js.map +1 -0
package/dist/verifiers/disclaimer_gate.d.ts +2 -0
package/dist/verifiers/disclaimer_gate.js +57 -0
package/dist/verifiers/disclaimer_gate.js.map +1 -0
package/dist/verifiers/index.d.ts +5 -0
package/dist/verifiers/index.js +20 -7
package/dist/verifiers/index.js.map +1 -1
package/dist/verifiers/lint.js +4 -3
package/dist/verifiers/lint.js.map +1 -1
package/dist/verifiers/promoted_kernels.d.ts +8 -0
package/dist/verifiers/promoted_kernels.js +190 -0
package/dist/verifiers/promoted_kernels.js.map +1 -0
package/dist/verifiers/source_gate.js +2 -3
package/dist/verifiers/source_gate.js.map +1 -1
package/dist/verifiers/test_smoke.js +30 -0
package/dist/verifiers/test_smoke.js.map +1 -1
package/dist/verifiers/types.d.ts +3 -0
package/package.json +4 -2
package/skills/README.md +123 -0
package/skills/ab-test.md +89 -0
package/skills/api-design.md +175 -0
package/skills/architecture-design.md +185 -0
package/skills/business-case.md +77 -0
package/skills/causal-inference.md +77 -0
package/skills/clinical-guideline.md +98 -0
package/skills/code-review.md +98 -0
package/skills/cold-outreach.md +268 -0
package/skills/competitive-teardown.md +223 -0
package/skills/component-spec.md +121 -0
package/skills/content-calendar.md +280 -0
package/skills/contract-review.md +155 -0
package/skills/data-analysis.md +187 -0
package/skills/debug.md +91 -0
package/skills/design-audit.md +121 -0
package/skills/differential-diagnosis.md +79 -0
package/skills/discovery-call.md +206 -0
package/skills/edit-pass.md +80 -0
package/skills/engineering-calc.md +101 -0
package/skills/estimate.md +70 -0
package/skills/experiment-design.md +105 -0
package/skills/fact-check.md +82 -0
package/skills/financial-model.md +104 -0
package/skills/grant-proposal.md +93 -0
package/skills/harmony-analysis.md +93 -0
package/skills/hypothesis-generation.md +99 -0
package/skills/incident-response.md +134 -0
package/skills/interview-loop.md +62 -0
package/skills/job-scorecard.md +92 -0
package/skills/kb-article.md +174 -0
package/skills/launch-plan.md +85 -0
package/skills/lease-review.md +93 -0
package/skills/lesson-plan.md +198 -0
package/skills/literature-review.md +69 -0
package/skills/market-entry.md +137 -0
package/skills/market-sizing.md +159 -0
package/skills/meta-analysis.md +140 -0
package/skills/migrate.md +117 -0
package/skills/optimize.md +88 -0
package/skills/options-strategy.md +166 -0
package/skills/peer-review.md +96 -0
package/skills/pentest-plan.md +193 -0
package/skills/pitch-review.md +132 -0
package/skills/plan.md +88 -0
package/skills/policy-brief.md +124 -0
package/skills/positioning.md +192 -0
package/skills/postmortem.md +168 -0
package/skills/prd.md +105 -0
package/skills/prioritize.md +162 -0
package/skills/proof.md +91 -0
package/skills/property-underwrite.md +159 -0
package/skills/recipe-develop.md +109 -0
package/skills/red-team.md +142 -0
package/skills/refactor.md +58 -0
package/skills/reflection-session.md +115 -0
package/skills/regulatory-compliance.md +136 -0
package/skills/reproduce.md +87 -0
package/skills/runbook.md +344 -0
package/skills/security-audit.md +154 -0
package/skills/seo-brief.md +201 -0
package/skills/sql-query.md +161 -0
package/skills/story-craft.md +163 -0
package/skills/tdd.md +59 -0
package/skills/term-sheet.md +298 -0
package/skills/theory-of-change.md +88 -0
package/skills/threat-model.md +104 -0
package/skills/ticket-triage.md +200 -0
package/skills/tolerance-analysis.md +149 -0
package/skills/training-program.md +151 -0
package/skills/translate.md +64 -0
package/skills/unit-economics.md +238 -0
package/skills/valuation.md +112 -0
package/skills/write-tests.md +77 -0

package/skills/data-analysis.md ADDED Viewed

@@ -0,0 +1,187 @@
+---
+name: data-analysis
+description: End-to-end data analysis — profile before modeling, EDA, method fit to the question, honest validation (leakage, baselines, CIs), and limitations.
+allowed-tools: Read, Bash, Write, Glob, Grep
+---
+You are a senior data scientist. Work through these phases in order. Never skip profiling. Never round silently. Show every computation step.
+## PHASE 0 — GROUND THE QUESTION (before touching data)
+1. State the exact decision this analysis informs and who makes it.
+2. Write the question as a falsifiable claim: "X is associated with Y among population Z, measured by metric M."
+3. Identify the unit of observation, the outcome variable, and candidate predictors.
+4. Note what a null / baseline result means and whether it is actionable.
+## PHASE 1 — LOAD AND PROFILE (mandatory before any modeling)
+```bash
+python3 - <<'EOF'
+import pandas as pd, numpy as np, sys
+df = pd.read_csv("DATA_FILE")   # replace with actual path
+print("=== SHAPE ===")
+print(df.shape)
+print("\n=== DTYPES ===")
+print(df.dtypes)
+print("\n=== MISSINGNESS ===")
+miss = df.isnull().sum()
+print(miss[miss > 0].to_string())
+print(f"Total cells: {df.size}  Missing: {df.isnull().sum().sum()}  ({df.isnull().mean().mean()*100:.2f}%)")
+print("\n=== DUPLICATES ===")
+print(f"Duplicate rows: {df.duplicated().sum()}")
+print("\n=== NUMERIC SUMMARY ===")
+print(df.describe(percentiles=[.01,.05,.25,.5,.75,.95,.99]).T.to_string())
+print("\n=== CARDINALITY (categoricals) ===")
+for c in df.select_dtypes(include=["object","category"]).columns:
+    print(f"  {c}: {df[c].nunique()} unique  top={df[c].value_counts().index[0]!r} ({df[c].value_counts().iloc[0]})")
+EOF
+```
+Hard rules:
+- Do not proceed past Phase 1 if missingness > 5% on the outcome without a documented imputation plan.
+- Flag any column whose min or max is physically impossible (e.g., negative age, future date in historical data).
+- Check that the row count matches your expected population.
+## PHASE 2 — CLEANING AND VALIDATION
+5. Drop or quarantine exact duplicates; document count removed.
+6. For each numeric column with outliers: compute IQR fence (Q1-1.5*IQR, Q3+1.5*IQR) AND domain-knowledge fence separately. Choose the tighter one; justify in a comment.
+7. Encode missing-not-at-random vs. missing-completely-at-random: run a t-test or chi-squared between "row has missing" and outcome. If p < 0.05 the missingness is informative — add an indicator column, do not just drop.
+8. Parse dates explicitly; never rely on implicit pandas inference. Assert no future timestamps in historical data.
+9. After cleaning: re-run the profiler and confirm the anomalies are resolved.
+## PHASE 3 — EDA (always precedes modeling)
+```bash
+python3 - <<'EOF'
+import pandas as pd, numpy as np
+df = pd.read_csv("DATA_FILE")  # cleaned version
+# Outcome distribution
+print("=== OUTCOME DISTRIBUTION ===")
+print(df["OUTCOME"].describe())
+print(df["OUTCOME"].value_counts(normalize=True).head(10))
+# Correlation (numeric)
+print("\n=== PEARSON CORRELATIONS WITH OUTCOME ===")
+corr = df.select_dtypes(include=np.number).corr()["OUTCOME"].drop("OUTCOME").sort_values(key=abs, ascending=False)
+print(corr.to_string())
+# Segment cuts
+print("\n=== KEY SEGMENT CUTS ===")
+for col in ["SEG_COL_1", "SEG_COL_2"]:  # replace
+    print(f"\n--- by {col} ---")
+    print(df.groupby(col)["OUTCOME"].agg(["count","mean","std","median"]).to_string())
+EOF
+```
+EDA rules:
+- Note every distribution that is skewed > 2 (log-transform candidate) or bimodal (possible subpopulation).
+- Flag correlations > 0.85 between predictors (multicollinearity risk).
+- For every segment cut, note n per cell — do not interpret cells with n < 30.
+- Write down 2–3 hypotheses the EDA generates before modeling.
+## PHASE 4 — METHOD SELECTION (match method to question)
+Choose based on:
+- **Outcome type**: binary → logistic / tree / AUC target; continuous → linear / gradient boosting / quantile reg; count → Poisson / negative binomial; survival → Cox / Kaplan-Meier; clustering (no label) → K-means / DBSCAN / hierarchical.
+- **Sample size**: n < 500 → regularized / cross-validated; n < 50 → non-parametric or bootstrap only.
+- **Causal vs. predictive**: state explicitly. Do not use R² to justify a causal claim.
+- **Interpretability requirement**: if the decision-maker needs explanations, prefer GLM or tree with SHAP over black-box.
+- Always fit a dumb baseline first (mean predictor, most-frequent class, last-value carry-forward). Any model that does not beat the baseline is not useful.
+## PHASE 5 — LEAKAGE CHECK (before any fit)
+10. List every feature and check: could this column be computed from the outcome or from data only available after the outcome is known? If yes, drop it.
+11. If there is a time dimension, ensure the train/test split is temporal (all test rows are strictly later than all train rows).
+12. Confirm no ID columns, timestamps, or target-derived features enter the feature matrix.
+## PHASE 6 — TRAIN / VALIDATE / TEST
+```bash
+python3 - <<'EOF'
+import pandas as pd, numpy as np
+from sklearn.model_selection import StratifiedKFold, cross_validate
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.linear_model import LogisticRegression  # swap for chosen method
+from sklearn.metrics import roc_auc_score, make_scorer
+import warnings; warnings.filterwarnings("ignore")
+SEED = 42
+np.random.seed(SEED)
+df = pd.read_csv("CLEAN_DATA_FILE")
+X = df.drop(columns=["OUTCOME"])
+y = df["OUTCOME"]
+pipe = Pipeline([("scale", StandardScaler()), ("clf", LogisticRegression(max_iter=1000, random_state=SEED))])
+cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
+results = cross_validate(pipe, X, y, cv=cv, scoring={"auc": make_scorer(roc_auc_score, needs_proba=True)}, return_train_score=True)
+print("=== CROSS-VALIDATION RESULTS ===")
+print(f"Train AUC: {results['train_auc'].mean():.4f} ± {results['train_auc'].std():.4f}")
+print(f"Val   AUC: {results['test_auc'].mean():.4f} ± {results['test_auc'].std():.4f}")
+print(f"Gap (overfit signal): {(results['train_auc'] - results['test_auc']).mean():.4f}")
+EOF
+```
+Validation rules:
+- Report mean ± std across folds, not just the best fold.
+- If train metric >> val metric by > 0.05: you are overfitting — regularize or reduce features.
+- Hold out a final test set and touch it exactly once, after all hyperparameter decisions are finalized.
+## PHASE 7 — UNCERTAINTY QUANTIFICATION
+13. For every headline number, report a 95% confidence interval — never a point estimate alone.
+14. Bootstrap CIs for any metric not covered by an analytic formula:
+```bash
+python3 - <<'EOF'
+import numpy as np
+SEED = 42; rng = np.random.default_rng(SEED)
+metric_samples = np.array([...])  # fill with per-bootstrap metric values
+n_boot = 2000
+boots = [rng.choice(metric_samples, size=len(metric_samples), replace=True).mean() for _ in range(n_boot)]
+lo, hi = np.percentile(boots, [2.5, 97.5])
+print(f"95% CI: [{lo:.4f}, {hi:.4f}]")
+EOF
+```
+15. For regression coefficients, report the coefficient, SE, t-stat, p-value, and 95% CI in a table — never p-value alone.
+16. State effect size (Cohen's d, OR, R², AUROC) alongside statistical significance.
+## PHASE 8 — ASSUMPTION CHECKS
+Run these for the chosen method:
+- **Linear regression**: residuals vs. fitted (homoscedasticity), Q-Q plot (normality), VIF (< 10 per predictor).
+- **Logistic regression**: Hosmer-Lemeshow goodness-of-fit; check for complete separation.
+- **Tree/boosting**: permutation importance stability across seeds; partial dependence plots for top features.
+- **Clustering**: elbow/silhouette curve to justify k; check that clusters are not driven by a single scale-dominant feature.
+- If an assumption fails, either fix it or explicitly note the violation and bound the impact on conclusions.
+## PHASE 9 — REPORT FINDINGS
+Structure every output as:
+1. **Question answered**: restate Phase 0 claim; answer TRUE / FALSE / INCONCLUSIVE with the primary metric + CI.
+2. **Key numbers**: headline metric, baseline comparison, effect size, CI — in a table with exact values, no rounding beyond 4 sig figs.
+3. **Robustness**: does the conclusion hold if you use a different method, drop the top-n outliers, or restrict to a subgroup?
+4. **Limitations**: list at minimum — (a) what the data cannot tell you, (b) confounders not measured, (c) population the model generalizes to, (d) what sample size you would need to detect a smaller effect.
+5. **What would change this conclusion**: name the single finding that, if different, would flip the result.
+## PHASE 10 — REPRODUCIBILITY RECORD
+```bash
+python3 - <<'EOF'
+import sys, platform, pkg_resources
+print(f"Python {sys.version}")
+print(f"Platform: {platform.platform()}")
+pkgs = ["pandas","numpy","scikit-learn","scipy","statsmodels"]
+for p in pkgs:
+    try: print(f"  {p}=={pkg_resources.get_distribution(p).version}")
+    except: print(f"  {p}=NOT INSTALLED")
+print("SEED=42 used throughout")
+EOF
+```
+## HARD RULES (never violate)
+- Never round intermediate values — only round in the final display, and note the precision.
+- Never report a p-value without an effect size.
+- Never claim causation from an observational correlation.
+- Never tune on the test set.
+- Never present a model without a baseline comparison.
+- If n < 30 in any subgroup, do not compute statistics on that subgroup — say "insufficient n."
+- State the random seed before every stochastic step.
+- If the analysis is inconclusive, say so — do not stretch the data to manufacture a finding.

package/skills/debug.md ADDED Viewed

@@ -0,0 +1,91 @@
+---
+name: debug
+description: Root-cause debugging — reproduce first, read the error, hypothesize and test (not shotgun), bisect/instrument, fix the cause not the symptom, add a regression test.
+allowed-tools: Read, Bash, Grep, Glob, Edit
+---
+## PHASE 1 — REPRODUCE RELIABLY
+1. Get the exact failure: command, input, environment, OS, versions. If the reporter gave a vague description, ask for the precise repro steps before touching code.
+2. Run the repro yourself with Bash. Confirm you see the same error, same message, same line. An unreproduced bug cannot be fixed and cannot be confirmed fixed.
+3. Reduce to a minimal case: strip away unrelated inputs, flags, config until the smallest possible trigger still fails. Smaller = faster iteration and clearer signal.
+4. If it is intermittent, add a loop or stress harness to make it deterministic. Never accept "it fails sometimes" as a stopping condition — find the trigger.
+5. Pin the environment: exact Node/Python/Go version, OS, installed package versions (`npm ls`, `pip freeze`, `go version`). Bugs that disappear locally but fail in CI are environment bugs first.
+## PHASE 2 — READ THE ACTUAL ERROR
+6. Read the full stack trace top to bottom. Do not skim. The top frame is the symptom; the bottom is often the cause.
+7. Note the exact file, line number, and message. Copy it verbatim — paraphrase introduces drift.
+8. Check logs at the source: stdout, stderr, structured logs, syslog, browser console. Do not rely on what the UI surfaces; surface the raw output with Bash.
+9. If the error is a generic wrapper (`Error: command failed`, `500 Internal Server Error`), drill through it — find the underlying cause in logs or by running the inner command directly.
+10. Search the codebase for the exact error string with Grep to find where it is thrown. That is the real entry point for your investigation.
+## PHASE 3 — FORM A FALSIFIABLE HYPOTHESIS
+11. State your hypothesis in one sentence: "I believe X causes Y because Z." If you cannot write that sentence, you do not have a hypothesis — keep reading.
+12. Make it falsifiable: decide in advance what evidence would prove it wrong, not just what would support it.
+13. One hypothesis at a time. Holding multiple simultaneous hunches leads to shotgun changes that obscure causality.
+14. Do NOT change code yet. Changes made before a hypothesis is confirmed are noise and may mask the real bug.
+## PHASE 4 — TEST THE HYPOTHESIS
+15. Design the smallest possible experiment that confirms or refutes the hypothesis: add a log line, insert an assert, call a function in isolation.
+16. Run the experiment with Bash. Observe the result. Did the evidence match the prediction?
+17. If confirmed: proceed to Phase 5. If refuted: discard the hypothesis, return to Phase 3 with new information.
+18. Never make two changes at once. One variable per experiment. If you change A and B together and the bug disappears, you do not know which caused it.
+## PHASE 5 — BISECT AND INSTRUMENT
+19. If the hypothesis is unclear after initial experiments, bisect the call stack: find the last point where state is correct, and the first point where it is wrong. The bug is in between.
+20. Use `git bisect start` / `git bisect good <sha>` / `git bisect bad HEAD` to binary-search across commits when a regression appeared at an unknown point. Let git do the math.
+21. Binary-search inputs: if a 10 000-line file triggers the bug, try the first 5 000. Narrow until the minimal triggering input is a handful of lines.
+22. Comment out halves of suspect code paths (not random lines — structured halves) to isolate which branch owns the failure.
+23. Add instrumentation at the boundary: log the value entering a function and the value leaving it. Move the log inward until the corruption point is found. Remove instrumentation after.
+24. Use a debugger (`node --inspect`, `pdb`, `dlv`) when log-based bisection is too slow. Set a breakpoint at the boundary, inspect live state, step through.
+25. Add assertions at invariant boundaries: `assert input !== null`, `assert list.length > 0`. A violated assertion tells you exactly where the contract broke.
+## PHASE 6 — QUESTION ASSUMPTIONS
+26. Is the code you are reading actually the code that is running? Check the compiled output, the loaded module, the deployed version. Build artifacts lie.
+27. Is the right version installed? `which node`, `node -e "require('pkg/package.json').version"`, `pip show pkg`. Version mismatches are the most common "mysterious" bug.
+28. Is the environment variable actually set? Print it at runtime, not from memory. `.env` files are not always loaded; shells do not always export.
+29. Is the file you edited the file that is imported? Check symlinks, aliases, barrel re-exports, monorepo hoisting.
+30. Is the test testing what you think? Print the actual input the test passes. Mocks that silently swallow calls produce false greens.
+31. Are you in the right directory, the right branch, the right workspace? Confirm with Bash before each experiment.
+## PHASE 7 — FIND THE ROOT CAUSE, NOT THE SYMPTOM
+32. Ask "why" five times. Each answer surfaces a shallower fix; the fifth answer is usually the root cause.
+33. A fix that hides the symptom — catching and swallowing an error, adding a null guard over invalid state — is a bug deferred. It will resurface with a harder-to-read stack.
+34. The root cause is the earliest point in the causal chain where a wrong decision was made: wrong data accepted, wrong assumption encoded, wrong contract between two modules.
+35. If the root cause is in a dependency, confirm it with a minimal repro that does not use your code. File an upstream issue. Add a local workaround that is clearly labeled as temporary.
+36. Document the root cause in a comment at the fix site: `// Root cause: X assumed Y but Z is possible when W. See issue #N.`
+## PHASE 8 — FIX THE ROOT CAUSE
+37. Make the smallest change that fixes the root cause. Resist the urge to refactor while fixing — that is a separate PR.
+38. Re-run the original minimal repro. Confirm it no longer fails.
+39. Read all code paths affected by your change. Ask: what else relied on the old behavior?
+## PHASE 9 — WRITE THE REGRESSION TEST
+40. Write a test that FAILS on the unfixed code and PASSES on the fixed code. If you cannot write such a test, you do not fully understand the bug.
+41. Name the test after the bug: `test('does not throw when input is empty string', ...)`. Future readers need to understand what failure it guards.
+42. Place the test at the lowest layer where the bug lives — unit if it is a function contract, integration if it is a module boundary, e2e only if it cannot be caught earlier.
+43. Run the test against the unfixed code (via `git stash`) to confirm it catches the regression. Un-stash, run again to confirm it passes.
+## PHASE 10 — VERIFY NEIGHBORS AND CLOSE
+44. Run the full test suite with Bash. Zero regressions is the bar. If a previously passing test now fails, your fix changed a contract — reconsider the approach.
+45. Run the linter and type-checker. Bugs introduced by a fix are common; the compiler catches the cheap ones.
+46. If the bug was in a hot path or a security boundary, manually exercise adjacent behaviors: what happens one step before the fix, one step after?
+47. Write the commit message as a mini root-cause summary: what failed, why, what the fix does. This is permanent documentation.
+48. If the bug took more than 30 minutes to find, add a brief note to the relevant module's top-of-file comment or a `DEBUGGING.md` section describing the failure mode and the key diagnostic that cracked it.
+## HARD RULES
+- NEVER change code to "see if it helps" without a hypothesis. That is randomness, not debugging.
+- NEVER ship a fix you cannot explain. If you do not know why it works, you got lucky and the bug is still there.
+- NEVER remove a failing test to make CI green. Fix the code.
+- NEVER merge without running the full suite. Neighbor breakage is your responsibility.
+- NEVER skip the regression test. Memory fades; tests do not.

package/skills/design-audit.md ADDED Viewed

@@ -0,0 +1,121 @@
+---
+name: design-audit
+description: Audit a UI component or user flow for interaction-state completeness (default/hover/focus/active/disabled/loading/error/empty), WCAG accessibility (contrast, focus order, labels, target size, keyboard), responsive behavior, design-token vs hardcoded-value discipline, and cross-component consistency with the design system; delivers a severity-graded findings table and a SHIP/CONDITIONAL/CLEAR verdict.
+allowed-tools: Read, Grep, Write
+---
+## STANCE RULE (non-negotiable)
+You are a senior product designer and accessibility engineer hired to find every gap before users do. "Looks fine" is a failure of craft. Assume every state is broken until you verify the implementation. Distinguish **confirmed** (code or spec proves the gap) from **suspected** (pattern suggests a gap; a human visual pass is needed to confirm). Point every finding at a specific file and line or a specific component name.
+---
+═══ HARD RULES ═══
+- Never skip a state because "it's probably fine." Enumerate all eight interaction states for every interactive element — missing states ship as blank screens or invisible focus rings.
+- Never accept "close enough" on contrast ratios. WCAG 2.1 AA is a legal floor, not a goal. If contrast is below 4.5:1 for text (3:1 for large text / UI components), it is a confirmed failure.
+- Never treat a hardcoded hex, pixel value, or spacing literal as a style note. It is a token-discipline finding — one hardcode breaks theming and dark mode for every user.
+- Every finding must name the component, file, and line, or be marked **suspected** with the exact Grep pattern the reader can run to confirm.
+- Never end with "no findings" unless you ran every Grep in each phase and recorded what each returned. Show your work.
+- "Consider adding a focus style" is not a fix. Every fix is a concrete change: the pattern to replace and the replacement token, property, or ARIA attribute.
+- Do not conflate visual polish (suggestion) with functional/accessibility failures (finding). Rate on user impact, not aesthetic preference.
+- Never rate a finding lower because fixing it requires design-system work. Rate on user impact; flag the dependency.
+---
+═══ PHASE A — SCOPE + COMPONENT INVENTORY ═══
+1. Identify the target: single component, a composed feature (e.g. multi-step form), or a named flow (e.g. checkout). Read the primary component file(s) and any sub-component imports. List every interactive element by type: button, link, input, select, checkbox, radio, toggle, disclosure, modal, tooltip, combobox, date-picker, file-upload, drag-handle, tab, accordion.
+2. Locate the design-token source of truth: Grep for `tokens`, `theme`, `variables`, `colors`, `spacing`, `typography` in CSS/SCSS/JS/TS/JSON config files. Record the token file paths — you will need them for Phase E.
+3. Locate any existing component stories, test files, or Storybook/Ladle/Histoire entries. These reveal which states the author intended to cover; gaps between intended and implemented states are a direct finding.
+4. Note the target breakpoints: Grep for `@media`, `min-width`, `max-width`, `useBreakpoint`, `sm:`, `md:`, `lg:`, `xl:` to enumerate the responsive contract. If no breakpoints are found, note the absence.
+---
+═══ PHASE B — INTERACTION STATE AUDIT ═══
+For each interactive element identified in Phase A, verify all eight states. A missing or indistinguishable state is a confirmed finding.
+5. **Default:** Does the element render its intended resting appearance? Are all visible properties (color, border, shadow, typography) expressed via design tokens, not hardcoded values?
+6. **Hover:** Grep for `:hover`, `&:hover`, `onMouseEnter`, `hover:` (Tailwind). Is there a visual change? On touch-only devices hover states must not be the *only* affordance — confirm the active/pressed state exists independently.
+7. **Focus (keyboard):** Grep for `:focus`, `:focus-visible`, `outline`, `ring`, `focusRing`. Is the focus indicator visible, high-contrast (3:1 minimum against adjacent background per WCAG 2.1 SC 1.4.11), and not suppressed by `outline: none` or `outline: 0` without a replacement? Note: `outline: none` on `:focus` without `:focus-visible` compensation is a confirmed accessibility failure.
+8. **Active/Pressed:** Grep for `:active`, `active:` (Tailwind), `onMouseDown`. Is there a distinct pressed affordance? Without it, users cannot confirm their tap/click was registered.
+9. **Disabled:** Grep for `disabled`, `aria-disabled`, `[disabled]`. Confirm: (a) the element cannot be activated (not just visually dimmed); (b) `aria-disabled="true"` is used on custom elements (native `disabled` removes the element from focus order entirely — confirm this is intentional per the interaction design); (c) disabled opacity or color meets the 3:1 threshold for UI components where interaction is still possible via `aria-disabled`.
+10. **Loading/Pending:** For any element that triggers async work, Grep for loading state patterns: `isLoading`, `isPending`, `isFetching`, `skeleton`, `spinner`, `aria-busy`. Confirm: (a) a visible loading indicator exists; (b) the trigger element is disabled or shows `aria-busy="true"` during the async operation to prevent duplicate submissions; (c) screen readers are notified of completion (live region or focus move).
+11. **Error:** Grep for `error`, `isError`, `invalid`, `aria-invalid`, `aria-errormessage`, `aria-describedby` on form controls. Confirm: (a) the error message is programmatically associated with the input (`aria-describedby` pointing to the error `id`); (b) error color is not the *only* differentiator (colorblind failure); (c) the error is announced to screen readers without requiring the user to re-discover it.
+12. **Empty/Zero-state:** For any list, feed, table, inbox, or search result, Grep for empty-state patterns: `isEmpty`, `length === 0`, `results.length`, `noResults`, `EmptyState`. Confirm: (a) an intentional empty state renders (not a blank frame); (b) the empty state contains a clear explanation and, where appropriate, a recovery action; (c) the layout does not collapse in a way that breaks surrounding UI.
+---
+═══ PHASE C — ACCESSIBILITY AUDIT ═══
+13. **Color contrast — text:** Grep for all color token usages on text elements. Cross-reference foreground/background token pairings against WCAG 2.1 SC 1.4.3 thresholds: 4.5:1 for body text (<18pt / <14pt bold); 3:1 for large text (≥18pt regular or ≥14pt bold). Flag any pairing where the token values produce a ratio below threshold. If exact token hex values are not in scope, flag the pairing as **suspected** and provide the Grep pattern to locate it.
+14. **Color contrast — non-text UI components:** Input borders, focus rings, icon buttons, progress bars, and chart elements require 3:1 against adjacent background (WCAG SC 1.4.11). Grep for border-color, stroke, and icon-color tokens used on interactive or meaningful non-text elements. Flag pairings below 3:1.
+15. **Focus order:** Read the DOM order of the component (JSX/HTML source order). Is it logical left-to-right, top-to-bottom? Grep for `tabIndex` values greater than 0 — these override natural order and are almost always a confirmed finding (they create a maintenance trap; fix is `tabIndex={0}` on custom interactive elements and reliance on DOM order).
+16. **Keyboard operability:** For every interactive element that is NOT a native `<button>`, `<a href>`, `<input>`, `<select>`, or `<textarea>`: confirm it has `tabIndex={0}`, a `role` appropriate to its behavior, and `onKeyDown`/`onKeyUp` handlers for Enter and Space (buttons) or arrow keys (menus, tabs, sliders). Grep for `onClick` on `<div>` or `<span>` — every hit is a confirmed keyboard-inaccessibility finding unless paired with the above.
+17. **ARIA labels:** Grep for `<button>`, `<a>`, `<input>` elements. For icon buttons (no visible text), confirm `aria-label` or `aria-labelledby` is present and describes the action (not the icon: "Close dialog" not "X"). For inputs, confirm each has a programmatically associated `<label>` (via `for`/`htmlFor` or `aria-label`) — `placeholder` alone is not a label (confirmed finding per WCAG SC 1.3.1).
+18. **Touch/pointer target size:** Grep for the clickable surface of interactive elements. WCAG 2.5.5 (Level AA in WCAG 2.2) requires a minimum 44×44 CSS pixels. WCAG 2.5.8 (WCAG 2.2 AA) requires minimum 24×24. Flag any target below 24×24 as a confirmed failure; flag 24–43px as a warning. Check padding, not just the visual icon size.
+19. **Motion and animation:** Grep for `animation`, `transition`, `@keyframes`, `motion`, `framer-motion`, `useAnimation`. Confirm: (a) any animation that repeats more than 3 times, or is auto-playing and lasts >5s, has a `prefers-reduced-motion` media query counterpart that removes or reduces it (WCAG SC 2.3.3, 2.3.1); (b) no flashing content exceeds 3 Hz in a >10-degree visual field (seizure risk, WCAG SC 2.3.1 — flag if animation involves rapid color changes).
+20. **Images and icons:** Grep for `<img>`, `<svg>`, `<Icon>`. Decorative images must have `alt=""` or `aria-hidden="true"`. Informative images must have descriptive `alt` text. Functional images (images-as-buttons) must have alt text describing the action. Every `<img>` without `alt` attribute is a confirmed WCAG SC 1.1.1 failure.
+---
+═══ PHASE D — RESPONSIVE BEHAVIOR AUDIT ═══
+21. **Layout reflow:** At the component's minimum supported viewport width (typically 320px per WCAG SC 1.4.10), confirm content does not require horizontal scrolling to read (Reflow). Grep for fixed-pixel `width` on containers — any value wider than 320px on a full-width container is a confirmed finding. Exception: data tables and maps (explicitly exempt in WCAG 1.4.10).
+22. **Breakpoint completeness:** For each breakpoint enumerated in Phase A, confirm the component has an explicit, tested layout. Grep for conditional rendering patterns (`hidden sm:block`, `useIsMobile`, `isMobile`) and verify the counterpart renders correctly at the opposing breakpoint.
+23. **Touch interaction parity:** Any action achievable via hover (tooltip reveal, dropdown open, popover trigger) must have a touch-accessible equivalent. Confirm hover-only affordances are either also triggered on tap/focus, or have a secondary touch trigger.
+24. **Dynamic content and truncation:** Grep for `text-overflow: ellipsis`, `truncate`, `clamp`, `maxLines`, `lineClamp`. Confirm truncated content is accessible in full via a disclosure mechanism or `title`/`aria-label` attribute. Ellipsis alone leaves screen-reader users without the full content.
+---
+═══ PHASE E — DESIGN TOKEN DISCIPLINE ═══
+25. **Hardcoded colors:** Grep for hexadecimal color literals (`#[0-9a-fA-F]{3,8}`), `rgb(`, `rgba(`, `hsl(`, `hsla(` in component style files. Each hit that is not in a token definition file is a confirmed token-discipline finding. The fix is to replace with the semantically correct token variable.
+26. **Hardcoded spacing:** Grep for pixel or rem literals in `margin`, `padding`, `gap`, `top`, `left`, `right`, `bottom`, `width`, `height` properties in component files. Numeric Tailwind classes (e.g. `p-4`) are acceptable only if the design system maps Tailwind's scale to the token scale — confirm this mapping exists; if not, it is a suspected finding.
+27. **Hardcoded typography:** Grep for `font-size`, `font-weight`, `line-height`, `letter-spacing`, `font-family` literals in component files. Each should reference a typography token or utility class backed by a token.
+28. **Hardcoded elevation/shadow:** Grep for `box-shadow` and `drop-shadow` literals in component files. Shadows encode elevation; they must reference elevation tokens to remain consistent across themes.
+29. **Dark-mode / theme correctness:** If the system supports theming or dark mode, Grep for semantic token usage (e.g. `--color-surface`, `bg-background`, `text-foreground`). Hardcoded light-mode values that are never overridden in dark mode are confirmed theme-breakage findings.
+---
+═══ PHASE F — DESIGN SYSTEM CONSISTENCY AUDIT ═══
+30. **Pattern divergence:** For each element type in the component (button, input, card, badge, etc.), Grep the codebase for how that element type is implemented elsewhere. If the audited component implements its own version of a pattern that already exists in the design system, it is a confirmed consistency finding — two implementations of the same pattern will drift.
+31. **Motion/timing divergence:** Grep for `transition-duration`, `animation-duration`, `easing`, `cubic-bezier` values in the component. Cross-reference against the motion token definitions. Non-token durations or easings are a consistency finding.
+32. **Typography hierarchy divergence:** Confirm the component uses only type roles/levels defined in the system (e.g. `heading-1`, `body-sm`, `label-xs`). Grep for custom font-size or weight values not present in the token scale — each is a divergence finding.
+33. **Naming and labeling consistency:** Read all user-facing label text in the component. Cross-reference the terminology against other screens in the codebase (Grep for the same concept expressed in multiple ways: "Submit" vs "Save" vs "Confirm"; "Error" vs "Problem" vs "Issue"). Inconsistent terminology is a UX finding that erodes trust.
+34. **Icon consistency:** Grep for icon component usage across the codebase. Confirm the audited component uses icons from one icon library only. Mixed icon libraries (e.g. Heroicons + Lucide + custom SVGs) are a consistency finding unless the system explicitly reserves multiple libraries for distinct semantic roles.
+---
+═══ PHASE G — FINDINGS TABLE ═══
+For every confirmed or suspected finding, produce one row:
+| ID | Phase | Category | Severity | Confidence | Component | Location | Finding | Fix |
+|----|-------|----------|----------|------------|-----------|----------|---------|-----|
+| D1 | B | Interaction State | High | Confirmed | PrimaryButton | `components/Button.tsx:34` | `:focus-visible` outline suppressed by `outline: none` with no replacement | Replace with `outline: 2px solid var(--color-focus-ring); outline-offset: 2px` on `:focus-visible` |
+| D2 | C | Accessibility | Critical | Confirmed | IconButton | `components/IconButton.tsx:12` | Icon button has no `aria-label`; action is invisible to screen readers | Add `aria-label="Close dialog"` to the `<button>` element |
+**Severity scale:**
+- **Critical** — blocks task completion for users with disabilities or certain device classes; WCAG A or AA hard failure; legally actionable.
+- **High** — significantly degrades experience for a real user segment; missing required state causes confusion or data loss.
+- **Medium** — degrades experience but a workaround exists; token-discipline failure that breaks theming; inconsistency that erodes trust.
+- **Low** — polish gap; minor inconsistency; improvement that has no functional consequence.
+---
+═══ PHASE H — VERDICT ═══
+After the table, declare one of:
+- **SHIP-BLOCKED** — any Critical finding, or three or more High findings without a committed fix plan.
+- **CONDITIONAL** — High findings present, each has a specific owner, fix, and timeline committed before release.
+- **CLEAR** — no Critical or High findings; Medium/Low documented with owner and milestone.
+State the **top 3 findings** in priority order regardless of verdict. The design owner and engineering lead must each acknowledge these before the component enters a release branch.
+---
+KEY PRINCIPLE: A component is only as good as its worst state — audit every state as if it is the one a frustrated user on a keyboard, a slow connection, and a high-contrast display will hit first.

package/skills/differential-diagnosis.md ADDED Viewed

@@ -0,0 +1,79 @@
+---
+name: differential-diagnosis
+description: Build and rank a structured clinical differential diagnosis from a symptom/sign/lab constellation — covering likelihood, danger, discriminating features, red flags, confirmatory/exclusionary tests, and guideline citations — for EDUCATIONAL use only; always appends the not-a-substitute-for-a-physician disclaimer.
+allowed-tools: Read, WebSearch, WebFetch, Write
+---
+═══ HARD RULES ═══
+1. EVERY output MUST close with the verbatim EDUCATIONAL DISCLAIMER block defined in Phase F. No exceptions.
+2. Never assert a diagnosis. Rank possibilities; state what would confirm or exclude each.
+3. Never fabricate drug doses, sensitivity/specificity figures, or guideline version numbers. If uncertain, write "per current [society] guidelines — verify the edition in use."
+4. Cannot-miss diagnoses (life-threatening if delayed) are listed FIRST in the danger column regardless of probability — never buried.
+5. Epidemiology matters: state the prior probability context (age, sex, risk factors) before ranking; a rare diagnosis in the right host is not rare.
+6. When any red-flag feature is present, elevate urgency language immediately — do not soften with hedging prose.
+7. Anchoring is malpractice: explicitly test the leading diagnosis against each alternative before finalizing the list.
+8. This skill is EDUCATIONAL. It does not replace history-taking, physical examination, imaging, or clinical judgment by a licensed practitioner.
+═══ PHASE A — CASE FRAMING ═══
+A1. Restate the presenting problem in one clinical sentence: [chief complaint] in a [age/sex/relevant host context] with [duration] and [severity modifiers].
+A2. Extract and list every anchor feature provided (symptom, sign, timeline, relevant history, current medications, allergies, recent travel/exposures, family history, vital signs, available labs/imaging).
+A3. Identify and flag any features that are ABSENT but clinically meaningful (pertinent negatives) — absence of fever, absence of trauma, etc.
+A4. Categorize the syndrome anatomically and physiologically: What organ system(s)? What pathophysiologic mechanism category fits best (structural / inflammatory / infectious / metabolic / vascular / neoplastic / toxic-pharmacologic / functional)?
+A5. State the baseline prior probability context: prevalence in this demographic, relevant risk-factor loading, seasonality or geography if applicable.
+═══ PHASE B — BUILD THE DIFFERENTIAL ═══
+B1. Generate a BROAD differential using the mnemonic VINDICATE-P (Vascular · Inflammatory/Immune · Neoplastic · Degenerative · Infectious · Congenital · Autoimmune/Allergic · Traumatic · Endocrine/Metabolic · Psychiatric/functional) applied to the syndrome — list every plausible category entry before culling.
+B2. Add organ-system-specific pattern overlays: if chest pain, include ACS, PE, aortic dissection, tension pneumothorax, esophageal rupture before any less dangerous cause; if altered mental status, include meningitis/encephalitis, hypoglycemia, intracranial bleed, toxidrome before functional causes. Tailor the overlay to the actual syndrome.
+B3. Cull obvious implausibilities (requires explicit reasoning: "excluded because X is incompatible with finding Y"), but err toward over-inclusion at this stage.
+B4. Output a WORKING DIFFERENTIAL LIST — at minimum 5 entries, ideally 6–10. For each entry record: [Name] | [ICD-10 concept] | [Mechanism in one clause].
+═══ PHASE C — RANK BY LIKELIHOOD AND DANGEROUSNESS ═══
+C1. Assign each differential entry TWO scores (1–5 scale, state reasoning):
+    - LIKELIHOOD given the complete clinical picture (5 = highly probable)
+    - DANGER if missed or delayed (5 = immediately life-threatening or causes irreversible harm)
+C2. Sort the display table by DANGER (descending) first, then LIKELIHOOD second. This forces cannot-miss diagnoses to the top regardless of probability.
+C3. Table format (Markdown):
+| # | Diagnosis | Likelihood (1–5) | Danger if Missed (1–5) | Key Supporting Features | Key Against Features |
+|---|-----------|-----------------|------------------------|------------------------|----------------------|
+C4. Separately call out CANNOT-MISS DIAGNOSES (danger = 5) in a bold labeled block even if likelihood is low — these must never be implicitly deprioritized by rank order.
+═══ PHASE D — DISCRIMINATING FEATURES AND WORKUP ═══
+D1. For each diagnosis in the table, state:
+    - ONE or TWO findings that, if present, would STRONGLY SUPPORT it (LR+ > 3 if known; state "high LR+" without fabricating a number if the exact figure is uncertain).
+    - ONE or TWO findings that, if present, would STRONGLY ARGUE AGAINST it (LR− < 0.3 if known; same caveat).
+D2. Propose a TIERED DIAGNOSTIC WORKUP:
+    - IMMEDIATE (do now, or in the next minutes-to-hours if unstable): e.g., ECG, glucose, O2 sat, CT without contrast, LP if meningismus present.
+    - URGENT (within hours to 24h if stable): e.g., CBC with differential, BMP, troponin serial, blood cultures × 2 before antibiotics, echo.
+    - SEMI-ELECTIVE (24h–7d, outpatient acceptable if safe): e.g., autoimmune panel, thyroid function, outpatient stress test, biopsy.
+D3. For each test, state WHAT RESULT would confirm vs. exclude which diagnosis. Never recommend a test without linking it to a specific diagnostic decision.
+D4. Cite the relevant guideline source by society and approximate year (e.g., "AHA/ACC 2021 chest pain guideline"; "IDSA 2017 CAP guideline") — verify current version before applying clinically.
+═══ PHASE E — RED FLAGS AND ESCALATION PROTOCOL ═══
+E1. List ALL red-flag features present in this case using a labeled RED FLAG block:
+    ⚠️ RED FLAGS IDENTIFIED: [feature] → [implication] → [action threshold]
+    If none identified from supplied data, write "No red flags identified from provided information — reassess if clinical picture evolves."
+E2. State explicit ESCALATION CRITERIA: the specific finding or trajectory change that mandates immediate emergency transfer, urgent specialist consultation, or hospital admission. Phrase as: "If [criterion], then [action] — do not defer."
+E3. Address SAFETY-NETTING: what the patient/caregiver should watch for and what prompts an immediate return. Even in an educational context, name these explicitly so the reader understands the stakes.
+E4. If the clinical picture is undifferentiated and dangerous possibilities cannot yet be excluded, state: "Treat for the most dangerous plausible diagnosis until excluded."
+═══ PHASE F — SYNTHESIS AND OUTPUT ═══
+F1. Write a CLINICAL SUMMARY PARAGRAPH (3–5 sentences): most likely diagnosis and rationale, the most dangerous diagnosis that must be excluded first and how, and the single highest-yield next step.
+F2. State the WORKING DIAGNOSIS for educational simulation (not a clinical conclusion): "For educational modeling purposes, the leading hypothesis is [X] — this requires confirmation by [test/finding] and exclusion of [Y] by [test/finding]."
+F3. Note any knowledge cutoff issues: if guidelines cited predate 2023, flag for user to verify current recommendations.
+F4. Append the following EDUCATIONAL DISCLAIMER verbatim — this block is MANDATORY on every output from this skill:
+---
+**EDUCATIONAL DISCLAIMER — NOT PROFESSIONAL MEDICAL ADVICE**
+This differential diagnosis was generated by an AI agent for EDUCATIONAL and INFORMATIONAL purposes only. It is not a clinical diagnosis, a treatment recommendation, or a substitute for the professional judgment of a licensed physician, nurse practitioner, or other qualified healthcare provider. Medical decision-making requires in-person history-taking, physical examination, validated diagnostic testing, and individualized clinical judgment that no AI tool can replicate. If you or someone you know has a medical concern, consult a qualified healthcare professional promptly. In an emergency, call your local emergency services (e.g., 911 in the US) immediately.
+---
+KEY PRINCIPLE: List what would kill first, then what is most likely — because a missed PE at 2% probability is more consequential than a confirmed GERD at 60%.