npm - sanook-cli - Versions diffs - 0.4.0 → 0.5.0 - Mend

sanook-cli 0.4.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (235) hide show

package/.env.example +19 -0
package/CHANGELOG.md +144 -0
package/README.md +153 -20
package/README.th.md +136 -0
package/dist/agentContext.js +4 -0
package/dist/approval.js +6 -0
package/dist/bin.js +394 -51
package/dist/brain.js +92 -59
package/dist/brand.js +47 -0
package/dist/checkpoint.js +37 -0
package/dist/commands.js +86 -6
package/dist/compaction.js +76 -5
package/dist/config.js +100 -12
package/dist/cost.js +60 -3
package/dist/doctor.js +92 -0
package/dist/gateway/auth.js +2 -2
package/dist/gateway/ledger.js +2 -2
package/dist/gateway/scheduler.js +1 -0
package/dist/gateway/serve.js +6 -4
package/dist/gateway/server.js +10 -2
package/dist/git.js +11 -2
package/dist/hooks.js +43 -17
package/dist/knowledge.js +48 -49
package/dist/loop.js +182 -66
package/dist/lsp/client.js +173 -0
package/dist/lsp/framing.js +56 -0
package/dist/lsp/index.js +138 -0
package/dist/lsp/servers.js +82 -0
package/dist/mcp-server.js +244 -0
package/dist/mcp.js +184 -29
package/dist/memory-store.js +559 -0
package/dist/memory.js +143 -29
package/dist/orchestrate.js +150 -0
package/dist/providers/codex.js +2 -2
package/dist/providers/keys.js +3 -2
package/dist/providers/registry.js +133 -1
package/dist/repomap.js +93 -0
package/dist/search/chunk.js +158 -0
package/dist/search/embed-store.js +187 -0
package/dist/search/engine.js +203 -0
package/dist/search/fuse.js +35 -0
package/dist/search/index-core.js +187 -0
package/dist/search/indexer.js +241 -0
package/dist/search/store.js +77 -0
package/dist/session.js +42 -8
package/dist/skill-install.js +10 -10
package/dist/skills.js +12 -9
package/dist/summarize.js +31 -0
package/dist/tools/bash.js +21 -2
package/dist/tools/diagnostics.js +41 -0
package/dist/tools/edit.js +29 -7
package/dist/tools/index.js +8 -1
package/dist/tools/list.js +7 -2
package/dist/tools/permission.js +90 -9
package/dist/tools/read.js +23 -4
package/dist/tools/remember.js +1 -1
package/dist/tools/sandbox.js +61 -0
package/dist/tools/search.js +105 -4
package/dist/tools/task.js +195 -29
package/dist/tools/timeout.js +35 -0
package/dist/tools/util.js +10 -0
package/dist/tools/write.js +6 -4
package/dist/trust.js +89 -0
package/dist/ui/app.js +218 -27
package/dist/ui/banner.js +4 -9
package/dist/ui/history.js +30 -0
package/dist/ui/mentions.js +44 -0
package/dist/ui/setup.js +6 -5
package/dist/ui/useEditor.js +83 -0
package/dist/update.js +114 -0
package/dist/worktree.js +173 -0
package/package.json +11 -5
package/scripts/postinstall.mjs +33 -0
package/second-brain/.agents/_Index.md +30 -0
package/second-brain/.agents/skills/_Index.md +30 -0
package/second-brain/.agents/workflows/_Index.md +30 -0
package/second-brain/AGENTS.md +4 -4
package/second-brain/Acceptance/_Index.md +30 -0
package/second-brain/Acceptance/golden-case-template.md +39 -0
package/second-brain/Areas/_Index.md +30 -0
package/second-brain/Bugs/System-OS/_Index.md +30 -0
package/second-brain/Bugs/_Index.md +30 -0
package/second-brain/CLAUDE.md +4 -1
package/second-brain/Checklists/_Index.md +30 -0
package/second-brain/Checklists/preflight-postflight-template.md +29 -0
package/second-brain/Distillations/_Index.md +30 -0
package/second-brain/Entities/_Index.md +30 -0
package/second-brain/Entities/entity-template.md +33 -0
package/second-brain/Evals/_Index.md +30 -0
package/second-brain/Evals/correction-pairs.md +24 -0
package/second-brain/Evals/failure-taxonomy.md +24 -0
package/second-brain/Evals/golden-set.md +25 -0
package/second-brain/Evals/quality-ledger.md +23 -0
package/second-brain/Evals/self-eval-rubric.md +23 -0
package/second-brain/GEMINI.md +4 -4
package/second-brain/Goals/_Index.md +30 -0
package/second-brain/Handoffs/_Index.md +30 -0
package/second-brain/Home.md +7 -0
package/second-brain/Intake/Raw Sources/_Index.md +30 -0
package/second-brain/Intake/_Index.md +30 -0
package/second-brain/Intake/_Quarantine/_Index.md +30 -0
package/second-brain/Learning/_Index.md +30 -0
package/second-brain/Playbooks/_Index.md +30 -0
package/second-brain/Playbooks/playbook-template.md +23 -0
package/second-brain/Projects/_Index.md +30 -0
package/second-brain/Prompts/_Index.md +30 -0
package/second-brain/README.md +2 -1
package/second-brain/Research/_Index.md +30 -0
package/second-brain/Retrospectives/_Index.md +30 -0
package/second-brain/Reviews/_Index.md +30 -0
package/second-brain/Runbooks/_Index.md +30 -0
package/second-brain/Runbooks/eval-loop.md +24 -0
package/second-brain/Sessions/_Index.md +30 -0
package/second-brain/Shared/AI-Context-Index.md +20 -0
package/second-brain/Shared/AI-Threads/_Index.md +30 -0
package/second-brain/Shared/Archive/_Index.md +30 -0
package/second-brain/Shared/Assets/_Index.md +30 -0
package/second-brain/Shared/Context-Packs/_Index.md +30 -0
package/second-brain/Shared/Context7-Docs/_Index.md +30 -0
package/second-brain/Shared/Coordination/NOW.md +28 -0
package/second-brain/Shared/Coordination/_Index.md +30 -0
package/second-brain/Shared/Coordination/agent-registry.md +24 -0
package/second-brain/Shared/Coordination/task-board/_Index.md +30 -0
package/second-brain/Shared/Coordination/task-board/task-template.md +43 -0
package/second-brain/Shared/Coordination/task-board.md +32 -0
package/second-brain/Shared/Core-Facts/_Index.md +30 -0
package/second-brain/Shared/Decision-Memory/_Index.md +30 -0
package/second-brain/Shared/Glossary/_Index.md +30 -0
package/second-brain/Shared/Memory-Inbox/_Index.md +30 -0
package/second-brain/Shared/Operating-State/_Index.md +30 -0
package/second-brain/Shared/Prompting/_Index.md +30 -0
package/second-brain/Shared/Provenance/_Index.md +30 -0
package/second-brain/Shared/Rules/_Index.md +30 -0
package/second-brain/Shared/Rules/contextual-note-rule.md +30 -0
package/second-brain/Shared/Rules/frontmatter-standard.md +10 -0
package/second-brain/Shared/Rules/memory-write-protocol.md +28 -0
package/second-brain/Shared/Rules/procedural-runbook-header.md +40 -0
package/second-brain/Shared/Rules/review-and-staleness-policy.md +22 -0
package/second-brain/Shared/Rules/rules-formatting.md +34 -0
package/second-brain/Shared/Scripts/_Index.md +30 -0
package/second-brain/Shared/Scripts-Archive/_Index.md +30 -0
package/second-brain/Shared/Tech-Standards/_Index.md +30 -0
package/second-brain/Shared/Tech-Standards/verification-standard.md +40 -0
package/second-brain/Shared/User-Memory/_Index.md +30 -0
package/second-brain/Shared/User-Persona/_Index.md +30 -0
package/second-brain/Shared/User-Persona/owner-profile.md +25 -0
package/second-brain/Shared/Working-Memory/_Index.md +30 -0
package/second-brain/Shared/_Index.md +30 -0
package/second-brain/Shared/mcp-servers/_Index.md +30 -0
package/second-brain/Skills/_Index.md +30 -0
package/second-brain/Templates/_Index.md +30 -0
package/second-brain/Templates/bug.md +2 -0
package/second-brain/Templates/handoff.md +2 -0
package/second-brain/Templates/session.md +2 -0
package/second-brain/Tools/_Index.md +30 -0
package/second-brain/Traces/_Index.md +30 -0
package/second-brain/Vault Structure Map.md +33 -1
package/second-brain/copilot/_Index.md +30 -0
package/skills/audit-license-compliance/SKILL.md +117 -0
package/skills/author-codemod/SKILL.md +110 -0
package/skills/build-audit-logging/SKILL.md +112 -0
package/skills/build-cdc-streaming-pipeline/SKILL.md +123 -0
package/skills/build-cli-tool/SKILL.md +108 -0
package/skills/build-data-table/SKILL.md +141 -0
package/skills/build-native-mobile-ui/SKILL.md +154 -0
package/skills/build-offline-first-sync/SKILL.md +118 -0
package/skills/build-realtime-channel/SKILL.md +122 -0
package/skills/build-vector-search/SKILL.md +131 -0
package/skills/compose-local-dev-stack/SKILL.md +149 -0
package/skills/configure-bundler-build/SKILL.md +166 -0
package/skills/configure-dns-tls/SKILL.md +142 -0
package/skills/configure-reverse-proxy-lb/SKILL.md +129 -0
package/skills/configure-security-headers-csp/SKILL.md +122 -0
package/skills/contract-testing/SKILL.md +140 -0
package/skills/datetime-timezone-correctness/SKILL.md +125 -0
package/skills/debug-ci-pipeline-failure/SKILL.md +134 -0
package/skills/debug-flaky-tests/SKILL.md +128 -0
package/skills/defend-llm-prompt-injection/SKILL.md +110 -0
package/skills/deliver-webhooks/SKILL.md +116 -0
package/skills/design-api-pagination/SKILL.md +144 -0
package/skills/design-authorization-model/SKILL.md +119 -0
package/skills/design-backup-dr-recovery/SKILL.md +113 -0
package/skills/design-event-sourcing-cqrs/SKILL.md +143 -0
package/skills/design-multi-tenancy/SKILL.md +100 -0
package/skills/design-protobuf-grpc-service/SKILL.md +146 -0
package/skills/design-relational-schema/SKILL.md +129 -0
package/skills/design-search-index-infra/SKILL.md +151 -0
package/skills/design-state-machine/SKILL.md +108 -0
package/skills/design-token-system/SKILL.md +109 -0
package/skills/distributed-locks-leases/SKILL.md +120 -0
package/skills/encrypt-sensitive-data/SKILL.md +148 -0
package/skills/feature-flags-rollout/SKILL.md +130 -0
package/skills/file-upload-object-storage/SKILL.md +107 -0
package/skills/fuzz-dynamic-security-test/SKILL.md +111 -0
package/skills/harden-llm-app-reliability/SKILL.md +126 -0
package/skills/i18n-localization-setup/SKILL.md +113 -0
package/skills/idempotency-keys/SKILL.md +107 -0
package/skills/implement-push-notifications/SKILL.md +142 -0
package/skills/ingest-webhook-secure/SKILL.md +120 -0
package/skills/integrate-oauth-oidc/SKILL.md +126 -0
package/skills/load-stress-test/SKILL.md +129 -0
package/skills/map-privacy-data-gdpr/SKILL.md +146 -0
package/skills/model-nosql-data/SKILL.md +118 -0
package/skills/money-decimal-arithmetic/SKILL.md +123 -0
package/skills/monitor-ml-drift/SKILL.md +109 -0
package/skills/numeric-precision-units/SKILL.md +144 -0
package/skills/optimize-llm-cost-latency/SKILL.md +103 -0
package/skills/optimize-react-rerenders/SKILL.md +124 -0
package/skills/orchestrate-agent-workflow/SKILL.md +100 -0
package/skills/payments-billing-integration/SKILL.md +114 -0
package/skills/pin-toolchain-versions/SKILL.md +116 -0
package/skills/plan-strangler-migration/SKILL.md +95 -0
package/skills/property-based-testing/SKILL.md +108 -0
package/skills/publish-package-registry/SKILL.md +130 -0
package/skills/recover-git-state/SKILL.md +119 -0
package/skills/remediate-web-vulnerabilities/SKILL.md +125 -0
package/skills/resilience-timeouts-retries/SKILL.md +104 -0
package/skills/resolve-merge-rebase-conflict/SKILL.md +97 -0
package/skills/rewrite-git-history/SKILL.md +109 -0
package/skills/scaffold-cross-platform-app/SKILL.md +137 -0
package/skills/schema-evolution-compatibility/SKILL.md +121 -0
package/skills/send-transactional-email/SKILL.md +126 -0
package/skills/serve-deploy-ml-model/SKILL.md +107 -0
package/skills/setup-cdn-edge-waf/SKILL.md +107 -0
package/skills/setup-devcontainer-env/SKILL.md +131 -0
package/skills/setup-lint-format-precommit/SKILL.md +140 -0
package/skills/setup-monorepo-tooling/SKILL.md +125 -0
package/skills/ship-mobile-app-store-release/SKILL.md +137 -0
package/skills/structured-output-llm/SKILL.md +86 -0
package/skills/supply-chain-sbom-provenance/SKILL.md +120 -0
package/skills/test-data-factories/SKILL.md +158 -0
package/skills/threat-model-stride/SKILL.md +123 -0
package/skills/train-evaluate-ml-model/SKILL.md +109 -0
package/skills/unicode-text-correctness/SKILL.md +109 -0
package/skills/visual-regression-testing/SKILL.md +120 -0

package/skills/numeric-precision-units/SKILL.md ADDED Viewed

@@ -0,0 +1,144 @@
+---
+name: numeric-precision-units
+description: Prevents numeric-precision and units defects by enforcing epsilon/ULP/relative float comparison, Kahan/Welford stable accumulation, NaN/Inf and div-by-zero guards, checked/saturating integer arithmetic, lossless int64/decimal transport across JSON/JS/DB boundaries, and explicit unit-typed conversions with consistent rounding.
+when_to_use: Code does scientific/statistical math, accumulates many floats, compares floats with ==, converts units (metric/imperial, time, data sizes, angles), or moves large integers/decimals across JSON/JS/DB/language boundaries; or bugs involve flaky float equality, NaN/Inf, silent overflow, or lost int64→double precision. Distinct from money-decimal-arithmetic (monetary rounding/allocation correctness) and validate-data-quality (schema/null/range checks).
+---
+## When to Use
+Reach for this skill when the defect is about **the number itself** — its representation, precision, or unit — not its monetary rounding, schema, or business validity:
+- "These two floats are equal but `a == b` returns false" / flaky test on a computed total
+- "Summing a million values gives a different answer depending on order"
+- "Mean/variance is wrong / NaN on large or near-equal data"
+- "My int64 ID comes back rounded after a round-trip through JSON/JS"
+- "A big integer turned into `1.0000000000000002e18` in the browser / spreadsheet"
+- "Counter wrapped to a negative number" / `i32` overflow / cast truncated a value
+- "We mixed meters and feet / ms and seconds / radians and degrees / KB(1000) and KiB(1024)"
+- Division by a value that can be zero; `0.1 + 0.2 != 0.3`; signed-zero or `-0.0` surprises
+NOT this skill:
+- Monetary correctness — cents/`Decimal`, banker's rounding, splitting a charge so it sums exactly, FX → **money-decimal-arithmetic** (this skill keeps money *out* of binary float and intact across boundaries; that one does the rounding/allocation math)
+- Duration/clock precision, monotonic vs wall-clock, leap seconds, DST math → **datetime-timezone-correctness** (this skill stores time as an integer; that one interprets it)
+- Checking a field is present / in range / right type at ingest → **validate-data-quality**
+- Configuring the compiler/linter to forbid implicit numeric coercion (tsconfig, mypy, clippy) → **type-safety-strict**
+- Choosing `NUMERIC` vs `BIGINT` column types for a *schema migration* → **db-migration-safety**
+- Declaring the wire shape of a number in an API (string vs int64 in the contract) → **rest-graphql-contract**
+- Writing the test harness/property-test scaffolding itself → **write-tests**
+- Making a `SUM` aggregate query fast → **optimize-sql-query**
+## Steps
+1. **Decide the representation first — float is a default, not a law.**
+   | Domain | Use | Never |
+   |---|---|---|
+   | IDs, counts, timestamps (ns/ms) | integer (`int64`) | `double` (loses precision > 2^53) |
+   | Physical/scientific measurement | `float64` (`double`) | `float32` unless memory-bound and tolerance allows |
+   | Exact fractions / ratios | rational type or scaled integer | float |
+   | Probabilities, weights, signals | `float64` | `float32` |
+   | Money / currency | → defer to **money-decimal-arithmetic** | binary `float`/`double` |
+   Rule: if two values must compare *exactly equal*, they must not be binary floats.
+2. **Never `==` floats. Pick the tolerance by scale.** Absolute epsilon fails for large magnitudes; relative fails near zero. Use a combined check:
+   ```python
+   def close(a, b, rel=1e-9, abs_tol=1e-12):
+       if a == b:                       # exact / both inf same sign
+           return True
+       if math.isnan(a) or math.isnan(b):
+           return False                  # NaN is never close to anything
+       return abs(a - b) <= max(rel * max(abs(a), abs(b)), abs_tol)
+   ```
+   - Library defaults: Python `math.isclose(a, b, rel_tol=1e-9)`, NumPy `np.isclose`/`allclose`, Rust `approx::relative_eq!`, JS — write the above (no stdlib equivalent).
+   - **Near zero, relative tolerance collapses** — that's why `abs_tol` exists; set it to the smallest difference you consider zero in your domain.
+   - ULP comparison (`a` and `b` within N representable steps) only for low-level kernels where you control rounding mode — overkill for app code; use relative+abs.
+3. **Accumulate stably — order and algorithm change the answer.** Naive left-to-right `sum` accumulates rounding error ∝ n·ε and suffers **catastrophic cancellation** when subtracting near-equal large numbers.
+   - Summation: use **pairwise** (NumPy's `np.sum` already does this) or **Kahan/Neumaier compensated** summation for long running totals:
+     ```python
+     def kahan_sum(xs):
+         s = 0.0; c = 0.0          # c = running compensation
+         for x in xs:
+             y = x - c
+             t = s + y
+             c = (t - s) - y
+             s = t
+         return s
+     ```
+   - Mean/variance: **never** `sum(x²)/n - mean²` (cancellation → negative variance / NaN). Use **Welford's online** algorithm:
+     ```python
+     n = 0; mean = 0.0; M2 = 0.0
+     for x in data:
+         n += 1
+         d = x - mean; mean += d / n
+         M2 += d * (x - mean)
+     var = M2 / n            # population; M2/(n-1) for sample
+     ```
+   - Sort ascending by magnitude before summing wildly different scales if you can't use compensated summation.
+4. **Guard special values at the source, not three layers downstream.**
+   - Before any `/`: reject or branch on a zero/near-zero divisor (`if abs(d) < abs_tol: raise/return sentinel`). Float `x/0.0` yields `±inf`/`nan` *silently*; integer `/0` traps/UB.
+   - Treat `NaN`/`Inf` as poison: one `NaN` propagates through every subsequent op and **every comparison with it is false** (including `nan == nan`). Validate inputs with `math.isfinite(x)` at boundaries; assert finiteness on outputs.
+   - `-0.0 == 0.0` is true but `1/-0.0 == -inf` while `1/0.0 == +inf`; normalize with `x + 0.0` or `x == 0.0 ? 0.0 : x` when sign of zero leaks into results.
+   - `log`/`sqrt`/`acos` domain: clamp inputs (`acos(min(1, max(-1, x)))`) — rounding can push a value to `1.0000000002` and yield `NaN`.
+5. **Integer arithmetic: assume it overflows, prove it doesn't.** Fixed-width ints wrap (C/Go/Rust release/Java) or trap (Rust debug) or silently promote (Python/JS bignum-ish). Choose explicit semantics:
+   | Need | Rust | Go/C | Java | Generic |
+   |---|---|---|---|---|
+   | Detect overflow | `checked_add` → `Option` | compare after op / `bits.Add64` | `Math.addExact` (throws) | check before/after |
+   | Clamp at bound | `saturating_add` | manual `min`/`max` | manual | manual |
+   | Intentional wrap | `wrapping_add` | default `+` | default `+` | mask |
+   - **Narrowing casts lose data silently**: `(int32)bigLong`, `i64 as i32`, `Number → Int32`. Range-check before narrowing; never cast a length/ID down.
+   - Multiplication overflows long before addition — `a*b` for two `int32` near 2^16 already wraps; widen to `int64` *before* multiplying.
+   - `INT_MIN / -1` and `-INT_MIN` overflow; `abs(INT_MIN)` is still negative.
+6. **Cross-boundary precision: the 2^53 trap is the #1 silent corruption.** JSON has one number type; JS `Number` is `float64` with exact integers only up to `2^53-1` (9007199254740991). Any `int64` above that **loses its low bits** the instant a JS/JSON parser touches it — no error.
+   - **Send large ints and exact decimals as JSON strings.** Contract: `{"id": "9223372036854775807", "ratio": "12345.6789"}`. Parse to `BigInt`/`int64`/`Decimal` explicitly on each side.
+   - Guard in JS: `Number.isSafeInteger(x)` before trusting any integer; use `BigInt` + a string-aware parser (`json-bigint`) when you can't change the wire format.
+   - DB: `DOUBLE`/`FLOAT` columns are binary — IDs and exact decimals go in `NUMERIC(p,s)`/`DECIMAL`/`BIGINT`. Read them through a driver path that returns `Decimal`/string, not float (many drivers default to float for `NUMERIC` — configure it off).
+   - Language interop (protobuf/Thrift/FFI): `int64`→language `long`/`BigInt`, never `double`; protobuf JSON mapping *already* encodes `int64` as string — keep it.
+7. **Units: make the unit part of the type or the name — no bare numbers cross a function boundary.**
+   - Suffix every quantity: `timeout_ms`, `dist_m`, `angle_rad`, `size_bytes`, `temp_c`. A parameter named `timeout` is a bug waiting to happen.
+   - Convert through one explicit factor table; round **once, at the conversion**, to the target's precision — don't let conversions compound. Conventions that bite:
+     | Domain | Trap | Rule |
+     |---|---|---|
+     | Data size | KB=1000 vs KiB=1024 | use IEC (`KiB/MiB`) for binary; label which |
+     | Angle | trig functions take **radians** | convert `deg * π/180` at the edge |
+     | Temperature | C/F is **affine** (offset), not a ratio | `F = C*9/5 + 32`, never `*9/5` alone |
+     | Time | ms vs s vs ns | store epoch as int ns/ms; never float seconds |
+     | Imperial | 1 mi = 1609.344 m (exact) | keep full-precision factors, round at end |
+   - Dimensional consistency: only add/subtract same-unit quantities; multiply/divide changes the dimension (m/s · s = m). If a units library exists (`pint`, `uom`, `js-quantities`), use it; otherwise centralize factors in one module and unit-test round-trips.
+## Common Errors
+- **`if x == 0.1 + 0.2`** — false; `0.1+0.2 == 0.30000000000000004`. Use a tolerance compare (step 2).
+- **`abs(a-b) < 1e-9` as a universal epsilon** — passes for tiny numbers, fails for `1e12`. Scale tolerance relatively (step 2).
+- **`sum(sq)/n - mean**2` for variance** — catastrophic cancellation gives negative/NaN variance. Use Welford (step 3).
+- **`json.parse` of `{"id": 9007199254740993}` in JS** — silently becomes `...992`. Send IDs as strings; `Number.isSafeInteger` guards.
+- **Storing an int64 ID in `FLOAT`/`double`** — loses the low bits above 2^53 on round-trip. Use `BIGINT`/`NUMERIC` and an integer/string driver path (step 6).
+- **`(int) (a * b)` with two int32s** — overflows before the cast even runs. Widen to int64 before multiplying (step 5).
+- **`while (n != target)` on a float loop counter** — may never hit `target` exactly; loop forever. Iterate with an integer index, compute the float.
+- **`nan == nan` to detect NaN** — always false. Use `isnan`/`isFinite`; sort/min/max with NaN present is also undefined.
+- **`Math.sqrt(neg)` / `acos(1.0000001)`** — returns `NaN` from rounding overshoot. Clamp domain before the call (step 4).
+- **Passing `deg` to `Math.sin`** — silently wrong, no error. Sin takes radians; convert at the boundary.
+- **`°C → °F` as a pure scale (`*9/5`)** — drops the `+32` offset; temperature conversions are affine, not linear.
+- **Mixing 1000- and 1024-based sizes** — "5 GB" disk vs "5 GiB" RAM differ by ~7%. Label and use IEC binary units.
+- **Casting a `length`/`count`/`id` to a narrower int** — truncates above the bound with no error. Range-check or keep it wide.
+## Verify
+1. **Float equality:** every float comparison in the diff uses a tolerance helper (or compares a quantity that is provably integer/decimal). `grep -nE '==|!=' ` over float paths returns no bare float `==`.
+2. **Accumulation:** sum a 1e6-element array forwards vs reversed vs Kahan — Kahan matches a higher-precision (`Decimal`/`float128`) reference within tolerance; naive may not. Variance of near-equal large values is ≥ 0 and finite (Welford), not NaN.
+3. **Special values:** feed `0`, `-0.0`, `NaN`, `+Inf`, `-Inf`, and a near-zero divisor through each public function — none crash silently; divide-by-zero is rejected or returns a documented sentinel; outputs pass an `isfinite` assertion.
+4. **Integer bounds:** test at `MAX`, `MAX-1`, `MIN`, `0`, `MIN/-1` for every fixed-width arithmetic op — overflow is detected/saturated/intentionally-wrapped per the chosen semantics, never an undocumented wrap. Narrowing casts reject or clamp out-of-range input.
+5. **Boundary round-trip:** serialize the value `9223372036854775807` (`int64` max) and a `12345.6789` decimal to JSON, parse on the other side (especially JS) → byte-identical value restored. `Number.isSafeInteger` is checked on any JS integer path.
+6. **DB round-trip:** write `NUMERIC(38,9)` max-precision values and an int64-max ID, read back → equal as `Decimal`/integer/string (not float-coerced). No ID or exact-decimal column is `FLOAT`/`DOUBLE`.
+7. **Units:** round-trip every conversion (`m→ft→m`, `C→F→C`, `KiB→bytes→KiB`) returns the original within rounding tolerance; trig inputs are radians; affine conversions keep their offset; mismatched-unit add/subtract is impossible (typed) or covered by a failing-on-mix test.
+8. **Property tests:** generators include extremes (`±MAX`, `±0.0`, `NaN`, `Inf`, subnormals, near-epsilon pairs, overflow boundaries) — not just typical mid-range values.
+Done = no bare float `==`, no ID/exact-decimal in binary float, every cross-boundary int64/decimal survives a JSON/JS/DB round-trip bit-for-bit, all fixed-width integer ops have defined overflow behavior, divide-by-zero and NaN/Inf are guarded at the boundary, and every unit-bearing quantity is named/typed with its unit and round-trips through conversion within tolerance.

package/skills/optimize-llm-cost-latency/SKILL.md ADDED Viewed

@@ -0,0 +1,103 @@
+---
+name: optimize-llm-cost-latency
+description: Cuts LLM token cost and tail latency via context trimming, provider prompt caching on stable prefixes, model tiering/routing, semantic answer caching, batch APIs, and streaming, proving a measured before/after on cost-per-request and p50/p95 at equal output quality.
+when_to_use: An LLM feature is too slow or too expensive, or usage is scaling and the bill/latency matters. Distinct from prompt-engineering (output quality), rag-pipeline (retrieval quality), and harden-llm-app-reliability (timeouts/retries/fallback).
+---
+## When to Use
+Reach for this skill when the problem is **spend or speed**, not what the model says:
+- "Our LLM bill is $X/month and climbing — cut it"
+- "This call takes 8s; make it feel fast"
+- "Token usage per request is huge — trim it"
+- "We send the same 6k-token system prompt on every call"
+- "We re-answer near-identical questions all day"
+- "We have a nightly batch of 50k classifications — too slow/expensive"
+NOT this skill:
+- Making the *output* better/more correct/structured → prompt-engineering
+- Improving *what gets retrieved* into context (chunking/embeddings/reranking) → rag-pipeline
+- Timeouts, retries, fallback models, circuit breakers when a call fails → harden-llm-app-reliability
+- Blocking malicious instructions in inputs/tools → defend-llm-prompt-injection
+- Generic HTTP/data response caching unrelated to LLM token economics → caching-strategy
+- Cutting compute/storage/egress spend on the infra bill → cloud-cost-optimize
+## Steps
+1. **Measure first — no optimization without a baseline.** Log per request: `input_tokens`, `output_tokens`, model id, end-to-end latency, and time-to-first-token (TTFT). Get exact counts from the provider's usage object or a count-tokens endpoint — never estimate from `len(text)/4` for billing decisions. Compute `$/req` from current per-token prices and aggregate **p50/p95** (means hide the tail that users actually feel). You cannot claim a win without these numbers before and after.
+   ```python
+   # cost per request from the provider usage object (Anthropic-style)
+   u = resp.usage  # input_tokens / output_tokens / cache_creation_input_tokens / cache_read_input_tokens
+   cost = (u.input_tokens*IN + u.output_tokens*OUT
+           + u.cache_creation_input_tokens*CACHE_WRITE   # ~1.25x input
+           + u.cache_read_input_tokens*CACHE_READ) / 1e6 # ~0.1x input
+   ```
+2. **Apply levers in ROI order — cheapest, highest-impact first.** Do not jump to a fancy router before you've capped output and cached the prefix.
+   | Lever | Typical win | Effort | Use when |
+   |---|---|---|---|
+   | Cap `max_tokens` + stop sequences | 10–40% output cost | trivial | output runs longer than needed |
+   | Context diet (trim history, drop dead few-shot) | 20–60% input cost | low | long/growing prompts |
+   | **Prompt caching** (cache stable prefix) | **up to 90% on the cached portion**, lower TTFT | low | long fixed system prompt / RAG docs reused across calls |
+   | Streaming | TTFT 5–10x better (perceived) | low | user-facing chat/UI |
+   | Model tiering/routing | 50–95% on easy traffic | medium | mixed easy/hard requests |
+   | Semantic cache | ~100% on a cache hit | medium | repeated/near-duplicate queries |
+   | Batch API | ~50% on $ | medium | offline, non-urgent jobs |
+3. **Context diet — shrink input before you optimize how you send it.** Every input token is paid on every call. (a) **Cap history**: keep the last N turns; once over a token budget, **summarize** older turns into a compact running summary instead of carrying raw transcript. (b) **Cut dead few-shot examples**: drop each example and re-run the eval — keep only those that move accuracy. Most prompts carry 3–4 examples that earn nothing. (c) Strip boilerplate, redundant instructions, and verbose tool schemas. Re-measure tokens after each cut.
+4. **Prompt caching — usually the single biggest win.** Put the **stable, large** content first (system prompt, tool definitions, RAG documents, long instructions) and the **variable** part (the user's actual turn) last, then mark the boundary with the provider's cache control so the prefix is reused across requests. Order matters: the cache keys on an exact prefix match, so one moving token near the top busts the whole cache.
+   ```python
+   # Anthropic: cache_control breakpoint on the stable prefix; variable user msg stays uncached
+   system=[{"type":"text","text":BIG_STABLE_PROMPT,"cache_control":{"type":"ephemeral"}}]
+   # OpenAI: automatic prefix caching — just keep the prefix byte-identical and put it first
+   ```
+   Cache reads are ~10% of input price; writes ~25% more than input — so caching pays off once a prefix is reused even a handful of times within its TTL (~5 min ephemeral). Verify hits via `cache_read_input_tokens > 0`.
+5. **Model tiering + routing — send easy work to the cheap model.** Default the bulk of traffic to a small/fast model; escalate only when needed. Pick a **deterministic router** over an LLM-classifier router when you can (no extra call, no extra latency): route on cheap signals — input length, required output schema, presence of code/math, retrieval confidence, or an explicit difficulty flag. Use a tiny classifier model only when rules can't separate easy from hard. Always escalate on low-confidence or a validation failure rather than returning a bad cheap answer.
+   ```python
+   def route(req):
+       if req.tokens < 800 and not req.needs_reasoning: return SMALL   # ~1/10 the cost
+       if req.needs_long_reasoning or req.high_stakes:  return LARGE
+       return MID
+   # escalate: if small-model output fails schema/confidence check -> retry once on LARGE
+   ```
+6. **Semantic cache for repeated/near-duplicate queries.** Embed the normalized query, look it up in a vector store; on cosine similarity ≥ ~0.95 return the cached answer (0 LLM tokens). Set a conservative threshold (too low → wrong cached answers), scope the key by anything that changes the answer (user/tenant/locale/tool-version), and TTL it so stale facts expire. Bypass the cache for personalized or time-sensitive responses. This is *exact/near-exact answer reuse*, layered on top of prompt caching (which reuses the prefix, not the answer).
+7. **Batch the offline work.** Anything not blocking a user — nightly classification, backfills, evals, bulk summarization — goes through the provider **Batch API** (Anthropic Message Batches / OpenAI Batch) for ~50% off, accepting up to ~24h turnaround. Never batch interactive requests.
+8. **Stream to cut *perceived* latency.** For any user-facing surface, stream tokens (`stream=True` / SSE) so first text shows in a few hundred ms instead of after the full generation. This doesn't reduce cost or total wall-clock, but it collapses TTFT — often the only latency users perceive. Combine with caching: a cached prefix also lowers real TTFT.
+9. **Re-measure and prove equal quality.** Re-run the same metrics (step 1) after changes, and run your eval set to confirm output quality didn't regress (see Verify). A cost win that drops accuracy is not a win — report cost/latency **and** the quality delta together.
+## Common Errors
+- **Optimizing without a baseline.** "Feels faster/cheaper" is not a number. Capture p50/p95 and $/req before touching anything, or you can't prove (or trust) the result.
+- **Estimating tokens with `len/4`.** Fine for a rough guess, wrong for billing and for `max_tokens` budgeting. Use the provider usage object / count-tokens endpoint.
+- **Cache-busting the prefix.** Putting a timestamp, request id, randomized example order, or the user message *before* the stable block invalidates the cache every call. Stable content first, byte-identical; variable content last.
+- **Caching the wrong span.** Marking a tiny or rarely-reused prefix as cached pays the ~25% write premium for no reads. Cache large prefixes reused within the TTL; confirm with `cache_read_input_tokens`.
+- **Semantic-cache threshold too loose.** A 0.85 similarity match returns a confidently wrong answer for a different question. Start ≥0.95, log near-miss hits, and never cache personalized/time-sensitive answers.
+- **Unscoped cache key.** Caching by query text alone leaks one tenant/user/locale's answer to another. Include every dimension that changes the correct answer in the key.
+- **Router that always escalates.** A misconfigured or overcautious router sends everything to the big model — you pay more *and* added a routing hop. Verify the cheap-model hit rate; if <50% on easy traffic, the rules are wrong.
+- **Unbounded `max_tokens`.** Leaving it at the model max lets a runaway generation bill thousands of output tokens. Set it to the real ceiling and add stop sequences.
+- **Batching interactive traffic.** Batch APIs trade latency for price; a user waiting on a 24h-window response is a broken product. Batch only offline work.
+- **Streaming to a non-interactive consumer.** Streaming into code that just `.join()`s the whole thing adds complexity with zero benefit — and can hide errors that arrive mid-stream. Stream only where a human sees partial output.
+- **Trimming context until quality silently drops.** Aggressive history truncation or cutting a load-bearing few-shot example tanks accuracy. Gate every cut behind the eval (step 9).
+## Verify
+1. **Baseline captured:** before/after table exists with `input_tokens`, `output_tokens`, `$/req`, p50, p95, and TTFT — real measured numbers, not estimates.
+2. **Cost dropped:** post-change `$/req` is measurably lower (state the %), computed from the provider usage object at current prices.
+3. **Latency dropped:** p95 (and TTFT for user-facing paths) improved; report the actual ms, not "feels faster."
+4. **Cache is hot:** `cache_read_input_tokens > 0` on repeat requests, and the measured hit rate is reported. A prompt cache that never reads is dead config.
+5. **Router behaves:** the cheap model handles the majority of easy traffic (hit-rate stated), and escalation fires on low-confidence/validation-fail (one logged example each).
+6. **Semantic cache is safe:** a deliberately *different* query below threshold does **not** return a cached answer; key scoping (tenant/locale) prevents cross-leak.
+7. **Quality held:** the eval set scores within tolerance of baseline (state the delta) — the cost/latency win did not regress accuracy. If it did, the change is rejected.
+Done = a before/after table shows lower $/req and lower p50/p95 (with cache hit rate and cheap-model share stated), and the eval confirms output quality is unchanged at the new, cheaper, faster configuration.

package/skills/optimize-react-rerenders/SKILL.md ADDED Viewed

@@ -0,0 +1,124 @@
+---
+name: optimize-react-rerenders
+description: Eliminates wasted React re-renders by measuring first then fixing — profile with the React DevTools Profiler (flamegraph + "why did this render") and why-did-you-render to find the actual offenders, then apply the right fix: stable references (hoist constants, useCallback/useMemo only where a referentially-equal prop/dep actually matters), correct list keys (stable id, never index), React.memo with a custom comparator on genuinely-hot leaf components, context splitting + selector subscriptions (useSyncExternalStore / Zustand / use-context-selector) to stop whole-tree re-renders, derive-don't-store to kill redundant state, and list virtualization (TanStack Virtual) for long lists — while knowing when NOT to memo (cheap renders, unstable deps, and the React 19 Compiler which auto-memoizes and makes most manual memo dead weight).
+when_to_use: A React app re-renders too much — typing lags, a list re-renders every row on one change, the Profiler shows components rendering with unchanged props, or you're sprinkling useMemo/useCallback/React.memo and want to know what actually helps. Distinct from optimize-core-web-vitals (load/paint metrics — LCP/INP/CLS and asset/JS strategy, not render-count) and manage-client-server-state (data-fetching/caching architecture with TanStack Query; this skill fixes the render churn that fetched data triggers).
+---
+## When to Use
+Reach for this skill when the problem is **too many React renders / render churn**, not load time, not data fetching:
+- "Typing in this input lags / the whole form re-renders on every keystroke"
+- "Changing one row re-renders the entire list of 500 items"
+- "The Profiler shows components re-rendering even though their props didn't change"
+- "I added `useMemo`/`useCallback`/`React.memo` everywhere and it's not faster (or slower)"
+- "A context update re-renders half the tree"
+- "Should I memoize this?" / "Is this `useMemo` worth it?" / "We're on React 19 — do I still need this?"
+- "Long list scrolls slowly / mounts thousands of DOM nodes"
+NOT this skill:
+- Slow initial load, late paint, layout shift, or a red Lighthouse score (LCP/INP/CLS, image/font/JS-bundle strategy) → optimize-core-web-vitals (it owns paint/load metrics; this skill owns render *count*. Note: cutting re-renders during interaction also improves INP, but go there for the metric-driven workflow)
+- Data fetching, cache invalidation, optimistic updates, SSR hydration, or picking a store (TanStack Query / Zustand / Redux) → manage-client-server-state (it architects *where state lives and how it's fetched*; this skill stops the renders that state changes cause)
+- Building a new component's structure/props/a11y from scratch → build-react-component
+- A big interactive grid feature (sorting/filtering/column resize/selection) as a unit → build-data-table (it builds the table; this skill makes its rows stop re-rendering)
+- Backend/server query latency or a CPU profile of non-React code → performance-profiling
+- Using the browser/Chrome DevTools to debug a runtime bug (state, network, errors) generally → debug-frontend-browser (this skill is the render-perf specialization of profiling)
+## Steps
+1. **Measure before you touch anything — never memoize on a hunch.** Manual memoization is a tradeoff (extra comparisons + cache memory); applied blindly it often makes things *slower* and always makes them harder to read. Get evidence first:
+   | Tool | What it tells you | How |
+   |---|---|---|
+   | **React DevTools Profiler** | which components rendered, how long, **why** | record an interaction → flamegraph + ranked chart |
+   | **"Highlight updates when components render"** (DevTools → ⚙ → Components) | visual flash on every render — instant "this re-renders on every keystroke" signal | toggle on, interact |
+   | **`why-did-you-render`** | logs the *exact prop/state/hook that changed* (and whether it was a deep-equal-but-referentially-different value) | dev-only, see step 2 |
+   | **React 19 `<Profiler onRender>`** / `performance.measure` | programmatic render timings in tests/CI | wrap a subtree |
+   In the Profiler, enable **"Record why each component rendered"** (⚙ → Profiler). Re-renders show a reason: *props changed*, *hooks changed*, *parent rendered*, *context changed*. That reason picks the fix below — don't guess.
+2. **Wire up `why-did-you-render` to catch referential-equality bugs (dev only).** It surfaces the classic "props are deep-equal but a new object/array/function identity every render" case that React.memo can't catch.
+   ```js
+   // wdyr.js — import FIRST, before React renders anything
+   import React from 'react';
+   if (process.env.NODE_ENV === 'development') {
+     const wdyr = require('@welldone-software/why-did-you-render');
+     wdyr(React, { trackAllPureComponents: true, collapseGroups: true });
+   }
+   ```
+   A log like *"props.style changed: ({}) → ({}) (different objects that are equal by value)"* means: hoist the literal or memoize the reference — not wrap the child in `memo`.
+3. **If you're on React 19, turn on the React Compiler FIRST — it auto-memoizes and makes most manual memo dead weight.** The compiler (`babel-plugin-react-compiler`, also in the Next.js / Vite plugin) memoizes components and hook values automatically at build time, so `useMemo`/`useCallback`/`React.memo` become largely redundant. Before hand-tuning:
+   - Install and enable it; run `npx react-compiler-healthcheck` to see how many components are compatible (it skips components that break the Rules of React — those are your real bugs to fix).
+   - **Don't mix strategies blindly:** keep code Rules-of-React-clean (no mutation of props/state, no conditional hooks) so the compiler can optimize. New manual `useMemo`/`useCallback` you add on top is usually noise. Lean on `<StrictMode>` to surface impurity.
+   - The compiler does **not** fix algorithmic problems (huge lists, O(n²) renders) — you still need virtualization (step 9) and correct keys (step 7). Treat the rest of these steps as either pre-19 work or the things the compiler can't do.
+4. **Kill unstable references at the source — this is the #1 cause of "memo doesn't work".** `React.memo` and dep arrays compare by reference (`Object.is`). A new `{}`, `[]`, or arrow function created in render is a new identity every time, so it busts every downstream memo and effect. Fix the *source*, don't paper over it:
+   | Anti-pattern (new identity each render) | Fix |
+   |---|---|
+   | `<Child style={{ margin: 8 }} />` | hoist the object to a module-level `const` (it never changes) |
+   | `<Child onClick={() => doX(id)} />` passed to a memoized child | `useCallback(() => doX(id), [id])` |
+   | `const opts = { a, b }` then used in a dep array | `useMemo(() => ({ a, b }), [a, b])` |
+   | `data.filter(...)` computed inline each render into a memoized child | `useMemo(() => data.filter(...), [data])` |
+   | default prop `items = []` (new array each call) | hoist `const EMPTY = []` and default to it |
+   Static values (handlers with no closure deps, constant config) belong **outside the component** entirely — zero runtime cost.
+5. **Use `useCallback`/`useMemo` only where a referentially-equal value actually changes behavior — otherwise skip it.** They are not free: each stores a closure + dep array and runs an equality check every render. They earn their keep in exactly these cases — and nowhere else:
+   - The memoized value/callback is **a dependency of another hook** (`useEffect`, `useMemo`) where a changing identity would refire the effect.
+   - It's **passed as a prop to a `React.memo`'d child** (otherwise the child's memo is pointless).
+   - `useMemo` wraps a **genuinely expensive computation** (sort/filter of thousands, parse, heavy derive) — measure; "expensive" is rarely a `.map` over 20 items.
+   **Skip them when** the consumer isn't memoized, the deps change every render anyway (then the cache never hits — pure overhead), or the body is cheap. A `useCallback` whose result flows only into a non-memoized DOM `<button onClick>` does nothing useful.
+6. **`React.memo` only the genuinely-hot leaf, and give it the right comparator.** `memo` skips a re-render when props are shallow-equal. Apply it to a component that (a) renders often due to *parent* re-renders, (b) is expensive or numerous (list rows), and (c) gets **stable props** (you did step 4). Without stable props it's worse than nothing.
+   - Default shallow compare is correct most of the time. Provide a custom `areEqual(prev, next)` only for a known-shape prop where shallow misses (e.g. compare `prev.item.id === next.item.id && prev.selected === next.selected`) — and keep it cheaper than the render it saves.
+   - `memo` compares **props only** — it does **not** stop re-renders from internal `useState`/`useContext` changes. If the offender is context (step 8) or state, `memo` won't help.
+   ```jsx
+   const Row = memo(function Row({ item, onSelect }) { /* ... */ },
+     (a, b) => a.item.id === b.item.id && a.item.label === b.item.label && a.onSelect === b.onSelect);
+   ```
+7. **Fix list keys — wrong keys force remounts and break memoized rows.** Use a **stable, unique id** from the data (`item.id`), never the array index for any list that can reorder, insert, or delete. Index keys make React reuse the wrong DOM/state on reorder (lost input focus, wrong row highlighted) and defeat per-row `memo`. Don't use `Math.random()`/`uuid()` in render either — a new key every render remounts the row each time. Stable identity → React diffs in place → memoized rows actually skip.
+8. **Stop context from re-rendering the whole subtree — split it or use a selector.** Every consumer of a context re-renders whenever **any** field of the context value changes, even fields it doesn't read. Three escalating fixes:
+   | Technique | When | How |
+   |---|---|---|
+   | **Memoize the provider value** | value is `{}`/`[]` literal inline | `value={useMemo(() => ({ user, setUser }), [user])}` — without this *every* consumer re-renders every parent render |
+   | **Split contexts by change frequency** | one context mixes hot + cold data (e.g. `theme` + live `cursorPosition`) | separate providers; a component subscribes only to what it uses → high-frequency updates don't touch theme consumers |
+   | **Selector subscription** | consumers read different slices of a big store | `useSyncExternalStore` with a selector, `use-context-selector`, or a store lib (**Zustand**: `useStore(s => s.field)`, **Redux**: `useSelector`) — re-render only when the *selected* slice changes |
+   Splitting state-vs-dispatch contexts is a cheap classic win: a component that only dispatches never re-renders on state changes.
+9. **Virtualize long lists instead of memoizing 10,000 rows.** No amount of `memo` saves you from mounting thousands of DOM nodes. Render only the visible window (+overscan) with **TanStack Virtual** (`@tanstack/react-virtual`, headless, framework-agnostic) or `react-virtuoso`. It computes which rows intersect the viewport and absolutely-positions them; DOM stays ~constant regardless of dataset size. Combine with stable keys (step 7) and a memoized row. This is the fix when the *count* is the problem, not per-row work.
+10. **Derive, don't store — redundant state is redundant renders.** Anything computable from props/existing state during render should be **computed in render** (optionally `useMemo`'d), not held in its own `useState` synced via `useEffect`. The `useState`+`useEffect`-to-sync pattern adds an extra render every change and drifts out of sync. Likewise: lift state **down** (push it into the smallest component that needs it so updates don't re-render siblings), and split a god-component so a chatty piece of state isn't wired into a large subtree. Fewer state cells in fewer places = fewer renders.
+11. **Don't over-memoize — premature memoization has a real cost.** Each `memo`/`useMemo`/`useCallback` adds an equality check + retained closure + cognitive load, and a wrong dep array becomes a stale-closure bug. Rules of thumb: leave cheap components un-memoized; never wrap a component whose props are unstable (fix the props instead); never memoize purely to "be safe." On React 19, prefer the compiler over manual memo. Memoize when the Profiler shows a *measured* hot path — not before. Remove memos that the Profiler shows aren't being hit.
+## Common Errors
+- **`React.memo` with unstable props.** Memo'd child still re-renders because a parent passes a fresh `{}`/`() => {}` each render. Fix: stabilize the prop at the source (step 4) — `memo` without stable props is pure overhead.
+- **`useCallback`/`useMemo` feeding a non-memoized consumer.** A stable callback handed to a plain `<button>` or non-`memo` child changes nothing but adds overhead. Fix: only memoize values that cross into a `memo`'d child or another hook's deps.
+- **Dep array that changes every render.** The memo never caches (cache miss every time) — all cost, no benefit. Fix: stabilize the deps too, or drop the memo.
+- **Index as list key.** Reorders/inserts reuse the wrong DOM/state and break per-row memo; lost focus, wrong highlight. Fix: stable `item.id` key.
+- **`Math.random()`/`uuid()` key in render.** Remounts every row every render — the opposite of memoization. Fix: derive a stable id once.
+- **Inline object/array provider value.** `<Ctx.Provider value={{a,b}}>` re-renders every consumer on every parent render. Fix: `useMemo` the value, or split contexts.
+- **One fat context for hot + cold data.** A 60fps field re-renders theme/auth consumers. Fix: split by update frequency; use a selector subscription.
+- **`useState` mirrored from props via `useEffect`.** Extra render + drift. Fix: derive in render (`useMemo` if pricey).
+- **Memoizing everything by default.** Slower and unreadable; stale-closure bugs from wrong deps. Fix: memoize measured hot paths only; on React 19 let the compiler do it.
+- **Expecting `memo` to stop context/state re-renders.** `memo` compares props only. Fix: address the actual reason the Profiler reports (context → split/selector; state → derive/lift).
+## Verify
+1. **Profiler shows the offender gone:** record the same interaction before/after — the component that flashed/rendered "props changed (but equal)" or "parent rendered" no longer appears in the commit (or its render time drops). Keep the before flamegraph as proof.
+2. **`why-did-you-render` is silent on the fixed path:** no "different objects that are equal by value" logs for the props you stabilized.
+3. **Typing/interaction is smooth:** the input/list that lagged updates per keystroke without re-rendering unrelated siblings; "Highlight updates" flashes only the changed node.
+4. **One-row change → one row renders:** mutating a single list item re-renders that row only, not the whole list (visible in the Profiler ranked chart).
+5. **Context change is scoped:** updating a hot context field re-renders only its real consumers; theme/auth consumers stay dark.
+6. **Long list DOM is bounded:** the virtualized list mounts ~viewport+overscan rows, and node count stays roughly constant as the dataset grows from 100 → 10,000.
+7. **No memo is dead weight:** every remaining `memo`/`useMemo`/`useCallback` corresponds to a Profiler-confirmed hot path or a real dep/`memo`-prop boundary; the rest were removed. On React 19, `react-compiler-healthcheck` passes and manual memo is minimal.
+Done = the measured re-render(s) the Profiler flagged are eliminated by the matching fix (stable refs, correct keys, context split/selector, derive-don't-store, or virtualization), every remaining manual memo is justified by evidence (or replaced by the React 19 compiler), and the before/after Profiler traces prove the churn is gone — not added overhead.

package/skills/orchestrate-agent-workflow/SKILL.md ADDED Viewed

@@ -0,0 +1,100 @@
+---
+name: orchestrate-agent-workflow
+description: Designs reliable multi-step LLM agent loops — tool-call orchestration, state/memory between steps, explicit stop conditions, per-step verification, retries/replanning, subagent decomposition, and budget/approval gates — so an agent finishes long tasks without drifting or looping forever.
+when_to_use: Building an agent that plans and calls tools across multiple steps, not a single prompt-and-response. Distinct from agent-tool-mcp-builder (designing the individual tools/MCP server), prompt-engineering (single-prompt/structured-output design), and harden-llm-app-reliability (transport-level retry/timeout of one model call).
+---
+## When to Use
+Reach for this skill when the work spans **multiple model→tool→observe turns** under one goal:
+- "Build an agent that researches, then drafts, then files a PR / ticket"
+- "My agent loops forever / repeats the same tool call / never decides it's done"
+- "It drifts off-task halfway through and starts solving a different problem"
+- "Split this big task across subagents and stitch the results back together"
+- "Add a cost/step cap and a human-approval gate before it deletes / deploys / pays"
+NOT this skill:
+- Designing the tool schemas / error shapes / MCP server the agent calls → agent-tool-mcp-builder
+- Writing or hardening a single prompt or its JSON/function-call contract → prompt-engineering
+- Making *one* model call survive timeouts/429s/5xx (retry, backoff, circuit breaker) → harden-llm-app-reliability
+- Cutting token cost / latency of the model calls themselves (caching, model routing) → optimize-llm-cost-latency
+- Treating tool output / web content as untrusted input → defend-llm-prompt-injection
+## Steps
+1. **Pick the simplest topology that works — escalate only when forced.** Default down this ladder, not up.
+   | Pattern | Use when | Control flow | Failure mode to fear |
+   |---|---|---|---|
+   | **Single-agent tool loop** | One goal, <~15 steps, work fits one context | Model decides each next tool | Infinite loop / drift |
+   | **Code-orchestrated workflow (DAG)** | Steps + order are known *ahead of time* | Your code sequences; model fills each node | Rigid; can't adapt mid-run |
+   | **Multi-agent (planner + subagents)** | One context literally can't hold the work, or independent parallel branches | Orchestrator spawns sub-loops, merges summaries | Coordination cost, lost context at handoff |
+   Start with the loop. Move to a coded DAG the moment the step graph is fixed (cheaper, deterministic, testable). Reach for multi-agent **only** when a single context window can't hold the task — never for "it feels big."
+2. **Make the loop terminate — two independent kills.** Every loop needs BOTH a semantic stop and a hard cap. One alone is insufficient: the model's "I'm done" can be wrong, and a raw cap truncates good work.
+   ```python
+   MAX_STEPS, MAX_USD, t0 = 12, 0.50, time.time()
+   for step in range(MAX_STEPS):                 # hard cap (anti-infinite-loop)
+       msg = model(messages, tools)
+       if msg.stop_reason == "end_turn":         # semantic stop: model emits final, no tool call
+           return msg.text
+       result = run_tool(msg.tool_call)          # exactly one tool per turn keeps it auditable
+       messages += [msg, tool_result(result)]
+       if spent_usd() > MAX_USD or time.time()-t0 > 120:   # budget / wall-clock guard
+           return escalate("budget exceeded", messages)
+   return escalate("max steps reached", messages)   # NEVER silently return partial as success
+   ```
+   Also detect a **no-progress loop**: hash `(tool_name, args)`; if the same call repeats ≥2× with no state change, break and replan — don't wait for MAX_STEPS.
+3. **Carry state in an explicit scratchpad — not the raw message list.** The growing transcript is not memory; it's noise that costs tokens and dilutes the goal. Keep a small structured state object the loop reads/writes every turn:
+   ```json
+   {"goal": "<one immutable sentence>", "facts": [...], "open_questions": [...],
+    "done": ["fetched X", "parsed Y"], "next": "draft summary", "artifacts": {"pr_url": null}}
+   ```
+   Re-inject `goal` + `next` into the prompt **every step** (anti-drift re-grounding). When the transcript grows large, summarize old turns into `facts`/`done` and drop the raw tool dumps — keep state, discard chatter.
+4. **Verify each step's output before continuing — gate, don't trust.** After every tool result, check it actually advanced the goal *before* the model plans the next move. Cheapest sufficient check wins:
+   - Schema/shape check on tool output (parses? non-empty? expected fields?) — pure code, no model.
+   - Goal-relevance check: "does this result move us toward `goal`, or sideways?" If sideways → discard and re-ground, don't append it as progress.
+   - For generated artifacts (code, SQL, configs): run the real check (compile/lint/`pytest`/dry-run), not a model's self-assessment. A failing check feeds back into the loop as a tool result.
+5. **Retry then replan on tool failure — distinguish transient from logical.** Tool error ≠ retry-forever.
+   - **Transient** (timeout, 429, 5xx): bounded retry with backoff — but that's transport reliability; delegate it to harden-llm-app-reliability, don't re-implement here.
+   - **Logical** (bad args, "not found", validation reject): do **not** retry the identical call. Feed the error text back as an observation and let the model *replan* (fix args, pick another tool, or revise the plan). Cap replans (≤2) per subgoal; exceeding it → escalate, don't thrash.
+6. **Decompose to subagents only when one context can't hold the work — and return summaries, not dumps.** A subagent gets a *narrow* objective, its own fresh context and tool subset, and returns a **compact result** (the answer + key facts + artifact refs), never its raw transcript. The orchestrator merges summaries into parent state. This is the whole point: parallel/large work happens in child contexts so the parent stays lean. If you find yourself piping a subagent's full message history back up, you've defeated it.
+7. **Gate irreversible actions behind explicit approval; meter spend.** Classify each tool: read-only / reversible-write / **irreversible** (delete, deploy, send money, email customers, `DROP`). Irreversible tools require a human-approval checkpoint (or a strict policy allowlist) *before* execution — the loop pauses and surfaces the proposed action + args. Track cumulative tokens/$ per run (step 2's `MAX_USD`); a runaway agent is a billing incident. Emit a structured per-step trace (step #, tool, args, result-status, cost) so a stuck run is debuggable after the fact → build-audit-logging.
+8. **Prove it on a multi-step harness with machine-checkable success criteria — built before you ship.** A loop that works on one happy-path demo proves nothing. Write the harness first: define each scenario's *success oracle* in code (artifact exists / `pytest` green / expected end-state reached / final answer matches a regex), not a model's "looks good." Cover ≥3 scenarios — one happy path, one designed-to-fail (asserts escalation, never an infinite loop), one with a distractor in tool output (asserts the goal held). Run it in CI on every change to the loop, prompt, or tool set; a passing harness is the only evidence the orchestration is reliable. Spec the checks in the Verify section below.
+## Common Errors
+- **Stop condition = only "model says done".** The model declares victory early or never. Always pair the semantic stop with a hard `MAX_STEPS` cap.
+- **No max-step / max-cost cap.** One bad plan → infinite tool calls → runaway bill. Both caps are mandatory, and hitting them must escalate, not silently return a partial answer as if it succeeded.
+- **Treating the message list as memory.** Relying on the growing transcript means the goal drowns in tool noise and context blows up. Keep an explicit scratchpad; re-inject the goal every step.
+- **Never re-grounding → drift.** Without re-stating `goal` each turn, a long run quietly migrates to a neighboring task. Inject goal + next-action every step and discard off-goal results.
+- **Retrying a logical error unchanged.** Re-sending the same bad args on a validation failure just burns steps. Transient → backoff retry; logical → replan with the error fed back as an observation.
+- **Subagents returning raw transcripts.** Dumping a child's full history into the parent defeats decomposition and re-bloats the context you spun it off to avoid. Children return summaries + artifact refs only.
+- **Multi-agent when a loop would do.** Coordination overhead, lost context at handoffs, and harder debugging — for work one context could have held. Escalate topology only when forced (step 1).
+- **More than one tool call per turn without isolation.** Parallel tool calls in one turn make the trace ambiguous and ordering bugs invisible. Keep one tool per turn unless the calls are provably independent.
+- **No verification between steps.** Appending an empty/erroring/irrelevant tool result as "progress" compounds garbage across the run. Gate every result before it enters state.
+- **Irreversible action with no approval gate.** The agent deletes/deploys/pays on a hallucinated plan. Classify tools; pause for human approval on irreversible ones.
+- **No per-step trace.** When a run gets stuck you have nothing to debug. Emit structured step records (tool, args, status, cost) from day one.
+## Verify
+1. **Terminates on success:** a happy-path multi-step task reaches the semantic stop and returns the artifact *before* `MAX_STEPS` — caps are a safety net, not the normal exit.
+2. **Terminates on failure:** force an unsolvable goal → the run hits `MAX_STEPS`/`MAX_USD` and **escalates** (returns a clear "could not complete" + trace), never an infinite loop and never a partial dressed up as success.
+3. **No-progress break:** inject a tool that returns the same value forever → the loop detects the repeated `(tool,args)` and breaks/replans within ≤2 repeats, not at MAX_STEPS.
+4. **Anti-drift:** run a long task with a distractor in tool output → final answer still serves the original `goal`; the off-goal result was discarded, not built upon.
+5. **Replan on logical error:** make a tool reject the first args → the agent fixes args / switches tool and proceeds, and does **not** re-send the identical failing call.
+6. **Budget guard:** set `MAX_USD` low → the run halts and escalates when spend exceeds it; cumulative cost in the trace matches actual usage.
+7. **Approval gate:** an irreversible tool is never executed without the approval checkpoint firing first (assert it pauses and surfaces args).
+8. **Subagent isolation:** parent context size after a subagent task stays bounded — the child returned a summary, not its transcript (check token count, not just correctness).
+9. **Harness:** ≥3 multi-step scenarios with machine-checkable success criteria (artifact exists / check passes / expected state reached) run green in CI, including at least one designed-to-fail case from check 2.
+Done = on the multi-step harness the agent finishes happy-path tasks via the semantic stop, every failure/runaway path escalates within the step+budget caps (no infinite loops, no partial-as-success), each step is verified and re-grounded on the goal, irreversible actions are gated, and subagents return summaries — all evidenced by the per-step trace.