sanook-cli 0.4.0 → 0.5.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.env.example +19 -0
- package/CHANGELOG.md +173 -0
- package/README.md +153 -20
- package/README.th.md +136 -0
- package/dist/agentContext.js +4 -0
- package/dist/approval.js +6 -0
- package/dist/bin.js +405 -57
- package/dist/brain.js +92 -59
- package/dist/brand.js +47 -0
- package/dist/checkpoint.js +37 -0
- package/dist/commands.js +86 -6
- package/dist/compaction.js +76 -5
- package/dist/config.js +100 -12
- package/dist/cost.js +60 -3
- package/dist/doctor.js +92 -0
- package/dist/gateway/auth.js +2 -2
- package/dist/gateway/ledger.js +2 -2
- package/dist/gateway/scheduler.js +1 -0
- package/dist/gateway/serve.js +6 -4
- package/dist/gateway/server.js +10 -2
- package/dist/git.js +11 -2
- package/dist/hooks.js +43 -17
- package/dist/knowledge.js +48 -49
- package/dist/loop.js +182 -66
- package/dist/lsp/client.js +173 -0
- package/dist/lsp/framing.js +56 -0
- package/dist/lsp/index.js +138 -0
- package/dist/lsp/servers.js +82 -0
- package/dist/mcp-server.js +244 -0
- package/dist/mcp.js +184 -29
- package/dist/memory-store.js +559 -0
- package/dist/memory.js +143 -29
- package/dist/orchestrate.js +150 -0
- package/dist/providers/codex.js +21 -7
- package/dist/providers/keys.js +3 -2
- package/dist/providers/models.js +22 -6
- package/dist/providers/registry.js +155 -1
- package/dist/repomap.js +93 -0
- package/dist/search/chunk.js +158 -0
- package/dist/search/embed-store.js +187 -0
- package/dist/search/engine.js +203 -0
- package/dist/search/fuse.js +35 -0
- package/dist/search/index-core.js +187 -0
- package/dist/search/indexer.js +241 -0
- package/dist/search/store.js +77 -0
- package/dist/session.js +42 -8
- package/dist/skill-install.js +10 -10
- package/dist/skills.js +12 -9
- package/dist/summarize.js +31 -0
- package/dist/tools/bash.js +21 -2
- package/dist/tools/diagnostics.js +41 -0
- package/dist/tools/edit.js +29 -7
- package/dist/tools/index.js +8 -1
- package/dist/tools/list.js +7 -2
- package/dist/tools/permission.js +90 -9
- package/dist/tools/read.js +23 -4
- package/dist/tools/remember.js +1 -1
- package/dist/tools/sandbox.js +61 -0
- package/dist/tools/search.js +105 -4
- package/dist/tools/task.js +195 -29
- package/dist/tools/timeout.js +35 -0
- package/dist/tools/util.js +10 -0
- package/dist/tools/write.js +6 -4
- package/dist/trust.js +89 -0
- package/dist/ui/app.js +228 -31
- package/dist/ui/banner.js +4 -9
- package/dist/ui/brain-wizard.js +2 -2
- package/dist/ui/history.js +30 -0
- package/dist/ui/mentions.js +44 -0
- package/dist/ui/render.js +55 -15
- package/dist/ui/setup.js +97 -12
- package/dist/ui/useEditor.js +83 -0
- package/dist/update.js +114 -0
- package/dist/worktree.js +173 -0
- package/package.json +11 -5
- package/scripts/postinstall.mjs +33 -0
- package/second-brain/.agents/_Index.md +30 -0
- package/second-brain/.agents/skills/_Index.md +30 -0
- package/second-brain/.agents/workflows/_Index.md +30 -0
- package/second-brain/AGENTS.md +4 -4
- package/second-brain/Acceptance/_Index.md +30 -0
- package/second-brain/Acceptance/golden-case-template.md +39 -0
- package/second-brain/Areas/_Index.md +30 -0
- package/second-brain/Bugs/System-OS/_Index.md +30 -0
- package/second-brain/Bugs/_Index.md +30 -0
- package/second-brain/CLAUDE.md +4 -1
- package/second-brain/Checklists/_Index.md +30 -0
- package/second-brain/Checklists/preflight-postflight-template.md +29 -0
- package/second-brain/Distillations/_Index.md +30 -0
- package/second-brain/Entities/_Index.md +30 -0
- package/second-brain/Entities/entity-template.md +33 -0
- package/second-brain/Evals/_Index.md +30 -0
- package/second-brain/Evals/correction-pairs.md +24 -0
- package/second-brain/Evals/failure-taxonomy.md +24 -0
- package/second-brain/Evals/golden-set.md +25 -0
- package/second-brain/Evals/quality-ledger.md +23 -0
- package/second-brain/Evals/self-eval-rubric.md +23 -0
- package/second-brain/GEMINI.md +4 -4
- package/second-brain/Goals/_Index.md +30 -0
- package/second-brain/Handoffs/_Index.md +30 -0
- package/second-brain/Home.md +7 -0
- package/second-brain/Intake/Raw Sources/_Index.md +30 -0
- package/second-brain/Intake/_Index.md +30 -0
- package/second-brain/Intake/_Quarantine/_Index.md +30 -0
- package/second-brain/Learning/_Index.md +30 -0
- package/second-brain/Playbooks/_Index.md +30 -0
- package/second-brain/Playbooks/playbook-template.md +23 -0
- package/second-brain/Projects/_Index.md +30 -0
- package/second-brain/Prompts/_Index.md +30 -0
- package/second-brain/README.md +2 -1
- package/second-brain/Research/_Index.md +30 -0
- package/second-brain/Retrospectives/_Index.md +30 -0
- package/second-brain/Reviews/_Index.md +30 -0
- package/second-brain/Runbooks/_Index.md +30 -0
- package/second-brain/Runbooks/eval-loop.md +24 -0
- package/second-brain/Sessions/_Index.md +30 -0
- package/second-brain/Shared/AI-Context-Index.md +20 -0
- package/second-brain/Shared/AI-Threads/_Index.md +30 -0
- package/second-brain/Shared/Archive/_Index.md +30 -0
- package/second-brain/Shared/Assets/_Index.md +30 -0
- package/second-brain/Shared/Context-Packs/_Index.md +30 -0
- package/second-brain/Shared/Context7-Docs/_Index.md +30 -0
- package/second-brain/Shared/Coordination/NOW.md +28 -0
- package/second-brain/Shared/Coordination/_Index.md +30 -0
- package/second-brain/Shared/Coordination/agent-registry.md +24 -0
- package/second-brain/Shared/Coordination/task-board/_Index.md +30 -0
- package/second-brain/Shared/Coordination/task-board/task-template.md +43 -0
- package/second-brain/Shared/Coordination/task-board.md +32 -0
- package/second-brain/Shared/Core-Facts/_Index.md +30 -0
- package/second-brain/Shared/Decision-Memory/_Index.md +30 -0
- package/second-brain/Shared/Glossary/_Index.md +30 -0
- package/second-brain/Shared/Memory-Inbox/_Index.md +30 -0
- package/second-brain/Shared/Operating-State/_Index.md +30 -0
- package/second-brain/Shared/Prompting/_Index.md +30 -0
- package/second-brain/Shared/Provenance/_Index.md +30 -0
- package/second-brain/Shared/Rules/_Index.md +30 -0
- package/second-brain/Shared/Rules/contextual-note-rule.md +30 -0
- package/second-brain/Shared/Rules/frontmatter-standard.md +10 -0
- package/second-brain/Shared/Rules/memory-write-protocol.md +28 -0
- package/second-brain/Shared/Rules/procedural-runbook-header.md +40 -0
- package/second-brain/Shared/Rules/review-and-staleness-policy.md +22 -0
- package/second-brain/Shared/Rules/rules-formatting.md +34 -0
- package/second-brain/Shared/Scripts/_Index.md +30 -0
- package/second-brain/Shared/Scripts-Archive/_Index.md +30 -0
- package/second-brain/Shared/Tech-Standards/_Index.md +30 -0
- package/second-brain/Shared/Tech-Standards/verification-standard.md +40 -0
- package/second-brain/Shared/User-Memory/_Index.md +30 -0
- package/second-brain/Shared/User-Persona/_Index.md +30 -0
- package/second-brain/Shared/User-Persona/owner-profile.md +25 -0
- package/second-brain/Shared/Working-Memory/_Index.md +30 -0
- package/second-brain/Shared/_Index.md +30 -0
- package/second-brain/Shared/mcp-servers/_Index.md +30 -0
- package/second-brain/Skills/_Index.md +30 -0
- package/second-brain/Templates/_Index.md +30 -0
- package/second-brain/Templates/bug.md +2 -0
- package/second-brain/Templates/handoff.md +2 -0
- package/second-brain/Templates/session.md +2 -0
- package/second-brain/Tools/_Index.md +30 -0
- package/second-brain/Traces/_Index.md +30 -0
- package/second-brain/Vault Structure Map.md +33 -1
- package/second-brain/copilot/_Index.md +30 -0
- package/skills/audit-license-compliance/SKILL.md +117 -0
- package/skills/author-codemod/SKILL.md +110 -0
- package/skills/build-audit-logging/SKILL.md +112 -0
- package/skills/build-cdc-streaming-pipeline/SKILL.md +123 -0
- package/skills/build-cli-tool/SKILL.md +108 -0
- package/skills/build-data-table/SKILL.md +141 -0
- package/skills/build-native-mobile-ui/SKILL.md +154 -0
- package/skills/build-offline-first-sync/SKILL.md +118 -0
- package/skills/build-realtime-channel/SKILL.md +122 -0
- package/skills/build-vector-search/SKILL.md +131 -0
- package/skills/compose-local-dev-stack/SKILL.md +149 -0
- package/skills/configure-bundler-build/SKILL.md +166 -0
- package/skills/configure-dns-tls/SKILL.md +142 -0
- package/skills/configure-reverse-proxy-lb/SKILL.md +129 -0
- package/skills/configure-security-headers-csp/SKILL.md +122 -0
- package/skills/contract-testing/SKILL.md +140 -0
- package/skills/datetime-timezone-correctness/SKILL.md +125 -0
- package/skills/debug-ci-pipeline-failure/SKILL.md +134 -0
- package/skills/debug-flaky-tests/SKILL.md +128 -0
- package/skills/defend-llm-prompt-injection/SKILL.md +110 -0
- package/skills/deliver-webhooks/SKILL.md +116 -0
- package/skills/design-api-pagination/SKILL.md +144 -0
- package/skills/design-authorization-model/SKILL.md +119 -0
- package/skills/design-backup-dr-recovery/SKILL.md +113 -0
- package/skills/design-event-sourcing-cqrs/SKILL.md +143 -0
- package/skills/design-multi-tenancy/SKILL.md +100 -0
- package/skills/design-protobuf-grpc-service/SKILL.md +146 -0
- package/skills/design-relational-schema/SKILL.md +129 -0
- package/skills/design-search-index-infra/SKILL.md +151 -0
- package/skills/design-state-machine/SKILL.md +108 -0
- package/skills/design-token-system/SKILL.md +109 -0
- package/skills/distributed-locks-leases/SKILL.md +120 -0
- package/skills/encrypt-sensitive-data/SKILL.md +148 -0
- package/skills/feature-flags-rollout/SKILL.md +130 -0
- package/skills/file-upload-object-storage/SKILL.md +107 -0
- package/skills/fuzz-dynamic-security-test/SKILL.md +111 -0
- package/skills/harden-llm-app-reliability/SKILL.md +126 -0
- package/skills/i18n-localization-setup/SKILL.md +113 -0
- package/skills/idempotency-keys/SKILL.md +107 -0
- package/skills/implement-push-notifications/SKILL.md +142 -0
- package/skills/ingest-webhook-secure/SKILL.md +120 -0
- package/skills/integrate-oauth-oidc/SKILL.md +126 -0
- package/skills/load-stress-test/SKILL.md +129 -0
- package/skills/map-privacy-data-gdpr/SKILL.md +146 -0
- package/skills/model-nosql-data/SKILL.md +118 -0
- package/skills/money-decimal-arithmetic/SKILL.md +123 -0
- package/skills/monitor-ml-drift/SKILL.md +109 -0
- package/skills/numeric-precision-units/SKILL.md +144 -0
- package/skills/optimize-llm-cost-latency/SKILL.md +103 -0
- package/skills/optimize-react-rerenders/SKILL.md +124 -0
- package/skills/orchestrate-agent-workflow/SKILL.md +100 -0
- package/skills/payments-billing-integration/SKILL.md +114 -0
- package/skills/pin-toolchain-versions/SKILL.md +116 -0
- package/skills/plan-strangler-migration/SKILL.md +95 -0
- package/skills/property-based-testing/SKILL.md +108 -0
- package/skills/publish-package-registry/SKILL.md +130 -0
- package/skills/recover-git-state/SKILL.md +119 -0
- package/skills/remediate-web-vulnerabilities/SKILL.md +125 -0
- package/skills/resilience-timeouts-retries/SKILL.md +104 -0
- package/skills/resolve-merge-rebase-conflict/SKILL.md +97 -0
- package/skills/rewrite-git-history/SKILL.md +109 -0
- package/skills/scaffold-cross-platform-app/SKILL.md +137 -0
- package/skills/schema-evolution-compatibility/SKILL.md +121 -0
- package/skills/send-transactional-email/SKILL.md +126 -0
- package/skills/serve-deploy-ml-model/SKILL.md +107 -0
- package/skills/setup-cdn-edge-waf/SKILL.md +107 -0
- package/skills/setup-devcontainer-env/SKILL.md +131 -0
- package/skills/setup-lint-format-precommit/SKILL.md +140 -0
- package/skills/setup-monorepo-tooling/SKILL.md +125 -0
- package/skills/ship-mobile-app-store-release/SKILL.md +137 -0
- package/skills/structured-output-llm/SKILL.md +86 -0
- package/skills/supply-chain-sbom-provenance/SKILL.md +120 -0
- package/skills/test-data-factories/SKILL.md +158 -0
- package/skills/threat-model-stride/SKILL.md +123 -0
- package/skills/train-evaluate-ml-model/SKILL.md +109 -0
- package/skills/unicode-text-correctness/SKILL.md +109 -0
- package/skills/visual-regression-testing/SKILL.md +120 -0
|
@@ -0,0 +1,103 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: optimize-llm-cost-latency
|
|
3
|
+
description: Cuts LLM token cost and tail latency via context trimming, provider prompt caching on stable prefixes, model tiering/routing, semantic answer caching, batch APIs, and streaming, proving a measured before/after on cost-per-request and p50/p95 at equal output quality.
|
|
4
|
+
when_to_use: An LLM feature is too slow or too expensive, or usage is scaling and the bill/latency matters. Distinct from prompt-engineering (output quality), rag-pipeline (retrieval quality), and harden-llm-app-reliability (timeouts/retries/fallback).
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
Reach for this skill when the problem is **spend or speed**, not what the model says:
|
|
10
|
+
|
|
11
|
+
- "Our LLM bill is $X/month and climbing — cut it"
|
|
12
|
+
- "This call takes 8s; make it feel fast"
|
|
13
|
+
- "Token usage per request is huge — trim it"
|
|
14
|
+
- "We send the same 6k-token system prompt on every call"
|
|
15
|
+
- "We re-answer near-identical questions all day"
|
|
16
|
+
- "We have a nightly batch of 50k classifications — too slow/expensive"
|
|
17
|
+
|
|
18
|
+
NOT this skill:
|
|
19
|
+
- Making the *output* better/more correct/structured → prompt-engineering
|
|
20
|
+
- Improving *what gets retrieved* into context (chunking/embeddings/reranking) → rag-pipeline
|
|
21
|
+
- Timeouts, retries, fallback models, circuit breakers when a call fails → harden-llm-app-reliability
|
|
22
|
+
- Blocking malicious instructions in inputs/tools → defend-llm-prompt-injection
|
|
23
|
+
- Generic HTTP/data response caching unrelated to LLM token economics → caching-strategy
|
|
24
|
+
- Cutting compute/storage/egress spend on the infra bill → cloud-cost-optimize
|
|
25
|
+
|
|
26
|
+
## Steps
|
|
27
|
+
|
|
28
|
+
1. **Measure first — no optimization without a baseline.** Log per request: `input_tokens`, `output_tokens`, model id, end-to-end latency, and time-to-first-token (TTFT). Get exact counts from the provider's usage object or a count-tokens endpoint — never estimate from `len(text)/4` for billing decisions. Compute `$/req` from current per-token prices and aggregate **p50/p95** (means hide the tail that users actually feel). You cannot claim a win without these numbers before and after.
|
|
29
|
+
|
|
30
|
+
```python
|
|
31
|
+
# cost per request from the provider usage object (Anthropic-style)
|
|
32
|
+
u = resp.usage # input_tokens / output_tokens / cache_creation_input_tokens / cache_read_input_tokens
|
|
33
|
+
cost = (u.input_tokens*IN + u.output_tokens*OUT
|
|
34
|
+
+ u.cache_creation_input_tokens*CACHE_WRITE # ~1.25x input
|
|
35
|
+
+ u.cache_read_input_tokens*CACHE_READ) / 1e6 # ~0.1x input
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
2. **Apply levers in ROI order — cheapest, highest-impact first.** Do not jump to a fancy router before you've capped output and cached the prefix.
|
|
39
|
+
|
|
40
|
+
| Lever | Typical win | Effort | Use when |
|
|
41
|
+
|---|---|---|---|
|
|
42
|
+
| Cap `max_tokens` + stop sequences | 10–40% output cost | trivial | output runs longer than needed |
|
|
43
|
+
| Context diet (trim history, drop dead few-shot) | 20–60% input cost | low | long/growing prompts |
|
|
44
|
+
| **Prompt caching** (cache stable prefix) | **up to 90% on the cached portion**, lower TTFT | low | long fixed system prompt / RAG docs reused across calls |
|
|
45
|
+
| Streaming | TTFT 5–10x better (perceived) | low | user-facing chat/UI |
|
|
46
|
+
| Model tiering/routing | 50–95% on easy traffic | medium | mixed easy/hard requests |
|
|
47
|
+
| Semantic cache | ~100% on a cache hit | medium | repeated/near-duplicate queries |
|
|
48
|
+
| Batch API | ~50% on $ | medium | offline, non-urgent jobs |
|
|
49
|
+
|
|
50
|
+
3. **Context diet — shrink input before you optimize how you send it.** Every input token is paid on every call. (a) **Cap history**: keep the last N turns; once over a token budget, **summarize** older turns into a compact running summary instead of carrying raw transcript. (b) **Cut dead few-shot examples**: drop each example and re-run the eval — keep only those that move accuracy. Most prompts carry 3–4 examples that earn nothing. (c) Strip boilerplate, redundant instructions, and verbose tool schemas. Re-measure tokens after each cut.
|
|
51
|
+
|
|
52
|
+
4. **Prompt caching — usually the single biggest win.** Put the **stable, large** content first (system prompt, tool definitions, RAG documents, long instructions) and the **variable** part (the user's actual turn) last, then mark the boundary with the provider's cache control so the prefix is reused across requests. Order matters: the cache keys on an exact prefix match, so one moving token near the top busts the whole cache.
|
|
53
|
+
|
|
54
|
+
```python
|
|
55
|
+
# Anthropic: cache_control breakpoint on the stable prefix; variable user msg stays uncached
|
|
56
|
+
system=[{"type":"text","text":BIG_STABLE_PROMPT,"cache_control":{"type":"ephemeral"}}]
|
|
57
|
+
# OpenAI: automatic prefix caching — just keep the prefix byte-identical and put it first
|
|
58
|
+
```
|
|
59
|
+
Cache reads are ~10% of input price; writes ~25% more than input — so caching pays off once a prefix is reused even a handful of times within its TTL (~5 min ephemeral). Verify hits via `cache_read_input_tokens > 0`.
|
|
60
|
+
|
|
61
|
+
5. **Model tiering + routing — send easy work to the cheap model.** Default the bulk of traffic to a small/fast model; escalate only when needed. Pick a **deterministic router** over an LLM-classifier router when you can (no extra call, no extra latency): route on cheap signals — input length, required output schema, presence of code/math, retrieval confidence, or an explicit difficulty flag. Use a tiny classifier model only when rules can't separate easy from hard. Always escalate on low-confidence or a validation failure rather than returning a bad cheap answer.
|
|
62
|
+
|
|
63
|
+
```python
|
|
64
|
+
def route(req):
|
|
65
|
+
if req.tokens < 800 and not req.needs_reasoning: return SMALL # ~1/10 the cost
|
|
66
|
+
if req.needs_long_reasoning or req.high_stakes: return LARGE
|
|
67
|
+
return MID
|
|
68
|
+
# escalate: if small-model output fails schema/confidence check -> retry once on LARGE
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
6. **Semantic cache for repeated/near-duplicate queries.** Embed the normalized query, look it up in a vector store; on cosine similarity ≥ ~0.95 return the cached answer (0 LLM tokens). Set a conservative threshold (too low → wrong cached answers), scope the key by anything that changes the answer (user/tenant/locale/tool-version), and TTL it so stale facts expire. Bypass the cache for personalized or time-sensitive responses. This is *exact/near-exact answer reuse*, layered on top of prompt caching (which reuses the prefix, not the answer).
|
|
72
|
+
|
|
73
|
+
7. **Batch the offline work.** Anything not blocking a user — nightly classification, backfills, evals, bulk summarization — goes through the provider **Batch API** (Anthropic Message Batches / OpenAI Batch) for ~50% off, accepting up to ~24h turnaround. Never batch interactive requests.
|
|
74
|
+
|
|
75
|
+
8. **Stream to cut *perceived* latency.** For any user-facing surface, stream tokens (`stream=True` / SSE) so first text shows in a few hundred ms instead of after the full generation. This doesn't reduce cost or total wall-clock, but it collapses TTFT — often the only latency users perceive. Combine with caching: a cached prefix also lowers real TTFT.
|
|
76
|
+
|
|
77
|
+
9. **Re-measure and prove equal quality.** Re-run the same metrics (step 1) after changes, and run your eval set to confirm output quality didn't regress (see Verify). A cost win that drops accuracy is not a win — report cost/latency **and** the quality delta together.
|
|
78
|
+
|
|
79
|
+
## Common Errors
|
|
80
|
+
|
|
81
|
+
- **Optimizing without a baseline.** "Feels faster/cheaper" is not a number. Capture p50/p95 and $/req before touching anything, or you can't prove (or trust) the result.
|
|
82
|
+
- **Estimating tokens with `len/4`.** Fine for a rough guess, wrong for billing and for `max_tokens` budgeting. Use the provider usage object / count-tokens endpoint.
|
|
83
|
+
- **Cache-busting the prefix.** Putting a timestamp, request id, randomized example order, or the user message *before* the stable block invalidates the cache every call. Stable content first, byte-identical; variable content last.
|
|
84
|
+
- **Caching the wrong span.** Marking a tiny or rarely-reused prefix as cached pays the ~25% write premium for no reads. Cache large prefixes reused within the TTL; confirm with `cache_read_input_tokens`.
|
|
85
|
+
- **Semantic-cache threshold too loose.** A 0.85 similarity match returns a confidently wrong answer for a different question. Start ≥0.95, log near-miss hits, and never cache personalized/time-sensitive answers.
|
|
86
|
+
- **Unscoped cache key.** Caching by query text alone leaks one tenant/user/locale's answer to another. Include every dimension that changes the correct answer in the key.
|
|
87
|
+
- **Router that always escalates.** A misconfigured or overcautious router sends everything to the big model — you pay more *and* added a routing hop. Verify the cheap-model hit rate; if <50% on easy traffic, the rules are wrong.
|
|
88
|
+
- **Unbounded `max_tokens`.** Leaving it at the model max lets a runaway generation bill thousands of output tokens. Set it to the real ceiling and add stop sequences.
|
|
89
|
+
- **Batching interactive traffic.** Batch APIs trade latency for price; a user waiting on a 24h-window response is a broken product. Batch only offline work.
|
|
90
|
+
- **Streaming to a non-interactive consumer.** Streaming into code that just `.join()`s the whole thing adds complexity with zero benefit — and can hide errors that arrive mid-stream. Stream only where a human sees partial output.
|
|
91
|
+
- **Trimming context until quality silently drops.** Aggressive history truncation or cutting a load-bearing few-shot example tanks accuracy. Gate every cut behind the eval (step 9).
|
|
92
|
+
|
|
93
|
+
## Verify
|
|
94
|
+
|
|
95
|
+
1. **Baseline captured:** before/after table exists with `input_tokens`, `output_tokens`, `$/req`, p50, p95, and TTFT — real measured numbers, not estimates.
|
|
96
|
+
2. **Cost dropped:** post-change `$/req` is measurably lower (state the %), computed from the provider usage object at current prices.
|
|
97
|
+
3. **Latency dropped:** p95 (and TTFT for user-facing paths) improved; report the actual ms, not "feels faster."
|
|
98
|
+
4. **Cache is hot:** `cache_read_input_tokens > 0` on repeat requests, and the measured hit rate is reported. A prompt cache that never reads is dead config.
|
|
99
|
+
5. **Router behaves:** the cheap model handles the majority of easy traffic (hit-rate stated), and escalation fires on low-confidence/validation-fail (one logged example each).
|
|
100
|
+
6. **Semantic cache is safe:** a deliberately *different* query below threshold does **not** return a cached answer; key scoping (tenant/locale) prevents cross-leak.
|
|
101
|
+
7. **Quality held:** the eval set scores within tolerance of baseline (state the delta) — the cost/latency win did not regress accuracy. If it did, the change is rejected.
|
|
102
|
+
|
|
103
|
+
Done = a before/after table shows lower $/req and lower p50/p95 (with cache hit rate and cheap-model share stated), and the eval confirms output quality is unchanged at the new, cheaper, faster configuration.
|
|
@@ -0,0 +1,124 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: optimize-react-rerenders
|
|
3
|
+
description: Eliminates wasted React re-renders by measuring first then fixing — profile with the React DevTools Profiler (flamegraph + "why did this render") and why-did-you-render to find the actual offenders, then apply the right fix: stable references (hoist constants, useCallback/useMemo only where a referentially-equal prop/dep actually matters), correct list keys (stable id, never index), React.memo with a custom comparator on genuinely-hot leaf components, context splitting + selector subscriptions (useSyncExternalStore / Zustand / use-context-selector) to stop whole-tree re-renders, derive-don't-store to kill redundant state, and list virtualization (TanStack Virtual) for long lists — while knowing when NOT to memo (cheap renders, unstable deps, and the React 19 Compiler which auto-memoizes and makes most manual memo dead weight).
|
|
4
|
+
when_to_use: A React app re-renders too much — typing lags, a list re-renders every row on one change, the Profiler shows components rendering with unchanged props, or you're sprinkling useMemo/useCallback/React.memo and want to know what actually helps. Distinct from optimize-core-web-vitals (load/paint metrics — LCP/INP/CLS and asset/JS strategy, not render-count) and manage-client-server-state (data-fetching/caching architecture with TanStack Query; this skill fixes the render churn that fetched data triggers).
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
Reach for this skill when the problem is **too many React renders / render churn**, not load time, not data fetching:
|
|
10
|
+
|
|
11
|
+
- "Typing in this input lags / the whole form re-renders on every keystroke"
|
|
12
|
+
- "Changing one row re-renders the entire list of 500 items"
|
|
13
|
+
- "The Profiler shows components re-rendering even though their props didn't change"
|
|
14
|
+
- "I added `useMemo`/`useCallback`/`React.memo` everywhere and it's not faster (or slower)"
|
|
15
|
+
- "A context update re-renders half the tree"
|
|
16
|
+
- "Should I memoize this?" / "Is this `useMemo` worth it?" / "We're on React 19 — do I still need this?"
|
|
17
|
+
- "Long list scrolls slowly / mounts thousands of DOM nodes"
|
|
18
|
+
|
|
19
|
+
NOT this skill:
|
|
20
|
+
- Slow initial load, late paint, layout shift, or a red Lighthouse score (LCP/INP/CLS, image/font/JS-bundle strategy) → optimize-core-web-vitals (it owns paint/load metrics; this skill owns render *count*. Note: cutting re-renders during interaction also improves INP, but go there for the metric-driven workflow)
|
|
21
|
+
- Data fetching, cache invalidation, optimistic updates, SSR hydration, or picking a store (TanStack Query / Zustand / Redux) → manage-client-server-state (it architects *where state lives and how it's fetched*; this skill stops the renders that state changes cause)
|
|
22
|
+
- Building a new component's structure/props/a11y from scratch → build-react-component
|
|
23
|
+
- A big interactive grid feature (sorting/filtering/column resize/selection) as a unit → build-data-table (it builds the table; this skill makes its rows stop re-rendering)
|
|
24
|
+
- Backend/server query latency or a CPU profile of non-React code → performance-profiling
|
|
25
|
+
- Using the browser/Chrome DevTools to debug a runtime bug (state, network, errors) generally → debug-frontend-browser (this skill is the render-perf specialization of profiling)
|
|
26
|
+
|
|
27
|
+
## Steps
|
|
28
|
+
|
|
29
|
+
1. **Measure before you touch anything — never memoize on a hunch.** Manual memoization is a tradeoff (extra comparisons + cache memory); applied blindly it often makes things *slower* and always makes them harder to read. Get evidence first:
|
|
30
|
+
|
|
31
|
+
| Tool | What it tells you | How |
|
|
32
|
+
|---|---|---|
|
|
33
|
+
| **React DevTools Profiler** | which components rendered, how long, **why** | record an interaction → flamegraph + ranked chart |
|
|
34
|
+
| **"Highlight updates when components render"** (DevTools → ⚙ → Components) | visual flash on every render — instant "this re-renders on every keystroke" signal | toggle on, interact |
|
|
35
|
+
| **`why-did-you-render`** | logs the *exact prop/state/hook that changed* (and whether it was a deep-equal-but-referentially-different value) | dev-only, see step 2 |
|
|
36
|
+
| **React 19 `<Profiler onRender>`** / `performance.measure` | programmatic render timings in tests/CI | wrap a subtree |
|
|
37
|
+
|
|
38
|
+
In the Profiler, enable **"Record why each component rendered"** (⚙ → Profiler). Re-renders show a reason: *props changed*, *hooks changed*, *parent rendered*, *context changed*. That reason picks the fix below — don't guess.
|
|
39
|
+
|
|
40
|
+
2. **Wire up `why-did-you-render` to catch referential-equality bugs (dev only).** It surfaces the classic "props are deep-equal but a new object/array/function identity every render" case that React.memo can't catch.
|
|
41
|
+
```js
|
|
42
|
+
// wdyr.js — import FIRST, before React renders anything
|
|
43
|
+
import React from 'react';
|
|
44
|
+
if (process.env.NODE_ENV === 'development') {
|
|
45
|
+
const wdyr = require('@welldone-software/why-did-you-render');
|
|
46
|
+
wdyr(React, { trackAllPureComponents: true, collapseGroups: true });
|
|
47
|
+
}
|
|
48
|
+
```
|
|
49
|
+
A log like *"props.style changed: ({}) → ({}) (different objects that are equal by value)"* means: hoist the literal or memoize the reference — not wrap the child in `memo`.
|
|
50
|
+
|
|
51
|
+
3. **If you're on React 19, turn on the React Compiler FIRST — it auto-memoizes and makes most manual memo dead weight.** The compiler (`babel-plugin-react-compiler`, also in the Next.js / Vite plugin) memoizes components and hook values automatically at build time, so `useMemo`/`useCallback`/`React.memo` become largely redundant. Before hand-tuning:
|
|
52
|
+
- Install and enable it; run `npx react-compiler-healthcheck` to see how many components are compatible (it skips components that break the Rules of React — those are your real bugs to fix).
|
|
53
|
+
- **Don't mix strategies blindly:** keep code Rules-of-React-clean (no mutation of props/state, no conditional hooks) so the compiler can optimize. New manual `useMemo`/`useCallback` you add on top is usually noise. Lean on `<StrictMode>` to surface impurity.
|
|
54
|
+
- The compiler does **not** fix algorithmic problems (huge lists, O(n²) renders) — you still need virtualization (step 9) and correct keys (step 7). Treat the rest of these steps as either pre-19 work or the things the compiler can't do.
|
|
55
|
+
|
|
56
|
+
4. **Kill unstable references at the source — this is the #1 cause of "memo doesn't work".** `React.memo` and dep arrays compare by reference (`Object.is`). A new `{}`, `[]`, or arrow function created in render is a new identity every time, so it busts every downstream memo and effect. Fix the *source*, don't paper over it:
|
|
57
|
+
|
|
58
|
+
| Anti-pattern (new identity each render) | Fix |
|
|
59
|
+
|---|---|
|
|
60
|
+
| `<Child style={{ margin: 8 }} />` | hoist the object to a module-level `const` (it never changes) |
|
|
61
|
+
| `<Child onClick={() => doX(id)} />` passed to a memoized child | `useCallback(() => doX(id), [id])` |
|
|
62
|
+
| `const opts = { a, b }` then used in a dep array | `useMemo(() => ({ a, b }), [a, b])` |
|
|
63
|
+
| `data.filter(...)` computed inline each render into a memoized child | `useMemo(() => data.filter(...), [data])` |
|
|
64
|
+
| default prop `items = []` (new array each call) | hoist `const EMPTY = []` and default to it |
|
|
65
|
+
|
|
66
|
+
Static values (handlers with no closure deps, constant config) belong **outside the component** entirely — zero runtime cost.
|
|
67
|
+
|
|
68
|
+
5. **Use `useCallback`/`useMemo` only where a referentially-equal value actually changes behavior — otherwise skip it.** They are not free: each stores a closure + dep array and runs an equality check every render. They earn their keep in exactly these cases — and nowhere else:
|
|
69
|
+
- The memoized value/callback is **a dependency of another hook** (`useEffect`, `useMemo`) where a changing identity would refire the effect.
|
|
70
|
+
- It's **passed as a prop to a `React.memo`'d child** (otherwise the child's memo is pointless).
|
|
71
|
+
- `useMemo` wraps a **genuinely expensive computation** (sort/filter of thousands, parse, heavy derive) — measure; "expensive" is rarely a `.map` over 20 items.
|
|
72
|
+
|
|
73
|
+
**Skip them when** the consumer isn't memoized, the deps change every render anyway (then the cache never hits — pure overhead), or the body is cheap. A `useCallback` whose result flows only into a non-memoized DOM `<button onClick>` does nothing useful.
|
|
74
|
+
|
|
75
|
+
6. **`React.memo` only the genuinely-hot leaf, and give it the right comparator.** `memo` skips a re-render when props are shallow-equal. Apply it to a component that (a) renders often due to *parent* re-renders, (b) is expensive or numerous (list rows), and (c) gets **stable props** (you did step 4). Without stable props it's worse than nothing.
|
|
76
|
+
- Default shallow compare is correct most of the time. Provide a custom `areEqual(prev, next)` only for a known-shape prop where shallow misses (e.g. compare `prev.item.id === next.item.id && prev.selected === next.selected`) — and keep it cheaper than the render it saves.
|
|
77
|
+
- `memo` compares **props only** — it does **not** stop re-renders from internal `useState`/`useContext` changes. If the offender is context (step 8) or state, `memo` won't help.
|
|
78
|
+
```jsx
|
|
79
|
+
const Row = memo(function Row({ item, onSelect }) { /* ... */ },
|
|
80
|
+
(a, b) => a.item.id === b.item.id && a.item.label === b.item.label && a.onSelect === b.onSelect);
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
7. **Fix list keys — wrong keys force remounts and break memoized rows.** Use a **stable, unique id** from the data (`item.id`), never the array index for any list that can reorder, insert, or delete. Index keys make React reuse the wrong DOM/state on reorder (lost input focus, wrong row highlighted) and defeat per-row `memo`. Don't use `Math.random()`/`uuid()` in render either — a new key every render remounts the row each time. Stable identity → React diffs in place → memoized rows actually skip.
|
|
84
|
+
|
|
85
|
+
8. **Stop context from re-rendering the whole subtree — split it or use a selector.** Every consumer of a context re-renders whenever **any** field of the context value changes, even fields it doesn't read. Three escalating fixes:
|
|
86
|
+
|
|
87
|
+
| Technique | When | How |
|
|
88
|
+
|---|---|---|
|
|
89
|
+
| **Memoize the provider value** | value is `{}`/`[]` literal inline | `value={useMemo(() => ({ user, setUser }), [user])}` — without this *every* consumer re-renders every parent render |
|
|
90
|
+
| **Split contexts by change frequency** | one context mixes hot + cold data (e.g. `theme` + live `cursorPosition`) | separate providers; a component subscribes only to what it uses → high-frequency updates don't touch theme consumers |
|
|
91
|
+
| **Selector subscription** | consumers read different slices of a big store | `useSyncExternalStore` with a selector, `use-context-selector`, or a store lib (**Zustand**: `useStore(s => s.field)`, **Redux**: `useSelector`) — re-render only when the *selected* slice changes |
|
|
92
|
+
|
|
93
|
+
Splitting state-vs-dispatch contexts is a cheap classic win: a component that only dispatches never re-renders on state changes.
|
|
94
|
+
|
|
95
|
+
9. **Virtualize long lists instead of memoizing 10,000 rows.** No amount of `memo` saves you from mounting thousands of DOM nodes. Render only the visible window (+overscan) with **TanStack Virtual** (`@tanstack/react-virtual`, headless, framework-agnostic) or `react-virtuoso`. It computes which rows intersect the viewport and absolutely-positions them; DOM stays ~constant regardless of dataset size. Combine with stable keys (step 7) and a memoized row. This is the fix when the *count* is the problem, not per-row work.
|
|
96
|
+
|
|
97
|
+
10. **Derive, don't store — redundant state is redundant renders.** Anything computable from props/existing state during render should be **computed in render** (optionally `useMemo`'d), not held in its own `useState` synced via `useEffect`. The `useState`+`useEffect`-to-sync pattern adds an extra render every change and drifts out of sync. Likewise: lift state **down** (push it into the smallest component that needs it so updates don't re-render siblings), and split a god-component so a chatty piece of state isn't wired into a large subtree. Fewer state cells in fewer places = fewer renders.
|
|
98
|
+
|
|
99
|
+
11. **Don't over-memoize — premature memoization has a real cost.** Each `memo`/`useMemo`/`useCallback` adds an equality check + retained closure + cognitive load, and a wrong dep array becomes a stale-closure bug. Rules of thumb: leave cheap components un-memoized; never wrap a component whose props are unstable (fix the props instead); never memoize purely to "be safe." On React 19, prefer the compiler over manual memo. Memoize when the Profiler shows a *measured* hot path — not before. Remove memos that the Profiler shows aren't being hit.
|
|
100
|
+
|
|
101
|
+
## Common Errors
|
|
102
|
+
|
|
103
|
+
- **`React.memo` with unstable props.** Memo'd child still re-renders because a parent passes a fresh `{}`/`() => {}` each render. Fix: stabilize the prop at the source (step 4) — `memo` without stable props is pure overhead.
|
|
104
|
+
- **`useCallback`/`useMemo` feeding a non-memoized consumer.** A stable callback handed to a plain `<button>` or non-`memo` child changes nothing but adds overhead. Fix: only memoize values that cross into a `memo`'d child or another hook's deps.
|
|
105
|
+
- **Dep array that changes every render.** The memo never caches (cache miss every time) — all cost, no benefit. Fix: stabilize the deps too, or drop the memo.
|
|
106
|
+
- **Index as list key.** Reorders/inserts reuse the wrong DOM/state and break per-row memo; lost focus, wrong highlight. Fix: stable `item.id` key.
|
|
107
|
+
- **`Math.random()`/`uuid()` key in render.** Remounts every row every render — the opposite of memoization. Fix: derive a stable id once.
|
|
108
|
+
- **Inline object/array provider value.** `<Ctx.Provider value={{a,b}}>` re-renders every consumer on every parent render. Fix: `useMemo` the value, or split contexts.
|
|
109
|
+
- **One fat context for hot + cold data.** A 60fps field re-renders theme/auth consumers. Fix: split by update frequency; use a selector subscription.
|
|
110
|
+
- **`useState` mirrored from props via `useEffect`.** Extra render + drift. Fix: derive in render (`useMemo` if pricey).
|
|
111
|
+
- **Memoizing everything by default.** Slower and unreadable; stale-closure bugs from wrong deps. Fix: memoize measured hot paths only; on React 19 let the compiler do it.
|
|
112
|
+
- **Expecting `memo` to stop context/state re-renders.** `memo` compares props only. Fix: address the actual reason the Profiler reports (context → split/selector; state → derive/lift).
|
|
113
|
+
|
|
114
|
+
## Verify
|
|
115
|
+
|
|
116
|
+
1. **Profiler shows the offender gone:** record the same interaction before/after — the component that flashed/rendered "props changed (but equal)" or "parent rendered" no longer appears in the commit (or its render time drops). Keep the before flamegraph as proof.
|
|
117
|
+
2. **`why-did-you-render` is silent on the fixed path:** no "different objects that are equal by value" logs for the props you stabilized.
|
|
118
|
+
3. **Typing/interaction is smooth:** the input/list that lagged updates per keystroke without re-rendering unrelated siblings; "Highlight updates" flashes only the changed node.
|
|
119
|
+
4. **One-row change → one row renders:** mutating a single list item re-renders that row only, not the whole list (visible in the Profiler ranked chart).
|
|
120
|
+
5. **Context change is scoped:** updating a hot context field re-renders only its real consumers; theme/auth consumers stay dark.
|
|
121
|
+
6. **Long list DOM is bounded:** the virtualized list mounts ~viewport+overscan rows, and node count stays roughly constant as the dataset grows from 100 → 10,000.
|
|
122
|
+
7. **No memo is dead weight:** every remaining `memo`/`useMemo`/`useCallback` corresponds to a Profiler-confirmed hot path or a real dep/`memo`-prop boundary; the rest were removed. On React 19, `react-compiler-healthcheck` passes and manual memo is minimal.
|
|
123
|
+
|
|
124
|
+
Done = the measured re-render(s) the Profiler flagged are eliminated by the matching fix (stable refs, correct keys, context split/selector, derive-don't-store, or virtualization), every remaining manual memo is justified by evidence (or replaced by the React 19 compiler), and the before/after Profiler traces prove the churn is gone — not added overhead.
|
|
@@ -0,0 +1,100 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: orchestrate-agent-workflow
|
|
3
|
+
description: Designs reliable multi-step LLM agent loops — tool-call orchestration, state/memory between steps, explicit stop conditions, per-step verification, retries/replanning, subagent decomposition, and budget/approval gates — so an agent finishes long tasks without drifting or looping forever.
|
|
4
|
+
when_to_use: Building an agent that plans and calls tools across multiple steps, not a single prompt-and-response. Distinct from agent-tool-mcp-builder (designing the individual tools/MCP server), prompt-engineering (single-prompt/structured-output design), and harden-llm-app-reliability (transport-level retry/timeout of one model call).
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
Reach for this skill when the work spans **multiple model→tool→observe turns** under one goal:
|
|
10
|
+
|
|
11
|
+
- "Build an agent that researches, then drafts, then files a PR / ticket"
|
|
12
|
+
- "My agent loops forever / repeats the same tool call / never decides it's done"
|
|
13
|
+
- "It drifts off-task halfway through and starts solving a different problem"
|
|
14
|
+
- "Split this big task across subagents and stitch the results back together"
|
|
15
|
+
- "Add a cost/step cap and a human-approval gate before it deletes / deploys / pays"
|
|
16
|
+
|
|
17
|
+
NOT this skill:
|
|
18
|
+
- Designing the tool schemas / error shapes / MCP server the agent calls → agent-tool-mcp-builder
|
|
19
|
+
- Writing or hardening a single prompt or its JSON/function-call contract → prompt-engineering
|
|
20
|
+
- Making *one* model call survive timeouts/429s/5xx (retry, backoff, circuit breaker) → harden-llm-app-reliability
|
|
21
|
+
- Cutting token cost / latency of the model calls themselves (caching, model routing) → optimize-llm-cost-latency
|
|
22
|
+
- Treating tool output / web content as untrusted input → defend-llm-prompt-injection
|
|
23
|
+
|
|
24
|
+
## Steps
|
|
25
|
+
|
|
26
|
+
1. **Pick the simplest topology that works — escalate only when forced.** Default down this ladder, not up.
|
|
27
|
+
|
|
28
|
+
| Pattern | Use when | Control flow | Failure mode to fear |
|
|
29
|
+
|---|---|---|---|
|
|
30
|
+
| **Single-agent tool loop** | One goal, <~15 steps, work fits one context | Model decides each next tool | Infinite loop / drift |
|
|
31
|
+
| **Code-orchestrated workflow (DAG)** | Steps + order are known *ahead of time* | Your code sequences; model fills each node | Rigid; can't adapt mid-run |
|
|
32
|
+
| **Multi-agent (planner + subagents)** | One context literally can't hold the work, or independent parallel branches | Orchestrator spawns sub-loops, merges summaries | Coordination cost, lost context at handoff |
|
|
33
|
+
|
|
34
|
+
Start with the loop. Move to a coded DAG the moment the step graph is fixed (cheaper, deterministic, testable). Reach for multi-agent **only** when a single context window can't hold the task — never for "it feels big."
|
|
35
|
+
|
|
36
|
+
2. **Make the loop terminate — two independent kills.** Every loop needs BOTH a semantic stop and a hard cap. One alone is insufficient: the model's "I'm done" can be wrong, and a raw cap truncates good work.
|
|
37
|
+
|
|
38
|
+
```python
|
|
39
|
+
MAX_STEPS, MAX_USD, t0 = 12, 0.50, time.time()
|
|
40
|
+
for step in range(MAX_STEPS): # hard cap (anti-infinite-loop)
|
|
41
|
+
msg = model(messages, tools)
|
|
42
|
+
if msg.stop_reason == "end_turn": # semantic stop: model emits final, no tool call
|
|
43
|
+
return msg.text
|
|
44
|
+
result = run_tool(msg.tool_call) # exactly one tool per turn keeps it auditable
|
|
45
|
+
messages += [msg, tool_result(result)]
|
|
46
|
+
if spent_usd() > MAX_USD or time.time()-t0 > 120: # budget / wall-clock guard
|
|
47
|
+
return escalate("budget exceeded", messages)
|
|
48
|
+
return escalate("max steps reached", messages) # NEVER silently return partial as success
|
|
49
|
+
```
|
|
50
|
+
Also detect a **no-progress loop**: hash `(tool_name, args)`; if the same call repeats ≥2× with no state change, break and replan — don't wait for MAX_STEPS.
|
|
51
|
+
|
|
52
|
+
3. **Carry state in an explicit scratchpad — not the raw message list.** The growing transcript is not memory; it's noise that costs tokens and dilutes the goal. Keep a small structured state object the loop reads/writes every turn:
|
|
53
|
+
```json
|
|
54
|
+
{"goal": "<one immutable sentence>", "facts": [...], "open_questions": [...],
|
|
55
|
+
"done": ["fetched X", "parsed Y"], "next": "draft summary", "artifacts": {"pr_url": null}}
|
|
56
|
+
```
|
|
57
|
+
Re-inject `goal` + `next` into the prompt **every step** (anti-drift re-grounding). When the transcript grows large, summarize old turns into `facts`/`done` and drop the raw tool dumps — keep state, discard chatter.
|
|
58
|
+
|
|
59
|
+
4. **Verify each step's output before continuing — gate, don't trust.** After every tool result, check it actually advanced the goal *before* the model plans the next move. Cheapest sufficient check wins:
|
|
60
|
+
- Schema/shape check on tool output (parses? non-empty? expected fields?) — pure code, no model.
|
|
61
|
+
- Goal-relevance check: "does this result move us toward `goal`, or sideways?" If sideways → discard and re-ground, don't append it as progress.
|
|
62
|
+
- For generated artifacts (code, SQL, configs): run the real check (compile/lint/`pytest`/dry-run), not a model's self-assessment. A failing check feeds back into the loop as a tool result.
|
|
63
|
+
|
|
64
|
+
5. **Retry then replan on tool failure — distinguish transient from logical.** Tool error ≠ retry-forever.
|
|
65
|
+
- **Transient** (timeout, 429, 5xx): bounded retry with backoff — but that's transport reliability; delegate it to harden-llm-app-reliability, don't re-implement here.
|
|
66
|
+
- **Logical** (bad args, "not found", validation reject): do **not** retry the identical call. Feed the error text back as an observation and let the model *replan* (fix args, pick another tool, or revise the plan). Cap replans (≤2) per subgoal; exceeding it → escalate, don't thrash.
|
|
67
|
+
|
|
68
|
+
6. **Decompose to subagents only when one context can't hold the work — and return summaries, not dumps.** A subagent gets a *narrow* objective, its own fresh context and tool subset, and returns a **compact result** (the answer + key facts + artifact refs), never its raw transcript. The orchestrator merges summaries into parent state. This is the whole point: parallel/large work happens in child contexts so the parent stays lean. If you find yourself piping a subagent's full message history back up, you've defeated it.
|
|
69
|
+
|
|
70
|
+
7. **Gate irreversible actions behind explicit approval; meter spend.** Classify each tool: read-only / reversible-write / **irreversible** (delete, deploy, send money, email customers, `DROP`). Irreversible tools require a human-approval checkpoint (or a strict policy allowlist) *before* execution — the loop pauses and surfaces the proposed action + args. Track cumulative tokens/$ per run (step 2's `MAX_USD`); a runaway agent is a billing incident. Emit a structured per-step trace (step #, tool, args, result-status, cost) so a stuck run is debuggable after the fact → build-audit-logging.
|
|
71
|
+
|
|
72
|
+
8. **Prove it on a multi-step harness with machine-checkable success criteria — built before you ship.** A loop that works on one happy-path demo proves nothing. Write the harness first: define each scenario's *success oracle* in code (artifact exists / `pytest` green / expected end-state reached / final answer matches a regex), not a model's "looks good." Cover ≥3 scenarios — one happy path, one designed-to-fail (asserts escalation, never an infinite loop), one with a distractor in tool output (asserts the goal held). Run it in CI on every change to the loop, prompt, or tool set; a passing harness is the only evidence the orchestration is reliable. Spec the checks in the Verify section below.
|
|
73
|
+
|
|
74
|
+
## Common Errors
|
|
75
|
+
|
|
76
|
+
- **Stop condition = only "model says done".** The model declares victory early or never. Always pair the semantic stop with a hard `MAX_STEPS` cap.
|
|
77
|
+
- **No max-step / max-cost cap.** One bad plan → infinite tool calls → runaway bill. Both caps are mandatory, and hitting them must escalate, not silently return a partial answer as if it succeeded.
|
|
78
|
+
- **Treating the message list as memory.** Relying on the growing transcript means the goal drowns in tool noise and context blows up. Keep an explicit scratchpad; re-inject the goal every step.
|
|
79
|
+
- **Never re-grounding → drift.** Without re-stating `goal` each turn, a long run quietly migrates to a neighboring task. Inject goal + next-action every step and discard off-goal results.
|
|
80
|
+
- **Retrying a logical error unchanged.** Re-sending the same bad args on a validation failure just burns steps. Transient → backoff retry; logical → replan with the error fed back as an observation.
|
|
81
|
+
- **Subagents returning raw transcripts.** Dumping a child's full history into the parent defeats decomposition and re-bloats the context you spun it off to avoid. Children return summaries + artifact refs only.
|
|
82
|
+
- **Multi-agent when a loop would do.** Coordination overhead, lost context at handoffs, and harder debugging — for work one context could have held. Escalate topology only when forced (step 1).
|
|
83
|
+
- **More than one tool call per turn without isolation.** Parallel tool calls in one turn make the trace ambiguous and ordering bugs invisible. Keep one tool per turn unless the calls are provably independent.
|
|
84
|
+
- **No verification between steps.** Appending an empty/erroring/irrelevant tool result as "progress" compounds garbage across the run. Gate every result before it enters state.
|
|
85
|
+
- **Irreversible action with no approval gate.** The agent deletes/deploys/pays on a hallucinated plan. Classify tools; pause for human approval on irreversible ones.
|
|
86
|
+
- **No per-step trace.** When a run gets stuck you have nothing to debug. Emit structured step records (tool, args, status, cost) from day one.
|
|
87
|
+
|
|
88
|
+
## Verify
|
|
89
|
+
|
|
90
|
+
1. **Terminates on success:** a happy-path multi-step task reaches the semantic stop and returns the artifact *before* `MAX_STEPS` — caps are a safety net, not the normal exit.
|
|
91
|
+
2. **Terminates on failure:** force an unsolvable goal → the run hits `MAX_STEPS`/`MAX_USD` and **escalates** (returns a clear "could not complete" + trace), never an infinite loop and never a partial dressed up as success.
|
|
92
|
+
3. **No-progress break:** inject a tool that returns the same value forever → the loop detects the repeated `(tool,args)` and breaks/replans within ≤2 repeats, not at MAX_STEPS.
|
|
93
|
+
4. **Anti-drift:** run a long task with a distractor in tool output → final answer still serves the original `goal`; the off-goal result was discarded, not built upon.
|
|
94
|
+
5. **Replan on logical error:** make a tool reject the first args → the agent fixes args / switches tool and proceeds, and does **not** re-send the identical failing call.
|
|
95
|
+
6. **Budget guard:** set `MAX_USD` low → the run halts and escalates when spend exceeds it; cumulative cost in the trace matches actual usage.
|
|
96
|
+
7. **Approval gate:** an irreversible tool is never executed without the approval checkpoint firing first (assert it pauses and surfaces args).
|
|
97
|
+
8. **Subagent isolation:** parent context size after a subagent task stays bounded — the child returned a summary, not its transcript (check token count, not just correctness).
|
|
98
|
+
9. **Harness:** ≥3 multi-step scenarios with machine-checkable success criteria (artifact exists / check passes / expected state reached) run green in CI, including at least one designed-to-fail case from check 2.
|
|
99
|
+
|
|
100
|
+
Done = on the multi-step harness the agent finishes happy-path tasks via the semantic stop, every failure/runaway path escalates within the step+budget caps (no infinite loops, no partial-as-success), each step is verified and re-grounded on the goal, irreversible actions are gated, and subagents return summaries — all evidenced by the per-step trace.
|
|
@@ -0,0 +1,114 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: payments-billing-integration
|
|
3
|
+
description: Integrates payment, subscription, and billing flows against a payment provider — hosted/PCI-offloaded checkout and payment-intent surfaces, idempotency-keyed money-mutating calls that survive retries, webhook-driven order/subscription state reconciliation keyed on stored provider event ids, subscription lifecycle (trial/upgrade/downgrade/proration/cancel/dunning), and an append-only ledger of charges, refunds, and credits that reconciles to the provider balance.
|
|
4
|
+
when_to_use: Integrating a payment provider (checkout, PaymentIntents, subscriptions), handling plan changes/proration/cancellations, processing payment webhooks, preventing double-charges, reconciling payment state, or implementing dunning. Distinct from ingest-webhook-secure (verifies the generic signature/replay/dedup of any inbound webhook — this skill drives billing state from those verified events), money-decimal-arithmetic (the rounding/allocation/FX math this skill calls into for totals), and auth-jwt-session (your users' identity, not a PSP charge).
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
Reach for this when the request mutates money or subscription state through a payment provider (Stripe, Adyen, Braintree, PayPal):
|
|
10
|
+
|
|
11
|
+
- "Add a checkout / let users pay for X"
|
|
12
|
+
- "Set up subscription billing with monthly/annual plans and a free trial"
|
|
13
|
+
- "Handle upgrade/downgrade with proration" or "cancel at period end vs immediately"
|
|
14
|
+
- "We double-charged a customer / a retry created two charges" — make money calls idempotent
|
|
15
|
+
- "Order shows paid but the webhook said failed" — reconcile state from webhooks, not the redirect
|
|
16
|
+
- "Implement dunning / retry failed renewals / grace period before we revoke access"
|
|
17
|
+
- "Refund or partially refund, issue account credit, keep the ledger auditable"
|
|
18
|
+
|
|
19
|
+
NOT this skill:
|
|
20
|
+
- Verifying the raw signature, timestamp window, and replay/dedup of the inbound webhook itself → ingest-webhook-secure (this skill consumes an already-verified, deduped event and decides what billing state it changes)
|
|
21
|
+
- Rounding cents, splitting a charge across line items so it sums exactly, banker's rounding, FX triangulation → money-decimal-arithmetic (call into it; don't re-implement allocation here)
|
|
22
|
+
- Authenticating *your* logged-in user before they reach checkout → auth-jwt-session
|
|
23
|
+
- The background worker/queue that processes a handed-off event → message-queue-jobs
|
|
24
|
+
- Storing the PSP secret/webhook signing key → secrets-management
|
|
25
|
+
|
|
26
|
+
## Steps
|
|
27
|
+
|
|
28
|
+
1. **Never let raw card data or client-supplied amounts touch your server.** Use a hosted/PCI-offloaded surface so PAN never hits your backend — keeps you in SAQ-A, not SAQ-D.
|
|
29
|
+
|
|
30
|
+
| Need | Use | Why |
|
|
31
|
+
|---|---|---|
|
|
32
|
+
| Fastest, lowest PCI scope, one-off or sub | **Hosted checkout** (Stripe Checkout / Adyen Drop-in / PayPal Smart Buttons) | provider hosts the card form, redirects back |
|
|
33
|
+
| Custom in-page UI, still PCI-offloaded | **PaymentIntents + provider Elements/SDK** | card tokenized client-side; you only see a token + intent id |
|
|
34
|
+
| Recurring | **Subscriptions API** on a saved payment method | provider runs the renewal schedule + retries |
|
|
35
|
+
| You handle raw PAN | **don't** | SAQ-D, audits, liability — almost never justified |
|
|
36
|
+
|
|
37
|
+
The **provider is the source of truth for amount and currency**. Compute the price server-side from your catalog, create the intent server-side with that amount, and ignore any amount the client posts. A client that sends `amount=1` for a $100 cart must still be charged $100.
|
|
38
|
+
|
|
39
|
+
2. **Every money-mutating call carries an idempotency key — no exceptions.** Charges, captures, refunds, and subscription creates must be safe to retry after a timeout, because "request timed out" does NOT mean "charge didn't happen." Derive the key deterministically from your own intent (e.g. `charge:order_42:attempt_1`), persist it before the call, and reuse the *same* key on retry.
|
|
40
|
+
|
|
41
|
+
```python
|
|
42
|
+
# Stripe — header makes the create idempotent for 24h
|
|
43
|
+
intent = stripe.PaymentIntent.create(
|
|
44
|
+
amount=order.total_minor, # integer minor units, computed server-side
|
|
45
|
+
currency=order.currency, # ISO 4217, lowercased for Stripe
|
|
46
|
+
customer=order.customer_id,
|
|
47
|
+
idempotency_key=f"pi:order:{order.id}", # SAME key on every retry of THIS order
|
|
48
|
+
metadata={"order_id": order.id}, # your id, so webhooks map back
|
|
49
|
+
)
|
|
50
|
+
```
|
|
51
|
+
Rules: a new key per *logical* operation, the same key across *retries* of that operation. Never reuse a key for a different amount (providers return a conflict/error). Generating a fresh UUID per HTTP attempt defeats the entire mechanism — that's how double-charges happen.
|
|
52
|
+
|
|
53
|
+
3. **Drive durable state from verified webhooks, not the redirect.** The browser redirect/return is a UX hint only — the user may close the tab, the network may drop, or the 3DS challenge may resolve seconds later. Treat the synchronous result as "pending"; flip to `paid`/`active`/`failed` **only** when the matching verified webhook arrives.
|
|
54
|
+
|
|
55
|
+
- Inbound verification (signature over raw body, timestamp window, replay/seen-id dedup) is owned by **ingest-webhook-secure** — do that first.
|
|
56
|
+
- Store the **provider event id** (`evt_…`) in a `processed_events` table with a unique constraint; INSERT-or-skip makes re-delivery a no-op.
|
|
57
|
+
- Map by **your** id from `metadata` (set in step 2), not by position or amount.
|
|
58
|
+
- Return `2xx` fast so the provider stops retrying; do the heavy lifting async (hand to message-queue-jobs).
|
|
59
|
+
|
|
60
|
+
4. **Make state transitions a guarded machine, tolerant of out-of-order delivery.** Webhooks arrive out of order and duplicated; a `payment_failed` can land after a later `payment_succeeded`. Never overwrite blindly — apply only forward transitions.
|
|
61
|
+
|
|
62
|
+
```
|
|
63
|
+
pending ──succeeded──▶ paid ──refunded──▶ refunded
|
|
64
|
+
│ ▲
|
|
65
|
+
└──failed──▶ failed ─┘ (manual retry / new intent)
|
|
66
|
+
```
|
|
67
|
+
Guard: ignore a `failed` event for an intent already `paid` by a later event; key the decision on the event's intent status + your stored status, not arrival order. Use the event's own timestamp/sequence to drop stale ones.
|
|
68
|
+
|
|
69
|
+
5. **Subscription lifecycle — pick defaults, don't hand-roll proration.**
|
|
70
|
+
|
|
71
|
+
| Change | Default behavior | How |
|
|
72
|
+
|---|---|---|
|
|
73
|
+
| Trial → paid | charge at trial end; webhook flips `trialing`→`active` | provider `trial_period_days` + `payment_behavior=default_incomplete` so a failed first charge stays `incomplete` instead of silently activating; gate access on the webhook-confirmed status |
|
|
74
|
+
| **Upgrade** (to pricier plan) | **immediate**, prorate, charge the difference now | swap price with `proration_behavior=create_prorations` and invoice now |
|
|
75
|
+
| **Downgrade** | **at period end** (avoid mid-cycle credit/refund churn) | schedule the price change for next period |
|
|
76
|
+
| Cancel | **at period end** by default (keep paid access they bought); offer immediate+refund only if asked | `cancel_at_period_end=true`; immediate = cancel now + prorated credit/refund |
|
|
77
|
+
| Quantity/seats | prorate immediately | update quantity, `create_prorations` |
|
|
78
|
+
|
|
79
|
+
Let the provider compute proration — it knows the exact second of the cycle. **Gate feature access on the subscription's webhook-confirmed status** (`active`/`trialing`/`past_due`), never on "they clicked upgrade."
|
|
80
|
+
|
|
81
|
+
6. **Dunning — let the provider retry, you handle the lifecycle.** On a failed renewal the provider enters smart-retries and moves the subscription to `past_due`. Subscribe to `invoice.payment_failed` (notify + start grace), `invoice.payment_succeeded` (recovered → `active`), and `subscription.deleted`/`unpaid` (retries exhausted → revoke). Default grace: keep access through `past_due`, revoke only on terminal `unpaid`/`canceled`. Don't build your own retry timer — you'll race the provider's.
|
|
82
|
+
|
|
83
|
+
7. **Ledger and invoice correctness — append-only, money math delegated.** Record every money event as an immutable ledger row (`charge`/`refund`/`credit`/`fee` with provider id, minor-unit integer amount, currency, timestamp); never UPDATE an amount in place — post a compensating row. Store amounts as integer minor units or `NUMERIC`, never float. **All splitting/rounding/tax/FX goes through money-decimal-arithmetic** so line items reconcile to the captured total exactly. Refunds reference the original charge and can't exceed it (track remaining refundable). Reconcile the ledger sum against the provider's balance/payout for each charge.
|
|
84
|
+
|
|
85
|
+
8. **Periodically reconcile against the provider** — webhooks get missed (endpoint down, dropped delivery). Run a scheduled job that lists provider charges/subscriptions since the last cursor and repairs any local row that drifted (missing `paid`, stale `active`). The provider is authoritative; your DB is a cache that must converge.
|
|
86
|
+
|
|
87
|
+
## Common Errors
|
|
88
|
+
|
|
89
|
+
- **Acting on the redirect instead of the webhook.** User bounces before the success URL → order stuck `pending` though they paid; or the redirect fires before the charge settles → premature fulfillment. Fulfill on the verified webhook only.
|
|
90
|
+
- **Fresh idempotency key per HTTP retry.** A timeout retried with a new key creates a second charge. Key must be deterministic per logical operation and identical across retries.
|
|
91
|
+
- **Trusting the client amount/currency.** Always compute price server-side from your catalog; the client value is display-only and spoofable.
|
|
92
|
+
- **No `processed_events` dedup.** Providers deliver each event at-least-once; processing a redelivered `payment_succeeded` double-fulfills or double-credits. Unique-constrain the provider event id and skip on conflict.
|
|
93
|
+
- **Overwriting state on out-of-order events.** A late `payment_failed` clobbers a `paid` order. Apply forward-only transitions guarded by stored status + event status, not arrival order.
|
|
94
|
+
- **Hand-rolling proration math.** Off-by-cents and wrong on leap/short months. Let the provider prorate; if you must do money math, route it through money-decimal-arithmetic.
|
|
95
|
+
- **Granting access on the click, not the confirmed status.** Failed first charge on a trial → user gets the product free. Gate on webhook-confirmed `active`/`trialing`.
|
|
96
|
+
- **Floating-point money.** `0.1 + 0.2 != 0.3`; totals drift by a cent. Integer minor units or `NUMERIC` only — see money-decimal-arithmetic.
|
|
97
|
+
- **Refund without a remaining-refundable check.** Two partial refunds can exceed the charge or the provider rejects the second. Track refunded-so-far against the original charge.
|
|
98
|
+
- **Slow webhook handler.** Doing DB writes + emails synchronously blows the provider's timeout → it retries → storms. `2xx` fast, process async.
|
|
99
|
+
- **Logging the full PAN / CVV / signing key.** PCI violation and secret leak. Never log card data; keep the webhook signing key in secrets-management.
|
|
100
|
+
- **Testing only the happy path.** Ship without simulating `card_declined`, `insufficient_funds`, 3DS challenge, expired card, or webhook redelivery and you'll discover them in production.
|
|
101
|
+
|
|
102
|
+
## Verify
|
|
103
|
+
|
|
104
|
+
1. **Double-charge under retry:** create one PaymentIntent, fire the create twice with the **same** idempotency key (or kill the first mid-flight and retry) → exactly one charge on the provider dashboard, one ledger row.
|
|
105
|
+
2. **Redirect-independent fulfillment:** complete a test payment but **don't** follow the success redirect (close the tab) → the webhook still flips the order to `paid`. Then never deliver the webhook → order stays `pending` (proves you don't fulfill on redirect).
|
|
106
|
+
3. **Webhook dedup:** replay the same `evt_…` (provider "Resend" or `stripe trigger` + manual re-POST) → second delivery is a no-op; one fulfillment, one ledger entry.
|
|
107
|
+
4. **Out-of-order:** deliver `payment_succeeded` then a stale `payment_failed` for the same intent → final state stays `paid`.
|
|
108
|
+
5. **Lifecycle:** in provider test mode run trial→active, upgrade (immediate prorated charge appears), downgrade (applies next period), cancel-at-period-end (access persists to period end) — each confirmed by the corresponding webhook.
|
|
109
|
+
6. **Dunning:** force a renewal decline (`4000000000000341` / provider test card) → subscription `past_due`, grace access holds, then succeed a retry → back to `active`; exhaust retries → access revoked on terminal event.
|
|
110
|
+
7. **Failure paths:** simulate `card_declined`, `insufficient_funds`, expired card, and a 3DS-required card → each yields the correct user-facing error and no partial fulfillment.
|
|
111
|
+
8. **Ledger reconciliation:** for a charge + partial refund, the ledger sum equals the provider's net for that charge; a refund exceeding remaining refundable is rejected.
|
|
112
|
+
9. **Drift repair:** delete a local order row (simulate a missed webhook), run the reconcile job → it re-creates/repairs from the provider.
|
|
113
|
+
|
|
114
|
+
Done = under retried/duplicated/out-of-order webhooks there is exactly one charge and one fulfillment per order, durable state is driven only by verified+deduped webhooks (never the redirect), the full subscription lifecycle + dunning are exercised in provider test mode, the ledger reconciles to the provider's balance, and every failure path is tested.
|
|
@@ -0,0 +1,116 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: pin-toolchain-versions
|
|
3
|
+
description: Pins language/runtime/CLI versions for identical toolchains across machines and CI — a version manager (mise/asdf/Volta/nvm), exact .tool-versions/.mise.toml pins, engines + packageManager via corepack, frozen-lockfile installs, auto-switch on cd, and CI reading the same pin file.
|
|
4
|
+
when_to_use: Version drift across machines or CI ("wrong node/python version"), painful onboarding, or a "works locally, fails in CI" toolchain mismatch. NOT bumping library deps (dependency-upgrade), a containerized env (setup-devcontainer-env), or workspace task orchestration (setup-monorepo-tooling).
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
Reach for this skill when the problem is **which version of the tool runs**, not which library version is installed:
|
|
10
|
+
|
|
11
|
+
- "CI uses node 18, my laptop has 20 — build passes locally, fails in CI"
|
|
12
|
+
- "New hire spent a day getting the right Python/Ruby/Go installed"
|
|
13
|
+
- "`pnpm install` produces a different lockfile on the CI runner"
|
|
14
|
+
- "Every machine must resolve the exact same node + package-manager version"
|
|
15
|
+
- A flaky build traced to a runtime/CLI version that differs by host
|
|
16
|
+
|
|
17
|
+
NOT this skill:
|
|
18
|
+
- Bumping a library/framework dependency and fixing the breakage → dependency-upgrade
|
|
19
|
+
- Reproducibility via a container/devcontainer image → setup-devcontainer-env
|
|
20
|
+
- Wiring up workspace task runners / package-manager workspaces → setup-monorepo-tooling
|
|
21
|
+
- Publishing the resulting package to a registry → publish-package-registry
|
|
22
|
+
|
|
23
|
+
## Steps
|
|
24
|
+
|
|
25
|
+
1. **Pick one manager and commit to it — never run two.** Two managers fighting over `PATH` shims is the #1 cause of "it switched back."
|
|
26
|
+
|
|
27
|
+
| Manager | Use when | Notes |
|
|
28
|
+
|---|---|---|
|
|
29
|
+
| **mise** | **Default.** Polyglot (node/python/go/ruby/rust/…), fast Rust shims, reads `.tool-versions` *and* `.mise.toml`, runs tasks + env | One tool for every language; drop-in upgrade path from asdf |
|
|
30
|
+
| asdf | Already standardized on it org-wide | Plugin-per-language, slower; mise reads its files unchanged |
|
|
31
|
+
| Volta | JS/TS-only repo, want the toolchain pinned *in package.json* | Pins node+pm under `"volta"`, no separate file |
|
|
32
|
+
| nvm | Minimal, node-only, can't install other tools | `.nvmrc` only, no pm pin, manual `nvm use` |
|
|
33
|
+
|
|
34
|
+
Default to **mise** unless the repo is JS-only and you specifically want package.json-embedded pins (Volta).
|
|
35
|
+
|
|
36
|
+
2. **Pin EXACT versions — never a range, `latest`, or `lts`.** A range re-introduces drift the moment a new patch ships. `.mise.toml`:
|
|
37
|
+
|
|
38
|
+
```toml
|
|
39
|
+
[tools]
|
|
40
|
+
node = "20.18.1"
|
|
41
|
+
python = "3.12.7"
|
|
42
|
+
pnpm = "9.15.0"
|
|
43
|
+
```
|
|
44
|
+
Or `.tool-versions` (asdf/mise compatible):
|
|
45
|
+
```
|
|
46
|
+
node 20.18.1
|
|
47
|
+
python 3.12.7
|
|
48
|
+
pnpm 9.15.0
|
|
49
|
+
```
|
|
50
|
+
Full `MAJOR.MINOR.PATCH`. `node = "20"` or `"lts"` resolves differently on a machine that synced its index yesterday vs today — that defeats the purpose.
|
|
51
|
+
|
|
52
|
+
3. **Pin the package manager too — runtime alone is not enough.** A pinned node with a floating pnpm still produces different lockfiles. Declare both in `package.json` and let corepack enforce it:
|
|
53
|
+
|
|
54
|
+
```jsonc
|
|
55
|
+
{
|
|
56
|
+
"packageManager": "pnpm@9.15.0", // corepack pins the exact pm
|
|
57
|
+
"engines": { "node": "20.18.1", "pnpm": "9.15.0" }
|
|
58
|
+
}
|
|
59
|
+
```
|
|
60
|
+
`corepack enable` makes the `pnpm`/`yarn` shim resolve that exact version. Add `engine-strict=true` to `.npmrc` so an out-of-range node **errors** instead of warning. (Volta users: put `"volta": { "node": "20.18.1", "pnpm": "9.15.0" }` in package.json instead — it owns both.)
|
|
61
|
+
|
|
62
|
+
4. **Commit the lockfile and install frozen in CI.** Pinned tools are wasted if installs still resolve fresh versions. Commit `pnpm-lock.yaml` / `package-lock.json` / `poetry.lock` / `Cargo.lock`, and in CI use the **frozen** install that fails on any drift, never re-resolves:
|
|
63
|
+
|
|
64
|
+
| PM | Local install | CI (must fail on drift) |
|
|
65
|
+
|---|---|---|
|
|
66
|
+
| pnpm | `pnpm install` | `pnpm install --frozen-lockfile` |
|
|
67
|
+
| npm | `npm install` | `npm ci` |
|
|
68
|
+
| yarn (berry) | `yarn install` | `yarn install --immutable` |
|
|
69
|
+
| poetry | `poetry install` | `poetry install` after `poetry lock --check` |
|
|
70
|
+
| cargo | `cargo build` | `cargo build --locked` |
|
|
71
|
+
|
|
72
|
+
5. **Auto-switch on `cd` so nobody runs the wrong version by hand.** Hook the manager into the shell once: append `mise activate zsh` (or `bash`/`fish`) to the rc file; nvm users add `.nvmrc` auto-use logic. Entering the repo now selects the pinned tools automatically — no `nvm use`, no stale shell. Run `mise install` once to materialize the versions and `mise trust` to allow the repo's config.
|
|
73
|
+
|
|
74
|
+
6. **Make CI read the SAME pin file — never retype versions in YAML.** A hardcoded `node-version: 20` in the workflow is a second source of truth that silently drifts. GitHub Actions:
|
|
75
|
+
|
|
76
|
+
```yaml
|
|
77
|
+
# Option A: native setup reads the pin file directly
|
|
78
|
+
- uses: actions/setup-node@v4
|
|
79
|
+
with:
|
|
80
|
+
node-version-file: '.tool-versions' # or .nvmrc / package.json
|
|
81
|
+
- uses: actions/setup-python@v5
|
|
82
|
+
with:
|
|
83
|
+
python-version-file: '.tool-versions'
|
|
84
|
+
|
|
85
|
+
# Option B (polyglot, simplest): let mise install everything
|
|
86
|
+
- uses: jdx/mise-action@v2 # reads .mise.toml / .tool-versions
|
|
87
|
+
```
|
|
88
|
+
`mise-action` is cleanest when you pin >2 languages — one step, the same file the dev uses.
|
|
89
|
+
|
|
90
|
+
7. **Go hermetic only when "same versions" isn't enough.** A version manager pins the tool but still links the host's system libraries (openssl, glibc), so two "same node" builds can still differ. For bit-for-bit reproducibility (security/compliance), use **Nix flakes** (`flake.nix` + `flake.lock`, `nix develop`) or **devbox** (`devbox.json`, mise-like UX over Nix). Reserve this for projects that genuinely need it — it's heavier and steeper than mise.
|
|
91
|
+
|
|
92
|
+
## Common Errors
|
|
93
|
+
|
|
94
|
+
- **Pinning a range or `latest`/`lts`.** `node = "20"` resolves to different patches over time and across machines. Always full `MAJOR.MINOR.PATCH`.
|
|
95
|
+
- **Pinning the runtime but not the package manager.** Floating pnpm/yarn/npm produces divergent lockfiles even on identical node. Pin via `packageManager` + corepack.
|
|
96
|
+
- **Two managers installed (nvm + mise, or asdf + Volta).** Their `PATH` shims clash and one wins nondeterministically per shell. Pick one; uninstall the other's shell hook.
|
|
97
|
+
- **`engine-strict` not set, so `engines` is just a warning.** npm ignores an `engines` mismatch by default. Set `engine-strict=true` in `.npmrc` to make it a hard error.
|
|
98
|
+
- **CI hardcodes the version in YAML.** `node-version: 20` drifts from the repo's pin file the day someone bumps one and not the other. Use `node-version-file:` / `mise-action`.
|
|
99
|
+
- **Non-frozen install in CI.** Plain `npm install` / `pnpm install` re-resolves and can pick newer deps than the lockfile. Use `npm ci` / `--frozen-lockfile` / `--immutable` / `--locked`.
|
|
100
|
+
- **Forgot `corepack enable`.** Then the `pnpm` on PATH is whatever was globally installed, ignoring `packageManager`. Enable corepack locally and in CI before install.
|
|
101
|
+
- **Lockfile not committed (or in `.gitignore`).** Frozen install has nothing to enforce against. Commit every lockfile.
|
|
102
|
+
- **`.mise.toml` not trusted on a fresh clone.** mise refuses untrusted config and silently skips it. Run `mise trust` (or set `MISE_TRUSTED_CONFIG_PATHS`) in onboarding/CI.
|
|
103
|
+
- **Pinning the tool but using system libs.** A "same node version" build still differs if it links a different openssl. If that bites you, go hermetic (Nix/devbox), not just a version manager.
|
|
104
|
+
|
|
105
|
+
## Verify
|
|
106
|
+
|
|
107
|
+
1. **Pin file is exact:** `grep -E '[0-9]+\.[0-9]+\.[0-9]+' .mise.toml .tool-versions package.json` shows full triples — no bare majors, no `latest`/`lts`/`*`.
|
|
108
|
+
2. **Single source of truth:** the version in the pin file, `package.json` `engines`/`packageManager`, and the CI workflow all match — no hardcoded version in YAML that diverges from the file.
|
|
109
|
+
3. **Clean machine A:** fresh clone → `mise install && corepack enable` → `node -v`, `python -V`, `pnpm -v` print the pinned versions with zero manual selection.
|
|
110
|
+
4. **Clean machine B (or container):** repeat step 3 on a second OS/host → identical version strings.
|
|
111
|
+
5. **CI parity:** the CI job logs the same `node -v`/`pnpm -v` as the two machines (read the pin file via `setup-*` `version-file` or `mise-action`).
|
|
112
|
+
6. **Frozen install holds the line:** bump a dep without updating the lock → `npm ci` / `pnpm install --frozen-lockfile` **fails** (drift is rejected, not silently resolved).
|
|
113
|
+
7. **Auto-switch works:** `cd` out of the repo and back → the active `node`/`python` flips to the pinned versions with no manual command.
|
|
114
|
+
8. **engine-strict bites:** temporarily set a wrong node in the pin file → install **errors** on the mismatch instead of warning.
|
|
115
|
+
|
|
116
|
+
Done = two clean machines and CI all print identical tool + package-manager versions resolved from one committed pin file, the frozen install fails on any lockfile drift, and switching is automatic on `cd`.
|