sanook-cli 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (148) hide show
  1. package/.env.example +23 -0
  2. package/CHANGELOG.md +38 -0
  3. package/LICENSE +201 -0
  4. package/README.md +239 -0
  5. package/dist/agentContext.js +2 -0
  6. package/dist/approval.js +78 -0
  7. package/dist/bin.js +461 -0
  8. package/dist/brain.js +186 -0
  9. package/dist/commands.js +66 -0
  10. package/dist/compaction.js +85 -0
  11. package/dist/config.js +101 -0
  12. package/dist/cost.js +59 -0
  13. package/dist/diff.js +36 -0
  14. package/dist/gateway/auth.js +32 -0
  15. package/dist/gateway/ledger.js +94 -0
  16. package/dist/gateway/lock.js +114 -0
  17. package/dist/gateway/schedule.js +74 -0
  18. package/dist/gateway/scheduler.js +87 -0
  19. package/dist/gateway/serve.js +57 -0
  20. package/dist/gateway/server.js +94 -0
  21. package/dist/gateway/telegram.js +115 -0
  22. package/dist/git.js +55 -0
  23. package/dist/hooks.js +104 -0
  24. package/dist/knowledge.js +68 -0
  25. package/dist/loop.js +169 -0
  26. package/dist/mcp.js +191 -0
  27. package/dist/memory.js +108 -0
  28. package/dist/providers/codex.js +86 -0
  29. package/dist/providers/keys.js +37 -0
  30. package/dist/providers/models.js +55 -0
  31. package/dist/providers/registry.js +241 -0
  32. package/dist/session.js +36 -0
  33. package/dist/skill-install.js +190 -0
  34. package/dist/skills.js +111 -0
  35. package/dist/tools/bash.js +26 -0
  36. package/dist/tools/edit.js +107 -0
  37. package/dist/tools/git.js +68 -0
  38. package/dist/tools/index.js +36 -0
  39. package/dist/tools/list.js +24 -0
  40. package/dist/tools/permission.js +30 -0
  41. package/dist/tools/read.js +18 -0
  42. package/dist/tools/recall.js +12 -0
  43. package/dist/tools/remember.js +14 -0
  44. package/dist/tools/schedule.js +61 -0
  45. package/dist/tools/search.js +54 -0
  46. package/dist/tools/skill.js +65 -0
  47. package/dist/tools/task.js +46 -0
  48. package/dist/tools/util.js +5 -0
  49. package/dist/tools/write.js +27 -0
  50. package/dist/ui/app.js +132 -0
  51. package/dist/ui/banner.js +20 -0
  52. package/dist/ui/brain-wizard.js +29 -0
  53. package/dist/ui/render.js +57 -0
  54. package/dist/ui/setup.js +46 -0
  55. package/package.json +77 -0
  56. package/second-brain/AGENTS.md +18 -0
  57. package/second-brain/CLAUDE.md +96 -0
  58. package/second-brain/Evals/retrieval-eval.md +30 -0
  59. package/second-brain/GEMINI.md +15 -0
  60. package/second-brain/Home.md +33 -0
  61. package/second-brain/README.md +29 -0
  62. package/second-brain/Runbooks/ingest-quarantine.md +27 -0
  63. package/second-brain/Runbooks/sleep-time-consolidation.md +26 -0
  64. package/second-brain/Shared/AI-Context-Index.md +52 -0
  65. package/second-brain/Shared/Core-Facts/protected-facts.md +21 -0
  66. package/second-brain/Shared/Decision-Memory/decision-log.md +24 -0
  67. package/second-brain/Shared/Memory-Inbox/memory-inbox.md +23 -0
  68. package/second-brain/Shared/Operating-State/current-state.md +30 -0
  69. package/second-brain/Shared/Provenance/ingest-log.md +27 -0
  70. package/second-brain/Shared/Rules/context-assembly-policy.md +28 -0
  71. package/second-brain/Shared/Rules/frontmatter-standard.md +33 -0
  72. package/second-brain/Shared/Rules/skills-admission.md +30 -0
  73. package/second-brain/Shared/User-Memory/user-preferences.md +25 -0
  74. package/second-brain/Templates/bug.md +22 -0
  75. package/second-brain/Templates/handoff.md +21 -0
  76. package/second-brain/Templates/project.md +24 -0
  77. package/second-brain/Templates/session.md +26 -0
  78. package/second-brain/USER.md +36 -0
  79. package/second-brain/Vault Structure Map.md +106 -0
  80. package/skills/agent-tool-mcp-builder/SKILL.md +88 -0
  81. package/skills/api-design-review/SKILL.md +70 -0
  82. package/skills/async-concurrency-correctness/SKILL.md +93 -0
  83. package/skills/audit-accessibility-wcag/SKILL.md +59 -0
  84. package/skills/audit-technical-seo/SKILL.md +62 -0
  85. package/skills/auth-jwt-session/SKILL.md +88 -0
  86. package/skills/brainstorm-design/SKILL.md +73 -0
  87. package/skills/build-etl-pipeline/SKILL.md +58 -0
  88. package/skills/build-form-validation/SKILL.md +103 -0
  89. package/skills/build-office-docs/SKILL.md +80 -0
  90. package/skills/build-react-component/SKILL.md +116 -0
  91. package/skills/build-spreadsheet/SKILL.md +106 -0
  92. package/skills/caching-strategy/SKILL.md +75 -0
  93. package/skills/cicd-pipeline-author/SKILL.md +65 -0
  94. package/skills/cloud-cost-optimize/SKILL.md +91 -0
  95. package/skills/code-comments/SKILL.md +52 -0
  96. package/skills/code-review/SKILL.md +61 -0
  97. package/skills/db-migration-safety/SKILL.md +67 -0
  98. package/skills/debug-frontend-browser/SKILL.md +58 -0
  99. package/skills/debug-root-cause/SKILL.md +54 -0
  100. package/skills/dependency-upgrade/SKILL.md +56 -0
  101. package/skills/deploy-release/SKILL.md +64 -0
  102. package/skills/diff-table-parity/SKILL.md +58 -0
  103. package/skills/dockerfile-optimize/SKILL.md +82 -0
  104. package/skills/error-message/SKILL.md +58 -0
  105. package/skills/estimate-work/SKILL.md +54 -0
  106. package/skills/explore-codebase/SKILL.md +73 -0
  107. package/skills/git-commit-pr/SKILL.md +65 -0
  108. package/skills/gitops-deploy-workflow/SKILL.md +97 -0
  109. package/skills/implement-from-design/SKILL.md +69 -0
  110. package/skills/incident-response-sre/SKILL.md +78 -0
  111. package/skills/k8s-debug-workload/SKILL.md +135 -0
  112. package/skills/k8s-manifest-review/SKILL.md +86 -0
  113. package/skills/llm-eval-harness/SKILL.md +63 -0
  114. package/skills/manage-client-server-state/SKILL.md +94 -0
  115. package/skills/mermaid-diagram/SKILL.md +61 -0
  116. package/skills/message-queue-jobs/SKILL.md +139 -0
  117. package/skills/naming-helper/SKILL.md +57 -0
  118. package/skills/observability-instrument/SKILL.md +113 -0
  119. package/skills/optimize-core-web-vitals/SKILL.md +75 -0
  120. package/skills/optimize-sql-query/SKILL.md +67 -0
  121. package/skills/performance-profiling/SKILL.md +65 -0
  122. package/skills/process-pdf/SKILL.md +107 -0
  123. package/skills/profile-dataset/SKILL.md +97 -0
  124. package/skills/prompt-engineering/SKILL.md +70 -0
  125. package/skills/rag-pipeline/SKILL.md +53 -0
  126. package/skills/rate-limiting/SKILL.md +96 -0
  127. package/skills/refactor-cleanup/SKILL.md +54 -0
  128. package/skills/regex-build/SKILL.md +72 -0
  129. package/skills/release-notes/SKILL.md +79 -0
  130. package/skills/rest-graphql-contract/SKILL.md +71 -0
  131. package/skills/scrape-structured-web-data/SKILL.md +61 -0
  132. package/skills/secrets-management/SKILL.md +96 -0
  133. package/skills/security-review/SKILL.md +62 -0
  134. package/skills/shell-script-robust/SKILL.md +71 -0
  135. package/skills/style-responsive-tailwind/SKILL.md +70 -0
  136. package/skills/terraform-plan-review/SKILL.md +95 -0
  137. package/skills/type-safety-strict/SKILL.md +82 -0
  138. package/skills/validate-data-quality/SKILL.md +62 -0
  139. package/skills/wrangle-tabular-data/SKILL.md +75 -0
  140. package/skills/write-adr/SKILL.md +75 -0
  141. package/skills/write-analytical-sql/SKILL.md +71 -0
  142. package/skills/write-data-viz/SKILL.md +58 -0
  143. package/skills/write-docs/SKILL.md +54 -0
  144. package/skills/write-plan/SKILL.md +59 -0
  145. package/skills/write-playwright-e2e/SKILL.md +86 -0
  146. package/skills/write-prd/SKILL.md +65 -0
  147. package/skills/write-rfc/SKILL.md +75 -0
  148. package/skills/write-tests/SKILL.md +50 -0
@@ -0,0 +1,78 @@
1
+ ---
2
+ name: incident-response-sre
3
+ description: Drives live incident response and postmortems SRE-style: severity triage (P0–P3), log/metric/trace correlation to find what changed, safe mitigation, comms updates, and blameless postmortem with action items. Triggers during an active incident/outage, on-call triage, or writing a postmortem afterward.
4
+ when_to_use: incident/outage เกิดอยู่, on-call ต้อง triage, service degraded, หรือเขียน postmortem หลังเหตุ
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ - An active incident is happening: errors spiking, latency up, partial/full outage, data not flowing, customers reporting breakage.
10
+ - On-call triage: an alert fired and you must decide severity + whether to page humans.
11
+ - A service is degraded but not down (elevated error rate, slow tail latency, queue backlog growing).
12
+ - After the incident is resolved and you need to write a blameless postmortem.
13
+
14
+ Do NOT use for routine bug fixing with no live blast radius, planned maintenance, or feature work — use normal debugging instead.
15
+
16
+ ## Steps
17
+
18
+ **Phase A — Stabilize first, diagnose second (mitigation before root cause).**
19
+
20
+ 1. **Declare + triage severity.** Pick one and state it explicitly with scope:
21
+ - **P0** — full outage / data loss / security breach / money-affecting. Mitigate now, page on-call.
22
+ - **P1** — major feature down or broad degradation, no full workaround.
23
+ - **P2** — partial degradation, workaround exists, limited users.
24
+ - **P3** — minor / cosmetic / single-user, no urgency.
25
+ Write impact in user terms: *what* is broken, *who/how many* affected, *since when*. If you can't quantify yet, say "scope unknown, treating as P1 until proven smaller" — never downgrade on a guess.
26
+
27
+ 2. **Build the "what changed" timeline.** Most incidents are caused by a recent change. List, with timestamps, the last 24–48h of:
28
+ - deploys / releases / rollbacks (check CI/CD history, git log, image tags)
29
+ - config / feature-flag / env-var changes
30
+ - infra changes (scaling, DB migration, cert rotation, DNS, dependency upgrade)
31
+ - traffic shifts (spike, new client, retry storm, upstream outage)
32
+ Overlay the incident start time against this. The change immediately preceding the symptom onset is your prime suspect.
33
+
34
+ 3. **Correlate signals — logs + metrics + traces together, not in isolation.**
35
+ - **Metrics** tell you *where/when* (which service, which endpoint, error-rate vs latency vs saturation — the golden signals). Find the first metric that broke and its exact start time.
36
+ - **Logs** tell you *what* (the actual error string, stack trace, status code). Grep the failing service around the metric break time; read the FIRST errors, not the loudest.
37
+ - **Traces** tell you *which hop* in the request path failed (DB? upstream dep? timeout? auth?).
38
+ Pin all three to the same time window. A latency spike + DB connection-pool-exhausted logs + traces stalling at the DB call = one coherent story.
39
+
40
+ 4. **Hypothesis-driven correlation — don't oversimplify.** Form 2–3 candidate root causes from the timeline + signals. For each, state the one observation that would confirm or kill it, then check that observation. Beware: correlation ≠ cause (a deploy at the same time may be innocent); the loudest error may be a downstream *symptom*, not the source. Keep going until the signal chain is coherent end-to-end — stop only when remaining hypotheses are killed.
41
+
42
+ 5. **Mitigate with the safest reversible action FIRST — before any permanent fix.** Prefer, in order: roll back the suspect deploy/config → disable the suspect feature flag → scale up / add capacity → shed load / rate-limit / shut a non-critical path → failover. Reversible mitigation is allowed even before root cause is 100% confirmed if it's safe and likely to help. **Never** apply an irreversible action (drop/delete data, force-push, destructive migration) as a mitigation — escalate to a human first. Confirm recovery via the same metric that broke in step 3, not by assumption.
43
+
44
+ 6. **Post status comms.** Short, factual, no blame, on a cadence (e.g. every 30 min while P0/P1). Each update: current impact → what's being done → next update time. On resolution: "mitigated/resolved at <time>, root cause <known/under investigation>, monitoring." State unknowns plainly — don't speculate publicly.
45
+
46
+ **Phase B — After recovery: blameless postmortem.**
47
+
48
+ 7. **Write the postmortem** (only after the incident is mitigated/closed). Required sections:
49
+ - **Summary** — one paragraph: what broke, impact, duration.
50
+ - **Timeline** — UTC timestamps from detection → mitigation → resolution, including key diagnostic steps.
51
+ - **Impact** — quantified (users, requests, $, SLO/error-budget burn).
52
+ - **Root cause + contributing factors** — the trigger AND the conditions that let it become an incident (missing alert, no rollback path, silent dependency). Multiple factors usually, not one.
53
+ - **What went well / what was slow** — detection time, mitigation time, gaps in tooling/visibility.
54
+ - **Action items** — each is concrete, *owned*, *dated*, and prioritized (P0/P1...). Distinguish "prevent recurrence" from "detect faster" from "mitigate faster". No vague "be more careful".
55
+
56
+ 8. **Blameless language.** Attack the system, never the person. "The deploy step had no automated canary check" not "X deployed bad code". People act reasonably given the info they had; if a human could cause an outage with a normal action, the *system* lacked a guardrail.
57
+
58
+ 9. **Feed back into a runbook.** If this incident class can recur, capture the detection signal → mitigation steps → verification as a reusable runbook so next time is faster. Convert the highest-value action item into a guardrail (alert, canary, rollback automation, flag).
59
+
60
+ ## Common Errors
61
+
62
+ - **Chasing root cause while the system burns.** Mitigation (rollback/flag-off) comes before diagnosis. A reverted deploy that buys 30 min is worth more than the perfect RCA mid-outage.
63
+ - **Trusting the loudest error.** The error filling the logs is often a downstream *symptom* (cascading timeouts, retry amplification). Trace upstream to the first failure in time, not the most frequent.
64
+ - **"Deploy at the same time = the cause."** Temporal coincidence is a hypothesis, not proof. A config change, traffic spike, or expiring cert can hide behind an innocent deploy. Verify the causal link.
65
+ - **Premature severity downgrade.** "Probably just a few users" → it's the whole region. When scope is unknown, hold the higher severity until you've *measured* it smaller.
66
+ - **Irreversible "fix" under pressure.** Deleting rows / dropping a table / force-pushing / running a destructive migration to "clear the bad state" turns a recoverable outage into permanent data loss. Escalate to a human before anything irreversible.
67
+ - **Looking at one signal only.** Metrics-only = you see the *symptom* but not *why*. Logs-only = you miss the blast radius. Traces-only = you miss the trend. The story lives at the intersection, pinned to one time window.
68
+ - **Blameful postmortem.** Naming who-did-it makes people hide info next time and kills the learning. The action item is a missing guardrail, never a missing person.
69
+ - **Action items with no owner/date.** "Improve monitoring" never ships. Each item needs a name, a date, and a priority, or it's decoration.
70
+ - **Declaring "fixed" without watching recovery.** Mitigation applied ≠ recovered. Confirm on the exact metric that originally broke before standing down.
71
+
72
+ ## Verify
73
+
74
+ - **During incident:** the metric that first broke (step 3) is back to baseline AND held there for a sustained window — not a momentary dip. Error rate / latency / saturation confirmed normal on the dashboard, not assumed.
75
+ - **Mitigation safety:** every action taken was reversible, or a human explicitly approved any irreversible one. You can state how to undo each.
76
+ - **Severity correct in hindsight:** measured impact matches (or was higher than) the declared severity — you didn't under-call it.
77
+ - **Postmortem complete:** all required sections present; root cause has ≥1 contributing factor beyond the trigger; every action item has owner + date + priority; language is blameless (no person named as cause).
78
+ - **Loop closed:** at least one action item became a real guardrail (alert/canary/rollback/flag) or a runbook so the same incident is faster/prevented next time.
@@ -0,0 +1,135 @@
1
+ ---
2
+ name: k8s-debug-workload
3
+ description: Systematically diagnoses live Kubernetes workload failures — CrashLoopBackOff, ImagePullBackOff, OOMKilled, pending pods, failing probes — by gathering describe/logs/events/node status and isolating root cause. Triggers when a pod won't start, keeps restarting, or a deployment is stuck/unhealthy in a cluster.
4
+ when_to_use: pod CrashLoopBackOff/ImagePullBackOff/OOMKilled/Pending, probe fail, rollout ค้าง, service ไม่มา endpoint
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ A live K8s workload is broken and you have `kubectl` access. Specifically:
10
+
11
+ - Pod stuck `Pending`, `ContainerCreating`, `Init:*`, or `Terminating`
12
+ - Pod `CrashLoopBackOff`, `Error`, `OOMKilled`, or restart count climbing
13
+ - `ImagePullBackOff` / `ErrImagePull` / `InvalidImageName`
14
+ - Readiness/liveness/startup probe failing — pod `Running` but `0/1 READY`
15
+ - `kubectl rollout` stuck, deployment shows old + new ReplicaSet both alive
16
+ - Service has no endpoints / traffic 503s even though pods look up
17
+
18
+ Do NOT use for: writing new manifests from scratch, Helm chart authoring, cluster provisioning. This skill is for diagnosing something already deployed.
19
+
20
+ ## Steps
21
+
22
+ **0. Pin the target. Never debug "the cluster" — debug one object.**
23
+ ```
24
+ kubectl get pods -n <ns> -o wide # find the bad pod, note NODE + IP
25
+ kubectl get deploy,rs,pod -n <ns> -l app=<x> # see the ownership chain
26
+ ```
27
+ Always pass `-n <ns>` explicitly. Default namespace is a trap. Grab the exact pod name (`<deploy>-<rs-hash>-<rand>`) for everything below.
28
+
29
+ **1. Read the symptom from `STATUS` + `RESTARTS`, then branch. Do not guess — the status string already names the failure layer:**
30
+
31
+ | STATUS | Layer | Jump to |
32
+ |---|---|---|
33
+ | `Pending` | scheduling / quota | Step 4 |
34
+ | `ContainerCreating` stuck | volume / CNI / image | Step 2 + Step 4 |
35
+ | `ImagePullBackOff`/`ErrImagePull` | image / registry auth | Step 2 |
36
+ | `CrashLoopBackOff`/`Error` | container exits | Step 3 |
37
+ | `OOMKilled` (in RESTARTS detail) | memory limit | Step 3 + Step 4 |
38
+ | `Running` but `0/N READY` | probe / app slow-start | Step 5 |
39
+ | Pod fine, Service 503 | endpoints / selector / port | Step 6 |
40
+
41
+ **2. Always start with `describe` — events are the highest-signal source and most people skip them.**
42
+ ```
43
+ kubectl describe pod <pod> -n <ns>
44
+ ```
45
+ Read the bottom `Events:` block first. It literally prints `Failed to pull image`, `0/3 nodes available: insufficient memory`, `Liveness probe failed: ...`. Then for image pulls:
46
+ - `ErrImagePull` + `not found` → wrong tag/repo. Check `Image:` field vs what was actually pushed.
47
+ - `ErrImagePull` + `unauthorized`/`denied` → missing/expired `imagePullSecrets`. Verify: `kubectl get sa <sa> -n <ns> -o jsonpath='{.imagePullSecrets}'` and that the secret exists + is type `kubernetes.io/dockerconfigjson`.
48
+ - `InvalidImageName` → typo / bad chars in image string.
49
+ - Stuck `ContainerCreating` with no pull error → volume mount or CNI, go to Step 4.
50
+
51
+ **3. For crashes, get BOTH current and previous logs — the crash output lives in `--previous`.**
52
+ ```
53
+ kubectl logs <pod> -n <ns> --previous --tail=200 # the run that died
54
+ kubectl logs <pod> -n <ns> --tail=100 # current attempt
55
+ kubectl logs <pod> -n <ns> -c <container> --previous # if multi-container/init
56
+ ```
57
+ Then decode the exit:
58
+ ```
59
+ kubectl get pod <pod> -n <ns> -o jsonpath='{.status.containerStatuses[*].lastState.terminated}'
60
+ ```
61
+ Read `reason` + `exitCode`:
62
+ - `OOMKilled` → real memory cap hit. Go to Step 4 for limits. Bump `resources.limits.memory` or fix the leak — do NOT just raise the limit blindly if RSS grows unbounded.
63
+ - exit `137` → SIGKILL (OOM or failed liveness kill). Cross-check `Events` for `Liveness probe failed`.
64
+ - exit `143` → SIGTERM, usually probe-triggered restart or graceful shutdown loop.
65
+ - exit `1`/`2` + app stack trace in `--previous` → app bug / bad config. Check env vars + mounted ConfigMap/Secret are present: `kubectl exec <pod> -- env` won't work if crashing, so read the manifest's `envFrom`/`valueFrom` and confirm the referenced ConfigMap/Secret exists.
66
+ - exit `0` looping → app runs to completion then restarts; wrong `restartPolicy` or missing long-running command.
67
+ - CrashLoop with empty logs → fails before logging; check `command`/`args` override, entrypoint, or missing config file mount.
68
+
69
+ **4. For `Pending`/OOM, check scheduling constraints and node pressure — the scheduler tells you exactly why in events.**
70
+ ```
71
+ kubectl describe pod <pod> -n <ns> | grep -A10 Events # "0/N nodes available: ..."
72
+ kubectl describe node <node> # Allocatable, Taints, pressure
73
+ kubectl top pod <pod> -n <ns>; kubectl top nodes # needs metrics-server
74
+ ```
75
+ Match the scheduler reason:
76
+ - `insufficient cpu/memory` → requests too high vs `Allocatable`, or nodes full. Lower `requests` or scale nodes.
77
+ - `node(s) had untolerated taint` → pod lacks matching `tolerations`. Check `Taints:` on node.
78
+ - `didn't match node affinity/selector` → `nodeSelector`/`affinity` points at labels no node has.
79
+ - `had volume node affinity conflict` → PV is zone-locked, pod scheduled to wrong zone.
80
+ - `pod has unbound immediate PersistentVolumeClaims` → PVC `Pending`: `kubectl get pvc -n <ns>` then `describe pvc` (usually no provisioner / no matching StorageClass).
81
+ - Node shows `MemoryPressure`/`DiskPressure True` → node-level, not pod-level; evictions incoming.
82
+
83
+ **5. For `Running` but not `READY`, the probe config is wrong far more often than the app.**
84
+ ```
85
+ kubectl describe pod <pod> -n <ns> | grep -iA3 -e Liveness -e Readiness -e Startup
86
+ kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[*].readinessProbe}'
87
+ ```
88
+ Check, in order:
89
+ - Probe `path`/`port` actually served by the app? Wrong port = permanent fail. Confirm against `containerPort` and what the app binds.
90
+ - `initialDelaySeconds` too short for slow boot → use a `startupProbe` instead of inflating liveness delay.
91
+ - Probe hits a path requiring auth → returns 401/403, counts as fail; use a dedicated unauthenticated `/healthz`.
92
+ - From inside (if it stays up): `kubectl exec <pod> -n <ns> -- wget -qO- localhost:<port><path>` to reproduce what kubelet sees.
93
+ - Liveness too aggressive → kills a healthy-but-busy pod, looks like CrashLoop (exit 137/143). Relax `timeoutSeconds`/`failureThreshold`.
94
+
95
+ **6. For "pod up but Service dead", verify the selector→endpoint chain — a 503 with healthy pods is almost always a selector or port mismatch.**
96
+ ```
97
+ kubectl get endpointslices -n <ns> -l kubernetes.io/service-name=<svc> # EMPTY = broken
98
+ kubectl get svc <svc> -n <ns> -o wide
99
+ kubectl describe svc <svc> -n <ns>
100
+ ```
101
+ - Empty endpoints → Service `selector` labels don't match pod labels (`kubectl get pod --show-labels`), OR pods aren't `READY` (go to Step 5 — unready pods are excluded from endpoints).
102
+ - Endpoints present but traffic fails → Service `targetPort` ≠ container's listening port. Confirm `targetPort` maps to the real `containerPort`/app port.
103
+ - DNS: `kubectl run tmp --rm -it --image=busybox -n <ns> -- nslookup <svc>` to confirm resolution; check `kube-dns`/CoreDNS pods are up if it fails.
104
+ - `NetworkPolicy` silently dropping traffic: `kubectl get netpol -n <ns>` — a default-deny with no matching allow rule blackholes connections with no error.
105
+
106
+ **7. Propose ONE narrow fix, name the exact field changed, and give a verify command. Don't shotgun multiple changes at once — you won't know which one worked.**
107
+
108
+ ## Common Errors
109
+
110
+ - **Skipping `--previous`** — `kubectl logs <pod>` on a crashlooping pod shows the *current* (already-restarted, possibly empty) attempt. The actual crash output is in `--previous`. This is the #1 missed clue.
111
+ - **Trusting `kubectl logs` when the pod never started** — Init containers and pre-entrypoint failures produce no app logs; the answer is in `describe` Events, not logs.
112
+ - **Raising the memory limit to "fix" OOMKilled** — if RSS grows unbounded it's a leak; a higher limit just delays the kill and masks the bug. Confirm steady-state vs leak via `kubectl top pod` over time before touching limits.
113
+ - **Confusing liveness vs readiness** — a failing *readiness* probe takes the pod out of Service endpoints (traffic stops) but does NOT restart it. A failing *liveness* probe *restarts* it (exit 137/143, looks like a crash). The fix differs: readiness = app not ready / wrong probe; liveness = probe too aggressive or app hung.
114
+ - **Forgetting unready pods are excluded from endpoints** — a "Service has no endpoints" bug is frequently a probe problem in disguise (Step 5), not a selector problem.
115
+ - **`top` returns error** — needs metrics-server installed; if absent, read `resources` requests/limits from the manifest and node `Allocatable` from `describe node` instead.
116
+ - **Editing the live pod instead of the controller** — `kubectl edit pod` changes get wiped on the next restart. Patch the Deployment/StatefulSet (`kubectl set resources` / `kubectl edit deploy`) so the change survives.
117
+ - **`ContainerCreating` blamed on the image** when it's actually a stuck PVC or CNI — check Events for `FailedMount`/`FailedAttachVolume` vs pull errors; they look similar from the outside.
118
+ - **Ignoring the ownership chain** — restarting a single pod when the Deployment template is the bug just respawns the same broken pod. Fix the template, then `rollout restart`.
119
+
120
+ ## Verify
121
+
122
+ After applying a fix, confirm with a runnable check — never declare done on "looks fixed":
123
+
124
+ ```
125
+ kubectl rollout status deploy/<name> -n <ns> --timeout=120s # waits, exits non-zero on fail
126
+ kubectl get pods -n <ns> -l app=<x> -w # watch RESTARTS stay 0, READY = N/N
127
+ ```
128
+ Then re-confirm the original symptom is gone:
129
+ - Crash fixed → `RESTARTS` stops climbing for ≥2 min and `lastState.terminated` is no longer present.
130
+ - Image fixed → STATUS leaves `ImagePullBackOff`, reaches `Running`.
131
+ - Pending fixed → pod has a `NODE` assigned in `-o wide`.
132
+ - Probe fixed → `READY` shows `N/N`, no new `probe failed` events.
133
+ - Service fixed → `kubectl get endpointslices -n <ns> -l kubernetes.io/service-name=<svc>` lists pod IPs, and a request from an in-cluster `busybox` pod succeeds.
134
+
135
+ If the verify command fails, do NOT relax the probe/assertion or bump a limit to force green — go back to the matching step and find the real cause.
@@ -0,0 +1,86 @@
1
+ ---
2
+ name: k8s-manifest-review
3
+ description: Reviews and writes Kubernetes / Helm manifests for production-readiness: resource requests/limits, probes, security contexts, PodDisruptionBudgets, standard labels, and validation via kubeconform/conftest. Triggers when authoring or reviewing k8s YAML or Helm charts, or before applying manifests to a cluster.
4
+ when_to_use: เขียน/รีวิว Deployment/Service/Helm chart, จะ apply manifest ขึ้น cluster, หรือเช็ค best practice ก่อน merge
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Invoke this skill when the task involves any of:
10
+ - Authoring a new Deployment / StatefulSet / DaemonSet / Service / Ingress / Helm chart.
11
+ - Reviewing an existing k8s manifest or Helm chart before merge.
12
+ - Anything that will end in `kubectl apply`, `helm install/upgrade`, `kustomize build`, or an Argo/Flux sync.
13
+
14
+ Do NOT run live `kubectl apply` as part of review. Render and validate offline first. Treat the cluster as the last step the user triggers, not you.
15
+
16
+ ## Steps
17
+
18
+ Run these in order. For each gate, report PASS / FAIL / N-A with the exact field that is missing — never a vague "looks fine".
19
+
20
+ **1. Render + structural lint first (fail fast).**
21
+ - Plain YAML: `kubeconform -strict -summary -schema-location default -schema-location 'https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/{{.Group}}/{{.ResourceKind}}_{{.ResourceAPIVersion}}.json' manifests/`. `-strict` rejects unknown fields (catches typos like `limts:`).
22
+ - Helm: render real values, do not lint the templates blind — `helm template <release> ./chart -f values.yaml | kubeconform -strict -summary -`. Run once per environment values file (dev/staging/prod) since prod often diverges.
23
+ - Kustomize: `kustomize build overlays/prod | kubeconform -strict -summary -`.
24
+ - Confirm `apiVersion` is GA where one exists (`apps/v1`, `networking.k8s.io/v1`, `policy/v1` for PDB — `policy/v1beta1` is removed in 1.25+).
25
+
26
+ **2. Resource requests + limits — every container, including initContainers and sidecars.**
27
+ - Require `resources.requests.cpu`, `requests.memory`, `limits.memory` on each container. Missing requests => `BestEffort`/`Burstable` => first to be OOM-killed/evicted under pressure.
28
+ - Set `limits.memory == requests.memory` (memory is incompressible; bursting then getting OOM-killed is worse than a hard cap).
29
+ - Prefer NO `limits.cpu` (or set it generously). A CPU limit throttles via CFS quota and adds tail latency even when the node is idle — request guarantees the floor.
30
+ - Flag any limit that is >4x its request (noisy-neighbor / scheduling-lie risk).
31
+
32
+ **3. Probes — liveness, readiness, startup.**
33
+ - `readinessProbe`: required. Without it, traffic hits the pod before it can serve => 502s on rollout.
34
+ - `livenessProbe`: required, but must NOT point at the same deep dependency-checking endpoint as readiness. Liveness = "is this process wedged" (cheap, local). A liveness probe that checks the DB will cascade-restart every pod when the DB blips.
35
+ - `startupProbe`: required for slow-booting apps (JVM, migrations). Set `failureThreshold * periodSeconds` to cover worst-case boot, so liveness doesn't kill a pod mid-startup.
36
+ - Verify `initialDelaySeconds`/`timeoutSeconds`/`failureThreshold` are explicit, not relying on the 1s default timeout (too tight for most HTTP handlers).
37
+
38
+ **4. securityContext + Pod Security Standards (target `restricted`).**
39
+ Pod-level + container-level:
40
+ - `runAsNonRoot: true` and an explicit `runAsUser` (non-zero).
41
+ - `allowPrivilegeEscalation: false`.
42
+ - `readOnlyRootFilesystem: true` (add an `emptyDir` writable mount for `/tmp` and any cache dirs the app actually needs).
43
+ - `capabilities.drop: ["ALL"]`; add back only what is provably needed (e.g. `NET_BIND_SERVICE`).
44
+ - `seccompProfile.type: RuntimeDefault` (required by PSS restricted).
45
+ - Reject `privileged: true`, `hostNetwork`, `hostPID`, `hostIPC`, and `hostPath` mounts unless explicitly justified in the PR.
46
+
47
+ **5. Availability — PDB + replicas + spread + rollout.**
48
+ - `replicas >= 2` for anything serving traffic (a single replica = guaranteed downtime on node drain/upgrade).
49
+ - A `PodDisruptionBudget` (`policy/v1`) with `minAvailable` or `maxUnavailable`. Gotcha: `minAvailable: 1` with `replicas: 1` makes the node un-drainable forever — use `maxUnavailable: 1` for replicas of 2-3.
50
+ - `topologySpreadConstraints` across `topology.kubernetes.io/zone` (and `kubernetes.io/hostname`) so all replicas don't land on one node/zone.
51
+ - `strategy.rollingUpdate` with sane `maxUnavailable`/`maxSurge`; a `Recreate` strategy on a serving Deployment means a hard outage window.
52
+
53
+ **6. Standard labels + namespace.**
54
+ - Apply recommended `app.kubernetes.io/*` labels: `name`, `instance`, `version`, `component`, `part-of`, `managed-by`. Service selectors should target these, not ad-hoc keys.
55
+ - `metadata.namespace` set explicitly (never rely on the kubectl context default — a wrong-namespace apply is a silent prod incident).
56
+ - Service/Deployment selector labels must match the pod template labels exactly, or the Service routes to zero endpoints (it will NOT error — just black-holes traffic).
57
+
58
+ **7. Policy gate (conftest / OPA).**
59
+ - Run org policy as code: `conftest test --policy policy/ <(helm template ./chart -f values-prod.yaml)`.
60
+ - Common deny rules to verify pass: no `:latest` image tags (pin a digest or semver), images from approved registries only, every workload has the gates from steps 2-6.
61
+
62
+ **8. Emit a checklist report.** One line per gate: `[PASS|FAIL] <gate> — <resource/container> — <missing field or fix>`. Group FAILs at the top. End with the exact commands to re-validate.
63
+
64
+ ## Common Errors
65
+
66
+ - **`helm lint` passing ≠ valid manifests.** `helm lint` checks template syntax, not the rendered Kubernetes objects. Always pipe `helm template` output through `kubeconform`.
67
+ - **kubeconform green but CRDs skipped.** Without `-schema-location` for the CRD catalog, custom resources are silently skipped, not validated. Watch the summary for `skipped` counts.
68
+ - **Selector/label drift.** Editing pod template labels without updating the Service/Deployment `selector` leaves the workload running but with zero endpoints. No error is raised — only an empty `kubectl get endpoints`.
69
+ - **`readOnlyRootFilesystem: true` with no writable mount.** App crashes on first write to `/tmp`, logs, or cache. Always pair with an `emptyDir`.
70
+ - **`minAvailable: 1` + `replicas: 1`.** Blocks `kubectl drain` / node autoscaler / cluster upgrades indefinitely. Catches teams by surprise during a maintenance window.
71
+ - **CPU limit causing latency.** Throttling shows as p99 spikes with the node nowhere near saturated. If you see CPU limits on a latency-sensitive service, flag it — remove the limit, keep the request.
72
+ - **`livenessProbe` pointed at a downstream dependency.** Turns a transient DB/cache outage into a restart storm across the whole fleet. Liveness must be local-only.
73
+ - **`policy/v1beta1` PodDisruptionBudget.** Removed in k8s 1.25+. Manifest renders fine offline against an old schema but fails on apply to a modern cluster.
74
+ - **One values file ≠ all environments.** Prod values often disable a probe, change replicas, or add a sidecar. Validate every env's rendered output, not just the default.
75
+
76
+ ## Verify
77
+
78
+ A manifest is production-ready only when ALL of these are true:
79
+ - `kubeconform -strict` passes for every rendered environment with zero `skipped` resources you care about.
80
+ - `conftest test` (or the org policy gate) returns zero failures.
81
+ - Every container (init + sidecar included) has requests, a memory limit, and the security context from step 4.
82
+ - Every serving workload has readiness + liveness probes, `replicas >= 2`, a `policy/v1` PDB, and topology spread.
83
+ - No `:latest` tags; images pinned to digest or semver.
84
+ - Service selector resolves to a non-empty endpoint set (selector labels == pod template labels).
85
+
86
+ Report each as a checked line. If you cannot run a validator in this environment, say so explicitly and provide the exact command for the user to run — never claim a manifest is validated when it was only eyeballed.
@@ -0,0 +1,63 @@
1
+ ---
2
+ name: llm-eval-harness
3
+ description: Builds an evaluation harness for LLM/agent outputs using golden datasets, code-based scorers, and LLM-as-judge, run as a regression gate when prompts, models, or RAG configs change.
4
+ when_to_use: User wants to measure prompt/model/agent quality, stop regressions when changing a prompt or model, set up llm-as-judge or a golden dataset, or go beyond vibes-based testing. NOT for deterministic unit tests of plain code (use write-tests).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this when output quality is **non-deterministic** and "looks fine" is not good enough:
10
+
11
+ - Changing a prompt, model version, temperature, tool definitions, or RAG retrieval config and need proof it didn't regress.
12
+ - Setting up llm-as-judge, a golden dataset, or any CI gate that scores model/agent output.
13
+ - A bug report like "answers got worse after we switched models" — you need a number, not a vibe.
14
+
15
+ Do **NOT** use for deterministic code (a parser, a pure function, an API contract) — that is a normal unit test; use write-tests. The line: if the same input can produce different-but-valid outputs, it's an eval; if there's one correct output, it's a test.
16
+
17
+ ## Steps
18
+
19
+ 1. **Pick the unit of evaluation first.** Decide what one "case" is: a single prompt→completion, a full agent trajectory (tool calls + final answer), or a RAG turn (query + retrieved context + answer). Everything below is keyed to this unit. Don't mix units in one suite.
20
+
21
+ 2. **Build a lean golden dataset (start at 20–50 cases, not 1000).** Pull real cases from production logs/traces, not invented ones — sample across the actual input distribution. Each case is a row with: `id`, `input` (+ any `variables` like retrieved_context), `expected` (or `reference`), and `tags` (e.g. `edge`, `regression`, `pii`). Deliberately seed known edge cases and every past failure. Store as JSONL or CSV under `evals/data/` so diffs are reviewable in git.
22
+
23
+ 3. **Layer metrics cheapest-first; only escalate to a judge when needed.**
24
+ - **Code scorers (deterministic, free, run first):** exact match, regex/JSON-schema validation, "must contain / must NOT contain" substrings, valid-tool-was-called, refusal-detector, latency/cost budget, output parses as valid JSON. These catch the majority of regressions with zero flakiness.
25
+ - **Semantic match:** embedding cosine similarity vs. reference, with a calibrated threshold — use for "is this roughly the right answer" when wording varies.
26
+ - **LLM-as-judge (last resort, for fuzzy quality):** correctness, faithfulness-to-context, helpfulness, tone. Only build a judge for dimensions code can't check.
27
+
28
+ 4. **Make the judge rigorous, not vibes-in-a-trenchcoat.**
29
+ - Write a **rubric with discrete levels** (e.g. 1–5 or pass/borderline/fail) where each level has a concrete, observable definition. Force the judge to emit `reasoning` BEFORE `score` (CoT lifts agreement).
30
+ - **Pin the judge: fixed model id + `temperature=0`** (or near-0) so the gate is reproducible.
31
+ - **Prefer pairwise or reference-based grading over a lonely 1–10 score** — absolute scores drift; "is A better than reference B?" is far more stable. Mitigate position bias by randomizing A/B order.
32
+ - For faithfulness/RAG, the judge sees the answer + the retrieved context and rules ONLY on "is every claim supported by context" — not on world knowledge.
33
+
34
+ 5. **Calibrate the judge against humans before trusting it.** Hand-label 20–30 cases. Run the judge on the same cases and compute agreement (Cohen's κ or simple % match). If agreement is low, the rubric is ambiguous — tighten level definitions and re-run. Do not ship a judge you haven't calibrated; an uncalibrated judge is a random number generator with a PhD voice.
35
+
36
+ 6. **Wire variable mapping + per-case pass/fail.** Each case template maps dataset columns → prompt variables (e.g. `{{question}}`, `{{context}}`). Define per-case pass = all code scorers pass AND judge score ≥ threshold. Aggregate to a suite-level pass rate and per-tag breakdown.
37
+
38
+ 7. **Run as a deterministic gate in CI and on every prompt/model/config change.** Trigger on changes to prompt files, model id, or RAG config. The runner: load dataset → run candidate system per case → score → compare against the committed **baseline scores file** (`evals/baseline.json`). **Fail the build if pass-rate drops below threshold OR any case in the `regression` tag flips from pass→fail.** Print a per-case delta table (old score → new score) so the regression is obvious in the PR.
39
+
40
+ 8. **Add random-sample probing to surface NEW failures.** The golden set only tests known cases. On a schedule, sample fresh production inputs, run them, and have the judge flag low-quality outputs for human review. This finds failure modes the static dataset can't.
41
+
42
+ 9. **Close the loop: every confirmed failure becomes a golden case.** When probing or production surfaces a bad output, add it to the dataset tagged `regression` with the corrected `expected`. The harness gets stronger over time and the same bug can never silently return.
43
+
44
+ ## Common Errors
45
+
46
+ - **Judge with `temperature > 0` → non-reproducible gate.** Two runs of the same diff give different verdicts and the gate becomes noise. Pin model + temp=0, and pin the model *version/snapshot* (a silently-updated judge model is itself a regression source).
47
+ - **Same model judges its own output → inflated, biased scores.** Use a different (often stronger) model as judge, or at minimum acknowledge the bias. Never let the system grade its own homework on quality dimensions.
48
+ - **Absolute 1–10 scoring drifts and clusters at 7–8.** Switch to pairwise (vs. reference) or a discrete rubric with anchored definitions. "Rate 1–10" without anchors is the #1 cause of a useless judge.
49
+ - **Position/verbosity bias in pairwise judging.** Judges favor the first option and longer answers. Randomize order per case (and optionally swap-and-average); penalize verbosity explicitly in the rubric if needed.
50
+ - **Tiny or synthetic dataset → green evals, red production.** 5 cherry-picked cases prove nothing. Pull from real traffic and cover the actual input distribution, including the boring/edge tails.
51
+ - **Reaching for the LLM judge when a code scorer would do.** If the spec is "output must be valid JSON with field X" or "must not leak the system prompt," that's a regex/schema check — deterministic, free, flake-free. Don't pay a judge to check what `assert` can.
52
+ - **No committed baseline → "is this better?" is unanswerable.** Without `baseline.json` under version control you can detect a crash but not a regression. Commit baselines and update them deliberately (in a reviewed PR), never silently.
53
+ - **Judge marks a faithful RAG answer wrong because it "knows better."** A faithfulness judge must rule only on support-by-retrieved-context, not on its own world knowledge. State this explicitly in the judge prompt or you'll get false fails.
54
+ - **Flaky scorers fail the gate for the wrong reason.** Network timeouts / rate limits on judge calls look like quality regressions. Add retries + a clear "harness error" vs. "quality fail" distinction so a 429 doesn't block a ship.
55
+
56
+ ## Verify
57
+
58
+ - Run the full suite twice on the **same** candidate; the score must be **identical** (proves the gate is deterministic). If it isn't, a scorer/judge has hidden randomness.
59
+ - Confirm the gate actually bites: introduce a deliberately bad prompt and check the suite **fails** with a clear per-case delta pointing at the regressed cases.
60
+ - Confirm a known-good change passes and the baseline-comparison table renders old→new deltas.
61
+ - Spot-check 3–5 judge verdicts by hand against the rubric — the judge's `reasoning` should justify its `score`. If reasoning and score disagree, the rubric/prompt needs tightening.
62
+ - Verify CI is wired: the eval job triggers on prompt/model/RAG-config file changes and blocks merge on failure (not just warns).
63
+ - Confirm every `regression`-tagged case maps to a real past failure, and that the latest surfaced failure has been added back into the golden set.
@@ -0,0 +1,94 @@
1
+ ---
2
+ name: manage-client-server-state
3
+ description: Sets up server-state with TanStack Query (caching, mutations, optimistic updates, hydration) and picks the right client-state tool; used when wiring data fetching or untangling state.
4
+ when_to_use: When the user wires data fetching/caching, mentions TanStack Query / React Query, useEffect fetch waterfalls, optimistic UI, SSR hydration, or choosing a state manager (Zustand/Context/Redux).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this skill when the task involves any of:
10
+ - Wiring data fetching/caching, or the prompt mentions **TanStack Query / React Query** (`useQuery`, `useMutation`, `QueryClient`).
11
+ - A component fetches in `useEffect` and stuffs results into `useState` (manual loading/error flags, refetch-on-prop-change, fetch waterfalls).
12
+ - **Optimistic UI** — update the screen before the server confirms, then roll back on failure.
13
+ - **SSR / RSC hydration** in Next.js (App Router) — prefetch on the server, hand off to the client without a refetch flash.
14
+ - Choosing a **client-state** tool: Context vs Zustand vs Redux, or fixing prop drilling / over-rendering.
15
+
16
+ Skip / hand off if the task is purely **form validation** (use `build-form-validation`) or **component scaffolding/markup** (component skill). This skill is about *where state lives and how it syncs*, not UI structure.
17
+
18
+ First principle, state the line out loud before coding: **Server state ≠ client state.** Server state is async, owned by the backend, shared, and can go stale (lists, user profile, search results) → TanStack Query. Client state is synchronous, owned by the UI, local (modal open, active tab, form draft, theme) → `useState` / Zustand / Context. Most "state is a mess" bugs are these two being managed by the same tool.
19
+
20
+ ## Steps
21
+
22
+ 1. **Classify every piece of state first.** Grep the file/feature for `useState`, `useEffect`, `useContext`, `dispatch`. For each, ask: does the value originate from an API/DB? → server state, migrate to TanStack Query. Is it ephemeral UI? → leave as client state. Do NOT migrate UI toggles into Query, and do NOT keep fetched data in `useState`.
23
+
24
+ 2. **Provider + client setup (once per app).** Create one `QueryClient` and wrap the tree in `<QueryClientProvider>`. In Next.js App Router, the client MUST be created inside a `"use client"` component via `useState(() => new QueryClient())` (not a module-level singleton) so each request/user gets its own cache. Set sane defaults: `staleTime` 30–60s for most reads (0 means "always refetch on mount/focus" — usually not what you want), keep `gcTime` (v5; was `cacheTime`) default 5min unless memory-bound.
25
+
26
+ 3. **Replace `useEffect`+`fetch` with `useQuery`.** Pattern:
27
+ ```ts
28
+ const { data, isPending, isError, error } = useQuery({
29
+ queryKey: ['todos', { status, page }], // serializable, hierarchical
30
+ queryFn: ({ signal }) => fetchTodos({ status, page, signal }),
31
+ staleTime: 30_000,
32
+ })
33
+ ```
34
+ Delete the manual `loading`/`error`/`data` `useState` and the `useEffect`. Pass `signal` into fetch for auto-cancel. **queryKey strategy:** array form, broad→narrow (`['todos']` → `['todos', id]` → `['todos', { filters }]`). Every variable the `queryFn` reads MUST appear in the key — missing deps = stale data served for the wrong params. Centralize keys in a `queryKeys` factory object to avoid typos and make invalidation greppable.
35
+
36
+ 4. **Mutations + invalidation.** Use `useMutation` for writes. After success, invalidate the affected reads so they refetch:
37
+ ```ts
38
+ const qc = useQueryClient()
39
+ useMutation({
40
+ mutationFn: updateTodo,
41
+ onSuccess: () => qc.invalidateQueries({ queryKey: ['todos'] }),
42
+ })
43
+ ```
44
+ `invalidateQueries({ queryKey: ['todos'] })` matches all keys *prefixed* by `['todos']` — that's the lever the hierarchical key design buys you. Prefer invalidation over manually hand-editing the cache unless you're doing optimistic updates.
45
+
46
+ 5. **Optimistic updates (only where latency is felt — toggles, likes, reorder).** The four-callback contract:
47
+ ```ts
48
+ useMutation({
49
+ mutationFn: toggleTodo,
50
+ onMutate: async (next) => {
51
+ await qc.cancelQueries({ queryKey: ['todos'] }) // stop in-flight refetch clobbering us
52
+ const prev = qc.getQueryData(['todos']) // snapshot for rollback
53
+ qc.setQueryData(['todos'], (old) => applyOptimistic(old, next))
54
+ return { prev } // context passed to onError
55
+ },
56
+ onError: (_e, _next, ctx) => qc.setQueryData(['todos'], ctx?.prev), // rollback
57
+ onSettled: () => qc.invalidateQueries({ queryKey: ['todos'] }), // reconcile w/ server
58
+ })
59
+ ```
60
+ All three of `cancelQueries` / snapshot+rollback / `onSettled` invalidate are required — drop any one and you get flicker, lost rollback, or permanent drift from the server.
61
+
62
+ 6. **SSR / RSC hydration (Next.js App Router).** In the **server component**: create a request-scoped `QueryClient`, `await queryClient.prefetchQuery({ queryKey, queryFn })`, then render `<HydrationBoundary state={dehydrate(queryClient)}>` wrapping the client component. The client component calls `useQuery` with the **identical queryKey** — it reads the dehydrated cache, no refetch, no loading flash. Mismatched keys between prefetch and `useQuery` = silent double-fetch on the client; this is the #1 hydration bug, verify keys are byte-identical.
63
+
64
+ 7. **Pick the client-state tool (the non-server state from step 1):**
65
+ | Need | Use |
66
+ |---|---|
67
+ | Low-frequency, rarely-changing (theme, locale, auth user object) | **Context** |
68
+ | Cross-cutting, frequently-updated, shared across distant components (cart, wizard, filters) | **Zustand** |
69
+ | Local to one subtree | keep `useState` / `useReducer`, lift only as far as needed |
70
+ | Complex shared logic + time-travel/devtools/middleware genuinely needed | **Redux Toolkit** (default to NOT this) |
71
+ Do **not** reach for Redux by reflex, and do **not** put server data in any of these — that's step 3's job.
72
+
73
+ 8. **Kill prop drilling + unnecessary re-renders.** If a prop is threaded through 3+ layers only to reach a leaf, move it to Context or Zustand. For Zustand, **always select narrowly** — `useStore((s) => s.cart.count)`, never `useStore((s) => s.cart)` or the whole store, or every store change re-renders the component. For Context, split into multiple providers (e.g. separate `ThemeContext` and `AuthContext`) so a change to one doesn't re-render consumers of the other; memoize the provider `value`.
74
+
75
+ ## Common Errors
76
+
77
+ - **`queryKey` missing a `queryFn` dependency** → wrong/stale data shown when params change. Every variable used in `queryFn` must be in the key.
78
+ - **Module-level `new QueryClient()` in Next.js** → cache leaks across requests/users on the server (one user sees another's data). Create it inside `useState`/per-request.
79
+ - **`v5` rename trap:** `cacheTime` → `gcTime`, `isLoading` → `isPending` (for "no data yet"), `useQuery` no longer takes positional args (options object only), callbacks `onSuccess/onError` removed from `useQuery` (still on `useMutation`). If you see those on a query, it's v4 code.
80
+ - **`staleTime: 0` (default) + refetchOnWindowFocus** → app hammers the API on every tab switch. Set a real `staleTime` for reads that don't need to be live.
81
+ - **Optimistic update without `cancelQueries`** → an in-flight background refetch resolves *after* your optimistic `setQueryData` and overwrites it → UI flickers back. Always `cancelQueries` in `onMutate`.
82
+ - **Hydration key mismatch** → server prefetches `['todos']`, client `useQuery(['todos', filters])` → no cache hit, refetch + flash. Keys must match exactly.
83
+ - **Subscribing to the whole Zustand store** → re-renders on unrelated state changes. Use a selector.
84
+ - **`invalidateQueries` with an over-narrow key** → sibling queries stay stale. Invalidate at the right prefix level.
85
+ - **Putting server data in Zustand/Context "to share it"** → you reimplement caching/invalidation/refetch badly. Let TanStack Query own server state; share the *query*, not a copy.
86
+
87
+ ## Verify
88
+
89
+ - Search the touched files: no `fetch(`/axios call inside `useEffect` writing to `useState` remains for server data; those are now `useQuery`/`useMutation`.
90
+ - TypeScript/build passes; no v4-only API names (`cacheTime`, `useQuery({ onSuccess })`) left.
91
+ - Open React Query Devtools: each screen's queries appear with sensible keys, correct `fresh`/`stale` status, and no duplicate keys fetching the same data.
92
+ - Mutate something: the relevant query auto-refetches (invalidation works). For optimistic paths, throttle the network and confirm UI updates instantly then either persists or **rolls back** on a forced 500.
93
+ - SSR pages: disable JS or check the Network tab — initial data is in the server HTML and the client does **not** refetch on mount (no loading flash, no duplicate request).
94
+ - Profile a noisy interaction (React DevTools Profiler): components not consuming the changed slice do **not** re-render after the Context/Zustand split + selectors.
@@ -0,0 +1,61 @@
1
+ ---
2
+ name: mermaid-diagram
3
+ description: Turn requirements, code, or a system description into validated Mermaid diagrams (flowchart, sequence, class, ER, state, C4, Gantt, mindmap, git graph) and verify they render via mermaid-cli before delivering.
4
+ when_to_use: User asks to diagram/visualize a flow, architecture, data model, state machine, or sequence; needs a chart embedded in docs/markdown; "draw me the flow of X". Skip when the user wants an image edited or a non-Mermaid format (e.g. raw SVG, PlantUML, Graphviz/DOT) — those are different tools.
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Trigger when the intent is "show this as a diagram" and the target is Mermaid (renders natively in GitHub, GitLab, Obsidian, VS Code, most docs). Map intent → diagram type up front; picking wrong wastes a render cycle:
10
+
11
+ | Intent / phrasing | Diagram type | Header keyword |
12
+ |---|---|---|
13
+ | Steps, branches, "the flow of X", decision logic | flowchart | `flowchart TD` / `LR` |
14
+ | Actors talking over time, API calls, request/response | sequence | `sequenceDiagram` |
15
+ | DB schema, tables + relations + cardinality | ER | `erDiagram` |
16
+ | OOP model, types, fields, inheritance | class | `classDiagram` |
17
+ | Lifecycle, status machine, "states of an order" | state | `stateDiagram-v2` |
18
+ | System / container / service boundaries | C4 | `C4Context` / `C4Container` |
19
+ | Timeline, project phases, dependencies-over-time | gantt | `gantt` |
20
+ | Hierarchy of ideas, brainstorm tree | mindmap | `mindmap` |
21
+ | Branch/merge history | git graph | `gitGraph` |
22
+
23
+ If unsure between flowchart and sequence: **does order-of-time + who-does-what matter?** yes → sequence, no → flowchart.
24
+
25
+ ## Steps
26
+
27
+ 1. **Read the source of truth, don't guess.** If diagramming code, open the actual files/functions and trace real control flow, call edges, or schema — not an assumed design. For ER/class, pull field names and FKs from the real schema/models.
28
+ 2. **Pick the type** from the table above and the direction. Default `flowchart TD` (top-down) for processes; `LR` (left-right) when there are many sequential steps (wide reads better than tall).
29
+ 3. **Write the Mermaid source to a temp file**, e.g. `/tmp/diagram.mmd` — do not hand-build it inline; you need a file to validate.
30
+ 4. **Validate by rendering** — this is non-negotiable, a diagram that doesn't parse is a failure:
31
+ ```bash
32
+ npx -y @mermaid-js/mermaid-cli@latest -i /tmp/diagram.mmd -o /tmp/diagram.svg
33
+ ```
34
+ - First run downloads the package; that's expected. If `mmdc` is already on PATH, use it directly: `mmdc -i in.mmd -o out.svg`.
35
+ - On headless/CI or if Chromium fails to launch, add a puppeteer config: write `{"args":["--no-sandbox"]}` to `/tmp/pp.json` and append `-p /tmp/pp.json`.
36
+ - SVG is the cheapest/fastest output for a syntax check. Use `-o out.png` only if the user wants a raster image.
37
+ 5. **If it fails, fix the root cause and re-render.** Read the parser error (it gives a line/token), correct the syntax, run step 4 again. Loop until it renders clean. Never "soften" by deleting the failing node or wrapping problem text — fix the actual syntax (see Common Errors).
38
+ 6. **Tighten for readability** before delivery: short node labels (offload detail to surrounding prose), `subgraph` to group related nodes, consistent edge labels, and a sane direction. A correct-but-unreadable diagram still fails the task.
39
+ 7. **Deliver as a fenced ```mermaid block** ready to paste into markdown/Obsidian. Do not deliver the SVG path as the answer unless the user explicitly asked for an image file — the fenced source is the product.
40
+
41
+ ## Common Errors
42
+
43
+ - **Parens/brackets/quotes inside a label break the parser.** `A[Fetch (cached)]` fails. Wrap the whole label in double quotes: `A["Fetch (cached)"]`. Same fix for `:`, `;`, `#`, `&`, `<`, `>`, `/`.
44
+ - **HTML in labels needs `<br/>` not `\n`.** Newlines inside a node use `A["Line one<br/>Line two"]`. A literal `\n` renders as text.
45
+ - **Reserved word `end`** as a lowercase node id silently corrupts flowcharts. Use `End`, `END`, or `e_end`.
46
+ - **Sequence diagrams: declare or imply participants consistently.** Mixing `participant A as Auth` then referring to `Auth` later (the alias, not the id) breaks. Reference the **id** in arrows, the alias is display-only.
47
+ - **ER cardinality syntax is strict:** `USER ||--o{ ORDER : places` (one-to-many). Common crash is using `1` / `*` instead of the `||`, `o{`, `}o`, `|{` tokens.
48
+ - **C4 diagrams need the right header** (`C4Context`, `C4Container`, `C4Component`) and `Rel(a, b, "label")` — they do NOT use flowchart arrow syntax. Don't mix the two grammars.
49
+ - **`gantt` dates must be ISO** (`YYYY-MM-DD`) and need a `dateFormat` line; freeform dates fail.
50
+ - **`%%` is the comment marker, not `//` or `#`.** A stray `//` comment is parsed as a node and throws.
51
+ - **`graph` is the legacy keyword; prefer `flowchart`.** Both parse, but `flowchart` gets newer features (e.g. `&` chaining). Don't mix dialects in one block.
52
+ - **Indentation matters in `mindmap`** — it's whitespace-structured like YAML, not bracket-structured. Inconsistent indent = wrong tree.
53
+
54
+ ## Verify
55
+
56
+ Done only when ALL hold:
57
+ - [ ] `mmdc`/`npx ... mermaid-cli` exited 0 on the final source (an actual render happened — not "it looks right").
58
+ - [ ] Diagram type matches the user's intent (flow vs sequence vs schema), not just "a diagram."
59
+ - [ ] Every node/edge in the diagram maps to something real in the source code/spec — no invented steps, no dropped branches.
60
+ - [ ] Labels are concise and the layout direction reads cleanly; groups use `subgraph` where it reduces crossing edges.
61
+ - [ ] Delivered as a copy-paste-ready ```mermaid fenced block (plus the image path only if an image was requested).