sanook-cli 0.4.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (235) hide show
  1. package/.env.example +19 -0
  2. package/CHANGELOG.md +144 -0
  3. package/README.md +153 -20
  4. package/README.th.md +136 -0
  5. package/dist/agentContext.js +4 -0
  6. package/dist/approval.js +6 -0
  7. package/dist/bin.js +394 -51
  8. package/dist/brain.js +92 -59
  9. package/dist/brand.js +47 -0
  10. package/dist/checkpoint.js +37 -0
  11. package/dist/commands.js +86 -6
  12. package/dist/compaction.js +76 -5
  13. package/dist/config.js +100 -12
  14. package/dist/cost.js +60 -3
  15. package/dist/doctor.js +92 -0
  16. package/dist/gateway/auth.js +2 -2
  17. package/dist/gateway/ledger.js +2 -2
  18. package/dist/gateway/scheduler.js +1 -0
  19. package/dist/gateway/serve.js +6 -4
  20. package/dist/gateway/server.js +10 -2
  21. package/dist/git.js +11 -2
  22. package/dist/hooks.js +43 -17
  23. package/dist/knowledge.js +48 -49
  24. package/dist/loop.js +182 -66
  25. package/dist/lsp/client.js +173 -0
  26. package/dist/lsp/framing.js +56 -0
  27. package/dist/lsp/index.js +138 -0
  28. package/dist/lsp/servers.js +82 -0
  29. package/dist/mcp-server.js +244 -0
  30. package/dist/mcp.js +184 -29
  31. package/dist/memory-store.js +559 -0
  32. package/dist/memory.js +143 -29
  33. package/dist/orchestrate.js +150 -0
  34. package/dist/providers/codex.js +2 -2
  35. package/dist/providers/keys.js +3 -2
  36. package/dist/providers/registry.js +133 -1
  37. package/dist/repomap.js +93 -0
  38. package/dist/search/chunk.js +158 -0
  39. package/dist/search/embed-store.js +187 -0
  40. package/dist/search/engine.js +203 -0
  41. package/dist/search/fuse.js +35 -0
  42. package/dist/search/index-core.js +187 -0
  43. package/dist/search/indexer.js +241 -0
  44. package/dist/search/store.js +77 -0
  45. package/dist/session.js +42 -8
  46. package/dist/skill-install.js +10 -10
  47. package/dist/skills.js +12 -9
  48. package/dist/summarize.js +31 -0
  49. package/dist/tools/bash.js +21 -2
  50. package/dist/tools/diagnostics.js +41 -0
  51. package/dist/tools/edit.js +29 -7
  52. package/dist/tools/index.js +8 -1
  53. package/dist/tools/list.js +7 -2
  54. package/dist/tools/permission.js +90 -9
  55. package/dist/tools/read.js +23 -4
  56. package/dist/tools/remember.js +1 -1
  57. package/dist/tools/sandbox.js +61 -0
  58. package/dist/tools/search.js +105 -4
  59. package/dist/tools/task.js +195 -29
  60. package/dist/tools/timeout.js +35 -0
  61. package/dist/tools/util.js +10 -0
  62. package/dist/tools/write.js +6 -4
  63. package/dist/trust.js +89 -0
  64. package/dist/ui/app.js +218 -27
  65. package/dist/ui/banner.js +4 -9
  66. package/dist/ui/history.js +30 -0
  67. package/dist/ui/mentions.js +44 -0
  68. package/dist/ui/setup.js +6 -5
  69. package/dist/ui/useEditor.js +83 -0
  70. package/dist/update.js +114 -0
  71. package/dist/worktree.js +173 -0
  72. package/package.json +11 -5
  73. package/scripts/postinstall.mjs +33 -0
  74. package/second-brain/.agents/_Index.md +30 -0
  75. package/second-brain/.agents/skills/_Index.md +30 -0
  76. package/second-brain/.agents/workflows/_Index.md +30 -0
  77. package/second-brain/AGENTS.md +4 -4
  78. package/second-brain/Acceptance/_Index.md +30 -0
  79. package/second-brain/Acceptance/golden-case-template.md +39 -0
  80. package/second-brain/Areas/_Index.md +30 -0
  81. package/second-brain/Bugs/System-OS/_Index.md +30 -0
  82. package/second-brain/Bugs/_Index.md +30 -0
  83. package/second-brain/CLAUDE.md +4 -1
  84. package/second-brain/Checklists/_Index.md +30 -0
  85. package/second-brain/Checklists/preflight-postflight-template.md +29 -0
  86. package/second-brain/Distillations/_Index.md +30 -0
  87. package/second-brain/Entities/_Index.md +30 -0
  88. package/second-brain/Entities/entity-template.md +33 -0
  89. package/second-brain/Evals/_Index.md +30 -0
  90. package/second-brain/Evals/correction-pairs.md +24 -0
  91. package/second-brain/Evals/failure-taxonomy.md +24 -0
  92. package/second-brain/Evals/golden-set.md +25 -0
  93. package/second-brain/Evals/quality-ledger.md +23 -0
  94. package/second-brain/Evals/self-eval-rubric.md +23 -0
  95. package/second-brain/GEMINI.md +4 -4
  96. package/second-brain/Goals/_Index.md +30 -0
  97. package/second-brain/Handoffs/_Index.md +30 -0
  98. package/second-brain/Home.md +7 -0
  99. package/second-brain/Intake/Raw Sources/_Index.md +30 -0
  100. package/second-brain/Intake/_Index.md +30 -0
  101. package/second-brain/Intake/_Quarantine/_Index.md +30 -0
  102. package/second-brain/Learning/_Index.md +30 -0
  103. package/second-brain/Playbooks/_Index.md +30 -0
  104. package/second-brain/Playbooks/playbook-template.md +23 -0
  105. package/second-brain/Projects/_Index.md +30 -0
  106. package/second-brain/Prompts/_Index.md +30 -0
  107. package/second-brain/README.md +2 -1
  108. package/second-brain/Research/_Index.md +30 -0
  109. package/second-brain/Retrospectives/_Index.md +30 -0
  110. package/second-brain/Reviews/_Index.md +30 -0
  111. package/second-brain/Runbooks/_Index.md +30 -0
  112. package/second-brain/Runbooks/eval-loop.md +24 -0
  113. package/second-brain/Sessions/_Index.md +30 -0
  114. package/second-brain/Shared/AI-Context-Index.md +20 -0
  115. package/second-brain/Shared/AI-Threads/_Index.md +30 -0
  116. package/second-brain/Shared/Archive/_Index.md +30 -0
  117. package/second-brain/Shared/Assets/_Index.md +30 -0
  118. package/second-brain/Shared/Context-Packs/_Index.md +30 -0
  119. package/second-brain/Shared/Context7-Docs/_Index.md +30 -0
  120. package/second-brain/Shared/Coordination/NOW.md +28 -0
  121. package/second-brain/Shared/Coordination/_Index.md +30 -0
  122. package/second-brain/Shared/Coordination/agent-registry.md +24 -0
  123. package/second-brain/Shared/Coordination/task-board/_Index.md +30 -0
  124. package/second-brain/Shared/Coordination/task-board/task-template.md +43 -0
  125. package/second-brain/Shared/Coordination/task-board.md +32 -0
  126. package/second-brain/Shared/Core-Facts/_Index.md +30 -0
  127. package/second-brain/Shared/Decision-Memory/_Index.md +30 -0
  128. package/second-brain/Shared/Glossary/_Index.md +30 -0
  129. package/second-brain/Shared/Memory-Inbox/_Index.md +30 -0
  130. package/second-brain/Shared/Operating-State/_Index.md +30 -0
  131. package/second-brain/Shared/Prompting/_Index.md +30 -0
  132. package/second-brain/Shared/Provenance/_Index.md +30 -0
  133. package/second-brain/Shared/Rules/_Index.md +30 -0
  134. package/second-brain/Shared/Rules/contextual-note-rule.md +30 -0
  135. package/second-brain/Shared/Rules/frontmatter-standard.md +10 -0
  136. package/second-brain/Shared/Rules/memory-write-protocol.md +28 -0
  137. package/second-brain/Shared/Rules/procedural-runbook-header.md +40 -0
  138. package/second-brain/Shared/Rules/review-and-staleness-policy.md +22 -0
  139. package/second-brain/Shared/Rules/rules-formatting.md +34 -0
  140. package/second-brain/Shared/Scripts/_Index.md +30 -0
  141. package/second-brain/Shared/Scripts-Archive/_Index.md +30 -0
  142. package/second-brain/Shared/Tech-Standards/_Index.md +30 -0
  143. package/second-brain/Shared/Tech-Standards/verification-standard.md +40 -0
  144. package/second-brain/Shared/User-Memory/_Index.md +30 -0
  145. package/second-brain/Shared/User-Persona/_Index.md +30 -0
  146. package/second-brain/Shared/User-Persona/owner-profile.md +25 -0
  147. package/second-brain/Shared/Working-Memory/_Index.md +30 -0
  148. package/second-brain/Shared/_Index.md +30 -0
  149. package/second-brain/Shared/mcp-servers/_Index.md +30 -0
  150. package/second-brain/Skills/_Index.md +30 -0
  151. package/second-brain/Templates/_Index.md +30 -0
  152. package/second-brain/Templates/bug.md +2 -0
  153. package/second-brain/Templates/handoff.md +2 -0
  154. package/second-brain/Templates/session.md +2 -0
  155. package/second-brain/Tools/_Index.md +30 -0
  156. package/second-brain/Traces/_Index.md +30 -0
  157. package/second-brain/Vault Structure Map.md +33 -1
  158. package/second-brain/copilot/_Index.md +30 -0
  159. package/skills/audit-license-compliance/SKILL.md +117 -0
  160. package/skills/author-codemod/SKILL.md +110 -0
  161. package/skills/build-audit-logging/SKILL.md +112 -0
  162. package/skills/build-cdc-streaming-pipeline/SKILL.md +123 -0
  163. package/skills/build-cli-tool/SKILL.md +108 -0
  164. package/skills/build-data-table/SKILL.md +141 -0
  165. package/skills/build-native-mobile-ui/SKILL.md +154 -0
  166. package/skills/build-offline-first-sync/SKILL.md +118 -0
  167. package/skills/build-realtime-channel/SKILL.md +122 -0
  168. package/skills/build-vector-search/SKILL.md +131 -0
  169. package/skills/compose-local-dev-stack/SKILL.md +149 -0
  170. package/skills/configure-bundler-build/SKILL.md +166 -0
  171. package/skills/configure-dns-tls/SKILL.md +142 -0
  172. package/skills/configure-reverse-proxy-lb/SKILL.md +129 -0
  173. package/skills/configure-security-headers-csp/SKILL.md +122 -0
  174. package/skills/contract-testing/SKILL.md +140 -0
  175. package/skills/datetime-timezone-correctness/SKILL.md +125 -0
  176. package/skills/debug-ci-pipeline-failure/SKILL.md +134 -0
  177. package/skills/debug-flaky-tests/SKILL.md +128 -0
  178. package/skills/defend-llm-prompt-injection/SKILL.md +110 -0
  179. package/skills/deliver-webhooks/SKILL.md +116 -0
  180. package/skills/design-api-pagination/SKILL.md +144 -0
  181. package/skills/design-authorization-model/SKILL.md +119 -0
  182. package/skills/design-backup-dr-recovery/SKILL.md +113 -0
  183. package/skills/design-event-sourcing-cqrs/SKILL.md +143 -0
  184. package/skills/design-multi-tenancy/SKILL.md +100 -0
  185. package/skills/design-protobuf-grpc-service/SKILL.md +146 -0
  186. package/skills/design-relational-schema/SKILL.md +129 -0
  187. package/skills/design-search-index-infra/SKILL.md +151 -0
  188. package/skills/design-state-machine/SKILL.md +108 -0
  189. package/skills/design-token-system/SKILL.md +109 -0
  190. package/skills/distributed-locks-leases/SKILL.md +120 -0
  191. package/skills/encrypt-sensitive-data/SKILL.md +148 -0
  192. package/skills/feature-flags-rollout/SKILL.md +130 -0
  193. package/skills/file-upload-object-storage/SKILL.md +107 -0
  194. package/skills/fuzz-dynamic-security-test/SKILL.md +111 -0
  195. package/skills/harden-llm-app-reliability/SKILL.md +126 -0
  196. package/skills/i18n-localization-setup/SKILL.md +113 -0
  197. package/skills/idempotency-keys/SKILL.md +107 -0
  198. package/skills/implement-push-notifications/SKILL.md +142 -0
  199. package/skills/ingest-webhook-secure/SKILL.md +120 -0
  200. package/skills/integrate-oauth-oidc/SKILL.md +126 -0
  201. package/skills/load-stress-test/SKILL.md +129 -0
  202. package/skills/map-privacy-data-gdpr/SKILL.md +146 -0
  203. package/skills/model-nosql-data/SKILL.md +118 -0
  204. package/skills/money-decimal-arithmetic/SKILL.md +123 -0
  205. package/skills/monitor-ml-drift/SKILL.md +109 -0
  206. package/skills/numeric-precision-units/SKILL.md +144 -0
  207. package/skills/optimize-llm-cost-latency/SKILL.md +103 -0
  208. package/skills/optimize-react-rerenders/SKILL.md +124 -0
  209. package/skills/orchestrate-agent-workflow/SKILL.md +100 -0
  210. package/skills/payments-billing-integration/SKILL.md +114 -0
  211. package/skills/pin-toolchain-versions/SKILL.md +116 -0
  212. package/skills/plan-strangler-migration/SKILL.md +95 -0
  213. package/skills/property-based-testing/SKILL.md +108 -0
  214. package/skills/publish-package-registry/SKILL.md +130 -0
  215. package/skills/recover-git-state/SKILL.md +119 -0
  216. package/skills/remediate-web-vulnerabilities/SKILL.md +125 -0
  217. package/skills/resilience-timeouts-retries/SKILL.md +104 -0
  218. package/skills/resolve-merge-rebase-conflict/SKILL.md +97 -0
  219. package/skills/rewrite-git-history/SKILL.md +109 -0
  220. package/skills/scaffold-cross-platform-app/SKILL.md +137 -0
  221. package/skills/schema-evolution-compatibility/SKILL.md +121 -0
  222. package/skills/send-transactional-email/SKILL.md +126 -0
  223. package/skills/serve-deploy-ml-model/SKILL.md +107 -0
  224. package/skills/setup-cdn-edge-waf/SKILL.md +107 -0
  225. package/skills/setup-devcontainer-env/SKILL.md +131 -0
  226. package/skills/setup-lint-format-precommit/SKILL.md +140 -0
  227. package/skills/setup-monorepo-tooling/SKILL.md +125 -0
  228. package/skills/ship-mobile-app-store-release/SKILL.md +137 -0
  229. package/skills/structured-output-llm/SKILL.md +86 -0
  230. package/skills/supply-chain-sbom-provenance/SKILL.md +120 -0
  231. package/skills/test-data-factories/SKILL.md +158 -0
  232. package/skills/threat-model-stride/SKILL.md +123 -0
  233. package/skills/train-evaluate-ml-model/SKILL.md +109 -0
  234. package/skills/unicode-text-correctness/SKILL.md +109 -0
  235. package/skills/visual-regression-testing/SKILL.md +120 -0
@@ -0,0 +1,123 @@
1
+ ---
2
+ name: threat-model-stride
3
+ description: Produces a design-level STRIDE threat model — decomposes the architecture into a data-flow diagram with trust boundaries, enumerates threats per element, rates them by likelihood × impact, and records mitigations and signed-off residual risk. Use before building or substantially changing a system that handles untrusted input, secrets, money, or PII.
4
+ when_to_use: A new service, public API, auth flow, multi-tenant boundary, or agent/tool surface is being designed and you need "what could go wrong here?" answered before code exists. Distinct from security-review (audits an already-written diff line by line) and write-rfc (proposes the design itself).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this when the question is **what an adversary could do to a design**, before the design is built:
10
+
11
+ - "Threat model this new payments/checkout service"
12
+ - "We're adding a multi-tenant boundary — where can one tenant reach another's data?"
13
+ - "New public API / webhook ingress / agent tool surface — enumerate the attack surface"
14
+ - "The security RFC needs a threats section and a residual-risk register"
15
+ - "What trust boundaries does this auth flow cross, and what crosses each one?"
16
+
17
+ NOT this skill:
18
+ - Auditing already-written code for injection/SSRF/secrets line by line → security-review (this skill works on a diagram, not a diff)
19
+ - Writing the design/proposal itself (motivation, alternatives, rollout) → write-rfc (threat-model is one section feeding it)
20
+ - Implementing the login/JWT/session controls a threat surfaces → auth-jwt-session
21
+ - Storing/rotating the secrets a threat targets → secrets-management
22
+ - Hardening one webhook endpoint's signature/replay handling → ingest-webhook-secure
23
+ - Responding to an attack happening **now** → incident-response-sre
24
+
25
+ ## Steps
26
+
27
+ 1. **Define scope, assets, and adversaries first — never enumerate threats against an unbounded system.** Write three lists before drawing anything:
28
+ - **Assets** — what you protect, by category: *confidentiality* (PII, secrets, tokens), *integrity* (balances, order state, audit log), *availability* (checkout, login). Name the concrete data, not "the database."
29
+ - **Adversaries** — pick from this set and state each one's starting position:
30
+
31
+ | Adversary | Starts with | Typically drives |
32
+ |---|---|---|
33
+ | Anonymous internet | Network reachability only | S, D, I (info disclosure via errors) |
34
+ | Authenticated user | Valid session, own tenant | E (priv-esc), tenant-boundary I/T |
35
+ | Malicious tenant (multi-tenant) | Valid account, own data | Cross-tenant I (read), T (write) |
36
+ | Insider / operator | Prod console, some creds | R (repudiation), I, E |
37
+ | Compromised dependency | Code execution in one process | S, T, E across the process boundary |
38
+ | Stolen credential / token | One leaked secret | S, blast-radius of that secret |
39
+
40
+ - **In/out of scope** — explicitly list what you will NOT model (e.g. "physical datacenter security: out; we trust the cloud provider's hypervisor"). Unstated scope = infinite scope.
41
+
42
+ 2. **Draw the data-flow diagram as validated Mermaid — four element types plus boundaries, no more.** DFD elements: **External Entity** (square — user, third-party API), **Process** (round — your service/lambda), **Data Store** (cylinder — DB, queue, bucket, cache), **Data Flow** (arrow, labeled with protocol + what data). A **trust boundary** is a dashed box crossing one or more flows where the privilege/trust level changes. The four boundaries to always look for: **network edge** (internet → DMZ), **authz** (unauthenticated → authenticated), **tenant** (tenant A → shared/tenant B), **process** (your code → third-party/dependency code).
43
+
44
+ ```mermaid
45
+ flowchart LR
46
+ subgraph edge["Network edge — untrusted"]
47
+ user["Browser (external entity)"]
48
+ end
49
+ subgraph trusted["Authenticated · single-tenant"]
50
+ api("API service")
51
+ worker("Async worker")
52
+ end
53
+ db[("Orders DB")]
54
+ pay["Stripe API (external entity)"]
55
+ user -->|"HTTPS · login creds"| api
56
+ api -->|"SQL · tenant_id scoped"| db
57
+ api -->|"enqueue · job payload"| worker
58
+ worker -->|"HTTPS · card token"| pay
59
+ ```
60
+
61
+ Shapes match the legend above: `[ ]` square = external entity, `( )` round = process, `[( )]` cylinder = data store, `-->|label|` = data flow. Each `subgraph` is a trust boundary (Mermaid renders dashed boxes). `db` and `pay` sit outside both boundaries on purpose — that's the point where trust changes.
62
+
63
+ Validate it before continuing: `npx -y @mermaid-js/mermaid-cli -i model.mmd -o model.svg` (or paste into mermaid.live). A diagram that doesn't render isn't a deliverable. If the system is large, model **one boundary-crossing flow per diagram** rather than one unreadable mega-graph.
64
+
65
+ 3. **Walk every flow that crosses a boundary and apply STRIDE per element.** Do not brainstorm freely — march the checklist. STRIDE maps to the property each threat violates:
66
+
67
+ | Letter | Threat | Violates | Ask at this element |
68
+ |---|---|---|---|
69
+ | **S** | Spoofing | Authentication | Can the caller forge who they are? (no/weak auth, replayable token) |
70
+ | **T** | Tampering | Integrity | Can data in transit or at rest be altered? (no TLS, no signature, mutable audit log) |
71
+ | **R** | Repudiation | Non-repudiation | Can an actor deny an action? (no/forgeable logs, shared accounts) |
72
+ | **I** | Information disclosure | Confidentiality | Can data leak? (verbose errors, missing authz check, IDOR, unencrypted store) |
73
+ | **D** | Denial of service | Availability | Can it be exhausted? (unbounded input, no rate limit, amplification) |
74
+ | **E** | Elevation of privilege | Authorization | Can a lower-privilege actor gain higher rights? (missing tenant scope, broken RBAC, injection → RCE) |
75
+
76
+ Apply the elements-affected rule to save time: **External entities** → S, R. **Processes** → all six. **Data stores** → T, I, D (and R if it's the audit log). **Data flows** → T, I, D. Record each threat as one row: `<element> | <STRIDE letter> | <concrete attack> | <adversary from step 1>`. Concrete means "authenticated user changes `tenant_id` in the path param and reads another tenant's orders (I via IDOR)", not "data could leak."
77
+
78
+ 4. **Rate each threat likelihood × impact, then rank.** Use a 3×3 so disagreements are cheap:
79
+
80
+ | | Impact: Low | Impact: Med | Impact: High |
81
+ |---|---|---|---|
82
+ | **Likelihood: High** | Medium | High | **Critical** |
83
+ | **Likelihood: Med** | Low | Medium | High |
84
+ | **Likelihood: Low** | Low | Low | Medium |
85
+
86
+ Likelihood = how exposed + how easy (anonymous-reachable + no skill = High; insider-only + needs prod creds = Low). Impact = blast radius on the asset (all-tenant PII dump = High; one user's display name = Low). Sort the threat table by rating, Critical first. Rate on the controls that *exist today*, never on ones you plan to build — planned controls earn their reduction in the disposition step (5), not here.
87
+
88
+ 5. **Disposition every threat — exactly one of four, no threat left unrated or "noted."**
89
+ - **Mitigate** — name the *specific* control (e.g. "scope every query by `tenant_id` from the session, never from the request; add a row-level-security policy as defense-in-depth"). A mitigation without a named control is not a mitigation.
90
+ - **Eliminate** — remove the feature/flow/data that creates the threat (don't store the PAN; tokenize at the edge so the card number never enters scope).
91
+ - **Transfer** — push to a party who owns it (offload card storage to a PCI-compliant processor; buy insurance). Note who now owns it.
92
+ - **Accept** — only with a named sign-off and an expiry. An accepted risk needs an owner, a date, and a re-review trigger; otherwise it's a silent gap.
93
+
94
+ Default bias: **Critical/High must be mitigated or eliminated before ship.** Medium may be accepted with sign-off. Low may be accepted by the team lead.
95
+
96
+ 6. **Map each mitigation to a real engineering task and link existing controls.** For every "mitigate," produce a tracked task (`SEC-123: enforce tenant_id from session in OrderRepository`) and point at the control that delivers it — frequently a sibling skill: rate limiting (D) → rate-limiting; auth/RBAC/IDOR (S/E) → auth-jwt-session; secret handling (S, I) → secrets-management; webhook signature/replay (S/T) → ingest-webhook-secure. Mark which controls **already exist** (TLS everywhere, WAF) vs **must be built** so the model doubles as a backlog.
97
+
98
+ 7. **Emit the deliverables — the model is the artifact, not the conversation.** Write a `threat-model.md` containing: scope/assets/adversaries (step 1), the validated DFD (step 2), the rated threat table with dispositions (steps 3–5), an explicit **abuse-cases** list (the top attacker stories: "as a malicious tenant I enumerate IDs to read others' invoices"), and a **residual-risk register** (every Accept row: threat, rating, owner, sign-off, expiry). Finish with **re-model triggers** — the events that invalidate this model and require a redo (new trust boundary, new external integration, auth model change, new class of PII, major arch change). A threat model with no expiry condition rots silently.
99
+
100
+ ## Common Errors
101
+
102
+ - **Listing threats with no diagram.** Without the DFD you miss the boundary-crossing flows that produce the real threats. Draw and validate the diagram first; enumerate per element second.
103
+ - **Missing trust boundaries entirely (or only drawing the network edge).** The expensive bugs live at the *authz* and *tenant* boundaries, not the firewall. Every place trust level changes gets a dashed box.
104
+ - **Vague threats: "data could be leaked."** Unactionable and unrateable. Write the concrete path: which actor, which element, which parameter, which STRIDE letter.
105
+ - **Skipping STRIDE letters because they "feel unlikely."** That's what rating is for — enumerate all applicable letters per element, then let likelihood × impact triage. Skipping at enumeration time hides the threat; skipping at rating time is a defensible decision.
106
+ - **Accepting risk with no owner/expiry.** "We'll accept that" in a meeting evaporates. An accepted risk is only accepted when it's in the register with a name, a date, and a re-review trigger.
107
+ - **Modeling the whole company.** Scope creep makes the model useless. Bound it to the one service/flow/change and explicitly list what's out of scope (step 1).
108
+ - **Rating on aspirational controls.** Rating a threat "Low" because of a mitigation you *plan* to build inflates safety. Rate on what exists today; the disposition step is where planned controls earn their reduction.
109
+ - **Mitigation = "add validation" / "we'll be careful."** Not a control. Name the mechanism (RLS policy, signed token with `aud` check, allowlist, rate limiter) and the task that builds it.
110
+ - **Treating it as one-and-done.** A model with no re-model triggers is stale the next time a boundary moves. List the triggers that force a redo.
111
+ - **Confusing this with a code audit.** STRIDE on a diagram finds design flaws (missing boundary, IDOR by design); it will not find a SQL-injection typo in line 88 — that's security-review on the diff.
112
+
113
+ ## Verify
114
+
115
+ 1. **Diagram renders:** `npx -y @mermaid-js/mermaid-cli -i model.mmd -o model.svg` exits 0 and the SVG shows every external entity, process, store, flow, and at least one dashed trust boundary.
116
+ 2. **Boundary coverage:** Every flow that crosses a trust boundary has at least one threat row. Pick any boundary-crossing arrow at random — it must appear in the threat table.
117
+ 3. **STRIDE coverage:** Each process element was evaluated against all six letters (each letter either has a threat row or an explicit "N/A — why"); stores and flows covered for T/I/D.
118
+ 4. **Every threat is concrete and rated:** No row reads "data could leak"; each names actor + element + attack, carries a likelihood × impact rating, and is sorted Critical-first.
119
+ 5. **Every threat is dispositioned:** Each row is exactly one of mitigate / eliminate / transfer / accept — zero "noted" or blank. Mitigations name a control and link a tracked task.
120
+ 6. **Residual register complete:** Every Accept appears in the residual-risk register with owner, sign-off, and expiry. No Critical/High is in Accept without explicit named sign-off.
121
+ 7. **Abuse cases + re-model triggers present:** The doc lists the top attacker stories and the events that invalidate the model.
122
+
123
+ Done = the Mermaid DFD renders with explicit trust boundaries, every boundary-crossing flow has STRIDE-enumerated threats that are each rated and dispositioned, no Critical/High sits in Accept without named sign-off, and the doc ships a residual-risk register plus re-model triggers.
@@ -0,0 +1,109 @@
1
+ ---
2
+ name: train-evaluate-ml-model
3
+ description: Trains and evaluates a classic (non-LLM) ML model — business-aligned metric selection, leakage-safe train/validation/test splits, Pipeline-scoped feature engineering, baseline-first model selection, cross-validated hyperparameter tuning, bias/variance diagnosis, and experiment tracking — guarding against data leakage and overfitting.
4
+ when_to_use: Fitting and validating a classification, regression, ranking, forecasting, or clustering model on tabular/feature data. Distinct from profile-dataset (EDA only), wrangle-tabular-data (cleaning/shaping the feature table), serve-deploy-ml-model (deployment), monitor-ml-drift (post-deploy), and rag-pipeline/prompt-engineering (LLM work).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this skill when the request is to **fit and validate a model that predicts**, not to explore or clean data:
10
+
11
+ - "Train a classifier to predict churn / fraud / default and tell me if it's good"
12
+ - "Build a regression/forecasting model for demand/price and pick the best one"
13
+ - "My model gets 99% accuracy — is that real or leakage?"
14
+ - "Tune hyperparameters / cross-validate / compare XGBoost vs logistic regression"
15
+ - "Cluster these customers into segments"
16
+
17
+ NOT this skill:
18
+ - Summary stats, distributions, correlations, missingness *before* modeling → profile-dataset
19
+ - Cleaning, type coercion, joins, dedup, resampling to build the feature table → wrangle-tabular-data
20
+ - Asserting schema/range/null contracts on the data before training → validate-data-quality
21
+ - Packaging the trained model behind an API / batch job → serve-deploy-ml-model
22
+ - Watching a live model's inputs/outputs degrade over time → monitor-ml-drift
23
+ - Scoring or grading an LLM's outputs against a rubric/golden set → llm-eval-harness
24
+ - Getting an LLM to answer over a corpus, or designing prompts → rag-pipeline / prompt-engineering
25
+
26
+ ## Steps
27
+
28
+ 1. **Choose the metric from the business cost FIRST — never optimize bare accuracy on imbalanced data.** A 1%-positive fraud set scores 99% accuracy by predicting all-negative and catches zero fraud. Pick before you train:
29
+
30
+ | Task / situation | Metric | Why |
31
+ |---|---|---|
32
+ | Imbalanced classification, cost of missing a positive high | Recall @ fixed precision, or **PR-AUC** | accuracy & ROC-AUC look great while recall is ~0 |
33
+ | Imbalanced, ranking/threshold-free comparison | **PR-AUC** (not ROC-AUC) | ROC-AUC is optimistic under heavy imbalance |
34
+ | Balanced classification | F1 or ROC-AUC | symmetric cost |
35
+ | Asymmetric FP vs FN cost | Expected cost = `cFP·FP + cFN·FN`, tune threshold | maps directly to money |
36
+ | Regression, outliers matter | RMSE | penalizes large errors |
37
+ | Regression, robust to outliers | MAE / MAPE | business reads "off by X" |
38
+ | Ranking / recsys | NDCG@k, MAP@k | position-aware |
39
+ | Forecasting | MASE / sMAPE vs naive | scale-free, must beat seasonal-naive |
40
+ | Clustering (no labels) | silhouette + downstream business check | inertia alone is meaningless |
41
+
42
+ Fix the **decision threshold** from the cost matrix later — don't ship the default 0.5.
43
+
44
+ 2. **Split BEFORE any feature engineering or fitting — this is the #1 leakage source.** Three sets: train / validation (or CV folds) / **held-out test touched once at the very end**. The split strategy is not optional — pick by data structure:
45
+ - **Random stratified** for i.i.d. rows: `train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)`.
46
+ - **Time-based** for any temporal data — train on past, test on future, never shuffle. Use `TimeSeriesSplit` for CV. A random split on time-series leaks the future and inflates every metric.
47
+ - **Group split** when rows share an entity (same user/patient/device across rows): `GroupKFold` / `StratifiedGroupKFold` so the same group never appears in both train and test.
48
+
49
+ 3. **Engineer features inside a `Pipeline` fit only on train, then run a leakage audit.** Every transform that *learns* (imputation means, scaler stats, target/one-hot encoders, feature selection) must `.fit` on train and only `.transform` val/test — otherwise test statistics leak in. Wrap it:
50
+
51
+ ```python
52
+ from sklearn.pipeline import Pipeline
53
+ from sklearn.compose import ColumnTransformer
54
+ from sklearn.impute import SimpleImputer
55
+ from sklearn.preprocessing import StandardScaler, OneHotEncoder
56
+ from sklearn.model_selection import cross_val_score
57
+
58
+ pre = ColumnTransformer([
59
+ ("num", Pipeline([("imp", SimpleImputer(strategy="median")),
60
+ ("sc", StandardScaler())]), num_cols),
61
+ ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
62
+ ])
63
+ pipe = Pipeline([("pre", pre), ("model", model)])
64
+ # CV refits `pre` per fold → no leakage across folds
65
+ cross_val_score(pipe, X_train, y_train, cv=cv, scoring="average_precision")
66
+ ```
67
+
68
+ **Leakage audit checklist** — a feature is leaky if it: (a) is derived from the target or post-outcome (e.g. `payment_received` predicting `will_pay`); (b) encodes future information unavailable at prediction time; (c) is an ID/timestamp that proxies the label; (d) was computed using full-dataset statistics before the split. If any single feature gives a near-perfect score, it's leakage, not skill.
69
+
70
+ 4. **Establish a dumb baseline before any real model.** `DummyClassifier(strategy="most_frequent")` / `DummyRegressor(strategy="mean")`, then a linear baseline (`LogisticRegression` / `Ridge`). This is the bar every model must clear; a fancy model that barely beats majority-class isn't worth the complexity.
71
+
72
+ 5. **For tabular data, reach for gradient boosting before deep nets.** Default order: linear baseline → **gradient boosting** (`XGBoost` / `LightGBM` / `HistGradientBoostingClassifier`) → deep net only if GBM plateaus and you have ample data. GBMs win on heterogeneous tabular data, need little preprocessing, and train in minutes. Handle imbalance with `class_weight` / `scale_pos_weight` or threshold tuning — not blind SMOTE (and if you oversample, do it inside the CV fold only, never before the split).
73
+
74
+ 6. **Tune hyperparameters with CV, search smart not exhaustive.** `RandomizedSearchCV` or Optuna over a sensible space beats `GridSearchCV` on a huge grid. Always pass `scoring=` your step-1 metric (not accuracy) and use the same CV object as step 2. Key GBM knobs: `n_estimators` + `learning_rate` (trade off), `max_depth` / `num_leaves`, `min_child_samples`, `subsample`, `colsample_bytree`, plus `early_stopping_rounds` on a validation set.
75
+
76
+ 7. **Diagnose bias vs variance, then act.** Compare train vs validation score:
77
+ - Train high, val low (large gap) = **overfit/variance** → regularize, reduce depth/leaves, add data, drop features, stronger early stopping.
78
+ - Train and val both low = **underfit/bias** → richer model, better features, less regularization.
79
+ Plot a learning curve to decide whether more data would even help before collecting it.
80
+
81
+ 8. **Track every experiment — params, metric, data version, code version.** Log to MLflow / Weights & Biases (or a CSV at minimum): hyperparameters, all CV metrics with std, the data snapshot hash, git commit, and the random seed. An untracked best-run is unreproducible. Pin seeds (`random_state`) everywhere.
82
+
83
+ 9. **Evaluate the held-out test set EXACTLY ONCE, at the end.** Report the step-1 metric with a confidence interval (bootstrap), the confusion matrix at your chosen threshold, and the gap vs baseline. Repeatedly peeking at test = overfitting to test by hand.
84
+
85
+ ## Common Errors
86
+
87
+ - **Splitting after feature engineering / scaling on the full dataset.** Test statistics bleed into train; metrics inflate, prod collapses. Split first, fit transforms on train only (use a `Pipeline`).
88
+ - **Random split on temporal data.** The model trains on future rows and "predicts" the past. Use a time-based split / `TimeSeriesSplit`.
89
+ - **Reporting accuracy on an imbalanced problem.** 99% accuracy with 0% recall is useless. Pick PR-AUC / recall-at-precision from the cost (step 1).
90
+ - **A feature that's too good.** One column driving a near-perfect score is almost always leakage (post-outcome field, ID proxy, target-derived). Audit and drop it.
91
+ - **Target encoding / imputation / feature selection computed before the split or outside the CV fold.** Subtle leakage that survives CV but not production. Fit them inside the pipeline, per fold.
92
+ - **SMOTE/oversampling applied before splitting.** Synthetic copies of test rows land in train. Resample inside the training fold only.
93
+ - **Tuning against the test set / peeking repeatedly.** You overfit to it manually. Tune on CV/validation; touch test once.
94
+ - **No baseline.** Without DummyClassifier/linear you can't tell if your model learned anything. Always beat the dumb baseline first.
95
+ - **`scoring=` left at default (accuracy/R²) during search.** The search optimizes the wrong thing. Pass your business metric to `cross_val_score` / `*SearchCV`.
96
+ - **Train/serve feature skew.** Features computed differently (or with different code/library versions) at training vs inference. Reuse the exact fitted pipeline artifact for serving.
97
+ - **Unpinned seeds / untracked runs.** Results aren't reproducible and "best model" can't be recovered. Pin `random_state`, log params+metrics+data+code version.
98
+
99
+ ## Verify
100
+
101
+ 1. **Beats baseline:** held-out test metric (step-1 metric) > the DummyClassifier/DummyRegressor and the linear baseline by a margin larger than the bootstrap CI. If it doesn't, there's no model.
102
+ 2. **Leakage check passes:** drop each top-importance feature individually — no single feature should collapse the score to chance-level; remove any post-outcome/target-derived/ID-proxy column; confirm all learned transforms were fit on train only.
103
+ 3. **Split integrity:** for temporal data, every test timestamp is strictly after every train timestamp; for grouped data, no group ID appears in two sets (assert programmatically).
104
+ 4. **Generalization gap sane:** train metric − validation metric is small and explained by your bias/variance call; test ≈ validation (a big test drop means tuning overfit to validation).
105
+ 5. **Metric matches business cost:** the reported metric and the chosen decision threshold come from step 1, not the library default.
106
+ 6. **No train/serve skew:** running the saved pipeline on a held-out row reproduces the exact training-time prediction (same features, same library versions).
107
+ 7. **Reproducible:** re-running with the logged seed + data snapshot + code commit yields the same metric (within float tolerance); the experiment tracker has params, CV metrics with std, data hash, and git commit.
108
+
109
+ Done = the held-out test metric (chosen for business cost, evaluated once) beats both baselines beyond its CI, the leakage and split-integrity checks pass, the generalization gap is explained, and the saved pipeline reproduces predictions with no train/serve skew under a pinned, tracked run.
@@ -0,0 +1,109 @@
1
+ ---
2
+ name: unicode-text-correctness
3
+ description: Implements and fixes correct text/Unicode handling — pinning UTF-8 end-to-end, detecting BOM/legacy charsets, NFC/NFD normalization, grapheme-aware length/slicing/truncation/reversal, locale-aware collation and full case-folding, and homoglyph/confusable/bidi spoofing defenses.
4
+ when_to_use: Code measures, slices, truncates, reverses, sorts, lowercases, or compares strings containing emoji, combining marks, or CJK; or bugs show mojibake, emoji counting as length 4, truncation splitting a character, equal-looking usernames comparing unequal, broken accented sorting, or double-encoding. Distinct from regex-build (pattern matching) and validate-data-quality (column-level rules, not character semantics).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this when the bug is about **what a character *is*** — its bytes, boundaries, identity, or order — not about pattern matching or business rules:
10
+
11
+ - "Emoji `👨‍👩‍👧` counts as length 7 / truncates to a broken `�` / reverses into garbage"
12
+ - "Twitter-style `120 chars` limit cuts a flag emoji or `é` in half"
13
+ - "Two usernames look identical but `==` says they differ" (or the reverse: a spoof passes)
14
+ - "Accented words sort after `z` / `ä` doesn't sort near `a`"
15
+ - "Text came in as `é` / `’` / `é` — mojibake / double-encoding"
16
+ - "`.toLowerCase()` breaks Turkish `İ`, German `ß`, or fails to match `İstanbul`"
17
+ - "MySQL stores emoji as `????` / IDN domain `аpple.com` (Cyrillic а) phishes users"
18
+
19
+ NOT this skill:
20
+ - Writing/debugging a regex pattern (email/slug/`\d` over-matching) → **regex-build**
21
+ - Column-level assertions (no nulls/dupes, value ranges, freshness) → **validate-data-quality**
22
+ - Schema/charset migration mechanics (lock contention, rollback of an `ALTER`) → **db-migration-safety**
23
+ - Whether a confusable username is an actual attack you must *report* in a diff → **security-review** (this skill *builds* the defense; security-review *audits* for its absence)
24
+
25
+ ## Steps
26
+
27
+ 1. **Know the four length units — pick one deliberately, never let the language pick for you.** Most "Unicode bugs" are using the wrong unit.
28
+
29
+ | Unit | What it counts | `"é"` (NFD) | `"👨‍👩‍👧"` | Use for |
30
+ |---|---|---|---|---|
31
+ | Bytes | UTF-8 octets | 3 | 18 | storage size, network frames, DB byte limits |
32
+ | Code units | UTF-16 slots (JS `.length`, Java `char`) | 2 | 7 | **almost never — this is the trap** |
33
+ | Code points | Unicode scalars | 2 | 5 | normalization input, codepoint ranges |
34
+ | **Grapheme clusters** | user-perceived characters | **1** | **1** | length shown to users, truncation, cursor, slicing |
35
+
36
+ Default for any **user-facing** length, limit, slice, or reverse: **grapheme clusters**. JS `"👨‍👩‍👧".length === 7` and `[..."👨‍👩‍👧"].length === 5` are *both wrong* for "how many characters"; only a segmenter gives 1.
37
+
38
+ 2. **Count and slice on grapheme boundaries — use a real segmenter, do not split on code points.** Built-ins:
39
+
40
+ ```js
41
+ const seg = new Intl.Segmenter(undefined, { granularity: "grapheme" });
42
+ const graphemes = [...seg.segment(s)].map(x => x.segment);
43
+ const len = graphemes.length; // user-visible length
44
+ const head = graphemes.slice(0, 120).join(""); // truncate to 120 chars, never split
45
+ const reversed = graphemes.reverse().join(""); // reverse without scrambling 👨‍👩‍👧
46
+ ```
47
+ - Python: `regex` module `\X` (`regex.findall(r"\X", s)`) — stdlib `re`/`len()` give code points, not graphemes.
48
+ - Rust: `unicode-segmentation` `.graphemes(true)`. Go: `rivo/uniseg`. Swift: `String` is already grapheme-correct (`.count`).
49
+ - **Truncate, then re-append an ellipsis as its own grapheme**; if a byte cap (e.g. DB `VARCHAR(n)` is bytes) also applies, trim graphemes until `utf8Bytes(result) <= cap` — never cut at byte `n` directly.
50
+
51
+ 3. **Normalize to NFC at every boundary you store, compare, hash, or index.** `"é"` has two encodings (NFC U+00E9 = 1 codepoint; NFD U+0065 U+0301 = 2). They render identically but are `!=` and hash differently. Rule:
52
+ - **NFC on input** (ingest/form submit/API request) — canonical, shortest, what the web expects.
53
+ - Compare/hash/dedup/`UNIQUE` index **only on NFC** strings — never store one form and look up another (macOS filesystem returns **NFD**; HTTP/most input is NFC → a path from disk won't match a stored key without normalizing both sides).
54
+ - Apply NFC **before** truncation (combining mark must ride with its base) and **before** case-folding.
55
+ ```js
56
+ const key = s.normalize("NFC"); // JS
57
+ ```
58
+ ```python
59
+ import unicodedata; key = unicodedata.normalize("NFC", s) # Python
60
+ ```
61
+ Use **NFKC** (compatibility) only for *identifiers/search keys* where you want `①`→`1`, `fi`→`fi`, full-width `A`→`A` folded together — it is lossy, so never NFKC user display text.
62
+
63
+ 4. **Compare case-insensitively with full case-folding, not `lower()`; sort with a locale collator, not byte order.**
64
+ - Case-insensitive equality: `str.casefold()` (Python) / `String::to_lowercase` (Rust) is the floor — `.toLowerCase()`/`.toUpperCase()` is *not* enough. `"ß".casefold() == "ss"`; Turkish `"İ"` vs `"i"` differ by locale. Never use `.toLowerCase()` for a *security or identity* comparison — NFC then fold both sides: compare `a.normalize("NFC")` folded vs `b.normalize("NFC")` folded.
65
+ - Sorting: byte/codepoint order puts `Z`(0x5A) before `a`(0x61) and accented letters after `z`. Use an **ICU/CLDR collator**: `new Intl.Collator("de", { sensitivity: "base" }).compare(a, b)` (JS), `PyICU.Collator` or `locale.strxfrm` (Python), `COLLATE "de-x-icu"` (Postgres). Pin the locale explicitly — the "right" order for `ä`/`ö` differs by language (German vs Swedish).
66
+
67
+ 5. **Defend identifiers (usernames, domains, package names) against confusables and mixed-script spoofing.** Equal-*looking* must mean equal-*compared*, and visually-deceptive must be rejected:
68
+ - **Skeleton/confusable check** (UTS #39): map each char to its prototype (`раypal`→`paypal`) via the Unicode confusables table (`confusable_homoglyphs`, ICU `usprep`, `unicode-security` crate) and compare skeletons against existing identifiers.
69
+ - **Mixed-script reject:** allow a single script run per identifier (Latin *or* Cyrillic, not `аpple` mixing Cyrillic `а` + Latin); permit only known-safe combos (Latin+Han+Hiragana for JP). Reject whole-script confusables (all-Cyrillic `аррӏе`).
70
+ - **Strip/reject bidi overrides** `U+202A–202E`, `U+2066–2069`, and zero-width `U+200B/200C/200D/FEFF` in identifiers and filenames — `safe.txt‮gpj.exe` displays as `safe.txtexe.jpg` (Trojan Source). NFKC-fold identifiers before storing.
71
+
72
+ 6. **Pin UTF-8 across storage and transport — no implicit charset anywhere.**
73
+ - DB: MySQL **`utf8mb4`** (the 3-byte `utf8` alias silently drops emoji → `????`); set table *and* connection charset + a `_unicode_ci`/`utf8mb4_0900_ai_ci` collation. Postgres: `ENCODING 'UTF8'` + ICU collation per UTF-8 column.
74
+ - HTTP: send `Content-Type: …; charset=utf-8`; read `charset` from the response header, fall back to BOM, then to the declared meta — never assume Latin-1.
75
+ - **BOM:** strip a leading `U+FEFF` on read (it corrupts JSON parse and the first field of CSV); do **not** emit a BOM in UTF-8 output unless a consumer (Excel CSV) demands it.
76
+ - **Legacy ingest:** detect with `chardet`/`charset-normalizer`/ICU `CharsetDetector`, decode once to Unicode, then work in UTF-8 — and **never re-decode an already-decoded string** (the cause of `’` double-encoding mojibake).
77
+ - URLs/IDN: percent-encode the UTF-8 bytes of the path/query; convert IDN hostnames to **Punycode** (`xn--…`) for transport, but display the Unicode form *only after* the confusable check in step 5.
78
+
79
+ 7. **Lock the behavior with adversarial test strings** (see Verify) before declaring text handling correct.
80
+
81
+ ## Common Errors
82
+
83
+ - **Using `.length` (JS/Java UTF-16) as character count.** Counts code units → emoji = 2–7, BMP CJK = 1. Fix: `Intl.Segmenter` graphemes for user counts.
84
+ - **Splitting on code points and calling it grapheme-safe.** `[...str]` keeps `é`(NFC) whole but shatters `👨‍👩‍👧` (5 codepoints) and a base+combining `e+◌́`. Fix: segment graphemes, not codepoints.
85
+ - **Byte-cap truncation (`s[:200]`, `substr`).** Cuts mid-codepoint → `�`, or splits a base from its combining mark / a ZWJ sequence. Fix: trim whole graphemes until under the byte cap.
86
+ - **Comparing/indexing without normalizing.** NFC `café` ≠ NFD `café`; one inserts, the other duplicates past a `UNIQUE` constraint. Fix: NFC both sides before `==`, hash, and the DB write.
87
+ - **`toLowerCase()` for identity/security checks.** Misses `ß`/`ss`, breaks Turkish `İ/ı`, locale-dependent. Fix: full case-fold (`casefold()`), NFC first.
88
+ - **Sorting by codepoint/byte.** `Z` before `a`, accents dumped after `z`, wrong per language. Fix: ICU/CLDR collator with an explicit locale.
89
+ - **MySQL `utf8` (3-byte alias).** Silently stores emoji/4-byte chars as `????` or errors. Fix: `utf8mb4` everywhere — column, table, connection.
90
+ - **Double-decoding / re-encoding.** Decoding an already-`str` value (or treating UTF-8 bytes as Latin-1 then re-encoding) → `é`, `’`. Fix: decode exactly once at the boundary; keep Unicode internally.
91
+ - **Not stripping the BOM.** Leading `U+FEFF` breaks `JSON.parse`, makes the first CSV column key invisible. Fix: strip a leading `` on read.
92
+ - **Reversing a string by codepoint/char.** Scrambles emoji ZWJ sequences and detaches combining marks (`á` → `́a`). Fix: reverse grapheme clusters.
93
+ - **NFKC on display text.** Lossy: `²`→`2`, `fi`→`fi`, full-width collapses. Fix: NFKC only for fold-keys/identifiers; store NFC for display.
94
+ - **Trusting Unicode display of IDN/filenames.** Bidi override + homoglyph spoofs the eye. Fix: render Punycode / run the confusable+bidi check before showing.
95
+
96
+ ## Verify
97
+
98
+ Test every text op against a fixed adversarial corpus — at minimum: `"á"` (e + combining acute, NFD `á`), `"á"` (NFC `á`), `"👨‍👩‍👧‍👦"` (ZWJ family), `"🇯🇵"` (regional-indicator flag), `"ẹ́"` (stacked combining marks), `"한국어"` (Hangul), `"Hello"` (full-width), `"раypal"` (mixed-script Cyrillic), `"safe‮txt.exe"` (bidi override), `"hi"` (BOM), `"café"` in NFC and NFD.
99
+
100
+ 1. **Grapheme length:** the ZWJ family and a flag emoji each report length **1**; `"á"` reports **1**. Not 2, 4, or 7.
101
+ 2. **Truncation:** truncating the corpus to N graphemes never yields a `�`, never splits a ZWJ sequence, and never strands a combining mark; `utf8Bytes(result) <= byteCap` when a byte cap applies.
102
+ 3. **Reverse:** reversing `"👨‍👩‍👧"` returns it unchanged (single grapheme); reversing `"áb"` keeps `á` intact.
103
+ 4. **Normalization equality:** NFC `"café"` and NFD `"café"` compare **equal** and hash equal after `.normalize("NFC")`; inserting both into a table with a `UNIQUE(NFC)` key yields one row.
104
+ 5. **Case-fold:** `"ß"` matches `"SS"`/`"ss"` under full case-fold; `"İstanbul"` matches per Turkish locale and is *not* silently mangled in the default locale.
105
+ 6. **Collation:** sorting `["z","ä","a","Z"]` under `de` collator puts `ä` adjacent to `a` and is *not* codepoint order (`Z` before `a`).
106
+ 7. **Confusable/bidi:** `"раypal"` is flagged confusable with an existing `"paypal"` and mixed-script-rejected; the bidi-override string is rejected or its overrides stripped before storage/display.
107
+ 8. **Round-trip:** a string written to the DB (`utf8mb4`) and read back is byte-identical including emoji; a BOM-prefixed file parses with no phantom first key; an IDN host round-trips through Punycode and back.
108
+
109
+ Done = grapheme-unit length/slice/truncate/reverse are all correct on the ZWJ + flag + combining-mark corpus, NFC-normalized values compare/hash/dedup equal across forms, case-insensitive matching uses full case-folding and sorting uses a locale collator, confusable + mixed-script + bidi spoofs are rejected, and emoji round-trip cleanly through the `utf8mb4` store with no `????`/`�`/mojibake.
@@ -0,0 +1,120 @@
1
+ ---
2
+ name: visual-regression-testing
3
+ description: Catches unintended UI pixel changes by snapshotting rendered output and diffing against approved baselines — make snapshots deterministic (disable CSS animations/transitions/caret, mask dynamic regions like dates/avatars/ads, freeze the clock and seed randomness, preload+wait for fonts, pin viewport + deviceScaleFactor, force reduced-motion and a fixed color-scheme), generate per-browser/per-OS baselines (never share a Linux baseline with a dev's macOS), tune the diff threshold (maxDiffPixelRatio / anti-alias mode) instead of inflating it to hide flake, run baselines in ONE pinned container so subpixel/font rendering is identical, and wire a human review/approve flow (Playwright --update-snapshots, Chromatic/Percy approve UI) — at component level (isolated, fast) and page level (integration). Effectively a pixel contract: a diff is a question for a human, not an auto-pass.
4
+ when_to_use: You want to detect visual UI regressions — a CSS/refactor/dependency bump silently shifted layout/color/spacing, you're adding toHaveScreenshot/Chromatic/Percy/BackstopJS, baselines flake across machines, or you're tuning diff thresholds and the review/approve flow. Distinct from write-playwright-e2e (asserts functional behavior and DOM state, not pixels — this skill is the screenshot-diff layer) and audit-accessibility-wcag (WCAG conformance / contrast / semantics, not whether pixels changed).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this skill when the goal is **detecting unintended pixel/visual changes against an approved baseline**, not functional behavior or a11y conformance:
10
+
11
+ - "A CSS refactor / Tailwind upgrade / design-token change silently broke a layout somewhere"
12
+ - "Add visual regression / screenshot tests to this component library or these pages"
13
+ - "Set up Playwright `toHaveScreenshot`, Chromatic, Percy, or BackstopJS"
14
+ - "Snapshots flake — they pass on CI but fail on my Mac, or fail randomly"
15
+ - "Tune the diff threshold / mask the date+avatar regions / freeze animations"
16
+ - "Wire the baseline review-and-approve flow into PRs"
17
+
18
+ NOT this skill:
19
+ - Asserting a button click opens a modal, a form submits, navigation/DOM state, network mocking → write-playwright-e2e (functional E2E; this skill is the screenshot-diff layer that *also* runs on a stabilized page)
20
+ - WCAG conformance, contrast ratios, ARIA, keyboard/focus order, screen-reader semantics → audit-accessibility-wcag (correct *semantics*, not whether pixels match a baseline)
21
+ - A snapshot/screenshot test that's flaky for timing/ordering reasons → debug-flaky-tests (root-causing nondeterminism in general; this skill prescribes the *visual-specific* stabilizers)
22
+ - Structuring the test suite, fixtures, assertions for unit/integration tests → write-tests
23
+ - Driving a real browser to manually inspect/debug a rendering bug → debug-frontend-browser
24
+ - Catching LCP/CLS/perf regressions (layout shift as a metric, not a pixel diff) → optimize-core-web-vitals
25
+ - Defining the tokens (color/space/type scale) whose changes you're guarding → design-token-system
26
+
27
+ ## Steps
28
+
29
+ 1. **Pick the tier by what you own.** Each is a screenshot + perceptual diff against a stored baseline; they differ in where baselines live and review happens.
30
+
31
+ | Tool | Baseline storage | Review/approve | Best for |
32
+ |---|---|---|---|
33
+ | **Playwright `toHaveScreenshot`** | git (PNGs committed per project) | `--update-snapshots` + PR diff of `.png` | self-hosted, full control, free; you own the render env |
34
+ | **Chromatic** | cloud (Storybook) | hosted UI, per-story approve, branch baselines | Storybook component libs; turbosnap diffs only changed stories |
35
+ | **Percy (BrowserStack)** | cloud | hosted UI, approve per snapshot | cross-browser cloud render, framework-agnostic SDK |
36
+ | **BackstopLP / BackstopJS** | git/local | `approve` CLI, HTML report | legacy/no-cloud, reference+test+report flow |
37
+
38
+ Default to **Playwright `toHaveScreenshot`** when you control the runner (commit baselines, run in a pinned container); reach for **Chromatic/Percy** when you can't pin a render env or want cross-browser cloud baselines without managing them.
39
+
40
+ 2. **Render env is the baseline — pin it or every diff is noise.** Font hinting and subpixel antialiasing differ across OS/GPU, so a macOS-generated PNG will *never* match a Linux CI PNG. Generate and verify baselines in **one** environment:
41
+ - Playwright: pin the Docker image to your exact version — `mcr.microsoft.com/playwright:v1.50.0-noble` — and run *baseline generation and CI in the same image*. Never commit a baseline produced on a dev's machine.
42
+ - Snapshot filenames already encode browser/OS (`button-chromium-linux.png`). Keep that suffix; do **not** force a single platform name to "share" baselines across OSes — generate one baseline per `(browser, platform)` you actually test.
43
+ - `npx playwright test --update-snapshots` locally only via `docker run` in that image, or with a dedicated CI "update baselines" job — so the bytes match CI.
44
+
45
+ 3. **Kill animation and motion before the shot.** A mid-transition frame is the #1 flake source.
46
+ ```ts
47
+ // playwright.config.ts
48
+ expect: { toHaveScreenshot: { animations: 'disabled', caret: 'hide', scale: 'css' } }
49
+ ```
50
+ `animations:'disabled'` finite-CSS-animations are fast-forwarded to their end state and transitions disabled; `caret:'hide'` removes the blinking text cursor; `scale:'css'` ignores DPR so HiDPI vs 1x render the same logical pixels. For motion that CSS can't reach, also inject:
51
+ ```ts
52
+ await page.emulateMedia({ reducedMotion: 'reduce', colorScheme: 'light' });
53
+ await page.addStyleTag({ content: `*,*::before,*::after{transition:none!important;animation:none!important;}` });
54
+ ```
55
+
56
+ 4. **Pin viewport + DPR + color-scheme deterministically.** Layout depends on width; rendering depends on DPR and scheme. Set them explicitly per project, never inherit the runner's screen:
57
+ ```ts
58
+ use: { viewport: { width: 1280, height: 720 }, deviceScaleFactor: 1, colorScheme: 'light' }
59
+ ```
60
+ Test responsive breakpoints as **separate named snapshots** (`card-mobile-375.png`, `card-desktop-1280.png`) — don't rely on a default window size. For full-page shots, set `fullPage: true` only when the page height is stable; otherwise prefer clipping a component.
61
+
62
+ 5. **Freeze time, randomness, and anything non-deterministic in content.** "Updated 3 minutes ago", `Math.random()` ids, and animated counters all churn pixels:
63
+ - Clock: Playwright `await page.clock.setFixedTime(new Date('2025-01-01T00:00:00Z'))` (or `page.clock.install`) before navigation, so `Date.now()`/timers are frozen.
64
+ - Seed PRNGs / stub `Math.random` and `crypto.randomUUID` via `addInitScript` so generated ids/charts are stable.
65
+ - Stub network: route API calls to **fixtures** (deterministic data) — a live API means live data means flake. This is where it overlaps with write-playwright-e2e's mocking, but here the goal is *stable pixels*, not asserting a request.
66
+
67
+ 6. **Wait for the page to be visually settled — not just `load`.** Diff what's actually rendered:
68
+ - **Fonts:** a FOUT (fallback → web font swap) changes glyph metrics. `await page.evaluate(() => document.fonts.ready)` before the shot, and self-host/preload fonts so they're not network-flaky.
69
+ - **Lazy images / skeletons:** wait for the specific `<img>` `decode()`/`load`, or assert the skeleton is gone (`await expect(loc).toBeVisible()`), not a blanket `networkidle` (deprecated and flaky).
70
+ - **Layout stability:** `await page.waitForFunction` on a render-complete signal, or `expect(locator).toHaveScreenshot()` which **auto-retries until two consecutive shots match** — lean on that built-in stabilization rather than `waitForTimeout`.
71
+
72
+ 7. **Mask the regions you can't make deterministic — don't widen the threshold to swallow them.** Ads, avatars, timestamps, maps, video, third-party embeds:
73
+ ```ts
74
+ await expect(page).toHaveScreenshot('dashboard.png', {
75
+ mask: [page.locator('.ad-slot'), page.locator('[data-testid="avatar"]')],
76
+ maskColor: '#FF00FF',
77
+ });
78
+ ```
79
+ Masking paints those areas a solid color in both baseline and actual, so they're excluded from the diff while the rest stays pixel-exact. This is strictly better than raising the global threshold, which blinds you to real regressions everywhere.
80
+
81
+ 8. **Tune the threshold tight; treat a loose threshold as a bug.** Two knobs, prefer the pixel-count one:
82
+ - `maxDiffPixelRatio` (fraction of differing pixels, e.g. `0.01`) or `maxDiffPixels` (absolute count) — set as low as your env allows. Start at `0` and raise only to the floor that survives a no-change re-run.
83
+ - `threshold` (per-pixel color sensitivity, 0–1, default `0.2`) — handles antialias jitter; lowering it makes diffs *stricter*.
84
+ - **Anti-pattern:** bumping `maxDiffPixelRatio` to `0.1` to "stop flake." That hides a 9%-of-the-screen regression. Fix the nondeterminism (steps 3–6) instead; reserve a small ratio purely for subpixel antialiasing noise.
85
+
86
+ 9. **Component vs page level — run both, weight toward component.** Component snapshots (Storybook + Chromatic, or Playwright `mount`/component testing) are isolated, fast, and pinpoint *which* component changed; a wall of full-page snapshots is slow and every page that embeds a changed header fails at once (noisy, hard to triage). Use a **pyramid**: many small component/story snapshots, a handful of critical full-page integration snapshots (login, checkout, dashboard). Snapshot **states**, not just the default: hover, focus, error, empty, loading, RTL, dark mode — each as its own baseline.
87
+
88
+ 10. **A diff is a question for a human — never auto-update on CI.** The review/approve flow is the whole point:
89
+ - **Failing build is correct behavior** when pixels change — the PR must show the diff image (Playwright attaches `expected/actual/diff` to the HTML report and `test-results/`; Chromatic/Percy link a hosted diff).
90
+ - Approve intentional changes deliberately: Playwright → run the dedicated `--update-snapshots` job and **commit the new PNGs in the same PR** (reviewers see the pixel diff in git); Chromatic/Percy → click *approve* which moves the branch baseline.
91
+ - **Never** run `--update-snapshots` automatically in the main test job or on every CI run — that auto-blesses regressions and the test becomes worthless. Updating baselines is a reviewed, intentional act.
92
+
93
+ 11. **Keep baselines healthy.** Commit PNGs via **Git LFS** (binary churn bloats history); delete stale baselines when a component is removed (orphan PNGs hide nothing and rot); regenerate the whole set deliberately after an intentional global change (font swap, token update) in a single isolated PR titled as such, so reviewers know the diff is wholesale, not a regression slipping through.
94
+
95
+ ## Common Errors
96
+
97
+ - **Baseline made on macOS, CI runs Linux.** Font/subpixel rendering differs → every snapshot "fails." Fix: generate and run in one pinned container image (`mcr.microsoft.com/playwright:vX.Y-noble`); never commit a dev-machine baseline.
98
+ - **Animations/transitions not disabled.** Mid-flight frame captured → random diffs. Fix: `animations:'disabled'`, `caret:'hide'`, inject `transition/animation:none!important`, `emulateMedia({reducedMotion:'reduce'})`.
99
+ - **Web font swaps after the shot (FOUT).** Glyph metrics shift → text diffs. Fix: `await document.fonts.ready` + self-host/preload fonts.
100
+ - **Live time/random/data.** "2 min ago", uuids, live API → churns pixels. Fix: `page.clock.setFixedTime`, seed/stub `Math.random`/`randomUUID`, route APIs to fixtures.
101
+ - **Raising `maxDiffPixelRatio` to stop flake.** Hides real regressions across the whole frame. Fix: eliminate nondeterminism (steps 3–6) and *mask* dynamic regions; keep the threshold near zero.
102
+ - **`waitForTimeout`/`networkidle` instead of a render signal.** Flaky on slow CI, deprecated. Fix: wait on `fonts.ready`, specific image `decode()`, or rely on `toHaveScreenshot`'s built-in retry-until-stable.
103
+ - **Forcing one platform name to share baselines.** A "shared" baseline matches no real env. Fix: one baseline per `(browser, platform)`; keep the OS suffix in the filename.
104
+ - **Auto-running `--update-snapshots` in CI.** Silently re-baselines regressions → the test never fails on a real change. Fix: dedicated, reviewed update job; commit PNGs in the PR.
105
+ - **Only the default/happy state snapshotted.** Hover/error/empty/dark/RTL regressions slip through. Fix: a baseline per meaningful state.
106
+ - **No DPR pin.** HiDPI runner doubles pixels vs 1x → size mismatch. Fix: `deviceScaleFactor:1` + `scale:'css'`.
107
+ - **Giant full-page snapshots only.** One header change fails 40 pages; slow, untriageable. Fix: component-level pyramid + a few critical page shots.
108
+ - **Baselines committed as raw blobs.** Binary churn bloats the repo. Fix: Git LFS; prune orphaned PNGs.
109
+
110
+ ## Verify
111
+
112
+ 1. **Determinism re-run:** run the suite twice back-to-back with **no code change** in the pinned CI image → zero diffs. Any nonzero diff on a clean re-run is leftover nondeterminism — fix it before trusting the suite.
113
+ 2. **Env parity:** generate a baseline in the container and run CI in the same container → match; confirm filenames carry the `(browser, platform)` suffix and no baseline was produced on a dev machine.
114
+ 3. **Real regression is caught:** deliberately change a color/padding/font-size by a few px → the relevant snapshot fails and the report shows a highlighted `diff.png`; the build goes red.
115
+ 4. **Masking works, threshold is tight:** a masked region (avatar/clock) churning its content produces **no** diff, while an unmasked 1% layout shift **does** fail — proving the threshold isn't swallowing real changes.
116
+ 5. **Stabilizers active:** animations disabled, `document.fonts.ready` awaited, clock fixed, randomness seeded, APIs stubbed to fixtures — grep the config/setup for each; a snapshot taken mid-animation or with a live `Date.now()` would fail check 1.
117
+ 6. **Approve flow is manual:** confirm no job runs `--update-snapshots`/auto-approve on the main path; an intentional change requires committing new PNGs (or clicking approve) in a reviewed PR, and that PR's diff shows the pixel change.
118
+ 7. **State coverage:** the critical components have baselines for hover/focus/error/empty/dark/RTL, not just default; responsive breakpoints are separate named snapshots.
119
+
120
+ Done = snapshots are byte-stable on a clean re-run in one pinned render env, dynamic regions are masked (not threshold-inflated), per-`(browser,platform)` baselines live in version control via LFS, a real few-pixel change goes red with a visible diff, and every baseline update is a deliberate, reviewed human approval — never an automatic CI step.