npm - sanook-cli - Versions diffs - 0.4.0 - Mend

sanook-cli 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (148) hide show

package/.env.example +23 -0
package/CHANGELOG.md +38 -0
package/LICENSE +201 -0
package/README.md +239 -0
package/dist/agentContext.js +2 -0
package/dist/approval.js +78 -0
package/dist/bin.js +461 -0
package/dist/brain.js +186 -0
package/dist/commands.js +66 -0
package/dist/compaction.js +85 -0
package/dist/config.js +101 -0
package/dist/cost.js +59 -0
package/dist/diff.js +36 -0
package/dist/gateway/auth.js +32 -0
package/dist/gateway/ledger.js +94 -0
package/dist/gateway/lock.js +114 -0
package/dist/gateway/schedule.js +74 -0
package/dist/gateway/scheduler.js +87 -0
package/dist/gateway/serve.js +57 -0
package/dist/gateway/server.js +94 -0
package/dist/gateway/telegram.js +115 -0
package/dist/git.js +55 -0
package/dist/hooks.js +104 -0
package/dist/knowledge.js +68 -0
package/dist/loop.js +169 -0
package/dist/mcp.js +191 -0
package/dist/memory.js +108 -0
package/dist/providers/codex.js +86 -0
package/dist/providers/keys.js +37 -0
package/dist/providers/models.js +55 -0
package/dist/providers/registry.js +241 -0
package/dist/session.js +36 -0
package/dist/skill-install.js +190 -0
package/dist/skills.js +111 -0
package/dist/tools/bash.js +26 -0
package/dist/tools/edit.js +107 -0
package/dist/tools/git.js +68 -0
package/dist/tools/index.js +36 -0
package/dist/tools/list.js +24 -0
package/dist/tools/permission.js +30 -0
package/dist/tools/read.js +18 -0
package/dist/tools/recall.js +12 -0
package/dist/tools/remember.js +14 -0
package/dist/tools/schedule.js +61 -0
package/dist/tools/search.js +54 -0
package/dist/tools/skill.js +65 -0
package/dist/tools/task.js +46 -0
package/dist/tools/util.js +5 -0
package/dist/tools/write.js +27 -0
package/dist/ui/app.js +132 -0
package/dist/ui/banner.js +20 -0
package/dist/ui/brain-wizard.js +29 -0
package/dist/ui/render.js +57 -0
package/dist/ui/setup.js +46 -0
package/package.json +77 -0
package/second-brain/AGENTS.md +18 -0
package/second-brain/CLAUDE.md +96 -0
package/second-brain/Evals/retrieval-eval.md +30 -0
package/second-brain/GEMINI.md +15 -0
package/second-brain/Home.md +33 -0
package/second-brain/README.md +29 -0
package/second-brain/Runbooks/ingest-quarantine.md +27 -0
package/second-brain/Runbooks/sleep-time-consolidation.md +26 -0
package/second-brain/Shared/AI-Context-Index.md +52 -0
package/second-brain/Shared/Core-Facts/protected-facts.md +21 -0
package/second-brain/Shared/Decision-Memory/decision-log.md +24 -0
package/second-brain/Shared/Memory-Inbox/memory-inbox.md +23 -0
package/second-brain/Shared/Operating-State/current-state.md +30 -0
package/second-brain/Shared/Provenance/ingest-log.md +27 -0
package/second-brain/Shared/Rules/context-assembly-policy.md +28 -0
package/second-brain/Shared/Rules/frontmatter-standard.md +33 -0
package/second-brain/Shared/Rules/skills-admission.md +30 -0
package/second-brain/Shared/User-Memory/user-preferences.md +25 -0
package/second-brain/Templates/bug.md +22 -0
package/second-brain/Templates/handoff.md +21 -0
package/second-brain/Templates/project.md +24 -0
package/second-brain/Templates/session.md +26 -0
package/second-brain/USER.md +36 -0
package/second-brain/Vault Structure Map.md +106 -0
package/skills/agent-tool-mcp-builder/SKILL.md +88 -0
package/skills/api-design-review/SKILL.md +70 -0
package/skills/async-concurrency-correctness/SKILL.md +93 -0
package/skills/audit-accessibility-wcag/SKILL.md +59 -0
package/skills/audit-technical-seo/SKILL.md +62 -0
package/skills/auth-jwt-session/SKILL.md +88 -0
package/skills/brainstorm-design/SKILL.md +73 -0
package/skills/build-etl-pipeline/SKILL.md +58 -0
package/skills/build-form-validation/SKILL.md +103 -0
package/skills/build-office-docs/SKILL.md +80 -0
package/skills/build-react-component/SKILL.md +116 -0
package/skills/build-spreadsheet/SKILL.md +106 -0
package/skills/caching-strategy/SKILL.md +75 -0
package/skills/cicd-pipeline-author/SKILL.md +65 -0
package/skills/cloud-cost-optimize/SKILL.md +91 -0
package/skills/code-comments/SKILL.md +52 -0
package/skills/code-review/SKILL.md +61 -0
package/skills/db-migration-safety/SKILL.md +67 -0
package/skills/debug-frontend-browser/SKILL.md +58 -0
package/skills/debug-root-cause/SKILL.md +54 -0
package/skills/dependency-upgrade/SKILL.md +56 -0
package/skills/deploy-release/SKILL.md +64 -0
package/skills/diff-table-parity/SKILL.md +58 -0
package/skills/dockerfile-optimize/SKILL.md +82 -0
package/skills/error-message/SKILL.md +58 -0
package/skills/estimate-work/SKILL.md +54 -0
package/skills/explore-codebase/SKILL.md +73 -0
package/skills/git-commit-pr/SKILL.md +65 -0
package/skills/gitops-deploy-workflow/SKILL.md +97 -0
package/skills/implement-from-design/SKILL.md +69 -0
package/skills/incident-response-sre/SKILL.md +78 -0
package/skills/k8s-debug-workload/SKILL.md +135 -0
package/skills/k8s-manifest-review/SKILL.md +86 -0
package/skills/llm-eval-harness/SKILL.md +63 -0
package/skills/manage-client-server-state/SKILL.md +94 -0
package/skills/mermaid-diagram/SKILL.md +61 -0
package/skills/message-queue-jobs/SKILL.md +139 -0
package/skills/naming-helper/SKILL.md +57 -0
package/skills/observability-instrument/SKILL.md +113 -0
package/skills/optimize-core-web-vitals/SKILL.md +75 -0
package/skills/optimize-sql-query/SKILL.md +67 -0
package/skills/performance-profiling/SKILL.md +65 -0
package/skills/process-pdf/SKILL.md +107 -0
package/skills/profile-dataset/SKILL.md +97 -0
package/skills/prompt-engineering/SKILL.md +70 -0
package/skills/rag-pipeline/SKILL.md +53 -0
package/skills/rate-limiting/SKILL.md +96 -0
package/skills/refactor-cleanup/SKILL.md +54 -0
package/skills/regex-build/SKILL.md +72 -0
package/skills/release-notes/SKILL.md +79 -0
package/skills/rest-graphql-contract/SKILL.md +71 -0
package/skills/scrape-structured-web-data/SKILL.md +61 -0
package/skills/secrets-management/SKILL.md +96 -0
package/skills/security-review/SKILL.md +62 -0
package/skills/shell-script-robust/SKILL.md +71 -0
package/skills/style-responsive-tailwind/SKILL.md +70 -0
package/skills/terraform-plan-review/SKILL.md +95 -0
package/skills/type-safety-strict/SKILL.md +82 -0
package/skills/validate-data-quality/SKILL.md +62 -0
package/skills/wrangle-tabular-data/SKILL.md +75 -0
package/skills/write-adr/SKILL.md +75 -0
package/skills/write-analytical-sql/SKILL.md +71 -0
package/skills/write-data-viz/SKILL.md +58 -0
package/skills/write-docs/SKILL.md +54 -0
package/skills/write-plan/SKILL.md +59 -0
package/skills/write-playwright-e2e/SKILL.md +86 -0
package/skills/write-prd/SKILL.md +65 -0
package/skills/write-rfc/SKILL.md +75 -0
package/skills/write-tests/SKILL.md +50 -0

package/skills/message-queue-jobs/SKILL.md ADDED Viewed

@@ -0,0 +1,139 @@
+---
+name: message-queue-jobs
+description: Builds async job and message-queue workflows (producers/consumers, idempotency, retries with backoff, dead-letter queues, exactly-once semantics) when offloading work or decoupling services.
+when_to_use: User wants background jobs, a queue/worker, event-driven processing, or to fix duplicate processing, lost messages, retry storms, or poison messages. Covers Redis/SQS/Kafka/RabbitMQ/Celery/BullMQ-style systems.
+---
+---
+name: message-queue-jobs
+description: Builds async job and message-queue workflows (producers/consumers, idempotency, retries with backoff, dead-letter queues, exactly-once semantics) when offloading work or decoupling services.
+when_to_use: User wants background jobs, a queue/worker, event-driven processing, or to fix duplicate processing, lost messages, retry storms, or poison messages. Covers Redis/SQS/Kafka/RabbitMQ/Celery/BullMQ-style systems.
+---
+# Message Queue & Async Jobs
+## When to Use
+Reach for this skill when the task is one of:
+- **Offload slow work** — email, image/video processing, report generation, webhooks out → move it off the request path so the API responds fast.
+- **Decouple services** — service A emits an event, service B reacts, without a synchronous call chain.
+- **Fan-out / scheduled / batch** — one event triggers N workers, or work runs on a timer.
+- **Fix a broken queue** — symptoms map to root causes:
+| Symptom | Likely root cause | Section |
+|---|---|---|
+| Same side effect happens twice (double charge, double email) | At-least-once delivery + non-idempotent handler | Steps 1, 3 |
+| Messages vanish, work never runs | Ack-before-process, or no DLQ for failures | Steps 2, 4 |
+| Worker hammers a downstream into outage | Fixed-interval retries, no jitter, no backpressure | Steps 4, 5 |
+| One bad message blocks the whole partition/queue | No poison-message handling / no DLQ | Step 4 |
+| Queue depth climbs forever | Consumers slower than producers, no throttle/autoscale | Steps 5, 6 |
+If the work is fast, in-process, and never needs a retry, **do not add a queue** — it's complexity you'll pay for forever. Say so.
+## Steps
+### 1. Pick broker + delivery semantics first (this decides everything else)
+- Match broker to existing infra — don't add a new system if one is already running:
+  - **Redis-backed lib** (BullMQ / Celery+Redis / Sidekiq): simplest, low-latency jobs, single region. Good default for app-level background jobs.
+  - **Managed queue** (SQS-style): zero ops, built-in DLQ + visibility timeout. Default when on that cloud.
+  - **Log/stream** (Kafka-style): ordered, replayable, high-throughput, multi-consumer fan-out. Use for event sourcing / analytics, **not** for simple job offload.
+  - **AMQP broker** (RabbitMQ-style): rich routing (topic/fanout exchanges), per-message TTL.
+- **Assume at-least-once. Build for it.** True exactly-once delivery does not exist across a network. What you actually deliver is **at-least-once delivery + idempotent processing = effectively-once.** Kafka's "exactly-once" only holds for read→process→write *inside one Kafka transaction*; the moment a handler touches an external API or DB outside that transaction, you're back to at-least-once. Design idempotency (Step 3) regardless.
+- Decide ordering need now: most jobs don't need it. If they do, you need a partition/group key and it caps your parallelism (one in-flight per key).
+### 2. Design producer, consumer, and a versioned message schema
+- **Message = a typed envelope, not a blob.** Required fields:
+  ```
+  { id, type, version, occurred_at, idempotency_key, payload, attempts }
+  ```
+- Put a **stable `idempotency_key`** in the message at produce time (e.g. derived from the business event: `order:1234:charge`). Do **not** use the broker's auto-generated message id — it changes on redelivery.
+- **Keep payloads small.** Send IDs/references, not large rows or files. Store blobs in object storage and pass the key. Big payloads blow past broker size limits (SQS 256KB, etc.) and slow every consumer.
+- **Schema versioning:** consumers must tolerate unknown fields and handle old `version` values during rolling deploys. Never reuse a field name with a new meaning — add a new field and bump `version`.
+- **Producer flush:** confirm the broker accepted the message (await the publish ack) before you report success upstream. Fire-and-forget producers silently drop messages on broker hiccups.
+### 3. Make handlers idempotent — this is the load-bearing step
+Retries are guaranteed, so a handler that runs twice must produce the same result as running once.
+- **Dedup table / dedup key (preferred for side-effects):** before doing work, `INSERT idempotency_key` into a table with a UNIQUE constraint. If the insert conflicts, the message was already processed → ack and skip. Do the insert and the side-effect in the **same transaction** where possible.
+- **Natural idempotency:** prefer operations that are safe to repeat — `UPSERT` / `SET x=v` over `INSERT` / `x = x + 1`. Use conditional writes (`UPDATE ... WHERE status='pending'`).
+- **External calls:** pass the `idempotency_key` to downstream APIs that support it (payment providers do). Wrap non-idempotent calls in your own dedup check.
+- **Set a TTL on dedup keys** longer than your max retry window (e.g. retries span 1h → keep keys ≥ 24h), then expire them so the table doesn't grow unbounded.
+### 4. Retries with backoff + a dead-letter queue
+- **Exponential backoff WITH jitter.** Never fixed-interval, never un-jittered — synchronized retries cause a thundering herd that takes the downstream out a second time.
+  ```
+  delay = min(cap, base * 2 ** attempt)
+  delay = random_between(0, delay)   # full jitter
+  ```
+- **Bounded max attempts** (typically 3–6). On exceeding, route the message to a **dead-letter queue (DLQ)** — never drop it, never retry forever.
+- **Classify failures before retrying:**
+  - *Transient* (timeout, 5xx, throttle) → retry with backoff.
+  - *Permanent* (validation error, 4xx, malformed payload = **poison message**) → DLQ immediately, attempts wasted otherwise.
+- **Poison messages:** one un-handleable message must not block the queue/partition behind it. With a DLQ this is automatic; with ordered Kafka partitions you must skip-and-DLQ explicitly or the partition stalls forever.
+- **Have a DLQ drain plan:** an alarm on DLQ count + a documented redrive (fix bug → replay DLQ back to main queue). A DLQ nobody watches is a silent data-loss bucket.
+### 5. Backpressure so consumers never overwhelm anything
+- **Bounded concurrency** per worker (`prefetch` / `concurrency` / `maxInFlight`). Default low (e.g. 5–10) and raise with evidence. Unbounded concurrency = OOM and downstream meltdown.
+- **Visibility timeout / ack deadline > p99 processing time.** Too short → the message redelivers *while you're still processing it* → duplicate work. Too long → slow recovery after a crash. Set it, and for long jobs heartbeat/extend it.
+- **Throttle on queue depth:** when depth or consumer lag exceeds a threshold, slow producers or scale consumers — don't just pile on.
+- **Rate-limit the downstream**, not just the queue (token bucket on the external API), so a sudden backlog drain doesn't exceed third-party quotas.
+### 6. Observability — you can't operate a queue you can't see
+Emit and alarm on, at minimum:
+- **Queue depth** (backlog size) and **trend** — rising = consumers losing.
+- **Oldest-message age / consumer lag** — the real "are we behind?" signal (depth alone lies when message size varies).
+- **DLQ count** — should be ~0; any nonzero needs eyes.
+- **Processing latency** (p50/p99) and **throughput** (msgs/sec).
+- **Retry rate** — a spike means a downstream is degrading; alarm before it becomes a DLQ flood.
+- **Trace context:** propagate a `trace_id` from producer through consumer so one logical operation is followable across the async boundary.
+### 7. Test failure injection before you ship
+Happy-path passing proves nothing here. Add tests that prove correctness under failure:
+1. **Duplicate delivery** — send the same message twice → assert the side-effect happens exactly once (proves Step 3).
+2. **Out-of-order / delayed** — deliver messages out of order → assert correctness if you claimed ordering, or assert order-independence if you didn't.
+3. **Poison message** — feed a malformed payload → assert it lands in the DLQ and the next good message still processes.
+4. **Crash mid-process** — kill the worker after the side-effect but **before ack** → restart → assert no duplicate (visibility timeout redelivers; idempotency must absorb it).
+5. **Retry exhaustion** — force a permanent failure → assert it stops at max attempts and DLQs, no infinite loop.
+## Common Errors
+- **Ack-before-process.** Acking on receive (or `auto-ack`/`enable.auto.commit=true` with default timing) loses the message if the worker dies mid-job. **Always ack/commit only after the work + its side-effects are durably committed.**
+- **Using the broker's message id as the idempotency key.** It changes on redelivery, so dedup never matches and you process duplicates anyway. Use a business-derived key set at produce time (Step 2).
+- **Believing "exactly-once delivery" exists.** Vendors sell at-least-once or at-most-once. "Exactly-once" is a *processing* property you build with idempotency, not a delivery guarantee you buy.
+- **Fixed-interval or un-jittered retries.** Turns one downstream blip into a synchronized retry storm. Always exponential + full jitter (Step 4).
+- **Visibility timeout shorter than processing time.** The message reappears and a second worker starts the same job → guaranteed duplicates that look like a "random" bug. Measure p99, set timeout above it, extend for long jobs.
+- **No DLQ.** Failures either retry forever (resource burn, head-of-line block) or get dropped (silent data loss). There is no good third option without a DLQ.
+- **Unbounded concurrency / prefetch.** A backlog drain spins up unlimited in-flight work → OOM or a downstream outage. Always cap (Step 5).
+- **Giant payloads in the message.** Hits broker size limits and bloats every consumer's memory. Send a reference; store the blob elsewhere.
+- **Producer doesn't await the publish ack.** Fire-and-forget drops messages on transient broker errors and you never know. Confirm acceptance before reporting success.
+- **Non-idempotent handler "fixed" by hoping retries won't happen.** They will. Idempotency is mandatory, not optional, under at-least-once.
+## Verify
+Before declaring done, confirm each — with evidence, not assertion:
+- [ ] Idempotency test passes: same message delivered 2x → side-effect occurs once (show the test + run output).
+- [ ] DLQ exists and is wired: poison message → lands in DLQ, queue keeps draining (show it).
+- [ ] Retries use exponential backoff **with jitter** and a finite max-attempts (show the config/code).
+- [ ] Ack/commit happens **after** the side-effect commits, not before (point to the line).
+- [ ] Visibility timeout / ack deadline > measured p99 processing time (state both numbers).
+- [ ] Concurrency/prefetch is bounded (show the value).
+- [ ] Metrics emitted for queue depth, oldest-message age, DLQ count, retry rate; alarm on DLQ > 0.
+- [ ] Crash-mid-process test: kill before ack → restart → no duplicate (show run).
+- [ ] Producer awaits publish ack; payload carries a stable `idempotency_key` and `version`.
+If any box can't be checked with a real run/config, it is **not** done — fix the root cause, don't loosen the test.
+## Related
+- `update-config` — to install a queue-health check or DLQ alarm as a recurring hook/job.

package/skills/naming-helper/SKILL.md ADDED Viewed

@@ -0,0 +1,57 @@
+---
+name: naming-helper
+description: Proposes and audits names for code identifiers, APIs, files, and config keys — generating consistent, intention-revealing candidates that follow the project's existing conventions (case style, domain vocabulary) and avoiding misleading or abbreviation-heavy names.
+when_to_use: User asks 'what should I call this', to rename for clarity, to name a function/variable/endpoint/flag/table, or to audit a diff for naming consistency before review.
+---
+## When to Use
+- "What should I call this?" / "Better name for X?" / "Rename this for clarity."
+- Naming a new function, variable, class, endpoint, CLI flag, env var, DB table/column, or config key.
+- Auditing a diff or file for naming consistency before code review.
+Skip when: the name is dictated by an external contract (HTTP spec header, third-party API field, framework magic name like `getServerSideProps`) — match the contract exactly, don't "improve" it.
+## Steps
+1. **Infer conventions before proposing.** Grep the surrounding scope, do NOT guess:
+   - Case style per kind — run e.g. `rg '\b(fn|def|func|function|const|let|var)\s+\w+' <dir>` and tally: are functions `camelCase`, `snake_case`, `PascalCase`? Booleans `is*/has*/should*`? Constants `SCREAMING_SNAKE`? Endpoints kebab or snake? Env vars `UPPER_SNAKE`?
+   - Domain vocabulary — does the codebase say `user` or `account`, `cancel` or `void`, `delete` or `remove`? Reuse the term already in use; don't introduce a synonym.
+   - Verb register for siblings — list the neighbors (`rg 'def (get|fetch|load|read|find)_' <module>`) and match the dominant verb instead of mixing.
+2. **Propose 2-3 candidates per item, each with a one-line rationale**, then mark the recommended one. Format:
+   ```
+   processData()  →
+     1. normalizeOrderRows()   ← recommend: says WHAT (normalize) + ON WHAT (order rows)
+     2. cleanOrders()          ← "clean" is vague, what does clean mean here?
+     3. transformOrderData()   ← "Data" is a noise word; "transform" hides intent
+   ```
+3. **Apply the quality bar** to filter candidates:
+   - Intention-revealing: name answers why it exists / what it returns, not how it's implemented.
+   - Searchable: no single-letter (except loop indices `i/j` in ≤3-line scopes) and no ambiguous 2-char names.
+   - Pronounceable, no encoded type (`strName`, `arrItems`, `bIsValid` — drop the prefix).
+   - No noise words: `Data`, `Info`, `Manager`, `Helper`, `Object`, `Stuff`, `tmp`, `do*`, `handle*` unless they add real meaning.
+   - No misleading terms: `userList` that's a `Set`/`Map`; `getX()` that mutates or does I/O; `isReady` that returns a count.
+   - Length scales with scope: tight loop var short; module-level export descriptive.
+4. **Consistency check across siblings.** Flag mixed vocabularies for the same concept:
+   - Verb drift: `getUser` + `fetchOrders` + `loadCart` in one module → pick one verb.
+   - Antonym pairs must match: `open/close`, `begin/end`, `add/remove`, `start/stop` — not `add/delete`.
+   - Singular/plural per cardinality: `user` returns one, `users` returns a collection.
+   - Same prefix family: if one flag is `--dry-run`, a sibling should be `--no-cache`, not `--skipCache`.
+5. **Audit mode** (when given a diff/file): scan only changed/added identifiers → output a table `current → suggested | reason`. **Names only — never change behavior, signatures' arg order, or types.** If a rename touches a public/exported symbol, list every call-site that must change and say so explicitly; do not silently break callers.
+## Common Errors
+- **Inventing a convention instead of detecting one.** If you didn't grep, you don't know the case style — a `camelCase` suggestion in a `snake_case` file is an instant reject. Detect first.
+- **Renaming public API / serialized keys as if they're free.** A DB column, JSON field, env var, or exported function rename is a breaking change + (for columns) a migration. Flag it as breaking, don't bundle it into a "cleanup."
+- **Reserved words & collisions.** Check the target name isn't a language keyword (`class`, `type`, `enum`, `default`), a builtin (`id`, `list`, `dict`, `len`, `type`), or already taken in the same scope. A name that shadows a builtin is worse than the original.
+- **Over-shortening.** `cfg`, `usr`, `ctx`, `req/res` are fine ONLY if already idiomatic in this codebase; otherwise expand. Never abbreviate to save typing — searchability beats brevity.
+- **Boolean inversions.** Renaming `disabled` → `enabled` (or vice-versa) flips meaning at every call site. If you flip polarity, that's a behavior change — flag it, don't do it silently.
+- **Hungarian / type-encoded names** survive in legacy spots; match the file's existing pattern even if you dislike it — consistency within a file beats global purity.
+## Verify
+- Recommended name is in the file's detected case style (re-grep one neighbor to confirm).
+- Name reuses existing domain vocabulary, no new synonym introduced.
+- No collision: the name isn't already declared in scope and isn't a keyword/builtin.
+- Verb/antonym/cardinality matches its siblings.
+- For audit/rename output: every breaking rename (public symbol, serialized key, env var, column) is explicitly tagged as breaking with its call-sites/migration noted; no behavior, type, or signature was altered — names only.

package/skills/observability-instrument/SKILL.md ADDED Viewed

@@ -0,0 +1,113 @@
+---
+name: observability-instrument
+description: Adds production observability to a service — structured logging, RED/USE metrics, OpenTelemetry distributed tracing, plus Prometheus/Grafana dashboards and actionable SLO-based alerts. Triggers when instrumenting code, designing metrics/dashboards, defining SLI/SLO, or fixing noisy/missing alerts.
+when_to_use: เพิ่ม logging/metrics/tracing ลง service, ออกแบบ dashboard, ตั้ง SLI/SLO/alert, alert noisy หรือ blind spot
+---
+## When to Use
+ใช้ skill นี้เมื่อทำงานข้อใดข้อหนึ่ง:
+- เพิ่ม observability (logging/metrics/tracing) ลง service ที่ยัง blind
+- ออกแบบ metric/dashboard ใหม่ หรือต้องเลือกว่าจะวัด signal อะไร
+- กำหนด SLI/SLO + error budget
+- แก้ alert ที่ noisy (เด้งบ่อยจนคนเมิน) หรือ blind spot (incident เกิดแต่ไม่มี alert)
+- debug latency/error ที่กระจายข้ามหลาย service (ต้อง distributed tracing)
+**ไม่ใช้เมื่อ:** แค่ debug ครั้งเดียวด้วย print/log ชั่วคราว, service เป็น prototype throwaway, หรือมี platform team จัดการ instrumentation มาตรฐานให้แล้ว (ใช้ของเขา)
+## Steps
+ทำตามลำดับ — แต่ละ step verify ได้จริง ห้ามข้ามไป dashboard ก่อนมี metric
+### 1. เลือก signal: RED สำหรับ request-driven, USE สำหรับ resource
+- **RED** (service ที่รับ request: API, RPC, consumer) → วัด 3 ตัวต่อ endpoint/route:
+  - **Rate** — requests/sec
+  - **Errors** — failed requests/sec (แยกจาก rate, ไม่ใช่ ratio ตอน emit — คำนวณ ratio ตอน query)
+  - **Duration** — latency distribution (histogram เสมอ ไม่ใช่ avg)
+- **USE** (resource: CPU, memory, disk, queue, connection pool) → วัด **Utilization, Saturation, Errors**
+- เริ่มจาก golden path: instrument entry point (HTTP handler / message consumer) ก่อน แล้วค่อยลงลึก dependency call (DB, cache, downstream HTTP)
+### 2. Structured logging — JSON + correlation/trace id ทุก log line
+- ออก log เป็น **JSON (one object per line)** ไม่ใช่ printf string — เพื่อ query/filter ได้
+- ทุก log entry ต้องมี field: `timestamp` (RFC3339/ISO8601 UTC), `level`, `message`, `service`, `trace_id`, `span_id`
+- **inject `trace_id` จาก context** ทุก log ที่อยู่ใน request scope — เพื่อ jump จาก log → trace ได้ (logs↔traces correlation)
+- log levels จริง: `ERROR` = ต้องมีคนดู / `WARN` = degraded แต่ทำงานได้ / `INFO` = state change สำคัญ (start/stop/deploy) / `DEBUG` = ปิดใน prod ปกติ
+- **ห้าม log PII/secret** (password, token, full card, email เต็ม) — redact ก่อน ออก field ที่ filter ได้แทน (เช่น `user_id` ไม่ใช่ email)
+- ใส่ context เป็น field ไม่ใช่ string interpolation: `{"event":"order_failed","order_id":"123","reason":"timeout"}` ไม่ใช่ `"order 123 failed: timeout"`
+### 3. OTel tracing — span ครอบ unit of work + propagate context
+- ใช้ **OpenTelemetry SDK** (vendor-neutral) — auto-instrumentation สำหรับ HTTP/gRPC/DB client ก่อน, manual span เฉพาะ business logic สำคัญ
+- 1 span = 1 unit of work ที่มีความหมาย (handle request, query DB, call downstream, ประมวลผล batch)
+- **propagate context ข้าม boundary** — ใส่ `traceparent` header (W3C Trace Context) ทุก outbound HTTP/RPC/message; ฝั่งรับ extract → ต่อ trace เดียวกัน
+- ใส่ span attributes ที่ filter ได้: `http.route`, `http.status_code`, `db.system`, `messaging.destination` — **ห้ามใส่ high-cardinality** (user_id, request_id) เป็น attribute ถ้าจะ aggregate (ใส่เป็น event/log แทน)
+- mark span error ให้ถูก: set status `ERROR` + record exception → ไม่งั้น trace ดู green ทั้งที่ fail
+- **sampling**: head-based (เช่น 10%) สำหรับ traffic สูง แต่ **always-sample error/slow span** (tail-based ถ้า backend รองรับ) — ไม่งั้น trace ของ incident หาย
+### 4. Prometheus metrics — naming + label กัน cardinality ระเบิด
+- metric type ให้ถูก: **Counter** (monotonic: requests, errors) / **Histogram** (latency, size — ได้ percentile) / **Gauge** (queue depth, pool in-use)
+- naming convention: `<namespace>_<subsystem>_<name>_<unit>` + suffix มาตรฐาน — `_total` (counter), `_seconds` (duration), `_bytes` (size)
+  - ดี: `http_requests_total`, `http_request_duration_seconds`, `db_pool_connections_in_use`
+- **base unit เสมอ**: seconds ไม่ใช่ ms, bytes ไม่ใช่ MB
+- **🔴 cardinality discipline (gotcha ที่ทำ Prometheus ล่มบ่อยสุด):**
+  - label = bounded set เท่านั้น: `method`, `route` (template `/users/:id` ไม่ใช่ `/users/42`), `status_code`, `status_class` (2xx/4xx/5xx)
+  - **ห้ามใส่ unbounded เป็น label**: user_id, request_id, email, full URL with query, raw error message, timestamp
+  - cardinality = ผลคูณของจำนวนค่าทุก label → 3 labels × (10 routes × 5 methods × 6 status) = 300 series/metric; ถ้าใส่ user_id (1M users) = 1M series → OOM
+- histogram bucket: ตั้ง bucket ให้ครอบ SLO threshold (เช่น SLO p99 < 300ms → ต้องมี bucket ที่ 0.3) ไม่งั้นวัด SLO ไม่ได้
+### 5. Grafana dashboard — RED panels + SLI ต่อ service
+- 1 dashboard ต่อ service, layout top→down: **SLI summary → RED → dependencies → saturation**
+- RED panels (PromQL):
+  - Rate: `sum(rate(http_requests_total[5m])) by (route)`
+  - Errors: `sum(rate(http_requests_total{status_class="5xx"}[5m])) by (route)` และ error ratio = errors/rate
+  - Duration: `histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))` — **p50/p95/p99 แยกเส้น ห้ามใช้ avg**
+- ใช้ template variable (`$service`, `$route`) ให้ reuse dashboard ได้หลาย service
+- ทุก panel ต้อง answer คำถาม operational จริง — panel ที่ไม่เคยถูกดูตอน incident = ลบทิ้ง
+### 6. SLO + error budget → alert ที่ actionable (multi-window burn rate)
+- กำหนด **SLI** (วัดได้จาก metric ที่มีจริง): availability = `good_requests / total_requests`, latency = `% requests < threshold`
+- ตั้ง **SLO** เป็นเป้า (เช่น 99.9% / 30 วัน) → **error budget** = 0.1% = ~43 นาที downtime/เดือน
+- **alert บน burn rate ของ error budget ไม่ใช่ raw threshold** — นี่คือกุญแจกัน noise:
+  - **multi-window, multi-burn-rate**: fast burn (เช่น 14.4× budget ใน 1h window + 5m window ยืนยัน) → page; slow burn (เช่น 3× ใน 6h) → ticket
+  - 2 window (long + short) กัน flapping: long window จับว่า "เผาจริง", short window กัน alert ค้างหลังหายแล้ว
+- **ทุก alert ต้อง actionable**: มี runbook link, บอกว่าใคร impact, มี action ชัด — alert ที่ทำอะไรไม่ได้ = ลบ
+- แยก **symptom-based** (user เจ็บจริง: error rate, latency สูง → page) ออกจาก **cause-based** (CPU สูง, disk เต็ม → ticket/warn) — page เฉพาะ symptom
+### 7. Verify — sample query + trace จริง
+- ดู Common Errors + Verify ด้านล่าง รัน end-to-end ก่อน declare done
+## Common Errors
+Gotcha จริงที่เจอซ้ำ — เช็คก่อน ship:
+- **Cardinality bomb** — ใส่ user_id/request_id/raw error เป็น Prometheus label → series ระเบิด → Prometheus OOM/ช้า. แก้: label เป็น bounded set, route ใช้ template (`/users/:id`), error เก็บใน trace/log แทน
+- **avg latency โกหก** — `avg(duration)` ซ่อน tail; p99 800ms แต่ avg 50ms = ผู้ใช้ 1% เจ็บแต่ dashboard เขียว. ใช้ `histogram_quantile` percentile เสมอ
+- **histogram bucket ไม่ครอบ SLO** — SLO p99<300ms แต่ bucket กระโดด 0.1→0.5 → คำนวณ p99 ผิดเพราะ interpolate. ตั้ง bucket ให้มีขอบที่ threshold พอดี
+- **trace ขาดตอน** — ลืม propagate `traceparent` ที่ boundary (async job, message queue, new HTTP client) → trace แตกเป็นหลายอัน หา root cause ไม่ได้. เช็คทุก outbound call ใส่ context
+- **error span ดู green** — catch exception แล้วไม่ set span status ERROR → trace ดูปกติทั้งที่ fail. ต้อง `span.recordException` + set status
+- **rate() บน gauge / sum บน histogram_bucket ผิด le** — `rate()` ใช้กับ counter เท่านั้น; `histogram_quantile` ต้อง `by (le, ...)` มี `le` เสมอ ไม่งั้นได้ NaN
+- **alert บน raw threshold** (เช่น CPU>80%) → noisy, page ตอนตี 3 โดยไม่มี user เจ็บ. ย้ายไป burn-rate/symptom-based
+- **counter ไม่มี `_total`, duration เป็น ms** → ผิด convention, query/recording rule พัง. base unit + suffix มาตรฐานเสมอ
+- **log PII รั่ว** — ออก email/token/PII ลง log JSON → compliance/security incident. redact ก่อน emit
+- **sampling กิน error traces** — head sampling 10% ทำให้ trace ของ incident หาย 90%. always-keep error/slow span
+- **double-counting จาก label ที่หาย** — เปลี่ยน label set ของ metric เดิม → recording rule/alert คำนวณผิดช่วง transition. version metric หรือ migrate พร้อม recording rule
+## Verify
+ก่อน declare done ต้องผ่านทุกข้อ (มีหลักฐานจริง ไม่ใช่ "น่าจะได้"):
+1. **Metrics scrape ได้** — `curl localhost:<port>/metrics` เห็น metric ที่ instrument + format ถูก (มี `_total`/`_seconds`, HELP/TYPE line). ตรวจ Prometheus target = `UP`
+2. **PromQL ของ RED ทั้ง 3 คืนค่า** — รัน query rate/errors/duration จริง ได้ตัวเลข ไม่ใช่ empty/NaN; `histogram_quantile` มี `le` ครบ
+3. **Cardinality check** — `count({__name__="<metric>"})` ดู series count สมเหตุผล (สิบ–พัน ไม่ใช่ แสน+); ยิง test request ด้วย id ต่างกันหลายตัว → series ต้องไม่เพิ่มตาม id
+4. **Trace ครบ end-to-end** — ยิง request ผ่าน ≥2 service เห็น 1 trace เดียวเชื่อม span ครบ ใน tracing backend; error case → span status = ERROR + exception ติด
+5. **Log↔trace correlation** — เปิด log ของ request นั้น เห็น `trace_id` ตรงกับ trace; click จาก log ไป trace ได้
+6. **Alert ยิงจริงตอนควรยิง** — trigger error/latency เกิน SLO (load test / fault inject) → burn-rate alert fire ภายใน window ที่ตั้ง; พอหาย → alert resolve (ไม่ค้าง)
+7. **Alert เงียบตอนควรเงียบ** — spike สั้นๆ ใต้ burn-rate threshold → ไม่ page (กัน noise); confirm short+long window ทำงาน
+8. **Dashboard อ่านออกตอน incident** — เปิด dashboard ระหว่าง fault inject เห็น RED panel เปลี่ยนชัด ชี้ไป root cause ได้

package/skills/optimize-core-web-vitals/SKILL.md ADDED Viewed

@@ -0,0 +1,75 @@
+---
+name: optimize-core-web-vitals
+description: Diagnoses and fixes Core Web Vitals (LCP, INP, CLS) and Lighthouse failures via image/font/JS strategy in the browser; used when pages are slow, janky, or failing a Lighthouse/PageSpeed audit.
+when_to_use: When the user reports slow page load, layout shift, sluggish interactions, poor Lighthouse/PageSpeed scores, or asks to improve LCP/INP/CLS or Core Web Vitals — frontend/browser-side specifically. Not for backend/server query latency (that is performance-profiling).
+---
+## When to Use
+Use when a page is slow to render, shifts layout, or feels laggy on interaction — and the fix lives in the browser (HTML/CSS/JS/assets), not the server query path.
+| Symptom | Metric | This skill |
+|---|---|---|
+| Hero/main content paints late | LCP | yes |
+| Content jumps after load | CLS | yes |
+| Click/typing feels frozen | INP | yes |
+| Lighthouse/PageSpeed score red | all | yes |
+| API/DB response is the bottleneck | TTFB-backend | no → backend profiling |
+Targets (pass = 75th percentile): **LCP ≤ 2.5s · INP ≤ 200ms · CLS ≤ 0.1**. TTFB ≤ 800ms is a precondition for LCP — if TTFB alone blows the budget, stop and route to backend.
+## Steps
+**1. Baseline before touching anything.** Run a trace + audit on the real URL so you have a before/after diff.
+- `chrome-devtools.performance_start_trace` (with `reload: true`, `autoStop: true`) → load → `performance_stop_trace`.
+- Or `chrome-devtools.lighthouse_audit` (categories `["performance"]`, mobile preset — mobile is what fails first).
+- Run `performance_analyze_insight` on the flagged insights (e.g. `LCPBreakdown`, `RenderBlocking`, `CLSCulprits`, `DocumentLatency`). Record the LCP element, INP target, CLS shifters, and TTFB. **Do not guess which image/script is the problem — read it from the trace.**
+**2. Identify the LCP element and attack its critical path.** LCP = TTFB + resource load delay + load time + render delay. Fix the dominant segment:
+- LCP is an image → add `fetchpriority="high"` to that one `<img>`, and `<link rel="preload" as="image" fetchpriority="high" imagesrcset=...>` in `<head>`. Remove `loading="lazy"` from it (lazy on the LCP image is a top regression).
+- Serve modern formats: AVIF then WebP fallback via `<picture>`; provide `srcset`/`sizes` so mobile downloads a small file, not the desktop original.
+- LCP is text → preload the font (step 4) and remove render-blocking CSS/JS in front of it.
+**3. Kill render-blocking resources.** From the `RenderBlocking` insight:
+- Inline critical above-the-fold CSS; load the rest with `media` swap or async. Defer non-critical CSS.
+- Add `defer` (or `type="module"`) to scripts; never `async` a script that the first paint depends on ordering-wise.
+- Remove third-party/analytics tags from the critical path — load them after `load` or via `requestIdleCallback`.
+- Self-host or `preconnect` to required cross-origin asset domains.
+**4. Eliminate CLS — reserve space for everything that arrives late.**
+- Every `<img>`/`<video>`/`<iframe>` gets explicit `width`+`height` (or CSS `aspect-ratio`) so the box is reserved before the asset loads.
+- Fonts: `font-display: swap` + `<link rel="preload" as="font" type="font/woff2" crossorigin>` for the primary font; set `size-adjust`/`ascent-override` (or a matched fallback metric) to minimize the swap reflow.
+- Ads/embeds/dynamic banners: reserve a min-height container. Never inject DOM above existing content without space already held.
+- Avoid animating layout properties (top/left/height); use `transform`/`opacity`.
+**5. Fix INP — the slowest interaction, not just the first.** From the trace's interaction/long-task data:
+- Break long tasks (>50ms) with `scheduler.yield()` or chunked work; move heavy compute to a Web Worker.
+- `debounce`/`throttle` input handlers; isolate expensive work out of the event handler's synchronous path.
+- Reduce hydration cost: ship less JS to the client, use islands/partial hydration or server components so interactive regions hydrate independently instead of one giant blocking bundle.
+- Avoid forced synchronous layout (reading `offsetWidth`/`getBoundingClientRect` then writing styles in a loop — batch reads then writes).
+**6. Cut the JS budget.** Check `list_network_requests` for the heaviest scripts:
+- Code-split by route; dynamic-`import()` below-the-fold and on-interaction components (modals, carousels, charts).
+- Tree-shake; drop unused polyfills and duplicate library copies (check for two versions of the same dep in the bundle).
+- `loading="lazy"` + `IntersectionObserver` for below-fold images/iframes (but NOT the LCP element — see step 2).
+**7. Re-run the audit and diff.** Repeat step 1 on the same URL/preset. Compare LCP/INP/CLS numbers before vs after. Iterate until all three pass the target or the remaining gap is backend TTFB.
+## Common Errors
+- **Lazy-loading the LCP image.** `loading="lazy"` on the hero delays the very thing LCP measures. Lazy is for below-fold only.
+- **Trusting lab CLS = 0.** Lab loads fast and may not trigger font swap or late ads. Reproduce by throttling network/CPU in the trace, or check field data — most CLS comes from late-arriving fonts/ads/images, not the initial paint.
+- **`fetchpriority="high"` on everything.** Priority is relative; flag one LCP resource. Marking many demotes the signal to noise.
+- **`preload` without using it.** A preloaded font/image not referenced within a few seconds throws a console warning and wastes bandwidth. Preload only the LCP image and the first-paint font.
+- **Preloading a font without `crossorigin`.** Fonts are CORS-fetched; a `preload` missing `crossorigin` double-downloads the file.
+- **Optimizing the first interaction only.** INP is the worst interaction across the session. Test scrolling, opening menus, typing — not just the initial click.
+- **Desktop-only testing.** CWV failures are mobile-first (slow CPU, slow network). Always audit with the mobile preset and CPU/network throttling.
+- **Counting bytes saved, not metric moved.** "Saved 200KB" means nothing if it wasn't on the critical path. Verify the metric number changed, not the asset size.
+## Verify
+1. Re-run `lighthouse_audit` (mobile) **and** a `performance_start_trace`/`stop_trace` on the same URL as the baseline.
+2. Confirm: **LCP ≤ 2.5s, INP ≤ 200ms, CLS ≤ 0.1** — quote the actual before→after numbers, not "improved".
+3. `list_console_messages` shows no new preload/warning errors and no unused-preload warnings.
+4. `list_network_requests` confirms LCP image is high priority + modern format, fonts are preloaded, and no surprise render-blocking script returned.
+5. If a target still fails and the remaining cost is TTFB/server, say so explicitly and route to backend — do not claim a browser-side fix that isn't there.

package/skills/optimize-sql-query/SKILL.md ADDED Viewed

@@ -0,0 +1,67 @@
+---
+name: optimize-sql-query
+description: Diagnoses slow SQL via EXPLAIN plans and recommends fixes — indexes, query rewrites, partition pruning, and join reordering — with measured before/after.
+when_to_use: When a SQL query is slow or expensive and the user wants it faster — analyze the execution plan, find the bottleneck (seq scan, bad join order, missing index, spilled sort), and propose index or rewrite fixes with verified improvement.
+---
+## When to Use
+A specific query is slow or costly and the user wants it faster. You have (or can get) the SQL text and a way to run `EXPLAIN` against a representative dataset.
+Do NOT use this skill when:
+- Authoring a new analytical query from scratch → use the SQL-authoring skill instead (this is perf, not authoring).
+- The "slowness" is connection/pool/lock contention, not the plan → that is an ops problem, not a query problem.
+- Row counts are tiny (< ~10k) on every table — a seq scan is fine; chasing a plan here is wasted effort.
+## Steps
+1. **Get the real plan, not the estimate.** Run `EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)` (Postgres) or the engine's analyze form. `EXPLAIN` alone gives the optimizer's *guess*; `ANALYZE` runs it and gives actual rows/time. Always compare **estimated rows vs actual rows** per node — a 100x gap means stale statistics (run `ANALYZE <table>`) and every downstream join choice is suspect.
+2. **Find the cost driver, top of the plan downward.** Look for, in priority order:
+   - **Seq Scan on a large table** under a selective filter → missing index on the filter/join column.
+   - **Nested Loop where inner side runs N times** with high N → join-order or missing-index blowup; the planner expected few outer rows but got many (see step 1's row gap).
+   - **Sort / Hash that spills to disk** → look for `external merge Disk: NkB` or `Batches: >1`. The sort/hash exceeded `work_mem`. Fix by indexing to avoid the sort, reducing rows before the sort, or (last resort) raising `work_mem` for that session.
+   - **Partition scan with no pruning** → all partitions scanned. The predicate isn't on the partition key, or wraps it in a function/cast that defeats pruning.
+3. **Index recommendations — be specific about column order.**
+   - Composite index column order = **equality columns first, then the range/sort column, then included columns**. `WHERE a = ? AND b > ?` wants `(a, b)`, never `(b, a)`.
+   - A query filtering `a` and `c` cannot use `(a, b, c)` efficiently for `c` — `b` is a gap. Either reorder or make a separate index.
+   - **Covering index** (`INCLUDE` in Postgres, or trailing columns) turns an Index Scan + heap fetch into an Index-Only Scan — propose it when the query selects only a few columns beyond the predicate.
+   - Partial index (`WHERE status = 'active'`) when the query always filters on a low-cardinality flag — smaller, hotter, faster.
+4. **Rewrite when an index won't help.**
+   - **Make predicates sargable**: `WHERE date_col >= '2024-01-01'` not `WHERE date(date_col) = ...`; `col LIKE 'foo%'` (indexable) not `LIKE '%foo'` (not). Functions/casts on the indexed column kill index use — move the transform to the literal side.
+   - **Predicate pushdown**: filter inside the subquery/CTE, not after the join.
+   - In Postgres, a CTE may **materialize** and block predicate pushdown — add `MATERIALIZED`/`NOT MATERIALIZED` deliberately, or inline it.
+   - Replace a correlated self-join with a **window function** (`ROW_NUMBER`, `LAG`, running `SUM`) — usually one pass instead of N.
+   - Drop a redundant `DISTINCT` when a join key already guarantees uniqueness, or when `EXISTS` expresses the intent without deduplicating a large set.
+   - Prefer `EXISTS` over `IN (subquery)` when the subquery can short-circuit; prefer a join over `IN` when you need columns from both sides.
+5. **Warehouse / columnar engines need different levers.** On columnar warehouses, indexes barely exist; instead:
+   - Align the filter with the **partition/cluster key** so the engine prunes files/micro-partitions — check the plan's scanned-partitions/bytes-scanned, not just time.
+   - Watch **broadcast vs shuffle joins**: broadcast a small dimension table, shuffle when both sides are large. A shuffle of two huge tables on a skewed key is the classic blowup.
+   - Reduce columns scanned (columnar pays per column) — `SELECT *` is expensive here in a way it isn't on row stores.
+6. **Measure before/after on representative data and prove equivalence.**
+   - Capture baseline timing (median of a few warm runs, not one cold run) and the plan.
+   - Apply the fix, re-run `EXPLAIN ANALYZE`, capture the new timing/plan.
+   - Confirm the rewrite is **logically equivalent**: same row count and a checksum/ordered-diff of results on a sample, especially after touching `DISTINCT`, join type, or `NULL`-handling. A faster query that returns different rows is a bug, not an optimization.
+## Common Errors
+- **Tuning to the cold cache.** First run reads from disk; second run is cached and 10x faster regardless of your change. Warm up, then take the median.
+- **Stale statistics misread as a bad query.** Huge estimated-vs-actual row gap → run `ANALYZE` first and re-plan *before* adding indexes; the "fix" may just be fresh stats.
+- **An index that helps one query slows every write.** Each index is maintained on every `INSERT`/`UPDATE`/`DELETE`. On write-heavy tables, weigh read gain against write cost; don't add a fourth near-duplicate index when reordering an existing one covers both.
+- **Function/cast silently disabling the index you just added.** `WHERE lower(email) = ?` won't use an index on `email` — needs an expression index on `lower(email)` or a sargable rewrite.
+- **Plan flips with data volume.** A plan that's optimal on today's 10k rows can go quadratic at 10M. Validate on production-scale row counts, or at least reason about how the chosen join (nested loop vs hash) scales.
+- **`OR` across columns defeating indexes.** `WHERE a = ? OR b = ?` often can't use a single composite index — a `UNION ALL` of two indexed lookups is frequently faster.
+- **Raising `work_mem`/session knobs as the "fix."** That hides the spill instead of removing it; it's a per-connection footgun under concurrency. Prefer reducing rows or indexing the sort away; bump memory only as a last, documented resort.
+## Verify
+The optimization is done only when ALL hold:
+- New plan no longer shows the original cost driver (seq scan gone / nested loop replaced by hash / sort no longer spilling / partitions pruned).
+- Measured median runtime (or bytes-scanned on warehouses) improved on representative data — show the before/after numbers, not "should be faster."
+- Result set is provably identical: same row count + matching checksum/ordered sample vs the original query.
+- Write/maintenance cost of any new index was considered and noted for write-heavy tables.
+- Estimated rows now track actual rows within ~1 order of magnitude (no stale-stats landmine left behind).

package/skills/performance-profiling/SKILL.md ADDED Viewed

@@ -0,0 +1,65 @@
+---
+name: performance-profiling
+description: Diagnoses and fixes runtime performance problems — finds hotspots via profiling/measurement, then fixes N+1 queries, unnecessary allocations, blocking IO, and algorithmic complexity, proving the gain with before/after numbers. Use when something is measurably slow or resource-heavy.
+when_to_use: endpoint/job ช้า, memory/CPU พุ่ง, query ช้า, bundle ใหญ่ — มีอาการ perf ที่วัดได้
+---
+## When to Use
+ใช้เมื่อมี **อาการ perf ที่วัดเป็นตัวเลขได้** อย่างน้อยหนึ่งอย่าง:
+- endpoint/job ช้า (latency สูง, p95/p99 พุ่ง, timeout)
+- memory/CPU พุ่งหรือ OOM, GC ถี่
+- DB query ช้า, connection pool หมด
+- bundle/build/startup ใหญ่หรือนาน
+อย่าใช้ skill นี้ถ้ายังไม่มีตัวเลขยืนยันว่าช้า — ไปวัดให้ได้ baseline ก่อน มิฉะนั้นคือ premature optimization. ถ้าโจทย์คือ "correctness bug" ไม่ใช่ "speed/resource" → ใช้ skill อื่น.
+**กฎเหล็ก: measure ก่อน optimize เสมอ. ห้าม optimize จากการเดา.**
+## Steps
+1. **กำหนด target ที่วัดได้ก่อนเริ่ม.** เขียนเป็นตัวเลข เช่น "p95 ของ `GET /orders` < 200ms" หรือ "RSS < 512MB ที่ 1k req/s". ระบุ workload/dataset ที่จะใช้ reproduce (input size, concurrency, env). ถ้าไม่มี repro ที่เสถียร ให้สร้างก่อน (script/benchmark/load test) — เพราะตัวเลขที่ noisy เชื่อไม่ได้.
+2. **วัด baseline แล้วหา hotspot จริงด้วย profiler — ไม่ใช่อ่านโค้ดแล้วเดา.** เลือกเครื่องมือตาม layer:
+   - **CPU/wall time (app):** sampling profiler ของภาษานั้น (เช่น `py-spy`/`cProfile`, `--prof`+flamegraph / `clinic flame` (Node), `pprof` (Go), async-profiler (JVM), `perf` + flamegraph). อ่าน **flame graph** หา frame ที่กว้างสุด (กิน wall/CPU มากสุด).
+   - **DB:** `EXPLAIN ANALYZE` บน query ที่ต้องสงสัย ดู rows, Seq Scan, nested loop, actual time. เปิด slow query log / `pg_stat_statements` หา query ที่กินเวลารวมมากสุด.
+   - **N+1:** ดู query log/APM count — ถ้าจำนวน query โตตาม N ของผลลัพธ์ = N+1.
+   - **Memory:** heap snapshot / heap profiler (`take_heapsnapshot`, `memory_profiler`, `pprof -alloc_space`, Go `pprof heap`) เทียบ 2 snapshot หา object ที่โตเรื่อยๆ (leak) หรือ allocation hotspot.
+   - **Async/blocking IO:** หา blocking call ใน hot path (sync IO ใน event loop, lock contention, sequential awaits ที่ควร parallel).
+   - **Frontend/bundle:** bundle analyzer หา module ใหญ่; Lighthouse/perf trace หา long task, layout thrash.
+3. **จัดอันดับ candidate ตาม impact, แก้ทีละจุดบนสุด.** เรียงตาม "% ของเวลา/หน่วยทรัพยากรที่จุดนั้นกิน" จาก profiler. แก้จุดที่กิน 60% ก่อนจุดที่กิน 2% เสมอ. **แก้ทีละ change** เพื่อ attribute การปรับปรุงได้ชัด — อย่าแก้ 5 อย่างพร้อมกันแล้วไม่รู้อันไหนช่วย.
+4. **เลือก fix ตาม root cause (จากแพงไปถูกในแง่ effort/maintainability):**
+   - **N+1 query** → batch (`IN (...)` / `WHERE id = ANY`), `JOIN`, eager-load (include/preload/dataloader). ลด round-trip ไม่ใช่แค่ทำ query เร็วขึ้น.
+   - **Algorithmic** → O(n²)→O(n) ด้วย hash map/set แทน nested loop; O(n)→O(log n)/O(1) ด้วย index/sorted structure. นี่ให้ผลโตตาม N — มักคุ้มสุด.
+   - **Missing DB index** → เพิ่ม index ให้ตรง predicate/sort/join (เช็คด้วย `EXPLAIN` ว่ามันถูกใช้จริงหลังเพิ่ม). ระวัง write overhead.
+   - **Allocation/GC** → ลด allocation ใน hot loop (reuse buffer, ตัด intermediate copy/`map().filter().map()` chain บน array ใหญ่, streaming แทน load-all-in-memory).
+   - **Blocking/serial IO** → parallelize (`Promise.all`/goroutine/async gather) งานที่ independent; ย้าย blocking call ออกจาก hot path/event loop.
+   - **Repeated expensive compute** → cache/memoize (กำหนด invalidation ชัด), หรือ precompute.
+   - **Over-fetching** → select เฉพาะ column ที่ใช้, paginate, lazy-load, defer.
+5. **วัดหลังแก้ด้วย repro/profiler เดิม แล้วรายงาน before/after เป็นตัวเลข.** เช่น `p95 820ms → 140ms (-83%)`, `queries/req 312 → 3`, `RSS 1.2GB → 380MB`. ถ้าตัวเลขไม่ดีขึ้นจริง → revert การแก้นั้น (มันไม่ใช่ hotspot จริง) แล้วกลับไป step 3.
+6. **รัน regression test/lint ให้เขียว.** confirm ว่าไม่ได้แลก correctness เพื่อ speed. ถ้าไม่มี test ครอบ behavior ที่เพิ่งแก้ → เขียนเพิ่มก่อน merge.
+## Common Errors / Gotchas
+- **เดา hotspot จากการอ่านโค้ด.** จุดที่ "ดูช้า" มักไม่ใช่จุดที่กินเวลาจริง. ต้องมี profiler/EXPLAIN ยืนยัน — ทุกครั้ง.
+- **วัดใน dev/debug build หรือ dataset เล็ก.** debug mode, source map, dataset 10 แถวให้ profile คนละเรื่องกับ prod. วัดบน build + data size ใกล้ prod.
+- **Cold start / JIT / cache warm-up บิดเบือนเลข.** ทิ้ง run แรกๆ (warmup), วัดหลายรอบเอา median/p95 ไม่ใช่ค่าครั้งเดียวที่ noisy.
+- **`EXPLAIN` (planner estimate) ≠ `EXPLAIN ANALYZE` (actual).** ดู actual time + actual rows; estimate โกหกได้ถ้า stats เก่า (`ANALYZE` table ก่อน).
+- **แก้แล้ว query เร็วแต่ throughput แย่ลง** — เพิ่ม index ทำ read เร็วแต่ write ช้า/ใหญ่ขึ้น. ดู trade-off ทั้งระบบ ไม่ใช่แค่ตัวเลขเดียว.
+- **Cache ที่ไม่มี invalidation** = correctness bug รอเวลา. มี cache ต้องตอบได้ว่า invalidate เมื่อไหร่.
+- **Micro-optimize จุดที่กิน 1%.** เปลี่ยน `for` เป็น `while`, ตัด `+` ทีละนิด ในจุดที่ไม่ใช่ hotspot = เสียเวลา + โค้ดอ่านยากขึ้น ไม่ได้อะไร.
+- **แลก correctness เพื่อ speed** — ลด assertion, ข้าม validation, อ่าน stale data โดยไม่ตั้งใจ. ห้ามเด็ดขาด: เร็วแต่ผิด = พัง.
+- **N+1 ที่ซ่อนอยู่หลัง ORM lazy-load/serializer.** มองไม่เห็นในโค้ดตรงๆ — ต้องดู query log จริง.
+## Verify
+ถือว่าสำเร็จเมื่อครบทุกข้อ:
+- [ ] มี **before/after เป็นตัวเลข** จาก repro/profiler เดิม และผ่าน target ที่ตั้งไว้ใน step 1
+- [ ] การปรับปรุงมาจาก hotspot จริงที่ profiler ชี้ (อธิบายได้ว่าทำไมมันช่วย ไม่ใช่ "ลองแล้วเร็วขึ้น")
+- [ ] regression test + lint เขียว — correctness ไม่ถูกแลก
+- [ ] ไม่มี trade-off ที่ซ่อน (write/memory/maintainability แย่ลงโดยไม่ได้ตั้งใจ) หรือถ้ามีก็ระบุชัดและยอมรับได้
+- [ ] วัดบน build/dataset ใกล้ prod ไม่ใช่ dev/toy data