sanook-cli 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.env.example +23 -0
- package/CHANGELOG.md +38 -0
- package/LICENSE +201 -0
- package/README.md +239 -0
- package/dist/agentContext.js +2 -0
- package/dist/approval.js +78 -0
- package/dist/bin.js +461 -0
- package/dist/brain.js +186 -0
- package/dist/commands.js +66 -0
- package/dist/compaction.js +85 -0
- package/dist/config.js +101 -0
- package/dist/cost.js +59 -0
- package/dist/diff.js +36 -0
- package/dist/gateway/auth.js +32 -0
- package/dist/gateway/ledger.js +94 -0
- package/dist/gateway/lock.js +114 -0
- package/dist/gateway/schedule.js +74 -0
- package/dist/gateway/scheduler.js +87 -0
- package/dist/gateway/serve.js +57 -0
- package/dist/gateway/server.js +94 -0
- package/dist/gateway/telegram.js +115 -0
- package/dist/git.js +55 -0
- package/dist/hooks.js +104 -0
- package/dist/knowledge.js +68 -0
- package/dist/loop.js +169 -0
- package/dist/mcp.js +191 -0
- package/dist/memory.js +108 -0
- package/dist/providers/codex.js +86 -0
- package/dist/providers/keys.js +37 -0
- package/dist/providers/models.js +55 -0
- package/dist/providers/registry.js +241 -0
- package/dist/session.js +36 -0
- package/dist/skill-install.js +190 -0
- package/dist/skills.js +111 -0
- package/dist/tools/bash.js +26 -0
- package/dist/tools/edit.js +107 -0
- package/dist/tools/git.js +68 -0
- package/dist/tools/index.js +36 -0
- package/dist/tools/list.js +24 -0
- package/dist/tools/permission.js +30 -0
- package/dist/tools/read.js +18 -0
- package/dist/tools/recall.js +12 -0
- package/dist/tools/remember.js +14 -0
- package/dist/tools/schedule.js +61 -0
- package/dist/tools/search.js +54 -0
- package/dist/tools/skill.js +65 -0
- package/dist/tools/task.js +46 -0
- package/dist/tools/util.js +5 -0
- package/dist/tools/write.js +27 -0
- package/dist/ui/app.js +132 -0
- package/dist/ui/banner.js +20 -0
- package/dist/ui/brain-wizard.js +29 -0
- package/dist/ui/render.js +57 -0
- package/dist/ui/setup.js +46 -0
- package/package.json +77 -0
- package/second-brain/AGENTS.md +18 -0
- package/second-brain/CLAUDE.md +96 -0
- package/second-brain/Evals/retrieval-eval.md +30 -0
- package/second-brain/GEMINI.md +15 -0
- package/second-brain/Home.md +33 -0
- package/second-brain/README.md +29 -0
- package/second-brain/Runbooks/ingest-quarantine.md +27 -0
- package/second-brain/Runbooks/sleep-time-consolidation.md +26 -0
- package/second-brain/Shared/AI-Context-Index.md +52 -0
- package/second-brain/Shared/Core-Facts/protected-facts.md +21 -0
- package/second-brain/Shared/Decision-Memory/decision-log.md +24 -0
- package/second-brain/Shared/Memory-Inbox/memory-inbox.md +23 -0
- package/second-brain/Shared/Operating-State/current-state.md +30 -0
- package/second-brain/Shared/Provenance/ingest-log.md +27 -0
- package/second-brain/Shared/Rules/context-assembly-policy.md +28 -0
- package/second-brain/Shared/Rules/frontmatter-standard.md +33 -0
- package/second-brain/Shared/Rules/skills-admission.md +30 -0
- package/second-brain/Shared/User-Memory/user-preferences.md +25 -0
- package/second-brain/Templates/bug.md +22 -0
- package/second-brain/Templates/handoff.md +21 -0
- package/second-brain/Templates/project.md +24 -0
- package/second-brain/Templates/session.md +26 -0
- package/second-brain/USER.md +36 -0
- package/second-brain/Vault Structure Map.md +106 -0
- package/skills/agent-tool-mcp-builder/SKILL.md +88 -0
- package/skills/api-design-review/SKILL.md +70 -0
- package/skills/async-concurrency-correctness/SKILL.md +93 -0
- package/skills/audit-accessibility-wcag/SKILL.md +59 -0
- package/skills/audit-technical-seo/SKILL.md +62 -0
- package/skills/auth-jwt-session/SKILL.md +88 -0
- package/skills/brainstorm-design/SKILL.md +73 -0
- package/skills/build-etl-pipeline/SKILL.md +58 -0
- package/skills/build-form-validation/SKILL.md +103 -0
- package/skills/build-office-docs/SKILL.md +80 -0
- package/skills/build-react-component/SKILL.md +116 -0
- package/skills/build-spreadsheet/SKILL.md +106 -0
- package/skills/caching-strategy/SKILL.md +75 -0
- package/skills/cicd-pipeline-author/SKILL.md +65 -0
- package/skills/cloud-cost-optimize/SKILL.md +91 -0
- package/skills/code-comments/SKILL.md +52 -0
- package/skills/code-review/SKILL.md +61 -0
- package/skills/db-migration-safety/SKILL.md +67 -0
- package/skills/debug-frontend-browser/SKILL.md +58 -0
- package/skills/debug-root-cause/SKILL.md +54 -0
- package/skills/dependency-upgrade/SKILL.md +56 -0
- package/skills/deploy-release/SKILL.md +64 -0
- package/skills/diff-table-parity/SKILL.md +58 -0
- package/skills/dockerfile-optimize/SKILL.md +82 -0
- package/skills/error-message/SKILL.md +58 -0
- package/skills/estimate-work/SKILL.md +54 -0
- package/skills/explore-codebase/SKILL.md +73 -0
- package/skills/git-commit-pr/SKILL.md +65 -0
- package/skills/gitops-deploy-workflow/SKILL.md +97 -0
- package/skills/implement-from-design/SKILL.md +69 -0
- package/skills/incident-response-sre/SKILL.md +78 -0
- package/skills/k8s-debug-workload/SKILL.md +135 -0
- package/skills/k8s-manifest-review/SKILL.md +86 -0
- package/skills/llm-eval-harness/SKILL.md +63 -0
- package/skills/manage-client-server-state/SKILL.md +94 -0
- package/skills/mermaid-diagram/SKILL.md +61 -0
- package/skills/message-queue-jobs/SKILL.md +139 -0
- package/skills/naming-helper/SKILL.md +57 -0
- package/skills/observability-instrument/SKILL.md +113 -0
- package/skills/optimize-core-web-vitals/SKILL.md +75 -0
- package/skills/optimize-sql-query/SKILL.md +67 -0
- package/skills/performance-profiling/SKILL.md +65 -0
- package/skills/process-pdf/SKILL.md +107 -0
- package/skills/profile-dataset/SKILL.md +97 -0
- package/skills/prompt-engineering/SKILL.md +70 -0
- package/skills/rag-pipeline/SKILL.md +53 -0
- package/skills/rate-limiting/SKILL.md +96 -0
- package/skills/refactor-cleanup/SKILL.md +54 -0
- package/skills/regex-build/SKILL.md +72 -0
- package/skills/release-notes/SKILL.md +79 -0
- package/skills/rest-graphql-contract/SKILL.md +71 -0
- package/skills/scrape-structured-web-data/SKILL.md +61 -0
- package/skills/secrets-management/SKILL.md +96 -0
- package/skills/security-review/SKILL.md +62 -0
- package/skills/shell-script-robust/SKILL.md +71 -0
- package/skills/style-responsive-tailwind/SKILL.md +70 -0
- package/skills/terraform-plan-review/SKILL.md +95 -0
- package/skills/type-safety-strict/SKILL.md +82 -0
- package/skills/validate-data-quality/SKILL.md +62 -0
- package/skills/wrangle-tabular-data/SKILL.md +75 -0
- package/skills/write-adr/SKILL.md +75 -0
- package/skills/write-analytical-sql/SKILL.md +71 -0
- package/skills/write-data-viz/SKILL.md +58 -0
- package/skills/write-docs/SKILL.md +54 -0
- package/skills/write-plan/SKILL.md +59 -0
- package/skills/write-playwright-e2e/SKILL.md +86 -0
- package/skills/write-prd/SKILL.md +65 -0
- package/skills/write-rfc/SKILL.md +75 -0
- package/skills/write-tests/SKILL.md +50 -0
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: cicd-pipeline-author
|
|
3
|
+
description: Designs and hardens CI/CD pipelines across GitHub Actions, GitLab CI, Jenkins, and CircleCI — caching, matrix builds, least-privilege tokens, pinned actions, and OIDC instead of long-lived secrets. Triggers when writing or fixing a pipeline/workflow file, speeding up CI, or securing a build.
|
|
4
|
+
when_to_use: เขียน/แก้ .github/workflows, .gitlab-ci.yml, Jenkinsfile; CI ช้า/แพง; ต้อง harden pipeline หรือเปลี่ยนไป OIDC
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
Use this skill when the task touches a pipeline definition file or its performance/security:
|
|
10
|
+
|
|
11
|
+
- Authoring or editing `.github/workflows/*.yml`, `.gitlab-ci.yml`, `Jenkinsfile`, or `.circleci/config.yml`.
|
|
12
|
+
- CI is slow or expensive — reduce wall-clock time and runner minutes.
|
|
13
|
+
- Hardening a build: removing long-lived secrets, pinning third-party actions, scoping token permissions.
|
|
14
|
+
- Migrating cloud auth from static keys to OIDC (federated identity).
|
|
15
|
+
- Splitting a monolithic pipeline into reusable workflows/templates.
|
|
16
|
+
|
|
17
|
+
Skip if the change is a one-line tweak whose diff you can state in a sentence (bump a runner image tag, fix a typo in a step name) — just make it.
|
|
18
|
+
|
|
19
|
+
## Steps
|
|
20
|
+
|
|
21
|
+
1. **Detect platform and read the existing pipeline first.** Match the file: `.github/workflows/*` → Actions, `.gitlab-ci.yml` → GitLab, `Jenkinsfile` → Jenkins, `.circleci/config.yml` → CircleCI. Never rewrite blind — read the current file, note language/build tool, where secrets are consumed, and which steps dominate runtime.
|
|
22
|
+
|
|
23
|
+
2. **Lay out stages as a DAG, not a straight line.** Target order: `lint+typecheck` → `build` → `test` → `security-scan` → `deploy` (gated). Run independent stages in parallel; only serialize on real data dependencies. In Actions use `needs:`; in GitLab `stage:` + `needs:` for out-of-order DAG; in CircleCI `requires:` under workflows.
|
|
24
|
+
|
|
25
|
+
3. **Cache the dependency layer, key it on the lockfile.** Cache key = hash of the lockfile, with a restore-key fallback for partial hits. Cache the package dir, not `node_modules`-style install output that must stay in sync with the runner.
|
|
26
|
+
- Actions: `actions/cache` with `key: deps-${{ runner.os }}-${{ hashFiles('**/lock*') }}` and `restore-keys: deps-${{ runner.os }}-`. For common ecosystems prefer the built-in cache (e.g. `setup-*` with `cache:` input) which keys the lockfile for you.
|
|
27
|
+
- GitLab: `cache: { key: { files: [lockfile] }, paths: [...], policy: pull-push }`; set `policy: pull` on jobs that only read.
|
|
28
|
+
- Never cache build artifacts that change every commit — that just thrashes the cache. Pass those between stages as artifacts instead.
|
|
29
|
+
|
|
30
|
+
4. **Add a matrix only where it earns parallelism.** Fan out across versions/OS that you actually support. Set `fail-fast: false` when you need the full failure map (e.g. compatibility testing); leave it `true` (default) to abort early and save minutes on normal PR runs. Shard a large test suite across N parallel jobs and merge results.
|
|
31
|
+
|
|
32
|
+
5. **Pin third-party actions/images by full commit SHA, not a tag.** Tags (`@v4`) are mutable and a supply-chain hole. Pin `uses: owner/action@<40-char-sha> # v4.1.2` and keep the version in a trailing comment so a bot can bump it. Pin container/runner images by digest where the platform allows. First-party `actions/*` may use a major tag if your policy permits, but SHA is the safe default.
|
|
33
|
+
|
|
34
|
+
6. **Scope token permissions to least privilege.** In Actions, set top-level `permissions: { contents: read }` (default-deny everything else) and widen per-job only for what that job needs (e.g. `id-token: write` for OIDC, `packages: write` to publish). In GitLab, prefer job tokens with limited scope over project access tokens. Never give a test job write access.
|
|
35
|
+
|
|
36
|
+
7. **Replace static cloud secrets with OIDC.** Stop storing long-lived `AWS_*`/`GCP_*`/`AZURE_*` keys. Have the pipeline mint a short-lived token via the provider's federated identity:
|
|
37
|
+
- Job needs `permissions: id-token: write` (Actions) or the platform's OIDC equivalent.
|
|
38
|
+
- Configure a trust policy on the cloud side scoped to **this repo + this branch/environment + this workflow** (don't trust `repo:*`).
|
|
39
|
+
- Use the provider's official auth action to exchange the OIDC token for temporary credentials. Remove the old secrets after the OIDC path is verified — not before.
|
|
40
|
+
|
|
41
|
+
8. **Factor shared logic into reusable workflows/templates.** Actions: reusable workflow called via `uses: ./.github/workflows/x.yml` or a composite action. GitLab: `include:` + `extends:` + `!reference`. CircleCI: orbs/commands. Jenkins: shared library. One source of truth beats copy-pasted YAML across services.
|
|
42
|
+
|
|
43
|
+
9. **Make failures fast and required.** Put cheap checks (lint, typecheck) before expensive ones so a bad PR fails in seconds. Add `timeout-minutes` to every job so a hung step can't burn an hour. Wire the gating jobs into branch protection / required status checks so deploy literally cannot run on a red build. Gate deploy on a protected environment with manual approval for prod.
|
|
44
|
+
|
|
45
|
+
10. **Validate locally and report the diff.** Lint the file, dry-run if possible (see Verify), then report **before/after wall-clock and estimated runner-minutes**, plus the security delta (secrets removed, permissions narrowed, actions pinned). A pipeline change without a measured result is unfinished.
|
|
46
|
+
|
|
47
|
+
## Common Errors
|
|
48
|
+
|
|
49
|
+
- **`permissions` block at top level silently kills the whole-workflow grant.** The moment you add any `permissions:` key, every unlisted scope becomes `none`. A push/comment step that worked before will now 403. Add back exactly what each job needs at job level.
|
|
50
|
+
- **Cache restored but the install step still runs full.** Usually the cache path is wrong, or the key changes every run (you hashed a file that mutates). Verify with the cache-hit output, and cache the package manager's store dir, not the project-local install dir.
|
|
51
|
+
- **OIDC `Error: Not authorized to perform sts:AssumeRoleWithWebIdentity` / audience mismatch.** The cloud-side trust condition doesn't match the token's `sub`/`aud`. Check the exact subject string format (it differs for branch vs tag vs environment vs pull_request) and that the audience matches what the auth step requests.
|
|
52
|
+
- **Pinned SHA points at a tag that was force-moved → action runs different code than reviewed.** Always re-verify the SHA belongs to the version in your comment; don't copy a SHA from an untrusted PR.
|
|
53
|
+
- **`fail-fast` cancels sibling matrix jobs and you lose the failure you needed.** For compatibility/flake debugging set `fail-fast: false`; for normal PRs keep it on to save minutes — they're opposite goals, pick deliberately.
|
|
54
|
+
- **Secrets passed into a reusable workflow vanish.** Reusable workflows don't inherit secrets implicitly — pass them with `secrets: inherit` or map each one explicitly.
|
|
55
|
+
- **`pull_request` from a fork can't read secrets / OIDC.** That's by design (untrusted code). Don't "fix" it by switching to `pull_request_target` with checkout of the PR head — that's a known privilege-escalation footgun. Run untrusted PR builds without secrets.
|
|
56
|
+
- **No `timeout-minutes` → a hung network call burns the full default timeout (often 6h) of paid minutes.** Set an explicit per-job timeout always.
|
|
57
|
+
- **Caching across untrusted PRs can poison the cache.** Scope/segregate caches so a fork PR can't write a cache a trusted branch later reads.
|
|
58
|
+
|
|
59
|
+
## Verify
|
|
60
|
+
|
|
61
|
+
- **Syntax/lint:** Actions → `actionlint` (catches expression and context errors a YAML linter misses). GitLab → CI Lint API/UI. CircleCI → `circleci config validate`. Jenkins → `jenkins-cli declarative-linter`. Any platform → a plain YAML linter as a floor.
|
|
62
|
+
- **Dry run where supported:** run the workflow locally (e.g. `act` for Actions) for fast iterate-on-failure before pushing; if the job has external/cloud deps that can't run locally, push to a throwaway branch and read the live run.
|
|
63
|
+
- **Security checklist passes:** no `secrets.*` for cloud creds remaining (OIDC instead); top-level `permissions` is read-only with per-job widening; every third-party action pinned to a full SHA; every job has `timeout-minutes`; deploy gated by required checks + protected environment.
|
|
64
|
+
- **Performance proven:** capture wall-clock and runner-minutes before vs after on a real run; confirm the cache reports a hit on the second run and parallel jobs actually overlap. Report the numbers — "should be faster" is not verification.
|
|
65
|
+
- **Branch protection enforced:** confirm the gating jobs are listed as required status checks so a red build blocks merge/deploy.
|
|
@@ -0,0 +1,91 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: cloud-cost-optimize
|
|
3
|
+
description: Performs FinOps cost optimization on AWS/GCP/Azure — right-sizing instances, spotting idle/orphaned resources, Savings Plans/Reserved/committed-use analysis, storage tiering, and cost anomaly investigation. Triggers when a cloud bill spikes, doing a cost review, or right-sizing infrastructure.
|
|
4
|
+
when_to_use: bill cloud พุ่ง, รีวิว cost/FinOps, right-size instance, หา resource ที่ idle/orphan, วิเคราะห์ cost anomaly
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
- Cloud bill jumped vs last month/period and you need to find *what* and *why* before paying it.
|
|
10
|
+
- Scheduled FinOps / cost review of an account, project, or subscription.
|
|
11
|
+
- Right-sizing a fleet (EC2 / Compute Engine / VM Scale Sets, RDS/Cloud SQL, EKS/GKE/AKS nodes).
|
|
12
|
+
- Hunting idle or orphaned resources draining money with zero traffic.
|
|
13
|
+
- Deciding commitment strategy: Savings Plans / Reserved Instances / Committed-Use Discounts vs Spot/Preemptible.
|
|
14
|
+
|
|
15
|
+
Skip if: you only need a one-off price quote for a single SKU (use the pricing calculator), or the spike is already explained by a known launch/migration.
|
|
16
|
+
|
|
17
|
+
## Steps
|
|
18
|
+
|
|
19
|
+
Detect the provider first — never assume. Check for `~/.aws/credentials` / `AWS_PROFILE`, `gcloud config list`, or `az account show`. Always pin scope (`--region`, `--project`, subscription) and a date window. Always start **read-only**; never mutate in the analysis phase.
|
|
20
|
+
|
|
21
|
+
### 1. Pull the cost breakdown by service + tag (find the biggest lever)
|
|
22
|
+
Don't eyeball the console — pull structured data grouped by service AND a cost-allocation dimension, so you know where the money actually is.
|
|
23
|
+
- **AWS:** `aws ce get-cost-and-usage --time-period Start=<YYYY-MM-01>,End=<YYYY-MM-01> --granularity MONTHLY --metrics UnblendedCost --group-by Type=DIMENSION,Key=SERVICE` then re-run grouped by `Type=TAG,Key=<env|team|app>` and by `Type=DIMENSION,Key=USAGE_TYPE` for the top service. Compare two consecutive months to isolate the delta.
|
|
24
|
+
- **GCP:** query the **BigQuery billing export** (`SELECT service.description, sku.description, SUM(cost) ... GROUP BY ... ORDER BY 3 DESC`). The console export lags and rounds; BQ is the source of truth. If no export configured, that's finding #1.
|
|
25
|
+
- **Azure:** `az costmanagement query --type ActualCost --timeframe MonthToDate --dataset-grouping name=ServiceName type=Dimension`.
|
|
26
|
+
- Output: top 5 services by spend + their MoM delta. Everything below focuses effort on these, not on the long tail.
|
|
27
|
+
|
|
28
|
+
### 2. Find idle + orphaned resources (fast money, low risk)
|
|
29
|
+
These cost money for nothing. Hit each class explicitly:
|
|
30
|
+
- **Unattached block storage:** AWS `aws ec2 describe-volumes --filters Name=status,Values=available` (status `available` = detached, still billed). GCP unattached PDs (`gcloud compute disks list` + cross-check users field empty). Azure `az disk list --query "[?diskState=='Unattached']"`.
|
|
31
|
+
- **Unassociated static IPs:** AWS EIPs not attached to a running instance are billed hourly — `aws ec2 describe-addresses --query "Addresses[?AssociationId==null]"`. Same for idle GCP static external IPs / Azure unassociated Public IPs.
|
|
32
|
+
- **Idle load balancers:** ELB/ALB/NLB with zero `RequestCount`/`ActiveFlowCount` over 14d in CloudWatch. An ALB with no targets or no traffic is pure waste.
|
|
33
|
+
- **Stale snapshots & old AMIs/images:** snapshots older than your retention policy, AMIs no launch-templated anywhere. These grow silently.
|
|
34
|
+
- **Over-provisioned managed clusters:** EKS/GKE/AKS node groups with low pod density — check `kubectl top nodes` and allocatable vs requests. Empty/cordoned nodes that never scaled down.
|
|
35
|
+
- **Forgotten dev/staging:** non-prod compute running 24/7. Tag-filter `env!=prod` and check for instances with no business reason to run nights/weekends → schedule stop or use auto-shutdown.
|
|
36
|
+
- **NAT Gateway / cross-AZ data:** often a hidden top-3 line item. Flag for step 5.
|
|
37
|
+
|
|
38
|
+
### 3. Right-size compute on REAL utilization (not nameplate)
|
|
39
|
+
Never resize on instance type alone — pull 14–30 days of actual metrics.
|
|
40
|
+
- **AWS:** start with **Compute Optimizer** (`aws compute-optimizer get-ec2-instance-recommendations`) — it already crunched p99 CPU/mem/network and gives a target type + projected savings. Validate against CloudWatch `CPUUtilization` p99 and **memory** (CloudWatch agent required — if mem isn't published, say so; CPU-only sizing is unsafe for memory-bound apps). Rule of thumb: sustained p99 CPU < 40% and p99 mem < 50% → downsize one step.
|
|
41
|
+
- **GCP:** use **Recommender** (`gcloud recommender recommendations list --recommender=google.compute.instance.MachineTypeRecommender`).
|
|
42
|
+
- **Azure:** **Advisor** right-size recommendations (`az advisor recommendation list --category Cost`).
|
|
43
|
+
- **Databases:** RDS/Cloud SQL/Azure SQL are frequently oversized — check CPU, connections, and IOPS p99. Move gp2→gp3 on RDS (cheaper, decoupled IOPS) as a near-zero-risk win.
|
|
44
|
+
- Prefer the same-family smaller size or a newer generation (e.g. m5→m6i/m7g Graviton) before exotic types — Graviton/ARM is often ~20% cheaper at equal perf if the workload is ARM-compatible.
|
|
45
|
+
|
|
46
|
+
### 4. Commitment analysis (Savings Plans / RI / CUD vs Spot)
|
|
47
|
+
Only after right-sizing — committing to oversized capacity locks in waste.
|
|
48
|
+
- Pull current commitment **coverage** and **utilization**: AWS `aws ce get-savings-plans-utilization` + `get-savings-plans-coverage` (and `get-reservation-coverage` for RDS/ElastiCache/Redshift/OpenSearch which use RIs, not SPs). Coverage low + steady baseline = buy more; utilization < 95% = you over-committed.
|
|
49
|
+
- Use the provider's **purchase recommendations** as the starting number: AWS `aws ce get-savings-plans-purchase-recommendation --term-in-years ONE_YEAR --payment-option NO_UPFRONT --lookback-period-in-days THIRTY_DAYS`. Default to **Compute Savings Plans** (flexible across family/region/Fargate/Lambda) over EC2 Instance SPs unless the fleet is truly static. 1yr no-upfront is the safe default; 3yr only for proven-stable baseline.
|
|
50
|
+
- **GCP:** CUDs — prefer **flexible/spend-based** CUDs over resource-based for changing fleets.
|
|
51
|
+
- **Spot/Preemptible:** for stateless, fault-tolerant, batch, or interruptible workloads → 60–90% off. Quantify what fraction of compute is spot-eligible; that's the cheapest capacity tier and should be exhausted before buying commitments for it.
|
|
52
|
+
- Output: recommended commitment $ + projected monthly savings + break-even, and explicitly the baseline (committed) vs burst (on-demand) vs interruptible (spot) split.
|
|
53
|
+
|
|
54
|
+
### 5. Storage tiering + data-transfer / egress
|
|
55
|
+
- **Object storage:** enable lifecycle/Intelligent-Tiering. S3 → Intelligent-Tiering or lifecycle to IA/Glacier by access age; GCS → Autoclass or Nearline/Coldline/Archive; Azure → Blob lifecycle Hot→Cool→Cold→Archive. Check for **no lifecycle policy at all** on large buckets — common silent leak.
|
|
56
|
+
- **Incomplete multipart uploads / old versions:** S3 hides cost in incomplete MPU and noncurrent versions — add a lifecycle rule to abort/expire them.
|
|
57
|
+
- **Egress / data transfer:** usually the most opaque line item. Cross-AZ chatter, NAT Gateway processing, inter-region replication, and internet egress. Mitigations: VPC/Gateway/PrivateLink endpoints to avoid NAT for AWS-service traffic, co-locate chatty services in one AZ, CloudFront/CDN in front of egress-heavy origins, and check that internal traffic isn't routing through public IPs.
|
|
58
|
+
- **Logs/metrics retention:** CloudWatch Logs / Cloud Logging / Log Analytics with infinite retention silently balloon — set retention and route cold logs to object storage.
|
|
59
|
+
|
|
60
|
+
### 6. Tagging, budgets, and anomaly alerts (prevent the next spike)
|
|
61
|
+
- **Tag coverage:** quantify % of spend that is **untagged** (allocation gap). Untagged spend = can't attribute = can't optimize. Recommend enforcing required tags (`env`, `owner`, `cost-center`) via SCP/Org Policy/Azure Policy.
|
|
62
|
+
- **Budgets:** AWS Budgets / GCP Budget alerts / Azure Cost Management budgets with alert thresholds at 50/80/100% to the owning team.
|
|
63
|
+
- **Anomaly detection:** AWS **Cost Anomaly Detection** (`aws ce get-anomalies` to investigate the current spike; create a monitor if none exists). GCP/Azure have equivalent anomaly alerts. For a live spike: filter anomalies to the window, get the root-cause dimension (service + usage type + linked account), and tie it back to step 1's delta.
|
|
64
|
+
|
|
65
|
+
### 7. Rank recommendations by effort vs impact
|
|
66
|
+
Produce a single prioritized table — do NOT dump a flat list. Columns: **Recommendation | Monthly $ saved (est.) | Effort (S/M/L) | Risk | Reversible? | Action/command**.
|
|
67
|
+
- Sort by impact-per-effort. Surface **quick wins** (delete orphans, gp2→gp3, add lifecycle, release EIPs) at the top — high $/low effort/reversible.
|
|
68
|
+
- Right-sizing and Graviton migration = medium effort, needs a perf validation step.
|
|
69
|
+
- Commitments = high-confidence savings but **financial lock-in** — flag explicitly and require human sign-off; never purchase autonomously.
|
|
70
|
+
- Give a total: "estimated $X/mo (Y%) recoverable, $Z of it as zero-risk quick wins."
|
|
71
|
+
|
|
72
|
+
## Common Errors
|
|
73
|
+
|
|
74
|
+
- **Right-sizing on CPU only.** If the CloudWatch agent / Ops Agent isn't installed, memory metrics don't exist — the provider tool fills gaps with assumptions. Downsizing a memory-bound app on CPU data alone causes OOM. State when mem data is missing.
|
|
75
|
+
- **`aws ce` returns near-zero / empty.** Cost Explorer must be **enabled** in the account and has ~24h data latency; first call after enabling shows nothing. Also CE is global (`us-east-1` endpoint) — region flags don't filter it. Tag-based grouping only works for **activated** cost-allocation tags (activation isn't retroactive).
|
|
76
|
+
- **Confusing detached vs in-use.** A volume `in-use`/disk attached to a **stopped** instance is still billed — "stopped" ≠ free for storage/EIP. Don't only filter `available`.
|
|
77
|
+
- **Blended vs unblended cost.** Use **UnblendedCost** for actuals; blended hides RI/SP allocation across an org and misleads per-account analysis.
|
|
78
|
+
- **GCP console export ≠ truth.** The CSV/console rounds and lags up to a day. Always query the BigQuery billing export for cents-accurate, SKU-level data.
|
|
79
|
+
- **Buying commitments before right-sizing.** Locks in your current (oversized) footprint for 1–3 years. Order matters: orphans → right-size → spot → *then* commit on the proven baseline.
|
|
80
|
+
- **Deleting "orphaned" snapshots that back an AMI / are a DR copy.** A snapshot can look unattached but be the backing store of a registered AMI or a cross-region DR copy. Check AMI block-device mappings and DR runbooks before deleting. Never bulk-delete snapshots in the analysis phase.
|
|
81
|
+
- **Reading egress as one number.** "Data transfer" bundles internet egress, cross-AZ, NAT processing, and inter-region — each has a different fix. Break it down by `USAGE_TYPE` before recommending.
|
|
82
|
+
- **Mutating during analysis.** This skill's job is to *recommend*. Don't terminate, resize, or purchase. All write actions are human-gated.
|
|
83
|
+
|
|
84
|
+
## Verify
|
|
85
|
+
|
|
86
|
+
- Every dollar figure traces to a real query/command output, not an estimate from memory — cite the command per recommendation.
|
|
87
|
+
- Savings estimates are **monthly** and net (post-discount), with the assumed commitment term/payment option stated.
|
|
88
|
+
- Quick wins are confirmed reversible (release EIP, delete unattached vol with a final snapshot, gp2→gp3) and separated from lock-in actions.
|
|
89
|
+
- The top spend services in the final report reconcile to the step-1 breakdown total (sums match the bill, ± rounding).
|
|
90
|
+
- No write/mutating command was executed; all destructive or financial actions are presented as proposed commands for human approval.
|
|
91
|
+
- Re-run the cost breakdown ~one billing cycle after changes land to confirm realized savings vs projected (close the loop).
|
|
@@ -0,0 +1,52 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: code-comments
|
|
3
|
+
description: Adds high-signal code comments and docstrings that explain WHY (intent, invariants, gotchas) rather than restating WHAT, in the language's idiomatic doc format (JSDoc/TSDoc, Google/NumPy Python docstrings, rustdoc, godoc) — without over-commenting.
|
|
4
|
+
when_to_use: User asks to comment/document a function, add docstrings, explain a tricky block, or prep code for handoff/review; legacy code with no comments. Not for prose docs (that's write-docs).
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
- User says "comment this", "add docstrings", "document this function/module/class", or "explain this tricky block".
|
|
10
|
+
- Prepping code for handoff or review where intent isn't obvious.
|
|
11
|
+
- Legacy code with zero comments where you must infer and record WHY.
|
|
12
|
+
- NOT for prose docs (README/guides/API pages) — that's `write-docs`.
|
|
13
|
+
- NOT a refactor — if logic looks wrong, flag it separately; this skill changes comments only.
|
|
14
|
+
|
|
15
|
+
## Steps
|
|
16
|
+
|
|
17
|
+
1. **Read the full unit first.** Read the whole function/module before writing a single comment. Trace data flow, error paths, and callers (grep for usages) so you comment real intent, not a guess. If intent is genuinely unrecoverable, write `// FIXME: intent unclear — <specific question>` instead of inventing a rationale.
|
|
18
|
+
|
|
19
|
+
2. **Mark only the non-obvious.** Tag lines/blocks where WHY ≠ WHAT: magic numbers, off-by-one guards, ret/ timeout / backoff constants, workarounds for upstream bugs, ordering that looks swappable but isn't, perf hacks, locking/concurrency assumptions, security checks. Skip anything a competent reader infers from the code itself.
|
|
20
|
+
|
|
21
|
+
3. **Write the docstring in the language's idiomatic format.** Use the existing convention in the file/repo if one is present (match it). Otherwise default to:
|
|
22
|
+
- **TS/JS** → TSDoc/JSDoc: one-line summary, then `@param`, `@returns`, `@throws`. Omit `@param` types in TS (the type is in the signature) — describe meaning/constraints/units instead.
|
|
23
|
+
- **Python** → match repo (Google or NumPy). Sections: summary, `Args:`/`Parameters`, `Returns:`, `Raises:`. Add `Examples:` (doctest-style) only for non-trivial contracts.
|
|
24
|
+
- **Rust** → rustdoc `///`: summary, `# Arguments` (when non-obvious), `# Errors`, `# Panics`, `# Safety` for `unsafe`. `# Examples` should be compilable.
|
|
25
|
+
- **Go** → godoc: comment starts with the identifier name (`// Foo does ...`), full sentences, period-terminated.
|
|
26
|
+
|
|
27
|
+
4. **Put the signal in the docstring, not the body.** Document: units (ms vs s, bytes vs KB), valid ranges, nullability, ownership/lifetime, side effects, thread-safety, idempotency, and what each error/exception actually means for the caller. These are the things callers can't see from the signature.
|
|
28
|
+
|
|
29
|
+
5. **Inline comments only for surprising WHY.** One line, above the code (not trailing, unless very short). Good: `// Retry 3x — upstream returns 503 on cold start`. Bad: `// loop over items`. Prefer `# TODO(context)` / `# HACK:` / `# SAFETY:` tags so they're greppable.
|
|
30
|
+
|
|
31
|
+
6. **Delete redundant and stale comments you encounter.** While editing, remove comments that restate code, are commented-out dead code, or contradict current behavior (stale comments are worse than none). Don't expand scope hunting the whole repo — just clean what's in your editing path.
|
|
32
|
+
|
|
33
|
+
7. **Keep the diff comment-only.** Zero logic, formatting, or whitespace changes beyond the comments themselves. The diff must be reviewable as "comments added/removed" and nothing else.
|
|
34
|
+
|
|
35
|
+
## Common Errors
|
|
36
|
+
|
|
37
|
+
- **Restating WHAT.** `i++ // increment i`, `return user // return the user`. If the comment is the code in English, delete it.
|
|
38
|
+
- **Inventing rationale.** Writing a confident "why" you didn't verify. If you can't confirm intent from code + callers, say so (`FIXME: intent unclear`) — a wrong comment is a future bug.
|
|
39
|
+
- **Over-commenting.** Docstring on every trivial getter/setter, comment on every line. Noise drowns the 3 comments that matter. Comment the surprising, skip the obvious.
|
|
40
|
+
- **Wrong format / mixing conventions.** JSDoc in a Python file, NumPy sections in a Google-style repo, `/** */` in Go. Match the file's existing style; if none, use the language default above.
|
|
41
|
+
- **Duplicating types in TS JSDoc.** `@param {string} name` when the signature already says `name: string` — drift risk when the type changes. Describe meaning, not type.
|
|
42
|
+
- **Touching logic.** "Improving" a line while commenting it. Any behavior change voids the comment-only contract — stop and split it out.
|
|
43
|
+
- **Stale-comment blindness.** Adding new docstrings while leaving an old comment that now lies. Reconcile or delete the old one.
|
|
44
|
+
- **Commenting the wrong layer.** Explaining a workaround in the body when the caller needs it — put caller-facing contracts (raises, side effects, units) in the docstring where callers actually read them.
|
|
45
|
+
|
|
46
|
+
## Verify
|
|
47
|
+
|
|
48
|
+
1. **Diff is comment-only** — `git diff` shows only added/removed comment lines and docstrings; no logic, no reordering, no reformatting. If logic changed, revert it.
|
|
49
|
+
2. **Still compiles / lints** — run the build or doc-comment linter (e.g. `tsc --noEmit`, `ruff`/`pydocstyle`, `cargo doc`, `go vet`). Malformed docstrings (bad `@param`, broken rustdoc `# Examples`) must pass.
|
|
50
|
+
3. **Doctests pass** if you added any runnable examples (`pytest --doctest-modules`, `cargo test --doc`).
|
|
51
|
+
4. **WHY-not-WHAT scan** — reread each added comment: does it tell the reader something the code doesn't already say? Delete any that fail.
|
|
52
|
+
5. **No invented facts** — every "because/to avoid/upstream bug" is traceable to the code or an explicit `FIXME`, not a guess.
|
|
@@ -0,0 +1,61 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: code-review
|
|
3
|
+
description: Reviews the current git diff (or a target PR/branch) for correctness bugs, logic errors, edge cases, and missing error handling, grouping findings by severity (Critical/Warning/Suggestion). Use after implementing a non-trivial change and before declaring it done, or when asked to review a PR.
|
|
4
|
+
when_to_use: After implementing a non-trivial change and before declaring it done; when asked to review a PR, branch, or diff; when a change touches correctness across multiple files. Skip for one-line edits whose effect is obvious (typo, rename, log string).
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
- Right after finishing a non-trivial implementation, **before** reporting it as done.
|
|
10
|
+
- When explicitly asked to review a PR, branch, or working-tree diff.
|
|
11
|
+
- When a change touches logic, control flow, data handling, or error paths across one or more files.
|
|
12
|
+
|
|
13
|
+
Skip it for changes whose full effect fits in one sentence (typo fix, rename, comment, log-string tweak) — reviewing those wastes a pass.
|
|
14
|
+
|
|
15
|
+
This skill hunts for **correctness defects only**. Style, naming, and refactor opportunities are out of scope; leave those to a cleanup/refactor pass. Never block a change on style.
|
|
16
|
+
|
|
17
|
+
## Steps
|
|
18
|
+
|
|
19
|
+
1. **Get the exact diff — never review the whole repo.** Pick the source:
|
|
20
|
+
- Working tree (uncommitted): `git diff` and `git diff --staged`
|
|
21
|
+
- Last commit: `git diff HEAD~1`
|
|
22
|
+
- Branch vs base: `git diff $(git merge-base HEAD <base>)...HEAD` (base is usually the default branch)
|
|
23
|
+
- A PR: `gh pr diff <number>` (or `gh pr diff <number> --patch`)
|
|
24
|
+
If unsure what to review, default to `git diff HEAD` plus staged changes.
|
|
25
|
+
|
|
26
|
+
2. **Lock scope to changed lines + their blast radius.** Read only the changed hunks and the immediate surrounding function/caller needed to judge correctness. If a changed function is called elsewhere, open just those call sites — do not audit unrelated files.
|
|
27
|
+
|
|
28
|
+
3. **Run the correctness checklist over each hunk.** For every changed function/branch, ask:
|
|
29
|
+
- **Null / undefined / None:** new value dereferenced without a guard? optional/map-lookup assumed present? empty string/array/`0`/`false` mishandled by a truthiness check?
|
|
30
|
+
- **Off-by-one / bounds:** loop `<` vs `<=`, slice/index ranges, first/last element, empty-collection case.
|
|
31
|
+
- **Async / concurrency:** missing `await`, unhandled rejected promise, shared state mutated without ordering guarantee, race between read and write, fire-and-forget that should block.
|
|
32
|
+
- **Error paths:** failures swallowed or only logged; thrown error left uncaught; partial failure leaves state half-updated; non-happy-path return value ignored.
|
|
33
|
+
- **Boundaries / inputs:** untrusted or empty input, very large input, negative/zero, unicode, timezone/locale, numeric overflow or float precision.
|
|
34
|
+
- **Resource leaks:** file/socket/db handle, lock, timer, subscription, or listener opened but not closed on all exit paths (including the error path).
|
|
35
|
+
- **Logic regressions:** inverted condition, wrong variable, changed default that alters existing behavior, broken invariant a caller depends on.
|
|
36
|
+
|
|
37
|
+
4. **Triage each finding by severity** and report with `path:line`, a one-line explanation of the failure mode, and a concrete suggested fix:
|
|
38
|
+
- **Critical** — will produce wrong results, crash, data loss, or a security hole on a realistic input. Must be fixed before done.
|
|
39
|
+
- **Warning** — plausible bug in an edge case, or correctness depends on an assumption that isn't enforced. Should be fixed or explicitly justified.
|
|
40
|
+
- **Suggestion** — minor robustness/clarity improvement that affects correctness only weakly. Optional.
|
|
41
|
+
Order findings most-severe-first. If a section is empty, say so (e.g. "Critical: none") rather than omitting it.
|
|
42
|
+
|
|
43
|
+
5. **Apply fixes (or hand them to the implementer), then re-review the new diff** — focus on whether each fix is correct and whether it introduced a new issue. **Do not declare done while any Critical remains unresolved.**
|
|
44
|
+
|
|
45
|
+
## Common Errors
|
|
46
|
+
|
|
47
|
+
- **Reviewing the whole repo instead of the diff.** Floods the report with pre-existing issues you didn't change and weren't asked about. Always diff first; comment only on changed lines and their direct callers.
|
|
48
|
+
- **Drowning in false positives.** Reporting "could be null" on a value the surrounding code already guarantees non-null. Read enough context to confirm the path is actually reachable before flagging; if you can't tell, mark it Warning, not Critical, and say what you assumed.
|
|
49
|
+
- **Nitpicking style as if it were a bug.** Renames, formatting, "prefer const", import order — out of scope here. Flagging them buries the real findings and trains the reader to ignore the report.
|
|
50
|
+
- **Severity inflation.** Marking everything Critical destroys the signal of the Critical tier. Reserve Critical for "wrong on a realistic input," not "theoretically possible if three unlikely things happen."
|
|
51
|
+
- **Reviewing intent instead of code.** Judge what the diff *does*, not what the description says it should do; the two diverging is itself a finding.
|
|
52
|
+
- **Stopping after the first pass.** A fix can introduce a fresh bug. The change isn't reviewed until the post-fix diff is clean.
|
|
53
|
+
|
|
54
|
+
## Verify
|
|
55
|
+
|
|
56
|
+
The review is complete when:
|
|
57
|
+
- The diff source is explicit and covers exactly the intended change set (and nothing unrelated).
|
|
58
|
+
- Every checklist category in step 3 was considered for each changed hunk (not silently skipped).
|
|
59
|
+
- Findings are grouped Critical / Warning / Suggestion, each with `path:line` + failure mode + suggested fix; empty tiers stated explicitly.
|
|
60
|
+
- Zero Critical findings remain open — every one is either fixed (and the fix re-reviewed) or has a written justification for why it's not a defect.
|
|
61
|
+
- The final reported diff differs from the initial one only by the applied fixes, and re-review of those fixes surfaced no new issues.
|
|
@@ -0,0 +1,67 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: db-migration-safety
|
|
3
|
+
description: Reviews and writes database migrations for safety — lock contention, blocking DDL on large tables, data-loss/destructive operations, missing indexes, and rollback plans. Use before running any schema change against a real or production database.
|
|
4
|
+
when_to_use: เขียน/รัน migration; เปลี่ยน schema; เพิ่ม/ลบ column/index/constraint บนตารางจริง
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
Trigger this skill before writing or running ANY of these against a real/production DB:
|
|
10
|
+
|
|
11
|
+
- New migration file, or editing an existing one
|
|
12
|
+
- `ALTER TABLE`, `CREATE INDEX`, `DROP`, `ADD CONSTRAINT`, type changes, `SET NOT NULL`
|
|
13
|
+
- Backfilling / updating data across a large table (≥ ~100k rows, or unknown size)
|
|
14
|
+
- Renaming a column/table, or changing a column's type
|
|
15
|
+
- Any change reviewed by another engineer touching schema
|
|
16
|
+
|
|
17
|
+
Skip only for: brand-new empty tables, dev-only throwaway DBs, or pure read/query work with no DDL.
|
|
18
|
+
|
|
19
|
+
## Steps
|
|
20
|
+
|
|
21
|
+
1. **Classify every statement** as `additive` (safe), `lock-risky`, or `destructive`:
|
|
22
|
+
- additive: `ADD COLUMN` (nullable, no default OR constant default on PG11+), `CREATE INDEX CONCURRENTLY`, new table, add nullable FK without validation
|
|
23
|
+
- lock-risky: `CREATE INDEX` (non-concurrent), `ADD COLUMN ... NOT NULL` with volatile default, `ALTER COLUMN TYPE`, `ADD CONSTRAINT` (validating), `SET NOT NULL` on existing column
|
|
24
|
+
- destructive: `DROP TABLE/COLUMN`, `TRUNCATE`, type narrowing (e.g. `bigint`→`int`, `text`→`varchar(n)`), `DROP CONSTRAINT` relied on by app
|
|
25
|
+
Print this classification before doing anything else.
|
|
26
|
+
|
|
27
|
+
2. **Get the row count** before judging lock risk: `SELECT reltuples::bigint FROM pg_class WHERE relname = '<table>';` (fast estimate). Treat "unknown" as large. Lock duration scales with rows.
|
|
28
|
+
|
|
29
|
+
3. **Kill blocking index builds** — replace `CREATE INDEX` with `CREATE INDEX CONCURRENTLY` (Postgres) / `ALGORITHM=INPLACE, LOCK=NONE` (MySQL 8). CONCURRENTLY cannot run inside a transaction block, so it must be its own migration file with transaction wrapping disabled (e.g. `disable_ddl_transaction!` / no `BEGIN`).
|
|
30
|
+
|
|
31
|
+
4. **Batch every backfill** — never one `UPDATE` over the whole table (holds row locks + bloats WAL). Loop in chunks (1k–10k rows) keyed on PK, commit per batch, optional sleep between. Run the backfill as a SEPARATE step from the DDL that depends on it.
|
|
32
|
+
|
|
33
|
+
5. **Apply expand–contract for rename / type change** (3 deploys, never rename in place):
|
|
34
|
+
- Expand: add the new column/type. Deploy app that writes BOTH old + new.
|
|
35
|
+
- Migrate: backfill new column in batches; switch reads to new column.
|
|
36
|
+
- Contract: drop the old column in a LATER migration, after the new code is fully deployed and stable.
|
|
37
|
+
|
|
38
|
+
6. **Make NOT NULL safe** — don't `SET NOT NULL` directly on a big table (full scan under lock). Instead: `ADD CONSTRAINT ... CHECK (col IS NOT NULL) NOT VALID`, then `VALIDATE CONSTRAINT` (lighter lock), then optionally promote.
|
|
39
|
+
|
|
40
|
+
7. **Write the rollback** — every migration needs a working `down` / inverse. If the op is irreversible (dropped column = lost data), say so explicitly and require an explicit confirmation + backup before running.
|
|
41
|
+
|
|
42
|
+
8. **Dry-run on a copy** — run the migration against a clone/staging snapshot of prod data (real volume), time it, and check it completes without long locks. Report measured duration.
|
|
43
|
+
|
|
44
|
+
## Common Errors / Gotchas
|
|
45
|
+
|
|
46
|
+
- **`CREATE INDEX CONCURRENTLY` inside a transaction** → fails or silently degrades to a normal locking build. Each CONCURRENTLY index needs its own non-transactional migration. A failed CONCURRENTLY build leaves an `INVALID` index — drop it before retrying.
|
|
47
|
+
- **`ADD COLUMN ... DEFAULT <volatile>` rewrites the whole table** under an `ACCESS EXCLUSIVE` lock. Constant defaults are metadata-only on PG11+/MySQL8 and are safe; `now()`, `gen_random_uuid()`, sequences are NOT.
|
|
48
|
+
- **`ALTER TABLE` takes `ACCESS EXCLUSIVE`** which queues behind AND blocks all reads/writes — even a fast `ALTER` blocks the table while waiting for a slow query to finish (lock queue pile-up). Set a short `lock_timeout` (e.g. `SET lock_timeout = '3s'`) and retry, so the migration fails fast instead of freezing the table.
|
|
49
|
+
- **Adding a foreign key validates the whole referencing table** under lock. Use `ADD CONSTRAINT ... NOT VALID` then `VALIDATE CONSTRAINT` in a second step.
|
|
50
|
+
- **Type narrowing = silent data loss / errors** (`bigint`→`int` overflows, `varchar(n)` truncates, `timestamp`→`date` drops time). Block these unless data is proven to fit and a backup exists.
|
|
51
|
+
- **Dropping a column the running app still references** → errors until the new deploy lands. Drop only AFTER the code that stopped using it is fully deployed (contract phase).
|
|
52
|
+
- **Renaming a column in one shot** breaks the currently-running app version during deploy. Always expand–contract.
|
|
53
|
+
- **Long single-statement backfill** holds locks + inflates replication lag; replicas fall behind and reads time out.
|
|
54
|
+
|
|
55
|
+
## Verify
|
|
56
|
+
|
|
57
|
+
A migration is safe to run only when ALL hold:
|
|
58
|
+
|
|
59
|
+
- [ ] Every statement classified; no `destructive` op runs without explicit confirmation + a backup taken
|
|
60
|
+
- [ ] No non-concurrent index build on a non-trivial table; CONCURRENTLY indexes are in their own non-transactional file
|
|
61
|
+
- [ ] All backfills are batched with per-batch commits, separate from dependent DDL
|
|
62
|
+
- [ ] Rename/type-change uses expand–contract across separate deploys
|
|
63
|
+
- [ ] A working `down`/rollback exists, or irreversibility is flagged and confirmed
|
|
64
|
+
- [ ] `lock_timeout` set so the migration fails fast instead of hanging the table
|
|
65
|
+
- [ ] Dry-run on prod-sized copy completed; measured duration and lock impact reported
|
|
66
|
+
|
|
67
|
+
If you cannot verify lock duration on real data volume, do NOT run against production — escalate for a maintenance window or a tested staging run first.
|
|
@@ -0,0 +1,58 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: debug-frontend-browser
|
|
3
|
+
description: Diagnoses runtime UI bugs live in the browser — console/network errors, hydration mismatches, failed renders, CORS, broken interactions; used when a page misbehaves at runtime.
|
|
4
|
+
when_to_use: When the UI misbehaves at runtime — blank screen, hydration mismatch error, console/network errors, CORS failures, broken click/interaction, or 'works in dev not prod' — and needs live browser inspection.
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
The page is **already rendering (or failing to) in a real browser** and behaves wrong. Reach here when:
|
|
10
|
+
|
|
11
|
+
- Blank/white screen, partial render, or content flashes then disappears.
|
|
12
|
+
- Console error: `Hydration failed`, `Text content does not match server-rendered HTML`, `Cannot read properties of undefined`, etc.
|
|
13
|
+
- Network tab shows 4xx/5xx, a request stuck pending, or a CORS rejection.
|
|
14
|
+
- A click/submit/keypress does nothing or fires the wrong handler.
|
|
15
|
+
- "Works in dev, breaks in prod" (minified build, different env, SSR vs CSR mismatch).
|
|
16
|
+
|
|
17
|
+
Not for: build/compile failures (no browser yet), pure logic bugs reproducible in a unit test, or backend-only errors. Route those to general root-cause debugging. This skill is **browser-runtime UI** only and uses the `chrome-devtools` MCP to inspect a live page.
|
|
18
|
+
|
|
19
|
+
## Steps
|
|
20
|
+
|
|
21
|
+
1. **Get a live page.** `list_pages` to see open tabs; `select_page` the offending one, or `new_page` + `navigate_page` to the failing URL. Reproduce the exact state the bug needs (route, query params, logged-in session). If repro needs interaction, drive it: `click` / `fill` / `press_key`. Note: a fresh `new_page` has no auth cookies — reuse the existing tab when the bug is session-dependent.
|
|
22
|
+
|
|
23
|
+
2. **Drain the console first.** `list_console_messages` — read every `error` and `warning`, not just the top one. The *first* error is usually the root; later ones are fallout. Capture the full stack trace via `get_console_message` for the key error. Map minified frames back to source using the file:line in the trace (sourcemaps); if prod sourcemaps are absent, re-run the same flow against the dev/staging build to get readable frames.
|
|
24
|
+
|
|
25
|
+
3. **Check failed network requests.** `list_network_requests`, then `get_network_request` on anything non-2xx, pending, or that the failing component depends on. Inspect: status code, response body (real error message often lives here, not the console), request/response headers. For **CORS**: confirm the server returned `Access-Control-Allow-Origin` matching the page origin and, for preflight, that the `OPTIONS` request succeeded with the right `Access-Control-Allow-Methods/Headers`. CORS is a *server-config* fix — never "fix" it by disabling browser security or proxying around it silently.
|
|
26
|
+
|
|
27
|
+
4. **Hydration mismatch (Next.js / React).** The cause is server HTML ≠ first client render. Hunt for non-deterministic-in-render values: `Date.now()` / `new Date()` / `Math.random()`, `window`/`localStorage`/`navigator` read during render, `typeof window !== 'undefined'` branches, locale/timezone-dependent formatting, and invalid DOM nesting (`<div>` inside `<p>`, `<p>` inside `<p>`). Take `take_snapshot` to see the actual client DOM and compare against the SSR HTML (view the document response in `get_network_request`). Fix at source: gate client-only values behind `useEffect`/mounted-flag or `next/dynamic({ ssr: false })`. Use `suppressHydrationWarning` **only** for genuinely unavoidable per-render values (e.g. a timestamp) — it silences the warning, it does not fix a real divergence.
|
|
28
|
+
|
|
29
|
+
5. **Layout / styling bug.** `take_screenshot` for the visual, `take_snapshot` for the structured DOM + roles. `evaluate_script` to read computed styles on the culprit node (`getComputedStyle(el)`), bounding box, overflow, z-index, and whether the element is actually in the DOM vs `display:none`/zero-size. Distinguish "not rendered" (missing from snapshot) from "rendered but invisible" (present, hidden by CSS).
|
|
30
|
+
|
|
31
|
+
6. **Broken interaction / event wiring.** Confirm the handler is attached and the right element receives the event. `evaluate_script` to inspect listeners, check for an overlay intercepting clicks (`document.elementFromPoint(x,y)`), `pointer-events:none`, disabled state, or a stale closure capturing old state. Reproduce the click with `click` and re-read the console/network to see what (if anything) fired. For React state bugs, log or read the value at the moment of interaction rather than assuming.
|
|
32
|
+
|
|
33
|
+
7. **Confirm root cause, fix at source.** Tie the symptom to one concrete cause (a specific request, a specific render expression, a specific listener). Fix the source code — do **not** swallow the error with an empty `catch`, a blanket `try/catch`, or by hiding the failing UI.
|
|
34
|
+
|
|
35
|
+
8. **Verify live.** Reload via `navigate_page` (or `reload`), re-run the repro interaction, then re-check: `list_console_messages` clean of the original error, `list_network_requests` shows the request now 2xx, `take_screenshot` shows correct render. Show the before/after as evidence — do not declare fixed on code-read alone.
|
|
36
|
+
|
|
37
|
+
## Common Errors
|
|
38
|
+
|
|
39
|
+
- **Reading only the first console line.** The headline error is often downstream; scroll the full list and trace the earliest one.
|
|
40
|
+
- **Trusting the console over the network body.** A `500` shows a generic message in console but the real stack/error is in the response body — always `get_network_request` it.
|
|
41
|
+
- **"Fixing" CORS client-side.** Adding `mode: 'no-cors'`, disabling web security, or routing through a hack hides it. CORS is fixed on the server's response headers (or preflight) only.
|
|
42
|
+
- **Blaming hydration on the wrong line.** React points at the mismatched node, but the cause is *why* server and client disagree there — usually a non-deterministic value or a `window` read, not the JSX at that line.
|
|
43
|
+
- **`suppressHydrationWarning` as a cure.** It mutes the symptom. If the underlying data genuinely differs, you've shipped a silent inconsistency. Use it only for inherently dynamic content, never to paper over a real bug.
|
|
44
|
+
- **Fresh tab missing auth.** `new_page` starts with no session — a bug that needs login won't reproduce. Reuse the existing authenticated tab.
|
|
45
|
+
- **Minified prod stack with no sourcemap.** Frames like `a.b is not a function` are useless; reproduce against dev/staging to get real file:line.
|
|
46
|
+
- **Stale page state.** Inspecting after the error already cleared (e.g. an error boundary swapped the UI) shows the recovered state. Re-trigger the failure, then inspect immediately.
|
|
47
|
+
- **Race / timing.** Element or data not there yet — use `wait_for` on the target before asserting it's missing, instead of concluding it never renders.
|
|
48
|
+
|
|
49
|
+
## Verify
|
|
50
|
+
|
|
51
|
+
Fix is done only when all hold against the live page:
|
|
52
|
+
|
|
53
|
+
- [ ] Original console error/warning gone from `list_console_messages` after reload + repro.
|
|
54
|
+
- [ ] Previously-failed request now returns 2xx with a valid body (check `get_network_request`).
|
|
55
|
+
- [ ] `take_screenshot` shows the page rendering correctly through the exact repro steps.
|
|
56
|
+
- [ ] The broken interaction now fires the correct handler and produces the expected result.
|
|
57
|
+
- [ ] No new errors introduced, and the error is fixed at source — not suppressed, caught-and-ignored, or hidden.
|
|
58
|
+
- [ ] If it was a "dev-only / prod-only" bug, verified in the build that originally failed.
|
|
@@ -0,0 +1,54 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: debug-root-cause
|
|
3
|
+
description: Diagnoses a failing test, crash, exception, or wrong output by reproducing the failure, isolating the cause, and fixing at the root — never by suppressing the error or weakening assertions.
|
|
4
|
+
when_to_use: test/build fail, exception/crash, output ผิด, behavior ไม่ตรงคาด — มี failure ที่เกิดขึ้นแล้วและต้องหาว่าทำไม ก่อนจะแก้
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
Invoke when a failure **already exists** and you need to find *why*:
|
|
10
|
+
|
|
11
|
+
- A test or build fails (red CI, failing assertion, compile error).
|
|
12
|
+
- An exception / crash / stack trace appears at runtime.
|
|
13
|
+
- Output is wrong, behavior doesn't match the spec, or a regression appeared.
|
|
14
|
+
|
|
15
|
+
Do NOT use for: greenfield feature work, "make it faster", or vague "improve this" — those aren't failures with a reproducible signal.
|
|
16
|
+
|
|
17
|
+
## Steps
|
|
18
|
+
|
|
19
|
+
1. **Capture the exact signal first.** Run the failing command yourself and copy the *verbatim* error: full message, stack trace, exit code, failing assertion, expected-vs-actual. Do not paraphrase or work from the user's summary — the real text usually names the file/line/value. If you can't reproduce it, you cannot fix it: stop and gather repro steps.
|
|
20
|
+
|
|
21
|
+
2. **Write a minimal failing test that reproduces it.** Before touching any production code, encode the bug as a test (or a tiny script) that FAILS for the right reason. Run it, confirm it goes red, confirm the failure message matches step 1. This is your oracle — without it, "fixed" is a guess. For a crash with no test harness, write the smallest standalone repro that triggers it.
|
|
22
|
+
|
|
23
|
+
3. **Isolate by narrowing, not guessing.**
|
|
24
|
+
- Shrink the input: delete half the data/config/steps; if it still fails, the removed half was irrelevant. Binary-search down to the smallest trigger.
|
|
25
|
+
- Bisect history: if it worked before, `git bisect` (or diff against last-good commit) to find the introducing change.
|
|
26
|
+
- Narrow the code path: comment out / short-circuit branches until the failure flips, then you've bracketed the cause.
|
|
27
|
+
|
|
28
|
+
4. **Trace the actual values.** At the suspected boundary, log or breakpoint the real runtime values (types, nulls, lengths, timestamps, the thing right before it breaks). Compare what you *assumed* vs what's *actually there*. The gap is almost always the bug.
|
|
29
|
+
|
|
30
|
+
5. **Form one hypothesis and state it explicitly.** "X is null because Y returns early when Z" — a single, falsifiable claim that explains the *entire* observed symptom (not just part of it). If your hypothesis only explains some of the evidence, it's wrong; keep tracing.
|
|
31
|
+
|
|
32
|
+
6. **Fix at the root and verify.** Change the cause, not the symptom. Re-run the step-2 test → it must now PASS. Re-run the full suite → no new failures. Keep the reproduction test in the codebase as a regression guard.
|
|
33
|
+
|
|
34
|
+
7. **Show evidence.** Paste the command(s) you ran and their output: red before, green after. "Fixed it" without the before/after output is not a finished fix.
|
|
35
|
+
|
|
36
|
+
## Common Errors / Gotchas
|
|
37
|
+
|
|
38
|
+
- **Patching the symptom.** Wrapping the call in `try/catch` that swallows the error, adding `?.`/`if (x) return` to dodge a null, or `// eslint-disable` does NOT fix anything — it hides the next failure and creates a silent prod bug. Find why `x` was null in the first place.
|
|
39
|
+
- **Weakening the test to go green.** Loosening an assertion, deleting a case, adding `.skip`, bumping a tolerance, or asserting the *buggy* output as "expected" defeats the entire point. Never edit the test to make a broken fix pass.
|
|
40
|
+
- **Fixing without reproducing.** "It's probably the cache" → blind edit → suite still red, now with extra noise. No repro = no fix; you can't confirm a cause you never isolated.
|
|
41
|
+
- **Stopping at the first plausible cause.** A hypothesis that explains *part* of the symptom (e.g. one of three failing cases) is incomplete. Account for all the evidence.
|
|
42
|
+
- **Usual root causes to check fast:** null/undefined, off-by-one / boundary, async timing & race (await missed, unresolved promise, ordering), shared/mutated state, type coercion (`==`, string vs number, truthy `0`/`""`), stale cache or memoization, env mismatch (versions, locale, timezone, missing env var), encoding, float precision.
|
|
43
|
+
- **Heisenbug warning:** if adding a log makes it pass, suspect timing/ordering/concurrency — the log changed scheduling. Don't conclude "can't reproduce", change tactic to deterministic timing.
|
|
44
|
+
|
|
45
|
+
## Verify
|
|
46
|
+
|
|
47
|
+
You are done only when ALL hold:
|
|
48
|
+
|
|
49
|
+
- The minimal repro test exists in the codebase and **fails on the old code, passes on the fixed code** (you ran both).
|
|
50
|
+
- The fix targets the cause named in your hypothesis — not a guard/catch/skip around the symptom.
|
|
51
|
+
- The full test suite passes with **no new failures** and no assertions were weakened.
|
|
52
|
+
- You pasted the before (red) and after (green) command output as evidence.
|
|
53
|
+
|
|
54
|
+
If you cannot produce a failing-then-passing test, you have not found the root cause — keep going, do not ship.
|
|
@@ -0,0 +1,56 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: dependency-upgrade
|
|
3
|
+
description: Upgrades and audits project dependencies safely — reads changelogs/breaking changes, bumps versions, fixes resulting breakage, and verifies with tests/build. Use when updating packages, resolving version conflicts, or patching a vulnerable dependency.
|
|
4
|
+
when_to_use: อัปเดต package; แก้ version conflict / lockfile; patch dependency ที่มี CVE; bump major version
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
- A package is outdated and you need it updated (new feature, bug fix, or just freshness).
|
|
10
|
+
- A security advisory / CVE affects a direct or transitive dependency and needs patching.
|
|
11
|
+
- A version conflict or broken lockfile blocks install (peer-dep mismatch, resolution error).
|
|
12
|
+
- A major-version bump is requested and the resulting breakage must be fixed and verified.
|
|
13
|
+
|
|
14
|
+
Do NOT use for a brand-new dependency add (that's a feature change, not an upgrade) or for unrelated build failures that have nothing to do with dependency versions.
|
|
15
|
+
|
|
16
|
+
## Steps
|
|
17
|
+
|
|
18
|
+
1. **Snapshot a clean baseline first.** Ensure the working tree is committed and the build/tests pass *before* touching anything — you need a known-green starting point to attribute breakage. If the baseline is already red, stop and report; don't upgrade on top of existing failures.
|
|
19
|
+
|
|
20
|
+
2. **Detect the ecosystem and gather facts** (run the pair for the lockfile present):
|
|
21
|
+
- npm/yarn/pnpm: `npm outdated` + `npm audit` (or `pnpm outdated` / `yarn npm audit`)
|
|
22
|
+
- Python: `pip list --outdated` + `pip-audit` (or `uv pip list --outdated`)
|
|
23
|
+
- Rust: `cargo outdated` + `cargo audit`
|
|
24
|
+
- Go: `go list -m -u all` + `govulncheck ./...`
|
|
25
|
+
|
|
26
|
+
3. **Build a categorized plan, do not bump blindly.** Split targets into three buckets and handle them as *separate commits*:
|
|
27
|
+
- **Security patches** (CVE-driven) — highest priority, smallest scope possible. Prefer the *minimum* version that clears the advisory.
|
|
28
|
+
- **Minor / patch upgrades** — safe to batch together in one group.
|
|
29
|
+
- **Major upgrades** — one package (or one tightly-coupled cluster) at a time, never batched with others.
|
|
30
|
+
For each target note: current → target version, bump type (patch/minor/major), and reason (security/feature/maintenance).
|
|
31
|
+
|
|
32
|
+
4. **For every major bump, read the breaking changes before editing code.** Open the package's CHANGELOG / release notes / migration guide for *each major version crossed* (e.g. 2→4 means reading 3.0 and 4.0 notes, not just 4.0). List the breaking changes that touch APIs this project actually uses — grep the codebase for the affected symbols to confirm exposure.
|
|
33
|
+
|
|
34
|
+
5. **Apply one bucket at a time and regenerate the lockfile.** Update the manifest version, then run the install that rewrites the lockfile (`npm install` / `pnpm install` / `cargo update -p <pkg>` / `pip install -U <pkg> && pip freeze`). Check transitive dependencies that moved too — a "minor" direct bump can pull a major transitive bump.
|
|
35
|
+
|
|
36
|
+
6. **Fix resulting breakage at the source.** Follow the migration guide; update call sites, renamed exports, changed config, removed options. Do **not** pin around the breakage or silence type/lint errors to go green — fix the actual incompatibility.
|
|
37
|
+
|
|
38
|
+
7. **Verify the bucket is green, then commit including the lockfile.** Run lint + build + full test suite. Only when green, commit (manifest + lockfile together). Repeat steps 5–7 for the next bucket. Keeping buckets in separate commits means any later regression is bisectable and individually revertible.
|
|
39
|
+
|
|
40
|
+
## Common Errors
|
|
41
|
+
|
|
42
|
+
- **Batching majors → unbisectable breakage.** Bumping several major versions in one shot makes it impossible to tell which one broke the build. Always isolate majors into their own commit.
|
|
43
|
+
- **Lockfile left stale or uncommitted.** Editing the manifest but not regenerating/committing the lockfile means CI and teammates install different versions. The lockfile change is part of the upgrade, not optional.
|
|
44
|
+
- **Missing transitive breaking changes.** A direct bump that looks minor can hoist a transitive dependency across a major boundary. Inspect the lockfile diff, not just the manifest diff.
|
|
45
|
+
- **Skipping intermediate changelogs.** Jumping 2→5 and reading only the 5.0 notes misses breaking changes introduced in 3.0/4.0. Read every major boundary crossed.
|
|
46
|
+
- **Peer-dependency cascade.** Upgrading one package (e.g. a framework) often forces matching majors of its plugins/adapters. Resolve the whole compatible set together or the install errors / silently downgrades.
|
|
47
|
+
- **`audit fix --force` blindly.** The `--force` flag will install breaking majors to clear an advisory; run it only after you've read the breaking changes, or prefer an explicit pinned safe version instead.
|
|
48
|
+
- **Faking green.** Loosening test assertions, adding `// @ts-ignore`, or downgrading back to dodge the work hides a real runtime break. Fix the root cause.
|
|
49
|
+
|
|
50
|
+
## Verify
|
|
51
|
+
|
|
52
|
+
- Install is clean and reproducible: a fresh `npm ci` / `pip install -r` / `cargo build --locked` succeeds with no resolution warnings.
|
|
53
|
+
- Lint, build, and the **full** test suite pass — not a subset — and you can show the command + exit code, not just "it works".
|
|
54
|
+
- `npm audit` / `pip-audit` / `cargo audit` reports zero remaining advisories at the severity you set out to fix (or documents why any are unfixable).
|
|
55
|
+
- The lockfile diff is committed and contains only the intended version moves; no unexplained transitive jumps.
|
|
56
|
+
- Each major upgrade is its own commit, so any single one can be reverted independently.
|
|
@@ -0,0 +1,64 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: deploy-release
|
|
3
|
+
description: Prepares and runs a safe deploy/release — pre-flight checks (tests/build green, env vars, migrations applied), versioning/tagging, rollout, and post-deploy smoke verification with a rollback path. Use when shipping a build to staging/production.
|
|
4
|
+
when_to_use: deploy ขึ้น staging/prod; ทำ release; cut a version tag; ก่อน/หลัง rollout
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
Use this skill when shipping a build to **staging or production**: cutting a release, tagging a version, or rolling out a deployment. Do **not** use it for local dev runs, hotfix branches that never reach a shared environment, or read-only inspection of a running service.
|
|
10
|
+
|
|
11
|
+
Before doing anything, discover the repo's actual deploy mechanics — never assume. In order:
|
|
12
|
+
1. Read CI/CD config: `.github/workflows/`, `.gitlab-ci.yml`, `Jenkinsfile`, `.circleci/`.
|
|
13
|
+
2. Read deploy/infra files: `Dockerfile`, `docker-compose*.yml`, `k8s/`, `helm/`, `Procfile`, `fly.toml`, `vercel.json`, `serverless.yml`, `Makefile` (look for `deploy`/`release`/`migrate` targets).
|
|
14
|
+
3. Read the package manifest scripts: `package.json`, `pyproject.toml`, `Cargo.toml`, etc.
|
|
15
|
+
4. Check `CHANGELOG.md`, `VERSION`, and existing git tags (`git tag --sort=-v:refname | head`).
|
|
16
|
+
|
|
17
|
+
If the rollout strategy or rollback command cannot be determined from the repo, **stop and ask** — do not invent one.
|
|
18
|
+
|
|
19
|
+
## Steps
|
|
20
|
+
|
|
21
|
+
1. **Confirm target + branch.** Identify the environment (staging vs prod) and the exact commit/branch being shipped. Run `git status` (clean working tree) and `git rev-parse HEAD`. Refuse to deploy uncommitted changes or a detached/unexpected ref.
|
|
22
|
+
|
|
23
|
+
2. **Pre-flight — must all pass before proceeding. Stop on the first red.**
|
|
24
|
+
- Tests: run the repo's test command; require **exit code 0**. Do not weaken assertions or skip suites to get green.
|
|
25
|
+
- Build: run the production build command; require exit 0 and verify artifacts/output exist.
|
|
26
|
+
- Lint/typecheck: run if the repo defines it; require clean.
|
|
27
|
+
- Env/secrets: enumerate required vars from the repo (`.env.example`, config schema, CI secret list) and confirm each is set in the **target** environment. Never print secret values — check presence only.
|
|
28
|
+
- Migrations: detect pending DB migrations (e.g. `migrate status`, framework equivalent). Decide order per the repo's convention (usually migrate **before** rollout for additive changes; expand→migrate→contract for breaking schema). If migrations are destructive, surface them explicitly before running.
|
|
29
|
+
|
|
30
|
+
3. **Version + tag + changelog.** Bump version per repo convention (semver: patch/minor/major). Update `CHANGELOG.md` with the diff since the last tag. Commit, then create an **annotated** tag (`git tag -a vX.Y.Z -m "..."`). Push commit and tag.
|
|
31
|
+
|
|
32
|
+
4. **Rollback prep — BEFORE rollout, not after.** Capture and write down the exact rollback path: previous tag/image digest/release id, the precise revert command (e.g. redeploy prior image, `helm rollback`, platform "rollback" command), and any migration down-path. Do not start rollout until this is ready.
|
|
33
|
+
|
|
34
|
+
5. **Rollout** using the repo's strategy — do not improvise a different one:
|
|
35
|
+
- **Blue-green:** deploy to idle env, smoke-test it, then flip traffic.
|
|
36
|
+
- **Canary:** route a small % first, watch metrics/errors, then ramp.
|
|
37
|
+
- **Rolling:** deploy incrementally, watch each batch's health gate.
|
|
38
|
+
Trigger via the repo's documented command/CI pipeline.
|
|
39
|
+
|
|
40
|
+
6. **Post-deploy smoke test.** Hit the primary health endpoint and 1–2 critical user paths against the **live** target. Confirm expected status code AND response shape (not just 200). Check error rate / logs for a short window. For a UI deploy, load the page and screenshot.
|
|
41
|
+
|
|
42
|
+
7. **Decide.** Smoke green → record the deployed version + commit and close out. Smoke red → execute the rollback from step 4 immediately, then diagnose. Never leave a half-rolled-out failed deploy live.
|
|
43
|
+
|
|
44
|
+
## Common Errors
|
|
45
|
+
|
|
46
|
+
- **Shipping on red tests** — the single biggest gotcha. CI cache or a flaky suite makes it tempting to bypass. Don't. Re-run, fix root cause.
|
|
47
|
+
- **Forgotten migration** — code deployed expecting a column/table that isn't there → runtime 500s on first request. Always check pending migrations in pre-flight, and sequence migrate vs rollout for the schema-change type.
|
|
48
|
+
- **No smoke check** — assuming a successful deploy command means a working app. A clean rollout can still serve a crashing app (bad env var, failed boot). Always verify against the live endpoint.
|
|
49
|
+
- **Rollback prepared too late** — discovering the rollback command *after* prod is broken wastes the worst minutes. Capture it in step 4.
|
|
50
|
+
- **Env var set in CI but not runtime** (or in staging but not prod) — check presence in the actual target environment, not where you happen to be standing.
|
|
51
|
+
- **Lightweight tag instead of annotated** — `git tag vX` loses author/date/message; release tooling and `git describe` expect annotated tags.
|
|
52
|
+
- **Migration run twice / out of order** across blue and green nodes — gate migrations to run once, idempotently.
|
|
53
|
+
|
|
54
|
+
## Verify
|
|
55
|
+
|
|
56
|
+
Deploy is successful only when **all** hold, each backed by real evidence (output, status code, screenshot) — not assumption:
|
|
57
|
+
|
|
58
|
+
- Pre-flight: test + build commands exited 0 (show the output).
|
|
59
|
+
- Tag pushed: `git ls-remote --tags origin` shows the new annotated tag; `CHANGELOG.md` updated.
|
|
60
|
+
- Live smoke: health endpoint returns expected status **and** body shape; critical path works against the target host (paste the response / screenshot).
|
|
61
|
+
- Error rate / logs clean for the watch window after rollout.
|
|
62
|
+
- Rollback command is documented and was confirmed available before rollout.
|
|
63
|
+
|
|
64
|
+
If you cannot produce evidence for any line above, the deploy is **not** verified — do not declare done; roll back or hold.
|