@coralai/sps-cli 0.42.0 → 0.43.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (109) hide show
  1. package/README.md +34 -3
  2. package/dist/commands/projectInit.d.ts.map +1 -1
  3. package/dist/commands/projectInit.js +40 -53
  4. package/dist/commands/projectInit.js.map +1 -1
  5. package/dist/commands/skillCommand.d.ts +2 -0
  6. package/dist/commands/skillCommand.d.ts.map +1 -0
  7. package/dist/commands/skillCommand.js +235 -0
  8. package/dist/commands/skillCommand.js.map +1 -0
  9. package/dist/core/skillStore.d.ts +46 -0
  10. package/dist/core/skillStore.d.ts.map +1 -0
  11. package/dist/core/skillStore.js +197 -0
  12. package/dist/core/skillStore.js.map +1 -0
  13. package/dist/core/skillStore.test.d.ts +2 -0
  14. package/dist/core/skillStore.test.d.ts.map +1 -0
  15. package/dist/core/skillStore.test.js +190 -0
  16. package/dist/core/skillStore.test.js.map +1 -0
  17. package/dist/main.js +19 -17
  18. package/dist/main.js.map +1 -1
  19. package/package.json +1 -1
  20. package/skills/architecture-decision-records/SKILL.md +207 -0
  21. package/skills/backend/SKILL.md +62 -0
  22. package/skills/backend/references/api-design.md +168 -0
  23. package/skills/backend/references/caching.md +181 -0
  24. package/skills/backend/references/data-access.md +173 -0
  25. package/skills/backend/references/layering.md +181 -0
  26. package/skills/backend/references/observability.md +190 -0
  27. package/skills/backend/references/resilience.md +201 -0
  28. package/skills/backend/references/security.md +186 -0
  29. package/skills/backend-architect/SKILL.md +119 -0
  30. package/skills/code-reviewer/SKILL.md +143 -0
  31. package/skills/coding-standards/SKILL.md +60 -0
  32. package/skills/coding-standards/references/clean-code.md +258 -0
  33. package/skills/coding-standards/references/code-review.md +192 -0
  34. package/skills/coding-standards/references/commits-and-prs.md +226 -0
  35. package/skills/coding-standards/references/error-strategy.md +193 -0
  36. package/skills/coding-standards/references/naming.md +185 -0
  37. package/skills/coding-standards/references/tdd.md +171 -0
  38. package/skills/database/SKILL.md +53 -0
  39. package/skills/database/references/indexing.md +190 -0
  40. package/skills/database/references/migrations.md +199 -0
  41. package/skills/database/references/nosql.md +185 -0
  42. package/skills/database/references/queries.md +295 -0
  43. package/skills/database/references/scaling.md +203 -0
  44. package/skills/database/references/schema.md +191 -0
  45. package/skills/database-optimizer/SKILL.md +168 -0
  46. package/skills/debugging-workflow/SKILL.md +244 -0
  47. package/skills/devops/SKILL.md +55 -0
  48. package/skills/devops/references/ci-cd.md +204 -0
  49. package/skills/devops/references/containers.md +272 -0
  50. package/skills/devops/references/deploy.md +201 -0
  51. package/skills/devops/references/iac.md +252 -0
  52. package/skills/devops/references/observability.md +228 -0
  53. package/skills/devops/references/secrets.md +178 -0
  54. package/skills/devops-automator/SKILL.md +164 -0
  55. package/skills/frontend/SKILL.md +52 -0
  56. package/skills/frontend/references/accessibility.md +222 -0
  57. package/skills/frontend/references/components.md +206 -0
  58. package/skills/frontend/references/performance.md +219 -0
  59. package/skills/frontend/references/routing.md +209 -0
  60. package/skills/frontend/references/state.md +190 -0
  61. package/skills/frontend/references/testing.md +216 -0
  62. package/skills/frontend-developer/SKILL.md +115 -0
  63. package/skills/git-workflow/SKILL.md +355 -0
  64. package/skills/golang/SKILL.md +49 -0
  65. package/skills/golang/references/concurrency.md +284 -0
  66. package/skills/golang/references/errors.md +241 -0
  67. package/skills/golang/references/idioms.md +285 -0
  68. package/skills/golang/references/testing.md +238 -0
  69. package/skills/java/SKILL.md +50 -0
  70. package/skills/java/references/concurrency.md +194 -0
  71. package/skills/java/references/idioms.md +283 -0
  72. package/skills/java/references/testing.md +228 -0
  73. package/skills/kotlin/SKILL.md +47 -0
  74. package/skills/kotlin/references/coroutines.md +240 -0
  75. package/skills/kotlin/references/idioms.md +268 -0
  76. package/skills/kotlin/references/testing.md +219 -0
  77. package/skills/mobile/SKILL.md +50 -0
  78. package/skills/mobile/references/architecture.md +204 -0
  79. package/skills/mobile/references/navigation.md +158 -0
  80. package/skills/mobile/references/performance.md +152 -0
  81. package/skills/mobile/references/platform.md +166 -0
  82. package/skills/mobile/references/state-and-data.md +174 -0
  83. package/skills/python/SKILL.md +51 -0
  84. package/skills/python/THIRD_PARTY.md +14 -0
  85. package/skills/python/references/async.md +218 -0
  86. package/skills/python/references/error-handling.md +254 -0
  87. package/skills/python/references/idioms.md +279 -0
  88. package/skills/python/references/packaging.md +233 -0
  89. package/skills/python/references/testing.md +269 -0
  90. package/skills/python/references/typing.md +292 -0
  91. package/skills/qa-tester/SKILL.md +186 -0
  92. package/skills/rust/SKILL.md +50 -0
  93. package/skills/rust/references/async.md +224 -0
  94. package/skills/rust/references/errors.md +240 -0
  95. package/skills/rust/references/ownership.md +263 -0
  96. package/skills/rust/references/testing.md +274 -0
  97. package/skills/rust/references/traits.md +250 -0
  98. package/skills/security-engineer/SKILL.md +157 -0
  99. package/skills/swift/SKILL.md +48 -0
  100. package/skills/swift/references/concurrency.md +280 -0
  101. package/skills/swift/references/idioms.md +334 -0
  102. package/skills/swift/references/testing.md +229 -0
  103. package/skills/typescript/SKILL.md +51 -0
  104. package/skills/typescript/references/async.md +241 -0
  105. package/skills/typescript/references/errors.md +208 -0
  106. package/skills/typescript/references/idioms.md +246 -0
  107. package/skills/typescript/references/testing.md +225 -0
  108. package/skills/typescript/references/tooling.md +208 -0
  109. package/skills/typescript/references/types.md +259 -0
@@ -0,0 +1,204 @@
1
+ # CI / CD
2
+
3
+ Pipelines, caching, parallelism, artifacts, gates.
4
+
5
+ ## Pipeline stages — the standard shape
6
+
7
+ ```
8
+ ┌──────────┐ ┌───────┐ ┌──────┐ ┌──────┐ ┌──────────┐ ┌──────────┐
9
+ │ checkout │▶│ lint │▶│ test │▶│ build│▶│ scan/sign│▶│ deploy │
10
+ └──────────┘ └───────┘ └──────┘ └──────┘ └──────────┘ └──────────┘
11
+
12
+ └─► parallel jobs where possible
13
+ ```
14
+
15
+ Order matters: cheap-and-fast first (lint, typecheck). Expensive and slow last (E2E, image build). Failing lint should fail the pipeline in under a minute.
16
+
17
+ ## Keep CI fast
18
+
19
+ Target: **< 10 min end-to-end on a typical change**. Slow CI punishes every commit.
20
+
21
+ Levers:
22
+
23
+ - **Cache dependencies.** Lockfile as cache key. `actions/cache` / equivalent.
24
+ - **Parallelize independent jobs.** Lint + typecheck + unit tests can all run at once.
25
+ - **Shard tests.** A 10-minute test suite split into 4 shards = 2.5 min each.
26
+ - **Run integration / E2E on critical paths only**, or only on main.
27
+ - **Test only what changed** for monorepos. `nx affected`, `turbo run --filter`, `bazel query`.
28
+
29
+ ## Cache keys
30
+
31
+ ```yaml
32
+ # ✅ stable, invalidates only when deps change
33
+ key: "${{ runner.os }}-node-${{ hashFiles('package-lock.json') }}"
34
+
35
+ # ❌ too narrow — cache misses every run
36
+ key: "${{ runner.os }}-node-${{ github.sha }}"
37
+
38
+ # ❌ too broad — may return incompatible cache
39
+ key: "${{ runner.os }}-node"
40
+ ```
41
+
42
+ Cache the right things:
43
+ - `node_modules` / `pip wheels` / `cargo target` / `gradle caches` — big wins.
44
+ - Build output (`.next`, `dist`) if subsequent jobs use it.
45
+ - Don't cache test reports or transient artifacts.
46
+
47
+ ## Build artifacts, promote them
48
+
49
+ Build once. The same artifact flows dev → staging → prod.
50
+
51
+ ```
52
+ PR: build → test (no artifact published)
53
+ main: build → test → publish (publish image:sha)
54
+ deploy dev: pull image:sha → deploy
55
+ deploy staging: pull image:sha → deploy (same image)
56
+ deploy prod: pull image:sha → deploy (same image)
57
+ ```
58
+
59
+ Building again per environment re-runs tests and invites "it worked in staging" surprises when the new build differs (new dependency version, timestamp).
60
+
61
+ ## Artifact tagging
62
+
63
+ ```
64
+ image: myapp:sha-abc1234 # immutable, references a commit
65
+ image: myapp:v1.2.3 # semver release
66
+ image: myapp:main # mutable — latest main
67
+ image: myapp:latest # mutable — last push to whatever
68
+ ```
69
+
70
+ Deploy by immutable tag (`sha-abc1234` or `v1.2.3`). Mutable tags (`main`, `latest`) are convenient for humans but make rollbacks ambiguous.
71
+
72
+ ## Gates and approvals
73
+
74
+ Autodeploy to dev. Require a check/approval for staging → prod (or for sensitive envs).
75
+
76
+ ```
77
+ merge to main
78
+ ▶ deploy dev (auto)
79
+ ▶ deploy staging (auto, smoke tests)
80
+ ▶ deploy prod (manual approval)
81
+ ```
82
+
83
+ Manual gate is the pause for "should this actually ship now?" — release freeze, cross-team sync.
84
+
85
+ ## Required status checks
86
+
87
+ On the PR branch, block merge unless:
88
+ - Lint / typecheck pass
89
+ - Unit tests pass
90
+ - Coverage above threshold (if enforced)
91
+ - Review approval received
92
+
93
+ Configure in the VCS (GitHub branch protection, GitLab push rules).
94
+
95
+ ## Secrets in CI
96
+
97
+ - **Never** store secrets in CI config files or env files checked into the repo.
98
+ - CI platforms have secret stores (GitHub Secrets, GitLab Variables, environment-scoped).
99
+ - Scope per environment (`PROD_DB_URL`, not a shared one).
100
+ - Prefer short-lived credentials (OIDC) over long-lived keys.
101
+ ```
102
+ # GitHub Actions → AWS via OIDC, no AWS_ACCESS_KEY stored in GitHub
103
+ permissions: { id-token: write }
104
+ - uses: aws-actions/configure-aws-credentials@v4
105
+ with: { role-to-assume: arn:aws:iam::...:role/github-prod }
106
+ ```
107
+ - Mask secrets in logs (most CI tools do this automatically).
108
+
109
+ ## Supply-chain security
110
+
111
+ - **Pin** third-party actions / images by SHA, not version tag.
112
+ ```yaml
113
+ uses: actions/checkout@11bd71901bbe5b1630ceea73d27796261f9... # v4.0.0
114
+ ```
115
+ Tags are mutable; an attacker who takes over the repo can repoint a tag.
116
+ - **Dependency scanning**: Dependabot / Renovate for updates; Snyk / Trivy / Grype for vulnerabilities.
117
+ - **SBOM generation**: produce one per build, store it.
118
+ - **Image signing**: cosign + Sigstore; verify at deploy.
119
+
120
+ ## Matrix builds
121
+
122
+ For multi-version / multi-OS testing:
123
+
124
+ ```yaml
125
+ strategy:
126
+ matrix:
127
+ node: [18, 20, 22]
128
+ os: [ubuntu-latest, macos-latest]
129
+ ```
130
+
131
+ Keep matrices narrow — `3 × 2 = 6` jobs, not 30. CI-minutes add up.
132
+
133
+ ## Flaky tests — triage immediately
134
+
135
+ One flaky test poisons the signal.
136
+
137
+ - Tag the test as flaky, move to a separate job, investigate within a week.
138
+ - A test that fails intermittently is ALWAYS a bug: race condition, shared state, timing assumption. Don't accept "just retry".
139
+ - Quarantine + retry is a short-term fix only. Delete the test rather than leave it quarantined forever.
140
+
141
+ ## Pull-request vs. main pipelines
142
+
143
+ Different triggers, often different scopes:
144
+
145
+ | Trigger | Run |
146
+ |---|---|
147
+ | PR | Lint, typecheck, unit, key integration |
148
+ | PR (target main) | + E2E happy path |
149
+ | Merge to main | + build, publish artifact, deploy dev / staging |
150
+ | Tag / release | + prod deploy gate |
151
+ | Scheduled | + full E2E, perf tests, security scans |
152
+
153
+ Don't run everything on every PR. Keep PRs fast; save heavy tests for main.
154
+
155
+ ## Monorepo considerations
156
+
157
+ - **Change-aware testing**: don't rebuild / test the whole monorepo if only one package changed.
158
+ - **Project graph tools**: Turborepo, Nx, Bazel, Pants.
159
+ - **Shared cache**: remote cache (Turbo Cloud, Nx Cloud, BuildBuddy) pays for itself on larger teams.
160
+
161
+ ## Deploy previews
162
+
163
+ Ephemeral environments per PR:
164
+
165
+ ```
166
+ PR #123 → https://pr-123.preview.myapp.com
167
+ ```
168
+
169
+ Great for frontend, reasonable for APIs, expensive for heavy backends. Tear down on PR close.
170
+
171
+ Tools: Vercel / Netlify / Cloudflare for frontends; Render / Fly / Kubernetes preview envs / Garden / Uffizzi for full-stack.
172
+
173
+ ## Concurrency control
174
+
175
+ Don't let two prod deploys race:
176
+
177
+ ```yaml
178
+ concurrency:
179
+ group: deploy-prod
180
+ cancel-in-progress: false
181
+ ```
182
+
183
+ For PR previews, cancel old runs when a new commit arrives:
184
+
185
+ ```yaml
186
+ concurrency:
187
+ group: pr-${{ github.ref }}
188
+ cancel-in-progress: true
189
+ ```
190
+
191
+ ## Anti-patterns
192
+
193
+ | Anti-pattern | Fix |
194
+ |---|---|
195
+ | `|| true` to hide test failures | Fix or delete the test |
196
+ | `:latest` tag in prod deploy manifest | Immutable tag |
197
+ | Deploying code untested in staging | Dev → staging → prod, same artifact |
198
+ | Secrets via commits / CI log | Secret store, masked |
199
+ | 40-minute CI runs on every PR | Split; run heavy tests on main |
200
+ | Tests that share mutable state | Isolate / reset per test |
201
+ | Action pinned by tag only | Pin by SHA |
202
+ | Deploying from a dev laptop | CI-only deploy path |
203
+ | No automated rollback plan | See `deploy.md` |
204
+ | Ignoring flaky tests | Quarantine + fix within a week; don't normalize |
@@ -0,0 +1,272 @@
1
+ # Containers
2
+
3
+ Dockerfile, multi-stage, size, rootless, base images.
4
+
5
+ ## The goals
6
+
7
+ 1. **Small** — ship less bytes, fewer CVEs, faster pulls.
8
+ 2. **Reproducible** — same Dockerfile + same lockfile → same image.
9
+ 3. **Secure** — no root, no shell if possible, minimal deps, signed.
10
+ 4. **Fast to build** — layer caching aligned with change frequency.
11
+
12
+ ## Base image selection
13
+
14
+ ```
15
+ Prefer: distroless > alpine > slim > full distro
16
+ ```
17
+
18
+ | Base | Size | Trade-off |
19
+ |---|---|---|
20
+ | `gcr.io/distroless/static` | ~2 MB | No shell, no package manager. Static binaries only. |
21
+ | `gcr.io/distroless/base` | ~15 MB | libc, openssl, etc. Good for most compiled langs. |
22
+ | `alpine:3.20` | ~5 MB | musl libc (incompat with some packages); apk package manager |
23
+ | `debian:12-slim` | ~75 MB | glibc; widest compatibility; still small |
24
+ | `debian:12` / `ubuntu:24.04` | 120–200 MB | Full distro; use only when you need dev tools at runtime |
25
+
26
+ Choose the smallest one that runs your workload. Distroless is ideal for production runtime.
27
+
28
+ ## Multi-stage builds
29
+
30
+ Separate build env from runtime env.
31
+
32
+ ```dockerfile
33
+ # syntax=docker/dockerfile:1
34
+
35
+ # --- Build stage
36
+ FROM node:20-alpine AS build
37
+ WORKDIR /app
38
+
39
+ # Cache deps separately from source
40
+ COPY package.json package-lock.json ./
41
+ RUN npm ci
42
+
43
+ COPY . .
44
+ RUN npm run build
45
+
46
+ # Prune dev deps after build
47
+ RUN npm prune --omit=dev
48
+
49
+ # --- Runtime stage
50
+ FROM gcr.io/distroless/nodejs20-debian12
51
+ WORKDIR /app
52
+ COPY --from=build /app/node_modules ./node_modules
53
+ COPY --from=build /app/dist ./dist
54
+ COPY --from=build /app/package.json ./
55
+ USER nonroot
56
+ EXPOSE 3000
57
+ CMD ["dist/server.js"]
58
+ ```
59
+
60
+ Benefits:
61
+ - Build tools don't ship to prod.
62
+ - Different base OS for build vs runtime (Alpine to build, distroless to run).
63
+ - Smaller image, smaller attack surface.
64
+
65
+ ## Layer order matters for caching
66
+
67
+ Stable layers first, volatile last.
68
+
69
+ ```dockerfile
70
+ # ✅
71
+ COPY package.json package-lock.json ./ # stable; rarely changes
72
+ RUN npm ci # cached unless lockfile changed
73
+ COPY . . # volatile; changes every commit
74
+ RUN npm run build
75
+
76
+ # ❌ re-runs npm ci on every code change
77
+ COPY . .
78
+ RUN npm ci
79
+ RUN npm run build
80
+ ```
81
+
82
+ ## Pin versions
83
+
84
+ ```dockerfile
85
+ FROM node:20.11.1-alpine3.19 # not node:20, not node:latest
86
+ # or pin by digest for strictest reproducibility:
87
+ FROM node@sha256:5b57a...
88
+ ```
89
+
90
+ Tags are mutable; SHAs aren't. For prod, pin by SHA; for dev, tag is usually fine.
91
+
92
+ ## Don't run as root
93
+
94
+ ```dockerfile
95
+ # Debian-based
96
+ RUN groupadd --system app && useradd --system --gid app app
97
+ USER app
98
+
99
+ # Alpine
100
+ RUN addgroup -S app && adduser -S -G app app
101
+ USER app
102
+
103
+ # Distroless already provides a `nonroot` user
104
+ USER nonroot
105
+ ```
106
+
107
+ Root inside a container is not isolation. Principle of least privilege applies here too.
108
+
109
+ ## Don't bake secrets
110
+
111
+ ```dockerfile
112
+ # ❌
113
+ ARG API_KEY
114
+ ENV API_KEY=$API_KEY # baked into the image layer
115
+
116
+ # ✅ pass at runtime
117
+ docker run -e API_KEY=... myapp
118
+ # or mount from secret manager
119
+ ```
120
+
121
+ Secrets that land in a layer are visible to anyone with the image. Rotating is painful.
122
+
123
+ ## Build secrets (BuildKit)
124
+
125
+ For secrets needed **during build** only (e.g., private npm registry token):
126
+
127
+ ```dockerfile
128
+ # syntax=docker/dockerfile:1.4
129
+ RUN --mount=type=secret,id=npm_token \
130
+ NPM_TOKEN=$(cat /run/secrets/npm_token) npm ci
131
+ ```
132
+
133
+ The secret doesn't persist in the final image.
134
+
135
+ ## `.dockerignore`
136
+
137
+ Exclude the junk. Every extra file bloats the build context.
138
+
139
+ ```
140
+ .git
141
+ node_modules
142
+ **/__pycache__
143
+ **/*.log
144
+ .DS_Store
145
+ .idea/
146
+ .vscode/
147
+ .env
148
+ tests/
149
+ .venv
150
+ target/
151
+ coverage/
152
+ dist/
153
+ ```
154
+
155
+ A 2 GB build context on a 200 MB repo is a signal you need a `.dockerignore`.
156
+
157
+ ## HEALTHCHECK
158
+
159
+ Declare how to know the container is alive.
160
+
161
+ ```dockerfile
162
+ HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
163
+ CMD curl -fsS http://localhost:3000/health || exit 1
164
+ ```
165
+
166
+ Kubernetes uses its own liveness / readiness probes — set them in the manifest, not the Dockerfile. Docker Swarm / standalone Docker uses `HEALTHCHECK`.
167
+
168
+ ## Signal handling
169
+
170
+ ```dockerfile
171
+ # ✅ node handles SIGTERM directly (no shell in between)
172
+ CMD ["node", "server.js"]
173
+
174
+ # ❌ shell form — shell gets SIGTERM, may not forward to node
175
+ CMD node server.js
176
+ ```
177
+
178
+ Exec form (array) runs the binary directly. Shell form runs `/bin/sh -c` and can swallow signals. Use exec form for main process.
179
+
180
+ For apps that don't handle signals, use `tini` as init:
181
+
182
+ ```dockerfile
183
+ ENTRYPOINT ["/sbin/tini", "--"]
184
+ CMD ["./my-binary"]
185
+ ```
186
+
187
+ ## Logging
188
+
189
+ Write to stdout / stderr. The container runtime collects these.
190
+
191
+ ```dockerfile
192
+ # ❌ logs go into the container filesystem; lost on restart
193
+ CMD ["my-binary", "--log-file", "/var/log/app.log"]
194
+
195
+ # ✅ stdout
196
+ CMD ["my-binary"] # app logs to stdout by default
197
+ ```
198
+
199
+ ## Image scanning
200
+
201
+ Run a scanner in CI (Trivy, Grype, Snyk).
202
+
203
+ ```
204
+ trivy image myapp:sha-abc123
205
+ # fail build on HIGH/CRITICAL unfixed CVEs
206
+ ```
207
+
208
+ Scan at build, again on a schedule (CVEs are disclosed after you build).
209
+
210
+ ## Image signing (supply chain)
211
+
212
+ Sign images so the cluster can verify.
213
+
214
+ ```
215
+ cosign sign --key cosign.key myregistry/myapp:sha-abc123
216
+ ```
217
+
218
+ At deploy, policy (Kyverno, Gatekeeper, ECR policy, Sigstore policy-controller) verifies the signature before admission.
219
+
220
+ ## Don't install what you don't need
221
+
222
+ Every package installed is:
223
+ - Bytes on the wire
224
+ - Disk on the node
225
+ - A CVE waiting to be reported
226
+
227
+ ```dockerfile
228
+ # ❌
229
+ RUN apt-get update && apt-get install -y \
230
+ curl vim git build-essential python3 netcat ...
231
+
232
+ # ✅
233
+ RUN apt-get update && apt-get install --no-install-recommends -y \
234
+ ca-certificates && rm -rf /var/lib/apt/lists/*
235
+ ```
236
+
237
+ Clean apt lists, yum caches, pip caches in the same layer that installed them.
238
+
239
+ ## Distroless specifics
240
+
241
+ No shell. No `ls`, no `cat`, no `curl`. This is a feature.
242
+
243
+ - Healthcheck: use the binary itself (`myapp healthcheck`) or a Kubernetes probe over HTTP.
244
+ - Debugging: `kubectl exec` won't give you a shell. Use ephemeral debug containers (`kubectl debug`) or log more.
245
+
246
+ Downsides: ops is harder at first. Upside: drastically smaller attack surface.
247
+
248
+ ## Image size — typical targets
249
+
250
+ | App | Target size |
251
+ |---|---|
252
+ | Go / Rust static binary | 5–15 MB |
253
+ | Node.js app | 100–200 MB |
254
+ | Python app | 100–250 MB |
255
+ | Java app (with JRE) | 150–250 MB |
256
+
257
+ If your Node image is 1.5 GB, you shipped `node_modules/` twice, left dev deps, or forgot `--omit=dev`.
258
+
259
+ ## Anti-patterns
260
+
261
+ | Anti-pattern | Fix |
262
+ |---|---|
263
+ | `FROM ubuntu:latest` | Specific version; smaller base |
264
+ | `COPY . .` before `RUN npm ci` | Deps before source for cache |
265
+ | `apt-get install -y *` no cleanup | `rm -rf /var/lib/apt/lists/*` same layer |
266
+ | `RUN chmod ...` in 10 separate layers | Combine; each layer has size cost |
267
+ | Running as root | `USER` before CMD |
268
+ | Shell form CMD | Exec form, signals work |
269
+ | Logs to file inside container | stdout/stderr |
270
+ | `ADD` for local files | `COPY` — `ADD` also untars and downloads, surprising |
271
+ | Secrets in `ENV` / `ARG` | Mount at runtime |
272
+ | Every service uses a different base image | Standardize; fewer bases to scan and update |
@@ -0,0 +1,201 @@
1
+ # Deploy
2
+
3
+ Rolling, blue-green, canary, feature flags. Rollback plan always.
4
+
5
+ ## Deploy ≠ release
6
+
7
+ - **Deploy**: new code is running on the infra.
8
+ - **Release**: new code is serving user traffic.
9
+
10
+ Decoupling them (deploy first, flag on later) is how you ship safely. The deploy can be tested under real load without user impact; the release is a quick toggle.
11
+
12
+ ## Rolling update
13
+
14
+ Replace instances one / a few at a time. Default in Kubernetes, ECS, most orchestrators.
15
+
16
+ ```
17
+ v1 v1 v1 v1 (4 pods)
18
+ v1 v1 v1 v2 (replace one)
19
+ v1 v1 v2 v2 (replace next)
20
+ ... (eventually all v2)
21
+ ```
22
+
23
+ Parameters to tune:
24
+ - **maxSurge**: how many extra pods can exist during roll (e.g., +25%).
25
+ - **maxUnavailable**: how many pods can be missing (e.g., 0 for strict).
26
+ - **readinessProbe**: traffic waits until new pod reports ready.
27
+
28
+ Rollback: roll back to the previous replica set / task definition.
29
+
30
+ ## Blue-green
31
+
32
+ Two full environments. Cut over all traffic at once.
33
+
34
+ ```
35
+ blue (prod traffic) — current version
36
+ green (idle) — new version, warmed up
37
+
38
+ Cut over: point load balancer to green.
39
+ Rollback: point back to blue (instant).
40
+ ```
41
+
42
+ Pros: clean cutover, instant rollback.
43
+ Cons: 2× infra cost during the overlap window.
44
+
45
+ Use for:
46
+ - Critical releases where rolling drag-out is risky.
47
+ - DB schema changes that coexist with both versions.
48
+
49
+ ## Canary
50
+
51
+ Send a small percentage of traffic to the new version; scale up if healthy, roll back if not.
52
+
53
+ ```
54
+ 100% v1
55
+ 5% v2, 95% v1 — watch metrics for 10 min
56
+ 25% v2, 75% v1 — watch
57
+ 50% v2, 50% v1
58
+ 100% v2
59
+ ```
60
+
61
+ Observation:
62
+ - Error rate on v2 vs. v1
63
+ - Latency p95/p99
64
+ - Business metrics (conversion, checkout success)
65
+ - Custom alarms (signup, payment)
66
+
67
+ Automated: Argo Rollouts, Flagger, AWS CodeDeploy canary. Manual also works — humans read dashboards and decide.
68
+
69
+ Rollback: pull the canary (route 100% back to v1).
70
+
71
+ ## Progressive delivery
72
+
73
+ Canary + automated analysis. The rollout controller evaluates metrics at each step and promotes / rolls back automatically.
74
+
75
+ ```yaml
76
+ # Argo Rollouts (sketch)
77
+ strategy:
78
+ canary:
79
+ steps:
80
+ - setWeight: 10
81
+ - pause: { duration: 5m }
82
+ - analysis: { templates: [error-rate, latency-p95] }
83
+ - setWeight: 50
84
+ - analysis: [...]
85
+ - setWeight: 100
86
+ ```
87
+
88
+ The analysis step is a check against a metric threshold. Fail → auto-rollback.
89
+
90
+ ## Feature flags
91
+
92
+ Ship code dark; flip for a percentage of users; roll back with a toggle.
93
+
94
+ ```
95
+ if feature_enabled('new_checkout', user):
96
+ new_checkout()
97
+ else:
98
+ old_checkout()
99
+ ```
100
+
101
+ Benefits:
102
+ - Deploy-release decoupling.
103
+ - A/B testing for correctness, not just design.
104
+ - Instant rollback without a redeploy.
105
+
106
+ Discipline:
107
+ - Every flag is temporary. Clean up after full rollout (or after hypothesis failure).
108
+ - Document owner + expiry for each flag. Stale flags accrete and become impossible to remove.
109
+ - Service (LaunchDarkly, ConfigCat, Unleash, Flagsmith, home-grown).
110
+
111
+ ## Database migrations + deploys
112
+
113
+ See `database/migrations.md` — expand / contract is the safe dance. Key rules for deploy:
114
+
115
+ - **Migration BEFORE code that needs it.** Don't deploy v2 of the app before running its required migration.
116
+ - **No destructive migrations during peak hours.** Schedule windows; at least reduce blast radius.
117
+ - **New code tolerates old schema AND old code tolerates new schema** at the overlap.
118
+
119
+ ## Preflight checks
120
+
121
+ Before actually rolling out:
122
+ - **Smoke test** against staging with the exact artifact going to prod.
123
+ - **Load test** for major changes (perf regressions hide in staging noise).
124
+ - **Dependency audit** (new CVE in the image?).
125
+ - **Release notes** drafted; rollback plan documented.
126
+
127
+ ## During deploy
128
+
129
+ Monitor:
130
+ - **Deployment health**: readiness failures, crash loops.
131
+ - **Service health**: error rate, latency, saturation.
132
+ - **Downstream**: DB, cache, message broker metrics — did the new code change call patterns?
133
+ - **Business metrics**: signups / second, checkout completion.
134
+
135
+ Alerting during a deploy is different — some jitter is normal. Tighten post-deploy windows.
136
+
137
+ ## Rollback
138
+
139
+ Every deploy plan includes: **how do we get back?**
140
+
141
+ ```
142
+ If <condition>, roll back by <step 1>, <step 2>, ...
143
+ ```
144
+
145
+ Common "conditions":
146
+ - Error rate > 2× baseline for 5 min
147
+ - Latency p99 > 2× baseline
148
+ - Business metric (signup, checkout) drops > 20%
149
+ - Manual oncall decision
150
+
151
+ Rollback command should be one thing — not a checklist. Automate it.
152
+
153
+ ### What's rollback-safe?
154
+
155
+ | Change | Rollback |
156
+ |---|---|
157
+ | Code-only | Deploy previous artifact |
158
+ | Migration that's additive | Old code works; no DB action needed |
159
+ | Migration that removed a column | Restore from backup (painful) — avoid this in a rolled-back state |
160
+ | Feature flag on | Turn it off |
161
+ | Config change | Revert config |
162
+
163
+ Design changes so rolling back code is sufficient. That dictates the migration pattern (expand/contract).
164
+
165
+ ## Shutdown gracefully
166
+
167
+ On `SIGTERM`:
168
+ 1. Stop accepting new connections (readiness → unhealthy; LB removes pod).
169
+ 2. Finish in-flight requests (with a grace period — 30 s typical).
170
+ 3. Drain queues / finish current job.
171
+ 4. Close DB / external connections.
172
+ 5. Exit 0.
173
+
174
+ Without this, rolling updates drop requests and leave half-processed jobs.
175
+
176
+ Kubernetes: `terminationGracePeriodSeconds: 30` + a `preStop` hook + SIGTERM handling in app.
177
+
178
+ ## Deploy cadence
179
+
180
+ - **Fast** (multiple / day, per team) — best for small changes, high automation, strong tests.
181
+ - **Batched** (weekly / fortnightly) — when risk of each release is high.
182
+
183
+ High-performing orgs deploy multiple times per day. The trick is making each deploy small.
184
+
185
+ ## "Release" vs. "deploy" for mobile
186
+
187
+ Mobile can't feature-flag installed versions. But server-side flags can control behaviour of the installed app. Design APIs so server-side flags let you turn off client features without a new app version.
188
+
189
+ ## Anti-patterns
190
+
191
+ | Anti-pattern | Fix |
192
+ |---|---|
193
+ | Deploying on Fridays without cause | Deploy midweek, quieter rollback |
194
+ | Rollback as "SSH in and run..." | Automated, one-command rollback |
195
+ | Waiting for the deploy to "look OK" by refreshing the app | Instrument; set specific metrics |
196
+ | Manual canary percentage math | Use the orchestrator's progressive rollout |
197
+ | Schema migration in the same step as code rollout | Pre-migrate; expand/contract |
198
+ | Long-lived feature flags | Set expiry; clean up |
199
+ | Sidecar fetching config at startup with no timeout | Fail fast; bounded retry |
200
+ | Skipping grace period on SIGTERM | Lose requests at every deploy |
201
+ | Deploys that don't produce an event in observability | Correlate spikes; deploys are first-class events |