@coralai/sps-cli 0.42.0 → 0.43.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (109) hide show
  1. package/README.md +34 -3
  2. package/dist/commands/projectInit.d.ts.map +1 -1
  3. package/dist/commands/projectInit.js +40 -53
  4. package/dist/commands/projectInit.js.map +1 -1
  5. package/dist/commands/skillCommand.d.ts +2 -0
  6. package/dist/commands/skillCommand.d.ts.map +1 -0
  7. package/dist/commands/skillCommand.js +235 -0
  8. package/dist/commands/skillCommand.js.map +1 -0
  9. package/dist/core/skillStore.d.ts +46 -0
  10. package/dist/core/skillStore.d.ts.map +1 -0
  11. package/dist/core/skillStore.js +197 -0
  12. package/dist/core/skillStore.js.map +1 -0
  13. package/dist/core/skillStore.test.d.ts +2 -0
  14. package/dist/core/skillStore.test.d.ts.map +1 -0
  15. package/dist/core/skillStore.test.js +190 -0
  16. package/dist/core/skillStore.test.js.map +1 -0
  17. package/dist/main.js +19 -17
  18. package/dist/main.js.map +1 -1
  19. package/package.json +1 -1
  20. package/skills/architecture-decision-records/SKILL.md +207 -0
  21. package/skills/backend/SKILL.md +62 -0
  22. package/skills/backend/references/api-design.md +168 -0
  23. package/skills/backend/references/caching.md +181 -0
  24. package/skills/backend/references/data-access.md +173 -0
  25. package/skills/backend/references/layering.md +181 -0
  26. package/skills/backend/references/observability.md +190 -0
  27. package/skills/backend/references/resilience.md +201 -0
  28. package/skills/backend/references/security.md +186 -0
  29. package/skills/backend-architect/SKILL.md +119 -0
  30. package/skills/code-reviewer/SKILL.md +143 -0
  31. package/skills/coding-standards/SKILL.md +60 -0
  32. package/skills/coding-standards/references/clean-code.md +258 -0
  33. package/skills/coding-standards/references/code-review.md +192 -0
  34. package/skills/coding-standards/references/commits-and-prs.md +226 -0
  35. package/skills/coding-standards/references/error-strategy.md +193 -0
  36. package/skills/coding-standards/references/naming.md +185 -0
  37. package/skills/coding-standards/references/tdd.md +171 -0
  38. package/skills/database/SKILL.md +53 -0
  39. package/skills/database/references/indexing.md +190 -0
  40. package/skills/database/references/migrations.md +199 -0
  41. package/skills/database/references/nosql.md +185 -0
  42. package/skills/database/references/queries.md +295 -0
  43. package/skills/database/references/scaling.md +203 -0
  44. package/skills/database/references/schema.md +191 -0
  45. package/skills/database-optimizer/SKILL.md +168 -0
  46. package/skills/debugging-workflow/SKILL.md +244 -0
  47. package/skills/devops/SKILL.md +55 -0
  48. package/skills/devops/references/ci-cd.md +204 -0
  49. package/skills/devops/references/containers.md +272 -0
  50. package/skills/devops/references/deploy.md +201 -0
  51. package/skills/devops/references/iac.md +252 -0
  52. package/skills/devops/references/observability.md +228 -0
  53. package/skills/devops/references/secrets.md +178 -0
  54. package/skills/devops-automator/SKILL.md +164 -0
  55. package/skills/frontend/SKILL.md +52 -0
  56. package/skills/frontend/references/accessibility.md +222 -0
  57. package/skills/frontend/references/components.md +206 -0
  58. package/skills/frontend/references/performance.md +219 -0
  59. package/skills/frontend/references/routing.md +209 -0
  60. package/skills/frontend/references/state.md +190 -0
  61. package/skills/frontend/references/testing.md +216 -0
  62. package/skills/frontend-developer/SKILL.md +115 -0
  63. package/skills/git-workflow/SKILL.md +355 -0
  64. package/skills/golang/SKILL.md +49 -0
  65. package/skills/golang/references/concurrency.md +284 -0
  66. package/skills/golang/references/errors.md +241 -0
  67. package/skills/golang/references/idioms.md +285 -0
  68. package/skills/golang/references/testing.md +238 -0
  69. package/skills/java/SKILL.md +50 -0
  70. package/skills/java/references/concurrency.md +194 -0
  71. package/skills/java/references/idioms.md +283 -0
  72. package/skills/java/references/testing.md +228 -0
  73. package/skills/kotlin/SKILL.md +47 -0
  74. package/skills/kotlin/references/coroutines.md +240 -0
  75. package/skills/kotlin/references/idioms.md +268 -0
  76. package/skills/kotlin/references/testing.md +219 -0
  77. package/skills/mobile/SKILL.md +50 -0
  78. package/skills/mobile/references/architecture.md +204 -0
  79. package/skills/mobile/references/navigation.md +158 -0
  80. package/skills/mobile/references/performance.md +152 -0
  81. package/skills/mobile/references/platform.md +166 -0
  82. package/skills/mobile/references/state-and-data.md +174 -0
  83. package/skills/python/SKILL.md +51 -0
  84. package/skills/python/THIRD_PARTY.md +14 -0
  85. package/skills/python/references/async.md +218 -0
  86. package/skills/python/references/error-handling.md +254 -0
  87. package/skills/python/references/idioms.md +279 -0
  88. package/skills/python/references/packaging.md +233 -0
  89. package/skills/python/references/testing.md +269 -0
  90. package/skills/python/references/typing.md +292 -0
  91. package/skills/qa-tester/SKILL.md +186 -0
  92. package/skills/rust/SKILL.md +50 -0
  93. package/skills/rust/references/async.md +224 -0
  94. package/skills/rust/references/errors.md +240 -0
  95. package/skills/rust/references/ownership.md +263 -0
  96. package/skills/rust/references/testing.md +274 -0
  97. package/skills/rust/references/traits.md +250 -0
  98. package/skills/security-engineer/SKILL.md +157 -0
  99. package/skills/swift/SKILL.md +48 -0
  100. package/skills/swift/references/concurrency.md +280 -0
  101. package/skills/swift/references/idioms.md +334 -0
  102. package/skills/swift/references/testing.md +229 -0
  103. package/skills/typescript/SKILL.md +51 -0
  104. package/skills/typescript/references/async.md +241 -0
  105. package/skills/typescript/references/errors.md +208 -0
  106. package/skills/typescript/references/idioms.md +246 -0
  107. package/skills/typescript/references/testing.md +225 -0
  108. package/skills/typescript/references/tooling.md +208 -0
  109. package/skills/typescript/references/types.md +259 -0
@@ -0,0 +1,252 @@
1
+ # Infrastructure-as-Code
2
+
3
+ Terraform, Pulumi, CDK, CloudFormation. Patterns, not syntax.
4
+
5
+ ## Principles
6
+
7
+ 1. **Code, not console.** Every production resource is declared in code. Console use is for exploration only, not for shipping.
8
+ 2. **State is sacred.** Losing or corrupting state means reconstructing reality from the cloud. Store it remotely, lock it.
9
+ 3. **Plan before apply.** `terraform plan` (or equivalent) is a read-only dry run. Diff it; understand it; only then apply.
10
+ 4. **Modules for reuse, not for abstraction.** A module that no one else uses is just folders. Extract when you have two callers, not before.
11
+ 5. **Environments as instances of a config.** Same code, different variables. Dev, staging, prod shouldn't fork into three different trees.
12
+ 6. **Less is more.** The smallest resource set that meets requirements. Every extra resource is an extra blast radius.
13
+
14
+ ## Tool choice
15
+
16
+ | Tool | Strengths |
17
+ |---|---|
18
+ | **Terraform / OpenTofu** | Multi-cloud, huge provider ecosystem, mature |
19
+ | **Pulumi** | Real programming language; complex logic feels natural |
20
+ | **AWS CDK** | AWS-native, TypeScript / Python, feels like SDK |
21
+ | **CloudFormation** | AWS-native, declarative, tight integration |
22
+ | **Ansible** | Imperative config mgmt (VMs), not declarative infra |
23
+ | **Kubernetes manifests / Helm / Kustomize** | K8s-specific |
24
+ | **Crossplane / KRO** | K8s-native for cloud infra |
25
+
26
+ Pick one IaC tool per cloud estate. Using three for one codebase creates state duplication and drift.
27
+
28
+ ## State — remote, locked
29
+
30
+ Local state on a laptop is a failure mode waiting to happen. Remote backend + locking is non-negotiable for team work.
31
+
32
+ Terraform remote backends:
33
+ - **S3 + DynamoDB** (AWS) — classic, cheap.
34
+ - **Terraform Cloud / HCP Terraform** — managed, VCS integration.
35
+ - **GCS + lock** (GCP).
36
+ - **Azure Storage** (Azure).
37
+
38
+ Enable:
39
+ - Encryption at rest.
40
+ - Versioning (so a corrupt state can be rolled back).
41
+ - Restricted access (only CI + admins).
42
+
43
+ ## One state file vs. many
44
+
45
+ Split state when:
46
+ - **Blast radius**: a typo in one part shouldn't risk another (separate "network" from "apps").
47
+ - **Apply time**: a 40-minute plan is painful; split so changes in one area only plan that area.
48
+ - **Permissions**: different teams own different pieces.
49
+
50
+ Typical split:
51
+ ```
52
+ networking/ — VPCs, subnets, DNS
53
+ data/ — databases, caches, queues
54
+ platform/ — K8s clusters, service mesh
55
+ apps/service-a/
56
+ apps/service-b/
57
+ ```
58
+
59
+ Cross-state references via remote state data sources:
60
+
61
+ ```hcl
62
+ data "terraform_remote_state" "network" {
63
+ backend = "s3"
64
+ config = { bucket = "...", key = "networking.tfstate", region = "..." }
65
+ }
66
+
67
+ resource "aws_instance" "app" {
68
+ subnet_id = data.terraform_remote_state.network.outputs.subnet_id
69
+ }
70
+ ```
71
+
72
+ ## Modules
73
+
74
+ Reuse is an effect; don't modularize for the sake of it.
75
+
76
+ ```
77
+ modules/
78
+ ├── s3-bucket/ # simple, focused
79
+ ├── rds-postgres/ # multiple call sites; worth the abstraction
80
+ └── vpc/ # canonical
81
+ ```
82
+
83
+ Rules:
84
+ - Module inputs: the minimum variables that matter; sensible defaults for the rest.
85
+ - Module outputs: only what callers need.
86
+ - Version modules (git tag / registry version) if shared across repos.
87
+
88
+ Anti-pattern: "god module" with 50 inputs covering every hypothetical case. That's configuration disguised as code.
89
+
90
+ ## Workspaces vs. env-specific folders
91
+
92
+ ### Workspaces (Terraform)
93
+
94
+ Same code, different state per workspace.
95
+
96
+ ```
97
+ terraform workspace new dev
98
+ terraform workspace new prod
99
+ ```
100
+
101
+ OK for dev/prod parity when the topology is truly identical.
102
+
103
+ ### Folder-per-env (often preferred)
104
+
105
+ ```
106
+ environments/
107
+ ├── dev/
108
+ │ └── main.tfvars
109
+ ├── staging/
110
+ │ └── main.tfvars
111
+ └── prod/
112
+ └── main.tfvars
113
+ ```
114
+
115
+ Each env has its own state and vars file. Same underlying modules. Explicit, reviewable, allows env-specific overrides.
116
+
117
+ ## Review flow
118
+
119
+ Every IaC change is a PR. `terraform plan` output in the PR description or CI comment. Reviewer reads the plan and the code.
120
+
121
+ Automate it:
122
+
123
+ ```yaml
124
+ - run: terraform init
125
+ - run: terraform plan -out=plan.tfplan
126
+ - run: terraform show -no-color plan.tfplan > plan.txt
127
+ - uses: actions/github-script@v7
128
+ with: { script: |
129
+ const plan = fs.readFileSync('plan.txt', 'utf8');
130
+ await github.rest.issues.createComment({ ..., body: '```\n' + plan + '\n```' });
131
+ }
132
+ ```
133
+
134
+ Prod apply gated behind approval:
135
+
136
+ ```yaml
137
+ environment:
138
+ name: prod
139
+ url: https://...
140
+ ```
141
+
142
+ ## Drift
143
+
144
+ State says one thing; reality says another (someone clicked in the console, an external process modified a resource).
145
+
146
+ - `terraform plan` detects drift.
147
+ - Reconcile: revert console changes or adopt them into code.
148
+ - Policy: disable console write access for prod; force all changes through IaC.
149
+
150
+ Drift ignored becomes a permanent parallel state that diverges further every day.
151
+
152
+ ## Secret handling in IaC
153
+
154
+ - Secrets don't live in `.tfvars` committed to git.
155
+ - Pull from secret manager at apply time (`data "aws_secretsmanager_secret_version"`).
156
+ - Output sensitive values with `sensitive = true` so they don't leak in logs.
157
+ - `.tfstate` itself contains sensitive values — protect the backend.
158
+
159
+ ## Tagging strategy
160
+
161
+ Every cloud resource gets tags. Makes cost allocation, search, ownership unambiguous.
162
+
163
+ ```hcl
164
+ default_tags = {
165
+ environment = var.env
166
+ service = var.service
167
+ owner_team = var.team
168
+ managed_by = "terraform"
169
+ repo = "github.com/.../infra"
170
+ }
171
+ ```
172
+
173
+ Enforce via policy (SCPs on AWS, Azure Policy, custom `terraform-compliance`).
174
+
175
+ ## Avoid `count` / `for_each` on resources likely to change order
176
+
177
+ Terraform tracks resources by address. `aws_instance.web[0]` is not the same as `aws_instance.web[1]`. If you insert into the middle, everything after is "new".
178
+
179
+ Prefer `for_each` over `count` — keyed by a string, stable under insertion.
180
+
181
+ ```hcl
182
+ # ❌ count — fragile if order changes
183
+ resource "aws_instance" "web" {
184
+ count = length(var.regions)
185
+ region = var.regions[count.index]
186
+ }
187
+
188
+ # ✅ for_each — keyed
189
+ resource "aws_instance" "web" {
190
+ for_each = toset(var.regions)
191
+ region = each.value
192
+ }
193
+ ```
194
+
195
+ ## Lifecycle rules
196
+
197
+ ```hcl
198
+ lifecycle {
199
+ prevent_destroy = true # guard prod databases, S3 buckets with data
200
+ create_before_destroy = true # zero-downtime replacement where possible
201
+ ignore_changes = [tags["last_updated"]] # don't churn on external tag writes
202
+ }
203
+ ```
204
+
205
+ `prevent_destroy` on anything that would be irrecoverable to delete. Deliberate override needed to actually destroy.
206
+
207
+ ## Blast radius
208
+
209
+ Running `terraform destroy` in the wrong directory has wiped real infrastructure. Mitigations:
210
+ - Separate folders per env with distinct state backends.
211
+ - Prod destroy requires a separate pipeline or a specific role.
212
+ - Critical resources have `prevent_destroy`.
213
+
214
+ ## Policy as code
215
+
216
+ Test that the plan adheres to rules before apply.
217
+
218
+ - **OPA / Conftest** — general-purpose policy, plaintext rules.
219
+ - **Terraform Sentinel** (HCP Terraform) — policy as code.
220
+ - **Checkov, tfsec, terrascan** — pre-built security rules (public S3 buckets, encryption, etc.).
221
+
222
+ Typical guards:
223
+ - No public S3 buckets.
224
+ - No unencrypted RDS / EBS.
225
+ - No 0.0.0.0/0 ingress on non-frontend services.
226
+ - All resources have required tags.
227
+
228
+ ## Rollback
229
+
230
+ IaC rollback = revert the commit + apply. It works when the infra change is self-contained.
231
+
232
+ It does NOT work for:
233
+ - Data migrations (schema changes, in-place transformations).
234
+ - Resources that were deleted with associated data.
235
+ - Stateful upgrades where the backing engine doesn't support downgrade.
236
+
237
+ For those, plan forward-only fixes.
238
+
239
+ ## Anti-patterns
240
+
241
+ | Anti-pattern | Fix |
242
+ |---|---|
243
+ | Local state | Remote + locking |
244
+ | No module versioning for shared modules | Tag + version pin |
245
+ | Env-specific logic via `if env == "prod"` | Env-specific `.tfvars` / folder |
246
+ | Every change runs `apply` without review | PR + plan output |
247
+ | Sensitive outputs without `sensitive = true` | Mark them |
248
+ | Giant 5000-line main.tf | Split by resource group / module |
249
+ | `terraform taint` as a regular workflow | Fix the root cause; taint is a hack |
250
+ | `null_resource` + `local-exec` for everything | Find a proper provider |
251
+ | Hand-written policies enforced by discipline | Automate with OPA / tfsec |
252
+ | Cloud console changes that "just need to happen" | Update IaC; revert the console |
@@ -0,0 +1,228 @@
1
+ # Observability (Platform)
2
+
3
+ Log / metric / trace pipelines, alerting, on-call, runbooks. For app-level signal definition, see `backend/references/observability.md`; this file covers the platform plumbing.
4
+
5
+ ## The stack
6
+
7
+ ```
8
+ App ──stdout──▶ Collector (fluent-bit, otel-collector, vector)
9
+ ──metric──▶ Prometheus / Cloud Monitoring / Datadog / NewRelic
10
+ ──trace───▶ OpenTelemetry Collector ──▶ Jaeger / Tempo / DD APM
11
+
12
+ Alerting: Prometheus Alertmanager / Grafana / PagerDuty / OpsGenie
13
+ ```
14
+
15
+ Pick the right number of tools. Four different tools with overlapping coverage is a tax; one plus another specialist is usually enough.
16
+
17
+ ## Logs
18
+
19
+ ### Collection
20
+
21
+ - Container logs → stdout/stderr.
22
+ - Daemon on each node reads container logs (`fluent-bit`, `fluentd`, `vector`, cloud-native).
23
+ - Collector forwards to the backend (Elasticsearch, Loki, Datadog Logs, Cloud Logging).
24
+
25
+ Don't write logs to local files inside containers. Lost on pod restart; hard to collect.
26
+
27
+ ### Format
28
+
29
+ JSON. Every line a structured event. See `backend/references/observability.md` for field names.
30
+
31
+ ### Retention
32
+
33
+ Tier by age:
34
+ - **Hot**: 7–14 days, fast search.
35
+ - **Warm**: 30–90 days, slower but still searchable.
36
+ - **Archive**: 1+ year, S3 / cold storage, restore on demand.
37
+
38
+ Log volume grows with traffic; set retention per env (dev can be 3 days, prod 30). Otherwise, the bill does the planning for you.
39
+
40
+ ### Sensitive data
41
+
42
+ Redact at source — the app's logger, not the collector. Once a secret hits the pipeline it's harder to control.
43
+
44
+ Check your logs periodically for leaked PII / tokens. Automated scanning rules (pattern matching JWT, credit card) in the pipeline.
45
+
46
+ ## Metrics
47
+
48
+ ### Collection
49
+
50
+ - **Pull** (Prometheus) — scraper hits app endpoints.
51
+ - **Push** (StatsD, OTLP) — app pushes to a gateway / collector.
52
+
53
+ Pull scales well at moderate cluster sizes, gets fiddly at huge scale. Push is simpler at scale but loses some visibility.
54
+
55
+ ### Standards
56
+
57
+ OpenTelemetry (OTel) is becoming the de-facto standard for metric + trace instrumentation. Instrument once with OTel SDKs; switch backends by changing the collector config.
58
+
59
+ ### Cardinality
60
+
61
+ Every unique combination of label values creates a new time series. High-cardinality labels (user_id, request_id) blow up storage and cost.
62
+
63
+ ```
64
+ # ✅ bounded
65
+ http_requests_total{service="api", route="/orders", method="POST", status="200"}
66
+
67
+ # ❌ unbounded
68
+ http_requests_total{service="api", user_id="u_01HX..."}
69
+ ```
70
+
71
+ The cloud will silently charge you for cardinality. Watch the count of series.
72
+
73
+ ### Four golden signals (per service)
74
+
75
+ 1. **Latency** — how long do requests take (p50/p95/p99)?
76
+ 2. **Traffic** — how many requests per second?
77
+ 3. **Errors** — rate of failed requests?
78
+ 4. **Saturation** — how full is it? (CPU, queue depth, connection pool)
79
+
80
+ Dashboards start here. Drill into specifics from the starting point.
81
+
82
+ ## Traces
83
+
84
+ OpenTelemetry instrumented endpoints + propagated context.
85
+
86
+ ```
87
+ Request ─▶ Service A [span] ─▶ Service B [span] ─▶ DB [span]
88
+ ```
89
+
90
+ Each span has timing, tags, events. Together they form the request timeline.
91
+
92
+ ### Sampling
93
+
94
+ Head-based (per request, decide at ingress):
95
+ - 1–10% typical.
96
+ - Boost to 100% for errors.
97
+
98
+ Tail-based (sample after seeing the whole trace):
99
+ - Keep slow traces, error traces, unusual patterns.
100
+ - Needs a full collector layer (otel-collector).
101
+
102
+ Tracing overhead is real — don't trace 100% in prod without tail-based sampling.
103
+
104
+ ## Dashboards
105
+
106
+ ### Structure
107
+
108
+ One dashboard per service, standard layout:
109
+ - Overview: RED metrics (Rate, Errors, Duration).
110
+ - Saturation: CPU, memory, pool utilization.
111
+ - Dependencies: DB, cache, upstream services.
112
+ - Recent deploys marked as vertical annotations.
113
+
114
+ Links to runbook + logs + traces.
115
+
116
+ ### Don't build 50 dashboards
117
+
118
+ Most go stale within weeks. Focus on a small set that matters:
119
+ - One per critical service.
120
+ - One per SLO.
121
+ - A few investigative templates ("compare p99 before/after a given deploy").
122
+
123
+ ## Alerts
124
+
125
+ ### Principles
126
+
127
+ - **Alert on symptoms, not causes.** "Users can't check out" beats "CPU is 80%".
128
+ - **Every alert is actionable** — there's a specific thing the oncall does.
129
+ - **Every alert has a runbook** linked in the alert body.
130
+ - **Every alert has an owner** — the team / service that owns the fix.
131
+
132
+ ### Severity levels
133
+
134
+ - **P1 / SEV-1**: page the oncall; revenue / customer-facing impact.
135
+ - **P2 / SEV-2**: notify in team channel; degraded state.
136
+ - **P3 / SEV-3**: track as an issue; investigate next business day.
137
+
138
+ Only P1 should wake someone up. Too many P1s → pager fatigue → missed alerts.
139
+
140
+ ### Tuning
141
+
142
+ - Alert fires → wasn't actionable → either tune the threshold or delete it.
143
+ - Alert fires at 3am and auto-resolves at 3:15am with no action → wasn't actionable.
144
+ - Alert with "click dashboard, maybe it's fine" → wasn't actionable.
145
+
146
+ Audit monthly.
147
+
148
+ ## On-call
149
+
150
+ ### Rotation
151
+
152
+ - Weekly rotation typical; one primary + one secondary.
153
+ - Handoff meeting: what's ongoing, what's worrying.
154
+ - On-call participants must have access: can deploy, rollback, scale.
155
+
156
+ ### Triage flow
157
+
158
+ 1. **Acknowledge** — clock is ticking on MTTR.
159
+ 2. **Stop the bleeding** — rollback, scale up, disable a feature flag. Don't perfect-fix in the moment.
160
+ 3. **Gather context** — what changed recently? dashboards, logs, traces.
161
+ 4. **Escalate** — bring in the service owner if you're not them.
162
+ 5. **Post-incident** — see below.
163
+
164
+ ### Post-incident
165
+
166
+ Every P1 gets a postmortem. Blameless.
167
+
168
+ Template:
169
+ - **Summary** — one paragraph.
170
+ - **Impact** — who / how much / how long.
171
+ - **Timeline** — minute-by-minute of detection, response, resolution.
172
+ - **Root cause** — technical + process.
173
+ - **Action items** — specific, owned, dated.
174
+ - **Lessons** — what was surprising.
175
+
176
+ Track action items to completion. Unshipped postmortem actions are how the same incident happens twice.
177
+
178
+ ## SLO / error budget
179
+
180
+ Set SLOs (service-level objectives) that match user expectations. Derive error budget.
181
+
182
+ ```
183
+ SLO: 99.9% of /orders POST succeed in ≤ 500 ms
184
+ Budget: 0.1% × 30 days ≈ 43 min / month
185
+ ```
186
+
187
+ Burn rate:
188
+ - Slow burn — spend budget over weeks (minor quality erosion).
189
+ - Fast burn — exhaust weekly budget in a day (real problem).
190
+
191
+ Alert on burn rate, not just on individual failures. "We're burning budget 10× too fast" is actionable.
192
+
193
+ ## Health endpoints
194
+
195
+ ```
196
+ /health/live — process alive
197
+ /health/ready — can serve traffic (DB reachable, cache reachable)
198
+ ```
199
+
200
+ Container orchestrators use both:
201
+ - Live failing → restart the container.
202
+ - Ready failing → take out of load balancer, leave alive.
203
+
204
+ Never put business logic in health checks. Keep them cheap and boring.
205
+
206
+ ## Cost visibility
207
+
208
+ Observability is expensive at scale. Monitor the bill:
209
+ - Logs ingested per service per day.
210
+ - Metric series count.
211
+ - Trace spans per second.
212
+
213
+ When one service exports 10× what others do — investigate. Usually debug logging left on, or a metric with a user-id label.
214
+
215
+ ## Anti-patterns
216
+
217
+ | Anti-pattern | Fix |
218
+ |---|---|
219
+ | Logs written to files in containers | stdout |
220
+ | Alerts on infrastructure without symptom mapping | Alert on service impact |
221
+ | Dashboards nobody reads | Delete unused; focus on core ones |
222
+ | Runbook-less alerts | Every alert links a runbook |
223
+ | Tracing 100% in prod without sampling | Head or tail sampling |
224
+ | Metric labels on request IDs | Use logs/traces for high-cardinality |
225
+ | "I'll set up monitoring later" | Observability before launch |
226
+ | Alert channel drowning in noise | Audit and tune |
227
+ | No oncall → whoever's free panics | Formal rotation, documented escalation |
228
+ | Postmortems as blame sessions | Blameless format; focus on systems and actions |
@@ -0,0 +1,178 @@
1
+ # Secrets
2
+
3
+ Storage, rotation, access, scanning. The part that goes wrong quietly.
4
+
5
+ ## The rules
6
+
7
+ 1. **Never in source control.** Ever. `.env` files stay in `.gitignore`; secrets come from a manager.
8
+ 2. **One secret, one owner.** Scoped per service, per env. The "shared creds" bucket is a leak waiting.
9
+ 3. **Rotate on schedule AND on compromise.** Short lifetime = small window of exposure.
10
+ 4. **Least privilege.** A service key can read its own DB, not every DB.
11
+ 5. **Audit every read.** You should be able to tell who accessed prod secret X yesterday.
12
+ 6. **Encrypt at rest AND in transit.** Default in modern secret managers; verify.
13
+
14
+ ## Where to store them
15
+
16
+ | Tool | For |
17
+ |---|---|
18
+ | **AWS Secrets Manager / Parameter Store** | AWS; IAM-integrated |
19
+ | **GCP Secret Manager** | GCP; IAM-integrated |
20
+ | **Azure Key Vault** | Azure; RBAC-integrated |
21
+ | **HashiCorp Vault** | Cloud-agnostic; dynamic secrets, rich ACLs |
22
+ | **1Password / Bitwarden** | Human-held secrets, small teams |
23
+ | **Kubernetes Secrets** | K8s-native, lightweight; encrypt etcd |
24
+ | **Sealed Secrets / SOPS** | Encrypt secrets that live in git (controller decrypts) |
25
+
26
+ Pick one canonical store per cloud estate. Sprawl breeds drift — the same secret in three places rotates in one.
27
+
28
+ ## Access patterns
29
+
30
+ ### At deploy
31
+
32
+ - CI has scoped permission to fetch the secrets its job needs (e.g., `DEPLOY_ROLE_PROD`).
33
+ - Terraform pulls via `data` sources at apply (not baked into `.tfvars`).
34
+ - Kubernetes: use the cloud provider's secret CSI driver or External Secrets Operator.
35
+
36
+ ### At runtime
37
+
38
+ - Container pulls secrets from the manager on startup via the platform (IRSA on EKS, Workload Identity on GKE, IAM on Azure).
39
+ - Or mount as files; the app reads from `/secrets/db_url`.
40
+ - Never bake into the image.
41
+
42
+ ### Locally
43
+
44
+ - Developers authenticate to the secret manager (SSO + session).
45
+ - CLI pulls secrets on demand: `op run --env-file .env.tpl -- npm start`.
46
+ - Don't ship shared `.env.dev` files on Slack.
47
+
48
+ ## Rotation
49
+
50
+ For every secret, know:
51
+ - Who rotates (automated job, human?).
52
+ - How often (14 d? 90 d?).
53
+ - How do consumers pick up the new value without downtime?
54
+
55
+ Rotation patterns:
56
+
57
+ ### Dual-secret window
58
+
59
+ 1. Create secret v2 alongside v1.
60
+ 2. Consumers accept both.
61
+ 3. Update producers to use v2.
62
+ 4. Disable v1 after grace period.
63
+
64
+ Zero downtime if all consumers support reading both.
65
+
66
+ ### Break-glass
67
+
68
+ Some secrets (signing keys, root creds) rarely rotate. Document:
69
+ - Access procedure (break-glass requires 2-person approval).
70
+ - Rotation playbook.
71
+ - Who to notify.
72
+
73
+ ## Dynamic secrets
74
+
75
+ The gold standard. Vault generates short-lived DB credentials on request; they expire in minutes / hours.
76
+
77
+ ```
78
+ app → Vault: "give me DB creds for service X"
79
+ Vault → DB: CREATE USER temp_xyz WITH GRANT ...
80
+ Vault → app: { user: "temp_xyz", pass: "...", ttl: 1h }
81
+ # after 1h Vault revokes the user
82
+ ```
83
+
84
+ A leaked cred is useless after an hour. Requires upfront setup; worth it for sensitive DBs.
85
+
86
+ ## Pre-commit scanning
87
+
88
+ Catch secrets before they hit the repo.
89
+
90
+ - **gitleaks** / **detect-secrets** in pre-commit hook.
91
+ - **trufflehog** / **gitleaks** in CI, scanning history.
92
+ - If something lands by accident: rotate immediately, don't just `git revert`. History is forever.
93
+
94
+ ```yaml
95
+ # pre-commit
96
+ - repo: https://github.com/gitleaks/gitleaks
97
+ rev: v8.18.0
98
+ hooks: [{ id: gitleaks }]
99
+ ```
100
+
101
+ ## Git history cleanup — last resort
102
+
103
+ If a secret is in the history:
104
+ 1. **Rotate the secret immediately.** Treat as compromised.
105
+ 2. Cleaning history (`git filter-repo`, BFG) is partial — anyone with the old clone still has it.
106
+ 3. Force-push is disruptive; coordinate with team.
107
+ 4. Document the incident.
108
+
109
+ Assume: if it was pushed, someone scraped it already.
110
+
111
+ ## Encrypted configs in git (SOPS)
112
+
113
+ For teams that want encrypted secrets checked into the repo (mostly for smaller teams / K8s manifests):
114
+
115
+ ```
116
+ config.yaml: # SOPS-encrypted at rest
117
+ db:
118
+ url: ENC[AES256_GCM,data:abc123...]
119
+ ```
120
+
121
+ SOPS uses a KMS key (AWS KMS, GCP KMS, age) to decrypt at runtime. Team members with access to the KMS key can decrypt.
122
+
123
+ Rules:
124
+ - Encrypt secret fields only (`sops.encrypted_regex: '^(password|token|url)$'`).
125
+ - Commit `.sops.yaml` describing the key.
126
+ - Revoke KMS key access when a team member leaves.
127
+
128
+ ## Secret sprawl — audit
129
+
130
+ Once a quarter, audit:
131
+ - How many secrets exist?
132
+ - Who / what has access to each?
133
+ - When was each last rotated?
134
+ - Any that are unused (zero accesses in 90 d)?
135
+
136
+ Cleanup: revoke unused keys. Deleted secrets can't leak.
137
+
138
+ ## Non-secret configs
139
+
140
+ Not everything is a secret. Feature flags, service endpoints, log levels are config — check them into code (env-specific files), don't put them in the secret manager.
141
+
142
+ Mixing bloats the secret manager and trains people to ignore the "secret" marker.
143
+
144
+ ## Logging and secrets
145
+
146
+ - Redact at the logger — don't rely on calls to `log.info(token[:5] + '...')`.
147
+ - Structured loggers (Winston, Pino, Zap, slog) support redaction on field names.
148
+ - Test: grep logs for known secret prefixes. Zero hits.
149
+
150
+ ## Cloud IAM vs. shared secrets
151
+
152
+ Prefer IAM / workload identity over shared long-lived API keys.
153
+
154
+ ```
155
+ # ❌
156
+ AWS_ACCESS_KEY_ID=AKIA...
157
+ AWS_SECRET_ACCESS_KEY=...
158
+
159
+ # ✅
160
+ # Pod assumes a role via IRSA/Workload Identity; temporary creds minted per request.
161
+ ```
162
+
163
+ Same for DB access (RDS IAM auth), MQ (IAM policies), object storage. If the cloud supports workload identity, use it.
164
+
165
+ ## Anti-patterns
166
+
167
+ | Anti-pattern | Fix |
168
+ |---|---|
169
+ | Secrets in `.env` committed to git | `.gitignore` + secret manager |
170
+ | Same API key used by every service | Scope per service |
171
+ | Rotating by "reminding the team once a year" | Automate or track with a rotation job |
172
+ | Logging tokens for debugging | Redact; use opaque IDs |
173
+ | Long-lived cloud API keys in CI | OIDC + short-lived role assumption |
174
+ | Shared "admin" DB user per team | Per-user creds (even for humans), audit log |
175
+ | Decrypting secrets to an env var only to forget | Let the app read from a file or secret mount |
176
+ | Secret "just this one time" pasted in a ticket | Rotate now; use a secure channel (short-lived link) |
177
+ | No alerts on secret access from unusual IPs / times | Enable Cloud audit logs + alert |
178
+ | Skipping pre-commit scanning "it slows me down" | It saves an incident |