@coralai/sps-cli 0.41.2 → 0.43.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +34 -3
- package/dist/commands/cardAdd.d.ts +1 -1
- package/dist/commands/cardAdd.d.ts.map +1 -1
- package/dist/commands/cardAdd.js +16 -6
- package/dist/commands/cardAdd.js.map +1 -1
- package/dist/commands/cardDashboard.js +1 -1
- package/dist/commands/cardDashboard.js.map +1 -1
- package/dist/commands/doctor.d.ts +9 -0
- package/dist/commands/doctor.d.ts.map +1 -1
- package/dist/commands/doctor.js +3 -314
- package/dist/commands/doctor.js.map +1 -1
- package/dist/commands/hookCommand.d.ts.map +1 -1
- package/dist/commands/hookCommand.js +6 -7
- package/dist/commands/hookCommand.js.map +1 -1
- package/dist/commands/pmCommand.js +1 -1
- package/dist/commands/pmCommand.js.map +1 -1
- package/dist/commands/projectInit.d.ts.map +1 -1
- package/dist/commands/projectInit.js +60 -37
- package/dist/commands/projectInit.js.map +1 -1
- package/dist/commands/setup.d.ts.map +1 -1
- package/dist/commands/setup.js +3 -30
- package/dist/commands/setup.js.map +1 -1
- package/dist/commands/skillCommand.d.ts +2 -0
- package/dist/commands/skillCommand.d.ts.map +1 -0
- package/dist/commands/skillCommand.js +235 -0
- package/dist/commands/skillCommand.js.map +1 -0
- package/dist/commands/tick.js +1 -1
- package/dist/commands/tick.js.map +1 -1
- package/dist/core/checklist.d.ts +22 -0
- package/dist/core/checklist.d.ts.map +1 -0
- package/dist/core/checklist.js +38 -0
- package/dist/core/checklist.js.map +1 -0
- package/dist/core/checklist.test.d.ts +2 -0
- package/dist/core/checklist.test.d.ts.map +1 -0
- package/dist/core/checklist.test.js +74 -0
- package/dist/core/checklist.test.js.map +1 -0
- package/dist/core/config.d.ts +1 -1
- package/dist/core/config.d.ts.map +1 -1
- package/dist/core/config.js +1 -1
- package/dist/core/config.js.map +1 -1
- package/dist/core/config.test.js +7 -4
- package/dist/core/config.test.js.map +1 -1
- package/dist/core/context.d.ts +1 -1
- package/dist/core/context.d.ts.map +1 -1
- package/dist/core/skillStore.d.ts +46 -0
- package/dist/core/skillStore.d.ts.map +1 -0
- package/dist/core/skillStore.js +197 -0
- package/dist/core/skillStore.js.map +1 -0
- package/dist/core/skillStore.test.d.ts +2 -0
- package/dist/core/skillStore.test.d.ts.map +1 -0
- package/dist/core/skillStore.test.js +190 -0
- package/dist/core/skillStore.test.js.map +1 -0
- package/dist/engines/EventHandler.test.js +3 -3
- package/dist/engines/EventHandler.test.js.map +1 -1
- package/dist/engines/MonitorEngine.js +2 -2
- package/dist/engines/MonitorEngine.js.map +1 -1
- package/dist/engines/SchedulerEngine.js +1 -1
- package/dist/engines/SchedulerEngine.js.map +1 -1
- package/dist/engines/StageEngine.js +3 -3
- package/dist/engines/StageEngine.js.map +1 -1
- package/dist/engines/engine-pipeline-adapter.test.js +2 -2
- package/dist/engines/engine-pipeline-adapter.test.js.map +1 -1
- package/dist/interfaces/TaskBackend.d.ts +3 -1
- package/dist/interfaces/TaskBackend.d.ts.map +1 -1
- package/dist/main.js +19 -17
- package/dist/main.js.map +1 -1
- package/dist/models/types.d.ts +16 -1
- package/dist/models/types.d.ts.map +1 -1
- package/dist/providers/MarkdownTaskBackend.d.ts +2 -1
- package/dist/providers/MarkdownTaskBackend.d.ts.map +1 -1
- package/dist/providers/MarkdownTaskBackend.js +28 -5
- package/dist/providers/MarkdownTaskBackend.js.map +1 -1
- package/dist/providers/registry.d.ts.map +1 -1
- package/dist/providers/registry.js +5 -7
- package/dist/providers/registry.js.map +1 -1
- package/package.json +1 -1
- package/project-template/.claude/hooks/start.sh +44 -0
- package/project-template/.claude/settings.json +1 -1
- package/skills/architecture-decision-records/SKILL.md +207 -0
- package/skills/backend/SKILL.md +62 -0
- package/skills/backend/references/api-design.md +168 -0
- package/skills/backend/references/caching.md +181 -0
- package/skills/backend/references/data-access.md +173 -0
- package/skills/backend/references/layering.md +181 -0
- package/skills/backend/references/observability.md +190 -0
- package/skills/backend/references/resilience.md +201 -0
- package/skills/backend/references/security.md +186 -0
- package/skills/backend-architect/SKILL.md +119 -0
- package/skills/code-reviewer/SKILL.md +143 -0
- package/skills/coding-standards/SKILL.md +60 -0
- package/skills/coding-standards/references/clean-code.md +258 -0
- package/skills/coding-standards/references/code-review.md +192 -0
- package/skills/coding-standards/references/commits-and-prs.md +226 -0
- package/skills/coding-standards/references/error-strategy.md +193 -0
- package/skills/coding-standards/references/naming.md +185 -0
- package/skills/coding-standards/references/tdd.md +171 -0
- package/skills/database/SKILL.md +53 -0
- package/skills/database/references/indexing.md +190 -0
- package/skills/database/references/migrations.md +199 -0
- package/skills/database/references/nosql.md +185 -0
- package/skills/database/references/queries.md +295 -0
- package/skills/database/references/scaling.md +203 -0
- package/skills/database/references/schema.md +191 -0
- package/skills/database-optimizer/SKILL.md +168 -0
- package/skills/debugging-workflow/SKILL.md +244 -0
- package/skills/devops/SKILL.md +55 -0
- package/skills/devops/references/ci-cd.md +204 -0
- package/skills/devops/references/containers.md +272 -0
- package/skills/devops/references/deploy.md +201 -0
- package/skills/devops/references/iac.md +252 -0
- package/skills/devops/references/observability.md +228 -0
- package/skills/devops/references/secrets.md +178 -0
- package/skills/devops-automator/SKILL.md +164 -0
- package/skills/frontend/SKILL.md +52 -0
- package/skills/frontend/references/accessibility.md +222 -0
- package/skills/frontend/references/components.md +206 -0
- package/skills/frontend/references/performance.md +219 -0
- package/skills/frontend/references/routing.md +209 -0
- package/skills/frontend/references/state.md +190 -0
- package/skills/frontend/references/testing.md +216 -0
- package/skills/frontend-developer/SKILL.md +115 -0
- package/skills/git-workflow/SKILL.md +355 -0
- package/skills/golang/SKILL.md +49 -0
- package/skills/golang/references/concurrency.md +284 -0
- package/skills/golang/references/errors.md +241 -0
- package/skills/golang/references/idioms.md +285 -0
- package/skills/golang/references/testing.md +238 -0
- package/skills/java/SKILL.md +50 -0
- package/skills/java/references/concurrency.md +194 -0
- package/skills/java/references/idioms.md +283 -0
- package/skills/java/references/testing.md +228 -0
- package/skills/kotlin/SKILL.md +47 -0
- package/skills/kotlin/references/coroutines.md +240 -0
- package/skills/kotlin/references/idioms.md +268 -0
- package/skills/kotlin/references/testing.md +219 -0
- package/skills/mobile/SKILL.md +50 -0
- package/skills/mobile/references/architecture.md +204 -0
- package/skills/mobile/references/navigation.md +158 -0
- package/skills/mobile/references/performance.md +152 -0
- package/skills/mobile/references/platform.md +166 -0
- package/skills/mobile/references/state-and-data.md +174 -0
- package/skills/python/SKILL.md +51 -0
- package/skills/python/THIRD_PARTY.md +14 -0
- package/skills/python/references/async.md +218 -0
- package/skills/python/references/error-handling.md +254 -0
- package/skills/python/references/idioms.md +279 -0
- package/skills/python/references/packaging.md +233 -0
- package/skills/python/references/testing.md +269 -0
- package/skills/python/references/typing.md +292 -0
- package/skills/qa-tester/SKILL.md +186 -0
- package/skills/rust/SKILL.md +50 -0
- package/skills/rust/references/async.md +224 -0
- package/skills/rust/references/errors.md +240 -0
- package/skills/rust/references/ownership.md +263 -0
- package/skills/rust/references/testing.md +274 -0
- package/skills/rust/references/traits.md +250 -0
- package/skills/security-engineer/SKILL.md +157 -0
- package/skills/swift/SKILL.md +48 -0
- package/skills/swift/references/concurrency.md +280 -0
- package/skills/swift/references/idioms.md +334 -0
- package/skills/swift/references/testing.md +229 -0
- package/skills/typescript/SKILL.md +51 -0
- package/skills/typescript/references/async.md +241 -0
- package/skills/typescript/references/errors.md +208 -0
- package/skills/typescript/references/idioms.md +246 -0
- package/skills/typescript/references/testing.md +225 -0
- package/skills/typescript/references/tooling.md +208 -0
- package/skills/typescript/references/types.md +259 -0
|
@@ -0,0 +1,252 @@
|
|
|
1
|
+
# Infrastructure-as-Code
|
|
2
|
+
|
|
3
|
+
Terraform, Pulumi, CDK, CloudFormation. Patterns, not syntax.
|
|
4
|
+
|
|
5
|
+
## Principles
|
|
6
|
+
|
|
7
|
+
1. **Code, not console.** Every production resource is declared in code. Console use is for exploration only, not for shipping.
|
|
8
|
+
2. **State is sacred.** Losing or corrupting state means reconstructing reality from the cloud. Store it remotely, lock it.
|
|
9
|
+
3. **Plan before apply.** `terraform plan` (or equivalent) is a read-only dry run. Diff it; understand it; only then apply.
|
|
10
|
+
4. **Modules for reuse, not for abstraction.** A module that no one else uses is just folders. Extract when you have two callers, not before.
|
|
11
|
+
5. **Environments as instances of a config.** Same code, different variables. Dev, staging, prod shouldn't fork into three different trees.
|
|
12
|
+
6. **Less is more.** The smallest resource set that meets requirements. Every extra resource is an extra blast radius.
|
|
13
|
+
|
|
14
|
+
## Tool choice
|
|
15
|
+
|
|
16
|
+
| Tool | Strengths |
|
|
17
|
+
|---|---|
|
|
18
|
+
| **Terraform / OpenTofu** | Multi-cloud, huge provider ecosystem, mature |
|
|
19
|
+
| **Pulumi** | Real programming language; complex logic feels natural |
|
|
20
|
+
| **AWS CDK** | AWS-native, TypeScript / Python, feels like SDK |
|
|
21
|
+
| **CloudFormation** | AWS-native, declarative, tight integration |
|
|
22
|
+
| **Ansible** | Imperative config mgmt (VMs), not declarative infra |
|
|
23
|
+
| **Kubernetes manifests / Helm / Kustomize** | K8s-specific |
|
|
24
|
+
| **Crossplane / KRO** | K8s-native for cloud infra |
|
|
25
|
+
|
|
26
|
+
Pick one IaC tool per cloud estate. Using three for one codebase creates state duplication and drift.
|
|
27
|
+
|
|
28
|
+
## State — remote, locked
|
|
29
|
+
|
|
30
|
+
Local state on a laptop is a failure mode waiting to happen. Remote backend + locking is non-negotiable for team work.
|
|
31
|
+
|
|
32
|
+
Terraform remote backends:
|
|
33
|
+
- **S3 + DynamoDB** (AWS) — classic, cheap.
|
|
34
|
+
- **Terraform Cloud / HCP Terraform** — managed, VCS integration.
|
|
35
|
+
- **GCS + lock** (GCP).
|
|
36
|
+
- **Azure Storage** (Azure).
|
|
37
|
+
|
|
38
|
+
Enable:
|
|
39
|
+
- Encryption at rest.
|
|
40
|
+
- Versioning (so a corrupt state can be rolled back).
|
|
41
|
+
- Restricted access (only CI + admins).
|
|
42
|
+
|
|
43
|
+
## One state file vs. many
|
|
44
|
+
|
|
45
|
+
Split state when:
|
|
46
|
+
- **Blast radius**: a typo in one part shouldn't risk another (separate "network" from "apps").
|
|
47
|
+
- **Apply time**: a 40-minute plan is painful; split so changes in one area only plan that area.
|
|
48
|
+
- **Permissions**: different teams own different pieces.
|
|
49
|
+
|
|
50
|
+
Typical split:
|
|
51
|
+
```
|
|
52
|
+
networking/ — VPCs, subnets, DNS
|
|
53
|
+
data/ — databases, caches, queues
|
|
54
|
+
platform/ — K8s clusters, service mesh
|
|
55
|
+
apps/service-a/
|
|
56
|
+
apps/service-b/
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
Cross-state references via remote state data sources:
|
|
60
|
+
|
|
61
|
+
```hcl
|
|
62
|
+
data "terraform_remote_state" "network" {
|
|
63
|
+
backend = "s3"
|
|
64
|
+
config = { bucket = "...", key = "networking.tfstate", region = "..." }
|
|
65
|
+
}
|
|
66
|
+
|
|
67
|
+
resource "aws_instance" "app" {
|
|
68
|
+
subnet_id = data.terraform_remote_state.network.outputs.subnet_id
|
|
69
|
+
}
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
## Modules
|
|
73
|
+
|
|
74
|
+
Reuse is an effect; don't modularize for the sake of it.
|
|
75
|
+
|
|
76
|
+
```
|
|
77
|
+
modules/
|
|
78
|
+
├── s3-bucket/ # simple, focused
|
|
79
|
+
├── rds-postgres/ # multiple call sites; worth the abstraction
|
|
80
|
+
└── vpc/ # canonical
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
Rules:
|
|
84
|
+
- Module inputs: the minimum variables that matter; sensible defaults for the rest.
|
|
85
|
+
- Module outputs: only what callers need.
|
|
86
|
+
- Version modules (git tag / registry version) if shared across repos.
|
|
87
|
+
|
|
88
|
+
Anti-pattern: "god module" with 50 inputs covering every hypothetical case. That's configuration disguised as code.
|
|
89
|
+
|
|
90
|
+
## Workspaces vs. env-specific folders
|
|
91
|
+
|
|
92
|
+
### Workspaces (Terraform)
|
|
93
|
+
|
|
94
|
+
Same code, different state per workspace.
|
|
95
|
+
|
|
96
|
+
```
|
|
97
|
+
terraform workspace new dev
|
|
98
|
+
terraform workspace new prod
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
OK for dev/prod parity when the topology is truly identical.
|
|
102
|
+
|
|
103
|
+
### Folder-per-env (often preferred)
|
|
104
|
+
|
|
105
|
+
```
|
|
106
|
+
environments/
|
|
107
|
+
├── dev/
|
|
108
|
+
│ └── main.tfvars
|
|
109
|
+
├── staging/
|
|
110
|
+
│ └── main.tfvars
|
|
111
|
+
└── prod/
|
|
112
|
+
└── main.tfvars
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
Each env has its own state and vars file. Same underlying modules. Explicit, reviewable, allows env-specific overrides.
|
|
116
|
+
|
|
117
|
+
## Review flow
|
|
118
|
+
|
|
119
|
+
Every IaC change is a PR. `terraform plan` output in the PR description or CI comment. Reviewer reads the plan and the code.
|
|
120
|
+
|
|
121
|
+
Automate it:
|
|
122
|
+
|
|
123
|
+
```yaml
|
|
124
|
+
- run: terraform init
|
|
125
|
+
- run: terraform plan -out=plan.tfplan
|
|
126
|
+
- run: terraform show -no-color plan.tfplan > plan.txt
|
|
127
|
+
- uses: actions/github-script@v7
|
|
128
|
+
with: { script: |
|
|
129
|
+
const plan = fs.readFileSync('plan.txt', 'utf8');
|
|
130
|
+
await github.rest.issues.createComment({ ..., body: '```\n' + plan + '\n```' });
|
|
131
|
+
}
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
Prod apply gated behind approval:
|
|
135
|
+
|
|
136
|
+
```yaml
|
|
137
|
+
environment:
|
|
138
|
+
name: prod
|
|
139
|
+
url: https://...
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
## Drift
|
|
143
|
+
|
|
144
|
+
State says one thing; reality says another (someone clicked in the console, an external process modified a resource).
|
|
145
|
+
|
|
146
|
+
- `terraform plan` detects drift.
|
|
147
|
+
- Reconcile: revert console changes or adopt them into code.
|
|
148
|
+
- Policy: disable console write access for prod; force all changes through IaC.
|
|
149
|
+
|
|
150
|
+
Drift ignored becomes a permanent parallel state that diverges further every day.
|
|
151
|
+
|
|
152
|
+
## Secret handling in IaC
|
|
153
|
+
|
|
154
|
+
- Secrets don't live in `.tfvars` committed to git.
|
|
155
|
+
- Pull from secret manager at apply time (`data "aws_secretsmanager_secret_version"`).
|
|
156
|
+
- Output sensitive values with `sensitive = true` so they don't leak in logs.
|
|
157
|
+
- `.tfstate` itself contains sensitive values — protect the backend.
|
|
158
|
+
|
|
159
|
+
## Tagging strategy
|
|
160
|
+
|
|
161
|
+
Every cloud resource gets tags. Makes cost allocation, search, ownership unambiguous.
|
|
162
|
+
|
|
163
|
+
```hcl
|
|
164
|
+
default_tags = {
|
|
165
|
+
environment = var.env
|
|
166
|
+
service = var.service
|
|
167
|
+
owner_team = var.team
|
|
168
|
+
managed_by = "terraform"
|
|
169
|
+
repo = "github.com/.../infra"
|
|
170
|
+
}
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
Enforce via policy (SCPs on AWS, Azure Policy, custom `terraform-compliance`).
|
|
174
|
+
|
|
175
|
+
## Avoid `count` / `for_each` on resources likely to change order
|
|
176
|
+
|
|
177
|
+
Terraform tracks resources by address. `aws_instance.web[0]` is not the same as `aws_instance.web[1]`. If you insert into the middle, everything after is "new".
|
|
178
|
+
|
|
179
|
+
Prefer `for_each` over `count` — keyed by a string, stable under insertion.
|
|
180
|
+
|
|
181
|
+
```hcl
|
|
182
|
+
# ❌ count — fragile if order changes
|
|
183
|
+
resource "aws_instance" "web" {
|
|
184
|
+
count = length(var.regions)
|
|
185
|
+
region = var.regions[count.index]
|
|
186
|
+
}
|
|
187
|
+
|
|
188
|
+
# ✅ for_each — keyed
|
|
189
|
+
resource "aws_instance" "web" {
|
|
190
|
+
for_each = toset(var.regions)
|
|
191
|
+
region = each.value
|
|
192
|
+
}
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
## Lifecycle rules
|
|
196
|
+
|
|
197
|
+
```hcl
|
|
198
|
+
lifecycle {
|
|
199
|
+
prevent_destroy = true # guard prod databases, S3 buckets with data
|
|
200
|
+
create_before_destroy = true # zero-downtime replacement where possible
|
|
201
|
+
ignore_changes = [tags["last_updated"]] # don't churn on external tag writes
|
|
202
|
+
}
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
`prevent_destroy` on anything that would be irrecoverable to delete. Deliberate override needed to actually destroy.
|
|
206
|
+
|
|
207
|
+
## Blast radius
|
|
208
|
+
|
|
209
|
+
Running `terraform destroy` in the wrong directory has wiped real infrastructure. Mitigations:
|
|
210
|
+
- Separate folders per env with distinct state backends.
|
|
211
|
+
- Prod destroy requires a separate pipeline or a specific role.
|
|
212
|
+
- Critical resources have `prevent_destroy`.
|
|
213
|
+
|
|
214
|
+
## Policy as code
|
|
215
|
+
|
|
216
|
+
Test that the plan adheres to rules before apply.
|
|
217
|
+
|
|
218
|
+
- **OPA / Conftest** — general-purpose policy, plaintext rules.
|
|
219
|
+
- **Terraform Sentinel** (HCP Terraform) — policy as code.
|
|
220
|
+
- **Checkov, tfsec, terrascan** — pre-built security rules (public S3 buckets, encryption, etc.).
|
|
221
|
+
|
|
222
|
+
Typical guards:
|
|
223
|
+
- No public S3 buckets.
|
|
224
|
+
- No unencrypted RDS / EBS.
|
|
225
|
+
- No 0.0.0.0/0 ingress on non-frontend services.
|
|
226
|
+
- All resources have required tags.
|
|
227
|
+
|
|
228
|
+
## Rollback
|
|
229
|
+
|
|
230
|
+
IaC rollback = revert the commit + apply. It works when the infra change is self-contained.
|
|
231
|
+
|
|
232
|
+
It does NOT work for:
|
|
233
|
+
- Data migrations (schema changes, in-place transformations).
|
|
234
|
+
- Resources that were deleted with associated data.
|
|
235
|
+
- Stateful upgrades where the backing engine doesn't support downgrade.
|
|
236
|
+
|
|
237
|
+
For those, plan forward-only fixes.
|
|
238
|
+
|
|
239
|
+
## Anti-patterns
|
|
240
|
+
|
|
241
|
+
| Anti-pattern | Fix |
|
|
242
|
+
|---|---|
|
|
243
|
+
| Local state | Remote + locking |
|
|
244
|
+
| No module versioning for shared modules | Tag + version pin |
|
|
245
|
+
| Env-specific logic via `if env == "prod"` | Env-specific `.tfvars` / folder |
|
|
246
|
+
| Every change runs `apply` without review | PR + plan output |
|
|
247
|
+
| Sensitive outputs without `sensitive = true` | Mark them |
|
|
248
|
+
| Giant 5000-line main.tf | Split by resource group / module |
|
|
249
|
+
| `terraform taint` as a regular workflow | Fix the root cause; taint is a hack |
|
|
250
|
+
| `null_resource` + `local-exec` for everything | Find a proper provider |
|
|
251
|
+
| Hand-written policies enforced by discipline | Automate with OPA / tfsec |
|
|
252
|
+
| Cloud console changes that "just need to happen" | Update IaC; revert the console |
|
|
@@ -0,0 +1,228 @@
|
|
|
1
|
+
# Observability (Platform)
|
|
2
|
+
|
|
3
|
+
Log / metric / trace pipelines, alerting, on-call, runbooks. For app-level signal definition, see `backend/references/observability.md`; this file covers the platform plumbing.
|
|
4
|
+
|
|
5
|
+
## The stack
|
|
6
|
+
|
|
7
|
+
```
|
|
8
|
+
App ──stdout──▶ Collector (fluent-bit, otel-collector, vector)
|
|
9
|
+
──metric──▶ Prometheus / Cloud Monitoring / Datadog / NewRelic
|
|
10
|
+
──trace───▶ OpenTelemetry Collector ──▶ Jaeger / Tempo / DD APM
|
|
11
|
+
|
|
12
|
+
Alerting: Prometheus Alertmanager / Grafana / PagerDuty / OpsGenie
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
Pick the right number of tools. Four different tools with overlapping coverage is a tax; one plus another specialist is usually enough.
|
|
16
|
+
|
|
17
|
+
## Logs
|
|
18
|
+
|
|
19
|
+
### Collection
|
|
20
|
+
|
|
21
|
+
- Container logs → stdout/stderr.
|
|
22
|
+
- Daemon on each node reads container logs (`fluent-bit`, `fluentd`, `vector`, cloud-native).
|
|
23
|
+
- Collector forwards to the backend (Elasticsearch, Loki, Datadog Logs, Cloud Logging).
|
|
24
|
+
|
|
25
|
+
Don't write logs to local files inside containers. Lost on pod restart; hard to collect.
|
|
26
|
+
|
|
27
|
+
### Format
|
|
28
|
+
|
|
29
|
+
JSON. Every line a structured event. See `backend/references/observability.md` for field names.
|
|
30
|
+
|
|
31
|
+
### Retention
|
|
32
|
+
|
|
33
|
+
Tier by age:
|
|
34
|
+
- **Hot**: 7–14 days, fast search.
|
|
35
|
+
- **Warm**: 30–90 days, slower but still searchable.
|
|
36
|
+
- **Archive**: 1+ year, S3 / cold storage, restore on demand.
|
|
37
|
+
|
|
38
|
+
Log volume grows with traffic; set retention per env (dev can be 3 days, prod 30). Otherwise, the bill does the planning for you.
|
|
39
|
+
|
|
40
|
+
### Sensitive data
|
|
41
|
+
|
|
42
|
+
Redact at source — the app's logger, not the collector. Once a secret hits the pipeline it's harder to control.
|
|
43
|
+
|
|
44
|
+
Check your logs periodically for leaked PII / tokens. Automated scanning rules (pattern matching JWT, credit card) in the pipeline.
|
|
45
|
+
|
|
46
|
+
## Metrics
|
|
47
|
+
|
|
48
|
+
### Collection
|
|
49
|
+
|
|
50
|
+
- **Pull** (Prometheus) — scraper hits app endpoints.
|
|
51
|
+
- **Push** (StatsD, OTLP) — app pushes to a gateway / collector.
|
|
52
|
+
|
|
53
|
+
Pull scales well at moderate cluster sizes, gets fiddly at huge scale. Push is simpler at scale but loses some visibility.
|
|
54
|
+
|
|
55
|
+
### Standards
|
|
56
|
+
|
|
57
|
+
OpenTelemetry (OTel) is becoming the de-facto standard for metric + trace instrumentation. Instrument once with OTel SDKs; switch backends by changing the collector config.
|
|
58
|
+
|
|
59
|
+
### Cardinality
|
|
60
|
+
|
|
61
|
+
Every unique combination of label values creates a new time series. High-cardinality labels (user_id, request_id) blow up storage and cost.
|
|
62
|
+
|
|
63
|
+
```
|
|
64
|
+
# ✅ bounded
|
|
65
|
+
http_requests_total{service="api", route="/orders", method="POST", status="200"}
|
|
66
|
+
|
|
67
|
+
# ❌ unbounded
|
|
68
|
+
http_requests_total{service="api", user_id="u_01HX..."}
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
The cloud will silently charge you for cardinality. Watch the count of series.
|
|
72
|
+
|
|
73
|
+
### Four golden signals (per service)
|
|
74
|
+
|
|
75
|
+
1. **Latency** — how long do requests take (p50/p95/p99)?
|
|
76
|
+
2. **Traffic** — how many requests per second?
|
|
77
|
+
3. **Errors** — rate of failed requests?
|
|
78
|
+
4. **Saturation** — how full is it? (CPU, queue depth, connection pool)
|
|
79
|
+
|
|
80
|
+
Dashboards start here. Drill into specifics from the starting point.
|
|
81
|
+
|
|
82
|
+
## Traces
|
|
83
|
+
|
|
84
|
+
OpenTelemetry instrumented endpoints + propagated context.
|
|
85
|
+
|
|
86
|
+
```
|
|
87
|
+
Request ─▶ Service A [span] ─▶ Service B [span] ─▶ DB [span]
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
Each span has timing, tags, events. Together they form the request timeline.
|
|
91
|
+
|
|
92
|
+
### Sampling
|
|
93
|
+
|
|
94
|
+
Head-based (per request, decide at ingress):
|
|
95
|
+
- 1–10% typical.
|
|
96
|
+
- Boost to 100% for errors.
|
|
97
|
+
|
|
98
|
+
Tail-based (sample after seeing the whole trace):
|
|
99
|
+
- Keep slow traces, error traces, unusual patterns.
|
|
100
|
+
- Needs a full collector layer (otel-collector).
|
|
101
|
+
|
|
102
|
+
Tracing overhead is real — don't trace 100% in prod without tail-based sampling.
|
|
103
|
+
|
|
104
|
+
## Dashboards
|
|
105
|
+
|
|
106
|
+
### Structure
|
|
107
|
+
|
|
108
|
+
One dashboard per service, standard layout:
|
|
109
|
+
- Overview: RED metrics (Rate, Errors, Duration).
|
|
110
|
+
- Saturation: CPU, memory, pool utilization.
|
|
111
|
+
- Dependencies: DB, cache, upstream services.
|
|
112
|
+
- Recent deploys marked as vertical annotations.
|
|
113
|
+
|
|
114
|
+
Links to runbook + logs + traces.
|
|
115
|
+
|
|
116
|
+
### Don't build 50 dashboards
|
|
117
|
+
|
|
118
|
+
Most go stale within weeks. Focus on a small set that matters:
|
|
119
|
+
- One per critical service.
|
|
120
|
+
- One per SLO.
|
|
121
|
+
- A few investigative templates ("compare p99 before/after a given deploy").
|
|
122
|
+
|
|
123
|
+
## Alerts
|
|
124
|
+
|
|
125
|
+
### Principles
|
|
126
|
+
|
|
127
|
+
- **Alert on symptoms, not causes.** "Users can't check out" beats "CPU is 80%".
|
|
128
|
+
- **Every alert is actionable** — there's a specific thing the oncall does.
|
|
129
|
+
- **Every alert has a runbook** linked in the alert body.
|
|
130
|
+
- **Every alert has an owner** — the team / service that owns the fix.
|
|
131
|
+
|
|
132
|
+
### Severity levels
|
|
133
|
+
|
|
134
|
+
- **P1 / SEV-1**: page the oncall; revenue / customer-facing impact.
|
|
135
|
+
- **P2 / SEV-2**: notify in team channel; degraded state.
|
|
136
|
+
- **P3 / SEV-3**: track as an issue; investigate next business day.
|
|
137
|
+
|
|
138
|
+
Only P1 should wake someone up. Too many P1s → pager fatigue → missed alerts.
|
|
139
|
+
|
|
140
|
+
### Tuning
|
|
141
|
+
|
|
142
|
+
- Alert fires → wasn't actionable → either tune the threshold or delete it.
|
|
143
|
+
- Alert fires at 3am and auto-resolves at 3:15am with no action → wasn't actionable.
|
|
144
|
+
- Alert with "click dashboard, maybe it's fine" → wasn't actionable.
|
|
145
|
+
|
|
146
|
+
Audit monthly.
|
|
147
|
+
|
|
148
|
+
## On-call
|
|
149
|
+
|
|
150
|
+
### Rotation
|
|
151
|
+
|
|
152
|
+
- Weekly rotation typical; one primary + one secondary.
|
|
153
|
+
- Handoff meeting: what's ongoing, what's worrying.
|
|
154
|
+
- On-call participants must have access: can deploy, rollback, scale.
|
|
155
|
+
|
|
156
|
+
### Triage flow
|
|
157
|
+
|
|
158
|
+
1. **Acknowledge** — clock is ticking on MTTR.
|
|
159
|
+
2. **Stop the bleeding** — rollback, scale up, disable a feature flag. Don't perfect-fix in the moment.
|
|
160
|
+
3. **Gather context** — what changed recently? dashboards, logs, traces.
|
|
161
|
+
4. **Escalate** — bring in the service owner if you're not them.
|
|
162
|
+
5. **Post-incident** — see below.
|
|
163
|
+
|
|
164
|
+
### Post-incident
|
|
165
|
+
|
|
166
|
+
Every P1 gets a postmortem. Blameless.
|
|
167
|
+
|
|
168
|
+
Template:
|
|
169
|
+
- **Summary** — one paragraph.
|
|
170
|
+
- **Impact** — who / how much / how long.
|
|
171
|
+
- **Timeline** — minute-by-minute of detection, response, resolution.
|
|
172
|
+
- **Root cause** — technical + process.
|
|
173
|
+
- **Action items** — specific, owned, dated.
|
|
174
|
+
- **Lessons** — what was surprising.
|
|
175
|
+
|
|
176
|
+
Track action items to completion. Unshipped postmortem actions are how the same incident happens twice.
|
|
177
|
+
|
|
178
|
+
## SLO / error budget
|
|
179
|
+
|
|
180
|
+
Set SLOs (service-level objectives) that match user expectations. Derive error budget.
|
|
181
|
+
|
|
182
|
+
```
|
|
183
|
+
SLO: 99.9% of /orders POST succeed in ≤ 500 ms
|
|
184
|
+
Budget: 0.1% × 30 days ≈ 43 min / month
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
Burn rate:
|
|
188
|
+
- Slow burn — spend budget over weeks (minor quality erosion).
|
|
189
|
+
- Fast burn — exhaust weekly budget in a day (real problem).
|
|
190
|
+
|
|
191
|
+
Alert on burn rate, not just on individual failures. "We're burning budget 10× too fast" is actionable.
|
|
192
|
+
|
|
193
|
+
## Health endpoints
|
|
194
|
+
|
|
195
|
+
```
|
|
196
|
+
/health/live — process alive
|
|
197
|
+
/health/ready — can serve traffic (DB reachable, cache reachable)
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
Container orchestrators use both:
|
|
201
|
+
- Live failing → restart the container.
|
|
202
|
+
- Ready failing → take out of load balancer, leave alive.
|
|
203
|
+
|
|
204
|
+
Never put business logic in health checks. Keep them cheap and boring.
|
|
205
|
+
|
|
206
|
+
## Cost visibility
|
|
207
|
+
|
|
208
|
+
Observability is expensive at scale. Monitor the bill:
|
|
209
|
+
- Logs ingested per service per day.
|
|
210
|
+
- Metric series count.
|
|
211
|
+
- Trace spans per second.
|
|
212
|
+
|
|
213
|
+
When one service exports 10× what others do — investigate. Usually debug logging left on, or a metric with a user-id label.
|
|
214
|
+
|
|
215
|
+
## Anti-patterns
|
|
216
|
+
|
|
217
|
+
| Anti-pattern | Fix |
|
|
218
|
+
|---|---|
|
|
219
|
+
| Logs written to files in containers | stdout |
|
|
220
|
+
| Alerts on infrastructure without symptom mapping | Alert on service impact |
|
|
221
|
+
| Dashboards nobody reads | Delete unused; focus on core ones |
|
|
222
|
+
| Runbook-less alerts | Every alert links a runbook |
|
|
223
|
+
| Tracing 100% in prod without sampling | Head or tail sampling |
|
|
224
|
+
| Metric labels on request IDs | Use logs/traces for high-cardinality |
|
|
225
|
+
| "I'll set up monitoring later" | Observability before launch |
|
|
226
|
+
| Alert channel drowning in noise | Audit and tune |
|
|
227
|
+
| No oncall → whoever's free panics | Formal rotation, documented escalation |
|
|
228
|
+
| Postmortems as blame sessions | Blameless format; focus on systems and actions |
|
|
@@ -0,0 +1,178 @@
|
|
|
1
|
+
# Secrets
|
|
2
|
+
|
|
3
|
+
Storage, rotation, access, scanning. The part that goes wrong quietly.
|
|
4
|
+
|
|
5
|
+
## The rules
|
|
6
|
+
|
|
7
|
+
1. **Never in source control.** Ever. `.env` files stay in `.gitignore`; secrets come from a manager.
|
|
8
|
+
2. **One secret, one owner.** Scoped per service, per env. The "shared creds" bucket is a leak waiting.
|
|
9
|
+
3. **Rotate on schedule AND on compromise.** Short lifetime = small window of exposure.
|
|
10
|
+
4. **Least privilege.** A service key can read its own DB, not every DB.
|
|
11
|
+
5. **Audit every read.** You should be able to tell who accessed prod secret X yesterday.
|
|
12
|
+
6. **Encrypt at rest AND in transit.** Default in modern secret managers; verify.
|
|
13
|
+
|
|
14
|
+
## Where to store them
|
|
15
|
+
|
|
16
|
+
| Tool | For |
|
|
17
|
+
|---|---|
|
|
18
|
+
| **AWS Secrets Manager / Parameter Store** | AWS; IAM-integrated |
|
|
19
|
+
| **GCP Secret Manager** | GCP; IAM-integrated |
|
|
20
|
+
| **Azure Key Vault** | Azure; RBAC-integrated |
|
|
21
|
+
| **HashiCorp Vault** | Cloud-agnostic; dynamic secrets, rich ACLs |
|
|
22
|
+
| **1Password / Bitwarden** | Human-held secrets, small teams |
|
|
23
|
+
| **Kubernetes Secrets** | K8s-native, lightweight; encrypt etcd |
|
|
24
|
+
| **Sealed Secrets / SOPS** | Encrypt secrets that live in git (controller decrypts) |
|
|
25
|
+
|
|
26
|
+
Pick one canonical store per cloud estate. Sprawl breeds drift — the same secret in three places rotates in one.
|
|
27
|
+
|
|
28
|
+
## Access patterns
|
|
29
|
+
|
|
30
|
+
### At deploy
|
|
31
|
+
|
|
32
|
+
- CI has scoped permission to fetch the secrets its job needs (e.g., `DEPLOY_ROLE_PROD`).
|
|
33
|
+
- Terraform pulls via `data` sources at apply (not baked into `.tfvars`).
|
|
34
|
+
- Kubernetes: use the cloud provider's secret CSI driver or External Secrets Operator.
|
|
35
|
+
|
|
36
|
+
### At runtime
|
|
37
|
+
|
|
38
|
+
- Container pulls secrets from the manager on startup via the platform (IRSA on EKS, Workload Identity on GKE, IAM on Azure).
|
|
39
|
+
- Or mount as files; the app reads from `/secrets/db_url`.
|
|
40
|
+
- Never bake into the image.
|
|
41
|
+
|
|
42
|
+
### Locally
|
|
43
|
+
|
|
44
|
+
- Developers authenticate to the secret manager (SSO + session).
|
|
45
|
+
- CLI pulls secrets on demand: `op run --env-file .env.tpl -- npm start`.
|
|
46
|
+
- Don't ship shared `.env.dev` files on Slack.
|
|
47
|
+
|
|
48
|
+
## Rotation
|
|
49
|
+
|
|
50
|
+
For every secret, know:
|
|
51
|
+
- Who rotates (automated job, human?).
|
|
52
|
+
- How often (14 d? 90 d?).
|
|
53
|
+
- How do consumers pick up the new value without downtime?
|
|
54
|
+
|
|
55
|
+
Rotation patterns:
|
|
56
|
+
|
|
57
|
+
### Dual-secret window
|
|
58
|
+
|
|
59
|
+
1. Create secret v2 alongside v1.
|
|
60
|
+
2. Consumers accept both.
|
|
61
|
+
3. Update producers to use v2.
|
|
62
|
+
4. Disable v1 after grace period.
|
|
63
|
+
|
|
64
|
+
Zero downtime if all consumers support reading both.
|
|
65
|
+
|
|
66
|
+
### Break-glass
|
|
67
|
+
|
|
68
|
+
Some secrets (signing keys, root creds) rarely rotate. Document:
|
|
69
|
+
- Access procedure (break-glass requires 2-person approval).
|
|
70
|
+
- Rotation playbook.
|
|
71
|
+
- Who to notify.
|
|
72
|
+
|
|
73
|
+
## Dynamic secrets
|
|
74
|
+
|
|
75
|
+
The gold standard. Vault generates short-lived DB credentials on request; they expire in minutes / hours.
|
|
76
|
+
|
|
77
|
+
```
|
|
78
|
+
app → Vault: "give me DB creds for service X"
|
|
79
|
+
Vault → DB: CREATE USER temp_xyz WITH GRANT ...
|
|
80
|
+
Vault → app: { user: "temp_xyz", pass: "...", ttl: 1h }
|
|
81
|
+
# after 1h Vault revokes the user
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
A leaked cred is useless after an hour. Requires upfront setup; worth it for sensitive DBs.
|
|
85
|
+
|
|
86
|
+
## Pre-commit scanning
|
|
87
|
+
|
|
88
|
+
Catch secrets before they hit the repo.
|
|
89
|
+
|
|
90
|
+
- **gitleaks** / **detect-secrets** in pre-commit hook.
|
|
91
|
+
- **trufflehog** / **gitleaks** in CI, scanning history.
|
|
92
|
+
- If something lands by accident: rotate immediately, don't just `git revert`. History is forever.
|
|
93
|
+
|
|
94
|
+
```yaml
|
|
95
|
+
# pre-commit
|
|
96
|
+
- repo: https://github.com/gitleaks/gitleaks
|
|
97
|
+
rev: v8.18.0
|
|
98
|
+
hooks: [{ id: gitleaks }]
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
## Git history cleanup — last resort
|
|
102
|
+
|
|
103
|
+
If a secret is in the history:
|
|
104
|
+
1. **Rotate the secret immediately.** Treat as compromised.
|
|
105
|
+
2. Cleaning history (`git filter-repo`, BFG) is partial — anyone with the old clone still has it.
|
|
106
|
+
3. Force-push is disruptive; coordinate with team.
|
|
107
|
+
4. Document the incident.
|
|
108
|
+
|
|
109
|
+
Assume: if it was pushed, someone scraped it already.
|
|
110
|
+
|
|
111
|
+
## Encrypted configs in git (SOPS)
|
|
112
|
+
|
|
113
|
+
For teams that want encrypted secrets checked into the repo (mostly for smaller teams / K8s manifests):
|
|
114
|
+
|
|
115
|
+
```
|
|
116
|
+
config.yaml: # SOPS-encrypted at rest
|
|
117
|
+
db:
|
|
118
|
+
url: ENC[AES256_GCM,data:abc123...]
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
SOPS uses a KMS key (AWS KMS, GCP KMS, age) to decrypt at runtime. Team members with access to the KMS key can decrypt.
|
|
122
|
+
|
|
123
|
+
Rules:
|
|
124
|
+
- Encrypt secret fields only (`sops.encrypted_regex: '^(password|token|url)$'`).
|
|
125
|
+
- Commit `.sops.yaml` describing the key.
|
|
126
|
+
- Revoke KMS key access when a team member leaves.
|
|
127
|
+
|
|
128
|
+
## Secret sprawl — audit
|
|
129
|
+
|
|
130
|
+
Once a quarter, audit:
|
|
131
|
+
- How many secrets exist?
|
|
132
|
+
- Who / what has access to each?
|
|
133
|
+
- When was each last rotated?
|
|
134
|
+
- Any that are unused (zero accesses in 90 d)?
|
|
135
|
+
|
|
136
|
+
Cleanup: revoke unused keys. Deleted secrets can't leak.
|
|
137
|
+
|
|
138
|
+
## Non-secret configs
|
|
139
|
+
|
|
140
|
+
Not everything is a secret. Feature flags, service endpoints, log levels are config — check them into code (env-specific files), don't put them in the secret manager.
|
|
141
|
+
|
|
142
|
+
Mixing bloats the secret manager and trains people to ignore the "secret" marker.
|
|
143
|
+
|
|
144
|
+
## Logging and secrets
|
|
145
|
+
|
|
146
|
+
- Redact at the logger — don't rely on calls to `log.info(token[:5] + '...')`.
|
|
147
|
+
- Structured loggers (Winston, Pino, Zap, slog) support redaction on field names.
|
|
148
|
+
- Test: grep logs for known secret prefixes. Zero hits.
|
|
149
|
+
|
|
150
|
+
## Cloud IAM vs. shared secrets
|
|
151
|
+
|
|
152
|
+
Prefer IAM / workload identity over shared long-lived API keys.
|
|
153
|
+
|
|
154
|
+
```
|
|
155
|
+
# ❌
|
|
156
|
+
AWS_ACCESS_KEY_ID=AKIA...
|
|
157
|
+
AWS_SECRET_ACCESS_KEY=...
|
|
158
|
+
|
|
159
|
+
# ✅
|
|
160
|
+
# Pod assumes a role via IRSA/Workload Identity; temporary creds minted per request.
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
Same for DB access (RDS IAM auth), MQ (IAM policies), object storage. If the cloud supports workload identity, use it.
|
|
164
|
+
|
|
165
|
+
## Anti-patterns
|
|
166
|
+
|
|
167
|
+
| Anti-pattern | Fix |
|
|
168
|
+
|---|---|
|
|
169
|
+
| Secrets in `.env` committed to git | `.gitignore` + secret manager |
|
|
170
|
+
| Same API key used by every service | Scope per service |
|
|
171
|
+
| Rotating by "reminding the team once a year" | Automate or track with a rotation job |
|
|
172
|
+
| Logging tokens for debugging | Redact; use opaque IDs |
|
|
173
|
+
| Long-lived cloud API keys in CI | OIDC + short-lived role assumption |
|
|
174
|
+
| Shared "admin" DB user per team | Per-user creds (even for humans), audit log |
|
|
175
|
+
| Decrypting secrets to an env var only to forget | Let the app read from a file or secret mount |
|
|
176
|
+
| Secret "just this one time" pasted in a ticket | Rotate now; use a secure channel (short-lived link) |
|
|
177
|
+
| No alerts on secret access from unusual IPs / times | Enable Cloud audit logs + alert |
|
|
178
|
+
| Skipping pre-commit scanning "it slows me down" | It saves an incident |
|