@coralai/sps-cli 0.42.0 → 0.44.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +59 -4
- package/dist/commands/consoleCommand.d.ts +2 -0
- package/dist/commands/consoleCommand.d.ts.map +1 -0
- package/dist/commands/consoleCommand.js +129 -0
- package/dist/commands/consoleCommand.js.map +1 -0
- package/dist/commands/projectInit.d.ts.map +1 -1
- package/dist/commands/projectInit.js +40 -53
- package/dist/commands/projectInit.js.map +1 -1
- package/dist/commands/setup.d.ts.map +1 -1
- package/dist/commands/setup.js +14 -2
- package/dist/commands/setup.js.map +1 -1
- package/dist/commands/skillCommand.d.ts +2 -0
- package/dist/commands/skillCommand.d.ts.map +1 -0
- package/dist/commands/skillCommand.js +235 -0
- package/dist/commands/skillCommand.js.map +1 -0
- package/dist/console-assets/assets/index-Bhd2f9AP.js +125 -0
- package/dist/console-assets/assets/index-bsAN2a12.css +1 -0
- package/dist/console-assets/index.html +16 -0
- package/dist/console-server/index.d.ts +29 -0
- package/dist/console-server/index.d.ts.map +1 -0
- package/dist/console-server/index.js +145 -0
- package/dist/console-server/index.js.map +1 -0
- package/dist/console-server/lib/lockFile.d.ts +17 -0
- package/dist/console-server/lib/lockFile.d.ts.map +1 -0
- package/dist/console-server/lib/lockFile.js +61 -0
- package/dist/console-server/lib/lockFile.js.map +1 -0
- package/dist/console-server/lib/portPicker.d.ts +3 -0
- package/dist/console-server/lib/portPicker.d.ts.map +1 -0
- package/dist/console-server/lib/portPicker.js +25 -0
- package/dist/console-server/lib/portPicker.js.map +1 -0
- package/dist/console-server/routes/projects.d.ts +11 -0
- package/dist/console-server/routes/projects.d.ts.map +1 -0
- package/dist/console-server/routes/projects.js +149 -0
- package/dist/console-server/routes/projects.js.map +1 -0
- package/dist/console-server/routes/system.d.ts +7 -0
- package/dist/console-server/routes/system.d.ts.map +1 -0
- package/dist/console-server/routes/system.js +19 -0
- package/dist/console-server/routes/system.js.map +1 -0
- package/dist/console-server/sse/eventBus.d.ts +25 -0
- package/dist/console-server/sse/eventBus.d.ts.map +1 -0
- package/dist/console-server/sse/eventBus.js +32 -0
- package/dist/console-server/sse/eventBus.js.map +1 -0
- package/dist/console-server/watchers/cardWatcher.d.ts +9 -0
- package/dist/console-server/watchers/cardWatcher.d.ts.map +1 -0
- package/dist/console-server/watchers/cardWatcher.js +42 -0
- package/dist/console-server/watchers/cardWatcher.js.map +1 -0
- package/dist/core/skillStore.d.ts +46 -0
- package/dist/core/skillStore.d.ts.map +1 -0
- package/dist/core/skillStore.js +210 -0
- package/dist/core/skillStore.js.map +1 -0
- package/dist/core/skillStore.test.d.ts +2 -0
- package/dist/core/skillStore.test.d.ts.map +1 -0
- package/dist/core/skillStore.test.js +203 -0
- package/dist/core/skillStore.test.js.map +1 -0
- package/dist/main.js +27 -17
- package/dist/main.js.map +1 -1
- package/package.json +8 -2
- package/skills/architecture-decision-records/SKILL.md +207 -0
- package/skills/backend/SKILL.md +62 -0
- package/skills/backend/references/api-design.md +168 -0
- package/skills/backend/references/caching.md +181 -0
- package/skills/backend/references/data-access.md +173 -0
- package/skills/backend/references/layering.md +181 -0
- package/skills/backend/references/observability.md +190 -0
- package/skills/backend/references/resilience.md +201 -0
- package/skills/backend/references/security.md +186 -0
- package/skills/backend-architect/SKILL.md +119 -0
- package/skills/code-reviewer/SKILL.md +143 -0
- package/skills/coding-standards/SKILL.md +60 -0
- package/skills/coding-standards/references/clean-code.md +258 -0
- package/skills/coding-standards/references/code-review.md +192 -0
- package/skills/coding-standards/references/commits-and-prs.md +226 -0
- package/skills/coding-standards/references/error-strategy.md +193 -0
- package/skills/coding-standards/references/naming.md +185 -0
- package/skills/coding-standards/references/tdd.md +171 -0
- package/skills/database/SKILL.md +53 -0
- package/skills/database/references/indexing.md +190 -0
- package/skills/database/references/migrations.md +199 -0
- package/skills/database/references/nosql.md +185 -0
- package/skills/database/references/queries.md +295 -0
- package/skills/database/references/scaling.md +203 -0
- package/skills/database/references/schema.md +191 -0
- package/skills/database-optimizer/SKILL.md +168 -0
- package/skills/debugging-workflow/SKILL.md +244 -0
- package/skills/devops/SKILL.md +55 -0
- package/skills/devops/references/ci-cd.md +204 -0
- package/skills/devops/references/containers.md +272 -0
- package/skills/devops/references/deploy.md +201 -0
- package/skills/devops/references/iac.md +252 -0
- package/skills/devops/references/observability.md +228 -0
- package/skills/devops/references/secrets.md +178 -0
- package/skills/devops-automator/SKILL.md +164 -0
- package/skills/frontend/SKILL.md +52 -0
- package/skills/frontend/references/accessibility.md +222 -0
- package/skills/frontend/references/components.md +206 -0
- package/skills/frontend/references/performance.md +219 -0
- package/skills/frontend/references/routing.md +209 -0
- package/skills/frontend/references/state.md +190 -0
- package/skills/frontend/references/testing.md +216 -0
- package/skills/frontend-developer/SKILL.md +115 -0
- package/skills/git-workflow/SKILL.md +355 -0
- package/skills/golang/SKILL.md +49 -0
- package/skills/golang/references/concurrency.md +284 -0
- package/skills/golang/references/errors.md +241 -0
- package/skills/golang/references/idioms.md +285 -0
- package/skills/golang/references/testing.md +238 -0
- package/skills/java/SKILL.md +50 -0
- package/skills/java/references/concurrency.md +194 -0
- package/skills/java/references/idioms.md +283 -0
- package/skills/java/references/testing.md +228 -0
- package/skills/kotlin/SKILL.md +47 -0
- package/skills/kotlin/references/coroutines.md +240 -0
- package/skills/kotlin/references/idioms.md +268 -0
- package/skills/kotlin/references/testing.md +219 -0
- package/skills/mobile/SKILL.md +50 -0
- package/skills/mobile/references/architecture.md +204 -0
- package/skills/mobile/references/navigation.md +158 -0
- package/skills/mobile/references/performance.md +152 -0
- package/skills/mobile/references/platform.md +166 -0
- package/skills/mobile/references/state-and-data.md +174 -0
- package/skills/python/SKILL.md +51 -0
- package/skills/python/THIRD_PARTY.md +14 -0
- package/skills/python/references/async.md +218 -0
- package/skills/python/references/error-handling.md +254 -0
- package/skills/python/references/idioms.md +279 -0
- package/skills/python/references/packaging.md +233 -0
- package/skills/python/references/testing.md +269 -0
- package/skills/python/references/typing.md +292 -0
- package/skills/qa-tester/SKILL.md +186 -0
- package/skills/rust/SKILL.md +50 -0
- package/skills/rust/references/async.md +224 -0
- package/skills/rust/references/errors.md +240 -0
- package/skills/rust/references/ownership.md +263 -0
- package/skills/rust/references/testing.md +274 -0
- package/skills/rust/references/traits.md +250 -0
- package/skills/security-engineer/SKILL.md +157 -0
- package/skills/swift/SKILL.md +48 -0
- package/skills/swift/references/concurrency.md +280 -0
- package/skills/swift/references/idioms.md +334 -0
- package/skills/swift/references/testing.md +229 -0
- package/skills/typescript/SKILL.md +51 -0
- package/skills/typescript/references/async.md +241 -0
- package/skills/typescript/references/errors.md +208 -0
- package/skills/typescript/references/idioms.md +246 -0
- package/skills/typescript/references/testing.md +225 -0
- package/skills/typescript/references/tooling.md +208 -0
- package/skills/typescript/references/types.md +259 -0
|
@@ -0,0 +1,190 @@
|
|
|
1
|
+
# Observability
|
|
2
|
+
|
|
3
|
+
Logs, metrics, traces, health. A request you can't trace is a bug you can't fix.
|
|
4
|
+
|
|
5
|
+
## The three pillars
|
|
6
|
+
|
|
7
|
+
| Signal | Answers | Cost | Cardinality |
|
|
8
|
+
|---|---|---|---|
|
|
9
|
+
| **Logs** | "What happened in this request?" | High (per-event) | Unlimited |
|
|
10
|
+
| **Metrics** | "How much, how often, how fast, across the fleet?" | Low (aggregated) | Bounded (labels explode) |
|
|
11
|
+
| **Traces** | "Where did time go in this distributed request?" | Medium (sampled) | Unlimited per trace |
|
|
12
|
+
|
|
13
|
+
Pick the right signal for the question. Metrics for dashboards, logs for forensics, traces for latency breakdowns.
|
|
14
|
+
|
|
15
|
+
## Structured logs — JSON, not prose
|
|
16
|
+
|
|
17
|
+
Human-readable strings are unqueryable. Every log line is a JSON object with a stable schema.
|
|
18
|
+
|
|
19
|
+
```json
|
|
20
|
+
{
|
|
21
|
+
"ts": "2026-04-20T10:23:45.123Z",
|
|
22
|
+
"level": "info",
|
|
23
|
+
"service": "orders",
|
|
24
|
+
"env": "prod",
|
|
25
|
+
"request_id": "req_01HX...",
|
|
26
|
+
"trace_id": "0af7651916cd43dd...",
|
|
27
|
+
"user_id": "u_01HX...",
|
|
28
|
+
"msg": "order created",
|
|
29
|
+
"order_id": "ord_01HX...",
|
|
30
|
+
"amount_cents": 2599,
|
|
31
|
+
"duration_ms": 87
|
|
32
|
+
}
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
Rules:
|
|
36
|
+
- Always include `ts`, `level`, `service`, `env`.
|
|
37
|
+
- Always include a request/trace id so you can stitch a request together across services.
|
|
38
|
+
- Message is a short constant string — fixed values in `msg`, varying values in fields. `"order created"` not `"order ord_01HX created for $25.99"`.
|
|
39
|
+
- Never log secrets, tokens, passwords, full PII. Redact at the logger, not the call site.
|
|
40
|
+
|
|
41
|
+
## Log levels — use them honestly
|
|
42
|
+
|
|
43
|
+
| Level | Means | Typical rate |
|
|
44
|
+
|---|---|---|
|
|
45
|
+
| ERROR | Something broke; a human should look | Low |
|
|
46
|
+
| WARN | Unexpected, but handled (retry succeeded, fallback used) | Low |
|
|
47
|
+
| INFO | State changes worth knowing at normal volume | Medium |
|
|
48
|
+
| DEBUG | Details useful while investigating; off in prod | High (when on) |
|
|
49
|
+
|
|
50
|
+
Abused levels poison the signal. If everything is INFO, nothing is INFO.
|
|
51
|
+
|
|
52
|
+
## Correlation IDs
|
|
53
|
+
|
|
54
|
+
Every request gets a unique id at the edge; it propagates through every log line and outbound call.
|
|
55
|
+
|
|
56
|
+
```
|
|
57
|
+
incoming request → generate request_id (or accept from X-Request-ID)
|
|
58
|
+
→ bind to logger context
|
|
59
|
+
→ forward on outbound calls (X-Request-ID, traceparent)
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
Distributed tracing (OpenTelemetry) gives you `trace_id` + `span_id` for free. Log both when you have them.
|
|
63
|
+
|
|
64
|
+
## Metrics — RED + USE
|
|
65
|
+
|
|
66
|
+
Two checklists that cover almost everything.
|
|
67
|
+
|
|
68
|
+
### RED (per request-driven service)
|
|
69
|
+
|
|
70
|
+
- **R**ate — requests per second
|
|
71
|
+
- **E**rrors — failing requests per second (or error rate)
|
|
72
|
+
- **D**uration — latency distribution (p50 / p95 / p99)
|
|
73
|
+
|
|
74
|
+
### USE (per resource)
|
|
75
|
+
|
|
76
|
+
- **U**tilization — how busy is it? (CPU%, thread pool in use / max)
|
|
77
|
+
- **S**aturation — how much work is queued? (request queue depth)
|
|
78
|
+
- **E**rrors — how many operations failed?
|
|
79
|
+
|
|
80
|
+
Track these for every service and every critical dependency.
|
|
81
|
+
|
|
82
|
+
## Latency — measure distributions, not averages
|
|
83
|
+
|
|
84
|
+
Averages hide the worst cases. P95/P99 are where your users actually feel slowness.
|
|
85
|
+
|
|
86
|
+
```
|
|
87
|
+
# ✅
|
|
88
|
+
http_request_duration_seconds{route="/orders", method="POST"}
|
|
89
|
+
→ histogram with buckets (0.01, 0.05, 0.1, 0.5, 1, 5)
|
|
90
|
+
→ alert on p99 > 1s for 5 min
|
|
91
|
+
|
|
92
|
+
# ❌
|
|
93
|
+
avg_response_time = sum(durations) / count(durations)
|
|
94
|
+
→ a 10 s outlier buried in 999 fast ones looks fine
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
## Labels — finite cardinality
|
|
98
|
+
|
|
99
|
+
Every unique label combination creates a new metric series. High-cardinality labels (user id, request id, email) will blow up storage and cost.
|
|
100
|
+
|
|
101
|
+
| Label | OK? |
|
|
102
|
+
|---|---|
|
|
103
|
+
| route, method, status_code | Yes (small set) |
|
|
104
|
+
| region, pod_name, env | Yes |
|
|
105
|
+
| user_id, request_id, email, SKU | NO — use logs/traces for these |
|
|
106
|
+
|
|
107
|
+
## Tracing
|
|
108
|
+
|
|
109
|
+
A trace is a tree of spans representing one request's path through services. Each span has: operation name, start/end time, attributes, parent span.
|
|
110
|
+
|
|
111
|
+
Auto-instrument with OpenTelemetry. Add manual spans around:
|
|
112
|
+
- External HTTP calls (service, endpoint, status)
|
|
113
|
+
- DB queries (operation, table; never the full raw query — cardinality)
|
|
114
|
+
- Cache ops
|
|
115
|
+
- Queue enqueue / dequeue
|
|
116
|
+
- Expensive pure computations
|
|
117
|
+
|
|
118
|
+
Sampling: head-based (1–10% of requests fully traced) or tail-based (keep traces where something went wrong). Keep tracer overhead < 1% of request latency.
|
|
119
|
+
|
|
120
|
+
## SLOs — the contract
|
|
121
|
+
|
|
122
|
+
An SLO is a number + a window. "99.9% of /orders responses succeed within 500 ms, measured over 28 days."
|
|
123
|
+
|
|
124
|
+
Error budget = `1 − SLO`. Over 28 days, 99.9% allows ≈40 min of downtime. When you burn the budget, freeze risky changes and invest in reliability.
|
|
125
|
+
|
|
126
|
+
Don't set SLOs to what your service does today. Set them to what your users need.
|
|
127
|
+
|
|
128
|
+
## Alerts — page on symptoms, not causes
|
|
129
|
+
|
|
130
|
+
Alert on "users are affected" (SLO burn rate, error rate spike, latency breach). Don't alert on "CPU is at 80%" — that's often fine.
|
|
131
|
+
|
|
132
|
+
Every alert must be:
|
|
133
|
+
- **Actionable** — there is something the oncall can do right now
|
|
134
|
+
- **Unambiguous** — one cause for the page, not "anything could have fired this"
|
|
135
|
+
- **Documented** — link to a runbook from the alert body
|
|
136
|
+
|
|
137
|
+
If an alert fires and the oncall thinks "not my problem" or "auto-resolves in 5 min", it's a bad alert. Delete or tune it.
|
|
138
|
+
|
|
139
|
+
## Runbooks
|
|
140
|
+
|
|
141
|
+
One per alert. Structure:
|
|
142
|
+
|
|
143
|
+
```
|
|
144
|
+
# Alert: api-latency-p99-high
|
|
145
|
+
|
|
146
|
+
## What this means
|
|
147
|
+
p99 on /api/orders POST is > 1s for 5m.
|
|
148
|
+
|
|
149
|
+
## Immediate checks
|
|
150
|
+
1. Look at [dashboard-link]
|
|
151
|
+
2. Check for recent deploy: [deploys-link]
|
|
152
|
+
3. Check upstream health: [dep-status]
|
|
153
|
+
|
|
154
|
+
## Common causes
|
|
155
|
+
- DB slow query → check [slow-query-dashboard]
|
|
156
|
+
- Cache outage → check redis metrics
|
|
157
|
+
- Upstream payment provider → check provider status page
|
|
158
|
+
|
|
159
|
+
## Mitigation
|
|
160
|
+
- Roll back recent deploy if within 30 min window
|
|
161
|
+
- Failover to secondary region
|
|
162
|
+
- ...
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
## Health endpoints (minimal)
|
|
166
|
+
|
|
167
|
+
```
|
|
168
|
+
GET /health/live → 200 if process can serve (don't check dependencies)
|
|
169
|
+
GET /health/ready → 200 only if dependencies are reachable (DB, cache, queue)
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
Live failing → orchestrator restarts the pod.
|
|
173
|
+
Ready failing → orchestrator takes the pod out of the load balancer (but doesn't kill it).
|
|
174
|
+
|
|
175
|
+
Never put business logic in health checks. They should be cheap and boring.
|
|
176
|
+
|
|
177
|
+
## Anti-patterns
|
|
178
|
+
|
|
179
|
+
| Anti-pattern | Why |
|
|
180
|
+
|---|---|
|
|
181
|
+
| String-formatted logs (`"user X did Y at Z"`) | Unqueryable |
|
|
182
|
+
| Logging full request bodies | PII leak, storage blow-up |
|
|
183
|
+
| Alerting on CPU / disk without symptom link | Pager fatigue; noise |
|
|
184
|
+
| No request correlation id | Can't stitch a failure across services |
|
|
185
|
+
| Logging at DEBUG in prod | Drowns the signal; storage cost |
|
|
186
|
+
| `avg_latency` as the only latency metric | Hides the outliers that hurt users |
|
|
187
|
+
| `status:500` as the only error signal | 200 with `{error: ...}` bodies exist and hurt |
|
|
188
|
+
| Metrics labels with user id / email | Cardinality explosion |
|
|
189
|
+
| Tracing everything, sampling nothing | Cost blowup; latency overhead |
|
|
190
|
+
| Alerts without runbooks | Oncall guesses, takes too long |
|
|
@@ -0,0 +1,201 @@
|
|
|
1
|
+
# Resilience
|
|
2
|
+
|
|
3
|
+
Timeouts, retries, circuit breakers, idempotency, background jobs. Make failures cheap.
|
|
4
|
+
|
|
5
|
+
## Timeouts — every outbound call
|
|
6
|
+
|
|
7
|
+
No exceptions. A dependency that never answers will exhaust threads, sockets, and memory.
|
|
8
|
+
|
|
9
|
+
```
|
|
10
|
+
# Wrong: no timeout
|
|
11
|
+
response = http.get("https://upstream/api")
|
|
12
|
+
|
|
13
|
+
# Right: fail fast
|
|
14
|
+
response = http.get("https://upstream/api", timeout=2.0)
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
Timeout budget, layered:
|
|
18
|
+
|
|
19
|
+
```
|
|
20
|
+
client 10s
|
|
21
|
+
└ gateway 8s
|
|
22
|
+
└ service 5s
|
|
23
|
+
└ dependency call 2s ← must be smaller than parent budget
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
If the inner call's timeout ≥ the outer's, the outer never gets to return a clean 504 — it just hangs.
|
|
27
|
+
|
|
28
|
+
## Retries — only for safe, transient failures
|
|
29
|
+
|
|
30
|
+
**Retryable**:
|
|
31
|
+
- Network timeouts
|
|
32
|
+
- 5xx on GET/idempotent calls
|
|
33
|
+
- 429 (with `Retry-After`)
|
|
34
|
+
- Explicit DB "retry" errors (e.g., serialization failures)
|
|
35
|
+
|
|
36
|
+
**NOT retryable**:
|
|
37
|
+
- 4xx other than 429 (client bug; retry won't help)
|
|
38
|
+
- Any non-idempotent call without an `Idempotency-Key`
|
|
39
|
+
- "Connection reset" where the write may have landed
|
|
40
|
+
|
|
41
|
+
### Exponential backoff with jitter
|
|
42
|
+
|
|
43
|
+
Pure exponential backoff creates thundering herds when many clients fail together. Always add jitter.
|
|
44
|
+
|
|
45
|
+
```
|
|
46
|
+
attempt(n):
|
|
47
|
+
base = 100ms
|
|
48
|
+
max = 10s
|
|
49
|
+
sleep = min(max, base * 2^n) * random(0.5, 1.5)
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
Bound the total attempts and total time; don't let retries outlive the user's patience.
|
|
53
|
+
|
|
54
|
+
## Circuit breakers
|
|
55
|
+
|
|
56
|
+
When a dependency is sick, stop hammering it. Three states:
|
|
57
|
+
|
|
58
|
+
```
|
|
59
|
+
CLOSED (normal)
|
|
60
|
+
│ failures exceed threshold
|
|
61
|
+
▼
|
|
62
|
+
OPEN (fail fast, short-circuit calls)
|
|
63
|
+
│ after cool-down, try one request
|
|
64
|
+
▼
|
|
65
|
+
HALF_OPEN ──success──► CLOSED
|
|
66
|
+
│
|
|
67
|
+
└─failure──────────► OPEN
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
Thresholds to tune: error rate (e.g., >50% of last 20 calls), minimum sample size, cool-down time, half-open probe count.
|
|
71
|
+
|
|
72
|
+
Open-circuit response: fall back to cache, degraded response, or fail fast with 503. Never silently return empty data.
|
|
73
|
+
|
|
74
|
+
## Idempotency
|
|
75
|
+
|
|
76
|
+
Any operation that might be retried must be safe to run twice.
|
|
77
|
+
|
|
78
|
+
### Idempotency keys
|
|
79
|
+
|
|
80
|
+
For non-GET HTTP writes, accept an `Idempotency-Key` header.
|
|
81
|
+
|
|
82
|
+
```
|
|
83
|
+
POST /payments
|
|
84
|
+
Idempotency-Key: 7a8b9c...
|
|
85
|
+
|
|
86
|
+
server:
|
|
87
|
+
stored = store.get(key)
|
|
88
|
+
if stored and stored.request_hash == hash(body):
|
|
89
|
+
return stored.response
|
|
90
|
+
if stored:
|
|
91
|
+
return 409 # same key, different body → conflict
|
|
92
|
+
response = execute()
|
|
93
|
+
store.set(key, (hash(body), response), ttl=24h)
|
|
94
|
+
return response
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
### Natural idempotency
|
|
98
|
+
|
|
99
|
+
Often better than keys: design the operation so repeats are harmless.
|
|
100
|
+
|
|
101
|
+
```
|
|
102
|
+
# Not idempotent
|
|
103
|
+
UPDATE balance SET amount = amount + 10 WHERE id = 1
|
|
104
|
+
|
|
105
|
+
# Idempotent — absorbs double-apply
|
|
106
|
+
INSERT INTO ledger (id, account, amount) VALUES (:tx_id, 1, 10)
|
|
107
|
+
ON CONFLICT (id) DO NOTHING
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
## Graceful degradation
|
|
111
|
+
|
|
112
|
+
When a non-critical dependency is down, return a usable response, not an error.
|
|
113
|
+
|
|
114
|
+
```
|
|
115
|
+
product = productRepo.get(id)
|
|
116
|
+
try:
|
|
117
|
+
product.recommendations = recService.for(id, timeout=300ms)
|
|
118
|
+
except (Timeout, ServiceError):
|
|
119
|
+
product.recommendations = [] # degrade, don't fail
|
|
120
|
+
return product
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
Decide up front which pieces are essential vs. nice-to-have. Never degrade silently on essentials (payments, auth).
|
|
124
|
+
|
|
125
|
+
## Background jobs
|
|
126
|
+
|
|
127
|
+
For anything not strictly needed in the request path: send, enqueue, return.
|
|
128
|
+
|
|
129
|
+
```
|
|
130
|
+
# Request path
|
|
131
|
+
handler(req):
|
|
132
|
+
order = orderRepo.save(newOrder)
|
|
133
|
+
queue.enqueue(SendOrderEmail(order.id)) # defer
|
|
134
|
+
queue.enqueue(UpdateSearchIndex(order.id))
|
|
135
|
+
return 201
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
Queue requirements:
|
|
139
|
+
- **Durable** — enqueue survives broker restart (disk, replicated)
|
|
140
|
+
- **At-least-once delivery** — so jobs must be idempotent
|
|
141
|
+
- **Dead-letter queue** — after N failures, park the message and alert
|
|
142
|
+
- **Visibility timeout** — consumer crashes → job requeues automatically
|
|
143
|
+
|
|
144
|
+
Common choices: Postgres-backed (pgboss, solid-queue), Redis (BullMQ, Sidekiq), managed (SQS, Cloud Tasks), streaming (Kafka).
|
|
145
|
+
|
|
146
|
+
## Scheduled jobs
|
|
147
|
+
|
|
148
|
+
Two traps:
|
|
149
|
+
1. **Lock per job** — multiple replicas must not run the same job twice. Use a DB advisory lock or a leader-election lib.
|
|
150
|
+
2. **Overlap** — if a job runs longer than its interval, the next tick starts before the previous ends. Decide: skip, queue, or overlap — explicitly.
|
|
151
|
+
|
|
152
|
+
Don't use `cron` on a single VM in production; it dies with the VM. Use a platform scheduler (Kubernetes CronJob, cloud scheduler) + idempotent job logic.
|
|
153
|
+
|
|
154
|
+
## Health checks
|
|
155
|
+
|
|
156
|
+
Two separate endpoints:
|
|
157
|
+
|
|
158
|
+
```
|
|
159
|
+
GET /health/live # Am I running? (200 = process alive)
|
|
160
|
+
GET /health/ready # Can I take traffic? (checks DB, cache, queue connectivity)
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
Orchestrators (K8s, load balancers) need both. `/ready` failing for 30s → take the pod out of rotation, don't kill it.
|
|
164
|
+
|
|
165
|
+
## Graceful shutdown
|
|
166
|
+
|
|
167
|
+
On SIGTERM:
|
|
168
|
+
1. Stop accepting new requests (`/ready` → 503).
|
|
169
|
+
2. Finish in-flight requests (with a hard deadline, e.g., 30 s).
|
|
170
|
+
3. Drain the job consumer.
|
|
171
|
+
4. Close DB pools and sockets.
|
|
172
|
+
5. Exit.
|
|
173
|
+
|
|
174
|
+
Without this, a deploy drops requests and leaves half-processed jobs.
|
|
175
|
+
|
|
176
|
+
## Bulkheads
|
|
177
|
+
|
|
178
|
+
Isolate failure domains so one tenant / feature can't drown the others.
|
|
179
|
+
|
|
180
|
+
- Separate thread pool / connection pool per downstream service
|
|
181
|
+
- Separate queue / worker group per job class
|
|
182
|
+
- Separate rate limit per tenant
|
|
183
|
+
|
|
184
|
+
One noisy neighbor should degrade its own lane, not everyone's.
|
|
185
|
+
|
|
186
|
+
## Timeouts for tasks, not just HTTP
|
|
187
|
+
|
|
188
|
+
DB query timeouts (`statement_timeout` in Postgres), job max runtime, lock wait timeout — all finite. Anything unbounded will eventually hang something.
|
|
189
|
+
|
|
190
|
+
## Anti-patterns
|
|
191
|
+
|
|
192
|
+
| Anti-pattern | Why |
|
|
193
|
+
|---|---|
|
|
194
|
+
| Infinite retries | One bad day becomes a queue explosion |
|
|
195
|
+
| Retries without backoff | Synchronized thundering herds |
|
|
196
|
+
| Retry on POST without idempotency key | Duplicate payments, double-sends |
|
|
197
|
+
| Shared retry budget across unrelated calls | One bad dep exhausts retries for healthy ones |
|
|
198
|
+
| Catching all exceptions to mask failures | Bugs silently go to prod |
|
|
199
|
+
| Fire-and-forget without a dead-letter queue | Failed jobs vanish with no alert |
|
|
200
|
+
| "Run every N seconds" cron on a single machine | Loses work on reboot |
|
|
201
|
+
| Waiting forever for a lock | Locks don't auto-expire unless you say so |
|
|
@@ -0,0 +1,186 @@
|
|
|
1
|
+
# Security
|
|
2
|
+
|
|
3
|
+
Authentication, authorization, input validation, rate limiting, secrets. The non-negotiables.
|
|
4
|
+
|
|
5
|
+
## AuthN vs AuthZ
|
|
6
|
+
|
|
7
|
+
| | Authentication | Authorization |
|
|
8
|
+
|---|---|---|
|
|
9
|
+
| Answers | Who are you? | What can you do? |
|
|
10
|
+
| Failure code | 401 | 403 |
|
|
11
|
+
| Mechanism | Session, token, signature | Role / policy / permission check |
|
|
12
|
+
|
|
13
|
+
Never conflate these. A 401 says "tell me who you are"; a 403 says "I know who you are and you can't do this".
|
|
14
|
+
|
|
15
|
+
## Session vs token
|
|
16
|
+
|
|
17
|
+
| | Server session | Stateless token (JWT) |
|
|
18
|
+
|---|---|---|
|
|
19
|
+
| State | Server-side (DB / Redis) | In the token itself |
|
|
20
|
+
| Revocation | Delete session row | Hard — need blocklist or short TTL |
|
|
21
|
+
| Scale | Needs sticky / shared store | Stateless across servers |
|
|
22
|
+
| Size on wire | Small (cookie id) | Large (signed payload) |
|
|
23
|
+
| First-party web | Excellent | Overkill |
|
|
24
|
+
| Service-to-service | Weak | Natural fit |
|
|
25
|
+
|
|
26
|
+
For a browser-based web app, **server-side sessions with secure cookies** are usually the right answer. JWTs shine for APIs, federation, and service-to-service.
|
|
27
|
+
|
|
28
|
+
## Cookies — secure defaults
|
|
29
|
+
|
|
30
|
+
```
|
|
31
|
+
Set-Cookie: session=abc...; Secure; HttpOnly; SameSite=Lax; Path=/
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
- **Secure** — HTTPS only.
|
|
35
|
+
- **HttpOnly** — JS can't read it (blocks XSS-based token theft).
|
|
36
|
+
- **SameSite=Lax** — default; blocks CSRF on cross-site POSTs. Use `Strict` for admin; `None` + `Secure` only for true cross-origin use cases.
|
|
37
|
+
- **Path** — scope to where it's needed.
|
|
38
|
+
- Don't store user data in the cookie payload; store an opaque session id.
|
|
39
|
+
|
|
40
|
+
## JWT rules
|
|
41
|
+
|
|
42
|
+
- Always check signature. Reject `alg: none`. Reject unexpected algorithms.
|
|
43
|
+
- Verify `iss`, `aud`, `exp`, `nbf`.
|
|
44
|
+
- Short lifetime (5–15 min) + rotating refresh token.
|
|
45
|
+
- Don't put secrets inside; tokens are readable by anyone who has them.
|
|
46
|
+
- Rotate signing keys; publish via JWKS.
|
|
47
|
+
- Revocation: maintain a short jti blocklist in Redis for stolen-token cases.
|
|
48
|
+
|
|
49
|
+
## Authorization models
|
|
50
|
+
|
|
51
|
+
| Model | Use when |
|
|
52
|
+
|---|---|
|
|
53
|
+
| RBAC (roles) | Small fixed set of roles: admin, user, moderator |
|
|
54
|
+
| ABAC (attributes) | Rules depend on attributes of user, resource, time, IP |
|
|
55
|
+
| ReBAC (relationships) | "Can Alice read doc X?" answered via a graph (Google Zanzibar / OpenFGA) |
|
|
56
|
+
| Policy-as-code (OPA, Cedar) | Complex rules that need to live outside the app |
|
|
57
|
+
|
|
58
|
+
Start with RBAC. Graduate to ReBAC/ABAC when roles no longer express the rules. Never hard-code `if user.email == "admin@x.com"`.
|
|
59
|
+
|
|
60
|
+
## Enforce authorization at the boundary
|
|
61
|
+
|
|
62
|
+
Every handler starts with a permission check. No implicit trust.
|
|
63
|
+
|
|
64
|
+
```
|
|
65
|
+
handler(req):
|
|
66
|
+
user = requireAuth(req)
|
|
67
|
+
resource = repo.load(req.id)
|
|
68
|
+
if not user.can(READ, resource):
|
|
69
|
+
return 403 | 404 # 404 if the existence of the resource is itself secret
|
|
70
|
+
return resource
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
`403` vs `404`: return 404 if the existence of the resource is itself secret (e.g., private documents); return 403 otherwise.
|
|
74
|
+
|
|
75
|
+
## Input validation
|
|
76
|
+
|
|
77
|
+
Validate everything at the edge, once. Never trust "internal" callers.
|
|
78
|
+
|
|
79
|
+
```
|
|
80
|
+
schema:
|
|
81
|
+
email : string, format=email
|
|
82
|
+
age : int, 0 <= x <= 150
|
|
83
|
+
role : enum(user, admin)
|
|
84
|
+
|
|
85
|
+
handler(req):
|
|
86
|
+
cmd = schema.parse(req.body) # rejects anything else
|
|
87
|
+
useCase.execute(cmd)
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
Rules:
|
|
91
|
+
- Whitelist what you accept, not blacklist what you reject.
|
|
92
|
+
- Reject unknown fields (guard against mass-assignment).
|
|
93
|
+
- Bound all variable-size inputs (strings, arrays): `max_length`, `max_items`.
|
|
94
|
+
- Parse into strong types at the boundary; don't pass raw dicts through the system.
|
|
95
|
+
|
|
96
|
+
## Injection defenses
|
|
97
|
+
|
|
98
|
+
- **SQL**: parameterized queries ONLY. Never string-concatenate. ORMs handle this if you use their query API, not raw strings.
|
|
99
|
+
- **Command**: don't build shell commands from user input. If you must: use array-form `exec` (no shell) and whitelist args.
|
|
100
|
+
- **LDAP / XPath / NoSQL**: same rule — parameterize.
|
|
101
|
+
- **Template injection**: never render user input as a template (Jinja2, ERB, etc.).
|
|
102
|
+
- **Path traversal**: canonicalize and assert the result is inside an allow-listed directory.
|
|
103
|
+
- **Prototype pollution / mass assignment**: whitelist fields; never `Object.assign(user, req.body)`.
|
|
104
|
+
|
|
105
|
+
## Passwords
|
|
106
|
+
|
|
107
|
+
- **argon2id** (preferred) or **bcrypt** (with cost ≥ 12). Never SHA-* for passwords.
|
|
108
|
+
- Never log passwords, even hashed.
|
|
109
|
+
- Enforce length (≥ 12 chars), not character classes. Check against a breached-password list (HaveIBeenPwned API / offline list).
|
|
110
|
+
- Account-level lockout on repeated failures, plus rate limiting per IP/account.
|
|
111
|
+
|
|
112
|
+
## Secrets
|
|
113
|
+
|
|
114
|
+
- Never in source control. `.env` files are .gitignored; production secrets come from a secret manager (Vault, AWS Secrets Manager, GCP Secret Manager, 1Password Connect).
|
|
115
|
+
- Rotate on compromise AND on a schedule.
|
|
116
|
+
- Scope per-service and per-environment. One stolen dev key should never reach prod.
|
|
117
|
+
- Don't print secrets to logs. Redact at the logger config.
|
|
118
|
+
|
|
119
|
+
## Rate limiting
|
|
120
|
+
|
|
121
|
+
Apply at the edge (CDN/API gateway) AND per-endpoint in the app.
|
|
122
|
+
|
|
123
|
+
Limits by identity:
|
|
124
|
+
- Anonymous: by IP — coarse, bypassable with proxies.
|
|
125
|
+
- Authenticated: by user id — reliable.
|
|
126
|
+
- Authenticated + IP: both, for defense in depth.
|
|
127
|
+
|
|
128
|
+
Algorithms:
|
|
129
|
+
- **Token bucket**: allows short bursts; refill rate controls long-run.
|
|
130
|
+
- **Fixed window**: simple, but bursty at boundaries.
|
|
131
|
+
- **Sliding window**: smooth; costs more.
|
|
132
|
+
|
|
133
|
+
Always include `Retry-After` on 429 responses.
|
|
134
|
+
|
|
135
|
+
## CSRF
|
|
136
|
+
|
|
137
|
+
Required if the client is a browser using cookies. Not required if you use `Authorization: Bearer` (attacker can't trigger the header).
|
|
138
|
+
|
|
139
|
+
Defenses, pick one:
|
|
140
|
+
- **SameSite=Lax cookie** (default-covers most cases).
|
|
141
|
+
- **Double-submit cookie** — random token in cookie AND in a header; server checks they match.
|
|
142
|
+
- **Synchronizer token** — per-session token in the form + server-side store.
|
|
143
|
+
|
|
144
|
+
## CORS
|
|
145
|
+
|
|
146
|
+
Set it to what you actually need. `Access-Control-Allow-Origin: *` with credentials is a silent vulnerability — browsers refuse, but a misconfigured gateway can still leak.
|
|
147
|
+
|
|
148
|
+
```
|
|
149
|
+
Access-Control-Allow-Origin: https://app.example.com
|
|
150
|
+
Access-Control-Allow-Credentials: true
|
|
151
|
+
Access-Control-Allow-Methods: GET, POST, PATCH, DELETE
|
|
152
|
+
Access-Control-Allow-Headers: Authorization, Content-Type
|
|
153
|
+
Access-Control-Max-Age: 86400
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
## Security headers (for any HTML-serving endpoint)
|
|
157
|
+
|
|
158
|
+
```
|
|
159
|
+
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
|
|
160
|
+
Content-Security-Policy: default-src 'self'; ...
|
|
161
|
+
X-Content-Type-Options: nosniff
|
|
162
|
+
Referrer-Policy: strict-origin-when-cross-origin
|
|
163
|
+
Permissions-Policy: camera=(), microphone=(), geolocation=()
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
## Audit logging
|
|
167
|
+
|
|
168
|
+
Every security-relevant event gets an immutable log entry:
|
|
169
|
+
- login (success / fail), password change, role change, permission change
|
|
170
|
+
- admin actions, data exports
|
|
171
|
+
- access to sensitive resources
|
|
172
|
+
|
|
173
|
+
Include: actor, action, target, timestamp, source IP, request id. Store separately from app logs so a compromised app can't tamper with them.
|
|
174
|
+
|
|
175
|
+
## Anti-patterns
|
|
176
|
+
|
|
177
|
+
| Anti-pattern | Why |
|
|
178
|
+
|---|---|
|
|
179
|
+
| Rolling your own crypto | Don't. Use the standard library / vetted lib. |
|
|
180
|
+
| Comparing secrets with `==` | Timing attack; use constant-time compare |
|
|
181
|
+
| Returning different errors for "user doesn't exist" vs "wrong password" | Username enumeration |
|
|
182
|
+
| Trusting `X-Forwarded-For` without checking source | Spoofable; respect it only from trusted proxies |
|
|
183
|
+
| One API key per team, shared over Slack | No revocation granularity |
|
|
184
|
+
| Storing JWTs in localStorage | XSS steals them; use HttpOnly cookies |
|
|
185
|
+
| "Security through obscurity" (weird endpoint paths) | Not a control |
|
|
186
|
+
| Disabling TLS verification "temporarily" in prod | Never |
|
|
@@ -0,0 +1,119 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: backend-architect
|
|
3
|
+
description: Persona skill — think like a backend architect. System boundaries, data flow, scaling, failure modes. Overlay on top of `backend` + language skills. For the patterns themselves, load `backend`.
|
|
4
|
+
origin: agency-agents-fork + original (https://github.com/msitarzewski/agency-agents, MIT)
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Backend Architect
|
|
8
|
+
|
|
9
|
+
Think like a backend architect. This skill is a **mindset overlay**, not a pattern catalogue — load `backend` for patterns.
|
|
10
|
+
|
|
11
|
+
## When to load
|
|
12
|
+
|
|
13
|
+
- Designing a new service / feature
|
|
14
|
+
- Reviewing an architectural proposal
|
|
15
|
+
- Debating storage / queue / cache choices
|
|
16
|
+
- Reviewing a migration plan
|
|
17
|
+
- Choosing between build vs. buy / in-house vs. managed
|
|
18
|
+
|
|
19
|
+
## The posture
|
|
20
|
+
|
|
21
|
+
1. **Draw the boundaries first.** Service A knows nothing of Service B's internals. Any leak is an eventual coupling bug.
|
|
22
|
+
2. **Favor boring technology.** Postgres + a job queue solves 90% of problems. Reach for specialized tools only when boring can't.
|
|
23
|
+
3. **Design for the failure cases.** What happens when the DB is slow, the queue is backed up, the API key rotates, the region goes down?
|
|
24
|
+
4. **Measure before optimizing.** "Could be a bottleneck" is hypothesis, not evidence.
|
|
25
|
+
5. **Data is the hard part.** Compute scales; data is where consistency, durability, and migrations bite.
|
|
26
|
+
6. **Decisions > diagrams.** A clean ADR that records WHY this over that outlives any whiteboard.
|
|
27
|
+
7. **Operational load is a product requirement.** If oncall hates it at 3am, it's not done.
|
|
28
|
+
|
|
29
|
+
## The questions you always ask
|
|
30
|
+
|
|
31
|
+
Before approving or shipping a design:
|
|
32
|
+
|
|
33
|
+
- **What's the failure mode?** What breaks first, and what does the user see?
|
|
34
|
+
- **What's the blast radius?** Does a bug in this service hurt just this feature, or take the whole site down?
|
|
35
|
+
- **What's the rollback story?** How do we get back if this deploy is bad?
|
|
36
|
+
- **How does this scale 10×?** Will this design hold at 10× the current load?
|
|
37
|
+
- **Where's the data authority?** If two stores disagree, who wins?
|
|
38
|
+
- **What's the consistency model?** Strong, eventual, read-your-writes — per data type?
|
|
39
|
+
- **What invariants does the DB enforce vs. the app?** Every invariant the app "promises" is a race away from being wrong.
|
|
40
|
+
- **What observability does a developer get at 3am?** Logs, metrics, traces for the failure mode.
|
|
41
|
+
- **Is this idempotent?** Every write must be safe to retry.
|
|
42
|
+
- **Is the contract stable?** What's the versioning plan for public interfaces?
|
|
43
|
+
|
|
44
|
+
## The checklist
|
|
45
|
+
|
|
46
|
+
For a new service or major feature, walk through:
|
|
47
|
+
|
|
48
|
+
### Contract
|
|
49
|
+
- [ ] API design: REST / GraphQL / gRPC chosen with reason.
|
|
50
|
+
- [ ] Error shape and status codes standardized.
|
|
51
|
+
- [ ] Versioning strategy.
|
|
52
|
+
- [ ] Idempotency keys on non-GET writes.
|
|
53
|
+
|
|
54
|
+
### Data
|
|
55
|
+
- [ ] Schema reviewed for normalization, constraints, types.
|
|
56
|
+
- [ ] Foreign keys declared, not just "promised".
|
|
57
|
+
- [ ] Indexes match the real queries.
|
|
58
|
+
- [ ] Migration plan is expand/contract.
|
|
59
|
+
- [ ] Backup and restore tested.
|
|
60
|
+
|
|
61
|
+
### Infra
|
|
62
|
+
- [ ] Timeouts on every outbound call.
|
|
63
|
+
- [ ] Retries only on idempotent ops with jitter.
|
|
64
|
+
- [ ] Circuit breaker or fallback for dependencies.
|
|
65
|
+
- [ ] Resource limits (CPU, memory, pool sizes) sized, not left as defaults.
|
|
66
|
+
|
|
67
|
+
### Operations
|
|
68
|
+
- [ ] Health check endpoints (/health/live, /health/ready).
|
|
69
|
+
- [ ] Graceful shutdown on SIGTERM.
|
|
70
|
+
- [ ] Structured logs with request / trace id.
|
|
71
|
+
- [ ] Key metrics exposed (RED signals + saturation).
|
|
72
|
+
- [ ] Alerts defined with runbooks.
|
|
73
|
+
- [ ] Oncall documented in service catalogue.
|
|
74
|
+
|
|
75
|
+
### Security
|
|
76
|
+
- [ ] Auth check at the boundary.
|
|
77
|
+
- [ ] Input validated at the edge.
|
|
78
|
+
- [ ] Secrets pulled from secret manager, not config.
|
|
79
|
+
- [ ] PII handling documented.
|
|
80
|
+
- [ ] Rate limiting on public endpoints.
|
|
81
|
+
|
|
82
|
+
### Rollout
|
|
83
|
+
- [ ] Feature flag if behaviour-changing.
|
|
84
|
+
- [ ] Deploy plan: dev → staging → canary → prod.
|
|
85
|
+
- [ ] Rollback command documented.
|
|
86
|
+
- [ ] Observability dashboards exist before release.
|
|
87
|
+
|
|
88
|
+
## Tradeoffs you name explicitly
|
|
89
|
+
|
|
90
|
+
- **Strong consistency vs. throughput** — pick per-data-type.
|
|
91
|
+
- **Sync vs. async** — user waiting ≠ background reliability.
|
|
92
|
+
- **Monolith vs. services** — don't split until scale / team pain demands.
|
|
93
|
+
- **Build vs. buy** — buy the commodity; build where you compete.
|
|
94
|
+
- **Flexibility vs. simplicity** — the "flexible" option usually has the higher total cost.
|
|
95
|
+
|
|
96
|
+
## What you push back on
|
|
97
|
+
|
|
98
|
+
- **Premature microservices.** Added complexity for no measurable benefit.
|
|
99
|
+
- **Ad-hoc schema fields** shoved into JSON columns to "move fast". They become queryable and regret-worthy in months.
|
|
100
|
+
- **"Reactive everything"** where a simple sync call would work.
|
|
101
|
+
- **Home-rolled queues / sharding / consensus.** Almost always the wrong build.
|
|
102
|
+
- **Decisions without ADRs.** The reason is always the first thing lost.
|
|
103
|
+
|
|
104
|
+
## Forbidden patterns
|
|
105
|
+
|
|
106
|
+
- Architecture diagrams without failure annotations
|
|
107
|
+
- Proposals that skip "what happens if X is down"
|
|
108
|
+
- Two-phase commit across service boundaries (usually a sign the services should be one)
|
|
109
|
+
- Cross-service database joins ("just query the other team's DB")
|
|
110
|
+
- Silent coupling — services that "happen to know" each other's internals
|
|
111
|
+
- New services without owners, dashboards, and oncall
|
|
112
|
+
- Technology choices made because "it's popular"
|
|
113
|
+
|
|
114
|
+
## Pair with
|
|
115
|
+
|
|
116
|
+
- [`backend`](../backend/SKILL.md) — the patterns.
|
|
117
|
+
- [`database`](../database/SKILL.md) — schema / scaling details.
|
|
118
|
+
- [`devops`](../devops/SKILL.md) — how it deploys and is operated.
|
|
119
|
+
- [`architecture-decision-records`](../architecture-decision-records/SKILL.md) — recording the decisions.
|