@coralai/sps-cli 0.42.0 → 0.44.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (147) hide show
  1. package/README.md +59 -4
  2. package/dist/commands/consoleCommand.d.ts +2 -0
  3. package/dist/commands/consoleCommand.d.ts.map +1 -0
  4. package/dist/commands/consoleCommand.js +129 -0
  5. package/dist/commands/consoleCommand.js.map +1 -0
  6. package/dist/commands/projectInit.d.ts.map +1 -1
  7. package/dist/commands/projectInit.js +40 -53
  8. package/dist/commands/projectInit.js.map +1 -1
  9. package/dist/commands/setup.d.ts.map +1 -1
  10. package/dist/commands/setup.js +14 -2
  11. package/dist/commands/setup.js.map +1 -1
  12. package/dist/commands/skillCommand.d.ts +2 -0
  13. package/dist/commands/skillCommand.d.ts.map +1 -0
  14. package/dist/commands/skillCommand.js +235 -0
  15. package/dist/commands/skillCommand.js.map +1 -0
  16. package/dist/console-assets/assets/index-Bhd2f9AP.js +125 -0
  17. package/dist/console-assets/assets/index-bsAN2a12.css +1 -0
  18. package/dist/console-assets/index.html +16 -0
  19. package/dist/console-server/index.d.ts +29 -0
  20. package/dist/console-server/index.d.ts.map +1 -0
  21. package/dist/console-server/index.js +145 -0
  22. package/dist/console-server/index.js.map +1 -0
  23. package/dist/console-server/lib/lockFile.d.ts +17 -0
  24. package/dist/console-server/lib/lockFile.d.ts.map +1 -0
  25. package/dist/console-server/lib/lockFile.js +61 -0
  26. package/dist/console-server/lib/lockFile.js.map +1 -0
  27. package/dist/console-server/lib/portPicker.d.ts +3 -0
  28. package/dist/console-server/lib/portPicker.d.ts.map +1 -0
  29. package/dist/console-server/lib/portPicker.js +25 -0
  30. package/dist/console-server/lib/portPicker.js.map +1 -0
  31. package/dist/console-server/routes/projects.d.ts +11 -0
  32. package/dist/console-server/routes/projects.d.ts.map +1 -0
  33. package/dist/console-server/routes/projects.js +149 -0
  34. package/dist/console-server/routes/projects.js.map +1 -0
  35. package/dist/console-server/routes/system.d.ts +7 -0
  36. package/dist/console-server/routes/system.d.ts.map +1 -0
  37. package/dist/console-server/routes/system.js +19 -0
  38. package/dist/console-server/routes/system.js.map +1 -0
  39. package/dist/console-server/sse/eventBus.d.ts +25 -0
  40. package/dist/console-server/sse/eventBus.d.ts.map +1 -0
  41. package/dist/console-server/sse/eventBus.js +32 -0
  42. package/dist/console-server/sse/eventBus.js.map +1 -0
  43. package/dist/console-server/watchers/cardWatcher.d.ts +9 -0
  44. package/dist/console-server/watchers/cardWatcher.d.ts.map +1 -0
  45. package/dist/console-server/watchers/cardWatcher.js +42 -0
  46. package/dist/console-server/watchers/cardWatcher.js.map +1 -0
  47. package/dist/core/skillStore.d.ts +46 -0
  48. package/dist/core/skillStore.d.ts.map +1 -0
  49. package/dist/core/skillStore.js +210 -0
  50. package/dist/core/skillStore.js.map +1 -0
  51. package/dist/core/skillStore.test.d.ts +2 -0
  52. package/dist/core/skillStore.test.d.ts.map +1 -0
  53. package/dist/core/skillStore.test.js +203 -0
  54. package/dist/core/skillStore.test.js.map +1 -0
  55. package/dist/main.js +27 -17
  56. package/dist/main.js.map +1 -1
  57. package/package.json +8 -2
  58. package/skills/architecture-decision-records/SKILL.md +207 -0
  59. package/skills/backend/SKILL.md +62 -0
  60. package/skills/backend/references/api-design.md +168 -0
  61. package/skills/backend/references/caching.md +181 -0
  62. package/skills/backend/references/data-access.md +173 -0
  63. package/skills/backend/references/layering.md +181 -0
  64. package/skills/backend/references/observability.md +190 -0
  65. package/skills/backend/references/resilience.md +201 -0
  66. package/skills/backend/references/security.md +186 -0
  67. package/skills/backend-architect/SKILL.md +119 -0
  68. package/skills/code-reviewer/SKILL.md +143 -0
  69. package/skills/coding-standards/SKILL.md +60 -0
  70. package/skills/coding-standards/references/clean-code.md +258 -0
  71. package/skills/coding-standards/references/code-review.md +192 -0
  72. package/skills/coding-standards/references/commits-and-prs.md +226 -0
  73. package/skills/coding-standards/references/error-strategy.md +193 -0
  74. package/skills/coding-standards/references/naming.md +185 -0
  75. package/skills/coding-standards/references/tdd.md +171 -0
  76. package/skills/database/SKILL.md +53 -0
  77. package/skills/database/references/indexing.md +190 -0
  78. package/skills/database/references/migrations.md +199 -0
  79. package/skills/database/references/nosql.md +185 -0
  80. package/skills/database/references/queries.md +295 -0
  81. package/skills/database/references/scaling.md +203 -0
  82. package/skills/database/references/schema.md +191 -0
  83. package/skills/database-optimizer/SKILL.md +168 -0
  84. package/skills/debugging-workflow/SKILL.md +244 -0
  85. package/skills/devops/SKILL.md +55 -0
  86. package/skills/devops/references/ci-cd.md +204 -0
  87. package/skills/devops/references/containers.md +272 -0
  88. package/skills/devops/references/deploy.md +201 -0
  89. package/skills/devops/references/iac.md +252 -0
  90. package/skills/devops/references/observability.md +228 -0
  91. package/skills/devops/references/secrets.md +178 -0
  92. package/skills/devops-automator/SKILL.md +164 -0
  93. package/skills/frontend/SKILL.md +52 -0
  94. package/skills/frontend/references/accessibility.md +222 -0
  95. package/skills/frontend/references/components.md +206 -0
  96. package/skills/frontend/references/performance.md +219 -0
  97. package/skills/frontend/references/routing.md +209 -0
  98. package/skills/frontend/references/state.md +190 -0
  99. package/skills/frontend/references/testing.md +216 -0
  100. package/skills/frontend-developer/SKILL.md +115 -0
  101. package/skills/git-workflow/SKILL.md +355 -0
  102. package/skills/golang/SKILL.md +49 -0
  103. package/skills/golang/references/concurrency.md +284 -0
  104. package/skills/golang/references/errors.md +241 -0
  105. package/skills/golang/references/idioms.md +285 -0
  106. package/skills/golang/references/testing.md +238 -0
  107. package/skills/java/SKILL.md +50 -0
  108. package/skills/java/references/concurrency.md +194 -0
  109. package/skills/java/references/idioms.md +283 -0
  110. package/skills/java/references/testing.md +228 -0
  111. package/skills/kotlin/SKILL.md +47 -0
  112. package/skills/kotlin/references/coroutines.md +240 -0
  113. package/skills/kotlin/references/idioms.md +268 -0
  114. package/skills/kotlin/references/testing.md +219 -0
  115. package/skills/mobile/SKILL.md +50 -0
  116. package/skills/mobile/references/architecture.md +204 -0
  117. package/skills/mobile/references/navigation.md +158 -0
  118. package/skills/mobile/references/performance.md +152 -0
  119. package/skills/mobile/references/platform.md +166 -0
  120. package/skills/mobile/references/state-and-data.md +174 -0
  121. package/skills/python/SKILL.md +51 -0
  122. package/skills/python/THIRD_PARTY.md +14 -0
  123. package/skills/python/references/async.md +218 -0
  124. package/skills/python/references/error-handling.md +254 -0
  125. package/skills/python/references/idioms.md +279 -0
  126. package/skills/python/references/packaging.md +233 -0
  127. package/skills/python/references/testing.md +269 -0
  128. package/skills/python/references/typing.md +292 -0
  129. package/skills/qa-tester/SKILL.md +186 -0
  130. package/skills/rust/SKILL.md +50 -0
  131. package/skills/rust/references/async.md +224 -0
  132. package/skills/rust/references/errors.md +240 -0
  133. package/skills/rust/references/ownership.md +263 -0
  134. package/skills/rust/references/testing.md +274 -0
  135. package/skills/rust/references/traits.md +250 -0
  136. package/skills/security-engineer/SKILL.md +157 -0
  137. package/skills/swift/SKILL.md +48 -0
  138. package/skills/swift/references/concurrency.md +280 -0
  139. package/skills/swift/references/idioms.md +334 -0
  140. package/skills/swift/references/testing.md +229 -0
  141. package/skills/typescript/SKILL.md +51 -0
  142. package/skills/typescript/references/async.md +241 -0
  143. package/skills/typescript/references/errors.md +208 -0
  144. package/skills/typescript/references/idioms.md +246 -0
  145. package/skills/typescript/references/testing.md +225 -0
  146. package/skills/typescript/references/tooling.md +208 -0
  147. package/skills/typescript/references/types.md +259 -0
@@ -0,0 +1,190 @@
1
+ # Observability
2
+
3
+ Logs, metrics, traces, health. A request you can't trace is a bug you can't fix.
4
+
5
+ ## The three pillars
6
+
7
+ | Signal | Answers | Cost | Cardinality |
8
+ |---|---|---|---|
9
+ | **Logs** | "What happened in this request?" | High (per-event) | Unlimited |
10
+ | **Metrics** | "How much, how often, how fast, across the fleet?" | Low (aggregated) | Bounded (labels explode) |
11
+ | **Traces** | "Where did time go in this distributed request?" | Medium (sampled) | Unlimited per trace |
12
+
13
+ Pick the right signal for the question. Metrics for dashboards, logs for forensics, traces for latency breakdowns.
14
+
15
+ ## Structured logs — JSON, not prose
16
+
17
+ Human-readable strings are unqueryable. Every log line is a JSON object with a stable schema.
18
+
19
+ ```json
20
+ {
21
+ "ts": "2026-04-20T10:23:45.123Z",
22
+ "level": "info",
23
+ "service": "orders",
24
+ "env": "prod",
25
+ "request_id": "req_01HX...",
26
+ "trace_id": "0af7651916cd43dd...",
27
+ "user_id": "u_01HX...",
28
+ "msg": "order created",
29
+ "order_id": "ord_01HX...",
30
+ "amount_cents": 2599,
31
+ "duration_ms": 87
32
+ }
33
+ ```
34
+
35
+ Rules:
36
+ - Always include `ts`, `level`, `service`, `env`.
37
+ - Always include a request/trace id so you can stitch a request together across services.
38
+ - Message is a short constant string — fixed values in `msg`, varying values in fields. `"order created"` not `"order ord_01HX created for $25.99"`.
39
+ - Never log secrets, tokens, passwords, full PII. Redact at the logger, not the call site.
40
+
41
+ ## Log levels — use them honestly
42
+
43
+ | Level | Means | Typical rate |
44
+ |---|---|---|
45
+ | ERROR | Something broke; a human should look | Low |
46
+ | WARN | Unexpected, but handled (retry succeeded, fallback used) | Low |
47
+ | INFO | State changes worth knowing at normal volume | Medium |
48
+ | DEBUG | Details useful while investigating; off in prod | High (when on) |
49
+
50
+ Abused levels poison the signal. If everything is INFO, nothing is INFO.
51
+
52
+ ## Correlation IDs
53
+
54
+ Every request gets a unique id at the edge; it propagates through every log line and outbound call.
55
+
56
+ ```
57
+ incoming request → generate request_id (or accept from X-Request-ID)
58
+ → bind to logger context
59
+ → forward on outbound calls (X-Request-ID, traceparent)
60
+ ```
61
+
62
+ Distributed tracing (OpenTelemetry) gives you `trace_id` + `span_id` for free. Log both when you have them.
63
+
64
+ ## Metrics — RED + USE
65
+
66
+ Two checklists that cover almost everything.
67
+
68
+ ### RED (per request-driven service)
69
+
70
+ - **R**ate — requests per second
71
+ - **E**rrors — failing requests per second (or error rate)
72
+ - **D**uration — latency distribution (p50 / p95 / p99)
73
+
74
+ ### USE (per resource)
75
+
76
+ - **U**tilization — how busy is it? (CPU%, thread pool in use / max)
77
+ - **S**aturation — how much work is queued? (request queue depth)
78
+ - **E**rrors — how many operations failed?
79
+
80
+ Track these for every service and every critical dependency.
81
+
82
+ ## Latency — measure distributions, not averages
83
+
84
+ Averages hide the worst cases. P95/P99 are where your users actually feel slowness.
85
+
86
+ ```
87
+ # ✅
88
+ http_request_duration_seconds{route="/orders", method="POST"}
89
+ → histogram with buckets (0.01, 0.05, 0.1, 0.5, 1, 5)
90
+ → alert on p99 > 1s for 5 min
91
+
92
+ # ❌
93
+ avg_response_time = sum(durations) / count(durations)
94
+ → a 10 s outlier buried in 999 fast ones looks fine
95
+ ```
96
+
97
+ ## Labels — finite cardinality
98
+
99
+ Every unique label combination creates a new metric series. High-cardinality labels (user id, request id, email) will blow up storage and cost.
100
+
101
+ | Label | OK? |
102
+ |---|---|
103
+ | route, method, status_code | Yes (small set) |
104
+ | region, pod_name, env | Yes |
105
+ | user_id, request_id, email, SKU | NO — use logs/traces for these |
106
+
107
+ ## Tracing
108
+
109
+ A trace is a tree of spans representing one request's path through services. Each span has: operation name, start/end time, attributes, parent span.
110
+
111
+ Auto-instrument with OpenTelemetry. Add manual spans around:
112
+ - External HTTP calls (service, endpoint, status)
113
+ - DB queries (operation, table; never the full raw query — cardinality)
114
+ - Cache ops
115
+ - Queue enqueue / dequeue
116
+ - Expensive pure computations
117
+
118
+ Sampling: head-based (1–10% of requests fully traced) or tail-based (keep traces where something went wrong). Keep tracer overhead < 1% of request latency.
119
+
120
+ ## SLOs — the contract
121
+
122
+ An SLO is a number + a window. "99.9% of /orders responses succeed within 500 ms, measured over 28 days."
123
+
124
+ Error budget = `1 − SLO`. Over 28 days, 99.9% allows ≈40 min of downtime. When you burn the budget, freeze risky changes and invest in reliability.
125
+
126
+ Don't set SLOs to what your service does today. Set them to what your users need.
127
+
128
+ ## Alerts — page on symptoms, not causes
129
+
130
+ Alert on "users are affected" (SLO burn rate, error rate spike, latency breach). Don't alert on "CPU is at 80%" — that's often fine.
131
+
132
+ Every alert must be:
133
+ - **Actionable** — there is something the oncall can do right now
134
+ - **Unambiguous** — one cause for the page, not "anything could have fired this"
135
+ - **Documented** — link to a runbook from the alert body
136
+
137
+ If an alert fires and the oncall thinks "not my problem" or "auto-resolves in 5 min", it's a bad alert. Delete or tune it.
138
+
139
+ ## Runbooks
140
+
141
+ One per alert. Structure:
142
+
143
+ ```
144
+ # Alert: api-latency-p99-high
145
+
146
+ ## What this means
147
+ p99 on /api/orders POST is > 1s for 5m.
148
+
149
+ ## Immediate checks
150
+ 1. Look at [dashboard-link]
151
+ 2. Check for recent deploy: [deploys-link]
152
+ 3. Check upstream health: [dep-status]
153
+
154
+ ## Common causes
155
+ - DB slow query → check [slow-query-dashboard]
156
+ - Cache outage → check redis metrics
157
+ - Upstream payment provider → check provider status page
158
+
159
+ ## Mitigation
160
+ - Roll back recent deploy if within 30 min window
161
+ - Failover to secondary region
162
+ - ...
163
+ ```
164
+
165
+ ## Health endpoints (minimal)
166
+
167
+ ```
168
+ GET /health/live → 200 if process can serve (don't check dependencies)
169
+ GET /health/ready → 200 only if dependencies are reachable (DB, cache, queue)
170
+ ```
171
+
172
+ Live failing → orchestrator restarts the pod.
173
+ Ready failing → orchestrator takes the pod out of the load balancer (but doesn't kill it).
174
+
175
+ Never put business logic in health checks. They should be cheap and boring.
176
+
177
+ ## Anti-patterns
178
+
179
+ | Anti-pattern | Why |
180
+ |---|---|
181
+ | String-formatted logs (`"user X did Y at Z"`) | Unqueryable |
182
+ | Logging full request bodies | PII leak, storage blow-up |
183
+ | Alerting on CPU / disk without symptom link | Pager fatigue; noise |
184
+ | No request correlation id | Can't stitch a failure across services |
185
+ | Logging at DEBUG in prod | Drowns the signal; storage cost |
186
+ | `avg_latency` as the only latency metric | Hides the outliers that hurt users |
187
+ | `status:500` as the only error signal | 200 with `{error: ...}` bodies exist and hurt |
188
+ | Metrics labels with user id / email | Cardinality explosion |
189
+ | Tracing everything, sampling nothing | Cost blowup; latency overhead |
190
+ | Alerts without runbooks | Oncall guesses, takes too long |
@@ -0,0 +1,201 @@
1
+ # Resilience
2
+
3
+ Timeouts, retries, circuit breakers, idempotency, background jobs. Make failures cheap.
4
+
5
+ ## Timeouts — every outbound call
6
+
7
+ No exceptions. A dependency that never answers will exhaust threads, sockets, and memory.
8
+
9
+ ```
10
+ # Wrong: no timeout
11
+ response = http.get("https://upstream/api")
12
+
13
+ # Right: fail fast
14
+ response = http.get("https://upstream/api", timeout=2.0)
15
+ ```
16
+
17
+ Timeout budget, layered:
18
+
19
+ ```
20
+ client 10s
21
+ └ gateway 8s
22
+ └ service 5s
23
+ └ dependency call 2s ← must be smaller than parent budget
24
+ ```
25
+
26
+ If the inner call's timeout ≥ the outer's, the outer never gets to return a clean 504 — it just hangs.
27
+
28
+ ## Retries — only for safe, transient failures
29
+
30
+ **Retryable**:
31
+ - Network timeouts
32
+ - 5xx on GET/idempotent calls
33
+ - 429 (with `Retry-After`)
34
+ - Explicit DB "retry" errors (e.g., serialization failures)
35
+
36
+ **NOT retryable**:
37
+ - 4xx other than 429 (client bug; retry won't help)
38
+ - Any non-idempotent call without an `Idempotency-Key`
39
+ - "Connection reset" where the write may have landed
40
+
41
+ ### Exponential backoff with jitter
42
+
43
+ Pure exponential backoff creates thundering herds when many clients fail together. Always add jitter.
44
+
45
+ ```
46
+ attempt(n):
47
+ base = 100ms
48
+ max = 10s
49
+ sleep = min(max, base * 2^n) * random(0.5, 1.5)
50
+ ```
51
+
52
+ Bound the total attempts and total time; don't let retries outlive the user's patience.
53
+
54
+ ## Circuit breakers
55
+
56
+ When a dependency is sick, stop hammering it. Three states:
57
+
58
+ ```
59
+ CLOSED (normal)
60
+ │ failures exceed threshold
61
+
62
+ OPEN (fail fast, short-circuit calls)
63
+ │ after cool-down, try one request
64
+
65
+ HALF_OPEN ──success──► CLOSED
66
+
67
+ └─failure──────────► OPEN
68
+ ```
69
+
70
+ Thresholds to tune: error rate (e.g., >50% of last 20 calls), minimum sample size, cool-down time, half-open probe count.
71
+
72
+ Open-circuit response: fall back to cache, degraded response, or fail fast with 503. Never silently return empty data.
73
+
74
+ ## Idempotency
75
+
76
+ Any operation that might be retried must be safe to run twice.
77
+
78
+ ### Idempotency keys
79
+
80
+ For non-GET HTTP writes, accept an `Idempotency-Key` header.
81
+
82
+ ```
83
+ POST /payments
84
+ Idempotency-Key: 7a8b9c...
85
+
86
+ server:
87
+ stored = store.get(key)
88
+ if stored and stored.request_hash == hash(body):
89
+ return stored.response
90
+ if stored:
91
+ return 409 # same key, different body → conflict
92
+ response = execute()
93
+ store.set(key, (hash(body), response), ttl=24h)
94
+ return response
95
+ ```
96
+
97
+ ### Natural idempotency
98
+
99
+ Often better than keys: design the operation so repeats are harmless.
100
+
101
+ ```
102
+ # Not idempotent
103
+ UPDATE balance SET amount = amount + 10 WHERE id = 1
104
+
105
+ # Idempotent — absorbs double-apply
106
+ INSERT INTO ledger (id, account, amount) VALUES (:tx_id, 1, 10)
107
+ ON CONFLICT (id) DO NOTHING
108
+ ```
109
+
110
+ ## Graceful degradation
111
+
112
+ When a non-critical dependency is down, return a usable response, not an error.
113
+
114
+ ```
115
+ product = productRepo.get(id)
116
+ try:
117
+ product.recommendations = recService.for(id, timeout=300ms)
118
+ except (Timeout, ServiceError):
119
+ product.recommendations = [] # degrade, don't fail
120
+ return product
121
+ ```
122
+
123
+ Decide up front which pieces are essential vs. nice-to-have. Never degrade silently on essentials (payments, auth).
124
+
125
+ ## Background jobs
126
+
127
+ For anything not strictly needed in the request path: send, enqueue, return.
128
+
129
+ ```
130
+ # Request path
131
+ handler(req):
132
+ order = orderRepo.save(newOrder)
133
+ queue.enqueue(SendOrderEmail(order.id)) # defer
134
+ queue.enqueue(UpdateSearchIndex(order.id))
135
+ return 201
136
+ ```
137
+
138
+ Queue requirements:
139
+ - **Durable** — enqueue survives broker restart (disk, replicated)
140
+ - **At-least-once delivery** — so jobs must be idempotent
141
+ - **Dead-letter queue** — after N failures, park the message and alert
142
+ - **Visibility timeout** — consumer crashes → job requeues automatically
143
+
144
+ Common choices: Postgres-backed (pgboss, solid-queue), Redis (BullMQ, Sidekiq), managed (SQS, Cloud Tasks), streaming (Kafka).
145
+
146
+ ## Scheduled jobs
147
+
148
+ Two traps:
149
+ 1. **Lock per job** — multiple replicas must not run the same job twice. Use a DB advisory lock or a leader-election lib.
150
+ 2. **Overlap** — if a job runs longer than its interval, the next tick starts before the previous ends. Decide: skip, queue, or overlap — explicitly.
151
+
152
+ Don't use `cron` on a single VM in production; it dies with the VM. Use a platform scheduler (Kubernetes CronJob, cloud scheduler) + idempotent job logic.
153
+
154
+ ## Health checks
155
+
156
+ Two separate endpoints:
157
+
158
+ ```
159
+ GET /health/live # Am I running? (200 = process alive)
160
+ GET /health/ready # Can I take traffic? (checks DB, cache, queue connectivity)
161
+ ```
162
+
163
+ Orchestrators (K8s, load balancers) need both. `/ready` failing for 30s → take the pod out of rotation, don't kill it.
164
+
165
+ ## Graceful shutdown
166
+
167
+ On SIGTERM:
168
+ 1. Stop accepting new requests (`/ready` → 503).
169
+ 2. Finish in-flight requests (with a hard deadline, e.g., 30 s).
170
+ 3. Drain the job consumer.
171
+ 4. Close DB pools and sockets.
172
+ 5. Exit.
173
+
174
+ Without this, a deploy drops requests and leaves half-processed jobs.
175
+
176
+ ## Bulkheads
177
+
178
+ Isolate failure domains so one tenant / feature can't drown the others.
179
+
180
+ - Separate thread pool / connection pool per downstream service
181
+ - Separate queue / worker group per job class
182
+ - Separate rate limit per tenant
183
+
184
+ One noisy neighbor should degrade its own lane, not everyone's.
185
+
186
+ ## Timeouts for tasks, not just HTTP
187
+
188
+ DB query timeouts (`statement_timeout` in Postgres), job max runtime, lock wait timeout — all finite. Anything unbounded will eventually hang something.
189
+
190
+ ## Anti-patterns
191
+
192
+ | Anti-pattern | Why |
193
+ |---|---|
194
+ | Infinite retries | One bad day becomes a queue explosion |
195
+ | Retries without backoff | Synchronized thundering herds |
196
+ | Retry on POST without idempotency key | Duplicate payments, double-sends |
197
+ | Shared retry budget across unrelated calls | One bad dep exhausts retries for healthy ones |
198
+ | Catching all exceptions to mask failures | Bugs silently go to prod |
199
+ | Fire-and-forget without a dead-letter queue | Failed jobs vanish with no alert |
200
+ | "Run every N seconds" cron on a single machine | Loses work on reboot |
201
+ | Waiting forever for a lock | Locks don't auto-expire unless you say so |
@@ -0,0 +1,186 @@
1
+ # Security
2
+
3
+ Authentication, authorization, input validation, rate limiting, secrets. The non-negotiables.
4
+
5
+ ## AuthN vs AuthZ
6
+
7
+ | | Authentication | Authorization |
8
+ |---|---|---|
9
+ | Answers | Who are you? | What can you do? |
10
+ | Failure code | 401 | 403 |
11
+ | Mechanism | Session, token, signature | Role / policy / permission check |
12
+
13
+ Never conflate these. A 401 says "tell me who you are"; a 403 says "I know who you are and you can't do this".
14
+
15
+ ## Session vs token
16
+
17
+ | | Server session | Stateless token (JWT) |
18
+ |---|---|---|
19
+ | State | Server-side (DB / Redis) | In the token itself |
20
+ | Revocation | Delete session row | Hard — need blocklist or short TTL |
21
+ | Scale | Needs sticky / shared store | Stateless across servers |
22
+ | Size on wire | Small (cookie id) | Large (signed payload) |
23
+ | First-party web | Excellent | Overkill |
24
+ | Service-to-service | Weak | Natural fit |
25
+
26
+ For a browser-based web app, **server-side sessions with secure cookies** are usually the right answer. JWTs shine for APIs, federation, and service-to-service.
27
+
28
+ ## Cookies — secure defaults
29
+
30
+ ```
31
+ Set-Cookie: session=abc...; Secure; HttpOnly; SameSite=Lax; Path=/
32
+ ```
33
+
34
+ - **Secure** — HTTPS only.
35
+ - **HttpOnly** — JS can't read it (blocks XSS-based token theft).
36
+ - **SameSite=Lax** — default; blocks CSRF on cross-site POSTs. Use `Strict` for admin; `None` + `Secure` only for true cross-origin use cases.
37
+ - **Path** — scope to where it's needed.
38
+ - Don't store user data in the cookie payload; store an opaque session id.
39
+
40
+ ## JWT rules
41
+
42
+ - Always check signature. Reject `alg: none`. Reject unexpected algorithms.
43
+ - Verify `iss`, `aud`, `exp`, `nbf`.
44
+ - Short lifetime (5–15 min) + rotating refresh token.
45
+ - Don't put secrets inside; tokens are readable by anyone who has them.
46
+ - Rotate signing keys; publish via JWKS.
47
+ - Revocation: maintain a short jti blocklist in Redis for stolen-token cases.
48
+
49
+ ## Authorization models
50
+
51
+ | Model | Use when |
52
+ |---|---|
53
+ | RBAC (roles) | Small fixed set of roles: admin, user, moderator |
54
+ | ABAC (attributes) | Rules depend on attributes of user, resource, time, IP |
55
+ | ReBAC (relationships) | "Can Alice read doc X?" answered via a graph (Google Zanzibar / OpenFGA) |
56
+ | Policy-as-code (OPA, Cedar) | Complex rules that need to live outside the app |
57
+
58
+ Start with RBAC. Graduate to ReBAC/ABAC when roles no longer express the rules. Never hard-code `if user.email == "admin@x.com"`.
59
+
60
+ ## Enforce authorization at the boundary
61
+
62
+ Every handler starts with a permission check. No implicit trust.
63
+
64
+ ```
65
+ handler(req):
66
+ user = requireAuth(req)
67
+ resource = repo.load(req.id)
68
+ if not user.can(READ, resource):
69
+ return 403 | 404 # 404 if the existence of the resource is itself secret
70
+ return resource
71
+ ```
72
+
73
+ `403` vs `404`: return 404 if the existence of the resource is itself secret (e.g., private documents); return 403 otherwise.
74
+
75
+ ## Input validation
76
+
77
+ Validate everything at the edge, once. Never trust "internal" callers.
78
+
79
+ ```
80
+ schema:
81
+ email : string, format=email
82
+ age : int, 0 <= x <= 150
83
+ role : enum(user, admin)
84
+
85
+ handler(req):
86
+ cmd = schema.parse(req.body) # rejects anything else
87
+ useCase.execute(cmd)
88
+ ```
89
+
90
+ Rules:
91
+ - Whitelist what you accept, not blacklist what you reject.
92
+ - Reject unknown fields (guard against mass-assignment).
93
+ - Bound all variable-size inputs (strings, arrays): `max_length`, `max_items`.
94
+ - Parse into strong types at the boundary; don't pass raw dicts through the system.
95
+
96
+ ## Injection defenses
97
+
98
+ - **SQL**: parameterized queries ONLY. Never string-concatenate. ORMs handle this if you use their query API, not raw strings.
99
+ - **Command**: don't build shell commands from user input. If you must: use array-form `exec` (no shell) and whitelist args.
100
+ - **LDAP / XPath / NoSQL**: same rule — parameterize.
101
+ - **Template injection**: never render user input as a template (Jinja2, ERB, etc.).
102
+ - **Path traversal**: canonicalize and assert the result is inside an allow-listed directory.
103
+ - **Prototype pollution / mass assignment**: whitelist fields; never `Object.assign(user, req.body)`.
104
+
105
+ ## Passwords
106
+
107
+ - **argon2id** (preferred) or **bcrypt** (with cost ≥ 12). Never SHA-* for passwords.
108
+ - Never log passwords, even hashed.
109
+ - Enforce length (≥ 12 chars), not character classes. Check against a breached-password list (HaveIBeenPwned API / offline list).
110
+ - Account-level lockout on repeated failures, plus rate limiting per IP/account.
111
+
112
+ ## Secrets
113
+
114
+ - Never in source control. `.env` files are .gitignored; production secrets come from a secret manager (Vault, AWS Secrets Manager, GCP Secret Manager, 1Password Connect).
115
+ - Rotate on compromise AND on a schedule.
116
+ - Scope per-service and per-environment. One stolen dev key should never reach prod.
117
+ - Don't print secrets to logs. Redact at the logger config.
118
+
119
+ ## Rate limiting
120
+
121
+ Apply at the edge (CDN/API gateway) AND per-endpoint in the app.
122
+
123
+ Limits by identity:
124
+ - Anonymous: by IP — coarse, bypassable with proxies.
125
+ - Authenticated: by user id — reliable.
126
+ - Authenticated + IP: both, for defense in depth.
127
+
128
+ Algorithms:
129
+ - **Token bucket**: allows short bursts; refill rate controls long-run.
130
+ - **Fixed window**: simple, but bursty at boundaries.
131
+ - **Sliding window**: smooth; costs more.
132
+
133
+ Always include `Retry-After` on 429 responses.
134
+
135
+ ## CSRF
136
+
137
+ Required if the client is a browser using cookies. Not required if you use `Authorization: Bearer` (attacker can't trigger the header).
138
+
139
+ Defenses, pick one:
140
+ - **SameSite=Lax cookie** (default-covers most cases).
141
+ - **Double-submit cookie** — random token in cookie AND in a header; server checks they match.
142
+ - **Synchronizer token** — per-session token in the form + server-side store.
143
+
144
+ ## CORS
145
+
146
+ Set it to what you actually need. `Access-Control-Allow-Origin: *` with credentials is a silent vulnerability — browsers refuse, but a misconfigured gateway can still leak.
147
+
148
+ ```
149
+ Access-Control-Allow-Origin: https://app.example.com
150
+ Access-Control-Allow-Credentials: true
151
+ Access-Control-Allow-Methods: GET, POST, PATCH, DELETE
152
+ Access-Control-Allow-Headers: Authorization, Content-Type
153
+ Access-Control-Max-Age: 86400
154
+ ```
155
+
156
+ ## Security headers (for any HTML-serving endpoint)
157
+
158
+ ```
159
+ Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
160
+ Content-Security-Policy: default-src 'self'; ...
161
+ X-Content-Type-Options: nosniff
162
+ Referrer-Policy: strict-origin-when-cross-origin
163
+ Permissions-Policy: camera=(), microphone=(), geolocation=()
164
+ ```
165
+
166
+ ## Audit logging
167
+
168
+ Every security-relevant event gets an immutable log entry:
169
+ - login (success / fail), password change, role change, permission change
170
+ - admin actions, data exports
171
+ - access to sensitive resources
172
+
173
+ Include: actor, action, target, timestamp, source IP, request id. Store separately from app logs so a compromised app can't tamper with them.
174
+
175
+ ## Anti-patterns
176
+
177
+ | Anti-pattern | Why |
178
+ |---|---|
179
+ | Rolling your own crypto | Don't. Use the standard library / vetted lib. |
180
+ | Comparing secrets with `==` | Timing attack; use constant-time compare |
181
+ | Returning different errors for "user doesn't exist" vs "wrong password" | Username enumeration |
182
+ | Trusting `X-Forwarded-For` without checking source | Spoofable; respect it only from trusted proxies |
183
+ | One API key per team, shared over Slack | No revocation granularity |
184
+ | Storing JWTs in localStorage | XSS steals them; use HttpOnly cookies |
185
+ | "Security through obscurity" (weird endpoint paths) | Not a control |
186
+ | Disabling TLS verification "temporarily" in prod | Never |
@@ -0,0 +1,119 @@
1
+ ---
2
+ name: backend-architect
3
+ description: Persona skill — think like a backend architect. System boundaries, data flow, scaling, failure modes. Overlay on top of `backend` + language skills. For the patterns themselves, load `backend`.
4
+ origin: agency-agents-fork + original (https://github.com/msitarzewski/agency-agents, MIT)
5
+ ---
6
+
7
+ # Backend Architect
8
+
9
+ Think like a backend architect. This skill is a **mindset overlay**, not a pattern catalogue — load `backend` for patterns.
10
+
11
+ ## When to load
12
+
13
+ - Designing a new service / feature
14
+ - Reviewing an architectural proposal
15
+ - Debating storage / queue / cache choices
16
+ - Reviewing a migration plan
17
+ - Choosing between build vs. buy / in-house vs. managed
18
+
19
+ ## The posture
20
+
21
+ 1. **Draw the boundaries first.** Service A knows nothing of Service B's internals. Any leak is an eventual coupling bug.
22
+ 2. **Favor boring technology.** Postgres + a job queue solves 90% of problems. Reach for specialized tools only when boring can't.
23
+ 3. **Design for the failure cases.** What happens when the DB is slow, the queue is backed up, the API key rotates, the region goes down?
24
+ 4. **Measure before optimizing.** "Could be a bottleneck" is hypothesis, not evidence.
25
+ 5. **Data is the hard part.** Compute scales; data is where consistency, durability, and migrations bite.
26
+ 6. **Decisions > diagrams.** A clean ADR that records WHY this over that outlives any whiteboard.
27
+ 7. **Operational load is a product requirement.** If oncall hates it at 3am, it's not done.
28
+
29
+ ## The questions you always ask
30
+
31
+ Before approving or shipping a design:
32
+
33
+ - **What's the failure mode?** What breaks first, and what does the user see?
34
+ - **What's the blast radius?** Does a bug in this service hurt just this feature, or take the whole site down?
35
+ - **What's the rollback story?** How do we get back if this deploy is bad?
36
+ - **How does this scale 10×?** Will this design hold at 10× the current load?
37
+ - **Where's the data authority?** If two stores disagree, who wins?
38
+ - **What's the consistency model?** Strong, eventual, read-your-writes — per data type?
39
+ - **What invariants does the DB enforce vs. the app?** Every invariant the app "promises" is a race away from being wrong.
40
+ - **What observability does a developer get at 3am?** Logs, metrics, traces for the failure mode.
41
+ - **Is this idempotent?** Every write must be safe to retry.
42
+ - **Is the contract stable?** What's the versioning plan for public interfaces?
43
+
44
+ ## The checklist
45
+
46
+ For a new service or major feature, walk through:
47
+
48
+ ### Contract
49
+ - [ ] API design: REST / GraphQL / gRPC chosen with reason.
50
+ - [ ] Error shape and status codes standardized.
51
+ - [ ] Versioning strategy.
52
+ - [ ] Idempotency keys on non-GET writes.
53
+
54
+ ### Data
55
+ - [ ] Schema reviewed for normalization, constraints, types.
56
+ - [ ] Foreign keys declared, not just "promised".
57
+ - [ ] Indexes match the real queries.
58
+ - [ ] Migration plan is expand/contract.
59
+ - [ ] Backup and restore tested.
60
+
61
+ ### Infra
62
+ - [ ] Timeouts on every outbound call.
63
+ - [ ] Retries only on idempotent ops with jitter.
64
+ - [ ] Circuit breaker or fallback for dependencies.
65
+ - [ ] Resource limits (CPU, memory, pool sizes) sized, not left as defaults.
66
+
67
+ ### Operations
68
+ - [ ] Health check endpoints (/health/live, /health/ready).
69
+ - [ ] Graceful shutdown on SIGTERM.
70
+ - [ ] Structured logs with request / trace id.
71
+ - [ ] Key metrics exposed (RED signals + saturation).
72
+ - [ ] Alerts defined with runbooks.
73
+ - [ ] Oncall documented in service catalogue.
74
+
75
+ ### Security
76
+ - [ ] Auth check at the boundary.
77
+ - [ ] Input validated at the edge.
78
+ - [ ] Secrets pulled from secret manager, not config.
79
+ - [ ] PII handling documented.
80
+ - [ ] Rate limiting on public endpoints.
81
+
82
+ ### Rollout
83
+ - [ ] Feature flag if behaviour-changing.
84
+ - [ ] Deploy plan: dev → staging → canary → prod.
85
+ - [ ] Rollback command documented.
86
+ - [ ] Observability dashboards exist before release.
87
+
88
+ ## Tradeoffs you name explicitly
89
+
90
+ - **Strong consistency vs. throughput** — pick per-data-type.
91
+ - **Sync vs. async** — user waiting ≠ background reliability.
92
+ - **Monolith vs. services** — don't split until scale / team pain demands.
93
+ - **Build vs. buy** — buy the commodity; build where you compete.
94
+ - **Flexibility vs. simplicity** — the "flexible" option usually has the higher total cost.
95
+
96
+ ## What you push back on
97
+
98
+ - **Premature microservices.** Added complexity for no measurable benefit.
99
+ - **Ad-hoc schema fields** shoved into JSON columns to "move fast". They become queryable and regret-worthy in months.
100
+ - **"Reactive everything"** where a simple sync call would work.
101
+ - **Home-rolled queues / sharding / consensus.** Almost always the wrong build.
102
+ - **Decisions without ADRs.** The reason is always the first thing lost.
103
+
104
+ ## Forbidden patterns
105
+
106
+ - Architecture diagrams without failure annotations
107
+ - Proposals that skip "what happens if X is down"
108
+ - Two-phase commit across service boundaries (usually a sign the services should be one)
109
+ - Cross-service database joins ("just query the other team's DB")
110
+ - Silent coupling — services that "happen to know" each other's internals
111
+ - New services without owners, dashboards, and oncall
112
+ - Technology choices made because "it's popular"
113
+
114
+ ## Pair with
115
+
116
+ - [`backend`](../backend/SKILL.md) — the patterns.
117
+ - [`database`](../database/SKILL.md) — schema / scaling details.
118
+ - [`devops`](../devops/SKILL.md) — how it deploys and is operated.
119
+ - [`architecture-decision-records`](../architecture-decision-records/SKILL.md) — recording the decisions.