@coralai/sps-cli 0.42.0 → 0.43.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (109) hide show
  1. package/README.md +34 -3
  2. package/dist/commands/projectInit.d.ts.map +1 -1
  3. package/dist/commands/projectInit.js +40 -53
  4. package/dist/commands/projectInit.js.map +1 -1
  5. package/dist/commands/skillCommand.d.ts +2 -0
  6. package/dist/commands/skillCommand.d.ts.map +1 -0
  7. package/dist/commands/skillCommand.js +235 -0
  8. package/dist/commands/skillCommand.js.map +1 -0
  9. package/dist/core/skillStore.d.ts +46 -0
  10. package/dist/core/skillStore.d.ts.map +1 -0
  11. package/dist/core/skillStore.js +197 -0
  12. package/dist/core/skillStore.js.map +1 -0
  13. package/dist/core/skillStore.test.d.ts +2 -0
  14. package/dist/core/skillStore.test.d.ts.map +1 -0
  15. package/dist/core/skillStore.test.js +190 -0
  16. package/dist/core/skillStore.test.js.map +1 -0
  17. package/dist/main.js +19 -17
  18. package/dist/main.js.map +1 -1
  19. package/package.json +1 -1
  20. package/skills/architecture-decision-records/SKILL.md +207 -0
  21. package/skills/backend/SKILL.md +62 -0
  22. package/skills/backend/references/api-design.md +168 -0
  23. package/skills/backend/references/caching.md +181 -0
  24. package/skills/backend/references/data-access.md +173 -0
  25. package/skills/backend/references/layering.md +181 -0
  26. package/skills/backend/references/observability.md +190 -0
  27. package/skills/backend/references/resilience.md +201 -0
  28. package/skills/backend/references/security.md +186 -0
  29. package/skills/backend-architect/SKILL.md +119 -0
  30. package/skills/code-reviewer/SKILL.md +143 -0
  31. package/skills/coding-standards/SKILL.md +60 -0
  32. package/skills/coding-standards/references/clean-code.md +258 -0
  33. package/skills/coding-standards/references/code-review.md +192 -0
  34. package/skills/coding-standards/references/commits-and-prs.md +226 -0
  35. package/skills/coding-standards/references/error-strategy.md +193 -0
  36. package/skills/coding-standards/references/naming.md +185 -0
  37. package/skills/coding-standards/references/tdd.md +171 -0
  38. package/skills/database/SKILL.md +53 -0
  39. package/skills/database/references/indexing.md +190 -0
  40. package/skills/database/references/migrations.md +199 -0
  41. package/skills/database/references/nosql.md +185 -0
  42. package/skills/database/references/queries.md +295 -0
  43. package/skills/database/references/scaling.md +203 -0
  44. package/skills/database/references/schema.md +191 -0
  45. package/skills/database-optimizer/SKILL.md +168 -0
  46. package/skills/debugging-workflow/SKILL.md +244 -0
  47. package/skills/devops/SKILL.md +55 -0
  48. package/skills/devops/references/ci-cd.md +204 -0
  49. package/skills/devops/references/containers.md +272 -0
  50. package/skills/devops/references/deploy.md +201 -0
  51. package/skills/devops/references/iac.md +252 -0
  52. package/skills/devops/references/observability.md +228 -0
  53. package/skills/devops/references/secrets.md +178 -0
  54. package/skills/devops-automator/SKILL.md +164 -0
  55. package/skills/frontend/SKILL.md +52 -0
  56. package/skills/frontend/references/accessibility.md +222 -0
  57. package/skills/frontend/references/components.md +206 -0
  58. package/skills/frontend/references/performance.md +219 -0
  59. package/skills/frontend/references/routing.md +209 -0
  60. package/skills/frontend/references/state.md +190 -0
  61. package/skills/frontend/references/testing.md +216 -0
  62. package/skills/frontend-developer/SKILL.md +115 -0
  63. package/skills/git-workflow/SKILL.md +355 -0
  64. package/skills/golang/SKILL.md +49 -0
  65. package/skills/golang/references/concurrency.md +284 -0
  66. package/skills/golang/references/errors.md +241 -0
  67. package/skills/golang/references/idioms.md +285 -0
  68. package/skills/golang/references/testing.md +238 -0
  69. package/skills/java/SKILL.md +50 -0
  70. package/skills/java/references/concurrency.md +194 -0
  71. package/skills/java/references/idioms.md +283 -0
  72. package/skills/java/references/testing.md +228 -0
  73. package/skills/kotlin/SKILL.md +47 -0
  74. package/skills/kotlin/references/coroutines.md +240 -0
  75. package/skills/kotlin/references/idioms.md +268 -0
  76. package/skills/kotlin/references/testing.md +219 -0
  77. package/skills/mobile/SKILL.md +50 -0
  78. package/skills/mobile/references/architecture.md +204 -0
  79. package/skills/mobile/references/navigation.md +158 -0
  80. package/skills/mobile/references/performance.md +152 -0
  81. package/skills/mobile/references/platform.md +166 -0
  82. package/skills/mobile/references/state-and-data.md +174 -0
  83. package/skills/python/SKILL.md +51 -0
  84. package/skills/python/THIRD_PARTY.md +14 -0
  85. package/skills/python/references/async.md +218 -0
  86. package/skills/python/references/error-handling.md +254 -0
  87. package/skills/python/references/idioms.md +279 -0
  88. package/skills/python/references/packaging.md +233 -0
  89. package/skills/python/references/testing.md +269 -0
  90. package/skills/python/references/typing.md +292 -0
  91. package/skills/qa-tester/SKILL.md +186 -0
  92. package/skills/rust/SKILL.md +50 -0
  93. package/skills/rust/references/async.md +224 -0
  94. package/skills/rust/references/errors.md +240 -0
  95. package/skills/rust/references/ownership.md +263 -0
  96. package/skills/rust/references/testing.md +274 -0
  97. package/skills/rust/references/traits.md +250 -0
  98. package/skills/security-engineer/SKILL.md +157 -0
  99. package/skills/swift/SKILL.md +48 -0
  100. package/skills/swift/references/concurrency.md +280 -0
  101. package/skills/swift/references/idioms.md +334 -0
  102. package/skills/swift/references/testing.md +229 -0
  103. package/skills/typescript/SKILL.md +51 -0
  104. package/skills/typescript/references/async.md +241 -0
  105. package/skills/typescript/references/errors.md +208 -0
  106. package/skills/typescript/references/idioms.md +246 -0
  107. package/skills/typescript/references/testing.md +225 -0
  108. package/skills/typescript/references/tooling.md +208 -0
  109. package/skills/typescript/references/types.md +259 -0
@@ -0,0 +1,173 @@
1
+ # Data Access
2
+
3
+ Transactions, queries, migrations, connection pooling. Language-neutral patterns.
4
+
5
+ ## N+1 queries — the universal killer
6
+
7
+ The single most common backend performance bug.
8
+
9
+ ```
10
+ # ❌ N+1
11
+ orders = orderRepo.findAll() # 1 query
12
+ for order in orders:
13
+ order.user = userRepo.find(order.userId) # N queries
14
+
15
+ # ✅ Batch fetch
16
+ orders = orderRepo.findAll()
17
+ userIds = unique(o.userId for o in orders)
18
+ users = userRepo.findByIds(userIds) # 1 query
19
+ userMap = { u.id: u for u in users }
20
+ for o in orders:
21
+ o.user = userMap[o.userId]
22
+
23
+ # ✅ Join (if the ORM supports eager loading)
24
+ orders = orderRepo.findAll(include=['user'])
25
+ ```
26
+
27
+ Detect early: log every query in test mode; assert query count on hot paths.
28
+
29
+ ## Select only what you need
30
+
31
+ Wide `SELECT *` costs bandwidth, memory, and breaks when the schema changes.
32
+
33
+ ```
34
+ # ❌
35
+ SELECT * FROM users WHERE active = true
36
+
37
+ # ✅
38
+ SELECT id, email, name FROM users WHERE active = true
39
+ ```
40
+
41
+ ## Indexes
42
+
43
+ An index is a write-time tax for a read-time refund. Worth it on columns used in WHERE, JOIN, ORDER BY of hot queries.
44
+
45
+ ```
46
+ # Common first indexes
47
+ CREATE INDEX idx_orders_user_id ON orders(user_id);
48
+ CREATE INDEX idx_orders_status ON orders(status) WHERE status = 'pending'; -- partial
49
+ CREATE INDEX idx_users_email_lower ON users (LOWER(email)); -- expression
50
+ ```
51
+
52
+ Rules:
53
+ - Read the query plan. Don't guess.
54
+ - Composite index order matters: `(user_id, created_at)` helps `WHERE user_id = ? ORDER BY created_at`, not the reverse.
55
+ - Every index slows writes. More indexes ≠ faster system.
56
+
57
+ ## Transactions
58
+
59
+ One business operation = one transaction. Cross the boundary at the use case, not inside a repository.
60
+
61
+ ```
62
+ unitOfWork.begin()
63
+ try:
64
+ order = orderRepo.save(newOrder)
65
+ inventoryRepo.decrement(order.items)
66
+ eventBus.publish(OrderPlaced(order.id))
67
+ unitOfWork.commit()
68
+ except:
69
+ unitOfWork.rollback()
70
+ raise
71
+ ```
72
+
73
+ Isolation levels:
74
+ - **READ COMMITTED**: default on most DBs, fine for most workloads
75
+ - **REPEATABLE READ**: if you read the same row twice within a transaction and want consistency
76
+ - **SERIALIZABLE**: correctness over throughput; expect retries
77
+
78
+ Keep transactions short. Long-running transactions hold locks and block everyone.
79
+
80
+ ## Connection pooling
81
+
82
+ Every real backend uses a pool, not per-request connections. DBs limit max connections (Postgres default ~100); without pooling, a traffic spike exhausts the DB.
83
+
84
+ | Pool param | Starting value | Notes |
85
+ |---|---|---|
86
+ | min idle | 2–5 | Warm connections for low traffic |
87
+ | max size | (DB max ÷ replicas) − safety margin | e.g., 100 ÷ 4 = 25 per instance, then leave room |
88
+ | connection timeout | 2–5 s | Fail fast if pool is saturated |
89
+ | idle timeout | 30 s – 5 min | Recycle stale connections |
90
+ | max lifetime | 30 min | Force re-resolve DNS, rotate creds |
91
+
92
+ Serverless + traditional DB: use a pooler (PgBouncer, RDS Proxy) — each cold lambda can't open its own pool.
93
+
94
+ ## Read replicas
95
+
96
+ Route reads to replicas, writes to primary. Beware of replication lag:
97
+
98
+ ```
99
+ user.save(newEmail) # primary
100
+ user = user.reload() # replica — may still show old email
101
+ ```
102
+
103
+ Common fix: stick to primary for N seconds after a write, or read-your-writes from primary only.
104
+
105
+ ## Migrations
106
+
107
+ Every schema change is a migration file, checked in, applied in CI/CD, reversible where possible.
108
+
109
+ Rules:
110
+ - **Never edit a merged migration.** Write a new one.
111
+ - **Additive first, destructive later.** Add the new column → backfill → switch code → drop the old column (separate deploys).
112
+ - **Index creation on a hot table**: use `CREATE INDEX CONCURRENTLY` (Postgres) so you don't lock the table.
113
+ - **Default values**: adding a `NOT NULL` column with a default on a big table can rewrite the whole table. In Postgres 11+, adding `DEFAULT` is metadata-only; in older DBs, do `ADD NULLABLE → backfill → SET NOT NULL`.
114
+
115
+ ## Soft deletes
116
+
117
+ Don't add `deleted_at` everywhere by default. It creates a silent contract that every query must filter. Use it when:
118
+ - You genuinely need to recover records, and
119
+ - You accept the cognitive tax on every query.
120
+
121
+ Prefer hard deletes + an `audit_log` / `events` table if you only need history.
122
+
123
+ ## Bulk operations
124
+
125
+ One round-trip per row kills throughput. Use batch APIs.
126
+
127
+ ```
128
+ # ❌
129
+ for row in 10_000_rows:
130
+ db.insert(row)
131
+
132
+ # ✅
133
+ db.bulk_insert(10_000_rows) # one statement
134
+ # or
135
+ db.copy_from(csv_buffer) # Postgres COPY, fastest
136
+ ```
137
+
138
+ On upserts, use the DB's native construct (`INSERT ... ON CONFLICT`, `MERGE`, `INSERT ... ON DUPLICATE KEY UPDATE`), not read-then-update in app code.
139
+
140
+ ## Pagination queries
141
+
142
+ Offset pagination gets slow on large tables because the DB still walks the skipped rows.
143
+
144
+ ```
145
+ # ❌ Slow on page 10 000
146
+ SELECT * FROM events ORDER BY id LIMIT 50 OFFSET 500000
147
+
148
+ # ✅ Keyset / cursor
149
+ SELECT * FROM events WHERE id > :last_id ORDER BY id LIMIT 50
150
+ ```
151
+
152
+ Keyset pagination is O(log n); offset is O(offset + limit).
153
+
154
+ ## NoSQL quick notes
155
+
156
+ - **Key-value (Redis, DynamoDB)**: design the key; scan queries are evil.
157
+ - **Document (Mongo)**: embed what you always read together; reference what you sometimes read separately.
158
+ - **Wide column (Cassandra, Bigtable)**: query patterns decide the schema, not the other way around.
159
+ - **Graph (Neo4j)**: use when the traversal depth would be painful in SQL.
160
+
161
+ Rule: pick the store that matches the access pattern. Don't use Mongo because "it's flexible"; flexibility defers modeling pain, it doesn't erase it.
162
+
163
+ ## Anti-patterns
164
+
165
+ | Anti-pattern | Why bad | Fix |
166
+ |---|---|---|
167
+ | Queries in a loop | N+1; one slow endpoint tanks the DB | Batch / join / cache |
168
+ | No timeout on DB calls | A single slow query hangs threads / pool | Set statement timeout |
169
+ | `SELECT *` in hot code | Brittle, wasteful | List columns |
170
+ | Business logic in stored procedures "for speed" | Hard to test, version, review | Keep logic in code; use SQL for set operations |
171
+ | Multiple orthogonal indexes on the same table | Slow writes, bloated storage | Review `pg_stat_user_indexes`; drop unused |
172
+ | Editing an applied migration | Divergent envs | New migration |
173
+ | Schema changes without a rollback plan | Stuck deploys | Reversible migrations or documented forward-only fix |
@@ -0,0 +1,181 @@
1
+ # Layering
2
+
3
+ Split the code so business rules don't depend on the framework, the database, or the network. Hexagonal / clean architecture in practical form.
4
+
5
+ ## The four layers
6
+
7
+ ```
8
+ ┌──────────────────────────────────────────────┐
9
+ │ Delivery (HTTP handler, CLI, gRPC, worker) │ — framework-aware
10
+ ├──────────────────────────────────────────────┤
11
+ │ Application (use cases, orchestration) │ — framework-ignorant
12
+ ├──────────────────────────────────────────────┤
13
+ │ Domain (entities, value objects, rules) │ — pure
14
+ ├──────────────────────────────────────────────┤
15
+ │ Infrastructure (DB, cache, HTTP clients) │ — implements domain ports
16
+ └──────────────────────────────────────────────┘
17
+ ```
18
+
19
+ Dependency direction: **only inward**. Delivery → Application → Domain. Infrastructure implements interfaces owned by the inner layers.
20
+
21
+ If your domain imports an HTTP framework, a DB driver, or a cache client, the layering is broken.
22
+
23
+ ## Minimal layer roles
24
+
25
+ | Layer | Contains | Does NOT contain |
26
+ |---|---|---|
27
+ | Delivery | Request parsing, auth check, calls a use case, maps result to response | Business rules, DB queries |
28
+ | Application | Use case orchestration, transaction boundaries, calls repositories and services | SQL, HTTP, JSON parsing |
29
+ | Domain | Entities, value objects, invariants, domain events | I/O, frameworks |
30
+ | Infrastructure | Repository impls, HTTP client impls, message queue impls | Business decisions |
31
+
32
+ ## Ports and adapters
33
+
34
+ The domain declares a **port** (interface). Infrastructure provides an **adapter** (implementation).
35
+
36
+ ```
37
+ Domain declares (port):
38
+ interface UserRepository
39
+ findById(id) -> User | null
40
+ save(user) -> void
41
+
42
+ Infrastructure provides (adapter):
43
+ PostgresUserRepository implements UserRepository
44
+ InMemoryUserRepository implements UserRepository (for tests)
45
+ RedisUserRepository implements UserRepository (cache-aside)
46
+ ```
47
+
48
+ Rule: the adapter file imports the port. The port file never imports any adapter.
49
+
50
+ ## Use case pattern
51
+
52
+ A use case is one method, one transaction boundary, one business intent.
53
+
54
+ ```
55
+ class CreateOrder:
56
+ deps: OrderRepository, UserRepository, PaymentGateway, EventBus
57
+
58
+ execute(cmd: CreateOrderCommand) -> OrderId:
59
+ user = userRepository.findById(cmd.userId)
60
+ if not user: raise UserNotFound
61
+ if not user.canOrder(): raise UserCannotOrder
62
+
63
+ order = Order.create(user, cmd.items) # domain rules
64
+ paymentRepository.authorize(order) # infra
65
+ orderRepository.save(order) # infra
66
+ eventBus.publish(OrderCreated(order.id)) # infra
67
+
68
+ return order.id
69
+ ```
70
+
71
+ Delivery turns an HTTP request into `CreateOrderCommand`, calls `execute`, turns the result into a response. That's it.
72
+
73
+ ## Repository pattern
74
+
75
+ Collect the DB operations for one aggregate behind one interface.
76
+
77
+ ```
78
+ interface OrderRepository:
79
+ findById(id) -> Order | null
80
+ findByUser(uid) -> list[Order]
81
+ save(order) -> void
82
+ delete(id) -> void
83
+ ```
84
+
85
+ Rules:
86
+ - Repositories return **domain objects**, not DB rows.
87
+ - Queries that cross aggregates (reporting, analytics) do NOT belong in a repository; put them in a dedicated `Queries` / `ReadModel` interface.
88
+ - Avoid growing `findByXAndYAndZ` explosions — those signal you need a query object or a read model.
89
+
90
+ ## Service vs domain vs use case
91
+
92
+ People confuse these. Rough guide:
93
+
94
+ | Name | Lives in | Contains |
95
+ |---|---|---|
96
+ | Entity / Aggregate | Domain | State + invariants + rules that depend ONLY on that state |
97
+ | Domain Service | Domain | Rules that span multiple aggregates but are still pure |
98
+ | Use Case / Application Service | Application | Orchestration: load, decide, persist, publish |
99
+ | Gateway / Client | Infrastructure | Talks to the outside world (HTTP, DB, queue) |
100
+
101
+ If you have a `FooService` that does both business rules and DB calls, split it.
102
+
103
+ ## Dependency injection, without magic
104
+
105
+ Pass dependencies in as constructor args. Don't pull them from globals.
106
+
107
+ ```
108
+ # Good
109
+ CreateOrder(orderRepo, userRepo, paymentGateway, eventBus)
110
+
111
+ # Bad
112
+ class CreateOrder:
113
+ def execute():
114
+ order_repo = Container.get("OrderRepository") # hidden dep
115
+ ```
116
+
117
+ Any framework DI container that ends up manipulating constructor signatures reflectively becomes impossible to reason about. Prefer explicit wiring in a composition root.
118
+
119
+ ## Composition root
120
+
121
+ One file where everything is wired up.
122
+
123
+ ```
124
+ # main / bootstrap
125
+ db = Postgres(config.url)
126
+ cache = Redis(config.redis_url)
127
+ eventBus = Kafka(config.brokers)
128
+
129
+ userRepo = PostgresUserRepository(db)
130
+ orderRepo = CachedOrderRepository(
131
+ PostgresOrderRepository(db),
132
+ cache,
133
+ )
134
+
135
+ createOrder = CreateOrder(orderRepo, userRepo, PaymentStripe(config.key), eventBus)
136
+
137
+ app.register("POST /orders", lambda req: http_create_order(req, createOrder))
138
+ ```
139
+
140
+ All layering choices become visible in this one file.
141
+
142
+ ## Transaction boundary
143
+
144
+ The use case decides where the transaction starts and ends, not the repository.
145
+
146
+ ```
147
+ class TransferMoney:
148
+ execute(cmd):
149
+ with unitOfWork.begin():
150
+ src = accountRepo.findById(cmd.fromId)
151
+ dst = accountRepo.findById(cmd.toId)
152
+ src.withdraw(cmd.amount)
153
+ dst.deposit(cmd.amount)
154
+ accountRepo.save(src)
155
+ accountRepo.save(dst)
156
+ # commit happens here; rollback on exception
157
+ ```
158
+
159
+ One transaction per use case, not per repository call. If a use case needs multiple transactions, it's probably two use cases.
160
+
161
+ ## Anti-patterns
162
+
163
+ | Anti-pattern | Why bad | Fix |
164
+ |---|---|---|
165
+ | Framework objects (HTTP request/response) inside the domain | Couples domain to HTTP | Parse at delivery, pass plain command |
166
+ | Repository returns a DB row | Leaks schema upward | Map to domain object at the edge |
167
+ | Controller calls the DB directly | Skips domain rules | Every write goes through a use case |
168
+ | ORM entities ARE the domain entities | Can't change storage without rewriting rules | Separate persistence model from domain model |
169
+ | Static "Service" class with 40 unrelated methods | No cohesion; everything imports everything | One use case per class |
170
+ | Domain event published before persistence succeeds | Consumers act on data that doesn't exist | Publish after commit, or use transactional outbox |
171
+
172
+ ## Don't over-engineer
173
+
174
+ A 500-line CRUD service doesn't need four layers, a DI container, and a port-adapter diagram. Start simple:
175
+
176
+ ```
177
+ # Acceptable for small services
178
+ handler -> repository -> db
179
+ ```
180
+
181
+ Introduce the extra seams **when you feel the pain**: when tests get hard, when the DB needs replacing, when rules start repeating across endpoints. Layering is a response to complexity, not a prerequisite for it.
@@ -0,0 +1,190 @@
1
+ # Observability
2
+
3
+ Logs, metrics, traces, health. A request you can't trace is a bug you can't fix.
4
+
5
+ ## The three pillars
6
+
7
+ | Signal | Answers | Cost | Cardinality |
8
+ |---|---|---|---|
9
+ | **Logs** | "What happened in this request?" | High (per-event) | Unlimited |
10
+ | **Metrics** | "How much, how often, how fast, across the fleet?" | Low (aggregated) | Bounded (labels explode) |
11
+ | **Traces** | "Where did time go in this distributed request?" | Medium (sampled) | Unlimited per trace |
12
+
13
+ Pick the right signal for the question. Metrics for dashboards, logs for forensics, traces for latency breakdowns.
14
+
15
+ ## Structured logs — JSON, not prose
16
+
17
+ Human-readable strings are unqueryable. Every log line is a JSON object with a stable schema.
18
+
19
+ ```json
20
+ {
21
+ "ts": "2026-04-20T10:23:45.123Z",
22
+ "level": "info",
23
+ "service": "orders",
24
+ "env": "prod",
25
+ "request_id": "req_01HX...",
26
+ "trace_id": "0af7651916cd43dd...",
27
+ "user_id": "u_01HX...",
28
+ "msg": "order created",
29
+ "order_id": "ord_01HX...",
30
+ "amount_cents": 2599,
31
+ "duration_ms": 87
32
+ }
33
+ ```
34
+
35
+ Rules:
36
+ - Always include `ts`, `level`, `service`, `env`.
37
+ - Always include a request/trace id so you can stitch a request together across services.
38
+ - Message is a short constant string — fixed values in `msg`, varying values in fields. `"order created"` not `"order ord_01HX created for $25.99"`.
39
+ - Never log secrets, tokens, passwords, full PII. Redact at the logger, not the call site.
40
+
41
+ ## Log levels — use them honestly
42
+
43
+ | Level | Means | Typical rate |
44
+ |---|---|---|
45
+ | ERROR | Something broke; a human should look | Low |
46
+ | WARN | Unexpected, but handled (retry succeeded, fallback used) | Low |
47
+ | INFO | State changes worth knowing at normal volume | Medium |
48
+ | DEBUG | Details useful while investigating; off in prod | High (when on) |
49
+
50
+ Abused levels poison the signal. If everything is INFO, nothing is INFO.
51
+
52
+ ## Correlation IDs
53
+
54
+ Every request gets a unique id at the edge; it propagates through every log line and outbound call.
55
+
56
+ ```
57
+ incoming request → generate request_id (or accept from X-Request-ID)
58
+ → bind to logger context
59
+ → forward on outbound calls (X-Request-ID, traceparent)
60
+ ```
61
+
62
+ Distributed tracing (OpenTelemetry) gives you `trace_id` + `span_id` for free. Log both when you have them.
63
+
64
+ ## Metrics — RED + USE
65
+
66
+ Two checklists that cover almost everything.
67
+
68
+ ### RED (per request-driven service)
69
+
70
+ - **R**ate — requests per second
71
+ - **E**rrors — failing requests per second (or error rate)
72
+ - **D**uration — latency distribution (p50 / p95 / p99)
73
+
74
+ ### USE (per resource)
75
+
76
+ - **U**tilization — how busy is it? (CPU%, thread pool in use / max)
77
+ - **S**aturation — how much work is queued? (request queue depth)
78
+ - **E**rrors — how many operations failed?
79
+
80
+ Track these for every service and every critical dependency.
81
+
82
+ ## Latency — measure distributions, not averages
83
+
84
+ Averages hide the worst cases. P95/P99 are where your users actually feel slowness.
85
+
86
+ ```
87
+ # ✅
88
+ http_request_duration_seconds{route="/orders", method="POST"}
89
+ → histogram with buckets (0.01, 0.05, 0.1, 0.5, 1, 5)
90
+ → alert on p99 > 1s for 5 min
91
+
92
+ # ❌
93
+ avg_response_time = sum(durations) / count(durations)
94
+ → a 10 s outlier buried in 999 fast ones looks fine
95
+ ```
96
+
97
+ ## Labels — finite cardinality
98
+
99
+ Every unique label combination creates a new metric series. High-cardinality labels (user id, request id, email) will blow up storage and cost.
100
+
101
+ | Label | OK? |
102
+ |---|---|
103
+ | route, method, status_code | Yes (small set) |
104
+ | region, pod_name, env | Yes |
105
+ | user_id, request_id, email, SKU | NO — use logs/traces for these |
106
+
107
+ ## Tracing
108
+
109
+ A trace is a tree of spans representing one request's path through services. Each span has: operation name, start/end time, attributes, parent span.
110
+
111
+ Auto-instrument with OpenTelemetry. Add manual spans around:
112
+ - External HTTP calls (service, endpoint, status)
113
+ - DB queries (operation, table; never the full raw query — cardinality)
114
+ - Cache ops
115
+ - Queue enqueue / dequeue
116
+ - Expensive pure computations
117
+
118
+ Sampling: head-based (1–10% of requests fully traced) or tail-based (keep traces where something went wrong). Keep tracer overhead < 1% of request latency.
119
+
120
+ ## SLOs — the contract
121
+
122
+ An SLO is a number + a window. "99.9% of /orders responses succeed within 500 ms, measured over 28 days."
123
+
124
+ Error budget = `1 − SLO`. Over 28 days, 99.9% allows ≈40 min of downtime. When you burn the budget, freeze risky changes and invest in reliability.
125
+
126
+ Don't set SLOs to what your service does today. Set them to what your users need.
127
+
128
+ ## Alerts — page on symptoms, not causes
129
+
130
+ Alert on "users are affected" (SLO burn rate, error rate spike, latency breach). Don't alert on "CPU is at 80%" — that's often fine.
131
+
132
+ Every alert must be:
133
+ - **Actionable** — there is something the oncall can do right now
134
+ - **Unambiguous** — one cause for the page, not "anything could have fired this"
135
+ - **Documented** — link to a runbook from the alert body
136
+
137
+ If an alert fires and the oncall thinks "not my problem" or "auto-resolves in 5 min", it's a bad alert. Delete or tune it.
138
+
139
+ ## Runbooks
140
+
141
+ One per alert. Structure:
142
+
143
+ ```
144
+ # Alert: api-latency-p99-high
145
+
146
+ ## What this means
147
+ p99 on /api/orders POST is > 1s for 5m.
148
+
149
+ ## Immediate checks
150
+ 1. Look at [dashboard-link]
151
+ 2. Check for recent deploy: [deploys-link]
152
+ 3. Check upstream health: [dep-status]
153
+
154
+ ## Common causes
155
+ - DB slow query → check [slow-query-dashboard]
156
+ - Cache outage → check redis metrics
157
+ - Upstream payment provider → check provider status page
158
+
159
+ ## Mitigation
160
+ - Roll back recent deploy if within 30 min window
161
+ - Failover to secondary region
162
+ - ...
163
+ ```
164
+
165
+ ## Health endpoints (minimal)
166
+
167
+ ```
168
+ GET /health/live → 200 if process can serve (don't check dependencies)
169
+ GET /health/ready → 200 only if dependencies are reachable (DB, cache, queue)
170
+ ```
171
+
172
+ Live failing → orchestrator restarts the pod.
173
+ Ready failing → orchestrator takes the pod out of the load balancer (but doesn't kill it).
174
+
175
+ Never put business logic in health checks. They should be cheap and boring.
176
+
177
+ ## Anti-patterns
178
+
179
+ | Anti-pattern | Why |
180
+ |---|---|
181
+ | String-formatted logs (`"user X did Y at Z"`) | Unqueryable |
182
+ | Logging full request bodies | PII leak, storage blow-up |
183
+ | Alerting on CPU / disk without symptom link | Pager fatigue; noise |
184
+ | No request correlation id | Can't stitch a failure across services |
185
+ | Logging at DEBUG in prod | Drowns the signal; storage cost |
186
+ | `avg_latency` as the only latency metric | Hides the outliers that hurt users |
187
+ | `status:500` as the only error signal | 200 with `{error: ...}` bodies exist and hurt |
188
+ | Metrics labels with user id / email | Cardinality explosion |
189
+ | Tracing everything, sampling nothing | Cost blowup; latency overhead |
190
+ | Alerts without runbooks | Oncall guesses, takes too long |