npm - @coralai/sps-cli - Versions diffs - 0.41.2 → 0.43.0 - Mend

@coralai/sps-cli 0.41.2 → 0.43.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (168) hide show

package/README.md +34 -3
package/dist/commands/cardAdd.d.ts +1 -1
package/dist/commands/cardAdd.d.ts.map +1 -1
package/dist/commands/cardAdd.js +16 -6
package/dist/commands/cardAdd.js.map +1 -1
package/dist/commands/cardDashboard.js +1 -1
package/dist/commands/cardDashboard.js.map +1 -1
package/dist/commands/doctor.d.ts +9 -0
package/dist/commands/doctor.d.ts.map +1 -1
package/dist/commands/doctor.js +3 -314
package/dist/commands/doctor.js.map +1 -1
package/dist/commands/hookCommand.d.ts.map +1 -1
package/dist/commands/hookCommand.js +6 -7
package/dist/commands/hookCommand.js.map +1 -1
package/dist/commands/pmCommand.js +1 -1
package/dist/commands/pmCommand.js.map +1 -1
package/dist/commands/projectInit.d.ts.map +1 -1
package/dist/commands/projectInit.js +60 -37
package/dist/commands/projectInit.js.map +1 -1
package/dist/commands/setup.d.ts.map +1 -1
package/dist/commands/setup.js +3 -30
package/dist/commands/setup.js.map +1 -1
package/dist/commands/skillCommand.d.ts +2 -0
package/dist/commands/skillCommand.d.ts.map +1 -0
package/dist/commands/skillCommand.js +235 -0
package/dist/commands/skillCommand.js.map +1 -0
package/dist/commands/tick.js +1 -1
package/dist/commands/tick.js.map +1 -1
package/dist/core/checklist.d.ts +22 -0
package/dist/core/checklist.d.ts.map +1 -0
package/dist/core/checklist.js +38 -0
package/dist/core/checklist.js.map +1 -0
package/dist/core/checklist.test.d.ts +2 -0
package/dist/core/checklist.test.d.ts.map +1 -0
package/dist/core/checklist.test.js +74 -0
package/dist/core/checklist.test.js.map +1 -0
package/dist/core/config.d.ts +1 -1
package/dist/core/config.d.ts.map +1 -1
package/dist/core/config.js +1 -1
package/dist/core/config.js.map +1 -1
package/dist/core/config.test.js +7 -4
package/dist/core/config.test.js.map +1 -1
package/dist/core/context.d.ts +1 -1
package/dist/core/context.d.ts.map +1 -1
package/dist/core/skillStore.d.ts +46 -0
package/dist/core/skillStore.d.ts.map +1 -0
package/dist/core/skillStore.js +197 -0
package/dist/core/skillStore.js.map +1 -0
package/dist/core/skillStore.test.d.ts +2 -0
package/dist/core/skillStore.test.d.ts.map +1 -0
package/dist/core/skillStore.test.js +190 -0
package/dist/core/skillStore.test.js.map +1 -0
package/dist/engines/EventHandler.test.js +3 -3
package/dist/engines/EventHandler.test.js.map +1 -1
package/dist/engines/MonitorEngine.js +2 -2
package/dist/engines/MonitorEngine.js.map +1 -1
package/dist/engines/SchedulerEngine.js +1 -1
package/dist/engines/SchedulerEngine.js.map +1 -1
package/dist/engines/StageEngine.js +3 -3
package/dist/engines/StageEngine.js.map +1 -1
package/dist/engines/engine-pipeline-adapter.test.js +2 -2
package/dist/engines/engine-pipeline-adapter.test.js.map +1 -1
package/dist/interfaces/TaskBackend.d.ts +3 -1
package/dist/interfaces/TaskBackend.d.ts.map +1 -1
package/dist/main.js +19 -17
package/dist/main.js.map +1 -1
package/dist/models/types.d.ts +16 -1
package/dist/models/types.d.ts.map +1 -1
package/dist/providers/MarkdownTaskBackend.d.ts +2 -1
package/dist/providers/MarkdownTaskBackend.d.ts.map +1 -1
package/dist/providers/MarkdownTaskBackend.js +28 -5
package/dist/providers/MarkdownTaskBackend.js.map +1 -1
package/dist/providers/registry.d.ts.map +1 -1
package/dist/providers/registry.js +5 -7
package/dist/providers/registry.js.map +1 -1
package/package.json +1 -1
package/project-template/.claude/hooks/start.sh +44 -0
package/project-template/.claude/settings.json +1 -1
package/skills/architecture-decision-records/SKILL.md +207 -0
package/skills/backend/SKILL.md +62 -0
package/skills/backend/references/api-design.md +168 -0
package/skills/backend/references/caching.md +181 -0
package/skills/backend/references/data-access.md +173 -0
package/skills/backend/references/layering.md +181 -0
package/skills/backend/references/observability.md +190 -0
package/skills/backend/references/resilience.md +201 -0
package/skills/backend/references/security.md +186 -0
package/skills/backend-architect/SKILL.md +119 -0
package/skills/code-reviewer/SKILL.md +143 -0
package/skills/coding-standards/SKILL.md +60 -0
package/skills/coding-standards/references/clean-code.md +258 -0
package/skills/coding-standards/references/code-review.md +192 -0
package/skills/coding-standards/references/commits-and-prs.md +226 -0
package/skills/coding-standards/references/error-strategy.md +193 -0
package/skills/coding-standards/references/naming.md +185 -0
package/skills/coding-standards/references/tdd.md +171 -0
package/skills/database/SKILL.md +53 -0
package/skills/database/references/indexing.md +190 -0
package/skills/database/references/migrations.md +199 -0
package/skills/database/references/nosql.md +185 -0
package/skills/database/references/queries.md +295 -0
package/skills/database/references/scaling.md +203 -0
package/skills/database/references/schema.md +191 -0
package/skills/database-optimizer/SKILL.md +168 -0
package/skills/debugging-workflow/SKILL.md +244 -0
package/skills/devops/SKILL.md +55 -0
package/skills/devops/references/ci-cd.md +204 -0
package/skills/devops/references/containers.md +272 -0
package/skills/devops/references/deploy.md +201 -0
package/skills/devops/references/iac.md +252 -0
package/skills/devops/references/observability.md +228 -0
package/skills/devops/references/secrets.md +178 -0
package/skills/devops-automator/SKILL.md +164 -0
package/skills/frontend/SKILL.md +52 -0
package/skills/frontend/references/accessibility.md +222 -0
package/skills/frontend/references/components.md +206 -0
package/skills/frontend/references/performance.md +219 -0
package/skills/frontend/references/routing.md +209 -0
package/skills/frontend/references/state.md +190 -0
package/skills/frontend/references/testing.md +216 -0
package/skills/frontend-developer/SKILL.md +115 -0
package/skills/git-workflow/SKILL.md +355 -0
package/skills/golang/SKILL.md +49 -0
package/skills/golang/references/concurrency.md +284 -0
package/skills/golang/references/errors.md +241 -0
package/skills/golang/references/idioms.md +285 -0
package/skills/golang/references/testing.md +238 -0
package/skills/java/SKILL.md +50 -0
package/skills/java/references/concurrency.md +194 -0
package/skills/java/references/idioms.md +283 -0
package/skills/java/references/testing.md +228 -0
package/skills/kotlin/SKILL.md +47 -0
package/skills/kotlin/references/coroutines.md +240 -0
package/skills/kotlin/references/idioms.md +268 -0
package/skills/kotlin/references/testing.md +219 -0
package/skills/mobile/SKILL.md +50 -0
package/skills/mobile/references/architecture.md +204 -0
package/skills/mobile/references/navigation.md +158 -0
package/skills/mobile/references/performance.md +152 -0
package/skills/mobile/references/platform.md +166 -0
package/skills/mobile/references/state-and-data.md +174 -0
package/skills/python/SKILL.md +51 -0
package/skills/python/THIRD_PARTY.md +14 -0
package/skills/python/references/async.md +218 -0
package/skills/python/references/error-handling.md +254 -0
package/skills/python/references/idioms.md +279 -0
package/skills/python/references/packaging.md +233 -0
package/skills/python/references/testing.md +269 -0
package/skills/python/references/typing.md +292 -0
package/skills/qa-tester/SKILL.md +186 -0
package/skills/rust/SKILL.md +50 -0
package/skills/rust/references/async.md +224 -0
package/skills/rust/references/errors.md +240 -0
package/skills/rust/references/ownership.md +263 -0
package/skills/rust/references/testing.md +274 -0
package/skills/rust/references/traits.md +250 -0
package/skills/security-engineer/SKILL.md +157 -0
package/skills/swift/SKILL.md +48 -0
package/skills/swift/references/concurrency.md +280 -0
package/skills/swift/references/idioms.md +334 -0
package/skills/swift/references/testing.md +229 -0
package/skills/typescript/SKILL.md +51 -0
package/skills/typescript/references/async.md +241 -0
package/skills/typescript/references/errors.md +208 -0
package/skills/typescript/references/idioms.md +246 -0
package/skills/typescript/references/testing.md +225 -0
package/skills/typescript/references/tooling.md +208 -0
package/skills/typescript/references/types.md +259 -0

package/skills/backend/references/data-access.md ADDED Viewed

@@ -0,0 +1,173 @@
+# Data Access
+Transactions, queries, migrations, connection pooling. Language-neutral patterns.
+## N+1 queries — the universal killer
+The single most common backend performance bug.
+```
+# ❌ N+1
+orders = orderRepo.findAll()              # 1 query
+for order in orders:
+    order.user = userRepo.find(order.userId)   # N queries
+# ✅ Batch fetch
+orders = orderRepo.findAll()
+userIds = unique(o.userId for o in orders)
+users = userRepo.findByIds(userIds)        # 1 query
+userMap = { u.id: u for u in users }
+for o in orders:
+    o.user = userMap[o.userId]
+# ✅ Join (if the ORM supports eager loading)
+orders = orderRepo.findAll(include=['user'])
+```
+Detect early: log every query in test mode; assert query count on hot paths.
+## Select only what you need
+Wide `SELECT *` costs bandwidth, memory, and breaks when the schema changes.
+```
+# ❌
+SELECT * FROM users WHERE active = true
+# ✅
+SELECT id, email, name FROM users WHERE active = true
+```
+## Indexes
+An index is a write-time tax for a read-time refund. Worth it on columns used in WHERE, JOIN, ORDER BY of hot queries.
+```
+# Common first indexes
+CREATE INDEX idx_orders_user_id     ON orders(user_id);
+CREATE INDEX idx_orders_status      ON orders(status) WHERE status = 'pending';  -- partial
+CREATE INDEX idx_users_email_lower  ON users (LOWER(email));                      -- expression
+```
+Rules:
+- Read the query plan. Don't guess.
+- Composite index order matters: `(user_id, created_at)` helps `WHERE user_id = ? ORDER BY created_at`, not the reverse.
+- Every index slows writes. More indexes ≠ faster system.
+## Transactions
+One business operation = one transaction. Cross the boundary at the use case, not inside a repository.
+```
+unitOfWork.begin()
+try:
+    order = orderRepo.save(newOrder)
+    inventoryRepo.decrement(order.items)
+    eventBus.publish(OrderPlaced(order.id))
+    unitOfWork.commit()
+except:
+    unitOfWork.rollback()
+    raise
+```
+Isolation levels:
+- **READ COMMITTED**: default on most DBs, fine for most workloads
+- **REPEATABLE READ**: if you read the same row twice within a transaction and want consistency
+- **SERIALIZABLE**: correctness over throughput; expect retries
+Keep transactions short. Long-running transactions hold locks and block everyone.
+## Connection pooling
+Every real backend uses a pool, not per-request connections. DBs limit max connections (Postgres default ~100); without pooling, a traffic spike exhausts the DB.
+| Pool param | Starting value | Notes |
+|---|---|---|
+| min idle | 2–5 | Warm connections for low traffic |
+| max size | (DB max ÷ replicas) − safety margin | e.g., 100 ÷ 4 = 25 per instance, then leave room |
+| connection timeout | 2–5 s | Fail fast if pool is saturated |
+| idle timeout | 30 s – 5 min | Recycle stale connections |
+| max lifetime | 30 min | Force re-resolve DNS, rotate creds |
+Serverless + traditional DB: use a pooler (PgBouncer, RDS Proxy) — each cold lambda can't open its own pool.
+## Read replicas
+Route reads to replicas, writes to primary. Beware of replication lag:
+```
+user.save(newEmail)            # primary
+user = user.reload()           # replica — may still show old email
+```
+Common fix: stick to primary for N seconds after a write, or read-your-writes from primary only.
+## Migrations
+Every schema change is a migration file, checked in, applied in CI/CD, reversible where possible.
+Rules:
+- **Never edit a merged migration.** Write a new one.
+- **Additive first, destructive later.** Add the new column → backfill → switch code → drop the old column (separate deploys).
+- **Index creation on a hot table**: use `CREATE INDEX CONCURRENTLY` (Postgres) so you don't lock the table.
+- **Default values**: adding a `NOT NULL` column with a default on a big table can rewrite the whole table. In Postgres 11+, adding `DEFAULT` is metadata-only; in older DBs, do `ADD NULLABLE → backfill → SET NOT NULL`.
+## Soft deletes
+Don't add `deleted_at` everywhere by default. It creates a silent contract that every query must filter. Use it when:
+- You genuinely need to recover records, and
+- You accept the cognitive tax on every query.
+Prefer hard deletes + an `audit_log` / `events` table if you only need history.
+## Bulk operations
+One round-trip per row kills throughput. Use batch APIs.
+```
+# ❌
+for row in 10_000_rows:
+    db.insert(row)
+# ✅
+db.bulk_insert(10_000_rows)            # one statement
+# or
+db.copy_from(csv_buffer)               # Postgres COPY, fastest
+```
+On upserts, use the DB's native construct (`INSERT ... ON CONFLICT`, `MERGE`, `INSERT ... ON DUPLICATE KEY UPDATE`), not read-then-update in app code.
+## Pagination queries
+Offset pagination gets slow on large tables because the DB still walks the skipped rows.
+```
+# ❌ Slow on page 10 000
+SELECT * FROM events ORDER BY id LIMIT 50 OFFSET 500000
+# ✅ Keyset / cursor
+SELECT * FROM events WHERE id > :last_id ORDER BY id LIMIT 50
+```
+Keyset pagination is O(log n); offset is O(offset + limit).
+## NoSQL quick notes
+- **Key-value (Redis, DynamoDB)**: design the key; scan queries are evil.
+- **Document (Mongo)**: embed what you always read together; reference what you sometimes read separately.
+- **Wide column (Cassandra, Bigtable)**: query patterns decide the schema, not the other way around.
+- **Graph (Neo4j)**: use when the traversal depth would be painful in SQL.
+Rule: pick the store that matches the access pattern. Don't use Mongo because "it's flexible"; flexibility defers modeling pain, it doesn't erase it.
+## Anti-patterns
+| Anti-pattern | Why bad | Fix |
+|---|---|---|
+| Queries in a loop | N+1; one slow endpoint tanks the DB | Batch / join / cache |
+| No timeout on DB calls | A single slow query hangs threads / pool | Set statement timeout |
+| `SELECT *` in hot code | Brittle, wasteful | List columns |
+| Business logic in stored procedures "for speed" | Hard to test, version, review | Keep logic in code; use SQL for set operations |
+| Multiple orthogonal indexes on the same table | Slow writes, bloated storage | Review `pg_stat_user_indexes`; drop unused |
+| Editing an applied migration | Divergent envs | New migration |
+| Schema changes without a rollback plan | Stuck deploys | Reversible migrations or documented forward-only fix |

package/skills/backend/references/layering.md ADDED Viewed

@@ -0,0 +1,181 @@
+# Layering
+Split the code so business rules don't depend on the framework, the database, or the network. Hexagonal / clean architecture in practical form.
+## The four layers
+```
+┌──────────────────────────────────────────────┐
+│  Delivery (HTTP handler, CLI, gRPC, worker)  │  — framework-aware
+├──────────────────────────────────────────────┤
+│  Application (use cases, orchestration)      │  — framework-ignorant
+├──────────────────────────────────────────────┤
+│  Domain (entities, value objects, rules)     │  — pure
+├──────────────────────────────────────────────┤
+│  Infrastructure (DB, cache, HTTP clients)    │  — implements domain ports
+└──────────────────────────────────────────────┘
+```
+Dependency direction: **only inward**. Delivery → Application → Domain. Infrastructure implements interfaces owned by the inner layers.
+If your domain imports an HTTP framework, a DB driver, or a cache client, the layering is broken.
+## Minimal layer roles
+| Layer | Contains | Does NOT contain |
+|---|---|---|
+| Delivery | Request parsing, auth check, calls a use case, maps result to response | Business rules, DB queries |
+| Application | Use case orchestration, transaction boundaries, calls repositories and services | SQL, HTTP, JSON parsing |
+| Domain | Entities, value objects, invariants, domain events | I/O, frameworks |
+| Infrastructure | Repository impls, HTTP client impls, message queue impls | Business decisions |
+## Ports and adapters
+The domain declares a **port** (interface). Infrastructure provides an **adapter** (implementation).
+```
+Domain declares (port):
+  interface UserRepository
+      findById(id) -> User | null
+      save(user) -> void
+Infrastructure provides (adapter):
+  PostgresUserRepository    implements UserRepository
+  InMemoryUserRepository    implements UserRepository  (for tests)
+  RedisUserRepository       implements UserRepository  (cache-aside)
+```
+Rule: the adapter file imports the port. The port file never imports any adapter.
+## Use case pattern
+A use case is one method, one transaction boundary, one business intent.
+```
+class CreateOrder:
+    deps: OrderRepository, UserRepository, PaymentGateway, EventBus
+    execute(cmd: CreateOrderCommand) -> OrderId:
+        user = userRepository.findById(cmd.userId)
+        if not user:             raise UserNotFound
+        if not user.canOrder():  raise UserCannotOrder
+        order = Order.create(user, cmd.items)     # domain rules
+        paymentRepository.authorize(order)        # infra
+        orderRepository.save(order)               # infra
+        eventBus.publish(OrderCreated(order.id))  # infra
+        return order.id
+```
+Delivery turns an HTTP request into `CreateOrderCommand`, calls `execute`, turns the result into a response. That's it.
+## Repository pattern
+Collect the DB operations for one aggregate behind one interface.
+```
+interface OrderRepository:
+    findById(id)      -> Order | null
+    findByUser(uid)   -> list[Order]
+    save(order)       -> void
+    delete(id)        -> void
+```
+Rules:
+- Repositories return **domain objects**, not DB rows.
+- Queries that cross aggregates (reporting, analytics) do NOT belong in a repository; put them in a dedicated `Queries` / `ReadModel` interface.
+- Avoid growing `findByXAndYAndZ` explosions — those signal you need a query object or a read model.
+## Service vs domain vs use case
+People confuse these. Rough guide:
+| Name | Lives in | Contains |
+|---|---|---|
+| Entity / Aggregate | Domain | State + invariants + rules that depend ONLY on that state |
+| Domain Service | Domain | Rules that span multiple aggregates but are still pure |
+| Use Case / Application Service | Application | Orchestration: load, decide, persist, publish |
+| Gateway / Client | Infrastructure | Talks to the outside world (HTTP, DB, queue) |
+If you have a `FooService` that does both business rules and DB calls, split it.
+## Dependency injection, without magic
+Pass dependencies in as constructor args. Don't pull them from globals.
+```
+# Good
+CreateOrder(orderRepo, userRepo, paymentGateway, eventBus)
+# Bad
+class CreateOrder:
+    def execute():
+        order_repo = Container.get("OrderRepository")   # hidden dep
+```
+Any framework DI container that ends up manipulating constructor signatures reflectively becomes impossible to reason about. Prefer explicit wiring in a composition root.
+## Composition root
+One file where everything is wired up.
+```
+# main / bootstrap
+db        = Postgres(config.url)
+cache     = Redis(config.redis_url)
+eventBus  = Kafka(config.brokers)
+userRepo   = PostgresUserRepository(db)
+orderRepo  = CachedOrderRepository(
+                PostgresOrderRepository(db),
+                cache,
+             )
+createOrder = CreateOrder(orderRepo, userRepo, PaymentStripe(config.key), eventBus)
+app.register("POST /orders", lambda req: http_create_order(req, createOrder))
+```
+All layering choices become visible in this one file.
+## Transaction boundary
+The use case decides where the transaction starts and ends, not the repository.
+```
+class TransferMoney:
+    execute(cmd):
+        with unitOfWork.begin():
+            src = accountRepo.findById(cmd.fromId)
+            dst = accountRepo.findById(cmd.toId)
+            src.withdraw(cmd.amount)
+            dst.deposit(cmd.amount)
+            accountRepo.save(src)
+            accountRepo.save(dst)
+        # commit happens here; rollback on exception
+```
+One transaction per use case, not per repository call. If a use case needs multiple transactions, it's probably two use cases.
+## Anti-patterns
+| Anti-pattern | Why bad | Fix |
+|---|---|---|
+| Framework objects (HTTP request/response) inside the domain | Couples domain to HTTP | Parse at delivery, pass plain command |
+| Repository returns a DB row | Leaks schema upward | Map to domain object at the edge |
+| Controller calls the DB directly | Skips domain rules | Every write goes through a use case |
+| ORM entities ARE the domain entities | Can't change storage without rewriting rules | Separate persistence model from domain model |
+| Static "Service" class with 40 unrelated methods | No cohesion; everything imports everything | One use case per class |
+| Domain event published before persistence succeeds | Consumers act on data that doesn't exist | Publish after commit, or use transactional outbox |
+## Don't over-engineer
+A 500-line CRUD service doesn't need four layers, a DI container, and a port-adapter diagram. Start simple:
+```
+# Acceptable for small services
+handler -> repository -> db
+```
+Introduce the extra seams **when you feel the pain**: when tests get hard, when the DB needs replacing, when rules start repeating across endpoints. Layering is a response to complexity, not a prerequisite for it.

package/skills/backend/references/observability.md ADDED Viewed

@@ -0,0 +1,190 @@
+# Observability
+Logs, metrics, traces, health. A request you can't trace is a bug you can't fix.
+## The three pillars
+| Signal | Answers | Cost | Cardinality |
+|---|---|---|---|
+| **Logs** | "What happened in this request?" | High (per-event) | Unlimited |
+| **Metrics** | "How much, how often, how fast, across the fleet?" | Low (aggregated) | Bounded (labels explode) |
+| **Traces** | "Where did time go in this distributed request?" | Medium (sampled) | Unlimited per trace |
+Pick the right signal for the question. Metrics for dashboards, logs for forensics, traces for latency breakdowns.
+## Structured logs — JSON, not prose
+Human-readable strings are unqueryable. Every log line is a JSON object with a stable schema.
+```json
+{
+  "ts": "2026-04-20T10:23:45.123Z",
+  "level": "info",
+  "service": "orders",
+  "env": "prod",
+  "request_id": "req_01HX...",
+  "trace_id": "0af7651916cd43dd...",
+  "user_id": "u_01HX...",
+  "msg": "order created",
+  "order_id": "ord_01HX...",
+  "amount_cents": 2599,
+  "duration_ms": 87
+}
+```
+Rules:
+- Always include `ts`, `level`, `service`, `env`.
+- Always include a request/trace id so you can stitch a request together across services.
+- Message is a short constant string — fixed values in `msg`, varying values in fields. `"order created"` not `"order ord_01HX created for $25.99"`.
+- Never log secrets, tokens, passwords, full PII. Redact at the logger, not the call site.
+## Log levels — use them honestly
+| Level | Means | Typical rate |
+|---|---|---|
+| ERROR | Something broke; a human should look | Low |
+| WARN | Unexpected, but handled (retry succeeded, fallback used) | Low |
+| INFO | State changes worth knowing at normal volume | Medium |
+| DEBUG | Details useful while investigating; off in prod | High (when on) |
+Abused levels poison the signal. If everything is INFO, nothing is INFO.
+## Correlation IDs
+Every request gets a unique id at the edge; it propagates through every log line and outbound call.
+```
+incoming request → generate request_id (or accept from X-Request-ID)
+                 → bind to logger context
+                 → forward on outbound calls (X-Request-ID, traceparent)
+```
+Distributed tracing (OpenTelemetry) gives you `trace_id` + `span_id` for free. Log both when you have them.
+## Metrics — RED + USE
+Two checklists that cover almost everything.
+### RED (per request-driven service)
+- **R**ate — requests per second
+- **E**rrors — failing requests per second (or error rate)
+- **D**uration — latency distribution (p50 / p95 / p99)
+### USE (per resource)
+- **U**tilization — how busy is it? (CPU%, thread pool in use / max)
+- **S**aturation — how much work is queued? (request queue depth)
+- **E**rrors — how many operations failed?
+Track these for every service and every critical dependency.
+## Latency — measure distributions, not averages
+Averages hide the worst cases. P95/P99 are where your users actually feel slowness.
+```
+# ✅
+http_request_duration_seconds{route="/orders", method="POST"}
+  → histogram with buckets (0.01, 0.05, 0.1, 0.5, 1, 5)
+  → alert on p99 > 1s for 5 min
+# ❌
+avg_response_time = sum(durations) / count(durations)
+  → a 10 s outlier buried in 999 fast ones looks fine
+```
+## Labels — finite cardinality
+Every unique label combination creates a new metric series. High-cardinality labels (user id, request id, email) will blow up storage and cost.
+| Label | OK? |
+|---|---|
+| route, method, status_code | Yes (small set) |
+| region, pod_name, env | Yes |
+| user_id, request_id, email, SKU | NO — use logs/traces for these |
+## Tracing
+A trace is a tree of spans representing one request's path through services. Each span has: operation name, start/end time, attributes, parent span.
+Auto-instrument with OpenTelemetry. Add manual spans around:
+- External HTTP calls (service, endpoint, status)
+- DB queries (operation, table; never the full raw query — cardinality)
+- Cache ops
+- Queue enqueue / dequeue
+- Expensive pure computations
+Sampling: head-based (1–10% of requests fully traced) or tail-based (keep traces where something went wrong). Keep tracer overhead < 1% of request latency.
+## SLOs — the contract
+An SLO is a number + a window. "99.9% of /orders responses succeed within 500 ms, measured over 28 days."
+Error budget = `1 − SLO`. Over 28 days, 99.9% allows ≈40 min of downtime. When you burn the budget, freeze risky changes and invest in reliability.
+Don't set SLOs to what your service does today. Set them to what your users need.
+## Alerts — page on symptoms, not causes
+Alert on "users are affected" (SLO burn rate, error rate spike, latency breach). Don't alert on "CPU is at 80%" — that's often fine.
+Every alert must be:
+- **Actionable** — there is something the oncall can do right now
+- **Unambiguous** — one cause for the page, not "anything could have fired this"
+- **Documented** — link to a runbook from the alert body
+If an alert fires and the oncall thinks "not my problem" or "auto-resolves in 5 min", it's a bad alert. Delete or tune it.
+## Runbooks
+One per alert. Structure:
+```
+# Alert: api-latency-p99-high
+## What this means
+p99 on /api/orders POST is > 1s for 5m.
+## Immediate checks
+1. Look at [dashboard-link]
+2. Check for recent deploy: [deploys-link]
+3. Check upstream health: [dep-status]
+## Common causes
+- DB slow query → check [slow-query-dashboard]
+- Cache outage → check redis metrics
+- Upstream payment provider → check provider status page
+## Mitigation
+- Roll back recent deploy if within 30 min window
+- Failover to secondary region
+- ...
+```
+## Health endpoints (minimal)
+```
+GET /health/live   → 200 if process can serve (don't check dependencies)
+GET /health/ready  → 200 only if dependencies are reachable (DB, cache, queue)
+```
+Live failing → orchestrator restarts the pod.
+Ready failing → orchestrator takes the pod out of the load balancer (but doesn't kill it).
+Never put business logic in health checks. They should be cheap and boring.
+## Anti-patterns
+| Anti-pattern | Why |
+|---|---|
+| String-formatted logs (`"user X did Y at Z"`) | Unqueryable |
+| Logging full request bodies | PII leak, storage blow-up |
+| Alerting on CPU / disk without symptom link | Pager fatigue; noise |
+| No request correlation id | Can't stitch a failure across services |
+| Logging at DEBUG in prod | Drowns the signal; storage cost |
+| `avg_latency` as the only latency metric | Hides the outliers that hurt users |
+| `status:500` as the only error signal | 200 with `{error: ...}` bodies exist and hurt |
+| Metrics labels with user id / email | Cardinality explosion |
+| Tracing everything, sampling nothing | Cost blowup; latency overhead |
+| Alerts without runbooks | Oncall guesses, takes too long |