npm - @coralai/sps-cli - Versions diffs - 0.41.2 → 0.43.0 - Mend

@coralai/sps-cli 0.41.2 → 0.43.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (168) hide show

package/README.md +34 -3
package/dist/commands/cardAdd.d.ts +1 -1
package/dist/commands/cardAdd.d.ts.map +1 -1
package/dist/commands/cardAdd.js +16 -6
package/dist/commands/cardAdd.js.map +1 -1
package/dist/commands/cardDashboard.js +1 -1
package/dist/commands/cardDashboard.js.map +1 -1
package/dist/commands/doctor.d.ts +9 -0
package/dist/commands/doctor.d.ts.map +1 -1
package/dist/commands/doctor.js +3 -314
package/dist/commands/doctor.js.map +1 -1
package/dist/commands/hookCommand.d.ts.map +1 -1
package/dist/commands/hookCommand.js +6 -7
package/dist/commands/hookCommand.js.map +1 -1
package/dist/commands/pmCommand.js +1 -1
package/dist/commands/pmCommand.js.map +1 -1
package/dist/commands/projectInit.d.ts.map +1 -1
package/dist/commands/projectInit.js +60 -37
package/dist/commands/projectInit.js.map +1 -1
package/dist/commands/setup.d.ts.map +1 -1
package/dist/commands/setup.js +3 -30
package/dist/commands/setup.js.map +1 -1
package/dist/commands/skillCommand.d.ts +2 -0
package/dist/commands/skillCommand.d.ts.map +1 -0
package/dist/commands/skillCommand.js +235 -0
package/dist/commands/skillCommand.js.map +1 -0
package/dist/commands/tick.js +1 -1
package/dist/commands/tick.js.map +1 -1
package/dist/core/checklist.d.ts +22 -0
package/dist/core/checklist.d.ts.map +1 -0
package/dist/core/checklist.js +38 -0
package/dist/core/checklist.js.map +1 -0
package/dist/core/checklist.test.d.ts +2 -0
package/dist/core/checklist.test.d.ts.map +1 -0
package/dist/core/checklist.test.js +74 -0
package/dist/core/checklist.test.js.map +1 -0
package/dist/core/config.d.ts +1 -1
package/dist/core/config.d.ts.map +1 -1
package/dist/core/config.js +1 -1
package/dist/core/config.js.map +1 -1
package/dist/core/config.test.js +7 -4
package/dist/core/config.test.js.map +1 -1
package/dist/core/context.d.ts +1 -1
package/dist/core/context.d.ts.map +1 -1
package/dist/core/skillStore.d.ts +46 -0
package/dist/core/skillStore.d.ts.map +1 -0
package/dist/core/skillStore.js +197 -0
package/dist/core/skillStore.js.map +1 -0
package/dist/core/skillStore.test.d.ts +2 -0
package/dist/core/skillStore.test.d.ts.map +1 -0
package/dist/core/skillStore.test.js +190 -0
package/dist/core/skillStore.test.js.map +1 -0
package/dist/engines/EventHandler.test.js +3 -3
package/dist/engines/EventHandler.test.js.map +1 -1
package/dist/engines/MonitorEngine.js +2 -2
package/dist/engines/MonitorEngine.js.map +1 -1
package/dist/engines/SchedulerEngine.js +1 -1
package/dist/engines/SchedulerEngine.js.map +1 -1
package/dist/engines/StageEngine.js +3 -3
package/dist/engines/StageEngine.js.map +1 -1
package/dist/engines/engine-pipeline-adapter.test.js +2 -2
package/dist/engines/engine-pipeline-adapter.test.js.map +1 -1
package/dist/interfaces/TaskBackend.d.ts +3 -1
package/dist/interfaces/TaskBackend.d.ts.map +1 -1
package/dist/main.js +19 -17
package/dist/main.js.map +1 -1
package/dist/models/types.d.ts +16 -1
package/dist/models/types.d.ts.map +1 -1
package/dist/providers/MarkdownTaskBackend.d.ts +2 -1
package/dist/providers/MarkdownTaskBackend.d.ts.map +1 -1
package/dist/providers/MarkdownTaskBackend.js +28 -5
package/dist/providers/MarkdownTaskBackend.js.map +1 -1
package/dist/providers/registry.d.ts.map +1 -1
package/dist/providers/registry.js +5 -7
package/dist/providers/registry.js.map +1 -1
package/package.json +1 -1
package/project-template/.claude/hooks/start.sh +44 -0
package/project-template/.claude/settings.json +1 -1
package/skills/architecture-decision-records/SKILL.md +207 -0
package/skills/backend/SKILL.md +62 -0
package/skills/backend/references/api-design.md +168 -0
package/skills/backend/references/caching.md +181 -0
package/skills/backend/references/data-access.md +173 -0
package/skills/backend/references/layering.md +181 -0
package/skills/backend/references/observability.md +190 -0
package/skills/backend/references/resilience.md +201 -0
package/skills/backend/references/security.md +186 -0
package/skills/backend-architect/SKILL.md +119 -0
package/skills/code-reviewer/SKILL.md +143 -0
package/skills/coding-standards/SKILL.md +60 -0
package/skills/coding-standards/references/clean-code.md +258 -0
package/skills/coding-standards/references/code-review.md +192 -0
package/skills/coding-standards/references/commits-and-prs.md +226 -0
package/skills/coding-standards/references/error-strategy.md +193 -0
package/skills/coding-standards/references/naming.md +185 -0
package/skills/coding-standards/references/tdd.md +171 -0
package/skills/database/SKILL.md +53 -0
package/skills/database/references/indexing.md +190 -0
package/skills/database/references/migrations.md +199 -0
package/skills/database/references/nosql.md +185 -0
package/skills/database/references/queries.md +295 -0
package/skills/database/references/scaling.md +203 -0
package/skills/database/references/schema.md +191 -0
package/skills/database-optimizer/SKILL.md +168 -0
package/skills/debugging-workflow/SKILL.md +244 -0
package/skills/devops/SKILL.md +55 -0
package/skills/devops/references/ci-cd.md +204 -0
package/skills/devops/references/containers.md +272 -0
package/skills/devops/references/deploy.md +201 -0
package/skills/devops/references/iac.md +252 -0
package/skills/devops/references/observability.md +228 -0
package/skills/devops/references/secrets.md +178 -0
package/skills/devops-automator/SKILL.md +164 -0
package/skills/frontend/SKILL.md +52 -0
package/skills/frontend/references/accessibility.md +222 -0
package/skills/frontend/references/components.md +206 -0
package/skills/frontend/references/performance.md +219 -0
package/skills/frontend/references/routing.md +209 -0
package/skills/frontend/references/state.md +190 -0
package/skills/frontend/references/testing.md +216 -0
package/skills/frontend-developer/SKILL.md +115 -0
package/skills/git-workflow/SKILL.md +355 -0
package/skills/golang/SKILL.md +49 -0
package/skills/golang/references/concurrency.md +284 -0
package/skills/golang/references/errors.md +241 -0
package/skills/golang/references/idioms.md +285 -0
package/skills/golang/references/testing.md +238 -0
package/skills/java/SKILL.md +50 -0
package/skills/java/references/concurrency.md +194 -0
package/skills/java/references/idioms.md +283 -0
package/skills/java/references/testing.md +228 -0
package/skills/kotlin/SKILL.md +47 -0
package/skills/kotlin/references/coroutines.md +240 -0
package/skills/kotlin/references/idioms.md +268 -0
package/skills/kotlin/references/testing.md +219 -0
package/skills/mobile/SKILL.md +50 -0
package/skills/mobile/references/architecture.md +204 -0
package/skills/mobile/references/navigation.md +158 -0
package/skills/mobile/references/performance.md +152 -0
package/skills/mobile/references/platform.md +166 -0
package/skills/mobile/references/state-and-data.md +174 -0
package/skills/python/SKILL.md +51 -0
package/skills/python/THIRD_PARTY.md +14 -0
package/skills/python/references/async.md +218 -0
package/skills/python/references/error-handling.md +254 -0
package/skills/python/references/idioms.md +279 -0
package/skills/python/references/packaging.md +233 -0
package/skills/python/references/testing.md +269 -0
package/skills/python/references/typing.md +292 -0
package/skills/qa-tester/SKILL.md +186 -0
package/skills/rust/SKILL.md +50 -0
package/skills/rust/references/async.md +224 -0
package/skills/rust/references/errors.md +240 -0
package/skills/rust/references/ownership.md +263 -0
package/skills/rust/references/testing.md +274 -0
package/skills/rust/references/traits.md +250 -0
package/skills/security-engineer/SKILL.md +157 -0
package/skills/swift/SKILL.md +48 -0
package/skills/swift/references/concurrency.md +280 -0
package/skills/swift/references/idioms.md +334 -0
package/skills/swift/references/testing.md +229 -0
package/skills/typescript/SKILL.md +51 -0
package/skills/typescript/references/async.md +241 -0
package/skills/typescript/references/errors.md +208 -0
package/skills/typescript/references/idioms.md +246 -0
package/skills/typescript/references/testing.md +225 -0
package/skills/typescript/references/tooling.md +208 -0
package/skills/typescript/references/types.md +259 -0

package/skills/database/references/schema.md ADDED Viewed

@@ -0,0 +1,191 @@
+# Schema
+Normalization, keys, constraints, types.
+## Start normalized
+3rd Normal Form (3NF) is the sweet spot for most OLTP apps:
+- Each table represents one concept.
+- Every non-key column depends on the key, the whole key, and nothing but the key.
+- No repeating groups (prefer child tables over `phone1 / phone2 / phone3`).
+Denormalize later, with measurement, for reporting / read-heavy paths. Denormalized by default breeds data-consistency bugs.
+## Primary keys
+Every table has one. Options:
+| Type | Pros | Cons |
+|---|---|---|
+| Auto-increment integer | Compact, fast, ordered | Leaks count; awkward in distributed systems |
+| UUID v4 | Globally unique, generated anywhere | 16 bytes, random → index fragmentation |
+| UUID v7 / ULID / KSUID | Time-ordered, unique, generated anywhere | Slightly newer; check lib support |
+| Composite key | Natural uniqueness (e.g., `(order_id, line_no)`) | Join and FK complexity |
+Recommendation: **UUID v7 / ULID** for new systems. Still sortable by time, no central generator needed, no count leakage. Store as `UUID` (Postgres) or `BINARY(16)` (MySQL) — don't use `VARCHAR(36)` (3× the storage).
+## Foreign keys
+Declare them:
+```sql
+CREATE TABLE orders (
+    id         UUID PRIMARY KEY,
+    user_id    UUID NOT NULL REFERENCES users(id) ON DELETE RESTRICT,
+    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
+);
+```
+`ON DELETE`:
+- `RESTRICT` / `NO ACTION` — safe default; forces explicit cleanup.
+- `CASCADE` — dangerous; one delete can wipe large subgraphs. Use carefully.
+- `SET NULL` — when the relationship is optional.
+Don't skip FKs "for performance". The integrity guarantee is worth a lot more than the microseconds.
+## Constraints over application logic
+Let the DB enforce invariants that must always hold.
+```sql
+-- Invariants
+email      TEXT     NOT NULL UNIQUE,
+age        INT      CHECK (age >= 0 AND age <= 150),
+status     TEXT     NOT NULL CHECK (status IN ('pending','active','banned')),
+balance    NUMERIC  NOT NULL CHECK (balance >= 0),
+-- Uniqueness across multiple columns
+CONSTRAINT uq_org_email UNIQUE (org_id, email)
+```
+Application validation is belt; DB constraints are suspenders. You want both.
+## Column types
+### Text
+- `TEXT` in Postgres — no length limit, same storage as `VARCHAR`.
+- `VARCHAR(n)` when the limit is a real business rule (e.g., phone max 20). `VARCHAR(255)` by habit is noise.
+- `CHAR(n)` — almost never the right answer (pads with spaces).
+### Numbers
+- Integers: `INTEGER` (32-bit) or `BIGINT` (64-bit). Pick based on range.
+- **Money**: `NUMERIC(12, 2)` or **cents as integer** — never `FLOAT` / `DOUBLE` (binary floats lose pennies).
+- `REAL` / `DOUBLE PRECISION` only for scientific / measurement data where precision loss is OK.
+### Time
+- `TIMESTAMPTZ` (Postgres with timezone) or UTC `TIMESTAMP` — always store in UTC. Convert in the app.
+- `DATE` for dates without time-of-day.
+- Never store time as `TEXT` or milliseconds-since-epoch as `BIGINT` unless you genuinely need it for external APIs.
+### Boolean
+- `BOOLEAN` where supported.
+- MySQL older versions: `TINYINT(1)` as a workaround.
+### Enum-like
+Three options:
+| Approach | Pros | Cons |
+|---|---|---|
+| Native ENUM (Postgres, MySQL) | Type-checked at DB level | Hard to add values without ALTER |
+| CHECK constraint with string | Easy to extend | String fragility |
+| Lookup table + FK | Flexible, self-documenting | Join on common queries |
+For small fixed sets (status = active / pending / banned), CHECK on a TEXT column is the pragmatic default.
+### JSON
+- `JSONB` (Postgres) — indexable, queryable.
+- Use for schemaless attributes, user-defined fields, optional metadata.
+- **Don't** use as the primary way to represent structured data. If you're going to query `raw->>'email'` everywhere, promote `email` to a column.
+## Naming
+Pick a convention, stay consistent.
+- **Tables**: plural (`users`, `orders`) or singular (`user`, `order`). Most teams pick plural.
+- **Columns**: `snake_case`. Match the language's ORM convention.
+- **Primary key**: `id`.
+- **Foreign keys**: `<table_singular>_id`. `user_id` references `users.id`.
+- **Timestamps**: `created_at`, `updated_at`, `deleted_at`.
+- **Booleans**: `is_active`, `has_verified_email`.
+- **Indexes**: `ix_<table>_<columns>`. Unique: `uq_<table>_<columns>`.
+## Timestamps on every table
+```sql
+created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
+updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
+```
+`updated_at` via trigger OR app responsibility. Auditing, debugging, backfills — all need these.
+## Soft delete vs. hard delete
+Default: **hard delete**. Reclaims space, simplifies queries, respects privacy requests.
+Soft delete (`deleted_at TIMESTAMP NULL`) when:
+- You need a recovery window.
+- Historical referencing matters (keep the row for audit but hide from listings).
+Trade-off: every query now filters `WHERE deleted_at IS NULL`. Forgetting is a bug class. Consider a `users_active` view for day-to-day use.
+If you need audit history, an `events` / `audit_log` table is usually cleaner than soft-delete everywhere.
+## Partitioning
+For very large tables (hundreds of millions of rows), partition by time, tenant, or region. Postgres declarative partitioning:
+```sql
+CREATE TABLE events (
+    id UUID PRIMARY KEY,
+    tenant_id UUID NOT NULL,
+    created_at TIMESTAMPTZ NOT NULL,
+    payload JSONB
+) PARTITION BY RANGE (created_at);
+CREATE TABLE events_2026_04 PARTITION OF events
+    FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');
+```
+Benefits: small indexes per partition, easy to drop old data (`DROP PARTITION`), parallel queries.
+Don't partition speculatively. The operational overhead is real.
+## Multi-tenancy
+Three shapes:
+| Shape | Pros | Cons |
+|---|---|---|
+| Separate DBs per tenant | Full isolation, easy backup per tenant | Operational overhead at scale |
+| Shared DB, separate schemas | Middle ground | Connection per schema can be clunky |
+| Shared DB, shared schema + tenant_id column | Simplest, scales to many tenants | Every query MUST filter tenant_id |
+Shared schema + `tenant_id`: enforce via row-level security (Postgres RLS) if available, or framework middleware that injects the filter. Forgetting is a catastrophic data leak.
+## Referential design patterns
+- **Associations**: join table `order_items (order_id, line_no, product_id, qty, price_cents)` with composite PK.
+- **Hierarchies**: `parent_id` + recursive CTE, or `ltree` (Postgres), or nested sets (read-heavy).
+- **Tags / many-to-many**: `post_tags (post_id, tag_id)`.
+- **Audit**: separate `audit_logs` table with immutable rows.
+## Anti-patterns
+| Anti-pattern | Fix |
+|---|---|
+| Primary key = `VARCHAR(36)` UUID | Use native `UUID` / `BINARY(16)` |
+| `VARCHAR(255)` reflex | Use `TEXT` or a justified length |
+| One table "users_and_admins" with a role column and many nullable fields | Normalize or use type-specific tables |
+| Store JSON with the same shape in every row | Promote to columns |
+| Mutable history columns (`last_email_change_at`) spread across user table | Consider an audit log |
+| Money as `FLOAT` | Never |
+| `bool` stored as `Y/N` strings | Native `BOOLEAN` |
+| Different timestamps in different timezones | UTC everywhere |
+| `NULL` for "unknown" AND "not applicable" | Two different columns, or a CHECK'd enum |
+| Massive tables with no partitioning plan | Partition or archive when they get big |

package/skills/database-optimizer/SKILL.md ADDED Viewed

@@ -0,0 +1,168 @@
+---
+name: database-optimizer
+description: Persona skill — debug and tune SQL / DB performance like a specialist. Read plans, spot missing indexes, size pools. Overlay on top of `database`. For patterns, load `database`.
+origin: original
+---
+# Database Optimizer
+Diagnose slow queries. Design indexes that earn their keep. Keep the DB quietly fast.
+## When to load
+- A query is slow
+- A hot table keeps timing out
+- Planning a schema change with performance implications
+- Reviewing an ORM-generated query
+- Sizing a connection pool, memory, autovacuum
+## The posture
+1. **Numbers, not hunches.** `EXPLAIN (ANALYZE, BUFFERS)` beats "this should be fast."
+2. **The right index > three wrong ones.** Over-indexing is its own perf bug (writes, storage).
+3. **Most "DB problems" are query problems.** Bad SQL, N+1, unnecessary ORDER BY. Fix the query before scaling hardware.
+4. **The plan is the truth.** If it says seq scan and you expected index scan, figure out why.
+5. **Stats are often stale.** `ANALYZE` / `ANALYZE TABLE` before believing the plan.
+6. **Think in rows scanned, not rows returned.** Scanning 1M rows to return 10 is the bug.
+## The diagnostic flow
+When "the query is slow":
+1. **Get the query.** Exact SQL as executed, with real parameters.
+2. **Run `EXPLAIN (ANALYZE, BUFFERS)`** (Postgres) / `EXPLAIN ANALYZE` (MySQL). Read the actual cost and row counts.
+3. **Look for the big number.** One node dominates. Start there.
+4. **Compare actual vs. estimated rows.** Orders of magnitude off → stats are stale or skewed.
+5. **Look for seq scan with a selective filter.** Missing / unused index.
+6. **Look for sort spilling to disk.** Under-sized work_mem or wrong index order.
+7. **Look for nested loop on a large inner.** Missing join index or bad cardinality estimate.
+8. **Propose the fix.** Add index / rewrite query / update stats / partition / cache.
+9. **Measure.** Run again with the change. Quote before/after timings.
+Don't propose fixes blind. Every fix you ship without a measured before/after is a guess.
+## Questions you always ask
+- **Is this the exact SQL production runs?** (ORMs lie; check with pg_stat_statements or query log.)
+- **What's the selectivity?** How many rows does the filter return out of the total?
+- **Is there an index that covers this?** Check `pg_stat_user_indexes` for usage.
+- **What's the cardinality estimate vs. actual?**
+- **Is the query sargable?** (`WHERE lower(email) = ?` without an expression index isn't.)
+- **Are statistics fresh?** When was the last `ANALYZE`?
+- **Is this part of an N+1?** Small query run N times is bigger than one large query.
+- **How does this scale?** At 10× data, what happens?
+## Common patterns you recognize
+### Missing index
+```
+Seq Scan on orders  (cost=0.00..25000.00 rows=1000 width=48)
+  Filter: (status = 'pending')
+  Rows Removed by Filter: 499000
+  Actual Rows: 1000
+```
+Scanned 500K, kept 1K. Index on `status` (partial index if 'pending' is rare) fixes it.
+### Wrong composite order
+```
+-- Index: (user_id, created_at)
+EXPLAIN SELECT * FROM orders WHERE created_at > now() - '1 day' AND user_id = ?;
+```
+If the plan is a seq scan, the index order may be wrong for the planner's needs. Reorder or add a second index.
+### Stale stats
+```
+Estimated Rows: 50
+Actual Rows: 500000
+```
+10 000× off. The planner used the wrong join strategy. `ANALYZE` the table.
+### N+1
+```python
+for order in orders:                              # 1 query
+    user = User.get(order.user_id)                # N queries
+```
+Fix: batch (`User.get_by_ids([...])`) or eager-load in the ORM (`select_related`, `with_eager` etc.).
+### Unneeded ORDER BY
+```sql
+SELECT * FROM events WHERE tenant_id = ? ORDER BY created_at DESC LIMIT 10;
+-- No index on (tenant_id, created_at) → sorts everything first
+```
+Composite index matching filter + sort makes this an index scan + early termination.
+### Bloat / dead tuples
+Postgres UPDATE / DELETE leaves dead tuples. Autovacuum cleans up. If it's not keeping up:
+```sql
+SELECT relname, n_dead_tup, last_vacuum, last_autovacuum
+FROM pg_stat_user_tables WHERE n_dead_tup > 10000
+ORDER BY n_dead_tup DESC;
+```
+Tune autovacuum thresholds on hot tables.
+## Index recommendations
+- **First index on a hot table**: the one that serves the dominant query.
+- **Composite**: columns in order of `WHERE = ?` first, then `WHERE > ?` (range), then `ORDER BY`.
+- **Partial**: when the query filters on a rare value.
+- **Expression**: when `WHERE lower(col) = ?` / `WHERE date_trunc('day', col) = ?`.
+- **Covering (INCLUDE)**: when the query reads a few extra columns on top of the indexed ones.
+Prune: drop indexes with zero scans in the last 30 days (after verifying usage across envs).
+## Pool / memory sizing
+When the DB is healthy but the app times out:
+- **Connection saturation**: check pool wait times in the app. Likely `max_size` too small.
+- **DB max_connections**: Postgres default ~100. Total app replicas × pool size must leave headroom for admin.
+- **work_mem** (Postgres): per-operation; if queries spill to disk, consider raising (but test — memory multiplies per connection).
+- **shared_buffers**: typically 25% of available RAM for a dedicated DB host.
+## Anti-patterns you always flag
+- `SELECT *` in production app code.
+- Adding an index on every column "just in case".
+- `WHERE function(col) = ?` without a matching expression index.
+- `LIKE '%x%'` on a big table (non-indexable wildcard).
+- `ORDER BY RANDOM()` on large tables.
+- Business logic implemented in triggers without ADR.
+- Read-then-update for an upsert (race condition).
+- One giant transaction wrapping a batch import.
+- UUID v4 primary keys on tables heavily sorted/paginated by PK (use UUID v7 / ULID).
+- Migration that takes a long lock during peak traffic.
+## Tradeoffs you name
+- **Index count vs. write speed.** Every index is a write tax.
+- **Normalization vs. read speed.** Denormalize only where measured.
+- **Consistency vs. throughput.** RR / SI / SR per workload.
+- **Read from replica vs. primary.** Staleness vs. primary load.
+## Forbidden patterns
+- Proposing performance fixes without an EXPLAIN
+- "Let's just scale up" before diagnosing
+- Adding an index without naming the query it serves
+- Ignoring migration lock impact ("it's a small change")
+- Running heavy analytical queries against the primary
+- Changing indexes without measuring before/after
+## Pair with
+- [`database`](../database/SKILL.md) — the patterns and the vocabulary.
+- [`backend/references/data-access.md`](../backend/references/data-access.md) — where SQL meets app code.
+- [`devops/references/observability.md`](../devops/references/observability.md) — DB dashboards and alerts.

package/skills/debugging-workflow/SKILL.md ADDED Viewed

@@ -0,0 +1,244 @@
+---
+name: debugging-workflow
+description: Workflow skill — systematic debugging. Reproduce, isolate, hypothesize, verify. Works for bugs, performance issues, and live incidents.
+origin: original
+---
+# Debugging Workflow
+A method, not a ritual. Works for "user says it's broken" bugs, performance regressions, and live incidents.
+## When to load
+- Something is broken and you don't yet know why
+- A test started failing and you don't know which change broke it
+- A performance regression appeared
+- You're on-call and an alert fired
+- Reviewing someone's debug session to teach or coach
+## The posture
+1. **Change one variable at a time.** If you flip three things and it works, you don't know which one fixed it.
+2. **Reproduce first, diagnose second, fix third.** Fixing without reproducing is guessing.
+3. **Trust the data over the story.** Bug reports are leads, not proofs.
+4. **Read the error, all of it.** Stack trace, message, timestamp, request id.
+5. **When stuck, lower the abstraction.** Go one layer down until the mechanism is visible.
+6. **Stop when stumped. Sleep. Reset.** Fresh eyes find bugs that tired eyes write bugs for.
+## The flow
+```
+  Reproduce ──▶ Isolate ──▶ Hypothesize ──▶ Test ──▶ Fix ──▶ Verify
+      ▲                                       │
+      └──────────── disconfirm? back up ──────┘
+```
+### 1. Reproduce
+You cannot debug what you can't reproduce. Turn the bug into a command.
+- From a user report: collect the steps, the exact time, the user id, the device.
+- From logs / metrics: narrow to the failing request or batch; get a request id.
+- From a test: `cargo test --test my_test`, `pytest -k my_test`.
+Goal: the smallest reproducible case. A 20-step manual reproduction is a lead; a 3-line test is evidence.
+If you cannot reproduce:
+- Add logging around the suspected area, ship a canary, wait for recurrence.
+- Check if it's environment-specific (timezone, locale, OS, version).
+- Check if it's data-specific (a particular record triggers it).
+- Consider whether the bug is the bug report (user confused, different issue).
+### 2. Isolate
+Shrink the reproduction. Remove pieces one at a time. The last piece you remove is the bug's home.
+Techniques:
+- **Binary search** the codebase: comment out half; reproduce; comment out the remaining half; repeat. `git bisect` if the bug is recent.
+- **Minimize the input**: shorter string, fewer rows, simpler config.
+- **Swap in fakes**: if the bug reproduces with a fake DB, the bug isn't in the real DB.
+### 3. Hypothesize
+State, out loud or in writing, what you think is happening. One sentence.
+> "When the cart is empty, checkout is calling `items[0]` and crashing."
+A good hypothesis:
+- Is specific (names a function, a condition).
+- Is disprovable (there's an experiment that would show it's wrong).
+- Explains the observed symptom AND the variations you've seen.
+### 4. Test the hypothesis
+Design the test that would disprove it.
+- Add a print/log at the suspected line.
+- Run with the minimal reproduction.
+- Observe: does reality match the hypothesis?
+If yes → proceed to fix.
+If no → the hypothesis is wrong. Go back to step 3.
+Don't let a wrong hypothesis linger. "It almost fits" is how debug sessions become five-hour goose chases.
+### 5. Fix
+Smallest correct change that fixes the bug and doesn't break other things.
+- Add a test that would have caught this.
+- Make the test fail.
+- Apply the fix.
+- Test passes.
+- Other tests still pass.
+Commit fix + test together.
+### 6. Verify
+- Run the test.
+- Run the minimal reproduction.
+- Run the original user scenario (if different).
+- For production bugs: deploy to staging and verify there before prod.
+Don't close the ticket until you've verified on the system where the bug was reported.
+## Tools by abstraction level
+When the bug hides, drop a layer.
+| Level | Tools |
+|---|---|
+| **Logs** | grep, structured-log viewer, APM log search |
+| **Metrics / dashboards** | Grafana, Datadog, CloudWatch |
+| **Traces** | Jaeger, Tempo, DD APM |
+| **Debugger** | `pdb`, IDE debuggers, `dlv`, `lldb` |
+| **Profiler** | `pprof`, py-spy, perf, Instruments |
+| **Network** | `tcpdump`, Wireshark, browser DevTools Network, `curl -v` |
+| **System calls** | `strace` (Linux), `dtruss` (macOS) |
+| **Kernel / hardware** | `perf`, eBPF, `iostat`, `top` |
+You usually won't go below "traces". When you do, the bug was worth the depth.
+## The log reading discipline
+Read the entire trace, not just the top line.
+```
+ValidationError: email required
+  at validate (validate.py:23)
+  at create (service.py:41)
+  at handler (app.py:15)      ← where the request started
+```
+- **Top**: the immediate cause.
+- **Middle**: the path that got there.
+- **Bottom**: the entry point.
+For multi-service requests, trace by **request id** across services. If you can't — fix that first.
+## Debugging performance
+Different but structurally similar:
+1. **Measure**. Don't optimize without a number. `ab`, `k6`, `wrk` for throughput; APM for p95/p99.
+2. **Profile**. Flame graph reveals the hot function. Guessing reveals nothing.
+3. **Hypothesize the bottleneck**. "The SQL is slow" vs. "JSON serialization is slow" vs. "We're blocking on the main thread."
+4. **Test with EXPLAIN / flame graph / profiler output**.
+5. **Fix the highest-yield bottleneck**. Ignore the rest until you've re-measured.
+Rule: **never optimize the 2% case while the 60% case is still on the table**.
+## Debugging flaky tests
+A flaky test is a bug. Treat it.
+- **Shared mutable state** between tests — reset in setup / use fresh fixtures.
+- **Order dependency** — tests depend on other tests' side effects.
+- **Timing** — tests that wait for "done" via sleep; flip to deterministic waits.
+- **Randomness** — uncontrolled random input; seed it.
+- **External dependencies** — real network / time / env; mock or inject.
+If you can't fix the flake in a week, DELETE the test. A flake that lies about whether the code works is worse than no test.
+## Live incident
+Debugging with a fire lit:
+1. **Stop the bleeding first.** Roll back, disable a feature flag, scale up, divert traffic. Diagnose later.
+2. **Preserve evidence** — snapshot logs, heap, DB state before you mitigate; you'll need them for the postmortem.
+3. **One driver, many helpers**. One person coordinating; others investigate. Avoid overlapping operations.
+4. **Communicate every 15 min** even if nothing new: "still investigating DB side; rollback started at 14:03".
+5. **Fix the immediate symptom. Plan the durable fix.** Different timescales.
+6. **Write the postmortem.** Always. Blameless. Drive action items to completion.
+## Rubber-ducking
+Explaining the problem, in full, in plain words, to anyone or anything:
+- A colleague.
+- A rubber duck on your desk.
+- A paragraph in a doc.
+Making the explanation forces you to sequence the facts; the sequence often exposes the missing step.
+Most "aha!" moments during rubber-ducking come at "okay so X happens, then Y, then — wait, does Y actually happen?"
+## Pair debugging
+Two people, one keyboard. One describes their mental model, the other asks questions. Costly in time; often pays for itself on nasty bugs.
+## Warning signs in your own process
+- You've tried four fixes. None landed.
+- You're re-running the test hoping it passes.
+- You're editing code to "see what happens" without a hypothesis.
+- You've been on the same bug for 3+ hours with no progress.
+All of these say: **stop, step away, reset**. Take a walk. Explain the bug to someone. Sleep on it. You'll come back cheaper and more effective.
+## Bugs that turn out to be "not bugs"
+Always worth checking:
+- **Timezone / DST** — off-by-one-hour bugs.
+- **Locale** — decimal separators, date order, sort order.
+- **Unicode** — grapheme cluster length vs. byte length; RTL order.
+- **Float precision** — 0.1 + 0.2 ≠ 0.3.
+- **Integer overflow** — counters that wrap.
+- **Caches** — serving a stale copy.
+- **Config drift** — dev has flag X on, prod doesn't.
+- **Env variables** — typos, unset, accidentally committed.
+When the bug is "it only happens in prod", it's usually one of these.
+## Fixing responsibly
+- Write a regression test BEFORE merging the fix.
+- Describe what the test proves in the commit message.
+- Link the original bug report / ticket.
+- If the fix has broader implications, write an ADR.
+## Forbidden patterns
+- "Just add a try/except around it so it doesn't crash"
+- Closing a ticket without a reproduction-proving test
+- Rolling out a fix to prod before verifying on staging
+- Shipping a fix and "hoping" it works
+- Saying "it works on my machine" as a closing line
+- Removing a test that's failing "to unblock CI"
+- Blaming a user without reproducing the bug first
+- Fixing the symptom when you know where the root cause is
+## The two-question close
+Before declaring "fixed":
+1. **Do I have a test that would have failed before this change?**
+2. **Do I know what caused the bug, not just what suppresses it?**
+If either is "no", you haven't finished.
+## Pair with
+- [`coding-standards/references/tdd.md`](../coding-standards/references/tdd.md) — for the test-first fix discipline.
+- [`qa-tester`](../qa-tester/SKILL.md) — for edge-case intuition.
+- [`devops/references/observability.md`](../devops/references/observability.md) — tools for finding what you can't guess.

package/skills/devops/SKILL.md ADDED Viewed

@@ -0,0 +1,55 @@
+---
+name: devops
+description: DevOps end skill — CI/CD, containers, infrastructure-as-code, secrets, observability. Tool-neutral (GitHub Actions / GitLab CI / Argo / Terraform patterns). Pair with `backend`, language skills, and `coding-standards`.
+origin: ecc-fork + original (https://github.com/affaan-m/everything-claude-code, MIT)
+---
+# DevOps
+CI/CD, containers, infra-as-code, secrets, observability. **Tool-neutral** — patterns apply across GitHub Actions / GitLab CI / CircleCI, Terraform / Pulumi / CDK, Docker / OCI, Kubernetes / ECS / Cloud Run.
+## When to load
+- Setting up or reviewing CI/CD
+- Containerization, image builds, multi-stage Dockerfiles
+- Infrastructure-as-code changes (Terraform, Pulumi, CloudFormation, CDK)
+- Secret management, rotation, access control
+- Observability at the platform level (metrics / logs / traces collection, alerting)
+- Deploy strategies (blue-green, canary, progressive delivery)
+## Core principles
+1. **Everything as code.** Infra, CI, secret policy, dashboards, alerts — in the repo, reviewed, versioned.
+2. **Immutable artifacts.** The build produces one artifact (image, binary); the same artifact promotes through envs unchanged.
+3. **Dev / staging / prod parity.** Same tooling, same topology, smaller. Differences are explicit (size, scaling, data), not accidental.
+4. **Automate the path to prod.** Merges to main trigger deploy (with gates); humans click "promote", not "run these commands".
+5. **Ephemeral infra, persistent data.** Nodes, pods, VMs — replaceable. Data — backed up, versioned, migrated.
+6. **Least privilege by default.** CI, services, humans all get scoped credentials. Root access is an event, not a default.
+7. **Fast feedback.** Build < 10 min on typical change, < 3 min on type/lint. Slow CI loses its purpose.
+8. **Observability before features.** You can't fix what you can't see.
+## How to use references
+| Reference | When to load |
+|---|---|
+| [`references/ci-cd.md`](references/ci-cd.md) | Pipelines, caching, parallelism, artifacts, gates, promotion |
+| [`references/containers.md`](references/containers.md) | Dockerfile, multi-stage, size, rootless, base images, image signing |
+| [`references/iac.md`](references/iac.md) | Terraform / Pulumi / CDK — structure, state, modules, reviews |
+| [`references/secrets.md`](references/secrets.md) | Secret managers, rotation, access control, pre-commit scanning |
+| [`references/deploy.md`](references/deploy.md) | Rolling / blue-green / canary / feature flags, rollback |
+| [`references/observability.md`](references/observability.md) | Log/metric/trace pipelines, alerting, on-call, runbooks |
+## Forbidden patterns (auto-reject)
+- Secrets in code / Dockerfile / CI config / `.env` in git
+- CI pipelines that skip tests with `|| true` / `--continue-on-error`
+- Pushing latest tag only (no immutable version for rollback)
+- `curl | bash` from the internet in Dockerfile / install script without pinning
+- Running containers as root without documented reason
+- Terraform state on a local dev machine (no remote, no locking)
+- Manual prod changes (clicking in a console) not followed by IaC update
+- Deploy scripts that don't know how to roll back
+- Alerts that wake someone up without a runbook
+- Public S3 buckets / databases without explicit review
+- `:latest` base image tags in prod builds
+- Writing logs to the container filesystem (lost on restart)