npm - codex-genesis-harness - Versions diffs - 0.1.0 → 0.1.4 - Mend

codex-genesis-harness 0.1.0 → 0.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (328) hide show

package/.codex/skills/genesis-performance-profiling/checklists/optimization-verification.md ADDED Viewed

@@ -0,0 +1,199 @@
+# Optimization Verification Checklist
+Use this checklist when verifying that a performance optimization is correct, safe, and effective. Run this end-to-end after every optimization implementation — never ship a performance fix without completing all gates.
+---
+## Pre-Optimization State Capture
+### Mandatory — complete BEFORE implementing any optimization
+- [ ] **Before-baseline captured and stored.** Path: `observability/baselines/before-[date]-[ticket].json`. This file is immutable — do NOT overwrite after implementation starts.
+- [ ] **Bottleneck evidence documented.** The `BOTTLENECK_ANALYSIS.md` has been written with specific function names, call counts, and evidence (flame graph, heap snapshot, query plan). Vague bottlenecks ("the API is slow") are NOT acceptable.
+- [ ] **Git branch created.** Optimization work happens on a dedicated branch. Never optimize on `main` directly.
+- [ ] **Rollback plan documented.** In `OPTIMIZATION_RECOMMENDATIONS.md`, under the bottleneck entry, there is a rollback procedure (e.g., feature flag disable, migration revert command, previous Docker image tag).
+- [ ] **Adjacent system state known.** Confirm which other services call the service being optimized. Know which upstream/downstream endpoints could be affected.
+- [ ] **Database schema snapshot taken** (if DB changes are part of the optimization). Record table sizes, index list, and current query plans for all affected queries.
+- [ ] **Feature flag prepared** (for high-risk optimizations). If the optimization changes behavior (not just performance), wrap it in a feature flag that can be disabled without a deployment.
+### State snapshot commands
+Run these and save outputs alongside the before-baseline:
+```bash
+# Service info
+git log --oneline -5 > state-snapshot/git-log.txt
+docker inspect service-name > state-snapshot/docker-inspect.json
+# Database state
+psql -c "SELECT tablename, pg_size_pretty(pg_total_relation_size(tablename::regclass)) FROM pg_tables WHERE schemaname='public' ORDER BY pg_total_relation_size(tablename::regclass) DESC;" > state-snapshot/table-sizes.txt
+psql -c "SELECT indexname, tablename, pg_size_pretty(pg_relation_size(indexname::regclass)) FROM pg_indexes WHERE schemaname='public';" > state-snapshot/indexes.txt
+# Current query plans for affected queries
+psql -c "EXPLAIN ANALYZE SELECT ..." > state-snapshot/query-plan-before.txt
+```
+---
+## Optimization Implementation Checklist
+### Code quality gates
+- [ ] **Change is minimal and focused.** The PR diff touches only the files necessary for this specific bottleneck. No unrelated refactors bundled in the same PR.
+- [ ] **Unit tests written for the optimization.** If caching was added, there are tests for cache hit, cache miss, and cache invalidation. If a query was changed, there are tests for the new query results.
+- [ ] **Existing tests still pass.** Run the full test suite: `npm test` / `pytest` / `go test ./...`. Zero test failures allowed.
+- [ ] **No new linter errors.** Run `eslint` / `flake8` / `golangci-lint`.
+- [ ] **Memory management verified** (for caching optimizations). Cache has a maximum size limit. Cache has a TTL or eviction policy. Unbounded caches are NOT acceptable.
+- [ ] **Error handling preserved.** The optimization does not swallow errors or change error behavior. All original error paths still return correct HTTP status codes.
+- [ ] **Concurrency safety checked** (for shared state optimizations). Thread safety / mutex locking reviewed. Race conditions verified absent (`go test -race` for Go; ThreadSanitizer for C/C++).
+### Database optimization gates
+- [ ] **New indexes created with `CONCURRENTLY`** to avoid table locks in production.
+  ```sql
+  CREATE INDEX CONCURRENTLY idx_users_tenant_status ON users(tenant_id, status, created_at);
+  ```
+- [ ] **EXPLAIN ANALYZE run on optimized query** to confirm index is being used (not a sequential scan).
+- [ ] **Query result correctness verified.** The optimized query returns the same results as the original query. Run a diff of outputs on representative data sets.
+- [ ] **Migration is reversible.** The `down` migration correctly reverts all schema changes.
+- [ ] **No N+1 queries introduced.** If ORM eager loading was added, verify it does not over-fetch (loading too many related records).
+### Caching optimization gates
+- [ ] **Cache key includes all relevant dimensions.** If results vary by tenant, the cache key includes tenant ID. If results vary by user role, the cache key includes role.
+- [ ] **Cache invalidation is correct.** When the underlying data changes (write operation), cache is invalidated within defined TTL or immediately.
+- [ ] **Thundering herd protection.** For high-traffic cache misses, implement lock-based cache warming (only one request populates the cache; others wait) to avoid a flood of DB queries on cache miss.
+- [ ] **Cache TTL is appropriate.** TTL matches the acceptable staleness of the data (e.g., user profile: 5 min is fine; real-time stock price: 0 ms TTL, no cache).
+- [ ] **Cache metrics are instrumented.** `cache_hits`, `cache_misses`, and `cache_evictions` counters are exported to monitoring.
+---
+## Post-Optimization Validation Gates
+### Gate 1: Functional correctness (run FIRST — before any performance testing)
+- [ ] Run integration test suite against the optimized service. All tests pass.
+- [ ] Run manual smoke test against staging: hit the optimized endpoint manually and verify response body is correct.
+- [ ] If the optimization changed query behavior, compare a sample of API responses (before vs after) to confirm no data differences.
+### Gate 2: Performance improvement confirmation
+Run 5 after-baseline measurement runs (same protocol as before-baseline):
+- [ ] **p95 latency improved** by ≥ 10% vs before-baseline.
+  - Before p95: ___ ms → After p95: ___ ms → Delta: ___%
+  - ✅ PASS if delta ≥ 10% improvement | ❌ FAIL if delta < 10% or regressed
+- [ ] **Throughput not degraded**: After RPS ≥ 95% of before RPS.
+  - Before RPS: ___ → After RPS: ___ → Delta: ___%
+  - ✅ PASS if delta ≥ -5% | ❌ FAIL if throughput decreased > 5%
+- [ ] **Error rate not increased**: After error rate ≤ before error rate.
+  - Before: ___% → After: ___%
+  - ✅ PASS if no increase | ❌ FAIL if increased
+- [ ] **Memory growth not increased**: After memory delta ≤ before memory delta + 10 MB.
+  - Before delta: ___ MB → After delta: ___ MB
+  - ✅ PASS if within acceptable range | ❌ FAIL if memory growth increased
+### Gate 3: SLO compliance (all targets must be met)
+- [ ] p50 ≤ SLA target: ___ ms
+- [ ] p95 ≤ SLA target: ___ ms
+- [ ] p99 ≤ SLA target: ___ ms
+- [ ] Throughput ≥ minimum: ___ RPS
+- [ ] Error rate ≤ SLA target: ___%
+### Gate 4: Stability (no flakiness)
+- [ ] All 5 after-baseline runs show variance < 15% for p95. (Consistent improvement, not a one-off lucky run.)
+- [ ] Run a 10-minute mini-soak test at 50% peak concurrency. Memory does not grow > 25 MB over this period.
+- [ ] No error rate spikes observed during the soak test.
+### Gate 5: Adjacent endpoint regression check
+- [ ] Run full regression detection (all in-scope endpoints, not just the optimized one).
+  - `REGRESSION_SUMMARY.json` shows 0 regressions on adjacent endpoints.
+  - If any adjacent endpoint regressed: STOP, investigate before proceeding.
+---
+## Go/No-Go Performance Criteria by Tier
+### Web API (synchronous REST endpoints)
+| Criterion | Go threshold | No-Go threshold |
+|-----------|-------------|-----------------|
+| p95 latency | ≤ 200 ms | > 300 ms |
+| p99 latency | ≤ 500 ms | > 1000 ms |
+| Throughput | ≥ 500 RPS per instance | < 300 RPS per instance |
+| Error rate | < 0.1% | > 0.5% |
+| Memory growth (5 min) | < 20 MB | > 50 MB |
+**Decision**: ALL criteria must be Go. One No-Go criterion blocks the optimization deployment.
+### Batch Job (async/scheduled processing)
+| Criterion | Go threshold | No-Go threshold |
+|-----------|-------------|-----------------|
+| Job completion time | ≤ defined SLA per batch | > 2× SLA |
+| Records processed/sec | ≥ baseline × 0.95 | < baseline × 0.80 |
+| Memory peak | < 2 GB | > 4 GB |
+| Error rate | < 1% | > 5% |
+| CPU utilization | < 80% avg | > 95% sustained |
+### Real-Time Services (WebSocket, streaming, event-driven)
+| Criterion | Go threshold | No-Go threshold |
+|-----------|-------------|-----------------|
+| Message latency p95 | ≤ 50 ms | > 100 ms |
+| Message latency p99 | ≤ 100 ms | > 200 ms |
+| Throughput | ≥ 1000 events/sec per instance | < 500 events/sec |
+| Error rate | < 0.01% | > 0.1% |
+| Memory growth (30 min) | < 50 MB | > 200 MB |
+### Database Layer (query optimization)
+| Criterion | Go threshold | No-Go threshold |
+|-----------|-------------|-----------------|
+| Query p95 | ≤ 10 ms (indexed) | > 50 ms |
+| Sequential scans eliminated | 100% of targeted queries | Any targeted query still scanning |
+| Index usage confirmed | EXPLAIN ANALYZE shows Index Scan | Seq Scan remains |
+| Replication lag | < 100 ms | > 1 s |
+| Connection pool wait | < 5 ms | > 50 ms |
+---
+## Rollback Trigger Conditions
+Initiate rollback IMMEDIATELY if any of the following occur in production after deployment:
+### Automatic rollback triggers (CI/CD can execute without human approval)
+- Error rate exceeds 1% for more than 2 consecutive minutes.
+- p95 latency exceeds 2× the before-baseline value for more than 3 minutes.
+- Service health check (`/health` or `/readiness`) starts failing.
+- Memory RSS exceeds 90% of container limit.
+- Database connection pool wait time exceeds 500 ms.
+### Human-reviewed rollback triggers (escalate to on-call, decision within 5 min)
+- p95 latency exceeds 1.5× before-baseline for more than 5 minutes.
+- Error rate between 0.5% and 1% sustained for > 5 minutes.
+- New exception type appearing in logs at high frequency (> 100/min).
+- Database slow query log showing new queries > 5 seconds.
+- External dependency (payment provider, auth service) reporting increased error rates correlated with the deployment.
+### Rollback procedure
+```bash
+# Option 1: Kubernetes rollout undo
+kubectl rollout undo deployment/service-name
+# Option 2: Feature flag disable (no deployment required)
+curl -X PATCH https://flagd-host/flags/perf-optimization-001 \
+  -d '{"enabled": false}'
+# Option 3: Database migration revert
+npm run migrate:down -- --to=<previous-migration-id>
+# Or:
+psql -f migrations/revert-<optimization>.sql
+# Verify rollback
+kubectl rollout status deployment/service-name
+curl https://service/health
+# Run smoke test
+k6 run --vus 1 --duration 30s smoke-test.js
+```
+### Post-rollback actions
+- [ ] Confirm error rate returned to pre-optimization baseline within 5 minutes.
+- [ ] Confirm p95 latency returned to pre-optimization baseline within 5 minutes.
+- [ ] Create incident report documenting: what regressed, what was rolled back, when, by whom.
+- [ ] Add to `OPTIMIZATION_RECOMMENDATIONS.md` under the failed optimization: "ATTEMPTED [date] — ROLLED BACK. Reason: [root cause]. Next attempt: [redesign notes]."
+- [ ] Schedule post-mortem within 24 hours if production impact was > 5 minutes.

package/.codex/skills/genesis-performance-profiling/checklists/performance-baseline.md ADDED Viewed

@@ -0,0 +1,183 @@
+# Performance Baseline Checklist
+Use this checklist every time you capture a performance baseline. Complete every item in order. Do not skip items — each gate exists because a skipped step has caused a bad baseline in the past.
+---
+## Pre-Baseline: Environment Isolation
+### Infrastructure isolation
+- [ ] **Test environment is NOT shared with production traffic.** Confirm with ops/SRE that no production routes to this environment.
+- [ ] **No deployments are in progress.** Deployments during baseline capture invalidate results. Check CI/CD pipeline status.
+- [ ] **No scheduled jobs running.** Check cron schedules and batch job queues. Postpone or disable non-critical jobs for the duration.
+- [ ] **Auto-scaling is DISABLED or pinned.** If the environment auto-scales, pin to a fixed instance count for the test. Note the pinned count.
+- [ ] **Horizontal pod count is stable.** For Kubernetes: confirm `kubectl get pods` shows all pods Running and Ready. No pending restarts.
+- [ ] **Database is not under maintenance.** Confirm no vacuum, reindex, or backup jobs scheduled during the test window.
+- [ ] **External dependencies are stubbed or consistent.** Third-party APIs (payment, email, SMS) should be mocked or pointed at their own stable sandbox. Document which ones are mocked.
+### Resource baseline snapshot
+Before starting any test traffic, record current resource state:
+- [ ] CPU utilization < 5% (idle baseline). Command: `top -bn1 | grep "Cpu(s)"`
+- [ ] Available memory > 70% of total. Command: `free -m`
+- [ ] Disk I/O idle. Command: `iostat -x 1 5`
+- [ ] Network utilization < 10 Mbps (background noise). Command: `iftop` or `nethogs`
+- [ ] Database connection pool: active connections < 20% of max. Command: `SELECT count(*) FROM pg_stat_activity WHERE state = 'active';`
+- [ ] No alerts firing in monitoring stack (Grafana/Datadog). Check alert manager before starting.
+### Tool installation verification
+- [ ] **k6 installed and correct version** (≥ 0.45): `k6 version`
+- [ ] **Artillery installed** (if using): `artillery --version`
+- [ ] **Locust installed** (if using): `locust --version`
+- [ ] **wrk2 or hey available** (for quick smoke latency): `wrk --version` or `hey --version`
+- [ ] **Language profiler available**:
+  - Node.js: `clinic --version` or verify `--prof` flag works.
+  - Python: `py-spy --version`
+  - Go: `go tool pprof` available
+- [ ] **jq installed** for JSON result parsing: `jq --version`
+- [ ] **Prometheus metrics endpoint reachable**: `curl http://service-host:9090/metrics | head -20`
+### Authentication and connectivity
+- [ ] **Test API tokens are valid and not expiring during the test.** Check token TTL — should be > test duration + 30 min buffer.
+- [ ] **All target endpoints reachable from test runner.** Run `curl -s -o /dev/null -w "%{http_code}" http://target-host/api/health` for each endpoint.
+- [ ] **DNS resolves correctly from test runner.** Run `dig target-host` and confirm A record.
+- [ ] **TLS certificate is valid** (not expired): `echo | openssl s_client -connect target-host:443 2>/dev/null | openssl x509 -noout -dates`
+---
+## Baseline Measurement Protocol
+### What to measure
+For every endpoint in scope, capture the following metrics in each run:
+| Metric | Description | Unit | How to capture |
+|--------|-------------|------|----------------|
+| `p50_ms` | Median response time | milliseconds | k6 `http_req_duration{p:50}` |
+| `p95_ms` | 95th percentile response time | milliseconds | k6 `http_req_duration{p:95}` |
+| `p99_ms` | 99th percentile response time | milliseconds | k6 `http_req_duration{p:99}` |
+| `throughput_rps` | Requests per second at peak | req/s | k6 `http_reqs` counter ÷ duration |
+| `error_rate_pct` | HTTP 4xx/5xx rate | percentage | k6 `http_req_failed` rate |
+| `memory_start_mb` | Process heap at test start | megabytes | Prometheus `process_resident_memory_bytes` |
+| `memory_end_mb` | Process heap at test end | megabytes | Prometheus `process_resident_memory_bytes` |
+| `memory_delta_mb` | Heap growth during test | megabytes | `memory_end - memory_start` |
+| `cpu_avg_pct` | Average CPU utilization during test | percentage | Prometheus `process_cpu_seconds_total` rate |
+| `db_query_p95_ms` | Database query 95th percentile | milliseconds | `pg_stat_statements` or slow query log |
+| `connection_pool_wait_ms` | Connection pool wait time | milliseconds | App-level metrics or APM |
+### How many runs
+| Situation | Minimum runs | Notes |
+|-----------|-------------|-------|
+| Initial baseline | 5 runs | Discard first (JIT warm-up). Average remaining 4. |
+| Daily CI check | 3 runs | Acceptable for regression gating if environment is consistent. |
+| Before/after comparison | 5 runs each | Must use identical conditions. Run same day if possible. |
+| After major optimization | 5 runs | Document before re-running to confirm stability. |
+| Incident reproduction | 1 run (controlled) | Enough to confirm reproduction; document variance. |
+### Run protocol
+1. **Warm-up run** (discard): 30% of peak concurrency for 2 minutes. Do NOT record results.
+2. **Wait 1 minute** for connection pools and caches to settle after warm-up.
+3. **Measurement run 1**: Full peak concurrency for minimum 5 minutes. Record results.
+4. **Wait 2 minutes** between runs to let pools drain and metrics settle.
+5. **Measurement runs 2-N**: Repeat steps 3-4.
+6. **Post-run snapshot**: Record memory and CPU after all runs complete (check for drift).
+### Variance check
+After all runs: compute variance. If p95 variance > 30% across runs, the baseline is unreliable.
+- Do NOT average an unreliable baseline.
+- Investigate the source of variance (see Recovery Workflow in SKILL.md).
+- Re-run after fixing the root cause.
+---
+## Load Test Design Checklist
+### Traffic model design
+- [ ] **Traffic distribution matches production**. Do not send 100% to one endpoint. Use production access log analysis to determine realistic split. Example: `GET /api/users` = 40%, `GET /api/products` = 30%, `POST /api/orders` = 20%, other = 10%.
+- [ ] **Payload sizes match production**. Test with representative payloads (median size from logs). Do not use tiny toy payloads.
+- [ ] **Read/write ratio matches production**. If production is 80% reads / 20% writes, the load test must reflect this.
+- [ ] **Think time is included** for user-simulation tests (not API microbenchmarks). Add 1–5 second pauses between user actions to simulate real user behavior.
+- [ ] **Session state is realistic**. If production users have sessions with history (shopping carts, multi-step forms), simulate stateful user journeys.
+- [ ] **Geographic distribution considered**. If production traffic comes from multiple regions, ensure test runner location is not the only factor reducing latency (test from multiple regions for global services).
+### Load test stages
+Every load test must include these stages:
+| Stage | Duration | Concurrency | Purpose |
+|-------|----------|-------------|---------|
+| Smoke | 1 min | 1 VU | Verify endpoint responds without errors at all |
+| Ramp-up | 2 min | 0 → target VUs | Simulate organic traffic growth; avoid cold-start spikes |
+| Peak | 5-30 min | Target VUs | Measure steady-state performance at expected load |
+| Stress (optional) | 5 min | 2× target VUs | Find breaking point and graceful degradation behavior |
+| Ramp-down | 1 min | Target → 0 VUs | Verify graceful connection drain |
+| Soak (optional) | 30 min | 50% target VUs | Detect memory leaks, connection pool exhaustion, drift |
+---
+## Regression Threshold Definitions
+Use these thresholds to classify measurement deltas as REGRESSION / STABLE / IMPROVEMENT.
+### Latency thresholds
+| Tier | SLA target | Regression if p95 increases by | Improvement if p95 decreases by |
+|------|-----------|-------------------------------|--------------------------------|
+| Real-time (WebSocket, streaming) | < 50 ms | > 10% | > 5% |
+| Web API (synchronous REST) | < 200 ms | > 20% | > 10% |
+| Internal service (microservice call) | < 100 ms | > 15% | > 10% |
+| Background job (async processing) | < 5000 ms | > 30% | > 15% |
+| Batch processing | < 60 s | > 25% | > 20% |
+### Throughput thresholds
+| Tier | Regression if RPS decreases by | Improvement if RPS increases by |
+|------|-------------------------------|--------------------------------|
+| Web API | > 15% | > 10% |
+| Internal service | > 20% | > 15% |
+| Batch job | > 25% | > 20% |
+### Error rate thresholds (absolute, not relative)
+| Tier | SLA target | Flag as CRITICAL if | Flag as WARNING if |
+|------|-----------|--------------------|--------------------|
+| Payment / financial | < 0.01% | > 0.05% | > 0.01% |
+| Web API (standard) | < 0.1% | > 0.5% | > 0.1% |
+| Internal service | < 0.5% | > 1% | > 0.5% |
+| Background job | < 1% | > 5% | > 1% |
+### Memory growth thresholds
+| Duration | Acceptable growth | Warning | Leak suspected |
+|----------|------------------|---------|----------------|
+| 5 min test | < 20 MB | 20-50 MB | > 50 MB |
+| 30 min soak | < 50 MB | 50-100 MB | > 100 MB |
+| 1 hour soak | < 100 MB | 100-200 MB | > 200 MB |
+---
+## Post-Optimization Validation Gates
+After implementing any optimization, ALL of the following gates must pass before declaring success:
+### Gate 1: Improvement confirmation
+- [ ] After-baseline p95 is ≥ 10% lower than before-baseline p95 (meaningful improvement, not noise).
+- [ ] After-baseline throughput is NOT lower than before-baseline throughput (optimization didn't trade throughput for latency).
+- [ ] After-baseline error rate is NOT higher than before-baseline error rate.
+- [ ] After-baseline memory delta is NOT higher than before-baseline memory delta.
+### Gate 2: SLO compliance
+- [ ] All endpoints pass their defined p95 SLA target (not just "better than before").
+- [ ] Error rate is below SLA threshold.
+- [ ] Throughput meets minimum RPS target.
+### Gate 3: Stability
+- [ ] Run 3 after-baseline measurements; variance < 15% across runs (optimization is stable, not flaky).
+- [ ] Memory does not grow unboundedly over a 10-minute post-optimization soak test.
+### Gate 4: Regression check on adjacent endpoints
+- [ ] Run regression detection on ALL endpoints in scope, not just the optimized one. Confirm no adjacent regressions were introduced.
+### Gate 5: Documentation
+- [ ] `PERF_LOG.md` updated with before/after comparison.
+- [ ] `OPTIMIZATION_RECOMMENDATIONS.md` marked as implemented with date and commit SHA.
+- [ ] `PERF_BASELINE.json` updated with new after-baseline (old baseline archived, not deleted).

package/.codex/skills/genesis-performance-profiling/examples/example.md ADDED Viewed

@@ -0,0 +1,234 @@
+# Worked Example: Profiling a Slow /api/users Endpoint
+**Scenario**: The `GET /api/users` endpoint in the `users-api` service is taking 450 ms at p95. The SLA target is 200 ms. A customer support ticket reports "the user list is very slow." The team suspects a database issue.
+This example walks through the complete cycle: baseline → profile → identify → fix → verify.
+---
+## Step 1: Capture Baseline
+**Environment**: staging-isolated (no production traffic, background CPU < 2%)
+**Version**: users-api v1.4.1 (commit: `abc1234`)
+```bash
+# Run k6 baseline (5 runs, discard first)
+k6 run \
+  --env BASE_URL=https://api.staging.example.com \
+  --env AUTH_TOKEN=$STAGING_TOKEN \
+  --vus 50 --duration 5m \
+  baseline-check.js
+```
+**Baseline results (averaged across 4 runs):**
+| Endpoint | p50 ms | p95 ms | p99 ms | RPS | Error % | SLA | Status |
+|----------|--------|--------|--------|-----|---------|-----|--------|
+| GET /api/users | 180 | **450** | 920 | 68 | 0.04 | 200 ms | ❌ FAIL |
+| GET /api/users/:id | 15 | 55 | 120 | 312 | 0.01 | 200 ms | ✅ PASS |
+| POST /api/orders | 90 | 185 | 380 | 145 | 0.05 | 300 ms | ✅ PASS |
+**Finding**: Only `GET /api/users` is failing SLA. Other endpoints are fine. This narrows the scope.
+---
+## Step 2: Identify Bottleneck Type
+**CPU check during load test:**
+```bash
+# During the 5-min load test, check CPU
+top -bn1 | grep "Cpu(s)"
+# us=18.3% sy=2.1% id=78.5% — CPU is NOT the bottleneck (only 20% used)
+```
+**Memory check:**
+```bash
+# Memory at start vs end
+# Start: 256 MB RSS, End: 261 MB — only +5 MB, no leak
+```
+**DB query analysis:**
+```sql
+SELECT query, calls, total_time/calls AS avg_ms, rows/calls AS avg_rows
+FROM pg_stat_statements
+ORDER BY avg_ms DESC
+LIMIT 5;
+```
+Results:
+```
+query                                          calls  avg_ms   avg_rows
+SELECT * FROM users WHERE tenant_id=$1          50    385.2    8500
+SELECT * FROM profiles WHERE user_id=$1        4250   0.8      1.0
+SELECT * FROM users WHERE tenant_id=$1 ...      50    380.1    8500
+```
+**Finding**: The `SELECT * FROM users WHERE tenant_id=$1` query is taking **385 ms** on average — that's **85% of the total response time (450 ms)**. And there are also 4,250 calls to `SELECT * FROM profiles WHERE user_id=$1` for just 50 parent queries → **classic N+1 pattern**.
+---
+## Step 3: Deep Analysis
+```sql
+-- Run EXPLAIN ANALYZE on the slow query
+EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
+SELECT u.*, p.avatar_url, p.bio
+FROM users u
+WHERE u.tenant_id = 1
+ORDER BY u.created_at DESC
+LIMIT 50;
+```
+Output:
+```
+Limit  (cost=45234.56..45234.69 rows=50 width=248) (actual time=385.123..385.145 rows=50 loops=1)
+  -> Sort  (cost=45234.56..47484.56 rows=900000 width=248) (actual time=385.111..385.125 rows=50 loops=1)
+       Sort Key: created_at DESC
+       Sort Method: external merge  Disk: 32768kB        ← sorting 32 MB to disk!
+       -> Seq Scan on users  (cost=0..20234.00 rows=900000 width=248) (actual time=0.034..120.456 rows=8500 loops=1)
+            Filter: (tenant_id = 1)
+            Rows Removed by Filter: 841500             ← scanning 850K rows, only 8500 match!
+Planning Time: 0.452 ms
+Execution Time: 385.312 ms
+```
+**Root cause identified:**
+1. **Sequential scan on 850,000 rows** to find 8,500 belonging to `tenant_id=1`. No index on `tenant_id`.
+2. **Sort to disk (32 MB)** because there's no index on `created_at` to support the sort.
+3. **N+1 pattern**: Application makes 1 query for users list, then 1 query per user for their profile (50 users = 51 total queries).
+**Check indexes:**
+```sql
+SELECT indexname FROM pg_indexes WHERE tablename = 'users';
+-- Output: pk_users (primary key only)
+-- Missing: index on tenant_id, index on created_at
+```
+---
+## Step 4: Implement Fix
+**Fix 1: Add missing index** (migration file: `20260531-add-users-tenant-index.sql`)
+```sql
+-- Non-blocking index creation (CONCURRENTLY means no table lock)
+CREATE INDEX CONCURRENTLY idx_users_tenant_created
+  ON users(tenant_id, created_at DESC);
+```
+**Fix 2: Fix N+1 ORM query** (file: `src/routes/users.js`)
+```javascript
+// BEFORE (N+1 pattern):
+const users = await User.findAll({
+  where: { tenant_id: req.tenantId, status: req.query.status },
+  order: [['created_at', 'DESC']],
+  limit: req.query.limit || 20,
+});
+// Then for each user: SELECT * FROM profiles WHERE user_id=$1
+// AFTER (single JOIN query):
+const users = await User.findAll({
+  where: { tenant_id: req.tenantId, status: req.query.status },
+  order: [['created_at', 'DESC']],
+  limit: req.query.limit || 20,
+  include: [{
+    model: Profile,
+    attributes: ['avatar_url', 'bio'],
+    required: false,  // LEFT JOIN so users without profiles still appear
+  }],
+});
+```
+**Verify fix with EXPLAIN ANALYZE:**
+```sql
+EXPLAIN (ANALYZE, BUFFERS)
+SELECT u.*, p.avatar_url, p.bio
+FROM users u
+LEFT JOIN profiles p ON u.id = p.user_id
+WHERE u.tenant_id = 1
+ORDER BY u.created_at DESC
+LIMIT 50;
+```
+Output after index:
+```
+Limit  (cost=0.57..189.23 rows=50 width=248) (actual time=0.089..0.456 rows=50 loops=1)
+  -> Index Scan using idx_users_tenant_created on users u  ← Index Scan (not Seq Scan!)
+       Index Cond: (tenant_id = 1)
+       (... hash join with profiles ...)
+Execution Time: 1.234 ms   ← 385ms → 1.2ms (99.7% improvement!)
+```
+---
+## Step 5: Capture After-Baseline
+Deploy the fix to staging and re-run the baseline:
+```bash
+k6 run \
+  --env BASE_URL=https://api.staging.example.com \
+  --env AUTH_TOKEN=$STAGING_TOKEN \
+  --vus 50 --duration 5m \
+  baseline-check.js
+```
+**After-baseline results (averaged across 4 runs):**
+| Endpoint | Before p95 | After p95 | Delta | Delta % | SLA | Status |
+|----------|-----------|----------|-------|---------|-----|--------|
+| GET /api/users | 450 ms | **42 ms** | -408 ms | **-90.7%** | 200 ms | ✅ PASS |
+| GET /api/users/:id | 55 ms | 53 ms | -2 ms | -3.6% | 200 ms | ✅ PASS |
+| POST /api/orders | 185 ms | 187 ms | +2 ms | +1.1% | 300 ms | ✅ PASS |
+**Verification queries:**
+```sql
+-- N+1 pattern gone — profiles query should no longer appear
+SELECT query, calls FROM pg_stat_statements
+WHERE query LIKE '%profiles WHERE user_id%'
+ORDER BY calls DESC;
+-- Result: 0 rows (query no longer exists!)
+-- Main query is now fast
+SELECT query, calls, total_time/calls AS avg_ms
+FROM pg_stat_statements
+WHERE query LIKE '%users WHERE tenant_id%'
+ORDER BY avg_ms DESC;
+-- Result: avg_ms = 1.2 ms (was 385 ms — 99.7% improvement)
+```
+---
+## Outcome Summary
+| Metric | Before | After | Improvement |
+|--------|--------|-------|-------------|
+| p50 latency | 180 ms | 18 ms | -90.0% |
+| p95 latency | 450 ms | 42 ms | -90.7% |
+| p99 latency | 920 ms | 95 ms | -89.7% |
+| Throughput | 68 RPS | 489 RPS | +619% |
+| DB queries per request | 51 (N+1) | 1 (JOIN) | -98% |
+| DB query time | 385 ms | 1.2 ms | -99.7% |
+**Code changes**: 1 SQL migration file + 3 lines of ORM change.
+**Time to fix**: 2 hours (1 hour profiling, 1 hour implementing + verifying).
+**SLA compliance**: ✅ PASS (42 ms vs 200 ms SLA target — 79% margin).
+---
+## PERF_LOG.md Entry for This Cycle
+```
+---
+date:        2026-05-31
+version:     1.4.2
+commit:      def5678
+environment: staging-isolated
+test_type:   optimization
+triggered_by: Customer complaint (support ticket SUP-1234)
+---
+Fixed N+1 ORM query and added missing index on users(tenant_id, created_at).
+GET /api/users p95 improved from 450 ms to 42 ms (90.7% improvement).
+SLA target of 200 ms now comfortably met with 79% margin.
+No adjacent endpoint regressions detected.
+```