sanook-cli 0.4.0 → 0.5.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.env.example +19 -0
- package/CHANGELOG.md +173 -0
- package/README.md +153 -20
- package/README.th.md +136 -0
- package/dist/agentContext.js +4 -0
- package/dist/approval.js +6 -0
- package/dist/bin.js +405 -57
- package/dist/brain.js +92 -59
- package/dist/brand.js +47 -0
- package/dist/checkpoint.js +37 -0
- package/dist/commands.js +86 -6
- package/dist/compaction.js +76 -5
- package/dist/config.js +100 -12
- package/dist/cost.js +60 -3
- package/dist/doctor.js +92 -0
- package/dist/gateway/auth.js +2 -2
- package/dist/gateway/ledger.js +2 -2
- package/dist/gateway/scheduler.js +1 -0
- package/dist/gateway/serve.js +6 -4
- package/dist/gateway/server.js +10 -2
- package/dist/git.js +11 -2
- package/dist/hooks.js +43 -17
- package/dist/knowledge.js +48 -49
- package/dist/loop.js +182 -66
- package/dist/lsp/client.js +173 -0
- package/dist/lsp/framing.js +56 -0
- package/dist/lsp/index.js +138 -0
- package/dist/lsp/servers.js +82 -0
- package/dist/mcp-server.js +244 -0
- package/dist/mcp.js +184 -29
- package/dist/memory-store.js +559 -0
- package/dist/memory.js +143 -29
- package/dist/orchestrate.js +150 -0
- package/dist/providers/codex.js +21 -7
- package/dist/providers/keys.js +3 -2
- package/dist/providers/models.js +22 -6
- package/dist/providers/registry.js +155 -1
- package/dist/repomap.js +93 -0
- package/dist/search/chunk.js +158 -0
- package/dist/search/embed-store.js +187 -0
- package/dist/search/engine.js +203 -0
- package/dist/search/fuse.js +35 -0
- package/dist/search/index-core.js +187 -0
- package/dist/search/indexer.js +241 -0
- package/dist/search/store.js +77 -0
- package/dist/session.js +42 -8
- package/dist/skill-install.js +10 -10
- package/dist/skills.js +12 -9
- package/dist/summarize.js +31 -0
- package/dist/tools/bash.js +21 -2
- package/dist/tools/diagnostics.js +41 -0
- package/dist/tools/edit.js +29 -7
- package/dist/tools/index.js +8 -1
- package/dist/tools/list.js +7 -2
- package/dist/tools/permission.js +90 -9
- package/dist/tools/read.js +23 -4
- package/dist/tools/remember.js +1 -1
- package/dist/tools/sandbox.js +61 -0
- package/dist/tools/search.js +105 -4
- package/dist/tools/task.js +195 -29
- package/dist/tools/timeout.js +35 -0
- package/dist/tools/util.js +10 -0
- package/dist/tools/write.js +6 -4
- package/dist/trust.js +89 -0
- package/dist/ui/app.js +228 -31
- package/dist/ui/banner.js +4 -9
- package/dist/ui/brain-wizard.js +2 -2
- package/dist/ui/history.js +30 -0
- package/dist/ui/mentions.js +44 -0
- package/dist/ui/render.js +55 -15
- package/dist/ui/setup.js +97 -12
- package/dist/ui/useEditor.js +83 -0
- package/dist/update.js +114 -0
- package/dist/worktree.js +173 -0
- package/package.json +11 -5
- package/scripts/postinstall.mjs +33 -0
- package/second-brain/.agents/_Index.md +30 -0
- package/second-brain/.agents/skills/_Index.md +30 -0
- package/second-brain/.agents/workflows/_Index.md +30 -0
- package/second-brain/AGENTS.md +4 -4
- package/second-brain/Acceptance/_Index.md +30 -0
- package/second-brain/Acceptance/golden-case-template.md +39 -0
- package/second-brain/Areas/_Index.md +30 -0
- package/second-brain/Bugs/System-OS/_Index.md +30 -0
- package/second-brain/Bugs/_Index.md +30 -0
- package/second-brain/CLAUDE.md +4 -1
- package/second-brain/Checklists/_Index.md +30 -0
- package/second-brain/Checklists/preflight-postflight-template.md +29 -0
- package/second-brain/Distillations/_Index.md +30 -0
- package/second-brain/Entities/_Index.md +30 -0
- package/second-brain/Entities/entity-template.md +33 -0
- package/second-brain/Evals/_Index.md +30 -0
- package/second-brain/Evals/correction-pairs.md +24 -0
- package/second-brain/Evals/failure-taxonomy.md +24 -0
- package/second-brain/Evals/golden-set.md +25 -0
- package/second-brain/Evals/quality-ledger.md +23 -0
- package/second-brain/Evals/self-eval-rubric.md +23 -0
- package/second-brain/GEMINI.md +4 -4
- package/second-brain/Goals/_Index.md +30 -0
- package/second-brain/Handoffs/_Index.md +30 -0
- package/second-brain/Home.md +7 -0
- package/second-brain/Intake/Raw Sources/_Index.md +30 -0
- package/second-brain/Intake/_Index.md +30 -0
- package/second-brain/Intake/_Quarantine/_Index.md +30 -0
- package/second-brain/Learning/_Index.md +30 -0
- package/second-brain/Playbooks/_Index.md +30 -0
- package/second-brain/Playbooks/playbook-template.md +23 -0
- package/second-brain/Projects/_Index.md +30 -0
- package/second-brain/Prompts/_Index.md +30 -0
- package/second-brain/README.md +2 -1
- package/second-brain/Research/_Index.md +30 -0
- package/second-brain/Retrospectives/_Index.md +30 -0
- package/second-brain/Reviews/_Index.md +30 -0
- package/second-brain/Runbooks/_Index.md +30 -0
- package/second-brain/Runbooks/eval-loop.md +24 -0
- package/second-brain/Sessions/_Index.md +30 -0
- package/second-brain/Shared/AI-Context-Index.md +20 -0
- package/second-brain/Shared/AI-Threads/_Index.md +30 -0
- package/second-brain/Shared/Archive/_Index.md +30 -0
- package/second-brain/Shared/Assets/_Index.md +30 -0
- package/second-brain/Shared/Context-Packs/_Index.md +30 -0
- package/second-brain/Shared/Context7-Docs/_Index.md +30 -0
- package/second-brain/Shared/Coordination/NOW.md +28 -0
- package/second-brain/Shared/Coordination/_Index.md +30 -0
- package/second-brain/Shared/Coordination/agent-registry.md +24 -0
- package/second-brain/Shared/Coordination/task-board/_Index.md +30 -0
- package/second-brain/Shared/Coordination/task-board/task-template.md +43 -0
- package/second-brain/Shared/Coordination/task-board.md +32 -0
- package/second-brain/Shared/Core-Facts/_Index.md +30 -0
- package/second-brain/Shared/Decision-Memory/_Index.md +30 -0
- package/second-brain/Shared/Glossary/_Index.md +30 -0
- package/second-brain/Shared/Memory-Inbox/_Index.md +30 -0
- package/second-brain/Shared/Operating-State/_Index.md +30 -0
- package/second-brain/Shared/Prompting/_Index.md +30 -0
- package/second-brain/Shared/Provenance/_Index.md +30 -0
- package/second-brain/Shared/Rules/_Index.md +30 -0
- package/second-brain/Shared/Rules/contextual-note-rule.md +30 -0
- package/second-brain/Shared/Rules/frontmatter-standard.md +10 -0
- package/second-brain/Shared/Rules/memory-write-protocol.md +28 -0
- package/second-brain/Shared/Rules/procedural-runbook-header.md +40 -0
- package/second-brain/Shared/Rules/review-and-staleness-policy.md +22 -0
- package/second-brain/Shared/Rules/rules-formatting.md +34 -0
- package/second-brain/Shared/Scripts/_Index.md +30 -0
- package/second-brain/Shared/Scripts-Archive/_Index.md +30 -0
- package/second-brain/Shared/Tech-Standards/_Index.md +30 -0
- package/second-brain/Shared/Tech-Standards/verification-standard.md +40 -0
- package/second-brain/Shared/User-Memory/_Index.md +30 -0
- package/second-brain/Shared/User-Persona/_Index.md +30 -0
- package/second-brain/Shared/User-Persona/owner-profile.md +25 -0
- package/second-brain/Shared/Working-Memory/_Index.md +30 -0
- package/second-brain/Shared/_Index.md +30 -0
- package/second-brain/Shared/mcp-servers/_Index.md +30 -0
- package/second-brain/Skills/_Index.md +30 -0
- package/second-brain/Templates/_Index.md +30 -0
- package/second-brain/Templates/bug.md +2 -0
- package/second-brain/Templates/handoff.md +2 -0
- package/second-brain/Templates/session.md +2 -0
- package/second-brain/Tools/_Index.md +30 -0
- package/second-brain/Traces/_Index.md +30 -0
- package/second-brain/Vault Structure Map.md +33 -1
- package/second-brain/copilot/_Index.md +30 -0
- package/skills/audit-license-compliance/SKILL.md +117 -0
- package/skills/author-codemod/SKILL.md +110 -0
- package/skills/build-audit-logging/SKILL.md +112 -0
- package/skills/build-cdc-streaming-pipeline/SKILL.md +123 -0
- package/skills/build-cli-tool/SKILL.md +108 -0
- package/skills/build-data-table/SKILL.md +141 -0
- package/skills/build-native-mobile-ui/SKILL.md +154 -0
- package/skills/build-offline-first-sync/SKILL.md +118 -0
- package/skills/build-realtime-channel/SKILL.md +122 -0
- package/skills/build-vector-search/SKILL.md +131 -0
- package/skills/compose-local-dev-stack/SKILL.md +149 -0
- package/skills/configure-bundler-build/SKILL.md +166 -0
- package/skills/configure-dns-tls/SKILL.md +142 -0
- package/skills/configure-reverse-proxy-lb/SKILL.md +129 -0
- package/skills/configure-security-headers-csp/SKILL.md +122 -0
- package/skills/contract-testing/SKILL.md +140 -0
- package/skills/datetime-timezone-correctness/SKILL.md +125 -0
- package/skills/debug-ci-pipeline-failure/SKILL.md +134 -0
- package/skills/debug-flaky-tests/SKILL.md +128 -0
- package/skills/defend-llm-prompt-injection/SKILL.md +110 -0
- package/skills/deliver-webhooks/SKILL.md +116 -0
- package/skills/design-api-pagination/SKILL.md +144 -0
- package/skills/design-authorization-model/SKILL.md +119 -0
- package/skills/design-backup-dr-recovery/SKILL.md +113 -0
- package/skills/design-event-sourcing-cqrs/SKILL.md +143 -0
- package/skills/design-multi-tenancy/SKILL.md +100 -0
- package/skills/design-protobuf-grpc-service/SKILL.md +146 -0
- package/skills/design-relational-schema/SKILL.md +129 -0
- package/skills/design-search-index-infra/SKILL.md +151 -0
- package/skills/design-state-machine/SKILL.md +108 -0
- package/skills/design-token-system/SKILL.md +109 -0
- package/skills/distributed-locks-leases/SKILL.md +120 -0
- package/skills/encrypt-sensitive-data/SKILL.md +148 -0
- package/skills/feature-flags-rollout/SKILL.md +130 -0
- package/skills/file-upload-object-storage/SKILL.md +107 -0
- package/skills/fuzz-dynamic-security-test/SKILL.md +111 -0
- package/skills/harden-llm-app-reliability/SKILL.md +126 -0
- package/skills/i18n-localization-setup/SKILL.md +113 -0
- package/skills/idempotency-keys/SKILL.md +107 -0
- package/skills/implement-push-notifications/SKILL.md +142 -0
- package/skills/ingest-webhook-secure/SKILL.md +120 -0
- package/skills/integrate-oauth-oidc/SKILL.md +126 -0
- package/skills/load-stress-test/SKILL.md +129 -0
- package/skills/map-privacy-data-gdpr/SKILL.md +146 -0
- package/skills/model-nosql-data/SKILL.md +118 -0
- package/skills/money-decimal-arithmetic/SKILL.md +123 -0
- package/skills/monitor-ml-drift/SKILL.md +109 -0
- package/skills/numeric-precision-units/SKILL.md +144 -0
- package/skills/optimize-llm-cost-latency/SKILL.md +103 -0
- package/skills/optimize-react-rerenders/SKILL.md +124 -0
- package/skills/orchestrate-agent-workflow/SKILL.md +100 -0
- package/skills/payments-billing-integration/SKILL.md +114 -0
- package/skills/pin-toolchain-versions/SKILL.md +116 -0
- package/skills/plan-strangler-migration/SKILL.md +95 -0
- package/skills/property-based-testing/SKILL.md +108 -0
- package/skills/publish-package-registry/SKILL.md +130 -0
- package/skills/recover-git-state/SKILL.md +119 -0
- package/skills/remediate-web-vulnerabilities/SKILL.md +125 -0
- package/skills/resilience-timeouts-retries/SKILL.md +104 -0
- package/skills/resolve-merge-rebase-conflict/SKILL.md +97 -0
- package/skills/rewrite-git-history/SKILL.md +109 -0
- package/skills/scaffold-cross-platform-app/SKILL.md +137 -0
- package/skills/schema-evolution-compatibility/SKILL.md +121 -0
- package/skills/send-transactional-email/SKILL.md +126 -0
- package/skills/serve-deploy-ml-model/SKILL.md +107 -0
- package/skills/setup-cdn-edge-waf/SKILL.md +107 -0
- package/skills/setup-devcontainer-env/SKILL.md +131 -0
- package/skills/setup-lint-format-precommit/SKILL.md +140 -0
- package/skills/setup-monorepo-tooling/SKILL.md +125 -0
- package/skills/ship-mobile-app-store-release/SKILL.md +137 -0
- package/skills/structured-output-llm/SKILL.md +86 -0
- package/skills/supply-chain-sbom-provenance/SKILL.md +120 -0
- package/skills/test-data-factories/SKILL.md +158 -0
- package/skills/threat-model-stride/SKILL.md +123 -0
- package/skills/train-evaluate-ml-model/SKILL.md +109 -0
- package/skills/unicode-text-correctness/SKILL.md +109 -0
- package/skills/visual-regression-testing/SKILL.md +120 -0
|
@@ -0,0 +1,104 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: resilience-timeouts-retries
|
|
3
|
+
description: Makes calls to flaky dependencies (DBs, HTTP/RPC APIs, queues) survive failure without amplifying it — bounded timeouts on connect/read/total/per-attempt, deadline propagation across the call chain, exponential backoff with FULL jitter, retry budgets/caps, circuit breakers (closed/open/half-open), bulkheads, backpressure + load-shedding (429/503 + Retry-After), and hedged requests for tail latency. Retries only idempotent ops; never retries 4xx except 408/429; library-specific for resilience4j, Polly, tenacity, failsafe-go/gobreaker, JS AbortSignal+p-retry, gRPC deadlines, and Envoy/Istio outlier-detection.
|
|
4
|
+
when_to_use: User is calling a network dependency that can be slow/down (HTTP API, DB, RPC, queue) and needs it to fail fast, retry safely, or stop hammering a sick service — or is debugging retry storms, thundering-herd, hung pools, cascading timeouts. Distinct from rate-limiting (limits inbound traffic *you* receive; this protects *your* outbound calls) and async-concurrency-correctness (in-process task/lock/cancellation correctness, not network failure policy). For making the retried write itself safe, pair with idempotency-keys.
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
Reach for this skill when your code crosses a network boundary to something that can be slow, flaky, or down:
|
|
10
|
+
|
|
11
|
+
- "This call to the payments API / DB sometimes hangs forever and pins all our threads"
|
|
12
|
+
- "Add retries to this HTTP/RPC client" (and you need them to not make an outage worse)
|
|
13
|
+
- "One slow downstream is taking down the whole service" (cascading failure)
|
|
14
|
+
- "We had a retry storm / thundering herd after the dependency recovered"
|
|
15
|
+
- "Tail latency (p99) is terrible even though p50 is fine"
|
|
16
|
+
- "Should we retry this 500? this timeout? this POST?"
|
|
17
|
+
|
|
18
|
+
NOT this skill:
|
|
19
|
+
- Limiting inbound traffic *you serve* (per-user/IP quotas, token bucket) → rate-limiting. This skill governs the *outbound* calls you make and shedding load when *you* are overwhelmed.
|
|
20
|
+
- In-process deadlocks, leaked tasks, locks across `await`, channel backpressure → async-concurrency-correctness. That's correctness of concurrency; this is policy for network failure.
|
|
21
|
+
- Making the operation you retry safe to run twice (dedup key, exactly-once effect) → idempotency-keys. Retry without idempotency = duplicate charges.
|
|
22
|
+
- Delivering *your* outbound webhooks with retry/backoff/DLQ → deliver-webhooks (this skill is the primitive it builds on).
|
|
23
|
+
|
|
24
|
+
First principle: **every retry adds load to a system that is already failing.** Default to *fewer* retries with jitter and a budget, never *more*.
|
|
25
|
+
|
|
26
|
+
## Steps
|
|
27
|
+
|
|
28
|
+
1. **Put a bound on every wait — no unbounded blocking, ever.** A missing timeout is the root cause of most "the whole service hung" incidents: one stuck call holds a connection/thread until the pool is empty. Set all four:
|
|
29
|
+
|
|
30
|
+
| Timeout | What it caps | Typical |
|
|
31
|
+
|---|---|---|
|
|
32
|
+
| connect | TCP/TLS handshake | 1–3s |
|
|
33
|
+
| read/socket | gap between bytes | 2–10s |
|
|
34
|
+
| per-attempt total | one try end-to-end | derived from p99 + margin |
|
|
35
|
+
| overall/deadline | whole op incl. retries | < the caller's own deadline |
|
|
36
|
+
|
|
37
|
+
Per-language: JS `fetch(url, { signal: AbortSignal.timeout(ms) })` (default fetch has NO timeout); Python `httpx.Timeout(connect=, read=, write=, pool=)` or `requests` `timeout=(connect, read)` (a bare `timeout=5` is read-only — connect can still hang); Go `http.Client{Timeout}` + a per-request `context.WithTimeout`; Java set both `connectTimeout` and `requestTimeout`. Never leave a driver/client on its infinite default.
|
|
38
|
+
|
|
39
|
+
2. **Propagate a deadline (time budget), don't restart the clock per layer.** A 5s timeout at three nested layers = up to 15s of real wait. Compute an absolute deadline once at the edge and pass it down; each hop spends from the *remaining* budget. Go: pass `ctx` (carries `WithDeadline`) into every call — `ctx, _ := context.WithTimeout(parent, remaining)`. gRPC: set a **deadline** on the client call (`grpc.WithTimeout`/`context` deadline), and servers must check `ctx.Err()` / `context.Deadline()` and stop work when it's blown. Reserve a slice of the budget for retries — don't let a single attempt consume all of it.
|
|
40
|
+
|
|
41
|
+
3. **Retry ONLY idempotent/safe operations.** GET/PUT/DELETE are safe; a raw POST is not — a retried "create order" can double-charge. Either restrict retries to safe verbs, or make the write idempotent with an **idempotency key** the server dedupes on (→ idempotency-keys) and only then retry it. Treat "I don't know if it ran" (timeout after send) as **possibly executed** — never blind-retry a non-idempotent write on timeout.
|
|
42
|
+
|
|
43
|
+
4. **Retry the right errors, fail fast on the rest.**
|
|
44
|
+
|
|
45
|
+
| Outcome | Retry? |
|
|
46
|
+
|---|---|
|
|
47
|
+
| connection refused / reset / DNS / connect timeout | yes |
|
|
48
|
+
| read timeout (idempotent op only) | yes |
|
|
49
|
+
| 502, 503, 504 | yes |
|
|
50
|
+
| 429, 408 | yes — and honor `Retry-After` |
|
|
51
|
+
| 500 | usually no (often a deterministic bug — same input, same failure) |
|
|
52
|
+
| 400, 401, 403, 404, 409, 422 | **no** — retrying a client error just burns budget |
|
|
53
|
+
|
|
54
|
+
gRPC: retry `UNAVAILABLE`/`DEADLINE_EXCEEDED`/`RESOURCE_EXHAUSTED`; not `INVALID_ARGUMENT`/`NOT_FOUND`/`PERMISSION_DENIED`. When `Retry-After` is present, obey it instead of your backoff.
|
|
55
|
+
|
|
56
|
+
5. **Back off exponentially with FULL jitter.** Fixed delays (or exponential with *no* jitter) make every client that failed at T retry in lockstep — a synchronized retry storm that re-knocks-over the recovering dependency (thundering herd). Use full jitter: `sleep = random_uniform(0, min(cap, base * 2**attempt))`.
|
|
57
|
+
|
|
58
|
+
```python
|
|
59
|
+
delay = random.uniform(0, min(cap, base * (2 ** attempt))) # base=0.1s, cap=10s
|
|
60
|
+
```
|
|
61
|
+
"Equal jitter" (`half + rand(half)`) is acceptable; "no jitter" is the bug. Add `Retry-After` override when the server sent one.
|
|
62
|
+
|
|
63
|
+
6. **Cap retries and add a retry budget — keep nesting from multiplying load.** Limit attempts (typically **2–3 total**, not 5+) AND a max elapsed (the overall deadline from step 2). Crucial at scale: a per-client **retry budget** (e.g. retries ≤ 10–20% of total requests). Retrying at *every* layer multiplies: 3 layers × 3 retries = 27× load on the bottom service. **Retry at exactly one layer** (usually the lowest, closest to the dependency) and have outer layers fail fast.
|
|
64
|
+
|
|
65
|
+
7. **Add a circuit breaker so a dead dependency fails instantly.** Stop sending into a hole; give it room to recover. States: **closed** (pass through, count failures) → **open** (fail fast immediately, no call) → **half-open** (after a cooldown, allow a few probes; success → closed, failure → open). Trip on **error-rate over a rolling window** (e.g. >50% of ≥20 calls) rather than raw consecutive failures (less noisy under low traffic). Per-dependency breaker — never one global breaker for all downstreams.
|
|
66
|
+
|
|
67
|
+
8. **Bulkhead: isolate pools so one sick dependency can't drown the rest.** Give each downstream its *own* connection pool / thread pool / semaphore with a bounded max. Without it, slow calls to dep A consume every worker and requests to healthy dep B also fail. resilience4j `Bulkhead`/`ThreadPoolBulkhead`; a per-dependency `Semaphore(N)`; separate HTTP clients with separate `maxConnsPerHost`. Bound the pool *and* the wait-for-a-slot timeout.
|
|
68
|
+
|
|
69
|
+
9. **Backpressure & load-shed when *you're* the one overwhelmed.** Bounded queues only — an unbounded queue just hides the overload until OOM and inflates latency past every deadline. When the queue/pool is full, **reject early**: return `503` (or `429`) with `Retry-After` and shed load *before* doing expensive work. Fast rejection beats slow timeout. (Inbound *policy* limits → rate-limiting; this is shedding under acute overload.)
|
|
70
|
+
|
|
71
|
+
10. **Choose fail-fast vs fail-open/degrade per call deliberately.** On exhausted retries / open breaker: **fail-fast** (propagate the error) for must-be-correct calls (payment authorize); **fail-open / degrade** for optional ones — return a cached/stale value, a default, or a partial response so one non-critical dep can't take down the page. Decide and document it; never default to "throw 500".
|
|
72
|
+
|
|
73
|
+
11. **Hedge requests for tail latency (read-only/idempotent only).** If p99 ≫ p50, send a second (parallel) request after a delay (e.g. at the p95 latency mark), take whichever returns first, cancel the loser. Cuts tail latency at the cost of extra load — gate it (e.g. ≤5% hedged) so it doesn't amplify during an incident. gRPC supports hedging policy natively. Never hedge non-idempotent writes.
|
|
74
|
+
|
|
75
|
+
12. **Use the battle-tested library, not a hand-rolled `for` loop.**
|
|
76
|
+
|
|
77
|
+
| Stack | Use | Notes |
|
|
78
|
+
|---|---|---|
|
|
79
|
+
| Java/Kotlin | **resilience4j** | `Retry` + `CircuitBreaker` + `Bulkhead` + `TimeLimiter`, composed; order matters (TimeLimiter inside Retry) |
|
|
80
|
+
| .NET | **Polly** | `ResiliencePipeline`: `AddRetry` (with jitter) + `AddCircuitBreaker` + `AddTimeout` |
|
|
81
|
+
| Python | **tenacity** or **backoff** | `@retry(wait=wait_random_exponential(max=10), stop=stop_after_attempt(3), retry=retry_if_exception_type(...))` |
|
|
82
|
+
| Go | **failsafe-go** / **sony/gobreaker** | gobreaker for the breaker; failsafe-go for retry+backoff+circuit composed |
|
|
83
|
+
| JS/TS | **p-retry** + `AbortSignal.timeout` | p-retry for backoff; `opossum` for the circuit breaker |
|
|
84
|
+
| gRPC | service-config `retryPolicy` + `hedgingPolicy` | declarative; set deadlines on the client |
|
|
85
|
+
| Mesh | **Envoy / Istio** | `retryPolicy` (`retry_on: 5xx,connect-failure,retriable-4xx`) + `numRetries` + **outlier detection** (the mesh's circuit breaker — ejects bad hosts) |
|
|
86
|
+
|
|
87
|
+
Push retry/breaker policy to the mesh (Envoy/Istio) when you can — it's per-route, consistent across languages, and observable. Keep app-level resilience for finer control (idempotency-aware retries, fallbacks).
|
|
88
|
+
|
|
89
|
+
13. **Make all of it observable.** Emit metrics per dependency: attempt count, retry count, breaker state transitions, timeout count, shed/rejected count, p99 latency. A breaker that's silently open looks identical to a healthy-but-idle dependency — you must be able to see it. Alert on breaker-open and on retry-budget exhaustion.
|
|
90
|
+
|
|
91
|
+
## Anti-Patterns
|
|
92
|
+
|
|
93
|
+
| Anti-pattern | Why it bites | Fix |
|
|
94
|
+
|---|---|---|
|
|
95
|
+
| Retrying a non-idempotent POST | Double charge / duplicate record | Idempotency key (→ idempotency-keys) or don't retry |
|
|
96
|
+
| Backoff with no jitter | Synchronized retry storm hammers the recovering dep | Full jitter |
|
|
97
|
+
| Unbounded retries (`while`/no cap) | Burns budget, melts the dependency | Cap 2–3 + overall deadline |
|
|
98
|
+
| Retries nested at every layer | 3×3×3 = 27× load multiplication | Retry at ONE layer only |
|
|
99
|
+
| Retrying inside a retry (library + manual) | Hidden multiplication, double the attempts | Pick one place to own retries |
|
|
100
|
+
| No timeout / infinite default | One hung call drains the whole pool → service down | Bound connect+read+total |
|
|
101
|
+
| Retrying 4xx (400/401/404) | Same input, same failure — pure waste | Only retry 5xx/408/429/connect errors |
|
|
102
|
+
| One global circuit breaker | One bad dep opens the breaker for all deps | Per-dependency breaker + bulkhead |
|
|
103
|
+
| Unbounded queue for backpressure | Latency blows past deadlines, then OOM | Bounded queue + reject early (503/Retry-After) |
|
|
104
|
+
| Same timeout at every layer | Inner timeout ≥ outer → outer fires first, inner work wasted | Propagate a shrinking deadline |
|
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: resolve-merge-rebase-conflict
|
|
3
|
+
description: Resolves non-trivial merge, rebase, and cherry-pick conflicts by reading both sides' intent and combining hunks — handling rename/delete/add-add/binary conflicts, enabling rerere for repeats, and verifying the result builds and tests green rather than just clearing markers.
|
|
4
|
+
when_to_use: A merge/rebase/cherry-pick halts on conflicts (large or semantic), or the same conflict keeps recurring. Not clean commits/PRs (git-commit-pr), intentional history editing (rewrite-git-history), or recovering lost work (recover-git-state).
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
- "`git merge`/`git rebase`/`git cherry-pick` stopped with conflicts and I need to finish it correctly."
|
|
10
|
+
- "Big conflict across many files after a long-lived branch — which side wins where?"
|
|
11
|
+
- "The same conflict keeps coming back every rebase."
|
|
12
|
+
- "`<<<<<<<`/`=======`/`>>>>>>>` markers everywhere and the two sides changed the *same* logic differently."
|
|
13
|
+
- "rename/rename, modify/delete, add/add, or a binary asset conflicted and `--ours`/`--theirs` isn't obvious."
|
|
14
|
+
|
|
15
|
+
NOT this skill:
|
|
16
|
+
- Writing the commit message / opening the PR once the tree is clean → git-commit-pr
|
|
17
|
+
- Deliberately reshaping history (squash, reorder, split, drop commits) when there's no conflict to resolve → rewrite-git-history
|
|
18
|
+
- A rebase/merge that ate your work, detached HEAD, or you need a commit back via reflog → recover-git-state
|
|
19
|
+
|
|
20
|
+
## Steps
|
|
21
|
+
|
|
22
|
+
1. **Identify the operation first — `ours`/`theirs` MEANING FLIPS.** The same flag points opposite directions depending on what's running. Check `git status` (it names the op) before touching anything:
|
|
23
|
+
|
|
24
|
+
| Operation | `HEAD` / `--ours` is | `MERGE_HEAD` / `--theirs` is | Mental model |
|
|
25
|
+
|---|---|---|---|
|
|
26
|
+
| `git merge X` | your current branch | the branch X you're merging in | ours = where you are |
|
|
27
|
+
| `git rebase X` | **X (upstream)** — replayed onto | **your commit** being reapplied | **inverted**: "ours" is the base you're moving onto, "theirs" is your own change |
|
|
28
|
+
| `git cherry-pick C` | your current branch | commit C being applied | like merge |
|
|
29
|
+
| `git revert C` | your current branch | the inverse of C | like merge |
|
|
30
|
+
|
|
31
|
+
In a rebase, blindly taking `--ours` throws away *your own* work. Never run `checkout --ours/--theirs` on autopilot during a rebase.
|
|
32
|
+
|
|
33
|
+
2. **Turn on zdiff3 so you can see the base.** Default `merge` style hides the common ancestor, so you can't tell which side actually changed what:
|
|
34
|
+
|
|
35
|
+
```sh
|
|
36
|
+
git config --global merge.conflictStyle zdiff3 # adds a ||||||| base section between the two sides
|
|
37
|
+
git config --global rerere.enabled true # record+replay resolutions (step 7)
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
With zdiff3 a hunk shows `<<<<<<< ours` / `||||||| base` / `======= theirs` / `>>>>>>>`. Compare each side *against base*: the side that differs from base is the one that changed; if both differ, you must merge them by hand.
|
|
41
|
+
|
|
42
|
+
3. **Survey the whole conflict before editing one file.** `git status` lists unmerged paths; `git diff --diff-filter=U` shows only conflicted hunks. Classify each path: content conflict, or a *tree* conflict (rename/delete/add). Resolve content hunks by **intent**, not by picking a side:
|
|
43
|
+
- If both sides differ from base, you almost always want **both changes combined**, not one discarded. Read what each side was trying to do, write the union that satisfies both, delete all markers.
|
|
44
|
+
- Only take one side wholesale when the other is genuinely superseded (e.g. theirs deleted a function ours merely reformatted).
|
|
45
|
+
- For a single file you've decided entirely belongs to one side: `git checkout --ours -- <path>` / `--theirs -- <path>` (remember the rebase flip), then `git add <path>`.
|
|
46
|
+
|
|
47
|
+
4. **Handle tree conflicts deliberately — markers won't appear:**
|
|
48
|
+
- **modify/delete** (`deleted by us`/`deleted by them`): git can't merge a hunk into a missing file. Decide: keep the file and reapply the other side's change (`git add <path>`), or honor the delete (`git rm <path>`). Default: keep it if the surviving side still calls into it.
|
|
49
|
+
- **rename/rename** (same file renamed to two names, or both edited one rename): git leaves multiple paths. Pick the intended final name, move the merged content there, `git rm` the stray path, `git add` the keeper.
|
|
50
|
+
- **add/add** (both branches created the same path with different content): treat exactly like a content conflict — merge the two versions into one file, `git add`.
|
|
51
|
+
- **binary / generated** (images, lockfiles, compiled assets): no line merge possible. Choose a side on purpose — `git checkout --theirs -- yarn.lock && git add yarn.lock` — then **regenerate** rather than trust either copy (e.g. re-run `yarn install` / `npm install` and commit the result, never hand-stitch a lockfile).
|
|
52
|
+
|
|
53
|
+
5. **Continue, don't re-stage by hand at the end.** After every conflicted path is `git add`-ed (or `git rm`-ed):
|
|
54
|
+
- merge → `git commit` (uses the prepared merge message)
|
|
55
|
+
- rebase → `git rebase --continue`
|
|
56
|
+
- cherry-pick → `git cherry-pick --continue`
|
|
57
|
+
Repeat per replayed commit — a rebase can stop again on the next commit; resolve each in turn.
|
|
58
|
+
|
|
59
|
+
6. **Know the three exits.** Don't thrash a bad resolution — bail and retry with a cleaner head:
|
|
60
|
+
|
|
61
|
+
| Command | Effect | Use when |
|
|
62
|
+
|---|---|---|
|
|
63
|
+
| `--continue` | accept current resolution, proceed | you resolved correctly |
|
|
64
|
+
| `--abort` | restore the pre-operation state exactly | resolution is going sideways; start over |
|
|
65
|
+
| `--skip` | drop the current commit being applied (rebase/cherry-pick only) | this commit is already upstream / empty after resolution — **never** to dodge a hard conflict |
|
|
66
|
+
|
|
67
|
+
`git merge --abort` / `git rebase --abort` / `git cherry-pick --abort` always returns you to safety. Prefer abort + a fresh attempt over forcing a tangled hunk.
|
|
68
|
+
|
|
69
|
+
7. **Let rerere kill the repeats.** With `rerere.enabled true` (step 2), git records each manual resolution and **auto-replays** it the next time the identical conflict appears — invaluable when rebasing a long branch onto a moving main, or re-running a merge. Inspect with `git rerere status`/`git rerere diff`; if rerere replayed a *wrong* prior resolution, `git rerere forget <path>` and redo it.
|
|
70
|
+
|
|
71
|
+
8. **Build and test BEFORE declaring done — clearing markers is not resolving.** A tree with zero markers can still compile to wrong behavior: you may have dropped one side's logic, or combined two valid hunks into a contradiction. Run the project's real build + test (`npm test`, `cargo build && cargo test`, `pytest`, `make`) on the resolved tree. Fix failures at the merge seams, not by weakening assertions. Only then commit/continue.
|
|
72
|
+
|
|
73
|
+
9. **Reduce future conflicts (avoidance > resolution).** Keep PRs small and short-lived; `git fetch && git rebase origin/main` (or merge main in) frequently so you resolve a little, often, against fresh base instead of a giant divergence later; agree on formatting (run the formatter on both sides) so whitespace/reflow doesn't manufacture conflicts.
|
|
74
|
+
|
|
75
|
+
## Common Errors
|
|
76
|
+
|
|
77
|
+
- **`checkout --ours`/`--theirs` during a rebase, expecting merge semantics.** The labels are inverted — `--ours` is upstream, `--theirs` is your own commit. Result: you silently delete your own changes. Confirm the op via `git status` first; in rebase, *theirs* is usually what you want to keep.
|
|
78
|
+
- **Committing with conflict markers still in the file.** `<<<<<<<`/`=======`/`>>>>>>>` left behind ship as literal source and break the build. Grep the whole tree before committing (see Verify) — don't trust your eyes per-file.
|
|
79
|
+
- **Default `merge` conflictStyle, guessing which side changed.** Without the `|||||||` base you can't see the common ancestor and pick wrong. Set `merge.conflictStyle=zdiff3` once, globally.
|
|
80
|
+
- **Blindly taking one side to "make it go away."** `checkout --theirs` on a content conflict where both sides added needed logic drops half the work with a clean exit code. Conflicts where both differ from base almost always want the *union*.
|
|
81
|
+
- **Treating modify/delete as nothing to do.** No markers appear, so it looks resolved — but you must explicitly `git add` (keep) or `git rm` (delete). Leaving it unstaged stalls `--continue`.
|
|
82
|
+
- **Hand-merging a lockfile or binary.** `package-lock.json`/`yarn.lock`/`Cargo.lock` and images can't be line-merged sanely. Pick a side, then regenerate (`npm install`) or re-export — a stitched lockfile installs a phantom dependency graph.
|
|
83
|
+
- **`git rebase --skip` to escape a hard conflict.** Skip *drops the entire commit*, silently losing that change. Only skip a commit already present upstream or empty after resolution; otherwise resolve it.
|
|
84
|
+
- **Declaring done at zero markers without building.** A syntactically clean merge can be semantically broken (dropped branch, double-applied change). Always run build + tests on the resolved tree.
|
|
85
|
+
- **rerere replaying a stale wrong resolution.** Once enabled it auto-applies your *previous* answer even if that answer was the bug. If an auto-resolved hunk looks off, `git rerere forget <path>` and redo.
|
|
86
|
+
- **Resolving one conflict in a multi-commit rebase and assuming you're done.** Rebase stops per commit; `--continue` may immediately halt on the next. Loop until `git status` reports no rebase in progress.
|
|
87
|
+
|
|
88
|
+
## Verify
|
|
89
|
+
|
|
90
|
+
1. **No markers anywhere:** `git grep -nE '^(<{7}|={7}|>{7}|\|{7})' -- ':!*.md'` returns nothing (also catches the zdiff3 `|||||||` base line). Empty output is mandatory.
|
|
91
|
+
2. **No unmerged paths:** `git status --porcelain` shows no `UU`/`AA`/`DU`/`UD`/`DD`/`AU`/`UA` entries; `git diff --diff-filter=U` is empty.
|
|
92
|
+
3. **Operation actually finished:** `git status` reports no merge/rebase/cherry-pick in progress (no `.git/MERGE_HEAD`, `.git/rebase-merge`, or `CHERRY_PICK_HEAD`).
|
|
93
|
+
4. **Build + tests green on the resolved tree** — the project's real commands (`npm test` / `cargo test` / `pytest` / `make`), not a marker check standing in for verification.
|
|
94
|
+
5. **Both sides' intended changes are present:** diff the result against each parent and confirm neither side's needed logic was dropped. For a finished merge: `git diff HEAD^1 HEAD` (your side) and `git diff HEAD^2 HEAD` (the merged-in side). For a rebase, `git range-diff @{upstream}...HEAD` shows your commits survived intact.
|
|
95
|
+
6. **rerere recorded reusable resolutions** (if conflicts may recur): `git rerere status` is clean and future identical conflicts auto-resolve.
|
|
96
|
+
|
|
97
|
+
Done = zero conflict markers, no unmerged paths, the operation is complete, build + tests pass on the resolved tree, and a diff against both parents confirms each side's intended change is present.
|
|
@@ -0,0 +1,109 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: rewrite-git-history
|
|
3
|
+
description: Rewrites git history safely — interactive rebase (squash/split/reorder/reword/edit), amend, and git filter-repo/BFG to purge a committed secret or large file — using force-with-lease and explicit shared-branch safeguards.
|
|
4
|
+
when_to_use: Cleaning up a feature branch before merge, purging a committed secret or large blob from all history, or splitting/squashing commits. Distinct from git-commit-pr (normal commit/PR), resolve-merge-rebase-conflict (fixing conflicts), and recover-git-state (recovering lost commits).
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
Reach for this skill when you must **change commits that already exist**, not create new ones:
|
|
10
|
+
|
|
11
|
+
- "Squash these 6 WIP commits into one before merging"
|
|
12
|
+
- "Reword that commit message / fix the author/email on old commits"
|
|
13
|
+
- "Split this giant commit into logical pieces"
|
|
14
|
+
- "Reorder / drop a commit from my branch"
|
|
15
|
+
- "I committed an API key / `.env` / a 400 MB binary — scrub it from the whole history"
|
|
16
|
+
- "Amend the last commit, I forgot a file"
|
|
17
|
+
|
|
18
|
+
NOT this skill:
|
|
19
|
+
- Writing a normal commit message or opening a PR on un-rewritten work → git-commit-pr
|
|
20
|
+
- A rebase/merge that stopped on `CONFLICT` markers → resolve-merge-rebase-conflict
|
|
21
|
+
- Lost a commit/branch after a bad rebase/reset, need it back → recover-git-state
|
|
22
|
+
- Moving plaintext secrets into a vault, rotating, or scanning for leaks → secrets-management (run *after* the purge here)
|
|
23
|
+
|
|
24
|
+
## Steps
|
|
25
|
+
|
|
26
|
+
1. **Apply the golden rule first — is this branch shared?** Rewriting changes every downstream SHA; anyone who pulled the old history gets a divergent tree and can re-push the secret you deleted.
|
|
27
|
+
|
|
28
|
+
| Branch state | Rewrite? | How to push |
|
|
29
|
+
|---|---|---|
|
|
30
|
+
| Local-only, never pushed | Yes, freely | normal `git push` (first push) |
|
|
31
|
+
| Pushed, **only you** pull it (personal feature branch) | Yes | `git push --force-with-lease` |
|
|
32
|
+
| Shared / `main` / release / others have pulled | **No — coordinate first** | announce → everyone stops → rewrite → `--force-with-lease` → everyone re-clones or `git reset --hard origin/<branch>` |
|
|
33
|
+
|
|
34
|
+
Default: rewrite only local/unshared branches. For `main`, the answer is almost always "don't" — the only exception is purging a secret, and then only with team sign-off.
|
|
35
|
+
|
|
36
|
+
2. **Record the safety net before touching anything.** Copy the current tip so you can undo: `git rev-parse HEAD`, and confirm `git status` is clean — stash or commit dirty work first; rebase refuses to start otherwise. The reflog (`git reflog`) keeps the old tip ~90 days regardless, but a written-down SHA is faster.
|
|
37
|
+
|
|
38
|
+
3. **Interactive rebase for squash/reword/reorder/drop/edit.** Rebase the commits *since* the base, not your whole history:
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
git rebase -i origin/main # or: git rebase -i HEAD~5
|
|
42
|
+
```
|
|
43
|
+
In the todo editor, set the verb on each line (top = oldest), and **reorder by moving whole lines**:
|
|
44
|
+
|
|
45
|
+
| Verb | Effect |
|
|
46
|
+
|---|---|
|
|
47
|
+
| `pick` | keep as-is |
|
|
48
|
+
| `reword` | keep changes, edit the message |
|
|
49
|
+
| `edit` | stop here to amend content or split (step 4) |
|
|
50
|
+
| `squash` | fold into the commit above, **combine both messages** |
|
|
51
|
+
| `fixup` | fold into the commit above, **discard this message** (use for "oops typo" commits) |
|
|
52
|
+
| `drop` | delete the commit entirely (or just delete the line) |
|
|
53
|
+
|
|
54
|
+
Save and close. Resolve any conflicts (→ resolve-merge-rebase-conflict), then `git rebase --continue`. Abort cleanly at any point with `git rebase --abort` — it restores the pre-rebase tip exactly.
|
|
55
|
+
|
|
56
|
+
4. **Split one commit into several.** Mark it `edit` in the rebase todo; when rebase stops on it:
|
|
57
|
+
```bash
|
|
58
|
+
git reset HEAD^ # un-commit, keep changes in working tree (mixed reset)
|
|
59
|
+
git add -p # stage hunk-by-hunk for the first piece
|
|
60
|
+
git commit -m "first logical change"
|
|
61
|
+
git add -p && git commit -m "second logical change" # repeat until clean
|
|
62
|
+
git rebase --continue
|
|
63
|
+
```
|
|
64
|
+
`git reset HEAD^` (mixed) keeps your work in the tree; never `--hard` here or you lose it.
|
|
65
|
+
|
|
66
|
+
5. **Amend only the last commit** (no rebase needed): `git commit --amend` (edit message + fold staged changes), or `git commit --amend --no-edit` to silently add forgotten files. If already pushed to your own branch, follow with `git push --force-with-lease`.
|
|
67
|
+
|
|
68
|
+
6. **Purge a file/secret across ALL history — `git filter-repo` (preferred, BFG fallback).** `git filter-branch` is deprecated and slow; do not use it.
|
|
69
|
+
```bash
|
|
70
|
+
pip install git-filter-repo # or: brew install git-filter-repo
|
|
71
|
+
# Remove a path everywhere (history rewritten in place):
|
|
72
|
+
git filter-repo --path config/secrets.yml --invert-paths
|
|
73
|
+
# Or redact a literal string/regex from every blob:
|
|
74
|
+
printf 'AKIAIOSFODNN7EXAMPLE==>REDACTED\n' > replacements.txt
|
|
75
|
+
git filter-repo --replace-text replacements.txt
|
|
76
|
+
```
|
|
77
|
+
BFG alternative for big blobs: `bfg --delete-files '*.zip'` or `bfg --replace-text replacements.txt`, then `git reflog expire --expire=now --all && git gc --prune=now --aggressive`. `filter-repo` runs that cleanup for you and removes the `origin` remote on purpose — re-add it before pushing.
|
|
78
|
+
|
|
79
|
+
7. **ROTATE the leaked secret — rewriting history does NOT un-leak it.** Anyone who cloned, any fork, any CI cache, and GitHub's own unreachable-commit cache still hold it. The history scrub is step 1 of 2; the real fix is: **revoke + reissue the key/token/password at the provider** (then → secrets-management). Treat the credential as compromised the moment it was pushed.
|
|
80
|
+
|
|
81
|
+
8. **Force-push with `--force-with-lease`, never bare `--force`.**
|
|
82
|
+
```bash
|
|
83
|
+
git push --force-with-lease origin <branch>
|
|
84
|
+
```
|
|
85
|
+
`--force-with-lease` refuses the push if the remote moved since your last fetch — it catches the case where a teammate pushed in the meantime, which bare `--force` would silently obliterate. For a full-history purge you must push every ref: `git push --force-with-lease --all && git push --force-with-lease --tags`.
|
|
86
|
+
|
|
87
|
+
## Common Errors
|
|
88
|
+
|
|
89
|
+
- **Force-pushing a shared/`main` branch without coordination.** Everyone else's next pull diverges, and someone re-pushes the deleted secret. Confirm the branch is unshared (step 1) or get explicit sign-off first.
|
|
90
|
+
- **Using bare `git push --force`.** Overwrites whatever is on the remote with zero safety check, including a teammate's just-pushed commits. Always `--force-with-lease`.
|
|
91
|
+
- **Thinking the rewrite removed the secret.** It only removed it from *your* refs. Forks, clones, CI caches, and GitHub's cached unreachable commits still expose it — rotate the credential (step 7). This is the single most common and most damaging mistake.
|
|
92
|
+
- **`git reset --hard HEAD^` when splitting a commit.** Discards the very changes you were trying to re-commit. Use `git reset HEAD^` (mixed) to keep them in the working tree.
|
|
93
|
+
- **Rebasing onto the wrong base** (`HEAD~10` swallows commits already on `main`). Rebase against the tracking base, e.g. `git rebase -i origin/main`, so you only touch your own commits.
|
|
94
|
+
- **Reordering by editing SHAs or text instead of moving whole lines.** The rebase todo is line-ordered; cut/paste entire lines to reorder. Editing a hash breaks the rebase.
|
|
95
|
+
- **`squash` when you meant `fixup`** (or vice versa). `squash` opens an editor to combine both messages; `fixup` drops the second message. Picking wrong leaves "WIP"/"typo" text in your final message.
|
|
96
|
+
- **Running `git filter-repo` on a dirty or non-fresh clone.** It refuses non-fresh clones by default for safety; work on a fresh `git clone` (or pass `--force` only when you understand why), and commit/stash everything first.
|
|
97
|
+
- **Forgetting `filter-repo` dropped the remote.** It removes `origin` deliberately so you can't push to the wrong place by reflex. Re-add it (`git remote add origin <url>`) before the force-push.
|
|
98
|
+
- **Letting purged objects linger.** After a manual `filter-branch`/BFG run, the blobs survive in reflog/packs until `git reflog expire --expire=now --all && git gc --prune=now`. Without it, `git cat-file` still serves the secret locally.
|
|
99
|
+
|
|
100
|
+
## Verify
|
|
101
|
+
|
|
102
|
+
1. **Target commits are correct.** `git log --oneline` (and `--stat`/`-p`) shows exactly the intended squash/split/reword/reorder — commit count and each message match the plan.
|
|
103
|
+
2. **No unintended commits dropped.** Diff the rewritten branch against the saved pre-rewrite SHA from step 2: `git range-diff <old-sha> HEAD` (or `git diff <old-sha> HEAD` to confirm the *tree* is identical when you only re-shaped messages/structure, not content). Cross-check `git reflog` for the old tip.
|
|
104
|
+
3. **Secret is gone from every ref, not just `HEAD`.** `git log --all -p -S 'AKIA'` returns nothing, and `git grep -I 'AKIA' $(git rev-list --all)` finds no hits across all history. For a removed path: `git log --all --oneline -- config/secrets.yml` is empty.
|
|
105
|
+
4. **No dangling reachable copy.** After cleanup, `git rev-list --objects --all | grep <blob-sha>` is empty and `git count-objects -v` shows the pack shrank.
|
|
106
|
+
5. **Credential actually rotated.** The old key returns 401/403 from the provider (not just deleted from git). If it still authenticates, you are not done.
|
|
107
|
+
6. **Remote matches and was pushed safely.** `git push --force-with-lease` succeeded (not refused), and `git log origin/<branch> --oneline` equals local. Teammates on a shared branch confirmed re-sync.
|
|
108
|
+
|
|
109
|
+
Done = intended commits are exactly reshaped with zero unintended drops (verified against the pre-rewrite SHA/reflog), the secret is absent from every ref and every history blob, the leaked credential is rotated and dead at the provider, and the push used `--force-with-lease` on a branch that was either unshared or coordinated.
|
|
@@ -0,0 +1,137 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: scaffold-cross-platform-app
|
|
3
|
+
description: Scaffolds React Native (Expo Router) and Flutter app shells — feature-first folder layout, typed navigation + deep links, client-store wiring (Zustand/Redux Toolkit/Riverpod/Bloc), platform-divergent code and native bridges (Expo config plugin/Flutter platform channel), token-driven theming with dark mode, and env/build-flavor tooling.
|
|
4
|
+
when_to_use: Standing up or restructuring a whole React Native (Expo) or Flutter app — choosing navigation, client state, platform-conditional code, bridging a native module, theming, and build flavors. Distinct from build-native-mobile-ui (SwiftUI/Compose screens, not RN/Flutter), manage-client-server-state (server cache/data fetching), design-token-system (the token pipeline this skill consumes), and ship-mobile-app-store-release (signing + store upload).
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
Reach for this skill when the request is about **standing up or reorganizing a whole RN/Flutter app**, not a single screen:
|
|
10
|
+
|
|
11
|
+
- "Set up a new Expo app with tabs + a typed navigation stack and deep links"
|
|
12
|
+
- "Start a Flutter app with go_router and Riverpod, organized by feature"
|
|
13
|
+
- "Pick state management — Redux Toolkit vs Zustand / Bloc vs Riverpod — and wire it"
|
|
14
|
+
- "I need iOS-only and Android-only versions of this code / an adaptive widget"
|
|
15
|
+
- "Bridge a native module / write an Expo config plugin / add a Flutter platform channel"
|
|
16
|
+
- "Apply our design tokens + dark mode across the app shell"
|
|
17
|
+
- "Add dev/staging/prod flavors with separate env, bundle IDs, and icons"
|
|
18
|
+
|
|
19
|
+
NOT this skill:
|
|
20
|
+
- A **native** iOS/Android screen in SwiftUI or Jetpack Compose (not RN/Flutter) → build-native-mobile-ui
|
|
21
|
+
- Building one reusable RN component in an existing tree → build-react-component
|
|
22
|
+
- Server-cache, fetching, optimistic updates, query invalidation → manage-client-server-state (this skill wires *client* state only)
|
|
23
|
+
- Designing the token **architecture/pipeline** (primitive/semantic tiers, Style Dictionary, W3C export) → design-token-system (this skill *consumes* the exported tokens)
|
|
24
|
+
- Pixel-matching a Figma/screenshot for a screen → implement-from-design
|
|
25
|
+
- Tailwind/responsive web layout → style-responsive-tailwind
|
|
26
|
+
- E2E flows on the running app → write-playwright-e2e
|
|
27
|
+
- Code signing, keystores, TestFlight/Play upload, phased rollout → ship-mobile-app-store-release
|
|
28
|
+
- The CI workflow that calls build/sign/upload lanes (EAS/Codemagic/Fastlane in CI) → cicd-pipeline-author
|
|
29
|
+
- Storing signing keys / API secrets safely → secrets-management
|
|
30
|
+
|
|
31
|
+
## Steps
|
|
32
|
+
|
|
33
|
+
1. **Pick the framework lane and don't drift mid-project.** Default to **Expo (managed) + Expo Router** for RN, **Flutter stable + go_router** for Dart. Go bare RN only when a dependency needs native build config the managed prebuild can't express.
|
|
34
|
+
|
|
35
|
+
| Need | RN choice | Flutter choice |
|
|
36
|
+
|---|---|---|
|
|
37
|
+
| Standard app, OTA updates, fast start | **Expo managed** + `expo-dev-client` | Flutter stable |
|
|
38
|
+
| Custom native code you control | Expo + **config plugin** (stay managed) | Flutter + plugin/FFI |
|
|
39
|
+
| Native build settings Expo can't model | bare RN (`expo prebuild` then own `ios/`,`android/`) | n/a |
|
|
40
|
+
| Routing | **Expo Router** (file-based, typed) | **go_router** (typed routes) |
|
|
41
|
+
| New project command | `npx create-expo-app@latest -t default` | `flutter create --org com.acme app` |
|
|
42
|
+
|
|
43
|
+
Reject React-Navigation-only (no router) for new apps: Expo Router *is* React Navigation underneath but gives file-based deep linking for free.
|
|
44
|
+
|
|
45
|
+
2. **Lay out feature-first, not type-first.** Group by domain so a feature is one deletable folder. Avoid the top-level `screens/ components/ reducers/` split — it scatters every feature across the tree.
|
|
46
|
+
|
|
47
|
+
```
|
|
48
|
+
src/
|
|
49
|
+
app/ # Expo Router routes (file = route). Flutter: lib/routing/
|
|
50
|
+
(tabs)/index.tsx # deep link: myapp:// → /
|
|
51
|
+
(tabs)/profile.tsx
|
|
52
|
+
post/[id].tsx # myapp://post/42
|
|
53
|
+
_layout.tsx # Stack/Tabs + theme provider
|
|
54
|
+
features/
|
|
55
|
+
auth/ { ui/ store.ts api.ts types.ts }
|
|
56
|
+
feed/ { ui/ store.ts api.ts }
|
|
57
|
+
shared/ { ui/ hooks/ theme/ lib/ }
|
|
58
|
+
platform/ # *.ios.tsx / *.android.tsx live next to use site
|
|
59
|
+
```
|
|
60
|
+
Flutter mirror: `lib/features/<x>/{presentation,application,data,domain}`, `lib/core/theme`, `lib/routing/app_router.dart`.
|
|
61
|
+
|
|
62
|
+
3. **Make routes typed and deep-linkable from day one.**
|
|
63
|
+
- **Expo Router:** enable typed routes in `app.json` → `"experiments": { "typedRoutes": true }`. Set `scheme` in `app.json` (`"scheme": "myapp"`) so `myapp://post/42` resolves; for universal/app links add `expo-router` `+native-intent` or `associatedDomains`. Nest with `_layout.tsx`: a `(tabs)` group holds `<Tabs>`, a sibling `_layout` holds a `<Stack>` for modals/detail. Navigate with `router.push({ pathname: '/post/[id]', params: { id } })` — params are type-checked.
|
|
64
|
+
- **go_router:** define routes once, use `GoRoute` + `context.goNamed('post', pathParameters: {'id': id})`. Configure `MaterialApp.router(routerConfig: appRouter)`. Deep links work via the platform `<intent-filter>` (Android) / `CFBundleURLTypes` (iOS) — wire `uriPrefix`/`scheme` to match the route table.
|
|
65
|
+
|
|
66
|
+
4. **Wire client state by app shape — opinionated defaults, no "it depends":**
|
|
67
|
+
|
|
68
|
+
| App | RN | Flutter | Why |
|
|
69
|
+
|---|---|---|---|
|
|
70
|
+
| Small/medium, mostly local UI state | **Zustand** | **Riverpod** | Minimal boilerplate, no provider-tree gymnastics |
|
|
71
|
+
| Large, many devs, time-travel/devtools, strict conventions | **Redux Toolkit** | **Bloc** | Enforced structure, traceable events, predictable reducers |
|
|
72
|
+
| Server data (lists, caches, mutations) | **TanStack Query** | Riverpod `AsyncNotifier` / `dio` | Don't hand-roll cache in the store → manage-client-server-state |
|
|
73
|
+
|
|
74
|
+
Default to **Zustand** (RN) / **Riverpod** (Flutter) unless team size or audit needs push you to RTK/Bloc. **Boundary rule:** keep *server cache* out of the global store; the store holds session, auth, theme, navigation-adjacent UI state. One `store.ts`/notifier per feature; compose at app root, never one god-store.
|
|
75
|
+
|
|
76
|
+
```ts
|
|
77
|
+
// features/auth/store.ts — Zustand slice, typed, selector-friendly
|
|
78
|
+
export const useAuth = create<AuthState>()((set) => ({
|
|
79
|
+
user: null, token: null,
|
|
80
|
+
signIn: async (c) => { const { user, token } = await api.login(c); set({ user, token }); },
|
|
81
|
+
signOut: () => set({ user: null, token: null }),
|
|
82
|
+
}));
|
|
83
|
+
// read narrowly to avoid re-renders: const user = useAuth(s => s.user)
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
5. **Diverge by platform with the cheapest tool that works.** Escalate only as needed:
|
|
87
|
+
- **One value differs:** `Platform.select({ ios: 12, android: 8, default: 8 })` or `Platform.OS === 'ios'`. Flutter: `Theme.of(context).platform == TargetPlatform.iOS` or `defaultTargetPlatform`.
|
|
88
|
+
- **A whole component differs:** split files — `Button.ios.tsx` / `Button.android.tsx`; import `./Button` and Metro resolves per-platform. Flutter: `Platform.isIOS ? CupertinoButton(...) : ElevatedButton(...)`, or conditional imports for web vs native.
|
|
89
|
+
- **Adaptive by design:** Flutter `Switch.adaptive`, `CupertinoIcons` on iOS; RN use a wrapper that picks the native control. Never branch on `Platform.OS` deep inside business logic — isolate divergence at the UI/platform layer.
|
|
90
|
+
|
|
91
|
+
6. **Bridge native code through the framework's official channel — never patch generated folders by hand.**
|
|
92
|
+
- **Expo config plugin** (stay managed): write a plugin that mutates native config at prebuild, e.g. `withInfoPlist` / `withAndroidManifest`, register in `app.json` `"plugins": ["./plugins/with-foo"]`. For real native APIs use the **Expo Modules API** (`createModule`, Swift/Kotlin) — typed JS interface, no manual bridge boilerplate.
|
|
93
|
+
- **Bare RN:** Turbo/Native Module — declare a TS spec, run Codegen, implement on iOS (Swift/ObjC) + Android (Kotlin/Java).
|
|
94
|
+
- **Flutter platform channel:** `MethodChannel('com.acme/foo')` on Dart side; implement the matching handler in `AppDelegate.swift` and `MainActivity.kt`. Keep the channel name and method strings in one shared constants file so both sides can't drift.
|
|
95
|
+
```dart
|
|
96
|
+
const _ch = MethodChannel('com.acme/battery');
|
|
97
|
+
Future<int> level() async => await _ch.invokeMethod<int>('getLevel') ?? -1;
|
|
98
|
+
```
|
|
99
|
+
After any native change run `expo prebuild --clean` (Expo) or `flutter clean` and rebuild — JS/Dart hot reload will NOT pick up native edits.
|
|
100
|
+
|
|
101
|
+
7. **Consume design tokens at the app shell; theme from them, don't hardcode.** Build the token source/pipeline with design-token-system; *this* step wires its output into RN/Flutter theming. One `theme/tokens.ts` (or `core/theme/tokens.dart`) holds colors/spacing/radii/typography. Build light+dark from the same tokens; resolve via system scheme.
|
|
102
|
+
- **RN:** export a `light`/`dark` theme object keyed off tokens; read `useColorScheme()`; pass to a `ThemeProvider` (or Expo Router's `<ThemeProvider value={scheme === 'dark' ? Dark : Light}>`). Never inline hex in components — pull from theme.
|
|
103
|
+
- **Flutter:** `MaterialApp(theme: lightFromTokens, darkTheme: darkFromTokens, themeMode: ThemeMode.system)`; build `ColorScheme.fromSeed(seedColor: tokens.brand)`; use `CupertinoTheme` where you ship iOS-native chrome. Dark mode = the dark token set + `themeMode`, not ad-hoc `if (isDark)` checks.
|
|
104
|
+
|
|
105
|
+
8. **Set up tooling once so the app is reproducible:**
|
|
106
|
+
- **Env/flavors:** RN — `app.config.ts` reading `process.env`, build profiles in `eas.json` (`development`/`preview`/`production`), distinct `bundleIdentifier`/`package` per profile. Flutter — `--flavor dev|staging|prod` with `--dart-define-from-file=env/dev.json`, matching Xcode schemes + Android `productFlavors`. **Secrets never in `app.json`/committed `.env`** → secrets-management. **Signing certs, keystores, and store upload** are out of scope → ship-mobile-app-store-release.
|
|
107
|
+
- **Fonts/assets:** RN `expo-font` `useFonts()` (or `expo-asset` preload), gate render on loaded; Flutter declare under `pubspec.yaml` `fonts:`/`assets:`.
|
|
108
|
+
- **Types/lint:** TS `strict: true`, `eslint` + `eslint-config-expo`, `prettier`; Flutter `flutter analyze` + `flutter_lints`. Add a `typecheck` script (`tsc --noEmit`) to CI.
|
|
109
|
+
- **Fast refresh** is on by default — if it stops working, it's almost always a non-component export or a circular import, not the bundler.
|
|
110
|
+
|
|
111
|
+
## Common Errors
|
|
112
|
+
|
|
113
|
+
- **Type-first folders (`screens/`, `reducers/`, `components/`).** Every feature smears across the tree; deleting a feature touches 6 folders. Group by feature, share only truly shared code in `shared/`.
|
|
114
|
+
- **One global store for everything including server data.** Caching API responses in Zustand/Redux means manual invalidation and stale UI. Put server cache in TanStack Query / Riverpod `AsyncNotifier`; keep the store for session/UI state.
|
|
115
|
+
- **`Platform.OS` checks buried in business logic.** Divergence leaks everywhere and is untestable. Isolate it at the UI/platform layer via `.ios`/`.android` files or `Platform.select`.
|
|
116
|
+
- **Editing `ios/` or `android/` by hand on a managed Expo app.** The next `prebuild` wipes it. Express native changes as a **config plugin** or Expo Module instead.
|
|
117
|
+
- **Native change with no rebuild.** Hot reload/Fast Refresh only reloads JS/Dart. A new native module or channel needs `expo prebuild --clean` / `flutter clean` + a fresh native build, or you'll debug a phantom "method not found."
|
|
118
|
+
- **Hardcoded hex colors / magic spacing.** Dark mode and rebrands become a find-and-replace. Pull every color/space/radius from the token theme; derive light+dark from one source.
|
|
119
|
+
- **Missing `scheme` / intent-filter, so deep links silently no-op.** Set `scheme` in `app.json` (RN) and the Android `<intent-filter>` + iOS `CFBundleURLTypes` (Flutter) to match the route table, or `myapp://post/42` opens the app to the home screen.
|
|
120
|
+
- **Mismatched platform-channel/method names across Dart↔native.** A typo yields a silent `MissingPluginException` at runtime. Keep channel + method strings in one shared constant referenced by both sides.
|
|
121
|
+
- **Same `bundleIdentifier`/`applicationId` across flavors.** Dev and prod overwrite each other on-device and can't coexist. Give each flavor a distinct id + icon + display name.
|
|
122
|
+
- **Untyped navigation params.** `router.push('/post/' + id)` loses type-checking and breaks on refactor. Enable typed routes (Expo) / named go_router routes and pass params as objects.
|
|
123
|
+
|
|
124
|
+
## Verify
|
|
125
|
+
|
|
126
|
+
Run on **both** an iOS simulator and an Android emulator/device — a single-platform pass proves nothing cross-platform.
|
|
127
|
+
|
|
128
|
+
1. **Boots clean both OSes:** `npx expo run:ios` and `npx expo run:android` (or `flutter run -d ios` / `-d android`) start with **no red box / no exception**, app reaches the first screen.
|
|
129
|
+
2. **Typed navigation + deep links:** a wrong route param fails `tsc --noEmit`/`flutter analyze`. `xcrun simctl openurl booted myapp://post/42` and `adb shell am start -a android.intent.action.VIEW -d "myapp://post/42"` both open the correct detail screen with the right id.
|
|
130
|
+
3. **State wiring:** an action mutates the store and exactly the subscribed components re-render (verify with a render log/devtools); unrelated screens do not. Server data lives in the query cache, not the store.
|
|
131
|
+
4. **Platform divergence resolves:** the `.ios`/`.android` (or adaptive) variant renders the native-looking control on each OS — confirm by screenshot, not assumption.
|
|
132
|
+
5. **Native bridge round-trips:** call the module/channel method on both platforms and get a real value back (not `-1`/`MissingPluginException`); confirm a rebuild was done after the native edit.
|
|
133
|
+
6. **Theming + dark mode:** toggle system appearance on each OS → colors/typography flip via tokens, no hardcoded color survives; no contrast regressions.
|
|
134
|
+
7. **Flavors:** build `dev` and `prod` → distinct bundle id + icon + name, each reading its own env, no committed secret in the bundle.
|
|
135
|
+
8. **Lint/types green:** `tsc --noEmit` + `eslint .` (or `flutter analyze`) pass with zero errors.
|
|
136
|
+
|
|
137
|
+
Done = the app builds and runs on iOS *and* Android, deep links and typed nav resolve on both, state/theming/native-bridge round-trip correctly per platform, and lint + typecheck are green.
|
|
@@ -0,0 +1,121 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: schema-evolution-compatibility
|
|
3
|
+
description: Evolves shared data contracts (events, API payloads, DB columns, protobuf/avro) without breaking live consumers — additive-only changes with optional+default fields, NEVER remove/rename/repurpose a field or reuse a protobuf tag / avro position (reserve them with `reserved`/aliases instead), backward vs forward vs full compatibility chosen per producer/consumer upgrade order, expand-then-contract (dual-write/dual-read) migrations for renames and type changes, and a schema registry (Confluent/Buf) wired into CI to mechanically reject incompatible diffs before merge. Tolerant reader, unknown-field preservation, and explicit versioning when a true break is unavoidable.
|
|
4
|
+
when_to_use: Changing a schema that something else already reads or writes — adding/removing/renaming a field on a Kafka event, API JSON payload, protobuf/avro/JSON-Schema, or a DB column other services depend on; deciding if a change is safe to deploy and in what order; or wiring registry compat checks into CI. Distinct from design-protobuf-grpc-service (designs the IDL/RPCs from scratch; this evolves an existing one safely) and db-migration-safety (runs the ALTER without locking/downtime; this decides whether the column change breaks readers at all).
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## When to Use
|
|
8
|
+
|
|
9
|
+
Reach for this skill when a contract that *another* process already produces or consumes is changing and you must not break it mid-deploy:
|
|
10
|
+
|
|
11
|
+
- "Add a field to this Kafka event / API response — will old consumers still parse it?"
|
|
12
|
+
- "Rename / remove / change the type of a field that other services read"
|
|
13
|
+
- "Which compatibility mode (backward/forward/full) for this Avro subject?"
|
|
14
|
+
- "We reused a protobuf field number and a consumer is reading garbage"
|
|
15
|
+
- "Deploy producers or consumers first? what's the safe order?"
|
|
16
|
+
- "Wire `buf breaking` / Confluent compat checks into CI so bad diffs get blocked"
|
|
17
|
+
- "Migrate a column/field rename with zero downtime across services"
|
|
18
|
+
|
|
19
|
+
NOT this skill:
|
|
20
|
+
- Designing the proto/gRPC service, message shapes, and RPCs from scratch → design-protobuf-grpc-service (this skill *evolves* an IDL that already has live readers)
|
|
21
|
+
- Running the `ALTER TABLE` itself without locks/downtime (lock-free index, batched backfill, `NOT VALID` constraints) → db-migration-safety (it makes the DDL safe; this skill decides if the column change breaks consumers)
|
|
22
|
+
- Designing the relational schema / normalization / keys → design-relational-schema
|
|
23
|
+
- The REST/GraphQL field-type and nullability contract for one endpoint → rest-graphql-contract
|
|
24
|
+
- API versioning policy, deprecation headers, pagination contracts → api-design-review / design-api-pagination
|
|
25
|
+
- Validating one payload against a schema at the edge (request validation) → build-form-validation / validate-data-quality
|
|
26
|
+
- Verifying producer and consumer agree via recorded pacts → contract-testing (it tests the agreement; this skill governs how the schema may change)
|
|
27
|
+
- Big phased rewrite/cutover of a whole system → plan-strangler-migration
|
|
28
|
+
|
|
29
|
+
## Steps
|
|
30
|
+
|
|
31
|
+
1. **Pick the compatibility mode from your upgrade order — it's the whole game.** Compatibility is asymmetric and defined by *who reads data written under the other schema*:
|
|
32
|
+
|
|
33
|
+
| Mode | Guarantees | Allowed change | Upgrade FIRST |
|
|
34
|
+
|---|---|---|---|
|
|
35
|
+
| **BACKWARD** | new consumer reads data from old + new producers | **add** optional field (w/ default), **delete** optional field | **consumers** |
|
|
36
|
+
| **FORWARD** | old consumer reads data from new producer | **add** optional field, **delete** field that had a default | **producers** |
|
|
37
|
+
| **FULL** | both directions | **only** add/remove **optional fields with defaults** | either |
|
|
38
|
+
| **\*_TRANSITIVE** | same, but vs **all** prior versions not just the last | — | — |
|
|
39
|
+
|
|
40
|
+
Default to **BACKWARD** for events/topics (Confluent's default — consumers lag and replay history, so the new reader must handle old records). Use **FORWARD** when producers ship ahead of consumers. Use **FULL_TRANSITIVE** for long-lived event logs you replay from the beginning. The rule of thumb: **add a field → forward-safe; remove a field → backward-safe; do both safely → only optional+default**.
|
|
41
|
+
|
|
42
|
+
2. **Additive-only is the safe default. Every new field is optional with a default — never required.** A new *required* field breaks every old producer (forward) and every old record (backward) instantly. Concretely:
|
|
43
|
+
- **JSON / JSON-Schema:** add the key, do NOT add it to `required`, give consumers a default. Keep `additionalProperties` permissive (or `unevaluatedProperties` in 2020-12) so old readers tolerate fields they don't know.
|
|
44
|
+
- **protobuf (proto3):** every field is already optional; new scalar fields default to `0`/`""`/`false`. Just append with a **fresh field number**. Use `optional` (proto3 explicit presence) when you must distinguish "unset" from "zero".
|
|
45
|
+
- **Avro:** a new field **must** carry a `"default"`, or it's neither backward- nor forward-compatible — `{"name":"x","type":["null","string"],"default":null}`. This is the #1 Avro footgun.
|
|
46
|
+
|
|
47
|
+
3. **NEVER remove, rename, or repurpose a field in place — and NEVER reuse a tag/number/position.** Renaming = remove + add to every consumer; changing a field's *meaning* while keeping its name/number is the worst break because it passes schema checks but silently corrupts data. Reuse of an identifier makes old payloads decode into the wrong field. Reserve instead:
|
|
48
|
+
- **protobuf** — when you drop field `7` (name `email`), reserve both so the number and name can never be re-added:
|
|
49
|
+
```proto
|
|
50
|
+
message User {
|
|
51
|
+
reserved 7, 9 to 11; // numbers
|
|
52
|
+
reserved "email", "legacy_id"; // names
|
|
53
|
+
string username = 3;
|
|
54
|
+
}
|
|
55
|
+
```
|
|
56
|
+
- **Avro** — never reuse a removed field's name; to *rename* keep the old name reachable via `"aliases": ["old_name"]` so readers using the old schema still resolve it.
|
|
57
|
+
- **JSON** — treat a removed key as permanently retired; never recycle a key name for a different type/meaning.
|
|
58
|
+
|
|
59
|
+
A type change (e.g. `int32 → string`, `string → enum`) is **not** additive even if the name stays — it's a remove-and-add. Wire-compatible widenings exist in proto (`int32`/`int64`/`uint32`/`bool` are interchangeable on the wire; `sint*`/`fixed*` are **not**) but treat them as breaking unless you've verified the exact pair.
|
|
60
|
+
|
|
61
|
+
4. **For a true rename or type change, run expand → migrate → contract (dual-write/dual-read).** You cannot atomically change a field across N independently-deployed services. Phase it:
|
|
62
|
+
|
|
63
|
+
| Phase | Producer | Consumer | DB column |
|
|
64
|
+
|---|---|---|---|
|
|
65
|
+
| **1 Expand** | write BOTH `old` + `new` | still reads `old` | add `new` col, backfill, dual-write trigger |
|
|
66
|
+
| **2 Migrate** | writes both | switch reads to `new` (fallback to `old`) | — |
|
|
67
|
+
| **3 Contract** | stop writing `old`; reserve it | reads `new` only | drop `old` col (after grace + replay window) |
|
|
68
|
+
|
|
69
|
+
Each phase is independently deployable and rollback-safe. The grace window between expand and contract must exceed your **longest consumer lag + replay/retention window** (e.g. Kafka topic retention) so no in-flight or replayed record still needs the old field. The DB column drop is where db-migration-safety takes over.
|
|
70
|
+
|
|
71
|
+
5. **Deploy in the order the compatibility mode dictates — getting this backwards is the classic outage.**
|
|
72
|
+
- **BACKWARD** change (added/removed optional): deploy **consumers first**, then producers. New consumers can read both shapes; once all consumers handle the new shape, flip producers.
|
|
73
|
+
- **FORWARD** change: deploy **producers first** — old consumers tolerate the new field (they ignore unknowns), then upgrade consumers to use it.
|
|
74
|
+
- **FULL**: either order, but still roll out gradually and watch dead-letter/parse-error metrics during the canary.
|
|
75
|
+
- Never deploy producer and consumer in lockstep assuming atomicity — there is always a window where mixed versions run.
|
|
76
|
+
|
|
77
|
+
6. **Run a schema registry with mechanical compatibility checks, and gate CI on them.** Humans miss breaks; the registry doesn't.
|
|
78
|
+
- **Confluent Schema Registry** (Avro/Protobuf/JSON-Schema over Kafka): set per-subject mode and test the candidate before publishing — `curl -X PUT .../config/<subject> -d '{"compatibility":"BACKWARD_TRANSITIVE"}'`, then `POST .../compatibility/subjects/<subject>/versions/latest` returns `{"is_compatible": true|false}`. The Maven/Gradle `schema-registry:test-compatibility` goal does this in CI.
|
|
79
|
+
- **protobuf** → **`buf breaking --against '.git#branch=main'`** in CI; rules `FIELD_NO_DELETE` (forces `reserved`), `FIELD_SAME_TYPE`, `RESERVED_*` catch exactly the breaks above. Pair with `buf lint`.
|
|
80
|
+
- **Avro** standalone → `java -jar avro-tools` or the `avro-compatibility` checker; gate the PR.
|
|
81
|
+
- **JSON-Schema** → `json-schema-diff` / `oasdiff` (for OpenAPI) flag breaking changes.
|
|
82
|
+
|
|
83
|
+
Make the check **fail the build**, not warn. The registry's `compatibility` setting per subject is the contract; CI is the enforcement.
|
|
84
|
+
|
|
85
|
+
7. **Write consumers as tolerant readers — ignore unknown fields, never hard-fail on them.** Forward compatibility depends on the *reader's* behavior as much as the schema:
|
|
86
|
+
- JSON: don't use a strict/closed deserializer that throws on unknown keys. Jackson → `@JsonIgnoreProperties(ignoreUnknown = true)` / `FAIL_ON_UNKNOWN_PROPERTIES=false`; Go `encoding/json` ignores unknowns by default (avoid `DisallowUnknownFields`); Pydantic → `model_config = ConfigDict(extra="ignore")` (NOT `"forbid"`).
|
|
87
|
+
- **Preserve, don't drop, unknown fields** on a read-modify-write path, or a round-trip through an old service silently deletes data a newer one added. protobuf keeps unknown fields by default; for JSON, capture them (`@JsonAnySetter`, `additionalProperties` map) and re-emit. This is the subtle one — a "harmless" old service in the middle of a pipeline strips new fields.
|
|
88
|
+
- Always provide a default when a field is absent; don't assume presence.
|
|
89
|
+
|
|
90
|
+
8. **When a break is genuinely unavoidable, version explicitly — don't mutate in place.** Some changes (splitting one field into two, restructuring nesting, semantic redefinition) can't be made compatible. Then:
|
|
91
|
+
- **Events:** new schema = **new subject / new topic** (`orders.v2`) or an explicit `schema_version` field; run v1 and v2 in parallel; migrate consumers; retire v1 after the replay window. Never silently change `v1`'s meaning.
|
|
92
|
+
- **APIs:** new path/header version (`/v2`, `Accept: application/vnd.api.v2+json`); deprecate v1 with a sunset header and timeline.
|
|
93
|
+
Versioning is the escape hatch, not the default — additive evolution avoids a version bump for the 90% case.
|
|
94
|
+
|
|
95
|
+
## Common Errors
|
|
96
|
+
|
|
97
|
+
- **Adding a required field.** Breaks every old producer and every historical record at once. Fix: optional + default, always.
|
|
98
|
+
- **Avro field with no `default`.** Silently fails both backward and forward compat. Fix: every Avro field added/removed needs an explicit `"default"`.
|
|
99
|
+
- **Reusing a protobuf field number (or Avro position).** Old payloads decode into the wrong field — type-confused garbage that passes schema checks. Fix: `reserved` the number AND the name; only ever append fresh numbers.
|
|
100
|
+
- **Renaming a field in place.** It's a delete + add to every consumer simultaneously. Fix: expand→migrate→contract, or Avro `aliases`.
|
|
101
|
+
- **Repurposing a field's meaning while keeping its name.** Passes all mechanical checks, silently corrupts semantics. Fix: new field; reserve the old one.
|
|
102
|
+
- **Wrong deploy order for the compat mode.** Backward change with producers-first (or forward with consumers-first) → mixed-version outage. Fix: consumers-first for backward, producers-first for forward.
|
|
103
|
+
- **Strict deserializer that throws on unknown fields.** Kills forward compatibility the moment a producer adds a field. Fix: tolerant reader (`ignoreUnknown`, `extra="ignore"`, no `DisallowUnknownFields`).
|
|
104
|
+
- **Dropping unknown fields on read-modify-write.** An older service in the pipeline silently erases data newer services added. Fix: preserve and re-emit unknown fields.
|
|
105
|
+
- **Treating a type widening as free.** `int32→string` or `string→enum` is a break even with the same name; not all proto widenings are wire-safe. Fix: verify the exact pair or run expand→contract.
|
|
106
|
+
- **No registry / CI gate.** Relying on review to catch breaks. Fix: `buf breaking` / Confluent compat check that **fails the build**.
|
|
107
|
+
- **Checking only against the latest version, not all.** A change compatible with v3 but not v1 breaks replay. Fix: `*_TRANSITIVE` mode for replayable logs.
|
|
108
|
+
- **Contracting before the replay/retention window passes.** Dropping the old field while replayable records still reference it. Fix: grace window > longest consumer lag + topic retention.
|
|
109
|
+
|
|
110
|
+
## Verify
|
|
111
|
+
|
|
112
|
+
1. **Mechanical compat check passes in CI:** `buf breaking` / Confluent `is_compatible:true` / Avro checker runs on the PR diff and **fails the build** on an incompatible change — proven by intentionally introducing a remove/rename and watching CI go red.
|
|
113
|
+
2. **Old-schema read of new data, and vice versa:** serialize a record with the new schema, deserialize with the old (forward); serialize with old, read with new (backward) — both succeed, defaults fill absent fields. This is the literal compatibility definition; test it, don't assume it.
|
|
114
|
+
3. **No required field added, every new field has a default:** grep the diff — new fields are optional and defaulted (`"default"` in Avro, not in JSON `required`, appended proto numbers).
|
|
115
|
+
4. **Removed fields are reserved:** any dropped proto field has its number AND name in `reserved`; any renamed Avro field has `aliases`; no identifier is reused.
|
|
116
|
+
5. **Tolerant reader confirmed:** feed a consumer a payload with an extra unknown field → it parses and ignores it (no exception); on read-modify-write, the unknown field survives the round-trip.
|
|
117
|
+
6. **Deploy order documented and rehearsed:** the rollout plan states consumers-first (backward) or producers-first (forward), and a mixed-version canary shows zero parse errors / dead-letters during the window.
|
|
118
|
+
7. **Rename via expand→contract, not in place:** the migration is staged (dual-write, switch reads, then drop + reserve) and each phase is independently rollback-safe; the old field is dropped only after the replay window.
|
|
119
|
+
8. **Transitive check for replayable logs:** for an event log replayed from offset 0, compat mode is `*_TRANSITIVE` and a candidate is checked against all prior versions, not just latest.
|
|
120
|
+
|
|
121
|
+
Done = the change is additive (optional + defaulted) or staged through expand→migrate→contract, no field/tag/position is ever removed-without-reserving or repurposed, the compatibility mode matches the deploy order, consumers are tolerant readers that preserve unknowns, and a schema-registry compat check fails CI on any incompatible diff — all proven by the old↔new round-trip and the red-CI test in checks 1–2.
|