@luanpdd/kit-mcp 1.34.0 → 1.36.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (118) hide show
  1. package/README.md +1 -1
  2. package/bin/cli.js +2 -2
  3. package/bin/mcp.js +6 -6
  4. package/bin/ui.js +74 -74
  5. package/gates/ai-prompt-stability.md +120 -120
  6. package/gates/budget-description.md +68 -68
  7. package/gates/confidence.md +29 -29
  8. package/gates/dependency-check.md +33 -33
  9. package/gates/dept-cycle-prevention.md +179 -179
  10. package/gates/golden-signals-coverage.md +133 -133
  11. package/gates/legacy-refactor-safety.md +178 -178
  12. package/gates/multi-tenant-rls-coverage.md +102 -102
  13. package/gates/no-personal-uuid.md +72 -72
  14. package/gates/obs-agents-mcp-supabase.md +86 -86
  15. package/gates/obs-skills-frontmatter.md +76 -76
  16. package/gates/observability-coverage.md +151 -151
  17. package/gates/omm-no-regression.md +83 -83
  18. package/gates/postmortem-template-required.md +127 -127
  19. package/gates/prr-checklist-coverage.md +128 -128
  20. package/gates/regression.md +32 -32
  21. package/gates/release-pipeline-policy.md +132 -132
  22. package/gates/secrets-scan.md +33 -33
  23. package/gates/service-role-not-in-user-facing.md +113 -113
  24. package/gates/skill-must-include.md +71 -71
  25. package/gates/sync-idempotent.md +62 -62
  26. package/gates/verify-phase-goal.md +34 -34
  27. package/kit/agents/designer-ui.md +216 -216
  28. package/kit/agents/workflow-generator.md +537 -0
  29. package/kit/commands/adicionar-backlog.md +1 -1
  30. package/kit/commands/adicionar-fase.md +1 -1
  31. package/kit/commands/adicionar-tarefa.md +1 -1
  32. package/kit/commands/auditar-observabilidade.md +103 -103
  33. package/kit/commands/auditar-toil.md +129 -129
  34. package/kit/commands/caracterizar-prompt.md +195 -195
  35. package/kit/commands/criar-workflow.md +158 -0
  36. package/kit/commands/definir-perfil.md +1 -1
  37. package/kit/commands/definir-slo.md +108 -108
  38. package/kit/commands/fio.md +1 -1
  39. package/kit/commands/golden-signals.md +142 -142
  40. package/kit/commands/instrumentar-fase.md +200 -200
  41. package/kit/commands/investigar-producao.md +162 -162
  42. package/kit/commands/observabilidade.md +118 -118
  43. package/kit/commands/postmortem.md +179 -179
  44. package/kit/commands/prr.md +205 -205
  45. package/kit/commands/publicar-rapido.md +207 -207
  46. package/kit/commands/risk-budget.md +220 -220
  47. package/kit/commands/sre.md +230 -230
  48. package/kit/file-manifest.json +5 -2
  49. package/kit/framework/references/output-style.md +22 -22
  50. package/kit/hooks/post-apply-migration.js +199 -199
  51. package/kit/hooks/sidecar-tool-publisher.js +210 -210
  52. package/kit/skills/_shared-dados-distribuidos/glossary.md +224 -224
  53. package/kit/skills/_shared-legacy/glossary.md +389 -389
  54. package/kit/skills/_shared-multi-tenant/glossary.md +186 -186
  55. package/kit/skills/_shared-observability/glossary.md +396 -396
  56. package/kit/skills/_shared-sre/glossary.md +712 -712
  57. package/kit/skills/_shared-supabase/glossary.md +234 -234
  58. package/kit/skills/blameless-postmortems/SKILL.md +340 -340
  59. package/kit/skills/burn-rate-alerting/SKILL.md +258 -258
  60. package/kit/skills/cascading-failures/SKILL.md +311 -311
  61. package/kit/skills/core-analysis-loop/SKILL.md +352 -352
  62. package/kit/skills/distributed-tracing/SKILL.md +362 -362
  63. package/kit/skills/dynamic-workflow-authoring/SKILL.md +327 -0
  64. package/kit/skills/eliminating-toil/SKILL.md +243 -243
  65. package/kit/skills/event-based-slos/SKILL.md +296 -296
  66. package/kit/skills/four-golden-signals/SKILL.md +314 -314
  67. package/kit/skills/hermetic-builds/SKILL.md +323 -323
  68. package/kit/skills/legacy-monster-methods/SKILL.md +444 -444
  69. package/kit/skills/llm-as-dependency/SKILL.md +436 -436
  70. package/kit/skills/load-shedding-graceful-degradation/SKILL.md +396 -396
  71. package/kit/skills/observability-driven-development/SKILL.md +315 -315
  72. package/kit/skills/observability-maturity-model/SKILL.md +222 -222
  73. package/kit/skills/opentelemetry-standard/SKILL.md +351 -351
  74. package/kit/skills/production-readiness-review/SKILL.md +305 -305
  75. package/kit/skills/release-engineering/SKILL.md +367 -367
  76. package/kit/skills/retry-strategies/SKILL.md +372 -372
  77. package/kit/skills/sre-risk-management/SKILL.md +221 -221
  78. package/kit/skills/structured-events/SKILL.md +265 -265
  79. package/kit/skills/supabase-cron-queues/SKILL.md +275 -275
  80. package/kit/skills/supabase-database-functions/SKILL.md +332 -332
  81. package/kit/skills/supabase-declarative-schema/SKILL.md +183 -183
  82. package/kit/skills/supabase-pgvector-rag/SKILL.md +253 -253
  83. package/kit/skills/supabase-postgres-style/SKILL.md +138 -138
  84. package/kit/skills/supabase-storage/SKILL.md +234 -234
  85. package/kit/skills/telemetry-pipelines/SKILL.md +259 -259
  86. package/kit/skills/telemetry-sampling/SKILL.md +256 -256
  87. package/kit/skills/ui-anti-padroes-ia/SKILL.md +261 -261
  88. package/kit/skills/ui-contexto-produto/SKILL.md +248 -248
  89. package/kit/skills/ui-cor-estrategia/SKILL.md +213 -213
  90. package/kit/skills/ui-critica-auditoria/SKILL.md +260 -260
  91. package/kit/skills/ui-motion-funcional/SKILL.md +264 -264
  92. package/kit/skills/ui-ritmo-espacial/SKILL.md +259 -259
  93. package/kit/skills/ui-tipografia/SKILL.md +211 -211
  94. package/package.json +1 -1
  95. package/src/cli/index.js +1114 -1114
  96. package/src/cli/render.js +194 -194
  97. package/src/cli/upgrade-check.js +135 -135
  98. package/src/core/error-redaction.js +76 -76
  99. package/src/core/failures.js +153 -153
  100. package/src/core/gate-runner.js +205 -205
  101. package/src/core/gates.js +82 -82
  102. package/src/core/logger.js +170 -170
  103. package/src/core/manifest-verify.js +174 -174
  104. package/src/core/metrics.js +268 -268
  105. package/src/core/notify.js +60 -60
  106. package/src/core/path-safety.js +141 -141
  107. package/src/core/replays.js +120 -120
  108. package/src/core/ui.js +185 -185
  109. package/src/mcp-server/install.js +149 -149
  110. package/src/mcp-server/roots.js +124 -124
  111. package/src/ui/auto-spawn.js +113 -113
  112. package/src/ui/browser.js +78 -78
  113. package/src/ui/client.js +130 -130
  114. package/src/ui/events.js +65 -65
  115. package/src/ui/lockfile.js +191 -191
  116. package/src/ui/port.js +67 -67
  117. package/src/ui/server.js +547 -547
  118. package/src/ui/wrapper.js +129 -129
@@ -1,372 +1,372 @@
1
- ---
2
- name: retry-strategies
3
- description: Use ao implementar retry — full/equal/decorrelated jitter, exponential backoff cap, retry budget, idempotency keys, when NOT to retry. Cap 22 livro Google SRE.
4
- ---
5
-
6
- # SRE — Retry Strategies
7
-
8
- ## Quando usar
9
-
10
- LLM carrega esta skill ao escrever código que chama dep externa e precisa lidar com failure transient. Trigger phrases:
11
-
12
- - "retry", "exponential backoff"
13
- - "jitter", "thundering herd"
14
- - "retry budget"
15
- - "idempotency key", "safe to retry?"
16
- - "quando NÃO retentar?"
17
- - "retry storm"
18
-
19
- ## Regras absolutas
20
-
21
- - **Retry SEM jitter = retry storm garantido.** Sempre adicione jitter (full por default).
22
- - **Retry SEM deadline = work zumbi.** Cada retry respeita deadline propagation; após deadline aborta.
23
- - **Retry SOMENTE em erros retentáveis.** 5xx, timeout, connection reset = retry. 4xx (validation, auth, not_found) = NÃO retry.
24
- - **Idempotency key OBRIGATÓRIA em retry de write operation.** Sem idempotency, retry pode duplicar (charge double, send email twice, etc.).
25
- - **Retry budget global limita amplificação.** Sem budget, retry total = N clients × M retries × cascading.
26
- - **Max retries ≤ 3-5.** Mais que isso = bug não-transient mascarado.
27
- - **Backoff cap obrigatório.** Sem cap, último retry pode ser horas. `min(base × 2^attempt, cap)`.
28
- - **Não retry em rate limit a menos que respeite Retry-After.** 429 sem header → wait default; com header → respeita.
29
-
30
- ## Patterns canônicos
31
-
32
- ### Pattern 1: Tipos de jitter (cap 22)
33
-
34
- ```ts
35
- // Full jitter — DEFAULT canônico (Google SRE recomenda)
36
- function fullJitter(baseMs: number, attempt: number, capMs = 30000): number {
37
- const expBase = Math.min(baseMs * Math.pow(2, attempt), capMs)
38
- return Math.random() * expBase // [0, expBase)
39
- }
40
-
41
- // Equal jitter — variação; metade fixa, metade jitter
42
- function equalJitter(baseMs: number, attempt: number, capMs = 30000): number {
43
- const expBase = Math.min(baseMs * Math.pow(2, attempt), capMs)
44
- return expBase / 2 + Math.random() * (expBase / 2)
45
- }
46
-
47
- // Decorrelated jitter — para bursty load (AWS recomenda)
48
- function decorrelatedJitter(baseMs: number, lastDelayMs: number, capMs = 30000): number {
49
- return Math.min(capMs, Math.random() * (lastDelayMs * 3 - baseMs) + baseMs)
50
- }
51
-
52
- // Comparação:
53
- // FULL JITTER: spread máximo, simples, default canônico
54
- // EQUAL JITTER: spread parcial, predictable mínimo
55
- // DECORRELATED: spread depende do último; melhor pra long outages
56
- ```
57
-
58
- ### Pattern 2: Retry com deadline propagation
59
-
60
- ```ts
61
- async function callWithRetry<T>(
62
- call: () => Promise<T>,
63
- opts: {
64
- maxRetries: number
65
- baseMs: number
66
- capMs?: number
67
- deadlineMs: number // unix ms; retry só se ainda há tempo
68
- isRetryable?: (e: Error) => boolean
69
- retryBudget?: RetryBudget
70
- }
71
- ): Promise<T> {
72
- const startMs = performance.now()
73
- let lastError: Error | undefined
74
-
75
- for (let attempt = 0; attempt <= opts.maxRetries; attempt++) {
76
- // Check deadline antes de cada attempt
77
- if (Date.now() > opts.deadlineMs) {
78
- throw new DeadlineExceededError(lastError)
79
- }
80
-
81
- try {
82
- return await call()
83
- } catch (e) {
84
- lastError = e as Error
85
-
86
- // Não-retentável → throw imediato
87
- if (opts.isRetryable && !opts.isRetryable(lastError)) throw lastError
88
-
89
- // Retry budget global
90
- if (opts.retryBudget && !opts.retryBudget.tryAcquire()) {
91
- throw new RetryBudgetExhaustedError(lastError)
92
- }
93
-
94
- // Last attempt — não delay
95
- if (attempt >= opts.maxRetries) throw lastError
96
-
97
- // Calcula delay com full jitter
98
- const delayMs = fullJitter(opts.baseMs, attempt, opts.capMs)
99
-
100
- // Não exceder deadline
101
- const remainingMs = opts.deadlineMs - Date.now()
102
- if (delayMs >= remainingMs) {
103
- throw new DeadlineExceededError(lastError)
104
- }
105
-
106
- await sleep(delayMs)
107
- }
108
- }
109
- throw lastError!
110
- }
111
- ```
112
-
113
- ### Pattern 3: When NOT to retry — decision tree
114
-
115
- ```text
116
- Erro recebido. Retry?
117
-
118
- 1. HTTP status 4xx (excluding 408, 429)?
119
- → NÃO retry. Validation/auth/not_found.
120
- 400, 401, 403, 404, 422 → throw imediato.
121
-
122
- 2. HTTP status 408 (Request Timeout)?
123
- → SIM retry. Servidor pediu (request expirou no servidor).
124
-
125
- 3. HTTP status 429 (Too Many Requests)?
126
- → SIM retry, COM Retry-After header se presente.
127
- Sem Retry-After → backoff default + retry budget.
128
-
129
- 4. HTTP status 5xx?
130
- → SIM retry. Server error transient.
131
-
132
- 5. Network error (connection reset, DNS failure)?
133
- → SIM retry. Network transient.
134
-
135
- 6. Timeout local (AbortSignal.timeout estourou)?
136
- → SIM retry, mas check deadline global.
137
-
138
- 7. Custom error (validation interna, business rule)?
139
- → NÃO retry. Bug; retry não resolve.
140
-
141
- 8. OperationCancelled (deadline upstream)?
142
- → NÃO retry. Caller já desistiu.
143
- ```
144
-
145
- ```ts
146
- function isRetryable(e: any): boolean {
147
- // 4xx geralmente não retry
148
- if (e.statusCode >= 400 && e.statusCode < 500) {
149
- if (e.statusCode === 408) return true
150
- if (e.statusCode === 429) return true
151
- return false
152
- }
153
-
154
- // 5xx retry
155
- if (e.statusCode >= 500 && e.statusCode < 600) return true
156
-
157
- // Network errors
158
- if (e.code === 'ECONNRESET' || e.code === 'ETIMEDOUT' || e.code === 'EAI_AGAIN') return true
159
-
160
- // Aborts não retry (caller desistiu)
161
- if (e.name === 'AbortError' || e instanceof DeadlineExceededError) return false
162
-
163
- // Default: não retry
164
- return false
165
- }
166
- ```
167
-
168
- ### Pattern 4: Idempotency key em writes
169
-
170
- ```ts
171
- // PT-BR: idempotency key permite retry seguro de writes
172
- import { randomUUID } from 'crypto'
173
-
174
- interface CreateOrderInput {
175
- customerId: string
176
- items: OrderItem[]
177
- idempotencyKey?: string // gerado se não fornecido
178
- }
179
-
180
- async function createOrderSafe(client: PaymentClient, input: CreateOrderInput): Promise<Order> {
181
- const key = input.idempotencyKey ?? randomUUID()
182
-
183
- return callWithRetry(
184
- () => client.createOrder({ ...input, idempotencyKey: key }), // ← SAME KEY em retry
185
- {
186
- maxRetries: 3,
187
- baseMs: 500,
188
- capMs: 30000,
189
- deadlineMs: Date.now() + 30000,
190
- isRetryable,
191
- }
192
- )
193
- }
194
-
195
- // Server-side
196
- async function createOrderHandler(input: CreateOrderInput): Promise<Order> {
197
- // Check se já processamos esta key
198
- const existing = await db.findByIdempotencyKey(input.idempotencyKey)
199
- if (existing) return existing // retorna mesmo result; safe to retry
200
-
201
- // Process; record com idempotency_key (UNIQUE constraint)
202
- return await db.transaction(async (tx) => {
203
- const order = await tx.orders.insert({
204
- ...input,
205
- idempotency_key: input.idempotencyKey, // UNIQUE constraint catches duplicate
206
- created_at: new Date(),
207
- })
208
- return order
209
- })
210
- }
211
- ```
212
-
213
- **Anti-corruption:** idempotency keys SEMPRE em writes. Stripe, AWS S3, todos suportam. Se sua API não suporta, adicione (1 column UNIQUE).
214
-
215
- ### Pattern 5: Retry budget (cap 22)
216
-
217
- ```ts
218
- // PT-BR: retry budget global limita amplificação
219
- class RetryBudget {
220
- private tokens: number
221
- private readonly maxTokens: number
222
- private readonly refillRate: number // tokens per second
223
- private lastRefillMs: number
224
-
225
- constructor(opts: { maxTokens: number; refillPerSec: number }) {
226
- this.tokens = opts.maxTokens
227
- this.maxTokens = opts.maxTokens
228
- this.refillRate = opts.refillPerSec
229
- this.lastRefillMs = Date.now()
230
- }
231
-
232
- tryAcquire(): boolean {
233
- this.refill()
234
- if (this.tokens < 1) return false
235
- this.tokens--
236
- return true
237
- }
238
-
239
- private refill(): void {
240
- const now = Date.now()
241
- const elapsed = (now - this.lastRefillMs) / 1000
242
- const refill = elapsed * this.refillRate
243
- this.tokens = Math.min(this.maxTokens, this.tokens + refill)
244
- this.lastRefillMs = now
245
- }
246
- }
247
-
248
- // Configuração canônica:
249
- // - maxTokens = 10% da capacidade normal de calls/sec
250
- // - refillPerSec = mesmo
251
- // Se sua dep aguenta 1000 RPS, retry budget = 100 RPS de retries.
252
- // Excede = circuit breaker abre OR caller falha rápido.
253
-
254
- const retryBudget = new RetryBudget({ maxTokens: 100, refillPerSec: 100 })
255
- ```
256
-
257
- ### Pattern 6: Configuração canônica por tipo de call
258
-
259
- | Tipo de call | Max retries | Base | Cap | Jitter |
260
- |---|---|---|---|---|
261
- | **Read DB query** | 3 | 50ms | 1000ms | full |
262
- | **Write DB transaction** | 3 (com idempotency key) | 100ms | 5000ms | full |
263
- | **HTTP API third-party** | 3 | 500ms | 30000ms | full |
264
- | **Webhook delivery** | 5 | 1000ms | 60000ms | decorrelated |
265
- | **Background job** | 5+ | 5000ms | 600000ms (10min) | decorrelated |
266
- | **Real-time message** | 0-1 | — | — | — (deadline tight) |
267
-
268
- ### Pattern 7: Observability de retry
269
-
270
- Métricas a instrumentar:
271
-
272
- ```ts
273
- // Counter de retries por (dep, attempt)
274
- metrics.counter('retries_total', { dep: 'stripe', attempt: '1' }) // attempt 1, 2, 3
275
-
276
- // Counter de outcomes finais
277
- metrics.counter('retry_outcomes_total', { dep: 'stripe', outcome: 'success_after_retry' })
278
- metrics.counter('retry_outcomes_total', { dep: 'stripe', outcome: 'exhausted_max' })
279
- metrics.counter('retry_outcomes_total', { dep: 'stripe', outcome: 'budget_exhausted' })
280
- metrics.counter('retry_outcomes_total', { dep: 'stripe', outcome: 'deadline_exceeded' })
281
-
282
- // Histogram de delay total adicionado por retry
283
- metrics.histogram('retry_delay_added_ms', delayMs, { dep: 'stripe' })
284
- ```
285
-
286
- Alertas:
287
- - `rate(retries_total) > 10% × rate(requests_total)` → dep degradada; investigate
288
- - `retry_outcomes{outcome="budget_exhausted"} > 0` → sistema sob storm; load shedding pode ajudar
289
- - `histogram retry_delay_added_ms > p99 baseline × 5` → delays inflados; deps lentas
290
-
291
- ## Anti-patterns
292
-
293
- ### ANTI: retry sem jitter
294
-
295
- ```text
296
- ANTI: setTimeout(call, 1000 * 2^attempt) — fixed exponential.
297
-
298
- PROBLEMA: 1000 clients sincroniza retries. Storm na recovery.
299
-
300
- CERTO: Math.random() * 1000 * 2^attempt — full jitter.
301
- ```
302
-
303
- ### ANTI: retry em 4xx
304
-
305
- ```text
306
- ANTI: catch (e) { return retry(call) } sem checar status.
307
-
308
- PROBLEMA: 400/422/404 retentado infinitamente. Bug não corrige sozinho.
309
-
310
- CERTO: isRetryable(e) check antes de retry. 4xx (excluding 408, 429)
311
- throw imediato.
312
- ```
313
-
314
- ### ANTI: retry sem deadline
315
-
316
- ```text
317
- ANTI: retry com max=5, base=1s. Worst case: 1+2+4+8+16=31s de delays.
318
- Plus call time. Cliente já desistiu.
319
-
320
- PROBLEMA: work zumbi. Recursos consumidos sem benefit.
321
-
322
- CERTO: deadline propagation. Cada attempt checa Date.now() vs deadline.
323
- Aborta cedo.
324
- ```
325
-
326
- ### ANTI: idempotency key per-attempt
327
-
328
- ```text
329
- ANTI: retry gera NEW idempotency key cada attempt.
330
-
331
- PROBLEMA: cada attempt vira write distinta. Charge double, email
332
- duplicado.
333
-
334
- CERTO: idempotency key gerada UMA vez por call lógica. Mesma key
335
- em todos os attempts.
336
- ```
337
-
338
- ### ANTI: max retries muito alto
339
-
340
- ```text
341
- ANTI: maxRetries = 20.
342
-
343
- PROBLEMA: bug não-transient mascarado. Erro real demora pra aparecer.
344
- Logs de retry inundam observability.
345
-
346
- CERTO: max 3-5. Mais que isso = error real, não transient. Falha
347
- rápido + alert > retry esperançoso.
348
- ```
349
-
350
- ## Verificação
351
-
352
- 1. Toda retry tem jitter (full por default)
353
- 2. Toda retry respeita deadline propagation
354
- 3. isRetryable() check em cada attempt
355
- 4. Idempotency key em writes
356
- 5. Retry budget global ativo
357
- 6. Max retries ≤ 5
358
- 7. Backoff cap ≤ 60s para user-facing
359
- 8. Métricas instrumentadas (counter retries, outcome, histogram delay)
360
-
361
- ---
362
-
363
- ## Ver também
364
-
365
- - [`_shared-sre/glossary.md`](../_shared-sre/glossary.md) — vocabulário (jitter types, retry storm, etc.)
366
- - [`cascading-failures`](../cascading-failures/SKILL.md) (v1.11) — retry sem jitter é trigger principal de cascade
367
- - [`load-shedding-graceful-degradation`](../load-shedding-graceful-degradation/SKILL.md) (v1.11) — server-side coopera (503 + Retry-After)
368
- - [`four-golden-signals`](../four-golden-signals/SKILL.md) (v1.10) — métricas de retry instrumentadas seguindo padrão
369
- - [`supabase-edge-fn-writer`](../../agents/supabase-edge-fn-writer.md) (v1.8 + patch v1.11) — Edge Functions ganham retry-with-jitter built-in
370
- - [`cascading-failures-auditor`](../../agents/cascading-failures-auditor.md) (v1.11) — agent detecta retry sem jitter
371
-
372
- *Material-fonte: Site Reliability Engineering — Beyer/Jones/Petoff/Murphy (Google/O'Reilly, 2016) — Cap 22 (subsections sobre retry, jitter, deadline propagation).*
1
+ ---
2
+ name: retry-strategies
3
+ description: Use ao implementar retry — full/equal/decorrelated jitter, exponential backoff cap, retry budget, idempotency keys, when NOT to retry. Cap 22 livro Google SRE.
4
+ ---
5
+
6
+ # SRE — Retry Strategies
7
+
8
+ ## Quando usar
9
+
10
+ LLM carrega esta skill ao escrever código que chama dep externa e precisa lidar com failure transient. Trigger phrases:
11
+
12
+ - "retry", "exponential backoff"
13
+ - "jitter", "thundering herd"
14
+ - "retry budget"
15
+ - "idempotency key", "safe to retry?"
16
+ - "quando NÃO retentar?"
17
+ - "retry storm"
18
+
19
+ ## Regras absolutas
20
+
21
+ - **Retry SEM jitter = retry storm garantido.** Sempre adicione jitter (full por default).
22
+ - **Retry SEM deadline = work zumbi.** Cada retry respeita deadline propagation; após deadline aborta.
23
+ - **Retry SOMENTE em erros retentáveis.** 5xx, timeout, connection reset = retry. 4xx (validation, auth, not_found) = NÃO retry.
24
+ - **Idempotency key OBRIGATÓRIA em retry de write operation.** Sem idempotency, retry pode duplicar (charge double, send email twice, etc.).
25
+ - **Retry budget global limita amplificação.** Sem budget, retry total = N clients × M retries × cascading.
26
+ - **Max retries ≤ 3-5.** Mais que isso = bug não-transient mascarado.
27
+ - **Backoff cap obrigatório.** Sem cap, último retry pode ser horas. `min(base × 2^attempt, cap)`.
28
+ - **Não retry em rate limit a menos que respeite Retry-After.** 429 sem header → wait default; com header → respeita.
29
+
30
+ ## Patterns canônicos
31
+
32
+ ### Pattern 1: Tipos de jitter (cap 22)
33
+
34
+ ```ts
35
+ // Full jitter — DEFAULT canônico (Google SRE recomenda)
36
+ function fullJitter(baseMs: number, attempt: number, capMs = 30000): number {
37
+ const expBase = Math.min(baseMs * Math.pow(2, attempt), capMs)
38
+ return Math.random() * expBase // [0, expBase)
39
+ }
40
+
41
+ // Equal jitter — variação; metade fixa, metade jitter
42
+ function equalJitter(baseMs: number, attempt: number, capMs = 30000): number {
43
+ const expBase = Math.min(baseMs * Math.pow(2, attempt), capMs)
44
+ return expBase / 2 + Math.random() * (expBase / 2)
45
+ }
46
+
47
+ // Decorrelated jitter — para bursty load (AWS recomenda)
48
+ function decorrelatedJitter(baseMs: number, lastDelayMs: number, capMs = 30000): number {
49
+ return Math.min(capMs, Math.random() * (lastDelayMs * 3 - baseMs) + baseMs)
50
+ }
51
+
52
+ // Comparação:
53
+ // FULL JITTER: spread máximo, simples, default canônico
54
+ // EQUAL JITTER: spread parcial, predictable mínimo
55
+ // DECORRELATED: spread depende do último; melhor pra long outages
56
+ ```
57
+
58
+ ### Pattern 2: Retry com deadline propagation
59
+
60
+ ```ts
61
+ async function callWithRetry<T>(
62
+ call: () => Promise<T>,
63
+ opts: {
64
+ maxRetries: number
65
+ baseMs: number
66
+ capMs?: number
67
+ deadlineMs: number // unix ms; retry só se ainda há tempo
68
+ isRetryable?: (e: Error) => boolean
69
+ retryBudget?: RetryBudget
70
+ }
71
+ ): Promise<T> {
72
+ const startMs = performance.now()
73
+ let lastError: Error | undefined
74
+
75
+ for (let attempt = 0; attempt <= opts.maxRetries; attempt++) {
76
+ // Check deadline antes de cada attempt
77
+ if (Date.now() > opts.deadlineMs) {
78
+ throw new DeadlineExceededError(lastError)
79
+ }
80
+
81
+ try {
82
+ return await call()
83
+ } catch (e) {
84
+ lastError = e as Error
85
+
86
+ // Não-retentável → throw imediato
87
+ if (opts.isRetryable && !opts.isRetryable(lastError)) throw lastError
88
+
89
+ // Retry budget global
90
+ if (opts.retryBudget && !opts.retryBudget.tryAcquire()) {
91
+ throw new RetryBudgetExhaustedError(lastError)
92
+ }
93
+
94
+ // Last attempt — não delay
95
+ if (attempt >= opts.maxRetries) throw lastError
96
+
97
+ // Calcula delay com full jitter
98
+ const delayMs = fullJitter(opts.baseMs, attempt, opts.capMs)
99
+
100
+ // Não exceder deadline
101
+ const remainingMs = opts.deadlineMs - Date.now()
102
+ if (delayMs >= remainingMs) {
103
+ throw new DeadlineExceededError(lastError)
104
+ }
105
+
106
+ await sleep(delayMs)
107
+ }
108
+ }
109
+ throw lastError!
110
+ }
111
+ ```
112
+
113
+ ### Pattern 3: When NOT to retry — decision tree
114
+
115
+ ```text
116
+ Erro recebido. Retry?
117
+
118
+ 1. HTTP status 4xx (excluding 408, 429)?
119
+ → NÃO retry. Validation/auth/not_found.
120
+ 400, 401, 403, 404, 422 → throw imediato.
121
+
122
+ 2. HTTP status 408 (Request Timeout)?
123
+ → SIM retry. Servidor pediu (request expirou no servidor).
124
+
125
+ 3. HTTP status 429 (Too Many Requests)?
126
+ → SIM retry, COM Retry-After header se presente.
127
+ Sem Retry-After → backoff default + retry budget.
128
+
129
+ 4. HTTP status 5xx?
130
+ → SIM retry. Server error transient.
131
+
132
+ 5. Network error (connection reset, DNS failure)?
133
+ → SIM retry. Network transient.
134
+
135
+ 6. Timeout local (AbortSignal.timeout estourou)?
136
+ → SIM retry, mas check deadline global.
137
+
138
+ 7. Custom error (validation interna, business rule)?
139
+ → NÃO retry. Bug; retry não resolve.
140
+
141
+ 8. OperationCancelled (deadline upstream)?
142
+ → NÃO retry. Caller já desistiu.
143
+ ```
144
+
145
+ ```ts
146
+ function isRetryable(e: any): boolean {
147
+ // 4xx geralmente não retry
148
+ if (e.statusCode >= 400 && e.statusCode < 500) {
149
+ if (e.statusCode === 408) return true
150
+ if (e.statusCode === 429) return true
151
+ return false
152
+ }
153
+
154
+ // 5xx retry
155
+ if (e.statusCode >= 500 && e.statusCode < 600) return true
156
+
157
+ // Network errors
158
+ if (e.code === 'ECONNRESET' || e.code === 'ETIMEDOUT' || e.code === 'EAI_AGAIN') return true
159
+
160
+ // Aborts não retry (caller desistiu)
161
+ if (e.name === 'AbortError' || e instanceof DeadlineExceededError) return false
162
+
163
+ // Default: não retry
164
+ return false
165
+ }
166
+ ```
167
+
168
+ ### Pattern 4: Idempotency key em writes
169
+
170
+ ```ts
171
+ // PT-BR: idempotency key permite retry seguro de writes
172
+ import { randomUUID } from 'crypto'
173
+
174
+ interface CreateOrderInput {
175
+ customerId: string
176
+ items: OrderItem[]
177
+ idempotencyKey?: string // gerado se não fornecido
178
+ }
179
+
180
+ async function createOrderSafe(client: PaymentClient, input: CreateOrderInput): Promise<Order> {
181
+ const key = input.idempotencyKey ?? randomUUID()
182
+
183
+ return callWithRetry(
184
+ () => client.createOrder({ ...input, idempotencyKey: key }), // ← SAME KEY em retry
185
+ {
186
+ maxRetries: 3,
187
+ baseMs: 500,
188
+ capMs: 30000,
189
+ deadlineMs: Date.now() + 30000,
190
+ isRetryable,
191
+ }
192
+ )
193
+ }
194
+
195
+ // Server-side
196
+ async function createOrderHandler(input: CreateOrderInput): Promise<Order> {
197
+ // Check se já processamos esta key
198
+ const existing = await db.findByIdempotencyKey(input.idempotencyKey)
199
+ if (existing) return existing // retorna mesmo result; safe to retry
200
+
201
+ // Process; record com idempotency_key (UNIQUE constraint)
202
+ return await db.transaction(async (tx) => {
203
+ const order = await tx.orders.insert({
204
+ ...input,
205
+ idempotency_key: input.idempotencyKey, // UNIQUE constraint catches duplicate
206
+ created_at: new Date(),
207
+ })
208
+ return order
209
+ })
210
+ }
211
+ ```
212
+
213
+ **Anti-corruption:** idempotency keys SEMPRE em writes. Stripe, AWS S3, todos suportam. Se sua API não suporta, adicione (1 column UNIQUE).
214
+
215
+ ### Pattern 5: Retry budget (cap 22)
216
+
217
+ ```ts
218
+ // PT-BR: retry budget global limita amplificação
219
+ class RetryBudget {
220
+ private tokens: number
221
+ private readonly maxTokens: number
222
+ private readonly refillRate: number // tokens per second
223
+ private lastRefillMs: number
224
+
225
+ constructor(opts: { maxTokens: number; refillPerSec: number }) {
226
+ this.tokens = opts.maxTokens
227
+ this.maxTokens = opts.maxTokens
228
+ this.refillRate = opts.refillPerSec
229
+ this.lastRefillMs = Date.now()
230
+ }
231
+
232
+ tryAcquire(): boolean {
233
+ this.refill()
234
+ if (this.tokens < 1) return false
235
+ this.tokens--
236
+ return true
237
+ }
238
+
239
+ private refill(): void {
240
+ const now = Date.now()
241
+ const elapsed = (now - this.lastRefillMs) / 1000
242
+ const refill = elapsed * this.refillRate
243
+ this.tokens = Math.min(this.maxTokens, this.tokens + refill)
244
+ this.lastRefillMs = now
245
+ }
246
+ }
247
+
248
+ // Configuração canônica:
249
+ // - maxTokens = 10% da capacidade normal de calls/sec
250
+ // - refillPerSec = mesmo
251
+ // Se sua dep aguenta 1000 RPS, retry budget = 100 RPS de retries.
252
+ // Excede = circuit breaker abre OR caller falha rápido.
253
+
254
+ const retryBudget = new RetryBudget({ maxTokens: 100, refillPerSec: 100 })
255
+ ```
256
+
257
+ ### Pattern 6: Configuração canônica por tipo de call
258
+
259
+ | Tipo de call | Max retries | Base | Cap | Jitter |
260
+ |---|---|---|---|---|
261
+ | **Read DB query** | 3 | 50ms | 1000ms | full |
262
+ | **Write DB transaction** | 3 (com idempotency key) | 100ms | 5000ms | full |
263
+ | **HTTP API third-party** | 3 | 500ms | 30000ms | full |
264
+ | **Webhook delivery** | 5 | 1000ms | 60000ms | decorrelated |
265
+ | **Background job** | 5+ | 5000ms | 600000ms (10min) | decorrelated |
266
+ | **Real-time message** | 0-1 | — | — | — (deadline tight) |
267
+
268
+ ### Pattern 7: Observability de retry
269
+
270
+ Métricas a instrumentar:
271
+
272
+ ```ts
273
+ // Counter de retries por (dep, attempt)
274
+ metrics.counter('retries_total', { dep: 'stripe', attempt: '1' }) // attempt 1, 2, 3
275
+
276
+ // Counter de outcomes finais
277
+ metrics.counter('retry_outcomes_total', { dep: 'stripe', outcome: 'success_after_retry' })
278
+ metrics.counter('retry_outcomes_total', { dep: 'stripe', outcome: 'exhausted_max' })
279
+ metrics.counter('retry_outcomes_total', { dep: 'stripe', outcome: 'budget_exhausted' })
280
+ metrics.counter('retry_outcomes_total', { dep: 'stripe', outcome: 'deadline_exceeded' })
281
+
282
+ // Histogram de delay total adicionado por retry
283
+ metrics.histogram('retry_delay_added_ms', delayMs, { dep: 'stripe' })
284
+ ```
285
+
286
+ Alertas:
287
+ - `rate(retries_total) > 10% × rate(requests_total)` → dep degradada; investigate
288
+ - `retry_outcomes{outcome="budget_exhausted"} > 0` → sistema sob storm; load shedding pode ajudar
289
+ - `histogram retry_delay_added_ms > p99 baseline × 5` → delays inflados; deps lentas
290
+
291
+ ## Anti-patterns
292
+
293
+ ### ANTI: retry sem jitter
294
+
295
+ ```text
296
+ ANTI: setTimeout(call, 1000 * 2^attempt) — fixed exponential.
297
+
298
+ PROBLEMA: 1000 clients sincroniza retries. Storm na recovery.
299
+
300
+ CERTO: Math.random() * 1000 * 2^attempt — full jitter.
301
+ ```
302
+
303
+ ### ANTI: retry em 4xx
304
+
305
+ ```text
306
+ ANTI: catch (e) { return retry(call) } sem checar status.
307
+
308
+ PROBLEMA: 400/422/404 retentado infinitamente. Bug não corrige sozinho.
309
+
310
+ CERTO: isRetryable(e) check antes de retry. 4xx (excluding 408, 429)
311
+ throw imediato.
312
+ ```
313
+
314
+ ### ANTI: retry sem deadline
315
+
316
+ ```text
317
+ ANTI: retry com max=5, base=1s. Worst case: 1+2+4+8+16=31s de delays.
318
+ Plus call time. Cliente já desistiu.
319
+
320
+ PROBLEMA: work zumbi. Recursos consumidos sem benefit.
321
+
322
+ CERTO: deadline propagation. Cada attempt checa Date.now() vs deadline.
323
+ Aborta cedo.
324
+ ```
325
+
326
+ ### ANTI: idempotency key per-attempt
327
+
328
+ ```text
329
+ ANTI: retry gera NEW idempotency key cada attempt.
330
+
331
+ PROBLEMA: cada attempt vira write distinta. Charge double, email
332
+ duplicado.
333
+
334
+ CERTO: idempotency key gerada UMA vez por call lógica. Mesma key
335
+ em todos os attempts.
336
+ ```
337
+
338
+ ### ANTI: max retries muito alto
339
+
340
+ ```text
341
+ ANTI: maxRetries = 20.
342
+
343
+ PROBLEMA: bug não-transient mascarado. Erro real demora pra aparecer.
344
+ Logs de retry inundam observability.
345
+
346
+ CERTO: max 3-5. Mais que isso = error real, não transient. Falha
347
+ rápido + alert > retry esperançoso.
348
+ ```
349
+
350
+ ## Verificação
351
+
352
+ 1. Toda retry tem jitter (full por default)
353
+ 2. Toda retry respeita deadline propagation
354
+ 3. isRetryable() check em cada attempt
355
+ 4. Idempotency key em writes
356
+ 5. Retry budget global ativo
357
+ 6. Max retries ≤ 5
358
+ 7. Backoff cap ≤ 60s para user-facing
359
+ 8. Métricas instrumentadas (counter retries, outcome, histogram delay)
360
+
361
+ ---
362
+
363
+ ## Ver também
364
+
365
+ - [`_shared-sre/glossary.md`](../_shared-sre/glossary.md) — vocabulário (jitter types, retry storm, etc.)
366
+ - [`cascading-failures`](../cascading-failures/SKILL.md) (v1.11) — retry sem jitter é trigger principal de cascade
367
+ - [`load-shedding-graceful-degradation`](../load-shedding-graceful-degradation/SKILL.md) (v1.11) — server-side coopera (503 + Retry-After)
368
+ - [`four-golden-signals`](../four-golden-signals/SKILL.md) (v1.10) — métricas de retry instrumentadas seguindo padrão
369
+ - [`supabase-edge-fn-writer`](../../agents/supabase-edge-fn-writer.md) (v1.8 + patch v1.11) — Edge Functions ganham retry-with-jitter built-in
370
+ - [`cascading-failures-auditor`](../../agents/cascading-failures-auditor.md) (v1.11) — agent detecta retry sem jitter
371
+
372
+ *Material-fonte: Site Reliability Engineering — Beyer/Jones/Petoff/Murphy (Google/O'Reilly, 2016) — Cap 22 (subsections sobre retry, jitter, deadline propagation).*