@luanpdd/kit-mcp 1.34.0 → 1.36.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (118) hide show
  1. package/README.md +1 -1
  2. package/bin/cli.js +2 -2
  3. package/bin/mcp.js +6 -6
  4. package/bin/ui.js +74 -74
  5. package/gates/ai-prompt-stability.md +120 -120
  6. package/gates/budget-description.md +68 -68
  7. package/gates/confidence.md +29 -29
  8. package/gates/dependency-check.md +33 -33
  9. package/gates/dept-cycle-prevention.md +179 -179
  10. package/gates/golden-signals-coverage.md +133 -133
  11. package/gates/legacy-refactor-safety.md +178 -178
  12. package/gates/multi-tenant-rls-coverage.md +102 -102
  13. package/gates/no-personal-uuid.md +72 -72
  14. package/gates/obs-agents-mcp-supabase.md +86 -86
  15. package/gates/obs-skills-frontmatter.md +76 -76
  16. package/gates/observability-coverage.md +151 -151
  17. package/gates/omm-no-regression.md +83 -83
  18. package/gates/postmortem-template-required.md +127 -127
  19. package/gates/prr-checklist-coverage.md +128 -128
  20. package/gates/regression.md +32 -32
  21. package/gates/release-pipeline-policy.md +132 -132
  22. package/gates/secrets-scan.md +33 -33
  23. package/gates/service-role-not-in-user-facing.md +113 -113
  24. package/gates/skill-must-include.md +71 -71
  25. package/gates/sync-idempotent.md +62 -62
  26. package/gates/verify-phase-goal.md +34 -34
  27. package/kit/agents/designer-ui.md +216 -216
  28. package/kit/agents/workflow-generator.md +537 -0
  29. package/kit/commands/adicionar-backlog.md +1 -1
  30. package/kit/commands/adicionar-fase.md +1 -1
  31. package/kit/commands/adicionar-tarefa.md +1 -1
  32. package/kit/commands/auditar-observabilidade.md +103 -103
  33. package/kit/commands/auditar-toil.md +129 -129
  34. package/kit/commands/caracterizar-prompt.md +195 -195
  35. package/kit/commands/criar-workflow.md +158 -0
  36. package/kit/commands/definir-perfil.md +1 -1
  37. package/kit/commands/definir-slo.md +108 -108
  38. package/kit/commands/fio.md +1 -1
  39. package/kit/commands/golden-signals.md +142 -142
  40. package/kit/commands/instrumentar-fase.md +200 -200
  41. package/kit/commands/investigar-producao.md +162 -162
  42. package/kit/commands/observabilidade.md +118 -118
  43. package/kit/commands/postmortem.md +179 -179
  44. package/kit/commands/prr.md +205 -205
  45. package/kit/commands/publicar-rapido.md +207 -207
  46. package/kit/commands/risk-budget.md +220 -220
  47. package/kit/commands/sre.md +230 -230
  48. package/kit/file-manifest.json +5 -2
  49. package/kit/framework/references/output-style.md +22 -22
  50. package/kit/hooks/post-apply-migration.js +199 -199
  51. package/kit/hooks/sidecar-tool-publisher.js +210 -210
  52. package/kit/skills/_shared-dados-distribuidos/glossary.md +224 -224
  53. package/kit/skills/_shared-legacy/glossary.md +389 -389
  54. package/kit/skills/_shared-multi-tenant/glossary.md +186 -186
  55. package/kit/skills/_shared-observability/glossary.md +396 -396
  56. package/kit/skills/_shared-sre/glossary.md +712 -712
  57. package/kit/skills/_shared-supabase/glossary.md +234 -234
  58. package/kit/skills/blameless-postmortems/SKILL.md +340 -340
  59. package/kit/skills/burn-rate-alerting/SKILL.md +258 -258
  60. package/kit/skills/cascading-failures/SKILL.md +311 -311
  61. package/kit/skills/core-analysis-loop/SKILL.md +352 -352
  62. package/kit/skills/distributed-tracing/SKILL.md +362 -362
  63. package/kit/skills/dynamic-workflow-authoring/SKILL.md +327 -0
  64. package/kit/skills/eliminating-toil/SKILL.md +243 -243
  65. package/kit/skills/event-based-slos/SKILL.md +296 -296
  66. package/kit/skills/four-golden-signals/SKILL.md +314 -314
  67. package/kit/skills/hermetic-builds/SKILL.md +323 -323
  68. package/kit/skills/legacy-monster-methods/SKILL.md +444 -444
  69. package/kit/skills/llm-as-dependency/SKILL.md +436 -436
  70. package/kit/skills/load-shedding-graceful-degradation/SKILL.md +396 -396
  71. package/kit/skills/observability-driven-development/SKILL.md +315 -315
  72. package/kit/skills/observability-maturity-model/SKILL.md +222 -222
  73. package/kit/skills/opentelemetry-standard/SKILL.md +351 -351
  74. package/kit/skills/production-readiness-review/SKILL.md +305 -305
  75. package/kit/skills/release-engineering/SKILL.md +367 -367
  76. package/kit/skills/retry-strategies/SKILL.md +372 -372
  77. package/kit/skills/sre-risk-management/SKILL.md +221 -221
  78. package/kit/skills/structured-events/SKILL.md +265 -265
  79. package/kit/skills/supabase-cron-queues/SKILL.md +275 -275
  80. package/kit/skills/supabase-database-functions/SKILL.md +332 -332
  81. package/kit/skills/supabase-declarative-schema/SKILL.md +183 -183
  82. package/kit/skills/supabase-pgvector-rag/SKILL.md +253 -253
  83. package/kit/skills/supabase-postgres-style/SKILL.md +138 -138
  84. package/kit/skills/supabase-storage/SKILL.md +234 -234
  85. package/kit/skills/telemetry-pipelines/SKILL.md +259 -259
  86. package/kit/skills/telemetry-sampling/SKILL.md +256 -256
  87. package/kit/skills/ui-anti-padroes-ia/SKILL.md +261 -261
  88. package/kit/skills/ui-contexto-produto/SKILL.md +248 -248
  89. package/kit/skills/ui-cor-estrategia/SKILL.md +213 -213
  90. package/kit/skills/ui-critica-auditoria/SKILL.md +260 -260
  91. package/kit/skills/ui-motion-funcional/SKILL.md +264 -264
  92. package/kit/skills/ui-ritmo-espacial/SKILL.md +259 -259
  93. package/kit/skills/ui-tipografia/SKILL.md +211 -211
  94. package/package.json +1 -1
  95. package/src/cli/index.js +1114 -1114
  96. package/src/cli/render.js +194 -194
  97. package/src/cli/upgrade-check.js +135 -135
  98. package/src/core/error-redaction.js +76 -76
  99. package/src/core/failures.js +153 -153
  100. package/src/core/gate-runner.js +205 -205
  101. package/src/core/gates.js +82 -82
  102. package/src/core/logger.js +170 -170
  103. package/src/core/manifest-verify.js +174 -174
  104. package/src/core/metrics.js +268 -268
  105. package/src/core/notify.js +60 -60
  106. package/src/core/path-safety.js +141 -141
  107. package/src/core/replays.js +120 -120
  108. package/src/core/ui.js +185 -185
  109. package/src/mcp-server/install.js +149 -149
  110. package/src/mcp-server/roots.js +124 -124
  111. package/src/ui/auto-spawn.js +113 -113
  112. package/src/ui/browser.js +78 -78
  113. package/src/ui/client.js +130 -130
  114. package/src/ui/events.js +65 -65
  115. package/src/ui/lockfile.js +191 -191
  116. package/src/ui/port.js +67 -67
  117. package/src/ui/server.js +547 -547
  118. package/src/ui/wrapper.js +129 -129
@@ -1,256 +1,256 @@
1
- ---
2
- name: telemetry-sampling
3
- description: Use ao reduzir custo de telemetria — head/tail sampling, by-key, dynamic. 100% errors, by-tier para customers, head-based propaga via traceparent.
4
- ---
5
-
6
- # Observabilidade — Telemetry Sampling
7
-
8
- ## Quando usar
9
-
10
- LLM carrega esta skill ao reduzir custo de telemetria sem perder sinal. Trigger phrases:
11
-
12
- - "sampling", "reduzir custo de telemetria"
13
- - "head-based vs tail-based"
14
- - "by-key sampling", "dynamic sampling"
15
- - "100% errors mas só 1% sucessos"
16
- - "trace fica incompleto após sampling"
17
-
18
- ## Regras absolutas
19
-
20
- - **100% dos erros sempre** — sample 100% de eventos com `result.success = false`. Erros são raros e críticos. Nunca sample.
21
- - **100% de paying/enterprise customers** — high-value, baixo volume relativo, debug crucial.
22
- - **Head-based propaga via `traceparent` flag** — decisão tomada no service de entrada, propagada downstream para garantir trace completo.
23
- - **Tail-based requer collector buffer** — decisão pós-trace; impossível de implementar inline em código.
24
- - **Constant probability falha em low volume** — 1/1000 de 100 req/min = 0.1 evento/min, perde tudo.
25
- - **Sample rate gravado no evento** — sem isso, agregações reconstroem totais errados.
26
- - **Errors > success** — categorize: paying customers > free, enterprise > pro > free.
27
- - **Não sample antes de aggregate** — pre-aggregation perde alta cardinalidade. Sample evento bruto, aggregate no read.
28
-
29
- ## Estratégias canônicas
30
-
31
- ### Head-based sampling (decisão no início do trace)
32
-
33
- ```ts
34
- // PT-BR: decisão tomada no service de entrada, propagada via traceparent flag
35
- import { trace, context } from '@opentelemetry/api'
36
- import { TraceFlags } from '@opentelemetry/api'
37
-
38
- function shouldSample(event: SpanContext): boolean {
39
- // PT-BR: 100% errors (head-based: erros raramente são conhecidos no head;
40
- // verificar HTTP status no início via header)
41
- if (event.attributes['result.success'] === false) return true
42
-
43
- // PT-BR: 100% enterprise — alto valor
44
- if (event.attributes['customer.tier'] === 'enterprise') return true
45
-
46
- // PT-BR: 10% pro
47
- if (event.attributes['customer.tier'] === 'pro') return Math.random() < 0.1
48
-
49
- // PT-BR: 1% free baseline
50
- return Math.random() < 0.01
51
- }
52
-
53
- // PT-BR: marcar flag sampled no traceparent — propaga para downstream
54
- const flags = shouldSample(event) ? TraceFlags.SAMPLED : TraceFlags.NONE
55
- ```
56
-
57
- ### Tail-based sampling (decisão após trace completar)
58
-
59
- ```yaml
60
- # PT-BR: OTel Collector config — sampling pós-trace
61
- # 100% errors + outliers de latência + 1% success
62
- processors:
63
- tail_sampling:
64
- decision_wait: 10s # PT-BR: buffer 10s para esperar todos os spans do trace
65
- policies:
66
- - name: errors-policy
67
- type: status_code
68
- status_code: { status_codes: [ERROR] }
69
- - name: latency-outliers
70
- type: latency
71
- latency: { threshold_ms: 1000 } # PT-BR: > 1s é outlier
72
- - name: probabilistic-baseline
73
- type: probabilistic
74
- probabilistic: { sampling_percentage: 1 }
75
- ```
76
-
77
- ### By-key sampling
78
-
79
- ```ts
80
- // PT-BR: taxas diferentes por chave — mais preciso que constant
81
- const SAMPLE_RATES: Record<string, number> = {
82
- // chave: [error.type | endpoint | tenant_id, etc.]
83
- 'error_rate_limit': 0.5, // PT-BR: 50% (já frequente, mas importante)
84
- 'error_validation': 1.0, // PT-BR: 100% (raro, debug crítico)
85
- 'tenant_acme-corp': 1.0, // PT-BR: 100% (big customer)
86
- 'endpoint_/health': 0.001, // PT-BR: 0.1% (muito frequente, baixo valor)
87
- 'default': 0.05 // PT-BR: 5% baseline
88
- }
89
-
90
- function sampleByKey(event: SpanLike): boolean {
91
- const errorKey = `error_${event.attributes['error.type']}`
92
- const tenantKey = `tenant_${event.attributes['tenant_id']}`
93
- const endpointKey = `endpoint_${event.attributes['endpoint']}`
94
-
95
- const rate = SAMPLE_RATES[errorKey]
96
- ?? SAMPLE_RATES[tenantKey]
97
- ?? SAMPLE_RATES[endpointKey]
98
- ?? SAMPLE_RATES['default']
99
-
100
- return Math.random() < rate
101
- }
102
- ```
103
-
104
- ### Dynamic sampling (taxa adapta com volume)
105
-
106
- ```ts
107
- // PT-BR: lookback 30s — quanto traffic veio recentemente?
108
- let recentVolume = 0
109
- setInterval(() => { recentVolume = 0 }, 30_000)
110
-
111
- function sampleDynamic(event: SpanLike): boolean {
112
- recentVolume++
113
-
114
- // PT-BR: tráfego baixo → sample mais; tráfego alto → sample menos
115
- if (recentVolume < 100) return true // até 100 spans em 30s, mantém todos
116
- if (recentVolume < 1000) return Math.random() < 0.1 // até 1k, 10%
117
- return Math.random() < 0.01 // > 1k, 1%
118
- }
119
- ```
120
-
121
- ### Combinando: by-key + dynamic + head
122
-
123
- ```ts
124
- function shouldSample(event: SpanLike): boolean {
125
- // PT-BR: 1. Errors sempre 100%
126
- if (event.attributes['result.success'] === false) return true
127
-
128
- // PT-BR: 2. Enterprise sempre 100%
129
- if (event.attributes['customer.tier'] === 'enterprise') return true
130
-
131
- // PT-BR: 3. Outras chaves de alto valor
132
- if (event.attributes['feature_flag.experiment_a'] === true) return true // experimento ativo
133
-
134
- // PT-BR: 4. Dynamic baseline
135
- return sampleDynamic(event)
136
- }
137
- ```
138
-
139
- ## Patterns canônicos
140
-
141
- ### Pattern: gravar sample_rate no evento
142
-
143
- ```ts
144
- // PT-BR: sem sample_rate, agregações no read time não conseguem reconstruir totais
145
- const sampleRate = computeSampleRate(event)
146
- if (Math.random() < sampleRate) {
147
- span.setAttribute('_sample_rate', sampleRate) // PT-BR: 0.01 = 1% sampled
148
- span.setAttribute('_sampled', true)
149
- // PT-BR: agora o backend pode multiplicar contagens por 1/sample_rate
150
- exportSpan(span)
151
- }
152
- ```
153
-
154
- ### Pattern: query reconstruindo totais com sample_rate
155
-
156
- ```sql
157
- -- PT-BR: sem sample_rate, count(*) está errado
158
- -- COM sample_rate, sum(1/_sample_rate) reconstrói total estimado
159
- select
160
- endpoint,
161
- sum(1.0 / _sample_rate) as estimated_total,
162
- count(*) as samples_collected,
163
- sum(1.0 / _sample_rate) filter (where result_success = false) as estimated_errors
164
- from observability.events
165
- where timestamp > now() - interval '1 hour'
166
- group by endpoint
167
- order by estimated_total desc;
168
- ```
169
-
170
- ### Pattern: sampling para alta cardinalidade
171
-
172
- ```ts
173
- // PT-BR: cardinalidade alta (millions of users) — não pode sample por user.id
174
- // mas pode sample por (customer.tier, error.type) — combinação cardin. baixa
175
- function sampleByDimensions(event: SpanLike): number {
176
- const key = `${event.attributes['customer.tier']}-${event.attributes['error.type'] ?? 'success'}`
177
-
178
- const rates: Record<string, number> = {
179
- 'enterprise-success': 0.5,
180
- 'enterprise-error': 1.0,
181
- 'pro-success': 0.1,
182
- 'pro-error': 1.0,
183
- 'free-success': 0.01,
184
- 'free-error': 1.0,
185
- }
186
-
187
- return rates[key] ?? 0.01
188
- }
189
- ```
190
-
191
- ## Anti-patterns
192
-
193
- ### ANTI: constant probability em low volume
194
-
195
- ```text
196
- ANTI: app com 100 req/min, sample rate fixo 1/1000 → 0.1 evento/min retidos
197
- Você verá 1 erro a cada 10 minutos. Sinal perdido.
198
-
199
- CERTO: dynamic sampling — alta taxa quando volume baixo, baixa quando alto.
200
- ```
201
-
202
- ### ANTI: sample errors
203
-
204
- ```text
205
- ANTI: sample 1% de errors junto com 1% de success — erros são 0.5% do tráfego;
206
- seu sample retém 0.005% de errors total. Praticamente nunca aparecem.
207
-
208
- CERTO: 100% errors. SEMPRE. Erros são raros e críticos.
209
- ```
210
-
211
- ### ANTI: sample sem gravar rate
212
-
213
- ```text
214
- ANTI: sample 1/100 mas evento não tem _sample_rate
215
- Backend conta literais → count = 1% do real → métricas erradas
216
-
217
- CERTO: gravar _sample_rate no evento; agregar com sum(1/rate) no read.
218
- ```
219
-
220
- ### ANTI: tail-based sem collector
221
-
222
- ```text
223
- ANTI: tentar implementar tail-based em SDK do app — precisa bufferizar todos os spans
224
- de cada trace, esperar conclusão, decidir, exportar. Memória e latência altas.
225
-
226
- CERTO: tail-based requer OTel Collector como sidecar/proxy. App envia 100% para
227
- Collector; Collector decide via processor `tail_sampling`.
228
- ```
229
-
230
- ### ANTI: head-based sem propagação
231
-
232
- ```text
233
- ANTI: decisão de sample tomada no service A → não propagada para B → B decide sozinho
234
- → trace fica incompleto (alguns spans em A, outros em B, sem correlação)
235
-
236
- CERTO: marcar TraceFlags.SAMPLED no traceparent; B respeita decisão upstream.
237
- ```
238
-
239
- ## Verificação
240
-
241
- 1. **Errors 100%** — `select count(*) where result_success=false` × `1/sample_rate` ≈ count real
242
- 2. **Enterprise 100%** — verificar via query que enterprise tier tem _sample_rate=1 sempre
243
- 3. **Sample rate gravado** — `select count(*) filter (where _sample_rate is null)` = 0
244
- 4. **Trace integridade** — head-based: trace tem todos os spans (não 50% missing)
245
- 5. **Custo redução real** — bytes/segundo enviado para backend caiu sem perder sinal de error/p99
246
-
247
- ---
248
-
249
- ## Ver também
250
-
251
- - `kit/skills/_shared-observability/glossary.md` — termos sampling
252
- - `kit/skills/distributed-tracing/SKILL.md` — head vs tail decision timing
253
- - `kit/skills/opentelemetry-standard/SKILL.md` — Collector tail_sampling processor
254
- - `kit/skills/event-based-slos/SKILL.md` — SLO precisa de sample_rate para reconstruir totais
255
-
256
- *Material-fonte: Observability Engineering (O'Reilly, 2022) — Cap 17: "Cheap and Accurate Enough: Sampling".*
1
+ ---
2
+ name: telemetry-sampling
3
+ description: Use ao reduzir custo de telemetria — head/tail sampling, by-key, dynamic. 100% errors, by-tier para customers, head-based propaga via traceparent.
4
+ ---
5
+
6
+ # Observabilidade — Telemetry Sampling
7
+
8
+ ## Quando usar
9
+
10
+ LLM carrega esta skill ao reduzir custo de telemetria sem perder sinal. Trigger phrases:
11
+
12
+ - "sampling", "reduzir custo de telemetria"
13
+ - "head-based vs tail-based"
14
+ - "by-key sampling", "dynamic sampling"
15
+ - "100% errors mas só 1% sucessos"
16
+ - "trace fica incompleto após sampling"
17
+
18
+ ## Regras absolutas
19
+
20
+ - **100% dos erros sempre** — sample 100% de eventos com `result.success = false`. Erros são raros e críticos. Nunca sample.
21
+ - **100% de paying/enterprise customers** — high-value, baixo volume relativo, debug crucial.
22
+ - **Head-based propaga via `traceparent` flag** — decisão tomada no service de entrada, propagada downstream para garantir trace completo.
23
+ - **Tail-based requer collector buffer** — decisão pós-trace; impossível de implementar inline em código.
24
+ - **Constant probability falha em low volume** — 1/1000 de 100 req/min = 0.1 evento/min, perde tudo.
25
+ - **Sample rate gravado no evento** — sem isso, agregações reconstroem totais errados.
26
+ - **Errors > success** — categorize: paying customers > free, enterprise > pro > free.
27
+ - **Não sample antes de aggregate** — pre-aggregation perde alta cardinalidade. Sample evento bruto, aggregate no read.
28
+
29
+ ## Estratégias canônicas
30
+
31
+ ### Head-based sampling (decisão no início do trace)
32
+
33
+ ```ts
34
+ // PT-BR: decisão tomada no service de entrada, propagada via traceparent flag
35
+ import { trace, context } from '@opentelemetry/api'
36
+ import { TraceFlags } from '@opentelemetry/api'
37
+
38
+ function shouldSample(event: SpanContext): boolean {
39
+ // PT-BR: 100% errors (head-based: erros raramente são conhecidos no head;
40
+ // verificar HTTP status no início via header)
41
+ if (event.attributes['result.success'] === false) return true
42
+
43
+ // PT-BR: 100% enterprise — alto valor
44
+ if (event.attributes['customer.tier'] === 'enterprise') return true
45
+
46
+ // PT-BR: 10% pro
47
+ if (event.attributes['customer.tier'] === 'pro') return Math.random() < 0.1
48
+
49
+ // PT-BR: 1% free baseline
50
+ return Math.random() < 0.01
51
+ }
52
+
53
+ // PT-BR: marcar flag sampled no traceparent — propaga para downstream
54
+ const flags = shouldSample(event) ? TraceFlags.SAMPLED : TraceFlags.NONE
55
+ ```
56
+
57
+ ### Tail-based sampling (decisão após trace completar)
58
+
59
+ ```yaml
60
+ # PT-BR: OTel Collector config — sampling pós-trace
61
+ # 100% errors + outliers de latência + 1% success
62
+ processors:
63
+ tail_sampling:
64
+ decision_wait: 10s # PT-BR: buffer 10s para esperar todos os spans do trace
65
+ policies:
66
+ - name: errors-policy
67
+ type: status_code
68
+ status_code: { status_codes: [ERROR] }
69
+ - name: latency-outliers
70
+ type: latency
71
+ latency: { threshold_ms: 1000 } # PT-BR: > 1s é outlier
72
+ - name: probabilistic-baseline
73
+ type: probabilistic
74
+ probabilistic: { sampling_percentage: 1 }
75
+ ```
76
+
77
+ ### By-key sampling
78
+
79
+ ```ts
80
+ // PT-BR: taxas diferentes por chave — mais preciso que constant
81
+ const SAMPLE_RATES: Record<string, number> = {
82
+ // chave: [error.type | endpoint | tenant_id, etc.]
83
+ 'error_rate_limit': 0.5, // PT-BR: 50% (já frequente, mas importante)
84
+ 'error_validation': 1.0, // PT-BR: 100% (raro, debug crítico)
85
+ 'tenant_acme-corp': 1.0, // PT-BR: 100% (big customer)
86
+ 'endpoint_/health': 0.001, // PT-BR: 0.1% (muito frequente, baixo valor)
87
+ 'default': 0.05 // PT-BR: 5% baseline
88
+ }
89
+
90
+ function sampleByKey(event: SpanLike): boolean {
91
+ const errorKey = `error_${event.attributes['error.type']}`
92
+ const tenantKey = `tenant_${event.attributes['tenant_id']}`
93
+ const endpointKey = `endpoint_${event.attributes['endpoint']}`
94
+
95
+ const rate = SAMPLE_RATES[errorKey]
96
+ ?? SAMPLE_RATES[tenantKey]
97
+ ?? SAMPLE_RATES[endpointKey]
98
+ ?? SAMPLE_RATES['default']
99
+
100
+ return Math.random() < rate
101
+ }
102
+ ```
103
+
104
+ ### Dynamic sampling (taxa adapta com volume)
105
+
106
+ ```ts
107
+ // PT-BR: lookback 30s — quanto traffic veio recentemente?
108
+ let recentVolume = 0
109
+ setInterval(() => { recentVolume = 0 }, 30_000)
110
+
111
+ function sampleDynamic(event: SpanLike): boolean {
112
+ recentVolume++
113
+
114
+ // PT-BR: tráfego baixo → sample mais; tráfego alto → sample menos
115
+ if (recentVolume < 100) return true // até 100 spans em 30s, mantém todos
116
+ if (recentVolume < 1000) return Math.random() < 0.1 // até 1k, 10%
117
+ return Math.random() < 0.01 // > 1k, 1%
118
+ }
119
+ ```
120
+
121
+ ### Combinando: by-key + dynamic + head
122
+
123
+ ```ts
124
+ function shouldSample(event: SpanLike): boolean {
125
+ // PT-BR: 1. Errors sempre 100%
126
+ if (event.attributes['result.success'] === false) return true
127
+
128
+ // PT-BR: 2. Enterprise sempre 100%
129
+ if (event.attributes['customer.tier'] === 'enterprise') return true
130
+
131
+ // PT-BR: 3. Outras chaves de alto valor
132
+ if (event.attributes['feature_flag.experiment_a'] === true) return true // experimento ativo
133
+
134
+ // PT-BR: 4. Dynamic baseline
135
+ return sampleDynamic(event)
136
+ }
137
+ ```
138
+
139
+ ## Patterns canônicos
140
+
141
+ ### Pattern: gravar sample_rate no evento
142
+
143
+ ```ts
144
+ // PT-BR: sem sample_rate, agregações no read time não conseguem reconstruir totais
145
+ const sampleRate = computeSampleRate(event)
146
+ if (Math.random() < sampleRate) {
147
+ span.setAttribute('_sample_rate', sampleRate) // PT-BR: 0.01 = 1% sampled
148
+ span.setAttribute('_sampled', true)
149
+ // PT-BR: agora o backend pode multiplicar contagens por 1/sample_rate
150
+ exportSpan(span)
151
+ }
152
+ ```
153
+
154
+ ### Pattern: query reconstruindo totais com sample_rate
155
+
156
+ ```sql
157
+ -- PT-BR: sem sample_rate, count(*) está errado
158
+ -- COM sample_rate, sum(1/_sample_rate) reconstrói total estimado
159
+ select
160
+ endpoint,
161
+ sum(1.0 / _sample_rate) as estimated_total,
162
+ count(*) as samples_collected,
163
+ sum(1.0 / _sample_rate) filter (where result_success = false) as estimated_errors
164
+ from observability.events
165
+ where timestamp > now() - interval '1 hour'
166
+ group by endpoint
167
+ order by estimated_total desc;
168
+ ```
169
+
170
+ ### Pattern: sampling para alta cardinalidade
171
+
172
+ ```ts
173
+ // PT-BR: cardinalidade alta (millions of users) — não pode sample por user.id
174
+ // mas pode sample por (customer.tier, error.type) — combinação cardin. baixa
175
+ function sampleByDimensions(event: SpanLike): number {
176
+ const key = `${event.attributes['customer.tier']}-${event.attributes['error.type'] ?? 'success'}`
177
+
178
+ const rates: Record<string, number> = {
179
+ 'enterprise-success': 0.5,
180
+ 'enterprise-error': 1.0,
181
+ 'pro-success': 0.1,
182
+ 'pro-error': 1.0,
183
+ 'free-success': 0.01,
184
+ 'free-error': 1.0,
185
+ }
186
+
187
+ return rates[key] ?? 0.01
188
+ }
189
+ ```
190
+
191
+ ## Anti-patterns
192
+
193
+ ### ANTI: constant probability em low volume
194
+
195
+ ```text
196
+ ANTI: app com 100 req/min, sample rate fixo 1/1000 → 0.1 evento/min retidos
197
+ Você verá 1 erro a cada 10 minutos. Sinal perdido.
198
+
199
+ CERTO: dynamic sampling — alta taxa quando volume baixo, baixa quando alto.
200
+ ```
201
+
202
+ ### ANTI: sample errors
203
+
204
+ ```text
205
+ ANTI: sample 1% de errors junto com 1% de success — erros são 0.5% do tráfego;
206
+ seu sample retém 0.005% de errors total. Praticamente nunca aparecem.
207
+
208
+ CERTO: 100% errors. SEMPRE. Erros são raros e críticos.
209
+ ```
210
+
211
+ ### ANTI: sample sem gravar rate
212
+
213
+ ```text
214
+ ANTI: sample 1/100 mas evento não tem _sample_rate
215
+ Backend conta literais → count = 1% do real → métricas erradas
216
+
217
+ CERTO: gravar _sample_rate no evento; agregar com sum(1/rate) no read.
218
+ ```
219
+
220
+ ### ANTI: tail-based sem collector
221
+
222
+ ```text
223
+ ANTI: tentar implementar tail-based em SDK do app — precisa bufferizar todos os spans
224
+ de cada trace, esperar conclusão, decidir, exportar. Memória e latência altas.
225
+
226
+ CERTO: tail-based requer OTel Collector como sidecar/proxy. App envia 100% para
227
+ Collector; Collector decide via processor `tail_sampling`.
228
+ ```
229
+
230
+ ### ANTI: head-based sem propagação
231
+
232
+ ```text
233
+ ANTI: decisão de sample tomada no service A → não propagada para B → B decide sozinho
234
+ → trace fica incompleto (alguns spans em A, outros em B, sem correlação)
235
+
236
+ CERTO: marcar TraceFlags.SAMPLED no traceparent; B respeita decisão upstream.
237
+ ```
238
+
239
+ ## Verificação
240
+
241
+ 1. **Errors 100%** — `select count(*) where result_success=false` × `1/sample_rate` ≈ count real
242
+ 2. **Enterprise 100%** — verificar via query que enterprise tier tem _sample_rate=1 sempre
243
+ 3. **Sample rate gravado** — `select count(*) filter (where _sample_rate is null)` = 0
244
+ 4. **Trace integridade** — head-based: trace tem todos os spans (não 50% missing)
245
+ 5. **Custo redução real** — bytes/segundo enviado para backend caiu sem perder sinal de error/p99
246
+
247
+ ---
248
+
249
+ ## Ver também
250
+
251
+ - `kit/skills/_shared-observability/glossary.md` — termos sampling
252
+ - `kit/skills/distributed-tracing/SKILL.md` — head vs tail decision timing
253
+ - `kit/skills/opentelemetry-standard/SKILL.md` — Collector tail_sampling processor
254
+ - `kit/skills/event-based-slos/SKILL.md` — SLO precisa de sample_rate para reconstruir totais
255
+
256
+ *Material-fonte: Observability Engineering (O'Reilly, 2022) — Cap 17: "Cheap and Accurate Enough: Sampling".*