@luanpdd/kit-mcp 1.35.0 → 1.36.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (117) hide show
  1. package/bin/cli.js +2 -2
  2. package/bin/mcp.js +6 -6
  3. package/bin/ui.js +74 -74
  4. package/gates/ai-prompt-stability.md +120 -120
  5. package/gates/budget-description.md +68 -68
  6. package/gates/confidence.md +29 -29
  7. package/gates/dependency-check.md +33 -33
  8. package/gates/dept-cycle-prevention.md +179 -179
  9. package/gates/golden-signals-coverage.md +133 -133
  10. package/gates/legacy-refactor-safety.md +178 -178
  11. package/gates/multi-tenant-rls-coverage.md +102 -102
  12. package/gates/no-personal-uuid.md +72 -72
  13. package/gates/obs-agents-mcp-supabase.md +86 -86
  14. package/gates/obs-skills-frontmatter.md +76 -76
  15. package/gates/observability-coverage.md +151 -151
  16. package/gates/omm-no-regression.md +83 -83
  17. package/gates/postmortem-template-required.md +127 -127
  18. package/gates/prr-checklist-coverage.md +128 -128
  19. package/gates/regression.md +32 -32
  20. package/gates/release-pipeline-policy.md +132 -132
  21. package/gates/secrets-scan.md +33 -33
  22. package/gates/service-role-not-in-user-facing.md +113 -113
  23. package/gates/skill-must-include.md +71 -71
  24. package/gates/sync-idempotent.md +62 -62
  25. package/gates/verify-phase-goal.md +34 -34
  26. package/kit/agents/designer-ui.md +216 -216
  27. package/kit/agents/workflow-generator.md +537 -167
  28. package/kit/commands/adicionar-backlog.md +1 -1
  29. package/kit/commands/adicionar-fase.md +1 -1
  30. package/kit/commands/adicionar-tarefa.md +1 -1
  31. package/kit/commands/auditar-observabilidade.md +103 -103
  32. package/kit/commands/auditar-toil.md +129 -129
  33. package/kit/commands/caracterizar-prompt.md +195 -195
  34. package/kit/commands/criar-workflow.md +158 -158
  35. package/kit/commands/definir-perfil.md +1 -1
  36. package/kit/commands/definir-slo.md +108 -108
  37. package/kit/commands/fio.md +1 -1
  38. package/kit/commands/golden-signals.md +142 -142
  39. package/kit/commands/instrumentar-fase.md +200 -200
  40. package/kit/commands/investigar-producao.md +162 -162
  41. package/kit/commands/observabilidade.md +118 -118
  42. package/kit/commands/postmortem.md +179 -179
  43. package/kit/commands/prr.md +205 -205
  44. package/kit/commands/publicar-rapido.md +207 -207
  45. package/kit/commands/risk-budget.md +220 -220
  46. package/kit/commands/sre.md +230 -230
  47. package/kit/file-manifest.json +424 -424
  48. package/kit/framework/references/output-style.md +22 -22
  49. package/kit/hooks/post-apply-migration.js +199 -199
  50. package/kit/hooks/sidecar-tool-publisher.js +210 -210
  51. package/kit/skills/_shared-dados-distribuidos/glossary.md +224 -224
  52. package/kit/skills/_shared-legacy/glossary.md +389 -389
  53. package/kit/skills/_shared-multi-tenant/glossary.md +186 -186
  54. package/kit/skills/_shared-observability/glossary.md +396 -396
  55. package/kit/skills/_shared-sre/glossary.md +712 -712
  56. package/kit/skills/_shared-supabase/glossary.md +234 -234
  57. package/kit/skills/blameless-postmortems/SKILL.md +340 -340
  58. package/kit/skills/burn-rate-alerting/SKILL.md +258 -258
  59. package/kit/skills/cascading-failures/SKILL.md +311 -311
  60. package/kit/skills/core-analysis-loop/SKILL.md +352 -352
  61. package/kit/skills/distributed-tracing/SKILL.md +362 -362
  62. package/kit/skills/dynamic-workflow-authoring/SKILL.md +327 -223
  63. package/kit/skills/eliminating-toil/SKILL.md +243 -243
  64. package/kit/skills/event-based-slos/SKILL.md +296 -296
  65. package/kit/skills/four-golden-signals/SKILL.md +314 -314
  66. package/kit/skills/hermetic-builds/SKILL.md +323 -323
  67. package/kit/skills/legacy-monster-methods/SKILL.md +444 -444
  68. package/kit/skills/llm-as-dependency/SKILL.md +436 -436
  69. package/kit/skills/load-shedding-graceful-degradation/SKILL.md +396 -396
  70. package/kit/skills/observability-driven-development/SKILL.md +315 -315
  71. package/kit/skills/observability-maturity-model/SKILL.md +222 -222
  72. package/kit/skills/opentelemetry-standard/SKILL.md +351 -351
  73. package/kit/skills/production-readiness-review/SKILL.md +305 -305
  74. package/kit/skills/release-engineering/SKILL.md +367 -367
  75. package/kit/skills/retry-strategies/SKILL.md +372 -372
  76. package/kit/skills/sre-risk-management/SKILL.md +221 -221
  77. package/kit/skills/structured-events/SKILL.md +265 -265
  78. package/kit/skills/supabase-cron-queues/SKILL.md +275 -275
  79. package/kit/skills/supabase-database-functions/SKILL.md +332 -332
  80. package/kit/skills/supabase-declarative-schema/SKILL.md +183 -183
  81. package/kit/skills/supabase-pgvector-rag/SKILL.md +253 -253
  82. package/kit/skills/supabase-postgres-style/SKILL.md +138 -138
  83. package/kit/skills/supabase-storage/SKILL.md +234 -234
  84. package/kit/skills/telemetry-pipelines/SKILL.md +259 -259
  85. package/kit/skills/telemetry-sampling/SKILL.md +256 -256
  86. package/kit/skills/ui-anti-padroes-ia/SKILL.md +261 -261
  87. package/kit/skills/ui-contexto-produto/SKILL.md +248 -248
  88. package/kit/skills/ui-cor-estrategia/SKILL.md +213 -213
  89. package/kit/skills/ui-critica-auditoria/SKILL.md +260 -260
  90. package/kit/skills/ui-motion-funcional/SKILL.md +264 -264
  91. package/kit/skills/ui-ritmo-espacial/SKILL.md +259 -259
  92. package/kit/skills/ui-tipografia/SKILL.md +211 -211
  93. package/package.json +1 -1
  94. package/src/cli/index.js +1114 -1114
  95. package/src/cli/render.js +194 -194
  96. package/src/cli/upgrade-check.js +135 -135
  97. package/src/core/error-redaction.js +76 -76
  98. package/src/core/failures.js +153 -153
  99. package/src/core/gate-runner.js +205 -205
  100. package/src/core/gates.js +82 -82
  101. package/src/core/logger.js +170 -170
  102. package/src/core/manifest-verify.js +174 -174
  103. package/src/core/metrics.js +268 -268
  104. package/src/core/notify.js +60 -60
  105. package/src/core/path-safety.js +141 -141
  106. package/src/core/replays.js +120 -120
  107. package/src/core/ui.js +185 -185
  108. package/src/mcp-server/install.js +149 -149
  109. package/src/mcp-server/roots.js +124 -124
  110. package/src/ui/auto-spawn.js +113 -113
  111. package/src/ui/browser.js +78 -78
  112. package/src/ui/client.js +130 -130
  113. package/src/ui/events.js +65 -65
  114. package/src/ui/lockfile.js +191 -191
  115. package/src/ui/port.js +67 -67
  116. package/src/ui/server.js +547 -547
  117. package/src/ui/wrapper.js +129 -129
@@ -1,256 +1,256 @@
1
- ---
2
- name: telemetry-sampling
3
- description: Use ao reduzir custo de telemetria — head/tail sampling, by-key, dynamic. 100% errors, by-tier para customers, head-based propaga via traceparent.
4
- ---
5
-
6
- # Observabilidade — Telemetry Sampling
7
-
8
- ## Quando usar
9
-
10
- LLM carrega esta skill ao reduzir custo de telemetria sem perder sinal. Trigger phrases:
11
-
12
- - "sampling", "reduzir custo de telemetria"
13
- - "head-based vs tail-based"
14
- - "by-key sampling", "dynamic sampling"
15
- - "100% errors mas só 1% sucessos"
16
- - "trace fica incompleto após sampling"
17
-
18
- ## Regras absolutas
19
-
20
- - **100% dos erros sempre** — sample 100% de eventos com `result.success = false`. Erros são raros e críticos. Nunca sample.
21
- - **100% de paying/enterprise customers** — high-value, baixo volume relativo, debug crucial.
22
- - **Head-based propaga via `traceparent` flag** — decisão tomada no service de entrada, propagada downstream para garantir trace completo.
23
- - **Tail-based requer collector buffer** — decisão pós-trace; impossível de implementar inline em código.
24
- - **Constant probability falha em low volume** — 1/1000 de 100 req/min = 0.1 evento/min, perde tudo.
25
- - **Sample rate gravado no evento** — sem isso, agregações reconstroem totais errados.
26
- - **Errors > success** — categorize: paying customers > free, enterprise > pro > free.
27
- - **Não sample antes de aggregate** — pre-aggregation perde alta cardinalidade. Sample evento bruto, aggregate no read.
28
-
29
- ## Estratégias canônicas
30
-
31
- ### Head-based sampling (decisão no início do trace)
32
-
33
- ```ts
34
- // PT-BR: decisão tomada no service de entrada, propagada via traceparent flag
35
- import { trace, context } from '@opentelemetry/api'
36
- import { TraceFlags } from '@opentelemetry/api'
37
-
38
- function shouldSample(event: SpanContext): boolean {
39
- // PT-BR: 100% errors (head-based: erros raramente são conhecidos no head;
40
- // verificar HTTP status no início via header)
41
- if (event.attributes['result.success'] === false) return true
42
-
43
- // PT-BR: 100% enterprise — alto valor
44
- if (event.attributes['customer.tier'] === 'enterprise') return true
45
-
46
- // PT-BR: 10% pro
47
- if (event.attributes['customer.tier'] === 'pro') return Math.random() < 0.1
48
-
49
- // PT-BR: 1% free baseline
50
- return Math.random() < 0.01
51
- }
52
-
53
- // PT-BR: marcar flag sampled no traceparent — propaga para downstream
54
- const flags = shouldSample(event) ? TraceFlags.SAMPLED : TraceFlags.NONE
55
- ```
56
-
57
- ### Tail-based sampling (decisão após trace completar)
58
-
59
- ```yaml
60
- # PT-BR: OTel Collector config — sampling pós-trace
61
- # 100% errors + outliers de latência + 1% success
62
- processors:
63
- tail_sampling:
64
- decision_wait: 10s # PT-BR: buffer 10s para esperar todos os spans do trace
65
- policies:
66
- - name: errors-policy
67
- type: status_code
68
- status_code: { status_codes: [ERROR] }
69
- - name: latency-outliers
70
- type: latency
71
- latency: { threshold_ms: 1000 } # PT-BR: > 1s é outlier
72
- - name: probabilistic-baseline
73
- type: probabilistic
74
- probabilistic: { sampling_percentage: 1 }
75
- ```
76
-
77
- ### By-key sampling
78
-
79
- ```ts
80
- // PT-BR: taxas diferentes por chave — mais preciso que constant
81
- const SAMPLE_RATES: Record<string, number> = {
82
- // chave: [error.type | endpoint | tenant_id, etc.]
83
- 'error_rate_limit': 0.5, // PT-BR: 50% (já frequente, mas importante)
84
- 'error_validation': 1.0, // PT-BR: 100% (raro, debug crítico)
85
- 'tenant_acme-corp': 1.0, // PT-BR: 100% (big customer)
86
- 'endpoint_/health': 0.001, // PT-BR: 0.1% (muito frequente, baixo valor)
87
- 'default': 0.05 // PT-BR: 5% baseline
88
- }
89
-
90
- function sampleByKey(event: SpanLike): boolean {
91
- const errorKey = `error_${event.attributes['error.type']}`
92
- const tenantKey = `tenant_${event.attributes['tenant_id']}`
93
- const endpointKey = `endpoint_${event.attributes['endpoint']}`
94
-
95
- const rate = SAMPLE_RATES[errorKey]
96
- ?? SAMPLE_RATES[tenantKey]
97
- ?? SAMPLE_RATES[endpointKey]
98
- ?? SAMPLE_RATES['default']
99
-
100
- return Math.random() < rate
101
- }
102
- ```
103
-
104
- ### Dynamic sampling (taxa adapta com volume)
105
-
106
- ```ts
107
- // PT-BR: lookback 30s — quanto traffic veio recentemente?
108
- let recentVolume = 0
109
- setInterval(() => { recentVolume = 0 }, 30_000)
110
-
111
- function sampleDynamic(event: SpanLike): boolean {
112
- recentVolume++
113
-
114
- // PT-BR: tráfego baixo → sample mais; tráfego alto → sample menos
115
- if (recentVolume < 100) return true // até 100 spans em 30s, mantém todos
116
- if (recentVolume < 1000) return Math.random() < 0.1 // até 1k, 10%
117
- return Math.random() < 0.01 // > 1k, 1%
118
- }
119
- ```
120
-
121
- ### Combinando: by-key + dynamic + head
122
-
123
- ```ts
124
- function shouldSample(event: SpanLike): boolean {
125
- // PT-BR: 1. Errors sempre 100%
126
- if (event.attributes['result.success'] === false) return true
127
-
128
- // PT-BR: 2. Enterprise sempre 100%
129
- if (event.attributes['customer.tier'] === 'enterprise') return true
130
-
131
- // PT-BR: 3. Outras chaves de alto valor
132
- if (event.attributes['feature_flag.experiment_a'] === true) return true // experimento ativo
133
-
134
- // PT-BR: 4. Dynamic baseline
135
- return sampleDynamic(event)
136
- }
137
- ```
138
-
139
- ## Patterns canônicos
140
-
141
- ### Pattern: gravar sample_rate no evento
142
-
143
- ```ts
144
- // PT-BR: sem sample_rate, agregações no read time não conseguem reconstruir totais
145
- const sampleRate = computeSampleRate(event)
146
- if (Math.random() < sampleRate) {
147
- span.setAttribute('_sample_rate', sampleRate) // PT-BR: 0.01 = 1% sampled
148
- span.setAttribute('_sampled', true)
149
- // PT-BR: agora o backend pode multiplicar contagens por 1/sample_rate
150
- exportSpan(span)
151
- }
152
- ```
153
-
154
- ### Pattern: query reconstruindo totais com sample_rate
155
-
156
- ```sql
157
- -- PT-BR: sem sample_rate, count(*) está errado
158
- -- COM sample_rate, sum(1/_sample_rate) reconstrói total estimado
159
- select
160
- endpoint,
161
- sum(1.0 / _sample_rate) as estimated_total,
162
- count(*) as samples_collected,
163
- sum(1.0 / _sample_rate) filter (where result_success = false) as estimated_errors
164
- from observability.events
165
- where timestamp > now() - interval '1 hour'
166
- group by endpoint
167
- order by estimated_total desc;
168
- ```
169
-
170
- ### Pattern: sampling para alta cardinalidade
171
-
172
- ```ts
173
- // PT-BR: cardinalidade alta (millions of users) — não pode sample por user.id
174
- // mas pode sample por (customer.tier, error.type) — combinação cardin. baixa
175
- function sampleByDimensions(event: SpanLike): number {
176
- const key = `${event.attributes['customer.tier']}-${event.attributes['error.type'] ?? 'success'}`
177
-
178
- const rates: Record<string, number> = {
179
- 'enterprise-success': 0.5,
180
- 'enterprise-error': 1.0,
181
- 'pro-success': 0.1,
182
- 'pro-error': 1.0,
183
- 'free-success': 0.01,
184
- 'free-error': 1.0,
185
- }
186
-
187
- return rates[key] ?? 0.01
188
- }
189
- ```
190
-
191
- ## Anti-patterns
192
-
193
- ### ANTI: constant probability em low volume
194
-
195
- ```text
196
- ANTI: app com 100 req/min, sample rate fixo 1/1000 → 0.1 evento/min retidos
197
- Você verá 1 erro a cada 10 minutos. Sinal perdido.
198
-
199
- CERTO: dynamic sampling — alta taxa quando volume baixo, baixa quando alto.
200
- ```
201
-
202
- ### ANTI: sample errors
203
-
204
- ```text
205
- ANTI: sample 1% de errors junto com 1% de success — erros são 0.5% do tráfego;
206
- seu sample retém 0.005% de errors total. Praticamente nunca aparecem.
207
-
208
- CERTO: 100% errors. SEMPRE. Erros são raros e críticos.
209
- ```
210
-
211
- ### ANTI: sample sem gravar rate
212
-
213
- ```text
214
- ANTI: sample 1/100 mas evento não tem _sample_rate
215
- Backend conta literais → count = 1% do real → métricas erradas
216
-
217
- CERTO: gravar _sample_rate no evento; agregar com sum(1/rate) no read.
218
- ```
219
-
220
- ### ANTI: tail-based sem collector
221
-
222
- ```text
223
- ANTI: tentar implementar tail-based em SDK do app — precisa bufferizar todos os spans
224
- de cada trace, esperar conclusão, decidir, exportar. Memória e latência altas.
225
-
226
- CERTO: tail-based requer OTel Collector como sidecar/proxy. App envia 100% para
227
- Collector; Collector decide via processor `tail_sampling`.
228
- ```
229
-
230
- ### ANTI: head-based sem propagação
231
-
232
- ```text
233
- ANTI: decisão de sample tomada no service A → não propagada para B → B decide sozinho
234
- → trace fica incompleto (alguns spans em A, outros em B, sem correlação)
235
-
236
- CERTO: marcar TraceFlags.SAMPLED no traceparent; B respeita decisão upstream.
237
- ```
238
-
239
- ## Verificação
240
-
241
- 1. **Errors 100%** — `select count(*) where result_success=false` × `1/sample_rate` ≈ count real
242
- 2. **Enterprise 100%** — verificar via query que enterprise tier tem _sample_rate=1 sempre
243
- 3. **Sample rate gravado** — `select count(*) filter (where _sample_rate is null)` = 0
244
- 4. **Trace integridade** — head-based: trace tem todos os spans (não 50% missing)
245
- 5. **Custo redução real** — bytes/segundo enviado para backend caiu sem perder sinal de error/p99
246
-
247
- ---
248
-
249
- ## Ver também
250
-
251
- - `kit/skills/_shared-observability/glossary.md` — termos sampling
252
- - `kit/skills/distributed-tracing/SKILL.md` — head vs tail decision timing
253
- - `kit/skills/opentelemetry-standard/SKILL.md` — Collector tail_sampling processor
254
- - `kit/skills/event-based-slos/SKILL.md` — SLO precisa de sample_rate para reconstruir totais
255
-
256
- *Material-fonte: Observability Engineering (O'Reilly, 2022) — Cap 17: "Cheap and Accurate Enough: Sampling".*
1
+ ---
2
+ name: telemetry-sampling
3
+ description: Use ao reduzir custo de telemetria — head/tail sampling, by-key, dynamic. 100% errors, by-tier para customers, head-based propaga via traceparent.
4
+ ---
5
+
6
+ # Observabilidade — Telemetry Sampling
7
+
8
+ ## Quando usar
9
+
10
+ LLM carrega esta skill ao reduzir custo de telemetria sem perder sinal. Trigger phrases:
11
+
12
+ - "sampling", "reduzir custo de telemetria"
13
+ - "head-based vs tail-based"
14
+ - "by-key sampling", "dynamic sampling"
15
+ - "100% errors mas só 1% sucessos"
16
+ - "trace fica incompleto após sampling"
17
+
18
+ ## Regras absolutas
19
+
20
+ - **100% dos erros sempre** — sample 100% de eventos com `result.success = false`. Erros são raros e críticos. Nunca sample.
21
+ - **100% de paying/enterprise customers** — high-value, baixo volume relativo, debug crucial.
22
+ - **Head-based propaga via `traceparent` flag** — decisão tomada no service de entrada, propagada downstream para garantir trace completo.
23
+ - **Tail-based requer collector buffer** — decisão pós-trace; impossível de implementar inline em código.
24
+ - **Constant probability falha em low volume** — 1/1000 de 100 req/min = 0.1 evento/min, perde tudo.
25
+ - **Sample rate gravado no evento** — sem isso, agregações reconstroem totais errados.
26
+ - **Errors > success** — categorize: paying customers > free, enterprise > pro > free.
27
+ - **Não sample antes de aggregate** — pre-aggregation perde alta cardinalidade. Sample evento bruto, aggregate no read.
28
+
29
+ ## Estratégias canônicas
30
+
31
+ ### Head-based sampling (decisão no início do trace)
32
+
33
+ ```ts
34
+ // PT-BR: decisão tomada no service de entrada, propagada via traceparent flag
35
+ import { trace, context } from '@opentelemetry/api'
36
+ import { TraceFlags } from '@opentelemetry/api'
37
+
38
+ function shouldSample(event: SpanContext): boolean {
39
+ // PT-BR: 100% errors (head-based: erros raramente são conhecidos no head;
40
+ // verificar HTTP status no início via header)
41
+ if (event.attributes['result.success'] === false) return true
42
+
43
+ // PT-BR: 100% enterprise — alto valor
44
+ if (event.attributes['customer.tier'] === 'enterprise') return true
45
+
46
+ // PT-BR: 10% pro
47
+ if (event.attributes['customer.tier'] === 'pro') return Math.random() < 0.1
48
+
49
+ // PT-BR: 1% free baseline
50
+ return Math.random() < 0.01
51
+ }
52
+
53
+ // PT-BR: marcar flag sampled no traceparent — propaga para downstream
54
+ const flags = shouldSample(event) ? TraceFlags.SAMPLED : TraceFlags.NONE
55
+ ```
56
+
57
+ ### Tail-based sampling (decisão após trace completar)
58
+
59
+ ```yaml
60
+ # PT-BR: OTel Collector config — sampling pós-trace
61
+ # 100% errors + outliers de latência + 1% success
62
+ processors:
63
+ tail_sampling:
64
+ decision_wait: 10s # PT-BR: buffer 10s para esperar todos os spans do trace
65
+ policies:
66
+ - name: errors-policy
67
+ type: status_code
68
+ status_code: { status_codes: [ERROR] }
69
+ - name: latency-outliers
70
+ type: latency
71
+ latency: { threshold_ms: 1000 } # PT-BR: > 1s é outlier
72
+ - name: probabilistic-baseline
73
+ type: probabilistic
74
+ probabilistic: { sampling_percentage: 1 }
75
+ ```
76
+
77
+ ### By-key sampling
78
+
79
+ ```ts
80
+ // PT-BR: taxas diferentes por chave — mais preciso que constant
81
+ const SAMPLE_RATES: Record<string, number> = {
82
+ // chave: [error.type | endpoint | tenant_id, etc.]
83
+ 'error_rate_limit': 0.5, // PT-BR: 50% (já frequente, mas importante)
84
+ 'error_validation': 1.0, // PT-BR: 100% (raro, debug crítico)
85
+ 'tenant_acme-corp': 1.0, // PT-BR: 100% (big customer)
86
+ 'endpoint_/health': 0.001, // PT-BR: 0.1% (muito frequente, baixo valor)
87
+ 'default': 0.05 // PT-BR: 5% baseline
88
+ }
89
+
90
+ function sampleByKey(event: SpanLike): boolean {
91
+ const errorKey = `error_${event.attributes['error.type']}`
92
+ const tenantKey = `tenant_${event.attributes['tenant_id']}`
93
+ const endpointKey = `endpoint_${event.attributes['endpoint']}`
94
+
95
+ const rate = SAMPLE_RATES[errorKey]
96
+ ?? SAMPLE_RATES[tenantKey]
97
+ ?? SAMPLE_RATES[endpointKey]
98
+ ?? SAMPLE_RATES['default']
99
+
100
+ return Math.random() < rate
101
+ }
102
+ ```
103
+
104
+ ### Dynamic sampling (taxa adapta com volume)
105
+
106
+ ```ts
107
+ // PT-BR: lookback 30s — quanto traffic veio recentemente?
108
+ let recentVolume = 0
109
+ setInterval(() => { recentVolume = 0 }, 30_000)
110
+
111
+ function sampleDynamic(event: SpanLike): boolean {
112
+ recentVolume++
113
+
114
+ // PT-BR: tráfego baixo → sample mais; tráfego alto → sample menos
115
+ if (recentVolume < 100) return true // até 100 spans em 30s, mantém todos
116
+ if (recentVolume < 1000) return Math.random() < 0.1 // até 1k, 10%
117
+ return Math.random() < 0.01 // > 1k, 1%
118
+ }
119
+ ```
120
+
121
+ ### Combinando: by-key + dynamic + head
122
+
123
+ ```ts
124
+ function shouldSample(event: SpanLike): boolean {
125
+ // PT-BR: 1. Errors sempre 100%
126
+ if (event.attributes['result.success'] === false) return true
127
+
128
+ // PT-BR: 2. Enterprise sempre 100%
129
+ if (event.attributes['customer.tier'] === 'enterprise') return true
130
+
131
+ // PT-BR: 3. Outras chaves de alto valor
132
+ if (event.attributes['feature_flag.experiment_a'] === true) return true // experimento ativo
133
+
134
+ // PT-BR: 4. Dynamic baseline
135
+ return sampleDynamic(event)
136
+ }
137
+ ```
138
+
139
+ ## Patterns canônicos
140
+
141
+ ### Pattern: gravar sample_rate no evento
142
+
143
+ ```ts
144
+ // PT-BR: sem sample_rate, agregações no read time não conseguem reconstruir totais
145
+ const sampleRate = computeSampleRate(event)
146
+ if (Math.random() < sampleRate) {
147
+ span.setAttribute('_sample_rate', sampleRate) // PT-BR: 0.01 = 1% sampled
148
+ span.setAttribute('_sampled', true)
149
+ // PT-BR: agora o backend pode multiplicar contagens por 1/sample_rate
150
+ exportSpan(span)
151
+ }
152
+ ```
153
+
154
+ ### Pattern: query reconstruindo totais com sample_rate
155
+
156
+ ```sql
157
+ -- PT-BR: sem sample_rate, count(*) está errado
158
+ -- COM sample_rate, sum(1/_sample_rate) reconstrói total estimado
159
+ select
160
+ endpoint,
161
+ sum(1.0 / _sample_rate) as estimated_total,
162
+ count(*) as samples_collected,
163
+ sum(1.0 / _sample_rate) filter (where result_success = false) as estimated_errors
164
+ from observability.events
165
+ where timestamp > now() - interval '1 hour'
166
+ group by endpoint
167
+ order by estimated_total desc;
168
+ ```
169
+
170
+ ### Pattern: sampling para alta cardinalidade
171
+
172
+ ```ts
173
+ // PT-BR: cardinalidade alta (millions of users) — não pode sample por user.id
174
+ // mas pode sample por (customer.tier, error.type) — combinação cardin. baixa
175
+ function sampleByDimensions(event: SpanLike): number {
176
+ const key = `${event.attributes['customer.tier']}-${event.attributes['error.type'] ?? 'success'}`
177
+
178
+ const rates: Record<string, number> = {
179
+ 'enterprise-success': 0.5,
180
+ 'enterprise-error': 1.0,
181
+ 'pro-success': 0.1,
182
+ 'pro-error': 1.0,
183
+ 'free-success': 0.01,
184
+ 'free-error': 1.0,
185
+ }
186
+
187
+ return rates[key] ?? 0.01
188
+ }
189
+ ```
190
+
191
+ ## Anti-patterns
192
+
193
+ ### ANTI: constant probability em low volume
194
+
195
+ ```text
196
+ ANTI: app com 100 req/min, sample rate fixo 1/1000 → 0.1 evento/min retidos
197
+ Você verá 1 erro a cada 10 minutos. Sinal perdido.
198
+
199
+ CERTO: dynamic sampling — alta taxa quando volume baixo, baixa quando alto.
200
+ ```
201
+
202
+ ### ANTI: sample errors
203
+
204
+ ```text
205
+ ANTI: sample 1% de errors junto com 1% de success — erros são 0.5% do tráfego;
206
+ seu sample retém 0.005% de errors total. Praticamente nunca aparecem.
207
+
208
+ CERTO: 100% errors. SEMPRE. Erros são raros e críticos.
209
+ ```
210
+
211
+ ### ANTI: sample sem gravar rate
212
+
213
+ ```text
214
+ ANTI: sample 1/100 mas evento não tem _sample_rate
215
+ Backend conta literais → count = 1% do real → métricas erradas
216
+
217
+ CERTO: gravar _sample_rate no evento; agregar com sum(1/rate) no read.
218
+ ```
219
+
220
+ ### ANTI: tail-based sem collector
221
+
222
+ ```text
223
+ ANTI: tentar implementar tail-based em SDK do app — precisa bufferizar todos os spans
224
+ de cada trace, esperar conclusão, decidir, exportar. Memória e latência altas.
225
+
226
+ CERTO: tail-based requer OTel Collector como sidecar/proxy. App envia 100% para
227
+ Collector; Collector decide via processor `tail_sampling`.
228
+ ```
229
+
230
+ ### ANTI: head-based sem propagação
231
+
232
+ ```text
233
+ ANTI: decisão de sample tomada no service A → não propagada para B → B decide sozinho
234
+ → trace fica incompleto (alguns spans em A, outros em B, sem correlação)
235
+
236
+ CERTO: marcar TraceFlags.SAMPLED no traceparent; B respeita decisão upstream.
237
+ ```
238
+
239
+ ## Verificação
240
+
241
+ 1. **Errors 100%** — `select count(*) where result_success=false` × `1/sample_rate` ≈ count real
242
+ 2. **Enterprise 100%** — verificar via query que enterprise tier tem _sample_rate=1 sempre
243
+ 3. **Sample rate gravado** — `select count(*) filter (where _sample_rate is null)` = 0
244
+ 4. **Trace integridade** — head-based: trace tem todos os spans (não 50% missing)
245
+ 5. **Custo redução real** — bytes/segundo enviado para backend caiu sem perder sinal de error/p99
246
+
247
+ ---
248
+
249
+ ## Ver também
250
+
251
+ - `kit/skills/_shared-observability/glossary.md` — termos sampling
252
+ - `kit/skills/distributed-tracing/SKILL.md` — head vs tail decision timing
253
+ - `kit/skills/opentelemetry-standard/SKILL.md` — Collector tail_sampling processor
254
+ - `kit/skills/event-based-slos/SKILL.md` — SLO precisa de sample_rate para reconstruir totais
255
+
256
+ *Material-fonte: Observability Engineering (O'Reilly, 2022) — Cap 17: "Cheap and Accurate Enough: Sampling".*