e11y 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (157) hide show
  1. checksums.yaml +7 -0
  2. data/.rspec +4 -0
  3. data/.rubocop.yml +69 -0
  4. data/CHANGELOG.md +26 -0
  5. data/CODE_OF_CONDUCT.md +64 -0
  6. data/LICENSE.txt +21 -0
  7. data/README.md +179 -0
  8. data/Rakefile +37 -0
  9. data/benchmarks/run_all.rb +33 -0
  10. data/config/README.md +83 -0
  11. data/config/loki-local-config.yaml +35 -0
  12. data/config/prometheus.yml +15 -0
  13. data/docker-compose.yml +78 -0
  14. data/docs/00-ICP-AND-TIMELINE.md +483 -0
  15. data/docs/01-SCALE-REQUIREMENTS.md +858 -0
  16. data/docs/ADR-001-architecture.md +2617 -0
  17. data/docs/ADR-002-metrics-yabeda.md +1395 -0
  18. data/docs/ADR-003-slo-observability.md +3337 -0
  19. data/docs/ADR-004-adapter-architecture.md +2385 -0
  20. data/docs/ADR-005-tracing-context.md +1372 -0
  21. data/docs/ADR-006-security-compliance.md +4143 -0
  22. data/docs/ADR-007-opentelemetry-integration.md +1385 -0
  23. data/docs/ADR-008-rails-integration.md +1911 -0
  24. data/docs/ADR-009-cost-optimization.md +2993 -0
  25. data/docs/ADR-010-developer-experience.md +2166 -0
  26. data/docs/ADR-011-testing-strategy.md +1836 -0
  27. data/docs/ADR-012-event-evolution.md +958 -0
  28. data/docs/ADR-013-reliability-error-handling.md +2750 -0
  29. data/docs/ADR-014-event-driven-slo.md +1533 -0
  30. data/docs/ADR-015-middleware-order.md +1061 -0
  31. data/docs/ADR-016-self-monitoring-slo.md +1234 -0
  32. data/docs/API-REFERENCE-L28.md +914 -0
  33. data/docs/COMPREHENSIVE-CONFIGURATION.md +2366 -0
  34. data/docs/IMPLEMENTATION_NOTES.md +2804 -0
  35. data/docs/IMPLEMENTATION_PLAN.md +1971 -0
  36. data/docs/IMPLEMENTATION_PLAN_ARCHITECTURE.md +586 -0
  37. data/docs/PLAN.md +148 -0
  38. data/docs/QUICK-START.md +934 -0
  39. data/docs/README.md +296 -0
  40. data/docs/design/00-memory-optimization.md +593 -0
  41. data/docs/guides/MIGRATION-L27-L28.md +692 -0
  42. data/docs/guides/PERFORMANCE-BENCHMARKS.md +434 -0
  43. data/docs/guides/README.md +44 -0
  44. data/docs/prd/01-overview-vision.md +440 -0
  45. data/docs/use_cases/README.md +119 -0
  46. data/docs/use_cases/UC-001-request-scoped-debug-buffering.md +813 -0
  47. data/docs/use_cases/UC-002-business-event-tracking.md +1953 -0
  48. data/docs/use_cases/UC-003-pattern-based-metrics.md +1627 -0
  49. data/docs/use_cases/UC-004-zero-config-slo-tracking.md +728 -0
  50. data/docs/use_cases/UC-005-sentry-integration.md +759 -0
  51. data/docs/use_cases/UC-006-trace-context-management.md +905 -0
  52. data/docs/use_cases/UC-007-pii-filtering.md +2648 -0
  53. data/docs/use_cases/UC-008-opentelemetry-integration.md +1153 -0
  54. data/docs/use_cases/UC-009-multi-service-tracing.md +1043 -0
  55. data/docs/use_cases/UC-010-background-job-tracking.md +1018 -0
  56. data/docs/use_cases/UC-011-rate-limiting.md +1906 -0
  57. data/docs/use_cases/UC-012-audit-trail.md +2301 -0
  58. data/docs/use_cases/UC-013-high-cardinality-protection.md +2127 -0
  59. data/docs/use_cases/UC-014-adaptive-sampling.md +1940 -0
  60. data/docs/use_cases/UC-015-cost-optimization.md +735 -0
  61. data/docs/use_cases/UC-016-rails-logger-migration.md +785 -0
  62. data/docs/use_cases/UC-017-local-development.md +867 -0
  63. data/docs/use_cases/UC-018-testing-events.md +1081 -0
  64. data/docs/use_cases/UC-019-tiered-storage-migration.md +562 -0
  65. data/docs/use_cases/UC-020-event-versioning.md +708 -0
  66. data/docs/use_cases/UC-021-error-handling-retry-dlq.md +956 -0
  67. data/docs/use_cases/UC-022-event-registry.md +648 -0
  68. data/docs/use_cases/backlog.md +226 -0
  69. data/e11y.gemspec +76 -0
  70. data/lib/e11y/adapters/adaptive_batcher.rb +207 -0
  71. data/lib/e11y/adapters/audit_encrypted.rb +239 -0
  72. data/lib/e11y/adapters/base.rb +580 -0
  73. data/lib/e11y/adapters/file.rb +224 -0
  74. data/lib/e11y/adapters/in_memory.rb +216 -0
  75. data/lib/e11y/adapters/loki.rb +333 -0
  76. data/lib/e11y/adapters/otel_logs.rb +203 -0
  77. data/lib/e11y/adapters/registry.rb +141 -0
  78. data/lib/e11y/adapters/sentry.rb +230 -0
  79. data/lib/e11y/adapters/stdout.rb +108 -0
  80. data/lib/e11y/adapters/yabeda.rb +370 -0
  81. data/lib/e11y/buffers/adaptive_buffer.rb +339 -0
  82. data/lib/e11y/buffers/base_buffer.rb +40 -0
  83. data/lib/e11y/buffers/request_scoped_buffer.rb +246 -0
  84. data/lib/e11y/buffers/ring_buffer.rb +267 -0
  85. data/lib/e11y/buffers.rb +14 -0
  86. data/lib/e11y/console.rb +122 -0
  87. data/lib/e11y/current.rb +48 -0
  88. data/lib/e11y/event/base.rb +894 -0
  89. data/lib/e11y/event/value_sampling_config.rb +84 -0
  90. data/lib/e11y/events/base_audit_event.rb +43 -0
  91. data/lib/e11y/events/base_payment_event.rb +33 -0
  92. data/lib/e11y/events/rails/cache/delete.rb +21 -0
  93. data/lib/e11y/events/rails/cache/read.rb +23 -0
  94. data/lib/e11y/events/rails/cache/write.rb +22 -0
  95. data/lib/e11y/events/rails/database/query.rb +45 -0
  96. data/lib/e11y/events/rails/http/redirect.rb +21 -0
  97. data/lib/e11y/events/rails/http/request.rb +26 -0
  98. data/lib/e11y/events/rails/http/send_file.rb +21 -0
  99. data/lib/e11y/events/rails/http/start_processing.rb +26 -0
  100. data/lib/e11y/events/rails/job/completed.rb +22 -0
  101. data/lib/e11y/events/rails/job/enqueued.rb +22 -0
  102. data/lib/e11y/events/rails/job/failed.rb +22 -0
  103. data/lib/e11y/events/rails/job/scheduled.rb +23 -0
  104. data/lib/e11y/events/rails/job/started.rb +22 -0
  105. data/lib/e11y/events/rails/log.rb +56 -0
  106. data/lib/e11y/events/rails/view/render.rb +23 -0
  107. data/lib/e11y/events.rb +18 -0
  108. data/lib/e11y/instruments/active_job.rb +201 -0
  109. data/lib/e11y/instruments/rails_instrumentation.rb +141 -0
  110. data/lib/e11y/instruments/sidekiq.rb +175 -0
  111. data/lib/e11y/logger/bridge.rb +205 -0
  112. data/lib/e11y/metrics/cardinality_protection.rb +172 -0
  113. data/lib/e11y/metrics/cardinality_tracker.rb +134 -0
  114. data/lib/e11y/metrics/registry.rb +234 -0
  115. data/lib/e11y/metrics/relabeling.rb +226 -0
  116. data/lib/e11y/metrics.rb +102 -0
  117. data/lib/e11y/middleware/audit_signing.rb +174 -0
  118. data/lib/e11y/middleware/base.rb +140 -0
  119. data/lib/e11y/middleware/event_slo.rb +167 -0
  120. data/lib/e11y/middleware/pii_filter.rb +266 -0
  121. data/lib/e11y/middleware/pii_filtering.rb +280 -0
  122. data/lib/e11y/middleware/rate_limiting.rb +214 -0
  123. data/lib/e11y/middleware/request.rb +163 -0
  124. data/lib/e11y/middleware/routing.rb +157 -0
  125. data/lib/e11y/middleware/sampling.rb +254 -0
  126. data/lib/e11y/middleware/slo.rb +168 -0
  127. data/lib/e11y/middleware/trace_context.rb +131 -0
  128. data/lib/e11y/middleware/validation.rb +118 -0
  129. data/lib/e11y/middleware/versioning.rb +132 -0
  130. data/lib/e11y/middleware.rb +12 -0
  131. data/lib/e11y/pii/patterns.rb +90 -0
  132. data/lib/e11y/pii.rb +13 -0
  133. data/lib/e11y/pipeline/builder.rb +155 -0
  134. data/lib/e11y/pipeline/zone_validator.rb +110 -0
  135. data/lib/e11y/pipeline.rb +12 -0
  136. data/lib/e11y/presets/audit_event.rb +65 -0
  137. data/lib/e11y/presets/debug_event.rb +34 -0
  138. data/lib/e11y/presets/high_value_event.rb +51 -0
  139. data/lib/e11y/presets.rb +19 -0
  140. data/lib/e11y/railtie.rb +138 -0
  141. data/lib/e11y/reliability/circuit_breaker.rb +216 -0
  142. data/lib/e11y/reliability/dlq/file_storage.rb +277 -0
  143. data/lib/e11y/reliability/dlq/filter.rb +117 -0
  144. data/lib/e11y/reliability/retry_handler.rb +207 -0
  145. data/lib/e11y/reliability/retry_rate_limiter.rb +117 -0
  146. data/lib/e11y/sampling/error_spike_detector.rb +225 -0
  147. data/lib/e11y/sampling/load_monitor.rb +161 -0
  148. data/lib/e11y/sampling/stratified_tracker.rb +92 -0
  149. data/lib/e11y/sampling/value_extractor.rb +82 -0
  150. data/lib/e11y/self_monitoring/buffer_monitor.rb +79 -0
  151. data/lib/e11y/self_monitoring/performance_monitor.rb +97 -0
  152. data/lib/e11y/self_monitoring/reliability_monitor.rb +146 -0
  153. data/lib/e11y/slo/event_driven.rb +150 -0
  154. data/lib/e11y/slo/tracker.rb +119 -0
  155. data/lib/e11y/version.rb +9 -0
  156. data/lib/e11y.rb +283 -0
  157. metadata +452 -0
@@ -0,0 +1,692 @@
1
+ # Migration Guide: L2.7 (Basic Sampling) → L2.8 (Advanced Sampling)
2
+
3
+ **Version:** 1.0
4
+ **Date:** January 20, 2026
5
+ **Applies to:** E11y gem v0.8.0+
6
+
7
+ ---
8
+
9
+ ## 📋 Overview
10
+
11
+ This guide helps you migrate from **L2.7 (Basic Sampling)** to **L2.8 (Advanced Sampling Strategies)** to unlock:
12
+
13
+ - **Error-Based Adaptive Sampling**: 100% sampling during error spikes
14
+ - **Load-Based Adaptive Sampling**: Tiered sampling (100%/50%/10%/1%) based on system load
15
+ - **Value-Based Sampling**: Always sample high-value events (e.g., >$1000 orders)
16
+ - **Stratified Sampling**: SLO-accurate metrics with < 5% error margin
17
+
18
+ **Cost Savings**: 35-90% reduction in observability costs while maintaining or improving data quality.
19
+
20
+ ---
21
+
22
+ ## 🚦 Migration Phases
23
+
24
+ ### Phase 1: Preparation (1 hour)
25
+ 1. Review current sampling config
26
+ 2. Run tests to establish baseline
27
+ 3. Enable self-monitoring metrics
28
+
29
+ ### Phase 2: Enable Error-Based Adaptive (30 minutes)
30
+ 1. Add error spike detection config
31
+ 2. Deploy to staging
32
+ 3. Validate behavior during simulated incidents
33
+
34
+ ### Phase 3: Enable Load-Based Adaptive (30 minutes)
35
+ 1. Add load monitor config
36
+ 2. Deploy to staging
37
+ 3. Load test with varying traffic levels
38
+
39
+ ### Phase 4: Add Value-Based Sampling (1 hour)
40
+ 1. Identify high-value events
41
+ 2. Add `sample_by_value` DSL to event classes
42
+ 3. Validate in staging
43
+
44
+ ### Phase 5: Enable Stratified Sampling (15 minutes)
45
+ 1. Enable SLO sampling correction
46
+ 2. Validate SLO accuracy
47
+ 3. Deploy to production
48
+
49
+ ---
50
+
51
+ ## 📊 Current State (L2.7 - Basic Sampling)
52
+
53
+ **What You Have:**
54
+
55
+ ```ruby
56
+ # config/initializers/e11y.rb (L2.7)
57
+ E11y.configure do |config|
58
+ # Basic sampling middleware (already in pipeline)
59
+ config.pipeline.use E11y::Middleware::Sampling,
60
+ default_sample_rate: 0.1, # 10% sampling
61
+ trace_aware: true # Trace-consistent sampling
62
+ end
63
+
64
+ # Event-level sampling
65
+ class Events::HighFrequencyEvent < E11y::Event::Base
66
+ sample_rate 0.01 # 1% sampling
67
+ end
68
+
69
+ # Severity-based defaults (automatic)
70
+ class Events::ErrorEvent < E11y::Event::Base
71
+ severity :error # → 100% sampling (SEVERITY_SAMPLE_RATES[:error])
72
+ end
73
+
74
+ # Audit event exemption
75
+ class Events::AuditEvent < E11y::Event::Base
76
+ audit_event true # Never sampled, always processed
77
+ end
78
+ ```
79
+
80
+ **What's Working:**
81
+ - ✅ Basic sampling (10% default)
82
+ - ✅ Trace-aware sampling (C05 resolution)
83
+ - ✅ Event-level sample rates
84
+ - ✅ Severity-based defaults
85
+ - ✅ Audit event exemption
86
+
87
+ **What's Missing:**
88
+ - ❌ No dynamic adjustment during errors
89
+ - ❌ No load-based adaptation
90
+ - ❌ No value-based prioritization
91
+ - ❌ SLO metrics not corrected for sampling
92
+
93
+ ---
94
+
95
+ ## 🎯 Target State (L2.8 - Advanced Sampling)
96
+
97
+ **What You'll Have:**
98
+
99
+ ```ruby
100
+ # config/initializers/e11y.rb (L2.8)
101
+ E11y.configure do |config|
102
+ config.pipeline.use E11y::Middleware::Sampling,
103
+ default_sample_rate: 0.1,
104
+
105
+ # ✅ NEW: Error-Based Adaptive (FEAT-4838)
106
+ error_based_adaptive: true,
107
+ error_spike_config: {
108
+ window: 60,
109
+ absolute_threshold: 100,
110
+ relative_threshold: 3.0,
111
+ spike_duration: 300
112
+ },
113
+
114
+ # ✅ NEW: Load-Based Adaptive (FEAT-4842)
115
+ load_based_adaptive: true,
116
+ load_monitor_config: {
117
+ window: 60,
118
+ normal_threshold: 1_000,
119
+ high_threshold: 10_000,
120
+ very_high_threshold: 50_000,
121
+ overload_threshold: 100_000
122
+ }
123
+
124
+ # ✅ NEW: Stratified Sampling for SLO (FEAT-4850)
125
+ config.slo do
126
+ enabled true
127
+ enable_sampling_correction true # Automatic correction
128
+ end
129
+ end
130
+
131
+ # ✅ NEW: Value-Based Sampling (FEAT-4846)
132
+ class Events::OrderPaid < E11y::Event::Base
133
+ schema do
134
+ required(:order_id).filled(:string)
135
+ required(:amount).filled(:decimal)
136
+ end
137
+
138
+ # Always sample high-value orders
139
+ sample_by_value field: "amount",
140
+ operator: :greater_than,
141
+ threshold: 1000,
142
+ sample_rate: 1.0
143
+ end
144
+ ```
145
+
146
+ **What You'll Gain:**
147
+ - ✅ 100% sampling during error spikes (debug priority)
148
+ - ✅ Cost protection during high load (1-10% sampling)
149
+ - ✅ Business-critical event prioritization
150
+ - ✅ Accurate SLO metrics (< 5% error)
151
+
152
+ ---
153
+
154
+ ## 🛠️ Step-by-Step Migration
155
+
156
+ ### Step 1: Review Current Config
157
+
158
+ **Check your current sampling configuration:**
159
+
160
+ ```bash
161
+ # Find current sampling config
162
+ grep -r "Middleware::Sampling" config/initializers/
163
+ grep -r "sample_rate" app/events/
164
+
165
+ # Check event classes with custom sampling
166
+ find app/events -name "*.rb" -exec grep -l "sample_rate" {} \;
167
+ ```
168
+
169
+ **Document current behavior:**
170
+ - What's your `default_sample_rate`?
171
+ - Which events have custom `sample_rate`?
172
+ - Are you using `audit_event true`?
173
+
174
+ **Baseline metrics (capture before migration):**
175
+ ```ruby
176
+ # Run for 1 hour in production
177
+ # - e11y_events_tracked_total (events/sec)
178
+ # - e11y_events_dropped_total (% dropped)
179
+ # - e11y_slo_http_success_rate (success rate)
180
+ ```
181
+
182
+ ---
183
+
184
+ ### Step 2: Enable Self-Monitoring
185
+
186
+ **Add self-monitoring to track migration effectiveness:**
187
+
188
+ ```ruby
189
+ # config/initializers/e11y.rb
190
+ E11y.configure do |config|
191
+ # ... existing config ...
192
+
193
+ # Enable self-monitoring (already included in L2.7)
194
+ config.self_monitoring.enabled = true
195
+ end
196
+ ```
197
+
198
+ **Key metrics to watch:**
199
+ - `e11y_middleware_latency_ms` (sampling overhead)
200
+ - `e11y_events_sampled_total` (events kept)
201
+ - `e11y_events_dropped_total` (events dropped)
202
+
203
+ ---
204
+
205
+ ### Step 3: Enable Error-Based Adaptive Sampling
206
+
207
+ **Add error spike detection config:**
208
+
209
+ ```ruby
210
+ # config/initializers/e11y.rb
211
+ E11y.configure do |config|
212
+ config.pipeline.use E11y::Middleware::Sampling,
213
+ default_sample_rate: 0.1,
214
+ trace_aware: true,
215
+
216
+ # NEW: Error-Based Adaptive
217
+ error_based_adaptive: true,
218
+ error_spike_config: {
219
+ window: 60, # 60 seconds sliding window
220
+ absolute_threshold: 100, # 100 errors/min triggers spike
221
+ relative_threshold: 3.0, # 3x normal rate triggers spike
222
+ spike_duration: 300 # Keep 100% sampling for 5 minutes
223
+ }
224
+ end
225
+ ```
226
+
227
+ **Deploy to staging:**
228
+ ```bash
229
+ # Push config changes
230
+ git add config/initializers/e11y.rb
231
+ git commit -m "feat: enable error-based adaptive sampling (FEAT-4838)"
232
+ git push origin feature/l28-migration
233
+
234
+ # Deploy to staging
235
+ bin/deploy staging
236
+ ```
237
+
238
+ **Validate in staging:**
239
+
240
+ 1. **Simulate error spike:**
241
+ ```bash
242
+ # Generate 150 errors in 1 minute (exceeds absolute threshold)
243
+ 150.times { Events::TestError.track(severity: :error) }
244
+ ```
245
+
246
+ 2. **Check Grafana:**
247
+ ```promql
248
+ # Should see sampling rate jump to 100%
249
+ e11y_sampling_current_rate{strategy="error_spike"}
250
+ ```
251
+
252
+ 3. **Verify events captured:**
253
+ ```bash
254
+ # Query Loki for events during spike
255
+ # All errors should be present (100% sampling)
256
+ ```
257
+
258
+ **Rollback plan:**
259
+ ```ruby
260
+ # If issues, disable error-based adaptive:
261
+ E11y.configure do |config|
262
+ config.pipeline.use E11y::Middleware::Sampling,
263
+ default_sample_rate: 0.1,
264
+ error_based_adaptive: false # ← Disable
265
+ end
266
+ ```
267
+
268
+ ---
269
+
270
+ ### Step 4: Enable Load-Based Adaptive Sampling
271
+
272
+ **Add load monitor config:**
273
+
274
+ ```ruby
275
+ # config/initializers/e11y.rb
276
+ E11y.configure do |config|
277
+ config.pipeline.use E11y::Middleware::Sampling,
278
+ default_sample_rate: 0.1,
279
+ error_based_adaptive: true,
280
+ error_spike_config: { ... },
281
+
282
+ # NEW: Load-Based Adaptive
283
+ load_based_adaptive: true,
284
+ load_monitor_config: {
285
+ window: 60, # 60 seconds
286
+ normal_threshold: 1_000, # < 1k events/sec = normal (100%)
287
+ high_threshold: 10_000, # 10k events/sec = high (50%)
288
+ very_high_threshold: 50_000, # 50k events/sec = very high (10%)
289
+ overload_threshold: 100_000 # > 100k events/sec = overload (1%)
290
+ }
291
+ end
292
+ ```
293
+
294
+ **Tune thresholds for your app:**
295
+
296
+ ```bash
297
+ # Check current event rate in production
298
+ echo "SELECT rate(e11y_events_tracked_total[5m])" | promql
299
+
300
+ # Adjust thresholds based on your baseline:
301
+ # - normal_threshold: 2x baseline
302
+ # - high_threshold: 10x baseline
303
+ # - very_high_threshold: 50x baseline
304
+ # - overload_threshold: 100x baseline
305
+ ```
306
+
307
+ **Load test in staging:**
308
+
309
+ ```bash
310
+ # Simulate high load with wrk
311
+ wrk -t12 -c400 -d30s --latency https://staging.example.com/api/orders
312
+
313
+ # Watch sampling rate adjust in Grafana:
314
+ # - Low load: 100%
315
+ # - High load: 50%
316
+ # - Very high: 10%
317
+ # - Overload: 1%
318
+ ```
319
+
320
+ **Monitor performance:**
321
+ ```promql
322
+ # Check if load-based sampling is working
323
+ e11y_sampling_current_rate{strategy="load_based"}
324
+
325
+ # Verify cost savings
326
+ sum(rate(e11y_events_dropped_total[5m])) / sum(rate(e11y_events_tracked_total[5m]))
327
+ ```
328
+
329
+ ---
330
+
331
+ ### Step 5: Add Value-Based Sampling
332
+
333
+ **Identify high-value events:**
334
+
335
+ 1. **Business-critical events:**
336
+ - Payment transactions
337
+ - Order completions
338
+ - User registrations
339
+
340
+ 2. **High-value thresholds:**
341
+ - Orders > $1000
342
+ - Enterprise/VIP users
343
+ - Critical API endpoints
344
+
345
+ **Add `sample_by_value` to event classes:**
346
+
347
+ ```ruby
348
+ # app/events/order_paid.rb
349
+ class Events::OrderPaid < E11y::Event::Base
350
+ schema do
351
+ required(:order_id).filled(:string)
352
+ required(:amount).filled(:decimal)
353
+ required(:user_segment).filled(:string)
354
+ end
355
+
356
+ # Always sample high-value orders
357
+ sample_by_value field: "amount",
358
+ operator: :greater_than,
359
+ threshold: 1000,
360
+ sample_rate: 1.0
361
+
362
+ # Always sample enterprise users
363
+ sample_by_value field: "user_segment",
364
+ operator: :equals,
365
+ threshold: "enterprise",
366
+ sample_rate: 1.0
367
+ end
368
+
369
+ # app/events/api_request.rb
370
+ class Events::ApiRequest < E11y::Event::Base
371
+ schema do
372
+ required(:endpoint).filled(:string)
373
+ required(:latency_ms).filled(:integer)
374
+ end
375
+
376
+ # Always sample slow requests (>1000ms)
377
+ sample_by_value field: "latency_ms",
378
+ operator: :greater_than,
379
+ threshold: 1000,
380
+ sample_rate: 1.0
381
+ end
382
+ ```
383
+
384
+ **Test in staging:**
385
+
386
+ ```ruby
387
+ # High-value order → Always sampled
388
+ Events::OrderPaid.track(
389
+ order_id: "123",
390
+ amount: 5000, # > $1000 → 100% sampled
391
+ user_segment: "enterprise"
392
+ )
393
+
394
+ # Low-value order → Falls back to load-based sampling
395
+ Events::OrderPaid.track(
396
+ order_id: "456",
397
+ amount: 50, # < $1000 → load-based rate
398
+ user_segment: "free"
399
+ )
400
+ ```
401
+
402
+ **Validate in Grafana:**
403
+ ```promql
404
+ # Check value-based sampling rate
405
+ e11y_sampling_decisions_total{decision="kept", reason="value_based"}
406
+
407
+ # Verify high-value events never dropped
408
+ rate(e11y_events_dropped_total{event_name="order.paid", amount=">1000"}[5m])
409
+ # Should be 0!
410
+ ```
411
+
412
+ ---
413
+
414
+ ### Step 6: Enable Stratified Sampling for SLO
415
+
416
+ **Enable SLO sampling correction:**
417
+
418
+ ```ruby
419
+ # config/initializers/e11y.rb
420
+ E11y.configure do |config|
421
+ # ... existing sampling config ...
422
+
423
+ # NEW: SLO with sampling correction
424
+ config.slo do
425
+ enabled true
426
+ enable_sampling_correction true # Automatic correction
427
+ end
428
+ end
429
+ ```
430
+
431
+ **Validate SLO accuracy:**
432
+
433
+ ```bash
434
+ # Generate test traffic with known success rate
435
+ # - 950 successful requests (95%)
436
+ # - 50 failed requests (5%)
437
+
438
+ # Check corrected SLO in Grafana:
439
+ e11y_slo_http_success_rate
440
+
441
+ # Should be 95.0% (±0.5%), even with aggressive sampling!
442
+ ```
443
+
444
+ **Compare with/without correction:**
445
+
446
+ ```promql
447
+ # Without correction (raw metrics):
448
+ sum(rate(http_requests_total{status="200"}[5m]))
449
+ /
450
+ sum(rate(http_requests_total[5m]))
451
+ # May show 60-70% (biased by sampling)
452
+
453
+ # With correction (E11y SLO):
454
+ e11y_slo_http_success_rate
455
+ # Shows 95.0% (accurate!)
456
+ ```
457
+
458
+ ---
459
+
460
+ ### Step 7: Production Deployment
461
+
462
+ **Pre-deployment checklist:**
463
+ - ✅ All strategies tested in staging
464
+ - ✅ Thresholds tuned for your app
465
+ - ✅ Rollback plan documented
466
+ - ✅ Monitoring dashboard updated
467
+ - ✅ Team notified of changes
468
+
469
+ **Gradual rollout:**
470
+
471
+ 1. **Deploy to canary (10% of traffic):**
472
+ ```bash
473
+ bin/deploy production --canary 10%
474
+ ```
475
+
476
+ 2. **Monitor for 1 hour:**
477
+ - Check error rates
478
+ - Verify sampling behavior
479
+ - Compare SLO metrics
480
+
481
+ 3. **Increase to 50%:**
482
+ ```bash
483
+ bin/deploy production --canary 50%
484
+ ```
485
+
486
+ 4. **Full deployment:**
487
+ ```bash
488
+ bin/deploy production --all
489
+ ```
490
+
491
+ **Post-deployment validation:**
492
+
493
+ 1. **Check sampling effectiveness:**
494
+ ```promql
495
+ # Error spike detection working?
496
+ sum(increase(e11y_sampling_strategy_transitions_total{to_strategy="error_spike"}[1h]))
497
+
498
+ # Load-based adaptation working?
499
+ histogram_quantile(0.99, e11y_sampling_current_rate_bucket)
500
+
501
+ # Value-based sampling working?
502
+ sum(rate(e11y_events_sampled_total{reason="value_based"}[5m]))
503
+ ```
504
+
505
+ 2. **Verify cost savings:**
506
+ ```promql
507
+ # Cost reduction vs baseline
508
+ (baseline_events_per_sec - current_events_per_sec) / baseline_events_per_sec * 100
509
+ ```
510
+
511
+ 3. **Confirm SLO accuracy:**
512
+ ```promql
513
+ # Compare E11y SLO vs raw metrics
514
+ abs(e11y_slo_http_success_rate - raw_http_success_rate) < 0.05
515
+ # Should be < 5% error
516
+ ```
517
+
518
+ ---
519
+
520
+ ## 🔍 Troubleshooting
521
+
522
+ ### Issue 1: Error Spike Not Detected
523
+
524
+ **Symptoms:**
525
+ - Errors occurring, but sampling rate stays at 10%
526
+ - `e11y_sampling_strategy_transitions_total{to_strategy="error_spike"}` is 0
527
+
528
+ **Diagnosis:**
529
+ ```ruby
530
+ # Check error rate:
531
+ E11y::Sampling::ErrorSpikeDetector.new(config).current_error_rate
532
+ # vs
533
+ E11y::Sampling::ErrorSpikeDetector.new(config).baseline_error_rate
534
+
535
+ # Check thresholds:
536
+ config[:absolute_threshold] # e.g., 100 errors/min
537
+ config[:relative_threshold] # e.g., 3.0x baseline
538
+ ```
539
+
540
+ **Fix:**
541
+ - Lower `absolute_threshold` (e.g., 50 errors/min)
542
+ - Lower `relative_threshold` (e.g., 2.0x baseline)
543
+
544
+ ---
545
+
546
+ ### Issue 2: Load-Based Sampling Too Aggressive
547
+
548
+ **Symptoms:**
549
+ - Missing important events during high load
550
+ - Sampling rate drops to 1% too quickly
551
+
552
+ **Diagnosis:**
553
+ ```promql
554
+ # Check current load level:
555
+ e11y_sampling_load_level
556
+
557
+ # Check events per second:
558
+ rate(e11y_events_tracked_total[1m])
559
+ ```
560
+
561
+ **Fix:**
562
+ - Increase thresholds (e.g., `high_threshold: 20_000` instead of 10_000)
563
+ - Add value-based sampling for critical events (they'll be sampled at 100% regardless of load)
564
+
565
+ ---
566
+
567
+ ### Issue 3: Value-Based Sampling Not Working
568
+
569
+ **Symptoms:**
570
+ - High-value events being dropped
571
+ - `e11y_events_sampled_total{reason="value_based"}` is 0
572
+
573
+ **Diagnosis:**
574
+ ```ruby
575
+ # Check if event has value_sampling_config:
576
+ Events::OrderPaid.value_sampling_config
577
+ # Should return ValueSamplingConfig object
578
+
579
+ # Check if value is extracted correctly:
580
+ E11y::Sampling::ValueExtractor.extract({ "amount" => "5000" }, "amount")
581
+ # Should return 5000.0
582
+ ```
583
+
584
+ **Fix:**
585
+ - Verify `sample_by_value` DSL syntax
586
+ - Check field path (use dot notation for nested fields: `"order.amount"`)
587
+ - Ensure numeric values (not strings)
588
+
589
+ ---
590
+
591
+ ### Issue 4: SLO Metrics Inaccurate
592
+
593
+ **Symptoms:**
594
+ - E11y SLO showing 70% success rate, but actual is 95%
595
+ - Correction not being applied
596
+
597
+ **Diagnosis:**
598
+ ```ruby
599
+ # Check if sampling correction enabled:
600
+ E11y.config.slo.enable_sampling_correction
601
+ # Should be true
602
+
603
+ # Check stratified tracker:
604
+ E11y::Sampling::StratifiedTracker.new.sampling_correction(:info)
605
+ # Should return correction factor (e.g., 10.0 for 10% sampling)
606
+ ```
607
+
608
+ **Fix:**
609
+ - Enable `enable_sampling_correction: true` in SLO config
610
+ - Verify sample rates are being recorded (check `event_data[:metadata][:sample_rate]`)
611
+
612
+ ---
613
+
614
+ ## 📈 Expected Results
615
+
616
+ **Before Migration (L2.7):**
617
+ - Fixed 10% sampling
618
+ - 10,000 events/sec × 10% = 1,000 events/sec tracked
619
+ - Cost: $1,000/month
620
+
621
+ **After Migration (L2.8):**
622
+
623
+ | Scenario | Events Tracked | Sampling Rate | Cost Savings |
624
+ |----------|---------------|---------------|--------------|
625
+ | **Normal load** (1k/sec) | 1,000/sec | 100% (load: normal) | 0% (same as before) |
626
+ | **High load** (10k/sec) | 5,000/sec | 50% (load: high) | 50% vs fixed 10% |
627
+ | **Error spike** | 100% | 100% (error spike override) | Better data quality! |
628
+ | **Overload** (100k/sec) | 1,000/sec | 1% (load: overload) | **90% vs fixed 10%** |
629
+
630
+ **Overall Cost Reduction: 35-50% during normal operations, 90% during extreme load.**
631
+
632
+ ---
633
+
634
+ ## 📚 Additional Resources
635
+
636
+ - **[ADR-009: Cost Optimization](../ADR-009-cost-optimization.md)** - Architecture details
637
+ - **[UC-014: Adaptive Sampling](../use_cases/UC-014-adaptive-sampling.md)** - Use case examples
638
+ - **[IMPLEMENTATION_NOTES.md](../IMPLEMENTATION_NOTES.md)** - Implementation details
639
+
640
+ ---
641
+
642
+ ## ✅ Migration Checklist
643
+
644
+ ```
645
+ Phase 1: Preparation
646
+ [ ] Reviewed current sampling config
647
+ [ ] Documented baseline metrics
648
+ [ ] Enabled self-monitoring
649
+
650
+ Phase 2: Error-Based Adaptive
651
+ [ ] Added error spike detection config
652
+ [ ] Deployed to staging
653
+ [ ] Validated error spike behavior
654
+ [ ] Deployed to production (canary)
655
+ [ ] Validated in production
656
+
657
+ Phase 3: Load-Based Adaptive
658
+ [ ] Added load monitor config
659
+ [ ] Tuned thresholds for app
660
+ [ ] Load tested in staging
661
+ [ ] Deployed to production (canary)
662
+ [ ] Validated in production
663
+
664
+ Phase 4: Value-Based Sampling
665
+ [ ] Identified high-value events
666
+ [ ] Added sample_by_value DSL
667
+ [ ] Tested in staging
668
+ [ ] Deployed to production
669
+ [ ] Validated in production
670
+
671
+ Phase 5: Stratified Sampling
672
+ [ ] Enabled SLO sampling correction
673
+ [ ] Validated SLO accuracy
674
+ [ ] Deployed to production
675
+ [ ] Monitored for 7 days
676
+
677
+ Post-Migration
678
+ [ ] Documented cost savings
679
+ [ ] Updated team runbooks
680
+ [ ] Shared learnings with team
681
+ ```
682
+
683
+ ---
684
+
685
+ **Migration Complete! 🎉**
686
+
687
+ You've successfully migrated from L2.7 (Basic Sampling) to L2.8 (Advanced Sampling Strategies).
688
+
689
+ **Next Steps:**
690
+ - Monitor savings over 30 days
691
+ - Fine-tune thresholds based on production data
692
+ - Share success metrics with stakeholders