e11y 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.rspec +4 -0
- data/.rubocop.yml +69 -0
- data/CHANGELOG.md +26 -0
- data/CODE_OF_CONDUCT.md +64 -0
- data/LICENSE.txt +21 -0
- data/README.md +179 -0
- data/Rakefile +37 -0
- data/benchmarks/run_all.rb +33 -0
- data/config/README.md +83 -0
- data/config/loki-local-config.yaml +35 -0
- data/config/prometheus.yml +15 -0
- data/docker-compose.yml +78 -0
- data/docs/00-ICP-AND-TIMELINE.md +483 -0
- data/docs/01-SCALE-REQUIREMENTS.md +858 -0
- data/docs/ADR-001-architecture.md +2617 -0
- data/docs/ADR-002-metrics-yabeda.md +1395 -0
- data/docs/ADR-003-slo-observability.md +3337 -0
- data/docs/ADR-004-adapter-architecture.md +2385 -0
- data/docs/ADR-005-tracing-context.md +1372 -0
- data/docs/ADR-006-security-compliance.md +4143 -0
- data/docs/ADR-007-opentelemetry-integration.md +1385 -0
- data/docs/ADR-008-rails-integration.md +1911 -0
- data/docs/ADR-009-cost-optimization.md +2993 -0
- data/docs/ADR-010-developer-experience.md +2166 -0
- data/docs/ADR-011-testing-strategy.md +1836 -0
- data/docs/ADR-012-event-evolution.md +958 -0
- data/docs/ADR-013-reliability-error-handling.md +2750 -0
- data/docs/ADR-014-event-driven-slo.md +1533 -0
- data/docs/ADR-015-middleware-order.md +1061 -0
- data/docs/ADR-016-self-monitoring-slo.md +1234 -0
- data/docs/API-REFERENCE-L28.md +914 -0
- data/docs/COMPREHENSIVE-CONFIGURATION.md +2366 -0
- data/docs/IMPLEMENTATION_NOTES.md +2804 -0
- data/docs/IMPLEMENTATION_PLAN.md +1971 -0
- data/docs/IMPLEMENTATION_PLAN_ARCHITECTURE.md +586 -0
- data/docs/PLAN.md +148 -0
- data/docs/QUICK-START.md +934 -0
- data/docs/README.md +296 -0
- data/docs/design/00-memory-optimization.md +593 -0
- data/docs/guides/MIGRATION-L27-L28.md +692 -0
- data/docs/guides/PERFORMANCE-BENCHMARKS.md +434 -0
- data/docs/guides/README.md +44 -0
- data/docs/prd/01-overview-vision.md +440 -0
- data/docs/use_cases/README.md +119 -0
- data/docs/use_cases/UC-001-request-scoped-debug-buffering.md +813 -0
- data/docs/use_cases/UC-002-business-event-tracking.md +1953 -0
- data/docs/use_cases/UC-003-pattern-based-metrics.md +1627 -0
- data/docs/use_cases/UC-004-zero-config-slo-tracking.md +728 -0
- data/docs/use_cases/UC-005-sentry-integration.md +759 -0
- data/docs/use_cases/UC-006-trace-context-management.md +905 -0
- data/docs/use_cases/UC-007-pii-filtering.md +2648 -0
- data/docs/use_cases/UC-008-opentelemetry-integration.md +1153 -0
- data/docs/use_cases/UC-009-multi-service-tracing.md +1043 -0
- data/docs/use_cases/UC-010-background-job-tracking.md +1018 -0
- data/docs/use_cases/UC-011-rate-limiting.md +1906 -0
- data/docs/use_cases/UC-012-audit-trail.md +2301 -0
- data/docs/use_cases/UC-013-high-cardinality-protection.md +2127 -0
- data/docs/use_cases/UC-014-adaptive-sampling.md +1940 -0
- data/docs/use_cases/UC-015-cost-optimization.md +735 -0
- data/docs/use_cases/UC-016-rails-logger-migration.md +785 -0
- data/docs/use_cases/UC-017-local-development.md +867 -0
- data/docs/use_cases/UC-018-testing-events.md +1081 -0
- data/docs/use_cases/UC-019-tiered-storage-migration.md +562 -0
- data/docs/use_cases/UC-020-event-versioning.md +708 -0
- data/docs/use_cases/UC-021-error-handling-retry-dlq.md +956 -0
- data/docs/use_cases/UC-022-event-registry.md +648 -0
- data/docs/use_cases/backlog.md +226 -0
- data/e11y.gemspec +76 -0
- data/lib/e11y/adapters/adaptive_batcher.rb +207 -0
- data/lib/e11y/adapters/audit_encrypted.rb +239 -0
- data/lib/e11y/adapters/base.rb +580 -0
- data/lib/e11y/adapters/file.rb +224 -0
- data/lib/e11y/adapters/in_memory.rb +216 -0
- data/lib/e11y/adapters/loki.rb +333 -0
- data/lib/e11y/adapters/otel_logs.rb +203 -0
- data/lib/e11y/adapters/registry.rb +141 -0
- data/lib/e11y/adapters/sentry.rb +230 -0
- data/lib/e11y/adapters/stdout.rb +108 -0
- data/lib/e11y/adapters/yabeda.rb +370 -0
- data/lib/e11y/buffers/adaptive_buffer.rb +339 -0
- data/lib/e11y/buffers/base_buffer.rb +40 -0
- data/lib/e11y/buffers/request_scoped_buffer.rb +246 -0
- data/lib/e11y/buffers/ring_buffer.rb +267 -0
- data/lib/e11y/buffers.rb +14 -0
- data/lib/e11y/console.rb +122 -0
- data/lib/e11y/current.rb +48 -0
- data/lib/e11y/event/base.rb +894 -0
- data/lib/e11y/event/value_sampling_config.rb +84 -0
- data/lib/e11y/events/base_audit_event.rb +43 -0
- data/lib/e11y/events/base_payment_event.rb +33 -0
- data/lib/e11y/events/rails/cache/delete.rb +21 -0
- data/lib/e11y/events/rails/cache/read.rb +23 -0
- data/lib/e11y/events/rails/cache/write.rb +22 -0
- data/lib/e11y/events/rails/database/query.rb +45 -0
- data/lib/e11y/events/rails/http/redirect.rb +21 -0
- data/lib/e11y/events/rails/http/request.rb +26 -0
- data/lib/e11y/events/rails/http/send_file.rb +21 -0
- data/lib/e11y/events/rails/http/start_processing.rb +26 -0
- data/lib/e11y/events/rails/job/completed.rb +22 -0
- data/lib/e11y/events/rails/job/enqueued.rb +22 -0
- data/lib/e11y/events/rails/job/failed.rb +22 -0
- data/lib/e11y/events/rails/job/scheduled.rb +23 -0
- data/lib/e11y/events/rails/job/started.rb +22 -0
- data/lib/e11y/events/rails/log.rb +56 -0
- data/lib/e11y/events/rails/view/render.rb +23 -0
- data/lib/e11y/events.rb +18 -0
- data/lib/e11y/instruments/active_job.rb +201 -0
- data/lib/e11y/instruments/rails_instrumentation.rb +141 -0
- data/lib/e11y/instruments/sidekiq.rb +175 -0
- data/lib/e11y/logger/bridge.rb +205 -0
- data/lib/e11y/metrics/cardinality_protection.rb +172 -0
- data/lib/e11y/metrics/cardinality_tracker.rb +134 -0
- data/lib/e11y/metrics/registry.rb +234 -0
- data/lib/e11y/metrics/relabeling.rb +226 -0
- data/lib/e11y/metrics.rb +102 -0
- data/lib/e11y/middleware/audit_signing.rb +174 -0
- data/lib/e11y/middleware/base.rb +140 -0
- data/lib/e11y/middleware/event_slo.rb +167 -0
- data/lib/e11y/middleware/pii_filter.rb +266 -0
- data/lib/e11y/middleware/pii_filtering.rb +280 -0
- data/lib/e11y/middleware/rate_limiting.rb +214 -0
- data/lib/e11y/middleware/request.rb +163 -0
- data/lib/e11y/middleware/routing.rb +157 -0
- data/lib/e11y/middleware/sampling.rb +254 -0
- data/lib/e11y/middleware/slo.rb +168 -0
- data/lib/e11y/middleware/trace_context.rb +131 -0
- data/lib/e11y/middleware/validation.rb +118 -0
- data/lib/e11y/middleware/versioning.rb +132 -0
- data/lib/e11y/middleware.rb +12 -0
- data/lib/e11y/pii/patterns.rb +90 -0
- data/lib/e11y/pii.rb +13 -0
- data/lib/e11y/pipeline/builder.rb +155 -0
- data/lib/e11y/pipeline/zone_validator.rb +110 -0
- data/lib/e11y/pipeline.rb +12 -0
- data/lib/e11y/presets/audit_event.rb +65 -0
- data/lib/e11y/presets/debug_event.rb +34 -0
- data/lib/e11y/presets/high_value_event.rb +51 -0
- data/lib/e11y/presets.rb +19 -0
- data/lib/e11y/railtie.rb +138 -0
- data/lib/e11y/reliability/circuit_breaker.rb +216 -0
- data/lib/e11y/reliability/dlq/file_storage.rb +277 -0
- data/lib/e11y/reliability/dlq/filter.rb +117 -0
- data/lib/e11y/reliability/retry_handler.rb +207 -0
- data/lib/e11y/reliability/retry_rate_limiter.rb +117 -0
- data/lib/e11y/sampling/error_spike_detector.rb +225 -0
- data/lib/e11y/sampling/load_monitor.rb +161 -0
- data/lib/e11y/sampling/stratified_tracker.rb +92 -0
- data/lib/e11y/sampling/value_extractor.rb +82 -0
- data/lib/e11y/self_monitoring/buffer_monitor.rb +79 -0
- data/lib/e11y/self_monitoring/performance_monitor.rb +97 -0
- data/lib/e11y/self_monitoring/reliability_monitor.rb +146 -0
- data/lib/e11y/slo/event_driven.rb +150 -0
- data/lib/e11y/slo/tracker.rb +119 -0
- data/lib/e11y/version.rb +9 -0
- data/lib/e11y.rb +283 -0
- metadata +452 -0
|
@@ -0,0 +1,692 @@
|
|
|
1
|
+
# Migration Guide: L2.7 (Basic Sampling) → L2.8 (Advanced Sampling)
|
|
2
|
+
|
|
3
|
+
**Version:** 1.0
|
|
4
|
+
**Date:** January 20, 2026
|
|
5
|
+
**Applies to:** E11y gem v0.8.0+
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## 📋 Overview
|
|
10
|
+
|
|
11
|
+
This guide helps you migrate from **L2.7 (Basic Sampling)** to **L2.8 (Advanced Sampling Strategies)** to unlock:
|
|
12
|
+
|
|
13
|
+
- **Error-Based Adaptive Sampling**: 100% sampling during error spikes
|
|
14
|
+
- **Load-Based Adaptive Sampling**: Tiered sampling (100%/50%/10%/1%) based on system load
|
|
15
|
+
- **Value-Based Sampling**: Always sample high-value events (e.g., >$1000 orders)
|
|
16
|
+
- **Stratified Sampling**: SLO-accurate metrics with < 5% error margin
|
|
17
|
+
|
|
18
|
+
**Cost Savings**: 35-90% reduction in observability costs while maintaining or improving data quality.
|
|
19
|
+
|
|
20
|
+
---
|
|
21
|
+
|
|
22
|
+
## 🚦 Migration Phases
|
|
23
|
+
|
|
24
|
+
### Phase 1: Preparation (1 hour)
|
|
25
|
+
1. Review current sampling config
|
|
26
|
+
2. Run tests to establish baseline
|
|
27
|
+
3. Enable self-monitoring metrics
|
|
28
|
+
|
|
29
|
+
### Phase 2: Enable Error-Based Adaptive (30 minutes)
|
|
30
|
+
1. Add error spike detection config
|
|
31
|
+
2. Deploy to staging
|
|
32
|
+
3. Validate behavior during simulated incidents
|
|
33
|
+
|
|
34
|
+
### Phase 3: Enable Load-Based Adaptive (30 minutes)
|
|
35
|
+
1. Add load monitor config
|
|
36
|
+
2. Deploy to staging
|
|
37
|
+
3. Load test with varying traffic levels
|
|
38
|
+
|
|
39
|
+
### Phase 4: Add Value-Based Sampling (1 hour)
|
|
40
|
+
1. Identify high-value events
|
|
41
|
+
2. Add `sample_by_value` DSL to event classes
|
|
42
|
+
3. Validate in staging
|
|
43
|
+
|
|
44
|
+
### Phase 5: Enable Stratified Sampling (15 minutes)
|
|
45
|
+
1. Enable SLO sampling correction
|
|
46
|
+
2. Validate SLO accuracy
|
|
47
|
+
3. Deploy to production
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## 📊 Current State (L2.7 - Basic Sampling)
|
|
52
|
+
|
|
53
|
+
**What You Have:**
|
|
54
|
+
|
|
55
|
+
```ruby
|
|
56
|
+
# config/initializers/e11y.rb (L2.7)
|
|
57
|
+
E11y.configure do |config|
|
|
58
|
+
# Basic sampling middleware (already in pipeline)
|
|
59
|
+
config.pipeline.use E11y::Middleware::Sampling,
|
|
60
|
+
default_sample_rate: 0.1, # 10% sampling
|
|
61
|
+
trace_aware: true # Trace-consistent sampling
|
|
62
|
+
end
|
|
63
|
+
|
|
64
|
+
# Event-level sampling
|
|
65
|
+
class Events::HighFrequencyEvent < E11y::Event::Base
|
|
66
|
+
sample_rate 0.01 # 1% sampling
|
|
67
|
+
end
|
|
68
|
+
|
|
69
|
+
# Severity-based defaults (automatic)
|
|
70
|
+
class Events::ErrorEvent < E11y::Event::Base
|
|
71
|
+
severity :error # → 100% sampling (SEVERITY_SAMPLE_RATES[:error])
|
|
72
|
+
end
|
|
73
|
+
|
|
74
|
+
# Audit event exemption
|
|
75
|
+
class Events::AuditEvent < E11y::Event::Base
|
|
76
|
+
audit_event true # Never sampled, always processed
|
|
77
|
+
end
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
**What's Working:**
|
|
81
|
+
- ✅ Basic sampling (10% default)
|
|
82
|
+
- ✅ Trace-aware sampling (C05 resolution)
|
|
83
|
+
- ✅ Event-level sample rates
|
|
84
|
+
- ✅ Severity-based defaults
|
|
85
|
+
- ✅ Audit event exemption
|
|
86
|
+
|
|
87
|
+
**What's Missing:**
|
|
88
|
+
- ❌ No dynamic adjustment during errors
|
|
89
|
+
- ❌ No load-based adaptation
|
|
90
|
+
- ❌ No value-based prioritization
|
|
91
|
+
- ❌ SLO metrics not corrected for sampling
|
|
92
|
+
|
|
93
|
+
---
|
|
94
|
+
|
|
95
|
+
## 🎯 Target State (L2.8 - Advanced Sampling)
|
|
96
|
+
|
|
97
|
+
**What You'll Have:**
|
|
98
|
+
|
|
99
|
+
```ruby
|
|
100
|
+
# config/initializers/e11y.rb (L2.8)
|
|
101
|
+
E11y.configure do |config|
|
|
102
|
+
config.pipeline.use E11y::Middleware::Sampling,
|
|
103
|
+
default_sample_rate: 0.1,
|
|
104
|
+
|
|
105
|
+
# ✅ NEW: Error-Based Adaptive (FEAT-4838)
|
|
106
|
+
error_based_adaptive: true,
|
|
107
|
+
error_spike_config: {
|
|
108
|
+
window: 60,
|
|
109
|
+
absolute_threshold: 100,
|
|
110
|
+
relative_threshold: 3.0,
|
|
111
|
+
spike_duration: 300
|
|
112
|
+
},
|
|
113
|
+
|
|
114
|
+
# ✅ NEW: Load-Based Adaptive (FEAT-4842)
|
|
115
|
+
load_based_adaptive: true,
|
|
116
|
+
load_monitor_config: {
|
|
117
|
+
window: 60,
|
|
118
|
+
normal_threshold: 1_000,
|
|
119
|
+
high_threshold: 10_000,
|
|
120
|
+
very_high_threshold: 50_000,
|
|
121
|
+
overload_threshold: 100_000
|
|
122
|
+
}
|
|
123
|
+
|
|
124
|
+
# ✅ NEW: Stratified Sampling for SLO (FEAT-4850)
|
|
125
|
+
config.slo do
|
|
126
|
+
enabled true
|
|
127
|
+
enable_sampling_correction true # Automatic correction
|
|
128
|
+
end
|
|
129
|
+
end
|
|
130
|
+
|
|
131
|
+
# ✅ NEW: Value-Based Sampling (FEAT-4846)
|
|
132
|
+
class Events::OrderPaid < E11y::Event::Base
|
|
133
|
+
schema do
|
|
134
|
+
required(:order_id).filled(:string)
|
|
135
|
+
required(:amount).filled(:decimal)
|
|
136
|
+
end
|
|
137
|
+
|
|
138
|
+
# Always sample high-value orders
|
|
139
|
+
sample_by_value field: "amount",
|
|
140
|
+
operator: :greater_than,
|
|
141
|
+
threshold: 1000,
|
|
142
|
+
sample_rate: 1.0
|
|
143
|
+
end
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
**What You'll Gain:**
|
|
147
|
+
- ✅ 100% sampling during error spikes (debug priority)
|
|
148
|
+
- ✅ Cost protection during high load (1-10% sampling)
|
|
149
|
+
- ✅ Business-critical event prioritization
|
|
150
|
+
- ✅ Accurate SLO metrics (< 5% error)
|
|
151
|
+
|
|
152
|
+
---
|
|
153
|
+
|
|
154
|
+
## 🛠️ Step-by-Step Migration
|
|
155
|
+
|
|
156
|
+
### Step 1: Review Current Config
|
|
157
|
+
|
|
158
|
+
**Check your current sampling configuration:**
|
|
159
|
+
|
|
160
|
+
```bash
|
|
161
|
+
# Find current sampling config
|
|
162
|
+
grep -r "Middleware::Sampling" config/initializers/
|
|
163
|
+
grep -r "sample_rate" app/events/
|
|
164
|
+
|
|
165
|
+
# Check event classes with custom sampling
|
|
166
|
+
find app/events -name "*.rb" -exec grep -l "sample_rate" {} \;
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
**Document current behavior:**
|
|
170
|
+
- What's your `default_sample_rate`?
|
|
171
|
+
- Which events have custom `sample_rate`?
|
|
172
|
+
- Are you using `audit_event true`?
|
|
173
|
+
|
|
174
|
+
**Baseline metrics (capture before migration):**
|
|
175
|
+
```ruby
|
|
176
|
+
# Run for 1 hour in production
|
|
177
|
+
# - e11y_events_tracked_total (events/sec)
|
|
178
|
+
# - e11y_events_dropped_total (% dropped)
|
|
179
|
+
# - e11y_slo_http_success_rate (success rate)
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
---
|
|
183
|
+
|
|
184
|
+
### Step 2: Enable Self-Monitoring
|
|
185
|
+
|
|
186
|
+
**Add self-monitoring to track migration effectiveness:**
|
|
187
|
+
|
|
188
|
+
```ruby
|
|
189
|
+
# config/initializers/e11y.rb
|
|
190
|
+
E11y.configure do |config|
|
|
191
|
+
# ... existing config ...
|
|
192
|
+
|
|
193
|
+
# Enable self-monitoring (already included in L2.7)
|
|
194
|
+
config.self_monitoring.enabled = true
|
|
195
|
+
end
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
**Key metrics to watch:**
|
|
199
|
+
- `e11y_middleware_latency_ms` (sampling overhead)
|
|
200
|
+
- `e11y_events_sampled_total` (events kept)
|
|
201
|
+
- `e11y_events_dropped_total` (events dropped)
|
|
202
|
+
|
|
203
|
+
---
|
|
204
|
+
|
|
205
|
+
### Step 3: Enable Error-Based Adaptive Sampling
|
|
206
|
+
|
|
207
|
+
**Add error spike detection config:**
|
|
208
|
+
|
|
209
|
+
```ruby
|
|
210
|
+
# config/initializers/e11y.rb
|
|
211
|
+
E11y.configure do |config|
|
|
212
|
+
config.pipeline.use E11y::Middleware::Sampling,
|
|
213
|
+
default_sample_rate: 0.1,
|
|
214
|
+
trace_aware: true,
|
|
215
|
+
|
|
216
|
+
# NEW: Error-Based Adaptive
|
|
217
|
+
error_based_adaptive: true,
|
|
218
|
+
error_spike_config: {
|
|
219
|
+
window: 60, # 60 seconds sliding window
|
|
220
|
+
absolute_threshold: 100, # 100 errors/min triggers spike
|
|
221
|
+
relative_threshold: 3.0, # 3x normal rate triggers spike
|
|
222
|
+
spike_duration: 300 # Keep 100% sampling for 5 minutes
|
|
223
|
+
}
|
|
224
|
+
end
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
**Deploy to staging:**
|
|
228
|
+
```bash
|
|
229
|
+
# Push config changes
|
|
230
|
+
git add config/initializers/e11y.rb
|
|
231
|
+
git commit -m "feat: enable error-based adaptive sampling (FEAT-4838)"
|
|
232
|
+
git push origin feature/l28-migration
|
|
233
|
+
|
|
234
|
+
# Deploy to staging
|
|
235
|
+
bin/deploy staging
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
**Validate in staging:**
|
|
239
|
+
|
|
240
|
+
1. **Simulate error spike:**
|
|
241
|
+
```bash
|
|
242
|
+
# Generate 150 errors in 1 minute (exceeds absolute threshold)
|
|
243
|
+
150.times { Events::TestError.track(severity: :error) }
|
|
244
|
+
```
|
|
245
|
+
|
|
246
|
+
2. **Check Grafana:**
|
|
247
|
+
```promql
|
|
248
|
+
# Should see sampling rate jump to 100%
|
|
249
|
+
e11y_sampling_current_rate{strategy="error_spike"}
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
3. **Verify events captured:**
|
|
253
|
+
```bash
|
|
254
|
+
# Query Loki for events during spike
|
|
255
|
+
# All errors should be present (100% sampling)
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
**Rollback plan:**
|
|
259
|
+
```ruby
|
|
260
|
+
# If issues, disable error-based adaptive:
|
|
261
|
+
E11y.configure do |config|
|
|
262
|
+
config.pipeline.use E11y::Middleware::Sampling,
|
|
263
|
+
default_sample_rate: 0.1,
|
|
264
|
+
error_based_adaptive: false # ← Disable
|
|
265
|
+
end
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
---
|
|
269
|
+
|
|
270
|
+
### Step 4: Enable Load-Based Adaptive Sampling
|
|
271
|
+
|
|
272
|
+
**Add load monitor config:**
|
|
273
|
+
|
|
274
|
+
```ruby
|
|
275
|
+
# config/initializers/e11y.rb
|
|
276
|
+
E11y.configure do |config|
|
|
277
|
+
config.pipeline.use E11y::Middleware::Sampling,
|
|
278
|
+
default_sample_rate: 0.1,
|
|
279
|
+
error_based_adaptive: true,
|
|
280
|
+
error_spike_config: { ... },
|
|
281
|
+
|
|
282
|
+
# NEW: Load-Based Adaptive
|
|
283
|
+
load_based_adaptive: true,
|
|
284
|
+
load_monitor_config: {
|
|
285
|
+
window: 60, # 60 seconds
|
|
286
|
+
normal_threshold: 1_000, # < 1k events/sec = normal (100%)
|
|
287
|
+
high_threshold: 10_000, # 10k events/sec = high (50%)
|
|
288
|
+
very_high_threshold: 50_000, # 50k events/sec = very high (10%)
|
|
289
|
+
overload_threshold: 100_000 # > 100k events/sec = overload (1%)
|
|
290
|
+
}
|
|
291
|
+
end
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
**Tune thresholds for your app:**
|
|
295
|
+
|
|
296
|
+
```bash
|
|
297
|
+
# Check current event rate in production
|
|
298
|
+
echo "SELECT rate(e11y_events_tracked_total[5m])" | promql
|
|
299
|
+
|
|
300
|
+
# Adjust thresholds based on your baseline:
|
|
301
|
+
# - normal_threshold: 2x baseline
|
|
302
|
+
# - high_threshold: 10x baseline
|
|
303
|
+
# - very_high_threshold: 50x baseline
|
|
304
|
+
# - overload_threshold: 100x baseline
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
**Load test in staging:**
|
|
308
|
+
|
|
309
|
+
```bash
|
|
310
|
+
# Simulate high load with wrk
|
|
311
|
+
wrk -t12 -c400 -d30s --latency https://staging.example.com/api/orders
|
|
312
|
+
|
|
313
|
+
# Watch sampling rate adjust in Grafana:
|
|
314
|
+
# - Low load: 100%
|
|
315
|
+
# - High load: 50%
|
|
316
|
+
# - Very high: 10%
|
|
317
|
+
# - Overload: 1%
|
|
318
|
+
```
|
|
319
|
+
|
|
320
|
+
**Monitor performance:**
|
|
321
|
+
```promql
|
|
322
|
+
# Check if load-based sampling is working
|
|
323
|
+
e11y_sampling_current_rate{strategy="load_based"}
|
|
324
|
+
|
|
325
|
+
# Verify cost savings
|
|
326
|
+
sum(rate(e11y_events_dropped_total[5m])) / sum(rate(e11y_events_tracked_total[5m]))
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
---
|
|
330
|
+
|
|
331
|
+
### Step 5: Add Value-Based Sampling
|
|
332
|
+
|
|
333
|
+
**Identify high-value events:**
|
|
334
|
+
|
|
335
|
+
1. **Business-critical events:**
|
|
336
|
+
- Payment transactions
|
|
337
|
+
- Order completions
|
|
338
|
+
- User registrations
|
|
339
|
+
|
|
340
|
+
2. **High-value thresholds:**
|
|
341
|
+
- Orders > $1000
|
|
342
|
+
- Enterprise/VIP users
|
|
343
|
+
- Critical API endpoints
|
|
344
|
+
|
|
345
|
+
**Add `sample_by_value` to event classes:**
|
|
346
|
+
|
|
347
|
+
```ruby
|
|
348
|
+
# app/events/order_paid.rb
|
|
349
|
+
class Events::OrderPaid < E11y::Event::Base
|
|
350
|
+
schema do
|
|
351
|
+
required(:order_id).filled(:string)
|
|
352
|
+
required(:amount).filled(:decimal)
|
|
353
|
+
required(:user_segment).filled(:string)
|
|
354
|
+
end
|
|
355
|
+
|
|
356
|
+
# Always sample high-value orders
|
|
357
|
+
sample_by_value field: "amount",
|
|
358
|
+
operator: :greater_than,
|
|
359
|
+
threshold: 1000,
|
|
360
|
+
sample_rate: 1.0
|
|
361
|
+
|
|
362
|
+
# Always sample enterprise users
|
|
363
|
+
sample_by_value field: "user_segment",
|
|
364
|
+
operator: :equals,
|
|
365
|
+
threshold: "enterprise",
|
|
366
|
+
sample_rate: 1.0
|
|
367
|
+
end
|
|
368
|
+
|
|
369
|
+
# app/events/api_request.rb
|
|
370
|
+
class Events::ApiRequest < E11y::Event::Base
|
|
371
|
+
schema do
|
|
372
|
+
required(:endpoint).filled(:string)
|
|
373
|
+
required(:latency_ms).filled(:integer)
|
|
374
|
+
end
|
|
375
|
+
|
|
376
|
+
# Always sample slow requests (>1000ms)
|
|
377
|
+
sample_by_value field: "latency_ms",
|
|
378
|
+
operator: :greater_than,
|
|
379
|
+
threshold: 1000,
|
|
380
|
+
sample_rate: 1.0
|
|
381
|
+
end
|
|
382
|
+
```
|
|
383
|
+
|
|
384
|
+
**Test in staging:**
|
|
385
|
+
|
|
386
|
+
```ruby
|
|
387
|
+
# High-value order → Always sampled
|
|
388
|
+
Events::OrderPaid.track(
|
|
389
|
+
order_id: "123",
|
|
390
|
+
amount: 5000, # > $1000 → 100% sampled
|
|
391
|
+
user_segment: "enterprise"
|
|
392
|
+
)
|
|
393
|
+
|
|
394
|
+
# Low-value order → Falls back to load-based sampling
|
|
395
|
+
Events::OrderPaid.track(
|
|
396
|
+
order_id: "456",
|
|
397
|
+
amount: 50, # < $1000 → load-based rate
|
|
398
|
+
user_segment: "free"
|
|
399
|
+
)
|
|
400
|
+
```
|
|
401
|
+
|
|
402
|
+
**Validate in Grafana:**
|
|
403
|
+
```promql
|
|
404
|
+
# Check value-based sampling rate
|
|
405
|
+
e11y_sampling_decisions_total{decision="kept", reason="value_based"}
|
|
406
|
+
|
|
407
|
+
# Verify high-value events never dropped
|
|
408
|
+
rate(e11y_events_dropped_total{event_name="order.paid", amount=">1000"}[5m])
|
|
409
|
+
# Should be 0!
|
|
410
|
+
```
|
|
411
|
+
|
|
412
|
+
---
|
|
413
|
+
|
|
414
|
+
### Step 6: Enable Stratified Sampling for SLO
|
|
415
|
+
|
|
416
|
+
**Enable SLO sampling correction:**
|
|
417
|
+
|
|
418
|
+
```ruby
|
|
419
|
+
# config/initializers/e11y.rb
|
|
420
|
+
E11y.configure do |config|
|
|
421
|
+
# ... existing sampling config ...
|
|
422
|
+
|
|
423
|
+
# NEW: SLO with sampling correction
|
|
424
|
+
config.slo do
|
|
425
|
+
enabled true
|
|
426
|
+
enable_sampling_correction true # Automatic correction
|
|
427
|
+
end
|
|
428
|
+
end
|
|
429
|
+
```
|
|
430
|
+
|
|
431
|
+
**Validate SLO accuracy:**
|
|
432
|
+
|
|
433
|
+
```bash
|
|
434
|
+
# Generate test traffic with known success rate
|
|
435
|
+
# - 950 successful requests (95%)
|
|
436
|
+
# - 50 failed requests (5%)
|
|
437
|
+
|
|
438
|
+
# Check corrected SLO in Grafana:
|
|
439
|
+
e11y_slo_http_success_rate
|
|
440
|
+
|
|
441
|
+
# Should be 95.0% (±0.5%), even with aggressive sampling!
|
|
442
|
+
```
|
|
443
|
+
|
|
444
|
+
**Compare with/without correction:**
|
|
445
|
+
|
|
446
|
+
```promql
|
|
447
|
+
# Without correction (raw metrics):
|
|
448
|
+
sum(rate(http_requests_total{status="200"}[5m]))
|
|
449
|
+
/
|
|
450
|
+
sum(rate(http_requests_total[5m]))
|
|
451
|
+
# May show 60-70% (biased by sampling)
|
|
452
|
+
|
|
453
|
+
# With correction (E11y SLO):
|
|
454
|
+
e11y_slo_http_success_rate
|
|
455
|
+
# Shows 95.0% (accurate!)
|
|
456
|
+
```
|
|
457
|
+
|
|
458
|
+
---
|
|
459
|
+
|
|
460
|
+
### Step 7: Production Deployment
|
|
461
|
+
|
|
462
|
+
**Pre-deployment checklist:**
|
|
463
|
+
- ✅ All strategies tested in staging
|
|
464
|
+
- ✅ Thresholds tuned for your app
|
|
465
|
+
- ✅ Rollback plan documented
|
|
466
|
+
- ✅ Monitoring dashboard updated
|
|
467
|
+
- ✅ Team notified of changes
|
|
468
|
+
|
|
469
|
+
**Gradual rollout:**
|
|
470
|
+
|
|
471
|
+
1. **Deploy to canary (10% of traffic):**
|
|
472
|
+
```bash
|
|
473
|
+
bin/deploy production --canary 10%
|
|
474
|
+
```
|
|
475
|
+
|
|
476
|
+
2. **Monitor for 1 hour:**
|
|
477
|
+
- Check error rates
|
|
478
|
+
- Verify sampling behavior
|
|
479
|
+
- Compare SLO metrics
|
|
480
|
+
|
|
481
|
+
3. **Increase to 50%:**
|
|
482
|
+
```bash
|
|
483
|
+
bin/deploy production --canary 50%
|
|
484
|
+
```
|
|
485
|
+
|
|
486
|
+
4. **Full deployment:**
|
|
487
|
+
```bash
|
|
488
|
+
bin/deploy production --all
|
|
489
|
+
```
|
|
490
|
+
|
|
491
|
+
**Post-deployment validation:**
|
|
492
|
+
|
|
493
|
+
1. **Check sampling effectiveness:**
|
|
494
|
+
```promql
|
|
495
|
+
# Error spike detection working?
|
|
496
|
+
sum(increase(e11y_sampling_strategy_transitions_total{to_strategy="error_spike"}[1h]))
|
|
497
|
+
|
|
498
|
+
# Load-based adaptation working?
|
|
499
|
+
histogram_quantile(0.99, e11y_sampling_current_rate_bucket)
|
|
500
|
+
|
|
501
|
+
# Value-based sampling working?
|
|
502
|
+
sum(rate(e11y_events_sampled_total{reason="value_based"}[5m]))
|
|
503
|
+
```
|
|
504
|
+
|
|
505
|
+
2. **Verify cost savings:**
|
|
506
|
+
```promql
|
|
507
|
+
# Cost reduction vs baseline
|
|
508
|
+
(baseline_events_per_sec - current_events_per_sec) / baseline_events_per_sec * 100
|
|
509
|
+
```
|
|
510
|
+
|
|
511
|
+
3. **Confirm SLO accuracy:**
|
|
512
|
+
```promql
|
|
513
|
+
# Compare E11y SLO vs raw metrics
|
|
514
|
+
abs(e11y_slo_http_success_rate - raw_http_success_rate) < 0.05
|
|
515
|
+
# Should be < 5% error
|
|
516
|
+
```
|
|
517
|
+
|
|
518
|
+
---
|
|
519
|
+
|
|
520
|
+
## 🔍 Troubleshooting
|
|
521
|
+
|
|
522
|
+
### Issue 1: Error Spike Not Detected
|
|
523
|
+
|
|
524
|
+
**Symptoms:**
|
|
525
|
+
- Errors occurring, but sampling rate stays at 10%
|
|
526
|
+
- `e11y_sampling_strategy_transitions_total{to_strategy="error_spike"}` is 0
|
|
527
|
+
|
|
528
|
+
**Diagnosis:**
|
|
529
|
+
```ruby
|
|
530
|
+
# Check error rate:
|
|
531
|
+
E11y::Sampling::ErrorSpikeDetector.new(config).current_error_rate
|
|
532
|
+
# vs
|
|
533
|
+
E11y::Sampling::ErrorSpikeDetector.new(config).baseline_error_rate
|
|
534
|
+
|
|
535
|
+
# Check thresholds:
|
|
536
|
+
config[:absolute_threshold] # e.g., 100 errors/min
|
|
537
|
+
config[:relative_threshold] # e.g., 3.0x baseline
|
|
538
|
+
```
|
|
539
|
+
|
|
540
|
+
**Fix:**
|
|
541
|
+
- Lower `absolute_threshold` (e.g., 50 errors/min)
|
|
542
|
+
- Lower `relative_threshold` (e.g., 2.0x baseline)
|
|
543
|
+
|
|
544
|
+
---
|
|
545
|
+
|
|
546
|
+
### Issue 2: Load-Based Sampling Too Aggressive
|
|
547
|
+
|
|
548
|
+
**Symptoms:**
|
|
549
|
+
- Missing important events during high load
|
|
550
|
+
- Sampling rate drops to 1% too quickly
|
|
551
|
+
|
|
552
|
+
**Diagnosis:**
|
|
553
|
+
```promql
|
|
554
|
+
# Check current load level:
|
|
555
|
+
e11y_sampling_load_level
|
|
556
|
+
|
|
557
|
+
# Check events per second:
|
|
558
|
+
rate(e11y_events_tracked_total[1m])
|
|
559
|
+
```
|
|
560
|
+
|
|
561
|
+
**Fix:**
|
|
562
|
+
- Increase thresholds (e.g., `high_threshold: 20_000` instead of 10_000)
|
|
563
|
+
- Add value-based sampling for critical events (they'll be sampled at 100% regardless of load)
|
|
564
|
+
|
|
565
|
+
---
|
|
566
|
+
|
|
567
|
+
### Issue 3: Value-Based Sampling Not Working
|
|
568
|
+
|
|
569
|
+
**Symptoms:**
|
|
570
|
+
- High-value events being dropped
|
|
571
|
+
- `e11y_events_sampled_total{reason="value_based"}` is 0
|
|
572
|
+
|
|
573
|
+
**Diagnosis:**
|
|
574
|
+
```ruby
|
|
575
|
+
# Check if event has value_sampling_config:
|
|
576
|
+
Events::OrderPaid.value_sampling_config
|
|
577
|
+
# Should return ValueSamplingConfig object
|
|
578
|
+
|
|
579
|
+
# Check if value is extracted correctly:
|
|
580
|
+
E11y::Sampling::ValueExtractor.extract({ "amount" => "5000" }, "amount")
|
|
581
|
+
# Should return 5000.0
|
|
582
|
+
```
|
|
583
|
+
|
|
584
|
+
**Fix:**
|
|
585
|
+
- Verify `sample_by_value` DSL syntax
|
|
586
|
+
- Check field path (use dot notation for nested fields: `"order.amount"`)
|
|
587
|
+
- Ensure numeric values (not strings)
|
|
588
|
+
|
|
589
|
+
---
|
|
590
|
+
|
|
591
|
+
### Issue 4: SLO Metrics Inaccurate
|
|
592
|
+
|
|
593
|
+
**Symptoms:**
|
|
594
|
+
- E11y SLO showing 70% success rate, but actual is 95%
|
|
595
|
+
- Correction not being applied
|
|
596
|
+
|
|
597
|
+
**Diagnosis:**
|
|
598
|
+
```ruby
|
|
599
|
+
# Check if sampling correction enabled:
|
|
600
|
+
E11y.config.slo.enable_sampling_correction
|
|
601
|
+
# Should be true
|
|
602
|
+
|
|
603
|
+
# Check stratified tracker:
|
|
604
|
+
E11y::Sampling::StratifiedTracker.new.sampling_correction(:info)
|
|
605
|
+
# Should return correction factor (e.g., 10.0 for 10% sampling)
|
|
606
|
+
```
|
|
607
|
+
|
|
608
|
+
**Fix:**
|
|
609
|
+
- Enable `enable_sampling_correction: true` in SLO config
|
|
610
|
+
- Verify sample rates are being recorded (check `event_data[:metadata][:sample_rate]`)
|
|
611
|
+
|
|
612
|
+
---
|
|
613
|
+
|
|
614
|
+
## 📈 Expected Results
|
|
615
|
+
|
|
616
|
+
**Before Migration (L2.7):**
|
|
617
|
+
- Fixed 10% sampling
|
|
618
|
+
- 10,000 events/sec × 10% = 1,000 events/sec tracked
|
|
619
|
+
- Cost: $1,000/month
|
|
620
|
+
|
|
621
|
+
**After Migration (L2.8):**
|
|
622
|
+
|
|
623
|
+
| Scenario | Events Tracked | Sampling Rate | Cost Savings |
|
|
624
|
+
|----------|---------------|---------------|--------------|
|
|
625
|
+
| **Normal load** (1k/sec) | 1,000/sec | 100% (load: normal) | 0% (same as before) |
|
|
626
|
+
| **High load** (10k/sec) | 5,000/sec | 50% (load: high) | 50% vs fixed 10% |
|
|
627
|
+
| **Error spike** | 100% | 100% (error spike override) | Better data quality! |
|
|
628
|
+
| **Overload** (100k/sec) | 1,000/sec | 1% (load: overload) | **90% vs fixed 10%** |
|
|
629
|
+
|
|
630
|
+
**Overall Cost Reduction: 35-50% during normal operations, 90% during extreme load.**
|
|
631
|
+
|
|
632
|
+
---
|
|
633
|
+
|
|
634
|
+
## 📚 Additional Resources
|
|
635
|
+
|
|
636
|
+
- **[ADR-009: Cost Optimization](../ADR-009-cost-optimization.md)** - Architecture details
|
|
637
|
+
- **[UC-014: Adaptive Sampling](../use_cases/UC-014-adaptive-sampling.md)** - Use case examples
|
|
638
|
+
- **[IMPLEMENTATION_NOTES.md](../IMPLEMENTATION_NOTES.md)** - Implementation details
|
|
639
|
+
|
|
640
|
+
---
|
|
641
|
+
|
|
642
|
+
## ✅ Migration Checklist
|
|
643
|
+
|
|
644
|
+
```
|
|
645
|
+
Phase 1: Preparation
|
|
646
|
+
[ ] Reviewed current sampling config
|
|
647
|
+
[ ] Documented baseline metrics
|
|
648
|
+
[ ] Enabled self-monitoring
|
|
649
|
+
|
|
650
|
+
Phase 2: Error-Based Adaptive
|
|
651
|
+
[ ] Added error spike detection config
|
|
652
|
+
[ ] Deployed to staging
|
|
653
|
+
[ ] Validated error spike behavior
|
|
654
|
+
[ ] Deployed to production (canary)
|
|
655
|
+
[ ] Validated in production
|
|
656
|
+
|
|
657
|
+
Phase 3: Load-Based Adaptive
|
|
658
|
+
[ ] Added load monitor config
|
|
659
|
+
[ ] Tuned thresholds for app
|
|
660
|
+
[ ] Load tested in staging
|
|
661
|
+
[ ] Deployed to production (canary)
|
|
662
|
+
[ ] Validated in production
|
|
663
|
+
|
|
664
|
+
Phase 4: Value-Based Sampling
|
|
665
|
+
[ ] Identified high-value events
|
|
666
|
+
[ ] Added sample_by_value DSL
|
|
667
|
+
[ ] Tested in staging
|
|
668
|
+
[ ] Deployed to production
|
|
669
|
+
[ ] Validated in production
|
|
670
|
+
|
|
671
|
+
Phase 5: Stratified Sampling
|
|
672
|
+
[ ] Enabled SLO sampling correction
|
|
673
|
+
[ ] Validated SLO accuracy
|
|
674
|
+
[ ] Deployed to production
|
|
675
|
+
[ ] Monitored for 7 days
|
|
676
|
+
|
|
677
|
+
Post-Migration
|
|
678
|
+
[ ] Documented cost savings
|
|
679
|
+
[ ] Updated team runbooks
|
|
680
|
+
[ ] Shared learnings with team
|
|
681
|
+
```
|
|
682
|
+
|
|
683
|
+
---
|
|
684
|
+
|
|
685
|
+
**Migration Complete! 🎉**
|
|
686
|
+
|
|
687
|
+
You've successfully migrated from L2.7 (Basic Sampling) to L2.8 (Advanced Sampling Strategies).
|
|
688
|
+
|
|
689
|
+
**Next Steps:**
|
|
690
|
+
- Monitor savings over 30 days
|
|
691
|
+
- Fine-tune thresholds based on production data
|
|
692
|
+
- Share success metrics with stakeholders
|