e11y 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (157) hide show
  1. checksums.yaml +7 -0
  2. data/.rspec +4 -0
  3. data/.rubocop.yml +69 -0
  4. data/CHANGELOG.md +26 -0
  5. data/CODE_OF_CONDUCT.md +64 -0
  6. data/LICENSE.txt +21 -0
  7. data/README.md +179 -0
  8. data/Rakefile +37 -0
  9. data/benchmarks/run_all.rb +33 -0
  10. data/config/README.md +83 -0
  11. data/config/loki-local-config.yaml +35 -0
  12. data/config/prometheus.yml +15 -0
  13. data/docker-compose.yml +78 -0
  14. data/docs/00-ICP-AND-TIMELINE.md +483 -0
  15. data/docs/01-SCALE-REQUIREMENTS.md +858 -0
  16. data/docs/ADR-001-architecture.md +2617 -0
  17. data/docs/ADR-002-metrics-yabeda.md +1395 -0
  18. data/docs/ADR-003-slo-observability.md +3337 -0
  19. data/docs/ADR-004-adapter-architecture.md +2385 -0
  20. data/docs/ADR-005-tracing-context.md +1372 -0
  21. data/docs/ADR-006-security-compliance.md +4143 -0
  22. data/docs/ADR-007-opentelemetry-integration.md +1385 -0
  23. data/docs/ADR-008-rails-integration.md +1911 -0
  24. data/docs/ADR-009-cost-optimization.md +2993 -0
  25. data/docs/ADR-010-developer-experience.md +2166 -0
  26. data/docs/ADR-011-testing-strategy.md +1836 -0
  27. data/docs/ADR-012-event-evolution.md +958 -0
  28. data/docs/ADR-013-reliability-error-handling.md +2750 -0
  29. data/docs/ADR-014-event-driven-slo.md +1533 -0
  30. data/docs/ADR-015-middleware-order.md +1061 -0
  31. data/docs/ADR-016-self-monitoring-slo.md +1234 -0
  32. data/docs/API-REFERENCE-L28.md +914 -0
  33. data/docs/COMPREHENSIVE-CONFIGURATION.md +2366 -0
  34. data/docs/IMPLEMENTATION_NOTES.md +2804 -0
  35. data/docs/IMPLEMENTATION_PLAN.md +1971 -0
  36. data/docs/IMPLEMENTATION_PLAN_ARCHITECTURE.md +586 -0
  37. data/docs/PLAN.md +148 -0
  38. data/docs/QUICK-START.md +934 -0
  39. data/docs/README.md +296 -0
  40. data/docs/design/00-memory-optimization.md +593 -0
  41. data/docs/guides/MIGRATION-L27-L28.md +692 -0
  42. data/docs/guides/PERFORMANCE-BENCHMARKS.md +434 -0
  43. data/docs/guides/README.md +44 -0
  44. data/docs/prd/01-overview-vision.md +440 -0
  45. data/docs/use_cases/README.md +119 -0
  46. data/docs/use_cases/UC-001-request-scoped-debug-buffering.md +813 -0
  47. data/docs/use_cases/UC-002-business-event-tracking.md +1953 -0
  48. data/docs/use_cases/UC-003-pattern-based-metrics.md +1627 -0
  49. data/docs/use_cases/UC-004-zero-config-slo-tracking.md +728 -0
  50. data/docs/use_cases/UC-005-sentry-integration.md +759 -0
  51. data/docs/use_cases/UC-006-trace-context-management.md +905 -0
  52. data/docs/use_cases/UC-007-pii-filtering.md +2648 -0
  53. data/docs/use_cases/UC-008-opentelemetry-integration.md +1153 -0
  54. data/docs/use_cases/UC-009-multi-service-tracing.md +1043 -0
  55. data/docs/use_cases/UC-010-background-job-tracking.md +1018 -0
  56. data/docs/use_cases/UC-011-rate-limiting.md +1906 -0
  57. data/docs/use_cases/UC-012-audit-trail.md +2301 -0
  58. data/docs/use_cases/UC-013-high-cardinality-protection.md +2127 -0
  59. data/docs/use_cases/UC-014-adaptive-sampling.md +1940 -0
  60. data/docs/use_cases/UC-015-cost-optimization.md +735 -0
  61. data/docs/use_cases/UC-016-rails-logger-migration.md +785 -0
  62. data/docs/use_cases/UC-017-local-development.md +867 -0
  63. data/docs/use_cases/UC-018-testing-events.md +1081 -0
  64. data/docs/use_cases/UC-019-tiered-storage-migration.md +562 -0
  65. data/docs/use_cases/UC-020-event-versioning.md +708 -0
  66. data/docs/use_cases/UC-021-error-handling-retry-dlq.md +956 -0
  67. data/docs/use_cases/UC-022-event-registry.md +648 -0
  68. data/docs/use_cases/backlog.md +226 -0
  69. data/e11y.gemspec +76 -0
  70. data/lib/e11y/adapters/adaptive_batcher.rb +207 -0
  71. data/lib/e11y/adapters/audit_encrypted.rb +239 -0
  72. data/lib/e11y/adapters/base.rb +580 -0
  73. data/lib/e11y/adapters/file.rb +224 -0
  74. data/lib/e11y/adapters/in_memory.rb +216 -0
  75. data/lib/e11y/adapters/loki.rb +333 -0
  76. data/lib/e11y/adapters/otel_logs.rb +203 -0
  77. data/lib/e11y/adapters/registry.rb +141 -0
  78. data/lib/e11y/adapters/sentry.rb +230 -0
  79. data/lib/e11y/adapters/stdout.rb +108 -0
  80. data/lib/e11y/adapters/yabeda.rb +370 -0
  81. data/lib/e11y/buffers/adaptive_buffer.rb +339 -0
  82. data/lib/e11y/buffers/base_buffer.rb +40 -0
  83. data/lib/e11y/buffers/request_scoped_buffer.rb +246 -0
  84. data/lib/e11y/buffers/ring_buffer.rb +267 -0
  85. data/lib/e11y/buffers.rb +14 -0
  86. data/lib/e11y/console.rb +122 -0
  87. data/lib/e11y/current.rb +48 -0
  88. data/lib/e11y/event/base.rb +894 -0
  89. data/lib/e11y/event/value_sampling_config.rb +84 -0
  90. data/lib/e11y/events/base_audit_event.rb +43 -0
  91. data/lib/e11y/events/base_payment_event.rb +33 -0
  92. data/lib/e11y/events/rails/cache/delete.rb +21 -0
  93. data/lib/e11y/events/rails/cache/read.rb +23 -0
  94. data/lib/e11y/events/rails/cache/write.rb +22 -0
  95. data/lib/e11y/events/rails/database/query.rb +45 -0
  96. data/lib/e11y/events/rails/http/redirect.rb +21 -0
  97. data/lib/e11y/events/rails/http/request.rb +26 -0
  98. data/lib/e11y/events/rails/http/send_file.rb +21 -0
  99. data/lib/e11y/events/rails/http/start_processing.rb +26 -0
  100. data/lib/e11y/events/rails/job/completed.rb +22 -0
  101. data/lib/e11y/events/rails/job/enqueued.rb +22 -0
  102. data/lib/e11y/events/rails/job/failed.rb +22 -0
  103. data/lib/e11y/events/rails/job/scheduled.rb +23 -0
  104. data/lib/e11y/events/rails/job/started.rb +22 -0
  105. data/lib/e11y/events/rails/log.rb +56 -0
  106. data/lib/e11y/events/rails/view/render.rb +23 -0
  107. data/lib/e11y/events.rb +18 -0
  108. data/lib/e11y/instruments/active_job.rb +201 -0
  109. data/lib/e11y/instruments/rails_instrumentation.rb +141 -0
  110. data/lib/e11y/instruments/sidekiq.rb +175 -0
  111. data/lib/e11y/logger/bridge.rb +205 -0
  112. data/lib/e11y/metrics/cardinality_protection.rb +172 -0
  113. data/lib/e11y/metrics/cardinality_tracker.rb +134 -0
  114. data/lib/e11y/metrics/registry.rb +234 -0
  115. data/lib/e11y/metrics/relabeling.rb +226 -0
  116. data/lib/e11y/metrics.rb +102 -0
  117. data/lib/e11y/middleware/audit_signing.rb +174 -0
  118. data/lib/e11y/middleware/base.rb +140 -0
  119. data/lib/e11y/middleware/event_slo.rb +167 -0
  120. data/lib/e11y/middleware/pii_filter.rb +266 -0
  121. data/lib/e11y/middleware/pii_filtering.rb +280 -0
  122. data/lib/e11y/middleware/rate_limiting.rb +214 -0
  123. data/lib/e11y/middleware/request.rb +163 -0
  124. data/lib/e11y/middleware/routing.rb +157 -0
  125. data/lib/e11y/middleware/sampling.rb +254 -0
  126. data/lib/e11y/middleware/slo.rb +168 -0
  127. data/lib/e11y/middleware/trace_context.rb +131 -0
  128. data/lib/e11y/middleware/validation.rb +118 -0
  129. data/lib/e11y/middleware/versioning.rb +132 -0
  130. data/lib/e11y/middleware.rb +12 -0
  131. data/lib/e11y/pii/patterns.rb +90 -0
  132. data/lib/e11y/pii.rb +13 -0
  133. data/lib/e11y/pipeline/builder.rb +155 -0
  134. data/lib/e11y/pipeline/zone_validator.rb +110 -0
  135. data/lib/e11y/pipeline.rb +12 -0
  136. data/lib/e11y/presets/audit_event.rb +65 -0
  137. data/lib/e11y/presets/debug_event.rb +34 -0
  138. data/lib/e11y/presets/high_value_event.rb +51 -0
  139. data/lib/e11y/presets.rb +19 -0
  140. data/lib/e11y/railtie.rb +138 -0
  141. data/lib/e11y/reliability/circuit_breaker.rb +216 -0
  142. data/lib/e11y/reliability/dlq/file_storage.rb +277 -0
  143. data/lib/e11y/reliability/dlq/filter.rb +117 -0
  144. data/lib/e11y/reliability/retry_handler.rb +207 -0
  145. data/lib/e11y/reliability/retry_rate_limiter.rb +117 -0
  146. data/lib/e11y/sampling/error_spike_detector.rb +225 -0
  147. data/lib/e11y/sampling/load_monitor.rb +161 -0
  148. data/lib/e11y/sampling/stratified_tracker.rb +92 -0
  149. data/lib/e11y/sampling/value_extractor.rb +82 -0
  150. data/lib/e11y/self_monitoring/buffer_monitor.rb +79 -0
  151. data/lib/e11y/self_monitoring/performance_monitor.rb +97 -0
  152. data/lib/e11y/self_monitoring/reliability_monitor.rb +146 -0
  153. data/lib/e11y/slo/event_driven.rb +150 -0
  154. data/lib/e11y/slo/tracker.rb +119 -0
  155. data/lib/e11y/version.rb +9 -0
  156. data/lib/e11y.rb +283 -0
  157. metadata +452 -0
@@ -0,0 +1,1533 @@
1
+ # ADR-014: Event-Driven SLO
2
+
3
+ **Status:** Draft
4
+ **Date:** January 13, 2026
5
+ **Covers:** Integration between Event System (ADR-001) and SLO (ADR-003)
6
+ **Depends On:** ADR-001 (Core), ADR-002 (Metrics), ADR-003 (SLO)
7
+
8
+ **Related ADRs:**
9
+ - 📊 **ADR-003: SLO & Observability** - HTTP/Job SLO (infrastructure reliability)
10
+ - 🔗 **Integration:** See `ADR-003-014-INTEGRATION.md` for detailed integration analysis
11
+
12
+ ---
13
+
14
+ ## 🔍 Scope of This ADR
15
+
16
+ This ADR covers **Event-based SLO** (business logic reliability):
17
+ - ✅ Custom SLO based on E11y Events (e.g., `Events::PaymentProcessed`)
18
+ - ✅ Explicit opt-in via `slo { enabled true }` in Event class
19
+ - ✅ Auto-calculation of `slo_status` from event payload
20
+ - ✅ Configuration in `slo.yml` under `custom_slos` section
21
+ - ✅ App-Wide Aggregated SLO (combines HTTP + Event metrics)
22
+
23
+ **For HTTP/Job SLO** (zero-config infrastructure monitoring), see **ADR-003**.
24
+
25
+ **Key Difference:**
26
+ - **ADR-003**: "Is the server responding?" (HTTP 200 vs 500, job success vs failure)
27
+ - **ADR-014**: "Is the business logic working?" (payment processed vs failed, order created vs rejected)
28
+
29
+ ---
30
+
31
+ ## 📋 Table of Contents
32
+
33
+ 1. [Context & Problem](#1-context--problem)
34
+ 2. [Architecture Overview](#2-architecture-overview)
35
+ 3. [Event SLO DSL](#3-event-slo-dsl)
36
+ 4. [SLO Status Calculation](#4-slo-status-calculation)
37
+ 5. [Custom SLO Configuration](#5-custom-slo-configuration)
38
+ 6. [Metrics Export](#6-metrics-export)
39
+ 7. [Validation & Linting](#7-validation--linting)
40
+ 8. [Prometheus Integration](#8-prometheus-integration)
41
+ 9. [App-Wide SLO Aggregation](#9-app-wide-slo-aggregation)
42
+ 10. [Real-World Examples](#10-real-world-examples)
43
+ 11. [Trade-offs](#11-trade-offs)
44
+
45
+ ---
46
+
47
+ ## 1. Context & Problem
48
+
49
+ ### 1.1. Problem Statement
50
+
51
+ **HTTP SLO is insufficient for business metrics:**
52
+
53
+ ```ruby
54
+ # === PROBLEM 1: HTTP 200 ≠ Business Success ===
55
+ # POST /orders → HTTP 200 (infrastructure success)
56
+ # but Events::OrderCreationFailed.track(...) (business logic fail)
57
+ # → HTTP SLO shows 100%, but actually 50% of orders fail to create!
58
+
59
+ # === PROBLEM 2: Background Jobs SLO ===
60
+ # Sidekiq job completed (no exception)
61
+ # but Events::PaymentFailed.track(...) (payment actually failed)
62
+ # → Job SLO shows 100%, but payments are not processing!
63
+
64
+ # === PROBLEM 3: Complex Business Logic SLO ===
65
+ # SLO: "Order fulfillment within 24h"
66
+ # Events::OrderPaid → Events::OrderShipped (time diff < 24h)
67
+ # → How to track such SLO?
68
+ ```
69
+
70
+ ### 1.2. Design Decisions
71
+
72
+ **Decision 1: Independent HTTP + Event SLO**
73
+ ```ruby
74
+ # ✅ HTTP SLO = Infrastructure reliability (200 vs 500)
75
+ # ✅ Event SLO = Business logic reliability (order created vs failed)
76
+ # ✅ App-Wide SLO = Aggregation of both (for overall health)
77
+ # → All three are important!
78
+ ```
79
+
80
+ **Decision 2: Explicit opt-in for Event SLO**
81
+ ```ruby
82
+ # ✅ By default Events do NOT participate in SLO
83
+ # ✅ Must explicitly declare `slo { enabled true }`
84
+ # ✅ Must explicitly define `slo_status_from`
85
+ ```
86
+
87
+ **Decision 3: Auto-calculation slo_status (with override)**
88
+ ```ruby
89
+ # ✅ slo_status computed from payload (e.g., status == 'completed')
90
+ # ✅ Can override: track(status: 'completed', slo_status: 'failure')
91
+ # ✅ If slo_status = nil → event not counted in SLO
92
+ ```
93
+
94
+ **Decision 4: Aggregation in Prometheus (not in application)**
95
+ ```ruby
96
+ # ✅ E11y exports raw metrics: event_result_total{slo_status="success|failure"}
97
+ # ✅ Prometheus calculates SLO via PromQL
98
+ # ✅ Flexibility (can recalculate for any time period)
99
+ ```
100
+
101
+ **Decision 5: Linters for explicitness**
102
+ ```ruby
103
+ # ✅ Linter 1: Every Event must have `slo { ... }` or `slo false`
104
+ # ✅ Linter 2: If `slo { enabled true }` → `slo_status_from` is required
105
+ # ✅ Linter 3: If Event in slo.yml → must have `slo { enabled true }`
106
+ ```
107
+
108
+ ### 1.3. Goals
109
+
110
+ **Primary Goals:**
111
+ - ✅ **Custom SLO based on Events** (business logic)
112
+ - ✅ **Explicit configuration** (no magic)
113
+ - ✅ **Auto-calculation slo_status** (DRY)
114
+ - ✅ **Independent HTTP + Event SLO**
115
+ - ✅ **App-Wide SLO aggregation** (overall health)
116
+ - ✅ **Prometheus aggregation** (flexible)
117
+ - ✅ **Linters for consistency**
118
+
119
+ **Non-Goals:**
120
+ - ❌ Automatic linking HTTP SLO + Event SLO (magic)
121
+ - ❌ Multi-event SLO in v1.0 (e.g., OrderPaid → OrderShipped within 24h)
122
+ - ❌ ML-based SLO prediction
123
+
124
+ ### 1.4. Success Metrics
125
+
126
+ | Metric | Target | Critical? |
127
+ |--------|--------|-----------|
128
+ | **Event SLO accuracy** | 100% (matches business logic) | ✅ Yes |
129
+ | **Explicit slo declaration** | 100% of Events | ✅ Yes |
130
+ | **slo_status calculation overhead** | <0.1ms p99 | ✅ Yes |
131
+ | **Prometheus query performance** | <500ms for 30d window | ✅ Yes |
132
+
133
+ ---
134
+
135
+ ## 2. Architecture Overview
136
+
137
+ ### 2.1. System Context
138
+
139
+ ```mermaid
140
+ C4Context
141
+ title Event-Driven SLO Context
142
+
143
+ Person(dev, "Developer", "Defines Events + SLO")
144
+ Person(sre, "SRE", "Monitors SLO")
145
+
146
+ System(rails_app, "Rails App", "Tracks Events")
147
+ System(e11y, "E11y Gem", "Event SLO DSL")
148
+ System(slo_yml, "slo.yml", "Custom SLO config")
149
+
150
+ System_Ext(yabeda, "Yabeda", "Metrics export")
151
+ System_Ext(prometheus, "Prometheus", "Aggregation")
152
+ System_Ext(grafana, "Grafana", "Dashboards")
153
+
154
+ Rel(dev, rails_app, "Tracks", "Events::PaymentProcessed")
155
+ Rel(rails_app, e11y, "Evaluates", "slo_status_from")
156
+ Rel(e11y, slo_yml, "Validates", "custom_slos")
157
+ Rel(e11y, yabeda, "Exports", "event_result_total{slo_status}")
158
+ Rel(yabeda, prometheus, "Scrapes", "Metrics")
159
+ Rel(prometheus, grafana, "Queries", "PromQL")
160
+ Rel(sre, grafana, "Views", "SLO dashboards")
161
+
162
+ UpdateLayoutConfig($c4ShapeInRow="3", $c4BoundaryInRow="1")
163
+ ```
164
+
165
+ ### 2.2. Component Architecture
166
+
167
+ ```mermaid
168
+ graph TB
169
+ subgraph "Event Class Definition"
170
+ EventDef[Event Class<br/>PaymentProcessed]
171
+ SLOBlock[slo do<br/>enabled true<br/>slo_status_from]
172
+ end
173
+
174
+ subgraph "Event Tracking"
175
+ Track[.track&#40;payload&#41;]
176
+ Validate[Validate Schema]
177
+ ComputeSLO[Compute slo_status]
178
+ EmitMetric[Emit Yabeda Metric]
179
+ end
180
+
181
+ subgraph "SLO Configuration"
182
+ SLOConfig[slo.yml]
183
+ CustomSLO[custom_slos]
184
+ EventList[events: PaymentProcessed]
185
+ end
186
+
187
+ subgraph "Metrics Export"
188
+ YabedaMetric[Yabeda.e11y_slo<br/>.event_result_total]
189
+ PrometheusMetric[e11y_slo_event_result_total&#123;<br/>slo_status='success'&#125;]
190
+ end
191
+
192
+ subgraph "Validation & Linting"
193
+ Linter1[Explicit slo declaration]
194
+ Linter2[slo_status_from required]
195
+ Linter3[slo.yml consistency]
196
+ end
197
+
198
+ EventDef --> SLOBlock
199
+ SLOBlock --> Track
200
+ Track --> Validate
201
+ Validate --> ComputeSLO
202
+ ComputeSLO --> EmitMetric
203
+
204
+ SLOConfig --> CustomSLO
205
+ CustomSLO --> EventList
206
+ EventList -.validates.-> SLOBlock
207
+
208
+ EmitMetric --> YabedaMetric
209
+ YabedaMetric --> PrometheusMetric
210
+
211
+ SLOBlock -.validates.-> Linter1
212
+ SLOBlock -.validates.-> Linter2
213
+ CustomSLO -.validates.-> Linter3
214
+
215
+ style ComputeSLO fill:#d1ecf1
216
+ style EmitMetric fill:#fff3cd
217
+ style Linter1 fill:#f8d7da
218
+ style Linter2 fill:#f8d7da
219
+ style Linter3 fill:#f8d7da
220
+ ```
221
+
222
+ ### 2.3. Event SLO Flow Sequence
223
+
224
+ ```mermaid
225
+ sequenceDiagram
226
+ participant App as Rails App
227
+ participant Event as Events::PaymentProcessed
228
+ participant SLO as SLO Config
229
+ participant Yabeda as Yabeda
230
+ participant Prom as Prometheus
231
+
232
+ App->>Event: .track(payment_id: 'p123', status: 'completed')
233
+
234
+ Note over Event: 1. Validate schema
235
+ Event->>Event: Schema validation ✅
236
+
237
+ Note over Event: 2. Check if SLO enabled
238
+ Event->>SLO: slo_enabled?
239
+ SLO-->>Event: true
240
+
241
+ Note over Event: 3. Compute slo_status
242
+ Event->>Event: slo_status_from.call(payload)
243
+ Event->>Event: payload[:status] == 'completed' → 'success'
244
+
245
+ Note over Event: 4. Emit Yabeda metric
246
+ Event->>Yabeda: event_result_total.increment(slo_status: 'success')
247
+
248
+ Note over Yabeda: 5. Export to Prometheus
249
+ Yabeda->>Prom: e11y_slo_event_result_total{slo_status="success"} +1
250
+
251
+ Note over Prom: 6. Aggregate SLO
252
+ Prom->>Prom: sum(rate(...{slo_status="success"}[30d])) / sum(rate(...[30d]))
253
+ Prom->>Prom: Result: 0.9996 (99.96%)
254
+ ```
255
+
256
+ ---
257
+
258
+ ## 3. Event SLO DSL
259
+
260
+ ### 3.1. Basic Event SLO (enabled)
261
+
262
+ ```ruby
263
+ # app/events/payment_processed.rb
264
+ module Events
265
+ class PaymentProcessed < E11y::Event::Base
266
+ schema do
267
+ required(:payment_id).filled(:string)
268
+ required(:amount).filled(:float)
269
+ required(:status).filled(:string) # 'completed', 'failed', 'pending'
270
+ optional(:slo_status).filled(:string) # Optional explicit override
271
+ end
272
+
273
+ # ============================================================
274
+ # SLO CONFIGURATION (explicit, opt-in)
275
+ # ============================================================
276
+ slo do
277
+ # 1. Enable SLO tracking for this Event
278
+ enabled true
279
+
280
+ # 2. Calculate slo_status from payload
281
+ slo_status_from do |payload|
282
+ # Priority 1: Explicit override (if provided)
283
+ return payload[:slo_status] if payload[:slo_status]
284
+
285
+ # Priority 2: Auto-calculate from status
286
+ case payload[:status]
287
+ when 'completed' then 'success'
288
+ when 'failed' then 'failure'
289
+ when 'pending' then nil # Not counted in SLO
290
+ else nil
291
+ end
292
+ end
293
+
294
+ # 3. Which custom SLO does this contribute to?
295
+ contributes_to 'payment_success_rate'
296
+
297
+ # 4. Optional: Group by label (for per-type SLO)
298
+ group_by :payment_method # Separate SLO for 'card', 'bank', 'paypal'
299
+ end
300
+ end
301
+ end
302
+
303
+ # Usage 1: Auto-calculation (normal case)
304
+ Events::PaymentProcessed.track(
305
+ payment_id: 'p123',
306
+ amount: 99.99,
307
+ status: 'completed', # → slo_status = 'success'
308
+ payment_method: 'card'
309
+ )
310
+
311
+ # Usage 2: Explicit override (edge case)
312
+ Events::PaymentProcessed.track(
313
+ payment_id: 'p456',
314
+ amount: 50.00,
315
+ status: 'completed', # Status completed
316
+ slo_status: 'failure', # But business logic says failure (e.g., fraud detected)
317
+ payment_method: 'card'
318
+ )
319
+
320
+ # Usage 3: Not counted in SLO
321
+ Events::PaymentProcessed.track(
322
+ payment_id: 'p789',
323
+ amount: 25.00,
324
+ status: 'pending', # → slo_status = nil (not counted)
325
+ payment_method: 'bank'
326
+ )
327
+ ```
328
+
329
+ ### 3.2. Event SLO Disabled
330
+
331
+ ```ruby
332
+ # app/events/health_check_pinged.rb
333
+ module Events
334
+ class HealthCheckPinged < E11y::Event::Base
335
+ schema do
336
+ required(:status).filled(:string)
337
+ required(:response_time_ms).filled(:integer)
338
+ end
339
+
340
+ # ============================================================
341
+ # SLO DISABLED (explicit opt-out)
342
+ # ============================================================
343
+ slo false # ← Explicit: does NOT participate in SLO
344
+ end
345
+ end
346
+
347
+ # Usage: Normal event tracking (no SLO metrics emitted)
348
+ Events::HealthCheckPinged.track(
349
+ status: 'ok',
350
+ response_time_ms: 15
351
+ )
352
+ ```
353
+
354
+ ### 3.3. Event SLO with Latency
355
+
356
+ ```ruby
357
+ # app/events/api_request_completed.rb
358
+ module Events
359
+ class ApiRequestCompleted < E11y::Event::Base
360
+ schema do
361
+ required(:endpoint).filled(:string)
362
+ required(:status_code).filled(:integer)
363
+ required(:duration_ms).filled(:integer)
364
+ optional(:slo_status).filled(:string)
365
+ end
366
+
367
+ slo do
368
+ enabled true
369
+
370
+ slo_status_from do |payload|
371
+ return payload[:slo_status] if payload[:slo_status]
372
+
373
+ payload[:status_code] < 500 ? 'success' : 'failure'
374
+ end
375
+
376
+ contributes_to 'api_latency_slo'
377
+
378
+ # Optional: Extract latency for histogram
379
+ latency_field :duration_ms # E11y will emit histogram metric
380
+
381
+ group_by :endpoint # Per-endpoint latency SLO
382
+ end
383
+ end
384
+ end
385
+ ```
386
+
387
+ ### 3.4. Complex SLO Status Calculation
388
+
389
+ ```ruby
390
+ # app/events/order_created.rb
391
+ module Events
392
+ class OrderCreated < E11y::Event::Base
393
+ schema do
394
+ required(:order_id).filled(:string)
395
+ required(:user_id).filled(:string)
396
+ required(:items).array(:hash)
397
+ required(:total_amount).filled(:float)
398
+ required(:validation_result).hash do
399
+ required(:passed).filled(:bool)
400
+ optional(:errors).array(:string)
401
+ end
402
+ optional(:slo_status).filled(:string)
403
+ end
404
+
405
+ slo do
406
+ enabled true
407
+
408
+ # Complex business logic for slo_status
409
+ slo_status_from do |payload|
410
+ return payload[:slo_status] if payload[:slo_status]
411
+
412
+ # Validation passed AND amount > 0 → success
413
+ if payload[:validation_result][:passed] && payload[:total_amount] > 0
414
+ 'success'
415
+ elsif payload[:validation_result][:passed]
416
+ nil # Passed but $0 order → not counted (test order)
417
+ else
418
+ 'failure' # Validation failed
419
+ end
420
+ end
421
+
422
+ contributes_to 'order_creation_success_rate'
423
+ end
424
+ end
425
+ end
426
+ ```
427
+
428
+ ---
429
+
430
+ ## 4. SLO Status Calculation
431
+
432
+ ### 4.1. Implementation (Full Code)
433
+
434
+ ```ruby
435
+ # lib/e11y/event/base.rb (extended for SLO)
436
+ module E11y
437
+ module Event
438
+ class Base
439
+ class << self
440
+ # SLO configuration DSL
441
+ def slo(value = nil, &block)
442
+ if value == false
443
+ # Explicit opt-out: slo false
444
+ @slo_disabled = true
445
+ @slo_config = nil
446
+ elsif block_given?
447
+ # Explicit opt-in: slo do ... end
448
+ @slo_config = SLOConfig.new(self)
449
+ @slo_config.instance_eval(&block)
450
+ @slo_disabled = false
451
+ end
452
+ end
453
+
454
+ def slo_enabled?
455
+ !@slo_disabled && @slo_config&.enabled?
456
+ end
457
+
458
+ def slo_disabled?
459
+ @slo_disabled == true
460
+ end
461
+
462
+ def slo_config
463
+ @slo_config
464
+ end
465
+
466
+ # Override track to integrate SLO
467
+ def track(payload, **options)
468
+ # 1. Normal event tracking
469
+ event_data = build_event(payload, **options)
470
+ deliver_to_adapters(event_data)
471
+
472
+ # 2. SLO tracking (if enabled)
473
+ if slo_enabled?
474
+ track_slo(payload, event_data)
475
+ end
476
+
477
+ event_data
478
+ end
479
+
480
+ private
481
+
482
+ def track_slo(payload, event_data)
483
+ # Compute slo_status
484
+ slo_status = @slo_config.compute_slo_status(payload)
485
+
486
+ # Skip if slo_status is nil (not counted in SLO)
487
+ return unless slo_status
488
+
489
+ # Validate slo_status value
490
+ unless ['success', 'failure'].include?(slo_status)
491
+ E11y.logger.error(
492
+ "Invalid slo_status for #{name}: '#{slo_status}' (must be 'success', 'failure', or nil)"
493
+ )
494
+ return
495
+ end
496
+
497
+ # Build labels
498
+ labels = {
499
+ event_class: name,
500
+ slo_name: @slo_config.contributes_to,
501
+ slo_status: slo_status
502
+ }
503
+
504
+ # Add group_by label (if configured)
505
+ if @slo_config.group_by
506
+ group_value = payload[@slo_config.group_by]
507
+ labels[@slo_config.group_by] = group_value if group_value
508
+ end
509
+
510
+ # Emit availability metric
511
+ Yabeda.e11y_slo.event_result_total.increment(labels, by: 1)
512
+
513
+ # Emit latency metric (if latency_field configured)
514
+ if @slo_config.latency_field
515
+ latency_value = payload[@slo_config.latency_field]
516
+
517
+ if latency_value
518
+ Yabeda.e11y_slo.event_duration_seconds.observe(
519
+ labels.except(:slo_status),
520
+ latency_value / 1000.0 # ms → seconds
521
+ )
522
+ end
523
+ end
524
+
525
+ # Log SLO tracking (debug)
526
+ E11y.logger.debug(
527
+ "SLO tracked: #{name} → #{@slo_config.contributes_to} (#{slo_status})"
528
+ )
529
+ rescue => error
530
+ E11y.logger.error("SLO tracking failed for #{name}: #{error.message}")
531
+ E11y.logger.error(error.backtrace.first(5).join("\n"))
532
+ # Don't raise - SLO tracking should not break event tracking
533
+ end
534
+ end
535
+ end
536
+ end
537
+ end
538
+ ```
539
+
540
+ ### 4.2. SLOConfig Class
541
+
542
+ ```ruby
543
+ # lib/e11y/event/slo_config.rb
544
+ module E11y
545
+ module Event
546
+ class SLOConfig
547
+ attr_reader :event_class
548
+ attr_accessor :enabled, :slo_status_from_proc, :contributes_to, :group_by, :latency_field
549
+
550
+ def initialize(event_class)
551
+ @event_class = event_class
552
+ @enabled = false
553
+ @slo_status_from_proc = nil
554
+ @contributes_to = nil
555
+ @group_by = nil
556
+ @latency_field = nil
557
+ end
558
+
559
+ # DSL methods
560
+ def enabled(value = true)
561
+ @enabled = value
562
+ end
563
+
564
+ def enabled?
565
+ @enabled == true
566
+ end
567
+
568
+ def slo_status_from(&block)
569
+ unless block_given?
570
+ raise ArgumentError, "slo_status_from requires a block"
571
+ end
572
+
573
+ @slo_status_from_proc = block
574
+ end
575
+
576
+ def contributes_to(slo_name)
577
+ @contributes_to = slo_name
578
+ end
579
+
580
+ def group_by(field)
581
+ @group_by = field
582
+ end
583
+
584
+ def latency_field(field)
585
+ @latency_field = field
586
+ end
587
+
588
+ # Compute slo_status from payload
589
+ def compute_slo_status(payload)
590
+ unless @slo_status_from_proc
591
+ raise "Event #{@event_class.name} has slo enabled but no slo_status_from block!"
592
+ end
593
+
594
+ @slo_status_from_proc.call(payload)
595
+ end
596
+
597
+ # Validate configuration
598
+ def validate!
599
+ errors = []
600
+
601
+ if enabled? && !@slo_status_from_proc
602
+ errors << "slo_status_from block is required when slo is enabled"
603
+ end
604
+
605
+ if enabled? && !@contributes_to
606
+ errors << "contributes_to is required when slo is enabled"
607
+ end
608
+
609
+ if @latency_field && !@event_class.schema.rules.key?(@latency_field)
610
+ errors << "latency_field :#{@latency_field} not found in schema"
611
+ end
612
+
613
+ if @group_by && !@event_class.schema.rules.key?(@group_by)
614
+ errors << "group_by :#{@group_by} not found in schema"
615
+ end
616
+
617
+ if errors.any?
618
+ raise E11y::SLO::ConfigurationError,
619
+ "SLO configuration errors for #{@event_class.name}:\n#{errors.join("\n")}"
620
+ end
621
+
622
+ true
623
+ end
624
+ end
625
+ end
626
+ end
627
+ ```
628
+
629
+ ---
630
+
631
+ ## 5. Custom SLO Configuration
632
+
633
+ ### 5.1. slo.yml Schema for Event-Based SLO
634
+
635
+ ```yaml
636
+ # config/slo.yml
637
+ version: 1
638
+
639
+ # ============================================================================
640
+ # CUSTOM EVENT-BASED SLO
641
+ # ============================================================================
642
+ custom_slos:
643
+ # Simple availability SLO
644
+ - name: "payment_success_rate"
645
+ description: "Payment success rate (business SLO)"
646
+ type: event_based
647
+
648
+ # Which Events contribute to this SLO?
649
+ events:
650
+ - Events::PaymentProcessed
651
+
652
+ # SLO target
653
+ target: 0.999 # 99.9%
654
+ window: 30d
655
+
656
+ # Validation: Ensure Event has slo { enabled true }
657
+ require_explicit_slo_config: true
658
+
659
+ # Prometheus metric name (auto-generated if omitted)
660
+ metric_name: "e11y_slo_payment_success_rate"
661
+
662
+ # Burn rate alerts (same as HTTP SLO)
663
+ burn_rate_alerts:
664
+ fast:
665
+ enabled: true
666
+ threshold: 14.4
667
+ alert_after: 5m
668
+ severity: critical
669
+ medium:
670
+ enabled: true
671
+ threshold: 6.0
672
+ alert_after: 30m
673
+ severity: warning
674
+ slow:
675
+ enabled: true
676
+ threshold: 1.0
677
+ alert_after: 6h
678
+ severity: info
679
+
680
+ # Latency SLO (with histogram)
681
+ - name: "api_latency_slo"
682
+ description: "API request latency SLO"
683
+ type: event_based
684
+
685
+ events:
686
+ - Events::ApiRequestCompleted
687
+
688
+ # Availability target
689
+ target: 0.999 # 99.9% requests < 500ms
690
+ window: 30d
691
+
692
+ # Latency target (Prometheus computes p99 from histogram)
693
+ latency:
694
+ enabled: true
695
+ p99_target: 500 # ms
696
+ p95_target: 300 # ms
697
+ field: :duration_ms # Which field in Event payload
698
+
699
+ # Group by endpoint (per-endpoint SLO)
700
+ group_by: :endpoint
701
+
702
+ # Multi-event SLO (v2.0 feature, not implemented in v1.0)
703
+ - name: "order_fulfillment_slo"
704
+ description: "Order shipped within 24h of payment"
705
+ type: event_sequence # Future feature
706
+
707
+ events:
708
+ start: Events::OrderPaid
709
+ end: Events::OrderShipped
710
+ max_duration: 86400 # 24 hours in seconds
711
+
712
+ target: 0.95 # 95%
713
+ window: 30d
714
+
715
+ # v1.0: Not implemented, placeholder for future
716
+
717
+ # ============================================================================
718
+ # HTTP SLO (unchanged, for reference)
719
+ # ============================================================================
720
+ endpoints:
721
+ - name: "Create Order"
722
+ pattern: "POST /api/orders"
723
+ controller: "Api::OrdersController"
724
+ action: "create"
725
+ slo:
726
+ availability:
727
+ target: 0.999 # HTTP 2xx/3xx vs 5xx (independent from Event SLO)
728
+ latency:
729
+ p99_target: 500
730
+
731
+ # ============================================================================
732
+ # APP-WIDE FALLBACK
733
+ # ============================================================================
734
+ app_wide:
735
+ http:
736
+ availability:
737
+ target: 0.999
738
+ latency:
739
+ p99_target: 500
740
+
741
+ events:
742
+ # NEW: App-wide Event SLO (aggregation of all Event SLOs)
743
+ enabled: true
744
+ target: 0.999 # 99.9%
745
+ window: 30d
746
+ ```
747
+
748
+ ---
749
+
750
+ ## 6. Metrics Export
751
+
752
+ ### 6.1. Yabeda Metrics Definition
753
+
754
+ ```ruby
755
+ # config/initializers/e11y.rb
756
+ E11y.configure do |config|
757
+ config.metrics do
758
+ enabled true
759
+ adapter :yabeda
760
+
761
+ # Define SLO metrics
762
+ Yabeda.configure do
763
+ group :e11y_slo do
764
+ # Availability metric (counter)
765
+ counter :event_result_total,
766
+ comment: "Total count of Event SLO results (success/failure)",
767
+ tags: [:event_class, :slo_name, :slo_status] # slo_status = 'success' | 'failure'
768
+
769
+ # Latency metric (histogram)
770
+ histogram :event_duration_seconds,
771
+ comment: "Event processing duration for SLO (seconds)",
772
+ tags: [:event_class, :slo_name],
773
+ buckets: [0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0]
774
+ end
775
+ end
776
+ end
777
+ end
778
+ ```
779
+
780
+ ### 6.2. Prometheus Metrics Output
781
+
782
+ ```
783
+ # Availability metrics
784
+ e11y_slo_event_result_total{event_class="Events::PaymentProcessed",slo_name="payment_success_rate",slo_status="success"} 1234
785
+ e11y_slo_event_result_total{event_class="Events::PaymentProcessed",slo_name="payment_success_rate",slo_status="failure"} 5
786
+
787
+ # Latency metrics (histogram)
788
+ e11y_slo_event_duration_seconds_bucket{event_class="Events::ApiRequestCompleted",slo_name="api_latency_slo",le="0.1"} 500
789
+ e11y_slo_event_duration_seconds_bucket{event_class="Events::ApiRequestCompleted",slo_name="api_latency_slo",le="0.5"} 1200
790
+ e11y_slo_event_duration_seconds_bucket{event_class="Events::ApiRequestCompleted",slo_name="api_latency_slo",le="1.0"} 1234
791
+ e11y_slo_event_duration_seconds_sum{event_class="Events::ApiRequestCompleted",slo_name="api_latency_slo"} 450.67
792
+ e11y_slo_event_duration_seconds_count{event_class="Events::ApiRequestCompleted",slo_name="api_latency_slo"} 1239
793
+
794
+ # With group_by (per-payment-method)
795
+ e11y_slo_event_result_total{event_class="Events::PaymentProcessed",slo_name="payment_success_rate",slo_status="success",payment_method="card"} 1000
796
+ e11y_slo_event_result_total{event_class="Events::PaymentProcessed",slo_name="payment_success_rate",slo_status="success",payment_method="bank"} 200
797
+ e11y_slo_event_result_total{event_class="Events::PaymentProcessed",slo_name="payment_success_rate",slo_status="success",payment_method="paypal"} 34
798
+ ```
799
+
800
+ ---
801
+
802
+ ## 7. Validation & Linting
803
+
804
+ ### 7.1. Linter 1: Explicit SLO Declaration
805
+
806
+ **Rule:** Every Event class MUST have explicit `slo` declaration.
807
+
808
+ ```ruby
809
+ # lib/e11y/slo/linters/explicit_declaration_linter.rb
810
+ module E11y
811
+ module SLO
812
+ module Linters
813
+ class ExplicitDeclarationLinter
814
+ def self.validate!
815
+ errors = []
816
+
817
+ E11y::Registry.all_events.each do |event_class|
818
+ # Check: Has slo declaration?
819
+ has_slo_enabled = event_class.slo_enabled?
820
+ has_slo_disabled = event_class.slo_disabled?
821
+
822
+ unless has_slo_enabled || has_slo_disabled
823
+ errors << "Event #{event_class.name} missing explicit SLO declaration! " \
824
+ "Add `slo do ... end` or `slo false`"
825
+ end
826
+ end
827
+
828
+ if errors.any?
829
+ raise E11y::SLO::LinterError, "SLO Linter 1 failed:\n#{errors.join("\n")}"
830
+ end
831
+
832
+ E11y.logger.info("✅ Linter 1: Explicit SLO declaration (passed)")
833
+ end
834
+ end
835
+ end
836
+ end
837
+ end
838
+ ```
839
+
840
+ ### 7.2. Linter 2: slo_status_from Required
841
+
842
+ **Rule:** If `slo { enabled true }`, then `slo_status_from` is REQUIRED.
843
+
844
+ ```ruby
845
+ # lib/e11y/slo/linters/slo_status_from_linter.rb
846
+ module E11y
847
+ module SLO
848
+ module Linters
849
+ class SloStatusFromLinter
850
+ def self.validate!
851
+ errors = []
852
+
853
+ E11y::Registry.all_events.each do |event_class|
854
+ next unless event_class.slo_enabled?
855
+
856
+ # Check: Has slo_status_from block?
857
+ unless event_class.slo_config.slo_status_from_proc
858
+ errors << "Event #{event_class.name} has `slo { enabled true }`, " \
859
+ "but missing `slo_status_from` block!"
860
+ end
861
+
862
+ # Check: Has contributes_to?
863
+ unless event_class.slo_config.contributes_to
864
+ errors << "Event #{event_class.name} has `slo { enabled true }`, " \
865
+ "but missing `contributes_to` declaration!"
866
+ end
867
+ end
868
+
869
+ if errors.any?
870
+ raise E11y::SLO::LinterError, "SLO Linter 2 failed:\n#{errors.join("\n")}"
871
+ end
872
+
873
+ E11y.logger.info("✅ Linter 2: slo_status_from required (passed)")
874
+ end
875
+ end
876
+ end
877
+ end
878
+ end
879
+ ```
880
+
881
+ ### 7.3. Linter 3: slo.yml Consistency
882
+
883
+ **Rule:** If Event is referenced in `slo.yml`, it MUST have `slo { enabled true }`.
884
+
885
+ ```ruby
886
+ # lib/e11y/slo/linters/config_consistency_linter.rb
887
+ module E11y
888
+ module SLO
889
+ module Linters
890
+ class ConfigConsistencyLinter
891
+ def self.validate!
892
+ errors = []
893
+ config = E11y::SLO::ConfigLoader.config
894
+
895
+ # Check each custom_slo in slo.yml
896
+ config.custom_slos.each do |slo|
897
+ slo_name = slo['name']
898
+ events = slo['events'] || []
899
+
900
+ events.each do |event_class_name|
901
+ # Check: Event class exists?
902
+ begin
903
+ event_class = event_class_name.constantize
904
+ rescue NameError
905
+ errors << "SLO '#{slo_name}' references Event #{event_class_name}, " \
906
+ "but class not found!"
907
+ next
908
+ end
909
+
910
+ # Check: Event has slo { enabled true }?
911
+ unless event_class.slo_enabled?
912
+ errors << "SLO '#{slo_name}' references Event #{event_class_name}, " \
913
+ "but Event has `slo false` or missing slo declaration!"
914
+ end
915
+
916
+ # Check: Event contributes_to matches slo_name?
917
+ if event_class.slo_enabled? && event_class.slo_config.contributes_to != slo_name
918
+ errors << "Event #{event_class_name} contributes_to " \
919
+ "'#{event_class.slo_config.contributes_to}', " \
920
+ "but slo.yml defines SLO '#{slo_name}'! Mismatch!"
921
+ end
922
+ end
923
+ end
924
+
925
+ # Check reverse: Events with slo enabled but NOT in slo.yml
926
+ E11y::Registry.all_events.each do |event_class|
927
+ next unless event_class.slo_enabled?
928
+
929
+ slo_name = event_class.slo_config.contributes_to
930
+ slo_config = config.resolve_custom_slo(slo_name)
931
+
932
+ unless slo_config
933
+ errors << "Event #{event_class.name} contributes_to '#{slo_name}', " \
934
+ "but this SLO not found in slo.yml!"
935
+ end
936
+ end
937
+
938
+ if errors.any?
939
+ raise E11y::SLO::LinterError, "SLO Linter 3 failed:\n#{errors.join("\n")}"
940
+ end
941
+
942
+ E11y.logger.info("✅ Linter 3: slo.yml consistency (passed)")
943
+ end
944
+ end
945
+ end
946
+ end
947
+ end
948
+ ```
949
+
950
+ ### 7.4. Auto-validation on Boot
951
+
952
+ ```ruby
953
+ # config/initializers/e11y_slo.rb
954
+ Rails.application.config.after_initialize do
955
+ if E11y.config.slo.enabled
956
+ begin
957
+ # Run all linters
958
+ E11y::SLO::Linters::ExplicitDeclarationLinter.validate!
959
+ E11y::SLO::Linters::SloStatusFromLinter.validate!
960
+ E11y::SLO::Linters::ConfigConsistencyLinter.validate!
961
+
962
+ E11y.logger.info("✅ All SLO linters passed")
963
+ rescue E11y::SLO::LinterError => error
964
+ if E11y.config.slo.strict_validation
965
+ # Strict mode: Fail hard
966
+ raise error
967
+ else
968
+ # Lenient mode: Log warning and continue
969
+ E11y.logger.error("❌ SLO Linters failed (continuing in lenient mode):")
970
+ E11y.logger.error(error.message)
971
+ end
972
+ end
973
+ end
974
+ end
975
+ ```
976
+
977
+ ---
978
+
979
+ ## 8. Prometheus Integration
980
+
981
+ ### 8.1. PromQL for Availability SLO
982
+
983
+ ```promql
984
+ # Payment Success Rate (30 days)
985
+ sum(rate(e11y_slo_event_result_total{
986
+ slo_name="payment_success_rate",
987
+ slo_status="success"
988
+ }[30d]))
989
+ /
990
+ sum(rate(e11y_slo_event_result_total{
991
+ slo_name="payment_success_rate"
992
+ }[30d]))
993
+
994
+ # Result: 0.9996 (99.96%)
995
+ ```
996
+
997
+ ### 8.2. PromQL for Latency SLO
998
+
999
+ ```promql
1000
+ # API Latency p99 (30 days)
1001
+ histogram_quantile(0.99,
1002
+ sum(rate(e11y_slo_event_duration_seconds_bucket{
1003
+ slo_name="api_latency_slo"
1004
+ }[30d])) by (le)
1005
+ )
1006
+
1007
+ # Result: 0.450 (450ms)
1008
+ ```
1009
+
1010
+ ### 8.3. PromQL for Grouped SLO (per payment_method)
1011
+
1012
+ ```promql
1013
+ # Payment Success Rate per payment_method
1014
+ sum(rate(e11y_slo_event_result_total{
1015
+ slo_name="payment_success_rate",
1016
+ slo_status="success"
1017
+ }[30d])) by (payment_method)
1018
+ /
1019
+ sum(rate(e11y_slo_event_result_total{
1020
+ slo_name="payment_success_rate"
1021
+ }[30d])) by (payment_method)
1022
+
1023
+ # Result:
1024
+ # {payment_method="card"} 0.9998
1025
+ # {payment_method="bank"} 0.9950
1026
+ # {payment_method="paypal"} 0.9970
1027
+ ```
1028
+
1029
+ ### 8.4. Burn Rate Alerts (same as HTTP SLO)
1030
+
1031
+ ```yaml
1032
+ # prometheus/alerts/e11y_event_slo.yml
1033
+ groups:
1034
+ - name: e11y_event_slo
1035
+ interval: 30s
1036
+ rules:
1037
+ # Fast burn (1h window, 5 min alert)
1038
+ - alert: E11yEventSLOFastBurn_PaymentSuccessRate
1039
+ expr: |
1040
+ (
1041
+ sum(rate(e11y_slo_event_result_total{
1042
+ slo_name="payment_success_rate",
1043
+ slo_status="failure"
1044
+ }[1h]))
1045
+ /
1046
+ sum(rate(e11y_slo_event_result_total{
1047
+ slo_name="payment_success_rate"
1048
+ }[1h]))
1049
+ )
1050
+ /
1051
+ 0.001 # Error budget per hour (0.1% / 720h)
1052
+ > 14.4 # 14.4x burn rate = 2% of 30-day budget in 1h
1053
+ for: 5m
1054
+ labels:
1055
+ severity: critical
1056
+ slo_name: "payment_success_rate"
1057
+ burn_window: "1h"
1058
+ annotations:
1059
+ summary: "CRITICAL: Fast burn on payment_success_rate SLO"
1060
+ description: |
1061
+ Payment failure rate is 14.4x higher than sustainable rate.
1062
+ Burning 2% of 30-day error budget in 1 hour.
1063
+
1064
+ Current burn rate: {{ $value | humanize }}x
1065
+
1066
+ Dashboard: https://grafana/d/e11y-event-slo?var-slo=payment_success_rate
1067
+ ```
1068
+
1069
+ ---
1070
+
1071
+ ## 9. App-Wide SLO Aggregation
1072
+
1073
+ ### 9.1. Problem: Need Overall Health Metric
1074
+
1075
+ **Separate HTTP + Event SLO is good, but SRE wants single number:**
1076
+
1077
+ ```
1078
+ SRE: "What's our overall application health?"
1079
+
1080
+ Current state:
1081
+ - HTTP SLO: 99.95%
1082
+ - Event SLO (payment): 99.96%
1083
+ - Event SLO (orders): 99.92%
1084
+
1085
+ ❌ Which one to report? Need AGGREGATION!
1086
+ ```
1087
+
1088
+ ### 9.2. Solution: Weighted App-Wide SLO
1089
+
1090
+ ```yaml
1091
+ # config/slo.yml (extended)
1092
+ app_wide:
1093
+ # NEW: Aggregated SLO (combines HTTP + Events)
1094
+ aggregated_slo:
1095
+ enabled: true
1096
+
1097
+ # How to combine?
1098
+ strategy: "weighted_average" # or "min" (worst), "max" (best)
1099
+
1100
+ # Weights for each component
1101
+ components:
1102
+ - name: "http_slo"
1103
+ weight: 0.4 # 40% weight (infrastructure)
1104
+ metric: |
1105
+ sum(rate(http_requests_total{status=~"2..|3.."}[30d]))
1106
+ /
1107
+ sum(rate(http_requests_total[30d]))
1108
+
1109
+ - name: "event_slo_payment"
1110
+ weight: 0.4 # 40% weight (critical business logic)
1111
+ metric: |
1112
+ sum(rate(e11y_slo_event_result_total{
1113
+ slo_name="payment_success_rate",
1114
+ slo_status="success"
1115
+ }[30d]))
1116
+ /
1117
+ sum(rate(e11y_slo_event_result_total{
1118
+ slo_name="payment_success_rate"
1119
+ }[30d]))
1120
+
1121
+ - name: "event_slo_orders"
1122
+ weight: 0.2 # 20% weight (important business logic)
1123
+ metric: |
1124
+ sum(rate(e11y_slo_event_result_total{
1125
+ slo_name="order_creation_success_rate",
1126
+ slo_status="success"
1127
+ }[30d]))
1128
+ /
1129
+ sum(rate(e11y_slo_event_result_total{
1130
+ slo_name="order_creation_success_rate"
1131
+ }[30d]))
1132
+
1133
+ # Overall target
1134
+ target: 0.999 # 99.9%
1135
+ window: 30d
1136
+ ```
1137
+
1138
+ ### 9.3. PromQL for Aggregated SLO
1139
+
1140
+ ```promql
1141
+ # Weighted Average App-Wide SLO
1142
+ (
1143
+ # HTTP SLO (40% weight)
1144
+ 0.4 * (
1145
+ sum(rate(http_requests_total{status=~"2..|3.."}[30d]))
1146
+ /
1147
+ sum(rate(http_requests_total[30d]))
1148
+ )
1149
+
1150
+ +
1151
+
1152
+ # Payment Event SLO (40% weight)
1153
+ 0.4 * (
1154
+ sum(rate(e11y_slo_event_result_total{
1155
+ slo_name="payment_success_rate",
1156
+ slo_status="success"
1157
+ }[30d]))
1158
+ /
1159
+ sum(rate(e11y_slo_event_result_total{
1160
+ slo_name="payment_success_rate"
1161
+ }[30d]))
1162
+ )
1163
+
1164
+ +
1165
+
1166
+ # Order Event SLO (20% weight)
1167
+ 0.2 * (
1168
+ sum(rate(e11y_slo_event_result_total{
1169
+ slo_name="order_creation_success_rate",
1170
+ slo_status="success"
1171
+ }[30d]))
1172
+ /
1173
+ sum(rate(e11y_slo_event_result_total{
1174
+ slo_name="order_creation_success_rate"
1175
+ }[30d]))
1176
+ )
1177
+ )
1178
+
1179
+ # Example Result:
1180
+ # HTTP: 99.95%, Payment: 99.96%, Orders: 99.92%
1181
+ # → (0.4 * 0.9995) + (0.4 * 0.9996) + (0.2 * 0.9992)
1182
+ # → 0.39980 + 0.39984 + 0.19984
1183
+ # → 0.99948 (99.948%)
1184
+ ```
1185
+
1186
+ ### 9.4. Alternative: "Worst Case" Strategy
1187
+
1188
+ ```yaml
1189
+ # config/slo.yml
1190
+ app_wide:
1191
+ aggregated_slo:
1192
+ strategy: "min" # Take worst SLO
1193
+
1194
+ components:
1195
+ - name: "http_slo"
1196
+ metric: ...
1197
+ - name: "event_slo_payment"
1198
+ metric: ...
1199
+ - name: "event_slo_orders"
1200
+ metric: ...
1201
+ ```
1202
+
1203
+ ```promql
1204
+ # Min (Worst) SLO
1205
+ min(
1206
+ # HTTP SLO
1207
+ sum(rate(http_requests_total{status=~"2..|3.."}[30d]))
1208
+ / sum(rate(http_requests_total[30d])),
1209
+
1210
+ # Payment SLO
1211
+ sum(rate(e11y_slo_event_result_total{slo_name="payment_success_rate",slo_status="success"}[30d]))
1212
+ / sum(rate(e11y_slo_event_result_total{slo_name="payment_success_rate"}[30d])),
1213
+
1214
+ # Orders SLO
1215
+ sum(rate(e11y_slo_event_result_total{slo_name="order_creation_success_rate",slo_status="success"}[30d]))
1216
+ / sum(rate(e11y_slo_event_result_total{slo_name="order_creation_success_rate"}[30d]))
1217
+ )
1218
+
1219
+ # Result: 99.92% (worst of the three)
1220
+ ```
1221
+
1222
+ ### 9.5. Grafana Dashboard: Aggregated SLO
1223
+
1224
+ ```json
1225
+ {
1226
+ "dashboard": {
1227
+ "title": "E11y App-Wide SLO Dashboard",
1228
+ "panels": [
1229
+ {
1230
+ "title": "Overall Application Health (Aggregated SLO)",
1231
+ "type": "gauge",
1232
+ "targets": [
1233
+ {
1234
+ "expr": "# Weighted average PromQL from above",
1235
+ "legendFormat": "App-Wide SLO"
1236
+ },
1237
+ {
1238
+ "expr": "0.999",
1239
+ "legendFormat": "Target (99.9%)"
1240
+ }
1241
+ ],
1242
+ "fieldConfig": {
1243
+ "defaults": {
1244
+ "min": 0.99,
1245
+ "max": 1.0,
1246
+ "thresholds": {
1247
+ "mode": "absolute",
1248
+ "steps": [
1249
+ { "value": 0.99, "color": "red" },
1250
+ { "value": 0.995, "color": "yellow" },
1251
+ { "value": 0.999, "color": "green" }
1252
+ ]
1253
+ }
1254
+ }
1255
+ }
1256
+ },
1257
+ {
1258
+ "title": "SLO Components Breakdown",
1259
+ "type": "timeseries",
1260
+ "targets": [
1261
+ { "expr": "# HTTP SLO", "legendFormat": "HTTP (40%)" },
1262
+ { "expr": "# Payment SLO", "legendFormat": "Payment Events (40%)" },
1263
+ { "expr": "# Orders SLO", "legendFormat": "Order Events (20%)" }
1264
+ ]
1265
+ }
1266
+ ]
1267
+ }
1268
+ }
1269
+ ```
1270
+
1271
+ ---
1272
+
1273
+ ## 10. Real-World Examples
1274
+
1275
+ ### 10.1. E-Commerce Platform
1276
+
1277
+ ```ruby
1278
+ # === Payment Success Rate SLO ===
1279
+ module Events
1280
+ class PaymentProcessed < E11y::Event::Base
1281
+ schema do
1282
+ required(:payment_id).filled(:string)
1283
+ required(:order_id).filled(:string)
1284
+ required(:amount).filled(:float)
1285
+ required(:currency).filled(:string)
1286
+ required(:payment_method).filled(:string) # 'card', 'bank', 'paypal'
1287
+ required(:status).filled(:string) # 'completed', 'failed', 'pending'
1288
+ required(:error_code).maybe(:string) # Present if status = 'failed'
1289
+ optional(:slo_status).filled(:string)
1290
+ end
1291
+
1292
+ slo do
1293
+ enabled true
1294
+
1295
+ slo_status_from do |payload|
1296
+ return payload[:slo_status] if payload[:slo_status]
1297
+
1298
+ case payload[:status]
1299
+ when 'completed' then 'success'
1300
+ when 'failed' then 'failure'
1301
+ when 'pending' then nil # Not counted
1302
+ end
1303
+ end
1304
+
1305
+ contributes_to 'payment_success_rate'
1306
+ group_by :payment_method # Per-method SLO
1307
+ end
1308
+ end
1309
+ end
1310
+
1311
+ # === Order Creation Success Rate SLO ===
1312
+ module Events
1313
+ class OrderCreated < E11y::Event::Base
1314
+ schema do
1315
+ required(:order_id).filled(:string)
1316
+ required(:user_id).filled(:string)
1317
+ required(:items).array(:hash)
1318
+ required(:total_amount).filled(:float)
1319
+ optional(:slo_status).filled(:string)
1320
+ end
1321
+
1322
+ slo do
1323
+ enabled true
1324
+
1325
+ slo_status_from do |payload|
1326
+ return payload[:slo_status] if payload[:slo_status]
1327
+
1328
+ # All OrderCreated events = success
1329
+ 'success'
1330
+ end
1331
+
1332
+ contributes_to 'order_creation_success_rate'
1333
+ end
1334
+ end
1335
+ end
1336
+
1337
+ module Events
1338
+ class OrderCreationFailed < E11y::Event::Base
1339
+ schema do
1340
+ required(:user_id).filled(:string)
1341
+ required(:reason).filled(:string)
1342
+ required(:validation_errors).maybe(:array)
1343
+ optional(:slo_status).filled(:string)
1344
+ end
1345
+
1346
+ slo do
1347
+ enabled true
1348
+
1349
+ slo_status_from do |payload|
1350
+ return payload[:slo_status] if payload[:slo_status]
1351
+
1352
+ # All OrderCreationFailed events = failure
1353
+ 'failure'
1354
+ end
1355
+
1356
+ contributes_to 'order_creation_success_rate'
1357
+ end
1358
+ end
1359
+ end
1360
+ ```
1361
+
1362
+ ```yaml
1363
+ # config/slo.yml (E-commerce)
1364
+ custom_slos:
1365
+ - name: "payment_success_rate"
1366
+ description: "Payment processing success rate"
1367
+ type: event_based
1368
+ events:
1369
+ - Events::PaymentProcessed
1370
+ target: 0.999 # 99.9%
1371
+ window: 30d
1372
+ group_by: :payment_method
1373
+
1374
+ burn_rate_alerts:
1375
+ fast: { enabled: true, threshold: 14.4, alert_after: 5m }
1376
+
1377
+ - name: "order_creation_success_rate"
1378
+ description: "Order creation success rate"
1379
+ type: event_based
1380
+ events:
1381
+ - Events::OrderCreated
1382
+ - Events::OrderCreationFailed
1383
+ target: 0.999
1384
+ window: 30d
1385
+
1386
+ # App-Wide SLO (combines HTTP + Events)
1387
+ app_wide:
1388
+ aggregated_slo:
1389
+ enabled: true
1390
+ strategy: "weighted_average"
1391
+ components:
1392
+ - name: "http_slo"
1393
+ weight: 0.3 # 30%
1394
+ metric: "sum(rate(http_requests_total{status=~\"2..|3..\"}[30d])) / sum(rate(http_requests_total[30d]))"
1395
+ - name: "payment_slo"
1396
+ weight: 0.5 # 50% (most critical!)
1397
+ metric: "sum(rate(e11y_slo_event_result_total{slo_name=\"payment_success_rate\",slo_status=\"success\"}[30d])) / sum(rate(e11y_slo_event_result_total{slo_name=\"payment_success_rate\"}[30d]))"
1398
+ - name: "orders_slo"
1399
+ weight: 0.2 # 20%
1400
+ metric: "sum(rate(e11y_slo_event_result_total{slo_name=\"order_creation_success_rate\",slo_status=\"success\"}[30d])) / sum(rate(e11y_slo_event_result_total{slo_name=\"order_creation_success_rate\"}[30d]))"
1401
+ target: 0.999
1402
+ ```
1403
+
1404
+ ---
1405
+
1406
+ ## 11. Trade-offs
1407
+
1408
+ ### 11.1. Key Decisions
1409
+
1410
+ | Decision | Pro | Con | Rationale |
1411
+ |----------|-----|-----|-----------|
1412
+ | **Explicit slo declaration** | No magic, clear intent | Boilerplate in every Event | Clarity > brevity |
1413
+ | **slo_status_from auto-calc** | DRY, flexible | Extra method call | Performance acceptable (<0.1ms) |
1414
+ | **Prometheus aggregation** | Flexible, standard | Complex PromQL | Industry standard (Google SRE) |
1415
+ | **Separate HTTP + Event SLO** | Clear separation | Two SLOs to manage | Reflects reality (infra ≠ business) |
1416
+ | **App-Wide aggregated SLO** | Single health metric | More config | SRE needs one number |
1417
+ | **Linters at boot** | Early failure detection | Startup time +50ms | Worth it for consistency |
1418
+ | **Optional override** | Edge case support | Potential for abuse | Trust developers |
1419
+
1420
+ ### 11.2. Alternatives Considered
1421
+
1422
+ **A) Auto-detect SLO from severity**
1423
+ ```ruby
1424
+ # ❌ REJECTED: Too implicit
1425
+ # severity :error → slo_status = 'failure'
1426
+ # severity :success → slo_status = 'success'
1427
+ ```
1428
+ - ❌ Not flexible (business logic ≠ severity)
1429
+ - ❌ Magic behavior
1430
+ - ✅ **CHOSEN:** Explicit `slo_status_from`
1431
+
1432
+ **B) SLO status as required field**
1433
+ ```ruby
1434
+ # ❌ REJECTED: Too rigid
1435
+ schema do
1436
+ required(:slo_status).filled(:string)
1437
+ end
1438
+ ```
1439
+ - ❌ Forces manual calculation at call site
1440
+ - ❌ Not DRY
1441
+ - ✅ **CHOSEN:** Optional override + auto-calc
1442
+
1443
+ **C) SLO calculation in application**
1444
+ ```ruby
1445
+ # ❌ REJECTED: Not flexible
1446
+ # E11y calculates SLO percentage internally
1447
+ Yabeda.e11y_slo.payment_success_rate.set({}, 0.9996)
1448
+ ```
1449
+ - ❌ Can't recalculate for different time windows
1450
+ - ❌ State lost on restart
1451
+ - ✅ **CHOSEN:** Prometheus aggregation
1452
+
1453
+ ---
1454
+
1455
+ ## 12. Summary & Next Steps
1456
+
1457
+ ### 12.1. What We Achieved
1458
+
1459
+ ✅ **Event-Driven SLO**: Custom SLO based on business events
1460
+ ✅ **Explicit Configuration**: No magic, all visible in Event class
1461
+ ✅ **Auto-calculation**: `slo_status_from` with optional override
1462
+ ✅ **Prometheus Integration**: Standard aggregation, flexible queries
1463
+ ✅ **3 Linters**: Ensure consistency at boot time
1464
+ ✅ **Independent HTTP + Event SLO**: Clear separation of concerns
1465
+ ✅ **App-Wide SLO Aggregation**: Single health metric for SRE
1466
+ ✅ **Group by support**: Per-label SLO (e.g., per payment_method)
1467
+ ✅ **Latency SLO**: Histogram metrics for p99/p95
1468
+ ✅ **Real-world examples**: E-commerce, SaaS API, Admin tool
1469
+
1470
+ ### 12.2. Integration with ADR-003
1471
+
1472
+ | Aspect | ADR-003 (HTTP SLO) | ADR-014 (Event SLO) | Integration |
1473
+ |--------|-------------------|---------------------|-------------|
1474
+ | **Metrics Source** | HTTP requests/jobs | Event tracking | Independent |
1475
+ | **Config Location** | `slo.yml` endpoints | `slo.yml` custom_slos | Same file |
1476
+ | **Linters** | Route validation | Event validation | Run together |
1477
+ | **Burn Rate Alerts** | Multi-window | Multi-window | Same strategy |
1478
+ | **Prometheus** | PromQL aggregation | PromQL aggregation | Same approach |
1479
+ | **App-Wide SLO** | ❌ Not defined | ✅ Aggregated SLO | NEW feature |
1480
+
1481
+ ### 12.3. Implementation Checklist
1482
+
1483
+ **Phase 1: Core (Week 1-2)**
1484
+ - [ ] Implement `E11y::Event::Base.slo` DSL
1485
+ - [ ] Implement `E11y::Event::SLOConfig` class
1486
+ - [ ] Add `slo_status_from` computation
1487
+ - [ ] Integrate with `track` method
1488
+ - [ ] Emit Yabeda metrics (`event_result_total`)
1489
+
1490
+ **Phase 2: Configuration (Week 3)**
1491
+ - [ ] Extend `slo.yml` schema for `custom_slos`
1492
+ - [ ] Implement `ConfigLoader.resolve_custom_slo`
1493
+ - [ ] Add `app_wide.aggregated_slo` config
1494
+ - [ ] Add validation for custom SLO config
1495
+
1496
+ **Phase 3: Linters (Week 4)**
1497
+ - [ ] Implement Linter 1 (explicit slo declaration)
1498
+ - [ ] Implement Linter 2 (slo_status_from required)
1499
+ - [ ] Implement Linter 3 (slo.yml consistency)
1500
+ - [ ] Add auto-validation on boot
1501
+ - [ ] Add `strict_validation` config option
1502
+
1503
+ **Phase 4: Prometheus (Week 5)**
1504
+ - [ ] Document PromQL queries
1505
+ - [ ] Add app-wide aggregated SLO query
1506
+ - [ ] Create Grafana dashboard templates
1507
+ - [ ] Add burn rate alerts for Event SLO
1508
+ - [ ] Test with real Event tracking
1509
+
1510
+ **Phase 5: Documentation (Week 6)**
1511
+ - [ ] Write Event SLO guide
1512
+ - [ ] Add real-world examples
1513
+ - [ ] Document migration from HTTP-only SLO
1514
+ - [ ] Add troubleshooting guide
1515
+
1516
+ **Phase 6: Testing (Week 7)**
1517
+ - [ ] RSpec for `slo_status_from` computation
1518
+ - [ ] RSpec for Yabeda metric emission
1519
+ - [ ] RSpec for all 3 linters
1520
+ - [ ] Integration tests (end-to-end)
1521
+ - [ ] Performance benchmarks (<0.1ms p99)
1522
+
1523
+ ---
1524
+
1525
+ **Status:** ✅ Fully Designed
1526
+ **Next:** Implementation (Phases 1-6)
1527
+ **Estimated Implementation:** 7 weeks
1528
+ **Impact:**
1529
+ - Business-logic SLO visibility (not just infrastructure)
1530
+ - Explicit, no-magic configuration
1531
+ - Flexible Prometheus-based aggregation
1532
+ - App-wide health metric for SRE
1533
+ - Consistency enforced by linters