e11y 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.rspec +4 -0
- data/.rubocop.yml +69 -0
- data/CHANGELOG.md +26 -0
- data/CODE_OF_CONDUCT.md +64 -0
- data/LICENSE.txt +21 -0
- data/README.md +179 -0
- data/Rakefile +37 -0
- data/benchmarks/run_all.rb +33 -0
- data/config/README.md +83 -0
- data/config/loki-local-config.yaml +35 -0
- data/config/prometheus.yml +15 -0
- data/docker-compose.yml +78 -0
- data/docs/00-ICP-AND-TIMELINE.md +483 -0
- data/docs/01-SCALE-REQUIREMENTS.md +858 -0
- data/docs/ADR-001-architecture.md +2617 -0
- data/docs/ADR-002-metrics-yabeda.md +1395 -0
- data/docs/ADR-003-slo-observability.md +3337 -0
- data/docs/ADR-004-adapter-architecture.md +2385 -0
- data/docs/ADR-005-tracing-context.md +1372 -0
- data/docs/ADR-006-security-compliance.md +4143 -0
- data/docs/ADR-007-opentelemetry-integration.md +1385 -0
- data/docs/ADR-008-rails-integration.md +1911 -0
- data/docs/ADR-009-cost-optimization.md +2993 -0
- data/docs/ADR-010-developer-experience.md +2166 -0
- data/docs/ADR-011-testing-strategy.md +1836 -0
- data/docs/ADR-012-event-evolution.md +958 -0
- data/docs/ADR-013-reliability-error-handling.md +2750 -0
- data/docs/ADR-014-event-driven-slo.md +1533 -0
- data/docs/ADR-015-middleware-order.md +1061 -0
- data/docs/ADR-016-self-monitoring-slo.md +1234 -0
- data/docs/API-REFERENCE-L28.md +914 -0
- data/docs/COMPREHENSIVE-CONFIGURATION.md +2366 -0
- data/docs/IMPLEMENTATION_NOTES.md +2804 -0
- data/docs/IMPLEMENTATION_PLAN.md +1971 -0
- data/docs/IMPLEMENTATION_PLAN_ARCHITECTURE.md +586 -0
- data/docs/PLAN.md +148 -0
- data/docs/QUICK-START.md +934 -0
- data/docs/README.md +296 -0
- data/docs/design/00-memory-optimization.md +593 -0
- data/docs/guides/MIGRATION-L27-L28.md +692 -0
- data/docs/guides/PERFORMANCE-BENCHMARKS.md +434 -0
- data/docs/guides/README.md +44 -0
- data/docs/prd/01-overview-vision.md +440 -0
- data/docs/use_cases/README.md +119 -0
- data/docs/use_cases/UC-001-request-scoped-debug-buffering.md +813 -0
- data/docs/use_cases/UC-002-business-event-tracking.md +1953 -0
- data/docs/use_cases/UC-003-pattern-based-metrics.md +1627 -0
- data/docs/use_cases/UC-004-zero-config-slo-tracking.md +728 -0
- data/docs/use_cases/UC-005-sentry-integration.md +759 -0
- data/docs/use_cases/UC-006-trace-context-management.md +905 -0
- data/docs/use_cases/UC-007-pii-filtering.md +2648 -0
- data/docs/use_cases/UC-008-opentelemetry-integration.md +1153 -0
- data/docs/use_cases/UC-009-multi-service-tracing.md +1043 -0
- data/docs/use_cases/UC-010-background-job-tracking.md +1018 -0
- data/docs/use_cases/UC-011-rate-limiting.md +1906 -0
- data/docs/use_cases/UC-012-audit-trail.md +2301 -0
- data/docs/use_cases/UC-013-high-cardinality-protection.md +2127 -0
- data/docs/use_cases/UC-014-adaptive-sampling.md +1940 -0
- data/docs/use_cases/UC-015-cost-optimization.md +735 -0
- data/docs/use_cases/UC-016-rails-logger-migration.md +785 -0
- data/docs/use_cases/UC-017-local-development.md +867 -0
- data/docs/use_cases/UC-018-testing-events.md +1081 -0
- data/docs/use_cases/UC-019-tiered-storage-migration.md +562 -0
- data/docs/use_cases/UC-020-event-versioning.md +708 -0
- data/docs/use_cases/UC-021-error-handling-retry-dlq.md +956 -0
- data/docs/use_cases/UC-022-event-registry.md +648 -0
- data/docs/use_cases/backlog.md +226 -0
- data/e11y.gemspec +76 -0
- data/lib/e11y/adapters/adaptive_batcher.rb +207 -0
- data/lib/e11y/adapters/audit_encrypted.rb +239 -0
- data/lib/e11y/adapters/base.rb +580 -0
- data/lib/e11y/adapters/file.rb +224 -0
- data/lib/e11y/adapters/in_memory.rb +216 -0
- data/lib/e11y/adapters/loki.rb +333 -0
- data/lib/e11y/adapters/otel_logs.rb +203 -0
- data/lib/e11y/adapters/registry.rb +141 -0
- data/lib/e11y/adapters/sentry.rb +230 -0
- data/lib/e11y/adapters/stdout.rb +108 -0
- data/lib/e11y/adapters/yabeda.rb +370 -0
- data/lib/e11y/buffers/adaptive_buffer.rb +339 -0
- data/lib/e11y/buffers/base_buffer.rb +40 -0
- data/lib/e11y/buffers/request_scoped_buffer.rb +246 -0
- data/lib/e11y/buffers/ring_buffer.rb +267 -0
- data/lib/e11y/buffers.rb +14 -0
- data/lib/e11y/console.rb +122 -0
- data/lib/e11y/current.rb +48 -0
- data/lib/e11y/event/base.rb +894 -0
- data/lib/e11y/event/value_sampling_config.rb +84 -0
- data/lib/e11y/events/base_audit_event.rb +43 -0
- data/lib/e11y/events/base_payment_event.rb +33 -0
- data/lib/e11y/events/rails/cache/delete.rb +21 -0
- data/lib/e11y/events/rails/cache/read.rb +23 -0
- data/lib/e11y/events/rails/cache/write.rb +22 -0
- data/lib/e11y/events/rails/database/query.rb +45 -0
- data/lib/e11y/events/rails/http/redirect.rb +21 -0
- data/lib/e11y/events/rails/http/request.rb +26 -0
- data/lib/e11y/events/rails/http/send_file.rb +21 -0
- data/lib/e11y/events/rails/http/start_processing.rb +26 -0
- data/lib/e11y/events/rails/job/completed.rb +22 -0
- data/lib/e11y/events/rails/job/enqueued.rb +22 -0
- data/lib/e11y/events/rails/job/failed.rb +22 -0
- data/lib/e11y/events/rails/job/scheduled.rb +23 -0
- data/lib/e11y/events/rails/job/started.rb +22 -0
- data/lib/e11y/events/rails/log.rb +56 -0
- data/lib/e11y/events/rails/view/render.rb +23 -0
- data/lib/e11y/events.rb +18 -0
- data/lib/e11y/instruments/active_job.rb +201 -0
- data/lib/e11y/instruments/rails_instrumentation.rb +141 -0
- data/lib/e11y/instruments/sidekiq.rb +175 -0
- data/lib/e11y/logger/bridge.rb +205 -0
- data/lib/e11y/metrics/cardinality_protection.rb +172 -0
- data/lib/e11y/metrics/cardinality_tracker.rb +134 -0
- data/lib/e11y/metrics/registry.rb +234 -0
- data/lib/e11y/metrics/relabeling.rb +226 -0
- data/lib/e11y/metrics.rb +102 -0
- data/lib/e11y/middleware/audit_signing.rb +174 -0
- data/lib/e11y/middleware/base.rb +140 -0
- data/lib/e11y/middleware/event_slo.rb +167 -0
- data/lib/e11y/middleware/pii_filter.rb +266 -0
- data/lib/e11y/middleware/pii_filtering.rb +280 -0
- data/lib/e11y/middleware/rate_limiting.rb +214 -0
- data/lib/e11y/middleware/request.rb +163 -0
- data/lib/e11y/middleware/routing.rb +157 -0
- data/lib/e11y/middleware/sampling.rb +254 -0
- data/lib/e11y/middleware/slo.rb +168 -0
- data/lib/e11y/middleware/trace_context.rb +131 -0
- data/lib/e11y/middleware/validation.rb +118 -0
- data/lib/e11y/middleware/versioning.rb +132 -0
- data/lib/e11y/middleware.rb +12 -0
- data/lib/e11y/pii/patterns.rb +90 -0
- data/lib/e11y/pii.rb +13 -0
- data/lib/e11y/pipeline/builder.rb +155 -0
- data/lib/e11y/pipeline/zone_validator.rb +110 -0
- data/lib/e11y/pipeline.rb +12 -0
- data/lib/e11y/presets/audit_event.rb +65 -0
- data/lib/e11y/presets/debug_event.rb +34 -0
- data/lib/e11y/presets/high_value_event.rb +51 -0
- data/lib/e11y/presets.rb +19 -0
- data/lib/e11y/railtie.rb +138 -0
- data/lib/e11y/reliability/circuit_breaker.rb +216 -0
- data/lib/e11y/reliability/dlq/file_storage.rb +277 -0
- data/lib/e11y/reliability/dlq/filter.rb +117 -0
- data/lib/e11y/reliability/retry_handler.rb +207 -0
- data/lib/e11y/reliability/retry_rate_limiter.rb +117 -0
- data/lib/e11y/sampling/error_spike_detector.rb +225 -0
- data/lib/e11y/sampling/load_monitor.rb +161 -0
- data/lib/e11y/sampling/stratified_tracker.rb +92 -0
- data/lib/e11y/sampling/value_extractor.rb +82 -0
- data/lib/e11y/self_monitoring/buffer_monitor.rb +79 -0
- data/lib/e11y/self_monitoring/performance_monitor.rb +97 -0
- data/lib/e11y/self_monitoring/reliability_monitor.rb +146 -0
- data/lib/e11y/slo/event_driven.rb +150 -0
- data/lib/e11y/slo/tracker.rb +119 -0
- data/lib/e11y/version.rb +9 -0
- data/lib/e11y.rb +283 -0
- metadata +452 -0
|
@@ -0,0 +1,1533 @@
|
|
|
1
|
+
# ADR-014: Event-Driven SLO
|
|
2
|
+
|
|
3
|
+
**Status:** Draft
|
|
4
|
+
**Date:** January 13, 2026
|
|
5
|
+
**Covers:** Integration between Event System (ADR-001) and SLO (ADR-003)
|
|
6
|
+
**Depends On:** ADR-001 (Core), ADR-002 (Metrics), ADR-003 (SLO)
|
|
7
|
+
|
|
8
|
+
**Related ADRs:**
|
|
9
|
+
- 📊 **ADR-003: SLO & Observability** - HTTP/Job SLO (infrastructure reliability)
|
|
10
|
+
- 🔗 **Integration:** See `ADR-003-014-INTEGRATION.md` for detailed integration analysis
|
|
11
|
+
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
## 🔍 Scope of This ADR
|
|
15
|
+
|
|
16
|
+
This ADR covers **Event-based SLO** (business logic reliability):
|
|
17
|
+
- ✅ Custom SLO based on E11y Events (e.g., `Events::PaymentProcessed`)
|
|
18
|
+
- ✅ Explicit opt-in via `slo { enabled true }` in Event class
|
|
19
|
+
- ✅ Auto-calculation of `slo_status` from event payload
|
|
20
|
+
- ✅ Configuration in `slo.yml` under `custom_slos` section
|
|
21
|
+
- ✅ App-Wide Aggregated SLO (combines HTTP + Event metrics)
|
|
22
|
+
|
|
23
|
+
**For HTTP/Job SLO** (zero-config infrastructure monitoring), see **ADR-003**.
|
|
24
|
+
|
|
25
|
+
**Key Difference:**
|
|
26
|
+
- **ADR-003**: "Is the server responding?" (HTTP 200 vs 500, job success vs failure)
|
|
27
|
+
- **ADR-014**: "Is the business logic working?" (payment processed vs failed, order created vs rejected)
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## 📋 Table of Contents
|
|
32
|
+
|
|
33
|
+
1. [Context & Problem](#1-context--problem)
|
|
34
|
+
2. [Architecture Overview](#2-architecture-overview)
|
|
35
|
+
3. [Event SLO DSL](#3-event-slo-dsl)
|
|
36
|
+
4. [SLO Status Calculation](#4-slo-status-calculation)
|
|
37
|
+
5. [Custom SLO Configuration](#5-custom-slo-configuration)
|
|
38
|
+
6. [Metrics Export](#6-metrics-export)
|
|
39
|
+
7. [Validation & Linting](#7-validation--linting)
|
|
40
|
+
8. [Prometheus Integration](#8-prometheus-integration)
|
|
41
|
+
9. [App-Wide SLO Aggregation](#9-app-wide-slo-aggregation)
|
|
42
|
+
10. [Real-World Examples](#10-real-world-examples)
|
|
43
|
+
11. [Trade-offs](#11-trade-offs)
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## 1. Context & Problem
|
|
48
|
+
|
|
49
|
+
### 1.1. Problem Statement
|
|
50
|
+
|
|
51
|
+
**HTTP SLO is insufficient for business metrics:**
|
|
52
|
+
|
|
53
|
+
```ruby
|
|
54
|
+
# === PROBLEM 1: HTTP 200 ≠ Business Success ===
|
|
55
|
+
# POST /orders → HTTP 200 (infrastructure success)
|
|
56
|
+
# but Events::OrderCreationFailed.track(...) (business logic fail)
|
|
57
|
+
# → HTTP SLO shows 100%, but actually 50% of orders fail to create!
|
|
58
|
+
|
|
59
|
+
# === PROBLEM 2: Background Jobs SLO ===
|
|
60
|
+
# Sidekiq job completed (no exception)
|
|
61
|
+
# but Events::PaymentFailed.track(...) (payment actually failed)
|
|
62
|
+
# → Job SLO shows 100%, but payments are not processing!
|
|
63
|
+
|
|
64
|
+
# === PROBLEM 3: Complex Business Logic SLO ===
|
|
65
|
+
# SLO: "Order fulfillment within 24h"
|
|
66
|
+
# Events::OrderPaid → Events::OrderShipped (time diff < 24h)
|
|
67
|
+
# → How to track such SLO?
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
### 1.2. Design Decisions
|
|
71
|
+
|
|
72
|
+
**Decision 1: Independent HTTP + Event SLO**
|
|
73
|
+
```ruby
|
|
74
|
+
# ✅ HTTP SLO = Infrastructure reliability (200 vs 500)
|
|
75
|
+
# ✅ Event SLO = Business logic reliability (order created vs failed)
|
|
76
|
+
# ✅ App-Wide SLO = Aggregation of both (for overall health)
|
|
77
|
+
# → All three are important!
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
**Decision 2: Explicit opt-in for Event SLO**
|
|
81
|
+
```ruby
|
|
82
|
+
# ✅ By default Events do NOT participate in SLO
|
|
83
|
+
# ✅ Must explicitly declare `slo { enabled true }`
|
|
84
|
+
# ✅ Must explicitly define `slo_status_from`
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
**Decision 3: Auto-calculation slo_status (with override)**
|
|
88
|
+
```ruby
|
|
89
|
+
# ✅ slo_status computed from payload (e.g., status == 'completed')
|
|
90
|
+
# ✅ Can override: track(status: 'completed', slo_status: 'failure')
|
|
91
|
+
# ✅ If slo_status = nil → event not counted in SLO
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
**Decision 4: Aggregation in Prometheus (not in application)**
|
|
95
|
+
```ruby
|
|
96
|
+
# ✅ E11y exports raw metrics: event_result_total{slo_status="success|failure"}
|
|
97
|
+
# ✅ Prometheus calculates SLO via PromQL
|
|
98
|
+
# ✅ Flexibility (can recalculate for any time period)
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
**Decision 5: Linters for explicitness**
|
|
102
|
+
```ruby
|
|
103
|
+
# ✅ Linter 1: Every Event must have `slo { ... }` or `slo false`
|
|
104
|
+
# ✅ Linter 2: If `slo { enabled true }` → `slo_status_from` is required
|
|
105
|
+
# ✅ Linter 3: If Event in slo.yml → must have `slo { enabled true }`
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
### 1.3. Goals
|
|
109
|
+
|
|
110
|
+
**Primary Goals:**
|
|
111
|
+
- ✅ **Custom SLO based on Events** (business logic)
|
|
112
|
+
- ✅ **Explicit configuration** (no magic)
|
|
113
|
+
- ✅ **Auto-calculation slo_status** (DRY)
|
|
114
|
+
- ✅ **Independent HTTP + Event SLO**
|
|
115
|
+
- ✅ **App-Wide SLO aggregation** (overall health)
|
|
116
|
+
- ✅ **Prometheus aggregation** (flexible)
|
|
117
|
+
- ✅ **Linters for consistency**
|
|
118
|
+
|
|
119
|
+
**Non-Goals:**
|
|
120
|
+
- ❌ Automatic linking HTTP SLO + Event SLO (magic)
|
|
121
|
+
- ❌ Multi-event SLO in v1.0 (e.g., OrderPaid → OrderShipped within 24h)
|
|
122
|
+
- ❌ ML-based SLO prediction
|
|
123
|
+
|
|
124
|
+
### 1.4. Success Metrics
|
|
125
|
+
|
|
126
|
+
| Metric | Target | Critical? |
|
|
127
|
+
|--------|--------|-----------|
|
|
128
|
+
| **Event SLO accuracy** | 100% (matches business logic) | ✅ Yes |
|
|
129
|
+
| **Explicit slo declaration** | 100% of Events | ✅ Yes |
|
|
130
|
+
| **slo_status calculation overhead** | <0.1ms p99 | ✅ Yes |
|
|
131
|
+
| **Prometheus query performance** | <500ms for 30d window | ✅ Yes |
|
|
132
|
+
|
|
133
|
+
---
|
|
134
|
+
|
|
135
|
+
## 2. Architecture Overview
|
|
136
|
+
|
|
137
|
+
### 2.1. System Context
|
|
138
|
+
|
|
139
|
+
```mermaid
|
|
140
|
+
C4Context
|
|
141
|
+
title Event-Driven SLO Context
|
|
142
|
+
|
|
143
|
+
Person(dev, "Developer", "Defines Events + SLO")
|
|
144
|
+
Person(sre, "SRE", "Monitors SLO")
|
|
145
|
+
|
|
146
|
+
System(rails_app, "Rails App", "Tracks Events")
|
|
147
|
+
System(e11y, "E11y Gem", "Event SLO DSL")
|
|
148
|
+
System(slo_yml, "slo.yml", "Custom SLO config")
|
|
149
|
+
|
|
150
|
+
System_Ext(yabeda, "Yabeda", "Metrics export")
|
|
151
|
+
System_Ext(prometheus, "Prometheus", "Aggregation")
|
|
152
|
+
System_Ext(grafana, "Grafana", "Dashboards")
|
|
153
|
+
|
|
154
|
+
Rel(dev, rails_app, "Tracks", "Events::PaymentProcessed")
|
|
155
|
+
Rel(rails_app, e11y, "Evaluates", "slo_status_from")
|
|
156
|
+
Rel(e11y, slo_yml, "Validates", "custom_slos")
|
|
157
|
+
Rel(e11y, yabeda, "Exports", "event_result_total{slo_status}")
|
|
158
|
+
Rel(yabeda, prometheus, "Scrapes", "Metrics")
|
|
159
|
+
Rel(prometheus, grafana, "Queries", "PromQL")
|
|
160
|
+
Rel(sre, grafana, "Views", "SLO dashboards")
|
|
161
|
+
|
|
162
|
+
UpdateLayoutConfig($c4ShapeInRow="3", $c4BoundaryInRow="1")
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
### 2.2. Component Architecture
|
|
166
|
+
|
|
167
|
+
```mermaid
|
|
168
|
+
graph TB
|
|
169
|
+
subgraph "Event Class Definition"
|
|
170
|
+
EventDef[Event Class<br/>PaymentProcessed]
|
|
171
|
+
SLOBlock[slo do<br/>enabled true<br/>slo_status_from]
|
|
172
|
+
end
|
|
173
|
+
|
|
174
|
+
subgraph "Event Tracking"
|
|
175
|
+
Track[.track(payload)]
|
|
176
|
+
Validate[Validate Schema]
|
|
177
|
+
ComputeSLO[Compute slo_status]
|
|
178
|
+
EmitMetric[Emit Yabeda Metric]
|
|
179
|
+
end
|
|
180
|
+
|
|
181
|
+
subgraph "SLO Configuration"
|
|
182
|
+
SLOConfig[slo.yml]
|
|
183
|
+
CustomSLO[custom_slos]
|
|
184
|
+
EventList[events: PaymentProcessed]
|
|
185
|
+
end
|
|
186
|
+
|
|
187
|
+
subgraph "Metrics Export"
|
|
188
|
+
YabedaMetric[Yabeda.e11y_slo<br/>.event_result_total]
|
|
189
|
+
PrometheusMetric[e11y_slo_event_result_total{<br/>slo_status='success'}]
|
|
190
|
+
end
|
|
191
|
+
|
|
192
|
+
subgraph "Validation & Linting"
|
|
193
|
+
Linter1[Explicit slo declaration]
|
|
194
|
+
Linter2[slo_status_from required]
|
|
195
|
+
Linter3[slo.yml consistency]
|
|
196
|
+
end
|
|
197
|
+
|
|
198
|
+
EventDef --> SLOBlock
|
|
199
|
+
SLOBlock --> Track
|
|
200
|
+
Track --> Validate
|
|
201
|
+
Validate --> ComputeSLO
|
|
202
|
+
ComputeSLO --> EmitMetric
|
|
203
|
+
|
|
204
|
+
SLOConfig --> CustomSLO
|
|
205
|
+
CustomSLO --> EventList
|
|
206
|
+
EventList -.validates.-> SLOBlock
|
|
207
|
+
|
|
208
|
+
EmitMetric --> YabedaMetric
|
|
209
|
+
YabedaMetric --> PrometheusMetric
|
|
210
|
+
|
|
211
|
+
SLOBlock -.validates.-> Linter1
|
|
212
|
+
SLOBlock -.validates.-> Linter2
|
|
213
|
+
CustomSLO -.validates.-> Linter3
|
|
214
|
+
|
|
215
|
+
style ComputeSLO fill:#d1ecf1
|
|
216
|
+
style EmitMetric fill:#fff3cd
|
|
217
|
+
style Linter1 fill:#f8d7da
|
|
218
|
+
style Linter2 fill:#f8d7da
|
|
219
|
+
style Linter3 fill:#f8d7da
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
### 2.3. Event SLO Flow Sequence
|
|
223
|
+
|
|
224
|
+
```mermaid
|
|
225
|
+
sequenceDiagram
|
|
226
|
+
participant App as Rails App
|
|
227
|
+
participant Event as Events::PaymentProcessed
|
|
228
|
+
participant SLO as SLO Config
|
|
229
|
+
participant Yabeda as Yabeda
|
|
230
|
+
participant Prom as Prometheus
|
|
231
|
+
|
|
232
|
+
App->>Event: .track(payment_id: 'p123', status: 'completed')
|
|
233
|
+
|
|
234
|
+
Note over Event: 1. Validate schema
|
|
235
|
+
Event->>Event: Schema validation ✅
|
|
236
|
+
|
|
237
|
+
Note over Event: 2. Check if SLO enabled
|
|
238
|
+
Event->>SLO: slo_enabled?
|
|
239
|
+
SLO-->>Event: true
|
|
240
|
+
|
|
241
|
+
Note over Event: 3. Compute slo_status
|
|
242
|
+
Event->>Event: slo_status_from.call(payload)
|
|
243
|
+
Event->>Event: payload[:status] == 'completed' → 'success'
|
|
244
|
+
|
|
245
|
+
Note over Event: 4. Emit Yabeda metric
|
|
246
|
+
Event->>Yabeda: event_result_total.increment(slo_status: 'success')
|
|
247
|
+
|
|
248
|
+
Note over Yabeda: 5. Export to Prometheus
|
|
249
|
+
Yabeda->>Prom: e11y_slo_event_result_total{slo_status="success"} +1
|
|
250
|
+
|
|
251
|
+
Note over Prom: 6. Aggregate SLO
|
|
252
|
+
Prom->>Prom: sum(rate(...{slo_status="success"}[30d])) / sum(rate(...[30d]))
|
|
253
|
+
Prom->>Prom: Result: 0.9996 (99.96%)
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
---
|
|
257
|
+
|
|
258
|
+
## 3. Event SLO DSL
|
|
259
|
+
|
|
260
|
+
### 3.1. Basic Event SLO (enabled)
|
|
261
|
+
|
|
262
|
+
```ruby
|
|
263
|
+
# app/events/payment_processed.rb
|
|
264
|
+
module Events
|
|
265
|
+
class PaymentProcessed < E11y::Event::Base
|
|
266
|
+
schema do
|
|
267
|
+
required(:payment_id).filled(:string)
|
|
268
|
+
required(:amount).filled(:float)
|
|
269
|
+
required(:status).filled(:string) # 'completed', 'failed', 'pending'
|
|
270
|
+
optional(:slo_status).filled(:string) # Optional explicit override
|
|
271
|
+
end
|
|
272
|
+
|
|
273
|
+
# ============================================================
|
|
274
|
+
# SLO CONFIGURATION (explicit, opt-in)
|
|
275
|
+
# ============================================================
|
|
276
|
+
slo do
|
|
277
|
+
# 1. Enable SLO tracking for this Event
|
|
278
|
+
enabled true
|
|
279
|
+
|
|
280
|
+
# 2. Calculate slo_status from payload
|
|
281
|
+
slo_status_from do |payload|
|
|
282
|
+
# Priority 1: Explicit override (if provided)
|
|
283
|
+
return payload[:slo_status] if payload[:slo_status]
|
|
284
|
+
|
|
285
|
+
# Priority 2: Auto-calculate from status
|
|
286
|
+
case payload[:status]
|
|
287
|
+
when 'completed' then 'success'
|
|
288
|
+
when 'failed' then 'failure'
|
|
289
|
+
when 'pending' then nil # Not counted in SLO
|
|
290
|
+
else nil
|
|
291
|
+
end
|
|
292
|
+
end
|
|
293
|
+
|
|
294
|
+
# 3. Which custom SLO does this contribute to?
|
|
295
|
+
contributes_to 'payment_success_rate'
|
|
296
|
+
|
|
297
|
+
# 4. Optional: Group by label (for per-type SLO)
|
|
298
|
+
group_by :payment_method # Separate SLO for 'card', 'bank', 'paypal'
|
|
299
|
+
end
|
|
300
|
+
end
|
|
301
|
+
end
|
|
302
|
+
|
|
303
|
+
# Usage 1: Auto-calculation (normal case)
|
|
304
|
+
Events::PaymentProcessed.track(
|
|
305
|
+
payment_id: 'p123',
|
|
306
|
+
amount: 99.99,
|
|
307
|
+
status: 'completed', # → slo_status = 'success'
|
|
308
|
+
payment_method: 'card'
|
|
309
|
+
)
|
|
310
|
+
|
|
311
|
+
# Usage 2: Explicit override (edge case)
|
|
312
|
+
Events::PaymentProcessed.track(
|
|
313
|
+
payment_id: 'p456',
|
|
314
|
+
amount: 50.00,
|
|
315
|
+
status: 'completed', # Status completed
|
|
316
|
+
slo_status: 'failure', # But business logic says failure (e.g., fraud detected)
|
|
317
|
+
payment_method: 'card'
|
|
318
|
+
)
|
|
319
|
+
|
|
320
|
+
# Usage 3: Not counted in SLO
|
|
321
|
+
Events::PaymentProcessed.track(
|
|
322
|
+
payment_id: 'p789',
|
|
323
|
+
amount: 25.00,
|
|
324
|
+
status: 'pending', # → slo_status = nil (not counted)
|
|
325
|
+
payment_method: 'bank'
|
|
326
|
+
)
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
### 3.2. Event SLO Disabled
|
|
330
|
+
|
|
331
|
+
```ruby
|
|
332
|
+
# app/events/health_check_pinged.rb
|
|
333
|
+
module Events
|
|
334
|
+
class HealthCheckPinged < E11y::Event::Base
|
|
335
|
+
schema do
|
|
336
|
+
required(:status).filled(:string)
|
|
337
|
+
required(:response_time_ms).filled(:integer)
|
|
338
|
+
end
|
|
339
|
+
|
|
340
|
+
# ============================================================
|
|
341
|
+
# SLO DISABLED (explicit opt-out)
|
|
342
|
+
# ============================================================
|
|
343
|
+
slo false # ← Explicit: does NOT participate in SLO
|
|
344
|
+
end
|
|
345
|
+
end
|
|
346
|
+
|
|
347
|
+
# Usage: Normal event tracking (no SLO metrics emitted)
|
|
348
|
+
Events::HealthCheckPinged.track(
|
|
349
|
+
status: 'ok',
|
|
350
|
+
response_time_ms: 15
|
|
351
|
+
)
|
|
352
|
+
```
|
|
353
|
+
|
|
354
|
+
### 3.3. Event SLO with Latency
|
|
355
|
+
|
|
356
|
+
```ruby
|
|
357
|
+
# app/events/api_request_completed.rb
|
|
358
|
+
module Events
|
|
359
|
+
class ApiRequestCompleted < E11y::Event::Base
|
|
360
|
+
schema do
|
|
361
|
+
required(:endpoint).filled(:string)
|
|
362
|
+
required(:status_code).filled(:integer)
|
|
363
|
+
required(:duration_ms).filled(:integer)
|
|
364
|
+
optional(:slo_status).filled(:string)
|
|
365
|
+
end
|
|
366
|
+
|
|
367
|
+
slo do
|
|
368
|
+
enabled true
|
|
369
|
+
|
|
370
|
+
slo_status_from do |payload|
|
|
371
|
+
return payload[:slo_status] if payload[:slo_status]
|
|
372
|
+
|
|
373
|
+
payload[:status_code] < 500 ? 'success' : 'failure'
|
|
374
|
+
end
|
|
375
|
+
|
|
376
|
+
contributes_to 'api_latency_slo'
|
|
377
|
+
|
|
378
|
+
# Optional: Extract latency for histogram
|
|
379
|
+
latency_field :duration_ms # E11y will emit histogram metric
|
|
380
|
+
|
|
381
|
+
group_by :endpoint # Per-endpoint latency SLO
|
|
382
|
+
end
|
|
383
|
+
end
|
|
384
|
+
end
|
|
385
|
+
```
|
|
386
|
+
|
|
387
|
+
### 3.4. Complex SLO Status Calculation
|
|
388
|
+
|
|
389
|
+
```ruby
|
|
390
|
+
# app/events/order_created.rb
|
|
391
|
+
module Events
|
|
392
|
+
class OrderCreated < E11y::Event::Base
|
|
393
|
+
schema do
|
|
394
|
+
required(:order_id).filled(:string)
|
|
395
|
+
required(:user_id).filled(:string)
|
|
396
|
+
required(:items).array(:hash)
|
|
397
|
+
required(:total_amount).filled(:float)
|
|
398
|
+
required(:validation_result).hash do
|
|
399
|
+
required(:passed).filled(:bool)
|
|
400
|
+
optional(:errors).array(:string)
|
|
401
|
+
end
|
|
402
|
+
optional(:slo_status).filled(:string)
|
|
403
|
+
end
|
|
404
|
+
|
|
405
|
+
slo do
|
|
406
|
+
enabled true
|
|
407
|
+
|
|
408
|
+
# Complex business logic for slo_status
|
|
409
|
+
slo_status_from do |payload|
|
|
410
|
+
return payload[:slo_status] if payload[:slo_status]
|
|
411
|
+
|
|
412
|
+
# Validation passed AND amount > 0 → success
|
|
413
|
+
if payload[:validation_result][:passed] && payload[:total_amount] > 0
|
|
414
|
+
'success'
|
|
415
|
+
elsif payload[:validation_result][:passed]
|
|
416
|
+
nil # Passed but $0 order → not counted (test order)
|
|
417
|
+
else
|
|
418
|
+
'failure' # Validation failed
|
|
419
|
+
end
|
|
420
|
+
end
|
|
421
|
+
|
|
422
|
+
contributes_to 'order_creation_success_rate'
|
|
423
|
+
end
|
|
424
|
+
end
|
|
425
|
+
end
|
|
426
|
+
```
|
|
427
|
+
|
|
428
|
+
---
|
|
429
|
+
|
|
430
|
+
## 4. SLO Status Calculation
|
|
431
|
+
|
|
432
|
+
### 4.1. Implementation (Full Code)
|
|
433
|
+
|
|
434
|
+
```ruby
|
|
435
|
+
# lib/e11y/event/base.rb (extended for SLO)
|
|
436
|
+
module E11y
|
|
437
|
+
module Event
|
|
438
|
+
class Base
|
|
439
|
+
class << self
|
|
440
|
+
# SLO configuration DSL
|
|
441
|
+
def slo(value = nil, &block)
|
|
442
|
+
if value == false
|
|
443
|
+
# Explicit opt-out: slo false
|
|
444
|
+
@slo_disabled = true
|
|
445
|
+
@slo_config = nil
|
|
446
|
+
elsif block_given?
|
|
447
|
+
# Explicit opt-in: slo do ... end
|
|
448
|
+
@slo_config = SLOConfig.new(self)
|
|
449
|
+
@slo_config.instance_eval(&block)
|
|
450
|
+
@slo_disabled = false
|
|
451
|
+
end
|
|
452
|
+
end
|
|
453
|
+
|
|
454
|
+
def slo_enabled?
|
|
455
|
+
!@slo_disabled && @slo_config&.enabled?
|
|
456
|
+
end
|
|
457
|
+
|
|
458
|
+
def slo_disabled?
|
|
459
|
+
@slo_disabled == true
|
|
460
|
+
end
|
|
461
|
+
|
|
462
|
+
def slo_config
|
|
463
|
+
@slo_config
|
|
464
|
+
end
|
|
465
|
+
|
|
466
|
+
# Override track to integrate SLO
|
|
467
|
+
def track(payload, **options)
|
|
468
|
+
# 1. Normal event tracking
|
|
469
|
+
event_data = build_event(payload, **options)
|
|
470
|
+
deliver_to_adapters(event_data)
|
|
471
|
+
|
|
472
|
+
# 2. SLO tracking (if enabled)
|
|
473
|
+
if slo_enabled?
|
|
474
|
+
track_slo(payload, event_data)
|
|
475
|
+
end
|
|
476
|
+
|
|
477
|
+
event_data
|
|
478
|
+
end
|
|
479
|
+
|
|
480
|
+
private
|
|
481
|
+
|
|
482
|
+
def track_slo(payload, event_data)
|
|
483
|
+
# Compute slo_status
|
|
484
|
+
slo_status = @slo_config.compute_slo_status(payload)
|
|
485
|
+
|
|
486
|
+
# Skip if slo_status is nil (not counted in SLO)
|
|
487
|
+
return unless slo_status
|
|
488
|
+
|
|
489
|
+
# Validate slo_status value
|
|
490
|
+
unless ['success', 'failure'].include?(slo_status)
|
|
491
|
+
E11y.logger.error(
|
|
492
|
+
"Invalid slo_status for #{name}: '#{slo_status}' (must be 'success', 'failure', or nil)"
|
|
493
|
+
)
|
|
494
|
+
return
|
|
495
|
+
end
|
|
496
|
+
|
|
497
|
+
# Build labels
|
|
498
|
+
labels = {
|
|
499
|
+
event_class: name,
|
|
500
|
+
slo_name: @slo_config.contributes_to,
|
|
501
|
+
slo_status: slo_status
|
|
502
|
+
}
|
|
503
|
+
|
|
504
|
+
# Add group_by label (if configured)
|
|
505
|
+
if @slo_config.group_by
|
|
506
|
+
group_value = payload[@slo_config.group_by]
|
|
507
|
+
labels[@slo_config.group_by] = group_value if group_value
|
|
508
|
+
end
|
|
509
|
+
|
|
510
|
+
# Emit availability metric
|
|
511
|
+
Yabeda.e11y_slo.event_result_total.increment(labels, by: 1)
|
|
512
|
+
|
|
513
|
+
# Emit latency metric (if latency_field configured)
|
|
514
|
+
if @slo_config.latency_field
|
|
515
|
+
latency_value = payload[@slo_config.latency_field]
|
|
516
|
+
|
|
517
|
+
if latency_value
|
|
518
|
+
Yabeda.e11y_slo.event_duration_seconds.observe(
|
|
519
|
+
labels.except(:slo_status),
|
|
520
|
+
latency_value / 1000.0 # ms → seconds
|
|
521
|
+
)
|
|
522
|
+
end
|
|
523
|
+
end
|
|
524
|
+
|
|
525
|
+
# Log SLO tracking (debug)
|
|
526
|
+
E11y.logger.debug(
|
|
527
|
+
"SLO tracked: #{name} → #{@slo_config.contributes_to} (#{slo_status})"
|
|
528
|
+
)
|
|
529
|
+
rescue => error
|
|
530
|
+
E11y.logger.error("SLO tracking failed for #{name}: #{error.message}")
|
|
531
|
+
E11y.logger.error(error.backtrace.first(5).join("\n"))
|
|
532
|
+
# Don't raise - SLO tracking should not break event tracking
|
|
533
|
+
end
|
|
534
|
+
end
|
|
535
|
+
end
|
|
536
|
+
end
|
|
537
|
+
end
|
|
538
|
+
```
|
|
539
|
+
|
|
540
|
+
### 4.2. SLOConfig Class
|
|
541
|
+
|
|
542
|
+
```ruby
|
|
543
|
+
# lib/e11y/event/slo_config.rb
|
|
544
|
+
module E11y
|
|
545
|
+
module Event
|
|
546
|
+
class SLOConfig
|
|
547
|
+
attr_reader :event_class
|
|
548
|
+
attr_accessor :enabled, :slo_status_from_proc, :contributes_to, :group_by, :latency_field
|
|
549
|
+
|
|
550
|
+
def initialize(event_class)
|
|
551
|
+
@event_class = event_class
|
|
552
|
+
@enabled = false
|
|
553
|
+
@slo_status_from_proc = nil
|
|
554
|
+
@contributes_to = nil
|
|
555
|
+
@group_by = nil
|
|
556
|
+
@latency_field = nil
|
|
557
|
+
end
|
|
558
|
+
|
|
559
|
+
# DSL methods
|
|
560
|
+
def enabled(value = true)
|
|
561
|
+
@enabled = value
|
|
562
|
+
end
|
|
563
|
+
|
|
564
|
+
def enabled?
|
|
565
|
+
@enabled == true
|
|
566
|
+
end
|
|
567
|
+
|
|
568
|
+
def slo_status_from(&block)
|
|
569
|
+
unless block_given?
|
|
570
|
+
raise ArgumentError, "slo_status_from requires a block"
|
|
571
|
+
end
|
|
572
|
+
|
|
573
|
+
@slo_status_from_proc = block
|
|
574
|
+
end
|
|
575
|
+
|
|
576
|
+
def contributes_to(slo_name)
|
|
577
|
+
@contributes_to = slo_name
|
|
578
|
+
end
|
|
579
|
+
|
|
580
|
+
def group_by(field)
|
|
581
|
+
@group_by = field
|
|
582
|
+
end
|
|
583
|
+
|
|
584
|
+
def latency_field(field)
|
|
585
|
+
@latency_field = field
|
|
586
|
+
end
|
|
587
|
+
|
|
588
|
+
# Compute slo_status from payload
|
|
589
|
+
def compute_slo_status(payload)
|
|
590
|
+
unless @slo_status_from_proc
|
|
591
|
+
raise "Event #{@event_class.name} has slo enabled but no slo_status_from block!"
|
|
592
|
+
end
|
|
593
|
+
|
|
594
|
+
@slo_status_from_proc.call(payload)
|
|
595
|
+
end
|
|
596
|
+
|
|
597
|
+
# Validate configuration
|
|
598
|
+
def validate!
|
|
599
|
+
errors = []
|
|
600
|
+
|
|
601
|
+
if enabled? && !@slo_status_from_proc
|
|
602
|
+
errors << "slo_status_from block is required when slo is enabled"
|
|
603
|
+
end
|
|
604
|
+
|
|
605
|
+
if enabled? && !@contributes_to
|
|
606
|
+
errors << "contributes_to is required when slo is enabled"
|
|
607
|
+
end
|
|
608
|
+
|
|
609
|
+
if @latency_field && !@event_class.schema.rules.key?(@latency_field)
|
|
610
|
+
errors << "latency_field :#{@latency_field} not found in schema"
|
|
611
|
+
end
|
|
612
|
+
|
|
613
|
+
if @group_by && !@event_class.schema.rules.key?(@group_by)
|
|
614
|
+
errors << "group_by :#{@group_by} not found in schema"
|
|
615
|
+
end
|
|
616
|
+
|
|
617
|
+
if errors.any?
|
|
618
|
+
raise E11y::SLO::ConfigurationError,
|
|
619
|
+
"SLO configuration errors for #{@event_class.name}:\n#{errors.join("\n")}"
|
|
620
|
+
end
|
|
621
|
+
|
|
622
|
+
true
|
|
623
|
+
end
|
|
624
|
+
end
|
|
625
|
+
end
|
|
626
|
+
end
|
|
627
|
+
```
|
|
628
|
+
|
|
629
|
+
---
|
|
630
|
+
|
|
631
|
+
## 5. Custom SLO Configuration
|
|
632
|
+
|
|
633
|
+
### 5.1. slo.yml Schema for Event-Based SLO
|
|
634
|
+
|
|
635
|
+
```yaml
|
|
636
|
+
# config/slo.yml
|
|
637
|
+
version: 1
|
|
638
|
+
|
|
639
|
+
# ============================================================================
|
|
640
|
+
# CUSTOM EVENT-BASED SLO
|
|
641
|
+
# ============================================================================
|
|
642
|
+
custom_slos:
|
|
643
|
+
# Simple availability SLO
|
|
644
|
+
- name: "payment_success_rate"
|
|
645
|
+
description: "Payment success rate (business SLO)"
|
|
646
|
+
type: event_based
|
|
647
|
+
|
|
648
|
+
# Which Events contribute to this SLO?
|
|
649
|
+
events:
|
|
650
|
+
- Events::PaymentProcessed
|
|
651
|
+
|
|
652
|
+
# SLO target
|
|
653
|
+
target: 0.999 # 99.9%
|
|
654
|
+
window: 30d
|
|
655
|
+
|
|
656
|
+
# Validation: Ensure Event has slo { enabled true }
|
|
657
|
+
require_explicit_slo_config: true
|
|
658
|
+
|
|
659
|
+
# Prometheus metric name (auto-generated if omitted)
|
|
660
|
+
metric_name: "e11y_slo_payment_success_rate"
|
|
661
|
+
|
|
662
|
+
# Burn rate alerts (same as HTTP SLO)
|
|
663
|
+
burn_rate_alerts:
|
|
664
|
+
fast:
|
|
665
|
+
enabled: true
|
|
666
|
+
threshold: 14.4
|
|
667
|
+
alert_after: 5m
|
|
668
|
+
severity: critical
|
|
669
|
+
medium:
|
|
670
|
+
enabled: true
|
|
671
|
+
threshold: 6.0
|
|
672
|
+
alert_after: 30m
|
|
673
|
+
severity: warning
|
|
674
|
+
slow:
|
|
675
|
+
enabled: true
|
|
676
|
+
threshold: 1.0
|
|
677
|
+
alert_after: 6h
|
|
678
|
+
severity: info
|
|
679
|
+
|
|
680
|
+
# Latency SLO (with histogram)
|
|
681
|
+
- name: "api_latency_slo"
|
|
682
|
+
description: "API request latency SLO"
|
|
683
|
+
type: event_based
|
|
684
|
+
|
|
685
|
+
events:
|
|
686
|
+
- Events::ApiRequestCompleted
|
|
687
|
+
|
|
688
|
+
# Availability target
|
|
689
|
+
target: 0.999 # 99.9% requests < 500ms
|
|
690
|
+
window: 30d
|
|
691
|
+
|
|
692
|
+
# Latency target (Prometheus computes p99 from histogram)
|
|
693
|
+
latency:
|
|
694
|
+
enabled: true
|
|
695
|
+
p99_target: 500 # ms
|
|
696
|
+
p95_target: 300 # ms
|
|
697
|
+
field: :duration_ms # Which field in Event payload
|
|
698
|
+
|
|
699
|
+
# Group by endpoint (per-endpoint SLO)
|
|
700
|
+
group_by: :endpoint
|
|
701
|
+
|
|
702
|
+
# Multi-event SLO (v2.0 feature, not implemented in v1.0)
|
|
703
|
+
- name: "order_fulfillment_slo"
|
|
704
|
+
description: "Order shipped within 24h of payment"
|
|
705
|
+
type: event_sequence # Future feature
|
|
706
|
+
|
|
707
|
+
events:
|
|
708
|
+
start: Events::OrderPaid
|
|
709
|
+
end: Events::OrderShipped
|
|
710
|
+
max_duration: 86400 # 24 hours in seconds
|
|
711
|
+
|
|
712
|
+
target: 0.95 # 95%
|
|
713
|
+
window: 30d
|
|
714
|
+
|
|
715
|
+
# v1.0: Not implemented, placeholder for future
|
|
716
|
+
|
|
717
|
+
# ============================================================================
|
|
718
|
+
# HTTP SLO (unchanged, for reference)
|
|
719
|
+
# ============================================================================
|
|
720
|
+
endpoints:
|
|
721
|
+
- name: "Create Order"
|
|
722
|
+
pattern: "POST /api/orders"
|
|
723
|
+
controller: "Api::OrdersController"
|
|
724
|
+
action: "create"
|
|
725
|
+
slo:
|
|
726
|
+
availability:
|
|
727
|
+
target: 0.999 # HTTP 2xx/3xx vs 5xx (independent from Event SLO)
|
|
728
|
+
latency:
|
|
729
|
+
p99_target: 500
|
|
730
|
+
|
|
731
|
+
# ============================================================================
|
|
732
|
+
# APP-WIDE FALLBACK
|
|
733
|
+
# ============================================================================
|
|
734
|
+
app_wide:
|
|
735
|
+
http:
|
|
736
|
+
availability:
|
|
737
|
+
target: 0.999
|
|
738
|
+
latency:
|
|
739
|
+
p99_target: 500
|
|
740
|
+
|
|
741
|
+
events:
|
|
742
|
+
# NEW: App-wide Event SLO (aggregation of all Event SLOs)
|
|
743
|
+
enabled: true
|
|
744
|
+
target: 0.999 # 99.9%
|
|
745
|
+
window: 30d
|
|
746
|
+
```
|
|
747
|
+
|
|
748
|
+
---
|
|
749
|
+
|
|
750
|
+
## 6. Metrics Export
|
|
751
|
+
|
|
752
|
+
### 6.1. Yabeda Metrics Definition
|
|
753
|
+
|
|
754
|
+
```ruby
|
|
755
|
+
# config/initializers/e11y.rb
|
|
756
|
+
E11y.configure do |config|
|
|
757
|
+
config.metrics do
|
|
758
|
+
enabled true
|
|
759
|
+
adapter :yabeda
|
|
760
|
+
|
|
761
|
+
# Define SLO metrics
|
|
762
|
+
Yabeda.configure do
|
|
763
|
+
group :e11y_slo do
|
|
764
|
+
# Availability metric (counter)
|
|
765
|
+
counter :event_result_total,
|
|
766
|
+
comment: "Total count of Event SLO results (success/failure)",
|
|
767
|
+
tags: [:event_class, :slo_name, :slo_status] # slo_status = 'success' | 'failure'
|
|
768
|
+
|
|
769
|
+
# Latency metric (histogram)
|
|
770
|
+
histogram :event_duration_seconds,
|
|
771
|
+
comment: "Event processing duration for SLO (seconds)",
|
|
772
|
+
tags: [:event_class, :slo_name],
|
|
773
|
+
buckets: [0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0]
|
|
774
|
+
end
|
|
775
|
+
end
|
|
776
|
+
end
|
|
777
|
+
end
|
|
778
|
+
```
|
|
779
|
+
|
|
780
|
+
### 6.2. Prometheus Metrics Output
|
|
781
|
+
|
|
782
|
+
```
|
|
783
|
+
# Availability metrics
|
|
784
|
+
e11y_slo_event_result_total{event_class="Events::PaymentProcessed",slo_name="payment_success_rate",slo_status="success"} 1234
|
|
785
|
+
e11y_slo_event_result_total{event_class="Events::PaymentProcessed",slo_name="payment_success_rate",slo_status="failure"} 5
|
|
786
|
+
|
|
787
|
+
# Latency metrics (histogram)
|
|
788
|
+
e11y_slo_event_duration_seconds_bucket{event_class="Events::ApiRequestCompleted",slo_name="api_latency_slo",le="0.1"} 500
|
|
789
|
+
e11y_slo_event_duration_seconds_bucket{event_class="Events::ApiRequestCompleted",slo_name="api_latency_slo",le="0.5"} 1200
|
|
790
|
+
e11y_slo_event_duration_seconds_bucket{event_class="Events::ApiRequestCompleted",slo_name="api_latency_slo",le="1.0"} 1234
|
|
791
|
+
e11y_slo_event_duration_seconds_sum{event_class="Events::ApiRequestCompleted",slo_name="api_latency_slo"} 450.67
|
|
792
|
+
e11y_slo_event_duration_seconds_count{event_class="Events::ApiRequestCompleted",slo_name="api_latency_slo"} 1239
|
|
793
|
+
|
|
794
|
+
# With group_by (per-payment-method)
|
|
795
|
+
e11y_slo_event_result_total{event_class="Events::PaymentProcessed",slo_name="payment_success_rate",slo_status="success",payment_method="card"} 1000
|
|
796
|
+
e11y_slo_event_result_total{event_class="Events::PaymentProcessed",slo_name="payment_success_rate",slo_status="success",payment_method="bank"} 200
|
|
797
|
+
e11y_slo_event_result_total{event_class="Events::PaymentProcessed",slo_name="payment_success_rate",slo_status="success",payment_method="paypal"} 34
|
|
798
|
+
```
|
|
799
|
+
|
|
800
|
+
---
|
|
801
|
+
|
|
802
|
+
## 7. Validation & Linting
|
|
803
|
+
|
|
804
|
+
### 7.1. Linter 1: Explicit SLO Declaration
|
|
805
|
+
|
|
806
|
+
**Rule:** Every Event class MUST have explicit `slo` declaration.
|
|
807
|
+
|
|
808
|
+
```ruby
|
|
809
|
+
# lib/e11y/slo/linters/explicit_declaration_linter.rb
|
|
810
|
+
module E11y
|
|
811
|
+
module SLO
|
|
812
|
+
module Linters
|
|
813
|
+
class ExplicitDeclarationLinter
|
|
814
|
+
def self.validate!
|
|
815
|
+
errors = []
|
|
816
|
+
|
|
817
|
+
E11y::Registry.all_events.each do |event_class|
|
|
818
|
+
# Check: Has slo declaration?
|
|
819
|
+
has_slo_enabled = event_class.slo_enabled?
|
|
820
|
+
has_slo_disabled = event_class.slo_disabled?
|
|
821
|
+
|
|
822
|
+
unless has_slo_enabled || has_slo_disabled
|
|
823
|
+
errors << "Event #{event_class.name} missing explicit SLO declaration! " \
|
|
824
|
+
"Add `slo do ... end` or `slo false`"
|
|
825
|
+
end
|
|
826
|
+
end
|
|
827
|
+
|
|
828
|
+
if errors.any?
|
|
829
|
+
raise E11y::SLO::LinterError, "SLO Linter 1 failed:\n#{errors.join("\n")}"
|
|
830
|
+
end
|
|
831
|
+
|
|
832
|
+
E11y.logger.info("✅ Linter 1: Explicit SLO declaration (passed)")
|
|
833
|
+
end
|
|
834
|
+
end
|
|
835
|
+
end
|
|
836
|
+
end
|
|
837
|
+
end
|
|
838
|
+
```
|
|
839
|
+
|
|
840
|
+
### 7.2. Linter 2: slo_status_from Required
|
|
841
|
+
|
|
842
|
+
**Rule:** If `slo { enabled true }`, then `slo_status_from` is REQUIRED.
|
|
843
|
+
|
|
844
|
+
```ruby
|
|
845
|
+
# lib/e11y/slo/linters/slo_status_from_linter.rb
|
|
846
|
+
module E11y
|
|
847
|
+
module SLO
|
|
848
|
+
module Linters
|
|
849
|
+
class SloStatusFromLinter
|
|
850
|
+
def self.validate!
|
|
851
|
+
errors = []
|
|
852
|
+
|
|
853
|
+
E11y::Registry.all_events.each do |event_class|
|
|
854
|
+
next unless event_class.slo_enabled?
|
|
855
|
+
|
|
856
|
+
# Check: Has slo_status_from block?
|
|
857
|
+
unless event_class.slo_config.slo_status_from_proc
|
|
858
|
+
errors << "Event #{event_class.name} has `slo { enabled true }`, " \
|
|
859
|
+
"but missing `slo_status_from` block!"
|
|
860
|
+
end
|
|
861
|
+
|
|
862
|
+
# Check: Has contributes_to?
|
|
863
|
+
unless event_class.slo_config.contributes_to
|
|
864
|
+
errors << "Event #{event_class.name} has `slo { enabled true }`, " \
|
|
865
|
+
"but missing `contributes_to` declaration!"
|
|
866
|
+
end
|
|
867
|
+
end
|
|
868
|
+
|
|
869
|
+
if errors.any?
|
|
870
|
+
raise E11y::SLO::LinterError, "SLO Linter 2 failed:\n#{errors.join("\n")}"
|
|
871
|
+
end
|
|
872
|
+
|
|
873
|
+
E11y.logger.info("✅ Linter 2: slo_status_from required (passed)")
|
|
874
|
+
end
|
|
875
|
+
end
|
|
876
|
+
end
|
|
877
|
+
end
|
|
878
|
+
end
|
|
879
|
+
```
|
|
880
|
+
|
|
881
|
+
### 7.3. Linter 3: slo.yml Consistency
|
|
882
|
+
|
|
883
|
+
**Rule:** If Event is referenced in `slo.yml`, it MUST have `slo { enabled true }`.
|
|
884
|
+
|
|
885
|
+
```ruby
|
|
886
|
+
# lib/e11y/slo/linters/config_consistency_linter.rb
|
|
887
|
+
module E11y
|
|
888
|
+
module SLO
|
|
889
|
+
module Linters
|
|
890
|
+
class ConfigConsistencyLinter
|
|
891
|
+
def self.validate!
|
|
892
|
+
errors = []
|
|
893
|
+
config = E11y::SLO::ConfigLoader.config
|
|
894
|
+
|
|
895
|
+
# Check each custom_slo in slo.yml
|
|
896
|
+
config.custom_slos.each do |slo|
|
|
897
|
+
slo_name = slo['name']
|
|
898
|
+
events = slo['events'] || []
|
|
899
|
+
|
|
900
|
+
events.each do |event_class_name|
|
|
901
|
+
# Check: Event class exists?
|
|
902
|
+
begin
|
|
903
|
+
event_class = event_class_name.constantize
|
|
904
|
+
rescue NameError
|
|
905
|
+
errors << "SLO '#{slo_name}' references Event #{event_class_name}, " \
|
|
906
|
+
"but class not found!"
|
|
907
|
+
next
|
|
908
|
+
end
|
|
909
|
+
|
|
910
|
+
# Check: Event has slo { enabled true }?
|
|
911
|
+
unless event_class.slo_enabled?
|
|
912
|
+
errors << "SLO '#{slo_name}' references Event #{event_class_name}, " \
|
|
913
|
+
"but Event has `slo false` or missing slo declaration!"
|
|
914
|
+
end
|
|
915
|
+
|
|
916
|
+
# Check: Event contributes_to matches slo_name?
|
|
917
|
+
if event_class.slo_enabled? && event_class.slo_config.contributes_to != slo_name
|
|
918
|
+
errors << "Event #{event_class_name} contributes_to " \
|
|
919
|
+
"'#{event_class.slo_config.contributes_to}', " \
|
|
920
|
+
"but slo.yml defines SLO '#{slo_name}'! Mismatch!"
|
|
921
|
+
end
|
|
922
|
+
end
|
|
923
|
+
end
|
|
924
|
+
|
|
925
|
+
# Check reverse: Events with slo enabled but NOT in slo.yml
|
|
926
|
+
E11y::Registry.all_events.each do |event_class|
|
|
927
|
+
next unless event_class.slo_enabled?
|
|
928
|
+
|
|
929
|
+
slo_name = event_class.slo_config.contributes_to
|
|
930
|
+
slo_config = config.resolve_custom_slo(slo_name)
|
|
931
|
+
|
|
932
|
+
unless slo_config
|
|
933
|
+
errors << "Event #{event_class.name} contributes_to '#{slo_name}', " \
|
|
934
|
+
"but this SLO not found in slo.yml!"
|
|
935
|
+
end
|
|
936
|
+
end
|
|
937
|
+
|
|
938
|
+
if errors.any?
|
|
939
|
+
raise E11y::SLO::LinterError, "SLO Linter 3 failed:\n#{errors.join("\n")}"
|
|
940
|
+
end
|
|
941
|
+
|
|
942
|
+
E11y.logger.info("✅ Linter 3: slo.yml consistency (passed)")
|
|
943
|
+
end
|
|
944
|
+
end
|
|
945
|
+
end
|
|
946
|
+
end
|
|
947
|
+
end
|
|
948
|
+
```
|
|
949
|
+
|
|
950
|
+
### 7.4. Auto-validation on Boot
|
|
951
|
+
|
|
952
|
+
```ruby
|
|
953
|
+
# config/initializers/e11y_slo.rb
|
|
954
|
+
Rails.application.config.after_initialize do
|
|
955
|
+
if E11y.config.slo.enabled
|
|
956
|
+
begin
|
|
957
|
+
# Run all linters
|
|
958
|
+
E11y::SLO::Linters::ExplicitDeclarationLinter.validate!
|
|
959
|
+
E11y::SLO::Linters::SloStatusFromLinter.validate!
|
|
960
|
+
E11y::SLO::Linters::ConfigConsistencyLinter.validate!
|
|
961
|
+
|
|
962
|
+
E11y.logger.info("✅ All SLO linters passed")
|
|
963
|
+
rescue E11y::SLO::LinterError => error
|
|
964
|
+
if E11y.config.slo.strict_validation
|
|
965
|
+
# Strict mode: Fail hard
|
|
966
|
+
raise error
|
|
967
|
+
else
|
|
968
|
+
# Lenient mode: Log warning and continue
|
|
969
|
+
E11y.logger.error("❌ SLO Linters failed (continuing in lenient mode):")
|
|
970
|
+
E11y.logger.error(error.message)
|
|
971
|
+
end
|
|
972
|
+
end
|
|
973
|
+
end
|
|
974
|
+
end
|
|
975
|
+
```
|
|
976
|
+
|
|
977
|
+
---
|
|
978
|
+
|
|
979
|
+
## 8. Prometheus Integration
|
|
980
|
+
|
|
981
|
+
### 8.1. PromQL for Availability SLO
|
|
982
|
+
|
|
983
|
+
```promql
|
|
984
|
+
# Payment Success Rate (30 days)
|
|
985
|
+
sum(rate(e11y_slo_event_result_total{
|
|
986
|
+
slo_name="payment_success_rate",
|
|
987
|
+
slo_status="success"
|
|
988
|
+
}[30d]))
|
|
989
|
+
/
|
|
990
|
+
sum(rate(e11y_slo_event_result_total{
|
|
991
|
+
slo_name="payment_success_rate"
|
|
992
|
+
}[30d]))
|
|
993
|
+
|
|
994
|
+
# Result: 0.9996 (99.96%)
|
|
995
|
+
```
|
|
996
|
+
|
|
997
|
+
### 8.2. PromQL for Latency SLO
|
|
998
|
+
|
|
999
|
+
```promql
|
|
1000
|
+
# API Latency p99 (30 days)
|
|
1001
|
+
histogram_quantile(0.99,
|
|
1002
|
+
sum(rate(e11y_slo_event_duration_seconds_bucket{
|
|
1003
|
+
slo_name="api_latency_slo"
|
|
1004
|
+
}[30d])) by (le)
|
|
1005
|
+
)
|
|
1006
|
+
|
|
1007
|
+
# Result: 0.450 (450ms)
|
|
1008
|
+
```
|
|
1009
|
+
|
|
1010
|
+
### 8.3. PromQL for Grouped SLO (per payment_method)
|
|
1011
|
+
|
|
1012
|
+
```promql
|
|
1013
|
+
# Payment Success Rate per payment_method
|
|
1014
|
+
sum(rate(e11y_slo_event_result_total{
|
|
1015
|
+
slo_name="payment_success_rate",
|
|
1016
|
+
slo_status="success"
|
|
1017
|
+
}[30d])) by (payment_method)
|
|
1018
|
+
/
|
|
1019
|
+
sum(rate(e11y_slo_event_result_total{
|
|
1020
|
+
slo_name="payment_success_rate"
|
|
1021
|
+
}[30d])) by (payment_method)
|
|
1022
|
+
|
|
1023
|
+
# Result:
|
|
1024
|
+
# {payment_method="card"} 0.9998
|
|
1025
|
+
# {payment_method="bank"} 0.9950
|
|
1026
|
+
# {payment_method="paypal"} 0.9970
|
|
1027
|
+
```
|
|
1028
|
+
|
|
1029
|
+
### 8.4. Burn Rate Alerts (same as HTTP SLO)
|
|
1030
|
+
|
|
1031
|
+
```yaml
|
|
1032
|
+
# prometheus/alerts/e11y_event_slo.yml
|
|
1033
|
+
groups:
|
|
1034
|
+
- name: e11y_event_slo
|
|
1035
|
+
interval: 30s
|
|
1036
|
+
rules:
|
|
1037
|
+
# Fast burn (1h window, 5 min alert)
|
|
1038
|
+
- alert: E11yEventSLOFastBurn_PaymentSuccessRate
|
|
1039
|
+
expr: |
|
|
1040
|
+
(
|
|
1041
|
+
sum(rate(e11y_slo_event_result_total{
|
|
1042
|
+
slo_name="payment_success_rate",
|
|
1043
|
+
slo_status="failure"
|
|
1044
|
+
}[1h]))
|
|
1045
|
+
/
|
|
1046
|
+
sum(rate(e11y_slo_event_result_total{
|
|
1047
|
+
slo_name="payment_success_rate"
|
|
1048
|
+
}[1h]))
|
|
1049
|
+
)
|
|
1050
|
+
/
|
|
1051
|
+
0.001 # Error budget per hour (0.1% / 720h)
|
|
1052
|
+
> 14.4 # 14.4x burn rate = 2% of 30-day budget in 1h
|
|
1053
|
+
for: 5m
|
|
1054
|
+
labels:
|
|
1055
|
+
severity: critical
|
|
1056
|
+
slo_name: "payment_success_rate"
|
|
1057
|
+
burn_window: "1h"
|
|
1058
|
+
annotations:
|
|
1059
|
+
summary: "CRITICAL: Fast burn on payment_success_rate SLO"
|
|
1060
|
+
description: |
|
|
1061
|
+
Payment failure rate is 14.4x higher than sustainable rate.
|
|
1062
|
+
Burning 2% of 30-day error budget in 1 hour.
|
|
1063
|
+
|
|
1064
|
+
Current burn rate: {{ $value | humanize }}x
|
|
1065
|
+
|
|
1066
|
+
Dashboard: https://grafana/d/e11y-event-slo?var-slo=payment_success_rate
|
|
1067
|
+
```
|
|
1068
|
+
|
|
1069
|
+
---
|
|
1070
|
+
|
|
1071
|
+
## 9. App-Wide SLO Aggregation
|
|
1072
|
+
|
|
1073
|
+
### 9.1. Problem: Need Overall Health Metric
|
|
1074
|
+
|
|
1075
|
+
**Separate HTTP + Event SLO is good, but SRE wants single number:**
|
|
1076
|
+
|
|
1077
|
+
```
|
|
1078
|
+
SRE: "What's our overall application health?"
|
|
1079
|
+
|
|
1080
|
+
Current state:
|
|
1081
|
+
- HTTP SLO: 99.95%
|
|
1082
|
+
- Event SLO (payment): 99.96%
|
|
1083
|
+
- Event SLO (orders): 99.92%
|
|
1084
|
+
|
|
1085
|
+
❌ Which one to report? Need AGGREGATION!
|
|
1086
|
+
```
|
|
1087
|
+
|
|
1088
|
+
### 9.2. Solution: Weighted App-Wide SLO
|
|
1089
|
+
|
|
1090
|
+
```yaml
|
|
1091
|
+
# config/slo.yml (extended)
|
|
1092
|
+
app_wide:
|
|
1093
|
+
# NEW: Aggregated SLO (combines HTTP + Events)
|
|
1094
|
+
aggregated_slo:
|
|
1095
|
+
enabled: true
|
|
1096
|
+
|
|
1097
|
+
# How to combine?
|
|
1098
|
+
strategy: "weighted_average" # or "min" (worst), "max" (best)
|
|
1099
|
+
|
|
1100
|
+
# Weights for each component
|
|
1101
|
+
components:
|
|
1102
|
+
- name: "http_slo"
|
|
1103
|
+
weight: 0.4 # 40% weight (infrastructure)
|
|
1104
|
+
metric: |
|
|
1105
|
+
sum(rate(http_requests_total{status=~"2..|3.."}[30d]))
|
|
1106
|
+
/
|
|
1107
|
+
sum(rate(http_requests_total[30d]))
|
|
1108
|
+
|
|
1109
|
+
- name: "event_slo_payment"
|
|
1110
|
+
weight: 0.4 # 40% weight (critical business logic)
|
|
1111
|
+
metric: |
|
|
1112
|
+
sum(rate(e11y_slo_event_result_total{
|
|
1113
|
+
slo_name="payment_success_rate",
|
|
1114
|
+
slo_status="success"
|
|
1115
|
+
}[30d]))
|
|
1116
|
+
/
|
|
1117
|
+
sum(rate(e11y_slo_event_result_total{
|
|
1118
|
+
slo_name="payment_success_rate"
|
|
1119
|
+
}[30d]))
|
|
1120
|
+
|
|
1121
|
+
- name: "event_slo_orders"
|
|
1122
|
+
weight: 0.2 # 20% weight (important business logic)
|
|
1123
|
+
metric: |
|
|
1124
|
+
sum(rate(e11y_slo_event_result_total{
|
|
1125
|
+
slo_name="order_creation_success_rate",
|
|
1126
|
+
slo_status="success"
|
|
1127
|
+
}[30d]))
|
|
1128
|
+
/
|
|
1129
|
+
sum(rate(e11y_slo_event_result_total{
|
|
1130
|
+
slo_name="order_creation_success_rate"
|
|
1131
|
+
}[30d]))
|
|
1132
|
+
|
|
1133
|
+
# Overall target
|
|
1134
|
+
target: 0.999 # 99.9%
|
|
1135
|
+
window: 30d
|
|
1136
|
+
```
|
|
1137
|
+
|
|
1138
|
+
### 9.3. PromQL for Aggregated SLO
|
|
1139
|
+
|
|
1140
|
+
```promql
|
|
1141
|
+
# Weighted Average App-Wide SLO
|
|
1142
|
+
(
|
|
1143
|
+
# HTTP SLO (40% weight)
|
|
1144
|
+
0.4 * (
|
|
1145
|
+
sum(rate(http_requests_total{status=~"2..|3.."}[30d]))
|
|
1146
|
+
/
|
|
1147
|
+
sum(rate(http_requests_total[30d]))
|
|
1148
|
+
)
|
|
1149
|
+
|
|
1150
|
+
+
|
|
1151
|
+
|
|
1152
|
+
# Payment Event SLO (40% weight)
|
|
1153
|
+
0.4 * (
|
|
1154
|
+
sum(rate(e11y_slo_event_result_total{
|
|
1155
|
+
slo_name="payment_success_rate",
|
|
1156
|
+
slo_status="success"
|
|
1157
|
+
}[30d]))
|
|
1158
|
+
/
|
|
1159
|
+
sum(rate(e11y_slo_event_result_total{
|
|
1160
|
+
slo_name="payment_success_rate"
|
|
1161
|
+
}[30d]))
|
|
1162
|
+
)
|
|
1163
|
+
|
|
1164
|
+
+
|
|
1165
|
+
|
|
1166
|
+
# Order Event SLO (20% weight)
|
|
1167
|
+
0.2 * (
|
|
1168
|
+
sum(rate(e11y_slo_event_result_total{
|
|
1169
|
+
slo_name="order_creation_success_rate",
|
|
1170
|
+
slo_status="success"
|
|
1171
|
+
}[30d]))
|
|
1172
|
+
/
|
|
1173
|
+
sum(rate(e11y_slo_event_result_total{
|
|
1174
|
+
slo_name="order_creation_success_rate"
|
|
1175
|
+
}[30d]))
|
|
1176
|
+
)
|
|
1177
|
+
)
|
|
1178
|
+
|
|
1179
|
+
# Example Result:
|
|
1180
|
+
# HTTP: 99.95%, Payment: 99.96%, Orders: 99.92%
|
|
1181
|
+
# → (0.4 * 0.9995) + (0.4 * 0.9996) + (0.2 * 0.9992)
|
|
1182
|
+
# → 0.39980 + 0.39984 + 0.19984
|
|
1183
|
+
# → 0.99948 (99.948%)
|
|
1184
|
+
```
|
|
1185
|
+
|
|
1186
|
+
### 9.4. Alternative: "Worst Case" Strategy
|
|
1187
|
+
|
|
1188
|
+
```yaml
|
|
1189
|
+
# config/slo.yml
|
|
1190
|
+
app_wide:
|
|
1191
|
+
aggregated_slo:
|
|
1192
|
+
strategy: "min" # Take worst SLO
|
|
1193
|
+
|
|
1194
|
+
components:
|
|
1195
|
+
- name: "http_slo"
|
|
1196
|
+
metric: ...
|
|
1197
|
+
- name: "event_slo_payment"
|
|
1198
|
+
metric: ...
|
|
1199
|
+
- name: "event_slo_orders"
|
|
1200
|
+
metric: ...
|
|
1201
|
+
```
|
|
1202
|
+
|
|
1203
|
+
```promql
|
|
1204
|
+
# Min (Worst) SLO
|
|
1205
|
+
min(
|
|
1206
|
+
# HTTP SLO
|
|
1207
|
+
sum(rate(http_requests_total{status=~"2..|3.."}[30d]))
|
|
1208
|
+
/ sum(rate(http_requests_total[30d])),
|
|
1209
|
+
|
|
1210
|
+
# Payment SLO
|
|
1211
|
+
sum(rate(e11y_slo_event_result_total{slo_name="payment_success_rate",slo_status="success"}[30d]))
|
|
1212
|
+
/ sum(rate(e11y_slo_event_result_total{slo_name="payment_success_rate"}[30d])),
|
|
1213
|
+
|
|
1214
|
+
# Orders SLO
|
|
1215
|
+
sum(rate(e11y_slo_event_result_total{slo_name="order_creation_success_rate",slo_status="success"}[30d]))
|
|
1216
|
+
/ sum(rate(e11y_slo_event_result_total{slo_name="order_creation_success_rate"}[30d]))
|
|
1217
|
+
)
|
|
1218
|
+
|
|
1219
|
+
# Result: 99.92% (worst of the three)
|
|
1220
|
+
```
|
|
1221
|
+
|
|
1222
|
+
### 9.5. Grafana Dashboard: Aggregated SLO
|
|
1223
|
+
|
|
1224
|
+
```json
|
|
1225
|
+
{
|
|
1226
|
+
"dashboard": {
|
|
1227
|
+
"title": "E11y App-Wide SLO Dashboard",
|
|
1228
|
+
"panels": [
|
|
1229
|
+
{
|
|
1230
|
+
"title": "Overall Application Health (Aggregated SLO)",
|
|
1231
|
+
"type": "gauge",
|
|
1232
|
+
"targets": [
|
|
1233
|
+
{
|
|
1234
|
+
"expr": "# Weighted average PromQL from above",
|
|
1235
|
+
"legendFormat": "App-Wide SLO"
|
|
1236
|
+
},
|
|
1237
|
+
{
|
|
1238
|
+
"expr": "0.999",
|
|
1239
|
+
"legendFormat": "Target (99.9%)"
|
|
1240
|
+
}
|
|
1241
|
+
],
|
|
1242
|
+
"fieldConfig": {
|
|
1243
|
+
"defaults": {
|
|
1244
|
+
"min": 0.99,
|
|
1245
|
+
"max": 1.0,
|
|
1246
|
+
"thresholds": {
|
|
1247
|
+
"mode": "absolute",
|
|
1248
|
+
"steps": [
|
|
1249
|
+
{ "value": 0.99, "color": "red" },
|
|
1250
|
+
{ "value": 0.995, "color": "yellow" },
|
|
1251
|
+
{ "value": 0.999, "color": "green" }
|
|
1252
|
+
]
|
|
1253
|
+
}
|
|
1254
|
+
}
|
|
1255
|
+
}
|
|
1256
|
+
},
|
|
1257
|
+
{
|
|
1258
|
+
"title": "SLO Components Breakdown",
|
|
1259
|
+
"type": "timeseries",
|
|
1260
|
+
"targets": [
|
|
1261
|
+
{ "expr": "# HTTP SLO", "legendFormat": "HTTP (40%)" },
|
|
1262
|
+
{ "expr": "# Payment SLO", "legendFormat": "Payment Events (40%)" },
|
|
1263
|
+
{ "expr": "# Orders SLO", "legendFormat": "Order Events (20%)" }
|
|
1264
|
+
]
|
|
1265
|
+
}
|
|
1266
|
+
]
|
|
1267
|
+
}
|
|
1268
|
+
}
|
|
1269
|
+
```
|
|
1270
|
+
|
|
1271
|
+
---
|
|
1272
|
+
|
|
1273
|
+
## 10. Real-World Examples
|
|
1274
|
+
|
|
1275
|
+
### 10.1. E-Commerce Platform
|
|
1276
|
+
|
|
1277
|
+
```ruby
|
|
1278
|
+
# === Payment Success Rate SLO ===
|
|
1279
|
+
module Events
|
|
1280
|
+
class PaymentProcessed < E11y::Event::Base
|
|
1281
|
+
schema do
|
|
1282
|
+
required(:payment_id).filled(:string)
|
|
1283
|
+
required(:order_id).filled(:string)
|
|
1284
|
+
required(:amount).filled(:float)
|
|
1285
|
+
required(:currency).filled(:string)
|
|
1286
|
+
required(:payment_method).filled(:string) # 'card', 'bank', 'paypal'
|
|
1287
|
+
required(:status).filled(:string) # 'completed', 'failed', 'pending'
|
|
1288
|
+
required(:error_code).maybe(:string) # Present if status = 'failed'
|
|
1289
|
+
optional(:slo_status).filled(:string)
|
|
1290
|
+
end
|
|
1291
|
+
|
|
1292
|
+
slo do
|
|
1293
|
+
enabled true
|
|
1294
|
+
|
|
1295
|
+
slo_status_from do |payload|
|
|
1296
|
+
return payload[:slo_status] if payload[:slo_status]
|
|
1297
|
+
|
|
1298
|
+
case payload[:status]
|
|
1299
|
+
when 'completed' then 'success'
|
|
1300
|
+
when 'failed' then 'failure'
|
|
1301
|
+
when 'pending' then nil # Not counted
|
|
1302
|
+
end
|
|
1303
|
+
end
|
|
1304
|
+
|
|
1305
|
+
contributes_to 'payment_success_rate'
|
|
1306
|
+
group_by :payment_method # Per-method SLO
|
|
1307
|
+
end
|
|
1308
|
+
end
|
|
1309
|
+
end
|
|
1310
|
+
|
|
1311
|
+
# === Order Creation Success Rate SLO ===
|
|
1312
|
+
module Events
|
|
1313
|
+
class OrderCreated < E11y::Event::Base
|
|
1314
|
+
schema do
|
|
1315
|
+
required(:order_id).filled(:string)
|
|
1316
|
+
required(:user_id).filled(:string)
|
|
1317
|
+
required(:items).array(:hash)
|
|
1318
|
+
required(:total_amount).filled(:float)
|
|
1319
|
+
optional(:slo_status).filled(:string)
|
|
1320
|
+
end
|
|
1321
|
+
|
|
1322
|
+
slo do
|
|
1323
|
+
enabled true
|
|
1324
|
+
|
|
1325
|
+
slo_status_from do |payload|
|
|
1326
|
+
return payload[:slo_status] if payload[:slo_status]
|
|
1327
|
+
|
|
1328
|
+
# All OrderCreated events = success
|
|
1329
|
+
'success'
|
|
1330
|
+
end
|
|
1331
|
+
|
|
1332
|
+
contributes_to 'order_creation_success_rate'
|
|
1333
|
+
end
|
|
1334
|
+
end
|
|
1335
|
+
end
|
|
1336
|
+
|
|
1337
|
+
module Events
|
|
1338
|
+
class OrderCreationFailed < E11y::Event::Base
|
|
1339
|
+
schema do
|
|
1340
|
+
required(:user_id).filled(:string)
|
|
1341
|
+
required(:reason).filled(:string)
|
|
1342
|
+
required(:validation_errors).maybe(:array)
|
|
1343
|
+
optional(:slo_status).filled(:string)
|
|
1344
|
+
end
|
|
1345
|
+
|
|
1346
|
+
slo do
|
|
1347
|
+
enabled true
|
|
1348
|
+
|
|
1349
|
+
slo_status_from do |payload|
|
|
1350
|
+
return payload[:slo_status] if payload[:slo_status]
|
|
1351
|
+
|
|
1352
|
+
# All OrderCreationFailed events = failure
|
|
1353
|
+
'failure'
|
|
1354
|
+
end
|
|
1355
|
+
|
|
1356
|
+
contributes_to 'order_creation_success_rate'
|
|
1357
|
+
end
|
|
1358
|
+
end
|
|
1359
|
+
end
|
|
1360
|
+
```
|
|
1361
|
+
|
|
1362
|
+
```yaml
|
|
1363
|
+
# config/slo.yml (E-commerce)
|
|
1364
|
+
custom_slos:
|
|
1365
|
+
- name: "payment_success_rate"
|
|
1366
|
+
description: "Payment processing success rate"
|
|
1367
|
+
type: event_based
|
|
1368
|
+
events:
|
|
1369
|
+
- Events::PaymentProcessed
|
|
1370
|
+
target: 0.999 # 99.9%
|
|
1371
|
+
window: 30d
|
|
1372
|
+
group_by: :payment_method
|
|
1373
|
+
|
|
1374
|
+
burn_rate_alerts:
|
|
1375
|
+
fast: { enabled: true, threshold: 14.4, alert_after: 5m }
|
|
1376
|
+
|
|
1377
|
+
- name: "order_creation_success_rate"
|
|
1378
|
+
description: "Order creation success rate"
|
|
1379
|
+
type: event_based
|
|
1380
|
+
events:
|
|
1381
|
+
- Events::OrderCreated
|
|
1382
|
+
- Events::OrderCreationFailed
|
|
1383
|
+
target: 0.999
|
|
1384
|
+
window: 30d
|
|
1385
|
+
|
|
1386
|
+
# App-Wide SLO (combines HTTP + Events)
|
|
1387
|
+
app_wide:
|
|
1388
|
+
aggregated_slo:
|
|
1389
|
+
enabled: true
|
|
1390
|
+
strategy: "weighted_average"
|
|
1391
|
+
components:
|
|
1392
|
+
- name: "http_slo"
|
|
1393
|
+
weight: 0.3 # 30%
|
|
1394
|
+
metric: "sum(rate(http_requests_total{status=~\"2..|3..\"}[30d])) / sum(rate(http_requests_total[30d]))"
|
|
1395
|
+
- name: "payment_slo"
|
|
1396
|
+
weight: 0.5 # 50% (most critical!)
|
|
1397
|
+
metric: "sum(rate(e11y_slo_event_result_total{slo_name=\"payment_success_rate\",slo_status=\"success\"}[30d])) / sum(rate(e11y_slo_event_result_total{slo_name=\"payment_success_rate\"}[30d]))"
|
|
1398
|
+
- name: "orders_slo"
|
|
1399
|
+
weight: 0.2 # 20%
|
|
1400
|
+
metric: "sum(rate(e11y_slo_event_result_total{slo_name=\"order_creation_success_rate\",slo_status=\"success\"}[30d])) / sum(rate(e11y_slo_event_result_total{slo_name=\"order_creation_success_rate\"}[30d]))"
|
|
1401
|
+
target: 0.999
|
|
1402
|
+
```
|
|
1403
|
+
|
|
1404
|
+
---
|
|
1405
|
+
|
|
1406
|
+
## 11. Trade-offs
|
|
1407
|
+
|
|
1408
|
+
### 11.1. Key Decisions
|
|
1409
|
+
|
|
1410
|
+
| Decision | Pro | Con | Rationale |
|
|
1411
|
+
|----------|-----|-----|-----------|
|
|
1412
|
+
| **Explicit slo declaration** | No magic, clear intent | Boilerplate in every Event | Clarity > brevity |
|
|
1413
|
+
| **slo_status_from auto-calc** | DRY, flexible | Extra method call | Performance acceptable (<0.1ms) |
|
|
1414
|
+
| **Prometheus aggregation** | Flexible, standard | Complex PromQL | Industry standard (Google SRE) |
|
|
1415
|
+
| **Separate HTTP + Event SLO** | Clear separation | Two SLOs to manage | Reflects reality (infra ≠ business) |
|
|
1416
|
+
| **App-Wide aggregated SLO** | Single health metric | More config | SRE needs one number |
|
|
1417
|
+
| **Linters at boot** | Early failure detection | Startup time +50ms | Worth it for consistency |
|
|
1418
|
+
| **Optional override** | Edge case support | Potential for abuse | Trust developers |
|
|
1419
|
+
|
|
1420
|
+
### 11.2. Alternatives Considered
|
|
1421
|
+
|
|
1422
|
+
**A) Auto-detect SLO from severity**
|
|
1423
|
+
```ruby
|
|
1424
|
+
# ❌ REJECTED: Too implicit
|
|
1425
|
+
# severity :error → slo_status = 'failure'
|
|
1426
|
+
# severity :success → slo_status = 'success'
|
|
1427
|
+
```
|
|
1428
|
+
- ❌ Not flexible (business logic ≠ severity)
|
|
1429
|
+
- ❌ Magic behavior
|
|
1430
|
+
- ✅ **CHOSEN:** Explicit `slo_status_from`
|
|
1431
|
+
|
|
1432
|
+
**B) SLO status as required field**
|
|
1433
|
+
```ruby
|
|
1434
|
+
# ❌ REJECTED: Too rigid
|
|
1435
|
+
schema do
|
|
1436
|
+
required(:slo_status).filled(:string)
|
|
1437
|
+
end
|
|
1438
|
+
```
|
|
1439
|
+
- ❌ Forces manual calculation at call site
|
|
1440
|
+
- ❌ Not DRY
|
|
1441
|
+
- ✅ **CHOSEN:** Optional override + auto-calc
|
|
1442
|
+
|
|
1443
|
+
**C) SLO calculation in application**
|
|
1444
|
+
```ruby
|
|
1445
|
+
# ❌ REJECTED: Not flexible
|
|
1446
|
+
# E11y calculates SLO percentage internally
|
|
1447
|
+
Yabeda.e11y_slo.payment_success_rate.set({}, 0.9996)
|
|
1448
|
+
```
|
|
1449
|
+
- ❌ Can't recalculate for different time windows
|
|
1450
|
+
- ❌ State lost on restart
|
|
1451
|
+
- ✅ **CHOSEN:** Prometheus aggregation
|
|
1452
|
+
|
|
1453
|
+
---
|
|
1454
|
+
|
|
1455
|
+
## 12. Summary & Next Steps
|
|
1456
|
+
|
|
1457
|
+
### 12.1. What We Achieved
|
|
1458
|
+
|
|
1459
|
+
✅ **Event-Driven SLO**: Custom SLO based on business events
|
|
1460
|
+
✅ **Explicit Configuration**: No magic, all visible in Event class
|
|
1461
|
+
✅ **Auto-calculation**: `slo_status_from` with optional override
|
|
1462
|
+
✅ **Prometheus Integration**: Standard aggregation, flexible queries
|
|
1463
|
+
✅ **3 Linters**: Ensure consistency at boot time
|
|
1464
|
+
✅ **Independent HTTP + Event SLO**: Clear separation of concerns
|
|
1465
|
+
✅ **App-Wide SLO Aggregation**: Single health metric for SRE
|
|
1466
|
+
✅ **Group by support**: Per-label SLO (e.g., per payment_method)
|
|
1467
|
+
✅ **Latency SLO**: Histogram metrics for p99/p95
|
|
1468
|
+
✅ **Real-world examples**: E-commerce, SaaS API, Admin tool
|
|
1469
|
+
|
|
1470
|
+
### 12.2. Integration with ADR-003
|
|
1471
|
+
|
|
1472
|
+
| Aspect | ADR-003 (HTTP SLO) | ADR-014 (Event SLO) | Integration |
|
|
1473
|
+
|--------|-------------------|---------------------|-------------|
|
|
1474
|
+
| **Metrics Source** | HTTP requests/jobs | Event tracking | Independent |
|
|
1475
|
+
| **Config Location** | `slo.yml` endpoints | `slo.yml` custom_slos | Same file |
|
|
1476
|
+
| **Linters** | Route validation | Event validation | Run together |
|
|
1477
|
+
| **Burn Rate Alerts** | Multi-window | Multi-window | Same strategy |
|
|
1478
|
+
| **Prometheus** | PromQL aggregation | PromQL aggregation | Same approach |
|
|
1479
|
+
| **App-Wide SLO** | ❌ Not defined | ✅ Aggregated SLO | NEW feature |
|
|
1480
|
+
|
|
1481
|
+
### 12.3. Implementation Checklist
|
|
1482
|
+
|
|
1483
|
+
**Phase 1: Core (Week 1-2)**
|
|
1484
|
+
- [ ] Implement `E11y::Event::Base.slo` DSL
|
|
1485
|
+
- [ ] Implement `E11y::Event::SLOConfig` class
|
|
1486
|
+
- [ ] Add `slo_status_from` computation
|
|
1487
|
+
- [ ] Integrate with `track` method
|
|
1488
|
+
- [ ] Emit Yabeda metrics (`event_result_total`)
|
|
1489
|
+
|
|
1490
|
+
**Phase 2: Configuration (Week 3)**
|
|
1491
|
+
- [ ] Extend `slo.yml` schema for `custom_slos`
|
|
1492
|
+
- [ ] Implement `ConfigLoader.resolve_custom_slo`
|
|
1493
|
+
- [ ] Add `app_wide.aggregated_slo` config
|
|
1494
|
+
- [ ] Add validation for custom SLO config
|
|
1495
|
+
|
|
1496
|
+
**Phase 3: Linters (Week 4)**
|
|
1497
|
+
- [ ] Implement Linter 1 (explicit slo declaration)
|
|
1498
|
+
- [ ] Implement Linter 2 (slo_status_from required)
|
|
1499
|
+
- [ ] Implement Linter 3 (slo.yml consistency)
|
|
1500
|
+
- [ ] Add auto-validation on boot
|
|
1501
|
+
- [ ] Add `strict_validation` config option
|
|
1502
|
+
|
|
1503
|
+
**Phase 4: Prometheus (Week 5)**
|
|
1504
|
+
- [ ] Document PromQL queries
|
|
1505
|
+
- [ ] Add app-wide aggregated SLO query
|
|
1506
|
+
- [ ] Create Grafana dashboard templates
|
|
1507
|
+
- [ ] Add burn rate alerts for Event SLO
|
|
1508
|
+
- [ ] Test with real Event tracking
|
|
1509
|
+
|
|
1510
|
+
**Phase 5: Documentation (Week 6)**
|
|
1511
|
+
- [ ] Write Event SLO guide
|
|
1512
|
+
- [ ] Add real-world examples
|
|
1513
|
+
- [ ] Document migration from HTTP-only SLO
|
|
1514
|
+
- [ ] Add troubleshooting guide
|
|
1515
|
+
|
|
1516
|
+
**Phase 6: Testing (Week 7)**
|
|
1517
|
+
- [ ] RSpec for `slo_status_from` computation
|
|
1518
|
+
- [ ] RSpec for Yabeda metric emission
|
|
1519
|
+
- [ ] RSpec for all 3 linters
|
|
1520
|
+
- [ ] Integration tests (end-to-end)
|
|
1521
|
+
- [ ] Performance benchmarks (<0.1ms p99)
|
|
1522
|
+
|
|
1523
|
+
---
|
|
1524
|
+
|
|
1525
|
+
**Status:** ✅ Fully Designed
|
|
1526
|
+
**Next:** Implementation (Phases 1-6)
|
|
1527
|
+
**Estimated Implementation:** 7 weeks
|
|
1528
|
+
**Impact:**
|
|
1529
|
+
- Business-logic SLO visibility (not just infrastructure)
|
|
1530
|
+
- Explicit, no-magic configuration
|
|
1531
|
+
- Flexible Prometheus-based aggregation
|
|
1532
|
+
- App-wide health metric for SRE
|
|
1533
|
+
- Consistency enforced by linters
|