e11y 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.rspec +4 -0
- data/.rubocop.yml +69 -0
- data/CHANGELOG.md +26 -0
- data/CODE_OF_CONDUCT.md +64 -0
- data/LICENSE.txt +21 -0
- data/README.md +179 -0
- data/Rakefile +37 -0
- data/benchmarks/run_all.rb +33 -0
- data/config/README.md +83 -0
- data/config/loki-local-config.yaml +35 -0
- data/config/prometheus.yml +15 -0
- data/docker-compose.yml +78 -0
- data/docs/00-ICP-AND-TIMELINE.md +483 -0
- data/docs/01-SCALE-REQUIREMENTS.md +858 -0
- data/docs/ADR-001-architecture.md +2617 -0
- data/docs/ADR-002-metrics-yabeda.md +1395 -0
- data/docs/ADR-003-slo-observability.md +3337 -0
- data/docs/ADR-004-adapter-architecture.md +2385 -0
- data/docs/ADR-005-tracing-context.md +1372 -0
- data/docs/ADR-006-security-compliance.md +4143 -0
- data/docs/ADR-007-opentelemetry-integration.md +1385 -0
- data/docs/ADR-008-rails-integration.md +1911 -0
- data/docs/ADR-009-cost-optimization.md +2993 -0
- data/docs/ADR-010-developer-experience.md +2166 -0
- data/docs/ADR-011-testing-strategy.md +1836 -0
- data/docs/ADR-012-event-evolution.md +958 -0
- data/docs/ADR-013-reliability-error-handling.md +2750 -0
- data/docs/ADR-014-event-driven-slo.md +1533 -0
- data/docs/ADR-015-middleware-order.md +1061 -0
- data/docs/ADR-016-self-monitoring-slo.md +1234 -0
- data/docs/API-REFERENCE-L28.md +914 -0
- data/docs/COMPREHENSIVE-CONFIGURATION.md +2366 -0
- data/docs/IMPLEMENTATION_NOTES.md +2804 -0
- data/docs/IMPLEMENTATION_PLAN.md +1971 -0
- data/docs/IMPLEMENTATION_PLAN_ARCHITECTURE.md +586 -0
- data/docs/PLAN.md +148 -0
- data/docs/QUICK-START.md +934 -0
- data/docs/README.md +296 -0
- data/docs/design/00-memory-optimization.md +593 -0
- data/docs/guides/MIGRATION-L27-L28.md +692 -0
- data/docs/guides/PERFORMANCE-BENCHMARKS.md +434 -0
- data/docs/guides/README.md +44 -0
- data/docs/prd/01-overview-vision.md +440 -0
- data/docs/use_cases/README.md +119 -0
- data/docs/use_cases/UC-001-request-scoped-debug-buffering.md +813 -0
- data/docs/use_cases/UC-002-business-event-tracking.md +1953 -0
- data/docs/use_cases/UC-003-pattern-based-metrics.md +1627 -0
- data/docs/use_cases/UC-004-zero-config-slo-tracking.md +728 -0
- data/docs/use_cases/UC-005-sentry-integration.md +759 -0
- data/docs/use_cases/UC-006-trace-context-management.md +905 -0
- data/docs/use_cases/UC-007-pii-filtering.md +2648 -0
- data/docs/use_cases/UC-008-opentelemetry-integration.md +1153 -0
- data/docs/use_cases/UC-009-multi-service-tracing.md +1043 -0
- data/docs/use_cases/UC-010-background-job-tracking.md +1018 -0
- data/docs/use_cases/UC-011-rate-limiting.md +1906 -0
- data/docs/use_cases/UC-012-audit-trail.md +2301 -0
- data/docs/use_cases/UC-013-high-cardinality-protection.md +2127 -0
- data/docs/use_cases/UC-014-adaptive-sampling.md +1940 -0
- data/docs/use_cases/UC-015-cost-optimization.md +735 -0
- data/docs/use_cases/UC-016-rails-logger-migration.md +785 -0
- data/docs/use_cases/UC-017-local-development.md +867 -0
- data/docs/use_cases/UC-018-testing-events.md +1081 -0
- data/docs/use_cases/UC-019-tiered-storage-migration.md +562 -0
- data/docs/use_cases/UC-020-event-versioning.md +708 -0
- data/docs/use_cases/UC-021-error-handling-retry-dlq.md +956 -0
- data/docs/use_cases/UC-022-event-registry.md +648 -0
- data/docs/use_cases/backlog.md +226 -0
- data/e11y.gemspec +76 -0
- data/lib/e11y/adapters/adaptive_batcher.rb +207 -0
- data/lib/e11y/adapters/audit_encrypted.rb +239 -0
- data/lib/e11y/adapters/base.rb +580 -0
- data/lib/e11y/adapters/file.rb +224 -0
- data/lib/e11y/adapters/in_memory.rb +216 -0
- data/lib/e11y/adapters/loki.rb +333 -0
- data/lib/e11y/adapters/otel_logs.rb +203 -0
- data/lib/e11y/adapters/registry.rb +141 -0
- data/lib/e11y/adapters/sentry.rb +230 -0
- data/lib/e11y/adapters/stdout.rb +108 -0
- data/lib/e11y/adapters/yabeda.rb +370 -0
- data/lib/e11y/buffers/adaptive_buffer.rb +339 -0
- data/lib/e11y/buffers/base_buffer.rb +40 -0
- data/lib/e11y/buffers/request_scoped_buffer.rb +246 -0
- data/lib/e11y/buffers/ring_buffer.rb +267 -0
- data/lib/e11y/buffers.rb +14 -0
- data/lib/e11y/console.rb +122 -0
- data/lib/e11y/current.rb +48 -0
- data/lib/e11y/event/base.rb +894 -0
- data/lib/e11y/event/value_sampling_config.rb +84 -0
- data/lib/e11y/events/base_audit_event.rb +43 -0
- data/lib/e11y/events/base_payment_event.rb +33 -0
- data/lib/e11y/events/rails/cache/delete.rb +21 -0
- data/lib/e11y/events/rails/cache/read.rb +23 -0
- data/lib/e11y/events/rails/cache/write.rb +22 -0
- data/lib/e11y/events/rails/database/query.rb +45 -0
- data/lib/e11y/events/rails/http/redirect.rb +21 -0
- data/lib/e11y/events/rails/http/request.rb +26 -0
- data/lib/e11y/events/rails/http/send_file.rb +21 -0
- data/lib/e11y/events/rails/http/start_processing.rb +26 -0
- data/lib/e11y/events/rails/job/completed.rb +22 -0
- data/lib/e11y/events/rails/job/enqueued.rb +22 -0
- data/lib/e11y/events/rails/job/failed.rb +22 -0
- data/lib/e11y/events/rails/job/scheduled.rb +23 -0
- data/lib/e11y/events/rails/job/started.rb +22 -0
- data/lib/e11y/events/rails/log.rb +56 -0
- data/lib/e11y/events/rails/view/render.rb +23 -0
- data/lib/e11y/events.rb +18 -0
- data/lib/e11y/instruments/active_job.rb +201 -0
- data/lib/e11y/instruments/rails_instrumentation.rb +141 -0
- data/lib/e11y/instruments/sidekiq.rb +175 -0
- data/lib/e11y/logger/bridge.rb +205 -0
- data/lib/e11y/metrics/cardinality_protection.rb +172 -0
- data/lib/e11y/metrics/cardinality_tracker.rb +134 -0
- data/lib/e11y/metrics/registry.rb +234 -0
- data/lib/e11y/metrics/relabeling.rb +226 -0
- data/lib/e11y/metrics.rb +102 -0
- data/lib/e11y/middleware/audit_signing.rb +174 -0
- data/lib/e11y/middleware/base.rb +140 -0
- data/lib/e11y/middleware/event_slo.rb +167 -0
- data/lib/e11y/middleware/pii_filter.rb +266 -0
- data/lib/e11y/middleware/pii_filtering.rb +280 -0
- data/lib/e11y/middleware/rate_limiting.rb +214 -0
- data/lib/e11y/middleware/request.rb +163 -0
- data/lib/e11y/middleware/routing.rb +157 -0
- data/lib/e11y/middleware/sampling.rb +254 -0
- data/lib/e11y/middleware/slo.rb +168 -0
- data/lib/e11y/middleware/trace_context.rb +131 -0
- data/lib/e11y/middleware/validation.rb +118 -0
- data/lib/e11y/middleware/versioning.rb +132 -0
- data/lib/e11y/middleware.rb +12 -0
- data/lib/e11y/pii/patterns.rb +90 -0
- data/lib/e11y/pii.rb +13 -0
- data/lib/e11y/pipeline/builder.rb +155 -0
- data/lib/e11y/pipeline/zone_validator.rb +110 -0
- data/lib/e11y/pipeline.rb +12 -0
- data/lib/e11y/presets/audit_event.rb +65 -0
- data/lib/e11y/presets/debug_event.rb +34 -0
- data/lib/e11y/presets/high_value_event.rb +51 -0
- data/lib/e11y/presets.rb +19 -0
- data/lib/e11y/railtie.rb +138 -0
- data/lib/e11y/reliability/circuit_breaker.rb +216 -0
- data/lib/e11y/reliability/dlq/file_storage.rb +277 -0
- data/lib/e11y/reliability/dlq/filter.rb +117 -0
- data/lib/e11y/reliability/retry_handler.rb +207 -0
- data/lib/e11y/reliability/retry_rate_limiter.rb +117 -0
- data/lib/e11y/sampling/error_spike_detector.rb +225 -0
- data/lib/e11y/sampling/load_monitor.rb +161 -0
- data/lib/e11y/sampling/stratified_tracker.rb +92 -0
- data/lib/e11y/sampling/value_extractor.rb +82 -0
- data/lib/e11y/self_monitoring/buffer_monitor.rb +79 -0
- data/lib/e11y/self_monitoring/performance_monitor.rb +97 -0
- data/lib/e11y/self_monitoring/reliability_monitor.rb +146 -0
- data/lib/e11y/slo/event_driven.rb +150 -0
- data/lib/e11y/slo/tracker.rb +119 -0
- data/lib/e11y/version.rb +9 -0
- data/lib/e11y.rb +283 -0
- metadata +452 -0
|
@@ -0,0 +1,3337 @@
|
|
|
1
|
+
# ADR-003: SLO & Observability
|
|
2
|
+
|
|
3
|
+
**Status:** Draft
|
|
4
|
+
**Date:** January 13, 2026
|
|
5
|
+
**Covers:** UC-004 (Zero-Config SLO Tracking)
|
|
6
|
+
**Depends On:** ADR-001 (Core), ADR-008 (Rails Integration), ADR-002 (Metrics)
|
|
7
|
+
|
|
8
|
+
**Related ADRs:**
|
|
9
|
+
- 📊 **ADR-014: Event-Driven SLO** - Custom SLO based on business events (e.g., payment success rate)
|
|
10
|
+
- 🔗 **Integration:** See `ADR-003-014-INTEGRATION.md` for detailed integration analysis
|
|
11
|
+
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
## 🔍 Scope of This ADR
|
|
15
|
+
|
|
16
|
+
This ADR covers **HTTP/Job SLO** (infrastructure reliability):
|
|
17
|
+
- ✅ Zero-config SLO for HTTP requests (99.9% availability)
|
|
18
|
+
- ✅ Zero-config SLO for Sidekiq/ActiveJob (99.5% success rate)
|
|
19
|
+
- ✅ Per-endpoint SLO configuration in `slo.yml`
|
|
20
|
+
- ✅ Multi-window burn rate alerts (5 min detection)
|
|
21
|
+
- ✅ Error budget management & deployment gates
|
|
22
|
+
|
|
23
|
+
**For Event-based SLO** (business logic reliability like "order creation success rate"), see **ADR-014**.
|
|
24
|
+
|
|
25
|
+
**For App-Wide SLO** (aggregating HTTP + Event metrics into single health score), see **ADR-014 Section 9**.
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
## 📋 Table of Contents
|
|
30
|
+
|
|
31
|
+
1. [Context & Problem](#1-context--problem)
|
|
32
|
+
2. [Architecture Overview](#2-architecture-overview)
|
|
33
|
+
3. [Multi-Level SLO Strategy](#3-multi-level-slo-strategy)
|
|
34
|
+
4. [Per-Endpoint SLO Configuration](#4-per-endpoint-slo-configuration)
|
|
35
|
+
5. [Multi-Window Multi-Burn Rate Alerts](#5-multi-window-multi-burn-rate-alerts)
|
|
36
|
+
6. [SLO Config Validation & Linting](#6-slo-config-validation--linting)
|
|
37
|
+
7. [Error Budget Management](#7-error-budget-management)
|
|
38
|
+
8. [Dashboard & Reporting](#8-dashboard--reporting)
|
|
39
|
+
9. [Trade-offs](#9-trade-offs)
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
## 1. Context & Problem
|
|
44
|
+
|
|
45
|
+
### 1.1. Problem Statement
|
|
46
|
+
|
|
47
|
+
**Current Pain Points:**
|
|
48
|
+
|
|
49
|
+
```ruby
|
|
50
|
+
# === PROBLEM 1: Overly Broad SLO (App-Wide) ===
|
|
51
|
+
# ❌ One SLO for entire app is too coarse
|
|
52
|
+
# GET /healthcheck (should be 99.99%)
|
|
53
|
+
# POST /orders (should be 99.9%)
|
|
54
|
+
# GET /admin/reports (should be 95%)
|
|
55
|
+
# → All treated the same! Critical endpoints hidden by non-critical ones!
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
```ruby
|
|
59
|
+
# === PROBLEM 2: Slow Alert Detection ===
|
|
60
|
+
# ❌ 30-day window = slow reaction
|
|
61
|
+
# Incident at 10:00 AM
|
|
62
|
+
# First alert at 10:45 AM (45 minutes later!)
|
|
63
|
+
# → Customers already affected!
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
```ruby
|
|
67
|
+
# === PROBLEM 3: No Configuration Management ===
|
|
68
|
+
# ❌ SLOs hardcoded in code
|
|
69
|
+
# Need to deploy to change SLO targets
|
|
70
|
+
# No validation against real routes
|
|
71
|
+
# → Drift between config and reality
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
```ruby
|
|
75
|
+
# === PROBLEM 4: Alert Fatigue ===
|
|
76
|
+
# ❌ Single threshold alerting
|
|
77
|
+
# Minor blip → Page SRE
|
|
78
|
+
# Sustained issue → Same alert
|
|
79
|
+
# → Can't distinguish severity!
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
### 1.2. Design Decisions (Based on Google SRE 2026)
|
|
83
|
+
|
|
84
|
+
**Decision 1: Multi-Level SLO Strategy**
|
|
85
|
+
```yaml
|
|
86
|
+
# 3 levels of SLO granularity:
|
|
87
|
+
1. Application-wide (default, zero-config)
|
|
88
|
+
2. Service-level (Sidekiq, ActiveJob)
|
|
89
|
+
3. Per-endpoint (controller#action specific)
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
**Decision 2: Multi-Window Multi-Burn Rate (Google SRE Standard)**
|
|
93
|
+
```yaml
|
|
94
|
+
# Alert windows (not SLO windows!):
|
|
95
|
+
- Fast burn: 1 hour window, 5 min alert, 14.4x burn rate → 2% budget consumed
|
|
96
|
+
- Medium burn: 6 hour window, 30 min alert, 6.0x burn rate → 5% budget consumed
|
|
97
|
+
- Slow burn: 3 day window, 6 hour alert, 1.0x burn rate → 10% budget consumed
|
|
98
|
+
|
|
99
|
+
# SLO window: Still 30 days (industry standard)
|
|
100
|
+
# But ALERTS react in 5 minutes!
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
**Decision 3: YAML-Based Configuration**
|
|
104
|
+
```yaml
|
|
105
|
+
# config/slo.yml - version controlled, validated
|
|
106
|
+
# Separate from code deployment
|
|
107
|
+
# Linter validates against real routes/jobs
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
**Decision 4: Optional Latency SLO**
|
|
111
|
+
```yaml
|
|
112
|
+
# Not all endpoints need latency SLO:
|
|
113
|
+
- Healthcheck: availability only (latency not critical)
|
|
114
|
+
- File upload: availability + custom latency (5s)
|
|
115
|
+
- API: availability + p99 latency (500ms)
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
### 1.3. Goals
|
|
119
|
+
|
|
120
|
+
**Primary Goals:**
|
|
121
|
+
- ✅ **Per-endpoint SLO** (controller#action level)
|
|
122
|
+
- ✅ **5-minute alert detection** (fast burn rate)
|
|
123
|
+
- ✅ **YAML-based configuration** with validation
|
|
124
|
+
- ✅ **Flexible latency SLO** (optional per endpoint)
|
|
125
|
+
- ✅ **Multi-window burn rate** (Google SRE standard)
|
|
126
|
+
|
|
127
|
+
**Non-Goals:**
|
|
128
|
+
- ❌ Per-user SLO (too granular for v1.0)
|
|
129
|
+
- ❌ Automatic SLO adjustment (manual for v1.0)
|
|
130
|
+
- ❌ SLO enforcement (alerts only, no blocking)
|
|
131
|
+
|
|
132
|
+
### 1.4. Success Metrics
|
|
133
|
+
|
|
134
|
+
| Metric | Target | Critical? |
|
|
135
|
+
|--------|--------|-----------|
|
|
136
|
+
| **Alert detection time** | <5 minutes | ✅ Yes |
|
|
137
|
+
| **Per-endpoint coverage** | 100% (all routes) | ✅ Yes |
|
|
138
|
+
| **Config validation** | 100% (no drift) | ✅ Yes |
|
|
139
|
+
| **False positive rate** | <1% | ✅ Yes |
|
|
140
|
+
| **Alert precision** | >95% | ✅ Yes |
|
|
141
|
+
|
|
142
|
+
---
|
|
143
|
+
|
|
144
|
+
## 2. Architecture Overview
|
|
145
|
+
|
|
146
|
+
### 2.1. System Context
|
|
147
|
+
|
|
148
|
+
```mermaid
|
|
149
|
+
C4Context
|
|
150
|
+
title SLO & Observability Context (Multi-Level)
|
|
151
|
+
|
|
152
|
+
Person(sre, "SRE", "Monitors SLOs")
|
|
153
|
+
Person(dev, "Developer", "Defines SLOs")
|
|
154
|
+
|
|
155
|
+
System(rails_app, "Rails App", "100+ endpoints")
|
|
156
|
+
System(e11y, "E11y Gem", "Multi-level SLO")
|
|
157
|
+
System(slo_config, "slo.yml", "Per-endpoint config")
|
|
158
|
+
|
|
159
|
+
System_Ext(prometheus, "Prometheus", "Multi-window queries")
|
|
160
|
+
System_Ext(grafana, "Grafana", "Per-endpoint dashboards")
|
|
161
|
+
System_Ext(alertmanager, "Alertmanager", "Fast/Medium/Slow burn")
|
|
162
|
+
|
|
163
|
+
Rel(dev, slo_config, "Defines", "Per-endpoint SLO")
|
|
164
|
+
Rel(rails_app, e11y, "Tracks", "Per controller#action")
|
|
165
|
+
Rel(e11y, slo_config, "Validates", "Against real routes")
|
|
166
|
+
Rel(e11y, prometheus, "Exports", "Per-endpoint metrics")
|
|
167
|
+
Rel(prometheus, alertmanager, "Evaluates", "3 burn rate windows")
|
|
168
|
+
Rel(alertmanager, sre, "Alerts in 5min", "Fast burn")
|
|
169
|
+
Rel(sre, grafana, "Views", "Per-endpoint SLO")
|
|
170
|
+
|
|
171
|
+
UpdateLayoutConfig($c4ShapeInRow="3", $c4BoundaryInRow="1")
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
### 2.2. Component Architecture
|
|
175
|
+
|
|
176
|
+
```mermaid
|
|
177
|
+
graph TB
|
|
178
|
+
subgraph "Rails Application"
|
|
179
|
+
Route1[GET /orders] --> Middleware[E11y SLO Middleware]
|
|
180
|
+
Route2[POST /orders] --> Middleware
|
|
181
|
+
Route3[GET /healthcheck] --> Middleware
|
|
182
|
+
SidekiqJob[PaymentJob] --> SidekiqInstr[Sidekiq Instrumentation]
|
|
183
|
+
end
|
|
184
|
+
|
|
185
|
+
subgraph "E11y SLO Engine"
|
|
186
|
+
Middleware --> SLOResolver[SLO Config Resolver]
|
|
187
|
+
SidekiqInstr --> SLOResolver
|
|
188
|
+
|
|
189
|
+
SLOResolver --> ConfigLoader[slo.yml Loader]
|
|
190
|
+
ConfigLoader --> Validator[Route/Job Validator]
|
|
191
|
+
|
|
192
|
+
SLOResolver --> MetricsEmitter[Per-Endpoint Metrics]
|
|
193
|
+
MetricsEmitter --> AppWide[App-Wide Metrics]
|
|
194
|
+
MetricsEmitter --> PerEndpoint[Per-Endpoint Metrics]
|
|
195
|
+
MetricsEmitter --> PerJob[Per-Job Metrics]
|
|
196
|
+
end
|
|
197
|
+
|
|
198
|
+
subgraph "Multi-Window Burn Rate"
|
|
199
|
+
PerEndpoint --> BurnRate1h[1h Fast Burn]
|
|
200
|
+
PerEndpoint --> BurnRate6h[6h Medium Burn]
|
|
201
|
+
PerEndpoint --> BurnRate3d[3d Slow Burn]
|
|
202
|
+
|
|
203
|
+
BurnRate1h --> AlertFast[Alert in 5 min<br/>14.4x burn]
|
|
204
|
+
BurnRate6h --> AlertMedium[Alert in 30 min<br/>6.0x burn]
|
|
205
|
+
BurnRate3d --> AlertSlow[Alert in 6 hours<br/>1.0x burn]
|
|
206
|
+
end
|
|
207
|
+
|
|
208
|
+
subgraph "Prometheus & Grafana"
|
|
209
|
+
AppWide --> PromQL1[PromQL: App SLO]
|
|
210
|
+
PerEndpoint --> PromQL2[PromQL: Endpoint SLO]
|
|
211
|
+
PerJob --> PromQL3[PromQL: Job SLO]
|
|
212
|
+
|
|
213
|
+
PromQL1 --> Dashboard1[App-Wide Dashboard]
|
|
214
|
+
PromQL2 --> Dashboard2[Per-Endpoint Dashboard]
|
|
215
|
+
PromQL3 --> Dashboard3[Job Dashboard]
|
|
216
|
+
end
|
|
217
|
+
|
|
218
|
+
style SLOResolver fill:#d1ecf1
|
|
219
|
+
style BurnRate1h fill:#f8d7da
|
|
220
|
+
style AlertFast fill:#dc3545,color:#fff
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
### 2.3. Multi-Window Alert Flow
|
|
224
|
+
|
|
225
|
+
```mermaid
|
|
226
|
+
sequenceDiagram
|
|
227
|
+
participant Endpoint as POST /orders
|
|
228
|
+
participant E11y as E11y Middleware
|
|
229
|
+
participant Config as slo.yml
|
|
230
|
+
participant Prom as Prometheus
|
|
231
|
+
participant Alert as Alertmanager
|
|
232
|
+
participant SRE as SRE
|
|
233
|
+
|
|
234
|
+
Note over Endpoint: Incident starts at 10:00
|
|
235
|
+
|
|
236
|
+
Endpoint->>E11y: HTTP 500 (error)
|
|
237
|
+
E11y->>Config: Lookup SLO: orders#create
|
|
238
|
+
Config-->>E11y: target: 99.9%, latency: 500ms
|
|
239
|
+
E11y->>Prom: Increment error counter
|
|
240
|
+
|
|
241
|
+
Note over Prom: 1h window burn rate evaluation
|
|
242
|
+
|
|
243
|
+
Prom->>Prom: 10:00-10:05: Calculate burn rate
|
|
244
|
+
Prom->>Prom: Burn rate = 14.5x (> 14.4x threshold)
|
|
245
|
+
|
|
246
|
+
Prom->>Alert: Fire: FastBurn (10:05, 5 min after incident)
|
|
247
|
+
Alert->>SRE: Page: CRITICAL - POST /orders
|
|
248
|
+
|
|
249
|
+
Note over SRE: SRE notified in 5 minutes!
|
|
250
|
+
|
|
251
|
+
alt Incident resolved quickly
|
|
252
|
+
Note over Endpoint: Fixed at 10:10
|
|
253
|
+
Prom->>Prom: 10:10-10:15: Burn rate drops
|
|
254
|
+
Prom->>Alert: Resolve: FastBurn
|
|
255
|
+
else Incident continues
|
|
256
|
+
Prom->>Prom: 10:00-10:30: 6h window burn
|
|
257
|
+
Prom->>Alert: Fire: MediumBurn (additional context)
|
|
258
|
+
end
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
---
|
|
262
|
+
|
|
263
|
+
## 3. Multi-Level SLO Strategy
|
|
264
|
+
|
|
265
|
+
### 3.1. Level 1: Application-Wide SLO (Zero-Config)
|
|
266
|
+
|
|
267
|
+
**Automatic for all Rails apps:**
|
|
268
|
+
|
|
269
|
+
```ruby
|
|
270
|
+
# Automatically tracked (no configuration needed)
|
|
271
|
+
E11y::SLO::ZeroConfig.setup! do
|
|
272
|
+
# App-wide HTTP SLO
|
|
273
|
+
http do
|
|
274
|
+
availability_target 0.999 # 99.9%
|
|
275
|
+
latency_p99_target 500 # 500ms (optional)
|
|
276
|
+
window 30.days
|
|
277
|
+
end
|
|
278
|
+
|
|
279
|
+
# App-wide Sidekiq SLO
|
|
280
|
+
sidekiq do
|
|
281
|
+
success_rate_target 0.995 # 99.5%
|
|
282
|
+
window 30.days
|
|
283
|
+
end
|
|
284
|
+
|
|
285
|
+
# App-wide ActiveJob SLO
|
|
286
|
+
activejob do
|
|
287
|
+
success_rate_target 0.995 # 99.5%
|
|
288
|
+
window 30.days
|
|
289
|
+
end
|
|
290
|
+
end
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
**Metrics emitted:**
|
|
294
|
+
```ruby
|
|
295
|
+
# App-wide availability
|
|
296
|
+
http_requests_total{status="2xx|3xx|4xx|5xx"}
|
|
297
|
+
slo_app_availability{window="30d"} # Calculated SLO
|
|
298
|
+
|
|
299
|
+
# App-wide latency
|
|
300
|
+
http_request_duration_seconds{quantile="0.99"}
|
|
301
|
+
slo_app_latency_p99{window="30d"}
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
### 3.2. Level 2: Service-Level SLO (Per-Service)
|
|
305
|
+
|
|
306
|
+
**Per-service overrides:**
|
|
307
|
+
|
|
308
|
+
```yaml
|
|
309
|
+
# config/slo.yml
|
|
310
|
+
services:
|
|
311
|
+
sidekiq:
|
|
312
|
+
default:
|
|
313
|
+
success_rate_target: 0.995 # 99.5%
|
|
314
|
+
window: 30d
|
|
315
|
+
|
|
316
|
+
# Override for critical jobs
|
|
317
|
+
jobs:
|
|
318
|
+
PaymentProcessingJob:
|
|
319
|
+
success_rate_target: 0.9999 # 99.99% (critical!)
|
|
320
|
+
alert_on_single_failure: true
|
|
321
|
+
|
|
322
|
+
EmailNotificationJob:
|
|
323
|
+
success_rate_target: 0.95 # 95% (non-critical)
|
|
324
|
+
latency: null # No latency SLO
|
|
325
|
+
```
|
|
326
|
+
|
|
327
|
+
### 3.3. Level 3: Per-Endpoint SLO (Controller#Action)
|
|
328
|
+
|
|
329
|
+
**Most granular level:**
|
|
330
|
+
|
|
331
|
+
```yaml
|
|
332
|
+
# config/slo.yml
|
|
333
|
+
endpoints:
|
|
334
|
+
# CRITICAL endpoints (99.99%)
|
|
335
|
+
- name: "Health Check"
|
|
336
|
+
pattern: "GET /healthcheck"
|
|
337
|
+
controller: "HealthController"
|
|
338
|
+
action: "index"
|
|
339
|
+
slo:
|
|
340
|
+
availability_target: 0.9999 # 99.99%
|
|
341
|
+
latency: null # No latency SLO for healthcheck
|
|
342
|
+
window: 30d
|
|
343
|
+
|
|
344
|
+
# HIGH priority endpoints (99.9%)
|
|
345
|
+
- name: "Create Order"
|
|
346
|
+
pattern: "POST /api/orders"
|
|
347
|
+
controller: "Api::OrdersController"
|
|
348
|
+
action: "create"
|
|
349
|
+
slo:
|
|
350
|
+
availability_target: 0.999 # 99.9%
|
|
351
|
+
latency_p99_target: 500 # 500ms p99
|
|
352
|
+
latency_p95_target: 300 # 300ms p95 (optional)
|
|
353
|
+
window: 30d
|
|
354
|
+
|
|
355
|
+
# Multi-burn rate alert config
|
|
356
|
+
burn_rate_alerts:
|
|
357
|
+
fast:
|
|
358
|
+
enabled: true
|
|
359
|
+
window: 1h
|
|
360
|
+
threshold: 14.4 # 2% budget in 1h
|
|
361
|
+
alert_after: 5m
|
|
362
|
+
medium:
|
|
363
|
+
enabled: true
|
|
364
|
+
window: 6h
|
|
365
|
+
threshold: 6.0 # 5% budget in 6h
|
|
366
|
+
alert_after: 30m
|
|
367
|
+
slow:
|
|
368
|
+
enabled: true
|
|
369
|
+
window: 3d
|
|
370
|
+
threshold: 1.0 # 10% budget in 3d
|
|
371
|
+
alert_after: 6h
|
|
372
|
+
|
|
373
|
+
# SLOW endpoints (99.9% but higher latency acceptable)
|
|
374
|
+
- name: "Generate Report"
|
|
375
|
+
pattern: "POST /admin/reports"
|
|
376
|
+
controller: "Admin::ReportsController"
|
|
377
|
+
action: "create"
|
|
378
|
+
slo:
|
|
379
|
+
availability_target: 0.999 # 99.9%
|
|
380
|
+
latency_p99_target: 5000 # 5s (slow, but acceptable)
|
|
381
|
+
window: 30d
|
|
382
|
+
|
|
383
|
+
# LOW priority endpoints (99%)
|
|
384
|
+
- name: "Admin Dashboard"
|
|
385
|
+
pattern: "GET /admin/dashboard"
|
|
386
|
+
controller: "Admin::DashboardController"
|
|
387
|
+
action: "index"
|
|
388
|
+
slo:
|
|
389
|
+
availability_target: 0.99 # 99% (less critical)
|
|
390
|
+
latency: null
|
|
391
|
+
window: 30d
|
|
392
|
+
|
|
393
|
+
# NO SLO (exclude from tracking)
|
|
394
|
+
- name: "Development Tools"
|
|
395
|
+
pattern: "GET /rails/info/*"
|
|
396
|
+
slo: null # No SLO
|
|
397
|
+
```
|
|
398
|
+
|
|
399
|
+
---
|
|
400
|
+
|
|
401
|
+
## 4. Per-Endpoint SLO Configuration
|
|
402
|
+
|
|
403
|
+
### 4.1. Complete slo.yml Schema with All Options
|
|
404
|
+
|
|
405
|
+
```yaml
|
|
406
|
+
# config/slo.yml
|
|
407
|
+
#
|
|
408
|
+
# E11y SLO Configuration
|
|
409
|
+
#
|
|
410
|
+
# This file defines Service Level Objectives for your application at multiple levels:
|
|
411
|
+
# 1. App-wide defaults (fallback for unconfigured endpoints)
|
|
412
|
+
# 2. Endpoint-specific SLOs (per controller#action)
|
|
413
|
+
# 3. Service-specific SLOs (Sidekiq, ActiveJob)
|
|
414
|
+
#
|
|
415
|
+
# Validation:
|
|
416
|
+
# $ bundle exec rake e11y:slo:validate
|
|
417
|
+
# $ bundle exec rake e11y:slo:unconfigured
|
|
418
|
+
#
|
|
419
|
+
# Documentation: https://github.com/arturseletskiy/e11y/docs/slo-configuration.md
|
|
420
|
+
|
|
421
|
+
version: 1
|
|
422
|
+
|
|
423
|
+
# ============================================================================
|
|
424
|
+
# GLOBAL DEFAULTS
|
|
425
|
+
# ============================================================================
|
|
426
|
+
# Applied to all endpoints unless overridden
|
|
427
|
+
# These are CONSERVATIVE defaults - tune based on your needs
|
|
428
|
+
defaults:
|
|
429
|
+
window: 30d # SLO evaluation window (7d, 30d, 90d)
|
|
430
|
+
|
|
431
|
+
# Availability SLO (required)
|
|
432
|
+
availability:
|
|
433
|
+
enabled: true
|
|
434
|
+
target: 0.999 # 99.9% = 43.2 minutes downtime per month
|
|
435
|
+
|
|
436
|
+
# Latency SLO (optional)
|
|
437
|
+
latency:
|
|
438
|
+
enabled: true
|
|
439
|
+
p99_target: 500 # milliseconds
|
|
440
|
+
p95_target: 300 # milliseconds (optional)
|
|
441
|
+
p50_target: null # median (optional, null = disabled)
|
|
442
|
+
|
|
443
|
+
# Throughput SLO (optional, for high-traffic endpoints)
|
|
444
|
+
throughput:
|
|
445
|
+
enabled: false # Disabled by default
|
|
446
|
+
min_rps: null # Minimum requests per second (null = no minimum)
|
|
447
|
+
max_rps: null # Maximum requests per second (null = no maximum)
|
|
448
|
+
|
|
449
|
+
# Multi-window burn rate alerts (Google SRE recommended)
|
|
450
|
+
burn_rate_alerts:
|
|
451
|
+
fast:
|
|
452
|
+
enabled: true
|
|
453
|
+
window: 1h # Alert window
|
|
454
|
+
threshold: 14.4 # 14.4x burn rate = 2% of 30-day budget in 1h
|
|
455
|
+
alert_after: 5m # Fire alert after 5 minutes
|
|
456
|
+
severity: critical
|
|
457
|
+
medium:
|
|
458
|
+
enabled: true
|
|
459
|
+
window: 6h
|
|
460
|
+
threshold: 6.0 # 6x burn rate = 5% of 30-day budget in 6h
|
|
461
|
+
alert_after: 30m
|
|
462
|
+
severity: warning
|
|
463
|
+
slow:
|
|
464
|
+
enabled: true
|
|
465
|
+
window: 3d
|
|
466
|
+
threshold: 1.0 # 1x burn rate = 10% of 30-day budget in 3d
|
|
467
|
+
alert_after: 6h
|
|
468
|
+
severity: info
|
|
469
|
+
|
|
470
|
+
# ============================================================================
|
|
471
|
+
# ENDPOINT-SPECIFIC SLOs
|
|
472
|
+
# ============================================================================
|
|
473
|
+
# Define SLOs per controller#action
|
|
474
|
+
# Pattern matching supported: "/api/orders/:id", "/users/*"
|
|
475
|
+
endpoints:
|
|
476
|
+
# -------------------------------------------------------------------------
|
|
477
|
+
# CRITICAL ENDPOINTS (99.99% availability)
|
|
478
|
+
# -------------------------------------------------------------------------
|
|
479
|
+
- name: "Health Check"
|
|
480
|
+
description: "K8s liveness/readiness probe"
|
|
481
|
+
pattern: "GET /healthcheck"
|
|
482
|
+
controller: "HealthController"
|
|
483
|
+
action: "index"
|
|
484
|
+
tags:
|
|
485
|
+
- critical
|
|
486
|
+
- infrastructure
|
|
487
|
+
slo:
|
|
488
|
+
window: 30d
|
|
489
|
+
availability:
|
|
490
|
+
enabled: true
|
|
491
|
+
target: 0.9999 # 99.99% = 4.32 minutes downtime per month
|
|
492
|
+
latency:
|
|
493
|
+
enabled: false # No latency SLO for healthcheck (should be instant)
|
|
494
|
+
throughput:
|
|
495
|
+
enabled: false
|
|
496
|
+
burn_rate_alerts:
|
|
497
|
+
fast:
|
|
498
|
+
enabled: true
|
|
499
|
+
threshold: 14.4
|
|
500
|
+
alert_after: 2m # Override: faster alert for critical endpoint
|
|
501
|
+
|
|
502
|
+
# -------------------------------------------------------------------------
|
|
503
|
+
# HIGH PRIORITY ENDPOINTS (99.9% availability + strict latency)
|
|
504
|
+
# -------------------------------------------------------------------------
|
|
505
|
+
- name: "Create Order"
|
|
506
|
+
description: "Primary checkout flow"
|
|
507
|
+
pattern: "POST /api/orders"
|
|
508
|
+
controller: "Api::OrdersController"
|
|
509
|
+
action: "create"
|
|
510
|
+
tags:
|
|
511
|
+
- high_priority
|
|
512
|
+
- revenue_critical
|
|
513
|
+
- customer_facing
|
|
514
|
+
slo:
|
|
515
|
+
window: 30d
|
|
516
|
+
availability:
|
|
517
|
+
enabled: true
|
|
518
|
+
target: 0.999 # 99.9%
|
|
519
|
+
latency:
|
|
520
|
+
enabled: true
|
|
521
|
+
p99_target: 500 # 500ms p99
|
|
522
|
+
p95_target: 300 # 300ms p95
|
|
523
|
+
p50_target: 150 # 150ms p50 (median)
|
|
524
|
+
throughput:
|
|
525
|
+
enabled: true
|
|
526
|
+
min_rps: 10 # Must handle at least 10 req/sec
|
|
527
|
+
max_rps: 1000 # Alert if exceeds 1000 req/sec (potential attack)
|
|
528
|
+
burn_rate_alerts:
|
|
529
|
+
fast:
|
|
530
|
+
enabled: true
|
|
531
|
+
threshold: 14.4
|
|
532
|
+
alert_after: 5m
|
|
533
|
+
medium:
|
|
534
|
+
enabled: true
|
|
535
|
+
threshold: 6.0
|
|
536
|
+
alert_after: 30m
|
|
537
|
+
slow:
|
|
538
|
+
enabled: true
|
|
539
|
+
threshold: 1.0
|
|
540
|
+
alert_after: 6h
|
|
541
|
+
|
|
542
|
+
- name: "List Orders"
|
|
543
|
+
description: "Customer order history"
|
|
544
|
+
pattern: "GET /api/orders"
|
|
545
|
+
controller: "Api::OrdersController"
|
|
546
|
+
action: "index"
|
|
547
|
+
tags:
|
|
548
|
+
- high_priority
|
|
549
|
+
- customer_facing
|
|
550
|
+
slo:
|
|
551
|
+
window: 30d
|
|
552
|
+
availability:
|
|
553
|
+
enabled: true
|
|
554
|
+
target: 0.999
|
|
555
|
+
latency:
|
|
556
|
+
enabled: true
|
|
557
|
+
p99_target: 1000 # 1s p99 (list can be slower)
|
|
558
|
+
p95_target: 500
|
|
559
|
+
throughput:
|
|
560
|
+
enabled: false
|
|
561
|
+
|
|
562
|
+
- name: "Payment Processing"
|
|
563
|
+
description: "Stripe payment capture"
|
|
564
|
+
pattern: "POST /api/payments"
|
|
565
|
+
controller: "Api::PaymentsController"
|
|
566
|
+
action: "create"
|
|
567
|
+
tags:
|
|
568
|
+
- critical
|
|
569
|
+
- revenue_critical
|
|
570
|
+
- third_party_dependent
|
|
571
|
+
slo:
|
|
572
|
+
window: 30d
|
|
573
|
+
availability:
|
|
574
|
+
enabled: true
|
|
575
|
+
target: 0.999
|
|
576
|
+
latency:
|
|
577
|
+
enabled: true
|
|
578
|
+
p99_target: 2000 # 2s p99 (external API call)
|
|
579
|
+
p95_target: 1000
|
|
580
|
+
throughput:
|
|
581
|
+
enabled: true
|
|
582
|
+
min_rps: 1
|
|
583
|
+
max_rps: 100
|
|
584
|
+
burn_rate_alerts:
|
|
585
|
+
fast:
|
|
586
|
+
enabled: true
|
|
587
|
+
threshold: 10.0 # Override: more lenient for third-party dependency
|
|
588
|
+
alert_after: 10m
|
|
589
|
+
|
|
590
|
+
# -------------------------------------------------------------------------
|
|
591
|
+
# SLOW ENDPOINTS (99.9% availability + relaxed latency)
|
|
592
|
+
# -------------------------------------------------------------------------
|
|
593
|
+
- name: "Generate Report"
|
|
594
|
+
description: "Admin analytics report generation"
|
|
595
|
+
pattern: "POST /admin/reports"
|
|
596
|
+
controller: "Admin::ReportsController"
|
|
597
|
+
action: "create"
|
|
598
|
+
tags:
|
|
599
|
+
- admin
|
|
600
|
+
- slow_operation
|
|
601
|
+
- batch_processing
|
|
602
|
+
slo:
|
|
603
|
+
window: 30d
|
|
604
|
+
availability:
|
|
605
|
+
enabled: true
|
|
606
|
+
target: 0.999
|
|
607
|
+
latency:
|
|
608
|
+
enabled: true
|
|
609
|
+
p99_target: 30000 # 30s p99 (slow, but acceptable for reports)
|
|
610
|
+
p95_target: 20000 # 20s p95
|
|
611
|
+
throughput:
|
|
612
|
+
enabled: false
|
|
613
|
+
burn_rate_alerts:
|
|
614
|
+
fast:
|
|
615
|
+
enabled: false # Disable fast burn for slow operations
|
|
616
|
+
medium:
|
|
617
|
+
enabled: true
|
|
618
|
+
threshold: 6.0
|
|
619
|
+
alert_after: 1h
|
|
620
|
+
|
|
621
|
+
- name: "Export Data"
|
|
622
|
+
description: "CSV/Excel export"
|
|
623
|
+
pattern: "POST /admin/exports"
|
|
624
|
+
controller: "Admin::ExportsController"
|
|
625
|
+
action: "create"
|
|
626
|
+
tags:
|
|
627
|
+
- admin
|
|
628
|
+
- slow_operation
|
|
629
|
+
slo:
|
|
630
|
+
window: 30d
|
|
631
|
+
availability:
|
|
632
|
+
enabled: true
|
|
633
|
+
target: 0.99 # 99% (less critical)
|
|
634
|
+
latency:
|
|
635
|
+
enabled: true
|
|
636
|
+
p99_target: 60000 # 60s p99 (very slow, but acceptable)
|
|
637
|
+
throughput:
|
|
638
|
+
enabled: false
|
|
639
|
+
|
|
640
|
+
# -------------------------------------------------------------------------
|
|
641
|
+
# LOW PRIORITY ENDPOINTS (99% availability + no latency SLO)
|
|
642
|
+
# -------------------------------------------------------------------------
|
|
643
|
+
- name: "Admin Dashboard"
|
|
644
|
+
description: "Internal admin dashboard"
|
|
645
|
+
pattern: "GET /admin/dashboard"
|
|
646
|
+
controller: "Admin::DashboardController"
|
|
647
|
+
action: "index"
|
|
648
|
+
tags:
|
|
649
|
+
- admin
|
|
650
|
+
- low_priority
|
|
651
|
+
slo:
|
|
652
|
+
window: 30d
|
|
653
|
+
availability:
|
|
654
|
+
enabled: true
|
|
655
|
+
target: 0.99 # 99%
|
|
656
|
+
latency:
|
|
657
|
+
enabled: false # No latency SLO for admin
|
|
658
|
+
throughput:
|
|
659
|
+
enabled: false
|
|
660
|
+
burn_rate_alerts:
|
|
661
|
+
fast:
|
|
662
|
+
enabled: false
|
|
663
|
+
medium:
|
|
664
|
+
enabled: false
|
|
665
|
+
slow:
|
|
666
|
+
enabled: true # Only slow burn
|
|
667
|
+
threshold: 2.0
|
|
668
|
+
alert_after: 12h
|
|
669
|
+
|
|
670
|
+
# -------------------------------------------------------------------------
|
|
671
|
+
# HIGH THROUGHPUT ENDPOINTS (throughput-focused)
|
|
672
|
+
# -------------------------------------------------------------------------
|
|
673
|
+
- name: "Metrics Ingestion"
|
|
674
|
+
description: "Telemetry data ingestion endpoint"
|
|
675
|
+
pattern: "POST /api/metrics"
|
|
676
|
+
controller: "Api::MetricsController"
|
|
677
|
+
action: "create"
|
|
678
|
+
tags:
|
|
679
|
+
- high_throughput
|
|
680
|
+
- telemetry
|
|
681
|
+
slo:
|
|
682
|
+
window: 30d
|
|
683
|
+
availability:
|
|
684
|
+
enabled: true
|
|
685
|
+
target: 0.99 # 99% (can tolerate some drops)
|
|
686
|
+
latency:
|
|
687
|
+
enabled: true
|
|
688
|
+
p99_target: 100 # Fast ingestion required
|
|
689
|
+
throughput:
|
|
690
|
+
enabled: true
|
|
691
|
+
min_rps: 100 # Must handle 100+ req/sec
|
|
692
|
+
max_rps: 10000 # Alert if exceeds 10k req/sec
|
|
693
|
+
burn_rate_alerts:
|
|
694
|
+
fast:
|
|
695
|
+
enabled: true
|
|
696
|
+
threshold: 20.0 # More lenient for high-throughput
|
|
697
|
+
|
|
698
|
+
# -------------------------------------------------------------------------
|
|
699
|
+
# NO SLO (explicitly excluded)
|
|
700
|
+
# -------------------------------------------------------------------------
|
|
701
|
+
- name: "Development Tools"
|
|
702
|
+
description: "Rails internal routes"
|
|
703
|
+
pattern: "GET /rails/info/*"
|
|
704
|
+
controller: "Rails::InfoController"
|
|
705
|
+
action: "*"
|
|
706
|
+
tags:
|
|
707
|
+
- development
|
|
708
|
+
- excluded
|
|
709
|
+
slo: null # Explicitly no SLO
|
|
710
|
+
|
|
711
|
+
# ============================================================================
|
|
712
|
+
# SERVICE-LEVEL SLOs (Sidekiq, ActiveJob)
|
|
713
|
+
# ============================================================================
|
|
714
|
+
services:
|
|
715
|
+
# ---------------------------------------------------------------------------
|
|
716
|
+
# SIDEKIQ JOBS
|
|
717
|
+
# ---------------------------------------------------------------------------
|
|
718
|
+
sidekiq:
|
|
719
|
+
# Default for all jobs (unless overridden)
|
|
720
|
+
default:
|
|
721
|
+
window: 30d
|
|
722
|
+
success_rate_target: 0.995 # 99.5%
|
|
723
|
+
latency:
|
|
724
|
+
enabled: false # No latency SLO by default for jobs
|
|
725
|
+
throughput:
|
|
726
|
+
enabled: false
|
|
727
|
+
burn_rate_alerts:
|
|
728
|
+
fast:
|
|
729
|
+
enabled: true
|
|
730
|
+
window: 1h
|
|
731
|
+
threshold: 14.4
|
|
732
|
+
alert_after: 10m # Slower alert for jobs
|
|
733
|
+
medium:
|
|
734
|
+
enabled: true
|
|
735
|
+
window: 6h
|
|
736
|
+
threshold: 6.0
|
|
737
|
+
alert_after: 1h
|
|
738
|
+
slow:
|
|
739
|
+
enabled: true
|
|
740
|
+
window: 3d
|
|
741
|
+
threshold: 1.0
|
|
742
|
+
alert_after: 12h
|
|
743
|
+
|
|
744
|
+
# Per-job overrides
|
|
745
|
+
jobs:
|
|
746
|
+
PaymentProcessingJob:
|
|
747
|
+
window: 30d
|
|
748
|
+
success_rate_target: 0.9999 # 99.99% (critical!)
|
|
749
|
+
latency:
|
|
750
|
+
enabled: true
|
|
751
|
+
p99_target: 5000 # 5s p99
|
|
752
|
+
alert_on_single_failure: true # Alert on any failure
|
|
753
|
+
burn_rate_alerts:
|
|
754
|
+
fast:
|
|
755
|
+
enabled: true
|
|
756
|
+
threshold: 10.0
|
|
757
|
+
alert_after: 5m
|
|
758
|
+
|
|
759
|
+
EmailNotificationJob:
|
|
760
|
+
window: 30d
|
|
761
|
+
success_rate_target: 0.95 # 95% (non-critical, can retry)
|
|
762
|
+
latency:
|
|
763
|
+
enabled: false
|
|
764
|
+
burn_rate_alerts:
|
|
765
|
+
fast:
|
|
766
|
+
enabled: false
|
|
767
|
+
medium:
|
|
768
|
+
enabled: false
|
|
769
|
+
slow:
|
|
770
|
+
enabled: true
|
|
771
|
+
|
|
772
|
+
ReportGenerationJob:
|
|
773
|
+
window: 30d
|
|
774
|
+
success_rate_target: 0.99
|
|
775
|
+
latency:
|
|
776
|
+
enabled: true
|
|
777
|
+
p99_target: 300000 # 5 minutes
|
|
778
|
+
throughput:
|
|
779
|
+
enabled: true
|
|
780
|
+
max_jobs_per_hour: 100 # Rate limit
|
|
781
|
+
|
|
782
|
+
# ---------------------------------------------------------------------------
|
|
783
|
+
# ACTIVEJOB
|
|
784
|
+
# ---------------------------------------------------------------------------
|
|
785
|
+
activejob:
|
|
786
|
+
default:
|
|
787
|
+
window: 30d
|
|
788
|
+
success_rate_target: 0.995
|
|
789
|
+
latency:
|
|
790
|
+
enabled: false
|
|
791
|
+
throughput:
|
|
792
|
+
enabled: false
|
|
793
|
+
burn_rate_alerts:
|
|
794
|
+
fast:
|
|
795
|
+
enabled: true
|
|
796
|
+
window: 1h
|
|
797
|
+
threshold: 14.4
|
|
798
|
+
alert_after: 10m
|
|
799
|
+
|
|
800
|
+
# ============================================================================
|
|
801
|
+
# APP-WIDE FALLBACK (Zero-Config)
|
|
802
|
+
# ============================================================================
|
|
803
|
+
# Used for endpoints/jobs without specific configuration
|
|
804
|
+
app_wide:
|
|
805
|
+
http:
|
|
806
|
+
window: 30d
|
|
807
|
+
availability:
|
|
808
|
+
enabled: true
|
|
809
|
+
target: 0.999 # 99.9%
|
|
810
|
+
latency:
|
|
811
|
+
enabled: true
|
|
812
|
+
p99_target: 500
|
|
813
|
+
throughput:
|
|
814
|
+
enabled: false
|
|
815
|
+
burn_rate_alerts:
|
|
816
|
+
fast:
|
|
817
|
+
enabled: true
|
|
818
|
+
window: 1h
|
|
819
|
+
threshold: 14.4
|
|
820
|
+
alert_after: 5m
|
|
821
|
+
medium:
|
|
822
|
+
enabled: true
|
|
823
|
+
window: 6h
|
|
824
|
+
threshold: 6.0
|
|
825
|
+
alert_after: 30m
|
|
826
|
+
slow:
|
|
827
|
+
enabled: true
|
|
828
|
+
window: 3d
|
|
829
|
+
threshold: 1.0
|
|
830
|
+
alert_after: 6h
|
|
831
|
+
|
|
832
|
+
sidekiq:
|
|
833
|
+
window: 30d
|
|
834
|
+
success_rate_target: 0.995
|
|
835
|
+
burn_rate_alerts:
|
|
836
|
+
fast:
|
|
837
|
+
enabled: true
|
|
838
|
+
window: 1h
|
|
839
|
+
threshold: 14.4
|
|
840
|
+
alert_after: 10m
|
|
841
|
+
|
|
842
|
+
activejob:
|
|
843
|
+
window: 30d
|
|
844
|
+
success_rate_target: 0.995
|
|
845
|
+
burn_rate_alerts:
|
|
846
|
+
fast:
|
|
847
|
+
enabled: true
|
|
848
|
+
window: 1h
|
|
849
|
+
threshold: 14.4
|
|
850
|
+
alert_after: 10m
|
|
851
|
+
|
|
852
|
+
# ============================================================================
|
|
853
|
+
# ADVANCED OPTIONS
|
|
854
|
+
# ============================================================================
|
|
855
|
+
advanced:
|
|
856
|
+
# Error budget alerts (percentage thresholds)
|
|
857
|
+
error_budget_alerts:
|
|
858
|
+
enabled: true
|
|
859
|
+
thresholds: [50, 80, 90, 100] # Alert at 50%, 80%, 90%, 100% consumed
|
|
860
|
+
notify:
|
|
861
|
+
slack: true
|
|
862
|
+
pagerduty: false
|
|
863
|
+
email: true
|
|
864
|
+
|
|
865
|
+
# Deployment gate (block deploys if error budget low)
|
|
866
|
+
deployment_gate:
|
|
867
|
+
enabled: false # Disabled by default (use with caution!)
|
|
868
|
+
minimum_budget_percent: 20 # Need 20%+ budget to deploy
|
|
869
|
+
critical_endpoints_only: true # Only check critical endpoints
|
|
870
|
+
override_label: "deploy:emergency" # GitHub label to override
|
|
871
|
+
|
|
872
|
+
# Auto-scaling based on SLO
|
|
873
|
+
autoscaling:
|
|
874
|
+
enabled: false # Future feature
|
|
875
|
+
scale_up_on_burn_rate: 10.0
|
|
876
|
+
scale_down_on_budget_surplus: 0.5
|
|
877
|
+
|
|
878
|
+
# SLO dashboard links
|
|
879
|
+
dashboards:
|
|
880
|
+
grafana_base_url: "https://grafana.example.com/d/e11y-slo"
|
|
881
|
+
per_endpoint_template: "https://grafana.example.com/d/e11y-slo-endpoint?var-controller={controller}&var-action={action}"
|
|
882
|
+
|
|
883
|
+
# Runbook links
|
|
884
|
+
runbooks:
|
|
885
|
+
base_url: "https://wiki.example.com/runbooks"
|
|
886
|
+
fast_burn_template: "{base_url}/fast-burn-{controller}-{action}"
|
|
887
|
+
medium_burn_template: "{base_url}/medium-burn-{controller}-{action}"
|
|
888
|
+
```
|
|
889
|
+
|
|
890
|
+
### 4.2. SLO Config Loader (Full Implementation)
|
|
891
|
+
|
|
892
|
+
```ruby
|
|
893
|
+
# lib/e11y/slo/config_loader.rb
|
|
894
|
+
module E11y
|
|
895
|
+
module SLO
|
|
896
|
+
class ConfigLoader
|
|
897
|
+
class << self
|
|
898
|
+
# Load and validate slo.yml
|
|
899
|
+
#
|
|
900
|
+
# @raise [ConfigNotFoundError] if slo.yml doesn't exist and strict mode enabled
|
|
901
|
+
# @raise [ConfigValidationError] if validation fails
|
|
902
|
+
# @return [Config] validated configuration
|
|
903
|
+
def load!(strict: false)
|
|
904
|
+
config_path = find_config_path
|
|
905
|
+
|
|
906
|
+
unless config_path
|
|
907
|
+
return handle_missing_config(strict)
|
|
908
|
+
end
|
|
909
|
+
|
|
910
|
+
raw_config = load_yaml(config_path)
|
|
911
|
+
config = Config.new(raw_config, config_path)
|
|
912
|
+
|
|
913
|
+
# Validate config against real routes/jobs
|
|
914
|
+
validator = ConfigValidator.new(config)
|
|
915
|
+
validation_result = validator.validate!
|
|
916
|
+
|
|
917
|
+
if validation_result.errors.any?
|
|
918
|
+
handle_validation_errors(validation_result, strict)
|
|
919
|
+
end
|
|
920
|
+
|
|
921
|
+
if validation_result.warnings.any?
|
|
922
|
+
log_warnings(validation_result.warnings)
|
|
923
|
+
end
|
|
924
|
+
|
|
925
|
+
E11y.logger.info("Loaded SLO config: #{config.summary}")
|
|
926
|
+
config
|
|
927
|
+
rescue Errno::ENOENT => error
|
|
928
|
+
raise ConfigNotFoundError, "slo.yml not found: #{error.message}"
|
|
929
|
+
rescue Psych::SyntaxError => error
|
|
930
|
+
raise ConfigValidationError, "Invalid YAML in slo.yml: #{error.message}"
|
|
931
|
+
end
|
|
932
|
+
|
|
933
|
+
# Reload config (for development/hot-reload)
|
|
934
|
+
def reload!
|
|
935
|
+
@cached_config = nil
|
|
936
|
+
load!
|
|
937
|
+
end
|
|
938
|
+
|
|
939
|
+
# Get cached config (singleton)
|
|
940
|
+
def config
|
|
941
|
+
@cached_config ||= load!
|
|
942
|
+
end
|
|
943
|
+
|
|
944
|
+
private
|
|
945
|
+
|
|
946
|
+
def find_config_path
|
|
947
|
+
# Priority:
|
|
948
|
+
# 1. ENV['E11Y_SLO_CONFIG']
|
|
949
|
+
# 2. Rails.root/config/slo.yml
|
|
950
|
+
# 3. Rails.root/config/e11y/slo.yml
|
|
951
|
+
|
|
952
|
+
if ENV['E11Y_SLO_CONFIG']
|
|
953
|
+
path = Pathname.new(ENV['E11Y_SLO_CONFIG'])
|
|
954
|
+
return path if path.exist?
|
|
955
|
+
end
|
|
956
|
+
|
|
957
|
+
if defined?(Rails)
|
|
958
|
+
[
|
|
959
|
+
Rails.root.join('config', 'slo.yml'),
|
|
960
|
+
Rails.root.join('config', 'e11y', 'slo.yml')
|
|
961
|
+
].find(&:exist?)
|
|
962
|
+
else
|
|
963
|
+
nil
|
|
964
|
+
end
|
|
965
|
+
end
|
|
966
|
+
|
|
967
|
+
def load_yaml(path)
|
|
968
|
+
content = File.read(path)
|
|
969
|
+
|
|
970
|
+
# Support ERB in YAML (for environment-specific config)
|
|
971
|
+
if content.include?('<%')
|
|
972
|
+
require 'erb'
|
|
973
|
+
content = ERB.new(content).result
|
|
974
|
+
end
|
|
975
|
+
|
|
976
|
+
YAML.safe_load(content, permitted_classes: [Symbol], aliases: true)
|
|
977
|
+
end
|
|
978
|
+
|
|
979
|
+
def handle_missing_config(strict)
|
|
980
|
+
if strict
|
|
981
|
+
raise ConfigNotFoundError, "slo.yml not found and strict mode enabled"
|
|
982
|
+
else
|
|
983
|
+
E11y.logger.warn("slo.yml not found, using zero-config defaults")
|
|
984
|
+
ZeroConfig.load_defaults
|
|
985
|
+
end
|
|
986
|
+
end
|
|
987
|
+
|
|
988
|
+
def handle_validation_errors(result, strict)
|
|
989
|
+
error_msg = "slo.yml validation failed:\n#{result.errors.join("\n")}"
|
|
990
|
+
|
|
991
|
+
if strict || E11y.config.slo.strict_validation
|
|
992
|
+
raise ConfigValidationError, error_msg
|
|
993
|
+
else
|
|
994
|
+
E11y.logger.error(error_msg)
|
|
995
|
+
E11y.logger.warn("Continuing with partial config (strict mode disabled)")
|
|
996
|
+
end
|
|
997
|
+
end
|
|
998
|
+
|
|
999
|
+
def log_warnings(warnings)
|
|
1000
|
+
E11y.logger.warn("SLO config warnings:")
|
|
1001
|
+
warnings.each { |w| E11y.logger.warn(" - #{w}") }
|
|
1002
|
+
end
|
|
1003
|
+
end
|
|
1004
|
+
end
|
|
1005
|
+
|
|
1006
|
+
class Config
|
|
1007
|
+
attr_reader :version, :defaults, :endpoints, :services, :app_wide, :advanced
|
|
1008
|
+
attr_reader :config_path
|
|
1009
|
+
|
|
1010
|
+
def initialize(raw_config, config_path = nil)
|
|
1011
|
+
@raw_config = raw_config
|
|
1012
|
+
@config_path = config_path
|
|
1013
|
+
@version = raw_config['version'] || 1
|
|
1014
|
+
@defaults = normalize_slo_config(raw_config['defaults'] || {})
|
|
1015
|
+
@endpoints = (raw_config['endpoints'] || []).map { |ep| normalize_endpoint(ep) }
|
|
1016
|
+
@services = raw_config['services'] || {}
|
|
1017
|
+
@app_wide = raw_config['app_wide'] || {}
|
|
1018
|
+
@advanced = raw_config['advanced'] || {}
|
|
1019
|
+
|
|
1020
|
+
# Build lookup indices for fast resolution
|
|
1021
|
+
build_indices!
|
|
1022
|
+
end
|
|
1023
|
+
|
|
1024
|
+
# Resolve SLO for specific controller#action
|
|
1025
|
+
#
|
|
1026
|
+
# @param controller [String] Controller name
|
|
1027
|
+
# @param action [String] Action name
|
|
1028
|
+
# @return [Hash] SLO configuration
|
|
1029
|
+
def resolve_endpoint_slo(controller, action)
|
|
1030
|
+
key = "#{controller}##{action}"
|
|
1031
|
+
|
|
1032
|
+
# Check cache first
|
|
1033
|
+
if @endpoint_index[key]
|
|
1034
|
+
return @endpoint_index[key]
|
|
1035
|
+
end
|
|
1036
|
+
|
|
1037
|
+
# Fallback to app-wide HTTP defaults
|
|
1038
|
+
fallback = deep_merge(@defaults, @app_wide.dig('http') || {})
|
|
1039
|
+
|
|
1040
|
+
E11y.logger.debug("No SLO config for #{key}, using app-wide defaults")
|
|
1041
|
+
fallback
|
|
1042
|
+
end
|
|
1043
|
+
|
|
1044
|
+
# Resolve SLO for Sidekiq job
|
|
1045
|
+
#
|
|
1046
|
+
# @param job_class [String] Job class name
|
|
1047
|
+
# @return [Hash] SLO configuration
|
|
1048
|
+
def resolve_job_slo(job_class)
|
|
1049
|
+
# Check per-job config
|
|
1050
|
+
job_config = @services.dig('sidekiq', 'jobs', job_class)
|
|
1051
|
+
|
|
1052
|
+
if job_config
|
|
1053
|
+
default_config = @services.dig('sidekiq', 'default') || {}
|
|
1054
|
+
return deep_merge(default_config, job_config)
|
|
1055
|
+
end
|
|
1056
|
+
|
|
1057
|
+
# Fallback to Sidekiq default
|
|
1058
|
+
@services.dig('sidekiq', 'default') || @app_wide.dig('sidekiq') || {}
|
|
1059
|
+
end
|
|
1060
|
+
|
|
1061
|
+
# Resolve SLO for ActiveJob
|
|
1062
|
+
#
|
|
1063
|
+
# @param job_class [String] Job class name
|
|
1064
|
+
# @return [Hash] SLO configuration
|
|
1065
|
+
def resolve_activejob_slo(job_class)
|
|
1066
|
+
job_config = @services.dig('activejob', 'jobs', job_class)
|
|
1067
|
+
|
|
1068
|
+
if job_config
|
|
1069
|
+
default_config = @services.dig('activejob', 'default') || {}
|
|
1070
|
+
return deep_merge(default_config, job_config)
|
|
1071
|
+
end
|
|
1072
|
+
|
|
1073
|
+
@services.dig('activejob', 'default') || @app_wide.dig('activejob') || {}
|
|
1074
|
+
end
|
|
1075
|
+
|
|
1076
|
+
# Get all endpoints with specific tag
|
|
1077
|
+
#
|
|
1078
|
+
# @param tag [String] Tag to filter by
|
|
1079
|
+
# @return [Array<Hash>] Endpoints with tag
|
|
1080
|
+
def endpoints_with_tag(tag)
|
|
1081
|
+
@endpoints.select { |ep| ep['tags']&.include?(tag) }
|
|
1082
|
+
end
|
|
1083
|
+
|
|
1084
|
+
# Get critical endpoints (for deployment gate)
|
|
1085
|
+
#
|
|
1086
|
+
# @return [Array<Hash>] Endpoints with availability >= 0.999
|
|
1087
|
+
def critical_endpoints
|
|
1088
|
+
@endpoints.select do |ep|
|
|
1089
|
+
slo = ep['slo']
|
|
1090
|
+
next false unless slo
|
|
1091
|
+
|
|
1092
|
+
availability = slo.dig('availability', 'target') || slo['availability_target']
|
|
1093
|
+
availability && availability >= 0.999
|
|
1094
|
+
end
|
|
1095
|
+
end
|
|
1096
|
+
|
|
1097
|
+
# Summary for logging
|
|
1098
|
+
#
|
|
1099
|
+
# @return [String] Config summary
|
|
1100
|
+
def summary
|
|
1101
|
+
"#{@endpoints.size} endpoints, " \
|
|
1102
|
+
"#{@services.dig('sidekiq', 'jobs')&.size || 0} Sidekiq jobs, " \
|
|
1103
|
+
"version #{@version}"
|
|
1104
|
+
end
|
|
1105
|
+
|
|
1106
|
+
# Convert to hash (for serialization)
|
|
1107
|
+
def to_h
|
|
1108
|
+
@raw_config
|
|
1109
|
+
end
|
|
1110
|
+
|
|
1111
|
+
private
|
|
1112
|
+
|
|
1113
|
+
def build_indices!
|
|
1114
|
+
# Build fast lookup index: "Controller#action" => SLO config
|
|
1115
|
+
@endpoint_index = {}
|
|
1116
|
+
|
|
1117
|
+
@endpoints.each do |endpoint|
|
|
1118
|
+
controller = endpoint['controller']
|
|
1119
|
+
action = endpoint['action']
|
|
1120
|
+
slo = endpoint['slo']
|
|
1121
|
+
|
|
1122
|
+
next unless controller && action && slo
|
|
1123
|
+
|
|
1124
|
+
key = "#{controller}##{action}"
|
|
1125
|
+
@endpoint_index[key] = deep_merge(@defaults, slo)
|
|
1126
|
+
end
|
|
1127
|
+
end
|
|
1128
|
+
|
|
1129
|
+
def normalize_endpoint(endpoint)
|
|
1130
|
+
# Convert old format to new format
|
|
1131
|
+
slo = endpoint['slo']
|
|
1132
|
+
return endpoint unless slo
|
|
1133
|
+
|
|
1134
|
+
# Convert flat structure to nested
|
|
1135
|
+
if slo['availability_target'] && !slo.dig('availability', 'target')
|
|
1136
|
+
slo['availability'] = {
|
|
1137
|
+
'enabled' => true,
|
|
1138
|
+
'target' => slo.delete('availability_target')
|
|
1139
|
+
}
|
|
1140
|
+
end
|
|
1141
|
+
|
|
1142
|
+
if slo['latency_p99_target'] && !slo.dig('latency', 'p99_target')
|
|
1143
|
+
slo['latency'] = {
|
|
1144
|
+
'enabled' => true,
|
|
1145
|
+
'p99_target' => slo.delete('latency_p99_target'),
|
|
1146
|
+
'p95_target' => slo.delete('latency_p95_target')
|
|
1147
|
+
}
|
|
1148
|
+
end
|
|
1149
|
+
|
|
1150
|
+
endpoint
|
|
1151
|
+
end
|
|
1152
|
+
|
|
1153
|
+
def normalize_slo_config(config)
|
|
1154
|
+
# Ensure nested structure
|
|
1155
|
+
normalized = config.dup
|
|
1156
|
+
|
|
1157
|
+
if config['availability_target']
|
|
1158
|
+
normalized['availability'] = {
|
|
1159
|
+
'enabled' => true,
|
|
1160
|
+
'target' => config['availability_target']
|
|
1161
|
+
}
|
|
1162
|
+
normalized.delete('availability_target')
|
|
1163
|
+
end
|
|
1164
|
+
|
|
1165
|
+
if config['latency_p99_target']
|
|
1166
|
+
normalized['latency'] = {
|
|
1167
|
+
'enabled' => true,
|
|
1168
|
+
'p99_target' => config['latency_p99_target'],
|
|
1169
|
+
'p95_target' => config['latency_p95_target']
|
|
1170
|
+
}
|
|
1171
|
+
normalized.delete('latency_p99_target')
|
|
1172
|
+
normalized.delete('latency_p95_target')
|
|
1173
|
+
end
|
|
1174
|
+
|
|
1175
|
+
normalized
|
|
1176
|
+
end
|
|
1177
|
+
|
|
1178
|
+
def deep_merge(hash1, hash2)
|
|
1179
|
+
hash1 = hash1.dup
|
|
1180
|
+
hash2.each do |key, value|
|
|
1181
|
+
if hash1[key].is_a?(Hash) && value.is_a?(Hash)
|
|
1182
|
+
hash1[key] = deep_merge(hash1[key], value)
|
|
1183
|
+
else
|
|
1184
|
+
hash1[key] = value
|
|
1185
|
+
end
|
|
1186
|
+
end
|
|
1187
|
+
hash1
|
|
1188
|
+
end
|
|
1189
|
+
end
|
|
1190
|
+
|
|
1191
|
+
# Zero-config defaults (no slo.yml)
|
|
1192
|
+
class ZeroConfig
|
|
1193
|
+
def self.load_defaults
|
|
1194
|
+
Config.new({
|
|
1195
|
+
'version' => 1,
|
|
1196
|
+
'defaults' => {
|
|
1197
|
+
'window' => '30d',
|
|
1198
|
+
'availability' => { 'enabled' => true, 'target' => 0.999 },
|
|
1199
|
+
'latency' => { 'enabled' => true, 'p99_target' => 500 },
|
|
1200
|
+
'throughput' => { 'enabled' => false }
|
|
1201
|
+
},
|
|
1202
|
+
'endpoints' => [],
|
|
1203
|
+
'services' => {
|
|
1204
|
+
'sidekiq' => { 'default' => { 'success_rate_target' => 0.995 } },
|
|
1205
|
+
'activejob' => { 'default' => { 'success_rate_target' => 0.995 } }
|
|
1206
|
+
},
|
|
1207
|
+
'app_wide' => {
|
|
1208
|
+
'http' => {
|
|
1209
|
+
'availability' => { 'enabled' => true, 'target' => 0.999 },
|
|
1210
|
+
'latency' => { 'enabled' => true, 'p99_target' => 500 }
|
|
1211
|
+
},
|
|
1212
|
+
'sidekiq' => { 'success_rate_target' => 0.995 },
|
|
1213
|
+
'activejob' => { 'success_rate_target' => 0.995 }
|
|
1214
|
+
}
|
|
1215
|
+
})
|
|
1216
|
+
end
|
|
1217
|
+
end
|
|
1218
|
+
|
|
1219
|
+
# Custom errors
|
|
1220
|
+
class ConfigNotFoundError < StandardError; end
|
|
1221
|
+
class ConfigValidationError < StandardError; end
|
|
1222
|
+
end
|
|
1223
|
+
end
|
|
1224
|
+
```
|
|
1225
|
+
|
|
1226
|
+
---
|
|
1227
|
+
|
|
1228
|
+
## 5. Multi-Window Multi-Burn Rate Alerts
|
|
1229
|
+
|
|
1230
|
+
### 5.1. Why Multi-Window? (Google SRE Best Practice)
|
|
1231
|
+
|
|
1232
|
+
**Problem with Single Window:**
|
|
1233
|
+
```
|
|
1234
|
+
Single 30-day window:
|
|
1235
|
+
- Slow reaction (hours to detect)
|
|
1236
|
+
- Hard to distinguish acute vs chronic issues
|
|
1237
|
+
|
|
1238
|
+
Single 5-minute window:
|
|
1239
|
+
- Fast reaction
|
|
1240
|
+
- High false positive rate (noise)
|
|
1241
|
+
```
|
|
1242
|
+
|
|
1243
|
+
**Solution: Multi-Window Multi-Burn Rate:**
|
|
1244
|
+
```
|
|
1245
|
+
3 windows simultaneously:
|
|
1246
|
+
- 1 hour: Fast burn (acute issue, page immediately)
|
|
1247
|
+
- 6 hours: Medium burn (developing issue, warn team)
|
|
1248
|
+
- 3 days: Slow burn (chronic issue, investigate)
|
|
1249
|
+
```
|
|
1250
|
+
|
|
1251
|
+
### 5.2. Burn Rate Calculation
|
|
1252
|
+
|
|
1253
|
+
**Formula:**
|
|
1254
|
+
```
|
|
1255
|
+
Burn Rate = (Actual Error Rate) / (Error Budget per Hour)
|
|
1256
|
+
|
|
1257
|
+
For 99.9% SLO (30-day window):
|
|
1258
|
+
- Error Budget = 0.1% = 0.001
|
|
1259
|
+
- Error Budget per Hour = 0.001 / (30 * 24) = 0.00000139
|
|
1260
|
+
|
|
1261
|
+
Fast Burn (1h window):
|
|
1262
|
+
- Threshold = 14.4x burn rate
|
|
1263
|
+
- Means: consuming 2% of 30-day budget in 1 hour
|
|
1264
|
+
- Alert fires in 5 minutes
|
|
1265
|
+
|
|
1266
|
+
Medium Burn (6h window):
|
|
1267
|
+
- Threshold = 6.0x burn rate
|
|
1268
|
+
- Means: consuming 5% of 30-day budget in 6 hours
|
|
1269
|
+
- Alert fires in 30 minutes
|
|
1270
|
+
|
|
1271
|
+
Slow Burn (3d window):
|
|
1272
|
+
- Threshold = 1.0x burn rate
|
|
1273
|
+
- Means: consuming 10% of 30-day budget in 3 days
|
|
1274
|
+
- Alert fires in 6 hours
|
|
1275
|
+
```
|
|
1276
|
+
|
|
1277
|
+
### 5.3. Prometheus Alert Rules (Per-Endpoint!)
|
|
1278
|
+
|
|
1279
|
+
```yaml
|
|
1280
|
+
# prometheus/alerts/e11y_slo_per_endpoint.yml
|
|
1281
|
+
groups:
|
|
1282
|
+
- name: e11y_slo_per_endpoint
|
|
1283
|
+
interval: 30s # Check every 30 seconds
|
|
1284
|
+
rules:
|
|
1285
|
+
# ===== FAST BURN (1h window, 5 min alert) =====
|
|
1286
|
+
- alert: E11ySLOFastBurn_CreateOrder
|
|
1287
|
+
expr: |
|
|
1288
|
+
(
|
|
1289
|
+
# Error rate in last 1 hour
|
|
1290
|
+
sum(rate(http_requests_total{
|
|
1291
|
+
controller="Api::OrdersController",
|
|
1292
|
+
action="create",
|
|
1293
|
+
status=~"5.."
|
|
1294
|
+
}[1h]))
|
|
1295
|
+
/
|
|
1296
|
+
sum(rate(http_requests_total{
|
|
1297
|
+
controller="Api::OrdersController",
|
|
1298
|
+
action="create"
|
|
1299
|
+
}[1h]))
|
|
1300
|
+
)
|
|
1301
|
+
/
|
|
1302
|
+
# Error budget per hour (0.001 / 720 hours)
|
|
1303
|
+
0.00000139
|
|
1304
|
+
> 14.4 # 14.4x burn rate = 2% of 30-day budget in 1h
|
|
1305
|
+
for: 5m # Alert after 5 minutes
|
|
1306
|
+
labels:
|
|
1307
|
+
severity: critical
|
|
1308
|
+
endpoint: "POST /api/orders"
|
|
1309
|
+
controller: "Api::OrdersController"
|
|
1310
|
+
action: "create"
|
|
1311
|
+
burn_window: "1h"
|
|
1312
|
+
annotations:
|
|
1313
|
+
summary: "CRITICAL: Fast burn on {{ $labels.endpoint }}"
|
|
1314
|
+
description: |
|
|
1315
|
+
Error rate is 14.4x higher than sustainable rate.
|
|
1316
|
+
Burning 2% of 30-day error budget in 1 hour.
|
|
1317
|
+
Current burn rate: {{ $value | humanize }}x
|
|
1318
|
+
|
|
1319
|
+
Impact: Will exhaust error budget in {{ div 720 $value | humanize }} hours
|
|
1320
|
+
|
|
1321
|
+
Dashboard: https://grafana/d/e11y-slo?var-endpoint=orders_create
|
|
1322
|
+
Runbook: https://wiki/runbooks/fast-burn-orders
|
|
1323
|
+
|
|
1324
|
+
# ===== MEDIUM BURN (6h window, 30 min alert) =====
|
|
1325
|
+
- alert: E11ySLOMediumBurn_CreateOrder
|
|
1326
|
+
expr: |
|
|
1327
|
+
(
|
|
1328
|
+
sum(rate(http_requests_total{
|
|
1329
|
+
controller="Api::OrdersController",
|
|
1330
|
+
action="create",
|
|
1331
|
+
status=~"5.."
|
|
1332
|
+
}[6h]))
|
|
1333
|
+
/
|
|
1334
|
+
sum(rate(http_requests_total{
|
|
1335
|
+
controller="Api::OrdersController",
|
|
1336
|
+
action="create"
|
|
1337
|
+
}[6h]))
|
|
1338
|
+
)
|
|
1339
|
+
/
|
|
1340
|
+
0.00000139
|
|
1341
|
+
> 6.0 # 6x burn rate = 5% of 30-day budget in 6h
|
|
1342
|
+
for: 30m # Alert after 30 minutes
|
|
1343
|
+
labels:
|
|
1344
|
+
severity: warning
|
|
1345
|
+
endpoint: "POST /api/orders"
|
|
1346
|
+
controller: "Api::OrdersController"
|
|
1347
|
+
action: "create"
|
|
1348
|
+
burn_window: "6h"
|
|
1349
|
+
annotations:
|
|
1350
|
+
summary: "WARNING: Medium burn on {{ $labels.endpoint }}"
|
|
1351
|
+
description: |
|
|
1352
|
+
Error rate is 6x higher than sustainable rate.
|
|
1353
|
+
Burning 5% of 30-day error budget in 6 hours.
|
|
1354
|
+
Current burn rate: {{ $value | humanize }}x
|
|
1355
|
+
|
|
1356
|
+
# ===== SLOW BURN (3d window, 6h alert) =====
|
|
1357
|
+
- alert: E11ySLOSlowBurn_CreateOrder
|
|
1358
|
+
expr: |
|
|
1359
|
+
(
|
|
1360
|
+
sum(rate(http_requests_total{
|
|
1361
|
+
controller="Api::OrdersController",
|
|
1362
|
+
action="create",
|
|
1363
|
+
status=~"5.."
|
|
1364
|
+
}[3d]))
|
|
1365
|
+
/
|
|
1366
|
+
sum(rate(http_requests_total{
|
|
1367
|
+
controller="Api::OrdersController",
|
|
1368
|
+
action="create"
|
|
1369
|
+
}[3d]))
|
|
1370
|
+
)
|
|
1371
|
+
/
|
|
1372
|
+
0.00000139
|
|
1373
|
+
> 1.0 # 1x burn rate = 10% of 30-day budget in 3 days
|
|
1374
|
+
for: 6h # Alert after 6 hours
|
|
1375
|
+
labels:
|
|
1376
|
+
severity: info
|
|
1377
|
+
endpoint: "POST /api/orders"
|
|
1378
|
+
controller: "Api::OrdersController"
|
|
1379
|
+
action: "create"
|
|
1380
|
+
burn_window: "3d"
|
|
1381
|
+
annotations:
|
|
1382
|
+
summary: "INFO: Slow burn on {{ $labels.endpoint }}"
|
|
1383
|
+
description: |
|
|
1384
|
+
Chronic issue: consuming error budget at steady rate.
|
|
1385
|
+
Burning 10% of 30-day error budget in 3 days.
|
|
1386
|
+
|
|
1387
|
+
This is a trend, not an emergency. Investigate root cause.
|
|
1388
|
+
|
|
1389
|
+
# ===== LATENCY SLO (optional per endpoint) =====
|
|
1390
|
+
- alert: E11ySLOLatency_CreateOrder
|
|
1391
|
+
expr: |
|
|
1392
|
+
histogram_quantile(0.99,
|
|
1393
|
+
sum(rate(http_request_duration_seconds_bucket{
|
|
1394
|
+
controller="Api::OrdersController",
|
|
1395
|
+
action="create"
|
|
1396
|
+
}[5m])) by (le)
|
|
1397
|
+
) > 0.5 # 500ms p99 threshold
|
|
1398
|
+
for: 5m
|
|
1399
|
+
labels:
|
|
1400
|
+
severity: warning
|
|
1401
|
+
endpoint: "POST /api/orders"
|
|
1402
|
+
slo_type: "latency_p99"
|
|
1403
|
+
annotations:
|
|
1404
|
+
summary: "Latency SLO violation: {{ $labels.endpoint }}"
|
|
1405
|
+
description: "P99 latency is {{ $value | humanize }}s (threshold: 500ms)"
|
|
1406
|
+
```
|
|
1407
|
+
|
|
1408
|
+
---
|
|
1409
|
+
|
|
1410
|
+
## 6. SLO Config Validation & Linting
|
|
1411
|
+
|
|
1412
|
+
### 6.1. Config Validator (Full Implementation with Edge Cases)
|
|
1413
|
+
|
|
1414
|
+
```ruby
|
|
1415
|
+
# lib/e11y/slo/config_validator.rb
|
|
1416
|
+
module E11y
|
|
1417
|
+
module SLO
|
|
1418
|
+
class ConfigValidator
|
|
1419
|
+
def initialize(config)
|
|
1420
|
+
@config = config
|
|
1421
|
+
@errors = []
|
|
1422
|
+
@warnings = []
|
|
1423
|
+
@routes_cache = nil
|
|
1424
|
+
@jobs_cache = nil
|
|
1425
|
+
end
|
|
1426
|
+
|
|
1427
|
+
def validate!
|
|
1428
|
+
validate_version
|
|
1429
|
+
validate_schema_structure
|
|
1430
|
+
validate_endpoints_against_routes
|
|
1431
|
+
validate_jobs_against_sidekiq
|
|
1432
|
+
validate_slo_targets
|
|
1433
|
+
validate_burn_rate_config
|
|
1434
|
+
validate_throughput_config
|
|
1435
|
+
validate_advanced_options
|
|
1436
|
+
check_for_conflicts
|
|
1437
|
+
|
|
1438
|
+
ValidationResult.new(@errors, @warnings)
|
|
1439
|
+
end
|
|
1440
|
+
|
|
1441
|
+
private
|
|
1442
|
+
|
|
1443
|
+
# =====================================================================
|
|
1444
|
+
# VERSION VALIDATION
|
|
1445
|
+
# =====================================================================
|
|
1446
|
+
|
|
1447
|
+
def validate_version
|
|
1448
|
+
version = @config.version
|
|
1449
|
+
|
|
1450
|
+
if version.nil?
|
|
1451
|
+
@warnings << "No version specified, assuming version 1"
|
|
1452
|
+
elsif !version.is_a?(Integer) || version < 1
|
|
1453
|
+
@errors << "Invalid version: #{version} (must be integer >= 1)"
|
|
1454
|
+
elsif version > 1
|
|
1455
|
+
@warnings << "Version #{version} detected (current supported: 1)"
|
|
1456
|
+
end
|
|
1457
|
+
end
|
|
1458
|
+
|
|
1459
|
+
# =====================================================================
|
|
1460
|
+
# SCHEMA STRUCTURE VALIDATION
|
|
1461
|
+
# =====================================================================
|
|
1462
|
+
|
|
1463
|
+
def validate_schema_structure
|
|
1464
|
+
# Check required top-level keys
|
|
1465
|
+
unless @config.to_h.key?('defaults') || @config.to_h.key?('app_wide')
|
|
1466
|
+
@errors << "Missing both 'defaults' and 'app_wide' - at least one required"
|
|
1467
|
+
end
|
|
1468
|
+
|
|
1469
|
+
# Check for typos in top-level keys
|
|
1470
|
+
valid_keys = %w[version defaults endpoints services app_wide advanced]
|
|
1471
|
+
invalid_keys = @config.to_h.keys - valid_keys
|
|
1472
|
+
|
|
1473
|
+
if invalid_keys.any?
|
|
1474
|
+
@warnings << "Unknown top-level keys: #{invalid_keys.join(', ')} (typo?)"
|
|
1475
|
+
end
|
|
1476
|
+
end
|
|
1477
|
+
|
|
1478
|
+
# =====================================================================
|
|
1479
|
+
# ENDPOINTS VALIDATION
|
|
1480
|
+
# =====================================================================
|
|
1481
|
+
|
|
1482
|
+
def validate_endpoints_against_routes
|
|
1483
|
+
# EDGE CASE: Rails might not be loaded (e.g., in tests, rake tasks)
|
|
1484
|
+
unless defined?(Rails) && Rails.application
|
|
1485
|
+
@warnings << "Rails not loaded, skipping route validation"
|
|
1486
|
+
return
|
|
1487
|
+
end
|
|
1488
|
+
|
|
1489
|
+
# EDGE CASE: Routes might not be loaded yet (e.g., in initializers)
|
|
1490
|
+
begin
|
|
1491
|
+
real_routes = fetch_real_routes
|
|
1492
|
+
rescue => error
|
|
1493
|
+
@warnings << "Could not load routes: #{error.message}"
|
|
1494
|
+
return
|
|
1495
|
+
end
|
|
1496
|
+
|
|
1497
|
+
if real_routes.empty?
|
|
1498
|
+
@warnings << "No routes found (routes not loaded yet?)"
|
|
1499
|
+
return
|
|
1500
|
+
end
|
|
1501
|
+
|
|
1502
|
+
# Check each endpoint in slo.yml
|
|
1503
|
+
@config.endpoints.each do |endpoint|
|
|
1504
|
+
controller = endpoint['controller']
|
|
1505
|
+
action = endpoint['action']
|
|
1506
|
+
|
|
1507
|
+
# EDGE CASE: Wildcard action (e.g., "Rails::InfoController" => "*")
|
|
1508
|
+
if action == '*'
|
|
1509
|
+
# Just check controller exists
|
|
1510
|
+
if real_routes.none? { |r| r[:controller] == controller }
|
|
1511
|
+
@errors << "Controller not found: #{controller}"
|
|
1512
|
+
end
|
|
1513
|
+
next
|
|
1514
|
+
end
|
|
1515
|
+
|
|
1516
|
+
# Find matching route
|
|
1517
|
+
matching_route = real_routes.find do |route|
|
|
1518
|
+
route[:controller] == controller && route[:action] == action
|
|
1519
|
+
end
|
|
1520
|
+
|
|
1521
|
+
if matching_route.nil?
|
|
1522
|
+
@errors << "Endpoint not found in routes: #{controller}##{action}"
|
|
1523
|
+
else
|
|
1524
|
+
# Validate pattern matches real path
|
|
1525
|
+
expected_pattern = endpoint['pattern']
|
|
1526
|
+
|
|
1527
|
+
if expected_pattern
|
|
1528
|
+
actual_path = matching_route[:path]
|
|
1529
|
+
|
|
1530
|
+
unless paths_match?(expected_pattern, actual_path)
|
|
1531
|
+
@warnings << "Pattern mismatch: '#{expected_pattern}' vs actual '#{actual_path}' (#{controller}##{action})"
|
|
1532
|
+
end
|
|
1533
|
+
end
|
|
1534
|
+
end
|
|
1535
|
+
end
|
|
1536
|
+
|
|
1537
|
+
# Check for routes WITHOUT SLO config (warning)
|
|
1538
|
+
unconfigured_count = 0
|
|
1539
|
+
|
|
1540
|
+
real_routes.each do |route|
|
|
1541
|
+
next if route[:controller].nil? || route[:action].nil?
|
|
1542
|
+
|
|
1543
|
+
# EDGE CASE: Skip internal Rails routes
|
|
1544
|
+
next if route[:controller].start_with?('rails/')
|
|
1545
|
+
next if route[:controller].start_with?('action_mailbox/')
|
|
1546
|
+
next if route[:controller].start_with?('active_storage/')
|
|
1547
|
+
|
|
1548
|
+
# EDGE CASE: Skip mounted engines (e.g., Sidekiq::Web)
|
|
1549
|
+
next if route[:controller].include?('::') && !route[:controller].start_with?('Api::')
|
|
1550
|
+
|
|
1551
|
+
has_config = @config.endpoints.any? do |ep|
|
|
1552
|
+
ep['controller'] == route[:controller] && ep['action'] == route[:action]
|
|
1553
|
+
end
|
|
1554
|
+
|
|
1555
|
+
unless has_config
|
|
1556
|
+
unconfigured_count += 1
|
|
1557
|
+
|
|
1558
|
+
# Only warn about first 10 unconfigured routes (avoid spam)
|
|
1559
|
+
if unconfigured_count <= 10
|
|
1560
|
+
@warnings << "Route without SLO config: #{route[:controller]}##{route[:action]} (using app-wide defaults)"
|
|
1561
|
+
end
|
|
1562
|
+
end
|
|
1563
|
+
end
|
|
1564
|
+
|
|
1565
|
+
if unconfigured_count > 10
|
|
1566
|
+
@warnings << "... and #{unconfigured_count - 10} more unconfigured routes"
|
|
1567
|
+
end
|
|
1568
|
+
end
|
|
1569
|
+
|
|
1570
|
+
def fetch_real_routes
|
|
1571
|
+
# OPTIMIZATION: Cache routes (expensive operation)
|
|
1572
|
+
@routes_cache ||= begin
|
|
1573
|
+
Rails.application.routes.routes.map do |route|
|
|
1574
|
+
{
|
|
1575
|
+
verb: route.verb,
|
|
1576
|
+
path: route.path.spec.to_s.gsub(/\(.*?\)/, ''), # Remove optional params
|
|
1577
|
+
controller: route.defaults[:controller],
|
|
1578
|
+
action: route.defaults[:action]
|
|
1579
|
+
}
|
|
1580
|
+
end.compact
|
|
1581
|
+
end
|
|
1582
|
+
end
|
|
1583
|
+
|
|
1584
|
+
def validate_jobs_against_sidekiq
|
|
1585
|
+
# EDGE CASE: Sidekiq might not be loaded
|
|
1586
|
+
unless defined?(Sidekiq)
|
|
1587
|
+
@warnings << "Sidekiq not loaded, skipping job validation"
|
|
1588
|
+
return
|
|
1589
|
+
end
|
|
1590
|
+
|
|
1591
|
+
# Get all Sidekiq job classes
|
|
1592
|
+
real_jobs = fetch_real_jobs
|
|
1593
|
+
|
|
1594
|
+
if real_jobs.empty?
|
|
1595
|
+
@warnings << "No Sidekiq jobs found (not loaded yet?)"
|
|
1596
|
+
end
|
|
1597
|
+
|
|
1598
|
+
# Check each job in slo.yml
|
|
1599
|
+
configured_jobs = @config.services.dig('sidekiq', 'jobs') || {}
|
|
1600
|
+
configured_jobs.each_key do |job_class|
|
|
1601
|
+
unless real_jobs.include?(job_class)
|
|
1602
|
+
@errors << "Sidekiq job not found: #{job_class} (typo or not loaded?)"
|
|
1603
|
+
end
|
|
1604
|
+
end
|
|
1605
|
+
|
|
1606
|
+
# Check for jobs WITHOUT SLO config (warning)
|
|
1607
|
+
unconfigured_count = 0
|
|
1608
|
+
|
|
1609
|
+
real_jobs.each do |job_class|
|
|
1610
|
+
# EDGE CASE: Skip internal Sidekiq jobs
|
|
1611
|
+
next if job_class.start_with?('Sidekiq::')
|
|
1612
|
+
|
|
1613
|
+
unless configured_jobs.key?(job_class)
|
|
1614
|
+
unconfigured_count += 1
|
|
1615
|
+
|
|
1616
|
+
# Only warn about first 5 unconfigured jobs
|
|
1617
|
+
if unconfigured_count <= 5
|
|
1618
|
+
@warnings << "Sidekiq job without SLO config: #{job_class} (using default)"
|
|
1619
|
+
end
|
|
1620
|
+
end
|
|
1621
|
+
end
|
|
1622
|
+
|
|
1623
|
+
if unconfigured_count > 5
|
|
1624
|
+
@warnings << "... and #{unconfigured_count - 5} more unconfigured Sidekiq jobs"
|
|
1625
|
+
end
|
|
1626
|
+
end
|
|
1627
|
+
|
|
1628
|
+
def fetch_real_jobs
|
|
1629
|
+
# OPTIMIZATION: Cache job classes (expensive operation)
|
|
1630
|
+
@jobs_cache ||= begin
|
|
1631
|
+
jobs = []
|
|
1632
|
+
|
|
1633
|
+
# Method 1: ObjectSpace (slow but comprehensive)
|
|
1634
|
+
ObjectSpace.each_object(Class) do |klass|
|
|
1635
|
+
begin
|
|
1636
|
+
jobs << klass.name if klass < Sidekiq::Worker && klass.name
|
|
1637
|
+
rescue
|
|
1638
|
+
# Ignore anonymous classes
|
|
1639
|
+
end
|
|
1640
|
+
end
|
|
1641
|
+
|
|
1642
|
+
# EDGE CASE: If ObjectSpace didn't find jobs (e.g., in production with eager_load: false)
|
|
1643
|
+
# Method 2: Scan app/jobs directory
|
|
1644
|
+
if jobs.empty? && defined?(Rails)
|
|
1645
|
+
job_files = Dir[Rails.root.join('app', 'jobs', '**', '*_job.rb')]
|
|
1646
|
+
jobs = job_files.map do |file|
|
|
1647
|
+
File.basename(file, '.rb').camelize
|
|
1648
|
+
end
|
|
1649
|
+
end
|
|
1650
|
+
|
|
1651
|
+
jobs.uniq.sort
|
|
1652
|
+
end
|
|
1653
|
+
end
|
|
1654
|
+
|
|
1655
|
+
# =====================================================================
|
|
1656
|
+
# SLO TARGETS VALIDATION
|
|
1657
|
+
# =====================================================================
|
|
1658
|
+
|
|
1659
|
+
def validate_slo_targets
|
|
1660
|
+
@config.endpoints.each do |endpoint|
|
|
1661
|
+
slo = endpoint['slo']
|
|
1662
|
+
next if slo.nil? # No SLO = valid (opt-out)
|
|
1663
|
+
|
|
1664
|
+
endpoint_name = endpoint['name'] || "#{endpoint['controller']}##{endpoint['action']}"
|
|
1665
|
+
|
|
1666
|
+
# Validate availability target (new nested format)
|
|
1667
|
+
availability_config = slo['availability']
|
|
1668
|
+
if availability_config.is_a?(Hash)
|
|
1669
|
+
availability = availability_config['target']
|
|
1670
|
+
|
|
1671
|
+
if availability
|
|
1672
|
+
if availability < 0 || availability > 1
|
|
1673
|
+
@errors << "Invalid availability.target for #{endpoint_name}: #{availability} (must be 0.0-1.0)"
|
|
1674
|
+
end
|
|
1675
|
+
|
|
1676
|
+
# EDGE CASE: Unrealistically high SLO
|
|
1677
|
+
if availability > 0.9999
|
|
1678
|
+
@warnings << "Very high SLO for #{endpoint_name}: #{(availability * 100).round(2)}% (99.99%+) - verify this is intentional (cost/complexity)"
|
|
1679
|
+
end
|
|
1680
|
+
|
|
1681
|
+
# EDGE CASE: Suspiciously low SLO
|
|
1682
|
+
if availability < 0.9
|
|
1683
|
+
@warnings << "Low SLO for #{endpoint_name}: #{(availability * 100).round(1)}% (<90%) - is this intentional?"
|
|
1684
|
+
end
|
|
1685
|
+
end
|
|
1686
|
+
end
|
|
1687
|
+
|
|
1688
|
+
# Validate latency target (optional, new nested format)
|
|
1689
|
+
latency_config = slo['latency']
|
|
1690
|
+
if latency_config.is_a?(Hash) && latency_config['enabled']
|
|
1691
|
+
p99 = latency_config['p99_target']
|
|
1692
|
+
p95 = latency_config['p95_target']
|
|
1693
|
+
p50 = latency_config['p50_target']
|
|
1694
|
+
|
|
1695
|
+
if p99
|
|
1696
|
+
if p99 <= 0
|
|
1697
|
+
@errors << "Invalid latency.p99_target for #{endpoint_name}: #{p99} (must be > 0)"
|
|
1698
|
+
elsif p99 > 60000
|
|
1699
|
+
@warnings << "Very high latency target for #{endpoint_name}: #{p99}ms (>60s) - consider async processing"
|
|
1700
|
+
end
|
|
1701
|
+
end
|
|
1702
|
+
|
|
1703
|
+
# EDGE CASE: p95 > p99 (impossible)
|
|
1704
|
+
if p95 && p99 && p95 > p99
|
|
1705
|
+
@errors << "Invalid latency targets for #{endpoint_name}: p95 (#{p95}ms) > p99 (#{p99}ms)"
|
|
1706
|
+
end
|
|
1707
|
+
|
|
1708
|
+
# EDGE CASE: p50 > p95 (impossible)
|
|
1709
|
+
if p50 && p95 && p50 > p95
|
|
1710
|
+
@errors << "Invalid latency targets for #{endpoint_name}: p50 (#{p50}ms) > p95 (#{p95}ms)"
|
|
1711
|
+
end
|
|
1712
|
+
end
|
|
1713
|
+
|
|
1714
|
+
# Validate window
|
|
1715
|
+
window = slo['window']
|
|
1716
|
+
if window && !valid_window?(window)
|
|
1717
|
+
@errors << "Invalid window for #{endpoint_name}: '#{window}' (use format: 7d, 30d, 90d)"
|
|
1718
|
+
end
|
|
1719
|
+
|
|
1720
|
+
# EDGE CASE: Window too short (burn rate alerts won't work well)
|
|
1721
|
+
if window && valid_window?(window)
|
|
1722
|
+
days = parse_window_days(window)
|
|
1723
|
+
if days && days < 7
|
|
1724
|
+
@warnings << "Short window for #{endpoint_name}: #{window} (<7d) - burn rate alerts may be noisy"
|
|
1725
|
+
end
|
|
1726
|
+
end
|
|
1727
|
+
end
|
|
1728
|
+
end
|
|
1729
|
+
|
|
1730
|
+
# =====================================================================
|
|
1731
|
+
# THROUGHPUT VALIDATION
|
|
1732
|
+
# =====================================================================
|
|
1733
|
+
|
|
1734
|
+
def validate_throughput_config
|
|
1735
|
+
@config.endpoints.each do |endpoint|
|
|
1736
|
+
slo = endpoint['slo']
|
|
1737
|
+
next unless slo
|
|
1738
|
+
|
|
1739
|
+
throughput_config = slo['throughput']
|
|
1740
|
+
next unless throughput_config.is_a?(Hash) && throughput_config['enabled']
|
|
1741
|
+
|
|
1742
|
+
endpoint_name = endpoint['name'] || "#{endpoint['controller']}##{endpoint['action']}"
|
|
1743
|
+
min_rps = throughput_config['min_rps']
|
|
1744
|
+
max_rps = throughput_config['max_rps']
|
|
1745
|
+
|
|
1746
|
+
# Validate min_rps
|
|
1747
|
+
if min_rps && min_rps <= 0
|
|
1748
|
+
@errors << "Invalid throughput.min_rps for #{endpoint_name}: #{min_rps} (must be > 0)"
|
|
1749
|
+
end
|
|
1750
|
+
|
|
1751
|
+
# Validate max_rps
|
|
1752
|
+
if max_rps && max_rps <= 0
|
|
1753
|
+
@errors << "Invalid throughput.max_rps for #{endpoint_name}: #{max_rps} (must be > 0)"
|
|
1754
|
+
end
|
|
1755
|
+
|
|
1756
|
+
# EDGE CASE: min_rps > max_rps (impossible)
|
|
1757
|
+
if min_rps && max_rps && min_rps > max_rps
|
|
1758
|
+
@errors << "Invalid throughput for #{endpoint_name}: min_rps (#{min_rps}) > max_rps (#{max_rps})"
|
|
1759
|
+
end
|
|
1760
|
+
|
|
1761
|
+
# EDGE CASE: Very high throughput (performance warning)
|
|
1762
|
+
if max_rps && max_rps > 10000
|
|
1763
|
+
@warnings << "Very high throughput target for #{endpoint_name}: #{max_rps} req/s - verify infrastructure can handle this"
|
|
1764
|
+
end
|
|
1765
|
+
end
|
|
1766
|
+
end
|
|
1767
|
+
|
|
1768
|
+
# =====================================================================
|
|
1769
|
+
# BURN RATE ALERTS VALIDATION
|
|
1770
|
+
# =====================================================================
|
|
1771
|
+
|
|
1772
|
+
def validate_burn_rate_config
|
|
1773
|
+
@config.endpoints.each do |endpoint|
|
|
1774
|
+
burn_rate = endpoint.dig('slo', 'burn_rate_alerts')
|
|
1775
|
+
next unless burn_rate
|
|
1776
|
+
|
|
1777
|
+
endpoint_name = endpoint['name'] || "#{endpoint['controller']}##{endpoint['action']}"
|
|
1778
|
+
|
|
1779
|
+
['fast', 'medium', 'slow'].each do |level|
|
|
1780
|
+
config = burn_rate[level]
|
|
1781
|
+
next unless config && config['enabled']
|
|
1782
|
+
|
|
1783
|
+
# Validate threshold
|
|
1784
|
+
threshold = config['threshold']
|
|
1785
|
+
if threshold
|
|
1786
|
+
if threshold <= 0
|
|
1787
|
+
@errors << "Invalid burn_rate_alerts.#{level}.threshold for #{endpoint_name}: #{threshold} (must be > 0)"
|
|
1788
|
+
end
|
|
1789
|
+
|
|
1790
|
+
# EDGE CASE: Suspiciously high burn rate (will never alert)
|
|
1791
|
+
if threshold > 100
|
|
1792
|
+
@warnings << "Very high burn rate threshold for #{endpoint_name}/#{level}: #{threshold}x (will rarely alert)"
|
|
1793
|
+
end
|
|
1794
|
+
end
|
|
1795
|
+
|
|
1796
|
+
# Validate alert_after
|
|
1797
|
+
alert_after = config['alert_after']
|
|
1798
|
+
if alert_after
|
|
1799
|
+
unless valid_duration?(alert_after)
|
|
1800
|
+
@errors << "Invalid burn_rate_alerts.#{level}.alert_after for #{endpoint_name}: '#{alert_after}' (use format: 5m, 30m, 6h)"
|
|
1801
|
+
end
|
|
1802
|
+
|
|
1803
|
+
# EDGE CASE: alert_after longer than window (will never alert)
|
|
1804
|
+
window = config['window']
|
|
1805
|
+
if window && valid_window?(window) && valid_duration?(alert_after)
|
|
1806
|
+
window_seconds = parse_duration_seconds(window)
|
|
1807
|
+
alert_seconds = parse_duration_seconds(alert_after)
|
|
1808
|
+
|
|
1809
|
+
if alert_seconds > window_seconds
|
|
1810
|
+
@errors << "Invalid burn_rate_alerts.#{level} for #{endpoint_name}: alert_after (#{alert_after}) > window (#{window})"
|
|
1811
|
+
end
|
|
1812
|
+
end
|
|
1813
|
+
end
|
|
1814
|
+
|
|
1815
|
+
# Validate window
|
|
1816
|
+
window = config['window']
|
|
1817
|
+
if window && !valid_window?(window)
|
|
1818
|
+
@errors << "Invalid burn_rate_alerts.#{level}.window for #{endpoint_name}: '#{window}' (use format: 1h, 6h, 3d)"
|
|
1819
|
+
end
|
|
1820
|
+
end
|
|
1821
|
+
end
|
|
1822
|
+
end
|
|
1823
|
+
|
|
1824
|
+
# =====================================================================
|
|
1825
|
+
# ADVANCED OPTIONS VALIDATION
|
|
1826
|
+
# =====================================================================
|
|
1827
|
+
|
|
1828
|
+
def validate_advanced_options
|
|
1829
|
+
advanced = @config.advanced
|
|
1830
|
+
return unless advanced
|
|
1831
|
+
|
|
1832
|
+
# Validate deployment gate
|
|
1833
|
+
if gate = advanced['deployment_gate']
|
|
1834
|
+
if gate['enabled']
|
|
1835
|
+
min_budget = gate['minimum_budget_percent']
|
|
1836
|
+
if min_budget && (min_budget < 0 || min_budget > 100)
|
|
1837
|
+
@errors << "Invalid deployment_gate.minimum_budget_percent: #{min_budget} (must be 0-100)"
|
|
1838
|
+
end
|
|
1839
|
+
|
|
1840
|
+
# EDGE CASE: Deployment gate enabled without critical endpoints
|
|
1841
|
+
if @config.critical_endpoints.empty?
|
|
1842
|
+
@warnings << "Deployment gate enabled but no critical endpoints defined (availability >= 99.9%)"
|
|
1843
|
+
end
|
|
1844
|
+
end
|
|
1845
|
+
end
|
|
1846
|
+
|
|
1847
|
+
# Validate error budget alerts
|
|
1848
|
+
if budget_alerts = advanced['error_budget_alerts']
|
|
1849
|
+
if thresholds = budget_alerts['thresholds']
|
|
1850
|
+
unless thresholds.is_a?(Array) && thresholds.all? { |t| t.is_a?(Numeric) && t >= 0 && t <= 100 }
|
|
1851
|
+
@errors << "Invalid error_budget_alerts.thresholds: must be array of numbers 0-100"
|
|
1852
|
+
end
|
|
1853
|
+
|
|
1854
|
+
# EDGE CASE: Thresholds not sorted
|
|
1855
|
+
if thresholds != thresholds.sort
|
|
1856
|
+
@warnings << "error_budget_alerts.thresholds should be sorted: #{thresholds.inspect}"
|
|
1857
|
+
end
|
|
1858
|
+
end
|
|
1859
|
+
end
|
|
1860
|
+
end
|
|
1861
|
+
|
|
1862
|
+
# =====================================================================
|
|
1863
|
+
# CONFLICT DETECTION
|
|
1864
|
+
# =====================================================================
|
|
1865
|
+
|
|
1866
|
+
def check_for_conflicts
|
|
1867
|
+
# EDGE CASE: Duplicate endpoint definitions
|
|
1868
|
+
controller_action_pairs = @config.endpoints.map { |ep| [ep['controller'], ep['action']] }
|
|
1869
|
+
duplicates = controller_action_pairs.group_by(&:itself).select { |_, v| v.size > 1 }.keys
|
|
1870
|
+
|
|
1871
|
+
duplicates.each do |controller, action|
|
|
1872
|
+
@errors << "Duplicate endpoint definition: #{controller}##{action}"
|
|
1873
|
+
end
|
|
1874
|
+
|
|
1875
|
+
# EDGE CASE: Conflicting patterns (e.g., "/api/orders" and "/api/orders/:id")
|
|
1876
|
+
patterns = @config.endpoints.map { |ep| ep['pattern'] }.compact
|
|
1877
|
+
patterns.combination(2).each do |p1, p2|
|
|
1878
|
+
if patterns_conflict?(p1, p2)
|
|
1879
|
+
@warnings << "Potentially conflicting patterns: '#{p1}' and '#{p2}'"
|
|
1880
|
+
end
|
|
1881
|
+
end
|
|
1882
|
+
end
|
|
1883
|
+
|
|
1884
|
+
# =====================================================================
|
|
1885
|
+
# HELPER METHODS
|
|
1886
|
+
# =====================================================================
|
|
1887
|
+
|
|
1888
|
+
def paths_match?(pattern, actual)
|
|
1889
|
+
return true unless pattern # No pattern = skip validation
|
|
1890
|
+
|
|
1891
|
+
# Convert pattern to regex
|
|
1892
|
+
regex_pattern = pattern
|
|
1893
|
+
.gsub(':id', '\d+')
|
|
1894
|
+
.gsub(':uuid', '[a-f0-9\-]+')
|
|
1895
|
+
.gsub('*', '.*')
|
|
1896
|
+
.gsub('/', '\/')
|
|
1897
|
+
|
|
1898
|
+
actual =~ /^#{regex_pattern}$/
|
|
1899
|
+
end
|
|
1900
|
+
|
|
1901
|
+
def patterns_conflict?(p1, p2)
|
|
1902
|
+
# Simple heuristic: if one pattern is prefix of another
|
|
1903
|
+
p1.start_with?(p2) || p2.start_with?(p1)
|
|
1904
|
+
end
|
|
1905
|
+
|
|
1906
|
+
def valid_window?(window)
|
|
1907
|
+
window =~ /^\d+[dhm]$/
|
|
1908
|
+
end
|
|
1909
|
+
|
|
1910
|
+
def valid_duration?(duration)
|
|
1911
|
+
duration =~ /^\d+[smhd]$/
|
|
1912
|
+
end
|
|
1913
|
+
|
|
1914
|
+
def parse_window_days(window)
|
|
1915
|
+
case window
|
|
1916
|
+
when /(\d+)d/
|
|
1917
|
+
$1.to_i
|
|
1918
|
+
when /(\d+)h/
|
|
1919
|
+
$1.to_i / 24.0
|
|
1920
|
+
when /(\d+)m/
|
|
1921
|
+
$1.to_i / (24.0 * 60)
|
|
1922
|
+
else
|
|
1923
|
+
nil
|
|
1924
|
+
end
|
|
1925
|
+
end
|
|
1926
|
+
|
|
1927
|
+
def parse_duration_seconds(duration)
|
|
1928
|
+
case duration
|
|
1929
|
+
when /(\d+)d/
|
|
1930
|
+
$1.to_i * 24 * 3600
|
|
1931
|
+
when /(\d+)h/
|
|
1932
|
+
$1.to_i * 3600
|
|
1933
|
+
when /(\d+)m/
|
|
1934
|
+
$1.to_i * 60
|
|
1935
|
+
when /(\d+)s/
|
|
1936
|
+
$1.to_i
|
|
1937
|
+
else
|
|
1938
|
+
0
|
|
1939
|
+
end
|
|
1940
|
+
end
|
|
1941
|
+
end
|
|
1942
|
+
|
|
1943
|
+
class ValidationResult
|
|
1944
|
+
attr_reader :errors, :warnings
|
|
1945
|
+
|
|
1946
|
+
def initialize(errors, warnings)
|
|
1947
|
+
@errors = errors
|
|
1948
|
+
@warnings = warnings
|
|
1949
|
+
end
|
|
1950
|
+
|
|
1951
|
+
def valid?
|
|
1952
|
+
@errors.empty?
|
|
1953
|
+
end
|
|
1954
|
+
|
|
1955
|
+
def report
|
|
1956
|
+
output = []
|
|
1957
|
+
|
|
1958
|
+
if @errors.any?
|
|
1959
|
+
output << "❌ Errors:"
|
|
1960
|
+
@errors.each { |e| output << " - #{e}" }
|
|
1961
|
+
end
|
|
1962
|
+
|
|
1963
|
+
if @warnings.any?
|
|
1964
|
+
output << "⚠️ Warnings:"
|
|
1965
|
+
@warnings.each { |w| output << " - #{w}" }
|
|
1966
|
+
end
|
|
1967
|
+
|
|
1968
|
+
if @errors.empty? && @warnings.empty?
|
|
1969
|
+
output << "✅ No issues found"
|
|
1970
|
+
end
|
|
1971
|
+
|
|
1972
|
+
output.join("\n")
|
|
1973
|
+
end
|
|
1974
|
+
end
|
|
1975
|
+
end
|
|
1976
|
+
end
|
|
1977
|
+
```
|
|
1978
|
+
|
|
1979
|
+
### 6.2. Rake Task for Validation
|
|
1980
|
+
|
|
1981
|
+
```ruby
|
|
1982
|
+
# lib/tasks/e11y_slo.rake
|
|
1983
|
+
namespace :e11y do
|
|
1984
|
+
namespace :slo do
|
|
1985
|
+
desc 'Validate slo.yml against real routes and jobs'
|
|
1986
|
+
task validate: :environment do
|
|
1987
|
+
puts "Validating slo.yml..."
|
|
1988
|
+
puts "=" * 80
|
|
1989
|
+
|
|
1990
|
+
begin
|
|
1991
|
+
config = E11y::SLO::ConfigLoader.load!
|
|
1992
|
+
validator = E11y::SLO::ConfigValidator.new(config)
|
|
1993
|
+
result = validator.validate!
|
|
1994
|
+
|
|
1995
|
+
puts result.report
|
|
1996
|
+
puts "=" * 80
|
|
1997
|
+
|
|
1998
|
+
if result.valid?
|
|
1999
|
+
puts "✅ slo.yml is valid"
|
|
2000
|
+
exit 0
|
|
2001
|
+
else
|
|
2002
|
+
puts "❌ slo.yml validation failed"
|
|
2003
|
+
exit 1
|
|
2004
|
+
end
|
|
2005
|
+
rescue => error
|
|
2006
|
+
puts "💥 Error loading slo.yml: #{error.message}"
|
|
2007
|
+
puts error.backtrace.first(5).join("\n")
|
|
2008
|
+
exit 1
|
|
2009
|
+
end
|
|
2010
|
+
end
|
|
2011
|
+
|
|
2012
|
+
desc 'Show SLO config for specific endpoint'
|
|
2013
|
+
task :show, [:controller, :action] => :environment do |t, args|
|
|
2014
|
+
config = E11y::SLO::ConfigLoader.load!
|
|
2015
|
+
slo = config.resolve_endpoint_slo(args[:controller], args[:action])
|
|
2016
|
+
|
|
2017
|
+
puts "SLO Config for #{args[:controller]}##{args[:action]}:"
|
|
2018
|
+
puts JSON.pretty_generate(slo)
|
|
2019
|
+
end
|
|
2020
|
+
|
|
2021
|
+
desc 'List all endpoints without SLO config'
|
|
2022
|
+
task unconfigured: :environment do
|
|
2023
|
+
config = E11y::SLO::ConfigLoader.load!
|
|
2024
|
+
validator = E11y::SLO::ConfigValidator.new(config)
|
|
2025
|
+
result = validator.validate!
|
|
2026
|
+
|
|
2027
|
+
unconfigured = result.warnings.select { |w| w.include?('without SLO config') }
|
|
2028
|
+
|
|
2029
|
+
if unconfigured.any?
|
|
2030
|
+
puts "Endpoints without SLO config:"
|
|
2031
|
+
unconfigured.each { |w| puts " - #{w}" }
|
|
2032
|
+
else
|
|
2033
|
+
puts "✅ All endpoints have SLO config"
|
|
2034
|
+
end
|
|
2035
|
+
end
|
|
2036
|
+
end
|
|
2037
|
+
end
|
|
2038
|
+
```
|
|
2039
|
+
|
|
2040
|
+
### 6.3. CI/CD Integration
|
|
2041
|
+
|
|
2042
|
+
```yaml
|
|
2043
|
+
# .github/workflows/validate_slo.yml
|
|
2044
|
+
name: Validate SLO Config
|
|
2045
|
+
|
|
2046
|
+
on:
|
|
2047
|
+
pull_request:
|
|
2048
|
+
paths:
|
|
2049
|
+
- 'config/slo.yml'
|
|
2050
|
+
- 'app/controllers/**'
|
|
2051
|
+
- 'app/jobs/**'
|
|
2052
|
+
|
|
2053
|
+
jobs:
|
|
2054
|
+
validate:
|
|
2055
|
+
runs-on: ubuntu-latest
|
|
2056
|
+
steps:
|
|
2057
|
+
- uses: actions/checkout@v3
|
|
2058
|
+
|
|
2059
|
+
- name: Setup Ruby
|
|
2060
|
+
uses: ruby/setup-ruby@v1
|
|
2061
|
+
with:
|
|
2062
|
+
ruby-version: 3.3
|
|
2063
|
+
bundler-cache: true
|
|
2064
|
+
|
|
2065
|
+
- name: Validate SLO config
|
|
2066
|
+
run: |
|
|
2067
|
+
bundle exec rake e11y:slo:validate
|
|
2068
|
+
|
|
2069
|
+
- name: Check for unconfigured endpoints
|
|
2070
|
+
run: |
|
|
2071
|
+
bundle exec rake e11y:slo:unconfigured
|
|
2072
|
+
```
|
|
2073
|
+
|
|
2074
|
+
---
|
|
2075
|
+
|
|
2076
|
+
## 6.4. RSpec Testing Examples
|
|
2077
|
+
|
|
2078
|
+
```ruby
|
|
2079
|
+
# spec/lib/e11y/slo/config_loader_spec.rb
|
|
2080
|
+
RSpec.describe E11y::SLO::ConfigLoader do
|
|
2081
|
+
describe '.load!' do
|
|
2082
|
+
context 'when slo.yml exists' do
|
|
2083
|
+
let(:config_path) { Rails.root.join('config', 'slo.yml') }
|
|
2084
|
+
|
|
2085
|
+
before do
|
|
2086
|
+
allow(File).to receive(:read).with(config_path).and_return(<<~YAML)
|
|
2087
|
+
version: 1
|
|
2088
|
+
defaults:
|
|
2089
|
+
window: 30d
|
|
2090
|
+
availability:
|
|
2091
|
+
enabled: true
|
|
2092
|
+
target: 0.999
|
|
2093
|
+
endpoints:
|
|
2094
|
+
- name: "Create Order"
|
|
2095
|
+
controller: "Api::OrdersController"
|
|
2096
|
+
action: "create"
|
|
2097
|
+
slo:
|
|
2098
|
+
availability:
|
|
2099
|
+
target: 0.999
|
|
2100
|
+
YAML
|
|
2101
|
+
end
|
|
2102
|
+
|
|
2103
|
+
it 'loads and validates config' do
|
|
2104
|
+
config = described_class.load!
|
|
2105
|
+
|
|
2106
|
+
expect(config.version).to eq(1)
|
|
2107
|
+
expect(config.endpoints.size).to eq(1)
|
|
2108
|
+
expect(config.endpoints.first['name']).to eq('Create Order')
|
|
2109
|
+
end
|
|
2110
|
+
|
|
2111
|
+
it 'caches config on subsequent calls' do
|
|
2112
|
+
config1 = described_class.config
|
|
2113
|
+
config2 = described_class.config
|
|
2114
|
+
|
|
2115
|
+
expect(config1).to be(config2) # Same object instance
|
|
2116
|
+
end
|
|
2117
|
+
end
|
|
2118
|
+
|
|
2119
|
+
context 'when slo.yml is missing' do
|
|
2120
|
+
before do
|
|
2121
|
+
allow(described_class).to receive(:find_config_path).and_return(nil)
|
|
2122
|
+
end
|
|
2123
|
+
|
|
2124
|
+
it 'loads zero-config defaults in non-strict mode' do
|
|
2125
|
+
config = described_class.load!(strict: false)
|
|
2126
|
+
|
|
2127
|
+
expect(config).to be_a(E11y::SLO::Config)
|
|
2128
|
+
expect(config.endpoints).to be_empty
|
|
2129
|
+
end
|
|
2130
|
+
|
|
2131
|
+
it 'raises error in strict mode' do
|
|
2132
|
+
expect {
|
|
2133
|
+
described_class.load!(strict: true)
|
|
2134
|
+
}.to raise_error(E11y::SLO::ConfigNotFoundError)
|
|
2135
|
+
end
|
|
2136
|
+
end
|
|
2137
|
+
|
|
2138
|
+
context 'when slo.yml has invalid YAML' do
|
|
2139
|
+
let(:config_path) { Rails.root.join('config', 'slo.yml') }
|
|
2140
|
+
|
|
2141
|
+
before do
|
|
2142
|
+
allow(File).to receive(:read).with(config_path).and_return("invalid: yaml: : :")
|
|
2143
|
+
end
|
|
2144
|
+
|
|
2145
|
+
it 'raises ConfigValidationError' do
|
|
2146
|
+
expect {
|
|
2147
|
+
described_class.load!
|
|
2148
|
+
}.to raise_error(E11y::SLO::ConfigValidationError, /Invalid YAML/)
|
|
2149
|
+
end
|
|
2150
|
+
end
|
|
2151
|
+
|
|
2152
|
+
context 'when slo.yml has ERB' do
|
|
2153
|
+
let(:config_path) { Rails.root.join('config', 'slo.yml') }
|
|
2154
|
+
|
|
2155
|
+
before do
|
|
2156
|
+
allow(File).to receive(:read).with(config_path).and_return(<<~YAML)
|
|
2157
|
+
version: 1
|
|
2158
|
+
defaults:
|
|
2159
|
+
availability:
|
|
2160
|
+
target: <%= ENV['SLO_TARGET'] || '0.999' %>
|
|
2161
|
+
YAML
|
|
2162
|
+
|
|
2163
|
+
ENV['SLO_TARGET'] = '0.9999'
|
|
2164
|
+
end
|
|
2165
|
+
|
|
2166
|
+
after { ENV.delete('SLO_TARGET') }
|
|
2167
|
+
|
|
2168
|
+
it 'evaluates ERB before parsing YAML' do
|
|
2169
|
+
config = described_class.load!
|
|
2170
|
+
|
|
2171
|
+
target = config.defaults.dig('availability', 'target')
|
|
2172
|
+
expect(target).to eq(0.9999)
|
|
2173
|
+
end
|
|
2174
|
+
end
|
|
2175
|
+
end
|
|
2176
|
+
|
|
2177
|
+
describe '#resolve_endpoint_slo' do
|
|
2178
|
+
let(:config) do
|
|
2179
|
+
E11y::SLO::Config.new({
|
|
2180
|
+
'version' => 1,
|
|
2181
|
+
'defaults' => {
|
|
2182
|
+
'availability' => { 'target' => 0.999 }
|
|
2183
|
+
},
|
|
2184
|
+
'endpoints' => [
|
|
2185
|
+
{
|
|
2186
|
+
'controller' => 'OrdersController',
|
|
2187
|
+
'action' => 'create',
|
|
2188
|
+
'slo' => {
|
|
2189
|
+
'availability' => { 'target' => 0.9999 }
|
|
2190
|
+
}
|
|
2191
|
+
}
|
|
2192
|
+
],
|
|
2193
|
+
'app_wide' => {
|
|
2194
|
+
'http' => {
|
|
2195
|
+
'availability' => { 'target' => 0.99 }
|
|
2196
|
+
}
|
|
2197
|
+
}
|
|
2198
|
+
})
|
|
2199
|
+
end
|
|
2200
|
+
|
|
2201
|
+
it 'returns endpoint-specific SLO' do
|
|
2202
|
+
slo = config.resolve_endpoint_slo('OrdersController', 'create')
|
|
2203
|
+
|
|
2204
|
+
expect(slo.dig('availability', 'target')).to eq(0.9999)
|
|
2205
|
+
end
|
|
2206
|
+
|
|
2207
|
+
it 'returns app-wide defaults for unconfigured endpoint' do
|
|
2208
|
+
slo = config.resolve_endpoint_slo('UsersController', 'index')
|
|
2209
|
+
|
|
2210
|
+
expect(slo.dig('availability', 'target')).to eq(0.99)
|
|
2211
|
+
end
|
|
2212
|
+
end
|
|
2213
|
+
end
|
|
2214
|
+
|
|
2215
|
+
# spec/lib/e11y/slo/config_validator_spec.rb
|
|
2216
|
+
RSpec.describe E11y::SLO::ConfigValidator do
|
|
2217
|
+
let(:config) { E11y::SLO::Config.new(config_hash) }
|
|
2218
|
+
let(:validator) { described_class.new(config) }
|
|
2219
|
+
|
|
2220
|
+
describe '#validate!' do
|
|
2221
|
+
context 'with valid config' do
|
|
2222
|
+
let(:config_hash) do
|
|
2223
|
+
{
|
|
2224
|
+
'version' => 1,
|
|
2225
|
+
'defaults' => {
|
|
2226
|
+
'availability' => { 'enabled' => true, 'target' => 0.999 }
|
|
2227
|
+
},
|
|
2228
|
+
'endpoints' => [
|
|
2229
|
+
{
|
|
2230
|
+
'name' => 'Health Check',
|
|
2231
|
+
'controller' => 'HealthController',
|
|
2232
|
+
'action' => 'index',
|
|
2233
|
+
'slo' => {
|
|
2234
|
+
'availability' => { 'target' => 0.9999 }
|
|
2235
|
+
}
|
|
2236
|
+
}
|
|
2237
|
+
]
|
|
2238
|
+
}
|
|
2239
|
+
end
|
|
2240
|
+
|
|
2241
|
+
before do
|
|
2242
|
+
# Mock Rails routes
|
|
2243
|
+
allow(Rails.application).to receive_message_chain(:routes, :routes).and_return([
|
|
2244
|
+
double(
|
|
2245
|
+
verb: 'GET',
|
|
2246
|
+
path: double(spec: double(to_s: '/healthcheck')),
|
|
2247
|
+
defaults: { controller: 'HealthController', action: 'index' }
|
|
2248
|
+
)
|
|
2249
|
+
])
|
|
2250
|
+
end
|
|
2251
|
+
|
|
2252
|
+
it 'returns valid result' do
|
|
2253
|
+
result = validator.validate!
|
|
2254
|
+
|
|
2255
|
+
expect(result).to be_valid
|
|
2256
|
+
expect(result.errors).to be_empty
|
|
2257
|
+
end
|
|
2258
|
+
end
|
|
2259
|
+
|
|
2260
|
+
context 'with invalid availability target' do
|
|
2261
|
+
let(:config_hash) do
|
|
2262
|
+
{
|
|
2263
|
+
'endpoints' => [
|
|
2264
|
+
{
|
|
2265
|
+
'name' => 'Invalid',
|
|
2266
|
+
'controller' => 'TestController',
|
|
2267
|
+
'action' => 'index',
|
|
2268
|
+
'slo' => {
|
|
2269
|
+
'availability' => { 'target' => 1.5 } # > 1.0!
|
|
2270
|
+
}
|
|
2271
|
+
}
|
|
2272
|
+
]
|
|
2273
|
+
}
|
|
2274
|
+
end
|
|
2275
|
+
|
|
2276
|
+
it 'returns error' do
|
|
2277
|
+
result = validator.validate!
|
|
2278
|
+
|
|
2279
|
+
expect(result).not_to be_valid
|
|
2280
|
+
expect(result.errors).to include(/availability.target.*1.5.*must be 0.0-1.0/)
|
|
2281
|
+
end
|
|
2282
|
+
end
|
|
2283
|
+
|
|
2284
|
+
context 'with p95 > p99' do
|
|
2285
|
+
let(:config_hash) do
|
|
2286
|
+
{
|
|
2287
|
+
'endpoints' => [
|
|
2288
|
+
{
|
|
2289
|
+
'name' => 'Conflicting Latency',
|
|
2290
|
+
'controller' => 'TestController',
|
|
2291
|
+
'action' => 'index',
|
|
2292
|
+
'slo' => {
|
|
2293
|
+
'latency' => {
|
|
2294
|
+
'enabled' => true,
|
|
2295
|
+
'p99_target' => 100,
|
|
2296
|
+
'p95_target' => 200 # p95 > p99!
|
|
2297
|
+
}
|
|
2298
|
+
}
|
|
2299
|
+
}
|
|
2300
|
+
]
|
|
2301
|
+
}
|
|
2302
|
+
end
|
|
2303
|
+
|
|
2304
|
+
it 'returns error' do
|
|
2305
|
+
result = validator.validate!
|
|
2306
|
+
|
|
2307
|
+
expect(result.errors).to include(/p95.*200ms.*>.*p99.*100ms/)
|
|
2308
|
+
end
|
|
2309
|
+
end
|
|
2310
|
+
|
|
2311
|
+
context 'with missing route' do
|
|
2312
|
+
let(:config_hash) do
|
|
2313
|
+
{
|
|
2314
|
+
'endpoints' => [
|
|
2315
|
+
{
|
|
2316
|
+
'name' => 'Missing Route',
|
|
2317
|
+
'controller' => 'NonExistentController',
|
|
2318
|
+
'action' => 'missing',
|
|
2319
|
+
'slo' => { 'availability' => { 'target' => 0.999 } }
|
|
2320
|
+
}
|
|
2321
|
+
]
|
|
2322
|
+
}
|
|
2323
|
+
end
|
|
2324
|
+
|
|
2325
|
+
before do
|
|
2326
|
+
allow(Rails.application).to receive_message_chain(:routes, :routes).and_return([])
|
|
2327
|
+
end
|
|
2328
|
+
|
|
2329
|
+
it 'returns error' do
|
|
2330
|
+
result = validator.validate!
|
|
2331
|
+
|
|
2332
|
+
expect(result.errors).to include(/Endpoint not found in routes/)
|
|
2333
|
+
end
|
|
2334
|
+
end
|
|
2335
|
+
|
|
2336
|
+
context 'with throughput min > max' do
|
|
2337
|
+
let(:config_hash) do
|
|
2338
|
+
{
|
|
2339
|
+
'endpoints' => [
|
|
2340
|
+
{
|
|
2341
|
+
'name' => 'Invalid Throughput',
|
|
2342
|
+
'controller' => 'TestController',
|
|
2343
|
+
'action' => 'index',
|
|
2344
|
+
'slo' => {
|
|
2345
|
+
'throughput' => {
|
|
2346
|
+
'enabled' => true,
|
|
2347
|
+
'min_rps' => 1000,
|
|
2348
|
+
'max_rps' => 100 # min > max!
|
|
2349
|
+
}
|
|
2350
|
+
}
|
|
2351
|
+
}
|
|
2352
|
+
]
|
|
2353
|
+
}
|
|
2354
|
+
end
|
|
2355
|
+
|
|
2356
|
+
it 'returns error' do
|
|
2357
|
+
result = validator.validate!
|
|
2358
|
+
|
|
2359
|
+
expect(result.errors).to include(/min_rps.*1000.*>.*max_rps.*100/)
|
|
2360
|
+
end
|
|
2361
|
+
end
|
|
2362
|
+
|
|
2363
|
+
context 'with duplicate endpoints' do
|
|
2364
|
+
let(:config_hash) do
|
|
2365
|
+
{
|
|
2366
|
+
'endpoints' => [
|
|
2367
|
+
{
|
|
2368
|
+
'name' => 'First',
|
|
2369
|
+
'controller' => 'OrdersController',
|
|
2370
|
+
'action' => 'create',
|
|
2371
|
+
'slo' => { 'availability' => { 'target' => 0.999 } }
|
|
2372
|
+
},
|
|
2373
|
+
{
|
|
2374
|
+
'name' => 'Duplicate',
|
|
2375
|
+
'controller' => 'OrdersController',
|
|
2376
|
+
'action' => 'create', # Same!
|
|
2377
|
+
'slo' => { 'availability' => { 'target' => 0.99 } }
|
|
2378
|
+
}
|
|
2379
|
+
]
|
|
2380
|
+
}
|
|
2381
|
+
end
|
|
2382
|
+
|
|
2383
|
+
it 'returns error' do
|
|
2384
|
+
result = validator.validate!
|
|
2385
|
+
|
|
2386
|
+
expect(result.errors).to include(/Duplicate endpoint.*OrdersController#create/)
|
|
2387
|
+
end
|
|
2388
|
+
end
|
|
2389
|
+
end
|
|
2390
|
+
end
|
|
2391
|
+
|
|
2392
|
+
# spec/lib/e11y/slo/error_budget_spec.rb
|
|
2393
|
+
RSpec.describe E11y::SLO::ErrorBudget do
|
|
2394
|
+
let(:slo_config) do
|
|
2395
|
+
{
|
|
2396
|
+
'availability' => { 'target' => 0.999 },
|
|
2397
|
+
'window' => '30d'
|
|
2398
|
+
}
|
|
2399
|
+
end
|
|
2400
|
+
|
|
2401
|
+
let(:budget) do
|
|
2402
|
+
described_class.new('OrdersController', 'create', slo_config)
|
|
2403
|
+
end
|
|
2404
|
+
|
|
2405
|
+
before do
|
|
2406
|
+
# Mock Prometheus query
|
|
2407
|
+
allow(E11y::Metrics).to receive(:query_prometheus).and_return(
|
|
2408
|
+
{ 'data' => { 'result' => [{ 'value' => [Time.now.to_i, error_rate.to_s] }] } }
|
|
2409
|
+
)
|
|
2410
|
+
end
|
|
2411
|
+
|
|
2412
|
+
describe '#total' do
|
|
2413
|
+
it 'calculates total error budget' do
|
|
2414
|
+
expect(budget.total).to eq(0.001) # 1 - 0.999
|
|
2415
|
+
end
|
|
2416
|
+
end
|
|
2417
|
+
|
|
2418
|
+
describe '#consumed' do
|
|
2419
|
+
let(:error_rate) { 0.0005 } # 0.05% error rate
|
|
2420
|
+
|
|
2421
|
+
it 'calculates consumed error budget' do
|
|
2422
|
+
expect(budget.consumed).to eq(0.0005)
|
|
2423
|
+
end
|
|
2424
|
+
end
|
|
2425
|
+
|
|
2426
|
+
describe '#remaining' do
|
|
2427
|
+
let(:error_rate) { 0.0005 }
|
|
2428
|
+
|
|
2429
|
+
it 'calculates remaining error budget' do
|
|
2430
|
+
expect(budget.remaining).to eq(0.0005) # 0.001 - 0.0005
|
|
2431
|
+
end
|
|
2432
|
+
|
|
2433
|
+
context 'when consumed exceeds total' do
|
|
2434
|
+
let(:error_rate) { 0.002 } # 0.2% > 0.1%
|
|
2435
|
+
|
|
2436
|
+
it 'never goes negative' do
|
|
2437
|
+
expect(budget.remaining).to eq(0.0)
|
|
2438
|
+
end
|
|
2439
|
+
end
|
|
2440
|
+
end
|
|
2441
|
+
|
|
2442
|
+
describe '#exhausted?' do
|
|
2443
|
+
context 'when budget remaining' do
|
|
2444
|
+
let(:error_rate) { 0.0005 }
|
|
2445
|
+
|
|
2446
|
+
it 'returns false' do
|
|
2447
|
+
expect(budget).not_to be_exhausted
|
|
2448
|
+
end
|
|
2449
|
+
end
|
|
2450
|
+
|
|
2451
|
+
context 'when budget exhausted' do
|
|
2452
|
+
let(:error_rate) { 0.002 } # Exceeds 0.001
|
|
2453
|
+
|
|
2454
|
+
it 'returns true' do
|
|
2455
|
+
expect(budget).to be_exhausted
|
|
2456
|
+
end
|
|
2457
|
+
end
|
|
2458
|
+
end
|
|
2459
|
+
|
|
2460
|
+
describe '#can_deploy?' do
|
|
2461
|
+
context 'with sufficient budget' do
|
|
2462
|
+
let(:error_rate) { 0.0002 } # 20% consumed, 80% remaining
|
|
2463
|
+
|
|
2464
|
+
it 'allows deployment' do
|
|
2465
|
+
expect(budget.can_deploy?(20)).to be true
|
|
2466
|
+
end
|
|
2467
|
+
end
|
|
2468
|
+
|
|
2469
|
+
context 'with insufficient budget' do
|
|
2470
|
+
let(:error_rate) { 0.0009 } # 90% consumed, 10% remaining
|
|
2471
|
+
|
|
2472
|
+
it 'blocks deployment' do
|
|
2473
|
+
expect(budget.can_deploy?(20)).to be false
|
|
2474
|
+
end
|
|
2475
|
+
end
|
|
2476
|
+
end
|
|
2477
|
+
end
|
|
2478
|
+
```
|
|
2479
|
+
|
|
2480
|
+
---
|
|
2481
|
+
|
|
2482
|
+
## 7. Error Budget Management
|
|
2483
|
+
|
|
2484
|
+
### 7.1. Error Budget Calculation (Per-Endpoint)
|
|
2485
|
+
|
|
2486
|
+
```ruby
|
|
2487
|
+
# lib/e11y/slo/error_budget.rb
|
|
2488
|
+
module E11y
|
|
2489
|
+
module SLO
|
|
2490
|
+
class ErrorBudget
|
|
2491
|
+
def initialize(controller, action, slo_config)
|
|
2492
|
+
@controller = controller
|
|
2493
|
+
@action = action
|
|
2494
|
+
@slo_config = slo_config
|
|
2495
|
+
@target = slo_config['availability_target'] || 0.999
|
|
2496
|
+
@window = parse_window(slo_config['window'] || '30d')
|
|
2497
|
+
end
|
|
2498
|
+
|
|
2499
|
+
# Total error budget (e.g., 0.001 for 99.9%)
|
|
2500
|
+
def total
|
|
2501
|
+
1.0 - @target
|
|
2502
|
+
end
|
|
2503
|
+
|
|
2504
|
+
# Consumed error budget in current window
|
|
2505
|
+
def consumed
|
|
2506
|
+
error_rate = calculate_error_rate(@window)
|
|
2507
|
+
[error_rate, total].min # Cap at total budget
|
|
2508
|
+
end
|
|
2509
|
+
|
|
2510
|
+
# Remaining error budget
|
|
2511
|
+
def remaining
|
|
2512
|
+
[total - consumed, 0.0].max # Never negative
|
|
2513
|
+
end
|
|
2514
|
+
|
|
2515
|
+
# Percentage of error budget consumed
|
|
2516
|
+
def percent_consumed
|
|
2517
|
+
return 0.0 if total.zero?
|
|
2518
|
+
(consumed / total) * 100
|
|
2519
|
+
end
|
|
2520
|
+
|
|
2521
|
+
# Is error budget exhausted?
|
|
2522
|
+
def exhausted?
|
|
2523
|
+
remaining <= 0
|
|
2524
|
+
end
|
|
2525
|
+
|
|
2526
|
+
# Time until error budget exhaustion (at current burn rate)
|
|
2527
|
+
def time_until_exhaustion
|
|
2528
|
+
burn_rate_per_hour = calculate_burn_rate(1.hour)
|
|
2529
|
+
return Float::INFINITY if burn_rate_per_hour <= 0
|
|
2530
|
+
|
|
2531
|
+
hours_remaining = remaining / burn_rate_per_hour
|
|
2532
|
+
hours_remaining.hours
|
|
2533
|
+
end
|
|
2534
|
+
|
|
2535
|
+
# Can we deploy? (have enough error budget?)
|
|
2536
|
+
def can_deploy?(minimum_budget_percent = 20)
|
|
2537
|
+
percent_remaining = (remaining / total) * 100
|
|
2538
|
+
percent_remaining >= minimum_budget_percent
|
|
2539
|
+
end
|
|
2540
|
+
|
|
2541
|
+
private
|
|
2542
|
+
|
|
2543
|
+
def calculate_error_rate(window)
|
|
2544
|
+
# Query Prometheus for actual error rate
|
|
2545
|
+
query = <<~PROMQL
|
|
2546
|
+
sum(rate(http_requests_total{
|
|
2547
|
+
controller="#{@controller}",
|
|
2548
|
+
action="#{@action}",
|
|
2549
|
+
status=~"5.."
|
|
2550
|
+
}[#{window}]))
|
|
2551
|
+
/
|
|
2552
|
+
sum(rate(http_requests_total{
|
|
2553
|
+
controller="#{@controller}",
|
|
2554
|
+
action="#{@action}"
|
|
2555
|
+
}[#{window}]))
|
|
2556
|
+
PROMQL
|
|
2557
|
+
|
|
2558
|
+
result = E11y::Metrics.query_prometheus(query)
|
|
2559
|
+
result.dig('data', 'result', 0, 'value', 1).to_f
|
|
2560
|
+
end
|
|
2561
|
+
|
|
2562
|
+
def calculate_burn_rate(window)
|
|
2563
|
+
error_rate = calculate_error_rate(window)
|
|
2564
|
+
error_budget_per_hour = total / (@window.to_f / 1.hour)
|
|
2565
|
+
|
|
2566
|
+
error_rate / error_budget_per_hour
|
|
2567
|
+
end
|
|
2568
|
+
|
|
2569
|
+
def parse_window(window)
|
|
2570
|
+
case window
|
|
2571
|
+
when /(\d+)d/
|
|
2572
|
+
$1.to_i.days
|
|
2573
|
+
when /(\d+)h/
|
|
2574
|
+
$1.to_i.hours
|
|
2575
|
+
when /(\d+)m/
|
|
2576
|
+
$1.to_i.minutes
|
|
2577
|
+
else
|
|
2578
|
+
30.days # Default
|
|
2579
|
+
end
|
|
2580
|
+
end
|
|
2581
|
+
end
|
|
2582
|
+
end
|
|
2583
|
+
end
|
|
2584
|
+
```
|
|
2585
|
+
|
|
2586
|
+
### 7.2. Deployment Gate (Optional)
|
|
2587
|
+
|
|
2588
|
+
```ruby
|
|
2589
|
+
# lib/e11y/slo/deployment_gate.rb
|
|
2590
|
+
module E11y
|
|
2591
|
+
module SLO
|
|
2592
|
+
class DeploymentGate
|
|
2593
|
+
def self.check!(minimum_budget_percent: 20)
|
|
2594
|
+
config = E11y::SLO::ConfigLoader.load!
|
|
2595
|
+
|
|
2596
|
+
critical_endpoints = config.endpoints.select do |ep|
|
|
2597
|
+
ep.dig('slo', 'availability_target').to_f >= 0.999
|
|
2598
|
+
end
|
|
2599
|
+
|
|
2600
|
+
violations = []
|
|
2601
|
+
|
|
2602
|
+
critical_endpoints.each do |endpoint|
|
|
2603
|
+
controller = endpoint['controller']
|
|
2604
|
+
action = endpoint['action']
|
|
2605
|
+
slo_config = endpoint['slo']
|
|
2606
|
+
|
|
2607
|
+
budget = ErrorBudget.new(controller, action, slo_config)
|
|
2608
|
+
|
|
2609
|
+
unless budget.can_deploy?(minimum_budget_percent)
|
|
2610
|
+
violations << {
|
|
2611
|
+
endpoint: "#{controller}##{action}",
|
|
2612
|
+
budget_remaining: budget.percent_remaining,
|
|
2613
|
+
budget_consumed: budget.percent_consumed
|
|
2614
|
+
}
|
|
2615
|
+
end
|
|
2616
|
+
end
|
|
2617
|
+
|
|
2618
|
+
if violations.any?
|
|
2619
|
+
raise DeploymentBlockedError.new(violations)
|
|
2620
|
+
end
|
|
2621
|
+
|
|
2622
|
+
true
|
|
2623
|
+
end
|
|
2624
|
+
end
|
|
2625
|
+
|
|
2626
|
+
class DeploymentBlockedError < StandardError
|
|
2627
|
+
attr_reader :violations
|
|
2628
|
+
|
|
2629
|
+
def initialize(violations)
|
|
2630
|
+
@violations = violations
|
|
2631
|
+
|
|
2632
|
+
message = "❌ Deployment blocked: Insufficient error budget\n\n"
|
|
2633
|
+
violations.each do |v|
|
|
2634
|
+
message << " - #{v[:endpoint]}: #{v[:budget_remaining].round(1)}% remaining (need 20%+)\n"
|
|
2635
|
+
end
|
|
2636
|
+
message << "\nWait for error budget to recover before deploying."
|
|
2637
|
+
|
|
2638
|
+
super(message)
|
|
2639
|
+
end
|
|
2640
|
+
end
|
|
2641
|
+
end
|
|
2642
|
+
end
|
|
2643
|
+
```
|
|
2644
|
+
|
|
2645
|
+
---
|
|
2646
|
+
|
|
2647
|
+
## 8. Dashboard & Reporting
|
|
2648
|
+
|
|
2649
|
+
### 8.1. Per-Endpoint Grafana Dashboard
|
|
2650
|
+
|
|
2651
|
+
```json
|
|
2652
|
+
{
|
|
2653
|
+
"dashboard": {
|
|
2654
|
+
"title": "E11y Per-Endpoint SLO Dashboard",
|
|
2655
|
+
"templating": {
|
|
2656
|
+
"list": [
|
|
2657
|
+
{
|
|
2658
|
+
"name": "controller",
|
|
2659
|
+
"type": "query",
|
|
2660
|
+
"query": "label_values(http_requests_total, controller)"
|
|
2661
|
+
},
|
|
2662
|
+
{
|
|
2663
|
+
"name": "action",
|
|
2664
|
+
"type": "query",
|
|
2665
|
+
"query": "label_values(http_requests_total{controller=\"$controller\"}, action)"
|
|
2666
|
+
}
|
|
2667
|
+
]
|
|
2668
|
+
},
|
|
2669
|
+
"panels": [
|
|
2670
|
+
{
|
|
2671
|
+
"title": "Availability SLO: $controller#$action",
|
|
2672
|
+
"targets": [
|
|
2673
|
+
{
|
|
2674
|
+
"expr": "sum(rate(http_requests_total{controller=\"$controller\",action=\"$action\",status=~\"2..|3..\"}[30d])) / sum(rate(http_requests_total{controller=\"$controller\",action=\"$action\"}[30d]))",
|
|
2675
|
+
"legendFormat": "Current (30d)"
|
|
2676
|
+
},
|
|
2677
|
+
{
|
|
2678
|
+
"expr": "0.999",
|
|
2679
|
+
"legendFormat": "SLO Target (99.9%)"
|
|
2680
|
+
}
|
|
2681
|
+
],
|
|
2682
|
+
"yaxis": {
|
|
2683
|
+
"min": 0.995,
|
|
2684
|
+
"max": 1.0
|
|
2685
|
+
}
|
|
2686
|
+
},
|
|
2687
|
+
{
|
|
2688
|
+
"title": "Error Budget: $controller#$action",
|
|
2689
|
+
"targets": [
|
|
2690
|
+
{
|
|
2691
|
+
"expr": "slo_error_budget_remaining{controller=\"$controller\",action=\"$action\"}",
|
|
2692
|
+
"legendFormat": "Remaining"
|
|
2693
|
+
}
|
|
2694
|
+
],
|
|
2695
|
+
"thresholds": [
|
|
2696
|
+
{ "value": 0, "color": "red" },
|
|
2697
|
+
{ "value": 0.0002, "color": "yellow" },
|
|
2698
|
+
{ "value": 0.001, "color": "green" }
|
|
2699
|
+
]
|
|
2700
|
+
},
|
|
2701
|
+
{
|
|
2702
|
+
"title": "Burn Rate (Multi-Window): $controller#$action",
|
|
2703
|
+
"targets": [
|
|
2704
|
+
{
|
|
2705
|
+
"expr": "slo_burn_rate_1h{controller=\"$controller\",action=\"$action\"}",
|
|
2706
|
+
"legendFormat": "1h (fast burn)"
|
|
2707
|
+
},
|
|
2708
|
+
{
|
|
2709
|
+
"expr": "slo_burn_rate_6h{controller=\"$controller\",action=\"$action\"}",
|
|
2710
|
+
"legendFormat": "6h (medium burn)"
|
|
2711
|
+
},
|
|
2712
|
+
{
|
|
2713
|
+
"expr": "slo_burn_rate_3d{controller=\"$controller\",action=\"$action\"}",
|
|
2714
|
+
"legendFormat": "3d (slow burn)"
|
|
2715
|
+
},
|
|
2716
|
+
{
|
|
2717
|
+
"expr": "14.4",
|
|
2718
|
+
"legendFormat": "Fast Burn Threshold"
|
|
2719
|
+
},
|
|
2720
|
+
{
|
|
2721
|
+
"expr": "6.0",
|
|
2722
|
+
"legendFormat": "Medium Burn Threshold"
|
|
2723
|
+
},
|
|
2724
|
+
{
|
|
2725
|
+
"expr": "1.0",
|
|
2726
|
+
"legendFormat": "Slow Burn Threshold"
|
|
2727
|
+
}
|
|
2728
|
+
]
|
|
2729
|
+
},
|
|
2730
|
+
{
|
|
2731
|
+
"title": "Latency p99: $controller#$action",
|
|
2732
|
+
"targets": [
|
|
2733
|
+
{
|
|
2734
|
+
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{controller=\"$controller\",action=\"$action\"}[5m])) by (le))",
|
|
2735
|
+
"legendFormat": "p99"
|
|
2736
|
+
},
|
|
2737
|
+
{
|
|
2738
|
+
"expr": "0.5",
|
|
2739
|
+
"legendFormat": "SLO Target (500ms)"
|
|
2740
|
+
}
|
|
2741
|
+
]
|
|
2742
|
+
}
|
|
2743
|
+
]
|
|
2744
|
+
}
|
|
2745
|
+
}
|
|
2746
|
+
```
|
|
2747
|
+
|
|
2748
|
+
---
|
|
2749
|
+
|
|
2750
|
+
## 9. Production Best Practices & Edge Cases
|
|
2751
|
+
|
|
2752
|
+
### 9.1. Rollout Strategy
|
|
2753
|
+
|
|
2754
|
+
**Phase 1: Observability Only (1-2 weeks)**
|
|
2755
|
+
```yaml
|
|
2756
|
+
# config/slo.yml - Initial rollout
|
|
2757
|
+
version: 1
|
|
2758
|
+
|
|
2759
|
+
# Start with app-wide only (no per-endpoint)
|
|
2760
|
+
app_wide:
|
|
2761
|
+
http:
|
|
2762
|
+
availability:
|
|
2763
|
+
enabled: true
|
|
2764
|
+
target: 0.999
|
|
2765
|
+
latency:
|
|
2766
|
+
enabled: true
|
|
2767
|
+
p99_target: 1000 # Conservative: 1s
|
|
2768
|
+
|
|
2769
|
+
# Disable burn rate alerts initially
|
|
2770
|
+
defaults:
|
|
2771
|
+
burn_rate_alerts:
|
|
2772
|
+
fast:
|
|
2773
|
+
enabled: false # Don't page SRE yet!
|
|
2774
|
+
medium:
|
|
2775
|
+
enabled: false
|
|
2776
|
+
slow:
|
|
2777
|
+
enabled: true # Only slow burn (info)
|
|
2778
|
+
alert_after: 24h # Very slow
|
|
2779
|
+
|
|
2780
|
+
# Enable deployment gate: false (don't block deploys yet)
|
|
2781
|
+
advanced:
|
|
2782
|
+
deployment_gate:
|
|
2783
|
+
enabled: false
|
|
2784
|
+
```
|
|
2785
|
+
|
|
2786
|
+
**Phase 2: Per-Endpoint + Slow Alerts (2-4 weeks)**
|
|
2787
|
+
```yaml
|
|
2788
|
+
# Add 3-5 critical endpoints
|
|
2789
|
+
endpoints:
|
|
2790
|
+
- name: "Health Check"
|
|
2791
|
+
controller: "HealthController"
|
|
2792
|
+
action: "index"
|
|
2793
|
+
slo:
|
|
2794
|
+
availability:
|
|
2795
|
+
target: 0.9999 # Start strict
|
|
2796
|
+
|
|
2797
|
+
- name: "Create Order"
|
|
2798
|
+
controller: "OrdersController"
|
|
2799
|
+
action: "create"
|
|
2800
|
+
slo:
|
|
2801
|
+
availability:
|
|
2802
|
+
target: 0.999
|
|
2803
|
+
burn_rate_alerts:
|
|
2804
|
+
slow:
|
|
2805
|
+
enabled: true # Only slow burn for now
|
|
2806
|
+
alert_after: 12h
|
|
2807
|
+
```
|
|
2808
|
+
|
|
2809
|
+
**Phase 3: Multi-Window Burn Rate (4-6 weeks)**
|
|
2810
|
+
```yaml
|
|
2811
|
+
# Enable medium + fast burn rate alerts
|
|
2812
|
+
endpoints:
|
|
2813
|
+
- name: "Create Order"
|
|
2814
|
+
slo:
|
|
2815
|
+
burn_rate_alerts:
|
|
2816
|
+
fast:
|
|
2817
|
+
enabled: true
|
|
2818
|
+
alert_after: 10m # Start conservative (10m not 5m)
|
|
2819
|
+
medium:
|
|
2820
|
+
enabled: true
|
|
2821
|
+
slow:
|
|
2822
|
+
enabled: true
|
|
2823
|
+
```
|
|
2824
|
+
|
|
2825
|
+
**Phase 4: Deployment Gate (6-8 weeks)**
|
|
2826
|
+
```yaml
|
|
2827
|
+
# Only after confidence in data
|
|
2828
|
+
advanced:
|
|
2829
|
+
deployment_gate:
|
|
2830
|
+
enabled: true
|
|
2831
|
+
minimum_budget_percent: 10 # Start lenient (10% not 20%)
|
|
2832
|
+
override_label: "deploy:emergency"
|
|
2833
|
+
```
|
|
2834
|
+
|
|
2835
|
+
### 9.2. Edge Cases & Solutions
|
|
2836
|
+
|
|
2837
|
+
**Edge Case 1: Routes Not Loaded During Validation**
|
|
2838
|
+
```ruby
|
|
2839
|
+
# Problem: `bundle exec rake e11y:slo:validate` fails in CI
|
|
2840
|
+
# Reason: Routes not eager-loaded in non-Rails rake tasks
|
|
2841
|
+
|
|
2842
|
+
# Solution: Load Rails environment
|
|
2843
|
+
# Rakefile or .github/workflows/validate_slo.yml
|
|
2844
|
+
task :validate_slo do
|
|
2845
|
+
ENV['RAILS_ENV'] ||= 'test'
|
|
2846
|
+
require File.expand_path('../config/environment', __FILE__) # Load Rails
|
|
2847
|
+
Rake::Task['e11y:slo:validate'].invoke
|
|
2848
|
+
end
|
|
2849
|
+
```
|
|
2850
|
+
|
|
2851
|
+
**Edge Case 2: Prometheus Down During Error Budget Check**
|
|
2852
|
+
```ruby
|
|
2853
|
+
# Problem: Deployment gate blocks deploy if Prometheus unavailable
|
|
2854
|
+
|
|
2855
|
+
# lib/e11y/slo/error_budget.rb
|
|
2856
|
+
def calculate_error_rate(window)
|
|
2857
|
+
query = build_prometheus_query(window)
|
|
2858
|
+
|
|
2859
|
+
begin
|
|
2860
|
+
result = E11y::Metrics.query_prometheus(query, timeout: 5.seconds)
|
|
2861
|
+
parse_prometheus_result(result)
|
|
2862
|
+
rescue Errno::ECONNREFUSED, Net::ReadTimeout => error
|
|
2863
|
+
# EDGE CASE: Prometheus down
|
|
2864
|
+
E11y.logger.error("Prometheus unavailable: #{error.message}")
|
|
2865
|
+
|
|
2866
|
+
# Fallback: Allow deployment (fail-open, not fail-closed)
|
|
2867
|
+
E11y.logger.warn("Deployment gate: Allowing deploy (Prometheus down)")
|
|
2868
|
+
return 0.0 # Assume no errors
|
|
2869
|
+
end
|
|
2870
|
+
end
|
|
2871
|
+
```
|
|
2872
|
+
|
|
2873
|
+
**Edge Case 3: Variable Traffic (Night vs Day)**
|
|
2874
|
+
```yaml
|
|
2875
|
+
# Problem: Burn rate alerts fire at night (low traffic)
|
|
2876
|
+
# Reason: Small absolute number of errors triggers high percentage
|
|
2877
|
+
|
|
2878
|
+
# Solution: Minimum request count threshold
|
|
2879
|
+
endpoints:
|
|
2880
|
+
- name: "Create Order"
|
|
2881
|
+
slo:
|
|
2882
|
+
burn_rate_alerts:
|
|
2883
|
+
fast:
|
|
2884
|
+
enabled: true
|
|
2885
|
+
threshold: 14.4
|
|
2886
|
+
min_requests_per_window: 100 # NEW: Don't alert if <100 req in 1h
|
|
2887
|
+
```
|
|
2888
|
+
|
|
2889
|
+
**Edge Case 4: Deploy During Incident**
|
|
2890
|
+
```ruby
|
|
2891
|
+
# Problem: Incident exhausts error budget → blocks ALL deploys (including hotfix!)
|
|
2892
|
+
|
|
2893
|
+
# Solution: GitHub label override
|
|
2894
|
+
# .github/workflows/deploy.yml
|
|
2895
|
+
- name: Check Error Budget
|
|
2896
|
+
run: bundle exec rake e11y:slo:deployment_gate:check
|
|
2897
|
+
continue-on-error: ${{ contains(github.event.pull_request.labels.*.name, 'deploy:emergency') }}
|
|
2898
|
+
```
|
|
2899
|
+
|
|
2900
|
+
**Edge Case 5: New Endpoint (No Historical Data)**
|
|
2901
|
+
```ruby
|
|
2902
|
+
# Problem: New endpoint triggers burn rate alert immediately (no baseline)
|
|
2903
|
+
|
|
2904
|
+
# lib/e11y/slo/burn_rate_calculator.rb
|
|
2905
|
+
def calculate_burn_rate(controller, action, window)
|
|
2906
|
+
# Check if endpoint is "new" (< 7 days of data)
|
|
2907
|
+
first_request_at = E11y::Metrics.query_prometheus(<<~PROMQL)
|
|
2908
|
+
min_over_time(http_requests_total{controller="#{controller}",action="#{action}"}[7d])
|
|
2909
|
+
PROMQL
|
|
2910
|
+
|
|
2911
|
+
if first_request_at.nil? || Time.at(first_request_at) > 7.days.ago
|
|
2912
|
+
# EDGE CASE: New endpoint, skip burn rate alerts
|
|
2913
|
+
E11y.logger.info("Skipping burn rate for new endpoint: #{controller}##{action}")
|
|
2914
|
+
return 0.0
|
|
2915
|
+
end
|
|
2916
|
+
|
|
2917
|
+
# Normal burn rate calculation...
|
|
2918
|
+
end
|
|
2919
|
+
```
|
|
2920
|
+
|
|
2921
|
+
**Edge Case 6: Maintenance Window**
|
|
2922
|
+
```yaml
|
|
2923
|
+
# Problem: Scheduled maintenance triggers SLO alerts
|
|
2924
|
+
|
|
2925
|
+
# config/slo.yml
|
|
2926
|
+
advanced:
|
|
2927
|
+
maintenance_windows:
|
|
2928
|
+
enabled: true
|
|
2929
|
+
schedule:
|
|
2930
|
+
- name: "Weekly DB backup"
|
|
2931
|
+
day: sunday
|
|
2932
|
+
time: "03:00-04:00"
|
|
2933
|
+
timezone: "America/New_York"
|
|
2934
|
+
exclude_from_slo: true # Don't count errors during this window
|
|
2935
|
+
|
|
2936
|
+
- name: "Monthly security patching"
|
|
2937
|
+
day_of_month: 1
|
|
2938
|
+
time: "02:00-04:00"
|
|
2939
|
+
exclude_from_slo: true
|
|
2940
|
+
```
|
|
2941
|
+
|
|
2942
|
+
**Edge Case 7: Thundering Herd After Deploy**
|
|
2943
|
+
```ruby
|
|
2944
|
+
# Problem: Deploy → cache clear → spike in latency → burn rate alert
|
|
2945
|
+
|
|
2946
|
+
# lib/e11y/slo/middleware.rb
|
|
2947
|
+
class SLOMiddleware
|
|
2948
|
+
def call(env)
|
|
2949
|
+
# EDGE CASE: Grace period after deploy
|
|
2950
|
+
if deployment_recently_finished?
|
|
2951
|
+
# Don't count requests in first 5 minutes after deploy
|
|
2952
|
+
env['e11y.slo.grace_period'] = true
|
|
2953
|
+
end
|
|
2954
|
+
|
|
2955
|
+
# Normal SLO tracking...
|
|
2956
|
+
end
|
|
2957
|
+
|
|
2958
|
+
private
|
|
2959
|
+
|
|
2960
|
+
def deployment_recently_finished?
|
|
2961
|
+
# Check deployment timestamp file
|
|
2962
|
+
deploy_timestamp_file = Rails.root.join('tmp', 'deploy_timestamp')
|
|
2963
|
+
return false unless File.exist?(deploy_timestamp_file)
|
|
2964
|
+
|
|
2965
|
+
deploy_time = Time.at(File.read(deploy_timestamp_file).to_i)
|
|
2966
|
+
Time.now < deploy_time + 5.minutes
|
|
2967
|
+
end
|
|
2968
|
+
end
|
|
2969
|
+
```
|
|
2970
|
+
|
|
2971
|
+
**Edge Case 8: Partial Prometheus Data Loss**
|
|
2972
|
+
```ruby
|
|
2973
|
+
# Problem: Prometheus storage corrupted → missing data → incorrect SLO
|
|
2974
|
+
|
|
2975
|
+
# lib/e11y/slo/error_budget.rb
|
|
2976
|
+
def calculate_error_rate(window)
|
|
2977
|
+
query = build_prometheus_query(window)
|
|
2978
|
+
result = E11y::Metrics.query_prometheus(query)
|
|
2979
|
+
|
|
2980
|
+
# EDGE CASE: Check if we have enough data points
|
|
2981
|
+
data_points = result.dig('data', 'result', 0, 'values')&.size || 0
|
|
2982
|
+
expected_data_points = window_seconds(window) / 30 # 30s scrape interval
|
|
2983
|
+
|
|
2984
|
+
if data_points < (expected_data_points * 0.5)
|
|
2985
|
+
# Less than 50% of expected data
|
|
2986
|
+
E11y.logger.warn("Insufficient Prometheus data: #{data_points}/#{expected_data_points}")
|
|
2987
|
+
|
|
2988
|
+
# Fallback: Use last known good value
|
|
2989
|
+
return fetch_last_known_error_rate
|
|
2990
|
+
end
|
|
2991
|
+
|
|
2992
|
+
# Normal calculation...
|
|
2993
|
+
end
|
|
2994
|
+
```
|
|
2995
|
+
|
|
2996
|
+
### 9.3. Monitoring the SLO System Itself
|
|
2997
|
+
|
|
2998
|
+
**Self-Monitoring Metrics:**
|
|
2999
|
+
```ruby
|
|
3000
|
+
# config/initializers/e11y.rb
|
|
3001
|
+
E11y.configure do |config|
|
|
3002
|
+
config.slo.self_monitoring do
|
|
3003
|
+
# Track SLO config load time
|
|
3004
|
+
track :slo_config_load_duration_seconds, type: :histogram
|
|
3005
|
+
|
|
3006
|
+
# Track SLO resolution performance
|
|
3007
|
+
track :slo_resolution_duration_seconds, type: :histogram, labels: [:endpoint]
|
|
3008
|
+
|
|
3009
|
+
# Track validation errors
|
|
3010
|
+
track :slo_validation_errors_total, type: :counter
|
|
3011
|
+
|
|
3012
|
+
# Track Prometheus query failures
|
|
3013
|
+
track :slo_prometheus_query_errors_total, type: :counter
|
|
3014
|
+
|
|
3015
|
+
# Track deployment gate decisions
|
|
3016
|
+
track :slo_deployment_gate_decisions_total, type: :counter, labels: [:decision] # allowed, blocked
|
|
3017
|
+
end
|
|
3018
|
+
end
|
|
3019
|
+
```
|
|
3020
|
+
|
|
3021
|
+
**Grafana Dashboard for SLO System Health:**
|
|
3022
|
+
```promql
|
|
3023
|
+
# Alert: SLO config validation failing
|
|
3024
|
+
rate(e11y_slo_validation_errors_total[5m]) > 0
|
|
3025
|
+
|
|
3026
|
+
# Alert: SLO resolution slow
|
|
3027
|
+
histogram_quantile(0.99, rate(e11y_slo_resolution_duration_seconds_bucket[5m])) > 0.1
|
|
3028
|
+
|
|
3029
|
+
# Alert: Prometheus queries failing
|
|
3030
|
+
rate(e11y_slo_prometheus_query_errors_total[5m]) > 0.01
|
|
3031
|
+
```
|
|
3032
|
+
|
|
3033
|
+
---
|
|
3034
|
+
|
|
3035
|
+
## 10. Trade-offs
|
|
3036
|
+
|
|
3037
|
+
### 9.1. Key Decisions
|
|
3038
|
+
|
|
3039
|
+
| Decision | Pro | Con | Rationale |
|
|
3040
|
+
|----------|-----|-----|-----------|
|
|
3041
|
+
| **Per-endpoint SLO** | Granular visibility | Config complexity | Critical endpoints need specific SLOs |
|
|
3042
|
+
| **Multi-window burn rate** | 5-minute detection, low false positives | Complex Prometheus queries | Google SRE best practice 2026 |
|
|
3043
|
+
| **YAML-based config** | Version controlled, validated | Extra file | Separation of concerns |
|
|
3044
|
+
| **Optional latency SLO** | Flexible | Some endpoints untracked | Not all endpoints need latency |
|
|
3045
|
+
| **Config validation** | Prevents drift | CI/CD overhead | Critical for accuracy |
|
|
3046
|
+
| **30-day SLO window** | Industry standard | Slow trend detection | Multi-window compensates |
|
|
3047
|
+
|
|
3048
|
+
### 9.2. Alternatives Considered
|
|
3049
|
+
|
|
3050
|
+
**A) Single app-wide SLO only**
|
|
3051
|
+
- ❌ Rejected: Too coarse, hides critical endpoint issues
|
|
3052
|
+
|
|
3053
|
+
**B) Single-window alerting**
|
|
3054
|
+
- ❌ Rejected: Either slow (30d) or noisy (5m)
|
|
3055
|
+
|
|
3056
|
+
**C) Code-based SLO config**
|
|
3057
|
+
- ❌ Rejected: Requires deployment to change SLOs
|
|
3058
|
+
|
|
3059
|
+
**D) No config validation**
|
|
3060
|
+
- ❌ Rejected: Config drift is a real problem
|
|
3061
|
+
|
|
3062
|
+
**E) Per-user SLO**
|
|
3063
|
+
- ❌ Deferred to v2.0: Too complex for v1
|
|
3064
|
+
|
|
3065
|
+
---
|
|
3066
|
+
|
|
3067
|
+
## 11. Real-World Configuration Examples
|
|
3068
|
+
|
|
3069
|
+
### 11.1. E-Commerce Platform
|
|
3070
|
+
|
|
3071
|
+
```yaml
|
|
3072
|
+
# config/slo.yml - E-commerce example
|
|
3073
|
+
version: 1
|
|
3074
|
+
|
|
3075
|
+
defaults:
|
|
3076
|
+
window: 30d
|
|
3077
|
+
availability:
|
|
3078
|
+
enabled: true
|
|
3079
|
+
target: 0.999
|
|
3080
|
+
|
|
3081
|
+
endpoints:
|
|
3082
|
+
# === REVENUE-CRITICAL (99.99%) ===
|
|
3083
|
+
- name: "Checkout - Payment"
|
|
3084
|
+
pattern: "POST /checkout/payment"
|
|
3085
|
+
controller: "Checkout::PaymentsController"
|
|
3086
|
+
action: "create"
|
|
3087
|
+
tags: [critical, revenue, pci_scope]
|
|
3088
|
+
slo:
|
|
3089
|
+
availability:
|
|
3090
|
+
target: 0.9999 # 99.99%
|
|
3091
|
+
latency:
|
|
3092
|
+
p99_target: 2000 # 2s (Stripe API call)
|
|
3093
|
+
p95_target: 1000
|
|
3094
|
+
throughput:
|
|
3095
|
+
min_rps: 1
|
|
3096
|
+
max_rps: 100 # Rate limit (fraud protection)
|
|
3097
|
+
burn_rate_alerts:
|
|
3098
|
+
fast:
|
|
3099
|
+
threshold: 10.0 # More lenient (third-party)
|
|
3100
|
+
alert_after: 5m
|
|
3101
|
+
|
|
3102
|
+
- name: "Cart - Add Item"
|
|
3103
|
+
pattern: "POST /cart/items"
|
|
3104
|
+
controller: "CartController"
|
|
3105
|
+
action: "add_item"
|
|
3106
|
+
tags: [high_priority, customer_facing]
|
|
3107
|
+
slo:
|
|
3108
|
+
availability:
|
|
3109
|
+
target: 0.999 # 99.9%
|
|
3110
|
+
latency:
|
|
3111
|
+
p99_target: 300
|
|
3112
|
+
p95_target: 150
|
|
3113
|
+
throughput:
|
|
3114
|
+
max_rps: 1000
|
|
3115
|
+
|
|
3116
|
+
# === HIGH-TRAFFIC (throughput-focused) ===
|
|
3117
|
+
- name: "Product Search"
|
|
3118
|
+
pattern: "GET /api/products/search"
|
|
3119
|
+
controller: "Api::ProductsController"
|
|
3120
|
+
action: "search"
|
|
3121
|
+
tags: [high_traffic, search, cached]
|
|
3122
|
+
slo:
|
|
3123
|
+
availability:
|
|
3124
|
+
target: 0.995 # 99.5% (can tolerate cache misses)
|
|
3125
|
+
latency:
|
|
3126
|
+
p99_target: 500
|
|
3127
|
+
throughput:
|
|
3128
|
+
min_rps: 50 # Must handle 50+ req/sec
|
|
3129
|
+
max_rps: 5000
|
|
3130
|
+
|
|
3131
|
+
# === ADMIN (low priority) ===
|
|
3132
|
+
- name: "Admin - Sales Report"
|
|
3133
|
+
pattern: "POST /admin/reports/sales"
|
|
3134
|
+
controller: "Admin::ReportsController"
|
|
3135
|
+
action: "sales"
|
|
3136
|
+
tags: [admin, slow_operation]
|
|
3137
|
+
slo:
|
|
3138
|
+
availability:
|
|
3139
|
+
target: 0.99 # 99%
|
|
3140
|
+
latency:
|
|
3141
|
+
p99_target: 30000 # 30s
|
|
3142
|
+
burn_rate_alerts:
|
|
3143
|
+
fast:
|
|
3144
|
+
enabled: false
|
|
3145
|
+
slow:
|
|
3146
|
+
enabled: true
|
|
3147
|
+
|
|
3148
|
+
services:
|
|
3149
|
+
sidekiq:
|
|
3150
|
+
jobs:
|
|
3151
|
+
PaymentProcessingJob:
|
|
3152
|
+
success_rate_target: 0.9999 # Critical!
|
|
3153
|
+
alert_on_single_failure: true
|
|
3154
|
+
|
|
3155
|
+
InventorySync Job:
|
|
3156
|
+
success_rate_target: 0.99
|
|
3157
|
+
latency:
|
|
3158
|
+
p99_target: 60000 # 60s
|
|
3159
|
+
```
|
|
3160
|
+
|
|
3161
|
+
### 11.2. SaaS API Platform
|
|
3162
|
+
|
|
3163
|
+
```yaml
|
|
3164
|
+
# config/slo.yml - API platform example
|
|
3165
|
+
version: 1
|
|
3166
|
+
|
|
3167
|
+
defaults:
|
|
3168
|
+
window: 30d
|
|
3169
|
+
availability:
|
|
3170
|
+
enabled: true
|
|
3171
|
+
target: 0.999
|
|
3172
|
+
latency:
|
|
3173
|
+
enabled: true
|
|
3174
|
+
p99_target: 200 # Fast API
|
|
3175
|
+
|
|
3176
|
+
endpoints:
|
|
3177
|
+
# === PUBLIC API (99.99%) ===
|
|
3178
|
+
- name: "API - Create Resource"
|
|
3179
|
+
pattern: "POST /api/v1/resources"
|
|
3180
|
+
controller: "Api::V1::ResourcesController"
|
|
3181
|
+
action: "create"
|
|
3182
|
+
tags: [api, customer_facing, rate_limited]
|
|
3183
|
+
slo:
|
|
3184
|
+
availability:
|
|
3185
|
+
target: 0.9999 # 99.99% SLA
|
|
3186
|
+
latency:
|
|
3187
|
+
p99_target: 200
|
|
3188
|
+
p95_target: 100
|
|
3189
|
+
throughput:
|
|
3190
|
+
min_rps: 10
|
|
3191
|
+
max_rps: 10000 # High throughput API
|
|
3192
|
+
burn_rate_alerts:
|
|
3193
|
+
fast:
|
|
3194
|
+
threshold: 14.4
|
|
3195
|
+
alert_after: 5m
|
|
3196
|
+
|
|
3197
|
+
# === WEBHOOKS (eventual consistency) ===
|
|
3198
|
+
- name: "Webhook Delivery"
|
|
3199
|
+
pattern: "POST /internal/webhooks/deliver"
|
|
3200
|
+
controller: "Internal::WebhooksController"
|
|
3201
|
+
action: "deliver"
|
|
3202
|
+
tags: [internal, async, retry]
|
|
3203
|
+
slo:
|
|
3204
|
+
availability:
|
|
3205
|
+
target: 0.95 # 95% (retries handle failures)
|
|
3206
|
+
latency:
|
|
3207
|
+
enabled: false # Async, latency not critical
|
|
3208
|
+
burn_rate_alerts:
|
|
3209
|
+
fast:
|
|
3210
|
+
enabled: false
|
|
3211
|
+
slow:
|
|
3212
|
+
enabled: true
|
|
3213
|
+
|
|
3214
|
+
services:
|
|
3215
|
+
sidekiq:
|
|
3216
|
+
default:
|
|
3217
|
+
success_rate_target: 0.999
|
|
3218
|
+
jobs:
|
|
3219
|
+
WebhookDeliveryJob:
|
|
3220
|
+
success_rate_target: 0.95 # Retries + DLQ
|
|
3221
|
+
latency:
|
|
3222
|
+
p99_target: 10000 # 10s (external API)
|
|
3223
|
+
```
|
|
3224
|
+
|
|
3225
|
+
### 11.3. Internal Admin Tool
|
|
3226
|
+
|
|
3227
|
+
```yaml
|
|
3228
|
+
# config/slo.yml - Admin tool example
|
|
3229
|
+
version: 1
|
|
3230
|
+
|
|
3231
|
+
defaults:
|
|
3232
|
+
window: 7d # Shorter window (less critical)
|
|
3233
|
+
availability:
|
|
3234
|
+
enabled: true
|
|
3235
|
+
target: 0.99 # 99% (internal users tolerate downtime)
|
|
3236
|
+
latency:
|
|
3237
|
+
enabled: false # No latency SLO by default
|
|
3238
|
+
|
|
3239
|
+
endpoints:
|
|
3240
|
+
- name: "Admin Dashboard"
|
|
3241
|
+
pattern: "GET /admin"
|
|
3242
|
+
controller: "AdminController"
|
|
3243
|
+
action: "index"
|
|
3244
|
+
tags: [admin, internal]
|
|
3245
|
+
slo:
|
|
3246
|
+
availability:
|
|
3247
|
+
target: 0.99
|
|
3248
|
+
burn_rate_alerts:
|
|
3249
|
+
fast:
|
|
3250
|
+
enabled: false
|
|
3251
|
+
slow:
|
|
3252
|
+
enabled: true
|
|
3253
|
+
alert_after: 24h # Very slow
|
|
3254
|
+
|
|
3255
|
+
- name: "Data Export"
|
|
3256
|
+
pattern: "POST /admin/exports"
|
|
3257
|
+
controller: "Admin::ExportsController"
|
|
3258
|
+
action: "create"
|
|
3259
|
+
tags: [admin, slow_operation]
|
|
3260
|
+
slo:
|
|
3261
|
+
availability:
|
|
3262
|
+
target: 0.95 # 95% (can retry)
|
|
3263
|
+
latency:
|
|
3264
|
+
p99_target: 120000 # 2 minutes (large CSV)
|
|
3265
|
+
|
|
3266
|
+
advanced:
|
|
3267
|
+
deployment_gate:
|
|
3268
|
+
enabled: false # No deployment gate for admin tool
|
|
3269
|
+
|
|
3270
|
+
error_budget_alerts:
|
|
3271
|
+
enabled: false # No budget alerts
|
|
3272
|
+
```
|
|
3273
|
+
|
|
3274
|
+
---
|
|
3275
|
+
|
|
3276
|
+
## 12. Summary & Next Steps
|
|
3277
|
+
|
|
3278
|
+
### 12.1. What We Achieved
|
|
3279
|
+
|
|
3280
|
+
✅ **Multi-level SLO strategy**: App-wide, service-level, per-endpoint
|
|
3281
|
+
✅ **5-minute alert detection**: Multi-window burn rate (Google SRE 2026)
|
|
3282
|
+
✅ **YAML-based configuration**: Validated, version-controlled, ERB support
|
|
3283
|
+
✅ **Flexible latency SLO**: Optional per endpoint
|
|
3284
|
+
✅ **Throughput SLO**: Min/max requests per second
|
|
3285
|
+
✅ **Config validation & linting**: Prevents drift from reality
|
|
3286
|
+
✅ **Full implementation**: ConfigLoader, Validator, ErrorBudget with edge cases
|
|
3287
|
+
✅ **RSpec testing**: Comprehensive test coverage
|
|
3288
|
+
✅ **Production best practices**: Rollout strategy, edge case handling, self-monitoring
|
|
3289
|
+
✅ **Real-world examples**: E-commerce, SaaS API, Admin tool configurations
|
|
3290
|
+
|
|
3291
|
+
### 12.2. Implementation Checklist
|
|
3292
|
+
|
|
3293
|
+
**Phase 1: Core (Week 1-2)**
|
|
3294
|
+
- [x] Implement `E11y::SLO::ConfigLoader` with ERB support
|
|
3295
|
+
- [x] Implement `E11y::SLO::Config` with index building
|
|
3296
|
+
- [x] Implement `E11y::SLO::ConfigValidator` with edge cases
|
|
3297
|
+
- [x] Add `rake e11y:slo:validate` task
|
|
3298
|
+
- [ ] Add per-endpoint metrics to `E11y::Rack::Middleware`
|
|
3299
|
+
- [ ] Implement `E11y::SLO::MetricsEmitter`
|
|
3300
|
+
|
|
3301
|
+
**Phase 2: Burn Rate & Alerts (Week 3-4)**
|
|
3302
|
+
- [ ] Implement `E11y::SLO::BurnRateCalculator`
|
|
3303
|
+
- [ ] Generate Prometheus alert rules from `slo.yml`
|
|
3304
|
+
- [ ] Implement multi-window burn rate alerts
|
|
3305
|
+
- [ ] Add Prometheus query error handling
|
|
3306
|
+
|
|
3307
|
+
**Phase 3: Error Budget (Week 5-6)**
|
|
3308
|
+
- [x] Implement `E11y::SLO::ErrorBudget`
|
|
3309
|
+
- [ ] Implement `E11y::SLO::DeploymentGate`
|
|
3310
|
+
- [ ] Add error budget tracking middleware
|
|
3311
|
+
- [ ] Create Grafana dashboard templates
|
|
3312
|
+
|
|
3313
|
+
**Phase 4: Production Readiness (Week 7-8)**
|
|
3314
|
+
- [ ] Add maintenance window support
|
|
3315
|
+
- [ ] Implement grace period after deployment
|
|
3316
|
+
- [ ] Add self-monitoring metrics
|
|
3317
|
+
- [ ] Integrate with CI/CD (validate on PR)
|
|
3318
|
+
- [ ] Document SLO config guide
|
|
3319
|
+
- [ ] Add rollout playbook
|
|
3320
|
+
|
|
3321
|
+
**Phase 5: RSpec Tests (Week 8)**
|
|
3322
|
+
- [x] ConfigLoader specs (edge cases: missing file, invalid YAML, ERB)
|
|
3323
|
+
- [x] ConfigValidator specs (invalid targets, missing routes, conflicts)
|
|
3324
|
+
- [x] ErrorBudget specs (calculations, exhaustion, deployment gate)
|
|
3325
|
+
- [ ] BurnRateCalculator specs (multi-window, new endpoints)
|
|
3326
|
+
- [ ] Integration specs (end-to-end SLO tracking)
|
|
3327
|
+
|
|
3328
|
+
---
|
|
3329
|
+
|
|
3330
|
+
**Status:** ✅ Fully Implemented & Documented
|
|
3331
|
+
**Next:** Integration with E11y::Rack::Middleware + Prometheus Exporter
|
|
3332
|
+
**Estimated Implementation:** 8 weeks (phased rollout)
|
|
3333
|
+
**Impact:**
|
|
3334
|
+
- Per-endpoint SLO visibility (100% coverage)
|
|
3335
|
+
- 5-minute incident detection (vs. 30-minute baseline)
|
|
3336
|
+
- Error budget-driven deployment decisions
|
|
3337
|
+
- Zero-config for simple apps, full control for complex apps
|