e11y 0.2.0 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (230) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +130 -10
  3. data/CHANGELOG.md +56 -1
  4. data/CLAUDE.md +168 -0
  5. data/CONTRIBUTING.md +640 -0
  6. data/README.md +134 -702
  7. data/RELEASE.md +18 -3
  8. data/Rakefile +108 -29
  9. data/config/README.md +1 -1
  10. data/config/loki-local-config.yaml +12 -0
  11. data/config/otel-collector-config.yaml +44 -0
  12. data/cucumber.yml +1 -0
  13. data/docker-compose.yml +18 -2
  14. data/docs/ADAPTERS.md +76 -0
  15. data/docs/ADAPTIVE_SAMPLING.md +59 -0
  16. data/docs/COMPARISON.md +104 -0
  17. data/docs/CONFIGURATION.md +52 -0
  18. data/docs/DISTRIBUTED_TRACING.md +44 -0
  19. data/docs/LIMITATIONS.md +13 -0
  20. data/docs/METRICS_DSL.md +84 -0
  21. data/docs/PERFORMANCE.md +60 -0
  22. data/docs/PII_FILTERING.md +40 -0
  23. data/docs/PRESETS.md +65 -0
  24. data/docs/QUICK-START.md +546 -587
  25. data/docs/RAILS_INTEGRATION.md +29 -0
  26. data/docs/SCHEMA_VALIDATION.md +63 -0
  27. data/docs/SLO-PROMQL-ALERTS.md +161 -0
  28. data/docs/TESTING.md +69 -0
  29. data/docs/{ADR-001-architecture.md → architecture/ADR-001-architecture.md} +35 -64
  30. data/docs/{ADR-002-metrics-yabeda.md → architecture/ADR-002-metrics-yabeda.md} +62 -236
  31. data/docs/{ADR-003-slo-observability.md → architecture/ADR-003-slo-observability.md} +27 -466
  32. data/docs/{ADR-004-adapter-architecture.md → architecture/ADR-004-adapter-architecture.md} +163 -146
  33. data/docs/{ADR-005-tracing-context.md → architecture/ADR-005-tracing-context.md} +10 -9
  34. data/docs/{ADR-006-security-compliance.md → architecture/ADR-006-security-compliance.md} +184 -191
  35. data/docs/{ADR-007-opentelemetry-integration.md → architecture/ADR-007-opentelemetry-integration.md} +3 -21
  36. data/docs/{ADR-008-rails-integration.md → architecture/ADR-008-rails-integration.md} +209 -339
  37. data/docs/{ADR-009-cost-optimization.md → architecture/ADR-009-cost-optimization.md} +45 -54
  38. data/docs/architecture/ADR-010-developer-experience.md +522 -0
  39. data/docs/{ADR-011-testing-strategy.md → architecture/ADR-011-testing-strategy.md} +41 -83
  40. data/docs/{ADR-013-reliability-error-handling.md → architecture/ADR-013-reliability-error-handling.md} +37 -12
  41. data/docs/{ADR-014-event-driven-slo.md → architecture/ADR-014-event-driven-slo.md} +12 -24
  42. data/docs/{ADR-015-middleware-order.md → architecture/ADR-015-middleware-order.md} +23 -41
  43. data/docs/{ADR-016-self-monitoring-slo.md → architecture/ADR-016-self-monitoring-slo.md} +52 -349
  44. data/docs/{ADR-017-multi-rails-compatibility.md → architecture/ADR-017-multi-rails-compatibility.md} +4 -11
  45. data/docs/architecture/ADR-018-memory-optimization.md +366 -0
  46. data/docs/{ADR-INDEX.md → architecture/ADR-INDEX.md} +11 -6
  47. data/docs/{00-ICP-AND-TIMELINE.md → prd/00-ICP-AND-TIMELINE.md} +6 -6
  48. data/docs/{01-SCALE-REQUIREMENTS.md → prd/01-SCALE-REQUIREMENTS.md} +6 -6
  49. data/docs/prd/01-overview-vision.md +19 -14
  50. data/docs/use_cases/README.md +22 -23
  51. data/docs/use_cases/UC-001-request-scoped-debug-buffering.md +50 -44
  52. data/docs/use_cases/UC-002-business-event-tracking.md +26 -95
  53. data/docs/use_cases/UC-003-event-metrics.md +66 -0
  54. data/docs/use_cases/UC-004-zero-config-slo-tracking.md +42 -101
  55. data/docs/use_cases/UC-005-sentry-integration.md +13 -15
  56. data/docs/use_cases/UC-006-trace-context-management.md +30 -28
  57. data/docs/use_cases/UC-007-pii-filtering.md +35 -87
  58. data/docs/use_cases/UC-008-opentelemetry-integration.md +51 -89
  59. data/docs/use_cases/UC-009-multi-service-tracing.md +4 -4
  60. data/docs/use_cases/UC-010-background-job-tracking.md +5 -5
  61. data/docs/use_cases/UC-011-rate-limiting.md +95 -168
  62. data/docs/use_cases/UC-012-audit-trail.md +21 -46
  63. data/docs/use_cases/UC-013-high-cardinality-protection.md +29 -167
  64. data/docs/use_cases/UC-014-adaptive-sampling.md +2 -2
  65. data/docs/use_cases/UC-015-cost-optimization.md +46 -99
  66. data/docs/use_cases/UC-016-rails-logger-migration.md +39 -213
  67. data/docs/use_cases/UC-017-local-development.md +203 -777
  68. data/docs/use_cases/UC-018-testing-events.md +3 -3
  69. data/docs/use_cases/UC-019-retention-based-routing.md +53 -106
  70. data/docs/use_cases/UC-020-event-versioning.md +8 -9
  71. data/docs/use_cases/UC-021-error-handling-retry-dlq.md +18 -22
  72. data/docs/use_cases/UC-022-event-registry.md +15 -21
  73. data/docs/use_cases/backlog.md +119 -87
  74. data/e11y.gemspec +2 -2
  75. data/gems/e11y-devtools/README.md +136 -0
  76. data/gems/e11y-devtools/config/routes.rb +8 -0
  77. data/gems/e11y-devtools/e11y-devtools.gemspec +25 -0
  78. data/gems/e11y-devtools/exe/e11y +34 -0
  79. data/gems/e11y-devtools/lib/e11y/devtools/mcp/server.rb +96 -0
  80. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tool_base.rb +25 -0
  81. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/clear.rb +31 -0
  82. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/errors.rb +35 -0
  83. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/event_detail.rb +33 -0
  84. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/events_by_trace.rb +33 -0
  85. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/interactions.rb +40 -0
  86. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/recent_events.rb +34 -0
  87. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/search.rb +34 -0
  88. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/stats.rb +30 -0
  89. data/gems/e11y-devtools/lib/e11y/devtools/overlay/assets/overlay.js +115 -0
  90. data/gems/e11y-devtools/lib/e11y/devtools/overlay/controller.rb +54 -0
  91. data/gems/e11y-devtools/lib/e11y/devtools/overlay/engine.rb +26 -0
  92. data/gems/e11y-devtools/lib/e11y/devtools/overlay/middleware.rb +80 -0
  93. data/gems/e11y-devtools/lib/e11y/devtools/overlay/rails_controller.rb +42 -0
  94. data/gems/e11y-devtools/lib/e11y/devtools/tui/app.rb +262 -0
  95. data/gems/e11y-devtools/lib/e11y/devtools/tui/grouping.rb +66 -0
  96. data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/event_detail.rb +62 -0
  97. data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/event_list.rb +70 -0
  98. data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/interaction_list.rb +47 -0
  99. data/gems/e11y-devtools/lib/e11y/devtools/version.rb +8 -0
  100. data/gems/e11y-devtools/lib/e11y/devtools.rb +13 -0
  101. data/gems/e11y-devtools/spec/e11y/devtools/mcp/tools_spec.rb +107 -0
  102. data/gems/e11y-devtools/spec/e11y/devtools/overlay/controller_spec.rb +58 -0
  103. data/gems/e11y-devtools/spec/e11y/devtools/overlay/middleware_spec.rb +46 -0
  104. data/gems/e11y-devtools/spec/e11y/devtools/tui/app_spec.rb +85 -0
  105. data/gems/e11y-devtools/spec/e11y/devtools/tui/grouping_spec.rb +64 -0
  106. data/gems/e11y-devtools/spec/spec_helper.rb +5 -0
  107. data/gems/e11y-devtools/spec/tui/widgets/event_list_spec.rb +44 -0
  108. data/gems/e11y-devtools/spec/tui/widgets/interaction_list_spec.rb +62 -0
  109. data/lib/e11y/adapters/audit_encrypted.rb +53 -11
  110. data/lib/e11y/adapters/base.rb +33 -34
  111. data/lib/e11y/adapters/dev_log/file_store.rb +143 -0
  112. data/lib/e11y/adapters/dev_log/query.rb +219 -0
  113. data/lib/e11y/adapters/dev_log.rb +118 -0
  114. data/lib/e11y/adapters/file.rb +3 -6
  115. data/lib/e11y/adapters/in_memory.rb +52 -5
  116. data/lib/e11y/adapters/in_memory_test.rb +29 -0
  117. data/lib/e11y/adapters/loki.rb +58 -23
  118. data/lib/e11y/adapters/null.rb +82 -0
  119. data/lib/e11y/adapters/opentelemetry_collector.rb +183 -0
  120. data/lib/e11y/adapters/otel_logs.rb +136 -23
  121. data/lib/e11y/adapters/sentry.rb +4 -7
  122. data/lib/e11y/adapters/stdout.rb +73 -7
  123. data/lib/e11y/adapters/yabeda.rb +153 -29
  124. data/lib/e11y/buffers/adaptive_buffer.rb +3 -17
  125. data/lib/e11y/buffers/{request_scoped_buffer.rb → ephemeral_buffer.rb} +72 -58
  126. data/lib/e11y/buffers/ring_buffer.rb +3 -16
  127. data/lib/e11y/configuration.rb +272 -0
  128. data/lib/e11y/console.rb +10 -17
  129. data/lib/e11y/current.rb +53 -1
  130. data/lib/e11y/debug/pipeline_inspector.rb +96 -0
  131. data/lib/e11y/documentation/generator.rb +48 -0
  132. data/lib/e11y/event/base.rb +176 -82
  133. data/lib/e11y/event/value_sampling_config.rb +1 -5
  134. data/lib/e11y/events/rails/database/query.rb +1 -4
  135. data/lib/e11y/events/rails/job/failed.rb +2 -0
  136. data/lib/e11y/instruments/active_job.rb +46 -12
  137. data/lib/e11y/instruments/rails_instrumentation.rb +49 -24
  138. data/lib/e11y/instruments/sidekiq.rb +137 -31
  139. data/lib/e11y/linters/base.rb +11 -0
  140. data/lib/e11y/linters/pii/pii_declaration_linter.rb +120 -0
  141. data/lib/e11y/linters/slo/config_consistency_linter.rb +76 -0
  142. data/lib/e11y/linters/slo/explicit_declaration_linter.rb +36 -0
  143. data/lib/e11y/linters/slo/slo_status_from_linter.rb +41 -0
  144. data/lib/e11y/logger/bridge.rb +26 -7
  145. data/lib/e11y/metrics/cardinality_protection.rb +10 -15
  146. data/lib/e11y/metrics/cardinality_tracker.rb +16 -6
  147. data/lib/e11y/metrics/registry.rb +3 -5
  148. data/lib/e11y/metrics/test_backend.rb +62 -0
  149. data/lib/e11y/metrics.rb +56 -10
  150. data/lib/e11y/middleware/adapter_resolver.rb +40 -0
  151. data/lib/e11y/middleware/audit_signing.rb +43 -6
  152. data/lib/e11y/middleware/baggage_protection.rb +75 -0
  153. data/lib/e11y/middleware/dev_log_source.rb +24 -0
  154. data/lib/e11y/middleware/event_slo.rb +23 -9
  155. data/lib/e11y/middleware/otel_span.rb +23 -0
  156. data/lib/e11y/middleware/pii_filter.rb +104 -75
  157. data/lib/e11y/middleware/rate_limiting.rb +54 -27
  158. data/lib/e11y/middleware/request.rb +70 -23
  159. data/lib/e11y/middleware/routing.rb +78 -21
  160. data/lib/e11y/middleware/sampling.rb +66 -17
  161. data/lib/e11y/middleware/self_monitoring_emit.rb +39 -0
  162. data/lib/e11y/middleware/trace_context.rb +45 -10
  163. data/lib/e11y/middleware/track_latency.rb +34 -0
  164. data/lib/e11y/middleware/validation.rb +7 -16
  165. data/lib/e11y/middleware/versioning.rb +26 -22
  166. data/lib/e11y/opentelemetry/semantic_conventions.rb +109 -0
  167. data/lib/e11y/opentelemetry/span_creator.rb +142 -0
  168. data/lib/e11y/pii/patterns.rb +12 -1
  169. data/lib/e11y/pipeline/builder.rb +1 -1
  170. data/lib/e11y/presets/audit_event.rb +13 -2
  171. data/lib/e11y/railtie.rb +52 -15
  172. data/lib/e11y/registry.rb +306 -0
  173. data/lib/e11y/reliability/circuit_breaker.rb +19 -21
  174. data/lib/e11y/reliability/dlq/base.rb +71 -0
  175. data/lib/e11y/reliability/dlq/file_adapter.rb +301 -0
  176. data/lib/e11y/reliability/dlq/file_storage.rb +63 -34
  177. data/lib/e11y/reliability/dlq/filter.rb +37 -54
  178. data/lib/e11y/reliability/retry_handler.rb +26 -29
  179. data/lib/e11y/reliability/retry_rate_limiter.rb +3 -11
  180. data/lib/e11y/sampling/error_spike_detector.rb +0 -2
  181. data/lib/e11y/sampling/load_monitor.rb +5 -9
  182. data/lib/e11y/sampling/stratified_tracker.rb +18 -0
  183. data/lib/e11y/self_monitoring/buffer_monitor.rb +2 -0
  184. data/lib/e11y/self_monitoring/performance_monitor.rb +19 -61
  185. data/lib/e11y/self_monitoring/reliability_monitor.rb +4 -74
  186. data/lib/e11y/slo/config_loader.rb +40 -0
  187. data/lib/e11y/slo/config_validator.rb +58 -0
  188. data/lib/e11y/slo/dashboard_generator.rb +122 -0
  189. data/lib/e11y/slo/event_driven.rb +8 -0
  190. data/lib/e11y/slo/tracker.rb +31 -4
  191. data/lib/e11y/testing/have_tracked_event_matcher.rb +190 -0
  192. data/lib/e11y/testing/rspec_matchers.rb +21 -0
  193. data/lib/e11y/testing/snapshot_matcher.rb +86 -0
  194. data/lib/e11y/trace_context/sampler.rb +35 -0
  195. data/lib/e11y/tracing/faraday_middleware.rb +31 -0
  196. data/lib/e11y/tracing/net_http_patch.rb +33 -0
  197. data/lib/e11y/tracing/propagator.rb +116 -0
  198. data/lib/e11y/tracing.rb +47 -0
  199. data/lib/e11y/version.rb +1 -1
  200. data/lib/e11y/versioning/version_extractor.rb +32 -0
  201. data/lib/e11y.rb +141 -265
  202. data/lib/generators/e11y/event/event_generator.rb +22 -0
  203. data/lib/generators/e11y/event/templates/event.rb.tt +16 -0
  204. data/lib/generators/e11y/grafana_dashboard/grafana_dashboard_generator.rb +30 -0
  205. data/lib/generators/e11y/grafana_dashboard/templates/e11y_dashboard.json +81 -0
  206. data/lib/generators/e11y/install/install_generator.rb +34 -0
  207. data/lib/generators/e11y/install/templates/e11y.rb +239 -0
  208. data/lib/generators/e11y/prometheus_alerts/prometheus_alerts_generator.rb +29 -0
  209. data/lib/generators/e11y/prometheus_alerts/templates/e11y_alerts.yml +28 -0
  210. data/lib/tasks/e11y_docs.rake +30 -0
  211. data/lib/tasks/e11y_events.rake +71 -0
  212. data/lib/tasks/e11y_lint.rake +91 -0
  213. data/lib/tasks/e11y_slo.rake +29 -0
  214. metadata +129 -39
  215. data/docs/ADR-010-developer-experience.md +0 -2166
  216. data/docs/API-REFERENCE-L28.md +0 -914
  217. data/docs/COMPREHENSIVE-CONFIGURATION.md +0 -2366
  218. data/docs/CONTRIBUTING.md +0 -312
  219. data/docs/IMPLEMENTATION_NOTES.md +0 -2804
  220. data/docs/IMPLEMENTATION_PLAN.md +0 -1971
  221. data/docs/IMPLEMENTATION_PLAN_ARCHITECTURE.md +0 -586
  222. data/docs/PLAN.md +0 -148
  223. data/docs/README.md +0 -296
  224. data/docs/design/00-memory-optimization.md +0 -593
  225. data/docs/guides/MIGRATION-L27-L28.md +0 -692
  226. data/docs/guides/PERFORMANCE-BENCHMARKS.md +0 -434
  227. data/docs/guides/README.md +0 -44
  228. data/docs/use_cases/UC-003-pattern-based-metrics.md +0 -1627
  229. data/lib/e11y/adapters/registry.rb +0 -141
  230. /data/docs/{ADR-012-event-evolution.md → architecture/ADR-012-event-evolution.md} +0 -0
@@ -17,8 +17,7 @@ This ADR covers **HTTP/Job SLO** (infrastructure reliability):
17
17
  - ✅ Zero-config SLO for HTTP requests (99.9% availability)
18
18
  - ✅ Zero-config SLO for Sidekiq/ActiveJob (99.5% success rate)
19
19
  - ✅ Per-endpoint SLO configuration in `slo.yml`
20
- - ✅ Multi-window burn rate alerts (5 min detection)
21
- - ✅ Error budget management & deployment gates
20
+ - ✅ PromQL queries and alert rules see [SLO-PROMQL-ALERTS.md](../SLO-PROMQL-ALERTS.md)
22
21
 
23
22
  **For Event-based SLO** (business logic reliability like "order creation success rate"), see **ADR-014**.
24
23
 
@@ -32,11 +31,13 @@ This ADR covers **HTTP/Job SLO** (infrastructure reliability):
32
31
  2. [Architecture Overview](#2-architecture-overview)
33
32
  3. [Multi-Level SLO Strategy](#3-multi-level-slo-strategy)
34
33
  4. [Per-Endpoint SLO Configuration](#4-per-endpoint-slo-configuration)
35
- 5. [Multi-Window Multi-Burn Rate Alerts](#5-multi-window-multi-burn-rate-alerts)
34
+ 5. [PromQL & Alerts](#5-promql--alerts)
36
35
  6. [SLO Config Validation & Linting](#6-slo-config-validation--linting)
37
- 7. [Error Budget Management](#7-error-budget-management)
38
- 8. [Dashboard & Reporting](#8-dashboard--reporting)
36
+ 7. [Dashboard & Reporting](#7-dashboard--reporting)
37
+ 8. [Production Best Practices & Edge Cases](#8-production-best-practices--edge-cases)
39
38
  9. [Trade-offs](#9-trade-offs)
39
+ 10. [Real-World Configuration Examples](#10-real-world-configuration-examples)
40
+ 11. [Summary & Next Steps](#11-summary--next-steps)
40
41
 
41
42
  ---
42
43
 
@@ -1225,185 +1226,9 @@ end
1225
1226
 
1226
1227
  ---
1227
1228
 
1228
- ## 5. Multi-Window Multi-Burn Rate Alerts
1229
+ ## 5. PromQL & Alerts
1229
1230
 
1230
- ### 5.1. Why Multi-Window? (Google SRE Best Practice)
1231
-
1232
- **Problem with Single Window:**
1233
- ```
1234
- Single 30-day window:
1235
- - Slow reaction (hours to detect)
1236
- - Hard to distinguish acute vs chronic issues
1237
-
1238
- Single 5-minute window:
1239
- - Fast reaction
1240
- - High false positive rate (noise)
1241
- ```
1242
-
1243
- **Solution: Multi-Window Multi-Burn Rate:**
1244
- ```
1245
- 3 windows simultaneously:
1246
- - 1 hour: Fast burn (acute issue, page immediately)
1247
- - 6 hours: Medium burn (developing issue, warn team)
1248
- - 3 days: Slow burn (chronic issue, investigate)
1249
- ```
1250
-
1251
- ### 5.2. Burn Rate Calculation
1252
-
1253
- **Formula:**
1254
- ```
1255
- Burn Rate = (Actual Error Rate) / (Error Budget per Hour)
1256
-
1257
- For 99.9% SLO (30-day window):
1258
- - Error Budget = 0.1% = 0.001
1259
- - Error Budget per Hour = 0.001 / (30 * 24) = 0.00000139
1260
-
1261
- Fast Burn (1h window):
1262
- - Threshold = 14.4x burn rate
1263
- - Means: consuming 2% of 30-day budget in 1 hour
1264
- - Alert fires in 5 minutes
1265
-
1266
- Medium Burn (6h window):
1267
- - Threshold = 6.0x burn rate
1268
- - Means: consuming 5% of 30-day budget in 6 hours
1269
- - Alert fires in 30 minutes
1270
-
1271
- Slow Burn (3d window):
1272
- - Threshold = 1.0x burn rate
1273
- - Means: consuming 10% of 30-day budget in 3 days
1274
- - Alert fires in 6 hours
1275
- ```
1276
-
1277
- ### 5.3. Prometheus Alert Rules (Per-Endpoint!)
1278
-
1279
- ```yaml
1280
- # prometheus/alerts/e11y_slo_per_endpoint.yml
1281
- groups:
1282
- - name: e11y_slo_per_endpoint
1283
- interval: 30s # Check every 30 seconds
1284
- rules:
1285
- # ===== FAST BURN (1h window, 5 min alert) =====
1286
- - alert: E11ySLOFastBurn_CreateOrder
1287
- expr: |
1288
- (
1289
- # Error rate in last 1 hour
1290
- sum(rate(http_requests_total{
1291
- controller="Api::OrdersController",
1292
- action="create",
1293
- status=~"5.."
1294
- }[1h]))
1295
- /
1296
- sum(rate(http_requests_total{
1297
- controller="Api::OrdersController",
1298
- action="create"
1299
- }[1h]))
1300
- )
1301
- /
1302
- # Error budget per hour (0.001 / 720 hours)
1303
- 0.00000139
1304
- > 14.4 # 14.4x burn rate = 2% of 30-day budget in 1h
1305
- for: 5m # Alert after 5 minutes
1306
- labels:
1307
- severity: critical
1308
- endpoint: "POST /api/orders"
1309
- controller: "Api::OrdersController"
1310
- action: "create"
1311
- burn_window: "1h"
1312
- annotations:
1313
- summary: "CRITICAL: Fast burn on {{ $labels.endpoint }}"
1314
- description: |
1315
- Error rate is 14.4x higher than sustainable rate.
1316
- Burning 2% of 30-day error budget in 1 hour.
1317
- Current burn rate: {{ $value | humanize }}x
1318
-
1319
- Impact: Will exhaust error budget in {{ div 720 $value | humanize }} hours
1320
-
1321
- Dashboard: https://grafana/d/e11y-slo?var-endpoint=orders_create
1322
- Runbook: https://wiki/runbooks/fast-burn-orders
1323
-
1324
- # ===== MEDIUM BURN (6h window, 30 min alert) =====
1325
- - alert: E11ySLOMediumBurn_CreateOrder
1326
- expr: |
1327
- (
1328
- sum(rate(http_requests_total{
1329
- controller="Api::OrdersController",
1330
- action="create",
1331
- status=~"5.."
1332
- }[6h]))
1333
- /
1334
- sum(rate(http_requests_total{
1335
- controller="Api::OrdersController",
1336
- action="create"
1337
- }[6h]))
1338
- )
1339
- /
1340
- 0.00000139
1341
- > 6.0 # 6x burn rate = 5% of 30-day budget in 6h
1342
- for: 30m # Alert after 30 minutes
1343
- labels:
1344
- severity: warning
1345
- endpoint: "POST /api/orders"
1346
- controller: "Api::OrdersController"
1347
- action: "create"
1348
- burn_window: "6h"
1349
- annotations:
1350
- summary: "WARNING: Medium burn on {{ $labels.endpoint }}"
1351
- description: |
1352
- Error rate is 6x higher than sustainable rate.
1353
- Burning 5% of 30-day error budget in 6 hours.
1354
- Current burn rate: {{ $value | humanize }}x
1355
-
1356
- # ===== SLOW BURN (3d window, 6h alert) =====
1357
- - alert: E11ySLOSlowBurn_CreateOrder
1358
- expr: |
1359
- (
1360
- sum(rate(http_requests_total{
1361
- controller="Api::OrdersController",
1362
- action="create",
1363
- status=~"5.."
1364
- }[3d]))
1365
- /
1366
- sum(rate(http_requests_total{
1367
- controller="Api::OrdersController",
1368
- action="create"
1369
- }[3d]))
1370
- )
1371
- /
1372
- 0.00000139
1373
- > 1.0 # 1x burn rate = 10% of 30-day budget in 3 days
1374
- for: 6h # Alert after 6 hours
1375
- labels:
1376
- severity: info
1377
- endpoint: "POST /api/orders"
1378
- controller: "Api::OrdersController"
1379
- action: "create"
1380
- burn_window: "3d"
1381
- annotations:
1382
- summary: "INFO: Slow burn on {{ $labels.endpoint }}"
1383
- description: |
1384
- Chronic issue: consuming error budget at steady rate.
1385
- Burning 10% of 30-day error budget in 3 days.
1386
-
1387
- This is a trend, not an emergency. Investigate root cause.
1388
-
1389
- # ===== LATENCY SLO (optional per endpoint) =====
1390
- - alert: E11ySLOLatency_CreateOrder
1391
- expr: |
1392
- histogram_quantile(0.99,
1393
- sum(rate(http_request_duration_seconds_bucket{
1394
- controller="Api::OrdersController",
1395
- action="create"
1396
- }[5m])) by (le)
1397
- ) > 0.5 # 500ms p99 threshold
1398
- for: 5m
1399
- labels:
1400
- severity: warning
1401
- endpoint: "POST /api/orders"
1402
- slo_type: "latency_p99"
1403
- annotations:
1404
- summary: "Latency SLO violation: {{ $labels.endpoint }}"
1405
- description: "P99 latency is {{ $value | humanize }}s (threshold: 500ms)"
1406
- ```
1231
+ PromQL queries and Prometheus alert rules: see [SLO-PROMQL-ALERTS.md](../SLO-PROMQL-ALERTS.md).
1407
1232
 
1408
1233
  ---
1409
1234
 
@@ -2389,264 +2214,13 @@ RSpec.describe E11y::SLO::ConfigValidator do
2389
2214
  end
2390
2215
  end
2391
2216
 
2392
- # spec/lib/e11y/slo/error_budget_spec.rb
2393
- RSpec.describe E11y::SLO::ErrorBudget do
2394
- let(:slo_config) do
2395
- {
2396
- 'availability' => { 'target' => 0.999 },
2397
- 'window' => '30d'
2398
- }
2399
- end
2400
-
2401
- let(:budget) do
2402
- described_class.new('OrdersController', 'create', slo_config)
2403
- end
2404
-
2405
- before do
2406
- # Mock Prometheus query
2407
- allow(E11y::Metrics).to receive(:query_prometheus).and_return(
2408
- { 'data' => { 'result' => [{ 'value' => [Time.now.to_i, error_rate.to_s] }] } }
2409
- )
2410
- end
2411
-
2412
- describe '#total' do
2413
- it 'calculates total error budget' do
2414
- expect(budget.total).to eq(0.001) # 1 - 0.999
2415
- end
2416
- end
2417
-
2418
- describe '#consumed' do
2419
- let(:error_rate) { 0.0005 } # 0.05% error rate
2420
-
2421
- it 'calculates consumed error budget' do
2422
- expect(budget.consumed).to eq(0.0005)
2423
- end
2424
- end
2425
-
2426
- describe '#remaining' do
2427
- let(:error_rate) { 0.0005 }
2428
-
2429
- it 'calculates remaining error budget' do
2430
- expect(budget.remaining).to eq(0.0005) # 0.001 - 0.0005
2431
- end
2432
-
2433
- context 'when consumed exceeds total' do
2434
- let(:error_rate) { 0.002 } # 0.2% > 0.1%
2435
-
2436
- it 'never goes negative' do
2437
- expect(budget.remaining).to eq(0.0)
2438
- end
2439
- end
2440
- end
2441
-
2442
- describe '#exhausted?' do
2443
- context 'when budget remaining' do
2444
- let(:error_rate) { 0.0005 }
2445
-
2446
- it 'returns false' do
2447
- expect(budget).not_to be_exhausted
2448
- end
2449
- end
2450
-
2451
- context 'when budget exhausted' do
2452
- let(:error_rate) { 0.002 } # Exceeds 0.001
2453
-
2454
- it 'returns true' do
2455
- expect(budget).to be_exhausted
2456
- end
2457
- end
2458
- end
2459
-
2460
- describe '#can_deploy?' do
2461
- context 'with sufficient budget' do
2462
- let(:error_rate) { 0.0002 } # 20% consumed, 80% remaining
2463
-
2464
- it 'allows deployment' do
2465
- expect(budget.can_deploy?(20)).to be true
2466
- end
2467
- end
2468
-
2469
- context 'with insufficient budget' do
2470
- let(:error_rate) { 0.0009 } # 90% consumed, 10% remaining
2471
-
2472
- it 'blocks deployment' do
2473
- expect(budget.can_deploy?(20)).to be false
2474
- end
2475
- end
2476
- end
2477
- end
2478
- ```
2479
-
2480
- ---
2481
-
2482
- ## 7. Error Budget Management
2483
-
2484
- ### 7.1. Error Budget Calculation (Per-Endpoint)
2485
-
2486
- ```ruby
2487
- # lib/e11y/slo/error_budget.rb
2488
- module E11y
2489
- module SLO
2490
- class ErrorBudget
2491
- def initialize(controller, action, slo_config)
2492
- @controller = controller
2493
- @action = action
2494
- @slo_config = slo_config
2495
- @target = slo_config['availability_target'] || 0.999
2496
- @window = parse_window(slo_config['window'] || '30d')
2497
- end
2498
-
2499
- # Total error budget (e.g., 0.001 for 99.9%)
2500
- def total
2501
- 1.0 - @target
2502
- end
2503
-
2504
- # Consumed error budget in current window
2505
- def consumed
2506
- error_rate = calculate_error_rate(@window)
2507
- [error_rate, total].min # Cap at total budget
2508
- end
2509
-
2510
- # Remaining error budget
2511
- def remaining
2512
- [total - consumed, 0.0].max # Never negative
2513
- end
2514
-
2515
- # Percentage of error budget consumed
2516
- def percent_consumed
2517
- return 0.0 if total.zero?
2518
- (consumed / total) * 100
2519
- end
2520
-
2521
- # Is error budget exhausted?
2522
- def exhausted?
2523
- remaining <= 0
2524
- end
2525
-
2526
- # Time until error budget exhaustion (at current burn rate)
2527
- def time_until_exhaustion
2528
- burn_rate_per_hour = calculate_burn_rate(1.hour)
2529
- return Float::INFINITY if burn_rate_per_hour <= 0
2530
-
2531
- hours_remaining = remaining / burn_rate_per_hour
2532
- hours_remaining.hours
2533
- end
2534
-
2535
- # Can we deploy? (have enough error budget?)
2536
- def can_deploy?(minimum_budget_percent = 20)
2537
- percent_remaining = (remaining / total) * 100
2538
- percent_remaining >= minimum_budget_percent
2539
- end
2540
-
2541
- private
2542
-
2543
- def calculate_error_rate(window)
2544
- # Query Prometheus for actual error rate
2545
- query = <<~PROMQL
2546
- sum(rate(http_requests_total{
2547
- controller="#{@controller}",
2548
- action="#{@action}",
2549
- status=~"5.."
2550
- }[#{window}]))
2551
- /
2552
- sum(rate(http_requests_total{
2553
- controller="#{@controller}",
2554
- action="#{@action}"
2555
- }[#{window}]))
2556
- PROMQL
2557
-
2558
- result = E11y::Metrics.query_prometheus(query)
2559
- result.dig('data', 'result', 0, 'value', 1).to_f
2560
- end
2561
-
2562
- def calculate_burn_rate(window)
2563
- error_rate = calculate_error_rate(window)
2564
- error_budget_per_hour = total / (@window.to_f / 1.hour)
2565
-
2566
- error_rate / error_budget_per_hour
2567
- end
2568
-
2569
- def parse_window(window)
2570
- case window
2571
- when /(\d+)d/
2572
- $1.to_i.days
2573
- when /(\d+)h/
2574
- $1.to_i.hours
2575
- when /(\d+)m/
2576
- $1.to_i.minutes
2577
- else
2578
- 30.days # Default
2579
- end
2580
- end
2581
- end
2582
- end
2583
- end
2584
- ```
2585
-
2586
- ### 7.2. Deployment Gate (Optional)
2587
-
2588
- ```ruby
2589
- # lib/e11y/slo/deployment_gate.rb
2590
- module E11y
2591
- module SLO
2592
- class DeploymentGate
2593
- def self.check!(minimum_budget_percent: 20)
2594
- config = E11y::SLO::ConfigLoader.load!
2595
-
2596
- critical_endpoints = config.endpoints.select do |ep|
2597
- ep.dig('slo', 'availability_target').to_f >= 0.999
2598
- end
2599
-
2600
- violations = []
2601
-
2602
- critical_endpoints.each do |endpoint|
2603
- controller = endpoint['controller']
2604
- action = endpoint['action']
2605
- slo_config = endpoint['slo']
2606
-
2607
- budget = ErrorBudget.new(controller, action, slo_config)
2608
-
2609
- unless budget.can_deploy?(minimum_budget_percent)
2610
- violations << {
2611
- endpoint: "#{controller}##{action}",
2612
- budget_remaining: budget.percent_remaining,
2613
- budget_consumed: budget.percent_consumed
2614
- }
2615
- end
2616
- end
2617
-
2618
- if violations.any?
2619
- raise DeploymentBlockedError.new(violations)
2620
- end
2621
-
2622
- true
2623
- end
2624
- end
2625
-
2626
- class DeploymentBlockedError < StandardError
2627
- attr_reader :violations
2628
-
2629
- def initialize(violations)
2630
- @violations = violations
2631
-
2632
- message = "❌ Deployment blocked: Insufficient error budget\n\n"
2633
- violations.each do |v|
2634
- message << " - #{v[:endpoint]}: #{v[:budget_remaining].round(1)}% remaining (need 20%+)\n"
2635
- end
2636
- message << "\nWait for error budget to recover before deploying."
2637
-
2638
- super(message)
2639
- end
2640
- end
2641
- end
2642
- end
2643
2217
  ```
2644
2218
 
2645
2219
  ---
2646
2220
 
2647
- ## 8. Dashboard & Reporting
2221
+ ## 7. Dashboard & Reporting
2648
2222
 
2649
- ### 8.1. Per-Endpoint Grafana Dashboard
2223
+ ### 7.1. Per-Endpoint Grafana Dashboard
2650
2224
 
2651
2225
  ```json
2652
2226
  {
@@ -2747,9 +2321,9 @@ end
2747
2321
 
2748
2322
  ---
2749
2323
 
2750
- ## 9. Production Best Practices & Edge Cases
2324
+ ## 8. Production Best Practices & Edge Cases
2751
2325
 
2752
- ### 9.1. Rollout Strategy
2326
+ ### 8.1. Rollout Strategy
2753
2327
 
2754
2328
  **Phase 1: Observability Only (1-2 weeks)**
2755
2329
  ```yaml
@@ -2832,7 +2406,7 @@ advanced:
2832
2406
  override_label: "deploy:emergency"
2833
2407
  ```
2834
2408
 
2835
- ### 9.2. Edge Cases & Solutions
2409
+ ### 8.2. Edge Cases & Solutions
2836
2410
 
2837
2411
  **Edge Case 1: Routes Not Loaded During Validation**
2838
2412
  ```ruby
@@ -2993,7 +2567,7 @@ def calculate_error_rate(window)
2993
2567
  end
2994
2568
  ```
2995
2569
 
2996
- ### 9.3. Monitoring the SLO System Itself
2570
+ ### 8.3. Monitoring the SLO System Itself
2997
2571
 
2998
2572
  **Self-Monitoring Metrics:**
2999
2573
  ```ruby
@@ -3032,7 +2606,7 @@ rate(e11y_slo_prometheus_query_errors_total[5m]) > 0.01
3032
2606
 
3033
2607
  ---
3034
2608
 
3035
- ## 10. Trade-offs
2609
+ ## 9. Trade-offs
3036
2610
 
3037
2611
  ### 9.1. Key Decisions
3038
2612
 
@@ -3064,9 +2638,9 @@ rate(e11y_slo_prometheus_query_errors_total[5m]) > 0.01
3064
2638
 
3065
2639
  ---
3066
2640
 
3067
- ## 11. Real-World Configuration Examples
2641
+ ## 10. Real-World Configuration Examples
3068
2642
 
3069
- ### 11.1. E-Commerce Platform
2643
+ ### 10.1. E-Commerce Platform
3070
2644
 
3071
2645
  ```yaml
3072
2646
  # config/slo.yml - E-commerce example
@@ -3158,7 +2732,7 @@ services:
3158
2732
  p99_target: 60000 # 60s
3159
2733
  ```
3160
2734
 
3161
- ### 11.2. SaaS API Platform
2735
+ ### 10.2. SaaS API Platform
3162
2736
 
3163
2737
  ```yaml
3164
2738
  # config/slo.yml - API platform example
@@ -3222,7 +2796,7 @@ services:
3222
2796
  p99_target: 10000 # 10s (external API)
3223
2797
  ```
3224
2798
 
3225
- ### 11.3. Internal Admin Tool
2799
+ ### 10.3. Internal Admin Tool
3226
2800
 
3227
2801
  ```yaml
3228
2802
  # config/slo.yml - Admin tool example
@@ -3273,9 +2847,9 @@ advanced:
3273
2847
 
3274
2848
  ---
3275
2849
 
3276
- ## 12. Summary & Next Steps
2850
+ ## 11. Summary & Next Steps
3277
2851
 
3278
- ### 12.1. What We Achieved
2852
+ ### 11.1. What We Achieved
3279
2853
 
3280
2854
  ✅ **Multi-level SLO strategy**: App-wide, service-level, per-endpoint
3281
2855
  ✅ **5-minute alert detection**: Multi-window burn rate (Google SRE 2026)
@@ -3283,12 +2857,13 @@ advanced:
3283
2857
  ✅ **Flexible latency SLO**: Optional per endpoint
3284
2858
  ✅ **Throughput SLO**: Min/max requests per second
3285
2859
  ✅ **Config validation & linting**: Prevents drift from reality
3286
- ✅ **Full implementation**: ConfigLoader, Validator, ErrorBudget with edge cases
2860
+ ✅ **Full implementation**: ConfigLoader, Validator with edge cases
2861
+ ✅ **PromQL & alerts**: See [SLO-PROMQL-ALERTS.md](../SLO-PROMQL-ALERTS.md)
3287
2862
  ✅ **RSpec testing**: Comprehensive test coverage
3288
2863
  ✅ **Production best practices**: Rollout strategy, edge case handling, self-monitoring
3289
2864
  ✅ **Real-world examples**: E-commerce, SaaS API, Admin tool configurations
3290
2865
 
3291
- ### 12.2. Implementation Checklist
2866
+ ### 11.2. Implementation Checklist
3292
2867
 
3293
2868
  **Phase 1: Core (Week 1-2)**
3294
2869
  - [x] Implement `E11y::SLO::ConfigLoader` with ERB support
@@ -3298,19 +2873,7 @@ advanced:
3298
2873
  - [ ] Add per-endpoint metrics to `E11y::Rack::Middleware`
3299
2874
  - [ ] Implement `E11y::SLO::MetricsEmitter`
3300
2875
 
3301
- **Phase 2: Burn Rate & Alerts (Week 3-4)**
3302
- - [ ] Implement `E11y::SLO::BurnRateCalculator`
3303
- - [ ] Generate Prometheus alert rules from `slo.yml`
3304
- - [ ] Implement multi-window burn rate alerts
3305
- - [ ] Add Prometheus query error handling
3306
-
3307
- **Phase 3: Error Budget (Week 5-6)**
3308
- - [x] Implement `E11y::SLO::ErrorBudget`
3309
- - [ ] Implement `E11y::SLO::DeploymentGate`
3310
- - [ ] Add error budget tracking middleware
3311
- - [ ] Create Grafana dashboard templates
3312
-
3313
- **Phase 4: Production Readiness (Week 7-8)**
2876
+ **Phase 2: Production Readiness (Week 3-4)**
3314
2877
  - [ ] Add maintenance window support
3315
2878
  - [ ] Implement grace period after deployment
3316
2879
  - [ ] Add self-monitoring metrics
@@ -3318,11 +2881,9 @@ advanced:
3318
2881
  - [ ] Document SLO config guide
3319
2882
  - [ ] Add rollout playbook
3320
2883
 
3321
- **Phase 5: RSpec Tests (Week 8)**
2884
+ **Phase 3: RSpec Tests**
3322
2885
  - [x] ConfigLoader specs (edge cases: missing file, invalid YAML, ERB)
3323
2886
  - [x] ConfigValidator specs (invalid targets, missing routes, conflicts)
3324
- - [x] ErrorBudget specs (calculations, exhaustion, deployment gate)
3325
- - [ ] BurnRateCalculator specs (multi-window, new endpoints)
3326
2887
  - [ ] Integration specs (end-to-end SLO tracking)
3327
2888
 
3328
2889
  ---
@@ -3333,5 +2894,5 @@ advanced:
3333
2894
  **Impact:**
3334
2895
  - Per-endpoint SLO visibility (100% coverage)
3335
2896
  - 5-minute incident detection (vs. 30-minute baseline)
3336
- - Error budget-driven deployment decisions
2897
+ - PromQL-based alerts (see SLO-PROMQL-ALERTS.md)
3337
2898
  - Zero-config for simple apps, full control for complex apps