e11y 0.2.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (288) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +130 -10
  3. data/CHANGELOG.md +80 -1
  4. data/CLAUDE.md +168 -0
  5. data/CONTRIBUTING.md +640 -0
  6. data/README.md +165 -701
  7. data/RELEASE.md +41 -12
  8. data/Rakefile +249 -57
  9. data/config/README.md +1 -1
  10. data/config/loki-local-config.yaml +12 -0
  11. data/config/otel-collector-config.yaml +44 -0
  12. data/cucumber.yml +1 -0
  13. data/docker-compose.yml +18 -2
  14. data/docs/ADAPTERS.md +76 -0
  15. data/docs/ADAPTIVE_SAMPLING.md +59 -0
  16. data/docs/COMPARISON.md +104 -0
  17. data/docs/CONFIGURATION.md +52 -0
  18. data/docs/DISTRIBUTED_TRACING.md +44 -0
  19. data/docs/LIMITATIONS.md +13 -0
  20. data/docs/METRICS_DSL.md +84 -0
  21. data/docs/PERFORMANCE.md +60 -0
  22. data/docs/PII_FILTERING.md +40 -0
  23. data/docs/PRESETS.md +65 -0
  24. data/docs/QUICK-START.md +546 -587
  25. data/docs/RAILS_INTEGRATION.md +79 -0
  26. data/docs/SCHEMA_VALIDATION.md +63 -0
  27. data/docs/SLO-PROMQL-ALERTS.md +161 -0
  28. data/docs/TESTING.md +69 -0
  29. data/docs/{ADR-001-architecture.md → architecture/ADR-001-architecture.md} +36 -65
  30. data/docs/{ADR-002-metrics-yabeda.md → architecture/ADR-002-metrics-yabeda.md} +62 -236
  31. data/docs/architecture/ADR-003-slo-observability.md +1402 -0
  32. data/docs/{ADR-004-adapter-architecture.md → architecture/ADR-004-adapter-architecture.md} +163 -146
  33. data/docs/{ADR-005-tracing-context.md → architecture/ADR-005-tracing-context.md} +10 -9
  34. data/docs/{ADR-006-security-compliance.md → architecture/ADR-006-security-compliance.md} +184 -191
  35. data/docs/{ADR-007-opentelemetry-integration.md → architecture/ADR-007-opentelemetry-integration.md} +3 -21
  36. data/docs/{ADR-008-rails-integration.md → architecture/ADR-008-rails-integration.md} +182 -743
  37. data/docs/{ADR-009-cost-optimization.md → architecture/ADR-009-cost-optimization.md} +45 -54
  38. data/docs/architecture/ADR-010-developer-experience.md +522 -0
  39. data/docs/{ADR-011-testing-strategy.md → architecture/ADR-011-testing-strategy.md} +44 -86
  40. data/docs/{ADR-012-event-evolution.md → architecture/ADR-012-event-evolution.md} +11 -11
  41. data/docs/{ADR-013-reliability-error-handling.md → architecture/ADR-013-reliability-error-handling.md} +37 -12
  42. data/docs/{ADR-014-event-driven-slo.md → architecture/ADR-014-event-driven-slo.md} +12 -24
  43. data/docs/{ADR-015-middleware-order.md → architecture/ADR-015-middleware-order.md} +43 -59
  44. data/docs/{ADR-016-self-monitoring-slo.md → architecture/ADR-016-self-monitoring-slo.md} +58 -355
  45. data/docs/{ADR-017-multi-rails-compatibility.md → architecture/ADR-017-multi-rails-compatibility.md} +4 -11
  46. data/docs/architecture/ADR-018-memory-optimization.md +366 -0
  47. data/docs/{ADR-INDEX.md → architecture/ADR-INDEX.md} +11 -6
  48. data/docs/plans/2026-03-20-browser-overlay-svelte.md +281 -0
  49. data/docs/{00-ICP-AND-TIMELINE.md → prd/00-ICP-AND-TIMELINE.md} +6 -6
  50. data/docs/{01-SCALE-REQUIREMENTS.md → prd/01-SCALE-REQUIREMENTS.md} +6 -6
  51. data/docs/prd/01-overview-vision.md +19 -14
  52. data/docs/use_cases/README.md +22 -23
  53. data/docs/use_cases/UC-001-request-scoped-debug-buffering.md +50 -44
  54. data/docs/use_cases/UC-002-business-event-tracking.md +26 -95
  55. data/docs/use_cases/UC-003-event-metrics.md +66 -0
  56. data/docs/use_cases/UC-004-zero-config-slo-tracking.md +33 -684
  57. data/docs/use_cases/UC-005-sentry-integration.md +13 -15
  58. data/docs/use_cases/UC-006-trace-context-management.md +30 -28
  59. data/docs/use_cases/UC-007-pii-filtering.md +35 -87
  60. data/docs/use_cases/UC-008-opentelemetry-integration.md +51 -89
  61. data/docs/use_cases/UC-009-multi-service-tracing.md +30 -178
  62. data/docs/use_cases/UC-010-background-job-tracking.md +24 -91
  63. data/docs/use_cases/UC-011-rate-limiting.md +95 -168
  64. data/docs/use_cases/UC-012-audit-trail.md +21 -46
  65. data/docs/use_cases/UC-013-high-cardinality-protection.md +29 -167
  66. data/docs/use_cases/UC-014-adaptive-sampling.md +2 -2
  67. data/docs/use_cases/UC-015-cost-optimization.md +46 -99
  68. data/docs/use_cases/UC-016-rails-logger-migration.md +39 -213
  69. data/docs/use_cases/UC-017-local-development.md +203 -777
  70. data/docs/use_cases/UC-018-testing-events.md +3 -3
  71. data/docs/use_cases/UC-019-retention-based-routing.md +53 -106
  72. data/docs/use_cases/UC-020-event-versioning.md +8 -9
  73. data/docs/use_cases/UC-021-error-handling-retry-dlq.md +18 -22
  74. data/docs/use_cases/UC-022-event-registry.md +15 -21
  75. data/docs/use_cases/backlog.md +119 -87
  76. data/e11y.gemspec +2 -2
  77. data/gems/e11y-devtools/README.md +158 -0
  78. data/gems/e11y-devtools/config/routes.rb +15 -0
  79. data/gems/e11y-devtools/e11y-devtools.gemspec +25 -0
  80. data/gems/e11y-devtools/exe/e11y +34 -0
  81. data/gems/e11y-devtools/frontend/.gitignore +24 -0
  82. data/gems/e11y-devtools/frontend/README.md +51 -0
  83. data/gems/e11y-devtools/frontend/index.html +14 -0
  84. data/gems/e11y-devtools/frontend/package-lock.json +3707 -0
  85. data/gems/e11y-devtools/frontend/package.json +28 -0
  86. data/gems/e11y-devtools/frontend/public/mocks/v1/events/recent.json +4205 -0
  87. data/gems/e11y-devtools/frontend/public/mocks/v1/interactions.json +194 -0
  88. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/0a2e04027cfa22d014bc22e8b27cd913/events.json +86 -0
  89. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/0e1543af6a630fb3af6b52283154b3e0/events.json +169 -0
  90. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/1838b691faa49564f97db8592ff3978d/events.json +78 -0
  91. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/29f198f6588dacffb687777eb5f8f118/events.json +197 -0
  92. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/34bc3c9c0097de28a7a6f99b90a8e7bc/events.json +194 -0
  93. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/3ba6c20d068ab9cee00e51b180e66444/events.json +184 -0
  94. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/435bfd8f17b9009146a79812d7c3726d/events.json +144 -0
  95. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/4c7676e3fe668e99edb2b94d7d5678a9/events.json +222 -0
  96. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/6daf0d47974bedfc55d5de7004a3ea9f/events.json +194 -0
  97. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/8a81ada42834d15f287bb40010043605/events.json +194 -0
  98. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/8c0a98900edaae105469df8daedccf02/events.json +198 -0
  99. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/8e4f645180f8a7d1dce426b07380466b/events.json +222 -0
  100. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/93db346fa5d44a032605a13b627f4b80/events.json +128 -0
  101. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/98ff6146faf7bd9be8bd03a8275817ba/events.json +223 -0
  102. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/9997ddd0247bc7e25f2ca7a5c415c93d/events.json +197 -0
  103. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/99e35f8ef3baedd798cc4fd085980ad9/events.json +194 -0
  104. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/b4f3095c1909924cbc98889a86c83d6d/events.json +131 -0
  105. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/b54b7fc32b7575a7110de809d11ccda0/events.json +128 -0
  106. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/c0b48033fa06746bcc5886745e053cff/events.json +169 -0
  107. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/c44649ac76701b4558927cd2305ab535/events.json +169 -0
  108. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/d601ae3320057580a39dbdac2edfdf4a/events.json +248 -0
  109. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/e67e724bab422d2b52eeb49635e512e1/events.json +194 -0
  110. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/e6c72765a28f158a8485b35fa63f73da/events.json +194 -0
  111. data/gems/e11y-devtools/frontend/public/mocks/v1/traces/f541b87405c9a54819b18ebe529f6419/events.json +194 -0
  112. data/gems/e11y-devtools/frontend/scripts/generate_mocks.rb +397 -0
  113. data/gems/e11y-devtools/frontend/src/App.svelte +827 -0
  114. data/gems/e11y-devtools/frontend/src/components/Fab.svelte +19 -0
  115. data/gems/e11y-devtools/frontend/src/components/FilterBar.svelte +38 -0
  116. data/gems/e11y-devtools/frontend/src/components/FullscreenPanel.svelte +82 -0
  117. data/gems/e11y-devtools/frontend/src/components/InteractionsTimeline.svelte +264 -0
  118. data/gems/e11y-devtools/frontend/src/components/RecentHistogram.svelte +354 -0
  119. data/gems/e11y-devtools/frontend/src/lib/api.ts +37 -0
  120. data/gems/e11y-devtools/frontend/src/lib/eventIdentity.ts +12 -0
  121. data/gems/e11y-devtools/frontend/src/lib/format.ts +37 -0
  122. data/gems/e11y-devtools/frontend/src/lib/listFilter.ts +43 -0
  123. data/gems/e11y-devtools/frontend/src/lib/recentVolume.ts +80 -0
  124. data/gems/e11y-devtools/frontend/src/lib/router.ts +12 -0
  125. data/gems/e11y-devtools/frontend/src/lib/transitions.ts +34 -0
  126. data/gems/e11y-devtools/frontend/src/lib/viewportOrigin.ts +25 -0
  127. data/gems/e11y-devtools/frontend/src/main.ts +8 -0
  128. data/gems/e11y-devtools/frontend/src/overlay-entry.ts +24 -0
  129. data/gems/e11y-devtools/frontend/src/overlay.css +1080 -0
  130. data/gems/e11y-devtools/frontend/svelte.config.js +2 -0
  131. data/gems/e11y-devtools/frontend/test_puppeteer.js +41 -0
  132. data/gems/e11y-devtools/frontend/test_scale.js +3 -0
  133. data/gems/e11y-devtools/frontend/tsconfig.app.json +21 -0
  134. data/gems/e11y-devtools/frontend/tsconfig.json +7 -0
  135. data/gems/e11y-devtools/frontend/tsconfig.node.json +26 -0
  136. data/gems/e11y-devtools/frontend/vite.config.ts +36 -0
  137. data/gems/e11y-devtools/lib/e11y/devtools/mcp/server.rb +96 -0
  138. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tool_base.rb +25 -0
  139. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/clear.rb +31 -0
  140. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/errors.rb +35 -0
  141. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/event_detail.rb +33 -0
  142. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/events_by_trace.rb +33 -0
  143. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/interactions.rb +40 -0
  144. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/recent_events.rb +34 -0
  145. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/search.rb +34 -0
  146. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/stats.rb +30 -0
  147. data/gems/e11y-devtools/lib/e11y/devtools/overlay/assets/overlay.js +20 -0
  148. data/gems/e11y-devtools/lib/e11y/devtools/overlay/controller.rb +94 -0
  149. data/gems/e11y-devtools/lib/e11y/devtools/overlay/engine.rb +26 -0
  150. data/gems/e11y-devtools/lib/e11y/devtools/overlay/middleware.rb +80 -0
  151. data/gems/e11y-devtools/lib/e11y/devtools/overlay/rails_controller.rb +67 -0
  152. data/gems/e11y-devtools/lib/e11y/devtools/tui/app.rb +262 -0
  153. data/gems/e11y-devtools/lib/e11y/devtools/tui/grouping.rb +66 -0
  154. data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/event_detail.rb +62 -0
  155. data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/event_list.rb +70 -0
  156. data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/interaction_list.rb +47 -0
  157. data/gems/e11y-devtools/lib/e11y/devtools/version.rb +8 -0
  158. data/gems/e11y-devtools/lib/e11y/devtools.rb +13 -0
  159. data/gems/e11y-devtools/spec/e11y/devtools/mcp/tools_spec.rb +107 -0
  160. data/gems/e11y-devtools/spec/e11y/devtools/overlay/controller_spec.rb +91 -0
  161. data/gems/e11y-devtools/spec/e11y/devtools/overlay/middleware_spec.rb +46 -0
  162. data/gems/e11y-devtools/spec/e11y/devtools/tui/app_spec.rb +85 -0
  163. data/gems/e11y-devtools/spec/e11y/devtools/tui/grouping_spec.rb +64 -0
  164. data/gems/e11y-devtools/spec/spec_helper.rb +5 -0
  165. data/gems/e11y-devtools/spec/tui/widgets/event_list_spec.rb +44 -0
  166. data/gems/e11y-devtools/spec/tui/widgets/interaction_list_spec.rb +62 -0
  167. data/lib/e11y/adapters/audit_encrypted.rb +53 -11
  168. data/lib/e11y/adapters/base.rb +33 -34
  169. data/lib/e11y/adapters/dev_log/file_store.rb +143 -0
  170. data/lib/e11y/adapters/dev_log/query.rb +219 -0
  171. data/lib/e11y/adapters/dev_log.rb +118 -0
  172. data/lib/e11y/adapters/file.rb +3 -6
  173. data/lib/e11y/adapters/in_memory.rb +52 -5
  174. data/lib/e11y/adapters/in_memory_test.rb +29 -0
  175. data/lib/e11y/adapters/loki.rb +58 -23
  176. data/lib/e11y/adapters/null.rb +82 -0
  177. data/lib/e11y/adapters/opentelemetry_collector.rb +183 -0
  178. data/lib/e11y/adapters/otel_logs.rb +136 -23
  179. data/lib/e11y/adapters/sentry.rb +4 -7
  180. data/lib/e11y/adapters/stdout.rb +73 -7
  181. data/lib/e11y/adapters/yabeda.rb +153 -29
  182. data/lib/e11y/buffers/adaptive_buffer.rb +3 -17
  183. data/lib/e11y/buffers/{request_scoped_buffer.rb → ephemeral_buffer.rb} +72 -58
  184. data/lib/e11y/buffers/ring_buffer.rb +3 -16
  185. data/lib/e11y/configuration.rb +272 -0
  186. data/lib/e11y/console.rb +10 -17
  187. data/lib/e11y/current.rb +53 -1
  188. data/lib/e11y/debug/pipeline_inspector.rb +96 -0
  189. data/lib/e11y/documentation/generator.rb +48 -0
  190. data/lib/e11y/event/base.rb +176 -82
  191. data/lib/e11y/event/value_sampling_config.rb +1 -5
  192. data/lib/e11y/events/rails/database/query.rb +1 -4
  193. data/lib/e11y/events/rails/job/failed.rb +2 -0
  194. data/lib/e11y/instruments/active_job.rb +44 -12
  195. data/lib/e11y/instruments/rails_instrumentation.rb +49 -24
  196. data/lib/e11y/instruments/sidekiq.rb +135 -31
  197. data/lib/e11y/linters/base.rb +11 -0
  198. data/lib/e11y/linters/pii/pii_declaration_linter.rb +120 -0
  199. data/lib/e11y/linters/slo/config_consistency_linter.rb +76 -0
  200. data/lib/e11y/linters/slo/explicit_declaration_linter.rb +36 -0
  201. data/lib/e11y/linters/slo/slo_status_from_linter.rb +41 -0
  202. data/lib/e11y/logger/bridge.rb +26 -7
  203. data/lib/e11y/metrics/cardinality_protection.rb +10 -15
  204. data/lib/e11y/metrics/cardinality_tracker.rb +16 -6
  205. data/lib/e11y/metrics/registry.rb +3 -5
  206. data/lib/e11y/metrics/test_backend.rb +62 -0
  207. data/lib/e11y/metrics.rb +56 -10
  208. data/lib/e11y/middleware/adapter_resolver.rb +40 -0
  209. data/lib/e11y/middleware/audit_signing.rb +43 -6
  210. data/lib/e11y/middleware/baggage_protection.rb +75 -0
  211. data/lib/e11y/middleware/dev_log_source.rb +24 -0
  212. data/lib/e11y/middleware/event_slo.rb +23 -9
  213. data/lib/e11y/middleware/otel_span.rb +23 -0
  214. data/lib/e11y/middleware/pii_filter.rb +104 -75
  215. data/lib/e11y/middleware/rate_limiting.rb +54 -27
  216. data/lib/e11y/middleware/request.rb +70 -23
  217. data/lib/e11y/middleware/routing.rb +78 -21
  218. data/lib/e11y/middleware/sampling.rb +66 -17
  219. data/lib/e11y/middleware/self_monitoring_emit.rb +39 -0
  220. data/lib/e11y/middleware/trace_context.rb +45 -10
  221. data/lib/e11y/middleware/track_latency.rb +34 -0
  222. data/lib/e11y/middleware/validation.rb +7 -16
  223. data/lib/e11y/middleware/versioning.rb +26 -22
  224. data/lib/e11y/opentelemetry/semantic_conventions.rb +109 -0
  225. data/lib/e11y/opentelemetry/span_creator.rb +142 -0
  226. data/lib/e11y/pii/patterns.rb +12 -1
  227. data/lib/e11y/pipeline/builder.rb +4 -4
  228. data/lib/e11y/presets/audit_event.rb +13 -2
  229. data/lib/e11y/railtie.rb +52 -14
  230. data/lib/e11y/registry.rb +306 -0
  231. data/lib/e11y/reliability/circuit_breaker.rb +19 -21
  232. data/lib/e11y/reliability/dlq/base.rb +71 -0
  233. data/lib/e11y/reliability/dlq/file_adapter.rb +301 -0
  234. data/lib/e11y/reliability/dlq/file_storage.rb +63 -34
  235. data/lib/e11y/reliability/dlq/filter.rb +37 -54
  236. data/lib/e11y/reliability/retry_handler.rb +26 -29
  237. data/lib/e11y/reliability/retry_rate_limiter.rb +3 -11
  238. data/lib/e11y/sampling/error_spike_detector.rb +0 -2
  239. data/lib/e11y/sampling/load_monitor.rb +5 -9
  240. data/lib/e11y/sampling/stratified_tracker.rb +18 -0
  241. data/lib/e11y/self_monitoring/buffer_monitor.rb +2 -0
  242. data/lib/e11y/self_monitoring/performance_monitor.rb +19 -61
  243. data/lib/e11y/self_monitoring/reliability_monitor.rb +4 -74
  244. data/lib/e11y/slo/config_loader.rb +40 -0
  245. data/lib/e11y/slo/config_validator.rb +58 -0
  246. data/lib/e11y/slo/dashboard_generator.rb +122 -0
  247. data/lib/e11y/slo/event_driven.rb +8 -0
  248. data/lib/e11y/slo/tracker.rb +31 -4
  249. data/lib/e11y/testing/have_tracked_event_matcher.rb +190 -0
  250. data/lib/e11y/testing/rspec_matchers.rb +21 -0
  251. data/lib/e11y/testing/snapshot_matcher.rb +86 -0
  252. data/lib/e11y/trace_context/sampler.rb +35 -0
  253. data/lib/e11y/tracing/faraday_middleware.rb +31 -0
  254. data/lib/e11y/tracing/net_http_patch.rb +33 -0
  255. data/lib/e11y/tracing/propagator.rb +144 -0
  256. data/lib/e11y/tracing.rb +47 -0
  257. data/lib/e11y/version.rb +1 -1
  258. data/lib/e11y/versioning/version_extractor.rb +32 -0
  259. data/lib/e11y.rb +123 -266
  260. data/lib/generators/e11y/event/event_generator.rb +22 -0
  261. data/lib/generators/e11y/event/templates/event.rb.tt +16 -0
  262. data/lib/generators/e11y/grafana_dashboard/grafana_dashboard_generator.rb +30 -0
  263. data/lib/generators/e11y/grafana_dashboard/templates/e11y_dashboard.json +81 -0
  264. data/lib/generators/e11y/install/install_generator.rb +34 -0
  265. data/lib/generators/e11y/install/templates/e11y.rb +239 -0
  266. data/lib/generators/e11y/prometheus_alerts/prometheus_alerts_generator.rb +29 -0
  267. data/lib/generators/e11y/prometheus_alerts/templates/e11y_alerts.yml +28 -0
  268. data/lib/tasks/e11y_docs.rake +30 -0
  269. data/lib/tasks/e11y_events.rake +71 -0
  270. data/lib/tasks/e11y_lint.rake +91 -0
  271. data/lib/tasks/e11y_slo.rake +29 -0
  272. metadata +186 -39
  273. data/docs/ADR-003-slo-observability.md +0 -3337
  274. data/docs/ADR-010-developer-experience.md +0 -2166
  275. data/docs/API-REFERENCE-L28.md +0 -914
  276. data/docs/COMPREHENSIVE-CONFIGURATION.md +0 -2366
  277. data/docs/CONTRIBUTING.md +0 -312
  278. data/docs/IMPLEMENTATION_NOTES.md +0 -2804
  279. data/docs/IMPLEMENTATION_PLAN.md +0 -1971
  280. data/docs/IMPLEMENTATION_PLAN_ARCHITECTURE.md +0 -586
  281. data/docs/PLAN.md +0 -148
  282. data/docs/README.md +0 -296
  283. data/docs/design/00-memory-optimization.md +0 -593
  284. data/docs/guides/MIGRATION-L27-L28.md +0 -692
  285. data/docs/guides/PERFORMANCE-BENCHMARKS.md +0 -434
  286. data/docs/guides/README.md +0 -44
  287. data/docs/use_cases/UC-003-pattern-based-metrics.md +0 -1627
  288. data/lib/e11y/adapters/registry.rb +0 -141
@@ -0,0 +1,1402 @@
1
+ # ADR-003: SLO & Observability
2
+
3
+ **Status:** Draft
4
+ **Date:** January 13, 2026
5
+ **Covers:** UC-004 (Zero-Config SLO Tracking)
6
+ **Depends On:** ADR-001 (Core), ADR-008 (Rails Integration), ADR-002 (Metrics)
7
+
8
+ **Related ADRs:**
9
+ - 📊 **ADR-014: Event-Driven SLO** - Custom SLO based on business events (e.g., payment success rate)
10
+ - 🔗 **Integration:** See `ADR-003-014-INTEGRATION.md` for detailed integration analysis
11
+
12
+ ---
13
+
14
+ ## 🔍 Scope of This ADR
15
+
16
+ This ADR covers **HTTP/Job SLO** (infrastructure reliability):
17
+ - ✅ Zero-config SLO for HTTP requests (99.9% availability)
18
+ - ✅ Zero-config SLO for Sidekiq/ActiveJob (99.5% success rate)
19
+ - ✅ Per-endpoint SLO configuration in `slo.yml`
20
+ - ✅ PromQL queries and alert rules — see [SLO-PROMQL-ALERTS.md](../SLO-PROMQL-ALERTS.md)
21
+
22
+ **For Event-based SLO** (business logic reliability like "order creation success rate"), see **ADR-014**.
23
+
24
+ **For App-Wide SLO** (aggregating HTTP + Event metrics into single health score), see **ADR-014 Section 9**.
25
+
26
+ ---
27
+
28
+ ## 📋 Table of Contents
29
+
30
+ 1. [Context & Problem](#1-context--problem)
31
+ 2. [Architecture Overview](#2-architecture-overview)
32
+ 3. [Multi-Level SLO Strategy](#3-multi-level-slo-strategy)
33
+ 4. [Per-Endpoint SLO Configuration](#4-per-endpoint-slo-configuration)
34
+ 5. [PromQL & Alerts](#5-promql--alerts)
35
+ 6. [SLO Config Validation & Linting](#6-slo-config-validation--linting)
36
+ 7. [Dashboard & Reporting](#7-dashboard--reporting)
37
+ 8. [Production Best Practices & Edge Cases](#8-production-best-practices--edge-cases)
38
+ 9. [Trade-offs](#9-trade-offs)
39
+ 10. [Real-World Configuration Examples](#10-real-world-configuration-examples)
40
+ 11. [Summary & Next Steps](#11-summary--next-steps)
41
+
42
+ ---
43
+
44
+ ## 1. Context & Problem
45
+
46
+ ### 1.1. Problem Statement
47
+
48
+ **Current Pain Points:**
49
+
50
+ ```ruby
51
+ # === PROBLEM 1: Overly Broad SLO (App-Wide) ===
52
+ # ❌ One SLO for entire app is too coarse
53
+ # GET /healthcheck (should be 99.99%)
54
+ # POST /orders (should be 99.9%)
55
+ # GET /admin/reports (should be 95%)
56
+ # → All treated the same! Critical endpoints hidden by non-critical ones!
57
+ ```
58
+
59
+ ```ruby
60
+ # === PROBLEM 2: Slow Alert Detection ===
61
+ # ❌ 30-day window = slow reaction
62
+ # Incident at 10:00 AM
63
+ # First alert at 10:45 AM (45 minutes later!)
64
+ # → Customers already affected!
65
+ ```
66
+
67
+ ```ruby
68
+ # === PROBLEM 3: No Configuration Management ===
69
+ # ❌ SLOs hardcoded in code
70
+ # Need to deploy to change SLO targets
71
+ # No validation against real routes
72
+ # → Drift between config and reality
73
+ ```
74
+
75
+ ```ruby
76
+ # === PROBLEM 4: Alert Fatigue ===
77
+ # ❌ Single threshold alerting
78
+ # Minor blip → Page SRE
79
+ # Sustained issue → Same alert
80
+ # → Can't distinguish severity!
81
+ ```
82
+
83
+ ### 1.2. Design Decisions (Based on Google SRE 2026)
84
+
85
+ **Decision 1: Multi-Level SLO Strategy**
86
+ ```yaml
87
+ # 3 levels of SLO granularity:
88
+ 1. Application-wide (default, zero-config)
89
+ 2. Service-level (Sidekiq, ActiveJob)
90
+ 3. Per-endpoint (controller#action specific)
91
+ ```
92
+
93
+ **Decision 2: Multi-Window Multi-Burn Rate (Google SRE Standard)**
94
+ ```yaml
95
+ # Alert windows (not SLO windows!):
96
+ - Fast burn: 1 hour window, 5 min alert, 14.4x burn rate → 2% budget consumed
97
+ - Medium burn: 6 hour window, 30 min alert, 6.0x burn rate → 5% budget consumed
98
+ - Slow burn: 3 day window, 6 hour alert, 1.0x burn rate → 10% budget consumed
99
+
100
+ # SLO window: Still 30 days (industry standard)
101
+ # But ALERTS react in 5 minutes!
102
+ ```
103
+
104
+ **Decision 3: YAML-Based Configuration**
105
+ ```yaml
106
+ # config/slo.yml - version controlled, validated
107
+ # Separate from code deployment
108
+ # Linter validates against real routes/jobs
109
+ ```
110
+
111
+ **Decision 4: Optional Latency SLO**
112
+ ```yaml
113
+ # Not all endpoints need latency SLO:
114
+ - Healthcheck: availability only (latency not critical)
115
+ - File upload: availability + custom latency (5s)
116
+ - API: availability + p99 latency (500ms)
117
+ ```
118
+
119
+ ### 1.3. Goals
120
+
121
+ **Primary Goals:**
122
+ - ✅ **Per-endpoint SLO** (controller#action level)
123
+ - ✅ **5-minute alert detection** (fast burn rate)
124
+ - ✅ **YAML-based configuration** with validation
125
+ - ✅ **Flexible latency SLO** (optional per endpoint)
126
+ - ✅ **Multi-window burn rate** (Google SRE standard)
127
+
128
+ **Non-Goals:**
129
+ - ❌ Per-user SLO (too granular for v1.0)
130
+ - ❌ Automatic SLO adjustment (manual for v1.0)
131
+ - ❌ SLO enforcement (alerts only, no blocking)
132
+
133
+ ### 1.4. Success Metrics
134
+
135
+ | Metric | Target | Critical? |
136
+ |--------|--------|-----------|
137
+ | **Alert detection time** | <5 minutes | ✅ Yes |
138
+ | **Per-endpoint coverage** | 100% (all routes) | ✅ Yes |
139
+ | **Config validation** | 100% (no drift) | ✅ Yes |
140
+ | **False positive rate** | <1% | ✅ Yes |
141
+ | **Alert precision** | >95% | ✅ Yes |
142
+
143
+ ---
144
+
145
+ ## 2. Architecture Overview
146
+
147
+ ### 2.1. System Context
148
+
149
+ ```mermaid
150
+ C4Context
151
+ title SLO & Observability Context (Multi-Level)
152
+
153
+ Person(sre, "SRE", "Monitors SLOs")
154
+ Person(dev, "Developer", "Defines SLOs")
155
+
156
+ System(rails_app, "Rails App", "100+ endpoints")
157
+ System(e11y, "E11y Gem", "Multi-level SLO")
158
+ System(slo_config, "slo.yml", "Per-endpoint config")
159
+
160
+ System_Ext(prometheus, "Prometheus", "Multi-window queries")
161
+ System_Ext(grafana, "Grafana", "Per-endpoint dashboards")
162
+ System_Ext(alertmanager, "Alertmanager", "Fast/Medium/Slow burn")
163
+
164
+ Rel(dev, slo_config, "Defines", "Per-endpoint SLO")
165
+ Rel(rails_app, e11y, "Tracks", "Per controller#action")
166
+ Rel(e11y, slo_config, "Validates", "Against real routes")
167
+ Rel(e11y, prometheus, "Exports", "Per-endpoint metrics")
168
+ Rel(prometheus, alertmanager, "Evaluates", "3 burn rate windows")
169
+ Rel(alertmanager, sre, "Alerts in 5min", "Fast burn")
170
+ Rel(sre, grafana, "Views", "Per-endpoint SLO")
171
+
172
+ UpdateLayoutConfig($c4ShapeInRow="3", $c4BoundaryInRow="1")
173
+ ```
174
+
175
+ ### 2.2. Component Architecture
176
+
177
+ ```mermaid
178
+ graph TB
179
+ subgraph "Rails Application"
180
+ Route1[GET /orders] --> Middleware[E11y SLO Middleware]
181
+ Route2[POST /orders] --> Middleware
182
+ Route3[GET /healthcheck] --> Middleware
183
+ SidekiqJob[PaymentJob] --> SidekiqInstr[Sidekiq Instrumentation]
184
+ end
185
+
186
+ subgraph "E11y SLO Engine"
187
+ Middleware --> SLOResolver[SLO Config Resolver]
188
+ SidekiqInstr --> SLOResolver
189
+
190
+ SLOResolver --> ConfigLoader[slo.yml Loader]
191
+ ConfigLoader --> Validator[Route/Job Validator]
192
+
193
+ SLOResolver --> MetricsEmitter[Per-Endpoint Metrics]
194
+ MetricsEmitter --> AppWide[App-Wide Metrics]
195
+ MetricsEmitter --> PerEndpoint[Per-Endpoint Metrics]
196
+ MetricsEmitter --> PerJob[Per-Job Metrics]
197
+ end
198
+
199
+ subgraph "Multi-Window Burn Rate"
200
+ PerEndpoint --> BurnRate1h[1h Fast Burn]
201
+ PerEndpoint --> BurnRate6h[6h Medium Burn]
202
+ PerEndpoint --> BurnRate3d[3d Slow Burn]
203
+
204
+ BurnRate1h --> AlertFast[Alert in 5 min<br/>14.4x burn]
205
+ BurnRate6h --> AlertMedium[Alert in 30 min<br/>6.0x burn]
206
+ BurnRate3d --> AlertSlow[Alert in 6 hours<br/>1.0x burn]
207
+ end
208
+
209
+ subgraph "Prometheus & Grafana"
210
+ AppWide --> PromQL1[PromQL: App SLO]
211
+ PerEndpoint --> PromQL2[PromQL: Endpoint SLO]
212
+ PerJob --> PromQL3[PromQL: Job SLO]
213
+
214
+ PromQL1 --> Dashboard1[App-Wide Dashboard]
215
+ PromQL2 --> Dashboard2[Per-Endpoint Dashboard]
216
+ PromQL3 --> Dashboard3[Job Dashboard]
217
+ end
218
+
219
+ style SLOResolver fill:#d1ecf1
220
+ style BurnRate1h fill:#f8d7da
221
+ style AlertFast fill:#dc3545,color:#fff
222
+ ```
223
+
224
+ ### 2.3. Multi-Window Alert Flow
225
+
226
+ ```mermaid
227
+ sequenceDiagram
228
+ participant Endpoint as POST /orders
229
+ participant E11y as E11y Middleware
230
+ participant Config as slo.yml
231
+ participant Prom as Prometheus
232
+ participant Alert as Alertmanager
233
+ participant SRE as SRE
234
+
235
+ Note over Endpoint: Incident starts at 10:00
236
+
237
+ Endpoint->>E11y: HTTP 500 (error)
238
+ E11y->>Config: Lookup SLO: orders#create
239
+ Config-->>E11y: target: 99.9%, latency: 500ms
240
+ E11y->>Prom: Increment error counter
241
+
242
+ Note over Prom: 1h window burn rate evaluation
243
+
244
+ Prom->>Prom: 10:00-10:05: Calculate burn rate
245
+ Prom->>Prom: Burn rate = 14.5x (> 14.4x threshold)
246
+
247
+ Prom->>Alert: Fire: FastBurn (10:05, 5 min after incident)
248
+ Alert->>SRE: Page: CRITICAL - POST /orders
249
+
250
+ Note over SRE: SRE notified in 5 minutes!
251
+
252
+ alt Incident resolved quickly
253
+ Note over Endpoint: Fixed at 10:10
254
+ Prom->>Prom: 10:10-10:15: Burn rate drops
255
+ Prom->>Alert: Resolve: FastBurn
256
+ else Incident continues
257
+ Prom->>Prom: 10:00-10:30: 6h window burn
258
+ Prom->>Alert: Fire: MediumBurn (additional context)
259
+ end
260
+ ```
261
+
262
+ ---
263
+
264
+ ## 3. Multi-Level SLO Strategy
265
+
266
+ ### 3.1. Level 1: Application-Wide SLO (Zero-Config)
267
+
268
+ **Automatic for all Rails apps:**
269
+
270
+ ```ruby
271
+ # Shipped knobs (see lib/e11y/configuration.rb, lib/e11y/slo/)
272
+ E11y.configure do |config|
273
+ config.slo_tracking_enabled = true # default
274
+ config.rails_instrumentation_enabled = true
275
+ config.sidekiq_enabled = true # job SLO signals from Sidekiq middleware
276
+ config.active_job_enabled = true # job SLO signals from Active Job callbacks
277
+ end
278
+ ```
279
+
280
+ **Metrics emitted:**
281
+ ```ruby
282
+ # App-wide availability
283
+ http_requests_total{status="2xx|3xx|4xx|5xx"}
284
+ slo_app_availability{window="30d"} # Calculated SLO
285
+
286
+ # App-wide latency
287
+ http_request_duration_seconds{quantile="0.99"}
288
+ slo_app_latency_p99{window="30d"}
289
+ ```
290
+
291
+ ### 3.2. Level 2: Service-Level SLO (Per-Service)
292
+
293
+ **Per-service overrides:**
294
+
295
+ ```yaml
296
+ # config/slo.yml
297
+ services:
298
+ sidekiq:
299
+ default:
300
+ success_rate_target: 0.995 # 99.5%
301
+ window: 30d
302
+
303
+ # Override for critical jobs
304
+ jobs:
305
+ PaymentProcessingJob:
306
+ success_rate_target: 0.9999 # 99.99% (critical!)
307
+ alert_on_single_failure: true
308
+
309
+ EmailNotificationJob:
310
+ success_rate_target: 0.95 # 95% (non-critical)
311
+ latency: null # No latency SLO
312
+ ```
313
+
314
+ ### 3.3. Level 3: Per-Endpoint SLO (Controller#Action)
315
+
316
+ **Most granular level:**
317
+
318
+ ```yaml
319
+ # config/slo.yml
320
+ endpoints:
321
+ # CRITICAL endpoints (99.99%)
322
+ - name: "Health Check"
323
+ pattern: "GET /healthcheck"
324
+ controller: "HealthController"
325
+ action: "index"
326
+ slo:
327
+ availability_target: 0.9999 # 99.99%
328
+ latency: null # No latency SLO for healthcheck
329
+ window: 30d
330
+
331
+ # HIGH priority endpoints (99.9%)
332
+ - name: "Create Order"
333
+ pattern: "POST /api/orders"
334
+ controller: "Api::OrdersController"
335
+ action: "create"
336
+ slo:
337
+ availability_target: 0.999 # 99.9%
338
+ latency_p99_target: 500 # 500ms p99
339
+ latency_p95_target: 300 # 300ms p95 (optional)
340
+ window: 30d
341
+
342
+ # Multi-burn rate alert config
343
+ burn_rate_alerts:
344
+ fast:
345
+ enabled: true
346
+ window: 1h
347
+ threshold: 14.4 # 2% budget in 1h
348
+ alert_after: 5m
349
+ medium:
350
+ enabled: true
351
+ window: 6h
352
+ threshold: 6.0 # 5% budget in 6h
353
+ alert_after: 30m
354
+ slow:
355
+ enabled: true
356
+ window: 3d
357
+ threshold: 1.0 # 10% budget in 3d
358
+ alert_after: 6h
359
+
360
+ # SLOW endpoints (99.9% but higher latency acceptable)
361
+ - name: "Generate Report"
362
+ pattern: "POST /admin/reports"
363
+ controller: "Admin::ReportsController"
364
+ action: "create"
365
+ slo:
366
+ availability_target: 0.999 # 99.9%
367
+ latency_p99_target: 5000 # 5s (slow, but acceptable)
368
+ window: 30d
369
+
370
+ # LOW priority endpoints (99%)
371
+ - name: "Admin Dashboard"
372
+ pattern: "GET /admin/dashboard"
373
+ controller: "Admin::DashboardController"
374
+ action: "index"
375
+ slo:
376
+ availability_target: 0.99 # 99% (less critical)
377
+ latency: null
378
+ window: 30d
379
+
380
+ # NO SLO (exclude from tracking)
381
+ - name: "Development Tools"
382
+ pattern: "GET /rails/info/*"
383
+ slo: null # No SLO
384
+ ```
385
+
386
+ ---
387
+
388
+ ## 4. Per-Endpoint SLO Configuration
389
+
390
+ ### 4.1. Complete slo.yml Schema with All Options
391
+
392
+ ```yaml
393
+ # config/slo.yml
394
+ #
395
+ # E11y SLO Configuration
396
+ #
397
+ # This file defines Service Level Objectives for your application at multiple levels:
398
+ # 1. App-wide defaults (fallback for unconfigured endpoints)
399
+ # 2. Endpoint-specific SLOs (per controller#action)
400
+ # 3. Service-specific SLOs (Sidekiq, ActiveJob)
401
+ #
402
+ # Lint / validation (see §6 — e11y:slo:validate aliases e11y:lint):
403
+ # $ bundle exec rake e11y:lint
404
+ # $ bundle exec rake e11y:slo:validate
405
+ # Dashboard JSON from slo.yml:
406
+ # $ bundle exec rake e11y:slo:dashboard
407
+ #
408
+ # Documentation: https://github.com/arturseletskiy/e11y/docs/slo-configuration.md
409
+
410
+ version: 1
411
+
412
+ # ============================================================================
413
+ # GLOBAL DEFAULTS
414
+ # ============================================================================
415
+ # Applied to all endpoints unless overridden
416
+ # These are CONSERVATIVE defaults - tune based on your needs
417
+ defaults:
418
+ window: 30d # SLO evaluation window (7d, 30d, 90d)
419
+
420
+ # Availability SLO (required)
421
+ availability:
422
+ enabled: true
423
+ target: 0.999 # 99.9% = 43.2 minutes downtime per month
424
+
425
+ # Latency SLO (optional)
426
+ latency:
427
+ enabled: true
428
+ p99_target: 500 # milliseconds
429
+ p95_target: 300 # milliseconds (optional)
430
+ p50_target: null # median (optional, null = disabled)
431
+
432
+ # Throughput SLO (optional, for high-traffic endpoints)
433
+ throughput:
434
+ enabled: false # Disabled by default
435
+ min_rps: null # Minimum requests per second (null = no minimum)
436
+ max_rps: null # Maximum requests per second (null = no maximum)
437
+
438
+ # Multi-window burn rate alerts (Google SRE recommended)
439
+ burn_rate_alerts:
440
+ fast:
441
+ enabled: true
442
+ window: 1h # Alert window
443
+ threshold: 14.4 # 14.4x burn rate = 2% of 30-day budget in 1h
444
+ alert_after: 5m # Fire alert after 5 minutes
445
+ severity: critical
446
+ medium:
447
+ enabled: true
448
+ window: 6h
449
+ threshold: 6.0 # 6x burn rate = 5% of 30-day budget in 6h
450
+ alert_after: 30m
451
+ severity: warning
452
+ slow:
453
+ enabled: true
454
+ window: 3d
455
+ threshold: 1.0 # 1x burn rate = 10% of 30-day budget in 3d
456
+ alert_after: 6h
457
+ severity: info
458
+
459
+ # ============================================================================
460
+ # ENDPOINT-SPECIFIC SLOs
461
+ # ============================================================================
462
+ # Define SLOs per controller#action
463
+ # Pattern matching supported: "/api/orders/:id", "/users/*"
464
+ endpoints:
465
+ # -------------------------------------------------------------------------
466
+ # CRITICAL ENDPOINTS (99.99% availability)
467
+ # -------------------------------------------------------------------------
468
+ - name: "Health Check"
469
+ description: "K8s liveness/readiness probe"
470
+ pattern: "GET /healthcheck"
471
+ controller: "HealthController"
472
+ action: "index"
473
+ tags:
474
+ - critical
475
+ - infrastructure
476
+ slo:
477
+ window: 30d
478
+ availability:
479
+ enabled: true
480
+ target: 0.9999 # 99.99% = 4.32 minutes downtime per month
481
+ latency:
482
+ enabled: false # No latency SLO for healthcheck (should be instant)
483
+ throughput:
484
+ enabled: false
485
+ burn_rate_alerts:
486
+ fast:
487
+ enabled: true
488
+ threshold: 14.4
489
+ alert_after: 2m # Override: faster alert for critical endpoint
490
+
491
+ # -------------------------------------------------------------------------
492
+ # HIGH PRIORITY ENDPOINTS (99.9% availability + strict latency)
493
+ # -------------------------------------------------------------------------
494
+ - name: "Create Order"
495
+ description: "Primary checkout flow"
496
+ pattern: "POST /api/orders"
497
+ controller: "Api::OrdersController"
498
+ action: "create"
499
+ tags:
500
+ - high_priority
501
+ - revenue_critical
502
+ - customer_facing
503
+ slo:
504
+ window: 30d
505
+ availability:
506
+ enabled: true
507
+ target: 0.999 # 99.9%
508
+ latency:
509
+ enabled: true
510
+ p99_target: 500 # 500ms p99
511
+ p95_target: 300 # 300ms p95
512
+ p50_target: 150 # 150ms p50 (median)
513
+ throughput:
514
+ enabled: true
515
+ min_rps: 10 # Must handle at least 10 req/sec
516
+ max_rps: 1000 # Alert if exceeds 1000 req/sec (potential attack)
517
+ burn_rate_alerts:
518
+ fast:
519
+ enabled: true
520
+ threshold: 14.4
521
+ alert_after: 5m
522
+ medium:
523
+ enabled: true
524
+ threshold: 6.0
525
+ alert_after: 30m
526
+ slow:
527
+ enabled: true
528
+ threshold: 1.0
529
+ alert_after: 6h
530
+
531
+ - name: "List Orders"
532
+ description: "Customer order history"
533
+ pattern: "GET /api/orders"
534
+ controller: "Api::OrdersController"
535
+ action: "index"
536
+ tags:
537
+ - high_priority
538
+ - customer_facing
539
+ slo:
540
+ window: 30d
541
+ availability:
542
+ enabled: true
543
+ target: 0.999
544
+ latency:
545
+ enabled: true
546
+ p99_target: 1000 # 1s p99 (list can be slower)
547
+ p95_target: 500
548
+ throughput:
549
+ enabled: false
550
+
551
+ - name: "Payment Processing"
552
+ description: "Stripe payment capture"
553
+ pattern: "POST /api/payments"
554
+ controller: "Api::PaymentsController"
555
+ action: "create"
556
+ tags:
557
+ - critical
558
+ - revenue_critical
559
+ - third_party_dependent
560
+ slo:
561
+ window: 30d
562
+ availability:
563
+ enabled: true
564
+ target: 0.999
565
+ latency:
566
+ enabled: true
567
+ p99_target: 2000 # 2s p99 (external API call)
568
+ p95_target: 1000
569
+ throughput:
570
+ enabled: true
571
+ min_rps: 1
572
+ max_rps: 100
573
+ burn_rate_alerts:
574
+ fast:
575
+ enabled: true
576
+ threshold: 10.0 # Override: more lenient for third-party dependency
577
+ alert_after: 10m
578
+
579
+ # -------------------------------------------------------------------------
580
+ # SLOW ENDPOINTS (99.9% availability + relaxed latency)
581
+ # -------------------------------------------------------------------------
582
+ - name: "Generate Report"
583
+ description: "Admin analytics report generation"
584
+ pattern: "POST /admin/reports"
585
+ controller: "Admin::ReportsController"
586
+ action: "create"
587
+ tags:
588
+ - admin
589
+ - slow_operation
590
+ - batch_processing
591
+ slo:
592
+ window: 30d
593
+ availability:
594
+ enabled: true
595
+ target: 0.999
596
+ latency:
597
+ enabled: true
598
+ p99_target: 30000 # 30s p99 (slow, but acceptable for reports)
599
+ p95_target: 20000 # 20s p95
600
+ throughput:
601
+ enabled: false
602
+ burn_rate_alerts:
603
+ fast:
604
+ enabled: false # Disable fast burn for slow operations
605
+ medium:
606
+ enabled: true
607
+ threshold: 6.0
608
+ alert_after: 1h
609
+
610
+ - name: "Export Data"
611
+ description: "CSV/Excel export"
612
+ pattern: "POST /admin/exports"
613
+ controller: "Admin::ExportsController"
614
+ action: "create"
615
+ tags:
616
+ - admin
617
+ - slow_operation
618
+ slo:
619
+ window: 30d
620
+ availability:
621
+ enabled: true
622
+ target: 0.99 # 99% (less critical)
623
+ latency:
624
+ enabled: true
625
+ p99_target: 60000 # 60s p99 (very slow, but acceptable)
626
+ throughput:
627
+ enabled: false
628
+
629
+ # -------------------------------------------------------------------------
630
+ # LOW PRIORITY ENDPOINTS (99% availability + no latency SLO)
631
+ # -------------------------------------------------------------------------
632
+ - name: "Admin Dashboard"
633
+ description: "Internal admin dashboard"
634
+ pattern: "GET /admin/dashboard"
635
+ controller: "Admin::DashboardController"
636
+ action: "index"
637
+ tags:
638
+ - admin
639
+ - low_priority
640
+ slo:
641
+ window: 30d
642
+ availability:
643
+ enabled: true
644
+ target: 0.99 # 99%
645
+ latency:
646
+ enabled: false # No latency SLO for admin
647
+ throughput:
648
+ enabled: false
649
+ burn_rate_alerts:
650
+ fast:
651
+ enabled: false
652
+ medium:
653
+ enabled: false
654
+ slow:
655
+ enabled: true # Only slow burn
656
+ threshold: 2.0
657
+ alert_after: 12h
658
+
659
+ # -------------------------------------------------------------------------
660
+ # HIGH THROUGHPUT ENDPOINTS (throughput-focused)
661
+ # -------------------------------------------------------------------------
662
+ - name: "Metrics Ingestion"
663
+ description: "Telemetry data ingestion endpoint"
664
+ pattern: "POST /api/metrics"
665
+ controller: "Api::MetricsController"
666
+ action: "create"
667
+ tags:
668
+ - high_throughput
669
+ - telemetry
670
+ slo:
671
+ window: 30d
672
+ availability:
673
+ enabled: true
674
+ target: 0.99 # 99% (can tolerate some drops)
675
+ latency:
676
+ enabled: true
677
+ p99_target: 100 # Fast ingestion required
678
+ throughput:
679
+ enabled: true
680
+ min_rps: 100 # Must handle 100+ req/sec
681
+ max_rps: 10000 # Alert if exceeds 10k req/sec
682
+ burn_rate_alerts:
683
+ fast:
684
+ enabled: true
685
+ threshold: 20.0 # More lenient for high-throughput
686
+
687
+ # -------------------------------------------------------------------------
688
+ # NO SLO (explicitly excluded)
689
+ # -------------------------------------------------------------------------
690
+ - name: "Development Tools"
691
+ description: "Rails internal routes"
692
+ pattern: "GET /rails/info/*"
693
+ controller: "Rails::InfoController"
694
+ action: "*"
695
+ tags:
696
+ - development
697
+ - excluded
698
+ slo: null # Explicitly no SLO
699
+
700
+ # ============================================================================
701
+ # SERVICE-LEVEL SLOs (Sidekiq, ActiveJob)
702
+ # ============================================================================
703
+ services:
704
+ # ---------------------------------------------------------------------------
705
+ # SIDEKIQ JOBS
706
+ # ---------------------------------------------------------------------------
707
+ sidekiq:
708
+ # Default for all jobs (unless overridden)
709
+ default:
710
+ window: 30d
711
+ success_rate_target: 0.995 # 99.5%
712
+ latency:
713
+ enabled: false # No latency SLO by default for jobs
714
+ throughput:
715
+ enabled: false
716
+ burn_rate_alerts:
717
+ fast:
718
+ enabled: true
719
+ window: 1h
720
+ threshold: 14.4
721
+ alert_after: 10m # Slower alert for jobs
722
+ medium:
723
+ enabled: true
724
+ window: 6h
725
+ threshold: 6.0
726
+ alert_after: 1h
727
+ slow:
728
+ enabled: true
729
+ window: 3d
730
+ threshold: 1.0
731
+ alert_after: 12h
732
+
733
+ # Per-job overrides
734
+ jobs:
735
+ PaymentProcessingJob:
736
+ window: 30d
737
+ success_rate_target: 0.9999 # 99.99% (critical!)
738
+ latency:
739
+ enabled: true
740
+ p99_target: 5000 # 5s p99
741
+ alert_on_single_failure: true # Alert on any failure
742
+ burn_rate_alerts:
743
+ fast:
744
+ enabled: true
745
+ threshold: 10.0
746
+ alert_after: 5m
747
+
748
+ EmailNotificationJob:
749
+ window: 30d
750
+ success_rate_target: 0.95 # 95% (non-critical, can retry)
751
+ latency:
752
+ enabled: false
753
+ burn_rate_alerts:
754
+ fast:
755
+ enabled: false
756
+ medium:
757
+ enabled: false
758
+ slow:
759
+ enabled: true
760
+
761
+ ReportGenerationJob:
762
+ window: 30d
763
+ success_rate_target: 0.99
764
+ latency:
765
+ enabled: true
766
+ p99_target: 300000 # 5 minutes
767
+ throughput:
768
+ enabled: true
769
+ max_jobs_per_hour: 100 # Rate limit
770
+
771
+ # ---------------------------------------------------------------------------
772
+ # ACTIVEJOB
773
+ # ---------------------------------------------------------------------------
774
+ activejob:
775
+ default:
776
+ window: 30d
777
+ success_rate_target: 0.995
778
+ latency:
779
+ enabled: false
780
+ throughput:
781
+ enabled: false
782
+ burn_rate_alerts:
783
+ fast:
784
+ enabled: true
785
+ window: 1h
786
+ threshold: 14.4
787
+ alert_after: 10m
788
+
789
+ # ============================================================================
790
+ # APP-WIDE FALLBACK (Zero-Config)
791
+ # ============================================================================
792
+ # Used for endpoints/jobs without specific configuration
793
+ app_wide:
794
+ http:
795
+ window: 30d
796
+ availability:
797
+ enabled: true
798
+ target: 0.999 # 99.9%
799
+ latency:
800
+ enabled: true
801
+ p99_target: 500
802
+ throughput:
803
+ enabled: false
804
+ burn_rate_alerts:
805
+ fast:
806
+ enabled: true
807
+ window: 1h
808
+ threshold: 14.4
809
+ alert_after: 5m
810
+ medium:
811
+ enabled: true
812
+ window: 6h
813
+ threshold: 6.0
814
+ alert_after: 30m
815
+ slow:
816
+ enabled: true
817
+ window: 3d
818
+ threshold: 1.0
819
+ alert_after: 6h
820
+
821
+ sidekiq:
822
+ window: 30d
823
+ success_rate_target: 0.995
824
+ burn_rate_alerts:
825
+ fast:
826
+ enabled: true
827
+ window: 1h
828
+ threshold: 14.4
829
+ alert_after: 10m
830
+
831
+ activejob:
832
+ window: 30d
833
+ success_rate_target: 0.995
834
+ burn_rate_alerts:
835
+ fast:
836
+ enabled: true
837
+ window: 1h
838
+ threshold: 14.4
839
+ alert_after: 10m
840
+
841
+ # ============================================================================
842
+ # ADVANCED OPTIONS
843
+ # ============================================================================
844
+ advanced:
845
+ # Error budget alerts (percentage thresholds)
846
+ error_budget_alerts:
847
+ enabled: true
848
+ thresholds: [50, 80, 90, 100] # Alert at 50%, 80%, 90%, 100% consumed
849
+ notify:
850
+ slack: true
851
+ pagerduty: false
852
+ email: true
853
+
854
+ # Deployment gate (block deploys if error budget low)
855
+ deployment_gate:
856
+ enabled: false # Disabled by default (use with caution!)
857
+ minimum_budget_percent: 20 # Need 20%+ budget to deploy
858
+ critical_endpoints_only: true # Only check critical endpoints
859
+ override_label: "deploy:emergency" # GitHub label to override
860
+
861
+ # Auto-scaling based on SLO
862
+ autoscaling:
863
+ enabled: false # Future feature
864
+ scale_up_on_burn_rate: 10.0
865
+ scale_down_on_budget_surplus: 0.5
866
+
867
+ # SLO dashboard links
868
+ dashboards:
869
+ grafana_base_url: "https://grafana.example.com/d/e11y-slo"
870
+ per_endpoint_template: "https://grafana.example.com/d/e11y-slo-endpoint?var-controller={controller}&var-action={action}"
871
+
872
+ # Runbook links
873
+ runbooks:
874
+ base_url: "https://wiki.example.com/runbooks"
875
+ fast_burn_template: "{base_url}/fast-burn-{controller}-{action}"
876
+ medium_burn_template: "{base_url}/medium-burn-{controller}-{action}"
877
+ ```
878
+
879
+ ### 4.2. Config loading (shipped)
880
+
881
+ The gem ships **`E11y::SLO::ConfigLoader`** (`lib/e11y/slo/config_loader.rb`): it searches configurable directories for `slo.yml`, parses with `YAML.safe_load`, and returns a Hash or `nil` if no file exists. There is no `Config` object, no `load!`, no `ZeroConfig`, and no route/job validation inside the loader.
882
+
883
+ Use **`E11y::SLO::ConfigValidator.validate(config)`** on a parsed Hash for the checks implemented today (`version`, `endpoints`, `app_wide.aggregated_slo`, `e11y_self_monitoring`). See `lib/e11y/slo/config_validator.rb`.
884
+
885
+ Runtime SLO behaviour is driven by **`E11y::SLO::Tracker`** and code under `lib/e11y/slo/` plus request/job instrumentation—not by the long pseudo-code that used to appear in this ADR.
886
+
887
+ The YAML in §4.1 remains a **reference** for metrics, PromQL, and dashboards; the loader and **`E11y::SLO::DashboardGenerator`** consume only part of that structure.
888
+
889
+ ---
890
+
891
+ ## 5. PromQL & Alerts
892
+
893
+ PromQL queries and Prometheus alert rules: see [SLO-PROMQL-ALERTS.md](../SLO-PROMQL-ALERTS.md).
894
+
895
+ ---
896
+
897
+ ## 6. SLO config validation and linting
898
+
899
+ ### 6.1. Shipped APIs and tasks
900
+
901
+ - **`E11y::SLO::ConfigLoader.load` / `.load(search_paths:)`** — returns a Hash or `nil`; see `spec/e11y/slo/config_loader_spec.rb`.
902
+ - **`E11y::SLO::ConfigValidator.validate(config)`** — returns an array of error strings; see `spec/e11y/slo/config_validator_spec.rb`.
903
+ - **`rake e11y:slo:dashboard`** — generates Grafana JSON from `slo.yml`; see `lib/tasks/e11y_slo.rake`.
904
+ - **`rake e11y:lint`** — lints E11y configuration and related rules.
905
+ - **`rake e11y:slo:validate`** — backwards-compatible alias that **invokes `e11y:lint`** (`lib/tasks/e11y_lint.rake`). It does **not** run a dedicated “SLO vs routes” validator like the old drafts of this ADR described.
906
+
907
+ There is **no** `rake e11y:slo:unconfigured` task in the repository.
908
+
909
+ ### 6.2. CI
910
+
911
+ You can run `bundle exec rake e11y:lint` (or `e11y:slo:validate`) in CI when `config/slo.yml` changes. Drop any workflow step that called `e11y:slo:unconfigured`, or replace it with your own check using `ConfigLoader.load` + `ConfigValidator.validate`.
912
+
913
+ ### 6.3. Tests
914
+
915
+ Use the specs under `spec/e11y/slo/` and `spec/e11y/linters/slo/` as the source of truth for behaviour.
916
+
917
+ ---
918
+
919
+ ## 7. Dashboard & Reporting
920
+
921
+ ### 7.1. Per-Endpoint Grafana Dashboard
922
+
923
+ ```json
924
+ {
925
+ "dashboard": {
926
+ "title": "E11y Per-Endpoint SLO Dashboard",
927
+ "templating": {
928
+ "list": [
929
+ {
930
+ "name": "controller",
931
+ "type": "query",
932
+ "query": "label_values(http_requests_total, controller)"
933
+ },
934
+ {
935
+ "name": "action",
936
+ "type": "query",
937
+ "query": "label_values(http_requests_total{controller=\"$controller\"}, action)"
938
+ }
939
+ ]
940
+ },
941
+ "panels": [
942
+ {
943
+ "title": "Availability SLO: $controller#$action",
944
+ "targets": [
945
+ {
946
+ "expr": "sum(rate(http_requests_total{controller=\"$controller\",action=\"$action\",status=~\"2..|3..\"}[30d])) / sum(rate(http_requests_total{controller=\"$controller\",action=\"$action\"}[30d]))",
947
+ "legendFormat": "Current (30d)"
948
+ },
949
+ {
950
+ "expr": "0.999",
951
+ "legendFormat": "SLO Target (99.9%)"
952
+ }
953
+ ],
954
+ "yaxis": {
955
+ "min": 0.995,
956
+ "max": 1.0
957
+ }
958
+ },
959
+ {
960
+ "title": "Error Budget: $controller#$action",
961
+ "targets": [
962
+ {
963
+ "expr": "slo_error_budget_remaining{controller=\"$controller\",action=\"$action\"}",
964
+ "legendFormat": "Remaining"
965
+ }
966
+ ],
967
+ "thresholds": [
968
+ { "value": 0, "color": "red" },
969
+ { "value": 0.0002, "color": "yellow" },
970
+ { "value": 0.001, "color": "green" }
971
+ ]
972
+ },
973
+ {
974
+ "title": "Burn Rate (Multi-Window): $controller#$action",
975
+ "targets": [
976
+ {
977
+ "expr": "slo_burn_rate_1h{controller=\"$controller\",action=\"$action\"}",
978
+ "legendFormat": "1h (fast burn)"
979
+ },
980
+ {
981
+ "expr": "slo_burn_rate_6h{controller=\"$controller\",action=\"$action\"}",
982
+ "legendFormat": "6h (medium burn)"
983
+ },
984
+ {
985
+ "expr": "slo_burn_rate_3d{controller=\"$controller\",action=\"$action\"}",
986
+ "legendFormat": "3d (slow burn)"
987
+ },
988
+ {
989
+ "expr": "14.4",
990
+ "legendFormat": "Fast Burn Threshold"
991
+ },
992
+ {
993
+ "expr": "6.0",
994
+ "legendFormat": "Medium Burn Threshold"
995
+ },
996
+ {
997
+ "expr": "1.0",
998
+ "legendFormat": "Slow Burn Threshold"
999
+ }
1000
+ ]
1001
+ },
1002
+ {
1003
+ "title": "Latency p99: $controller#$action",
1004
+ "targets": [
1005
+ {
1006
+ "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{controller=\"$controller\",action=\"$action\"}[5m])) by (le))",
1007
+ "legendFormat": "p99"
1008
+ },
1009
+ {
1010
+ "expr": "0.5",
1011
+ "legendFormat": "SLO Target (500ms)"
1012
+ }
1013
+ ]
1014
+ }
1015
+ ]
1016
+ }
1017
+ }
1018
+ ```
1019
+
1020
+ ---
1021
+
1022
+ ## 8. Production Best Practices & Edge Cases
1023
+
1024
+ ### 8.1. Rollout Strategy
1025
+
1026
+ **Phase 1: Observability Only (1-2 weeks)**
1027
+ ```yaml
1028
+ # config/slo.yml - Initial rollout
1029
+ version: 1
1030
+
1031
+ # Start with app-wide only (no per-endpoint)
1032
+ app_wide:
1033
+ http:
1034
+ availability:
1035
+ enabled: true
1036
+ target: 0.999
1037
+ latency:
1038
+ enabled: true
1039
+ p99_target: 1000 # Conservative: 1s
1040
+
1041
+ # Disable burn rate alerts initially
1042
+ defaults:
1043
+ burn_rate_alerts:
1044
+ fast:
1045
+ enabled: false # Don't page SRE yet!
1046
+ medium:
1047
+ enabled: false
1048
+ slow:
1049
+ enabled: true # Only slow burn (info)
1050
+ alert_after: 24h # Very slow
1051
+
1052
+ # Enable deployment gate: false (don't block deploys yet)
1053
+ advanced:
1054
+ deployment_gate:
1055
+ enabled: false
1056
+ ```
1057
+
1058
+ **Phase 2: Per-Endpoint + Slow Alerts (2-4 weeks)**
1059
+ ```yaml
1060
+ # Add 3-5 critical endpoints
1061
+ endpoints:
1062
+ - name: "Health Check"
1063
+ controller: "HealthController"
1064
+ action: "index"
1065
+ slo:
1066
+ availability:
1067
+ target: 0.9999 # Start strict
1068
+
1069
+ - name: "Create Order"
1070
+ controller: "OrdersController"
1071
+ action: "create"
1072
+ slo:
1073
+ availability:
1074
+ target: 0.999
1075
+ burn_rate_alerts:
1076
+ slow:
1077
+ enabled: true # Only slow burn for now
1078
+ alert_after: 12h
1079
+ ```
1080
+
1081
+ **Phase 3: Multi-Window Burn Rate (4-6 weeks)**
1082
+ ```yaml
1083
+ # Enable medium + fast burn rate alerts
1084
+ endpoints:
1085
+ - name: "Create Order"
1086
+ slo:
1087
+ burn_rate_alerts:
1088
+ fast:
1089
+ enabled: true
1090
+ alert_after: 10m # Start conservative (10m not 5m)
1091
+ medium:
1092
+ enabled: true
1093
+ slow:
1094
+ enabled: true
1095
+ ```
1096
+
1097
+ **Phase 4: Deployment Gate (6-8 weeks)**
1098
+ ```yaml
1099
+ # Only after confidence in data
1100
+ advanced:
1101
+ deployment_gate:
1102
+ enabled: true
1103
+ minimum_budget_percent: 10 # Start lenient (10% not 20%)
1104
+ override_label: "deploy:emergency"
1105
+ ```
1106
+
1107
+ ### 8.2. Edge cases (guidance)
1108
+
1109
+ Rollouts often hit: routes unavailable in CI, Prometheus outages during budget checks, low-traffic burn-rate noise, emergency deploys during incidents, endpoints without history, and maintenance windows. Older drafts of this ADR embedded Ruby and YAML that named **non-existent** rake tasks (for example `e11y:slo:deployment_gate:check`) and Ruby modules under `lib/e11y/slo/`. Those listings were removed—they are **not** gem APIs.
1110
+
1111
+ Use **§6** for what ships today (`ConfigLoader`, `ConfigValidator`, `e11y:lint`, `e11y:slo:dashboard`). Treat deployment gates, error-budget automation, and advanced burn-rate logic as **platform or application** work built on your own Prometheus and CI.
1112
+
1113
+ ### 8.3. Monitoring the SLO system itself
1114
+
1115
+ When `e11y_self_monitoring.enabled` is `true` in `slo.yml`, **`E11y::SLO::ConfigLoader.self_monitoring_enabled?`** is true and **`E11y::Middleware::SelfMonitoringEmit`** increments **`e11y_events_tracked_total`** for events that complete the pipeline (see `lib/e11y/middleware/self_monitoring_emit.rb`). There is **no** `E11y.configure { |c| c.slo.self_monitoring do ... }` DSL.
1116
+
1117
+ Derive Prometheus alert rules from the metric names your deployment actually exports (Yabeda prefixing, scrape config, etc.)—do not assume placeholder metric names from older versions of this ADR.
1118
+
1119
+ ---
1120
+
1121
+ ## 9. Trade-offs
1122
+
1123
+ ### 9.1. Key Decisions
1124
+
1125
+ | Decision | Pro | Con | Rationale |
1126
+ |----------|-----|-----|-----------|
1127
+ | **Per-endpoint SLO** | Granular visibility | Config complexity | Critical endpoints need specific SLOs |
1128
+ | **Multi-window burn rate** | 5-minute detection, low false positives | Complex Prometheus queries | Google SRE best practice 2026 |
1129
+ | **YAML-based config** | Version controlled, validated | Extra file | Separation of concerns |
1130
+ | **Optional latency SLO** | Flexible | Some endpoints untracked | Not all endpoints need latency |
1131
+ | **Config validation** | Prevents drift | CI/CD overhead | Critical for accuracy |
1132
+ | **30-day SLO window** | Industry standard | Slow trend detection | Multi-window compensates |
1133
+
1134
+ ### 9.2. Alternatives Considered
1135
+
1136
+ **A) Single app-wide SLO only**
1137
+ - ❌ Rejected: Too coarse, hides critical endpoint issues
1138
+
1139
+ **B) Single-window alerting**
1140
+ - ❌ Rejected: Either slow (30d) or noisy (5m)
1141
+
1142
+ **C) Code-based SLO config**
1143
+ - ❌ Rejected: Requires deployment to change SLOs
1144
+
1145
+ **D) No config validation**
1146
+ - ❌ Rejected: Config drift is a real problem
1147
+
1148
+ **E) Per-user SLO**
1149
+ - ❌ Deferred to v2.0: Too complex for v1
1150
+
1151
+ ---
1152
+
1153
+ ## 10. Real-World Configuration Examples
1154
+
1155
+ ### 10.1. E-Commerce Platform
1156
+
1157
+ ```yaml
1158
+ # config/slo.yml - E-commerce example
1159
+ version: 1
1160
+
1161
+ defaults:
1162
+ window: 30d
1163
+ availability:
1164
+ enabled: true
1165
+ target: 0.999
1166
+
1167
+ endpoints:
1168
+ # === REVENUE-CRITICAL (99.99%) ===
1169
+ - name: "Checkout - Payment"
1170
+ pattern: "POST /checkout/payment"
1171
+ controller: "Checkout::PaymentsController"
1172
+ action: "create"
1173
+ tags: [critical, revenue, pci_scope]
1174
+ slo:
1175
+ availability:
1176
+ target: 0.9999 # 99.99%
1177
+ latency:
1178
+ p99_target: 2000 # 2s (Stripe API call)
1179
+ p95_target: 1000
1180
+ throughput:
1181
+ min_rps: 1
1182
+ max_rps: 100 # Rate limit (fraud protection)
1183
+ burn_rate_alerts:
1184
+ fast:
1185
+ threshold: 10.0 # More lenient (third-party)
1186
+ alert_after: 5m
1187
+
1188
+ - name: "Cart - Add Item"
1189
+ pattern: "POST /cart/items"
1190
+ controller: "CartController"
1191
+ action: "add_item"
1192
+ tags: [high_priority, customer_facing]
1193
+ slo:
1194
+ availability:
1195
+ target: 0.999 # 99.9%
1196
+ latency:
1197
+ p99_target: 300
1198
+ p95_target: 150
1199
+ throughput:
1200
+ max_rps: 1000
1201
+
1202
+ # === HIGH-TRAFFIC (throughput-focused) ===
1203
+ - name: "Product Search"
1204
+ pattern: "GET /api/products/search"
1205
+ controller: "Api::ProductsController"
1206
+ action: "search"
1207
+ tags: [high_traffic, search, cached]
1208
+ slo:
1209
+ availability:
1210
+ target: 0.995 # 99.5% (can tolerate cache misses)
1211
+ latency:
1212
+ p99_target: 500
1213
+ throughput:
1214
+ min_rps: 50 # Must handle 50+ req/sec
1215
+ max_rps: 5000
1216
+
1217
+ # === ADMIN (low priority) ===
1218
+ - name: "Admin - Sales Report"
1219
+ pattern: "POST /admin/reports/sales"
1220
+ controller: "Admin::ReportsController"
1221
+ action: "sales"
1222
+ tags: [admin, slow_operation]
1223
+ slo:
1224
+ availability:
1225
+ target: 0.99 # 99%
1226
+ latency:
1227
+ p99_target: 30000 # 30s
1228
+ burn_rate_alerts:
1229
+ fast:
1230
+ enabled: false
1231
+ slow:
1232
+ enabled: true
1233
+
1234
+ services:
1235
+ sidekiq:
1236
+ jobs:
1237
+ PaymentProcessingJob:
1238
+ success_rate_target: 0.9999 # Critical!
1239
+ alert_on_single_failure: true
1240
+
1241
+ InventorySync Job:
1242
+ success_rate_target: 0.99
1243
+ latency:
1244
+ p99_target: 60000 # 60s
1245
+ ```
1246
+
1247
+ ### 10.2. SaaS API Platform
1248
+
1249
+ ```yaml
1250
+ # config/slo.yml - API platform example
1251
+ version: 1
1252
+
1253
+ defaults:
1254
+ window: 30d
1255
+ availability:
1256
+ enabled: true
1257
+ target: 0.999
1258
+ latency:
1259
+ enabled: true
1260
+ p99_target: 200 # Fast API
1261
+
1262
+ endpoints:
1263
+ # === PUBLIC API (99.99%) ===
1264
+ - name: "API - Create Resource"
1265
+ pattern: "POST /api/v1/resources"
1266
+ controller: "Api::V1::ResourcesController"
1267
+ action: "create"
1268
+ tags: [api, customer_facing, rate_limited]
1269
+ slo:
1270
+ availability:
1271
+ target: 0.9999 # 99.99% SLA
1272
+ latency:
1273
+ p99_target: 200
1274
+ p95_target: 100
1275
+ throughput:
1276
+ min_rps: 10
1277
+ max_rps: 10000 # High throughput API
1278
+ burn_rate_alerts:
1279
+ fast:
1280
+ threshold: 14.4
1281
+ alert_after: 5m
1282
+
1283
+ # === WEBHOOKS (eventual consistency) ===
1284
+ - name: "Webhook Delivery"
1285
+ pattern: "POST /internal/webhooks/deliver"
1286
+ controller: "Internal::WebhooksController"
1287
+ action: "deliver"
1288
+ tags: [internal, async, retry]
1289
+ slo:
1290
+ availability:
1291
+ target: 0.95 # 95% (retries handle failures)
1292
+ latency:
1293
+ enabled: false # Async, latency not critical
1294
+ burn_rate_alerts:
1295
+ fast:
1296
+ enabled: false
1297
+ slow:
1298
+ enabled: true
1299
+
1300
+ services:
1301
+ sidekiq:
1302
+ default:
1303
+ success_rate_target: 0.999
1304
+ jobs:
1305
+ WebhookDeliveryJob:
1306
+ success_rate_target: 0.95 # Retries + DLQ
1307
+ latency:
1308
+ p99_target: 10000 # 10s (external API)
1309
+ ```
1310
+
1311
+ ### 10.3. Internal Admin Tool
1312
+
1313
+ ```yaml
1314
+ # config/slo.yml - Admin tool example
1315
+ version: 1
1316
+
1317
+ defaults:
1318
+ window: 7d # Shorter window (less critical)
1319
+ availability:
1320
+ enabled: true
1321
+ target: 0.99 # 99% (internal users tolerate downtime)
1322
+ latency:
1323
+ enabled: false # No latency SLO by default
1324
+
1325
+ endpoints:
1326
+ - name: "Admin Dashboard"
1327
+ pattern: "GET /admin"
1328
+ controller: "AdminController"
1329
+ action: "index"
1330
+ tags: [admin, internal]
1331
+ slo:
1332
+ availability:
1333
+ target: 0.99
1334
+ burn_rate_alerts:
1335
+ fast:
1336
+ enabled: false
1337
+ slow:
1338
+ enabled: true
1339
+ alert_after: 24h # Very slow
1340
+
1341
+ - name: "Data Export"
1342
+ pattern: "POST /admin/exports"
1343
+ controller: "Admin::ExportsController"
1344
+ action: "create"
1345
+ tags: [admin, slow_operation]
1346
+ slo:
1347
+ availability:
1348
+ target: 0.95 # 95% (can retry)
1349
+ latency:
1350
+ p99_target: 120000 # 2 minutes (large CSV)
1351
+
1352
+ advanced:
1353
+ deployment_gate:
1354
+ enabled: false # No deployment gate for admin tool
1355
+
1356
+ error_budget_alerts:
1357
+ enabled: false # No budget alerts
1358
+ ```
1359
+
1360
+ ---
1361
+
1362
+ ## 11. Summary & Next Steps
1363
+
1364
+ ### 11.1. What We Achieved
1365
+
1366
+ ✅ **Multi-level SLO strategy**: App-wide, service-level, per-endpoint
1367
+ ✅ **5-minute alert detection**: Multi-window burn rate (Google SRE 2026)
1368
+ ✅ **YAML-based configuration**: Version-controlled `slo.yml`; partial validation via `ConfigValidator`
1369
+ ✅ **Flexible latency SLO**: Optional per endpoint (reference schema in §4.1)
1370
+ ✅ **Throughput SLO**: Reference schema for high-traffic endpoints
1371
+ ✅ **Config validation & linting**: `e11y:lint`, `ConfigLoader` + `ConfigValidator` (subset)
1372
+ ✅ **Shipped tooling**: `ConfigLoader`, `ConfigValidator`, `e11y:slo:dashboard`, `SelfMonitoringEmit` when enabled in YAML
1373
+ ✅ **PromQL & alerts**: See [SLO-PROMQL-ALERTS.md](../SLO-PROMQL-ALERTS.md)
1374
+ ✅ **RSpec testing**: Comprehensive test coverage
1375
+ ✅ **Production best practices**: Rollout strategy, edge case handling, self-monitoring
1376
+ ✅ **Real-world examples**: E-commerce, SaaS API, Admin tool configurations
1377
+
1378
+ ### 11.2. Implementation Checklist
1379
+
1380
+ **Phase 1: Core (Week 1-2)**
1381
+ - [x] Implement `E11y::SLO::ConfigLoader` (YAML search + `safe_load`)
1382
+ - [ ] Optional: richer loader (ERB, strict mode) if product needs it
1383
+ - [x] Implement `E11y::SLO::ConfigValidator.validate` (subset of schema)
1384
+ - [x] `rake e11y:slo:validate` → alias for `e11y:lint`
1385
+ - [x] HTTP / job SLIs via `E11y::SLO::Tracker` + Rack / Sidekiq / ActiveJob hooks
1386
+ - [ ] Broader `slo.yml` consumption (beyond dashboard + linters), if desired
1387
+
1388
+ **Phase 2: Production Readiness (Week 3-4)**
1389
+ - [ ] Maintenance windows, deploy grace periods, deployment gates (product/platform)
1390
+ - [x] Basic self-monitoring hook (`e11y_self_monitoring` in YAML + `SelfMonitoringEmit`)
1391
+ - [ ] CI: run `e11y:lint` / optional `ConfigValidator.validate` on `slo.yml` changes
1392
+ - [ ] Operational playbooks (runbooks, Grafana) per team
1393
+
1394
+ **Phase 3: Tests**
1395
+ - [x] `spec/e11y/slo/config_loader_spec.rb`, `config_validator_spec.rb`, linters under `spec/e11y/linters/slo/`
1396
+ - [ ] Additional integration coverage as features grow
1397
+
1398
+ ---
1399
+
1400
+ **Status:** Core SLO tracking and `slo.yml` tooling are in the gem; large parts of §4.1 YAML remain a **reference** for alerting/dashboards rather than a fully interpreted config file.
1401
+ **Next:** Expand or document which YAML keys each component reads (`DashboardGenerator`, linters, `ConfigLoader`).
1402
+ **Impact:** HTTP/job SLI metrics, optional `slo.yml` + dashboard export, PromQL examples in [SLO-PROMQL-ALERTS.md](../SLO-PROMQL-ALERTS.md).