e11y 0.2.0 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +130 -10
- data/CHANGELOG.md +80 -1
- data/CLAUDE.md +168 -0
- data/CONTRIBUTING.md +640 -0
- data/README.md +165 -701
- data/RELEASE.md +41 -12
- data/Rakefile +249 -57
- data/config/README.md +1 -1
- data/config/loki-local-config.yaml +12 -0
- data/config/otel-collector-config.yaml +44 -0
- data/cucumber.yml +1 -0
- data/docker-compose.yml +18 -2
- data/docs/ADAPTERS.md +76 -0
- data/docs/ADAPTIVE_SAMPLING.md +59 -0
- data/docs/COMPARISON.md +104 -0
- data/docs/CONFIGURATION.md +52 -0
- data/docs/DISTRIBUTED_TRACING.md +44 -0
- data/docs/LIMITATIONS.md +13 -0
- data/docs/METRICS_DSL.md +84 -0
- data/docs/PERFORMANCE.md +60 -0
- data/docs/PII_FILTERING.md +40 -0
- data/docs/PRESETS.md +65 -0
- data/docs/QUICK-START.md +546 -587
- data/docs/RAILS_INTEGRATION.md +79 -0
- data/docs/SCHEMA_VALIDATION.md +63 -0
- data/docs/SLO-PROMQL-ALERTS.md +161 -0
- data/docs/TESTING.md +69 -0
- data/docs/{ADR-001-architecture.md → architecture/ADR-001-architecture.md} +36 -65
- data/docs/{ADR-002-metrics-yabeda.md → architecture/ADR-002-metrics-yabeda.md} +62 -236
- data/docs/architecture/ADR-003-slo-observability.md +1402 -0
- data/docs/{ADR-004-adapter-architecture.md → architecture/ADR-004-adapter-architecture.md} +163 -146
- data/docs/{ADR-005-tracing-context.md → architecture/ADR-005-tracing-context.md} +10 -9
- data/docs/{ADR-006-security-compliance.md → architecture/ADR-006-security-compliance.md} +184 -191
- data/docs/{ADR-007-opentelemetry-integration.md → architecture/ADR-007-opentelemetry-integration.md} +3 -21
- data/docs/{ADR-008-rails-integration.md → architecture/ADR-008-rails-integration.md} +182 -743
- data/docs/{ADR-009-cost-optimization.md → architecture/ADR-009-cost-optimization.md} +45 -54
- data/docs/architecture/ADR-010-developer-experience.md +522 -0
- data/docs/{ADR-011-testing-strategy.md → architecture/ADR-011-testing-strategy.md} +44 -86
- data/docs/{ADR-012-event-evolution.md → architecture/ADR-012-event-evolution.md} +11 -11
- data/docs/{ADR-013-reliability-error-handling.md → architecture/ADR-013-reliability-error-handling.md} +37 -12
- data/docs/{ADR-014-event-driven-slo.md → architecture/ADR-014-event-driven-slo.md} +12 -24
- data/docs/{ADR-015-middleware-order.md → architecture/ADR-015-middleware-order.md} +43 -59
- data/docs/{ADR-016-self-monitoring-slo.md → architecture/ADR-016-self-monitoring-slo.md} +58 -355
- data/docs/{ADR-017-multi-rails-compatibility.md → architecture/ADR-017-multi-rails-compatibility.md} +4 -11
- data/docs/architecture/ADR-018-memory-optimization.md +366 -0
- data/docs/{ADR-INDEX.md → architecture/ADR-INDEX.md} +11 -6
- data/docs/plans/2026-03-20-browser-overlay-svelte.md +281 -0
- data/docs/{00-ICP-AND-TIMELINE.md → prd/00-ICP-AND-TIMELINE.md} +6 -6
- data/docs/{01-SCALE-REQUIREMENTS.md → prd/01-SCALE-REQUIREMENTS.md} +6 -6
- data/docs/prd/01-overview-vision.md +19 -14
- data/docs/use_cases/README.md +22 -23
- data/docs/use_cases/UC-001-request-scoped-debug-buffering.md +50 -44
- data/docs/use_cases/UC-002-business-event-tracking.md +26 -95
- data/docs/use_cases/UC-003-event-metrics.md +66 -0
- data/docs/use_cases/UC-004-zero-config-slo-tracking.md +33 -684
- data/docs/use_cases/UC-005-sentry-integration.md +13 -15
- data/docs/use_cases/UC-006-trace-context-management.md +30 -28
- data/docs/use_cases/UC-007-pii-filtering.md +35 -87
- data/docs/use_cases/UC-008-opentelemetry-integration.md +51 -89
- data/docs/use_cases/UC-009-multi-service-tracing.md +30 -178
- data/docs/use_cases/UC-010-background-job-tracking.md +24 -91
- data/docs/use_cases/UC-011-rate-limiting.md +95 -168
- data/docs/use_cases/UC-012-audit-trail.md +21 -46
- data/docs/use_cases/UC-013-high-cardinality-protection.md +29 -167
- data/docs/use_cases/UC-014-adaptive-sampling.md +2 -2
- data/docs/use_cases/UC-015-cost-optimization.md +46 -99
- data/docs/use_cases/UC-016-rails-logger-migration.md +39 -213
- data/docs/use_cases/UC-017-local-development.md +203 -777
- data/docs/use_cases/UC-018-testing-events.md +3 -3
- data/docs/use_cases/UC-019-retention-based-routing.md +53 -106
- data/docs/use_cases/UC-020-event-versioning.md +8 -9
- data/docs/use_cases/UC-021-error-handling-retry-dlq.md +18 -22
- data/docs/use_cases/UC-022-event-registry.md +15 -21
- data/docs/use_cases/backlog.md +119 -87
- data/e11y.gemspec +2 -2
- data/gems/e11y-devtools/README.md +158 -0
- data/gems/e11y-devtools/config/routes.rb +15 -0
- data/gems/e11y-devtools/e11y-devtools.gemspec +25 -0
- data/gems/e11y-devtools/exe/e11y +34 -0
- data/gems/e11y-devtools/frontend/.gitignore +24 -0
- data/gems/e11y-devtools/frontend/README.md +51 -0
- data/gems/e11y-devtools/frontend/index.html +14 -0
- data/gems/e11y-devtools/frontend/package-lock.json +3707 -0
- data/gems/e11y-devtools/frontend/package.json +28 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/events/recent.json +4205 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/interactions.json +194 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/0a2e04027cfa22d014bc22e8b27cd913/events.json +86 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/0e1543af6a630fb3af6b52283154b3e0/events.json +169 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/1838b691faa49564f97db8592ff3978d/events.json +78 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/29f198f6588dacffb687777eb5f8f118/events.json +197 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/34bc3c9c0097de28a7a6f99b90a8e7bc/events.json +194 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/3ba6c20d068ab9cee00e51b180e66444/events.json +184 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/435bfd8f17b9009146a79812d7c3726d/events.json +144 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/4c7676e3fe668e99edb2b94d7d5678a9/events.json +222 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/6daf0d47974bedfc55d5de7004a3ea9f/events.json +194 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/8a81ada42834d15f287bb40010043605/events.json +194 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/8c0a98900edaae105469df8daedccf02/events.json +198 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/8e4f645180f8a7d1dce426b07380466b/events.json +222 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/93db346fa5d44a032605a13b627f4b80/events.json +128 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/98ff6146faf7bd9be8bd03a8275817ba/events.json +223 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/9997ddd0247bc7e25f2ca7a5c415c93d/events.json +197 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/99e35f8ef3baedd798cc4fd085980ad9/events.json +194 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/b4f3095c1909924cbc98889a86c83d6d/events.json +131 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/b54b7fc32b7575a7110de809d11ccda0/events.json +128 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/c0b48033fa06746bcc5886745e053cff/events.json +169 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/c44649ac76701b4558927cd2305ab535/events.json +169 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/d601ae3320057580a39dbdac2edfdf4a/events.json +248 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/e67e724bab422d2b52eeb49635e512e1/events.json +194 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/e6c72765a28f158a8485b35fa63f73da/events.json +194 -0
- data/gems/e11y-devtools/frontend/public/mocks/v1/traces/f541b87405c9a54819b18ebe529f6419/events.json +194 -0
- data/gems/e11y-devtools/frontend/scripts/generate_mocks.rb +397 -0
- data/gems/e11y-devtools/frontend/src/App.svelte +827 -0
- data/gems/e11y-devtools/frontend/src/components/Fab.svelte +19 -0
- data/gems/e11y-devtools/frontend/src/components/FilterBar.svelte +38 -0
- data/gems/e11y-devtools/frontend/src/components/FullscreenPanel.svelte +82 -0
- data/gems/e11y-devtools/frontend/src/components/InteractionsTimeline.svelte +264 -0
- data/gems/e11y-devtools/frontend/src/components/RecentHistogram.svelte +354 -0
- data/gems/e11y-devtools/frontend/src/lib/api.ts +37 -0
- data/gems/e11y-devtools/frontend/src/lib/eventIdentity.ts +12 -0
- data/gems/e11y-devtools/frontend/src/lib/format.ts +37 -0
- data/gems/e11y-devtools/frontend/src/lib/listFilter.ts +43 -0
- data/gems/e11y-devtools/frontend/src/lib/recentVolume.ts +80 -0
- data/gems/e11y-devtools/frontend/src/lib/router.ts +12 -0
- data/gems/e11y-devtools/frontend/src/lib/transitions.ts +34 -0
- data/gems/e11y-devtools/frontend/src/lib/viewportOrigin.ts +25 -0
- data/gems/e11y-devtools/frontend/src/main.ts +8 -0
- data/gems/e11y-devtools/frontend/src/overlay-entry.ts +24 -0
- data/gems/e11y-devtools/frontend/src/overlay.css +1080 -0
- data/gems/e11y-devtools/frontend/svelte.config.js +2 -0
- data/gems/e11y-devtools/frontend/test_puppeteer.js +41 -0
- data/gems/e11y-devtools/frontend/test_scale.js +3 -0
- data/gems/e11y-devtools/frontend/tsconfig.app.json +21 -0
- data/gems/e11y-devtools/frontend/tsconfig.json +7 -0
- data/gems/e11y-devtools/frontend/tsconfig.node.json +26 -0
- data/gems/e11y-devtools/frontend/vite.config.ts +36 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/server.rb +96 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tool_base.rb +25 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/clear.rb +31 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/errors.rb +35 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/event_detail.rb +33 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/events_by_trace.rb +33 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/interactions.rb +40 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/recent_events.rb +34 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/search.rb +34 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/stats.rb +30 -0
- data/gems/e11y-devtools/lib/e11y/devtools/overlay/assets/overlay.js +20 -0
- data/gems/e11y-devtools/lib/e11y/devtools/overlay/controller.rb +94 -0
- data/gems/e11y-devtools/lib/e11y/devtools/overlay/engine.rb +26 -0
- data/gems/e11y-devtools/lib/e11y/devtools/overlay/middleware.rb +80 -0
- data/gems/e11y-devtools/lib/e11y/devtools/overlay/rails_controller.rb +67 -0
- data/gems/e11y-devtools/lib/e11y/devtools/tui/app.rb +262 -0
- data/gems/e11y-devtools/lib/e11y/devtools/tui/grouping.rb +66 -0
- data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/event_detail.rb +62 -0
- data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/event_list.rb +70 -0
- data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/interaction_list.rb +47 -0
- data/gems/e11y-devtools/lib/e11y/devtools/version.rb +8 -0
- data/gems/e11y-devtools/lib/e11y/devtools.rb +13 -0
- data/gems/e11y-devtools/spec/e11y/devtools/mcp/tools_spec.rb +107 -0
- data/gems/e11y-devtools/spec/e11y/devtools/overlay/controller_spec.rb +91 -0
- data/gems/e11y-devtools/spec/e11y/devtools/overlay/middleware_spec.rb +46 -0
- data/gems/e11y-devtools/spec/e11y/devtools/tui/app_spec.rb +85 -0
- data/gems/e11y-devtools/spec/e11y/devtools/tui/grouping_spec.rb +64 -0
- data/gems/e11y-devtools/spec/spec_helper.rb +5 -0
- data/gems/e11y-devtools/spec/tui/widgets/event_list_spec.rb +44 -0
- data/gems/e11y-devtools/spec/tui/widgets/interaction_list_spec.rb +62 -0
- data/lib/e11y/adapters/audit_encrypted.rb +53 -11
- data/lib/e11y/adapters/base.rb +33 -34
- data/lib/e11y/adapters/dev_log/file_store.rb +143 -0
- data/lib/e11y/adapters/dev_log/query.rb +219 -0
- data/lib/e11y/adapters/dev_log.rb +118 -0
- data/lib/e11y/adapters/file.rb +3 -6
- data/lib/e11y/adapters/in_memory.rb +52 -5
- data/lib/e11y/adapters/in_memory_test.rb +29 -0
- data/lib/e11y/adapters/loki.rb +58 -23
- data/lib/e11y/adapters/null.rb +82 -0
- data/lib/e11y/adapters/opentelemetry_collector.rb +183 -0
- data/lib/e11y/adapters/otel_logs.rb +136 -23
- data/lib/e11y/adapters/sentry.rb +4 -7
- data/lib/e11y/adapters/stdout.rb +73 -7
- data/lib/e11y/adapters/yabeda.rb +153 -29
- data/lib/e11y/buffers/adaptive_buffer.rb +3 -17
- data/lib/e11y/buffers/{request_scoped_buffer.rb → ephemeral_buffer.rb} +72 -58
- data/lib/e11y/buffers/ring_buffer.rb +3 -16
- data/lib/e11y/configuration.rb +272 -0
- data/lib/e11y/console.rb +10 -17
- data/lib/e11y/current.rb +53 -1
- data/lib/e11y/debug/pipeline_inspector.rb +96 -0
- data/lib/e11y/documentation/generator.rb +48 -0
- data/lib/e11y/event/base.rb +176 -82
- data/lib/e11y/event/value_sampling_config.rb +1 -5
- data/lib/e11y/events/rails/database/query.rb +1 -4
- data/lib/e11y/events/rails/job/failed.rb +2 -0
- data/lib/e11y/instruments/active_job.rb +44 -12
- data/lib/e11y/instruments/rails_instrumentation.rb +49 -24
- data/lib/e11y/instruments/sidekiq.rb +135 -31
- data/lib/e11y/linters/base.rb +11 -0
- data/lib/e11y/linters/pii/pii_declaration_linter.rb +120 -0
- data/lib/e11y/linters/slo/config_consistency_linter.rb +76 -0
- data/lib/e11y/linters/slo/explicit_declaration_linter.rb +36 -0
- data/lib/e11y/linters/slo/slo_status_from_linter.rb +41 -0
- data/lib/e11y/logger/bridge.rb +26 -7
- data/lib/e11y/metrics/cardinality_protection.rb +10 -15
- data/lib/e11y/metrics/cardinality_tracker.rb +16 -6
- data/lib/e11y/metrics/registry.rb +3 -5
- data/lib/e11y/metrics/test_backend.rb +62 -0
- data/lib/e11y/metrics.rb +56 -10
- data/lib/e11y/middleware/adapter_resolver.rb +40 -0
- data/lib/e11y/middleware/audit_signing.rb +43 -6
- data/lib/e11y/middleware/baggage_protection.rb +75 -0
- data/lib/e11y/middleware/dev_log_source.rb +24 -0
- data/lib/e11y/middleware/event_slo.rb +23 -9
- data/lib/e11y/middleware/otel_span.rb +23 -0
- data/lib/e11y/middleware/pii_filter.rb +104 -75
- data/lib/e11y/middleware/rate_limiting.rb +54 -27
- data/lib/e11y/middleware/request.rb +70 -23
- data/lib/e11y/middleware/routing.rb +78 -21
- data/lib/e11y/middleware/sampling.rb +66 -17
- data/lib/e11y/middleware/self_monitoring_emit.rb +39 -0
- data/lib/e11y/middleware/trace_context.rb +45 -10
- data/lib/e11y/middleware/track_latency.rb +34 -0
- data/lib/e11y/middleware/validation.rb +7 -16
- data/lib/e11y/middleware/versioning.rb +26 -22
- data/lib/e11y/opentelemetry/semantic_conventions.rb +109 -0
- data/lib/e11y/opentelemetry/span_creator.rb +142 -0
- data/lib/e11y/pii/patterns.rb +12 -1
- data/lib/e11y/pipeline/builder.rb +4 -4
- data/lib/e11y/presets/audit_event.rb +13 -2
- data/lib/e11y/railtie.rb +52 -14
- data/lib/e11y/registry.rb +306 -0
- data/lib/e11y/reliability/circuit_breaker.rb +19 -21
- data/lib/e11y/reliability/dlq/base.rb +71 -0
- data/lib/e11y/reliability/dlq/file_adapter.rb +301 -0
- data/lib/e11y/reliability/dlq/file_storage.rb +63 -34
- data/lib/e11y/reliability/dlq/filter.rb +37 -54
- data/lib/e11y/reliability/retry_handler.rb +26 -29
- data/lib/e11y/reliability/retry_rate_limiter.rb +3 -11
- data/lib/e11y/sampling/error_spike_detector.rb +0 -2
- data/lib/e11y/sampling/load_monitor.rb +5 -9
- data/lib/e11y/sampling/stratified_tracker.rb +18 -0
- data/lib/e11y/self_monitoring/buffer_monitor.rb +2 -0
- data/lib/e11y/self_monitoring/performance_monitor.rb +19 -61
- data/lib/e11y/self_monitoring/reliability_monitor.rb +4 -74
- data/lib/e11y/slo/config_loader.rb +40 -0
- data/lib/e11y/slo/config_validator.rb +58 -0
- data/lib/e11y/slo/dashboard_generator.rb +122 -0
- data/lib/e11y/slo/event_driven.rb +8 -0
- data/lib/e11y/slo/tracker.rb +31 -4
- data/lib/e11y/testing/have_tracked_event_matcher.rb +190 -0
- data/lib/e11y/testing/rspec_matchers.rb +21 -0
- data/lib/e11y/testing/snapshot_matcher.rb +86 -0
- data/lib/e11y/trace_context/sampler.rb +35 -0
- data/lib/e11y/tracing/faraday_middleware.rb +31 -0
- data/lib/e11y/tracing/net_http_patch.rb +33 -0
- data/lib/e11y/tracing/propagator.rb +144 -0
- data/lib/e11y/tracing.rb +47 -0
- data/lib/e11y/version.rb +1 -1
- data/lib/e11y/versioning/version_extractor.rb +32 -0
- data/lib/e11y.rb +123 -266
- data/lib/generators/e11y/event/event_generator.rb +22 -0
- data/lib/generators/e11y/event/templates/event.rb.tt +16 -0
- data/lib/generators/e11y/grafana_dashboard/grafana_dashboard_generator.rb +30 -0
- data/lib/generators/e11y/grafana_dashboard/templates/e11y_dashboard.json +81 -0
- data/lib/generators/e11y/install/install_generator.rb +34 -0
- data/lib/generators/e11y/install/templates/e11y.rb +239 -0
- data/lib/generators/e11y/prometheus_alerts/prometheus_alerts_generator.rb +29 -0
- data/lib/generators/e11y/prometheus_alerts/templates/e11y_alerts.yml +28 -0
- data/lib/tasks/e11y_docs.rake +30 -0
- data/lib/tasks/e11y_events.rake +71 -0
- data/lib/tasks/e11y_lint.rake +91 -0
- data/lib/tasks/e11y_slo.rake +29 -0
- metadata +186 -39
- data/docs/ADR-003-slo-observability.md +0 -3337
- data/docs/ADR-010-developer-experience.md +0 -2166
- data/docs/API-REFERENCE-L28.md +0 -914
- data/docs/COMPREHENSIVE-CONFIGURATION.md +0 -2366
- data/docs/CONTRIBUTING.md +0 -312
- data/docs/IMPLEMENTATION_NOTES.md +0 -2804
- data/docs/IMPLEMENTATION_PLAN.md +0 -1971
- data/docs/IMPLEMENTATION_PLAN_ARCHITECTURE.md +0 -586
- data/docs/PLAN.md +0 -148
- data/docs/README.md +0 -296
- data/docs/design/00-memory-optimization.md +0 -593
- data/docs/guides/MIGRATION-L27-L28.md +0 -692
- data/docs/guides/PERFORMANCE-BENCHMARKS.md +0 -434
- data/docs/guides/README.md +0 -44
- data/docs/use_cases/UC-003-pattern-based-metrics.md +0 -1627
- data/lib/e11y/adapters/registry.rb +0 -141
|
@@ -0,0 +1,1402 @@
|
|
|
1
|
+
# ADR-003: SLO & Observability
|
|
2
|
+
|
|
3
|
+
**Status:** Draft
|
|
4
|
+
**Date:** January 13, 2026
|
|
5
|
+
**Covers:** UC-004 (Zero-Config SLO Tracking)
|
|
6
|
+
**Depends On:** ADR-001 (Core), ADR-008 (Rails Integration), ADR-002 (Metrics)
|
|
7
|
+
|
|
8
|
+
**Related ADRs:**
|
|
9
|
+
- 📊 **ADR-014: Event-Driven SLO** - Custom SLO based on business events (e.g., payment success rate)
|
|
10
|
+
- 🔗 **Integration:** See `ADR-003-014-INTEGRATION.md` for detailed integration analysis
|
|
11
|
+
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
## 🔍 Scope of This ADR
|
|
15
|
+
|
|
16
|
+
This ADR covers **HTTP/Job SLO** (infrastructure reliability):
|
|
17
|
+
- ✅ Zero-config SLO for HTTP requests (99.9% availability)
|
|
18
|
+
- ✅ Zero-config SLO for Sidekiq/ActiveJob (99.5% success rate)
|
|
19
|
+
- ✅ Per-endpoint SLO configuration in `slo.yml`
|
|
20
|
+
- ✅ PromQL queries and alert rules — see [SLO-PROMQL-ALERTS.md](../SLO-PROMQL-ALERTS.md)
|
|
21
|
+
|
|
22
|
+
**For Event-based SLO** (business logic reliability like "order creation success rate"), see **ADR-014**.
|
|
23
|
+
|
|
24
|
+
**For App-Wide SLO** (aggregating HTTP + Event metrics into single health score), see **ADR-014 Section 9**.
|
|
25
|
+
|
|
26
|
+
---
|
|
27
|
+
|
|
28
|
+
## 📋 Table of Contents
|
|
29
|
+
|
|
30
|
+
1. [Context & Problem](#1-context--problem)
|
|
31
|
+
2. [Architecture Overview](#2-architecture-overview)
|
|
32
|
+
3. [Multi-Level SLO Strategy](#3-multi-level-slo-strategy)
|
|
33
|
+
4. [Per-Endpoint SLO Configuration](#4-per-endpoint-slo-configuration)
|
|
34
|
+
5. [PromQL & Alerts](#5-promql--alerts)
|
|
35
|
+
6. [SLO Config Validation & Linting](#6-slo-config-validation--linting)
|
|
36
|
+
7. [Dashboard & Reporting](#7-dashboard--reporting)
|
|
37
|
+
8. [Production Best Practices & Edge Cases](#8-production-best-practices--edge-cases)
|
|
38
|
+
9. [Trade-offs](#9-trade-offs)
|
|
39
|
+
10. [Real-World Configuration Examples](#10-real-world-configuration-examples)
|
|
40
|
+
11. [Summary & Next Steps](#11-summary--next-steps)
|
|
41
|
+
|
|
42
|
+
---
|
|
43
|
+
|
|
44
|
+
## 1. Context & Problem
|
|
45
|
+
|
|
46
|
+
### 1.1. Problem Statement
|
|
47
|
+
|
|
48
|
+
**Current Pain Points:**
|
|
49
|
+
|
|
50
|
+
```ruby
|
|
51
|
+
# === PROBLEM 1: Overly Broad SLO (App-Wide) ===
|
|
52
|
+
# ❌ One SLO for entire app is too coarse
|
|
53
|
+
# GET /healthcheck (should be 99.99%)
|
|
54
|
+
# POST /orders (should be 99.9%)
|
|
55
|
+
# GET /admin/reports (should be 95%)
|
|
56
|
+
# → All treated the same! Critical endpoints hidden by non-critical ones!
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
```ruby
|
|
60
|
+
# === PROBLEM 2: Slow Alert Detection ===
|
|
61
|
+
# ❌ 30-day window = slow reaction
|
|
62
|
+
# Incident at 10:00 AM
|
|
63
|
+
# First alert at 10:45 AM (45 minutes later!)
|
|
64
|
+
# → Customers already affected!
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
```ruby
|
|
68
|
+
# === PROBLEM 3: No Configuration Management ===
|
|
69
|
+
# ❌ SLOs hardcoded in code
|
|
70
|
+
# Need to deploy to change SLO targets
|
|
71
|
+
# No validation against real routes
|
|
72
|
+
# → Drift between config and reality
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
```ruby
|
|
76
|
+
# === PROBLEM 4: Alert Fatigue ===
|
|
77
|
+
# ❌ Single threshold alerting
|
|
78
|
+
# Minor blip → Page SRE
|
|
79
|
+
# Sustained issue → Same alert
|
|
80
|
+
# → Can't distinguish severity!
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### 1.2. Design Decisions (Based on Google SRE 2026)
|
|
84
|
+
|
|
85
|
+
**Decision 1: Multi-Level SLO Strategy**
|
|
86
|
+
```yaml
|
|
87
|
+
# 3 levels of SLO granularity:
|
|
88
|
+
1. Application-wide (default, zero-config)
|
|
89
|
+
2. Service-level (Sidekiq, ActiveJob)
|
|
90
|
+
3. Per-endpoint (controller#action specific)
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
**Decision 2: Multi-Window Multi-Burn Rate (Google SRE Standard)**
|
|
94
|
+
```yaml
|
|
95
|
+
# Alert windows (not SLO windows!):
|
|
96
|
+
- Fast burn: 1 hour window, 5 min alert, 14.4x burn rate → 2% budget consumed
|
|
97
|
+
- Medium burn: 6 hour window, 30 min alert, 6.0x burn rate → 5% budget consumed
|
|
98
|
+
- Slow burn: 3 day window, 6 hour alert, 1.0x burn rate → 10% budget consumed
|
|
99
|
+
|
|
100
|
+
# SLO window: Still 30 days (industry standard)
|
|
101
|
+
# But ALERTS react in 5 minutes!
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
**Decision 3: YAML-Based Configuration**
|
|
105
|
+
```yaml
|
|
106
|
+
# config/slo.yml - version controlled, validated
|
|
107
|
+
# Separate from code deployment
|
|
108
|
+
# Linter validates against real routes/jobs
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
**Decision 4: Optional Latency SLO**
|
|
112
|
+
```yaml
|
|
113
|
+
# Not all endpoints need latency SLO:
|
|
114
|
+
- Healthcheck: availability only (latency not critical)
|
|
115
|
+
- File upload: availability + custom latency (5s)
|
|
116
|
+
- API: availability + p99 latency (500ms)
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
### 1.3. Goals
|
|
120
|
+
|
|
121
|
+
**Primary Goals:**
|
|
122
|
+
- ✅ **Per-endpoint SLO** (controller#action level)
|
|
123
|
+
- ✅ **5-minute alert detection** (fast burn rate)
|
|
124
|
+
- ✅ **YAML-based configuration** with validation
|
|
125
|
+
- ✅ **Flexible latency SLO** (optional per endpoint)
|
|
126
|
+
- ✅ **Multi-window burn rate** (Google SRE standard)
|
|
127
|
+
|
|
128
|
+
**Non-Goals:**
|
|
129
|
+
- ❌ Per-user SLO (too granular for v1.0)
|
|
130
|
+
- ❌ Automatic SLO adjustment (manual for v1.0)
|
|
131
|
+
- ❌ SLO enforcement (alerts only, no blocking)
|
|
132
|
+
|
|
133
|
+
### 1.4. Success Metrics
|
|
134
|
+
|
|
135
|
+
| Metric | Target | Critical? |
|
|
136
|
+
|--------|--------|-----------|
|
|
137
|
+
| **Alert detection time** | <5 minutes | ✅ Yes |
|
|
138
|
+
| **Per-endpoint coverage** | 100% (all routes) | ✅ Yes |
|
|
139
|
+
| **Config validation** | 100% (no drift) | ✅ Yes |
|
|
140
|
+
| **False positive rate** | <1% | ✅ Yes |
|
|
141
|
+
| **Alert precision** | >95% | ✅ Yes |
|
|
142
|
+
|
|
143
|
+
---
|
|
144
|
+
|
|
145
|
+
## 2. Architecture Overview
|
|
146
|
+
|
|
147
|
+
### 2.1. System Context
|
|
148
|
+
|
|
149
|
+
```mermaid
|
|
150
|
+
C4Context
|
|
151
|
+
title SLO & Observability Context (Multi-Level)
|
|
152
|
+
|
|
153
|
+
Person(sre, "SRE", "Monitors SLOs")
|
|
154
|
+
Person(dev, "Developer", "Defines SLOs")
|
|
155
|
+
|
|
156
|
+
System(rails_app, "Rails App", "100+ endpoints")
|
|
157
|
+
System(e11y, "E11y Gem", "Multi-level SLO")
|
|
158
|
+
System(slo_config, "slo.yml", "Per-endpoint config")
|
|
159
|
+
|
|
160
|
+
System_Ext(prometheus, "Prometheus", "Multi-window queries")
|
|
161
|
+
System_Ext(grafana, "Grafana", "Per-endpoint dashboards")
|
|
162
|
+
System_Ext(alertmanager, "Alertmanager", "Fast/Medium/Slow burn")
|
|
163
|
+
|
|
164
|
+
Rel(dev, slo_config, "Defines", "Per-endpoint SLO")
|
|
165
|
+
Rel(rails_app, e11y, "Tracks", "Per controller#action")
|
|
166
|
+
Rel(e11y, slo_config, "Validates", "Against real routes")
|
|
167
|
+
Rel(e11y, prometheus, "Exports", "Per-endpoint metrics")
|
|
168
|
+
Rel(prometheus, alertmanager, "Evaluates", "3 burn rate windows")
|
|
169
|
+
Rel(alertmanager, sre, "Alerts in 5min", "Fast burn")
|
|
170
|
+
Rel(sre, grafana, "Views", "Per-endpoint SLO")
|
|
171
|
+
|
|
172
|
+
UpdateLayoutConfig($c4ShapeInRow="3", $c4BoundaryInRow="1")
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
### 2.2. Component Architecture
|
|
176
|
+
|
|
177
|
+
```mermaid
|
|
178
|
+
graph TB
|
|
179
|
+
subgraph "Rails Application"
|
|
180
|
+
Route1[GET /orders] --> Middleware[E11y SLO Middleware]
|
|
181
|
+
Route2[POST /orders] --> Middleware
|
|
182
|
+
Route3[GET /healthcheck] --> Middleware
|
|
183
|
+
SidekiqJob[PaymentJob] --> SidekiqInstr[Sidekiq Instrumentation]
|
|
184
|
+
end
|
|
185
|
+
|
|
186
|
+
subgraph "E11y SLO Engine"
|
|
187
|
+
Middleware --> SLOResolver[SLO Config Resolver]
|
|
188
|
+
SidekiqInstr --> SLOResolver
|
|
189
|
+
|
|
190
|
+
SLOResolver --> ConfigLoader[slo.yml Loader]
|
|
191
|
+
ConfigLoader --> Validator[Route/Job Validator]
|
|
192
|
+
|
|
193
|
+
SLOResolver --> MetricsEmitter[Per-Endpoint Metrics]
|
|
194
|
+
MetricsEmitter --> AppWide[App-Wide Metrics]
|
|
195
|
+
MetricsEmitter --> PerEndpoint[Per-Endpoint Metrics]
|
|
196
|
+
MetricsEmitter --> PerJob[Per-Job Metrics]
|
|
197
|
+
end
|
|
198
|
+
|
|
199
|
+
subgraph "Multi-Window Burn Rate"
|
|
200
|
+
PerEndpoint --> BurnRate1h[1h Fast Burn]
|
|
201
|
+
PerEndpoint --> BurnRate6h[6h Medium Burn]
|
|
202
|
+
PerEndpoint --> BurnRate3d[3d Slow Burn]
|
|
203
|
+
|
|
204
|
+
BurnRate1h --> AlertFast[Alert in 5 min<br/>14.4x burn]
|
|
205
|
+
BurnRate6h --> AlertMedium[Alert in 30 min<br/>6.0x burn]
|
|
206
|
+
BurnRate3d --> AlertSlow[Alert in 6 hours<br/>1.0x burn]
|
|
207
|
+
end
|
|
208
|
+
|
|
209
|
+
subgraph "Prometheus & Grafana"
|
|
210
|
+
AppWide --> PromQL1[PromQL: App SLO]
|
|
211
|
+
PerEndpoint --> PromQL2[PromQL: Endpoint SLO]
|
|
212
|
+
PerJob --> PromQL3[PromQL: Job SLO]
|
|
213
|
+
|
|
214
|
+
PromQL1 --> Dashboard1[App-Wide Dashboard]
|
|
215
|
+
PromQL2 --> Dashboard2[Per-Endpoint Dashboard]
|
|
216
|
+
PromQL3 --> Dashboard3[Job Dashboard]
|
|
217
|
+
end
|
|
218
|
+
|
|
219
|
+
style SLOResolver fill:#d1ecf1
|
|
220
|
+
style BurnRate1h fill:#f8d7da
|
|
221
|
+
style AlertFast fill:#dc3545,color:#fff
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
### 2.3. Multi-Window Alert Flow
|
|
225
|
+
|
|
226
|
+
```mermaid
|
|
227
|
+
sequenceDiagram
|
|
228
|
+
participant Endpoint as POST /orders
|
|
229
|
+
participant E11y as E11y Middleware
|
|
230
|
+
participant Config as slo.yml
|
|
231
|
+
participant Prom as Prometheus
|
|
232
|
+
participant Alert as Alertmanager
|
|
233
|
+
participant SRE as SRE
|
|
234
|
+
|
|
235
|
+
Note over Endpoint: Incident starts at 10:00
|
|
236
|
+
|
|
237
|
+
Endpoint->>E11y: HTTP 500 (error)
|
|
238
|
+
E11y->>Config: Lookup SLO: orders#create
|
|
239
|
+
Config-->>E11y: target: 99.9%, latency: 500ms
|
|
240
|
+
E11y->>Prom: Increment error counter
|
|
241
|
+
|
|
242
|
+
Note over Prom: 1h window burn rate evaluation
|
|
243
|
+
|
|
244
|
+
Prom->>Prom: 10:00-10:05: Calculate burn rate
|
|
245
|
+
Prom->>Prom: Burn rate = 14.5x (> 14.4x threshold)
|
|
246
|
+
|
|
247
|
+
Prom->>Alert: Fire: FastBurn (10:05, 5 min after incident)
|
|
248
|
+
Alert->>SRE: Page: CRITICAL - POST /orders
|
|
249
|
+
|
|
250
|
+
Note over SRE: SRE notified in 5 minutes!
|
|
251
|
+
|
|
252
|
+
alt Incident resolved quickly
|
|
253
|
+
Note over Endpoint: Fixed at 10:10
|
|
254
|
+
Prom->>Prom: 10:10-10:15: Burn rate drops
|
|
255
|
+
Prom->>Alert: Resolve: FastBurn
|
|
256
|
+
else Incident continues
|
|
257
|
+
Prom->>Prom: 10:00-10:30: 6h window burn
|
|
258
|
+
Prom->>Alert: Fire: MediumBurn (additional context)
|
|
259
|
+
end
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
---
|
|
263
|
+
|
|
264
|
+
## 3. Multi-Level SLO Strategy
|
|
265
|
+
|
|
266
|
+
### 3.1. Level 1: Application-Wide SLO (Zero-Config)
|
|
267
|
+
|
|
268
|
+
**Automatic for all Rails apps:**
|
|
269
|
+
|
|
270
|
+
```ruby
|
|
271
|
+
# Shipped knobs (see lib/e11y/configuration.rb, lib/e11y/slo/)
|
|
272
|
+
E11y.configure do |config|
|
|
273
|
+
config.slo_tracking_enabled = true # default
|
|
274
|
+
config.rails_instrumentation_enabled = true
|
|
275
|
+
config.sidekiq_enabled = true # job SLO signals from Sidekiq middleware
|
|
276
|
+
config.active_job_enabled = true # job SLO signals from Active Job callbacks
|
|
277
|
+
end
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
**Metrics emitted:**
|
|
281
|
+
```ruby
|
|
282
|
+
# App-wide availability
|
|
283
|
+
http_requests_total{status="2xx|3xx|4xx|5xx"}
|
|
284
|
+
slo_app_availability{window="30d"} # Calculated SLO
|
|
285
|
+
|
|
286
|
+
# App-wide latency
|
|
287
|
+
http_request_duration_seconds{quantile="0.99"}
|
|
288
|
+
slo_app_latency_p99{window="30d"}
|
|
289
|
+
```
|
|
290
|
+
|
|
291
|
+
### 3.2. Level 2: Service-Level SLO (Per-Service)
|
|
292
|
+
|
|
293
|
+
**Per-service overrides:**
|
|
294
|
+
|
|
295
|
+
```yaml
|
|
296
|
+
# config/slo.yml
|
|
297
|
+
services:
|
|
298
|
+
sidekiq:
|
|
299
|
+
default:
|
|
300
|
+
success_rate_target: 0.995 # 99.5%
|
|
301
|
+
window: 30d
|
|
302
|
+
|
|
303
|
+
# Override for critical jobs
|
|
304
|
+
jobs:
|
|
305
|
+
PaymentProcessingJob:
|
|
306
|
+
success_rate_target: 0.9999 # 99.99% (critical!)
|
|
307
|
+
alert_on_single_failure: true
|
|
308
|
+
|
|
309
|
+
EmailNotificationJob:
|
|
310
|
+
success_rate_target: 0.95 # 95% (non-critical)
|
|
311
|
+
latency: null # No latency SLO
|
|
312
|
+
```
|
|
313
|
+
|
|
314
|
+
### 3.3. Level 3: Per-Endpoint SLO (Controller#Action)
|
|
315
|
+
|
|
316
|
+
**Most granular level:**
|
|
317
|
+
|
|
318
|
+
```yaml
|
|
319
|
+
# config/slo.yml
|
|
320
|
+
endpoints:
|
|
321
|
+
# CRITICAL endpoints (99.99%)
|
|
322
|
+
- name: "Health Check"
|
|
323
|
+
pattern: "GET /healthcheck"
|
|
324
|
+
controller: "HealthController"
|
|
325
|
+
action: "index"
|
|
326
|
+
slo:
|
|
327
|
+
availability_target: 0.9999 # 99.99%
|
|
328
|
+
latency: null # No latency SLO for healthcheck
|
|
329
|
+
window: 30d
|
|
330
|
+
|
|
331
|
+
# HIGH priority endpoints (99.9%)
|
|
332
|
+
- name: "Create Order"
|
|
333
|
+
pattern: "POST /api/orders"
|
|
334
|
+
controller: "Api::OrdersController"
|
|
335
|
+
action: "create"
|
|
336
|
+
slo:
|
|
337
|
+
availability_target: 0.999 # 99.9%
|
|
338
|
+
latency_p99_target: 500 # 500ms p99
|
|
339
|
+
latency_p95_target: 300 # 300ms p95 (optional)
|
|
340
|
+
window: 30d
|
|
341
|
+
|
|
342
|
+
# Multi-burn rate alert config
|
|
343
|
+
burn_rate_alerts:
|
|
344
|
+
fast:
|
|
345
|
+
enabled: true
|
|
346
|
+
window: 1h
|
|
347
|
+
threshold: 14.4 # 2% budget in 1h
|
|
348
|
+
alert_after: 5m
|
|
349
|
+
medium:
|
|
350
|
+
enabled: true
|
|
351
|
+
window: 6h
|
|
352
|
+
threshold: 6.0 # 5% budget in 6h
|
|
353
|
+
alert_after: 30m
|
|
354
|
+
slow:
|
|
355
|
+
enabled: true
|
|
356
|
+
window: 3d
|
|
357
|
+
threshold: 1.0 # 10% budget in 3d
|
|
358
|
+
alert_after: 6h
|
|
359
|
+
|
|
360
|
+
# SLOW endpoints (99.9% but higher latency acceptable)
|
|
361
|
+
- name: "Generate Report"
|
|
362
|
+
pattern: "POST /admin/reports"
|
|
363
|
+
controller: "Admin::ReportsController"
|
|
364
|
+
action: "create"
|
|
365
|
+
slo:
|
|
366
|
+
availability_target: 0.999 # 99.9%
|
|
367
|
+
latency_p99_target: 5000 # 5s (slow, but acceptable)
|
|
368
|
+
window: 30d
|
|
369
|
+
|
|
370
|
+
# LOW priority endpoints (99%)
|
|
371
|
+
- name: "Admin Dashboard"
|
|
372
|
+
pattern: "GET /admin/dashboard"
|
|
373
|
+
controller: "Admin::DashboardController"
|
|
374
|
+
action: "index"
|
|
375
|
+
slo:
|
|
376
|
+
availability_target: 0.99 # 99% (less critical)
|
|
377
|
+
latency: null
|
|
378
|
+
window: 30d
|
|
379
|
+
|
|
380
|
+
# NO SLO (exclude from tracking)
|
|
381
|
+
- name: "Development Tools"
|
|
382
|
+
pattern: "GET /rails/info/*"
|
|
383
|
+
slo: null # No SLO
|
|
384
|
+
```
|
|
385
|
+
|
|
386
|
+
---
|
|
387
|
+
|
|
388
|
+
## 4. Per-Endpoint SLO Configuration
|
|
389
|
+
|
|
390
|
+
### 4.1. Complete slo.yml Schema with All Options
|
|
391
|
+
|
|
392
|
+
```yaml
|
|
393
|
+
# config/slo.yml
|
|
394
|
+
#
|
|
395
|
+
# E11y SLO Configuration
|
|
396
|
+
#
|
|
397
|
+
# This file defines Service Level Objectives for your application at multiple levels:
|
|
398
|
+
# 1. App-wide defaults (fallback for unconfigured endpoints)
|
|
399
|
+
# 2. Endpoint-specific SLOs (per controller#action)
|
|
400
|
+
# 3. Service-specific SLOs (Sidekiq, ActiveJob)
|
|
401
|
+
#
|
|
402
|
+
# Lint / validation (see §6 — e11y:slo:validate aliases e11y:lint):
|
|
403
|
+
# $ bundle exec rake e11y:lint
|
|
404
|
+
# $ bundle exec rake e11y:slo:validate
|
|
405
|
+
# Dashboard JSON from slo.yml:
|
|
406
|
+
# $ bundle exec rake e11y:slo:dashboard
|
|
407
|
+
#
|
|
408
|
+
# Documentation: https://github.com/arturseletskiy/e11y/docs/slo-configuration.md
|
|
409
|
+
|
|
410
|
+
version: 1
|
|
411
|
+
|
|
412
|
+
# ============================================================================
|
|
413
|
+
# GLOBAL DEFAULTS
|
|
414
|
+
# ============================================================================
|
|
415
|
+
# Applied to all endpoints unless overridden
|
|
416
|
+
# These are CONSERVATIVE defaults - tune based on your needs
|
|
417
|
+
defaults:
|
|
418
|
+
window: 30d # SLO evaluation window (7d, 30d, 90d)
|
|
419
|
+
|
|
420
|
+
# Availability SLO (required)
|
|
421
|
+
availability:
|
|
422
|
+
enabled: true
|
|
423
|
+
target: 0.999 # 99.9% = 43.2 minutes downtime per month
|
|
424
|
+
|
|
425
|
+
# Latency SLO (optional)
|
|
426
|
+
latency:
|
|
427
|
+
enabled: true
|
|
428
|
+
p99_target: 500 # milliseconds
|
|
429
|
+
p95_target: 300 # milliseconds (optional)
|
|
430
|
+
p50_target: null # median (optional, null = disabled)
|
|
431
|
+
|
|
432
|
+
# Throughput SLO (optional, for high-traffic endpoints)
|
|
433
|
+
throughput:
|
|
434
|
+
enabled: false # Disabled by default
|
|
435
|
+
min_rps: null # Minimum requests per second (null = no minimum)
|
|
436
|
+
max_rps: null # Maximum requests per second (null = no maximum)
|
|
437
|
+
|
|
438
|
+
# Multi-window burn rate alerts (Google SRE recommended)
|
|
439
|
+
burn_rate_alerts:
|
|
440
|
+
fast:
|
|
441
|
+
enabled: true
|
|
442
|
+
window: 1h # Alert window
|
|
443
|
+
threshold: 14.4 # 14.4x burn rate = 2% of 30-day budget in 1h
|
|
444
|
+
alert_after: 5m # Fire alert after 5 minutes
|
|
445
|
+
severity: critical
|
|
446
|
+
medium:
|
|
447
|
+
enabled: true
|
|
448
|
+
window: 6h
|
|
449
|
+
threshold: 6.0 # 6x burn rate = 5% of 30-day budget in 6h
|
|
450
|
+
alert_after: 30m
|
|
451
|
+
severity: warning
|
|
452
|
+
slow:
|
|
453
|
+
enabled: true
|
|
454
|
+
window: 3d
|
|
455
|
+
threshold: 1.0 # 1x burn rate = 10% of 30-day budget in 3d
|
|
456
|
+
alert_after: 6h
|
|
457
|
+
severity: info
|
|
458
|
+
|
|
459
|
+
# ============================================================================
|
|
460
|
+
# ENDPOINT-SPECIFIC SLOs
|
|
461
|
+
# ============================================================================
|
|
462
|
+
# Define SLOs per controller#action
|
|
463
|
+
# Pattern matching supported: "/api/orders/:id", "/users/*"
|
|
464
|
+
endpoints:
|
|
465
|
+
# -------------------------------------------------------------------------
|
|
466
|
+
# CRITICAL ENDPOINTS (99.99% availability)
|
|
467
|
+
# -------------------------------------------------------------------------
|
|
468
|
+
- name: "Health Check"
|
|
469
|
+
description: "K8s liveness/readiness probe"
|
|
470
|
+
pattern: "GET /healthcheck"
|
|
471
|
+
controller: "HealthController"
|
|
472
|
+
action: "index"
|
|
473
|
+
tags:
|
|
474
|
+
- critical
|
|
475
|
+
- infrastructure
|
|
476
|
+
slo:
|
|
477
|
+
window: 30d
|
|
478
|
+
availability:
|
|
479
|
+
enabled: true
|
|
480
|
+
target: 0.9999 # 99.99% = 4.32 minutes downtime per month
|
|
481
|
+
latency:
|
|
482
|
+
enabled: false # No latency SLO for healthcheck (should be instant)
|
|
483
|
+
throughput:
|
|
484
|
+
enabled: false
|
|
485
|
+
burn_rate_alerts:
|
|
486
|
+
fast:
|
|
487
|
+
enabled: true
|
|
488
|
+
threshold: 14.4
|
|
489
|
+
alert_after: 2m # Override: faster alert for critical endpoint
|
|
490
|
+
|
|
491
|
+
# -------------------------------------------------------------------------
|
|
492
|
+
# HIGH PRIORITY ENDPOINTS (99.9% availability + strict latency)
|
|
493
|
+
# -------------------------------------------------------------------------
|
|
494
|
+
- name: "Create Order"
|
|
495
|
+
description: "Primary checkout flow"
|
|
496
|
+
pattern: "POST /api/orders"
|
|
497
|
+
controller: "Api::OrdersController"
|
|
498
|
+
action: "create"
|
|
499
|
+
tags:
|
|
500
|
+
- high_priority
|
|
501
|
+
- revenue_critical
|
|
502
|
+
- customer_facing
|
|
503
|
+
slo:
|
|
504
|
+
window: 30d
|
|
505
|
+
availability:
|
|
506
|
+
enabled: true
|
|
507
|
+
target: 0.999 # 99.9%
|
|
508
|
+
latency:
|
|
509
|
+
enabled: true
|
|
510
|
+
p99_target: 500 # 500ms p99
|
|
511
|
+
p95_target: 300 # 300ms p95
|
|
512
|
+
p50_target: 150 # 150ms p50 (median)
|
|
513
|
+
throughput:
|
|
514
|
+
enabled: true
|
|
515
|
+
min_rps: 10 # Must handle at least 10 req/sec
|
|
516
|
+
max_rps: 1000 # Alert if exceeds 1000 req/sec (potential attack)
|
|
517
|
+
burn_rate_alerts:
|
|
518
|
+
fast:
|
|
519
|
+
enabled: true
|
|
520
|
+
threshold: 14.4
|
|
521
|
+
alert_after: 5m
|
|
522
|
+
medium:
|
|
523
|
+
enabled: true
|
|
524
|
+
threshold: 6.0
|
|
525
|
+
alert_after: 30m
|
|
526
|
+
slow:
|
|
527
|
+
enabled: true
|
|
528
|
+
threshold: 1.0
|
|
529
|
+
alert_after: 6h
|
|
530
|
+
|
|
531
|
+
- name: "List Orders"
|
|
532
|
+
description: "Customer order history"
|
|
533
|
+
pattern: "GET /api/orders"
|
|
534
|
+
controller: "Api::OrdersController"
|
|
535
|
+
action: "index"
|
|
536
|
+
tags:
|
|
537
|
+
- high_priority
|
|
538
|
+
- customer_facing
|
|
539
|
+
slo:
|
|
540
|
+
window: 30d
|
|
541
|
+
availability:
|
|
542
|
+
enabled: true
|
|
543
|
+
target: 0.999
|
|
544
|
+
latency:
|
|
545
|
+
enabled: true
|
|
546
|
+
p99_target: 1000 # 1s p99 (list can be slower)
|
|
547
|
+
p95_target: 500
|
|
548
|
+
throughput:
|
|
549
|
+
enabled: false
|
|
550
|
+
|
|
551
|
+
- name: "Payment Processing"
|
|
552
|
+
description: "Stripe payment capture"
|
|
553
|
+
pattern: "POST /api/payments"
|
|
554
|
+
controller: "Api::PaymentsController"
|
|
555
|
+
action: "create"
|
|
556
|
+
tags:
|
|
557
|
+
- critical
|
|
558
|
+
- revenue_critical
|
|
559
|
+
- third_party_dependent
|
|
560
|
+
slo:
|
|
561
|
+
window: 30d
|
|
562
|
+
availability:
|
|
563
|
+
enabled: true
|
|
564
|
+
target: 0.999
|
|
565
|
+
latency:
|
|
566
|
+
enabled: true
|
|
567
|
+
p99_target: 2000 # 2s p99 (external API call)
|
|
568
|
+
p95_target: 1000
|
|
569
|
+
throughput:
|
|
570
|
+
enabled: true
|
|
571
|
+
min_rps: 1
|
|
572
|
+
max_rps: 100
|
|
573
|
+
burn_rate_alerts:
|
|
574
|
+
fast:
|
|
575
|
+
enabled: true
|
|
576
|
+
threshold: 10.0 # Override: more lenient for third-party dependency
|
|
577
|
+
alert_after: 10m
|
|
578
|
+
|
|
579
|
+
# -------------------------------------------------------------------------
|
|
580
|
+
# SLOW ENDPOINTS (99.9% availability + relaxed latency)
|
|
581
|
+
# -------------------------------------------------------------------------
|
|
582
|
+
- name: "Generate Report"
|
|
583
|
+
description: "Admin analytics report generation"
|
|
584
|
+
pattern: "POST /admin/reports"
|
|
585
|
+
controller: "Admin::ReportsController"
|
|
586
|
+
action: "create"
|
|
587
|
+
tags:
|
|
588
|
+
- admin
|
|
589
|
+
- slow_operation
|
|
590
|
+
- batch_processing
|
|
591
|
+
slo:
|
|
592
|
+
window: 30d
|
|
593
|
+
availability:
|
|
594
|
+
enabled: true
|
|
595
|
+
target: 0.999
|
|
596
|
+
latency:
|
|
597
|
+
enabled: true
|
|
598
|
+
p99_target: 30000 # 30s p99 (slow, but acceptable for reports)
|
|
599
|
+
p95_target: 20000 # 20s p95
|
|
600
|
+
throughput:
|
|
601
|
+
enabled: false
|
|
602
|
+
burn_rate_alerts:
|
|
603
|
+
fast:
|
|
604
|
+
enabled: false # Disable fast burn for slow operations
|
|
605
|
+
medium:
|
|
606
|
+
enabled: true
|
|
607
|
+
threshold: 6.0
|
|
608
|
+
alert_after: 1h
|
|
609
|
+
|
|
610
|
+
- name: "Export Data"
|
|
611
|
+
description: "CSV/Excel export"
|
|
612
|
+
pattern: "POST /admin/exports"
|
|
613
|
+
controller: "Admin::ExportsController"
|
|
614
|
+
action: "create"
|
|
615
|
+
tags:
|
|
616
|
+
- admin
|
|
617
|
+
- slow_operation
|
|
618
|
+
slo:
|
|
619
|
+
window: 30d
|
|
620
|
+
availability:
|
|
621
|
+
enabled: true
|
|
622
|
+
target: 0.99 # 99% (less critical)
|
|
623
|
+
latency:
|
|
624
|
+
enabled: true
|
|
625
|
+
p99_target: 60000 # 60s p99 (very slow, but acceptable)
|
|
626
|
+
throughput:
|
|
627
|
+
enabled: false
|
|
628
|
+
|
|
629
|
+
# -------------------------------------------------------------------------
|
|
630
|
+
# LOW PRIORITY ENDPOINTS (99% availability + no latency SLO)
|
|
631
|
+
# -------------------------------------------------------------------------
|
|
632
|
+
- name: "Admin Dashboard"
|
|
633
|
+
description: "Internal admin dashboard"
|
|
634
|
+
pattern: "GET /admin/dashboard"
|
|
635
|
+
controller: "Admin::DashboardController"
|
|
636
|
+
action: "index"
|
|
637
|
+
tags:
|
|
638
|
+
- admin
|
|
639
|
+
- low_priority
|
|
640
|
+
slo:
|
|
641
|
+
window: 30d
|
|
642
|
+
availability:
|
|
643
|
+
enabled: true
|
|
644
|
+
target: 0.99 # 99%
|
|
645
|
+
latency:
|
|
646
|
+
enabled: false # No latency SLO for admin
|
|
647
|
+
throughput:
|
|
648
|
+
enabled: false
|
|
649
|
+
burn_rate_alerts:
|
|
650
|
+
fast:
|
|
651
|
+
enabled: false
|
|
652
|
+
medium:
|
|
653
|
+
enabled: false
|
|
654
|
+
slow:
|
|
655
|
+
enabled: true # Only slow burn
|
|
656
|
+
threshold: 2.0
|
|
657
|
+
alert_after: 12h
|
|
658
|
+
|
|
659
|
+
# -------------------------------------------------------------------------
|
|
660
|
+
# HIGH THROUGHPUT ENDPOINTS (throughput-focused)
|
|
661
|
+
# -------------------------------------------------------------------------
|
|
662
|
+
- name: "Metrics Ingestion"
|
|
663
|
+
description: "Telemetry data ingestion endpoint"
|
|
664
|
+
pattern: "POST /api/metrics"
|
|
665
|
+
controller: "Api::MetricsController"
|
|
666
|
+
action: "create"
|
|
667
|
+
tags:
|
|
668
|
+
- high_throughput
|
|
669
|
+
- telemetry
|
|
670
|
+
slo:
|
|
671
|
+
window: 30d
|
|
672
|
+
availability:
|
|
673
|
+
enabled: true
|
|
674
|
+
target: 0.99 # 99% (can tolerate some drops)
|
|
675
|
+
latency:
|
|
676
|
+
enabled: true
|
|
677
|
+
p99_target: 100 # Fast ingestion required
|
|
678
|
+
throughput:
|
|
679
|
+
enabled: true
|
|
680
|
+
min_rps: 100 # Must handle 100+ req/sec
|
|
681
|
+
max_rps: 10000 # Alert if exceeds 10k req/sec
|
|
682
|
+
burn_rate_alerts:
|
|
683
|
+
fast:
|
|
684
|
+
enabled: true
|
|
685
|
+
threshold: 20.0 # More lenient for high-throughput
|
|
686
|
+
|
|
687
|
+
# -------------------------------------------------------------------------
|
|
688
|
+
# NO SLO (explicitly excluded)
|
|
689
|
+
# -------------------------------------------------------------------------
|
|
690
|
+
- name: "Development Tools"
|
|
691
|
+
description: "Rails internal routes"
|
|
692
|
+
pattern: "GET /rails/info/*"
|
|
693
|
+
controller: "Rails::InfoController"
|
|
694
|
+
action: "*"
|
|
695
|
+
tags:
|
|
696
|
+
- development
|
|
697
|
+
- excluded
|
|
698
|
+
slo: null # Explicitly no SLO
|
|
699
|
+
|
|
700
|
+
# ============================================================================
|
|
701
|
+
# SERVICE-LEVEL SLOs (Sidekiq, ActiveJob)
|
|
702
|
+
# ============================================================================
|
|
703
|
+
services:
|
|
704
|
+
# ---------------------------------------------------------------------------
|
|
705
|
+
# SIDEKIQ JOBS
|
|
706
|
+
# ---------------------------------------------------------------------------
|
|
707
|
+
sidekiq:
|
|
708
|
+
# Default for all jobs (unless overridden)
|
|
709
|
+
default:
|
|
710
|
+
window: 30d
|
|
711
|
+
success_rate_target: 0.995 # 99.5%
|
|
712
|
+
latency:
|
|
713
|
+
enabled: false # No latency SLO by default for jobs
|
|
714
|
+
throughput:
|
|
715
|
+
enabled: false
|
|
716
|
+
burn_rate_alerts:
|
|
717
|
+
fast:
|
|
718
|
+
enabled: true
|
|
719
|
+
window: 1h
|
|
720
|
+
threshold: 14.4
|
|
721
|
+
alert_after: 10m # Slower alert for jobs
|
|
722
|
+
medium:
|
|
723
|
+
enabled: true
|
|
724
|
+
window: 6h
|
|
725
|
+
threshold: 6.0
|
|
726
|
+
alert_after: 1h
|
|
727
|
+
slow:
|
|
728
|
+
enabled: true
|
|
729
|
+
window: 3d
|
|
730
|
+
threshold: 1.0
|
|
731
|
+
alert_after: 12h
|
|
732
|
+
|
|
733
|
+
# Per-job overrides
|
|
734
|
+
jobs:
|
|
735
|
+
PaymentProcessingJob:
|
|
736
|
+
window: 30d
|
|
737
|
+
success_rate_target: 0.9999 # 99.99% (critical!)
|
|
738
|
+
latency:
|
|
739
|
+
enabled: true
|
|
740
|
+
p99_target: 5000 # 5s p99
|
|
741
|
+
alert_on_single_failure: true # Alert on any failure
|
|
742
|
+
burn_rate_alerts:
|
|
743
|
+
fast:
|
|
744
|
+
enabled: true
|
|
745
|
+
threshold: 10.0
|
|
746
|
+
alert_after: 5m
|
|
747
|
+
|
|
748
|
+
EmailNotificationJob:
|
|
749
|
+
window: 30d
|
|
750
|
+
success_rate_target: 0.95 # 95% (non-critical, can retry)
|
|
751
|
+
latency:
|
|
752
|
+
enabled: false
|
|
753
|
+
burn_rate_alerts:
|
|
754
|
+
fast:
|
|
755
|
+
enabled: false
|
|
756
|
+
medium:
|
|
757
|
+
enabled: false
|
|
758
|
+
slow:
|
|
759
|
+
enabled: true
|
|
760
|
+
|
|
761
|
+
ReportGenerationJob:
|
|
762
|
+
window: 30d
|
|
763
|
+
success_rate_target: 0.99
|
|
764
|
+
latency:
|
|
765
|
+
enabled: true
|
|
766
|
+
p99_target: 300000 # 5 minutes
|
|
767
|
+
throughput:
|
|
768
|
+
enabled: true
|
|
769
|
+
max_jobs_per_hour: 100 # Rate limit
|
|
770
|
+
|
|
771
|
+
# ---------------------------------------------------------------------------
|
|
772
|
+
# ACTIVEJOB
|
|
773
|
+
# ---------------------------------------------------------------------------
|
|
774
|
+
activejob:
|
|
775
|
+
default:
|
|
776
|
+
window: 30d
|
|
777
|
+
success_rate_target: 0.995
|
|
778
|
+
latency:
|
|
779
|
+
enabled: false
|
|
780
|
+
throughput:
|
|
781
|
+
enabled: false
|
|
782
|
+
burn_rate_alerts:
|
|
783
|
+
fast:
|
|
784
|
+
enabled: true
|
|
785
|
+
window: 1h
|
|
786
|
+
threshold: 14.4
|
|
787
|
+
alert_after: 10m
|
|
788
|
+
|
|
789
|
+
# ============================================================================
|
|
790
|
+
# APP-WIDE FALLBACK (Zero-Config)
|
|
791
|
+
# ============================================================================
|
|
792
|
+
# Used for endpoints/jobs without specific configuration
|
|
793
|
+
app_wide:
|
|
794
|
+
http:
|
|
795
|
+
window: 30d
|
|
796
|
+
availability:
|
|
797
|
+
enabled: true
|
|
798
|
+
target: 0.999 # 99.9%
|
|
799
|
+
latency:
|
|
800
|
+
enabled: true
|
|
801
|
+
p99_target: 500
|
|
802
|
+
throughput:
|
|
803
|
+
enabled: false
|
|
804
|
+
burn_rate_alerts:
|
|
805
|
+
fast:
|
|
806
|
+
enabled: true
|
|
807
|
+
window: 1h
|
|
808
|
+
threshold: 14.4
|
|
809
|
+
alert_after: 5m
|
|
810
|
+
medium:
|
|
811
|
+
enabled: true
|
|
812
|
+
window: 6h
|
|
813
|
+
threshold: 6.0
|
|
814
|
+
alert_after: 30m
|
|
815
|
+
slow:
|
|
816
|
+
enabled: true
|
|
817
|
+
window: 3d
|
|
818
|
+
threshold: 1.0
|
|
819
|
+
alert_after: 6h
|
|
820
|
+
|
|
821
|
+
sidekiq:
|
|
822
|
+
window: 30d
|
|
823
|
+
success_rate_target: 0.995
|
|
824
|
+
burn_rate_alerts:
|
|
825
|
+
fast:
|
|
826
|
+
enabled: true
|
|
827
|
+
window: 1h
|
|
828
|
+
threshold: 14.4
|
|
829
|
+
alert_after: 10m
|
|
830
|
+
|
|
831
|
+
activejob:
|
|
832
|
+
window: 30d
|
|
833
|
+
success_rate_target: 0.995
|
|
834
|
+
burn_rate_alerts:
|
|
835
|
+
fast:
|
|
836
|
+
enabled: true
|
|
837
|
+
window: 1h
|
|
838
|
+
threshold: 14.4
|
|
839
|
+
alert_after: 10m
|
|
840
|
+
|
|
841
|
+
# ============================================================================
|
|
842
|
+
# ADVANCED OPTIONS
|
|
843
|
+
# ============================================================================
|
|
844
|
+
advanced:
|
|
845
|
+
# Error budget alerts (percentage thresholds)
|
|
846
|
+
error_budget_alerts:
|
|
847
|
+
enabled: true
|
|
848
|
+
thresholds: [50, 80, 90, 100] # Alert at 50%, 80%, 90%, 100% consumed
|
|
849
|
+
notify:
|
|
850
|
+
slack: true
|
|
851
|
+
pagerduty: false
|
|
852
|
+
email: true
|
|
853
|
+
|
|
854
|
+
# Deployment gate (block deploys if error budget low)
|
|
855
|
+
deployment_gate:
|
|
856
|
+
enabled: false # Disabled by default (use with caution!)
|
|
857
|
+
minimum_budget_percent: 20 # Need 20%+ budget to deploy
|
|
858
|
+
critical_endpoints_only: true # Only check critical endpoints
|
|
859
|
+
override_label: "deploy:emergency" # GitHub label to override
|
|
860
|
+
|
|
861
|
+
# Auto-scaling based on SLO
|
|
862
|
+
autoscaling:
|
|
863
|
+
enabled: false # Future feature
|
|
864
|
+
scale_up_on_burn_rate: 10.0
|
|
865
|
+
scale_down_on_budget_surplus: 0.5
|
|
866
|
+
|
|
867
|
+
# SLO dashboard links
|
|
868
|
+
dashboards:
|
|
869
|
+
grafana_base_url: "https://grafana.example.com/d/e11y-slo"
|
|
870
|
+
per_endpoint_template: "https://grafana.example.com/d/e11y-slo-endpoint?var-controller={controller}&var-action={action}"
|
|
871
|
+
|
|
872
|
+
# Runbook links
|
|
873
|
+
runbooks:
|
|
874
|
+
base_url: "https://wiki.example.com/runbooks"
|
|
875
|
+
fast_burn_template: "{base_url}/fast-burn-{controller}-{action}"
|
|
876
|
+
medium_burn_template: "{base_url}/medium-burn-{controller}-{action}"
|
|
877
|
+
```
|
|
878
|
+
|
|
879
|
+
### 4.2. Config loading (shipped)
|
|
880
|
+
|
|
881
|
+
The gem ships **`E11y::SLO::ConfigLoader`** (`lib/e11y/slo/config_loader.rb`): it searches configurable directories for `slo.yml`, parses with `YAML.safe_load`, and returns a Hash or `nil` if no file exists. There is no `Config` object, no `load!`, no `ZeroConfig`, and no route/job validation inside the loader.
|
|
882
|
+
|
|
883
|
+
Use **`E11y::SLO::ConfigValidator.validate(config)`** on a parsed Hash for the checks implemented today (`version`, `endpoints`, `app_wide.aggregated_slo`, `e11y_self_monitoring`). See `lib/e11y/slo/config_validator.rb`.
|
|
884
|
+
|
|
885
|
+
Runtime SLO behaviour is driven by **`E11y::SLO::Tracker`** and code under `lib/e11y/slo/` plus request/job instrumentation—not by the long pseudo-code that used to appear in this ADR.
|
|
886
|
+
|
|
887
|
+
The YAML in §4.1 remains a **reference** for metrics, PromQL, and dashboards; the loader and **`E11y::SLO::DashboardGenerator`** consume only part of that structure.
|
|
888
|
+
|
|
889
|
+
---
|
|
890
|
+
|
|
891
|
+
## 5. PromQL & Alerts
|
|
892
|
+
|
|
893
|
+
PromQL queries and Prometheus alert rules: see [SLO-PROMQL-ALERTS.md](../SLO-PROMQL-ALERTS.md).
|
|
894
|
+
|
|
895
|
+
---
|
|
896
|
+
|
|
897
|
+
## 6. SLO config validation and linting
|
|
898
|
+
|
|
899
|
+
### 6.1. Shipped APIs and tasks
|
|
900
|
+
|
|
901
|
+
- **`E11y::SLO::ConfigLoader.load` / `.load(search_paths:)`** — returns a Hash or `nil`; see `spec/e11y/slo/config_loader_spec.rb`.
|
|
902
|
+
- **`E11y::SLO::ConfigValidator.validate(config)`** — returns an array of error strings; see `spec/e11y/slo/config_validator_spec.rb`.
|
|
903
|
+
- **`rake e11y:slo:dashboard`** — generates Grafana JSON from `slo.yml`; see `lib/tasks/e11y_slo.rake`.
|
|
904
|
+
- **`rake e11y:lint`** — lints E11y configuration and related rules.
|
|
905
|
+
- **`rake e11y:slo:validate`** — backwards-compatible alias that **invokes `e11y:lint`** (`lib/tasks/e11y_lint.rake`). It does **not** run a dedicated “SLO vs routes” validator like the old drafts of this ADR described.
|
|
906
|
+
|
|
907
|
+
There is **no** `rake e11y:slo:unconfigured` task in the repository.
|
|
908
|
+
|
|
909
|
+
### 6.2. CI
|
|
910
|
+
|
|
911
|
+
You can run `bundle exec rake e11y:lint` (or `e11y:slo:validate`) in CI when `config/slo.yml` changes. Drop any workflow step that called `e11y:slo:unconfigured`, or replace it with your own check using `ConfigLoader.load` + `ConfigValidator.validate`.
|
|
912
|
+
|
|
913
|
+
### 6.3. Tests
|
|
914
|
+
|
|
915
|
+
Use the specs under `spec/e11y/slo/` and `spec/e11y/linters/slo/` as the source of truth for behaviour.
|
|
916
|
+
|
|
917
|
+
---
|
|
918
|
+
|
|
919
|
+
## 7. Dashboard & Reporting
|
|
920
|
+
|
|
921
|
+
### 7.1. Per-Endpoint Grafana Dashboard
|
|
922
|
+
|
|
923
|
+
```json
|
|
924
|
+
{
|
|
925
|
+
"dashboard": {
|
|
926
|
+
"title": "E11y Per-Endpoint SLO Dashboard",
|
|
927
|
+
"templating": {
|
|
928
|
+
"list": [
|
|
929
|
+
{
|
|
930
|
+
"name": "controller",
|
|
931
|
+
"type": "query",
|
|
932
|
+
"query": "label_values(http_requests_total, controller)"
|
|
933
|
+
},
|
|
934
|
+
{
|
|
935
|
+
"name": "action",
|
|
936
|
+
"type": "query",
|
|
937
|
+
"query": "label_values(http_requests_total{controller=\"$controller\"}, action)"
|
|
938
|
+
}
|
|
939
|
+
]
|
|
940
|
+
},
|
|
941
|
+
"panels": [
|
|
942
|
+
{
|
|
943
|
+
"title": "Availability SLO: $controller#$action",
|
|
944
|
+
"targets": [
|
|
945
|
+
{
|
|
946
|
+
"expr": "sum(rate(http_requests_total{controller=\"$controller\",action=\"$action\",status=~\"2..|3..\"}[30d])) / sum(rate(http_requests_total{controller=\"$controller\",action=\"$action\"}[30d]))",
|
|
947
|
+
"legendFormat": "Current (30d)"
|
|
948
|
+
},
|
|
949
|
+
{
|
|
950
|
+
"expr": "0.999",
|
|
951
|
+
"legendFormat": "SLO Target (99.9%)"
|
|
952
|
+
}
|
|
953
|
+
],
|
|
954
|
+
"yaxis": {
|
|
955
|
+
"min": 0.995,
|
|
956
|
+
"max": 1.0
|
|
957
|
+
}
|
|
958
|
+
},
|
|
959
|
+
{
|
|
960
|
+
"title": "Error Budget: $controller#$action",
|
|
961
|
+
"targets": [
|
|
962
|
+
{
|
|
963
|
+
"expr": "slo_error_budget_remaining{controller=\"$controller\",action=\"$action\"}",
|
|
964
|
+
"legendFormat": "Remaining"
|
|
965
|
+
}
|
|
966
|
+
],
|
|
967
|
+
"thresholds": [
|
|
968
|
+
{ "value": 0, "color": "red" },
|
|
969
|
+
{ "value": 0.0002, "color": "yellow" },
|
|
970
|
+
{ "value": 0.001, "color": "green" }
|
|
971
|
+
]
|
|
972
|
+
},
|
|
973
|
+
{
|
|
974
|
+
"title": "Burn Rate (Multi-Window): $controller#$action",
|
|
975
|
+
"targets": [
|
|
976
|
+
{
|
|
977
|
+
"expr": "slo_burn_rate_1h{controller=\"$controller\",action=\"$action\"}",
|
|
978
|
+
"legendFormat": "1h (fast burn)"
|
|
979
|
+
},
|
|
980
|
+
{
|
|
981
|
+
"expr": "slo_burn_rate_6h{controller=\"$controller\",action=\"$action\"}",
|
|
982
|
+
"legendFormat": "6h (medium burn)"
|
|
983
|
+
},
|
|
984
|
+
{
|
|
985
|
+
"expr": "slo_burn_rate_3d{controller=\"$controller\",action=\"$action\"}",
|
|
986
|
+
"legendFormat": "3d (slow burn)"
|
|
987
|
+
},
|
|
988
|
+
{
|
|
989
|
+
"expr": "14.4",
|
|
990
|
+
"legendFormat": "Fast Burn Threshold"
|
|
991
|
+
},
|
|
992
|
+
{
|
|
993
|
+
"expr": "6.0",
|
|
994
|
+
"legendFormat": "Medium Burn Threshold"
|
|
995
|
+
},
|
|
996
|
+
{
|
|
997
|
+
"expr": "1.0",
|
|
998
|
+
"legendFormat": "Slow Burn Threshold"
|
|
999
|
+
}
|
|
1000
|
+
]
|
|
1001
|
+
},
|
|
1002
|
+
{
|
|
1003
|
+
"title": "Latency p99: $controller#$action",
|
|
1004
|
+
"targets": [
|
|
1005
|
+
{
|
|
1006
|
+
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{controller=\"$controller\",action=\"$action\"}[5m])) by (le))",
|
|
1007
|
+
"legendFormat": "p99"
|
|
1008
|
+
},
|
|
1009
|
+
{
|
|
1010
|
+
"expr": "0.5",
|
|
1011
|
+
"legendFormat": "SLO Target (500ms)"
|
|
1012
|
+
}
|
|
1013
|
+
]
|
|
1014
|
+
}
|
|
1015
|
+
]
|
|
1016
|
+
}
|
|
1017
|
+
}
|
|
1018
|
+
```
|
|
1019
|
+
|
|
1020
|
+
---
|
|
1021
|
+
|
|
1022
|
+
## 8. Production Best Practices & Edge Cases
|
|
1023
|
+
|
|
1024
|
+
### 8.1. Rollout Strategy
|
|
1025
|
+
|
|
1026
|
+
**Phase 1: Observability Only (1-2 weeks)**
|
|
1027
|
+
```yaml
|
|
1028
|
+
# config/slo.yml - Initial rollout
|
|
1029
|
+
version: 1
|
|
1030
|
+
|
|
1031
|
+
# Start with app-wide only (no per-endpoint)
|
|
1032
|
+
app_wide:
|
|
1033
|
+
http:
|
|
1034
|
+
availability:
|
|
1035
|
+
enabled: true
|
|
1036
|
+
target: 0.999
|
|
1037
|
+
latency:
|
|
1038
|
+
enabled: true
|
|
1039
|
+
p99_target: 1000 # Conservative: 1s
|
|
1040
|
+
|
|
1041
|
+
# Disable burn rate alerts initially
|
|
1042
|
+
defaults:
|
|
1043
|
+
burn_rate_alerts:
|
|
1044
|
+
fast:
|
|
1045
|
+
enabled: false # Don't page SRE yet!
|
|
1046
|
+
medium:
|
|
1047
|
+
enabled: false
|
|
1048
|
+
slow:
|
|
1049
|
+
enabled: true # Only slow burn (info)
|
|
1050
|
+
alert_after: 24h # Very slow
|
|
1051
|
+
|
|
1052
|
+
# Enable deployment gate: false (don't block deploys yet)
|
|
1053
|
+
advanced:
|
|
1054
|
+
deployment_gate:
|
|
1055
|
+
enabled: false
|
|
1056
|
+
```
|
|
1057
|
+
|
|
1058
|
+
**Phase 2: Per-Endpoint + Slow Alerts (2-4 weeks)**
|
|
1059
|
+
```yaml
|
|
1060
|
+
# Add 3-5 critical endpoints
|
|
1061
|
+
endpoints:
|
|
1062
|
+
- name: "Health Check"
|
|
1063
|
+
controller: "HealthController"
|
|
1064
|
+
action: "index"
|
|
1065
|
+
slo:
|
|
1066
|
+
availability:
|
|
1067
|
+
target: 0.9999 # Start strict
|
|
1068
|
+
|
|
1069
|
+
- name: "Create Order"
|
|
1070
|
+
controller: "OrdersController"
|
|
1071
|
+
action: "create"
|
|
1072
|
+
slo:
|
|
1073
|
+
availability:
|
|
1074
|
+
target: 0.999
|
|
1075
|
+
burn_rate_alerts:
|
|
1076
|
+
slow:
|
|
1077
|
+
enabled: true # Only slow burn for now
|
|
1078
|
+
alert_after: 12h
|
|
1079
|
+
```
|
|
1080
|
+
|
|
1081
|
+
**Phase 3: Multi-Window Burn Rate (4-6 weeks)**
|
|
1082
|
+
```yaml
|
|
1083
|
+
# Enable medium + fast burn rate alerts
|
|
1084
|
+
endpoints:
|
|
1085
|
+
- name: "Create Order"
|
|
1086
|
+
slo:
|
|
1087
|
+
burn_rate_alerts:
|
|
1088
|
+
fast:
|
|
1089
|
+
enabled: true
|
|
1090
|
+
alert_after: 10m # Start conservative (10m not 5m)
|
|
1091
|
+
medium:
|
|
1092
|
+
enabled: true
|
|
1093
|
+
slow:
|
|
1094
|
+
enabled: true
|
|
1095
|
+
```
|
|
1096
|
+
|
|
1097
|
+
**Phase 4: Deployment Gate (6-8 weeks)**
|
|
1098
|
+
```yaml
|
|
1099
|
+
# Only after confidence in data
|
|
1100
|
+
advanced:
|
|
1101
|
+
deployment_gate:
|
|
1102
|
+
enabled: true
|
|
1103
|
+
minimum_budget_percent: 10 # Start lenient (10% not 20%)
|
|
1104
|
+
override_label: "deploy:emergency"
|
|
1105
|
+
```
|
|
1106
|
+
|
|
1107
|
+
### 8.2. Edge cases (guidance)
|
|
1108
|
+
|
|
1109
|
+
Rollouts often hit: routes unavailable in CI, Prometheus outages during budget checks, low-traffic burn-rate noise, emergency deploys during incidents, endpoints without history, and maintenance windows. Older drafts of this ADR embedded Ruby and YAML that named **non-existent** rake tasks (for example `e11y:slo:deployment_gate:check`) and Ruby modules under `lib/e11y/slo/`. Those listings were removed—they are **not** gem APIs.
|
|
1110
|
+
|
|
1111
|
+
Use **§6** for what ships today (`ConfigLoader`, `ConfigValidator`, `e11y:lint`, `e11y:slo:dashboard`). Treat deployment gates, error-budget automation, and advanced burn-rate logic as **platform or application** work built on your own Prometheus and CI.
|
|
1112
|
+
|
|
1113
|
+
### 8.3. Monitoring the SLO system itself
|
|
1114
|
+
|
|
1115
|
+
When `e11y_self_monitoring.enabled` is `true` in `slo.yml`, **`E11y::SLO::ConfigLoader.self_monitoring_enabled?`** is true and **`E11y::Middleware::SelfMonitoringEmit`** increments **`e11y_events_tracked_total`** for events that complete the pipeline (see `lib/e11y/middleware/self_monitoring_emit.rb`). There is **no** `E11y.configure { |c| c.slo.self_monitoring do ... }` DSL.
|
|
1116
|
+
|
|
1117
|
+
Derive Prometheus alert rules from the metric names your deployment actually exports (Yabeda prefixing, scrape config, etc.)—do not assume placeholder metric names from older versions of this ADR.
|
|
1118
|
+
|
|
1119
|
+
---
|
|
1120
|
+
|
|
1121
|
+
## 9. Trade-offs
|
|
1122
|
+
|
|
1123
|
+
### 9.1. Key Decisions
|
|
1124
|
+
|
|
1125
|
+
| Decision | Pro | Con | Rationale |
|
|
1126
|
+
|----------|-----|-----|-----------|
|
|
1127
|
+
| **Per-endpoint SLO** | Granular visibility | Config complexity | Critical endpoints need specific SLOs |
|
|
1128
|
+
| **Multi-window burn rate** | 5-minute detection, low false positives | Complex Prometheus queries | Google SRE best practice 2026 |
|
|
1129
|
+
| **YAML-based config** | Version controlled, validated | Extra file | Separation of concerns |
|
|
1130
|
+
| **Optional latency SLO** | Flexible | Some endpoints untracked | Not all endpoints need latency |
|
|
1131
|
+
| **Config validation** | Prevents drift | CI/CD overhead | Critical for accuracy |
|
|
1132
|
+
| **30-day SLO window** | Industry standard | Slow trend detection | Multi-window compensates |
|
|
1133
|
+
|
|
1134
|
+
### 9.2. Alternatives Considered
|
|
1135
|
+
|
|
1136
|
+
**A) Single app-wide SLO only**
|
|
1137
|
+
- ❌ Rejected: Too coarse, hides critical endpoint issues
|
|
1138
|
+
|
|
1139
|
+
**B) Single-window alerting**
|
|
1140
|
+
- ❌ Rejected: Either slow (30d) or noisy (5m)
|
|
1141
|
+
|
|
1142
|
+
**C) Code-based SLO config**
|
|
1143
|
+
- ❌ Rejected: Requires deployment to change SLOs
|
|
1144
|
+
|
|
1145
|
+
**D) No config validation**
|
|
1146
|
+
- ❌ Rejected: Config drift is a real problem
|
|
1147
|
+
|
|
1148
|
+
**E) Per-user SLO**
|
|
1149
|
+
- ❌ Deferred to v2.0: Too complex for v1
|
|
1150
|
+
|
|
1151
|
+
---
|
|
1152
|
+
|
|
1153
|
+
## 10. Real-World Configuration Examples
|
|
1154
|
+
|
|
1155
|
+
### 10.1. E-Commerce Platform
|
|
1156
|
+
|
|
1157
|
+
```yaml
|
|
1158
|
+
# config/slo.yml - E-commerce example
|
|
1159
|
+
version: 1
|
|
1160
|
+
|
|
1161
|
+
defaults:
|
|
1162
|
+
window: 30d
|
|
1163
|
+
availability:
|
|
1164
|
+
enabled: true
|
|
1165
|
+
target: 0.999
|
|
1166
|
+
|
|
1167
|
+
endpoints:
|
|
1168
|
+
# === REVENUE-CRITICAL (99.99%) ===
|
|
1169
|
+
- name: "Checkout - Payment"
|
|
1170
|
+
pattern: "POST /checkout/payment"
|
|
1171
|
+
controller: "Checkout::PaymentsController"
|
|
1172
|
+
action: "create"
|
|
1173
|
+
tags: [critical, revenue, pci_scope]
|
|
1174
|
+
slo:
|
|
1175
|
+
availability:
|
|
1176
|
+
target: 0.9999 # 99.99%
|
|
1177
|
+
latency:
|
|
1178
|
+
p99_target: 2000 # 2s (Stripe API call)
|
|
1179
|
+
p95_target: 1000
|
|
1180
|
+
throughput:
|
|
1181
|
+
min_rps: 1
|
|
1182
|
+
max_rps: 100 # Rate limit (fraud protection)
|
|
1183
|
+
burn_rate_alerts:
|
|
1184
|
+
fast:
|
|
1185
|
+
threshold: 10.0 # More lenient (third-party)
|
|
1186
|
+
alert_after: 5m
|
|
1187
|
+
|
|
1188
|
+
- name: "Cart - Add Item"
|
|
1189
|
+
pattern: "POST /cart/items"
|
|
1190
|
+
controller: "CartController"
|
|
1191
|
+
action: "add_item"
|
|
1192
|
+
tags: [high_priority, customer_facing]
|
|
1193
|
+
slo:
|
|
1194
|
+
availability:
|
|
1195
|
+
target: 0.999 # 99.9%
|
|
1196
|
+
latency:
|
|
1197
|
+
p99_target: 300
|
|
1198
|
+
p95_target: 150
|
|
1199
|
+
throughput:
|
|
1200
|
+
max_rps: 1000
|
|
1201
|
+
|
|
1202
|
+
# === HIGH-TRAFFIC (throughput-focused) ===
|
|
1203
|
+
- name: "Product Search"
|
|
1204
|
+
pattern: "GET /api/products/search"
|
|
1205
|
+
controller: "Api::ProductsController"
|
|
1206
|
+
action: "search"
|
|
1207
|
+
tags: [high_traffic, search, cached]
|
|
1208
|
+
slo:
|
|
1209
|
+
availability:
|
|
1210
|
+
target: 0.995 # 99.5% (can tolerate cache misses)
|
|
1211
|
+
latency:
|
|
1212
|
+
p99_target: 500
|
|
1213
|
+
throughput:
|
|
1214
|
+
min_rps: 50 # Must handle 50+ req/sec
|
|
1215
|
+
max_rps: 5000
|
|
1216
|
+
|
|
1217
|
+
# === ADMIN (low priority) ===
|
|
1218
|
+
- name: "Admin - Sales Report"
|
|
1219
|
+
pattern: "POST /admin/reports/sales"
|
|
1220
|
+
controller: "Admin::ReportsController"
|
|
1221
|
+
action: "sales"
|
|
1222
|
+
tags: [admin, slow_operation]
|
|
1223
|
+
slo:
|
|
1224
|
+
availability:
|
|
1225
|
+
target: 0.99 # 99%
|
|
1226
|
+
latency:
|
|
1227
|
+
p99_target: 30000 # 30s
|
|
1228
|
+
burn_rate_alerts:
|
|
1229
|
+
fast:
|
|
1230
|
+
enabled: false
|
|
1231
|
+
slow:
|
|
1232
|
+
enabled: true
|
|
1233
|
+
|
|
1234
|
+
services:
|
|
1235
|
+
sidekiq:
|
|
1236
|
+
jobs:
|
|
1237
|
+
PaymentProcessingJob:
|
|
1238
|
+
success_rate_target: 0.9999 # Critical!
|
|
1239
|
+
alert_on_single_failure: true
|
|
1240
|
+
|
|
1241
|
+
InventorySync Job:
|
|
1242
|
+
success_rate_target: 0.99
|
|
1243
|
+
latency:
|
|
1244
|
+
p99_target: 60000 # 60s
|
|
1245
|
+
```
|
|
1246
|
+
|
|
1247
|
+
### 10.2. SaaS API Platform
|
|
1248
|
+
|
|
1249
|
+
```yaml
|
|
1250
|
+
# config/slo.yml - API platform example
|
|
1251
|
+
version: 1
|
|
1252
|
+
|
|
1253
|
+
defaults:
|
|
1254
|
+
window: 30d
|
|
1255
|
+
availability:
|
|
1256
|
+
enabled: true
|
|
1257
|
+
target: 0.999
|
|
1258
|
+
latency:
|
|
1259
|
+
enabled: true
|
|
1260
|
+
p99_target: 200 # Fast API
|
|
1261
|
+
|
|
1262
|
+
endpoints:
|
|
1263
|
+
# === PUBLIC API (99.99%) ===
|
|
1264
|
+
- name: "API - Create Resource"
|
|
1265
|
+
pattern: "POST /api/v1/resources"
|
|
1266
|
+
controller: "Api::V1::ResourcesController"
|
|
1267
|
+
action: "create"
|
|
1268
|
+
tags: [api, customer_facing, rate_limited]
|
|
1269
|
+
slo:
|
|
1270
|
+
availability:
|
|
1271
|
+
target: 0.9999 # 99.99% SLA
|
|
1272
|
+
latency:
|
|
1273
|
+
p99_target: 200
|
|
1274
|
+
p95_target: 100
|
|
1275
|
+
throughput:
|
|
1276
|
+
min_rps: 10
|
|
1277
|
+
max_rps: 10000 # High throughput API
|
|
1278
|
+
burn_rate_alerts:
|
|
1279
|
+
fast:
|
|
1280
|
+
threshold: 14.4
|
|
1281
|
+
alert_after: 5m
|
|
1282
|
+
|
|
1283
|
+
# === WEBHOOKS (eventual consistency) ===
|
|
1284
|
+
- name: "Webhook Delivery"
|
|
1285
|
+
pattern: "POST /internal/webhooks/deliver"
|
|
1286
|
+
controller: "Internal::WebhooksController"
|
|
1287
|
+
action: "deliver"
|
|
1288
|
+
tags: [internal, async, retry]
|
|
1289
|
+
slo:
|
|
1290
|
+
availability:
|
|
1291
|
+
target: 0.95 # 95% (retries handle failures)
|
|
1292
|
+
latency:
|
|
1293
|
+
enabled: false # Async, latency not critical
|
|
1294
|
+
burn_rate_alerts:
|
|
1295
|
+
fast:
|
|
1296
|
+
enabled: false
|
|
1297
|
+
slow:
|
|
1298
|
+
enabled: true
|
|
1299
|
+
|
|
1300
|
+
services:
|
|
1301
|
+
sidekiq:
|
|
1302
|
+
default:
|
|
1303
|
+
success_rate_target: 0.999
|
|
1304
|
+
jobs:
|
|
1305
|
+
WebhookDeliveryJob:
|
|
1306
|
+
success_rate_target: 0.95 # Retries + DLQ
|
|
1307
|
+
latency:
|
|
1308
|
+
p99_target: 10000 # 10s (external API)
|
|
1309
|
+
```
|
|
1310
|
+
|
|
1311
|
+
### 10.3. Internal Admin Tool
|
|
1312
|
+
|
|
1313
|
+
```yaml
|
|
1314
|
+
# config/slo.yml - Admin tool example
|
|
1315
|
+
version: 1
|
|
1316
|
+
|
|
1317
|
+
defaults:
|
|
1318
|
+
window: 7d # Shorter window (less critical)
|
|
1319
|
+
availability:
|
|
1320
|
+
enabled: true
|
|
1321
|
+
target: 0.99 # 99% (internal users tolerate downtime)
|
|
1322
|
+
latency:
|
|
1323
|
+
enabled: false # No latency SLO by default
|
|
1324
|
+
|
|
1325
|
+
endpoints:
|
|
1326
|
+
- name: "Admin Dashboard"
|
|
1327
|
+
pattern: "GET /admin"
|
|
1328
|
+
controller: "AdminController"
|
|
1329
|
+
action: "index"
|
|
1330
|
+
tags: [admin, internal]
|
|
1331
|
+
slo:
|
|
1332
|
+
availability:
|
|
1333
|
+
target: 0.99
|
|
1334
|
+
burn_rate_alerts:
|
|
1335
|
+
fast:
|
|
1336
|
+
enabled: false
|
|
1337
|
+
slow:
|
|
1338
|
+
enabled: true
|
|
1339
|
+
alert_after: 24h # Very slow
|
|
1340
|
+
|
|
1341
|
+
- name: "Data Export"
|
|
1342
|
+
pattern: "POST /admin/exports"
|
|
1343
|
+
controller: "Admin::ExportsController"
|
|
1344
|
+
action: "create"
|
|
1345
|
+
tags: [admin, slow_operation]
|
|
1346
|
+
slo:
|
|
1347
|
+
availability:
|
|
1348
|
+
target: 0.95 # 95% (can retry)
|
|
1349
|
+
latency:
|
|
1350
|
+
p99_target: 120000 # 2 minutes (large CSV)
|
|
1351
|
+
|
|
1352
|
+
advanced:
|
|
1353
|
+
deployment_gate:
|
|
1354
|
+
enabled: false # No deployment gate for admin tool
|
|
1355
|
+
|
|
1356
|
+
error_budget_alerts:
|
|
1357
|
+
enabled: false # No budget alerts
|
|
1358
|
+
```
|
|
1359
|
+
|
|
1360
|
+
---
|
|
1361
|
+
|
|
1362
|
+
## 11. Summary & Next Steps
|
|
1363
|
+
|
|
1364
|
+
### 11.1. What We Achieved
|
|
1365
|
+
|
|
1366
|
+
✅ **Multi-level SLO strategy**: App-wide, service-level, per-endpoint
|
|
1367
|
+
✅ **5-minute alert detection**: Multi-window burn rate (Google SRE 2026)
|
|
1368
|
+
✅ **YAML-based configuration**: Version-controlled `slo.yml`; partial validation via `ConfigValidator`
|
|
1369
|
+
✅ **Flexible latency SLO**: Optional per endpoint (reference schema in §4.1)
|
|
1370
|
+
✅ **Throughput SLO**: Reference schema for high-traffic endpoints
|
|
1371
|
+
✅ **Config validation & linting**: `e11y:lint`, `ConfigLoader` + `ConfigValidator` (subset)
|
|
1372
|
+
✅ **Shipped tooling**: `ConfigLoader`, `ConfigValidator`, `e11y:slo:dashboard`, `SelfMonitoringEmit` when enabled in YAML
|
|
1373
|
+
✅ **PromQL & alerts**: See [SLO-PROMQL-ALERTS.md](../SLO-PROMQL-ALERTS.md)
|
|
1374
|
+
✅ **RSpec testing**: Comprehensive test coverage
|
|
1375
|
+
✅ **Production best practices**: Rollout strategy, edge case handling, self-monitoring
|
|
1376
|
+
✅ **Real-world examples**: E-commerce, SaaS API, Admin tool configurations
|
|
1377
|
+
|
|
1378
|
+
### 11.2. Implementation Checklist
|
|
1379
|
+
|
|
1380
|
+
**Phase 1: Core (Week 1-2)**
|
|
1381
|
+
- [x] Implement `E11y::SLO::ConfigLoader` (YAML search + `safe_load`)
|
|
1382
|
+
- [ ] Optional: richer loader (ERB, strict mode) if product needs it
|
|
1383
|
+
- [x] Implement `E11y::SLO::ConfigValidator.validate` (subset of schema)
|
|
1384
|
+
- [x] `rake e11y:slo:validate` → alias for `e11y:lint`
|
|
1385
|
+
- [x] HTTP / job SLIs via `E11y::SLO::Tracker` + Rack / Sidekiq / ActiveJob hooks
|
|
1386
|
+
- [ ] Broader `slo.yml` consumption (beyond dashboard + linters), if desired
|
|
1387
|
+
|
|
1388
|
+
**Phase 2: Production Readiness (Week 3-4)**
|
|
1389
|
+
- [ ] Maintenance windows, deploy grace periods, deployment gates (product/platform)
|
|
1390
|
+
- [x] Basic self-monitoring hook (`e11y_self_monitoring` in YAML + `SelfMonitoringEmit`)
|
|
1391
|
+
- [ ] CI: run `e11y:lint` / optional `ConfigValidator.validate` on `slo.yml` changes
|
|
1392
|
+
- [ ] Operational playbooks (runbooks, Grafana) per team
|
|
1393
|
+
|
|
1394
|
+
**Phase 3: Tests**
|
|
1395
|
+
- [x] `spec/e11y/slo/config_loader_spec.rb`, `config_validator_spec.rb`, linters under `spec/e11y/linters/slo/`
|
|
1396
|
+
- [ ] Additional integration coverage as features grow
|
|
1397
|
+
|
|
1398
|
+
---
|
|
1399
|
+
|
|
1400
|
+
**Status:** Core SLO tracking and `slo.yml` tooling are in the gem; large parts of §4.1 YAML remain a **reference** for alerting/dashboards rather than a fully interpreted config file.
|
|
1401
|
+
**Next:** Expand or document which YAML keys each component reads (`DashboardGenerator`, linters, `ConfigLoader`).
|
|
1402
|
+
**Impact:** HTTP/job SLI metrics, optional `slo.yml` + dashboard export, PromQL examples in [SLO-PROMQL-ALERTS.md](../SLO-PROMQL-ALERTS.md).
|