e11y 0.2.0 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +130 -10
- data/CHANGELOG.md +56 -1
- data/CLAUDE.md +168 -0
- data/CONTRIBUTING.md +640 -0
- data/README.md +134 -702
- data/RELEASE.md +18 -3
- data/Rakefile +108 -29
- data/config/README.md +1 -1
- data/config/loki-local-config.yaml +12 -0
- data/config/otel-collector-config.yaml +44 -0
- data/cucumber.yml +1 -0
- data/docker-compose.yml +18 -2
- data/docs/ADAPTERS.md +76 -0
- data/docs/ADAPTIVE_SAMPLING.md +59 -0
- data/docs/COMPARISON.md +104 -0
- data/docs/CONFIGURATION.md +52 -0
- data/docs/DISTRIBUTED_TRACING.md +44 -0
- data/docs/LIMITATIONS.md +13 -0
- data/docs/METRICS_DSL.md +84 -0
- data/docs/PERFORMANCE.md +60 -0
- data/docs/PII_FILTERING.md +40 -0
- data/docs/PRESETS.md +65 -0
- data/docs/QUICK-START.md +546 -587
- data/docs/RAILS_INTEGRATION.md +29 -0
- data/docs/SCHEMA_VALIDATION.md +63 -0
- data/docs/SLO-PROMQL-ALERTS.md +161 -0
- data/docs/TESTING.md +69 -0
- data/docs/{ADR-001-architecture.md → architecture/ADR-001-architecture.md} +35 -64
- data/docs/{ADR-002-metrics-yabeda.md → architecture/ADR-002-metrics-yabeda.md} +62 -236
- data/docs/{ADR-003-slo-observability.md → architecture/ADR-003-slo-observability.md} +27 -466
- data/docs/{ADR-004-adapter-architecture.md → architecture/ADR-004-adapter-architecture.md} +163 -146
- data/docs/{ADR-005-tracing-context.md → architecture/ADR-005-tracing-context.md} +10 -9
- data/docs/{ADR-006-security-compliance.md → architecture/ADR-006-security-compliance.md} +184 -191
- data/docs/{ADR-007-opentelemetry-integration.md → architecture/ADR-007-opentelemetry-integration.md} +3 -21
- data/docs/{ADR-008-rails-integration.md → architecture/ADR-008-rails-integration.md} +209 -339
- data/docs/{ADR-009-cost-optimization.md → architecture/ADR-009-cost-optimization.md} +45 -54
- data/docs/architecture/ADR-010-developer-experience.md +522 -0
- data/docs/{ADR-011-testing-strategy.md → architecture/ADR-011-testing-strategy.md} +41 -83
- data/docs/{ADR-013-reliability-error-handling.md → architecture/ADR-013-reliability-error-handling.md} +37 -12
- data/docs/{ADR-014-event-driven-slo.md → architecture/ADR-014-event-driven-slo.md} +12 -24
- data/docs/{ADR-015-middleware-order.md → architecture/ADR-015-middleware-order.md} +23 -41
- data/docs/{ADR-016-self-monitoring-slo.md → architecture/ADR-016-self-monitoring-slo.md} +52 -349
- data/docs/{ADR-017-multi-rails-compatibility.md → architecture/ADR-017-multi-rails-compatibility.md} +4 -11
- data/docs/architecture/ADR-018-memory-optimization.md +366 -0
- data/docs/{ADR-INDEX.md → architecture/ADR-INDEX.md} +11 -6
- data/docs/{00-ICP-AND-TIMELINE.md → prd/00-ICP-AND-TIMELINE.md} +6 -6
- data/docs/{01-SCALE-REQUIREMENTS.md → prd/01-SCALE-REQUIREMENTS.md} +6 -6
- data/docs/prd/01-overview-vision.md +19 -14
- data/docs/use_cases/README.md +22 -23
- data/docs/use_cases/UC-001-request-scoped-debug-buffering.md +50 -44
- data/docs/use_cases/UC-002-business-event-tracking.md +26 -95
- data/docs/use_cases/UC-003-event-metrics.md +66 -0
- data/docs/use_cases/UC-004-zero-config-slo-tracking.md +42 -101
- data/docs/use_cases/UC-005-sentry-integration.md +13 -15
- data/docs/use_cases/UC-006-trace-context-management.md +30 -28
- data/docs/use_cases/UC-007-pii-filtering.md +35 -87
- data/docs/use_cases/UC-008-opentelemetry-integration.md +51 -89
- data/docs/use_cases/UC-009-multi-service-tracing.md +4 -4
- data/docs/use_cases/UC-010-background-job-tracking.md +5 -5
- data/docs/use_cases/UC-011-rate-limiting.md +95 -168
- data/docs/use_cases/UC-012-audit-trail.md +21 -46
- data/docs/use_cases/UC-013-high-cardinality-protection.md +29 -167
- data/docs/use_cases/UC-014-adaptive-sampling.md +2 -2
- data/docs/use_cases/UC-015-cost-optimization.md +46 -99
- data/docs/use_cases/UC-016-rails-logger-migration.md +39 -213
- data/docs/use_cases/UC-017-local-development.md +203 -777
- data/docs/use_cases/UC-018-testing-events.md +3 -3
- data/docs/use_cases/UC-019-retention-based-routing.md +53 -106
- data/docs/use_cases/UC-020-event-versioning.md +8 -9
- data/docs/use_cases/UC-021-error-handling-retry-dlq.md +18 -22
- data/docs/use_cases/UC-022-event-registry.md +15 -21
- data/docs/use_cases/backlog.md +119 -87
- data/e11y.gemspec +2 -2
- data/gems/e11y-devtools/README.md +136 -0
- data/gems/e11y-devtools/config/routes.rb +8 -0
- data/gems/e11y-devtools/e11y-devtools.gemspec +25 -0
- data/gems/e11y-devtools/exe/e11y +34 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/server.rb +96 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tool_base.rb +25 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/clear.rb +31 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/errors.rb +35 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/event_detail.rb +33 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/events_by_trace.rb +33 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/interactions.rb +40 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/recent_events.rb +34 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/search.rb +34 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/stats.rb +30 -0
- data/gems/e11y-devtools/lib/e11y/devtools/overlay/assets/overlay.js +115 -0
- data/gems/e11y-devtools/lib/e11y/devtools/overlay/controller.rb +54 -0
- data/gems/e11y-devtools/lib/e11y/devtools/overlay/engine.rb +26 -0
- data/gems/e11y-devtools/lib/e11y/devtools/overlay/middleware.rb +80 -0
- data/gems/e11y-devtools/lib/e11y/devtools/overlay/rails_controller.rb +42 -0
- data/gems/e11y-devtools/lib/e11y/devtools/tui/app.rb +262 -0
- data/gems/e11y-devtools/lib/e11y/devtools/tui/grouping.rb +66 -0
- data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/event_detail.rb +62 -0
- data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/event_list.rb +70 -0
- data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/interaction_list.rb +47 -0
- data/gems/e11y-devtools/lib/e11y/devtools/version.rb +8 -0
- data/gems/e11y-devtools/lib/e11y/devtools.rb +13 -0
- data/gems/e11y-devtools/spec/e11y/devtools/mcp/tools_spec.rb +107 -0
- data/gems/e11y-devtools/spec/e11y/devtools/overlay/controller_spec.rb +58 -0
- data/gems/e11y-devtools/spec/e11y/devtools/overlay/middleware_spec.rb +46 -0
- data/gems/e11y-devtools/spec/e11y/devtools/tui/app_spec.rb +85 -0
- data/gems/e11y-devtools/spec/e11y/devtools/tui/grouping_spec.rb +64 -0
- data/gems/e11y-devtools/spec/spec_helper.rb +5 -0
- data/gems/e11y-devtools/spec/tui/widgets/event_list_spec.rb +44 -0
- data/gems/e11y-devtools/spec/tui/widgets/interaction_list_spec.rb +62 -0
- data/lib/e11y/adapters/audit_encrypted.rb +53 -11
- data/lib/e11y/adapters/base.rb +33 -34
- data/lib/e11y/adapters/dev_log/file_store.rb +143 -0
- data/lib/e11y/adapters/dev_log/query.rb +219 -0
- data/lib/e11y/adapters/dev_log.rb +118 -0
- data/lib/e11y/adapters/file.rb +3 -6
- data/lib/e11y/adapters/in_memory.rb +52 -5
- data/lib/e11y/adapters/in_memory_test.rb +29 -0
- data/lib/e11y/adapters/loki.rb +58 -23
- data/lib/e11y/adapters/null.rb +82 -0
- data/lib/e11y/adapters/opentelemetry_collector.rb +183 -0
- data/lib/e11y/adapters/otel_logs.rb +136 -23
- data/lib/e11y/adapters/sentry.rb +4 -7
- data/lib/e11y/adapters/stdout.rb +73 -7
- data/lib/e11y/adapters/yabeda.rb +153 -29
- data/lib/e11y/buffers/adaptive_buffer.rb +3 -17
- data/lib/e11y/buffers/{request_scoped_buffer.rb → ephemeral_buffer.rb} +72 -58
- data/lib/e11y/buffers/ring_buffer.rb +3 -16
- data/lib/e11y/configuration.rb +272 -0
- data/lib/e11y/console.rb +10 -17
- data/lib/e11y/current.rb +53 -1
- data/lib/e11y/debug/pipeline_inspector.rb +96 -0
- data/lib/e11y/documentation/generator.rb +48 -0
- data/lib/e11y/event/base.rb +176 -82
- data/lib/e11y/event/value_sampling_config.rb +1 -5
- data/lib/e11y/events/rails/database/query.rb +1 -4
- data/lib/e11y/events/rails/job/failed.rb +2 -0
- data/lib/e11y/instruments/active_job.rb +46 -12
- data/lib/e11y/instruments/rails_instrumentation.rb +49 -24
- data/lib/e11y/instruments/sidekiq.rb +137 -31
- data/lib/e11y/linters/base.rb +11 -0
- data/lib/e11y/linters/pii/pii_declaration_linter.rb +120 -0
- data/lib/e11y/linters/slo/config_consistency_linter.rb +76 -0
- data/lib/e11y/linters/slo/explicit_declaration_linter.rb +36 -0
- data/lib/e11y/linters/slo/slo_status_from_linter.rb +41 -0
- data/lib/e11y/logger/bridge.rb +26 -7
- data/lib/e11y/metrics/cardinality_protection.rb +10 -15
- data/lib/e11y/metrics/cardinality_tracker.rb +16 -6
- data/lib/e11y/metrics/registry.rb +3 -5
- data/lib/e11y/metrics/test_backend.rb +62 -0
- data/lib/e11y/metrics.rb +56 -10
- data/lib/e11y/middleware/adapter_resolver.rb +40 -0
- data/lib/e11y/middleware/audit_signing.rb +43 -6
- data/lib/e11y/middleware/baggage_protection.rb +75 -0
- data/lib/e11y/middleware/dev_log_source.rb +24 -0
- data/lib/e11y/middleware/event_slo.rb +23 -9
- data/lib/e11y/middleware/otel_span.rb +23 -0
- data/lib/e11y/middleware/pii_filter.rb +104 -75
- data/lib/e11y/middleware/rate_limiting.rb +54 -27
- data/lib/e11y/middleware/request.rb +70 -23
- data/lib/e11y/middleware/routing.rb +78 -21
- data/lib/e11y/middleware/sampling.rb +66 -17
- data/lib/e11y/middleware/self_monitoring_emit.rb +39 -0
- data/lib/e11y/middleware/trace_context.rb +45 -10
- data/lib/e11y/middleware/track_latency.rb +34 -0
- data/lib/e11y/middleware/validation.rb +7 -16
- data/lib/e11y/middleware/versioning.rb +26 -22
- data/lib/e11y/opentelemetry/semantic_conventions.rb +109 -0
- data/lib/e11y/opentelemetry/span_creator.rb +142 -0
- data/lib/e11y/pii/patterns.rb +12 -1
- data/lib/e11y/pipeline/builder.rb +1 -1
- data/lib/e11y/presets/audit_event.rb +13 -2
- data/lib/e11y/railtie.rb +52 -15
- data/lib/e11y/registry.rb +306 -0
- data/lib/e11y/reliability/circuit_breaker.rb +19 -21
- data/lib/e11y/reliability/dlq/base.rb +71 -0
- data/lib/e11y/reliability/dlq/file_adapter.rb +301 -0
- data/lib/e11y/reliability/dlq/file_storage.rb +63 -34
- data/lib/e11y/reliability/dlq/filter.rb +37 -54
- data/lib/e11y/reliability/retry_handler.rb +26 -29
- data/lib/e11y/reliability/retry_rate_limiter.rb +3 -11
- data/lib/e11y/sampling/error_spike_detector.rb +0 -2
- data/lib/e11y/sampling/load_monitor.rb +5 -9
- data/lib/e11y/sampling/stratified_tracker.rb +18 -0
- data/lib/e11y/self_monitoring/buffer_monitor.rb +2 -0
- data/lib/e11y/self_monitoring/performance_monitor.rb +19 -61
- data/lib/e11y/self_monitoring/reliability_monitor.rb +4 -74
- data/lib/e11y/slo/config_loader.rb +40 -0
- data/lib/e11y/slo/config_validator.rb +58 -0
- data/lib/e11y/slo/dashboard_generator.rb +122 -0
- data/lib/e11y/slo/event_driven.rb +8 -0
- data/lib/e11y/slo/tracker.rb +31 -4
- data/lib/e11y/testing/have_tracked_event_matcher.rb +190 -0
- data/lib/e11y/testing/rspec_matchers.rb +21 -0
- data/lib/e11y/testing/snapshot_matcher.rb +86 -0
- data/lib/e11y/trace_context/sampler.rb +35 -0
- data/lib/e11y/tracing/faraday_middleware.rb +31 -0
- data/lib/e11y/tracing/net_http_patch.rb +33 -0
- data/lib/e11y/tracing/propagator.rb +116 -0
- data/lib/e11y/tracing.rb +47 -0
- data/lib/e11y/version.rb +1 -1
- data/lib/e11y/versioning/version_extractor.rb +32 -0
- data/lib/e11y.rb +141 -265
- data/lib/generators/e11y/event/event_generator.rb +22 -0
- data/lib/generators/e11y/event/templates/event.rb.tt +16 -0
- data/lib/generators/e11y/grafana_dashboard/grafana_dashboard_generator.rb +30 -0
- data/lib/generators/e11y/grafana_dashboard/templates/e11y_dashboard.json +81 -0
- data/lib/generators/e11y/install/install_generator.rb +34 -0
- data/lib/generators/e11y/install/templates/e11y.rb +239 -0
- data/lib/generators/e11y/prometheus_alerts/prometheus_alerts_generator.rb +29 -0
- data/lib/generators/e11y/prometheus_alerts/templates/e11y_alerts.yml +28 -0
- data/lib/tasks/e11y_docs.rake +30 -0
- data/lib/tasks/e11y_events.rake +71 -0
- data/lib/tasks/e11y_lint.rake +91 -0
- data/lib/tasks/e11y_slo.rake +29 -0
- metadata +129 -39
- data/docs/ADR-010-developer-experience.md +0 -2166
- data/docs/API-REFERENCE-L28.md +0 -914
- data/docs/COMPREHENSIVE-CONFIGURATION.md +0 -2366
- data/docs/CONTRIBUTING.md +0 -312
- data/docs/IMPLEMENTATION_NOTES.md +0 -2804
- data/docs/IMPLEMENTATION_PLAN.md +0 -1971
- data/docs/IMPLEMENTATION_PLAN_ARCHITECTURE.md +0 -586
- data/docs/PLAN.md +0 -148
- data/docs/README.md +0 -296
- data/docs/design/00-memory-optimization.md +0 -593
- data/docs/guides/MIGRATION-L27-L28.md +0 -692
- data/docs/guides/PERFORMANCE-BENCHMARKS.md +0 -434
- data/docs/guides/README.md +0 -44
- data/docs/use_cases/UC-003-pattern-based-metrics.md +0 -1627
- data/lib/e11y/adapters/registry.rb +0 -141
- /data/docs/{ADR-012-event-evolution.md → architecture/ADR-012-event-evolution.md} +0 -0
|
@@ -17,8 +17,7 @@ This ADR covers **HTTP/Job SLO** (infrastructure reliability):
|
|
|
17
17
|
- ✅ Zero-config SLO for HTTP requests (99.9% availability)
|
|
18
18
|
- ✅ Zero-config SLO for Sidekiq/ActiveJob (99.5% success rate)
|
|
19
19
|
- ✅ Per-endpoint SLO configuration in `slo.yml`
|
|
20
|
-
- ✅
|
|
21
|
-
- ✅ Error budget management & deployment gates
|
|
20
|
+
- ✅ PromQL queries and alert rules — see [SLO-PROMQL-ALERTS.md](../SLO-PROMQL-ALERTS.md)
|
|
22
21
|
|
|
23
22
|
**For Event-based SLO** (business logic reliability like "order creation success rate"), see **ADR-014**.
|
|
24
23
|
|
|
@@ -32,11 +31,13 @@ This ADR covers **HTTP/Job SLO** (infrastructure reliability):
|
|
|
32
31
|
2. [Architecture Overview](#2-architecture-overview)
|
|
33
32
|
3. [Multi-Level SLO Strategy](#3-multi-level-slo-strategy)
|
|
34
33
|
4. [Per-Endpoint SLO Configuration](#4-per-endpoint-slo-configuration)
|
|
35
|
-
5. [
|
|
34
|
+
5. [PromQL & Alerts](#5-promql--alerts)
|
|
36
35
|
6. [SLO Config Validation & Linting](#6-slo-config-validation--linting)
|
|
37
|
-
7. [
|
|
38
|
-
8. [
|
|
36
|
+
7. [Dashboard & Reporting](#7-dashboard--reporting)
|
|
37
|
+
8. [Production Best Practices & Edge Cases](#8-production-best-practices--edge-cases)
|
|
39
38
|
9. [Trade-offs](#9-trade-offs)
|
|
39
|
+
10. [Real-World Configuration Examples](#10-real-world-configuration-examples)
|
|
40
|
+
11. [Summary & Next Steps](#11-summary--next-steps)
|
|
40
41
|
|
|
41
42
|
---
|
|
42
43
|
|
|
@@ -1225,185 +1226,9 @@ end
|
|
|
1225
1226
|
|
|
1226
1227
|
---
|
|
1227
1228
|
|
|
1228
|
-
## 5.
|
|
1229
|
+
## 5. PromQL & Alerts
|
|
1229
1230
|
|
|
1230
|
-
|
|
1231
|
-
|
|
1232
|
-
**Problem with Single Window:**
|
|
1233
|
-
```
|
|
1234
|
-
Single 30-day window:
|
|
1235
|
-
- Slow reaction (hours to detect)
|
|
1236
|
-
- Hard to distinguish acute vs chronic issues
|
|
1237
|
-
|
|
1238
|
-
Single 5-minute window:
|
|
1239
|
-
- Fast reaction
|
|
1240
|
-
- High false positive rate (noise)
|
|
1241
|
-
```
|
|
1242
|
-
|
|
1243
|
-
**Solution: Multi-Window Multi-Burn Rate:**
|
|
1244
|
-
```
|
|
1245
|
-
3 windows simultaneously:
|
|
1246
|
-
- 1 hour: Fast burn (acute issue, page immediately)
|
|
1247
|
-
- 6 hours: Medium burn (developing issue, warn team)
|
|
1248
|
-
- 3 days: Slow burn (chronic issue, investigate)
|
|
1249
|
-
```
|
|
1250
|
-
|
|
1251
|
-
### 5.2. Burn Rate Calculation
|
|
1252
|
-
|
|
1253
|
-
**Formula:**
|
|
1254
|
-
```
|
|
1255
|
-
Burn Rate = (Actual Error Rate) / (Error Budget per Hour)
|
|
1256
|
-
|
|
1257
|
-
For 99.9% SLO (30-day window):
|
|
1258
|
-
- Error Budget = 0.1% = 0.001
|
|
1259
|
-
- Error Budget per Hour = 0.001 / (30 * 24) = 0.00000139
|
|
1260
|
-
|
|
1261
|
-
Fast Burn (1h window):
|
|
1262
|
-
- Threshold = 14.4x burn rate
|
|
1263
|
-
- Means: consuming 2% of 30-day budget in 1 hour
|
|
1264
|
-
- Alert fires in 5 minutes
|
|
1265
|
-
|
|
1266
|
-
Medium Burn (6h window):
|
|
1267
|
-
- Threshold = 6.0x burn rate
|
|
1268
|
-
- Means: consuming 5% of 30-day budget in 6 hours
|
|
1269
|
-
- Alert fires in 30 minutes
|
|
1270
|
-
|
|
1271
|
-
Slow Burn (3d window):
|
|
1272
|
-
- Threshold = 1.0x burn rate
|
|
1273
|
-
- Means: consuming 10% of 30-day budget in 3 days
|
|
1274
|
-
- Alert fires in 6 hours
|
|
1275
|
-
```
|
|
1276
|
-
|
|
1277
|
-
### 5.3. Prometheus Alert Rules (Per-Endpoint!)
|
|
1278
|
-
|
|
1279
|
-
```yaml
|
|
1280
|
-
# prometheus/alerts/e11y_slo_per_endpoint.yml
|
|
1281
|
-
groups:
|
|
1282
|
-
- name: e11y_slo_per_endpoint
|
|
1283
|
-
interval: 30s # Check every 30 seconds
|
|
1284
|
-
rules:
|
|
1285
|
-
# ===== FAST BURN (1h window, 5 min alert) =====
|
|
1286
|
-
- alert: E11ySLOFastBurn_CreateOrder
|
|
1287
|
-
expr: |
|
|
1288
|
-
(
|
|
1289
|
-
# Error rate in last 1 hour
|
|
1290
|
-
sum(rate(http_requests_total{
|
|
1291
|
-
controller="Api::OrdersController",
|
|
1292
|
-
action="create",
|
|
1293
|
-
status=~"5.."
|
|
1294
|
-
}[1h]))
|
|
1295
|
-
/
|
|
1296
|
-
sum(rate(http_requests_total{
|
|
1297
|
-
controller="Api::OrdersController",
|
|
1298
|
-
action="create"
|
|
1299
|
-
}[1h]))
|
|
1300
|
-
)
|
|
1301
|
-
/
|
|
1302
|
-
# Error budget per hour (0.001 / 720 hours)
|
|
1303
|
-
0.00000139
|
|
1304
|
-
> 14.4 # 14.4x burn rate = 2% of 30-day budget in 1h
|
|
1305
|
-
for: 5m # Alert after 5 minutes
|
|
1306
|
-
labels:
|
|
1307
|
-
severity: critical
|
|
1308
|
-
endpoint: "POST /api/orders"
|
|
1309
|
-
controller: "Api::OrdersController"
|
|
1310
|
-
action: "create"
|
|
1311
|
-
burn_window: "1h"
|
|
1312
|
-
annotations:
|
|
1313
|
-
summary: "CRITICAL: Fast burn on {{ $labels.endpoint }}"
|
|
1314
|
-
description: |
|
|
1315
|
-
Error rate is 14.4x higher than sustainable rate.
|
|
1316
|
-
Burning 2% of 30-day error budget in 1 hour.
|
|
1317
|
-
Current burn rate: {{ $value | humanize }}x
|
|
1318
|
-
|
|
1319
|
-
Impact: Will exhaust error budget in {{ div 720 $value | humanize }} hours
|
|
1320
|
-
|
|
1321
|
-
Dashboard: https://grafana/d/e11y-slo?var-endpoint=orders_create
|
|
1322
|
-
Runbook: https://wiki/runbooks/fast-burn-orders
|
|
1323
|
-
|
|
1324
|
-
# ===== MEDIUM BURN (6h window, 30 min alert) =====
|
|
1325
|
-
- alert: E11ySLOMediumBurn_CreateOrder
|
|
1326
|
-
expr: |
|
|
1327
|
-
(
|
|
1328
|
-
sum(rate(http_requests_total{
|
|
1329
|
-
controller="Api::OrdersController",
|
|
1330
|
-
action="create",
|
|
1331
|
-
status=~"5.."
|
|
1332
|
-
}[6h]))
|
|
1333
|
-
/
|
|
1334
|
-
sum(rate(http_requests_total{
|
|
1335
|
-
controller="Api::OrdersController",
|
|
1336
|
-
action="create"
|
|
1337
|
-
}[6h]))
|
|
1338
|
-
)
|
|
1339
|
-
/
|
|
1340
|
-
0.00000139
|
|
1341
|
-
> 6.0 # 6x burn rate = 5% of 30-day budget in 6h
|
|
1342
|
-
for: 30m # Alert after 30 minutes
|
|
1343
|
-
labels:
|
|
1344
|
-
severity: warning
|
|
1345
|
-
endpoint: "POST /api/orders"
|
|
1346
|
-
controller: "Api::OrdersController"
|
|
1347
|
-
action: "create"
|
|
1348
|
-
burn_window: "6h"
|
|
1349
|
-
annotations:
|
|
1350
|
-
summary: "WARNING: Medium burn on {{ $labels.endpoint }}"
|
|
1351
|
-
description: |
|
|
1352
|
-
Error rate is 6x higher than sustainable rate.
|
|
1353
|
-
Burning 5% of 30-day error budget in 6 hours.
|
|
1354
|
-
Current burn rate: {{ $value | humanize }}x
|
|
1355
|
-
|
|
1356
|
-
# ===== SLOW BURN (3d window, 6h alert) =====
|
|
1357
|
-
- alert: E11ySLOSlowBurn_CreateOrder
|
|
1358
|
-
expr: |
|
|
1359
|
-
(
|
|
1360
|
-
sum(rate(http_requests_total{
|
|
1361
|
-
controller="Api::OrdersController",
|
|
1362
|
-
action="create",
|
|
1363
|
-
status=~"5.."
|
|
1364
|
-
}[3d]))
|
|
1365
|
-
/
|
|
1366
|
-
sum(rate(http_requests_total{
|
|
1367
|
-
controller="Api::OrdersController",
|
|
1368
|
-
action="create"
|
|
1369
|
-
}[3d]))
|
|
1370
|
-
)
|
|
1371
|
-
/
|
|
1372
|
-
0.00000139
|
|
1373
|
-
> 1.0 # 1x burn rate = 10% of 30-day budget in 3 days
|
|
1374
|
-
for: 6h # Alert after 6 hours
|
|
1375
|
-
labels:
|
|
1376
|
-
severity: info
|
|
1377
|
-
endpoint: "POST /api/orders"
|
|
1378
|
-
controller: "Api::OrdersController"
|
|
1379
|
-
action: "create"
|
|
1380
|
-
burn_window: "3d"
|
|
1381
|
-
annotations:
|
|
1382
|
-
summary: "INFO: Slow burn on {{ $labels.endpoint }}"
|
|
1383
|
-
description: |
|
|
1384
|
-
Chronic issue: consuming error budget at steady rate.
|
|
1385
|
-
Burning 10% of 30-day error budget in 3 days.
|
|
1386
|
-
|
|
1387
|
-
This is a trend, not an emergency. Investigate root cause.
|
|
1388
|
-
|
|
1389
|
-
# ===== LATENCY SLO (optional per endpoint) =====
|
|
1390
|
-
- alert: E11ySLOLatency_CreateOrder
|
|
1391
|
-
expr: |
|
|
1392
|
-
histogram_quantile(0.99,
|
|
1393
|
-
sum(rate(http_request_duration_seconds_bucket{
|
|
1394
|
-
controller="Api::OrdersController",
|
|
1395
|
-
action="create"
|
|
1396
|
-
}[5m])) by (le)
|
|
1397
|
-
) > 0.5 # 500ms p99 threshold
|
|
1398
|
-
for: 5m
|
|
1399
|
-
labels:
|
|
1400
|
-
severity: warning
|
|
1401
|
-
endpoint: "POST /api/orders"
|
|
1402
|
-
slo_type: "latency_p99"
|
|
1403
|
-
annotations:
|
|
1404
|
-
summary: "Latency SLO violation: {{ $labels.endpoint }}"
|
|
1405
|
-
description: "P99 latency is {{ $value | humanize }}s (threshold: 500ms)"
|
|
1406
|
-
```
|
|
1231
|
+
PromQL queries and Prometheus alert rules: see [SLO-PROMQL-ALERTS.md](../SLO-PROMQL-ALERTS.md).
|
|
1407
1232
|
|
|
1408
1233
|
---
|
|
1409
1234
|
|
|
@@ -2389,264 +2214,13 @@ RSpec.describe E11y::SLO::ConfigValidator do
|
|
|
2389
2214
|
end
|
|
2390
2215
|
end
|
|
2391
2216
|
|
|
2392
|
-
# spec/lib/e11y/slo/error_budget_spec.rb
|
|
2393
|
-
RSpec.describe E11y::SLO::ErrorBudget do
|
|
2394
|
-
let(:slo_config) do
|
|
2395
|
-
{
|
|
2396
|
-
'availability' => { 'target' => 0.999 },
|
|
2397
|
-
'window' => '30d'
|
|
2398
|
-
}
|
|
2399
|
-
end
|
|
2400
|
-
|
|
2401
|
-
let(:budget) do
|
|
2402
|
-
described_class.new('OrdersController', 'create', slo_config)
|
|
2403
|
-
end
|
|
2404
|
-
|
|
2405
|
-
before do
|
|
2406
|
-
# Mock Prometheus query
|
|
2407
|
-
allow(E11y::Metrics).to receive(:query_prometheus).and_return(
|
|
2408
|
-
{ 'data' => { 'result' => [{ 'value' => [Time.now.to_i, error_rate.to_s] }] } }
|
|
2409
|
-
)
|
|
2410
|
-
end
|
|
2411
|
-
|
|
2412
|
-
describe '#total' do
|
|
2413
|
-
it 'calculates total error budget' do
|
|
2414
|
-
expect(budget.total).to eq(0.001) # 1 - 0.999
|
|
2415
|
-
end
|
|
2416
|
-
end
|
|
2417
|
-
|
|
2418
|
-
describe '#consumed' do
|
|
2419
|
-
let(:error_rate) { 0.0005 } # 0.05% error rate
|
|
2420
|
-
|
|
2421
|
-
it 'calculates consumed error budget' do
|
|
2422
|
-
expect(budget.consumed).to eq(0.0005)
|
|
2423
|
-
end
|
|
2424
|
-
end
|
|
2425
|
-
|
|
2426
|
-
describe '#remaining' do
|
|
2427
|
-
let(:error_rate) { 0.0005 }
|
|
2428
|
-
|
|
2429
|
-
it 'calculates remaining error budget' do
|
|
2430
|
-
expect(budget.remaining).to eq(0.0005) # 0.001 - 0.0005
|
|
2431
|
-
end
|
|
2432
|
-
|
|
2433
|
-
context 'when consumed exceeds total' do
|
|
2434
|
-
let(:error_rate) { 0.002 } # 0.2% > 0.1%
|
|
2435
|
-
|
|
2436
|
-
it 'never goes negative' do
|
|
2437
|
-
expect(budget.remaining).to eq(0.0)
|
|
2438
|
-
end
|
|
2439
|
-
end
|
|
2440
|
-
end
|
|
2441
|
-
|
|
2442
|
-
describe '#exhausted?' do
|
|
2443
|
-
context 'when budget remaining' do
|
|
2444
|
-
let(:error_rate) { 0.0005 }
|
|
2445
|
-
|
|
2446
|
-
it 'returns false' do
|
|
2447
|
-
expect(budget).not_to be_exhausted
|
|
2448
|
-
end
|
|
2449
|
-
end
|
|
2450
|
-
|
|
2451
|
-
context 'when budget exhausted' do
|
|
2452
|
-
let(:error_rate) { 0.002 } # Exceeds 0.001
|
|
2453
|
-
|
|
2454
|
-
it 'returns true' do
|
|
2455
|
-
expect(budget).to be_exhausted
|
|
2456
|
-
end
|
|
2457
|
-
end
|
|
2458
|
-
end
|
|
2459
|
-
|
|
2460
|
-
describe '#can_deploy?' do
|
|
2461
|
-
context 'with sufficient budget' do
|
|
2462
|
-
let(:error_rate) { 0.0002 } # 20% consumed, 80% remaining
|
|
2463
|
-
|
|
2464
|
-
it 'allows deployment' do
|
|
2465
|
-
expect(budget.can_deploy?(20)).to be true
|
|
2466
|
-
end
|
|
2467
|
-
end
|
|
2468
|
-
|
|
2469
|
-
context 'with insufficient budget' do
|
|
2470
|
-
let(:error_rate) { 0.0009 } # 90% consumed, 10% remaining
|
|
2471
|
-
|
|
2472
|
-
it 'blocks deployment' do
|
|
2473
|
-
expect(budget.can_deploy?(20)).to be false
|
|
2474
|
-
end
|
|
2475
|
-
end
|
|
2476
|
-
end
|
|
2477
|
-
end
|
|
2478
|
-
```
|
|
2479
|
-
|
|
2480
|
-
---
|
|
2481
|
-
|
|
2482
|
-
## 7. Error Budget Management
|
|
2483
|
-
|
|
2484
|
-
### 7.1. Error Budget Calculation (Per-Endpoint)
|
|
2485
|
-
|
|
2486
|
-
```ruby
|
|
2487
|
-
# lib/e11y/slo/error_budget.rb
|
|
2488
|
-
module E11y
|
|
2489
|
-
module SLO
|
|
2490
|
-
class ErrorBudget
|
|
2491
|
-
def initialize(controller, action, slo_config)
|
|
2492
|
-
@controller = controller
|
|
2493
|
-
@action = action
|
|
2494
|
-
@slo_config = slo_config
|
|
2495
|
-
@target = slo_config['availability_target'] || 0.999
|
|
2496
|
-
@window = parse_window(slo_config['window'] || '30d')
|
|
2497
|
-
end
|
|
2498
|
-
|
|
2499
|
-
# Total error budget (e.g., 0.001 for 99.9%)
|
|
2500
|
-
def total
|
|
2501
|
-
1.0 - @target
|
|
2502
|
-
end
|
|
2503
|
-
|
|
2504
|
-
# Consumed error budget in current window
|
|
2505
|
-
def consumed
|
|
2506
|
-
error_rate = calculate_error_rate(@window)
|
|
2507
|
-
[error_rate, total].min # Cap at total budget
|
|
2508
|
-
end
|
|
2509
|
-
|
|
2510
|
-
# Remaining error budget
|
|
2511
|
-
def remaining
|
|
2512
|
-
[total - consumed, 0.0].max # Never negative
|
|
2513
|
-
end
|
|
2514
|
-
|
|
2515
|
-
# Percentage of error budget consumed
|
|
2516
|
-
def percent_consumed
|
|
2517
|
-
return 0.0 if total.zero?
|
|
2518
|
-
(consumed / total) * 100
|
|
2519
|
-
end
|
|
2520
|
-
|
|
2521
|
-
# Is error budget exhausted?
|
|
2522
|
-
def exhausted?
|
|
2523
|
-
remaining <= 0
|
|
2524
|
-
end
|
|
2525
|
-
|
|
2526
|
-
# Time until error budget exhaustion (at current burn rate)
|
|
2527
|
-
def time_until_exhaustion
|
|
2528
|
-
burn_rate_per_hour = calculate_burn_rate(1.hour)
|
|
2529
|
-
return Float::INFINITY if burn_rate_per_hour <= 0
|
|
2530
|
-
|
|
2531
|
-
hours_remaining = remaining / burn_rate_per_hour
|
|
2532
|
-
hours_remaining.hours
|
|
2533
|
-
end
|
|
2534
|
-
|
|
2535
|
-
# Can we deploy? (have enough error budget?)
|
|
2536
|
-
def can_deploy?(minimum_budget_percent = 20)
|
|
2537
|
-
percent_remaining = (remaining / total) * 100
|
|
2538
|
-
percent_remaining >= minimum_budget_percent
|
|
2539
|
-
end
|
|
2540
|
-
|
|
2541
|
-
private
|
|
2542
|
-
|
|
2543
|
-
def calculate_error_rate(window)
|
|
2544
|
-
# Query Prometheus for actual error rate
|
|
2545
|
-
query = <<~PROMQL
|
|
2546
|
-
sum(rate(http_requests_total{
|
|
2547
|
-
controller="#{@controller}",
|
|
2548
|
-
action="#{@action}",
|
|
2549
|
-
status=~"5.."
|
|
2550
|
-
}[#{window}]))
|
|
2551
|
-
/
|
|
2552
|
-
sum(rate(http_requests_total{
|
|
2553
|
-
controller="#{@controller}",
|
|
2554
|
-
action="#{@action}"
|
|
2555
|
-
}[#{window}]))
|
|
2556
|
-
PROMQL
|
|
2557
|
-
|
|
2558
|
-
result = E11y::Metrics.query_prometheus(query)
|
|
2559
|
-
result.dig('data', 'result', 0, 'value', 1).to_f
|
|
2560
|
-
end
|
|
2561
|
-
|
|
2562
|
-
def calculate_burn_rate(window)
|
|
2563
|
-
error_rate = calculate_error_rate(window)
|
|
2564
|
-
error_budget_per_hour = total / (@window.to_f / 1.hour)
|
|
2565
|
-
|
|
2566
|
-
error_rate / error_budget_per_hour
|
|
2567
|
-
end
|
|
2568
|
-
|
|
2569
|
-
def parse_window(window)
|
|
2570
|
-
case window
|
|
2571
|
-
when /(\d+)d/
|
|
2572
|
-
$1.to_i.days
|
|
2573
|
-
when /(\d+)h/
|
|
2574
|
-
$1.to_i.hours
|
|
2575
|
-
when /(\d+)m/
|
|
2576
|
-
$1.to_i.minutes
|
|
2577
|
-
else
|
|
2578
|
-
30.days # Default
|
|
2579
|
-
end
|
|
2580
|
-
end
|
|
2581
|
-
end
|
|
2582
|
-
end
|
|
2583
|
-
end
|
|
2584
|
-
```
|
|
2585
|
-
|
|
2586
|
-
### 7.2. Deployment Gate (Optional)
|
|
2587
|
-
|
|
2588
|
-
```ruby
|
|
2589
|
-
# lib/e11y/slo/deployment_gate.rb
|
|
2590
|
-
module E11y
|
|
2591
|
-
module SLO
|
|
2592
|
-
class DeploymentGate
|
|
2593
|
-
def self.check!(minimum_budget_percent: 20)
|
|
2594
|
-
config = E11y::SLO::ConfigLoader.load!
|
|
2595
|
-
|
|
2596
|
-
critical_endpoints = config.endpoints.select do |ep|
|
|
2597
|
-
ep.dig('slo', 'availability_target').to_f >= 0.999
|
|
2598
|
-
end
|
|
2599
|
-
|
|
2600
|
-
violations = []
|
|
2601
|
-
|
|
2602
|
-
critical_endpoints.each do |endpoint|
|
|
2603
|
-
controller = endpoint['controller']
|
|
2604
|
-
action = endpoint['action']
|
|
2605
|
-
slo_config = endpoint['slo']
|
|
2606
|
-
|
|
2607
|
-
budget = ErrorBudget.new(controller, action, slo_config)
|
|
2608
|
-
|
|
2609
|
-
unless budget.can_deploy?(minimum_budget_percent)
|
|
2610
|
-
violations << {
|
|
2611
|
-
endpoint: "#{controller}##{action}",
|
|
2612
|
-
budget_remaining: budget.percent_remaining,
|
|
2613
|
-
budget_consumed: budget.percent_consumed
|
|
2614
|
-
}
|
|
2615
|
-
end
|
|
2616
|
-
end
|
|
2617
|
-
|
|
2618
|
-
if violations.any?
|
|
2619
|
-
raise DeploymentBlockedError.new(violations)
|
|
2620
|
-
end
|
|
2621
|
-
|
|
2622
|
-
true
|
|
2623
|
-
end
|
|
2624
|
-
end
|
|
2625
|
-
|
|
2626
|
-
class DeploymentBlockedError < StandardError
|
|
2627
|
-
attr_reader :violations
|
|
2628
|
-
|
|
2629
|
-
def initialize(violations)
|
|
2630
|
-
@violations = violations
|
|
2631
|
-
|
|
2632
|
-
message = "❌ Deployment blocked: Insufficient error budget\n\n"
|
|
2633
|
-
violations.each do |v|
|
|
2634
|
-
message << " - #{v[:endpoint]}: #{v[:budget_remaining].round(1)}% remaining (need 20%+)\n"
|
|
2635
|
-
end
|
|
2636
|
-
message << "\nWait for error budget to recover before deploying."
|
|
2637
|
-
|
|
2638
|
-
super(message)
|
|
2639
|
-
end
|
|
2640
|
-
end
|
|
2641
|
-
end
|
|
2642
|
-
end
|
|
2643
2217
|
```
|
|
2644
2218
|
|
|
2645
2219
|
---
|
|
2646
2220
|
|
|
2647
|
-
##
|
|
2221
|
+
## 7. Dashboard & Reporting
|
|
2648
2222
|
|
|
2649
|
-
###
|
|
2223
|
+
### 7.1. Per-Endpoint Grafana Dashboard
|
|
2650
2224
|
|
|
2651
2225
|
```json
|
|
2652
2226
|
{
|
|
@@ -2747,9 +2321,9 @@ end
|
|
|
2747
2321
|
|
|
2748
2322
|
---
|
|
2749
2323
|
|
|
2750
|
-
##
|
|
2324
|
+
## 8. Production Best Practices & Edge Cases
|
|
2751
2325
|
|
|
2752
|
-
###
|
|
2326
|
+
### 8.1. Rollout Strategy
|
|
2753
2327
|
|
|
2754
2328
|
**Phase 1: Observability Only (1-2 weeks)**
|
|
2755
2329
|
```yaml
|
|
@@ -2832,7 +2406,7 @@ advanced:
|
|
|
2832
2406
|
override_label: "deploy:emergency"
|
|
2833
2407
|
```
|
|
2834
2408
|
|
|
2835
|
-
###
|
|
2409
|
+
### 8.2. Edge Cases & Solutions
|
|
2836
2410
|
|
|
2837
2411
|
**Edge Case 1: Routes Not Loaded During Validation**
|
|
2838
2412
|
```ruby
|
|
@@ -2993,7 +2567,7 @@ def calculate_error_rate(window)
|
|
|
2993
2567
|
end
|
|
2994
2568
|
```
|
|
2995
2569
|
|
|
2996
|
-
###
|
|
2570
|
+
### 8.3. Monitoring the SLO System Itself
|
|
2997
2571
|
|
|
2998
2572
|
**Self-Monitoring Metrics:**
|
|
2999
2573
|
```ruby
|
|
@@ -3032,7 +2606,7 @@ rate(e11y_slo_prometheus_query_errors_total[5m]) > 0.01
|
|
|
3032
2606
|
|
|
3033
2607
|
---
|
|
3034
2608
|
|
|
3035
|
-
##
|
|
2609
|
+
## 9. Trade-offs
|
|
3036
2610
|
|
|
3037
2611
|
### 9.1. Key Decisions
|
|
3038
2612
|
|
|
@@ -3064,9 +2638,9 @@ rate(e11y_slo_prometheus_query_errors_total[5m]) > 0.01
|
|
|
3064
2638
|
|
|
3065
2639
|
---
|
|
3066
2640
|
|
|
3067
|
-
##
|
|
2641
|
+
## 10. Real-World Configuration Examples
|
|
3068
2642
|
|
|
3069
|
-
###
|
|
2643
|
+
### 10.1. E-Commerce Platform
|
|
3070
2644
|
|
|
3071
2645
|
```yaml
|
|
3072
2646
|
# config/slo.yml - E-commerce example
|
|
@@ -3158,7 +2732,7 @@ services:
|
|
|
3158
2732
|
p99_target: 60000 # 60s
|
|
3159
2733
|
```
|
|
3160
2734
|
|
|
3161
|
-
###
|
|
2735
|
+
### 10.2. SaaS API Platform
|
|
3162
2736
|
|
|
3163
2737
|
```yaml
|
|
3164
2738
|
# config/slo.yml - API platform example
|
|
@@ -3222,7 +2796,7 @@ services:
|
|
|
3222
2796
|
p99_target: 10000 # 10s (external API)
|
|
3223
2797
|
```
|
|
3224
2798
|
|
|
3225
|
-
###
|
|
2799
|
+
### 10.3. Internal Admin Tool
|
|
3226
2800
|
|
|
3227
2801
|
```yaml
|
|
3228
2802
|
# config/slo.yml - Admin tool example
|
|
@@ -3273,9 +2847,9 @@ advanced:
|
|
|
3273
2847
|
|
|
3274
2848
|
---
|
|
3275
2849
|
|
|
3276
|
-
##
|
|
2850
|
+
## 11. Summary & Next Steps
|
|
3277
2851
|
|
|
3278
|
-
###
|
|
2852
|
+
### 11.1. What We Achieved
|
|
3279
2853
|
|
|
3280
2854
|
✅ **Multi-level SLO strategy**: App-wide, service-level, per-endpoint
|
|
3281
2855
|
✅ **5-minute alert detection**: Multi-window burn rate (Google SRE 2026)
|
|
@@ -3283,12 +2857,13 @@ advanced:
|
|
|
3283
2857
|
✅ **Flexible latency SLO**: Optional per endpoint
|
|
3284
2858
|
✅ **Throughput SLO**: Min/max requests per second
|
|
3285
2859
|
✅ **Config validation & linting**: Prevents drift from reality
|
|
3286
|
-
✅ **Full implementation**: ConfigLoader, Validator
|
|
2860
|
+
✅ **Full implementation**: ConfigLoader, Validator with edge cases
|
|
2861
|
+
✅ **PromQL & alerts**: See [SLO-PROMQL-ALERTS.md](../SLO-PROMQL-ALERTS.md)
|
|
3287
2862
|
✅ **RSpec testing**: Comprehensive test coverage
|
|
3288
2863
|
✅ **Production best practices**: Rollout strategy, edge case handling, self-monitoring
|
|
3289
2864
|
✅ **Real-world examples**: E-commerce, SaaS API, Admin tool configurations
|
|
3290
2865
|
|
|
3291
|
-
###
|
|
2866
|
+
### 11.2. Implementation Checklist
|
|
3292
2867
|
|
|
3293
2868
|
**Phase 1: Core (Week 1-2)**
|
|
3294
2869
|
- [x] Implement `E11y::SLO::ConfigLoader` with ERB support
|
|
@@ -3298,19 +2873,7 @@ advanced:
|
|
|
3298
2873
|
- [ ] Add per-endpoint metrics to `E11y::Rack::Middleware`
|
|
3299
2874
|
- [ ] Implement `E11y::SLO::MetricsEmitter`
|
|
3300
2875
|
|
|
3301
|
-
**Phase 2:
|
|
3302
|
-
- [ ] Implement `E11y::SLO::BurnRateCalculator`
|
|
3303
|
-
- [ ] Generate Prometheus alert rules from `slo.yml`
|
|
3304
|
-
- [ ] Implement multi-window burn rate alerts
|
|
3305
|
-
- [ ] Add Prometheus query error handling
|
|
3306
|
-
|
|
3307
|
-
**Phase 3: Error Budget (Week 5-6)**
|
|
3308
|
-
- [x] Implement `E11y::SLO::ErrorBudget`
|
|
3309
|
-
- [ ] Implement `E11y::SLO::DeploymentGate`
|
|
3310
|
-
- [ ] Add error budget tracking middleware
|
|
3311
|
-
- [ ] Create Grafana dashboard templates
|
|
3312
|
-
|
|
3313
|
-
**Phase 4: Production Readiness (Week 7-8)**
|
|
2876
|
+
**Phase 2: Production Readiness (Week 3-4)**
|
|
3314
2877
|
- [ ] Add maintenance window support
|
|
3315
2878
|
- [ ] Implement grace period after deployment
|
|
3316
2879
|
- [ ] Add self-monitoring metrics
|
|
@@ -3318,11 +2881,9 @@ advanced:
|
|
|
3318
2881
|
- [ ] Document SLO config guide
|
|
3319
2882
|
- [ ] Add rollout playbook
|
|
3320
2883
|
|
|
3321
|
-
**Phase
|
|
2884
|
+
**Phase 3: RSpec Tests**
|
|
3322
2885
|
- [x] ConfigLoader specs (edge cases: missing file, invalid YAML, ERB)
|
|
3323
2886
|
- [x] ConfigValidator specs (invalid targets, missing routes, conflicts)
|
|
3324
|
-
- [x] ErrorBudget specs (calculations, exhaustion, deployment gate)
|
|
3325
|
-
- [ ] BurnRateCalculator specs (multi-window, new endpoints)
|
|
3326
2887
|
- [ ] Integration specs (end-to-end SLO tracking)
|
|
3327
2888
|
|
|
3328
2889
|
---
|
|
@@ -3333,5 +2894,5 @@ advanced:
|
|
|
3333
2894
|
**Impact:**
|
|
3334
2895
|
- Per-endpoint SLO visibility (100% coverage)
|
|
3335
2896
|
- 5-minute incident detection (vs. 30-minute baseline)
|
|
3336
|
-
-
|
|
2897
|
+
- PromQL-based alerts (see SLO-PROMQL-ALERTS.md)
|
|
3337
2898
|
- Zero-config for simple apps, full control for complex apps
|