e11y 0.2.0 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (230) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +130 -10
  3. data/CHANGELOG.md +56 -1
  4. data/CLAUDE.md +168 -0
  5. data/CONTRIBUTING.md +640 -0
  6. data/README.md +134 -702
  7. data/RELEASE.md +18 -3
  8. data/Rakefile +108 -29
  9. data/config/README.md +1 -1
  10. data/config/loki-local-config.yaml +12 -0
  11. data/config/otel-collector-config.yaml +44 -0
  12. data/cucumber.yml +1 -0
  13. data/docker-compose.yml +18 -2
  14. data/docs/ADAPTERS.md +76 -0
  15. data/docs/ADAPTIVE_SAMPLING.md +59 -0
  16. data/docs/COMPARISON.md +104 -0
  17. data/docs/CONFIGURATION.md +52 -0
  18. data/docs/DISTRIBUTED_TRACING.md +44 -0
  19. data/docs/LIMITATIONS.md +13 -0
  20. data/docs/METRICS_DSL.md +84 -0
  21. data/docs/PERFORMANCE.md +60 -0
  22. data/docs/PII_FILTERING.md +40 -0
  23. data/docs/PRESETS.md +65 -0
  24. data/docs/QUICK-START.md +546 -587
  25. data/docs/RAILS_INTEGRATION.md +29 -0
  26. data/docs/SCHEMA_VALIDATION.md +63 -0
  27. data/docs/SLO-PROMQL-ALERTS.md +161 -0
  28. data/docs/TESTING.md +69 -0
  29. data/docs/{ADR-001-architecture.md → architecture/ADR-001-architecture.md} +35 -64
  30. data/docs/{ADR-002-metrics-yabeda.md → architecture/ADR-002-metrics-yabeda.md} +62 -236
  31. data/docs/{ADR-003-slo-observability.md → architecture/ADR-003-slo-observability.md} +27 -466
  32. data/docs/{ADR-004-adapter-architecture.md → architecture/ADR-004-adapter-architecture.md} +163 -146
  33. data/docs/{ADR-005-tracing-context.md → architecture/ADR-005-tracing-context.md} +10 -9
  34. data/docs/{ADR-006-security-compliance.md → architecture/ADR-006-security-compliance.md} +184 -191
  35. data/docs/{ADR-007-opentelemetry-integration.md → architecture/ADR-007-opentelemetry-integration.md} +3 -21
  36. data/docs/{ADR-008-rails-integration.md → architecture/ADR-008-rails-integration.md} +209 -339
  37. data/docs/{ADR-009-cost-optimization.md → architecture/ADR-009-cost-optimization.md} +45 -54
  38. data/docs/architecture/ADR-010-developer-experience.md +522 -0
  39. data/docs/{ADR-011-testing-strategy.md → architecture/ADR-011-testing-strategy.md} +41 -83
  40. data/docs/{ADR-013-reliability-error-handling.md → architecture/ADR-013-reliability-error-handling.md} +37 -12
  41. data/docs/{ADR-014-event-driven-slo.md → architecture/ADR-014-event-driven-slo.md} +12 -24
  42. data/docs/{ADR-015-middleware-order.md → architecture/ADR-015-middleware-order.md} +23 -41
  43. data/docs/{ADR-016-self-monitoring-slo.md → architecture/ADR-016-self-monitoring-slo.md} +52 -349
  44. data/docs/{ADR-017-multi-rails-compatibility.md → architecture/ADR-017-multi-rails-compatibility.md} +4 -11
  45. data/docs/architecture/ADR-018-memory-optimization.md +366 -0
  46. data/docs/{ADR-INDEX.md → architecture/ADR-INDEX.md} +11 -6
  47. data/docs/{00-ICP-AND-TIMELINE.md → prd/00-ICP-AND-TIMELINE.md} +6 -6
  48. data/docs/{01-SCALE-REQUIREMENTS.md → prd/01-SCALE-REQUIREMENTS.md} +6 -6
  49. data/docs/prd/01-overview-vision.md +19 -14
  50. data/docs/use_cases/README.md +22 -23
  51. data/docs/use_cases/UC-001-request-scoped-debug-buffering.md +50 -44
  52. data/docs/use_cases/UC-002-business-event-tracking.md +26 -95
  53. data/docs/use_cases/UC-003-event-metrics.md +66 -0
  54. data/docs/use_cases/UC-004-zero-config-slo-tracking.md +42 -101
  55. data/docs/use_cases/UC-005-sentry-integration.md +13 -15
  56. data/docs/use_cases/UC-006-trace-context-management.md +30 -28
  57. data/docs/use_cases/UC-007-pii-filtering.md +35 -87
  58. data/docs/use_cases/UC-008-opentelemetry-integration.md +51 -89
  59. data/docs/use_cases/UC-009-multi-service-tracing.md +4 -4
  60. data/docs/use_cases/UC-010-background-job-tracking.md +5 -5
  61. data/docs/use_cases/UC-011-rate-limiting.md +95 -168
  62. data/docs/use_cases/UC-012-audit-trail.md +21 -46
  63. data/docs/use_cases/UC-013-high-cardinality-protection.md +29 -167
  64. data/docs/use_cases/UC-014-adaptive-sampling.md +2 -2
  65. data/docs/use_cases/UC-015-cost-optimization.md +46 -99
  66. data/docs/use_cases/UC-016-rails-logger-migration.md +39 -213
  67. data/docs/use_cases/UC-017-local-development.md +203 -777
  68. data/docs/use_cases/UC-018-testing-events.md +3 -3
  69. data/docs/use_cases/UC-019-retention-based-routing.md +53 -106
  70. data/docs/use_cases/UC-020-event-versioning.md +8 -9
  71. data/docs/use_cases/UC-021-error-handling-retry-dlq.md +18 -22
  72. data/docs/use_cases/UC-022-event-registry.md +15 -21
  73. data/docs/use_cases/backlog.md +119 -87
  74. data/e11y.gemspec +2 -2
  75. data/gems/e11y-devtools/README.md +136 -0
  76. data/gems/e11y-devtools/config/routes.rb +8 -0
  77. data/gems/e11y-devtools/e11y-devtools.gemspec +25 -0
  78. data/gems/e11y-devtools/exe/e11y +34 -0
  79. data/gems/e11y-devtools/lib/e11y/devtools/mcp/server.rb +96 -0
  80. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tool_base.rb +25 -0
  81. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/clear.rb +31 -0
  82. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/errors.rb +35 -0
  83. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/event_detail.rb +33 -0
  84. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/events_by_trace.rb +33 -0
  85. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/interactions.rb +40 -0
  86. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/recent_events.rb +34 -0
  87. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/search.rb +34 -0
  88. data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/stats.rb +30 -0
  89. data/gems/e11y-devtools/lib/e11y/devtools/overlay/assets/overlay.js +115 -0
  90. data/gems/e11y-devtools/lib/e11y/devtools/overlay/controller.rb +54 -0
  91. data/gems/e11y-devtools/lib/e11y/devtools/overlay/engine.rb +26 -0
  92. data/gems/e11y-devtools/lib/e11y/devtools/overlay/middleware.rb +80 -0
  93. data/gems/e11y-devtools/lib/e11y/devtools/overlay/rails_controller.rb +42 -0
  94. data/gems/e11y-devtools/lib/e11y/devtools/tui/app.rb +262 -0
  95. data/gems/e11y-devtools/lib/e11y/devtools/tui/grouping.rb +66 -0
  96. data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/event_detail.rb +62 -0
  97. data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/event_list.rb +70 -0
  98. data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/interaction_list.rb +47 -0
  99. data/gems/e11y-devtools/lib/e11y/devtools/version.rb +8 -0
  100. data/gems/e11y-devtools/lib/e11y/devtools.rb +13 -0
  101. data/gems/e11y-devtools/spec/e11y/devtools/mcp/tools_spec.rb +107 -0
  102. data/gems/e11y-devtools/spec/e11y/devtools/overlay/controller_spec.rb +58 -0
  103. data/gems/e11y-devtools/spec/e11y/devtools/overlay/middleware_spec.rb +46 -0
  104. data/gems/e11y-devtools/spec/e11y/devtools/tui/app_spec.rb +85 -0
  105. data/gems/e11y-devtools/spec/e11y/devtools/tui/grouping_spec.rb +64 -0
  106. data/gems/e11y-devtools/spec/spec_helper.rb +5 -0
  107. data/gems/e11y-devtools/spec/tui/widgets/event_list_spec.rb +44 -0
  108. data/gems/e11y-devtools/spec/tui/widgets/interaction_list_spec.rb +62 -0
  109. data/lib/e11y/adapters/audit_encrypted.rb +53 -11
  110. data/lib/e11y/adapters/base.rb +33 -34
  111. data/lib/e11y/adapters/dev_log/file_store.rb +143 -0
  112. data/lib/e11y/adapters/dev_log/query.rb +219 -0
  113. data/lib/e11y/adapters/dev_log.rb +118 -0
  114. data/lib/e11y/adapters/file.rb +3 -6
  115. data/lib/e11y/adapters/in_memory.rb +52 -5
  116. data/lib/e11y/adapters/in_memory_test.rb +29 -0
  117. data/lib/e11y/adapters/loki.rb +58 -23
  118. data/lib/e11y/adapters/null.rb +82 -0
  119. data/lib/e11y/adapters/opentelemetry_collector.rb +183 -0
  120. data/lib/e11y/adapters/otel_logs.rb +136 -23
  121. data/lib/e11y/adapters/sentry.rb +4 -7
  122. data/lib/e11y/adapters/stdout.rb +73 -7
  123. data/lib/e11y/adapters/yabeda.rb +153 -29
  124. data/lib/e11y/buffers/adaptive_buffer.rb +3 -17
  125. data/lib/e11y/buffers/{request_scoped_buffer.rb → ephemeral_buffer.rb} +72 -58
  126. data/lib/e11y/buffers/ring_buffer.rb +3 -16
  127. data/lib/e11y/configuration.rb +272 -0
  128. data/lib/e11y/console.rb +10 -17
  129. data/lib/e11y/current.rb +53 -1
  130. data/lib/e11y/debug/pipeline_inspector.rb +96 -0
  131. data/lib/e11y/documentation/generator.rb +48 -0
  132. data/lib/e11y/event/base.rb +176 -82
  133. data/lib/e11y/event/value_sampling_config.rb +1 -5
  134. data/lib/e11y/events/rails/database/query.rb +1 -4
  135. data/lib/e11y/events/rails/job/failed.rb +2 -0
  136. data/lib/e11y/instruments/active_job.rb +46 -12
  137. data/lib/e11y/instruments/rails_instrumentation.rb +49 -24
  138. data/lib/e11y/instruments/sidekiq.rb +137 -31
  139. data/lib/e11y/linters/base.rb +11 -0
  140. data/lib/e11y/linters/pii/pii_declaration_linter.rb +120 -0
  141. data/lib/e11y/linters/slo/config_consistency_linter.rb +76 -0
  142. data/lib/e11y/linters/slo/explicit_declaration_linter.rb +36 -0
  143. data/lib/e11y/linters/slo/slo_status_from_linter.rb +41 -0
  144. data/lib/e11y/logger/bridge.rb +26 -7
  145. data/lib/e11y/metrics/cardinality_protection.rb +10 -15
  146. data/lib/e11y/metrics/cardinality_tracker.rb +16 -6
  147. data/lib/e11y/metrics/registry.rb +3 -5
  148. data/lib/e11y/metrics/test_backend.rb +62 -0
  149. data/lib/e11y/metrics.rb +56 -10
  150. data/lib/e11y/middleware/adapter_resolver.rb +40 -0
  151. data/lib/e11y/middleware/audit_signing.rb +43 -6
  152. data/lib/e11y/middleware/baggage_protection.rb +75 -0
  153. data/lib/e11y/middleware/dev_log_source.rb +24 -0
  154. data/lib/e11y/middleware/event_slo.rb +23 -9
  155. data/lib/e11y/middleware/otel_span.rb +23 -0
  156. data/lib/e11y/middleware/pii_filter.rb +104 -75
  157. data/lib/e11y/middleware/rate_limiting.rb +54 -27
  158. data/lib/e11y/middleware/request.rb +70 -23
  159. data/lib/e11y/middleware/routing.rb +78 -21
  160. data/lib/e11y/middleware/sampling.rb +66 -17
  161. data/lib/e11y/middleware/self_monitoring_emit.rb +39 -0
  162. data/lib/e11y/middleware/trace_context.rb +45 -10
  163. data/lib/e11y/middleware/track_latency.rb +34 -0
  164. data/lib/e11y/middleware/validation.rb +7 -16
  165. data/lib/e11y/middleware/versioning.rb +26 -22
  166. data/lib/e11y/opentelemetry/semantic_conventions.rb +109 -0
  167. data/lib/e11y/opentelemetry/span_creator.rb +142 -0
  168. data/lib/e11y/pii/patterns.rb +12 -1
  169. data/lib/e11y/pipeline/builder.rb +1 -1
  170. data/lib/e11y/presets/audit_event.rb +13 -2
  171. data/lib/e11y/railtie.rb +52 -15
  172. data/lib/e11y/registry.rb +306 -0
  173. data/lib/e11y/reliability/circuit_breaker.rb +19 -21
  174. data/lib/e11y/reliability/dlq/base.rb +71 -0
  175. data/lib/e11y/reliability/dlq/file_adapter.rb +301 -0
  176. data/lib/e11y/reliability/dlq/file_storage.rb +63 -34
  177. data/lib/e11y/reliability/dlq/filter.rb +37 -54
  178. data/lib/e11y/reliability/retry_handler.rb +26 -29
  179. data/lib/e11y/reliability/retry_rate_limiter.rb +3 -11
  180. data/lib/e11y/sampling/error_spike_detector.rb +0 -2
  181. data/lib/e11y/sampling/load_monitor.rb +5 -9
  182. data/lib/e11y/sampling/stratified_tracker.rb +18 -0
  183. data/lib/e11y/self_monitoring/buffer_monitor.rb +2 -0
  184. data/lib/e11y/self_monitoring/performance_monitor.rb +19 -61
  185. data/lib/e11y/self_monitoring/reliability_monitor.rb +4 -74
  186. data/lib/e11y/slo/config_loader.rb +40 -0
  187. data/lib/e11y/slo/config_validator.rb +58 -0
  188. data/lib/e11y/slo/dashboard_generator.rb +122 -0
  189. data/lib/e11y/slo/event_driven.rb +8 -0
  190. data/lib/e11y/slo/tracker.rb +31 -4
  191. data/lib/e11y/testing/have_tracked_event_matcher.rb +190 -0
  192. data/lib/e11y/testing/rspec_matchers.rb +21 -0
  193. data/lib/e11y/testing/snapshot_matcher.rb +86 -0
  194. data/lib/e11y/trace_context/sampler.rb +35 -0
  195. data/lib/e11y/tracing/faraday_middleware.rb +31 -0
  196. data/lib/e11y/tracing/net_http_patch.rb +33 -0
  197. data/lib/e11y/tracing/propagator.rb +116 -0
  198. data/lib/e11y/tracing.rb +47 -0
  199. data/lib/e11y/version.rb +1 -1
  200. data/lib/e11y/versioning/version_extractor.rb +32 -0
  201. data/lib/e11y.rb +141 -265
  202. data/lib/generators/e11y/event/event_generator.rb +22 -0
  203. data/lib/generators/e11y/event/templates/event.rb.tt +16 -0
  204. data/lib/generators/e11y/grafana_dashboard/grafana_dashboard_generator.rb +30 -0
  205. data/lib/generators/e11y/grafana_dashboard/templates/e11y_dashboard.json +81 -0
  206. data/lib/generators/e11y/install/install_generator.rb +34 -0
  207. data/lib/generators/e11y/install/templates/e11y.rb +239 -0
  208. data/lib/generators/e11y/prometheus_alerts/prometheus_alerts_generator.rb +29 -0
  209. data/lib/generators/e11y/prometheus_alerts/templates/e11y_alerts.yml +28 -0
  210. data/lib/tasks/e11y_docs.rake +30 -0
  211. data/lib/tasks/e11y_events.rake +71 -0
  212. data/lib/tasks/e11y_lint.rake +91 -0
  213. data/lib/tasks/e11y_slo.rake +29 -0
  214. metadata +129 -39
  215. data/docs/ADR-010-developer-experience.md +0 -2166
  216. data/docs/API-REFERENCE-L28.md +0 -914
  217. data/docs/COMPREHENSIVE-CONFIGURATION.md +0 -2366
  218. data/docs/CONTRIBUTING.md +0 -312
  219. data/docs/IMPLEMENTATION_NOTES.md +0 -2804
  220. data/docs/IMPLEMENTATION_PLAN.md +0 -1971
  221. data/docs/IMPLEMENTATION_PLAN_ARCHITECTURE.md +0 -586
  222. data/docs/PLAN.md +0 -148
  223. data/docs/README.md +0 -296
  224. data/docs/design/00-memory-optimization.md +0 -593
  225. data/docs/guides/MIGRATION-L27-L28.md +0 -692
  226. data/docs/guides/PERFORMANCE-BENCHMARKS.md +0 -434
  227. data/docs/guides/README.md +0 -44
  228. data/docs/use_cases/UC-003-pattern-based-metrics.md +0 -1627
  229. data/lib/e11y/adapters/registry.rb +0 -141
  230. /data/docs/{ADR-012-event-evolution.md → architecture/ADR-012-event-evolution.md} +0 -0
@@ -5,6 +5,8 @@
5
5
  **Covers:** Internal observability and reliability of E11y gem itself
6
6
  **Depends On:** ADR-001 (Core), ADR-002 (Metrics), ADR-003 (SLO)
7
7
 
8
+ **Implementation (2026-03):** `track_latency` implemented via `E11y::Middleware::TrackLatency` (first in pipeline).
9
+
8
10
  ---
9
11
 
10
12
  ## 📋 Table of Contents
@@ -163,9 +165,7 @@ graph TB
163
165
  end
164
166
 
165
167
  subgraph "SLO Tracking"
166
- Prometheus --> SLOCalc[E11y SLO Calculator]
167
- SLOCalc --> ErrorBudget[E11y Error Budget]
168
- ErrorBudget --> Alerts[Alertmanager]
168
+ Prometheus --> Alerts[Alertmanager]
169
169
  end
170
170
 
171
171
  style SelfMetrics fill:#fff3cd
@@ -484,205 +484,7 @@ end
484
484
 
485
485
  ## 4. Internal SLO Tracking
486
486
 
487
- ### 4.1. E11y SLO Definition
488
-
489
- **E11y has its own SLO** (separate from application SLO):
490
-
491
- ```yaml
492
- # config/e11y_slo.yml
493
- #
494
- # E11y Gem Internal SLO
495
- #
496
- # This defines reliability targets for E11y itself.
497
- # If E11y violates its SLO → alert SRE immediately!
498
-
499
- version: 1
500
-
501
- e11y_slo:
502
- # === LATENCY SLO ===
503
- # E11y.track() must be fast (<1ms p99)
504
- latency:
505
- enabled: true
506
- p99_target: 0.001 # 1ms
507
- p95_target: 0.0005 # 0.5ms
508
- p50_target: 0.0001 # 0.1ms
509
- window: 30d
510
-
511
- # Multi-window burn rate alerts
512
- burn_rate_alerts:
513
- fast:
514
- enabled: true
515
- window: 1h
516
- threshold: 14.4
517
- alert_after: 5m
518
- severity: critical
519
- medium:
520
- enabled: true
521
- window: 6h
522
- threshold: 6.0
523
- alert_after: 30m
524
- severity: warning
525
-
526
- # === RELIABILITY SLO ===
527
- # E11y must deliver 99.9% of events
528
- reliability:
529
- enabled: true
530
- success_rate_target: 0.999 # 99.9%
531
- window: 30d
532
-
533
- # What counts as "success"?
534
- success_criteria:
535
- - event_tracked: true
536
- - not_dropped: true
537
- - adapter_delivered: true # At least 1 adapter succeeded
538
-
539
- # What counts as "failure"?
540
- failure_criteria:
541
- - validation_failed: true
542
- - all_adapters_failed: true
543
- - buffer_overflow: true
544
-
545
- burn_rate_alerts:
546
- fast:
547
- enabled: true
548
- window: 1h
549
- threshold: 14.4
550
- alert_after: 5m
551
-
552
- # === RESOURCE SLO ===
553
- # E11y must use <2% CPU, <100MB memory
554
- resources:
555
- enabled: true
556
-
557
- cpu_percent_target: 2.0 # <2% CPU
558
- memory_mb_target: 100 # <100MB
559
-
560
- buffer_utilization_target: 80 # <80% full
561
-
562
- alerts:
563
- cpu_high:
564
- threshold: 5.0 # Alert if >5% CPU
565
- duration: 5m
566
- memory_high:
567
- threshold: 200 # Alert if >200MB
568
- duration: 5m
569
- buffer_high:
570
- threshold: 90 # Alert if >90% full
571
- duration: 1m
572
-
573
- # === ERROR BUDGET ===
574
- error_budget:
575
- enabled: true
576
-
577
- # Latency budget: 0.1% of requests can be >1ms
578
- latency_budget: 0.001
579
-
580
- # Reliability budget: 0.1% of events can be dropped
581
- reliability_budget: 0.001
582
-
583
- # Alert thresholds
584
- alert_at_percent_consumed: [50, 80, 90, 100]
585
- ```
586
-
587
- ### 4.2. SLO Calculator
588
-
589
- ```ruby
590
- # lib/e11y/self_monitoring/slo_calculator.rb
591
- module E11y
592
- module SelfMonitoring
593
- class SLOCalculator
594
- def self.calculate_latency_slo(window: 30.days)
595
- # Query Prometheus for E11y latency p99
596
- query = <<~PROMQL
597
- histogram_quantile(0.99,
598
- sum(rate(e11y_track_duration_seconds_bucket[#{window}])) by (le)
599
- )
600
- PROMQL
601
-
602
- p99_latency = E11y::Metrics.query_prometheus(query)
603
- target = 0.001 # 1ms
604
-
605
- {
606
- current_p99: p99_latency,
607
- target_p99: target,
608
- slo_met: p99_latency <= target,
609
- error_budget_consumed: calculate_latency_budget_consumed(p99_latency, target, window)
610
- }
611
- end
612
-
613
- def self.calculate_reliability_slo(window: 30.days)
614
- # Query Prometheus for E11y success rate
615
- query = <<~PROMQL
616
- sum(rate(e11y_events_tracked_total{result="success"}[#{window}]))
617
- /
618
- sum(rate(e11y_events_tracked_total[#{window}]))
619
- PROMQL
620
-
621
- success_rate = E11y::Metrics.query_prometheus(query)
622
- target = 0.999 # 99.9%
623
-
624
- {
625
- current_success_rate: success_rate,
626
- target_success_rate: target,
627
- slo_met: success_rate >= target,
628
- error_budget_consumed: calculate_reliability_budget_consumed(success_rate, target)
629
- }
630
- end
631
-
632
- def self.calculate_resource_slo
633
- # Query Prometheus for E11y resource usage
634
- cpu_query = 'avg(rate(e11y_cpu_seconds_total[5m])) * 100'
635
- memory_query = 'e11y_memory_usage_mb'
636
- buffer_query = 'e11y_buffer_utilization_percent'
637
-
638
- cpu_percent = E11y::Metrics.query_prometheus(cpu_query)
639
- memory_mb = E11y::Metrics.query_prometheus(memory_query)
640
- buffer_percent = E11y::Metrics.query_prometheus(buffer_query)
641
-
642
- {
643
- cpu: {
644
- current: cpu_percent,
645
- target: 2.0,
646
- slo_met: cpu_percent <= 2.0
647
- },
648
- memory: {
649
- current: memory_mb,
650
- target: 100,
651
- slo_met: memory_mb <= 100
652
- },
653
- buffer: {
654
- current: buffer_percent,
655
- target: 80,
656
- slo_met: buffer_percent <= 80
657
- }
658
- }
659
- end
660
-
661
- private
662
-
663
- def self.calculate_latency_budget_consumed(current, target, window)
664
- # Simplified: % of requests exceeding target
665
- # In reality, use Prometheus query for exact calculation
666
- return 0.0 if current <= target
667
-
668
- excess = current - target
669
- budget = target * 0.001 # 0.1% budget
670
-
671
- (excess / budget * 100).round(2)
672
- end
673
-
674
- def self.calculate_reliability_budget_consumed(current, target)
675
- error_rate = 1.0 - current
676
- error_budget = 1.0 - target
677
-
678
- return 0.0 if error_rate <= error_budget
679
-
680
- (error_rate / error_budget * 100).round(2)
681
- end
682
- end
683
- end
684
- end
685
- ```
487
+ **PromQL and alerts:** See [SLO-PROMQL-ALERTS.md](../SLO-PROMQL-ALERTS.md) for E11y latency, delivery rate, and alert rules.
686
488
 
687
489
  ---
688
490
 
@@ -780,160 +582,61 @@ end
780
582
 
781
583
  ## 6. Health Checks
782
584
 
783
- ### 6.1. Health Check API
585
+ The gem runs in the host application's main process. E11y does not provide a separate health endpoint — the host application decides how and where to expose health (e.g. its own `/health` or `/ready`).
586
+
587
+ ### 6.1. API: E11y.health and E11y.healthy?
784
588
 
785
589
  ```ruby
786
- # lib/e11y/health_check.rb
787
- module E11y
788
- class HealthCheck
789
- def self.status
790
- {
791
- status: overall_status,
792
- timestamp: Time.now.iso8601,
793
- checks: {
794
- latency: check_latency,
795
- reliability: check_reliability,
796
- resources: check_resources,
797
- adapters: check_adapters,
798
- buffer: check_buffer
799
- },
800
- slo: {
801
- latency: SelfMonitoring::SLOCalculator.calculate_latency_slo,
802
- reliability: SelfMonitoring::SLOCalculator.calculate_reliability_slo,
803
- resources: SelfMonitoring::SLOCalculator.calculate_resource_slo
804
- }
805
- }
806
- end
807
-
808
- def self.healthy?
809
- status[:status] == :healthy
810
- end
811
-
812
- private
813
-
814
- def self.overall_status
815
- checks = [
816
- check_latency[:status],
817
- check_reliability[:status],
818
- check_resources[:status],
819
- check_adapters[:status],
820
- check_buffer[:status]
821
- ]
822
-
823
- return :unhealthy if checks.include?(:unhealthy)
824
- return :degraded if checks.include?(:degraded)
825
- :healthy
826
- end
827
-
828
- def self.check_latency
829
- slo = SelfMonitoring::SLOCalculator.calculate_latency_slo(window: 5.minutes)
830
-
831
- {
832
- status: slo[:slo_met] ? :healthy : :degraded,
833
- current_p99: slo[:current_p99],
834
- target_p99: slo[:target_p99],
835
- message: slo[:slo_met] ? 'Latency within SLO' : 'Latency exceeds SLO'
836
- }
837
- end
838
-
839
- def self.check_reliability
840
- slo = SelfMonitoring::SLOCalculator.calculate_reliability_slo(window: 5.minutes)
841
-
842
- {
843
- status: slo[:slo_met] ? :healthy : :unhealthy,
844
- current_success_rate: slo[:current_success_rate],
845
- target_success_rate: slo[:target_success_rate],
846
- message: slo[:slo_met] ? 'Reliability within SLO' : 'Reliability below SLO'
847
- }
848
- end
849
-
850
- def self.check_resources
851
- slo = SelfMonitoring::SLOCalculator.calculate_resource_slo
852
-
853
- cpu_ok = slo[:cpu][:slo_met]
854
- memory_ok = slo[:memory][:slo_met]
855
- buffer_ok = slo[:buffer][:slo_met]
856
-
857
- status = (cpu_ok && memory_ok && buffer_ok) ? :healthy : :degraded
858
-
859
- {
860
- status: status,
861
- cpu_percent: slo[:cpu][:current],
862
- memory_mb: slo[:memory][:current],
863
- buffer_percent: slo[:buffer][:current],
864
- message: status == :healthy ? 'Resources within limits' : 'Resource usage high'
865
- }
866
- end
867
-
868
- def self.check_adapters
869
- # Check circuit breaker states
870
- adapters = E11y.config.adapters.all
871
-
872
- failed_adapters = adapters.select do |name, adapter|
873
- adapter.circuit_breaker&.open?
874
- end
875
-
876
- {
877
- status: failed_adapters.empty? ? :healthy : :degraded,
878
- total_adapters: adapters.size,
879
- failed_adapters: failed_adapters.keys,
880
- message: failed_adapters.empty? ? 'All adapters healthy' : "#{failed_adapters.size} adapters failed"
881
- }
882
- end
883
-
884
- def self.check_buffer
885
- buffer = E11y::Buffer.instance
886
- utilization = (buffer.size.to_f / buffer.max_size * 100).round(2)
887
-
888
- status = case utilization
889
- when 0..80 then :healthy
890
- when 81..90 then :degraded
891
- else :unhealthy
892
- end
893
-
894
- {
895
- status: status,
896
- current_size: buffer.size,
897
- max_size: buffer.max_size,
898
- utilization_percent: utilization,
899
- message: "Buffer #{utilization}% full"
900
- }
901
- end
902
- end
903
- end
590
+ # Module methods on E11y
591
+ E11y.health # => Hash with details
592
+ E11y.healthy? # => true/false
904
593
  ```
905
594
 
906
- ### 6.2. Health Check Endpoint
595
+ **Structure of `E11y.health`:**
907
596
 
908
597
  ```ruby
909
- # config/routes.rb (for Web UI)
910
- E11y::WebUI::Engine.routes.draw do
911
- # ... existing routes ...
912
-
913
- get '/health', to: 'health#show'
914
- get '/health/detailed', to: 'health#detailed'
915
- end
598
+ {
599
+ status: :healthy | :degraded | :unhealthy,
600
+ timestamp: "2026-03-13T12:00:00Z", # ISO8601
601
+ adapters: {
602
+ loki: { healthy: true, circuit_breaker: :closed },
603
+ sentry: { healthy: false, circuit_breaker: :open }
604
+ }
605
+ }
606
+ ```
916
607
 
917
- # app/controllers/e11y/web_ui/health_controller.rb
918
- module E11y
919
- module WebUI
920
- class HealthController < ApplicationController
921
- def show
922
- status = E11y::HealthCheck.status
923
-
924
- render json: {
925
- status: status[:status],
926
- timestamp: status[:timestamp]
927
- }, status: status[:status] == :healthy ? 200 : 503
928
- end
929
-
930
- def detailed
931
- status = E11y::HealthCheck.status
932
-
933
- render json: status, status: status[:status] == :healthy ? 200 : 503
934
- end
935
- end
936
- end
608
+ - **`status`** — overall telemetry status (see below).
609
+ - **`adapters`** — per registered adapter:
610
+ - **`healthy`** — result of `adapter.healthy?` (backend reachability: Loki pings, Sentry validates DSN, etc.).
611
+ - **`circuit_breaker`** `:closed` | `:open` | `:half_open` (adapter circuit breaker state).
612
+
613
+ ### 6.2. What Does "Telemetry Healthy" Mean?
614
+
615
+ **Telemetry is healthy** = we can deliver events to at least one backend.
616
+
617
+ | status | Condition |
618
+ |-------------|-----------|
619
+ | **:healthy** | All adapters have `healthy? == true` and `circuit_breaker == :closed`. |
620
+ | **:degraded** | At least one adapter is unhealthy or circuit open, but at least one is healthy + closed. |
621
+ | **:unhealthy** | No adapter is healthy + closed — nowhere to deliver events. |
622
+
623
+ **`E11y.healthy?`** returns `true` only when `status == :healthy`.
624
+
625
+ ### 6.3. Integration into Host Application Health
626
+
627
+ ```ruby
628
+ # config/routes.rb
629
+ get "/health", to: "health#show"
630
+
631
+ # app/controllers/health_controller.rb
632
+ def show
633
+ e11y = E11y.health
634
+ overall = e11y[:status] == :healthy ? :ok : :degraded
635
+
636
+ render json: {
637
+ app: check_app,
638
+ e11y: e11y
639
+ }, status: overall == :ok ? 200 : 503
937
640
  end
938
641
  ```
939
642
 
@@ -55,22 +55,15 @@ matrix:
55
55
  ### Negative
56
56
  - CI time: 5min → 15min (3× Rails versions)
57
57
  - Maintenance: 3 Rails versions to test
58
- - Conditional code (2 places only)
58
+ - Conditional code: 1 place (sqlite3 in Gemfile)
59
59
 
60
60
  ---
61
61
 
62
62
  ## Version-Specific Code
63
63
 
64
- **1. Exception handling (Rails 8.0 changed behavior):**
65
- ```ruby
66
- if Rails.version.to_f >= 8.0
67
- expect(response.status).to eq(500) # Rails 8.0: caught
68
- else
69
- expect { get '/error' }.to raise_error # Rails 7.x: raised
70
- end
71
- ```
64
+ **1. Exception handling** — not needed: `show_exceptions = :all` in dummy app gives identical behavior (exceptions → 500) on 7.x and 8.x. Tests without version branching.
72
65
 
73
- **2. sqlite3 dependency (Gemfile only)**
66
+ **2. sqlite3 dependency (Gemfile only)** — different versions for 7.x (~> 1.4) and 8.x (~> 2.0).
74
67
 
75
68
  ---
76
69
 
@@ -99,5 +92,5 @@ end
99
92
  |--------|--------|---------|
100
93
  | Test pass rate | 100% | 100% ✅ |
101
94
  | Code coverage | ≥95% | 96.46% ✅ |
102
- | Version checks | <10 | 2 ✅ |
95
+ | Version checks | <10 | 1 ✅ |
103
96
  | CI time | <20min | ~15min ✅ |