e11y 0.2.0 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +130 -10
- data/CHANGELOG.md +56 -1
- data/CLAUDE.md +168 -0
- data/CONTRIBUTING.md +640 -0
- data/README.md +134 -702
- data/RELEASE.md +18 -3
- data/Rakefile +108 -29
- data/config/README.md +1 -1
- data/config/loki-local-config.yaml +12 -0
- data/config/otel-collector-config.yaml +44 -0
- data/cucumber.yml +1 -0
- data/docker-compose.yml +18 -2
- data/docs/ADAPTERS.md +76 -0
- data/docs/ADAPTIVE_SAMPLING.md +59 -0
- data/docs/COMPARISON.md +104 -0
- data/docs/CONFIGURATION.md +52 -0
- data/docs/DISTRIBUTED_TRACING.md +44 -0
- data/docs/LIMITATIONS.md +13 -0
- data/docs/METRICS_DSL.md +84 -0
- data/docs/PERFORMANCE.md +60 -0
- data/docs/PII_FILTERING.md +40 -0
- data/docs/PRESETS.md +65 -0
- data/docs/QUICK-START.md +546 -587
- data/docs/RAILS_INTEGRATION.md +29 -0
- data/docs/SCHEMA_VALIDATION.md +63 -0
- data/docs/SLO-PROMQL-ALERTS.md +161 -0
- data/docs/TESTING.md +69 -0
- data/docs/{ADR-001-architecture.md → architecture/ADR-001-architecture.md} +35 -64
- data/docs/{ADR-002-metrics-yabeda.md → architecture/ADR-002-metrics-yabeda.md} +62 -236
- data/docs/{ADR-003-slo-observability.md → architecture/ADR-003-slo-observability.md} +27 -466
- data/docs/{ADR-004-adapter-architecture.md → architecture/ADR-004-adapter-architecture.md} +163 -146
- data/docs/{ADR-005-tracing-context.md → architecture/ADR-005-tracing-context.md} +10 -9
- data/docs/{ADR-006-security-compliance.md → architecture/ADR-006-security-compliance.md} +184 -191
- data/docs/{ADR-007-opentelemetry-integration.md → architecture/ADR-007-opentelemetry-integration.md} +3 -21
- data/docs/{ADR-008-rails-integration.md → architecture/ADR-008-rails-integration.md} +209 -339
- data/docs/{ADR-009-cost-optimization.md → architecture/ADR-009-cost-optimization.md} +45 -54
- data/docs/architecture/ADR-010-developer-experience.md +522 -0
- data/docs/{ADR-011-testing-strategy.md → architecture/ADR-011-testing-strategy.md} +41 -83
- data/docs/{ADR-013-reliability-error-handling.md → architecture/ADR-013-reliability-error-handling.md} +37 -12
- data/docs/{ADR-014-event-driven-slo.md → architecture/ADR-014-event-driven-slo.md} +12 -24
- data/docs/{ADR-015-middleware-order.md → architecture/ADR-015-middleware-order.md} +23 -41
- data/docs/{ADR-016-self-monitoring-slo.md → architecture/ADR-016-self-monitoring-slo.md} +52 -349
- data/docs/{ADR-017-multi-rails-compatibility.md → architecture/ADR-017-multi-rails-compatibility.md} +4 -11
- data/docs/architecture/ADR-018-memory-optimization.md +366 -0
- data/docs/{ADR-INDEX.md → architecture/ADR-INDEX.md} +11 -6
- data/docs/{00-ICP-AND-TIMELINE.md → prd/00-ICP-AND-TIMELINE.md} +6 -6
- data/docs/{01-SCALE-REQUIREMENTS.md → prd/01-SCALE-REQUIREMENTS.md} +6 -6
- data/docs/prd/01-overview-vision.md +19 -14
- data/docs/use_cases/README.md +22 -23
- data/docs/use_cases/UC-001-request-scoped-debug-buffering.md +50 -44
- data/docs/use_cases/UC-002-business-event-tracking.md +26 -95
- data/docs/use_cases/UC-003-event-metrics.md +66 -0
- data/docs/use_cases/UC-004-zero-config-slo-tracking.md +42 -101
- data/docs/use_cases/UC-005-sentry-integration.md +13 -15
- data/docs/use_cases/UC-006-trace-context-management.md +30 -28
- data/docs/use_cases/UC-007-pii-filtering.md +35 -87
- data/docs/use_cases/UC-008-opentelemetry-integration.md +51 -89
- data/docs/use_cases/UC-009-multi-service-tracing.md +4 -4
- data/docs/use_cases/UC-010-background-job-tracking.md +5 -5
- data/docs/use_cases/UC-011-rate-limiting.md +95 -168
- data/docs/use_cases/UC-012-audit-trail.md +21 -46
- data/docs/use_cases/UC-013-high-cardinality-protection.md +29 -167
- data/docs/use_cases/UC-014-adaptive-sampling.md +2 -2
- data/docs/use_cases/UC-015-cost-optimization.md +46 -99
- data/docs/use_cases/UC-016-rails-logger-migration.md +39 -213
- data/docs/use_cases/UC-017-local-development.md +203 -777
- data/docs/use_cases/UC-018-testing-events.md +3 -3
- data/docs/use_cases/UC-019-retention-based-routing.md +53 -106
- data/docs/use_cases/UC-020-event-versioning.md +8 -9
- data/docs/use_cases/UC-021-error-handling-retry-dlq.md +18 -22
- data/docs/use_cases/UC-022-event-registry.md +15 -21
- data/docs/use_cases/backlog.md +119 -87
- data/e11y.gemspec +2 -2
- data/gems/e11y-devtools/README.md +136 -0
- data/gems/e11y-devtools/config/routes.rb +8 -0
- data/gems/e11y-devtools/e11y-devtools.gemspec +25 -0
- data/gems/e11y-devtools/exe/e11y +34 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/server.rb +96 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tool_base.rb +25 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/clear.rb +31 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/errors.rb +35 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/event_detail.rb +33 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/events_by_trace.rb +33 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/interactions.rb +40 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/recent_events.rb +34 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/search.rb +34 -0
- data/gems/e11y-devtools/lib/e11y/devtools/mcp/tools/stats.rb +30 -0
- data/gems/e11y-devtools/lib/e11y/devtools/overlay/assets/overlay.js +115 -0
- data/gems/e11y-devtools/lib/e11y/devtools/overlay/controller.rb +54 -0
- data/gems/e11y-devtools/lib/e11y/devtools/overlay/engine.rb +26 -0
- data/gems/e11y-devtools/lib/e11y/devtools/overlay/middleware.rb +80 -0
- data/gems/e11y-devtools/lib/e11y/devtools/overlay/rails_controller.rb +42 -0
- data/gems/e11y-devtools/lib/e11y/devtools/tui/app.rb +262 -0
- data/gems/e11y-devtools/lib/e11y/devtools/tui/grouping.rb +66 -0
- data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/event_detail.rb +62 -0
- data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/event_list.rb +70 -0
- data/gems/e11y-devtools/lib/e11y/devtools/tui/widgets/interaction_list.rb +47 -0
- data/gems/e11y-devtools/lib/e11y/devtools/version.rb +8 -0
- data/gems/e11y-devtools/lib/e11y/devtools.rb +13 -0
- data/gems/e11y-devtools/spec/e11y/devtools/mcp/tools_spec.rb +107 -0
- data/gems/e11y-devtools/spec/e11y/devtools/overlay/controller_spec.rb +58 -0
- data/gems/e11y-devtools/spec/e11y/devtools/overlay/middleware_spec.rb +46 -0
- data/gems/e11y-devtools/spec/e11y/devtools/tui/app_spec.rb +85 -0
- data/gems/e11y-devtools/spec/e11y/devtools/tui/grouping_spec.rb +64 -0
- data/gems/e11y-devtools/spec/spec_helper.rb +5 -0
- data/gems/e11y-devtools/spec/tui/widgets/event_list_spec.rb +44 -0
- data/gems/e11y-devtools/spec/tui/widgets/interaction_list_spec.rb +62 -0
- data/lib/e11y/adapters/audit_encrypted.rb +53 -11
- data/lib/e11y/adapters/base.rb +33 -34
- data/lib/e11y/adapters/dev_log/file_store.rb +143 -0
- data/lib/e11y/adapters/dev_log/query.rb +219 -0
- data/lib/e11y/adapters/dev_log.rb +118 -0
- data/lib/e11y/adapters/file.rb +3 -6
- data/lib/e11y/adapters/in_memory.rb +52 -5
- data/lib/e11y/adapters/in_memory_test.rb +29 -0
- data/lib/e11y/adapters/loki.rb +58 -23
- data/lib/e11y/adapters/null.rb +82 -0
- data/lib/e11y/adapters/opentelemetry_collector.rb +183 -0
- data/lib/e11y/adapters/otel_logs.rb +136 -23
- data/lib/e11y/adapters/sentry.rb +4 -7
- data/lib/e11y/adapters/stdout.rb +73 -7
- data/lib/e11y/adapters/yabeda.rb +153 -29
- data/lib/e11y/buffers/adaptive_buffer.rb +3 -17
- data/lib/e11y/buffers/{request_scoped_buffer.rb → ephemeral_buffer.rb} +72 -58
- data/lib/e11y/buffers/ring_buffer.rb +3 -16
- data/lib/e11y/configuration.rb +272 -0
- data/lib/e11y/console.rb +10 -17
- data/lib/e11y/current.rb +53 -1
- data/lib/e11y/debug/pipeline_inspector.rb +96 -0
- data/lib/e11y/documentation/generator.rb +48 -0
- data/lib/e11y/event/base.rb +176 -82
- data/lib/e11y/event/value_sampling_config.rb +1 -5
- data/lib/e11y/events/rails/database/query.rb +1 -4
- data/lib/e11y/events/rails/job/failed.rb +2 -0
- data/lib/e11y/instruments/active_job.rb +46 -12
- data/lib/e11y/instruments/rails_instrumentation.rb +49 -24
- data/lib/e11y/instruments/sidekiq.rb +137 -31
- data/lib/e11y/linters/base.rb +11 -0
- data/lib/e11y/linters/pii/pii_declaration_linter.rb +120 -0
- data/lib/e11y/linters/slo/config_consistency_linter.rb +76 -0
- data/lib/e11y/linters/slo/explicit_declaration_linter.rb +36 -0
- data/lib/e11y/linters/slo/slo_status_from_linter.rb +41 -0
- data/lib/e11y/logger/bridge.rb +26 -7
- data/lib/e11y/metrics/cardinality_protection.rb +10 -15
- data/lib/e11y/metrics/cardinality_tracker.rb +16 -6
- data/lib/e11y/metrics/registry.rb +3 -5
- data/lib/e11y/metrics/test_backend.rb +62 -0
- data/lib/e11y/metrics.rb +56 -10
- data/lib/e11y/middleware/adapter_resolver.rb +40 -0
- data/lib/e11y/middleware/audit_signing.rb +43 -6
- data/lib/e11y/middleware/baggage_protection.rb +75 -0
- data/lib/e11y/middleware/dev_log_source.rb +24 -0
- data/lib/e11y/middleware/event_slo.rb +23 -9
- data/lib/e11y/middleware/otel_span.rb +23 -0
- data/lib/e11y/middleware/pii_filter.rb +104 -75
- data/lib/e11y/middleware/rate_limiting.rb +54 -27
- data/lib/e11y/middleware/request.rb +70 -23
- data/lib/e11y/middleware/routing.rb +78 -21
- data/lib/e11y/middleware/sampling.rb +66 -17
- data/lib/e11y/middleware/self_monitoring_emit.rb +39 -0
- data/lib/e11y/middleware/trace_context.rb +45 -10
- data/lib/e11y/middleware/track_latency.rb +34 -0
- data/lib/e11y/middleware/validation.rb +7 -16
- data/lib/e11y/middleware/versioning.rb +26 -22
- data/lib/e11y/opentelemetry/semantic_conventions.rb +109 -0
- data/lib/e11y/opentelemetry/span_creator.rb +142 -0
- data/lib/e11y/pii/patterns.rb +12 -1
- data/lib/e11y/pipeline/builder.rb +1 -1
- data/lib/e11y/presets/audit_event.rb +13 -2
- data/lib/e11y/railtie.rb +52 -15
- data/lib/e11y/registry.rb +306 -0
- data/lib/e11y/reliability/circuit_breaker.rb +19 -21
- data/lib/e11y/reliability/dlq/base.rb +71 -0
- data/lib/e11y/reliability/dlq/file_adapter.rb +301 -0
- data/lib/e11y/reliability/dlq/file_storage.rb +63 -34
- data/lib/e11y/reliability/dlq/filter.rb +37 -54
- data/lib/e11y/reliability/retry_handler.rb +26 -29
- data/lib/e11y/reliability/retry_rate_limiter.rb +3 -11
- data/lib/e11y/sampling/error_spike_detector.rb +0 -2
- data/lib/e11y/sampling/load_monitor.rb +5 -9
- data/lib/e11y/sampling/stratified_tracker.rb +18 -0
- data/lib/e11y/self_monitoring/buffer_monitor.rb +2 -0
- data/lib/e11y/self_monitoring/performance_monitor.rb +19 -61
- data/lib/e11y/self_monitoring/reliability_monitor.rb +4 -74
- data/lib/e11y/slo/config_loader.rb +40 -0
- data/lib/e11y/slo/config_validator.rb +58 -0
- data/lib/e11y/slo/dashboard_generator.rb +122 -0
- data/lib/e11y/slo/event_driven.rb +8 -0
- data/lib/e11y/slo/tracker.rb +31 -4
- data/lib/e11y/testing/have_tracked_event_matcher.rb +190 -0
- data/lib/e11y/testing/rspec_matchers.rb +21 -0
- data/lib/e11y/testing/snapshot_matcher.rb +86 -0
- data/lib/e11y/trace_context/sampler.rb +35 -0
- data/lib/e11y/tracing/faraday_middleware.rb +31 -0
- data/lib/e11y/tracing/net_http_patch.rb +33 -0
- data/lib/e11y/tracing/propagator.rb +116 -0
- data/lib/e11y/tracing.rb +47 -0
- data/lib/e11y/version.rb +1 -1
- data/lib/e11y/versioning/version_extractor.rb +32 -0
- data/lib/e11y.rb +141 -265
- data/lib/generators/e11y/event/event_generator.rb +22 -0
- data/lib/generators/e11y/event/templates/event.rb.tt +16 -0
- data/lib/generators/e11y/grafana_dashboard/grafana_dashboard_generator.rb +30 -0
- data/lib/generators/e11y/grafana_dashboard/templates/e11y_dashboard.json +81 -0
- data/lib/generators/e11y/install/install_generator.rb +34 -0
- data/lib/generators/e11y/install/templates/e11y.rb +239 -0
- data/lib/generators/e11y/prometheus_alerts/prometheus_alerts_generator.rb +29 -0
- data/lib/generators/e11y/prometheus_alerts/templates/e11y_alerts.yml +28 -0
- data/lib/tasks/e11y_docs.rake +30 -0
- data/lib/tasks/e11y_events.rake +71 -0
- data/lib/tasks/e11y_lint.rake +91 -0
- data/lib/tasks/e11y_slo.rake +29 -0
- metadata +129 -39
- data/docs/ADR-010-developer-experience.md +0 -2166
- data/docs/API-REFERENCE-L28.md +0 -914
- data/docs/COMPREHENSIVE-CONFIGURATION.md +0 -2366
- data/docs/CONTRIBUTING.md +0 -312
- data/docs/IMPLEMENTATION_NOTES.md +0 -2804
- data/docs/IMPLEMENTATION_PLAN.md +0 -1971
- data/docs/IMPLEMENTATION_PLAN_ARCHITECTURE.md +0 -586
- data/docs/PLAN.md +0 -148
- data/docs/README.md +0 -296
- data/docs/design/00-memory-optimization.md +0 -593
- data/docs/guides/MIGRATION-L27-L28.md +0 -692
- data/docs/guides/PERFORMANCE-BENCHMARKS.md +0 -434
- data/docs/guides/README.md +0 -44
- data/docs/use_cases/UC-003-pattern-based-metrics.md +0 -1627
- data/lib/e11y/adapters/registry.rb +0 -141
- /data/docs/{ADR-012-event-evolution.md → architecture/ADR-012-event-evolution.md} +0 -0
|
@@ -5,6 +5,8 @@
|
|
|
5
5
|
**Covers:** Internal observability and reliability of E11y gem itself
|
|
6
6
|
**Depends On:** ADR-001 (Core), ADR-002 (Metrics), ADR-003 (SLO)
|
|
7
7
|
|
|
8
|
+
**Implementation (2026-03):** `track_latency` implemented via `E11y::Middleware::TrackLatency` (first in pipeline).
|
|
9
|
+
|
|
8
10
|
---
|
|
9
11
|
|
|
10
12
|
## 📋 Table of Contents
|
|
@@ -163,9 +165,7 @@ graph TB
|
|
|
163
165
|
end
|
|
164
166
|
|
|
165
167
|
subgraph "SLO Tracking"
|
|
166
|
-
Prometheus -->
|
|
167
|
-
SLOCalc --> ErrorBudget[E11y Error Budget]
|
|
168
|
-
ErrorBudget --> Alerts[Alertmanager]
|
|
168
|
+
Prometheus --> Alerts[Alertmanager]
|
|
169
169
|
end
|
|
170
170
|
|
|
171
171
|
style SelfMetrics fill:#fff3cd
|
|
@@ -484,205 +484,7 @@ end
|
|
|
484
484
|
|
|
485
485
|
## 4. Internal SLO Tracking
|
|
486
486
|
|
|
487
|
-
|
|
488
|
-
|
|
489
|
-
**E11y has its own SLO** (separate from application SLO):
|
|
490
|
-
|
|
491
|
-
```yaml
|
|
492
|
-
# config/e11y_slo.yml
|
|
493
|
-
#
|
|
494
|
-
# E11y Gem Internal SLO
|
|
495
|
-
#
|
|
496
|
-
# This defines reliability targets for E11y itself.
|
|
497
|
-
# If E11y violates its SLO → alert SRE immediately!
|
|
498
|
-
|
|
499
|
-
version: 1
|
|
500
|
-
|
|
501
|
-
e11y_slo:
|
|
502
|
-
# === LATENCY SLO ===
|
|
503
|
-
# E11y.track() must be fast (<1ms p99)
|
|
504
|
-
latency:
|
|
505
|
-
enabled: true
|
|
506
|
-
p99_target: 0.001 # 1ms
|
|
507
|
-
p95_target: 0.0005 # 0.5ms
|
|
508
|
-
p50_target: 0.0001 # 0.1ms
|
|
509
|
-
window: 30d
|
|
510
|
-
|
|
511
|
-
# Multi-window burn rate alerts
|
|
512
|
-
burn_rate_alerts:
|
|
513
|
-
fast:
|
|
514
|
-
enabled: true
|
|
515
|
-
window: 1h
|
|
516
|
-
threshold: 14.4
|
|
517
|
-
alert_after: 5m
|
|
518
|
-
severity: critical
|
|
519
|
-
medium:
|
|
520
|
-
enabled: true
|
|
521
|
-
window: 6h
|
|
522
|
-
threshold: 6.0
|
|
523
|
-
alert_after: 30m
|
|
524
|
-
severity: warning
|
|
525
|
-
|
|
526
|
-
# === RELIABILITY SLO ===
|
|
527
|
-
# E11y must deliver 99.9% of events
|
|
528
|
-
reliability:
|
|
529
|
-
enabled: true
|
|
530
|
-
success_rate_target: 0.999 # 99.9%
|
|
531
|
-
window: 30d
|
|
532
|
-
|
|
533
|
-
# What counts as "success"?
|
|
534
|
-
success_criteria:
|
|
535
|
-
- event_tracked: true
|
|
536
|
-
- not_dropped: true
|
|
537
|
-
- adapter_delivered: true # At least 1 adapter succeeded
|
|
538
|
-
|
|
539
|
-
# What counts as "failure"?
|
|
540
|
-
failure_criteria:
|
|
541
|
-
- validation_failed: true
|
|
542
|
-
- all_adapters_failed: true
|
|
543
|
-
- buffer_overflow: true
|
|
544
|
-
|
|
545
|
-
burn_rate_alerts:
|
|
546
|
-
fast:
|
|
547
|
-
enabled: true
|
|
548
|
-
window: 1h
|
|
549
|
-
threshold: 14.4
|
|
550
|
-
alert_after: 5m
|
|
551
|
-
|
|
552
|
-
# === RESOURCE SLO ===
|
|
553
|
-
# E11y must use <2% CPU, <100MB memory
|
|
554
|
-
resources:
|
|
555
|
-
enabled: true
|
|
556
|
-
|
|
557
|
-
cpu_percent_target: 2.0 # <2% CPU
|
|
558
|
-
memory_mb_target: 100 # <100MB
|
|
559
|
-
|
|
560
|
-
buffer_utilization_target: 80 # <80% full
|
|
561
|
-
|
|
562
|
-
alerts:
|
|
563
|
-
cpu_high:
|
|
564
|
-
threshold: 5.0 # Alert if >5% CPU
|
|
565
|
-
duration: 5m
|
|
566
|
-
memory_high:
|
|
567
|
-
threshold: 200 # Alert if >200MB
|
|
568
|
-
duration: 5m
|
|
569
|
-
buffer_high:
|
|
570
|
-
threshold: 90 # Alert if >90% full
|
|
571
|
-
duration: 1m
|
|
572
|
-
|
|
573
|
-
# === ERROR BUDGET ===
|
|
574
|
-
error_budget:
|
|
575
|
-
enabled: true
|
|
576
|
-
|
|
577
|
-
# Latency budget: 0.1% of requests can be >1ms
|
|
578
|
-
latency_budget: 0.001
|
|
579
|
-
|
|
580
|
-
# Reliability budget: 0.1% of events can be dropped
|
|
581
|
-
reliability_budget: 0.001
|
|
582
|
-
|
|
583
|
-
# Alert thresholds
|
|
584
|
-
alert_at_percent_consumed: [50, 80, 90, 100]
|
|
585
|
-
```
|
|
586
|
-
|
|
587
|
-
### 4.2. SLO Calculator
|
|
588
|
-
|
|
589
|
-
```ruby
|
|
590
|
-
# lib/e11y/self_monitoring/slo_calculator.rb
|
|
591
|
-
module E11y
|
|
592
|
-
module SelfMonitoring
|
|
593
|
-
class SLOCalculator
|
|
594
|
-
def self.calculate_latency_slo(window: 30.days)
|
|
595
|
-
# Query Prometheus for E11y latency p99
|
|
596
|
-
query = <<~PROMQL
|
|
597
|
-
histogram_quantile(0.99,
|
|
598
|
-
sum(rate(e11y_track_duration_seconds_bucket[#{window}])) by (le)
|
|
599
|
-
)
|
|
600
|
-
PROMQL
|
|
601
|
-
|
|
602
|
-
p99_latency = E11y::Metrics.query_prometheus(query)
|
|
603
|
-
target = 0.001 # 1ms
|
|
604
|
-
|
|
605
|
-
{
|
|
606
|
-
current_p99: p99_latency,
|
|
607
|
-
target_p99: target,
|
|
608
|
-
slo_met: p99_latency <= target,
|
|
609
|
-
error_budget_consumed: calculate_latency_budget_consumed(p99_latency, target, window)
|
|
610
|
-
}
|
|
611
|
-
end
|
|
612
|
-
|
|
613
|
-
def self.calculate_reliability_slo(window: 30.days)
|
|
614
|
-
# Query Prometheus for E11y success rate
|
|
615
|
-
query = <<~PROMQL
|
|
616
|
-
sum(rate(e11y_events_tracked_total{result="success"}[#{window}]))
|
|
617
|
-
/
|
|
618
|
-
sum(rate(e11y_events_tracked_total[#{window}]))
|
|
619
|
-
PROMQL
|
|
620
|
-
|
|
621
|
-
success_rate = E11y::Metrics.query_prometheus(query)
|
|
622
|
-
target = 0.999 # 99.9%
|
|
623
|
-
|
|
624
|
-
{
|
|
625
|
-
current_success_rate: success_rate,
|
|
626
|
-
target_success_rate: target,
|
|
627
|
-
slo_met: success_rate >= target,
|
|
628
|
-
error_budget_consumed: calculate_reliability_budget_consumed(success_rate, target)
|
|
629
|
-
}
|
|
630
|
-
end
|
|
631
|
-
|
|
632
|
-
def self.calculate_resource_slo
|
|
633
|
-
# Query Prometheus for E11y resource usage
|
|
634
|
-
cpu_query = 'avg(rate(e11y_cpu_seconds_total[5m])) * 100'
|
|
635
|
-
memory_query = 'e11y_memory_usage_mb'
|
|
636
|
-
buffer_query = 'e11y_buffer_utilization_percent'
|
|
637
|
-
|
|
638
|
-
cpu_percent = E11y::Metrics.query_prometheus(cpu_query)
|
|
639
|
-
memory_mb = E11y::Metrics.query_prometheus(memory_query)
|
|
640
|
-
buffer_percent = E11y::Metrics.query_prometheus(buffer_query)
|
|
641
|
-
|
|
642
|
-
{
|
|
643
|
-
cpu: {
|
|
644
|
-
current: cpu_percent,
|
|
645
|
-
target: 2.0,
|
|
646
|
-
slo_met: cpu_percent <= 2.0
|
|
647
|
-
},
|
|
648
|
-
memory: {
|
|
649
|
-
current: memory_mb,
|
|
650
|
-
target: 100,
|
|
651
|
-
slo_met: memory_mb <= 100
|
|
652
|
-
},
|
|
653
|
-
buffer: {
|
|
654
|
-
current: buffer_percent,
|
|
655
|
-
target: 80,
|
|
656
|
-
slo_met: buffer_percent <= 80
|
|
657
|
-
}
|
|
658
|
-
}
|
|
659
|
-
end
|
|
660
|
-
|
|
661
|
-
private
|
|
662
|
-
|
|
663
|
-
def self.calculate_latency_budget_consumed(current, target, window)
|
|
664
|
-
# Simplified: % of requests exceeding target
|
|
665
|
-
# In reality, use Prometheus query for exact calculation
|
|
666
|
-
return 0.0 if current <= target
|
|
667
|
-
|
|
668
|
-
excess = current - target
|
|
669
|
-
budget = target * 0.001 # 0.1% budget
|
|
670
|
-
|
|
671
|
-
(excess / budget * 100).round(2)
|
|
672
|
-
end
|
|
673
|
-
|
|
674
|
-
def self.calculate_reliability_budget_consumed(current, target)
|
|
675
|
-
error_rate = 1.0 - current
|
|
676
|
-
error_budget = 1.0 - target
|
|
677
|
-
|
|
678
|
-
return 0.0 if error_rate <= error_budget
|
|
679
|
-
|
|
680
|
-
(error_rate / error_budget * 100).round(2)
|
|
681
|
-
end
|
|
682
|
-
end
|
|
683
|
-
end
|
|
684
|
-
end
|
|
685
|
-
```
|
|
487
|
+
**PromQL and alerts:** See [SLO-PROMQL-ALERTS.md](../SLO-PROMQL-ALERTS.md) for E11y latency, delivery rate, and alert rules.
|
|
686
488
|
|
|
687
489
|
---
|
|
688
490
|
|
|
@@ -780,160 +582,61 @@ end
|
|
|
780
582
|
|
|
781
583
|
## 6. Health Checks
|
|
782
584
|
|
|
783
|
-
|
|
585
|
+
The gem runs in the host application's main process. E11y does not provide a separate health endpoint — the host application decides how and where to expose health (e.g. its own `/health` or `/ready`).
|
|
586
|
+
|
|
587
|
+
### 6.1. API: E11y.health and E11y.healthy?
|
|
784
588
|
|
|
785
589
|
```ruby
|
|
786
|
-
#
|
|
787
|
-
|
|
788
|
-
|
|
789
|
-
def self.status
|
|
790
|
-
{
|
|
791
|
-
status: overall_status,
|
|
792
|
-
timestamp: Time.now.iso8601,
|
|
793
|
-
checks: {
|
|
794
|
-
latency: check_latency,
|
|
795
|
-
reliability: check_reliability,
|
|
796
|
-
resources: check_resources,
|
|
797
|
-
adapters: check_adapters,
|
|
798
|
-
buffer: check_buffer
|
|
799
|
-
},
|
|
800
|
-
slo: {
|
|
801
|
-
latency: SelfMonitoring::SLOCalculator.calculate_latency_slo,
|
|
802
|
-
reliability: SelfMonitoring::SLOCalculator.calculate_reliability_slo,
|
|
803
|
-
resources: SelfMonitoring::SLOCalculator.calculate_resource_slo
|
|
804
|
-
}
|
|
805
|
-
}
|
|
806
|
-
end
|
|
807
|
-
|
|
808
|
-
def self.healthy?
|
|
809
|
-
status[:status] == :healthy
|
|
810
|
-
end
|
|
811
|
-
|
|
812
|
-
private
|
|
813
|
-
|
|
814
|
-
def self.overall_status
|
|
815
|
-
checks = [
|
|
816
|
-
check_latency[:status],
|
|
817
|
-
check_reliability[:status],
|
|
818
|
-
check_resources[:status],
|
|
819
|
-
check_adapters[:status],
|
|
820
|
-
check_buffer[:status]
|
|
821
|
-
]
|
|
822
|
-
|
|
823
|
-
return :unhealthy if checks.include?(:unhealthy)
|
|
824
|
-
return :degraded if checks.include?(:degraded)
|
|
825
|
-
:healthy
|
|
826
|
-
end
|
|
827
|
-
|
|
828
|
-
def self.check_latency
|
|
829
|
-
slo = SelfMonitoring::SLOCalculator.calculate_latency_slo(window: 5.minutes)
|
|
830
|
-
|
|
831
|
-
{
|
|
832
|
-
status: slo[:slo_met] ? :healthy : :degraded,
|
|
833
|
-
current_p99: slo[:current_p99],
|
|
834
|
-
target_p99: slo[:target_p99],
|
|
835
|
-
message: slo[:slo_met] ? 'Latency within SLO' : 'Latency exceeds SLO'
|
|
836
|
-
}
|
|
837
|
-
end
|
|
838
|
-
|
|
839
|
-
def self.check_reliability
|
|
840
|
-
slo = SelfMonitoring::SLOCalculator.calculate_reliability_slo(window: 5.minutes)
|
|
841
|
-
|
|
842
|
-
{
|
|
843
|
-
status: slo[:slo_met] ? :healthy : :unhealthy,
|
|
844
|
-
current_success_rate: slo[:current_success_rate],
|
|
845
|
-
target_success_rate: slo[:target_success_rate],
|
|
846
|
-
message: slo[:slo_met] ? 'Reliability within SLO' : 'Reliability below SLO'
|
|
847
|
-
}
|
|
848
|
-
end
|
|
849
|
-
|
|
850
|
-
def self.check_resources
|
|
851
|
-
slo = SelfMonitoring::SLOCalculator.calculate_resource_slo
|
|
852
|
-
|
|
853
|
-
cpu_ok = slo[:cpu][:slo_met]
|
|
854
|
-
memory_ok = slo[:memory][:slo_met]
|
|
855
|
-
buffer_ok = slo[:buffer][:slo_met]
|
|
856
|
-
|
|
857
|
-
status = (cpu_ok && memory_ok && buffer_ok) ? :healthy : :degraded
|
|
858
|
-
|
|
859
|
-
{
|
|
860
|
-
status: status,
|
|
861
|
-
cpu_percent: slo[:cpu][:current],
|
|
862
|
-
memory_mb: slo[:memory][:current],
|
|
863
|
-
buffer_percent: slo[:buffer][:current],
|
|
864
|
-
message: status == :healthy ? 'Resources within limits' : 'Resource usage high'
|
|
865
|
-
}
|
|
866
|
-
end
|
|
867
|
-
|
|
868
|
-
def self.check_adapters
|
|
869
|
-
# Check circuit breaker states
|
|
870
|
-
adapters = E11y.config.adapters.all
|
|
871
|
-
|
|
872
|
-
failed_adapters = adapters.select do |name, adapter|
|
|
873
|
-
adapter.circuit_breaker&.open?
|
|
874
|
-
end
|
|
875
|
-
|
|
876
|
-
{
|
|
877
|
-
status: failed_adapters.empty? ? :healthy : :degraded,
|
|
878
|
-
total_adapters: adapters.size,
|
|
879
|
-
failed_adapters: failed_adapters.keys,
|
|
880
|
-
message: failed_adapters.empty? ? 'All adapters healthy' : "#{failed_adapters.size} adapters failed"
|
|
881
|
-
}
|
|
882
|
-
end
|
|
883
|
-
|
|
884
|
-
def self.check_buffer
|
|
885
|
-
buffer = E11y::Buffer.instance
|
|
886
|
-
utilization = (buffer.size.to_f / buffer.max_size * 100).round(2)
|
|
887
|
-
|
|
888
|
-
status = case utilization
|
|
889
|
-
when 0..80 then :healthy
|
|
890
|
-
when 81..90 then :degraded
|
|
891
|
-
else :unhealthy
|
|
892
|
-
end
|
|
893
|
-
|
|
894
|
-
{
|
|
895
|
-
status: status,
|
|
896
|
-
current_size: buffer.size,
|
|
897
|
-
max_size: buffer.max_size,
|
|
898
|
-
utilization_percent: utilization,
|
|
899
|
-
message: "Buffer #{utilization}% full"
|
|
900
|
-
}
|
|
901
|
-
end
|
|
902
|
-
end
|
|
903
|
-
end
|
|
590
|
+
# Module methods on E11y
|
|
591
|
+
E11y.health # => Hash with details
|
|
592
|
+
E11y.healthy? # => true/false
|
|
904
593
|
```
|
|
905
594
|
|
|
906
|
-
|
|
595
|
+
**Structure of `E11y.health`:**
|
|
907
596
|
|
|
908
597
|
```ruby
|
|
909
|
-
|
|
910
|
-
|
|
911
|
-
#
|
|
912
|
-
|
|
913
|
-
|
|
914
|
-
|
|
915
|
-
|
|
598
|
+
{
|
|
599
|
+
status: :healthy | :degraded | :unhealthy,
|
|
600
|
+
timestamp: "2026-03-13T12:00:00Z", # ISO8601
|
|
601
|
+
adapters: {
|
|
602
|
+
loki: { healthy: true, circuit_breaker: :closed },
|
|
603
|
+
sentry: { healthy: false, circuit_breaker: :open }
|
|
604
|
+
}
|
|
605
|
+
}
|
|
606
|
+
```
|
|
916
607
|
|
|
917
|
-
|
|
918
|
-
|
|
919
|
-
|
|
920
|
-
|
|
921
|
-
|
|
922
|
-
|
|
923
|
-
|
|
924
|
-
|
|
925
|
-
|
|
926
|
-
|
|
927
|
-
|
|
928
|
-
|
|
929
|
-
|
|
930
|
-
|
|
931
|
-
|
|
932
|
-
|
|
933
|
-
|
|
934
|
-
|
|
935
|
-
|
|
936
|
-
|
|
608
|
+
- **`status`** — overall telemetry status (see below).
|
|
609
|
+
- **`adapters`** — per registered adapter:
|
|
610
|
+
- **`healthy`** — result of `adapter.healthy?` (backend reachability: Loki pings, Sentry validates DSN, etc.).
|
|
611
|
+
- **`circuit_breaker`** — `:closed` | `:open` | `:half_open` (adapter circuit breaker state).
|
|
612
|
+
|
|
613
|
+
### 6.2. What Does "Telemetry Healthy" Mean?
|
|
614
|
+
|
|
615
|
+
**Telemetry is healthy** = we can deliver events to at least one backend.
|
|
616
|
+
|
|
617
|
+
| status | Condition |
|
|
618
|
+
|-------------|-----------|
|
|
619
|
+
| **:healthy** | All adapters have `healthy? == true` and `circuit_breaker == :closed`. |
|
|
620
|
+
| **:degraded** | At least one adapter is unhealthy or circuit open, but at least one is healthy + closed. |
|
|
621
|
+
| **:unhealthy** | No adapter is healthy + closed — nowhere to deliver events. |
|
|
622
|
+
|
|
623
|
+
**`E11y.healthy?`** returns `true` only when `status == :healthy`.
|
|
624
|
+
|
|
625
|
+
### 6.3. Integration into Host Application Health
|
|
626
|
+
|
|
627
|
+
```ruby
|
|
628
|
+
# config/routes.rb
|
|
629
|
+
get "/health", to: "health#show"
|
|
630
|
+
|
|
631
|
+
# app/controllers/health_controller.rb
|
|
632
|
+
def show
|
|
633
|
+
e11y = E11y.health
|
|
634
|
+
overall = e11y[:status] == :healthy ? :ok : :degraded
|
|
635
|
+
|
|
636
|
+
render json: {
|
|
637
|
+
app: check_app,
|
|
638
|
+
e11y: e11y
|
|
639
|
+
}, status: overall == :ok ? 200 : 503
|
|
937
640
|
end
|
|
938
641
|
```
|
|
939
642
|
|
data/docs/{ADR-017-multi-rails-compatibility.md → architecture/ADR-017-multi-rails-compatibility.md}
RENAMED
|
@@ -55,22 +55,15 @@ matrix:
|
|
|
55
55
|
### Negative
|
|
56
56
|
- CI time: 5min → 15min (3× Rails versions)
|
|
57
57
|
- Maintenance: 3 Rails versions to test
|
|
58
|
-
- Conditional code (
|
|
58
|
+
- Conditional code: 1 place (sqlite3 in Gemfile)
|
|
59
59
|
|
|
60
60
|
---
|
|
61
61
|
|
|
62
62
|
## Version-Specific Code
|
|
63
63
|
|
|
64
|
-
**1. Exception handling (
|
|
65
|
-
```ruby
|
|
66
|
-
if Rails.version.to_f >= 8.0
|
|
67
|
-
expect(response.status).to eq(500) # Rails 8.0: caught
|
|
68
|
-
else
|
|
69
|
-
expect { get '/error' }.to raise_error # Rails 7.x: raised
|
|
70
|
-
end
|
|
71
|
-
```
|
|
64
|
+
**1. Exception handling** — not needed: `show_exceptions = :all` in dummy app gives identical behavior (exceptions → 500) on 7.x and 8.x. Tests without version branching.
|
|
72
65
|
|
|
73
|
-
**2. sqlite3 dependency (Gemfile only)**
|
|
66
|
+
**2. sqlite3 dependency (Gemfile only)** — different versions for 7.x (~> 1.4) and 8.x (~> 2.0).
|
|
74
67
|
|
|
75
68
|
---
|
|
76
69
|
|
|
@@ -99,5 +92,5 @@ end
|
|
|
99
92
|
|--------|--------|---------|
|
|
100
93
|
| Test pass rate | 100% | 100% ✅ |
|
|
101
94
|
| Code coverage | ≥95% | 96.46% ✅ |
|
|
102
|
-
| Version checks | <10 |
|
|
95
|
+
| Version checks | <10 | 1 ✅ |
|
|
103
96
|
| CI time | <20min | ~15min ✅ |
|