e11y 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.rspec +4 -0
- data/.rubocop.yml +69 -0
- data/CHANGELOG.md +26 -0
- data/CODE_OF_CONDUCT.md +64 -0
- data/LICENSE.txt +21 -0
- data/README.md +179 -0
- data/Rakefile +37 -0
- data/benchmarks/run_all.rb +33 -0
- data/config/README.md +83 -0
- data/config/loki-local-config.yaml +35 -0
- data/config/prometheus.yml +15 -0
- data/docker-compose.yml +78 -0
- data/docs/00-ICP-AND-TIMELINE.md +483 -0
- data/docs/01-SCALE-REQUIREMENTS.md +858 -0
- data/docs/ADR-001-architecture.md +2617 -0
- data/docs/ADR-002-metrics-yabeda.md +1395 -0
- data/docs/ADR-003-slo-observability.md +3337 -0
- data/docs/ADR-004-adapter-architecture.md +2385 -0
- data/docs/ADR-005-tracing-context.md +1372 -0
- data/docs/ADR-006-security-compliance.md +4143 -0
- data/docs/ADR-007-opentelemetry-integration.md +1385 -0
- data/docs/ADR-008-rails-integration.md +1911 -0
- data/docs/ADR-009-cost-optimization.md +2993 -0
- data/docs/ADR-010-developer-experience.md +2166 -0
- data/docs/ADR-011-testing-strategy.md +1836 -0
- data/docs/ADR-012-event-evolution.md +958 -0
- data/docs/ADR-013-reliability-error-handling.md +2750 -0
- data/docs/ADR-014-event-driven-slo.md +1533 -0
- data/docs/ADR-015-middleware-order.md +1061 -0
- data/docs/ADR-016-self-monitoring-slo.md +1234 -0
- data/docs/API-REFERENCE-L28.md +914 -0
- data/docs/COMPREHENSIVE-CONFIGURATION.md +2366 -0
- data/docs/IMPLEMENTATION_NOTES.md +2804 -0
- data/docs/IMPLEMENTATION_PLAN.md +1971 -0
- data/docs/IMPLEMENTATION_PLAN_ARCHITECTURE.md +586 -0
- data/docs/PLAN.md +148 -0
- data/docs/QUICK-START.md +934 -0
- data/docs/README.md +296 -0
- data/docs/design/00-memory-optimization.md +593 -0
- data/docs/guides/MIGRATION-L27-L28.md +692 -0
- data/docs/guides/PERFORMANCE-BENCHMARKS.md +434 -0
- data/docs/guides/README.md +44 -0
- data/docs/prd/01-overview-vision.md +440 -0
- data/docs/use_cases/README.md +119 -0
- data/docs/use_cases/UC-001-request-scoped-debug-buffering.md +813 -0
- data/docs/use_cases/UC-002-business-event-tracking.md +1953 -0
- data/docs/use_cases/UC-003-pattern-based-metrics.md +1627 -0
- data/docs/use_cases/UC-004-zero-config-slo-tracking.md +728 -0
- data/docs/use_cases/UC-005-sentry-integration.md +759 -0
- data/docs/use_cases/UC-006-trace-context-management.md +905 -0
- data/docs/use_cases/UC-007-pii-filtering.md +2648 -0
- data/docs/use_cases/UC-008-opentelemetry-integration.md +1153 -0
- data/docs/use_cases/UC-009-multi-service-tracing.md +1043 -0
- data/docs/use_cases/UC-010-background-job-tracking.md +1018 -0
- data/docs/use_cases/UC-011-rate-limiting.md +1906 -0
- data/docs/use_cases/UC-012-audit-trail.md +2301 -0
- data/docs/use_cases/UC-013-high-cardinality-protection.md +2127 -0
- data/docs/use_cases/UC-014-adaptive-sampling.md +1940 -0
- data/docs/use_cases/UC-015-cost-optimization.md +735 -0
- data/docs/use_cases/UC-016-rails-logger-migration.md +785 -0
- data/docs/use_cases/UC-017-local-development.md +867 -0
- data/docs/use_cases/UC-018-testing-events.md +1081 -0
- data/docs/use_cases/UC-019-tiered-storage-migration.md +562 -0
- data/docs/use_cases/UC-020-event-versioning.md +708 -0
- data/docs/use_cases/UC-021-error-handling-retry-dlq.md +956 -0
- data/docs/use_cases/UC-022-event-registry.md +648 -0
- data/docs/use_cases/backlog.md +226 -0
- data/e11y.gemspec +76 -0
- data/lib/e11y/adapters/adaptive_batcher.rb +207 -0
- data/lib/e11y/adapters/audit_encrypted.rb +239 -0
- data/lib/e11y/adapters/base.rb +580 -0
- data/lib/e11y/adapters/file.rb +224 -0
- data/lib/e11y/adapters/in_memory.rb +216 -0
- data/lib/e11y/adapters/loki.rb +333 -0
- data/lib/e11y/adapters/otel_logs.rb +203 -0
- data/lib/e11y/adapters/registry.rb +141 -0
- data/lib/e11y/adapters/sentry.rb +230 -0
- data/lib/e11y/adapters/stdout.rb +108 -0
- data/lib/e11y/adapters/yabeda.rb +370 -0
- data/lib/e11y/buffers/adaptive_buffer.rb +339 -0
- data/lib/e11y/buffers/base_buffer.rb +40 -0
- data/lib/e11y/buffers/request_scoped_buffer.rb +246 -0
- data/lib/e11y/buffers/ring_buffer.rb +267 -0
- data/lib/e11y/buffers.rb +14 -0
- data/lib/e11y/console.rb +122 -0
- data/lib/e11y/current.rb +48 -0
- data/lib/e11y/event/base.rb +894 -0
- data/lib/e11y/event/value_sampling_config.rb +84 -0
- data/lib/e11y/events/base_audit_event.rb +43 -0
- data/lib/e11y/events/base_payment_event.rb +33 -0
- data/lib/e11y/events/rails/cache/delete.rb +21 -0
- data/lib/e11y/events/rails/cache/read.rb +23 -0
- data/lib/e11y/events/rails/cache/write.rb +22 -0
- data/lib/e11y/events/rails/database/query.rb +45 -0
- data/lib/e11y/events/rails/http/redirect.rb +21 -0
- data/lib/e11y/events/rails/http/request.rb +26 -0
- data/lib/e11y/events/rails/http/send_file.rb +21 -0
- data/lib/e11y/events/rails/http/start_processing.rb +26 -0
- data/lib/e11y/events/rails/job/completed.rb +22 -0
- data/lib/e11y/events/rails/job/enqueued.rb +22 -0
- data/lib/e11y/events/rails/job/failed.rb +22 -0
- data/lib/e11y/events/rails/job/scheduled.rb +23 -0
- data/lib/e11y/events/rails/job/started.rb +22 -0
- data/lib/e11y/events/rails/log.rb +56 -0
- data/lib/e11y/events/rails/view/render.rb +23 -0
- data/lib/e11y/events.rb +18 -0
- data/lib/e11y/instruments/active_job.rb +201 -0
- data/lib/e11y/instruments/rails_instrumentation.rb +141 -0
- data/lib/e11y/instruments/sidekiq.rb +175 -0
- data/lib/e11y/logger/bridge.rb +205 -0
- data/lib/e11y/metrics/cardinality_protection.rb +172 -0
- data/lib/e11y/metrics/cardinality_tracker.rb +134 -0
- data/lib/e11y/metrics/registry.rb +234 -0
- data/lib/e11y/metrics/relabeling.rb +226 -0
- data/lib/e11y/metrics.rb +102 -0
- data/lib/e11y/middleware/audit_signing.rb +174 -0
- data/lib/e11y/middleware/base.rb +140 -0
- data/lib/e11y/middleware/event_slo.rb +167 -0
- data/lib/e11y/middleware/pii_filter.rb +266 -0
- data/lib/e11y/middleware/pii_filtering.rb +280 -0
- data/lib/e11y/middleware/rate_limiting.rb +214 -0
- data/lib/e11y/middleware/request.rb +163 -0
- data/lib/e11y/middleware/routing.rb +157 -0
- data/lib/e11y/middleware/sampling.rb +254 -0
- data/lib/e11y/middleware/slo.rb +168 -0
- data/lib/e11y/middleware/trace_context.rb +131 -0
- data/lib/e11y/middleware/validation.rb +118 -0
- data/lib/e11y/middleware/versioning.rb +132 -0
- data/lib/e11y/middleware.rb +12 -0
- data/lib/e11y/pii/patterns.rb +90 -0
- data/lib/e11y/pii.rb +13 -0
- data/lib/e11y/pipeline/builder.rb +155 -0
- data/lib/e11y/pipeline/zone_validator.rb +110 -0
- data/lib/e11y/pipeline.rb +12 -0
- data/lib/e11y/presets/audit_event.rb +65 -0
- data/lib/e11y/presets/debug_event.rb +34 -0
- data/lib/e11y/presets/high_value_event.rb +51 -0
- data/lib/e11y/presets.rb +19 -0
- data/lib/e11y/railtie.rb +138 -0
- data/lib/e11y/reliability/circuit_breaker.rb +216 -0
- data/lib/e11y/reliability/dlq/file_storage.rb +277 -0
- data/lib/e11y/reliability/dlq/filter.rb +117 -0
- data/lib/e11y/reliability/retry_handler.rb +207 -0
- data/lib/e11y/reliability/retry_rate_limiter.rb +117 -0
- data/lib/e11y/sampling/error_spike_detector.rb +225 -0
- data/lib/e11y/sampling/load_monitor.rb +161 -0
- data/lib/e11y/sampling/stratified_tracker.rb +92 -0
- data/lib/e11y/sampling/value_extractor.rb +82 -0
- data/lib/e11y/self_monitoring/buffer_monitor.rb +79 -0
- data/lib/e11y/self_monitoring/performance_monitor.rb +97 -0
- data/lib/e11y/self_monitoring/reliability_monitor.rb +146 -0
- data/lib/e11y/slo/event_driven.rb +150 -0
- data/lib/e11y/slo/tracker.rb +119 -0
- data/lib/e11y/version.rb +9 -0
- data/lib/e11y.rb +283 -0
- metadata +452 -0
|
@@ -0,0 +1,2127 @@
|
|
|
1
|
+
# UC-013: High Cardinality Protection
|
|
2
|
+
|
|
3
|
+
**Status:** v1.0 Feature (Critical for Scale)
|
|
4
|
+
**Complexity:** Advanced
|
|
5
|
+
**Setup Time:** 30-60 minutes
|
|
6
|
+
**Target Users:** Engineering Managers, SRE, DevOps, Backend Developers
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## 📋 Overview
|
|
11
|
+
|
|
12
|
+
### Problem Statement
|
|
13
|
+
|
|
14
|
+
**The $68,000/month mistake:**
|
|
15
|
+
```ruby
|
|
16
|
+
# ❌ CATASTROPHIC: Using user_id as metric label
|
|
17
|
+
E11y.configure do |config|
|
|
18
|
+
config.metrics do
|
|
19
|
+
counter_for pattern: 'user.action',
|
|
20
|
+
name: 'user_actions_total',
|
|
21
|
+
tags: [:user_id, :action_type] # ← 💸💸💸
|
|
22
|
+
end
|
|
23
|
+
end
|
|
24
|
+
|
|
25
|
+
# With 100,000 users × 10 action types = 1,000,000 metric series
|
|
26
|
+
# Datadog cost: $68/host × 1000 hosts = $68,000/month
|
|
27
|
+
# Prometheus memory: ~200 bytes/series × 1M = 200 MB per host
|
|
28
|
+
# Query latency: 10x slower due to cardinality explosion
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
**Real-world impact:**
|
|
32
|
+
- 200 services × 1,000 users × 5 dimensions = **1,000,000 metric series**
|
|
33
|
+
- **Datadog cost: $68,000/month**
|
|
34
|
+
- **Prometheus OOM crashes** (out of memory)
|
|
35
|
+
- **Query timeouts** (PromQL queries take 30+ seconds)
|
|
36
|
+
- **Incident during Black Friday** (metrics system collapsed)
|
|
37
|
+
|
|
38
|
+
### E11y Solution
|
|
39
|
+
|
|
40
|
+
**4-Layer Defense System + 99% Cost Reduction:**
|
|
41
|
+
```ruby
|
|
42
|
+
# ✅ SAFE: Aggregate user_id → user_segment
|
|
43
|
+
E11y.configure do |config|
|
|
44
|
+
config.metrics do
|
|
45
|
+
# Layer 1: Denylist (hard block)
|
|
46
|
+
forbidden_labels :user_id, :order_id, :session_id, :trace_id
|
|
47
|
+
|
|
48
|
+
# Layer 2: Safe aggregation
|
|
49
|
+
counter_for pattern: 'user.action',
|
|
50
|
+
name: 'user_actions_total',
|
|
51
|
+
tags: [:user_segment, :action_type], # ← 3 segments × 10 actions = 30 series
|
|
52
|
+
tag_extractors: {
|
|
53
|
+
user_segment: ->(event) {
|
|
54
|
+
user = User.find(event.payload[:user_id])
|
|
55
|
+
user.segment # 'free', 'paid', 'enterprise'
|
|
56
|
+
}
|
|
57
|
+
}
|
|
58
|
+
|
|
59
|
+
# Layer 3: Per-metric limits
|
|
60
|
+
cardinality_limit_for 'user_actions_total', max: 100
|
|
61
|
+
|
|
62
|
+
# Layer 4: Dynamic monitoring
|
|
63
|
+
cardinality_monitoring do
|
|
64
|
+
warn_threshold 0.7 # Alert at 70%
|
|
65
|
+
auto_aggregate true # Auto-fix if exceeded
|
|
66
|
+
end
|
|
67
|
+
end
|
|
68
|
+
end
|
|
69
|
+
|
|
70
|
+
# Result:
|
|
71
|
+
# - 200 services × 10 segments × 5 dimensions = 10,000 series
|
|
72
|
+
# - Datadog cost: $680/month
|
|
73
|
+
# - Savings: $67,320/month (99% reduction) ✅
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
---
|
|
77
|
+
|
|
78
|
+
## 🎯 Event-Level Cardinality Protection (NEW - v1.1)
|
|
79
|
+
|
|
80
|
+
> **🎯 CONTRADICTION_01 Resolution:** Move cardinality config from global initializer to event classes.
|
|
81
|
+
|
|
82
|
+
**Event-level cardinality DSL:**
|
|
83
|
+
|
|
84
|
+
```ruby
|
|
85
|
+
# app/events/user_action.rb
|
|
86
|
+
module Events
|
|
87
|
+
class UserAction < E11y::Event::Base
|
|
88
|
+
schema do
|
|
89
|
+
required(:user_id).filled(:string)
|
|
90
|
+
required(:action_type).filled(:string)
|
|
91
|
+
required(:user_segment).filled(:string)
|
|
92
|
+
end
|
|
93
|
+
|
|
94
|
+
# ✨ Event-level cardinality protection (right next to schema!)
|
|
95
|
+
metric :counter,
|
|
96
|
+
name: 'user_actions_total',
|
|
97
|
+
tags: [:user_segment, :action_type], # ← Safe labels
|
|
98
|
+
cardinality_limit: 100 # Max 100 series
|
|
99
|
+
|
|
100
|
+
# Forbidden labels (high cardinality)
|
|
101
|
+
forbidden_metric_labels :user_id, :session_id
|
|
102
|
+
|
|
103
|
+
# Safe labels (low cardinality)
|
|
104
|
+
safe_metric_labels :user_segment, :action_type, :status
|
|
105
|
+
end
|
|
106
|
+
end
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
**Inheritance for cardinality protection:**
|
|
110
|
+
|
|
111
|
+
```ruby
|
|
112
|
+
# Base class with common cardinality rules
|
|
113
|
+
module Events
|
|
114
|
+
class BaseUserEvent < E11y::Event::Base
|
|
115
|
+
# Common for ALL user events
|
|
116
|
+
forbidden_metric_labels :user_id, :email, :ip_address
|
|
117
|
+
safe_metric_labels :user_segment, :country, :plan
|
|
118
|
+
|
|
119
|
+
# Default cardinality limit
|
|
120
|
+
default_cardinality_limit 100
|
|
121
|
+
end
|
|
122
|
+
end
|
|
123
|
+
|
|
124
|
+
# Inherit from base
|
|
125
|
+
class Events::UserAction < Events::BaseUserEvent
|
|
126
|
+
schema do
|
|
127
|
+
required(:user_id).filled(:string)
|
|
128
|
+
required(:action_type).filled(:string)
|
|
129
|
+
end
|
|
130
|
+
|
|
131
|
+
metric :counter,
|
|
132
|
+
name: 'user_actions_total',
|
|
133
|
+
tags: [:user_segment, :action_type] # ← Uses safe labels
|
|
134
|
+
# ← Inherits: forbidden_metric_labels + safe_metric_labels
|
|
135
|
+
end
|
|
136
|
+
|
|
137
|
+
class Events::UserProfileUpdated < Events::BaseUserEvent
|
|
138
|
+
schema do
|
|
139
|
+
required(:user_id).filled(:string)
|
|
140
|
+
required(:field_name).filled(:string)
|
|
141
|
+
end
|
|
142
|
+
|
|
143
|
+
metric :counter,
|
|
144
|
+
name: 'profile_updates_total',
|
|
145
|
+
tags: [:user_segment, :field_name]
|
|
146
|
+
# ← Inherits: forbidden_metric_labels + safe_metric_labels
|
|
147
|
+
end
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
**Preset modules for cardinality protection:**
|
|
151
|
+
|
|
152
|
+
```ruby
|
|
153
|
+
# lib/e11y/presets/metric_safe_event.rb
|
|
154
|
+
module E11y
|
|
155
|
+
module Presets
|
|
156
|
+
module MetricSafeEvent
|
|
157
|
+
extend ActiveSupport::Concern
|
|
158
|
+
included do
|
|
159
|
+
# Common forbidden labels (high cardinality)
|
|
160
|
+
forbidden_metric_labels :user_id, :order_id, :session_id,
|
|
161
|
+
:trace_id, :request_id, :email,
|
|
162
|
+
:ip_address, :uuid
|
|
163
|
+
|
|
164
|
+
# Common safe labels (low cardinality)
|
|
165
|
+
safe_metric_labels :status, :severity, :country,
|
|
166
|
+
:plan, :segment, :method
|
|
167
|
+
|
|
168
|
+
# Default cardinality limit
|
|
169
|
+
default_cardinality_limit 100
|
|
170
|
+
|
|
171
|
+
# Auto-aggregate on limit
|
|
172
|
+
cardinality_monitoring do
|
|
173
|
+
warn_threshold 0.7
|
|
174
|
+
auto_aggregate true
|
|
175
|
+
end
|
|
176
|
+
end
|
|
177
|
+
end
|
|
178
|
+
end
|
|
179
|
+
end
|
|
180
|
+
|
|
181
|
+
# Usage:
|
|
182
|
+
class Events::OrderPlaced < E11y::Event::Base
|
|
183
|
+
include E11y::Presets::MetricSafeEvent # ← Cardinality rules inherited!
|
|
184
|
+
|
|
185
|
+
schema do
|
|
186
|
+
required(:order_id).filled(:string)
|
|
187
|
+
required(:user_id).filled(:string)
|
|
188
|
+
required(:status).filled(:string)
|
|
189
|
+
end
|
|
190
|
+
|
|
191
|
+
metric :counter,
|
|
192
|
+
name: 'orders_total',
|
|
193
|
+
tags: [:status] # ← Only safe labels (status)
|
|
194
|
+
# ← Inherits: forbidden_metric_labels (user_id blocked!)
|
|
195
|
+
end
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
**Tag extractors (aggregation):**
|
|
199
|
+
|
|
200
|
+
```ruby
|
|
201
|
+
# app/events/user_action.rb
|
|
202
|
+
module Events
|
|
203
|
+
class UserAction < E11y::Event::Base
|
|
204
|
+
schema do
|
|
205
|
+
required(:user_id).filled(:string)
|
|
206
|
+
required(:action_type).filled(:string)
|
|
207
|
+
end
|
|
208
|
+
|
|
209
|
+
# ✨ Event-level tag extractors (aggregate user_id → segment)
|
|
210
|
+
metric :counter,
|
|
211
|
+
name: 'user_actions_total',
|
|
212
|
+
tags: [:user_segment, :action_type],
|
|
213
|
+
tag_extractors: {
|
|
214
|
+
user_segment: ->(event) {
|
|
215
|
+
user = User.find(event.payload[:user_id])
|
|
216
|
+
user.segment # 'free', 'paid', 'enterprise'
|
|
217
|
+
}
|
|
218
|
+
},
|
|
219
|
+
cardinality_limit: 30 # 3 segments × 10 actions
|
|
220
|
+
end
|
|
221
|
+
end
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
**Conventions for cardinality (sensible defaults):**
|
|
225
|
+
|
|
226
|
+
```ruby
|
|
227
|
+
# Convention: Default cardinality limit = 100 series per metric
|
|
228
|
+
# Convention: Common forbidden labels auto-blocked
|
|
229
|
+
|
|
230
|
+
# Zero-config event (uses conventions):
|
|
231
|
+
class Events::OrderCreated < E11y::Event::Base
|
|
232
|
+
schema do
|
|
233
|
+
required(:order_id).filled(:string)
|
|
234
|
+
required(:status).filled(:string)
|
|
235
|
+
end
|
|
236
|
+
|
|
237
|
+
metric :counter,
|
|
238
|
+
name: 'orders_total',
|
|
239
|
+
tags: [:status] # ← Safe (low cardinality)
|
|
240
|
+
# ← Auto: cardinality_limit = 100 (default)
|
|
241
|
+
# ← Auto: order_id blocked (common forbidden label)
|
|
242
|
+
end
|
|
243
|
+
|
|
244
|
+
# Override convention:
|
|
245
|
+
class Events::OrderCreated < E11y::Event::Base
|
|
246
|
+
schema do; required(:order_id).filled(:string); end
|
|
247
|
+
|
|
248
|
+
metric :counter,
|
|
249
|
+
name: 'orders_total',
|
|
250
|
+
tags: [:status],
|
|
251
|
+
cardinality_limit: 50 # ← Override: 50 (not 100)
|
|
252
|
+
end
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
**Precedence (event-level overrides global):**
|
|
256
|
+
|
|
257
|
+
```ruby
|
|
258
|
+
# Global config (infrastructure):
|
|
259
|
+
E11y.configure do |config|
|
|
260
|
+
config.cardinality_protection do
|
|
261
|
+
forbidden_labels :user_id, :order_id # Global defaults
|
|
262
|
+
default_cardinality_limit 100
|
|
263
|
+
end
|
|
264
|
+
end
|
|
265
|
+
|
|
266
|
+
# Event-level config (overrides global):
|
|
267
|
+
class Events::UserAction < E11y::Event::Base
|
|
268
|
+
forbidden_metric_labels :user_id, :session_id # ← Override: adds session_id
|
|
269
|
+
default_cardinality_limit 50 # ← Override: 50 (not 100)
|
|
270
|
+
end
|
|
271
|
+
```
|
|
272
|
+
|
|
273
|
+
**Benefits:**
|
|
274
|
+
- ✅ Locality of behavior (cardinality rules next to schema)
|
|
275
|
+
- ✅ DRY via inheritance/presets
|
|
276
|
+
- ✅ Sensible defaults (100 series limit)
|
|
277
|
+
- ✅ Easy to override when needed
|
|
278
|
+
- ✅ Tag extractors co-located with metrics
|
|
279
|
+
|
|
280
|
+
---
|
|
281
|
+
|
|
282
|
+
## 🎯 The 4-Layer Defense System
|
|
283
|
+
|
|
284
|
+
### Layer Processing Flow
|
|
285
|
+
|
|
286
|
+
> **Implementation:** See [ADR-002 Section 4.1: Four-Layer Defense](../ADR-002-metrics-yabeda.md#41-four-layer-defense) for detailed architecture.
|
|
287
|
+
|
|
288
|
+
**🔑 Critical: Layers execute SEQUENTIALLY (not simultaneously).**
|
|
289
|
+
|
|
290
|
+
Each label is processed through all 4 layers **in order**. Once a layer makes a decision (DROP/KEEP), subsequent layers may be skipped:
|
|
291
|
+
|
|
292
|
+
```
|
|
293
|
+
┌────────────────────────────────────────────────────────┐
|
|
294
|
+
│ Incoming Event: { user_id: 123, status: 'paid' } │
|
|
295
|
+
└────────────────────────────────────────────────────────┘
|
|
296
|
+
↓
|
|
297
|
+
┌──────────────────────────────┐
|
|
298
|
+
│ For EACH label in event: │
|
|
299
|
+
└──────────────────────────────┘
|
|
300
|
+
↓
|
|
301
|
+
╔═══════════════════════════════════════╗
|
|
302
|
+
║ Layer 1: Universal Denylist ║
|
|
303
|
+
║ Q: Is label in FORBIDDEN_LABELS? ║
|
|
304
|
+
╚═══════════════════════════════════════╝
|
|
305
|
+
↓
|
|
306
|
+
┌───────────┴───────────┐
|
|
307
|
+
│ YES │ NO
|
|
308
|
+
↓ ↓
|
|
309
|
+
❌ DROP Continue
|
|
310
|
+
(stop here) ↓
|
|
311
|
+
╔═══════════════════════════════════════╗
|
|
312
|
+
║ Layer 2: Safe Allowlist ║
|
|
313
|
+
║ Q: Is label in SAFE_LABELS? ║
|
|
314
|
+
╚═══════════════════════════════════════╝
|
|
315
|
+
↓
|
|
316
|
+
┌───────────┴───────────┐
|
|
317
|
+
│ YES │ NO
|
|
318
|
+
↓ ↓
|
|
319
|
+
✅ KEEP Continue
|
|
320
|
+
(skip Layer 3-4) ↓
|
|
321
|
+
╔═══════════════════════════════════════╗
|
|
322
|
+
║ Layer 3: Per-Metric Cardinality Limit ║
|
|
323
|
+
║ Q: Is cardinality < limit? ║
|
|
324
|
+
╚═══════════════════════════════════════╝
|
|
325
|
+
↓
|
|
326
|
+
┌───────────┴───────────┐
|
|
327
|
+
│ YES │ NO
|
|
328
|
+
↓ ↓
|
|
329
|
+
✅ KEEP Continue
|
|
330
|
+
↓
|
|
331
|
+
╔═══════════════════════════════════════╗
|
|
332
|
+
║ Layer 4: Dynamic Action ║
|
|
333
|
+
║ Execute: drop/alert/sample ║
|
|
334
|
+
╚═══════════════════════════════════════╝
|
|
335
|
+
↓
|
|
336
|
+
❌ DROP (or alert)
|
|
337
|
+
```
|
|
338
|
+
|
|
339
|
+
**Example: Processing 3 labels**
|
|
340
|
+
|
|
341
|
+
```ruby
|
|
342
|
+
# Incoming event
|
|
343
|
+
Events::OrderPlaced.track(
|
|
344
|
+
user_id: 'user_12345', # ← Label 1
|
|
345
|
+
status: 'paid', # ← Label 2
|
|
346
|
+
custom_field: 'special_123' # ← Label 3
|
|
347
|
+
)
|
|
348
|
+
|
|
349
|
+
# Processing:
|
|
350
|
+
|
|
351
|
+
# user_id:
|
|
352
|
+
# → Layer 1: in FORBIDDEN_LABELS? ✅ YES → ❌ DROP (stop, skip Layer 2-4)
|
|
353
|
+
# Result: user_id not included in metric
|
|
354
|
+
|
|
355
|
+
# status:
|
|
356
|
+
# → Layer 1: in FORBIDDEN_LABELS? ❌ NO → continue
|
|
357
|
+
# → Layer 2: in SAFE_LABELS? ✅ YES → ✅ KEEP (skip Layer 3-4)
|
|
358
|
+
# Result: status='paid' included in metric
|
|
359
|
+
|
|
360
|
+
# custom_field:
|
|
361
|
+
# → Layer 1: in FORBIDDEN_LABELS? ❌ NO → continue
|
|
362
|
+
# → Layer 2: in SAFE_LABELS? ❌ NO → continue
|
|
363
|
+
# → Layer 3: cardinality < limit? ❌ NO (150 > 100) → continue
|
|
364
|
+
# → Layer 4: action = :drop → ❌ DROP
|
|
365
|
+
# Result: custom_field not included in metric
|
|
366
|
+
|
|
367
|
+
# Final metric:
|
|
368
|
+
# order_placed_total{status="paid"} 1
|
|
369
|
+
```
|
|
370
|
+
|
|
371
|
+
**Key Properties:**
|
|
372
|
+
|
|
373
|
+
1. **Early Exit Optimization:** If Layer 1 drops a label, Layers 2-4 never execute (performance optimization).
|
|
374
|
+
2. **Safe Labels Fast Path:** Layer 2 approval skips expensive cardinality tracking (Layers 3-4).
|
|
375
|
+
3. **Fallback to Dynamic Action:** Only labels that pass Layer 1-2 but fail Layer 3 reach Layer 4.
|
|
376
|
+
4. **Order Matters:** Changing layer order breaks the protection model (e.g., Layer 3 before Layer 1 = wrong).
|
|
377
|
+
|
|
378
|
+
**Performance Impact:**
|
|
379
|
+
|
|
380
|
+
| Scenario | Layers Executed | Time | Example |
|
|
381
|
+
|---|---|---|---|
|
|
382
|
+
| Forbidden label | Layer 1 only | ~0.001ms | `user_id` |
|
|
383
|
+
| Safe label | Layer 1-2 | ~0.002ms | `status`, `method` |
|
|
384
|
+
| New label (under limit) | Layer 1-3 | ~0.01ms | `custom_field` (90th unique value) |
|
|
385
|
+
| Overflow label | Layer 1-4 | ~0.02ms | `custom_field` (101st unique value) |
|
|
386
|
+
|
|
387
|
+
**Why Sequential?**
|
|
388
|
+
|
|
389
|
+
```ruby
|
|
390
|
+
# ❌ WRONG: Parallel layer execution
|
|
391
|
+
# Problem: All layers execute simultaneously, wasting CPU on labels already dropped
|
|
392
|
+
|
|
393
|
+
# ✅ CORRECT: Sequential execution
|
|
394
|
+
# Benefit: Early exit saves 75% CPU for forbidden labels
|
|
395
|
+
```
|
|
396
|
+
|
|
397
|
+
---
|
|
398
|
+
|
|
399
|
+
### Layer 1: Denylist (Hard Block)
|
|
400
|
+
|
|
401
|
+
> **⚠️ CRITICAL: Adapter-Specific Filtering**
|
|
402
|
+
> **Implementation:** See [ADR-002 Section 4.2: Layer 1 - Universal Denylist](../ADR-002-metrics-yabeda.md#42-layer-1-universal-denylist) for detailed architecture.
|
|
403
|
+
>
|
|
404
|
+
> **Cardinality protection (denylist/allowlist) applies ONLY to metrics adapters (Yabeda/Prometheus), NOT to other adapters:**
|
|
405
|
+
>
|
|
406
|
+
> | Adapter Type | Denylist Applied? | Why? |
|
|
407
|
+
> |---|---|---|
|
|
408
|
+
> | **Metrics (Yabeda/Prometheus)** | ✅ YES | High-cardinality labels cause memory explosion in time-series databases (1M labels = 1GB RAM). |
|
|
409
|
+
> | **Logs (Loki)** | ❌ NO | Loki is designed for high-cardinality labels and uses different indexing strategy. Full payload preserved. |
|
|
410
|
+
> | **Errors (Sentry)** | ❌ NO | Sentry needs full context for debugging. High cardinality is acceptable for error tracking. |
|
|
411
|
+
> | **Audit (File/PostgreSQL)** | ❌ NO | Audit trails require complete, unfiltered data for compliance. |
|
|
412
|
+
>
|
|
413
|
+
> **Example:**
|
|
414
|
+
> ```ruby
|
|
415
|
+
> # Event with user_id (forbidden for metrics)
|
|
416
|
+
> Events::UserAction.track(user_id: "12345", action: "login")
|
|
417
|
+
>
|
|
418
|
+
> # What happens:
|
|
419
|
+
> # ✅ Prometheus: { action="login" } ← user_id DROPPED
|
|
420
|
+
> # ✅ Loki: { user_id="12345", action="login" } ← user_id PRESERVED
|
|
421
|
+
> # ✅ Sentry: { user_id="12345", action="login" } ← user_id PRESERVED
|
|
422
|
+
> # ✅ Audit: { user_id="12345", action="login" } ← user_id PRESERVED
|
|
423
|
+
> ```
|
|
424
|
+
>
|
|
425
|
+
> **Why This Matters:**
|
|
426
|
+
> - ✅ **Metrics stay safe:** Prometheus won't OOM due to cardinality explosion
|
|
427
|
+
> - ✅ **Debugging stays rich:** Loki/Sentry get full context for troubleshooting
|
|
428
|
+
> - ✅ **Compliance stays intact:** Audit logs remain complete and unfiltered
|
|
429
|
+
> - ✅ **Best of both worlds:** Safety for metrics + completeness for logs/errors
|
|
430
|
+
|
|
431
|
+
**Universal denylist - NEVER use these as labels (for metrics adapters):**
|
|
432
|
+
|
|
433
|
+
```ruby
|
|
434
|
+
E11y.configure do |config|
|
|
435
|
+
config.metrics do
|
|
436
|
+
# === UNBOUNDED IDENTIFIERS (FORBIDDEN) ===
|
|
437
|
+
forbidden_labels :user_id, :customer_id, :account_id,
|
|
438
|
+
:order_id, :transaction_id, :invoice_id,
|
|
439
|
+
:session_id, :request_id, :trace_id, :span_id
|
|
440
|
+
|
|
441
|
+
# === INFRASTRUCTURE (FORBIDDEN) ===
|
|
442
|
+
forbidden_labels :pod_uid, :container_id, :instance_id,
|
|
443
|
+
:node_name # If dynamic
|
|
444
|
+
|
|
445
|
+
# === NETWORK/HTTP (FORBIDDEN) ===
|
|
446
|
+
forbidden_labels :url, # With query strings
|
|
447
|
+
:ip_address,
|
|
448
|
+
:user_agent,
|
|
449
|
+
:hostname # If ephemeral
|
|
450
|
+
|
|
451
|
+
# === TIME-BASED (FORBIDDEN) ===
|
|
452
|
+
forbidden_labels :timestamp, :created_at,
|
|
453
|
+
:version # Patch-level: 2.5.7234
|
|
454
|
+
|
|
455
|
+
# === ENFORCEMENT ===
|
|
456
|
+
enforcement :strict # ERROR on forbidden label usage
|
|
457
|
+
# OR
|
|
458
|
+
enforcement :warn # Log warning but allow
|
|
459
|
+
# OR
|
|
460
|
+
enforcement :aggregate # Auto-aggregate to "_other"
|
|
461
|
+
end
|
|
462
|
+
end
|
|
463
|
+
|
|
464
|
+
# Usage:
|
|
465
|
+
counter_for pattern: 'user.action',
|
|
466
|
+
tags: [:user_id] # ← ERROR: "user_id is forbidden!"
|
|
467
|
+
|
|
468
|
+
# Development warning:
|
|
469
|
+
# [E11y ERROR] Metric 'user.action_total' uses forbidden label 'user_id'
|
|
470
|
+
# Cardinality explosion risk! Use 'user_segment' instead.
|
|
471
|
+
```
|
|
472
|
+
|
|
473
|
+
---
|
|
474
|
+
|
|
475
|
+
### Layer 2: Allowlist (Strict Mode)
|
|
476
|
+
|
|
477
|
+
**Only allow explicitly safe labels:**
|
|
478
|
+
|
|
479
|
+
```ruby
|
|
480
|
+
E11y.configure do |config|
|
|
481
|
+
config.metrics do
|
|
482
|
+
# Strict mode: ONLY these labels allowed
|
|
483
|
+
allowed_labels_only true
|
|
484
|
+
|
|
485
|
+
# === BUSINESS DIMENSIONS (< 50 values) ===
|
|
486
|
+
allowed_labels :status, # pending, paid, failed (4-10 values)
|
|
487
|
+
:payment_method, # card, paypal (5-20 values)
|
|
488
|
+
:plan_tier # free, pro, enterprise (3-5 values)
|
|
489
|
+
|
|
490
|
+
# === INFRASTRUCTURE (< 20 values) ===
|
|
491
|
+
allowed_labels :env, # production, staging, dev (3 values)
|
|
492
|
+
:region, # us-east, eu-west (5-20 values)
|
|
493
|
+
:cluster, # main, backup (2-5 values)
|
|
494
|
+
:availability_zone
|
|
495
|
+
|
|
496
|
+
# === HTTP/SERVICE (< 100 values) ===
|
|
497
|
+
allowed_labels :http_method, # GET, POST, PUT, DELETE (10 values)
|
|
498
|
+
:http_status_code, # 200, 404, 500 (50 values)
|
|
499
|
+
:controller_action # UsersController#show (20-100 values)
|
|
500
|
+
end
|
|
501
|
+
end
|
|
502
|
+
|
|
503
|
+
# Usage:
|
|
504
|
+
counter_for pattern: 'order.paid',
|
|
505
|
+
tags: [:currency] # ← ERROR: "currency not in allowlist!"
|
|
506
|
+
|
|
507
|
+
# Must explicitly allow:
|
|
508
|
+
allowed_labels :currency # USD, EUR, GBP (3-20 values)
|
|
509
|
+
```
|
|
510
|
+
|
|
511
|
+
**Rule of thumb:**
|
|
512
|
+
- ✅ **< 10 values** - Always safe
|
|
513
|
+
- 🟡 **10-100 values** - Usually OK, monitor
|
|
514
|
+
- 🔴 **> 100 values** - High risk, aggregate!
|
|
515
|
+
|
|
516
|
+
---
|
|
517
|
+
|
|
518
|
+
### Layer 3: Per-Metric Limits
|
|
519
|
+
|
|
520
|
+
**Set cardinality limits per metric:**
|
|
521
|
+
|
|
522
|
+
```ruby
|
|
523
|
+
E11y.configure do |config|
|
|
524
|
+
config.metrics do
|
|
525
|
+
# === GLOBAL DEFAULT ===
|
|
526
|
+
default_cardinality_limit 1_000
|
|
527
|
+
|
|
528
|
+
# === PER-METRIC LIMITS ===
|
|
529
|
+
cardinality_limit_for 'http.requests' do
|
|
530
|
+
max_cardinality 2_000 # Higher limit for this metric
|
|
531
|
+
overflow_strategy :drop # → Drop overflow events
|
|
532
|
+
overflow_sample_rate 0.1 # Sample 10% of overflow events
|
|
533
|
+
end
|
|
534
|
+
|
|
535
|
+
cardinality_limit_for 'user.actions' do
|
|
536
|
+
max_cardinality 500 # Lower limit
|
|
537
|
+
overflow_strategy :drop # Drop overflow events
|
|
538
|
+
overflow_alert true # Alert on overflow
|
|
539
|
+
end
|
|
540
|
+
|
|
541
|
+
cardinality_limit_for 'orders.paid' do
|
|
542
|
+
max_cardinality 100
|
|
543
|
+
overflow_strategy :alert # Alert ops team + drop
|
|
544
|
+
end
|
|
545
|
+
end
|
|
546
|
+
end
|
|
547
|
+
|
|
548
|
+
# How it works:
|
|
549
|
+
# 1. Track unique label combinations per metric
|
|
550
|
+
# 2. If exceeds limit:
|
|
551
|
+
# - :drop → Discard overflow events (increment drop counter)
|
|
552
|
+
# - :alert → Alert ops team + drop
|
|
553
|
+
#
|
|
554
|
+
# NOTE: For aggregation/relabeling (e.g., user_id → user_segment),
|
|
555
|
+
# use tag_extractors (see "Aggregation" section below),
|
|
556
|
+
# NOT overflow_strategy.
|
|
557
|
+
```
|
|
558
|
+
|
|
559
|
+
**Overflow strategies:**
|
|
560
|
+
|
|
561
|
+
| Strategy | Behavior | Use Case |
|
|
562
|
+
|----------|----------|----------|
|
|
563
|
+
| `:drop` | Discard overflow events | Default, simplest |
|
|
564
|
+
| `:alert` | Alert ops team + drop | Critical metrics |
|
|
565
|
+
|
|
566
|
+
#### Thread Safety
|
|
567
|
+
|
|
568
|
+
> **Implementation:** See [ADR-002 Section 4.4: Layer 3 - Per-Metric Cardinality Limits](../ADR-002-metrics-yabeda.md#44-layer-3-per-metric-cardinality-limits) for detailed architecture.
|
|
569
|
+
>
|
|
570
|
+
> **Sources:**
|
|
571
|
+
> - [Ruby Hash thread safety - Stack Overflow](https://stackoverflow.com/questions/22674498/thread-safety-for-hashes-in-ruby)
|
|
572
|
+
> - [Mutex performance overhead - Stack Overflow](https://stackoverflow.com/questions/9761899/why-does-this-code-run-slower-with-multiple-threads-even-on-a-multi-core-mach)
|
|
573
|
+
> - [Thread Safety with Mutexes - GoRails](https://gorails.com/episodes/thread-safety-with-mutexes-in-ruby)
|
|
574
|
+
> - [Understanding Ruby Threads and Concurrency - Better Stack](https://betterstack.com/community/guides/scaling-ruby/threads-and-concurrency/)
|
|
575
|
+
|
|
576
|
+
**🔒 Critical: CardinalityTracker is thread-safe by design.**
|
|
577
|
+
|
|
578
|
+
E11y applications typically handle hundreds of concurrent requests, each potentially emitting events with labels. The `CardinalityTracker` uses a **mutex** to ensure thread-safe tracking of unique label values across concurrent requests.
|
|
579
|
+
|
|
580
|
+
**Why Thread Safety Matters:**
|
|
581
|
+
|
|
582
|
+
```ruby
|
|
583
|
+
# Scenario: 3 concurrent requests tracking same metric
|
|
584
|
+
Thread 1: track('orders_total', status: 'paid') # ← Same time
|
|
585
|
+
Thread 2: track('orders_total', status: 'pending') # ← Same time
|
|
586
|
+
Thread 3: track('orders_total', status: 'paid') # ← Same time
|
|
587
|
+
|
|
588
|
+
# Without mutex:
|
|
589
|
+
# - Race condition: both Thread 1 & 3 might think 'paid' is new
|
|
590
|
+
# - Tracker corruption: @trackers hash modified by 3 threads simultaneously
|
|
591
|
+
# - Lost updates: Thread 2's 'pending' might be overwritten
|
|
592
|
+
# - RESULT: Incorrect cardinality counts, potential memory leaks
|
|
593
|
+
|
|
594
|
+
# With mutex (actual E11y implementation):
|
|
595
|
+
# - Thread 1 acquires lock → adds 'paid' → releases (1/limit)
|
|
596
|
+
# - Thread 2 acquires lock → adds 'pending' → releases (2/limit)
|
|
597
|
+
# - Thread 3 acquires lock → sees 'paid' exists → releases (2/limit)
|
|
598
|
+
# - RESULT: Correct cardinality = 2
|
|
599
|
+
```
|
|
600
|
+
|
|
601
|
+
**Implementation:**
|
|
602
|
+
|
|
603
|
+
```ruby
|
|
604
|
+
# From ADR-002 Section 4.4
|
|
605
|
+
class CardinalityTracker
|
|
606
|
+
def initialize(limit: 100)
|
|
607
|
+
@limit = limit
|
|
608
|
+
@trackers = {} # { metric_name: { label_name: Set[values] } }
|
|
609
|
+
@mutex = Mutex.new # ← Thread safety
|
|
610
|
+
end
|
|
611
|
+
|
|
612
|
+
def check_and_track(metric_name, label_name, value)
|
|
613
|
+
@mutex.synchronize do # ← Only 1 thread executes this block at a time
|
|
614
|
+
@trackers[metric_name] ||= {}
|
|
615
|
+
@trackers[metric_name][label_name] ||= Set.new
|
|
616
|
+
|
|
617
|
+
tracker = @trackers[metric_name][label_name]
|
|
618
|
+
|
|
619
|
+
if tracker.include?(value)
|
|
620
|
+
true # Already seen
|
|
621
|
+
elsif tracker.size < @limit
|
|
622
|
+
tracker.add(value)
|
|
623
|
+
true # Added, under limit
|
|
624
|
+
else
|
|
625
|
+
false # Rejected, over limit
|
|
626
|
+
end
|
|
627
|
+
end
|
|
628
|
+
end
|
|
629
|
+
end
|
|
630
|
+
```
|
|
631
|
+
|
|
632
|
+
**Performance Impact:**
|
|
633
|
+
|
|
634
|
+
⚠️ **Reality Check:** Mutex synchronization has measurable overhead, especially under high concurrency:
|
|
635
|
+
|
|
636
|
+
- **Single-threaded baseline:** Hash lookup + Set operation ~0.001ms (1 microsecond)
|
|
637
|
+
- **With Mutex (low contention):** ~0.005-0.01ms (5-10 microseconds) - 5-10x slower
|
|
638
|
+
- **With Mutex (high contention):** Can degrade significantly due to cache coherency overhead
|
|
639
|
+
|
|
640
|
+
**Why slower?** Each `@mutex.synchronize` call forces CPU to:
|
|
641
|
+
1. Acquire lock (coordinate with other cores)
|
|
642
|
+
2. Access shared state from RAM (not L1/L2 cache) - ~100x slower than cache
|
|
643
|
+
3. Release lock (notify waiting threads)
|
|
644
|
+
|
|
645
|
+
**Mitigation:** E11y minimizes overhead by:
|
|
646
|
+
- Keeping critical section **extremely short** (hash lookup + set add only)
|
|
647
|
+
- Using simple data structures (Hash + Set, not complex objects)
|
|
648
|
+
- Avoiding I/O or heavy computation inside `synchronize` block
|
|
649
|
+
|
|
650
|
+
**Real-world impact:** For most applications (100-1000 concurrent requests), mutex overhead is acceptable compared to the catastrophic cost of NOT having thread safety (corrupted cardinality counts, memory leaks, incorrect metrics)
|
|
651
|
+
|
|
652
|
+
**Monitoring Thread Contention:**
|
|
653
|
+
|
|
654
|
+
If you suspect mutex contention is becoming a bottleneck, monitor these indicators:
|
|
655
|
+
|
|
656
|
+
```ruby
|
|
657
|
+
# Built-in E11y metrics (no extra config needed)
|
|
658
|
+
e11y_cardinality_checks_total # Total cardinality checks
|
|
659
|
+
e11y_cardinality_checks_duration_seconds # Duration histogram
|
|
660
|
+
|
|
661
|
+
# Prometheus query to detect contention:
|
|
662
|
+
# If p99 latency >> p50, likely contention
|
|
663
|
+
histogram_quantile(0.99, rate(e11y_cardinality_checks_duration_seconds_bucket[5m]))
|
|
664
|
+
/
|
|
665
|
+
histogram_quantile(0.50, rate(e11y_cardinality_checks_duration_seconds_bucket[5m]))
|
|
666
|
+
# Ratio > 10 = high contention
|
|
667
|
+
```
|
|
668
|
+
|
|
669
|
+
**If contention becomes critical:**
|
|
670
|
+
- Consider using `Concurrent::Map` from concurrent-ruby gem (lock-free for reads)
|
|
671
|
+
- Shard cardinality trackers by metric name (separate mutex per metric)
|
|
672
|
+
- Profile with `ruby-prof` to identify exact bottleneck
|
|
673
|
+
|
|
674
|
+
---
|
|
675
|
+
|
|
676
|
+
### Layer 4: Dynamic Monitoring
|
|
677
|
+
|
|
678
|
+
**Auto-detect and alert on high cardinality:**
|
|
679
|
+
|
|
680
|
+
```ruby
|
|
681
|
+
E11y.configure do |config|
|
|
682
|
+
config.metrics do
|
|
683
|
+
cardinality_monitoring do
|
|
684
|
+
# === THRESHOLDS ===
|
|
685
|
+
warn_threshold 0.7 # Alert at 70% of limit
|
|
686
|
+
critical_threshold 0.9 # Critical alert at 90%
|
|
687
|
+
|
|
688
|
+
# === AUTO-ADJUSTMENT ===
|
|
689
|
+
auto_adjust do
|
|
690
|
+
enabled true
|
|
691
|
+
threshold 0.8 # Trigger at 80%
|
|
692
|
+
action :aggregate # Auto-switch to aggregate strategy
|
|
693
|
+
notify :slack # Notify team
|
|
694
|
+
end
|
|
695
|
+
|
|
696
|
+
# === REPORTING ===
|
|
697
|
+
report_interval 1.minute # Check every minute
|
|
698
|
+
top_violators_count 10 # Track top 10 high-cardinality metrics
|
|
699
|
+
|
|
700
|
+
# === ALERTS ===
|
|
701
|
+
on_high_cardinality do |metric_name, current, limit|
|
|
702
|
+
Rails.logger.warn(
|
|
703
|
+
"[E11y] High cardinality: #{metric_name} at #{current}/#{limit}"
|
|
704
|
+
)
|
|
705
|
+
|
|
706
|
+
# Send to Slack
|
|
707
|
+
SlackNotifier.notify(
|
|
708
|
+
channel: '#observability',
|
|
709
|
+
message: "⚠️ Metric #{metric_name} cardinality: #{current}/#{limit}"
|
|
710
|
+
)
|
|
711
|
+
end
|
|
712
|
+
end
|
|
713
|
+
end
|
|
714
|
+
end
|
|
715
|
+
```
|
|
716
|
+
|
|
717
|
+
#### Action Selection Guide
|
|
718
|
+
|
|
719
|
+
> **Implementation:** See [ADR-002 Section 4.5: Layer 4 - Dynamic Actions](../ADR-002-metrics-yabeda.md#45-layer-4-dynamic-actions) for detailed architecture.
|
|
720
|
+
|
|
721
|
+
**🎯 When cardinality limit is exceeded, which action should you choose?**
|
|
722
|
+
|
|
723
|
+
Use this decision tree to select the right strategy:
|
|
724
|
+
|
|
725
|
+
```
|
|
726
|
+
┌─────────────────────────────────────┐
|
|
727
|
+
│ Cardinality Limit Exceeded │
|
|
728
|
+
└─────────────────────────────────────┘
|
|
729
|
+
↓
|
|
730
|
+
┌────────────────┐
|
|
731
|
+
│ Critical to │ ← Question 1
|
|
732
|
+
│ investigate? │
|
|
733
|
+
└────────────────┘
|
|
734
|
+
↙ ↘
|
|
735
|
+
YES NO
|
|
736
|
+
↓ ↓
|
|
737
|
+
┌─────────┐ ┌──────────────┐
|
|
738
|
+
│ ALERT │ │ Can group │ ← Question 2
|
|
739
|
+
│ │ │ values into │
|
|
740
|
+
│ + Drop │ │ categories? │
|
|
741
|
+
└─────────┘ └──────────────┘
|
|
742
|
+
↓ ↙ ↘
|
|
743
|
+
│ YES NO
|
|
744
|
+
│ ↓ ↓
|
|
745
|
+
│ ┌─────────┐ ┌───────┐
|
|
746
|
+
│ │ RELABEL │ │ DROP │
|
|
747
|
+
│ └─────────┘ └───────┘
|
|
748
|
+
↓ ↓ ↓
|
|
749
|
+
PagerDuty Reduced Silent
|
|
750
|
+
Alert Cardinality Removal
|
|
751
|
+
```
|
|
752
|
+
|
|
753
|
+
**Decision Matrix:**
|
|
754
|
+
|
|
755
|
+
| Action | When to Use | Signal Preserved | Cardinality | Example |
|
|
756
|
+
|--------|-------------|------------------|-------------|---------|
|
|
757
|
+
| **DROP** | Label not important for analysis | ❌ None (label removed entirely) | 1 (label dropped) | Drop `request_id`, `trace_id` from metrics (keep in logs) |
|
|
758
|
+
| **RELABEL** | Clear categories exist (e.g., status codes, paths) | ✅✅✅ High (grouped into buckets) | 5-10 (category count) | `http_status: 200` → `status_class: 2xx` |
|
|
759
|
+
| **ALERT** | Unexpected high cardinality, needs investigation | ❌ None + 🚨 (label dropped + ops alerted) | 1 (label dropped) | Sudden spike in unique `customer_id` values |
|
|
760
|
+
|
|
761
|
+
**Practical Examples:**
|
|
762
|
+
|
|
763
|
+
**1. DROP - Default for non-critical labels**
|
|
764
|
+
```ruby
|
|
765
|
+
# ❌ BAD: request_id creates 1M unique metrics
|
|
766
|
+
counter_for pattern: 'api.request',
|
|
767
|
+
tags: [:request_id, :endpoint] # request_id = high cardinality!
|
|
768
|
+
|
|
769
|
+
# ✅ GOOD: Drop request_id from metrics
|
|
770
|
+
counter_for pattern: 'api.request',
|
|
771
|
+
tags: [:endpoint] # Only low-cardinality tags
|
|
772
|
+
|
|
773
|
+
cardinality_limit_for 'api.request' do
|
|
774
|
+
max_cardinality 100
|
|
775
|
+
overflow_strategy :drop # Silent drop if exceeded
|
|
776
|
+
end
|
|
777
|
+
|
|
778
|
+
# Result: request_id still in logs/traces, just not in metrics
|
|
779
|
+
```
|
|
780
|
+
|
|
781
|
+
**2. RELABEL - Best for known categories**
|
|
782
|
+
```ruby
|
|
783
|
+
# ❌ BAD: 200 unique HTTP status codes
|
|
784
|
+
counter_for pattern: 'http.response',
|
|
785
|
+
tags: [:http_status] # 200, 201, 204, 400, 401, 403, ...
|
|
786
|
+
|
|
787
|
+
# ✅ GOOD: Relabel to status classes (5 categories)
|
|
788
|
+
counter_for pattern: 'http.response',
|
|
789
|
+
tags: [:status_class],
|
|
790
|
+
tag_extractors: {
|
|
791
|
+
status_class: ->(event) {
|
|
792
|
+
status = event.payload[:http_status].to_i
|
|
793
|
+
case status
|
|
794
|
+
when 100..199 then '1xx'
|
|
795
|
+
when 200..299 then '2xx'
|
|
796
|
+
when 300..399 then '3xx'
|
|
797
|
+
when 400..499 then '4xx'
|
|
798
|
+
when 500..599 then '5xx'
|
|
799
|
+
else 'unknown'
|
|
800
|
+
end
|
|
801
|
+
}
|
|
802
|
+
}
|
|
803
|
+
|
|
804
|
+
# Result: 200 values → 5 categories (99% cardinality reduction)
|
|
805
|
+
```
|
|
806
|
+
|
|
807
|
+
**3. ALERT - For unexpected cardinality spikes**
|
|
808
|
+
```ruby
|
|
809
|
+
# Payment events should have stable cardinality
|
|
810
|
+
cardinality_limit_for 'payments.processed' do
|
|
811
|
+
max_cardinality 50 # Expect ~10 payment methods
|
|
812
|
+
overflow_strategy :alert # Alert if exceeded
|
|
813
|
+
overflow_sample_rate 0.1 # Sample 10% of overflow events
|
|
814
|
+
end
|
|
815
|
+
|
|
816
|
+
# Scenario: Suddenly 1000 unique payment_method values
|
|
817
|
+
# → Alert sent to PagerDuty
|
|
818
|
+
# → Label dropped from metrics
|
|
819
|
+
# → Ops investigates (possible bug, data corruption, attack)
|
|
820
|
+
```
|
|
821
|
+
|
|
822
|
+
**When NOT to use each action:**
|
|
823
|
+
|
|
824
|
+
| Action | DON'T Use When | Why |
|
|
825
|
+
|--------|---------------|-----|
|
|
826
|
+
| DROP | Label is critical for debugging | You lose all visibility into this dimension |
|
|
827
|
+
| RELABEL | No clear categories exist | Arbitrary bucketing (e.g., hash-based) loses signal |
|
|
828
|
+
| ALERT | High cardinality is expected | Alert fatigue, ops team overwhelmed |
|
|
829
|
+
|
|
830
|
+
**Common Patterns:**
|
|
831
|
+
|
|
832
|
+
```ruby
|
|
833
|
+
# Pattern 1: DROP non-critical identifiers
|
|
834
|
+
# request_id, session_id, trace_id → DROP (keep in logs)
|
|
835
|
+
overflow_strategy :drop
|
|
836
|
+
|
|
837
|
+
# Pattern 2: RELABEL known enums
|
|
838
|
+
# http_status, country_code, user_tier → RELABEL (aggregate)
|
|
839
|
+
tag_extractors: { status_class: ->(e) { ... } }
|
|
840
|
+
|
|
841
|
+
# Pattern 3: ALERT on unexpected cardinality
|
|
842
|
+
# payment_method, product_sku → ALERT (should be stable)
|
|
843
|
+
overflow_strategy :alert
|
|
844
|
+
```
|
|
845
|
+
|
|
846
|
+
**Monitoring Your Decisions:**
|
|
847
|
+
|
|
848
|
+
```ruby
|
|
849
|
+
# Track how often each action triggers
|
|
850
|
+
Yabeda.e11y_internal.cardinality_actions_total.values
|
|
851
|
+
# => { action: 'drop', metric: 'api.requests' } => 42
|
|
852
|
+
# => { action: 'alert', metric: 'payments.processed' } => 1
|
|
853
|
+
|
|
854
|
+
# Prometheus query:
|
|
855
|
+
rate(e11y_cardinality_actions_total{action="alert"}[5m])
|
|
856
|
+
# → If >0, investigate what's causing unexpected cardinality
|
|
857
|
+
```
|
|
858
|
+
|
|
859
|
+
---
|
|
860
|
+
|
|
861
|
+
## 💻 Advanced Techniques
|
|
862
|
+
|
|
863
|
+
### 1. Aggregation (Best ROI - 99% Reduction)
|
|
864
|
+
|
|
865
|
+
> **Note:** This section describes **relabeling/normalization** (e.g., `user_id` → `user_segment`) via `tag_extractors`, which is different from `overflow_strategy`. Aggregation reduces cardinality **before** metrics are created, while overflow handling (`drop`/`alert`) deals with exceeding limits **after** creation. See [ADR-002 Section 4.5](../ADR-002-metrics-yabeda.md#45-cardinality-protection) for implementation details.
|
|
866
|
+
|
|
867
|
+
**Problem:** 1M users = 1M metric series
|
|
868
|
+
|
|
869
|
+
**Solution:** Aggregate to segments
|
|
870
|
+
|
|
871
|
+
```ruby
|
|
872
|
+
# ❌ BAD: 1,000,000 users = 1,000,000 series
|
|
873
|
+
counter_for pattern: 'user.action',
|
|
874
|
+
tags: [:user_id]
|
|
875
|
+
|
|
876
|
+
# ✅ GOOD: 3 segments = 3 series (99.9997% reduction!)
|
|
877
|
+
counter_for pattern: 'user.action',
|
|
878
|
+
tags: [:user_segment],
|
|
879
|
+
tag_extractors: {
|
|
880
|
+
user_segment: ->(event) {
|
|
881
|
+
user_id = event.payload[:user_id]
|
|
882
|
+
user = User.find_by(id: user_id)
|
|
883
|
+
user&.segment || 'unknown' # 'free', 'paid', 'enterprise'
|
|
884
|
+
}
|
|
885
|
+
}
|
|
886
|
+
|
|
887
|
+
# Result:
|
|
888
|
+
# user_actions_total{user_segment="free"} 500000
|
|
889
|
+
# user_actions_total{user_segment="paid"} 400000
|
|
890
|
+
# user_actions_total{user_segment="enterprise"} 100000
|
|
891
|
+
```
|
|
892
|
+
|
|
893
|
+
**Common aggregation strategies:**
|
|
894
|
+
|
|
895
|
+
| High-Cardinality Field | Aggregate To | Values |
|
|
896
|
+
|------------------------|--------------|--------|
|
|
897
|
+
| `user_id` (1M) | `user_segment` | free, paid, enterprise (3) |
|
|
898
|
+
| `order_id` (10M) | `order_status` | pending, paid, shipped (4) |
|
|
899
|
+
| `ip_address` (100k) | `country` | US, UK, DE, FR (50) |
|
|
900
|
+
| `version` (1000) | `major_version` | 1.x, 2.x, 3.x (3) |
|
|
901
|
+
| `url` (10k) | `endpoint_pattern` | /api/users/:id (100) |
|
|
902
|
+
|
|
903
|
+
---
|
|
904
|
+
|
|
905
|
+
### 2. Relabeling & Normalization
|
|
906
|
+
|
|
907
|
+
**Transform high-cardinality values to low-cardinality:**
|
|
908
|
+
|
|
909
|
+
```ruby
|
|
910
|
+
counter_for pattern: 'http.request',
|
|
911
|
+
tags: [:http_status, :endpoint, :version],
|
|
912
|
+
tag_extractors: {
|
|
913
|
+
# Aggregate status codes: 200..299 → 2xx
|
|
914
|
+
http_status: ->(event) {
|
|
915
|
+
status = event.payload[:status]
|
|
916
|
+
"#{status / 100}xx" # 200 → "2xx", 404 → "4xx"
|
|
917
|
+
},
|
|
918
|
+
|
|
919
|
+
# Normalize endpoints: /api/users/123 → /api/users/:id
|
|
920
|
+
endpoint: ->(event) {
|
|
921
|
+
path = event.payload[:path]
|
|
922
|
+
path.gsub(/\/\d+/, '/:id') # Replace numbers with :id
|
|
923
|
+
},
|
|
924
|
+
|
|
925
|
+
# Major version only: 2.5.7234 → 2.x
|
|
926
|
+
version: ->(event) {
|
|
927
|
+
version = event.payload[:version]
|
|
928
|
+
major = version.split('.').first
|
|
929
|
+
"#{major}.x"
|
|
930
|
+
}
|
|
931
|
+
}
|
|
932
|
+
|
|
933
|
+
# Before relabeling: 50 status codes × 1000 endpoints × 100 versions = 5M series
|
|
934
|
+
# After relabeling: 5 status groups × 100 patterns × 10 major versions = 5k series
|
|
935
|
+
# Reduction: 99.9%
|
|
936
|
+
```
|
|
937
|
+
|
|
938
|
+
---
|
|
939
|
+
|
|
940
|
+
### 3. Exemplars (Best of Both Worlds)
|
|
941
|
+
|
|
942
|
+
**Low-cardinality metrics + high-cardinality exemplars:**
|
|
943
|
+
|
|
944
|
+
```ruby
|
|
945
|
+
counter_for pattern: 'order.paid',
|
|
946
|
+
name: 'orders_paid_total',
|
|
947
|
+
# LOW-cardinality labels (stored for all events)
|
|
948
|
+
tags: [:currency, :payment_method],
|
|
949
|
+
# HIGH-cardinality exemplars (sampled, not stored as labels)
|
|
950
|
+
exemplars: {
|
|
951
|
+
user_id: ->(event) { event.payload[:user_id] },
|
|
952
|
+
order_id: ->(event) { event.payload[:order_id] },
|
|
953
|
+
trace_id: ->(event) { event.trace_id }
|
|
954
|
+
},
|
|
955
|
+
exemplar_sample_rate: 0.01 # Sample 1% of events
|
|
956
|
+
|
|
957
|
+
# Result in Prometheus:
|
|
958
|
+
# Metric: orders_paid_total{currency="USD",payment_method="stripe"} 1234
|
|
959
|
+
# Exemplar (sampled): {user_id="12345",order_id="ord_abc",trace_id="xyz"}
|
|
960
|
+
#
|
|
961
|
+
# Benefits:
|
|
962
|
+
# - Low cardinality for storage/query (2 labels)
|
|
963
|
+
# - High cardinality context available (3 exemplars, sampled)
|
|
964
|
+
# - Can jump from metric to trace via trace_id
|
|
965
|
+
```
|
|
966
|
+
|
|
967
|
+
---
|
|
968
|
+
|
|
969
|
+
### 4. Streaming Aggregation
|
|
970
|
+
|
|
971
|
+
**Aggregate BEFORE sending to metrics backend:**
|
|
972
|
+
|
|
973
|
+
```ruby
|
|
974
|
+
E11y.configure do |config|
|
|
975
|
+
config.metrics do
|
|
976
|
+
# Pre-aggregate high-cardinality dimensions
|
|
977
|
+
streaming_aggregation do
|
|
978
|
+
# Aggregate all http.* events
|
|
979
|
+
aggregate pattern: 'http.*' do
|
|
980
|
+
# Keep these dimensions
|
|
981
|
+
keep_dimensions [:controller, :action, :http_status]
|
|
982
|
+
|
|
983
|
+
# Drop these dimensions (aggregate out)
|
|
984
|
+
drop_dimensions [:user_id, :session_id, :ip_address]
|
|
985
|
+
|
|
986
|
+
# Aggregation window
|
|
987
|
+
window 10.seconds
|
|
988
|
+
|
|
989
|
+
# Flush interval
|
|
990
|
+
flush_interval 5.seconds
|
|
991
|
+
end
|
|
992
|
+
end
|
|
993
|
+
end
|
|
994
|
+
end
|
|
995
|
+
|
|
996
|
+
# How it works:
|
|
997
|
+
# 1. Events buffered for 10 seconds
|
|
998
|
+
# 2. Aggregate by keep_dimensions (drop others)
|
|
999
|
+
# 3. Flush aggregated metrics every 5 seconds
|
|
1000
|
+
# 4. Result: 90% fewer metric updates
|
|
1001
|
+
```
|
|
1002
|
+
|
|
1003
|
+
---
|
|
1004
|
+
|
|
1005
|
+
### 5. Tiered Retention
|
|
1006
|
+
|
|
1007
|
+
**Different retention for different cardinality:**
|
|
1008
|
+
|
|
1009
|
+
```ruby
|
|
1010
|
+
E11y.configure do |config|
|
|
1011
|
+
config.metrics do
|
|
1012
|
+
# High-cardinality: short retention
|
|
1013
|
+
retention_for pattern: 'http.request.*',
|
|
1014
|
+
cardinality: :high,
|
|
1015
|
+
duration: 1.hour,
|
|
1016
|
+
aggregation: :mean # Downsample to mean after 1 hour
|
|
1017
|
+
|
|
1018
|
+
# Low-cardinality: long retention
|
|
1019
|
+
retention_for pattern: 'orders.paid.*',
|
|
1020
|
+
cardinality: :low,
|
|
1021
|
+
duration: 90.days,
|
|
1022
|
+
aggregation: :none # Keep raw data
|
|
1023
|
+
|
|
1024
|
+
# Auto-classify by actual cardinality
|
|
1025
|
+
auto_classify_retention true
|
|
1026
|
+
high_cardinality_threshold 1_000
|
|
1027
|
+
low_cardinality_threshold 100
|
|
1028
|
+
end
|
|
1029
|
+
end
|
|
1030
|
+
|
|
1031
|
+
# Result:
|
|
1032
|
+
# - High-cardinality metrics: 1 hour raw + 30 days aggregated
|
|
1033
|
+
# - Low-cardinality metrics: 90 days raw
|
|
1034
|
+
# - Cost savings: 70% reduction in storage
|
|
1035
|
+
```
|
|
1036
|
+
|
|
1037
|
+
---
|
|
1038
|
+
|
|
1039
|
+
### 6. Universal Cardinality Protection (C04 Resolution) ⚠️ CRITICAL
|
|
1040
|
+
|
|
1041
|
+
> **⚠️ CRITICAL: C04 Conflict Resolution - Cardinality Protection for ALL Backends**
|
|
1042
|
+
> **See:** [ADR-009 Section 8](../ADR-009-cost-optimization.md#8-cardinality-protection-c04-resolution--critical) for detailed architecture and cost impact analysis.
|
|
1043
|
+
> **Problem:** Original UC-013 cardinality protection applied ONLY to Yabeda/Prometheus metrics, but NOT to OpenTelemetry span attributes or Loki log labels. High-cardinality values (`user_id`, `order_id`) bypassed protection and caused cost explosions in OTLP backends (Datadog, Honeycomb).
|
|
1044
|
+
> **Solution:** Universal `CardinalityFilter` middleware applies protection to **ALL backends** (Yabeda, OpenTelemetry, Loki) with optional per-backend overrides.
|
|
1045
|
+
|
|
1046
|
+
**The Problem - Inconsistent Cardinality Protection:**
|
|
1047
|
+
|
|
1048
|
+
Before C04 resolution, cardinality protection was **metrics-only**:
|
|
1049
|
+
|
|
1050
|
+
```ruby
|
|
1051
|
+
# ❌ BEFORE C04: Inconsistent protection (cost explosion!)
|
|
1052
|
+
E11y.configure do |config|
|
|
1053
|
+
config.metrics do
|
|
1054
|
+
# Cardinality protection for Yabeda/Prometheus ✅
|
|
1055
|
+
forbidden_labels :user_id, :order_id
|
|
1056
|
+
cardinality_limit_for 'orders_total', max: 100
|
|
1057
|
+
end
|
|
1058
|
+
|
|
1059
|
+
# OpenTelemetry: NO cardinality protection! ❌
|
|
1060
|
+
config.opentelemetry do
|
|
1061
|
+
enabled true
|
|
1062
|
+
export_traces true # Spans include ALL attributes
|
|
1063
|
+
end
|
|
1064
|
+
end
|
|
1065
|
+
|
|
1066
|
+
# Event tracking (10,000 unique users):
|
|
1067
|
+
10_000.times do |i|
|
|
1068
|
+
Events::OrderCreated.track(
|
|
1069
|
+
order_id: "order-#{i}", # ← 10,000 unique values!
|
|
1070
|
+
user_id: "user-#{i}", # ← 10,000 unique values!
|
|
1071
|
+
amount: 99.99
|
|
1072
|
+
)
|
|
1073
|
+
end
|
|
1074
|
+
|
|
1075
|
+
# Result:
|
|
1076
|
+
# ✅ Prometheus: order_id/user_id PROTECTED (only 100 unique values tracked)
|
|
1077
|
+
# ❌ OpenTelemetry: order_id/user_id NOT PROTECTED (all 10,000 exported!)
|
|
1078
|
+
# ❌ Loki: order_id/user_id NOT PROTECTED (index bloat!)
|
|
1079
|
+
|
|
1080
|
+
# Cost impact:
|
|
1081
|
+
# - Datadog: $0.10/span × 10,000 = $1,000/day = $30,000/month 💸
|
|
1082
|
+
# - Backend cardinality limit exceeded → data loss
|
|
1083
|
+
```
|
|
1084
|
+
|
|
1085
|
+
**The Solution - Universal Cardinality Protection:**
|
|
1086
|
+
|
|
1087
|
+
After C04 resolution, protection applies to **ALL backends**:
|
|
1088
|
+
|
|
1089
|
+
```ruby
|
|
1090
|
+
# ✅ AFTER C04: Unified protection (cost savings!)
|
|
1091
|
+
E11y.configure do |config|
|
|
1092
|
+
# GLOBAL cardinality protection (applies to ALL backends)
|
|
1093
|
+
config.cardinality_protection do
|
|
1094
|
+
enabled true
|
|
1095
|
+
max_unique_values 100 # Conservative default (Prometheus-safe)
|
|
1096
|
+
protected_labels [:user_id, :order_id, :session_id, :tenant_id]
|
|
1097
|
+
end
|
|
1098
|
+
|
|
1099
|
+
# Optional: Per-backend overrides (if needed)
|
|
1100
|
+
config.adapters do
|
|
1101
|
+
# Yabeda: Use global settings (default)
|
|
1102
|
+
yabeda do
|
|
1103
|
+
cardinality_protection.inherit_from :global
|
|
1104
|
+
end
|
|
1105
|
+
|
|
1106
|
+
# OpenTelemetry: Higher limits OK (OTLP backends handle more)
|
|
1107
|
+
opentelemetry do
|
|
1108
|
+
cardinality_protection do
|
|
1109
|
+
max_unique_values 1000 # OTLP backends can handle more
|
|
1110
|
+
protected_labels [:user_id, :order_id] # Subset of global
|
|
1111
|
+
end
|
|
1112
|
+
end
|
|
1113
|
+
|
|
1114
|
+
# Loki: Use global settings
|
|
1115
|
+
loki do
|
|
1116
|
+
cardinality_protection.inherit_from :global
|
|
1117
|
+
end
|
|
1118
|
+
end
|
|
1119
|
+
end
|
|
1120
|
+
|
|
1121
|
+
# Same event tracking (10,000 unique users):
|
|
1122
|
+
10_000.times do |i|
|
|
1123
|
+
Events::OrderCreated.track(
|
|
1124
|
+
order_id: "order-#{i}",
|
|
1125
|
+
user_id: "user-#{i}",
|
|
1126
|
+
amount: 99.99
|
|
1127
|
+
)
|
|
1128
|
+
end
|
|
1129
|
+
|
|
1130
|
+
# Result:
|
|
1131
|
+
# ✅ Prometheus: order_id/user_id → 100 + [OTHER] (protected)
|
|
1132
|
+
# ✅ OpenTelemetry: order_id/user_id → 1000 + [OTHER] (protected)
|
|
1133
|
+
# ✅ Loki: order_id/user_id → 100 + [OTHER] (protected)
|
|
1134
|
+
|
|
1135
|
+
# Cost impact:
|
|
1136
|
+
# - Datadog: $0.01/span × 10,000 = $100/day = $3,000/month ✅
|
|
1137
|
+
# - Monthly savings: $27,000 💰 (90% reduction!)
|
|
1138
|
+
```
|
|
1139
|
+
|
|
1140
|
+
**Configuration Examples:**
|
|
1141
|
+
|
|
1142
|
+
**1. Production: Strict Limits (Cost-Sensitive)**
|
|
1143
|
+
|
|
1144
|
+
```ruby
|
|
1145
|
+
# config/environments/production.rb
|
|
1146
|
+
E11y.configure do |config|
|
|
1147
|
+
config.cardinality_protection do
|
|
1148
|
+
enabled true
|
|
1149
|
+
max_unique_values 100 # Prometheus-safe
|
|
1150
|
+
protected_labels [:user_id, :order_id, :session_id, :tenant_id, :ip_address]
|
|
1151
|
+
end
|
|
1152
|
+
|
|
1153
|
+
# OTLP can handle more (optional override)
|
|
1154
|
+
config.adapters.opentelemetry do
|
|
1155
|
+
cardinality_protection.max_unique_values 1000
|
|
1156
|
+
end
|
|
1157
|
+
end
|
|
1158
|
+
```
|
|
1159
|
+
|
|
1160
|
+
**2. Development: No Limits (Full Visibility)**
|
|
1161
|
+
|
|
1162
|
+
```ruby
|
|
1163
|
+
# config/environments/development.rb
|
|
1164
|
+
E11y.configure do |config|
|
|
1165
|
+
config.cardinality_protection.enabled false # Unlimited cardinality
|
|
1166
|
+
end
|
|
1167
|
+
```
|
|
1168
|
+
|
|
1169
|
+
**3. Staging: Moderate Limits (Balance Cost vs Debugging)**
|
|
1170
|
+
|
|
1171
|
+
```ruby
|
|
1172
|
+
# config/environments/staging.rb
|
|
1173
|
+
E11y.configure do |config|
|
|
1174
|
+
config.cardinality_protection do
|
|
1175
|
+
enabled true
|
|
1176
|
+
max_unique_values 500 # More than prod, less than unlimited
|
|
1177
|
+
protected_labels [:user_id, :order_id]
|
|
1178
|
+
end
|
|
1179
|
+
|
|
1180
|
+
# OTLP backend can handle even more
|
|
1181
|
+
config.adapters.opentelemetry do
|
|
1182
|
+
cardinality_protection.max_unique_values 1000
|
|
1183
|
+
end
|
|
1184
|
+
end
|
|
1185
|
+
```
|
|
1186
|
+
|
|
1187
|
+
**Per-Backend Cardinality Budgets:**
|
|
1188
|
+
|
|
1189
|
+
Different backends have different cardinality tolerance:
|
|
1190
|
+
|
|
1191
|
+
| Backend | Recommended `max_unique_values` | Why |
|
|
1192
|
+
|---------|----------------------------------|-----|
|
|
1193
|
+
| **Prometheus (Yabeda)** | 100 | Time-series DB, high memory usage per series |
|
|
1194
|
+
| **OpenTelemetry (Datadog)** | 1000 | Columnar storage, better cardinality handling |
|
|
1195
|
+
| **Loki** | 100 | Label cardinality affects index size & query performance |
|
|
1196
|
+
| **Sentry** | Unlimited | Error tracking needs full context (not cost-sensitive) |
|
|
1197
|
+
| **Audit (PostgreSQL)** | Unlimited | Compliance requires complete data |
|
|
1198
|
+
|
|
1199
|
+
**Example: Different Limits per Backend**
|
|
1200
|
+
|
|
1201
|
+
```ruby
|
|
1202
|
+
E11y.configure do |config|
|
|
1203
|
+
# Global default (applies to Yabeda, Loki)
|
|
1204
|
+
config.cardinality_protection do
|
|
1205
|
+
enabled true
|
|
1206
|
+
max_unique_values 100
|
|
1207
|
+
protected_labels [:user_id, :order_id]
|
|
1208
|
+
end
|
|
1209
|
+
|
|
1210
|
+
# OpenTelemetry: 10× higher limit
|
|
1211
|
+
config.adapters.opentelemetry do
|
|
1212
|
+
cardinality_protection.max_unique_values 1000
|
|
1213
|
+
end
|
|
1214
|
+
|
|
1215
|
+
# Sentry: No limit (need full context for debugging)
|
|
1216
|
+
config.adapters.sentry do
|
|
1217
|
+
cardinality_protection.enabled false
|
|
1218
|
+
end
|
|
1219
|
+
|
|
1220
|
+
# Audit: No limit (compliance)
|
|
1221
|
+
config.adapters.audit do
|
|
1222
|
+
cardinality_protection.enabled false
|
|
1223
|
+
end
|
|
1224
|
+
end
|
|
1225
|
+
|
|
1226
|
+
# Event with high-cardinality fields:
|
|
1227
|
+
Events::OrderCreated.track(
|
|
1228
|
+
order_id: "order-12345", # High-cardinality
|
|
1229
|
+
user_id: "user-67890", # High-cardinality
|
|
1230
|
+
amount: 99.99
|
|
1231
|
+
)
|
|
1232
|
+
|
|
1233
|
+
# Result per backend:
|
|
1234
|
+
# Prometheus: order_id/user_id → [OTHER] (after 100 unique values)
|
|
1235
|
+
# OpenTelemetry: order_id/user_id → [OTHER] (after 1000 unique values)
|
|
1236
|
+
# Loki: order_id/user_id → [OTHER] (after 100 unique values)
|
|
1237
|
+
# Sentry: order_id="order-12345", user_id="user-67890" (full context)
|
|
1238
|
+
# Audit: order_id="order-12345", user_id="user-67890" (full context)
|
|
1239
|
+
```
|
|
1240
|
+
|
|
1241
|
+
**Monitoring Cardinality Protection:**
|
|
1242
|
+
|
|
1243
|
+
Track cardinality protection effectiveness:
|
|
1244
|
+
|
|
1245
|
+
```ruby
|
|
1246
|
+
# Metrics:
|
|
1247
|
+
e11y_cardinality_filtered_labels_total{backend="all",label="user_id"}
|
|
1248
|
+
e11y_cardinality_unique_values{label="order_id"}
|
|
1249
|
+
e11y_cardinality_limit_breached_total{label="session_id"}
|
|
1250
|
+
|
|
1251
|
+
# Prometheus queries:
|
|
1252
|
+
|
|
1253
|
+
# 1. Cardinality protection rate (% of labels filtered)
|
|
1254
|
+
rate(e11y_cardinality_filtered_labels_total[5m])
|
|
1255
|
+
/
|
|
1256
|
+
rate(e11y_events_tracked_total[5m]) * 100
|
|
1257
|
+
|
|
1258
|
+
# 2. Labels at risk (approaching limit)
|
|
1259
|
+
e11y_cardinality_unique_values
|
|
1260
|
+
/
|
|
1261
|
+
100 * 100 > 80 # 80% of max_unique_values (100)
|
|
1262
|
+
|
|
1263
|
+
# 3. Top high-cardinality labels
|
|
1264
|
+
topk(10,
|
|
1265
|
+
sum by (label) (
|
|
1266
|
+
rate(e11y_cardinality_filtered_labels_total[1h])
|
|
1267
|
+
)
|
|
1268
|
+
)
|
|
1269
|
+
|
|
1270
|
+
# 4. Cost savings estimate (assume $0.10 per unique span attribute)
|
|
1271
|
+
sum(rate(e11y_cardinality_filtered_labels_total[1d])) * 0.10
|
|
1272
|
+
# Result: Daily $ saved
|
|
1273
|
+
```
|
|
1274
|
+
|
|
1275
|
+
**Trade-offs:**
|
|
1276
|
+
|
|
1277
|
+
| Aspect | Pros | Cons | Mitigation |
|
|
1278
|
+
|--------|------|------|------------|
|
|
1279
|
+
| **Unified protection** | Consistent across all backends | One size doesn't fit all backends | Per-backend overrides (`max_unique_values`) |
|
|
1280
|
+
| **[OTHER] grouping** | Prevents cost explosion | Loses context for debugging | Log original values at debug level |
|
|
1281
|
+
| **Global config** | Simple, DRY | May not fit all backend limits | Environment-specific: prod=100, staging=500, dev=unlimited |
|
|
1282
|
+
| **max_unique_values 100** | Conservative, safe for Prometheus | May be too strict for OTLP backends | Per-backend override: OTLP=1000, Yabeda=100 |
|
|
1283
|
+
|
|
1284
|
+
**Cost Impact:**
|
|
1285
|
+
|
|
1286
|
+
Real-world example from C04 analysis:
|
|
1287
|
+
|
|
1288
|
+
```
|
|
1289
|
+
BEFORE C04 (no OTLP protection):
|
|
1290
|
+
- 10,000 orders/day with unique order_id
|
|
1291
|
+
- Datadog pricing: $0.10/span with high-cardinality attributes
|
|
1292
|
+
- Daily cost: $1,000
|
|
1293
|
+
- Monthly cost: $30,000 ❌
|
|
1294
|
+
|
|
1295
|
+
AFTER C04 (universal protection):
|
|
1296
|
+
- Same 10,000 orders/day
|
|
1297
|
+
- Cardinality protected: 1000 unique + [OTHER]
|
|
1298
|
+
- Datadog pricing: $0.01/span with low-cardinality attributes
|
|
1299
|
+
- Daily cost: $100
|
|
1300
|
+
- Monthly cost: $3,000 ✅
|
|
1301
|
+
- Monthly savings: $27,000 💰 (90% reduction!)
|
|
1302
|
+
```
|
|
1303
|
+
|
|
1304
|
+
---
|
|
1305
|
+
|
|
1306
|
+
## 📊 Self-Monitoring Metrics
|
|
1307
|
+
|
|
1308
|
+
**E11y tracks its own cardinality:**
|
|
1309
|
+
|
|
1310
|
+
```ruby
|
|
1311
|
+
# === CARDINALITY METRICS ===
|
|
1312
|
+
e11y_internal_metric_cardinality{metric="user_actions_total"} # Current unique series
|
|
1313
|
+
e11y_internal_metric_cardinality_limit{metric="user_actions_total"} # Configured limit
|
|
1314
|
+
e11y_internal_metric_cardinality_ratio{metric="user_actions_total"} # current/limit (0-1)
|
|
1315
|
+
|
|
1316
|
+
# === OVERFLOW METRICS ===
|
|
1317
|
+
e11y_internal_metric_overflow_count{metric="user_actions_total"} # Times limit exceeded
|
|
1318
|
+
e11y_internal_metric_overflow_events_total{metric="user_actions_total"} # Events via overflow path
|
|
1319
|
+
|
|
1320
|
+
# === VIOLATION METRICS ===
|
|
1321
|
+
e11y_internal_forbidden_label_violations_total{label="user_id"} # Denylist violations
|
|
1322
|
+
e11y_internal_label_value_count{metric="orders_paid_total",label="currency"} # Unique values per label
|
|
1323
|
+
|
|
1324
|
+
# === AGGREGATE METRICS ===
|
|
1325
|
+
e11y_internal_high_cardinality_metrics_total # Metrics above threshold
|
|
1326
|
+
e11y_internal_aggregated_series_total # Series using "_other" bucket
|
|
1327
|
+
```
|
|
1328
|
+
|
|
1329
|
+
**Prometheus alerting:**
|
|
1330
|
+
|
|
1331
|
+
```yaml
|
|
1332
|
+
# config/prometheus/alerts.yml
|
|
1333
|
+
groups:
|
|
1334
|
+
- name: e11y_cardinality
|
|
1335
|
+
rules:
|
|
1336
|
+
# Alert at 80% of limit
|
|
1337
|
+
- alert: E11yHighCardinality
|
|
1338
|
+
expr: e11y_internal_metric_cardinality_ratio > 0.8
|
|
1339
|
+
for: 5m
|
|
1340
|
+
annotations:
|
|
1341
|
+
summary: "Metric {{ $labels.metric }} at {{ $value }}% of limit"
|
|
1342
|
+
description: "Consider aggregating or increasing limit"
|
|
1343
|
+
|
|
1344
|
+
# Alert on overflow
|
|
1345
|
+
- alert: E11yCardinalityOverflow
|
|
1346
|
+
expr: rate(e11y_internal_metric_overflow_events_total[5m]) > 10
|
|
1347
|
+
for: 2m
|
|
1348
|
+
annotations:
|
|
1349
|
+
summary: "Metric {{ $labels.metric }} overflowing ({{ $value }} events/sec)"
|
|
1350
|
+
|
|
1351
|
+
# Alert on forbidden label usage
|
|
1352
|
+
- alert: E11yForbiddenLabelViolation
|
|
1353
|
+
expr: increase(e11y_internal_forbidden_label_violations_total[1h]) > 0
|
|
1354
|
+
annotations:
|
|
1355
|
+
summary: "Forbidden label {{ $labels.label }} used!"
|
|
1356
|
+
description: "Check metric configuration"
|
|
1357
|
+
```
|
|
1358
|
+
|
|
1359
|
+
---
|
|
1360
|
+
|
|
1361
|
+
## 💻 Implementation Examples
|
|
1362
|
+
|
|
1363
|
+
### Example 1: User Analytics (Safe)
|
|
1364
|
+
|
|
1365
|
+
```ruby
|
|
1366
|
+
# ❌ BEFORE: High cardinality
|
|
1367
|
+
counter_for pattern: 'user.action',
|
|
1368
|
+
tags: [:user_id, :action] # 1M users × 10 actions = 10M series
|
|
1369
|
+
|
|
1370
|
+
# ✅ AFTER: Low cardinality
|
|
1371
|
+
counter_for pattern: 'user.action',
|
|
1372
|
+
tags: [:user_segment, :action, :cohort],
|
|
1373
|
+
tag_extractors: {
|
|
1374
|
+
user_segment: ->(e) {
|
|
1375
|
+
User.find(e.payload[:user_id]).segment # free, paid, enterprise
|
|
1376
|
+
},
|
|
1377
|
+
cohort: ->(e) {
|
|
1378
|
+
User.find(e.payload[:user_id]).cohort_month # 2024-01, 2024-02
|
|
1379
|
+
}
|
|
1380
|
+
}
|
|
1381
|
+
# Result: 3 segments × 10 actions × 12 cohorts = 360 series (99.996% reduction!)
|
|
1382
|
+
```
|
|
1383
|
+
|
|
1384
|
+
---
|
|
1385
|
+
|
|
1386
|
+
### Example 2: HTTP Request Tracking
|
|
1387
|
+
|
|
1388
|
+
```ruby
|
|
1389
|
+
counter_for pattern: 'http.request',
|
|
1390
|
+
tags: [:controller_action, :http_status_group, :region],
|
|
1391
|
+
tag_extractors: {
|
|
1392
|
+
# Normalize controller#action
|
|
1393
|
+
controller_action: ->(e) {
|
|
1394
|
+
"#{e.payload[:controller]}##{e.payload[:action]}"
|
|
1395
|
+
},
|
|
1396
|
+
|
|
1397
|
+
# Aggregate status codes
|
|
1398
|
+
http_status_group: ->(e) {
|
|
1399
|
+
status = e.payload[:status]
|
|
1400
|
+
case status
|
|
1401
|
+
when 200..299 then '2xx'
|
|
1402
|
+
when 300..399 then '3xx'
|
|
1403
|
+
when 400..499 then '4xx'
|
|
1404
|
+
when 500..599 then '5xx'
|
|
1405
|
+
else 'unknown'
|
|
1406
|
+
end
|
|
1407
|
+
}
|
|
1408
|
+
}
|
|
1409
|
+
|
|
1410
|
+
# With exemplars for debugging
|
|
1411
|
+
histogram_for pattern: 'http.request',
|
|
1412
|
+
value: ->(e) { e.duration_ms / 1000.0 },
|
|
1413
|
+
tags: [:controller_action, :http_status_group],
|
|
1414
|
+
exemplars: {
|
|
1415
|
+
trace_id: ->(e) { e.trace_id },
|
|
1416
|
+
user_id: ->(e) { e.context[:user_id] }
|
|
1417
|
+
},
|
|
1418
|
+
exemplar_sample_rate: 0.01 # 1% sampling
|
|
1419
|
+
```
|
|
1420
|
+
|
|
1421
|
+
---
|
|
1422
|
+
|
|
1423
|
+
### Example 3: E-Commerce Orders
|
|
1424
|
+
|
|
1425
|
+
```ruby
|
|
1426
|
+
# Orders by status, payment method, country
|
|
1427
|
+
counter_for pattern: 'order.paid',
|
|
1428
|
+
name: 'orders_paid_total',
|
|
1429
|
+
tags: [:status, :payment_method, :country, :amount_bucket],
|
|
1430
|
+
tag_extractors: {
|
|
1431
|
+
# Bucket amounts
|
|
1432
|
+
amount_bucket: ->(e) {
|
|
1433
|
+
amount = e.payload[:amount]
|
|
1434
|
+
case amount
|
|
1435
|
+
when 0..50 then 'small'
|
|
1436
|
+
when 51..200 then 'medium'
|
|
1437
|
+
when 201..1000 then 'large'
|
|
1438
|
+
else 'xlarge'
|
|
1439
|
+
end
|
|
1440
|
+
},
|
|
1441
|
+
|
|
1442
|
+
# Aggregate country to region
|
|
1443
|
+
country: ->(e) {
|
|
1444
|
+
Country.find(e.payload[:country_code]).region # US, EU, APAC
|
|
1445
|
+
}
|
|
1446
|
+
}
|
|
1447
|
+
|
|
1448
|
+
# Cardinality:
|
|
1449
|
+
# 4 statuses × 5 payment methods × 3 regions × 4 amount buckets = 240 series ✅
|
|
1450
|
+
```
|
|
1451
|
+
|
|
1452
|
+
---
|
|
1453
|
+
|
|
1454
|
+
## 🧪 Testing
|
|
1455
|
+
|
|
1456
|
+
```ruby
|
|
1457
|
+
# spec/e11y/cardinality_spec.rb
|
|
1458
|
+
RSpec.describe 'E11y Cardinality Protection' do
|
|
1459
|
+
describe 'forbidden labels' do
|
|
1460
|
+
it 'raises error on forbidden label usage' do
|
|
1461
|
+
E11y.configure do |config|
|
|
1462
|
+
config.metrics do
|
|
1463
|
+
forbidden_labels :user_id
|
|
1464
|
+
enforcement :strict
|
|
1465
|
+
end
|
|
1466
|
+
end
|
|
1467
|
+
|
|
1468
|
+
expect {
|
|
1469
|
+
E11y.configure do |config|
|
|
1470
|
+
config.metrics do
|
|
1471
|
+
counter_for pattern: 'test',
|
|
1472
|
+
tags: [:user_id]
|
|
1473
|
+
end
|
|
1474
|
+
end
|
|
1475
|
+
}.to raise_error(E11y::ForbiddenLabelError, /user_id/)
|
|
1476
|
+
end
|
|
1477
|
+
end
|
|
1478
|
+
|
|
1479
|
+
describe 'cardinality limits' do
|
|
1480
|
+
it 'drops overflow events' do
|
|
1481
|
+
E11y.configure do |config|
|
|
1482
|
+
config.metrics do
|
|
1483
|
+
cardinality_limit_for 'test_metric', max: 3
|
|
1484
|
+
overflow_strategy :drop
|
|
1485
|
+
end
|
|
1486
|
+
end
|
|
1487
|
+
|
|
1488
|
+
# Track 5 unique label values (exceeds limit of 3)
|
|
1489
|
+
5.times do |i|
|
|
1490
|
+
Events::TestEvent.track(category: "cat_#{i}")
|
|
1491
|
+
end
|
|
1492
|
+
|
|
1493
|
+
metric = Yabeda.test_metric
|
|
1494
|
+
# Expect only 3 unique (2 dropped)
|
|
1495
|
+
expect(metric.values.keys.size).to eq(3)
|
|
1496
|
+
|
|
1497
|
+
# Verify drop counter incremented
|
|
1498
|
+
expect(Yabeda.e11y_internal.metric_overflow_events_total).to be > 0
|
|
1499
|
+
end
|
|
1500
|
+
end
|
|
1501
|
+
|
|
1502
|
+
describe 'self-monitoring' do
|
|
1503
|
+
it 'tracks cardinality ratio' do
|
|
1504
|
+
E11y.configure do |config|
|
|
1505
|
+
config.metrics do
|
|
1506
|
+
cardinality_limit_for 'test_metric', max: 100
|
|
1507
|
+
end
|
|
1508
|
+
end
|
|
1509
|
+
|
|
1510
|
+
50.times { |i| Events::TestEvent.track(category: "cat_#{i}") }
|
|
1511
|
+
|
|
1512
|
+
ratio = Yabeda.e11y_internal.metric_cardinality_ratio.get(
|
|
1513
|
+
{ metric: 'test_metric' }
|
|
1514
|
+
)
|
|
1515
|
+
expect(ratio).to eq(0.5) # 50/100
|
|
1516
|
+
end
|
|
1517
|
+
end
|
|
1518
|
+
end
|
|
1519
|
+
```
|
|
1520
|
+
|
|
1521
|
+
---
|
|
1522
|
+
|
|
1523
|
+
## 💡 Best Practices
|
|
1524
|
+
|
|
1525
|
+
### ✅ DO
|
|
1526
|
+
|
|
1527
|
+
**1. Use aggregation for high-cardinality dimensions**
|
|
1528
|
+
```ruby
|
|
1529
|
+
# ✅ GOOD
|
|
1530
|
+
tags: [:user_segment] # free, paid, enterprise (3 values)
|
|
1531
|
+
```
|
|
1532
|
+
|
|
1533
|
+
**2. Monitor cardinality proactively**
|
|
1534
|
+
```ruby
|
|
1535
|
+
# ✅ GOOD
|
|
1536
|
+
cardinality_monitoring do
|
|
1537
|
+
warn_threshold 0.7
|
|
1538
|
+
alert_channel '#observability'
|
|
1539
|
+
end
|
|
1540
|
+
```
|
|
1541
|
+
|
|
1542
|
+
**3. Use exemplars for debugging**
|
|
1543
|
+
```ruby
|
|
1544
|
+
# ✅ GOOD
|
|
1545
|
+
exemplars: { trace_id: ->(e) { e.trace_id } }
|
|
1546
|
+
exemplar_sample_rate: 0.01
|
|
1547
|
+
```
|
|
1548
|
+
|
|
1549
|
+
---
|
|
1550
|
+
|
|
1551
|
+
### ❌ DON'T
|
|
1552
|
+
|
|
1553
|
+
**1. Don't use unbounded identifiers as labels**
|
|
1554
|
+
```ruby
|
|
1555
|
+
# ❌ BAD
|
|
1556
|
+
tags: [:user_id, :order_id, :session_id]
|
|
1557
|
+
```
|
|
1558
|
+
|
|
1559
|
+
**2. Don't ignore cardinality warnings**
|
|
1560
|
+
```ruby
|
|
1561
|
+
# ❌ BAD: Ignoring production alerts
|
|
1562
|
+
# [E11y WARNING] Metric at 95% of limit
|
|
1563
|
+
# → Action: Aggregate or increase limit immediately!
|
|
1564
|
+
```
|
|
1565
|
+
|
|
1566
|
+
**3. Don't use timestamps as labels**
|
|
1567
|
+
```ruby
|
|
1568
|
+
# ❌ BAD
|
|
1569
|
+
tags: [:timestamp, :created_at]
|
|
1570
|
+
# Use histogram buckets instead!
|
|
1571
|
+
```
|
|
1572
|
+
|
|
1573
|
+
---
|
|
1574
|
+
|
|
1575
|
+
## 💰 Cost Calculator
|
|
1576
|
+
|
|
1577
|
+
```ruby
|
|
1578
|
+
# Calculate your potential savings
|
|
1579
|
+
def calculate_cardinality_cost(
|
|
1580
|
+
services:,
|
|
1581
|
+
dimensions:,
|
|
1582
|
+
values_per_dimension:,
|
|
1583
|
+
cost_per_series: 0.068 # Datadog pricing
|
|
1584
|
+
)
|
|
1585
|
+
total_series = dimensions.map { |d| values_per_dimension[d] }.reduce(:*)
|
|
1586
|
+
total_series *= services
|
|
1587
|
+
|
|
1588
|
+
monthly_cost = total_series * cost_per_series
|
|
1589
|
+
|
|
1590
|
+
{
|
|
1591
|
+
total_series: total_series,
|
|
1592
|
+
monthly_cost: monthly_cost,
|
|
1593
|
+
yearly_cost: monthly_cost * 12
|
|
1594
|
+
}
|
|
1595
|
+
end
|
|
1596
|
+
|
|
1597
|
+
# Example: E-commerce app
|
|
1598
|
+
before = calculate_cardinality_cost(
|
|
1599
|
+
services: 50,
|
|
1600
|
+
dimensions: [:user_id, :product_id, :action],
|
|
1601
|
+
values_per_dimension: {
|
|
1602
|
+
user_id: 100_000,
|
|
1603
|
+
product_id: 10_000,
|
|
1604
|
+
action: 10
|
|
1605
|
+
}
|
|
1606
|
+
)
|
|
1607
|
+
# => 50B series, $3.4M/month! 😱
|
|
1608
|
+
|
|
1609
|
+
after = calculate_cardinality_cost(
|
|
1610
|
+
services: 50,
|
|
1611
|
+
dimensions: [:user_segment, :product_category, :action],
|
|
1612
|
+
values_per_dimension: {
|
|
1613
|
+
user_segment: 3,
|
|
1614
|
+
product_category: 20,
|
|
1615
|
+
action: 10
|
|
1616
|
+
}
|
|
1617
|
+
)
|
|
1618
|
+
# => 30k series, $2k/month ✅
|
|
1619
|
+
# SAVINGS: $3.4M - $2k = $3.398M/month (99.94% reduction!)
|
|
1620
|
+
```
|
|
1621
|
+
|
|
1622
|
+
---
|
|
1623
|
+
|
|
1624
|
+
## ❓ Frequently Asked Questions
|
|
1625
|
+
|
|
1626
|
+
> **Technical Details:** See [ADR-002 Section 11: FAQ & Critical Clarifications](../ADR-002-metrics-yabeda.md#11-faq--critical-clarifications) for architectural rationale.
|
|
1627
|
+
|
|
1628
|
+
### Q1: Does cardinality protection apply to all my logs and metrics?
|
|
1629
|
+
|
|
1630
|
+
**A: No, only to metrics (Prometheus/Yabeda). Logs keep full data.**
|
|
1631
|
+
|
|
1632
|
+
This is a common source of confusion. Let's clarify:
|
|
1633
|
+
|
|
1634
|
+
```ruby
|
|
1635
|
+
# Same event, different treatment:
|
|
1636
|
+
Events::OrderCreated.track(
|
|
1637
|
+
user_id: '123', # High-cardinality
|
|
1638
|
+
status: 'paid', # Low-cardinality
|
|
1639
|
+
amount: 99.99
|
|
1640
|
+
)
|
|
1641
|
+
```
|
|
1642
|
+
|
|
1643
|
+
**What happens:**
|
|
1644
|
+
|
|
1645
|
+
| Adapter | `user_id` | `status` | `amount` | Why |
|
|
1646
|
+
|---------|-----------|----------|----------|-----|
|
|
1647
|
+
| **Prometheus** | ❌ Dropped (denylist) | ✅ Kept | ❌ Dropped (value, not label) | Cardinality protection active |
|
|
1648
|
+
| **Loki (logs)** | ✅ Kept | ✅ Kept | ✅ Kept | No cardinality limits |
|
|
1649
|
+
| **Sentry** | ✅ Kept | ✅ Kept | ✅ Kept | Full context needed for debugging |
|
|
1650
|
+
| **Audit** | ✅ Kept | ✅ Kept | ✅ Kept | Compliance requires full data |
|
|
1651
|
+
|
|
1652
|
+
**Why this design?**
|
|
1653
|
+
|
|
1654
|
+
- **Metrics (Prometheus):** Cardinality explosions are catastrophic (cost, performance, query failures)
|
|
1655
|
+
- **Logs (Loki):** High-cardinality fields are fine (indexed differently, stored cheaper)
|
|
1656
|
+
- **Error tracking (Sentry):** Need full context to debug issues
|
|
1657
|
+
- **Audit trails:** Regulatory compliance requires complete data
|
|
1658
|
+
|
|
1659
|
+
**Practical implication:**
|
|
1660
|
+
|
|
1661
|
+
```ruby
|
|
1662
|
+
# ✅ This is SAFE and RECOMMENDED:
|
|
1663
|
+
Events::ApiRequest.track(
|
|
1664
|
+
request_id: SecureRandom.uuid, # High-cardinality, but OK!
|
|
1665
|
+
endpoint: '/api/users',
|
|
1666
|
+
user_id: current_user.id
|
|
1667
|
+
)
|
|
1668
|
+
|
|
1669
|
+
# Result:
|
|
1670
|
+
# - Metrics: only endpoint tracked (request_id/user_id dropped)
|
|
1671
|
+
# - Logs: full payload with request_id for debugging
|
|
1672
|
+
# - Best of both worlds!
|
|
1673
|
+
```
|
|
1674
|
+
|
|
1675
|
+
---
|
|
1676
|
+
|
|
1677
|
+
### Q2: Are the 4 layers checked simultaneously or one-by-one?
|
|
1678
|
+
|
|
1679
|
+
**A: One-by-one (sequential waterfall), not simultaneously.**
|
|
1680
|
+
|
|
1681
|
+
This is critical to understand for debugging and configuration:
|
|
1682
|
+
|
|
1683
|
+
```
|
|
1684
|
+
Processing order for each label:
|
|
1685
|
+
|
|
1686
|
+
┌─────────────────────────────────────────┐
|
|
1687
|
+
│ 1. Layer 1: Denylist Check │
|
|
1688
|
+
│ ↓ In denylist? → DROP, stop here │
|
|
1689
|
+
│ ↓ Not in denylist? → Continue to L2 │
|
|
1690
|
+
└─────────────────────────────────────────┘
|
|
1691
|
+
↓
|
|
1692
|
+
┌─────────────────────────────────────────┐
|
|
1693
|
+
│ 2. Layer 2: Allowlist Check │
|
|
1694
|
+
│ ↓ In allowlist? → KEEP, stop here │
|
|
1695
|
+
│ ↓ Not in allowlist? → Continue to L3 │
|
|
1696
|
+
└─────────────────────────────────────────┘
|
|
1697
|
+
↓
|
|
1698
|
+
┌─────────────────────────────────────────┐
|
|
1699
|
+
│ 3. Layer 3: Cardinality Limit │
|
|
1700
|
+
│ ↓ Under limit? → KEEP, stop here │
|
|
1701
|
+
│ ↓ Over limit? → Continue to L4 │
|
|
1702
|
+
└─────────────────────────────────────────┘
|
|
1703
|
+
↓
|
|
1704
|
+
┌─────────────────────────────────────────┐
|
|
1705
|
+
│ 4. Layer 4: Dynamic Action │
|
|
1706
|
+
│ ↓ Apply configured action: │
|
|
1707
|
+
│ drop / alert / relabel │
|
|
1708
|
+
└─────────────────────────────────────────┘
|
|
1709
|
+
```
|
|
1710
|
+
|
|
1711
|
+
**Example trace through all layers:**
|
|
1712
|
+
|
|
1713
|
+
```ruby
|
|
1714
|
+
# Event: { user_id: '123', status: 'paid', tier: 'premium' }
|
|
1715
|
+
|
|
1716
|
+
# Label: user_id
|
|
1717
|
+
# Layer 1: ✅ in FORBIDDEN_LABELS → ❌ DROP (stop here, never reaches L2-L4)
|
|
1718
|
+
|
|
1719
|
+
# Label: status
|
|
1720
|
+
# Layer 1: ✅ not in FORBIDDEN_LABELS → continue to L2
|
|
1721
|
+
# Layer 2: ✅ in SAFE_LABELS → ✅ KEEP (stop here, skip L3-L4)
|
|
1722
|
+
|
|
1723
|
+
# Label: tier
|
|
1724
|
+
# Layer 1: ✅ not in FORBIDDEN_LABELS → continue to L2
|
|
1725
|
+
# Layer 2: ✅ not in SAFE_LABELS → continue to L3
|
|
1726
|
+
# Layer 3: ❌ cardinality 150 > limit 100 → continue to L4
|
|
1727
|
+
# Layer 4: ✅ action=drop → ❌ DROP
|
|
1728
|
+
|
|
1729
|
+
# Final metric:
|
|
1730
|
+
# orders_total{status="paid"} 1
|
|
1731
|
+
# (user_id and tier dropped)
|
|
1732
|
+
```
|
|
1733
|
+
|
|
1734
|
+
**Why sequential (not simultaneous)?**
|
|
1735
|
+
|
|
1736
|
+
- **Performance:** Early exit on denylist (L1) avoids expensive cardinality checks (L3)
|
|
1737
|
+
- **Predictability:** Clear precedence (denylist > allowlist > cardinality > action)
|
|
1738
|
+
- **Debuggability:** Easy to trace which layer made the decision
|
|
1739
|
+
|
|
1740
|
+
---
|
|
1741
|
+
|
|
1742
|
+
### Q3: What should I do when I hit a cardinality limit?
|
|
1743
|
+
|
|
1744
|
+
**A: Use relabeling if possible, otherwise drop the label.**
|
|
1745
|
+
|
|
1746
|
+
Use this decision process:
|
|
1747
|
+
|
|
1748
|
+
**Step 1: Can you group values into clear categories?**
|
|
1749
|
+
|
|
1750
|
+
```ruby
|
|
1751
|
+
# ✅ YES - Use relabeling (best signal preservation):
|
|
1752
|
+
|
|
1753
|
+
# Example 1: HTTP status codes (200, 201, 204...) → status classes (2xx, 3xx...)
|
|
1754
|
+
tag_extractors: {
|
|
1755
|
+
status_class: ->(event) {
|
|
1756
|
+
case event.payload[:http_status].to_i
|
|
1757
|
+
when 200..299 then '2xx'
|
|
1758
|
+
when 400..499 then '4xx'
|
|
1759
|
+
when 500..599 then '5xx'
|
|
1760
|
+
end
|
|
1761
|
+
}
|
|
1762
|
+
}
|
|
1763
|
+
# Result: 50 values → 5 categories (90% reduction)
|
|
1764
|
+
|
|
1765
|
+
# Example 2: Paths (/users/123, /users/456...) → endpoint patterns (/users/:id)
|
|
1766
|
+
tag_extractors: {
|
|
1767
|
+
endpoint: ->(event) {
|
|
1768
|
+
event.payload[:path].gsub(/\d+/, ':id')
|
|
1769
|
+
}
|
|
1770
|
+
}
|
|
1771
|
+
# Result: Infinite values → ~100 endpoints
|
|
1772
|
+
```
|
|
1773
|
+
|
|
1774
|
+
**Step 2: Is this label critical for alerts/dashboards?**
|
|
1775
|
+
|
|
1776
|
+
```ruby
|
|
1777
|
+
# ❌ NO - Just drop it (keep in logs):
|
|
1778
|
+
cardinality_limit_for 'api.requests' do
|
|
1779
|
+
max_cardinality 100
|
|
1780
|
+
overflow_strategy :drop # Silent drop
|
|
1781
|
+
end
|
|
1782
|
+
|
|
1783
|
+
# Result: request_id removed from metrics, but still in logs for debugging
|
|
1784
|
+
```
|
|
1785
|
+
|
|
1786
|
+
**Step 3: Is this an unexpected cardinality spike?**
|
|
1787
|
+
|
|
1788
|
+
```ruby
|
|
1789
|
+
# ✅ YES - Alert ops team:
|
|
1790
|
+
cardinality_limit_for 'payments.processed' do
|
|
1791
|
+
max_cardinality 50
|
|
1792
|
+
overflow_strategy :alert # Alert + drop
|
|
1793
|
+
end
|
|
1794
|
+
|
|
1795
|
+
# Scenario: Suddenly 1000 unique payment_method values
|
|
1796
|
+
# → Alert sent to PagerDuty/Slack
|
|
1797
|
+
# → Ops investigates (possible bug, attack, data corruption)
|
|
1798
|
+
```
|
|
1799
|
+
|
|
1800
|
+
**Common patterns:**
|
|
1801
|
+
|
|
1802
|
+
| Your Situation | Recommended Action | Example |
|
|
1803
|
+
|----------------|-------------------|---------|
|
|
1804
|
+
| Label not needed for analysis | **DROP** | `request_id`, `trace_id` → keep in logs only |
|
|
1805
|
+
| Clear categories exist | **RELABEL** | `http_status: 200` → `status_class: 2xx` |
|
|
1806
|
+
| Cardinality should be stable | **ALERT** | Payment methods suddenly spike to 1000 values |
|
|
1807
|
+
| Need debugging context | **Keep in logs** | Drop from metrics, query logs when debugging |
|
|
1808
|
+
|
|
1809
|
+
**Anti-pattern to avoid:**
|
|
1810
|
+
|
|
1811
|
+
```ruby
|
|
1812
|
+
# ❌ DON'T: Keep high-cardinality labels in metrics
|
|
1813
|
+
counter_for pattern: 'api.request',
|
|
1814
|
+
tags: [:endpoint, :user_id] # user_id = millions of values!
|
|
1815
|
+
|
|
1816
|
+
# ✅ DO: Drop from metrics, keep in logs
|
|
1817
|
+
counter_for pattern: 'api.request',
|
|
1818
|
+
tags: [:endpoint] # Only low-cardinality tags
|
|
1819
|
+
|
|
1820
|
+
# user_id still available in Loki logs for debugging:
|
|
1821
|
+
# 2024-01-15 10:23:45 | api.request | endpoint=/api/users user_id=123 status=200
|
|
1822
|
+
```
|
|
1823
|
+
|
|
1824
|
+
---
|
|
1825
|
+
|
|
1826
|
+
### Q4: How do I debug which layer dropped my label?
|
|
1827
|
+
|
|
1828
|
+
**A: Check E11y's built-in cardinality metrics:**
|
|
1829
|
+
|
|
1830
|
+
```ruby
|
|
1831
|
+
# See which labels are being dropped:
|
|
1832
|
+
Yabeda.e11y_internal.cardinality_dropped_labels_total.values
|
|
1833
|
+
# => { metric: 'api_requests_total', label: 'user_id', reason: 'denylist' } => 1523
|
|
1834
|
+
# => { metric: 'api_requests_total', label: 'session_id', reason: 'limit_exceeded' } => 42
|
|
1835
|
+
|
|
1836
|
+
# See which layer made the decision:
|
|
1837
|
+
# - reason: 'denylist' → Layer 1
|
|
1838
|
+
# - reason: 'not_in_allowlist' → Layer 2 (if allowlist configured)
|
|
1839
|
+
# - reason: 'limit_exceeded' → Layer 3
|
|
1840
|
+
```
|
|
1841
|
+
|
|
1842
|
+
**Prometheus queries for debugging:**
|
|
1843
|
+
|
|
1844
|
+
```promql
|
|
1845
|
+
# Which metrics are dropping labels most frequently?
|
|
1846
|
+
topk(10, rate(e11y_cardinality_dropped_labels_total[5m]))
|
|
1847
|
+
|
|
1848
|
+
# Which labels are being dropped?
|
|
1849
|
+
sum by (label, reason) (e11y_cardinality_dropped_labels_total)
|
|
1850
|
+
|
|
1851
|
+
# Alert on unexpected drops:
|
|
1852
|
+
rate(e11y_cardinality_dropped_labels_total{reason="limit_exceeded"}[5m]) > 10
|
|
1853
|
+
```
|
|
1854
|
+
|
|
1855
|
+
**Development/staging debugging:**
|
|
1856
|
+
|
|
1857
|
+
```ruby
|
|
1858
|
+
# Temporarily log all cardinality decisions:
|
|
1859
|
+
E11y.configure do |config|
|
|
1860
|
+
config.metrics.cardinality_protection do
|
|
1861
|
+
debug_mode true # Logs every decision (verbose!)
|
|
1862
|
+
end
|
|
1863
|
+
end
|
|
1864
|
+
|
|
1865
|
+
# Output:
|
|
1866
|
+
# [E11y] Cardinality: KEEP label 'status' (Layer 2: allowlist)
|
|
1867
|
+
# [E11y] Cardinality: DROP label 'user_id' (Layer 1: denylist)
|
|
1868
|
+
# [E11y] Cardinality: DROP label 'tier' (Layer 3: limit 150/100, action=drop)
|
|
1869
|
+
```
|
|
1870
|
+
|
|
1871
|
+
---
|
|
1872
|
+
|
|
1873
|
+
## 🔒 Validations (NEW - v1.1)
|
|
1874
|
+
|
|
1875
|
+
> **🎯 Pattern:** Validate cardinality configuration at class load time.
|
|
1876
|
+
|
|
1877
|
+
### Cardinality Limit Validation
|
|
1878
|
+
|
|
1879
|
+
**Problem:** Invalid cardinality limits → metric explosion.
|
|
1880
|
+
|
|
1881
|
+
**Solution:** Validate cardinality_limit is positive integer:
|
|
1882
|
+
|
|
1883
|
+
```ruby
|
|
1884
|
+
# Gem implementation (automatic):
|
|
1885
|
+
def self.cardinality_limit(max)
|
|
1886
|
+
unless max.is_a?(Integer) && max > 0 && max <= 10_000
|
|
1887
|
+
raise ArgumentError, "cardinality_limit must be 1..10_000, got: #{max.inspect}"
|
|
1888
|
+
end
|
|
1889
|
+
self._cardinality_limit = max
|
|
1890
|
+
end
|
|
1891
|
+
|
|
1892
|
+
# Result:
|
|
1893
|
+
class Events::UserAction < E11y::Event::Base
|
|
1894
|
+
metric :counter, name: 'actions_total', cardinality_limit: -100
|
|
1895
|
+
# ← ERROR: "cardinality_limit must be 1..10_000, got: -100"
|
|
1896
|
+
end
|
|
1897
|
+
```
|
|
1898
|
+
|
|
1899
|
+
### Forbidden Labels Validation
|
|
1900
|
+
|
|
1901
|
+
**Problem:** Using high-cardinality labels → cost explosion.
|
|
1902
|
+
|
|
1903
|
+
**Solution:** Validate against denylist:
|
|
1904
|
+
|
|
1905
|
+
```ruby
|
|
1906
|
+
# Gem implementation (automatic):
|
|
1907
|
+
FORBIDDEN_LABELS = [:user_id, :order_id, :session_id, :trace_id, :request_id]
|
|
1908
|
+
|
|
1909
|
+
def self.metric(type, name:, tags:, **options)
|
|
1910
|
+
forbidden = tags & FORBIDDEN_LABELS
|
|
1911
|
+
if forbidden.any?
|
|
1912
|
+
raise ArgumentError, "Forbidden high-cardinality labels: #{forbidden.join(', ')}. Use aggregation instead!"
|
|
1913
|
+
end
|
|
1914
|
+
# ...
|
|
1915
|
+
end
|
|
1916
|
+
|
|
1917
|
+
# Result:
|
|
1918
|
+
class Events::UserAction < E11y::Event::Base
|
|
1919
|
+
metric :counter, name: 'actions_total', tags: [:user_id, :action_type]
|
|
1920
|
+
# ← ERROR: "Forbidden high-cardinality labels: user_id. Use aggregation instead!"
|
|
1921
|
+
end
|
|
1922
|
+
```
|
|
1923
|
+
|
|
1924
|
+
### Tag Extractors Validation
|
|
1925
|
+
|
|
1926
|
+
**Problem:** Tag extractors returning nil → metric gaps.
|
|
1927
|
+
|
|
1928
|
+
**Solution:** Validate extractor return values:
|
|
1929
|
+
|
|
1930
|
+
```ruby
|
|
1931
|
+
# Gem implementation (runtime):
|
|
1932
|
+
def extract_tag_value(event, extractor)
|
|
1933
|
+
value = extractor.call(event)
|
|
1934
|
+
if value.nil? || value.to_s.empty?
|
|
1935
|
+
raise ArgumentError, "Tag extractor returned nil/empty for event: #{event.name}"
|
|
1936
|
+
end
|
|
1937
|
+
value.to_s
|
|
1938
|
+
end
|
|
1939
|
+
```
|
|
1940
|
+
|
|
1941
|
+
---
|
|
1942
|
+
|
|
1943
|
+
## 🌍 Environment-Specific Cardinality Protection (NEW - v1.1)
|
|
1944
|
+
|
|
1945
|
+
> **🎯 Pattern:** Different cardinality limits per environment.
|
|
1946
|
+
|
|
1947
|
+
### Example 1: Stricter Limits in Production
|
|
1948
|
+
|
|
1949
|
+
```ruby
|
|
1950
|
+
class Events::UserAction < E11y::Event::Base
|
|
1951
|
+
schema do
|
|
1952
|
+
required(:user_id).filled(:string)
|
|
1953
|
+
required(:action_type).filled(:string)
|
|
1954
|
+
end
|
|
1955
|
+
|
|
1956
|
+
# Environment-specific cardinality limits
|
|
1957
|
+
metric :counter,
|
|
1958
|
+
name: 'user_actions_total',
|
|
1959
|
+
tags: [:user_segment, :action_type],
|
|
1960
|
+
cardinality_limit: Rails.env.production? ? 100 : 1_000,
|
|
1961
|
+
tag_extractors: {
|
|
1962
|
+
user_segment: ->(event) {
|
|
1963
|
+
if Rails.env.production?
|
|
1964
|
+
# Production: strict aggregation
|
|
1965
|
+
User.find(event.payload[:user_id]).segment # 'free', 'paid', 'enterprise'
|
|
1966
|
+
else
|
|
1967
|
+
# Dev/test: allow user_id for debugging
|
|
1968
|
+
event.payload[:user_id]
|
|
1969
|
+
end
|
|
1970
|
+
}
|
|
1971
|
+
}
|
|
1972
|
+
end
|
|
1973
|
+
```
|
|
1974
|
+
|
|
1975
|
+
### Example 2: Feature Flag for Cardinality Protection
|
|
1976
|
+
|
|
1977
|
+
```ruby
|
|
1978
|
+
class Events::ApiRequest < E11y::Event::Base
|
|
1979
|
+
schema do
|
|
1980
|
+
required(:endpoint).filled(:string)
|
|
1981
|
+
required(:user_id).filled(:string)
|
|
1982
|
+
end
|
|
1983
|
+
|
|
1984
|
+
# Enable cardinality protection only when flag is on
|
|
1985
|
+
if ENV['ENABLE_CARDINALITY_PROTECTION'] == 'true'
|
|
1986
|
+
metric :counter,
|
|
1987
|
+
name: 'api_requests_total',
|
|
1988
|
+
tags: [:endpoint_group], # Aggregated
|
|
1989
|
+
cardinality_limit: 50,
|
|
1990
|
+
tag_extractors: {
|
|
1991
|
+
endpoint_group: ->(event) {
|
|
1992
|
+
# Group /users/123 → /users/:id
|
|
1993
|
+
event.payload[:endpoint].gsub(/\/\d+/, '/:id')
|
|
1994
|
+
}
|
|
1995
|
+
}
|
|
1996
|
+
else
|
|
1997
|
+
# Dev: no aggregation
|
|
1998
|
+
metric :counter,
|
|
1999
|
+
name: 'api_requests_total',
|
|
2000
|
+
tags: [:endpoint] # Full endpoint
|
|
2001
|
+
end
|
|
2002
|
+
end
|
|
2003
|
+
```
|
|
2004
|
+
|
|
2005
|
+
---
|
|
2006
|
+
|
|
2007
|
+
## 📊 Precedence Rules for Cardinality Protection (NEW - v1.1)
|
|
2008
|
+
|
|
2009
|
+
> **🎯 Pattern:** Cardinality configuration precedence (most specific wins).
|
|
2010
|
+
|
|
2011
|
+
### Precedence Order (Highest to Lowest)
|
|
2012
|
+
|
|
2013
|
+
```
|
|
2014
|
+
1. Event-level explicit config (highest priority)
|
|
2015
|
+
↓
|
|
2016
|
+
2. Preset module config
|
|
2017
|
+
↓
|
|
2018
|
+
3. Base class config (inheritance)
|
|
2019
|
+
↓
|
|
2020
|
+
4. Convention-based defaults (100 series)
|
|
2021
|
+
↓
|
|
2022
|
+
5. Global config (lowest priority)
|
|
2023
|
+
```
|
|
2024
|
+
|
|
2025
|
+
### Example: Mixing Inheritance + Presets for Cardinality
|
|
2026
|
+
|
|
2027
|
+
```ruby
|
|
2028
|
+
# Global config (lowest priority)
|
|
2029
|
+
E11y.configure do |config|
|
|
2030
|
+
config.metrics do
|
|
2031
|
+
cardinality_limit 1_000 # Default for all metrics
|
|
2032
|
+
forbidden_labels :user_id, :session_id
|
|
2033
|
+
end
|
|
2034
|
+
end
|
|
2035
|
+
|
|
2036
|
+
# Base class (medium priority)
|
|
2037
|
+
class Events::BaseUserEvent < E11y::Event::Base
|
|
2038
|
+
# Common cardinality protection
|
|
2039
|
+
metric :counter,
|
|
2040
|
+
name: 'user_events_total',
|
|
2041
|
+
tags: [:user_segment, :event_type],
|
|
2042
|
+
cardinality_limit: 100, # Override global (stricter)
|
|
2043
|
+
tag_extractors: {
|
|
2044
|
+
user_segment: ->(event) { User.find(event.payload[:user_id]).segment }
|
|
2045
|
+
}
|
|
2046
|
+
end
|
|
2047
|
+
|
|
2048
|
+
# Preset module (higher priority)
|
|
2049
|
+
module E11y::Presets::MetricSafeEvent
|
|
2050
|
+
extend ActiveSupport::Concern
|
|
2051
|
+
included do
|
|
2052
|
+
# Override cardinality limit
|
|
2053
|
+
metric :counter,
|
|
2054
|
+
name: 'safe_events_total',
|
|
2055
|
+
tags: [:severity],
|
|
2056
|
+
cardinality_limit: 10 # Very strict!
|
|
2057
|
+
end
|
|
2058
|
+
end
|
|
2059
|
+
|
|
2060
|
+
# Event (highest priority)
|
|
2061
|
+
class Events::UserLogin < Events::BaseUserEvent
|
|
2062
|
+
include E11y::Presets::MetricSafeEvent
|
|
2063
|
+
|
|
2064
|
+
# Override preset (looser limit)
|
|
2065
|
+
metric :counter,
|
|
2066
|
+
name: 'user_logins_total',
|
|
2067
|
+
tags: [:user_segment, :login_method],
|
|
2068
|
+
cardinality_limit: 50 # Override preset
|
|
2069
|
+
|
|
2070
|
+
# Final config:
|
|
2071
|
+
# - cardinality_limit: 50 (event-level override)
|
|
2072
|
+
# - tags: [:user_segment, :login_method] (event-level)
|
|
2073
|
+
# - tag_extractors: inherited from base
|
|
2074
|
+
end
|
|
2075
|
+
```
|
|
2076
|
+
|
|
2077
|
+
### Precedence Rules Table
|
|
2078
|
+
|
|
2079
|
+
| Config | Global | Convention | Base Class | Preset | Event-Level | Winner |
|
|
2080
|
+
|--------|--------|------------|------------|--------|-------------|--------|
|
|
2081
|
+
| `cardinality_limit` | `1_000` | `100` | `100` | `10` | `50` | **`50`** (event) |
|
|
2082
|
+
| `tags` | - | - | `[:user_segment, :event_type]` | `[:severity]` | `[:user_segment, :login_method]` | **`[:user_segment, :login_method]`** (event) |
|
|
2083
|
+
| `forbidden_labels` | `[:user_id, :session_id]` | - | - | - | - | **`[:user_id, :session_id]`** (global) |
|
|
2084
|
+
|
|
2085
|
+
### Convention-Based Defaults
|
|
2086
|
+
|
|
2087
|
+
**Convention:** If no cardinality_limit specified → default `100 series`:
|
|
2088
|
+
|
|
2089
|
+
```ruby
|
|
2090
|
+
class Events::ApiRequest < E11y::Event::Base
|
|
2091
|
+
metric :counter, name: 'api_requests_total', tags: [:status]
|
|
2092
|
+
# ← Auto: cardinality_limit = 100 (convention!)
|
|
2093
|
+
end
|
|
2094
|
+
```
|
|
2095
|
+
|
|
2096
|
+
---
|
|
2097
|
+
|
|
2098
|
+
## 📚 Related Use Cases
|
|
2099
|
+
|
|
2100
|
+
- **[UC-003: Pattern-Based Metrics](./UC-003-pattern-based-metrics.md)** - Auto-generate metrics
|
|
2101
|
+
- **[UC-008: OpenTelemetry Integration](./UC-008-opentelemetry-integration.md)** - OTLP cardinality protection (C04)
|
|
2102
|
+
- **[UC-015: Cost Optimization](./UC-015-cost-optimization.md)** - Reduce observability costs
|
|
2103
|
+
|
|
2104
|
+
---
|
|
2105
|
+
|
|
2106
|
+
## 🎯 Summary
|
|
2107
|
+
|
|
2108
|
+
### E11y's Competitive Advantage
|
|
2109
|
+
|
|
2110
|
+
**ONLY Ruby gem with production-grade cardinality protection:**
|
|
2111
|
+
|
|
2112
|
+
| Feature | Yabeda | OTel Ruby | AppSignal | E11y |
|
|
2113
|
+
|---------|--------|-----------|-----------|------|
|
|
2114
|
+
| Forbidden labels | ❌ | ❌ | ❌ | ✅ |
|
|
2115
|
+
| Cardinality limits | ❌ | Basic (2000) | Vendor-specific | ✅ 4-layer defense |
|
|
2116
|
+
| Auto-aggregation | ❌ | ❌ | ❌ | ✅ |
|
|
2117
|
+
| Exemplars | ❌ | ❌ | ❌ | ✅ |
|
|
2118
|
+
| Self-monitoring | ❌ | Partial | Vendor-specific | ✅ 8+ metrics |
|
|
2119
|
+
| Cost reduction | 0% | ~30% | Vendor lock-in | **99%** |
|
|
2120
|
+
|
|
2121
|
+
**Real-world impact:** $67,320/month savings (99% reduction)
|
|
2122
|
+
|
|
2123
|
+
---
|
|
2124
|
+
|
|
2125
|
+
**Document Version:** 1.1 (Unified DSL)
|
|
2126
|
+
**Last Updated:** January 16, 2026
|
|
2127
|
+
**Status:** ✅ Complete - Consistent with DSL-SPECIFICATION.md v1.1.0
|