e11y 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.rspec +4 -0
- data/.rubocop.yml +69 -0
- data/CHANGELOG.md +26 -0
- data/CODE_OF_CONDUCT.md +64 -0
- data/LICENSE.txt +21 -0
- data/README.md +179 -0
- data/Rakefile +37 -0
- data/benchmarks/run_all.rb +33 -0
- data/config/README.md +83 -0
- data/config/loki-local-config.yaml +35 -0
- data/config/prometheus.yml +15 -0
- data/docker-compose.yml +78 -0
- data/docs/00-ICP-AND-TIMELINE.md +483 -0
- data/docs/01-SCALE-REQUIREMENTS.md +858 -0
- data/docs/ADR-001-architecture.md +2617 -0
- data/docs/ADR-002-metrics-yabeda.md +1395 -0
- data/docs/ADR-003-slo-observability.md +3337 -0
- data/docs/ADR-004-adapter-architecture.md +2385 -0
- data/docs/ADR-005-tracing-context.md +1372 -0
- data/docs/ADR-006-security-compliance.md +4143 -0
- data/docs/ADR-007-opentelemetry-integration.md +1385 -0
- data/docs/ADR-008-rails-integration.md +1911 -0
- data/docs/ADR-009-cost-optimization.md +2993 -0
- data/docs/ADR-010-developer-experience.md +2166 -0
- data/docs/ADR-011-testing-strategy.md +1836 -0
- data/docs/ADR-012-event-evolution.md +958 -0
- data/docs/ADR-013-reliability-error-handling.md +2750 -0
- data/docs/ADR-014-event-driven-slo.md +1533 -0
- data/docs/ADR-015-middleware-order.md +1061 -0
- data/docs/ADR-016-self-monitoring-slo.md +1234 -0
- data/docs/API-REFERENCE-L28.md +914 -0
- data/docs/COMPREHENSIVE-CONFIGURATION.md +2366 -0
- data/docs/IMPLEMENTATION_NOTES.md +2804 -0
- data/docs/IMPLEMENTATION_PLAN.md +1971 -0
- data/docs/IMPLEMENTATION_PLAN_ARCHITECTURE.md +586 -0
- data/docs/PLAN.md +148 -0
- data/docs/QUICK-START.md +934 -0
- data/docs/README.md +296 -0
- data/docs/design/00-memory-optimization.md +593 -0
- data/docs/guides/MIGRATION-L27-L28.md +692 -0
- data/docs/guides/PERFORMANCE-BENCHMARKS.md +434 -0
- data/docs/guides/README.md +44 -0
- data/docs/prd/01-overview-vision.md +440 -0
- data/docs/use_cases/README.md +119 -0
- data/docs/use_cases/UC-001-request-scoped-debug-buffering.md +813 -0
- data/docs/use_cases/UC-002-business-event-tracking.md +1953 -0
- data/docs/use_cases/UC-003-pattern-based-metrics.md +1627 -0
- data/docs/use_cases/UC-004-zero-config-slo-tracking.md +728 -0
- data/docs/use_cases/UC-005-sentry-integration.md +759 -0
- data/docs/use_cases/UC-006-trace-context-management.md +905 -0
- data/docs/use_cases/UC-007-pii-filtering.md +2648 -0
- data/docs/use_cases/UC-008-opentelemetry-integration.md +1153 -0
- data/docs/use_cases/UC-009-multi-service-tracing.md +1043 -0
- data/docs/use_cases/UC-010-background-job-tracking.md +1018 -0
- data/docs/use_cases/UC-011-rate-limiting.md +1906 -0
- data/docs/use_cases/UC-012-audit-trail.md +2301 -0
- data/docs/use_cases/UC-013-high-cardinality-protection.md +2127 -0
- data/docs/use_cases/UC-014-adaptive-sampling.md +1940 -0
- data/docs/use_cases/UC-015-cost-optimization.md +735 -0
- data/docs/use_cases/UC-016-rails-logger-migration.md +785 -0
- data/docs/use_cases/UC-017-local-development.md +867 -0
- data/docs/use_cases/UC-018-testing-events.md +1081 -0
- data/docs/use_cases/UC-019-tiered-storage-migration.md +562 -0
- data/docs/use_cases/UC-020-event-versioning.md +708 -0
- data/docs/use_cases/UC-021-error-handling-retry-dlq.md +956 -0
- data/docs/use_cases/UC-022-event-registry.md +648 -0
- data/docs/use_cases/backlog.md +226 -0
- data/e11y.gemspec +76 -0
- data/lib/e11y/adapters/adaptive_batcher.rb +207 -0
- data/lib/e11y/adapters/audit_encrypted.rb +239 -0
- data/lib/e11y/adapters/base.rb +580 -0
- data/lib/e11y/adapters/file.rb +224 -0
- data/lib/e11y/adapters/in_memory.rb +216 -0
- data/lib/e11y/adapters/loki.rb +333 -0
- data/lib/e11y/adapters/otel_logs.rb +203 -0
- data/lib/e11y/adapters/registry.rb +141 -0
- data/lib/e11y/adapters/sentry.rb +230 -0
- data/lib/e11y/adapters/stdout.rb +108 -0
- data/lib/e11y/adapters/yabeda.rb +370 -0
- data/lib/e11y/buffers/adaptive_buffer.rb +339 -0
- data/lib/e11y/buffers/base_buffer.rb +40 -0
- data/lib/e11y/buffers/request_scoped_buffer.rb +246 -0
- data/lib/e11y/buffers/ring_buffer.rb +267 -0
- data/lib/e11y/buffers.rb +14 -0
- data/lib/e11y/console.rb +122 -0
- data/lib/e11y/current.rb +48 -0
- data/lib/e11y/event/base.rb +894 -0
- data/lib/e11y/event/value_sampling_config.rb +84 -0
- data/lib/e11y/events/base_audit_event.rb +43 -0
- data/lib/e11y/events/base_payment_event.rb +33 -0
- data/lib/e11y/events/rails/cache/delete.rb +21 -0
- data/lib/e11y/events/rails/cache/read.rb +23 -0
- data/lib/e11y/events/rails/cache/write.rb +22 -0
- data/lib/e11y/events/rails/database/query.rb +45 -0
- data/lib/e11y/events/rails/http/redirect.rb +21 -0
- data/lib/e11y/events/rails/http/request.rb +26 -0
- data/lib/e11y/events/rails/http/send_file.rb +21 -0
- data/lib/e11y/events/rails/http/start_processing.rb +26 -0
- data/lib/e11y/events/rails/job/completed.rb +22 -0
- data/lib/e11y/events/rails/job/enqueued.rb +22 -0
- data/lib/e11y/events/rails/job/failed.rb +22 -0
- data/lib/e11y/events/rails/job/scheduled.rb +23 -0
- data/lib/e11y/events/rails/job/started.rb +22 -0
- data/lib/e11y/events/rails/log.rb +56 -0
- data/lib/e11y/events/rails/view/render.rb +23 -0
- data/lib/e11y/events.rb +18 -0
- data/lib/e11y/instruments/active_job.rb +201 -0
- data/lib/e11y/instruments/rails_instrumentation.rb +141 -0
- data/lib/e11y/instruments/sidekiq.rb +175 -0
- data/lib/e11y/logger/bridge.rb +205 -0
- data/lib/e11y/metrics/cardinality_protection.rb +172 -0
- data/lib/e11y/metrics/cardinality_tracker.rb +134 -0
- data/lib/e11y/metrics/registry.rb +234 -0
- data/lib/e11y/metrics/relabeling.rb +226 -0
- data/lib/e11y/metrics.rb +102 -0
- data/lib/e11y/middleware/audit_signing.rb +174 -0
- data/lib/e11y/middleware/base.rb +140 -0
- data/lib/e11y/middleware/event_slo.rb +167 -0
- data/lib/e11y/middleware/pii_filter.rb +266 -0
- data/lib/e11y/middleware/pii_filtering.rb +280 -0
- data/lib/e11y/middleware/rate_limiting.rb +214 -0
- data/lib/e11y/middleware/request.rb +163 -0
- data/lib/e11y/middleware/routing.rb +157 -0
- data/lib/e11y/middleware/sampling.rb +254 -0
- data/lib/e11y/middleware/slo.rb +168 -0
- data/lib/e11y/middleware/trace_context.rb +131 -0
- data/lib/e11y/middleware/validation.rb +118 -0
- data/lib/e11y/middleware/versioning.rb +132 -0
- data/lib/e11y/middleware.rb +12 -0
- data/lib/e11y/pii/patterns.rb +90 -0
- data/lib/e11y/pii.rb +13 -0
- data/lib/e11y/pipeline/builder.rb +155 -0
- data/lib/e11y/pipeline/zone_validator.rb +110 -0
- data/lib/e11y/pipeline.rb +12 -0
- data/lib/e11y/presets/audit_event.rb +65 -0
- data/lib/e11y/presets/debug_event.rb +34 -0
- data/lib/e11y/presets/high_value_event.rb +51 -0
- data/lib/e11y/presets.rb +19 -0
- data/lib/e11y/railtie.rb +138 -0
- data/lib/e11y/reliability/circuit_breaker.rb +216 -0
- data/lib/e11y/reliability/dlq/file_storage.rb +277 -0
- data/lib/e11y/reliability/dlq/filter.rb +117 -0
- data/lib/e11y/reliability/retry_handler.rb +207 -0
- data/lib/e11y/reliability/retry_rate_limiter.rb +117 -0
- data/lib/e11y/sampling/error_spike_detector.rb +225 -0
- data/lib/e11y/sampling/load_monitor.rb +161 -0
- data/lib/e11y/sampling/stratified_tracker.rb +92 -0
- data/lib/e11y/sampling/value_extractor.rb +82 -0
- data/lib/e11y/self_monitoring/buffer_monitor.rb +79 -0
- data/lib/e11y/self_monitoring/performance_monitor.rb +97 -0
- data/lib/e11y/self_monitoring/reliability_monitor.rb +146 -0
- data/lib/e11y/slo/event_driven.rb +150 -0
- data/lib/e11y/slo/tracker.rb +119 -0
- data/lib/e11y/version.rb +9 -0
- data/lib/e11y.rb +283 -0
- metadata +452 -0
|
@@ -0,0 +1,562 @@
|
|
|
1
|
+
# UC-019: Retention Tagging for Downstream Data Lifecycle
|
|
2
|
+
|
|
3
|
+
**Status:** Cost Optimization Feature (v1.0)
|
|
4
|
+
**Complexity:** Simple (just tagging!)
|
|
5
|
+
**Setup Time:** 10 minutes (E11y config only)
|
|
6
|
+
**Target Users:** Platform Engineers, DevOps, Cost Optimization Teams
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## π Overview
|
|
11
|
+
|
|
12
|
+
### Problem Statement
|
|
13
|
+
|
|
14
|
+
**Current Pain Points:**
|
|
15
|
+
|
|
16
|
+
1. **Downstream systems don't know event retention requirements**
|
|
17
|
+
- Elasticsearch ILM treats all events the same
|
|
18
|
+
- S3 Lifecycle Rules need manual setup per prefix
|
|
19
|
+
- No way to express "this event should be kept 7 years"
|
|
20
|
+
|
|
21
|
+
2. **Storage costs grow linearly**
|
|
22
|
+
- All events in hot tier (expensive)
|
|
23
|
+
- No differentiation between debug (1 day) and audit (7 years)
|
|
24
|
+
|
|
25
|
+
3. **Manual configuration hell**
|
|
26
|
+
- Different ES indices for different retention? (nightmare)
|
|
27
|
+
- Different S3 buckets per retention? (management overhead)
|
|
28
|
+
- No single source of truth
|
|
29
|
+
|
|
30
|
+
### E11y Solution
|
|
31
|
+
|
|
32
|
+
**Just Add Metadata Tags!**
|
|
33
|
+
|
|
34
|
+
E11y ΠΏΡΠΎΡΡΠΎ Π΄ΠΎΠ±Π°Π²Π»ΡΠ΅Ρ **retention tags** ΠΊ ΠΊΠ°ΠΆΠ΄ΠΎΠΌΡ ΡΠΎΠ±ΡΡΠΈΡ:
|
|
35
|
+
- `retention_days: 7` Π΄Π»Ρ debug events
|
|
36
|
+
- `retention_days: 2555` (7 years) Π΄Π»Ρ audit events
|
|
37
|
+
- **Downstream ΡΠΈΡΡΠ΅ΠΌΡ** (ES ILM, S3 Lifecycle) ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡΡ ΡΡΠΈ ΡΠ΅Π³ΠΈ
|
|
38
|
+
|
|
39
|
+
**Result:** ΠΡΠΎΡΡΠ°Ρ ΠΊΠΎΠ½ΡΠΈΠ³ΡΡΠ°ΡΠΈΡ, downstream Π΄Π΅Π»Π°Π΅Ρ Π²ΡΡ ΡΠ°Π±ΠΎΡΡ.
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
## π― Use Case Scenarios
|
|
44
|
+
|
|
45
|
+
### Scenario 1: Standard Observability Events
|
|
46
|
+
|
|
47
|
+
**Context:** Regular application events (logs, metrics)
|
|
48
|
+
|
|
49
|
+
```ruby
|
|
50
|
+
# Default retention: 30 days
|
|
51
|
+
class OrderCreated < E11y::Event::Base
|
|
52
|
+
# No explicit retention β use default (30 days)
|
|
53
|
+
end
|
|
54
|
+
|
|
55
|
+
# E11y adds metadata:
|
|
56
|
+
Events::OrderCreated.track(order_id: '123')
|
|
57
|
+
# Event written with:
|
|
58
|
+
# {
|
|
59
|
+
# "@timestamp": "2026-01-12T10:30:00Z",
|
|
60
|
+
# "retention_until": "2026-02-11T10:30:00Z", # β E11y calculates: @timestamp + 30 days
|
|
61
|
+
# "event_name": "order.created",
|
|
62
|
+
# ...
|
|
63
|
+
# }
|
|
64
|
+
|
|
65
|
+
# Downstream ES ILM simply checks:
|
|
66
|
+
# if now > retention_until β delete
|
|
67
|
+
# No calculation needed!
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
### Scenario 2: Audit Events (Long Retention)
|
|
71
|
+
|
|
72
|
+
**Context:** Compliance events requiring 7-year retention
|
|
73
|
+
|
|
74
|
+
```ruby
|
|
75
|
+
class UserPermissionChanged < E11y::AuditEvent
|
|
76
|
+
audit_retention 7.years # Compliance requirement
|
|
77
|
+
|
|
78
|
+
schema do
|
|
79
|
+
required(:user_id).filled(:string)
|
|
80
|
+
required(:old_role).filled(:string)
|
|
81
|
+
required(:new_role).filled(:string)
|
|
82
|
+
end
|
|
83
|
+
end
|
|
84
|
+
|
|
85
|
+
# E11y adds metadata:
|
|
86
|
+
Events::UserPermissionChanged.track(...)
|
|
87
|
+
# Event written with:
|
|
88
|
+
# {
|
|
89
|
+
# "@timestamp": "2026-01-12T10:30:00Z",
|
|
90
|
+
# "retention_until": "2033-01-12T10:30:00Z", # β @timestamp + 7 years
|
|
91
|
+
# "event_name": "user.permission_changed",
|
|
92
|
+
# ...
|
|
93
|
+
# }
|
|
94
|
+
|
|
95
|
+
# Downstream systems simply check:
|
|
96
|
+
# if now > retention_until β delete
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
### Scenario 3: High-Volume Events (Short Retention)
|
|
100
|
+
|
|
101
|
+
**Context:** Debug logs, page views (noise if kept long)
|
|
102
|
+
|
|
103
|
+
```ruby
|
|
104
|
+
class PageView < E11y::Event::Base
|
|
105
|
+
retention 1.day # Short retention
|
|
106
|
+
|
|
107
|
+
schema do
|
|
108
|
+
required(:path).filled(:string)
|
|
109
|
+
required(:user_id).filled(:string)
|
|
110
|
+
end
|
|
111
|
+
end
|
|
112
|
+
|
|
113
|
+
# E11y adds metadata:
|
|
114
|
+
Events::PageView.track(...)
|
|
115
|
+
# Event written with:
|
|
116
|
+
# {
|
|
117
|
+
# "@timestamp": "2026-01-12T10:30:00Z",
|
|
118
|
+
# "retention_until": "2026-01-13T10:30:00Z", # β @timestamp + 1 day
|
|
119
|
+
# "event_name": "page.view",
|
|
120
|
+
# ...
|
|
121
|
+
# }
|
|
122
|
+
|
|
123
|
+
# Downstream ES ILM:
|
|
124
|
+
# - Deletes when now > retention_until (1 day later)
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
---
|
|
128
|
+
|
|
129
|
+
## ποΈ Architecture
|
|
130
|
+
|
|
131
|
+
> **Implementation:** See [ADR-009 Section 6: Tiered Storage](../ADR-009-cost-optimization.md#6-tiered-storage-retention_until-tagging) for retention tagger architecture, retention tiers configuration, and downstream integration (Elasticsearch ILM, S3 Lifecycle).
|
|
132
|
+
|
|
133
|
+
### E11y's Simple Role: Just Add Expiry Date!
|
|
134
|
+
|
|
135
|
+
```
|
|
136
|
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
137
|
+
β E11Y (Dead Simple: Calculate Expiry Date) β
|
|
138
|
+
β β
|
|
139
|
+
β Event.track(...) β
|
|
140
|
+
β β β
|
|
141
|
+
β Add metadata to event: β
|
|
142
|
+
β { β
|
|
143
|
+
β "@timestamp": "2026-01-12T10:30:00Z", β
|
|
144
|
+
β "retention_until": "2026-02-11T10:30:00Z", β @timestamp + 30dβ
|
|
145
|
+
β "event_name": "order.created", β
|
|
146
|
+
β ... β
|
|
147
|
+
β } β
|
|
148
|
+
β β β
|
|
149
|
+
β Write to adapters (Loki, ES, S3) β
|
|
150
|
+
β β
|
|
151
|
+
β THAT'S IT! E11y's job done β
β
|
|
152
|
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
153
|
+
|
|
154
|
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
155
|
+
β DOWNSTREAM SYSTEMS (Trivial Logic) β
|
|
156
|
+
β β
|
|
157
|
+
β Elasticsearch ILM / S3 Lifecycle / External job: β
|
|
158
|
+
β β
|
|
159
|
+
β ββββββββββββββββββββββββββββββββββββ β
|
|
160
|
+
β β if now > retention_until β β
|
|
161
|
+
β β delete(event) β β
|
|
162
|
+
β β end β β
|
|
163
|
+
β ββββββββββββββββββββββββββββββββββββ β
|
|
164
|
+
β β
|
|
165
|
+
β No calculations! Just date comparison β
β
|
|
166
|
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
### Benefits of `retention_until` (Absolute Date)
|
|
170
|
+
|
|
171
|
+
**vs. `retention_days` (Relative):**
|
|
172
|
+
|
|
173
|
+
| Approach | Downstream Logic | Clock Skew Issues? | Simple? |
|
|
174
|
+
|----------|------------------|-------------------|---------|
|
|
175
|
+
| `retention_days: 30` | `if now > (@timestamp + 30.days)` | β Yes (if clocks differ) | π‘ Need calculation |
|
|
176
|
+
| `retention_until: "2026-02-11"` | `if now > retention_until` | β
No (date already calculated) | β
Trivial comparison |
|
|
177
|
+
|
|
178
|
+
**E11y calculates once, downstream just compares!**
|
|
179
|
+
|
|
180
|
+
---
|
|
181
|
+
|
|
182
|
+
## π§ Configuration
|
|
183
|
+
|
|
184
|
+
### E11y Configuration (Dead Simple!)
|
|
185
|
+
|
|
186
|
+
```ruby
|
|
187
|
+
# config/initializers/e11y.rb
|
|
188
|
+
E11y.configure do |config|
|
|
189
|
+
# Just enable retention tagging!
|
|
190
|
+
config.retention_tagging do
|
|
191
|
+
enabled true
|
|
192
|
+
|
|
193
|
+
# Default retention for events without explicit retention
|
|
194
|
+
default_retention 30.days
|
|
195
|
+
|
|
196
|
+
# Per-pattern retention rules
|
|
197
|
+
retention_by_pattern do
|
|
198
|
+
pattern 'audit.*', retention: 7.years
|
|
199
|
+
pattern 'security.*', retention: 1.year
|
|
200
|
+
pattern 'debug.*', retention: 1.day
|
|
201
|
+
pattern '*.page_view', retention: 7.days
|
|
202
|
+
pattern '*', retention: 30.days # Default
|
|
203
|
+
end
|
|
204
|
+
|
|
205
|
+
# Field name (what E11y adds to each event)
|
|
206
|
+
retention_field :retention_until # ISO8601 timestamp
|
|
207
|
+
end
|
|
208
|
+
end
|
|
209
|
+
|
|
210
|
+
# E11y automatically adds to each event:
|
|
211
|
+
# event["retention_until"] = event["@timestamp"] + retention_period
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
**That's it for E11y!** Downstream just checks: `now > retention_until`
|
|
215
|
+
|
|
216
|
+
---
|
|
217
|
+
|
|
218
|
+
## π Downstream Configuration
|
|
219
|
+
|
|
220
|
+
**E11y doesn't migrate data!** Downstream systems use `@timestamp` + `retention_days`.
|
|
221
|
+
|
|
222
|
+
### Option 1: Elasticsearch ILM (Recommended)
|
|
223
|
+
|
|
224
|
+
**Elasticsearch reads retention_days from each event:**
|
|
225
|
+
|
|
226
|
+
```bash
|
|
227
|
+
# Create ILM policy in Elasticsearch
|
|
228
|
+
PUT _ilm/policy/e11y-events-policy
|
|
229
|
+
{
|
|
230
|
+
"policy": {
|
|
231
|
+
"phases": {
|
|
232
|
+
"hot": {
|
|
233
|
+
"min_age": "0ms",
|
|
234
|
+
"actions": {
|
|
235
|
+
"rollover": {
|
|
236
|
+
"max_primary_shard_size": "50GB",
|
|
237
|
+
"max_age": "1d"
|
|
238
|
+
},
|
|
239
|
+
"set_priority": {
|
|
240
|
+
"priority": 100
|
|
241
|
+
}
|
|
242
|
+
}
|
|
243
|
+
},
|
|
244
|
+
"warm": {
|
|
245
|
+
"min_age": "7d",
|
|
246
|
+
"actions": {
|
|
247
|
+
"shrink": {
|
|
248
|
+
"number_of_shards": 1
|
|
249
|
+
},
|
|
250
|
+
"forcemerge": {
|
|
251
|
+
"max_num_segments": 1
|
|
252
|
+
},
|
|
253
|
+
"set_priority": {
|
|
254
|
+
"priority": 50
|
|
255
|
+
}
|
|
256
|
+
}
|
|
257
|
+
},
|
|
258
|
+
"cold": {
|
|
259
|
+
"min_age": "30d",
|
|
260
|
+
"actions": {
|
|
261
|
+
"searchable_snapshot": {
|
|
262
|
+
"snapshot_repository": "e11y-s3-repository"
|
|
263
|
+
}
|
|
264
|
+
}
|
|
265
|
+
},
|
|
266
|
+
"delete": {
|
|
267
|
+
# IMPORTANT: Delete based on @timestamp + retention_days
|
|
268
|
+
# This requires ES script or external job
|
|
269
|
+
"min_age": "365d", # Max default
|
|
270
|
+
"actions": {
|
|
271
|
+
"delete": {}
|
|
272
|
+
}
|
|
273
|
+
}
|
|
274
|
+
}
|
|
275
|
+
}
|
|
276
|
+
}
|
|
277
|
+
|
|
278
|
+
# NOTE: ES ILM doesn't natively support per-document retention!
|
|
279
|
+
# You need EITHER:
|
|
280
|
+
# 1. Multiple ILM policies per retention period (complex)
|
|
281
|
+
# 2. External cleanup job (reads retention_days, deletes old docs)
|
|
282
|
+
```
|
|
283
|
+
|
|
284
|
+
**Better approach: External cleanup job** (reads retention_days):
|
|
285
|
+
|
|
286
|
+
### Option 2: S3 Lifecycle Rules
|
|
287
|
+
|
|
288
|
+
**Problem:** S3 Lifecycle works on object creation date, not event @timestamp!
|
|
289
|
+
|
|
290
|
+
**Solution:** E11y can add S3 object tags (if using S3 adapter):
|
|
291
|
+
|
|
292
|
+
```ruby
|
|
293
|
+
# E11y S3 Adapter adds object tags
|
|
294
|
+
config.adapters do
|
|
295
|
+
register :s3, E11y::Adapters::S3Adapter.new(
|
|
296
|
+
bucket: 'e11y-events',
|
|
297
|
+
tagging: true, # Enable object tagging
|
|
298
|
+
tags_from_event: [:retention_days] # Copy event field to S3 tag
|
|
299
|
+
)
|
|
300
|
+
end
|
|
301
|
+
|
|
302
|
+
# AWS S3 Lifecycle (using tags)
|
|
303
|
+
resource "aws_s3_bucket_lifecycle_configuration" "e11y_events" {
|
|
304
|
+
bucket = aws_s3_bucket.e11y_events.id
|
|
305
|
+
|
|
306
|
+
# Rule for 7-day retention (debug, page views)
|
|
307
|
+
rule {
|
|
308
|
+
id = "short-retention"
|
|
309
|
+
status = "Enabled"
|
|
310
|
+
|
|
311
|
+
expiration {
|
|
312
|
+
days = 7
|
|
313
|
+
}
|
|
314
|
+
|
|
315
|
+
filter {
|
|
316
|
+
tag {
|
|
317
|
+
key = "retention_days"
|
|
318
|
+
value = "7"
|
|
319
|
+
}
|
|
320
|
+
}
|
|
321
|
+
}
|
|
322
|
+
|
|
323
|
+
# Rule for 30-day retention (standard events)
|
|
324
|
+
rule {
|
|
325
|
+
id = "standard-retention"
|
|
326
|
+
status = "Enabled"
|
|
327
|
+
|
|
328
|
+
transition {
|
|
329
|
+
days = 7
|
|
330
|
+
storage_class = "STANDARD_IA"
|
|
331
|
+
}
|
|
332
|
+
|
|
333
|
+
expiration {
|
|
334
|
+
days = 30
|
|
335
|
+
}
|
|
336
|
+
|
|
337
|
+
filter {
|
|
338
|
+
tag {
|
|
339
|
+
key = "retention_days"
|
|
340
|
+
value = "30"
|
|
341
|
+
}
|
|
342
|
+
}
|
|
343
|
+
}
|
|
344
|
+
|
|
345
|
+
# Rule for 7-year retention (audit)
|
|
346
|
+
rule {
|
|
347
|
+
id = "audit-retention"
|
|
348
|
+
status = "Enabled"
|
|
349
|
+
|
|
350
|
+
transition {
|
|
351
|
+
days = 30
|
|
352
|
+
storage_class = "GLACIER"
|
|
353
|
+
}
|
|
354
|
+
|
|
355
|
+
transition {
|
|
356
|
+
days = 365
|
|
357
|
+
storage_class = "DEEP_ARCHIVE"
|
|
358
|
+
}
|
|
359
|
+
|
|
360
|
+
expiration {
|
|
361
|
+
days = 2555 # 7 years
|
|
362
|
+
}
|
|
363
|
+
|
|
364
|
+
filter {
|
|
365
|
+
tag {
|
|
366
|
+
key = "retention_days"
|
|
367
|
+
value = "2555"
|
|
368
|
+
}
|
|
369
|
+
}
|
|
370
|
+
}
|
|
371
|
+
}
|
|
372
|
+
```
|
|
373
|
+
|
|
374
|
+
**Note:** Need one rule per retention_days value (manageable for common values like 1, 7, 30, 365, 2555).
|
|
375
|
+
|
|
376
|
+
### Option 3: External Cleanup Job (Recommended!)
|
|
377
|
+
|
|
378
|
+
**Trivial logic with `retention_until`:**
|
|
379
|
+
|
|
380
|
+
```ruby
|
|
381
|
+
# lib/tasks/e11y_cleanup.rake
|
|
382
|
+
namespace :e11y do
|
|
383
|
+
desc "Delete events past their retention period"
|
|
384
|
+
task cleanup: :environment do
|
|
385
|
+
es_client = Elasticsearch::Client.new(url: ENV['ES_URL'])
|
|
386
|
+
|
|
387
|
+
# Delete expired events (dead simple query!)
|
|
388
|
+
response = es_client.delete_by_query(
|
|
389
|
+
index: 'e11y-events-*',
|
|
390
|
+
body: {
|
|
391
|
+
query: {
|
|
392
|
+
range: {
|
|
393
|
+
retention_until: {
|
|
394
|
+
lte: 'now' # β That's it! Just: retention_until <= now
|
|
395
|
+
}
|
|
396
|
+
}
|
|
397
|
+
}
|
|
398
|
+
}
|
|
399
|
+
)
|
|
400
|
+
|
|
401
|
+
puts "Deleted #{response['deleted']} expired events"
|
|
402
|
+
end
|
|
403
|
+
end
|
|
404
|
+
|
|
405
|
+
# Schedule daily: 0 2 * * * rake e11y:cleanup
|
|
406
|
+
```
|
|
407
|
+
|
|
408
|
+
**This approach:**
|
|
409
|
+
- β
Works with ANY retention period (no calculation!)
|
|
410
|
+
- β
Trivial query: `retention_until <= now`
|
|
411
|
+
- β
No Painless scripts (faster, simpler)
|
|
412
|
+
- β
Standard Elasticsearch range query
|
|
413
|
+
|
|
414
|
+
---
|
|
415
|
+
|
|
416
|
+
## π Cost Savings Example
|
|
417
|
+
|
|
418
|
+
### Before Tiered Storage
|
|
419
|
+
|
|
420
|
+
```
|
|
421
|
+
All events in Elasticsearch (hot tier):
|
|
422
|
+
- Volume: 1TB/month
|
|
423
|
+
- Retention: 365 days
|
|
424
|
+
- Total storage: 12TB/year
|
|
425
|
+
- ES cost: $0.10/GB/month
|
|
426
|
+
- Annual cost: $0.10 Γ 12,000 GB Γ 12 months = $14,400/year
|
|
427
|
+
```
|
|
428
|
+
|
|
429
|
+
### After Tiered Storage
|
|
430
|
+
|
|
431
|
+
```
|
|
432
|
+
Hot tier (ES, 0-7 days):
|
|
433
|
+
- Volume: 1TB/month Γ 7/30 = 233GB
|
|
434
|
+
- Cost: $0.10 Γ 233 GB Γ 12 = $280/year
|
|
435
|
+
|
|
436
|
+
Warm tier (S3 Standard, 7-30 days):
|
|
437
|
+
- Volume: 1TB/month Γ 23/30 = 767GB
|
|
438
|
+
- Cost: $0.023/GB/month
|
|
439
|
+
- Annual cost: $0.023 Γ 767 GB Γ 12 = $212/year
|
|
440
|
+
|
|
441
|
+
Cold tier (S3 Glacier, 30-365 days):
|
|
442
|
+
- Volume: 1TB/month Γ 335/365 = 918GB per month average
|
|
443
|
+
- Cost: $0.004/GB/month
|
|
444
|
+
- Annual cost: $0.004 Γ 11,000 GB = $44/year
|
|
445
|
+
|
|
446
|
+
Total cost: $280 + $212 + $44 = $536/year
|
|
447
|
+
Savings: $14,400 - $536 = $13,864/year (96% reduction!)
|
|
448
|
+
```
|
|
449
|
+
|
|
450
|
+
---
|
|
451
|
+
|
|
452
|
+
## π‘ Best Practices
|
|
453
|
+
|
|
454
|
+
### β
DO
|
|
455
|
+
|
|
456
|
+
**1. Define retention at event level**
|
|
457
|
+
```ruby
|
|
458
|
+
class AuditEvent < E11y::Event::Base
|
|
459
|
+
retention 7.years # Explicit
|
|
460
|
+
end
|
|
461
|
+
```
|
|
462
|
+
|
|
463
|
+
**2. Use retention tagging for S3 lifecycle**
|
|
464
|
+
```ruby
|
|
465
|
+
config.cost_optimization.retention_tagging do
|
|
466
|
+
enabled true
|
|
467
|
+
tag_with_retention true # Adds retention_days to event metadata
|
|
468
|
+
end
|
|
469
|
+
```
|
|
470
|
+
|
|
471
|
+
**3. Query warm/cold data via Athena/BigQuery**
|
|
472
|
+
```sql
|
|
473
|
+
-- Query S3 via AWS Athena
|
|
474
|
+
SELECT * FROM e11y_events_warm
|
|
475
|
+
WHERE date = '2024-01-15'
|
|
476
|
+
AND event_name = 'order.created'
|
|
477
|
+
LIMIT 100;
|
|
478
|
+
```
|
|
479
|
+
|
|
480
|
+
**4. Set up ES ILM for automatic migration**
|
|
481
|
+
```bash
|
|
482
|
+
# Let Elasticsearch handle hotβwarmβcold automatically
|
|
483
|
+
```
|
|
484
|
+
|
|
485
|
+
---
|
|
486
|
+
|
|
487
|
+
### β DON'T
|
|
488
|
+
|
|
489
|
+
**1. Don't expect E11y to migrate data**
|
|
490
|
+
```ruby
|
|
491
|
+
# β E11y doesn't move data between adapters
|
|
492
|
+
# It only routes writes to appropriate tiers
|
|
493
|
+
|
|
494
|
+
# β
Use ES ILM or S3 Lifecycle for migration
|
|
495
|
+
```
|
|
496
|
+
|
|
497
|
+
**2. Don't keep high-volume data in hot tier long**
|
|
498
|
+
```ruby
|
|
499
|
+
# β BAD: Debug logs in ES for 30 days
|
|
500
|
+
class DebugEvent < E11y::Event::Base
|
|
501
|
+
retention 30.days # Expensive!
|
|
502
|
+
end
|
|
503
|
+
|
|
504
|
+
# β
GOOD: Short retention for debug
|
|
505
|
+
class DebugEvent < E11y::Event::Base
|
|
506
|
+
retention 1.day # Cheap!
|
|
507
|
+
end
|
|
508
|
+
```
|
|
509
|
+
|
|
510
|
+
**3. Don't forget to configure S3 lifecycle rules**
|
|
511
|
+
```ruby
|
|
512
|
+
# If you send events to S3, set up lifecycle rules!
|
|
513
|
+
# Otherwise data stays in Standard tier (expensive)
|
|
514
|
+
```
|
|
515
|
+
|
|
516
|
+
---
|
|
517
|
+
|
|
518
|
+
## π― Success Metrics
|
|
519
|
+
|
|
520
|
+
### Quantifiable Benefits
|
|
521
|
+
|
|
522
|
+
**1. Storage Cost Reduction**
|
|
523
|
+
- Before: $14,400/year (all in ES)
|
|
524
|
+
- After: $536/year (tiered)
|
|
525
|
+
- **Savings: 96%**
|
|
526
|
+
|
|
527
|
+
**2. Query Performance**
|
|
528
|
+
- Hot tier (0-7 days): <1s queries β
|
|
529
|
+
- Warm tier (7-30 days): 1-5s queries (acceptable)
|
|
530
|
+
- Cold tier (30+ days): Rare access (minutes OK)
|
|
531
|
+
|
|
532
|
+
**3. Compliance**
|
|
533
|
+
- Audit events: 7-year retention β
|
|
534
|
+
- PII events: Auto-deleted after 30 days β
|
|
535
|
+
- Debug logs: Deleted after 1 day β
|
|
536
|
+
|
|
537
|
+
---
|
|
538
|
+
|
|
539
|
+
## π Related Use Cases
|
|
540
|
+
|
|
541
|
+
- **[UC-012: Audit Trail](./UC-012-audit-trail.md)** - Long-term retention for compliance
|
|
542
|
+
- **[UC-015: Cost Optimization](./UC-015-cost-optimization.md)** - Overall cost reduction strategies
|
|
543
|
+
- **[UC-002: Business Event Tracking](./UC-002-business-event-tracking.md)** - Event definitions with retention
|
|
544
|
+
|
|
545
|
+
---
|
|
546
|
+
|
|
547
|
+
## π Quick Start Checklist
|
|
548
|
+
|
|
549
|
+
- [ ] Enable tiered storage in E11y config
|
|
550
|
+
- [ ] Configure retention tagging
|
|
551
|
+
- [ ] Set up Elasticsearch ILM policy
|
|
552
|
+
- [ ] Configure S3 lifecycle rules
|
|
553
|
+
- [ ] Define per-event retention policies
|
|
554
|
+
- [ ] Test write-time routing
|
|
555
|
+
- [ ] Monitor storage costs (before/after)
|
|
556
|
+
- [ ] Set up Athena for warm/cold queries (optional)
|
|
557
|
+
|
|
558
|
+
---
|
|
559
|
+
|
|
560
|
+
**Status:** β
Ready for Implementation
|
|
561
|
+
**Priority:** High (significant cost savings)
|
|
562
|
+
**Complexity:** Advanced (requires downstream setup)
|