e11y 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.rspec +4 -0
- data/.rubocop.yml +69 -0
- data/CHANGELOG.md +26 -0
- data/CODE_OF_CONDUCT.md +64 -0
- data/LICENSE.txt +21 -0
- data/README.md +179 -0
- data/Rakefile +37 -0
- data/benchmarks/run_all.rb +33 -0
- data/config/README.md +83 -0
- data/config/loki-local-config.yaml +35 -0
- data/config/prometheus.yml +15 -0
- data/docker-compose.yml +78 -0
- data/docs/00-ICP-AND-TIMELINE.md +483 -0
- data/docs/01-SCALE-REQUIREMENTS.md +858 -0
- data/docs/ADR-001-architecture.md +2617 -0
- data/docs/ADR-002-metrics-yabeda.md +1395 -0
- data/docs/ADR-003-slo-observability.md +3337 -0
- data/docs/ADR-004-adapter-architecture.md +2385 -0
- data/docs/ADR-005-tracing-context.md +1372 -0
- data/docs/ADR-006-security-compliance.md +4143 -0
- data/docs/ADR-007-opentelemetry-integration.md +1385 -0
- data/docs/ADR-008-rails-integration.md +1911 -0
- data/docs/ADR-009-cost-optimization.md +2993 -0
- data/docs/ADR-010-developer-experience.md +2166 -0
- data/docs/ADR-011-testing-strategy.md +1836 -0
- data/docs/ADR-012-event-evolution.md +958 -0
- data/docs/ADR-013-reliability-error-handling.md +2750 -0
- data/docs/ADR-014-event-driven-slo.md +1533 -0
- data/docs/ADR-015-middleware-order.md +1061 -0
- data/docs/ADR-016-self-monitoring-slo.md +1234 -0
- data/docs/API-REFERENCE-L28.md +914 -0
- data/docs/COMPREHENSIVE-CONFIGURATION.md +2366 -0
- data/docs/IMPLEMENTATION_NOTES.md +2804 -0
- data/docs/IMPLEMENTATION_PLAN.md +1971 -0
- data/docs/IMPLEMENTATION_PLAN_ARCHITECTURE.md +586 -0
- data/docs/PLAN.md +148 -0
- data/docs/QUICK-START.md +934 -0
- data/docs/README.md +296 -0
- data/docs/design/00-memory-optimization.md +593 -0
- data/docs/guides/MIGRATION-L27-L28.md +692 -0
- data/docs/guides/PERFORMANCE-BENCHMARKS.md +434 -0
- data/docs/guides/README.md +44 -0
- data/docs/prd/01-overview-vision.md +440 -0
- data/docs/use_cases/README.md +119 -0
- data/docs/use_cases/UC-001-request-scoped-debug-buffering.md +813 -0
- data/docs/use_cases/UC-002-business-event-tracking.md +1953 -0
- data/docs/use_cases/UC-003-pattern-based-metrics.md +1627 -0
- data/docs/use_cases/UC-004-zero-config-slo-tracking.md +728 -0
- data/docs/use_cases/UC-005-sentry-integration.md +759 -0
- data/docs/use_cases/UC-006-trace-context-management.md +905 -0
- data/docs/use_cases/UC-007-pii-filtering.md +2648 -0
- data/docs/use_cases/UC-008-opentelemetry-integration.md +1153 -0
- data/docs/use_cases/UC-009-multi-service-tracing.md +1043 -0
- data/docs/use_cases/UC-010-background-job-tracking.md +1018 -0
- data/docs/use_cases/UC-011-rate-limiting.md +1906 -0
- data/docs/use_cases/UC-012-audit-trail.md +2301 -0
- data/docs/use_cases/UC-013-high-cardinality-protection.md +2127 -0
- data/docs/use_cases/UC-014-adaptive-sampling.md +1940 -0
- data/docs/use_cases/UC-015-cost-optimization.md +735 -0
- data/docs/use_cases/UC-016-rails-logger-migration.md +785 -0
- data/docs/use_cases/UC-017-local-development.md +867 -0
- data/docs/use_cases/UC-018-testing-events.md +1081 -0
- data/docs/use_cases/UC-019-tiered-storage-migration.md +562 -0
- data/docs/use_cases/UC-020-event-versioning.md +708 -0
- data/docs/use_cases/UC-021-error-handling-retry-dlq.md +956 -0
- data/docs/use_cases/UC-022-event-registry.md +648 -0
- data/docs/use_cases/backlog.md +226 -0
- data/e11y.gemspec +76 -0
- data/lib/e11y/adapters/adaptive_batcher.rb +207 -0
- data/lib/e11y/adapters/audit_encrypted.rb +239 -0
- data/lib/e11y/adapters/base.rb +580 -0
- data/lib/e11y/adapters/file.rb +224 -0
- data/lib/e11y/adapters/in_memory.rb +216 -0
- data/lib/e11y/adapters/loki.rb +333 -0
- data/lib/e11y/adapters/otel_logs.rb +203 -0
- data/lib/e11y/adapters/registry.rb +141 -0
- data/lib/e11y/adapters/sentry.rb +230 -0
- data/lib/e11y/adapters/stdout.rb +108 -0
- data/lib/e11y/adapters/yabeda.rb +370 -0
- data/lib/e11y/buffers/adaptive_buffer.rb +339 -0
- data/lib/e11y/buffers/base_buffer.rb +40 -0
- data/lib/e11y/buffers/request_scoped_buffer.rb +246 -0
- data/lib/e11y/buffers/ring_buffer.rb +267 -0
- data/lib/e11y/buffers.rb +14 -0
- data/lib/e11y/console.rb +122 -0
- data/lib/e11y/current.rb +48 -0
- data/lib/e11y/event/base.rb +894 -0
- data/lib/e11y/event/value_sampling_config.rb +84 -0
- data/lib/e11y/events/base_audit_event.rb +43 -0
- data/lib/e11y/events/base_payment_event.rb +33 -0
- data/lib/e11y/events/rails/cache/delete.rb +21 -0
- data/lib/e11y/events/rails/cache/read.rb +23 -0
- data/lib/e11y/events/rails/cache/write.rb +22 -0
- data/lib/e11y/events/rails/database/query.rb +45 -0
- data/lib/e11y/events/rails/http/redirect.rb +21 -0
- data/lib/e11y/events/rails/http/request.rb +26 -0
- data/lib/e11y/events/rails/http/send_file.rb +21 -0
- data/lib/e11y/events/rails/http/start_processing.rb +26 -0
- data/lib/e11y/events/rails/job/completed.rb +22 -0
- data/lib/e11y/events/rails/job/enqueued.rb +22 -0
- data/lib/e11y/events/rails/job/failed.rb +22 -0
- data/lib/e11y/events/rails/job/scheduled.rb +23 -0
- data/lib/e11y/events/rails/job/started.rb +22 -0
- data/lib/e11y/events/rails/log.rb +56 -0
- data/lib/e11y/events/rails/view/render.rb +23 -0
- data/lib/e11y/events.rb +18 -0
- data/lib/e11y/instruments/active_job.rb +201 -0
- data/lib/e11y/instruments/rails_instrumentation.rb +141 -0
- data/lib/e11y/instruments/sidekiq.rb +175 -0
- data/lib/e11y/logger/bridge.rb +205 -0
- data/lib/e11y/metrics/cardinality_protection.rb +172 -0
- data/lib/e11y/metrics/cardinality_tracker.rb +134 -0
- data/lib/e11y/metrics/registry.rb +234 -0
- data/lib/e11y/metrics/relabeling.rb +226 -0
- data/lib/e11y/metrics.rb +102 -0
- data/lib/e11y/middleware/audit_signing.rb +174 -0
- data/lib/e11y/middleware/base.rb +140 -0
- data/lib/e11y/middleware/event_slo.rb +167 -0
- data/lib/e11y/middleware/pii_filter.rb +266 -0
- data/lib/e11y/middleware/pii_filtering.rb +280 -0
- data/lib/e11y/middleware/rate_limiting.rb +214 -0
- data/lib/e11y/middleware/request.rb +163 -0
- data/lib/e11y/middleware/routing.rb +157 -0
- data/lib/e11y/middleware/sampling.rb +254 -0
- data/lib/e11y/middleware/slo.rb +168 -0
- data/lib/e11y/middleware/trace_context.rb +131 -0
- data/lib/e11y/middleware/validation.rb +118 -0
- data/lib/e11y/middleware/versioning.rb +132 -0
- data/lib/e11y/middleware.rb +12 -0
- data/lib/e11y/pii/patterns.rb +90 -0
- data/lib/e11y/pii.rb +13 -0
- data/lib/e11y/pipeline/builder.rb +155 -0
- data/lib/e11y/pipeline/zone_validator.rb +110 -0
- data/lib/e11y/pipeline.rb +12 -0
- data/lib/e11y/presets/audit_event.rb +65 -0
- data/lib/e11y/presets/debug_event.rb +34 -0
- data/lib/e11y/presets/high_value_event.rb +51 -0
- data/lib/e11y/presets.rb +19 -0
- data/lib/e11y/railtie.rb +138 -0
- data/lib/e11y/reliability/circuit_breaker.rb +216 -0
- data/lib/e11y/reliability/dlq/file_storage.rb +277 -0
- data/lib/e11y/reliability/dlq/filter.rb +117 -0
- data/lib/e11y/reliability/retry_handler.rb +207 -0
- data/lib/e11y/reliability/retry_rate_limiter.rb +117 -0
- data/lib/e11y/sampling/error_spike_detector.rb +225 -0
- data/lib/e11y/sampling/load_monitor.rb +161 -0
- data/lib/e11y/sampling/stratified_tracker.rb +92 -0
- data/lib/e11y/sampling/value_extractor.rb +82 -0
- data/lib/e11y/self_monitoring/buffer_monitor.rb +79 -0
- data/lib/e11y/self_monitoring/performance_monitor.rb +97 -0
- data/lib/e11y/self_monitoring/reliability_monitor.rb +146 -0
- data/lib/e11y/slo/event_driven.rb +150 -0
- data/lib/e11y/slo/tracker.rb +119 -0
- data/lib/e11y/version.rb +9 -0
- data/lib/e11y.rb +283 -0
- metadata +452 -0
|
@@ -0,0 +1,735 @@
|
|
|
1
|
+
# UC-015: Cost Optimization
|
|
2
|
+
|
|
3
|
+
**Status:** v1.1 Enhancement
|
|
4
|
+
**Complexity:** Advanced
|
|
5
|
+
**Setup Time:** 45-60 minutes
|
|
6
|
+
**Target Users:** Engineering Managers, CTOs, FinOps Teams, SRE
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## 📋 Overview
|
|
11
|
+
|
|
12
|
+
### Problem Statement
|
|
13
|
+
|
|
14
|
+
**The $120,000/year observability bill:**
|
|
15
|
+
```ruby
|
|
16
|
+
# ❌ UNOPTIMIZED: Burning money on observability
|
|
17
|
+
# Current setup:
|
|
18
|
+
# - 50 services × 2k events/sec = 100k events/sec
|
|
19
|
+
# - All events at full payload size (~2KB each)
|
|
20
|
+
# - No compression
|
|
21
|
+
# - No intelligent sampling
|
|
22
|
+
# - 100% sent to Datadog + Loki
|
|
23
|
+
|
|
24
|
+
# Monthly costs:
|
|
25
|
+
# - Datadog: $15/host × 200 hosts = $3,000/month
|
|
26
|
+
# - Loki ingestion: 100k events/sec × 2KB × 86400 sec/day × 30 days
|
|
27
|
+
# = 518.4 TB/month × $0.02/GB = $10,368/month
|
|
28
|
+
# - Total: $13,368/month = $160,416/year 😱
|
|
29
|
+
|
|
30
|
+
# But wait... there's more waste:
|
|
31
|
+
# - 80% of events are duplicates (retry storms)
|
|
32
|
+
# - 50% of payload is empty/default values
|
|
33
|
+
# - 30% of events are DEBUG (not needed in prod)
|
|
34
|
+
# - Storing everything for 30 days (overkill for most data)
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
### E11y Solution
|
|
38
|
+
|
|
39
|
+
**10+ optimization techniques = 70-90% cost reduction:**
|
|
40
|
+
```ruby
|
|
41
|
+
# ✅ OPTIMIZED: Same insight, 10x less cost
|
|
42
|
+
E11y.configure do |config|
|
|
43
|
+
config.cost_optimization do
|
|
44
|
+
# 1. Intelligent sampling (90% reduction)
|
|
45
|
+
adaptive_sampling enabled: true,
|
|
46
|
+
base_rate: 0.1 # 10% of normal events
|
|
47
|
+
|
|
48
|
+
# 2. Compression (70% size reduction)
|
|
49
|
+
compression enabled: true,
|
|
50
|
+
algorithm: :zstd, # Better than gzip
|
|
51
|
+
level: 3
|
|
52
|
+
|
|
53
|
+
# 4. Payload minimization (50% smaller)
|
|
54
|
+
minimize_payloads enabled: true,
|
|
55
|
+
drop_null_fields: true,
|
|
56
|
+
drop_empty_strings: true,
|
|
57
|
+
truncate_strings: 1000 # chars
|
|
58
|
+
|
|
59
|
+
# 5. Tiered storage (60% cheaper)
|
|
60
|
+
retention_tiers do
|
|
61
|
+
hot 7.days, storage: :loki # Fast queries
|
|
62
|
+
warm 30.days, storage: :s3 # Slower, cheaper
|
|
63
|
+
cold 1.year, storage: :s3_glacier # Archive
|
|
64
|
+
end
|
|
65
|
+
|
|
66
|
+
# 6. Smart routing (send only what's needed)
|
|
67
|
+
routing do
|
|
68
|
+
# Errors → Datadog (for alerting)
|
|
69
|
+
route event_patterns: ['*.error', '*.fatal'],
|
|
70
|
+
to: [:datadog, :loki]
|
|
71
|
+
|
|
72
|
+
# Everything else → Loki only
|
|
73
|
+
route event_patterns: ['*'],
|
|
74
|
+
to: [:loki]
|
|
75
|
+
end
|
|
76
|
+
end
|
|
77
|
+
end
|
|
78
|
+
|
|
79
|
+
# Result:
|
|
80
|
+
# - 100k events/sec → 10k events/sec (adaptive sampling)
|
|
81
|
+
# - 2KB/event → 0.6KB/event (compression + minimization)
|
|
82
|
+
# - 30 days hot storage → 7 days hot + 23 days warm (tiered)
|
|
83
|
+
# - Datadog: Only errors (3k/sec instead of 100k/sec)
|
|
84
|
+
#
|
|
85
|
+
# New monthly cost:
|
|
86
|
+
# - Datadog: $3,000 → $500 (only errors)
|
|
87
|
+
# - Loki: $10,368 → $1,200 (10% volume, 70% smaller, 7 days hot)
|
|
88
|
+
# - S3: $200 (warm storage)
|
|
89
|
+
# - Total: $1,900/month = $22,800/year
|
|
90
|
+
#
|
|
91
|
+
# SAVINGS: $160,416 - $22,800 = $137,616/year (86% reduction!)
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
---
|
|
95
|
+
|
|
96
|
+
## 🎯 Cost Optimization Strategies
|
|
97
|
+
|
|
98
|
+
> **Note:** This UC focuses on proven, low-overhead optimizations. **Deduplication is intentionally NOT included** as a strategy. While it may seem like an obvious cost optimization, [ADR-009 Section 9.2.D](../ADR-009-cost-optimization.md#alternatives-considered) explains why it was rejected: high computational overhead (hash + Redis lookup per event), large memory cost (3.6GB for 1000 events/sec), false positives on legitimate retries, and debug confusion. Better alternatives (sampling + compression) achieve the same cost goals without these drawbacks.
|
|
99
|
+
|
|
100
|
+
### Strategy 1: Intelligent Sampling by Value
|
|
101
|
+
|
|
102
|
+
**Don't sample high-value events:**
|
|
103
|
+
```ruby
|
|
104
|
+
E11y.configure do |config|
|
|
105
|
+
config.cost_optimization do
|
|
106
|
+
intelligent_sampling do
|
|
107
|
+
# Always track high-value events
|
|
108
|
+
always_sample do
|
|
109
|
+
# High-value transactions
|
|
110
|
+
when_field :amount, greater_than: 1000
|
|
111
|
+
|
|
112
|
+
# VIP users
|
|
113
|
+
when_field :user_segment, in: ['enterprise', 'vip']
|
|
114
|
+
|
|
115
|
+
# Errors (always important)
|
|
116
|
+
when_severity :error, :fatal
|
|
117
|
+
|
|
118
|
+
# Security events
|
|
119
|
+
when_pattern 'security.*', 'audit.*'
|
|
120
|
+
end
|
|
121
|
+
|
|
122
|
+
# Aggressively sample low-value
|
|
123
|
+
sample_rate_for do
|
|
124
|
+
# Debug events: 1%
|
|
125
|
+
when_severity :debug, sample_rate: 0.01
|
|
126
|
+
|
|
127
|
+
# Success events: 5%
|
|
128
|
+
when_severity :success, sample_rate: 0.05
|
|
129
|
+
|
|
130
|
+
# Low-value transactions (<$10): 10%
|
|
131
|
+
when_field :amount, less_than: 10, sample_rate: 0.1
|
|
132
|
+
end
|
|
133
|
+
|
|
134
|
+
# Default: 10%
|
|
135
|
+
default_sample_rate 0.1
|
|
136
|
+
end
|
|
137
|
+
end
|
|
138
|
+
end
|
|
139
|
+
|
|
140
|
+
# Impact:
|
|
141
|
+
# Before: 100k events/sec × 100% = 100k tracked
|
|
142
|
+
# After:
|
|
143
|
+
# - High-value (5k/sec): 100% = 5k tracked
|
|
144
|
+
# - Errors (3k/sec): 100% = 3k tracked
|
|
145
|
+
# - Debug (20k/sec): 1% = 200 tracked
|
|
146
|
+
# - Other (72k/sec): 10% = 7.2k tracked
|
|
147
|
+
# - Total: 15.4k tracked (85% reduction!)
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
---
|
|
151
|
+
|
|
152
|
+
### Strategy 2: Payload Minimization
|
|
153
|
+
|
|
154
|
+
**Remove unnecessary data:**
|
|
155
|
+
```ruby
|
|
156
|
+
E11y.configure do |config|
|
|
157
|
+
config.cost_optimization do
|
|
158
|
+
payload_minimization do
|
|
159
|
+
enabled true
|
|
160
|
+
|
|
161
|
+
# Remove null/empty values
|
|
162
|
+
drop_null_fields true
|
|
163
|
+
drop_empty_strings true
|
|
164
|
+
drop_empty_arrays true
|
|
165
|
+
drop_empty_hashes true
|
|
166
|
+
|
|
167
|
+
# Truncate long strings
|
|
168
|
+
truncate_strings max_length: 1000,
|
|
169
|
+
suffix: '...[truncated]'
|
|
170
|
+
|
|
171
|
+
# Remove default values
|
|
172
|
+
drop_default_values true,
|
|
173
|
+
defaults: {
|
|
174
|
+
status: 'pending',
|
|
175
|
+
currency: 'USD',
|
|
176
|
+
country: 'US'
|
|
177
|
+
}
|
|
178
|
+
|
|
179
|
+
# Exclude specific fields (never send)
|
|
180
|
+
exclude_fields [:internal_debug_data, :temp_cache]
|
|
181
|
+
|
|
182
|
+
# Compress repeated values
|
|
183
|
+
compress_repeated_values threshold: 3 # If >3 occurrences
|
|
184
|
+
end
|
|
185
|
+
end
|
|
186
|
+
end
|
|
187
|
+
|
|
188
|
+
# Example:
|
|
189
|
+
# Before minimization:
|
|
190
|
+
{
|
|
191
|
+
event_name: 'order.created',
|
|
192
|
+
payload: {
|
|
193
|
+
order_id: '123',
|
|
194
|
+
user_id: '456',
|
|
195
|
+
status: 'pending', # ← Default, removed
|
|
196
|
+
currency: 'USD', # ← Default, removed
|
|
197
|
+
notes: '', # ← Empty, removed
|
|
198
|
+
tags: [], # ← Empty, removed
|
|
199
|
+
metadata: {}, # ← Empty, removed
|
|
200
|
+
internal_debug_data: { ... }, # ← Excluded
|
|
201
|
+
long_description: 'Lorem ipsum...' × 10000 # ← Truncated to 1000 chars
|
|
202
|
+
}
|
|
203
|
+
}
|
|
204
|
+
# Size: ~12 KB
|
|
205
|
+
|
|
206
|
+
# After minimization:
|
|
207
|
+
{
|
|
208
|
+
event_name: 'order.created',
|
|
209
|
+
payload: {
|
|
210
|
+
order_id: '123',
|
|
211
|
+
user_id: '456',
|
|
212
|
+
long_description: 'Lorem ipsum...[truncated]' # 1000 chars
|
|
213
|
+
}
|
|
214
|
+
}
|
|
215
|
+
# Size: ~1.2 KB (90% reduction!)
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
---
|
|
219
|
+
|
|
220
|
+
### Strategy 3: Compression
|
|
221
|
+
|
|
222
|
+
**Compress before sending:**
|
|
223
|
+
```ruby
|
|
224
|
+
E11y.configure do |config|
|
|
225
|
+
config.cost_optimization do
|
|
226
|
+
compression do
|
|
227
|
+
enabled true
|
|
228
|
+
|
|
229
|
+
# Algorithm (zstd > lz4 > gzip for JSON)
|
|
230
|
+
algorithm :zstd # OR :lz4, :gzip
|
|
231
|
+
|
|
232
|
+
# Compression level (1-9)
|
|
233
|
+
level 3 # Balance speed/ratio (3 = good default)
|
|
234
|
+
|
|
235
|
+
# Batch compression (more efficient)
|
|
236
|
+
batch_size 500 # Compress 500 events together
|
|
237
|
+
|
|
238
|
+
# Only compress if beneficial
|
|
239
|
+
min_batch_size 10.kilobytes # Don't compress tiny batches
|
|
240
|
+
|
|
241
|
+
# Compression statistics
|
|
242
|
+
track_compression_ratio true
|
|
243
|
+
end
|
|
244
|
+
end
|
|
245
|
+
end
|
|
246
|
+
|
|
247
|
+
# Compression ratios (for JSON events):
|
|
248
|
+
# - gzip level 6: ~65% reduction (2KB → 700 bytes)
|
|
249
|
+
# - lz4 default: ~55% reduction (2KB → 900 bytes, faster)
|
|
250
|
+
# - zstd level 3: ~70% reduction (2KB → 600 bytes, best!)
|
|
251
|
+
#
|
|
252
|
+
# Network cost reduction: 70%!
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
---
|
|
256
|
+
|
|
257
|
+
### Strategy 4: Tiered Storage
|
|
258
|
+
|
|
259
|
+
**Hot/warm/cold storage based on age:**
|
|
260
|
+
```ruby
|
|
261
|
+
E11y.configure do |config|
|
|
262
|
+
config.cost_optimization do
|
|
263
|
+
tiered_storage do
|
|
264
|
+
# HOT: Fast queries, expensive ($0.20/GB/month)
|
|
265
|
+
hot_tier do
|
|
266
|
+
duration 7.days
|
|
267
|
+
storage :loki # OR :elasticsearch
|
|
268
|
+
query_performance :fast
|
|
269
|
+
end
|
|
270
|
+
|
|
271
|
+
# WARM: Slower queries, cheaper ($0.05/GB/month)
|
|
272
|
+
warm_tier do
|
|
273
|
+
duration 30.days
|
|
274
|
+
storage :s3
|
|
275
|
+
query_performance :medium
|
|
276
|
+
compression :zstd # Compress when moving to warm
|
|
277
|
+
end
|
|
278
|
+
|
|
279
|
+
# COLD: Archive, very cheap ($0.004/GB/month)
|
|
280
|
+
cold_tier do
|
|
281
|
+
duration 1.year
|
|
282
|
+
storage :s3_glacier
|
|
283
|
+
query_performance :slow # Minutes to hours
|
|
284
|
+
compression :zstd
|
|
285
|
+
end
|
|
286
|
+
|
|
287
|
+
# Auto-archival
|
|
288
|
+
auto_archive enabled: true,
|
|
289
|
+
schedule: '0 2 * * *' # 2 AM daily
|
|
290
|
+
end
|
|
291
|
+
end
|
|
292
|
+
end
|
|
293
|
+
|
|
294
|
+
# Cost comparison (per 1TB):
|
|
295
|
+
# Hot (Loki): $0.20/GB × 1000 = $200/month
|
|
296
|
+
# Warm (S3): $0.05/GB × 1000 = $50/month
|
|
297
|
+
# Cold (Glacier): $0.004/GB × 1000 = $4/month
|
|
298
|
+
#
|
|
299
|
+
# Strategy:
|
|
300
|
+
# - 7 days hot (for active debugging)
|
|
301
|
+
# - 30 days warm (for recent lookups)
|
|
302
|
+
# - 1 year cold (for compliance)
|
|
303
|
+
#
|
|
304
|
+
# Cost for 30 days of data:
|
|
305
|
+
# Before: 30 days × $200 = $6,000/month
|
|
306
|
+
# After: (7 × $200) + (23 × $50) + (0 × $4) = $1,400 + $1,150 = $2,550/month
|
|
307
|
+
# Savings: $3,450/month (58% reduction!)
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
---
|
|
311
|
+
|
|
312
|
+
### Strategy 5: Smart Routing
|
|
313
|
+
|
|
314
|
+
**Send events only to necessary destinations:**
|
|
315
|
+
```ruby
|
|
316
|
+
E11y.configure do |config|
|
|
317
|
+
config.cost_optimization do
|
|
318
|
+
smart_routing do
|
|
319
|
+
# Errors → Multiple destinations (alerting)
|
|
320
|
+
route event_patterns: ['*.error', '*.fatal'],
|
|
321
|
+
severities: [:error, :fatal],
|
|
322
|
+
to: [:datadog, :loki, :sentry]
|
|
323
|
+
|
|
324
|
+
# High-value transactions → All (audit + analytics)
|
|
325
|
+
route event_patterns: ['payment.*', 'order.*'],
|
|
326
|
+
when: ->(e) { e.payload[:amount].to_i > 1000 },
|
|
327
|
+
to: [:datadog, :loki, :s3_archive]
|
|
328
|
+
|
|
329
|
+
# Security events → Specific SIEM
|
|
330
|
+
route event_patterns: ['security.*', 'audit.*'],
|
|
331
|
+
to: [:splunk, :s3_archive]
|
|
332
|
+
|
|
333
|
+
# Debug events → Only Loki (no expensive Datadog)
|
|
334
|
+
route severities: [:debug],
|
|
335
|
+
to: [:loki]
|
|
336
|
+
|
|
337
|
+
# Everything else → Loki only
|
|
338
|
+
route event_patterns: ['*'],
|
|
339
|
+
to: [:loki]
|
|
340
|
+
end
|
|
341
|
+
end
|
|
342
|
+
end
|
|
343
|
+
|
|
344
|
+
# Cost impact:
|
|
345
|
+
# Datadog: $15/host/month (expensive!)
|
|
346
|
+
# Loki: $0.20/GB/month (cheaper)
|
|
347
|
+
#
|
|
348
|
+
# Before: All 100k events/sec → Datadog + Loki
|
|
349
|
+
# Datadog cost: $3,000/month
|
|
350
|
+
#
|
|
351
|
+
# After: Only errors (3k events/sec) → Datadog
|
|
352
|
+
# Datadog cost: $500/month
|
|
353
|
+
#
|
|
354
|
+
# Savings: $2,500/month (83% reduction!)
|
|
355
|
+
```
|
|
356
|
+
|
|
357
|
+
---
|
|
358
|
+
|
|
359
|
+
### Strategy 6: Retention-Aware Tagging
|
|
360
|
+
|
|
361
|
+
**Tag events with retention requirements:**
|
|
362
|
+
```ruby
|
|
363
|
+
E11y.configure do |config|
|
|
364
|
+
config.cost_optimization do
|
|
365
|
+
retention_aware_tagging do
|
|
366
|
+
# Auto-tag events with retention hints
|
|
367
|
+
tag_with_retention do
|
|
368
|
+
# Compliance events: Long retention
|
|
369
|
+
when_pattern 'audit.*', 'gdpr.*', retention: 7.years
|
|
370
|
+
|
|
371
|
+
# Financial: Long retention
|
|
372
|
+
when_pattern 'payment.*', 'transaction.*', retention: 7.years
|
|
373
|
+
|
|
374
|
+
# Errors: Medium retention
|
|
375
|
+
when_severity :error, :fatal, retention: 90.days
|
|
376
|
+
|
|
377
|
+
# Debug: Short retention
|
|
378
|
+
when_severity :debug, retention: 7.days
|
|
379
|
+
|
|
380
|
+
# Default
|
|
381
|
+
default_retention 30.days
|
|
382
|
+
end
|
|
383
|
+
|
|
384
|
+
# Backend respects retention tags
|
|
385
|
+
backends do
|
|
386
|
+
loki retention_based: true,
|
|
387
|
+
max_retention: 30.days
|
|
388
|
+
|
|
389
|
+
s3_archive retention_based: true,
|
|
390
|
+
max_retention: 7.years
|
|
391
|
+
end
|
|
392
|
+
end
|
|
393
|
+
end
|
|
394
|
+
end
|
|
395
|
+
|
|
396
|
+
# Result:
|
|
397
|
+
# - Debug events: 7 days in Loki (cheap)
|
|
398
|
+
# - Errors: 90 days in Loki
|
|
399
|
+
# - Compliance: 7 years in S3 Glacier (very cheap)
|
|
400
|
+
# - Default: 30 days in Loki
|
|
401
|
+
#
|
|
402
|
+
# Cost optimization: Store data only as long as needed!
|
|
403
|
+
```
|
|
404
|
+
|
|
405
|
+
---
|
|
406
|
+
|
|
407
|
+
### Strategy 7: Batch & Bundle
|
|
408
|
+
|
|
409
|
+
**Batch events for efficiency:**
|
|
410
|
+
```ruby
|
|
411
|
+
E11y.configure do |config|
|
|
412
|
+
config.cost_optimization do
|
|
413
|
+
batching do
|
|
414
|
+
enabled true
|
|
415
|
+
|
|
416
|
+
# Batch parameters
|
|
417
|
+
max_batch_size 500 # events
|
|
418
|
+
max_batch_bytes 1.megabyte
|
|
419
|
+
max_wait_time 5.seconds
|
|
420
|
+
|
|
421
|
+
# Batch compression (more efficient)
|
|
422
|
+
compress_batches true
|
|
423
|
+
|
|
424
|
+
# Bundle similar events (further compression)
|
|
425
|
+
bundle_similar_events do
|
|
426
|
+
enabled true
|
|
427
|
+
similarity_threshold 0.8 # 80% similar
|
|
428
|
+
max_bundle_size 100
|
|
429
|
+
end
|
|
430
|
+
end
|
|
431
|
+
end
|
|
432
|
+
end
|
|
433
|
+
|
|
434
|
+
# Example:
|
|
435
|
+
# 500 events sent separately:
|
|
436
|
+
# - 500 HTTP requests
|
|
437
|
+
# - 500 × 2KB = 1 MB payload
|
|
438
|
+
# - Network overhead: 500 × 1KB = 500 KB
|
|
439
|
+
# - Total: 1.5 MB
|
|
440
|
+
|
|
441
|
+
# 500 events in 1 batch (compressed):
|
|
442
|
+
# - 1 HTTP request
|
|
443
|
+
# - 1 MB payload → 300 KB (compressed)
|
|
444
|
+
# - Network overhead: 1 KB
|
|
445
|
+
# - Total: 301 KB
|
|
446
|
+
#
|
|
447
|
+
# Bandwidth reduction: 80%!
|
|
448
|
+
```
|
|
449
|
+
|
|
450
|
+
---
|
|
451
|
+
|
|
452
|
+
## 💰 Cost Calculator
|
|
453
|
+
|
|
454
|
+
**Calculate your potential savings:**
|
|
455
|
+
```ruby
|
|
456
|
+
# lib/e11y/cost_calculator.rb
|
|
457
|
+
module E11y
|
|
458
|
+
class CostCalculator
|
|
459
|
+
def calculate(
|
|
460
|
+
events_per_second:,
|
|
461
|
+
avg_event_size_bytes:,
|
|
462
|
+
num_services:,
|
|
463
|
+
datadog_hosts: 0,
|
|
464
|
+
loki_ingestion_rate_gb_month: nil
|
|
465
|
+
)
|
|
466
|
+
# Calculate monthly volume
|
|
467
|
+
seconds_per_month = 30 * 24 * 60 * 60 # 2,592,000
|
|
468
|
+
total_events_month = events_per_second * seconds_per_month
|
|
469
|
+
total_bytes_month = total_events_month * avg_event_size_bytes
|
|
470
|
+
total_gb_month = total_bytes_month / 1.gigabyte
|
|
471
|
+
|
|
472
|
+
# === UNOPTIMIZED COSTS ===
|
|
473
|
+
unoptimized = {
|
|
474
|
+
datadog: datadog_hosts * 15, # $15/host/month
|
|
475
|
+
loki: total_gb_month * 0.20, # $0.20/GB/month
|
|
476
|
+
total: 0
|
|
477
|
+
}
|
|
478
|
+
unoptimized[:total] = unoptimized.values.sum
|
|
479
|
+
|
|
480
|
+
# === OPTIMIZED COSTS (with E11y) ===
|
|
481
|
+
# Assumptions:
|
|
482
|
+
# - 90% sampling reduction
|
|
483
|
+
# - 70% compression
|
|
484
|
+
# - 60% cheaper storage (tiered)
|
|
485
|
+
|
|
486
|
+
effective_events = total_events_month * 0.1 # 90% sampling
|
|
487
|
+
effective_bytes = effective_events * avg_event_size_bytes * 0.3 # 70% compression
|
|
488
|
+
effective_gb = effective_bytes / 1.gigabyte
|
|
489
|
+
|
|
490
|
+
optimized = {
|
|
491
|
+
datadog: datadog_hosts * 5, # Only errors ($5/host/month)
|
|
492
|
+
loki_hot: effective_gb * 0.20 * (7.0 / 30.0), # 7 days hot
|
|
493
|
+
loki_warm: effective_gb * 0.05 * (23.0 / 30.0), # 23 days warm
|
|
494
|
+
total: 0
|
|
495
|
+
}
|
|
496
|
+
optimized[:total] = optimized.values.sum
|
|
497
|
+
|
|
498
|
+
# === SAVINGS ===
|
|
499
|
+
{
|
|
500
|
+
unoptimized: unoptimized,
|
|
501
|
+
optimized: optimized,
|
|
502
|
+
monthly_savings: unoptimized[:total] - optimized[:total],
|
|
503
|
+
yearly_savings: (unoptimized[:total] - optimized[:total]) * 12,
|
|
504
|
+
savings_pct: ((unoptimized[:total] - optimized[:total]) / unoptimized[:total] * 100).round(1)
|
|
505
|
+
}
|
|
506
|
+
end
|
|
507
|
+
end
|
|
508
|
+
end
|
|
509
|
+
|
|
510
|
+
# Example usage:
|
|
511
|
+
calculator = E11y::CostCalculator.new
|
|
512
|
+
result = calculator.calculate(
|
|
513
|
+
events_per_second: 100_000,
|
|
514
|
+
avg_event_size_bytes: 2000, # 2 KB
|
|
515
|
+
num_services: 50,
|
|
516
|
+
datadog_hosts: 200
|
|
517
|
+
)
|
|
518
|
+
|
|
519
|
+
puts "Current monthly cost: $#{result[:unoptimized][:total]}"
|
|
520
|
+
puts "Optimized monthly cost: $#{result[:optimized][:total]}"
|
|
521
|
+
puts "Monthly savings: $#{result[:monthly_savings]} (#{result[:savings_pct]}%)"
|
|
522
|
+
puts "Yearly savings: $#{result[:yearly_savings]}"
|
|
523
|
+
|
|
524
|
+
# Output:
|
|
525
|
+
# Current monthly cost: $13368
|
|
526
|
+
# Optimized monthly cost: $1900
|
|
527
|
+
# Monthly savings: $11468 (85.8%)
|
|
528
|
+
# Yearly savings: $137616
|
|
529
|
+
```
|
|
530
|
+
|
|
531
|
+
---
|
|
532
|
+
|
|
533
|
+
## 📊 Monitoring Cost Optimization
|
|
534
|
+
|
|
535
|
+
**Track savings in real-time:**
|
|
536
|
+
```ruby
|
|
537
|
+
# Self-monitoring metrics
|
|
538
|
+
E11y.configure do |config|
|
|
539
|
+
config.self_monitoring do
|
|
540
|
+
# Bytes saved by compression
|
|
541
|
+
counter :cost_optimization_bytes_saved_total,
|
|
542
|
+
tags: [:optimization_type] # compression, sampling
|
|
543
|
+
|
|
544
|
+
# Events dropped/sampled
|
|
545
|
+
counter :cost_optimization_events_reduced_total,
|
|
546
|
+
tags: [:reason]
|
|
547
|
+
|
|
548
|
+
# Estimated cost savings
|
|
549
|
+
gauge :cost_optimization_monthly_savings_usd,
|
|
550
|
+
tags: [:backend]
|
|
551
|
+
|
|
552
|
+
# Compression ratio
|
|
553
|
+
histogram :cost_optimization_compression_ratio,
|
|
554
|
+
buckets: [0.1, 0.3, 0.5, 0.7, 0.9]
|
|
555
|
+
end
|
|
556
|
+
end
|
|
557
|
+
|
|
558
|
+
# Dashboard queries:
|
|
559
|
+
# - Total bytes saved: sum(cost_optimization_bytes_saved_total)
|
|
560
|
+
# - Monthly savings: cost_optimization_monthly_savings_usd
|
|
561
|
+
# - Avg compression: histogram_quantile(0.5, cost_optimization_compression_ratio_bucket)
|
|
562
|
+
```
|
|
563
|
+
|
|
564
|
+
---
|
|
565
|
+
|
|
566
|
+
## 🧪 Testing
|
|
567
|
+
|
|
568
|
+
```ruby
|
|
569
|
+
# spec/e11y/cost_optimization_spec.rb
|
|
570
|
+
RSpec.describe 'Cost Optimization' do
|
|
571
|
+
describe 'compression' do
|
|
572
|
+
it 'compresses event payloads' do
|
|
573
|
+
E11y.configure do |config|
|
|
574
|
+
config.cost_optimization do
|
|
575
|
+
compression enabled: true,
|
|
576
|
+
algorithm: :zstd
|
|
577
|
+
end
|
|
578
|
+
end
|
|
579
|
+
|
|
580
|
+
# Send event with large payload
|
|
581
|
+
Events::TestEvent.track(user_id: '123', large_data: 'x' * 10000)
|
|
582
|
+
|
|
583
|
+
# Should only store 1
|
|
584
|
+
events = E11y::Buffer.flush
|
|
585
|
+
expect(events.size).to eq(1)
|
|
586
|
+
expect(events.first.payload[:duplicate_count]).to eq(100)
|
|
587
|
+
end
|
|
588
|
+
end
|
|
589
|
+
|
|
590
|
+
describe 'payload minimization' do
|
|
591
|
+
it 'removes null and empty values' do
|
|
592
|
+
E11y.configure do |config|
|
|
593
|
+
config.cost_optimization do
|
|
594
|
+
payload_minimization enabled: true,
|
|
595
|
+
drop_null_fields: true,
|
|
596
|
+
drop_empty_strings: true
|
|
597
|
+
end
|
|
598
|
+
end
|
|
599
|
+
|
|
600
|
+
Events::TestEvent.track(
|
|
601
|
+
foo: 'bar',
|
|
602
|
+
baz: nil, # ← Should be removed
|
|
603
|
+
qux: '', # ← Should be removed
|
|
604
|
+
empty: [] # ← Should be removed
|
|
605
|
+
)
|
|
606
|
+
|
|
607
|
+
event = E11y::Buffer.pop
|
|
608
|
+
expect(event[:payload].keys).to eq([:foo])
|
|
609
|
+
end
|
|
610
|
+
end
|
|
611
|
+
|
|
612
|
+
describe 'compression' do
|
|
613
|
+
it 'compresses event batches' do
|
|
614
|
+
E11y.configure do |config|
|
|
615
|
+
config.cost_optimization do
|
|
616
|
+
compression enabled: true, algorithm: :zstd, level: 3
|
|
617
|
+
end
|
|
618
|
+
end
|
|
619
|
+
|
|
620
|
+
events = 500.times.map { |i| create_event(size: 2000) }
|
|
621
|
+
|
|
622
|
+
uncompressed_size = events.map { |e| e.to_json.bytesize }.sum
|
|
623
|
+
compressed = E11y::Compression.compress_batch(events)
|
|
624
|
+
|
|
625
|
+
compression_ratio = compressed.bytesize.to_f / uncompressed_size
|
|
626
|
+
expect(compression_ratio).to be < 0.4 # At least 60% reduction
|
|
627
|
+
end
|
|
628
|
+
end
|
|
629
|
+
end
|
|
630
|
+
```
|
|
631
|
+
|
|
632
|
+
---
|
|
633
|
+
|
|
634
|
+
## 💡 Best Practices
|
|
635
|
+
|
|
636
|
+
### ✅ DO
|
|
637
|
+
|
|
638
|
+
**1. Combine multiple optimizations**
|
|
639
|
+
```ruby
|
|
640
|
+
# ✅ GOOD: Layered optimizations
|
|
641
|
+
config.cost_optimization do
|
|
642
|
+
intelligent_sampling { ... } # 90% reduction
|
|
643
|
+
compression { ... } # 70% smaller payloads
|
|
644
|
+
tiered_storage { ... } # 60% cheaper storage
|
|
645
|
+
smart_routing { ... } # 50% fewer expensive destinations
|
|
646
|
+
end
|
|
647
|
+
# Combined: ~95% cost reduction!
|
|
648
|
+
```
|
|
649
|
+
|
|
650
|
+
**2. Monitor savings**
|
|
651
|
+
```ruby
|
|
652
|
+
# ✅ GOOD: Track ROI
|
|
653
|
+
# Dashboard: "Cost Optimization Savings"
|
|
654
|
+
# - Monthly savings: $X
|
|
655
|
+
# - YTD savings: $Y
|
|
656
|
+
# - Optimization breakdown (sampling, compression, tiered storage)
|
|
657
|
+
```
|
|
658
|
+
|
|
659
|
+
**3. Test in staging first**
|
|
660
|
+
```ruby
|
|
661
|
+
# ✅ GOOD: Validate optimizations don't lose critical data
|
|
662
|
+
# - Verify high-value events always tracked
|
|
663
|
+
# - Verify errors never sampled out
|
|
664
|
+
# - Verify compliance events retained
|
|
665
|
+
```
|
|
666
|
+
|
|
667
|
+
---
|
|
668
|
+
|
|
669
|
+
### ❌ DON'T
|
|
670
|
+
|
|
671
|
+
**1. Don't over-optimize critical events**
|
|
672
|
+
```ruby
|
|
673
|
+
# ❌ BAD: Sampling errors
|
|
674
|
+
config.sampling do
|
|
675
|
+
sample_rate 0.01 # 1%
|
|
676
|
+
end
|
|
677
|
+
# → You'll miss 99% of errors!
|
|
678
|
+
|
|
679
|
+
# ✅ GOOD: Never sample errors
|
|
680
|
+
always_sample severities: [:error, :fatal]
|
|
681
|
+
```
|
|
682
|
+
|
|
683
|
+
**2. Don't compress tiny batches**
|
|
684
|
+
```ruby
|
|
685
|
+
# ❌ BAD: Compression overhead > savings
|
|
686
|
+
compress_batch_size 1 # Compress single events
|
|
687
|
+
|
|
688
|
+
# ✅ GOOD: Only compress larger batches
|
|
689
|
+
compress_batch_size 100 # Worthwhile
|
|
690
|
+
```
|
|
691
|
+
|
|
692
|
+
**3. Don't ignore retention requirements**
|
|
693
|
+
```ruby
|
|
694
|
+
# ❌ BAD: Delete compliance data too soon
|
|
695
|
+
retention 7.days # But SOX requires 7 years!
|
|
696
|
+
|
|
697
|
+
# ✅ GOOD: Respect legal requirements
|
|
698
|
+
retention_for 'payment.*', 7.years
|
|
699
|
+
```
|
|
700
|
+
|
|
701
|
+
---
|
|
702
|
+
|
|
703
|
+
## 📚 Related Use Cases
|
|
704
|
+
|
|
705
|
+
- **[UC-013: High Cardinality Protection](./UC-013-high-cardinality-protection.md)** - Metric cost savings
|
|
706
|
+
- **[UC-014: Adaptive Sampling](./UC-014-adaptive-sampling.md)** - Smart sampling
|
|
707
|
+
|
|
708
|
+
---
|
|
709
|
+
|
|
710
|
+
## 🎯 Summary
|
|
711
|
+
|
|
712
|
+
### Real-World Savings Example
|
|
713
|
+
|
|
714
|
+
**Company:** E-commerce platform (50 services, 100k events/sec)
|
|
715
|
+
|
|
716
|
+
| Optimization | Before | After | Savings |
|
|
717
|
+
|--------------|--------|-------|---------|
|
|
718
|
+
| **Intelligent sampling** | 100k ev/sec | 10k ev/sec | 90% |
|
|
719
|
+
| **Compression** | 2KB/event | 0.6KB/event | 70% |
|
|
720
|
+
| **Tiered storage** | $200/TB/mo | $50/TB/mo | 75% |
|
|
721
|
+
| **Smart routing** | All → Datadog | Errors only → Datadog | 90% |
|
|
722
|
+
|
|
723
|
+
**Total Monthly Cost:**
|
|
724
|
+
- Before: $13,368/month
|
|
725
|
+
- After: $1,900/month
|
|
726
|
+
- **Savings: $11,468/month (86%)**
|
|
727
|
+
- **Yearly savings: $137,616**
|
|
728
|
+
|
|
729
|
+
**ROI:** Implementation effort: 2 weeks → Payback: Immediate → 3-year value: $412,848
|
|
730
|
+
|
|
731
|
+
---
|
|
732
|
+
|
|
733
|
+
**Document Version:** 1.0
|
|
734
|
+
**Last Updated:** January 12, 2026
|
|
735
|
+
**Status:** ✅ Complete
|