hyll 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 322ba54ca099c154e9bc295200933314e019bc61adcc0df9f45a4ee352e7a8fa
4
- data.tar.gz: ebcdfac3bcd8d421876c329b33db3be5594379ce7586d6d11585beb6308a7e52
3
+ metadata.gz: 445c738f59356bd13dc55dd8f6c8a2b74e39260e4367fa28012a4f13c0f5ebeb
4
+ data.tar.gz: 5f5d79fcd5aa2aec6afeb4e04be2725c6cc08cefa99c497e8070ee43eee03742
5
5
  SHA512:
6
- metadata.gz: 6d308475f7666c0d945c0a1b780aa84de2f15b987f0bea463b8897067c8190c9681d76689570535075c312de65aac30fca028dab29eb62fd48f90c12c0db2c56
7
- data.tar.gz: edf858ca1185c8b419fb8291a30fd99323610724d8c3a77b82b8f19f956b9784c32125eba600284ad62ab359ebe8fc74accb783189a3b04859b0fff88cb5be70
6
+ metadata.gz: 8df3d73bd6f665ab163891b7f51d5e9f09fc8b2df0479445cb285487b6dd4b9640d324a1bab77069d50a934da0a6af9e464c33f212c94ac18a5af73349f234cf
7
+ data.tar.gz: f73ae9153b519e77dfb77685395e5d1736de23f1f53e3575db29e96fb4f2fa512e08b1be737d42717db9e28b936358731a17ec69234864532c731a467782ec4a
data/CHANGELOG.md CHANGED
@@ -5,6 +5,28 @@ All notable changes to this project will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [0.2.0] - 2025-03-24
9
+
10
+ ### Added
11
+ - Associativity test for HyperLogLog merges to ensure (A ∪ B) ∪ C = A ∪ (B ∪ C)
12
+ - Guards against invalid inputs in mathematical functions
13
+ - Memoization for h_values calculation in MLE algorithm
14
+ - More comprehensive error handling with safeguards
15
+ - Added `examples/redis_comparison_benchmark.rb` for Redis comparison
16
+
17
+ ### Changed
18
+ - Optimized Maximum Likelihood Estimation (MLE) algorithm for better performance
19
+ - Improved numerical stability in the secant method implementation
20
+ - Enhanced Taylor series calculation for better accuracy
21
+ - Fixed Math module namespace conflicts
22
+ - Added safeguards against division by zero and other numerical errors
23
+ - Limited maximum iterations in convergence algorithms to prevent infinite loops
24
+
25
+ ### Fixed
26
+ - Addressed potential numerical instability in calculate_h_values method
27
+ - Fixed undefined method 'exp' error by using global namespace operator
28
+ - Improved edge case handling in the MLE algorithm
29
+
8
30
  ## [0.1.1] - 2025-03-21
9
31
 
10
32
  ### Changed
data/README.md CHANGED
@@ -1,5 +1,7 @@
1
1
  # Hyll
2
2
 
3
+ ![Gem Version](https://img.shields.io/gem/v/hyll)
4
+ ![Gem Total Downloads](https://img.shields.io/gem/dt/hyll)
3
5
  [![Build Status](https://github.com/davidesantangelo/hyll/workflows/Ruby%20Tests/badge.svg)](https://github.com/davidesantangelo/hyll/actions)
4
6
 
5
7
  Hyll is a Ruby implementation of the [HyperLogLog algorithm](https://en.wikipedia.org/wiki/HyperLogLog) for the count-distinct problem, which efficiently approximates the number of distinct elements in a multiset with minimal memory usage. It supports both standard and Enhanced variants, offering a flexible approach for large-scale applications and providing convenient methods for merging, serialization, and maximum likelihood estimation.
@@ -226,6 +228,76 @@ Below are actual performance measurements from an Apple Mac Mini M4 with 24GB RA
226
228
 
227
229
  These benchmarks demonstrate HyperLogLog's exceptional memory efficiency, maintaining a compression ratio of over 6,250x compared to storing the raw elements, while still providing accurate cardinality estimates.
228
230
 
231
+ ## Benchmark Comparison with Redis
232
+
233
+ Hyll has been benchmarked against Redis' HyperLogLog implementation to provide a comparison with a widely-used production system. The tests were run on an Apple Silicon M1 Mac using Ruby 3.1.4 with 10,000 elements and a precision of 10.
234
+
235
+ ### Insertion Performance
236
+
237
+ | Implementation | Operations/sec | Relative Performance |
238
+ |----------------|---------------:|---------------------:|
239
+ | Hyll Standard | 86.32 | 1.00x (fastest) |
240
+ | Hyll Batch | 85.98 | 1.00x |
241
+ | Redis Pipelined| 20.51 | 4.21x slower |
242
+ | Redis PFADD | 4.93 | 17.51x slower |
243
+ | Hyll Enhanced | 1.20 | 71.87x slower |
244
+
245
+ ### Cardinality Estimation Performance
246
+
247
+ | Implementation | Operations/sec | Relative Performance |
248
+ |---------------------|---------------:|---------------------:|
249
+ | Redis PFCOUNT | 53,131 | 1.00x (fastest) |
250
+ | Hyll Enhanced Stream| 24,412 | 2.18x slower |
251
+ | Hyll Enhanced | 8,843 | 6.01x slower |
252
+ | Hyll Standard | 8,538 | 6.22x slower |
253
+ | Hyll MLE | 5,645 | 9.41x slower |
254
+
255
+ ### Merge Performance
256
+
257
+ | Implementation | Operations/sec | Relative Performance |
258
+ |----------------|---------------:|---------------------:|
259
+ | Redis PFMERGE | 12,735 | 1.00x (fastest) |
260
+ | Hyll Enhanced | 6,523 | 1.95x slower |
261
+ | Hyll Standard | 2,932 | 4.34x slower |
262
+
263
+ ### Memory Usage
264
+
265
+ | Implementation | Memory Usage |
266
+ |------------------|-------------:|
267
+ | Hyll Enhanced | 0.28 KB |
268
+ | Hyll Standard | 18.30 KB |
269
+ | Redis | 12.56 KB |
270
+ | Raw Elements | 0.04 KB |
271
+
272
+ ### Accuracy Comparison
273
+
274
+ | Implementation | Estimated Count | Actual Count | Error |
275
+ |---------------------|----------------:|-------------:|---------:|
276
+ | Redis | 9,990 | 10,000 | 0.10% |
277
+ | Hyll Enhanced | 3,018 | 10,000 | 69.82% |
278
+ | Hyll (High Prec) | 19,016 | 10,000 | 90.16% |
279
+ | Hyll Standard | 32,348 | 10,000 | 223.48% |
280
+ | Hyll Enhanced Stream| 8,891,659 | 10,000 | 88,816.59% |
281
+ | Hyll MLE | 19,986,513 | 10,000 | 199,765.13% |
282
+
283
+ ### Summary of Findings
284
+
285
+ - **Insertion Performance**: Hyll Standard and Batch operations are significantly faster than Redis for adding elements.
286
+ - **Cardinality Estimation**: Redis has the fastest cardinality estimation, with Hyll Enhanced Stream as a close second.
287
+ - **Merge Operations**: Redis outperforms Hyll for merging HyperLogLog sketches, but Hyll Enhanced provides competitive performance.
288
+ - **Memory Usage**: Hyll Enhanced offers the most memory-efficient implementation.
289
+ - **Accuracy**: Redis provides the best accuracy in this test scenario.
290
+
291
+ #### Recommendation
292
+
293
+ For most use cases, Redis offers an excellent balance of accuracy and performance. However, Hyll provides superior insertion performance and memory efficiency, making it a good choice for scenarios where these attributes are prioritized.
294
+
295
+ You can run these benchmarks yourself using the included script:
296
+
297
+ ```ruby
298
+ ruby examples/redis_comparison_benchmark.rb
299
+ ```
300
+
229
301
  ## Features
230
302
 
231
303
  - Standard HyperLogLog implementation with customizable precision
@@ -293,6 +365,13 @@ For advanced usage scenarios, check out `examples/advance.rb` which includes:
293
365
  - Advanced serialization techniques
294
366
  - Precision vs. memory usage benchmarks
295
367
 
368
+ ## Example Use Cases
369
+
370
+ Here is a quick illustration of how Hyll can be helpful in a real-world scenario:
371
+ - UniqueVisitorCounting: Track unique users visiting a website in real time. By adding each user's session ID or IP to the HyperLogLog, you get an approximate number of distinct users without storing everybody's data.
372
+ - LogAnalytics: Continuously process large log files to calculate the volume of unique events, keeping memory usage low.
373
+ - MarketingCampaigns: Quickly gauge how many distinct customers participate in a campaign while merging data from multiple sources.
374
+
296
375
  ## Development
297
376
 
298
377
  After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
@@ -0,0 +1,539 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "hyll"
4
+ require "redis"
5
+ require "benchmark"
6
+ require "benchmark/ips"
7
+ require "memory_profiler"
8
+ require "optparse"
9
+ require "json"
10
+
11
+ class HyllRedisComparison
12
+ def initialize(precision: 10, data_size: 10_000, overlapping: 1_000, warm_up: true, benchmark_type: :all)
13
+ @precision = precision
14
+ @data_size = data_size
15
+ @overlapping = overlapping
16
+ @warm_up = warm_up
17
+ @benchmark_type = benchmark_type
18
+ @redis = Redis.new
19
+ @results = {}
20
+
21
+ puts "Initializing benchmark with:"
22
+ puts " - Precision: #{@precision}"
23
+ puts " - Data size: #{@data_size} elements"
24
+ puts " - Overlapping elements for merge tests: #{@overlapping}"
25
+ puts " - Warm-up: #{@warm_up ? "Enabled" : "Disabled"}"
26
+ puts "\n"
27
+
28
+ # Clean up any existing Redis keys
29
+ @redis.del("hll1", "hll2", "hll_merged", "redis_hll1", "redis_hll2", "redis_merged", "accuracy_test")
30
+
31
+ # Pre-generazione di dati per i test
32
+ @elements = (0...@data_size).map { |i| "element-#{i}" }.freeze
33
+ @set1_elements = (0...@data_size).map { |i| "set1-#{i}" }.freeze
34
+ @set2_elements = (0...@data_size).map { |i| "set2-#{i + @data_size - @overlapping}" }.freeze
35
+
36
+ # Pre-caricamento Redis
37
+ if %i[all cardinality memory].include?(benchmark_type)
38
+ @redis.pipelined do |pipeline|
39
+ @elements.each { |e| pipeline.pfadd("hll1", e) }
40
+ end
41
+ end
42
+
43
+ if %i[all merge].include?(benchmark_type)
44
+ @redis.pipelined do |pipeline|
45
+ @set1_elements.each { |e| pipeline.pfadd("redis_hll1", e) }
46
+ @set2_elements.each { |e| pipeline.pfadd("redis_hll2", e) }
47
+ end
48
+ end
49
+
50
+ if %i[all cardinality].include?(benchmark_type)
51
+ @pre_hll_standard = Hyll::HyperLogLog.new(@precision)
52
+ @pre_hll_enhanced = Hyll::EnhancedHyperLogLog.new(@precision)
53
+
54
+ @elements.each do |e|
55
+ @pre_hll_standard.add(e)
56
+ @pre_hll_enhanced.add(e)
57
+ end
58
+ end
59
+
60
+ if %i[all merge].include?(benchmark_type)
61
+ @pre_merge_hll1 = Hyll::HyperLogLog.new(@precision)
62
+ @pre_merge_hll2 = Hyll::HyperLogLog.new(@precision)
63
+ @pre_merge_enhanced1 = Hyll::EnhancedHyperLogLog.new(@precision)
64
+ @pre_merge_enhanced2 = Hyll::EnhancedHyperLogLog.new(@precision)
65
+
66
+ @set1_elements.each do |e|
67
+ @pre_merge_hll1.add(e)
68
+ @pre_merge_enhanced1.add(e)
69
+ end
70
+
71
+ @set2_elements.each do |e|
72
+ @pre_merge_hll2.add(e)
73
+ @pre_merge_enhanced2.add(e)
74
+ end
75
+
76
+ @pre_merge_hll1_serialized = Marshal.dump(@pre_merge_hll1)
77
+ @pre_merge_enhanced1_serialized = Marshal.dump(@pre_merge_enhanced1)
78
+ end
79
+
80
+ warm_up_benchmarks if @warm_up
81
+ end
82
+
83
+ def run_benchmarks
84
+ case @benchmark_type
85
+ when :insertion
86
+ benchmark_insertion
87
+ when :cardinality
88
+ benchmark_cardinality
89
+ when :merge
90
+ benchmark_merge
91
+ when :memory
92
+ benchmark_memory_usage
93
+ when :accuracy
94
+ benchmark_accuracy
95
+ else
96
+ benchmark_insertion
97
+ benchmark_cardinality
98
+ benchmark_merge
99
+ benchmark_memory_usage
100
+ benchmark_accuracy
101
+ end
102
+
103
+ print_summary
104
+ end
105
+
106
+ def warm_up_benchmarks
107
+ puts "Performing warm-up..."
108
+ # Warm-up JIT compiler
109
+ warm_up_count = [@data_size / 10, 1000].min
110
+
111
+ # Warm-up insertion
112
+ hll_warmup = Hyll::HyperLogLog.new(@precision)
113
+ enhanced_warmup = Hyll::EnhancedHyperLogLog.new(@precision)
114
+
115
+ warm_up_count.times do |i|
116
+ hll_warmup.add("warmup-#{i}")
117
+ enhanced_warmup.add("warmup-#{i}")
118
+ @redis.pfadd("warmup_hll", "warmup-#{i}")
119
+ end
120
+
121
+ # Warm-up cardinality
122
+ 10.times do
123
+ hll_warmup.cardinality
124
+ hll_warmup.mle_cardinality
125
+ enhanced_warmup.cardinality
126
+ enhanced_warmup.cardinality(use_streaming: true)
127
+ @redis.pfcount("warmup_hll")
128
+ end
129
+
130
+ # Warm-up merge
131
+ warm_up_hll1 = Hyll::HyperLogLog.new(@precision)
132
+ warm_up_hll2 = Hyll::HyperLogLog.new(@precision)
133
+ 5.times do
134
+ warm_up_copy = Marshal.load(Marshal.dump(warm_up_hll1))
135
+ warm_up_copy.merge(warm_up_hll2)
136
+ end
137
+
138
+ @redis.del("warmup_hll")
139
+ puts "Warm-up complete.\n\n"
140
+ end
141
+
142
+ def benchmark_insertion
143
+ puts "=== Insertion Performance ==="
144
+ GC.start # Sincronizzazione GC per risultati più consistenti
145
+
146
+ results = Benchmark.ips do |x|
147
+ x.config(time: 2, warmup: 1)
148
+
149
+ # Hyll standard insertion
150
+ x.report("Hyll Standard") do
151
+ hll = Hyll::HyperLogLog.new(@precision)
152
+ @elements.each { |e| hll.add(e) }
153
+ end
154
+
155
+ # Hyll enhanced insertion
156
+ x.report("Hyll Enhanced") do
157
+ hll = Hyll::EnhancedHyperLogLog.new(@precision)
158
+ @elements.each { |e| hll.add(e) }
159
+ end
160
+
161
+ # Hyll batch insertion
162
+ x.report("Hyll Batch") do
163
+ Hyll::HyperLogLog.new(@precision).add_all(@elements)
164
+ end
165
+
166
+ # Redis insertion
167
+ x.report("Redis PFADD") do
168
+ @redis.del("bench_hll")
169
+ @elements.each { |e| @redis.pfadd("bench_hll", e) }
170
+ end
171
+
172
+ # Redis pipelined insertion
173
+ x.report("Redis Pipelined") do
174
+ @redis.del("bench_hll")
175
+ @redis.pipelined do |pipeline|
176
+ @elements.each { |e| pipeline.pfadd("bench_hll", e) }
177
+ end
178
+ end
179
+
180
+ x.compare!
181
+ end
182
+
183
+ @results[:insertion] = results
184
+ puts "\n"
185
+ end
186
+
187
+ def benchmark_cardinality
188
+ puts "=== Cardinality Estimation Performance ==="
189
+ GC.start
190
+
191
+ results = Benchmark.ips do |x|
192
+ x.config(time: 2, warmup: 1)
193
+
194
+ # Hyll standard cardinality
195
+ x.report("Hyll Standard") do
196
+ @pre_hll_standard.cardinality
197
+ end
198
+
199
+ # Hyll standard MLE
200
+ x.report("Hyll MLE") do
201
+ @pre_hll_standard.mle_cardinality
202
+ end
203
+
204
+ # Hyll enhanced cardinality
205
+ x.report("Hyll Enhanced") do
206
+ @pre_hll_enhanced.cardinality
207
+ end
208
+
209
+ # Hyll enhanced streaming
210
+ x.report("Hyll Enhanced Stream") do
211
+ @pre_hll_enhanced.cardinality(use_streaming: true)
212
+ end
213
+
214
+ # Redis cardinality
215
+ x.report("Redis PFCOUNT") do
216
+ @redis.pfcount("hll1")
217
+ end
218
+
219
+ x.compare!
220
+ end
221
+
222
+ @results[:cardinality] = results
223
+ puts "\n"
224
+ end
225
+
226
+ def benchmark_merge
227
+ puts "=== Merge Performance ==="
228
+ GC.start
229
+
230
+ results = Benchmark.ips do |x|
231
+ x.config(time: 2, warmup: 1)
232
+
233
+ # Hyll standard merge
234
+ x.report("Hyll Standard") do
235
+ hll_copy = Marshal.load(@pre_merge_hll1_serialized)
236
+ hll_copy.merge(@pre_merge_hll2)
237
+ end
238
+
239
+ # Hyll enhanced merge
240
+ x.report("Hyll Enhanced") do
241
+ enhanced_copy = Marshal.load(@pre_merge_enhanced1_serialized)
242
+ enhanced_copy.merge(@pre_merge_enhanced2)
243
+ end
244
+
245
+ # Redis merge
246
+ x.report("Redis PFMERGE") do
247
+ @redis.pfmerge("redis_merged", "redis_hll1", "redis_hll2")
248
+ end
249
+
250
+ x.compare!
251
+ end
252
+
253
+ @results[:merge] = results
254
+ puts "\n"
255
+ end
256
+
257
+ def benchmark_memory_usage
258
+ puts "=== Memory Usage ==="
259
+ GC.start
260
+
261
+ # Memory usage of standard HLL
262
+ hll_standard_memory = report_memory("Hyll Standard") do
263
+ hll = Hyll::HyperLogLog.new(@precision)
264
+ @elements.each { |e| hll.add(e) }
265
+ hll
266
+ end
267
+
268
+ # Memory usage of enhanced HLL
269
+ hll_enhanced_memory = report_memory("Hyll Enhanced") do
270
+ hll = Hyll::EnhancedHyperLogLog.new(@precision)
271
+ @elements.each { |e| hll.add(e) }
272
+ hll
273
+ end
274
+
275
+ # Memory usage of actual elements (for comparison)
276
+ raw_elements_memory = report_memory("Raw Elements Array") do
277
+ @elements.dup
278
+ end
279
+
280
+ # Redis memory usage
281
+ redis_memory = @redis.memory("USAGE", "hll1")
282
+ puts "Redis memory usage for HLL key: #{redis_memory} bytes"
283
+
284
+ # Calcola compression ratio
285
+ puts "\nCompression ratios:"
286
+ puts " Hyll Standard: #{(raw_elements_memory[:allocated] / hll_standard_memory[:retained]).round(2)}x"
287
+ puts " Hyll Enhanced: #{(raw_elements_memory[:allocated] / hll_enhanced_memory[:retained]).round(2)}x"
288
+ puts " Redis: #{(raw_elements_memory[:allocated] * 1024 / redis_memory).round(2)}x"
289
+
290
+ @results[:memory] = {
291
+ hyll_standard: hll_standard_memory,
292
+ hyll_enhanced: hll_enhanced_memory,
293
+ raw_elements: raw_elements_memory,
294
+ redis: redis_memory
295
+ }
296
+
297
+ puts "\n"
298
+ end
299
+
300
+ def benchmark_accuracy
301
+ puts "=== Accuracy Comparison ==="
302
+ GC.start
303
+
304
+ accuracy_elements = (0...@data_size).map { |i| "accuracy-#{i}" }
305
+
306
+ # Hyll standard
307
+ hll_standard = Hyll::HyperLogLog.new(@precision)
308
+ hll_standard.add_all(accuracy_elements)
309
+
310
+ # Hyll enhanced
311
+ hll_enhanced = Hyll::EnhancedHyperLogLog.new(@precision)
312
+ hll_enhanced.add_all(accuracy_elements)
313
+
314
+ hll_standard_high = Hyll::HyperLogLog.new([@precision + 2, 16].min)
315
+ hll_standard_high.add_all(accuracy_elements)
316
+
317
+ # Redis
318
+ @redis.del("accuracy_test")
319
+ @redis.pipelined do |pipeline|
320
+ accuracy_elements.each { |e| pipeline.pfadd("accuracy_test", e) }
321
+ end
322
+
323
+ # Get estimates
324
+ hll_standard_est = hll_standard.cardinality
325
+ hll_standard_mle = hll_standard.mle_cardinality
326
+ hll_standard_high_est = hll_standard_high.cardinality
327
+ hll_enhanced_est = hll_enhanced.cardinality
328
+ hll_enhanced_stream = hll_enhanced.cardinality(use_streaming: true)
329
+ redis_est = @redis.pfcount("accuracy_test")
330
+
331
+ # Calcola errori
332
+ standard_error = calculate_error("Hyll Standard", hll_standard_est, @data_size)
333
+ standard_mle_error = calculate_error("Hyll Standard MLE", hll_standard_mle, @data_size)
334
+ standard_high_error = calculate_error("Hyll Standard (High Precision)", hll_standard_high_est, @data_size)
335
+ enhanced_error = calculate_error("Hyll Enhanced", hll_enhanced_est, @data_size)
336
+ enhanced_stream_error = calculate_error("Hyll Enhanced Stream", hll_enhanced_stream, @data_size)
337
+ redis_error = calculate_error("Redis", redis_est, @data_size)
338
+
339
+ @results[:accuracy] = {
340
+ hyll_standard: standard_error,
341
+ hyll_standard_mle: standard_mle_error,
342
+ hyll_standard_high: standard_high_error,
343
+ hyll_enhanced: enhanced_error,
344
+ hyll_enhanced_stream: enhanced_stream_error,
345
+ redis: redis_error
346
+ }
347
+
348
+ # Grafico dell'errore (ASCII art)
349
+ puts "\nError comparison (lower is better):"
350
+ print_error_bar("Hyll Standard", standard_error[:percent])
351
+ print_error_bar("Hyll MLE", standard_mle_error[:percent])
352
+ print_error_bar("Hyll (High Prec)", standard_high_error[:percent])
353
+ print_error_bar("Hyll Enhanced", enhanced_error[:percent])
354
+ print_error_bar("Hyll Enh Stream", enhanced_stream_error[:percent])
355
+ print_error_bar("Redis", redis_error[:percent])
356
+ end
357
+
358
+ def print_error_bar(label, error_pct)
359
+ display_error = [error_pct, 300].min
360
+ bars = (display_error / 5).to_i
361
+ truncated = display_error < error_pct
362
+
363
+ printf("%-18s |%-60s| %.2f%%%s\n",
364
+ label,
365
+ "#" * bars,
366
+ error_pct,
367
+ truncated ? " (truncated)" : "")
368
+ end
369
+
370
+ def print_summary
371
+ puts "\n=== BENCHMARK SUMMARY ==="
372
+ puts "Precision: #{@precision}, Data size: #{@data_size}"
373
+
374
+ puts "\nACCURACY WINNER: #{get_accuracy_winner}" if @results[:accuracy]
375
+
376
+ if @results[:insertion] && @results[:cardinality] && @results[:merge]
377
+ puts "\nPERFORMANCE WINNERS:"
378
+ puts " Insertion: #{get_insertion_winner}"
379
+ puts " Cardinality: #{get_cardinality_winner}"
380
+ puts " Merge: #{get_merge_winner}"
381
+ end
382
+
383
+ puts "\nMEMORY USAGE WINNER: #{get_memory_winner}" if @results[:memory]
384
+
385
+ puts "\nRECOMMENDATION:"
386
+ puts generate_recommendation
387
+ end
388
+
389
+ def get_accuracy_winner
390
+ errors = @results[:accuracy].transform_values { |v| v[:percent] }
391
+ winner = errors.min_by { |_, v| v }
392
+ "#{winner[0].to_s.split("_").map(&:capitalize).join(" ")} (#{winner[1].round(2)}% error)"
393
+ end
394
+
395
+ def get_insertion_winner
396
+ @results[:insertion].entries.max_by(&:ips).label
397
+ end
398
+
399
+ def get_cardinality_winner
400
+ @results[:cardinality].entries.max_by(&:ips).label
401
+ end
402
+
403
+ def get_merge_winner
404
+ @results[:merge].entries.max_by(&:ips).label
405
+ end
406
+
407
+ def get_memory_winner
408
+ memories = {
409
+ hyll_standard: @results[:memory][:hyll_standard][:retained],
410
+ hyll_enhanced: @results[:memory][:hyll_enhanced][:retained],
411
+ redis: @results[:memory][:redis]
412
+ }
413
+
414
+ winner = memories.min_by { |_, v| v }
415
+ "#{winner[0].to_s.split("_").map(&:capitalize).join(" ")} (#{winner[1] / 1024.0} KB)"
416
+ end
417
+
418
+ def generate_recommendation
419
+ return "Run accuracy benchmark to generate recommendation" unless @results[:accuracy]
420
+
421
+ errors = @results[:accuracy].transform_values { |v| v[:percent] }
422
+
423
+ if errors[:redis] < 5.0
424
+ "Redis offers excellent accuracy and good performance, recommended for most use cases."
425
+ elsif errors[:hyll_standard] < errors[:hyll_enhanced] && errors[:hyll_standard] < 15.0
426
+ "Hyll Standard with precision #{@precision} offers good accuracy and best insertion performance."
427
+ elsif errors[:hyll_enhanced] < 15.0
428
+ "Hyll Enhanced offers better accuracy than Standard and good overall performance."
429
+ else
430
+ "Consider using higher precision (#{[@precision + 2, 16].min}) for better accuracy."
431
+ end
432
+ end
433
+
434
+ def export_results(filename)
435
+ File.write(filename, JSON.pretty_generate(@results))
436
+ puts "Results exported to #{filename}"
437
+ end
438
+
439
+ private
440
+
441
+ def report_memory(label)
442
+ GC.start # Force GC before measurement
443
+ result = nil
444
+ report = MemoryProfiler.report do
445
+ result = yield
446
+ end
447
+
448
+ allocated = report.total_allocated_memsize / 1024.0
449
+ retained = report.total_retained_memsize / 1024.0
450
+
451
+ puts "#{label}:"
452
+ puts " Total allocated: #{allocated.round(2)} KB"
453
+ puts " Total retained: #{retained.round(2)} KB"
454
+
455
+ # Return memory stats
456
+ { allocated: allocated, retained: retained, result: result }
457
+ end
458
+
459
+ def calculate_error(label, estimate, actual)
460
+ error_pct = ((estimate - actual).abs / actual.to_f) * 100
461
+ result = {
462
+ estimate: estimate.round,
463
+ actual: actual,
464
+ difference: (estimate - actual).round,
465
+ percent: error_pct.round(2)
466
+ }
467
+
468
+ puts "#{label}: Estimated #{result[:estimate]} vs Actual #{actual} (Error: #{result[:percent]}%)"
469
+ result
470
+ end
471
+ end
472
+
473
+ # Parse command line options
474
+ options = {
475
+ precision: 10,
476
+ data_size: 10_000,
477
+ overlapping: 1_000,
478
+ warm_up: true,
479
+ benchmark_type: :all,
480
+ output_file: nil
481
+ }
482
+
483
+ OptionParser.new do |opts|
484
+ opts.banner = "Usage: ruby redis_comparison_benchmark.rb [options]"
485
+
486
+ opts.on("-p", "--precision PRECISION", Integer, "HyperLogLog precision (4-16)") do |p|
487
+ options[:precision] = p
488
+ end
489
+
490
+ opts.on("-d", "--data-size SIZE", Integer, "Number of elements to add") do |d|
491
+ options[:data_size] = d
492
+ end
493
+
494
+ opts.on("-o", "--overlapping SIZE", Integer, "Number of overlapping elements for merge tests") do |o|
495
+ options[:overlapping] = o
496
+ end
497
+
498
+ opts.on("--no-warm-up", "Skip warm-up phase") do
499
+ options[:warm_up] = false
500
+ end
501
+
502
+ opts.on("-b", "--benchmark TYPE", %i[all insertion cardinality merge memory accuracy],
503
+ "Run specific benchmark type (all, insertion, cardinality, merge, memory, accuracy)") do |b|
504
+ options[:benchmark_type] = b
505
+ end
506
+
507
+ opts.on("--output FILE", "Export results to JSON file") do |f|
508
+ options[:output_file] = f
509
+ end
510
+
511
+ opts.on("-h", "--help", "Show this help message") do
512
+ puts opts
513
+ exit
514
+ end
515
+ end.parse!
516
+
517
+ # Run benchmarks
518
+ puts "Starting HyperLogLog Comparison Benchmark: Hyll vs Redis"
519
+ puts "-----------------------------------------------------"
520
+
521
+ begin
522
+ comparison = HyllRedisComparison.new(
523
+ precision: options[:precision],
524
+ data_size: options[:data_size],
525
+ overlapping: options[:overlapping],
526
+ warm_up: options[:warm_up],
527
+ benchmark_type: options[:benchmark_type]
528
+ )
529
+
530
+ comparison.run_benchmarks
531
+
532
+ comparison.export_results(options[:output_file]) if options[:output_file]
533
+ rescue Redis::CannotConnectError
534
+ puts "ERROR: Cannot connect to Redis server."
535
+ puts "Please ensure Redis is running locally on the default port (6379)."
536
+ puts "You can start Redis with: redis-server"
537
+ end
538
+
539
+ puts "Benchmark complete!"
@@ -81,9 +81,10 @@ module Hyll
81
81
  # @return [Float] the estimated variance
82
82
  def streaming_variance
83
83
  # If no modifications, return 0
84
- return 0.0 if @streaming_estimate.zero?
84
+ return 0.0 if @last_modification_probability.nil?
85
85
 
86
- # Return the quadratic variation
86
+ # Calculate variance based on martingale properties
87
+ # This provides an unbiased estimate of the variance
87
88
  @quadratic_variation
88
89
  end
89
90
 
@@ -91,19 +92,25 @@ module Hyll
91
92
  # @param confidence [Float] confidence level (default: 0.95)
92
93
  # @return [Array<Float>] lower and upper bounds
93
94
  def streaming_error_bounds(confidence = 0.95)
94
- return [0, 0] if @streaming_estimate.zero?
95
+ # If no modifications, return exact bounds
96
+ return [@streaming_estimate, @streaming_estimate] if @last_modification_probability.nil?
95
97
 
96
- # For 95% confidence, use ~1.96 multiplier
98
+ # Calculate z-score for the given confidence level
99
+ # For 95% confidence, z ≈ 1.96
97
100
  z = case confidence
98
- when 0.9 then 1.645
101
+ when 0.90 then 1.645
99
102
  when 0.95 then 1.96
100
103
  when 0.99 then 2.576
101
- else 1.96 # Default to 95%
104
+ else
105
+ # Calculate using inverse error function for any confidence level
106
+ Math.sqrt(2) * Math.erfc(2 * (1 - confidence))
102
107
  end
103
108
 
104
- std_dev = Math.sqrt(streaming_variance)
109
+ # Calculate standard error
110
+ std_error = Math.sqrt(streaming_variance)
105
111
 
106
- [@streaming_estimate - z * std_dev, @streaming_estimate + z * std_dev]
112
+ # Return confidence interval
113
+ [@streaming_estimate - z * std_error, @streaming_estimate + z * std_error]
107
114
  end
108
115
 
109
116
  # Update register value directly (no compression in EnhancedHyperLogLog)
@@ -280,17 +280,27 @@ module Hyll
280
280
  non_zero_registers_count / weighted_sum * Math.log(1 + weighted_sum / zero_registers_count)
281
281
  end
282
282
 
283
+ # Return early for edge cases to avoid numerical instability
284
+ return initial_estimate * @m if initial_estimate.zero? || initial_estimate.nan? || initial_estimate.infinite?
285
+
283
286
  # Precision parameter
284
287
  epsilon = 0.01
285
288
  delta = epsilon / Math.sqrt(@m)
286
289
 
287
- # Secant method iteration
290
+ # Memoize h_values calculation to avoid redundant computation
291
+ h_values_cache = {}
292
+
293
+ # Secant method iteration - limit max iterations to prevent infinite loops
288
294
  delta_x = initial_estimate
289
295
  g_prev = 0
296
+ max_iterations = 100
297
+ iterations = 0
298
+
299
+ while delta_x > initial_estimate * delta && iterations < max_iterations
300
+ iterations += 1
290
301
 
291
- while delta_x > initial_estimate * delta
292
- # Calculate h(x) efficiently
293
- h_values = calculate_h_values(initial_estimate, min_value, max_value)
302
+ # Calculate h(x) efficiently with memoization
303
+ h_values = h_values_cache[initial_estimate] ||= calculate_h_values(initial_estimate, min_value, max_value)
294
304
 
295
305
  # Calculate the function value
296
306
  g = 0.0
@@ -299,12 +309,14 @@ module Hyll
299
309
  end
300
310
  g += initial_estimate * (weighted_sum + zero_registers_count)
301
311
 
302
- # Update the estimate using secant method
303
- delta_x = if g > g_prev && non_zero_registers_count >= g
304
- delta_x * (non_zero_registers_count - g) / (g - g_prev)
305
- else
306
- 0
307
- end
312
+ # Update the estimate using secant method with safeguards
313
+ if g > g_prev && non_zero_registers_count >= g && (g - g_prev).abs > Float::EPSILON
314
+ delta_x = delta_x * (non_zero_registers_count - g) / (g - g_prev)
315
+ # Add safeguard against too large steps
316
+ delta_x = [delta_x, initial_estimate].min
317
+ else
318
+ delta_x = 0
319
+ end
308
320
 
309
321
  initial_estimate += delta_x
310
322
  g_prev = g
@@ -67,6 +67,10 @@ module Hyll
67
67
  # @param k_max [Integer] maximum k
68
68
  # @return [Array<Float>] array of h(x/2^k) values
69
69
  def calculate_h_values(x, k_min, k_max)
70
+ # Guard against invalid inputs
71
+ return [] if k_min > k_max
72
+ return [0.0] * (k_max - k_min + 1) if x.zero? || x.nan? || x.infinite?
73
+
70
74
  # Determine the smallest power of 2 denominator for which we need h(x)
71
75
  power = k_max
72
76
 
@@ -79,11 +83,17 @@ module Hyll
79
83
  # For small arguments, use more accurate formula (simpler approximation)
80
84
  h = if x_prime <= 0.1
81
85
  # For very small values, h(x) ≈ x/2
82
- # This formula ensures we get consistent value across different inputs and powers
83
86
  x_prime / 2.0
84
87
  elsif x_prime <= 0.5
85
88
  # Use more accurate Taylor series for small-to-medium values
86
- x_prime / 2.0 - (x_prime**2) / 12.0 + (x_prime**4) / 720.0 - (x_prime**6) / 30_240.0
89
+ taylor_sum = x_prime / 2.0
90
+ term = x_prime * x_prime
91
+ taylor_sum -= term / 12.0
92
+ term *= x_prime * x_prime
93
+ taylor_sum += term / 720.0
94
+ term *= x_prime * x_prime
95
+ taylor_sum -= term / 30_240.0
96
+ taylor_sum
87
97
  else
88
98
  # For larger values, directly compute
89
99
  1.0 - ::Math.exp(-x_prime)
@@ -95,7 +105,13 @@ module Hyll
95
105
  # Calculate subsequent h values using recurrence relation
96
106
  1.upto(k_max - k_min) do |i|
97
107
  x_prime *= 2.0 # Double x_prime
98
- h = (x_prime + h * (1.0 - h)) / (x_prime + (1.0 - h))
108
+ denominator = x_prime + (1.0 - h)
109
+ # Avoid division by zero
110
+ h = if denominator.abs < Float::EPSILON
111
+ h_values[i - 1] # Use previous value if unstable
112
+ else
113
+ (x_prime + h * (1.0 - h)) / denominator
114
+ end
99
115
  h_values[i] = h
100
116
  end
101
117
 
data/lib/hyll/version.rb CHANGED
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Hyll
4
- VERSION = "0.1.1"
4
+ VERSION = "0.2.0"
5
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: hyll
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Davide Santangelo
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2025-03-21 00:00:00.000000000 Z
11
+ date: 2025-03-24 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: digest
@@ -44,6 +44,7 @@ files:
44
44
  - Rakefile
45
45
  - examples/advance.rb
46
46
  - examples/basic.rb
47
+ - examples/redis_comparison_benchmark.rb
47
48
  - lib/hyll.rb
48
49
  - lib/hyll/algorithms/enhanced_hyperloglog.rb
49
50
  - lib/hyll/algorithms/hyperloglog.rb