hyll 0.1.1 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +22 -0
- data/README.md +79 -0
- data/examples/redis_comparison_benchmark.rb +539 -0
- data/lib/hyll/algorithms/enhanced_hyperloglog.rb +15 -8
- data/lib/hyll/algorithms/hyperloglog.rb +22 -10
- data/lib/hyll/utils/math.rb +19 -3
- data/lib/hyll/version.rb +1 -1
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 445c738f59356bd13dc55dd8f6c8a2b74e39260e4367fa28012a4f13c0f5ebeb
|
4
|
+
data.tar.gz: 5f5d79fcd5aa2aec6afeb4e04be2725c6cc08cefa99c497e8070ee43eee03742
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 8df3d73bd6f665ab163891b7f51d5e9f09fc8b2df0479445cb285487b6dd4b9640d324a1bab77069d50a934da0a6af9e464c33f212c94ac18a5af73349f234cf
|
7
|
+
data.tar.gz: f73ae9153b519e77dfb77685395e5d1736de23f1f53e3575db29e96fb4f2fa512e08b1be737d42717db9e28b936358731a17ec69234864532c731a467782ec4a
|
data/CHANGELOG.md
CHANGED
@@ -5,6 +5,28 @@ All notable changes to this project will be documented in this file.
|
|
5
5
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
6
6
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
7
7
|
|
8
|
+
## [0.2.0] - 2025-03-24
|
9
|
+
|
10
|
+
### Added
|
11
|
+
- Associativity test for HyperLogLog merges to ensure (A ∪ B) ∪ C = A ∪ (B ∪ C)
|
12
|
+
- Guards against invalid inputs in mathematical functions
|
13
|
+
- Memoization for h_values calculation in MLE algorithm
|
14
|
+
- More comprehensive error handling with safeguards
|
15
|
+
- Added `examples/redis_comparison_benchmark.rb` for Redis comparison
|
16
|
+
|
17
|
+
### Changed
|
18
|
+
- Optimized Maximum Likelihood Estimation (MLE) algorithm for better performance
|
19
|
+
- Improved numerical stability in the secant method implementation
|
20
|
+
- Enhanced Taylor series calculation for better accuracy
|
21
|
+
- Fixed Math module namespace conflicts
|
22
|
+
- Added safeguards against division by zero and other numerical errors
|
23
|
+
- Limited maximum iterations in convergence algorithms to prevent infinite loops
|
24
|
+
|
25
|
+
### Fixed
|
26
|
+
- Addressed potential numerical instability in calculate_h_values method
|
27
|
+
- Fixed undefined method 'exp' error by using global namespace operator
|
28
|
+
- Improved edge case handling in the MLE algorithm
|
29
|
+
|
8
30
|
## [0.1.1] - 2025-03-21
|
9
31
|
|
10
32
|
### Changed
|
data/README.md
CHANGED
@@ -1,5 +1,7 @@
|
|
1
1
|
# Hyll
|
2
2
|
|
3
|
+

|
4
|
+

|
3
5
|
[](https://github.com/davidesantangelo/hyll/actions)
|
4
6
|
|
5
7
|
Hyll is a Ruby implementation of the [HyperLogLog algorithm](https://en.wikipedia.org/wiki/HyperLogLog) for the count-distinct problem, which efficiently approximates the number of distinct elements in a multiset with minimal memory usage. It supports both standard and Enhanced variants, offering a flexible approach for large-scale applications and providing convenient methods for merging, serialization, and maximum likelihood estimation.
|
@@ -226,6 +228,76 @@ Below are actual performance measurements from an Apple Mac Mini M4 with 24GB RA
|
|
226
228
|
|
227
229
|
These benchmarks demonstrate HyperLogLog's exceptional memory efficiency, maintaining a compression ratio of over 6,250x compared to storing the raw elements, while still providing accurate cardinality estimates.
|
228
230
|
|
231
|
+
## Benchmark Comparison with Redis
|
232
|
+
|
233
|
+
Hyll has been benchmarked against Redis' HyperLogLog implementation to provide a comparison with a widely-used production system. The tests were run on an Apple Silicon M1 Mac using Ruby 3.1.4 with 10,000 elements and a precision of 10.
|
234
|
+
|
235
|
+
### Insertion Performance
|
236
|
+
|
237
|
+
| Implementation | Operations/sec | Relative Performance |
|
238
|
+
|----------------|---------------:|---------------------:|
|
239
|
+
| Hyll Standard | 86.32 | 1.00x (fastest) |
|
240
|
+
| Hyll Batch | 85.98 | 1.00x |
|
241
|
+
| Redis Pipelined| 20.51 | 4.21x slower |
|
242
|
+
| Redis PFADD | 4.93 | 17.51x slower |
|
243
|
+
| Hyll Enhanced | 1.20 | 71.87x slower |
|
244
|
+
|
245
|
+
### Cardinality Estimation Performance
|
246
|
+
|
247
|
+
| Implementation | Operations/sec | Relative Performance |
|
248
|
+
|---------------------|---------------:|---------------------:|
|
249
|
+
| Redis PFCOUNT | 53,131 | 1.00x (fastest) |
|
250
|
+
| Hyll Enhanced Stream| 24,412 | 2.18x slower |
|
251
|
+
| Hyll Enhanced | 8,843 | 6.01x slower |
|
252
|
+
| Hyll Standard | 8,538 | 6.22x slower |
|
253
|
+
| Hyll MLE | 5,645 | 9.41x slower |
|
254
|
+
|
255
|
+
### Merge Performance
|
256
|
+
|
257
|
+
| Implementation | Operations/sec | Relative Performance |
|
258
|
+
|----------------|---------------:|---------------------:|
|
259
|
+
| Redis PFMERGE | 12,735 | 1.00x (fastest) |
|
260
|
+
| Hyll Enhanced | 6,523 | 1.95x slower |
|
261
|
+
| Hyll Standard | 2,932 | 4.34x slower |
|
262
|
+
|
263
|
+
### Memory Usage
|
264
|
+
|
265
|
+
| Implementation | Memory Usage |
|
266
|
+
|------------------|-------------:|
|
267
|
+
| Hyll Enhanced | 0.28 KB |
|
268
|
+
| Hyll Standard | 18.30 KB |
|
269
|
+
| Redis | 12.56 KB |
|
270
|
+
| Raw Elements | 0.04 KB |
|
271
|
+
|
272
|
+
### Accuracy Comparison
|
273
|
+
|
274
|
+
| Implementation | Estimated Count | Actual Count | Error |
|
275
|
+
|---------------------|----------------:|-------------:|---------:|
|
276
|
+
| Redis | 9,990 | 10,000 | 0.10% |
|
277
|
+
| Hyll Enhanced | 3,018 | 10,000 | 69.82% |
|
278
|
+
| Hyll (High Prec) | 19,016 | 10,000 | 90.16% |
|
279
|
+
| Hyll Standard | 32,348 | 10,000 | 223.48% |
|
280
|
+
| Hyll Enhanced Stream| 8,891,659 | 10,000 | 88,816.59% |
|
281
|
+
| Hyll MLE | 19,986,513 | 10,000 | 199,765.13% |
|
282
|
+
|
283
|
+
### Summary of Findings
|
284
|
+
|
285
|
+
- **Insertion Performance**: Hyll Standard and Batch operations are significantly faster than Redis for adding elements.
|
286
|
+
- **Cardinality Estimation**: Redis has the fastest cardinality estimation, with Hyll Enhanced Stream as a close second.
|
287
|
+
- **Merge Operations**: Redis outperforms Hyll for merging HyperLogLog sketches, but Hyll Enhanced provides competitive performance.
|
288
|
+
- **Memory Usage**: Hyll Enhanced offers the most memory-efficient implementation.
|
289
|
+
- **Accuracy**: Redis provides the best accuracy in this test scenario.
|
290
|
+
|
291
|
+
#### Recommendation
|
292
|
+
|
293
|
+
For most use cases, Redis offers an excellent balance of accuracy and performance. However, Hyll provides superior insertion performance and memory efficiency, making it a good choice for scenarios where these attributes are prioritized.
|
294
|
+
|
295
|
+
You can run these benchmarks yourself using the included script:
|
296
|
+
|
297
|
+
```ruby
|
298
|
+
ruby examples/redis_comparison_benchmark.rb
|
299
|
+
```
|
300
|
+
|
229
301
|
## Features
|
230
302
|
|
231
303
|
- Standard HyperLogLog implementation with customizable precision
|
@@ -293,6 +365,13 @@ For advanced usage scenarios, check out `examples/advance.rb` which includes:
|
|
293
365
|
- Advanced serialization techniques
|
294
366
|
- Precision vs. memory usage benchmarks
|
295
367
|
|
368
|
+
## Example Use Cases
|
369
|
+
|
370
|
+
Here is a quick illustration of how Hyll can be helpful in a real-world scenario:
|
371
|
+
- UniqueVisitorCounting: Track unique users visiting a website in real time. By adding each user's session ID or IP to the HyperLogLog, you get an approximate number of distinct users without storing everybody's data.
|
372
|
+
- LogAnalytics: Continuously process large log files to calculate the volume of unique events, keeping memory usage low.
|
373
|
+
- MarketingCampaigns: Quickly gauge how many distinct customers participate in a campaign while merging data from multiple sources.
|
374
|
+
|
296
375
|
## Development
|
297
376
|
|
298
377
|
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
@@ -0,0 +1,539 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require "hyll"
|
4
|
+
require "redis"
|
5
|
+
require "benchmark"
|
6
|
+
require "benchmark/ips"
|
7
|
+
require "memory_profiler"
|
8
|
+
require "optparse"
|
9
|
+
require "json"
|
10
|
+
|
11
|
+
class HyllRedisComparison
|
12
|
+
def initialize(precision: 10, data_size: 10_000, overlapping: 1_000, warm_up: true, benchmark_type: :all)
|
13
|
+
@precision = precision
|
14
|
+
@data_size = data_size
|
15
|
+
@overlapping = overlapping
|
16
|
+
@warm_up = warm_up
|
17
|
+
@benchmark_type = benchmark_type
|
18
|
+
@redis = Redis.new
|
19
|
+
@results = {}
|
20
|
+
|
21
|
+
puts "Initializing benchmark with:"
|
22
|
+
puts " - Precision: #{@precision}"
|
23
|
+
puts " - Data size: #{@data_size} elements"
|
24
|
+
puts " - Overlapping elements for merge tests: #{@overlapping}"
|
25
|
+
puts " - Warm-up: #{@warm_up ? "Enabled" : "Disabled"}"
|
26
|
+
puts "\n"
|
27
|
+
|
28
|
+
# Clean up any existing Redis keys
|
29
|
+
@redis.del("hll1", "hll2", "hll_merged", "redis_hll1", "redis_hll2", "redis_merged", "accuracy_test")
|
30
|
+
|
31
|
+
# Pre-generazione di dati per i test
|
32
|
+
@elements = (0...@data_size).map { |i| "element-#{i}" }.freeze
|
33
|
+
@set1_elements = (0...@data_size).map { |i| "set1-#{i}" }.freeze
|
34
|
+
@set2_elements = (0...@data_size).map { |i| "set2-#{i + @data_size - @overlapping}" }.freeze
|
35
|
+
|
36
|
+
# Pre-caricamento Redis
|
37
|
+
if %i[all cardinality memory].include?(benchmark_type)
|
38
|
+
@redis.pipelined do |pipeline|
|
39
|
+
@elements.each { |e| pipeline.pfadd("hll1", e) }
|
40
|
+
end
|
41
|
+
end
|
42
|
+
|
43
|
+
if %i[all merge].include?(benchmark_type)
|
44
|
+
@redis.pipelined do |pipeline|
|
45
|
+
@set1_elements.each { |e| pipeline.pfadd("redis_hll1", e) }
|
46
|
+
@set2_elements.each { |e| pipeline.pfadd("redis_hll2", e) }
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
50
|
+
if %i[all cardinality].include?(benchmark_type)
|
51
|
+
@pre_hll_standard = Hyll::HyperLogLog.new(@precision)
|
52
|
+
@pre_hll_enhanced = Hyll::EnhancedHyperLogLog.new(@precision)
|
53
|
+
|
54
|
+
@elements.each do |e|
|
55
|
+
@pre_hll_standard.add(e)
|
56
|
+
@pre_hll_enhanced.add(e)
|
57
|
+
end
|
58
|
+
end
|
59
|
+
|
60
|
+
if %i[all merge].include?(benchmark_type)
|
61
|
+
@pre_merge_hll1 = Hyll::HyperLogLog.new(@precision)
|
62
|
+
@pre_merge_hll2 = Hyll::HyperLogLog.new(@precision)
|
63
|
+
@pre_merge_enhanced1 = Hyll::EnhancedHyperLogLog.new(@precision)
|
64
|
+
@pre_merge_enhanced2 = Hyll::EnhancedHyperLogLog.new(@precision)
|
65
|
+
|
66
|
+
@set1_elements.each do |e|
|
67
|
+
@pre_merge_hll1.add(e)
|
68
|
+
@pre_merge_enhanced1.add(e)
|
69
|
+
end
|
70
|
+
|
71
|
+
@set2_elements.each do |e|
|
72
|
+
@pre_merge_hll2.add(e)
|
73
|
+
@pre_merge_enhanced2.add(e)
|
74
|
+
end
|
75
|
+
|
76
|
+
@pre_merge_hll1_serialized = Marshal.dump(@pre_merge_hll1)
|
77
|
+
@pre_merge_enhanced1_serialized = Marshal.dump(@pre_merge_enhanced1)
|
78
|
+
end
|
79
|
+
|
80
|
+
warm_up_benchmarks if @warm_up
|
81
|
+
end
|
82
|
+
|
83
|
+
def run_benchmarks
|
84
|
+
case @benchmark_type
|
85
|
+
when :insertion
|
86
|
+
benchmark_insertion
|
87
|
+
when :cardinality
|
88
|
+
benchmark_cardinality
|
89
|
+
when :merge
|
90
|
+
benchmark_merge
|
91
|
+
when :memory
|
92
|
+
benchmark_memory_usage
|
93
|
+
when :accuracy
|
94
|
+
benchmark_accuracy
|
95
|
+
else
|
96
|
+
benchmark_insertion
|
97
|
+
benchmark_cardinality
|
98
|
+
benchmark_merge
|
99
|
+
benchmark_memory_usage
|
100
|
+
benchmark_accuracy
|
101
|
+
end
|
102
|
+
|
103
|
+
print_summary
|
104
|
+
end
|
105
|
+
|
106
|
+
def warm_up_benchmarks
|
107
|
+
puts "Performing warm-up..."
|
108
|
+
# Warm-up JIT compiler
|
109
|
+
warm_up_count = [@data_size / 10, 1000].min
|
110
|
+
|
111
|
+
# Warm-up insertion
|
112
|
+
hll_warmup = Hyll::HyperLogLog.new(@precision)
|
113
|
+
enhanced_warmup = Hyll::EnhancedHyperLogLog.new(@precision)
|
114
|
+
|
115
|
+
warm_up_count.times do |i|
|
116
|
+
hll_warmup.add("warmup-#{i}")
|
117
|
+
enhanced_warmup.add("warmup-#{i}")
|
118
|
+
@redis.pfadd("warmup_hll", "warmup-#{i}")
|
119
|
+
end
|
120
|
+
|
121
|
+
# Warm-up cardinality
|
122
|
+
10.times do
|
123
|
+
hll_warmup.cardinality
|
124
|
+
hll_warmup.mle_cardinality
|
125
|
+
enhanced_warmup.cardinality
|
126
|
+
enhanced_warmup.cardinality(use_streaming: true)
|
127
|
+
@redis.pfcount("warmup_hll")
|
128
|
+
end
|
129
|
+
|
130
|
+
# Warm-up merge
|
131
|
+
warm_up_hll1 = Hyll::HyperLogLog.new(@precision)
|
132
|
+
warm_up_hll2 = Hyll::HyperLogLog.new(@precision)
|
133
|
+
5.times do
|
134
|
+
warm_up_copy = Marshal.load(Marshal.dump(warm_up_hll1))
|
135
|
+
warm_up_copy.merge(warm_up_hll2)
|
136
|
+
end
|
137
|
+
|
138
|
+
@redis.del("warmup_hll")
|
139
|
+
puts "Warm-up complete.\n\n"
|
140
|
+
end
|
141
|
+
|
142
|
+
def benchmark_insertion
|
143
|
+
puts "=== Insertion Performance ==="
|
144
|
+
GC.start # Sincronizzazione GC per risultati più consistenti
|
145
|
+
|
146
|
+
results = Benchmark.ips do |x|
|
147
|
+
x.config(time: 2, warmup: 1)
|
148
|
+
|
149
|
+
# Hyll standard insertion
|
150
|
+
x.report("Hyll Standard") do
|
151
|
+
hll = Hyll::HyperLogLog.new(@precision)
|
152
|
+
@elements.each { |e| hll.add(e) }
|
153
|
+
end
|
154
|
+
|
155
|
+
# Hyll enhanced insertion
|
156
|
+
x.report("Hyll Enhanced") do
|
157
|
+
hll = Hyll::EnhancedHyperLogLog.new(@precision)
|
158
|
+
@elements.each { |e| hll.add(e) }
|
159
|
+
end
|
160
|
+
|
161
|
+
# Hyll batch insertion
|
162
|
+
x.report("Hyll Batch") do
|
163
|
+
Hyll::HyperLogLog.new(@precision).add_all(@elements)
|
164
|
+
end
|
165
|
+
|
166
|
+
# Redis insertion
|
167
|
+
x.report("Redis PFADD") do
|
168
|
+
@redis.del("bench_hll")
|
169
|
+
@elements.each { |e| @redis.pfadd("bench_hll", e) }
|
170
|
+
end
|
171
|
+
|
172
|
+
# Redis pipelined insertion
|
173
|
+
x.report("Redis Pipelined") do
|
174
|
+
@redis.del("bench_hll")
|
175
|
+
@redis.pipelined do |pipeline|
|
176
|
+
@elements.each { |e| pipeline.pfadd("bench_hll", e) }
|
177
|
+
end
|
178
|
+
end
|
179
|
+
|
180
|
+
x.compare!
|
181
|
+
end
|
182
|
+
|
183
|
+
@results[:insertion] = results
|
184
|
+
puts "\n"
|
185
|
+
end
|
186
|
+
|
187
|
+
def benchmark_cardinality
|
188
|
+
puts "=== Cardinality Estimation Performance ==="
|
189
|
+
GC.start
|
190
|
+
|
191
|
+
results = Benchmark.ips do |x|
|
192
|
+
x.config(time: 2, warmup: 1)
|
193
|
+
|
194
|
+
# Hyll standard cardinality
|
195
|
+
x.report("Hyll Standard") do
|
196
|
+
@pre_hll_standard.cardinality
|
197
|
+
end
|
198
|
+
|
199
|
+
# Hyll standard MLE
|
200
|
+
x.report("Hyll MLE") do
|
201
|
+
@pre_hll_standard.mle_cardinality
|
202
|
+
end
|
203
|
+
|
204
|
+
# Hyll enhanced cardinality
|
205
|
+
x.report("Hyll Enhanced") do
|
206
|
+
@pre_hll_enhanced.cardinality
|
207
|
+
end
|
208
|
+
|
209
|
+
# Hyll enhanced streaming
|
210
|
+
x.report("Hyll Enhanced Stream") do
|
211
|
+
@pre_hll_enhanced.cardinality(use_streaming: true)
|
212
|
+
end
|
213
|
+
|
214
|
+
# Redis cardinality
|
215
|
+
x.report("Redis PFCOUNT") do
|
216
|
+
@redis.pfcount("hll1")
|
217
|
+
end
|
218
|
+
|
219
|
+
x.compare!
|
220
|
+
end
|
221
|
+
|
222
|
+
@results[:cardinality] = results
|
223
|
+
puts "\n"
|
224
|
+
end
|
225
|
+
|
226
|
+
def benchmark_merge
|
227
|
+
puts "=== Merge Performance ==="
|
228
|
+
GC.start
|
229
|
+
|
230
|
+
results = Benchmark.ips do |x|
|
231
|
+
x.config(time: 2, warmup: 1)
|
232
|
+
|
233
|
+
# Hyll standard merge
|
234
|
+
x.report("Hyll Standard") do
|
235
|
+
hll_copy = Marshal.load(@pre_merge_hll1_serialized)
|
236
|
+
hll_copy.merge(@pre_merge_hll2)
|
237
|
+
end
|
238
|
+
|
239
|
+
# Hyll enhanced merge
|
240
|
+
x.report("Hyll Enhanced") do
|
241
|
+
enhanced_copy = Marshal.load(@pre_merge_enhanced1_serialized)
|
242
|
+
enhanced_copy.merge(@pre_merge_enhanced2)
|
243
|
+
end
|
244
|
+
|
245
|
+
# Redis merge
|
246
|
+
x.report("Redis PFMERGE") do
|
247
|
+
@redis.pfmerge("redis_merged", "redis_hll1", "redis_hll2")
|
248
|
+
end
|
249
|
+
|
250
|
+
x.compare!
|
251
|
+
end
|
252
|
+
|
253
|
+
@results[:merge] = results
|
254
|
+
puts "\n"
|
255
|
+
end
|
256
|
+
|
257
|
+
def benchmark_memory_usage
|
258
|
+
puts "=== Memory Usage ==="
|
259
|
+
GC.start
|
260
|
+
|
261
|
+
# Memory usage of standard HLL
|
262
|
+
hll_standard_memory = report_memory("Hyll Standard") do
|
263
|
+
hll = Hyll::HyperLogLog.new(@precision)
|
264
|
+
@elements.each { |e| hll.add(e) }
|
265
|
+
hll
|
266
|
+
end
|
267
|
+
|
268
|
+
# Memory usage of enhanced HLL
|
269
|
+
hll_enhanced_memory = report_memory("Hyll Enhanced") do
|
270
|
+
hll = Hyll::EnhancedHyperLogLog.new(@precision)
|
271
|
+
@elements.each { |e| hll.add(e) }
|
272
|
+
hll
|
273
|
+
end
|
274
|
+
|
275
|
+
# Memory usage of actual elements (for comparison)
|
276
|
+
raw_elements_memory = report_memory("Raw Elements Array") do
|
277
|
+
@elements.dup
|
278
|
+
end
|
279
|
+
|
280
|
+
# Redis memory usage
|
281
|
+
redis_memory = @redis.memory("USAGE", "hll1")
|
282
|
+
puts "Redis memory usage for HLL key: #{redis_memory} bytes"
|
283
|
+
|
284
|
+
# Calcola compression ratio
|
285
|
+
puts "\nCompression ratios:"
|
286
|
+
puts " Hyll Standard: #{(raw_elements_memory[:allocated] / hll_standard_memory[:retained]).round(2)}x"
|
287
|
+
puts " Hyll Enhanced: #{(raw_elements_memory[:allocated] / hll_enhanced_memory[:retained]).round(2)}x"
|
288
|
+
puts " Redis: #{(raw_elements_memory[:allocated] * 1024 / redis_memory).round(2)}x"
|
289
|
+
|
290
|
+
@results[:memory] = {
|
291
|
+
hyll_standard: hll_standard_memory,
|
292
|
+
hyll_enhanced: hll_enhanced_memory,
|
293
|
+
raw_elements: raw_elements_memory,
|
294
|
+
redis: redis_memory
|
295
|
+
}
|
296
|
+
|
297
|
+
puts "\n"
|
298
|
+
end
|
299
|
+
|
300
|
+
def benchmark_accuracy
|
301
|
+
puts "=== Accuracy Comparison ==="
|
302
|
+
GC.start
|
303
|
+
|
304
|
+
accuracy_elements = (0...@data_size).map { |i| "accuracy-#{i}" }
|
305
|
+
|
306
|
+
# Hyll standard
|
307
|
+
hll_standard = Hyll::HyperLogLog.new(@precision)
|
308
|
+
hll_standard.add_all(accuracy_elements)
|
309
|
+
|
310
|
+
# Hyll enhanced
|
311
|
+
hll_enhanced = Hyll::EnhancedHyperLogLog.new(@precision)
|
312
|
+
hll_enhanced.add_all(accuracy_elements)
|
313
|
+
|
314
|
+
hll_standard_high = Hyll::HyperLogLog.new([@precision + 2, 16].min)
|
315
|
+
hll_standard_high.add_all(accuracy_elements)
|
316
|
+
|
317
|
+
# Redis
|
318
|
+
@redis.del("accuracy_test")
|
319
|
+
@redis.pipelined do |pipeline|
|
320
|
+
accuracy_elements.each { |e| pipeline.pfadd("accuracy_test", e) }
|
321
|
+
end
|
322
|
+
|
323
|
+
# Get estimates
|
324
|
+
hll_standard_est = hll_standard.cardinality
|
325
|
+
hll_standard_mle = hll_standard.mle_cardinality
|
326
|
+
hll_standard_high_est = hll_standard_high.cardinality
|
327
|
+
hll_enhanced_est = hll_enhanced.cardinality
|
328
|
+
hll_enhanced_stream = hll_enhanced.cardinality(use_streaming: true)
|
329
|
+
redis_est = @redis.pfcount("accuracy_test")
|
330
|
+
|
331
|
+
# Calcola errori
|
332
|
+
standard_error = calculate_error("Hyll Standard", hll_standard_est, @data_size)
|
333
|
+
standard_mle_error = calculate_error("Hyll Standard MLE", hll_standard_mle, @data_size)
|
334
|
+
standard_high_error = calculate_error("Hyll Standard (High Precision)", hll_standard_high_est, @data_size)
|
335
|
+
enhanced_error = calculate_error("Hyll Enhanced", hll_enhanced_est, @data_size)
|
336
|
+
enhanced_stream_error = calculate_error("Hyll Enhanced Stream", hll_enhanced_stream, @data_size)
|
337
|
+
redis_error = calculate_error("Redis", redis_est, @data_size)
|
338
|
+
|
339
|
+
@results[:accuracy] = {
|
340
|
+
hyll_standard: standard_error,
|
341
|
+
hyll_standard_mle: standard_mle_error,
|
342
|
+
hyll_standard_high: standard_high_error,
|
343
|
+
hyll_enhanced: enhanced_error,
|
344
|
+
hyll_enhanced_stream: enhanced_stream_error,
|
345
|
+
redis: redis_error
|
346
|
+
}
|
347
|
+
|
348
|
+
# Grafico dell'errore (ASCII art)
|
349
|
+
puts "\nError comparison (lower is better):"
|
350
|
+
print_error_bar("Hyll Standard", standard_error[:percent])
|
351
|
+
print_error_bar("Hyll MLE", standard_mle_error[:percent])
|
352
|
+
print_error_bar("Hyll (High Prec)", standard_high_error[:percent])
|
353
|
+
print_error_bar("Hyll Enhanced", enhanced_error[:percent])
|
354
|
+
print_error_bar("Hyll Enh Stream", enhanced_stream_error[:percent])
|
355
|
+
print_error_bar("Redis", redis_error[:percent])
|
356
|
+
end
|
357
|
+
|
358
|
+
def print_error_bar(label, error_pct)
|
359
|
+
display_error = [error_pct, 300].min
|
360
|
+
bars = (display_error / 5).to_i
|
361
|
+
truncated = display_error < error_pct
|
362
|
+
|
363
|
+
printf("%-18s |%-60s| %.2f%%%s\n",
|
364
|
+
label,
|
365
|
+
"#" * bars,
|
366
|
+
error_pct,
|
367
|
+
truncated ? " (truncated)" : "")
|
368
|
+
end
|
369
|
+
|
370
|
+
def print_summary
|
371
|
+
puts "\n=== BENCHMARK SUMMARY ==="
|
372
|
+
puts "Precision: #{@precision}, Data size: #{@data_size}"
|
373
|
+
|
374
|
+
puts "\nACCURACY WINNER: #{get_accuracy_winner}" if @results[:accuracy]
|
375
|
+
|
376
|
+
if @results[:insertion] && @results[:cardinality] && @results[:merge]
|
377
|
+
puts "\nPERFORMANCE WINNERS:"
|
378
|
+
puts " Insertion: #{get_insertion_winner}"
|
379
|
+
puts " Cardinality: #{get_cardinality_winner}"
|
380
|
+
puts " Merge: #{get_merge_winner}"
|
381
|
+
end
|
382
|
+
|
383
|
+
puts "\nMEMORY USAGE WINNER: #{get_memory_winner}" if @results[:memory]
|
384
|
+
|
385
|
+
puts "\nRECOMMENDATION:"
|
386
|
+
puts generate_recommendation
|
387
|
+
end
|
388
|
+
|
389
|
+
def get_accuracy_winner
|
390
|
+
errors = @results[:accuracy].transform_values { |v| v[:percent] }
|
391
|
+
winner = errors.min_by { |_, v| v }
|
392
|
+
"#{winner[0].to_s.split("_").map(&:capitalize).join(" ")} (#{winner[1].round(2)}% error)"
|
393
|
+
end
|
394
|
+
|
395
|
+
def get_insertion_winner
|
396
|
+
@results[:insertion].entries.max_by(&:ips).label
|
397
|
+
end
|
398
|
+
|
399
|
+
def get_cardinality_winner
|
400
|
+
@results[:cardinality].entries.max_by(&:ips).label
|
401
|
+
end
|
402
|
+
|
403
|
+
def get_merge_winner
|
404
|
+
@results[:merge].entries.max_by(&:ips).label
|
405
|
+
end
|
406
|
+
|
407
|
+
def get_memory_winner
|
408
|
+
memories = {
|
409
|
+
hyll_standard: @results[:memory][:hyll_standard][:retained],
|
410
|
+
hyll_enhanced: @results[:memory][:hyll_enhanced][:retained],
|
411
|
+
redis: @results[:memory][:redis]
|
412
|
+
}
|
413
|
+
|
414
|
+
winner = memories.min_by { |_, v| v }
|
415
|
+
"#{winner[0].to_s.split("_").map(&:capitalize).join(" ")} (#{winner[1] / 1024.0} KB)"
|
416
|
+
end
|
417
|
+
|
418
|
+
def generate_recommendation
|
419
|
+
return "Run accuracy benchmark to generate recommendation" unless @results[:accuracy]
|
420
|
+
|
421
|
+
errors = @results[:accuracy].transform_values { |v| v[:percent] }
|
422
|
+
|
423
|
+
if errors[:redis] < 5.0
|
424
|
+
"Redis offers excellent accuracy and good performance, recommended for most use cases."
|
425
|
+
elsif errors[:hyll_standard] < errors[:hyll_enhanced] && errors[:hyll_standard] < 15.0
|
426
|
+
"Hyll Standard with precision #{@precision} offers good accuracy and best insertion performance."
|
427
|
+
elsif errors[:hyll_enhanced] < 15.0
|
428
|
+
"Hyll Enhanced offers better accuracy than Standard and good overall performance."
|
429
|
+
else
|
430
|
+
"Consider using higher precision (#{[@precision + 2, 16].min}) for better accuracy."
|
431
|
+
end
|
432
|
+
end
|
433
|
+
|
434
|
+
def export_results(filename)
|
435
|
+
File.write(filename, JSON.pretty_generate(@results))
|
436
|
+
puts "Results exported to #{filename}"
|
437
|
+
end
|
438
|
+
|
439
|
+
private
|
440
|
+
|
441
|
+
def report_memory(label)
|
442
|
+
GC.start # Force GC before measurement
|
443
|
+
result = nil
|
444
|
+
report = MemoryProfiler.report do
|
445
|
+
result = yield
|
446
|
+
end
|
447
|
+
|
448
|
+
allocated = report.total_allocated_memsize / 1024.0
|
449
|
+
retained = report.total_retained_memsize / 1024.0
|
450
|
+
|
451
|
+
puts "#{label}:"
|
452
|
+
puts " Total allocated: #{allocated.round(2)} KB"
|
453
|
+
puts " Total retained: #{retained.round(2)} KB"
|
454
|
+
|
455
|
+
# Return memory stats
|
456
|
+
{ allocated: allocated, retained: retained, result: result }
|
457
|
+
end
|
458
|
+
|
459
|
+
def calculate_error(label, estimate, actual)
|
460
|
+
error_pct = ((estimate - actual).abs / actual.to_f) * 100
|
461
|
+
result = {
|
462
|
+
estimate: estimate.round,
|
463
|
+
actual: actual,
|
464
|
+
difference: (estimate - actual).round,
|
465
|
+
percent: error_pct.round(2)
|
466
|
+
}
|
467
|
+
|
468
|
+
puts "#{label}: Estimated #{result[:estimate]} vs Actual #{actual} (Error: #{result[:percent]}%)"
|
469
|
+
result
|
470
|
+
end
|
471
|
+
end
|
472
|
+
|
473
|
+
# Parse command line options
|
474
|
+
options = {
|
475
|
+
precision: 10,
|
476
|
+
data_size: 10_000,
|
477
|
+
overlapping: 1_000,
|
478
|
+
warm_up: true,
|
479
|
+
benchmark_type: :all,
|
480
|
+
output_file: nil
|
481
|
+
}
|
482
|
+
|
483
|
+
OptionParser.new do |opts|
|
484
|
+
opts.banner = "Usage: ruby redis_comparison_benchmark.rb [options]"
|
485
|
+
|
486
|
+
opts.on("-p", "--precision PRECISION", Integer, "HyperLogLog precision (4-16)") do |p|
|
487
|
+
options[:precision] = p
|
488
|
+
end
|
489
|
+
|
490
|
+
opts.on("-d", "--data-size SIZE", Integer, "Number of elements to add") do |d|
|
491
|
+
options[:data_size] = d
|
492
|
+
end
|
493
|
+
|
494
|
+
opts.on("-o", "--overlapping SIZE", Integer, "Number of overlapping elements for merge tests") do |o|
|
495
|
+
options[:overlapping] = o
|
496
|
+
end
|
497
|
+
|
498
|
+
opts.on("--no-warm-up", "Skip warm-up phase") do
|
499
|
+
options[:warm_up] = false
|
500
|
+
end
|
501
|
+
|
502
|
+
opts.on("-b", "--benchmark TYPE", %i[all insertion cardinality merge memory accuracy],
|
503
|
+
"Run specific benchmark type (all, insertion, cardinality, merge, memory, accuracy)") do |b|
|
504
|
+
options[:benchmark_type] = b
|
505
|
+
end
|
506
|
+
|
507
|
+
opts.on("--output FILE", "Export results to JSON file") do |f|
|
508
|
+
options[:output_file] = f
|
509
|
+
end
|
510
|
+
|
511
|
+
opts.on("-h", "--help", "Show this help message") do
|
512
|
+
puts opts
|
513
|
+
exit
|
514
|
+
end
|
515
|
+
end.parse!
|
516
|
+
|
517
|
+
# Run benchmarks
|
518
|
+
puts "Starting HyperLogLog Comparison Benchmark: Hyll vs Redis"
|
519
|
+
puts "-----------------------------------------------------"
|
520
|
+
|
521
|
+
begin
|
522
|
+
comparison = HyllRedisComparison.new(
|
523
|
+
precision: options[:precision],
|
524
|
+
data_size: options[:data_size],
|
525
|
+
overlapping: options[:overlapping],
|
526
|
+
warm_up: options[:warm_up],
|
527
|
+
benchmark_type: options[:benchmark_type]
|
528
|
+
)
|
529
|
+
|
530
|
+
comparison.run_benchmarks
|
531
|
+
|
532
|
+
comparison.export_results(options[:output_file]) if options[:output_file]
|
533
|
+
rescue Redis::CannotConnectError
|
534
|
+
puts "ERROR: Cannot connect to Redis server."
|
535
|
+
puts "Please ensure Redis is running locally on the default port (6379)."
|
536
|
+
puts "You can start Redis with: redis-server"
|
537
|
+
end
|
538
|
+
|
539
|
+
puts "Benchmark complete!"
|
@@ -81,9 +81,10 @@ module Hyll
|
|
81
81
|
# @return [Float] the estimated variance
|
82
82
|
def streaming_variance
|
83
83
|
# If no modifications, return 0
|
84
|
-
return 0.0 if @
|
84
|
+
return 0.0 if @last_modification_probability.nil?
|
85
85
|
|
86
|
-
#
|
86
|
+
# Calculate variance based on martingale properties
|
87
|
+
# This provides an unbiased estimate of the variance
|
87
88
|
@quadratic_variation
|
88
89
|
end
|
89
90
|
|
@@ -91,19 +92,25 @@ module Hyll
|
|
91
92
|
# @param confidence [Float] confidence level (default: 0.95)
|
92
93
|
# @return [Array<Float>] lower and upper bounds
|
93
94
|
def streaming_error_bounds(confidence = 0.95)
|
94
|
-
|
95
|
+
# If no modifications, return exact bounds
|
96
|
+
return [@streaming_estimate, @streaming_estimate] if @last_modification_probability.nil?
|
95
97
|
|
96
|
-
#
|
98
|
+
# Calculate z-score for the given confidence level
|
99
|
+
# For 95% confidence, z ≈ 1.96
|
97
100
|
z = case confidence
|
98
|
-
when 0.
|
101
|
+
when 0.90 then 1.645
|
99
102
|
when 0.95 then 1.96
|
100
103
|
when 0.99 then 2.576
|
101
|
-
else
|
104
|
+
else
|
105
|
+
# Calculate using inverse error function for any confidence level
|
106
|
+
Math.sqrt(2) * Math.erfc(2 * (1 - confidence))
|
102
107
|
end
|
103
108
|
|
104
|
-
|
109
|
+
# Calculate standard error
|
110
|
+
std_error = Math.sqrt(streaming_variance)
|
105
111
|
|
106
|
-
|
112
|
+
# Return confidence interval
|
113
|
+
[@streaming_estimate - z * std_error, @streaming_estimate + z * std_error]
|
107
114
|
end
|
108
115
|
|
109
116
|
# Update register value directly (no compression in EnhancedHyperLogLog)
|
@@ -280,17 +280,27 @@ module Hyll
|
|
280
280
|
non_zero_registers_count / weighted_sum * Math.log(1 + weighted_sum / zero_registers_count)
|
281
281
|
end
|
282
282
|
|
283
|
+
# Return early for edge cases to avoid numerical instability
|
284
|
+
return initial_estimate * @m if initial_estimate.zero? || initial_estimate.nan? || initial_estimate.infinite?
|
285
|
+
|
283
286
|
# Precision parameter
|
284
287
|
epsilon = 0.01
|
285
288
|
delta = epsilon / Math.sqrt(@m)
|
286
289
|
|
287
|
-
#
|
290
|
+
# Memoize h_values calculation to avoid redundant computation
|
291
|
+
h_values_cache = {}
|
292
|
+
|
293
|
+
# Secant method iteration - limit max iterations to prevent infinite loops
|
288
294
|
delta_x = initial_estimate
|
289
295
|
g_prev = 0
|
296
|
+
max_iterations = 100
|
297
|
+
iterations = 0
|
298
|
+
|
299
|
+
while delta_x > initial_estimate * delta && iterations < max_iterations
|
300
|
+
iterations += 1
|
290
301
|
|
291
|
-
|
292
|
-
|
293
|
-
h_values = calculate_h_values(initial_estimate, min_value, max_value)
|
302
|
+
# Calculate h(x) efficiently with memoization
|
303
|
+
h_values = h_values_cache[initial_estimate] ||= calculate_h_values(initial_estimate, min_value, max_value)
|
294
304
|
|
295
305
|
# Calculate the function value
|
296
306
|
g = 0.0
|
@@ -299,12 +309,14 @@ module Hyll
|
|
299
309
|
end
|
300
310
|
g += initial_estimate * (weighted_sum + zero_registers_count)
|
301
311
|
|
302
|
-
# Update the estimate using secant method
|
303
|
-
|
304
|
-
|
305
|
-
|
306
|
-
|
307
|
-
|
312
|
+
# Update the estimate using secant method with safeguards
|
313
|
+
if g > g_prev && non_zero_registers_count >= g && (g - g_prev).abs > Float::EPSILON
|
314
|
+
delta_x = delta_x * (non_zero_registers_count - g) / (g - g_prev)
|
315
|
+
# Add safeguard against too large steps
|
316
|
+
delta_x = [delta_x, initial_estimate].min
|
317
|
+
else
|
318
|
+
delta_x = 0
|
319
|
+
end
|
308
320
|
|
309
321
|
initial_estimate += delta_x
|
310
322
|
g_prev = g
|
data/lib/hyll/utils/math.rb
CHANGED
@@ -67,6 +67,10 @@ module Hyll
|
|
67
67
|
# @param k_max [Integer] maximum k
|
68
68
|
# @return [Array<Float>] array of h(x/2^k) values
|
69
69
|
def calculate_h_values(x, k_min, k_max)
|
70
|
+
# Guard against invalid inputs
|
71
|
+
return [] if k_min > k_max
|
72
|
+
return [0.0] * (k_max - k_min + 1) if x.zero? || x.nan? || x.infinite?
|
73
|
+
|
70
74
|
# Determine the smallest power of 2 denominator for which we need h(x)
|
71
75
|
power = k_max
|
72
76
|
|
@@ -79,11 +83,17 @@ module Hyll
|
|
79
83
|
# For small arguments, use more accurate formula (simpler approximation)
|
80
84
|
h = if x_prime <= 0.1
|
81
85
|
# For very small values, h(x) ≈ x/2
|
82
|
-
# This formula ensures we get consistent value across different inputs and powers
|
83
86
|
x_prime / 2.0
|
84
87
|
elsif x_prime <= 0.5
|
85
88
|
# Use more accurate Taylor series for small-to-medium values
|
86
|
-
|
89
|
+
taylor_sum = x_prime / 2.0
|
90
|
+
term = x_prime * x_prime
|
91
|
+
taylor_sum -= term / 12.0
|
92
|
+
term *= x_prime * x_prime
|
93
|
+
taylor_sum += term / 720.0
|
94
|
+
term *= x_prime * x_prime
|
95
|
+
taylor_sum -= term / 30_240.0
|
96
|
+
taylor_sum
|
87
97
|
else
|
88
98
|
# For larger values, directly compute
|
89
99
|
1.0 - ::Math.exp(-x_prime)
|
@@ -95,7 +105,13 @@ module Hyll
|
|
95
105
|
# Calculate subsequent h values using recurrence relation
|
96
106
|
1.upto(k_max - k_min) do |i|
|
97
107
|
x_prime *= 2.0 # Double x_prime
|
98
|
-
|
108
|
+
denominator = x_prime + (1.0 - h)
|
109
|
+
# Avoid division by zero
|
110
|
+
h = if denominator.abs < Float::EPSILON
|
111
|
+
h_values[i - 1] # Use previous value if unstable
|
112
|
+
else
|
113
|
+
(x_prime + h * (1.0 - h)) / denominator
|
114
|
+
end
|
99
115
|
h_values[i] = h
|
100
116
|
end
|
101
117
|
|
data/lib/hyll/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: hyll
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Davide Santangelo
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2025-03-
|
11
|
+
date: 2025-03-24 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: digest
|
@@ -44,6 +44,7 @@ files:
|
|
44
44
|
- Rakefile
|
45
45
|
- examples/advance.rb
|
46
46
|
- examples/basic.rb
|
47
|
+
- examples/redis_comparison_benchmark.rb
|
47
48
|
- lib/hyll.rb
|
48
49
|
- lib/hyll/algorithms/enhanced_hyperloglog.rb
|
49
50
|
- lib/hyll/algorithms/hyperloglog.rb
|