RubyGems - hyll - Versions diffs - 0.1.1 → 0.2.0 - Mend

hyll 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +22 -0
data/README.md +79 -0
data/examples/redis_comparison_benchmark.rb +539 -0
data/lib/hyll/algorithms/enhanced_hyperloglog.rb +15 -8
data/lib/hyll/algorithms/hyperloglog.rb +22 -10
data/lib/hyll/utils/math.rb +19 -3
data/lib/hyll/version.rb +1 -1
metadata +3 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 322ba54ca099c154e9bc295200933314e019bc61adcc0df9f45a4ee352e7a8fa
-  data.tar.gz: ebcdfac3bcd8d421876c329b33db3be5594379ce7586d6d11585beb6308a7e52
+  metadata.gz: 445c738f59356bd13dc55dd8f6c8a2b74e39260e4367fa28012a4f13c0f5ebeb
+  data.tar.gz: 5f5d79fcd5aa2aec6afeb4e04be2725c6cc08cefa99c497e8070ee43eee03742
 SHA512:
-  metadata.gz: 6d308475f7666c0d945c0a1b780aa84de2f15b987f0bea463b8897067c8190c9681d76689570535075c312de65aac30fca028dab29eb62fd48f90c12c0db2c56
-  data.tar.gz: edf858ca1185c8b419fb8291a30fd99323610724d8c3a77b82b8f19f956b9784c32125eba600284ad62ab359ebe8fc74accb783189a3b04859b0fff88cb5be70
+  metadata.gz: 8df3d73bd6f665ab163891b7f51d5e9f09fc8b2df0479445cb285487b6dd4b9640d324a1bab77069d50a934da0a6af9e464c33f212c94ac18a5af73349f234cf
+  data.tar.gz: f73ae9153b519e77dfb77685395e5d1736de23f1f53e3575db29e96fb4f2fa512e08b1be737d42717db9e28b936358731a17ec69234864532c731a467782ec4a

data/CHANGELOG.md CHANGED Viewed

@@ -5,6 +5,28 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.2.0] - 2025-03-24
+### Added
+- Associativity test for HyperLogLog merges to ensure (A ∪ B) ∪ C = A ∪ (B ∪ C)
+- Guards against invalid inputs in mathematical functions
+- Memoization for h_values calculation in MLE algorithm
+- More comprehensive error handling with safeguards
+- Added `examples/redis_comparison_benchmark.rb` for Redis comparison
+### Changed
+- Optimized Maximum Likelihood Estimation (MLE) algorithm for better performance
+- Improved numerical stability in the secant method implementation
+- Enhanced Taylor series calculation for better accuracy
+- Fixed Math module namespace conflicts
+- Added safeguards against division by zero and other numerical errors
+- Limited maximum iterations in convergence algorithms to prevent infinite loops
+### Fixed
+- Addressed potential numerical instability in calculate_h_values method
+- Fixed undefined method 'exp' error by using global namespace operator
+- Improved edge case handling in the MLE algorithm
 ## [0.1.1] - 2025-03-21
 ### Changed

data/README.md CHANGED Viewed

@@ -1,5 +1,7 @@
 # Hyll
+![Gem Version](https://img.shields.io/gem/v/hyll)
+![Gem Total Downloads](https://img.shields.io/gem/dt/hyll)
 [![Build Status](https://github.com/davidesantangelo/hyll/workflows/Ruby%20Tests/badge.svg)](https://github.com/davidesantangelo/hyll/actions)
 Hyll is a Ruby implementation of the [HyperLogLog algorithm](https://en.wikipedia.org/wiki/HyperLogLog) for the count-distinct problem, which efficiently approximates the number of distinct elements in a multiset with minimal memory usage. It supports both standard and Enhanced variants, offering a flexible approach for large-scale applications and providing convenient methods for merging, serialization, and maximum likelihood estimation.
@@ -226,6 +228,76 @@ Below are actual performance measurements from an Apple Mac Mini M4 with 24GB RA
 These benchmarks demonstrate HyperLogLog's exceptional memory efficiency, maintaining a compression ratio of over 6,250x compared to storing the raw elements, while still providing accurate cardinality estimates.
+## Benchmark Comparison with Redis
+Hyll has been benchmarked against Redis' HyperLogLog implementation to provide a comparison with a widely-used production system. The tests were run on an Apple Silicon M1 Mac using Ruby 3.1.4 with 10,000 elements and a precision of 10.
+### Insertion Performance
+| Implementation | Operations/sec | Relative Performance |
+|----------------|---------------:|---------------------:|
+| Hyll Standard  | 86.32          | 1.00x (fastest)      |
+| Hyll Batch     | 85.98          | 1.00x                |
+| Redis Pipelined| 20.51          | 4.21x slower         |
+| Redis PFADD    | 4.93           | 17.51x slower        |
+| Hyll Enhanced  | 1.20           | 71.87x slower        |
+### Cardinality Estimation Performance
+| Implementation      | Operations/sec | Relative Performance |
+|---------------------|---------------:|---------------------:|
+| Redis PFCOUNT       | 53,131         | 1.00x (fastest)      |
+| Hyll Enhanced Stream| 24,412         | 2.18x slower         |
+| Hyll Enhanced       | 8,843          | 6.01x slower         |
+| Hyll Standard       | 8,538          | 6.22x slower         |
+| Hyll MLE            | 5,645          | 9.41x slower         |
+### Merge Performance
+| Implementation | Operations/sec | Relative Performance |
+|----------------|---------------:|---------------------:|
+| Redis PFMERGE  | 12,735         | 1.00x (fastest)      |
+| Hyll Enhanced  | 6,523          | 1.95x slower         |
+| Hyll Standard  | 2,932          | 4.34x slower         |
+### Memory Usage
+| Implementation   | Memory Usage |
+|------------------|-------------:|
+| Hyll Enhanced    | 0.28 KB      |
+| Hyll Standard    | 18.30 KB     |
+| Redis            | 12.56 KB     |
+| Raw Elements     | 0.04 KB      |
+### Accuracy Comparison
+| Implementation      | Estimated Count | Actual Count | Error    |
+|---------------------|----------------:|-------------:|---------:|
+| Redis               | 9,990           | 10,000       | 0.10%    |
+| Hyll Enhanced       | 3,018           | 10,000       | 69.82%   |
+| Hyll (High Prec)    | 19,016          | 10,000       | 90.16%   |
+| Hyll Standard       | 32,348          | 10,000       | 223.48%  |
+| Hyll Enhanced Stream| 8,891,659       | 10,000       | 88,816.59% |
+| Hyll MLE            | 19,986,513      | 10,000       | 199,765.13% |
+### Summary of Findings
+- **Insertion Performance**: Hyll Standard and Batch operations are significantly faster than Redis for adding elements.
+- **Cardinality Estimation**: Redis has the fastest cardinality estimation, with Hyll Enhanced Stream as a close second.
+- **Merge Operations**: Redis outperforms Hyll for merging HyperLogLog sketches, but Hyll Enhanced provides competitive performance.
+- **Memory Usage**: Hyll Enhanced offers the most memory-efficient implementation.
+- **Accuracy**: Redis provides the best accuracy in this test scenario.
+#### Recommendation
+For most use cases, Redis offers an excellent balance of accuracy and performance. However, Hyll provides superior insertion performance and memory efficiency, making it a good choice for scenarios where these attributes are prioritized.
+You can run these benchmarks yourself using the included script:
+```ruby
+ruby examples/redis_comparison_benchmark.rb
+```
 ## Features
 - Standard HyperLogLog implementation with customizable precision
@@ -293,6 +365,13 @@ For advanced usage scenarios, check out `examples/advance.rb` which includes:
 - Advanced serialization techniques
 - Precision vs. memory usage benchmarks
+## Example Use Cases
+Here is a quick illustration of how Hyll can be helpful in a real-world scenario:
+- UniqueVisitorCounting: Track unique users visiting a website in real time. By adding each user's session ID or IP to the HyperLogLog, you get an approximate number of distinct users without storing everybody's data.
+- LogAnalytics: Continuously process large log files to calculate the volume of unique events, keeping memory usage low.
+- MarketingCampaigns: Quickly gauge how many distinct customers participate in a campaign while merging data from multiple sources.
 ## Development
 After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.

data/examples/redis_comparison_benchmark.rb ADDED Viewed

@@ -0,0 +1,539 @@
+# frozen_string_literal: true
+require "hyll"
+require "redis"
+require "benchmark"
+require "benchmark/ips"
+require "memory_profiler"
+require "optparse"
+require "json"
+class HyllRedisComparison
+  def initialize(precision: 10, data_size: 10_000, overlapping: 1_000, warm_up: true, benchmark_type: :all)
+    @precision = precision
+    @data_size = data_size
+    @overlapping = overlapping
+    @warm_up = warm_up
+    @benchmark_type = benchmark_type
+    @redis = Redis.new
+    @results = {}
+    puts "Initializing benchmark with:"
+    puts "  - Precision: #{@precision}"
+    puts "  - Data size: #{@data_size} elements"
+    puts "  - Overlapping elements for merge tests: #{@overlapping}"
+    puts "  - Warm-up: #{@warm_up ? "Enabled" : "Disabled"}"
+    puts "\n"
+    # Clean up any existing Redis keys
+    @redis.del("hll1", "hll2", "hll_merged", "redis_hll1", "redis_hll2", "redis_merged", "accuracy_test")
+    # Pre-generazione di dati per i test
+    @elements = (0...@data_size).map { |i| "element-#{i}" }.freeze
+    @set1_elements = (0...@data_size).map { |i| "set1-#{i}" }.freeze
+    @set2_elements = (0...@data_size).map { |i| "set2-#{i + @data_size - @overlapping}" }.freeze
+    # Pre-caricamento Redis
+    if %i[all cardinality memory].include?(benchmark_type)
+      @redis.pipelined do |pipeline|
+        @elements.each { |e| pipeline.pfadd("hll1", e) }
+      end
+    end
+    if %i[all merge].include?(benchmark_type)
+      @redis.pipelined do |pipeline|
+        @set1_elements.each { |e| pipeline.pfadd("redis_hll1", e) }
+        @set2_elements.each { |e| pipeline.pfadd("redis_hll2", e) }
+      end
+    end
+    if %i[all cardinality].include?(benchmark_type)
+      @pre_hll_standard = Hyll::HyperLogLog.new(@precision)
+      @pre_hll_enhanced = Hyll::EnhancedHyperLogLog.new(@precision)
+      @elements.each do |e|
+        @pre_hll_standard.add(e)
+        @pre_hll_enhanced.add(e)
+      end
+    end
+    if %i[all merge].include?(benchmark_type)
+      @pre_merge_hll1 = Hyll::HyperLogLog.new(@precision)
+      @pre_merge_hll2 = Hyll::HyperLogLog.new(@precision)
+      @pre_merge_enhanced1 = Hyll::EnhancedHyperLogLog.new(@precision)
+      @pre_merge_enhanced2 = Hyll::EnhancedHyperLogLog.new(@precision)
+      @set1_elements.each do |e|
+        @pre_merge_hll1.add(e)
+        @pre_merge_enhanced1.add(e)
+      end
+      @set2_elements.each do |e|
+        @pre_merge_hll2.add(e)
+        @pre_merge_enhanced2.add(e)
+      end
+      @pre_merge_hll1_serialized = Marshal.dump(@pre_merge_hll1)
+      @pre_merge_enhanced1_serialized = Marshal.dump(@pre_merge_enhanced1)
+    end
+    warm_up_benchmarks if @warm_up
+  end
+  def run_benchmarks
+    case @benchmark_type
+    when :insertion
+      benchmark_insertion
+    when :cardinality
+      benchmark_cardinality
+    when :merge
+      benchmark_merge
+    when :memory
+      benchmark_memory_usage
+    when :accuracy
+      benchmark_accuracy
+    else
+      benchmark_insertion
+      benchmark_cardinality
+      benchmark_merge
+      benchmark_memory_usage
+      benchmark_accuracy
+    end
+    print_summary
+  end
+  def warm_up_benchmarks
+    puts "Performing warm-up..."
+    # Warm-up JIT compiler
+    warm_up_count = [@data_size / 10, 1000].min
+    # Warm-up insertion
+    hll_warmup = Hyll::HyperLogLog.new(@precision)
+    enhanced_warmup = Hyll::EnhancedHyperLogLog.new(@precision)
+    warm_up_count.times do |i|
+      hll_warmup.add("warmup-#{i}")
+      enhanced_warmup.add("warmup-#{i}")
+      @redis.pfadd("warmup_hll", "warmup-#{i}")
+    end
+    # Warm-up cardinality
+    10.times do
+      hll_warmup.cardinality
+      hll_warmup.mle_cardinality
+      enhanced_warmup.cardinality
+      enhanced_warmup.cardinality(use_streaming: true)
+      @redis.pfcount("warmup_hll")
+    end
+    # Warm-up merge
+    warm_up_hll1 = Hyll::HyperLogLog.new(@precision)
+    warm_up_hll2 = Hyll::HyperLogLog.new(@precision)
+    5.times do
+      warm_up_copy = Marshal.load(Marshal.dump(warm_up_hll1))
+      warm_up_copy.merge(warm_up_hll2)
+    end
+    @redis.del("warmup_hll")
+    puts "Warm-up complete.\n\n"
+  end
+  def benchmark_insertion
+    puts "=== Insertion Performance ==="
+    GC.start # Sincronizzazione GC per risultati più consistenti
+    results = Benchmark.ips do |x|
+      x.config(time: 2, warmup: 1)
+      # Hyll standard insertion
+      x.report("Hyll Standard") do
+        hll = Hyll::HyperLogLog.new(@precision)
+        @elements.each { |e| hll.add(e) }
+      end
+      # Hyll enhanced insertion
+      x.report("Hyll Enhanced") do
+        hll = Hyll::EnhancedHyperLogLog.new(@precision)
+        @elements.each { |e| hll.add(e) }
+      end
+      # Hyll batch insertion
+      x.report("Hyll Batch") do
+        Hyll::HyperLogLog.new(@precision).add_all(@elements)
+      end
+      # Redis insertion
+      x.report("Redis PFADD") do
+        @redis.del("bench_hll")
+        @elements.each { |e| @redis.pfadd("bench_hll", e) }
+      end
+      # Redis pipelined insertion
+      x.report("Redis Pipelined") do
+        @redis.del("bench_hll")
+        @redis.pipelined do |pipeline|
+          @elements.each { |e| pipeline.pfadd("bench_hll", e) }
+        end
+      end
+      x.compare!
+    end
+    @results[:insertion] = results
+    puts "\n"
+  end
+  def benchmark_cardinality
+    puts "=== Cardinality Estimation Performance ==="
+    GC.start
+    results = Benchmark.ips do |x|
+      x.config(time: 2, warmup: 1)
+      # Hyll standard cardinality
+      x.report("Hyll Standard") do
+        @pre_hll_standard.cardinality
+      end
+      # Hyll standard MLE
+      x.report("Hyll MLE") do
+        @pre_hll_standard.mle_cardinality
+      end
+      # Hyll enhanced cardinality
+      x.report("Hyll Enhanced") do
+        @pre_hll_enhanced.cardinality
+      end
+      # Hyll enhanced streaming
+      x.report("Hyll Enhanced Stream") do
+        @pre_hll_enhanced.cardinality(use_streaming: true)
+      end
+      # Redis cardinality
+      x.report("Redis PFCOUNT") do
+        @redis.pfcount("hll1")
+      end
+      x.compare!
+    end
+    @results[:cardinality] = results
+    puts "\n"
+  end
+  def benchmark_merge
+    puts "=== Merge Performance ==="
+    GC.start
+    results = Benchmark.ips do |x|
+      x.config(time: 2, warmup: 1)
+      # Hyll standard merge
+      x.report("Hyll Standard") do
+        hll_copy = Marshal.load(@pre_merge_hll1_serialized)
+        hll_copy.merge(@pre_merge_hll2)
+      end
+      # Hyll enhanced merge
+      x.report("Hyll Enhanced") do
+        enhanced_copy = Marshal.load(@pre_merge_enhanced1_serialized)
+        enhanced_copy.merge(@pre_merge_enhanced2)
+      end
+      # Redis merge
+      x.report("Redis PFMERGE") do
+        @redis.pfmerge("redis_merged", "redis_hll1", "redis_hll2")
+      end
+      x.compare!
+    end
+    @results[:merge] = results
+    puts "\n"
+  end
+  def benchmark_memory_usage
+    puts "=== Memory Usage ==="
+    GC.start
+    # Memory usage of standard HLL
+    hll_standard_memory = report_memory("Hyll Standard") do
+      hll = Hyll::HyperLogLog.new(@precision)
+      @elements.each { |e| hll.add(e) }
+      hll
+    end
+    # Memory usage of enhanced HLL
+    hll_enhanced_memory = report_memory("Hyll Enhanced") do
+      hll = Hyll::EnhancedHyperLogLog.new(@precision)
+      @elements.each { |e| hll.add(e) }
+      hll
+    end
+    # Memory usage of actual elements (for comparison)
+    raw_elements_memory = report_memory("Raw Elements Array") do
+      @elements.dup
+    end
+    # Redis memory usage
+    redis_memory = @redis.memory("USAGE", "hll1")
+    puts "Redis memory usage for HLL key: #{redis_memory} bytes"
+    # Calcola compression ratio
+    puts "\nCompression ratios:"
+    puts "  Hyll Standard:  #{(raw_elements_memory[:allocated] / hll_standard_memory[:retained]).round(2)}x"
+    puts "  Hyll Enhanced:  #{(raw_elements_memory[:allocated] / hll_enhanced_memory[:retained]).round(2)}x"
+    puts "  Redis:          #{(raw_elements_memory[:allocated] * 1024 / redis_memory).round(2)}x"
+    @results[:memory] = {
+      hyll_standard: hll_standard_memory,
+      hyll_enhanced: hll_enhanced_memory,
+      raw_elements: raw_elements_memory,
+      redis: redis_memory
+    }
+    puts "\n"
+  end
+  def benchmark_accuracy
+    puts "=== Accuracy Comparison ==="
+    GC.start
+    accuracy_elements = (0...@data_size).map { |i| "accuracy-#{i}" }
+    # Hyll standard
+    hll_standard = Hyll::HyperLogLog.new(@precision)
+    hll_standard.add_all(accuracy_elements)
+    # Hyll enhanced
+    hll_enhanced = Hyll::EnhancedHyperLogLog.new(@precision)
+    hll_enhanced.add_all(accuracy_elements)
+    hll_standard_high = Hyll::HyperLogLog.new([@precision + 2, 16].min)
+    hll_standard_high.add_all(accuracy_elements)
+    # Redis
+    @redis.del("accuracy_test")
+    @redis.pipelined do |pipeline|
+      accuracy_elements.each { |e| pipeline.pfadd("accuracy_test", e) }
+    end
+    # Get estimates
+    hll_standard_est = hll_standard.cardinality
+    hll_standard_mle = hll_standard.mle_cardinality
+    hll_standard_high_est = hll_standard_high.cardinality
+    hll_enhanced_est = hll_enhanced.cardinality
+    hll_enhanced_stream = hll_enhanced.cardinality(use_streaming: true)
+    redis_est = @redis.pfcount("accuracy_test")
+    # Calcola errori
+    standard_error = calculate_error("Hyll Standard", hll_standard_est, @data_size)
+    standard_mle_error = calculate_error("Hyll Standard MLE", hll_standard_mle, @data_size)
+    standard_high_error = calculate_error("Hyll Standard (High Precision)", hll_standard_high_est, @data_size)
+    enhanced_error = calculate_error("Hyll Enhanced", hll_enhanced_est, @data_size)
+    enhanced_stream_error = calculate_error("Hyll Enhanced Stream", hll_enhanced_stream, @data_size)
+    redis_error = calculate_error("Redis", redis_est, @data_size)
+    @results[:accuracy] = {
+      hyll_standard: standard_error,
+      hyll_standard_mle: standard_mle_error,
+      hyll_standard_high: standard_high_error,
+      hyll_enhanced: enhanced_error,
+      hyll_enhanced_stream: enhanced_stream_error,
+      redis: redis_error
+    }
+    # Grafico dell'errore (ASCII art)
+    puts "\nError comparison (lower is better):"
+    print_error_bar("Hyll Standard", standard_error[:percent])
+    print_error_bar("Hyll MLE", standard_mle_error[:percent])
+    print_error_bar("Hyll (High Prec)", standard_high_error[:percent])
+    print_error_bar("Hyll Enhanced", enhanced_error[:percent])
+    print_error_bar("Hyll Enh Stream", enhanced_stream_error[:percent])
+    print_error_bar("Redis", redis_error[:percent])
+  end
+  def print_error_bar(label, error_pct)
+    display_error = [error_pct, 300].min
+    bars = (display_error / 5).to_i
+    truncated = display_error < error_pct
+    printf("%-18s |%-60s| %.2f%%%s\n",
+           label,
+           "#" * bars,
+           error_pct,
+           truncated ? " (truncated)" : "")
+  end
+  def print_summary
+    puts "\n=== BENCHMARK SUMMARY ==="
+    puts "Precision: #{@precision}, Data size: #{@data_size}"
+    puts "\nACCURACY WINNER: #{get_accuracy_winner}" if @results[:accuracy]
+    if @results[:insertion] && @results[:cardinality] && @results[:merge]
+      puts "\nPERFORMANCE WINNERS:"
+      puts "  Insertion:   #{get_insertion_winner}"
+      puts "  Cardinality: #{get_cardinality_winner}"
+      puts "  Merge:       #{get_merge_winner}"
+    end
+    puts "\nMEMORY USAGE WINNER: #{get_memory_winner}" if @results[:memory]
+    puts "\nRECOMMENDATION:"
+    puts generate_recommendation
+  end
+  def get_accuracy_winner
+    errors = @results[:accuracy].transform_values { |v| v[:percent] }
+    winner = errors.min_by { |_, v| v }
+    "#{winner[0].to_s.split("_").map(&:capitalize).join(" ")} (#{winner[1].round(2)}% error)"
+  end
+  def get_insertion_winner
+    @results[:insertion].entries.max_by(&:ips).label
+  end
+  def get_cardinality_winner
+    @results[:cardinality].entries.max_by(&:ips).label
+  end
+  def get_merge_winner
+    @results[:merge].entries.max_by(&:ips).label
+  end
+  def get_memory_winner
+    memories = {
+      hyll_standard: @results[:memory][:hyll_standard][:retained],
+      hyll_enhanced: @results[:memory][:hyll_enhanced][:retained],
+      redis: @results[:memory][:redis]
+    }
+    winner = memories.min_by { |_, v| v }
+    "#{winner[0].to_s.split("_").map(&:capitalize).join(" ")} (#{winner[1] / 1024.0} KB)"
+  end
+  def generate_recommendation
+    return "Run accuracy benchmark to generate recommendation" unless @results[:accuracy]
+    errors = @results[:accuracy].transform_values { |v| v[:percent] }
+    if errors[:redis] < 5.0
+      "Redis offers excellent accuracy and good performance, recommended for most use cases."
+    elsif errors[:hyll_standard] < errors[:hyll_enhanced] && errors[:hyll_standard] < 15.0
+      "Hyll Standard with precision #{@precision} offers good accuracy and best insertion performance."
+    elsif errors[:hyll_enhanced] < 15.0
+      "Hyll Enhanced offers better accuracy than Standard and good overall performance."
+    else
+      "Consider using higher precision (#{[@precision + 2, 16].min}) for better accuracy."
+    end
+  end
+  def export_results(filename)
+    File.write(filename, JSON.pretty_generate(@results))
+    puts "Results exported to #{filename}"
+  end
+  private
+  def report_memory(label)
+    GC.start # Force GC before measurement
+    result = nil
+    report = MemoryProfiler.report do
+      result = yield
+    end
+    allocated = report.total_allocated_memsize / 1024.0
+    retained = report.total_retained_memsize / 1024.0
+    puts "#{label}:"
+    puts "  Total allocated: #{allocated.round(2)} KB"
+    puts "  Total retained: #{retained.round(2)} KB"
+    # Return memory stats
+    { allocated: allocated, retained: retained, result: result }
+  end
+  def calculate_error(label, estimate, actual)
+    error_pct = ((estimate - actual).abs / actual.to_f) * 100
+    result = {
+      estimate: estimate.round,
+      actual: actual,
+      difference: (estimate - actual).round,
+      percent: error_pct.round(2)
+    }
+    puts "#{label}: Estimated #{result[:estimate]} vs Actual #{actual} (Error: #{result[:percent]}%)"
+    result
+  end
+end
+# Parse command line options
+options = {
+  precision: 10,
+  data_size: 10_000,
+  overlapping: 1_000,
+  warm_up: true,
+  benchmark_type: :all,
+  output_file: nil
+}
+OptionParser.new do |opts|
+  opts.banner = "Usage: ruby redis_comparison_benchmark.rb [options]"
+  opts.on("-p", "--precision PRECISION", Integer, "HyperLogLog precision (4-16)") do |p|
+    options[:precision] = p
+  end
+  opts.on("-d", "--data-size SIZE", Integer, "Number of elements to add") do |d|
+    options[:data_size] = d
+  end
+  opts.on("-o", "--overlapping SIZE", Integer, "Number of overlapping elements for merge tests") do |o|
+    options[:overlapping] = o
+  end
+  opts.on("--no-warm-up", "Skip warm-up phase") do
+    options[:warm_up] = false
+  end
+  opts.on("-b", "--benchmark TYPE", %i[all insertion cardinality merge memory accuracy],
+          "Run specific benchmark type (all, insertion, cardinality, merge, memory, accuracy)") do |b|
+    options[:benchmark_type] = b
+  end
+  opts.on("--output FILE", "Export results to JSON file") do |f|
+    options[:output_file] = f
+  end
+  opts.on("-h", "--help", "Show this help message") do
+    puts opts
+    exit
+  end
+end.parse!
+# Run benchmarks
+puts "Starting HyperLogLog Comparison Benchmark: Hyll vs Redis"
+puts "-----------------------------------------------------"
+begin
+  comparison = HyllRedisComparison.new(
+    precision: options[:precision],
+    data_size: options[:data_size],
+    overlapping: options[:overlapping],
+    warm_up: options[:warm_up],
+    benchmark_type: options[:benchmark_type]
+  )
+  comparison.run_benchmarks
+  comparison.export_results(options[:output_file]) if options[:output_file]
+rescue Redis::CannotConnectError
+  puts "ERROR: Cannot connect to Redis server."
+  puts "Please ensure Redis is running locally on the default port (6379)."
+  puts "You can start Redis with: redis-server"
+end
+puts "Benchmark complete!"

data/lib/hyll/algorithms/enhanced_hyperloglog.rb CHANGED Viewed

@@ -81,9 +81,10 @@ module Hyll
     # @return [Float] the estimated variance
     def streaming_variance
       # If no modifications, return 0
-      return 0.0 if @streaming_estimate.zero?
+      return 0.0 if @last_modification_probability.nil?
-      # Return the quadratic variation
+      # Calculate variance based on martingale properties
+      # This provides an unbiased estimate of the variance
       @quadratic_variation
     end
@@ -91,19 +92,25 @@ module Hyll
     # @param confidence [Float] confidence level (default: 0.95)
     # @return [Array<Float>] lower and upper bounds
     def streaming_error_bounds(confidence = 0.95)
-      return [0, 0] if @streaming_estimate.zero?
+      # If no modifications, return exact bounds
+      return [@streaming_estimate, @streaming_estimate] if @last_modification_probability.nil?
-      # For 95% confidence, use ~1.96 multiplier
+      # Calculate z-score for the given confidence level
+      # For 95% confidence, z ≈ 1.96
       z = case confidence
-          when 0.9 then 1.645
+          when 0.90 then 1.645
           when 0.95 then 1.96
           when 0.99 then 2.576
-          else 1.96 # Default to 95%
+          else
+            # Calculate using inverse error function for any confidence level
+            Math.sqrt(2) * Math.erfc(2 * (1 - confidence))
           end
-      std_dev = Math.sqrt(streaming_variance)
+      # Calculate standard error
+      std_error = Math.sqrt(streaming_variance)
-      [@streaming_estimate - z * std_dev, @streaming_estimate + z * std_dev]
+      # Return confidence interval
+      [@streaming_estimate - z * std_error, @streaming_estimate + z * std_error]
     end
     # Update register value directly (no compression in EnhancedHyperLogLog)

data/lib/hyll/algorithms/hyperloglog.rb CHANGED Viewed

@@ -280,17 +280,27 @@ module Hyll
                            non_zero_registers_count / weighted_sum * Math.log(1 + weighted_sum / zero_registers_count)
                          end
+      # Return early for edge cases to avoid numerical instability
+      return initial_estimate * @m if initial_estimate.zero? || initial_estimate.nan? || initial_estimate.infinite?
       # Precision parameter
       epsilon = 0.01
       delta = epsilon / Math.sqrt(@m)
-      # Secant method iteration
+      # Memoize h_values calculation to avoid redundant computation
+      h_values_cache = {}
+      # Secant method iteration - limit max iterations to prevent infinite loops
       delta_x = initial_estimate
       g_prev = 0
+      max_iterations = 100
+      iterations = 0
+      while delta_x > initial_estimate * delta && iterations < max_iterations
+        iterations += 1
-      while delta_x > initial_estimate * delta
-        # Calculate h(x) efficiently
-        h_values = calculate_h_values(initial_estimate, min_value, max_value)
+        # Calculate h(x) efficiently with memoization
+        h_values = h_values_cache[initial_estimate] ||= calculate_h_values(initial_estimate, min_value, max_value)
         # Calculate the function value
         g = 0.0
@@ -299,12 +309,14 @@ module Hyll
         end
         g += initial_estimate * (weighted_sum + zero_registers_count)
-        # Update the estimate using secant method
-        delta_x = if g > g_prev && non_zero_registers_count >= g
-                    delta_x * (non_zero_registers_count - g) / (g - g_prev)
-                  else
-                    0
-                  end
+        # Update the estimate using secant method with safeguards
+        if g > g_prev && non_zero_registers_count >= g && (g - g_prev).abs > Float::EPSILON
+          delta_x = delta_x * (non_zero_registers_count - g) / (g - g_prev)
+          # Add safeguard against too large steps
+          delta_x = [delta_x, initial_estimate].min
+        else
+          delta_x = 0
+        end
         initial_estimate += delta_x
         g_prev = g

data/lib/hyll/utils/math.rb CHANGED Viewed

@@ -67,6 +67,10 @@ module Hyll
       # @param k_max [Integer] maximum k
       # @return [Array<Float>] array of h(x/2^k) values
       def calculate_h_values(x, k_min, k_max)
+        # Guard against invalid inputs
+        return [] if k_min > k_max
+        return [0.0] * (k_max - k_min + 1) if x.zero? || x.nan? || x.infinite?
         # Determine the smallest power of 2 denominator for which we need h(x)
         power = k_max
@@ -79,11 +83,17 @@ module Hyll
         # For small arguments, use more accurate formula (simpler approximation)
         h = if x_prime <= 0.1
               # For very small values, h(x) ≈ x/2
-              # This formula ensures we get consistent value across different inputs and powers
               x_prime / 2.0
             elsif x_prime <= 0.5
               # Use more accurate Taylor series for small-to-medium values
-              x_prime / 2.0 - (x_prime**2) / 12.0 + (x_prime**4) / 720.0 - (x_prime**6) / 30_240.0
+              taylor_sum = x_prime / 2.0
+              term = x_prime * x_prime
+              taylor_sum -= term / 12.0
+              term *= x_prime * x_prime
+              taylor_sum += term / 720.0
+              term *= x_prime * x_prime
+              taylor_sum -= term / 30_240.0
+              taylor_sum
             else
               # For larger values, directly compute
               1.0 - ::Math.exp(-x_prime)
@@ -95,7 +105,13 @@ module Hyll
         # Calculate subsequent h values using recurrence relation
         1.upto(k_max - k_min) do |i|
           x_prime *= 2.0 # Double x_prime
-          h = (x_prime + h * (1.0 - h)) / (x_prime + (1.0 - h))
+          denominator = x_prime + (1.0 - h)
+          # Avoid division by zero
+          h = if denominator.abs < Float::EPSILON
+                h_values[i - 1] # Use previous value if unstable
+              else
+                (x_prime + h * (1.0 - h)) / denominator
+              end
           h_values[i] = h
         end

data/lib/hyll/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Hyll
-  VERSION = "0.1.1"
+  VERSION = "0.2.0"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: hyll
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.2.0
 platform: ruby
 authors:
 - Davide Santangelo
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2025-03-21 00:00:00.000000000 Z
+date: 2025-03-24 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: digest
@@ -44,6 +44,7 @@ files:
 - Rakefile
 - examples/advance.rb
 - examples/basic.rb
+- examples/redis_comparison_benchmark.rb
 - lib/hyll.rb
 - lib/hyll/algorithms/enhanced_hyperloglog.rb
 - lib/hyll/algorithms/hyperloglog.rb