RubyGems - hyperloglog-redis - Versions diffs - 1.0.0 → 2.0.0 - Mend

hyperloglog-redis 1.0.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

data/Gemfile +1 -0
data/Gemfile.lock +2 -0
data/HISTORY.md +25 -6
data/README.md +124 -25
data/VERSION +1 -1
data/hyperloglog-redis.gemspec +10 -4
data/lib/algorithm.rb +69 -0
data/lib/counter.rb +40 -0
data/lib/hyperloglog-redis.rb +3 -1
data/lib/time_series_counter.rb +80 -0
data/spec/hyper_log_log_spec.rb +199 -197
data/spec/time_series_counter_spec.rb +216 -0
metadata +28 -14
data/lib/hyper_log_log.rb +0 -96

data/Gemfile CHANGED

@@ -7,4 +7,5 @@ group :development, :test do
   gem 'jeweler', '~> 1.8.4'
   gem 'rake', '~> 0.9.2.2'
   gem 'rspec', '~> 2.11.0'
+  gem 'timecop', '~> 0.5.3'
 end

data/Gemfile.lock CHANGED

@@ -22,6 +22,7 @@ GEM
     rspec-expectations (2.11.3)
       diff-lcs (~> 1.1.3)
     rspec-mocks (2.11.2)
+    timecop (0.5.3)
 PLATFORMS
   ruby
@@ -32,3 +33,4 @@ DEPENDENCIES
   rake (~> 0.9.2.2)
   redis (~> 3.0.1)
   rspec (~> 2.11.0)
+  timecop (~> 0.5.3)

data/HISTORY.md CHANGED

@@ -1,3 +1,22 @@
+## 2.0.0 (11/30/2012)
+* Changed the underlying storage from Redis hashes to bitstrings [simonkro](https://github.com/simonkro)
+  If you have existing counters stored from version 1.0.0, you can upgrade them with
+  the following method:
+        def upgrade_1_2(counter, redis)
+          return if redis.type(counter) == "string"
+          sketch = redis.hgetall(counter)
+	  redis.del(counter)
+          sketch.each{ |key, value| redis.setrange(counter, key.to_i, value.to_i.chr) }
+        end
+* Moved main counter implementation from `HyperLogLog` to the class `HyperLogLog::Counter`
+* Added `HyperLogLog::TimeSeriesCounter` a counter type that can estimate cardinalities
+  for all events from a particular point in the past until the present.
 ## 1.0.0 (10/26/2012)
 * Changed the underlying storage from Redis sorted sets to Redis hashes. This
@@ -5,12 +24,12 @@
   versions of this library, you can upgrade them with something like the
   following method:
-      def upgrade(counter, redis)
-        return if redis.type(counter) == "hash"
-        values = redis.zrange(counter, 0, -1, {withscores: true})
-        redis.del(counter)
-        values.each { |key, value| redis.hset(counter, key, value.to_i) }
-      end
+        def upgrade(counter, redis)
+          return if redis.type(counter) == "hash"
+          values = redis.zrange(counter, 0, -1, {withscores: true})
+          redis.del(counter)
+          values.each { |key, value| redis.hset(counter, key, value.to_i) }
+        end
 * Added union_store command, which stores the results of a union for querying
   or combining with other sets later

data/README.md CHANGED

@@ -1,15 +1,14 @@
 hyperloglog-redis
 =================
-This gem is an implementation of the HyperLogLog algorithm for estimating
+This gem is a pure Ruby implementation of the HyperLogLog algorithm for estimating
 cardinalities of sets observed via a stream of events. A [Redis](http://redis.io)
-instance is used for storing the counters. A simple example:
+instance is used for storing the counters. A minimal example:
     require 'redis'
     require 'hyperloglog-redis'
-    redis = Redis.new
-    counter = HyperLogLog.new(redis)
+    counter = HyperLogLog::Counter.new(Redis.new)
     ['john', 'paul', 'george', 'ringo', 'john', 'paul'].each do |beatle|
       counter.add('beatles', beatle)
     end
@@ -18,22 +17,26 @@ instance is used for storing the counters. A simple example:
 Each HyperLogLog counter uses a small, fixed amount of space but can
 estimate the cardinality of any set of up to around a billion values with
-relative error of about 1.04 / Math.sqrt(2 ** b), where b is a parameter
-passed to the HyperLogLog initializer that defaults to 10. With b = 10,
-each counter is represented by a Redis sorted set with 2 ** b = 1024 values
-(a few KB of space) and we get an expected relative error of 3%. Contrast this
-with the amount of space needed to compute set cardinality exactly, which is
-over 100 MB for a even a bit vector representing a set with a billion values.
-The basic idea of HyperLogLog (and its predecessors PCSA and LogLog) is to apply
-a good hash function to each value you see in the stream and record the longest
-run of zeros that you've seen as a prefix of any hashed value. If the hash
-function is good, you'd expect that its bits are statistically independent, so
-seeing a value that starts with exactly X zeros should happen with probability
-2 ** -(X + 1). So if you've seen a run of 5 zeros in one of your hash values,
+relative error of 1.04 / Math.sqrt(2 ** b) with high probability, where b is a
+parameter passed to the `HyperLogLog::Counter` initializer that defaults to 10.
+With b = 10, each counter is represented by a 1 KB string in Redis and we get
+an expected relative error of 3%. Contrast this with the amount of space needed
+to compute set cardinality exactly, which is over 100 MB for a even a bit vector
+representing a set with a billion values.
+The basic idea of HyperLogLog (and its predecessors PCSA, LogLog, and others) is
+to apply a good hash function to each value observed in the stream and record the longest
+run of zeros seen as a prefix of any hashed value. If the hash
+function is good, the bits in any hashed value should be close to statistically independent,
+so seeing a value that starts with exactly X zeros should happen with probability close to
+2 ** -(X + 1). So, if you've seen a run of 5 zeros in one of your hash values,
 you're likely to have around 2 ** 6 = 64 values in the underlying set. The actual
 implementation and analysis are much more advanced than this, but that's the idea.
+This gem implements a few useful extensions to the basic HyperLogLog algorithm
+which allow you to estimate unions and intersections of counters as well as
+counts within specific time ranges. These extensions are described in detail below.
 The HyperLogLog algorithm is described and analyzed in the paper
 ["HyperLogLog: the analysis of a near-optimal cardinality estimation
 algorithm"](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)
@@ -49,31 +52,127 @@ You can also ask for an estimate of the union from multiple counters:
       counter.add('wings', wings_member)
     end
-    puts "There are approximately #{counter.union('beatles', 'wings')} people who were in the Beatles or Wings"
+    puts "There are approximately #{counter.union(['beatles', 'wings'])} people who were in the Beatles or Wings"
 The same relative error guarantee above applies to unions: a union of
-size N can be estimated to within N * (1.04 / Math.sqrt(2 ** b)) elements,
+size N can be estimated to within +/- N * (1.04 / Math.sqrt(2 ** b)) elements,
 regardless of how many HyperLogLog counters that union spans. You can store
 a unioned counter for querying or combining later with `union_store`:
-    counter.union_store('all_beatles_and_wings_members', 'beatles', 'wings')
+    counter.union_store('all_beatles_and_wings_members', ['beatles', 'wings'])
     puts "There are approximately #{counter.count('all_beatles_and_wings_members'}} people who were in the Beatles or Wings"
 Intersections can also be estimated:
-    puts "There are approximately #{counter.intersection('beatles', 'wings')} people who were in both the Beatles and Wings"
+    puts "There are approximately #{counter.intersection(['beatles', 'wings'])} people who were in both the Beatles and Wings"
 However, intersections of HyperLogLog counters are calculated indirectly via the
 [inclusion/exclusion principle](http://en.wikipedia.org/wiki/Inclusion%E2%80%93exclusion_principle)
 as a sum of unions and there aren't good theoretical bounds on the error of that sum. In
 practice, the estimates that come out of small intersections tend to follow the
-same relative error patterns, but beware using this type of estimation on large
-intersections, both because the errors can be much larger than those guaranteed
+same relative error patterns, but beware using this type of estimation on intersections
+of large numbers of sets, both because the errors can be much larger than those guaranteed
 for unions and the complexity of computing intersections grows exponentially with
-the number of counters being intersected.
+the number of sets in the intersection.
+Set cardinality within a time interval
+======================================
+All examples up until now use `HyperLogLog::Counter`, which stores HyperLogLog
+counters as (2 ** b)-byte Redis strings. hyperloglog-redis also contains the counter implementation
+`HyperLogLog::TimeSeriesCounter`, which uses a little more space (Redis strings of up to
+4 * (32 - b) * (2 ** b) bytes) but allows you to estimate the cardinality of sets during
+certain time windows.
+Using `HyperLogLog::TimeSeriesCounter`, you can get estimates of the number of distinct
+elements added to a set in the past X seconds, for any value of X. A `HyperLogLog::TimeSeriesCounter`
+is initialized with the same arguments as a regular `Counter` but implements a
+superset of `HyperLogLog::Counter`'s interface. Namely, each of the methods `add`,
+`count`, `union`, `intersection`, and `union_store` take an optional final time argument,
+either a Ruby `Time` or an integer representing seconds since the epoch.
+When passed a time argument t, `add` registers an addition to the set at time t. When no
+time is passed, the current system time is used. The methods `count`, `union`,
+`intersection`, and `union_store` all estimate set cardinality for the time interval
+consisting of all events that happened after time t when t is passed as a final argument.
+For example, to get the number of distinct user logins within the
+past week, we might call:
+    one_week = 60 * 60 * 24 * 7
+    logins_in_past_week = counter.count('user_logins', Time.now - one_week)
+A note about relative errors
+============================
+With a parameter `b` in the range [4..16], HyperLogLog counters provide a relative
+error of 1.04/sqrt(2 ** b) with high probability. When unions, intersections, and
+time range queries are involved, it's sometimes not clear what the relative error
+is relative to, so here is some clarification:
+* For a union of counters, the relative error applies to the size of the union. Taking
+the union of counters is lossless in the sense that you end up with the same counter
+you would have arrived at had you observed the union of all of the individual events.
+* For an intersection of counters, there's no good theoretical bound on the relative
+error. In practice, and especially for intersections involving a small number of sets,
+the relative error you obtain tends to be in relation to the size of the union of the
+sets involved. For example, if you have two sets, each of cardinality 5000 and observe
+both sets through HyperLogLog counters with parameter b=10 (3% relative error), you can
+expect the intersection estimate to be within 10000 * 0.03 = 300 of the actual intersection
+size.
+* For time queries, the relative error applies to the size of the set within the time
+range you've queried. For example, given a set of cardinality 1,000,000 that has had
+100 distinct additions within the last 10 minutes, if you observe such a set with a
+HyperLogLog counter with parameter b=10 (3% relative error), you can expect the count
+returned from a query about the last 10 minutes to be within 3 of 100.
+Comparison to other approaches
+==============================
+When trying to optimize for space, two well-known alternatives to HyperLogLog exist:
+* Bit vectors: you provide some near-perfect hash function between keys in your domain
+and an interval of integers, then represent that interval of integers with bits.
+* Bloom filters with counters: use a [Bloom filter](http://en.wikipedia.org/wiki/Bloom_filter)
+to keep track of items seen; on insert, when the Bloom filter tells you that the item
+seen is not in the set, increment the counter.
+Both bit vectors and bloom filters can be augmented to hold timestamps for entries in the
+data structures and simulate counters for time-ranges like `HyperLogLog::TimeSeriesCounter`.
+Bit vectors give exact counts, but the space complexity is linear with the size of
+the set, and you must either allocate a large bit vector upfront or cope with the complexity
+of dynamically resizing your bit vector as the set grows. Providing a manual mapping from
+members of your set to an interval of integers is sometimes a non-trivial task. Counts,
+unions, and intersections are all linear-time operations in the size of the universe of
+the set being represented.
+Bloom filters can be much more compact than bit vectors, but the actual count associated
+with a Bloom filter is an artifact of the construction of the data structure, so the cost
+of estimating a union or intersection is linear in the size of the Bloom filter. Getting
+high probability guarantees on the quality of the estimate of Bloom filter counts requires
+several "good" hash functions that have some degree of independence from each other; in
+practice, coming up with several independent implementations of good hash functions is
+difficult. Bloom filters require that all of their space be allocated upfront (re-hashing
+isn't possible without replaying all events), so in practice you need some estimate of
+how large the counters are going to be before allocating the counter.
+HyperLogLog counters take up less space than either of the above approaches and provide
+constant-time implementations (in the size of the sets being represented) of unions,
+intersections, and time range queries. A `HyperLogLog::Counter` with parameter b will
+be stored in a Redis string of length at most 2 ** b bytes, whereas a `HyperLogLog::TimeSeriesCounter` with parameter
+b will be stored in a Redis string of length at most 4 * (32 - b) * (2 ** b) bytes. For counters representing smaller sets,
+the size taken up by a `HyperLogLog::TimeSeriesCounter` can be significantly less. Here
+are some examples for specific values of b:
+* With b = 7, a `HyperLogLog::Counter` uses at most 128 bytes and a `HyperLogLog::TimeSeriesCounter` uses at most 13 KB while providing a relative error of 9%.
+* With b = 11, a `HyperLogLog::Counter` uses at most 2 KB and a `HyperLogLog::TimeSeriesCounter` uses at most 168 KB while providing a relative error of 2%
+* With b = 16, a `HyperLogLog::Counter` uses at most 64 KB and a `HyperLogLog::TimeSeriesCounter` uses at most 4 MB while providing a relative error of less than half a percent.
 Installation
 ============
-    gem install hyperloglog-redis
+    gem install hyperloglog-redis

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 1.0.0
1	+ 2.0.0

data/hyperloglog-redis.gemspec CHANGED

@@ -5,11 +5,11 @@
 Gem::Specification.new do |s|
   s.name = "hyperloglog-redis"
-  s.version = "1.0.0"
+  s.version = "2.0.0"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["Aaron Windsor"]
-  s.date = "2012-10-26"
+  s.date = "2012-11-30"
   s.description = "An implementation of the HyperLogLog set cardinality estimation algorithm in Ruby using Redis as a back-end"
   s.email = "aaron.windsor@gmail.com"
   s.extra_rdoc_files = [
@@ -27,10 +27,13 @@ Gem::Specification.new do |s|
     "Rakefile",
     "VERSION",
     "hyperloglog-redis.gemspec",
-    "lib/hyper_log_log.rb",
+    "lib/algorithm.rb",
+    "lib/counter.rb",
     "lib/hyperloglog-redis.rb",
+    "lib/time_series_counter.rb",
     "spec/hyper_log_log_spec.rb",
-    "spec/spec_helper.rb"
+    "spec/spec_helper.rb",
+    "spec/time_series_counter_spec.rb"
   ]
   s.homepage = "http://github.com/aaw/hyperloglog-redis"
   s.licenses = ["MIT"]
@@ -47,12 +50,14 @@ Gem::Specification.new do |s|
       s.add_development_dependency(%q<jeweler>, ["~> 1.8.4"])
       s.add_development_dependency(%q<rake>, ["~> 0.9.2.2"])
       s.add_development_dependency(%q<rspec>, ["~> 2.11.0"])
+      s.add_development_dependency(%q<timecop>, ["~> 0.5.3"])
     else
       s.add_dependency(%q<murmurhash3>, ["~> 0.1.3"])
       s.add_dependency(%q<redis>, ["~> 3.0.1"])
       s.add_dependency(%q<jeweler>, ["~> 1.8.4"])
       s.add_dependency(%q<rake>, ["~> 0.9.2.2"])
       s.add_dependency(%q<rspec>, ["~> 2.11.0"])
+      s.add_dependency(%q<timecop>, ["~> 0.5.3"])
     end
   else
     s.add_dependency(%q<murmurhash3>, ["~> 0.1.3"])
@@ -60,6 +65,7 @@ Gem::Specification.new do |s|
     s.add_dependency(%q<jeweler>, ["~> 1.8.4"])
     s.add_dependency(%q<rake>, ["~> 0.9.2.2"])
     s.add_dependency(%q<rspec>, ["~> 2.11.0"])
+    s.add_dependency(%q<timecop>, ["~> 0.5.3"])
   end
 end

data/lib/algorithm.rb ADDED

@@ -0,0 +1,69 @@
+require 'redis'
+require 'murmurhash3'
+module HyperLogLog
+  module Algorithm
+    def initialize(redis, b=10)
+      raise "Accuracy not supported. Please choose a value of b between 4 and 16" if b < 4 || b > 16
+      @redis = redis
+      @bits_in_hash = 32 - b
+      @m = (2 ** b).to_i
+      if @m == 16
+        @alpha = 0.673
+      elsif @m == 32
+        @alpha = 0.697
+      elsif @m == 64
+        @alpha = 0.709
+      else
+        @alpha = 0.7213/(1 + 1.079/@m)
+      end
+    end
+    # Estimate the cardinality of the intersection of several sets. We do this by
+    # using the principle of inclusion and exclusion to represent the size of the
+    # intersection as the alternating sum of an exponential number of
+    # cardinalities of unions of smaller sets.
+    def intersection(counter_names, time=0)
+      icount = (1..counter_names.length).map do |k|
+        counter_names.combination(k).map do |group|
+          ((k % 2 == 0) ? -1 : 1) * union_helper(group, time)
+        end.inject(0, :+)
+      end.inject(0, :+)
+      [icount, 0].max
+    end
+    private
+    def hash_info(value)
+      hash = MurmurHash3::V32.murmur3_32_str_hash(value)
+      [hash, hash % @m, rho(hash / @m)]
+    end
+    def union_helper(counter_names, time=0)
+      all_estimates = raw_union(counter_names, time).select{ |i| i > 0 }
+      estimate_sum = all_estimates.reduce(0.0){ |a, score| a + 2.0 ** -score }
+      estimate = @alpha * @m * @m / (estimate_sum + @m - all_estimates.length)
+      if estimate <= 2.5 * @m
+        if all_estimates.length == @m
+          estimate.round
+        else # Correction for small sets
+          (@m * Math.log(Float(@m)/(@m - all_estimates.length))).round
+        end
+      elsif estimate <= 2 ** 32 / 30.0
+        estimate.round
+      else # Correction for large sets
+        (-2**32 * Math.log(1 - estimate/(2.0**32))).round
+      end
+    end
+    # rho(i) is the position of the first 1 in the binary representation of i,
+    # reading from most significant to least significant bits. Some examples:
+    # rho(1...) = 1, rho(001...) = 3, rho(000...0) = @bits_in_hash + 1
+    def rho(i)
+      return @bits_in_hash + 1 if i == 0
+      @bits_in_hash - Math.log(i, 2).floor
+    end
+  end
+end

data/lib/counter.rb ADDED

@@ -0,0 +1,40 @@
+module HyperLogLog
+  class Counter
+    include Algorithm
+    # This is the implementation of the standard HyperLogLog algorithm, storing
+    # counts in each byte of a string of length 2 ** b.
+    def add(counter_name, value)
+      hash, function_name, new_value = hash_info(value)
+      existing_value = @redis.getrange(counter_name, function_name, function_name).unpack('C').first.to_i
+      @redis.setrange(counter_name, function_name, new_value.chr) if new_value > existing_value
+    end
+    # Estimate the cardinality of a single set
+    def count(counter_name)
+      union_helper([counter_name])
+    end
+    # Estimate the cardinality of the union of several sets
+    def union(counter_names)
+      union_helper(counter_names)
+    end
+    # Store the union of several sets in *destination* so that it can be used as
+    # a HyperLogLog counter later.
+    def union_store(destination, counter_names)
+      @redis.set(destination, raw_union(counter_names).inject('') {|a, e| a << e.chr})
+    end
+    private
+    def raw_union(counter_names, time=nil)
+      counters = @redis.mget(*counter_names).compact
+      return [] if counters.none?
+      return counters.first.each_byte if counters.one?
+      counters.map{|c| c.unpack("C#{@m}")}.transpose.map {|e| e.compact.max.to_i}
+    end
+  end
+end

data/lib/hyperloglog-redis.rb CHANGED

@@ -1 +1,3 @@
-require "hyper_log_log"
+require "algorithm"
+require "counter"
+require "time_series_counter"

data/lib/time_series_counter.rb ADDED

@@ -0,0 +1,80 @@
+module HyperLogLog
+  class TimeSeriesCounter
+    include Algorithm
+    # This is an implementation of HyperLogLog that allows for querying counts
+    # within time ranges of the form (t, current_time] with second-level
+    # granularity. The standard implementation of HyperLogLog stores the max
+    # number of leading zeros seen in the image of each of 2 ** b hash
+    # functions. These counts can naturally be stored in a string of length
+    # 2 ** b by allocating one byte per leading zero count.
+    #
+    # To provide counts within a time range, we alter the standard
+    # implementation to store a mapping of pairs of the form (hash function,
+    # leading zero count) -> timestamp, where the mapping (h,z) -> t represents
+    # the fact that we observed z leading zeros in the image of hash function h
+    # most recently at time t. This mapping is stored in a string by packing
+    # 4-byte words (timestamps, represented in seconds since the epoch) into
+    # a matrix indexed by hash function and zero count stored in row-major
+    # order. Since the max zero count for a counter with parameter b is (32-b),
+    # this representation takes up at most 4 * (32-b) * (2 ** b) bytes (and
+    # usually much less, since we don't allocate space for rows corresponding
+    # to higher leading zero counts until they're actaully observed.)
+    #
+    # To convert this representation to a HyperLogLog counter for the time
+    # range (t, current_time], we simply filter out all timestamps less than t
+    # in the matrix and then find, for each hash function, the maximum z for
+    # which that hash function has a non-zero timestamp.
+    def add(counter_name, value, time=nil)
+      hash, function_name, new_value = hash_info(value)
+      index = 4 * (function_name + (new_value.to_i * @m))
+      if time.nil?
+        @redis.setrange(counter_name, index, [Time.now.to_i].pack('N'))
+      else
+        existing_time = @redis.getrange(counter_name, index, index + 3)
+        existing_val = existing_time.empty? ? 0 : existing_time.unpack('N').first
+        @redis.setrange(counter_name, index, [time.to_i].pack('N')) if time.to_i > existing_val
+      end
+    end
+    # Estimate the cardinality of a single set
+    def count(counter_name, time=0)
+      union_helper([counter_name], time)
+    end
+    # Estimate the cardinality of the union of several sets
+    def union(counter_names, time=0)
+      union_helper(counter_names, time)
+    end
+    # Store the union of several sets in *destination* so that it can be used as
+    # a HyperLogLog counter later.
+    def union_store(destination, counter_names, time=0)
+      raw_counters = @redis.mget(*counter_names).compact.map{ |c| c.unpack('N*').map{ |x| x > time ? x : 0 } }
+      combined_counters = jagged_transpose(raw_counters).map{ |x| x.max.to_i }
+      @redis.set(destination, combined_counters.pack('N*'))
+    end
+    private
+    def raw_union(counter_names, time=0)
+      raw_counters = @redis.mget(*counter_names).compact
+      return [] if raw_counters.none?
+      hyperloglog_counters = raw_counters.map do |counter|
+        jagged_transpose(counter.unpack('N*').each_slice(@m).to_a).map{ |x| x.rindex{ |c| c > time } || 0 }
+      end
+      return hyperloglog_counters.first if hyperloglog_counters.one?
+      jagged_transpose(hyperloglog_counters).map{ |x| x.max.to_i }
+    end
+    # Given an array of non-uniform length arrays, right-pad all arrays with
+    # zeros so they're the same size, then transpose the array. This is a
+    # destructive operation: the zero-padding modifies the array-of-arrays
+    def jagged_transpose(arrays)
+      max_length = arrays.map{ |a| a.length }.max
+      arrays.map{ |a| a.fill(0, a.length, max_length - a.length) }.transpose
+    end
+  end
+end