hyperloglog-redis 1.0.0 → 2.0.0

Sign up to get free protection for your applications and to get access to all the features.
data/Gemfile CHANGED
@@ -7,4 +7,5 @@ group :development, :test do
7
7
  gem 'jeweler', '~> 1.8.4'
8
8
  gem 'rake', '~> 0.9.2.2'
9
9
  gem 'rspec', '~> 2.11.0'
10
+ gem 'timecop', '~> 0.5.3'
10
11
  end
@@ -22,6 +22,7 @@ GEM
22
22
  rspec-expectations (2.11.3)
23
23
  diff-lcs (~> 1.1.3)
24
24
  rspec-mocks (2.11.2)
25
+ timecop (0.5.3)
25
26
 
26
27
  PLATFORMS
27
28
  ruby
@@ -32,3 +33,4 @@ DEPENDENCIES
32
33
  rake (~> 0.9.2.2)
33
34
  redis (~> 3.0.1)
34
35
  rspec (~> 2.11.0)
36
+ timecop (~> 0.5.3)
data/HISTORY.md CHANGED
@@ -1,3 +1,22 @@
1
+ ## 2.0.0 (11/30/2012)
2
+
3
+ * Changed the underlying storage from Redis hashes to bitstrings [simonkro](https://github.com/simonkro)
4
+ If you have existing counters stored from version 1.0.0, you can upgrade them with
5
+ the following method:
6
+
7
+ def upgrade_1_2(counter, redis)
8
+ return if redis.type(counter) == "string"
9
+ sketch = redis.hgetall(counter)
10
+ redis.del(counter)
11
+ sketch.each{ |key, value| redis.setrange(counter, key.to_i, value.to_i.chr) }
12
+ end
13
+
14
+ * Moved main counter implementation from `HyperLogLog` to the class `HyperLogLog::Counter`
15
+
16
+ * Added `HyperLogLog::TimeSeriesCounter` a counter type that can estimate cardinalities
17
+ for all events from a particular point in the past until the present.
18
+
19
+
1
20
  ## 1.0.0 (10/26/2012)
2
21
 
3
22
  * Changed the underlying storage from Redis sorted sets to Redis hashes. This
@@ -5,12 +24,12 @@
5
24
  versions of this library, you can upgrade them with something like the
6
25
  following method:
7
26
 
8
- def upgrade(counter, redis)
9
- return if redis.type(counter) == "hash"
10
- values = redis.zrange(counter, 0, -1, {withscores: true})
11
- redis.del(counter)
12
- values.each { |key, value| redis.hset(counter, key, value.to_i) }
13
- end
27
+ def upgrade(counter, redis)
28
+ return if redis.type(counter) == "hash"
29
+ values = redis.zrange(counter, 0, -1, {withscores: true})
30
+ redis.del(counter)
31
+ values.each { |key, value| redis.hset(counter, key, value.to_i) }
32
+ end
14
33
 
15
34
  * Added union_store command, which stores the results of a union for querying
16
35
  or combining with other sets later
data/README.md CHANGED
@@ -1,15 +1,14 @@
1
1
  hyperloglog-redis
2
2
  =================
3
3
 
4
- This gem is an implementation of the HyperLogLog algorithm for estimating
4
+ This gem is a pure Ruby implementation of the HyperLogLog algorithm for estimating
5
5
  cardinalities of sets observed via a stream of events. A [Redis](http://redis.io)
6
- instance is used for storing the counters. A simple example:
6
+ instance is used for storing the counters. A minimal example:
7
7
 
8
8
  require 'redis'
9
9
  require 'hyperloglog-redis'
10
10
 
11
- redis = Redis.new
12
- counter = HyperLogLog.new(redis)
11
+ counter = HyperLogLog::Counter.new(Redis.new)
13
12
  ['john', 'paul', 'george', 'ringo', 'john', 'paul'].each do |beatle|
14
13
  counter.add('beatles', beatle)
15
14
  end
@@ -18,22 +17,26 @@ instance is used for storing the counters. A simple example:
18
17
 
19
18
  Each HyperLogLog counter uses a small, fixed amount of space but can
20
19
  estimate the cardinality of any set of up to around a billion values with
21
- relative error of about 1.04 / Math.sqrt(2 ** b), where b is a parameter
22
- passed to the HyperLogLog initializer that defaults to 10. With b = 10,
23
- each counter is represented by a Redis sorted set with 2 ** b = 1024 values
24
- (a few KB of space) and we get an expected relative error of 3%. Contrast this
25
- with the amount of space needed to compute set cardinality exactly, which is
26
- over 100 MB for a even a bit vector representing a set with a billion values.
27
-
28
- The basic idea of HyperLogLog (and its predecessors PCSA and LogLog) is to apply
29
- a good hash function to each value you see in the stream and record the longest
30
- run of zeros that you've seen as a prefix of any hashed value. If the hash
31
- function is good, you'd expect that its bits are statistically independent, so
32
- seeing a value that starts with exactly X zeros should happen with probability
33
- 2 ** -(X + 1). So if you've seen a run of 5 zeros in one of your hash values,
20
+ relative error of 1.04 / Math.sqrt(2 ** b) with high probability, where b is a
21
+ parameter passed to the `HyperLogLog::Counter` initializer that defaults to 10.
22
+ With b = 10, each counter is represented by a 1 KB string in Redis and we get
23
+ an expected relative error of 3%. Contrast this with the amount of space needed
24
+ to compute set cardinality exactly, which is over 100 MB for a even a bit vector
25
+ representing a set with a billion values.
26
+
27
+ The basic idea of HyperLogLog (and its predecessors PCSA, LogLog, and others) is
28
+ to apply a good hash function to each value observed in the stream and record the longest
29
+ run of zeros seen as a prefix of any hashed value. If the hash
30
+ function is good, the bits in any hashed value should be close to statistically independent,
31
+ so seeing a value that starts with exactly X zeros should happen with probability close to
32
+ 2 ** -(X + 1). So, if you've seen a run of 5 zeros in one of your hash values,
34
33
  you're likely to have around 2 ** 6 = 64 values in the underlying set. The actual
35
34
  implementation and analysis are much more advanced than this, but that's the idea.
36
35
 
36
+ This gem implements a few useful extensions to the basic HyperLogLog algorithm
37
+ which allow you to estimate unions and intersections of counters as well as
38
+ counts within specific time ranges. These extensions are described in detail below.
39
+
37
40
  The HyperLogLog algorithm is described and analyzed in the paper
38
41
  ["HyperLogLog: the analysis of a near-optimal cardinality estimation
39
42
  algorithm"](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)
@@ -49,31 +52,127 @@ You can also ask for an estimate of the union from multiple counters:
49
52
  counter.add('wings', wings_member)
50
53
  end
51
54
 
52
- puts "There are approximately #{counter.union('beatles', 'wings')} people who were in the Beatles or Wings"
55
+ puts "There are approximately #{counter.union(['beatles', 'wings'])} people who were in the Beatles or Wings"
53
56
 
54
57
  The same relative error guarantee above applies to unions: a union of
55
- size N can be estimated to within N * (1.04 / Math.sqrt(2 ** b)) elements,
58
+ size N can be estimated to within +/- N * (1.04 / Math.sqrt(2 ** b)) elements,
56
59
  regardless of how many HyperLogLog counters that union spans. You can store
57
60
  a unioned counter for querying or combining later with `union_store`:
58
61
 
59
- counter.union_store('all_beatles_and_wings_members', 'beatles', 'wings')
62
+ counter.union_store('all_beatles_and_wings_members', ['beatles', 'wings'])
60
63
 
61
64
  puts "There are approximately #{counter.count('all_beatles_and_wings_members'}} people who were in the Beatles or Wings"
62
65
 
63
66
  Intersections can also be estimated:
64
67
 
65
- puts "There are approximately #{counter.intersection('beatles', 'wings')} people who were in both the Beatles and Wings"
68
+ puts "There are approximately #{counter.intersection(['beatles', 'wings'])} people who were in both the Beatles and Wings"
66
69
 
67
70
  However, intersections of HyperLogLog counters are calculated indirectly via the
68
71
  [inclusion/exclusion principle](http://en.wikipedia.org/wiki/Inclusion%E2%80%93exclusion_principle)
69
72
  as a sum of unions and there aren't good theoretical bounds on the error of that sum. In
70
73
  practice, the estimates that come out of small intersections tend to follow the
71
- same relative error patterns, but beware using this type of estimation on large
72
- intersections, both because the errors can be much larger than those guaranteed
74
+ same relative error patterns, but beware using this type of estimation on intersections
75
+ of large numbers of sets, both because the errors can be much larger than those guaranteed
73
76
  for unions and the complexity of computing intersections grows exponentially with
74
- the number of counters being intersected.
77
+ the number of sets in the intersection.
78
+
79
+ Set cardinality within a time interval
80
+ ======================================
81
+
82
+ All examples up until now use `HyperLogLog::Counter`, which stores HyperLogLog
83
+ counters as (2 ** b)-byte Redis strings. hyperloglog-redis also contains the counter implementation
84
+ `HyperLogLog::TimeSeriesCounter`, which uses a little more space (Redis strings of up to
85
+ 4 * (32 - b) * (2 ** b) bytes) but allows you to estimate the cardinality of sets during
86
+ certain time windows.
87
+
88
+ Using `HyperLogLog::TimeSeriesCounter`, you can get estimates of the number of distinct
89
+ elements added to a set in the past X seconds, for any value of X. A `HyperLogLog::TimeSeriesCounter`
90
+ is initialized with the same arguments as a regular `Counter` but implements a
91
+ superset of `HyperLogLog::Counter`'s interface. Namely, each of the methods `add`,
92
+ `count`, `union`, `intersection`, and `union_store` take an optional final time argument,
93
+ either a Ruby `Time` or an integer representing seconds since the epoch.
94
+
95
+ When passed a time argument t, `add` registers an addition to the set at time t. When no
96
+ time is passed, the current system time is used. The methods `count`, `union`,
97
+ `intersection`, and `union_store` all estimate set cardinality for the time interval
98
+ consisting of all events that happened after time t when t is passed as a final argument.
99
+
100
+ For example, to get the number of distinct user logins within the
101
+ past week, we might call:
102
+
103
+ one_week = 60 * 60 * 24 * 7
104
+ logins_in_past_week = counter.count('user_logins', Time.now - one_week)
105
+
106
+ A note about relative errors
107
+ ============================
108
+
109
+ With a parameter `b` in the range [4..16], HyperLogLog counters provide a relative
110
+ error of 1.04/sqrt(2 ** b) with high probability. When unions, intersections, and
111
+ time range queries are involved, it's sometimes not clear what the relative error
112
+ is relative to, so here is some clarification:
113
+
114
+ * For a union of counters, the relative error applies to the size of the union. Taking
115
+ the union of counters is lossless in the sense that you end up with the same counter
116
+ you would have arrived at had you observed the union of all of the individual events.
117
+
118
+ * For an intersection of counters, there's no good theoretical bound on the relative
119
+ error. In practice, and especially for intersections involving a small number of sets,
120
+ the relative error you obtain tends to be in relation to the size of the union of the
121
+ sets involved. For example, if you have two sets, each of cardinality 5000 and observe
122
+ both sets through HyperLogLog counters with parameter b=10 (3% relative error), you can
123
+ expect the intersection estimate to be within 10000 * 0.03 = 300 of the actual intersection
124
+ size.
125
+
126
+ * For time queries, the relative error applies to the size of the set within the time
127
+ range you've queried. For example, given a set of cardinality 1,000,000 that has had
128
+ 100 distinct additions within the last 10 minutes, if you observe such a set with a
129
+ HyperLogLog counter with parameter b=10 (3% relative error), you can expect the count
130
+ returned from a query about the last 10 minutes to be within 3 of 100.
131
+
132
+ Comparison to other approaches
133
+ ==============================
134
+
135
+ When trying to optimize for space, two well-known alternatives to HyperLogLog exist:
136
+
137
+ * Bit vectors: you provide some near-perfect hash function between keys in your domain
138
+ and an interval of integers, then represent that interval of integers with bits.
139
+ * Bloom filters with counters: use a [Bloom filter](http://en.wikipedia.org/wiki/Bloom_filter)
140
+ to keep track of items seen; on insert, when the Bloom filter tells you that the item
141
+ seen is not in the set, increment the counter.
142
+
143
+ Both bit vectors and bloom filters can be augmented to hold timestamps for entries in the
144
+ data structures and simulate counters for time-ranges like `HyperLogLog::TimeSeriesCounter`.
145
+
146
+ Bit vectors give exact counts, but the space complexity is linear with the size of
147
+ the set, and you must either allocate a large bit vector upfront or cope with the complexity
148
+ of dynamically resizing your bit vector as the set grows. Providing a manual mapping from
149
+ members of your set to an interval of integers is sometimes a non-trivial task. Counts,
150
+ unions, and intersections are all linear-time operations in the size of the universe of
151
+ the set being represented.
152
+
153
+ Bloom filters can be much more compact than bit vectors, but the actual count associated
154
+ with a Bloom filter is an artifact of the construction of the data structure, so the cost
155
+ of estimating a union or intersection is linear in the size of the Bloom filter. Getting
156
+ high probability guarantees on the quality of the estimate of Bloom filter counts requires
157
+ several "good" hash functions that have some degree of independence from each other; in
158
+ practice, coming up with several independent implementations of good hash functions is
159
+ difficult. Bloom filters require that all of their space be allocated upfront (re-hashing
160
+ isn't possible without replaying all events), so in practice you need some estimate of
161
+ how large the counters are going to be before allocating the counter.
162
+
163
+ HyperLogLog counters take up less space than either of the above approaches and provide
164
+ constant-time implementations (in the size of the sets being represented) of unions,
165
+ intersections, and time range queries. A `HyperLogLog::Counter` with parameter b will
166
+ be stored in a Redis string of length at most 2 ** b bytes, whereas a `HyperLogLog::TimeSeriesCounter` with parameter
167
+ b will be stored in a Redis string of length at most 4 * (32 - b) * (2 ** b) bytes. For counters representing smaller sets,
168
+ the size taken up by a `HyperLogLog::TimeSeriesCounter` can be significantly less. Here
169
+ are some examples for specific values of b:
170
+
171
+ * With b = 7, a `HyperLogLog::Counter` uses at most 128 bytes and a `HyperLogLog::TimeSeriesCounter` uses at most 13 KB while providing a relative error of 9%.
172
+ * With b = 11, a `HyperLogLog::Counter` uses at most 2 KB and a `HyperLogLog::TimeSeriesCounter` uses at most 168 KB while providing a relative error of 2%
173
+ * With b = 16, a `HyperLogLog::Counter` uses at most 64 KB and a `HyperLogLog::TimeSeriesCounter` uses at most 4 MB while providing a relative error of less than half a percent.
75
174
 
76
175
  Installation
77
176
  ============
78
177
 
79
- gem install hyperloglog-redis
178
+ gem install hyperloglog-redis
data/VERSION CHANGED
@@ -1 +1 @@
1
- 1.0.0
1
+ 2.0.0
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = "hyperloglog-redis"
8
- s.version = "1.0.0"
8
+ s.version = "2.0.0"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Aaron Windsor"]
12
- s.date = "2012-10-26"
12
+ s.date = "2012-11-30"
13
13
  s.description = "An implementation of the HyperLogLog set cardinality estimation algorithm in Ruby using Redis as a back-end"
14
14
  s.email = "aaron.windsor@gmail.com"
15
15
  s.extra_rdoc_files = [
@@ -27,10 +27,13 @@ Gem::Specification.new do |s|
27
27
  "Rakefile",
28
28
  "VERSION",
29
29
  "hyperloglog-redis.gemspec",
30
- "lib/hyper_log_log.rb",
30
+ "lib/algorithm.rb",
31
+ "lib/counter.rb",
31
32
  "lib/hyperloglog-redis.rb",
33
+ "lib/time_series_counter.rb",
32
34
  "spec/hyper_log_log_spec.rb",
33
- "spec/spec_helper.rb"
35
+ "spec/spec_helper.rb",
36
+ "spec/time_series_counter_spec.rb"
34
37
  ]
35
38
  s.homepage = "http://github.com/aaw/hyperloglog-redis"
36
39
  s.licenses = ["MIT"]
@@ -47,12 +50,14 @@ Gem::Specification.new do |s|
47
50
  s.add_development_dependency(%q<jeweler>, ["~> 1.8.4"])
48
51
  s.add_development_dependency(%q<rake>, ["~> 0.9.2.2"])
49
52
  s.add_development_dependency(%q<rspec>, ["~> 2.11.0"])
53
+ s.add_development_dependency(%q<timecop>, ["~> 0.5.3"])
50
54
  else
51
55
  s.add_dependency(%q<murmurhash3>, ["~> 0.1.3"])
52
56
  s.add_dependency(%q<redis>, ["~> 3.0.1"])
53
57
  s.add_dependency(%q<jeweler>, ["~> 1.8.4"])
54
58
  s.add_dependency(%q<rake>, ["~> 0.9.2.2"])
55
59
  s.add_dependency(%q<rspec>, ["~> 2.11.0"])
60
+ s.add_dependency(%q<timecop>, ["~> 0.5.3"])
56
61
  end
57
62
  else
58
63
  s.add_dependency(%q<murmurhash3>, ["~> 0.1.3"])
@@ -60,6 +65,7 @@ Gem::Specification.new do |s|
60
65
  s.add_dependency(%q<jeweler>, ["~> 1.8.4"])
61
66
  s.add_dependency(%q<rake>, ["~> 0.9.2.2"])
62
67
  s.add_dependency(%q<rspec>, ["~> 2.11.0"])
68
+ s.add_dependency(%q<timecop>, ["~> 0.5.3"])
63
69
  end
64
70
  end
65
71
 
@@ -0,0 +1,69 @@
1
+ require 'redis'
2
+ require 'murmurhash3'
3
+
4
+ module HyperLogLog
5
+ module Algorithm
6
+
7
+ def initialize(redis, b=10)
8
+ raise "Accuracy not supported. Please choose a value of b between 4 and 16" if b < 4 || b > 16
9
+ @redis = redis
10
+ @bits_in_hash = 32 - b
11
+ @m = (2 ** b).to_i
12
+ if @m == 16
13
+ @alpha = 0.673
14
+ elsif @m == 32
15
+ @alpha = 0.697
16
+ elsif @m == 64
17
+ @alpha = 0.709
18
+ else
19
+ @alpha = 0.7213/(1 + 1.079/@m)
20
+ end
21
+ end
22
+
23
+ # Estimate the cardinality of the intersection of several sets. We do this by
24
+ # using the principle of inclusion and exclusion to represent the size of the
25
+ # intersection as the alternating sum of an exponential number of
26
+ # cardinalities of unions of smaller sets.
27
+ def intersection(counter_names, time=0)
28
+ icount = (1..counter_names.length).map do |k|
29
+ counter_names.combination(k).map do |group|
30
+ ((k % 2 == 0) ? -1 : 1) * union_helper(group, time)
31
+ end.inject(0, :+)
32
+ end.inject(0, :+)
33
+ [icount, 0].max
34
+ end
35
+
36
+ private
37
+
38
+ def hash_info(value)
39
+ hash = MurmurHash3::V32.murmur3_32_str_hash(value)
40
+ [hash, hash % @m, rho(hash / @m)]
41
+ end
42
+
43
+ def union_helper(counter_names, time=0)
44
+ all_estimates = raw_union(counter_names, time).select{ |i| i > 0 }
45
+ estimate_sum = all_estimates.reduce(0.0){ |a, score| a + 2.0 ** -score }
46
+ estimate = @alpha * @m * @m / (estimate_sum + @m - all_estimates.length)
47
+ if estimate <= 2.5 * @m
48
+ if all_estimates.length == @m
49
+ estimate.round
50
+ else # Correction for small sets
51
+ (@m * Math.log(Float(@m)/(@m - all_estimates.length))).round
52
+ end
53
+ elsif estimate <= 2 ** 32 / 30.0
54
+ estimate.round
55
+ else # Correction for large sets
56
+ (-2**32 * Math.log(1 - estimate/(2.0**32))).round
57
+ end
58
+ end
59
+
60
+ # rho(i) is the position of the first 1 in the binary representation of i,
61
+ # reading from most significant to least significant bits. Some examples:
62
+ # rho(1...) = 1, rho(001...) = 3, rho(000...0) = @bits_in_hash + 1
63
+ def rho(i)
64
+ return @bits_in_hash + 1 if i == 0
65
+ @bits_in_hash - Math.log(i, 2).floor
66
+ end
67
+
68
+ end
69
+ end
@@ -0,0 +1,40 @@
1
+ module HyperLogLog
2
+ class Counter
3
+ include Algorithm
4
+
5
+ # This is the implementation of the standard HyperLogLog algorithm, storing
6
+ # counts in each byte of a string of length 2 ** b.
7
+
8
+ def add(counter_name, value)
9
+ hash, function_name, new_value = hash_info(value)
10
+ existing_value = @redis.getrange(counter_name, function_name, function_name).unpack('C').first.to_i
11
+ @redis.setrange(counter_name, function_name, new_value.chr) if new_value > existing_value
12
+ end
13
+
14
+ # Estimate the cardinality of a single set
15
+ def count(counter_name)
16
+ union_helper([counter_name])
17
+ end
18
+
19
+ # Estimate the cardinality of the union of several sets
20
+ def union(counter_names)
21
+ union_helper(counter_names)
22
+ end
23
+
24
+ # Store the union of several sets in *destination* so that it can be used as
25
+ # a HyperLogLog counter later.
26
+ def union_store(destination, counter_names)
27
+ @redis.set(destination, raw_union(counter_names).inject('') {|a, e| a << e.chr})
28
+ end
29
+
30
+ private
31
+
32
+ def raw_union(counter_names, time=nil)
33
+ counters = @redis.mget(*counter_names).compact
34
+ return [] if counters.none?
35
+ return counters.first.each_byte if counters.one?
36
+ counters.map{|c| c.unpack("C#{@m}")}.transpose.map {|e| e.compact.max.to_i}
37
+ end
38
+
39
+ end
40
+ end
@@ -1 +1,3 @@
1
- require "hyper_log_log"
1
+ require "algorithm"
2
+ require "counter"
3
+ require "time_series_counter"
@@ -0,0 +1,80 @@
1
+ module HyperLogLog
2
+ class TimeSeriesCounter
3
+ include Algorithm
4
+
5
+ # This is an implementation of HyperLogLog that allows for querying counts
6
+ # within time ranges of the form (t, current_time] with second-level
7
+ # granularity. The standard implementation of HyperLogLog stores the max
8
+ # number of leading zeros seen in the image of each of 2 ** b hash
9
+ # functions. These counts can naturally be stored in a string of length
10
+ # 2 ** b by allocating one byte per leading zero count.
11
+ #
12
+ # To provide counts within a time range, we alter the standard
13
+ # implementation to store a mapping of pairs of the form (hash function,
14
+ # leading zero count) -> timestamp, where the mapping (h,z) -> t represents
15
+ # the fact that we observed z leading zeros in the image of hash function h
16
+ # most recently at time t. This mapping is stored in a string by packing
17
+ # 4-byte words (timestamps, represented in seconds since the epoch) into
18
+ # a matrix indexed by hash function and zero count stored in row-major
19
+ # order. Since the max zero count for a counter with parameter b is (32-b),
20
+ # this representation takes up at most 4 * (32-b) * (2 ** b) bytes (and
21
+ # usually much less, since we don't allocate space for rows corresponding
22
+ # to higher leading zero counts until they're actaully observed.)
23
+ #
24
+ # To convert this representation to a HyperLogLog counter for the time
25
+ # range (t, current_time], we simply filter out all timestamps less than t
26
+ # in the matrix and then find, for each hash function, the maximum z for
27
+ # which that hash function has a non-zero timestamp.
28
+
29
+ def add(counter_name, value, time=nil)
30
+ hash, function_name, new_value = hash_info(value)
31
+ index = 4 * (function_name + (new_value.to_i * @m))
32
+ if time.nil?
33
+ @redis.setrange(counter_name, index, [Time.now.to_i].pack('N'))
34
+ else
35
+ existing_time = @redis.getrange(counter_name, index, index + 3)
36
+ existing_val = existing_time.empty? ? 0 : existing_time.unpack('N').first
37
+ @redis.setrange(counter_name, index, [time.to_i].pack('N')) if time.to_i > existing_val
38
+ end
39
+ end
40
+
41
+ # Estimate the cardinality of a single set
42
+ def count(counter_name, time=0)
43
+ union_helper([counter_name], time)
44
+ end
45
+
46
+ # Estimate the cardinality of the union of several sets
47
+ def union(counter_names, time=0)
48
+ union_helper(counter_names, time)
49
+ end
50
+
51
+ # Store the union of several sets in *destination* so that it can be used as
52
+ # a HyperLogLog counter later.
53
+ def union_store(destination, counter_names, time=0)
54
+ raw_counters = @redis.mget(*counter_names).compact.map{ |c| c.unpack('N*').map{ |x| x > time ? x : 0 } }
55
+ combined_counters = jagged_transpose(raw_counters).map{ |x| x.max.to_i }
56
+ @redis.set(destination, combined_counters.pack('N*'))
57
+ end
58
+
59
+ private
60
+
61
+ def raw_union(counter_names, time=0)
62
+ raw_counters = @redis.mget(*counter_names).compact
63
+ return [] if raw_counters.none?
64
+ hyperloglog_counters = raw_counters.map do |counter|
65
+ jagged_transpose(counter.unpack('N*').each_slice(@m).to_a).map{ |x| x.rindex{ |c| c > time } || 0 }
66
+ end
67
+ return hyperloglog_counters.first if hyperloglog_counters.one?
68
+ jagged_transpose(hyperloglog_counters).map{ |x| x.max.to_i }
69
+ end
70
+
71
+ # Given an array of non-uniform length arrays, right-pad all arrays with
72
+ # zeros so they're the same size, then transpose the array. This is a
73
+ # destructive operation: the zero-padding modifies the array-of-arrays
74
+ def jagged_transpose(arrays)
75
+ max_length = arrays.map{ |a| a.length }.max
76
+ arrays.map{ |a| a.fill(0, a.length, max_length - a.length) }.transpose
77
+ end
78
+
79
+ end
80
+ end