hyperloglog-redis 1.0.0 → 2.0.0
Sign up to get free protection for your applications and to get access to all the features.
- data/Gemfile +1 -0
- data/Gemfile.lock +2 -0
- data/HISTORY.md +25 -6
- data/README.md +124 -25
- data/VERSION +1 -1
- data/hyperloglog-redis.gemspec +10 -4
- data/lib/algorithm.rb +69 -0
- data/lib/counter.rb +40 -0
- data/lib/hyperloglog-redis.rb +3 -1
- data/lib/time_series_counter.rb +80 -0
- data/spec/hyper_log_log_spec.rb +199 -197
- data/spec/time_series_counter_spec.rb +216 -0
- metadata +28 -14
- data/lib/hyper_log_log.rb +0 -96
data/Gemfile
CHANGED
data/Gemfile.lock
CHANGED
data/HISTORY.md
CHANGED
@@ -1,3 +1,22 @@
|
|
1
|
+
## 2.0.0 (11/30/2012)
|
2
|
+
|
3
|
+
* Changed the underlying storage from Redis hashes to bitstrings [simonkro](https://github.com/simonkro)
|
4
|
+
If you have existing counters stored from version 1.0.0, you can upgrade them with
|
5
|
+
the following method:
|
6
|
+
|
7
|
+
def upgrade_1_2(counter, redis)
|
8
|
+
return if redis.type(counter) == "string"
|
9
|
+
sketch = redis.hgetall(counter)
|
10
|
+
redis.del(counter)
|
11
|
+
sketch.each{ |key, value| redis.setrange(counter, key.to_i, value.to_i.chr) }
|
12
|
+
end
|
13
|
+
|
14
|
+
* Moved main counter implementation from `HyperLogLog` to the class `HyperLogLog::Counter`
|
15
|
+
|
16
|
+
* Added `HyperLogLog::TimeSeriesCounter` a counter type that can estimate cardinalities
|
17
|
+
for all events from a particular point in the past until the present.
|
18
|
+
|
19
|
+
|
1
20
|
## 1.0.0 (10/26/2012)
|
2
21
|
|
3
22
|
* Changed the underlying storage from Redis sorted sets to Redis hashes. This
|
@@ -5,12 +24,12 @@
|
|
5
24
|
versions of this library, you can upgrade them with something like the
|
6
25
|
following method:
|
7
26
|
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
27
|
+
def upgrade(counter, redis)
|
28
|
+
return if redis.type(counter) == "hash"
|
29
|
+
values = redis.zrange(counter, 0, -1, {withscores: true})
|
30
|
+
redis.del(counter)
|
31
|
+
values.each { |key, value| redis.hset(counter, key, value.to_i) }
|
32
|
+
end
|
14
33
|
|
15
34
|
* Added union_store command, which stores the results of a union for querying
|
16
35
|
or combining with other sets later
|
data/README.md
CHANGED
@@ -1,15 +1,14 @@
|
|
1
1
|
hyperloglog-redis
|
2
2
|
=================
|
3
3
|
|
4
|
-
This gem is
|
4
|
+
This gem is a pure Ruby implementation of the HyperLogLog algorithm for estimating
|
5
5
|
cardinalities of sets observed via a stream of events. A [Redis](http://redis.io)
|
6
|
-
instance is used for storing the counters. A
|
6
|
+
instance is used for storing the counters. A minimal example:
|
7
7
|
|
8
8
|
require 'redis'
|
9
9
|
require 'hyperloglog-redis'
|
10
10
|
|
11
|
-
|
12
|
-
counter = HyperLogLog.new(redis)
|
11
|
+
counter = HyperLogLog::Counter.new(Redis.new)
|
13
12
|
['john', 'paul', 'george', 'ringo', 'john', 'paul'].each do |beatle|
|
14
13
|
counter.add('beatles', beatle)
|
15
14
|
end
|
@@ -18,22 +17,26 @@ instance is used for storing the counters. A simple example:
|
|
18
17
|
|
19
18
|
Each HyperLogLog counter uses a small, fixed amount of space but can
|
20
19
|
estimate the cardinality of any set of up to around a billion values with
|
21
|
-
relative error of
|
22
|
-
passed to the HyperLogLog initializer that defaults to 10.
|
23
|
-
each counter is represented by a
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
The basic idea of HyperLogLog (and its predecessors PCSA and
|
29
|
-
a good hash function to each value
|
30
|
-
run of zeros
|
31
|
-
function is good,
|
32
|
-
seeing a value that starts with exactly X zeros should happen with probability
|
33
|
-
2 ** -(X + 1). So if you've seen a run of 5 zeros in one of your hash values,
|
20
|
+
relative error of 1.04 / Math.sqrt(2 ** b) with high probability, where b is a
|
21
|
+
parameter passed to the `HyperLogLog::Counter` initializer that defaults to 10.
|
22
|
+
With b = 10, each counter is represented by a 1 KB string in Redis and we get
|
23
|
+
an expected relative error of 3%. Contrast this with the amount of space needed
|
24
|
+
to compute set cardinality exactly, which is over 100 MB for a even a bit vector
|
25
|
+
representing a set with a billion values.
|
26
|
+
|
27
|
+
The basic idea of HyperLogLog (and its predecessors PCSA, LogLog, and others) is
|
28
|
+
to apply a good hash function to each value observed in the stream and record the longest
|
29
|
+
run of zeros seen as a prefix of any hashed value. If the hash
|
30
|
+
function is good, the bits in any hashed value should be close to statistically independent,
|
31
|
+
so seeing a value that starts with exactly X zeros should happen with probability close to
|
32
|
+
2 ** -(X + 1). So, if you've seen a run of 5 zeros in one of your hash values,
|
34
33
|
you're likely to have around 2 ** 6 = 64 values in the underlying set. The actual
|
35
34
|
implementation and analysis are much more advanced than this, but that's the idea.
|
36
35
|
|
36
|
+
This gem implements a few useful extensions to the basic HyperLogLog algorithm
|
37
|
+
which allow you to estimate unions and intersections of counters as well as
|
38
|
+
counts within specific time ranges. These extensions are described in detail below.
|
39
|
+
|
37
40
|
The HyperLogLog algorithm is described and analyzed in the paper
|
38
41
|
["HyperLogLog: the analysis of a near-optimal cardinality estimation
|
39
42
|
algorithm"](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)
|
@@ -49,31 +52,127 @@ You can also ask for an estimate of the union from multiple counters:
|
|
49
52
|
counter.add('wings', wings_member)
|
50
53
|
end
|
51
54
|
|
52
|
-
puts "There are approximately #{counter.union('beatles', 'wings')} people who were in the Beatles or Wings"
|
55
|
+
puts "There are approximately #{counter.union(['beatles', 'wings'])} people who were in the Beatles or Wings"
|
53
56
|
|
54
57
|
The same relative error guarantee above applies to unions: a union of
|
55
|
-
size N can be estimated to within N * (1.04 / Math.sqrt(2 ** b)) elements,
|
58
|
+
size N can be estimated to within +/- N * (1.04 / Math.sqrt(2 ** b)) elements,
|
56
59
|
regardless of how many HyperLogLog counters that union spans. You can store
|
57
60
|
a unioned counter for querying or combining later with `union_store`:
|
58
61
|
|
59
|
-
counter.union_store('all_beatles_and_wings_members', 'beatles', 'wings')
|
62
|
+
counter.union_store('all_beatles_and_wings_members', ['beatles', 'wings'])
|
60
63
|
|
61
64
|
puts "There are approximately #{counter.count('all_beatles_and_wings_members'}} people who were in the Beatles or Wings"
|
62
65
|
|
63
66
|
Intersections can also be estimated:
|
64
67
|
|
65
|
-
puts "There are approximately #{counter.intersection('beatles', 'wings')} people who were in both the Beatles and Wings"
|
68
|
+
puts "There are approximately #{counter.intersection(['beatles', 'wings'])} people who were in both the Beatles and Wings"
|
66
69
|
|
67
70
|
However, intersections of HyperLogLog counters are calculated indirectly via the
|
68
71
|
[inclusion/exclusion principle](http://en.wikipedia.org/wiki/Inclusion%E2%80%93exclusion_principle)
|
69
72
|
as a sum of unions and there aren't good theoretical bounds on the error of that sum. In
|
70
73
|
practice, the estimates that come out of small intersections tend to follow the
|
71
|
-
same relative error patterns, but beware using this type of estimation on
|
72
|
-
|
74
|
+
same relative error patterns, but beware using this type of estimation on intersections
|
75
|
+
of large numbers of sets, both because the errors can be much larger than those guaranteed
|
73
76
|
for unions and the complexity of computing intersections grows exponentially with
|
74
|
-
the number of
|
77
|
+
the number of sets in the intersection.
|
78
|
+
|
79
|
+
Set cardinality within a time interval
|
80
|
+
======================================
|
81
|
+
|
82
|
+
All examples up until now use `HyperLogLog::Counter`, which stores HyperLogLog
|
83
|
+
counters as (2 ** b)-byte Redis strings. hyperloglog-redis also contains the counter implementation
|
84
|
+
`HyperLogLog::TimeSeriesCounter`, which uses a little more space (Redis strings of up to
|
85
|
+
4 * (32 - b) * (2 ** b) bytes) but allows you to estimate the cardinality of sets during
|
86
|
+
certain time windows.
|
87
|
+
|
88
|
+
Using `HyperLogLog::TimeSeriesCounter`, you can get estimates of the number of distinct
|
89
|
+
elements added to a set in the past X seconds, for any value of X. A `HyperLogLog::TimeSeriesCounter`
|
90
|
+
is initialized with the same arguments as a regular `Counter` but implements a
|
91
|
+
superset of `HyperLogLog::Counter`'s interface. Namely, each of the methods `add`,
|
92
|
+
`count`, `union`, `intersection`, and `union_store` take an optional final time argument,
|
93
|
+
either a Ruby `Time` or an integer representing seconds since the epoch.
|
94
|
+
|
95
|
+
When passed a time argument t, `add` registers an addition to the set at time t. When no
|
96
|
+
time is passed, the current system time is used. The methods `count`, `union`,
|
97
|
+
`intersection`, and `union_store` all estimate set cardinality for the time interval
|
98
|
+
consisting of all events that happened after time t when t is passed as a final argument.
|
99
|
+
|
100
|
+
For example, to get the number of distinct user logins within the
|
101
|
+
past week, we might call:
|
102
|
+
|
103
|
+
one_week = 60 * 60 * 24 * 7
|
104
|
+
logins_in_past_week = counter.count('user_logins', Time.now - one_week)
|
105
|
+
|
106
|
+
A note about relative errors
|
107
|
+
============================
|
108
|
+
|
109
|
+
With a parameter `b` in the range [4..16], HyperLogLog counters provide a relative
|
110
|
+
error of 1.04/sqrt(2 ** b) with high probability. When unions, intersections, and
|
111
|
+
time range queries are involved, it's sometimes not clear what the relative error
|
112
|
+
is relative to, so here is some clarification:
|
113
|
+
|
114
|
+
* For a union of counters, the relative error applies to the size of the union. Taking
|
115
|
+
the union of counters is lossless in the sense that you end up with the same counter
|
116
|
+
you would have arrived at had you observed the union of all of the individual events.
|
117
|
+
|
118
|
+
* For an intersection of counters, there's no good theoretical bound on the relative
|
119
|
+
error. In practice, and especially for intersections involving a small number of sets,
|
120
|
+
the relative error you obtain tends to be in relation to the size of the union of the
|
121
|
+
sets involved. For example, if you have two sets, each of cardinality 5000 and observe
|
122
|
+
both sets through HyperLogLog counters with parameter b=10 (3% relative error), you can
|
123
|
+
expect the intersection estimate to be within 10000 * 0.03 = 300 of the actual intersection
|
124
|
+
size.
|
125
|
+
|
126
|
+
* For time queries, the relative error applies to the size of the set within the time
|
127
|
+
range you've queried. For example, given a set of cardinality 1,000,000 that has had
|
128
|
+
100 distinct additions within the last 10 minutes, if you observe such a set with a
|
129
|
+
HyperLogLog counter with parameter b=10 (3% relative error), you can expect the count
|
130
|
+
returned from a query about the last 10 minutes to be within 3 of 100.
|
131
|
+
|
132
|
+
Comparison to other approaches
|
133
|
+
==============================
|
134
|
+
|
135
|
+
When trying to optimize for space, two well-known alternatives to HyperLogLog exist:
|
136
|
+
|
137
|
+
* Bit vectors: you provide some near-perfect hash function between keys in your domain
|
138
|
+
and an interval of integers, then represent that interval of integers with bits.
|
139
|
+
* Bloom filters with counters: use a [Bloom filter](http://en.wikipedia.org/wiki/Bloom_filter)
|
140
|
+
to keep track of items seen; on insert, when the Bloom filter tells you that the item
|
141
|
+
seen is not in the set, increment the counter.
|
142
|
+
|
143
|
+
Both bit vectors and bloom filters can be augmented to hold timestamps for entries in the
|
144
|
+
data structures and simulate counters for time-ranges like `HyperLogLog::TimeSeriesCounter`.
|
145
|
+
|
146
|
+
Bit vectors give exact counts, but the space complexity is linear with the size of
|
147
|
+
the set, and you must either allocate a large bit vector upfront or cope with the complexity
|
148
|
+
of dynamically resizing your bit vector as the set grows. Providing a manual mapping from
|
149
|
+
members of your set to an interval of integers is sometimes a non-trivial task. Counts,
|
150
|
+
unions, and intersections are all linear-time operations in the size of the universe of
|
151
|
+
the set being represented.
|
152
|
+
|
153
|
+
Bloom filters can be much more compact than bit vectors, but the actual count associated
|
154
|
+
with a Bloom filter is an artifact of the construction of the data structure, so the cost
|
155
|
+
of estimating a union or intersection is linear in the size of the Bloom filter. Getting
|
156
|
+
high probability guarantees on the quality of the estimate of Bloom filter counts requires
|
157
|
+
several "good" hash functions that have some degree of independence from each other; in
|
158
|
+
practice, coming up with several independent implementations of good hash functions is
|
159
|
+
difficult. Bloom filters require that all of their space be allocated upfront (re-hashing
|
160
|
+
isn't possible without replaying all events), so in practice you need some estimate of
|
161
|
+
how large the counters are going to be before allocating the counter.
|
162
|
+
|
163
|
+
HyperLogLog counters take up less space than either of the above approaches and provide
|
164
|
+
constant-time implementations (in the size of the sets being represented) of unions,
|
165
|
+
intersections, and time range queries. A `HyperLogLog::Counter` with parameter b will
|
166
|
+
be stored in a Redis string of length at most 2 ** b bytes, whereas a `HyperLogLog::TimeSeriesCounter` with parameter
|
167
|
+
b will be stored in a Redis string of length at most 4 * (32 - b) * (2 ** b) bytes. For counters representing smaller sets,
|
168
|
+
the size taken up by a `HyperLogLog::TimeSeriesCounter` can be significantly less. Here
|
169
|
+
are some examples for specific values of b:
|
170
|
+
|
171
|
+
* With b = 7, a `HyperLogLog::Counter` uses at most 128 bytes and a `HyperLogLog::TimeSeriesCounter` uses at most 13 KB while providing a relative error of 9%.
|
172
|
+
* With b = 11, a `HyperLogLog::Counter` uses at most 2 KB and a `HyperLogLog::TimeSeriesCounter` uses at most 168 KB while providing a relative error of 2%
|
173
|
+
* With b = 16, a `HyperLogLog::Counter` uses at most 64 KB and a `HyperLogLog::TimeSeriesCounter` uses at most 4 MB while providing a relative error of less than half a percent.
|
75
174
|
|
76
175
|
Installation
|
77
176
|
============
|
78
177
|
|
79
|
-
gem install hyperloglog-redis
|
178
|
+
gem install hyperloglog-redis
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
|
1
|
+
2.0.0
|
data/hyperloglog-redis.gemspec
CHANGED
@@ -5,11 +5,11 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = "hyperloglog-redis"
|
8
|
-
s.version = "
|
8
|
+
s.version = "2.0.0"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Aaron Windsor"]
|
12
|
-
s.date = "2012-
|
12
|
+
s.date = "2012-11-30"
|
13
13
|
s.description = "An implementation of the HyperLogLog set cardinality estimation algorithm in Ruby using Redis as a back-end"
|
14
14
|
s.email = "aaron.windsor@gmail.com"
|
15
15
|
s.extra_rdoc_files = [
|
@@ -27,10 +27,13 @@ Gem::Specification.new do |s|
|
|
27
27
|
"Rakefile",
|
28
28
|
"VERSION",
|
29
29
|
"hyperloglog-redis.gemspec",
|
30
|
-
"lib/
|
30
|
+
"lib/algorithm.rb",
|
31
|
+
"lib/counter.rb",
|
31
32
|
"lib/hyperloglog-redis.rb",
|
33
|
+
"lib/time_series_counter.rb",
|
32
34
|
"spec/hyper_log_log_spec.rb",
|
33
|
-
"spec/spec_helper.rb"
|
35
|
+
"spec/spec_helper.rb",
|
36
|
+
"spec/time_series_counter_spec.rb"
|
34
37
|
]
|
35
38
|
s.homepage = "http://github.com/aaw/hyperloglog-redis"
|
36
39
|
s.licenses = ["MIT"]
|
@@ -47,12 +50,14 @@ Gem::Specification.new do |s|
|
|
47
50
|
s.add_development_dependency(%q<jeweler>, ["~> 1.8.4"])
|
48
51
|
s.add_development_dependency(%q<rake>, ["~> 0.9.2.2"])
|
49
52
|
s.add_development_dependency(%q<rspec>, ["~> 2.11.0"])
|
53
|
+
s.add_development_dependency(%q<timecop>, ["~> 0.5.3"])
|
50
54
|
else
|
51
55
|
s.add_dependency(%q<murmurhash3>, ["~> 0.1.3"])
|
52
56
|
s.add_dependency(%q<redis>, ["~> 3.0.1"])
|
53
57
|
s.add_dependency(%q<jeweler>, ["~> 1.8.4"])
|
54
58
|
s.add_dependency(%q<rake>, ["~> 0.9.2.2"])
|
55
59
|
s.add_dependency(%q<rspec>, ["~> 2.11.0"])
|
60
|
+
s.add_dependency(%q<timecop>, ["~> 0.5.3"])
|
56
61
|
end
|
57
62
|
else
|
58
63
|
s.add_dependency(%q<murmurhash3>, ["~> 0.1.3"])
|
@@ -60,6 +65,7 @@ Gem::Specification.new do |s|
|
|
60
65
|
s.add_dependency(%q<jeweler>, ["~> 1.8.4"])
|
61
66
|
s.add_dependency(%q<rake>, ["~> 0.9.2.2"])
|
62
67
|
s.add_dependency(%q<rspec>, ["~> 2.11.0"])
|
68
|
+
s.add_dependency(%q<timecop>, ["~> 0.5.3"])
|
63
69
|
end
|
64
70
|
end
|
65
71
|
|
data/lib/algorithm.rb
ADDED
@@ -0,0 +1,69 @@
|
|
1
|
+
require 'redis'
|
2
|
+
require 'murmurhash3'
|
3
|
+
|
4
|
+
module HyperLogLog
|
5
|
+
module Algorithm
|
6
|
+
|
7
|
+
def initialize(redis, b=10)
|
8
|
+
raise "Accuracy not supported. Please choose a value of b between 4 and 16" if b < 4 || b > 16
|
9
|
+
@redis = redis
|
10
|
+
@bits_in_hash = 32 - b
|
11
|
+
@m = (2 ** b).to_i
|
12
|
+
if @m == 16
|
13
|
+
@alpha = 0.673
|
14
|
+
elsif @m == 32
|
15
|
+
@alpha = 0.697
|
16
|
+
elsif @m == 64
|
17
|
+
@alpha = 0.709
|
18
|
+
else
|
19
|
+
@alpha = 0.7213/(1 + 1.079/@m)
|
20
|
+
end
|
21
|
+
end
|
22
|
+
|
23
|
+
# Estimate the cardinality of the intersection of several sets. We do this by
|
24
|
+
# using the principle of inclusion and exclusion to represent the size of the
|
25
|
+
# intersection as the alternating sum of an exponential number of
|
26
|
+
# cardinalities of unions of smaller sets.
|
27
|
+
def intersection(counter_names, time=0)
|
28
|
+
icount = (1..counter_names.length).map do |k|
|
29
|
+
counter_names.combination(k).map do |group|
|
30
|
+
((k % 2 == 0) ? -1 : 1) * union_helper(group, time)
|
31
|
+
end.inject(0, :+)
|
32
|
+
end.inject(0, :+)
|
33
|
+
[icount, 0].max
|
34
|
+
end
|
35
|
+
|
36
|
+
private
|
37
|
+
|
38
|
+
def hash_info(value)
|
39
|
+
hash = MurmurHash3::V32.murmur3_32_str_hash(value)
|
40
|
+
[hash, hash % @m, rho(hash / @m)]
|
41
|
+
end
|
42
|
+
|
43
|
+
def union_helper(counter_names, time=0)
|
44
|
+
all_estimates = raw_union(counter_names, time).select{ |i| i > 0 }
|
45
|
+
estimate_sum = all_estimates.reduce(0.0){ |a, score| a + 2.0 ** -score }
|
46
|
+
estimate = @alpha * @m * @m / (estimate_sum + @m - all_estimates.length)
|
47
|
+
if estimate <= 2.5 * @m
|
48
|
+
if all_estimates.length == @m
|
49
|
+
estimate.round
|
50
|
+
else # Correction for small sets
|
51
|
+
(@m * Math.log(Float(@m)/(@m - all_estimates.length))).round
|
52
|
+
end
|
53
|
+
elsif estimate <= 2 ** 32 / 30.0
|
54
|
+
estimate.round
|
55
|
+
else # Correction for large sets
|
56
|
+
(-2**32 * Math.log(1 - estimate/(2.0**32))).round
|
57
|
+
end
|
58
|
+
end
|
59
|
+
|
60
|
+
# rho(i) is the position of the first 1 in the binary representation of i,
|
61
|
+
# reading from most significant to least significant bits. Some examples:
|
62
|
+
# rho(1...) = 1, rho(001...) = 3, rho(000...0) = @bits_in_hash + 1
|
63
|
+
def rho(i)
|
64
|
+
return @bits_in_hash + 1 if i == 0
|
65
|
+
@bits_in_hash - Math.log(i, 2).floor
|
66
|
+
end
|
67
|
+
|
68
|
+
end
|
69
|
+
end
|
data/lib/counter.rb
ADDED
@@ -0,0 +1,40 @@
|
|
1
|
+
module HyperLogLog
|
2
|
+
class Counter
|
3
|
+
include Algorithm
|
4
|
+
|
5
|
+
# This is the implementation of the standard HyperLogLog algorithm, storing
|
6
|
+
# counts in each byte of a string of length 2 ** b.
|
7
|
+
|
8
|
+
def add(counter_name, value)
|
9
|
+
hash, function_name, new_value = hash_info(value)
|
10
|
+
existing_value = @redis.getrange(counter_name, function_name, function_name).unpack('C').first.to_i
|
11
|
+
@redis.setrange(counter_name, function_name, new_value.chr) if new_value > existing_value
|
12
|
+
end
|
13
|
+
|
14
|
+
# Estimate the cardinality of a single set
|
15
|
+
def count(counter_name)
|
16
|
+
union_helper([counter_name])
|
17
|
+
end
|
18
|
+
|
19
|
+
# Estimate the cardinality of the union of several sets
|
20
|
+
def union(counter_names)
|
21
|
+
union_helper(counter_names)
|
22
|
+
end
|
23
|
+
|
24
|
+
# Store the union of several sets in *destination* so that it can be used as
|
25
|
+
# a HyperLogLog counter later.
|
26
|
+
def union_store(destination, counter_names)
|
27
|
+
@redis.set(destination, raw_union(counter_names).inject('') {|a, e| a << e.chr})
|
28
|
+
end
|
29
|
+
|
30
|
+
private
|
31
|
+
|
32
|
+
def raw_union(counter_names, time=nil)
|
33
|
+
counters = @redis.mget(*counter_names).compact
|
34
|
+
return [] if counters.none?
|
35
|
+
return counters.first.each_byte if counters.one?
|
36
|
+
counters.map{|c| c.unpack("C#{@m}")}.transpose.map {|e| e.compact.max.to_i}
|
37
|
+
end
|
38
|
+
|
39
|
+
end
|
40
|
+
end
|
data/lib/hyperloglog-redis.rb
CHANGED
@@ -0,0 +1,80 @@
|
|
1
|
+
module HyperLogLog
|
2
|
+
class TimeSeriesCounter
|
3
|
+
include Algorithm
|
4
|
+
|
5
|
+
# This is an implementation of HyperLogLog that allows for querying counts
|
6
|
+
# within time ranges of the form (t, current_time] with second-level
|
7
|
+
# granularity. The standard implementation of HyperLogLog stores the max
|
8
|
+
# number of leading zeros seen in the image of each of 2 ** b hash
|
9
|
+
# functions. These counts can naturally be stored in a string of length
|
10
|
+
# 2 ** b by allocating one byte per leading zero count.
|
11
|
+
#
|
12
|
+
# To provide counts within a time range, we alter the standard
|
13
|
+
# implementation to store a mapping of pairs of the form (hash function,
|
14
|
+
# leading zero count) -> timestamp, where the mapping (h,z) -> t represents
|
15
|
+
# the fact that we observed z leading zeros in the image of hash function h
|
16
|
+
# most recently at time t. This mapping is stored in a string by packing
|
17
|
+
# 4-byte words (timestamps, represented in seconds since the epoch) into
|
18
|
+
# a matrix indexed by hash function and zero count stored in row-major
|
19
|
+
# order. Since the max zero count for a counter with parameter b is (32-b),
|
20
|
+
# this representation takes up at most 4 * (32-b) * (2 ** b) bytes (and
|
21
|
+
# usually much less, since we don't allocate space for rows corresponding
|
22
|
+
# to higher leading zero counts until they're actaully observed.)
|
23
|
+
#
|
24
|
+
# To convert this representation to a HyperLogLog counter for the time
|
25
|
+
# range (t, current_time], we simply filter out all timestamps less than t
|
26
|
+
# in the matrix and then find, for each hash function, the maximum z for
|
27
|
+
# which that hash function has a non-zero timestamp.
|
28
|
+
|
29
|
+
def add(counter_name, value, time=nil)
|
30
|
+
hash, function_name, new_value = hash_info(value)
|
31
|
+
index = 4 * (function_name + (new_value.to_i * @m))
|
32
|
+
if time.nil?
|
33
|
+
@redis.setrange(counter_name, index, [Time.now.to_i].pack('N'))
|
34
|
+
else
|
35
|
+
existing_time = @redis.getrange(counter_name, index, index + 3)
|
36
|
+
existing_val = existing_time.empty? ? 0 : existing_time.unpack('N').first
|
37
|
+
@redis.setrange(counter_name, index, [time.to_i].pack('N')) if time.to_i > existing_val
|
38
|
+
end
|
39
|
+
end
|
40
|
+
|
41
|
+
# Estimate the cardinality of a single set
|
42
|
+
def count(counter_name, time=0)
|
43
|
+
union_helper([counter_name], time)
|
44
|
+
end
|
45
|
+
|
46
|
+
# Estimate the cardinality of the union of several sets
|
47
|
+
def union(counter_names, time=0)
|
48
|
+
union_helper(counter_names, time)
|
49
|
+
end
|
50
|
+
|
51
|
+
# Store the union of several sets in *destination* so that it can be used as
|
52
|
+
# a HyperLogLog counter later.
|
53
|
+
def union_store(destination, counter_names, time=0)
|
54
|
+
raw_counters = @redis.mget(*counter_names).compact.map{ |c| c.unpack('N*').map{ |x| x > time ? x : 0 } }
|
55
|
+
combined_counters = jagged_transpose(raw_counters).map{ |x| x.max.to_i }
|
56
|
+
@redis.set(destination, combined_counters.pack('N*'))
|
57
|
+
end
|
58
|
+
|
59
|
+
private
|
60
|
+
|
61
|
+
def raw_union(counter_names, time=0)
|
62
|
+
raw_counters = @redis.mget(*counter_names).compact
|
63
|
+
return [] if raw_counters.none?
|
64
|
+
hyperloglog_counters = raw_counters.map do |counter|
|
65
|
+
jagged_transpose(counter.unpack('N*').each_slice(@m).to_a).map{ |x| x.rindex{ |c| c > time } || 0 }
|
66
|
+
end
|
67
|
+
return hyperloglog_counters.first if hyperloglog_counters.one?
|
68
|
+
jagged_transpose(hyperloglog_counters).map{ |x| x.max.to_i }
|
69
|
+
end
|
70
|
+
|
71
|
+
# Given an array of non-uniform length arrays, right-pad all arrays with
|
72
|
+
# zeros so they're the same size, then transpose the array. This is a
|
73
|
+
# destructive operation: the zero-padding modifies the array-of-arrays
|
74
|
+
def jagged_transpose(arrays)
|
75
|
+
max_length = arrays.map{ |a| a.length }.max
|
76
|
+
arrays.map{ |a| a.fill(0, a.length, max_length - a.length) }.transpose
|
77
|
+
end
|
78
|
+
|
79
|
+
end
|
80
|
+
end
|