bloombroom 1.0.0 → 1.2.0
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +12 -0
- data/.travis.yml +6 -0
- data/CHANGELOG.md +11 -1
- data/Gemfile +4 -0
- data/README.md +75 -45
- data/Rakefile +23 -0
- data/benchmark/bloom_filter.rb +35 -0
- data/benchmark/bloom_filter_memory.rb +28 -0
- data/benchmark/continuous_bloom_filter.rb +60 -0
- data/benchmark/continuous_bloom_filter_memory.rb +28 -0
- data/benchmark/continuous_bloom_filter_stats.rb +63 -0
- data/benchmark/memory.rb +8 -0
- data/bloombroom.gemspec +28 -0
- data/ffi/bloombroom/hash/Rakefile +5 -0
- data/{ext/bloombroom/hash/ffi → ffi/bloombroom/hash}/ffi_fnv.c +0 -0
- data/lib/bloombroom.rb +10 -6
- data/lib/bloombroom/hash/ffi_fnv.rb +4 -8
- data/lib/bloombroom/version.rb +1 -1
- data/spec/bloombroom/bits/bit_bucket_field_spec.rb +142 -0
- data/spec/bloombroom/bits/bit_field_spec.rb +108 -0
- data/spec/bloombroom/filter/bloom_filter_spec.rb +43 -0
- data/spec/bloombroom/filter/bloom_helper_spec.rb +18 -0
- data/spec/bloombroom/filter/continuous_bloom_filter_spec.rb +107 -0
- data/spec/bloombroom/hash/ffi_fnv_spec.rb +31 -0
- data/spec/bloombroom/hash/test_vectors.rb +610 -0
- data/spec/spec_helper.rb +2 -0
- metadata +94 -49
- data/ext/bloombroom/hash/cext/cext_fnv.c +0 -91
- data/ext/bloombroom/hash/cext/extconf.rb +0 -3
- data/ext/bloombroom/hash/ffi/extconf.rb +0 -3
- data/lib/bloombroom/hash/fnv_a.rb +0 -100
- data/lib/bloombroom/hash/fnv_b.rb +0 -56
data/.gitignore
ADDED
data/.travis.yml
ADDED
data/CHANGELOG.md
CHANGED
@@ -1 +1,11 @@
|
|
1
|
-
|
1
|
+
# 1.0.0, 05-09-2012
|
2
|
+
- initial release
|
3
|
+
|
4
|
+
# 1.1.x
|
5
|
+
- bad gems, yanked from Rubygems
|
6
|
+
|
7
|
+
# 1.2.0, 02-27-2013
|
8
|
+
- refactored to use ffi-compiler
|
9
|
+
- now only using FFI FNV hashing, removed all other implementations
|
10
|
+
- test on JRuby 1.7.3 and MRI Ruby 2.0.0
|
11
|
+
- FFI performance improvements
|
data/Gemfile
ADDED
data/README.md
CHANGED
@@ -1,10 +1,12 @@
|
|
1
|
-
# Bloombroom v1.
|
1
|
+
# Bloombroom v1.2.0
|
2
|
+
|
3
|
+
[![build status](https://secure.travis-ci.org/colinsurprenant/bloombroom.png)](http://travis-ci.org/colinsurprenant/bloombroom)
|
2
4
|
|
3
5
|
- Standard **Bloomfilter** class for bounded key space
|
4
6
|
- **ContinuousBloomfilter** class for unbounded keys (**stream**)
|
5
|
-
- Bitfield class
|
6
|
-
- BitBucketField class (
|
7
|
-
-
|
7
|
+
- Bitfield class (single bit field)
|
8
|
+
- BitBucketField class (multiple bit fields)
|
9
|
+
- Fast FNV hashing using C implementation with FFI bindings.
|
8
10
|
|
9
11
|
The Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. See [wikipedia](http://en.wikipedia.org/wiki/Bloom_filter).
|
10
12
|
|
@@ -21,16 +23,15 @@ The internal timer resolution is set to half of the required TTL (resolution div
|
|
21
23
|
ring buffer using the current timer tick modulo 15. The timer ticks will be time slot=1, 2, ... 15, 1, 2 and so on. The total
|
22
24
|
time of our internal clock will thus be 15 * (TTL / 2). We keep track of TTL by writing the current time slot
|
23
25
|
in the key k buckets when inserted in the filter. For a key lookup if the interval betweem the current time slot and any of the k buckets value
|
24
|
-
is greater than 2 (resolution divisor) we know this key is expired. See [
|
26
|
+
is greater than 2 (resolution divisor) we know this key is expired. See my [Continuous Bloom filter](http://colinsurprenant.com/blog/2012/05/12/continuous-bloom-filter/) blog post about this.
|
25
27
|
|
26
28
|
This means that an element is garanteed to not be expired before the given TTL but in the worst case could survive until 3 * (TTL / 2).
|
27
29
|
|
28
30
|
### Hashing
|
29
31
|
Bloom filters require the use of multiple (k) hash functions for each inserted element. We actually simulate multiple hash functions by having just two hash functions which are actually the upper and lower 32 bits of our FFI FNV1a 64 bits hash function. Double hashing with one hash function. Very very fast. See [bloom_helper.rb](https://github.com/colinsurprenant/bloombroom/blob/master/lib/bloombroom/filter/bloom_helper.rb) and the [references](#references) section for more info on this technique.
|
30
32
|
|
31
|
-
|
32
33
|
## Installation
|
33
|
-
tested in both MRI Ruby 1.9.2, 1.9.3 and JRuby 1.
|
34
|
+
tested in both MRI Ruby 1.9.2, 1.9.3, 2.0 and JRuby 1.7.3.
|
34
35
|
|
35
36
|
``` sh
|
36
37
|
$ gem install bloombroom
|
@@ -113,45 +114,67 @@ ruby benchmark/continuous_bloom_filter_memory.rb auto 100000000 0.001
|
|
113
114
|
- **1.0%** error rate for **100M** keys: **914mb**
|
114
115
|
- **0.1%** error rate for **100M** keys: **1371mb**
|
115
116
|
|
117
|
+
## Simulation
|
118
|
+
This is an input stream simulation into the ContinuousBloomfilter. First a series to 32 x 20k random unique insertion keys & 20k random unique test keys not part of the insertion set are generated. At each iteration, 20k insertion keys are added, and 20k test keys are checked for inclusion and the internal timer tick is incremented. Since the life of our keys is of 3 timer ticks we chose a filter capacity of 3 x 20k elements. Specific m and k parameter will be computed for an error rate of 0.1% and 3 x 20k capacity.
|
116
119
|
|
117
|
-
|
118
|
-
All benchmarks have been run on a MacbookPro with a 2.66GHz i7 with 8GB RAM on OSX 10.6.8 with MRI Ruby 1.9.3p194
|
119
|
-
|
120
|
-
### Hashing
|
121
|
-
The Hashing benchmark compares the performance of SHA1, MD5, two native Ruby FNV (A & B) implementations, a C implementation as a C extension and FFI extension for 32 and 64 bits hashes.
|
120
|
+
We see that as we add more keys, the test keys false positive rate is stable at the required error rate. In the second section, the same sequence is applied to a standard Bloomfilter to show that, obviously, the error rate will increase as more elements are added past the required capacity.
|
122
121
|
|
123
122
|
``` sh
|
124
|
-
ruby benchmark/
|
125
|
-
```
|
126
|
-
|
127
|
-
```
|
128
|
-
benchmarking for 1000000 iterations
|
129
|
-
user system total real
|
130
|
-
MD5: 1.900000 0.010000 1.910000 ( 1.912995)
|
131
|
-
SHA-1: 2.110000 0.000000 2.110000 ( 2.109739)
|
132
|
-
native FNV A 32: 32.470000 0.110000 32.580000 ( 32.596759)
|
133
|
-
native FNV A 64: 38.330000 0.570000 38.900000 ( 38.923384)
|
134
|
-
native FNV B 32: 4.870000 0.020000 4.890000 ( 4.882862)
|
135
|
-
native FNV B 64: 37.700000 0.110000 37.810000 ( 37.842873)
|
136
|
-
ffi FNV 32: 0.760000 0.010000 0.770000 ( 0.754941)
|
137
|
-
ffi FNV 64: 0.890000 0.000000 0.890000 ( 0.901954)
|
138
|
-
c-ext FNV 32: 0.310000 0.000000 0.310000 ( 0.307131)
|
139
|
-
c-ext FNV 64: 0.480000 0.000000 0.480000 ( 0.485310)
|
140
|
-
|
141
|
-
MD5: 522740 ops/s
|
142
|
-
SHA-1: 473992 ops/s
|
143
|
-
native FNV A 32: 30678 ops/s
|
144
|
-
native FNV A 64: 25691 ops/s
|
145
|
-
native FNV B 32: 204798 ops/s
|
146
|
-
native FNV B 64: 26425 ops/s
|
147
|
-
ffi FNV 32: 1324607 ops/s
|
148
|
-
ffi FNV 64: 1108704 ops/s
|
149
|
-
c-ext FNV 32: 3255939 ops/s
|
150
|
-
c-ext FNV 64: 2060538 ops/s
|
123
|
+
ruby benchmark/continuous_bloom_filter_stats.rb
|
151
124
|
```
|
152
125
|
|
126
|
+
```
|
127
|
+
generating lots of random keys
|
128
|
+
|
129
|
+
Continuous BloomFilter with capacity=60000, error=0.001(0.1%) -> m=862656, k=10
|
130
|
+
added 20000 keys, tested 20000 keys, FPs=0/20000 (0.000)%
|
131
|
+
added 20000 keys, tested 20000 keys, FPs=1/20000 (0.005)%
|
132
|
+
added 20000 keys, tested 20000 keys, FPs=17/20000 (0.085)%
|
133
|
+
added 20000 keys, tested 20000 keys, FPs=20/20000 (0.100)%
|
134
|
+
added 20000 keys, tested 20000 keys, FPs=23/20000 (0.115)%
|
135
|
+
added 20000 keys, tested 20000 keys, FPs=22/20000 (0.110)%
|
136
|
+
added 20000 keys, tested 20000 keys, FPs=22/20000 (0.110)%
|
137
|
+
added 20000 keys, tested 20000 keys, FPs=17/20000 (0.085)%
|
138
|
+
added 20000 keys, tested 20000 keys, FPs=18/20000 (0.090)%
|
139
|
+
added 20000 keys, tested 20000 keys, FPs=21/20000 (0.105)%
|
140
|
+
added 20000 keys, tested 20000 keys, FPs=11/20000 (0.055)%
|
141
|
+
added 20000 keys, tested 20000 keys, FPs=17/20000 (0.085)%
|
142
|
+
added 20000 keys, tested 20000 keys, FPs=18/20000 (0.090)%
|
143
|
+
added 20000 keys, tested 20000 keys, FPs=19/20000 (0.095)%
|
144
|
+
added 20000 keys, tested 20000 keys, FPs=21/20000 (0.105)%
|
145
|
+
added 20000 keys, tested 20000 keys, FPs=20/20000 (0.100)%
|
146
|
+
added 20000 keys, tested 20000 keys, FPs=24/20000 (0.120)%
|
147
|
+
added 20000 keys, tested 20000 keys, FPs=21/20000 (0.105)%
|
148
|
+
added 20000 keys, tested 20000 keys, FPs=22/20000 (0.110)%
|
149
|
+
added 20000 keys, tested 20000 keys, FPs=24/20000 (0.120)%
|
150
|
+
added 20000 keys, tested 20000 keys, FPs=15/20000 (0.075)%
|
151
|
+
added 20000 keys, tested 20000 keys, FPs=16/20000 (0.080)%
|
152
|
+
added 20000 keys, tested 20000 keys, FPs=16/20000 (0.080)%
|
153
|
+
added 20000 keys, tested 20000 keys, FPs=17/20000 (0.085)%
|
154
|
+
added 20000 keys, tested 20000 keys, FPs=22/20000 (0.110)%
|
155
|
+
added 20000 keys, tested 20000 keys, FPs=21/20000 (0.105)%
|
156
|
+
added 20000 keys, tested 20000 keys, FPs=24/20000 (0.120)%
|
157
|
+
added 20000 keys, tested 20000 keys, FPs=16/20000 (0.080)%
|
158
|
+
added 20000 keys, tested 20000 keys, FPs=17/20000 (0.085)%
|
159
|
+
added 20000 keys, tested 20000 keys, FPs=24/20000 (0.120)%
|
160
|
+
added 20000 keys, tested 20000 keys, FPs=17/20000 (0.085)%
|
161
|
+
added 20000 keys, tested 20000 keys, FPs=19/20000 (0.095)%
|
162
|
+
Continuous BloomFilter 640000 adds + 640000 tests in 16.95s, 75537 ops/s
|
163
|
+
|
164
|
+
BloomFilter with capacity=60000, error=0.001(0.1%) -> m=862656, k=10
|
165
|
+
added 20000 keys, tested 20000 keys, FPs=0/20000 (0.000)%
|
166
|
+
added 20000 keys, tested 20000 keys, FPs=1/20000 (0.005)%
|
167
|
+
added 20000 keys, tested 20000 keys, FPs=17/20000 (0.085)%
|
168
|
+
added 20000 keys, tested 20000 keys, FPs=131/20000 (0.655)%
|
169
|
+
added 20000 keys, tested 20000 keys, FPs=453/20000 (2.265)%
|
170
|
+
added 20000 keys, tested 20000 keys, FPs=1162/20000 (5.810)%
|
171
|
+
BloomFilter 120000 adds + 120000 tests in 1.64s, 146008 ops/s
|
172
|
+
```
|
173
|
+
|
174
|
+
## Benchmarks
|
175
|
+
All benchmarks have been run on a MacbookPro with a 2.66GHz i7 with 8GB RAM on OSX 10.6.8 with MRI Ruby 1.9.3p194
|
176
|
+
|
153
177
|
### Bloomfilter
|
154
|
-
The Bloomfilter class is using the FFI FNV hashing by default, for speed and compatibility.
|
155
178
|
|
156
179
|
``` sh
|
157
180
|
ruby benchmark/bloom_filter.rb
|
@@ -176,7 +199,6 @@ BloomFilter m=2875518, k=13 include? 119154 ops/s
|
|
176
199
|
```
|
177
200
|
|
178
201
|
### ContinuousBloomfilter
|
179
|
-
The ContinuousBloomfilter class is using the FFI FNV hashing by default, for speed and compatibility.
|
180
202
|
|
181
203
|
``` sh
|
182
204
|
ruby benchmark/continuous_bloom_filter.rb
|
@@ -211,6 +233,10 @@ ContinuousBloomFilter m=2875518, k=13 add+include 56606 ops/s
|
|
211
233
|
```
|
212
234
|
|
213
235
|
## JRuby
|
236
|
+
This has only been tested in Ruby **1.9** mode. JRuby 1.9 mode has to be enabled to run tests and benchmarks.
|
237
|
+
|
238
|
+
Note that this is not necessary anymore with JRuby 1.7 which is in 1.9 mode by default.
|
239
|
+
|
214
240
|
- to run specs use
|
215
241
|
|
216
242
|
``` sh
|
@@ -222,7 +248,7 @@ jruby --1.9 -S rake spec
|
|
222
248
|
jruby --1.9 benchmark/some_benchmark.rb
|
223
249
|
```
|
224
250
|
|
225
|
-
<a id="
|
251
|
+
<a id="references" />
|
226
252
|
## References ##
|
227
253
|
- [Bloom filter on wikipedia](http://en.wikipedia.org/wiki/Bloom_filter)
|
228
254
|
- [Scalable Datasets: Bloom Filters in Ruby](http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/)
|
@@ -232,11 +258,15 @@ jruby --1.9 benchmark/some_benchmark.rb
|
|
232
258
|
- [Producing n hash functions by hashing only once](http://willwhim.wordpress.com/2011/09/03/producing-n-hash-functions-by-hashing-only-once/)
|
233
259
|
- [Less Hashing, Same Performance: Building a Better Bloom Filter](http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.152.579&rep=rep1&type=pdf)
|
234
260
|
|
261
|
+
## Credits
|
262
|
+
- [Ilya Grigorik](http://www.igvita.com/) for his inspiration with the [Time-based Bloom filters](http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/).
|
263
|
+
- Authors of the [Stable Bloom filters research paper](http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf) which also provided inspiration.
|
264
|
+
- [Robey Pointer](https://github.com/robey) for his [Ruby FNV C extension implementation](https://github.com/robey/rbfnv).
|
265
|
+
- [Peter Cooper](http://www.petercooper.co.uk/) for inspiration with [his BitField class](http://dzone.com/snippets/bitfield-fastish-pure-ruby-bit).
|
266
|
+
|
235
267
|
## Author
|
236
|
-
Colin Surprenant, [@colinsurprenant]
|
268
|
+
Colin Surprenant, [@colinsurprenant](http://twitter.com/colinsurprenant), [http://github.com/colinsurprenant](http://github.com/colinsurprenant), colin.surprenant@gmail.com
|
237
269
|
|
238
270
|
## License
|
239
271
|
Bloombroom is distributed under the Apache License, Version 2.0.
|
240
272
|
|
241
|
-
[twitter]: http://twitter.com/colinsurprenant
|
242
|
-
[github]: http://github.com/colinsurprenant
|
data/Rakefile
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
require 'bundler/setup'
|
2
|
+
require 'rake'
|
3
|
+
require 'rake/clean'
|
4
|
+
require 'bundler/gem_tasks'
|
5
|
+
require 'rspec/core/rake_task'
|
6
|
+
require 'ffi'
|
7
|
+
require 'ffi-compiler/compile_task'
|
8
|
+
|
9
|
+
task :default => [:clean, :compile_ffi, :spec]
|
10
|
+
|
11
|
+
desc "clean, make and run specs"
|
12
|
+
task :spec do
|
13
|
+
RSpec::Core::RakeTask.new
|
14
|
+
end
|
15
|
+
|
16
|
+
desc "FFI compiler"
|
17
|
+
namespace "ffi-compiler" do
|
18
|
+
FFI::Compiler::CompileTask.new('ffi/bloombroom/hash/ffi_fnv')
|
19
|
+
end
|
20
|
+
task :compile_ffi => ["ffi-compiler:default"]
|
21
|
+
|
22
|
+
CLEAN.include('ffi/**/*{.o,.log,.so,.bundle}')
|
23
|
+
CLEAN.include('lib/**/*{.o,.log,.so,.bundle}')
|
@@ -0,0 +1,35 @@
|
|
1
|
+
require 'bundler/setup'
|
2
|
+
require "benchmark"
|
3
|
+
require "digest/sha1"
|
4
|
+
require "bloombroom"
|
5
|
+
|
6
|
+
KEYS_COUNT = 150000
|
7
|
+
ERRORS = [0.01, 0.001, 0.0001]
|
8
|
+
TEST_M_K = ERRORS.map{|error| Bloombroom::BloomHelper.find_m_k(KEYS_COUNT, error)}
|
9
|
+
|
10
|
+
keys = KEYS_COUNT.times.map{|i| Digest::SHA1.hexdigest("#{i}#{rand(1000000)}")}
|
11
|
+
|
12
|
+
if !!(defined?(RUBY_ENGINE) && RUBY_ENGINE == 'jruby')
|
13
|
+
puts("warming JVM...")
|
14
|
+
bf = Bloombroom::BloomFilter.new(KEYS_COUNT, 7)
|
15
|
+
keys.each{|key| bf.add(key)}
|
16
|
+
end
|
17
|
+
|
18
|
+
puts("benchmarking for #{keys.size} keys with #{ERRORS.map{|e| "#{e * 100}%"}.join(", ")} error rates")
|
19
|
+
|
20
|
+
reports = []
|
21
|
+
Benchmark.bm(40) do |x|
|
22
|
+
TEST_M_K.each do |m, k|
|
23
|
+
bf = Bloombroom::BloomFilter.new(m, k)
|
24
|
+
adds = x.report("BloomFilter m=#{"%07.0f" % m}, k=#{"%02.0f" % k} add") {keys.each{|key| bf.add(key)}}
|
25
|
+
includes = x.report("BloomFilter m=#{"%07.0f" % m}, k=#{"%02.0f" % k} include?") {keys.each{|key| bf.include?(key)}}
|
26
|
+
reports << {:m => m, :k => k, :adds => adds, :includes => includes}
|
27
|
+
end
|
28
|
+
end
|
29
|
+
|
30
|
+
puts("\n")
|
31
|
+
|
32
|
+
reports.each do |report|
|
33
|
+
puts("BloomFilter m=#{"%07.0f" % report[:m]}, k=#{"%02.0f" % report[:k]} add #{"%10.0f" % (keys.size / report[:adds].real)} ops/s")
|
34
|
+
puts("BloomFilter m=#{"%07.0f" % report[:m]}, k=#{"%02.0f" % report[:k]} include? #{"%10.0f" % (keys.size / report[:includes].real)} ops/s")
|
35
|
+
end
|
@@ -0,0 +1,28 @@
|
|
1
|
+
require 'bundler/setup'
|
2
|
+
require 'bloombroom'
|
3
|
+
require 'benchmark/memory'
|
4
|
+
|
5
|
+
DEFAULT_M = 10000000
|
6
|
+
DEFAULT_K = 1
|
7
|
+
DEFAULT_CAPACITY = 1000000
|
8
|
+
DEFAULT_ERROR = 0.01
|
9
|
+
|
10
|
+
m,k = if ARGV[0] == "auto"
|
11
|
+
ARGV.shift
|
12
|
+
capacity = (ARGV.shift || DEFAULT_CAPACITY).to_i
|
13
|
+
error = (ARGV.shift || DEFAULT_ERROR).to_f
|
14
|
+
Bloombroom::BloomHelper.find_m_k(capacity, error)
|
15
|
+
else
|
16
|
+
m = (ARGV.shift || DEFAULT_M).to_i
|
17
|
+
k = (ARGV.shift || DEFAULT_K).to_i
|
18
|
+
[m ,k]
|
19
|
+
end
|
20
|
+
|
21
|
+
puts("bloomfilter m=#{m}, k=#{k}, size=#{m} bits / #{"%.1f" % ((m / 8) / 1024.0)}k")
|
22
|
+
|
23
|
+
before = Bloombroom::Process.rss
|
24
|
+
bf = Bloombroom::BloomFilter.new(m, k)
|
25
|
+
after = Bloombroom::Process.rss
|
26
|
+
|
27
|
+
puts("process size before=#{before}k, after=#{after}k")
|
28
|
+
puts("process size growth=#{(after - before)}k" )
|
@@ -0,0 +1,60 @@
|
|
1
|
+
require 'bundler/setup'
|
2
|
+
require "benchmark"
|
3
|
+
require "digest/sha1"
|
4
|
+
require "bloombroom"
|
5
|
+
|
6
|
+
KEYS_COUNT = 150000
|
7
|
+
ERRORS = [0.01, 0.001, 0.0001]
|
8
|
+
TEST_M_K = ERRORS.map{|error| Bloombroom::BloomHelper.find_m_k(KEYS_COUNT, error)}
|
9
|
+
|
10
|
+
keys = KEYS_COUNT.times.map{|i| Digest::SHA1.hexdigest("#{i}#{rand(1000000)}")}
|
11
|
+
slots = 10.times.map{|i| (KEYS_COUNT / 3).times.map{|i| Digest::SHA1.hexdigest("#{i}#{rand(1000000)}")}}
|
12
|
+
|
13
|
+
if !!(defined?(RUBY_ENGINE) && RUBY_ENGINE == 'jruby')
|
14
|
+
puts("warming JVM...")
|
15
|
+
bf = Bloombroom::ContinuousBloomFilter.new(*Bloombroom::BloomHelper.find_m_k(KEYS_COUNT, 0.001), 0)
|
16
|
+
keys.each{|key| bf.add(key)}
|
17
|
+
end
|
18
|
+
|
19
|
+
puts("benchmarking WITHOUT expiration for #{keys.size} keys with #{ERRORS.map{|e| "#{e * 100}%"}.join(", ")} error rates")
|
20
|
+
|
21
|
+
reports = []
|
22
|
+
Benchmark.bm(53) do |x|
|
23
|
+
TEST_M_K.each do |m, k|
|
24
|
+
bf = Bloombroom::ContinuousBloomFilter.new(m, k, 0)
|
25
|
+
adds = x.report("ContinuousBloomFilter m=#{"%07.0f" % m}, k=#{"%02.0f" % k} add") {keys.each{|key| bf.add(key)}}
|
26
|
+
includes = x.report("ContinuousBloomFilter m=#{"%07.0f" % m}, k=#{"%02.0f" % k} include?") {keys.each{|key| bf.include?(key)}}
|
27
|
+
reports << {:m => m, :k => k, :adds => adds, :includes => includes}
|
28
|
+
end
|
29
|
+
end
|
30
|
+
|
31
|
+
puts("\n")
|
32
|
+
|
33
|
+
reports.each do |report|
|
34
|
+
puts("ContinuousBloomFilter m=#{"%07.0f" % report[:m]}, k=#{"%02.0f" % report[:k]} add #{"%10.0f" % (keys.size / report[:adds].real)} ops/s")
|
35
|
+
puts("ContinuousBloomFilter m=#{"%07.0f" % report[:m]}, k=#{"%02.0f" % report[:k]} include? #{"%10.0f" % (keys.size / report[:includes].real)} ops/s")
|
36
|
+
end
|
37
|
+
|
38
|
+
puts("\nbenchmarking WITH expiration for #{slots.map(&:size).reduce(&:+)} keys with #{ERRORS.map{|e| "#{e * 100}%"}.join(", ")} error rates")
|
39
|
+
|
40
|
+
reports = []
|
41
|
+
Benchmark.bm(53) do |x|
|
42
|
+
TEST_M_K.each do |m, k|
|
43
|
+
bf = Bloombroom::ContinuousBloomFilter.new(m, k, 0)
|
44
|
+
addincludes = x.report("ContinuousBloomFilter m=#{"%07.0f" % m}, k=#{"%02.0f" % k} add+include") do
|
45
|
+
slots.each do |slot|
|
46
|
+
slot.each{|key| bf.add(key)}
|
47
|
+
slot.each{|key| bf.include?(key)}
|
48
|
+
bf.inc_time_slot
|
49
|
+
end
|
50
|
+
end
|
51
|
+
|
52
|
+
reports << {:m => m, :k => k, :addincludes => addincludes}
|
53
|
+
end
|
54
|
+
end
|
55
|
+
|
56
|
+
puts("\n")
|
57
|
+
|
58
|
+
reports.each do |report|
|
59
|
+
puts("ContinuousBloomFilter m=#{"%07.0f" % report[:m]}, k=#{"%02.0f" % report[:k]} add+include #{"%10.0f" % (slots.map(&:size).reduce(&:+) * 2 / report[:addincludes].real)} ops/s")
|
60
|
+
end
|
@@ -0,0 +1,28 @@
|
|
1
|
+
require 'bundler/setup'
|
2
|
+
require 'benchmark/memory'
|
3
|
+
require "bloombroom"
|
4
|
+
|
5
|
+
DEFAULT_M = 10000000
|
6
|
+
DEFAULT_K = 1
|
7
|
+
DEFAULT_CAPACITY = 1000000
|
8
|
+
DEFAULT_ERROR = 0.01
|
9
|
+
|
10
|
+
m,k = if ARGV[0] == "auto"
|
11
|
+
ARGV.shift
|
12
|
+
capacity = (ARGV.shift || DEFAULT_CAPACITY).to_i
|
13
|
+
error = (ARGV.shift || DEFAULT_ERROR).to_f
|
14
|
+
Bloombroom::BloomHelper.find_m_k(capacity, error)
|
15
|
+
else
|
16
|
+
m = (ARGV.shift || DEFAULT_M).to_i
|
17
|
+
k = (ARGV.shift || DEFAULT_K).to_i
|
18
|
+
[m ,k]
|
19
|
+
end
|
20
|
+
|
21
|
+
puts("continuous bloomfilter m=#{m}, k=#{k}, size=#{m * Bloombroom::ContinuousBloomFilter::BITS_PER_BUCKET} bits / #{"%.1f" % (((m * Bloombroom::ContinuousBloomFilter::BITS_PER_BUCKET) / 8) / 1024.0)}k")
|
22
|
+
|
23
|
+
before = Bloombroom::Process.rss
|
24
|
+
bf = Bloombroom::ContinuousBloomFilter.new(m, k, 0)
|
25
|
+
after = Bloombroom::Process.rss
|
26
|
+
|
27
|
+
puts("process size before=#{before}k, after=#{after}k")
|
28
|
+
puts("process size growth=#{(after - before)}k" )
|
@@ -0,0 +1,63 @@
|
|
1
|
+
require 'bundler/setup'
|
2
|
+
require "benchmark"
|
3
|
+
require "digest/sha1"
|
4
|
+
require "bloombroom"
|
5
|
+
|
6
|
+
module Bloombroom
|
7
|
+
|
8
|
+
KEYS_PER_SLOT = 20000
|
9
|
+
SLOTS_PER_FILTER = 3
|
10
|
+
KEY_VALUE_RANGE = 100000000
|
11
|
+
|
12
|
+
puts("\ngenerating lots of random keys")
|
13
|
+
slots = 32.times.map do
|
14
|
+
add = {}
|
15
|
+
KEYS_PER_SLOT.times.each{|i| add["#{i}#{Digest::SHA1.hexdigest(rand(KEY_VALUE_RANGE).to_s)}"] = true}
|
16
|
+
|
17
|
+
free = []
|
18
|
+
while free.size < add.size
|
19
|
+
key = "#{Digest::SHA1.hexdigest(rand(KEY_VALUE_RANGE).to_s)}"
|
20
|
+
free << key unless add.has_key?(key)
|
21
|
+
end
|
22
|
+
|
23
|
+
[add.keys, free]
|
24
|
+
end
|
25
|
+
|
26
|
+
# puts(slots.map{|slot| slot.first.size}.inspect)
|
27
|
+
# puts(slots.map{|slot| slot.last.size}.inspect)
|
28
|
+
|
29
|
+
capacity = KEYS_PER_SLOT * SLOTS_PER_FILTER
|
30
|
+
error = 0.001 # 0.001 == 0.1%
|
31
|
+
|
32
|
+
m, k = BloomHelper.find_m_k(capacity, error)
|
33
|
+
puts("\nContinuous BloomFilter with capacity=#{capacity}, error=#{error}(#{error * 100}%) -> m=#{m}, k=#{k}")
|
34
|
+
bf = ContinuousBloomFilter.new(m, k, 0)
|
35
|
+
|
36
|
+
t = Benchmark.realtime do
|
37
|
+
slots.each do |slot|
|
38
|
+
slot.first.each{|key| bf.add(key)}
|
39
|
+
false_positives = 0
|
40
|
+
slot.last.each{|key| false_positives += 1 if bf.include?(key)}
|
41
|
+
puts("added #{slot.first.size} keys, tested #{slot.last.size} keys, FPs=#{false_positives}/#{slot.last.size} (#{"%.3f" % ((false_positives * 100) / Float(slot.last.size))})%")
|
42
|
+
bf.inc_time_slot
|
43
|
+
end
|
44
|
+
end
|
45
|
+
n = slots.size * KEYS_PER_SLOT
|
46
|
+
puts("Continuous BloomFilter #{n} adds + #{n} tests in #{"%.2f" % t}s, #{"%2.0f" % ((2 * n) / t)} ops/s")
|
47
|
+
|
48
|
+
|
49
|
+
puts("\nBloomFilter with capacity=#{capacity}, error=#{error}(#{error * 100}%) -> m=#{m}, k=#{k}")
|
50
|
+
bf = BloomFilter.new(m, k)
|
51
|
+
|
52
|
+
t = Benchmark.realtime do
|
53
|
+
slots[0, SLOTS_PER_FILTER + 3].each do |slot|
|
54
|
+
slot.first.each{|key| bf.add(key); n += 1}
|
55
|
+
false_positives = 0
|
56
|
+
slot.last.each{|key| false_positives += 1 if bf.include?(key)}
|
57
|
+
puts("added #{slot.first.size} keys, tested #{slot.last.size} keys, FPs=#{false_positives}/#{slot.last.size} (#{"%.3f" % ((false_positives * 100) / Float(slot.last.size))})%")
|
58
|
+
end
|
59
|
+
end
|
60
|
+
n = (SLOTS_PER_FILTER + 3) * KEYS_PER_SLOT
|
61
|
+
puts("BloomFilter #{n} adds + #{n} tests in #{"%.2f" % t}s, #{"%2.0f" % ((2 * n) / t)} ops/s")
|
62
|
+
|
63
|
+
end
|