bloombroom 1.0.0 → 1.2.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,12 @@
1
+ *.gem
2
+ *.bundle
3
+ *.o
4
+ .rvmrc
5
+ Makefile
6
+ Gemfile.lock
7
+ pkg/*
8
+ .ruby-version
9
+ tmp/
10
+ vendor/
11
+ .vagrant
12
+ Vagrantfile
@@ -0,0 +1,6 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.3
4
+ - jruby-19mode
5
+ - jruby-head
6
+
@@ -1 +1,11 @@
1
- tbd
1
+ # 1.0.0, 05-09-2012
2
+ - initial release
3
+
4
+ # 1.1.x
5
+ - bad gems, yanked from Rubygems
6
+
7
+ # 1.2.0, 02-27-2013
8
+ - refactored to use ffi-compiler
9
+ - now only using FFI FNV hashing, removed all other implementations
10
+ - test on JRuby 1.7.3 and MRI Ruby 2.0.0
11
+ - FFI performance improvements
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source "https://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in bloombroom.gemspec
4
+ gemspec
data/README.md CHANGED
@@ -1,10 +1,12 @@
1
- # Bloombroom v1.0.0
1
+ # Bloombroom v1.2.0
2
+
3
+ [![build status](https://secure.travis-ci.org/colinsurprenant/bloombroom.png)](http://travis-ci.org/colinsurprenant/bloombroom)
2
4
 
3
5
  - Standard **Bloomfilter** class for bounded key space
4
6
  - **ContinuousBloomfilter** class for unbounded keys (**stream**)
5
- - Bitfield class
6
- - BitBucketField class (multi bits)
7
- - native, C & FFI extensions FNV hash classes
7
+ - Bitfield class (single bit field)
8
+ - BitBucketField class (multiple bit fields)
9
+ - Fast FNV hashing using C implementation with FFI bindings.
8
10
 
9
11
  The Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. See [wikipedia](http://en.wikipedia.org/wiki/Bloom_filter).
10
12
 
@@ -21,16 +23,15 @@ The internal timer resolution is set to half of the required TTL (resolution div
21
23
  ring buffer using the current timer tick modulo 15. The timer ticks will be time slot=1, 2, ... 15, 1, 2 and so on. The total
22
24
  time of our internal clock will thus be 15 * (TTL / 2). We keep track of TTL by writing the current time slot
23
25
  in the key k buckets when inserted in the filter. For a key lookup if the interval betweem the current time slot and any of the k buckets value
24
- is greater than 2 (resolution divisor) we know this key is expired. See [continuous_bloom_filter.rb](https://github.com/colinsurprenant/bloombroom/blob/master/lib/bloombroom/filter/continuous_bloom_filter.rb)
26
+ is greater than 2 (resolution divisor) we know this key is expired. See my [Continuous Bloom filter](http://colinsurprenant.com/blog/2012/05/12/continuous-bloom-filter/) blog post about this.
25
27
 
26
28
  This means that an element is garanteed to not be expired before the given TTL but in the worst case could survive until 3 * (TTL / 2).
27
29
 
28
30
  ### Hashing
29
31
  Bloom filters require the use of multiple (k) hash functions for each inserted element. We actually simulate multiple hash functions by having just two hash functions which are actually the upper and lower 32 bits of our FFI FNV1a 64 bits hash function. Double hashing with one hash function. Very very fast. See [bloom_helper.rb](https://github.com/colinsurprenant/bloombroom/blob/master/lib/bloombroom/filter/bloom_helper.rb) and the [references](#references) section for more info on this technique.
30
32
 
31
-
32
33
  ## Installation
33
- tested in both MRI Ruby 1.9.2, 1.9.3 and JRuby 1.6.7 in 1.9 mode.
34
+ tested in both MRI Ruby 1.9.2, 1.9.3, 2.0 and JRuby 1.7.3.
34
35
 
35
36
  ``` sh
36
37
  $ gem install bloombroom
@@ -113,45 +114,67 @@ ruby benchmark/continuous_bloom_filter_memory.rb auto 100000000 0.001
113
114
  - **1.0%** error rate for **100M** keys: **914mb**
114
115
  - **0.1%** error rate for **100M** keys: **1371mb**
115
116
 
117
+ ## Simulation
118
+ This is an input stream simulation into the ContinuousBloomfilter. First a series to 32 x 20k random unique insertion keys & 20k random unique test keys not part of the insertion set are generated. At each iteration, 20k insertion keys are added, and 20k test keys are checked for inclusion and the internal timer tick is incremented. Since the life of our keys is of 3 timer ticks we chose a filter capacity of 3 x 20k elements. Specific m and k parameter will be computed for an error rate of 0.1% and 3 x 20k capacity.
116
119
 
117
- ## Benchmarks
118
- All benchmarks have been run on a MacbookPro with a 2.66GHz i7 with 8GB RAM on OSX 10.6.8 with MRI Ruby 1.9.3p194
119
-
120
- ### Hashing
121
- The Hashing benchmark compares the performance of SHA1, MD5, two native Ruby FNV (A & B) implementations, a C implementation as a C extension and FFI extension for 32 and 64 bits hashes.
120
+ We see that as we add more keys, the test keys false positive rate is stable at the required error rate. In the second section, the same sequence is applied to a standard Bloomfilter to show that, obviously, the error rate will increase as more elements are added past the required capacity.
122
121
 
123
122
  ``` sh
124
- ruby benchmark/fnv.rb
125
- ```
126
-
127
- ```
128
- benchmarking for 1000000 iterations
129
- user system total real
130
- MD5: 1.900000 0.010000 1.910000 ( 1.912995)
131
- SHA-1: 2.110000 0.000000 2.110000 ( 2.109739)
132
- native FNV A 32: 32.470000 0.110000 32.580000 ( 32.596759)
133
- native FNV A 64: 38.330000 0.570000 38.900000 ( 38.923384)
134
- native FNV B 32: 4.870000 0.020000 4.890000 ( 4.882862)
135
- native FNV B 64: 37.700000 0.110000 37.810000 ( 37.842873)
136
- ffi FNV 32: 0.760000 0.010000 0.770000 ( 0.754941)
137
- ffi FNV 64: 0.890000 0.000000 0.890000 ( 0.901954)
138
- c-ext FNV 32: 0.310000 0.000000 0.310000 ( 0.307131)
139
- c-ext FNV 64: 0.480000 0.000000 0.480000 ( 0.485310)
140
-
141
- MD5: 522740 ops/s
142
- SHA-1: 473992 ops/s
143
- native FNV A 32: 30678 ops/s
144
- native FNV A 64: 25691 ops/s
145
- native FNV B 32: 204798 ops/s
146
- native FNV B 64: 26425 ops/s
147
- ffi FNV 32: 1324607 ops/s
148
- ffi FNV 64: 1108704 ops/s
149
- c-ext FNV 32: 3255939 ops/s
150
- c-ext FNV 64: 2060538 ops/s
123
+ ruby benchmark/continuous_bloom_filter_stats.rb
151
124
  ```
152
125
 
126
+ ```
127
+ generating lots of random keys
128
+
129
+ Continuous BloomFilter with capacity=60000, error=0.001(0.1%) -> m=862656, k=10
130
+ added 20000 keys, tested 20000 keys, FPs=0/20000 (0.000)%
131
+ added 20000 keys, tested 20000 keys, FPs=1/20000 (0.005)%
132
+ added 20000 keys, tested 20000 keys, FPs=17/20000 (0.085)%
133
+ added 20000 keys, tested 20000 keys, FPs=20/20000 (0.100)%
134
+ added 20000 keys, tested 20000 keys, FPs=23/20000 (0.115)%
135
+ added 20000 keys, tested 20000 keys, FPs=22/20000 (0.110)%
136
+ added 20000 keys, tested 20000 keys, FPs=22/20000 (0.110)%
137
+ added 20000 keys, tested 20000 keys, FPs=17/20000 (0.085)%
138
+ added 20000 keys, tested 20000 keys, FPs=18/20000 (0.090)%
139
+ added 20000 keys, tested 20000 keys, FPs=21/20000 (0.105)%
140
+ added 20000 keys, tested 20000 keys, FPs=11/20000 (0.055)%
141
+ added 20000 keys, tested 20000 keys, FPs=17/20000 (0.085)%
142
+ added 20000 keys, tested 20000 keys, FPs=18/20000 (0.090)%
143
+ added 20000 keys, tested 20000 keys, FPs=19/20000 (0.095)%
144
+ added 20000 keys, tested 20000 keys, FPs=21/20000 (0.105)%
145
+ added 20000 keys, tested 20000 keys, FPs=20/20000 (0.100)%
146
+ added 20000 keys, tested 20000 keys, FPs=24/20000 (0.120)%
147
+ added 20000 keys, tested 20000 keys, FPs=21/20000 (0.105)%
148
+ added 20000 keys, tested 20000 keys, FPs=22/20000 (0.110)%
149
+ added 20000 keys, tested 20000 keys, FPs=24/20000 (0.120)%
150
+ added 20000 keys, tested 20000 keys, FPs=15/20000 (0.075)%
151
+ added 20000 keys, tested 20000 keys, FPs=16/20000 (0.080)%
152
+ added 20000 keys, tested 20000 keys, FPs=16/20000 (0.080)%
153
+ added 20000 keys, tested 20000 keys, FPs=17/20000 (0.085)%
154
+ added 20000 keys, tested 20000 keys, FPs=22/20000 (0.110)%
155
+ added 20000 keys, tested 20000 keys, FPs=21/20000 (0.105)%
156
+ added 20000 keys, tested 20000 keys, FPs=24/20000 (0.120)%
157
+ added 20000 keys, tested 20000 keys, FPs=16/20000 (0.080)%
158
+ added 20000 keys, tested 20000 keys, FPs=17/20000 (0.085)%
159
+ added 20000 keys, tested 20000 keys, FPs=24/20000 (0.120)%
160
+ added 20000 keys, tested 20000 keys, FPs=17/20000 (0.085)%
161
+ added 20000 keys, tested 20000 keys, FPs=19/20000 (0.095)%
162
+ Continuous BloomFilter 640000 adds + 640000 tests in 16.95s, 75537 ops/s
163
+
164
+ BloomFilter with capacity=60000, error=0.001(0.1%) -> m=862656, k=10
165
+ added 20000 keys, tested 20000 keys, FPs=0/20000 (0.000)%
166
+ added 20000 keys, tested 20000 keys, FPs=1/20000 (0.005)%
167
+ added 20000 keys, tested 20000 keys, FPs=17/20000 (0.085)%
168
+ added 20000 keys, tested 20000 keys, FPs=131/20000 (0.655)%
169
+ added 20000 keys, tested 20000 keys, FPs=453/20000 (2.265)%
170
+ added 20000 keys, tested 20000 keys, FPs=1162/20000 (5.810)%
171
+ BloomFilter 120000 adds + 120000 tests in 1.64s, 146008 ops/s
172
+ ```
173
+
174
+ ## Benchmarks
175
+ All benchmarks have been run on a MacbookPro with a 2.66GHz i7 with 8GB RAM on OSX 10.6.8 with MRI Ruby 1.9.3p194
176
+
153
177
  ### Bloomfilter
154
- The Bloomfilter class is using the FFI FNV hashing by default, for speed and compatibility.
155
178
 
156
179
  ``` sh
157
180
  ruby benchmark/bloom_filter.rb
@@ -176,7 +199,6 @@ BloomFilter m=2875518, k=13 include? 119154 ops/s
176
199
  ```
177
200
 
178
201
  ### ContinuousBloomfilter
179
- The ContinuousBloomfilter class is using the FFI FNV hashing by default, for speed and compatibility.
180
202
 
181
203
  ``` sh
182
204
  ruby benchmark/continuous_bloom_filter.rb
@@ -211,6 +233,10 @@ ContinuousBloomFilter m=2875518, k=13 add+include 56606 ops/s
211
233
  ```
212
234
 
213
235
  ## JRuby
236
+ This has only been tested in Ruby **1.9** mode. JRuby 1.9 mode has to be enabled to run tests and benchmarks.
237
+
238
+ Note that this is not necessary anymore with JRuby 1.7 which is in 1.9 mode by default.
239
+
214
240
  - to run specs use
215
241
 
216
242
  ``` sh
@@ -222,7 +248,7 @@ jruby --1.9 -S rake spec
222
248
  jruby --1.9 benchmark/some_benchmark.rb
223
249
  ```
224
250
 
225
- <a id="reference" />
251
+ <a id="references" />
226
252
  ## References ##
227
253
  - [Bloom filter on wikipedia](http://en.wikipedia.org/wiki/Bloom_filter)
228
254
  - [Scalable Datasets: Bloom Filters in Ruby](http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/)
@@ -232,11 +258,15 @@ jruby --1.9 benchmark/some_benchmark.rb
232
258
  - [Producing n hash functions by hashing only once](http://willwhim.wordpress.com/2011/09/03/producing-n-hash-functions-by-hashing-only-once/)
233
259
  - [Less Hashing, Same Performance: Building a Better Bloom Filter](http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.152.579&rep=rep1&type=pdf)
234
260
 
261
+ ## Credits
262
+ - [Ilya Grigorik](http://www.igvita.com/) for his inspiration with the [Time-based Bloom filters](http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/).
263
+ - Authors of the [Stable Bloom filters research paper](http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf) which also provided inspiration.
264
+ - [Robey Pointer](https://github.com/robey) for his [Ruby FNV C extension implementation](https://github.com/robey/rbfnv).
265
+ - [Peter Cooper](http://www.petercooper.co.uk/) for inspiration with [his BitField class](http://dzone.com/snippets/bitfield-fastish-pure-ruby-bit).
266
+
235
267
  ## Author
236
- Colin Surprenant, [@colinsurprenant][twitter], [http://github.com/colinsurprenant][github], colin.surprenant@needium.com, colin.surprenant@gmail.com
268
+ Colin Surprenant, [@colinsurprenant](http://twitter.com/colinsurprenant), [http://github.com/colinsurprenant](http://github.com/colinsurprenant), colin.surprenant@gmail.com
237
269
 
238
270
  ## License
239
271
  Bloombroom is distributed under the Apache License, Version 2.0.
240
272
 
241
- [twitter]: http://twitter.com/colinsurprenant
242
- [github]: http://github.com/colinsurprenant
@@ -0,0 +1,23 @@
1
+ require 'bundler/setup'
2
+ require 'rake'
3
+ require 'rake/clean'
4
+ require 'bundler/gem_tasks'
5
+ require 'rspec/core/rake_task'
6
+ require 'ffi'
7
+ require 'ffi-compiler/compile_task'
8
+
9
+ task :default => [:clean, :compile_ffi, :spec]
10
+
11
+ desc "clean, make and run specs"
12
+ task :spec do
13
+ RSpec::Core::RakeTask.new
14
+ end
15
+
16
+ desc "FFI compiler"
17
+ namespace "ffi-compiler" do
18
+ FFI::Compiler::CompileTask.new('ffi/bloombroom/hash/ffi_fnv')
19
+ end
20
+ task :compile_ffi => ["ffi-compiler:default"]
21
+
22
+ CLEAN.include('ffi/**/*{.o,.log,.so,.bundle}')
23
+ CLEAN.include('lib/**/*{.o,.log,.so,.bundle}')
@@ -0,0 +1,35 @@
1
+ require 'bundler/setup'
2
+ require "benchmark"
3
+ require "digest/sha1"
4
+ require "bloombroom"
5
+
6
+ KEYS_COUNT = 150000
7
+ ERRORS = [0.01, 0.001, 0.0001]
8
+ TEST_M_K = ERRORS.map{|error| Bloombroom::BloomHelper.find_m_k(KEYS_COUNT, error)}
9
+
10
+ keys = KEYS_COUNT.times.map{|i| Digest::SHA1.hexdigest("#{i}#{rand(1000000)}")}
11
+
12
+ if !!(defined?(RUBY_ENGINE) && RUBY_ENGINE == 'jruby')
13
+ puts("warming JVM...")
14
+ bf = Bloombroom::BloomFilter.new(KEYS_COUNT, 7)
15
+ keys.each{|key| bf.add(key)}
16
+ end
17
+
18
+ puts("benchmarking for #{keys.size} keys with #{ERRORS.map{|e| "#{e * 100}%"}.join(", ")} error rates")
19
+
20
+ reports = []
21
+ Benchmark.bm(40) do |x|
22
+ TEST_M_K.each do |m, k|
23
+ bf = Bloombroom::BloomFilter.new(m, k)
24
+ adds = x.report("BloomFilter m=#{"%07.0f" % m}, k=#{"%02.0f" % k} add") {keys.each{|key| bf.add(key)}}
25
+ includes = x.report("BloomFilter m=#{"%07.0f" % m}, k=#{"%02.0f" % k} include?") {keys.each{|key| bf.include?(key)}}
26
+ reports << {:m => m, :k => k, :adds => adds, :includes => includes}
27
+ end
28
+ end
29
+
30
+ puts("\n")
31
+
32
+ reports.each do |report|
33
+ puts("BloomFilter m=#{"%07.0f" % report[:m]}, k=#{"%02.0f" % report[:k]} add #{"%10.0f" % (keys.size / report[:adds].real)} ops/s")
34
+ puts("BloomFilter m=#{"%07.0f" % report[:m]}, k=#{"%02.0f" % report[:k]} include? #{"%10.0f" % (keys.size / report[:includes].real)} ops/s")
35
+ end
@@ -0,0 +1,28 @@
1
+ require 'bundler/setup'
2
+ require 'bloombroom'
3
+ require 'benchmark/memory'
4
+
5
+ DEFAULT_M = 10000000
6
+ DEFAULT_K = 1
7
+ DEFAULT_CAPACITY = 1000000
8
+ DEFAULT_ERROR = 0.01
9
+
10
+ m,k = if ARGV[0] == "auto"
11
+ ARGV.shift
12
+ capacity = (ARGV.shift || DEFAULT_CAPACITY).to_i
13
+ error = (ARGV.shift || DEFAULT_ERROR).to_f
14
+ Bloombroom::BloomHelper.find_m_k(capacity, error)
15
+ else
16
+ m = (ARGV.shift || DEFAULT_M).to_i
17
+ k = (ARGV.shift || DEFAULT_K).to_i
18
+ [m ,k]
19
+ end
20
+
21
+ puts("bloomfilter m=#{m}, k=#{k}, size=#{m} bits / #{"%.1f" % ((m / 8) / 1024.0)}k")
22
+
23
+ before = Bloombroom::Process.rss
24
+ bf = Bloombroom::BloomFilter.new(m, k)
25
+ after = Bloombroom::Process.rss
26
+
27
+ puts("process size before=#{before}k, after=#{after}k")
28
+ puts("process size growth=#{(after - before)}k" )
@@ -0,0 +1,60 @@
1
+ require 'bundler/setup'
2
+ require "benchmark"
3
+ require "digest/sha1"
4
+ require "bloombroom"
5
+
6
+ KEYS_COUNT = 150000
7
+ ERRORS = [0.01, 0.001, 0.0001]
8
+ TEST_M_K = ERRORS.map{|error| Bloombroom::BloomHelper.find_m_k(KEYS_COUNT, error)}
9
+
10
+ keys = KEYS_COUNT.times.map{|i| Digest::SHA1.hexdigest("#{i}#{rand(1000000)}")}
11
+ slots = 10.times.map{|i| (KEYS_COUNT / 3).times.map{|i| Digest::SHA1.hexdigest("#{i}#{rand(1000000)}")}}
12
+
13
+ if !!(defined?(RUBY_ENGINE) && RUBY_ENGINE == 'jruby')
14
+ puts("warming JVM...")
15
+ bf = Bloombroom::ContinuousBloomFilter.new(*Bloombroom::BloomHelper.find_m_k(KEYS_COUNT, 0.001), 0)
16
+ keys.each{|key| bf.add(key)}
17
+ end
18
+
19
+ puts("benchmarking WITHOUT expiration for #{keys.size} keys with #{ERRORS.map{|e| "#{e * 100}%"}.join(", ")} error rates")
20
+
21
+ reports = []
22
+ Benchmark.bm(53) do |x|
23
+ TEST_M_K.each do |m, k|
24
+ bf = Bloombroom::ContinuousBloomFilter.new(m, k, 0)
25
+ adds = x.report("ContinuousBloomFilter m=#{"%07.0f" % m}, k=#{"%02.0f" % k} add") {keys.each{|key| bf.add(key)}}
26
+ includes = x.report("ContinuousBloomFilter m=#{"%07.0f" % m}, k=#{"%02.0f" % k} include?") {keys.each{|key| bf.include?(key)}}
27
+ reports << {:m => m, :k => k, :adds => adds, :includes => includes}
28
+ end
29
+ end
30
+
31
+ puts("\n")
32
+
33
+ reports.each do |report|
34
+ puts("ContinuousBloomFilter m=#{"%07.0f" % report[:m]}, k=#{"%02.0f" % report[:k]} add #{"%10.0f" % (keys.size / report[:adds].real)} ops/s")
35
+ puts("ContinuousBloomFilter m=#{"%07.0f" % report[:m]}, k=#{"%02.0f" % report[:k]} include? #{"%10.0f" % (keys.size / report[:includes].real)} ops/s")
36
+ end
37
+
38
+ puts("\nbenchmarking WITH expiration for #{slots.map(&:size).reduce(&:+)} keys with #{ERRORS.map{|e| "#{e * 100}%"}.join(", ")} error rates")
39
+
40
+ reports = []
41
+ Benchmark.bm(53) do |x|
42
+ TEST_M_K.each do |m, k|
43
+ bf = Bloombroom::ContinuousBloomFilter.new(m, k, 0)
44
+ addincludes = x.report("ContinuousBloomFilter m=#{"%07.0f" % m}, k=#{"%02.0f" % k} add+include") do
45
+ slots.each do |slot|
46
+ slot.each{|key| bf.add(key)}
47
+ slot.each{|key| bf.include?(key)}
48
+ bf.inc_time_slot
49
+ end
50
+ end
51
+
52
+ reports << {:m => m, :k => k, :addincludes => addincludes}
53
+ end
54
+ end
55
+
56
+ puts("\n")
57
+
58
+ reports.each do |report|
59
+ puts("ContinuousBloomFilter m=#{"%07.0f" % report[:m]}, k=#{"%02.0f" % report[:k]} add+include #{"%10.0f" % (slots.map(&:size).reduce(&:+) * 2 / report[:addincludes].real)} ops/s")
60
+ end
@@ -0,0 +1,28 @@
1
+ require 'bundler/setup'
2
+ require 'benchmark/memory'
3
+ require "bloombroom"
4
+
5
+ DEFAULT_M = 10000000
6
+ DEFAULT_K = 1
7
+ DEFAULT_CAPACITY = 1000000
8
+ DEFAULT_ERROR = 0.01
9
+
10
+ m,k = if ARGV[0] == "auto"
11
+ ARGV.shift
12
+ capacity = (ARGV.shift || DEFAULT_CAPACITY).to_i
13
+ error = (ARGV.shift || DEFAULT_ERROR).to_f
14
+ Bloombroom::BloomHelper.find_m_k(capacity, error)
15
+ else
16
+ m = (ARGV.shift || DEFAULT_M).to_i
17
+ k = (ARGV.shift || DEFAULT_K).to_i
18
+ [m ,k]
19
+ end
20
+
21
+ puts("continuous bloomfilter m=#{m}, k=#{k}, size=#{m * Bloombroom::ContinuousBloomFilter::BITS_PER_BUCKET} bits / #{"%.1f" % (((m * Bloombroom::ContinuousBloomFilter::BITS_PER_BUCKET) / 8) / 1024.0)}k")
22
+
23
+ before = Bloombroom::Process.rss
24
+ bf = Bloombroom::ContinuousBloomFilter.new(m, k, 0)
25
+ after = Bloombroom::Process.rss
26
+
27
+ puts("process size before=#{before}k, after=#{after}k")
28
+ puts("process size growth=#{(after - before)}k" )
@@ -0,0 +1,63 @@
1
+ require 'bundler/setup'
2
+ require "benchmark"
3
+ require "digest/sha1"
4
+ require "bloombroom"
5
+
6
+ module Bloombroom
7
+
8
+ KEYS_PER_SLOT = 20000
9
+ SLOTS_PER_FILTER = 3
10
+ KEY_VALUE_RANGE = 100000000
11
+
12
+ puts("\ngenerating lots of random keys")
13
+ slots = 32.times.map do
14
+ add = {}
15
+ KEYS_PER_SLOT.times.each{|i| add["#{i}#{Digest::SHA1.hexdigest(rand(KEY_VALUE_RANGE).to_s)}"] = true}
16
+
17
+ free = []
18
+ while free.size < add.size
19
+ key = "#{Digest::SHA1.hexdigest(rand(KEY_VALUE_RANGE).to_s)}"
20
+ free << key unless add.has_key?(key)
21
+ end
22
+
23
+ [add.keys, free]
24
+ end
25
+
26
+ # puts(slots.map{|slot| slot.first.size}.inspect)
27
+ # puts(slots.map{|slot| slot.last.size}.inspect)
28
+
29
+ capacity = KEYS_PER_SLOT * SLOTS_PER_FILTER
30
+ error = 0.001 # 0.001 == 0.1%
31
+
32
+ m, k = BloomHelper.find_m_k(capacity, error)
33
+ puts("\nContinuous BloomFilter with capacity=#{capacity}, error=#{error}(#{error * 100}%) -> m=#{m}, k=#{k}")
34
+ bf = ContinuousBloomFilter.new(m, k, 0)
35
+
36
+ t = Benchmark.realtime do
37
+ slots.each do |slot|
38
+ slot.first.each{|key| bf.add(key)}
39
+ false_positives = 0
40
+ slot.last.each{|key| false_positives += 1 if bf.include?(key)}
41
+ puts("added #{slot.first.size} keys, tested #{slot.last.size} keys, FPs=#{false_positives}/#{slot.last.size} (#{"%.3f" % ((false_positives * 100) / Float(slot.last.size))})%")
42
+ bf.inc_time_slot
43
+ end
44
+ end
45
+ n = slots.size * KEYS_PER_SLOT
46
+ puts("Continuous BloomFilter #{n} adds + #{n} tests in #{"%.2f" % t}s, #{"%2.0f" % ((2 * n) / t)} ops/s")
47
+
48
+
49
+ puts("\nBloomFilter with capacity=#{capacity}, error=#{error}(#{error * 100}%) -> m=#{m}, k=#{k}")
50
+ bf = BloomFilter.new(m, k)
51
+
52
+ t = Benchmark.realtime do
53
+ slots[0, SLOTS_PER_FILTER + 3].each do |slot|
54
+ slot.first.each{|key| bf.add(key); n += 1}
55
+ false_positives = 0
56
+ slot.last.each{|key| false_positives += 1 if bf.include?(key)}
57
+ puts("added #{slot.first.size} keys, tested #{slot.last.size} keys, FPs=#{false_positives}/#{slot.last.size} (#{"%.3f" % ((false_positives * 100) / Float(slot.last.size))})%")
58
+ end
59
+ end
60
+ n = (SLOTS_PER_FILTER + 3) * KEYS_PER_SLOT
61
+ puts("BloomFilter #{n} adds + #{n} tests in #{"%.2f" % t}s, #{"%2.0f" % ((2 * n) / t)} ops/s")
62
+
63
+ end