bloombroom 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1 @@
1
+ tbd
@@ -0,0 +1,11 @@
1
+ Licensed under the Apache License, Version 2.0 (the "License");
2
+ you may not use this file except in compliance with the License.
3
+ You may obtain a copy of the License at
4
+
5
+ http://www.apache.org/licenses/LICENSE-2.0
6
+
7
+ Unless required by applicable law or agreed to in writing, software
8
+ distributed under the License is distributed on an "AS IS" BASIS,
9
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10
+ See the License for the specific language governing permissions and
11
+ limitations under the License.
@@ -0,0 +1,242 @@
1
+ # Bloombroom v1.0.0
2
+
3
+ - Standard **Bloomfilter** class for bounded key space
4
+ - **ContinuousBloomfilter** class for unbounded keys (**stream**)
5
+ - Bitfield class
6
+ - BitBucketField class (multi bits)
7
+ - native, C & FFI extensions FNV hash classes
8
+
9
+ The Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. See [wikipedia](http://en.wikipedia.org/wiki/Bloom_filter).
10
+
11
+ Bloom filters are normally used in the context of a bounded set since the filter size must be known in advance. for a given filter capacity, its total bit size will affect the false positive error rate. The total number of bits required for a given filter can be computed from the required filter capacity and target error rate. See the [references](#references) section for more info.
12
+
13
+ ### ContinuousBloomfilter
14
+ The **ContinuousBloomfilter** provides a bloom filter implementation which support unbounded stream of elements. Elements are expired after a chosen TTL. At initialization the filter capacity must be estimated for the numbers of elements expected over the given TTL period.
15
+
16
+ For example to do dedupping on a stream with a rate of **5000 items/sec** over a period of **60 minutes** would require a filter capacity of **18M** elements. For a required error rate of **0.1%** the filter would need **246mb** of memory (which include all Ruby objects overhead).
17
+
18
+ The ContinuousBloomfilter uses 4 bits for each filter *position* or *bucket* (instead of 1 bit in a normal bloom filter) for keeping track of the keys TTL.
19
+ The internal timer resolution is set to half of the required TTL (resolution divisor of 2). using 4 bits gives us
20
+ 15 usable time slots (slot 0 is for the unset state). Basically the internal time bookeeping is similar to a
21
+ ring buffer using the current timer tick modulo 15. The timer ticks will be time slot=1, 2, ... 15, 1, 2 and so on. The total
22
+ time of our internal clock will thus be 15 * (TTL / 2). We keep track of TTL by writing the current time slot
23
+ in the key k buckets when inserted in the filter. For a key lookup if the interval betweem the current time slot and any of the k buckets value
24
+ is greater than 2 (resolution divisor) we know this key is expired. See [continuous_bloom_filter.rb](https://github.com/colinsurprenant/bloombroom/blob/master/lib/bloombroom/filter/continuous_bloom_filter.rb)
25
+
26
+ This means that an element is garanteed to not be expired before the given TTL but in the worst case could survive until 3 * (TTL / 2).
27
+
28
+ ### Hashing
29
+ Bloom filters require the use of multiple (k) hash functions for each inserted element. We actually simulate multiple hash functions by having just two hash functions which are actually the upper and lower 32 bits of our FFI FNV1a 64 bits hash function. Double hashing with one hash function. Very very fast. See [bloom_helper.rb](https://github.com/colinsurprenant/bloombroom/blob/master/lib/bloombroom/filter/bloom_helper.rb) and the [references](#references) section for more info on this technique.
30
+
31
+
32
+ ## Installation
33
+ tested in both MRI Ruby 1.9.2, 1.9.3 and JRuby 1.6.7 in 1.9 mode.
34
+
35
+ ``` sh
36
+ $ gem install bloombroom
37
+ ```
38
+
39
+ ## Examples
40
+
41
+ ### Standard Bloom filter
42
+ ``` ruby
43
+ require 'bloombroom'
44
+
45
+ bf = Bloombroom::BloomFilter.new(1000, 3) # 1000 bits and 3 hash functions
46
+
47
+ bf.add("key1")
48
+ bf.add("key2")
49
+
50
+ bf.include?("key1") # => true
51
+ bf.include?("key3") # => false
52
+ ```
53
+
54
+ ``` ruby
55
+ require 'bloombroom'
56
+
57
+ # compute optimal m,k for a filter capacity of 1000 elements and 0.1% error rate
58
+ m, k = Bloombroom::BloomHelper.find_m_k(1000, 0.001)
59
+
60
+ bf = Bloombroom::BloomFilter.new(m, k)
61
+
62
+ bf << "key1"
63
+ bf << "key2"
64
+
65
+ bf["key1"] # => true
66
+ bf["key3"] # => false
67
+ ```
68
+ ### Continuous Bloom filter
69
+
70
+ ``` ruby
71
+ require 'bloombroom'
72
+
73
+ # 1000 buckets, 3 hash functions and a TTL of 2 seconds
74
+ bf = Bloombroom::ContinuousBloomFilter.new(1000, 3, 2)
75
+ bf.start_timer
76
+
77
+ bf << "key1"
78
+ bf << "key2"
79
+
80
+ bf["key1"] # => true
81
+ bf["key2"] # => true
82
+ bf["key3"] # => false
83
+
84
+ sleep(3)
85
+
86
+ bf["key1"] # => false
87
+ bf["key2"] # => false
88
+ bf["key3"] # => false
89
+ ```
90
+
91
+ ## Memory footprint
92
+ The calculated memory footprints **includes all Ruby objects overhead**. In fact the footprint is calculated by querying the process size (rss) before and after the bloom filter object initialization.
93
+
94
+ ### Bloomfilter
95
+ ``` sh
96
+ ruby benchmark/bloom_filter_memory.rb auto 1000000 0.01
97
+ ruby benchmark/bloom_filter_memory.rb auto 100000000 0.01
98
+ ruby benchmark/bloom_filter_memory.rb auto 100000000 0.001
99
+ ```
100
+
101
+ - **1.0%** error rate for **1M** keys: **2.3mb**
102
+ - **1.0%** error rate for **100M** keys: **228mb**
103
+ - **0.1%** error rate for **100M** keys: **342mb**
104
+
105
+ ### ContinuousBloomfilter
106
+ ``` sh
107
+ ruby benchmark/continuous_bloom_filter_memory.rb auto 1000000 0.01
108
+ ruby benchmark/continuous_bloom_filter_memory.rb auto 100000000 0.01
109
+ ruby benchmark/continuous_bloom_filter_memory.rb auto 100000000 0.001
110
+ ```
111
+
112
+ - **1.0%** error rate for **1M** keys: **9.1mb**
113
+ - **1.0%** error rate for **100M** keys: **914mb**
114
+ - **0.1%** error rate for **100M** keys: **1371mb**
115
+
116
+
117
+ ## Benchmarks
118
+ All benchmarks have been run on a MacbookPro with a 2.66GHz i7 with 8GB RAM on OSX 10.6.8 with MRI Ruby 1.9.3p194
119
+
120
+ ### Hashing
121
+ The Hashing benchmark compares the performance of SHA1, MD5, two native Ruby FNV (A & B) implementations, a C implementation as a C extension and FFI extension for 32 and 64 bits hashes.
122
+
123
+ ``` sh
124
+ ruby benchmark/fnv.rb
125
+ ```
126
+
127
+ ```
128
+ benchmarking for 1000000 iterations
129
+ user system total real
130
+ MD5: 1.900000 0.010000 1.910000 ( 1.912995)
131
+ SHA-1: 2.110000 0.000000 2.110000 ( 2.109739)
132
+ native FNV A 32: 32.470000 0.110000 32.580000 ( 32.596759)
133
+ native FNV A 64: 38.330000 0.570000 38.900000 ( 38.923384)
134
+ native FNV B 32: 4.870000 0.020000 4.890000 ( 4.882862)
135
+ native FNV B 64: 37.700000 0.110000 37.810000 ( 37.842873)
136
+ ffi FNV 32: 0.760000 0.010000 0.770000 ( 0.754941)
137
+ ffi FNV 64: 0.890000 0.000000 0.890000 ( 0.901954)
138
+ c-ext FNV 32: 0.310000 0.000000 0.310000 ( 0.307131)
139
+ c-ext FNV 64: 0.480000 0.000000 0.480000 ( 0.485310)
140
+
141
+ MD5: 522740 ops/s
142
+ SHA-1: 473992 ops/s
143
+ native FNV A 32: 30678 ops/s
144
+ native FNV A 64: 25691 ops/s
145
+ native FNV B 32: 204798 ops/s
146
+ native FNV B 64: 26425 ops/s
147
+ ffi FNV 32: 1324607 ops/s
148
+ ffi FNV 64: 1108704 ops/s
149
+ c-ext FNV 32: 3255939 ops/s
150
+ c-ext FNV 64: 2060538 ops/s
151
+ ```
152
+
153
+ ### Bloomfilter
154
+ The Bloomfilter class is using the FFI FNV hashing by default, for speed and compatibility.
155
+
156
+ ``` sh
157
+ ruby benchmark/bloom_filter.rb
158
+ ```
159
+
160
+ ```
161
+ benchmarking for 150000 keys with 1.0%, 0.1%, 0.01% error rates
162
+ user system total real
163
+ BloomFilter m=1437759, k=07 add 0.940000 0.000000 0.940000 ( 0.948075)
164
+ BloomFilter m=1437759, k=07 include? 0.830000 0.010000 0.840000 ( 0.834414)
165
+ BloomFilter m=2156639, k=10 add 1.220000 0.000000 1.220000 ( 1.227294)
166
+ BloomFilter m=2156639, k=10 include? 1.050000 0.010000 1.060000 ( 1.052358)
167
+ BloomFilter m=2875518, k=13 add 1.500000 0.010000 1.510000 ( 1.516086)
168
+ BloomFilter m=2875518, k=13 include? 1.260000 0.010000 1.270000 ( 1.258877)
169
+
170
+ BloomFilter m=1437759, k=07 add 158215 ops/s
171
+ BloomFilter m=1437759, k=07 include? 179767 ops/s
172
+ BloomFilter m=2156639, k=10 add 122220 ops/s
173
+ BloomFilter m=2156639, k=10 include? 142537 ops/s
174
+ BloomFilter m=2875518, k=13 add 98939 ops/s
175
+ BloomFilter m=2875518, k=13 include? 119154 ops/s
176
+ ```
177
+
178
+ ### ContinuousBloomfilter
179
+ The ContinuousBloomfilter class is using the FFI FNV hashing by default, for speed and compatibility.
180
+
181
+ ``` sh
182
+ ruby benchmark/continuous_bloom_filter.rb
183
+ ```
184
+
185
+ ```
186
+ benchmarking WITHOUT expiration for 150000 keys with 1.0%, 0.1%, 0.01% error rates
187
+ user system total real
188
+ ContinuousBloomFilter m=1437759, k=07 add 1.720000 0.000000 1.720000 ( 1.733903)
189
+ ContinuousBloomFilter m=1437759, k=07 include? 1.630000 0.010000 1.640000 ( 1.630668)
190
+ ContinuousBloomFilter m=2156639, k=10 add 2.130000 0.010000 2.140000 ( 2.142091)
191
+ ContinuousBloomFilter m=2156639, k=10 include? 2.160000 0.000000 2.160000 ( 2.159395)
192
+ ContinuousBloomFilter m=2875518, k=13 add 2.650000 0.010000 2.660000 ( 2.655585)
193
+ ContinuousBloomFilter m=2875518, k=13 include? 2.570000 0.010000 2.580000 ( 2.586032)
194
+
195
+ ContinuousBloomFilter m=1437759, k=07 add 86510 ops/s
196
+ ContinuousBloomFilter m=1437759, k=07 include? 91987 ops/s
197
+ ContinuousBloomFilter m=2156639, k=10 add 70025 ops/s
198
+ ContinuousBloomFilter m=2156639, k=10 include? 69464 ops/s
199
+ ContinuousBloomFilter m=2875518, k=13 add 56485 ops/s
200
+ ContinuousBloomFilter m=2875518, k=13 include? 58004 ops/s
201
+
202
+ benchmarking WITH expiration for 500000 keys with 1.0%, 0.1%, 0.01% error rates
203
+ user system total real
204
+ ContinuousBloomFilter m=1437759, k=07 add+include 11.110000 0.040000 11.150000 ( 11.146869)
205
+ ContinuousBloomFilter m=2156639, k=10 add+include 14.220000 0.040000 14.260000 ( 14.269583)
206
+ ContinuousBloomFilter m=2875518, k=13 add+include 17.600000 0.060000 17.660000 ( 17.665917)
207
+
208
+ ContinuousBloomFilter m=1437759, k=07 add+include 89711 ops/s
209
+ ContinuousBloomFilter m=2156639, k=10 add+include 70079 ops/s
210
+ ContinuousBloomFilter m=2875518, k=13 add+include 56606 ops/s
211
+ ```
212
+
213
+ ## JRuby
214
+ - to run specs use
215
+
216
+ ``` sh
217
+ jruby --1.9 -S rake spec
218
+ ```
219
+ - to run benchmarks use
220
+
221
+ ``` sh
222
+ jruby --1.9 benchmark/some_benchmark.rb
223
+ ```
224
+
225
+ <a id="reference" />
226
+ ## References ##
227
+ - [Bloom filter on wikipedia](http://en.wikipedia.org/wiki/Bloom_filter)
228
+ - [Scalable Datasets: Bloom Filters in Ruby](http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/)
229
+ - [Flow Analysis & Time-based Bloom Filters](http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/)
230
+ - [Stable Bloom filters](http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf)
231
+ - [The maths to compute optimal m and k ](http://www.siaris.net/index.cgi/Programming/LanguageBits/Ruby/BloomFilter.rdoc)
232
+ - [Producing n hash functions by hashing only once](http://willwhim.wordpress.com/2011/09/03/producing-n-hash-functions-by-hashing-only-once/)
233
+ - [Less Hashing, Same Performance: Building a Better Bloom Filter](http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.152.579&rep=rep1&type=pdf)
234
+
235
+ ## Author
236
+ Colin Surprenant, [@colinsurprenant][twitter], [http://github.com/colinsurprenant][github], colin.surprenant@needium.com, colin.surprenant@gmail.com
237
+
238
+ ## License
239
+ Bloombroom is distributed under the Apache License, Version 2.0.
240
+
241
+ [twitter]: http://twitter.com/colinsurprenant
242
+ [github]: http://github.com/colinsurprenant
@@ -0,0 +1,91 @@
1
+ /*
2
+ * based on https://github.com/robey/rbfnv with various fixes from forks
3
+ */
4
+
5
+ #include <stdint.h>
6
+ #include "ruby.h"
7
+
8
+ #define PRIME32 16777619
9
+ #define PRIME64 1099511628211ULL
10
+
11
+ /**
12
+ * FNV fast hashing algorithm in 32 bits.
13
+ * @see http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
14
+ */
15
+ uint32_t fnv1_32(const char *data, uint64_t len) {
16
+ uint32_t rv = 0x811c9dc5U;
17
+ uint64_t i;
18
+ for (i = 0; i < len; i++) {
19
+ rv = (rv * PRIME32) ^ (unsigned char)(data[i]);
20
+ }
21
+ return rv;
22
+ }
23
+
24
+ /**
25
+ * FNV fast hashing algorithm in 32 bits, variant with operations reversed.
26
+ * @see http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
27
+ */
28
+ uint32_t fnv1a_32(const char *data, uint64_t len) {
29
+ uint32_t rv = 0x811c9dc5U;
30
+ uint64_t i;
31
+ for (i = 0; i < len; i++) {
32
+ rv = (rv ^ (unsigned char)data[i]) * PRIME32;
33
+ }
34
+ return rv;
35
+ }
36
+
37
+ /**
38
+ * FNV fast hashing algorithm in 64 bits.
39
+ * @see http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
40
+ */
41
+ uint64_t fnv1_64(const char *data, uint64_t len) {
42
+ uint64_t rv = 0xcbf29ce484222325ULL;
43
+ uint64_t i;
44
+ for (i = 0; i < len; i++) {
45
+ rv = (rv * PRIME64) ^ (unsigned char)data[i];
46
+ }
47
+ return rv;
48
+ }
49
+
50
+ /**
51
+ * FNV fast hashing algorithm in 64 bits, variant with operations reversed.
52
+ * @see http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
53
+ */
54
+ uint64_t fnv1a_64(const char *data, uint64_t len) {
55
+ uint64_t rv = 0xcbf29ce484222325ULL;
56
+ uint64_t i;
57
+ for (i = 0; i < len; i++) {
58
+ rv = (rv ^ (unsigned char)data[i]) * PRIME64;
59
+ }
60
+ return rv;
61
+ }
62
+
63
+ /* ----- ruby bindings ----- */
64
+
65
+ VALUE rb_fnv1_32(VALUE self, VALUE data) {
66
+ return UINT2NUM(fnv1_32(RSTRING_PTR(data), RSTRING_LEN(data)));
67
+ }
68
+
69
+ VALUE rb_fnv1a_32(VALUE self, VALUE data) {
70
+ return UINT2NUM(fnv1a_32(RSTRING_PTR(data), RSTRING_LEN(data)));
71
+ }
72
+
73
+ VALUE rb_fnv1_64(VALUE self, VALUE data) {
74
+ return ULL2NUM(fnv1_64(RSTRING_PTR(data), RSTRING_LEN(data)));
75
+ }
76
+
77
+ VALUE rb_fnv1a_64(VALUE self, VALUE data) {
78
+ return ULL2NUM(fnv1a_64(RSTRING_PTR(data), RSTRING_LEN(data)));
79
+ }
80
+
81
+ VALUE rb_class;
82
+ VALUE rb_module;
83
+
84
+ void Init_cext_fnv() {
85
+ rb_module = rb_define_module("Bloombroom");
86
+ rb_class = rb_define_class_under(rb_module, "FNVEXT", rb_cObject);
87
+ rb_define_singleton_method(rb_class, "fnv1_32", rb_fnv1_32, 1);
88
+ rb_define_singleton_method(rb_class, "fnv1a_32", rb_fnv1a_32, 1);
89
+ rb_define_singleton_method(rb_class, "fnv1_64", rb_fnv1_64, 1);
90
+ rb_define_singleton_method(rb_class, "fnv1a_64", rb_fnv1a_64, 1);
91
+ }
@@ -0,0 +1,3 @@
1
+ require 'mkmf'
2
+
3
+ create_makefile 'bloombroom/hash/cext_fnv'
@@ -0,0 +1,3 @@
1
+ require 'mkmf'
2
+
3
+ create_makefile 'bloombroom/hash/ffi_fnv'
@@ -0,0 +1,60 @@
1
+ /*
2
+ * based on https://github.com/robey/rbfnv with various fixes from forks
3
+ */
4
+
5
+ #include <stdint.h>
6
+
7
+ #define PRIME32 16777619
8
+ #define PRIME64 1099511628211ULL
9
+
10
+ /**
11
+ * FNV fast hashing algorithm in 32 bits.
12
+ * @see http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
13
+ */
14
+ uint32_t fnv1_32(const char *data, uint32_t len) {
15
+ uint32_t rv = 0x811c9dc5U;
16
+ uint32_t i;
17
+ for (i = 0; i < len; i++) {
18
+ rv = (rv * PRIME32) ^ (unsigned char)(data[i]);
19
+ }
20
+ return rv;
21
+ }
22
+
23
+ /**
24
+ * FNV fast hashing algorithm in 32 bits, variant with operations reversed.
25
+ * @see http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
26
+ */
27
+ uint32_t fnv1a_32(const char *data, uint32_t len) {
28
+ uint32_t rv = 0x811c9dc5U;
29
+ uint32_t i;
30
+ for (i = 0; i < len; i++) {
31
+ rv = (rv ^ (unsigned char)data[i]) * PRIME32;
32
+ }
33
+ return rv;
34
+ }
35
+
36
+ /**
37
+ * FNV fast hashing algorithm in 64 bits.
38
+ * @see http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
39
+ */
40
+ uint64_t fnv1_64(const char *data, uint32_t len) {
41
+ uint64_t rv = 0xcbf29ce484222325ULL;
42
+ uint32_t i;
43
+ for (i = 0; i < len; i++) {
44
+ rv = (rv * PRIME64) ^ (unsigned char)data[i];
45
+ }
46
+ return rv;
47
+ }
48
+
49
+ /**
50
+ * FNV fast hashing algorithm in 64 bits, variant with operations reversed.
51
+ * @see http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
52
+ */
53
+ uint64_t fnv1a_64(const char *data, uint32_t len) {
54
+ uint64_t rv = 0xcbf29ce484222325ULL;
55
+ uint32_t i;
56
+ for (i = 0; i < len; i++) {
57
+ rv = (rv ^ (unsigned char)data[i]) * PRIME64;
58
+ }
59
+ return rv;
60
+ }
@@ -0,0 +1,13 @@
1
+ require "bloombroom/version"
2
+ require "bloombroom/bits/bit_field"
3
+ require "bloombroom/bits/bit_bucket_field"
4
+ require "bloombroom/filter/bloom_helper"
5
+ require "bloombroom/filter/bloom_filter"
6
+ require "bloombroom/filter/continuous_bloom_filter"
7
+ require "bloombroom/hash/fnv_a"
8
+ require "bloombroom/hash/fnv_b"
9
+ require "bloombroom/hash/cext_fnv"
10
+ require "bloombroom/hash/ffi_fnv"
11
+
12
+ module Bloombroom
13
+ end
@@ -0,0 +1,90 @@
1
+ # create a bit bucket field of 100 buckets of 4 bits
2
+ # bf = BitBucketField.new(4, 100)
3
+ #
4
+ # bf[10] = 5 or bf.set(10, 5)
5
+ # bf[10] => 5 or bf.get(10) => 5
6
+ # bf[10] = 0
7
+ # bf.zero?(10) => true
8
+ #
9
+ # bf.to_s = "10101000101010101"
10
+ # bf.to_s(2) = "10101000101010101"
11
+ # bf.to_s(10) = "5 23 7"
12
+
13
+ module Bloombroom
14
+ class BitBucketField
15
+ attr_reader :size
16
+ include Enumerable
17
+
18
+ ELEMENT_WIDTH = 32
19
+
20
+ # new BitBucketField
21
+ # @param bits [Fixnum] number of bits per bucket
22
+ # @param size [Fixnum] number of buckets in field
23
+ def initialize(bits, size)
24
+ @size = size
25
+ @bits = bits
26
+ @buckets_per_element = ELEMENT_WIDTH / bits
27
+ @field = Array.new(((size - 1) / @buckets_per_element) + 1, 0)
28
+ @bucket_mask = (2 ** @bits) - 1
29
+ end
30
+
31
+ # set a bucket
32
+ # @param position [Fixnum] bucket position
33
+ # @param value [Fixnum] bucket value
34
+ def []=(position, value)
35
+ element, offset = position.divmod(@buckets_per_element)
36
+ shift_bits = offset * @bits
37
+ if value == 0
38
+ @field[element] &= ~(@bucket_mask << shift_bits)
39
+ else
40
+ @field[element] = (@field[element] & ~(@bucket_mask << shift_bits)) | value << shift_bits
41
+ end
42
+ end
43
+ alias_method :set, :[]=
44
+
45
+ # read a bucket
46
+ # @param position [Fixnum] bucket position
47
+ # @return [Fixnum] bucket value
48
+ def [](position)
49
+ element, offset = position.divmod(@buckets_per_element)
50
+ shift_bits = (position % @buckets_per_element) * @bits
51
+ (@field[element] & (@bucket_mask << shift_bits)) >> shift_bits
52
+ end
53
+ alias_method :get, :[]
54
+
55
+ def zero?(position)
56
+ element, offset = position.divmod(@buckets_per_element)
57
+ shift_bits = (position % @buckets_per_element) * @bits
58
+ (@field[element] & (@bucket_mask << shift_bits)) == 0
59
+ end
60
+
61
+ def inc(position)
62
+ end
63
+
64
+ def dec(position)
65
+ end
66
+
67
+ # iterate over each bucket
68
+ def each(&block)
69
+ @size.times { |position| yield self[position] }
70
+ end
71
+
72
+ # returns the field as a string like "0101010100111100," etc.
73
+ def to_s(base = 2)
74
+ case base
75
+ when 2
76
+ inject("") { |a, b| a + "%0#{@bits}b " % b }.strip
77
+ when 10
78
+ self.inject("") { |a, b| a + "%1d " % b }.strip
79
+ else
80
+ raise(ArgumentError, "unsupported base")
81
+ end
82
+ end
83
+
84
+ # returns the total number of non zero buckets
85
+ def total_set
86
+ self.inject(0) { |a, bucket| a += bucket.zero? ? 0 : 1; a }
87
+ end
88
+
89
+ end
90
+ end
@@ -0,0 +1,90 @@
1
+ # inspired by Peter Cooper's http://snippets.dzone.com/posts/show/4234
2
+ #
3
+ # create a bit field 1000 bits wide
4
+ # bf = BitField.new(1000)
5
+ #
6
+ # bf[100] = 1 or bf.set(100)
7
+ # bf[100] => 1 or bg.get(100) => 1
8
+ # bf[100] = 0 or bf.unset(100)
9
+ # bf.zero?(100) => true
10
+ #
11
+ # bf.to_s = "10101000101010101"
12
+ # bf.total_set => 10 (example - 10 bits are set to "1")
13
+
14
+ module Bloombroom
15
+ class BitField
16
+ attr_reader :size
17
+ include Enumerable
18
+
19
+ ELEMENT_WIDTH = 32
20
+
21
+ def initialize(size)
22
+ @size = size
23
+ @field = Array.new(((size - 1) / ELEMENT_WIDTH) + 1, 0)
24
+ end
25
+
26
+ # set a bit
27
+ # @param position [Fixnum] bit position
28
+ # @param value [Fixnum] bit value 0/1
29
+ def []=(position, value)
30
+ if value == 0
31
+ @field[position / ELEMENT_WIDTH] &= ~(1 << (position % ELEMENT_WIDTH))
32
+ else
33
+ @field[position / ELEMENT_WIDTH] |= 1 << (position % ELEMENT_WIDTH)
34
+ end
35
+ end
36
+
37
+ # read a bit
38
+ # @param position [Fixnum] bit position
39
+ # @return [Fixnum] bit value 0/1
40
+ def [](position)
41
+ @field[position / ELEMENT_WIDTH] & 1 << (position % ELEMENT_WIDTH) > 0 ? 1 : 0
42
+ end
43
+ alias_method :get, :[]
44
+
45
+ # set a bit to 1
46
+ # @param position [Fixnum] bit position
47
+ def set(position)
48
+ # duplicated code to avoid a method call
49
+ @field[position / ELEMENT_WIDTH] |= 1 << (position % ELEMENT_WIDTH)
50
+ end
51
+
52
+ # set a bit to 0
53
+ # @param position [Fixnum] bit position
54
+ def unset(position)
55
+ # duplicated code to avoid a method call
56
+ @field[position / ELEMENT_WIDTH] &= ~(1 << (position % ELEMENT_WIDTH))
57
+ end
58
+
59
+ # check if bit is set
60
+ # @param position [Fixnum] bit position
61
+ # @return [Boolean] true if bit is set
62
+ def include?(position)
63
+ @field[position / ELEMENT_WIDTH] & 1 << (position % ELEMENT_WIDTH) > 0
64
+ end
65
+
66
+ # check if bit is not set
67
+ # @param position [Fixnum] bit position
68
+ # @return [Boolean] true if bit is not set
69
+ def zero?(position)
70
+ # duplicated code to avoid a method call
71
+ @field[position / ELEMENT_WIDTH] & 1 << (position % ELEMENT_WIDTH) == 0
72
+ end
73
+
74
+ # iterate over each bit
75
+ def each(&block)
76
+ @size.times { |position| yield self[position] }
77
+ end
78
+
79
+ # returns the field as a string like "0101010100111100," etc.
80
+ def to_s
81
+ inject("") { |a, b| a + b.to_s }
82
+ end
83
+
84
+ # returns the total number of bits that are set
85
+ # (the technique used here is about 6 times faster than using each or inject direct on the bitfield)
86
+ def total_set
87
+ @field.inject(0) { |a, byte| a += byte & 1 and byte >>= 1 until byte == 0; a }
88
+ end
89
+ end
90
+ end
@@ -0,0 +1,45 @@
1
+ require 'bloombroom/hash/ffi_fnv'
2
+ require 'bloombroom/bits/bit_field'
3
+ require 'bloombroom/filter/bloom_helper'
4
+
5
+ module Bloombroom
6
+
7
+ # BloomFilter false positive probability rule of thumb: see http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/
8
+ # a Bloom filter with a 1% error rate and an optimal value for k only needs 9.6 bits per key, and each time we add 4.8 bits
9
+ # per element we decrease the error rate by ten times.
10
+ #
11
+ # 10000 elements, 1% error rate: m = 10000 * 10 bits -> 12k of memory, k = 0.7 * (10000 * 10 bits / 10000) = 7 hash functions
12
+ # 10000 elements, 0.1% error rate: m = 10000 * 15 bits -> 18k of memory, k = 0.7 * (10000 * 15 bits / 10000) = 11 hash functions
13
+ #
14
+ # Bloombroom::BloomHelper.find_m_k can be used to compute optimal m & k values for a required capacity and error rate.
15
+ class BloomFilter
16
+
17
+ attr_reader :m, :k, :bits, :size
18
+
19
+ # @param m [Fixnum] filter size in bits
20
+ # @param k [Fixnum] number of hashing functions
21
+ def initialize(m, k)
22
+ @bits = BitField.new(m)
23
+ @m = m
24
+ @k = k
25
+ @size = 0
26
+ end
27
+
28
+ # @param key [String] the key to add in the filter
29
+ # @return [Fixnum] the total number of keys in the filter
30
+ def add(key)
31
+ BloomHelper.multi_hash(key, @k).each{|position| @bits.set(position % @m)}
32
+ @size += 1
33
+ end
34
+ alias_method :<<, :add
35
+
36
+ # @param key [String] test for the inclusion if key in the filter
37
+ # @return [Boolean] true if given key is present in the filter. false positive are possible and dependant on the m and k filter parameters.
38
+ def include?(key)
39
+ BloomHelper.multi_hash(key, @k).each{|position| return false unless @bits.include?(position % @m)}
40
+ true
41
+ end
42
+ alias_method :[], :include?
43
+
44
+ end
45
+ end
@@ -0,0 +1,35 @@
1
+ require 'bloombroom/hash/ffi_fnv'
2
+
3
+ module Bloombroom
4
+
5
+ class BloomHelper
6
+
7
+ # compute optimal m and k for a given capacity and error rate
8
+ # @param capacity [Fixnum] number of expected keys
9
+ # @param error [Float] error rate (0.0 < error < 1.0). Ex: 1% == 0.01, 0.1% == 0.001, ...
10
+ def self.find_m_k(capacity, error)
11
+ # thanks to http://www.siaris.net/index.cgi/Programming/LanguageBits/Ruby/BloomFilter.rdoc
12
+ m = (capacity * Math.log(error) / Math.log(1.0 / 2 ** Math.log(2))).ceil
13
+ k = (Math.log(2) * m / capacity).round
14
+ [m, k]
15
+ end
16
+
17
+ # produce k hash values for key
18
+ # @param key [String] key to hash
19
+ # @param k [Fixnum] number of hash functions
20
+ def self.multi_hash(key, k)
21
+ # simulate n hash functions by having just two hash functions
22
+ # see http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.152.579&rep=rep1&type=pdf
23
+ # see http://willwhim.wordpress.com/2011/09/03/producing-n-hash-functions-by-hashing-only-once/
24
+ #
25
+ # fake two hash functions by using the upper/lower 32 bits of a 64 bits FNV1a hash
26
+
27
+ h = Bloombroom::FNVFFI.fnv1a_64(key)
28
+ a = (h & 0xFFFFFFFF00000000) >> 32
29
+ b = h & 0xFFFFFFFF
30
+
31
+ Array.new(k) {|i| (a + b * (i + 1))}
32
+ end
33
+
34
+ end
35
+ end
@@ -0,0 +1,108 @@
1
+ require 'bloombroom/hash/ffi_fnv'
2
+ require 'bloombroom/bits/bit_bucket_field'
3
+ require 'bloombroom/filter/bloom_helper'
4
+ require 'thread'
5
+
6
+ module Bloombroom
7
+
8
+ # ContinuousBloomFilter is a bloom filter for unbounded stream of keys where keys are expired over a given period
9
+ # of time. The expected capacity of the bloom filter for the desired validity period must be known or estimated.
10
+ # For a given capacity and error rate, BloomHelper.find_m_k can be used to compute optimal m & k values.
11
+ #
12
+ # 4 bits per key (instead of 1 bit in a normal bloom filter) are used for keeping track of the keys ttl.
13
+ # the internal timer resolution is set to half of the ttl (resolution divisor of 2). using 4 bits gives us
14
+ # 15 usable time slots (slot 0 is for the unset state). basically the internal time bookeeping is similar to a
15
+ # ring buffer where the first timer tick will be time slot=1, slot=2, .. slot=15, slot=1 and so on. The total
16
+ # time of our internal clock will thus be 15 * (ttl / 2). We keep track of ttl by writing the current time slot
17
+ # in the key k buckets when first inserted in the filter. when doing a key lookup if any of the bucket contain
18
+ # the 0 value the key is not found. if the interval betweem the current time slot and any of the k buckets value
19
+ # is greater than 2 (resolution divisor) we know this key is expired and we reset the expired buckets to 0.
20
+ class ContinuousBloomFilter
21
+
22
+ attr_reader :m, :k, :ttl, :buckets
23
+
24
+ RESOLUTION_DIVISOR = 2
25
+ BITS_PER_BUCKET = 4
26
+
27
+ # @param m [Fixnum] total filter size in number of buckets. optimal m can be computed using BloomHelper.find_m_k
28
+ # @param k [Fixnum] number of hashing functions. optimal k can be computed using BloomHelper.find_m_k
29
+ # @param ttl [Fixnum] key time to live in seconds (validity period)
30
+ def initialize(m, k, ttl)
31
+ @m = m
32
+ @k = k
33
+ @ttl = ttl
34
+ @buckets = BitBucketField.new(BITS_PER_BUCKET, m)
35
+
36
+ # time management
37
+ @increment_period = @ttl / RESOLUTION_DIVISOR
38
+ @current_slot = 1
39
+ @max_slot = (2 ** BITS_PER_BUCKET) - 1 # ex. with 4 bits -> we want range 1..15
40
+ @lock = Mutex.new
41
+ end
42
+
43
+ # @param key [String] the key to add in the filter
44
+ # @return [ContinuousBloomFilter] self
45
+ def add(key)
46
+ current_slot = @lock.synchronize{@current_slot}
47
+ BloomHelper.multi_hash(key, @k).each{|position| @buckets[position % @m] = current_slot}
48
+ self
49
+ end
50
+ alias_method :<<, :add
51
+
52
+ # @param key [String] test for the inclusion if key in the filter
53
+ # @return [Boolean] true if given key is present in the filter. false positive are possible and dependant on the m and k filter parameters.
54
+ def include?(key)
55
+ current_slot = @lock.synchronize{@current_slot}
56
+ expired = false
57
+
58
+ BloomHelper.multi_hash(key, @k).each do |position|
59
+ start_slot = @buckets[position % @m]
60
+ if start_slot == 0
61
+ expired = true
62
+ elsif elapsed(start_slot, current_slot) > RESOLUTION_DIVISOR
63
+ expired = true
64
+ @buckets[position % @m] = 0
65
+ end
66
+ end
67
+ !expired
68
+ end
69
+ alias_method :[], :include?
70
+
71
+ # start the internal timer thread for managing ttls. must be explicitely called
72
+ def start_timer
73
+ @timer ||= detach_timer
74
+ end
75
+
76
+ # advance internal time slot. this is exposed primarily for spec'ing purposes.
77
+ # normally this is automatically called by the internal timer thread but if not
78
+ # using the internal timer thread it can be called explicitly when doing your
79
+ # own time management.
80
+ def inc_time_slot
81
+ # ex. with 4 bits -> we want range 1..15,
82
+ @lock.synchronize{@current_slot = (@current_slot % @max_slot) + 1}
83
+ end
84
+
85
+ private
86
+
87
+ def current_slot
88
+ @lock.synchronize{@current_slot}
89
+ end
90
+
91
+ def elapsed(start_slot, current_slot)
92
+ # ring buffer style
93
+ current_slot >= start_slot ? current_slot - start_slot : (current_slot + @max_slot) - start_slot
94
+ end
95
+
96
+ def detach_timer
97
+ Thread.new do
98
+ Thread.current.abort_on_exception = true
99
+
100
+ loop do
101
+ sleep(@increment_period)
102
+ inc_time_slot
103
+ end
104
+ end
105
+ end
106
+
107
+ end
108
+ end
@@ -0,0 +1,30 @@
1
+ require 'ffi'
2
+
3
+ module Bloombroom
4
+ class FNVFFI
5
+ extend FFI::Library
6
+
7
+ ffi_lib File.dirname(__FILE__) + "/" + (FFI::Platform.mac? ? "ffi_fnv.bundle" : FFI.map_library_name("ffi_fnv"))
8
+
9
+ attach_function :c_fnv1_32, :fnv1_32, [:string, :uint32], :uint32
10
+ attach_function :c_fnv1a_32, :fnv1a_32, [:string, :uint32], :uint32
11
+ attach_function :c_fnv1_64, :fnv1_64, [:string, :uint32], :uint64
12
+ attach_function :c_fnv1a_64, :fnv1a_64, [:string, :uint32], :uint64
13
+
14
+ def self.fnv1_32(data)
15
+ c_fnv1_32(data, data.size)
16
+ end
17
+
18
+ def self.fnv1_64(data)
19
+ c_fnv1_64(data, data.size)
20
+ end
21
+
22
+ def self.fnv1a_32(data)
23
+ c_fnv1a_32(data, data.size)
24
+ end
25
+
26
+ def self.fnv1a_64(data)
27
+ c_fnv1a_64(data, data.size)
28
+ end
29
+ end
30
+ end
@@ -0,0 +1,100 @@
1
+ # based on https://github.com/andyjeffries/digestfnv
2
+
3
+ module Bloombroom
4
+ class FNVA
5
+
6
+ OFFSET32 = 2166136261
7
+ OFFSET64 = 14695981039346656037
8
+ OFFSET128 = 144066263297769815596495629667062367629
9
+ OFFSET256 = 100029257958052580907070968620625704837092796014241193945225284501741471925557
10
+ OFFSET512 = 9659303129496669498009435400716310466090418745672637896108374329434462657994582932197716438449813051892206539805784495328239340083876191928701583869517785
11
+ OFFSET1024 = 14197795064947621068722070641403218320880622795441933960878474914617582723252296732303717722150864096521202355549365628174669108571814760471015076148029755969804077320157692458563003215304957150157403644460363550505412711285966361610267868082893823963790439336411086884584107735010676915
12
+
13
+ PRIME32 = 16777619
14
+ PRIME64 = 1099511628211
15
+ PRIME128 = 309485009821345068724781371
16
+ PRIME256 = 374144419156711147060143317175368453031918731002211
17
+ PRIME512 = 35835915874844867368919076489095108449946327955754392558399825615420669938882575126094039892345713852759
18
+ PRIME1024 = 5016456510113118655434598811035278955030765345404790744303017523831112055108147451509157692220295382716162651878526895249385292291816524375083746691371804094271873160484737966720260389217684476157468082573
19
+
20
+ MASK32 = (2 ** 32) - 1
21
+ MASK64 = (2 ** 64) - 1
22
+ MASK128 = (2 ** 128) - 1
23
+ MASK256 = (2 ** 256) - 1
24
+ MASK512 = (2 ** 512) - 1
25
+ MASK1024 = (2 ** 1024) - 1
26
+
27
+ def self.fnv1_32(input)
28
+ hash = OFFSET32
29
+ input.each_byte { |b| hash = (hash * PRIME32) ^ b }
30
+ hash & MASK32
31
+ end
32
+
33
+ def self.fnv1_64(input)
34
+ hash = OFFSET64
35
+ input.each_byte { |b| hash = (hash * PRIME64) ^ b }
36
+ hash & MASK64
37
+ end
38
+
39
+ def self.fnv1_128(input)
40
+ hash = OFFSET128
41
+ input.each_byte { |b| hash = (hash * PRIME128) ^ b }
42
+ hash & MASK128
43
+ end
44
+
45
+ def self.fnv1_256(input)
46
+ hash = OFFSET256
47
+ input.each_byte { |b| hash = (hash * PRIME256) ^ b }
48
+ hash & MASK256
49
+ end
50
+
51
+ def self.fnv1_512(input)
52
+ hash = OFFSET512
53
+ input.each_byte { |b| hash = (hash * PRIME512) ^ b }
54
+ hash & MASK512
55
+ end
56
+
57
+ def self.fnv1_1024(input)
58
+ hash = OFFSET1024
59
+ input.each_byte { |b| hash = (hash * PRIME1024) ^ b }
60
+ hash & MASK1024
61
+ end
62
+
63
+ def self.fnv1a_32(input)
64
+ hash = OFFSET32
65
+ input.each_byte { |b| hash = (hash ^ b) * PRIME32 }
66
+ hash & MASK32
67
+ end
68
+
69
+ def self.fnv1a_64(input)
70
+ hash = OFFSET64
71
+ input.each_byte { |b| hash = (hash ^ b) * PRIME64 }
72
+ hash & MASK64
73
+ end
74
+
75
+ def self.fnv1a_128(input)
76
+ hash = OFFSET128
77
+ input.each_byte { |b| hash = (hash ^ b) * PRIME128 }
78
+ hash & MASK128
79
+ end
80
+
81
+ def self.fnv1a_256(input)
82
+ hash = OFFSET256
83
+ input.each_byte { |b| hash = (hash ^ b) * PRIME256 }
84
+ hash & MASK256
85
+ end
86
+
87
+ def self.fnv1a_512(input)
88
+ hash = OFFSET512
89
+ input.each_byte { |b| hash = (hash ^ b) * PRIME512 }
90
+ hash & MASK512
91
+ end
92
+
93
+ def self.fnv1a_1024(input)
94
+ hash = OFFSET1024
95
+ input.each_byte { |b| hash = (hash ^ b) * PRIME1024 }
96
+ hash & MASK1024
97
+ end
98
+
99
+ end
100
+ end
@@ -0,0 +1,56 @@
1
+ # based on https://github.com/jakedouglas/fnv-ruby
2
+
3
+ module Bloombroom
4
+ class FNVB
5
+ INIT32 = 0x811c9dc5
6
+ INIT64 = 0xcbf29ce484222325
7
+ PRIME32 = 0x01000193
8
+ PRIME64 = 0x100000001b3
9
+ MOD32 = 2 ** 32
10
+ MOD64 = 2 ** 64
11
+
12
+ def self.fnv1_32(data)
13
+ hash = INIT32
14
+
15
+ data.each_byte do |byte|
16
+ hash = (hash * PRIME32) % MOD32
17
+ hash = hash ^ byte
18
+ end
19
+
20
+ hash
21
+ end
22
+
23
+ def self.fnv1_64(data)
24
+ hash = INIT64
25
+
26
+ data.each_byte do |byte|
27
+ hash = (hash * PRIME64) % MOD64
28
+ hash = hash ^ byte
29
+ end
30
+
31
+ hash
32
+ end
33
+
34
+ def self.fnv1a_32(data)
35
+ hash = INIT32
36
+
37
+ data.each_byte do |byte|
38
+ hash = hash ^ byte
39
+ hash = (hash * PRIME32) % MOD32
40
+ end
41
+
42
+ hash
43
+ end
44
+
45
+ def self.fnv1a_64(data)
46
+ hash = INIT64
47
+
48
+ data.each_byte do |byte|
49
+ hash = hash ^ byte
50
+ hash = (hash * PRIME64) % MOD64
51
+ end
52
+
53
+ hash
54
+ end
55
+ end
56
+ end
@@ -0,0 +1,3 @@
1
+ module Bloombroom
2
+ VERSION = "1.0.0"
3
+ end
metadata ADDED
@@ -0,0 +1,100 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: bloombroom
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Colin Surprenant
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2012-05-09 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: rspec
16
+ requirement: !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ~>
20
+ - !ruby/object:Gem::Version
21
+ version: 2.8.0
22
+ type: :development
23
+ prerelease: false
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ~>
28
+ - !ruby/object:Gem::Version
29
+ version: 2.8.0
30
+ - !ruby/object:Gem::Dependency
31
+ name: ffi
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ! '>='
36
+ - !ruby/object:Gem::Version
37
+ version: '0'
38
+ type: :runtime
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ! '>='
44
+ - !ruby/object:Gem::Version
45
+ version: '0'
46
+ description: bloombroom has two bloom filter implementations, a standard filter for
47
+ bounded key space and a continuous filter for unbounded keys
48
+ (stream). also contains fast bit field and bit bucket field (multi
49
+ bits), native/C-ext/FFI FNV hashing and benchmarks for all these.
50
+ email:
51
+ - colin.surprenant@gmail.com
52
+ executables: []
53
+ extensions:
54
+ - ext/bloombroom/hash/cext/extconf.rb
55
+ - ext/bloombroom/hash/ffi/extconf.rb
56
+ extra_rdoc_files: []
57
+ files:
58
+ - lib/bloombroom/bits/bit_bucket_field.rb
59
+ - lib/bloombroom/bits/bit_field.rb
60
+ - lib/bloombroom/filter/bloom_filter.rb
61
+ - lib/bloombroom/filter/bloom_helper.rb
62
+ - lib/bloombroom/filter/continuous_bloom_filter.rb
63
+ - lib/bloombroom/hash/ffi_fnv.rb
64
+ - lib/bloombroom/hash/fnv_a.rb
65
+ - lib/bloombroom/hash/fnv_b.rb
66
+ - lib/bloombroom/version.rb
67
+ - lib/bloombroom.rb
68
+ - ext/bloombroom/hash/cext/extconf.rb
69
+ - ext/bloombroom/hash/ffi/extconf.rb
70
+ - ext/bloombroom/hash/cext/cext_fnv.c
71
+ - ext/bloombroom/hash/ffi/ffi_fnv.c
72
+ - README.md
73
+ - CHANGELOG.md
74
+ - LICENSE.md
75
+ homepage: https://github.com/colinsurprenant/bloombroom
76
+ licenses: []
77
+ post_install_message:
78
+ rdoc_options: []
79
+ require_paths:
80
+ - lib
81
+ required_ruby_version: !ruby/object:Gem::Requirement
82
+ none: false
83
+ requirements:
84
+ - - ! '>='
85
+ - !ruby/object:Gem::Version
86
+ version: '0'
87
+ required_rubygems_version: !ruby/object:Gem::Requirement
88
+ none: false
89
+ requirements:
90
+ - - ! '>='
91
+ - !ruby/object:Gem::Version
92
+ version: '0'
93
+ requirements: []
94
+ rubyforge_project: bloombroom
95
+ rubygems_version: 1.8.24
96
+ signing_key:
97
+ specification_version: 3
98
+ summary: bloom filters for bounded and unbounded (streaming) data, FNV hashing and
99
+ bit fields
100
+ test_files: []