bloombroom 1.0.0
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG.md +1 -0
- data/LICENSE.md +11 -0
- data/README.md +242 -0
- data/ext/bloombroom/hash/cext/cext_fnv.c +91 -0
- data/ext/bloombroom/hash/cext/extconf.rb +3 -0
- data/ext/bloombroom/hash/ffi/extconf.rb +3 -0
- data/ext/bloombroom/hash/ffi/ffi_fnv.c +60 -0
- data/lib/bloombroom.rb +13 -0
- data/lib/bloombroom/bits/bit_bucket_field.rb +90 -0
- data/lib/bloombroom/bits/bit_field.rb +90 -0
- data/lib/bloombroom/filter/bloom_filter.rb +45 -0
- data/lib/bloombroom/filter/bloom_helper.rb +35 -0
- data/lib/bloombroom/filter/continuous_bloom_filter.rb +108 -0
- data/lib/bloombroom/hash/ffi_fnv.rb +30 -0
- data/lib/bloombroom/hash/fnv_a.rb +100 -0
- data/lib/bloombroom/hash/fnv_b.rb +56 -0
- data/lib/bloombroom/version.rb +3 -0
- metadata +100 -0
data/CHANGELOG.md
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
tbd
|
data/LICENSE.md
ADDED
@@ -0,0 +1,11 @@
|
|
1
|
+
Licensed under the Apache License, Version 2.0 (the "License");
|
2
|
+
you may not use this file except in compliance with the License.
|
3
|
+
You may obtain a copy of the License at
|
4
|
+
|
5
|
+
http://www.apache.org/licenses/LICENSE-2.0
|
6
|
+
|
7
|
+
Unless required by applicable law or agreed to in writing, software
|
8
|
+
distributed under the License is distributed on an "AS IS" BASIS,
|
9
|
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
10
|
+
See the License for the specific language governing permissions and
|
11
|
+
limitations under the License.
|
data/README.md
ADDED
@@ -0,0 +1,242 @@
|
|
1
|
+
# Bloombroom v1.0.0
|
2
|
+
|
3
|
+
- Standard **Bloomfilter** class for bounded key space
|
4
|
+
- **ContinuousBloomfilter** class for unbounded keys (**stream**)
|
5
|
+
- Bitfield class
|
6
|
+
- BitBucketField class (multi bits)
|
7
|
+
- native, C & FFI extensions FNV hash classes
|
8
|
+
|
9
|
+
The Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. See [wikipedia](http://en.wikipedia.org/wiki/Bloom_filter).
|
10
|
+
|
11
|
+
Bloom filters are normally used in the context of a bounded set since the filter size must be known in advance. for a given filter capacity, its total bit size will affect the false positive error rate. The total number of bits required for a given filter can be computed from the required filter capacity and target error rate. See the [references](#references) section for more info.
|
12
|
+
|
13
|
+
### ContinuousBloomfilter
|
14
|
+
The **ContinuousBloomfilter** provides a bloom filter implementation which support unbounded stream of elements. Elements are expired after a chosen TTL. At initialization the filter capacity must be estimated for the numbers of elements expected over the given TTL period.
|
15
|
+
|
16
|
+
For example to do dedupping on a stream with a rate of **5000 items/sec** over a period of **60 minutes** would require a filter capacity of **18M** elements. For a required error rate of **0.1%** the filter would need **246mb** of memory (which include all Ruby objects overhead).
|
17
|
+
|
18
|
+
The ContinuousBloomfilter uses 4 bits for each filter *position* or *bucket* (instead of 1 bit in a normal bloom filter) for keeping track of the keys TTL.
|
19
|
+
The internal timer resolution is set to half of the required TTL (resolution divisor of 2). using 4 bits gives us
|
20
|
+
15 usable time slots (slot 0 is for the unset state). Basically the internal time bookeeping is similar to a
|
21
|
+
ring buffer using the current timer tick modulo 15. The timer ticks will be time slot=1, 2, ... 15, 1, 2 and so on. The total
|
22
|
+
time of our internal clock will thus be 15 * (TTL / 2). We keep track of TTL by writing the current time slot
|
23
|
+
in the key k buckets when inserted in the filter. For a key lookup if the interval betweem the current time slot and any of the k buckets value
|
24
|
+
is greater than 2 (resolution divisor) we know this key is expired. See [continuous_bloom_filter.rb](https://github.com/colinsurprenant/bloombroom/blob/master/lib/bloombroom/filter/continuous_bloom_filter.rb)
|
25
|
+
|
26
|
+
This means that an element is garanteed to not be expired before the given TTL but in the worst case could survive until 3 * (TTL / 2).
|
27
|
+
|
28
|
+
### Hashing
|
29
|
+
Bloom filters require the use of multiple (k) hash functions for each inserted element. We actually simulate multiple hash functions by having just two hash functions which are actually the upper and lower 32 bits of our FFI FNV1a 64 bits hash function. Double hashing with one hash function. Very very fast. See [bloom_helper.rb](https://github.com/colinsurprenant/bloombroom/blob/master/lib/bloombroom/filter/bloom_helper.rb) and the [references](#references) section for more info on this technique.
|
30
|
+
|
31
|
+
|
32
|
+
## Installation
|
33
|
+
tested in both MRI Ruby 1.9.2, 1.9.3 and JRuby 1.6.7 in 1.9 mode.
|
34
|
+
|
35
|
+
``` sh
|
36
|
+
$ gem install bloombroom
|
37
|
+
```
|
38
|
+
|
39
|
+
## Examples
|
40
|
+
|
41
|
+
### Standard Bloom filter
|
42
|
+
``` ruby
|
43
|
+
require 'bloombroom'
|
44
|
+
|
45
|
+
bf = Bloombroom::BloomFilter.new(1000, 3) # 1000 bits and 3 hash functions
|
46
|
+
|
47
|
+
bf.add("key1")
|
48
|
+
bf.add("key2")
|
49
|
+
|
50
|
+
bf.include?("key1") # => true
|
51
|
+
bf.include?("key3") # => false
|
52
|
+
```
|
53
|
+
|
54
|
+
``` ruby
|
55
|
+
require 'bloombroom'
|
56
|
+
|
57
|
+
# compute optimal m,k for a filter capacity of 1000 elements and 0.1% error rate
|
58
|
+
m, k = Bloombroom::BloomHelper.find_m_k(1000, 0.001)
|
59
|
+
|
60
|
+
bf = Bloombroom::BloomFilter.new(m, k)
|
61
|
+
|
62
|
+
bf << "key1"
|
63
|
+
bf << "key2"
|
64
|
+
|
65
|
+
bf["key1"] # => true
|
66
|
+
bf["key3"] # => false
|
67
|
+
```
|
68
|
+
### Continuous Bloom filter
|
69
|
+
|
70
|
+
``` ruby
|
71
|
+
require 'bloombroom'
|
72
|
+
|
73
|
+
# 1000 buckets, 3 hash functions and a TTL of 2 seconds
|
74
|
+
bf = Bloombroom::ContinuousBloomFilter.new(1000, 3, 2)
|
75
|
+
bf.start_timer
|
76
|
+
|
77
|
+
bf << "key1"
|
78
|
+
bf << "key2"
|
79
|
+
|
80
|
+
bf["key1"] # => true
|
81
|
+
bf["key2"] # => true
|
82
|
+
bf["key3"] # => false
|
83
|
+
|
84
|
+
sleep(3)
|
85
|
+
|
86
|
+
bf["key1"] # => false
|
87
|
+
bf["key2"] # => false
|
88
|
+
bf["key3"] # => false
|
89
|
+
```
|
90
|
+
|
91
|
+
## Memory footprint
|
92
|
+
The calculated memory footprints **includes all Ruby objects overhead**. In fact the footprint is calculated by querying the process size (rss) before and after the bloom filter object initialization.
|
93
|
+
|
94
|
+
### Bloomfilter
|
95
|
+
``` sh
|
96
|
+
ruby benchmark/bloom_filter_memory.rb auto 1000000 0.01
|
97
|
+
ruby benchmark/bloom_filter_memory.rb auto 100000000 0.01
|
98
|
+
ruby benchmark/bloom_filter_memory.rb auto 100000000 0.001
|
99
|
+
```
|
100
|
+
|
101
|
+
- **1.0%** error rate for **1M** keys: **2.3mb**
|
102
|
+
- **1.0%** error rate for **100M** keys: **228mb**
|
103
|
+
- **0.1%** error rate for **100M** keys: **342mb**
|
104
|
+
|
105
|
+
### ContinuousBloomfilter
|
106
|
+
``` sh
|
107
|
+
ruby benchmark/continuous_bloom_filter_memory.rb auto 1000000 0.01
|
108
|
+
ruby benchmark/continuous_bloom_filter_memory.rb auto 100000000 0.01
|
109
|
+
ruby benchmark/continuous_bloom_filter_memory.rb auto 100000000 0.001
|
110
|
+
```
|
111
|
+
|
112
|
+
- **1.0%** error rate for **1M** keys: **9.1mb**
|
113
|
+
- **1.0%** error rate for **100M** keys: **914mb**
|
114
|
+
- **0.1%** error rate for **100M** keys: **1371mb**
|
115
|
+
|
116
|
+
|
117
|
+
## Benchmarks
|
118
|
+
All benchmarks have been run on a MacbookPro with a 2.66GHz i7 with 8GB RAM on OSX 10.6.8 with MRI Ruby 1.9.3p194
|
119
|
+
|
120
|
+
### Hashing
|
121
|
+
The Hashing benchmark compares the performance of SHA1, MD5, two native Ruby FNV (A & B) implementations, a C implementation as a C extension and FFI extension for 32 and 64 bits hashes.
|
122
|
+
|
123
|
+
``` sh
|
124
|
+
ruby benchmark/fnv.rb
|
125
|
+
```
|
126
|
+
|
127
|
+
```
|
128
|
+
benchmarking for 1000000 iterations
|
129
|
+
user system total real
|
130
|
+
MD5: 1.900000 0.010000 1.910000 ( 1.912995)
|
131
|
+
SHA-1: 2.110000 0.000000 2.110000 ( 2.109739)
|
132
|
+
native FNV A 32: 32.470000 0.110000 32.580000 ( 32.596759)
|
133
|
+
native FNV A 64: 38.330000 0.570000 38.900000 ( 38.923384)
|
134
|
+
native FNV B 32: 4.870000 0.020000 4.890000 ( 4.882862)
|
135
|
+
native FNV B 64: 37.700000 0.110000 37.810000 ( 37.842873)
|
136
|
+
ffi FNV 32: 0.760000 0.010000 0.770000 ( 0.754941)
|
137
|
+
ffi FNV 64: 0.890000 0.000000 0.890000 ( 0.901954)
|
138
|
+
c-ext FNV 32: 0.310000 0.000000 0.310000 ( 0.307131)
|
139
|
+
c-ext FNV 64: 0.480000 0.000000 0.480000 ( 0.485310)
|
140
|
+
|
141
|
+
MD5: 522740 ops/s
|
142
|
+
SHA-1: 473992 ops/s
|
143
|
+
native FNV A 32: 30678 ops/s
|
144
|
+
native FNV A 64: 25691 ops/s
|
145
|
+
native FNV B 32: 204798 ops/s
|
146
|
+
native FNV B 64: 26425 ops/s
|
147
|
+
ffi FNV 32: 1324607 ops/s
|
148
|
+
ffi FNV 64: 1108704 ops/s
|
149
|
+
c-ext FNV 32: 3255939 ops/s
|
150
|
+
c-ext FNV 64: 2060538 ops/s
|
151
|
+
```
|
152
|
+
|
153
|
+
### Bloomfilter
|
154
|
+
The Bloomfilter class is using the FFI FNV hashing by default, for speed and compatibility.
|
155
|
+
|
156
|
+
``` sh
|
157
|
+
ruby benchmark/bloom_filter.rb
|
158
|
+
```
|
159
|
+
|
160
|
+
```
|
161
|
+
benchmarking for 150000 keys with 1.0%, 0.1%, 0.01% error rates
|
162
|
+
user system total real
|
163
|
+
BloomFilter m=1437759, k=07 add 0.940000 0.000000 0.940000 ( 0.948075)
|
164
|
+
BloomFilter m=1437759, k=07 include? 0.830000 0.010000 0.840000 ( 0.834414)
|
165
|
+
BloomFilter m=2156639, k=10 add 1.220000 0.000000 1.220000 ( 1.227294)
|
166
|
+
BloomFilter m=2156639, k=10 include? 1.050000 0.010000 1.060000 ( 1.052358)
|
167
|
+
BloomFilter m=2875518, k=13 add 1.500000 0.010000 1.510000 ( 1.516086)
|
168
|
+
BloomFilter m=2875518, k=13 include? 1.260000 0.010000 1.270000 ( 1.258877)
|
169
|
+
|
170
|
+
BloomFilter m=1437759, k=07 add 158215 ops/s
|
171
|
+
BloomFilter m=1437759, k=07 include? 179767 ops/s
|
172
|
+
BloomFilter m=2156639, k=10 add 122220 ops/s
|
173
|
+
BloomFilter m=2156639, k=10 include? 142537 ops/s
|
174
|
+
BloomFilter m=2875518, k=13 add 98939 ops/s
|
175
|
+
BloomFilter m=2875518, k=13 include? 119154 ops/s
|
176
|
+
```
|
177
|
+
|
178
|
+
### ContinuousBloomfilter
|
179
|
+
The ContinuousBloomfilter class is using the FFI FNV hashing by default, for speed and compatibility.
|
180
|
+
|
181
|
+
``` sh
|
182
|
+
ruby benchmark/continuous_bloom_filter.rb
|
183
|
+
```
|
184
|
+
|
185
|
+
```
|
186
|
+
benchmarking WITHOUT expiration for 150000 keys with 1.0%, 0.1%, 0.01% error rates
|
187
|
+
user system total real
|
188
|
+
ContinuousBloomFilter m=1437759, k=07 add 1.720000 0.000000 1.720000 ( 1.733903)
|
189
|
+
ContinuousBloomFilter m=1437759, k=07 include? 1.630000 0.010000 1.640000 ( 1.630668)
|
190
|
+
ContinuousBloomFilter m=2156639, k=10 add 2.130000 0.010000 2.140000 ( 2.142091)
|
191
|
+
ContinuousBloomFilter m=2156639, k=10 include? 2.160000 0.000000 2.160000 ( 2.159395)
|
192
|
+
ContinuousBloomFilter m=2875518, k=13 add 2.650000 0.010000 2.660000 ( 2.655585)
|
193
|
+
ContinuousBloomFilter m=2875518, k=13 include? 2.570000 0.010000 2.580000 ( 2.586032)
|
194
|
+
|
195
|
+
ContinuousBloomFilter m=1437759, k=07 add 86510 ops/s
|
196
|
+
ContinuousBloomFilter m=1437759, k=07 include? 91987 ops/s
|
197
|
+
ContinuousBloomFilter m=2156639, k=10 add 70025 ops/s
|
198
|
+
ContinuousBloomFilter m=2156639, k=10 include? 69464 ops/s
|
199
|
+
ContinuousBloomFilter m=2875518, k=13 add 56485 ops/s
|
200
|
+
ContinuousBloomFilter m=2875518, k=13 include? 58004 ops/s
|
201
|
+
|
202
|
+
benchmarking WITH expiration for 500000 keys with 1.0%, 0.1%, 0.01% error rates
|
203
|
+
user system total real
|
204
|
+
ContinuousBloomFilter m=1437759, k=07 add+include 11.110000 0.040000 11.150000 ( 11.146869)
|
205
|
+
ContinuousBloomFilter m=2156639, k=10 add+include 14.220000 0.040000 14.260000 ( 14.269583)
|
206
|
+
ContinuousBloomFilter m=2875518, k=13 add+include 17.600000 0.060000 17.660000 ( 17.665917)
|
207
|
+
|
208
|
+
ContinuousBloomFilter m=1437759, k=07 add+include 89711 ops/s
|
209
|
+
ContinuousBloomFilter m=2156639, k=10 add+include 70079 ops/s
|
210
|
+
ContinuousBloomFilter m=2875518, k=13 add+include 56606 ops/s
|
211
|
+
```
|
212
|
+
|
213
|
+
## JRuby
|
214
|
+
- to run specs use
|
215
|
+
|
216
|
+
``` sh
|
217
|
+
jruby --1.9 -S rake spec
|
218
|
+
```
|
219
|
+
- to run benchmarks use
|
220
|
+
|
221
|
+
``` sh
|
222
|
+
jruby --1.9 benchmark/some_benchmark.rb
|
223
|
+
```
|
224
|
+
|
225
|
+
<a id="reference" />
|
226
|
+
## References ##
|
227
|
+
- [Bloom filter on wikipedia](http://en.wikipedia.org/wiki/Bloom_filter)
|
228
|
+
- [Scalable Datasets: Bloom Filters in Ruby](http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/)
|
229
|
+
- [Flow Analysis & Time-based Bloom Filters](http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/)
|
230
|
+
- [Stable Bloom filters](http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf)
|
231
|
+
- [The maths to compute optimal m and k ](http://www.siaris.net/index.cgi/Programming/LanguageBits/Ruby/BloomFilter.rdoc)
|
232
|
+
- [Producing n hash functions by hashing only once](http://willwhim.wordpress.com/2011/09/03/producing-n-hash-functions-by-hashing-only-once/)
|
233
|
+
- [Less Hashing, Same Performance: Building a Better Bloom Filter](http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.152.579&rep=rep1&type=pdf)
|
234
|
+
|
235
|
+
## Author
|
236
|
+
Colin Surprenant, [@colinsurprenant][twitter], [http://github.com/colinsurprenant][github], colin.surprenant@needium.com, colin.surprenant@gmail.com
|
237
|
+
|
238
|
+
## License
|
239
|
+
Bloombroom is distributed under the Apache License, Version 2.0.
|
240
|
+
|
241
|
+
[twitter]: http://twitter.com/colinsurprenant
|
242
|
+
[github]: http://github.com/colinsurprenant
|
@@ -0,0 +1,91 @@
|
|
1
|
+
/*
|
2
|
+
* based on https://github.com/robey/rbfnv with various fixes from forks
|
3
|
+
*/
|
4
|
+
|
5
|
+
#include <stdint.h>
|
6
|
+
#include "ruby.h"
|
7
|
+
|
8
|
+
#define PRIME32 16777619
|
9
|
+
#define PRIME64 1099511628211ULL
|
10
|
+
|
11
|
+
/**
|
12
|
+
* FNV fast hashing algorithm in 32 bits.
|
13
|
+
* @see http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
|
14
|
+
*/
|
15
|
+
uint32_t fnv1_32(const char *data, uint64_t len) {
|
16
|
+
uint32_t rv = 0x811c9dc5U;
|
17
|
+
uint64_t i;
|
18
|
+
for (i = 0; i < len; i++) {
|
19
|
+
rv = (rv * PRIME32) ^ (unsigned char)(data[i]);
|
20
|
+
}
|
21
|
+
return rv;
|
22
|
+
}
|
23
|
+
|
24
|
+
/**
|
25
|
+
* FNV fast hashing algorithm in 32 bits, variant with operations reversed.
|
26
|
+
* @see http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
|
27
|
+
*/
|
28
|
+
uint32_t fnv1a_32(const char *data, uint64_t len) {
|
29
|
+
uint32_t rv = 0x811c9dc5U;
|
30
|
+
uint64_t i;
|
31
|
+
for (i = 0; i < len; i++) {
|
32
|
+
rv = (rv ^ (unsigned char)data[i]) * PRIME32;
|
33
|
+
}
|
34
|
+
return rv;
|
35
|
+
}
|
36
|
+
|
37
|
+
/**
|
38
|
+
* FNV fast hashing algorithm in 64 bits.
|
39
|
+
* @see http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
|
40
|
+
*/
|
41
|
+
uint64_t fnv1_64(const char *data, uint64_t len) {
|
42
|
+
uint64_t rv = 0xcbf29ce484222325ULL;
|
43
|
+
uint64_t i;
|
44
|
+
for (i = 0; i < len; i++) {
|
45
|
+
rv = (rv * PRIME64) ^ (unsigned char)data[i];
|
46
|
+
}
|
47
|
+
return rv;
|
48
|
+
}
|
49
|
+
|
50
|
+
/**
|
51
|
+
* FNV fast hashing algorithm in 64 bits, variant with operations reversed.
|
52
|
+
* @see http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
|
53
|
+
*/
|
54
|
+
uint64_t fnv1a_64(const char *data, uint64_t len) {
|
55
|
+
uint64_t rv = 0xcbf29ce484222325ULL;
|
56
|
+
uint64_t i;
|
57
|
+
for (i = 0; i < len; i++) {
|
58
|
+
rv = (rv ^ (unsigned char)data[i]) * PRIME64;
|
59
|
+
}
|
60
|
+
return rv;
|
61
|
+
}
|
62
|
+
|
63
|
+
/* ----- ruby bindings ----- */
|
64
|
+
|
65
|
+
VALUE rb_fnv1_32(VALUE self, VALUE data) {
|
66
|
+
return UINT2NUM(fnv1_32(RSTRING_PTR(data), RSTRING_LEN(data)));
|
67
|
+
}
|
68
|
+
|
69
|
+
VALUE rb_fnv1a_32(VALUE self, VALUE data) {
|
70
|
+
return UINT2NUM(fnv1a_32(RSTRING_PTR(data), RSTRING_LEN(data)));
|
71
|
+
}
|
72
|
+
|
73
|
+
VALUE rb_fnv1_64(VALUE self, VALUE data) {
|
74
|
+
return ULL2NUM(fnv1_64(RSTRING_PTR(data), RSTRING_LEN(data)));
|
75
|
+
}
|
76
|
+
|
77
|
+
VALUE rb_fnv1a_64(VALUE self, VALUE data) {
|
78
|
+
return ULL2NUM(fnv1a_64(RSTRING_PTR(data), RSTRING_LEN(data)));
|
79
|
+
}
|
80
|
+
|
81
|
+
VALUE rb_class;
|
82
|
+
VALUE rb_module;
|
83
|
+
|
84
|
+
void Init_cext_fnv() {
|
85
|
+
rb_module = rb_define_module("Bloombroom");
|
86
|
+
rb_class = rb_define_class_under(rb_module, "FNVEXT", rb_cObject);
|
87
|
+
rb_define_singleton_method(rb_class, "fnv1_32", rb_fnv1_32, 1);
|
88
|
+
rb_define_singleton_method(rb_class, "fnv1a_32", rb_fnv1a_32, 1);
|
89
|
+
rb_define_singleton_method(rb_class, "fnv1_64", rb_fnv1_64, 1);
|
90
|
+
rb_define_singleton_method(rb_class, "fnv1a_64", rb_fnv1a_64, 1);
|
91
|
+
}
|
@@ -0,0 +1,60 @@
|
|
1
|
+
/*
|
2
|
+
* based on https://github.com/robey/rbfnv with various fixes from forks
|
3
|
+
*/
|
4
|
+
|
5
|
+
#include <stdint.h>
|
6
|
+
|
7
|
+
#define PRIME32 16777619
|
8
|
+
#define PRIME64 1099511628211ULL
|
9
|
+
|
10
|
+
/**
|
11
|
+
* FNV fast hashing algorithm in 32 bits.
|
12
|
+
* @see http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
|
13
|
+
*/
|
14
|
+
uint32_t fnv1_32(const char *data, uint32_t len) {
|
15
|
+
uint32_t rv = 0x811c9dc5U;
|
16
|
+
uint32_t i;
|
17
|
+
for (i = 0; i < len; i++) {
|
18
|
+
rv = (rv * PRIME32) ^ (unsigned char)(data[i]);
|
19
|
+
}
|
20
|
+
return rv;
|
21
|
+
}
|
22
|
+
|
23
|
+
/**
|
24
|
+
* FNV fast hashing algorithm in 32 bits, variant with operations reversed.
|
25
|
+
* @see http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
|
26
|
+
*/
|
27
|
+
uint32_t fnv1a_32(const char *data, uint32_t len) {
|
28
|
+
uint32_t rv = 0x811c9dc5U;
|
29
|
+
uint32_t i;
|
30
|
+
for (i = 0; i < len; i++) {
|
31
|
+
rv = (rv ^ (unsigned char)data[i]) * PRIME32;
|
32
|
+
}
|
33
|
+
return rv;
|
34
|
+
}
|
35
|
+
|
36
|
+
/**
|
37
|
+
* FNV fast hashing algorithm in 64 bits.
|
38
|
+
* @see http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
|
39
|
+
*/
|
40
|
+
uint64_t fnv1_64(const char *data, uint32_t len) {
|
41
|
+
uint64_t rv = 0xcbf29ce484222325ULL;
|
42
|
+
uint32_t i;
|
43
|
+
for (i = 0; i < len; i++) {
|
44
|
+
rv = (rv * PRIME64) ^ (unsigned char)data[i];
|
45
|
+
}
|
46
|
+
return rv;
|
47
|
+
}
|
48
|
+
|
49
|
+
/**
|
50
|
+
* FNV fast hashing algorithm in 64 bits, variant with operations reversed.
|
51
|
+
* @see http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
|
52
|
+
*/
|
53
|
+
uint64_t fnv1a_64(const char *data, uint32_t len) {
|
54
|
+
uint64_t rv = 0xcbf29ce484222325ULL;
|
55
|
+
uint32_t i;
|
56
|
+
for (i = 0; i < len; i++) {
|
57
|
+
rv = (rv ^ (unsigned char)data[i]) * PRIME64;
|
58
|
+
}
|
59
|
+
return rv;
|
60
|
+
}
|
data/lib/bloombroom.rb
ADDED
@@ -0,0 +1,13 @@
|
|
1
|
+
require "bloombroom/version"
|
2
|
+
require "bloombroom/bits/bit_field"
|
3
|
+
require "bloombroom/bits/bit_bucket_field"
|
4
|
+
require "bloombroom/filter/bloom_helper"
|
5
|
+
require "bloombroom/filter/bloom_filter"
|
6
|
+
require "bloombroom/filter/continuous_bloom_filter"
|
7
|
+
require "bloombroom/hash/fnv_a"
|
8
|
+
require "bloombroom/hash/fnv_b"
|
9
|
+
require "bloombroom/hash/cext_fnv"
|
10
|
+
require "bloombroom/hash/ffi_fnv"
|
11
|
+
|
12
|
+
module Bloombroom
|
13
|
+
end
|
@@ -0,0 +1,90 @@
|
|
1
|
+
# create a bit bucket field of 100 buckets of 4 bits
|
2
|
+
# bf = BitBucketField.new(4, 100)
|
3
|
+
#
|
4
|
+
# bf[10] = 5 or bf.set(10, 5)
|
5
|
+
# bf[10] => 5 or bf.get(10) => 5
|
6
|
+
# bf[10] = 0
|
7
|
+
# bf.zero?(10) => true
|
8
|
+
#
|
9
|
+
# bf.to_s = "10101000101010101"
|
10
|
+
# bf.to_s(2) = "10101000101010101"
|
11
|
+
# bf.to_s(10) = "5 23 7"
|
12
|
+
|
13
|
+
module Bloombroom
|
14
|
+
class BitBucketField
|
15
|
+
attr_reader :size
|
16
|
+
include Enumerable
|
17
|
+
|
18
|
+
ELEMENT_WIDTH = 32
|
19
|
+
|
20
|
+
# new BitBucketField
|
21
|
+
# @param bits [Fixnum] number of bits per bucket
|
22
|
+
# @param size [Fixnum] number of buckets in field
|
23
|
+
def initialize(bits, size)
|
24
|
+
@size = size
|
25
|
+
@bits = bits
|
26
|
+
@buckets_per_element = ELEMENT_WIDTH / bits
|
27
|
+
@field = Array.new(((size - 1) / @buckets_per_element) + 1, 0)
|
28
|
+
@bucket_mask = (2 ** @bits) - 1
|
29
|
+
end
|
30
|
+
|
31
|
+
# set a bucket
|
32
|
+
# @param position [Fixnum] bucket position
|
33
|
+
# @param value [Fixnum] bucket value
|
34
|
+
def []=(position, value)
|
35
|
+
element, offset = position.divmod(@buckets_per_element)
|
36
|
+
shift_bits = offset * @bits
|
37
|
+
if value == 0
|
38
|
+
@field[element] &= ~(@bucket_mask << shift_bits)
|
39
|
+
else
|
40
|
+
@field[element] = (@field[element] & ~(@bucket_mask << shift_bits)) | value << shift_bits
|
41
|
+
end
|
42
|
+
end
|
43
|
+
alias_method :set, :[]=
|
44
|
+
|
45
|
+
# read a bucket
|
46
|
+
# @param position [Fixnum] bucket position
|
47
|
+
# @return [Fixnum] bucket value
|
48
|
+
def [](position)
|
49
|
+
element, offset = position.divmod(@buckets_per_element)
|
50
|
+
shift_bits = (position % @buckets_per_element) * @bits
|
51
|
+
(@field[element] & (@bucket_mask << shift_bits)) >> shift_bits
|
52
|
+
end
|
53
|
+
alias_method :get, :[]
|
54
|
+
|
55
|
+
def zero?(position)
|
56
|
+
element, offset = position.divmod(@buckets_per_element)
|
57
|
+
shift_bits = (position % @buckets_per_element) * @bits
|
58
|
+
(@field[element] & (@bucket_mask << shift_bits)) == 0
|
59
|
+
end
|
60
|
+
|
61
|
+
def inc(position)
|
62
|
+
end
|
63
|
+
|
64
|
+
def dec(position)
|
65
|
+
end
|
66
|
+
|
67
|
+
# iterate over each bucket
|
68
|
+
def each(&block)
|
69
|
+
@size.times { |position| yield self[position] }
|
70
|
+
end
|
71
|
+
|
72
|
+
# returns the field as a string like "0101010100111100," etc.
|
73
|
+
def to_s(base = 2)
|
74
|
+
case base
|
75
|
+
when 2
|
76
|
+
inject("") { |a, b| a + "%0#{@bits}b " % b }.strip
|
77
|
+
when 10
|
78
|
+
self.inject("") { |a, b| a + "%1d " % b }.strip
|
79
|
+
else
|
80
|
+
raise(ArgumentError, "unsupported base")
|
81
|
+
end
|
82
|
+
end
|
83
|
+
|
84
|
+
# returns the total number of non zero buckets
|
85
|
+
def total_set
|
86
|
+
self.inject(0) { |a, bucket| a += bucket.zero? ? 0 : 1; a }
|
87
|
+
end
|
88
|
+
|
89
|
+
end
|
90
|
+
end
|
@@ -0,0 +1,90 @@
|
|
1
|
+
# inspired by Peter Cooper's http://snippets.dzone.com/posts/show/4234
|
2
|
+
#
|
3
|
+
# create a bit field 1000 bits wide
|
4
|
+
# bf = BitField.new(1000)
|
5
|
+
#
|
6
|
+
# bf[100] = 1 or bf.set(100)
|
7
|
+
# bf[100] => 1 or bg.get(100) => 1
|
8
|
+
# bf[100] = 0 or bf.unset(100)
|
9
|
+
# bf.zero?(100) => true
|
10
|
+
#
|
11
|
+
# bf.to_s = "10101000101010101"
|
12
|
+
# bf.total_set => 10 (example - 10 bits are set to "1")
|
13
|
+
|
14
|
+
module Bloombroom
|
15
|
+
class BitField
|
16
|
+
attr_reader :size
|
17
|
+
include Enumerable
|
18
|
+
|
19
|
+
ELEMENT_WIDTH = 32
|
20
|
+
|
21
|
+
def initialize(size)
|
22
|
+
@size = size
|
23
|
+
@field = Array.new(((size - 1) / ELEMENT_WIDTH) + 1, 0)
|
24
|
+
end
|
25
|
+
|
26
|
+
# set a bit
|
27
|
+
# @param position [Fixnum] bit position
|
28
|
+
# @param value [Fixnum] bit value 0/1
|
29
|
+
def []=(position, value)
|
30
|
+
if value == 0
|
31
|
+
@field[position / ELEMENT_WIDTH] &= ~(1 << (position % ELEMENT_WIDTH))
|
32
|
+
else
|
33
|
+
@field[position / ELEMENT_WIDTH] |= 1 << (position % ELEMENT_WIDTH)
|
34
|
+
end
|
35
|
+
end
|
36
|
+
|
37
|
+
# read a bit
|
38
|
+
# @param position [Fixnum] bit position
|
39
|
+
# @return [Fixnum] bit value 0/1
|
40
|
+
def [](position)
|
41
|
+
@field[position / ELEMENT_WIDTH] & 1 << (position % ELEMENT_WIDTH) > 0 ? 1 : 0
|
42
|
+
end
|
43
|
+
alias_method :get, :[]
|
44
|
+
|
45
|
+
# set a bit to 1
|
46
|
+
# @param position [Fixnum] bit position
|
47
|
+
def set(position)
|
48
|
+
# duplicated code to avoid a method call
|
49
|
+
@field[position / ELEMENT_WIDTH] |= 1 << (position % ELEMENT_WIDTH)
|
50
|
+
end
|
51
|
+
|
52
|
+
# set a bit to 0
|
53
|
+
# @param position [Fixnum] bit position
|
54
|
+
def unset(position)
|
55
|
+
# duplicated code to avoid a method call
|
56
|
+
@field[position / ELEMENT_WIDTH] &= ~(1 << (position % ELEMENT_WIDTH))
|
57
|
+
end
|
58
|
+
|
59
|
+
# check if bit is set
|
60
|
+
# @param position [Fixnum] bit position
|
61
|
+
# @return [Boolean] true if bit is set
|
62
|
+
def include?(position)
|
63
|
+
@field[position / ELEMENT_WIDTH] & 1 << (position % ELEMENT_WIDTH) > 0
|
64
|
+
end
|
65
|
+
|
66
|
+
# check if bit is not set
|
67
|
+
# @param position [Fixnum] bit position
|
68
|
+
# @return [Boolean] true if bit is not set
|
69
|
+
def zero?(position)
|
70
|
+
# duplicated code to avoid a method call
|
71
|
+
@field[position / ELEMENT_WIDTH] & 1 << (position % ELEMENT_WIDTH) == 0
|
72
|
+
end
|
73
|
+
|
74
|
+
# iterate over each bit
|
75
|
+
def each(&block)
|
76
|
+
@size.times { |position| yield self[position] }
|
77
|
+
end
|
78
|
+
|
79
|
+
# returns the field as a string like "0101010100111100," etc.
|
80
|
+
def to_s
|
81
|
+
inject("") { |a, b| a + b.to_s }
|
82
|
+
end
|
83
|
+
|
84
|
+
# returns the total number of bits that are set
|
85
|
+
# (the technique used here is about 6 times faster than using each or inject direct on the bitfield)
|
86
|
+
def total_set
|
87
|
+
@field.inject(0) { |a, byte| a += byte & 1 and byte >>= 1 until byte == 0; a }
|
88
|
+
end
|
89
|
+
end
|
90
|
+
end
|
@@ -0,0 +1,45 @@
|
|
1
|
+
require 'bloombroom/hash/ffi_fnv'
|
2
|
+
require 'bloombroom/bits/bit_field'
|
3
|
+
require 'bloombroom/filter/bloom_helper'
|
4
|
+
|
5
|
+
module Bloombroom
|
6
|
+
|
7
|
+
# BloomFilter false positive probability rule of thumb: see http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/
|
8
|
+
# a Bloom filter with a 1% error rate and an optimal value for k only needs 9.6 bits per key, and each time we add 4.8 bits
|
9
|
+
# per element we decrease the error rate by ten times.
|
10
|
+
#
|
11
|
+
# 10000 elements, 1% error rate: m = 10000 * 10 bits -> 12k of memory, k = 0.7 * (10000 * 10 bits / 10000) = 7 hash functions
|
12
|
+
# 10000 elements, 0.1% error rate: m = 10000 * 15 bits -> 18k of memory, k = 0.7 * (10000 * 15 bits / 10000) = 11 hash functions
|
13
|
+
#
|
14
|
+
# Bloombroom::BloomHelper.find_m_k can be used to compute optimal m & k values for a required capacity and error rate.
|
15
|
+
class BloomFilter
|
16
|
+
|
17
|
+
attr_reader :m, :k, :bits, :size
|
18
|
+
|
19
|
+
# @param m [Fixnum] filter size in bits
|
20
|
+
# @param k [Fixnum] number of hashing functions
|
21
|
+
def initialize(m, k)
|
22
|
+
@bits = BitField.new(m)
|
23
|
+
@m = m
|
24
|
+
@k = k
|
25
|
+
@size = 0
|
26
|
+
end
|
27
|
+
|
28
|
+
# @param key [String] the key to add in the filter
|
29
|
+
# @return [Fixnum] the total number of keys in the filter
|
30
|
+
def add(key)
|
31
|
+
BloomHelper.multi_hash(key, @k).each{|position| @bits.set(position % @m)}
|
32
|
+
@size += 1
|
33
|
+
end
|
34
|
+
alias_method :<<, :add
|
35
|
+
|
36
|
+
# @param key [String] test for the inclusion if key in the filter
|
37
|
+
# @return [Boolean] true if given key is present in the filter. false positive are possible and dependant on the m and k filter parameters.
|
38
|
+
def include?(key)
|
39
|
+
BloomHelper.multi_hash(key, @k).each{|position| return false unless @bits.include?(position % @m)}
|
40
|
+
true
|
41
|
+
end
|
42
|
+
alias_method :[], :include?
|
43
|
+
|
44
|
+
end
|
45
|
+
end
|
@@ -0,0 +1,35 @@
|
|
1
|
+
require 'bloombroom/hash/ffi_fnv'
|
2
|
+
|
3
|
+
module Bloombroom
|
4
|
+
|
5
|
+
class BloomHelper
|
6
|
+
|
7
|
+
# compute optimal m and k for a given capacity and error rate
|
8
|
+
# @param capacity [Fixnum] number of expected keys
|
9
|
+
# @param error [Float] error rate (0.0 < error < 1.0). Ex: 1% == 0.01, 0.1% == 0.001, ...
|
10
|
+
def self.find_m_k(capacity, error)
|
11
|
+
# thanks to http://www.siaris.net/index.cgi/Programming/LanguageBits/Ruby/BloomFilter.rdoc
|
12
|
+
m = (capacity * Math.log(error) / Math.log(1.0 / 2 ** Math.log(2))).ceil
|
13
|
+
k = (Math.log(2) * m / capacity).round
|
14
|
+
[m, k]
|
15
|
+
end
|
16
|
+
|
17
|
+
# produce k hash values for key
|
18
|
+
# @param key [String] key to hash
|
19
|
+
# @param k [Fixnum] number of hash functions
|
20
|
+
def self.multi_hash(key, k)
|
21
|
+
# simulate n hash functions by having just two hash functions
|
22
|
+
# see http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.152.579&rep=rep1&type=pdf
|
23
|
+
# see http://willwhim.wordpress.com/2011/09/03/producing-n-hash-functions-by-hashing-only-once/
|
24
|
+
#
|
25
|
+
# fake two hash functions by using the upper/lower 32 bits of a 64 bits FNV1a hash
|
26
|
+
|
27
|
+
h = Bloombroom::FNVFFI.fnv1a_64(key)
|
28
|
+
a = (h & 0xFFFFFFFF00000000) >> 32
|
29
|
+
b = h & 0xFFFFFFFF
|
30
|
+
|
31
|
+
Array.new(k) {|i| (a + b * (i + 1))}
|
32
|
+
end
|
33
|
+
|
34
|
+
end
|
35
|
+
end
|
@@ -0,0 +1,108 @@
|
|
1
|
+
require 'bloombroom/hash/ffi_fnv'
|
2
|
+
require 'bloombroom/bits/bit_bucket_field'
|
3
|
+
require 'bloombroom/filter/bloom_helper'
|
4
|
+
require 'thread'
|
5
|
+
|
6
|
+
module Bloombroom
|
7
|
+
|
8
|
+
# ContinuousBloomFilter is a bloom filter for unbounded stream of keys where keys are expired over a given period
|
9
|
+
# of time. The expected capacity of the bloom filter for the desired validity period must be known or estimated.
|
10
|
+
# For a given capacity and error rate, BloomHelper.find_m_k can be used to compute optimal m & k values.
|
11
|
+
#
|
12
|
+
# 4 bits per key (instead of 1 bit in a normal bloom filter) are used for keeping track of the keys ttl.
|
13
|
+
# the internal timer resolution is set to half of the ttl (resolution divisor of 2). using 4 bits gives us
|
14
|
+
# 15 usable time slots (slot 0 is for the unset state). basically the internal time bookeeping is similar to a
|
15
|
+
# ring buffer where the first timer tick will be time slot=1, slot=2, .. slot=15, slot=1 and so on. The total
|
16
|
+
# time of our internal clock will thus be 15 * (ttl / 2). We keep track of ttl by writing the current time slot
|
17
|
+
# in the key k buckets when first inserted in the filter. when doing a key lookup if any of the bucket contain
|
18
|
+
# the 0 value the key is not found. if the interval betweem the current time slot and any of the k buckets value
|
19
|
+
# is greater than 2 (resolution divisor) we know this key is expired and we reset the expired buckets to 0.
|
20
|
+
class ContinuousBloomFilter
|
21
|
+
|
22
|
+
attr_reader :m, :k, :ttl, :buckets
|
23
|
+
|
24
|
+
RESOLUTION_DIVISOR = 2
|
25
|
+
BITS_PER_BUCKET = 4
|
26
|
+
|
27
|
+
# @param m [Fixnum] total filter size in number of buckets. optimal m can be computed using BloomHelper.find_m_k
|
28
|
+
# @param k [Fixnum] number of hashing functions. optimal k can be computed using BloomHelper.find_m_k
|
29
|
+
# @param ttl [Fixnum] key time to live in seconds (validity period)
|
30
|
+
def initialize(m, k, ttl)
|
31
|
+
@m = m
|
32
|
+
@k = k
|
33
|
+
@ttl = ttl
|
34
|
+
@buckets = BitBucketField.new(BITS_PER_BUCKET, m)
|
35
|
+
|
36
|
+
# time management
|
37
|
+
@increment_period = @ttl / RESOLUTION_DIVISOR
|
38
|
+
@current_slot = 1
|
39
|
+
@max_slot = (2 ** BITS_PER_BUCKET) - 1 # ex. with 4 bits -> we want range 1..15
|
40
|
+
@lock = Mutex.new
|
41
|
+
end
|
42
|
+
|
43
|
+
# @param key [String] the key to add in the filter
|
44
|
+
# @return [ContinuousBloomFilter] self
|
45
|
+
def add(key)
|
46
|
+
current_slot = @lock.synchronize{@current_slot}
|
47
|
+
BloomHelper.multi_hash(key, @k).each{|position| @buckets[position % @m] = current_slot}
|
48
|
+
self
|
49
|
+
end
|
50
|
+
alias_method :<<, :add
|
51
|
+
|
52
|
+
# @param key [String] test for the inclusion if key in the filter
|
53
|
+
# @return [Boolean] true if given key is present in the filter. false positive are possible and dependant on the m and k filter parameters.
|
54
|
+
def include?(key)
|
55
|
+
current_slot = @lock.synchronize{@current_slot}
|
56
|
+
expired = false
|
57
|
+
|
58
|
+
BloomHelper.multi_hash(key, @k).each do |position|
|
59
|
+
start_slot = @buckets[position % @m]
|
60
|
+
if start_slot == 0
|
61
|
+
expired = true
|
62
|
+
elsif elapsed(start_slot, current_slot) > RESOLUTION_DIVISOR
|
63
|
+
expired = true
|
64
|
+
@buckets[position % @m] = 0
|
65
|
+
end
|
66
|
+
end
|
67
|
+
!expired
|
68
|
+
end
|
69
|
+
alias_method :[], :include?
|
70
|
+
|
71
|
+
# start the internal timer thread for managing ttls. must be explicitely called
|
72
|
+
def start_timer
|
73
|
+
@timer ||= detach_timer
|
74
|
+
end
|
75
|
+
|
76
|
+
# advance internal time slot. this is exposed primarily for spec'ing purposes.
|
77
|
+
# normally this is automatically called by the internal timer thread but if not
|
78
|
+
# using the internal timer thread it can be called explicitly when doing your
|
79
|
+
# own time management.
|
80
|
+
def inc_time_slot
|
81
|
+
# ex. with 4 bits -> we want range 1..15,
|
82
|
+
@lock.synchronize{@current_slot = (@current_slot % @max_slot) + 1}
|
83
|
+
end
|
84
|
+
|
85
|
+
private
|
86
|
+
|
87
|
+
def current_slot
|
88
|
+
@lock.synchronize{@current_slot}
|
89
|
+
end
|
90
|
+
|
91
|
+
def elapsed(start_slot, current_slot)
|
92
|
+
# ring buffer style
|
93
|
+
current_slot >= start_slot ? current_slot - start_slot : (current_slot + @max_slot) - start_slot
|
94
|
+
end
|
95
|
+
|
96
|
+
def detach_timer
|
97
|
+
Thread.new do
|
98
|
+
Thread.current.abort_on_exception = true
|
99
|
+
|
100
|
+
loop do
|
101
|
+
sleep(@increment_period)
|
102
|
+
inc_time_slot
|
103
|
+
end
|
104
|
+
end
|
105
|
+
end
|
106
|
+
|
107
|
+
end
|
108
|
+
end
|
@@ -0,0 +1,30 @@
|
|
1
|
+
require 'ffi'
|
2
|
+
|
3
|
+
module Bloombroom
|
4
|
+
class FNVFFI
|
5
|
+
extend FFI::Library
|
6
|
+
|
7
|
+
ffi_lib File.dirname(__FILE__) + "/" + (FFI::Platform.mac? ? "ffi_fnv.bundle" : FFI.map_library_name("ffi_fnv"))
|
8
|
+
|
9
|
+
attach_function :c_fnv1_32, :fnv1_32, [:string, :uint32], :uint32
|
10
|
+
attach_function :c_fnv1a_32, :fnv1a_32, [:string, :uint32], :uint32
|
11
|
+
attach_function :c_fnv1_64, :fnv1_64, [:string, :uint32], :uint64
|
12
|
+
attach_function :c_fnv1a_64, :fnv1a_64, [:string, :uint32], :uint64
|
13
|
+
|
14
|
+
def self.fnv1_32(data)
|
15
|
+
c_fnv1_32(data, data.size)
|
16
|
+
end
|
17
|
+
|
18
|
+
def self.fnv1_64(data)
|
19
|
+
c_fnv1_64(data, data.size)
|
20
|
+
end
|
21
|
+
|
22
|
+
def self.fnv1a_32(data)
|
23
|
+
c_fnv1a_32(data, data.size)
|
24
|
+
end
|
25
|
+
|
26
|
+
def self.fnv1a_64(data)
|
27
|
+
c_fnv1a_64(data, data.size)
|
28
|
+
end
|
29
|
+
end
|
30
|
+
end
|
@@ -0,0 +1,100 @@
|
|
1
|
+
# based on https://github.com/andyjeffries/digestfnv
|
2
|
+
|
3
|
+
module Bloombroom
|
4
|
+
class FNVA
|
5
|
+
|
6
|
+
OFFSET32 = 2166136261
|
7
|
+
OFFSET64 = 14695981039346656037
|
8
|
+
OFFSET128 = 144066263297769815596495629667062367629
|
9
|
+
OFFSET256 = 100029257958052580907070968620625704837092796014241193945225284501741471925557
|
10
|
+
OFFSET512 = 9659303129496669498009435400716310466090418745672637896108374329434462657994582932197716438449813051892206539805784495328239340083876191928701583869517785
|
11
|
+
OFFSET1024 = 14197795064947621068722070641403218320880622795441933960878474914617582723252296732303717722150864096521202355549365628174669108571814760471015076148029755969804077320157692458563003215304957150157403644460363550505412711285966361610267868082893823963790439336411086884584107735010676915
|
12
|
+
|
13
|
+
PRIME32 = 16777619
|
14
|
+
PRIME64 = 1099511628211
|
15
|
+
PRIME128 = 309485009821345068724781371
|
16
|
+
PRIME256 = 374144419156711147060143317175368453031918731002211
|
17
|
+
PRIME512 = 35835915874844867368919076489095108449946327955754392558399825615420669938882575126094039892345713852759
|
18
|
+
PRIME1024 = 5016456510113118655434598811035278955030765345404790744303017523831112055108147451509157692220295382716162651878526895249385292291816524375083746691371804094271873160484737966720260389217684476157468082573
|
19
|
+
|
20
|
+
MASK32 = (2 ** 32) - 1
|
21
|
+
MASK64 = (2 ** 64) - 1
|
22
|
+
MASK128 = (2 ** 128) - 1
|
23
|
+
MASK256 = (2 ** 256) - 1
|
24
|
+
MASK512 = (2 ** 512) - 1
|
25
|
+
MASK1024 = (2 ** 1024) - 1
|
26
|
+
|
27
|
+
def self.fnv1_32(input)
|
28
|
+
hash = OFFSET32
|
29
|
+
input.each_byte { |b| hash = (hash * PRIME32) ^ b }
|
30
|
+
hash & MASK32
|
31
|
+
end
|
32
|
+
|
33
|
+
def self.fnv1_64(input)
|
34
|
+
hash = OFFSET64
|
35
|
+
input.each_byte { |b| hash = (hash * PRIME64) ^ b }
|
36
|
+
hash & MASK64
|
37
|
+
end
|
38
|
+
|
39
|
+
def self.fnv1_128(input)
|
40
|
+
hash = OFFSET128
|
41
|
+
input.each_byte { |b| hash = (hash * PRIME128) ^ b }
|
42
|
+
hash & MASK128
|
43
|
+
end
|
44
|
+
|
45
|
+
def self.fnv1_256(input)
|
46
|
+
hash = OFFSET256
|
47
|
+
input.each_byte { |b| hash = (hash * PRIME256) ^ b }
|
48
|
+
hash & MASK256
|
49
|
+
end
|
50
|
+
|
51
|
+
def self.fnv1_512(input)
|
52
|
+
hash = OFFSET512
|
53
|
+
input.each_byte { |b| hash = (hash * PRIME512) ^ b }
|
54
|
+
hash & MASK512
|
55
|
+
end
|
56
|
+
|
57
|
+
def self.fnv1_1024(input)
|
58
|
+
hash = OFFSET1024
|
59
|
+
input.each_byte { |b| hash = (hash * PRIME1024) ^ b }
|
60
|
+
hash & MASK1024
|
61
|
+
end
|
62
|
+
|
63
|
+
def self.fnv1a_32(input)
|
64
|
+
hash = OFFSET32
|
65
|
+
input.each_byte { |b| hash = (hash ^ b) * PRIME32 }
|
66
|
+
hash & MASK32
|
67
|
+
end
|
68
|
+
|
69
|
+
def self.fnv1a_64(input)
|
70
|
+
hash = OFFSET64
|
71
|
+
input.each_byte { |b| hash = (hash ^ b) * PRIME64 }
|
72
|
+
hash & MASK64
|
73
|
+
end
|
74
|
+
|
75
|
+
def self.fnv1a_128(input)
|
76
|
+
hash = OFFSET128
|
77
|
+
input.each_byte { |b| hash = (hash ^ b) * PRIME128 }
|
78
|
+
hash & MASK128
|
79
|
+
end
|
80
|
+
|
81
|
+
def self.fnv1a_256(input)
|
82
|
+
hash = OFFSET256
|
83
|
+
input.each_byte { |b| hash = (hash ^ b) * PRIME256 }
|
84
|
+
hash & MASK256
|
85
|
+
end
|
86
|
+
|
87
|
+
def self.fnv1a_512(input)
|
88
|
+
hash = OFFSET512
|
89
|
+
input.each_byte { |b| hash = (hash ^ b) * PRIME512 }
|
90
|
+
hash & MASK512
|
91
|
+
end
|
92
|
+
|
93
|
+
def self.fnv1a_1024(input)
|
94
|
+
hash = OFFSET1024
|
95
|
+
input.each_byte { |b| hash = (hash ^ b) * PRIME1024 }
|
96
|
+
hash & MASK1024
|
97
|
+
end
|
98
|
+
|
99
|
+
end
|
100
|
+
end
|
@@ -0,0 +1,56 @@
|
|
1
|
+
# based on https://github.com/jakedouglas/fnv-ruby
|
2
|
+
|
3
|
+
module Bloombroom
|
4
|
+
class FNVB
|
5
|
+
INIT32 = 0x811c9dc5
|
6
|
+
INIT64 = 0xcbf29ce484222325
|
7
|
+
PRIME32 = 0x01000193
|
8
|
+
PRIME64 = 0x100000001b3
|
9
|
+
MOD32 = 2 ** 32
|
10
|
+
MOD64 = 2 ** 64
|
11
|
+
|
12
|
+
def self.fnv1_32(data)
|
13
|
+
hash = INIT32
|
14
|
+
|
15
|
+
data.each_byte do |byte|
|
16
|
+
hash = (hash * PRIME32) % MOD32
|
17
|
+
hash = hash ^ byte
|
18
|
+
end
|
19
|
+
|
20
|
+
hash
|
21
|
+
end
|
22
|
+
|
23
|
+
def self.fnv1_64(data)
|
24
|
+
hash = INIT64
|
25
|
+
|
26
|
+
data.each_byte do |byte|
|
27
|
+
hash = (hash * PRIME64) % MOD64
|
28
|
+
hash = hash ^ byte
|
29
|
+
end
|
30
|
+
|
31
|
+
hash
|
32
|
+
end
|
33
|
+
|
34
|
+
def self.fnv1a_32(data)
|
35
|
+
hash = INIT32
|
36
|
+
|
37
|
+
data.each_byte do |byte|
|
38
|
+
hash = hash ^ byte
|
39
|
+
hash = (hash * PRIME32) % MOD32
|
40
|
+
end
|
41
|
+
|
42
|
+
hash
|
43
|
+
end
|
44
|
+
|
45
|
+
def self.fnv1a_64(data)
|
46
|
+
hash = INIT64
|
47
|
+
|
48
|
+
data.each_byte do |byte|
|
49
|
+
hash = hash ^ byte
|
50
|
+
hash = (hash * PRIME64) % MOD64
|
51
|
+
end
|
52
|
+
|
53
|
+
hash
|
54
|
+
end
|
55
|
+
end
|
56
|
+
end
|
metadata
ADDED
@@ -0,0 +1,100 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: bloombroom
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 1.0.0
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Colin Surprenant
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2012-05-09 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: rspec
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
|
+
none: false
|
18
|
+
requirements:
|
19
|
+
- - ~>
|
20
|
+
- !ruby/object:Gem::Version
|
21
|
+
version: 2.8.0
|
22
|
+
type: :development
|
23
|
+
prerelease: false
|
24
|
+
version_requirements: !ruby/object:Gem::Requirement
|
25
|
+
none: false
|
26
|
+
requirements:
|
27
|
+
- - ~>
|
28
|
+
- !ruby/object:Gem::Version
|
29
|
+
version: 2.8.0
|
30
|
+
- !ruby/object:Gem::Dependency
|
31
|
+
name: ffi
|
32
|
+
requirement: !ruby/object:Gem::Requirement
|
33
|
+
none: false
|
34
|
+
requirements:
|
35
|
+
- - ! '>='
|
36
|
+
- !ruby/object:Gem::Version
|
37
|
+
version: '0'
|
38
|
+
type: :runtime
|
39
|
+
prerelease: false
|
40
|
+
version_requirements: !ruby/object:Gem::Requirement
|
41
|
+
none: false
|
42
|
+
requirements:
|
43
|
+
- - ! '>='
|
44
|
+
- !ruby/object:Gem::Version
|
45
|
+
version: '0'
|
46
|
+
description: bloombroom has two bloom filter implementations, a standard filter for
|
47
|
+
bounded key space and a continuous filter for unbounded keys
|
48
|
+
(stream). also contains fast bit field and bit bucket field (multi
|
49
|
+
bits), native/C-ext/FFI FNV hashing and benchmarks for all these.
|
50
|
+
email:
|
51
|
+
- colin.surprenant@gmail.com
|
52
|
+
executables: []
|
53
|
+
extensions:
|
54
|
+
- ext/bloombroom/hash/cext/extconf.rb
|
55
|
+
- ext/bloombroom/hash/ffi/extconf.rb
|
56
|
+
extra_rdoc_files: []
|
57
|
+
files:
|
58
|
+
- lib/bloombroom/bits/bit_bucket_field.rb
|
59
|
+
- lib/bloombroom/bits/bit_field.rb
|
60
|
+
- lib/bloombroom/filter/bloom_filter.rb
|
61
|
+
- lib/bloombroom/filter/bloom_helper.rb
|
62
|
+
- lib/bloombroom/filter/continuous_bloom_filter.rb
|
63
|
+
- lib/bloombroom/hash/ffi_fnv.rb
|
64
|
+
- lib/bloombroom/hash/fnv_a.rb
|
65
|
+
- lib/bloombroom/hash/fnv_b.rb
|
66
|
+
- lib/bloombroom/version.rb
|
67
|
+
- lib/bloombroom.rb
|
68
|
+
- ext/bloombroom/hash/cext/extconf.rb
|
69
|
+
- ext/bloombroom/hash/ffi/extconf.rb
|
70
|
+
- ext/bloombroom/hash/cext/cext_fnv.c
|
71
|
+
- ext/bloombroom/hash/ffi/ffi_fnv.c
|
72
|
+
- README.md
|
73
|
+
- CHANGELOG.md
|
74
|
+
- LICENSE.md
|
75
|
+
homepage: https://github.com/colinsurprenant/bloombroom
|
76
|
+
licenses: []
|
77
|
+
post_install_message:
|
78
|
+
rdoc_options: []
|
79
|
+
require_paths:
|
80
|
+
- lib
|
81
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
82
|
+
none: false
|
83
|
+
requirements:
|
84
|
+
- - ! '>='
|
85
|
+
- !ruby/object:Gem::Version
|
86
|
+
version: '0'
|
87
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
88
|
+
none: false
|
89
|
+
requirements:
|
90
|
+
- - ! '>='
|
91
|
+
- !ruby/object:Gem::Version
|
92
|
+
version: '0'
|
93
|
+
requirements: []
|
94
|
+
rubyforge_project: bloombroom
|
95
|
+
rubygems_version: 1.8.24
|
96
|
+
signing_key:
|
97
|
+
specification_version: 3
|
98
|
+
summary: bloom filters for bounded and unbounded (streaming) data, FNV hashing and
|
99
|
+
bit fields
|
100
|
+
test_files: []
|