bloom_fit 0.3.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: cd631cdb483e0a84fa05d56eb962fda0f7c7d7a0b002ea708024ce82505a9054
4
- data.tar.gz: ee781997465d6f5b590828082e4fadd5b00768298bbdec7845b9f07c3d046549
3
+ metadata.gz: ed19ba044e45497c9026b8227e77c48cd62aea3043f698c6aca4955eb734f17e
4
+ data.tar.gz: e712cf58a3b6b11e38733da4437c95fbd94a7e9b07eeb4e72945138a140d730f
5
5
  SHA512:
6
- metadata.gz: 7862f2d0189bae865c6fc5e7c7ad24f5c7ab0420415a455a1a0b130835d639c536cb8925b08219eab7dd7a10db1e9299b2868019d3e2259db4dce96de01e50a2
7
- data.tar.gz: 41cb7f2fcb8cf80f5345785ce0110e242a29fbe6177284b13b701973ec7b0e7010d788585e406f77712f7ee284ff308633fe060e492b0e153a4a5598658fd465
6
+ metadata.gz: 69b2b91fdf8e3995931507a53b13c6923e225faef01f6cf39c3524e9ad2e63673411452719fc93509aaa830b42f4fa45198cf39f0b3f70cd33f60846116f5430
7
+ data.tar.gz: e33e427c4bd6ca79d818887dbca0d80348a868fdea85930197e92e087c63adb8a3d339b74d420a894ed55c731e05fedfeb9875a57e5f32698ee342c2836a1ebc
data/README.md CHANGED
@@ -1,77 +1,250 @@
1
- # BloomFit makes Bloom Filter tuning easy
1
+ # BloomFit
2
2
 
3
- [![Gem Version](http://img.shields.io/gem/v/bloom_fit.svg)](https://rubygems.org/gems/bloom_fit)
3
+ [![Gem Version](https://img.shields.io/gem/v/bloom_fit.svg)](https://rubygems.org/gems/bloom_fit)
4
4
  [![CI](https://github.com/rmm5t/bloom_fit/actions/workflows/ci.yml/badge.svg)](https://github.com/rmm5t/bloom_fit/actions/workflows/ci.yml)
5
5
  [![Gem Downloads](https://img.shields.io/gem/dt/bloom_fit.svg)](https://rubygems.org/gems/bloom_fit)
6
6
 
7
- BloomFit provides a MRI/C-based non-counting bloom filter for use in your Ruby projects. It is heavily based on [bloomfilter-rb]'s native implementation, but differs in the following ways:
7
+ BloomFit is an in-memory, non-counting Bloom filter for Ruby backed by a small C extension.
8
+
9
+ It gives you a compact, Set-like API for probabilistic membership checks:
10
+
11
+ - false positives are possible
12
+ - false negatives are not, as long as a value was added to the same filter
13
+ - individual values cannot be deleted safely because the filter is non-counting
14
+
15
+ BloomFit is heavily inspired by [bloomfilter-rb]'s native implementation and the original C implementation by Tatsuya Mori. This version uses a DJB2 hash with salts from the CRC table and wraps the native filter in a Ruby-friendly API. The most common way to use it is to pass an expected `capacity` and optional `false_positive_rate`, then let BloomFit calculate `size` and `hashes` for you.
16
+
17
+ Compared with bloomfilter-rb, BloomFit:
8
18
 
9
19
  - uses DJB2 over CRC32 yielding better hash distribution
10
20
  - improves performance for very large datasets
11
21
  - avoids the need to supply a seed
12
- - automatically calculates the bit size (m) and the number of hashes (k) when given a capacity and false-positive-rate
22
+ - automatically calculates the filter size (`m`) and hash count (`k`) from capacity and false-positive rate
13
23
 
14
- A [Bloom filter](http://en.wikipedia.org/wiki/Bloom_filter) is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Instead of using k different hash functions, this implementation a DJB2 hash with k seeds from the CRC table.
24
+ ## Features
15
25
 
16
- Performance of the Bloom filter depends on the following:
26
+ - native `CBloomFilter` implementation for MRI Ruby
27
+ - automatic sizing from `capacity` and `false_positive_rate`
28
+ - small Ruby API with familiar methods like `add`, `include?`, `merge`, `|`, and `&`
29
+ - supports strings, symbols, integers, booleans, and other values that can be converted with `to_s`
30
+ - manual `size` / `hashes` overrides when you want control
31
+ - save and reload filters with Ruby `Marshal`
32
+ - inspect filter state with `stats`, `to_hex`, `to_binary`, and `bitmap`
17
33
 
18
- - size of the bit array
19
- - number of hash functions
34
+ ## Requirements
20
35
 
21
- ## Resources
36
+ - Ruby `>= 3.2.0`
22
37
 
23
- - Background: [Bloom filter](http://en.wikipedia.org/wiki/Bloom_filter)
24
- - Determining parameters: [Scalable Datasets: Bloom Filters in Ruby](http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/)
25
- - Applications & reasons behind bloom filter: [Flow analysis: Time based bloom filter](http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/)
38
+ ## Installation
26
39
 
27
- ## Examples
40
+ ```bash
41
+ gem install bloom_fit
42
+ ```
28
43
 
29
- MRI/C implementation which creates an in-memory filter which can be saved and reloaded from disk.
44
+ ```ruby
45
+ require "bloom_fit"
46
+ ```
30
47
 
31
- (COMING SOON) If you'd like to specify an expected item count and a false-positive rate that you can tolerate. Visit the [Bloom Filter Calculator](https://hur.st/bloomfilter/) to learn more.
48
+ ## Quick Start
32
49
 
33
50
  ```ruby
34
51
  require "bloom_fit"
35
52
 
36
- bf = BloomFit.new(capacity: 250, false_positive_rate: 0.001)
37
- bf.add("cat")
38
- bf.include?("cat") # => true
39
- bf.include?("dog") # => false
40
-
41
- # Hash syntax with a bloom filter!
42
- bf["bird"] = "bar"
43
- bf["bird"] # => true
44
- bf["mouse"] # => false
45
-
46
- puts bf.stats
47
- # Number of filter bits (m): 3600
48
- # Number of set bits (n): 20
49
- # Number of filter hashes (k) : 10
50
- # Predicted false positive rate = 0.00%
53
+ filter = BloomFit.new(capacity: 250, false_positive_rate: 0.001)
54
+
55
+ filter.add("cat")
56
+ filter << :dog
57
+
58
+ filter.include?("cat") # => true
59
+ filter.key?("dog") # => true
60
+ filter["bird"] # => false
61
+
62
+ filter["owl"] = true
63
+ filter["ant"] = false
64
+
65
+ filter["owl"] # => true
66
+ filter["ant"] # => false
67
+
68
+ filter.empty? # => false
69
+
70
+ filter.size # => 3595
71
+ filter.hashes # => 10
72
+
73
+ filter.clear
74
+ filter.empty? # => true
51
75
  ```
52
76
 
53
- If you'd like more control over the traditional inputs like bit size and the number of hashes:
77
+ `#include?`, `#key?`, and `#[]` are aliases. `#add` and `#<<` are also aliases.
78
+
79
+ ## Automatic Sizing
80
+
81
+ BloomFit now calculates `size` and `hashes` for you when you initialize it with an expected capacity:
54
82
 
55
83
  ```ruby
56
- require "bloom_fit"
84
+ filter = BloomFit.new(capacity: 10_000, false_positive_rate: 0.01)
85
+
86
+ filter.size # => 95851
87
+ filter.hashes # => 7
88
+ ```
89
+
90
+ The defaults are a good starting point for many small filters:
91
+
92
+ ```ruby
93
+ filter = BloomFit.new
94
+
95
+ filter.size # => 1438
96
+ filter.hashes # => 10
97
+ ```
98
+
99
+ That is equivalent to:
100
+
101
+ ```ruby
102
+ filter = BloomFit.new(capacity: 100, false_positive_rate: 0.001)
103
+ ```
104
+
105
+ Internally BloomFit uses the standard Bloom filter formulas:
106
+
107
+ ```text
108
+ m = -(n * ln(p)) / (ln(2)^2)
109
+ k = (m / n) * ln(2)
110
+ ```
111
+
112
+ - `n`: expected number of inserted values
113
+ - `p`: target false-positive rate
114
+ - `m`: number of filter buckets (`size`)
115
+ - `k`: number of hash functions (`hashes`)
116
+
117
+ For example, if you expect about `10_000` inserts and can tolerate a `1%` false-positive rate, BloomFit will calculate `size: 95_851` and `hashes: 7` for you.
118
+
119
+ If you prefer a calculator, see [Bloom Filter Calculator](https://hur.st/bloomfilter/).
120
+
121
+ ## Manual Sizing
122
+
123
+ If you already know the exact filter width and hash count you want, you can still pass them directly:
124
+
125
+ ```ruby
126
+ filter = BloomFit.new(size: 95_851, hashes: 7)
127
+ ```
128
+
129
+ This bypasses automatic sizing.
130
+
131
+ ## Common Operations
57
132
 
58
- bf = BloomFit.new(size: 100, hashes: 2)
59
- bf.add("cat")
60
- bf.include?("cat") # => true
61
- bf.include?("dog") # => false
62
-
63
- # Hash syntax with a bloom filter!
64
- bf["bird"] = "bar"
65
- bf["bird"] # => true
66
- bf["mouse"] # => false
67
-
68
- puts bf.stats
69
- # Number of filter bits (m): 100
70
- # Number of set bits (n): 4
71
- # Number of filter hashes (k) : 2
72
- # Predicted false positive rate = 10.87%
133
+ ### Add and check membership
134
+
135
+ ```ruby
136
+ filter = BloomFit.new(capacity: 100)
137
+
138
+ filter << "cat"
139
+ filter << "dog"
140
+
141
+ filter.include?("cat") # => true
142
+ filter.include?("bird") # => false
143
+ ```
144
+
145
+ ### Use hash-like syntax for truthy values
146
+
147
+ ```ruby
148
+ filter = BloomFit.new(capacity: 64)
149
+
150
+ filter[:cat] = true
151
+ filter[:dog] = false
152
+
153
+ filter[:cat] # => true
154
+ filter[:dog] # => false
155
+
156
+ filter.merge({ bird: true, ant: nil })
157
+
158
+ filter.include?(:bird) # => true
159
+ filter.include?(:ant) # => false
160
+ ```
161
+
162
+ When merging a hash, only keys with truthy values are added.
163
+
164
+ ### Merge, union, and intersection
165
+
166
+ ```ruby
167
+ pets = BloomFit.new(capacity: 50)
168
+ pets << "cat" << "dog"
169
+
170
+ more_pets = BloomFit.new(capacity: 50)
171
+ more_pets << "dog" << "bird"
172
+
173
+ combined = pets | more_pets
174
+ overlap = pets & more_pets
175
+
176
+ combined.include?("bird") # => true
177
+ overlap.include?("dog") # => true
178
+ overlap.include?("cat") # => false
179
+ ```
180
+
181
+ `#merge` also accepts arrays, sets, and other enumerables:
182
+
183
+ ```ruby
184
+ filter = BloomFit.new(capacity: 100)
185
+ filter.merge(%w[cat dog bird])
186
+ ```
187
+
188
+ Filters can only be combined when they have the same `size` and `hashes`. Otherwise BloomFit raises `ArgumentError`.
189
+
190
+ When you create filters with automatic sizing, use the same `capacity` and `false_positive_rate` for filters you plan to merge, union, or intersect.
191
+
192
+ ### Save and load filters
193
+
194
+ ```ruby
195
+ filter = BloomFit.new(capacity: 100)
196
+ filter << "cat" << "dog"
197
+ filter.save("pets.bloom")
198
+
199
+ reloaded = BloomFit.load("pets.bloom")
200
+ reloaded.include?("cat") # => true
201
+ reloaded.include?("dog") # => true
202
+ ```
203
+
204
+ Persistence uses Ruby `Marshal`. Only load files you trust.
205
+
206
+ ### Inspect the bitmap
207
+
208
+ ```ruby
209
+ filter = BloomFit.new(size: 16, hashes: 4)
210
+ filter << "cool"
211
+
212
+ filter.to_hex # => "1441"
213
+ filter.to_binary # => "0001010001000001"
214
+ filter.bitmap # => raw bytes from the native filter
73
215
  ```
74
216
 
217
+ `#bitmap` returns the native byte representation, which may include padding bytes beyond the configured filter width. `#to_binary` trims the result to exactly `size` bits.
218
+
219
+ ## API Overview
220
+
221
+ | Method | Notes |
222
+ | --- | --- |
223
+ | `BloomFit.new` or `BloomFit.new(capacity:, false_positive_rate:)` | Creates a filter and calculates `size` and `hashes` automatically. Defaults to `capacity: 100`, `false_positive_rate: 0.001`. |
224
+ | `BloomFit.new(size:, hashes:)` | Creates a filter with explicit sizing when you want fixed parameters. |
225
+ | `add`, `<<` | Adds a value and returns the filter. |
226
+ | `add?` | Adds only when the value does not already appear present. |
227
+ | `include?`, `key?`, `[]` | Probabilistic membership check. |
228
+ | `[]=` | Adds a key only when the assigned value is truthy. |
229
+ | `merge` | Merges another filter or an enumerable into the receiver. |
230
+ | `\|`, `union` | Returns a new filter containing the union. |
231
+ | `&`, `intersection` | Returns a new filter containing the intersection. |
232
+ | `clear` | Resets all bits to `0`. |
233
+ | `empty?` | Exact check for whether any bits are set. |
234
+ | `size`, `m` | Returns the configured filter width. |
235
+ | `hashes`, `k` | Returns the number of hash functions. |
236
+ | `set_bits`, `n` | Returns the number of bits currently set. |
237
+ | `stats` | Returns a human-readable summary including predicted false-positive rate. |
238
+ | `to_hex`, `to_binary`, `bitmap` | Returns the filter bitmap in different representations. |
239
+ | `save`, `BloomFit.load` | Serializes and restores a filter with Ruby `Marshal`. |
240
+
241
+ ## Resources
242
+
243
+ - Background: [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter)
244
+ - Determining parameters: [Scalable Datasets: Bloom Filters in Ruby](http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/)
245
+ - Applications and motivation: [Flow analysis: Time based bloom filter](http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/)
246
+ - Calculator: [Bloom Filter Calculator](https://hur.st/bloomfilter/)
247
+
75
248
  ## Credits
76
249
 
77
250
  - Tatsuya Mori <valdzone@gmail.com> (Original C implementation)
@@ -4,15 +4,15 @@
4
4
  */
5
5
 
6
6
  #include "ruby.h"
7
- #include "crc32.h"
7
+ #include <limits.h>
8
+ #include "salts.h"
8
9
 
9
10
  #if !defined(RSTRING_LEN)
10
11
  # define RSTRING_LEN(x) (RSTRING(x)->len)
11
12
  # define RSTRING_PTR(x) (RSTRING(x)->ptr)
12
13
  #endif
13
14
 
14
- /* Reuse the standard CRC table for consistent salts */
15
- static unsigned int *salts = crc_table;
15
+ static const int salts_length = sizeof(salts) / sizeof(salts[0]);
16
16
 
17
17
  static VALUE cBloomFilter;
18
18
 
@@ -26,7 +26,7 @@ struct BloomFilter {
26
26
  unsigned long djb2(const char *str, int len) {
27
27
  unsigned long hash = 5381;
28
28
  for (int i = 0; i < len; i++) {
29
- hash = ((hash << 5) + hash) + str[i];
29
+ hash = ((hash << 5) + hash) + (unsigned char) str[i];
30
30
  }
31
31
  return hash;
32
32
  }
@@ -92,14 +92,41 @@ static int bucket_check(struct BloomFilter *bf, int index) {
92
92
  return (bf->ptr[byte_offset] >> bit_offset) & 1;
93
93
  }
94
94
 
95
+ static void bf_ensure_compatible(struct BloomFilter *bf, struct BloomFilter *other) {
96
+ if (bf->m != other->m || bf->k != other->k || bf->bytes != other->bytes) {
97
+ rb_raise(rb_eArgError, "bloom filters must have matching size and hash count");
98
+ }
99
+ }
100
+
101
+ static void bf_clear_padding_bits(struct BloomFilter *bf) {
102
+ int full_bytes = bf->m / 8;
103
+ int remaining_bits = bf->m % 8;
104
+ int i;
105
+
106
+ if (remaining_bits > 0) {
107
+ unsigned char mask = (unsigned char) ((1U << remaining_bits) - 1U);
108
+ bf->ptr[full_bytes] &= mask;
109
+ full_bytes += 1;
110
+ }
111
+
112
+ for (i = full_bytes; i < bf->bytes; i++) {
113
+ bf->ptr[i] = 0;
114
+ }
115
+ }
116
+
95
117
  static VALUE bf_initialize(int argc, VALUE *argv, VALUE self) {
96
118
  struct BloomFilter *bf;
97
119
  VALUE arg1, arg2;
120
+ long m_value, k_value;
98
121
  int m, k;
99
122
 
100
123
  bf = bf_ptr(self);
101
124
 
102
- /* default = Fugou approach :-) */
125
+ if (argc > 2) {
126
+ rb_error_arity(argc, 0, 2);
127
+ }
128
+
129
+ /* defaults */
103
130
  arg1 = INT2FIX(1000);
104
131
  arg2 = INT2FIX(4);
105
132
 
@@ -111,13 +138,23 @@ static VALUE bf_initialize(int argc, VALUE *argv, VALUE self) {
111
138
  break;
112
139
  }
113
140
 
114
- m = FIX2INT(arg1);
115
- k = FIX2INT(arg2);
141
+ m_value = NUM2LONG(arg1);
142
+ k_value = NUM2LONG(arg2);
143
+
144
+ if (m_value > INT_MAX - 15)
145
+ rb_raise(rb_eRangeError, "bit length is too large");
146
+ if (k_value > INT_MAX)
147
+ rb_raise(rb_eRangeError, "hash length is too large");
148
+
149
+ m = (int) m_value;
150
+ k = (int) k_value;
116
151
 
117
152
  if (m < 1)
118
- rb_raise(rb_eArgError, "array size");
153
+ rb_raise(rb_eArgError, "bit length must be >= 1");
119
154
  if (k < 1)
120
- rb_raise(rb_eArgError, "hash length");
155
+ rb_raise(rb_eArgError, "hash length must be >= 1");
156
+ if (k > salts_length)
157
+ rb_raise(rb_eArgError, "hash length must be <= %d", salts_length);
121
158
 
122
159
  bf->m = m;
123
160
  bf->k = k;
@@ -131,7 +168,6 @@ static VALUE bf_initialize(int argc, VALUE *argv, VALUE self) {
131
168
 
132
169
  /* initialize the bits with zeros */
133
170
  memset(bf->ptr, 0, bf->bytes);
134
- rb_iv_set(self, "@hash_value", rb_hash_new());
135
171
 
136
172
  return self;
137
173
  }
@@ -154,12 +190,18 @@ static VALUE bf_k(VALUE self) {
154
190
 
155
191
  static VALUE bf_set_bits(VALUE self){
156
192
  struct BloomFilter *bf = bf_ptr(self);
157
- int i,j,count = 0;
193
+ int i, count = 0;
194
+
158
195
  for (i = 0; i < bf->bytes; i++) {
159
- for (j = 0; j < 8; j++) {
160
- count += (bf->ptr[i] >> j) & 1;
196
+ unsigned char byte = bf->ptr[i];
197
+
198
+ /* Brian Kernighan’s bit-count loop a*/
199
+ while (byte != 0) {
200
+ byte &= (unsigned char) (byte - 1);
201
+ count++;
161
202
  }
162
203
  }
204
+
163
205
  return INT2FIX(count);
164
206
  }
165
207
 
@@ -193,6 +235,9 @@ static VALUE bf_merge(VALUE self, VALUE other) {
193
235
  struct BloomFilter *bf = bf_ptr(self);
194
236
  struct BloomFilter *target = bf_ptr(other);
195
237
  int i;
238
+
239
+ bf_ensure_compatible(bf, target);
240
+
196
241
  for (i = 0; i < bf->bytes; i++) {
197
242
  bf->ptr[i] |= target->ptr[i];
198
243
  }
@@ -206,6 +251,8 @@ static VALUE bf_and(VALUE self, VALUE other) {
206
251
  VALUE klass, obj, args[5];
207
252
  int i;
208
253
 
254
+ bf_ensure_compatible(bf, bf_other);
255
+
209
256
  args[0] = INT2FIX(bf->m);
210
257
  args[1] = INT2FIX(bf->k);
211
258
  klass = rb_funcall(self,rb_intern("class"),0);
@@ -225,6 +272,8 @@ static VALUE bf_or(VALUE self, VALUE other) {
225
272
  VALUE klass, obj, args[5];
226
273
  int i;
227
274
 
275
+ bf_ensure_compatible(bf, bf_other);
276
+
228
277
  args[0] = INT2FIX(bf->m);
229
278
  args[1] = INT2FIX(bf->k);
230
279
  klass = rb_funcall(self,rb_intern("class"),0);
@@ -278,9 +327,17 @@ static VALUE bf_bitmap(VALUE self) {
278
327
 
279
328
  static VALUE bf_load(VALUE self, VALUE bitmap) {
280
329
  struct BloomFilter *bf = bf_ptr(self);
281
- unsigned char* ptr = (unsigned char *) RSTRING_PTR(bitmap);
330
+ VALUE bitmap_string = StringValue(bitmap);
331
+ unsigned char* ptr;
332
+
333
+ if (RSTRING_LEN(bitmap_string) != bf->bytes) {
334
+ rb_raise(rb_eArgError, "bitmap length must be %d bytes", bf->bytes);
335
+ }
336
+
337
+ ptr = (unsigned char *) RSTRING_PTR(bitmap_string);
282
338
 
283
339
  memcpy(bf->ptr, ptr, bf->bytes);
340
+ bf_clear_padding_bits(bf);
284
341
 
285
342
  return Qnil;
286
343
  }
@@ -0,0 +1,50 @@
1
+ /*
2
+ * Borrowed from the CRC table
3
+ * https://www.mrob.com/pub/comp/crc-all.html
4
+ *
5
+ */
6
+ static unsigned int salts[] = {
7
+ 0x00000000UL, 0x77073096UL, 0xee0e612cUL, 0x990951baUL, 0x076dc419UL, 0x706af48fUL,
8
+ 0xe963a535UL, 0x9e6495a3UL, 0x0edb8832UL, 0x79dcb8a4UL, 0xe0d5e91eUL, 0x97d2d988UL,
9
+ 0x09b64c2bUL, 0x7eb17cbdUL, 0xe7b82d07UL, 0x90bf1d91UL, 0x1db71064UL, 0x6ab020f2UL,
10
+ 0xf3b97148UL, 0x84be41deUL, 0x1adad47dUL, 0x6ddde4ebUL, 0xf4d4b551UL, 0x83d385c7UL,
11
+ 0x136c9856UL, 0x646ba8c0UL, 0xfd62f97aUL, 0x8a65c9ecUL, 0x14015c4fUL, 0x63066cd9UL,
12
+ 0xfa0f3d63UL, 0x8d080df5UL, 0x3b6e20c8UL, 0x4c69105eUL, 0xd56041e4UL, 0xa2677172UL,
13
+ 0x3c03e4d1UL, 0x4b04d447UL, 0xd20d85fdUL, 0xa50ab56bUL, 0x35b5a8faUL, 0x42b2986cUL,
14
+ 0xdbbbc9d6UL, 0xacbcf940UL, 0x32d86ce3UL, 0x45df5c75UL, 0xdcd60dcfUL, 0xabd13d59UL,
15
+ 0x26d930acUL, 0x51de003aUL, 0xc8d75180UL, 0xbfd06116UL, 0x21b4f4b5UL, 0x56b3c423UL,
16
+ 0xcfba9599UL, 0xb8bda50fUL, 0x2802b89eUL, 0x5f058808UL, 0xc60cd9b2UL, 0xb10be924UL,
17
+ 0x2f6f7c87UL, 0x58684c11UL, 0xc1611dabUL, 0xb6662d3dUL, 0x76dc4190UL, 0x01db7106UL,
18
+ 0x98d220bcUL, 0xefd5102aUL, 0x71b18589UL, 0x06b6b51fUL, 0x9fbfe4a5UL, 0xe8b8d433UL,
19
+ 0x7807c9a2UL, 0x0f00f934UL, 0x9609a88eUL, 0xe10e9818UL, 0x7f6a0dbbUL, 0x086d3d2dUL,
20
+ 0x91646c97UL, 0xe6635c01UL, 0x6b6b51f4UL, 0x1c6c6162UL, 0x856530d8UL, 0xf262004eUL,
21
+ 0x6c0695edUL, 0x1b01a57bUL, 0x8208f4c1UL, 0xf50fc457UL, 0x65b0d9c6UL, 0x12b7e950UL,
22
+ 0x8bbeb8eaUL, 0xfcb9887cUL, 0x62dd1ddfUL, 0x15da2d49UL, 0x8cd37cf3UL, 0xfbd44c65UL,
23
+ 0x4db26158UL, 0x3ab551ceUL, 0xa3bc0074UL, 0xd4bb30e2UL, 0x4adfa541UL, 0x3dd895d7UL,
24
+ 0xa4d1c46dUL, 0xd3d6f4fbUL, 0x4369e96aUL, 0x346ed9fcUL, 0xad678846UL, 0xda60b8d0UL,
25
+ 0x44042d73UL, 0x33031de5UL, 0xaa0a4c5fUL, 0xdd0d7cc9UL, 0x5005713cUL, 0x270241aaUL,
26
+ 0xbe0b1010UL, 0xc90c2086UL, 0x5768b525UL, 0x206f85b3UL, 0xb966d409UL, 0xce61e49fUL,
27
+ 0x5edef90eUL, 0x29d9c998UL, 0xb0d09822UL, 0xc7d7a8b4UL, 0x59b33d17UL, 0x2eb40d81UL,
28
+ 0xb7bd5c3bUL, 0xc0ba6cadUL, 0xedb88320UL, 0x9abfb3b6UL, 0x03b6e20cUL, 0x74b1d29aUL,
29
+ 0xead54739UL, 0x9dd277afUL, 0x04db2615UL, 0x73dc1683UL, 0xe3630b12UL, 0x94643b84UL,
30
+ 0x0d6d6a3eUL, 0x7a6a5aa8UL, 0xe40ecf0bUL, 0x9309ff9dUL, 0x0a00ae27UL, 0x7d079eb1UL,
31
+ 0xf00f9344UL, 0x8708a3d2UL, 0x1e01f268UL, 0x6906c2feUL, 0xf762575dUL, 0x806567cbUL,
32
+ 0x196c3671UL, 0x6e6b06e7UL, 0xfed41b76UL, 0x89d32be0UL, 0x10da7a5aUL, 0x67dd4accUL,
33
+ 0xf9b9df6fUL, 0x8ebeeff9UL, 0x17b7be43UL, 0x60b08ed5UL, 0xd6d6a3e8UL, 0xa1d1937eUL,
34
+ 0x38d8c2c4UL, 0x4fdff252UL, 0xd1bb67f1UL, 0xa6bc5767UL, 0x3fb506ddUL, 0x48b2364bUL,
35
+ 0xd80d2bdaUL, 0xaf0a1b4cUL, 0x36034af6UL, 0x41047a60UL, 0xdf60efc3UL, 0xa867df55UL,
36
+ 0x316e8eefUL, 0x4669be79UL, 0xcb61b38cUL, 0xbc66831aUL, 0x256fd2a0UL, 0x5268e236UL,
37
+ 0xcc0c7795UL, 0xbb0b4703UL, 0x220216b9UL, 0x5505262fUL, 0xc5ba3bbeUL, 0xb2bd0b28UL,
38
+ 0x2bb45a92UL, 0x5cb36a04UL, 0xc2d7ffa7UL, 0xb5d0cf31UL, 0x2cd99e8bUL, 0x5bdeae1dUL,
39
+ 0x9b64c2b0UL, 0xec63f226UL, 0x756aa39cUL, 0x026d930aUL, 0x9c0906a9UL, 0xeb0e363fUL,
40
+ 0x72076785UL, 0x05005713UL, 0x95bf4a82UL, 0xe2b87a14UL, 0x7bb12baeUL, 0x0cb61b38UL,
41
+ 0x92d28e9bUL, 0xe5d5be0dUL, 0x7cdcefb7UL, 0x0bdbdf21UL, 0x86d3d2d4UL, 0xf1d4e242UL,
42
+ 0x68ddb3f8UL, 0x1fda836eUL, 0x81be16cdUL, 0xf6b9265bUL, 0x6fb077e1UL, 0x18b74777UL,
43
+ 0x88085ae6UL, 0xff0f6a70UL, 0x66063bcaUL, 0x11010b5cUL, 0x8f659effUL, 0xf862ae69UL,
44
+ 0x616bffd3UL, 0x166ccf45UL, 0xa00ae278UL, 0xd70dd2eeUL, 0x4e048354UL, 0x3903b3c2UL,
45
+ 0xa7672661UL, 0xd06016f7UL, 0x4969474dUL, 0x3e6e77dbUL, 0xaed16a4aUL, 0xd9d65adcUL,
46
+ 0x40df0b66UL, 0x37d83bf0UL, 0xa9bcae53UL, 0xdebb9ec5UL, 0x47b2cf7fUL, 0x30b5ffe9UL,
47
+ 0xbdbdf21cUL, 0xcabac28aUL, 0x53b39330UL, 0x24b4a3a6UL, 0xbad03605UL, 0xcdd70693UL,
48
+ 0x54de5729UL, 0x23d967bfUL, 0xb3667a2eUL, 0xc4614ab8UL, 0x5d681b02UL, 0x2a6f2b94UL,
49
+ 0xb40bbe37UL, 0xc30c8ea1UL, 0x5a05df1bUL, 0x2d02ef8dUL
50
+ };
@@ -1,3 +1,3 @@
1
1
  class BloomFit
2
- VERSION = "0.3.1".freeze
2
+ VERSION = "1.1.0".freeze
3
3
  end
data/lib/bloom_fit.rb CHANGED
@@ -1,7 +1,6 @@
1
1
  require "forwardable"
2
2
 
3
3
  require "cbloomfilter"
4
- require "bloom_fit/configuration_mismatch"
5
4
  require "bloom_fit/version"
6
5
 
7
6
  # BloomFit is an in-memory Bloom filter with a small, Set-like API.
@@ -16,7 +15,7 @@ require "bloom_fit/version"
16
15
  # serialized with +save+ and reloaded with +BloomFit.load+.
17
16
  #
18
17
  # Filters can only be combined when they were created with the same +size+ and
19
- # +hashes+ values; otherwise +BloomFit::ConfigurationMismatch+ is raised.
18
+ # +hashes+ values; otherwise the native extension raises +ArgumentError+.
20
19
  #
21
20
  # filter = BloomFit.new(size: 10_000, hashes: 6)
22
21
  # filter.add("cat")
@@ -28,6 +27,8 @@ require "bloom_fit/version"
28
27
  class BloomFit
29
28
  extend Forwardable
30
29
 
30
+ LN2 = Math.log(2.0).freeze
31
+
31
32
  # The wrapped native +CBloomFilter+ instance.
32
33
  #
33
34
  # This is mostly useful for low-level integrations and internal filter
@@ -40,9 +41,19 @@ class BloomFit
40
41
  # but the best values depend on how many keys you expect to insert and how
41
42
  # many false positives you can tolerate.
42
43
  #
44
+ # @param capacity [Integer] expected number of elements to store in the set
45
+ # @param false_positive_rate [Integer] expected number of elements to store in the set
43
46
  # @param size [Integer] number of buckets in a bloom filter
44
47
  # @param hashes [Integer] number of hash functions
45
- def initialize(size: 1_000, hashes: 4)
48
+ def initialize(capacity: 100, false_positive_rate: 0.001, size: nil, hashes: 4)
49
+ if size.nil? || hashes.nil?
50
+ raise ArgumentError, "capacity must be > 0" unless capacity.positive?
51
+ raise ArgumentError, "false_positive_rate must be between 0 and 1" if false_positive_rate <= 0.0 || false_positive_rate >= 1.0
52
+
53
+ size = (-capacity.to_f * Math.log(false_positive_rate) / (LN2**2)).ceil
54
+ hashes = (size / capacity * LN2).ceil
55
+ end
56
+
46
57
  @bf = CBloomFilter.new(size, hashes)
47
58
  end
48
59
 
@@ -68,15 +79,11 @@ class BloomFit
68
79
  #
69
80
  # Positive results are probabilistic and may be false positives.
70
81
 
71
- # :method: clear
72
- #
73
- # Clears the filter by resetting all bits to +0+.
74
-
75
82
  # :method: set_bits
76
83
  #
77
84
  # Returns the number of bits currently set to +1+.
78
85
 
79
- def_delegators :@bf, :m, :k, :bitmap, :include?, :clear, :set_bits
86
+ def_delegators :@bf, :m, :k, :bitmap, :include?, :set_bits
80
87
 
81
88
  # Returns the configured filter width.
82
89
  alias size m
@@ -103,6 +110,12 @@ class BloomFit
103
110
  end
104
111
  alias << add
105
112
 
113
+ # Clears the filter by resetting all bits to +0+ and returns +self+.
114
+ def clear
115
+ @bf.clear
116
+ self
117
+ end
118
+
106
119
  # Adds +key+ to the filter when +value+ is truthy.
107
120
  #
108
121
  # This makes BloomFit behave like a write-only membership hash: truthy values
@@ -150,7 +163,6 @@ class BloomFit
150
163
  # This method mutates the receiver and mimics Set#merge.
151
164
  def merge(other)
152
165
  if other.is_a?(BloomFit)
153
- raise BloomFit::ConfigurationMismatch unless same_parameters?(other)
154
166
  @bf.merge(other.bf)
155
167
  elsif other.respond_to?(:each_key)
156
168
  other.each { |k, v| add(k) if v }
@@ -159,17 +171,18 @@ class BloomFit
159
171
  else
160
172
  raise ArgumentError, "value must be enumerable or another BloomFit filter"
161
173
  end
174
+
175
+ self
162
176
  end
163
177
 
164
178
  # Returns a new filter containing the bitwise intersection of two filters.
165
179
  #
166
- # Both filters must have the same +size+ and +hashes+ values or
167
- # +BloomFit::ConfigurationMismatch+ is raised.
180
+ # Both filters must have the same +size+ and +hashes+ values or the native
181
+ # extension raises +ArgumentError+.
168
182
  #
169
183
  # Like all Bloom filter operations, membership checks on the result remain
170
184
  # probabilistic and may still produce false positives.
171
185
  def &(other)
172
- raise BloomFit::ConfigurationMismatch unless same_parameters?(other)
173
186
  self.class.new(size:, hashes:).tap do |result|
174
187
  result.instance_variable_set(:@bf, @bf.&(other.bf))
175
188
  end
@@ -178,12 +191,11 @@ class BloomFit
178
191
 
179
192
  # Returns a new filter containing the bitwise union of two filters.
180
193
  #
181
- # Both filters must have the same +size+ and +hashes+ values or
182
- # +BloomFit::ConfigurationMismatch+ is raised.
194
+ # Both filters must have the same +size+ and +hashes+ values or the native
195
+ # extension raises +ArgumentError+.
183
196
  #
184
197
  # The receiver and +other+ are left unchanged.
185
198
  def |(other)
186
- raise BloomFit::ConfigurationMismatch unless same_parameters?(other)
187
199
  self.class.new(size:, hashes:).tap do |result|
188
200
  result.instance_variable_set(:@bf, @bf.|(other.bf))
189
201
  end
@@ -196,14 +208,14 @@ class BloomFit
196
208
  # bits (+n+), the hash count (+k+), and the predicted false-positive rate
197
209
  # based on the current fill level.
198
210
  def stats
199
- fpr = ((1.0 - Math.exp(-(k * n).to_f / m))**k) * 100
211
+ fpr = ((n.to_f / m)**k) * 100
200
212
 
201
- (+"").tap do |s|
202
- s << format("Number of filter buckets (m): %d\n", m)
203
- s << format("Number of set bits (n): %d\n", n)
204
- s << format("Number of filter hashes (k): %d\n", k)
205
- s << format("Predicted false positive rate: %.2f%%\n", fpr)
206
- end
213
+ format <<~STATS, m, n, k, fpr
214
+ Number of filter buckets (m): %d
215
+ Number of set bits (n): %d
216
+ Number of filter hashes (k): %d
217
+ Predicted false positive rate: %.2f%%
218
+ STATS
207
219
  end
208
220
 
209
221
  # Rebuilds the filter from the serialized data returned by +marshal_dump+.
@@ -226,20 +238,11 @@ class BloomFit
226
238
  # The file is read using Ruby's +Marshal+ format, so it should only be used
227
239
  # with trusted input.
228
240
  def self.load(filename)
229
- Marshal.load(File.open(filename, "r")) # rubocop:disable Security/MarshalLoad
241
+ Marshal.load(File.binread(filename)) # rubocop:disable Security/MarshalLoad
230
242
  end
231
243
 
232
244
  # Writes the filter to +filename+ using Ruby's +Marshal+ format.
233
245
  def save(filename)
234
- File.open(filename, "w") do |f|
235
- f << Marshal.dump(self)
236
- end
237
- end
238
-
239
- protected
240
-
241
- # Returns +true+ when +other+ has the same +size+ and +hashes+ values.
242
- def same_parameters?(other)
243
- bf.m == other.bf.m && bf.k == other.bf.k
246
+ File.binwrite(filename, Marshal.dump(self))
244
247
  end
245
248
  end
Binary file
@@ -3,6 +3,28 @@ require "test_helper"
3
3
  class BloomFitTest < Minitest::Spec
4
4
  subject { BloomFit.new(size: 100, hashes: 4) }
5
5
 
6
+ describe ".new" do
7
+ it "accepts size and hashes override" do
8
+ bf = BloomFit.new(size: 10, hashes: 1)
9
+ assert_equal 10, bf.size
10
+ assert_equal 1, bf.hashes
11
+ end
12
+
13
+ it "has default capacity and false positive-rate" do
14
+ bf = BloomFit.new
15
+ # https://hur.st/bloomfilter/?n=100&p=0.001&m=&k=
16
+ assert_equal 1438, bf.size
17
+ assert_equal 10, bf.hashes
18
+ end
19
+
20
+ it "calculates size and hashes given a capacity and false postiive rate" do
21
+ bf = BloomFit.new(capacity: 10_000, false_positive_rate: 0.0001)
22
+ # https://hur.st/bloomfilter/?n=10000&p=0.0001&m=&k=
23
+ assert_equal 191_702, bf.size
24
+ assert_equal 14, bf.hashes
25
+ end
26
+ end
27
+
6
28
  describe "#empty?" do
7
29
  it "returns true when nothing set" do
8
30
  assert_equal true, subject.empty? # rubocop:disable Minitest/AssertTruthy
@@ -102,11 +124,11 @@ class BloomFitTest < Minitest::Spec
102
124
  end
103
125
 
104
126
  describe "#clear" do
105
- it "zeroes the bits" do
127
+ it "zeroes the bits and returns self" do
106
128
  subject.add("test")
107
129
  assert_includes subject, "test"
108
130
  assert_includes subject.to_binary, "1"
109
- subject.clear
131
+ assert_equal subject, subject.clear
110
132
  refute_includes subject, "test"
111
133
  refute_includes subject.to_binary, "1"
112
134
  end
@@ -180,14 +202,14 @@ class BloomFitTest < Minitest::Spec
180
202
  end
181
203
 
182
204
  describe "#merge" do
183
- it "merges another BloomFit filter" do
205
+ it "merges another BloomFit filter and returns self" do
184
206
  bf1 = BloomFit.new(size: 100, hashes: 2)
185
207
  bf2 = BloomFit.new(size: 100, hashes: 2)
186
208
  bf1 << "mouse"
187
209
  bf2 << "cat" << "dog"
188
210
  refute_includes bf1, "cat"
189
211
  refute_includes bf1, "dog"
190
- bf1.merge(bf2)
212
+ assert_equal bf1, bf1.merge(bf2)
191
213
  assert_includes bf1, "mouse"
192
214
  assert_includes bf1, "cat"
193
215
  assert_includes bf1, "dog"
@@ -196,9 +218,9 @@ class BloomFitTest < Minitest::Spec
196
218
  assert_includes bf2, "dog"
197
219
  end
198
220
 
199
- it "merges an array" do
221
+ it "merges an array and returns self" do
200
222
  subject << "mouse"
201
- subject.merge %i[cat dog]
223
+ assert_equal subject, subject.merge(%i[cat dog])
202
224
  assert_includes subject, "mouse"
203
225
  assert_includes subject, "cat"
204
226
  assert_includes subject, "dog"
@@ -225,7 +247,7 @@ class BloomFitTest < Minitest::Spec
225
247
  it "raises when merge is between incompatible filters" do
226
248
  bf1 = BloomFit.new(size: 10)
227
249
  bf2 = BloomFit.new(size: 20)
228
- assert_raises(BloomFit::ConfigurationMismatch) { bf1.merge(bf2) }
250
+ assert_raises(ArgumentError) { bf1.merge(bf2) }
229
251
  end
230
252
  end
231
253
 
@@ -263,11 +285,11 @@ class BloomFitTest < Minitest::Spec
263
285
  it "raises when intersection is between incompatible filters" do
264
286
  bf1 = BloomFit.new(size: 10)
265
287
  bf2 = BloomFit.new(size: 20)
266
- assert_raises(BloomFit::ConfigurationMismatch) { bf1 & bf2 }
288
+ assert_raises(ArgumentError) { bf1 & bf2 }
267
289
 
268
290
  bf1 = BloomFit.new(size: 10, hashes: 2)
269
291
  bf2 = BloomFit.new(size: 10, hashes: 4)
270
- assert_raises(BloomFit::ConfigurationMismatch) { bf1 & bf2 }
292
+ assert_raises(ArgumentError) { bf1 & bf2 }
271
293
  end
272
294
  end
273
295
 
@@ -303,7 +325,7 @@ class BloomFitTest < Minitest::Spec
303
325
  it "raises when union is between incompatible filters" do
304
326
  bf1 = BloomFit.new(size: 10)
305
327
  bf2 = BloomFit.new(size: 20)
306
- assert_raises(BloomFit::ConfigurationMismatch) { bf1 | bf2 }
328
+ assert_raises(ArgumentError) { bf1 | bf2 }
307
329
  end
308
330
  end
309
331
 
@@ -318,16 +340,51 @@ class BloomFitTest < Minitest::Spec
318
340
  STATS
319
341
  assert_equal expected, bf.stats
320
342
  end
343
+
344
+ it "estimates false positives from the current fill level" do
345
+ bf = BloomFit.new(size: 10, hashes: 3)
346
+ bf.bf.load("\x07\x00\x00".b)
347
+
348
+ expected = <<~STATS
349
+ Number of filter buckets (m): 10
350
+ Number of set bits (n): 3
351
+ Number of filter hashes (k): 3
352
+ Predicted false positive rate: 2.70%
353
+ STATS
354
+ assert_equal expected, bf.stats
355
+ end
321
356
  end
322
357
 
323
358
  describe "serialization" do
324
- after { File.unlink("bf.out") }
359
+ after { FileUtils.rm_f("bf.out") }
325
360
 
326
361
  it "marshalls" do
327
362
  bf = BloomFit.new
328
363
  assert bf.save("bf.out")
329
364
  end
330
365
 
366
+ it "uses binary file io" do
367
+ dumped = Marshal.dump(subject)
368
+ writer = Minitest::Mock.new
369
+ writer.expect(:call, dumped.bytesize, ["bf.out", dumped])
370
+
371
+ reader = Minitest::Mock.new
372
+ reader.expect(:call, dumped, ["bf.out"])
373
+
374
+ File.stub(:binwrite, writer) do
375
+ assert_equal dumped.bytesize, subject.save("bf.out")
376
+ end
377
+
378
+ File.stub(:binread, reader) do
379
+ bf2 = BloomFit.load("bf.out")
380
+ assert_equal subject.size, bf2.size
381
+ assert_equal subject.hashes, bf2.hashes
382
+ end
383
+
384
+ writer.verify
385
+ reader.verify
386
+ end
387
+
331
388
  it "loads from marshalled" do
332
389
  subject.add("foo")
333
390
  subject.add("bar")
@@ -338,7 +395,8 @@ class BloomFitTest < Minitest::Spec
338
395
  assert_includes bf2, "bar"
339
396
  refute_includes bf2, "baz"
340
397
 
341
- assert subject.send(:same_parameters?, bf2)
398
+ assert_equal subject.size, bf2.size
399
+ assert_equal subject.hashes, bf2.hashes
342
400
  end
343
401
  end
344
402
  end
@@ -0,0 +1,233 @@
1
+ require "test_helper"
2
+
3
+ class CBloomFilterTest < Minitest::Spec
4
+ subject { CBloomFilter.new }
5
+
6
+ describe ".new" do
7
+ it "rejects more than two arguments" do
8
+ error = assert_raises(ArgumentError) { CBloomFilter.new(1, 2, 3) }
9
+ assert_equal "wrong number of arguments (given 3, expected 0..2)", error.message
10
+ end
11
+ end
12
+
13
+ describe "#m" do
14
+ it "defaults" do
15
+ assert_equal 1000, subject.m
16
+ end
17
+
18
+ it "is set by the 1st arg of the contructor" do
19
+ bf = CBloomFilter.new(10_000)
20
+ assert_equal 10_000, bf.m
21
+ end
22
+
23
+ it "rejects values less than 1" do
24
+ error = assert_raises(ArgumentError) { CBloomFilter.new(-1) }
25
+ assert_equal "bit length must be >= 1", error.message
26
+ end
27
+
28
+ it "rejects values that overflow internal byte sizing" do
29
+ error = assert_raises(RangeError) { CBloomFilter.new((1 << 31) - 7) }
30
+ assert_equal "bit length is too large", error.message
31
+ end
32
+ end
33
+
34
+ describe "#k" do
35
+ it "defaults" do
36
+ assert_equal 4, subject.k
37
+ end
38
+
39
+ it "is set by the 2nd arg of the contructor" do
40
+ bf = CBloomFilter.new(10_000, 9)
41
+ assert_equal 9, bf.k
42
+ end
43
+
44
+ it "rejects values less than 1" do
45
+ error = assert_raises(ArgumentError) { CBloomFilter.new(1000, 0) }
46
+ assert_equal "hash length must be >= 1", error.message
47
+ end
48
+
49
+ it "rejects values larger than the salt table" do
50
+ error = assert_raises(ArgumentError) { CBloomFilter.new(10_000, 257) }
51
+ assert_equal "hash length must be <= 256", error.message
52
+ end
53
+ end
54
+
55
+ describe "#set_bits" do
56
+ it "initializes to zero" do
57
+ assert_equal 0, subject.set_bits
58
+ end
59
+
60
+ it "counts the bits when active" do
61
+ subject.add("foo")
62
+ assert_equal 4, subject.set_bits
63
+ end
64
+ end
65
+
66
+ describe "#add" do
67
+ it "adds keys to the filter set" do
68
+ subject.add("foo")
69
+ subject.add("bar")
70
+ assert_includes subject, "foo"
71
+ assert_includes subject, "bar"
72
+ refute_includes subject, "baz"
73
+ end
74
+
75
+ it "treats binary bytes as unsigned when hashing" do
76
+ bf = CBloomFilter.new(20, 4)
77
+ bf.add("\xFF".b)
78
+ assert_equal "\x00\x05\x05\x00".b, bf.bitmap
79
+ end
80
+ end
81
+
82
+ describe "#include?" do
83
+ it "returns true when a key is in the set" do
84
+ subject.add("foo")
85
+ assert_equal true, subject.include?("foo") # rubocop:disable Minitest/AssertTruthy
86
+ end
87
+
88
+ it "returns false when a key is not in the set" do
89
+ subject.add("foo")
90
+ assert_equal false, subject.include?("bar") # rubocop:disable Minitest/RefuteFalse
91
+ end
92
+ end
93
+
94
+ describe "#clear" do
95
+ it "clears a set" do
96
+ subject.add("foo")
97
+ subject.add("bar")
98
+ subject.add("baz")
99
+ assert subject.set_bits.positive?
100
+ subject.clear
101
+ assert subject.set_bits.zero?
102
+ end
103
+ end
104
+
105
+ describe "#merge" do
106
+ it "adds keys from another set" do
107
+ subject.add("foo")
108
+
109
+ bf = CBloomFilter.new
110
+ bf.add("bar")
111
+ bf.add("baz")
112
+
113
+ subject.merge(bf)
114
+ assert_includes subject, "foo"
115
+ assert_includes subject, "bar"
116
+ assert_includes subject, "baz"
117
+ end
118
+
119
+ it "rejects incompatible filters" do
120
+ error = assert_raises(ArgumentError) { subject.merge(CBloomFilter.new(2000, 4)) }
121
+ assert_equal "bloom filters must have matching size and hash count", error.message
122
+ end
123
+ end
124
+
125
+ describe "#&" do
126
+ it "intersects keys from another set" do
127
+ subject.add("foo")
128
+ subject.add("bar")
129
+
130
+ bf = CBloomFilter.new
131
+ bf.add("bar")
132
+ bf.add("baz")
133
+
134
+ bf2 = subject & bf
135
+ refute_includes bf2, "foo"
136
+ assert_includes bf2, "bar"
137
+ refute_includes bf2, "baz"
138
+
139
+ bf3 = bf & subject
140
+ refute_includes bf3, "foo"
141
+ assert_includes bf3, "bar"
142
+ refute_includes bf3, "baz"
143
+ end
144
+
145
+ it "rejects incompatible filters" do
146
+ error = assert_raises(ArgumentError) { subject & CBloomFilter.new(1000, 2) }
147
+ assert_equal "bloom filters must have matching size and hash count", error.message
148
+ end
149
+ end
150
+
151
+ describe "#|" do
152
+ it "unions keys from another set" do
153
+ subject.add("foo")
154
+ subject.add("bar")
155
+
156
+ bf = CBloomFilter.new
157
+ bf.add("bar")
158
+ bf.add("baz")
159
+
160
+ bf2 = subject | bf
161
+ assert_includes bf2, "foo"
162
+ assert_includes bf2, "bar"
163
+ assert_includes bf2, "baz"
164
+
165
+ bf3 = bf | subject
166
+ assert_includes bf3, "foo"
167
+ assert_includes bf3, "bar"
168
+ assert_includes bf3, "baz"
169
+ end
170
+
171
+ it "rejects incompatible filters" do
172
+ error = assert_raises(ArgumentError) { subject | CBloomFilter.new(2000, 4) }
173
+ assert_equal "bloom filters must have matching size and hash count", error.message
174
+ end
175
+ end
176
+
177
+ describe "#bitmap" do
178
+ it "returns a binary bitmap of all zeros when empty (including a terminating byte)" do
179
+ bf = CBloomFilter.new(16)
180
+ assert_equal "\x00\x00\x00".b, bf.bitmap
181
+ end
182
+
183
+ it "returns a binary bitmap representing the set" do
184
+ bf = CBloomFilter.new(16, 4)
185
+ bf.add("something")
186
+ assert_equal "(\x82\x00".b, bf.bitmap
187
+ end
188
+
189
+ it "returns a binary bitmap representing the set even if not a multiple of 8 bits (includes padding)" do
190
+ bf = CBloomFilter.new(20, 4)
191
+ bf.add("wow")
192
+ assert_equal "\x04\x14\x00\x00".b, bf.bitmap
193
+ end
194
+ end
195
+
196
+ describe "#load" do
197
+ it "overwrites the bitmap" do
198
+ bf = CBloomFilter.new(1000, 4)
199
+ bf.add("foo")
200
+ bf.add("bar")
201
+ subject.load(bf.bitmap)
202
+ assert_includes subject, "foo"
203
+ assert_includes subject, "bar"
204
+ end
205
+
206
+ it "rejects a short bitmap" do
207
+ error = assert_raises(ArgumentError) { subject.load("\x00".b) }
208
+ assert_equal "bitmap length must be 126 bytes", error.message
209
+ end
210
+
211
+ it "rejects a long bitmap" do
212
+ error = assert_raises(ArgumentError) { subject.load("\x00".b * 127) }
213
+ assert_equal "bitmap length must be 126 bytes", error.message
214
+ end
215
+
216
+ it "coerces bitmap-like objects to strings before loading" do
217
+ bitmap_data = subject.bitmap
218
+ bitmap = Object.new
219
+ bitmap.define_singleton_method(:to_str) { bitmap_data }
220
+ subject.load(bitmap)
221
+ assert_equal 0, subject.set_bits
222
+ end
223
+
224
+ it "clears loaded padding bits beyond the configured size" do
225
+ bf = CBloomFilter.new(20, 4)
226
+
227
+ bf.load("\x00\x00\xF0\xFF".b)
228
+
229
+ assert_equal 0, bf.set_bits
230
+ assert_equal "\x00\x00\x00\x00".b, bf.bitmap
231
+ end
232
+ end
233
+ end
data/test/test_helper.rb CHANGED
@@ -1,4 +1,5 @@
1
1
  require "minitest/autorun"
2
+ require "minitest/mock"
2
3
  require "minitest/reporters"
3
4
 
4
5
  Minitest::Reporters.use! # override with MINITEST_REPORTER env var
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: bloom_fit
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.1
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ryan McGeary
@@ -24,13 +24,13 @@ extra_rdoc_files: []
24
24
  files:
25
25
  - README.md
26
26
  - ext/cbloomfilter/cbloomfilter.c
27
- - ext/cbloomfilter/crc32.h
28
27
  - ext/cbloomfilter/extconf.rb
28
+ - ext/cbloomfilter/salts.h
29
29
  - lib/bloom_fit.rb
30
- - lib/bloom_fit/configuration_mismatch.rb
31
30
  - lib/bloom_fit/version.rb
32
31
  - lib/cbloomfilter.bundle
33
32
  - test/bloom_fit_test.rb
33
+ - test/c_bloom_filter_test.rb
34
34
  - test/test_helper.rb
35
35
  homepage: https://github.com/rmm5t/bloom_fit
36
36
  licenses: []
@@ -1,76 +0,0 @@
1
- /* simple CRC32 code */
2
- /*
3
- * Copyright 2005 Aris Adamantiadis
4
- *
5
- * This file is part of the SSH Library
6
- *
7
- * The SSH Library is free software; you can redistribute it and/or modify
8
- * it under the terms of the GNU Lesser General Public License as published by
9
- * the Free Software Foundation; either version 2.1 of the License, or (at your
10
- * option) any later version.
11
- *
12
- *
13
- * The SSH Library is distributed in the hope that it will be useful, but
14
- * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
15
- * or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
16
- * License for more details.
17
- *
18
- * You should have received a copy of the GNU Lesser General Public License
19
- * along with the SSH Library; see the file COPYING. If not, write to
20
- * the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
21
- * MA 02111-1307, USA. */
22
-
23
- static unsigned int crc_table[] = {
24
- 0x00000000UL, 0x77073096UL, 0xee0e612cUL, 0x990951baUL, 0x076dc419UL,
25
- 0x706af48fUL, 0xe963a535UL, 0x9e6495a3UL, 0x0edb8832UL, 0x79dcb8a4UL,
26
- 0xe0d5e91eUL, 0x97d2d988UL, 0x09b64c2bUL, 0x7eb17cbdUL, 0xe7b82d07UL,
27
- 0x90bf1d91UL, 0x1db71064UL, 0x6ab020f2UL, 0xf3b97148UL, 0x84be41deUL,
28
- 0x1adad47dUL, 0x6ddde4ebUL, 0xf4d4b551UL, 0x83d385c7UL, 0x136c9856UL,
29
- 0x646ba8c0UL, 0xfd62f97aUL, 0x8a65c9ecUL, 0x14015c4fUL, 0x63066cd9UL,
30
- 0xfa0f3d63UL, 0x8d080df5UL, 0x3b6e20c8UL, 0x4c69105eUL, 0xd56041e4UL,
31
- 0xa2677172UL, 0x3c03e4d1UL, 0x4b04d447UL, 0xd20d85fdUL, 0xa50ab56bUL,
32
- 0x35b5a8faUL, 0x42b2986cUL, 0xdbbbc9d6UL, 0xacbcf940UL, 0x32d86ce3UL,
33
- 0x45df5c75UL, 0xdcd60dcfUL, 0xabd13d59UL, 0x26d930acUL, 0x51de003aUL,
34
- 0xc8d75180UL, 0xbfd06116UL, 0x21b4f4b5UL, 0x56b3c423UL, 0xcfba9599UL,
35
- 0xb8bda50fUL, 0x2802b89eUL, 0x5f058808UL, 0xc60cd9b2UL, 0xb10be924UL,
36
- 0x2f6f7c87UL, 0x58684c11UL, 0xc1611dabUL, 0xb6662d3dUL, 0x76dc4190UL,
37
- 0x01db7106UL, 0x98d220bcUL, 0xefd5102aUL, 0x71b18589UL, 0x06b6b51fUL,
38
- 0x9fbfe4a5UL, 0xe8b8d433UL, 0x7807c9a2UL, 0x0f00f934UL, 0x9609a88eUL,
39
- 0xe10e9818UL, 0x7f6a0dbbUL, 0x086d3d2dUL, 0x91646c97UL, 0xe6635c01UL,
40
- 0x6b6b51f4UL, 0x1c6c6162UL, 0x856530d8UL, 0xf262004eUL, 0x6c0695edUL,
41
- 0x1b01a57bUL, 0x8208f4c1UL, 0xf50fc457UL, 0x65b0d9c6UL, 0x12b7e950UL,
42
- 0x8bbeb8eaUL, 0xfcb9887cUL, 0x62dd1ddfUL, 0x15da2d49UL, 0x8cd37cf3UL,
43
- 0xfbd44c65UL, 0x4db26158UL, 0x3ab551ceUL, 0xa3bc0074UL, 0xd4bb30e2UL,
44
- 0x4adfa541UL, 0x3dd895d7UL, 0xa4d1c46dUL, 0xd3d6f4fbUL, 0x4369e96aUL,
45
- 0x346ed9fcUL, 0xad678846UL, 0xda60b8d0UL, 0x44042d73UL, 0x33031de5UL,
46
- 0xaa0a4c5fUL, 0xdd0d7cc9UL, 0x5005713cUL, 0x270241aaUL, 0xbe0b1010UL,
47
- 0xc90c2086UL, 0x5768b525UL, 0x206f85b3UL, 0xb966d409UL, 0xce61e49fUL,
48
- 0x5edef90eUL, 0x29d9c998UL, 0xb0d09822UL, 0xc7d7a8b4UL, 0x59b33d17UL,
49
- 0x2eb40d81UL, 0xb7bd5c3bUL, 0xc0ba6cadUL, 0xedb88320UL, 0x9abfb3b6UL,
50
- 0x03b6e20cUL, 0x74b1d29aUL, 0xead54739UL, 0x9dd277afUL, 0x04db2615UL,
51
- 0x73dc1683UL, 0xe3630b12UL, 0x94643b84UL, 0x0d6d6a3eUL, 0x7a6a5aa8UL,
52
- 0xe40ecf0bUL, 0x9309ff9dUL, 0x0a00ae27UL, 0x7d079eb1UL, 0xf00f9344UL,
53
- 0x8708a3d2UL, 0x1e01f268UL, 0x6906c2feUL, 0xf762575dUL, 0x806567cbUL,
54
- 0x196c3671UL, 0x6e6b06e7UL, 0xfed41b76UL, 0x89d32be0UL, 0x10da7a5aUL,
55
- 0x67dd4accUL, 0xf9b9df6fUL, 0x8ebeeff9UL, 0x17b7be43UL, 0x60b08ed5UL,
56
- 0xd6d6a3e8UL, 0xa1d1937eUL, 0x38d8c2c4UL, 0x4fdff252UL, 0xd1bb67f1UL,
57
- 0xa6bc5767UL, 0x3fb506ddUL, 0x48b2364bUL, 0xd80d2bdaUL, 0xaf0a1b4cUL,
58
- 0x36034af6UL, 0x41047a60UL, 0xdf60efc3UL, 0xa867df55UL, 0x316e8eefUL,
59
- 0x4669be79UL, 0xcb61b38cUL, 0xbc66831aUL, 0x256fd2a0UL, 0x5268e236UL,
60
- 0xcc0c7795UL, 0xbb0b4703UL, 0x220216b9UL, 0x5505262fUL, 0xc5ba3bbeUL,
61
- 0xb2bd0b28UL, 0x2bb45a92UL, 0x5cb36a04UL, 0xc2d7ffa7UL, 0xb5d0cf31UL,
62
- 0x2cd99e8bUL, 0x5bdeae1dUL, 0x9b64c2b0UL, 0xec63f226UL, 0x756aa39cUL,
63
- 0x026d930aUL, 0x9c0906a9UL, 0xeb0e363fUL, 0x72076785UL, 0x05005713UL,
64
- 0x95bf4a82UL, 0xe2b87a14UL, 0x7bb12baeUL, 0x0cb61b38UL, 0x92d28e9bUL,
65
- 0xe5d5be0dUL, 0x7cdcefb7UL, 0x0bdbdf21UL, 0x86d3d2d4UL, 0xf1d4e242UL,
66
- 0x68ddb3f8UL, 0x1fda836eUL, 0x81be16cdUL, 0xf6b9265bUL, 0x6fb077e1UL,
67
- 0x18b74777UL, 0x88085ae6UL, 0xff0f6a70UL, 0x66063bcaUL, 0x11010b5cUL,
68
- 0x8f659effUL, 0xf862ae69UL, 0x616bffd3UL, 0x166ccf45UL, 0xa00ae278UL,
69
- 0xd70dd2eeUL, 0x4e048354UL, 0x3903b3c2UL, 0xa7672661UL, 0xd06016f7UL,
70
- 0x4969474dUL, 0x3e6e77dbUL, 0xaed16a4aUL, 0xd9d65adcUL, 0x40df0b66UL,
71
- 0x37d83bf0UL, 0xa9bcae53UL, 0xdebb9ec5UL, 0x47b2cf7fUL, 0x30b5ffe9UL,
72
- 0xbdbdf21cUL, 0xcabac28aUL, 0x53b39330UL, 0x24b4a3a6UL, 0xbad03605UL,
73
- 0xcdd70693UL, 0x54de5729UL, 0x23d967bfUL, 0xb3667a2eUL, 0xc4614ab8UL,
74
- 0x5d681b02UL, 0x2a6f2b94UL, 0xb40bbe37UL, 0xc30c8ea1UL, 0x5a05df1bUL,
75
- 0x2d02ef8dUL
76
- };
@@ -1,4 +0,0 @@
1
- class BloomFit
2
- class ConfigurationMismatch < ArgumentError
3
- end
4
- end