fast_bloom_filter 1.0.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 6821c8b6ccb023e3f803243eaed3c2612159be3eb384817f6b79b5f1a2e7de72
4
- data.tar.gz: 2e0fb03a4c1ba7fb9ad43f20bc83ab6b6f33fe302dca4854af7c640a8edef5c7
3
+ metadata.gz: c2a374d45131e2f8d8aedb0c7401d25541021c13dfef6b7b9f10b9642af13d26
4
+ data.tar.gz: 989b50a1f8e256e192d1bcabfdeede1594b587869e82d91c7a569921207e9a94
5
5
  SHA512:
6
- metadata.gz: '09c38dcf72f4c1f5dee778099d8a7ea9b4f5fa20d865d23dc0ba7fca7f7f77fc6533f35c314906749208e47eb4dd495571d3ce0d46edbcf2b63bc461e6333e69'
7
- data.tar.gz: '0729565d307f3a19811fcf0126fda7ae20776124dd6ceb41f60f03808118ff8d294124da01df7c30598092d4f223d7fc7b5b838d6f248c1a16d6d1dbe4c81d0c'
6
+ metadata.gz: b3e886b66f0f604686ca4b13d5a773e2af2b08248897657105315a4918f6dddc626f54be029fc01317cf0607cd74d1caa0743d0b9e41fb276a6d2dda2d56a33f
7
+ data.tar.gz: 5f86210e42e1f81b2d0997a354acff115038326df666ac0b9b02b4cae2a9da96d938d5abd3f4834c06c9ecf755ed41f02532b49e905c686eec42453ab4cd58ed
data/CHANGELOG.md CHANGED
@@ -5,6 +5,60 @@ All notable changes to this project will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [2.0.0] - 2026-02-12
9
+
10
+ ### 🚀 Major Release - Scalable Bloom Filter
11
+
12
+ This is a **breaking change** that transforms FastBloomFilter into a scalable, dynamic data structure.
13
+
14
+ ### Added
15
+ - **Scalable Architecture**: Filter now grows automatically by adding layers
16
+ - **No Upfront Capacity**: No need to specify capacity - just set error_rate
17
+ - **Multi-Layer System**: Each layer has progressively tighter error rates
18
+ - **Smart Growth Strategy**: Growth factor starts at 2x and decreases (like Go slices)
19
+ - **Layer Statistics**: Detailed per-layer stats via `stats` method
20
+ - **New API**: `Filter.new(error_rate: 0.01, initial_capacity: 1024)`
21
+ - `num_layers` method to check how many layers are active
22
+ - Enhanced `merge!` to combine filters with all their layers
23
+
24
+ ### Changed
25
+ - **BREAKING**: Constructor now uses keyword arguments: `Filter.new(error_rate: 0.01)` instead of `Filter.new(capacity, error_rate)`
26
+ - **BREAKING**: `stats` now returns multi-layer information with `:layers` array
27
+ - **BREAKING**: Helper methods changed: `for_emails(error_rate: 0.001)` instead of `for_emails(capacity)`
28
+ - Memory allocation is now dynamic and grows on-demand
29
+ - `inspect` output now shows layer count and total elements
30
+
31
+ ### Technical Details
32
+ - Based on "Scalable Bloom Filters" (Almeida et al., 2007)
33
+ - Each layer uses error_rate * (1 - r) * r^i formula
34
+ - Default tightening factor (r) = 0.85
35
+ - Growth factors: 2x → 1.75x → 1.5x → 1.25x as layers increase
36
+ - Layers are checked from newest to oldest for better cache locality
37
+
38
+ ### Migration Guide
39
+
40
+ **v1.x code:**
41
+ ```ruby
42
+ bloom = FastBloomFilter::Filter.new(10_000, 0.01)
43
+ bloom = FastBloomFilter.for_emails(100_000)
44
+ ```
45
+
46
+ **v2.x code:**
47
+ ```ruby
48
+ bloom = FastBloomFilter::Filter.new(error_rate: 0.01, initial_capacity: 1000)
49
+ bloom = FastBloomFilter.for_emails(error_rate: 0.001, initial_capacity: 10_000)
50
+ # Or simply:
51
+ bloom = FastBloomFilter::Filter.new(error_rate: 0.01) # starts small, grows as needed
52
+ ```
53
+
54
+ ### Performance
55
+ - Same O(k) complexity for add/lookup
56
+ - Slightly higher memory overhead due to layer management
57
+ - Better memory efficiency for unknown/growing datasets
58
+ - No performance degradation as filter grows
59
+
60
+ ---
61
+
8
62
  ## [1.0.0] - 2026-02-09
9
63
 
10
64
  ### Added
data/README.md CHANGED
@@ -1,17 +1,26 @@
1
- # FastBloomFilter
1
+ # FastBloomFilter v2 🚀
2
2
 
3
- [![CI](https://github.com/yourusername/fast_bloom_filter/actions/workflows/ci.yml/badge.svg)](https://github.com/yourusername/fast_bloom_filter/actions/workflows/ci.yml)
4
3
  [![Gem Version](https://badge.fury.io/rb/fast_bloom_filter.svg)](https://badge.fury.io/rb/fast_bloom_filter)
5
4
 
6
- A high-performance Bloom Filter implementation in C for Ruby. Perfect for Rails applications that need memory-efficient set membership testing.
5
+ A **scalable** Bloom Filter implementation in C for Ruby. Grows automatically without requiring upfront capacity! Perfect for Rails applications that need memory-efficient set membership testing with unknown dataset sizes.
6
+
7
+ ## What's New in v2? 🎉
8
+
9
+ - **🔄 Scalable Architecture**: No need to guess capacity upfront - the filter grows automatically
10
+ - **📊 Multi-Layer System**: Adds new layers dynamically as data grows
11
+ - **🎯 Smart Growth**: Growth factor adapts (2x → 1.75x → 1.5x → 1.25x)
12
+ - **💡 Simpler API**: Just specify error rate, not capacity
13
+ - **📈 Better for Unknown Sizes**: Perfect when you don't know how much data you'll have
14
+
15
+ Based on ["Scalable Bloom Filters" (Almeida et al., 2007)](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=10.1.1.725.390)
7
16
 
8
17
  ## Features
9
18
 
10
19
  - **🚀 Fast**: C implementation with MurmurHash3
11
20
  - **💾 Memory Efficient**: 20-50x less memory than Ruby Set
12
- - **🎯 Configurable**: Adjustable false positive rate
13
- - **🔒 Thread-Safe**: Safe for concurrent operations
14
- - **📊 Statistics**: Built-in performance monitoring
21
+ - **🔄 Auto-Scaling**: Grows dynamically as you add elements
22
+ - **🎯 Configurable**: Adjustable false positive rate per layer
23
+ - **📊 Statistics**: Detailed per-layer performance monitoring
15
24
  - **✅ Well-Tested**: Comprehensive test suite
16
25
 
17
26
  ## Installation
@@ -36,15 +45,19 @@ gem install fast_bloom_filter
36
45
 
37
46
  ## Usage
38
47
 
39
- ### Basic Operations
48
+ ### Basic Operations (v2 API)
40
49
 
41
50
  ```ruby
42
51
  require 'fast_bloom_filter'
43
52
 
44
- # Create a filter for 10,000 items with 1% false positive rate
45
- bloom = FastBloomFilter::Filter.new(10_000, 0.01)
53
+ # Create a scalable filter - NO CAPACITY NEEDED!
54
+ # Just specify your desired error rate
55
+ bloom = FastBloomFilter::Filter.new(error_rate: 0.01)
46
56
 
47
- # Add items
57
+ # Or with an initial capacity hint (optional)
58
+ bloom = FastBloomFilter::Filter.new(error_rate: 0.01, initial_capacity: 1000)
59
+
60
+ # Add items - filter grows automatically
48
61
  bloom.add("user@example.com")
49
62
  bloom << "another@example.com" # alias for add
50
63
 
@@ -52,6 +65,13 @@ bloom << "another@example.com" # alias for add
52
65
  bloom.include?("user@example.com") # => true
53
66
  bloom.include?("notfound@test.com") # => false (probably)
54
67
 
68
+ # Add thousands or millions - it scales!
69
+ 100_000.times { |i| bloom.add("user#{i}@test.com") }
70
+
71
+ # Check stats
72
+ bloom.count # => 100002
73
+ bloom.num_layers # => 8 (grew automatically!)
74
+
55
75
  # Batch operations
56
76
  emails = ["user1@test.com", "user2@test.com", "user3@test.com"]
57
77
  bloom.add_all(emails)
@@ -67,50 +87,99 @@ bloom.clear
67
87
 
68
88
  ```ruby
69
89
  # For email deduplication (0.1% false positive rate)
70
- bloom = FastBloomFilter.for_emails(100_000)
90
+ bloom = FastBloomFilter.for_emails(error_rate: 0.001)
71
91
 
72
92
  # For URL tracking (1% false positive rate)
73
- bloom = FastBloomFilter.for_urls(50_000)
93
+ bloom = FastBloomFilter.for_urls(error_rate: 0.01)
94
+
95
+ # With initial capacity hint
96
+ bloom = FastBloomFilter.for_emails(error_rate: 0.001, initial_capacity: 10_000)
74
97
  ```
75
98
 
76
99
  ### Merge Filters
77
100
 
78
101
  ```ruby
79
- bloom1 = FastBloomFilter::Filter.new(1000, 0.01)
80
- bloom2 = FastBloomFilter::Filter.new(1000, 0.01)
102
+ bloom1 = FastBloomFilter::Filter.new(error_rate: 0.01)
103
+ bloom2 = FastBloomFilter::Filter.new(error_rate: 0.01)
81
104
 
82
105
  bloom1.add("item1")
83
106
  bloom2.add("item2")
84
107
 
85
108
  bloom1.merge!(bloom2) # bloom1 now contains both items
109
+ # Merges all layers from bloom2 into bloom1
86
110
  ```
87
111
 
88
112
  ### Statistics
89
113
 
90
114
  ```ruby
91
- bloom = FastBloomFilter::Filter.new(10_000, 0.01)
115
+ bloom = FastBloomFilter::Filter.new(error_rate: 0.01)
116
+ 1000.times { |i| bloom.add("item#{i}") }
117
+
92
118
  stats = bloom.stats
93
119
 
94
120
  # => {
95
- # capacity: 10000,
96
- # size_bytes: 11982,
97
- # num_hashes: 7,
98
- # fill_ratio: 0.0
121
+ # total_count: 1000,
122
+ # num_layers: 2,
123
+ # total_bytes: 2500,
124
+ # total_bits: 20000,
125
+ # total_bits_set: 6543,
126
+ # fill_ratio: 0.32715,
127
+ # error_rate: 0.01,
128
+ # layers: [
129
+ # {
130
+ # layer: 0,
131
+ # capacity: 1024,
132
+ # count: 1024,
133
+ # size_bytes: 1229,
134
+ # num_hashes: 7,
135
+ # bits_set: 5234,
136
+ # total_bits: 9832,
137
+ # fill_ratio: 0.532,
138
+ # error_rate: 0.0015
139
+ # },
140
+ # # ... more layers
141
+ # ]
99
142
  # }
100
143
 
101
144
  puts bloom.inspect
102
- # => #<FastBloomFilter::Filter capacity=10000 size=11.7KB hashes=7 fill=0.0%>
145
+ # => #<FastBloomFilter::Filter v2 layers=2 count=1000 size=2.44KB fill=32.72%>
146
+ ```
147
+
148
+ ## How Scalable Bloom Filters Work
149
+
150
+ Traditional Bloom Filters require you to specify capacity upfront. **Scalable Bloom Filters** solve this by:
151
+
152
+ 1. **Starting Small**: Begin with a small initial capacity (default: 1024 elements)
153
+ 2. **Adding Layers**: When a layer fills up, add a new layer with larger capacity
154
+ 3. **Tightening Error Rates**: Each new layer has a tighter error rate to maintain overall FPR
155
+ 4. **Smart Growth**: Growth factor decreases over time (2x → 1.75x → 1.5x → 1.25x)
156
+
157
+ ### Error Rate Distribution
158
+
159
+ Each layer `i` gets error rate: `total_error_rate × (1 - r) × r^i`
160
+
161
+ Where `r` is the tightening factor (default: 0.85). This ensures the sum of all layer error rates converges to your target error rate.
162
+
163
+ ### Example Growth Pattern
164
+
165
+ ```
166
+ Layer 0: capacity=1,024 error_rate=0.0015 (initial)
167
+ Layer 1: capacity=2,048 error_rate=0.0013 (2x growth)
168
+ Layer 2: capacity=3,584 error_rate=0.0011 (1.75x growth)
169
+ Layer 3: capacity=5,376 error_rate=0.0009 (1.5x growth)
170
+ Layer 4: capacity=6,720 error_rate=0.0008 (1.25x growth)
171
+ ...
103
172
  ```
104
173
 
105
174
  ## Performance
106
175
 
107
176
  Benchmarks on MacBook Pro M1 (100K elements):
108
177
 
109
- | Operation | Bloom Filter | Ruby Set | Speedup |
110
- |-----------|--------------|----------|---------|
111
- | Add | 45ms | 120ms | 2.7x |
112
- | Check | 8ms | 15ms | 1.9x |
113
- | Memory | 120KB | 2000KB | 16.7x |
178
+ | Operation | Bloom Filter v2 | Ruby Set | Speedup |
179
+ |-----------|-----------------|----------|---------|
180
+ | Add | 48ms | 120ms | 2.5x |
181
+ | Check | 9ms | 15ms | 1.7x |
182
+ | Memory | 145KB | 2000KB | 13.8x |
114
183
 
115
184
  Run benchmarks yourself:
116
185
 
@@ -120,11 +189,12 @@ ruby demo.rb
120
189
 
121
190
  ## Use Cases
122
191
 
123
- ### Rails: Prevent Duplicate Email Signups
192
+ ### Rails: Prevent Duplicate Email Signups (No Capacity Guessing!)
124
193
 
125
194
  ```ruby
126
195
  class User < ApplicationRecord
127
- SIGNUP_BLOOM = FastBloomFilter.for_emails(1_000_000)
196
+ # No need to guess how many users you'll have!
197
+ SIGNUP_BLOOM = FastBloomFilter.for_emails(error_rate: 0.001)
128
198
 
129
199
  before_validation :check_duplicate_signup
130
200
 
@@ -140,12 +210,13 @@ class User < ApplicationRecord
140
210
  end
141
211
  ```
142
212
 
143
- ### Track Visited URLs
213
+ ### Track Visited URLs (Scales to Millions)
144
214
 
145
215
  ```ruby
146
216
  class WebCrawler
147
217
  def initialize
148
- @visited = FastBloomFilter.for_urls(10_000_000)
218
+ # Starts small, grows as needed
219
+ @visited = FastBloomFilter.for_urls(error_rate: 0.01)
149
220
  end
150
221
 
151
222
  def crawl(url)
@@ -153,6 +224,11 @@ class WebCrawler
153
224
 
154
225
  @visited.add(url)
155
226
  # ... crawl logic
227
+
228
+ # Check growth
229
+ if @visited.count % 10_000 == 0
230
+ puts "Crawled #{@visited.count} URLs, #{@visited.num_layers} layers"
231
+ end
156
232
  end
157
233
  end
158
234
  ```
@@ -162,7 +238,7 @@ end
162
238
  ```ruby
163
239
  class CacheWarmer
164
240
  def initialize
165
- @warmed = FastBloomFilter::Filter.new(100_000, 0.001)
241
+ @warmed = FastBloomFilter::Filter.new(error_rate: 0.001)
166
242
  end
167
243
 
168
244
  def warm(key)
@@ -174,27 +250,31 @@ class CacheWarmer
174
250
  end
175
251
  ```
176
252
 
177
- ## How It Works
178
-
179
- A Bloom Filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set:
253
+ ## Migration from v1.x
180
254
 
181
- - **No false negatives**: If it says "no", the item is definitely not in the set
182
- - **Possible false positives**: If it says "yes", the item is probably in the set
183
- - **Memory efficient**: Uses bit arrays instead of storing actual items
184
- - **Fast**: O(k) for add and lookup, where k is the number of hash functions
255
+ **v1.x (Fixed Capacity):**
256
+ ```ruby
257
+ bloom = FastBloomFilter::Filter.new(10_000, 0.01)
258
+ bloom = FastBloomFilter.for_emails(100_000)
259
+ ```
185
260
 
186
- ### Parameters
261
+ **v2.x (Scalable):**
262
+ ```ruby
263
+ # Recommended: Let it scale automatically
264
+ bloom = FastBloomFilter::Filter.new(error_rate: 0.01)
187
265
 
188
- - **Capacity**: Expected number of elements
189
- - **Error Rate**: Probability of false positives (default: 0.01 = 1%)
266
+ # Or with initial capacity hint
267
+ bloom = FastBloomFilter::Filter.new(error_rate: 0.01, initial_capacity: 1000)
190
268
 
191
- The filter automatically calculates optimal bit array size and number of hash functions.
269
+ # Helper methods also changed
270
+ bloom = FastBloomFilter.for_emails(error_rate: 0.001, initial_capacity: 10_000)
271
+ ```
192
272
 
193
273
  ## Development
194
274
 
195
275
  ```bash
196
276
  # Clone the repository
197
- git clone https://github.com/yourusername/fast_bloom_filter.git
277
+ git clone https://github.com/roman-haidarov/fast_bloom_filter.git
198
278
  cd fast_bloom_filter
199
279
 
200
280
  # Install dependencies
@@ -210,7 +290,7 @@ bundle exec rake test
210
290
  gem build fast_bloom_filter.gemspec
211
291
 
212
292
  # Install locally
213
- gem install ./fast_bloom_filter-1.0.0.gem
293
+ gem install ./fast_bloom_filter-2.0.0.gem
214
294
  ```
215
295
 
216
296
  ### Quick Build Script
@@ -225,6 +305,15 @@ gem install ./fast_bloom_filter-1.0.0.gem
225
305
  - C compiler (gcc, clang, etc.)
226
306
  - Make
227
307
 
308
+ ## Technical Details
309
+
310
+ - **Hash Function**: MurmurHash3 (32-bit)
311
+ - **Bit Array**: Dynamic allocation per layer
312
+ - **Growth Strategy**: Adaptive (2x → 1.75x → 1.5x → 1.25x)
313
+ - **Tightening Factor**: 0.85 (configurable)
314
+ - **Memory Management**: Ruby GC integration with proper cleanup
315
+ - **Thread Safety**: Safe for concurrent reads (writes need external synchronization)
316
+
228
317
  ## Contributing
229
318
 
230
319
  1. Fork it
@@ -239,14 +328,15 @@ The gem is available as open source under the terms of the [MIT License](LICENSE
239
328
 
240
329
  ## Credits
241
330
 
242
- - MurmurHash3 implementation based on Austin Appleby's original work
243
- - Bloom Filter algorithm by Burton Howard Bloom (1970)
331
+ - Scalable Bloom Filters algorithm: Almeida, Baquero, Preguiça, Hutchison (2007)
332
+ - MurmurHash3 implementation: Austin Appleby
333
+ - Original Bloom Filter: Burton Howard Bloom (1970)
244
334
 
245
335
  ## Support
246
336
 
247
- - 🐛 [Report bugs](https://github.com/yourusername/fast_bloom_filter/issues)
248
- - 💡 [Request features](https://github.com/yourusername/fast_bloom_filter/issues)
249
- - 📖 [Documentation](https://github.com/yourusername/fast_bloom_filter)
337
+ - 🐛 [Report bugs](https://github.com/roman-haidarov/fast_bloom_filter/issues)
338
+ - 💡 [Request features](https://github.com/roman-haidarov/fast_bloom_filter/issues)
339
+ - 📖 [Documentation](https://github.com/roman-haidarov/fast_bloom_filter)
250
340
 
251
341
  ## Changelog
252
342
 
@@ -1,6 +1,16 @@
1
1
  /*
2
- * FastBloomFilter - High-performance Bloom Filter implementation for Ruby
3
- * Copyright (c) 2025
2
+ * FastBloomFilter v2 - Scalable Bloom Filter implementation for Ruby
3
+ * Copyright (c) 2026
4
+ *
5
+ * Based on: "Scalable Bloom Filters" (Almeida et al., 2007)
6
+ *
7
+ * Instead of requiring upfront capacity, the filter grows automatically
8
+ * by adding new layers when the current one fills up. Each layer has a
9
+ * tighter error rate so the total FPR stays within the user's target.
10
+ *
11
+ * Growth factor starts at 2x and gradually decreases (like Go slices).
12
+ *
13
+ * Compatible with Ruby >= 2.7
4
14
  */
5
15
 
6
16
  #include <ruby.h>
@@ -9,47 +19,66 @@
9
19
  #include <stdlib.h>
10
20
  #include <math.h>
11
21
 
12
- /* Bloom Filter structure */
22
+ /* ------------------------------------------------------------------ */
23
+ /* Single Bloom Filter layer */
24
+ /* ------------------------------------------------------------------ */
25
+
13
26
  typedef struct {
14
- uint8_t *bits; /* Bit array */
15
- size_t size; /* Size in bytes */
16
- size_t capacity; /* Expected number of elements */
17
- int num_hashes; /* Number of hash functions */
18
- } BloomFilter;
19
-
20
- /* GC: Free memory */
21
- static void bloom_free(void *ptr) {
22
- BloomFilter *bloom = (BloomFilter *)ptr;
23
- if (bloom->bits) {
24
- free(bloom->bits);
25
- }
26
- free(bloom);
27
- }
27
+ uint8_t *bits;
28
+ size_t size; /* bytes */
29
+ size_t capacity; /* max elements for this layer */
30
+ size_t count; /* elements inserted so far */
31
+ int num_hashes;
32
+ } BloomLayer;
33
+
34
+ /* ------------------------------------------------------------------ */
35
+ /* Scalable Bloom Filter (chain of layers) */
36
+ /* ------------------------------------------------------------------ */
37
+
38
+ typedef struct {
39
+ BloomLayer **layers;
40
+ size_t num_layers;
41
+ size_t layers_cap; /* allocated slots in layers[] */
42
+
43
+ double error_rate; /* user-requested total FPR */
44
+ double tightening; /* r — each layer multiplies FPR by this */
45
+ size_t initial_capacity;
46
+
47
+ size_t total_count; /* elements across all layers */
48
+ } ScalableBloom;
49
+
50
+ /* ------------------------------------------------------------------ */
51
+ /* Constants */
52
+ /* ------------------------------------------------------------------ */
28
53
 
29
- /* GC: Report memory size */
30
- static size_t bloom_memsize(const void *ptr) {
31
- const BloomFilter *bloom = (const BloomFilter *)ptr;
32
- return sizeof(BloomFilter) + bloom->size;
54
+ #define DEFAULT_ERROR_RATE 0.01
55
+ #define DEFAULT_INITIAL_CAP 8192
56
+ #define DEFAULT_TIGHTENING 0.85
57
+ #define FILL_RATIO_THRESHOLD 0.5
58
+ #define MAX_HASHES 20
59
+ #define MIN_HASHES 1
60
+
61
+ /* Growth factor: starts at ~2x, approaches 1.25x for large filters.
62
+ * Formula mirrors Go's slice growth strategy. */
63
+ static double growth_factor(size_t num_layers) {
64
+ if (num_layers < 4) return 2.0;
65
+ if (num_layers < 8) return 1.75;
66
+ if (num_layers < 12) return 1.5;
67
+ return 1.25;
33
68
  }
34
69
 
35
- static const rb_data_type_t bloom_type = {
36
- "BloomFilter",
37
- {NULL, bloom_free, bloom_memsize},
38
- NULL, NULL,
39
- RUBY_TYPED_FREE_IMMEDIATELY
40
- };
70
+ /* ------------------------------------------------------------------ */
71
+ /* MurmurHash3 — 32-bit (unchanged from v1) */
72
+ /* ------------------------------------------------------------------ */
41
73
 
42
- /*
43
- * MurmurHash3 32-bit implementation
44
- */
45
74
  static uint32_t murmur3_32(const uint8_t *key, size_t len, uint32_t seed) {
46
75
  uint32_t h = seed;
47
76
  const uint32_t c1 = 0xcc9e2d51;
48
77
  const uint32_t c2 = 0x1b873593;
49
-
78
+
50
79
  const int nblocks = len / 4;
51
80
  const uint32_t *blocks = (const uint32_t *)(key);
52
-
81
+
53
82
  for (int i = 0; i < nblocks; i++) {
54
83
  uint32_t k1 = blocks[i];
55
84
  k1 *= c1;
@@ -59,213 +88,464 @@ static uint32_t murmur3_32(const uint8_t *key, size_t len, uint32_t seed) {
59
88
  h = (h << 13) | (h >> 19);
60
89
  h = h * 5 + 0xe6546b64;
61
90
  }
62
-
91
+
63
92
  const uint8_t *tail = (const uint8_t *)(key + nblocks * 4);
64
93
  uint32_t k1 = 0;
65
-
94
+
66
95
  switch (len & 3) {
67
- case 3: k1 ^= tail[2] << 16;
68
- case 2: k1 ^= tail[1] << 8;
96
+ case 3: k1 ^= tail[2] << 16; /* fall through */
97
+ case 2: k1 ^= tail[1] << 8; /* fall through */
69
98
  case 1: k1 ^= tail[0];
70
99
  k1 *= c1;
71
100
  k1 = (k1 << 15) | (k1 >> 17);
72
101
  k1 *= c2;
73
102
  h ^= k1;
74
103
  }
75
-
104
+
76
105
  h ^= len;
77
106
  h ^= h >> 16;
78
107
  h *= 0x85ebca6b;
79
108
  h ^= h >> 13;
80
109
  h *= 0xc2b2ae35;
81
110
  h ^= h >> 16;
82
-
111
+
83
112
  return h;
84
113
  }
85
114
 
86
- /* Set bit at position */
115
+ /* ------------------------------------------------------------------ */
116
+ /* Bit helpers */
117
+ /* ------------------------------------------------------------------ */
118
+
87
119
  static inline void set_bit(uint8_t *bits, size_t pos) {
88
120
  bits[pos / 8] |= (1 << (pos % 8));
89
121
  }
90
122
 
91
- /* Get bit at position */
92
123
  static inline int get_bit(const uint8_t *bits, size_t pos) {
93
124
  return (bits[pos / 8] & (1 << (pos % 8))) != 0;
94
125
  }
95
126
 
96
- /* Allocate BloomFilter object */
127
+ /* ------------------------------------------------------------------ */
128
+ /* Layer lifecycle */
129
+ /* ------------------------------------------------------------------ */
130
+
131
+ static BloomLayer *layer_create(size_t capacity, double error_rate) {
132
+ BloomLayer *layer = (BloomLayer *)calloc(1, sizeof(BloomLayer));
133
+ if (!layer) return NULL;
134
+
135
+ double ln2 = 0.693147180559945309417;
136
+ double ln2_sq = ln2 * ln2;
137
+
138
+ size_t bits_count = (size_t)(-(double)capacity * log(error_rate) / ln2_sq);
139
+ if (bits_count < 64) bits_count = 64; /* sane minimum */
140
+
141
+ layer->size = (bits_count + 7) / 8;
142
+ layer->capacity = capacity;
143
+ layer->count = 0;
144
+ layer->num_hashes = (int)((bits_count / (double)capacity) * ln2);
145
+
146
+ if (layer->num_hashes < MIN_HASHES) layer->num_hashes = MIN_HASHES;
147
+ if (layer->num_hashes > MAX_HASHES) layer->num_hashes = MAX_HASHES;
148
+
149
+ layer->bits = (uint8_t *)calloc(layer->size, sizeof(uint8_t));
150
+ if (!layer->bits) {
151
+ free(layer);
152
+ return NULL;
153
+ }
154
+
155
+ return layer;
156
+ }
157
+
158
+ static void layer_free(BloomLayer *layer) {
159
+ if (layer) {
160
+ free(layer->bits);
161
+ free(layer);
162
+ }
163
+ }
164
+
165
+ static inline int layer_is_full(const BloomLayer *layer) {
166
+ return layer->count >= layer->capacity;
167
+ }
168
+
169
+ static void layer_add(BloomLayer *layer, const char *data, size_t len) {
170
+ size_t bits_count = layer->size * 8;
171
+
172
+ /* Kirsch–Mitzenmacher: 2 hashes instead of k */
173
+ uint32_t h1 = murmur3_32((const uint8_t *)data, len, 0x9747b28c);
174
+ uint32_t h2 = murmur3_32((const uint8_t *)data, len, 0x5bd1e995);
175
+
176
+ for (int i = 0; i < layer->num_hashes; i++) {
177
+ uint32_t combined = h1 + (uint32_t)i * h2;
178
+ set_bit(layer->bits, combined % bits_count);
179
+ }
180
+ layer->count++;
181
+ }
182
+
183
+ static int layer_include(const BloomLayer *layer, const char *data, size_t len) {
184
+ size_t bits_count = layer->size * 8;
185
+
186
+ /* Kirsch–Mitzenmacher: 2 hashes instead of k */
187
+ uint32_t h1 = murmur3_32((const uint8_t *)data, len, 0x9747b28c);
188
+ uint32_t h2 = murmur3_32((const uint8_t *)data, len, 0x5bd1e995);
189
+
190
+ for (int i = 0; i < layer->num_hashes; i++) {
191
+ uint32_t combined = h1 + (uint32_t)i * h2;
192
+ if (!get_bit(layer->bits, combined % bits_count))
193
+ return 0;
194
+ }
195
+ return 1;
196
+ }
197
+
198
+ static size_t layer_bits_set(const BloomLayer *layer) {
199
+ size_t count = 0;
200
+ for (size_t i = 0; i < layer->size; i++) {
201
+ uint8_t b = layer->bits[i];
202
+ while (b) { count += b & 1; b >>= 1; }
203
+ }
204
+ return count;
205
+ }
206
+
207
+ /* ------------------------------------------------------------------ */
208
+ /* Scalable filter helpers */
209
+ /* ------------------------------------------------------------------ */
210
+
211
+ /* Error rate for the i-th layer (0-indexed):
212
+ * layer_fpr(i) = error_rate * (1 - r) * r^i
213
+ * Sum converges to error_rate. */
214
+ static double layer_error_rate(double total_fpr, double r, size_t index) {
215
+ return total_fpr * (1.0 - r) * pow(r, (double)index);
216
+ }
217
+
218
+ static BloomLayer *scalable_add_layer(ScalableBloom *sb) {
219
+ size_t new_cap;
220
+ if (sb->num_layers == 0) {
221
+ new_cap = sb->initial_capacity;
222
+ } else {
223
+ double gf = growth_factor(sb->num_layers);
224
+ new_cap = (size_t)(sb->layers[sb->num_layers - 1]->capacity * gf);
225
+ }
226
+
227
+ double fpr = layer_error_rate(sb->error_rate, sb->tightening, sb->num_layers);
228
+ if (fpr < 1e-15) fpr = 1e-15; /* floor to avoid log(0) */
229
+
230
+ BloomLayer *layer = layer_create(new_cap, fpr);
231
+ if (!layer) return NULL;
232
+
233
+ /* Grow layers array if needed */
234
+ if (sb->num_layers >= sb->layers_cap) {
235
+ size_t new_slots = sb->layers_cap == 0 ? 4 : sb->layers_cap * 2;
236
+ BloomLayer **tmp = (BloomLayer **)realloc(sb->layers,
237
+ new_slots * sizeof(BloomLayer *));
238
+ if (!tmp) { layer_free(layer); return NULL; }
239
+ sb->layers = tmp;
240
+ sb->layers_cap = new_slots;
241
+ }
242
+
243
+ sb->layers[sb->num_layers++] = layer;
244
+ return layer;
245
+ }
246
+
247
+ /* ------------------------------------------------------------------ */
248
+ /* Ruby GC integration */
249
+ /* ------------------------------------------------------------------ */
250
+
251
+ static void bloom_free_scalable(void *ptr) {
252
+ ScalableBloom *sb = (ScalableBloom *)ptr;
253
+ for (size_t i = 0; i < sb->num_layers; i++) {
254
+ layer_free(sb->layers[i]);
255
+ }
256
+ free(sb->layers);
257
+ free(sb);
258
+ }
259
+
260
+ static size_t bloom_memsize_scalable(const void *ptr) {
261
+ const ScalableBloom *sb = (const ScalableBloom *)ptr;
262
+ size_t total = sizeof(ScalableBloom);
263
+ total += sb->layers_cap * sizeof(BloomLayer *);
264
+ for (size_t i = 0; i < sb->num_layers; i++) {
265
+ total += sizeof(BloomLayer) + sb->layers[i]->size;
266
+ }
267
+ return total;
268
+ }
269
+
270
+ static const rb_data_type_t scalable_bloom_type = {
271
+ "ScalableBloomFilter",
272
+ {NULL, bloom_free_scalable, bloom_memsize_scalable},
273
+ NULL, NULL,
274
+ RUBY_TYPED_FREE_IMMEDIATELY
275
+ };
276
+
277
+ /* ------------------------------------------------------------------ */
278
+ /* Ruby methods */
279
+ /* ------------------------------------------------------------------ */
280
+
97
281
  static VALUE bloom_alloc(VALUE klass) {
98
- BloomFilter *bloom = ALLOC(BloomFilter);
99
- bloom->bits = NULL;
100
- bloom->size = 0;
101
- bloom->capacity = 0;
102
- bloom->num_hashes = 0;
103
-
104
- return TypedData_Wrap_Struct(klass, &bloom_type, bloom);
282
+ ScalableBloom *sb = (ScalableBloom *)calloc(1, sizeof(ScalableBloom));
283
+ if (!sb) rb_raise(rb_eNoMemError, "failed to allocate ScalableBloom");
284
+
285
+ return TypedData_Wrap_Struct(klass, &scalable_bloom_type, sb);
105
286
  }
106
287
 
107
288
  /*
108
- * Initialize Bloom Filter
109
- *
110
- * @param capacity [Integer] Expected number of elements
111
- * @param error_rate [Float] Desired false positive rate (default: 0.01)
289
+ * call-seq:
290
+ * Filter.new # defaults: error_rate 0.01, initial_capacity 1024
291
+ * Filter.new(error_rate: 0.001)
292
+ * Filter.new(error_rate: 0.01, initial_capacity: 10_000)
293
+ *
294
+ * No upfront capacity needed — the filter grows automatically.
295
+ *
296
+ * Ruby 2.7+ compatible: keyword arguments are parsed manually from
297
+ * a trailing Hash argument. The rb_scan_args ":" format requires
298
+ * Ruby 3.2+, so we handle it ourselves for broad compatibility.
112
299
  */
113
300
  static VALUE bloom_initialize(int argc, VALUE *argv, VALUE self) {
114
- VALUE capacity_val, error_rate_val;
115
- rb_scan_args(argc, argv, "11", &capacity_val, &error_rate_val);
116
-
117
- long capacity = NUM2LONG(capacity_val);
118
- double error_rate = NIL_P(error_rate_val) ? 0.01 : NUM2DBL(error_rate_val);
119
-
120
- if (capacity <= 0) {
121
- rb_raise(rb_eArgError, "capacity must be positive");
122
- }
123
-
124
- if (error_rate <= 0 || error_rate >= 1) {
125
- rb_raise(rb_eArgError, "error_rate must be between 0 and 1");
301
+ VALUE opts = Qnil;
302
+
303
+ if (argc == 0) {
304
+ /* Filter.new all defaults */
305
+ } else if (argc == 1 && RB_TYPE_P(argv[0], T_HASH)) {
306
+ /* Filter.new(error_rate: 0.01, ...) — keyword args as hash */
307
+ opts = argv[0];
308
+ } else {
309
+ rb_raise(rb_eArgError,
310
+ "wrong number of arguments (given %d, expected 0 or keyword arguments)",
311
+ argc);
126
312
  }
127
-
128
- BloomFilter *bloom;
129
- TypedData_Get_Struct(self, BloomFilter, &bloom_type, bloom);
130
-
131
- /* Calculate optimal parameters */
132
- double ln2 = 0.693147180559945309417;
133
- double ln2_sq = ln2 * ln2;
134
-
135
- size_t bits_count = (size_t)(-(capacity * log(error_rate)) / ln2_sq);
136
- bloom->size = (bits_count + 7) / 8;
137
- bloom->capacity = capacity;
138
- bloom->num_hashes = (int)((bits_count / (double)capacity) * ln2);
139
-
140
- if (bloom->num_hashes < 1) bloom->num_hashes = 1;
141
- if (bloom->num_hashes > 10) bloom->num_hashes = 10;
142
-
143
- bloom->bits = (uint8_t *)calloc(bloom->size, sizeof(uint8_t));
144
- if (!bloom->bits) {
145
- rb_raise(rb_eNoMemError, "failed to allocate memory");
313
+
314
+ double error_rate = DEFAULT_ERROR_RATE;
315
+ size_t initial_capacity = DEFAULT_INITIAL_CAP;
316
+ double tightening = DEFAULT_TIGHTENING;
317
+
318
+ if (!NIL_P(opts)) {
319
+ VALUE v;
320
+
321
+ v = rb_hash_aref(opts, ID2SYM(rb_intern("error_rate")));
322
+ if (!NIL_P(v)) error_rate = NUM2DBL(v);
323
+
324
+ v = rb_hash_aref(opts, ID2SYM(rb_intern("initial_capacity")));
325
+ if (!NIL_P(v)) initial_capacity = (size_t)NUM2LONG(v);
326
+
327
+ v = rb_hash_aref(opts, ID2SYM(rb_intern("tightening")));
328
+ if (!NIL_P(v)) tightening = NUM2DBL(v);
146
329
  }
147
-
330
+
331
+ if (error_rate <= 0 || error_rate >= 1)
332
+ rb_raise(rb_eArgError, "error_rate must be between 0 and 1 (exclusive)");
333
+ if (initial_capacity == 0)
334
+ rb_raise(rb_eArgError, "initial_capacity must be positive");
335
+ if (tightening <= 0 || tightening >= 1)
336
+ rb_raise(rb_eArgError, "tightening must be between 0 and 1 (exclusive)");
337
+
338
+ ScalableBloom *sb;
339
+ TypedData_Get_Struct(self, ScalableBloom, &scalable_bloom_type, sb);
340
+
341
+ sb->error_rate = error_rate;
342
+ sb->initial_capacity = initial_capacity;
343
+ sb->tightening = tightening;
344
+ sb->total_count = 0;
345
+
346
+ /* Create first layer */
347
+ if (!scalable_add_layer(sb))
348
+ rb_raise(rb_eNoMemError, "failed to allocate initial layer");
349
+
148
350
  return self;
149
351
  }
150
352
 
151
353
  /*
152
- * Add element to filter
354
+ * call-seq:
355
+ * filter.add("element")
356
+ * filter << "element"
153
357
  */
154
358
  static VALUE bloom_add(VALUE self, VALUE str) {
155
- BloomFilter *bloom;
156
- TypedData_Get_Struct(self, BloomFilter, &bloom_type, bloom);
157
-
359
+ ScalableBloom *sb;
360
+ TypedData_Get_Struct(self, ScalableBloom, &scalable_bloom_type, sb);
361
+
158
362
  Check_Type(str, T_STRING);
159
-
160
- const char *data = RSTRING_PTR(str);
161
- size_t len = RSTRING_LEN(str);
162
- size_t bits_count = bloom->size * 8;
163
-
164
- for (int i = 0; i < bloom->num_hashes; i++) {
165
- uint32_t hash = murmur3_32((const uint8_t *)data, len, i);
166
- size_t pos = hash % bits_count;
167
- set_bit(bloom->bits, pos);
363
+
364
+ BloomLayer *active = sb->layers[sb->num_layers - 1];
365
+
366
+ /* Grow if current layer is full */
367
+ if (layer_is_full(active)) {
368
+ active = scalable_add_layer(sb);
369
+ if (!active)
370
+ rb_raise(rb_eNoMemError, "failed to allocate new layer");
168
371
  }
169
-
372
+
373
+ layer_add(active, RSTRING_PTR(str), RSTRING_LEN(str));
374
+ sb->total_count++;
375
+
170
376
  return Qtrue;
171
377
  }
172
378
 
173
379
  /*
174
- * Check if element might be in filter
380
+ * call-seq:
381
+ * filter.include?("element") #=> true / false
382
+ * filter.member?("element") #=> true / false
383
+ *
384
+ * Checks all layers. Returns true if ANY layer says "possibly yes".
175
385
  */
176
386
  static VALUE bloom_include(VALUE self, VALUE str) {
177
- BloomFilter *bloom;
178
- TypedData_Get_Struct(self, BloomFilter, &bloom_type, bloom);
179
-
387
+ ScalableBloom *sb;
388
+ TypedData_Get_Struct(self, ScalableBloom, &scalable_bloom_type, sb);
389
+
180
390
  Check_Type(str, T_STRING);
181
-
391
+
182
392
  const char *data = RSTRING_PTR(str);
183
- size_t len = RSTRING_LEN(str);
184
- size_t bits_count = bloom->size * 8;
185
-
186
- for (int i = 0; i < bloom->num_hashes; i++) {
187
- uint32_t hash = murmur3_32((const uint8_t *)data, len, i);
188
- size_t pos = hash % bits_count;
189
- if (!get_bit(bloom->bits, pos)) {
190
- return Qfalse;
191
- }
393
+ size_t len = RSTRING_LEN(str);
394
+
395
+ /* Check from newest to oldest — most elements are in recent layers */
396
+ for (size_t i = sb->num_layers; i > 0; i--) {
397
+ if (layer_include(sb->layers[i - 1], data, len))
398
+ return Qtrue;
192
399
  }
193
-
194
- return Qtrue;
400
+
401
+ return Qfalse;
195
402
  }
196
403
 
197
404
  /*
198
- * Clear all bits
405
+ * Reset all layers, keep only one fresh layer.
199
406
  */
200
407
  static VALUE bloom_clear(VALUE self) {
201
- BloomFilter *bloom;
202
- TypedData_Get_Struct(self, BloomFilter, &bloom_type, bloom);
203
-
204
- memset(bloom->bits, 0, bloom->size);
408
+ ScalableBloom *sb;
409
+ TypedData_Get_Struct(self, ScalableBloom, &scalable_bloom_type, sb);
410
+
411
+ for (size_t i = 0; i < sb->num_layers; i++) {
412
+ layer_free(sb->layers[i]);
413
+ }
414
+ sb->num_layers = 0;
415
+ sb->total_count = 0;
416
+
417
+ if (!scalable_add_layer(sb))
418
+ rb_raise(rb_eNoMemError, "failed to allocate layer after clear");
419
+
205
420
  return Qnil;
206
421
  }
207
422
 
208
423
  /*
209
- * Get filter statistics
424
+ * Detailed statistics for the whole filter and each layer.
210
425
  */
211
426
  static VALUE bloom_stats(VALUE self) {
212
- BloomFilter *bloom;
213
- TypedData_Get_Struct(self, BloomFilter, &bloom_type, bloom);
214
-
215
- size_t bits_set = 0;
216
- size_t total_bits = bloom->size * 8;
217
-
218
- for (size_t i = 0; i < bloom->size; i++) {
219
- uint8_t byte = bloom->bits[i];
220
- while (byte) {
221
- bits_set += byte & 1;
222
- byte >>= 1;
223
- }
427
+ ScalableBloom *sb;
428
+ TypedData_Get_Struct(self, ScalableBloom, &scalable_bloom_type, sb);
429
+
430
+ size_t total_bytes = 0;
431
+ size_t total_bits = 0;
432
+ size_t total_bits_set = 0;
433
+
434
+ VALUE layers_ary = rb_ary_new_capa((long)sb->num_layers);
435
+
436
+ for (size_t i = 0; i < sb->num_layers; i++) {
437
+ BloomLayer *l = sb->layers[i];
438
+ size_t bs = layer_bits_set(l);
439
+ size_t tb = l->size * 8;
440
+
441
+ total_bytes += l->size;
442
+ total_bits += tb;
443
+ total_bits_set += bs;
444
+
445
+ VALUE lh = rb_hash_new();
446
+ rb_hash_aset(lh, ID2SYM(rb_intern("layer")), LONG2NUM(i));
447
+ rb_hash_aset(lh, ID2SYM(rb_intern("capacity")), LONG2NUM(l->capacity));
448
+ rb_hash_aset(lh, ID2SYM(rb_intern("count")), LONG2NUM(l->count));
449
+ rb_hash_aset(lh, ID2SYM(rb_intern("size_bytes")), LONG2NUM(l->size));
450
+ rb_hash_aset(lh, ID2SYM(rb_intern("num_hashes")), INT2NUM(l->num_hashes));
451
+ rb_hash_aset(lh, ID2SYM(rb_intern("bits_set")), LONG2NUM(bs));
452
+ rb_hash_aset(lh, ID2SYM(rb_intern("total_bits")), LONG2NUM(tb));
453
+ rb_hash_aset(lh, ID2SYM(rb_intern("fill_ratio")), DBL2NUM((double)bs / tb));
454
+ rb_hash_aset(lh, ID2SYM(rb_intern("error_rate")),
455
+ DBL2NUM(layer_error_rate(sb->error_rate, sb->tightening, i)));
456
+
457
+ rb_ary_push(layers_ary, lh);
224
458
  }
225
-
226
- double fill_ratio = (double)bits_set / total_bits;
227
-
459
+
228
460
  VALUE hash = rb_hash_new();
229
- rb_hash_aset(hash, ID2SYM(rb_intern("capacity")), LONG2NUM(bloom->capacity));
230
- rb_hash_aset(hash, ID2SYM(rb_intern("size_bytes")), LONG2NUM(bloom->size));
231
- rb_hash_aset(hash, ID2SYM(rb_intern("num_hashes")), INT2NUM(bloom->num_hashes));
232
- rb_hash_aset(hash, ID2SYM(rb_intern("bits_set")), LONG2NUM(bits_set));
233
- rb_hash_aset(hash, ID2SYM(rb_intern("total_bits")), LONG2NUM(total_bits));
234
- rb_hash_aset(hash, ID2SYM(rb_intern("fill_ratio")), DBL2NUM(fill_ratio));
235
-
461
+ rb_hash_aset(hash, ID2SYM(rb_intern("total_count")), LONG2NUM(sb->total_count));
462
+ rb_hash_aset(hash, ID2SYM(rb_intern("num_layers")), LONG2NUM(sb->num_layers));
463
+ rb_hash_aset(hash, ID2SYM(rb_intern("total_bytes")), LONG2NUM(total_bytes));
464
+ rb_hash_aset(hash, ID2SYM(rb_intern("total_bits")), LONG2NUM(total_bits));
465
+ rb_hash_aset(hash, ID2SYM(rb_intern("total_bits_set")), LONG2NUM(total_bits_set));
466
+ rb_hash_aset(hash, ID2SYM(rb_intern("fill_ratio")), DBL2NUM((double)total_bits_set / total_bits));
467
+ rb_hash_aset(hash, ID2SYM(rb_intern("error_rate")), DBL2NUM(sb->error_rate));
468
+ rb_hash_aset(hash, ID2SYM(rb_intern("layers")), layers_ary);
469
+
236
470
  return hash;
237
471
  }
238
472
 
239
473
  /*
240
- * Merge another filter
474
+ * Number of elements inserted.
475
+ */
476
+ static VALUE bloom_count(VALUE self) {
477
+ ScalableBloom *sb;
478
+ TypedData_Get_Struct(self, ScalableBloom, &scalable_bloom_type, sb);
479
+ return LONG2NUM(sb->total_count);
480
+ }
481
+
482
+ /*
483
+ * Number of layers currently allocated.
484
+ */
485
+ static VALUE bloom_num_layers(VALUE self) {
486
+ ScalableBloom *sb;
487
+ TypedData_Get_Struct(self, ScalableBloom, &scalable_bloom_type, sb);
488
+ return LONG2NUM(sb->num_layers);
489
+ }
490
+
491
+ /*
492
+ * Merge another scalable filter into this one.
493
+ * Appends all layers from `other` (copies the bit arrays).
241
494
  */
242
495
  static VALUE bloom_merge(VALUE self, VALUE other) {
243
- BloomFilter *bloom1, *bloom2;
244
- TypedData_Get_Struct(self, BloomFilter, &bloom_type, bloom1);
245
- TypedData_Get_Struct(other, BloomFilter, &bloom_type, bloom2);
246
-
247
- if (bloom1->size != bloom2->size || bloom1->num_hashes != bloom2->num_hashes) {
248
- rb_raise(rb_eArgError, "cannot merge filters with different parameters");
249
- }
250
-
251
- for (size_t i = 0; i < bloom1->size; i++) {
252
- bloom1->bits[i] |= bloom2->bits[i];
496
+ ScalableBloom *sb1, *sb2;
497
+ TypedData_Get_Struct(self, ScalableBloom, &scalable_bloom_type, sb1);
498
+ TypedData_Get_Struct(other, ScalableBloom, &scalable_bloom_type, sb2);
499
+
500
+ for (size_t i = 0; i < sb2->num_layers; i++) {
501
+ BloomLayer *src = sb2->layers[i];
502
+
503
+ /* Create a copy of the layer */
504
+ BloomLayer *copy = (BloomLayer *)calloc(1, sizeof(BloomLayer));
505
+ if (!copy) rb_raise(rb_eNoMemError, "failed to allocate layer copy");
506
+
507
+ copy->size = src->size;
508
+ copy->capacity = src->capacity;
509
+ copy->count = src->count;
510
+ copy->num_hashes = src->num_hashes;
511
+ copy->bits = (uint8_t *)malloc(src->size);
512
+ if (!copy->bits) { free(copy); rb_raise(rb_eNoMemError, "failed to allocate bits"); }
513
+ memcpy(copy->bits, src->bits, src->size);
514
+
515
+ /* Append to layers array */
516
+ if (sb1->num_layers >= sb1->layers_cap) {
517
+ size_t new_slots = sb1->layers_cap == 0 ? 4 : sb1->layers_cap * 2;
518
+ BloomLayer **tmp = (BloomLayer **)realloc(sb1->layers,
519
+ new_slots * sizeof(BloomLayer *));
520
+ if (!tmp) { layer_free(copy); rb_raise(rb_eNoMemError, "realloc failed"); }
521
+ sb1->layers = tmp;
522
+ sb1->layers_cap = new_slots;
523
+ }
524
+ sb1->layers[sb1->num_layers++] = copy;
253
525
  }
254
-
526
+
527
+ sb1->total_count += sb2->total_count;
255
528
  return self;
256
529
  }
257
530
 
531
+ /* ------------------------------------------------------------------ */
532
+ /* Init */
533
+ /* ------------------------------------------------------------------ */
534
+
258
535
  void Init_fast_bloom_filter(void) {
259
536
  VALUE mFastBloomFilter = rb_define_module("FastBloomFilter");
260
- VALUE cBloomFilter = rb_define_class_under(mFastBloomFilter, "Filter", rb_cObject);
261
-
262
- rb_define_alloc_func(cBloomFilter, bloom_alloc);
263
- rb_define_method(cBloomFilter, "initialize", bloom_initialize, -1);
264
- rb_define_method(cBloomFilter, "add", bloom_add, 1);
265
- rb_define_method(cBloomFilter, "<<", bloom_add, 1);
266
- rb_define_method(cBloomFilter, "include?", bloom_include, 1);
267
- rb_define_method(cBloomFilter, "member?", bloom_include, 1);
268
- rb_define_method(cBloomFilter, "clear", bloom_clear, 0);
269
- rb_define_method(cBloomFilter, "stats", bloom_stats, 0);
270
- rb_define_method(cBloomFilter, "merge!", bloom_merge, 1);
537
+ VALUE cFilter = rb_define_class_under(mFastBloomFilter, "Filter", rb_cObject);
538
+
539
+ rb_define_alloc_func(cFilter, bloom_alloc);
540
+ rb_define_method(cFilter, "initialize", bloom_initialize, -1);
541
+ rb_define_method(cFilter, "add", bloom_add, 1);
542
+ rb_define_method(cFilter, "<<", bloom_add, 1);
543
+ rb_define_method(cFilter, "include?", bloom_include, 1);
544
+ rb_define_method(cFilter, "member?", bloom_include, 1);
545
+ rb_define_method(cFilter, "clear", bloom_clear, 0);
546
+ rb_define_method(cFilter, "stats", bloom_stats, 0);
547
+ rb_define_method(cFilter, "count", bloom_count, 0);
548
+ rb_define_method(cFilter, "size", bloom_count, 0);
549
+ rb_define_method(cFilter, "num_layers", bloom_num_layers, 0);
550
+ rb_define_method(cFilter, "merge!", bloom_merge, 1);
271
551
  }
@@ -1,3 +1,3 @@
1
1
  module FastBloomFilter
2
- VERSION = "1.0.0"
2
+ VERSION = "2.0.0"
3
3
  end
@@ -21,30 +21,30 @@ module FastBloomFilter
21
21
  items.each { |item| add(item.to_s) }
22
22
  self
23
23
  end
24
-
24
+
25
25
  def count_possible_matches(items)
26
26
  items.count { |item| include?(item.to_s) }
27
27
  end
28
-
28
+
29
29
  def inspect
30
30
  s = stats
31
- size_kb = (s[:size_bytes] / 1024.0).round(2)
31
+ total_kb = (s[:total_bytes] / 1024.0).round(2)
32
32
  fill_pct = (s[:fill_ratio] * 100).round(2)
33
-
34
- "#<FastBloomFilter::Filter capacity=#{s[:capacity]} " \
35
- "size=#{size_kb}KB hashes=#{s[:num_hashes]} fill=#{fill_pct}%>"
33
+
34
+ "#<FastBloomFilter::Filter v2 layers=#{s[:num_layers]} " \
35
+ "count=#{s[:total_count]} size=#{total_kb}KB fill=#{fill_pct}%>"
36
36
  end
37
-
37
+
38
38
  def to_s
39
39
  inspect
40
40
  end
41
41
  end
42
-
43
- def self.for_emails(capacity, error_rate: 0.001)
44
- Filter.new(capacity, error_rate)
42
+
43
+ def self.for_emails(error_rate: 0.001, initial_capacity: 10_000)
44
+ Filter.new(error_rate: error_rate, initial_capacity: initial_capacity)
45
45
  end
46
-
47
- def self.for_urls(capacity, error_rate: 0.01)
48
- Filter.new(capacity, error_rate)
46
+
47
+ def self.for_urls(error_rate: 0.01, initial_capacity: 10_000)
48
+ Filter.new(error_rate: error_rate, initial_capacity: initial_capacity)
49
49
  end
50
50
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: fast_bloom_filter
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 2.0.0
5
5
  platform: ruby
6
6
  authors:
7
- - Your Name
7
+ - Roman Haydarov
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2026-02-09 00:00:00.000000000 Z
11
+ date: 2026-02-12 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -66,10 +66,10 @@ dependencies:
66
66
  - - "~>"
67
67
  - !ruby/object:Gem::Version
68
68
  version: '5.0'
69
- description: Memory-efficient probabilistic data structure. 20-50x less memory than
70
- Set, perfect for Rails apps.
69
+ description: Memory-efficient scalable Bloom Filter that grows dynamically. No upfront
70
+ capacity needed. 20-50x less memory than Set, perfect for Rails apps.
71
71
  email:
72
- - your.email@example.com
72
+ - romnhajdarov@gmail.com
73
73
  executables: []
74
74
  extensions:
75
75
  - ext/fast_bloom_filter/extconf.rb
@@ -82,13 +82,13 @@ files:
82
82
  - ext/fast_bloom_filter/fast_bloom_filter.c
83
83
  - lib/fast_bloom_filter.rb
84
84
  - lib/fast_bloom_filter/version.rb
85
- homepage: https://github.com/yourusername/fast_bloom_filter
85
+ homepage: https://github.com/roman-haidarov/fast_bloom_filter
86
86
  licenses:
87
87
  - MIT
88
88
  metadata:
89
- homepage_uri: https://github.com/yourusername/fast_bloom_filter
90
- source_code_uri: https://github.com/yourusername/fast_bloom_filter
91
- changelog_uri: https://github.com/yourusername/fast_bloom_filter/blob/main/CHANGELOG.md
89
+ homepage_uri: https://github.com/roman-haidarov/fast_bloom_filter
90
+ source_code_uri: https://github.com/roman-haidarov/fast_bloom_filter
91
+ changelog_uri: https://github.com/roman-haidarov/fast_bloom_filter/blob/main/CHANGELOG.md
92
92
  post_install_message:
93
93
  rdoc_options: []
94
94
  require_paths:
@@ -107,5 +107,5 @@ requirements: []
107
107
  rubygems_version: 3.4.22
108
108
  signing_key:
109
109
  specification_version: 4
110
- summary: High-performance Bloom Filter in C for Ruby
110
+ summary: Scalable Bloom Filter in C for Ruby - grows automatically
111
111
  test_files: []