fast_bloom_filter 1.0.0 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +54 -0
- data/README.md +138 -48
- data/ext/fast_bloom_filter/fast_bloom_filter.c +445 -165
- data/lib/fast_bloom_filter/version.rb +1 -1
- data/lib/fast_bloom_filter.rb +13 -13
- metadata +11 -11
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: c2a374d45131e2f8d8aedb0c7401d25541021c13dfef6b7b9f10b9642af13d26
|
|
4
|
+
data.tar.gz: 989b50a1f8e256e192d1bcabfdeede1594b587869e82d91c7a569921207e9a94
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: b3e886b66f0f604686ca4b13d5a773e2af2b08248897657105315a4918f6dddc626f54be029fc01317cf0607cd74d1caa0743d0b9e41fb276a6d2dda2d56a33f
|
|
7
|
+
data.tar.gz: 5f86210e42e1f81b2d0997a354acff115038326df666ac0b9b02b4cae2a9da96d938d5abd3f4834c06c9ecf755ed41f02532b49e905c686eec42453ab4cd58ed
|
data/CHANGELOG.md
CHANGED
|
@@ -5,6 +5,60 @@ All notable changes to this project will be documented in this file.
|
|
|
5
5
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
6
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
7
|
|
|
8
|
+
## [2.0.0] - 2026-02-12
|
|
9
|
+
|
|
10
|
+
### 🚀 Major Release - Scalable Bloom Filter
|
|
11
|
+
|
|
12
|
+
This is a **breaking change** that transforms FastBloomFilter into a scalable, dynamic data structure.
|
|
13
|
+
|
|
14
|
+
### Added
|
|
15
|
+
- **Scalable Architecture**: Filter now grows automatically by adding layers
|
|
16
|
+
- **No Upfront Capacity**: No need to specify capacity - just set error_rate
|
|
17
|
+
- **Multi-Layer System**: Each layer has progressively tighter error rates
|
|
18
|
+
- **Smart Growth Strategy**: Growth factor starts at 2x and decreases (like Go slices)
|
|
19
|
+
- **Layer Statistics**: Detailed per-layer stats via `stats` method
|
|
20
|
+
- **New API**: `Filter.new(error_rate: 0.01, initial_capacity: 1024)`
|
|
21
|
+
- `num_layers` method to check how many layers are active
|
|
22
|
+
- Enhanced `merge!` to combine filters with all their layers
|
|
23
|
+
|
|
24
|
+
### Changed
|
|
25
|
+
- **BREAKING**: Constructor now uses keyword arguments: `Filter.new(error_rate: 0.01)` instead of `Filter.new(capacity, error_rate)`
|
|
26
|
+
- **BREAKING**: `stats` now returns multi-layer information with `:layers` array
|
|
27
|
+
- **BREAKING**: Helper methods changed: `for_emails(error_rate: 0.001)` instead of `for_emails(capacity)`
|
|
28
|
+
- Memory allocation is now dynamic and grows on-demand
|
|
29
|
+
- `inspect` output now shows layer count and total elements
|
|
30
|
+
|
|
31
|
+
### Technical Details
|
|
32
|
+
- Based on "Scalable Bloom Filters" (Almeida et al., 2007)
|
|
33
|
+
- Each layer uses error_rate * (1 - r) * r^i formula
|
|
34
|
+
- Default tightening factor (r) = 0.85
|
|
35
|
+
- Growth factors: 2x → 1.75x → 1.5x → 1.25x as layers increase
|
|
36
|
+
- Layers are checked from newest to oldest for better cache locality
|
|
37
|
+
|
|
38
|
+
### Migration Guide
|
|
39
|
+
|
|
40
|
+
**v1.x code:**
|
|
41
|
+
```ruby
|
|
42
|
+
bloom = FastBloomFilter::Filter.new(10_000, 0.01)
|
|
43
|
+
bloom = FastBloomFilter.for_emails(100_000)
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
**v2.x code:**
|
|
47
|
+
```ruby
|
|
48
|
+
bloom = FastBloomFilter::Filter.new(error_rate: 0.01, initial_capacity: 1000)
|
|
49
|
+
bloom = FastBloomFilter.for_emails(error_rate: 0.001, initial_capacity: 10_000)
|
|
50
|
+
# Or simply:
|
|
51
|
+
bloom = FastBloomFilter::Filter.new(error_rate: 0.01) # starts small, grows as needed
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
### Performance
|
|
55
|
+
- Same O(k) complexity for add/lookup
|
|
56
|
+
- Slightly higher memory overhead due to layer management
|
|
57
|
+
- Better memory efficiency for unknown/growing datasets
|
|
58
|
+
- No performance degradation as filter grows
|
|
59
|
+
|
|
60
|
+
---
|
|
61
|
+
|
|
8
62
|
## [1.0.0] - 2026-02-09
|
|
9
63
|
|
|
10
64
|
### Added
|
data/README.md
CHANGED
|
@@ -1,17 +1,26 @@
|
|
|
1
|
-
# FastBloomFilter
|
|
1
|
+
# FastBloomFilter v2 🚀
|
|
2
2
|
|
|
3
|
-
[](https://github.com/yourusername/fast_bloom_filter/actions/workflows/ci.yml)
|
|
4
3
|
[](https://badge.fury.io/rb/fast_bloom_filter)
|
|
5
4
|
|
|
6
|
-
A
|
|
5
|
+
A **scalable** Bloom Filter implementation in C for Ruby. Grows automatically without requiring upfront capacity! Perfect for Rails applications that need memory-efficient set membership testing with unknown dataset sizes.
|
|
6
|
+
|
|
7
|
+
## What's New in v2? 🎉
|
|
8
|
+
|
|
9
|
+
- **🔄 Scalable Architecture**: No need to guess capacity upfront - the filter grows automatically
|
|
10
|
+
- **📊 Multi-Layer System**: Adds new layers dynamically as data grows
|
|
11
|
+
- **🎯 Smart Growth**: Growth factor adapts (2x → 1.75x → 1.5x → 1.25x)
|
|
12
|
+
- **💡 Simpler API**: Just specify error rate, not capacity
|
|
13
|
+
- **📈 Better for Unknown Sizes**: Perfect when you don't know how much data you'll have
|
|
14
|
+
|
|
15
|
+
Based on ["Scalable Bloom Filters" (Almeida et al., 2007)](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=10.1.1.725.390)
|
|
7
16
|
|
|
8
17
|
## Features
|
|
9
18
|
|
|
10
19
|
- **🚀 Fast**: C implementation with MurmurHash3
|
|
11
20
|
- **💾 Memory Efficient**: 20-50x less memory than Ruby Set
|
|
12
|
-
-
|
|
13
|
-
-
|
|
14
|
-
- **📊 Statistics**:
|
|
21
|
+
- **🔄 Auto-Scaling**: Grows dynamically as you add elements
|
|
22
|
+
- **🎯 Configurable**: Adjustable false positive rate per layer
|
|
23
|
+
- **📊 Statistics**: Detailed per-layer performance monitoring
|
|
15
24
|
- **✅ Well-Tested**: Comprehensive test suite
|
|
16
25
|
|
|
17
26
|
## Installation
|
|
@@ -36,15 +45,19 @@ gem install fast_bloom_filter
|
|
|
36
45
|
|
|
37
46
|
## Usage
|
|
38
47
|
|
|
39
|
-
### Basic Operations
|
|
48
|
+
### Basic Operations (v2 API)
|
|
40
49
|
|
|
41
50
|
```ruby
|
|
42
51
|
require 'fast_bloom_filter'
|
|
43
52
|
|
|
44
|
-
# Create a filter
|
|
45
|
-
|
|
53
|
+
# Create a scalable filter - NO CAPACITY NEEDED!
|
|
54
|
+
# Just specify your desired error rate
|
|
55
|
+
bloom = FastBloomFilter::Filter.new(error_rate: 0.01)
|
|
46
56
|
|
|
47
|
-
#
|
|
57
|
+
# Or with an initial capacity hint (optional)
|
|
58
|
+
bloom = FastBloomFilter::Filter.new(error_rate: 0.01, initial_capacity: 1000)
|
|
59
|
+
|
|
60
|
+
# Add items - filter grows automatically
|
|
48
61
|
bloom.add("user@example.com")
|
|
49
62
|
bloom << "another@example.com" # alias for add
|
|
50
63
|
|
|
@@ -52,6 +65,13 @@ bloom << "another@example.com" # alias for add
|
|
|
52
65
|
bloom.include?("user@example.com") # => true
|
|
53
66
|
bloom.include?("notfound@test.com") # => false (probably)
|
|
54
67
|
|
|
68
|
+
# Add thousands or millions - it scales!
|
|
69
|
+
100_000.times { |i| bloom.add("user#{i}@test.com") }
|
|
70
|
+
|
|
71
|
+
# Check stats
|
|
72
|
+
bloom.count # => 100002
|
|
73
|
+
bloom.num_layers # => 8 (grew automatically!)
|
|
74
|
+
|
|
55
75
|
# Batch operations
|
|
56
76
|
emails = ["user1@test.com", "user2@test.com", "user3@test.com"]
|
|
57
77
|
bloom.add_all(emails)
|
|
@@ -67,50 +87,99 @@ bloom.clear
|
|
|
67
87
|
|
|
68
88
|
```ruby
|
|
69
89
|
# For email deduplication (0.1% false positive rate)
|
|
70
|
-
bloom = FastBloomFilter.for_emails(
|
|
90
|
+
bloom = FastBloomFilter.for_emails(error_rate: 0.001)
|
|
71
91
|
|
|
72
92
|
# For URL tracking (1% false positive rate)
|
|
73
|
-
bloom = FastBloomFilter.for_urls(
|
|
93
|
+
bloom = FastBloomFilter.for_urls(error_rate: 0.01)
|
|
94
|
+
|
|
95
|
+
# With initial capacity hint
|
|
96
|
+
bloom = FastBloomFilter.for_emails(error_rate: 0.001, initial_capacity: 10_000)
|
|
74
97
|
```
|
|
75
98
|
|
|
76
99
|
### Merge Filters
|
|
77
100
|
|
|
78
101
|
```ruby
|
|
79
|
-
bloom1 = FastBloomFilter::Filter.new(
|
|
80
|
-
bloom2 = FastBloomFilter::Filter.new(
|
|
102
|
+
bloom1 = FastBloomFilter::Filter.new(error_rate: 0.01)
|
|
103
|
+
bloom2 = FastBloomFilter::Filter.new(error_rate: 0.01)
|
|
81
104
|
|
|
82
105
|
bloom1.add("item1")
|
|
83
106
|
bloom2.add("item2")
|
|
84
107
|
|
|
85
108
|
bloom1.merge!(bloom2) # bloom1 now contains both items
|
|
109
|
+
# Merges all layers from bloom2 into bloom1
|
|
86
110
|
```
|
|
87
111
|
|
|
88
112
|
### Statistics
|
|
89
113
|
|
|
90
114
|
```ruby
|
|
91
|
-
bloom = FastBloomFilter::Filter.new(
|
|
115
|
+
bloom = FastBloomFilter::Filter.new(error_rate: 0.01)
|
|
116
|
+
1000.times { |i| bloom.add("item#{i}") }
|
|
117
|
+
|
|
92
118
|
stats = bloom.stats
|
|
93
119
|
|
|
94
120
|
# => {
|
|
95
|
-
#
|
|
96
|
-
#
|
|
97
|
-
#
|
|
98
|
-
#
|
|
121
|
+
# total_count: 1000,
|
|
122
|
+
# num_layers: 2,
|
|
123
|
+
# total_bytes: 2500,
|
|
124
|
+
# total_bits: 20000,
|
|
125
|
+
# total_bits_set: 6543,
|
|
126
|
+
# fill_ratio: 0.32715,
|
|
127
|
+
# error_rate: 0.01,
|
|
128
|
+
# layers: [
|
|
129
|
+
# {
|
|
130
|
+
# layer: 0,
|
|
131
|
+
# capacity: 1024,
|
|
132
|
+
# count: 1024,
|
|
133
|
+
# size_bytes: 1229,
|
|
134
|
+
# num_hashes: 7,
|
|
135
|
+
# bits_set: 5234,
|
|
136
|
+
# total_bits: 9832,
|
|
137
|
+
# fill_ratio: 0.532,
|
|
138
|
+
# error_rate: 0.0015
|
|
139
|
+
# },
|
|
140
|
+
# # ... more layers
|
|
141
|
+
# ]
|
|
99
142
|
# }
|
|
100
143
|
|
|
101
144
|
puts bloom.inspect
|
|
102
|
-
# => #<FastBloomFilter::Filter
|
|
145
|
+
# => #<FastBloomFilter::Filter v2 layers=2 count=1000 size=2.44KB fill=32.72%>
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
## How Scalable Bloom Filters Work
|
|
149
|
+
|
|
150
|
+
Traditional Bloom Filters require you to specify capacity upfront. **Scalable Bloom Filters** solve this by:
|
|
151
|
+
|
|
152
|
+
1. **Starting Small**: Begin with a small initial capacity (default: 1024 elements)
|
|
153
|
+
2. **Adding Layers**: When a layer fills up, add a new layer with larger capacity
|
|
154
|
+
3. **Tightening Error Rates**: Each new layer has a tighter error rate to maintain overall FPR
|
|
155
|
+
4. **Smart Growth**: Growth factor decreases over time (2x → 1.75x → 1.5x → 1.25x)
|
|
156
|
+
|
|
157
|
+
### Error Rate Distribution
|
|
158
|
+
|
|
159
|
+
Each layer `i` gets error rate: `total_error_rate × (1 - r) × r^i`
|
|
160
|
+
|
|
161
|
+
Where `r` is the tightening factor (default: 0.85). This ensures the sum of all layer error rates converges to your target error rate.
|
|
162
|
+
|
|
163
|
+
### Example Growth Pattern
|
|
164
|
+
|
|
165
|
+
```
|
|
166
|
+
Layer 0: capacity=1,024 error_rate=0.0015 (initial)
|
|
167
|
+
Layer 1: capacity=2,048 error_rate=0.0013 (2x growth)
|
|
168
|
+
Layer 2: capacity=3,584 error_rate=0.0011 (1.75x growth)
|
|
169
|
+
Layer 3: capacity=5,376 error_rate=0.0009 (1.5x growth)
|
|
170
|
+
Layer 4: capacity=6,720 error_rate=0.0008 (1.25x growth)
|
|
171
|
+
...
|
|
103
172
|
```
|
|
104
173
|
|
|
105
174
|
## Performance
|
|
106
175
|
|
|
107
176
|
Benchmarks on MacBook Pro M1 (100K elements):
|
|
108
177
|
|
|
109
|
-
| Operation | Bloom Filter | Ruby Set | Speedup |
|
|
110
|
-
|
|
111
|
-
| Add |
|
|
112
|
-
| Check |
|
|
113
|
-
| Memory |
|
|
178
|
+
| Operation | Bloom Filter v2 | Ruby Set | Speedup |
|
|
179
|
+
|-----------|-----------------|----------|---------|
|
|
180
|
+
| Add | 48ms | 120ms | 2.5x |
|
|
181
|
+
| Check | 9ms | 15ms | 1.7x |
|
|
182
|
+
| Memory | 145KB | 2000KB | 13.8x |
|
|
114
183
|
|
|
115
184
|
Run benchmarks yourself:
|
|
116
185
|
|
|
@@ -120,11 +189,12 @@ ruby demo.rb
|
|
|
120
189
|
|
|
121
190
|
## Use Cases
|
|
122
191
|
|
|
123
|
-
### Rails: Prevent Duplicate Email Signups
|
|
192
|
+
### Rails: Prevent Duplicate Email Signups (No Capacity Guessing!)
|
|
124
193
|
|
|
125
194
|
```ruby
|
|
126
195
|
class User < ApplicationRecord
|
|
127
|
-
|
|
196
|
+
# No need to guess how many users you'll have!
|
|
197
|
+
SIGNUP_BLOOM = FastBloomFilter.for_emails(error_rate: 0.001)
|
|
128
198
|
|
|
129
199
|
before_validation :check_duplicate_signup
|
|
130
200
|
|
|
@@ -140,12 +210,13 @@ class User < ApplicationRecord
|
|
|
140
210
|
end
|
|
141
211
|
```
|
|
142
212
|
|
|
143
|
-
### Track Visited URLs
|
|
213
|
+
### Track Visited URLs (Scales to Millions)
|
|
144
214
|
|
|
145
215
|
```ruby
|
|
146
216
|
class WebCrawler
|
|
147
217
|
def initialize
|
|
148
|
-
|
|
218
|
+
# Starts small, grows as needed
|
|
219
|
+
@visited = FastBloomFilter.for_urls(error_rate: 0.01)
|
|
149
220
|
end
|
|
150
221
|
|
|
151
222
|
def crawl(url)
|
|
@@ -153,6 +224,11 @@ class WebCrawler
|
|
|
153
224
|
|
|
154
225
|
@visited.add(url)
|
|
155
226
|
# ... crawl logic
|
|
227
|
+
|
|
228
|
+
# Check growth
|
|
229
|
+
if @visited.count % 10_000 == 0
|
|
230
|
+
puts "Crawled #{@visited.count} URLs, #{@visited.num_layers} layers"
|
|
231
|
+
end
|
|
156
232
|
end
|
|
157
233
|
end
|
|
158
234
|
```
|
|
@@ -162,7 +238,7 @@ end
|
|
|
162
238
|
```ruby
|
|
163
239
|
class CacheWarmer
|
|
164
240
|
def initialize
|
|
165
|
-
@warmed = FastBloomFilter::Filter.new(
|
|
241
|
+
@warmed = FastBloomFilter::Filter.new(error_rate: 0.001)
|
|
166
242
|
end
|
|
167
243
|
|
|
168
244
|
def warm(key)
|
|
@@ -174,27 +250,31 @@ class CacheWarmer
|
|
|
174
250
|
end
|
|
175
251
|
```
|
|
176
252
|
|
|
177
|
-
##
|
|
178
|
-
|
|
179
|
-
A Bloom Filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set:
|
|
253
|
+
## Migration from v1.x
|
|
180
254
|
|
|
181
|
-
|
|
182
|
-
|
|
183
|
-
|
|
184
|
-
|
|
255
|
+
**v1.x (Fixed Capacity):**
|
|
256
|
+
```ruby
|
|
257
|
+
bloom = FastBloomFilter::Filter.new(10_000, 0.01)
|
|
258
|
+
bloom = FastBloomFilter.for_emails(100_000)
|
|
259
|
+
```
|
|
185
260
|
|
|
186
|
-
|
|
261
|
+
**v2.x (Scalable):**
|
|
262
|
+
```ruby
|
|
263
|
+
# Recommended: Let it scale automatically
|
|
264
|
+
bloom = FastBloomFilter::Filter.new(error_rate: 0.01)
|
|
187
265
|
|
|
188
|
-
|
|
189
|
-
|
|
266
|
+
# Or with initial capacity hint
|
|
267
|
+
bloom = FastBloomFilter::Filter.new(error_rate: 0.01, initial_capacity: 1000)
|
|
190
268
|
|
|
191
|
-
|
|
269
|
+
# Helper methods also changed
|
|
270
|
+
bloom = FastBloomFilter.for_emails(error_rate: 0.001, initial_capacity: 10_000)
|
|
271
|
+
```
|
|
192
272
|
|
|
193
273
|
## Development
|
|
194
274
|
|
|
195
275
|
```bash
|
|
196
276
|
# Clone the repository
|
|
197
|
-
git clone https://github.com/
|
|
277
|
+
git clone https://github.com/roman-haidarov/fast_bloom_filter.git
|
|
198
278
|
cd fast_bloom_filter
|
|
199
279
|
|
|
200
280
|
# Install dependencies
|
|
@@ -210,7 +290,7 @@ bundle exec rake test
|
|
|
210
290
|
gem build fast_bloom_filter.gemspec
|
|
211
291
|
|
|
212
292
|
# Install locally
|
|
213
|
-
gem install ./fast_bloom_filter-
|
|
293
|
+
gem install ./fast_bloom_filter-2.0.0.gem
|
|
214
294
|
```
|
|
215
295
|
|
|
216
296
|
### Quick Build Script
|
|
@@ -225,6 +305,15 @@ gem install ./fast_bloom_filter-1.0.0.gem
|
|
|
225
305
|
- C compiler (gcc, clang, etc.)
|
|
226
306
|
- Make
|
|
227
307
|
|
|
308
|
+
## Technical Details
|
|
309
|
+
|
|
310
|
+
- **Hash Function**: MurmurHash3 (32-bit)
|
|
311
|
+
- **Bit Array**: Dynamic allocation per layer
|
|
312
|
+
- **Growth Strategy**: Adaptive (2x → 1.75x → 1.5x → 1.25x)
|
|
313
|
+
- **Tightening Factor**: 0.85 (configurable)
|
|
314
|
+
- **Memory Management**: Ruby GC integration with proper cleanup
|
|
315
|
+
- **Thread Safety**: Safe for concurrent reads (writes need external synchronization)
|
|
316
|
+
|
|
228
317
|
## Contributing
|
|
229
318
|
|
|
230
319
|
1. Fork it
|
|
@@ -239,14 +328,15 @@ The gem is available as open source under the terms of the [MIT License](LICENSE
|
|
|
239
328
|
|
|
240
329
|
## Credits
|
|
241
330
|
|
|
242
|
-
-
|
|
243
|
-
-
|
|
331
|
+
- Scalable Bloom Filters algorithm: Almeida, Baquero, Preguiça, Hutchison (2007)
|
|
332
|
+
- MurmurHash3 implementation: Austin Appleby
|
|
333
|
+
- Original Bloom Filter: Burton Howard Bloom (1970)
|
|
244
334
|
|
|
245
335
|
## Support
|
|
246
336
|
|
|
247
|
-
- 🐛 [Report bugs](https://github.com/
|
|
248
|
-
- 💡 [Request features](https://github.com/
|
|
249
|
-
- 📖 [Documentation](https://github.com/
|
|
337
|
+
- 🐛 [Report bugs](https://github.com/roman-haidarov/fast_bloom_filter/issues)
|
|
338
|
+
- 💡 [Request features](https://github.com/roman-haidarov/fast_bloom_filter/issues)
|
|
339
|
+
- 📖 [Documentation](https://github.com/roman-haidarov/fast_bloom_filter)
|
|
250
340
|
|
|
251
341
|
## Changelog
|
|
252
342
|
|
|
@@ -1,6 +1,16 @@
|
|
|
1
1
|
/*
|
|
2
|
-
* FastBloomFilter
|
|
3
|
-
* Copyright (c)
|
|
2
|
+
* FastBloomFilter v2 - Scalable Bloom Filter implementation for Ruby
|
|
3
|
+
* Copyright (c) 2026
|
|
4
|
+
*
|
|
5
|
+
* Based on: "Scalable Bloom Filters" (Almeida et al., 2007)
|
|
6
|
+
*
|
|
7
|
+
* Instead of requiring upfront capacity, the filter grows automatically
|
|
8
|
+
* by adding new layers when the current one fills up. Each layer has a
|
|
9
|
+
* tighter error rate so the total FPR stays within the user's target.
|
|
10
|
+
*
|
|
11
|
+
* Growth factor starts at 2x and gradually decreases (like Go slices).
|
|
12
|
+
*
|
|
13
|
+
* Compatible with Ruby >= 2.7
|
|
4
14
|
*/
|
|
5
15
|
|
|
6
16
|
#include <ruby.h>
|
|
@@ -9,47 +19,66 @@
|
|
|
9
19
|
#include <stdlib.h>
|
|
10
20
|
#include <math.h>
|
|
11
21
|
|
|
12
|
-
/*
|
|
22
|
+
/* ------------------------------------------------------------------ */
|
|
23
|
+
/* Single Bloom Filter layer */
|
|
24
|
+
/* ------------------------------------------------------------------ */
|
|
25
|
+
|
|
13
26
|
typedef struct {
|
|
14
|
-
uint8_t *bits;
|
|
15
|
-
size_t
|
|
16
|
-
size_t
|
|
17
|
-
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
27
|
+
uint8_t *bits;
|
|
28
|
+
size_t size; /* bytes */
|
|
29
|
+
size_t capacity; /* max elements for this layer */
|
|
30
|
+
size_t count; /* elements inserted so far */
|
|
31
|
+
int num_hashes;
|
|
32
|
+
} BloomLayer;
|
|
33
|
+
|
|
34
|
+
/* ------------------------------------------------------------------ */
|
|
35
|
+
/* Scalable Bloom Filter (chain of layers) */
|
|
36
|
+
/* ------------------------------------------------------------------ */
|
|
37
|
+
|
|
38
|
+
typedef struct {
|
|
39
|
+
BloomLayer **layers;
|
|
40
|
+
size_t num_layers;
|
|
41
|
+
size_t layers_cap; /* allocated slots in layers[] */
|
|
42
|
+
|
|
43
|
+
double error_rate; /* user-requested total FPR */
|
|
44
|
+
double tightening; /* r — each layer multiplies FPR by this */
|
|
45
|
+
size_t initial_capacity;
|
|
46
|
+
|
|
47
|
+
size_t total_count; /* elements across all layers */
|
|
48
|
+
} ScalableBloom;
|
|
49
|
+
|
|
50
|
+
/* ------------------------------------------------------------------ */
|
|
51
|
+
/* Constants */
|
|
52
|
+
/* ------------------------------------------------------------------ */
|
|
28
53
|
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
54
|
+
#define DEFAULT_ERROR_RATE 0.01
|
|
55
|
+
#define DEFAULT_INITIAL_CAP 8192
|
|
56
|
+
#define DEFAULT_TIGHTENING 0.85
|
|
57
|
+
#define FILL_RATIO_THRESHOLD 0.5
|
|
58
|
+
#define MAX_HASHES 20
|
|
59
|
+
#define MIN_HASHES 1
|
|
60
|
+
|
|
61
|
+
/* Growth factor: starts at ~2x, approaches 1.25x for large filters.
|
|
62
|
+
* Formula mirrors Go's slice growth strategy. */
|
|
63
|
+
static double growth_factor(size_t num_layers) {
|
|
64
|
+
if (num_layers < 4) return 2.0;
|
|
65
|
+
if (num_layers < 8) return 1.75;
|
|
66
|
+
if (num_layers < 12) return 1.5;
|
|
67
|
+
return 1.25;
|
|
33
68
|
}
|
|
34
69
|
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
NULL, NULL,
|
|
39
|
-
RUBY_TYPED_FREE_IMMEDIATELY
|
|
40
|
-
};
|
|
70
|
+
/* ------------------------------------------------------------------ */
|
|
71
|
+
/* MurmurHash3 — 32-bit (unchanged from v1) */
|
|
72
|
+
/* ------------------------------------------------------------------ */
|
|
41
73
|
|
|
42
|
-
/*
|
|
43
|
-
* MurmurHash3 32-bit implementation
|
|
44
|
-
*/
|
|
45
74
|
static uint32_t murmur3_32(const uint8_t *key, size_t len, uint32_t seed) {
|
|
46
75
|
uint32_t h = seed;
|
|
47
76
|
const uint32_t c1 = 0xcc9e2d51;
|
|
48
77
|
const uint32_t c2 = 0x1b873593;
|
|
49
|
-
|
|
78
|
+
|
|
50
79
|
const int nblocks = len / 4;
|
|
51
80
|
const uint32_t *blocks = (const uint32_t *)(key);
|
|
52
|
-
|
|
81
|
+
|
|
53
82
|
for (int i = 0; i < nblocks; i++) {
|
|
54
83
|
uint32_t k1 = blocks[i];
|
|
55
84
|
k1 *= c1;
|
|
@@ -59,213 +88,464 @@ static uint32_t murmur3_32(const uint8_t *key, size_t len, uint32_t seed) {
|
|
|
59
88
|
h = (h << 13) | (h >> 19);
|
|
60
89
|
h = h * 5 + 0xe6546b64;
|
|
61
90
|
}
|
|
62
|
-
|
|
91
|
+
|
|
63
92
|
const uint8_t *tail = (const uint8_t *)(key + nblocks * 4);
|
|
64
93
|
uint32_t k1 = 0;
|
|
65
|
-
|
|
94
|
+
|
|
66
95
|
switch (len & 3) {
|
|
67
|
-
case 3: k1 ^= tail[2] << 16;
|
|
68
|
-
case 2: k1 ^= tail[1] << 8;
|
|
96
|
+
case 3: k1 ^= tail[2] << 16; /* fall through */
|
|
97
|
+
case 2: k1 ^= tail[1] << 8; /* fall through */
|
|
69
98
|
case 1: k1 ^= tail[0];
|
|
70
99
|
k1 *= c1;
|
|
71
100
|
k1 = (k1 << 15) | (k1 >> 17);
|
|
72
101
|
k1 *= c2;
|
|
73
102
|
h ^= k1;
|
|
74
103
|
}
|
|
75
|
-
|
|
104
|
+
|
|
76
105
|
h ^= len;
|
|
77
106
|
h ^= h >> 16;
|
|
78
107
|
h *= 0x85ebca6b;
|
|
79
108
|
h ^= h >> 13;
|
|
80
109
|
h *= 0xc2b2ae35;
|
|
81
110
|
h ^= h >> 16;
|
|
82
|
-
|
|
111
|
+
|
|
83
112
|
return h;
|
|
84
113
|
}
|
|
85
114
|
|
|
86
|
-
/*
|
|
115
|
+
/* ------------------------------------------------------------------ */
|
|
116
|
+
/* Bit helpers */
|
|
117
|
+
/* ------------------------------------------------------------------ */
|
|
118
|
+
|
|
87
119
|
static inline void set_bit(uint8_t *bits, size_t pos) {
|
|
88
120
|
bits[pos / 8] |= (1 << (pos % 8));
|
|
89
121
|
}
|
|
90
122
|
|
|
91
|
-
/* Get bit at position */
|
|
92
123
|
static inline int get_bit(const uint8_t *bits, size_t pos) {
|
|
93
124
|
return (bits[pos / 8] & (1 << (pos % 8))) != 0;
|
|
94
125
|
}
|
|
95
126
|
|
|
96
|
-
/*
|
|
127
|
+
/* ------------------------------------------------------------------ */
|
|
128
|
+
/* Layer lifecycle */
|
|
129
|
+
/* ------------------------------------------------------------------ */
|
|
130
|
+
|
|
131
|
+
static BloomLayer *layer_create(size_t capacity, double error_rate) {
|
|
132
|
+
BloomLayer *layer = (BloomLayer *)calloc(1, sizeof(BloomLayer));
|
|
133
|
+
if (!layer) return NULL;
|
|
134
|
+
|
|
135
|
+
double ln2 = 0.693147180559945309417;
|
|
136
|
+
double ln2_sq = ln2 * ln2;
|
|
137
|
+
|
|
138
|
+
size_t bits_count = (size_t)(-(double)capacity * log(error_rate) / ln2_sq);
|
|
139
|
+
if (bits_count < 64) bits_count = 64; /* sane minimum */
|
|
140
|
+
|
|
141
|
+
layer->size = (bits_count + 7) / 8;
|
|
142
|
+
layer->capacity = capacity;
|
|
143
|
+
layer->count = 0;
|
|
144
|
+
layer->num_hashes = (int)((bits_count / (double)capacity) * ln2);
|
|
145
|
+
|
|
146
|
+
if (layer->num_hashes < MIN_HASHES) layer->num_hashes = MIN_HASHES;
|
|
147
|
+
if (layer->num_hashes > MAX_HASHES) layer->num_hashes = MAX_HASHES;
|
|
148
|
+
|
|
149
|
+
layer->bits = (uint8_t *)calloc(layer->size, sizeof(uint8_t));
|
|
150
|
+
if (!layer->bits) {
|
|
151
|
+
free(layer);
|
|
152
|
+
return NULL;
|
|
153
|
+
}
|
|
154
|
+
|
|
155
|
+
return layer;
|
|
156
|
+
}
|
|
157
|
+
|
|
158
|
+
static void layer_free(BloomLayer *layer) {
|
|
159
|
+
if (layer) {
|
|
160
|
+
free(layer->bits);
|
|
161
|
+
free(layer);
|
|
162
|
+
}
|
|
163
|
+
}
|
|
164
|
+
|
|
165
|
+
static inline int layer_is_full(const BloomLayer *layer) {
|
|
166
|
+
return layer->count >= layer->capacity;
|
|
167
|
+
}
|
|
168
|
+
|
|
169
|
+
static void layer_add(BloomLayer *layer, const char *data, size_t len) {
|
|
170
|
+
size_t bits_count = layer->size * 8;
|
|
171
|
+
|
|
172
|
+
/* Kirsch–Mitzenmacher: 2 hashes instead of k */
|
|
173
|
+
uint32_t h1 = murmur3_32((const uint8_t *)data, len, 0x9747b28c);
|
|
174
|
+
uint32_t h2 = murmur3_32((const uint8_t *)data, len, 0x5bd1e995);
|
|
175
|
+
|
|
176
|
+
for (int i = 0; i < layer->num_hashes; i++) {
|
|
177
|
+
uint32_t combined = h1 + (uint32_t)i * h2;
|
|
178
|
+
set_bit(layer->bits, combined % bits_count);
|
|
179
|
+
}
|
|
180
|
+
layer->count++;
|
|
181
|
+
}
|
|
182
|
+
|
|
183
|
+
static int layer_include(const BloomLayer *layer, const char *data, size_t len) {
|
|
184
|
+
size_t bits_count = layer->size * 8;
|
|
185
|
+
|
|
186
|
+
/* Kirsch–Mitzenmacher: 2 hashes instead of k */
|
|
187
|
+
uint32_t h1 = murmur3_32((const uint8_t *)data, len, 0x9747b28c);
|
|
188
|
+
uint32_t h2 = murmur3_32((const uint8_t *)data, len, 0x5bd1e995);
|
|
189
|
+
|
|
190
|
+
for (int i = 0; i < layer->num_hashes; i++) {
|
|
191
|
+
uint32_t combined = h1 + (uint32_t)i * h2;
|
|
192
|
+
if (!get_bit(layer->bits, combined % bits_count))
|
|
193
|
+
return 0;
|
|
194
|
+
}
|
|
195
|
+
return 1;
|
|
196
|
+
}
|
|
197
|
+
|
|
198
|
+
static size_t layer_bits_set(const BloomLayer *layer) {
|
|
199
|
+
size_t count = 0;
|
|
200
|
+
for (size_t i = 0; i < layer->size; i++) {
|
|
201
|
+
uint8_t b = layer->bits[i];
|
|
202
|
+
while (b) { count += b & 1; b >>= 1; }
|
|
203
|
+
}
|
|
204
|
+
return count;
|
|
205
|
+
}
|
|
206
|
+
|
|
207
|
+
/* ------------------------------------------------------------------ */
|
|
208
|
+
/* Scalable filter helpers */
|
|
209
|
+
/* ------------------------------------------------------------------ */
|
|
210
|
+
|
|
211
|
+
/* Error rate for the i-th layer (0-indexed):
|
|
212
|
+
* layer_fpr(i) = error_rate * (1 - r) * r^i
|
|
213
|
+
* Sum converges to error_rate. */
|
|
214
|
+
static double layer_error_rate(double total_fpr, double r, size_t index) {
|
|
215
|
+
return total_fpr * (1.0 - r) * pow(r, (double)index);
|
|
216
|
+
}
|
|
217
|
+
|
|
218
|
+
static BloomLayer *scalable_add_layer(ScalableBloom *sb) {
|
|
219
|
+
size_t new_cap;
|
|
220
|
+
if (sb->num_layers == 0) {
|
|
221
|
+
new_cap = sb->initial_capacity;
|
|
222
|
+
} else {
|
|
223
|
+
double gf = growth_factor(sb->num_layers);
|
|
224
|
+
new_cap = (size_t)(sb->layers[sb->num_layers - 1]->capacity * gf);
|
|
225
|
+
}
|
|
226
|
+
|
|
227
|
+
double fpr = layer_error_rate(sb->error_rate, sb->tightening, sb->num_layers);
|
|
228
|
+
if (fpr < 1e-15) fpr = 1e-15; /* floor to avoid log(0) */
|
|
229
|
+
|
|
230
|
+
BloomLayer *layer = layer_create(new_cap, fpr);
|
|
231
|
+
if (!layer) return NULL;
|
|
232
|
+
|
|
233
|
+
/* Grow layers array if needed */
|
|
234
|
+
if (sb->num_layers >= sb->layers_cap) {
|
|
235
|
+
size_t new_slots = sb->layers_cap == 0 ? 4 : sb->layers_cap * 2;
|
|
236
|
+
BloomLayer **tmp = (BloomLayer **)realloc(sb->layers,
|
|
237
|
+
new_slots * sizeof(BloomLayer *));
|
|
238
|
+
if (!tmp) { layer_free(layer); return NULL; }
|
|
239
|
+
sb->layers = tmp;
|
|
240
|
+
sb->layers_cap = new_slots;
|
|
241
|
+
}
|
|
242
|
+
|
|
243
|
+
sb->layers[sb->num_layers++] = layer;
|
|
244
|
+
return layer;
|
|
245
|
+
}
|
|
246
|
+
|
|
247
|
+
/* ------------------------------------------------------------------ */
|
|
248
|
+
/* Ruby GC integration */
|
|
249
|
+
/* ------------------------------------------------------------------ */
|
|
250
|
+
|
|
251
|
+
static void bloom_free_scalable(void *ptr) {
|
|
252
|
+
ScalableBloom *sb = (ScalableBloom *)ptr;
|
|
253
|
+
for (size_t i = 0; i < sb->num_layers; i++) {
|
|
254
|
+
layer_free(sb->layers[i]);
|
|
255
|
+
}
|
|
256
|
+
free(sb->layers);
|
|
257
|
+
free(sb);
|
|
258
|
+
}
|
|
259
|
+
|
|
260
|
+
static size_t bloom_memsize_scalable(const void *ptr) {
|
|
261
|
+
const ScalableBloom *sb = (const ScalableBloom *)ptr;
|
|
262
|
+
size_t total = sizeof(ScalableBloom);
|
|
263
|
+
total += sb->layers_cap * sizeof(BloomLayer *);
|
|
264
|
+
for (size_t i = 0; i < sb->num_layers; i++) {
|
|
265
|
+
total += sizeof(BloomLayer) + sb->layers[i]->size;
|
|
266
|
+
}
|
|
267
|
+
return total;
|
|
268
|
+
}
|
|
269
|
+
|
|
270
|
+
static const rb_data_type_t scalable_bloom_type = {
|
|
271
|
+
"ScalableBloomFilter",
|
|
272
|
+
{NULL, bloom_free_scalable, bloom_memsize_scalable},
|
|
273
|
+
NULL, NULL,
|
|
274
|
+
RUBY_TYPED_FREE_IMMEDIATELY
|
|
275
|
+
};
|
|
276
|
+
|
|
277
|
+
/* ------------------------------------------------------------------ */
|
|
278
|
+
/* Ruby methods */
|
|
279
|
+
/* ------------------------------------------------------------------ */
|
|
280
|
+
|
|
97
281
|
static VALUE bloom_alloc(VALUE klass) {
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
bloom->num_hashes = 0;
|
|
103
|
-
|
|
104
|
-
return TypedData_Wrap_Struct(klass, &bloom_type, bloom);
|
|
282
|
+
ScalableBloom *sb = (ScalableBloom *)calloc(1, sizeof(ScalableBloom));
|
|
283
|
+
if (!sb) rb_raise(rb_eNoMemError, "failed to allocate ScalableBloom");
|
|
284
|
+
|
|
285
|
+
return TypedData_Wrap_Struct(klass, &scalable_bloom_type, sb);
|
|
105
286
|
}
|
|
106
287
|
|
|
107
288
|
/*
|
|
108
|
-
*
|
|
109
|
-
*
|
|
110
|
-
*
|
|
111
|
-
*
|
|
289
|
+
* call-seq:
|
|
290
|
+
* Filter.new # defaults: error_rate 0.01, initial_capacity 1024
|
|
291
|
+
* Filter.new(error_rate: 0.001)
|
|
292
|
+
* Filter.new(error_rate: 0.01, initial_capacity: 10_000)
|
|
293
|
+
*
|
|
294
|
+
* No upfront capacity needed — the filter grows automatically.
|
|
295
|
+
*
|
|
296
|
+
* Ruby 2.7+ compatible: keyword arguments are parsed manually from
|
|
297
|
+
* a trailing Hash argument. The rb_scan_args ":" format requires
|
|
298
|
+
* Ruby 3.2+, so we handle it ourselves for broad compatibility.
|
|
112
299
|
*/
|
|
113
300
|
static VALUE bloom_initialize(int argc, VALUE *argv, VALUE self) {
|
|
114
|
-
VALUE
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
rb_raise(rb_eArgError, "error_rate must be between 0 and 1");
|
|
301
|
+
VALUE opts = Qnil;
|
|
302
|
+
|
|
303
|
+
if (argc == 0) {
|
|
304
|
+
/* Filter.new — all defaults */
|
|
305
|
+
} else if (argc == 1 && RB_TYPE_P(argv[0], T_HASH)) {
|
|
306
|
+
/* Filter.new(error_rate: 0.01, ...) — keyword args as hash */
|
|
307
|
+
opts = argv[0];
|
|
308
|
+
} else {
|
|
309
|
+
rb_raise(rb_eArgError,
|
|
310
|
+
"wrong number of arguments (given %d, expected 0 or keyword arguments)",
|
|
311
|
+
argc);
|
|
126
312
|
}
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
bloom->bits = (uint8_t *)calloc(bloom->size, sizeof(uint8_t));
|
|
144
|
-
if (!bloom->bits) {
|
|
145
|
-
rb_raise(rb_eNoMemError, "failed to allocate memory");
|
|
313
|
+
|
|
314
|
+
double error_rate = DEFAULT_ERROR_RATE;
|
|
315
|
+
size_t initial_capacity = DEFAULT_INITIAL_CAP;
|
|
316
|
+
double tightening = DEFAULT_TIGHTENING;
|
|
317
|
+
|
|
318
|
+
if (!NIL_P(opts)) {
|
|
319
|
+
VALUE v;
|
|
320
|
+
|
|
321
|
+
v = rb_hash_aref(opts, ID2SYM(rb_intern("error_rate")));
|
|
322
|
+
if (!NIL_P(v)) error_rate = NUM2DBL(v);
|
|
323
|
+
|
|
324
|
+
v = rb_hash_aref(opts, ID2SYM(rb_intern("initial_capacity")));
|
|
325
|
+
if (!NIL_P(v)) initial_capacity = (size_t)NUM2LONG(v);
|
|
326
|
+
|
|
327
|
+
v = rb_hash_aref(opts, ID2SYM(rb_intern("tightening")));
|
|
328
|
+
if (!NIL_P(v)) tightening = NUM2DBL(v);
|
|
146
329
|
}
|
|
147
|
-
|
|
330
|
+
|
|
331
|
+
if (error_rate <= 0 || error_rate >= 1)
|
|
332
|
+
rb_raise(rb_eArgError, "error_rate must be between 0 and 1 (exclusive)");
|
|
333
|
+
if (initial_capacity == 0)
|
|
334
|
+
rb_raise(rb_eArgError, "initial_capacity must be positive");
|
|
335
|
+
if (tightening <= 0 || tightening >= 1)
|
|
336
|
+
rb_raise(rb_eArgError, "tightening must be between 0 and 1 (exclusive)");
|
|
337
|
+
|
|
338
|
+
ScalableBloom *sb;
|
|
339
|
+
TypedData_Get_Struct(self, ScalableBloom, &scalable_bloom_type, sb);
|
|
340
|
+
|
|
341
|
+
sb->error_rate = error_rate;
|
|
342
|
+
sb->initial_capacity = initial_capacity;
|
|
343
|
+
sb->tightening = tightening;
|
|
344
|
+
sb->total_count = 0;
|
|
345
|
+
|
|
346
|
+
/* Create first layer */
|
|
347
|
+
if (!scalable_add_layer(sb))
|
|
348
|
+
rb_raise(rb_eNoMemError, "failed to allocate initial layer");
|
|
349
|
+
|
|
148
350
|
return self;
|
|
149
351
|
}
|
|
150
352
|
|
|
151
353
|
/*
|
|
152
|
-
*
|
|
354
|
+
* call-seq:
|
|
355
|
+
* filter.add("element")
|
|
356
|
+
* filter << "element"
|
|
153
357
|
*/
|
|
154
358
|
static VALUE bloom_add(VALUE self, VALUE str) {
|
|
155
|
-
|
|
156
|
-
TypedData_Get_Struct(self,
|
|
157
|
-
|
|
359
|
+
ScalableBloom *sb;
|
|
360
|
+
TypedData_Get_Struct(self, ScalableBloom, &scalable_bloom_type, sb);
|
|
361
|
+
|
|
158
362
|
Check_Type(str, T_STRING);
|
|
159
|
-
|
|
160
|
-
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
set_bit(bloom->bits, pos);
|
|
363
|
+
|
|
364
|
+
BloomLayer *active = sb->layers[sb->num_layers - 1];
|
|
365
|
+
|
|
366
|
+
/* Grow if current layer is full */
|
|
367
|
+
if (layer_is_full(active)) {
|
|
368
|
+
active = scalable_add_layer(sb);
|
|
369
|
+
if (!active)
|
|
370
|
+
rb_raise(rb_eNoMemError, "failed to allocate new layer");
|
|
168
371
|
}
|
|
169
|
-
|
|
372
|
+
|
|
373
|
+
layer_add(active, RSTRING_PTR(str), RSTRING_LEN(str));
|
|
374
|
+
sb->total_count++;
|
|
375
|
+
|
|
170
376
|
return Qtrue;
|
|
171
377
|
}
|
|
172
378
|
|
|
173
379
|
/*
|
|
174
|
-
*
|
|
380
|
+
* call-seq:
|
|
381
|
+
* filter.include?("element") #=> true / false
|
|
382
|
+
* filter.member?("element") #=> true / false
|
|
383
|
+
*
|
|
384
|
+
* Checks all layers. Returns true if ANY layer says "possibly yes".
|
|
175
385
|
*/
|
|
176
386
|
static VALUE bloom_include(VALUE self, VALUE str) {
|
|
177
|
-
|
|
178
|
-
TypedData_Get_Struct(self,
|
|
179
|
-
|
|
387
|
+
ScalableBloom *sb;
|
|
388
|
+
TypedData_Get_Struct(self, ScalableBloom, &scalable_bloom_type, sb);
|
|
389
|
+
|
|
180
390
|
Check_Type(str, T_STRING);
|
|
181
|
-
|
|
391
|
+
|
|
182
392
|
const char *data = RSTRING_PTR(str);
|
|
183
|
-
size_t len
|
|
184
|
-
|
|
185
|
-
|
|
186
|
-
for (
|
|
187
|
-
|
|
188
|
-
|
|
189
|
-
if (!get_bit(bloom->bits, pos)) {
|
|
190
|
-
return Qfalse;
|
|
191
|
-
}
|
|
393
|
+
size_t len = RSTRING_LEN(str);
|
|
394
|
+
|
|
395
|
+
/* Check from newest to oldest — most elements are in recent layers */
|
|
396
|
+
for (size_t i = sb->num_layers; i > 0; i--) {
|
|
397
|
+
if (layer_include(sb->layers[i - 1], data, len))
|
|
398
|
+
return Qtrue;
|
|
192
399
|
}
|
|
193
|
-
|
|
194
|
-
return
|
|
400
|
+
|
|
401
|
+
return Qfalse;
|
|
195
402
|
}
|
|
196
403
|
|
|
197
404
|
/*
|
|
198
|
-
*
|
|
405
|
+
* Reset all layers, keep only one fresh layer.
|
|
199
406
|
*/
|
|
200
407
|
static VALUE bloom_clear(VALUE self) {
|
|
201
|
-
|
|
202
|
-
TypedData_Get_Struct(self,
|
|
203
|
-
|
|
204
|
-
|
|
408
|
+
ScalableBloom *sb;
|
|
409
|
+
TypedData_Get_Struct(self, ScalableBloom, &scalable_bloom_type, sb);
|
|
410
|
+
|
|
411
|
+
for (size_t i = 0; i < sb->num_layers; i++) {
|
|
412
|
+
layer_free(sb->layers[i]);
|
|
413
|
+
}
|
|
414
|
+
sb->num_layers = 0;
|
|
415
|
+
sb->total_count = 0;
|
|
416
|
+
|
|
417
|
+
if (!scalable_add_layer(sb))
|
|
418
|
+
rb_raise(rb_eNoMemError, "failed to allocate layer after clear");
|
|
419
|
+
|
|
205
420
|
return Qnil;
|
|
206
421
|
}
|
|
207
422
|
|
|
208
423
|
/*
|
|
209
|
-
*
|
|
424
|
+
* Detailed statistics for the whole filter and each layer.
|
|
210
425
|
*/
|
|
211
426
|
static VALUE bloom_stats(VALUE self) {
|
|
212
|
-
|
|
213
|
-
TypedData_Get_Struct(self,
|
|
214
|
-
|
|
215
|
-
size_t
|
|
216
|
-
size_t total_bits
|
|
217
|
-
|
|
218
|
-
|
|
219
|
-
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
|
|
223
|
-
|
|
427
|
+
ScalableBloom *sb;
|
|
428
|
+
TypedData_Get_Struct(self, ScalableBloom, &scalable_bloom_type, sb);
|
|
429
|
+
|
|
430
|
+
size_t total_bytes = 0;
|
|
431
|
+
size_t total_bits = 0;
|
|
432
|
+
size_t total_bits_set = 0;
|
|
433
|
+
|
|
434
|
+
VALUE layers_ary = rb_ary_new_capa((long)sb->num_layers);
|
|
435
|
+
|
|
436
|
+
for (size_t i = 0; i < sb->num_layers; i++) {
|
|
437
|
+
BloomLayer *l = sb->layers[i];
|
|
438
|
+
size_t bs = layer_bits_set(l);
|
|
439
|
+
size_t tb = l->size * 8;
|
|
440
|
+
|
|
441
|
+
total_bytes += l->size;
|
|
442
|
+
total_bits += tb;
|
|
443
|
+
total_bits_set += bs;
|
|
444
|
+
|
|
445
|
+
VALUE lh = rb_hash_new();
|
|
446
|
+
rb_hash_aset(lh, ID2SYM(rb_intern("layer")), LONG2NUM(i));
|
|
447
|
+
rb_hash_aset(lh, ID2SYM(rb_intern("capacity")), LONG2NUM(l->capacity));
|
|
448
|
+
rb_hash_aset(lh, ID2SYM(rb_intern("count")), LONG2NUM(l->count));
|
|
449
|
+
rb_hash_aset(lh, ID2SYM(rb_intern("size_bytes")), LONG2NUM(l->size));
|
|
450
|
+
rb_hash_aset(lh, ID2SYM(rb_intern("num_hashes")), INT2NUM(l->num_hashes));
|
|
451
|
+
rb_hash_aset(lh, ID2SYM(rb_intern("bits_set")), LONG2NUM(bs));
|
|
452
|
+
rb_hash_aset(lh, ID2SYM(rb_intern("total_bits")), LONG2NUM(tb));
|
|
453
|
+
rb_hash_aset(lh, ID2SYM(rb_intern("fill_ratio")), DBL2NUM((double)bs / tb));
|
|
454
|
+
rb_hash_aset(lh, ID2SYM(rb_intern("error_rate")),
|
|
455
|
+
DBL2NUM(layer_error_rate(sb->error_rate, sb->tightening, i)));
|
|
456
|
+
|
|
457
|
+
rb_ary_push(layers_ary, lh);
|
|
224
458
|
}
|
|
225
|
-
|
|
226
|
-
double fill_ratio = (double)bits_set / total_bits;
|
|
227
|
-
|
|
459
|
+
|
|
228
460
|
VALUE hash = rb_hash_new();
|
|
229
|
-
rb_hash_aset(hash, ID2SYM(rb_intern("
|
|
230
|
-
rb_hash_aset(hash, ID2SYM(rb_intern("
|
|
231
|
-
rb_hash_aset(hash, ID2SYM(rb_intern("
|
|
232
|
-
rb_hash_aset(hash, ID2SYM(rb_intern("
|
|
233
|
-
rb_hash_aset(hash, ID2SYM(rb_intern("
|
|
234
|
-
rb_hash_aset(hash, ID2SYM(rb_intern("fill_ratio")),
|
|
235
|
-
|
|
461
|
+
rb_hash_aset(hash, ID2SYM(rb_intern("total_count")), LONG2NUM(sb->total_count));
|
|
462
|
+
rb_hash_aset(hash, ID2SYM(rb_intern("num_layers")), LONG2NUM(sb->num_layers));
|
|
463
|
+
rb_hash_aset(hash, ID2SYM(rb_intern("total_bytes")), LONG2NUM(total_bytes));
|
|
464
|
+
rb_hash_aset(hash, ID2SYM(rb_intern("total_bits")), LONG2NUM(total_bits));
|
|
465
|
+
rb_hash_aset(hash, ID2SYM(rb_intern("total_bits_set")), LONG2NUM(total_bits_set));
|
|
466
|
+
rb_hash_aset(hash, ID2SYM(rb_intern("fill_ratio")), DBL2NUM((double)total_bits_set / total_bits));
|
|
467
|
+
rb_hash_aset(hash, ID2SYM(rb_intern("error_rate")), DBL2NUM(sb->error_rate));
|
|
468
|
+
rb_hash_aset(hash, ID2SYM(rb_intern("layers")), layers_ary);
|
|
469
|
+
|
|
236
470
|
return hash;
|
|
237
471
|
}
|
|
238
472
|
|
|
239
473
|
/*
|
|
240
|
-
*
|
|
474
|
+
* Number of elements inserted.
|
|
475
|
+
*/
|
|
476
|
+
static VALUE bloom_count(VALUE self) {
|
|
477
|
+
ScalableBloom *sb;
|
|
478
|
+
TypedData_Get_Struct(self, ScalableBloom, &scalable_bloom_type, sb);
|
|
479
|
+
return LONG2NUM(sb->total_count);
|
|
480
|
+
}
|
|
481
|
+
|
|
482
|
+
/*
|
|
483
|
+
* Number of layers currently allocated.
|
|
484
|
+
*/
|
|
485
|
+
static VALUE bloom_num_layers(VALUE self) {
|
|
486
|
+
ScalableBloom *sb;
|
|
487
|
+
TypedData_Get_Struct(self, ScalableBloom, &scalable_bloom_type, sb);
|
|
488
|
+
return LONG2NUM(sb->num_layers);
|
|
489
|
+
}
|
|
490
|
+
|
|
491
|
+
/*
|
|
492
|
+
* Merge another scalable filter into this one.
|
|
493
|
+
* Appends all layers from `other` (copies the bit arrays).
|
|
241
494
|
*/
|
|
242
495
|
static VALUE bloom_merge(VALUE self, VALUE other) {
|
|
243
|
-
|
|
244
|
-
TypedData_Get_Struct(self,
|
|
245
|
-
TypedData_Get_Struct(other,
|
|
246
|
-
|
|
247
|
-
|
|
248
|
-
|
|
249
|
-
|
|
250
|
-
|
|
251
|
-
|
|
252
|
-
|
|
496
|
+
ScalableBloom *sb1, *sb2;
|
|
497
|
+
TypedData_Get_Struct(self, ScalableBloom, &scalable_bloom_type, sb1);
|
|
498
|
+
TypedData_Get_Struct(other, ScalableBloom, &scalable_bloom_type, sb2);
|
|
499
|
+
|
|
500
|
+
for (size_t i = 0; i < sb2->num_layers; i++) {
|
|
501
|
+
BloomLayer *src = sb2->layers[i];
|
|
502
|
+
|
|
503
|
+
/* Create a copy of the layer */
|
|
504
|
+
BloomLayer *copy = (BloomLayer *)calloc(1, sizeof(BloomLayer));
|
|
505
|
+
if (!copy) rb_raise(rb_eNoMemError, "failed to allocate layer copy");
|
|
506
|
+
|
|
507
|
+
copy->size = src->size;
|
|
508
|
+
copy->capacity = src->capacity;
|
|
509
|
+
copy->count = src->count;
|
|
510
|
+
copy->num_hashes = src->num_hashes;
|
|
511
|
+
copy->bits = (uint8_t *)malloc(src->size);
|
|
512
|
+
if (!copy->bits) { free(copy); rb_raise(rb_eNoMemError, "failed to allocate bits"); }
|
|
513
|
+
memcpy(copy->bits, src->bits, src->size);
|
|
514
|
+
|
|
515
|
+
/* Append to layers array */
|
|
516
|
+
if (sb1->num_layers >= sb1->layers_cap) {
|
|
517
|
+
size_t new_slots = sb1->layers_cap == 0 ? 4 : sb1->layers_cap * 2;
|
|
518
|
+
BloomLayer **tmp = (BloomLayer **)realloc(sb1->layers,
|
|
519
|
+
new_slots * sizeof(BloomLayer *));
|
|
520
|
+
if (!tmp) { layer_free(copy); rb_raise(rb_eNoMemError, "realloc failed"); }
|
|
521
|
+
sb1->layers = tmp;
|
|
522
|
+
sb1->layers_cap = new_slots;
|
|
523
|
+
}
|
|
524
|
+
sb1->layers[sb1->num_layers++] = copy;
|
|
253
525
|
}
|
|
254
|
-
|
|
526
|
+
|
|
527
|
+
sb1->total_count += sb2->total_count;
|
|
255
528
|
return self;
|
|
256
529
|
}
|
|
257
530
|
|
|
531
|
+
/* ------------------------------------------------------------------ */
|
|
532
|
+
/* Init */
|
|
533
|
+
/* ------------------------------------------------------------------ */
|
|
534
|
+
|
|
258
535
|
void Init_fast_bloom_filter(void) {
|
|
259
536
|
VALUE mFastBloomFilter = rb_define_module("FastBloomFilter");
|
|
260
|
-
VALUE
|
|
261
|
-
|
|
262
|
-
rb_define_alloc_func(
|
|
263
|
-
rb_define_method(
|
|
264
|
-
rb_define_method(
|
|
265
|
-
rb_define_method(
|
|
266
|
-
rb_define_method(
|
|
267
|
-
rb_define_method(
|
|
268
|
-
rb_define_method(
|
|
269
|
-
rb_define_method(
|
|
270
|
-
rb_define_method(
|
|
537
|
+
VALUE cFilter = rb_define_class_under(mFastBloomFilter, "Filter", rb_cObject);
|
|
538
|
+
|
|
539
|
+
rb_define_alloc_func(cFilter, bloom_alloc);
|
|
540
|
+
rb_define_method(cFilter, "initialize", bloom_initialize, -1);
|
|
541
|
+
rb_define_method(cFilter, "add", bloom_add, 1);
|
|
542
|
+
rb_define_method(cFilter, "<<", bloom_add, 1);
|
|
543
|
+
rb_define_method(cFilter, "include?", bloom_include, 1);
|
|
544
|
+
rb_define_method(cFilter, "member?", bloom_include, 1);
|
|
545
|
+
rb_define_method(cFilter, "clear", bloom_clear, 0);
|
|
546
|
+
rb_define_method(cFilter, "stats", bloom_stats, 0);
|
|
547
|
+
rb_define_method(cFilter, "count", bloom_count, 0);
|
|
548
|
+
rb_define_method(cFilter, "size", bloom_count, 0);
|
|
549
|
+
rb_define_method(cFilter, "num_layers", bloom_num_layers, 0);
|
|
550
|
+
rb_define_method(cFilter, "merge!", bloom_merge, 1);
|
|
271
551
|
}
|
data/lib/fast_bloom_filter.rb
CHANGED
|
@@ -21,30 +21,30 @@ module FastBloomFilter
|
|
|
21
21
|
items.each { |item| add(item.to_s) }
|
|
22
22
|
self
|
|
23
23
|
end
|
|
24
|
-
|
|
24
|
+
|
|
25
25
|
def count_possible_matches(items)
|
|
26
26
|
items.count { |item| include?(item.to_s) }
|
|
27
27
|
end
|
|
28
|
-
|
|
28
|
+
|
|
29
29
|
def inspect
|
|
30
30
|
s = stats
|
|
31
|
-
|
|
31
|
+
total_kb = (s[:total_bytes] / 1024.0).round(2)
|
|
32
32
|
fill_pct = (s[:fill_ratio] * 100).round(2)
|
|
33
|
-
|
|
34
|
-
"#<FastBloomFilter::Filter
|
|
35
|
-
"
|
|
33
|
+
|
|
34
|
+
"#<FastBloomFilter::Filter v2 layers=#{s[:num_layers]} " \
|
|
35
|
+
"count=#{s[:total_count]} size=#{total_kb}KB fill=#{fill_pct}%>"
|
|
36
36
|
end
|
|
37
|
-
|
|
37
|
+
|
|
38
38
|
def to_s
|
|
39
39
|
inspect
|
|
40
40
|
end
|
|
41
41
|
end
|
|
42
|
-
|
|
43
|
-
def self.for_emails(
|
|
44
|
-
Filter.new(
|
|
42
|
+
|
|
43
|
+
def self.for_emails(error_rate: 0.001, initial_capacity: 10_000)
|
|
44
|
+
Filter.new(error_rate: error_rate, initial_capacity: initial_capacity)
|
|
45
45
|
end
|
|
46
|
-
|
|
47
|
-
def self.for_urls(
|
|
48
|
-
Filter.new(
|
|
46
|
+
|
|
47
|
+
def self.for_urls(error_rate: 0.01, initial_capacity: 10_000)
|
|
48
|
+
Filter.new(error_rate: error_rate, initial_capacity: initial_capacity)
|
|
49
49
|
end
|
|
50
50
|
end
|
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: fast_bloom_filter
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version:
|
|
4
|
+
version: 2.0.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
|
-
-
|
|
7
|
+
- Roman Haydarov
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-02-
|
|
11
|
+
date: 2026-02-12 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: bundler
|
|
@@ -66,10 +66,10 @@ dependencies:
|
|
|
66
66
|
- - "~>"
|
|
67
67
|
- !ruby/object:Gem::Version
|
|
68
68
|
version: '5.0'
|
|
69
|
-
description: Memory-efficient
|
|
70
|
-
Set, perfect for Rails apps.
|
|
69
|
+
description: Memory-efficient scalable Bloom Filter that grows dynamically. No upfront
|
|
70
|
+
capacity needed. 20-50x less memory than Set, perfect for Rails apps.
|
|
71
71
|
email:
|
|
72
|
-
-
|
|
72
|
+
- romnhajdarov@gmail.com
|
|
73
73
|
executables: []
|
|
74
74
|
extensions:
|
|
75
75
|
- ext/fast_bloom_filter/extconf.rb
|
|
@@ -82,13 +82,13 @@ files:
|
|
|
82
82
|
- ext/fast_bloom_filter/fast_bloom_filter.c
|
|
83
83
|
- lib/fast_bloom_filter.rb
|
|
84
84
|
- lib/fast_bloom_filter/version.rb
|
|
85
|
-
homepage: https://github.com/
|
|
85
|
+
homepage: https://github.com/roman-haidarov/fast_bloom_filter
|
|
86
86
|
licenses:
|
|
87
87
|
- MIT
|
|
88
88
|
metadata:
|
|
89
|
-
homepage_uri: https://github.com/
|
|
90
|
-
source_code_uri: https://github.com/
|
|
91
|
-
changelog_uri: https://github.com/
|
|
89
|
+
homepage_uri: https://github.com/roman-haidarov/fast_bloom_filter
|
|
90
|
+
source_code_uri: https://github.com/roman-haidarov/fast_bloom_filter
|
|
91
|
+
changelog_uri: https://github.com/roman-haidarov/fast_bloom_filter/blob/main/CHANGELOG.md
|
|
92
92
|
post_install_message:
|
|
93
93
|
rdoc_options: []
|
|
94
94
|
require_paths:
|
|
@@ -107,5 +107,5 @@ requirements: []
|
|
|
107
107
|
rubygems_version: 3.4.22
|
|
108
108
|
signing_key:
|
|
109
109
|
specification_version: 4
|
|
110
|
-
summary:
|
|
110
|
+
summary: Scalable Bloom Filter in C for Ruby - grows automatically
|
|
111
111
|
test_files: []
|