RubyGems - fast_bloom_filter - Versions diffs - 1.0.0 → 2.1.0 - Mend

fast_bloom_filter 1.0.0 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +75 -0
data/README.md +138 -48
data/ext/fast_bloom_filter/fast_bloom_filter.c +733 -216
data/lib/fast_bloom_filter/version.rb +1 -1
data/lib/fast_bloom_filter.rb +13 -13
metadata +12 -12

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 6821c8b6ccb023e3f803243eaed3c2612159be3eb384817f6b79b5f1a2e7de72
-  data.tar.gz: 2e0fb03a4c1ba7fb9ad43f20bc83ab6b6f33fe302dca4854af7c640a8edef5c7
+  metadata.gz: 124ed9c861897621021ba516be4389a0c5304282147406fd1d79a68264041ebf
+  data.tar.gz: 17324726d1f5eaad49a362334499d79c72ed3af924abe9d84a81023f942ac056
 SHA512:
-  metadata.gz: '09c38dcf72f4c1f5dee778099d8a7ea9b4f5fa20d865d23dc0ba7fca7f7f77fc6533f35c314906749208e47eb4dd495571d3ce0d46edbcf2b63bc461e6333e69'
-  data.tar.gz: '0729565d307f3a19811fcf0126fda7ae20776124dd6ceb41f60f03808118ff8d294124da01df7c30598092d4f223d7fc7b5b838d6f248c1a16d6d1dbe4c81d0c'
+  metadata.gz: eb1437aec23308784ebb440f46815cee965c1d530ca957d3edb97de2cc361db5987f2a73f32d10884ce744eb87b6eabe9a44b6f0cd91acbbea44d62514c35b8b
+  data.tar.gz: 776703bb0bf4b3cd6f243dfb1b87a8402e2e72f2c483396a9001f91656b975d4c44641dbddf99c41f0d98cb2a83db8e8954460787e08e0a63e5c2d787a4a2c56

data/CHANGELOG.md CHANGED Viewed

@@ -5,6 +5,81 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [2.1.0] - 2026-03-24
+### ⚡ Performance Optimizations
+### Changed
+- **Performance improvements**: Optimized C extension implementation
+- **Memory efficiency**: Improved memory usage from ~332KB to ~242KB for 100K elements (~27% reduction)
+- **Speed boost**: Add operations now consistently ~5x faster than Ruby Set (up from ~4.7x)
+- Better overall stability and performance characteristics across multiple benchmark runs
+### Technical Details
+- Enhanced C code optimization in hash functions and bit operations
+- More efficient memory allocation and management
+- Improved layer scaling algorithm for better memory utilization
+- Reduced temporary allocations during hash computation
+### Benchmarks (100K elements)
+- **Before**: Add: ~5.5ms, Memory: ~332KB
+- **After**: Add: ~5.4ms, Memory: ~242KB
+- Consistent 5x+ speedup vs Ruby Set across multiple runs
+## [2.0.0] - 2026-02-12
+### 🚀 Major Release - Scalable Bloom Filter
+This is a **breaking change** that transforms FastBloomFilter into a scalable, dynamic data structure.
+### Added
+- **Scalable Architecture**: Filter now grows automatically by adding layers
+- **No Upfront Capacity**: No need to specify capacity - just set error_rate
+- **Multi-Layer System**: Each layer has progressively tighter error rates
+- **Smart Growth Strategy**: Growth factor starts at 2x and decreases (like Go slices)
+- **Layer Statistics**: Detailed per-layer stats via `stats` method
+- **New API**: `Filter.new(error_rate: 0.01, initial_capacity: 1024)`
+- `num_layers` method to check how many layers are active
+- Enhanced `merge!` to combine filters with all their layers
+### Changed
+- **BREAKING**: Constructor now uses keyword arguments: `Filter.new(error_rate: 0.01)` instead of `Filter.new(capacity, error_rate)`
+- **BREAKING**: `stats` now returns multi-layer information with `:layers` array
+- **BREAKING**: Helper methods changed: `for_emails(error_rate: 0.001)` instead of `for_emails(capacity)`
+- Memory allocation is now dynamic and grows on-demand
+- `inspect` output now shows layer count and total elements
+### Technical Details
+- Based on "Scalable Bloom Filters" (Almeida et al., 2007)
+- Each layer uses error_rate * (1 - r) * r^i formula
+- Default tightening factor (r) = 0.85
+- Growth factors: 2x → 1.75x → 1.5x → 1.25x as layers increase
+- Layers are checked from newest to oldest for better cache locality
+### Migration Guide
+**v1.x code:**
+```ruby
+bloom = FastBloomFilter::Filter.new(10_000, 0.01)
+bloom = FastBloomFilter.for_emails(100_000)
+```
+**v2.x code:**
+```ruby
+bloom = FastBloomFilter::Filter.new(error_rate: 0.01, initial_capacity: 1000)
+bloom = FastBloomFilter.for_emails(error_rate: 0.001, initial_capacity: 10_000)
+# Or simply:
+bloom = FastBloomFilter::Filter.new(error_rate: 0.01)  # starts small, grows as needed
+```
+### Performance
+- Same O(k) complexity for add/lookup
+- Slightly higher memory overhead due to layer management
+- Better memory efficiency for unknown/growing datasets
+- No performance degradation as filter grows
+---
 ## [1.0.0] - 2026-02-09
 ### Added

data/README.md CHANGED Viewed

@@ -1,17 +1,26 @@
-# FastBloomFilter
+# FastBloomFilter v2 🚀
-[![CI](https://github.com/yourusername/fast_bloom_filter/actions/workflows/ci.yml/badge.svg)](https://github.com/yourusername/fast_bloom_filter/actions/workflows/ci.yml)
 [![Gem Version](https://badge.fury.io/rb/fast_bloom_filter.svg)](https://badge.fury.io/rb/fast_bloom_filter)
-A high-performance Bloom Filter implementation in C for Ruby. Perfect for Rails applications that need memory-efficient set membership testing.
+A **scalable** Bloom Filter implementation in C for Ruby. Grows automatically without requiring upfront capacity! Perfect for Rails applications that need memory-efficient set membership testing with unknown dataset sizes.
+## What's New in v2? 🎉
+- **🔄 Scalable Architecture**: No need to guess capacity upfront - the filter grows automatically
+- **📊 Multi-Layer System**: Adds new layers dynamically as data grows
+- **🎯 Smart Growth**: Growth factor adapts (2x → 1.75x → 1.5x → 1.25x)
+- **💡 Simpler API**: Just specify error rate, not capacity
+- **📈 Better for Unknown Sizes**: Perfect when you don't know how much data you'll have
+Based on ["Scalable Bloom Filters" (Almeida et al., 2007)](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=10.1.1.725.390)
 ## Features
 - **🚀 Fast**: C implementation with MurmurHash3
 - **💾 Memory Efficient**: 20-50x less memory than Ruby Set
-- **🎯 Configurable**: Adjustable false positive rate
-- **🔒 Thread-Safe**: Safe for concurrent operations
-- **📊 Statistics**: Built-in performance monitoring
+- **🔄 Auto-Scaling**: Grows dynamically as you add elements
+- **🎯 Configurable**: Adjustable false positive rate per layer
+- **📊 Statistics**: Detailed per-layer performance monitoring
 - **✅ Well-Tested**: Comprehensive test suite
 ## Installation
@@ -36,15 +45,19 @@ gem install fast_bloom_filter
 ## Usage
-### Basic Operations
+### Basic Operations (v2 API)
 ```ruby
 require 'fast_bloom_filter'
-# Create a filter for 10,000 items with 1% false positive rate
-bloom = FastBloomFilter::Filter.new(10_000, 0.01)
+# Create a scalable filter - NO CAPACITY NEEDED!
+# Just specify your desired error rate
+bloom = FastBloomFilter::Filter.new(error_rate: 0.01)
-# Add items
+# Or with an initial capacity hint (optional)
+bloom = FastBloomFilter::Filter.new(error_rate: 0.01, initial_capacity: 1000)
+# Add items - filter grows automatically
 bloom.add("user@example.com")
 bloom << "another@example.com"  # alias for add
@@ -52,6 +65,13 @@ bloom << "another@example.com"  # alias for add
 bloom.include?("user@example.com")  # => true
 bloom.include?("notfound@test.com") # => false (probably)
+# Add thousands or millions - it scales!
+100_000.times { |i| bloom.add("user#{i}@test.com") }
+# Check stats
+bloom.count        # => 100002
+bloom.num_layers   # => 8 (grew automatically!)
 # Batch operations
 emails = ["user1@test.com", "user2@test.com", "user3@test.com"]
 bloom.add_all(emails)
@@ -67,50 +87,99 @@ bloom.clear
 ```ruby
 # For email deduplication (0.1% false positive rate)
-bloom = FastBloomFilter.for_emails(100_000)
+bloom = FastBloomFilter.for_emails(error_rate: 0.001)
 # For URL tracking (1% false positive rate)
-bloom = FastBloomFilter.for_urls(50_000)
+bloom = FastBloomFilter.for_urls(error_rate: 0.01)
+# With initial capacity hint
+bloom = FastBloomFilter.for_emails(error_rate: 0.001, initial_capacity: 10_000)
 ```
 ### Merge Filters
 ```ruby
-bloom1 = FastBloomFilter::Filter.new(1000, 0.01)
-bloom2 = FastBloomFilter::Filter.new(1000, 0.01)
+bloom1 = FastBloomFilter::Filter.new(error_rate: 0.01)
+bloom2 = FastBloomFilter::Filter.new(error_rate: 0.01)
 bloom1.add("item1")
 bloom2.add("item2")
 bloom1.merge!(bloom2)  # bloom1 now contains both items
+# Merges all layers from bloom2 into bloom1
 ```
 ### Statistics
 ```ruby
-bloom = FastBloomFilter::Filter.new(10_000, 0.01)
+bloom = FastBloomFilter::Filter.new(error_rate: 0.01)
+1000.times { |i| bloom.add("item#{i}") }
 stats = bloom.stats
 # => {
-#   capacity: 10000,
-#   size_bytes: 11982,
-#   num_hashes: 7,
-#   fill_ratio: 0.0
+#   total_count: 1000,
+#   num_layers: 2,
+#   total_bytes: 2500,
+#   total_bits: 20000,
+#   total_bits_set: 6543,
+#   fill_ratio: 0.32715,
+#   error_rate: 0.01,
+#   layers: [
+#     {
+#       layer: 0,
+#       capacity: 1024,
+#       count: 1024,
+#       size_bytes: 1229,
+#       num_hashes: 7,
+#       bits_set: 5234,
+#       total_bits: 9832,
+#       fill_ratio: 0.532,
+#       error_rate: 0.0015
+#     },
+#     # ... more layers
+#   ]
 # }
 puts bloom.inspect
-# => #<FastBloomFilter::Filter capacity=10000 size=11.7KB hashes=7 fill=0.0%>
+# => #<FastBloomFilter::Filter v2 layers=2 count=1000 size=2.44KB fill=32.72%>
+```
+## How Scalable Bloom Filters Work
+Traditional Bloom Filters require you to specify capacity upfront. **Scalable Bloom Filters** solve this by:
+1. **Starting Small**: Begin with a small initial capacity (default: 1024 elements)
+2. **Adding Layers**: When a layer fills up, add a new layer with larger capacity
+3. **Tightening Error Rates**: Each new layer has a tighter error rate to maintain overall FPR
+4. **Smart Growth**: Growth factor decreases over time (2x → 1.75x → 1.5x → 1.25x)
+### Error Rate Distribution
+Each layer `i` gets error rate: `total_error_rate × (1 - r) × r^i`
+Where `r` is the tightening factor (default: 0.85). This ensures the sum of all layer error rates converges to your target error rate.
+### Example Growth Pattern
+```
+Layer 0: capacity=1,024   error_rate=0.0015  (initial)
+Layer 1: capacity=2,048   error_rate=0.0013  (2x growth)
+Layer 2: capacity=3,584   error_rate=0.0011  (1.75x growth)
+Layer 3: capacity=5,376   error_rate=0.0009  (1.5x growth)
+Layer 4: capacity=6,720   error_rate=0.0008  (1.25x growth)
+...
 ```
 ## Performance
 Benchmarks on MacBook Pro M1 (100K elements):
-| Operation | Bloom Filter | Ruby Set | Speedup |
-|-----------|--------------|----------|---------|
-| Add       | 45ms         | 120ms    | 2.7x    |
-| Check     | 8ms          | 15ms     | 1.9x    |
-| Memory    | 120KB        | 2000KB   | 16.7x   |
+| Operation | Bloom Filter v2 | Ruby Set | Speedup |
+|-----------|-----------------|----------|---------|
+| Add       | 48ms            | 120ms    | 2.5x    |
+| Check     | 9ms             | 15ms     | 1.7x    |
+| Memory    | 145KB           | 2000KB   | 13.8x   |
 Run benchmarks yourself:
@@ -120,11 +189,12 @@ ruby demo.rb
 ## Use Cases
-### Rails: Prevent Duplicate Email Signups
+### Rails: Prevent Duplicate Email Signups (No Capacity Guessing!)
 ```ruby
 class User < ApplicationRecord
-  SIGNUP_BLOOM = FastBloomFilter.for_emails(1_000_000)
+  # No need to guess how many users you'll have!
+  SIGNUP_BLOOM = FastBloomFilter.for_emails(error_rate: 0.001)
   before_validation :check_duplicate_signup
@@ -140,12 +210,13 @@ class User < ApplicationRecord
 end
 ```
-### Track Visited URLs
+### Track Visited URLs (Scales to Millions)
 ```ruby
 class WebCrawler
   def initialize
-    @visited = FastBloomFilter.for_urls(10_000_000)
+    # Starts small, grows as needed
+    @visited = FastBloomFilter.for_urls(error_rate: 0.01)
   end
   def crawl(url)
@@ -153,6 +224,11 @@ class WebCrawler
     @visited.add(url)
     # ... crawl logic
+    # Check growth
+    if @visited.count % 10_000 == 0
+      puts "Crawled #{@visited.count} URLs, #{@visited.num_layers} layers"
+    end
   end
 end
 ```
@@ -162,7 +238,7 @@ end
 ```ruby
 class CacheWarmer
   def initialize
-    @warmed = FastBloomFilter::Filter.new(100_000, 0.001)
+    @warmed = FastBloomFilter::Filter.new(error_rate: 0.001)
   end
   def warm(key)
@@ -174,27 +250,31 @@ class CacheWarmer
 end
 ```
-## How It Works
-A Bloom Filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set:
+## Migration from v1.x
-- **No false negatives**: If it says "no", the item is definitely not in the set
-- **Possible false positives**: If it says "yes", the item is probably in the set
-- **Memory efficient**: Uses bit arrays instead of storing actual items
-- **Fast**: O(k) for add and lookup, where k is the number of hash functions
+**v1.x (Fixed Capacity):**
+```ruby
+bloom = FastBloomFilter::Filter.new(10_000, 0.01)
+bloom = FastBloomFilter.for_emails(100_000)
+```
-### Parameters
+**v2.x (Scalable):**
+```ruby
+# Recommended: Let it scale automatically
+bloom = FastBloomFilter::Filter.new(error_rate: 0.01)
-- **Capacity**: Expected number of elements
-- **Error Rate**: Probability of false positives (default: 0.01 = 1%)
+# Or with initial capacity hint
+bloom = FastBloomFilter::Filter.new(error_rate: 0.01, initial_capacity: 1000)
-The filter automatically calculates optimal bit array size and number of hash functions.
+# Helper methods also changed
+bloom = FastBloomFilter.for_emails(error_rate: 0.001, initial_capacity: 10_000)
+```
 ## Development
 ```bash
 # Clone the repository
-git clone https://github.com/yourusername/fast_bloom_filter.git
+git clone https://github.com/roman-haidarov/fast_bloom_filter.git
 cd fast_bloom_filter
 # Install dependencies
@@ -210,7 +290,7 @@ bundle exec rake test
 gem build fast_bloom_filter.gemspec
 # Install locally
-gem install ./fast_bloom_filter-1.0.0.gem
+gem install ./fast_bloom_filter-2.0.0.gem
 ```
 ### Quick Build Script
@@ -225,6 +305,15 @@ gem install ./fast_bloom_filter-1.0.0.gem
 - C compiler (gcc, clang, etc.)
 - Make
+## Technical Details
+- **Hash Function**: MurmurHash3 (32-bit)
+- **Bit Array**: Dynamic allocation per layer
+- **Growth Strategy**: Adaptive (2x → 1.75x → 1.5x → 1.25x)
+- **Tightening Factor**: 0.85 (configurable)
+- **Memory Management**: Ruby GC integration with proper cleanup
+- **Thread Safety**: Safe for concurrent reads (writes need external synchronization)
 ## Contributing
 1. Fork it
@@ -239,14 +328,15 @@ The gem is available as open source under the terms of the [MIT License](LICENSE
 ## Credits
-- MurmurHash3 implementation based on Austin Appleby's original work
-- Bloom Filter algorithm by Burton Howard Bloom (1970)
+- Scalable Bloom Filters algorithm: Almeida, Baquero, Preguiça, Hutchison (2007)
+- MurmurHash3 implementation: Austin Appleby
+- Original Bloom Filter: Burton Howard Bloom (1970)
 ## Support
-- 🐛 [Report bugs](https://github.com/yourusername/fast_bloom_filter/issues)
-- 💡 [Request features](https://github.com/yourusername/fast_bloom_filter/issues)
-- 📖 [Documentation](https://github.com/yourusername/fast_bloom_filter)
+- 🐛 [Report bugs](https://github.com/roman-haidarov/fast_bloom_filter/issues)
+- 💡 [Request features](https://github.com/roman-haidarov/fast_bloom_filter/issues)
+- 📖 [Documentation](https://github.com/roman-haidarov/fast_bloom_filter)
 ## Changelog