RubyGems - bloom_fit - Versions diffs - 0.3.1 → 1.0.0 - Mend

bloom_fit 0.3.1 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml +4 -4
data/README.md +220 -47
data/ext/cbloomfilter/cbloomfilter.c +1 -1
data/lib/bloom_fit/version.rb +1 -1
data/lib/bloom_fit.rb +13 -1
data/lib/cbloomfilter.bundle +0 -0
data/test/bloom_fit_test.rb +22 -0
data/test/c_bloom_filter_test.rb +158 -0
metadata +2 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: cd631cdb483e0a84fa05d56eb962fda0f7c7d7a0b002ea708024ce82505a9054
-  data.tar.gz: ee781997465d6f5b590828082e4fadd5b00768298bbdec7845b9f07c3d046549
+  metadata.gz: 54da887424b56d9c09e4d351125c22873bc24be3e32e96cf3716d044a0864957
+  data.tar.gz: 50780ab65355bc42c075586888f4f09ee6ce6849b16c01264d83887dc83f71a3
 SHA512:
-  metadata.gz: 7862f2d0189bae865c6fc5e7c7ad24f5c7ab0420415a455a1a0b130835d639c536cb8925b08219eab7dd7a10db1e9299b2868019d3e2259db4dce96de01e50a2
-  data.tar.gz: 41cb7f2fcb8cf80f5345785ce0110e242a29fbe6177284b13b701973ec7b0e7010d788585e406f77712f7ee284ff308633fe060e492b0e153a4a5598658fd465
+  metadata.gz: 53511030706f900e42050938ff80eaaaa5290c609dcd40e6b809bed6c6d491fe63bc57d4d2c1e494c0081642f85e6e29e8c5bc46cbe9cc342d8700990d910043
+  data.tar.gz: f5da69e7acebde88b41649f6dfac9925e4f021c5fb7687f442a0b61b78efd1c30423f013851c2982048ef2f3c374b0e583cca7f92352475a31ca5cddfb67fd46

data/README.md CHANGED Viewed

@@ -1,77 +1,250 @@
-# BloomFit makes Bloom Filter tuning easy
+# BloomFit
-[![Gem Version](http://img.shields.io/gem/v/bloom_fit.svg)](https://rubygems.org/gems/bloom_fit)
+[![Gem Version](https://img.shields.io/gem/v/bloom_fit.svg)](https://rubygems.org/gems/bloom_fit)
 [![CI](https://github.com/rmm5t/bloom_fit/actions/workflows/ci.yml/badge.svg)](https://github.com/rmm5t/bloom_fit/actions/workflows/ci.yml)
 [![Gem Downloads](https://img.shields.io/gem/dt/bloom_fit.svg)](https://rubygems.org/gems/bloom_fit)
-BloomFit provides a MRI/C-based non-counting bloom filter for use in your Ruby projects. It is heavily based on [bloomfilter-rb]'s native implementation, but differs in the following ways:
+BloomFit is an in-memory, non-counting Bloom filter for Ruby backed by a small C extension.
+It gives you a compact, Set-like API for probabilistic membership checks:
+- false positives are possible
+- false negatives are not, as long as a value was added to the same filter
+- individual values cannot be deleted safely because the filter is non-counting
+BloomFit is heavily inspired by [bloomfilter-rb]'s native implementation and the original C implementation by Tatsuya Mori. This version uses a DJB2 hash with salts from the CRC table and wraps the native filter in a Ruby-friendly API. The most common way to use it is to pass an expected `capacity` and optional `false_positive_rate`, then let BloomFit calculate `size` and `hashes` for you.
+Compared with bloomfilter-rb, BloomFit:
 - uses DJB2 over CRC32 yielding better hash distribution
 - improves performance for very large datasets
 - avoids the need to supply a seed
-- automatically calculates the bit size (m) and the number of hashes (k) when given a capacity and false-positive-rate
+- automatically calculates the filter size (`m`) and hash count (`k`) from capacity and false-positive rate
-A [Bloom filter](http://en.wikipedia.org/wiki/Bloom_filter) is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Instead of using k different hash functions, this implementation a DJB2 hash with k seeds from the CRC table.
+## Features
-Performance of the Bloom filter depends on the following:
+- native `CBloomFilter` implementation for MRI Ruby
+- automatic sizing from `capacity` and `false_positive_rate`
+- small Ruby API with familiar methods like `add`, `include?`, `merge`, `|`, and `&`
+- supports strings, symbols, integers, booleans, and other values that can be converted with `to_s`
+- manual `size` / `hashes` overrides when you want control
+- save and reload filters with Ruby `Marshal`
+- inspect filter state with `stats`, `to_hex`, `to_binary`, and `bitmap`
-- size of the bit array
-- number of hash functions
+## Requirements
-## Resources
+- Ruby `>= 3.2.0`
-- Background: [Bloom filter](http://en.wikipedia.org/wiki/Bloom_filter)
-- Determining parameters: [Scalable Datasets: Bloom Filters in Ruby](http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/)
-- Applications & reasons behind bloom filter: [Flow analysis: Time based bloom filter](http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/)
+## Installation
-## Examples
+```bash
+gem install bloom_fit
+```
-MRI/C implementation which creates an in-memory filter which can be saved and reloaded from disk.
+```ruby
+require "bloom_fit"
+```
-(COMING SOON) If you'd like to specify an expected item count and a false-positive rate that you can tolerate. Visit the [Bloom Filter Calculator](https://hur.st/bloomfilter/) to learn more.
+## Quick Start
 ```ruby
 require "bloom_fit"
-bf = BloomFit.new(capacity: 250, false_positive_rate: 0.001)
-bf.add("cat")
-bf.include?("cat")     # => true
-bf.include?("dog")     # => false
-# Hash syntax with a bloom filter!
-bf["bird"] = "bar"
-bf["bird"]             # => true
-bf["mouse"]            # => false
-puts bf.stats
-# Number of filter bits (m): 3600
-# Number of set bits (n): 20
-# Number of filter hashes (k) : 10
-# Predicted false positive rate = 0.00%
+filter = BloomFit.new(capacity: 250, false_positive_rate: 0.001)
+filter.add("cat")
+filter << :dog
+filter.include?("cat") # => true
+filter.key?("dog")     # => true
+filter["bird"]         # => false
+filter["owl"] = true
+filter["ant"] = false
+filter["owl"]          # => true
+filter["ant"]          # => false
+filter.empty?          # => false
+filter.size            # => 3595
+filter.hashes          # => 10
+filter.clear
+filter.empty?          # => true
 ```
-If you'd like more control over the traditional inputs like bit size and the number of hashes:
+`#include?`, `#key?`, and `#[]` are aliases. `#add` and `#<<` are also aliases.
+## Automatic Sizing
+BloomFit now calculates `size` and `hashes` for you when you initialize it with an expected capacity:
 ```ruby
-require "bloom_fit"
+filter = BloomFit.new(capacity: 10_000, false_positive_rate: 0.01)
+filter.size   # => 95851
+filter.hashes # => 7
+```
+The defaults are a good starting point for many small filters:
+```ruby
+filter = BloomFit.new
+filter.size   # => 1438
+filter.hashes # => 10
+```
+That is equivalent to:
+```ruby
+filter = BloomFit.new(capacity: 100, false_positive_rate: 0.001)
+```
+Internally BloomFit uses the standard Bloom filter formulas:
+```text
+m = -(n * ln(p)) / (ln(2)^2)
+k = (m / n) * ln(2)
+```
+- `n`: expected number of inserted values
+- `p`: target false-positive rate
+- `m`: number of filter buckets (`size`)
+- `k`: number of hash functions (`hashes`)
+For example, if you expect about `10_000` inserts and can tolerate a `1%` false-positive rate, BloomFit will calculate `size: 95_851` and `hashes: 7` for you.
+If you prefer a calculator, see [Bloom Filter Calculator](https://hur.st/bloomfilter/).
+## Manual Sizing
+If you already know the exact filter width and hash count you want, you can still pass them directly:
+```ruby
+filter = BloomFit.new(size: 95_851, hashes: 7)
+```
+This bypasses automatic sizing.
+## Common Operations
-bf = BloomFit.new(size: 100, hashes: 2)
-bf.add("cat")
-bf.include?("cat")     # => true
-bf.include?("dog")     # => false
-# Hash syntax with a bloom filter!
-bf["bird"] = "bar"
-bf["bird"]             # => true
-bf["mouse"]            # => false
-puts bf.stats
-# Number of filter bits (m): 100
-# Number of set bits (n): 4
-# Number of filter hashes (k) : 2
-# Predicted false positive rate = 10.87%
+### Add and check membership
+```ruby
+filter = BloomFit.new(capacity: 100)
+filter << "cat"
+filter << "dog"
+filter.include?("cat")  # => true
+filter.include?("bird") # => false
+```
+### Use hash-like syntax for truthy values
+```ruby
+filter = BloomFit.new(capacity: 64)
+filter[:cat] = true
+filter[:dog] = false
+filter[:cat] # => true
+filter[:dog] # => false
+filter.merge({ bird: true, ant: nil })
+filter.include?(:bird) # => true
+filter.include?(:ant)  # => false
+```
+When merging a hash, only keys with truthy values are added.
+### Merge, union, and intersection
+```ruby
+pets = BloomFit.new(capacity: 50)
+pets << "cat" << "dog"
+more_pets = BloomFit.new(capacity: 50)
+more_pets << "dog" << "bird"
+combined = pets | more_pets
+overlap = pets & more_pets
+combined.include?("bird") # => true
+overlap.include?("dog")   # => true
+overlap.include?("cat")   # => false
+```
+`#merge` also accepts arrays, sets, and other enumerables:
+```ruby
+filter = BloomFit.new(capacity: 100)
+filter.merge(%w[cat dog bird])
+```
+Filters can only be combined when they have the same `size` and `hashes`. Otherwise BloomFit raises `BloomFit::ConfigurationMismatch`.
+When you create filters with automatic sizing, use the same `capacity` and `false_positive_rate` for filters you plan to merge, union, or intersect.
+### Save and load filters
+```ruby
+filter = BloomFit.new(capacity: 100)
+filter << "cat" << "dog"
+filter.save("pets.bloom")
+reloaded = BloomFit.load("pets.bloom")
+reloaded.include?("cat") # => true
+reloaded.include?("dog") # => true
+```
+Persistence uses Ruby `Marshal`. Only load files you trust.
+### Inspect the bitmap
+```ruby
+filter = BloomFit.new(size: 16, hashes: 4)
+filter << "cool"
+filter.to_hex    # => "1441"
+filter.to_binary # => "0001010001000001"
+filter.bitmap    # => raw bytes from the native filter
 ```
+`#bitmap` returns the native byte representation, which may include padding bytes beyond the configured filter width. `#to_binary` trims the result to exactly `size` bits.
+## API Overview
+| Method | Notes |
+| --- | --- |
+| `BloomFit.new` or `BloomFit.new(capacity:, false_positive_rate:)` | Creates a filter and calculates `size` and `hashes` automatically. Defaults to `capacity: 100`, `false_positive_rate: 0.001`. |
+| `BloomFit.new(size:, hashes:)` | Creates a filter with explicit sizing when you want fixed parameters. |
+| `add`, `<<` | Adds a value and returns the filter. |
+| `add?` | Adds only when the value does not already appear present. |
+| `include?`, `key?`, `[]` | Probabilistic membership check. |
+| `[]=` | Adds a key only when the assigned value is truthy. |
+| `merge` | Merges another filter or an enumerable into the receiver. |
+| `\|`, `union` | Returns a new filter containing the union. |
+| `&`, `intersection` | Returns a new filter containing the intersection. |
+| `clear` | Resets all bits to `0`. |
+| `empty?` | Exact check for whether any bits are set. |
+| `size`, `m` | Returns the configured filter width. |
+| `hashes`, `k` | Returns the number of hash functions. |
+| `set_bits`, `n` | Returns the number of bits currently set. |
+| `stats` | Returns a human-readable summary including predicted false-positive rate. |
+| `to_hex`, `to_binary`, `bitmap` | Returns the filter bitmap in different representations. |
+| `save`, `BloomFit.load` | Serializes and restores a filter with Ruby `Marshal`. |
+## Resources
+- Background: [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter)
+- Determining parameters: [Scalable Datasets: Bloom Filters in Ruby](http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/)
+- Applications and motivation: [Flow analysis: Time based bloom filter](http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/)
+- Calculator: [Bloom Filter Calculator](https://hur.st/bloomfilter/)
 ## Credits
 - Tatsuya Mori <valdzone@gmail.com> (Original C implementation)

data/ext/cbloomfilter/cbloomfilter.c CHANGED Viewed

@@ -99,7 +99,7 @@ static VALUE bf_initialize(int argc, VALUE *argv, VALUE self) {
     bf = bf_ptr(self);
-    /* default = Fugou approach :-) */
+    /* defaults */
     arg1 = INT2FIX(1000);
     arg2 = INT2FIX(4);

data/lib/bloom_fit/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 class BloomFit
-  VERSION = "0.3.1".freeze
+  VERSION = "1.0.0".freeze
 end

data/lib/bloom_fit.rb CHANGED Viewed

@@ -28,6 +28,8 @@ require "bloom_fit/version"
 class BloomFit
   extend Forwardable
+  LN2 = Math.log(2.0).freeze
   # The wrapped native +CBloomFilter+ instance.
   #
   # This is mostly useful for low-level integrations and internal filter
@@ -40,9 +42,19 @@ class BloomFit
   # but the best values depend on how many keys you expect to insert and how
   # many false positives you can tolerate.
   #
+  # @param capacity [Integer] expected number of elements to store in the set
+  # @param false_positive_rate [Integer] expected number of elements to store in the set
   # @param size [Integer] number of buckets in a bloom filter
   # @param hashes [Integer] number of hash functions
-  def initialize(size: 1_000, hashes: 4)
+  def initialize(capacity: 100, false_positive_rate: 0.001, size: nil, hashes: 4)
+    if size.nil? || hashes.nil?
+      raise ArgumentError, "capacity must be > 0" unless capacity.positive?
+      raise ArgumentError, "false_positive_rate must be between 0 and 1" if false_positive_rate <= 0.0 || false_positive_rate >= 1.0
+      size = (-capacity.to_f * Math.log(false_positive_rate) / (LN2**2)).ceil
+      hashes = (size / capacity * LN2).ceil
+    end
     @bf = CBloomFilter.new(size, hashes)
   end

data/lib/cbloomfilter.bundle CHANGED Viewed

Binary file

data/test/bloom_fit_test.rb CHANGED Viewed

@@ -3,6 +3,28 @@ require "test_helper"
 class BloomFitTest < Minitest::Spec
   subject { BloomFit.new(size: 100, hashes: 4) }
+  describe ".new" do
+    it "accepts size and hashes override" do
+      bf = BloomFit.new(size: 10, hashes: 1)
+      assert_equal 10, bf.size
+      assert_equal 1, bf.hashes
+    end
+    it "has default capacity and false positive-rate" do
+      bf = BloomFit.new
+      # https://hur.st/bloomfilter/?n=100&p=0.001&m=&k=
+      assert_equal 1438, bf.size
+      assert_equal 10, bf.hashes
+    end
+    it "calculates size and hashes given a capacity and false postiive rate" do
+      bf = BloomFit.new(capacity: 10_000, false_positive_rate: 0.0001)
+      # https://hur.st/bloomfilter/?n=10000&p=0.0001&m=&k=
+      assert_equal 191_702, bf.size
+      assert_equal 14, bf.hashes
+    end
+  end
   describe "#empty?" do
     it "returns true when nothing set" do
       assert_equal true, subject.empty? # rubocop:disable Minitest/AssertTruthy

data/test/c_bloom_filter_test.rb ADDED Viewed

@@ -0,0 +1,158 @@
+require "test_helper"
+class CBloomFilterTest < Minitest::Spec
+  subject { CBloomFilter.new }
+  describe "#m" do
+    it "defaults" do
+      assert_equal 1000, subject.m
+    end
+    it "is set by the 1st arg of the contructor" do
+      bf = CBloomFilter.new(10_000)
+      assert_equal 10_000, bf.m
+    end
+  end
+  describe "#k" do
+    it "defaults" do
+      assert_equal 4, subject.k
+    end
+    it "is set by the 2nd arg of the contructor" do
+      bf = CBloomFilter.new(10_000, 9)
+      assert_equal 9, bf.k
+    end
+  end
+  describe "#set_bits" do
+    it "initializes to zero" do
+      assert_equal 0, subject.set_bits
+    end
+    it "counts the bits when active" do
+      subject.add("foo")
+      assert_equal 4, subject.set_bits
+    end
+  end
+  describe "#add" do
+    it "adds keys to the filter set" do
+      subject.add("foo")
+      subject.add("bar")
+      assert_includes subject, "foo"
+      assert_includes subject, "bar"
+      refute_includes subject, "baz"
+    end
+  end
+  describe "#include?" do
+    it "returns true when a key is in the set" do
+      subject.add("foo")
+      assert_equal true, subject.include?("foo") # rubocop:disable Minitest/AssertTruthy
+    end
+    it "returns false when a key is not in the set" do
+      subject.add("foo")
+      assert_equal false, subject.include?("bar") # rubocop:disable Minitest/RefuteFalse
+    end
+  end
+  describe "#clear" do
+    it "clears a set" do
+      subject.add("foo")
+      subject.add("bar")
+      subject.add("baz")
+      assert subject.set_bits.positive?
+      subject.clear
+      assert subject.set_bits.zero?
+    end
+  end
+  describe "#merge" do
+    it "adds keys from another set" do
+      subject.add("foo")
+      bf = CBloomFilter.new
+      bf.add("bar")
+      bf.add("baz")
+      subject.merge(bf)
+      assert_includes subject, "foo"
+      assert_includes subject, "bar"
+      assert_includes subject, "baz"
+    end
+  end
+  describe "#&" do
+    it "intersects keys from another set" do
+      subject.add("foo")
+      subject.add("bar")
+      bf = CBloomFilter.new
+      bf.add("bar")
+      bf.add("baz")
+      bf2 = subject & bf
+      refute_includes bf2, "foo"
+      assert_includes bf2, "bar"
+      refute_includes bf2, "baz"
+      bf3 = bf & subject
+      refute_includes bf3, "foo"
+      assert_includes bf3, "bar"
+      refute_includes bf3, "baz"
+    end
+  end
+  describe "#|" do
+    it "unions keys from another set" do
+      subject.add("foo")
+      subject.add("bar")
+      bf = CBloomFilter.new
+      bf.add("bar")
+      bf.add("baz")
+      bf2 = subject | bf
+      assert_includes bf2, "foo"
+      assert_includes bf2, "bar"
+      assert_includes bf2, "baz"
+      bf3 = bf | subject
+      assert_includes bf3, "foo"
+      assert_includes bf3, "bar"
+      assert_includes bf3, "baz"
+    end
+  end
+  describe "#bitmap" do
+    it "returns a binary bitmap of all zeros when empty (including a terminating byte)" do
+      bf = CBloomFilter.new(16)
+      assert_equal "\x00\x00\x00".b, bf.bitmap
+    end
+    it "returns a binary bitmap representing the set" do
+      bf = CBloomFilter.new(16, 4)
+      bf.add("something")
+      assert_equal "(\x82\x00".b, bf.bitmap
+    end
+    it "returns a binary bitmap representing the set even if not a multiple of 8 bits (includes padding)" do
+      bf = CBloomFilter.new(20, 4)
+      bf.add("wow")
+      assert_equal "\x04\x14\x00\x00".b, bf.bitmap
+    end
+  end
+  describe "#load" do
+    it "overwrites the bitmap" do
+      bf = CBloomFilter.new(1000, 4)
+      bf.add("foo")
+      bf.add("bar")
+      subject.load(bf.bitmap)
+      assert_includes subject, "foo"
+      assert_includes subject, "bar"
+    end
+  end
+end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: bloom_fit
 version: !ruby/object:Gem::Version
-  version: 0.3.1
+  version: 1.0.0
 platform: ruby
 authors:
 - Ryan McGeary
@@ -31,6 +31,7 @@ files:
 - lib/bloom_fit/version.rb
 - lib/cbloomfilter.bundle
 - test/bloom_fit_test.rb
+- test/c_bloom_filter_test.rb
 - test/test_helper.rb
 homepage: https://github.com/rmm5t/bloom_fit
 licenses: []