RubyGems - bloomer - Versions diffs - 0.0.2 → 0.0.3 - Mend

bloomer 0.0.2 → 0.0.3

Files changed (5) hide show

data/README.md CHANGED Viewed

@@ -1,18 +1,29 @@
-# Bloomer: A pure-ruby bloom filter with no extra fluff
+# Bloomer: A Scalable  pure-ruby Bloom filter
 [Bloom filters](http://en.wikipedia.org/wiki/Bloom_filter) are great for quickly checking to see if
-a given string has been seen before--in constant time, and using a fixed amount of RAM.
+a given string has been seen before--in constant time, and using a fixed amount of RAM, as long
+as you know the expected number of elements up front.
-Note that false positives with bloom filters *are possible*, but false negatives are not. In other words,
+[Scalable Bloom Filters](http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf) allow you to establish an
+initial capacity, but dynamically scale past that and maintain a false_positive_probability at the expense of
+growing the RAM requirements.
+```Bloomer``` is a Bloom Filter. ```Bloomer::Scalable``` is a Scalable Bloom Filter.
+Keep in mind that false positives with Bloom Filters *are expected* with a specified probability rate.
+False negatives, however, are not. In other words,
 * if ```include?``` returns *false*, that string has *certainly not* been ```add```ed
 * if ```include?``` returns *true*, it *might* mean that string was ```add```ed (depending on the
 ```false_positive_probability``` parameter provided to the constructor).
-This implementation is the Nth bloom filter gem written in ruby -- but, at the time of conception, the only one that
+This implementation is unique in that Bloomer
-* uses a robust set of hashing functions
+* supports scalable bloom filters (SBF)
+* uses triple hash chains (see [the paper](http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf))
 * can marshal state quickly
+* has rigorous tests
+* is pure ruby
 * does not require EM or Redis or something else unrelated to simply implementing a bloom filter
 ## Usage
@@ -28,6 +39,16 @@ bf.include? "dog"
 #=> false
 ```
+Scalable Bloom filters use the same API:
+```ruby
+b = Bloomer::Scalable.new
+b.add "boom"
+b.include? "boom"
+#=> true
+bf.include? "badda"
+#=> false
 Serialization is through [Marshal](http://ruby-doc.org/core-1.8.7/Marshal.html):
 ```ruby
@@ -42,11 +63,7 @@ new_b.include? "a"
 ## History
 * 0.0.1 Bloom, there it is.
-* 0.0.2 Switch to triple hash chaining, which resulted in better, faster hashing (!!):
-  md5 (v0.0.2): 66 sec, false positive rate = 1.116%, expected 1.0%
-  multihash (0.0.1): 92 sec, false positive rate = 1.27%, expected 1.0%
+* 0.0.2 Switch to triple-hash chaining (simpler, faster, and better false-positive rate)
+* 0.0.3 Added support for scalable bloom filters (SBF)

data/bloomer.gemspec CHANGED Viewed

@@ -9,8 +9,8 @@ Gem::Specification.new do |s|
   s.authors     = ["Matthew McEachen"]
   s.email       = ["matthew+github@mceachen.org"]
   s.homepage    = "https://github.com/mceachen/bloomer"
-  s.summary     = %q{Pure-ruby bloom filter with minimal dependencies}
-  s.description = %q{Pure-ruby bloom filter with minimal dependencies}
+  s.summary     = %q{Pure-ruby scalable bloom filter}
+  s.description = %q{Bloomer implements both simple Bloom filters as well as Scalable Bloom Filters (SBF), in pure ruby and with minimal external dependencies}
   s.rubyforge_project = "bloomer"

data/lib/bloomer.rb CHANGED Viewed

@@ -2,23 +2,26 @@ require 'bitarray'
 require 'digest/md5'
 class Bloomer
-  VERSION = "0.0.2"
+  VERSION = "0.0.3"
-  def initialize(expected_size, false_positive_probability = 0.001, opts = {})
-    @ba = opts[:ba] || begin
-      # m is the required number of bits in the array
-      m = -(expected_size * Math.log(false_positive_probability)) / (Math.log(2) ** 2)
-      BitArray.new(m.round)
-    end
+  def initialize(capacity, false_positive_probability = 0.001)
+    @capacity = capacity.round
+    # m is the required number of bits in the array
+    m = -(capacity * Math.log(false_positive_probability)) / (Math.log(2) ** 2)
+    @ba = BitArray.new(m.round)
+    # count is the number of unique additions to this filter.
+    @count = 0
     # k is the number of hash functions that minimizes the probability of false positives
-    @k = (opts[:k] || Math.log(2) * (@ba.size / expected_size)).round
+    @k = (Math.log(2) * (@ba.size / capacity)).round
   end
-  # returns true if item hadn't already been added
+  # returns true if item did had not already been added
   def add string
     count = 0
-    hashes(string).each { |ea| count += @ba[ea] ; @ba[ea] = 1 }
-    count == @k
+    hashes(string).each { |ea| count += @ba[ea]; @ba[ea] = 1 }
+    previously_included = (count == @k)
+    @count += 1 unless previously_included
+    !previously_included
   end
   # returns false if the item hadn't already been added
@@ -27,13 +30,15 @@ class Bloomer
     !hashes(string).any? { |ea| @ba[ea] == 0 }
   end
-  def _dump(depth)
-    [@k, Marshal.dump(@ba)].join(" ")
+  # The number of unique strings given to #add (including false positives, which can mean
+  # this number under-counts)
+  def count
+    @count
   end
-  def self._load(data)
-    k, ba = data.split(" ", 2)
-    new(nil, nil, :k => k.to_i, :ba => Marshal.load(ba))
+  # If count exceeds capacity, the provided #false_positive_probability will probably be exceeded.
+  def capacity
+    @capacity
   end
   private
@@ -54,4 +59,37 @@ class Bloomer
       x
     end
   end
+  # Automatically expanding bloom filter.
+  # See http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf
+  class Scalable
+    S = 2
+    R = Math.log(2) ** 2
+    def initialize(initial_capacity = 256, false_positive_probability = 0.001)
+      @false_positive_probability = false_positive_probability
+      @bloomers = [Bloomer.new(initial_capacity, false_positive_probability * R)]
+    end
+    def capacity
+      @bloomers.last.capacity
+    end
+    def count
+      @bloomers.inject(0) {|i,b|i + b.count}
+    end
+    def add string
+      l = @bloomers.last
+      r = l.add(string)
+      if r && (l.count > l.capacity)
+        @bloomers << Bloomer.new(l.capacity * S, @false_positive_probability * (R**@bloomers.size))
+      end
+      r
+    end
+    # only return false if no bloomers include string.
+    def include? string
+      @bloomers.any? { |ea| ea.include? string }
+    end
+  end
 end

data/spec/bloomer_spec.rb CHANGED Viewed

@@ -1,64 +1,105 @@
 require "spec_helper"
 require "benchmark"
+C = ('a'..'z').to_a
 def rand_word(length = 8)
-  ('a'..'z').to_a.shuffle.first(length).join # not random enough to cause hits.
+  C.shuffle.first(length).join # not random enough to cause hits.
+end
+def test_bloom(size, max_false_prob, bloom)
+  set = Set.new
+  size.times do
+    w = rand_word
+    bloom.add(w)
+    set.add(w)
+  end
+  set.each { |ea| bloom.include?(ea).should be_true }
+  tries = size * 3
+  false_hits = 0
+  hits = 0
+  tries.times.each do
+    word = rand_word
+    b_inc, s_inc = bloom.include?(word), set.include?(word)
+    hits += 1 if s_inc
+    if s_inc && !b_inc
+      fail "'#{word}': false negative on include"
+    elsif !s_inc && b_inc
+      false_hits += 1
+    end
+  end
+  false_positive_failure_rate = false_hits.to_f / tries
+  puts "False positive rate = #{false_positive_failure_rate * 100}%, expected #{max_false_prob * 100}% (#{false_hits} false positives, #{hits} hits)"
+  if (false_positive_failure_rate) > max_false_prob * 2
+    fail "False-positive failure rate was bad: #{false_positive_failure_rate}"
+  end
+end
+def test_marshal_state(b)
+  inputs = b.capacity.times.collect { rand_word }
+  inputs.each { |ea| b.add(ea) }
+  new_b = Marshal.load(Marshal.dump(b))
+  new_b.count.should == b.count
+  new_b.capacity.should == b.capacity
+  inputs.each { |ea| new_b.should include(ea) }
+end
+def test_simple(b)
+  b.add("a").should be_true
+  b.add("a").should be_false
+  b.should include("a")
+  b.should_not include("")
+  b.should_not include("b")
+  b.add("b").should be_true
+  b.add("b").should be_false
+  b.should include("b")
+  b.should_not include("")
+  b.add("")
+  b.should include("")
 end
 describe Bloomer do
   it "should work trivially" do
     b = Bloomer.new(10, 0.001)
-    b.add("a").should be_false
-    b.add("a").should be_true
-    b.should include("a")
-    b.should_not include("")
-    b.should_not include("b")
-    b.add("b").should be_false
-    b.add("b").should be_true
-    b.should include("b")
-    b.should_not include("")
-    b.add("")
-    b.should include("")
+    test_simple(b)
   end
   it "should marshal state correctly" do
     b = Bloomer.new(10, 0.001)
-    inputs = %q(a b c d)
-    inputs.each { |ea| b.add(ea) }
-    s = Marshal.dump(b)
-    new_b = Marshal.load(s)
-    inputs.each { |ea| new_b.should include(ea) }
+    test_marshal_state(b)
   end
   it "should result in similar-to-expected false positives" do
     max_false_prob = 0.001
     size = 50_000
-    bloom = Bloomer.new(size, max_false_prob)
-    set = Set.new
-    size.times do
-      w = rand_word
-      bloom.add(w)
-      set.add(w)
-    end
-    set.each { |ea| bloom.include?(ea).should be_true }
-    tries = size * 3
-    false_hits = 0
-    hits = 0
-    tries.times.each do
-      word = rand_word
-      b_inc, s_inc = bloom.include?(word), set.include?(word)
-      hits += 1 if s_inc
-      if s_inc && !b_inc
-        fail "'#{word}': false negative on include"
-      elsif !s_inc && b_inc
-        false_hits += 1
-      end
-    end
+    b = Bloomer.new(size, max_false_prob)
+    test_bloom(size, max_false_prob, b)
+  end
+end
-    false_positive_failure_rate = false_hits.to_f / tries
-    puts "False positive rate = #{false_positive_failure_rate * 100}%, expected #{max_false_prob * 100}% (#{false_hits} false positives, #{hits} hits)"
-    if (false_positive_failure_rate) > max_false_prob * 2
-      fail "False-positive failure rate was bad: #{false_positive_failure_rate}"
-    end
+describe Bloomer::Scalable do
+  it "should work trivially" do
+    b = Bloomer::Scalable.new
+    test_simple(b)
+  end
+  it "should marshal state correctly" do
+    b = Bloomer::Scalable.new(10, 0.001)
+    100.times.each { b.add(rand_word) }
+    test_marshal_state(b)
+  end
+  it "should result in similar-to-expected false positives" do
+    max_false_prob = 0.001
+    size = 10_000
+    b = Bloomer::Scalable.new(1024, max_false_prob)
+    test_bloom(size, max_false_prob, b)
+  end
+  it "should result in similar-to-expected false positives" do
+    max_false_prob = 0.01
+    size = 50_000
+    b = Bloomer::Scalable.new(1024, max_false_prob)
+    test_bloom(size, max_false_prob, b)
   end
 end

metadata CHANGED Viewed

@@ -1,13 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: bloomer
 version: !ruby/object:Gem::Version
-  hash: 27
+  hash: 25
   prerelease:
   segments:
   - 0
   - 0
-  - 2
-  version: 0.0.2
+  - 3
+  version: 0.0.3
 platform: ruby
 authors:
 - Matthew McEachen
@@ -32,7 +32,7 @@ dependencies:
   type: :runtime
   name: bitarray
   version_requirements: *id001
-description: Pure-ruby bloom filter with minimal dependencies
+description: Bloomer implements both simple Bloom filters as well as Scalable Bloom Filters (SBF), in pure ruby and with minimal external dependencies
 email:
 - matthew+github@mceachen.org
 executables: []
@@ -84,7 +84,7 @@ rubyforge_project: bloomer
 rubygems_version: 1.6.2
 signing_key:
 specification_version: 3
-summary: Pure-ruby bloom filter with minimal dependencies
+summary: Pure-ruby scalable bloom filter
 test_files:
 - spec/bloomer_spec.rb
 - spec/spec_helper.rb