RubyGems - simhilarity - Versions diffs - 1.0.2 → 1.0.3 - Mend

simhilarity 1.0.2 → 1.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

data/.gitignore +1 -0
data/.travis.yml +6 -0
data/README.md +13 -12
data/bin/simhilarity +5 -2
data/lib/simhilarity.rb +3 -3
data/lib/simhilarity/candidate.rb +4 -28
data/lib/simhilarity/candidates.rb +91 -0
data/lib/simhilarity/matcher.rb +84 -52
data/lib/simhilarity/score.rb +56 -0
data/lib/simhilarity/version.rb +1 -1
data/test/tests.rb +51 -25
metadata +6 -5
data/lib/simhilarity/bulk.rb +0 -163
data/lib/simhilarity/single.rb +0 -18

data/.gitignore CHANGED

@@ -1,5 +1,6 @@
 *.gem
 .bundle
+.rake-complete-cache
 Gemfile.lock
 pkg/*
 rdoc

data/.travis.yml ADDED

@@ -0,0 +1,6 @@
+language: ruby
+rvm:
+  - 1.9.3
+  - 2.0.0
+  - rbx-19mode
+  - jruby-19mode

data/README.md CHANGED

@@ -66,15 +66,9 @@ score,needle,haystack
 It will print out the best matches between needle and haystack in CSV format. Use `simhilarity --verbose` to look at pretty progress bars while it's running. Use --candidates to customize the candidates selection method, which will dramatically affect performance for large data sets.
-### Simhilarity::Bulk
+### Simhilarity::Matcher
-To use simhilarity from code, create a `Bulk` and call `matches(needles, haystack)`. It'll return an array of tuples, `[needle, haystack, score]`. By default, simhilarity assumes that needles and haystack are arrays of strings. To use something else, set `reader` to a proc that converts your opaque objects into strings. See [options](#options).
-### Simhilarity::Single
-Sometimes it's useful to just calculate the score between two strings. For example, if you just want a title similarity measurement as part of some larger analysis between two books. Create a `Single` and call `score(a, b)` to measure similarity between those two items. By default, simhilarity assumes that needle and haystack are strings. To use something else, set `reader` to a proc that converts your opaque objects into strings. See [options](#options).
-Important note: For best results with `Single`, set the corpus so that simhilarity can calculate ngram frequencies. This can dramatically improve accuracy. `Bulk` will do this automatically because it has access to the corpus, but `Single` doesn't. Call `corpus=` manually when using `Single`.
+To use simhilarity from code, create a `Matcher` and call `matches(needles, haystack)`. It'll return an array of tuples, `[needle, haystack, score]`. By default, simhilarity assumes that needles and haystack are arrays of strings. To use something else, set `reader` to a proc that converts your opaque objects into strings. See [options](#options).
 <a name="benchmarks"/>
@@ -135,10 +129,10 @@ There are a few ways to configure simhilarity:
   Simhash works great, but there's no reason not to use `:ngrams` or even `:all` for small data sets. In fact, that's what simhilarity does by default - if you use a small dataset (needle * haystack < 200,000) it defaults to `:all`, otherwise it uses `:simhash`. Some examples:
   ```ruby
-  Simhilarity::Bulk.new  # defaults to :all or :simhash based on size<
-  Simhilarity::Bulk.new(candidates: :simhash)
-  Simhilarity::Bulk.new(candidates: :simhash, simhash_max_hamming: 8)
-  Simhilarity::Bulk.new(candidates: :ngrams, ngram_overlaps: 4)
+  Simhilarity::Matcher.new  # defaults to :all or :simhash based on size
+  Simhilarity::Matcher.new(candidates: :simhash)
+  Simhilarity::Matcher.new(candidates: :simhash, simhash_max_hamming: 8)
+  Simhilarity::Matcher.new(candidates: :ngrams, ngram_overlaps: 4)
   ```
   or:
@@ -162,3 +156,10 @@ There are a few ways to configure simhilarity:
 * **ngrammer** - proc for converting normalized strings into ngrams. The default ngrammer pulls out bigrams and runs of digits, which is perfect for matching names and addresses.
 * **verbose** - if true, show progress while simhilarity is working. Great for the impatient. Use --verbose from the command line.
+## Changelog
+#### Master (unreleased)
+* Works with Ruby 2.0 - thanks @abscondment!
+* Travis

data/bin/simhilarity CHANGED

@@ -15,8 +15,11 @@ class Main
     # match
     tm = Time.now
-    matcher = Simhilarity::Bulk.new(options)
-    matches = matcher.matches(needle, haystack)
+    matcher = Simhilarity::Matcher.new
+    matcher.verbose = options[:verbose]
+    matcher.candidates = options[:candidates]
+    matcher.haystack = haystack
+    matches = matcher.matches(needle)
     if options[:verbose]
       tm = Time.now - tm

data/lib/simhilarity.rb CHANGED

@@ -1,8 +1,8 @@
 require "simhilarity/bits"
 require "simhilarity/candidate"
+require "simhilarity/candidates"
 require "simhilarity/element"
-require "simhilarity/matcher"
+require "simhilarity/score"
 require "simhilarity/version"
-require "simhilarity/bulk"
-require "simhilarity/single"
+require "simhilarity/matcher"

data/lib/simhilarity/candidate.rb CHANGED

@@ -1,44 +1,20 @@
 module Simhilarity
   # A potential match between two +Elements+. It can calculate it's own score.
   class Candidate
-    # matcher that owns this guy
-    attr_reader :matcher
     # first half of the candidate pair - the needle.
     attr_reader :a
     # first half of the candidate pair - the haystack.
     attr_reader :b
-    def initialize(matcher, a, b) #:nodoc:
-      @matcher = matcher
+    # the score between these two candidates
+    attr_accessor :score
+    def initialize(a, b) #:nodoc:
       @a = a
       @b = b
     end
-    # Calculate the score for this +Candidate+. The score is the {dice
-    # coefficient}[http://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient],
-    # <tt>(2*c)/(a+b)</tt>.
-    #
-    # * +a+: the frequency weighted sum of the ngrams in a
-    # * +b+: the frequency weighted sum of the ngrams in b
-    # * +c+: the frequency weighted sum of the ngrams in (a & b)
-    #
-    # Lazily calculated and memoized.
-    def score
-      @score ||= begin
-        c = (self.a.ngrams & self.b.ngrams)
-        if c.length > 0
-          a = self.a.ngrams_sum
-          b = self.b.ngrams_sum
-          c = matcher.ngrams_sum(c)
-          (2.0 * c) / (a + b)
-        else
-          0
-        end
-      end
-    end
     def to_s #:nodoc:
       "Candidate #{score}: #{a.inspect}..#{b.inspect}"
     end

data/lib/simhilarity/candidates.rb ADDED

@@ -0,0 +1,91 @@
+module Simhilarity
+  module Candidates
+    # default minimum number # of ngram overlaps with :ngrams
+    DEFAULT_NGRAM_OVERLAPS = 3
+    # default maximum hamming distance with :simhash
+    DEFAULT_SIMHASH_MAX_HAMMING = 7
+    # Find candidates from +needles+ & +haystack+. The method used
+    # depends on the value of +candidates+
+    def candidates_for(needles)
+      # generate candidates
+      candidates_method = candidates_method(needles)
+      candidates = self.send(candidates_method, needles)
+      # if these are the same, no self-dups
+      if needles == haystack
+        candidates = candidates.reject { |n, h| n == h }
+      end
+      # map and return
+      candidates.map { |n, h| Candidate.new(n, h) }
+    end
+    # Select the method for finding candidates based on +candidates+.
+    def candidates_method(needles)
+      # pick the method
+      method = self.candidates
+      method ||= (needles.length * haystack.length < 200000) ? :all : :simhash
+      case method
+      when /^ngrams=(\d+)$/
+        method = :ngrams
+        self.ngram_overlaps = $1.to_i
+      when /^simhash=(\d+)$/
+        method = :simhash
+        self.simhash_max_hamming = $1.to_i
+      end
+      method = "candidates_#{method}".to_sym
+      if !respond_to?(method, true)
+        raise "unsupported candidates #{candidates.inspect}"
+      end
+      vputs "Using #{method} with needles=#{needles.length} haystack=#{haystack.length}..."
+      method
+    end
+    # Return ALL candidates. This only works for small datasets.
+    def candidates_all(needles)
+      needles.product(haystack)
+    end
+    # Return candidates that overlap with three or more matching
+    # ngrams. Only works for small datasets.
+    def candidates_ngrams(needles)
+      ngram_overlaps = self.ngram_overlaps || DEFAULT_NGRAM_OVERLAPS
+      candidates = []
+      veach(" ngrams #{ngram_overlaps}", needles) do |n|
+        ngrams_set = Set.new(n.ngrams)
+        haystack.each do |h|
+          count = 0
+          h.ngrams.each do |ngram|
+            if ngrams_set.include?(ngram)
+              if (count += 1) == ngram_overlaps
+                candidates << [n, h]
+                break
+              end
+            end
+          end
+        end
+      end
+      candidates
+    end
+    # Find candidates that are close based on hamming distance between
+    # the simhashes.
+    def candidates_simhash(needles)
+      max_hamming = self.simhash_max_hamming || DEFAULT_SIMHASH_MAX_HAMMING
+      # search for candidates with low hamming distance
+      candidates = []
+      veach(" hamming #{max_hamming}", needles) do |n|
+        bk_tree.query(n, max_hamming).each do |h, distance|
+          candidates << [n, h]
+        end
+      end
+      candidates
+    end
+  end
+end

data/lib/simhilarity/matcher.rb CHANGED

@@ -1,70 +1,95 @@
+require "bk"
+require "set"
 require "progressbar"
 module Simhilarity
-  # Abstract superclass for matching. Mainly a container for options, corpus, etc.
   class Matcher
-    # Options used to create this Matcher.
-    attr_accessor :options
+    include Simhilarity::Candidates
+    include Simhilarity::Score
-    # Proc for turning needle/haystack elements into strings. You can
-    # leave this nil if the elements are already strings. See
-    # Matcher#reader for the default implementation.
+    # If true, show progress bars and timing
+    attr_accessor :verbose
+    # Proc for turning opaque items into strings.
     attr_accessor :reader
-    # Proc for normalizing input strings. See Matcher#normalize
-    # for the default implementation.
+    # Proc for normalizing strings.
     attr_accessor :normalizer
-    # Proc for generating ngrams from a normalized string. See
-    # Matcher#ngrams for the default implementation.
+    # Proc for generating ngrams.
     attr_accessor :ngrammer
-    # Ngram frequency weights from the corpus, or 1 if the ngram isn't
-    # in the corpus.
-    attr_accessor :freq
+    # Proc for scoring ngrams.
+    attr_accessor :scorer
+    # Specifies which method to use for finding candidates. See the
+    # README for more details.
+    attr_accessor :candidates
-    # Create a new Matcher matcher. Options include:
-    #
-    # * +reader+: Proc for turning opaque items into strings.
-    # * +normalizer+: Proc for normalizing strings.
-    # * +ngrammer+: Proc for generating ngrams.
-    # * +verbose+: If true, show progress bars and timing.
-    def initialize(options = {})
-      @options = options
+    # Minimum number of ngram overlaps, defaults to 3 (for candidates
+    # = :ngrams)
+    attr_accessor :ngram_overlaps
-      # procs
-      self.reader = options[:reader]
-      self.normalizer = options[:normalizer]
-      self.ngrammer = options[:ngrammer]
+    # Maximum simhash hamming distance, defaults to 7. (for candidates
+    # = :simhash)
+    attr_accessor :simhash_max_hamming
-      reset_corpus
+    # Set the haystack.
+    def haystack=(haystack)
+      @haystack = import_list(haystack)
+      # this stuff is lazily calculated from the haystack, and needs
+      # to be reset whenever the haystack changes.
+      @bitsums = { }
+      @bk_tree = nil
+      @freq = nil
     end
-    # Set the corpus. Calculates ngram frequencies (#freq) for future
-    # scoring.
-    def corpus=(corpus)
-      @corpus = corpus
+    # The current haystack.
+    def haystack
+      @haystack
+    end
-      reset_corpus
+    # Ngram frequency weights from the haystack, or 1 if the ngram
+    # isn't in the haystack. Lazily calculated.
+    def freq
+      @freq ||= begin
+        # calculate ngram counts for the haystack
+        counts = Hash.new(0)
+        veach("Haystack", @haystack) do |element|
+          element.ngrams.each do |ngram|
+            counts[ngram] += 1
+          end
+        end
-      # calculate ngram counts for the corpus
-      counts = Hash.new(0)
-      veach("Corpus", import_list(corpus)) do |element|
-        element.ngrams.each do |ngram|
-          counts[ngram] += 1
+        # turn counts into inverse frequencies
+        map = Hash.new(1)
+        total = counts.values.inject(&:+).to_f
+        counts.each do |ngram, count|
+          map[ngram] = ((total / count) * 10).round
         end
+        map
       end
+    end
-      # turn counts into inverse frequencies
-      total = counts.values.inject(&:+).to_f
-      counts.each do |ngram, count|
-        @freq[ngram] = ((total / count) * 10).round
+    # Match each item in +needles+ to an item in #haystack. Returns an
+    # array of tuples, <tt>[needle, haystack, score]</tt>. Scores
+    # range from 0 to 1, with 1 being a perfect match and 0 being a
+    # terrible match.
+    def matches(needles)
+      if haystack.nil?
+        raise RuntimeError.new('can\'t match before setting a haystack')
       end
-    end
-    # The current corpus.
-    def corpus
-      @corpus
+      # create Elements
+      needles = import_list(needles)
+      # get candidate matches
+      candidates = candidates_for(needles)
+      vputs " got #{candidates.length} candidates."
+      # pick winners
+      winners(needles, candidates)
     end
     # Turn an opaque item from the user into a string.
@@ -108,7 +133,7 @@ module Simhilarity
     # Sum up the frequency weights of the +ngrams+.
     def ngrams_sum(ngrams)
-      ngrams.map { |i| @freq[i] }.inject(&:+) || 0
+      ngrams.map { |i| freq[i] }.inject(&:+) || 0
     end
     # Calculate the frequency weighted
@@ -147,9 +172,16 @@ module Simhilarity
       Element.new(self, opaque)
     end
-    def reset_corpus
-      @freq = Hash.new(1)
-      @bitsums = { }
+    def bk_tree
+      @bk_tree ||= begin
+        # calculate this first so we get a nice progress bar
+        veach(" simhash", haystack) { |i| i.simhash }
+        # build the bk tree
+        tree = BK::Tree.new(lambda { |a, b| Bits.hamming32(a.simhash, b.simhash) })
+        veach(" bktree", haystack) { |i| tree.add(i) }
+        tree
+      end
     end
     # calculate the simhash bitsums for this +ngram+, as part of
@@ -171,14 +203,14 @@ module Simhilarity
       end
     end
-    # Puts if options[:verbose]
+    # Puts if +verbose+ is true
     def vputs(s)
-      $stderr.puts s if options[:verbose]
+      $stderr.puts s if verbose
     end
-    # Like each, but with a progress bar if options[:verbose]
+    # Like each, but with a progress bar if +verbose+ is true
     def veach(title, array, &block)
-      if !options[:verbose]
+      if !verbose
         array.each do |i|
           yield(i)
         end

data/lib/simhilarity/score.rb ADDED

@@ -0,0 +1,56 @@
+module Simhilarity
+  module Score
+    # walk candidates by score, pick winners
+    def winners(needles, candidates)
+      # calculate this first so we get a nice progress bar
+      veach("Scoring", candidates) do |i|
+        i.score = score(i)
+      end
+      # sort by score
+      candidates = candidates.sort_by { |i| -i.score }
+      # walk them, eliminate dups
+      seen = Set.new
+      winners = candidates.map do |i|
+        next if seen.include?(i.a)
+        seen << i.a
+        i
+      end.compact
+      # build map from needle => candidate...
+      needle_to_winner = { }
+      winners.each { |i| needle_to_winner[i.a] = i }
+      # so we can return in the original order
+      needles.map do |i|
+        if candidate = needle_to_winner[i]
+          [ i.opaque, candidate.b.opaque, candidate.score ]
+        else
+          [ i.opaque, nil, nil ]
+        end
+      end
+    end
+    # Score a +Candidate+. The default implementation is the {dice
+    # coefficient}[http://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient],
+    # <tt>(2*c)/(a+b)</tt>.
+    #
+    # * +a+: the frequency weighted sum of the ngrams in a
+    # * +b+: the frequency weighted sum of the ngrams in b
+    # * +c+: the frequency weighted sum of the ngrams in (a & b)
+    def score(candidate)
+      if scorer
+        return scorer.call(candidate)
+      end
+      c = (candidate.a.ngrams & candidate.b.ngrams)
+      return 0 if c.length == 0
+      a = candidate.a.ngrams_sum
+      b = candidate.b.ngrams_sum
+      c = ngrams_sum(c)
+      (2.0 * c) / (a + b)
+    end
+  end
+end

data/lib/simhilarity/version.rb CHANGED

@@ -1,4 +1,4 @@
 module Simhilarity
   # Gem version
-  VERSION = "1.0.2"
+  VERSION = "1.0.3"
 end

data/test/tests.rb CHANGED

@@ -29,15 +29,17 @@ class Tests < Test::Unit::TestCase
     sample
   end
-  def assert_bulk_candidates(candidates, percent)
+  def assert_candidates(candidates, percent)
     sample = self.sample
     # match, with benchmark
     output = nil
     Benchmark.bm(10) do |bm|
       bm.report(candidates.to_s) do
-        matcher = Simhilarity::Bulk.new(candidates: candidates)
-        output = matcher.matches(sample.needle, sample.haystack)
+        matcher = Simhilarity::Matcher.new
+        matcher.candidates = candidates
+        matcher.haystack = sample.haystack
+        output = matcher.matches(sample.needle)
       end
     end
@@ -55,9 +57,11 @@ class Tests < Test::Unit::TestCase
     assert((correct - percent).abs < 0.001, "percent #{correct} != #{percent}")
   end
+  TMP = "/tmp/_simhilarity_tests.txt"
   def assert_system(cmd)
-    system("#{cmd} > /dev/null 2>&1")
-    assert($? == 0, "#{cmd} failed")
+    system("#{cmd} > #{TMP} 2>&1")
+    assert($? == 0, File.read(TMP))
   end
   #
@@ -70,46 +74,62 @@ class Tests < Test::Unit::TestCase
     # not a string
     assert_raise(RuntimeError) { @matcher.read(123) }
-    # custom
-    @matcher.reader = lambda(&:key)
-    assert_equal @matcher.read(OpenStruct.new(key: "gub")), "gub"
   end
   def test_normalizer
     # default
     assert_equal @matcher.normalize(" HELLO,\tWORLD! "), "hello world"
-    # custom
-    @matcher.normalizer = lambda(&:upcase)
-    assert_equal @matcher.normalize("gub"), "GUB"
   end
   def test_ngrams
     # default
     assert_equal @matcher.ngrams("hi 42"), ["hi", "i ", " 4", "42"]
-    # custom
-    @matcher.ngrammer = lambda(&:split)
-    assert_equal @matcher.ngrams("hi 42"), ["hi", "42"]
   end
   def test_proc_options
-    matcher = Simhilarity::Matcher.new(reader: lambda(&:key), normalizer: lambda(&:upcase), ngrammer: lambda(&:split))
+    matcher = Simhilarity::Matcher.new
+    matcher.reader = lambda(&:key)
+    matcher.normalizer = lambda(&:upcase)
+    matcher.ngrammer =  lambda(&:split)
     assert_equal matcher.read(OpenStruct.new(key: "gub")), "gub"
     assert_equal matcher.normalize("gub"), "GUB"
     assert_equal matcher.ngrams("hi 42"), ["hi", "42"]
   end
-  def test_single
-    score = Simhilarity::Single.new.score("hello world", "hi worlds")
-    assert (score - 0.556).abs < 0.001, "test_single percent was wrong!"
+  def test_no_selfdups
+    # if you pass in the same list twice, it should ignore self-dups
+    list = ["hello, world", "hello there"]
+    @matcher.haystack = list
+    matches = @matcher.matches(@matcher.haystack)
+    assert_not_equal matches[0][1], "hello, world"
+  end
+  def test_haystack_required
+    # if you do not set a haystack, the matcher should yell
+    matcher = Simhilarity::Matcher.new
+    assert_raise RuntimeError do
+      matches = matcher.matches(['FOOM'])
+    end
   end
-  def test_bulk
-    assert_bulk_candidates(:all, 0.974)
-    assert_bulk_candidates(:ngrams, 0.974)
-    assert_bulk_candidates(:simhash, 0.949)
+  def test_one_result_can_win_multiple_times
+    # We should be able to find the same piece of hay multiple times for
+    # different needles.
+    haystack = ['Black Sabbath', 'Led Zeppelin', 'The Doors',
+                'The Beatles', 'Neil Young']
+    needles = ['blak sabbath', 'black sabath', 'block soborch']
+    @matcher.haystack = haystack
+    # Whether matched individually or as a group, all of these needles
+    # should produce the same result.
+    matches = @matcher.matches(needles)
+    needles.each do |n|
+      matches.concat @matcher.matches([n])
+    end
+    matches.each do |n, h, s|
+      assert_equal 'Black Sabbath', h
+    end
   end
   def test_bin
@@ -122,4 +142,10 @@ class Tests < Test::Unit::TestCase
     assert_system("#{bin} --candidates ngrams=3 identity.txt identity.txt")
     assert_system("#{bin} --candidates all identity.txt identity.txt")
   end
+  def test_candidates
+    assert_candidates(:all, 0.949)
+    assert_candidates(:ngrams, 0.949)
+    assert_candidates(:simhash, 0.949)
+  end
 end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: simhilarity
 version: !ruby/object:Gem::Version
-  version: 1.0.2
+  version: 1.0.3
   prerelease:
 platform: ruby
 authors:
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2013-04-18 00:00:00.000000000 Z
+date: 2013-04-26 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bk
@@ -100,6 +100,7 @@ extensions: []
 extra_rdoc_files: []
 files:
 - .gitignore
+- .travis.yml
 - Gemfile
 - LICENSE
 - README.md
@@ -107,11 +108,11 @@ files:
 - bin/simhilarity
 - lib/simhilarity.rb
 - lib/simhilarity/bits.rb
-- lib/simhilarity/bulk.rb
 - lib/simhilarity/candidate.rb
+- lib/simhilarity/candidates.rb
 - lib/simhilarity/element.rb
 - lib/simhilarity/matcher.rb
-- lib/simhilarity/single.rb
+- lib/simhilarity/score.rb
 - lib/simhilarity/version.rb
 - simhilarity.gemspec
 - test/harness
@@ -140,7 +141,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
       segments:
       - 0
-      hash: -1024001634221116929
+      hash: 3122268769366489382
 requirements: []
 rubyforge_project: simhilarity
 rubygems_version: 1.8.24

data/lib/simhilarity/bulk.rb DELETED

@@ -1,163 +0,0 @@
-require "bk"
-require "set"
-module Simhilarity
-  # Match a set of needles against a haystack, in bulk. For example,
-  # this is used if you want to match 50 new addresses against your
-  # database of 1,000 known addresses.
-  class Bulk < Matcher
-    # default minimum number # of ngram overlaps with :ngrams
-    DEFAULT_NGRAM_OVERLAPS = 3
-    # default maximum hamming distance with :simhash
-    DEFAULT_SIMHASH_MAX_HAMMING = 7
-    # Initialize a new Bulk matcher. See Matcher#initialize. Bulk adds
-    # these options:
-    #
-    # * +candidates+: specifies which method to use for finding
-    #   candidates. See the README for more details.
-    # * +ngrams_overlaps+: Minimum number of ngram overlaps, defaults
-    #   to 3.
-    # * +simhash_max_hamming+: Maximum simhash hamming distance,
-    #   defaults to 7.
-    def initialize(options = {})
-      super(options)
-    end
-    # Match each item in +needles+ to an item in +haystack+. Returns
-    # an array of tuples, <tt>[needle, haystack, score]</tt>. Scores
-    # range from 0 to 1, with 1 being a perfect match and 0 being a
-    # terrible match.
-    def matches(needles, haystack)
-      # create Elements
-      if needles == haystack
-        needles = haystack = import_list(needles)
-        # set the corpus, to generate frequency weights
-        self.corpus = needles
-      else
-        needles = import_list(needles)
-        haystack = import_list(haystack)
-        # set the corpus, to generate frequency weights
-        self.corpus = (needles + haystack)
-      end
-      # get candidate matches
-      candidates = candidates(needles, haystack)
-      vputs " got #{candidates.length} candidates."
-      # pick winners
-      winners(needles, candidates)
-    end
-    protected
-    # Find candidates from +needles+ & +haystack+. The method used
-    # depends on the value of options[:candidates]
-    def candidates(needles, haystack)
-      method = options[:candidates]
-      method ||= (needles.length * haystack.length < 200000) ? :all : :simhash
-      case method
-      when /^ngrams=(\d+)$/
-        method = :ngrams
-        options[:ngram_overlaps] = $1.to_i
-      when /^simhash=(\d+)$/
-        method = :simhash
-        options[:simhash_max_hamming] = $1.to_i
-      end
-      method = "candidates_#{method}".to_sym
-      if !respond_to?(method)
-        raise "unsupported options[:candidates] #{options[:candidates].inspect}"
-      end
-      vputs "Using #{method} with needles=#{needles.length} haystack=#{haystack.length}..."
-      self.send(method, needles, haystack).map do |n, h|
-        Candidate.new(self, n, h)
-      end
-    end
-    # Return ALL candidates. This only works for small datasets.
-    def candidates_all(needles, haystack)
-      needles.product(haystack)
-    end
-    # Return candidates that overlap with three or more matching
-    # ngrams. Only works for small datasets.
-    def candidates_ngrams(needles, haystack)
-      ngram_overlaps = options[:ngram_overlaps] || DEFAULT_NGRAM_OVERLAPS
-      candidates = []
-      veach(" ngrams #{ngram_overlaps}", needles) do |n|
-        ngrams_set = Set.new(n.ngrams)
-        haystack.each do |h|
-          count = 0
-          h.ngrams.each do |ngram|
-            if ngrams_set.include?(ngram)
-              if (count += 1) == ngram_overlaps
-                candidates << [n, h]
-                break
-              end
-            end
-          end
-        end
-      end
-      candidates
-    end
-    # Find candidates that are close based on hamming distance between
-    # the simhashes.
-    def candidates_simhash(needles, haystack)
-      max_hamming = options[:simhash_max_hamming] || DEFAULT_SIMHASH_MAX_HAMMING
-      # calculate this first so we get a nice progress bar
-      veach(" simhash", corpus) { |i| i.simhash }
-      # build the bk tree
-      bk = BK::Tree.new(lambda { |a, b| Bits.hamming32(a.simhash, b.simhash) })
-      veach(" bktree", haystack) { |i| bk.add(i) }
-      # search for candidates with low hamming distance
-      candidates = []
-      veach(" hamming #{max_hamming}", needles) do |n|
-        bk.query(n, max_hamming).each do |h, distance|
-          candidates << [n, h]
-        end
-      end
-      candidates
-    end
-    # walk candidates by score, pick winners
-    def winners(needles, candidates)
-      # calculate this first so we get a nice progress bar
-      veach("Scoring", candidates) { |i| i.score }
-      # score the candidates
-      candidates = candidates.sort_by { |i| -i.score }
-      # walk them, eliminate dups
-      seen = Set.new
-      winners = candidates.map do |i|
-        next if seen.include?(i.a) || seen.include?(i.b)
-        seen << i.a
-        seen << i.b
-        i
-      end.compact
-      # build map from needle => candidate...
-      needle_to_winner = { }
-      winners.each { |i| needle_to_winner[i.a] = i }
-      # so we can return in the original order
-      needles.map do |i|
-        if candidate = needle_to_winner[i]
-          [ i.opaque, candidate.b.opaque, candidate.score ]
-        else
-          [ i.opaque, nil, nil ]
-        end
-      end
-    end
-  end
-end

data/lib/simhilarity/single.rb DELETED

@@ -1,18 +0,0 @@
-require "set"
-module Simhilarity
-  # Calculate the similarity score for pairs of items, one at a time.
-  class Single < Matcher
-    # See Matcher#initialize.
-    def initialize(options = {})
-      super(options)
-    end
-    # Calculate the similarity score for these two items. Scores range
-    # from 0 to 1, with 1 being a perfect match and 0 being a terrible
-    # match. For best results, call #corpus= first.
-    def score(a, b)
-      Candidate.new(self, element_for(a), element_for(b)).score
-    end
-  end
-end