RubyGems - simhilarity - Versions diffs - 1.0.0 - Mend

simhilarity 1.0.0

Files changed (22) hide show

data/.gitignore +5 -0
data/Gemfile +2 -0
data/LICENSE +20 -0
data/README.md +165 -0
data/Rakefile +18 -0
data/bin/simhilarity +84 -0
data/lib/simhilarity/bits.rb +62 -0
data/lib/simhilarity/bulk.rb +163 -0
data/lib/simhilarity/candidate.rb +46 -0
data/lib/simhilarity/element.rb +50 -0
data/lib/simhilarity/matcher.rb +164 -0
data/lib/simhilarity/single.rb +18 -0
data/lib/simhilarity/version.rb +4 -0
data/lib/simhilarity.rb +8 -0
data/simhilarity.gemspec +27 -0
data/test/harness +138 -0
data/test/identity.txt +1 -0
data/test/large_haystack.txt +10000 -0
data/test/large_needles.txt +500 -0
data/test/sample.csv +2669 -0
data/test/tests.rb +125 -0
metadata +156 -0

data/.gitignore ADDED Viewed

@@ -0,0 +1,5 @@
+*.gem
+.bundle
+Gemfile.lock
+pkg/*
+rdoc

data/Gemfile ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ source "http://rubygems.org"
2	+ gemspec

data/LICENSE ADDED Viewed

@@ -0,0 +1,20 @@
+Copyright (c) 2013 Adam Doppelt
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,165 @@
+# Welcome to simhilarity
+Simhilarity is a gem for quickly matching up text strings that are similar but not identical. Here is how it works:
+1. Normalize strings. Downcase, remove non-alpha, etc:
+   ```ruby
+   normalize("Hello,  WORLD!") => "hello world"
+   ```
+1. Calculate [ngrams](http://en.wikipedia.org/wiki/N-gram) from strings. Specifically, it creates bigrams (2 character ngrams) and also creates an ngram for each sequence of digits in the string:
+   ```ruby
+                        # bigrams                        # digits
+   ngrams("hi 123") => ["hi", "i ", " 1", "12", "23"] + ["123"]
+   ```
+1. Calculate frequency of ngrams in the corpus.
+1. Select pairs of strings that might be matches. These are called **candidates**, and there are a few different ways they are chosen - see [options](#options). Simhilarity will try to pick the best method based on the size of your data set.
+1. Score candidates by measuring ngram overlap (with frequency weighting), using the [dice coefficient](http://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient).
+1. For each input string, return the match with the highest score.
+Here is output from a sample run:
+```
+score   needle                                      haystack
+1.000   Night Heron 19                              Night Heron 19
+1.000   103 Oceanwood                               103 Oceanwood
+0.987   Sea Crest 1504                              1504 Sea Crest
+0.986   Twin Oaks 189                               189 Twin Oaks
+0.981   Sea Crest 1205                              1205 Sea Crest
+0.980   Sea Crest 2411                              2411 Sea Crest
+0.972   Sea Crest 3405                              3405 Sea Crest
+0.968   Barrington Arms 504                         504 Barrington Arms
+0.964   Windsor Place 503                           503 Windsor Place
+0.951   1802 Bluff Villas - Hilton Head Island      1802 Bluff Villas
+0.943   3221 Villamare - Hilton Head Island         3221 Villamare
+0.941   134 Shorewood - Hilton Head Island          134 Shorewood
+0.900   1 Quail Street                              1 Quail
+0.894   2 Quail Street                              2 Quail
+0.823   Windsor II 2315                             2315 Windsor Place II
+0.736   Beachside Tennis 12                         12 Beachside
+0.732   16 Piping Plover - Hilton Head Island       16 Piping Plover
+0.460   7 Quail                                     7 QUAIL/126 Dune Lane
+0.379   11 Battery                                  11 Gunnery
+```
+Note that the final match has the lowest score, and is incorrect!
+## Usage
+### simhilarity executable
+The gem includes an executable called `simhilarity`. For example:
+```sh
+$ simhilarity needles.txt haystack.txt
+score,needle,haystack
+0.900,1 Quail Street,1 Quail
+1.000,103 Oceanwood,103 Oceanwood
+...
+```
+It will print out the best matches between needle and haystack in CSV format. Use `simhilarity --verbose` to look at pretty progress bars while it's running. Use --candidates to customize the candidates selection method, which will dramatically affect performance for large data sets.
+### Simhilarity::Bulk
+To use simhilarity from code, create a `Bulk` and call `matches(needles, haystack)`. It'll return an array of tuples, `[needle, haystack, score]`. By default, simhilarity assumes that needles and haystack are arrays of strings. To use something else, set `reader` to a proc that converts your opaque objects into strings. See [options](#options).
+### Simhilarity::Single
+Sometimes it's useful to just calculate the score between two strings. For example, if you just want a title similarity measurement as part of some larger analysis between two books. Create a `Single` and call `score(a, b)` to measure similarity between those two items. By default, simhilarity assumes that needle and haystack are strings. To use something else, set `reader` to a proc that converts your opaque objects into strings. See [options](#options).
+Important note: For best results with `Single`, set the corpus so that simhilarity can calculate ngram frequencies. This can dramatically improve accuracy. `Bulk` will do this automatically because it has access to the corpus, but `Single` doesn't. Call `corpus=` manually when using `Single`.
+<a name="benchmarks"/>
+## Benchmarks
+When looking at simhilarity's speed, there are two important aspects to consider:
+* **picking candidates** - how long does it take to pick decent candidates out of all the potential string pairs?
+* **matching** - once candidates are identified, how long does it take to score them?
+#### Picking Candidates
+There are three different methods for picking candidates - see [options](#options) for a detailed explanation. Here are some numbers from my i5 3ghz, for a test dataset consisting of 500 needles and 10,000 haystacks.
+```
+method      time   candidates returned
+simhash 5   4s     3,500
+simhash 6   7s     5,000
+simhash 7   9s     10,000   (this is the default)
+simhash 8   12s    25,000
+simhash 9   13s    60,000
+ngrams 5    46s    1,000,000
+ngrams 4    44s    1,500,000
+ngrams 3    40s    2,100,000
+all         3.9s   5,000,000
+```
+#### Matching
+Once candidates are identified, the string pairs are scored and winners are picked out. Scoring is O(n). On my i5 3ghz:
+```
+candidates   time
+25,000       1s
+60,000       2s
+1,000,000    35s
+5,000,000    190s
+```
+<a name="options"/>
+## Options
+There are a few ways to configure simhilarity:
+* **candidates** - controls how candidates are picked from the complete set of all string pairs. We want to avoid looking at all string pairs, because that's quite expensive for large datasets. On the other hand, if we examine too few we might miss some of the best matches. A conundrum. There are three different settings:
+  `:simhash` - generate a weighted [simhash](http://matpalm.com/resemblance/simhash/) for each string, then iterate the needles and look for "nearby" haystack simhashes using a [bktree](https://github.com/threedaymonk/bktree). Simhashes are compared using the [hamming distance](http://en.wikipedia.org/wiki/Hamming_distance). If the hamming distance between the simhashes <= `options[:simhash_max_hamming]`, the pair becomes a candidate. The default max hamming distance is 7 - see [benchmarks](#benchmarks) to get a sense for how different values perform.
+  `:ngrams` - for each pair of strings, count the number of ngrams they have in common. If the overlap is >= `options[:ngram_overlaps]`, the pair becomes a candidate. The default minimum number of overlaps is 3 - see [benchmarks](#benchmarks) to get a sense for how different values perform.
+  `:all` - all pairs are examined. This is completely braindead and very slow for large datasets.
+  Simhash works great, but there's no reason not to use `:ngrams` or even `:all` for small data sets. In fact, that's what simhilarity does by default - if you use a small dataset (needle * haystack < 200,000) it defaults to `:all`, otherwise it uses `:simhash`. Some examples:
+  ```ruby
+  Simhilarity::Bulk.new  # defaults to :all or :simhash based on size<
+  Simhilarity::Bulk.new(candidates: :simhash)
+  Simhilarity::Bulk.new(candidates: :simhash, simhash_max_hamming: 8)
+  Simhilarity::Bulk.new(candidates: :ngrams, ngram_overlaps: 4)
+  ```
+  or:
+  ```
+  $ simhilarity --candidates simhash   needles.txt haystack.txt
+  $ simhilarity --candidates simhash=8 needles.txt haystack.txt
+  $ simhilarity --candidates ngrams    needles.txt haystack.txt
+  $ simhilarity --candidates ngrams=4  needles.txt haystack.txt
+  ```
+* **reader** - proc for converting your opaque objects into strings. Set this to use something other than strings for source data. For example, if you want to match author names between ActiveRecord book objects:
+   ```ruby
+   matcher.reader = lambda { |i| i.author }
+   matcher.matches(needles, haystack)
+   ```
+* **normalizer** - proc for normalizing incoming strings. The default normalizer downcases, removes non-alphas, and strips whitespace.
+* **ngrammer** - proc for converting normalized strings into ngrams. The default ngrammer pulls out bigrams and runs of digits, which is perfect for matching names and addresses.
+* **verbose** - if true, show progress while simhilarity is working. Great for the impatient. Use --verbose from the command line.

data/Rakefile ADDED Viewed

@@ -0,0 +1,18 @@
+require "bundler/gem_tasks"
+require "rake/testtask"
+require "rdoc/task"
+Bundler::GemHelper.install_tasks
+# testing
+Rake::TestTask.new(:test) do |test|
+  test.libs << "test"
+end
+task default: :test
+# rdoc
+RDoc::Task.new do |rdoc|
+  rdoc.rdoc_dir = "rdoc"
+  rdoc.title = "simhilarity #{Simhilarity::VERSION}"
+  rdoc.rdoc_files.include("lib/**/*.rb")
+end

data/bin/simhilarity ADDED Viewed

@@ -0,0 +1,84 @@
+#!/usr/bin/env ruby
+# in lieu of -w, since we're using env to startup
+$VERBOSE = true
+require "csv"
+require "optparse"
+require "simhilarity"
+class Main
+  def initialize(options = {})
+    # load
+    needle   = File.readlines(options[:needle]).map(&:chomp)
+    haystack = File.readlines(options[:haystack]).map(&:chomp)
+    # match
+    tm = Time.now
+    matcher = Simhilarity::Bulk.new(options)
+    matches = matcher.matches(needle, haystack)
+    if options[:verbose]
+      tm = Time.now - tm
+      $stderr.printf("Simhilarity finished in %.3fs.\n\n", tm)
+    end
+    # now report
+    csv = CSV.new($stdout)
+    csv << %w(score needle haystack)
+    matches.each do |n, h, score|
+      csv << [sprintf("%4.3f", score || 0), n, h]
+    end
+  end
+end
+#
+# parse command line
+#
+options = { }
+opt = OptionParser.new do |o|
+  o.banner = <<EOF
+simhilarity matches lines of text between needle_file and
+haystack_file, then prints a report. Potential matches are scored
+using frequency weighted ngrams.
+Usage: simhilarity [options] <needle_file> <haystack_file>
+EOF
+  o.on("-v", "--verbose", "enable verbose/progress output") do |f|
+    options[:verbose] = true
+  end
+  o.on("-c", "--candidates [CANDIDATES]", "set candidates search method") do |f|
+    options[:candidates] = f
+  end
+  o.on_tail("-h", "--help", "print this help text") do
+    puts opt
+    exit 0
+  end
+end
+begin
+  opt.parse!
+rescue OptionParser::InvalidOption, OptionParser::MissingArgument => e
+  puts e
+  puts opt
+  exit 1
+end
+# mandatory args
+if ARGV.length != 2
+  puts opt
+  exit 1
+end
+options[:needle] = ARGV[0]
+options[:haystack] = ARGV[1]
+%w(needle haystack).map(&:to_sym).each do |i|
+  if !File.exists?(options[i])
+    puts "error: #{i.capitalize} file #{options[i].inspect} doesn't exist."
+    exit 1
+  end
+end
+Main.new(options)

data/lib/simhilarity/bits.rb ADDED Viewed

@@ -0,0 +1,62 @@
+require "digest"
+module Simhilarity
+  module Bits
+    # Calculate the {hamming
+    # distance}[http://en.wikipedia.org/wiki/Hamming_distance] between
+    # two integers. Not particularly fast.
+    def self.hamming(a, b)
+      x, d = 0, a ^ b
+      while d > 0
+        x += 1
+        d &= d - 1
+      end
+      x
+    end
+    HAMMING8  = (0..0xff).map { |i| Bits.hamming(0, i) }
+    HAMMING16 = (0..0xffff).map { |i| HAMMING8[(i >> 8) & 0xff] + HAMMING8[(i >> 0) & 0xff] }
+    # Calculate the {hamming
+    # distance}[http://en.wikipedia.org/wiki/Hamming_distance] between
+    # two 32 bit integers using a lookup table. This is fast.
+    def self.hamming32(a, b)
+      x = a ^ b
+      a = (x >> 16) & 0xffff
+      b = (x >>  0) & 0xffff
+      HAMMING16[a] + HAMMING16[b]
+    end
+    # can't rely on ruby hash, because it's not consistent across
+    # sessions. Let's just use MD5.
+    def self.nhash(ngram)
+      @hashes ||= { }
+      @hashes[ngram] ||= Digest::MD5.hexdigest(ngram).to_i(16)
+    end
+    # Calculate the frequency weighted
+    # simhash[http://matpalm.com/resemblance/simhash/] of the
+    # +ngrams+.
+    def self.simhash32(freq, ngrams)
+      # array of bit sums
+      bits = Array.new(32, 0)
+      # walk bits of ngram's hash, increase/decrease bit sums
+      ngrams.each do |ngram|
+        f = freq[ngram]
+        hash = nhash(ngram)
+        (0...32).each do |i|
+          bits[i] += (((hash >> i) & 1) == 1) ? f : -f
+        end
+      end
+      # calculate simhash based on whether bit sums are negative or
+      # positive
+      simhash = 0
+      (0...32).each do |bit|
+        simhash |= (1 << bit) if bits[bit] > 0
+      end
+      simhash
+    end
+  end
+end

data/lib/simhilarity/bulk.rb ADDED Viewed

@@ -0,0 +1,163 @@
+require "bk"
+require "set"
+module Simhilarity
+  # Match a set of needles against a haystack, in bulk. For example,
+  # this is used if you want to match 50 new addresses against your
+  # database of 1,000 known addresses.
+  class Bulk < Matcher
+    # default minimum number # of ngram overlaps with :ngrams
+    DEFAULT_NGRAM_OVERLAPS = 3
+    # default maximum hamming distance with :simhash
+    DEFAULT_SIMHASH_MAX_HAMMING = 7
+    # Initialize a new Bulk matcher. See Matcher#initialize. Bulk adds
+    # these options:
+    #
+    # * +candidates+: specifies which method to use for finding
+    #   candidates. See the README for more details.
+    # * +ngrams_overlaps+: Minimum number of ngram overlaps, defaults
+    #   to 3.
+    # * +simhash_max_hamming+: Maximum simhash hamming distance,
+    #   defaults to 7.
+    def initialize(options = {})
+      super(options)
+    end
+    # Match each item in +needles+ to an item in +haystack+. Returns
+    # an array of tuples, <tt>[needle, haystack, score]</tt>. Scores
+    # range from 0 to 1, with 1 being a perfect match and 0 being a
+    # terrible match.
+    def matches(needles, haystack)
+      # create Elements
+      if needles == haystack
+        needles = haystack = import_list(needles)
+        # set the corpus, to generate frequency weights
+        self.corpus = needles
+      else
+        needles = import_list(needles)
+        haystack = import_list(haystack)
+        # set the corpus, to generate frequency weights
+        self.corpus = (needles + haystack)
+      end
+      # get candidate matches
+      candidates = candidates(needles, haystack)
+      vputs " got #{candidates.length} candidates."
+      # pick winners
+      winners(needles, candidates)
+    end
+    protected
+    # Find candidates from +needles+ & +haystack+. The method used
+    # depends on the value of options[:candidates]
+    def candidates(needles, haystack)
+      method = options[:candidates]
+      method ||= (needles.length * haystack.length < 200000) ? :all : :simhash
+      case method
+      when /^ngrams=(\d+)$/
+        method = :ngrams
+        options[:ngram_overlaps] = $1.to_i
+      when /^simhash=(\d+)$/
+        method = :simhash
+        options[:simhash_max_hamming] = $1.to_i
+      end
+      method = "candidates_#{method}".to_sym
+      if !respond_to?(method)
+        raise "unsupported options[:candidates] #{options[:candidates].inspect}"
+      end
+      vputs "Using #{method} with needles=#{needles.length} haystack=#{haystack.length}..."
+      self.send(method, needles, haystack).map do |n, h|
+        Candidate.new(self, n, h)
+      end
+    end
+    # Return ALL candidates. This only works for small datasets.
+    def candidates_all(needles, haystack)
+      needles.product(haystack)
+    end
+    # Return candidates that overlap with three or more matching
+    # ngrams. Only works for small datasets.
+    def candidates_ngrams(needles, haystack)
+      ngram_overlaps = options[:ngram_overlaps] || DEFAULT_NGRAM_OVERLAPS
+      candidates = []
+      veach(" ngrams #{ngram_overlaps}", needles) do |n|
+        ngrams_set = Set.new(n.ngrams)
+        haystack.each do |h|
+          count = 0
+          h.ngrams.each do |ngram|
+            if ngrams_set.include?(ngram)
+              if (count += 1) == ngram_overlaps
+                candidates << [n, h]
+                break
+              end
+            end
+          end
+        end
+      end
+      candidates
+    end
+    # Find candidates that are close based on hamming distance between
+    # the simhashes.
+    def candidates_simhash(needles, haystack)
+      max_hamming = options[:simhash_max_hamming] || DEFAULT_SIMHASH_MAX_HAMMING
+      # calculate this first so we get a nice progress bar
+      veach(" simhash", corpus) { |i| i.simhash }
+      # build the bk tree
+      bk = BK::Tree.new(lambda { |a, b| Bits.hamming32(a.simhash, b.simhash) })
+      veach(" bktree", haystack) { |i| bk.add(i) }
+      # search for candidates with low hamming distance
+      candidates = []
+      veach(" hamming #{max_hamming}", needles) do |n|
+        bk.query(n, max_hamming).each do |h, distance|
+          candidates << [n, h]
+        end
+      end
+      candidates
+    end
+    # walk candidates by score, pick winners
+    def winners(needles, candidates)
+      # calculate this first so we get a nice progress bar
+      veach("Scoring", candidates) { |i| i.score }
+      # score the candidates
+      candidates = candidates.sort_by { |i| -i.score }
+      # walk them, eliminate dups
+      seen = Set.new
+      winners = candidates.map do |i|
+        next if seen.include?(i.a) || seen.include?(i.b)
+        seen << i.a
+        seen << i.b
+        i
+      end.compact
+      # build map from needle => candidate...
+      needle_to_winner = { }
+      winners.each { |i| needle_to_winner[i.a] = i }
+      # so we can return in the original order
+      needles.map do |i|
+        if candidate = needle_to_winner[i]
+          [ i.opaque, candidate.b.opaque, candidate.score ]
+        else
+          [ i.opaque, nil, nil ]
+        end
+      end
+    end
+  end
+end

data/lib/simhilarity/candidate.rb ADDED Viewed

@@ -0,0 +1,46 @@
+module Simhilarity
+  # A potential match between two +Elements+. It can calculate it's own score.
+  class Candidate
+    # matcher that owns this guy
+    attr_reader :matcher
+    # first half of the candidate pair - the needle.
+    attr_reader :a
+    # first half of the candidate pair - the haystack.
+    attr_reader :b
+    def initialize(matcher, a, b) #:nodoc:
+      @matcher = matcher
+      @a = a
+      @b = b
+    end
+    # Calculate the score for this +Candidate+. The score is the {dice
+    # coefficient}[http://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient],
+    # <tt>(2*c)/(a+b)</tt>.
+    #
+    # * +a+: the frequency weighted sum of the ngrams in a
+    # * +b+: the frequency weighted sum of the ngrams in b
+    # * +c+: the frequency weighted sum of the ngrams in (a & b)
+    #
+    # Lazily calculated and memoized.
+    def score
+      @score ||= begin
+        c = (self.a.ngrams & self.b.ngrams)
+        if c.length > 0
+          a = self.a.ngrams_sum
+          b = self.b.ngrams_sum
+          c = matcher.ngrams_sum(c)
+          (2.0 * c) / (a + b)
+        else
+          0
+        end
+      end
+    end
+    def to_s #:nodoc:
+      "Candidate #{score}: #{a.inspect}..#{b.inspect}"
+    end
+  end
+end

data/lib/simhilarity/element.rb ADDED Viewed

@@ -0,0 +1,50 @@
+require "set"
+module Simhilarity
+  # Internal wrapper around opaque items from user. This mostly exists
+  # to cache stuff that's expensive, like the ngrams.
+  class Element
+    # matcher that owns this guy
+    attr_reader :matcher
+    # opaque object from the user
+    attr_reader :opaque
+    def initialize(matcher, opaque) #:nodoc:
+      @matcher = matcher
+      @opaque = opaque
+    end
+    # Text string generated from +opaque+ via Matcher#read. Lazily
+    # calculated.
+    def str
+      @str ||= matcher.normalize(matcher.read(opaque))
+    end
+    # List of ngrams generated from +str+ via
+    # Matcher#ngrams. Lazily calculated.
+    def ngrams
+      @ngrams ||= matcher.ngrams(str)
+    end
+    # Weighted frequency sum of +ngrams+ via
+    # Matcher#ngrams_sum. Lazily calculated.
+    def ngrams_sum
+      @ngrams_sum ||= matcher.ngrams_sum(ngrams)
+    end
+    # Weighted simhash of +ngrams+ via Matcher#simhash. Lazily
+    # calculated.
+    def simhash
+      @simhash ||= matcher.simhash(ngrams)
+    end
+    def to_s #:nodoc:
+      str
+    end
+    def inspect #:nodoc:
+      str.inspect
+    end
+  end
+end