RubyGems - fuzzy_match - Versions diffs - 1.5.0 → 2.0.0 - Mend

fuzzy_match 1.5.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (55) hide show

checksums.yaml +8 -8
data/.rspec +2 -0
data/CHANGELOG +14 -0
data/Gemfile +8 -0
data/README.markdown +58 -38
data/Rakefile +0 -9
data/bin/fuzzy_match +106 -0
data/fuzzy_match.gemspec +4 -4
data/groupings-screenshot.png +0 -0
data/highlevel.graffle +0 -0
data/highlevel.png +0 -0
data/lib/fuzzy_match/record.rb +58 -0
data/lib/fuzzy_match/result.rb +11 -8
data/lib/fuzzy_match/rule/grouping.rb +70 -12
data/lib/fuzzy_match/rule/identity.rb +3 -3
data/lib/fuzzy_match/rule.rb +1 -1
data/lib/fuzzy_match/score/amatch.rb +0 -4
data/lib/fuzzy_match/score/pure_ruby.rb +2 -8
data/lib/fuzzy_match/score.rb +4 -0
data/lib/fuzzy_match/similarity.rb +10 -32
data/lib/fuzzy_match/version.rb +1 -1
data/lib/fuzzy_match.rb +78 -94
data/{test/test_amatch.rb → spec/amatch_spec.rb} +1 -2
data/{test/test_cache.rb → spec/cache_spec.rb} +7 -7
data/spec/foo.rb +9 -0
data/spec/fuzzy_match_spec.rb +354 -0
data/spec/grouping_spec.rb +60 -0
data/spec/identity_spec.rb +29 -0
data/{test/test_wrapper.rb → spec/record_spec.rb} +3 -7
data/spec/spec_helper.rb +21 -0
metadata +56 -50
data/bin/fuzzy_match_checker +0 -71
data/examples/bts_aircraft/5-2-A.htm +0 -10305
data/examples/bts_aircraft/5-2-B.htm +0 -9576
data/examples/bts_aircraft/5-2-D.htm +0 -7094
data/examples/bts_aircraft/5-2-E.htm +0 -2349
data/examples/bts_aircraft/5-2-G.htm +0 -2922
data/examples/bts_aircraft/groupings.csv +0 -1
data/examples/bts_aircraft/identities.csv +0 -1
data/examples/bts_aircraft/negatives.csv +0 -1
data/examples/bts_aircraft/normalizers.csv +0 -1
data/examples/bts_aircraft/number_260.csv +0 -334
data/examples/bts_aircraft/positives.csv +0 -1
data/examples/bts_aircraft/test_bts_aircraft.rb +0 -116
data/examples/first_name_matching.rb +0 -15
data/examples/icao-bts.xls +0 -0
data/lib/fuzzy_match/rule/normalizer.rb +0 -20
data/lib/fuzzy_match/rule/stop_word.rb +0 -11
data/lib/fuzzy_match/wrapper.rb +0 -73
data/test/helper.rb +0 -12
data/test/test_fuzzy_match.rb +0 -304
data/test/test_fuzzy_match_convoluted.rb.disabled +0 -268
data/test/test_grouping.rb +0 -28
data/test/test_identity.rb +0 -34
data/test/test_normalizer.rb +0 -10

checksums.yaml CHANGED Viewed

@@ -1,15 +1,15 @@
 ---
 !binary "U0hBMQ==":
   metadata.gz: !binary |-
-    YTlmMmE3MDI3MWQ0NzY0NGE5N2Q4ZDY4MmI5NjUzZTg5YWU1OWE5OQ==
+    NDhjZDk1NjAxOTMzMGZkMWU4ODAyNjRmMzA4YzlhNzQwZWIwZGU5MQ==
   data.tar.gz: !binary |-
-    MzU3MDc3NjQ1NDczNWFhNWE0ZDdlZGRmYjlhYWQ3Y2YyZTNiMjRhMQ==
+    OTk5Yjc2NmY4ZTY3NDkyOTFjOGQwYjlhZDgwYjk2NmViMzI0NGYyMQ==
 !binary "U0hBNTEy":
   metadata.gz: !binary |-
-    MDcyZDk5OGY5MTQwNTEyMDA1YmQ1MzdiYzYyZTczMWExNzI0NjhlYWNkYzcx
-    Y2UwNzgwN2I5YmI3MjVhMzIwMDljZmVmMTFkODQ2MjY2NDdkNzViZjlhNTcy
-    NjliMmU5NTAwOTBiYTYxNjk4ZDkwZWM0OWMzMTdjZTU4ZjQ4NDk=
+    YmE4MjUxNzg0MmFlN2UyMTYxNGJhNGRiZjZmYThjOGYzNDMwYjRjZmVlNTBk
+    MmQzZjE0NWJiN2IzZGY3YzBhZWQzOWVkMTNjZmVhODUwMTk5ODJmMTY0Njk0
+    ODEwMjc2MDc0M2M2YjNmZmY5MmViZGNiOTYzNGQ2MjQxODZkZDQ=
   data.tar.gz: !binary |-
-    MTg0MGVlYTc3NTY0NjMxZWMwNWFmZTdhYmRhOWM4MGJmN2QwYjc3NmNiYzQz
-    MzU3ODc2NDRjNDJjNzhiYmQwMmNkNmQ3MTdjYzMyNjM2YzBjMDEwOGYzNzgy
-    NmFkZTM0M2E2YzkwZjY1YjM4ZTEzY2Y1YmU5MmY3MDZjNTJhMzg=
+    NGIzYTVmNzk0N2E4YjU4ZDc1MTNiNTU4YmRhMDgxNmM5Y2Q0MWM2NzYzZjhj
+    NTA3ZDg5NWE2ZWM2MWZkNDRjM2I1NWY2NzA4ODAwM2E4NTIwY2NmYWI4MzBm
+    MjliNDZkNzIzMGJhZTVhMzViNThhZjk1MmM1NTViNmQ4ODFjNzM=

data/.rspec ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ --color
2	+ --format progress

data/CHANGELOG CHANGED Viewed

@@ -1,3 +1,17 @@
+2.0.0 / 2013-05-22
+* Breaking changes
+  * normalizers removed - use groupings instead
+  * first_grouping_decides removed
+  * FuzzyMatch#free gone
+* Enhancements
+  * chained groupings!
+  * faster and simpler structure
+  * FuzzyMatch#find_with_score returns [record, dice_score, lev_score]
 1.5.0 / 2013-04-03
 * Breaking changes

data/Gemfile CHANGED Viewed

@@ -1,3 +1,11 @@
 source :rubygems
 gemspec
+# bin/fuzzy_match development
+gem 'activesupport'
+gem 'remote_table'
+gem 'thor'
+gem 'to_regexp'
+gem 'perftools.rb'
+gem 'pry'

data/README.markdown CHANGED Viewed

@@ -4,20 +4,9 @@ Find a needle in a haystack based on string similarity and regular expression ru
 Replaces [`loose_tight_dictionary`](https://github.com/seamusabshere/loose_tight_dictionary) because that was a confusing name.
-## Real-world usage
-<p><a href="http://brighterplanet.com"><img src="https://s3.amazonaws.com/static.brighterplanet.com/assets/logos/flush-left/inline/green/rasterized/brighter_planet-160-transparent.png" alt="Brighter Planet logo"/></a></p>
-We use `fuzzy_match` for [data science at Brighter Planet](http://brighterplanet.com/research) and in production at
-* [Brighter Planet's impact estimate web service](http://impact.brighterplanet.com)
-* [Brighter Planet's reference data web service](http://data.brighterplanet.com)
-We often combine it with [`remote_table`](https://github.com/seamusabshere/remote_table) and [`errata`](https://github.com/seamusabshere/errata):
+Warning! `normalizers` are gone in version 2 and above! See the CHANGELOG and check out enhanced (and hopefully more intuitive) `groupings`.
-- download table with `remote_table`
-- correct serious or repeated errors with `errata`
-- `fuzzy_match` the rest
+![diagram of matching process](https://raw.github.com/seamusabshere/fuzzy_match/master/highlevel.png)
 ## Quickstart
@@ -30,41 +19,62 @@ See also the blog post [Fuzzy match in Ruby](http://numbers.brighterplanet.com/2
 ## Default matching (string similarity)
-At the core, and even if you configure nothing else, string similarity (calculated by "pair distance" aka Dice's) is used to compare records.
+At the core, and even if you configure nothing else, string similarity (calculated by "pair distance" aka Dice's Coefficient) is used to compare records.
 You can tell `FuzzyMatch` what field or method to use via the `:read` option... for example, let's say you want to match a `Country` object like `#<Country name:"Uruguay" iso_3166_code:"UY">`
-    >> matcher = FuzzyMatch.new(Country.all, :read => :name)  # Country#name will be called when comparing
+    >> fz = FuzzyMatch.new(Country.all, :read => :name)
     => #<FuzzyMatch: [...]>
-    >> matcher.find('youruguay')
-    => #<Country name:"Uruguay" iso_3166_code:"UY">            # the matcher returns a Country object
+    >> fz.find('youruguay')
+    => #<Country name:"Uruguay" iso_3166_code:"UY">
 ## Optional rules (regular expressions)
-You can improve the default matchings with rules. There are 4 different kinds of rules. Each rule is a regular expression. Depending on the kind of rule, the results of running the regular expression are used for a particular purpose.
+You can improve the default matchings with rules. There are 3 different kinds of rules. Each rule is a regular expression.
 We suggest that you **first try without any rules** and only define them to improve matching, prevent false positives, etc.
-    >> matcher = FuzzyMatch.new(['Ford F-150', 'Ford F-250', 'GMC 1500', 'GMC 2500'], :groupings => [ /ford/i, /gmc/i ], :normalizers => [ /K(\d500)/i ], :identities => [ /(f)-?(\d50)/i ])
-    => #<FuzzyMatch: [...]>
-    >> matcher.find('fordf250')
-    => "Ford F-250"
-    >> matcher.find('gmc truck k1500')
-    => "GMC 1500"
+### Groupings
-For identities and normalizers (see below), **only the captures are used.** For example, `/(f)-?(\d50)/i` captures the "F" and the "250" but ignores the dash. So place your parentheses carefully! Groupings work the same way, except that if you don't have any captures, a simple match will pass.
+Group records together. The two laws of groupings:
-### Groupings
+1. If a needle matches a grouping, only compare it with straws in the same grouping; (the "buddies vs buddies" rule)
+2. If a needle doesn't match any grouping, only compare it with straws that also don't match ANY grouping (the "misfits vs misfits" rule)
+The two laws of chained groupings: (new in v2.0 and rather important)
+1. Sub-groupings (e.g., `/plaza/i` below) only match if their primary (e.g., `/ramada/i`) does
+2. In final grouping decisions, sub-groupings win over primaries (so "Ramada Inn" is NOT grouped with "Ramada Plaza", but if you removed `/plaza/i` sub-grouping, then they would be grouped together)
+Hopefully they are rather intuitive once you start using them.
+[![screenshot of spreadsheet of groupings](https://raw.github.com/seamusabshere/fuzzy_match/master/groupings-screenshot.png)](https://docs.google.com/spreadsheet/pub?key=0AkCJNpm9Ks6JdG4xSWhfWFlOV1RsZ2NCeU9seGx6cnc&single=true&gid=0&output=html)
+That will...
-Group records together.
+* separate "Orient Express Hotel" and "Ramada Conference Center Mandarin" from real Mandarin Oriental hotels
+* keep "Trump Hotel Collection" away from "Luxury Collection" (another real hotel brand) without messing with the word "Luxury"
+* make sure that "Ramada Plaza" are always grouped with other RPs&mdash;and not with plain old Ramadas&mdash;and vice versa
+* splits out Hyatts into their different brands
+* and more
-Setting a grouping of `/Airbus/` ensures that strings containing "Airbus" will only be scored against to other strings containing "Airbus". A better grouping in this case would probably be `/airbus/i`.
+You specify chained groupings as arrays of regexps:
+    groupings = [
+      /mandarin/i,
+      /trump/i,
+      [ /ramada/i, /plaza/i ],
+      ...
+    ]
+    fz = FuzzyMatch.new(haystack, groupings: groupings)
+This way of specifying groupings is meant to be easy to load from a CSV, like `bin/fuzzy_match` does.
 Formerly called "blockings," but that was jargon that confused people.
 ### Identities
-Prevent impossible matches.
+Prevent impossible matches. Can be very confusing&mdash;see if you can make things work with groupings first.
 Adding an identity like `/(f)-?(\d50)/i` ensures that "Ford F-150" and "Ford F-250" never match.
@@ -72,29 +82,24 @@ Note that identities do not establish certainty. They just say whether two recor
 ### Stop words
-Ignore common and/or meaningless words. Applied before normalizers.
+Ignore common and/or meaningless words when doing string similarity.
 Adding a stop word like `THE` ensures that it is not taken into account when comparing "THE CAT", "THE DAT", and "THE CATT"
-### Normalizers (formerly called tighteners)
-Strip strings down to the essentials. Applied after stop words.
-Adding a normalizer like `/(boeing).*(7\d\d)/i` will cause "BOEING COMPANY 747" and "boeing747" to be normalized to "BOEING 747" and "boeing 747", respectively. Since things are generally downcased before they are compared, these would be an exact match.
+Stop words are NOT removed when checking `:must_match_at_least_one_word` and when doing identities and groupings.
 ## Find options
 * `read`: how to interpret each record in the 'haystack', either a Proc or a symbol
 * `must_match_grouping`: don't return a match unless the needle fits into one of the groupings you specified
 * `must_match_at_least_one_word`: don't return a match unless the needle shares at least one word with the match. Note that "Foo's" is treated like one word (so that it won't match "'s") and "Bolivia," is treated as just "bolivia"
-* `first_grouping_decides`: force records into the first grouping they match, rather than choosing a grouping that will give them a higher score
 * `gather_last_result`: enable `last_result`
 ## Case sensitivity
 String similarity is case-insensitive. Everything is downcased before scoring. This is a change from previous versions.
-Be careful when trying to use case-sensitivity in your rules; in general, things are downcased before comparing.
+Be careful with uppercase letters in your rules; in general, things are downcased before comparing.
 ## String similarity algorithm
@@ -152,6 +157,21 @@ You can optionally use [`amatch`](http://flori.github.com/amatch/) by [Florian F
 Otherwise, pure ruby versions of the string similarity algorithms derived from the [answer to a StackOverflow question](http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings) and [the text gem](https://github.com/threedaymonk/text/blob/master/lib/text/levenshtein.rb) are used. Thanks [marzagao](http://stackoverflow.com/users/10997/marzagao) and [threedaymonk](https://github.com/threedaymonk)!
+## Real-world usage
+<p><a href="http://brighterplanet.com"><img src="https://s3.amazonaws.com/static.brighterplanet.com/assets/logos/flush-left/inline/green/rasterized/brighter_planet-160-transparent.png" alt="Brighter Planet logo"/></a></p>
+We use `fuzzy_match` for [data science at Brighter Planet](http://brighterplanet.com/research) and in production at
+* [Brighter Planet's impact estimate web service](http://impact.brighterplanet.com)
+* [Brighter Planet's reference data web service](http://data.brighterplanet.com)
+We often combine it with [`remote_table`](https://github.com/seamusabshere/remote_table) and [`errata`](https://github.com/seamusabshere/errata):
+- download table with `remote_table`
+- correct serious or repeated errors with `errata`
+- `fuzzy_match` the rest
 ## Authors
 * Seamus Abshere <seamus@abshere.net>
@@ -160,4 +180,4 @@ Otherwise, pure ruby versions of the string similarity algorithms derived from t
 ## Copyright
-Copyright 2012 Brighter Planet, Inc.
+Copyright 2013 Seamus Abshere

data/Rakefile CHANGED Viewed

@@ -1,14 +1,5 @@
 require 'bundler'
 Bundler::GemHelper.install_tasks
-require 'rake/testtask'
-Rake::TestTask.new(:test) do |test|
-  test.libs << 'lib' << 'test'
-  test.pattern = 'test/**/test_*.rb'
-  test.verbose = true
-end
-task :default => :test
 require 'yard'
 YARD::Rake::YardocTask.new

data/bin/fuzzy_match ADDED Viewed

@@ -0,0 +1,106 @@
+#!/usr/bin/env ruby
+if File.exist?('Gemfile')
+  require 'bundler/setup'
+end
+if ENV['PROFILE'] == 'true'
+  require 'perftools'
+end
+# PerfTools::CpuProfiler.start("profile_data") do
+require 'fuzzy_match'
+require 'fuzzy_match/version'
+require 'active_support/core_ext'
+require 'remote_table'
+require 'thor'
+require 'to_regexp'
+class FuzzyMatch
+  class Cli < ::Thor
+    desc :match, "Print out matches between A and B, where A is haystack and B is a bunch of needles."
+    method_option :csv, default: false, type: :boolean, desc: "CSV output"
+    method_option :a_col, default: 0, type: :string, desc: "Column name in A. Defaults to first column."
+    method_option :b_col, default: 0, type: :string, desc: "Column name in B. Defaults to first column."
+    method_option :downcase, default: true, type: :boolean, desc: "Whether to downcase everything (except regexes, where you have to do /foo/i)"
+    method_option :groupings, default: nil, type: :string, desc: "Spreadsheet with groupings - no headers, multi-part groupings on the same row"
+    method_option :rules, default: nil, type: :string, desc: "Spreadsheet with headers: stop_words, identities, find_options. Listing a find_option like must_match_grouping makes it true."
+    method_option :explain, default: false, type: :boolean
+    method_option :grep, default: nil, type: :string
+    method_option :limit, default: 1.0/0, type: :numeric
+    def match(a_url, b_url)
+      puts "Checking matches using fuzzy_match version #{FuzzyMatch::VERSION}..."
+      fz = mkfz a_url
+      b = load_b b_url
+      if ENV['PROFILE'] == 'true'
+        require 'perftools'
+        PerfTools::CpuProfiler.start("profile.bin") { report(fz, b) }
+        system "pprof.rb --text profile.bin"
+        `pprof.rb --gif profile.bin > profile.gif`
+      else
+        report fz, b
+      end
+    end
+    private
+    def report(fz, b)
+      b.each do |b_val|
+        if options.explain
+          fz.explain
+        else
+          a_val = fz.find b_val
+          if options.csv
+            # puts [ b_val.ljust(50), a_val ].join('-> ')
+            puts [ b_val, a_val ].to_csv
+          else
+            puts %{\nB: #{b_val}\nA: #{a_val}}
+          end
+        end
+      end
+    end
+    def load_b(b_url)
+      b_options = options.b_col.is_a?(String) ? { headers: :first_row } : { headers: false }
+      if options[:grep]
+        regexp = options[:grep].to_regexp(detect: true)
+        b_options[:select] = lambda { |row| regexp =~ row[options.b_col] }
+      end
+      b = RemoteTable.new(b_url, b_options).to_a
+      limit = [options.limit, b.length].min
+      b.first(limit).map do |row|
+        b_val = row[options.b_col]
+        b_val.downcase! if options.downcase
+        b_val
+      end
+    end
+    def mkfz(a_url)
+      a_options = options.a_col.is_a?(String) ? { headers: :first_row } : { headers: false }
+      a = RemoteTable.new(a_url, a_options).map { |row| row[options.a_col] }
+      a.map!(&:downcase) if options.downcase
+      FuzzyMatch.new a, fz_options
+    end
+    def fz_options
+      memo = {}
+      if options.groupings
+        memo[:groupings] = RemoteTable.new(options.groupings, headers: false).map do |row|
+          row.to_a.select(&:present?).map { |v| v.to_regexp(detect: true) }
+        end
+      end
+      if options.rules
+        t = RemoteTable.new(options.rules, headers: :first_row)
+        find_options = t.rows.map { |row| row['find_options'] }
+        memo.merge!(
+          identities: t.rows.map { |row| row['identities'] }.select(&:present?).map { |v| v.to_regexp(detect: true) },
+          stop_words: t.rows.map { |row| row['stop_words'] }.select(&:present?).map { |v| v.to_regexp(detect: true) },
+          must_match_grouping: find_options.include?('must_match_grouping'),
+          must_match_at_least_one_word: find_options.include?('must_match_at_least_one_word'),
+        )
+      end
+      memo
+    end
+  end
+end
+FuzzyMatch::Cli.start

data/fuzzy_match.gemspec CHANGED Viewed

@@ -21,14 +21,14 @@ Gem::Specification.new do |s|
   s.add_development_dependency 'active_record_inline_schema', '>=0.4.0'
   # development dependencies
-  s.add_development_dependency "minitest"
+  s.add_development_dependency 'pry'
+  s.add_development_dependency 'rspec-core'
+  s.add_development_dependency 'rspec-expectations'
+  s.add_development_dependency 'rspec-mocks'
   s.add_development_dependency 'activerecord', '>=3'
   s.add_development_dependency 'mysql2'
   s.add_development_dependency 'cohort_analysis'
   s.add_development_dependency 'weighted_average'
   s.add_development_dependency 'yard'
   s.add_development_dependency 'amatch'
-  if RUBY_VERSION >= '1.9'
-    s.add_development_dependency 'minitest-reporters'
-  end
 end

data/groupings-screenshot.png ADDED Viewed

Binary file

data/highlevel.graffle ADDED Viewed

Binary file

data/highlevel.png ADDED Viewed

Binary file

data/lib/fuzzy_match/record.rb ADDED Viewed

@@ -0,0 +1,58 @@
+class FuzzyMatch
+  # Records are the tokens that are passed around when doing scoring and optimizing.
+  class Record #:nodoc: all
+    # "Foo's" is one word
+    # "North-west" is just one word
+    # "Bolivia," is just Bolivia
+    WORD_BOUNDARY = %r{\W*(?:\s+|$)}
+    EMPTY = [].freeze
+    BLANK = ''.freeze
+    attr_reader :original
+    attr_reader :read
+    attr_reader :stop_words
+    def initialize(original, options = {})
+      @original = original
+      @read = options[:read]
+      @stop_words = options.fetch(:stop_words, EMPTY)
+    end
+    def inspect
+      "w(#{clean.inspect})"
+    end
+    def clean
+      @clean ||= begin
+        memo = whole.dup
+        stop_words.each do |stop_word|
+          memo.gsub! stop_word, BLANK
+        end
+        memo.strip.freeze
+      end
+    end
+    def words
+      @words ||= clean.downcase.split(WORD_BOUNDARY).freeze
+    end
+    def similarity(other)
+      Similarity.new self, other
+    end
+    def whole
+      @whole ||= case read
+      when ::NilClass
+        original
+      when ::Numeric, ::String
+        original[read]
+      when ::Proc
+        read.call original
+      when ::Symbol
+        original.respond_to?(read) ? original.send(read) : original[read]
+      else
+        raise "Expected nil, a proc, or a symbol, got #{read.inspect}"
+      end.to_s.strip.freeze
+    end
+  end
+end

data/lib/fuzzy_match/result.rb CHANGED Viewed

@@ -1,19 +1,23 @@
+# encoding: utf-8
 require 'erb'
+require 'pp'
 class FuzzyMatch
   class Result #:nodoc: all
     EXPLANATION = <<-ERB
-You looked for <%= needle.render.inspect %>
+#####################################################
+# SUMMARY
+#####################################################
-<% if winner %>It was matched with "<%= winner %>"<% else %>No match was found<% end %>
+Needle: <%= needle.inspect %>
+Match:  <%= winner.inspect %>
-# THE HAYSTACK
+#####################################################
+# OPTIONS
+#####################################################
-The haystack reader was <%= read.inspect %>.
+<%= PP.pp(options, '') %>
-The haystack contained <%= haystack.length %> records like <%= haystack[0, 3].map(&:render).map(&:inspect).join(', ') %>
-# HOW IT WAS MATCHED
 <% timeline.each_with_index do |event, index| %>
 (<%= index+1 %>) <%= event %>
 <% end %>
@@ -23,7 +27,6 @@ ERB
     attr_accessor :read
     attr_accessor :haystack
     attr_accessor :options
-    attr_accessor :normalizers
     attr_accessor :groupings
     attr_accessor :identities
     attr_accessor :stop_words

data/lib/fuzzy_match/rule/grouping.rb CHANGED Viewed

@@ -1,3 +1,4 @@
+# require 'pry'
 class FuzzyMatch
   class Rule
     # "Record linkage typically involves two main steps: grouping and scoring..."
@@ -8,18 +9,75 @@ class FuzzyMatch
     # A grouping (formerly known as a blocking) comes into effect when a str matches.
     # Then the needle must also match the grouping's regexp.
     class Grouping < Rule
-      def match?(str)
-        !!(regexp.match(str))
-      end
-      # If a grouping "joins" two strings, that means they both fit into it.
-      #
-      # Returns false if they certainly don't fit this grouping.
-      # Returns nil if the grouping doesn't apply, i.e. str2 doesn't fit the grouping.
-      def join?(str1, str2)
-        if str2_match_data = regexp.match(str2)
-          if str1_match_data = regexp.match(str1)
-            str2_match_data.captures.join.downcase == str1_match_data.captures.join.downcase
+      class << self
+        def make(regexps)
+          case regexps
+          when ::Regexp
+            new regexps
+          when ::Array
+            chain = regexps.flatten.map { |regexp| new regexp }
+            if chain.length == 1
+              chain[0] # not really a chain after all
+            else
+              chain.each { |grouping| grouping.chain = chain }
+              chain
+            end
+          else
+            raise ArgumentError, "[fuzzy_match] Groupings should be specified as single regexps or an array of regexps (got #{regexps.inspect})"
+          end
+        end
+      end
+      attr_accessor :chain
+      def inspect
+        memo = []
+        memo << "#{regexp.inspect}"
+        if chain
+          memo << "(#{chain.find_index(self)} of #{chain.length})"
+        end
+        memo.join ' '
+      end
+      def xmatch?(record)
+        if primary?
+          match?(record) and subs.none? { |sub| sub.match?(record) }
+        else
+          match?(record) and primary.match?(record)
+        end
+      end
+      def xjoin?(needle, straw)
+        if primary?
+          join?(needle, straw) and subs.none? { |sub| sub.match?(straw) } # maybe xmatch here?
+        else
+          join?(needle, straw) and primary.match?(straw)
+        end
+      end
+      protected
+      def primary?
+        chain ? (primary == self) : true
+        # not chain or primary == self
+      end
+      def primary
+        chain ? chain[0] : self
+      end
+      def subs
+        chain ? chain[1..-1] : []
+      end
+      def match?(record)
+        !!(regexp.match(record.whole))
+      end
+      def join?(needle, straw)
+        if straw_match_data = regexp.match(straw.whole)
+          if needle_match_data = regexp.match(needle.whole)
+            straw_match_data.captures.join.downcase == needle_match_data.captures.join.downcase
           else
             false
           end

data/lib/fuzzy_match/rule/identity.rb CHANGED Viewed

@@ -7,9 +7,9 @@ class FuzzyMatch
       #
       # Only returns true/false if both strings match the regexp.
       # Otherwise returns nil.
-      def identical?(str1, str2)
-        if str1_match_data = regexp.match(str1) and match_data = regexp.match(str2)
-          str1_match_data.captures.join.downcase == match_data.captures.join.downcase
+      def identical?(record1, record2)
+        if str1_match_data = regexp.match(record1.whole) and str2_match_data = regexp.match(record2.whole)
+          str1_match_data.captures.join.downcase == str2_match_data.captures.join.downcase
         else
           nil
         end

data/lib/fuzzy_match/rule.rb CHANGED Viewed

@@ -11,7 +11,7 @@ class FuzzyMatch
     end
     def ==(other)
-      regexp == other.regexp
+      other.class == self.class and regexp == other.regexp
     end
   end
 end

data/lib/fuzzy_match/score/amatch.rb CHANGED Viewed

@@ -3,10 +3,6 @@ class FuzzyMatch
     # be sure to `require 'amatch'` before you use this class
     class Amatch < Score
-      def inspect
-        %{#<FuzzyMatch::Score::Amatch: str1=#{str1.inspect} str2=#{str2.inspect} dices_coefficient_similar=#{dices_coefficient_similar} levenshtein_similar=#{levenshtein_similar}>}
-      end
       def dices_coefficient_similar
         @dices_coefficient_similar ||= if str1 == str2
           1.0

data/lib/fuzzy_match/score/pure_ruby.rb CHANGED Viewed

@@ -4,10 +4,6 @@ class FuzzyMatch
       SPACE = ' '
-      def inspect
-        %{#<FuzzyMatch::Score::PureRuby: str1=#{str1.inspect} str2=#{str2.inspect} dices_coefficient_similar=#{dices_coefficient_similar} levenshtein_similar=#{levenshtein_similar}>}
-      end
       # http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings
       def dices_coefficient_similar
         @dices_coefficient_similar ||= begin
@@ -90,10 +86,8 @@ class FuzzyMatch
       private
       def utf8?
-        return @utf8_query[0] if @utf8_query.is_a?(::Array) # ActiveSupport::Memoizable is deprecated in 3.2, how annoying
-        utf8_query = (defined?(::Encoding) ? str1.encoding.to_s : $KCODE).downcase.start_with?('u')
-        @utf8_query = [utf8_query]
-        utf8_query
+        return @utf8_query if defined?(@utf8_query)
+        @utf8_query = (defined?(::Encoding) ? str1.encoding.to_s : $KCODE).downcase.start_with?('u')
       end
     end
   end

data/lib/fuzzy_match/score.rb CHANGED Viewed

@@ -13,6 +13,10 @@ class FuzzyMatch
       @str2 = str2.downcase
     end
+    def inspect
+      %{(dice=#{"%0.5f" % dices_coefficient_similar},lev=#{"%0.5f" % levenshtein_similar})}
+    end
     def <=>(other)
       a = dices_coefficient_similar
       b = other.dices_coefficient_similar