RubyGems - fuzzy_match - Versions diffs - 1.1.1 → 1.2.1 - Mend

fuzzy_match 1.1.1 → 1.2.1

Files changed (22) hide show

data/.gitignore +3 -1
data/README.markdown +124 -0
data/Rakefile +5 -8
data/benchmark/before-with-free.txt +25 -25
data/benchmark/before-without-last-result.txt +31 -31
data/benchmark/before.txt +29 -29
data/benchmark/memory.rb +3 -4
data/examples/bts_aircraft/{tighteners.csv → normalizers.csv} +0 -0
data/examples/bts_aircraft/test_bts_aircraft.rb +3 -3
data/lib/fuzzy_match/blocking.rb +1 -1
data/lib/fuzzy_match/identity.rb +1 -1
data/lib/fuzzy_match/{tightener.rb → normalizer.rb} +5 -5
data/lib/fuzzy_match/result.rb +1 -1
data/lib/fuzzy_match/version.rb +1 -1
data/lib/fuzzy_match/wrapper.rb +3 -3
data/lib/fuzzy_match.rb +30 -45
data/test/test_blocking.rb +5 -0
data/test/test_fuzzy_match.rb +40 -42
data/test/test_identity.rb +5 -0
data/test/{test_tightening.rb → test_normalizer.rb} +2 -2
metadata +26 -25
data/README.rdoc +0 -94

data/.gitignore CHANGED Viewed

@@ -15,8 +15,10 @@ tmtags
 ## PROJECT::GENERAL
 coverage
-rdoc
+doc
+.yardoc
 pkg
 ## PROJECT::SPECIFIC
 Gemfile.lock
+*.gem

data/README.markdown ADDED Viewed

@@ -0,0 +1,124 @@
+# fuzzy_match
+Find a needle in a haystack based on string similarity (using the Pair Distance algorithm and Levenshtein distance) and regular expressions.
+Replaces [`loose_tight_dictionary`](https://github.com/seamusabshere/loose_tight_dictionary) because that was a confusing name.
+## Quickstart
+    >> require 'fuzzy_match'
+    => true
+    >> FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus')
+    => "seamus"
+## Default matching (string similarity)
+If you configure nothing else, string similarity matching is used. That's why we call it fuzzy matching.
+The algorithm is [Dice's Coefficient](http://en.wikipedia.org/wiki/Dice's_coefficient) (aka Pair Distance) because it seemed to work better than Jaro Winkler, etc.
+## Rules (regular expressions)
+You can improve the default matchings with rules, which are generally regular expressions.
+    >> require 'fuzzy_match'
+    => true
+    >> matcher = FuzzyMatch.new(['Ford F-150', 'Ford F-250', 'GMC 1500', 'GMC 2500'], :blockings => [ /ford/i, /gmc/i ], :normalizers => [ /K(\d500)/i ], :identities => [ /(f)-?(\d\d\d)/i ])
+    => #<FuzzyMatch: [...]>
+    >> matcher.find('fordf250')
+    => "Ford F-250"
+    >> matcher.find('gmc truck k1500')
+    => "GMC 1500"
+### Blockings
+Group records together.
+Setting a blocking of `/Airbus/` ensures that strings containing "Airbus" will only be scored against to other strings containing "Airbus". A better blocking in this case would probably be `/airbus/i`.
+### Normalizers (formerly called tighteners)
+Strip strings down to the essentials.
+Adding a normalizer like `/(boeing).*(7\d\d)/i` will cause "BOEING COMPANY 747" and "boeing747" to be scored as if they were "BOEING 747" and "boeing 747", respectively. See also "Case sensitivity" below.
+### Identities
+Prevent impossible matches.
+Adding an identity like `/(F)\-?(\d50)/` ensures that "Ford F-150" and "Ford F-250" never match.
+### Stop words
+Ignore common and/or meaningless words.
+Adding a stop word like `THE` ensures that it is not taken into account when comparing "THE CAT", "THE DAT", and "THE CATT"
+## Find options
+* `read`: how to interpret each record in the 'haystack', either a Proc or a symbol
+* `must_match_blocking`: don't return a match unless the needle fits into one of the blockings you specified
+* `must_match_at_least_one_word`: don't return a match unless the needle shares at least one word with the match
+* `first_blocking_decides`: force records into the first blocking they match, rather than choosing a blocking that will give them a higher score
+* `gather_last_result`: enable `last_result`
+### `:read`
+So, what if your needle is a string like `youruguay` and your haystack is full of `Country` objects like `<Country name:"Uruguay">`?
+    >> FuzzyMatch.new(Country.all, :read => :name).find('youruguay')
+    => <Country name:"Uruguay">
+## Case sensitivity
+String similarity is case-insensitive. Everything is downcased before scoring. This is a change from previous versions.
+Be careful when trying to use case-sensitivity in your rules; in general, things are downcased before comparing.
+## Dice's coefficient edge case
+In edge cases where Dice's finds that two strings are equally similar to a third string, then Levenshtein distance is used. For example, pair distance considers "RATZ" and "CATZ" to be equally similar to "RITZ" so we invoke Levenshtein.
+    >> require 'amatch'
+    => true
+    >> 'RITZ'.pair_distance_similar 'RATZ'
+    => 0.3333333333333333
+    >> 'RITZ'.pair_distance_similar 'CATZ'  # <-- pair distance can't tell the difference, so we fall back to levenshtein...
+    => 0.3333333333333333
+    >> 'RITZ'.levenshtein_similar 'RATZ'
+    => 0.75
+    >> 'RITZ'.levenshtein_similar 'CATZ'    # <-- which properly shows that RATZ should win
+    => 0.5
+## Production use
+Over 2 years in [Brighter Planet's environmental impact API](http://impact.brighterplanet.com) and [reference data service](http://data.brighterplanet.com).
+We often combine `fuzzy_match` with [`remote_table`](https://github.com/seamusabshere/remote_table) and [`errata`](https://github.com/seamusabshere/errata):
+- download table with `remote_table`
+- correct serious or repeated errors with `errata`
+- `fuzzy_match` the rest
+## Glossary
+The admittedly imperfect metaphor is "look for a needle in a haystack"
+* needle: the search term
+* haystack: the records you are searching (<b>your result will be an object from here</b>)
+## Credits (and how to make things faster)
+If you add the [`amatch`](http://flori.github.com/amatch/) gem to your Gemfile, it will use that, which is much faster (but [segfaults have been seen in the wild](https://github.com/flori/amatch/issues/3)). Thanks [Flori](https://github.com/flori)!
+Otherwise, pure ruby versions of the string similarity algorithms derived from the [answer to a StackOverflow question](http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings) and [the text gem](https://github.com/threedaymonk/text/blob/master/lib/text/levenshtein.rb) are used. Thanks [marzagao](http://stackoverflow.com/users/10997/marzagao) and [threedaymonk](https://github.com/threedaymonk)!
+## Authors
+* Seamus Abshere <seamus@abshere.net>
+* Ian Hough <ijhough@gmail.com>
+* Andy Rossmeissl <andy@rossmeissl.net>
+## Copyright
+Copyright 2012 Brighter Planet, Inc.

data/Rakefile CHANGED Viewed

@@ -10,12 +10,9 @@ end
 task :default => :test
-require 'rake/rdoctask'
-Rake::RDocTask.new do |rdoc|
-  version = File.exist?('VERSION') ? File.read('VERSION') : ""
-  rdoc.rdoc_dir = 'rdoc'
-  rdoc.title = "fuzzy_match #{version}"
-  rdoc.rdoc_files.include('README*')
-  rdoc.rdoc_files.include('lib/**/*.rb')
+require 'yard'
+require File.expand_path('../lib/fuzzy_match/version.rb', __FILE__)
+YARD::Rake::YardocTask.new do |t|
+  t.files   = ['lib/**/*.rb', 'README.markdown']   # optional
+  # t.options = ['--any', '--extra', '--opts'] # optional
 end

data/benchmark/before-with-free.txt CHANGED Viewed

@@ -14,8 +14,8 @@
     325 ./benchmark/../lib/fuzzy_match.rb:35:FuzzyMatch::Wrapper
     320 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/remote_table-1.1.6/lib/remote_table/format/delimited.rb:28:String
     303 ./benchmark/../lib/fuzzy_match/similarity.rb:21:Float
-    201 ./benchmark/../lib/fuzzy_match/tightener.rb:20:String
-    184 ./benchmark/../lib/fuzzy_match/tightener.rb:14:String
+    201 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:String
+    184 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:String
     140 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:__node__
      41 ./benchmark/../lib/fuzzy_match/similarity.rb:49:__node__
      31 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:27:Regexp
@@ -45,8 +45,8 @@
       9 ./benchmark/../lib/fuzzy_match/similarity.rb:42:__node__
       9 ./benchmark/../lib/fuzzy_match/similarity.rb:41:__node__
       8 ./benchmark/../lib/fuzzy_match/wrapper.rb:31:__node__
-      8 ./benchmark/../lib/fuzzy_match/tightener.rb:27:__node__
-      8 ./benchmark/../lib/fuzzy_match/tightener.rb:14:__node__
+      8 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:__node__
+      8 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:__node__
       8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
       8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
       8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
@@ -71,8 +71,8 @@
       6 ./benchmark/../lib/fuzzy_match/blocking.rb:22:__node__
       5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
       5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
-      5 ./benchmark/../lib/fuzzy_match/tightener.rb:9:__node__
-      5 ./benchmark/../lib/fuzzy_match/tightener.rb:19:__node__
+      5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
+      5 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:__node__
       5 ./benchmark/../lib/fuzzy_match/similarity.rb:8:__node__
       5 ./benchmark/../lib/fuzzy_match/similarity.rb:33:__node__
       5 ./benchmark/../lib/fuzzy_match/similarity.rb:29:__node__
@@ -85,7 +85,7 @@
       5 ./benchmark/../lib/fuzzy_match/blocking.rb:15:__node__
       4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
       4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
-      4 ./benchmark/../lib/fuzzy_match/tightener.rb:20:__node__
+      4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
       4 ./benchmark/../lib/fuzzy_match/similarity.rb:59:__node__
       4 ./benchmark/../lib/fuzzy_match/similarity.rb:54:__node__
       4 ./benchmark/../lib/fuzzy_match/score.rb:5:__node__
@@ -112,10 +112,10 @@
       3 ./benchmark/../lib/fuzzy_match/wrapper.rb:18:__node__
       3 ./benchmark/../lib/fuzzy_match/wrapper.rb:15:String
       3 ./benchmark/../lib/fuzzy_match/wrapper.rb:14:__node__
-      3 ./benchmark/../lib/fuzzy_match/tightener.rb:8:__node__
-      3 ./benchmark/../lib/fuzzy_match/tightener.rb:26:__node__
-      3 ./benchmark/../lib/fuzzy_match/tightener.rb:18:__node__
-      3 ./benchmark/../lib/fuzzy_match/tightener.rb:13:__node__
+      3 ./benchmark/../lib/fuzzy_match/normalizer.rb:8:__node__
+      3 ./benchmark/../lib/fuzzy_match/normalizer.rb:26:__node__
+      3 ./benchmark/../lib/fuzzy_match/normalizer.rb:18:__node__
+      3 ./benchmark/../lib/fuzzy_match/normalizer.rb:13:__node__
       3 ./benchmark/../lib/fuzzy_match/similarity.rb:7:__node__
       3 ./benchmark/../lib/fuzzy_match/similarity.rb:6:__node__
       3 ./benchmark/../lib/fuzzy_match/similarity.rb:58:__node__
@@ -159,15 +159,15 @@
       2 ./benchmark/../lib/fuzzy_match/wrapper.rb:16:__node__
       2 ./benchmark/../lib/fuzzy_match/wrapper.rb:12:__node__
       2 ./benchmark/../lib/fuzzy_match/wrapper.rb:11:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:6:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:3:Class
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:28:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:27:String
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:24:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:23:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:22:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:15:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:10:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:6:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:Class
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:28:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:String
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:24:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:23:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:22:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:15:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:10:__node__
       2 ./benchmark/../lib/fuzzy_match/similarity.rb:60:__node__
       2 ./benchmark/../lib/fuzzy_match/similarity.rb:50:__node__
       2 ./benchmark/../lib/fuzzy_match/similarity.rb:4:__node__
@@ -230,11 +230,11 @@
       1 ./benchmark/../lib/fuzzy_match/wrapper.rb:25:String
       1 ./benchmark/../lib/fuzzy_match/wrapper.rb:1:__node__
       1 ./benchmark/../lib/fuzzy_match/wrapper.rb:10:String
-      1 ./benchmark/../lib/fuzzy_match/tightener.rb:4:String
-      1 ./benchmark/../lib/fuzzy_match/tightener.rb:4:FuzzyMatch::ExtractRegexp
-      1 ./benchmark/../lib/fuzzy_match/tightener.rb:3:__node__
-      1 ./benchmark/../lib/fuzzy_match/tightener.rb:3:String
-      1 ./benchmark/../lib/fuzzy_match/tightener.rb:1:__node__
+      1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:String
+      1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:FuzzyMatch::ExtractRegexp
+      1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:__node__
+      1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:String
+      1 ./benchmark/../lib/fuzzy_match/normalizer.rb:1:__node__
       1 ./benchmark/../lib/fuzzy_match/similarity.rb:9:__node__
       1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:__node__
       1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:String

data/benchmark/before-without-last-result.txt CHANGED Viewed

@@ -11,7 +11,7 @@
     779 ./benchmark/../lib/fuzzy_match/similarity.rb:42:FuzzyMatch::Score
     779 ./benchmark/../lib/fuzzy_match/similarity.rb:41:FuzzyMatch::Score
     676 benchmark/memory.rb:21:String
-    607 ./benchmark/../lib/fuzzy_match/tightener.rb:20:String
+    607 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:String
     444 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:Array
     342 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:String
     325 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/remote_table-1.1.6/lib/remote_table/hasher.rb:20:String
@@ -25,7 +25,7 @@
     303 ./benchmark/../lib/fuzzy_match/similarity.rb:21:Float
     234 ./benchmark/../lib/fuzzy_match/similarity.rb:56:Array
     234 ./benchmark/../lib/fuzzy_match/similarity.rb:55:Array
-    184 ./benchmark/../lib/fuzzy_match/tightener.rb:14:String
+    184 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:String
     140 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:__node__
     129 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__node__
     127 ./benchmark/../lib/fuzzy_match/similarity.rb:55:__node__
@@ -37,13 +37,13 @@
     118 ./benchmark/../lib/fuzzy_match/similarity.rb:12:__scope__
     117 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__scope__
     117 ./benchmark/../lib/fuzzy_match/wrapper.rb:19:Array
-    102 ./benchmark/../lib/fuzzy_match/tightener.rb:20:Array
-    102 ./benchmark/../lib/fuzzy_match/tightener.rb:19:MatchData
-    101 ./benchmark/../lib/fuzzy_match/tightener.rb:14:MatchData
+    102 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:Array
+    102 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:MatchData
+    101 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:MatchData
      41 ./benchmark/../lib/fuzzy_match/similarity.rb:49:__node__
      36 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:27:Regexp
      28 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:19:__node__
-     26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Tightener
+     26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Normalizer
      22 ./benchmark/../lib/fuzzy_match/similarity.rb:57:__node__
      22 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:20:__node__
      17 ./benchmark/../lib/fuzzy_match/similarity.rb:21:__node__
@@ -65,9 +65,9 @@
       9 ./benchmark/../lib/fuzzy_match/similarity.rb:42:__node__
       9 ./benchmark/../lib/fuzzy_match/similarity.rb:41:__node__
       8 ./benchmark/../lib/fuzzy_match/wrapper.rb:31:__node__
-      8 ./benchmark/../lib/fuzzy_match/tightener.rb:6:__node__
-      8 ./benchmark/../lib/fuzzy_match/tightener.rb:27:__node__
-      8 ./benchmark/../lib/fuzzy_match/tightener.rb:14:__node__
+      8 ./benchmark/../lib/fuzzy_match/normalizer.rb:6:__node__
+      8 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:__node__
+      8 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:__node__
       8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
       8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
       8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
@@ -77,10 +77,10 @@
       7 ./benchmark/../lib/fuzzy_match/score.rb:17:__node__
       7 ./benchmark/../lib/fuzzy_match/identity.rb:19:__node__
       6 ./benchmark/../lib/fuzzy_match/wrapper.rb:8:__node__
-      6 ./benchmark/../lib/fuzzy_match/tightener.rb:28:__node__
-      6 ./benchmark/../lib/fuzzy_match/tightener.rb:24:__node__
-      6 ./benchmark/../lib/fuzzy_match/tightener.rb:15:__node__
-      6 ./benchmark/../lib/fuzzy_match/tightener.rb:10:__node__
+      6 ./benchmark/../lib/fuzzy_match/normalizer.rb:28:__node__
+      6 ./benchmark/../lib/fuzzy_match/normalizer.rb:24:__node__
+      6 ./benchmark/../lib/fuzzy_match/normalizer.rb:15:__node__
+      6 ./benchmark/../lib/fuzzy_match/normalizer.rb:10:__node__
       6 ./benchmark/../lib/fuzzy_match/similarity.rb:44:__node__
       6 ./benchmark/../lib/fuzzy_match/similarity.rb:15:__node__
       6 ./benchmark/../lib/fuzzy_match/similarity.rb:13:__node__
@@ -89,8 +89,8 @@
       6 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:10:__node__
       5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
       5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
-      5 ./benchmark/../lib/fuzzy_match/tightener.rb:9:__node__
-      5 ./benchmark/../lib/fuzzy_match/tightener.rb:19:__node__
+      5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
+      5 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:__node__
       5 ./benchmark/../lib/fuzzy_match/similarity.rb:8:__node__
       5 ./benchmark/../lib/fuzzy_match/similarity.rb:33:__node__
       5 ./benchmark/../lib/fuzzy_match/similarity.rb:29:__node__
@@ -100,8 +100,8 @@
       4 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:7:__node__
       4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
       4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
-      4 ./benchmark/../lib/fuzzy_match/tightener.rb:4:__node__
-      4 ./benchmark/../lib/fuzzy_match/tightener.rb:20:__node__
+      4 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:__node__
+      4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
       4 ./benchmark/../lib/fuzzy_match/similarity.rb:59:__node__
       4 ./benchmark/../lib/fuzzy_match/similarity.rb:54:__node__
       4 ./benchmark/../lib/fuzzy_match/score.rb:5:__node__
@@ -116,12 +116,12 @@
       3 ./benchmark/../lib/fuzzy_match/wrapper.rb:18:__node__
       3 ./benchmark/../lib/fuzzy_match/wrapper.rb:15:String
       3 ./benchmark/../lib/fuzzy_match/wrapper.rb:14:__node__
-      3 ./benchmark/../lib/fuzzy_match/tightener.rb:8:__node__
-      3 ./benchmark/../lib/fuzzy_match/tightener.rb:30:__node__
-      3 ./benchmark/../lib/fuzzy_match/tightener.rb:29:__node__
-      3 ./benchmark/../lib/fuzzy_match/tightener.rb:26:__node__
-      3 ./benchmark/../lib/fuzzy_match/tightener.rb:18:__node__
-      3 ./benchmark/../lib/fuzzy_match/tightener.rb:13:__node__
+      3 ./benchmark/../lib/fuzzy_match/normalizer.rb:8:__node__
+      3 ./benchmark/../lib/fuzzy_match/normalizer.rb:30:__node__
+      3 ./benchmark/../lib/fuzzy_match/normalizer.rb:29:__node__
+      3 ./benchmark/../lib/fuzzy_match/normalizer.rb:26:__node__
+      3 ./benchmark/../lib/fuzzy_match/normalizer.rb:18:__node__
+      3 ./benchmark/../lib/fuzzy_match/normalizer.rb:13:__node__
       3 ./benchmark/../lib/fuzzy_match/similarity.rb:7:__node__
       3 ./benchmark/../lib/fuzzy_match/similarity.rb:6:__node__
       3 ./benchmark/../lib/fuzzy_match/similarity.rb:58:__node__
@@ -160,12 +160,12 @@
       2 ./benchmark/../lib/fuzzy_match/wrapper.rb:16:__node__
       2 ./benchmark/../lib/fuzzy_match/wrapper.rb:12:__node__
       2 ./benchmark/../lib/fuzzy_match/wrapper.rb:11:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:3:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:3:Class
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:27:String
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:23:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:22:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:1:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:Class
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:String
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:23:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:22:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:1:__node__
       2 ./benchmark/../lib/fuzzy_match/similarity.rb:60:__node__
       2 ./benchmark/../lib/fuzzy_match/similarity.rb:50:__node__
       2 ./benchmark/../lib/fuzzy_match/similarity.rb:4:__node__
@@ -218,8 +218,8 @@
       1 ./benchmark/../lib/fuzzy_match/wrapper.rb:25:String
       1 ./benchmark/../lib/fuzzy_match/wrapper.rb:1:__node__
       1 ./benchmark/../lib/fuzzy_match/wrapper.rb:10:String
-      1 ./benchmark/../lib/fuzzy_match/tightener.rb:4:FuzzyMatch::ExtractRegexp
-      1 ./benchmark/../lib/fuzzy_match/tightener.rb:3:String
+      1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:FuzzyMatch::ExtractRegexp
+      1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:String
       1 ./benchmark/../lib/fuzzy_match/similarity.rb:9:__node__
       1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:__node__
       1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:String

data/benchmark/before.txt CHANGED Viewed

@@ -11,7 +11,7 @@
     806 ./benchmark/../lib/fuzzy_match/similarity.rb:41:FuzzyMatch::Score
     805 ./benchmark/../lib/fuzzy_match/similarity.rb:42:FuzzyMatch::Score
     688 benchmark/memory.rb:21:String
-    639 ./benchmark/../lib/fuzzy_match/tightener.rb:20:String
+    639 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:String
     448 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:Array
     342 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:String
     325 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/remote_table-1.1.6/lib/remote_table/hasher.rb:20:String
@@ -25,7 +25,7 @@
     303 ./benchmark/../lib/fuzzy_match/similarity.rb:21:Float
     242 ./benchmark/../lib/fuzzy_match/similarity.rb:56:Array
     242 ./benchmark/../lib/fuzzy_match/similarity.rb:55:Array
-    184 ./benchmark/../lib/fuzzy_match/tightener.rb:14:String
+    184 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:String
     140 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:__node__
     133 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__node__
     131 ./benchmark/../lib/fuzzy_match/similarity.rb:55:__node__
@@ -37,13 +37,13 @@
     122 ./benchmark/../lib/fuzzy_match/similarity.rb:12:__scope__
     121 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__scope__
     121 ./benchmark/../lib/fuzzy_match/wrapper.rb:19:Array
-    110 ./benchmark/../lib/fuzzy_match/tightener.rb:20:Array
-    110 ./benchmark/../lib/fuzzy_match/tightener.rb:19:MatchData
-    109 ./benchmark/../lib/fuzzy_match/tightener.rb:14:MatchData
+    110 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:Array
+    110 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:MatchData
+    109 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:MatchData
      57 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:27:Regexp
      41 ./benchmark/../lib/fuzzy_match/similarity.rb:49:__node__
      28 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:19:__node__
-     26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Tightener
+     26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Normalizer
      22 ./benchmark/../lib/fuzzy_match/similarity.rb:57:__node__
      22 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:20:__node__
      21 ./benchmark/../lib/fuzzy_match.rb:199:FuzzyMatch::Blocking
@@ -67,8 +67,8 @@
       9 ./benchmark/../lib/fuzzy_match/similarity.rb:42:__node__
       9 ./benchmark/../lib/fuzzy_match/similarity.rb:41:__node__
       8 ./benchmark/../lib/fuzzy_match/wrapper.rb:31:__node__
-      8 ./benchmark/../lib/fuzzy_match/tightener.rb:27:__node__
-      8 ./benchmark/../lib/fuzzy_match/tightener.rb:14:__node__
+      8 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:__node__
+      8 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:__node__
       8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
       8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
       8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
@@ -92,8 +92,8 @@
       6 ./benchmark/../lib/fuzzy_match/blocking.rb:22:__node__
       5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
       5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
-      5 ./benchmark/../lib/fuzzy_match/tightener.rb:9:__node__
-      5 ./benchmark/../lib/fuzzy_match/tightener.rb:19:__node__
+      5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
+      5 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:__node__
       5 ./benchmark/../lib/fuzzy_match/similarity.rb:8:__node__
       5 ./benchmark/../lib/fuzzy_match/similarity.rb:33:__node__
       5 ./benchmark/../lib/fuzzy_match/similarity.rb:29:__node__
@@ -106,7 +106,7 @@
       5 ./benchmark/../lib/fuzzy_match/blocking.rb:15:__node__
       4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
       4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
-      4 ./benchmark/../lib/fuzzy_match/tightener.rb:20:__node__
+      4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
       4 ./benchmark/../lib/fuzzy_match/similarity.rb:59:__node__
       4 ./benchmark/../lib/fuzzy_match/similarity.rb:54:__node__
       4 ./benchmark/../lib/fuzzy_match/score.rb:5:__node__
@@ -133,10 +133,10 @@
       3 ./benchmark/../lib/fuzzy_match/wrapper.rb:18:__node__
       3 ./benchmark/../lib/fuzzy_match/wrapper.rb:15:String
       3 ./benchmark/../lib/fuzzy_match/wrapper.rb:14:__node__
-      3 ./benchmark/../lib/fuzzy_match/tightener.rb:8:__node__
-      3 ./benchmark/../lib/fuzzy_match/tightener.rb:26:__node__
-      3 ./benchmark/../lib/fuzzy_match/tightener.rb:18:__node__
-      3 ./benchmark/../lib/fuzzy_match/tightener.rb:13:__node__
+      3 ./benchmark/../lib/fuzzy_match/normalizer.rb:8:__node__
+      3 ./benchmark/../lib/fuzzy_match/normalizer.rb:26:__node__
+      3 ./benchmark/../lib/fuzzy_match/normalizer.rb:18:__node__
+      3 ./benchmark/../lib/fuzzy_match/normalizer.rb:13:__node__
       3 ./benchmark/../lib/fuzzy_match/similarity.rb:7:__node__
       3 ./benchmark/../lib/fuzzy_match/similarity.rb:6:__node__
       3 ./benchmark/../lib/fuzzy_match/similarity.rb:58:__node__
@@ -182,15 +182,15 @@
       2 ./benchmark/../lib/fuzzy_match/wrapper.rb:16:__node__
       2 ./benchmark/../lib/fuzzy_match/wrapper.rb:12:__node__
       2 ./benchmark/../lib/fuzzy_match/wrapper.rb:11:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:6:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:3:Class
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:28:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:27:String
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:24:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:23:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:22:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:15:__node__
-      2 ./benchmark/../lib/fuzzy_match/tightener.rb:10:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:6:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:Class
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:28:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:String
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:24:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:23:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:22:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:15:__node__
+      2 ./benchmark/../lib/fuzzy_match/normalizer.rb:10:__node__
       2 ./benchmark/../lib/fuzzy_match/similarity.rb:60:__node__
       2 ./benchmark/../lib/fuzzy_match/similarity.rb:50:__node__
       2 ./benchmark/../lib/fuzzy_match/similarity.rb:4:__node__
@@ -253,11 +253,11 @@
       1 ./benchmark/../lib/fuzzy_match/wrapper.rb:25:String
       1 ./benchmark/../lib/fuzzy_match/wrapper.rb:1:__node__
       1 ./benchmark/../lib/fuzzy_match/wrapper.rb:10:String
-      1 ./benchmark/../lib/fuzzy_match/tightener.rb:4:String
-      1 ./benchmark/../lib/fuzzy_match/tightener.rb:4:FuzzyMatch::ExtractRegexp
-      1 ./benchmark/../lib/fuzzy_match/tightener.rb:3:__node__
-      1 ./benchmark/../lib/fuzzy_match/tightener.rb:3:String
-      1 ./benchmark/../lib/fuzzy_match/tightener.rb:1:__node__
+      1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:String
+      1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:FuzzyMatch::ExtractRegexp
+      1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:__node__
+      1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:String
+      1 ./benchmark/../lib/fuzzy_match/normalizer.rb:1:__node__
       1 ./benchmark/../lib/fuzzy_match/similarity.rb:9:__node__
       1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:__node__
       1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:String

data/benchmark/memory.rb CHANGED Viewed

@@ -28,9 +28,9 @@ MUST_MATCH_BLOCKING = false
 # (Example) We made these by trial and error
 BLOCKINGS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/blockings.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
-# Tighteners
+# Normalizers
 # (Example) We made these by trial and error
-TIGHTENERS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/tighteners.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
+NORMALIZERS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/normalizers.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
 # Identities
 # (Example) We made these by trial and error
@@ -39,7 +39,7 @@ IDENTITIES = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/
 FINAL_OPTIONS = {
   :read => HAYSTACK_READER,
   :must_match_blocking => MUST_MATCH_BLOCKING,
-  :tighteners => TIGHTENERS,
+  :normalizers => NORMALIZERS,
   :identities => IDENTITIES,
   :blockings => BLOCKINGS
 }
@@ -48,7 +48,6 @@ Memprof.start
 d = FuzzyMatch.new HAYSTACK, FINAL_OPTIONS
 record = d.find('boeing 707(100)', :gather_last_result => false)
-# d.free
 Memprof.stats
 Memprof.stop

data/examples/bts_aircraft/{tighteners.csv → normalizers.csv} RENAMED Viewed

File without changes

data/examples/bts_aircraft/test_bts_aircraft.rb CHANGED Viewed

@@ -26,9 +26,9 @@ MUST_MATCH_BLOCKING = false
 # (Example) We made these by trial and error
 BLOCKINGS = RemoteTable.new(:url => "file://#{File.expand_path("../blockings.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
-# Tighteners
+# Normalizers
 # (Example) We made these by trial and error
-TIGHTENERS = RemoteTable.new(:url => "file://#{File.expand_path("../tighteners.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
+NORMALIZERS = RemoteTable.new(:url => "file://#{File.expand_path("../normalizers.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
 # Identities
 # (Example) We made these by trial and error
@@ -65,7 +65,7 @@ NEGATIVES = RemoteTable.new :url => "file://#{File.expand_path("../negatives.csv
 FINAL_OPTIONS = {
   :read => HAYSTACK_READER,
   :must_match_blocking => MUST_MATCH_BLOCKING,
-  :tighteners => TIGHTENERS,
+  :normalizers => NORMALIZERS,
   :identities => IDENTITIES,
   :blockings => BLOCKINGS
 }

data/lib/fuzzy_match/blocking.rb CHANGED Viewed

@@ -24,7 +24,7 @@ class FuzzyMatch
     def join?(str1, str2)
       if str2_match_data = regexp.match(str2)
         if str1_match_data = regexp.match(str1)
-          str2_match_data.captures == str1_match_data.captures
+          str2_match_data.captures.join.downcase == str1_match_data.captures.join.downcase
         else
           false
         end

data/lib/fuzzy_match/identity.rb CHANGED Viewed

@@ -14,7 +14,7 @@ class FuzzyMatch
     # Otherwise returns nil.
     def identical?(str1, str2)
       if str1_match_data = regexp.match(str1) and match_data = regexp.match(str2)
-        str1_match_data.captures == match_data.captures
+        str1_match_data.captures.join.downcase == match_data.captures.join.downcase
       else
         nil
       end

data/lib/fuzzy_match/{tightener.rb → normalizer.rb} RENAMED Viewed

@@ -1,18 +1,18 @@
 class FuzzyMatch
-  # A tightener just strips a string down to its core
-  class Tightener
+  # A normalizer just strips a string down to its core
+  class Normalizer
     attr_reader :regexp
     def initialize(regexp_or_str)
       @regexp = regexp_or_str.to_regexp
     end
-    # A tightener applies when its regexp matches and captures a new (shorter) string
+    # A normalizer applies when its regexp matches and captures a new (shorter) string
     def apply?(str)
       !!(regexp.match(str))
     end
-    # The result of applying a tightener is just all the captures put together.
+    # The result of applying a normalizer is just all the captures put together.
     def apply(str)
       if match_data = regexp.match(str)
         match_data.captures.join
@@ -22,7 +22,7 @@ class FuzzyMatch
     end
     def inspect
-      "#<Tightener regexp=#{regexp.inspect}>"
+      "#<Normalizer regexp=#{regexp.inspect}>"
     end
   end
 end

data/lib/fuzzy_match/result.rb CHANGED Viewed

@@ -27,7 +27,7 @@ ERB
     attr_accessor :read
     attr_accessor :haystack
     attr_accessor :options
-    attr_accessor :tighteners
+    attr_accessor :normalizers
     attr_accessor :blockings
     attr_accessor :identities
     attr_accessor :stop_words