fuzzy_match 1.1.1 → 1.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/.gitignore CHANGED
@@ -15,8 +15,10 @@ tmtags
15
15
 
16
16
  ## PROJECT::GENERAL
17
17
  coverage
18
- rdoc
18
+ doc
19
+ .yardoc
19
20
  pkg
20
21
 
21
22
  ## PROJECT::SPECIFIC
22
23
  Gemfile.lock
24
+ *.gem
data/README.markdown ADDED
@@ -0,0 +1,124 @@
1
+ # fuzzy_match
2
+
3
+ Find a needle in a haystack based on string similarity (using the Pair Distance algorithm and Levenshtein distance) and regular expressions.
4
+
5
+ Replaces [`loose_tight_dictionary`](https://github.com/seamusabshere/loose_tight_dictionary) because that was a confusing name.
6
+
7
+ ## Quickstart
8
+
9
+ >> require 'fuzzy_match'
10
+ => true
11
+ >> FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus')
12
+ => "seamus"
13
+
14
+ ## Default matching (string similarity)
15
+
16
+ If you configure nothing else, string similarity matching is used. That's why we call it fuzzy matching.
17
+
18
+ The algorithm is [Dice's Coefficient](http://en.wikipedia.org/wiki/Dice's_coefficient) (aka Pair Distance) because it seemed to work better than Jaro Winkler, etc.
19
+
20
+ ## Rules (regular expressions)
21
+
22
+ You can improve the default matchings with rules, which are generally regular expressions.
23
+
24
+ >> require 'fuzzy_match'
25
+ => true
26
+ >> matcher = FuzzyMatch.new(['Ford F-150', 'Ford F-250', 'GMC 1500', 'GMC 2500'], :blockings => [ /ford/i, /gmc/i ], :normalizers => [ /K(\d500)/i ], :identities => [ /(f)-?(\d\d\d)/i ])
27
+ => #<FuzzyMatch: [...]>
28
+ >> matcher.find('fordf250')
29
+ => "Ford F-250"
30
+ >> matcher.find('gmc truck k1500')
31
+ => "GMC 1500"
32
+
33
+ ### Blockings
34
+
35
+ Group records together.
36
+
37
+ Setting a blocking of `/Airbus/` ensures that strings containing "Airbus" will only be scored against to other strings containing "Airbus". A better blocking in this case would probably be `/airbus/i`.
38
+
39
+ ### Normalizers (formerly called tighteners)
40
+
41
+ Strip strings down to the essentials.
42
+
43
+ Adding a normalizer like `/(boeing).*(7\d\d)/i` will cause "BOEING COMPANY 747" and "boeing747" to be scored as if they were "BOEING 747" and "boeing 747", respectively. See also "Case sensitivity" below.
44
+
45
+ ### Identities
46
+
47
+ Prevent impossible matches.
48
+
49
+ Adding an identity like `/(F)\-?(\d50)/` ensures that "Ford F-150" and "Ford F-250" never match.
50
+
51
+ ### Stop words
52
+
53
+ Ignore common and/or meaningless words.
54
+
55
+ Adding a stop word like `THE` ensures that it is not taken into account when comparing "THE CAT", "THE DAT", and "THE CATT"
56
+
57
+ ## Find options
58
+
59
+ * `read`: how to interpret each record in the 'haystack', either a Proc or a symbol
60
+ * `must_match_blocking`: don't return a match unless the needle fits into one of the blockings you specified
61
+ * `must_match_at_least_one_word`: don't return a match unless the needle shares at least one word with the match
62
+ * `first_blocking_decides`: force records into the first blocking they match, rather than choosing a blocking that will give them a higher score
63
+ * `gather_last_result`: enable `last_result`
64
+
65
+ ### `:read`
66
+
67
+ So, what if your needle is a string like `youruguay` and your haystack is full of `Country` objects like `<Country name:"Uruguay">`?
68
+
69
+ >> FuzzyMatch.new(Country.all, :read => :name).find('youruguay')
70
+ => <Country name:"Uruguay">
71
+
72
+ ## Case sensitivity
73
+
74
+ String similarity is case-insensitive. Everything is downcased before scoring. This is a change from previous versions.
75
+
76
+ Be careful when trying to use case-sensitivity in your rules; in general, things are downcased before comparing.
77
+
78
+ ## Dice's coefficient edge case
79
+
80
+ In edge cases where Dice's finds that two strings are equally similar to a third string, then Levenshtein distance is used. For example, pair distance considers "RATZ" and "CATZ" to be equally similar to "RITZ" so we invoke Levenshtein.
81
+
82
+ >> require 'amatch'
83
+ => true
84
+ >> 'RITZ'.pair_distance_similar 'RATZ'
85
+ => 0.3333333333333333
86
+ >> 'RITZ'.pair_distance_similar 'CATZ' # <-- pair distance can't tell the difference, so we fall back to levenshtein...
87
+ => 0.3333333333333333
88
+ >> 'RITZ'.levenshtein_similar 'RATZ'
89
+ => 0.75
90
+ >> 'RITZ'.levenshtein_similar 'CATZ' # <-- which properly shows that RATZ should win
91
+ => 0.5
92
+
93
+ ## Production use
94
+
95
+ Over 2 years in [Brighter Planet's environmental impact API](http://impact.brighterplanet.com) and [reference data service](http://data.brighterplanet.com).
96
+
97
+ We often combine `fuzzy_match` with [`remote_table`](https://github.com/seamusabshere/remote_table) and [`errata`](https://github.com/seamusabshere/errata):
98
+
99
+ - download table with `remote_table`
100
+ - correct serious or repeated errors with `errata`
101
+ - `fuzzy_match` the rest
102
+
103
+ ## Glossary
104
+
105
+ The admittedly imperfect metaphor is "look for a needle in a haystack"
106
+
107
+ * needle: the search term
108
+ * haystack: the records you are searching (<b>your result will be an object from here</b>)
109
+
110
+ ## Credits (and how to make things faster)
111
+
112
+ If you add the [`amatch`](http://flori.github.com/amatch/) gem to your Gemfile, it will use that, which is much faster (but [segfaults have been seen in the wild](https://github.com/flori/amatch/issues/3)). Thanks [Flori](https://github.com/flori)!
113
+
114
+ Otherwise, pure ruby versions of the string similarity algorithms derived from the [answer to a StackOverflow question](http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings) and [the text gem](https://github.com/threedaymonk/text/blob/master/lib/text/levenshtein.rb) are used. Thanks [marzagao](http://stackoverflow.com/users/10997/marzagao) and [threedaymonk](https://github.com/threedaymonk)!
115
+
116
+ ## Authors
117
+
118
+ * Seamus Abshere <seamus@abshere.net>
119
+ * Ian Hough <ijhough@gmail.com>
120
+ * Andy Rossmeissl <andy@rossmeissl.net>
121
+
122
+ ## Copyright
123
+
124
+ Copyright 2012 Brighter Planet, Inc.
data/Rakefile CHANGED
@@ -10,12 +10,9 @@ end
10
10
 
11
11
  task :default => :test
12
12
 
13
- require 'rake/rdoctask'
14
- Rake::RDocTask.new do |rdoc|
15
- version = File.exist?('VERSION') ? File.read('VERSION') : ""
16
-
17
- rdoc.rdoc_dir = 'rdoc'
18
- rdoc.title = "fuzzy_match #{version}"
19
- rdoc.rdoc_files.include('README*')
20
- rdoc.rdoc_files.include('lib/**/*.rb')
13
+ require 'yard'
14
+ require File.expand_path('../lib/fuzzy_match/version.rb', __FILE__)
15
+ YARD::Rake::YardocTask.new do |t|
16
+ t.files = ['lib/**/*.rb', 'README.markdown'] # optional
17
+ # t.options = ['--any', '--extra', '--opts'] # optional
21
18
  end
@@ -14,8 +14,8 @@
14
14
  325 ./benchmark/../lib/fuzzy_match.rb:35:FuzzyMatch::Wrapper
15
15
  320 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/remote_table-1.1.6/lib/remote_table/format/delimited.rb:28:String
16
16
  303 ./benchmark/../lib/fuzzy_match/similarity.rb:21:Float
17
- 201 ./benchmark/../lib/fuzzy_match/tightener.rb:20:String
18
- 184 ./benchmark/../lib/fuzzy_match/tightener.rb:14:String
17
+ 201 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:String
18
+ 184 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:String
19
19
  140 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:__node__
20
20
  41 ./benchmark/../lib/fuzzy_match/similarity.rb:49:__node__
21
21
  31 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:27:Regexp
@@ -45,8 +45,8 @@
45
45
  9 ./benchmark/../lib/fuzzy_match/similarity.rb:42:__node__
46
46
  9 ./benchmark/../lib/fuzzy_match/similarity.rb:41:__node__
47
47
  8 ./benchmark/../lib/fuzzy_match/wrapper.rb:31:__node__
48
- 8 ./benchmark/../lib/fuzzy_match/tightener.rb:27:__node__
49
- 8 ./benchmark/../lib/fuzzy_match/tightener.rb:14:__node__
48
+ 8 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:__node__
49
+ 8 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:__node__
50
50
  8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
51
51
  8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
52
52
  8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
@@ -71,8 +71,8 @@
71
71
  6 ./benchmark/../lib/fuzzy_match/blocking.rb:22:__node__
72
72
  5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
73
73
  5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
74
- 5 ./benchmark/../lib/fuzzy_match/tightener.rb:9:__node__
75
- 5 ./benchmark/../lib/fuzzy_match/tightener.rb:19:__node__
74
+ 5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
75
+ 5 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:__node__
76
76
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:8:__node__
77
77
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:33:__node__
78
78
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:29:__node__
@@ -85,7 +85,7 @@
85
85
  5 ./benchmark/../lib/fuzzy_match/blocking.rb:15:__node__
86
86
  4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
87
87
  4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
88
- 4 ./benchmark/../lib/fuzzy_match/tightener.rb:20:__node__
88
+ 4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
89
89
  4 ./benchmark/../lib/fuzzy_match/similarity.rb:59:__node__
90
90
  4 ./benchmark/../lib/fuzzy_match/similarity.rb:54:__node__
91
91
  4 ./benchmark/../lib/fuzzy_match/score.rb:5:__node__
@@ -112,10 +112,10 @@
112
112
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:18:__node__
113
113
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:15:String
114
114
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:14:__node__
115
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:8:__node__
116
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:26:__node__
117
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:18:__node__
118
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:13:__node__
115
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:8:__node__
116
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:26:__node__
117
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:18:__node__
118
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:13:__node__
119
119
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:7:__node__
120
120
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:6:__node__
121
121
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:58:__node__
@@ -159,15 +159,15 @@
159
159
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:16:__node__
160
160
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:12:__node__
161
161
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:11:__node__
162
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:6:__node__
163
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:3:Class
164
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:28:__node__
165
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:27:String
166
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:24:__node__
167
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:23:__node__
168
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:22:__node__
169
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:15:__node__
170
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:10:__node__
162
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:6:__node__
163
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:Class
164
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:28:__node__
165
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:String
166
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:24:__node__
167
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:23:__node__
168
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:22:__node__
169
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:15:__node__
170
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:10:__node__
171
171
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:60:__node__
172
172
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:50:__node__
173
173
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:4:__node__
@@ -230,11 +230,11 @@
230
230
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:25:String
231
231
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:1:__node__
232
232
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:10:String
233
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:4:String
234
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:4:FuzzyMatch::ExtractRegexp
235
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:3:__node__
236
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:3:String
237
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:1:__node__
233
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:String
234
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:FuzzyMatch::ExtractRegexp
235
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:__node__
236
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:String
237
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:1:__node__
238
238
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:9:__node__
239
239
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:__node__
240
240
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:String
@@ -11,7 +11,7 @@
11
11
  779 ./benchmark/../lib/fuzzy_match/similarity.rb:42:FuzzyMatch::Score
12
12
  779 ./benchmark/../lib/fuzzy_match/similarity.rb:41:FuzzyMatch::Score
13
13
  676 benchmark/memory.rb:21:String
14
- 607 ./benchmark/../lib/fuzzy_match/tightener.rb:20:String
14
+ 607 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:String
15
15
  444 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:Array
16
16
  342 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:String
17
17
  325 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/remote_table-1.1.6/lib/remote_table/hasher.rb:20:String
@@ -25,7 +25,7 @@
25
25
  303 ./benchmark/../lib/fuzzy_match/similarity.rb:21:Float
26
26
  234 ./benchmark/../lib/fuzzy_match/similarity.rb:56:Array
27
27
  234 ./benchmark/../lib/fuzzy_match/similarity.rb:55:Array
28
- 184 ./benchmark/../lib/fuzzy_match/tightener.rb:14:String
28
+ 184 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:String
29
29
  140 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:__node__
30
30
  129 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__node__
31
31
  127 ./benchmark/../lib/fuzzy_match/similarity.rb:55:__node__
@@ -37,13 +37,13 @@
37
37
  118 ./benchmark/../lib/fuzzy_match/similarity.rb:12:__scope__
38
38
  117 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__scope__
39
39
  117 ./benchmark/../lib/fuzzy_match/wrapper.rb:19:Array
40
- 102 ./benchmark/../lib/fuzzy_match/tightener.rb:20:Array
41
- 102 ./benchmark/../lib/fuzzy_match/tightener.rb:19:MatchData
42
- 101 ./benchmark/../lib/fuzzy_match/tightener.rb:14:MatchData
40
+ 102 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:Array
41
+ 102 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:MatchData
42
+ 101 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:MatchData
43
43
  41 ./benchmark/../lib/fuzzy_match/similarity.rb:49:__node__
44
44
  36 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:27:Regexp
45
45
  28 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:19:__node__
46
- 26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Tightener
46
+ 26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Normalizer
47
47
  22 ./benchmark/../lib/fuzzy_match/similarity.rb:57:__node__
48
48
  22 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:20:__node__
49
49
  17 ./benchmark/../lib/fuzzy_match/similarity.rb:21:__node__
@@ -65,9 +65,9 @@
65
65
  9 ./benchmark/../lib/fuzzy_match/similarity.rb:42:__node__
66
66
  9 ./benchmark/../lib/fuzzy_match/similarity.rb:41:__node__
67
67
  8 ./benchmark/../lib/fuzzy_match/wrapper.rb:31:__node__
68
- 8 ./benchmark/../lib/fuzzy_match/tightener.rb:6:__node__
69
- 8 ./benchmark/../lib/fuzzy_match/tightener.rb:27:__node__
70
- 8 ./benchmark/../lib/fuzzy_match/tightener.rb:14:__node__
68
+ 8 ./benchmark/../lib/fuzzy_match/normalizer.rb:6:__node__
69
+ 8 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:__node__
70
+ 8 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:__node__
71
71
  8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
72
72
  8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
73
73
  8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
@@ -77,10 +77,10 @@
77
77
  7 ./benchmark/../lib/fuzzy_match/score.rb:17:__node__
78
78
  7 ./benchmark/../lib/fuzzy_match/identity.rb:19:__node__
79
79
  6 ./benchmark/../lib/fuzzy_match/wrapper.rb:8:__node__
80
- 6 ./benchmark/../lib/fuzzy_match/tightener.rb:28:__node__
81
- 6 ./benchmark/../lib/fuzzy_match/tightener.rb:24:__node__
82
- 6 ./benchmark/../lib/fuzzy_match/tightener.rb:15:__node__
83
- 6 ./benchmark/../lib/fuzzy_match/tightener.rb:10:__node__
80
+ 6 ./benchmark/../lib/fuzzy_match/normalizer.rb:28:__node__
81
+ 6 ./benchmark/../lib/fuzzy_match/normalizer.rb:24:__node__
82
+ 6 ./benchmark/../lib/fuzzy_match/normalizer.rb:15:__node__
83
+ 6 ./benchmark/../lib/fuzzy_match/normalizer.rb:10:__node__
84
84
  6 ./benchmark/../lib/fuzzy_match/similarity.rb:44:__node__
85
85
  6 ./benchmark/../lib/fuzzy_match/similarity.rb:15:__node__
86
86
  6 ./benchmark/../lib/fuzzy_match/similarity.rb:13:__node__
@@ -89,8 +89,8 @@
89
89
  6 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:10:__node__
90
90
  5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
91
91
  5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
92
- 5 ./benchmark/../lib/fuzzy_match/tightener.rb:9:__node__
93
- 5 ./benchmark/../lib/fuzzy_match/tightener.rb:19:__node__
92
+ 5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
93
+ 5 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:__node__
94
94
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:8:__node__
95
95
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:33:__node__
96
96
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:29:__node__
@@ -100,8 +100,8 @@
100
100
  4 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:7:__node__
101
101
  4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
102
102
  4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
103
- 4 ./benchmark/../lib/fuzzy_match/tightener.rb:4:__node__
104
- 4 ./benchmark/../lib/fuzzy_match/tightener.rb:20:__node__
103
+ 4 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:__node__
104
+ 4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
105
105
  4 ./benchmark/../lib/fuzzy_match/similarity.rb:59:__node__
106
106
  4 ./benchmark/../lib/fuzzy_match/similarity.rb:54:__node__
107
107
  4 ./benchmark/../lib/fuzzy_match/score.rb:5:__node__
@@ -116,12 +116,12 @@
116
116
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:18:__node__
117
117
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:15:String
118
118
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:14:__node__
119
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:8:__node__
120
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:30:__node__
121
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:29:__node__
122
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:26:__node__
123
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:18:__node__
124
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:13:__node__
119
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:8:__node__
120
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:30:__node__
121
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:29:__node__
122
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:26:__node__
123
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:18:__node__
124
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:13:__node__
125
125
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:7:__node__
126
126
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:6:__node__
127
127
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:58:__node__
@@ -160,12 +160,12 @@
160
160
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:16:__node__
161
161
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:12:__node__
162
162
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:11:__node__
163
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:3:__node__
164
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:3:Class
165
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:27:String
166
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:23:__node__
167
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:22:__node__
168
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:1:__node__
163
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:__node__
164
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:Class
165
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:String
166
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:23:__node__
167
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:22:__node__
168
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:1:__node__
169
169
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:60:__node__
170
170
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:50:__node__
171
171
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:4:__node__
@@ -218,8 +218,8 @@
218
218
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:25:String
219
219
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:1:__node__
220
220
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:10:String
221
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:4:FuzzyMatch::ExtractRegexp
222
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:3:String
221
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:FuzzyMatch::ExtractRegexp
222
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:String
223
223
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:9:__node__
224
224
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:__node__
225
225
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:String
data/benchmark/before.txt CHANGED
@@ -11,7 +11,7 @@
11
11
  806 ./benchmark/../lib/fuzzy_match/similarity.rb:41:FuzzyMatch::Score
12
12
  805 ./benchmark/../lib/fuzzy_match/similarity.rb:42:FuzzyMatch::Score
13
13
  688 benchmark/memory.rb:21:String
14
- 639 ./benchmark/../lib/fuzzy_match/tightener.rb:20:String
14
+ 639 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:String
15
15
  448 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:Array
16
16
  342 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:String
17
17
  325 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/remote_table-1.1.6/lib/remote_table/hasher.rb:20:String
@@ -25,7 +25,7 @@
25
25
  303 ./benchmark/../lib/fuzzy_match/similarity.rb:21:Float
26
26
  242 ./benchmark/../lib/fuzzy_match/similarity.rb:56:Array
27
27
  242 ./benchmark/../lib/fuzzy_match/similarity.rb:55:Array
28
- 184 ./benchmark/../lib/fuzzy_match/tightener.rb:14:String
28
+ 184 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:String
29
29
  140 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:__node__
30
30
  133 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__node__
31
31
  131 ./benchmark/../lib/fuzzy_match/similarity.rb:55:__node__
@@ -37,13 +37,13 @@
37
37
  122 ./benchmark/../lib/fuzzy_match/similarity.rb:12:__scope__
38
38
  121 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__scope__
39
39
  121 ./benchmark/../lib/fuzzy_match/wrapper.rb:19:Array
40
- 110 ./benchmark/../lib/fuzzy_match/tightener.rb:20:Array
41
- 110 ./benchmark/../lib/fuzzy_match/tightener.rb:19:MatchData
42
- 109 ./benchmark/../lib/fuzzy_match/tightener.rb:14:MatchData
40
+ 110 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:Array
41
+ 110 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:MatchData
42
+ 109 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:MatchData
43
43
  57 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:27:Regexp
44
44
  41 ./benchmark/../lib/fuzzy_match/similarity.rb:49:__node__
45
45
  28 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:19:__node__
46
- 26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Tightener
46
+ 26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Normalizer
47
47
  22 ./benchmark/../lib/fuzzy_match/similarity.rb:57:__node__
48
48
  22 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:20:__node__
49
49
  21 ./benchmark/../lib/fuzzy_match.rb:199:FuzzyMatch::Blocking
@@ -67,8 +67,8 @@
67
67
  9 ./benchmark/../lib/fuzzy_match/similarity.rb:42:__node__
68
68
  9 ./benchmark/../lib/fuzzy_match/similarity.rb:41:__node__
69
69
  8 ./benchmark/../lib/fuzzy_match/wrapper.rb:31:__node__
70
- 8 ./benchmark/../lib/fuzzy_match/tightener.rb:27:__node__
71
- 8 ./benchmark/../lib/fuzzy_match/tightener.rb:14:__node__
70
+ 8 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:__node__
71
+ 8 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:__node__
72
72
  8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
73
73
  8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
74
74
  8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
@@ -92,8 +92,8 @@
92
92
  6 ./benchmark/../lib/fuzzy_match/blocking.rb:22:__node__
93
93
  5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
94
94
  5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
95
- 5 ./benchmark/../lib/fuzzy_match/tightener.rb:9:__node__
96
- 5 ./benchmark/../lib/fuzzy_match/tightener.rb:19:__node__
95
+ 5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
96
+ 5 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:__node__
97
97
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:8:__node__
98
98
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:33:__node__
99
99
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:29:__node__
@@ -106,7 +106,7 @@
106
106
  5 ./benchmark/../lib/fuzzy_match/blocking.rb:15:__node__
107
107
  4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
108
108
  4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
109
- 4 ./benchmark/../lib/fuzzy_match/tightener.rb:20:__node__
109
+ 4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
110
110
  4 ./benchmark/../lib/fuzzy_match/similarity.rb:59:__node__
111
111
  4 ./benchmark/../lib/fuzzy_match/similarity.rb:54:__node__
112
112
  4 ./benchmark/../lib/fuzzy_match/score.rb:5:__node__
@@ -133,10 +133,10 @@
133
133
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:18:__node__
134
134
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:15:String
135
135
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:14:__node__
136
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:8:__node__
137
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:26:__node__
138
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:18:__node__
139
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:13:__node__
136
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:8:__node__
137
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:26:__node__
138
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:18:__node__
139
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:13:__node__
140
140
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:7:__node__
141
141
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:6:__node__
142
142
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:58:__node__
@@ -182,15 +182,15 @@
182
182
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:16:__node__
183
183
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:12:__node__
184
184
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:11:__node__
185
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:6:__node__
186
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:3:Class
187
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:28:__node__
188
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:27:String
189
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:24:__node__
190
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:23:__node__
191
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:22:__node__
192
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:15:__node__
193
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:10:__node__
185
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:6:__node__
186
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:Class
187
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:28:__node__
188
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:String
189
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:24:__node__
190
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:23:__node__
191
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:22:__node__
192
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:15:__node__
193
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:10:__node__
194
194
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:60:__node__
195
195
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:50:__node__
196
196
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:4:__node__
@@ -253,11 +253,11 @@
253
253
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:25:String
254
254
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:1:__node__
255
255
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:10:String
256
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:4:String
257
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:4:FuzzyMatch::ExtractRegexp
258
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:3:__node__
259
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:3:String
260
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:1:__node__
256
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:String
257
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:FuzzyMatch::ExtractRegexp
258
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:__node__
259
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:String
260
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:1:__node__
261
261
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:9:__node__
262
262
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:__node__
263
263
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:String
data/benchmark/memory.rb CHANGED
@@ -28,9 +28,9 @@ MUST_MATCH_BLOCKING = false
28
28
  # (Example) We made these by trial and error
29
29
  BLOCKINGS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/blockings.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
30
30
 
31
- # Tighteners
31
+ # Normalizers
32
32
  # (Example) We made these by trial and error
33
- TIGHTENERS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/tighteners.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
33
+ NORMALIZERS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/normalizers.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
34
34
 
35
35
  # Identities
36
36
  # (Example) We made these by trial and error
@@ -39,7 +39,7 @@ IDENTITIES = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/
39
39
  FINAL_OPTIONS = {
40
40
  :read => HAYSTACK_READER,
41
41
  :must_match_blocking => MUST_MATCH_BLOCKING,
42
- :tighteners => TIGHTENERS,
42
+ :normalizers => NORMALIZERS,
43
43
  :identities => IDENTITIES,
44
44
  :blockings => BLOCKINGS
45
45
  }
@@ -48,7 +48,6 @@ Memprof.start
48
48
 
49
49
  d = FuzzyMatch.new HAYSTACK, FINAL_OPTIONS
50
50
  record = d.find('boeing 707(100)', :gather_last_result => false)
51
- # d.free
52
51
 
53
52
  Memprof.stats
54
53
  Memprof.stop
@@ -26,9 +26,9 @@ MUST_MATCH_BLOCKING = false
26
26
  # (Example) We made these by trial and error
27
27
  BLOCKINGS = RemoteTable.new(:url => "file://#{File.expand_path("../blockings.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
28
28
 
29
- # Tighteners
29
+ # Normalizers
30
30
  # (Example) We made these by trial and error
31
- TIGHTENERS = RemoteTable.new(:url => "file://#{File.expand_path("../tighteners.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
31
+ NORMALIZERS = RemoteTable.new(:url => "file://#{File.expand_path("../normalizers.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
32
32
 
33
33
  # Identities
34
34
  # (Example) We made these by trial and error
@@ -65,7 +65,7 @@ NEGATIVES = RemoteTable.new :url => "file://#{File.expand_path("../negatives.csv
65
65
  FINAL_OPTIONS = {
66
66
  :read => HAYSTACK_READER,
67
67
  :must_match_blocking => MUST_MATCH_BLOCKING,
68
- :tighteners => TIGHTENERS,
68
+ :normalizers => NORMALIZERS,
69
69
  :identities => IDENTITIES,
70
70
  :blockings => BLOCKINGS
71
71
  }
@@ -24,7 +24,7 @@ class FuzzyMatch
24
24
  def join?(str1, str2)
25
25
  if str2_match_data = regexp.match(str2)
26
26
  if str1_match_data = regexp.match(str1)
27
- str2_match_data.captures == str1_match_data.captures
27
+ str2_match_data.captures.join.downcase == str1_match_data.captures.join.downcase
28
28
  else
29
29
  false
30
30
  end
@@ -14,7 +14,7 @@ class FuzzyMatch
14
14
  # Otherwise returns nil.
15
15
  def identical?(str1, str2)
16
16
  if str1_match_data = regexp.match(str1) and match_data = regexp.match(str2)
17
- str1_match_data.captures == match_data.captures
17
+ str1_match_data.captures.join.downcase == match_data.captures.join.downcase
18
18
  else
19
19
  nil
20
20
  end
@@ -1,18 +1,18 @@
1
1
  class FuzzyMatch
2
- # A tightener just strips a string down to its core
3
- class Tightener
2
+ # A normalizer just strips a string down to its core
3
+ class Normalizer
4
4
  attr_reader :regexp
5
5
 
6
6
  def initialize(regexp_or_str)
7
7
  @regexp = regexp_or_str.to_regexp
8
8
  end
9
9
 
10
- # A tightener applies when its regexp matches and captures a new (shorter) string
10
+ # A normalizer applies when its regexp matches and captures a new (shorter) string
11
11
  def apply?(str)
12
12
  !!(regexp.match(str))
13
13
  end
14
14
 
15
- # The result of applying a tightener is just all the captures put together.
15
+ # The result of applying a normalizer is just all the captures put together.
16
16
  def apply(str)
17
17
  if match_data = regexp.match(str)
18
18
  match_data.captures.join
@@ -22,7 +22,7 @@ class FuzzyMatch
22
22
  end
23
23
 
24
24
  def inspect
25
- "#<Tightener regexp=#{regexp.inspect}>"
25
+ "#<Normalizer regexp=#{regexp.inspect}>"
26
26
  end
27
27
  end
28
28
  end
@@ -27,7 +27,7 @@ ERB
27
27
  attr_accessor :read
28
28
  attr_accessor :haystack
29
29
  attr_accessor :options
30
- attr_accessor :tighteners
30
+ attr_accessor :normalizers
31
31
  attr_accessor :blockings
32
32
  attr_accessor :identities
33
33
  attr_accessor :stop_words