fuzzy_match 1.1.1 → 1.2.1

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore CHANGED
@@ -15,8 +15,10 @@ tmtags
15
15
 
16
16
  ## PROJECT::GENERAL
17
17
  coverage
18
- rdoc
18
+ doc
19
+ .yardoc
19
20
  pkg
20
21
 
21
22
  ## PROJECT::SPECIFIC
22
23
  Gemfile.lock
24
+ *.gem
data/README.markdown ADDED
@@ -0,0 +1,124 @@
1
+ # fuzzy_match
2
+
3
+ Find a needle in a haystack based on string similarity (using the Pair Distance algorithm and Levenshtein distance) and regular expressions.
4
+
5
+ Replaces [`loose_tight_dictionary`](https://github.com/seamusabshere/loose_tight_dictionary) because that was a confusing name.
6
+
7
+ ## Quickstart
8
+
9
+ >> require 'fuzzy_match'
10
+ => true
11
+ >> FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus')
12
+ => "seamus"
13
+
14
+ ## Default matching (string similarity)
15
+
16
+ If you configure nothing else, string similarity matching is used. That's why we call it fuzzy matching.
17
+
18
+ The algorithm is [Dice's Coefficient](http://en.wikipedia.org/wiki/Dice's_coefficient) (aka Pair Distance) because it seemed to work better than Jaro Winkler, etc.
19
+
20
+ ## Rules (regular expressions)
21
+
22
+ You can improve the default matchings with rules, which are generally regular expressions.
23
+
24
+ >> require 'fuzzy_match'
25
+ => true
26
+ >> matcher = FuzzyMatch.new(['Ford F-150', 'Ford F-250', 'GMC 1500', 'GMC 2500'], :blockings => [ /ford/i, /gmc/i ], :normalizers => [ /K(\d500)/i ], :identities => [ /(f)-?(\d\d\d)/i ])
27
+ => #<FuzzyMatch: [...]>
28
+ >> matcher.find('fordf250')
29
+ => "Ford F-250"
30
+ >> matcher.find('gmc truck k1500')
31
+ => "GMC 1500"
32
+
33
+ ### Blockings
34
+
35
+ Group records together.
36
+
37
+ Setting a blocking of `/Airbus/` ensures that strings containing "Airbus" will only be scored against to other strings containing "Airbus". A better blocking in this case would probably be `/airbus/i`.
38
+
39
+ ### Normalizers (formerly called tighteners)
40
+
41
+ Strip strings down to the essentials.
42
+
43
+ Adding a normalizer like `/(boeing).*(7\d\d)/i` will cause "BOEING COMPANY 747" and "boeing747" to be scored as if they were "BOEING 747" and "boeing 747", respectively. See also "Case sensitivity" below.
44
+
45
+ ### Identities
46
+
47
+ Prevent impossible matches.
48
+
49
+ Adding an identity like `/(F)\-?(\d50)/` ensures that "Ford F-150" and "Ford F-250" never match.
50
+
51
+ ### Stop words
52
+
53
+ Ignore common and/or meaningless words.
54
+
55
+ Adding a stop word like `THE` ensures that it is not taken into account when comparing "THE CAT", "THE DAT", and "THE CATT"
56
+
57
+ ## Find options
58
+
59
+ * `read`: how to interpret each record in the 'haystack', either a Proc or a symbol
60
+ * `must_match_blocking`: don't return a match unless the needle fits into one of the blockings you specified
61
+ * `must_match_at_least_one_word`: don't return a match unless the needle shares at least one word with the match
62
+ * `first_blocking_decides`: force records into the first blocking they match, rather than choosing a blocking that will give them a higher score
63
+ * `gather_last_result`: enable `last_result`
64
+
65
+ ### `:read`
66
+
67
+ So, what if your needle is a string like `youruguay` and your haystack is full of `Country` objects like `<Country name:"Uruguay">`?
68
+
69
+ >> FuzzyMatch.new(Country.all, :read => :name).find('youruguay')
70
+ => <Country name:"Uruguay">
71
+
72
+ ## Case sensitivity
73
+
74
+ String similarity is case-insensitive. Everything is downcased before scoring. This is a change from previous versions.
75
+
76
+ Be careful when trying to use case-sensitivity in your rules; in general, things are downcased before comparing.
77
+
78
+ ## Dice's coefficient edge case
79
+
80
+ In edge cases where Dice's finds that two strings are equally similar to a third string, then Levenshtein distance is used. For example, pair distance considers "RATZ" and "CATZ" to be equally similar to "RITZ" so we invoke Levenshtein.
81
+
82
+ >> require 'amatch'
83
+ => true
84
+ >> 'RITZ'.pair_distance_similar 'RATZ'
85
+ => 0.3333333333333333
86
+ >> 'RITZ'.pair_distance_similar 'CATZ' # <-- pair distance can't tell the difference, so we fall back to levenshtein...
87
+ => 0.3333333333333333
88
+ >> 'RITZ'.levenshtein_similar 'RATZ'
89
+ => 0.75
90
+ >> 'RITZ'.levenshtein_similar 'CATZ' # <-- which properly shows that RATZ should win
91
+ => 0.5
92
+
93
+ ## Production use
94
+
95
+ Over 2 years in [Brighter Planet's environmental impact API](http://impact.brighterplanet.com) and [reference data service](http://data.brighterplanet.com).
96
+
97
+ We often combine `fuzzy_match` with [`remote_table`](https://github.com/seamusabshere/remote_table) and [`errata`](https://github.com/seamusabshere/errata):
98
+
99
+ - download table with `remote_table`
100
+ - correct serious or repeated errors with `errata`
101
+ - `fuzzy_match` the rest
102
+
103
+ ## Glossary
104
+
105
+ The admittedly imperfect metaphor is "look for a needle in a haystack"
106
+
107
+ * needle: the search term
108
+ * haystack: the records you are searching (<b>your result will be an object from here</b>)
109
+
110
+ ## Credits (and how to make things faster)
111
+
112
+ If you add the [`amatch`](http://flori.github.com/amatch/) gem to your Gemfile, it will use that, which is much faster (but [segfaults have been seen in the wild](https://github.com/flori/amatch/issues/3)). Thanks [Flori](https://github.com/flori)!
113
+
114
+ Otherwise, pure ruby versions of the string similarity algorithms derived from the [answer to a StackOverflow question](http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings) and [the text gem](https://github.com/threedaymonk/text/blob/master/lib/text/levenshtein.rb) are used. Thanks [marzagao](http://stackoverflow.com/users/10997/marzagao) and [threedaymonk](https://github.com/threedaymonk)!
115
+
116
+ ## Authors
117
+
118
+ * Seamus Abshere <seamus@abshere.net>
119
+ * Ian Hough <ijhough@gmail.com>
120
+ * Andy Rossmeissl <andy@rossmeissl.net>
121
+
122
+ ## Copyright
123
+
124
+ Copyright 2012 Brighter Planet, Inc.
data/Rakefile CHANGED
@@ -10,12 +10,9 @@ end
10
10
 
11
11
  task :default => :test
12
12
 
13
- require 'rake/rdoctask'
14
- Rake::RDocTask.new do |rdoc|
15
- version = File.exist?('VERSION') ? File.read('VERSION') : ""
16
-
17
- rdoc.rdoc_dir = 'rdoc'
18
- rdoc.title = "fuzzy_match #{version}"
19
- rdoc.rdoc_files.include('README*')
20
- rdoc.rdoc_files.include('lib/**/*.rb')
13
+ require 'yard'
14
+ require File.expand_path('../lib/fuzzy_match/version.rb', __FILE__)
15
+ YARD::Rake::YardocTask.new do |t|
16
+ t.files = ['lib/**/*.rb', 'README.markdown'] # optional
17
+ # t.options = ['--any', '--extra', '--opts'] # optional
21
18
  end
@@ -14,8 +14,8 @@
14
14
  325 ./benchmark/../lib/fuzzy_match.rb:35:FuzzyMatch::Wrapper
15
15
  320 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/remote_table-1.1.6/lib/remote_table/format/delimited.rb:28:String
16
16
  303 ./benchmark/../lib/fuzzy_match/similarity.rb:21:Float
17
- 201 ./benchmark/../lib/fuzzy_match/tightener.rb:20:String
18
- 184 ./benchmark/../lib/fuzzy_match/tightener.rb:14:String
17
+ 201 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:String
18
+ 184 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:String
19
19
  140 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:__node__
20
20
  41 ./benchmark/../lib/fuzzy_match/similarity.rb:49:__node__
21
21
  31 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:27:Regexp
@@ -45,8 +45,8 @@
45
45
  9 ./benchmark/../lib/fuzzy_match/similarity.rb:42:__node__
46
46
  9 ./benchmark/../lib/fuzzy_match/similarity.rb:41:__node__
47
47
  8 ./benchmark/../lib/fuzzy_match/wrapper.rb:31:__node__
48
- 8 ./benchmark/../lib/fuzzy_match/tightener.rb:27:__node__
49
- 8 ./benchmark/../lib/fuzzy_match/tightener.rb:14:__node__
48
+ 8 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:__node__
49
+ 8 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:__node__
50
50
  8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
51
51
  8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
52
52
  8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
@@ -71,8 +71,8 @@
71
71
  6 ./benchmark/../lib/fuzzy_match/blocking.rb:22:__node__
72
72
  5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
73
73
  5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
74
- 5 ./benchmark/../lib/fuzzy_match/tightener.rb:9:__node__
75
- 5 ./benchmark/../lib/fuzzy_match/tightener.rb:19:__node__
74
+ 5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
75
+ 5 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:__node__
76
76
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:8:__node__
77
77
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:33:__node__
78
78
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:29:__node__
@@ -85,7 +85,7 @@
85
85
  5 ./benchmark/../lib/fuzzy_match/blocking.rb:15:__node__
86
86
  4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
87
87
  4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
88
- 4 ./benchmark/../lib/fuzzy_match/tightener.rb:20:__node__
88
+ 4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
89
89
  4 ./benchmark/../lib/fuzzy_match/similarity.rb:59:__node__
90
90
  4 ./benchmark/../lib/fuzzy_match/similarity.rb:54:__node__
91
91
  4 ./benchmark/../lib/fuzzy_match/score.rb:5:__node__
@@ -112,10 +112,10 @@
112
112
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:18:__node__
113
113
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:15:String
114
114
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:14:__node__
115
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:8:__node__
116
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:26:__node__
117
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:18:__node__
118
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:13:__node__
115
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:8:__node__
116
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:26:__node__
117
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:18:__node__
118
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:13:__node__
119
119
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:7:__node__
120
120
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:6:__node__
121
121
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:58:__node__
@@ -159,15 +159,15 @@
159
159
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:16:__node__
160
160
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:12:__node__
161
161
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:11:__node__
162
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:6:__node__
163
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:3:Class
164
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:28:__node__
165
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:27:String
166
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:24:__node__
167
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:23:__node__
168
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:22:__node__
169
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:15:__node__
170
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:10:__node__
162
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:6:__node__
163
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:Class
164
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:28:__node__
165
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:String
166
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:24:__node__
167
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:23:__node__
168
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:22:__node__
169
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:15:__node__
170
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:10:__node__
171
171
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:60:__node__
172
172
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:50:__node__
173
173
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:4:__node__
@@ -230,11 +230,11 @@
230
230
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:25:String
231
231
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:1:__node__
232
232
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:10:String
233
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:4:String
234
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:4:FuzzyMatch::ExtractRegexp
235
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:3:__node__
236
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:3:String
237
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:1:__node__
233
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:String
234
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:FuzzyMatch::ExtractRegexp
235
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:__node__
236
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:String
237
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:1:__node__
238
238
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:9:__node__
239
239
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:__node__
240
240
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:String
@@ -11,7 +11,7 @@
11
11
  779 ./benchmark/../lib/fuzzy_match/similarity.rb:42:FuzzyMatch::Score
12
12
  779 ./benchmark/../lib/fuzzy_match/similarity.rb:41:FuzzyMatch::Score
13
13
  676 benchmark/memory.rb:21:String
14
- 607 ./benchmark/../lib/fuzzy_match/tightener.rb:20:String
14
+ 607 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:String
15
15
  444 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:Array
16
16
  342 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:String
17
17
  325 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/remote_table-1.1.6/lib/remote_table/hasher.rb:20:String
@@ -25,7 +25,7 @@
25
25
  303 ./benchmark/../lib/fuzzy_match/similarity.rb:21:Float
26
26
  234 ./benchmark/../lib/fuzzy_match/similarity.rb:56:Array
27
27
  234 ./benchmark/../lib/fuzzy_match/similarity.rb:55:Array
28
- 184 ./benchmark/../lib/fuzzy_match/tightener.rb:14:String
28
+ 184 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:String
29
29
  140 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:__node__
30
30
  129 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__node__
31
31
  127 ./benchmark/../lib/fuzzy_match/similarity.rb:55:__node__
@@ -37,13 +37,13 @@
37
37
  118 ./benchmark/../lib/fuzzy_match/similarity.rb:12:__scope__
38
38
  117 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__scope__
39
39
  117 ./benchmark/../lib/fuzzy_match/wrapper.rb:19:Array
40
- 102 ./benchmark/../lib/fuzzy_match/tightener.rb:20:Array
41
- 102 ./benchmark/../lib/fuzzy_match/tightener.rb:19:MatchData
42
- 101 ./benchmark/../lib/fuzzy_match/tightener.rb:14:MatchData
40
+ 102 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:Array
41
+ 102 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:MatchData
42
+ 101 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:MatchData
43
43
  41 ./benchmark/../lib/fuzzy_match/similarity.rb:49:__node__
44
44
  36 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:27:Regexp
45
45
  28 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:19:__node__
46
- 26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Tightener
46
+ 26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Normalizer
47
47
  22 ./benchmark/../lib/fuzzy_match/similarity.rb:57:__node__
48
48
  22 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:20:__node__
49
49
  17 ./benchmark/../lib/fuzzy_match/similarity.rb:21:__node__
@@ -65,9 +65,9 @@
65
65
  9 ./benchmark/../lib/fuzzy_match/similarity.rb:42:__node__
66
66
  9 ./benchmark/../lib/fuzzy_match/similarity.rb:41:__node__
67
67
  8 ./benchmark/../lib/fuzzy_match/wrapper.rb:31:__node__
68
- 8 ./benchmark/../lib/fuzzy_match/tightener.rb:6:__node__
69
- 8 ./benchmark/../lib/fuzzy_match/tightener.rb:27:__node__
70
- 8 ./benchmark/../lib/fuzzy_match/tightener.rb:14:__node__
68
+ 8 ./benchmark/../lib/fuzzy_match/normalizer.rb:6:__node__
69
+ 8 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:__node__
70
+ 8 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:__node__
71
71
  8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
72
72
  8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
73
73
  8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
@@ -77,10 +77,10 @@
77
77
  7 ./benchmark/../lib/fuzzy_match/score.rb:17:__node__
78
78
  7 ./benchmark/../lib/fuzzy_match/identity.rb:19:__node__
79
79
  6 ./benchmark/../lib/fuzzy_match/wrapper.rb:8:__node__
80
- 6 ./benchmark/../lib/fuzzy_match/tightener.rb:28:__node__
81
- 6 ./benchmark/../lib/fuzzy_match/tightener.rb:24:__node__
82
- 6 ./benchmark/../lib/fuzzy_match/tightener.rb:15:__node__
83
- 6 ./benchmark/../lib/fuzzy_match/tightener.rb:10:__node__
80
+ 6 ./benchmark/../lib/fuzzy_match/normalizer.rb:28:__node__
81
+ 6 ./benchmark/../lib/fuzzy_match/normalizer.rb:24:__node__
82
+ 6 ./benchmark/../lib/fuzzy_match/normalizer.rb:15:__node__
83
+ 6 ./benchmark/../lib/fuzzy_match/normalizer.rb:10:__node__
84
84
  6 ./benchmark/../lib/fuzzy_match/similarity.rb:44:__node__
85
85
  6 ./benchmark/../lib/fuzzy_match/similarity.rb:15:__node__
86
86
  6 ./benchmark/../lib/fuzzy_match/similarity.rb:13:__node__
@@ -89,8 +89,8 @@
89
89
  6 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:10:__node__
90
90
  5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
91
91
  5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
92
- 5 ./benchmark/../lib/fuzzy_match/tightener.rb:9:__node__
93
- 5 ./benchmark/../lib/fuzzy_match/tightener.rb:19:__node__
92
+ 5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
93
+ 5 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:__node__
94
94
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:8:__node__
95
95
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:33:__node__
96
96
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:29:__node__
@@ -100,8 +100,8 @@
100
100
  4 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:7:__node__
101
101
  4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
102
102
  4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
103
- 4 ./benchmark/../lib/fuzzy_match/tightener.rb:4:__node__
104
- 4 ./benchmark/../lib/fuzzy_match/tightener.rb:20:__node__
103
+ 4 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:__node__
104
+ 4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
105
105
  4 ./benchmark/../lib/fuzzy_match/similarity.rb:59:__node__
106
106
  4 ./benchmark/../lib/fuzzy_match/similarity.rb:54:__node__
107
107
  4 ./benchmark/../lib/fuzzy_match/score.rb:5:__node__
@@ -116,12 +116,12 @@
116
116
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:18:__node__
117
117
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:15:String
118
118
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:14:__node__
119
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:8:__node__
120
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:30:__node__
121
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:29:__node__
122
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:26:__node__
123
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:18:__node__
124
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:13:__node__
119
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:8:__node__
120
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:30:__node__
121
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:29:__node__
122
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:26:__node__
123
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:18:__node__
124
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:13:__node__
125
125
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:7:__node__
126
126
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:6:__node__
127
127
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:58:__node__
@@ -160,12 +160,12 @@
160
160
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:16:__node__
161
161
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:12:__node__
162
162
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:11:__node__
163
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:3:__node__
164
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:3:Class
165
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:27:String
166
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:23:__node__
167
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:22:__node__
168
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:1:__node__
163
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:__node__
164
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:Class
165
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:String
166
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:23:__node__
167
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:22:__node__
168
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:1:__node__
169
169
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:60:__node__
170
170
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:50:__node__
171
171
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:4:__node__
@@ -218,8 +218,8 @@
218
218
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:25:String
219
219
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:1:__node__
220
220
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:10:String
221
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:4:FuzzyMatch::ExtractRegexp
222
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:3:String
221
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:FuzzyMatch::ExtractRegexp
222
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:String
223
223
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:9:__node__
224
224
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:__node__
225
225
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:String
data/benchmark/before.txt CHANGED
@@ -11,7 +11,7 @@
11
11
  806 ./benchmark/../lib/fuzzy_match/similarity.rb:41:FuzzyMatch::Score
12
12
  805 ./benchmark/../lib/fuzzy_match/similarity.rb:42:FuzzyMatch::Score
13
13
  688 benchmark/memory.rb:21:String
14
- 639 ./benchmark/../lib/fuzzy_match/tightener.rb:20:String
14
+ 639 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:String
15
15
  448 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:Array
16
16
  342 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:String
17
17
  325 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/remote_table-1.1.6/lib/remote_table/hasher.rb:20:String
@@ -25,7 +25,7 @@
25
25
  303 ./benchmark/../lib/fuzzy_match/similarity.rb:21:Float
26
26
  242 ./benchmark/../lib/fuzzy_match/similarity.rb:56:Array
27
27
  242 ./benchmark/../lib/fuzzy_match/similarity.rb:55:Array
28
- 184 ./benchmark/../lib/fuzzy_match/tightener.rb:14:String
28
+ 184 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:String
29
29
  140 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:__node__
30
30
  133 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__node__
31
31
  131 ./benchmark/../lib/fuzzy_match/similarity.rb:55:__node__
@@ -37,13 +37,13 @@
37
37
  122 ./benchmark/../lib/fuzzy_match/similarity.rb:12:__scope__
38
38
  121 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__scope__
39
39
  121 ./benchmark/../lib/fuzzy_match/wrapper.rb:19:Array
40
- 110 ./benchmark/../lib/fuzzy_match/tightener.rb:20:Array
41
- 110 ./benchmark/../lib/fuzzy_match/tightener.rb:19:MatchData
42
- 109 ./benchmark/../lib/fuzzy_match/tightener.rb:14:MatchData
40
+ 110 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:Array
41
+ 110 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:MatchData
42
+ 109 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:MatchData
43
43
  57 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:27:Regexp
44
44
  41 ./benchmark/../lib/fuzzy_match/similarity.rb:49:__node__
45
45
  28 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:19:__node__
46
- 26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Tightener
46
+ 26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Normalizer
47
47
  22 ./benchmark/../lib/fuzzy_match/similarity.rb:57:__node__
48
48
  22 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:20:__node__
49
49
  21 ./benchmark/../lib/fuzzy_match.rb:199:FuzzyMatch::Blocking
@@ -67,8 +67,8 @@
67
67
  9 ./benchmark/../lib/fuzzy_match/similarity.rb:42:__node__
68
68
  9 ./benchmark/../lib/fuzzy_match/similarity.rb:41:__node__
69
69
  8 ./benchmark/../lib/fuzzy_match/wrapper.rb:31:__node__
70
- 8 ./benchmark/../lib/fuzzy_match/tightener.rb:27:__node__
71
- 8 ./benchmark/../lib/fuzzy_match/tightener.rb:14:__node__
70
+ 8 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:__node__
71
+ 8 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:__node__
72
72
  8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
73
73
  8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
74
74
  8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
@@ -92,8 +92,8 @@
92
92
  6 ./benchmark/../lib/fuzzy_match/blocking.rb:22:__node__
93
93
  5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
94
94
  5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
95
- 5 ./benchmark/../lib/fuzzy_match/tightener.rb:9:__node__
96
- 5 ./benchmark/../lib/fuzzy_match/tightener.rb:19:__node__
95
+ 5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
96
+ 5 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:__node__
97
97
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:8:__node__
98
98
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:33:__node__
99
99
  5 ./benchmark/../lib/fuzzy_match/similarity.rb:29:__node__
@@ -106,7 +106,7 @@
106
106
  5 ./benchmark/../lib/fuzzy_match/blocking.rb:15:__node__
107
107
  4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
108
108
  4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
109
- 4 ./benchmark/../lib/fuzzy_match/tightener.rb:20:__node__
109
+ 4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
110
110
  4 ./benchmark/../lib/fuzzy_match/similarity.rb:59:__node__
111
111
  4 ./benchmark/../lib/fuzzy_match/similarity.rb:54:__node__
112
112
  4 ./benchmark/../lib/fuzzy_match/score.rb:5:__node__
@@ -133,10 +133,10 @@
133
133
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:18:__node__
134
134
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:15:String
135
135
  3 ./benchmark/../lib/fuzzy_match/wrapper.rb:14:__node__
136
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:8:__node__
137
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:26:__node__
138
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:18:__node__
139
- 3 ./benchmark/../lib/fuzzy_match/tightener.rb:13:__node__
136
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:8:__node__
137
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:26:__node__
138
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:18:__node__
139
+ 3 ./benchmark/../lib/fuzzy_match/normalizer.rb:13:__node__
140
140
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:7:__node__
141
141
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:6:__node__
142
142
  3 ./benchmark/../lib/fuzzy_match/similarity.rb:58:__node__
@@ -182,15 +182,15 @@
182
182
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:16:__node__
183
183
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:12:__node__
184
184
  2 ./benchmark/../lib/fuzzy_match/wrapper.rb:11:__node__
185
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:6:__node__
186
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:3:Class
187
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:28:__node__
188
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:27:String
189
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:24:__node__
190
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:23:__node__
191
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:22:__node__
192
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:15:__node__
193
- 2 ./benchmark/../lib/fuzzy_match/tightener.rb:10:__node__
185
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:6:__node__
186
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:Class
187
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:28:__node__
188
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:String
189
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:24:__node__
190
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:23:__node__
191
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:22:__node__
192
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:15:__node__
193
+ 2 ./benchmark/../lib/fuzzy_match/normalizer.rb:10:__node__
194
194
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:60:__node__
195
195
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:50:__node__
196
196
  2 ./benchmark/../lib/fuzzy_match/similarity.rb:4:__node__
@@ -253,11 +253,11 @@
253
253
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:25:String
254
254
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:1:__node__
255
255
  1 ./benchmark/../lib/fuzzy_match/wrapper.rb:10:String
256
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:4:String
257
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:4:FuzzyMatch::ExtractRegexp
258
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:3:__node__
259
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:3:String
260
- 1 ./benchmark/../lib/fuzzy_match/tightener.rb:1:__node__
256
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:String
257
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:FuzzyMatch::ExtractRegexp
258
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:__node__
259
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:String
260
+ 1 ./benchmark/../lib/fuzzy_match/normalizer.rb:1:__node__
261
261
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:9:__node__
262
262
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:__node__
263
263
  1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:String
data/benchmark/memory.rb CHANGED
@@ -28,9 +28,9 @@ MUST_MATCH_BLOCKING = false
28
28
  # (Example) We made these by trial and error
29
29
  BLOCKINGS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/blockings.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
30
30
 
31
- # Tighteners
31
+ # Normalizers
32
32
  # (Example) We made these by trial and error
33
- TIGHTENERS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/tighteners.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
33
+ NORMALIZERS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/normalizers.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
34
34
 
35
35
  # Identities
36
36
  # (Example) We made these by trial and error
@@ -39,7 +39,7 @@ IDENTITIES = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/
39
39
  FINAL_OPTIONS = {
40
40
  :read => HAYSTACK_READER,
41
41
  :must_match_blocking => MUST_MATCH_BLOCKING,
42
- :tighteners => TIGHTENERS,
42
+ :normalizers => NORMALIZERS,
43
43
  :identities => IDENTITIES,
44
44
  :blockings => BLOCKINGS
45
45
  }
@@ -48,7 +48,6 @@ Memprof.start
48
48
 
49
49
  d = FuzzyMatch.new HAYSTACK, FINAL_OPTIONS
50
50
  record = d.find('boeing 707(100)', :gather_last_result => false)
51
- # d.free
52
51
 
53
52
  Memprof.stats
54
53
  Memprof.stop
@@ -26,9 +26,9 @@ MUST_MATCH_BLOCKING = false
26
26
  # (Example) We made these by trial and error
27
27
  BLOCKINGS = RemoteTable.new(:url => "file://#{File.expand_path("../blockings.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
28
28
 
29
- # Tighteners
29
+ # Normalizers
30
30
  # (Example) We made these by trial and error
31
- TIGHTENERS = RemoteTable.new(:url => "file://#{File.expand_path("../tighteners.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
31
+ NORMALIZERS = RemoteTable.new(:url => "file://#{File.expand_path("../normalizers.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
32
32
 
33
33
  # Identities
34
34
  # (Example) We made these by trial and error
@@ -65,7 +65,7 @@ NEGATIVES = RemoteTable.new :url => "file://#{File.expand_path("../negatives.csv
65
65
  FINAL_OPTIONS = {
66
66
  :read => HAYSTACK_READER,
67
67
  :must_match_blocking => MUST_MATCH_BLOCKING,
68
- :tighteners => TIGHTENERS,
68
+ :normalizers => NORMALIZERS,
69
69
  :identities => IDENTITIES,
70
70
  :blockings => BLOCKINGS
71
71
  }
@@ -24,7 +24,7 @@ class FuzzyMatch
24
24
  def join?(str1, str2)
25
25
  if str2_match_data = regexp.match(str2)
26
26
  if str1_match_data = regexp.match(str1)
27
- str2_match_data.captures == str1_match_data.captures
27
+ str2_match_data.captures.join.downcase == str1_match_data.captures.join.downcase
28
28
  else
29
29
  false
30
30
  end
@@ -14,7 +14,7 @@ class FuzzyMatch
14
14
  # Otherwise returns nil.
15
15
  def identical?(str1, str2)
16
16
  if str1_match_data = regexp.match(str1) and match_data = regexp.match(str2)
17
- str1_match_data.captures == match_data.captures
17
+ str1_match_data.captures.join.downcase == match_data.captures.join.downcase
18
18
  else
19
19
  nil
20
20
  end
@@ -1,18 +1,18 @@
1
1
  class FuzzyMatch
2
- # A tightener just strips a string down to its core
3
- class Tightener
2
+ # A normalizer just strips a string down to its core
3
+ class Normalizer
4
4
  attr_reader :regexp
5
5
 
6
6
  def initialize(regexp_or_str)
7
7
  @regexp = regexp_or_str.to_regexp
8
8
  end
9
9
 
10
- # A tightener applies when its regexp matches and captures a new (shorter) string
10
+ # A normalizer applies when its regexp matches and captures a new (shorter) string
11
11
  def apply?(str)
12
12
  !!(regexp.match(str))
13
13
  end
14
14
 
15
- # The result of applying a tightener is just all the captures put together.
15
+ # The result of applying a normalizer is just all the captures put together.
16
16
  def apply(str)
17
17
  if match_data = regexp.match(str)
18
18
  match_data.captures.join
@@ -22,7 +22,7 @@ class FuzzyMatch
22
22
  end
23
23
 
24
24
  def inspect
25
- "#<Tightener regexp=#{regexp.inspect}>"
25
+ "#<Normalizer regexp=#{regexp.inspect}>"
26
26
  end
27
27
  end
28
28
  end
@@ -27,7 +27,7 @@ ERB
27
27
  attr_accessor :read
28
28
  attr_accessor :haystack
29
29
  attr_accessor :options
30
- attr_accessor :tighteners
30
+ attr_accessor :normalizers
31
31
  attr_accessor :blockings
32
32
  attr_accessor :identities
33
33
  attr_accessor :stop_words