fuzzy_match 1.1.1 → 1.2.1
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +3 -1
- data/README.markdown +124 -0
- data/Rakefile +5 -8
- data/benchmark/before-with-free.txt +25 -25
- data/benchmark/before-without-last-result.txt +31 -31
- data/benchmark/before.txt +29 -29
- data/benchmark/memory.rb +3 -4
- data/examples/bts_aircraft/{tighteners.csv → normalizers.csv} +0 -0
- data/examples/bts_aircraft/test_bts_aircraft.rb +3 -3
- data/lib/fuzzy_match/blocking.rb +1 -1
- data/lib/fuzzy_match/identity.rb +1 -1
- data/lib/fuzzy_match/{tightener.rb → normalizer.rb} +5 -5
- data/lib/fuzzy_match/result.rb +1 -1
- data/lib/fuzzy_match/version.rb +1 -1
- data/lib/fuzzy_match/wrapper.rb +3 -3
- data/lib/fuzzy_match.rb +30 -45
- data/test/test_blocking.rb +5 -0
- data/test/test_fuzzy_match.rb +40 -42
- data/test/test_identity.rb +5 -0
- data/test/{test_tightening.rb → test_normalizer.rb} +2 -2
- metadata +26 -25
- data/README.rdoc +0 -94
data/.gitignore
CHANGED
data/README.markdown
ADDED
@@ -0,0 +1,124 @@
|
|
1
|
+
# fuzzy_match
|
2
|
+
|
3
|
+
Find a needle in a haystack based on string similarity (using the Pair Distance algorithm and Levenshtein distance) and regular expressions.
|
4
|
+
|
5
|
+
Replaces [`loose_tight_dictionary`](https://github.com/seamusabshere/loose_tight_dictionary) because that was a confusing name.
|
6
|
+
|
7
|
+
## Quickstart
|
8
|
+
|
9
|
+
>> require 'fuzzy_match'
|
10
|
+
=> true
|
11
|
+
>> FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus')
|
12
|
+
=> "seamus"
|
13
|
+
|
14
|
+
## Default matching (string similarity)
|
15
|
+
|
16
|
+
If you configure nothing else, string similarity matching is used. That's why we call it fuzzy matching.
|
17
|
+
|
18
|
+
The algorithm is [Dice's Coefficient](http://en.wikipedia.org/wiki/Dice's_coefficient) (aka Pair Distance) because it seemed to work better than Jaro Winkler, etc.
|
19
|
+
|
20
|
+
## Rules (regular expressions)
|
21
|
+
|
22
|
+
You can improve the default matchings with rules, which are generally regular expressions.
|
23
|
+
|
24
|
+
>> require 'fuzzy_match'
|
25
|
+
=> true
|
26
|
+
>> matcher = FuzzyMatch.new(['Ford F-150', 'Ford F-250', 'GMC 1500', 'GMC 2500'], :blockings => [ /ford/i, /gmc/i ], :normalizers => [ /K(\d500)/i ], :identities => [ /(f)-?(\d\d\d)/i ])
|
27
|
+
=> #<FuzzyMatch: [...]>
|
28
|
+
>> matcher.find('fordf250')
|
29
|
+
=> "Ford F-250"
|
30
|
+
>> matcher.find('gmc truck k1500')
|
31
|
+
=> "GMC 1500"
|
32
|
+
|
33
|
+
### Blockings
|
34
|
+
|
35
|
+
Group records together.
|
36
|
+
|
37
|
+
Setting a blocking of `/Airbus/` ensures that strings containing "Airbus" will only be scored against to other strings containing "Airbus". A better blocking in this case would probably be `/airbus/i`.
|
38
|
+
|
39
|
+
### Normalizers (formerly called tighteners)
|
40
|
+
|
41
|
+
Strip strings down to the essentials.
|
42
|
+
|
43
|
+
Adding a normalizer like `/(boeing).*(7\d\d)/i` will cause "BOEING COMPANY 747" and "boeing747" to be scored as if they were "BOEING 747" and "boeing 747", respectively. See also "Case sensitivity" below.
|
44
|
+
|
45
|
+
### Identities
|
46
|
+
|
47
|
+
Prevent impossible matches.
|
48
|
+
|
49
|
+
Adding an identity like `/(F)\-?(\d50)/` ensures that "Ford F-150" and "Ford F-250" never match.
|
50
|
+
|
51
|
+
### Stop words
|
52
|
+
|
53
|
+
Ignore common and/or meaningless words.
|
54
|
+
|
55
|
+
Adding a stop word like `THE` ensures that it is not taken into account when comparing "THE CAT", "THE DAT", and "THE CATT"
|
56
|
+
|
57
|
+
## Find options
|
58
|
+
|
59
|
+
* `read`: how to interpret each record in the 'haystack', either a Proc or a symbol
|
60
|
+
* `must_match_blocking`: don't return a match unless the needle fits into one of the blockings you specified
|
61
|
+
* `must_match_at_least_one_word`: don't return a match unless the needle shares at least one word with the match
|
62
|
+
* `first_blocking_decides`: force records into the first blocking they match, rather than choosing a blocking that will give them a higher score
|
63
|
+
* `gather_last_result`: enable `last_result`
|
64
|
+
|
65
|
+
### `:read`
|
66
|
+
|
67
|
+
So, what if your needle is a string like `youruguay` and your haystack is full of `Country` objects like `<Country name:"Uruguay">`?
|
68
|
+
|
69
|
+
>> FuzzyMatch.new(Country.all, :read => :name).find('youruguay')
|
70
|
+
=> <Country name:"Uruguay">
|
71
|
+
|
72
|
+
## Case sensitivity
|
73
|
+
|
74
|
+
String similarity is case-insensitive. Everything is downcased before scoring. This is a change from previous versions.
|
75
|
+
|
76
|
+
Be careful when trying to use case-sensitivity in your rules; in general, things are downcased before comparing.
|
77
|
+
|
78
|
+
## Dice's coefficient edge case
|
79
|
+
|
80
|
+
In edge cases where Dice's finds that two strings are equally similar to a third string, then Levenshtein distance is used. For example, pair distance considers "RATZ" and "CATZ" to be equally similar to "RITZ" so we invoke Levenshtein.
|
81
|
+
|
82
|
+
>> require 'amatch'
|
83
|
+
=> true
|
84
|
+
>> 'RITZ'.pair_distance_similar 'RATZ'
|
85
|
+
=> 0.3333333333333333
|
86
|
+
>> 'RITZ'.pair_distance_similar 'CATZ' # <-- pair distance can't tell the difference, so we fall back to levenshtein...
|
87
|
+
=> 0.3333333333333333
|
88
|
+
>> 'RITZ'.levenshtein_similar 'RATZ'
|
89
|
+
=> 0.75
|
90
|
+
>> 'RITZ'.levenshtein_similar 'CATZ' # <-- which properly shows that RATZ should win
|
91
|
+
=> 0.5
|
92
|
+
|
93
|
+
## Production use
|
94
|
+
|
95
|
+
Over 2 years in [Brighter Planet's environmental impact API](http://impact.brighterplanet.com) and [reference data service](http://data.brighterplanet.com).
|
96
|
+
|
97
|
+
We often combine `fuzzy_match` with [`remote_table`](https://github.com/seamusabshere/remote_table) and [`errata`](https://github.com/seamusabshere/errata):
|
98
|
+
|
99
|
+
- download table with `remote_table`
|
100
|
+
- correct serious or repeated errors with `errata`
|
101
|
+
- `fuzzy_match` the rest
|
102
|
+
|
103
|
+
## Glossary
|
104
|
+
|
105
|
+
The admittedly imperfect metaphor is "look for a needle in a haystack"
|
106
|
+
|
107
|
+
* needle: the search term
|
108
|
+
* haystack: the records you are searching (<b>your result will be an object from here</b>)
|
109
|
+
|
110
|
+
## Credits (and how to make things faster)
|
111
|
+
|
112
|
+
If you add the [`amatch`](http://flori.github.com/amatch/) gem to your Gemfile, it will use that, which is much faster (but [segfaults have been seen in the wild](https://github.com/flori/amatch/issues/3)). Thanks [Flori](https://github.com/flori)!
|
113
|
+
|
114
|
+
Otherwise, pure ruby versions of the string similarity algorithms derived from the [answer to a StackOverflow question](http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings) and [the text gem](https://github.com/threedaymonk/text/blob/master/lib/text/levenshtein.rb) are used. Thanks [marzagao](http://stackoverflow.com/users/10997/marzagao) and [threedaymonk](https://github.com/threedaymonk)!
|
115
|
+
|
116
|
+
## Authors
|
117
|
+
|
118
|
+
* Seamus Abshere <seamus@abshere.net>
|
119
|
+
* Ian Hough <ijhough@gmail.com>
|
120
|
+
* Andy Rossmeissl <andy@rossmeissl.net>
|
121
|
+
|
122
|
+
## Copyright
|
123
|
+
|
124
|
+
Copyright 2012 Brighter Planet, Inc.
|
data/Rakefile
CHANGED
@@ -10,12 +10,9 @@ end
|
|
10
10
|
|
11
11
|
task :default => :test
|
12
12
|
|
13
|
-
require '
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
rdoc.title = "fuzzy_match #{version}"
|
19
|
-
rdoc.rdoc_files.include('README*')
|
20
|
-
rdoc.rdoc_files.include('lib/**/*.rb')
|
13
|
+
require 'yard'
|
14
|
+
require File.expand_path('../lib/fuzzy_match/version.rb', __FILE__)
|
15
|
+
YARD::Rake::YardocTask.new do |t|
|
16
|
+
t.files = ['lib/**/*.rb', 'README.markdown'] # optional
|
17
|
+
# t.options = ['--any', '--extra', '--opts'] # optional
|
21
18
|
end
|
@@ -14,8 +14,8 @@
|
|
14
14
|
325 ./benchmark/../lib/fuzzy_match.rb:35:FuzzyMatch::Wrapper
|
15
15
|
320 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/remote_table-1.1.6/lib/remote_table/format/delimited.rb:28:String
|
16
16
|
303 ./benchmark/../lib/fuzzy_match/similarity.rb:21:Float
|
17
|
-
201 ./benchmark/../lib/fuzzy_match/
|
18
|
-
184 ./benchmark/../lib/fuzzy_match/
|
17
|
+
201 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:String
|
18
|
+
184 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:String
|
19
19
|
140 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:__node__
|
20
20
|
41 ./benchmark/../lib/fuzzy_match/similarity.rb:49:__node__
|
21
21
|
31 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:27:Regexp
|
@@ -45,8 +45,8 @@
|
|
45
45
|
9 ./benchmark/../lib/fuzzy_match/similarity.rb:42:__node__
|
46
46
|
9 ./benchmark/../lib/fuzzy_match/similarity.rb:41:__node__
|
47
47
|
8 ./benchmark/../lib/fuzzy_match/wrapper.rb:31:__node__
|
48
|
-
8 ./benchmark/../lib/fuzzy_match/
|
49
|
-
8 ./benchmark/../lib/fuzzy_match/
|
48
|
+
8 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:__node__
|
49
|
+
8 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:__node__
|
50
50
|
8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
|
51
51
|
8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
|
52
52
|
8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
|
@@ -71,8 +71,8 @@
|
|
71
71
|
6 ./benchmark/../lib/fuzzy_match/blocking.rb:22:__node__
|
72
72
|
5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
|
73
73
|
5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
|
74
|
-
5 ./benchmark/../lib/fuzzy_match/
|
75
|
-
5 ./benchmark/../lib/fuzzy_match/
|
74
|
+
5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
|
75
|
+
5 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:__node__
|
76
76
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:8:__node__
|
77
77
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:33:__node__
|
78
78
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:29:__node__
|
@@ -85,7 +85,7 @@
|
|
85
85
|
5 ./benchmark/../lib/fuzzy_match/blocking.rb:15:__node__
|
86
86
|
4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
|
87
87
|
4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
|
88
|
-
4 ./benchmark/../lib/fuzzy_match/
|
88
|
+
4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
|
89
89
|
4 ./benchmark/../lib/fuzzy_match/similarity.rb:59:__node__
|
90
90
|
4 ./benchmark/../lib/fuzzy_match/similarity.rb:54:__node__
|
91
91
|
4 ./benchmark/../lib/fuzzy_match/score.rb:5:__node__
|
@@ -112,10 +112,10 @@
|
|
112
112
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:18:__node__
|
113
113
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:15:String
|
114
114
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:14:__node__
|
115
|
-
3 ./benchmark/../lib/fuzzy_match/
|
116
|
-
3 ./benchmark/../lib/fuzzy_match/
|
117
|
-
3 ./benchmark/../lib/fuzzy_match/
|
118
|
-
3 ./benchmark/../lib/fuzzy_match/
|
115
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:8:__node__
|
116
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:26:__node__
|
117
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:18:__node__
|
118
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:13:__node__
|
119
119
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:7:__node__
|
120
120
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:6:__node__
|
121
121
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:58:__node__
|
@@ -159,15 +159,15 @@
|
|
159
159
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:16:__node__
|
160
160
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:12:__node__
|
161
161
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:11:__node__
|
162
|
-
2 ./benchmark/../lib/fuzzy_match/
|
163
|
-
2 ./benchmark/../lib/fuzzy_match/
|
164
|
-
2 ./benchmark/../lib/fuzzy_match/
|
165
|
-
2 ./benchmark/../lib/fuzzy_match/
|
166
|
-
2 ./benchmark/../lib/fuzzy_match/
|
167
|
-
2 ./benchmark/../lib/fuzzy_match/
|
168
|
-
2 ./benchmark/../lib/fuzzy_match/
|
169
|
-
2 ./benchmark/../lib/fuzzy_match/
|
170
|
-
2 ./benchmark/../lib/fuzzy_match/
|
162
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:6:__node__
|
163
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:Class
|
164
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:28:__node__
|
165
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:String
|
166
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:24:__node__
|
167
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:23:__node__
|
168
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:22:__node__
|
169
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:15:__node__
|
170
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:10:__node__
|
171
171
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:60:__node__
|
172
172
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:50:__node__
|
173
173
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:4:__node__
|
@@ -230,11 +230,11 @@
|
|
230
230
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:25:String
|
231
231
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:1:__node__
|
232
232
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:10:String
|
233
|
-
1 ./benchmark/../lib/fuzzy_match/
|
234
|
-
1 ./benchmark/../lib/fuzzy_match/
|
235
|
-
1 ./benchmark/../lib/fuzzy_match/
|
236
|
-
1 ./benchmark/../lib/fuzzy_match/
|
237
|
-
1 ./benchmark/../lib/fuzzy_match/
|
233
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:String
|
234
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:FuzzyMatch::ExtractRegexp
|
235
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:__node__
|
236
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:String
|
237
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:1:__node__
|
238
238
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:9:__node__
|
239
239
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:__node__
|
240
240
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:String
|
@@ -11,7 +11,7 @@
|
|
11
11
|
779 ./benchmark/../lib/fuzzy_match/similarity.rb:42:FuzzyMatch::Score
|
12
12
|
779 ./benchmark/../lib/fuzzy_match/similarity.rb:41:FuzzyMatch::Score
|
13
13
|
676 benchmark/memory.rb:21:String
|
14
|
-
607 ./benchmark/../lib/fuzzy_match/
|
14
|
+
607 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:String
|
15
15
|
444 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:Array
|
16
16
|
342 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:String
|
17
17
|
325 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/remote_table-1.1.6/lib/remote_table/hasher.rb:20:String
|
@@ -25,7 +25,7 @@
|
|
25
25
|
303 ./benchmark/../lib/fuzzy_match/similarity.rb:21:Float
|
26
26
|
234 ./benchmark/../lib/fuzzy_match/similarity.rb:56:Array
|
27
27
|
234 ./benchmark/../lib/fuzzy_match/similarity.rb:55:Array
|
28
|
-
184 ./benchmark/../lib/fuzzy_match/
|
28
|
+
184 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:String
|
29
29
|
140 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:__node__
|
30
30
|
129 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__node__
|
31
31
|
127 ./benchmark/../lib/fuzzy_match/similarity.rb:55:__node__
|
@@ -37,13 +37,13 @@
|
|
37
37
|
118 ./benchmark/../lib/fuzzy_match/similarity.rb:12:__scope__
|
38
38
|
117 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__scope__
|
39
39
|
117 ./benchmark/../lib/fuzzy_match/wrapper.rb:19:Array
|
40
|
-
102 ./benchmark/../lib/fuzzy_match/
|
41
|
-
102 ./benchmark/../lib/fuzzy_match/
|
42
|
-
101 ./benchmark/../lib/fuzzy_match/
|
40
|
+
102 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:Array
|
41
|
+
102 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:MatchData
|
42
|
+
101 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:MatchData
|
43
43
|
41 ./benchmark/../lib/fuzzy_match/similarity.rb:49:__node__
|
44
44
|
36 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:27:Regexp
|
45
45
|
28 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:19:__node__
|
46
|
-
26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::
|
46
|
+
26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Normalizer
|
47
47
|
22 ./benchmark/../lib/fuzzy_match/similarity.rb:57:__node__
|
48
48
|
22 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:20:__node__
|
49
49
|
17 ./benchmark/../lib/fuzzy_match/similarity.rb:21:__node__
|
@@ -65,9 +65,9 @@
|
|
65
65
|
9 ./benchmark/../lib/fuzzy_match/similarity.rb:42:__node__
|
66
66
|
9 ./benchmark/../lib/fuzzy_match/similarity.rb:41:__node__
|
67
67
|
8 ./benchmark/../lib/fuzzy_match/wrapper.rb:31:__node__
|
68
|
-
8 ./benchmark/../lib/fuzzy_match/
|
69
|
-
8 ./benchmark/../lib/fuzzy_match/
|
70
|
-
8 ./benchmark/../lib/fuzzy_match/
|
68
|
+
8 ./benchmark/../lib/fuzzy_match/normalizer.rb:6:__node__
|
69
|
+
8 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:__node__
|
70
|
+
8 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:__node__
|
71
71
|
8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
|
72
72
|
8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
|
73
73
|
8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
|
@@ -77,10 +77,10 @@
|
|
77
77
|
7 ./benchmark/../lib/fuzzy_match/score.rb:17:__node__
|
78
78
|
7 ./benchmark/../lib/fuzzy_match/identity.rb:19:__node__
|
79
79
|
6 ./benchmark/../lib/fuzzy_match/wrapper.rb:8:__node__
|
80
|
-
6 ./benchmark/../lib/fuzzy_match/
|
81
|
-
6 ./benchmark/../lib/fuzzy_match/
|
82
|
-
6 ./benchmark/../lib/fuzzy_match/
|
83
|
-
6 ./benchmark/../lib/fuzzy_match/
|
80
|
+
6 ./benchmark/../lib/fuzzy_match/normalizer.rb:28:__node__
|
81
|
+
6 ./benchmark/../lib/fuzzy_match/normalizer.rb:24:__node__
|
82
|
+
6 ./benchmark/../lib/fuzzy_match/normalizer.rb:15:__node__
|
83
|
+
6 ./benchmark/../lib/fuzzy_match/normalizer.rb:10:__node__
|
84
84
|
6 ./benchmark/../lib/fuzzy_match/similarity.rb:44:__node__
|
85
85
|
6 ./benchmark/../lib/fuzzy_match/similarity.rb:15:__node__
|
86
86
|
6 ./benchmark/../lib/fuzzy_match/similarity.rb:13:__node__
|
@@ -89,8 +89,8 @@
|
|
89
89
|
6 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:10:__node__
|
90
90
|
5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
|
91
91
|
5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
|
92
|
-
5 ./benchmark/../lib/fuzzy_match/
|
93
|
-
5 ./benchmark/../lib/fuzzy_match/
|
92
|
+
5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
|
93
|
+
5 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:__node__
|
94
94
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:8:__node__
|
95
95
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:33:__node__
|
96
96
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:29:__node__
|
@@ -100,8 +100,8 @@
|
|
100
100
|
4 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:7:__node__
|
101
101
|
4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
|
102
102
|
4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
|
103
|
-
4 ./benchmark/../lib/fuzzy_match/
|
104
|
-
4 ./benchmark/../lib/fuzzy_match/
|
103
|
+
4 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:__node__
|
104
|
+
4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
|
105
105
|
4 ./benchmark/../lib/fuzzy_match/similarity.rb:59:__node__
|
106
106
|
4 ./benchmark/../lib/fuzzy_match/similarity.rb:54:__node__
|
107
107
|
4 ./benchmark/../lib/fuzzy_match/score.rb:5:__node__
|
@@ -116,12 +116,12 @@
|
|
116
116
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:18:__node__
|
117
117
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:15:String
|
118
118
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:14:__node__
|
119
|
-
3 ./benchmark/../lib/fuzzy_match/
|
120
|
-
3 ./benchmark/../lib/fuzzy_match/
|
121
|
-
3 ./benchmark/../lib/fuzzy_match/
|
122
|
-
3 ./benchmark/../lib/fuzzy_match/
|
123
|
-
3 ./benchmark/../lib/fuzzy_match/
|
124
|
-
3 ./benchmark/../lib/fuzzy_match/
|
119
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:8:__node__
|
120
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:30:__node__
|
121
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:29:__node__
|
122
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:26:__node__
|
123
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:18:__node__
|
124
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:13:__node__
|
125
125
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:7:__node__
|
126
126
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:6:__node__
|
127
127
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:58:__node__
|
@@ -160,12 +160,12 @@
|
|
160
160
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:16:__node__
|
161
161
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:12:__node__
|
162
162
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:11:__node__
|
163
|
-
2 ./benchmark/../lib/fuzzy_match/
|
164
|
-
2 ./benchmark/../lib/fuzzy_match/
|
165
|
-
2 ./benchmark/../lib/fuzzy_match/
|
166
|
-
2 ./benchmark/../lib/fuzzy_match/
|
167
|
-
2 ./benchmark/../lib/fuzzy_match/
|
168
|
-
2 ./benchmark/../lib/fuzzy_match/
|
163
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:__node__
|
164
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:Class
|
165
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:String
|
166
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:23:__node__
|
167
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:22:__node__
|
168
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:1:__node__
|
169
169
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:60:__node__
|
170
170
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:50:__node__
|
171
171
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:4:__node__
|
@@ -218,8 +218,8 @@
|
|
218
218
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:25:String
|
219
219
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:1:__node__
|
220
220
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:10:String
|
221
|
-
1 ./benchmark/../lib/fuzzy_match/
|
222
|
-
1 ./benchmark/../lib/fuzzy_match/
|
221
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:FuzzyMatch::ExtractRegexp
|
222
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:String
|
223
223
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:9:__node__
|
224
224
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:__node__
|
225
225
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:String
|
data/benchmark/before.txt
CHANGED
@@ -11,7 +11,7 @@
|
|
11
11
|
806 ./benchmark/../lib/fuzzy_match/similarity.rb:41:FuzzyMatch::Score
|
12
12
|
805 ./benchmark/../lib/fuzzy_match/similarity.rb:42:FuzzyMatch::Score
|
13
13
|
688 benchmark/memory.rb:21:String
|
14
|
-
639 ./benchmark/../lib/fuzzy_match/
|
14
|
+
639 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:String
|
15
15
|
448 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:Array
|
16
16
|
342 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:String
|
17
17
|
325 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/remote_table-1.1.6/lib/remote_table/hasher.rb:20:String
|
@@ -25,7 +25,7 @@
|
|
25
25
|
303 ./benchmark/../lib/fuzzy_match/similarity.rb:21:Float
|
26
26
|
242 ./benchmark/../lib/fuzzy_match/similarity.rb:56:Array
|
27
27
|
242 ./benchmark/../lib/fuzzy_match/similarity.rb:55:Array
|
28
|
-
184 ./benchmark/../lib/fuzzy_match/
|
28
|
+
184 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:String
|
29
29
|
140 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:__node__
|
30
30
|
133 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__node__
|
31
31
|
131 ./benchmark/../lib/fuzzy_match/similarity.rb:55:__node__
|
@@ -37,13 +37,13 @@
|
|
37
37
|
122 ./benchmark/../lib/fuzzy_match/similarity.rb:12:__scope__
|
38
38
|
121 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__scope__
|
39
39
|
121 ./benchmark/../lib/fuzzy_match/wrapper.rb:19:Array
|
40
|
-
110 ./benchmark/../lib/fuzzy_match/
|
41
|
-
110 ./benchmark/../lib/fuzzy_match/
|
42
|
-
109 ./benchmark/../lib/fuzzy_match/
|
40
|
+
110 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:Array
|
41
|
+
110 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:MatchData
|
42
|
+
109 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:MatchData
|
43
43
|
57 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:27:Regexp
|
44
44
|
41 ./benchmark/../lib/fuzzy_match/similarity.rb:49:__node__
|
45
45
|
28 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:19:__node__
|
46
|
-
26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::
|
46
|
+
26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Normalizer
|
47
47
|
22 ./benchmark/../lib/fuzzy_match/similarity.rb:57:__node__
|
48
48
|
22 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:20:__node__
|
49
49
|
21 ./benchmark/../lib/fuzzy_match.rb:199:FuzzyMatch::Blocking
|
@@ -67,8 +67,8 @@
|
|
67
67
|
9 ./benchmark/../lib/fuzzy_match/similarity.rb:42:__node__
|
68
68
|
9 ./benchmark/../lib/fuzzy_match/similarity.rb:41:__node__
|
69
69
|
8 ./benchmark/../lib/fuzzy_match/wrapper.rb:31:__node__
|
70
|
-
8 ./benchmark/../lib/fuzzy_match/
|
71
|
-
8 ./benchmark/../lib/fuzzy_match/
|
70
|
+
8 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:__node__
|
71
|
+
8 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:__node__
|
72
72
|
8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
|
73
73
|
8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
|
74
74
|
8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
|
@@ -92,8 +92,8 @@
|
|
92
92
|
6 ./benchmark/../lib/fuzzy_match/blocking.rb:22:__node__
|
93
93
|
5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
|
94
94
|
5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
|
95
|
-
5 ./benchmark/../lib/fuzzy_match/
|
96
|
-
5 ./benchmark/../lib/fuzzy_match/
|
95
|
+
5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
|
96
|
+
5 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:__node__
|
97
97
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:8:__node__
|
98
98
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:33:__node__
|
99
99
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:29:__node__
|
@@ -106,7 +106,7 @@
|
|
106
106
|
5 ./benchmark/../lib/fuzzy_match/blocking.rb:15:__node__
|
107
107
|
4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
|
108
108
|
4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
|
109
|
-
4 ./benchmark/../lib/fuzzy_match/
|
109
|
+
4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
|
110
110
|
4 ./benchmark/../lib/fuzzy_match/similarity.rb:59:__node__
|
111
111
|
4 ./benchmark/../lib/fuzzy_match/similarity.rb:54:__node__
|
112
112
|
4 ./benchmark/../lib/fuzzy_match/score.rb:5:__node__
|
@@ -133,10 +133,10 @@
|
|
133
133
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:18:__node__
|
134
134
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:15:String
|
135
135
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:14:__node__
|
136
|
-
3 ./benchmark/../lib/fuzzy_match/
|
137
|
-
3 ./benchmark/../lib/fuzzy_match/
|
138
|
-
3 ./benchmark/../lib/fuzzy_match/
|
139
|
-
3 ./benchmark/../lib/fuzzy_match/
|
136
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:8:__node__
|
137
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:26:__node__
|
138
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:18:__node__
|
139
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:13:__node__
|
140
140
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:7:__node__
|
141
141
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:6:__node__
|
142
142
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:58:__node__
|
@@ -182,15 +182,15 @@
|
|
182
182
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:16:__node__
|
183
183
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:12:__node__
|
184
184
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:11:__node__
|
185
|
-
2 ./benchmark/../lib/fuzzy_match/
|
186
|
-
2 ./benchmark/../lib/fuzzy_match/
|
187
|
-
2 ./benchmark/../lib/fuzzy_match/
|
188
|
-
2 ./benchmark/../lib/fuzzy_match/
|
189
|
-
2 ./benchmark/../lib/fuzzy_match/
|
190
|
-
2 ./benchmark/../lib/fuzzy_match/
|
191
|
-
2 ./benchmark/../lib/fuzzy_match/
|
192
|
-
2 ./benchmark/../lib/fuzzy_match/
|
193
|
-
2 ./benchmark/../lib/fuzzy_match/
|
185
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:6:__node__
|
186
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:Class
|
187
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:28:__node__
|
188
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:String
|
189
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:24:__node__
|
190
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:23:__node__
|
191
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:22:__node__
|
192
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:15:__node__
|
193
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:10:__node__
|
194
194
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:60:__node__
|
195
195
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:50:__node__
|
196
196
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:4:__node__
|
@@ -253,11 +253,11 @@
|
|
253
253
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:25:String
|
254
254
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:1:__node__
|
255
255
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:10:String
|
256
|
-
1 ./benchmark/../lib/fuzzy_match/
|
257
|
-
1 ./benchmark/../lib/fuzzy_match/
|
258
|
-
1 ./benchmark/../lib/fuzzy_match/
|
259
|
-
1 ./benchmark/../lib/fuzzy_match/
|
260
|
-
1 ./benchmark/../lib/fuzzy_match/
|
256
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:String
|
257
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:FuzzyMatch::ExtractRegexp
|
258
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:__node__
|
259
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:String
|
260
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:1:__node__
|
261
261
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:9:__node__
|
262
262
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:__node__
|
263
263
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:String
|
data/benchmark/memory.rb
CHANGED
@@ -28,9 +28,9 @@ MUST_MATCH_BLOCKING = false
|
|
28
28
|
# (Example) We made these by trial and error
|
29
29
|
BLOCKINGS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/blockings.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
|
30
30
|
|
31
|
-
#
|
31
|
+
# Normalizers
|
32
32
|
# (Example) We made these by trial and error
|
33
|
-
|
33
|
+
NORMALIZERS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/normalizers.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
|
34
34
|
|
35
35
|
# Identities
|
36
36
|
# (Example) We made these by trial and error
|
@@ -39,7 +39,7 @@ IDENTITIES = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/
|
|
39
39
|
FINAL_OPTIONS = {
|
40
40
|
:read => HAYSTACK_READER,
|
41
41
|
:must_match_blocking => MUST_MATCH_BLOCKING,
|
42
|
-
:
|
42
|
+
:normalizers => NORMALIZERS,
|
43
43
|
:identities => IDENTITIES,
|
44
44
|
:blockings => BLOCKINGS
|
45
45
|
}
|
@@ -48,7 +48,6 @@ Memprof.start
|
|
48
48
|
|
49
49
|
d = FuzzyMatch.new HAYSTACK, FINAL_OPTIONS
|
50
50
|
record = d.find('boeing 707(100)', :gather_last_result => false)
|
51
|
-
# d.free
|
52
51
|
|
53
52
|
Memprof.stats
|
54
53
|
Memprof.stop
|
File without changes
|
@@ -26,9 +26,9 @@ MUST_MATCH_BLOCKING = false
|
|
26
26
|
# (Example) We made these by trial and error
|
27
27
|
BLOCKINGS = RemoteTable.new(:url => "file://#{File.expand_path("../blockings.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
|
28
28
|
|
29
|
-
#
|
29
|
+
# Normalizers
|
30
30
|
# (Example) We made these by trial and error
|
31
|
-
|
31
|
+
NORMALIZERS = RemoteTable.new(:url => "file://#{File.expand_path("../normalizers.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
|
32
32
|
|
33
33
|
# Identities
|
34
34
|
# (Example) We made these by trial and error
|
@@ -65,7 +65,7 @@ NEGATIVES = RemoteTable.new :url => "file://#{File.expand_path("../negatives.csv
|
|
65
65
|
FINAL_OPTIONS = {
|
66
66
|
:read => HAYSTACK_READER,
|
67
67
|
:must_match_blocking => MUST_MATCH_BLOCKING,
|
68
|
-
:
|
68
|
+
:normalizers => NORMALIZERS,
|
69
69
|
:identities => IDENTITIES,
|
70
70
|
:blockings => BLOCKINGS
|
71
71
|
}
|
data/lib/fuzzy_match/blocking.rb
CHANGED
@@ -24,7 +24,7 @@ class FuzzyMatch
|
|
24
24
|
def join?(str1, str2)
|
25
25
|
if str2_match_data = regexp.match(str2)
|
26
26
|
if str1_match_data = regexp.match(str1)
|
27
|
-
str2_match_data.captures == str1_match_data.captures
|
27
|
+
str2_match_data.captures.join.downcase == str1_match_data.captures.join.downcase
|
28
28
|
else
|
29
29
|
false
|
30
30
|
end
|
data/lib/fuzzy_match/identity.rb
CHANGED
@@ -14,7 +14,7 @@ class FuzzyMatch
|
|
14
14
|
# Otherwise returns nil.
|
15
15
|
def identical?(str1, str2)
|
16
16
|
if str1_match_data = regexp.match(str1) and match_data = regexp.match(str2)
|
17
|
-
str1_match_data.captures == match_data.captures
|
17
|
+
str1_match_data.captures.join.downcase == match_data.captures.join.downcase
|
18
18
|
else
|
19
19
|
nil
|
20
20
|
end
|
@@ -1,18 +1,18 @@
|
|
1
1
|
class FuzzyMatch
|
2
|
-
# A
|
3
|
-
class
|
2
|
+
# A normalizer just strips a string down to its core
|
3
|
+
class Normalizer
|
4
4
|
attr_reader :regexp
|
5
5
|
|
6
6
|
def initialize(regexp_or_str)
|
7
7
|
@regexp = regexp_or_str.to_regexp
|
8
8
|
end
|
9
9
|
|
10
|
-
# A
|
10
|
+
# A normalizer applies when its regexp matches and captures a new (shorter) string
|
11
11
|
def apply?(str)
|
12
12
|
!!(regexp.match(str))
|
13
13
|
end
|
14
14
|
|
15
|
-
# The result of applying a
|
15
|
+
# The result of applying a normalizer is just all the captures put together.
|
16
16
|
def apply(str)
|
17
17
|
if match_data = regexp.match(str)
|
18
18
|
match_data.captures.join
|
@@ -22,7 +22,7 @@ class FuzzyMatch
|
|
22
22
|
end
|
23
23
|
|
24
24
|
def inspect
|
25
|
-
"#<
|
25
|
+
"#<Normalizer regexp=#{regexp.inspect}>"
|
26
26
|
end
|
27
27
|
end
|
28
28
|
end
|
data/lib/fuzzy_match/result.rb
CHANGED