fuzzy_match 1.1.1 → 1.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.gitignore +3 -1
- data/README.markdown +124 -0
- data/Rakefile +5 -8
- data/benchmark/before-with-free.txt +25 -25
- data/benchmark/before-without-last-result.txt +31 -31
- data/benchmark/before.txt +29 -29
- data/benchmark/memory.rb +3 -4
- data/examples/bts_aircraft/{tighteners.csv → normalizers.csv} +0 -0
- data/examples/bts_aircraft/test_bts_aircraft.rb +3 -3
- data/lib/fuzzy_match/blocking.rb +1 -1
- data/lib/fuzzy_match/identity.rb +1 -1
- data/lib/fuzzy_match/{tightener.rb → normalizer.rb} +5 -5
- data/lib/fuzzy_match/result.rb +1 -1
- data/lib/fuzzy_match/version.rb +1 -1
- data/lib/fuzzy_match/wrapper.rb +3 -3
- data/lib/fuzzy_match.rb +30 -45
- data/test/test_blocking.rb +5 -0
- data/test/test_fuzzy_match.rb +40 -42
- data/test/test_identity.rb +5 -0
- data/test/{test_tightening.rb → test_normalizer.rb} +2 -2
- metadata +26 -25
- data/README.rdoc +0 -94
data/.gitignore
CHANGED
data/README.markdown
ADDED
@@ -0,0 +1,124 @@
|
|
1
|
+
# fuzzy_match
|
2
|
+
|
3
|
+
Find a needle in a haystack based on string similarity (using the Pair Distance algorithm and Levenshtein distance) and regular expressions.
|
4
|
+
|
5
|
+
Replaces [`loose_tight_dictionary`](https://github.com/seamusabshere/loose_tight_dictionary) because that was a confusing name.
|
6
|
+
|
7
|
+
## Quickstart
|
8
|
+
|
9
|
+
>> require 'fuzzy_match'
|
10
|
+
=> true
|
11
|
+
>> FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus')
|
12
|
+
=> "seamus"
|
13
|
+
|
14
|
+
## Default matching (string similarity)
|
15
|
+
|
16
|
+
If you configure nothing else, string similarity matching is used. That's why we call it fuzzy matching.
|
17
|
+
|
18
|
+
The algorithm is [Dice's Coefficient](http://en.wikipedia.org/wiki/Dice's_coefficient) (aka Pair Distance) because it seemed to work better than Jaro Winkler, etc.
|
19
|
+
|
20
|
+
## Rules (regular expressions)
|
21
|
+
|
22
|
+
You can improve the default matchings with rules, which are generally regular expressions.
|
23
|
+
|
24
|
+
>> require 'fuzzy_match'
|
25
|
+
=> true
|
26
|
+
>> matcher = FuzzyMatch.new(['Ford F-150', 'Ford F-250', 'GMC 1500', 'GMC 2500'], :blockings => [ /ford/i, /gmc/i ], :normalizers => [ /K(\d500)/i ], :identities => [ /(f)-?(\d\d\d)/i ])
|
27
|
+
=> #<FuzzyMatch: [...]>
|
28
|
+
>> matcher.find('fordf250')
|
29
|
+
=> "Ford F-250"
|
30
|
+
>> matcher.find('gmc truck k1500')
|
31
|
+
=> "GMC 1500"
|
32
|
+
|
33
|
+
### Blockings
|
34
|
+
|
35
|
+
Group records together.
|
36
|
+
|
37
|
+
Setting a blocking of `/Airbus/` ensures that strings containing "Airbus" will only be scored against to other strings containing "Airbus". A better blocking in this case would probably be `/airbus/i`.
|
38
|
+
|
39
|
+
### Normalizers (formerly called tighteners)
|
40
|
+
|
41
|
+
Strip strings down to the essentials.
|
42
|
+
|
43
|
+
Adding a normalizer like `/(boeing).*(7\d\d)/i` will cause "BOEING COMPANY 747" and "boeing747" to be scored as if they were "BOEING 747" and "boeing 747", respectively. See also "Case sensitivity" below.
|
44
|
+
|
45
|
+
### Identities
|
46
|
+
|
47
|
+
Prevent impossible matches.
|
48
|
+
|
49
|
+
Adding an identity like `/(F)\-?(\d50)/` ensures that "Ford F-150" and "Ford F-250" never match.
|
50
|
+
|
51
|
+
### Stop words
|
52
|
+
|
53
|
+
Ignore common and/or meaningless words.
|
54
|
+
|
55
|
+
Adding a stop word like `THE` ensures that it is not taken into account when comparing "THE CAT", "THE DAT", and "THE CATT"
|
56
|
+
|
57
|
+
## Find options
|
58
|
+
|
59
|
+
* `read`: how to interpret each record in the 'haystack', either a Proc or a symbol
|
60
|
+
* `must_match_blocking`: don't return a match unless the needle fits into one of the blockings you specified
|
61
|
+
* `must_match_at_least_one_word`: don't return a match unless the needle shares at least one word with the match
|
62
|
+
* `first_blocking_decides`: force records into the first blocking they match, rather than choosing a blocking that will give them a higher score
|
63
|
+
* `gather_last_result`: enable `last_result`
|
64
|
+
|
65
|
+
### `:read`
|
66
|
+
|
67
|
+
So, what if your needle is a string like `youruguay` and your haystack is full of `Country` objects like `<Country name:"Uruguay">`?
|
68
|
+
|
69
|
+
>> FuzzyMatch.new(Country.all, :read => :name).find('youruguay')
|
70
|
+
=> <Country name:"Uruguay">
|
71
|
+
|
72
|
+
## Case sensitivity
|
73
|
+
|
74
|
+
String similarity is case-insensitive. Everything is downcased before scoring. This is a change from previous versions.
|
75
|
+
|
76
|
+
Be careful when trying to use case-sensitivity in your rules; in general, things are downcased before comparing.
|
77
|
+
|
78
|
+
## Dice's coefficient edge case
|
79
|
+
|
80
|
+
In edge cases where Dice's finds that two strings are equally similar to a third string, then Levenshtein distance is used. For example, pair distance considers "RATZ" and "CATZ" to be equally similar to "RITZ" so we invoke Levenshtein.
|
81
|
+
|
82
|
+
>> require 'amatch'
|
83
|
+
=> true
|
84
|
+
>> 'RITZ'.pair_distance_similar 'RATZ'
|
85
|
+
=> 0.3333333333333333
|
86
|
+
>> 'RITZ'.pair_distance_similar 'CATZ' # <-- pair distance can't tell the difference, so we fall back to levenshtein...
|
87
|
+
=> 0.3333333333333333
|
88
|
+
>> 'RITZ'.levenshtein_similar 'RATZ'
|
89
|
+
=> 0.75
|
90
|
+
>> 'RITZ'.levenshtein_similar 'CATZ' # <-- which properly shows that RATZ should win
|
91
|
+
=> 0.5
|
92
|
+
|
93
|
+
## Production use
|
94
|
+
|
95
|
+
Over 2 years in [Brighter Planet's environmental impact API](http://impact.brighterplanet.com) and [reference data service](http://data.brighterplanet.com).
|
96
|
+
|
97
|
+
We often combine `fuzzy_match` with [`remote_table`](https://github.com/seamusabshere/remote_table) and [`errata`](https://github.com/seamusabshere/errata):
|
98
|
+
|
99
|
+
- download table with `remote_table`
|
100
|
+
- correct serious or repeated errors with `errata`
|
101
|
+
- `fuzzy_match` the rest
|
102
|
+
|
103
|
+
## Glossary
|
104
|
+
|
105
|
+
The admittedly imperfect metaphor is "look for a needle in a haystack"
|
106
|
+
|
107
|
+
* needle: the search term
|
108
|
+
* haystack: the records you are searching (<b>your result will be an object from here</b>)
|
109
|
+
|
110
|
+
## Credits (and how to make things faster)
|
111
|
+
|
112
|
+
If you add the [`amatch`](http://flori.github.com/amatch/) gem to your Gemfile, it will use that, which is much faster (but [segfaults have been seen in the wild](https://github.com/flori/amatch/issues/3)). Thanks [Flori](https://github.com/flori)!
|
113
|
+
|
114
|
+
Otherwise, pure ruby versions of the string similarity algorithms derived from the [answer to a StackOverflow question](http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings) and [the text gem](https://github.com/threedaymonk/text/blob/master/lib/text/levenshtein.rb) are used. Thanks [marzagao](http://stackoverflow.com/users/10997/marzagao) and [threedaymonk](https://github.com/threedaymonk)!
|
115
|
+
|
116
|
+
## Authors
|
117
|
+
|
118
|
+
* Seamus Abshere <seamus@abshere.net>
|
119
|
+
* Ian Hough <ijhough@gmail.com>
|
120
|
+
* Andy Rossmeissl <andy@rossmeissl.net>
|
121
|
+
|
122
|
+
## Copyright
|
123
|
+
|
124
|
+
Copyright 2012 Brighter Planet, Inc.
|
data/Rakefile
CHANGED
@@ -10,12 +10,9 @@ end
|
|
10
10
|
|
11
11
|
task :default => :test
|
12
12
|
|
13
|
-
require '
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
rdoc.title = "fuzzy_match #{version}"
|
19
|
-
rdoc.rdoc_files.include('README*')
|
20
|
-
rdoc.rdoc_files.include('lib/**/*.rb')
|
13
|
+
require 'yard'
|
14
|
+
require File.expand_path('../lib/fuzzy_match/version.rb', __FILE__)
|
15
|
+
YARD::Rake::YardocTask.new do |t|
|
16
|
+
t.files = ['lib/**/*.rb', 'README.markdown'] # optional
|
17
|
+
# t.options = ['--any', '--extra', '--opts'] # optional
|
21
18
|
end
|
@@ -14,8 +14,8 @@
|
|
14
14
|
325 ./benchmark/../lib/fuzzy_match.rb:35:FuzzyMatch::Wrapper
|
15
15
|
320 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/remote_table-1.1.6/lib/remote_table/format/delimited.rb:28:String
|
16
16
|
303 ./benchmark/../lib/fuzzy_match/similarity.rb:21:Float
|
17
|
-
201 ./benchmark/../lib/fuzzy_match/
|
18
|
-
184 ./benchmark/../lib/fuzzy_match/
|
17
|
+
201 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:String
|
18
|
+
184 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:String
|
19
19
|
140 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:__node__
|
20
20
|
41 ./benchmark/../lib/fuzzy_match/similarity.rb:49:__node__
|
21
21
|
31 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:27:Regexp
|
@@ -45,8 +45,8 @@
|
|
45
45
|
9 ./benchmark/../lib/fuzzy_match/similarity.rb:42:__node__
|
46
46
|
9 ./benchmark/../lib/fuzzy_match/similarity.rb:41:__node__
|
47
47
|
8 ./benchmark/../lib/fuzzy_match/wrapper.rb:31:__node__
|
48
|
-
8 ./benchmark/../lib/fuzzy_match/
|
49
|
-
8 ./benchmark/../lib/fuzzy_match/
|
48
|
+
8 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:__node__
|
49
|
+
8 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:__node__
|
50
50
|
8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
|
51
51
|
8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
|
52
52
|
8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
|
@@ -71,8 +71,8 @@
|
|
71
71
|
6 ./benchmark/../lib/fuzzy_match/blocking.rb:22:__node__
|
72
72
|
5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
|
73
73
|
5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
|
74
|
-
5 ./benchmark/../lib/fuzzy_match/
|
75
|
-
5 ./benchmark/../lib/fuzzy_match/
|
74
|
+
5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
|
75
|
+
5 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:__node__
|
76
76
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:8:__node__
|
77
77
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:33:__node__
|
78
78
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:29:__node__
|
@@ -85,7 +85,7 @@
|
|
85
85
|
5 ./benchmark/../lib/fuzzy_match/blocking.rb:15:__node__
|
86
86
|
4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
|
87
87
|
4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
|
88
|
-
4 ./benchmark/../lib/fuzzy_match/
|
88
|
+
4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
|
89
89
|
4 ./benchmark/../lib/fuzzy_match/similarity.rb:59:__node__
|
90
90
|
4 ./benchmark/../lib/fuzzy_match/similarity.rb:54:__node__
|
91
91
|
4 ./benchmark/../lib/fuzzy_match/score.rb:5:__node__
|
@@ -112,10 +112,10 @@
|
|
112
112
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:18:__node__
|
113
113
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:15:String
|
114
114
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:14:__node__
|
115
|
-
3 ./benchmark/../lib/fuzzy_match/
|
116
|
-
3 ./benchmark/../lib/fuzzy_match/
|
117
|
-
3 ./benchmark/../lib/fuzzy_match/
|
118
|
-
3 ./benchmark/../lib/fuzzy_match/
|
115
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:8:__node__
|
116
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:26:__node__
|
117
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:18:__node__
|
118
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:13:__node__
|
119
119
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:7:__node__
|
120
120
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:6:__node__
|
121
121
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:58:__node__
|
@@ -159,15 +159,15 @@
|
|
159
159
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:16:__node__
|
160
160
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:12:__node__
|
161
161
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:11:__node__
|
162
|
-
2 ./benchmark/../lib/fuzzy_match/
|
163
|
-
2 ./benchmark/../lib/fuzzy_match/
|
164
|
-
2 ./benchmark/../lib/fuzzy_match/
|
165
|
-
2 ./benchmark/../lib/fuzzy_match/
|
166
|
-
2 ./benchmark/../lib/fuzzy_match/
|
167
|
-
2 ./benchmark/../lib/fuzzy_match/
|
168
|
-
2 ./benchmark/../lib/fuzzy_match/
|
169
|
-
2 ./benchmark/../lib/fuzzy_match/
|
170
|
-
2 ./benchmark/../lib/fuzzy_match/
|
162
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:6:__node__
|
163
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:Class
|
164
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:28:__node__
|
165
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:String
|
166
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:24:__node__
|
167
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:23:__node__
|
168
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:22:__node__
|
169
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:15:__node__
|
170
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:10:__node__
|
171
171
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:60:__node__
|
172
172
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:50:__node__
|
173
173
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:4:__node__
|
@@ -230,11 +230,11 @@
|
|
230
230
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:25:String
|
231
231
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:1:__node__
|
232
232
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:10:String
|
233
|
-
1 ./benchmark/../lib/fuzzy_match/
|
234
|
-
1 ./benchmark/../lib/fuzzy_match/
|
235
|
-
1 ./benchmark/../lib/fuzzy_match/
|
236
|
-
1 ./benchmark/../lib/fuzzy_match/
|
237
|
-
1 ./benchmark/../lib/fuzzy_match/
|
233
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:String
|
234
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:FuzzyMatch::ExtractRegexp
|
235
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:__node__
|
236
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:String
|
237
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:1:__node__
|
238
238
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:9:__node__
|
239
239
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:__node__
|
240
240
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:String
|
@@ -11,7 +11,7 @@
|
|
11
11
|
779 ./benchmark/../lib/fuzzy_match/similarity.rb:42:FuzzyMatch::Score
|
12
12
|
779 ./benchmark/../lib/fuzzy_match/similarity.rb:41:FuzzyMatch::Score
|
13
13
|
676 benchmark/memory.rb:21:String
|
14
|
-
607 ./benchmark/../lib/fuzzy_match/
|
14
|
+
607 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:String
|
15
15
|
444 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:Array
|
16
16
|
342 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:String
|
17
17
|
325 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/remote_table-1.1.6/lib/remote_table/hasher.rb:20:String
|
@@ -25,7 +25,7 @@
|
|
25
25
|
303 ./benchmark/../lib/fuzzy_match/similarity.rb:21:Float
|
26
26
|
234 ./benchmark/../lib/fuzzy_match/similarity.rb:56:Array
|
27
27
|
234 ./benchmark/../lib/fuzzy_match/similarity.rb:55:Array
|
28
|
-
184 ./benchmark/../lib/fuzzy_match/
|
28
|
+
184 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:String
|
29
29
|
140 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:__node__
|
30
30
|
129 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__node__
|
31
31
|
127 ./benchmark/../lib/fuzzy_match/similarity.rb:55:__node__
|
@@ -37,13 +37,13 @@
|
|
37
37
|
118 ./benchmark/../lib/fuzzy_match/similarity.rb:12:__scope__
|
38
38
|
117 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__scope__
|
39
39
|
117 ./benchmark/../lib/fuzzy_match/wrapper.rb:19:Array
|
40
|
-
102 ./benchmark/../lib/fuzzy_match/
|
41
|
-
102 ./benchmark/../lib/fuzzy_match/
|
42
|
-
101 ./benchmark/../lib/fuzzy_match/
|
40
|
+
102 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:Array
|
41
|
+
102 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:MatchData
|
42
|
+
101 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:MatchData
|
43
43
|
41 ./benchmark/../lib/fuzzy_match/similarity.rb:49:__node__
|
44
44
|
36 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:27:Regexp
|
45
45
|
28 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:19:__node__
|
46
|
-
26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::
|
46
|
+
26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Normalizer
|
47
47
|
22 ./benchmark/../lib/fuzzy_match/similarity.rb:57:__node__
|
48
48
|
22 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:20:__node__
|
49
49
|
17 ./benchmark/../lib/fuzzy_match/similarity.rb:21:__node__
|
@@ -65,9 +65,9 @@
|
|
65
65
|
9 ./benchmark/../lib/fuzzy_match/similarity.rb:42:__node__
|
66
66
|
9 ./benchmark/../lib/fuzzy_match/similarity.rb:41:__node__
|
67
67
|
8 ./benchmark/../lib/fuzzy_match/wrapper.rb:31:__node__
|
68
|
-
8 ./benchmark/../lib/fuzzy_match/
|
69
|
-
8 ./benchmark/../lib/fuzzy_match/
|
70
|
-
8 ./benchmark/../lib/fuzzy_match/
|
68
|
+
8 ./benchmark/../lib/fuzzy_match/normalizer.rb:6:__node__
|
69
|
+
8 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:__node__
|
70
|
+
8 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:__node__
|
71
71
|
8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
|
72
72
|
8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
|
73
73
|
8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
|
@@ -77,10 +77,10 @@
|
|
77
77
|
7 ./benchmark/../lib/fuzzy_match/score.rb:17:__node__
|
78
78
|
7 ./benchmark/../lib/fuzzy_match/identity.rb:19:__node__
|
79
79
|
6 ./benchmark/../lib/fuzzy_match/wrapper.rb:8:__node__
|
80
|
-
6 ./benchmark/../lib/fuzzy_match/
|
81
|
-
6 ./benchmark/../lib/fuzzy_match/
|
82
|
-
6 ./benchmark/../lib/fuzzy_match/
|
83
|
-
6 ./benchmark/../lib/fuzzy_match/
|
80
|
+
6 ./benchmark/../lib/fuzzy_match/normalizer.rb:28:__node__
|
81
|
+
6 ./benchmark/../lib/fuzzy_match/normalizer.rb:24:__node__
|
82
|
+
6 ./benchmark/../lib/fuzzy_match/normalizer.rb:15:__node__
|
83
|
+
6 ./benchmark/../lib/fuzzy_match/normalizer.rb:10:__node__
|
84
84
|
6 ./benchmark/../lib/fuzzy_match/similarity.rb:44:__node__
|
85
85
|
6 ./benchmark/../lib/fuzzy_match/similarity.rb:15:__node__
|
86
86
|
6 ./benchmark/../lib/fuzzy_match/similarity.rb:13:__node__
|
@@ -89,8 +89,8 @@
|
|
89
89
|
6 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:10:__node__
|
90
90
|
5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
|
91
91
|
5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
|
92
|
-
5 ./benchmark/../lib/fuzzy_match/
|
93
|
-
5 ./benchmark/../lib/fuzzy_match/
|
92
|
+
5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
|
93
|
+
5 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:__node__
|
94
94
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:8:__node__
|
95
95
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:33:__node__
|
96
96
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:29:__node__
|
@@ -100,8 +100,8 @@
|
|
100
100
|
4 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:7:__node__
|
101
101
|
4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
|
102
102
|
4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
|
103
|
-
4 ./benchmark/../lib/fuzzy_match/
|
104
|
-
4 ./benchmark/../lib/fuzzy_match/
|
103
|
+
4 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:__node__
|
104
|
+
4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
|
105
105
|
4 ./benchmark/../lib/fuzzy_match/similarity.rb:59:__node__
|
106
106
|
4 ./benchmark/../lib/fuzzy_match/similarity.rb:54:__node__
|
107
107
|
4 ./benchmark/../lib/fuzzy_match/score.rb:5:__node__
|
@@ -116,12 +116,12 @@
|
|
116
116
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:18:__node__
|
117
117
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:15:String
|
118
118
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:14:__node__
|
119
|
-
3 ./benchmark/../lib/fuzzy_match/
|
120
|
-
3 ./benchmark/../lib/fuzzy_match/
|
121
|
-
3 ./benchmark/../lib/fuzzy_match/
|
122
|
-
3 ./benchmark/../lib/fuzzy_match/
|
123
|
-
3 ./benchmark/../lib/fuzzy_match/
|
124
|
-
3 ./benchmark/../lib/fuzzy_match/
|
119
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:8:__node__
|
120
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:30:__node__
|
121
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:29:__node__
|
122
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:26:__node__
|
123
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:18:__node__
|
124
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:13:__node__
|
125
125
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:7:__node__
|
126
126
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:6:__node__
|
127
127
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:58:__node__
|
@@ -160,12 +160,12 @@
|
|
160
160
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:16:__node__
|
161
161
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:12:__node__
|
162
162
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:11:__node__
|
163
|
-
2 ./benchmark/../lib/fuzzy_match/
|
164
|
-
2 ./benchmark/../lib/fuzzy_match/
|
165
|
-
2 ./benchmark/../lib/fuzzy_match/
|
166
|
-
2 ./benchmark/../lib/fuzzy_match/
|
167
|
-
2 ./benchmark/../lib/fuzzy_match/
|
168
|
-
2 ./benchmark/../lib/fuzzy_match/
|
163
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:__node__
|
164
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:Class
|
165
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:String
|
166
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:23:__node__
|
167
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:22:__node__
|
168
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:1:__node__
|
169
169
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:60:__node__
|
170
170
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:50:__node__
|
171
171
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:4:__node__
|
@@ -218,8 +218,8 @@
|
|
218
218
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:25:String
|
219
219
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:1:__node__
|
220
220
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:10:String
|
221
|
-
1 ./benchmark/../lib/fuzzy_match/
|
222
|
-
1 ./benchmark/../lib/fuzzy_match/
|
221
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:FuzzyMatch::ExtractRegexp
|
222
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:String
|
223
223
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:9:__node__
|
224
224
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:__node__
|
225
225
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:String
|
data/benchmark/before.txt
CHANGED
@@ -11,7 +11,7 @@
|
|
11
11
|
806 ./benchmark/../lib/fuzzy_match/similarity.rb:41:FuzzyMatch::Score
|
12
12
|
805 ./benchmark/../lib/fuzzy_match/similarity.rb:42:FuzzyMatch::Score
|
13
13
|
688 benchmark/memory.rb:21:String
|
14
|
-
639 ./benchmark/../lib/fuzzy_match/
|
14
|
+
639 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:String
|
15
15
|
448 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:Array
|
16
16
|
342 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:String
|
17
17
|
325 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/remote_table-1.1.6/lib/remote_table/hasher.rb:20:String
|
@@ -25,7 +25,7 @@
|
|
25
25
|
303 ./benchmark/../lib/fuzzy_match/similarity.rb:21:Float
|
26
26
|
242 ./benchmark/../lib/fuzzy_match/similarity.rb:56:Array
|
27
27
|
242 ./benchmark/../lib/fuzzy_match/similarity.rb:55:Array
|
28
|
-
184 ./benchmark/../lib/fuzzy_match/
|
28
|
+
184 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:String
|
29
29
|
140 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:__node__
|
30
30
|
133 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__node__
|
31
31
|
131 ./benchmark/../lib/fuzzy_match/similarity.rb:55:__node__
|
@@ -37,13 +37,13 @@
|
|
37
37
|
122 ./benchmark/../lib/fuzzy_match/similarity.rb:12:__scope__
|
38
38
|
121 ./benchmark/../lib/fuzzy_match/wrapper.rb:29:__scope__
|
39
39
|
121 ./benchmark/../lib/fuzzy_match/wrapper.rb:19:Array
|
40
|
-
110 ./benchmark/../lib/fuzzy_match/
|
41
|
-
110 ./benchmark/../lib/fuzzy_match/
|
42
|
-
109 ./benchmark/../lib/fuzzy_match/
|
40
|
+
110 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:Array
|
41
|
+
110 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:MatchData
|
42
|
+
109 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:MatchData
|
43
43
|
57 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:27:Regexp
|
44
44
|
41 ./benchmark/../lib/fuzzy_match/similarity.rb:49:__node__
|
45
45
|
28 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:19:__node__
|
46
|
-
26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::
|
46
|
+
26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Normalizer
|
47
47
|
22 ./benchmark/../lib/fuzzy_match/similarity.rb:57:__node__
|
48
48
|
22 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:20:__node__
|
49
49
|
21 ./benchmark/../lib/fuzzy_match.rb:199:FuzzyMatch::Blocking
|
@@ -67,8 +67,8 @@
|
|
67
67
|
9 ./benchmark/../lib/fuzzy_match/similarity.rb:42:__node__
|
68
68
|
9 ./benchmark/../lib/fuzzy_match/similarity.rb:41:__node__
|
69
69
|
8 ./benchmark/../lib/fuzzy_match/wrapper.rb:31:__node__
|
70
|
-
8 ./benchmark/../lib/fuzzy_match/
|
71
|
-
8 ./benchmark/../lib/fuzzy_match/
|
70
|
+
8 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:__node__
|
71
|
+
8 ./benchmark/../lib/fuzzy_match/normalizer.rb:14:__node__
|
72
72
|
8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
|
73
73
|
8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
|
74
74
|
8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
|
@@ -92,8 +92,8 @@
|
|
92
92
|
6 ./benchmark/../lib/fuzzy_match/blocking.rb:22:__node__
|
93
93
|
5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
|
94
94
|
5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
|
95
|
-
5 ./benchmark/../lib/fuzzy_match/
|
96
|
-
5 ./benchmark/../lib/fuzzy_match/
|
95
|
+
5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
|
96
|
+
5 ./benchmark/../lib/fuzzy_match/normalizer.rb:19:__node__
|
97
97
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:8:__node__
|
98
98
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:33:__node__
|
99
99
|
5 ./benchmark/../lib/fuzzy_match/similarity.rb:29:__node__
|
@@ -106,7 +106,7 @@
|
|
106
106
|
5 ./benchmark/../lib/fuzzy_match/blocking.rb:15:__node__
|
107
107
|
4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
|
108
108
|
4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
|
109
|
-
4 ./benchmark/../lib/fuzzy_match/
|
109
|
+
4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
|
110
110
|
4 ./benchmark/../lib/fuzzy_match/similarity.rb:59:__node__
|
111
111
|
4 ./benchmark/../lib/fuzzy_match/similarity.rb:54:__node__
|
112
112
|
4 ./benchmark/../lib/fuzzy_match/score.rb:5:__node__
|
@@ -133,10 +133,10 @@
|
|
133
133
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:18:__node__
|
134
134
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:15:String
|
135
135
|
3 ./benchmark/../lib/fuzzy_match/wrapper.rb:14:__node__
|
136
|
-
3 ./benchmark/../lib/fuzzy_match/
|
137
|
-
3 ./benchmark/../lib/fuzzy_match/
|
138
|
-
3 ./benchmark/../lib/fuzzy_match/
|
139
|
-
3 ./benchmark/../lib/fuzzy_match/
|
136
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:8:__node__
|
137
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:26:__node__
|
138
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:18:__node__
|
139
|
+
3 ./benchmark/../lib/fuzzy_match/normalizer.rb:13:__node__
|
140
140
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:7:__node__
|
141
141
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:6:__node__
|
142
142
|
3 ./benchmark/../lib/fuzzy_match/similarity.rb:58:__node__
|
@@ -182,15 +182,15 @@
|
|
182
182
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:16:__node__
|
183
183
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:12:__node__
|
184
184
|
2 ./benchmark/../lib/fuzzy_match/wrapper.rb:11:__node__
|
185
|
-
2 ./benchmark/../lib/fuzzy_match/
|
186
|
-
2 ./benchmark/../lib/fuzzy_match/
|
187
|
-
2 ./benchmark/../lib/fuzzy_match/
|
188
|
-
2 ./benchmark/../lib/fuzzy_match/
|
189
|
-
2 ./benchmark/../lib/fuzzy_match/
|
190
|
-
2 ./benchmark/../lib/fuzzy_match/
|
191
|
-
2 ./benchmark/../lib/fuzzy_match/
|
192
|
-
2 ./benchmark/../lib/fuzzy_match/
|
193
|
-
2 ./benchmark/../lib/fuzzy_match/
|
185
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:6:__node__
|
186
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:Class
|
187
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:28:__node__
|
188
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:27:String
|
189
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:24:__node__
|
190
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:23:__node__
|
191
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:22:__node__
|
192
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:15:__node__
|
193
|
+
2 ./benchmark/../lib/fuzzy_match/normalizer.rb:10:__node__
|
194
194
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:60:__node__
|
195
195
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:50:__node__
|
196
196
|
2 ./benchmark/../lib/fuzzy_match/similarity.rb:4:__node__
|
@@ -253,11 +253,11 @@
|
|
253
253
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:25:String
|
254
254
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:1:__node__
|
255
255
|
1 ./benchmark/../lib/fuzzy_match/wrapper.rb:10:String
|
256
|
-
1 ./benchmark/../lib/fuzzy_match/
|
257
|
-
1 ./benchmark/../lib/fuzzy_match/
|
258
|
-
1 ./benchmark/../lib/fuzzy_match/
|
259
|
-
1 ./benchmark/../lib/fuzzy_match/
|
260
|
-
1 ./benchmark/../lib/fuzzy_match/
|
256
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:String
|
257
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:4:FuzzyMatch::ExtractRegexp
|
258
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:__node__
|
259
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:3:String
|
260
|
+
1 ./benchmark/../lib/fuzzy_match/normalizer.rb:1:__node__
|
261
261
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:9:__node__
|
262
262
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:__node__
|
263
263
|
1 ./benchmark/../lib/fuzzy_match/similarity.rb:2:String
|
data/benchmark/memory.rb
CHANGED
@@ -28,9 +28,9 @@ MUST_MATCH_BLOCKING = false
|
|
28
28
|
# (Example) We made these by trial and error
|
29
29
|
BLOCKINGS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/blockings.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
|
30
30
|
|
31
|
-
#
|
31
|
+
# Normalizers
|
32
32
|
# (Example) We made these by trial and error
|
33
|
-
|
33
|
+
NORMALIZERS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/normalizers.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
|
34
34
|
|
35
35
|
# Identities
|
36
36
|
# (Example) We made these by trial and error
|
@@ -39,7 +39,7 @@ IDENTITIES = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/
|
|
39
39
|
FINAL_OPTIONS = {
|
40
40
|
:read => HAYSTACK_READER,
|
41
41
|
:must_match_blocking => MUST_MATCH_BLOCKING,
|
42
|
-
:
|
42
|
+
:normalizers => NORMALIZERS,
|
43
43
|
:identities => IDENTITIES,
|
44
44
|
:blockings => BLOCKINGS
|
45
45
|
}
|
@@ -48,7 +48,6 @@ Memprof.start
|
|
48
48
|
|
49
49
|
d = FuzzyMatch.new HAYSTACK, FINAL_OPTIONS
|
50
50
|
record = d.find('boeing 707(100)', :gather_last_result => false)
|
51
|
-
# d.free
|
52
51
|
|
53
52
|
Memprof.stats
|
54
53
|
Memprof.stop
|
File without changes
|
@@ -26,9 +26,9 @@ MUST_MATCH_BLOCKING = false
|
|
26
26
|
# (Example) We made these by trial and error
|
27
27
|
BLOCKINGS = RemoteTable.new(:url => "file://#{File.expand_path("../blockings.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
|
28
28
|
|
29
|
-
#
|
29
|
+
# Normalizers
|
30
30
|
# (Example) We made these by trial and error
|
31
|
-
|
31
|
+
NORMALIZERS = RemoteTable.new(:url => "file://#{File.expand_path("../normalizers.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
|
32
32
|
|
33
33
|
# Identities
|
34
34
|
# (Example) We made these by trial and error
|
@@ -65,7 +65,7 @@ NEGATIVES = RemoteTable.new :url => "file://#{File.expand_path("../negatives.csv
|
|
65
65
|
FINAL_OPTIONS = {
|
66
66
|
:read => HAYSTACK_READER,
|
67
67
|
:must_match_blocking => MUST_MATCH_BLOCKING,
|
68
|
-
:
|
68
|
+
:normalizers => NORMALIZERS,
|
69
69
|
:identities => IDENTITIES,
|
70
70
|
:blockings => BLOCKINGS
|
71
71
|
}
|
data/lib/fuzzy_match/blocking.rb
CHANGED
@@ -24,7 +24,7 @@ class FuzzyMatch
|
|
24
24
|
def join?(str1, str2)
|
25
25
|
if str2_match_data = regexp.match(str2)
|
26
26
|
if str1_match_data = regexp.match(str1)
|
27
|
-
str2_match_data.captures == str1_match_data.captures
|
27
|
+
str2_match_data.captures.join.downcase == str1_match_data.captures.join.downcase
|
28
28
|
else
|
29
29
|
false
|
30
30
|
end
|
data/lib/fuzzy_match/identity.rb
CHANGED
@@ -14,7 +14,7 @@ class FuzzyMatch
|
|
14
14
|
# Otherwise returns nil.
|
15
15
|
def identical?(str1, str2)
|
16
16
|
if str1_match_data = regexp.match(str1) and match_data = regexp.match(str2)
|
17
|
-
str1_match_data.captures == match_data.captures
|
17
|
+
str1_match_data.captures.join.downcase == match_data.captures.join.downcase
|
18
18
|
else
|
19
19
|
nil
|
20
20
|
end
|
@@ -1,18 +1,18 @@
|
|
1
1
|
class FuzzyMatch
|
2
|
-
# A
|
3
|
-
class
|
2
|
+
# A normalizer just strips a string down to its core
|
3
|
+
class Normalizer
|
4
4
|
attr_reader :regexp
|
5
5
|
|
6
6
|
def initialize(regexp_or_str)
|
7
7
|
@regexp = regexp_or_str.to_regexp
|
8
8
|
end
|
9
9
|
|
10
|
-
# A
|
10
|
+
# A normalizer applies when its regexp matches and captures a new (shorter) string
|
11
11
|
def apply?(str)
|
12
12
|
!!(regexp.match(str))
|
13
13
|
end
|
14
14
|
|
15
|
-
# The result of applying a
|
15
|
+
# The result of applying a normalizer is just all the captures put together.
|
16
16
|
def apply(str)
|
17
17
|
if match_data = regexp.match(str)
|
18
18
|
match_data.captures.join
|
@@ -22,7 +22,7 @@ class FuzzyMatch
|
|
22
22
|
end
|
23
23
|
|
24
24
|
def inspect
|
25
|
-
"#<
|
25
|
+
"#<Normalizer regexp=#{regexp.inspect}>"
|
26
26
|
end
|
27
27
|
end
|
28
28
|
end
|
data/lib/fuzzy_match/result.rb
CHANGED