fuzzy_match 1.3.1 → 1.3.2
Sign up to get free protection for your applications and to get access to all the features.
- data/Gemfile +12 -2
- data/History.txt +13 -0
- data/README.markdown +10 -6
- data/benchmark/before-with-free.txt +21 -21
- data/benchmark/before.txt +21 -21
- data/benchmark/memory.rb +6 -6
- data/examples/bts_aircraft/{blockings.csv → groupings.csv} +0 -0
- data/examples/bts_aircraft/test_bts_aircraft.rb +6 -6
- data/fuzzy_match.gemspec +1 -10
- data/lib/fuzzy_match.rb +41 -33
- data/lib/fuzzy_match/result.rb +1 -1
- data/lib/fuzzy_match/rule.rb +14 -0
- data/lib/fuzzy_match/rule/grouping.rb +32 -0
- data/lib/fuzzy_match/rule/identity.rb +19 -0
- data/lib/fuzzy_match/rule/normalizer.rb +20 -0
- data/lib/fuzzy_match/rule/stop_word.rb +11 -0
- data/lib/fuzzy_match/version.rb +1 -1
- data/test/helper.rb +3 -1
- data/test/test_fuzzy_match.rb +188 -124
- data/test/test_fuzzy_match_convoluted.rb.disabled +12 -12
- data/test/{test_blocking.rb → test_grouping.rb} +6 -6
- data/test/test_identity.rb +8 -8
- data/test/test_normalizer.rb +2 -2
- data/test/test_wrapper.rb +1 -1
- metadata +15 -101
- data/lib/fuzzy_match/blocking.rb +0 -36
- data/lib/fuzzy_match/identity.rb +0 -23
- data/lib/fuzzy_match/normalizer.rb +0 -28
- data/lib/fuzzy_match/stop_word.rb +0 -19
data/Gemfile
CHANGED
@@ -1,4 +1,14 @@
|
|
1
|
-
source
|
1
|
+
source :rubygems
|
2
2
|
|
3
|
-
# Specify your gem's dependencies in fuzzy_match.gemspec
|
4
3
|
gemspec
|
4
|
+
|
5
|
+
# development dependencies
|
6
|
+
gem 'minitest-reporters'
|
7
|
+
gem "minitest"
|
8
|
+
gem 'activerecord', '>=3'
|
9
|
+
gem 'mysql2'
|
10
|
+
gem 'cohort_scope'
|
11
|
+
gem 'weighted_average'
|
12
|
+
gem 'rake'
|
13
|
+
gem 'yard'
|
14
|
+
gem 'amatch'
|
data/History.txt
ADDED
@@ -0,0 +1,13 @@
|
|
1
|
+
== 1.3.2 / 2012-02-24
|
2
|
+
|
3
|
+
* Start keeping a changelog!
|
4
|
+
|
5
|
+
* Enhancements
|
6
|
+
|
7
|
+
* renamed blockings to groupings
|
8
|
+
* cleaned up tests
|
9
|
+
|
10
|
+
* Bug fixes
|
11
|
+
|
12
|
+
* better handling for one-letter similiarities like 'X foo' vs 'X bar' which couldn't be detected by pair distance
|
13
|
+
* take deprecated option :tighteners as :normalizers
|
data/README.markdown
CHANGED
@@ -32,20 +32,22 @@ You can improve the default matchings with rules. There are 4 different kinds of
|
|
32
32
|
|
33
33
|
We suggest that you **first try without any rules** and only define them to improve matching, prevent false positives, etc.
|
34
34
|
|
35
|
-
>> matcher = FuzzyMatch.new(['Ford F-150', 'Ford F-250', 'GMC 1500', 'GMC 2500'], :
|
35
|
+
>> matcher = FuzzyMatch.new(['Ford F-150', 'Ford F-250', 'GMC 1500', 'GMC 2500'], :groupings => [ /ford/i, /gmc/i ], :normalizers => [ /K(\d500)/i ], :identities => [ /(f)-?(\d50)/i ])
|
36
36
|
=> #<FuzzyMatch: [...]>
|
37
37
|
>> matcher.find('fordf250')
|
38
38
|
=> "Ford F-250"
|
39
39
|
>> matcher.find('gmc truck k1500')
|
40
40
|
=> "GMC 1500"
|
41
41
|
|
42
|
-
For identities and normalizers (see below), **only the captures are used.** For example, `/(f)-?(\d50)/i` captures the "F" and the "250" but ignores the dash. So place your parentheses carefully!
|
42
|
+
For identities and normalizers (see below), **only the captures are used.** For example, `/(f)-?(\d50)/i` captures the "F" and the "250" but ignores the dash. So place your parentheses carefully! Groupings work the same way, except that if you don't have any captures, a simple match will pass.
|
43
43
|
|
44
|
-
###
|
44
|
+
### Groupings
|
45
45
|
|
46
46
|
Group records together.
|
47
47
|
|
48
|
-
Setting a
|
48
|
+
Setting a grouping of `/Airbus/` ensures that strings containing "Airbus" will only be scored against to other strings containing "Airbus". A better grouping in this case would probably be `/airbus/i`.
|
49
|
+
|
50
|
+
Formerly called "blockings," but that was jargon that confused people.
|
49
51
|
|
50
52
|
### Identities
|
51
53
|
|
@@ -53,6 +55,8 @@ Prevent impossible matches.
|
|
53
55
|
|
54
56
|
Adding an identity like `/(f)-?(\d50)/i` ensures that "Ford F-150" and "Ford F-250" never match.
|
55
57
|
|
58
|
+
Note that identities do not establish certainty. They just say whether two records **could** be identical... then string similarity takes over.
|
59
|
+
|
56
60
|
### Stop words
|
57
61
|
|
58
62
|
Ignore common and/or meaningless words. Applied before normalizers.
|
@@ -68,9 +72,9 @@ Adding a normalizer like `/(boeing).*(7\d\d)/i` will cause "BOEING COMPANY 747"
|
|
68
72
|
## Find options
|
69
73
|
|
70
74
|
* `read`: how to interpret each record in the 'haystack', either a Proc or a symbol
|
71
|
-
* `
|
75
|
+
* `must_match_grouping`: don't return a match unless the needle fits into one of the groupings you specified
|
72
76
|
* `must_match_at_least_one_word`: don't return a match unless the needle shares at least one word with the match. Note that "Foo's" is treated like one word (so that it won't match "'s") and "Bolivia," is treated as just "bolivia"
|
73
|
-
* `
|
77
|
+
* `first_grouping_decides`: force records into the first grouping they match, rather than choosing a grouping that will give them a higher score
|
74
78
|
* `gather_last_result`: enable `last_result`
|
75
79
|
|
76
80
|
## Case sensitivity
|
@@ -22,7 +22,7 @@
|
|
22
22
|
28 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:19:__node__
|
23
23
|
22 ./benchmark/../lib/fuzzy_match/similarity.rb:57:__node__
|
24
24
|
22 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:20:__node__
|
25
|
-
21 ./benchmark/../lib/fuzzy_match.rb:199:FuzzyMatch::
|
25
|
+
21 ./benchmark/../lib/fuzzy_match.rb:199:FuzzyMatch::Grouping
|
26
26
|
17 ./benchmark/../lib/fuzzy_match/similarity.rb:21:__node__
|
27
27
|
16 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:Class
|
28
28
|
14 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:4:__node__
|
@@ -50,7 +50,7 @@
|
|
50
50
|
8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
|
51
51
|
8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
|
52
52
|
8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
|
53
|
-
8 ./benchmark/../lib/fuzzy_match/
|
53
|
+
8 ./benchmark/../lib/fuzzy_match/grouping.rb:24:__node__
|
54
54
|
7 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:7:__node__
|
55
55
|
7 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:6:__node__
|
56
56
|
7 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:5:__node__
|
@@ -59,7 +59,7 @@
|
|
59
59
|
7 ./benchmark/../lib/fuzzy_match/similarity.rb:45:__node__
|
60
60
|
7 ./benchmark/../lib/fuzzy_match/score.rb:17:__node__
|
61
61
|
7 ./benchmark/../lib/fuzzy_match/identity.rb:19:__node__
|
62
|
-
7 ./benchmark/../lib/fuzzy_match/
|
62
|
+
7 ./benchmark/../lib/fuzzy_match/grouping.rb:27:__node__
|
63
63
|
7 ./benchmark/../lib/fuzzy_match.rb:209:String
|
64
64
|
6 ./benchmark/../lib/fuzzy_match/wrapper.rb:8:__node__
|
65
65
|
6 ./benchmark/../lib/fuzzy_match/similarity.rb:44:__node__
|
@@ -68,7 +68,7 @@
|
|
68
68
|
6 ./benchmark/../lib/fuzzy_match/score.rb:25:__node__
|
69
69
|
6 ./benchmark/../lib/fuzzy_match/score.rb:21:__node__
|
70
70
|
6 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:10:__node__
|
71
|
-
6 ./benchmark/../lib/fuzzy_match/
|
71
|
+
6 ./benchmark/../lib/fuzzy_match/grouping.rb:22:__node__
|
72
72
|
5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
|
73
73
|
5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
|
74
74
|
5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
|
@@ -80,9 +80,9 @@
|
|
80
80
|
5 ./benchmark/../lib/fuzzy_match/score.rb:9:__node__
|
81
81
|
5 ./benchmark/../lib/fuzzy_match/result.rb:16:__node__
|
82
82
|
5 ./benchmark/../lib/fuzzy_match/identity.rb:10:__node__
|
83
|
-
5 ./benchmark/../lib/fuzzy_match/
|
84
|
-
5 ./benchmark/../lib/fuzzy_match/
|
85
|
-
5 ./benchmark/../lib/fuzzy_match/
|
83
|
+
5 ./benchmark/../lib/fuzzy_match/grouping.rb:26:__node__
|
84
|
+
5 ./benchmark/../lib/fuzzy_match/grouping.rb:25:__node__
|
85
|
+
5 ./benchmark/../lib/fuzzy_match/grouping.rb:15:__node__
|
86
86
|
4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
|
87
87
|
4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
|
88
88
|
4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
|
@@ -142,8 +142,8 @@
|
|
142
142
|
3 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:18:__node__
|
143
143
|
3 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:16:String
|
144
144
|
3 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:15:String
|
145
|
-
3 ./benchmark/../lib/fuzzy_match/
|
146
|
-
3 ./benchmark/../lib/fuzzy_match/
|
145
|
+
3 ./benchmark/../lib/fuzzy_match/grouping.rb:33:__node__
|
146
|
+
3 ./benchmark/../lib/fuzzy_match/grouping.rb:14:__node__
|
147
147
|
3 ./benchmark/../lib/fuzzy_match.rb:77:Array
|
148
148
|
2 /Users/seamus/.rvm/rubies/ruby-1.8.7-p334/lib/ruby/1.8/uri/common.rb:387:String
|
149
149
|
2 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:3:String
|
@@ -200,14 +200,14 @@
|
|
200
200
|
2 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:5:__node__
|
201
201
|
2 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:String
|
202
202
|
2 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:12:__node__
|
203
|
-
2 ./benchmark/../lib/fuzzy_match/
|
204
|
-
2 ./benchmark/../lib/fuzzy_match/
|
205
|
-
2 ./benchmark/../lib/fuzzy_match/
|
206
|
-
2 ./benchmark/../lib/fuzzy_match/
|
207
|
-
2 ./benchmark/../lib/fuzzy_match/
|
208
|
-
2 ./benchmark/../lib/fuzzy_match/
|
209
|
-
2 ./benchmark/../lib/fuzzy_match/
|
210
|
-
2 ./benchmark/../lib/fuzzy_match/
|
203
|
+
2 ./benchmark/../lib/fuzzy_match/grouping.rb:9:Class
|
204
|
+
2 ./benchmark/../lib/fuzzy_match/grouping.rb:34:__node__
|
205
|
+
2 ./benchmark/../lib/fuzzy_match/grouping.rb:32:__node__
|
206
|
+
2 ./benchmark/../lib/fuzzy_match/grouping.rb:30:__node__
|
207
|
+
2 ./benchmark/../lib/fuzzy_match/grouping.rb:29:__node__
|
208
|
+
2 ./benchmark/../lib/fuzzy_match/grouping.rb:23:__node__
|
209
|
+
2 ./benchmark/../lib/fuzzy_match/grouping.rb:16:__node__
|
210
|
+
2 ./benchmark/../lib/fuzzy_match/grouping.rb:12:__node__
|
211
211
|
2 ./benchmark/../lib/fuzzy_match.rb:86:Array
|
212
212
|
1 benchmark/memory.rb:50:String
|
213
213
|
1 benchmark/memory.rb:49:FuzzyMatch
|
@@ -265,10 +265,10 @@
|
|
265
265
|
1 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:1:__node__
|
266
266
|
1 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:17:Hash
|
267
267
|
1 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:10:String
|
268
|
-
1 ./benchmark/../lib/fuzzy_match/
|
269
|
-
1 ./benchmark/../lib/fuzzy_match/
|
270
|
-
1 ./benchmark/../lib/fuzzy_match/
|
271
|
-
1 ./benchmark/../lib/fuzzy_match/
|
268
|
+
1 ./benchmark/../lib/fuzzy_match/grouping.rb:9:__node__
|
269
|
+
1 ./benchmark/../lib/fuzzy_match/grouping.rb:9:String
|
270
|
+
1 ./benchmark/../lib/fuzzy_match/grouping.rb:1:__node__
|
271
|
+
1 ./benchmark/../lib/fuzzy_match/grouping.rb:10:FuzzyMatch::ExtractRegexp
|
272
272
|
1 ./benchmark/../lib/fuzzy_match.rb:62:FuzzyMatch::Wrapper
|
273
273
|
1 ./benchmark/../lib/fuzzy_match.rb:39:String
|
274
274
|
1 ./benchmark/../lib/fuzzy_match.rb:39:FuzzyMatch::Result
|
data/benchmark/before.txt
CHANGED
@@ -46,7 +46,7 @@
|
|
46
46
|
26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Normalizer
|
47
47
|
22 ./benchmark/../lib/fuzzy_match/similarity.rb:57:__node__
|
48
48
|
22 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:20:__node__
|
49
|
-
21 ./benchmark/../lib/fuzzy_match.rb:199:FuzzyMatch::
|
49
|
+
21 ./benchmark/../lib/fuzzy_match.rb:199:FuzzyMatch::Grouping
|
50
50
|
17 ./benchmark/../lib/fuzzy_match/similarity.rb:21:__node__
|
51
51
|
16 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:Class
|
52
52
|
14 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:4:__node__
|
@@ -72,7 +72,7 @@
|
|
72
72
|
8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
|
73
73
|
8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
|
74
74
|
8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
|
75
|
-
8 ./benchmark/../lib/fuzzy_match/
|
75
|
+
8 ./benchmark/../lib/fuzzy_match/grouping.rb:24:__node__
|
76
76
|
7 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:7:__node__
|
77
77
|
7 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:6:__node__
|
78
78
|
7 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:5:__node__
|
@@ -81,7 +81,7 @@
|
|
81
81
|
7 ./benchmark/../lib/fuzzy_match/similarity.rb:45:__node__
|
82
82
|
7 ./benchmark/../lib/fuzzy_match/score.rb:17:__node__
|
83
83
|
7 ./benchmark/../lib/fuzzy_match/identity.rb:19:__node__
|
84
|
-
7 ./benchmark/../lib/fuzzy_match/
|
84
|
+
7 ./benchmark/../lib/fuzzy_match/grouping.rb:27:__node__
|
85
85
|
6 ./benchmark/../lib/fuzzy_match/wrapper.rb:8:__node__
|
86
86
|
6 ./benchmark/../lib/fuzzy_match/similarity.rb:44:__node__
|
87
87
|
6 ./benchmark/../lib/fuzzy_match/similarity.rb:15:__node__
|
@@ -89,7 +89,7 @@
|
|
89
89
|
6 ./benchmark/../lib/fuzzy_match/score.rb:25:__node__
|
90
90
|
6 ./benchmark/../lib/fuzzy_match/score.rb:21:__node__
|
91
91
|
6 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:10:__node__
|
92
|
-
6 ./benchmark/../lib/fuzzy_match/
|
92
|
+
6 ./benchmark/../lib/fuzzy_match/grouping.rb:22:__node__
|
93
93
|
5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
|
94
94
|
5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
|
95
95
|
5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
|
@@ -101,9 +101,9 @@
|
|
101
101
|
5 ./benchmark/../lib/fuzzy_match/score.rb:9:__node__
|
102
102
|
5 ./benchmark/../lib/fuzzy_match/result.rb:16:__node__
|
103
103
|
5 ./benchmark/../lib/fuzzy_match/identity.rb:10:__node__
|
104
|
-
5 ./benchmark/../lib/fuzzy_match/
|
105
|
-
5 ./benchmark/../lib/fuzzy_match/
|
106
|
-
5 ./benchmark/../lib/fuzzy_match/
|
104
|
+
5 ./benchmark/../lib/fuzzy_match/grouping.rb:26:__node__
|
105
|
+
5 ./benchmark/../lib/fuzzy_match/grouping.rb:25:__node__
|
106
|
+
5 ./benchmark/../lib/fuzzy_match/grouping.rb:15:__node__
|
107
107
|
4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
|
108
108
|
4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
|
109
109
|
4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
|
@@ -163,8 +163,8 @@
|
|
163
163
|
3 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:18:__node__
|
164
164
|
3 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:16:String
|
165
165
|
3 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:15:String
|
166
|
-
3 ./benchmark/../lib/fuzzy_match/
|
167
|
-
3 ./benchmark/../lib/fuzzy_match/
|
166
|
+
3 ./benchmark/../lib/fuzzy_match/grouping.rb:33:__node__
|
167
|
+
3 ./benchmark/../lib/fuzzy_match/grouping.rb:14:__node__
|
168
168
|
3 ./benchmark/../lib/fuzzy_match.rb:86:Array
|
169
169
|
3 ./benchmark/../lib/fuzzy_match.rb:77:Array
|
170
170
|
2 /Users/seamus/.rvm/rubies/ruby-1.8.7-p334/lib/ruby/1.8/uri/common.rb:387:String
|
@@ -222,14 +222,14 @@
|
|
222
222
|
2 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:5:__node__
|
223
223
|
2 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:String
|
224
224
|
2 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:12:__node__
|
225
|
-
2 ./benchmark/../lib/fuzzy_match/
|
226
|
-
2 ./benchmark/../lib/fuzzy_match/
|
227
|
-
2 ./benchmark/../lib/fuzzy_match/
|
228
|
-
2 ./benchmark/../lib/fuzzy_match/
|
229
|
-
2 ./benchmark/../lib/fuzzy_match/
|
230
|
-
2 ./benchmark/../lib/fuzzy_match/
|
231
|
-
2 ./benchmark/../lib/fuzzy_match/
|
232
|
-
2 ./benchmark/../lib/fuzzy_match/
|
225
|
+
2 ./benchmark/../lib/fuzzy_match/grouping.rb:9:Class
|
226
|
+
2 ./benchmark/../lib/fuzzy_match/grouping.rb:34:__node__
|
227
|
+
2 ./benchmark/../lib/fuzzy_match/grouping.rb:32:__node__
|
228
|
+
2 ./benchmark/../lib/fuzzy_match/grouping.rb:30:__node__
|
229
|
+
2 ./benchmark/../lib/fuzzy_match/grouping.rb:29:__node__
|
230
|
+
2 ./benchmark/../lib/fuzzy_match/grouping.rb:23:__node__
|
231
|
+
2 ./benchmark/../lib/fuzzy_match/grouping.rb:16:__node__
|
232
|
+
2 ./benchmark/../lib/fuzzy_match/grouping.rb:12:__node__
|
233
233
|
2 ./benchmark/../lib/fuzzy_match.rb:101:Array
|
234
234
|
1 benchmark/memory.rb:50:__scope__
|
235
235
|
1 benchmark/memory.rb:50:String
|
@@ -287,10 +287,10 @@
|
|
287
287
|
1 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:1:__node__
|
288
288
|
1 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:17:Hash
|
289
289
|
1 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:10:String
|
290
|
-
1 ./benchmark/../lib/fuzzy_match/
|
291
|
-
1 ./benchmark/../lib/fuzzy_match/
|
292
|
-
1 ./benchmark/../lib/fuzzy_match/
|
293
|
-
1 ./benchmark/../lib/fuzzy_match/
|
290
|
+
1 ./benchmark/../lib/fuzzy_match/grouping.rb:9:__node__
|
291
|
+
1 ./benchmark/../lib/fuzzy_match/grouping.rb:9:String
|
292
|
+
1 ./benchmark/../lib/fuzzy_match/grouping.rb:1:__node__
|
293
|
+
1 ./benchmark/../lib/fuzzy_match/grouping.rb:10:FuzzyMatch::ExtractRegexp
|
294
294
|
1 ./benchmark/../lib/fuzzy_match.rb:62:FuzzyMatch::Wrapper
|
295
295
|
1 ./benchmark/../lib/fuzzy_match.rb:39:String
|
296
296
|
1 ./benchmark/../lib/fuzzy_match.rb:39:FuzzyMatch::Result
|
data/benchmark/memory.rb
CHANGED
@@ -20,13 +20,13 @@ HAYSTACK = RemoteTable.new :url => "file://#{File.expand_path('../../examples/bt
|
|
20
20
|
# Note the downcase!
|
21
21
|
HAYSTACK_READER = lambda { |record| "#{record['Manufacturer']} #{record['Long Name']}".downcase }
|
22
22
|
|
23
|
-
# Whether to even bother trying to find a match for something without an explicit
|
23
|
+
# Whether to even bother trying to find a match for something without an explicit group
|
24
24
|
# (Example) False, which is the default, which means we have more work to do
|
25
|
-
|
25
|
+
MUST_MATCH_GROUPING = false
|
26
26
|
|
27
|
-
#
|
27
|
+
# Groupings
|
28
28
|
# (Example) We made these by trial and error
|
29
|
-
|
29
|
+
GROUPINGS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/groupings.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
|
30
30
|
|
31
31
|
# Normalizers
|
32
32
|
# (Example) We made these by trial and error
|
@@ -38,10 +38,10 @@ IDENTITIES = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/
|
|
38
38
|
|
39
39
|
FINAL_OPTIONS = {
|
40
40
|
:read => HAYSTACK_READER,
|
41
|
-
:
|
41
|
+
:must_match_grouping => MUST_MATCH_GROUPING,
|
42
42
|
:normalizers => NORMALIZERS,
|
43
43
|
:identities => IDENTITIES,
|
44
|
-
:
|
44
|
+
:groupings => GROUPINGS
|
45
45
|
}
|
46
46
|
|
47
47
|
Memprof.start
|
File without changes
|
@@ -16,13 +16,13 @@ HAYSTACK = RemoteTable.new :url => "file://#{File.expand_path('../number_260.csv
|
|
16
16
|
# Note the downcase!
|
17
17
|
HAYSTACK_READER = lambda { |record| "#{record['Manufacturer']} #{record['Long Name']}".downcase }
|
18
18
|
|
19
|
-
# Whether to even bother trying to find a match for something without an explicit
|
19
|
+
# Whether to even bother trying to find a match for something without an explicit group
|
20
20
|
# (Example) False, which is the default, which means we have more work to do
|
21
|
-
|
21
|
+
MUST_MATCH_GROUPING = false
|
22
22
|
|
23
|
-
#
|
23
|
+
# Groupings
|
24
24
|
# (Example) We made these by trial and error
|
25
|
-
|
25
|
+
GROUPINGS = RemoteTable.new(:url => "file://#{File.expand_path("../groupings.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
|
26
26
|
|
27
27
|
# Normalizers
|
28
28
|
# (Example) We made these by trial and error
|
@@ -62,10 +62,10 @@ NEGATIVES = RemoteTable.new :url => "file://#{File.expand_path("../negatives.csv
|
|
62
62
|
|
63
63
|
FINAL_OPTIONS = {
|
64
64
|
:read => HAYSTACK_READER,
|
65
|
-
:
|
65
|
+
:must_match_grouping => MUST_MATCH_GROUPING,
|
66
66
|
:normalizers => NORMALIZERS,
|
67
67
|
:identities => IDENTITIES,
|
68
|
-
:
|
68
|
+
:groupings => GROUPINGS
|
69
69
|
}
|
70
70
|
|
71
71
|
class TestBtsAircraft < MiniTest::Spec
|
data/fuzzy_match.gemspec
CHANGED
@@ -15,19 +15,10 @@ Gem::Specification.new do |s|
|
|
15
15
|
s.rubyforge_project = "fuzzy_match"
|
16
16
|
|
17
17
|
s.files = `git ls-files`.split("\n")
|
18
|
-
s.test_files = `git ls-files --
|
18
|
+
s.test_files = `git ls-files -- test/*`.split("\n")
|
19
19
|
s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
|
20
20
|
s.require_paths = ["lib"]
|
21
21
|
|
22
|
-
s.add_development_dependency "minitest"
|
23
|
-
s.add_development_dependency 'activerecord', '>=3'
|
24
|
-
s.add_development_dependency 'mysql2'
|
25
|
-
s.add_development_dependency 'cohort_scope'
|
26
|
-
s.add_development_dependency 'weighted_average'
|
27
|
-
s.add_development_dependency 'rake'
|
28
|
-
s.add_development_dependency 'yard'
|
29
|
-
s.add_development_dependency 'amatch'
|
30
|
-
|
31
22
|
s.add_runtime_dependency 'activesupport', '>=3'
|
32
23
|
s.add_runtime_dependency 'to_regexp', '>=0.0.3'
|
33
24
|
end
|
data/lib/fuzzy_match.rb
CHANGED
@@ -5,10 +5,11 @@ if ::ActiveSupport::VERSION::MAJOR >= 3
|
|
5
5
|
end
|
6
6
|
require 'to_regexp'
|
7
7
|
|
8
|
-
require 'fuzzy_match/
|
9
|
-
require 'fuzzy_match/
|
10
|
-
require 'fuzzy_match/
|
11
|
-
require 'fuzzy_match/
|
8
|
+
require 'fuzzy_match/rule'
|
9
|
+
require 'fuzzy_match/rule/normalizer'
|
10
|
+
require 'fuzzy_match/rule/stop_word'
|
11
|
+
require 'fuzzy_match/rule/grouping'
|
12
|
+
require 'fuzzy_match/rule/identity'
|
12
13
|
require 'fuzzy_match/result'
|
13
14
|
require 'fuzzy_match/wrapper'
|
14
15
|
require 'fuzzy_match/similarity'
|
@@ -44,15 +45,15 @@ class FuzzyMatch
|
|
44
45
|
DEFAULT_ENGINE = :pure_ruby
|
45
46
|
|
46
47
|
DEFAULT_OPTIONS = {
|
47
|
-
:
|
48
|
-
:
|
48
|
+
:first_grouping_decides => false,
|
49
|
+
:must_match_grouping => false,
|
49
50
|
:must_match_at_least_one_word => false,
|
50
51
|
:gather_last_result => false,
|
51
52
|
:find_all => false
|
52
53
|
}
|
53
54
|
|
54
55
|
attr_reader :haystack
|
55
|
-
attr_reader :
|
56
|
+
attr_reader :groupings
|
56
57
|
attr_reader :identities
|
57
58
|
attr_reader :normalizers
|
58
59
|
attr_reader :stop_words
|
@@ -64,46 +65,52 @@ class FuzzyMatch
|
|
64
65
|
# Rules (can only be specified at initialization or by using a setter)
|
65
66
|
# * :<tt>normalizers</tt> - regexps (see README)
|
66
67
|
# * :<tt>identities</tt> - regexps
|
67
|
-
# * :<tt>
|
68
|
+
# * :<tt>groupings</tt> - regexps
|
68
69
|
# * :<tt>stop_words</tt> - regexps
|
69
70
|
#
|
70
71
|
# Options (can be specified at initialization or when calling #find)
|
71
72
|
# * :<tt>read</tt> - how to interpret each record in the 'haystack', either a Proc or a symbol
|
72
|
-
# * :<tt>
|
73
|
+
# * :<tt>must_match_grouping</tt> - don't return a match unless the needle fits into one of the groupings you specified
|
73
74
|
# * :<tt>must_match_at_least_one_word</tt> - don't return a match unless the needle shares at least one word with the match
|
74
|
-
# * :<tt>
|
75
|
+
# * :<tt>first_grouping_decides</tt> - force records into the first grouping they match, rather than choosing a grouping that will give them a higher score
|
75
76
|
# * :<tt>gather_last_result</tt> - enable <tt>last_result</tt>
|
76
77
|
def initialize(competitors, options_and_rules = {})
|
77
78
|
options_and_rules = options_and_rules.symbolize_keys
|
78
79
|
|
79
80
|
# rules
|
80
|
-
self.
|
81
|
+
self.groupings = options_and_rules.delete(:groupings) || options_and_rules.delete(:blockings) || []
|
81
82
|
self.identities = options_and_rules.delete(:identities) || []
|
82
83
|
self.normalizers = options_and_rules.delete(:normalizers) || options_and_rules.delete(:tighteners) || []
|
83
84
|
self.stop_words = options_and_rules.delete(:stop_words) || []
|
84
85
|
@read = options_and_rules.delete(:read) || options_and_rules.delete(:haystack_reader)
|
85
86
|
|
86
87
|
# options
|
88
|
+
if deprecated = options_and_rules.delete(:first_blocking_decides)
|
89
|
+
options_and_rules[:first_grouping_decides] = deprecated
|
90
|
+
end
|
91
|
+
if deprecated = options_and_rules.delete(:must_match_blocking)
|
92
|
+
options_and_rules[:must_match_grouping] = deprecated
|
93
|
+
end
|
87
94
|
@default_options = options_and_rules.reverse_merge(DEFAULT_OPTIONS).freeze
|
88
95
|
|
89
96
|
# do this last
|
90
97
|
self.haystack = competitors
|
91
98
|
end
|
92
99
|
|
93
|
-
def
|
94
|
-
@
|
100
|
+
def groupings=(ary)
|
101
|
+
@groupings = ary.map { |regexp_or_str| Rule::Grouping.new regexp_or_str }
|
95
102
|
end
|
96
103
|
|
97
104
|
def identities=(ary)
|
98
|
-
@identities = ary.map { |regexp_or_str| Identity.new regexp_or_str }
|
105
|
+
@identities = ary.map { |regexp_or_str| Rule::Identity.new regexp_or_str }
|
99
106
|
end
|
100
107
|
|
101
108
|
def normalizers=(ary)
|
102
|
-
@normalizers = ary.map { |regexp_or_str| Normalizer.new regexp_or_str }
|
109
|
+
@normalizers = ary.map { |regexp_or_str| Rule::Normalizer.new regexp_or_str }
|
103
110
|
end
|
104
111
|
|
105
112
|
def stop_words=(ary)
|
106
|
-
@stop_words = ary.map { |regexp_or_str| StopWord.new regexp_or_str }
|
113
|
+
@stop_words = ary.map { |regexp_or_str| Rule::StopWord.new regexp_or_str }
|
107
114
|
end
|
108
115
|
|
109
116
|
def haystack=(ary)
|
@@ -124,8 +131,8 @@ class FuzzyMatch
|
|
124
131
|
|
125
132
|
gather_last_result = options[:gather_last_result]
|
126
133
|
is_find_all = options[:find_all]
|
127
|
-
|
128
|
-
|
134
|
+
first_grouping_decides = options[:first_grouping_decides]
|
135
|
+
must_match_grouping = options[:must_match_grouping]
|
129
136
|
must_match_at_least_one_word = options[:must_match_at_least_one_word]
|
130
137
|
|
131
138
|
if gather_last_result
|
@@ -142,7 +149,7 @@ EOS
|
|
142
149
|
if gather_last_result
|
143
150
|
last_result.normalizers = normalizers
|
144
151
|
last_result.identities = identities
|
145
|
-
last_result.
|
152
|
+
last_result.groupings = groupings
|
146
153
|
last_result.stop_words = stop_words
|
147
154
|
end
|
148
155
|
|
@@ -156,11 +163,11 @@ The needle's #{needle.variants.length} variants were enumerated.
|
|
156
163
|
EOS
|
157
164
|
end
|
158
165
|
|
159
|
-
if
|
166
|
+
if must_match_grouping and groupings.any? and groupings.none? { |grouping| grouping.match? needle }
|
160
167
|
if gather_last_result
|
161
168
|
last_result.timeline << <<-EOS
|
162
|
-
The needle didn't match any of the #{
|
163
|
-
\
|
169
|
+
The needle didn't match any of the #{groupings.length} grouping, which was a requirement.
|
170
|
+
\tGroupings (first 3): #{groupings[0,3].map(&:inspect).join(', ')}
|
164
171
|
EOS
|
165
172
|
end
|
166
173
|
|
@@ -187,18 +194,18 @@ EOS
|
|
187
194
|
passed_word_requirement = haystack
|
188
195
|
end
|
189
196
|
|
190
|
-
if
|
197
|
+
if groupings.any?
|
191
198
|
joint = passed_word_requirement.select do |straw|
|
192
|
-
if
|
193
|
-
|
199
|
+
if first_grouping_decides
|
200
|
+
groupings.detect { |grouping| grouping.match? needle }.try :join?, needle, straw
|
194
201
|
else
|
195
|
-
|
202
|
+
groupings.any? { |grouping| grouping.join? needle, straw }
|
196
203
|
end
|
197
204
|
end
|
198
205
|
if gather_last_result
|
199
206
|
last_result.timeline << <<-EOS
|
200
|
-
Since there were
|
201
|
-
\
|
207
|
+
Since there were groupings, the competition was reduced to records in the same group as the needle.
|
208
|
+
\tGroupings (first 3): #{groupings[0,3].map(&:inspect).join(', ')}
|
202
209
|
\tPassed (first 3): #{joint[0,3].map(&:render).map(&:inspect).join(', ')}
|
203
210
|
\tFailed (first 3): #{(passed_word_requirement-joint)[0,3].map(&:render).map(&:inspect).join(', ')}
|
204
211
|
EOS
|
@@ -208,10 +215,10 @@ EOS
|
|
208
215
|
end
|
209
216
|
|
210
217
|
if joint.none?
|
211
|
-
if
|
218
|
+
if must_match_grouping
|
212
219
|
if gather_last_result
|
213
220
|
last_result.timeline << <<-EOS
|
214
|
-
Since :must_match_at_least_one_word => true and none of the competition was in the same
|
221
|
+
Since :must_match_at_least_one_word => true and none of the competition was in the same group as the needle, the search stopped.
|
215
222
|
EOS
|
216
223
|
end
|
217
224
|
if is_find_all
|
@@ -256,20 +263,21 @@ EOS
|
|
256
263
|
return similarities.map { |similarity| similarity.wrapper2.record }
|
257
264
|
end
|
258
265
|
|
266
|
+
best_similarity = similarities.first
|
259
267
|
winner = nil
|
260
268
|
|
261
|
-
if best_similarity
|
269
|
+
if best_similarity and (best_similarity.best_score.dices_coefficient_similar > 0 or (needle.words & best_similarity.wrapper2.words).any?)
|
262
270
|
winner = best_similarity.wrapper2.record
|
263
271
|
if gather_last_result
|
264
272
|
last_result.winner = winner
|
265
273
|
last_result.score = best_similarity.best_score.dices_coefficient_similar
|
266
274
|
last_result.timeline << <<-EOS
|
267
|
-
A winner was determined because the similarity
|
275
|
+
A winner was determined because the Dice's Coefficient similarity (#{best_similarity.best_score.dices_coefficient_similar}) is greater than zero or because it shared a word with the needle.
|
268
276
|
EOS
|
269
277
|
end
|
270
278
|
elsif gather_last_result
|
271
279
|
last_result.timeline << <<-EOS
|
272
|
-
No winner assigned because
|
280
|
+
No winner assigned because the score of the best similarity (#{best_similarity.try(:wrapper2).try(:record).try(:inspect)}) was zero and it didn't match any words with the needle (#{needle.inspect}).
|
273
281
|
EOS
|
274
282
|
end
|
275
283
|
|