fuzzy_match 1.3.1 → 1.3.2

Sign up to get free protection for your applications and to get access to all the features.
data/Gemfile CHANGED
@@ -1,4 +1,14 @@
1
- source "http://rubygems.org"
1
+ source :rubygems
2
2
 
3
- # Specify your gem's dependencies in fuzzy_match.gemspec
4
3
  gemspec
4
+
5
+ # development dependencies
6
+ gem 'minitest-reporters'
7
+ gem "minitest"
8
+ gem 'activerecord', '>=3'
9
+ gem 'mysql2'
10
+ gem 'cohort_scope'
11
+ gem 'weighted_average'
12
+ gem 'rake'
13
+ gem 'yard'
14
+ gem 'amatch'
data/History.txt ADDED
@@ -0,0 +1,13 @@
1
+ == 1.3.2 / 2012-02-24
2
+
3
+ * Start keeping a changelog!
4
+
5
+ * Enhancements
6
+
7
+ * renamed blockings to groupings
8
+ * cleaned up tests
9
+
10
+ * Bug fixes
11
+
12
+ * better handling for one-letter similiarities like 'X foo' vs 'X bar' which couldn't be detected by pair distance
13
+ * take deprecated option :tighteners as :normalizers
data/README.markdown CHANGED
@@ -32,20 +32,22 @@ You can improve the default matchings with rules. There are 4 different kinds of
32
32
 
33
33
  We suggest that you **first try without any rules** and only define them to improve matching, prevent false positives, etc.
34
34
 
35
- >> matcher = FuzzyMatch.new(['Ford F-150', 'Ford F-250', 'GMC 1500', 'GMC 2500'], :blockings => [ /ford/i, /gmc/i ], :normalizers => [ /K(\d500)/i ], :identities => [ /(f)-?(\d50)/i ])
35
+ >> matcher = FuzzyMatch.new(['Ford F-150', 'Ford F-250', 'GMC 1500', 'GMC 2500'], :groupings => [ /ford/i, /gmc/i ], :normalizers => [ /K(\d500)/i ], :identities => [ /(f)-?(\d50)/i ])
36
36
  => #<FuzzyMatch: [...]>
37
37
  >> matcher.find('fordf250')
38
38
  => "Ford F-250"
39
39
  >> matcher.find('gmc truck k1500')
40
40
  => "GMC 1500"
41
41
 
42
- For identities and normalizers (see below), **only the captures are used.** For example, `/(f)-?(\d50)/i` captures the "F" and the "250" but ignores the dash. So place your parentheses carefully! Blockings work the same way, except that if you don't have any captures, a simple match will pass.
42
+ For identities and normalizers (see below), **only the captures are used.** For example, `/(f)-?(\d50)/i` captures the "F" and the "250" but ignores the dash. So place your parentheses carefully! Groupings work the same way, except that if you don't have any captures, a simple match will pass.
43
43
 
44
- ### Blockings
44
+ ### Groupings
45
45
 
46
46
  Group records together.
47
47
 
48
- Setting a blocking of `/Airbus/` ensures that strings containing "Airbus" will only be scored against to other strings containing "Airbus". A better blocking in this case would probably be `/airbus/i`.
48
+ Setting a grouping of `/Airbus/` ensures that strings containing "Airbus" will only be scored against to other strings containing "Airbus". A better grouping in this case would probably be `/airbus/i`.
49
+
50
+ Formerly called "blockings," but that was jargon that confused people.
49
51
 
50
52
  ### Identities
51
53
 
@@ -53,6 +55,8 @@ Prevent impossible matches.
53
55
 
54
56
  Adding an identity like `/(f)-?(\d50)/i` ensures that "Ford F-150" and "Ford F-250" never match.
55
57
 
58
+ Note that identities do not establish certainty. They just say whether two records **could** be identical... then string similarity takes over.
59
+
56
60
  ### Stop words
57
61
 
58
62
  Ignore common and/or meaningless words. Applied before normalizers.
@@ -68,9 +72,9 @@ Adding a normalizer like `/(boeing).*(7\d\d)/i` will cause "BOEING COMPANY 747"
68
72
  ## Find options
69
73
 
70
74
  * `read`: how to interpret each record in the 'haystack', either a Proc or a symbol
71
- * `must_match_blocking`: don't return a match unless the needle fits into one of the blockings you specified
75
+ * `must_match_grouping`: don't return a match unless the needle fits into one of the groupings you specified
72
76
  * `must_match_at_least_one_word`: don't return a match unless the needle shares at least one word with the match. Note that "Foo's" is treated like one word (so that it won't match "'s") and "Bolivia," is treated as just "bolivia"
73
- * `first_blocking_decides`: force records into the first blocking they match, rather than choosing a blocking that will give them a higher score
77
+ * `first_grouping_decides`: force records into the first grouping they match, rather than choosing a grouping that will give them a higher score
74
78
  * `gather_last_result`: enable `last_result`
75
79
 
76
80
  ## Case sensitivity
@@ -22,7 +22,7 @@
22
22
  28 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:19:__node__
23
23
  22 ./benchmark/../lib/fuzzy_match/similarity.rb:57:__node__
24
24
  22 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:20:__node__
25
- 21 ./benchmark/../lib/fuzzy_match.rb:199:FuzzyMatch::Blocking
25
+ 21 ./benchmark/../lib/fuzzy_match.rb:199:FuzzyMatch::Grouping
26
26
  17 ./benchmark/../lib/fuzzy_match/similarity.rb:21:__node__
27
27
  16 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:Class
28
28
  14 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:4:__node__
@@ -50,7 +50,7 @@
50
50
  8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
51
51
  8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
52
52
  8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
53
- 8 ./benchmark/../lib/fuzzy_match/blocking.rb:24:__node__
53
+ 8 ./benchmark/../lib/fuzzy_match/grouping.rb:24:__node__
54
54
  7 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:7:__node__
55
55
  7 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:6:__node__
56
56
  7 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:5:__node__
@@ -59,7 +59,7 @@
59
59
  7 ./benchmark/../lib/fuzzy_match/similarity.rb:45:__node__
60
60
  7 ./benchmark/../lib/fuzzy_match/score.rb:17:__node__
61
61
  7 ./benchmark/../lib/fuzzy_match/identity.rb:19:__node__
62
- 7 ./benchmark/../lib/fuzzy_match/blocking.rb:27:__node__
62
+ 7 ./benchmark/../lib/fuzzy_match/grouping.rb:27:__node__
63
63
  7 ./benchmark/../lib/fuzzy_match.rb:209:String
64
64
  6 ./benchmark/../lib/fuzzy_match/wrapper.rb:8:__node__
65
65
  6 ./benchmark/../lib/fuzzy_match/similarity.rb:44:__node__
@@ -68,7 +68,7 @@
68
68
  6 ./benchmark/../lib/fuzzy_match/score.rb:25:__node__
69
69
  6 ./benchmark/../lib/fuzzy_match/score.rb:21:__node__
70
70
  6 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:10:__node__
71
- 6 ./benchmark/../lib/fuzzy_match/blocking.rb:22:__node__
71
+ 6 ./benchmark/../lib/fuzzy_match/grouping.rb:22:__node__
72
72
  5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
73
73
  5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
74
74
  5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
@@ -80,9 +80,9 @@
80
80
  5 ./benchmark/../lib/fuzzy_match/score.rb:9:__node__
81
81
  5 ./benchmark/../lib/fuzzy_match/result.rb:16:__node__
82
82
  5 ./benchmark/../lib/fuzzy_match/identity.rb:10:__node__
83
- 5 ./benchmark/../lib/fuzzy_match/blocking.rb:26:__node__
84
- 5 ./benchmark/../lib/fuzzy_match/blocking.rb:25:__node__
85
- 5 ./benchmark/../lib/fuzzy_match/blocking.rb:15:__node__
83
+ 5 ./benchmark/../lib/fuzzy_match/grouping.rb:26:__node__
84
+ 5 ./benchmark/../lib/fuzzy_match/grouping.rb:25:__node__
85
+ 5 ./benchmark/../lib/fuzzy_match/grouping.rb:15:__node__
86
86
  4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
87
87
  4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
88
88
  4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
@@ -142,8 +142,8 @@
142
142
  3 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:18:__node__
143
143
  3 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:16:String
144
144
  3 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:15:String
145
- 3 ./benchmark/../lib/fuzzy_match/blocking.rb:33:__node__
146
- 3 ./benchmark/../lib/fuzzy_match/blocking.rb:14:__node__
145
+ 3 ./benchmark/../lib/fuzzy_match/grouping.rb:33:__node__
146
+ 3 ./benchmark/../lib/fuzzy_match/grouping.rb:14:__node__
147
147
  3 ./benchmark/../lib/fuzzy_match.rb:77:Array
148
148
  2 /Users/seamus/.rvm/rubies/ruby-1.8.7-p334/lib/ruby/1.8/uri/common.rb:387:String
149
149
  2 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:3:String
@@ -200,14 +200,14 @@
200
200
  2 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:5:__node__
201
201
  2 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:String
202
202
  2 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:12:__node__
203
- 2 ./benchmark/../lib/fuzzy_match/blocking.rb:9:Class
204
- 2 ./benchmark/../lib/fuzzy_match/blocking.rb:34:__node__
205
- 2 ./benchmark/../lib/fuzzy_match/blocking.rb:32:__node__
206
- 2 ./benchmark/../lib/fuzzy_match/blocking.rb:30:__node__
207
- 2 ./benchmark/../lib/fuzzy_match/blocking.rb:29:__node__
208
- 2 ./benchmark/../lib/fuzzy_match/blocking.rb:23:__node__
209
- 2 ./benchmark/../lib/fuzzy_match/blocking.rb:16:__node__
210
- 2 ./benchmark/../lib/fuzzy_match/blocking.rb:12:__node__
203
+ 2 ./benchmark/../lib/fuzzy_match/grouping.rb:9:Class
204
+ 2 ./benchmark/../lib/fuzzy_match/grouping.rb:34:__node__
205
+ 2 ./benchmark/../lib/fuzzy_match/grouping.rb:32:__node__
206
+ 2 ./benchmark/../lib/fuzzy_match/grouping.rb:30:__node__
207
+ 2 ./benchmark/../lib/fuzzy_match/grouping.rb:29:__node__
208
+ 2 ./benchmark/../lib/fuzzy_match/grouping.rb:23:__node__
209
+ 2 ./benchmark/../lib/fuzzy_match/grouping.rb:16:__node__
210
+ 2 ./benchmark/../lib/fuzzy_match/grouping.rb:12:__node__
211
211
  2 ./benchmark/../lib/fuzzy_match.rb:86:Array
212
212
  1 benchmark/memory.rb:50:String
213
213
  1 benchmark/memory.rb:49:FuzzyMatch
@@ -265,10 +265,10 @@
265
265
  1 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:1:__node__
266
266
  1 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:17:Hash
267
267
  1 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:10:String
268
- 1 ./benchmark/../lib/fuzzy_match/blocking.rb:9:__node__
269
- 1 ./benchmark/../lib/fuzzy_match/blocking.rb:9:String
270
- 1 ./benchmark/../lib/fuzzy_match/blocking.rb:1:__node__
271
- 1 ./benchmark/../lib/fuzzy_match/blocking.rb:10:FuzzyMatch::ExtractRegexp
268
+ 1 ./benchmark/../lib/fuzzy_match/grouping.rb:9:__node__
269
+ 1 ./benchmark/../lib/fuzzy_match/grouping.rb:9:String
270
+ 1 ./benchmark/../lib/fuzzy_match/grouping.rb:1:__node__
271
+ 1 ./benchmark/../lib/fuzzy_match/grouping.rb:10:FuzzyMatch::ExtractRegexp
272
272
  1 ./benchmark/../lib/fuzzy_match.rb:62:FuzzyMatch::Wrapper
273
273
  1 ./benchmark/../lib/fuzzy_match.rb:39:String
274
274
  1 ./benchmark/../lib/fuzzy_match.rb:39:FuzzyMatch::Result
data/benchmark/before.txt CHANGED
@@ -46,7 +46,7 @@
46
46
  26 ./benchmark/../lib/fuzzy_match.rb:187:FuzzyMatch::Normalizer
47
47
  22 ./benchmark/../lib/fuzzy_match/similarity.rb:57:__node__
48
48
  22 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:20:__node__
49
- 21 ./benchmark/../lib/fuzzy_match.rb:199:FuzzyMatch::Blocking
49
+ 21 ./benchmark/../lib/fuzzy_match.rb:199:FuzzyMatch::Grouping
50
50
  17 ./benchmark/../lib/fuzzy_match/similarity.rb:21:__node__
51
51
  16 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch.bundle:0:Class
52
52
  14 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:4:__node__
@@ -72,7 +72,7 @@
72
72
  8 ./benchmark/../lib/fuzzy_match/similarity.rb:38:__node__
73
73
  8 ./benchmark/../lib/fuzzy_match/score.rb:13:__node__
74
74
  8 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:__node__
75
- 8 ./benchmark/../lib/fuzzy_match/blocking.rb:24:__node__
75
+ 8 ./benchmark/../lib/fuzzy_match/grouping.rb:24:__node__
76
76
  7 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:7:__node__
77
77
  7 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:6:__node__
78
78
  7 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/amatch-0.2.5/lib/amatch/version.rb:5:__node__
@@ -81,7 +81,7 @@
81
81
  7 ./benchmark/../lib/fuzzy_match/similarity.rb:45:__node__
82
82
  7 ./benchmark/../lib/fuzzy_match/score.rb:17:__node__
83
83
  7 ./benchmark/../lib/fuzzy_match/identity.rb:19:__node__
84
- 7 ./benchmark/../lib/fuzzy_match/blocking.rb:27:__node__
84
+ 7 ./benchmark/../lib/fuzzy_match/grouping.rb:27:__node__
85
85
  6 ./benchmark/../lib/fuzzy_match/wrapper.rb:8:__node__
86
86
  6 ./benchmark/../lib/fuzzy_match/similarity.rb:44:__node__
87
87
  6 ./benchmark/../lib/fuzzy_match/similarity.rb:15:__node__
@@ -89,7 +89,7 @@
89
89
  6 ./benchmark/../lib/fuzzy_match/score.rb:25:__node__
90
90
  6 ./benchmark/../lib/fuzzy_match/score.rb:21:__node__
91
91
  6 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:10:__node__
92
- 6 ./benchmark/../lib/fuzzy_match/blocking.rb:22:__node__
92
+ 6 ./benchmark/../lib/fuzzy_match/grouping.rb:22:__node__
93
93
  5 /Users/seamus/.rvm/gems/ruby-1.8.7-p334/gems/fastercsv-1.5.4/lib/faster_csv.rb:1640:String
94
94
  5 ./benchmark/../lib/fuzzy_match/wrapper.rb:34:__node__
95
95
  5 ./benchmark/../lib/fuzzy_match/normalizer.rb:9:__node__
@@ -101,9 +101,9 @@
101
101
  5 ./benchmark/../lib/fuzzy_match/score.rb:9:__node__
102
102
  5 ./benchmark/../lib/fuzzy_match/result.rb:16:__node__
103
103
  5 ./benchmark/../lib/fuzzy_match/identity.rb:10:__node__
104
- 5 ./benchmark/../lib/fuzzy_match/blocking.rb:26:__node__
105
- 5 ./benchmark/../lib/fuzzy_match/blocking.rb:25:__node__
106
- 5 ./benchmark/../lib/fuzzy_match/blocking.rb:15:__node__
104
+ 5 ./benchmark/../lib/fuzzy_match/grouping.rb:26:__node__
105
+ 5 ./benchmark/../lib/fuzzy_match/grouping.rb:25:__node__
106
+ 5 ./benchmark/../lib/fuzzy_match/grouping.rb:15:__node__
107
107
  4 ./benchmark/../lib/fuzzy_match/wrapper.rb:33:__node__
108
108
  4 ./benchmark/../lib/fuzzy_match/wrapper.rb:30:__node__
109
109
  4 ./benchmark/../lib/fuzzy_match/normalizer.rb:20:__node__
@@ -163,8 +163,8 @@
163
163
  3 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:18:__node__
164
164
  3 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:16:String
165
165
  3 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:15:String
166
- 3 ./benchmark/../lib/fuzzy_match/blocking.rb:33:__node__
167
- 3 ./benchmark/../lib/fuzzy_match/blocking.rb:14:__node__
166
+ 3 ./benchmark/../lib/fuzzy_match/grouping.rb:33:__node__
167
+ 3 ./benchmark/../lib/fuzzy_match/grouping.rb:14:__node__
168
168
  3 ./benchmark/../lib/fuzzy_match.rb:86:Array
169
169
  3 ./benchmark/../lib/fuzzy_match.rb:77:Array
170
170
  2 /Users/seamus/.rvm/rubies/ruby-1.8.7-p334/lib/ruby/1.8/uri/common.rb:387:String
@@ -222,14 +222,14 @@
222
222
  2 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:5:__node__
223
223
  2 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:23:String
224
224
  2 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:12:__node__
225
- 2 ./benchmark/../lib/fuzzy_match/blocking.rb:9:Class
226
- 2 ./benchmark/../lib/fuzzy_match/blocking.rb:34:__node__
227
- 2 ./benchmark/../lib/fuzzy_match/blocking.rb:32:__node__
228
- 2 ./benchmark/../lib/fuzzy_match/blocking.rb:30:__node__
229
- 2 ./benchmark/../lib/fuzzy_match/blocking.rb:29:__node__
230
- 2 ./benchmark/../lib/fuzzy_match/blocking.rb:23:__node__
231
- 2 ./benchmark/../lib/fuzzy_match/blocking.rb:16:__node__
232
- 2 ./benchmark/../lib/fuzzy_match/blocking.rb:12:__node__
225
+ 2 ./benchmark/../lib/fuzzy_match/grouping.rb:9:Class
226
+ 2 ./benchmark/../lib/fuzzy_match/grouping.rb:34:__node__
227
+ 2 ./benchmark/../lib/fuzzy_match/grouping.rb:32:__node__
228
+ 2 ./benchmark/../lib/fuzzy_match/grouping.rb:30:__node__
229
+ 2 ./benchmark/../lib/fuzzy_match/grouping.rb:29:__node__
230
+ 2 ./benchmark/../lib/fuzzy_match/grouping.rb:23:__node__
231
+ 2 ./benchmark/../lib/fuzzy_match/grouping.rb:16:__node__
232
+ 2 ./benchmark/../lib/fuzzy_match/grouping.rb:12:__node__
233
233
  2 ./benchmark/../lib/fuzzy_match.rb:101:Array
234
234
  1 benchmark/memory.rb:50:__scope__
235
235
  1 benchmark/memory.rb:50:String
@@ -287,10 +287,10 @@
287
287
  1 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:1:__node__
288
288
  1 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:17:Hash
289
289
  1 ./benchmark/../lib/fuzzy_match/extract_regexp.rb:10:String
290
- 1 ./benchmark/../lib/fuzzy_match/blocking.rb:9:__node__
291
- 1 ./benchmark/../lib/fuzzy_match/blocking.rb:9:String
292
- 1 ./benchmark/../lib/fuzzy_match/blocking.rb:1:__node__
293
- 1 ./benchmark/../lib/fuzzy_match/blocking.rb:10:FuzzyMatch::ExtractRegexp
290
+ 1 ./benchmark/../lib/fuzzy_match/grouping.rb:9:__node__
291
+ 1 ./benchmark/../lib/fuzzy_match/grouping.rb:9:String
292
+ 1 ./benchmark/../lib/fuzzy_match/grouping.rb:1:__node__
293
+ 1 ./benchmark/../lib/fuzzy_match/grouping.rb:10:FuzzyMatch::ExtractRegexp
294
294
  1 ./benchmark/../lib/fuzzy_match.rb:62:FuzzyMatch::Wrapper
295
295
  1 ./benchmark/../lib/fuzzy_match.rb:39:String
296
296
  1 ./benchmark/../lib/fuzzy_match.rb:39:FuzzyMatch::Result
data/benchmark/memory.rb CHANGED
@@ -20,13 +20,13 @@ HAYSTACK = RemoteTable.new :url => "file://#{File.expand_path('../../examples/bt
20
20
  # Note the downcase!
21
21
  HAYSTACK_READER = lambda { |record| "#{record['Manufacturer']} #{record['Long Name']}".downcase }
22
22
 
23
- # Whether to even bother trying to find a match for something without an explicit block
23
+ # Whether to even bother trying to find a match for something without an explicit group
24
24
  # (Example) False, which is the default, which means we have more work to do
25
- MUST_MATCH_BLOCKING = false
25
+ MUST_MATCH_GROUPING = false
26
26
 
27
- # Blockings
27
+ # Groupings
28
28
  # (Example) We made these by trial and error
29
- BLOCKINGS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/blockings.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
29
+ GROUPINGS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/groupings.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
30
30
 
31
31
  # Normalizers
32
32
  # (Example) We made these by trial and error
@@ -38,10 +38,10 @@ IDENTITIES = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/
38
38
 
39
39
  FINAL_OPTIONS = {
40
40
  :read => HAYSTACK_READER,
41
- :must_match_blocking => MUST_MATCH_BLOCKING,
41
+ :must_match_grouping => MUST_MATCH_GROUPING,
42
42
  :normalizers => NORMALIZERS,
43
43
  :identities => IDENTITIES,
44
- :blockings => BLOCKINGS
44
+ :groupings => GROUPINGS
45
45
  }
46
46
 
47
47
  Memprof.start
@@ -16,13 +16,13 @@ HAYSTACK = RemoteTable.new :url => "file://#{File.expand_path('../number_260.csv
16
16
  # Note the downcase!
17
17
  HAYSTACK_READER = lambda { |record| "#{record['Manufacturer']} #{record['Long Name']}".downcase }
18
18
 
19
- # Whether to even bother trying to find a match for something without an explicit block
19
+ # Whether to even bother trying to find a match for something without an explicit group
20
20
  # (Example) False, which is the default, which means we have more work to do
21
- MUST_MATCH_BLOCKING = false
21
+ MUST_MATCH_GROUPING = false
22
22
 
23
- # Blockings
23
+ # Groupings
24
24
  # (Example) We made these by trial and error
25
- BLOCKINGS = RemoteTable.new(:url => "file://#{File.expand_path("../blockings.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
25
+ GROUPINGS = RemoteTable.new(:url => "file://#{File.expand_path("../groupings.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
26
26
 
27
27
  # Normalizers
28
28
  # (Example) We made these by trial and error
@@ -62,10 +62,10 @@ NEGATIVES = RemoteTable.new :url => "file://#{File.expand_path("../negatives.csv
62
62
 
63
63
  FINAL_OPTIONS = {
64
64
  :read => HAYSTACK_READER,
65
- :must_match_blocking => MUST_MATCH_BLOCKING,
65
+ :must_match_grouping => MUST_MATCH_GROUPING,
66
66
  :normalizers => NORMALIZERS,
67
67
  :identities => IDENTITIES,
68
- :blockings => BLOCKINGS
68
+ :groupings => GROUPINGS
69
69
  }
70
70
 
71
71
  class TestBtsAircraft < MiniTest::Spec
data/fuzzy_match.gemspec CHANGED
@@ -15,19 +15,10 @@ Gem::Specification.new do |s|
15
15
  s.rubyforge_project = "fuzzy_match"
16
16
 
17
17
  s.files = `git ls-files`.split("\n")
18
- s.test_files = `git ls-files -- {test,spec,ffuzzy_matchures}/*`.split("\n")
18
+ s.test_files = `git ls-files -- test/*`.split("\n")
19
19
  s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
20
20
  s.require_paths = ["lib"]
21
21
 
22
- s.add_development_dependency "minitest"
23
- s.add_development_dependency 'activerecord', '>=3'
24
- s.add_development_dependency 'mysql2'
25
- s.add_development_dependency 'cohort_scope'
26
- s.add_development_dependency 'weighted_average'
27
- s.add_development_dependency 'rake'
28
- s.add_development_dependency 'yard'
29
- s.add_development_dependency 'amatch'
30
-
31
22
  s.add_runtime_dependency 'activesupport', '>=3'
32
23
  s.add_runtime_dependency 'to_regexp', '>=0.0.3'
33
24
  end
data/lib/fuzzy_match.rb CHANGED
@@ -5,10 +5,11 @@ if ::ActiveSupport::VERSION::MAJOR >= 3
5
5
  end
6
6
  require 'to_regexp'
7
7
 
8
- require 'fuzzy_match/normalizer'
9
- require 'fuzzy_match/stop_word'
10
- require 'fuzzy_match/blocking'
11
- require 'fuzzy_match/identity'
8
+ require 'fuzzy_match/rule'
9
+ require 'fuzzy_match/rule/normalizer'
10
+ require 'fuzzy_match/rule/stop_word'
11
+ require 'fuzzy_match/rule/grouping'
12
+ require 'fuzzy_match/rule/identity'
12
13
  require 'fuzzy_match/result'
13
14
  require 'fuzzy_match/wrapper'
14
15
  require 'fuzzy_match/similarity'
@@ -44,15 +45,15 @@ class FuzzyMatch
44
45
  DEFAULT_ENGINE = :pure_ruby
45
46
 
46
47
  DEFAULT_OPTIONS = {
47
- :first_blocking_decides => false,
48
- :must_match_blocking => false,
48
+ :first_grouping_decides => false,
49
+ :must_match_grouping => false,
49
50
  :must_match_at_least_one_word => false,
50
51
  :gather_last_result => false,
51
52
  :find_all => false
52
53
  }
53
54
 
54
55
  attr_reader :haystack
55
- attr_reader :blockings
56
+ attr_reader :groupings
56
57
  attr_reader :identities
57
58
  attr_reader :normalizers
58
59
  attr_reader :stop_words
@@ -64,46 +65,52 @@ class FuzzyMatch
64
65
  # Rules (can only be specified at initialization or by using a setter)
65
66
  # * :<tt>normalizers</tt> - regexps (see README)
66
67
  # * :<tt>identities</tt> - regexps
67
- # * :<tt>blockings</tt> - regexps
68
+ # * :<tt>groupings</tt> - regexps
68
69
  # * :<tt>stop_words</tt> - regexps
69
70
  #
70
71
  # Options (can be specified at initialization or when calling #find)
71
72
  # * :<tt>read</tt> - how to interpret each record in the 'haystack', either a Proc or a symbol
72
- # * :<tt>must_match_blocking</tt> - don't return a match unless the needle fits into one of the blockings you specified
73
+ # * :<tt>must_match_grouping</tt> - don't return a match unless the needle fits into one of the groupings you specified
73
74
  # * :<tt>must_match_at_least_one_word</tt> - don't return a match unless the needle shares at least one word with the match
74
- # * :<tt>first_blocking_decides</tt> - force records into the first blocking they match, rather than choosing a blocking that will give them a higher score
75
+ # * :<tt>first_grouping_decides</tt> - force records into the first grouping they match, rather than choosing a grouping that will give them a higher score
75
76
  # * :<tt>gather_last_result</tt> - enable <tt>last_result</tt>
76
77
  def initialize(competitors, options_and_rules = {})
77
78
  options_and_rules = options_and_rules.symbolize_keys
78
79
 
79
80
  # rules
80
- self.blockings = options_and_rules.delete(:blockings) || []
81
+ self.groupings = options_and_rules.delete(:groupings) || options_and_rules.delete(:blockings) || []
81
82
  self.identities = options_and_rules.delete(:identities) || []
82
83
  self.normalizers = options_and_rules.delete(:normalizers) || options_and_rules.delete(:tighteners) || []
83
84
  self.stop_words = options_and_rules.delete(:stop_words) || []
84
85
  @read = options_and_rules.delete(:read) || options_and_rules.delete(:haystack_reader)
85
86
 
86
87
  # options
88
+ if deprecated = options_and_rules.delete(:first_blocking_decides)
89
+ options_and_rules[:first_grouping_decides] = deprecated
90
+ end
91
+ if deprecated = options_and_rules.delete(:must_match_blocking)
92
+ options_and_rules[:must_match_grouping] = deprecated
93
+ end
87
94
  @default_options = options_and_rules.reverse_merge(DEFAULT_OPTIONS).freeze
88
95
 
89
96
  # do this last
90
97
  self.haystack = competitors
91
98
  end
92
99
 
93
- def blockings=(ary)
94
- @blockings = ary.map { |regexp_or_str| Blocking.new regexp_or_str }
100
+ def groupings=(ary)
101
+ @groupings = ary.map { |regexp_or_str| Rule::Grouping.new regexp_or_str }
95
102
  end
96
103
 
97
104
  def identities=(ary)
98
- @identities = ary.map { |regexp_or_str| Identity.new regexp_or_str }
105
+ @identities = ary.map { |regexp_or_str| Rule::Identity.new regexp_or_str }
99
106
  end
100
107
 
101
108
  def normalizers=(ary)
102
- @normalizers = ary.map { |regexp_or_str| Normalizer.new regexp_or_str }
109
+ @normalizers = ary.map { |regexp_or_str| Rule::Normalizer.new regexp_or_str }
103
110
  end
104
111
 
105
112
  def stop_words=(ary)
106
- @stop_words = ary.map { |regexp_or_str| StopWord.new regexp_or_str }
113
+ @stop_words = ary.map { |regexp_or_str| Rule::StopWord.new regexp_or_str }
107
114
  end
108
115
 
109
116
  def haystack=(ary)
@@ -124,8 +131,8 @@ class FuzzyMatch
124
131
 
125
132
  gather_last_result = options[:gather_last_result]
126
133
  is_find_all = options[:find_all]
127
- first_blocking_decides = options[:first_blocking_decides]
128
- must_match_blocking = options[:must_match_blocking]
134
+ first_grouping_decides = options[:first_grouping_decides]
135
+ must_match_grouping = options[:must_match_grouping]
129
136
  must_match_at_least_one_word = options[:must_match_at_least_one_word]
130
137
 
131
138
  if gather_last_result
@@ -142,7 +149,7 @@ EOS
142
149
  if gather_last_result
143
150
  last_result.normalizers = normalizers
144
151
  last_result.identities = identities
145
- last_result.blockings = blockings
152
+ last_result.groupings = groupings
146
153
  last_result.stop_words = stop_words
147
154
  end
148
155
 
@@ -156,11 +163,11 @@ The needle's #{needle.variants.length} variants were enumerated.
156
163
  EOS
157
164
  end
158
165
 
159
- if must_match_blocking and blockings.any? and blockings.none? { |blocking| blocking.match? needle }
166
+ if must_match_grouping and groupings.any? and groupings.none? { |grouping| grouping.match? needle }
160
167
  if gather_last_result
161
168
  last_result.timeline << <<-EOS
162
- The needle didn't match any of the #{blockings.length} blocking, which was a requirement.
163
- \tBlockings (first 3): #{blockings[0,3].map(&:inspect).join(', ')}
169
+ The needle didn't match any of the #{groupings.length} grouping, which was a requirement.
170
+ \tGroupings (first 3): #{groupings[0,3].map(&:inspect).join(', ')}
164
171
  EOS
165
172
  end
166
173
 
@@ -187,18 +194,18 @@ EOS
187
194
  passed_word_requirement = haystack
188
195
  end
189
196
 
190
- if blockings.any?
197
+ if groupings.any?
191
198
  joint = passed_word_requirement.select do |straw|
192
- if first_blocking_decides
193
- blockings.detect { |blocking| blocking.match? needle }.try :join?, needle, straw
199
+ if first_grouping_decides
200
+ groupings.detect { |grouping| grouping.match? needle }.try :join?, needle, straw
194
201
  else
195
- blockings.any? { |blocking| blocking.join? needle, straw }
202
+ groupings.any? { |grouping| grouping.join? needle, straw }
196
203
  end
197
204
  end
198
205
  if gather_last_result
199
206
  last_result.timeline << <<-EOS
200
- Since there were blockings, the competition was reduced to records in the same block as the needle.
201
- \tBlockings (first 3): #{blockings[0,3].map(&:inspect).join(', ')}
207
+ Since there were groupings, the competition was reduced to records in the same group as the needle.
208
+ \tGroupings (first 3): #{groupings[0,3].map(&:inspect).join(', ')}
202
209
  \tPassed (first 3): #{joint[0,3].map(&:render).map(&:inspect).join(', ')}
203
210
  \tFailed (first 3): #{(passed_word_requirement-joint)[0,3].map(&:render).map(&:inspect).join(', ')}
204
211
  EOS
@@ -208,10 +215,10 @@ EOS
208
215
  end
209
216
 
210
217
  if joint.none?
211
- if must_match_blocking
218
+ if must_match_grouping
212
219
  if gather_last_result
213
220
  last_result.timeline << <<-EOS
214
- Since :must_match_at_least_one_word => true and none of the competition was in the same block as the needle, the search stopped.
221
+ Since :must_match_at_least_one_word => true and none of the competition was in the same group as the needle, the search stopped.
215
222
  EOS
216
223
  end
217
224
  if is_find_all
@@ -256,20 +263,21 @@ EOS
256
263
  return similarities.map { |similarity| similarity.wrapper2.record }
257
264
  end
258
265
 
266
+ best_similarity = similarities.first
259
267
  winner = nil
260
268
 
261
- if best_similarity = similarities.first and best_similarity.best_score.dices_coefficient_similar > 0
269
+ if best_similarity and (best_similarity.best_score.dices_coefficient_similar > 0 or (needle.words & best_similarity.wrapper2.words).any?)
262
270
  winner = best_similarity.wrapper2.record
263
271
  if gather_last_result
264
272
  last_result.winner = winner
265
273
  last_result.score = best_similarity.best_score.dices_coefficient_similar
266
274
  last_result.timeline << <<-EOS
267
- A winner was determined because the similarity score #{best_similarity.best_score.dices_coefficient_similar} is greater than zero.
275
+ A winner was determined because the Dice's Coefficient similarity (#{best_similarity.best_score.dices_coefficient_similar}) is greater than zero or because it shared a word with the needle.
268
276
  EOS
269
277
  end
270
278
  elsif gather_last_result
271
279
  last_result.timeline << <<-EOS
272
- No winner assigned because similarity score was zero.
280
+ No winner assigned because the score of the best similarity (#{best_similarity.try(:wrapper2).try(:record).try(:inspect)}) was zero and it didn't match any words with the needle (#{needle.inspect}).
273
281
  EOS
274
282
  end
275
283