loose_tight_dictionary 0.2.3 → 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
data/LICENSE CHANGED
@@ -1,4 +1,4 @@
1
- Copyright (c) 2009 Seamus Abshere
1
+ Copyright 2011 Brighter Planet, Inc.
2
2
 
3
3
  Permission is hereby granted, free of charge, to any person obtaining
4
4
  a copy of this software and associated documentation files (the
data/README.rdoc CHANGED
@@ -4,102 +4,59 @@ Match things based on string similarity (using the Pair Distance algorithm) and
4
4
 
5
5
  == Quickstart
6
6
 
7
- >> d = LooseTightDictionary.new %w{seamus andy ben}
8
- => [...]
9
- >> puts d.find 'Shamus Heaney'
10
- => 'seamus'
11
-
12
- Try running the included example file:
13
-
14
- $ ruby examples/first_name_matching.rb
15
- ######################################################################################################################################################
16
- # Match "Mr. Seamus" => "seamus"
17
- ######################################################################################################################################################
18
-
19
- Needle
20
- (needle_reader proc not defined, so downcasing everything)
21
- ------------------------------------------------------------------------------------------------------------------------------------------------------
22
- "mr. seamus"
23
-
24
- Haystack
25
- (haystack_reader proc not defined, so downcasing everything)
26
- ------------------------------------------------------------------------------------------------------------------------------------------------------
27
- "seamus"
28
- "andy"
29
- "ben"
30
-
31
- Tighteners
32
- ------------------------------------------------------------------------------------------------------------------------------------------------------
33
- (none)
34
-
35
- Comparisons
36
- Score t_haystack [=> tightened/prefixed] t_needle [=> tightened/prefixed]
37
- ------------------------------------------------------------------------------------------------------------------------------------------------------
38
- 0.8333333333333334 "seamus" "mr. seamus"
39
- 0.0 "andy" "mr. seamus"
40
- 0.0 "ben" "mr. seamus"
41
-
42
- Match
43
- ------------------------------------------------------------------------------------------------------------------------------------------------------
44
- "seamus"
45
-
46
- # [... there's more output ...]
47
-
48
- == The Boeing example
49
-
50
- From the tests:
51
-
52
- ######################################################################################################################################################
53
- # Match "BOEING 737100" => "BOEING BOEING 737-100/200"
54
- ######################################################################################################################################################
55
-
56
- Needle
57
- (needle_reader proc not defined, so downcasing everything)
58
- ------------------------------------------------------------------------------------------------------------------------------------------------------
59
- "boeing 737100"
60
-
61
- Haystack
62
- (haystack_reader proc not defined, so downcasing everything)
63
- ------------------------------------------------------------------------------------------------------------------------------------------------------
64
- "boeing boeing 737-100/200"
65
- "boeing boeing 737-900"
66
-
67
- Tighteners
68
- ------------------------------------------------------------------------------------------------------------------------------------------------------
69
- /(7\d)(7|0)-?(\d{1,3})/i
70
-
71
- Comparisons
72
- Score t_haystack [=> tightened/prefixed] t_needle [=> tightened/prefixed]
73
- ------------------------------------------------------------------------------------------------------------------------------------------------------
74
- 1.0 "boeing boeing 737-100/200" => "737100" "boeing 737100" => "737100"
75
- 0.6666666666666666 "boeing boeing 737-100/200" => "737100" "boeing 737100"
76
- 0.6153846153846154 "boeing boeing 737-900" "boeing 737100"
77
- 0.6 "boeing boeing 737-900" => "737900" "boeing 737100" => "737100"
78
- 0.6 "boeing boeing 737-100/200" "boeing 737100"
79
- 0.4 "boeing boeing 737-900" => "737900" "boeing 737100"
80
- 0.32 "boeing boeing 737-100/200" "boeing 737100" => "737100"
81
- 0.2857142857142857 "boeing boeing 737-900" "boeing 737100" => "737100"
82
-
83
- Match
84
- ------------------------------------------------------------------------------------------------------------------------------------------------------
85
- "BOEING BOEING 737-100/200"
86
-
87
- == Improving dictionaries
88
-
89
- Similarity matching will only get you so far.
90
-
91
- TODO: regex usage
92
-
93
- == Note on Patches/Pull Requests
94
-
95
- * Fork the project.
96
- * Make your feature addition or bug fix.
97
- * Add tests for it. This is important so I don't break it in a
98
- future version unintentionally.
99
- * Commit, do not mess with rakefile, version, or history.
100
- (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
101
- * Send me a pull request. Bonus points for topic branches.
7
+ >> require 'loose_tight_dictionary'
8
+ => true
9
+ >> LooseTightDictionary.new(%w{seamus andy ben}).find('Shamus')
10
+ => "seamus"
11
+
12
+ == String similarity matching
13
+
14
+ Exclusively uses {Dice's Coefficient}[http://en.wikipedia.org/wiki/Dice's_coefficient] algorithm (aka Pair Distance).
15
+
16
+ == Production use
17
+
18
+ Over 2 years in {Brighter Planet's environmental impact API}[http://impact.brighterplanet.com] and {reference data service}[http://data.brighterplanet.com].
19
+
20
+ == Speed
21
+
22
+ If you add the amatch[http://flori.github.com/amatch/] gem to your Gemfile, it will use that, which is much faster (but {segfaults have been seen in the wild}[https://github.com/flori/amatch/issues/3]).
23
+
24
+ Otherwise, a {pure ruby version}[http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings] is used.
25
+
26
+ == Regular expressions
27
+
28
+ You can improve the default matchings with regular expressions.
29
+
30
+ * Emphasize important words using <b>blockings</b> and <b>tighteners</b>
31
+ * Filter out stop words with <b>tighteners</b>
32
+ * Prevent impossible matches with <b>blockings</b> and <b>identities</b>
33
+
34
+ === Blockings
35
+
36
+ Setting a blocking of <tt>/Airbus/</tt> ensures that strings containing "Airbus" will only be scored against to other strings containing "Airbus". A better blocking in this case would probably be <tt>/airbus/i</tt>.
37
+
38
+ === Tighteners
39
+
40
+ Adding a tightener like <tt>/(boeing).*(7\d\d)/i</tt> will cause "BOEING COMPANY 747" and "boeing747" to be scored as if they were "BOEING 747" and "boeing 747", respectively. See also "Case sensitivity" below.
41
+
42
+ === Identities
43
+
44
+ Adding an identity like <tt>/(F)\-?(\d50)/</tt> ensures that "Ford F-150" and "Ford F-250" never match.
45
+
46
+ == Case sensitivity
47
+
48
+ Scoring is case-insensitive. Everything is downcased before scoring. This is a change from previous versions.
49
+
50
+ == Examples
51
+
52
+ Try running the included example files (<tt>examples/first_name_matching.rb</tt>) and check out the tests.
53
+
54
+ == Authors
55
+
56
+ * Seamus Abshere <seamus@abshere.net>
57
+ * Ian Hough <ijhough@gmail.com>
58
+ * Andy Rossmeissl <andy@rossmeissl.net>
102
59
 
103
60
  == Copyright
104
61
 
105
- Copyright (c) 2011 Seamus Abshere. See LICENSE for details.
62
+ Copyright 2011 Brighter Planet, Inc.
@@ -0,0 +1,37 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ # Thanks William James!
4
+ # http://www.ruby-forum.com/topic/95519#200484
5
+ def cart_prod(*args)
6
+ args.inject([[]]){|old,lst|
7
+ new = []
8
+ lst.each{|e| new += old.map{|c| c.dup << e }}
9
+ new
10
+ }
11
+ end
12
+
13
+ require 'benchmark'
14
+
15
+ a = [1,2,3]
16
+ b = [4,5]
17
+ Benchmark.bmbm do |x|
18
+ x.report("native") do
19
+ 500_000.times { a.product(b) }
20
+ end
21
+ x.report("william-james") do |x|
22
+ 500_000.times { cart_prod(a, b) }
23
+ end
24
+ end
25
+
26
+ # results:
27
+ # $ ruby foo.rb
28
+ # Rehearsal -------------------------------------------------
29
+ # native 0.720000 0.000000 0.720000 ( 0.729319)
30
+ # william-james 3.620000 0.010000 3.630000 ( 3.629198)
31
+ # ---------------------------------------- total: 4.350000sec
32
+ #
33
+ # user system total real
34
+ # native 0.710000 0.000000 0.710000 ( 0.708620)
35
+ # william-james 3.800000 0.000000 3.800000 ( 3.792538)
36
+
37
+ # thanks for all the fish!
@@ -1,4 +1,8 @@
1
- require 'amatch'
1
+ begin
2
+ require 'amatch'
3
+ rescue ::LoadError
4
+ # using native ruby similarity scoring
5
+ end
2
6
 
3
7
  class LooseTightDictionary
4
8
  class Score
@@ -10,7 +14,7 @@ class LooseTightDictionary
10
14
  end
11
15
 
12
16
  def to_f
13
- @to_f ||= str1.pair_distance_similar str2
17
+ @to_f ||= dices_coefficient(str1, str2)
14
18
  end
15
19
 
16
20
  def inspect
@@ -24,5 +28,44 @@ class LooseTightDictionary
24
28
  def ==(other)
25
29
  to_f == other.to_f
26
30
  end
31
+
32
+ private
33
+
34
+ # http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings
35
+ if defined?(::Amatch)
36
+ def dices_coefficient(str1, str2)
37
+ str1 = str1.downcase
38
+ str2 = str2.downcase
39
+ str1.pair_distance_similar str2
40
+ end
41
+ else
42
+ SPACE = ' '
43
+ def dices_coefficient(str1, str2)
44
+ str1 = str1.downcase
45
+ str2 = str2.downcase
46
+ pairs1 = (0..str1.length-2).map do |i|
47
+ str1[i,2]
48
+ end.reject do |pair|
49
+ pair.include? SPACE
50
+ end
51
+ pairs2 = (0..str2.length-2).map do |i|
52
+ str2[i,2]
53
+ end.reject do |pair|
54
+ pair.include? SPACE
55
+ end
56
+ union = pairs1.size + pairs2.size
57
+ intersection = 0
58
+ pairs1.each do |p1|
59
+ 0.upto(pairs2.size-1) do |i|
60
+ if p1 == pairs2[i]
61
+ intersection += 1
62
+ pairs2.slice!(i)
63
+ break
64
+ end
65
+ end
66
+ end
67
+ (2.0 * intersection) / union
68
+ end
69
+ end
27
70
  end
28
71
  end
@@ -34,7 +34,7 @@ class LooseTightDictionary
34
34
  end
35
35
 
36
36
  def best_variants
37
- @best_variants ||= cart_prod(wrapper1.variants, wrapper2.variants).sort do |tuple1, tuple2|
37
+ @best_variants ||= wrapper1.variants.product(wrapper2.variants).sort do |tuple1, tuple2|
38
38
  wrapper1_variant1, wrapper2_variant1 = tuple1
39
39
  wrapper1_variant2, wrapper2_variant2 = tuple2
40
40
 
@@ -48,15 +48,5 @@ class LooseTightDictionary
48
48
  def inspect
49
49
  %{#<Similarity "#{wrapper2.to_str}"=>"#{best_wrapper2_variant}" versus "#{wrapper1.to_str}"=>"#{best_wrapper1_variant}" weight=#{"%0.5f" % weight} best_score=#{"%0.5f" % best_score.to_f}>}
50
50
  end
51
-
52
- # Thanks William James!
53
- # http://www.ruby-forum.com/topic/95519#200484
54
- def cart_prod(*args)
55
- args.inject([[]]){|old,lst|
56
- new = []
57
- lst.each{|e| new += old.map{|c| c.dup << e }}
58
- new
59
- }
60
- end
61
51
  end
62
52
  end
@@ -1,3 +1,3 @@
1
1
  class LooseTightDictionary
2
- VERSION = '0.2.3'
2
+ VERSION = '1.0.0'
3
3
  end
@@ -25,7 +25,8 @@ Gem::Specification.new do |s|
25
25
  s.add_development_dependency 'mysql'
26
26
  s.add_development_dependency 'cohort_scope'
27
27
  s.add_development_dependency 'weighted_average'
28
- s.add_dependency 'activesupport', '>=3'
29
- s.add_dependency 'amatch'
30
- s.add_dependency 'to_regexp', '>=0.0.3'
28
+ s.add_development_dependency 'rake'
29
+ # s.add_development_dependency 'amatch'
30
+ s.add_runtime_dependency 'activesupport', '>=3'
31
+ s.add_runtime_dependency 'to_regexp', '>=0.0.3'
31
32
  end
data/test/test_cache.rb CHANGED
@@ -112,7 +112,7 @@ class TestCache < Test::Unit::TestCase
112
112
 
113
113
  def test_004_weighted_average
114
114
  aircraft = Aircraft.find('B742')
115
- assert_equal 5.4545, aircraft.flight_segments.weighted_average(:seats, :weighted_by => :passengers)
115
+ assert_equal 5.45455, aircraft.flight_segments.weighted_average(:seats, :weighted_by => :passengers)
116
116
  end
117
117
 
118
118
  def test_005_right_way_to_do_cohorts
metadata CHANGED
@@ -1,133 +1,130 @@
1
- --- !ruby/object:Gem::Specification
1
+ --- !ruby/object:Gem::Specification
2
2
  name: loose_tight_dictionary
3
- version: !ruby/object:Gem::Version
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0
4
5
  prerelease:
5
- version: 0.2.3
6
6
  platform: ruby
7
- authors:
7
+ authors:
8
8
  - Seamus Abshere
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
-
13
- date: 2011-05-17 00:00:00 -05:00
14
- default_executable:
15
- dependencies:
16
- - !ruby/object:Gem::Dependency
12
+ date: 2011-12-03 00:00:00.000000000Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
17
15
  name: shoulda
18
- prerelease: false
19
- requirement: &id001 !ruby/object:Gem::Requirement
16
+ requirement: &2160898300 !ruby/object:Gem::Requirement
20
17
  none: false
21
- requirements:
22
- - - ">="
23
- - !ruby/object:Gem::Version
24
- version: "0"
18
+ requirements:
19
+ - - ! '>='
20
+ - !ruby/object:Gem::Version
21
+ version: '0'
25
22
  type: :development
26
- version_requirements: *id001
27
- - !ruby/object:Gem::Dependency
28
- name: remote_table
29
23
  prerelease: false
30
- requirement: &id002 !ruby/object:Gem::Requirement
24
+ version_requirements: *2160898300
25
+ - !ruby/object:Gem::Dependency
26
+ name: remote_table
27
+ requirement: &2160965820 !ruby/object:Gem::Requirement
31
28
  none: false
32
- requirements:
33
- - - ">="
34
- - !ruby/object:Gem::Version
35
- version: "0"
29
+ requirements:
30
+ - - ! '>='
31
+ - !ruby/object:Gem::Version
32
+ version: '0'
36
33
  type: :development
37
- version_requirements: *id002
38
- - !ruby/object:Gem::Dependency
39
- name: activerecord
40
34
  prerelease: false
41
- requirement: &id003 !ruby/object:Gem::Requirement
35
+ version_requirements: *2160965820
36
+ - !ruby/object:Gem::Dependency
37
+ name: activerecord
38
+ requirement: &2161059600 !ruby/object:Gem::Requirement
42
39
  none: false
43
- requirements:
44
- - - ">="
45
- - !ruby/object:Gem::Version
46
- version: "3"
40
+ requirements:
41
+ - - ! '>='
42
+ - !ruby/object:Gem::Version
43
+ version: '3'
47
44
  type: :development
48
- version_requirements: *id003
49
- - !ruby/object:Gem::Dependency
50
- name: mysql
51
45
  prerelease: false
52
- requirement: &id004 !ruby/object:Gem::Requirement
46
+ version_requirements: *2161059600
47
+ - !ruby/object:Gem::Dependency
48
+ name: mysql
49
+ requirement: &2161138940 !ruby/object:Gem::Requirement
53
50
  none: false
54
- requirements:
55
- - - ">="
56
- - !ruby/object:Gem::Version
57
- version: "0"
51
+ requirements:
52
+ - - ! '>='
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
58
55
  type: :development
59
- version_requirements: *id004
60
- - !ruby/object:Gem::Dependency
61
- name: cohort_scope
62
56
  prerelease: false
63
- requirement: &id005 !ruby/object:Gem::Requirement
57
+ version_requirements: *2161138940
58
+ - !ruby/object:Gem::Dependency
59
+ name: cohort_scope
60
+ requirement: &2161208000 !ruby/object:Gem::Requirement
64
61
  none: false
65
- requirements:
66
- - - ">="
67
- - !ruby/object:Gem::Version
68
- version: "0"
62
+ requirements:
63
+ - - ! '>='
64
+ - !ruby/object:Gem::Version
65
+ version: '0'
69
66
  type: :development
70
- version_requirements: *id005
71
- - !ruby/object:Gem::Dependency
72
- name: weighted_average
73
67
  prerelease: false
74
- requirement: &id006 !ruby/object:Gem::Requirement
68
+ version_requirements: *2161208000
69
+ - !ruby/object:Gem::Dependency
70
+ name: weighted_average
71
+ requirement: &2161382600 !ruby/object:Gem::Requirement
75
72
  none: false
76
- requirements:
77
- - - ">="
78
- - !ruby/object:Gem::Version
79
- version: "0"
73
+ requirements:
74
+ - - ! '>='
75
+ - !ruby/object:Gem::Version
76
+ version: '0'
80
77
  type: :development
81
- version_requirements: *id006
82
- - !ruby/object:Gem::Dependency
83
- name: activesupport
84
78
  prerelease: false
85
- requirement: &id007 !ruby/object:Gem::Requirement
79
+ version_requirements: *2161382600
80
+ - !ruby/object:Gem::Dependency
81
+ name: rake
82
+ requirement: &2161619140 !ruby/object:Gem::Requirement
86
83
  none: false
87
- requirements:
88
- - - ">="
89
- - !ruby/object:Gem::Version
90
- version: "3"
91
- type: :runtime
92
- version_requirements: *id007
93
- - !ruby/object:Gem::Dependency
94
- name: amatch
84
+ requirements:
85
+ - - ! '>='
86
+ - !ruby/object:Gem::Version
87
+ version: '0'
88
+ type: :development
95
89
  prerelease: false
96
- requirement: &id008 !ruby/object:Gem::Requirement
90
+ version_requirements: *2161619140
91
+ - !ruby/object:Gem::Dependency
92
+ name: activesupport
93
+ requirement: &2161769980 !ruby/object:Gem::Requirement
97
94
  none: false
98
- requirements:
99
- - - ">="
100
- - !ruby/object:Gem::Version
101
- version: "0"
95
+ requirements:
96
+ - - ! '>='
97
+ - !ruby/object:Gem::Version
98
+ version: '3'
102
99
  type: :runtime
103
- version_requirements: *id008
104
- - !ruby/object:Gem::Dependency
105
- name: to_regexp
106
100
  prerelease: false
107
- requirement: &id009 !ruby/object:Gem::Requirement
101
+ version_requirements: *2161769980
102
+ - !ruby/object:Gem::Dependency
103
+ name: to_regexp
104
+ requirement: &2161790420 !ruby/object:Gem::Requirement
108
105
  none: false
109
- requirements:
110
- - - ">="
111
- - !ruby/object:Gem::Version
106
+ requirements:
107
+ - - ! '>='
108
+ - !ruby/object:Gem::Version
112
109
  version: 0.0.3
113
110
  type: :runtime
114
- version_requirements: *id009
115
- description: Create dictionaries that link rows between two tables using loose matching (string similarity) by default and tight matching (regexp) by request.
116
- email:
111
+ prerelease: false
112
+ version_requirements: *2161790420
113
+ description: Create dictionaries that link rows between two tables using loose matching
114
+ (string similarity) by default and tight matching (regexp) by request.
115
+ email:
117
116
  - seamus@abshere.net
118
117
  executables: []
119
-
120
118
  extensions: []
121
-
122
119
  extra_rdoc_files: []
123
-
124
- files:
120
+ files:
125
121
  - .document
126
122
  - .gitignore
127
123
  - Gemfile
128
124
  - LICENSE
129
125
  - README.rdoc
130
126
  - Rakefile
127
+ - THANKS-WILLIAM-JAMES.rb
131
128
  - benchmark/before-with-free.txt
132
129
  - benchmark/before-without-last-result.txt
133
130
  - benchmark/before.txt
@@ -164,35 +161,31 @@ files:
164
161
  - test/test_loose_tight_dictionary.rb
165
162
  - test/test_loose_tight_dictionary_convoluted.rb.disabled
166
163
  - test/test_tightening.rb
167
- has_rdoc: true
168
164
  homepage: https://github.com/seamusabshere/loose_tight_dictionary
169
165
  licenses: []
170
-
171
166
  post_install_message:
172
167
  rdoc_options: []
173
-
174
- require_paths:
168
+ require_paths:
175
169
  - lib
176
- required_ruby_version: !ruby/object:Gem::Requirement
170
+ required_ruby_version: !ruby/object:Gem::Requirement
177
171
  none: false
178
- requirements:
179
- - - ">="
180
- - !ruby/object:Gem::Version
181
- version: "0"
182
- required_rubygems_version: !ruby/object:Gem::Requirement
172
+ requirements:
173
+ - - ! '>='
174
+ - !ruby/object:Gem::Version
175
+ version: '0'
176
+ required_rubygems_version: !ruby/object:Gem::Requirement
183
177
  none: false
184
- requirements:
185
- - - ">="
186
- - !ruby/object:Gem::Version
187
- version: "0"
178
+ requirements:
179
+ - - ! '>='
180
+ - !ruby/object:Gem::Version
181
+ version: '0'
188
182
  requirements: []
189
-
190
183
  rubyforge_project: loose_tight_dictionary
191
- rubygems_version: 1.6.2
184
+ rubygems_version: 1.8.10
192
185
  signing_key:
193
186
  specification_version: 3
194
187
  summary: Allows iterative development of dictionaries for big data sets.
195
- test_files:
188
+ test_files:
196
189
  - test/helper.rb
197
190
  - test/test_blocking.rb
198
191
  - test/test_cache.rb