loose_tight_dictionary 1.0.0 → 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.rdoc CHANGED
@@ -17,11 +17,17 @@ Exclusively uses {Dice's Coefficient}[http://en.wikipedia.org/wiki/Dice's_coeffi
17
17
 
18
18
  Over 2 years in {Brighter Planet's environmental impact API}[http://impact.brighterplanet.com] and {reference data service}[http://data.brighterplanet.com].
19
19
 
20
- == Speed
20
+ == Haystacks and how to read them
21
+
22
+ The (admittedly imperfect) metaphor is "look for a needle in a haystack"
21
23
 
22
- If you add the amatch[http://flori.github.com/amatch/] gem to your Gemfile, it will use that, which is much faster (but {segfaults have been seen in the wild}[https://github.com/flori/amatch/issues/3]).
24
+ * needle - the search term
25
+ * haystack - the records you are searching (<b>your result will be an object from here</b>)
23
26
 
24
- Otherwise, a {pure ruby version}[http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings] is used.
27
+ So, what if your needle is a string like <tt>youruguay</tt> and your haystack is full of <tt>Country</tt> objects like <tt><Country name:"Uruguay"></tt>?
28
+
29
+ >> LooseTightDictionary.new(countries, :read => :name).find('youruguay')
30
+ => <Country name:"Uruguay">
25
31
 
26
32
  == Regular expressions
27
33
 
@@ -49,7 +55,13 @@ Scoring is case-insensitive. Everything is downcased before scoring. This is a c
49
55
 
50
56
  == Examples
51
57
 
52
- Try running the included example files (<tt>examples/first_name_matching.rb</tt>) and check out the tests.
58
+ Check out the tests.
59
+
60
+ == Speed
61
+
62
+ If you add the amatch[http://flori.github.com/amatch/] gem to your Gemfile, it will use that, which is much faster (but {segfaults have been seen in the wild}[https://github.com/flori/amatch/issues/3]). Thanks Flori!
63
+
64
+ Otherwise, a pure ruby version derived from the {answer to a StackOverflow question}[http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings] is used. Thanks {marzagao}[http://stackoverflow.com/users/10997/marzagao]!
53
65
 
54
66
  == Authors
55
67
 
data/benchmark/memory.rb CHANGED
@@ -37,7 +37,7 @@ TIGHTENERS = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/
37
37
  IDENTITIES = RemoteTable.new(:url => "file://#{File.expand_path("../../examples/bts_aircraft/identities.csv", __FILE__)}", :headers => :first_row).map { |row| row['regexp'] }
38
38
 
39
39
  FINAL_OPTIONS = {
40
- :haystack_reader => HAYSTACK_READER,
40
+ :read => HAYSTACK_READER,
41
41
  :must_match_blocking => MUST_MATCH_BLOCKING,
42
42
  :tighteners => TIGHTENERS,
43
43
  :identities => IDENTITIES,
@@ -63,7 +63,7 @@ NEGATIVES = RemoteTable.new :url => "file://#{File.expand_path("../negatives.csv
63
63
  # Section 3
64
64
 
65
65
  FINAL_OPTIONS = {
66
- :haystack_reader => HAYSTACK_READER,
66
+ :read => HAYSTACK_READER,
67
67
  :must_match_blocking => MUST_MATCH_BLOCKING,
68
68
  :tighteners => TIGHTENERS,
69
69
  :identities => IDENTITIES,
@@ -6,7 +6,7 @@ require 'active_support/version'
6
6
  active_support/core_ext/object
7
7
  }.each do |active_support_3_requirement|
8
8
  require active_support_3_requirement
9
- end if ::ActiveSupport::VERSION::MAJOR == 3
9
+ end if ::ActiveSupport::VERSION::MAJOR >= 3
10
10
  require 'to_regexp'
11
11
 
12
12
  # See the README for more information.
@@ -20,24 +20,25 @@ class LooseTightDictionary
20
20
  autoload :Score, 'loose_tight_dictionary/score'
21
21
  autoload :CachedResult, 'loose_tight_dictionary/cached_result'
22
22
 
23
- class Freed < RuntimeError; end
24
-
25
23
  attr_reader :options
26
24
  attr_reader :haystack
27
25
  attr_reader :records
28
26
 
29
27
  # haystack - a bunch of records
30
28
  # options
31
- # * tighteners: regexps that essentialize strings down
32
- # * identities: regexps that rule out similarities, for example a 737 cannot be identical to a 747
29
+ # * tighteners: regexps (see readme)
30
+ # * identities: regexps
31
+ # * blockings: regexps
32
+ # * read: how to interpret each entry in the 'haystack', either a Proc or a symbol
33
33
  def initialize(records, options = {})
34
34
  @options = options.symbolize_keys
35
35
  @records = records
36
- @haystack = records.map { |record| Wrapper.new :parent => self, :record => record, :reader => haystack_reader }
36
+ read = options[:read] || options[:haystack_reader]
37
+ @haystack = records.map { |record| Wrapper.new self, record, read }
37
38
  end
38
39
 
39
40
  def last_result
40
- @last_result ||= Result.new
41
+ @last_result || raise(::RuntimeError, "[loose_tight_dictionary] You can't access the last result until you've run a find with :gather_last_result => true")
41
42
  end
42
43
 
43
44
  def log(str = '') #:nodoc:
@@ -50,11 +51,13 @@ class LooseTightDictionary
50
51
  end
51
52
 
52
53
  def find(needle, options = {})
53
- raise Freed if freed?
54
- free_last_result
54
+ raise ::RuntimeError, "[loose_tight_dictionary] Dictionary has already been freed, can't perform more finds" if freed?
55
55
 
56
56
  options = options.symbolize_keys
57
- gather_last_result = options.fetch(:gather_last_result, true)
57
+ if gather_last_result = options.fetch(:gather_last_result, false)
58
+ free_last_result
59
+ @last_result = Result.new
60
+ end
58
61
  find_all = options.fetch(:find_all, false)
59
62
 
60
63
  if gather_last_result
@@ -63,7 +66,7 @@ class LooseTightDictionary
63
66
  last_result.blockings = blockings
64
67
  end
65
68
 
66
- needle = Wrapper.new :parent => self, :record => needle
69
+ needle = Wrapper.new self, needle
67
70
 
68
71
  if gather_last_result
69
72
  last_result.needle = needle
@@ -126,7 +129,9 @@ class LooseTightDictionary
126
129
  last_result.similarities = similarities
127
130
  end
128
131
 
129
- if best_similarity = similarities[-1] and straw = best_similarity.wrapper2 and record = straw.record
132
+ best_similarity = similarities[-1]
133
+ if best_similarity.best_score.to_f > 0
134
+ record = best_similarity.wrapper2.record
130
135
  if gather_last_result
131
136
  last_result.record = record
132
137
  last_result.score = best_similarity.best_score.to_f
@@ -140,7 +145,7 @@ class LooseTightDictionary
140
145
  # d = LooseTightDictionary.new ['737', '747', '757' ]
141
146
  # d.explain 'boeing 737-100'
142
147
  def explain(needle)
143
- record = find needle
148
+ record = find needle, :gather_last_result => true
144
149
  log "#" * 150
145
150
  log "# Match #{needle.inspect} => #{record.inspect}"
146
151
  log "#" * 150
@@ -190,16 +195,12 @@ class LooseTightDictionary
190
195
  log record.inspect
191
196
  end
192
197
 
193
- def haystack_reader
194
- options[:haystack_reader]
195
- end
196
-
197
198
  def must_match_blocking
198
- options[:must_match_blocking] || false
199
+ options.fetch :must_match_blocking, false
199
200
  end
200
201
 
201
202
  def first_blocking_decides
202
- options[:first_blocking_decides] || false
203
+ options.fetch :first_blocking_decides, false
203
204
  end
204
205
 
205
206
  def tighteners
@@ -43,6 +43,11 @@ class LooseTightDictionary
43
43
  def dices_coefficient(str1, str2)
44
44
  str1 = str1.downcase
45
45
  str2 = str2.downcase
46
+ if str1 == str2
47
+ return 1.0
48
+ elsif str1.length == 1 and str2.length == 1
49
+ return 0.0
50
+ end
46
51
  pairs1 = (0..str1.length-2).map do |i|
47
52
  str1[i,2]
48
53
  end.reject do |pair|
@@ -1,3 +1,3 @@
1
1
  class LooseTightDictionary
2
- VERSION = '1.0.0'
2
+ VERSION = '1.0.1'
3
3
  end
@@ -3,12 +3,12 @@ class LooseTightDictionary
3
3
  class Wrapper #:nodoc: all
4
4
  attr_reader :parent
5
5
  attr_reader :record
6
- attr_reader :reader
6
+ attr_reader :read
7
7
 
8
- def initialize(attrs = {})
9
- attrs.each do |k, v|
10
- instance_variable_set "@#{k}", v
11
- end
8
+ def initialize(parent, record, read = nil)
9
+ @parent = parent
10
+ @record = record
11
+ @read = read
12
12
  end
13
13
 
14
14
  def inspect
@@ -16,7 +16,20 @@ class LooseTightDictionary
16
16
  end
17
17
 
18
18
  def to_str
19
- @to_str ||= reader ? reader.call(record) : record.to_s
19
+ @to_str ||= case read
20
+ when ::Proc
21
+ read.call record
22
+ when ::Symbol
23
+ if record.respond_to?(read)
24
+ record.send read
25
+ else
26
+ record[read]
27
+ end
28
+ when ::NilClass
29
+ record
30
+ else
31
+ record[read]
32
+ end.to_s
20
33
  end
21
34
 
22
35
  alias :to_s :to_str
data/test/test_cache.rb CHANGED
@@ -33,7 +33,7 @@ class Aircraft < ActiveRecord::Base
33
33
  end
34
34
 
35
35
  def self.loose_tight_dictionary
36
- @loose_tight_dictionary ||= LooseTightDictionary.new all, :haystack_reader => lambda { |straw| straw.aircraft_description }
36
+ @loose_tight_dictionary ||= LooseTightDictionary.new all, :read => ::Proc.new { |straw| straw.aircraft_description }
37
37
  end
38
38
 
39
39
  def self.create_table
@@ -13,17 +13,23 @@ class TestLooseTightDictionary < Test::Unit::TestCase
13
13
  def test_001_find
14
14
  d = LooseTightDictionary.new %w{ NISSAN HONDA }
15
15
  assert_equal 'NISSAN', d.find('MISSAM')
16
+
17
+ d = LooseTightDictionary.new [ 'X' ]
18
+ assert_equal 'X', d.find('X')
19
+ assert_equal nil, d.find('A')
16
20
  end
17
21
 
18
- def test_002_find_with_score
22
+ def test_002_dont_gather_last_result_by_default
19
23
  d = LooseTightDictionary.new %w{ NISSAN HONDA }
20
- assert_equal 'NISSAN', d.find('MISSAM')
21
- assert_equal 0.6, d.last_result.score
24
+ d.find('MISSAM')
25
+ assert_raises(::RuntimeError, /gather_last_result/) do
26
+ d.last_result
27
+ end
22
28
  end
23
29
 
24
30
  def test_003_last_result
25
31
  d = LooseTightDictionary.new %w{ NISSAN HONDA }
26
- d.find 'MISSAM'
32
+ d.find 'MISSAM', :gather_last_result => true
27
33
  assert_equal 0.6, d.last_result.score
28
34
  assert_equal 'NISSAN', d.last_result.record
29
35
  end
@@ -48,18 +54,18 @@ class TestLooseTightDictionary < Test::Unit::TestCase
48
54
 
49
55
  def test_008_identify_false_positive
50
56
  d = LooseTightDictionary.new %w{ foo bar }, :identities => [ /ba(.)/ ]
51
- assert_equal 'foo', d.find('baz')
57
+ assert_equal nil, d.find('baz')
52
58
  end
53
59
 
54
- def test_009_must_match_blocking
55
- d = LooseTightDictionary.new [ 'X' ]
56
- assert_equal 'X', d.find('X')
57
- assert_equal 'X', d.find('A')
58
-
60
+ # TODO this is not very helpful
61
+ def test_009_blocking
59
62
  d = LooseTightDictionary.new [ 'X' ], :blockings => [ /X/, /Y/ ]
60
63
  assert_equal 'X', d.find('X')
61
- assert_equal 'X', d.find('A')
62
-
64
+ assert_equal nil, d.find('A')
65
+ end
66
+
67
+ # TODO this is not very helpful
68
+ def test_0095_must_match_blocking
63
69
  d = LooseTightDictionary.new [ 'X' ], :blockings => [ /X/, /Y/ ], :must_match_blocking => true
64
70
  assert_equal 'X', d.find('X')
65
71
  assert_equal nil, d.find('A')
@@ -68,7 +74,7 @@ class TestLooseTightDictionary < Test::Unit::TestCase
68
74
  def test_011_free
69
75
  d = LooseTightDictionary.new %w{ NISSAN HONDA }
70
76
  d.free
71
- assert_raises(LooseTightDictionary::Freed) do
77
+ assert_raises(::RuntimeError, /free/) do
72
78
  d.find('foobar')
73
79
  end
74
80
  end
@@ -97,4 +103,54 @@ class TestLooseTightDictionary < Test::Unit::TestCase
97
103
  d = LooseTightDictionary.new [ 'Boeing 747', 'Boeing 747SR', 'Boeing ER6' ], :blockings => [ /(boeing \d{3})/i, /boeing/i ], :first_blocking_decides => true, :identities => [ /boeing (7|E)/i ]
98
104
  assert_equal [ 'Boeing ER6' ], d.find_all('Boeing ER6')
99
105
  end
106
+
107
+ MyStruct = Struct.new(:one, :two)
108
+ def test_014_symbol_read_sends_method
109
+ ab = MyStruct.new('a', 'b')
110
+ ba = MyStruct.new('b', 'a')
111
+ haystack = [ab, ba]
112
+ by_first = LooseTightDictionary.new haystack, :read => :one
113
+ by_last = LooseTightDictionary.new haystack, :read => :two
114
+ assert_equal ab, by_first.find('a')
115
+ assert_equal ab, by_last.find('b')
116
+ assert_equal ba, by_first.find('b')
117
+ assert_equal ba, by_last.find('a')
118
+ end
119
+
120
+ def test_015_symbol_read_reads_array
121
+ ab = ['a', 'b']
122
+ ba = ['b', 'a']
123
+ haystack = [ab, ba]
124
+ by_first = LooseTightDictionary.new haystack, :read => 0
125
+ by_last = LooseTightDictionary.new haystack, :read => 1
126
+ assert_equal ab, by_first.find('a')
127
+ assert_equal ab, by_last.find('b')
128
+ assert_equal ba, by_first.find('b')
129
+ assert_equal ba, by_last.find('a')
130
+ end
131
+
132
+ def test_016_symbol_read_reads_hash
133
+ ab = { :one => 'a', :two => 'b' }
134
+ ba = { :one => 'b', :two => 'a' }
135
+ haystack = [ab, ba]
136
+ by_first = LooseTightDictionary.new haystack, :read => :one
137
+ by_last = LooseTightDictionary.new haystack, :read => :two
138
+ assert_equal ab, by_first.find('a')
139
+ assert_equal ab, by_last.find('b')
140
+ assert_equal ba, by_first.find('b')
141
+ assert_equal ba, by_last.find('a')
142
+ end
143
+
144
+ def test_017_understands_haystack_reader_option
145
+ ab = ['a', 'b']
146
+ ba = ['b', 'a']
147
+ haystack = [ab, ba]
148
+ by_first = LooseTightDictionary.new haystack, :haystack_reader => 0
149
+ assert_equal ab, by_first.find('a')
150
+ assert_equal ba, by_first.find('b')
151
+ end
152
+
153
+ def test_018_no_result_if_best_score_is_zero
154
+ assert_equal nil, LooseTightDictionary.new(['a']).find('b')
155
+ end
100
156
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: loose_tight_dictionary
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.0.1
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -13,7 +13,7 @@ date: 2011-12-03 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: shoulda
16
- requirement: &2160898300 !ruby/object:Gem::Requirement
16
+ requirement: &2164915940 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: '0'
22
22
  type: :development
23
23
  prerelease: false
24
- version_requirements: *2160898300
24
+ version_requirements: *2164915940
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: remote_table
27
- requirement: &2160965820 !ruby/object:Gem::Requirement
27
+ requirement: &2164983440 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ! '>='
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: '0'
33
33
  type: :development
34
34
  prerelease: false
35
- version_requirements: *2160965820
35
+ version_requirements: *2164983440
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: activerecord
38
- requirement: &2161059600 !ruby/object:Gem::Requirement
38
+ requirement: &2165077220 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ! '>='
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: '3'
44
44
  type: :development
45
45
  prerelease: false
46
- version_requirements: *2161059600
46
+ version_requirements: *2165077220
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: mysql
49
- requirement: &2161138940 !ruby/object:Gem::Requirement
49
+ requirement: &2165156580 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - ! '>='
@@ -54,10 +54,10 @@ dependencies:
54
54
  version: '0'
55
55
  type: :development
56
56
  prerelease: false
57
- version_requirements: *2161138940
57
+ version_requirements: *2165156580
58
58
  - !ruby/object:Gem::Dependency
59
59
  name: cohort_scope
60
- requirement: &2161208000 !ruby/object:Gem::Requirement
60
+ requirement: &2165225600 !ruby/object:Gem::Requirement
61
61
  none: false
62
62
  requirements:
63
63
  - - ! '>='
@@ -65,10 +65,10 @@ dependencies:
65
65
  version: '0'
66
66
  type: :development
67
67
  prerelease: false
68
- version_requirements: *2161208000
68
+ version_requirements: *2165225600
69
69
  - !ruby/object:Gem::Dependency
70
70
  name: weighted_average
71
- requirement: &2161382600 !ruby/object:Gem::Requirement
71
+ requirement: &2165400220 !ruby/object:Gem::Requirement
72
72
  none: false
73
73
  requirements:
74
74
  - - ! '>='
@@ -76,10 +76,10 @@ dependencies:
76
76
  version: '0'
77
77
  type: :development
78
78
  prerelease: false
79
- version_requirements: *2161382600
79
+ version_requirements: *2165400220
80
80
  - !ruby/object:Gem::Dependency
81
81
  name: rake
82
- requirement: &2161619140 !ruby/object:Gem::Requirement
82
+ requirement: &2165637020 !ruby/object:Gem::Requirement
83
83
  none: false
84
84
  requirements:
85
85
  - - ! '>='
@@ -87,10 +87,10 @@ dependencies:
87
87
  version: '0'
88
88
  type: :development
89
89
  prerelease: false
90
- version_requirements: *2161619140
90
+ version_requirements: *2165637020
91
91
  - !ruby/object:Gem::Dependency
92
92
  name: activesupport
93
- requirement: &2161769980 !ruby/object:Gem::Requirement
93
+ requirement: &2165774580 !ruby/object:Gem::Requirement
94
94
  none: false
95
95
  requirements:
96
96
  - - ! '>='
@@ -98,10 +98,10 @@ dependencies:
98
98
  version: '3'
99
99
  type: :runtime
100
100
  prerelease: false
101
- version_requirements: *2161769980
101
+ version_requirements: *2165774580
102
102
  - !ruby/object:Gem::Dependency
103
103
  name: to_regexp
104
- requirement: &2161790420 !ruby/object:Gem::Requirement
104
+ requirement: &2165808840 !ruby/object:Gem::Requirement
105
105
  none: false
106
106
  requirements:
107
107
  - - ! '>='
@@ -109,7 +109,7 @@ dependencies:
109
109
  version: 0.0.3
110
110
  type: :runtime
111
111
  prerelease: false
112
- version_requirements: *2161790420
112
+ version_requirements: *2165808840
113
113
  description: Create dictionaries that link rows between two tables using loose matching
114
114
  (string similarity) by default and tight matching (regexp) by request.
115
115
  email: