simhilarity 1.0.2 → 1.0.3

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore CHANGED
@@ -1,5 +1,6 @@
1
1
  *.gem
2
2
  .bundle
3
+ .rake-complete-cache
3
4
  Gemfile.lock
4
5
  pkg/*
5
6
  rdoc
@@ -0,0 +1,6 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.3
4
+ - 2.0.0
5
+ - rbx-19mode
6
+ - jruby-19mode
data/README.md CHANGED
@@ -66,15 +66,9 @@ score,needle,haystack
66
66
 
67
67
  It will print out the best matches between needle and haystack in CSV format. Use `simhilarity --verbose` to look at pretty progress bars while it's running. Use --candidates to customize the candidates selection method, which will dramatically affect performance for large data sets.
68
68
 
69
- ### Simhilarity::Bulk
69
+ ### Simhilarity::Matcher
70
70
 
71
- To use simhilarity from code, create a `Bulk` and call `matches(needles, haystack)`. It'll return an array of tuples, `[needle, haystack, score]`. By default, simhilarity assumes that needles and haystack are arrays of strings. To use something else, set `reader` to a proc that converts your opaque objects into strings. See [options](#options).
72
-
73
- ### Simhilarity::Single
74
-
75
- Sometimes it's useful to just calculate the score between two strings. For example, if you just want a title similarity measurement as part of some larger analysis between two books. Create a `Single` and call `score(a, b)` to measure similarity between those two items. By default, simhilarity assumes that needle and haystack are strings. To use something else, set `reader` to a proc that converts your opaque objects into strings. See [options](#options).
76
-
77
- Important note: For best results with `Single`, set the corpus so that simhilarity can calculate ngram frequencies. This can dramatically improve accuracy. `Bulk` will do this automatically because it has access to the corpus, but `Single` doesn't. Call `corpus=` manually when using `Single`.
71
+ To use simhilarity from code, create a `Matcher` and call `matches(needles, haystack)`. It'll return an array of tuples, `[needle, haystack, score]`. By default, simhilarity assumes that needles and haystack are arrays of strings. To use something else, set `reader` to a proc that converts your opaque objects into strings. See [options](#options).
78
72
 
79
73
  <a name="benchmarks"/>
80
74
 
@@ -135,10 +129,10 @@ There are a few ways to configure simhilarity:
135
129
  Simhash works great, but there's no reason not to use `:ngrams` or even `:all` for small data sets. In fact, that's what simhilarity does by default - if you use a small dataset (needle * haystack < 200,000) it defaults to `:all`, otherwise it uses `:simhash`. Some examples:
136
130
 
137
131
  ```ruby
138
- Simhilarity::Bulk.new # defaults to :all or :simhash based on size<
139
- Simhilarity::Bulk.new(candidates: :simhash)
140
- Simhilarity::Bulk.new(candidates: :simhash, simhash_max_hamming: 8)
141
- Simhilarity::Bulk.new(candidates: :ngrams, ngram_overlaps: 4)
132
+ Simhilarity::Matcher.new # defaults to :all or :simhash based on size
133
+ Simhilarity::Matcher.new(candidates: :simhash)
134
+ Simhilarity::Matcher.new(candidates: :simhash, simhash_max_hamming: 8)
135
+ Simhilarity::Matcher.new(candidates: :ngrams, ngram_overlaps: 4)
142
136
  ```
143
137
 
144
138
  or:
@@ -162,3 +156,10 @@ There are a few ways to configure simhilarity:
162
156
  * **ngrammer** - proc for converting normalized strings into ngrams. The default ngrammer pulls out bigrams and runs of digits, which is perfect for matching names and addresses.
163
157
 
164
158
  * **verbose** - if true, show progress while simhilarity is working. Great for the impatient. Use --verbose from the command line.
159
+
160
+ ## Changelog
161
+
162
+ #### Master (unreleased)
163
+
164
+ * Works with Ruby 2.0 - thanks @abscondment!
165
+ * Travis
@@ -15,8 +15,11 @@ class Main
15
15
 
16
16
  # match
17
17
  tm = Time.now
18
- matcher = Simhilarity::Bulk.new(options)
19
- matches = matcher.matches(needle, haystack)
18
+ matcher = Simhilarity::Matcher.new
19
+ matcher.verbose = options[:verbose]
20
+ matcher.candidates = options[:candidates]
21
+ matcher.haystack = haystack
22
+ matches = matcher.matches(needle)
20
23
 
21
24
  if options[:verbose]
22
25
  tm = Time.now - tm
@@ -1,8 +1,8 @@
1
1
  require "simhilarity/bits"
2
2
  require "simhilarity/candidate"
3
+ require "simhilarity/candidates"
3
4
  require "simhilarity/element"
4
- require "simhilarity/matcher"
5
+ require "simhilarity/score"
5
6
  require "simhilarity/version"
6
7
 
7
- require "simhilarity/bulk"
8
- require "simhilarity/single"
8
+ require "simhilarity/matcher"
@@ -1,44 +1,20 @@
1
1
  module Simhilarity
2
2
  # A potential match between two +Elements+. It can calculate it's own score.
3
3
  class Candidate
4
- # matcher that owns this guy
5
- attr_reader :matcher
6
-
7
4
  # first half of the candidate pair - the needle.
8
5
  attr_reader :a
9
6
 
10
7
  # first half of the candidate pair - the haystack.
11
8
  attr_reader :b
12
9
 
13
- def initialize(matcher, a, b) #:nodoc:
14
- @matcher = matcher
10
+ # the score between these two candidates
11
+ attr_accessor :score
12
+
13
+ def initialize(a, b) #:nodoc:
15
14
  @a = a
16
15
  @b = b
17
16
  end
18
17
 
19
- # Calculate the score for this +Candidate+. The score is the {dice
20
- # coefficient}[http://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient],
21
- # <tt>(2*c)/(a+b)</tt>.
22
- #
23
- # * +a+: the frequency weighted sum of the ngrams in a
24
- # * +b+: the frequency weighted sum of the ngrams in b
25
- # * +c+: the frequency weighted sum of the ngrams in (a & b)
26
- #
27
- # Lazily calculated and memoized.
28
- def score
29
- @score ||= begin
30
- c = (self.a.ngrams & self.b.ngrams)
31
- if c.length > 0
32
- a = self.a.ngrams_sum
33
- b = self.b.ngrams_sum
34
- c = matcher.ngrams_sum(c)
35
- (2.0 * c) / (a + b)
36
- else
37
- 0
38
- end
39
- end
40
- end
41
-
42
18
  def to_s #:nodoc:
43
19
  "Candidate #{score}: #{a.inspect}..#{b.inspect}"
44
20
  end
@@ -0,0 +1,91 @@
1
+ module Simhilarity
2
+ module Candidates
3
+ # default minimum number # of ngram overlaps with :ngrams
4
+ DEFAULT_NGRAM_OVERLAPS = 3
5
+
6
+ # default maximum hamming distance with :simhash
7
+ DEFAULT_SIMHASH_MAX_HAMMING = 7
8
+
9
+ # Find candidates from +needles+ & +haystack+. The method used
10
+ # depends on the value of +candidates+
11
+ def candidates_for(needles)
12
+ # generate candidates
13
+ candidates_method = candidates_method(needles)
14
+ candidates = self.send(candidates_method, needles)
15
+
16
+ # if these are the same, no self-dups
17
+ if needles == haystack
18
+ candidates = candidates.reject { |n, h| n == h }
19
+ end
20
+
21
+ # map and return
22
+ candidates.map { |n, h| Candidate.new(n, h) }
23
+ end
24
+
25
+ # Select the method for finding candidates based on +candidates+.
26
+ def candidates_method(needles)
27
+ # pick the method
28
+ method = self.candidates
29
+ method ||= (needles.length * haystack.length < 200000) ? :all : :simhash
30
+ case method
31
+ when /^ngrams=(\d+)$/
32
+ method = :ngrams
33
+ self.ngram_overlaps = $1.to_i
34
+ when /^simhash=(\d+)$/
35
+ method = :simhash
36
+ self.simhash_max_hamming = $1.to_i
37
+ end
38
+
39
+ method = "candidates_#{method}".to_sym
40
+ if !respond_to?(method, true)
41
+ raise "unsupported candidates #{candidates.inspect}"
42
+ end
43
+
44
+ vputs "Using #{method} with needles=#{needles.length} haystack=#{haystack.length}..."
45
+ method
46
+ end
47
+
48
+ # Return ALL candidates. This only works for small datasets.
49
+ def candidates_all(needles)
50
+ needles.product(haystack)
51
+ end
52
+
53
+ # Return candidates that overlap with three or more matching
54
+ # ngrams. Only works for small datasets.
55
+ def candidates_ngrams(needles)
56
+ ngram_overlaps = self.ngram_overlaps || DEFAULT_NGRAM_OVERLAPS
57
+
58
+ candidates = []
59
+ veach(" ngrams #{ngram_overlaps}", needles) do |n|
60
+ ngrams_set = Set.new(n.ngrams)
61
+ haystack.each do |h|
62
+ count = 0
63
+ h.ngrams.each do |ngram|
64
+ if ngrams_set.include?(ngram)
65
+ if (count += 1) == ngram_overlaps
66
+ candidates << [n, h]
67
+ break
68
+ end
69
+ end
70
+ end
71
+ end
72
+ end
73
+ candidates
74
+ end
75
+
76
+ # Find candidates that are close based on hamming distance between
77
+ # the simhashes.
78
+ def candidates_simhash(needles)
79
+ max_hamming = self.simhash_max_hamming || DEFAULT_SIMHASH_MAX_HAMMING
80
+
81
+ # search for candidates with low hamming distance
82
+ candidates = []
83
+ veach(" hamming #{max_hamming}", needles) do |n|
84
+ bk_tree.query(n, max_hamming).each do |h, distance|
85
+ candidates << [n, h]
86
+ end
87
+ end
88
+ candidates
89
+ end
90
+ end
91
+ end
@@ -1,70 +1,95 @@
1
+ require "bk"
2
+ require "set"
1
3
  require "progressbar"
2
4
 
3
5
  module Simhilarity
4
- # Abstract superclass for matching. Mainly a container for options, corpus, etc.
5
6
  class Matcher
6
- # Options used to create this Matcher.
7
- attr_accessor :options
7
+ include Simhilarity::Candidates
8
+ include Simhilarity::Score
8
9
 
9
- # Proc for turning needle/haystack elements into strings. You can
10
- # leave this nil if the elements are already strings. See
11
- # Matcher#reader for the default implementation.
10
+ # If true, show progress bars and timing
11
+ attr_accessor :verbose
12
+
13
+ # Proc for turning opaque items into strings.
12
14
  attr_accessor :reader
13
15
 
14
- # Proc for normalizing input strings. See Matcher#normalize
15
- # for the default implementation.
16
+ # Proc for normalizing strings.
16
17
  attr_accessor :normalizer
17
18
 
18
- # Proc for generating ngrams from a normalized string. See
19
- # Matcher#ngrams for the default implementation.
19
+ # Proc for generating ngrams.
20
20
  attr_accessor :ngrammer
21
21
 
22
- # Ngram frequency weights from the corpus, or 1 if the ngram isn't
23
- # in the corpus.
24
- attr_accessor :freq
22
+ # Proc for scoring ngrams.
23
+ attr_accessor :scorer
24
+
25
+ # Specifies which method to use for finding candidates. See the
26
+ # README for more details.
27
+ attr_accessor :candidates
25
28
 
26
- # Create a new Matcher matcher. Options include:
27
- #
28
- # * +reader+: Proc for turning opaque items into strings.
29
- # * +normalizer+: Proc for normalizing strings.
30
- # * +ngrammer+: Proc for generating ngrams.
31
- # * +verbose+: If true, show progress bars and timing.
32
- def initialize(options = {})
33
- @options = options
29
+ # Minimum number of ngram overlaps, defaults to 3 (for candidates
30
+ # = :ngrams)
31
+ attr_accessor :ngram_overlaps
34
32
 
35
- # procs
36
- self.reader = options[:reader]
37
- self.normalizer = options[:normalizer]
38
- self.ngrammer = options[:ngrammer]
33
+ # Maximum simhash hamming distance, defaults to 7. (for candidates
34
+ # = :simhash)
35
+ attr_accessor :simhash_max_hamming
39
36
 
40
- reset_corpus
37
+ # Set the haystack.
38
+ def haystack=(haystack)
39
+ @haystack = import_list(haystack)
40
+
41
+ # this stuff is lazily calculated from the haystack, and needs
42
+ # to be reset whenever the haystack changes.
43
+ @bitsums = { }
44
+ @bk_tree = nil
45
+ @freq = nil
41
46
  end
42
47
 
43
- # Set the corpus. Calculates ngram frequencies (#freq) for future
44
- # scoring.
45
- def corpus=(corpus)
46
- @corpus = corpus
48
+ # The current haystack.
49
+ def haystack
50
+ @haystack
51
+ end
47
52
 
48
- reset_corpus
53
+ # Ngram frequency weights from the haystack, or 1 if the ngram
54
+ # isn't in the haystack. Lazily calculated.
55
+ def freq
56
+ @freq ||= begin
57
+ # calculate ngram counts for the haystack
58
+ counts = Hash.new(0)
59
+ veach("Haystack", @haystack) do |element|
60
+ element.ngrams.each do |ngram|
61
+ counts[ngram] += 1
62
+ end
63
+ end
49
64
 
50
- # calculate ngram counts for the corpus
51
- counts = Hash.new(0)
52
- veach("Corpus", import_list(corpus)) do |element|
53
- element.ngrams.each do |ngram|
54
- counts[ngram] += 1
65
+ # turn counts into inverse frequencies
66
+ map = Hash.new(1)
67
+ total = counts.values.inject(&:+).to_f
68
+ counts.each do |ngram, count|
69
+ map[ngram] = ((total / count) * 10).round
55
70
  end
71
+ map
56
72
  end
73
+ end
57
74
 
58
- # turn counts into inverse frequencies
59
- total = counts.values.inject(&:+).to_f
60
- counts.each do |ngram, count|
61
- @freq[ngram] = ((total / count) * 10).round
75
+ # Match each item in +needles+ to an item in #haystack. Returns an
76
+ # array of tuples, <tt>[needle, haystack, score]</tt>. Scores
77
+ # range from 0 to 1, with 1 being a perfect match and 0 being a
78
+ # terrible match.
79
+ def matches(needles)
80
+ if haystack.nil?
81
+ raise RuntimeError.new('can\'t match before setting a haystack')
62
82
  end
63
- end
64
83
 
65
- # The current corpus.
66
- def corpus
67
- @corpus
84
+ # create Elements
85
+ needles = import_list(needles)
86
+
87
+ # get candidate matches
88
+ candidates = candidates_for(needles)
89
+ vputs " got #{candidates.length} candidates."
90
+
91
+ # pick winners
92
+ winners(needles, candidates)
68
93
  end
69
94
 
70
95
  # Turn an opaque item from the user into a string.
@@ -108,7 +133,7 @@ module Simhilarity
108
133
 
109
134
  # Sum up the frequency weights of the +ngrams+.
110
135
  def ngrams_sum(ngrams)
111
- ngrams.map { |i| @freq[i] }.inject(&:+) || 0
136
+ ngrams.map { |i| freq[i] }.inject(&:+) || 0
112
137
  end
113
138
 
114
139
  # Calculate the frequency weighted
@@ -147,9 +172,16 @@ module Simhilarity
147
172
  Element.new(self, opaque)
148
173
  end
149
174
 
150
- def reset_corpus
151
- @freq = Hash.new(1)
152
- @bitsums = { }
175
+ def bk_tree
176
+ @bk_tree ||= begin
177
+ # calculate this first so we get a nice progress bar
178
+ veach(" simhash", haystack) { |i| i.simhash }
179
+
180
+ # build the bk tree
181
+ tree = BK::Tree.new(lambda { |a, b| Bits.hamming32(a.simhash, b.simhash) })
182
+ veach(" bktree", haystack) { |i| tree.add(i) }
183
+ tree
184
+ end
153
185
  end
154
186
 
155
187
  # calculate the simhash bitsums for this +ngram+, as part of
@@ -171,14 +203,14 @@ module Simhilarity
171
203
  end
172
204
  end
173
205
 
174
- # Puts if options[:verbose]
206
+ # Puts if +verbose+ is true
175
207
  def vputs(s)
176
- $stderr.puts s if options[:verbose]
208
+ $stderr.puts s if verbose
177
209
  end
178
210
 
179
- # Like each, but with a progress bar if options[:verbose]
211
+ # Like each, but with a progress bar if +verbose+ is true
180
212
  def veach(title, array, &block)
181
- if !options[:verbose]
213
+ if !verbose
182
214
  array.each do |i|
183
215
  yield(i)
184
216
  end
@@ -0,0 +1,56 @@
1
+ module Simhilarity
2
+ module Score
3
+ # walk candidates by score, pick winners
4
+ def winners(needles, candidates)
5
+ # calculate this first so we get a nice progress bar
6
+ veach("Scoring", candidates) do |i|
7
+ i.score = score(i)
8
+ end
9
+
10
+ # sort by score
11
+ candidates = candidates.sort_by { |i| -i.score }
12
+
13
+ # walk them, eliminate dups
14
+ seen = Set.new
15
+ winners = candidates.map do |i|
16
+ next if seen.include?(i.a)
17
+ seen << i.a
18
+ i
19
+ end.compact
20
+
21
+ # build map from needle => candidate...
22
+ needle_to_winner = { }
23
+ winners.each { |i| needle_to_winner[i.a] = i }
24
+
25
+ # so we can return in the original order
26
+ needles.map do |i|
27
+ if candidate = needle_to_winner[i]
28
+ [ i.opaque, candidate.b.opaque, candidate.score ]
29
+ else
30
+ [ i.opaque, nil, nil ]
31
+ end
32
+ end
33
+ end
34
+
35
+ # Score a +Candidate+. The default implementation is the {dice
36
+ # coefficient}[http://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient],
37
+ # <tt>(2*c)/(a+b)</tt>.
38
+ #
39
+ # * +a+: the frequency weighted sum of the ngrams in a
40
+ # * +b+: the frequency weighted sum of the ngrams in b
41
+ # * +c+: the frequency weighted sum of the ngrams in (a & b)
42
+ def score(candidate)
43
+ if scorer
44
+ return scorer.call(candidate)
45
+ end
46
+
47
+ c = (candidate.a.ngrams & candidate.b.ngrams)
48
+ return 0 if c.length == 0
49
+
50
+ a = candidate.a.ngrams_sum
51
+ b = candidate.b.ngrams_sum
52
+ c = ngrams_sum(c)
53
+ (2.0 * c) / (a + b)
54
+ end
55
+ end
56
+ end
@@ -1,4 +1,4 @@
1
1
  module Simhilarity
2
2
  # Gem version
3
- VERSION = "1.0.2"
3
+ VERSION = "1.0.3"
4
4
  end
@@ -29,15 +29,17 @@ class Tests < Test::Unit::TestCase
29
29
  sample
30
30
  end
31
31
 
32
- def assert_bulk_candidates(candidates, percent)
32
+ def assert_candidates(candidates, percent)
33
33
  sample = self.sample
34
34
 
35
35
  # match, with benchmark
36
36
  output = nil
37
37
  Benchmark.bm(10) do |bm|
38
38
  bm.report(candidates.to_s) do
39
- matcher = Simhilarity::Bulk.new(candidates: candidates)
40
- output = matcher.matches(sample.needle, sample.haystack)
39
+ matcher = Simhilarity::Matcher.new
40
+ matcher.candidates = candidates
41
+ matcher.haystack = sample.haystack
42
+ output = matcher.matches(sample.needle)
41
43
  end
42
44
  end
43
45
 
@@ -55,9 +57,11 @@ class Tests < Test::Unit::TestCase
55
57
  assert((correct - percent).abs < 0.001, "percent #{correct} != #{percent}")
56
58
  end
57
59
 
60
+ TMP = "/tmp/_simhilarity_tests.txt"
61
+
58
62
  def assert_system(cmd)
59
- system("#{cmd} > /dev/null 2>&1")
60
- assert($? == 0, "#{cmd} failed")
63
+ system("#{cmd} > #{TMP} 2>&1")
64
+ assert($? == 0, File.read(TMP))
61
65
  end
62
66
 
63
67
  #
@@ -70,46 +74,62 @@ class Tests < Test::Unit::TestCase
70
74
 
71
75
  # not a string
72
76
  assert_raise(RuntimeError) { @matcher.read(123) }
73
-
74
- # custom
75
- @matcher.reader = lambda(&:key)
76
- assert_equal @matcher.read(OpenStruct.new(key: "gub")), "gub"
77
77
  end
78
78
 
79
79
  def test_normalizer
80
80
  # default
81
81
  assert_equal @matcher.normalize(" HELLO,\tWORLD! "), "hello world"
82
-
83
- # custom
84
- @matcher.normalizer = lambda(&:upcase)
85
- assert_equal @matcher.normalize("gub"), "GUB"
86
82
  end
87
83
 
88
84
  def test_ngrams
89
85
  # default
90
86
  assert_equal @matcher.ngrams("hi 42"), ["hi", "i ", " 4", "42"]
91
-
92
- # custom
93
- @matcher.ngrammer = lambda(&:split)
94
- assert_equal @matcher.ngrams("hi 42"), ["hi", "42"]
95
87
  end
96
88
 
97
89
  def test_proc_options
98
- matcher = Simhilarity::Matcher.new(reader: lambda(&:key), normalizer: lambda(&:upcase), ngrammer: lambda(&:split))
90
+ matcher = Simhilarity::Matcher.new
91
+ matcher.reader = lambda(&:key)
92
+ matcher.normalizer = lambda(&:upcase)
93
+ matcher.ngrammer = lambda(&:split)
99
94
  assert_equal matcher.read(OpenStruct.new(key: "gub")), "gub"
100
95
  assert_equal matcher.normalize("gub"), "GUB"
101
96
  assert_equal matcher.ngrams("hi 42"), ["hi", "42"]
102
97
  end
103
98
 
104
- def test_single
105
- score = Simhilarity::Single.new.score("hello world", "hi worlds")
106
- assert (score - 0.556).abs < 0.001, "test_single percent was wrong!"
99
+ def test_no_selfdups
100
+ # if you pass in the same list twice, it should ignore self-dups
101
+ list = ["hello, world", "hello there"]
102
+ @matcher.haystack = list
103
+ matches = @matcher.matches(@matcher.haystack)
104
+ assert_not_equal matches[0][1], "hello, world"
105
+ end
106
+
107
+ def test_haystack_required
108
+ # if you do not set a haystack, the matcher should yell
109
+ matcher = Simhilarity::Matcher.new
110
+ assert_raise RuntimeError do
111
+ matches = matcher.matches(['FOOM'])
112
+ end
107
113
  end
108
114
 
109
- def test_bulk
110
- assert_bulk_candidates(:all, 0.974)
111
- assert_bulk_candidates(:ngrams, 0.974)
112
- assert_bulk_candidates(:simhash, 0.949)
115
+ def test_one_result_can_win_multiple_times
116
+ # We should be able to find the same piece of hay multiple times for
117
+ # different needles.
118
+ haystack = ['Black Sabbath', 'Led Zeppelin', 'The Doors',
119
+ 'The Beatles', 'Neil Young']
120
+ needles = ['blak sabbath', 'black sabath', 'block soborch']
121
+ @matcher.haystack = haystack
122
+
123
+ # Whether matched individually or as a group, all of these needles
124
+ # should produce the same result.
125
+ matches = @matcher.matches(needles)
126
+ needles.each do |n|
127
+ matches.concat @matcher.matches([n])
128
+ end
129
+
130
+ matches.each do |n, h, s|
131
+ assert_equal 'Black Sabbath', h
132
+ end
113
133
  end
114
134
 
115
135
  def test_bin
@@ -122,4 +142,10 @@ class Tests < Test::Unit::TestCase
122
142
  assert_system("#{bin} --candidates ngrams=3 identity.txt identity.txt")
123
143
  assert_system("#{bin} --candidates all identity.txt identity.txt")
124
144
  end
145
+
146
+ def test_candidates
147
+ assert_candidates(:all, 0.949)
148
+ assert_candidates(:ngrams, 0.949)
149
+ assert_candidates(:simhash, 0.949)
150
+ end
125
151
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: simhilarity
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.2
4
+ version: 1.0.3
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2013-04-18 00:00:00.000000000 Z
12
+ date: 2013-04-26 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: bk
@@ -100,6 +100,7 @@ extensions: []
100
100
  extra_rdoc_files: []
101
101
  files:
102
102
  - .gitignore
103
+ - .travis.yml
103
104
  - Gemfile
104
105
  - LICENSE
105
106
  - README.md
@@ -107,11 +108,11 @@ files:
107
108
  - bin/simhilarity
108
109
  - lib/simhilarity.rb
109
110
  - lib/simhilarity/bits.rb
110
- - lib/simhilarity/bulk.rb
111
111
  - lib/simhilarity/candidate.rb
112
+ - lib/simhilarity/candidates.rb
112
113
  - lib/simhilarity/element.rb
113
114
  - lib/simhilarity/matcher.rb
114
- - lib/simhilarity/single.rb
115
+ - lib/simhilarity/score.rb
115
116
  - lib/simhilarity/version.rb
116
117
  - simhilarity.gemspec
117
118
  - test/harness
@@ -140,7 +141,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
140
141
  version: '0'
141
142
  segments:
142
143
  - 0
143
- hash: -1024001634221116929
144
+ hash: 3122268769366489382
144
145
  requirements: []
145
146
  rubyforge_project: simhilarity
146
147
  rubygems_version: 1.8.24
@@ -1,163 +0,0 @@
1
- require "bk"
2
- require "set"
3
-
4
- module Simhilarity
5
- # Match a set of needles against a haystack, in bulk. For example,
6
- # this is used if you want to match 50 new addresses against your
7
- # database of 1,000 known addresses.
8
- class Bulk < Matcher
9
- # default minimum number # of ngram overlaps with :ngrams
10
- DEFAULT_NGRAM_OVERLAPS = 3
11
- # default maximum hamming distance with :simhash
12
- DEFAULT_SIMHASH_MAX_HAMMING = 7
13
-
14
- # Initialize a new Bulk matcher. See Matcher#initialize. Bulk adds
15
- # these options:
16
- #
17
- # * +candidates+: specifies which method to use for finding
18
- # candidates. See the README for more details.
19
- # * +ngrams_overlaps+: Minimum number of ngram overlaps, defaults
20
- # to 3.
21
- # * +simhash_max_hamming+: Maximum simhash hamming distance,
22
- # defaults to 7.
23
- def initialize(options = {})
24
- super(options)
25
- end
26
-
27
- # Match each item in +needles+ to an item in +haystack+. Returns
28
- # an array of tuples, <tt>[needle, haystack, score]</tt>. Scores
29
- # range from 0 to 1, with 1 being a perfect match and 0 being a
30
- # terrible match.
31
- def matches(needles, haystack)
32
- # create Elements
33
- if needles == haystack
34
- needles = haystack = import_list(needles)
35
-
36
- # set the corpus, to generate frequency weights
37
- self.corpus = needles
38
- else
39
- needles = import_list(needles)
40
- haystack = import_list(haystack)
41
-
42
- # set the corpus, to generate frequency weights
43
- self.corpus = (needles + haystack)
44
- end
45
-
46
- # get candidate matches
47
- candidates = candidates(needles, haystack)
48
- vputs " got #{candidates.length} candidates."
49
-
50
- # pick winners
51
- winners(needles, candidates)
52
- end
53
-
54
- protected
55
-
56
- # Find candidates from +needles+ & +haystack+. The method used
57
- # depends on the value of options[:candidates]
58
- def candidates(needles, haystack)
59
- method = options[:candidates]
60
- method ||= (needles.length * haystack.length < 200000) ? :all : :simhash
61
-
62
- case method
63
- when /^ngrams=(\d+)$/
64
- method = :ngrams
65
- options[:ngram_overlaps] = $1.to_i
66
- when /^simhash=(\d+)$/
67
- method = :simhash
68
- options[:simhash_max_hamming] = $1.to_i
69
- end
70
-
71
- method = "candidates_#{method}".to_sym
72
- if !respond_to?(method)
73
- raise "unsupported options[:candidates] #{options[:candidates].inspect}"
74
- end
75
-
76
- vputs "Using #{method} with needles=#{needles.length} haystack=#{haystack.length}..."
77
- self.send(method, needles, haystack).map do |n, h|
78
- Candidate.new(self, n, h)
79
- end
80
- end
81
-
82
- # Return ALL candidates. This only works for small datasets.
83
- def candidates_all(needles, haystack)
84
- needles.product(haystack)
85
- end
86
-
87
- # Return candidates that overlap with three or more matching
88
- # ngrams. Only works for small datasets.
89
- def candidates_ngrams(needles, haystack)
90
- ngram_overlaps = options[:ngram_overlaps] || DEFAULT_NGRAM_OVERLAPS
91
-
92
- candidates = []
93
- veach(" ngrams #{ngram_overlaps}", needles) do |n|
94
- ngrams_set = Set.new(n.ngrams)
95
- haystack.each do |h|
96
- count = 0
97
- h.ngrams.each do |ngram|
98
- if ngrams_set.include?(ngram)
99
- if (count += 1) == ngram_overlaps
100
- candidates << [n, h]
101
- break
102
- end
103
- end
104
- end
105
- end
106
- end
107
- candidates
108
- end
109
-
110
- # Find candidates that are close based on hamming distance between
111
- # the simhashes.
112
- def candidates_simhash(needles, haystack)
113
- max_hamming = options[:simhash_max_hamming] || DEFAULT_SIMHASH_MAX_HAMMING
114
-
115
- # calculate this first so we get a nice progress bar
116
- veach(" simhash", corpus) { |i| i.simhash }
117
-
118
- # build the bk tree
119
- bk = BK::Tree.new(lambda { |a, b| Bits.hamming32(a.simhash, b.simhash) })
120
- veach(" bktree", haystack) { |i| bk.add(i) }
121
-
122
- # search for candidates with low hamming distance
123
- candidates = []
124
- veach(" hamming #{max_hamming}", needles) do |n|
125
- bk.query(n, max_hamming).each do |h, distance|
126
- candidates << [n, h]
127
- end
128
- end
129
- candidates
130
- end
131
-
132
- # walk candidates by score, pick winners
133
- def winners(needles, candidates)
134
- # calculate this first so we get a nice progress bar
135
- veach("Scoring", candidates) { |i| i.score }
136
-
137
- # score the candidates
138
- candidates = candidates.sort_by { |i| -i.score }
139
-
140
- # walk them, eliminate dups
141
- seen = Set.new
142
- winners = candidates.map do |i|
143
- next if seen.include?(i.a) || seen.include?(i.b)
144
- seen << i.a
145
- seen << i.b
146
- i
147
- end.compact
148
-
149
- # build map from needle => candidate...
150
- needle_to_winner = { }
151
- winners.each { |i| needle_to_winner[i.a] = i }
152
-
153
- # so we can return in the original order
154
- needles.map do |i|
155
- if candidate = needle_to_winner[i]
156
- [ i.opaque, candidate.b.opaque, candidate.score ]
157
- else
158
- [ i.opaque, nil, nil ]
159
- end
160
- end
161
- end
162
- end
163
- end
@@ -1,18 +0,0 @@
1
- require "set"
2
-
3
- module Simhilarity
4
- # Calculate the similarity score for pairs of items, one at a time.
5
- class Single < Matcher
6
- # See Matcher#initialize.
7
- def initialize(options = {})
8
- super(options)
9
- end
10
-
11
- # Calculate the similarity score for these two items. Scores range
12
- # from 0 to 1, with 1 being a perfect match and 0 being a terrible
13
- # match. For best results, call #corpus= first.
14
- def score(a, b)
15
- Candidate.new(self, element_for(a), element_for(b)).score
16
- end
17
- end
18
- end