simhilarity 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore ADDED
@@ -0,0 +1,5 @@
1
+ *.gem
2
+ .bundle
3
+ Gemfile.lock
4
+ pkg/*
5
+ rdoc
data/Gemfile ADDED
@@ -0,0 +1,2 @@
1
+ source "http://rubygems.org"
2
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2013 Adam Doppelt
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,165 @@
1
+ # Welcome to simhilarity
2
+
3
+ Simhilarity is a gem for quickly matching up text strings that are similar but not identical. Here is how it works:
4
+
5
+ 1. Normalize strings. Downcase, remove non-alpha, etc:
6
+
7
+ ```ruby
8
+ normalize("Hello, WORLD!") => "hello world"
9
+ ```
10
+
11
+ 1. Calculate [ngrams](http://en.wikipedia.org/wiki/N-gram) from strings. Specifically, it creates bigrams (2 character ngrams) and also creates an ngram for each sequence of digits in the string:
12
+
13
+ ```ruby
14
+ # bigrams # digits
15
+ ngrams("hi 123") => ["hi", "i ", " 1", "12", "23"] + ["123"]
16
+ ```
17
+
18
+ 1. Calculate frequency of ngrams in the corpus.
19
+
20
+ 1. Select pairs of strings that might be matches. These are called **candidates**, and there are a few different ways they are chosen - see [options](#options). Simhilarity will try to pick the best method based on the size of your data set.
21
+
22
+ 1. Score candidates by measuring ngram overlap (with frequency weighting), using the [dice coefficient](http://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient).
23
+
24
+ 1. For each input string, return the match with the highest score.
25
+
26
+ Here is output from a sample run:
27
+
28
+ ```
29
+ score needle haystack
30
+ 1.000 Night Heron 19 Night Heron 19
31
+ 1.000 103 Oceanwood 103 Oceanwood
32
+ 0.987 Sea Crest 1504 1504 Sea Crest
33
+ 0.986 Twin Oaks 189 189 Twin Oaks
34
+ 0.981 Sea Crest 1205 1205 Sea Crest
35
+ 0.980 Sea Crest 2411 2411 Sea Crest
36
+ 0.972 Sea Crest 3405 3405 Sea Crest
37
+ 0.968 Barrington Arms 504 504 Barrington Arms
38
+ 0.964 Windsor Place 503 503 Windsor Place
39
+ 0.951 1802 Bluff Villas - Hilton Head Island 1802 Bluff Villas
40
+ 0.943 3221 Villamare - Hilton Head Island 3221 Villamare
41
+ 0.941 134 Shorewood - Hilton Head Island 134 Shorewood
42
+ 0.900 1 Quail Street 1 Quail
43
+ 0.894 2 Quail Street 2 Quail
44
+ 0.823 Windsor II 2315 2315 Windsor Place II
45
+ 0.736 Beachside Tennis 12 12 Beachside
46
+ 0.732 16 Piping Plover - Hilton Head Island 16 Piping Plover
47
+ 0.460 7 Quail 7 QUAIL/126 Dune Lane
48
+ 0.379 11 Battery 11 Gunnery
49
+ ```
50
+
51
+ Note that the final match has the lowest score, and is incorrect!
52
+
53
+ ## Usage
54
+
55
+ ### simhilarity executable
56
+
57
+ The gem includes an executable called `simhilarity`. For example:
58
+
59
+ ```sh
60
+ $ simhilarity needles.txt haystack.txt
61
+ score,needle,haystack
62
+ 0.900,1 Quail Street,1 Quail
63
+ 1.000,103 Oceanwood,103 Oceanwood
64
+ ...
65
+ ```
66
+
67
+ It will print out the best matches between needle and haystack in CSV format. Use `simhilarity --verbose` to look at pretty progress bars while it's running. Use --candidates to customize the candidates selection method, which will dramatically affect performance for large data sets.
68
+
69
+ ### Simhilarity::Bulk
70
+
71
+ To use simhilarity from code, create a `Bulk` and call `matches(needles, haystack)`. It'll return an array of tuples, `[needle, haystack, score]`. By default, simhilarity assumes that needles and haystack are arrays of strings. To use something else, set `reader` to a proc that converts your opaque objects into strings. See [options](#options).
72
+
73
+ ### Simhilarity::Single
74
+
75
+ Sometimes it's useful to just calculate the score between two strings. For example, if you just want a title similarity measurement as part of some larger analysis between two books. Create a `Single` and call `score(a, b)` to measure similarity between those two items. By default, simhilarity assumes that needle and haystack are strings. To use something else, set `reader` to a proc that converts your opaque objects into strings. See [options](#options).
76
+
77
+ Important note: For best results with `Single`, set the corpus so that simhilarity can calculate ngram frequencies. This can dramatically improve accuracy. `Bulk` will do this automatically because it has access to the corpus, but `Single` doesn't. Call `corpus=` manually when using `Single`.
78
+
79
+ <a name="benchmarks"/>
80
+
81
+ ## Benchmarks
82
+
83
+ When looking at simhilarity's speed, there are two important aspects to consider:
84
+
85
+ * **picking candidates** - how long does it take to pick decent candidates out of all the potential string pairs?
86
+ * **matching** - once candidates are identified, how long does it take to score them?
87
+
88
+ #### Picking Candidates
89
+
90
+ There are three different methods for picking candidates - see [options](#options) for a detailed explanation. Here are some numbers from my i5 3ghz, for a test dataset consisting of 500 needles and 10,000 haystacks.
91
+
92
+
93
+ ```
94
+ method time candidates returned
95
+ simhash 5 4s 3,500
96
+ simhash 6 7s 5,000
97
+ simhash 7 9s 10,000 (this is the default)
98
+ simhash 8 12s 25,000
99
+ simhash 9 13s 60,000
100
+
101
+ ngrams 5 46s 1,000,000
102
+ ngrams 4 44s 1,500,000
103
+ ngrams 3 40s 2,100,000
104
+
105
+ all 3.9s 5,000,000
106
+ ```
107
+
108
+ #### Matching
109
+
110
+ Once candidates are identified, the string pairs are scored and winners are picked out. Scoring is O(n). On my i5 3ghz:
111
+
112
+ ```
113
+ candidates time
114
+ 25,000 1s
115
+ 60,000 2s
116
+ 1,000,000 35s
117
+ 5,000,000 190s
118
+ ```
119
+
120
+
121
+
122
+ <a name="options"/>
123
+
124
+ ## Options
125
+
126
+ There are a few ways to configure simhilarity:
127
+
128
+ * **candidates** - controls how candidates are picked from the complete set of all string pairs. We want to avoid looking at all string pairs, because that's quite expensive for large datasets. On the other hand, if we examine too few we might miss some of the best matches. A conundrum. There are three different settings:
129
+
130
+ `:simhash` - generate a weighted [simhash](http://matpalm.com/resemblance/simhash/) for each string, then iterate the needles and look for "nearby" haystack simhashes using a [bktree](https://github.com/threedaymonk/bktree). Simhashes are compared using the [hamming distance](http://en.wikipedia.org/wiki/Hamming_distance). If the hamming distance between the simhashes <= `options[:simhash_max_hamming]`, the pair becomes a candidate. The default max hamming distance is 7 - see [benchmarks](#benchmarks) to get a sense for how different values perform.
131
+
132
+ `:ngrams` - for each pair of strings, count the number of ngrams they have in common. If the overlap is >= `options[:ngram_overlaps]`, the pair becomes a candidate. The default minimum number of overlaps is 3 - see [benchmarks](#benchmarks) to get a sense for how different values perform.
133
+
134
+ `:all` - all pairs are examined. This is completely braindead and very slow for large datasets.
135
+
136
+ Simhash works great, but there's no reason not to use `:ngrams` or even `:all` for small data sets. In fact, that's what simhilarity does by default - if you use a small dataset (needle * haystack < 200,000) it defaults to `:all`, otherwise it uses `:simhash`. Some examples:
137
+
138
+ ```ruby
139
+ Simhilarity::Bulk.new # defaults to :all or :simhash based on size<
140
+ Simhilarity::Bulk.new(candidates: :simhash)
141
+ Simhilarity::Bulk.new(candidates: :simhash, simhash_max_hamming: 8)
142
+ Simhilarity::Bulk.new(candidates: :ngrams, ngram_overlaps: 4)
143
+ ```
144
+
145
+ or:
146
+
147
+ ```
148
+ $ simhilarity --candidates simhash needles.txt haystack.txt
149
+ $ simhilarity --candidates simhash=8 needles.txt haystack.txt
150
+ $ simhilarity --candidates ngrams needles.txt haystack.txt
151
+ $ simhilarity --candidates ngrams=4 needles.txt haystack.txt
152
+ ```
153
+
154
+ * **reader** - proc for converting your opaque objects into strings. Set this to use something other than strings for source data. For example, if you want to match author names between ActiveRecord book objects:
155
+
156
+ ```ruby
157
+ matcher.reader = lambda { |i| i.author }
158
+ matcher.matches(needles, haystack)
159
+ ```
160
+
161
+ * **normalizer** - proc for normalizing incoming strings. The default normalizer downcases, removes non-alphas, and strips whitespace.
162
+
163
+ * **ngrammer** - proc for converting normalized strings into ngrams. The default ngrammer pulls out bigrams and runs of digits, which is perfect for matching names and addresses.
164
+
165
+ * **verbose** - if true, show progress while simhilarity is working. Great for the impatient. Use --verbose from the command line.
data/Rakefile ADDED
@@ -0,0 +1,18 @@
1
+ require "bundler/gem_tasks"
2
+ require "rake/testtask"
3
+ require "rdoc/task"
4
+
5
+ Bundler::GemHelper.install_tasks
6
+
7
+ # testing
8
+ Rake::TestTask.new(:test) do |test|
9
+ test.libs << "test"
10
+ end
11
+ task default: :test
12
+
13
+ # rdoc
14
+ RDoc::Task.new do |rdoc|
15
+ rdoc.rdoc_dir = "rdoc"
16
+ rdoc.title = "simhilarity #{Simhilarity::VERSION}"
17
+ rdoc.rdoc_files.include("lib/**/*.rb")
18
+ end
data/bin/simhilarity ADDED
@@ -0,0 +1,84 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ # in lieu of -w, since we're using env to startup
4
+ $VERBOSE = true
5
+
6
+ require "csv"
7
+ require "optparse"
8
+ require "simhilarity"
9
+
10
+ class Main
11
+ def initialize(options = {})
12
+ # load
13
+ needle = File.readlines(options[:needle]).map(&:chomp)
14
+ haystack = File.readlines(options[:haystack]).map(&:chomp)
15
+
16
+ # match
17
+ tm = Time.now
18
+ matcher = Simhilarity::Bulk.new(options)
19
+ matches = matcher.matches(needle, haystack)
20
+
21
+ if options[:verbose]
22
+ tm = Time.now - tm
23
+ $stderr.printf("Simhilarity finished in %.3fs.\n\n", tm)
24
+ end
25
+
26
+ # now report
27
+ csv = CSV.new($stdout)
28
+ csv << %w(score needle haystack)
29
+ matches.each do |n, h, score|
30
+ csv << [sprintf("%4.3f", score || 0), n, h]
31
+ end
32
+ end
33
+ end
34
+
35
+
36
+
37
+ #
38
+ # parse command line
39
+ #
40
+
41
+ options = { }
42
+
43
+ opt = OptionParser.new do |o|
44
+ o.banner = <<EOF
45
+ simhilarity matches lines of text between needle_file and
46
+ haystack_file, then prints a report. Potential matches are scored
47
+ using frequency weighted ngrams.
48
+
49
+ Usage: simhilarity [options] <needle_file> <haystack_file>
50
+ EOF
51
+ o.on("-v", "--verbose", "enable verbose/progress output") do |f|
52
+ options[:verbose] = true
53
+ end
54
+ o.on("-c", "--candidates [CANDIDATES]", "set candidates search method") do |f|
55
+ options[:candidates] = f
56
+ end
57
+ o.on_tail("-h", "--help", "print this help text") do
58
+ puts opt
59
+ exit 0
60
+ end
61
+ end
62
+ begin
63
+ opt.parse!
64
+ rescue OptionParser::InvalidOption, OptionParser::MissingArgument => e
65
+ puts e
66
+ puts opt
67
+ exit 1
68
+ end
69
+
70
+ # mandatory args
71
+ if ARGV.length != 2
72
+ puts opt
73
+ exit 1
74
+ end
75
+ options[:needle] = ARGV[0]
76
+ options[:haystack] = ARGV[1]
77
+ %w(needle haystack).map(&:to_sym).each do |i|
78
+ if !File.exists?(options[i])
79
+ puts "error: #{i.capitalize} file #{options[i].inspect} doesn't exist."
80
+ exit 1
81
+ end
82
+ end
83
+
84
+ Main.new(options)
@@ -0,0 +1,62 @@
1
+ require "digest"
2
+
3
+ module Simhilarity
4
+ module Bits
5
+ # Calculate the {hamming
6
+ # distance}[http://en.wikipedia.org/wiki/Hamming_distance] between
7
+ # two integers. Not particularly fast.
8
+ def self.hamming(a, b)
9
+ x, d = 0, a ^ b
10
+ while d > 0
11
+ x += 1
12
+ d &= d - 1
13
+ end
14
+ x
15
+ end
16
+
17
+ HAMMING8 = (0..0xff).map { |i| Bits.hamming(0, i) }
18
+ HAMMING16 = (0..0xffff).map { |i| HAMMING8[(i >> 8) & 0xff] + HAMMING8[(i >> 0) & 0xff] }
19
+
20
+ # Calculate the {hamming
21
+ # distance}[http://en.wikipedia.org/wiki/Hamming_distance] between
22
+ # two 32 bit integers using a lookup table. This is fast.
23
+ def self.hamming32(a, b)
24
+ x = a ^ b
25
+ a = (x >> 16) & 0xffff
26
+ b = (x >> 0) & 0xffff
27
+ HAMMING16[a] + HAMMING16[b]
28
+ end
29
+
30
+ # can't rely on ruby hash, because it's not consistent across
31
+ # sessions. Let's just use MD5.
32
+ def self.nhash(ngram)
33
+ @hashes ||= { }
34
+ @hashes[ngram] ||= Digest::MD5.hexdigest(ngram).to_i(16)
35
+ end
36
+
37
+ # Calculate the frequency weighted
38
+ # simhash[http://matpalm.com/resemblance/simhash/] of the
39
+ # +ngrams+.
40
+ def self.simhash32(freq, ngrams)
41
+ # array of bit sums
42
+ bits = Array.new(32, 0)
43
+
44
+ # walk bits of ngram's hash, increase/decrease bit sums
45
+ ngrams.each do |ngram|
46
+ f = freq[ngram]
47
+ hash = nhash(ngram)
48
+ (0...32).each do |i|
49
+ bits[i] += (((hash >> i) & 1) == 1) ? f : -f
50
+ end
51
+ end
52
+
53
+ # calculate simhash based on whether bit sums are negative or
54
+ # positive
55
+ simhash = 0
56
+ (0...32).each do |bit|
57
+ simhash |= (1 << bit) if bits[bit] > 0
58
+ end
59
+ simhash
60
+ end
61
+ end
62
+ end
@@ -0,0 +1,163 @@
1
+ require "bk"
2
+ require "set"
3
+
4
+ module Simhilarity
5
+ # Match a set of needles against a haystack, in bulk. For example,
6
+ # this is used if you want to match 50 new addresses against your
7
+ # database of 1,000 known addresses.
8
+ class Bulk < Matcher
9
+ # default minimum number # of ngram overlaps with :ngrams
10
+ DEFAULT_NGRAM_OVERLAPS = 3
11
+ # default maximum hamming distance with :simhash
12
+ DEFAULT_SIMHASH_MAX_HAMMING = 7
13
+
14
+ # Initialize a new Bulk matcher. See Matcher#initialize. Bulk adds
15
+ # these options:
16
+ #
17
+ # * +candidates+: specifies which method to use for finding
18
+ # candidates. See the README for more details.
19
+ # * +ngrams_overlaps+: Minimum number of ngram overlaps, defaults
20
+ # to 3.
21
+ # * +simhash_max_hamming+: Maximum simhash hamming distance,
22
+ # defaults to 7.
23
+ def initialize(options = {})
24
+ super(options)
25
+ end
26
+
27
+ # Match each item in +needles+ to an item in +haystack+. Returns
28
+ # an array of tuples, <tt>[needle, haystack, score]</tt>. Scores
29
+ # range from 0 to 1, with 1 being a perfect match and 0 being a
30
+ # terrible match.
31
+ def matches(needles, haystack)
32
+ # create Elements
33
+ if needles == haystack
34
+ needles = haystack = import_list(needles)
35
+
36
+ # set the corpus, to generate frequency weights
37
+ self.corpus = needles
38
+ else
39
+ needles = import_list(needles)
40
+ haystack = import_list(haystack)
41
+
42
+ # set the corpus, to generate frequency weights
43
+ self.corpus = (needles + haystack)
44
+ end
45
+
46
+ # get candidate matches
47
+ candidates = candidates(needles, haystack)
48
+ vputs " got #{candidates.length} candidates."
49
+
50
+ # pick winners
51
+ winners(needles, candidates)
52
+ end
53
+
54
+ protected
55
+
56
+ # Find candidates from +needles+ & +haystack+. The method used
57
+ # depends on the value of options[:candidates]
58
+ def candidates(needles, haystack)
59
+ method = options[:candidates]
60
+ method ||= (needles.length * haystack.length < 200000) ? :all : :simhash
61
+
62
+ case method
63
+ when /^ngrams=(\d+)$/
64
+ method = :ngrams
65
+ options[:ngram_overlaps] = $1.to_i
66
+ when /^simhash=(\d+)$/
67
+ method = :simhash
68
+ options[:simhash_max_hamming] = $1.to_i
69
+ end
70
+
71
+ method = "candidates_#{method}".to_sym
72
+ if !respond_to?(method)
73
+ raise "unsupported options[:candidates] #{options[:candidates].inspect}"
74
+ end
75
+
76
+ vputs "Using #{method} with needles=#{needles.length} haystack=#{haystack.length}..."
77
+ self.send(method, needles, haystack).map do |n, h|
78
+ Candidate.new(self, n, h)
79
+ end
80
+ end
81
+
82
+ # Return ALL candidates. This only works for small datasets.
83
+ def candidates_all(needles, haystack)
84
+ needles.product(haystack)
85
+ end
86
+
87
+ # Return candidates that overlap with three or more matching
88
+ # ngrams. Only works for small datasets.
89
+ def candidates_ngrams(needles, haystack)
90
+ ngram_overlaps = options[:ngram_overlaps] || DEFAULT_NGRAM_OVERLAPS
91
+
92
+ candidates = []
93
+ veach(" ngrams #{ngram_overlaps}", needles) do |n|
94
+ ngrams_set = Set.new(n.ngrams)
95
+ haystack.each do |h|
96
+ count = 0
97
+ h.ngrams.each do |ngram|
98
+ if ngrams_set.include?(ngram)
99
+ if (count += 1) == ngram_overlaps
100
+ candidates << [n, h]
101
+ break
102
+ end
103
+ end
104
+ end
105
+ end
106
+ end
107
+ candidates
108
+ end
109
+
110
+ # Find candidates that are close based on hamming distance between
111
+ # the simhashes.
112
+ def candidates_simhash(needles, haystack)
113
+ max_hamming = options[:simhash_max_hamming] || DEFAULT_SIMHASH_MAX_HAMMING
114
+
115
+ # calculate this first so we get a nice progress bar
116
+ veach(" simhash", corpus) { |i| i.simhash }
117
+
118
+ # build the bk tree
119
+ bk = BK::Tree.new(lambda { |a, b| Bits.hamming32(a.simhash, b.simhash) })
120
+ veach(" bktree", haystack) { |i| bk.add(i) }
121
+
122
+ # search for candidates with low hamming distance
123
+ candidates = []
124
+ veach(" hamming #{max_hamming}", needles) do |n|
125
+ bk.query(n, max_hamming).each do |h, distance|
126
+ candidates << [n, h]
127
+ end
128
+ end
129
+ candidates
130
+ end
131
+
132
+ # walk candidates by score, pick winners
133
+ def winners(needles, candidates)
134
+ # calculate this first so we get a nice progress bar
135
+ veach("Scoring", candidates) { |i| i.score }
136
+
137
+ # score the candidates
138
+ candidates = candidates.sort_by { |i| -i.score }
139
+
140
+ # walk them, eliminate dups
141
+ seen = Set.new
142
+ winners = candidates.map do |i|
143
+ next if seen.include?(i.a) || seen.include?(i.b)
144
+ seen << i.a
145
+ seen << i.b
146
+ i
147
+ end.compact
148
+
149
+ # build map from needle => candidate...
150
+ needle_to_winner = { }
151
+ winners.each { |i| needle_to_winner[i.a] = i }
152
+
153
+ # so we can return in the original order
154
+ needles.map do |i|
155
+ if candidate = needle_to_winner[i]
156
+ [ i.opaque, candidate.b.opaque, candidate.score ]
157
+ else
158
+ [ i.opaque, nil, nil ]
159
+ end
160
+ end
161
+ end
162
+ end
163
+ end
@@ -0,0 +1,46 @@
1
+ module Simhilarity
2
+ # A potential match between two +Elements+. It can calculate it's own score.
3
+ class Candidate
4
+ # matcher that owns this guy
5
+ attr_reader :matcher
6
+
7
+ # first half of the candidate pair - the needle.
8
+ attr_reader :a
9
+
10
+ # first half of the candidate pair - the haystack.
11
+ attr_reader :b
12
+
13
+ def initialize(matcher, a, b) #:nodoc:
14
+ @matcher = matcher
15
+ @a = a
16
+ @b = b
17
+ end
18
+
19
+ # Calculate the score for this +Candidate+. The score is the {dice
20
+ # coefficient}[http://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient],
21
+ # <tt>(2*c)/(a+b)</tt>.
22
+ #
23
+ # * +a+: the frequency weighted sum of the ngrams in a
24
+ # * +b+: the frequency weighted sum of the ngrams in b
25
+ # * +c+: the frequency weighted sum of the ngrams in (a & b)
26
+ #
27
+ # Lazily calculated and memoized.
28
+ def score
29
+ @score ||= begin
30
+ c = (self.a.ngrams & self.b.ngrams)
31
+ if c.length > 0
32
+ a = self.a.ngrams_sum
33
+ b = self.b.ngrams_sum
34
+ c = matcher.ngrams_sum(c)
35
+ (2.0 * c) / (a + b)
36
+ else
37
+ 0
38
+ end
39
+ end
40
+ end
41
+
42
+ def to_s #:nodoc:
43
+ "Candidate #{score}: #{a.inspect}..#{b.inspect}"
44
+ end
45
+ end
46
+ end
@@ -0,0 +1,50 @@
1
+ require "set"
2
+
3
+ module Simhilarity
4
+ # Internal wrapper around opaque items from user. This mostly exists
5
+ # to cache stuff that's expensive, like the ngrams.
6
+ class Element
7
+ # matcher that owns this guy
8
+ attr_reader :matcher
9
+
10
+ # opaque object from the user
11
+ attr_reader :opaque
12
+
13
+ def initialize(matcher, opaque) #:nodoc:
14
+ @matcher = matcher
15
+ @opaque = opaque
16
+ end
17
+
18
+ # Text string generated from +opaque+ via Matcher#read. Lazily
19
+ # calculated.
20
+ def str
21
+ @str ||= matcher.normalize(matcher.read(opaque))
22
+ end
23
+
24
+ # List of ngrams generated from +str+ via
25
+ # Matcher#ngrams. Lazily calculated.
26
+ def ngrams
27
+ @ngrams ||= matcher.ngrams(str)
28
+ end
29
+
30
+ # Weighted frequency sum of +ngrams+ via
31
+ # Matcher#ngrams_sum. Lazily calculated.
32
+ def ngrams_sum
33
+ @ngrams_sum ||= matcher.ngrams_sum(ngrams)
34
+ end
35
+
36
+ # Weighted simhash of +ngrams+ via Matcher#simhash. Lazily
37
+ # calculated.
38
+ def simhash
39
+ @simhash ||= matcher.simhash(ngrams)
40
+ end
41
+
42
+ def to_s #:nodoc:
43
+ str
44
+ end
45
+
46
+ def inspect #:nodoc:
47
+ str.inspect
48
+ end
49
+ end
50
+ end