fuzzy_tools 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --color
@@ -0,0 +1,12 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.8.7
4
+ - 1.9.2
5
+ - 1.9.3
6
+ - ruby-head
7
+ - jruby-18mode # JRuby in 1.8 mode
8
+ - jruby-19mode # JRuby in 1.9 mode
9
+ # - rbx-18mode
10
+ - rbx-19mode # currently in active development, may or may not work for your project
11
+ # uncomment this line if your project needs to run something other than `rake`:
12
+ # script: bundle exec rspec spec
data/Gemfile ADDED
@@ -0,0 +1,9 @@
1
+ source "http://rubygems.org"
2
+
3
+ gem 'simple_stats'
4
+ gem 'nokogiri', :platforms => [:mri_18, :mri_19, :jruby, :rbx]
5
+ gem 'perftools.rb', :platforms => [:mri_18, :mri_19], :require => false
6
+ gem 'rake'
7
+
8
+ # Specify your gem's dependencies in fuzzy_tools.gemspec
9
+ gemspec
@@ -0,0 +1,236 @@
1
+ # FuzzyTools [![Build Status](https://secure.travis-ci.org/brianhempel/fuzzy_tools.png)](http://travis-ci.org/brianhempel/fuzzy_tools)
2
+
3
+ FuzzyTools is a toolset for fuzzy searches in Ruby. The default algorithm has been tuned for accuracy (and reasonable speed) on 23 different [test files](https://github.com/brianhempel/fuzzy_tools/tree/master/accuracy/test_data/query_tests) gathered from [many sources](https://github.com/brianhempel/fuzzy_tools/blob/master/accuracy/test_data/sources/SOURCES.txt).
4
+
5
+ Because it's mostly Ruby, FuzzyTools is best for searching smaller datasets—say less than 50Kb in size. Data cleaning or auto-complete over known options are potential uses.
6
+
7
+ Tested on Ruby 1.8.7, 1.9.2, 1.9.3, 2.0.0dev, JRuby (1.8 and 1.9 mode), and Rubinius (1.9 mode only).
8
+
9
+ ## Usage
10
+
11
+ Install with [Bundler](http://gembundler.com/):
12
+
13
+ ``` ruby
14
+ gem "fuzzy_tools"
15
+ ```
16
+
17
+ Install without Bundler:
18
+
19
+ gem install fuzzy_tools --no-ri --no-rdoc
20
+
21
+ Then, put it to work!
22
+
23
+ ``` ruby
24
+ require 'fuzzy_tools'
25
+
26
+ books = [
27
+ "Till We Have Faces",
28
+ "Ecclesiastes",
29
+ "The Prodigal God"
30
+ ]
31
+
32
+ # Search for a single object
33
+
34
+ books.fuzzy_find("facade") # => "Till We Have Faces"
35
+ books.fuzzy_index.find("facade") # => "Till We Have Faces"
36
+ FuzzyTools::TfIdfIndex.new(:source => books).find("facade") # => "Till We Have Faces"
37
+
38
+ # Search for all matches, from best to worst
39
+
40
+ books.fuzzy_find_all("the") # => ["The Prodigal God", "Till We Have Faces"]
41
+ books.fuzzy_index.all("the") # => ["The Prodigal God", "Till We Have Faces"]
42
+ FuzzyTools::TfIdfIndex.new(:source => books).all("the") # => ["The Prodigal God", "Till We Have Faces"]
43
+
44
+ # You can also get scored results, if you need
45
+
46
+ books.fuzzy_find_all_with_scores("the") # =>
47
+ # [
48
+ # ["The Prodigal God", 0.443175985397319 ],
49
+ # ["Till We Have Faces", 0.0102817553829306]
50
+ # ]
51
+ books.fuzzy_index.all_with_scores("the") # =>
52
+ # [
53
+ # ["The Prodigal God", 0.443175985397319 ],
54
+ # ["Till We Have Faces", 0.0102817553829306]
55
+ # ]
56
+ FuzzyTools::TfIdfIndex.new(:source => books).all_with_scores("the") # =>
57
+ # [
58
+ # ["The Prodigal God", 0.443175985397319 ],
59
+ # ["Till We Have Faces", 0.0102817553829306]
60
+ # ]
61
+ ```
62
+
63
+ FuzzyTools is not limited to searching strings. In fact, strings work simply because FuzzyTools indexes on `to_s` by default. You can index on any method you like.
64
+
65
+ ``` ruby
66
+ require 'fuzzy_tools'
67
+
68
+ Book = Struct.new(:title, :author)
69
+
70
+ books = [
71
+ Book.new("Till We Have Faces", "C.S. Lewis" ),
72
+ Book.new("Ecclesiastes", "The Teacher"),
73
+ Book.new("The Prodigal God", "Tim Keller" )
74
+ ]
75
+
76
+ books.fuzzy_find(:author => "timmy")
77
+ books.fuzzy_index(:attribute => :author).find("timmy")
78
+ FuzzyTools::TfIdfIndex.new(:source => books, :attribute => :author).find("timmy")
79
+ # => #<struct Book title="The Prodigal God", author="Tim Keller">
80
+
81
+ books.fuzzy_find_all(:author => "timmy")
82
+ books.fuzzy_index(:attribute => :author).all("timmy")
83
+ FuzzyTools::TfIdfIndex.new(:source => books, :attribute => :author).all("timmy")
84
+ # =>
85
+ # [
86
+ # #<struct Book title="The Prodigal God", author="Tim Keller" >,
87
+ # #<struct Book title="Ecclesiastes", author="The Teacher">
88
+ # ]
89
+
90
+ books.fuzzy_find_all_with_scores(:author => "timmy")
91
+ books.fuzzy_index(:attribute => :author).all_with_scores("timmy")
92
+ FuzzyTools::TfIdfIndex.new(:source => books, :attribute => :author).all_with_scores("timmy")
93
+ # =>
94
+ # [
95
+ # [#<struct Book title="The Prodigal God", author="Tim Keller" >, 0.29874954780727 ],
96
+ # [#<struct Book title="Ecclesiastes", author="The Teacher">, 0.0117801403002398]
97
+ # ]
98
+ ```
99
+
100
+ If the objects to be searched are hashes, FuzzyTools indexes the specified hash value.
101
+
102
+ ```ruby
103
+ books = [
104
+ { :title => "Till We Have Faces", :author => "C.S. Lewis" },
105
+ { :title => "Ecclesiastes", :author => "The Teacher" },
106
+ { :title => "The Prodigal God", :author => "Tim Keller" }
107
+ ]
108
+
109
+ books.fuzzy_find(:author => "timmy")
110
+ # => { :title => "The Prodigal God", :author => "Tim Keller" }
111
+ ```
112
+
113
+ If you want to index on some calculated data such as more than one field at a time, you can provide a proc.
114
+
115
+ ``` ruby
116
+ books.fuzzy_find("timmy", :attribute => lambda { |book| book.title + " " + book.author })
117
+ books.fuzzy_index(:attribute => lambda { |book| book.title + " " + book.author }).find("timmy")
118
+ FuzzyTools::TfIdfIndex.new(:source => books, :attribute => lambda { |book| book.title + " " + book.author }).find("timmy")
119
+ ```
120
+
121
+ ## Can it go faster?
122
+
123
+ If you need to do multiple searches on the same collection, grab a fuzzy index with `my_collection.fuzzy_index` and do finds on that. The `fuzzy_find` and `fuzzy_find_all` methods on Enumerable reindex every time they are called.
124
+
125
+ Here's a performance comparison:
126
+
127
+ ``` ruby
128
+ array_methods = Array.new.methods
129
+
130
+ Benchmark.bm(20) do |b|
131
+ b.report("fuzzy_find") do
132
+ 1000.times { array_methods.fuzzy_find("juice") }
133
+ end
134
+
135
+ b.report("fuzzy_index.find") do
136
+ index = array_methods.fuzzy_index
137
+ 1000.times { index.find("juice") }
138
+ end
139
+ end
140
+ ```
141
+
142
+ ```
143
+ user system total real
144
+ fuzzy_find 29.250000 0.040000 29.290000 ( 29.287992)
145
+ fuzzy_index.find 0.360000 0.000000 0.360000 ( 0.360066)
146
+ ```
147
+
148
+ If you need even more speed, you can [try a different tokenizer](#specifying-your-own-tokenizer). Fewer tokens per document shortens the comparison time between documents, lessens the garbage collector load, and reduces the number of candidate documents for a given query.
149
+
150
+ If it's still too slow, [open an issue](https://github.com/brianhempel/fuzzy_tools/issues) and perhaps we can figure out what can be done.
151
+
152
+ ## How does it work?
153
+
154
+ FuzzyTools downcases and then tokenizes each value using a [hybrid combination](https://github.com/brianhempel/fuzzy_tools/blob/master/lib/fuzzy/tokenizers.rb#L20-27) of words, [character bigrams](http://en.wikipedia.org/wiki/N-gram), [Soundex](http://en.wikipedia.org/wiki/Soundex), and words without vowels.
155
+
156
+ ``` ruby
157
+ FuzzyTools::Tokenizers::HYBRID.call("Till We Have Faces")
158
+ # => ["T400", "W000", "H100", "F220", "_t", "ti", "il", "ll", "l ", " w",
159
+ # "we", "e ", " h", "ha", "av", "ve", "e ", " f", "fa", "ac", "ce",
160
+ # "es", "s_", "tll", "w", "hv", "fcs", "till", "we", "have", "faces"]
161
+ ```
162
+
163
+ Gross, eh? But that's what worked best on the [test data sets](https://github.com/brianhempel/fuzzy_tools/tree/master/accuracy/test_data/query_tests).
164
+
165
+ The tokens are weighted using [Term Frequency * Inverse Document Frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf*idf) which basically assigns higher weights to the tokens that occur in fewer documents.
166
+
167
+ ```ruby
168
+ # hacky introspection here--don't do this!
169
+ index = books.fuzzy_index(:attribute => :author)
170
+ index.instance_variable_get(:@document_tokens)["The Teacher"].weights.sort_by { |k,v| [-v,k] }
171
+ # =>
172
+ # [
173
+ # ["he", 0.3910],
174
+ # ["th", 0.3910],
175
+ # [" t", 0.2467],
176
+ # ["T000", 0.2467],
177
+ # ["T260", 0.2467],
178
+ # ["ac", 0.2467],
179
+ # ["ch", 0.2467],
180
+ # ["e ", 0.2467],
181
+ # ["ea", 0.2467],
182
+ # ["tchr", 0.2467],
183
+ # ["te", 0.2467],
184
+ # ["teacher", 0.2467],
185
+ # ["the", 0.2467],
186
+ # ["_t", 0.0910],
187
+ # ["er", 0.0910],
188
+ # ["r_", 0.0910]
189
+ # ]
190
+ ```
191
+
192
+ When you do a query, that query string is tokenized and weighted, then compared against some of the documents using [Cosine Similarity](http://www.gettingcirrius.com/2010/12/calculating-similarity-part-1-cosine.html). Cosine similarity is not that terrible of a concept, assuming you like terms like "N-dimensional space". Basically, each unique token becomes an axis in N-dimensional space. If we had 4 different tokens in all, we'd use 4-D space. A document's token weights define a vector in this space. The _cosine_ of the _angle_ between documents' vectors becomes the similarity between the documents.
193
+
194
+ Trust me, it works.
195
+
196
+ ## Specifying your own tokenizer
197
+
198
+ If the default tokenizer isn't working for your data or you need more speed, you can try swapping out the tokenizers. You can use one of the various tokenizers are defined in [`FuzzyTools::Tokenizers`](https://github.com/brianhempel/fuzzy_tools/blob/master/lib/fuzzy/tokenizers.rb), or you can write your own.
199
+
200
+ ``` ruby
201
+ # a predefined tokenizer
202
+ books.fuzzy_find("facade", :tokenizer => FuzzyTools::Tokenizers::CHARACTERS)
203
+ books.fuzzy_index(:tokenizer => FuzzyTools::Tokenizers::CHARACTERS).find("facade")
204
+ FuzzyTools::TfIdfIndex.new(:source => books, :tokenizer => FuzzyTools::Tokenizers::CHARACTERS).find("facade")
205
+
206
+ # roll your own
207
+ punctuation_normalizer = lambda { |str| str.downcase.split.map { |word| word.gsub(/\W/, '') } }
208
+ books.fuzzy_find("facade", :tokenizer => punctuation_normalizer)
209
+ books.fuzzy_index(:tokenizer => punctuation_normalizer).find("facade")
210
+ FuzzyTools::TfIdfIndex.new(:source => books, :tokenizer => punctuation_normalizer).find("facade")
211
+ ```
212
+ ## I've heard of Soft TF-IDF. It's supposed to be better than TF-IDF.
213
+
214
+ Despite the impressive graphs, the "Soft TF-IDF" described in [WW Cohen, P Ravikumar, and SE Fienberg, A comparison of string distance metrics for name-matching tasks, IIWEB, pages 73-78, 2003](http://www.cs.cmu.edu/~pradeepr/papers/ijcai03.pdf) didn't give me good results. In the paper, they tokenized by word. The standard TF-IDF tokenized by character 4-grams or 5-grams may have been more effective.
215
+
216
+ In my tests, the word-tokenized Soft TF-IDF was significantly slower and considerably less accurate than a standard TF-IDF with n-gram tokenization.
217
+
218
+ ## Help make it better!
219
+
220
+ Need something added? Please [open an issue](https://github.com/brianhempel/fuzzy_tools/issues)! Or, even better, code it yourself and send a pull request:
221
+
222
+ # fork it on github, then clone:
223
+ git clone git@github.com:your_username/fuzzy_tools.git
224
+ bundle install
225
+ rspec
226
+ # hack away
227
+ git push
228
+ # then make a pull request
229
+
230
+ ## Acknowledgements
231
+
232
+ The [SecondString](http://secondstring.sourceforge.net/) source code was a valuable reference.
233
+
234
+ ## License
235
+
236
+ Authored by Brian Hempel. Public domain, no restrictions.
@@ -0,0 +1,29 @@
1
+ require 'bundler'
2
+ Bundler::GemHelper.install_tasks
3
+
4
+ Dir[File.expand_path('../accuracy/**/*.task', __FILE__)].each { |f| load f }
5
+ Dir[File.expand_path('../performance/**/*.task', __FILE__)].each { |f| load f }
6
+
7
+ task :default => :test
8
+
9
+ require 'rspec/core/rake_task'
10
+ desc "Run the tests"
11
+ RSpec::Core::RakeTask.new(:test)
12
+
13
+ desc "Launch an IRB session with the gem required"
14
+ task :console do
15
+ $:.unshift(File.dirname(__FILE__) + '/../lib')
16
+
17
+ require 'fuzzy_tools'
18
+ require 'irb'
19
+
20
+ IRB.setup(nil)
21
+ irb = IRB::Irb.new
22
+
23
+ IRB.conf[:MAIN_CONTEXT] = irb.context
24
+
25
+ irb.context.evaluate("require 'irb/completion'", 0)
26
+
27
+ trap("SIGINT") { irb.signal_handle }
28
+ catch(:IRB_EXIT) { irb.eval_input }
29
+ end
@@ -0,0 +1,24 @@
1
+ # -*- encoding: utf-8 -*-
2
+ $:.push File.expand_path("../lib", __FILE__)
3
+ require "fuzzy_tools/version"
4
+
5
+ Gem::Specification.new do |s|
6
+ s.name = "fuzzy_tools"
7
+ s.version = FuzzyTools::VERSION
8
+ s.platform = Gem::Platform::RUBY
9
+ s.authors = ["Brian Hempel"]
10
+ s.email = ["plasticchicken@gmail.com"]
11
+ s.homepage = "https://github.com/brianhempel/fuzzy_tools"
12
+ s.summary = %q{Easy, high quality fuzzy search in Ruby.}
13
+ s.description = %q{Easy, high quality fuzzy search in Ruby.}
14
+
15
+ s.files = `git ls-files | grep --invert-match --extended-regexp '^(accuracy|performance)/'`.split("\n")
16
+ s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
17
+ s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
18
+ s.require_paths = ["lib"]
19
+
20
+ s.add_dependency 'RubyInline'
21
+
22
+ s.add_development_dependency 'bundler'
23
+ s.add_development_dependency 'rspec'
24
+ end
@@ -0,0 +1,4 @@
1
+ require 'fuzzy_tools/helpers'
2
+ require 'fuzzy_tools/index'
3
+ require 'fuzzy_tools/tf_idf_index'
4
+ require 'fuzzy_tools/core_ext/enumerable'
@@ -0,0 +1,41 @@
1
+ require 'fuzzy_tools/index'
2
+
3
+ module Enumerable
4
+ def fuzzy_find(*args)
5
+ query, options = parse_fuzzy_finder_arguments(args)
6
+ fuzzy_index(options).find(query)
7
+ end
8
+
9
+ def fuzzy_find_all(*args)
10
+ query, options = parse_fuzzy_finder_arguments(args)
11
+ fuzzy_index(options).all(query)
12
+ end
13
+
14
+ def fuzzy_find_all_with_scores(*args)
15
+ query, options = parse_fuzzy_finder_arguments(args)
16
+ fuzzy_index(options).all_with_scores(query)
17
+ end
18
+
19
+ def fuzzy_index(options = {})
20
+ options = options.merge(:source => self)
21
+ FuzzyTools::TfIdfIndex.new(options)
22
+ end
23
+
24
+ private
25
+
26
+ def parse_fuzzy_finder_arguments(args)
27
+ index_option_keys = [:tokenizer]
28
+
29
+ if args.first.is_a? Hash
30
+ args = args.first.dup
31
+ options = {}
32
+ index_option_keys.each do |key|
33
+ options[key] = args.delete(key) if args.has_key?(key)
34
+ end
35
+ options[:attribute], query = args.first
36
+ [query, options]
37
+ else
38
+ [args[0], args[1] || {}]
39
+ end
40
+ end
41
+ end
@@ -0,0 +1,133 @@
1
+ module FuzzyTools
2
+ module Helpers
3
+ extend self
4
+
5
+ def term_counts(enumerator)
6
+ {}.tap do |counts|
7
+ enumerator.each do |e|
8
+ counts[e] ||= 0
9
+ counts[e] += 1
10
+ end
11
+ end
12
+ end
13
+
14
+ def bigrams(str)
15
+ ngrams(str, 2)
16
+ end
17
+
18
+ def trigrams(str)
19
+ ngrams(str, 3)
20
+ end
21
+
22
+ def tetragrams(str)
23
+ ngrams(str, 4)
24
+ end
25
+
26
+ def ngrams(str, n)
27
+ ends = "_" * (n - 1)
28
+ str = "#{ends}#{str}#{ends}"
29
+
30
+ (0..str.length - n).map { |i| str[i,n] }
31
+ end
32
+
33
+ if RUBY_DESCRIPTION !~ /^ruby/ # rbx, jruby
34
+
35
+ SOUNDEX_LETTERS_TO_CODES = {
36
+ 'A' => 0, 'B' => 1, 'C' => 2, 'D' => 3, 'E' => 0, 'F' => 1,
37
+ 'G' => 2, 'H' => 0, 'I' => 0, 'J' => 2, 'K' => 2,
38
+ 'L' => 4, 'M' => 5, 'N' => 5, 'O' => 0, 'P' => 1,
39
+ 'Q' => 2, 'R' => 6, 'S' => 2, 'T' => 3, 'U' => 0,
40
+ 'V' => 1, 'W' => 0, 'X' => 2, 'Y' => 0, 'Z' => 2
41
+ }
42
+
43
+ # Ruby port of the C below
44
+ def soundex(str)
45
+ soundex = "Z000"
46
+ chars = str.upcase.chars.to_a
47
+ first_letter = chars.shift until (last_numeral = first_letter && SOUNDEX_LETTERS_TO_CODES[first_letter]) || chars.size == 0
48
+
49
+ return soundex unless last_numeral
50
+
51
+ soundex[0] = first_letter
52
+
53
+ i = 1
54
+ while i < 4 && chars.size > 0
55
+ char = chars.shift
56
+ next unless numeral = SOUNDEX_LETTERS_TO_CODES[char]
57
+ if numeral != last_numeral
58
+ last_numeral = numeral
59
+ if numeral != 0
60
+ soundex[i] = numeral.to_s
61
+ i += 1
62
+ end
63
+ end
64
+ end
65
+
66
+ soundex
67
+ end
68
+
69
+ else # MRI
70
+
71
+ require 'inline'
72
+
73
+ # http://en.literateprograms.org/Soundex_(C)
74
+ inline(:C) do |builder|
75
+ builder.include '<ctype.h>'
76
+ builder.c_raw <<-EOC
77
+ static VALUE soundex(int argc, VALUE *argv, VALUE self) {
78
+ VALUE ruby_str = argv[0];
79
+ char * in;
80
+
81
+ static int code[] =
82
+ { 0,1,2,3,0,1,2,0,0,2,2,4,5,5,0,1,2,6,2,3,0,1,0,2,0,2 };
83
+ /* a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z */
84
+ static char key[5];
85
+ register char ch;
86
+ register int last;
87
+ register int count;
88
+
89
+ Check_Type(ruby_str, T_STRING);
90
+
91
+ in = StringValueCStr(ruby_str);
92
+
93
+ /* Set up default key, complete with trailing '0's */
94
+ strcpy(key, "Z000");
95
+
96
+ /* Advance to the first letter. If none present,
97
+ return default key */
98
+ while (*in != '\\0' && !isalpha(*in))
99
+ ++in;
100
+ if (*in == '\\0')
101
+ return rb_str_new2(key);
102
+
103
+ /* Pull out the first letter, uppercase it, and
104
+ set up for main loop */
105
+ key[0] = toupper(*in);
106
+ last = code[key[0] - 'A'];
107
+ ++in;
108
+
109
+ /* Scan rest of string, stop at end of string or
110
+ when the key is full */
111
+ for (count = 1; count < 4 && *in != '\\0'; ++in) {
112
+ /* If non-alpha, ignore the character altogether */
113
+ if (isalpha(*in)) {
114
+ ch = tolower(*in);
115
+ /* Fold together adjacent letters sharing the same code */
116
+ if (last != code[ch - 'a']) {
117
+ last = code[ch - 'a'];
118
+ /* Ignore code==0 letters except as separators */
119
+ if (last != 0)
120
+ key[count++] = '0' + last;
121
+ }
122
+ }
123
+ }
124
+
125
+ return rb_str_new2(key);
126
+ }
127
+ EOC
128
+ end
129
+
130
+ end
131
+
132
+ end
133
+ end
@@ -0,0 +1,41 @@
1
+ require 'fuzzy_tools/helpers'
2
+ require 'fuzzy_tools/tokenizers'
3
+
4
+ module FuzzyTools
5
+ class Index
6
+ attr_reader :source, :indexed_attribute
7
+
8
+ def initialize(options = {})
9
+ @source = options[:source]
10
+ @indexed_attribute = options[:attribute] || :to_s
11
+ build_index
12
+ end
13
+
14
+ def find(query)
15
+ result, score = unsorted_scored_results(query).max_by { |doc, score| [score, document_attribute(doc)] }
16
+ result
17
+ end
18
+
19
+ def all(query)
20
+ all_with_scores(query).map(&:first)
21
+ end
22
+
23
+ def all_with_scores(query)
24
+ unsorted_scored_results(query).sort_by { |doc, score| [-score, document_attribute(doc)] }
25
+ end
26
+
27
+ private
28
+
29
+ def each_attribute_and_document(&block)
30
+ source.each do |document|
31
+ yield(document_attribute(document), document)
32
+ end
33
+ end
34
+
35
+ def document_attribute(document)
36
+ return @indexed_attribute.call(document) if @indexed_attribute.is_a?(Proc)
37
+ return document[@indexed_attribute] if document.is_a?(Hash)
38
+ document.send(@indexed_attribute)
39
+ end
40
+ end
41
+ end
@@ -0,0 +1,106 @@
1
+ require 'set'
2
+ require 'fuzzy_tools/index'
3
+ require 'fuzzy_tools/weighted_document_tokens'
4
+
5
+ module FuzzyTools
6
+ class TfIdfIndex < Index
7
+ class Token
8
+ attr_accessor :documents, :idf
9
+
10
+ def initialize
11
+ @documents = Set.new
12
+ end
13
+ end
14
+
15
+ def self.default_tokenizer
16
+ FuzzyTools::Tokenizers::HYBRID
17
+ end
18
+
19
+ attr_reader :tokenizer
20
+
21
+ def initialize(options = {})
22
+ @tokenizer = options[:tokenizer] || self.class.default_tokenizer
23
+ super
24
+ end
25
+
26
+ def tokenize(str)
27
+ tokenizer.call(str.to_s)
28
+ end
29
+
30
+ def unsorted_scored_results(query)
31
+ query_weighted_tokens = WeightedDocumentTokens.new(tokenize(query), :weight_function => weight_function)
32
+
33
+ candidates = select_candidate_documents(query, query_weighted_tokens)
34
+
35
+ candidates.map do |candidate|
36
+ candidate_tokens = @document_tokens[document_attribute(candidate)]
37
+
38
+ score = self.score(query_weighted_tokens, candidate_tokens)
39
+
40
+ [candidate, score]
41
+ end
42
+ end
43
+
44
+ def score(weighted_tokens_1, weighted_tokens_2)
45
+ weighted_tokens_1.cosine_similarity(weighted_tokens_2)
46
+ end
47
+
48
+ def select_candidate_documents(query, query_weighted_tokens)
49
+ candidates = Set.new
50
+ check_all_threshold = @source_count * 0.75 # this threshold works best on the accuracy data
51
+ query_weighted_tokens.tokens.each do |query_token|
52
+ if tf_idf_token = @tf_idf_tokens[query_token]
53
+ next if tf_idf_token.idf < @idf_cutoff
54
+ candidates.merge(tf_idf_token.documents)
55
+ if candidates.size > check_all_threshold
56
+ candidates = source
57
+ break
58
+ end
59
+ end
60
+ end
61
+ candidates
62
+ end
63
+
64
+ private
65
+
66
+ # consolidate the same strings together
67
+ # lowers GC load
68
+ def tokenize_consolidated(str)
69
+ tokenize(str).map { |token| @token_table[token] ||= token }
70
+ end
71
+
72
+ def clear_token_table
73
+ @token_table = {}
74
+ end
75
+
76
+ def build_index
77
+ @source_count = source.count
78
+ clear_token_table
79
+ @tf_idf_tokens = {}
80
+ each_attribute_and_document do |attribute, document|
81
+ tokenize_consolidated(attribute).each do |token_str|
82
+ @tf_idf_tokens[token_str] ||= Token.new
83
+ @tf_idf_tokens[token_str].documents << document
84
+ end
85
+ end
86
+ @tf_idf_tokens.keys.each do |token_str|
87
+ @tf_idf_tokens[token_str].idf = Math.log(@source_count.to_f / @tf_idf_tokens[token_str].documents.size)
88
+ end
89
+ @document_tokens = {}
90
+ each_attribute_and_document do |attribute, document|
91
+ tokens = @document_tokens[attribute] = WeightedDocumentTokens.new(tokenize_consolidated(attribute), :weight_function => weight_function)
92
+ end
93
+ clear_token_table
94
+ idfs = @tf_idf_tokens.values.map(&:idf).sort
95
+ @idf_cutoff = (idfs[idfs.size/16] || 0.0) / 2.0
96
+ end
97
+
98
+ def weight_function
99
+ @weight_function ||= lambda do |token, n|
100
+ # secondstring gives unknown tokens a df of 1
101
+ idf = @tf_idf_tokens[token] ? @tf_idf_tokens[token].idf : Math.log(@source_count.to_f)
102
+ idf * Math.log(n + 1)
103
+ end
104
+ end
105
+ end
106
+ end
@@ -0,0 +1,30 @@
1
+ module FuzzyTools
2
+ module Tokenizers
3
+
4
+ CHARACTERS = lambda { |str| str.chars }
5
+ CHARACTERS_DOWNCASED = lambda { |str| str.downcase.chars }
6
+ BIGRAMS = lambda { |str| FuzzyTools::Helpers.ngrams(str, 2) }
7
+ BIGRAMS_DOWNCASED = lambda { |str| FuzzyTools::Helpers.ngrams(str.downcase, 2) }
8
+ TRIGRAMS = lambda { |str| FuzzyTools::Helpers.ngrams(str, 3) }
9
+ TRIGRAMS_DOWNCASED = lambda { |str| FuzzyTools::Helpers.ngrams(str.downcase, 3) }
10
+ TETRAGRAMS = lambda { |str| FuzzyTools::Helpers.ngrams(str, 4) }
11
+ TETRAGRAMS_DOWNCASED = lambda { |str| FuzzyTools::Helpers.ngrams(str.downcase, 4) }
12
+ PENTAGRAMS = lambda { |str| FuzzyTools::Helpers.ngrams(str, 5) }
13
+ PENTAGRAMS_DOWNCASED = lambda { |str| FuzzyTools::Helpers.ngrams(str.downcase, 5) }
14
+ HEXAGRAMS = lambda { |str| FuzzyTools::Helpers.ngrams(str, 6) }
15
+ HEXAGRAMS_DOWNCASED = lambda { |str| FuzzyTools::Helpers.ngrams(str.downcase, 6) }
16
+
17
+ WORDS = lambda { |str| str.split }
18
+ WORDS_DOWNCASED = lambda { |str| str.downcase.split }
19
+
20
+ HYBRID = lambda do |str|
21
+ str = str.downcase
22
+ words = str.split
23
+ words.map { |word| FuzzyTools::Helpers.soundex(word) } +
24
+ FuzzyTools::Helpers.ngrams(str.downcase, 2) +
25
+ words.map { |word| word.gsub(/[aeiou]/, '') } +
26
+ words
27
+ end
28
+
29
+ end
30
+ end
@@ -0,0 +1,3 @@
1
+ module FuzzyTools
2
+ VERSION = "1.0.0"
3
+ end
@@ -0,0 +1,88 @@
1
+ require 'fuzzy_tools/helpers'
2
+
3
+ module FuzzyTools
4
+ class WeightedDocumentTokens
5
+ attr_reader :weights
6
+
7
+ def initialize(tokens, options)
8
+ weight_function = options[:weight_function]
9
+ set_token_weights(tokens, &weight_function)
10
+ end
11
+
12
+ if RUBY_DESCRIPTION !~ /^ruby/
13
+
14
+ # Rubinius and JRuby
15
+ def cosine_similarity(other)
16
+ # equivalent to the C below, but the C is >2x faster
17
+ similarity = 0.0
18
+ other_weights = other.weights
19
+ @weights.each do |token, weight|
20
+ if other_weight = other_weights[token]
21
+ similarity += other_weight*weight
22
+ end
23
+ end
24
+ similarity
25
+ end
26
+
27
+ else
28
+
29
+ # MRI
30
+
31
+ require 'inline'
32
+
33
+ def cosine_similarity(other)
34
+ cosine_similarity_fast(@weights, tokens, other.weights)
35
+ end
36
+
37
+ inline(:C) do |builder|
38
+ builder.c_raw <<-EOC
39
+ static VALUE cosine_similarity_fast(int argc, VALUE *argv, VALUE self) {
40
+ double similarity = 0.0;
41
+ VALUE my_weights = argv[0];
42
+ VALUE my_tokens = argv[1];
43
+ VALUE other_weights = argv[2];
44
+ int i;
45
+ VALUE token;
46
+ VALUE my_weight;
47
+ VALUE other_weight;
48
+
49
+ for(i = 0; i < RARRAY_LEN(RARRAY(my_tokens)); i++) {
50
+ token = RARRAY_PTR(RARRAY(my_tokens))[i];
51
+ other_weight = rb_hash_aref(other_weights, token);
52
+ if (other_weight != Qnil) {
53
+ my_weight = rb_hash_aref(my_weights, token);
54
+ similarity += NUM2DBL(my_weight)*NUM2DBL(other_weight);
55
+ }
56
+ }
57
+
58
+ return rb_float_new(similarity);
59
+ }
60
+ EOC
61
+ end
62
+
63
+ end
64
+
65
+ def tokens
66
+ @tokens ||= @weights.keys
67
+ end
68
+
69
+ private
70
+
71
+ def set_token_weights(tokens, &block)
72
+ @weights = {}
73
+ counts = FuzzyTools::Helpers.term_counts(tokens)
74
+ counts.each do |token, n|
75
+ @weights[token] = yield(token, n)
76
+ end
77
+ normalize_weights
78
+ @weights
79
+ end
80
+
81
+ def normalize_weights
82
+ length = Math.sqrt(weights.values.reduce(0.0) { |sum, w| sum + w*w })
83
+ weights.each do |token, w|
84
+ weights[token] /= length
85
+ end
86
+ end
87
+ end
88
+ end
@@ -0,0 +1,124 @@
1
+ require 'spec_helper'
2
+ require 'set'
3
+
4
+ describe Enumerable do
5
+ before :each do
6
+ @till_we_have_faces = Book.new("Till We Have Faces", "C.S. Lewis" )
7
+ @ecclesiastes = Book.new("Ecclesiastes", "The Teacher")
8
+ @the_prodigal_god = Book.new("The Prodigal God", "Tim Keller" )
9
+
10
+ @books = [
11
+ @till_we_have_faces,
12
+ @ecclesiastes,
13
+ @the_prodigal_god
14
+ ].each
15
+ end
16
+
17
+ describe "#fuzzy_find" do
18
+ it "works with simple query syntax" do
19
+ @books.fuzzy_find("the").should == @ecclesiastes
20
+ end
21
+
22
+ it "works with :attribute => query syntax" do
23
+ @books.fuzzy_find(:title => "the").should == @the_prodigal_god
24
+ end
25
+
26
+ context "passes :tokenizer through to the index" do
27
+ before(:each) { @letter_count_tokenizer = lambda { |str| str.size.to_s } }
28
+
29
+ it "passes :tokenizer through to the index with simple query syntax" do
30
+ FuzzyTools::TfIdfIndex.should_receive(:new).with(:source => @books, :tokenizer => @letter_count_tokenizer)
31
+ begin
32
+ @books.fuzzy_find("the", :tokenizer => @letter_count_tokenizer)
33
+ rescue
34
+ end
35
+ end
36
+
37
+ it "passes :tokenizer through to the index with :attribute => query syntax" do
38
+ FuzzyTools::TfIdfIndex.should_receive(:new).with(:source => @books, :tokenizer => @letter_count_tokenizer, :attribute => :title)
39
+ begin
40
+ @books.fuzzy_find(:title => "the", :tokenizer => @letter_count_tokenizer)
41
+ rescue
42
+ end
43
+ end
44
+ end
45
+ end
46
+
47
+ describe "#fuzzy_find_all" do
48
+ it "works with simple query syntax" do
49
+ @books.fuzzy_find_all("the").should == [@ecclesiastes, @the_prodigal_god, @till_we_have_faces]
50
+ end
51
+
52
+ it "works with :attribute => query syntax" do
53
+ @books.fuzzy_find_all(:title => "the").should == [@the_prodigal_god, @till_we_have_faces]
54
+ end
55
+
56
+ context "passes :tokenizer through to the index" do
57
+ before(:each) { @letter_count_tokenizer = lambda { |str| str.size.to_s } }
58
+
59
+ it "passes :tokenizer through to the index with simple query syntax" do
60
+ FuzzyTools::TfIdfIndex.should_receive(:new).with(:source => @books, :tokenizer => @letter_count_tokenizer)
61
+ begin
62
+ @books.fuzzy_find_all("the", :tokenizer => @letter_count_tokenizer)
63
+ rescue
64
+ end
65
+ end
66
+
67
+ it "passes :tokenizer through to the index with :attribute => query syntax" do
68
+ FuzzyTools::TfIdfIndex.should_receive(:new).with(:source => @books, :tokenizer => @letter_count_tokenizer, :attribute => :title)
69
+ begin
70
+ @books.fuzzy_find_all(:title => "the", :tokenizer => @letter_count_tokenizer)
71
+ rescue
72
+ end
73
+ end
74
+ end
75
+ end
76
+
77
+ describe "#fuzzy_find_all_with_scores" do
78
+ it "works with simple query syntax" do
79
+ results = @books.fuzzy_find_all_with_scores("the")
80
+
81
+ results.map(&:first).should == [@ecclesiastes, @the_prodigal_god, @till_we_have_faces]
82
+ results.sort_by { |doc, score| -score }.should == results
83
+ end
84
+
85
+ it "works with :attribute => query syntax" do
86
+ results = @books.fuzzy_find_all_with_scores(:title => "the")
87
+
88
+ results.map(&:first).should == [@the_prodigal_god, @till_we_have_faces]
89
+ results.sort_by { |doc, score| -score }.should == results
90
+ end
91
+
92
+ context "passes :tokenizer through to the index" do
93
+ before(:each) { @letter_count_tokenizer = lambda { |str| str.size.to_s } }
94
+
95
+ it "passes :tokenizer through to the index with simple query syntax" do
96
+ FuzzyTools::TfIdfIndex.should_receive(:new).with(:source => @books, :tokenizer => @letter_count_tokenizer)
97
+ begin
98
+ @books.fuzzy_find_all_with_scores("the", :tokenizer => @letter_count_tokenizer)
99
+ rescue
100
+ end
101
+ end
102
+
103
+ it "passes :tokenizer through to the index with :attribute => query syntax" do
104
+ FuzzyTools::TfIdfIndex.should_receive(:new).with(:source => @books, :tokenizer => @letter_count_tokenizer, :attribute => :title)
105
+ begin
106
+ @books.fuzzy_find_all_with_scores(:title => "the", :tokenizer => @letter_count_tokenizer)
107
+ rescue
108
+ end
109
+ end
110
+ end
111
+ end
112
+
113
+ describe "#fuzzy_index" do
114
+ it "returns an TfIdfIndex" do
115
+ @books.fuzzy_index.class.should == FuzzyTools::TfIdfIndex
116
+ end
117
+
118
+ it "passes options along to the index" do
119
+ letter_count_tokenizer = lambda { |str| str.size.to_s }
120
+ FuzzyTools::TfIdfIndex.should_receive(:new).with(:source => @books, :tokenizer => letter_count_tokenizer, :attribute => :title)
121
+ @books.fuzzy_index(:attribute => :title, :tokenizer => letter_count_tokenizer)
122
+ end
123
+ end
124
+ end
@@ -0,0 +1,65 @@
1
+ require 'spec_helper'
2
+
3
+ describe FuzzyTools::Helpers do
4
+ describe ".ngrams" do
5
+
6
+ it "should do trigrams" do
7
+ FuzzyTools::Helpers.trigrams("hello").should == %w{
8
+ __h
9
+ _he
10
+ hel
11
+ ell
12
+ llo
13
+ lo_
14
+ o__
15
+ }
16
+ end
17
+
18
+ it "should do bigrams" do
19
+ FuzzyTools::Helpers.bigrams("hello").should == %w{
20
+ _h
21
+ he
22
+ el
23
+ ll
24
+ lo
25
+ o_
26
+ }
27
+ end
28
+
29
+ it "should do 1-grams" do
30
+ FuzzyTools::Helpers.ngrams("hello", 1).should == %w{
31
+ h
32
+ e
33
+ l
34
+ l
35
+ o
36
+ }
37
+ end
38
+
39
+ it "should do x-grams" do
40
+ FuzzyTools::Helpers.ngrams("hello", 4).should == %w{
41
+ ___h
42
+ __he
43
+ _hel
44
+ hell
45
+ ello
46
+ llo_
47
+ lo__
48
+ o___
49
+ }
50
+ end
51
+
52
+ end
53
+
54
+ describe ".soundex" do
55
+ it "works" do
56
+ FuzzyTools::Helpers.soundex("").should == "Z000"
57
+ FuzzyTools::Helpers.soundex("123").should == "Z000"
58
+ FuzzyTools::Helpers.soundex("Robert").should == "R163"
59
+ FuzzyTools::Helpers.soundex("Rubin").should == "R150"
60
+ FuzzyTools::Helpers.soundex("Washington").should == "W252"
61
+ FuzzyTools::Helpers.soundex("Lee").should == "L000"
62
+ FuzzyTools::Helpers.soundex("Gutierrez").should == "G362"
63
+ end
64
+ end
65
+ end
@@ -0,0 +1,5 @@
1
+ $:.unshift(File.join(File.dirname(__FILE__), "..", "lib"))
2
+
3
+ require 'fuzzy_tools'
4
+
5
+ Book = Struct.new(:title, :author)
@@ -0,0 +1,201 @@
1
+ require 'spec_helper'
2
+
3
+ describe FuzzyTools::TfIdfIndex do
4
+ it "takes a source" do
5
+ vegetables = ["mushroom", "olive", "tomato"]
6
+ index = FuzzyTools::TfIdfIndex.new(:source => vegetables)
7
+ index.source.should == vegetables
8
+ end
9
+
10
+ it "indexes on to_s by default" do
11
+ index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
12
+ index.find("2").should == 2
13
+ end
14
+
15
+ it "defaults tokenizer to FuzzyTools::Tokenizers::HYBRID" do
16
+ FuzzyTools::TfIdfIndex.new(:source => []).tokenizer.should == FuzzyTools::Tokenizers::HYBRID
17
+ end
18
+
19
+ it "takes any proc as a tokenizer" do
20
+ foods = ["muffins", "pancakes"]
21
+ letter_count_tokenizer = lambda { |str| [str.size.to_s] }
22
+ index = FuzzyTools::TfIdfIndex.new(:source => foods, :tokenizer => letter_count_tokenizer)
23
+
24
+ index.tokenizer.should == letter_count_tokenizer
25
+ index.find("octoword").should == "pancakes"
26
+ end
27
+
28
+ context "indexing incomparable objects" do
29
+ before :each do
30
+ @till_we_have_faces = Book.new("Till We Have Faces", "C.S. Lewis")
31
+ @perelandra = Book.new("Perelandra", "C.S. Lewis")
32
+
33
+ @books = [@till_we_have_faces, @perelandra]
34
+ end
35
+
36
+ it "#find works when they index the same" do
37
+ index = FuzzyTools::TfIdfIndex.new(:source => @books)
38
+ expect { index.all("louis") }.to_not raise_error
39
+ end
40
+
41
+ it "#all works when they index the same" do
42
+ index = FuzzyTools::TfIdfIndex.new(:source => @books)
43
+ expect { index.all("louis") }.to_not raise_error
44
+ end
45
+
46
+ it "#all_with_scores works when they index the same" do
47
+ index = FuzzyTools::TfIdfIndex.new(:source => @books)
48
+ expect { index.all("louis") }.to_not raise_error
49
+ end
50
+ end
51
+
52
+ context "indexing objects" do
53
+ before :each do
54
+ @till_we_have_faces = Book.new("Till We Have Faces", "C.S. Lewis" )
55
+ @ecclesiastes = Book.new("Ecclesiastes", "The Teacher")
56
+ @the_prodigal_god = Book.new("The Prodigal God", "Tim Keller" )
57
+
58
+ @books = [
59
+ @till_we_have_faces,
60
+ @ecclesiastes,
61
+ @the_prodigal_god,
62
+ ]
63
+ end
64
+
65
+ it "indexes on the method specified in :attribute" do
66
+ index = FuzzyTools::TfIdfIndex.new(:source => @books, :attribute => :title)
67
+ index.find("ecklestica").should == @ecclesiastes
68
+ end
69
+
70
+ it "indexes the proc result if a proc is given for :attribute" do
71
+ index = FuzzyTools::TfIdfIndex.new(:source => @books, :attribute => lambda { |book| book.title + " " + book.author })
72
+ index.find("prodigy").should == @the_prodigal_god
73
+ index.find("LEWIS").should == @till_we_have_faces
74
+ end
75
+ end
76
+
77
+ context "indexing hashes" do
78
+ before :each do
79
+ @till_we_have_faces = { :title => "Till We Have Faces", :author => "C.S. Lewis" }
80
+ @ecclesiastes = { :title => "Ecclesiastes", :author => "The Teacher" }
81
+ @the_prodigal_god = { :title => "The Prodigal God", :author => "Tim Keller" }
82
+
83
+ @books = [
84
+ @till_we_have_faces,
85
+ @ecclesiastes,
86
+ @the_prodigal_god,
87
+ ]
88
+ end
89
+
90
+ it "indexes on the hash key specified in :attribute" do
91
+ index = FuzzyTools::TfIdfIndex.new(:source => @books, :attribute => :title)
92
+ index.find("ecklestica").should == @ecclesiastes
93
+ end
94
+
95
+ it "indexes the proc result if a proc is given for :attribute" do
96
+ index = FuzzyTools::TfIdfIndex.new(:source => @books, :attribute => lambda { |book| book[:title] + " " + book[:author] })
97
+ index.find("prodigy").should == @the_prodigal_god
98
+ index.find("LEWIS").should == @till_we_have_faces
99
+ end
100
+ end
101
+
102
+ context "query methods" do
103
+ describe "#find" do
104
+ it "returns the best result" do
105
+ mushy_stuff = ["mushrooms", "mushroom", "mushy pit", "ABC"]
106
+ index = FuzzyTools::TfIdfIndex.new(:source => mushy_stuff)
107
+
108
+ index.find("ushr").should == "mushroom"
109
+ end
110
+
111
+ it "calls to_s on input" do
112
+ index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
113
+ index.find(2).should == 2
114
+ end
115
+
116
+ it "returns nil if no results" do
117
+ index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
118
+ index.find("bubble").should be_nil
119
+ end
120
+ end
121
+
122
+ describe "#all" do
123
+ it "returns all results, from best to worst" do
124
+ mushy_stuff = ["mushrooms", "mushroom", "mushy pit", "ABC"]
125
+ index = FuzzyTools::TfIdfIndex.new(:source => mushy_stuff)
126
+
127
+ index.all("ushr").should == [
128
+ "mushroom",
129
+ "mushrooms",
130
+ "mushy pit"
131
+ ]
132
+ end
133
+
134
+ it "calls to_s on input" do
135
+ index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
136
+ index.all(2).first.should == 2
137
+ end
138
+
139
+ it "returns an empty array if no results" do
140
+ index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
141
+ index.all("bubble").should == []
142
+ end
143
+ end
144
+
145
+
146
+ describe "#all" do
147
+ it "returns all results, from best to worst" do
148
+ mushy_stuff = ["mushrooms", "mushroom", "mushy pit", "ABC"]
149
+ index = FuzzyTools::TfIdfIndex.new(:source => mushy_stuff)
150
+
151
+ index.all("ushr").should == [
152
+ "mushroom",
153
+ "mushrooms",
154
+ "mushy pit"
155
+ ]
156
+ end
157
+
158
+ it "calls to_s on input" do
159
+ index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
160
+ index.all(2).first.should == 2
161
+ end
162
+
163
+ it "returns an empty array if no results" do
164
+ index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
165
+ index.all("bubble").should == []
166
+ end
167
+ end
168
+
169
+ describe "#all_with_scores" do
170
+ it "returns ordered array of arrays of score and results" do
171
+ mushy_stuff = ["mushrooms", "mushroom", "mushy pit", "ABC"]
172
+ index = FuzzyTools::TfIdfIndex.new(:source => mushy_stuff)
173
+
174
+ results = index.all_with_scores("ushr")
175
+
176
+ results.map(&:first).should == [
177
+ "mushroom",
178
+ "mushrooms",
179
+ "mushy pit"
180
+ ]
181
+
182
+ results.sort_by { |doc, score| -score }.should == results
183
+
184
+ results.map(&:last).each { |score| score.class.should == Float }
185
+ results.map(&:last).each { |score| score.should > 0.0 }
186
+ results.map(&:last).each { |score| score.should < 1.0 }
187
+ results.map(&:last).uniq.should == results.map(&:last)
188
+ end
189
+
190
+ it "calls to_s on input" do
191
+ index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
192
+ index.all_with_scores(2).first.should == [2, 1.0]
193
+ end
194
+
195
+ it "returns an empty array if no results" do
196
+ index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
197
+ index.all_with_scores("bubble").should == []
198
+ end
199
+ end
200
+ end
201
+ end
metadata ADDED
@@ -0,0 +1,121 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: fuzzy_tools
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Brian Hempel
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2012-07-24 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: RubyInline
16
+ requirement: !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ! '>='
20
+ - !ruby/object:Gem::Version
21
+ version: '0'
22
+ type: :runtime
23
+ prerelease: false
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ! '>='
28
+ - !ruby/object:Gem::Version
29
+ version: '0'
30
+ - !ruby/object:Gem::Dependency
31
+ name: bundler
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ! '>='
36
+ - !ruby/object:Gem::Version
37
+ version: '0'
38
+ type: :development
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ! '>='
44
+ - !ruby/object:Gem::Version
45
+ version: '0'
46
+ - !ruby/object:Gem::Dependency
47
+ name: rspec
48
+ requirement: !ruby/object:Gem::Requirement
49
+ none: false
50
+ requirements:
51
+ - - ! '>='
52
+ - !ruby/object:Gem::Version
53
+ version: '0'
54
+ type: :development
55
+ prerelease: false
56
+ version_requirements: !ruby/object:Gem::Requirement
57
+ none: false
58
+ requirements:
59
+ - - ! '>='
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ description: Easy, high quality fuzzy search in Ruby.
63
+ email:
64
+ - plasticchicken@gmail.com
65
+ executables: []
66
+ extensions: []
67
+ extra_rdoc_files: []
68
+ files:
69
+ - .rspec
70
+ - .travis.yml
71
+ - Gemfile
72
+ - README.md
73
+ - Rakefile
74
+ - fuzzy_tools.gemspec
75
+ - lib/fuzzy_tools.rb
76
+ - lib/fuzzy_tools/core_ext/enumerable.rb
77
+ - lib/fuzzy_tools/helpers.rb
78
+ - lib/fuzzy_tools/index.rb
79
+ - lib/fuzzy_tools/tf_idf_index.rb
80
+ - lib/fuzzy_tools/tokenizers.rb
81
+ - lib/fuzzy_tools/version.rb
82
+ - lib/fuzzy_tools/weighted_document_tokens.rb
83
+ - spec/enumerable_spec.rb
84
+ - spec/helpers_spec.rb
85
+ - spec/spec_helper.rb
86
+ - spec/tf_idf_index_spec.rb
87
+ homepage: https://github.com/brianhempel/fuzzy_tools
88
+ licenses: []
89
+ post_install_message:
90
+ rdoc_options: []
91
+ require_paths:
92
+ - lib
93
+ required_ruby_version: !ruby/object:Gem::Requirement
94
+ none: false
95
+ requirements:
96
+ - - ! '>='
97
+ - !ruby/object:Gem::Version
98
+ version: '0'
99
+ segments:
100
+ - 0
101
+ hash: -1099286336038854081
102
+ required_rubygems_version: !ruby/object:Gem::Requirement
103
+ none: false
104
+ requirements:
105
+ - - ! '>='
106
+ - !ruby/object:Gem::Version
107
+ version: '0'
108
+ segments:
109
+ - 0
110
+ hash: -1099286336038854081
111
+ requirements: []
112
+ rubyforge_project:
113
+ rubygems_version: 1.8.24
114
+ signing_key:
115
+ specification_version: 3
116
+ summary: Easy, high quality fuzzy search in Ruby.
117
+ test_files:
118
+ - spec/enumerable_spec.rb
119
+ - spec/helpers_spec.rb
120
+ - spec/spec_helper.rb
121
+ - spec/tf_idf_index_spec.rb