RubyGems - fuzzy_tools - Versions diffs - 1.0.0 - Mend

fuzzy_tools 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

data/.rspec +1 -0
data/.travis.yml +12 -0
data/Gemfile +9 -0
data/README.md +236 -0
data/Rakefile +29 -0
data/fuzzy_tools.gemspec +24 -0
data/lib/fuzzy_tools.rb +4 -0
data/lib/fuzzy_tools/core_ext/enumerable.rb +41 -0
data/lib/fuzzy_tools/helpers.rb +133 -0
data/lib/fuzzy_tools/index.rb +41 -0
data/lib/fuzzy_tools/tf_idf_index.rb +106 -0
data/lib/fuzzy_tools/tokenizers.rb +30 -0
data/lib/fuzzy_tools/version.rb +3 -0
data/lib/fuzzy_tools/weighted_document_tokens.rb +88 -0
data/spec/enumerable_spec.rb +124 -0
data/spec/helpers_spec.rb +65 -0
data/spec/spec_helper.rb +5 -0
data/spec/tf_idf_index_spec.rb +201 -0
metadata +121 -0

data/.rspec ADDED

	@@ -0,0 +1 @@
1	+ --color

data/.travis.yml ADDED

@@ -0,0 +1,12 @@
+language: ruby
+rvm:
+  - 1.8.7
+  - 1.9.2
+  - 1.9.3
+  - ruby-head
+  - jruby-18mode # JRuby in 1.8 mode
+  - jruby-19mode # JRuby in 1.9 mode
+  # - rbx-18mode
+  - rbx-19mode # currently in active development, may or may not work for your project
+# uncomment this line if your project needs to run something other than `rake`:
+# script: bundle exec rspec spec

data/Gemfile ADDED

@@ -0,0 +1,9 @@
+source "http://rubygems.org"
+gem 'simple_stats'
+gem 'nokogiri',     :platforms => [:mri_18, :mri_19, :jruby, :rbx]
+gem 'perftools.rb', :platforms => [:mri_18, :mri_19], :require => false
+gem 'rake'
+# Specify your gem's dependencies in fuzzy_tools.gemspec
+gemspec

data/README.md ADDED

@@ -0,0 +1,236 @@
+# FuzzyTools [![Build Status](https://secure.travis-ci.org/brianhempel/fuzzy_tools.png)](http://travis-ci.org/brianhempel/fuzzy_tools)
+FuzzyTools is a toolset for fuzzy searches in Ruby. The default algorithm has been tuned for accuracy (and reasonable speed) on 23 different [test files](https://github.com/brianhempel/fuzzy_tools/tree/master/accuracy/test_data/query_tests) gathered from [many sources](https://github.com/brianhempel/fuzzy_tools/blob/master/accuracy/test_data/sources/SOURCES.txt).
+Because it's mostly Ruby, FuzzyTools is best for searching smaller datasets—say less than 50Kb in size. Data cleaning or auto-complete over known options are potential uses.
+Tested on Ruby 1.8.7, 1.9.2, 1.9.3, 2.0.0dev, JRuby (1.8 and 1.9 mode), and Rubinius (1.9 mode only).
+## Usage
+Install with [Bundler](http://gembundler.com/):
+``` ruby
+gem "fuzzy_tools"
+```
+Install without Bundler:
+    gem install fuzzy_tools --no-ri --no-rdoc
+Then, put it to work!
+``` ruby
+require 'fuzzy_tools'
+books = [
+  "Till We Have Faces",
+  "Ecclesiastes",
+  "The Prodigal God"
+]
+# Search for a single object
+books.fuzzy_find("facade")                                   # => "Till We Have Faces"
+books.fuzzy_index.find("facade")                             # => "Till We Have Faces"
+FuzzyTools::TfIdfIndex.new(:source => books).find("facade")  # => "Till We Have Faces"
+# Search for all matches, from best to worst
+books.fuzzy_find_all("the")                             # => ["The Prodigal God", "Till We Have Faces"]
+books.fuzzy_index.all("the")                            # => ["The Prodigal God", "Till We Have Faces"]
+FuzzyTools::TfIdfIndex.new(:source => books).all("the") # => ["The Prodigal God", "Till We Have Faces"]
+# You can also get scored results, if you need
+books.fuzzy_find_all_with_scores("the") # =>
+# [
+#   ["The Prodigal God",   0.443175985397319 ],
+#   ["Till We Have Faces", 0.0102817553829306]
+# ]
+books.fuzzy_index.all_with_scores("the") # =>
+# [
+#   ["The Prodigal God",   0.443175985397319 ],
+#   ["Till We Have Faces", 0.0102817553829306]
+# ]
+FuzzyTools::TfIdfIndex.new(:source => books).all_with_scores("the") # =>
+# [
+#   ["The Prodigal God",   0.443175985397319 ],
+#   ["Till We Have Faces", 0.0102817553829306]
+# ]
+```
+FuzzyTools is not limited to searching strings. In fact, strings work simply because FuzzyTools indexes on `to_s` by default. You can index on any method you like.
+``` ruby
+require 'fuzzy_tools'
+Book = Struct.new(:title, :author)
+books = [
+  Book.new("Till We Have Faces", "C.S. Lewis" ),
+  Book.new("Ecclesiastes",       "The Teacher"),
+  Book.new("The Prodigal God",   "Tim Keller" )
+]
+books.fuzzy_find(:author => "timmy")
+books.fuzzy_index(:attribute => :author).find("timmy")
+FuzzyTools::TfIdfIndex.new(:source => books, :attribute => :author).find("timmy")
+# => #<struct Book title="The Prodigal God", author="Tim Keller">
+books.fuzzy_find_all(:author => "timmy")
+books.fuzzy_index(:attribute => :author).all("timmy")
+FuzzyTools::TfIdfIndex.new(:source => books, :attribute => :author).all("timmy")
+# =>
+# [
+#   #<struct Book title="The Prodigal God", author="Tim Keller" >,
+#   #<struct Book title="Ecclesiastes",     author="The Teacher">
+# ]
+books.fuzzy_find_all_with_scores(:author => "timmy")
+books.fuzzy_index(:attribute => :author).all_with_scores("timmy")
+FuzzyTools::TfIdfIndex.new(:source => books, :attribute => :author).all_with_scores("timmy")
+# =>
+# [
+#   [#<struct Book title="The Prodigal God", author="Tim Keller" >, 0.29874954780727  ],
+#   [#<struct Book title="Ecclesiastes",     author="The Teacher">, 0.0117801403002398]
+# ]
+```
+If the objects to be searched are hashes, FuzzyTools indexes the specified hash value.
+```ruby
+books = [
+  { :title => "Till We Have Faces", :author => "C.S. Lewis"  },
+  { :title => "Ecclesiastes",       :author => "The Teacher" },
+  { :title => "The Prodigal God",   :author => "Tim Keller"  }
+]
+books.fuzzy_find(:author => "timmy")
+# => { :title => "The Prodigal God",   :author => "Tim Keller"  }
+```
+If you want to index on some calculated data such as more than one field at a time, you can provide a proc.
+``` ruby
+books.fuzzy_find("timmy", :attribute => lambda { |book| book.title + " " + book.author })
+books.fuzzy_index(:attribute => lambda { |book| book.title + " " + book.author }).find("timmy")
+FuzzyTools::TfIdfIndex.new(:source => books, :attribute => lambda { |book| book.title + " " + book.author }).find("timmy")
+```
+## Can it go faster?
+If you need to do multiple searches on the same collection, grab a fuzzy index with `my_collection.fuzzy_index` and do finds on that. The `fuzzy_find` and `fuzzy_find_all` methods on Enumerable reindex every time they are called.
+Here's a performance comparison:
+``` ruby
+array_methods = Array.new.methods
+Benchmark.bm(20) do |b|
+  b.report("fuzzy_find") do
+    1000.times { array_methods.fuzzy_find("juice") }
+  end
+  b.report("fuzzy_index.find") do
+    index = array_methods.fuzzy_index
+    1000.times { index.find("juice") }
+  end
+end
+```
+```
+                          user     system      total        real
+fuzzy_find           29.250000   0.040000  29.290000 ( 29.287992)
+fuzzy_index.find      0.360000   0.000000   0.360000 (  0.360066)
+```
+If you need even more speed, you can [try a different tokenizer](#specifying-your-own-tokenizer). Fewer tokens per document shortens the comparison time between documents, lessens the garbage collector load, and reduces the number of candidate documents for a given query.
+If it's still too slow, [open an issue](https://github.com/brianhempel/fuzzy_tools/issues) and perhaps we can figure out what can be done.
+## How does it work?
+FuzzyTools downcases and then tokenizes each value using a [hybrid combination](https://github.com/brianhempel/fuzzy_tools/blob/master/lib/fuzzy/tokenizers.rb#L20-27) of words, [character bigrams](http://en.wikipedia.org/wiki/N-gram), [Soundex](http://en.wikipedia.org/wiki/Soundex), and words without vowels.
+``` ruby
+FuzzyTools::Tokenizers::HYBRID.call("Till We Have Faces")
+# => ["T400", "W000", "H100", "F220", "_t", "ti", "il", "ll", "l ", " w",
+#     "we", "e ", " h", "ha", "av", "ve", "e ", " f", "fa", "ac", "ce",
+#     "es", "s_", "tll", "w", "hv", "fcs", "till", "we", "have", "faces"]
+```
+Gross, eh? But that's what worked best on the [test data sets](https://github.com/brianhempel/fuzzy_tools/tree/master/accuracy/test_data/query_tests).
+The tokens are weighted using [Term Frequency * Inverse Document Frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf*idf) which basically assigns higher weights to the tokens that occur in fewer documents.
+```ruby
+# hacky introspection here--don't do this!
+index = books.fuzzy_index(:attribute => :author)
+index.instance_variable_get(:@document_tokens)["The Teacher"].weights.sort_by { |k,v| [-v,k] }
+# =>
+# [
+#   ["he",      0.3910],
+#   ["th",      0.3910],
+#   [" t",      0.2467],
+#   ["T000",    0.2467],
+#   ["T260",    0.2467],
+#   ["ac",      0.2467],
+#   ["ch",      0.2467],
+#   ["e ",      0.2467],
+#   ["ea",      0.2467],
+#   ["tchr",    0.2467],
+#   ["te",      0.2467],
+#   ["teacher", 0.2467],
+#   ["the",     0.2467],
+#   ["_t",      0.0910],
+#   ["er",      0.0910],
+#   ["r_",      0.0910]
+# ]
+```
+When you do a query, that query string is tokenized and weighted, then compared against some of the documents using [Cosine Similarity](http://www.gettingcirrius.com/2010/12/calculating-similarity-part-1-cosine.html). Cosine similarity is not that terrible of a concept, assuming you like terms like "N-dimensional space". Basically, each unique token becomes an axis in N-dimensional space. If we had 4 different tokens in all, we'd use 4-D space. A document's token weights define a vector in this space. The _cosine_ of the _angle_ between documents' vectors becomes the similarity between the documents.
+Trust me, it works.
+## Specifying your own tokenizer
+If the default tokenizer isn't working for your data or you need more speed, you can try swapping out the tokenizers. You can use one of the various tokenizers are defined in [`FuzzyTools::Tokenizers`](https://github.com/brianhempel/fuzzy_tools/blob/master/lib/fuzzy/tokenizers.rb), or you can write your own.
+``` ruby
+# a predefined tokenizer
+books.fuzzy_find("facade", :tokenizer => FuzzyTools::Tokenizers::CHARACTERS)
+books.fuzzy_index(:tokenizer => FuzzyTools::Tokenizers::CHARACTERS).find("facade")
+FuzzyTools::TfIdfIndex.new(:source => books, :tokenizer => FuzzyTools::Tokenizers::CHARACTERS).find("facade")
+# roll your own
+punctuation_normalizer = lambda { |str| str.downcase.split.map { |word| word.gsub(/\W/, '') } }
+books.fuzzy_find("facade", :tokenizer => punctuation_normalizer)
+books.fuzzy_index(:tokenizer => punctuation_normalizer).find("facade")
+FuzzyTools::TfIdfIndex.new(:source => books, :tokenizer => punctuation_normalizer).find("facade")
+```
+## I've heard of Soft TF-IDF. It's supposed to be better than TF-IDF.
+Despite the impressive graphs, the "Soft TF-IDF" described in [WW Cohen, P Ravikumar, and SE Fienberg, A comparison of string distance metrics for name-matching tasks, IIWEB, pages 73-78, 2003](http://www.cs.cmu.edu/~pradeepr/papers/ijcai03.pdf) didn't give me good results. In the paper, they tokenized by word. The standard TF-IDF tokenized by character 4-grams or 5-grams may have been more effective.
+In my tests, the word-tokenized Soft TF-IDF was significantly slower and considerably less accurate than a standard TF-IDF with n-gram tokenization.
+## Help make it better!
+Need something added? Please [open an issue](https://github.com/brianhempel/fuzzy_tools/issues)! Or, even better, code it yourself and send a pull request:
+    # fork it on github, then clone:
+    git clone git@github.com:your_username/fuzzy_tools.git
+    bundle install
+    rspec
+    # hack away
+    git push
+    # then make a pull request
+## Acknowledgements
+The [SecondString](http://secondstring.sourceforge.net/) source code was a valuable reference.
+## License
+Authored by Brian Hempel. Public domain, no restrictions.

data/Rakefile ADDED

@@ -0,0 +1,29 @@
+require 'bundler'
+Bundler::GemHelper.install_tasks
+Dir[File.expand_path('../accuracy/**/*.task', __FILE__)].each { |f| load f }
+Dir[File.expand_path('../performance/**/*.task', __FILE__)].each { |f| load f }
+task :default => :test
+require 'rspec/core/rake_task'
+desc "Run the tests"
+RSpec::Core::RakeTask.new(:test)
+desc "Launch an IRB session with the gem required"
+task :console do
+  $:.unshift(File.dirname(__FILE__) + '/../lib')
+  require 'fuzzy_tools'
+  require 'irb'
+  IRB.setup(nil)
+  irb = IRB::Irb.new
+  IRB.conf[:MAIN_CONTEXT] = irb.context
+  irb.context.evaluate("require 'irb/completion'", 0)
+  trap("SIGINT") { irb.signal_handle }
+  catch(:IRB_EXIT) { irb.eval_input }
+end

data/fuzzy_tools.gemspec ADDED

@@ -0,0 +1,24 @@
+# -*- encoding: utf-8 -*-
+$:.push File.expand_path("../lib", __FILE__)
+require "fuzzy_tools/version"
+Gem::Specification.new do |s|
+  s.name        = "fuzzy_tools"
+  s.version     = FuzzyTools::VERSION
+  s.platform    = Gem::Platform::RUBY
+  s.authors     = ["Brian Hempel"]
+  s.email       = ["plasticchicken@gmail.com"]
+  s.homepage    = "https://github.com/brianhempel/fuzzy_tools"
+  s.summary     = %q{Easy, high quality fuzzy search in Ruby.}
+  s.description = %q{Easy, high quality fuzzy search in Ruby.}
+  s.files         = `git ls-files | grep --invert-match --extended-regexp '^(accuracy|performance)/'`.split("\n")
+  s.test_files    = `git ls-files -- {test,spec,features}/*`.split("\n")
+  s.executables   = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
+  s.require_paths = ["lib"]
+  s.add_dependency 'RubyInline'
+  s.add_development_dependency 'bundler'
+  s.add_development_dependency 'rspec'
+end

data/lib/fuzzy_tools.rb ADDED

@@ -0,0 +1,4 @@
+require 'fuzzy_tools/helpers'
+require 'fuzzy_tools/index'
+require 'fuzzy_tools/tf_idf_index'
+require 'fuzzy_tools/core_ext/enumerable'

data/lib/fuzzy_tools/core_ext/enumerable.rb ADDED

@@ -0,0 +1,41 @@
+require 'fuzzy_tools/index'
+module Enumerable
+  def fuzzy_find(*args)
+    query, options = parse_fuzzy_finder_arguments(args)
+    fuzzy_index(options).find(query)
+  end
+  def fuzzy_find_all(*args)
+    query, options = parse_fuzzy_finder_arguments(args)
+    fuzzy_index(options).all(query)
+  end
+  def fuzzy_find_all_with_scores(*args)
+    query, options = parse_fuzzy_finder_arguments(args)
+    fuzzy_index(options).all_with_scores(query)
+  end
+  def fuzzy_index(options = {})
+    options = options.merge(:source => self)
+    FuzzyTools::TfIdfIndex.new(options)
+  end
+  private
+  def parse_fuzzy_finder_arguments(args)
+    index_option_keys = [:tokenizer]
+    if args.first.is_a? Hash
+      args = args.first.dup
+      options = {}
+      index_option_keys.each do |key|
+        options[key] = args.delete(key) if args.has_key?(key)
+      end
+      options[:attribute], query = args.first
+      [query, options]
+    else
+      [args[0], args[1] || {}]
+    end
+  end
+end

data/lib/fuzzy_tools/helpers.rb ADDED

@@ -0,0 +1,133 @@
+module FuzzyTools
+  module Helpers
+    extend self
+    def term_counts(enumerator)
+      {}.tap do |counts|
+        enumerator.each do |e|
+          counts[e] ||= 0
+          counts[e]  += 1
+        end
+      end
+    end
+    def bigrams(str)
+      ngrams(str, 2)
+    end
+    def trigrams(str)
+      ngrams(str, 3)
+    end
+    def tetragrams(str)
+      ngrams(str, 4)
+    end
+    def ngrams(str, n)
+      ends   = "_" * (n - 1)
+      str    = "#{ends}#{str}#{ends}"
+      (0..str.length - n).map { |i| str[i,n] }
+    end
+    if RUBY_DESCRIPTION !~ /^ruby/ # rbx, jruby
+      SOUNDEX_LETTERS_TO_CODES = {
+        'A' => 0, 'B' => 1, 'C' => 2, 'D' => 3, 'E' => 0, 'F' => 1,
+        'G' => 2, 'H' => 0, 'I' => 0, 'J' => 2, 'K' => 2,
+        'L' => 4, 'M' => 5, 'N' => 5, 'O' => 0, 'P' => 1,
+        'Q' => 2, 'R' => 6, 'S' => 2, 'T' => 3, 'U' => 0,
+        'V' => 1, 'W' => 0, 'X' => 2, 'Y' => 0, 'Z' => 2
+      }
+      # Ruby port of the C below
+      def soundex(str)
+        soundex = "Z000"
+        chars = str.upcase.chars.to_a
+        first_letter = chars.shift until (last_numeral = first_letter && SOUNDEX_LETTERS_TO_CODES[first_letter]) || chars.size == 0
+        return soundex unless last_numeral
+        soundex[0] = first_letter
+        i = 1
+        while i < 4 && chars.size > 0
+          char = chars.shift
+          next unless numeral = SOUNDEX_LETTERS_TO_CODES[char]
+          if numeral != last_numeral
+            last_numeral = numeral
+            if numeral != 0
+              soundex[i] = numeral.to_s
+              i += 1
+            end
+          end
+        end
+        soundex
+      end
+    else # MRI
+      require 'inline'
+      # http://en.literateprograms.org/Soundex_(C)
+      inline(:C) do |builder|
+        builder.include '<ctype.h>'
+        builder.c_raw <<-EOC
+          static VALUE soundex(int argc, VALUE *argv, VALUE self) {
+            VALUE  ruby_str = argv[0];
+            char * in;
+            static  int code[] =
+               {  0,1,2,3,0,1,2,0,0,2,2,4,5,5,0,1,2,6,2,3,0,1,0,2,0,2 };
+               /* a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z */
+            static   char key[5];
+            register char ch;
+            register int last;
+            register int count;
+            Check_Type(ruby_str, T_STRING);
+            in = StringValueCStr(ruby_str);
+            /* Set up default key, complete with trailing '0's */
+            strcpy(key, "Z000");
+            /* Advance to the first letter.  If none present,
+               return default key */
+            while (*in != '\\0'  &&  !isalpha(*in))
+               ++in;
+            if (*in == '\\0')
+               return rb_str_new2(key);
+            /* Pull out the first letter, uppercase it, and
+               set up for main loop */
+            key[0] = toupper(*in);
+            last = code[key[0] - 'A'];
+            ++in;
+            /* Scan rest of string, stop at end of string or
+               when the key is full */
+            for (count = 1;  count < 4  &&  *in != '\\0';  ++in) {
+               /* If non-alpha, ignore the character altogether */
+               if (isalpha(*in)) {
+                  ch = tolower(*in);
+                  /* Fold together adjacent letters sharing the same code */
+                  if (last != code[ch - 'a']) {
+                     last = code[ch - 'a'];
+                     /* Ignore code==0 letters except as separators */
+                     if (last != 0)
+                        key[count++] = '0' + last;
+                  }
+               }
+            }
+            return rb_str_new2(key);
+          }
+        EOC
+      end
+    end
+  end
+end

data/lib/fuzzy_tools/index.rb ADDED

@@ -0,0 +1,41 @@
+require 'fuzzy_tools/helpers'
+require 'fuzzy_tools/tokenizers'
+module FuzzyTools
+  class Index
+    attr_reader :source, :indexed_attribute
+    def initialize(options = {})
+      @source            = options[:source]
+      @indexed_attribute = options[:attribute] || :to_s
+      build_index
+    end
+    def find(query)
+      result, score = unsorted_scored_results(query).max_by { |doc, score| [score, document_attribute(doc)] }
+      result
+    end
+    def all(query)
+      all_with_scores(query).map(&:first)
+    end
+    def all_with_scores(query)
+      unsorted_scored_results(query).sort_by { |doc, score| [-score, document_attribute(doc)] }
+    end
+    private
+    def each_attribute_and_document(&block)
+      source.each do |document|
+        yield(document_attribute(document), document)
+      end
+    end
+    def document_attribute(document)
+      return @indexed_attribute.call(document) if @indexed_attribute.is_a?(Proc)
+      return document[@indexed_attribute]      if document.is_a?(Hash)
+      document.send(@indexed_attribute)
+    end
+  end
+end

data/lib/fuzzy_tools/tf_idf_index.rb ADDED

@@ -0,0 +1,106 @@
+require 'set'
+require 'fuzzy_tools/index'
+require 'fuzzy_tools/weighted_document_tokens'
+module FuzzyTools
+  class TfIdfIndex < Index
+    class Token
+      attr_accessor :documents, :idf
+      def initialize
+        @documents = Set.new
+      end
+    end
+    def self.default_tokenizer
+      FuzzyTools::Tokenizers::HYBRID
+    end
+    attr_reader :tokenizer
+    def initialize(options = {})
+      @tokenizer = options[:tokenizer] || self.class.default_tokenizer
+      super
+    end
+    def tokenize(str)
+      tokenizer.call(str.to_s)
+    end
+    def unsorted_scored_results(query)
+      query_weighted_tokens = WeightedDocumentTokens.new(tokenize(query), :weight_function => weight_function)
+      candidates = select_candidate_documents(query, query_weighted_tokens)
+      candidates.map do |candidate|
+        candidate_tokens = @document_tokens[document_attribute(candidate)]
+        score = self.score(query_weighted_tokens, candidate_tokens)
+        [candidate, score]
+      end
+    end
+    def score(weighted_tokens_1, weighted_tokens_2)
+      weighted_tokens_1.cosine_similarity(weighted_tokens_2)
+    end
+    def select_candidate_documents(query, query_weighted_tokens)
+      candidates = Set.new
+      check_all_threshold = @source_count * 0.75 # this threshold works best on the accuracy data
+      query_weighted_tokens.tokens.each do |query_token|
+        if tf_idf_token = @tf_idf_tokens[query_token]
+          next if tf_idf_token.idf < @idf_cutoff
+          candidates.merge(tf_idf_token.documents)
+          if candidates.size > check_all_threshold
+            candidates = source
+            break
+          end
+        end
+      end
+      candidates
+    end
+    private
+    # consolidate the same strings together
+    # lowers GC load
+    def tokenize_consolidated(str)
+      tokenize(str).map { |token| @token_table[token] ||= token }
+    end
+    def clear_token_table
+      @token_table = {}
+    end
+    def build_index
+      @source_count = source.count
+      clear_token_table
+      @tf_idf_tokens = {}
+      each_attribute_and_document do |attribute, document|
+        tokenize_consolidated(attribute).each do |token_str|
+          @tf_idf_tokens[token_str] ||= Token.new
+          @tf_idf_tokens[token_str].documents << document
+        end
+      end
+      @tf_idf_tokens.keys.each do |token_str|
+        @tf_idf_tokens[token_str].idf = Math.log(@source_count.to_f / @tf_idf_tokens[token_str].documents.size)
+      end
+      @document_tokens = {}
+      each_attribute_and_document do |attribute, document|
+        tokens = @document_tokens[attribute] = WeightedDocumentTokens.new(tokenize_consolidated(attribute), :weight_function => weight_function)
+      end
+      clear_token_table
+      idfs = @tf_idf_tokens.values.map(&:idf).sort
+      @idf_cutoff = (idfs[idfs.size/16] || 0.0) / 2.0
+    end
+    def weight_function
+      @weight_function ||= lambda do |token, n|
+        # secondstring gives unknown tokens a df of 1
+        idf = @tf_idf_tokens[token] ? @tf_idf_tokens[token].idf : Math.log(@source_count.to_f)
+        idf * Math.log(n + 1)
+      end
+    end
+  end
+end

data/lib/fuzzy_tools/tokenizers.rb ADDED

@@ -0,0 +1,30 @@
+module FuzzyTools
+  module Tokenizers
+    CHARACTERS           = lambda { |str| str.chars }
+    CHARACTERS_DOWNCASED = lambda { |str| str.downcase.chars }
+    BIGRAMS              = lambda { |str| FuzzyTools::Helpers.ngrams(str,          2) }
+    BIGRAMS_DOWNCASED    = lambda { |str| FuzzyTools::Helpers.ngrams(str.downcase, 2) }
+    TRIGRAMS             = lambda { |str| FuzzyTools::Helpers.ngrams(str,          3) }
+    TRIGRAMS_DOWNCASED   = lambda { |str| FuzzyTools::Helpers.ngrams(str.downcase, 3) }
+    TETRAGRAMS           = lambda { |str| FuzzyTools::Helpers.ngrams(str,          4) }
+    TETRAGRAMS_DOWNCASED = lambda { |str| FuzzyTools::Helpers.ngrams(str.downcase, 4) }
+    PENTAGRAMS           = lambda { |str| FuzzyTools::Helpers.ngrams(str,          5) }
+    PENTAGRAMS_DOWNCASED = lambda { |str| FuzzyTools::Helpers.ngrams(str.downcase, 5) }
+    HEXAGRAMS            = lambda { |str| FuzzyTools::Helpers.ngrams(str,          6) }
+    HEXAGRAMS_DOWNCASED  = lambda { |str| FuzzyTools::Helpers.ngrams(str.downcase, 6) }
+    WORDS                = lambda { |str| str.split }
+    WORDS_DOWNCASED      = lambda { |str| str.downcase.split }
+    HYBRID = lambda do |str|
+      str   = str.downcase
+      words = str.split
+      words.map { |word| FuzzyTools::Helpers.soundex(word) } +
+      FuzzyTools::Helpers.ngrams(str.downcase, 2) +
+      words.map { |word| word.gsub(/[aeiou]/, '') } +
+      words
+    end
+  end
+end

data/lib/fuzzy_tools/version.rb ADDED

@@ -0,0 +1,3 @@
+module FuzzyTools
+  VERSION = "1.0.0"
+end

data/lib/fuzzy_tools/weighted_document_tokens.rb ADDED

@@ -0,0 +1,88 @@
+require 'fuzzy_tools/helpers'
+module FuzzyTools
+  class WeightedDocumentTokens
+    attr_reader :weights
+    def initialize(tokens, options)
+      weight_function = options[:weight_function]
+      set_token_weights(tokens, &weight_function)
+    end
+    if RUBY_DESCRIPTION !~ /^ruby/
+      # Rubinius and JRuby
+      def cosine_similarity(other)
+        # equivalent to the C below, but the C is >2x faster
+        similarity = 0.0
+        other_weights = other.weights
+        @weights.each do |token, weight|
+          if other_weight = other_weights[token]
+            similarity += other_weight*weight
+          end
+        end
+        similarity
+      end
+    else
+      # MRI
+      require 'inline'
+      def cosine_similarity(other)
+        cosine_similarity_fast(@weights, tokens, other.weights)
+      end
+      inline(:C) do |builder|
+        builder.c_raw <<-EOC
+          static VALUE cosine_similarity_fast(int argc, VALUE *argv, VALUE self) {
+            double similarity    = 0.0;
+            VALUE  my_weights    = argv[0];
+            VALUE  my_tokens     = argv[1];
+            VALUE  other_weights = argv[2];
+            int    i;
+            VALUE  token;
+            VALUE  my_weight;
+            VALUE  other_weight;
+            for(i = 0; i < RARRAY_LEN(RARRAY(my_tokens)); i++) {
+              token        = RARRAY_PTR(RARRAY(my_tokens))[i];
+              other_weight = rb_hash_aref(other_weights, token);
+              if (other_weight != Qnil) {
+                my_weight   = rb_hash_aref(my_weights, token);
+                similarity += NUM2DBL(my_weight)*NUM2DBL(other_weight);
+              }
+            }
+            return rb_float_new(similarity);
+          }
+        EOC
+      end
+    end
+    def tokens
+      @tokens ||= @weights.keys
+    end
+    private
+    def set_token_weights(tokens, &block)
+      @weights = {}
+      counts = FuzzyTools::Helpers.term_counts(tokens)
+      counts.each do |token, n|
+        @weights[token] = yield(token, n)
+      end
+      normalize_weights
+      @weights
+    end
+    def normalize_weights
+      length = Math.sqrt(weights.values.reduce(0.0) { |sum, w| sum + w*w })
+      weights.each do |token, w|
+        weights[token] /= length
+      end
+    end
+  end
+end

data/spec/enumerable_spec.rb ADDED

@@ -0,0 +1,124 @@
+require 'spec_helper'
+require 'set'
+describe Enumerable do
+  before :each do
+    @till_we_have_faces = Book.new("Till We Have Faces", "C.S. Lewis" )
+    @ecclesiastes       = Book.new("Ecclesiastes",       "The Teacher")
+    @the_prodigal_god   = Book.new("The Prodigal God",   "Tim Keller" )
+    @books = [
+      @till_we_have_faces,
+      @ecclesiastes,
+      @the_prodigal_god
+    ].each
+  end
+  describe "#fuzzy_find" do
+    it "works with simple query syntax" do
+      @books.fuzzy_find("the").should == @ecclesiastes
+    end
+    it "works with :attribute => query syntax" do
+      @books.fuzzy_find(:title => "the").should == @the_prodigal_god
+    end
+    context "passes :tokenizer through to the index" do
+      before(:each) { @letter_count_tokenizer = lambda { |str| str.size.to_s } }
+      it "passes :tokenizer through to the index with simple query syntax" do
+        FuzzyTools::TfIdfIndex.should_receive(:new).with(:source => @books, :tokenizer => @letter_count_tokenizer)
+        begin
+          @books.fuzzy_find("the", :tokenizer => @letter_count_tokenizer)
+        rescue
+        end
+      end
+      it "passes :tokenizer through to the index with :attribute => query syntax" do
+        FuzzyTools::TfIdfIndex.should_receive(:new).with(:source => @books, :tokenizer => @letter_count_tokenizer, :attribute => :title)
+        begin
+          @books.fuzzy_find(:title => "the", :tokenizer => @letter_count_tokenizer)
+        rescue
+        end
+      end
+    end
+  end
+  describe "#fuzzy_find_all" do
+    it "works with simple query syntax" do
+      @books.fuzzy_find_all("the").should == [@ecclesiastes, @the_prodigal_god, @till_we_have_faces]
+    end
+    it "works with :attribute => query syntax" do
+      @books.fuzzy_find_all(:title => "the").should == [@the_prodigal_god, @till_we_have_faces]
+    end
+    context "passes :tokenizer through to the index" do
+      before(:each) { @letter_count_tokenizer = lambda { |str| str.size.to_s } }
+      it "passes :tokenizer through to the index with simple query syntax" do
+        FuzzyTools::TfIdfIndex.should_receive(:new).with(:source => @books, :tokenizer => @letter_count_tokenizer)
+        begin
+          @books.fuzzy_find_all("the", :tokenizer => @letter_count_tokenizer)
+        rescue
+        end
+      end
+      it "passes :tokenizer through to the index with :attribute => query syntax" do
+        FuzzyTools::TfIdfIndex.should_receive(:new).with(:source => @books, :tokenizer => @letter_count_tokenizer, :attribute => :title)
+        begin
+          @books.fuzzy_find_all(:title => "the", :tokenizer => @letter_count_tokenizer)
+        rescue
+        end
+      end
+    end
+  end
+  describe "#fuzzy_find_all_with_scores" do
+    it "works with simple query syntax" do
+      results = @books.fuzzy_find_all_with_scores("the")
+      results.map(&:first).should == [@ecclesiastes, @the_prodigal_god, @till_we_have_faces]
+      results.sort_by { |doc, score| -score }.should == results
+    end
+    it "works with :attribute => query syntax" do
+      results = @books.fuzzy_find_all_with_scores(:title => "the")
+      results.map(&:first).should == [@the_prodigal_god, @till_we_have_faces]
+      results.sort_by { |doc, score| -score }.should == results
+    end
+    context "passes :tokenizer through to the index" do
+      before(:each) { @letter_count_tokenizer = lambda { |str| str.size.to_s } }
+      it "passes :tokenizer through to the index with simple query syntax" do
+        FuzzyTools::TfIdfIndex.should_receive(:new).with(:source => @books, :tokenizer => @letter_count_tokenizer)
+        begin
+          @books.fuzzy_find_all_with_scores("the", :tokenizer => @letter_count_tokenizer)
+        rescue
+        end
+      end
+      it "passes :tokenizer through to the index with :attribute => query syntax" do
+        FuzzyTools::TfIdfIndex.should_receive(:new).with(:source => @books, :tokenizer => @letter_count_tokenizer, :attribute => :title)
+        begin
+          @books.fuzzy_find_all_with_scores(:title => "the", :tokenizer => @letter_count_tokenizer)
+        rescue
+        end
+      end
+    end
+  end
+  describe "#fuzzy_index" do
+    it "returns an TfIdfIndex" do
+      @books.fuzzy_index.class.should == FuzzyTools::TfIdfIndex
+    end
+    it "passes options along to the index" do
+      letter_count_tokenizer = lambda { |str| str.size.to_s }
+      FuzzyTools::TfIdfIndex.should_receive(:new).with(:source => @books, :tokenizer => letter_count_tokenizer, :attribute => :title)
+      @books.fuzzy_index(:attribute => :title, :tokenizer => letter_count_tokenizer)
+    end
+  end
+end

data/spec/helpers_spec.rb ADDED

@@ -0,0 +1,65 @@
+require 'spec_helper'
+describe FuzzyTools::Helpers do
+  describe ".ngrams" do
+    it "should do trigrams" do
+      FuzzyTools::Helpers.trigrams("hello").should == %w{
+        __h
+        _he
+        hel
+        ell
+        llo
+        lo_
+        o__
+      }
+    end
+    it "should do bigrams" do
+      FuzzyTools::Helpers.bigrams("hello").should == %w{
+        _h
+        he
+        el
+        ll
+        lo
+        o_
+      }
+    end
+    it "should do 1-grams" do
+      FuzzyTools::Helpers.ngrams("hello", 1).should == %w{
+        h
+        e
+        l
+        l
+        o
+      }
+    end
+    it "should do x-grams" do
+      FuzzyTools::Helpers.ngrams("hello", 4).should == %w{
+        ___h
+        __he
+        _hel
+        hell
+        ello
+        llo_
+        lo__
+        o___
+      }
+    end
+  end
+  describe ".soundex" do
+    it "works" do
+      FuzzyTools::Helpers.soundex("").should            == "Z000"
+      FuzzyTools::Helpers.soundex("123").should         == "Z000"
+      FuzzyTools::Helpers.soundex("Robert").should      == "R163"
+      FuzzyTools::Helpers.soundex("Rubin").should       == "R150"
+      FuzzyTools::Helpers.soundex("Washington").should  == "W252"
+      FuzzyTools::Helpers.soundex("Lee").should         == "L000"
+      FuzzyTools::Helpers.soundex("Gutierrez").should   == "G362"
+    end
+  end
+end

data/spec/spec_helper.rb ADDED

@@ -0,0 +1,5 @@
+$:.unshift(File.join(File.dirname(__FILE__), "..", "lib"))
+require 'fuzzy_tools'
+Book = Struct.new(:title, :author)

data/spec/tf_idf_index_spec.rb ADDED

@@ -0,0 +1,201 @@
+require 'spec_helper'
+describe FuzzyTools::TfIdfIndex do
+  it "takes a source" do
+    vegetables = ["mushroom", "olive", "tomato"]
+    index = FuzzyTools::TfIdfIndex.new(:source => vegetables)
+    index.source.should == vegetables
+  end
+  it "indexes on to_s by default" do
+    index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
+    index.find("2").should == 2
+  end
+  it "defaults tokenizer to FuzzyTools::Tokenizers::HYBRID" do
+    FuzzyTools::TfIdfIndex.new(:source => []).tokenizer.should == FuzzyTools::Tokenizers::HYBRID
+  end
+  it "takes any proc as a tokenizer" do
+    foods = ["muffins", "pancakes"]
+    letter_count_tokenizer = lambda { |str| [str.size.to_s] }
+    index = FuzzyTools::TfIdfIndex.new(:source => foods, :tokenizer => letter_count_tokenizer)
+    index.tokenizer.should == letter_count_tokenizer
+    index.find("octoword").should == "pancakes"
+  end
+  context "indexing incomparable objects" do
+    before :each do
+      @till_we_have_faces = Book.new("Till We Have Faces", "C.S. Lewis")
+      @perelandra         = Book.new("Perelandra",         "C.S. Lewis")
+      @books = [@till_we_have_faces, @perelandra]
+    end
+    it "#find works when they index the same" do
+      index = FuzzyTools::TfIdfIndex.new(:source => @books)
+      expect { index.all("louis") }.to_not raise_error
+    end
+    it "#all works when they index the same" do
+      index = FuzzyTools::TfIdfIndex.new(:source => @books)
+      expect { index.all("louis") }.to_not raise_error
+    end
+    it "#all_with_scores works when they index the same" do
+      index = FuzzyTools::TfIdfIndex.new(:source => @books)
+      expect { index.all("louis") }.to_not raise_error
+    end
+  end
+  context "indexing objects" do
+    before :each do
+      @till_we_have_faces = Book.new("Till We Have Faces", "C.S. Lewis" )
+      @ecclesiastes       = Book.new("Ecclesiastes",       "The Teacher")
+      @the_prodigal_god   = Book.new("The Prodigal God",   "Tim Keller" )
+      @books = [
+        @till_we_have_faces,
+        @ecclesiastes,
+        @the_prodigal_god,
+      ]
+    end
+    it "indexes on the method specified in :attribute" do
+      index = FuzzyTools::TfIdfIndex.new(:source => @books, :attribute => :title)
+      index.find("ecklestica").should == @ecclesiastes
+    end
+    it "indexes the proc result if a proc is given for :attribute" do
+      index = FuzzyTools::TfIdfIndex.new(:source => @books, :attribute => lambda { |book| book.title + " " + book.author })
+      index.find("prodigy").should == @the_prodigal_god
+      index.find("LEWIS").should   == @till_we_have_faces
+    end
+  end
+  context "indexing hashes" do
+    before :each do
+      @till_we_have_faces = { :title => "Till We Have Faces", :author => "C.S. Lewis"  }
+      @ecclesiastes       = { :title => "Ecclesiastes",       :author => "The Teacher" }
+      @the_prodigal_god   = { :title => "The Prodigal God",   :author => "Tim Keller"  }
+      @books = [
+        @till_we_have_faces,
+        @ecclesiastes,
+        @the_prodigal_god,
+      ]
+    end
+    it "indexes on the hash key specified in :attribute" do
+      index = FuzzyTools::TfIdfIndex.new(:source => @books, :attribute => :title)
+      index.find("ecklestica").should == @ecclesiastes
+    end
+    it "indexes the proc result if a proc is given for :attribute" do
+      index = FuzzyTools::TfIdfIndex.new(:source => @books, :attribute => lambda { |book| book[:title] + " " + book[:author] })
+      index.find("prodigy").should == @the_prodigal_god
+      index.find("LEWIS").should   == @till_we_have_faces
+    end
+  end
+  context "query methods" do
+    describe "#find" do
+      it "returns the best result" do
+        mushy_stuff = ["mushrooms", "mushroom", "mushy pit", "ABC"]
+        index = FuzzyTools::TfIdfIndex.new(:source => mushy_stuff)
+        index.find("ushr").should == "mushroom"
+      end
+      it "calls to_s on input" do
+        index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
+        index.find(2).should == 2
+      end
+      it "returns nil if no results" do
+        index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
+        index.find("bubble").should be_nil
+      end
+    end
+    describe "#all" do
+      it "returns all results, from best to worst" do
+        mushy_stuff = ["mushrooms", "mushroom", "mushy pit", "ABC"]
+        index = FuzzyTools::TfIdfIndex.new(:source => mushy_stuff)
+        index.all("ushr").should == [
+          "mushroom",
+          "mushrooms",
+          "mushy pit"
+        ]
+      end
+      it "calls to_s on input" do
+        index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
+        index.all(2).first.should == 2
+      end
+      it "returns an empty array if no results" do
+        index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
+        index.all("bubble").should == []
+      end
+    end
+    describe "#all" do
+      it "returns all results, from best to worst" do
+        mushy_stuff = ["mushrooms", "mushroom", "mushy pit", "ABC"]
+        index = FuzzyTools::TfIdfIndex.new(:source => mushy_stuff)
+        index.all("ushr").should == [
+          "mushroom",
+          "mushrooms",
+          "mushy pit"
+        ]
+      end
+      it "calls to_s on input" do
+        index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
+        index.all(2).first.should == 2
+      end
+      it "returns an empty array if no results" do
+        index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
+        index.all("bubble").should == []
+      end
+    end
+    describe "#all_with_scores" do
+      it "returns ordered array of arrays of score and results" do
+        mushy_stuff = ["mushrooms", "mushroom", "mushy pit", "ABC"]
+        index = FuzzyTools::TfIdfIndex.new(:source => mushy_stuff)
+        results = index.all_with_scores("ushr")
+        results.map(&:first).should == [
+          "mushroom",
+          "mushrooms",
+          "mushy pit"
+        ]
+        results.sort_by { |doc, score| -score }.should == results
+        results.map(&:last).each { |score| score.class.should == Float }
+        results.map(&:last).each { |score| score.should > 0.0 }
+        results.map(&:last).each { |score| score.should < 1.0 }
+        results.map(&:last).uniq.should == results.map(&:last)
+      end
+      it "calls to_s on input" do
+        index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
+        index.all_with_scores(2).first.should == [2, 1.0]
+      end
+      it "returns an empty array if no results" do
+        index = FuzzyTools::TfIdfIndex.new(:source => 1..3)
+        index.all_with_scores("bubble").should == []
+      end
+    end
+  end
+end

metadata ADDED

@@ -0,0 +1,121 @@
+--- !ruby/object:Gem::Specification
+name: fuzzy_tools
+version: !ruby/object:Gem::Version
+  version: 1.0.0
+  prerelease:
+platform: ruby
+authors:
+- Brian Hempel
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2012-07-24 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: RubyInline
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+description: Easy, high quality fuzzy search in Ruby.
+email:
+- plasticchicken@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- .rspec
+- .travis.yml
+- Gemfile
+- README.md
+- Rakefile
+- fuzzy_tools.gemspec
+- lib/fuzzy_tools.rb
+- lib/fuzzy_tools/core_ext/enumerable.rb
+- lib/fuzzy_tools/helpers.rb
+- lib/fuzzy_tools/index.rb
+- lib/fuzzy_tools/tf_idf_index.rb
+- lib/fuzzy_tools/tokenizers.rb
+- lib/fuzzy_tools/version.rb
+- lib/fuzzy_tools/weighted_document_tokens.rb
+- spec/enumerable_spec.rb
+- spec/helpers_spec.rb
+- spec/spec_helper.rb
+- spec/tf_idf_index_spec.rb
+homepage: https://github.com/brianhempel/fuzzy_tools
+licenses: []
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+      segments:
+      - 0
+      hash: -1099286336038854081
+required_rubygems_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+      segments:
+      - 0
+      hash: -1099286336038854081
+requirements: []
+rubyforge_project:
+rubygems_version: 1.8.24
+signing_key:
+specification_version: 3
+summary: Easy, high quality fuzzy search in Ruby.
+test_files:
+- spec/enumerable_spec.rb
+- spec/helpers_spec.rb
+- spec/spec_helper.rb
+- spec/tf_idf_index_spec.rb