RubyGems - tf-idf-similarity - Versions diffs - 0.0.2 → 0.0.3 - Mend

tf-idf-similarity 0.0.2 → 0.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

data/README.md +13 -2
data/lib/tf-idf-similarity/collection.rb +81 -30
data/lib/tf-idf-similarity/document.rb +3 -0
data/lib/tf-idf-similarity/version.rb +1 -1
metadata +4 -4

data/README.md CHANGED Viewed

@@ -3,7 +3,7 @@
 [![Dependency Status](https://gemnasium.com/opennorth/tf-idf-similarity.png)](https://gemnasium.com/opennorth/tf-idf-similarity)
 [![Code Climate](https://codeclimate.com/badge.png)](https://codeclimate.com/github/opennorth/tf-idf-similarity)
-Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) (or similar), which also implements other information retrieval functions like [BM 25](http://en.wikipedia.org/wiki/Okapi_BM25).
+Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) or similar (see below).
 ## Usage
@@ -41,7 +41,7 @@ gem install gsl
 You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through version `0.0.2` of the [nmatrix gem](https://github.com/SciRuby/nmatrix). As of writing, `0.0.2` is not released, so follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
-### Other Considerations
+### Other Options
 The [nmatrix](http://sciruby.com/nmatrix/) gem has no easy way to normalize all columns to unit vectors. [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
@@ -63,6 +63,17 @@ The [treat](https://github.com/louismullie/treat), [tf-idf](https://github.com/r
 * [G. Salton and C. Buckley. "Term Weighting Approaches in Automatic Text Retrieval."" Technical Report. Cornell University, Ithaca, NY, USA. 1987.](http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf)
 * [E. Chisholm and T. G. Kolda. "New term weighting formulas for the vector space method in information retrieval." Technical Report Number ORNL-TM-13756. Oak Ridge National Laboratory, Oak Ridge, TN, USA. 1999.](http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf)
+## Further Reading
+Lucene implements many more [similarity functions](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/Similarity.html), such as:
+* a [divergence from randomness (DFR) framework](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/DFRSimilarity.html)
+* a [framework for the family of information-based models](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/IBSimilarity.html)
+* a [language model with Bayesian smoothing using Dirichlet priors](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html)
+* a [language model with Jelinek-Mercer smoothing](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html)
+Lucene can even [combine similarity meatures](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/MultiSimilarity.html).
 ## Bugs? Questions?
 This gem's main repository is on GitHub: [http://github.com/opennorth/tf-idf-similarity](http://github.com/opennorth/tf-idf-similarity), where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.

data/lib/tf-idf-similarity/collection.rb CHANGED Viewed

@@ -34,43 +34,42 @@ class TfIdfSimilarity::Collection
     term_counts.keys
   end
+  # @param [Hash] opts optional arguments
+  # @option opts [Symbol] :function one of :tfidf (default) or :bm25
+  #
+  # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
+  # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/BM25Similarity.html
   # @see http://en.wikipedia.org/wiki/Vector_space_model
   # @see http://en.wikipedia.org/wiki/Document-term_matrix
   # @see http://en.wikipedia.org/wiki/Cosine_similarity
-  def similarity_matrix
-    if matrix?
+  def similarity_matrix(opts = {})
+    if stdlib?
       idf = []
-      term_document_matrix = Matrix.build(terms.size, documents.size) do |i,j|
-        idf[i] ||= inverse_document_frequency terms[i]
-        documents[j].term_frequency(terms[i]) * idf[i]
+      matrix = Matrix.build(terms.size, documents.size) do |i,j|
+        idf[i] ||= inverse_document_frequency(terms[i], opts)
+        idf[i] * term_frequency(documents[j], terms[i], opts)
       end
     else
-      term_document_matrix = if gsl?
-        GSL::Matrix.alloc terms.size, documents.size
-      elsif narray?
-        NArray.float documents.size, terms.size
-      elsif nmatrix?
-        NMatrix.new(:list, [terms.size, documents.size], :float64)
-      end
+      matrix = initialize_matrix
       terms.each_with_index do |term,i|
-        idf = inverse_document_frequency term
+        idf = inverse_document_frequency(term, opts)
         documents.each_with_index do |document,j|
-          tfidf = document.term_frequency(term) * idf
-          if gsl? || nmatrix?
-            term_document_matrix[i, j] = tfidf
+          value = idf * term_frequency(document, term, opts)
           # NArray puts the dimensions in a different order.
           # @see http://narray.rubyforge.org/SPEC.en
-          elsif narray?
-            term_document_matrix[j, i] = tfidf
+          if narray?
+            matrix[j, i] = value
+          else
+            matrix[i, j] = value
           end
         end
       end
-    end
-    # Columns are normalized to unit vectors, so we can calculate the cosine
-    # similarity of all document vectors.
-    matrix = normalize term_document_matrix
+      # Columns are normalized to unit vectors, so we can calculate the cosine
+      # similarity of all document vectors. BM25 doesn't normalize columns, but
+      # BM25 wasn't written with this use case in mind.
+      matrix = normalize matrix
+    end
     if nmatrix?
       matrix.transpose.dot matrix
@@ -80,14 +79,46 @@ class TfIdfSimilarity::Collection
   end
   # @param [String] term a term
+  # @param [Hash] opts optional arguments
+  # @option opts [Symbol] :function one of :tfidf (default) or :bm25
   # @return [Float] the term's inverse document frequency
-  #
-  # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
-  def inverse_document_frequency(term)
-    1 + Math.log(documents.size / (document_counts[term].to_f + 1))
+  def inverse_document_frequency(term, opts = {})
+    if opts[:function] == :bm25
+      Math.log (documents.size - document_counts[term] + 0.5) / (document_counts[term] + 0.5)
+    else
+      1 + Math.log(documents.size / (document_counts[term].to_f + 1))
+    end
   end
   alias_method :idf, :inverse_document_frequency
+  # @param [Document] document a document
+  # @param [String] term a term
+  # @param [Hash] opts optional arguments
+  # @option opts [Symbol] :function one of :tfidf (default) or :bm25
+  # @return [Float] the term's frequency in the document
+  #
+  # @note Like Lucene, we use a b value of 0.75 and a k1 value of 1.2.
+  def term_frequency(document, term, opts = {})
+    if opts[:function] == :bm25
+      (document.term_counts[term] * 2.2) / (document.term_counts[term] + 0.3 + 0.9 * document.size / average_document_size)
+    else
+      document.term_frequency term
+    end
+  end
+  # @return [Float] the average document size, in terms
+  def average_document_size
+    @average_document_size ||= documents.map(&:size).reduce(:+) / documents.size.to_f
+  end
+  # Resets the average document size.
+  #
+  # If you have already made a similarity matrix and are adding more documents,
+  # call this method before creating a new similarity matrix.
+  def reset_average_document_size!
+    @average_document_size = nil
+  end
   # @param [Document] matrix a term-document matrix
   # @return [Matrix] a matrix in which all document vectors are unit vectors
   #
@@ -99,7 +130,12 @@ class TfIdfSimilarity::Collection
       # @see https://github.com/masa16/narray/issues/21
       NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(5,1))
     elsif nmatrix?
-      # @todo NMatrix has no way to retrieve a column, besides iteration.
+      # @todo NMatrix has no way to perform scalar operations on matrices.
+      # (0...matrix.shape[0]).each do |i|
+      #   column = matrix.slice i, 0...matrix.shape[1]
+      #   norm   = column.dot column.transpose
+      #   # No way to divide column by norm.
+      # end
       matrix.cast :yale, :float64
     else
       Matrix.columns matrix.column_vectors.map(&:normalize)
@@ -108,19 +144,34 @@ class TfIdfSimilarity::Collection
 private
+  # @return a matrix
+  def initialize_matrix
+    if gsl?
+      GSL::Matrix.alloc terms.size, documents.size
+    elsif narray?
+      NArray.float documents.size, terms.size
+    elsif nmatrix?
+      NMatrix.new(:list, [terms.size, documents.size], :float64)
+    end
+  end
+  # @return [Boolean] whether to use the GSL gem
   def gsl?
     @gsl     ||= Object.const_defined?(:GSL)
   end
+  # @return [Boolean] whether to use the NArray gem
   def narray?
     @narray  ||= Object.const_defined?(:NArray) && !gsl?
   end
+  # @return [Boolean] whether to use the NMatrix gem
   def nmatrix?
-    @nmatrix ||= Object.const_defined?(:NMatrix) && !narray?
+    @nmatrix ||= Object.const_defined?(:NMatrix) && !gsl? && !narray?
   end
-  def matrix?
+  # @return [Boolean] whether to use the standard library
+  def stdlib?
     @matrix  ||= Object.const_defined?(:Matrix)
   end
 end

data/lib/tf-idf-similarity/document.rb CHANGED Viewed

@@ -8,6 +8,8 @@ class TfIdfSimilarity::Document
   attr_reader :text
   # The number of times each term appears in the document.
   attr_reader :term_counts
+  # The document size, in terms.
+  attr_reader :size
   # @param [String] text the document's text
   # @param [Hash] opts optional arguments
@@ -43,6 +45,7 @@ private
         @term_counts[token.lowercase_filter.classic_filter.to_s] += 1
       end
     end
+    @size = term_counts.values.reduce(:+)
   end
   # Tokenizes a text, respecting the word boundary rules from Unicode’s Default

data/lib/tf-idf-similarity/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module TfIdfSimilarity
-  VERSION = "0.0.2"
+  VERSION = "0.0.3"
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: tf-idf-similarity
 version: !ruby/object:Gem::Version
-  version: 0.0.2
+  version: 0.0.3
   prerelease:
 platform: ruby
 authors:
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-09-10 00:00:00.000000000 Z
+date: 2012-09-11 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: unicode_utils
@@ -95,7 +95,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
       version: '0'
       segments:
       - 0
-      hash: -1570138910816303214
+      hash: -4125970683092216956
 required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
   requirements:
@@ -104,7 +104,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
       segments:
       - 0
-      hash: -1570138910816303214
+      hash: -4125970683092216956
 requirements: []
 rubyforge_project:
 rubygems_version: 1.8.24