RubyGems - tf-idf-similarity - Versions diffs - 0.0.8 → 0.0.9 - Mend

tf-idf-similarity 0.0.8 → 0.0.9

Files changed (10) hide show

data/.yardopts +4 -0
data/Gemfile +1 -1
data/README.md +22 -10
data/lib/tf-idf-similarity/collection.rb +9 -7
data/lib/tf-idf-similarity/document.rb +1 -0
data/lib/tf-idf-similarity/extras/collection.rb +32 -7
data/lib/tf-idf-similarity/extras/document.rb +18 -0
data/lib/tf-idf-similarity/version.rb +1 -1
data/lib/tf-idf-similarity.rb +4 -3
metadata +4 -2

data/.yardopts ADDED Viewed

@@ -0,0 +1,4 @@
+--no-private
+--hide-void-return
+--embed-mixin ClassMethods
+--markup=markdown

data/Gemfile CHANGED Viewed

@@ -1,4 +1,4 @@
 source "http://rubygems.org"
-# Specify your gem's dependencies in scraperwiki-api.gemspec
+# Specify your gem's dependencies in the gemspec
 gemspec

data/README.md CHANGED Viewed

@@ -40,14 +40,6 @@ Be careful not to upgrade `gsl` to `1.15` with `brew upgrade outdated`. You can
     gem install narray
-### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
-You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
-### Other Options
-[Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
 ## Extras
 You can access more term frequency, document frequency, and normalization formulas with:
@@ -59,7 +51,19 @@ The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http:
 ## Why?
-The [treat](https://github.com/louismullie/treat), [tf-idf](https://github.com/reddavis/TF-IDF), [similarity](https://github.com/bbcrd/Similarity) and [rsimilarity](https://github.com/josephwilk/rsemantic) gems normalize the frequency of a term in a document to the number of terms in that document (which, as far as I can tell, never occurs in the academic literature) and have no normalization component. [vss](https://github.com/mkdynamic/vss) uses plain term and document frequencies, with no damping or normalization.
+No other Ruby gem implements the tf*idf formula used by Lucene, Sphinx and Ferret.
+### Term frequencies
+The [vss](https://github.com/mkdynamic/vss) gem does not normalize the frequency of a term in a document; this occurs frequently in the academic literature, but only to demonstrate why normalization is important. The [treat](https://github.com/louismullie/treat), [tf_idf](https://github.com/reddavis/TF-IDF), [similarity](https://github.com/bbcrd/Similarity) and [rsemantic](https://github.com/josephwilk/rsemantic) gems normalize the frequency of a term in a document to the number of terms in that document, which never occurs in the literature. The [tf-idf](https://github.com/mchung/tf-idf) gem normalizes the frequency of a term in a document to the number of *unique* terms in that document, which never occurs in the literature.
+### Document frequencies
+The vss gem does not normalize the inverse document frequency. The tf_idf, tf-idf, similarity and rsemantic gems use variants of the typical inverse document frequency formula.
+### Normalization
+The treat, tf_idf, tf-idf, rsemantic and vss gems have no normalization component.
 ## Reference
@@ -75,7 +79,15 @@ Lucene implements many more [similarity functions](http://lucene.apache.org/core
 * a [language model with Bayesian smoothing using Dirichlet priors](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html)
 * a [language model with Jelinek-Mercer smoothing](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html)
-Lucene can even [combine similarity meatures](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/MultiSimilarity.html).
+Lucene can even [combine similarity measures](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/MultiSimilarity.html).
+## Other optimizations
+[Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/) is available through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
+### Other Options
+[Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
 ## Bugs? Questions?

data/lib/tf-idf-similarity/collection.rb CHANGED Viewed

@@ -153,15 +153,17 @@ class TfIdfSimilarity::Collection
       matrix.each_col(&:normalize!)
     elsif narray?
       # @see https://github.com/masa16/narray/issues/21
-      NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(documents.size, 1))
+      NMatrix.refer(matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(documents.size, 1)))
     elsif nmatrix?
       # @see https://github.com/SciRuby/nmatrix/issues/38
-      # @todo NMatrix has no way to perform scalar operations on matrices.
-      # (0...matrix.shape[0]).each do |i|
-      #   column = matrix.slice i, 0...matrix.shape[1]
-      #   norm   = column.dot column.transpose
-      #   # No way to divide column by norm.
-      # end
+      (0...matrix.shape[1]).each do |j|
+        # @see https://github.com/SciRuby/nmatrix/pull/46
+        column = matrix.column(j)
+        norm = Math.sqrt(column.transpose.dot(column)[0, 0])
+        (0...m.shape[0]).each do |i|
+          m[i, j] /= norm
+        end
+      end
       matrix.cast :yale, :float64
     else
       Matrix.columns matrix.column_vectors.map(&:normalize)

data/lib/tf-idf-similarity/document.rb CHANGED Viewed

@@ -34,6 +34,7 @@ class TfIdfSimilarity::Document
   # @return [Float] the square root of the term count
   #
   # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
+  # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L13
   def term_frequency(term)
     Math.sqrt term_counts[term].to_i
   end

data/lib/tf-idf-similarity/extras/collection.rb CHANGED Viewed

@@ -1,15 +1,32 @@
 require 'tf-idf-similarity/collection'
+# @note The treat and similarity gems do not add one to the inverse document frequency.
+# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L16
+# @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/corpus.rb#L44
+#
+# @note The tf-idf gem adds one to the numerator when calculating inverse document frequency.
+# @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb#L153
+#
+# @note The vss gem does not take the logarithm of the inverse document frequency.
+# @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb#L79
+#
+# @see http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
+# @see http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf
+# @see http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf
 class TfIdfSimilarity::Collection
+  # https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L17
+  #
   # SMART n, Salton x, Chisholm NONE
   def no_collection_frequency(term)
     1.0
   end
+  # @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb#L50
+  # @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L15
+  #
   # SMART t, Salton f, Chisholm IDFB
   def plain_inverse_document_frequency(term)
-    count = document_counts[term].to_f
-    Math.log documents.size / count
+    Math.log documents.size / document_counts[term].to_f
   end
   alias_method :plain_idf, :plain_inverse_document_frequency
@@ -58,6 +75,11 @@ class TfIdfSimilarity::Collection
   # @param [Document] matrix a term-document matrix
   # @return [Matrix] the same matrix
+  # @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb
+  # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb
+  # @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb
+  # @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb
+  # @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb
   #
   # SMART n, Salton x, Chisholm NONE
   def no_normalization(matrix)
@@ -66,20 +88,23 @@ class TfIdfSimilarity::Collection
   # @param [Document] matrix a term-document matrix
   # @return [Matrix] a matrix in which all document vectors are unit vectors
+  # @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/term_document_matrix.rb#L23
   #
   # SMART c, Salton c, Chisholm COSN
   def cosine_normalization(matrix)
-    Matrix.columns(tfidf.column_vectors.map do |column|
-      column.normalize
-    end)
+    if gsl?
+      matrix.each_col(&:normalize!)
+    else
+      Matrix.columns matrix.column_vectors.map(&:normalize)
+    end
   end
   # @param [Document] matrix a term-document matrix
   # @return [Matrix] a matrix
+  # @todo http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
   #
   # SMART u, Chisholm PUQN
   def pivoted_unique_normalization(matrix)
-    # @todo
-    # http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
+    raise NotImplementedError
   end
 end

data/lib/tf-idf-similarity/extras/document.rb CHANGED Viewed

@@ -1,5 +1,19 @@
 require 'tf-idf-similarity/document'
+# @todo http://nlp.stanford.edu/IR-book/html/htmledition/maximum-tf-normalization-1.html
+#
+# @note The treat, tf_idf, similarity and rsemantic gems normalizes to the number of terms in the document.
+# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L77
+# @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb#L76
+# @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/document.rb#L42
+# @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L17
+#
+# @note The tf-idf gem normalizes to the number of unique terms in the document.
+# @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb#L172
+#
+# @see http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
+# @see http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf
+# @see http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf
 class TfIdfSimilarity::Document
   # @return [Float] the maximum term count of any term in the document
   def maximum_term_count
@@ -12,6 +26,8 @@ class TfIdfSimilarity::Document
   end
   # Returns the term count.
+  # @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb#L75
+  # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L11
   #
   # SMART n, Salton t, Chisholm FREQ
   def plain_term_frequency(term)
@@ -70,6 +86,8 @@ class TfIdfSimilarity::Document
   end
   alias_method :changed_coefficient_augmented_normalized_tf, :changed_coefficient_augmented_normalized_term_frequency
+  # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L12
+  #
   # SMART l, Chisholm LOGA
   def log_term_frequency(term)
     count = term_counts[term]

data/lib/tf-idf-similarity/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module TfIdfSimilarity
-  VERSION = "0.0.8"
+  VERSION = "0.0.9"
 end

data/lib/tf-idf-similarity.rb CHANGED Viewed

@@ -1,5 +1,6 @@
 module TfIdfSimilarity
-  autoload :Collection, 'tf-idf-similarity/collection'
-  autoload :Document, 'tf-idf-similarity/document'
-  autoload :Token, 'tf-idf-similarity/token'
 end
+require 'tf-idf-similarity/collection'
+require 'tf-idf-similarity/document'
+require 'tf-idf-similarity/token'

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: tf-idf-similarity
 version: !ruby/object:Gem::Version
-  version: 0.0.8
+  version: 0.0.9
   prerelease:
 platform: ruby
 authors:
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-11-20 00:00:00.000000000 Z
+date: 2013-01-07 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: unicode_utils
@@ -68,6 +68,7 @@ extra_rdoc_files: []
 files:
 - .gitignore
 - .travis.yml
+- .yardopts
 - Gemfile
 - LICENSE
 - README.md
@@ -106,3 +107,4 @@ signing_key:
 specification_version: 3
 summary: Calculates the similarity between texts using tf*idf
 test_files: []
+has_rdoc: