RubyGems - tf-idf-similarity - Versions diffs - 0.0.8 → 0.0.9 - Mend

tf-idf-similarity 0.0.8 → 0.0.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

data/.yardopts +4 -0
data/Gemfile +1 -1
data/README.md +22 -10
data/lib/tf-idf-similarity/collection.rb +9 -7
data/lib/tf-idf-similarity/document.rb +1 -0
data/lib/tf-idf-similarity/extras/collection.rb +32 -7
data/lib/tf-idf-similarity/extras/document.rb +18 -0
data/lib/tf-idf-similarity/version.rb +1 -1
data/lib/tf-idf-similarity.rb +4 -3
metadata +4 -2

data/.yardopts ADDED Viewed

@@ -0,0 +1,4 @@
+--no-private
+--hide-void-return
+--embed-mixin ClassMethods
+--markup=markdown

data/Gemfile CHANGED Viewed

@@ -1,4 +1,4 @@
 source "http://rubygems.org"
-# Specify your gem's dependencies in scraperwiki-api.gemspec
+# Specify your gem's dependencies in the gemspec
 gemspec

data/README.md CHANGED Viewed

@@ -40,14 +40,6 @@ Be careful not to upgrade `gsl` to `1.15` with `brew upgrade outdated`. You can
     gem install narray
-### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
-You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
-### Other Options
-[Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
 ## Extras
 You can access more term frequency, document frequency, and normalization formulas with:
@@ -59,7 +51,19 @@ The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http:
 ## Why?
-The [treat](https://github.com/louismullie/treat), [tf-idf](https://github.com/reddavis/TF-IDF), [similarity](https://github.com/bbcrd/Similarity) and [rsimilarity](https://github.com/josephwilk/rsemantic) gems normalize the frequency of a term in a document to the number of terms in that document (which, as far as I can tell, never occurs in the academic literature) and have no normalization component. [vss](https://github.com/mkdynamic/vss) uses plain term and document frequencies, with no damping or normalization.
+No other Ruby gem implements the tf*idf formula used by Lucene, Sphinx and Ferret.
+### Term frequencies
+The [vss](https://github.com/mkdynamic/vss) gem does not normalize the frequency of a term in a document; this occurs frequently in the academic literature, but only to demonstrate why normalization is important. The [treat](https://github.com/louismullie/treat), [tf_idf](https://github.com/reddavis/TF-IDF), [similarity](https://github.com/bbcrd/Similarity) and [rsemantic](https://github.com/josephwilk/rsemantic) gems normalize the frequency of a term in a document to the number of terms in that document, which never occurs in the literature. The [tf-idf](https://github.com/mchung/tf-idf) gem normalizes the frequency of a term in a document to the number of *unique* terms in that document, which never occurs in the literature.
+### Document frequencies
+The vss gem does not normalize the inverse document frequency. The tf_idf, tf-idf, similarity and rsemantic gems use variants of the typical inverse document frequency formula.
+### Normalization
+The treat, tf_idf, tf-idf, rsemantic and vss gems have no normalization component.
 ## Reference
@@ -75,7 +79,15 @@ Lucene implements many more [similarity functions](http://lucene.apache.org/core
 * a [language model with Bayesian smoothing using Dirichlet priors](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html)
 * a [language model with Jelinek-Mercer smoothing](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html)
-Lucene can even [combine similarity meatures](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/MultiSimilarity.html).
+Lucene can even [combine similarity measures](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/MultiSimilarity.html).
+## Other optimizations
+[Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/) is available through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
+### Other Options
+[Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
 ## Bugs? Questions?

data/lib/tf-idf-similarity/collection.rb CHANGED Viewed

@@ -153,15 +153,17 @@ class TfIdfSimilarity::Collection
       matrix.each_col(&:normalize!)
     elsif narray?
       # @see https://github.com/masa16/narray/issues/21
-      NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(documents.size, 1))
+      NMatrix.refer(matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(documents.size, 1)))
     elsif nmatrix?
       # @see https://github.com/SciRuby/nmatrix/issues/38
-      # @todo NMatrix has no way to perform scalar operations on matrices.
-      # (0...matrix.shape[0]).each do |i|
-      #   column = matrix.slice i, 0...matrix.shape[1]
-      #   norm   = column.dot column.transpose
-      #   # No way to divide column by norm.
-      # end
+      (0...matrix.shape[1]).each do |j|
+        # @see https://github.com/SciRuby/nmatrix/pull/46
+        column = matrix.column(j)
+        norm = Math.sqrt(column.transpose.dot(column)[0, 0])
+        (0...m.shape[0]).each do |i|
+          m[i, j] /= norm
+        end
+      end
       matrix.cast :yale, :float64
     else
       Matrix.columns matrix.column_vectors.map(&:normalize)

data/lib/tf-idf-similarity/document.rb CHANGED Viewed

@@ -34,6 +34,7 @@ class TfIdfSimilarity::Document
   # @return [Float] the square root of the term count
   #
   # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
+  # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L13
   def term_frequency(term)
     Math.sqrt term_counts[term].to_i
   end

data/lib/tf-idf-similarity/extras/collection.rb CHANGED Viewed

@@ -1,15 +1,32 @@
 require 'tf-idf-similarity/collection'
+# @note The treat and similarity gems do not add one to the inverse document frequency.
+# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L16
+# @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/corpus.rb#L44
+#
+# @note The tf-idf gem adds one to the numerator when calculating inverse document frequency.
+# @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb#L153
+#
+# @note The vss gem does not take the logarithm of the inverse document frequency.
+# @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb#L79
+#
+# @see http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
+# @see http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf
+# @see http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf
 class TfIdfSimilarity::Collection
+  # https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L17
+  #
   # SMART n, Salton x, Chisholm NONE
   def no_collection_frequency(term)
     1.0
   end
+  # @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb#L50
+  # @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L15
+  #
   # SMART t, Salton f, Chisholm IDFB
   def plain_inverse_document_frequency(term)
-    count = document_counts[term].to_f
-    Math.log documents.size / count
+    Math.log documents.size / document_counts[term].to_f
   end
   alias_method :plain_idf, :plain_inverse_document_frequency
@@ -58,6 +75,11 @@ class TfIdfSimilarity::Collection
   # @param [Document] matrix a term-document matrix
   # @return [Matrix] the same matrix
+  # @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb
+  # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb
+  # @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb
+  # @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb
+  # @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb
   #
   # SMART n, Salton x, Chisholm NONE
   def no_normalization(matrix)
@@ -66,20 +88,23 @@ class TfIdfSimilarity::Collection
   # @param [Document] matrix a term-document matrix
   # @return [Matrix] a matrix in which all document vectors are unit vectors
+  # @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/term_document_matrix.rb#L23
   #
   # SMART c, Salton c, Chisholm COSN
   def cosine_normalization(matrix)
-    Matrix.columns(tfidf.column_vectors.map do |column|
-      column.normalize
-    end)
+    if gsl?
+      matrix.each_col(&:normalize!)
+    else
+      Matrix.columns matrix.column_vectors.map(&:normalize)
+    end
   end
   # @param [Document] matrix a term-document matrix
   # @return [Matrix] a matrix
+  # @todo http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
   #
   # SMART u, Chisholm PUQN
   def pivoted_unique_normalization(matrix)
-    # @todo
-    # http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
+    raise NotImplementedError
   end
 end

data/lib/tf-idf-similarity/extras/document.rb CHANGED Viewed

@@ -1,5 +1,19 @@
 require 'tf-idf-similarity/document'
+# @todo http://nlp.stanford.edu/IR-book/html/htmledition/maximum-tf-normalization-1.html
+#
+# @note The treat, tf_idf, similarity and rsemantic gems normalizes to the number of terms in the document.
+# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L77
+# @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb#L76
+# @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/document.rb#L42
+# @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L17
+#
+# @note The tf-idf gem normalizes to the number of unique terms in the document.
+# @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb#L172
+#
+# @see http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
+# @see http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf
+# @see http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf
 class TfIdfSimilarity::Document
   # @return [Float] the maximum term count of any term in the document
   def maximum_term_count
@@ -12,6 +26,8 @@ class TfIdfSimilarity::Document
   end
   # Returns the term count.
+  # @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb#L75
+  # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L11
   #
   # SMART n, Salton t, Chisholm FREQ
   def plain_term_frequency(term)
@@ -70,6 +86,8 @@ class TfIdfSimilarity::Document
   end
   alias_method :changed_coefficient_augmented_normalized_tf, :changed_coefficient_augmented_normalized_term_frequency
+  # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L12
+  #
   # SMART l, Chisholm LOGA
   def log_term_frequency(term)
     count = term_counts[term]

data/lib/tf-idf-similarity/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module TfIdfSimilarity
-  VERSION = "0.0.8"
+  VERSION = "0.0.9"
 end

data/lib/tf-idf-similarity.rb CHANGED Viewed

@@ -1,5 +1,6 @@
 module TfIdfSimilarity
-  autoload :Collection, 'tf-idf-similarity/collection'
-  autoload :Document, 'tf-idf-similarity/document'
-  autoload :Token, 'tf-idf-similarity/token'
 end
+require 'tf-idf-similarity/collection'
+require 'tf-idf-similarity/document'
+require 'tf-idf-similarity/token'

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: tf-idf-similarity
 version: !ruby/object:Gem::Version
-  version: 0.0.8
+  version: 0.0.9
   prerelease:
 platform: ruby
 authors:
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-11-20 00:00:00.000000000 Z
+date: 2013-01-07 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: unicode_utils
@@ -68,6 +68,7 @@ extra_rdoc_files: []
 files:
 - .gitignore
 - .travis.yml
+- .yardopts
 - Gemfile
 - LICENSE
 - README.md
@@ -106,3 +107,4 @@ signing_key:
 specification_version: 3
 summary: Calculates the similarity between texts using tf*idf
 test_files: []
+has_rdoc: