RubyGems - tf-idf-similarity - Versions diffs - 0.0.9 → 0.1.0 - Mend

tf-idf-similarity 0.0.9 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

data/.travis.yml +29 -0
data/Gemfile +4 -0
data/README.md +41 -29
data/lib/tf-idf-similarity.rb +12 -1
data/lib/tf-idf-similarity/document.rb +35 -28
data/lib/tf-idf-similarity/extras/document.rb +2 -125
data/lib/tf-idf-similarity/extras/tf_idf_model.rb +192 -0
data/lib/tf-idf-similarity/matrix_methods.rb +164 -0
data/lib/tf-idf-similarity/term_count_model.rb +78 -0
data/lib/tf-idf-similarity/tf_idf_model.rb +81 -0
data/lib/tf-idf-similarity/token.rb +34 -12
data/lib/tf-idf-similarity/version.rb +1 -1
data/spec/document_spec.rb +136 -0
data/spec/extras/tf_idf_model_spec.rb +269 -0
data/spec/spec_helper.rb +21 -0
data/spec/term_count_model_spec.rb +108 -0
data/spec/tf_idf_model_spec.rb +174 -0
data/spec/token_spec.rb +34 -0
data/td-idf-similarity.gemspec +3 -3
metadata +91 -63
data/lib/tf-idf-similarity/collection.rb +0 -205
data/lib/tf-idf-similarity/extras/collection.rb +0 -110

data/.travis.yml CHANGED Viewed

@@ -1,3 +1,32 @@
 language: ruby
 rvm:
+  - 1.8.7
+  - 1.9.2
   - 1.9.3
+  - 2.0.0
+  - ree
+env:
+  - MATRIX_LIBRARY=gsl
+  - MATRIX_LIBRARY=narray
+  - MATRIX_LIBRARY=nmatrix
+  - MATRIX_LIBRARY=matrix
+matrix:
+  exclude:
+    - rvm: 1.8.7
+      env: MATRIX_LIBRARY=nmatrix
+    - rvm: ree
+      env: MATRIX_LIBRARY=nmatrix
+before_install:
+  - bundle config build.nmatrix --with-lapacklib
+  - if [ $MATRIX_LIBRARY = 'nmatrix' -o $MATRIX_LIBRARY = 'gsl' ]; then sudo apt-get update -qq; fi
+  - if [ $MATRIX_LIBRARY = 'gsl' ]; then sudo apt-get install gsl-bin libgsl0-dev; fi
+  # Installing ATLAS will install BLAS.
+  - if [ $MATRIX_LIBRARY = 'nmatrix' ]; then sudo apt-get install -qq libatlas-dev libatlas-base-dev libatlas3gf-base; fi
+  - if [ $MATRIX_LIBRARY = 'nmatrix' ]; then export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/include/atlas; fi
+  - if [ $MATRIX_LIBRARY = 'nmatrix' ]; then git clone git://github.com/SciRuby/nmatrix.git; fi
+  - if [ $MATRIX_LIBRARY = 'nmatrix' ]; then cd nmatrix && ORIGINAL_BUNDLE_GEMFILE=$BUNDLE_GEMFILE; fi
+  - if [ $MATRIX_LIBRARY = 'nmatrix' ]; then BUNDLE_GEMFILE=`pwd`/Gemfile && bundle && bundle exec rake install; fi
+  - if [ $MATRIX_LIBRARY = 'nmatrix' ]; then cd .. && BUNDLE_GEMFILE=$ORIGINAL_BUNDLE_GEMFILE; fi
+# Travis sometimes runs without Bundler.
+install: bundle
+script: bundle exec rake --trace

data/Gemfile CHANGED Viewed

@@ -1,4 +1,8 @@
 source "http://rubygems.org"
+gem 'gsl', '~> 1.15.3'     if ENV['MATRIX_LIBRARY'] == 'gsl'
+gem 'narray', '~> 0.6.0.0' if ENV['MATRIX_LIBRARY'] == 'narray'
+gem 'nmatrix', :git => 'git://github.com/SciRuby/nmatrix.git' if ENV['MATRIX_LIBRARY'] == 'nmatrix' && RUBY_VERSION >= '1.9'
 # Specify your gem's dependencies in the gemspec
 gemspec

data/README.md CHANGED Viewed

@@ -1,70 +1,90 @@
 # Ruby Vector Space Model (VSM) with tf*idf weights
+[![Build Status](https://secure.travis-ci.org/opennorth/tf-idf-similarity.png)](http://travis-ci.org/opennorth/tf-idf-similarity)
 [![Dependency Status](https://gemnasium.com/opennorth/tf-idf-similarity.png)](https://gemnasium.com/opennorth/tf-idf-similarity)
-[![Code Climate](https://codeclimate.com/badge.png)](https://codeclimate.com/github/opennorth/tf-idf-similarity)
+[![Coverage Status](https://coveralls.io/repos/opennorth/tf-idf-similarity/badge.png?branch=master)](https://coveralls.io/r/opennorth/tf-idf-similarity)
+[![Code Climate](https://codeclimate.com/github/opennorth/tf-idf-similarity.png)](https://codeclimate.com/github/opennorth/tf-idf-similarity)
-Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) or similar (see below).
+Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency (tf*idf)](http://en.wikipedia.org/wiki/
+) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) or similar (see below).
 ## Usage
+    require 'matrix'
     require 'tf-idf-similarity'
-    corpus = TfIdfSimilarity::Collection.new
+Create a set of documents:
+    corpus = []
     corpus << TfIdfSimilarity::Document.new("Lorem ipsum dolor sit amet...")
     corpus << TfIdfSimilarity::Document.new("Pellentesque sed ipsum dui...")
     corpus << TfIdfSimilarity::Document.new("Nam scelerisque dui sed leo...")
-    p corpus.similarity_matrix
+Create a document-term matrix using [Term Frequency-Inverse Document Frequency function](http://en.wikipedia.org/wiki/) (default:
-## Optimizations
+    model = TfIdfSimilarity::TfIdfModel(corpus, :function => :tf_idf)
-This gem will use the first available library below, for faster matrix multiplication.
+Create a document-term matrix using the [Okapi BM25 ranking function](http://en.wikipedia.org/wiki/Okapi_BM25):
-### [GNU Scientific Library (GSL)](http://www.gnu.org/software/gsl/)
+    model = TfIdfSimilarity::TfIdfModel(corpus, :function => :bm25)
+[Read the documentation at RubyDoc.info.](http://rubydoc.info/gems/tf-idf-similarity)
+## Speed
-The latest [gsl gem](http://rb-gsl.rubyforge.org/) (`1.14.7`) is [not compatible](http://bretthard.in/2012/03/getting-related_posts-lsi-and-gsl-to-work-in-jekyll/) with the `gsl` package (`1.15`) in Homebrew:
+Instead of using the Ruby Standard Library's [Matrix](http://www.ruby-doc.org/stdlib-2.0/libdoc/matrix/rdoc/Matrix.html) class, you can use one of the `gsl`, `narray` or `nmatrix` gems for faster matrix operations, e.g.:
-```sh
-cd /usr/local
-git checkout -b gsl-1.14 83ed49411f076e30ced04c2cbebb054b2645a431
-brew install gsl
-git checkout master
-git branch -d gsl-1.14
-```
+    require 'gsl'
+    model = TfIdfSimilarity::TfIdfModel(corpus, :library => :gsl)
-Be careful not to upgrade `gsl` to `1.15` with `brew upgrade outdated`. You can now run:
+### [GNU Scientific Library (GSL)](http://www.gnu.org/software/gsl/)
-    gem install gsl --no-ri --no-rdoc
+    gem install gsl
 ### [NArray](http://narray.rubyforge.org/)
     gem install narray
+### [NMatrix](https://github.com/SciRuby/nmatrix)
+The nmatrix gem gives access to [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/), which you may know of through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install the nmatrix gem. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/Installation).
 ## Extras
 You can access more term frequency, document frequency, and normalization formulas with:
-    require 'tf-idf-similarity/extras/collection'
     require 'tf-idf-similarity/extras/document'
+    require 'tf-idf-similarity/extras/tf_idf_model'
 The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).
 ## Why?
-No other Ruby gem implements the tf*idf formula used by Lucene, Sphinx and Ferret.
+At the time of writing, no other Ruby gem implemented the tf*idf formula used by Lucene, Sphinx and Ferret.
+* [rsemantic](https://github.com/josephwilk/rsemantic) now uses the same [term frequency](https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L14) and [document frequency](https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L13) formulas as Lucene.
+* [treat](https://github.com/louismullie/treat) offers many term frequency formulas, [one of which](https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L13) is the same as Lucene.
+* [similarity](https://github.com/bbcrd/Similarity) uses [cosine normalization](https://github.com/bbcrd/Similarity/blob/master/lib/similarity/term_document_matrix.rb#L23), which corresponds roughly to Lucene.
 ### Term frequencies
-The [vss](https://github.com/mkdynamic/vss) gem does not normalize the frequency of a term in a document; this occurs frequently in the academic literature, but only to demonstrate why normalization is important. The [treat](https://github.com/louismullie/treat), [tf_idf](https://github.com/reddavis/TF-IDF), [similarity](https://github.com/bbcrd/Similarity) and [rsemantic](https://github.com/josephwilk/rsemantic) gems normalize the frequency of a term in a document to the number of terms in that document, which never occurs in the literature. The [tf-idf](https://github.com/mchung/tf-idf) gem normalizes the frequency of a term in a document to the number of *unique* terms in that document, which never occurs in the literature.
+The [vss](https://github.com/mkdynamic/vss) gem does not normalize the frequency of a term in a document; this occurs frequently in the academic literature, but only to demonstrate why normalization is important. The [tf_idf](https://github.com/reddavis/TF-IDF) and similarity gems normalize the frequency of a term in a document to the number of terms in that document, which never occurs in the literature. The [tf-idf](https://github.com/mchung/tf-idf) gem normalizes the frequency of a term in a document to the number of *unique* terms in that document, which never occurs in the literature.
 ### Document frequencies
-The vss gem does not normalize the inverse document frequency. The tf_idf, tf-idf, similarity and rsemantic gems use variants of the typical inverse document frequency formula.
+The vss gem does not normalize the inverse document frequency. The treat, tf_idf, tf-idf and similarity gems use variants of the typical inverse document frequency formula.
 ### Normalization
 The treat, tf_idf, tf-idf, rsemantic and vss gems have no normalization component.
+## Additional adapters
+Adapters for the following projects were also considered:
+* [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme.
+* [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) give access to LAPACK from Ruby, but are old and unavailable as gems.
 ## Reference
 * [G. Salton and C. Buckley. "Term Weighting Approaches in Automatic Text Retrieval."" Technical Report. Cornell University, Ithaca, NY, USA. 1987.](http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf)
@@ -81,14 +101,6 @@ Lucene implements many more [similarity functions](http://lucene.apache.org/core
 Lucene can even [combine similarity measures](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/MultiSimilarity.html).
-## Other optimizations
-[Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/) is available through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
-### Other Options
-[Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
 ## Bugs? Questions?
 This gem's main repository is on GitHub: [http://github.com/opennorth/tf-idf-similarity](http://github.com/opennorth/tf-idf-similarity), where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.

data/lib/tf-idf-similarity.rb CHANGED Viewed

@@ -1,6 +1,17 @@
+require 'forwardable'
+require 'set'
+begin
+  require 'unicode_utils'
+rescue LoadError
+  # Ruby 1.8
+end
 module TfIdfSimilarity
 end
-require 'tf-idf-similarity/collection'
+require 'tf-idf-similarity/matrix_methods'
+require 'tf-idf-similarity/term_count_model'
+require 'tf-idf-similarity/tf_idf_model'
 require 'tf-idf-similarity/document'
 require 'tf-idf-similarity/token'

data/lib/tf-idf-similarity/document.rb CHANGED Viewed

@@ -1,56 +1,63 @@
-# coding: utf-8
-require 'unicode_utils'
+# A document.
 class TfIdfSimilarity::Document
-  # An optional document identifier.
+  # The document's identifier.
   attr_reader :id
   # The document's text.
   attr_reader :text
-  # The document's tokenized text.
-  attr_reader :tokens
   # The number of times each term appears in the document.
   attr_reader :term_counts
-  # The document size, in terms.
+  # The number of tokens in the document.
   attr_reader :size
   # @param [String] text the document's text
   # @param [Hash] opts optional arguments
-  # @option opts [String] :id a string to identify the document
+  # @option opts [String] :id the document's identifier
   # @option opts [Array] :tokens the document's tokenized text
+  # @option opts [Hash] :term_counts the number of times each term appears
+  # @option opts [Integer] :size the number of tokens in the document
   def initialize(text, opts = {})
-    @text        = text
-    @id          = opts[:id] || object_id
-    @tokens      = opts[:tokens]
-    @term_counts = Hash.new 0
-    process
+    @text   = text
+    @id     = opts[:id] || object_id
+    @tokens = opts[:tokens]
+    if opts[:term_counts]
+      @term_counts = opts[:term_counts]
+      @size = opts[:size] || term_counts.values.reduce(0, :+)
+      # Nothing to do.
+    else
+      @term_counts = Hash.new(0)
+      @size = 0
+      set_term_counts_and_size
+    end
   end
-  # @return [Array<String>] the set of the document's terms with no duplicates
+  # Returns the set of terms in the document.
+  #
+  # @return [Array<String>] the unique terms in the document
   def terms
     term_counts.keys
   end
-  # @param [String] term a term
-  # @return [Float] the square root of the term count
+  # Returns the number of occurrences of the term in the document.
   #
-  # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
-  # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L13
-  def term_frequency(term)
-    Math.sqrt term_counts[term].to_i
+  # @param [String] term a term
+  # @return [Integer] the number of times the term appears in the document
+  def term_count(term)
+    term_counts[term].to_i # need #to_i if unmarshalled
   end
-  alias_method :tf, :term_frequency
 private
-  # Tokenize the text and counts terms.
-  def process
+  # Tokenizes the text and counts terms and total tokens.
+  def set_term_counts_and_size
     tokenize(text).each do |word|
-      token = TfIdfSimilarity::Token.new word
+      token = TfIdfSimilarity::Token.new(word)
       if token.valid?
-        @term_counts[token.lowercase_filter.classic_filter.to_s] += 1
+        term = token.lowercase_filter.classic_filter.to_s
+        @term_counts[term] += 1
+        @size += 1
       end
     end
-    @size = term_counts.values.reduce(:+)
   end
   # Tokenizes a text, respecting the word boundary rules from Unicode’s Default
@@ -68,6 +75,6 @@ private
   # @see http://unicode.org/reports/tr29/#Default_Word_Boundaries
   # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory
   def tokenize(text)
-    @tokens || UnicodeUtils.each_word(text)
+    @tokens || defined?(UnicodeUtils) && UnicodeUtils.each_word(text) || text.split(/\b/) # @todo Ruby 1.8 has no good word boundary code
   end
 end

data/lib/tf-idf-similarity/extras/document.rb CHANGED Viewed

@@ -1,134 +1,11 @@
-require 'tf-idf-similarity/document'
-# @todo http://nlp.stanford.edu/IR-book/html/htmledition/maximum-tf-normalization-1.html
-#
-# @note The treat, tf_idf, similarity and rsemantic gems normalizes to the number of terms in the document.
-# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L77
-# @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb#L76
-# @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/document.rb#L42
-# @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L17
-#
-# @note The tf-idf gem normalizes to the number of unique terms in the document.
-# @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb#L172
-#
-# @see http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
-# @see http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf
-# @see http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf
 class TfIdfSimilarity::Document
   # @return [Float] the maximum term count of any term in the document
   def maximum_term_count
-    @maximum_term_count ||= @term_counts.values.max.to_f
+    @maximum_term_count ||= term_counts.values.max.to_f
   end
   # @return [Float] the average term count of all terms in the document
   def average_term_count
-    @average_term_count ||= @term_counts.values.reduce(:+) / @term_counts.size.to_f
-  end
-  # Returns the term count.
-  # @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb#L75
-  # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L11
-  #
-  # SMART n, Salton t, Chisholm FREQ
-  def plain_term_frequency(term)
-    term_counts[term]
-  end
-  alias :plain_tf, :plain_term_frequency
-  # Returns 1 if the term is present, 0 otherwise.
-  #
-  # SMART b, Salton b, Chisholm BNRY
-  def binary_term_frequency(term)
-    count = term_counts[term]
-    if count > 0
-      1
-    else
-      0
-    end
-  end
-  alias_method :binary_tf, :binary_term_frequency
-  # Normalizes the term count by the maximum term count.
-  #
-  # @see http://en.wikipedia.org/wiki/Tf*idf
-  def normalized_term_frequency(term)
-    term_counts[term] / maximum_term_count
-  end
-  alias_method :normalized_tf, :normalized_term_frequency
-  # Further normalizes the normalized term frequency to lie between 0.5 and 1.
-  #
-  # SMART a, Salton n, Chisholm ATF1
-  def augmented_normalized_term_frequency(term)
-    0.5 + 0.5 * normalized_term_frequency(term)
-  end
-  alias_method :augmented_normalized_tf, :augmented_normalized_term_frequency
-  # Chisholm ATFA
-  def augmented_average_term_frequency(term)
-    count = term_counts[term]
-    if count > 0
-      0.9 + 0.1 * count / average_term_count
-    else
-      0
-    end
-  end
-  alias_method :augmented_average_tf, :augmented_average_term_frequency
-  # Chisholm ATFC
-  def changed_coefficient_augmented_normalized_term_frequency(term)
-    count = term_counts[term]
-    if count > 0
-      0.2 + 0.8 * count / maximum_term_count
-    else
-      0
-    end
-  end
-  alias_method :changed_coefficient_augmented_normalized_tf, :changed_coefficient_augmented_normalized_term_frequency
-  # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L12
-  #
-  # SMART l, Chisholm LOGA
-  def log_term_frequency(term)
-    count = term_counts[term]
-    if count > 0
-      1 + Math.log(count)
-    else
-      0
-    end
-  end
-  alias_method :log_tf, :log_term_frequency
-  # SMART L, Chisholm LOGN
-  def normalized_log_term_frequency(term)
-    count = term_counts[term]
-    if count > 0
-      (1 + Math.log(count)) / (1 + Math.log(average_term_count))
-    else
-      0
-    end
-  end
-  alias_method :normalized_log_tf, :normalized_log_term_frequency
-  # Chisholm LOGG
-  def augmented_log_term_frequency(term)
-    count = term_counts[term]
-    if count > 0
-      0.2 + 0.8 * Math.log(count + 1)
-    else
-      0
-    end
-  end
-  alias_method :augmented_log_tf, :augmented_log_term_frequency
-  # Chisholm SQRT
-  def square_root_term_frequency(term)
-    count = term_counts[term]
-    if count > 0
-      Math.sqrt(count - 0.5) + 1
-    else
-      0
-    end
+    @average_term_count ||= term_counts.values.reduce(0, :+) / term_counts.size.to_f
   end
-  alias_method :square_root_tf, :square_root_term_frequency
 end

data/lib/tf-idf-similarity/extras/tf_idf_model.rb ADDED Viewed

@@ -0,0 +1,192 @@
+# @note The vss gem does not take the logarithm of the inverse document frequency.
+# @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb#L79
+# @note The treat gem does not add one to the inverse document frequency.
+# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L16
+# @note The treat gem normalizes to the number of tokens in the document.
+# @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/document.rb#L42
+# @see http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
+# @see http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf
+# @see http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf
+class TfIdfSimilarity::TfIdfModel
+  # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L17
+  #
+  # SMART n, Salton x, Chisholm NONE
+  def no_collection_frequency(term)
+    1.0
+  end
+  # @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb#L50
+  #
+  # SMART t, Salton f, Chisholm IDFB
+  def plain_inverse_document_frequency(term, numerator = 0, denominator = 0)
+    log((documents.size + numerator) / (@model.document_count(term).to_f + denominator))
+  end
+  alias_method :plain_idf, :plain_inverse_document_frequency
+  # SMART p, Salton p, Chisholm IDFP
+  def probabilistic_inverse_document_frequency(term)
+    count = @model.document_count(term).to_f
+    log((documents.size - count) / count)
+  end
+  alias_method :probabilistic_idf, :probabilistic_inverse_document_frequency
+  # Chisholm IGFF
+  def global_frequency_inverse_document_frequency(term)
+    @model.term_count(term) / @model.document_count(term).to_f
+  end
+  alias_method :gfidf, :global_frequency_inverse_document_frequency
+  # Chisholm IGFL
+  def log_global_frequency_inverse_document_frequency(term)
+    log(global_frequency_inverse_document_frequency(term) + 1)
+  end
+  alias_method :log_gfidf, :log_global_frequency_inverse_document_frequency
+  # Chisholm IGFI
+  def incremented_global_frequency_inverse_document_frequency(term)
+    global_frequency_inverse_document_frequency(term) + 1
+  end
+  alias_method :incremented_gfidf, :incremented_global_frequency_inverse_document_frequency
+  # Chisholm IGFS
+  def square_root_global_frequency_inverse_document_frequency(term)
+    sqrt(global_frequency_inverse_document_frequency(term) - 0.9)
+  end
+  alias_method :square_root_gfidf, :square_root_global_frequency_inverse_document_frequency
+  # Chisholm ENPY
+  def entropy(term)
+    denominator = @model.term_count(term).to_f
+    logN = log(documents.size)
+    1 + documents.reduce(0) do |sum,document|
+      quotient = document.term_count(term) / denominator
+      sum += quotient * log(quotient) / logN
+    end
+  end
+  # @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb
+  # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb
+  # @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb
+  # @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb
+  # @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb
+  #
+  # SMART n, Salton x, Chisholm NONE
+  def no_normalization(matrix)
+    matrix
+  end
+  # @see http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
+  #
+  # SMART u, Chisholm PUQN
+  def pivoted_unique_normalization(matrix)
+    raise NotImplementedError
+  end
+  # Cosine normalization is implemented as TfIdfSimilarity::MatrixMethods#normalize.
+  #
+  # SMART c, Salton c, Chisholm COSN
+  # The plain term frequency is implemented as TfIdfSimilarity::Document#term_count.
+  #
+  # @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb#L75
+  # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L11
+  #
+  # SMART n, Salton t, Chisholm FREQ
+  # SMART b, Salton b, Chisholm BNRY
+  def binary_term_frequency(document, term)
+    count = document.term_count(term)
+    if count > 0
+      1
+    else
+      0
+    end
+  end
+  alias_method :binary_tf, :binary_term_frequency
+  # @see http://en.wikipedia.org/wiki/Tf*idf
+  # @see http://nlp.stanford.edu/IR-book/html/htmledition/maximum-tf-normalization-1.html
+  def normalized_term_frequency(document, term, a = 0)
+    a + (1 - a) * document.term_count(term) / document.maximum_term_count
+  end
+  alias_method :normalized_tf, :normalized_term_frequency
+  # SMART a, Salton n, Chisholm ATF1
+  def augmented_normalized_term_frequency(document, term)
+    0.5 + 0.5 * normalized_term_frequency(document, term)
+  end
+  alias_method :augmented_normalized_tf, :augmented_normalized_term_frequency
+  # Chisholm ATFA
+  def augmented_average_term_frequency(document, term)
+    count = document.term_count(term)
+    if count > 0
+      0.9 + 0.1 * count / document.average_term_count
+    else
+      0
+    end
+  end
+  alias_method :augmented_average_tf, :augmented_average_term_frequency
+  # Chisholm ATFC
+  def changed_coefficient_augmented_normalized_term_frequency(document, term)
+    count = document.term_count(term)
+    if count > 0
+      0.2 + 0.8 * count / document.maximum_term_count
+    else
+      0
+    end
+  end
+  alias_method :changed_coefficient_augmented_normalized_tf, :changed_coefficient_augmented_normalized_term_frequency
+  # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L12
+  #
+  # SMART l, Chisholm LOGA
+  def log_term_frequency(document, term)
+    count = document.term_count(term)
+    if count > 0
+      1 + log(count)
+    else
+      0
+    end
+  end
+  alias_method :log_tf, :log_term_frequency
+  # SMART L, Chisholm LOGN
+  def normalized_log_term_frequency(document, term)
+    count = document.term_count(term)
+    if count > 0
+      (1 + log(count)) / (1 + log(document.average_term_count))
+    else
+      0
+    end
+  end
+  alias_method :normalized_log_tf, :normalized_log_term_frequency
+  # Chisholm LOGG
+  def augmented_log_term_frequency(document, term)
+    count = document.term_count(term)
+    if count > 0
+      0.2 + 0.8 * log(count + 1)
+    else
+      0
+    end
+  end
+  alias_method :augmented_log_tf, :augmented_log_term_frequency
+  # Chisholm SQRT
+  def square_root_term_frequency(document, term)
+    count = document.term_count(term)
+    if count > 0
+      sqrt(count - 0.5) + 1
+    else
+      0
+    end
+  end
+  alias_method :square_root_tf, :square_root_term_frequency
+end