tf-idf-similarity 0.0.8 → 0.0.9

Sign up to get free protection for your applications and to get access to all the features.
data/.yardopts ADDED
@@ -0,0 +1,4 @@
1
+ --no-private
2
+ --hide-void-return
3
+ --embed-mixin ClassMethods
4
+ --markup=markdown
data/Gemfile CHANGED
@@ -1,4 +1,4 @@
1
1
  source "http://rubygems.org"
2
2
 
3
- # Specify your gem's dependencies in scraperwiki-api.gemspec
3
+ # Specify your gem's dependencies in the gemspec
4
4
  gemspec
data/README.md CHANGED
@@ -40,14 +40,6 @@ Be careful not to upgrade `gsl` to `1.15` with `brew upgrade outdated`. You can
40
40
 
41
41
  gem install narray
42
42
 
43
- ### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
44
-
45
- You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
46
-
47
- ### Other Options
48
-
49
- [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
50
-
51
43
  ## Extras
52
44
 
53
45
  You can access more term frequency, document frequency, and normalization formulas with:
@@ -59,7 +51,19 @@ The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http:
59
51
 
60
52
  ## Why?
61
53
 
62
- The [treat](https://github.com/louismullie/treat), [tf-idf](https://github.com/reddavis/TF-IDF), [similarity](https://github.com/bbcrd/Similarity) and [rsimilarity](https://github.com/josephwilk/rsemantic) gems normalize the frequency of a term in a document to the number of terms in that document (which, as far as I can tell, never occurs in the academic literature) and have no normalization component. [vss](https://github.com/mkdynamic/vss) uses plain term and document frequencies, with no damping or normalization.
54
+ No other Ruby gem implements the tf*idf formula used by Lucene, Sphinx and Ferret.
55
+
56
+ ### Term frequencies
57
+
58
+ The [vss](https://github.com/mkdynamic/vss) gem does not normalize the frequency of a term in a document; this occurs frequently in the academic literature, but only to demonstrate why normalization is important. The [treat](https://github.com/louismullie/treat), [tf_idf](https://github.com/reddavis/TF-IDF), [similarity](https://github.com/bbcrd/Similarity) and [rsemantic](https://github.com/josephwilk/rsemantic) gems normalize the frequency of a term in a document to the number of terms in that document, which never occurs in the literature. The [tf-idf](https://github.com/mchung/tf-idf) gem normalizes the frequency of a term in a document to the number of *unique* terms in that document, which never occurs in the literature.
59
+
60
+ ### Document frequencies
61
+
62
+ The vss gem does not normalize the inverse document frequency. The tf_idf, tf-idf, similarity and rsemantic gems use variants of the typical inverse document frequency formula.
63
+
64
+ ### Normalization
65
+
66
+ The treat, tf_idf, tf-idf, rsemantic and vss gems have no normalization component.
63
67
 
64
68
  ## Reference
65
69
 
@@ -75,7 +79,15 @@ Lucene implements many more [similarity functions](http://lucene.apache.org/core
75
79
  * a [language model with Bayesian smoothing using Dirichlet priors](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html)
76
80
  * a [language model with Jelinek-Mercer smoothing](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html)
77
81
 
78
- Lucene can even [combine similarity meatures](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/MultiSimilarity.html).
82
+ Lucene can even [combine similarity measures](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/MultiSimilarity.html).
83
+
84
+ ## Other optimizations
85
+
86
+ [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/) is available through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
87
+
88
+ ### Other Options
89
+
90
+ [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
79
91
 
80
92
  ## Bugs? Questions?
81
93
 
@@ -153,15 +153,17 @@ class TfIdfSimilarity::Collection
153
153
  matrix.each_col(&:normalize!)
154
154
  elsif narray?
155
155
  # @see https://github.com/masa16/narray/issues/21
156
- NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(documents.size, 1))
156
+ NMatrix.refer(matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(documents.size, 1)))
157
157
  elsif nmatrix?
158
158
  # @see https://github.com/SciRuby/nmatrix/issues/38
159
- # @todo NMatrix has no way to perform scalar operations on matrices.
160
- # (0...matrix.shape[0]).each do |i|
161
- # column = matrix.slice i, 0...matrix.shape[1]
162
- # norm = column.dot column.transpose
163
- # # No way to divide column by norm.
164
- # end
159
+ (0...matrix.shape[1]).each do |j|
160
+ # @see https://github.com/SciRuby/nmatrix/pull/46
161
+ column = matrix.column(j)
162
+ norm = Math.sqrt(column.transpose.dot(column)[0, 0])
163
+ (0...m.shape[0]).each do |i|
164
+ m[i, j] /= norm
165
+ end
166
+ end
165
167
  matrix.cast :yale, :float64
166
168
  else
167
169
  Matrix.columns matrix.column_vectors.map(&:normalize)
@@ -34,6 +34,7 @@ class TfIdfSimilarity::Document
34
34
  # @return [Float] the square root of the term count
35
35
  #
36
36
  # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
37
+ # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L13
37
38
  def term_frequency(term)
38
39
  Math.sqrt term_counts[term].to_i
39
40
  end
@@ -1,15 +1,32 @@
1
1
  require 'tf-idf-similarity/collection'
2
2
 
3
+ # @note The treat and similarity gems do not add one to the inverse document frequency.
4
+ # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L16
5
+ # @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/corpus.rb#L44
6
+ #
7
+ # @note The tf-idf gem adds one to the numerator when calculating inverse document frequency.
8
+ # @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb#L153
9
+ #
10
+ # @note The vss gem does not take the logarithm of the inverse document frequency.
11
+ # @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb#L79
12
+ #
13
+ # @see http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
14
+ # @see http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf
15
+ # @see http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf
3
16
  class TfIdfSimilarity::Collection
17
+ # https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L17
18
+ #
4
19
  # SMART n, Salton x, Chisholm NONE
5
20
  def no_collection_frequency(term)
6
21
  1.0
7
22
  end
8
23
 
24
+ # @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb#L50
25
+ # @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L15
26
+ #
9
27
  # SMART t, Salton f, Chisholm IDFB
10
28
  def plain_inverse_document_frequency(term)
11
- count = document_counts[term].to_f
12
- Math.log documents.size / count
29
+ Math.log documents.size / document_counts[term].to_f
13
30
  end
14
31
  alias_method :plain_idf, :plain_inverse_document_frequency
15
32
 
@@ -58,6 +75,11 @@ class TfIdfSimilarity::Collection
58
75
 
59
76
  # @param [Document] matrix a term-document matrix
60
77
  # @return [Matrix] the same matrix
78
+ # @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb
79
+ # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb
80
+ # @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb
81
+ # @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb
82
+ # @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb
61
83
  #
62
84
  # SMART n, Salton x, Chisholm NONE
63
85
  def no_normalization(matrix)
@@ -66,20 +88,23 @@ class TfIdfSimilarity::Collection
66
88
 
67
89
  # @param [Document] matrix a term-document matrix
68
90
  # @return [Matrix] a matrix in which all document vectors are unit vectors
91
+ # @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/term_document_matrix.rb#L23
69
92
  #
70
93
  # SMART c, Salton c, Chisholm COSN
71
94
  def cosine_normalization(matrix)
72
- Matrix.columns(tfidf.column_vectors.map do |column|
73
- column.normalize
74
- end)
95
+ if gsl?
96
+ matrix.each_col(&:normalize!)
97
+ else
98
+ Matrix.columns matrix.column_vectors.map(&:normalize)
99
+ end
75
100
  end
76
101
 
77
102
  # @param [Document] matrix a term-document matrix
78
103
  # @return [Matrix] a matrix
104
+ # @todo http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
79
105
  #
80
106
  # SMART u, Chisholm PUQN
81
107
  def pivoted_unique_normalization(matrix)
82
- # @todo
83
- # http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
108
+ raise NotImplementedError
84
109
  end
85
110
  end
@@ -1,5 +1,19 @@
1
1
  require 'tf-idf-similarity/document'
2
2
 
3
+ # @todo http://nlp.stanford.edu/IR-book/html/htmledition/maximum-tf-normalization-1.html
4
+ #
5
+ # @note The treat, tf_idf, similarity and rsemantic gems normalizes to the number of terms in the document.
6
+ # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L77
7
+ # @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb#L76
8
+ # @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/document.rb#L42
9
+ # @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L17
10
+ #
11
+ # @note The tf-idf gem normalizes to the number of unique terms in the document.
12
+ # @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb#L172
13
+ #
14
+ # @see http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
15
+ # @see http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf
16
+ # @see http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf
3
17
  class TfIdfSimilarity::Document
4
18
  # @return [Float] the maximum term count of any term in the document
5
19
  def maximum_term_count
@@ -12,6 +26,8 @@ class TfIdfSimilarity::Document
12
26
  end
13
27
 
14
28
  # Returns the term count.
29
+ # @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb#L75
30
+ # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L11
15
31
  #
16
32
  # SMART n, Salton t, Chisholm FREQ
17
33
  def plain_term_frequency(term)
@@ -70,6 +86,8 @@ class TfIdfSimilarity::Document
70
86
  end
71
87
  alias_method :changed_coefficient_augmented_normalized_tf, :changed_coefficient_augmented_normalized_term_frequency
72
88
 
89
+ # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L12
90
+ #
73
91
  # SMART l, Chisholm LOGA
74
92
  def log_term_frequency(term)
75
93
  count = term_counts[term]
@@ -1,3 +1,3 @@
1
1
  module TfIdfSimilarity
2
- VERSION = "0.0.8"
2
+ VERSION = "0.0.9"
3
3
  end
@@ -1,5 +1,6 @@
1
1
  module TfIdfSimilarity
2
- autoload :Collection, 'tf-idf-similarity/collection'
3
- autoload :Document, 'tf-idf-similarity/document'
4
- autoload :Token, 'tf-idf-similarity/token'
5
2
  end
3
+
4
+ require 'tf-idf-similarity/collection'
5
+ require 'tf-idf-similarity/document'
6
+ require 'tf-idf-similarity/token'
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: tf-idf-similarity
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.8
4
+ version: 0.0.9
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-11-20 00:00:00.000000000 Z
12
+ date: 2013-01-07 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: unicode_utils
@@ -68,6 +68,7 @@ extra_rdoc_files: []
68
68
  files:
69
69
  - .gitignore
70
70
  - .travis.yml
71
+ - .yardopts
71
72
  - Gemfile
72
73
  - LICENSE
73
74
  - README.md
@@ -106,3 +107,4 @@ signing_key:
106
107
  specification_version: 3
107
108
  summary: Calculates the similarity between texts using tf*idf
108
109
  test_files: []
110
+ has_rdoc: