tf-idf-similarity 0.0.8 → 0.0.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/.yardopts ADDED
@@ -0,0 +1,4 @@
1
+ --no-private
2
+ --hide-void-return
3
+ --embed-mixin ClassMethods
4
+ --markup=markdown
data/Gemfile CHANGED
@@ -1,4 +1,4 @@
1
1
  source "http://rubygems.org"
2
2
 
3
- # Specify your gem's dependencies in scraperwiki-api.gemspec
3
+ # Specify your gem's dependencies in the gemspec
4
4
  gemspec
data/README.md CHANGED
@@ -40,14 +40,6 @@ Be careful not to upgrade `gsl` to `1.15` with `brew upgrade outdated`. You can
40
40
 
41
41
  gem install narray
42
42
 
43
- ### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
44
-
45
- You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
46
-
47
- ### Other Options
48
-
49
- [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
50
-
51
43
  ## Extras
52
44
 
53
45
  You can access more term frequency, document frequency, and normalization formulas with:
@@ -59,7 +51,19 @@ The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http:
59
51
 
60
52
  ## Why?
61
53
 
62
- The [treat](https://github.com/louismullie/treat), [tf-idf](https://github.com/reddavis/TF-IDF), [similarity](https://github.com/bbcrd/Similarity) and [rsimilarity](https://github.com/josephwilk/rsemantic) gems normalize the frequency of a term in a document to the number of terms in that document (which, as far as I can tell, never occurs in the academic literature) and have no normalization component. [vss](https://github.com/mkdynamic/vss) uses plain term and document frequencies, with no damping or normalization.
54
+ No other Ruby gem implements the tf*idf formula used by Lucene, Sphinx and Ferret.
55
+
56
+ ### Term frequencies
57
+
58
+ The [vss](https://github.com/mkdynamic/vss) gem does not normalize the frequency of a term in a document; this occurs frequently in the academic literature, but only to demonstrate why normalization is important. The [treat](https://github.com/louismullie/treat), [tf_idf](https://github.com/reddavis/TF-IDF), [similarity](https://github.com/bbcrd/Similarity) and [rsemantic](https://github.com/josephwilk/rsemantic) gems normalize the frequency of a term in a document to the number of terms in that document, which never occurs in the literature. The [tf-idf](https://github.com/mchung/tf-idf) gem normalizes the frequency of a term in a document to the number of *unique* terms in that document, which never occurs in the literature.
59
+
60
+ ### Document frequencies
61
+
62
+ The vss gem does not normalize the inverse document frequency. The tf_idf, tf-idf, similarity and rsemantic gems use variants of the typical inverse document frequency formula.
63
+
64
+ ### Normalization
65
+
66
+ The treat, tf_idf, tf-idf, rsemantic and vss gems have no normalization component.
63
67
 
64
68
  ## Reference
65
69
 
@@ -75,7 +79,15 @@ Lucene implements many more [similarity functions](http://lucene.apache.org/core
75
79
  * a [language model with Bayesian smoothing using Dirichlet priors](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html)
76
80
  * a [language model with Jelinek-Mercer smoothing](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html)
77
81
 
78
- Lucene can even [combine similarity meatures](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/MultiSimilarity.html).
82
+ Lucene can even [combine similarity measures](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/MultiSimilarity.html).
83
+
84
+ ## Other optimizations
85
+
86
+ [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/) is available through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
87
+
88
+ ### Other Options
89
+
90
+ [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
79
91
 
80
92
  ## Bugs? Questions?
81
93
 
@@ -153,15 +153,17 @@ class TfIdfSimilarity::Collection
153
153
  matrix.each_col(&:normalize!)
154
154
  elsif narray?
155
155
  # @see https://github.com/masa16/narray/issues/21
156
- NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(documents.size, 1))
156
+ NMatrix.refer(matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(documents.size, 1)))
157
157
  elsif nmatrix?
158
158
  # @see https://github.com/SciRuby/nmatrix/issues/38
159
- # @todo NMatrix has no way to perform scalar operations on matrices.
160
- # (0...matrix.shape[0]).each do |i|
161
- # column = matrix.slice i, 0...matrix.shape[1]
162
- # norm = column.dot column.transpose
163
- # # No way to divide column by norm.
164
- # end
159
+ (0...matrix.shape[1]).each do |j|
160
+ # @see https://github.com/SciRuby/nmatrix/pull/46
161
+ column = matrix.column(j)
162
+ norm = Math.sqrt(column.transpose.dot(column)[0, 0])
163
+ (0...m.shape[0]).each do |i|
164
+ m[i, j] /= norm
165
+ end
166
+ end
165
167
  matrix.cast :yale, :float64
166
168
  else
167
169
  Matrix.columns matrix.column_vectors.map(&:normalize)
@@ -34,6 +34,7 @@ class TfIdfSimilarity::Document
34
34
  # @return [Float] the square root of the term count
35
35
  #
36
36
  # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
37
+ # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L13
37
38
  def term_frequency(term)
38
39
  Math.sqrt term_counts[term].to_i
39
40
  end
@@ -1,15 +1,32 @@
1
1
  require 'tf-idf-similarity/collection'
2
2
 
3
+ # @note The treat and similarity gems do not add one to the inverse document frequency.
4
+ # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L16
5
+ # @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/corpus.rb#L44
6
+ #
7
+ # @note The tf-idf gem adds one to the numerator when calculating inverse document frequency.
8
+ # @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb#L153
9
+ #
10
+ # @note The vss gem does not take the logarithm of the inverse document frequency.
11
+ # @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb#L79
12
+ #
13
+ # @see http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
14
+ # @see http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf
15
+ # @see http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf
3
16
  class TfIdfSimilarity::Collection
17
+ # https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L17
18
+ #
4
19
  # SMART n, Salton x, Chisholm NONE
5
20
  def no_collection_frequency(term)
6
21
  1.0
7
22
  end
8
23
 
24
+ # @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb#L50
25
+ # @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L15
26
+ #
9
27
  # SMART t, Salton f, Chisholm IDFB
10
28
  def plain_inverse_document_frequency(term)
11
- count = document_counts[term].to_f
12
- Math.log documents.size / count
29
+ Math.log documents.size / document_counts[term].to_f
13
30
  end
14
31
  alias_method :plain_idf, :plain_inverse_document_frequency
15
32
 
@@ -58,6 +75,11 @@ class TfIdfSimilarity::Collection
58
75
 
59
76
  # @param [Document] matrix a term-document matrix
60
77
  # @return [Matrix] the same matrix
78
+ # @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb
79
+ # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb
80
+ # @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb
81
+ # @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb
82
+ # @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb
61
83
  #
62
84
  # SMART n, Salton x, Chisholm NONE
63
85
  def no_normalization(matrix)
@@ -66,20 +88,23 @@ class TfIdfSimilarity::Collection
66
88
 
67
89
  # @param [Document] matrix a term-document matrix
68
90
  # @return [Matrix] a matrix in which all document vectors are unit vectors
91
+ # @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/term_document_matrix.rb#L23
69
92
  #
70
93
  # SMART c, Salton c, Chisholm COSN
71
94
  def cosine_normalization(matrix)
72
- Matrix.columns(tfidf.column_vectors.map do |column|
73
- column.normalize
74
- end)
95
+ if gsl?
96
+ matrix.each_col(&:normalize!)
97
+ else
98
+ Matrix.columns matrix.column_vectors.map(&:normalize)
99
+ end
75
100
  end
76
101
 
77
102
  # @param [Document] matrix a term-document matrix
78
103
  # @return [Matrix] a matrix
104
+ # @todo http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
79
105
  #
80
106
  # SMART u, Chisholm PUQN
81
107
  def pivoted_unique_normalization(matrix)
82
- # @todo
83
- # http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
108
+ raise NotImplementedError
84
109
  end
85
110
  end
@@ -1,5 +1,19 @@
1
1
  require 'tf-idf-similarity/document'
2
2
 
3
+ # @todo http://nlp.stanford.edu/IR-book/html/htmledition/maximum-tf-normalization-1.html
4
+ #
5
+ # @note The treat, tf_idf, similarity and rsemantic gems normalizes to the number of terms in the document.
6
+ # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L77
7
+ # @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb#L76
8
+ # @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/document.rb#L42
9
+ # @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L17
10
+ #
11
+ # @note The tf-idf gem normalizes to the number of unique terms in the document.
12
+ # @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb#L172
13
+ #
14
+ # @see http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
15
+ # @see http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf
16
+ # @see http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf
3
17
  class TfIdfSimilarity::Document
4
18
  # @return [Float] the maximum term count of any term in the document
5
19
  def maximum_term_count
@@ -12,6 +26,8 @@ class TfIdfSimilarity::Document
12
26
  end
13
27
 
14
28
  # Returns the term count.
29
+ # @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb#L75
30
+ # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L11
15
31
  #
16
32
  # SMART n, Salton t, Chisholm FREQ
17
33
  def plain_term_frequency(term)
@@ -70,6 +86,8 @@ class TfIdfSimilarity::Document
70
86
  end
71
87
  alias_method :changed_coefficient_augmented_normalized_tf, :changed_coefficient_augmented_normalized_term_frequency
72
88
 
89
+ # @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L12
90
+ #
73
91
  # SMART l, Chisholm LOGA
74
92
  def log_term_frequency(term)
75
93
  count = term_counts[term]
@@ -1,3 +1,3 @@
1
1
  module TfIdfSimilarity
2
- VERSION = "0.0.8"
2
+ VERSION = "0.0.9"
3
3
  end
@@ -1,5 +1,6 @@
1
1
  module TfIdfSimilarity
2
- autoload :Collection, 'tf-idf-similarity/collection'
3
- autoload :Document, 'tf-idf-similarity/document'
4
- autoload :Token, 'tf-idf-similarity/token'
5
2
  end
3
+
4
+ require 'tf-idf-similarity/collection'
5
+ require 'tf-idf-similarity/document'
6
+ require 'tf-idf-similarity/token'
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: tf-idf-similarity
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.8
4
+ version: 0.0.9
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-11-20 00:00:00.000000000 Z
12
+ date: 2013-01-07 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: unicode_utils
@@ -68,6 +68,7 @@ extra_rdoc_files: []
68
68
  files:
69
69
  - .gitignore
70
70
  - .travis.yml
71
+ - .yardopts
71
72
  - Gemfile
72
73
  - LICENSE
73
74
  - README.md
@@ -106,3 +107,4 @@ signing_key:
106
107
  specification_version: 3
107
108
  summary: Calculates the similarity between texts using tf*idf
108
109
  test_files: []
110
+ has_rdoc: