tf-idf-similarity 0.0.8 → 0.0.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.yardopts +4 -0
- data/Gemfile +1 -1
- data/README.md +22 -10
- data/lib/tf-idf-similarity/collection.rb +9 -7
- data/lib/tf-idf-similarity/document.rb +1 -0
- data/lib/tf-idf-similarity/extras/collection.rb +32 -7
- data/lib/tf-idf-similarity/extras/document.rb +18 -0
- data/lib/tf-idf-similarity/version.rb +1 -1
- data/lib/tf-idf-similarity.rb +4 -3
- metadata +4 -2
data/.yardopts
ADDED
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -40,14 +40,6 @@ Be careful not to upgrade `gsl` to `1.15` with `brew upgrade outdated`. You can
|
|
40
40
|
|
41
41
|
gem install narray
|
42
42
|
|
43
|
-
### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
|
44
|
-
|
45
|
-
You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
|
46
|
-
|
47
|
-
### Other Options
|
48
|
-
|
49
|
-
[Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
|
50
|
-
|
51
43
|
## Extras
|
52
44
|
|
53
45
|
You can access more term frequency, document frequency, and normalization formulas with:
|
@@ -59,7 +51,19 @@ The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http:
|
|
59
51
|
|
60
52
|
## Why?
|
61
53
|
|
62
|
-
|
54
|
+
No other Ruby gem implements the tf*idf formula used by Lucene, Sphinx and Ferret.
|
55
|
+
|
56
|
+
### Term frequencies
|
57
|
+
|
58
|
+
The [vss](https://github.com/mkdynamic/vss) gem does not normalize the frequency of a term in a document; this occurs frequently in the academic literature, but only to demonstrate why normalization is important. The [treat](https://github.com/louismullie/treat), [tf_idf](https://github.com/reddavis/TF-IDF), [similarity](https://github.com/bbcrd/Similarity) and [rsemantic](https://github.com/josephwilk/rsemantic) gems normalize the frequency of a term in a document to the number of terms in that document, which never occurs in the literature. The [tf-idf](https://github.com/mchung/tf-idf) gem normalizes the frequency of a term in a document to the number of *unique* terms in that document, which never occurs in the literature.
|
59
|
+
|
60
|
+
### Document frequencies
|
61
|
+
|
62
|
+
The vss gem does not normalize the inverse document frequency. The tf_idf, tf-idf, similarity and rsemantic gems use variants of the typical inverse document frequency formula.
|
63
|
+
|
64
|
+
### Normalization
|
65
|
+
|
66
|
+
The treat, tf_idf, tf-idf, rsemantic and vss gems have no normalization component.
|
63
67
|
|
64
68
|
## Reference
|
65
69
|
|
@@ -75,7 +79,15 @@ Lucene implements many more [similarity functions](http://lucene.apache.org/core
|
|
75
79
|
* a [language model with Bayesian smoothing using Dirichlet priors](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html)
|
76
80
|
* a [language model with Jelinek-Mercer smoothing](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html)
|
77
81
|
|
78
|
-
Lucene can even [combine similarity
|
82
|
+
Lucene can even [combine similarity measures](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/MultiSimilarity.html).
|
83
|
+
|
84
|
+
## Other optimizations
|
85
|
+
|
86
|
+
[Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/) is available through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
|
87
|
+
|
88
|
+
### Other Options
|
89
|
+
|
90
|
+
[Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
|
79
91
|
|
80
92
|
## Bugs? Questions?
|
81
93
|
|
@@ -153,15 +153,17 @@ class TfIdfSimilarity::Collection
|
|
153
153
|
matrix.each_col(&:normalize!)
|
154
154
|
elsif narray?
|
155
155
|
# @see https://github.com/masa16/narray/issues/21
|
156
|
-
NMatrix.refer
|
156
|
+
NMatrix.refer(matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(documents.size, 1)))
|
157
157
|
elsif nmatrix?
|
158
158
|
# @see https://github.com/SciRuby/nmatrix/issues/38
|
159
|
-
|
160
|
-
|
161
|
-
|
162
|
-
|
163
|
-
|
164
|
-
|
159
|
+
(0...matrix.shape[1]).each do |j|
|
160
|
+
# @see https://github.com/SciRuby/nmatrix/pull/46
|
161
|
+
column = matrix.column(j)
|
162
|
+
norm = Math.sqrt(column.transpose.dot(column)[0, 0])
|
163
|
+
(0...m.shape[0]).each do |i|
|
164
|
+
m[i, j] /= norm
|
165
|
+
end
|
166
|
+
end
|
165
167
|
matrix.cast :yale, :float64
|
166
168
|
else
|
167
169
|
Matrix.columns matrix.column_vectors.map(&:normalize)
|
@@ -34,6 +34,7 @@ class TfIdfSimilarity::Document
|
|
34
34
|
# @return [Float] the square root of the term count
|
35
35
|
#
|
36
36
|
# @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
|
37
|
+
# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L13
|
37
38
|
def term_frequency(term)
|
38
39
|
Math.sqrt term_counts[term].to_i
|
39
40
|
end
|
@@ -1,15 +1,32 @@
|
|
1
1
|
require 'tf-idf-similarity/collection'
|
2
2
|
|
3
|
+
# @note The treat and similarity gems do not add one to the inverse document frequency.
|
4
|
+
# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L16
|
5
|
+
# @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/corpus.rb#L44
|
6
|
+
#
|
7
|
+
# @note The tf-idf gem adds one to the numerator when calculating inverse document frequency.
|
8
|
+
# @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb#L153
|
9
|
+
#
|
10
|
+
# @note The vss gem does not take the logarithm of the inverse document frequency.
|
11
|
+
# @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb#L79
|
12
|
+
#
|
13
|
+
# @see http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
|
14
|
+
# @see http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf
|
15
|
+
# @see http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf
|
3
16
|
class TfIdfSimilarity::Collection
|
17
|
+
# https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L17
|
18
|
+
#
|
4
19
|
# SMART n, Salton x, Chisholm NONE
|
5
20
|
def no_collection_frequency(term)
|
6
21
|
1.0
|
7
22
|
end
|
8
23
|
|
24
|
+
# @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb#L50
|
25
|
+
# @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L15
|
26
|
+
#
|
9
27
|
# SMART t, Salton f, Chisholm IDFB
|
10
28
|
def plain_inverse_document_frequency(term)
|
11
|
-
|
12
|
-
Math.log documents.size / count
|
29
|
+
Math.log documents.size / document_counts[term].to_f
|
13
30
|
end
|
14
31
|
alias_method :plain_idf, :plain_inverse_document_frequency
|
15
32
|
|
@@ -58,6 +75,11 @@ class TfIdfSimilarity::Collection
|
|
58
75
|
|
59
76
|
# @param [Document] matrix a term-document matrix
|
60
77
|
# @return [Matrix] the same matrix
|
78
|
+
# @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb
|
79
|
+
# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb
|
80
|
+
# @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb
|
81
|
+
# @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb
|
82
|
+
# @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb
|
61
83
|
#
|
62
84
|
# SMART n, Salton x, Chisholm NONE
|
63
85
|
def no_normalization(matrix)
|
@@ -66,20 +88,23 @@ class TfIdfSimilarity::Collection
|
|
66
88
|
|
67
89
|
# @param [Document] matrix a term-document matrix
|
68
90
|
# @return [Matrix] a matrix in which all document vectors are unit vectors
|
91
|
+
# @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/term_document_matrix.rb#L23
|
69
92
|
#
|
70
93
|
# SMART c, Salton c, Chisholm COSN
|
71
94
|
def cosine_normalization(matrix)
|
72
|
-
|
73
|
-
|
74
|
-
|
95
|
+
if gsl?
|
96
|
+
matrix.each_col(&:normalize!)
|
97
|
+
else
|
98
|
+
Matrix.columns matrix.column_vectors.map(&:normalize)
|
99
|
+
end
|
75
100
|
end
|
76
101
|
|
77
102
|
# @param [Document] matrix a term-document matrix
|
78
103
|
# @return [Matrix] a matrix
|
104
|
+
# @todo http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
|
79
105
|
#
|
80
106
|
# SMART u, Chisholm PUQN
|
81
107
|
def pivoted_unique_normalization(matrix)
|
82
|
-
|
83
|
-
# http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
|
108
|
+
raise NotImplementedError
|
84
109
|
end
|
85
110
|
end
|
@@ -1,5 +1,19 @@
|
|
1
1
|
require 'tf-idf-similarity/document'
|
2
2
|
|
3
|
+
# @todo http://nlp.stanford.edu/IR-book/html/htmledition/maximum-tf-normalization-1.html
|
4
|
+
#
|
5
|
+
# @note The treat, tf_idf, similarity and rsemantic gems normalizes to the number of terms in the document.
|
6
|
+
# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L77
|
7
|
+
# @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb#L76
|
8
|
+
# @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/document.rb#L42
|
9
|
+
# @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L17
|
10
|
+
#
|
11
|
+
# @note The tf-idf gem normalizes to the number of unique terms in the document.
|
12
|
+
# @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb#L172
|
13
|
+
#
|
14
|
+
# @see http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
|
15
|
+
# @see http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf
|
16
|
+
# @see http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf
|
3
17
|
class TfIdfSimilarity::Document
|
4
18
|
# @return [Float] the maximum term count of any term in the document
|
5
19
|
def maximum_term_count
|
@@ -12,6 +26,8 @@ class TfIdfSimilarity::Document
|
|
12
26
|
end
|
13
27
|
|
14
28
|
# Returns the term count.
|
29
|
+
# @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb#L75
|
30
|
+
# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L11
|
15
31
|
#
|
16
32
|
# SMART n, Salton t, Chisholm FREQ
|
17
33
|
def plain_term_frequency(term)
|
@@ -70,6 +86,8 @@ class TfIdfSimilarity::Document
|
|
70
86
|
end
|
71
87
|
alias_method :changed_coefficient_augmented_normalized_tf, :changed_coefficient_augmented_normalized_term_frequency
|
72
88
|
|
89
|
+
# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L12
|
90
|
+
#
|
73
91
|
# SMART l, Chisholm LOGA
|
74
92
|
def log_term_frequency(term)
|
75
93
|
count = term_counts[term]
|
data/lib/tf-idf-similarity.rb
CHANGED
@@ -1,5 +1,6 @@
|
|
1
1
|
module TfIdfSimilarity
|
2
|
-
autoload :Collection, 'tf-idf-similarity/collection'
|
3
|
-
autoload :Document, 'tf-idf-similarity/document'
|
4
|
-
autoload :Token, 'tf-idf-similarity/token'
|
5
2
|
end
|
3
|
+
|
4
|
+
require 'tf-idf-similarity/collection'
|
5
|
+
require 'tf-idf-similarity/document'
|
6
|
+
require 'tf-idf-similarity/token'
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: tf-idf-similarity
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.9
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date:
|
12
|
+
date: 2013-01-07 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: unicode_utils
|
@@ -68,6 +68,7 @@ extra_rdoc_files: []
|
|
68
68
|
files:
|
69
69
|
- .gitignore
|
70
70
|
- .travis.yml
|
71
|
+
- .yardopts
|
71
72
|
- Gemfile
|
72
73
|
- LICENSE
|
73
74
|
- README.md
|
@@ -106,3 +107,4 @@ signing_key:
|
|
106
107
|
specification_version: 3
|
107
108
|
summary: Calculates the similarity between texts using tf*idf
|
108
109
|
test_files: []
|
110
|
+
has_rdoc:
|