tf-idf-similarity 0.0.8 → 0.0.9
Sign up to get free protection for your applications and to get access to all the features.
- data/.yardopts +4 -0
- data/Gemfile +1 -1
- data/README.md +22 -10
- data/lib/tf-idf-similarity/collection.rb +9 -7
- data/lib/tf-idf-similarity/document.rb +1 -0
- data/lib/tf-idf-similarity/extras/collection.rb +32 -7
- data/lib/tf-idf-similarity/extras/document.rb +18 -0
- data/lib/tf-idf-similarity/version.rb +1 -1
- data/lib/tf-idf-similarity.rb +4 -3
- metadata +4 -2
data/.yardopts
ADDED
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -40,14 +40,6 @@ Be careful not to upgrade `gsl` to `1.15` with `brew upgrade outdated`. You can
|
|
40
40
|
|
41
41
|
gem install narray
|
42
42
|
|
43
|
-
### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
|
44
|
-
|
45
|
-
You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
|
46
|
-
|
47
|
-
### Other Options
|
48
|
-
|
49
|
-
[Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
|
50
|
-
|
51
43
|
## Extras
|
52
44
|
|
53
45
|
You can access more term frequency, document frequency, and normalization formulas with:
|
@@ -59,7 +51,19 @@ The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http:
|
|
59
51
|
|
60
52
|
## Why?
|
61
53
|
|
62
|
-
|
54
|
+
No other Ruby gem implements the tf*idf formula used by Lucene, Sphinx and Ferret.
|
55
|
+
|
56
|
+
### Term frequencies
|
57
|
+
|
58
|
+
The [vss](https://github.com/mkdynamic/vss) gem does not normalize the frequency of a term in a document; this occurs frequently in the academic literature, but only to demonstrate why normalization is important. The [treat](https://github.com/louismullie/treat), [tf_idf](https://github.com/reddavis/TF-IDF), [similarity](https://github.com/bbcrd/Similarity) and [rsemantic](https://github.com/josephwilk/rsemantic) gems normalize the frequency of a term in a document to the number of terms in that document, which never occurs in the literature. The [tf-idf](https://github.com/mchung/tf-idf) gem normalizes the frequency of a term in a document to the number of *unique* terms in that document, which never occurs in the literature.
|
59
|
+
|
60
|
+
### Document frequencies
|
61
|
+
|
62
|
+
The vss gem does not normalize the inverse document frequency. The tf_idf, tf-idf, similarity and rsemantic gems use variants of the typical inverse document frequency formula.
|
63
|
+
|
64
|
+
### Normalization
|
65
|
+
|
66
|
+
The treat, tf_idf, tf-idf, rsemantic and vss gems have no normalization component.
|
63
67
|
|
64
68
|
## Reference
|
65
69
|
|
@@ -75,7 +79,15 @@ Lucene implements many more [similarity functions](http://lucene.apache.org/core
|
|
75
79
|
* a [language model with Bayesian smoothing using Dirichlet priors](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html)
|
76
80
|
* a [language model with Jelinek-Mercer smoothing](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html)
|
77
81
|
|
78
|
-
Lucene can even [combine similarity
|
82
|
+
Lucene can even [combine similarity measures](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/MultiSimilarity.html).
|
83
|
+
|
84
|
+
## Other optimizations
|
85
|
+
|
86
|
+
[Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/) is available through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
|
87
|
+
|
88
|
+
### Other Options
|
89
|
+
|
90
|
+
[Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
|
79
91
|
|
80
92
|
## Bugs? Questions?
|
81
93
|
|
@@ -153,15 +153,17 @@ class TfIdfSimilarity::Collection
|
|
153
153
|
matrix.each_col(&:normalize!)
|
154
154
|
elsif narray?
|
155
155
|
# @see https://github.com/masa16/narray/issues/21
|
156
|
-
NMatrix.refer
|
156
|
+
NMatrix.refer(matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(documents.size, 1)))
|
157
157
|
elsif nmatrix?
|
158
158
|
# @see https://github.com/SciRuby/nmatrix/issues/38
|
159
|
-
|
160
|
-
|
161
|
-
|
162
|
-
|
163
|
-
|
164
|
-
|
159
|
+
(0...matrix.shape[1]).each do |j|
|
160
|
+
# @see https://github.com/SciRuby/nmatrix/pull/46
|
161
|
+
column = matrix.column(j)
|
162
|
+
norm = Math.sqrt(column.transpose.dot(column)[0, 0])
|
163
|
+
(0...m.shape[0]).each do |i|
|
164
|
+
m[i, j] /= norm
|
165
|
+
end
|
166
|
+
end
|
165
167
|
matrix.cast :yale, :float64
|
166
168
|
else
|
167
169
|
Matrix.columns matrix.column_vectors.map(&:normalize)
|
@@ -34,6 +34,7 @@ class TfIdfSimilarity::Document
|
|
34
34
|
# @return [Float] the square root of the term count
|
35
35
|
#
|
36
36
|
# @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
|
37
|
+
# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L13
|
37
38
|
def term_frequency(term)
|
38
39
|
Math.sqrt term_counts[term].to_i
|
39
40
|
end
|
@@ -1,15 +1,32 @@
|
|
1
1
|
require 'tf-idf-similarity/collection'
|
2
2
|
|
3
|
+
# @note The treat and similarity gems do not add one to the inverse document frequency.
|
4
|
+
# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L16
|
5
|
+
# @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/corpus.rb#L44
|
6
|
+
#
|
7
|
+
# @note The tf-idf gem adds one to the numerator when calculating inverse document frequency.
|
8
|
+
# @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb#L153
|
9
|
+
#
|
10
|
+
# @note The vss gem does not take the logarithm of the inverse document frequency.
|
11
|
+
# @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb#L79
|
12
|
+
#
|
13
|
+
# @see http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
|
14
|
+
# @see http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf
|
15
|
+
# @see http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf
|
3
16
|
class TfIdfSimilarity::Collection
|
17
|
+
# https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L17
|
18
|
+
#
|
4
19
|
# SMART n, Salton x, Chisholm NONE
|
5
20
|
def no_collection_frequency(term)
|
6
21
|
1.0
|
7
22
|
end
|
8
23
|
|
24
|
+
# @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb#L50
|
25
|
+
# @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L15
|
26
|
+
#
|
9
27
|
# SMART t, Salton f, Chisholm IDFB
|
10
28
|
def plain_inverse_document_frequency(term)
|
11
|
-
|
12
|
-
Math.log documents.size / count
|
29
|
+
Math.log documents.size / document_counts[term].to_f
|
13
30
|
end
|
14
31
|
alias_method :plain_idf, :plain_inverse_document_frequency
|
15
32
|
|
@@ -58,6 +75,11 @@ class TfIdfSimilarity::Collection
|
|
58
75
|
|
59
76
|
# @param [Document] matrix a term-document matrix
|
60
77
|
# @return [Matrix] the same matrix
|
78
|
+
# @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb
|
79
|
+
# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb
|
80
|
+
# @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb
|
81
|
+
# @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb
|
82
|
+
# @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb
|
61
83
|
#
|
62
84
|
# SMART n, Salton x, Chisholm NONE
|
63
85
|
def no_normalization(matrix)
|
@@ -66,20 +88,23 @@ class TfIdfSimilarity::Collection
|
|
66
88
|
|
67
89
|
# @param [Document] matrix a term-document matrix
|
68
90
|
# @return [Matrix] a matrix in which all document vectors are unit vectors
|
91
|
+
# @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/term_document_matrix.rb#L23
|
69
92
|
#
|
70
93
|
# SMART c, Salton c, Chisholm COSN
|
71
94
|
def cosine_normalization(matrix)
|
72
|
-
|
73
|
-
|
74
|
-
|
95
|
+
if gsl?
|
96
|
+
matrix.each_col(&:normalize!)
|
97
|
+
else
|
98
|
+
Matrix.columns matrix.column_vectors.map(&:normalize)
|
99
|
+
end
|
75
100
|
end
|
76
101
|
|
77
102
|
# @param [Document] matrix a term-document matrix
|
78
103
|
# @return [Matrix] a matrix
|
104
|
+
# @todo http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
|
79
105
|
#
|
80
106
|
# SMART u, Chisholm PUQN
|
81
107
|
def pivoted_unique_normalization(matrix)
|
82
|
-
|
83
|
-
# http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
|
108
|
+
raise NotImplementedError
|
84
109
|
end
|
85
110
|
end
|
@@ -1,5 +1,19 @@
|
|
1
1
|
require 'tf-idf-similarity/document'
|
2
2
|
|
3
|
+
# @todo http://nlp.stanford.edu/IR-book/html/htmledition/maximum-tf-normalization-1.html
|
4
|
+
#
|
5
|
+
# @note The treat, tf_idf, similarity and rsemantic gems normalizes to the number of terms in the document.
|
6
|
+
# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L77
|
7
|
+
# @see https://github.com/reddavis/TF-IDF/blob/master/lib/tf_idf.rb#L76
|
8
|
+
# @see https://github.com/bbcrd/Similarity/blob/master/lib/similarity/document.rb#L42
|
9
|
+
# @see https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L17
|
10
|
+
#
|
11
|
+
# @note The tf-idf gem normalizes to the number of unique terms in the document.
|
12
|
+
# @see https://github.com/mchung/tf-idf/blob/master/lib/tf-idf.rb#L172
|
13
|
+
#
|
14
|
+
# @see http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
|
15
|
+
# @see http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf
|
16
|
+
# @see http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf
|
3
17
|
class TfIdfSimilarity::Document
|
4
18
|
# @return [Float] the maximum term count of any term in the document
|
5
19
|
def maximum_term_count
|
@@ -12,6 +26,8 @@ class TfIdfSimilarity::Document
|
|
12
26
|
end
|
13
27
|
|
14
28
|
# Returns the term count.
|
29
|
+
# @see https://github.com/mkdynamic/vss/blob/master/lib/vss/engine.rb#L75
|
30
|
+
# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L11
|
15
31
|
#
|
16
32
|
# SMART n, Salton t, Chisholm FREQ
|
17
33
|
def plain_term_frequency(term)
|
@@ -70,6 +86,8 @@ class TfIdfSimilarity::Document
|
|
70
86
|
end
|
71
87
|
alias_method :changed_coefficient_augmented_normalized_tf, :changed_coefficient_augmented_normalized_term_frequency
|
72
88
|
|
89
|
+
# @see https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L12
|
90
|
+
#
|
73
91
|
# SMART l, Chisholm LOGA
|
74
92
|
def log_term_frequency(term)
|
75
93
|
count = term_counts[term]
|
data/lib/tf-idf-similarity.rb
CHANGED
@@ -1,5 +1,6 @@
|
|
1
1
|
module TfIdfSimilarity
|
2
|
-
autoload :Collection, 'tf-idf-similarity/collection'
|
3
|
-
autoload :Document, 'tf-idf-similarity/document'
|
4
|
-
autoload :Token, 'tf-idf-similarity/token'
|
5
2
|
end
|
3
|
+
|
4
|
+
require 'tf-idf-similarity/collection'
|
5
|
+
require 'tf-idf-similarity/document'
|
6
|
+
require 'tf-idf-similarity/token'
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: tf-idf-similarity
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.9
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date:
|
12
|
+
date: 2013-01-07 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: unicode_utils
|
@@ -68,6 +68,7 @@ extra_rdoc_files: []
|
|
68
68
|
files:
|
69
69
|
- .gitignore
|
70
70
|
- .travis.yml
|
71
|
+
- .yardopts
|
71
72
|
- Gemfile
|
72
73
|
- LICENSE
|
73
74
|
- README.md
|
@@ -106,3 +107,4 @@ signing_key:
|
|
106
107
|
specification_version: 3
|
107
108
|
summary: Calculates the similarity between texts using tf*idf
|
108
109
|
test_files: []
|
110
|
+
has_rdoc:
|