tf-idf-similarity 0.0.2 → 0.0.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.md +13 -2
- data/lib/tf-idf-similarity/collection.rb +81 -30
- data/lib/tf-idf-similarity/document.rb +3 -0
- data/lib/tf-idf-similarity/version.rb +1 -1
- metadata +4 -4
data/README.md
CHANGED
@@ -3,7 +3,7 @@
|
|
3
3
|
[](https://gemnasium.com/opennorth/tf-idf-similarity)
|
4
4
|
[](https://codeclimate.com/github/opennorth/tf-idf-similarity)
|
5
5
|
|
6
|
-
Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/)
|
6
|
+
Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) or similar (see below).
|
7
7
|
|
8
8
|
## Usage
|
9
9
|
|
@@ -41,7 +41,7 @@ gem install gsl
|
|
41
41
|
|
42
42
|
You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through version `0.0.2` of the [nmatrix gem](https://github.com/SciRuby/nmatrix). As of writing, `0.0.2` is not released, so follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
|
43
43
|
|
44
|
-
### Other
|
44
|
+
### Other Options
|
45
45
|
|
46
46
|
The [nmatrix](http://sciruby.com/nmatrix/) gem has no easy way to normalize all columns to unit vectors. [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
|
47
47
|
|
@@ -63,6 +63,17 @@ The [treat](https://github.com/louismullie/treat), [tf-idf](https://github.com/r
|
|
63
63
|
* [G. Salton and C. Buckley. "Term Weighting Approaches in Automatic Text Retrieval."" Technical Report. Cornell University, Ithaca, NY, USA. 1987.](http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf)
|
64
64
|
* [E. Chisholm and T. G. Kolda. "New term weighting formulas for the vector space method in information retrieval." Technical Report Number ORNL-TM-13756. Oak Ridge National Laboratory, Oak Ridge, TN, USA. 1999.](http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf)
|
65
65
|
|
66
|
+
## Further Reading
|
67
|
+
|
68
|
+
Lucene implements many more [similarity functions](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/Similarity.html), such as:
|
69
|
+
|
70
|
+
* a [divergence from randomness (DFR) framework](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/DFRSimilarity.html)
|
71
|
+
* a [framework for the family of information-based models](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/IBSimilarity.html)
|
72
|
+
* a [language model with Bayesian smoothing using Dirichlet priors](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html)
|
73
|
+
* a [language model with Jelinek-Mercer smoothing](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html)
|
74
|
+
|
75
|
+
Lucene can even [combine similarity meatures](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/MultiSimilarity.html).
|
76
|
+
|
66
77
|
## Bugs? Questions?
|
67
78
|
|
68
79
|
This gem's main repository is on GitHub: [http://github.com/opennorth/tf-idf-similarity](http://github.com/opennorth/tf-idf-similarity), where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
|
@@ -34,43 +34,42 @@ class TfIdfSimilarity::Collection
|
|
34
34
|
term_counts.keys
|
35
35
|
end
|
36
36
|
|
37
|
+
# @param [Hash] opts optional arguments
|
38
|
+
# @option opts [Symbol] :function one of :tfidf (default) or :bm25
|
39
|
+
#
|
40
|
+
# @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
|
41
|
+
# @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/BM25Similarity.html
|
37
42
|
# @see http://en.wikipedia.org/wiki/Vector_space_model
|
38
43
|
# @see http://en.wikipedia.org/wiki/Document-term_matrix
|
39
44
|
# @see http://en.wikipedia.org/wiki/Cosine_similarity
|
40
|
-
def similarity_matrix
|
41
|
-
if
|
45
|
+
def similarity_matrix(opts = {})
|
46
|
+
if stdlib?
|
42
47
|
idf = []
|
43
|
-
|
44
|
-
idf[i] ||= inverse_document_frequency
|
45
|
-
|
48
|
+
matrix = Matrix.build(terms.size, documents.size) do |i,j|
|
49
|
+
idf[i] ||= inverse_document_frequency(terms[i], opts)
|
50
|
+
idf[i] * term_frequency(documents[j], terms[i], opts)
|
46
51
|
end
|
47
52
|
else
|
48
|
-
|
49
|
-
GSL::Matrix.alloc terms.size, documents.size
|
50
|
-
elsif narray?
|
51
|
-
NArray.float documents.size, terms.size
|
52
|
-
elsif nmatrix?
|
53
|
-
NMatrix.new(:list, [terms.size, documents.size], :float64)
|
54
|
-
end
|
55
|
-
|
53
|
+
matrix = initialize_matrix
|
56
54
|
terms.each_with_index do |term,i|
|
57
|
-
idf = inverse_document_frequency
|
55
|
+
idf = inverse_document_frequency(term, opts)
|
58
56
|
documents.each_with_index do |document,j|
|
59
|
-
|
60
|
-
if gsl? || nmatrix?
|
61
|
-
term_document_matrix[i, j] = tfidf
|
57
|
+
value = idf * term_frequency(document, term, opts)
|
62
58
|
# NArray puts the dimensions in a different order.
|
63
59
|
# @see http://narray.rubyforge.org/SPEC.en
|
64
|
-
|
65
|
-
|
60
|
+
if narray?
|
61
|
+
matrix[j, i] = value
|
62
|
+
else
|
63
|
+
matrix[i, j] = value
|
66
64
|
end
|
67
65
|
end
|
68
66
|
end
|
69
|
-
end
|
70
67
|
|
71
|
-
|
72
|
-
|
73
|
-
|
68
|
+
# Columns are normalized to unit vectors, so we can calculate the cosine
|
69
|
+
# similarity of all document vectors. BM25 doesn't normalize columns, but
|
70
|
+
# BM25 wasn't written with this use case in mind.
|
71
|
+
matrix = normalize matrix
|
72
|
+
end
|
74
73
|
|
75
74
|
if nmatrix?
|
76
75
|
matrix.transpose.dot matrix
|
@@ -80,14 +79,46 @@ class TfIdfSimilarity::Collection
|
|
80
79
|
end
|
81
80
|
|
82
81
|
# @param [String] term a term
|
82
|
+
# @param [Hash] opts optional arguments
|
83
|
+
# @option opts [Symbol] :function one of :tfidf (default) or :bm25
|
83
84
|
# @return [Float] the term's inverse document frequency
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
85
|
+
def inverse_document_frequency(term, opts = {})
|
86
|
+
if opts[:function] == :bm25
|
87
|
+
Math.log (documents.size - document_counts[term] + 0.5) / (document_counts[term] + 0.5)
|
88
|
+
else
|
89
|
+
1 + Math.log(documents.size / (document_counts[term].to_f + 1))
|
90
|
+
end
|
88
91
|
end
|
89
92
|
alias_method :idf, :inverse_document_frequency
|
90
93
|
|
94
|
+
# @param [Document] document a document
|
95
|
+
# @param [String] term a term
|
96
|
+
# @param [Hash] opts optional arguments
|
97
|
+
# @option opts [Symbol] :function one of :tfidf (default) or :bm25
|
98
|
+
# @return [Float] the term's frequency in the document
|
99
|
+
#
|
100
|
+
# @note Like Lucene, we use a b value of 0.75 and a k1 value of 1.2.
|
101
|
+
def term_frequency(document, term, opts = {})
|
102
|
+
if opts[:function] == :bm25
|
103
|
+
(document.term_counts[term] * 2.2) / (document.term_counts[term] + 0.3 + 0.9 * document.size / average_document_size)
|
104
|
+
else
|
105
|
+
document.term_frequency term
|
106
|
+
end
|
107
|
+
end
|
108
|
+
|
109
|
+
# @return [Float] the average document size, in terms
|
110
|
+
def average_document_size
|
111
|
+
@average_document_size ||= documents.map(&:size).reduce(:+) / documents.size.to_f
|
112
|
+
end
|
113
|
+
|
114
|
+
# Resets the average document size.
|
115
|
+
#
|
116
|
+
# If you have already made a similarity matrix and are adding more documents,
|
117
|
+
# call this method before creating a new similarity matrix.
|
118
|
+
def reset_average_document_size!
|
119
|
+
@average_document_size = nil
|
120
|
+
end
|
121
|
+
|
91
122
|
# @param [Document] matrix a term-document matrix
|
92
123
|
# @return [Matrix] a matrix in which all document vectors are unit vectors
|
93
124
|
#
|
@@ -99,7 +130,12 @@ class TfIdfSimilarity::Collection
|
|
99
130
|
# @see https://github.com/masa16/narray/issues/21
|
100
131
|
NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(5,1))
|
101
132
|
elsif nmatrix?
|
102
|
-
# @todo NMatrix has no way to
|
133
|
+
# @todo NMatrix has no way to perform scalar operations on matrices.
|
134
|
+
# (0...matrix.shape[0]).each do |i|
|
135
|
+
# column = matrix.slice i, 0...matrix.shape[1]
|
136
|
+
# norm = column.dot column.transpose
|
137
|
+
# # No way to divide column by norm.
|
138
|
+
# end
|
103
139
|
matrix.cast :yale, :float64
|
104
140
|
else
|
105
141
|
Matrix.columns matrix.column_vectors.map(&:normalize)
|
@@ -108,19 +144,34 @@ class TfIdfSimilarity::Collection
|
|
108
144
|
|
109
145
|
private
|
110
146
|
|
147
|
+
# @return a matrix
|
148
|
+
def initialize_matrix
|
149
|
+
if gsl?
|
150
|
+
GSL::Matrix.alloc terms.size, documents.size
|
151
|
+
elsif narray?
|
152
|
+
NArray.float documents.size, terms.size
|
153
|
+
elsif nmatrix?
|
154
|
+
NMatrix.new(:list, [terms.size, documents.size], :float64)
|
155
|
+
end
|
156
|
+
end
|
157
|
+
|
158
|
+
# @return [Boolean] whether to use the GSL gem
|
111
159
|
def gsl?
|
112
160
|
@gsl ||= Object.const_defined?(:GSL)
|
113
161
|
end
|
114
162
|
|
163
|
+
# @return [Boolean] whether to use the NArray gem
|
115
164
|
def narray?
|
116
165
|
@narray ||= Object.const_defined?(:NArray) && !gsl?
|
117
166
|
end
|
118
167
|
|
168
|
+
# @return [Boolean] whether to use the NMatrix gem
|
119
169
|
def nmatrix?
|
120
|
-
@nmatrix ||= Object.const_defined?(:NMatrix) && !narray?
|
170
|
+
@nmatrix ||= Object.const_defined?(:NMatrix) && !gsl? && !narray?
|
121
171
|
end
|
122
172
|
|
123
|
-
|
173
|
+
# @return [Boolean] whether to use the standard library
|
174
|
+
def stdlib?
|
124
175
|
@matrix ||= Object.const_defined?(:Matrix)
|
125
176
|
end
|
126
177
|
end
|
@@ -8,6 +8,8 @@ class TfIdfSimilarity::Document
|
|
8
8
|
attr_reader :text
|
9
9
|
# The number of times each term appears in the document.
|
10
10
|
attr_reader :term_counts
|
11
|
+
# The document size, in terms.
|
12
|
+
attr_reader :size
|
11
13
|
|
12
14
|
# @param [String] text the document's text
|
13
15
|
# @param [Hash] opts optional arguments
|
@@ -43,6 +45,7 @@ private
|
|
43
45
|
@term_counts[token.lowercase_filter.classic_filter.to_s] += 1
|
44
46
|
end
|
45
47
|
end
|
48
|
+
@size = term_counts.values.reduce(:+)
|
46
49
|
end
|
47
50
|
|
48
51
|
# Tokenizes a text, respecting the word boundary rules from Unicode’s Default
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: tf-idf-similarity
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.3
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-09-
|
12
|
+
date: 2012-09-11 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: unicode_utils
|
@@ -95,7 +95,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
95
95
|
version: '0'
|
96
96
|
segments:
|
97
97
|
- 0
|
98
|
-
hash: -
|
98
|
+
hash: -4125970683092216956
|
99
99
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
100
100
|
none: false
|
101
101
|
requirements:
|
@@ -104,7 +104,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
104
104
|
version: '0'
|
105
105
|
segments:
|
106
106
|
- 0
|
107
|
-
hash: -
|
107
|
+
hash: -4125970683092216956
|
108
108
|
requirements: []
|
109
109
|
rubyforge_project:
|
110
110
|
rubygems_version: 1.8.24
|