tf-idf-similarity 0.0.2 → 0.0.3
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +13 -2
- data/lib/tf-idf-similarity/collection.rb +81 -30
- data/lib/tf-idf-similarity/document.rb +3 -0
- data/lib/tf-idf-similarity/version.rb +1 -1
- metadata +4 -4
data/README.md
CHANGED
@@ -3,7 +3,7 @@
|
|
3
3
|
[![Dependency Status](https://gemnasium.com/opennorth/tf-idf-similarity.png)](https://gemnasium.com/opennorth/tf-idf-similarity)
|
4
4
|
[![Code Climate](https://codeclimate.com/badge.png)](https://codeclimate.com/github/opennorth/tf-idf-similarity)
|
5
5
|
|
6
|
-
Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/)
|
6
|
+
Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) or similar (see below).
|
7
7
|
|
8
8
|
## Usage
|
9
9
|
|
@@ -41,7 +41,7 @@ gem install gsl
|
|
41
41
|
|
42
42
|
You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through version `0.0.2` of the [nmatrix gem](https://github.com/SciRuby/nmatrix). As of writing, `0.0.2` is not released, so follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
|
43
43
|
|
44
|
-
### Other
|
44
|
+
### Other Options
|
45
45
|
|
46
46
|
The [nmatrix](http://sciruby.com/nmatrix/) gem has no easy way to normalize all columns to unit vectors. [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
|
47
47
|
|
@@ -63,6 +63,17 @@ The [treat](https://github.com/louismullie/treat), [tf-idf](https://github.com/r
|
|
63
63
|
* [G. Salton and C. Buckley. "Term Weighting Approaches in Automatic Text Retrieval."" Technical Report. Cornell University, Ithaca, NY, USA. 1987.](http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf)
|
64
64
|
* [E. Chisholm and T. G. Kolda. "New term weighting formulas for the vector space method in information retrieval." Technical Report Number ORNL-TM-13756. Oak Ridge National Laboratory, Oak Ridge, TN, USA. 1999.](http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf)
|
65
65
|
|
66
|
+
## Further Reading
|
67
|
+
|
68
|
+
Lucene implements many more [similarity functions](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/Similarity.html), such as:
|
69
|
+
|
70
|
+
* a [divergence from randomness (DFR) framework](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/DFRSimilarity.html)
|
71
|
+
* a [framework for the family of information-based models](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/IBSimilarity.html)
|
72
|
+
* a [language model with Bayesian smoothing using Dirichlet priors](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html)
|
73
|
+
* a [language model with Jelinek-Mercer smoothing](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html)
|
74
|
+
|
75
|
+
Lucene can even [combine similarity meatures](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/MultiSimilarity.html).
|
76
|
+
|
66
77
|
## Bugs? Questions?
|
67
78
|
|
68
79
|
This gem's main repository is on GitHub: [http://github.com/opennorth/tf-idf-similarity](http://github.com/opennorth/tf-idf-similarity), where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
|
@@ -34,43 +34,42 @@ class TfIdfSimilarity::Collection
|
|
34
34
|
term_counts.keys
|
35
35
|
end
|
36
36
|
|
37
|
+
# @param [Hash] opts optional arguments
|
38
|
+
# @option opts [Symbol] :function one of :tfidf (default) or :bm25
|
39
|
+
#
|
40
|
+
# @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
|
41
|
+
# @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/BM25Similarity.html
|
37
42
|
# @see http://en.wikipedia.org/wiki/Vector_space_model
|
38
43
|
# @see http://en.wikipedia.org/wiki/Document-term_matrix
|
39
44
|
# @see http://en.wikipedia.org/wiki/Cosine_similarity
|
40
|
-
def similarity_matrix
|
41
|
-
if
|
45
|
+
def similarity_matrix(opts = {})
|
46
|
+
if stdlib?
|
42
47
|
idf = []
|
43
|
-
|
44
|
-
idf[i] ||= inverse_document_frequency
|
45
|
-
|
48
|
+
matrix = Matrix.build(terms.size, documents.size) do |i,j|
|
49
|
+
idf[i] ||= inverse_document_frequency(terms[i], opts)
|
50
|
+
idf[i] * term_frequency(documents[j], terms[i], opts)
|
46
51
|
end
|
47
52
|
else
|
48
|
-
|
49
|
-
GSL::Matrix.alloc terms.size, documents.size
|
50
|
-
elsif narray?
|
51
|
-
NArray.float documents.size, terms.size
|
52
|
-
elsif nmatrix?
|
53
|
-
NMatrix.new(:list, [terms.size, documents.size], :float64)
|
54
|
-
end
|
55
|
-
|
53
|
+
matrix = initialize_matrix
|
56
54
|
terms.each_with_index do |term,i|
|
57
|
-
idf = inverse_document_frequency
|
55
|
+
idf = inverse_document_frequency(term, opts)
|
58
56
|
documents.each_with_index do |document,j|
|
59
|
-
|
60
|
-
if gsl? || nmatrix?
|
61
|
-
term_document_matrix[i, j] = tfidf
|
57
|
+
value = idf * term_frequency(document, term, opts)
|
62
58
|
# NArray puts the dimensions in a different order.
|
63
59
|
# @see http://narray.rubyforge.org/SPEC.en
|
64
|
-
|
65
|
-
|
60
|
+
if narray?
|
61
|
+
matrix[j, i] = value
|
62
|
+
else
|
63
|
+
matrix[i, j] = value
|
66
64
|
end
|
67
65
|
end
|
68
66
|
end
|
69
|
-
end
|
70
67
|
|
71
|
-
|
72
|
-
|
73
|
-
|
68
|
+
# Columns are normalized to unit vectors, so we can calculate the cosine
|
69
|
+
# similarity of all document vectors. BM25 doesn't normalize columns, but
|
70
|
+
# BM25 wasn't written with this use case in mind.
|
71
|
+
matrix = normalize matrix
|
72
|
+
end
|
74
73
|
|
75
74
|
if nmatrix?
|
76
75
|
matrix.transpose.dot matrix
|
@@ -80,14 +79,46 @@ class TfIdfSimilarity::Collection
|
|
80
79
|
end
|
81
80
|
|
82
81
|
# @param [String] term a term
|
82
|
+
# @param [Hash] opts optional arguments
|
83
|
+
# @option opts [Symbol] :function one of :tfidf (default) or :bm25
|
83
84
|
# @return [Float] the term's inverse document frequency
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
85
|
+
def inverse_document_frequency(term, opts = {})
|
86
|
+
if opts[:function] == :bm25
|
87
|
+
Math.log (documents.size - document_counts[term] + 0.5) / (document_counts[term] + 0.5)
|
88
|
+
else
|
89
|
+
1 + Math.log(documents.size / (document_counts[term].to_f + 1))
|
90
|
+
end
|
88
91
|
end
|
89
92
|
alias_method :idf, :inverse_document_frequency
|
90
93
|
|
94
|
+
# @param [Document] document a document
|
95
|
+
# @param [String] term a term
|
96
|
+
# @param [Hash] opts optional arguments
|
97
|
+
# @option opts [Symbol] :function one of :tfidf (default) or :bm25
|
98
|
+
# @return [Float] the term's frequency in the document
|
99
|
+
#
|
100
|
+
# @note Like Lucene, we use a b value of 0.75 and a k1 value of 1.2.
|
101
|
+
def term_frequency(document, term, opts = {})
|
102
|
+
if opts[:function] == :bm25
|
103
|
+
(document.term_counts[term] * 2.2) / (document.term_counts[term] + 0.3 + 0.9 * document.size / average_document_size)
|
104
|
+
else
|
105
|
+
document.term_frequency term
|
106
|
+
end
|
107
|
+
end
|
108
|
+
|
109
|
+
# @return [Float] the average document size, in terms
|
110
|
+
def average_document_size
|
111
|
+
@average_document_size ||= documents.map(&:size).reduce(:+) / documents.size.to_f
|
112
|
+
end
|
113
|
+
|
114
|
+
# Resets the average document size.
|
115
|
+
#
|
116
|
+
# If you have already made a similarity matrix and are adding more documents,
|
117
|
+
# call this method before creating a new similarity matrix.
|
118
|
+
def reset_average_document_size!
|
119
|
+
@average_document_size = nil
|
120
|
+
end
|
121
|
+
|
91
122
|
# @param [Document] matrix a term-document matrix
|
92
123
|
# @return [Matrix] a matrix in which all document vectors are unit vectors
|
93
124
|
#
|
@@ -99,7 +130,12 @@ class TfIdfSimilarity::Collection
|
|
99
130
|
# @see https://github.com/masa16/narray/issues/21
|
100
131
|
NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(5,1))
|
101
132
|
elsif nmatrix?
|
102
|
-
# @todo NMatrix has no way to
|
133
|
+
# @todo NMatrix has no way to perform scalar operations on matrices.
|
134
|
+
# (0...matrix.shape[0]).each do |i|
|
135
|
+
# column = matrix.slice i, 0...matrix.shape[1]
|
136
|
+
# norm = column.dot column.transpose
|
137
|
+
# # No way to divide column by norm.
|
138
|
+
# end
|
103
139
|
matrix.cast :yale, :float64
|
104
140
|
else
|
105
141
|
Matrix.columns matrix.column_vectors.map(&:normalize)
|
@@ -108,19 +144,34 @@ class TfIdfSimilarity::Collection
|
|
108
144
|
|
109
145
|
private
|
110
146
|
|
147
|
+
# @return a matrix
|
148
|
+
def initialize_matrix
|
149
|
+
if gsl?
|
150
|
+
GSL::Matrix.alloc terms.size, documents.size
|
151
|
+
elsif narray?
|
152
|
+
NArray.float documents.size, terms.size
|
153
|
+
elsif nmatrix?
|
154
|
+
NMatrix.new(:list, [terms.size, documents.size], :float64)
|
155
|
+
end
|
156
|
+
end
|
157
|
+
|
158
|
+
# @return [Boolean] whether to use the GSL gem
|
111
159
|
def gsl?
|
112
160
|
@gsl ||= Object.const_defined?(:GSL)
|
113
161
|
end
|
114
162
|
|
163
|
+
# @return [Boolean] whether to use the NArray gem
|
115
164
|
def narray?
|
116
165
|
@narray ||= Object.const_defined?(:NArray) && !gsl?
|
117
166
|
end
|
118
167
|
|
168
|
+
# @return [Boolean] whether to use the NMatrix gem
|
119
169
|
def nmatrix?
|
120
|
-
@nmatrix ||= Object.const_defined?(:NMatrix) && !narray?
|
170
|
+
@nmatrix ||= Object.const_defined?(:NMatrix) && !gsl? && !narray?
|
121
171
|
end
|
122
172
|
|
123
|
-
|
173
|
+
# @return [Boolean] whether to use the standard library
|
174
|
+
def stdlib?
|
124
175
|
@matrix ||= Object.const_defined?(:Matrix)
|
125
176
|
end
|
126
177
|
end
|
@@ -8,6 +8,8 @@ class TfIdfSimilarity::Document
|
|
8
8
|
attr_reader :text
|
9
9
|
# The number of times each term appears in the document.
|
10
10
|
attr_reader :term_counts
|
11
|
+
# The document size, in terms.
|
12
|
+
attr_reader :size
|
11
13
|
|
12
14
|
# @param [String] text the document's text
|
13
15
|
# @param [Hash] opts optional arguments
|
@@ -43,6 +45,7 @@ private
|
|
43
45
|
@term_counts[token.lowercase_filter.classic_filter.to_s] += 1
|
44
46
|
end
|
45
47
|
end
|
48
|
+
@size = term_counts.values.reduce(:+)
|
46
49
|
end
|
47
50
|
|
48
51
|
# Tokenizes a text, respecting the word boundary rules from Unicode’s Default
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: tf-idf-similarity
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.3
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-09-
|
12
|
+
date: 2012-09-11 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: unicode_utils
|
@@ -95,7 +95,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
95
95
|
version: '0'
|
96
96
|
segments:
|
97
97
|
- 0
|
98
|
-
hash: -
|
98
|
+
hash: -4125970683092216956
|
99
99
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
100
100
|
none: false
|
101
101
|
requirements:
|
@@ -104,7 +104,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
104
104
|
version: '0'
|
105
105
|
segments:
|
106
106
|
- 0
|
107
|
-
hash: -
|
107
|
+
hash: -4125970683092216956
|
108
108
|
requirements: []
|
109
109
|
rubyforge_project:
|
110
110
|
rubygems_version: 1.8.24
|