tf-idf-similarity 0.0.7 → 0.0.8
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +12 -9
- data/lib/tf-idf-similarity/collection.rb +18 -6
- data/lib/tf-idf-similarity/version.rb +1 -1
- metadata +1 -2
data/README.md
CHANGED
@@ -16,17 +16,13 @@ Calculates the similarity between texts using a [bag-of-words](http://en.wikiped
|
|
16
16
|
|
17
17
|
p corpus.similarity_matrix
|
18
18
|
|
19
|
-
This gem will use the [gsl gem](http://rb-gsl.rubyforge.org/) if available, for faster matrix multiplication.
|
20
|
-
|
21
19
|
## Optimizations
|
22
20
|
|
23
|
-
|
24
|
-
|
25
|
-
gem install narray
|
21
|
+
This gem will use the first available library below, for faster matrix multiplication.
|
26
22
|
|
27
23
|
### [GNU Scientific Library (GSL)](http://www.gnu.org/software/gsl/)
|
28
24
|
|
29
|
-
The latest
|
25
|
+
The latest [gsl gem](http://rb-gsl.rubyforge.org/) (`1.14.7`) is [not compatible](http://bretthard.in/2012/03/getting-related_posts-lsi-and-gsl-to-work-in-jekyll/) with the `gsl` package (`1.15`) in Homebrew:
|
30
26
|
|
31
27
|
```sh
|
32
28
|
cd /usr/local
|
@@ -34,16 +30,23 @@ git checkout -b gsl-1.14 83ed49411f076e30ced04c2cbebb054b2645a431
|
|
34
30
|
brew install gsl
|
35
31
|
git checkout master
|
36
32
|
git branch -d gsl-1.14
|
37
|
-
gem install gsl
|
38
33
|
```
|
39
34
|
|
35
|
+
Be careful not to upgrade `gsl` to `1.15` with `brew upgrade outdated`. You can now run:
|
36
|
+
|
37
|
+
gem install gsl --no-ri --no-rdoc
|
38
|
+
|
39
|
+
### [NArray](http://narray.rubyforge.org/)
|
40
|
+
|
41
|
+
gem install narray
|
42
|
+
|
40
43
|
### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
|
41
44
|
|
42
|
-
You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through
|
45
|
+
You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
|
43
46
|
|
44
47
|
### Other Options
|
45
48
|
|
46
|
-
|
49
|
+
[Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
|
47
50
|
|
48
51
|
## Extras
|
49
52
|
|
@@ -10,6 +10,8 @@ rescue LoadError
|
|
10
10
|
end
|
11
11
|
|
12
12
|
class TfIdfSimilarity::Collection
|
13
|
+
class CollectionError < StandardError; end
|
14
|
+
|
13
15
|
# The documents in the collection.
|
14
16
|
attr_reader :documents
|
15
17
|
# The number of times each term appears in all documents.
|
@@ -46,6 +48,11 @@ class TfIdfSimilarity::Collection
|
|
46
48
|
# @see http://en.wikipedia.org/wiki/Cosine_similarity
|
47
49
|
# @see http://en.wikipedia.org/wiki/Okapi_BM25
|
48
50
|
def similarity_matrix(opts = {})
|
51
|
+
if documents.empty?
|
52
|
+
raise CollectionError, "No documents in collection"
|
53
|
+
end
|
54
|
+
|
55
|
+
# Calculate tf*idf.
|
49
56
|
if stdlib?
|
50
57
|
idf = []
|
51
58
|
matrix = Matrix.build(terms.size, documents.size) do |i,j|
|
@@ -67,13 +74,13 @@ class TfIdfSimilarity::Collection
|
|
67
74
|
end
|
68
75
|
end
|
69
76
|
end
|
70
|
-
|
71
|
-
# Columns are normalized to unit vectors, so we can calculate the cosine
|
72
|
-
# similarity of all document vectors. BM25 doesn't normalize columns, but
|
73
|
-
# BM25 wasn't written with this use case in mind.
|
74
|
-
matrix = normalize matrix
|
75
77
|
end
|
76
78
|
|
79
|
+
# Columns are normalized to unit vectors, so we can calculate the cosine
|
80
|
+
# similarity of all document vectors. BM25 doesn't normalize columns, but
|
81
|
+
# BM25 wasn't written with this use case in mind.
|
82
|
+
matrix = normalize matrix
|
83
|
+
|
77
84
|
if nmatrix?
|
78
85
|
matrix.transpose.dot matrix
|
79
86
|
else
|
@@ -122,6 +129,10 @@ class TfIdfSimilarity::Collection
|
|
122
129
|
|
123
130
|
# @return [Float] the average document size, in terms
|
124
131
|
def average_document_size
|
132
|
+
if documents.empty?
|
133
|
+
raise CollectionError, "No documents in collection"
|
134
|
+
end
|
135
|
+
|
125
136
|
@average_document_size ||= documents.map(&:size).reduce(:+) / documents.size.to_f
|
126
137
|
end
|
127
138
|
|
@@ -134,7 +145,7 @@ class TfIdfSimilarity::Collection
|
|
134
145
|
end
|
135
146
|
|
136
147
|
# @param [Document] matrix a term-document matrix
|
137
|
-
# @return [Matrix] a matrix in which all document vectors are unit vectors
|
148
|
+
# @return [GSL::Matrix,NMatrix,Matrix] a matrix in which all document vectors are unit vectors
|
138
149
|
#
|
139
150
|
# @note Lucene normalizes document length differently.
|
140
151
|
def normalize(matrix)
|
@@ -144,6 +155,7 @@ class TfIdfSimilarity::Collection
|
|
144
155
|
# @see https://github.com/masa16/narray/issues/21
|
145
156
|
NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(documents.size, 1))
|
146
157
|
elsif nmatrix?
|
158
|
+
# @see https://github.com/SciRuby/nmatrix/issues/38
|
147
159
|
# @todo NMatrix has no way to perform scalar operations on matrices.
|
148
160
|
# (0...matrix.shape[0]).each do |i|
|
149
161
|
# column = matrix.slice i, 0...matrix.shape[1]
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: tf-idf-similarity
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.8
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -106,4 +106,3 @@ signing_key:
|
|
106
106
|
specification_version: 3
|
107
107
|
summary: Calculates the similarity between texts using tf*idf
|
108
108
|
test_files: []
|
109
|
-
has_rdoc:
|