tf-idf-similarity 0.0.7 → 0.0.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.md +12 -9
- data/lib/tf-idf-similarity/collection.rb +18 -6
- data/lib/tf-idf-similarity/version.rb +1 -1
- metadata +1 -2
data/README.md
CHANGED
@@ -16,17 +16,13 @@ Calculates the similarity between texts using a [bag-of-words](http://en.wikiped
|
|
16
16
|
|
17
17
|
p corpus.similarity_matrix
|
18
18
|
|
19
|
-
This gem will use the [gsl gem](http://rb-gsl.rubyforge.org/) if available, for faster matrix multiplication.
|
20
|
-
|
21
19
|
## Optimizations
|
22
20
|
|
23
|
-
|
24
|
-
|
25
|
-
gem install narray
|
21
|
+
This gem will use the first available library below, for faster matrix multiplication.
|
26
22
|
|
27
23
|
### [GNU Scientific Library (GSL)](http://www.gnu.org/software/gsl/)
|
28
24
|
|
29
|
-
The latest
|
25
|
+
The latest [gsl gem](http://rb-gsl.rubyforge.org/) (`1.14.7`) is [not compatible](http://bretthard.in/2012/03/getting-related_posts-lsi-and-gsl-to-work-in-jekyll/) with the `gsl` package (`1.15`) in Homebrew:
|
30
26
|
|
31
27
|
```sh
|
32
28
|
cd /usr/local
|
@@ -34,16 +30,23 @@ git checkout -b gsl-1.14 83ed49411f076e30ced04c2cbebb054b2645a431
|
|
34
30
|
brew install gsl
|
35
31
|
git checkout master
|
36
32
|
git branch -d gsl-1.14
|
37
|
-
gem install gsl
|
38
33
|
```
|
39
34
|
|
35
|
+
Be careful not to upgrade `gsl` to `1.15` with `brew upgrade outdated`. You can now run:
|
36
|
+
|
37
|
+
gem install gsl --no-ri --no-rdoc
|
38
|
+
|
39
|
+
### [NArray](http://narray.rubyforge.org/)
|
40
|
+
|
41
|
+
gem install narray
|
42
|
+
|
40
43
|
### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
|
41
44
|
|
42
|
-
You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through
|
45
|
+
You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
|
43
46
|
|
44
47
|
### Other Options
|
45
48
|
|
46
|
-
|
49
|
+
[Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
|
47
50
|
|
48
51
|
## Extras
|
49
52
|
|
@@ -10,6 +10,8 @@ rescue LoadError
|
|
10
10
|
end
|
11
11
|
|
12
12
|
class TfIdfSimilarity::Collection
|
13
|
+
class CollectionError < StandardError; end
|
14
|
+
|
13
15
|
# The documents in the collection.
|
14
16
|
attr_reader :documents
|
15
17
|
# The number of times each term appears in all documents.
|
@@ -46,6 +48,11 @@ class TfIdfSimilarity::Collection
|
|
46
48
|
# @see http://en.wikipedia.org/wiki/Cosine_similarity
|
47
49
|
# @see http://en.wikipedia.org/wiki/Okapi_BM25
|
48
50
|
def similarity_matrix(opts = {})
|
51
|
+
if documents.empty?
|
52
|
+
raise CollectionError, "No documents in collection"
|
53
|
+
end
|
54
|
+
|
55
|
+
# Calculate tf*idf.
|
49
56
|
if stdlib?
|
50
57
|
idf = []
|
51
58
|
matrix = Matrix.build(terms.size, documents.size) do |i,j|
|
@@ -67,13 +74,13 @@ class TfIdfSimilarity::Collection
|
|
67
74
|
end
|
68
75
|
end
|
69
76
|
end
|
70
|
-
|
71
|
-
# Columns are normalized to unit vectors, so we can calculate the cosine
|
72
|
-
# similarity of all document vectors. BM25 doesn't normalize columns, but
|
73
|
-
# BM25 wasn't written with this use case in mind.
|
74
|
-
matrix = normalize matrix
|
75
77
|
end
|
76
78
|
|
79
|
+
# Columns are normalized to unit vectors, so we can calculate the cosine
|
80
|
+
# similarity of all document vectors. BM25 doesn't normalize columns, but
|
81
|
+
# BM25 wasn't written with this use case in mind.
|
82
|
+
matrix = normalize matrix
|
83
|
+
|
77
84
|
if nmatrix?
|
78
85
|
matrix.transpose.dot matrix
|
79
86
|
else
|
@@ -122,6 +129,10 @@ class TfIdfSimilarity::Collection
|
|
122
129
|
|
123
130
|
# @return [Float] the average document size, in terms
|
124
131
|
def average_document_size
|
132
|
+
if documents.empty?
|
133
|
+
raise CollectionError, "No documents in collection"
|
134
|
+
end
|
135
|
+
|
125
136
|
@average_document_size ||= documents.map(&:size).reduce(:+) / documents.size.to_f
|
126
137
|
end
|
127
138
|
|
@@ -134,7 +145,7 @@ class TfIdfSimilarity::Collection
|
|
134
145
|
end
|
135
146
|
|
136
147
|
# @param [Document] matrix a term-document matrix
|
137
|
-
# @return [Matrix] a matrix in which all document vectors are unit vectors
|
148
|
+
# @return [GSL::Matrix,NMatrix,Matrix] a matrix in which all document vectors are unit vectors
|
138
149
|
#
|
139
150
|
# @note Lucene normalizes document length differently.
|
140
151
|
def normalize(matrix)
|
@@ -144,6 +155,7 @@ class TfIdfSimilarity::Collection
|
|
144
155
|
# @see https://github.com/masa16/narray/issues/21
|
145
156
|
NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(documents.size, 1))
|
146
157
|
elsif nmatrix?
|
158
|
+
# @see https://github.com/SciRuby/nmatrix/issues/38
|
147
159
|
# @todo NMatrix has no way to perform scalar operations on matrices.
|
148
160
|
# (0...matrix.shape[0]).each do |i|
|
149
161
|
# column = matrix.slice i, 0...matrix.shape[1]
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: tf-idf-similarity
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.8
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -106,4 +106,3 @@ signing_key:
|
|
106
106
|
specification_version: 3
|
107
107
|
summary: Calculates the similarity between texts using tf*idf
|
108
108
|
test_files: []
|
109
|
-
has_rdoc:
|