tf-idf-similarity 0.0.7 → 0.0.8

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -16,17 +16,13 @@ Calculates the similarity between texts using a [bag-of-words](http://en.wikiped
16
16
 
17
17
  p corpus.similarity_matrix
18
18
 
19
- This gem will use the [gsl gem](http://rb-gsl.rubyforge.org/) if available, for faster matrix multiplication.
20
-
21
19
  ## Optimizations
22
20
 
23
- ### [NArray](http://narray.rubyforge.org/)
24
-
25
- gem install narray
21
+ This gem will use the first available library below, for faster matrix multiplication.
26
22
 
27
23
  ### [GNU Scientific Library (GSL)](http://www.gnu.org/software/gsl/)
28
24
 
29
- The latest `gsl` gem (`1.14.7`) is [not compatible](http://bretthard.in/2012/03/getting-related_posts-lsi-and-gsl-to-work-in-jekyll/) with the `gsl` package (`1.15`) in Homebrew:
25
+ The latest [gsl gem](http://rb-gsl.rubyforge.org/) (`1.14.7`) is [not compatible](http://bretthard.in/2012/03/getting-related_posts-lsi-and-gsl-to-work-in-jekyll/) with the `gsl` package (`1.15`) in Homebrew:
30
26
 
31
27
  ```sh
32
28
  cd /usr/local
@@ -34,16 +30,23 @@ git checkout -b gsl-1.14 83ed49411f076e30ced04c2cbebb054b2645a431
34
30
  brew install gsl
35
31
  git checkout master
36
32
  git branch -d gsl-1.14
37
- gem install gsl
38
33
  ```
39
34
 
35
+ Be careful not to upgrade `gsl` to `1.15` with `brew upgrade outdated`. You can now run:
36
+
37
+ gem install gsl --no-ri --no-rdoc
38
+
39
+ ### [NArray](http://narray.rubyforge.org/)
40
+
41
+ gem install narray
42
+
40
43
  ### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
41
44
 
42
- You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through version `0.0.2` of the [nmatrix gem](https://github.com/SciRuby/nmatrix). As of writing, `0.0.2` is not released, so follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
45
+ You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
43
46
 
44
47
  ### Other Options
45
48
 
46
- The [nmatrix](http://sciruby.com/nmatrix/) gem has no easy way to normalize all columns to unit vectors. [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
49
+ [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
47
50
 
48
51
  ## Extras
49
52
 
@@ -10,6 +10,8 @@ rescue LoadError
10
10
  end
11
11
 
12
12
  class TfIdfSimilarity::Collection
13
+ class CollectionError < StandardError; end
14
+
13
15
  # The documents in the collection.
14
16
  attr_reader :documents
15
17
  # The number of times each term appears in all documents.
@@ -46,6 +48,11 @@ class TfIdfSimilarity::Collection
46
48
  # @see http://en.wikipedia.org/wiki/Cosine_similarity
47
49
  # @see http://en.wikipedia.org/wiki/Okapi_BM25
48
50
  def similarity_matrix(opts = {})
51
+ if documents.empty?
52
+ raise CollectionError, "No documents in collection"
53
+ end
54
+
55
+ # Calculate tf*idf.
49
56
  if stdlib?
50
57
  idf = []
51
58
  matrix = Matrix.build(terms.size, documents.size) do |i,j|
@@ -67,13 +74,13 @@ class TfIdfSimilarity::Collection
67
74
  end
68
75
  end
69
76
  end
70
-
71
- # Columns are normalized to unit vectors, so we can calculate the cosine
72
- # similarity of all document vectors. BM25 doesn't normalize columns, but
73
- # BM25 wasn't written with this use case in mind.
74
- matrix = normalize matrix
75
77
  end
76
78
 
79
+ # Columns are normalized to unit vectors, so we can calculate the cosine
80
+ # similarity of all document vectors. BM25 doesn't normalize columns, but
81
+ # BM25 wasn't written with this use case in mind.
82
+ matrix = normalize matrix
83
+
77
84
  if nmatrix?
78
85
  matrix.transpose.dot matrix
79
86
  else
@@ -122,6 +129,10 @@ class TfIdfSimilarity::Collection
122
129
 
123
130
  # @return [Float] the average document size, in terms
124
131
  def average_document_size
132
+ if documents.empty?
133
+ raise CollectionError, "No documents in collection"
134
+ end
135
+
125
136
  @average_document_size ||= documents.map(&:size).reduce(:+) / documents.size.to_f
126
137
  end
127
138
 
@@ -134,7 +145,7 @@ class TfIdfSimilarity::Collection
134
145
  end
135
146
 
136
147
  # @param [Document] matrix a term-document matrix
137
- # @return [Matrix] a matrix in which all document vectors are unit vectors
148
+ # @return [GSL::Matrix,NMatrix,Matrix] a matrix in which all document vectors are unit vectors
138
149
  #
139
150
  # @note Lucene normalizes document length differently.
140
151
  def normalize(matrix)
@@ -144,6 +155,7 @@ class TfIdfSimilarity::Collection
144
155
  # @see https://github.com/masa16/narray/issues/21
145
156
  NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(documents.size, 1))
146
157
  elsif nmatrix?
158
+ # @see https://github.com/SciRuby/nmatrix/issues/38
147
159
  # @todo NMatrix has no way to perform scalar operations on matrices.
148
160
  # (0...matrix.shape[0]).each do |i|
149
161
  # column = matrix.slice i, 0...matrix.shape[1]
@@ -1,3 +1,3 @@
1
1
  module TfIdfSimilarity
2
- VERSION = "0.0.7"
2
+ VERSION = "0.0.8"
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: tf-idf-similarity
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.7
4
+ version: 0.0.8
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -106,4 +106,3 @@ signing_key:
106
106
  specification_version: 3
107
107
  summary: Calculates the similarity between texts using tf*idf
108
108
  test_files: []
109
- has_rdoc: