tf-idf-similarity 0.0.7 → 0.0.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -16,17 +16,13 @@ Calculates the similarity between texts using a [bag-of-words](http://en.wikiped
16
16
 
17
17
  p corpus.similarity_matrix
18
18
 
19
- This gem will use the [gsl gem](http://rb-gsl.rubyforge.org/) if available, for faster matrix multiplication.
20
-
21
19
  ## Optimizations
22
20
 
23
- ### [NArray](http://narray.rubyforge.org/)
24
-
25
- gem install narray
21
+ This gem will use the first available library below, for faster matrix multiplication.
26
22
 
27
23
  ### [GNU Scientific Library (GSL)](http://www.gnu.org/software/gsl/)
28
24
 
29
- The latest `gsl` gem (`1.14.7`) is [not compatible](http://bretthard.in/2012/03/getting-related_posts-lsi-and-gsl-to-work-in-jekyll/) with the `gsl` package (`1.15`) in Homebrew:
25
+ The latest [gsl gem](http://rb-gsl.rubyforge.org/) (`1.14.7`) is [not compatible](http://bretthard.in/2012/03/getting-related_posts-lsi-and-gsl-to-work-in-jekyll/) with the `gsl` package (`1.15`) in Homebrew:
30
26
 
31
27
  ```sh
32
28
  cd /usr/local
@@ -34,16 +30,23 @@ git checkout -b gsl-1.14 83ed49411f076e30ced04c2cbebb054b2645a431
34
30
  brew install gsl
35
31
  git checkout master
36
32
  git branch -d gsl-1.14
37
- gem install gsl
38
33
  ```
39
34
 
35
+ Be careful not to upgrade `gsl` to `1.15` with `brew upgrade outdated`. You can now run:
36
+
37
+ gem install gsl --no-ri --no-rdoc
38
+
39
+ ### [NArray](http://narray.rubyforge.org/)
40
+
41
+ gem install narray
42
+
40
43
  ### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
41
44
 
42
- You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through version `0.0.2` of the [nmatrix gem](https://github.com/SciRuby/nmatrix). As of writing, `0.0.2` is not released, so follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
45
+ You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through the next release (after `0.0.2`) of the [nmatrix gem](https://github.com/SciRuby/nmatrix). Follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
43
46
 
44
47
  ### Other Options
45
48
 
46
- The [nmatrix](http://sciruby.com/nmatrix/) gem has no easy way to normalize all columns to unit vectors. [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
49
+ [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
47
50
 
48
51
  ## Extras
49
52
 
@@ -10,6 +10,8 @@ rescue LoadError
10
10
  end
11
11
 
12
12
  class TfIdfSimilarity::Collection
13
+ class CollectionError < StandardError; end
14
+
13
15
  # The documents in the collection.
14
16
  attr_reader :documents
15
17
  # The number of times each term appears in all documents.
@@ -46,6 +48,11 @@ class TfIdfSimilarity::Collection
46
48
  # @see http://en.wikipedia.org/wiki/Cosine_similarity
47
49
  # @see http://en.wikipedia.org/wiki/Okapi_BM25
48
50
  def similarity_matrix(opts = {})
51
+ if documents.empty?
52
+ raise CollectionError, "No documents in collection"
53
+ end
54
+
55
+ # Calculate tf*idf.
49
56
  if stdlib?
50
57
  idf = []
51
58
  matrix = Matrix.build(terms.size, documents.size) do |i,j|
@@ -67,13 +74,13 @@ class TfIdfSimilarity::Collection
67
74
  end
68
75
  end
69
76
  end
70
-
71
- # Columns are normalized to unit vectors, so we can calculate the cosine
72
- # similarity of all document vectors. BM25 doesn't normalize columns, but
73
- # BM25 wasn't written with this use case in mind.
74
- matrix = normalize matrix
75
77
  end
76
78
 
79
+ # Columns are normalized to unit vectors, so we can calculate the cosine
80
+ # similarity of all document vectors. BM25 doesn't normalize columns, but
81
+ # BM25 wasn't written with this use case in mind.
82
+ matrix = normalize matrix
83
+
77
84
  if nmatrix?
78
85
  matrix.transpose.dot matrix
79
86
  else
@@ -122,6 +129,10 @@ class TfIdfSimilarity::Collection
122
129
 
123
130
  # @return [Float] the average document size, in terms
124
131
  def average_document_size
132
+ if documents.empty?
133
+ raise CollectionError, "No documents in collection"
134
+ end
135
+
125
136
  @average_document_size ||= documents.map(&:size).reduce(:+) / documents.size.to_f
126
137
  end
127
138
 
@@ -134,7 +145,7 @@ class TfIdfSimilarity::Collection
134
145
  end
135
146
 
136
147
  # @param [Document] matrix a term-document matrix
137
- # @return [Matrix] a matrix in which all document vectors are unit vectors
148
+ # @return [GSL::Matrix,NMatrix,Matrix] a matrix in which all document vectors are unit vectors
138
149
  #
139
150
  # @note Lucene normalizes document length differently.
140
151
  def normalize(matrix)
@@ -144,6 +155,7 @@ class TfIdfSimilarity::Collection
144
155
  # @see https://github.com/masa16/narray/issues/21
145
156
  NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(documents.size, 1))
146
157
  elsif nmatrix?
158
+ # @see https://github.com/SciRuby/nmatrix/issues/38
147
159
  # @todo NMatrix has no way to perform scalar operations on matrices.
148
160
  # (0...matrix.shape[0]).each do |i|
149
161
  # column = matrix.slice i, 0...matrix.shape[1]
@@ -1,3 +1,3 @@
1
1
  module TfIdfSimilarity
2
- VERSION = "0.0.7"
2
+ VERSION = "0.0.8"
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: tf-idf-similarity
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.7
4
+ version: 0.0.8
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -106,4 +106,3 @@ signing_key:
106
106
  specification_version: 3
107
107
  summary: Calculates the similarity between texts using tf*idf
108
108
  test_files: []
109
- has_rdoc: