tf-idf-similarity 0.0.1 → 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,5 +1,8 @@
1
1
  # Ruby Vector Space Model (VSM) with tf*idf weights
2
2
 
3
+ [![Dependency Status](https://gemnasium.com/opennorth/tf-idf-similarity.png)](https://gemnasium.com/opennorth/tf-idf-similarity)
4
+ [![Code Climate](https://codeclimate.com/badge.png)](https://codeclimate.com/github/opennorth/tf-idf-similarity)
5
+
3
6
  Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) (or similar), which also implements other information retrieval functions like [BM 25](http://en.wikipedia.org/wiki/Okapi_BM25).
4
7
 
5
8
  ## Usage
@@ -17,6 +20,10 @@ This gem will use the [gsl gem](http://rb-gsl.rubyforge.org/) if available, for
17
20
 
18
21
  ## Optimizations
19
22
 
23
+ ### [NArray](http://narray.rubyforge.org/)
24
+
25
+ gem install narray
26
+
20
27
  ### [GNU Scientific Library (GSL)](http://www.gnu.org/software/gsl/)
21
28
 
22
29
  The latest `gsl` gem (`1.14.7`) is [not compatible](http://bretthard.in/2012/03/getting-related_posts-lsi-and-gsl-to-work-in-jekyll/) with the `gsl` package (`1.15`) in Homebrew:
@@ -32,22 +39,11 @@ gem install gsl
32
39
 
33
40
  ### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
34
41
 
35
- You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/).
36
-
37
- The `nmatrix` gem (`0.0.1`) can't find the `cblas.h` and `clapack.h` header files. Either [set the C_INCLUDE_PATH](https://github.com/SciRuby/nmatrix#synopsis):
38
-
39
- export C_INCLUDE_PATH=/System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/
40
-
41
- Or [create links](https://github.com/SciRuby/nmatrix/issues/21) before installing the gem:
42
-
43
- sudo ln -s /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/cblas.h /usr/include/cblas.h
44
- sudo ln -s /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/clapack.h /usr/include/clapack.h
45
-
46
- Version `0.0.2` [doesn't compile on Mac OS X Lion](https://github.com/SciRuby/nmatrix/issues/34).
42
+ You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through version `0.0.2` of the [nmatrix gem](https://github.com/SciRuby/nmatrix). As of writing, `0.0.2` is not released, so follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
47
43
 
48
44
  ### Other Considerations
49
45
 
50
- The [narray](http://narray.rubyforge.org/) and [nmatrix](http://sciruby.com/nmatrix/) gems have no method to calculate the magnitude of a vector. [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) and old and not available as gems.
46
+ The [nmatrix](http://sciruby.com/nmatrix/) gem has no easy way to normalize all columns to unit vectors. [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
51
47
 
52
48
  ## Extras
53
49
 
@@ -58,6 +54,10 @@ You can access more term frequency, document frequency, and normalization formul
58
54
 
59
55
  The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).
60
56
 
57
+ ## Why?
58
+
59
+ The [treat](https://github.com/louismullie/treat), [tf-idf](https://github.com/reddavis/TF-IDF), [similarity](https://github.com/bbcrd/Similarity) and [rsimilarity](https://github.com/josephwilk/rsemantic) gems normalize the frequency of a term in a document to the number of terms in that document (which, as far as I can tell, never occurs in the academic literature) and have no normalization component. [vss](https://github.com/mkdynamic/vss) uses plain term and document frequencies, with no damping or normalization.
60
+
61
61
  ## Reference
62
62
 
63
63
  * [G. Salton and C. Buckley. "Term Weighting Approaches in Automatic Text Retrieval."" Technical Report. Cornell University, Ithaca, NY, USA. 1987.](http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf)
@@ -1,5 +1,8 @@
1
+ # @todo Do speed comparison between these gsl and narray, to load fastest first.
1
2
  begin
2
3
  require 'gsl'
4
+ rescue LoadError
5
+ require 'narray'
3
6
  rescue LoadError
4
7
  require 'matrix'
5
8
  end
@@ -45,11 +48,9 @@ class TfIdfSimilarity::Collection
45
48
  term_document_matrix = if gsl?
46
49
  GSL::Matrix.alloc terms.size, documents.size
47
50
  elsif narray?
48
- NMatrix.float documents.size, terms.size
51
+ NArray.float documents.size, terms.size
49
52
  elsif nmatrix?
50
- # The nmatrix gem's sparse matrices are unfortunately buggy.
51
- # @see https://github.com/SciRuby/nmatrix/issues/35
52
- NMatrix.new([terms.size, documents.size], :float64)
53
+ NMatrix.new(:list, [terms.size, documents.size], :float64)
53
54
  end
54
55
 
55
56
  terms.each_with_index do |term,i|
@@ -95,14 +96,11 @@ class TfIdfSimilarity::Collection
95
96
  if gsl?
96
97
  matrix.each_col(&:normalize!)
97
98
  elsif narray?
98
- # @todo NArray doesn't have a method to normalize a vector.
99
- # 0.upto(matrix.shape[0] - 1).each do |j|
100
- # matrix[j, true] # Normalize this column somehow.
101
- # end
102
- matrix
99
+ # @see https://github.com/masa16/narray/issues/21
100
+ NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(5,1))
103
101
  elsif nmatrix?
104
- # @todo NMatrix doesn't have a method to normalize a vector.
105
- matrix
102
+ # @todo NMatrix has no way to retrieve a column, besides iteration.
103
+ matrix.cast :yale, :float64
106
104
  else
107
105
  Matrix.columns matrix.column_vectors.map(&:normalize)
108
106
  end
@@ -1,3 +1,3 @@
1
1
  module TfIdfSimilarity
2
- VERSION = "0.0.1"
2
+ VERSION = "0.0.2"
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: tf-idf-similarity
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.1
4
+ version: 0.0.2
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -95,7 +95,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
95
95
  version: '0'
96
96
  segments:
97
97
  - 0
98
- hash: 697007281194730821
98
+ hash: -1570138910816303214
99
99
  required_rubygems_version: !ruby/object:Gem::Requirement
100
100
  none: false
101
101
  requirements:
@@ -104,7 +104,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
104
104
  version: '0'
105
105
  segments:
106
106
  - 0
107
- hash: 697007281194730821
107
+ hash: -1570138910816303214
108
108
  requirements: []
109
109
  rubyforge_project:
110
110
  rubygems_version: 1.8.24