tf-idf-similarity 0.0.1 → 0.0.2

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -1,5 +1,8 @@
1
1
  # Ruby Vector Space Model (VSM) with tf*idf weights
2
2
 
3
+ [![Dependency Status](https://gemnasium.com/opennorth/tf-idf-similarity.png)](https://gemnasium.com/opennorth/tf-idf-similarity)
4
+ [![Code Climate](https://codeclimate.com/badge.png)](https://codeclimate.com/github/opennorth/tf-idf-similarity)
5
+
3
6
  Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) (or similar), which also implements other information retrieval functions like [BM 25](http://en.wikipedia.org/wiki/Okapi_BM25).
4
7
 
5
8
  ## Usage
@@ -17,6 +20,10 @@ This gem will use the [gsl gem](http://rb-gsl.rubyforge.org/) if available, for
17
20
 
18
21
  ## Optimizations
19
22
 
23
+ ### [NArray](http://narray.rubyforge.org/)
24
+
25
+ gem install narray
26
+
20
27
  ### [GNU Scientific Library (GSL)](http://www.gnu.org/software/gsl/)
21
28
 
22
29
  The latest `gsl` gem (`1.14.7`) is [not compatible](http://bretthard.in/2012/03/getting-related_posts-lsi-and-gsl-to-work-in-jekyll/) with the `gsl` package (`1.15`) in Homebrew:
@@ -32,22 +39,11 @@ gem install gsl
32
39
 
33
40
  ### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
34
41
 
35
- You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/).
36
-
37
- The `nmatrix` gem (`0.0.1`) can't find the `cblas.h` and `clapack.h` header files. Either [set the C_INCLUDE_PATH](https://github.com/SciRuby/nmatrix#synopsis):
38
-
39
- export C_INCLUDE_PATH=/System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/
40
-
41
- Or [create links](https://github.com/SciRuby/nmatrix/issues/21) before installing the gem:
42
-
43
- sudo ln -s /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/cblas.h /usr/include/cblas.h
44
- sudo ln -s /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/clapack.h /usr/include/clapack.h
45
-
46
- Version `0.0.2` [doesn't compile on Mac OS X Lion](https://github.com/SciRuby/nmatrix/issues/34).
42
+ You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through version `0.0.2` of the [nmatrix gem](https://github.com/SciRuby/nmatrix). As of writing, `0.0.2` is not released, so follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
47
43
 
48
44
  ### Other Considerations
49
45
 
50
- The [narray](http://narray.rubyforge.org/) and [nmatrix](http://sciruby.com/nmatrix/) gems have no method to calculate the magnitude of a vector. [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) and old and not available as gems.
46
+ The [nmatrix](http://sciruby.com/nmatrix/) gem has no easy way to normalize all columns to unit vectors. [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
51
47
 
52
48
  ## Extras
53
49
 
@@ -58,6 +54,10 @@ You can access more term frequency, document frequency, and normalization formul
58
54
 
59
55
  The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).
60
56
 
57
+ ## Why?
58
+
59
+ The [treat](https://github.com/louismullie/treat), [tf-idf](https://github.com/reddavis/TF-IDF), [similarity](https://github.com/bbcrd/Similarity) and [rsimilarity](https://github.com/josephwilk/rsemantic) gems normalize the frequency of a term in a document to the number of terms in that document (which, as far as I can tell, never occurs in the academic literature) and have no normalization component. [vss](https://github.com/mkdynamic/vss) uses plain term and document frequencies, with no damping or normalization.
60
+
61
61
  ## Reference
62
62
 
63
63
  * [G. Salton and C. Buckley. "Term Weighting Approaches in Automatic Text Retrieval."" Technical Report. Cornell University, Ithaca, NY, USA. 1987.](http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf)
@@ -1,5 +1,8 @@
1
+ # @todo Do speed comparison between these gsl and narray, to load fastest first.
1
2
  begin
2
3
  require 'gsl'
4
+ rescue LoadError
5
+ require 'narray'
3
6
  rescue LoadError
4
7
  require 'matrix'
5
8
  end
@@ -45,11 +48,9 @@ class TfIdfSimilarity::Collection
45
48
  term_document_matrix = if gsl?
46
49
  GSL::Matrix.alloc terms.size, documents.size
47
50
  elsif narray?
48
- NMatrix.float documents.size, terms.size
51
+ NArray.float documents.size, terms.size
49
52
  elsif nmatrix?
50
- # The nmatrix gem's sparse matrices are unfortunately buggy.
51
- # @see https://github.com/SciRuby/nmatrix/issues/35
52
- NMatrix.new([terms.size, documents.size], :float64)
53
+ NMatrix.new(:list, [terms.size, documents.size], :float64)
53
54
  end
54
55
 
55
56
  terms.each_with_index do |term,i|
@@ -95,14 +96,11 @@ class TfIdfSimilarity::Collection
95
96
  if gsl?
96
97
  matrix.each_col(&:normalize!)
97
98
  elsif narray?
98
- # @todo NArray doesn't have a method to normalize a vector.
99
- # 0.upto(matrix.shape[0] - 1).each do |j|
100
- # matrix[j, true] # Normalize this column somehow.
101
- # end
102
- matrix
99
+ # @see https://github.com/masa16/narray/issues/21
100
+ NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(5,1))
103
101
  elsif nmatrix?
104
- # @todo NMatrix doesn't have a method to normalize a vector.
105
- matrix
102
+ # @todo NMatrix has no way to retrieve a column, besides iteration.
103
+ matrix.cast :yale, :float64
106
104
  else
107
105
  Matrix.columns matrix.column_vectors.map(&:normalize)
108
106
  end
@@ -1,3 +1,3 @@
1
1
  module TfIdfSimilarity
2
- VERSION = "0.0.1"
2
+ VERSION = "0.0.2"
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: tf-idf-similarity
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.1
4
+ version: 0.0.2
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -95,7 +95,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
95
95
  version: '0'
96
96
  segments:
97
97
  - 0
98
- hash: 697007281194730821
98
+ hash: -1570138910816303214
99
99
  required_rubygems_version: !ruby/object:Gem::Requirement
100
100
  none: false
101
101
  requirements:
@@ -104,7 +104,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
104
104
  version: '0'
105
105
  segments:
106
106
  - 0
107
- hash: 697007281194730821
107
+ hash: -1570138910816303214
108
108
  requirements: []
109
109
  rubyforge_project:
110
110
  rubygems_version: 1.8.24