tf-idf-similarity 0.0.1 → 0.0.2
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +13 -13
- data/lib/tf-idf-similarity/collection.rb +9 -11
- data/lib/tf-idf-similarity/version.rb +1 -1
- metadata +3 -3
data/README.md
CHANGED
@@ -1,5 +1,8 @@
|
|
1
1
|
# Ruby Vector Space Model (VSM) with tf*idf weights
|
2
2
|
|
3
|
+
[![Dependency Status](https://gemnasium.com/opennorth/tf-idf-similarity.png)](https://gemnasium.com/opennorth/tf-idf-similarity)
|
4
|
+
[![Code Climate](https://codeclimate.com/badge.png)](https://codeclimate.com/github/opennorth/tf-idf-similarity)
|
5
|
+
|
3
6
|
Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) (or similar), which also implements other information retrieval functions like [BM 25](http://en.wikipedia.org/wiki/Okapi_BM25).
|
4
7
|
|
5
8
|
## Usage
|
@@ -17,6 +20,10 @@ This gem will use the [gsl gem](http://rb-gsl.rubyforge.org/) if available, for
|
|
17
20
|
|
18
21
|
## Optimizations
|
19
22
|
|
23
|
+
### [NArray](http://narray.rubyforge.org/)
|
24
|
+
|
25
|
+
gem install narray
|
26
|
+
|
20
27
|
### [GNU Scientific Library (GSL)](http://www.gnu.org/software/gsl/)
|
21
28
|
|
22
29
|
The latest `gsl` gem (`1.14.7`) is [not compatible](http://bretthard.in/2012/03/getting-related_posts-lsi-and-gsl-to-work-in-jekyll/) with the `gsl` package (`1.15`) in Homebrew:
|
@@ -32,22 +39,11 @@ gem install gsl
|
|
32
39
|
|
33
40
|
### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
|
34
41
|
|
35
|
-
You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/).
|
36
|
-
|
37
|
-
The `nmatrix` gem (`0.0.1`) can't find the `cblas.h` and `clapack.h` header files. Either [set the C_INCLUDE_PATH](https://github.com/SciRuby/nmatrix#synopsis):
|
38
|
-
|
39
|
-
export C_INCLUDE_PATH=/System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/
|
40
|
-
|
41
|
-
Or [create links](https://github.com/SciRuby/nmatrix/issues/21) before installing the gem:
|
42
|
-
|
43
|
-
sudo ln -s /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/cblas.h /usr/include/cblas.h
|
44
|
-
sudo ln -s /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/clapack.h /usr/include/clapack.h
|
45
|
-
|
46
|
-
Version `0.0.2` [doesn't compile on Mac OS X Lion](https://github.com/SciRuby/nmatrix/issues/34).
|
42
|
+
You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through version `0.0.2` of the [nmatrix gem](https://github.com/SciRuby/nmatrix). As of writing, `0.0.2` is not released, so follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
|
47
43
|
|
48
44
|
### Other Considerations
|
49
45
|
|
50
|
-
The [
|
46
|
+
The [nmatrix](http://sciruby.com/nmatrix/) gem has no easy way to normalize all columns to unit vectors. [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
|
51
47
|
|
52
48
|
## Extras
|
53
49
|
|
@@ -58,6 +54,10 @@ You can access more term frequency, document frequency, and normalization formul
|
|
58
54
|
|
59
55
|
The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).
|
60
56
|
|
57
|
+
## Why?
|
58
|
+
|
59
|
+
The [treat](https://github.com/louismullie/treat), [tf-idf](https://github.com/reddavis/TF-IDF), [similarity](https://github.com/bbcrd/Similarity) and [rsimilarity](https://github.com/josephwilk/rsemantic) gems normalize the frequency of a term in a document to the number of terms in that document (which, as far as I can tell, never occurs in the academic literature) and have no normalization component. [vss](https://github.com/mkdynamic/vss) uses plain term and document frequencies, with no damping or normalization.
|
60
|
+
|
61
61
|
## Reference
|
62
62
|
|
63
63
|
* [G. Salton and C. Buckley. "Term Weighting Approaches in Automatic Text Retrieval."" Technical Report. Cornell University, Ithaca, NY, USA. 1987.](http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf)
|
@@ -1,5 +1,8 @@
|
|
1
|
+
# @todo Do speed comparison between these gsl and narray, to load fastest first.
|
1
2
|
begin
|
2
3
|
require 'gsl'
|
4
|
+
rescue LoadError
|
5
|
+
require 'narray'
|
3
6
|
rescue LoadError
|
4
7
|
require 'matrix'
|
5
8
|
end
|
@@ -45,11 +48,9 @@ class TfIdfSimilarity::Collection
|
|
45
48
|
term_document_matrix = if gsl?
|
46
49
|
GSL::Matrix.alloc terms.size, documents.size
|
47
50
|
elsif narray?
|
48
|
-
|
51
|
+
NArray.float documents.size, terms.size
|
49
52
|
elsif nmatrix?
|
50
|
-
|
51
|
-
# @see https://github.com/SciRuby/nmatrix/issues/35
|
52
|
-
NMatrix.new([terms.size, documents.size], :float64)
|
53
|
+
NMatrix.new(:list, [terms.size, documents.size], :float64)
|
53
54
|
end
|
54
55
|
|
55
56
|
terms.each_with_index do |term,i|
|
@@ -95,14 +96,11 @@ class TfIdfSimilarity::Collection
|
|
95
96
|
if gsl?
|
96
97
|
matrix.each_col(&:normalize!)
|
97
98
|
elsif narray?
|
98
|
-
# @
|
99
|
-
|
100
|
-
# matrix[j, true] # Normalize this column somehow.
|
101
|
-
# end
|
102
|
-
matrix
|
99
|
+
# @see https://github.com/masa16/narray/issues/21
|
100
|
+
NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(5,1))
|
103
101
|
elsif nmatrix?
|
104
|
-
# @todo NMatrix
|
105
|
-
matrix
|
102
|
+
# @todo NMatrix has no way to retrieve a column, besides iteration.
|
103
|
+
matrix.cast :yale, :float64
|
106
104
|
else
|
107
105
|
Matrix.columns matrix.column_vectors.map(&:normalize)
|
108
106
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: tf-idf-similarity
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.2
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -95,7 +95,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
95
95
|
version: '0'
|
96
96
|
segments:
|
97
97
|
- 0
|
98
|
-
hash:
|
98
|
+
hash: -1570138910816303214
|
99
99
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
100
100
|
none: false
|
101
101
|
requirements:
|
@@ -104,7 +104,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
104
104
|
version: '0'
|
105
105
|
segments:
|
106
106
|
- 0
|
107
|
-
hash:
|
107
|
+
hash: -1570138910816303214
|
108
108
|
requirements: []
|
109
109
|
rubyforge_project:
|
110
110
|
rubygems_version: 1.8.24
|