tf-idf-similarity 0.0.1 → 0.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.md +13 -13
- data/lib/tf-idf-similarity/collection.rb +9 -11
- data/lib/tf-idf-similarity/version.rb +1 -1
- metadata +3 -3
data/README.md
CHANGED
@@ -1,5 +1,8 @@
|
|
1
1
|
# Ruby Vector Space Model (VSM) with tf*idf weights
|
2
2
|
|
3
|
+
[](https://gemnasium.com/opennorth/tf-idf-similarity)
|
4
|
+
[](https://codeclimate.com/github/opennorth/tf-idf-similarity)
|
5
|
+
|
3
6
|
Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) (or similar), which also implements other information retrieval functions like [BM 25](http://en.wikipedia.org/wiki/Okapi_BM25).
|
4
7
|
|
5
8
|
## Usage
|
@@ -17,6 +20,10 @@ This gem will use the [gsl gem](http://rb-gsl.rubyforge.org/) if available, for
|
|
17
20
|
|
18
21
|
## Optimizations
|
19
22
|
|
23
|
+
### [NArray](http://narray.rubyforge.org/)
|
24
|
+
|
25
|
+
gem install narray
|
26
|
+
|
20
27
|
### [GNU Scientific Library (GSL)](http://www.gnu.org/software/gsl/)
|
21
28
|
|
22
29
|
The latest `gsl` gem (`1.14.7`) is [not compatible](http://bretthard.in/2012/03/getting-related_posts-lsi-and-gsl-to-work-in-jekyll/) with the `gsl` package (`1.15`) in Homebrew:
|
@@ -32,22 +39,11 @@ gem install gsl
|
|
32
39
|
|
33
40
|
### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
|
34
41
|
|
35
|
-
You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/).
|
36
|
-
|
37
|
-
The `nmatrix` gem (`0.0.1`) can't find the `cblas.h` and `clapack.h` header files. Either [set the C_INCLUDE_PATH](https://github.com/SciRuby/nmatrix#synopsis):
|
38
|
-
|
39
|
-
export C_INCLUDE_PATH=/System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/
|
40
|
-
|
41
|
-
Or [create links](https://github.com/SciRuby/nmatrix/issues/21) before installing the gem:
|
42
|
-
|
43
|
-
sudo ln -s /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/cblas.h /usr/include/cblas.h
|
44
|
-
sudo ln -s /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/clapack.h /usr/include/clapack.h
|
45
|
-
|
46
|
-
Version `0.0.2` [doesn't compile on Mac OS X Lion](https://github.com/SciRuby/nmatrix/issues/34).
|
42
|
+
You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through version `0.0.2` of the [nmatrix gem](https://github.com/SciRuby/nmatrix). As of writing, `0.0.2` is not released, so follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
|
47
43
|
|
48
44
|
### Other Considerations
|
49
45
|
|
50
|
-
The [
|
46
|
+
The [nmatrix](http://sciruby.com/nmatrix/) gem has no easy way to normalize all columns to unit vectors. [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
|
51
47
|
|
52
48
|
## Extras
|
53
49
|
|
@@ -58,6 +54,10 @@ You can access more term frequency, document frequency, and normalization formul
|
|
58
54
|
|
59
55
|
The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).
|
60
56
|
|
57
|
+
## Why?
|
58
|
+
|
59
|
+
The [treat](https://github.com/louismullie/treat), [tf-idf](https://github.com/reddavis/TF-IDF), [similarity](https://github.com/bbcrd/Similarity) and [rsimilarity](https://github.com/josephwilk/rsemantic) gems normalize the frequency of a term in a document to the number of terms in that document (which, as far as I can tell, never occurs in the academic literature) and have no normalization component. [vss](https://github.com/mkdynamic/vss) uses plain term and document frequencies, with no damping or normalization.
|
60
|
+
|
61
61
|
## Reference
|
62
62
|
|
63
63
|
* [G. Salton and C. Buckley. "Term Weighting Approaches in Automatic Text Retrieval."" Technical Report. Cornell University, Ithaca, NY, USA. 1987.](http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf)
|
@@ -1,5 +1,8 @@
|
|
1
|
+
# @todo Do speed comparison between these gsl and narray, to load fastest first.
|
1
2
|
begin
|
2
3
|
require 'gsl'
|
4
|
+
rescue LoadError
|
5
|
+
require 'narray'
|
3
6
|
rescue LoadError
|
4
7
|
require 'matrix'
|
5
8
|
end
|
@@ -45,11 +48,9 @@ class TfIdfSimilarity::Collection
|
|
45
48
|
term_document_matrix = if gsl?
|
46
49
|
GSL::Matrix.alloc terms.size, documents.size
|
47
50
|
elsif narray?
|
48
|
-
|
51
|
+
NArray.float documents.size, terms.size
|
49
52
|
elsif nmatrix?
|
50
|
-
|
51
|
-
# @see https://github.com/SciRuby/nmatrix/issues/35
|
52
|
-
NMatrix.new([terms.size, documents.size], :float64)
|
53
|
+
NMatrix.new(:list, [terms.size, documents.size], :float64)
|
53
54
|
end
|
54
55
|
|
55
56
|
terms.each_with_index do |term,i|
|
@@ -95,14 +96,11 @@ class TfIdfSimilarity::Collection
|
|
95
96
|
if gsl?
|
96
97
|
matrix.each_col(&:normalize!)
|
97
98
|
elsif narray?
|
98
|
-
# @
|
99
|
-
|
100
|
-
# matrix[j, true] # Normalize this column somehow.
|
101
|
-
# end
|
102
|
-
matrix
|
99
|
+
# @see https://github.com/masa16/narray/issues/21
|
100
|
+
NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(5,1))
|
103
101
|
elsif nmatrix?
|
104
|
-
# @todo NMatrix
|
105
|
-
matrix
|
102
|
+
# @todo NMatrix has no way to retrieve a column, besides iteration.
|
103
|
+
matrix.cast :yale, :float64
|
106
104
|
else
|
107
105
|
Matrix.columns matrix.column_vectors.map(&:normalize)
|
108
106
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: tf-idf-similarity
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.2
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -95,7 +95,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
95
95
|
version: '0'
|
96
96
|
segments:
|
97
97
|
- 0
|
98
|
-
hash:
|
98
|
+
hash: -1570138910816303214
|
99
99
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
100
100
|
none: false
|
101
101
|
requirements:
|
@@ -104,7 +104,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
104
104
|
version: '0'
|
105
105
|
segments:
|
106
106
|
- 0
|
107
|
-
hash:
|
107
|
+
hash: -1570138910816303214
|
108
108
|
requirements: []
|
109
109
|
rubyforge_project:
|
110
110
|
rubygems_version: 1.8.24
|