tf-idf-similarity 0.0.2 → 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -3,7 +3,7 @@
3
3
  [![Dependency Status](https://gemnasium.com/opennorth/tf-idf-similarity.png)](https://gemnasium.com/opennorth/tf-idf-similarity)
4
4
  [![Code Climate](https://codeclimate.com/badge.png)](https://codeclimate.com/github/opennorth/tf-idf-similarity)
5
5
 
6
- Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) (or similar), which also implements other information retrieval functions like [BM 25](http://en.wikipedia.org/wiki/Okapi_BM25).
6
+ Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) or similar (see below).
7
7
 
8
8
  ## Usage
9
9
 
@@ -41,7 +41,7 @@ gem install gsl
41
41
 
42
42
  You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through version `0.0.2` of the [nmatrix gem](https://github.com/SciRuby/nmatrix). As of writing, `0.0.2` is not released, so follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
43
43
 
44
- ### Other Considerations
44
+ ### Other Options
45
45
 
46
46
  The [nmatrix](http://sciruby.com/nmatrix/) gem has no easy way to normalize all columns to unit vectors. [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
47
47
 
@@ -63,6 +63,17 @@ The [treat](https://github.com/louismullie/treat), [tf-idf](https://github.com/r
63
63
  * [G. Salton and C. Buckley. "Term Weighting Approaches in Automatic Text Retrieval."" Technical Report. Cornell University, Ithaca, NY, USA. 1987.](http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf)
64
64
  * [E. Chisholm and T. G. Kolda. "New term weighting formulas for the vector space method in information retrieval." Technical Report Number ORNL-TM-13756. Oak Ridge National Laboratory, Oak Ridge, TN, USA. 1999.](http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf)
65
65
 
66
+ ## Further Reading
67
+
68
+ Lucene implements many more [similarity functions](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/Similarity.html), such as:
69
+
70
+ * a [divergence from randomness (DFR) framework](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/DFRSimilarity.html)
71
+ * a [framework for the family of information-based models](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/IBSimilarity.html)
72
+ * a [language model with Bayesian smoothing using Dirichlet priors](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html)
73
+ * a [language model with Jelinek-Mercer smoothing](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html)
74
+
75
+ Lucene can even [combine similarity meatures](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/MultiSimilarity.html).
76
+
66
77
  ## Bugs? Questions?
67
78
 
68
79
  This gem's main repository is on GitHub: [http://github.com/opennorth/tf-idf-similarity](http://github.com/opennorth/tf-idf-similarity), where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
@@ -34,43 +34,42 @@ class TfIdfSimilarity::Collection
34
34
  term_counts.keys
35
35
  end
36
36
 
37
+ # @param [Hash] opts optional arguments
38
+ # @option opts [Symbol] :function one of :tfidf (default) or :bm25
39
+ #
40
+ # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
41
+ # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/BM25Similarity.html
37
42
  # @see http://en.wikipedia.org/wiki/Vector_space_model
38
43
  # @see http://en.wikipedia.org/wiki/Document-term_matrix
39
44
  # @see http://en.wikipedia.org/wiki/Cosine_similarity
40
- def similarity_matrix
41
- if matrix?
45
+ def similarity_matrix(opts = {})
46
+ if stdlib?
42
47
  idf = []
43
- term_document_matrix = Matrix.build(terms.size, documents.size) do |i,j|
44
- idf[i] ||= inverse_document_frequency terms[i]
45
- documents[j].term_frequency(terms[i]) * idf[i]
48
+ matrix = Matrix.build(terms.size, documents.size) do |i,j|
49
+ idf[i] ||= inverse_document_frequency(terms[i], opts)
50
+ idf[i] * term_frequency(documents[j], terms[i], opts)
46
51
  end
47
52
  else
48
- term_document_matrix = if gsl?
49
- GSL::Matrix.alloc terms.size, documents.size
50
- elsif narray?
51
- NArray.float documents.size, terms.size
52
- elsif nmatrix?
53
- NMatrix.new(:list, [terms.size, documents.size], :float64)
54
- end
55
-
53
+ matrix = initialize_matrix
56
54
  terms.each_with_index do |term,i|
57
- idf = inverse_document_frequency term
55
+ idf = inverse_document_frequency(term, opts)
58
56
  documents.each_with_index do |document,j|
59
- tfidf = document.term_frequency(term) * idf
60
- if gsl? || nmatrix?
61
- term_document_matrix[i, j] = tfidf
57
+ value = idf * term_frequency(document, term, opts)
62
58
  # NArray puts the dimensions in a different order.
63
59
  # @see http://narray.rubyforge.org/SPEC.en
64
- elsif narray?
65
- term_document_matrix[j, i] = tfidf
60
+ if narray?
61
+ matrix[j, i] = value
62
+ else
63
+ matrix[i, j] = value
66
64
  end
67
65
  end
68
66
  end
69
- end
70
67
 
71
- # Columns are normalized to unit vectors, so we can calculate the cosine
72
- # similarity of all document vectors.
73
- matrix = normalize term_document_matrix
68
+ # Columns are normalized to unit vectors, so we can calculate the cosine
69
+ # similarity of all document vectors. BM25 doesn't normalize columns, but
70
+ # BM25 wasn't written with this use case in mind.
71
+ matrix = normalize matrix
72
+ end
74
73
 
75
74
  if nmatrix?
76
75
  matrix.transpose.dot matrix
@@ -80,14 +79,46 @@ class TfIdfSimilarity::Collection
80
79
  end
81
80
 
82
81
  # @param [String] term a term
82
+ # @param [Hash] opts optional arguments
83
+ # @option opts [Symbol] :function one of :tfidf (default) or :bm25
83
84
  # @return [Float] the term's inverse document frequency
84
- #
85
- # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
86
- def inverse_document_frequency(term)
87
- 1 + Math.log(documents.size / (document_counts[term].to_f + 1))
85
+ def inverse_document_frequency(term, opts = {})
86
+ if opts[:function] == :bm25
87
+ Math.log (documents.size - document_counts[term] + 0.5) / (document_counts[term] + 0.5)
88
+ else
89
+ 1 + Math.log(documents.size / (document_counts[term].to_f + 1))
90
+ end
88
91
  end
89
92
  alias_method :idf, :inverse_document_frequency
90
93
 
94
+ # @param [Document] document a document
95
+ # @param [String] term a term
96
+ # @param [Hash] opts optional arguments
97
+ # @option opts [Symbol] :function one of :tfidf (default) or :bm25
98
+ # @return [Float] the term's frequency in the document
99
+ #
100
+ # @note Like Lucene, we use a b value of 0.75 and a k1 value of 1.2.
101
+ def term_frequency(document, term, opts = {})
102
+ if opts[:function] == :bm25
103
+ (document.term_counts[term] * 2.2) / (document.term_counts[term] + 0.3 + 0.9 * document.size / average_document_size)
104
+ else
105
+ document.term_frequency term
106
+ end
107
+ end
108
+
109
+ # @return [Float] the average document size, in terms
110
+ def average_document_size
111
+ @average_document_size ||= documents.map(&:size).reduce(:+) / documents.size.to_f
112
+ end
113
+
114
+ # Resets the average document size.
115
+ #
116
+ # If you have already made a similarity matrix and are adding more documents,
117
+ # call this method before creating a new similarity matrix.
118
+ def reset_average_document_size!
119
+ @average_document_size = nil
120
+ end
121
+
91
122
  # @param [Document] matrix a term-document matrix
92
123
  # @return [Matrix] a matrix in which all document vectors are unit vectors
93
124
  #
@@ -99,7 +130,12 @@ class TfIdfSimilarity::Collection
99
130
  # @see https://github.com/masa16/narray/issues/21
100
131
  NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(5,1))
101
132
  elsif nmatrix?
102
- # @todo NMatrix has no way to retrieve a column, besides iteration.
133
+ # @todo NMatrix has no way to perform scalar operations on matrices.
134
+ # (0...matrix.shape[0]).each do |i|
135
+ # column = matrix.slice i, 0...matrix.shape[1]
136
+ # norm = column.dot column.transpose
137
+ # # No way to divide column by norm.
138
+ # end
103
139
  matrix.cast :yale, :float64
104
140
  else
105
141
  Matrix.columns matrix.column_vectors.map(&:normalize)
@@ -108,19 +144,34 @@ class TfIdfSimilarity::Collection
108
144
 
109
145
  private
110
146
 
147
+ # @return a matrix
148
+ def initialize_matrix
149
+ if gsl?
150
+ GSL::Matrix.alloc terms.size, documents.size
151
+ elsif narray?
152
+ NArray.float documents.size, terms.size
153
+ elsif nmatrix?
154
+ NMatrix.new(:list, [terms.size, documents.size], :float64)
155
+ end
156
+ end
157
+
158
+ # @return [Boolean] whether to use the GSL gem
111
159
  def gsl?
112
160
  @gsl ||= Object.const_defined?(:GSL)
113
161
  end
114
162
 
163
+ # @return [Boolean] whether to use the NArray gem
115
164
  def narray?
116
165
  @narray ||= Object.const_defined?(:NArray) && !gsl?
117
166
  end
118
167
 
168
+ # @return [Boolean] whether to use the NMatrix gem
119
169
  def nmatrix?
120
- @nmatrix ||= Object.const_defined?(:NMatrix) && !narray?
170
+ @nmatrix ||= Object.const_defined?(:NMatrix) && !gsl? && !narray?
121
171
  end
122
172
 
123
- def matrix?
173
+ # @return [Boolean] whether to use the standard library
174
+ def stdlib?
124
175
  @matrix ||= Object.const_defined?(:Matrix)
125
176
  end
126
177
  end
@@ -8,6 +8,8 @@ class TfIdfSimilarity::Document
8
8
  attr_reader :text
9
9
  # The number of times each term appears in the document.
10
10
  attr_reader :term_counts
11
+ # The document size, in terms.
12
+ attr_reader :size
11
13
 
12
14
  # @param [String] text the document's text
13
15
  # @param [Hash] opts optional arguments
@@ -43,6 +45,7 @@ private
43
45
  @term_counts[token.lowercase_filter.classic_filter.to_s] += 1
44
46
  end
45
47
  end
48
+ @size = term_counts.values.reduce(:+)
46
49
  end
47
50
 
48
51
  # Tokenizes a text, respecting the word boundary rules from Unicode’s Default
@@ -1,3 +1,3 @@
1
1
  module TfIdfSimilarity
2
- VERSION = "0.0.2"
2
+ VERSION = "0.0.3"
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: tf-idf-similarity
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.2
4
+ version: 0.0.3
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-09-10 00:00:00.000000000 Z
12
+ date: 2012-09-11 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: unicode_utils
@@ -95,7 +95,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
95
95
  version: '0'
96
96
  segments:
97
97
  - 0
98
- hash: -1570138910816303214
98
+ hash: -4125970683092216956
99
99
  required_rubygems_version: !ruby/object:Gem::Requirement
100
100
  none: false
101
101
  requirements:
@@ -104,7 +104,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
104
104
  version: '0'
105
105
  segments:
106
106
  - 0
107
- hash: -1570138910816303214
107
+ hash: -4125970683092216956
108
108
  requirements: []
109
109
  rubyforge_project:
110
110
  rubygems_version: 1.8.24