tf-idf-similarity 0.0.2 → 0.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -3,7 +3,7 @@
3
3
  [![Dependency Status](https://gemnasium.com/opennorth/tf-idf-similarity.png)](https://gemnasium.com/opennorth/tf-idf-similarity)
4
4
  [![Code Climate](https://codeclimate.com/badge.png)](https://codeclimate.com/github/opennorth/tf-idf-similarity)
5
5
 
6
- Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) (or similar), which also implements other information retrieval functions like [BM 25](http://en.wikipedia.org/wiki/Okapi_BM25).
6
+ Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) or similar (see below).
7
7
 
8
8
  ## Usage
9
9
 
@@ -41,7 +41,7 @@ gem install gsl
41
41
 
42
42
  You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/). You can use it through version `0.0.2` of the [nmatrix gem](https://github.com/SciRuby/nmatrix). As of writing, `0.0.2` is not released, so follow [these instructions](https://github.com/SciRuby/nmatrix#synopsis) to install it. You may need [additional instructions for Mac OS X Lion](https://github.com/SciRuby/nmatrix/wiki/NMatrix-Installation).
43
43
 
44
- ### Other Considerations
44
+ ### Other Options
45
45
 
46
46
  The [nmatrix](http://sciruby.com/nmatrix/) gem has no easy way to normalize all columns to unit vectors. [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) are old and not available as gems.
47
47
 
@@ -63,6 +63,17 @@ The [treat](https://github.com/louismullie/treat), [tf-idf](https://github.com/r
63
63
  * [G. Salton and C. Buckley. "Term Weighting Approaches in Automatic Text Retrieval."" Technical Report. Cornell University, Ithaca, NY, USA. 1987.](http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf)
64
64
  * [E. Chisholm and T. G. Kolda. "New term weighting formulas for the vector space method in information retrieval." Technical Report Number ORNL-TM-13756. Oak Ridge National Laboratory, Oak Ridge, TN, USA. 1999.](http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf)
65
65
 
66
+ ## Further Reading
67
+
68
+ Lucene implements many more [similarity functions](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/Similarity.html), such as:
69
+
70
+ * a [divergence from randomness (DFR) framework](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/DFRSimilarity.html)
71
+ * a [framework for the family of information-based models](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/IBSimilarity.html)
72
+ * a [language model with Bayesian smoothing using Dirichlet priors](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html)
73
+ * a [language model with Jelinek-Mercer smoothing](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html)
74
+
75
+ Lucene can even [combine similarity meatures](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/MultiSimilarity.html).
76
+
66
77
  ## Bugs? Questions?
67
78
 
68
79
  This gem's main repository is on GitHub: [http://github.com/opennorth/tf-idf-similarity](http://github.com/opennorth/tf-idf-similarity), where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
@@ -34,43 +34,42 @@ class TfIdfSimilarity::Collection
34
34
  term_counts.keys
35
35
  end
36
36
 
37
+ # @param [Hash] opts optional arguments
38
+ # @option opts [Symbol] :function one of :tfidf (default) or :bm25
39
+ #
40
+ # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
41
+ # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/BM25Similarity.html
37
42
  # @see http://en.wikipedia.org/wiki/Vector_space_model
38
43
  # @see http://en.wikipedia.org/wiki/Document-term_matrix
39
44
  # @see http://en.wikipedia.org/wiki/Cosine_similarity
40
- def similarity_matrix
41
- if matrix?
45
+ def similarity_matrix(opts = {})
46
+ if stdlib?
42
47
  idf = []
43
- term_document_matrix = Matrix.build(terms.size, documents.size) do |i,j|
44
- idf[i] ||= inverse_document_frequency terms[i]
45
- documents[j].term_frequency(terms[i]) * idf[i]
48
+ matrix = Matrix.build(terms.size, documents.size) do |i,j|
49
+ idf[i] ||= inverse_document_frequency(terms[i], opts)
50
+ idf[i] * term_frequency(documents[j], terms[i], opts)
46
51
  end
47
52
  else
48
- term_document_matrix = if gsl?
49
- GSL::Matrix.alloc terms.size, documents.size
50
- elsif narray?
51
- NArray.float documents.size, terms.size
52
- elsif nmatrix?
53
- NMatrix.new(:list, [terms.size, documents.size], :float64)
54
- end
55
-
53
+ matrix = initialize_matrix
56
54
  terms.each_with_index do |term,i|
57
- idf = inverse_document_frequency term
55
+ idf = inverse_document_frequency(term, opts)
58
56
  documents.each_with_index do |document,j|
59
- tfidf = document.term_frequency(term) * idf
60
- if gsl? || nmatrix?
61
- term_document_matrix[i, j] = tfidf
57
+ value = idf * term_frequency(document, term, opts)
62
58
  # NArray puts the dimensions in a different order.
63
59
  # @see http://narray.rubyforge.org/SPEC.en
64
- elsif narray?
65
- term_document_matrix[j, i] = tfidf
60
+ if narray?
61
+ matrix[j, i] = value
62
+ else
63
+ matrix[i, j] = value
66
64
  end
67
65
  end
68
66
  end
69
- end
70
67
 
71
- # Columns are normalized to unit vectors, so we can calculate the cosine
72
- # similarity of all document vectors.
73
- matrix = normalize term_document_matrix
68
+ # Columns are normalized to unit vectors, so we can calculate the cosine
69
+ # similarity of all document vectors. BM25 doesn't normalize columns, but
70
+ # BM25 wasn't written with this use case in mind.
71
+ matrix = normalize matrix
72
+ end
74
73
 
75
74
  if nmatrix?
76
75
  matrix.transpose.dot matrix
@@ -80,14 +79,46 @@ class TfIdfSimilarity::Collection
80
79
  end
81
80
 
82
81
  # @param [String] term a term
82
+ # @param [Hash] opts optional arguments
83
+ # @option opts [Symbol] :function one of :tfidf (default) or :bm25
83
84
  # @return [Float] the term's inverse document frequency
84
- #
85
- # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
86
- def inverse_document_frequency(term)
87
- 1 + Math.log(documents.size / (document_counts[term].to_f + 1))
85
+ def inverse_document_frequency(term, opts = {})
86
+ if opts[:function] == :bm25
87
+ Math.log (documents.size - document_counts[term] + 0.5) / (document_counts[term] + 0.5)
88
+ else
89
+ 1 + Math.log(documents.size / (document_counts[term].to_f + 1))
90
+ end
88
91
  end
89
92
  alias_method :idf, :inverse_document_frequency
90
93
 
94
+ # @param [Document] document a document
95
+ # @param [String] term a term
96
+ # @param [Hash] opts optional arguments
97
+ # @option opts [Symbol] :function one of :tfidf (default) or :bm25
98
+ # @return [Float] the term's frequency in the document
99
+ #
100
+ # @note Like Lucene, we use a b value of 0.75 and a k1 value of 1.2.
101
+ def term_frequency(document, term, opts = {})
102
+ if opts[:function] == :bm25
103
+ (document.term_counts[term] * 2.2) / (document.term_counts[term] + 0.3 + 0.9 * document.size / average_document_size)
104
+ else
105
+ document.term_frequency term
106
+ end
107
+ end
108
+
109
+ # @return [Float] the average document size, in terms
110
+ def average_document_size
111
+ @average_document_size ||= documents.map(&:size).reduce(:+) / documents.size.to_f
112
+ end
113
+
114
+ # Resets the average document size.
115
+ #
116
+ # If you have already made a similarity matrix and are adding more documents,
117
+ # call this method before creating a new similarity matrix.
118
+ def reset_average_document_size!
119
+ @average_document_size = nil
120
+ end
121
+
91
122
  # @param [Document] matrix a term-document matrix
92
123
  # @return [Matrix] a matrix in which all document vectors are unit vectors
93
124
  #
@@ -99,7 +130,12 @@ class TfIdfSimilarity::Collection
99
130
  # @see https://github.com/masa16/narray/issues/21
100
131
  NMatrix.refer matrix / NMath.sqrt((matrix ** 2).sum(1).reshape(5,1))
101
132
  elsif nmatrix?
102
- # @todo NMatrix has no way to retrieve a column, besides iteration.
133
+ # @todo NMatrix has no way to perform scalar operations on matrices.
134
+ # (0...matrix.shape[0]).each do |i|
135
+ # column = matrix.slice i, 0...matrix.shape[1]
136
+ # norm = column.dot column.transpose
137
+ # # No way to divide column by norm.
138
+ # end
103
139
  matrix.cast :yale, :float64
104
140
  else
105
141
  Matrix.columns matrix.column_vectors.map(&:normalize)
@@ -108,19 +144,34 @@ class TfIdfSimilarity::Collection
108
144
 
109
145
  private
110
146
 
147
+ # @return a matrix
148
+ def initialize_matrix
149
+ if gsl?
150
+ GSL::Matrix.alloc terms.size, documents.size
151
+ elsif narray?
152
+ NArray.float documents.size, terms.size
153
+ elsif nmatrix?
154
+ NMatrix.new(:list, [terms.size, documents.size], :float64)
155
+ end
156
+ end
157
+
158
+ # @return [Boolean] whether to use the GSL gem
111
159
  def gsl?
112
160
  @gsl ||= Object.const_defined?(:GSL)
113
161
  end
114
162
 
163
+ # @return [Boolean] whether to use the NArray gem
115
164
  def narray?
116
165
  @narray ||= Object.const_defined?(:NArray) && !gsl?
117
166
  end
118
167
 
168
+ # @return [Boolean] whether to use the NMatrix gem
119
169
  def nmatrix?
120
- @nmatrix ||= Object.const_defined?(:NMatrix) && !narray?
170
+ @nmatrix ||= Object.const_defined?(:NMatrix) && !gsl? && !narray?
121
171
  end
122
172
 
123
- def matrix?
173
+ # @return [Boolean] whether to use the standard library
174
+ def stdlib?
124
175
  @matrix ||= Object.const_defined?(:Matrix)
125
176
  end
126
177
  end
@@ -8,6 +8,8 @@ class TfIdfSimilarity::Document
8
8
  attr_reader :text
9
9
  # The number of times each term appears in the document.
10
10
  attr_reader :term_counts
11
+ # The document size, in terms.
12
+ attr_reader :size
11
13
 
12
14
  # @param [String] text the document's text
13
15
  # @param [Hash] opts optional arguments
@@ -43,6 +45,7 @@ private
43
45
  @term_counts[token.lowercase_filter.classic_filter.to_s] += 1
44
46
  end
45
47
  end
48
+ @size = term_counts.values.reduce(:+)
46
49
  end
47
50
 
48
51
  # Tokenizes a text, respecting the word boundary rules from Unicode’s Default
@@ -1,3 +1,3 @@
1
1
  module TfIdfSimilarity
2
- VERSION = "0.0.2"
2
+ VERSION = "0.0.3"
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: tf-idf-similarity
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.2
4
+ version: 0.0.3
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-09-10 00:00:00.000000000 Z
12
+ date: 2012-09-11 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: unicode_utils
@@ -95,7 +95,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
95
95
  version: '0'
96
96
  segments:
97
97
  - 0
98
- hash: -1570138910816303214
98
+ hash: -4125970683092216956
99
99
  required_rubygems_version: !ruby/object:Gem::Requirement
100
100
  none: false
101
101
  requirements:
@@ -104,7 +104,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
104
104
  version: '0'
105
105
  segments:
106
106
  - 0
107
- hash: -1570138910816303214
107
+ hash: -4125970683092216956
108
108
  requirements: []
109
109
  rubyforge_project:
110
110
  rubygems_version: 1.8.24