tf-idf-similarity 0.0.1
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +6 -0
- data/.travis.yml +3 -0
- data/Gemfile +4 -0
- data/LICENSE +20 -0
- data/README.md +70 -0
- data/Rakefile +16 -0
- data/USAGE +1 -0
- data/lib/tf-idf-similarity.rb +7 -0
- data/lib/tf-idf-similarity/collection.rb +128 -0
- data/lib/tf-idf-similarity/document.rb +62 -0
- data/lib/tf-idf-similarity/extras/collection.rb +85 -0
- data/lib/tf-idf-similarity/extras/document.rb +118 -0
- data/lib/tf-idf-similarity/token.rb +42 -0
- data/lib/tf-idf-similarity/version.rb +3 -0
- data/td-idf-similarity.gemspec +22 -0
- metadata +114 -0
data/.gitignore
ADDED
data/.travis.yml
ADDED
data/Gemfile
ADDED
data/LICENSE
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2012 Open North Inc.
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,70 @@
|
|
1
|
+
# Ruby Vector Space Model (VSM) with tf*idf weights
|
2
|
+
|
3
|
+
Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) (or similar), which also implements other information retrieval functions like [BM 25](http://en.wikipedia.org/wiki/Okapi_BM25).
|
4
|
+
|
5
|
+
## Usage
|
6
|
+
|
7
|
+
require 'tf-idf-similarity'
|
8
|
+
|
9
|
+
corpus = TfIdfSimilarity::Collection.new
|
10
|
+
corpus << TfIdfSimilarity::Document.new("Lorem ipsum dolor sit amet...")
|
11
|
+
corpus << TfIdfSimilarity::Document.new("Pellentesque sed ipsum dui...")
|
12
|
+
corpus << TfIdfSimilarity::Document.new("Nam scelerisque dui sed leo...")
|
13
|
+
|
14
|
+
p corpus.similarity_matrix
|
15
|
+
|
16
|
+
This gem will use the [gsl gem](http://rb-gsl.rubyforge.org/) if available, for faster matrix multiplication.
|
17
|
+
|
18
|
+
## Optimizations
|
19
|
+
|
20
|
+
### [GNU Scientific Library (GSL)](http://www.gnu.org/software/gsl/)
|
21
|
+
|
22
|
+
The latest `gsl` gem (`1.14.7`) is [not compatible](http://bretthard.in/2012/03/getting-related_posts-lsi-and-gsl-to-work-in-jekyll/) with the `gsl` package (`1.15`) in Homebrew:
|
23
|
+
|
24
|
+
```sh
|
25
|
+
cd /usr/local
|
26
|
+
git checkout -b gsl-1.14 83ed49411f076e30ced04c2cbebb054b2645a431
|
27
|
+
brew install gsl
|
28
|
+
git checkout master
|
29
|
+
git branch -d gsl-1.14
|
30
|
+
gem install gsl
|
31
|
+
```
|
32
|
+
|
33
|
+
### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
|
34
|
+
|
35
|
+
You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/).
|
36
|
+
|
37
|
+
The `nmatrix` gem (`0.0.1`) can't find the `cblas.h` and `clapack.h` header files. Either [set the C_INCLUDE_PATH](https://github.com/SciRuby/nmatrix#synopsis):
|
38
|
+
|
39
|
+
export C_INCLUDE_PATH=/System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/
|
40
|
+
|
41
|
+
Or [create links](https://github.com/SciRuby/nmatrix/issues/21) before installing the gem:
|
42
|
+
|
43
|
+
sudo ln -s /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/cblas.h /usr/include/cblas.h
|
44
|
+
sudo ln -s /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/clapack.h /usr/include/clapack.h
|
45
|
+
|
46
|
+
Version `0.0.2` [doesn't compile on Mac OS X Lion](https://github.com/SciRuby/nmatrix/issues/34).
|
47
|
+
|
48
|
+
### Other Considerations
|
49
|
+
|
50
|
+
The [narray](http://narray.rubyforge.org/) and [nmatrix](http://sciruby.com/nmatrix/) gems have no method to calculate the magnitude of a vector. [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) and old and not available as gems.
|
51
|
+
|
52
|
+
## Extras
|
53
|
+
|
54
|
+
You can access more term frequency, document frequency, and normalization formulas with:
|
55
|
+
|
56
|
+
require 'tf-idf-similarity/extras/collection'
|
57
|
+
require 'tf-idf-similarity/extras/document'
|
58
|
+
|
59
|
+
The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).
|
60
|
+
|
61
|
+
## Reference
|
62
|
+
|
63
|
+
* [G. Salton and C. Buckley. "Term Weighting Approaches in Automatic Text Retrieval."" Technical Report. Cornell University, Ithaca, NY, USA. 1987.](http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf)
|
64
|
+
* [E. Chisholm and T. G. Kolda. "New term weighting formulas for the vector space method in information retrieval." Technical Report Number ORNL-TM-13756. Oak Ridge National Laboratory, Oak Ridge, TN, USA. 1999.](http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf)
|
65
|
+
|
66
|
+
## Bugs? Questions?
|
67
|
+
|
68
|
+
This gem's main repository is on GitHub: [http://github.com/opennorth/tf-idf-similarity](http://github.com/opennorth/tf-idf-similarity), where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
|
69
|
+
|
70
|
+
Copyright (c) 2012 Open North Inc., released under the MIT license
|
data/Rakefile
ADDED
@@ -0,0 +1,16 @@
|
|
1
|
+
require 'bundler'
|
2
|
+
Bundler::GemHelper.install_tasks
|
3
|
+
|
4
|
+
require 'rspec/core/rake_task'
|
5
|
+
RSpec::Core::RakeTask.new(:spec)
|
6
|
+
|
7
|
+
task :default => :spec
|
8
|
+
|
9
|
+
begin
|
10
|
+
require 'yard'
|
11
|
+
YARD::Rake::YardocTask.new
|
12
|
+
rescue LoadError
|
13
|
+
task :yard do
|
14
|
+
abort 'YARD is not available. In order to run yard, you must: gem install yard'
|
15
|
+
end
|
16
|
+
end
|
data/USAGE
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
See README.md for full usage details.
|
@@ -0,0 +1,7 @@
|
|
1
|
+
$LOAD_PATH.unshift(File.expand_path(File.dirname(__FILE__))) unless $LOAD_PATH.include?(File.expand_path(File.dirname(__FILE__)))
|
2
|
+
|
3
|
+
module TfIdfSimilarity
|
4
|
+
autoload :Collection, 'tf-idf-similarity/collection'
|
5
|
+
autoload :Document, 'tf-idf-similarity/document'
|
6
|
+
autoload :Token, 'tf-idf-similarity/token'
|
7
|
+
end
|
@@ -0,0 +1,128 @@
|
|
1
|
+
begin
|
2
|
+
require 'gsl'
|
3
|
+
rescue LoadError
|
4
|
+
require 'matrix'
|
5
|
+
end
|
6
|
+
|
7
|
+
class TfIdfSimilarity::Collection
|
8
|
+
# The documents in the collection.
|
9
|
+
attr_reader :documents
|
10
|
+
# The number of times each term appears in all documents.
|
11
|
+
attr_reader :term_counts
|
12
|
+
# The number of documents each term appears in.
|
13
|
+
attr_reader :document_counts
|
14
|
+
|
15
|
+
def initialize
|
16
|
+
@documents = []
|
17
|
+
@term_counts = Hash.new 0
|
18
|
+
@document_counts = Hash.new 0
|
19
|
+
end
|
20
|
+
|
21
|
+
def <<(document)
|
22
|
+
document.term_counts.each do |term,count|
|
23
|
+
@term_counts[term] += count
|
24
|
+
@document_counts[term] += 1
|
25
|
+
end
|
26
|
+
@documents << document
|
27
|
+
end
|
28
|
+
|
29
|
+
# @return [Array<String>] the set of the collection's terms with no duplicates
|
30
|
+
def terms
|
31
|
+
term_counts.keys
|
32
|
+
end
|
33
|
+
|
34
|
+
# @see http://en.wikipedia.org/wiki/Vector_space_model
|
35
|
+
# @see http://en.wikipedia.org/wiki/Document-term_matrix
|
36
|
+
# @see http://en.wikipedia.org/wiki/Cosine_similarity
|
37
|
+
def similarity_matrix
|
38
|
+
if matrix?
|
39
|
+
idf = []
|
40
|
+
term_document_matrix = Matrix.build(terms.size, documents.size) do |i,j|
|
41
|
+
idf[i] ||= inverse_document_frequency terms[i]
|
42
|
+
documents[j].term_frequency(terms[i]) * idf[i]
|
43
|
+
end
|
44
|
+
else
|
45
|
+
term_document_matrix = if gsl?
|
46
|
+
GSL::Matrix.alloc terms.size, documents.size
|
47
|
+
elsif narray?
|
48
|
+
NMatrix.float documents.size, terms.size
|
49
|
+
elsif nmatrix?
|
50
|
+
# The nmatrix gem's sparse matrices are unfortunately buggy.
|
51
|
+
# @see https://github.com/SciRuby/nmatrix/issues/35
|
52
|
+
NMatrix.new([terms.size, documents.size], :float64)
|
53
|
+
end
|
54
|
+
|
55
|
+
terms.each_with_index do |term,i|
|
56
|
+
idf = inverse_document_frequency term
|
57
|
+
documents.each_with_index do |document,j|
|
58
|
+
tfidf = document.term_frequency(term) * idf
|
59
|
+
if gsl? || nmatrix?
|
60
|
+
term_document_matrix[i, j] = tfidf
|
61
|
+
# NArray puts the dimensions in a different order.
|
62
|
+
# @see http://narray.rubyforge.org/SPEC.en
|
63
|
+
elsif narray?
|
64
|
+
term_document_matrix[j, i] = tfidf
|
65
|
+
end
|
66
|
+
end
|
67
|
+
end
|
68
|
+
end
|
69
|
+
|
70
|
+
# Columns are normalized to unit vectors, so we can calculate the cosine
|
71
|
+
# similarity of all document vectors.
|
72
|
+
matrix = normalize term_document_matrix
|
73
|
+
|
74
|
+
if nmatrix?
|
75
|
+
matrix.transpose.dot matrix
|
76
|
+
else
|
77
|
+
matrix.transpose * matrix
|
78
|
+
end
|
79
|
+
end
|
80
|
+
|
81
|
+
# @param [String] term a term
|
82
|
+
# @return [Float] the term's inverse document frequency
|
83
|
+
#
|
84
|
+
# @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
|
85
|
+
def inverse_document_frequency(term)
|
86
|
+
1 + Math.log(documents.size / (document_counts[term].to_f + 1))
|
87
|
+
end
|
88
|
+
alias_method :idf, :inverse_document_frequency
|
89
|
+
|
90
|
+
# @param [Document] matrix a term-document matrix
|
91
|
+
# @return [Matrix] a matrix in which all document vectors are unit vectors
|
92
|
+
#
|
93
|
+
# @note Lucene normalizes document length differently.
|
94
|
+
def normalize(matrix)
|
95
|
+
if gsl?
|
96
|
+
matrix.each_col(&:normalize!)
|
97
|
+
elsif narray?
|
98
|
+
# @todo NArray doesn't have a method to normalize a vector.
|
99
|
+
# 0.upto(matrix.shape[0] - 1).each do |j|
|
100
|
+
# matrix[j, true] # Normalize this column somehow.
|
101
|
+
# end
|
102
|
+
matrix
|
103
|
+
elsif nmatrix?
|
104
|
+
# @todo NMatrix doesn't have a method to normalize a vector.
|
105
|
+
matrix
|
106
|
+
else
|
107
|
+
Matrix.columns matrix.column_vectors.map(&:normalize)
|
108
|
+
end
|
109
|
+
end
|
110
|
+
|
111
|
+
private
|
112
|
+
|
113
|
+
def gsl?
|
114
|
+
@gsl ||= Object.const_defined?(:GSL)
|
115
|
+
end
|
116
|
+
|
117
|
+
def narray?
|
118
|
+
@narray ||= Object.const_defined?(:NArray) && !gsl?
|
119
|
+
end
|
120
|
+
|
121
|
+
def nmatrix?
|
122
|
+
@nmatrix ||= Object.const_defined?(:NMatrix) && !narray?
|
123
|
+
end
|
124
|
+
|
125
|
+
def matrix?
|
126
|
+
@matrix ||= Object.const_defined?(:Matrix)
|
127
|
+
end
|
128
|
+
end
|
@@ -0,0 +1,62 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
require 'unicode_utils'
|
3
|
+
|
4
|
+
class TfIdfSimilarity::Document
|
5
|
+
# An optional document identifier.
|
6
|
+
attr_reader :id
|
7
|
+
# The document's text.
|
8
|
+
attr_reader :text
|
9
|
+
# The number of times each term appears in the document.
|
10
|
+
attr_reader :term_counts
|
11
|
+
|
12
|
+
# @param [String] text the document's text
|
13
|
+
# @param [Hash] opts optional arguments
|
14
|
+
# @option opts [String] :id a string to identify the document
|
15
|
+
def initialize(text, opts = {})
|
16
|
+
@text = text
|
17
|
+
@id = opts[:id] || object_id
|
18
|
+
@term_counts = Hash.new 0
|
19
|
+
process
|
20
|
+
end
|
21
|
+
|
22
|
+
# @return [Array<String>] the set of the document's terms with no duplicates
|
23
|
+
def terms
|
24
|
+
term_counts.keys
|
25
|
+
end
|
26
|
+
|
27
|
+
# @param [String] term a term
|
28
|
+
# @return [Float] the square root of the term count
|
29
|
+
#
|
30
|
+
# @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
|
31
|
+
def term_frequency(term)
|
32
|
+
Math.sqrt term_counts[term]
|
33
|
+
end
|
34
|
+
alias_method :tf, :term_frequency
|
35
|
+
|
36
|
+
private
|
37
|
+
|
38
|
+
# Tokenize the text and counts terms.
|
39
|
+
def process
|
40
|
+
tokenize(text).each do |word|
|
41
|
+
token = TfIdfSimilarity::Token.new word
|
42
|
+
if token.valid?
|
43
|
+
@term_counts[token.lowercase_filter.classic_filter.to_s] += 1
|
44
|
+
end
|
45
|
+
end
|
46
|
+
end
|
47
|
+
|
48
|
+
# Tokenizes a text, respecting the word boundary rules from Unicode’s Default
|
49
|
+
# Word Boundary Specification.
|
50
|
+
#
|
51
|
+
# @param [String] text a text
|
52
|
+
# @return [Enumerator] a token enumerator
|
53
|
+
#
|
54
|
+
# @note We should evaluate the tokenizers by {http://www.sciencemag.org/content/suppl/2010/12/16/science.1199644.DC1/Michel.SOM.revision.2.pdf Google}
|
55
|
+
# or {http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailTokenizerFactory Solr}.
|
56
|
+
#
|
57
|
+
# @see http://unicode.org/reports/tr29/#Default_Word_Boundaries
|
58
|
+
# @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory
|
59
|
+
def tokenize(text)
|
60
|
+
UnicodeUtils.each_word text
|
61
|
+
end
|
62
|
+
end
|
@@ -0,0 +1,85 @@
|
|
1
|
+
require 'tf-idf-similarity/collection'
|
2
|
+
|
3
|
+
class TfIdfSimilarity::Collection
|
4
|
+
# SMART n, Salton x, Chisholm NONE
|
5
|
+
def no_collection_frequency(term)
|
6
|
+
1.0
|
7
|
+
end
|
8
|
+
|
9
|
+
# SMART t, Salton f, Chisholm IDFB
|
10
|
+
def plain_inverse_document_frequency(term)
|
11
|
+
count = document_counts[term].to_f
|
12
|
+
Math.log documents.size / count
|
13
|
+
end
|
14
|
+
alias_method :plain_idf, :plain_inverse_document_frequency
|
15
|
+
|
16
|
+
# SMART p, Salton p, Chisholm IDFP
|
17
|
+
def probabilistic_inverse_document_frequency(term)
|
18
|
+
count = document_counts[term].to_f
|
19
|
+
Math.log (documents.size - count) / count
|
20
|
+
end
|
21
|
+
alias_method :probabilistic_idf, :probabilistic_inverse_document_frequency
|
22
|
+
|
23
|
+
# Chisholm IGFF
|
24
|
+
def global_frequency_inverse_document_frequency(term)
|
25
|
+
term_counts[term] / document_counts[term].to_f
|
26
|
+
end
|
27
|
+
alias_method :gfidf, :global_frequency_inverse_document_frequency
|
28
|
+
|
29
|
+
# Chisholm IGFL
|
30
|
+
def log_global_frequency_inverse_document_frequency(term)
|
31
|
+
Math.log global_frequency_inverse_document_frequency(term) + 1
|
32
|
+
end
|
33
|
+
alias_method :log_gfidf, :log_global_frequency_inverse_document_frequency
|
34
|
+
|
35
|
+
# Chisholm IGFI
|
36
|
+
def incremented_global_frequency_inverse_document_frequency(term)
|
37
|
+
global_frequency_inverse_document_frequency(term) + 1
|
38
|
+
end
|
39
|
+
alias_method :incremented_gfidf, :incremented_global_frequency_inverse_document_frequency
|
40
|
+
|
41
|
+
# Chisholm IGFS
|
42
|
+
def square_root_global_frequency_inverse_document_frequency(term)
|
43
|
+
Math.sqrt global_frequency_inverse_document_frequency(term) - 0.9
|
44
|
+
end
|
45
|
+
alias_method :square_root_gfidf, :square_root_global_frequency_inverse_document_frequency
|
46
|
+
|
47
|
+
# Chisholm ENPY
|
48
|
+
def entropy(term)
|
49
|
+
denominator = term_counts[term].to_f
|
50
|
+
logN = Math.log documents.size
|
51
|
+
1 + documents.reduce(0) do |sum,document|
|
52
|
+
quotient = document.term_counts[term] / denominator
|
53
|
+
sum += quotient * Math.log(quotient) / logN
|
54
|
+
end
|
55
|
+
end
|
56
|
+
|
57
|
+
|
58
|
+
|
59
|
+
# @param [Document] matrix a term-document matrix
|
60
|
+
# @return [Matrix] the same matrix
|
61
|
+
#
|
62
|
+
# SMART n, Salton x, Chisholm NONE
|
63
|
+
def no_normalization(matrix)
|
64
|
+
matrix
|
65
|
+
end
|
66
|
+
|
67
|
+
# @param [Document] matrix a term-document matrix
|
68
|
+
# @return [Matrix] a matrix in which all document vectors are unit vectors
|
69
|
+
#
|
70
|
+
# SMART c, Salton c, Chisholm COSN
|
71
|
+
def cosine_normalization(matrix)
|
72
|
+
Matrix.columns(tfidf.column_vectors.map do |column|
|
73
|
+
column.normalize
|
74
|
+
end)
|
75
|
+
end
|
76
|
+
|
77
|
+
# @param [Document] matrix a term-document matrix
|
78
|
+
# @return [Matrix] a matrix
|
79
|
+
#
|
80
|
+
# SMART u, Chisholm PUQN
|
81
|
+
def pivoted_unique_normalization(matrix)
|
82
|
+
# @todo
|
83
|
+
# http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
|
84
|
+
end
|
85
|
+
end
|
@@ -0,0 +1,118 @@
|
|
1
|
+
require 'tf-idf-similarity/document'
|
2
|
+
|
3
|
+
class TfIdfSimilarity::Document
|
4
|
+
# @return [Float] the maximum term count of any term in the document
|
5
|
+
def maximum_term_count
|
6
|
+
@maximum_term_count ||= @term_counts.values.max.to_f
|
7
|
+
end
|
8
|
+
|
9
|
+
# @return [Float] the average term count of all terms in the document
|
10
|
+
def average_term_count
|
11
|
+
@average_term_count ||= @term_counts.values.reduce(:+) / @term_counts.size.to_f
|
12
|
+
end
|
13
|
+
|
14
|
+
|
15
|
+
|
16
|
+
# Returns the term count.
|
17
|
+
#
|
18
|
+
# SMART n, Salton t, Chisholm FREQ
|
19
|
+
def plain_term_frequency(term)
|
20
|
+
term_counts[term]
|
21
|
+
end
|
22
|
+
alias :plain_tf, :plain_term_frequency
|
23
|
+
|
24
|
+
# Returns 1 if the term is present, 0 otherwise.
|
25
|
+
#
|
26
|
+
# SMART b, Salton b, Chisholm BNRY
|
27
|
+
def binary_term_frequency(term)
|
28
|
+
count = term_counts[term]
|
29
|
+
if count > 0
|
30
|
+
1
|
31
|
+
else
|
32
|
+
0
|
33
|
+
end
|
34
|
+
end
|
35
|
+
alias_method :binary_tf, :binary_term_frequency
|
36
|
+
|
37
|
+
# Normalizes the term count by the maximum term count.
|
38
|
+
#
|
39
|
+
# @see http://en.wikipedia.org/wiki/Tf*idf
|
40
|
+
def normalized_term_frequency(term)
|
41
|
+
term_counts[term] / maximum_term_count
|
42
|
+
end
|
43
|
+
alias_method :normalized_tf, :normalized_term_frequency
|
44
|
+
|
45
|
+
# Further normalizes the normalized term frequency to lie between 0.5 and 1.
|
46
|
+
#
|
47
|
+
# SMART a, Salton n, Chisholm ATF1
|
48
|
+
def augmented_normalized_term_frequency(term)
|
49
|
+
0.5 + 0.5 * normalized_term_frequency(term)
|
50
|
+
end
|
51
|
+
alias_method :augmented_normalized_tf, :augmented_normalized_term_frequency
|
52
|
+
|
53
|
+
# Chisholm ATFA
|
54
|
+
def augmented_average_term_frequency(term)
|
55
|
+
count = term_counts[term]
|
56
|
+
if count > 0
|
57
|
+
0.9 + 0.1 * count / average_term_count
|
58
|
+
else
|
59
|
+
0
|
60
|
+
end
|
61
|
+
end
|
62
|
+
alias_method :augmented_average_tf, :augmented_average_term_frequency
|
63
|
+
|
64
|
+
# Chisholm ATFC
|
65
|
+
def changed_coefficient_augmented_normalized_term_frequency(term)
|
66
|
+
count = term_counts[term]
|
67
|
+
if count > 0
|
68
|
+
0.2 + 0.8 * count / maximum_term_count
|
69
|
+
else
|
70
|
+
0
|
71
|
+
end
|
72
|
+
end
|
73
|
+
alias_method :changed_coefficient_augmented_normalized_tf, :changed_coefficient_augmented_normalized_term_frequency
|
74
|
+
|
75
|
+
# SMART l, Chisholm LOGA
|
76
|
+
def log_term_frequency(term)
|
77
|
+
count = term_counts[term]
|
78
|
+
if count > 0
|
79
|
+
1 + Math.log(count)
|
80
|
+
else
|
81
|
+
0
|
82
|
+
end
|
83
|
+
end
|
84
|
+
alias_method :log_tf, :log_term_frequency
|
85
|
+
|
86
|
+
# SMART L, Chisholm LOGN
|
87
|
+
def normalized_log_term_frequency(term)
|
88
|
+
count = term_counts[term]
|
89
|
+
if count > 0
|
90
|
+
(1 + Math.log(count)) / (1 + Math.log(average_term_count))
|
91
|
+
else
|
92
|
+
0
|
93
|
+
end
|
94
|
+
end
|
95
|
+
alias_method :normalized_log_tf, :normalized_log_term_frequency
|
96
|
+
|
97
|
+
# Chisholm LOGG
|
98
|
+
def augmented_log_term_frequency(term)
|
99
|
+
count = term_counts[term]
|
100
|
+
if count > 0
|
101
|
+
0.2 + 0.8 * Math.log(count + 1)
|
102
|
+
else
|
103
|
+
0
|
104
|
+
end
|
105
|
+
end
|
106
|
+
alias_method :augmented_log_tf, :augmented_log_term_frequency
|
107
|
+
|
108
|
+
# Chisholm SQRT
|
109
|
+
def square_root_term_frequency(term)
|
110
|
+
count = term_counts[term]
|
111
|
+
if count > 0
|
112
|
+
Math.sqrt(count - 0.5) + 1
|
113
|
+
else
|
114
|
+
0
|
115
|
+
end
|
116
|
+
end
|
117
|
+
alias_method :square_root_tf, :square_root_term_frequency
|
118
|
+
end
|
@@ -0,0 +1,42 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
|
3
|
+
# @note We can add more filters from Solr and stem using Porter's Snowball.
|
4
|
+
#
|
5
|
+
# @see https://github.com/aurelian/ruby-stemmer
|
6
|
+
# @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory
|
7
|
+
# @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
|
8
|
+
# @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
|
9
|
+
class TfIdfSimilarity::Token < String
|
10
|
+
# Returns a falsy value if all its characters are numbers, punctuation,
|
11
|
+
# whitespace or control characters.
|
12
|
+
#
|
13
|
+
# @note Some implementations ignore one and two-letter words.
|
14
|
+
#
|
15
|
+
# @return [Boolean] whether the string is a token
|
16
|
+
def valid?
|
17
|
+
!self[%r{
|
18
|
+
\A
|
19
|
+
(
|
20
|
+
\d | # number
|
21
|
+
\p{Cntrl} | # control character
|
22
|
+
\p{Punct} | # punctuation
|
23
|
+
[[:space:]] # whitespace
|
24
|
+
)+
|
25
|
+
\z
|
26
|
+
}x]
|
27
|
+
end
|
28
|
+
|
29
|
+
# @return [Token] a lowercase string
|
30
|
+
#
|
31
|
+
# @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseFilterFactory
|
32
|
+
def lowercase_filter
|
33
|
+
self.class.new UnicodeUtils.downcase(self, :fr)
|
34
|
+
end
|
35
|
+
|
36
|
+
# @return [Token] a string with no English possessive or periods in acronyms
|
37
|
+
#
|
38
|
+
# @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ClassicFilterFactory
|
39
|
+
def classic_filter
|
40
|
+
self.class.new self.gsub('.', '').chomp("'s")
|
41
|
+
end
|
42
|
+
end
|
@@ -0,0 +1,22 @@
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
2
|
+
$:.push File.expand_path("../lib", __FILE__)
|
3
|
+
require "tf-idf-similarity/version"
|
4
|
+
|
5
|
+
Gem::Specification.new do |s|
|
6
|
+
s.name = "tf-idf-similarity"
|
7
|
+
s.version = TfIdfSimilarity::VERSION
|
8
|
+
s.platform = Gem::Platform::RUBY
|
9
|
+
s.authors = ["Open North"]
|
10
|
+
s.email = ["info@opennorth.ca"]
|
11
|
+
s.homepage = "http://github.com/opennorth/tf-idf-similarity"
|
12
|
+
s.summary = %q{Calculates the similarity between texts using tf*idf}
|
13
|
+
|
14
|
+
s.files = `git ls-files`.split("\n")
|
15
|
+
s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
|
16
|
+
s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
|
17
|
+
s.require_paths = ["lib"]
|
18
|
+
|
19
|
+
s.add_runtime_dependency('unicode_utils')
|
20
|
+
s.add_development_dependency('rspec', '~> 2.10')
|
21
|
+
s.add_development_dependency('rake')
|
22
|
+
end
|
metadata
ADDED
@@ -0,0 +1,114 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: tf-idf-similarity
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.1
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Open North
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2012-09-10 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: unicode_utils
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
|
+
none: false
|
18
|
+
requirements:
|
19
|
+
- - ! '>='
|
20
|
+
- !ruby/object:Gem::Version
|
21
|
+
version: '0'
|
22
|
+
type: :runtime
|
23
|
+
prerelease: false
|
24
|
+
version_requirements: !ruby/object:Gem::Requirement
|
25
|
+
none: false
|
26
|
+
requirements:
|
27
|
+
- - ! '>='
|
28
|
+
- !ruby/object:Gem::Version
|
29
|
+
version: '0'
|
30
|
+
- !ruby/object:Gem::Dependency
|
31
|
+
name: rspec
|
32
|
+
requirement: !ruby/object:Gem::Requirement
|
33
|
+
none: false
|
34
|
+
requirements:
|
35
|
+
- - ~>
|
36
|
+
- !ruby/object:Gem::Version
|
37
|
+
version: '2.10'
|
38
|
+
type: :development
|
39
|
+
prerelease: false
|
40
|
+
version_requirements: !ruby/object:Gem::Requirement
|
41
|
+
none: false
|
42
|
+
requirements:
|
43
|
+
- - ~>
|
44
|
+
- !ruby/object:Gem::Version
|
45
|
+
version: '2.10'
|
46
|
+
- !ruby/object:Gem::Dependency
|
47
|
+
name: rake
|
48
|
+
requirement: !ruby/object:Gem::Requirement
|
49
|
+
none: false
|
50
|
+
requirements:
|
51
|
+
- - ! '>='
|
52
|
+
- !ruby/object:Gem::Version
|
53
|
+
version: '0'
|
54
|
+
type: :development
|
55
|
+
prerelease: false
|
56
|
+
version_requirements: !ruby/object:Gem::Requirement
|
57
|
+
none: false
|
58
|
+
requirements:
|
59
|
+
- - ! '>='
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '0'
|
62
|
+
description:
|
63
|
+
email:
|
64
|
+
- info@opennorth.ca
|
65
|
+
executables: []
|
66
|
+
extensions: []
|
67
|
+
extra_rdoc_files: []
|
68
|
+
files:
|
69
|
+
- .gitignore
|
70
|
+
- .travis.yml
|
71
|
+
- Gemfile
|
72
|
+
- LICENSE
|
73
|
+
- README.md
|
74
|
+
- Rakefile
|
75
|
+
- USAGE
|
76
|
+
- lib/tf-idf-similarity.rb
|
77
|
+
- lib/tf-idf-similarity/collection.rb
|
78
|
+
- lib/tf-idf-similarity/document.rb
|
79
|
+
- lib/tf-idf-similarity/extras/collection.rb
|
80
|
+
- lib/tf-idf-similarity/extras/document.rb
|
81
|
+
- lib/tf-idf-similarity/token.rb
|
82
|
+
- lib/tf-idf-similarity/version.rb
|
83
|
+
- td-idf-similarity.gemspec
|
84
|
+
homepage: http://github.com/opennorth/tf-idf-similarity
|
85
|
+
licenses: []
|
86
|
+
post_install_message:
|
87
|
+
rdoc_options: []
|
88
|
+
require_paths:
|
89
|
+
- lib
|
90
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
91
|
+
none: false
|
92
|
+
requirements:
|
93
|
+
- - ! '>='
|
94
|
+
- !ruby/object:Gem::Version
|
95
|
+
version: '0'
|
96
|
+
segments:
|
97
|
+
- 0
|
98
|
+
hash: 697007281194730821
|
99
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
100
|
+
none: false
|
101
|
+
requirements:
|
102
|
+
- - ! '>='
|
103
|
+
- !ruby/object:Gem::Version
|
104
|
+
version: '0'
|
105
|
+
segments:
|
106
|
+
- 0
|
107
|
+
hash: 697007281194730821
|
108
|
+
requirements: []
|
109
|
+
rubyforge_project:
|
110
|
+
rubygems_version: 1.8.24
|
111
|
+
signing_key:
|
112
|
+
specification_version: 3
|
113
|
+
summary: Calculates the similarity between texts using tf*idf
|
114
|
+
test_files: []
|