tf-idf-similarity 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore ADDED
@@ -0,0 +1,6 @@
1
+ *.gem
2
+ .bundle
3
+ .yardoc
4
+ Gemfile.lock
5
+ doc/*
6
+ pkg/*
data/.travis.yml ADDED
@@ -0,0 +1,3 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.3
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source "http://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in scraperwiki-api.gemspec
4
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2012 Open North Inc.
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,70 @@
1
+ # Ruby Vector Space Model (VSM) with tf*idf weights
2
+
3
+ Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) (or similar), which also implements other information retrieval functions like [BM 25](http://en.wikipedia.org/wiki/Okapi_BM25).
4
+
5
+ ## Usage
6
+
7
+ require 'tf-idf-similarity'
8
+
9
+ corpus = TfIdfSimilarity::Collection.new
10
+ corpus << TfIdfSimilarity::Document.new("Lorem ipsum dolor sit amet...")
11
+ corpus << TfIdfSimilarity::Document.new("Pellentesque sed ipsum dui...")
12
+ corpus << TfIdfSimilarity::Document.new("Nam scelerisque dui sed leo...")
13
+
14
+ p corpus.similarity_matrix
15
+
16
+ This gem will use the [gsl gem](http://rb-gsl.rubyforge.org/) if available, for faster matrix multiplication.
17
+
18
+ ## Optimizations
19
+
20
+ ### [GNU Scientific Library (GSL)](http://www.gnu.org/software/gsl/)
21
+
22
+ The latest `gsl` gem (`1.14.7`) is [not compatible](http://bretthard.in/2012/03/getting-related_posts-lsi-and-gsl-to-work-in-jekyll/) with the `gsl` package (`1.15`) in Homebrew:
23
+
24
+ ```sh
25
+ cd /usr/local
26
+ git checkout -b gsl-1.14 83ed49411f076e30ced04c2cbebb054b2645a431
27
+ brew install gsl
28
+ git checkout master
29
+ git branch -d gsl-1.14
30
+ gem install gsl
31
+ ```
32
+
33
+ ### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
34
+
35
+ You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/).
36
+
37
+ The `nmatrix` gem (`0.0.1`) can't find the `cblas.h` and `clapack.h` header files. Either [set the C_INCLUDE_PATH](https://github.com/SciRuby/nmatrix#synopsis):
38
+
39
+ export C_INCLUDE_PATH=/System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/
40
+
41
+ Or [create links](https://github.com/SciRuby/nmatrix/issues/21) before installing the gem:
42
+
43
+ sudo ln -s /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/cblas.h /usr/include/cblas.h
44
+ sudo ln -s /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/clapack.h /usr/include/clapack.h
45
+
46
+ Version `0.0.2` [doesn't compile on Mac OS X Lion](https://github.com/SciRuby/nmatrix/issues/34).
47
+
48
+ ### Other Considerations
49
+
50
+ The [narray](http://narray.rubyforge.org/) and [nmatrix](http://sciruby.com/nmatrix/) gems have no method to calculate the magnitude of a vector. [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) and old and not available as gems.
51
+
52
+ ## Extras
53
+
54
+ You can access more term frequency, document frequency, and normalization formulas with:
55
+
56
+ require 'tf-idf-similarity/extras/collection'
57
+ require 'tf-idf-similarity/extras/document'
58
+
59
+ The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).
60
+
61
+ ## Reference
62
+
63
+ * [G. Salton and C. Buckley. "Term Weighting Approaches in Automatic Text Retrieval."" Technical Report. Cornell University, Ithaca, NY, USA. 1987.](http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf)
64
+ * [E. Chisholm and T. G. Kolda. "New term weighting formulas for the vector space method in information retrieval." Technical Report Number ORNL-TM-13756. Oak Ridge National Laboratory, Oak Ridge, TN, USA. 1999.](http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf)
65
+
66
+ ## Bugs? Questions?
67
+
68
+ This gem's main repository is on GitHub: [http://github.com/opennorth/tf-idf-similarity](http://github.com/opennorth/tf-idf-similarity), where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
69
+
70
+ Copyright (c) 2012 Open North Inc., released under the MIT license
data/Rakefile ADDED
@@ -0,0 +1,16 @@
1
+ require 'bundler'
2
+ Bundler::GemHelper.install_tasks
3
+
4
+ require 'rspec/core/rake_task'
5
+ RSpec::Core::RakeTask.new(:spec)
6
+
7
+ task :default => :spec
8
+
9
+ begin
10
+ require 'yard'
11
+ YARD::Rake::YardocTask.new
12
+ rescue LoadError
13
+ task :yard do
14
+ abort 'YARD is not available. In order to run yard, you must: gem install yard'
15
+ end
16
+ end
data/USAGE ADDED
@@ -0,0 +1 @@
1
+ See README.md for full usage details.
@@ -0,0 +1,7 @@
1
+ $LOAD_PATH.unshift(File.expand_path(File.dirname(__FILE__))) unless $LOAD_PATH.include?(File.expand_path(File.dirname(__FILE__)))
2
+
3
+ module TfIdfSimilarity
4
+ autoload :Collection, 'tf-idf-similarity/collection'
5
+ autoload :Document, 'tf-idf-similarity/document'
6
+ autoload :Token, 'tf-idf-similarity/token'
7
+ end
@@ -0,0 +1,128 @@
1
+ begin
2
+ require 'gsl'
3
+ rescue LoadError
4
+ require 'matrix'
5
+ end
6
+
7
+ class TfIdfSimilarity::Collection
8
+ # The documents in the collection.
9
+ attr_reader :documents
10
+ # The number of times each term appears in all documents.
11
+ attr_reader :term_counts
12
+ # The number of documents each term appears in.
13
+ attr_reader :document_counts
14
+
15
+ def initialize
16
+ @documents = []
17
+ @term_counts = Hash.new 0
18
+ @document_counts = Hash.new 0
19
+ end
20
+
21
+ def <<(document)
22
+ document.term_counts.each do |term,count|
23
+ @term_counts[term] += count
24
+ @document_counts[term] += 1
25
+ end
26
+ @documents << document
27
+ end
28
+
29
+ # @return [Array<String>] the set of the collection's terms with no duplicates
30
+ def terms
31
+ term_counts.keys
32
+ end
33
+
34
+ # @see http://en.wikipedia.org/wiki/Vector_space_model
35
+ # @see http://en.wikipedia.org/wiki/Document-term_matrix
36
+ # @see http://en.wikipedia.org/wiki/Cosine_similarity
37
+ def similarity_matrix
38
+ if matrix?
39
+ idf = []
40
+ term_document_matrix = Matrix.build(terms.size, documents.size) do |i,j|
41
+ idf[i] ||= inverse_document_frequency terms[i]
42
+ documents[j].term_frequency(terms[i]) * idf[i]
43
+ end
44
+ else
45
+ term_document_matrix = if gsl?
46
+ GSL::Matrix.alloc terms.size, documents.size
47
+ elsif narray?
48
+ NMatrix.float documents.size, terms.size
49
+ elsif nmatrix?
50
+ # The nmatrix gem's sparse matrices are unfortunately buggy.
51
+ # @see https://github.com/SciRuby/nmatrix/issues/35
52
+ NMatrix.new([terms.size, documents.size], :float64)
53
+ end
54
+
55
+ terms.each_with_index do |term,i|
56
+ idf = inverse_document_frequency term
57
+ documents.each_with_index do |document,j|
58
+ tfidf = document.term_frequency(term) * idf
59
+ if gsl? || nmatrix?
60
+ term_document_matrix[i, j] = tfidf
61
+ # NArray puts the dimensions in a different order.
62
+ # @see http://narray.rubyforge.org/SPEC.en
63
+ elsif narray?
64
+ term_document_matrix[j, i] = tfidf
65
+ end
66
+ end
67
+ end
68
+ end
69
+
70
+ # Columns are normalized to unit vectors, so we can calculate the cosine
71
+ # similarity of all document vectors.
72
+ matrix = normalize term_document_matrix
73
+
74
+ if nmatrix?
75
+ matrix.transpose.dot matrix
76
+ else
77
+ matrix.transpose * matrix
78
+ end
79
+ end
80
+
81
+ # @param [String] term a term
82
+ # @return [Float] the term's inverse document frequency
83
+ #
84
+ # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
85
+ def inverse_document_frequency(term)
86
+ 1 + Math.log(documents.size / (document_counts[term].to_f + 1))
87
+ end
88
+ alias_method :idf, :inverse_document_frequency
89
+
90
+ # @param [Document] matrix a term-document matrix
91
+ # @return [Matrix] a matrix in which all document vectors are unit vectors
92
+ #
93
+ # @note Lucene normalizes document length differently.
94
+ def normalize(matrix)
95
+ if gsl?
96
+ matrix.each_col(&:normalize!)
97
+ elsif narray?
98
+ # @todo NArray doesn't have a method to normalize a vector.
99
+ # 0.upto(matrix.shape[0] - 1).each do |j|
100
+ # matrix[j, true] # Normalize this column somehow.
101
+ # end
102
+ matrix
103
+ elsif nmatrix?
104
+ # @todo NMatrix doesn't have a method to normalize a vector.
105
+ matrix
106
+ else
107
+ Matrix.columns matrix.column_vectors.map(&:normalize)
108
+ end
109
+ end
110
+
111
+ private
112
+
113
+ def gsl?
114
+ @gsl ||= Object.const_defined?(:GSL)
115
+ end
116
+
117
+ def narray?
118
+ @narray ||= Object.const_defined?(:NArray) && !gsl?
119
+ end
120
+
121
+ def nmatrix?
122
+ @nmatrix ||= Object.const_defined?(:NMatrix) && !narray?
123
+ end
124
+
125
+ def matrix?
126
+ @matrix ||= Object.const_defined?(:Matrix)
127
+ end
128
+ end
@@ -0,0 +1,62 @@
1
+ # coding: utf-8
2
+ require 'unicode_utils'
3
+
4
+ class TfIdfSimilarity::Document
5
+ # An optional document identifier.
6
+ attr_reader :id
7
+ # The document's text.
8
+ attr_reader :text
9
+ # The number of times each term appears in the document.
10
+ attr_reader :term_counts
11
+
12
+ # @param [String] text the document's text
13
+ # @param [Hash] opts optional arguments
14
+ # @option opts [String] :id a string to identify the document
15
+ def initialize(text, opts = {})
16
+ @text = text
17
+ @id = opts[:id] || object_id
18
+ @term_counts = Hash.new 0
19
+ process
20
+ end
21
+
22
+ # @return [Array<String>] the set of the document's terms with no duplicates
23
+ def terms
24
+ term_counts.keys
25
+ end
26
+
27
+ # @param [String] term a term
28
+ # @return [Float] the square root of the term count
29
+ #
30
+ # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
31
+ def term_frequency(term)
32
+ Math.sqrt term_counts[term]
33
+ end
34
+ alias_method :tf, :term_frequency
35
+
36
+ private
37
+
38
+ # Tokenize the text and counts terms.
39
+ def process
40
+ tokenize(text).each do |word|
41
+ token = TfIdfSimilarity::Token.new word
42
+ if token.valid?
43
+ @term_counts[token.lowercase_filter.classic_filter.to_s] += 1
44
+ end
45
+ end
46
+ end
47
+
48
+ # Tokenizes a text, respecting the word boundary rules from Unicode’s Default
49
+ # Word Boundary Specification.
50
+ #
51
+ # @param [String] text a text
52
+ # @return [Enumerator] a token enumerator
53
+ #
54
+ # @note We should evaluate the tokenizers by {http://www.sciencemag.org/content/suppl/2010/12/16/science.1199644.DC1/Michel.SOM.revision.2.pdf Google}
55
+ # or {http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailTokenizerFactory Solr}.
56
+ #
57
+ # @see http://unicode.org/reports/tr29/#Default_Word_Boundaries
58
+ # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory
59
+ def tokenize(text)
60
+ UnicodeUtils.each_word text
61
+ end
62
+ end
@@ -0,0 +1,85 @@
1
+ require 'tf-idf-similarity/collection'
2
+
3
+ class TfIdfSimilarity::Collection
4
+ # SMART n, Salton x, Chisholm NONE
5
+ def no_collection_frequency(term)
6
+ 1.0
7
+ end
8
+
9
+ # SMART t, Salton f, Chisholm IDFB
10
+ def plain_inverse_document_frequency(term)
11
+ count = document_counts[term].to_f
12
+ Math.log documents.size / count
13
+ end
14
+ alias_method :plain_idf, :plain_inverse_document_frequency
15
+
16
+ # SMART p, Salton p, Chisholm IDFP
17
+ def probabilistic_inverse_document_frequency(term)
18
+ count = document_counts[term].to_f
19
+ Math.log (documents.size - count) / count
20
+ end
21
+ alias_method :probabilistic_idf, :probabilistic_inverse_document_frequency
22
+
23
+ # Chisholm IGFF
24
+ def global_frequency_inverse_document_frequency(term)
25
+ term_counts[term] / document_counts[term].to_f
26
+ end
27
+ alias_method :gfidf, :global_frequency_inverse_document_frequency
28
+
29
+ # Chisholm IGFL
30
+ def log_global_frequency_inverse_document_frequency(term)
31
+ Math.log global_frequency_inverse_document_frequency(term) + 1
32
+ end
33
+ alias_method :log_gfidf, :log_global_frequency_inverse_document_frequency
34
+
35
+ # Chisholm IGFI
36
+ def incremented_global_frequency_inverse_document_frequency(term)
37
+ global_frequency_inverse_document_frequency(term) + 1
38
+ end
39
+ alias_method :incremented_gfidf, :incremented_global_frequency_inverse_document_frequency
40
+
41
+ # Chisholm IGFS
42
+ def square_root_global_frequency_inverse_document_frequency(term)
43
+ Math.sqrt global_frequency_inverse_document_frequency(term) - 0.9
44
+ end
45
+ alias_method :square_root_gfidf, :square_root_global_frequency_inverse_document_frequency
46
+
47
+ # Chisholm ENPY
48
+ def entropy(term)
49
+ denominator = term_counts[term].to_f
50
+ logN = Math.log documents.size
51
+ 1 + documents.reduce(0) do |sum,document|
52
+ quotient = document.term_counts[term] / denominator
53
+ sum += quotient * Math.log(quotient) / logN
54
+ end
55
+ end
56
+
57
+
58
+
59
+ # @param [Document] matrix a term-document matrix
60
+ # @return [Matrix] the same matrix
61
+ #
62
+ # SMART n, Salton x, Chisholm NONE
63
+ def no_normalization(matrix)
64
+ matrix
65
+ end
66
+
67
+ # @param [Document] matrix a term-document matrix
68
+ # @return [Matrix] a matrix in which all document vectors are unit vectors
69
+ #
70
+ # SMART c, Salton c, Chisholm COSN
71
+ def cosine_normalization(matrix)
72
+ Matrix.columns(tfidf.column_vectors.map do |column|
73
+ column.normalize
74
+ end)
75
+ end
76
+
77
+ # @param [Document] matrix a term-document matrix
78
+ # @return [Matrix] a matrix
79
+ #
80
+ # SMART u, Chisholm PUQN
81
+ def pivoted_unique_normalization(matrix)
82
+ # @todo
83
+ # http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
84
+ end
85
+ end
@@ -0,0 +1,118 @@
1
+ require 'tf-idf-similarity/document'
2
+
3
+ class TfIdfSimilarity::Document
4
+ # @return [Float] the maximum term count of any term in the document
5
+ def maximum_term_count
6
+ @maximum_term_count ||= @term_counts.values.max.to_f
7
+ end
8
+
9
+ # @return [Float] the average term count of all terms in the document
10
+ def average_term_count
11
+ @average_term_count ||= @term_counts.values.reduce(:+) / @term_counts.size.to_f
12
+ end
13
+
14
+
15
+
16
+ # Returns the term count.
17
+ #
18
+ # SMART n, Salton t, Chisholm FREQ
19
+ def plain_term_frequency(term)
20
+ term_counts[term]
21
+ end
22
+ alias :plain_tf, :plain_term_frequency
23
+
24
+ # Returns 1 if the term is present, 0 otherwise.
25
+ #
26
+ # SMART b, Salton b, Chisholm BNRY
27
+ def binary_term_frequency(term)
28
+ count = term_counts[term]
29
+ if count > 0
30
+ 1
31
+ else
32
+ 0
33
+ end
34
+ end
35
+ alias_method :binary_tf, :binary_term_frequency
36
+
37
+ # Normalizes the term count by the maximum term count.
38
+ #
39
+ # @see http://en.wikipedia.org/wiki/Tf*idf
40
+ def normalized_term_frequency(term)
41
+ term_counts[term] / maximum_term_count
42
+ end
43
+ alias_method :normalized_tf, :normalized_term_frequency
44
+
45
+ # Further normalizes the normalized term frequency to lie between 0.5 and 1.
46
+ #
47
+ # SMART a, Salton n, Chisholm ATF1
48
+ def augmented_normalized_term_frequency(term)
49
+ 0.5 + 0.5 * normalized_term_frequency(term)
50
+ end
51
+ alias_method :augmented_normalized_tf, :augmented_normalized_term_frequency
52
+
53
+ # Chisholm ATFA
54
+ def augmented_average_term_frequency(term)
55
+ count = term_counts[term]
56
+ if count > 0
57
+ 0.9 + 0.1 * count / average_term_count
58
+ else
59
+ 0
60
+ end
61
+ end
62
+ alias_method :augmented_average_tf, :augmented_average_term_frequency
63
+
64
+ # Chisholm ATFC
65
+ def changed_coefficient_augmented_normalized_term_frequency(term)
66
+ count = term_counts[term]
67
+ if count > 0
68
+ 0.2 + 0.8 * count / maximum_term_count
69
+ else
70
+ 0
71
+ end
72
+ end
73
+ alias_method :changed_coefficient_augmented_normalized_tf, :changed_coefficient_augmented_normalized_term_frequency
74
+
75
+ # SMART l, Chisholm LOGA
76
+ def log_term_frequency(term)
77
+ count = term_counts[term]
78
+ if count > 0
79
+ 1 + Math.log(count)
80
+ else
81
+ 0
82
+ end
83
+ end
84
+ alias_method :log_tf, :log_term_frequency
85
+
86
+ # SMART L, Chisholm LOGN
87
+ def normalized_log_term_frequency(term)
88
+ count = term_counts[term]
89
+ if count > 0
90
+ (1 + Math.log(count)) / (1 + Math.log(average_term_count))
91
+ else
92
+ 0
93
+ end
94
+ end
95
+ alias_method :normalized_log_tf, :normalized_log_term_frequency
96
+
97
+ # Chisholm LOGG
98
+ def augmented_log_term_frequency(term)
99
+ count = term_counts[term]
100
+ if count > 0
101
+ 0.2 + 0.8 * Math.log(count + 1)
102
+ else
103
+ 0
104
+ end
105
+ end
106
+ alias_method :augmented_log_tf, :augmented_log_term_frequency
107
+
108
+ # Chisholm SQRT
109
+ def square_root_term_frequency(term)
110
+ count = term_counts[term]
111
+ if count > 0
112
+ Math.sqrt(count - 0.5) + 1
113
+ else
114
+ 0
115
+ end
116
+ end
117
+ alias_method :square_root_tf, :square_root_term_frequency
118
+ end
@@ -0,0 +1,42 @@
1
+ # coding: utf-8
2
+
3
+ # @note We can add more filters from Solr and stem using Porter's Snowball.
4
+ #
5
+ # @see https://github.com/aurelian/ruby-stemmer
6
+ # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory
7
+ # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
8
+ # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
9
+ class TfIdfSimilarity::Token < String
10
+ # Returns a falsy value if all its characters are numbers, punctuation,
11
+ # whitespace or control characters.
12
+ #
13
+ # @note Some implementations ignore one and two-letter words.
14
+ #
15
+ # @return [Boolean] whether the string is a token
16
+ def valid?
17
+ !self[%r{
18
+ \A
19
+ (
20
+ \d | # number
21
+ \p{Cntrl} | # control character
22
+ \p{Punct} | # punctuation
23
+ [[:space:]] # whitespace
24
+ )+
25
+ \z
26
+ }x]
27
+ end
28
+
29
+ # @return [Token] a lowercase string
30
+ #
31
+ # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseFilterFactory
32
+ def lowercase_filter
33
+ self.class.new UnicodeUtils.downcase(self, :fr)
34
+ end
35
+
36
+ # @return [Token] a string with no English possessive or periods in acronyms
37
+ #
38
+ # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ClassicFilterFactory
39
+ def classic_filter
40
+ self.class.new self.gsub('.', '').chomp("'s")
41
+ end
42
+ end
@@ -0,0 +1,3 @@
1
+ module TfIdfSimilarity
2
+ VERSION = "0.0.1"
3
+ end
@@ -0,0 +1,22 @@
1
+ # -*- encoding: utf-8 -*-
2
+ $:.push File.expand_path("../lib", __FILE__)
3
+ require "tf-idf-similarity/version"
4
+
5
+ Gem::Specification.new do |s|
6
+ s.name = "tf-idf-similarity"
7
+ s.version = TfIdfSimilarity::VERSION
8
+ s.platform = Gem::Platform::RUBY
9
+ s.authors = ["Open North"]
10
+ s.email = ["info@opennorth.ca"]
11
+ s.homepage = "http://github.com/opennorth/tf-idf-similarity"
12
+ s.summary = %q{Calculates the similarity between texts using tf*idf}
13
+
14
+ s.files = `git ls-files`.split("\n")
15
+ s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
16
+ s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
17
+ s.require_paths = ["lib"]
18
+
19
+ s.add_runtime_dependency('unicode_utils')
20
+ s.add_development_dependency('rspec', '~> 2.10')
21
+ s.add_development_dependency('rake')
22
+ end
metadata ADDED
@@ -0,0 +1,114 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: tf-idf-similarity
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Open North
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2012-09-10 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: unicode_utils
16
+ requirement: !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ! '>='
20
+ - !ruby/object:Gem::Version
21
+ version: '0'
22
+ type: :runtime
23
+ prerelease: false
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ! '>='
28
+ - !ruby/object:Gem::Version
29
+ version: '0'
30
+ - !ruby/object:Gem::Dependency
31
+ name: rspec
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ~>
36
+ - !ruby/object:Gem::Version
37
+ version: '2.10'
38
+ type: :development
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ~>
44
+ - !ruby/object:Gem::Version
45
+ version: '2.10'
46
+ - !ruby/object:Gem::Dependency
47
+ name: rake
48
+ requirement: !ruby/object:Gem::Requirement
49
+ none: false
50
+ requirements:
51
+ - - ! '>='
52
+ - !ruby/object:Gem::Version
53
+ version: '0'
54
+ type: :development
55
+ prerelease: false
56
+ version_requirements: !ruby/object:Gem::Requirement
57
+ none: false
58
+ requirements:
59
+ - - ! '>='
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ description:
63
+ email:
64
+ - info@opennorth.ca
65
+ executables: []
66
+ extensions: []
67
+ extra_rdoc_files: []
68
+ files:
69
+ - .gitignore
70
+ - .travis.yml
71
+ - Gemfile
72
+ - LICENSE
73
+ - README.md
74
+ - Rakefile
75
+ - USAGE
76
+ - lib/tf-idf-similarity.rb
77
+ - lib/tf-idf-similarity/collection.rb
78
+ - lib/tf-idf-similarity/document.rb
79
+ - lib/tf-idf-similarity/extras/collection.rb
80
+ - lib/tf-idf-similarity/extras/document.rb
81
+ - lib/tf-idf-similarity/token.rb
82
+ - lib/tf-idf-similarity/version.rb
83
+ - td-idf-similarity.gemspec
84
+ homepage: http://github.com/opennorth/tf-idf-similarity
85
+ licenses: []
86
+ post_install_message:
87
+ rdoc_options: []
88
+ require_paths:
89
+ - lib
90
+ required_ruby_version: !ruby/object:Gem::Requirement
91
+ none: false
92
+ requirements:
93
+ - - ! '>='
94
+ - !ruby/object:Gem::Version
95
+ version: '0'
96
+ segments:
97
+ - 0
98
+ hash: 697007281194730821
99
+ required_rubygems_version: !ruby/object:Gem::Requirement
100
+ none: false
101
+ requirements:
102
+ - - ! '>='
103
+ - !ruby/object:Gem::Version
104
+ version: '0'
105
+ segments:
106
+ - 0
107
+ hash: 697007281194730821
108
+ requirements: []
109
+ rubyforge_project:
110
+ rubygems_version: 1.8.24
111
+ signing_key:
112
+ specification_version: 3
113
+ summary: Calculates the similarity between texts using tf*idf
114
+ test_files: []