tf-idf-similarity 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/.gitignore ADDED
@@ -0,0 +1,6 @@
1
+ *.gem
2
+ .bundle
3
+ .yardoc
4
+ Gemfile.lock
5
+ doc/*
6
+ pkg/*
data/.travis.yml ADDED
@@ -0,0 +1,3 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.3
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source "http://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in scraperwiki-api.gemspec
4
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2012 Open North Inc.
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,70 @@
1
+ # Ruby Vector Space Model (VSM) with tf*idf weights
2
+
3
+ Calculates the similarity between texts using a [bag-of-words](http://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](http://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf*idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) (or similar), which also implements other information retrieval functions like [BM 25](http://en.wikipedia.org/wiki/Okapi_BM25).
4
+
5
+ ## Usage
6
+
7
+ require 'tf-idf-similarity'
8
+
9
+ corpus = TfIdfSimilarity::Collection.new
10
+ corpus << TfIdfSimilarity::Document.new("Lorem ipsum dolor sit amet...")
11
+ corpus << TfIdfSimilarity::Document.new("Pellentesque sed ipsum dui...")
12
+ corpus << TfIdfSimilarity::Document.new("Nam scelerisque dui sed leo...")
13
+
14
+ p corpus.similarity_matrix
15
+
16
+ This gem will use the [gsl gem](http://rb-gsl.rubyforge.org/) if available, for faster matrix multiplication.
17
+
18
+ ## Optimizations
19
+
20
+ ### [GNU Scientific Library (GSL)](http://www.gnu.org/software/gsl/)
21
+
22
+ The latest `gsl` gem (`1.14.7`) is [not compatible](http://bretthard.in/2012/03/getting-related_posts-lsi-and-gsl-to-work-in-jekyll/) with the `gsl` package (`1.15`) in Homebrew:
23
+
24
+ ```sh
25
+ cd /usr/local
26
+ git checkout -b gsl-1.14 83ed49411f076e30ced04c2cbebb054b2645a431
27
+ brew install gsl
28
+ git checkout master
29
+ git branch -d gsl-1.14
30
+ gem install gsl
31
+ ```
32
+
33
+ ### [Automatically Tuned Linear Algebra Software (ATLAS)](http://math-atlas.sourceforge.net/)
34
+
35
+ You may know this software through [Linear Algebra PACKage (LAPACK)](http://www.netlib.org/lapack/) or [Basic Linear Algebra Subprograms (BLAS)](http://www.netlib.org/blas/).
36
+
37
+ The `nmatrix` gem (`0.0.1`) can't find the `cblas.h` and `clapack.h` header files. Either [set the C_INCLUDE_PATH](https://github.com/SciRuby/nmatrix#synopsis):
38
+
39
+ export C_INCLUDE_PATH=/System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/
40
+
41
+ Or [create links](https://github.com/SciRuby/nmatrix/issues/21) before installing the gem:
42
+
43
+ sudo ln -s /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/cblas.h /usr/include/cblas.h
44
+ sudo ln -s /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/clapack.h /usr/include/clapack.h
45
+
46
+ Version `0.0.2` [doesn't compile on Mac OS X Lion](https://github.com/SciRuby/nmatrix/issues/34).
47
+
48
+ ### Other Considerations
49
+
50
+ The [narray](http://narray.rubyforge.org/) and [nmatrix](http://sciruby.com/nmatrix/) gems have no method to calculate the magnitude of a vector. [Ruby-LAPACK](http://ruby.gfd-dennou.org/products/ruby-lapack/) is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. [Linalg](https://github.com/quix/linalg) and [RNum](http://rnum.rubyforge.org/) and old and not available as gems.
51
+
52
+ ## Extras
53
+
54
+ You can access more term frequency, document frequency, and normalization formulas with:
55
+
56
+ require 'tf-idf-similarity/extras/collection'
57
+ require 'tf-idf-similarity/extras/document'
58
+
59
+ The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).
60
+
61
+ ## Reference
62
+
63
+ * [G. Salton and C. Buckley. "Term Weighting Approaches in Automatic Text Retrieval."" Technical Report. Cornell University, Ithaca, NY, USA. 1987.](http://www.cs.odu.edu/~jbollen/IR04/readings/article1-29-03.pdf)
64
+ * [E. Chisholm and T. G. Kolda. "New term weighting formulas for the vector space method in information retrieval." Technical Report Number ORNL-TM-13756. Oak Ridge National Laboratory, Oak Ridge, TN, USA. 1999.](http://www.sandia.gov/~tgkolda/pubs/bibtgkfiles/ornl-tm-13756.pdf)
65
+
66
+ ## Bugs? Questions?
67
+
68
+ This gem's main repository is on GitHub: [http://github.com/opennorth/tf-idf-similarity](http://github.com/opennorth/tf-idf-similarity), where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
69
+
70
+ Copyright (c) 2012 Open North Inc., released under the MIT license
data/Rakefile ADDED
@@ -0,0 +1,16 @@
1
+ require 'bundler'
2
+ Bundler::GemHelper.install_tasks
3
+
4
+ require 'rspec/core/rake_task'
5
+ RSpec::Core::RakeTask.new(:spec)
6
+
7
+ task :default => :spec
8
+
9
+ begin
10
+ require 'yard'
11
+ YARD::Rake::YardocTask.new
12
+ rescue LoadError
13
+ task :yard do
14
+ abort 'YARD is not available. In order to run yard, you must: gem install yard'
15
+ end
16
+ end
data/USAGE ADDED
@@ -0,0 +1 @@
1
+ See README.md for full usage details.
@@ -0,0 +1,7 @@
1
+ $LOAD_PATH.unshift(File.expand_path(File.dirname(__FILE__))) unless $LOAD_PATH.include?(File.expand_path(File.dirname(__FILE__)))
2
+
3
+ module TfIdfSimilarity
4
+ autoload :Collection, 'tf-idf-similarity/collection'
5
+ autoload :Document, 'tf-idf-similarity/document'
6
+ autoload :Token, 'tf-idf-similarity/token'
7
+ end
@@ -0,0 +1,128 @@
1
+ begin
2
+ require 'gsl'
3
+ rescue LoadError
4
+ require 'matrix'
5
+ end
6
+
7
+ class TfIdfSimilarity::Collection
8
+ # The documents in the collection.
9
+ attr_reader :documents
10
+ # The number of times each term appears in all documents.
11
+ attr_reader :term_counts
12
+ # The number of documents each term appears in.
13
+ attr_reader :document_counts
14
+
15
+ def initialize
16
+ @documents = []
17
+ @term_counts = Hash.new 0
18
+ @document_counts = Hash.new 0
19
+ end
20
+
21
+ def <<(document)
22
+ document.term_counts.each do |term,count|
23
+ @term_counts[term] += count
24
+ @document_counts[term] += 1
25
+ end
26
+ @documents << document
27
+ end
28
+
29
+ # @return [Array<String>] the set of the collection's terms with no duplicates
30
+ def terms
31
+ term_counts.keys
32
+ end
33
+
34
+ # @see http://en.wikipedia.org/wiki/Vector_space_model
35
+ # @see http://en.wikipedia.org/wiki/Document-term_matrix
36
+ # @see http://en.wikipedia.org/wiki/Cosine_similarity
37
+ def similarity_matrix
38
+ if matrix?
39
+ idf = []
40
+ term_document_matrix = Matrix.build(terms.size, documents.size) do |i,j|
41
+ idf[i] ||= inverse_document_frequency terms[i]
42
+ documents[j].term_frequency(terms[i]) * idf[i]
43
+ end
44
+ else
45
+ term_document_matrix = if gsl?
46
+ GSL::Matrix.alloc terms.size, documents.size
47
+ elsif narray?
48
+ NMatrix.float documents.size, terms.size
49
+ elsif nmatrix?
50
+ # The nmatrix gem's sparse matrices are unfortunately buggy.
51
+ # @see https://github.com/SciRuby/nmatrix/issues/35
52
+ NMatrix.new([terms.size, documents.size], :float64)
53
+ end
54
+
55
+ terms.each_with_index do |term,i|
56
+ idf = inverse_document_frequency term
57
+ documents.each_with_index do |document,j|
58
+ tfidf = document.term_frequency(term) * idf
59
+ if gsl? || nmatrix?
60
+ term_document_matrix[i, j] = tfidf
61
+ # NArray puts the dimensions in a different order.
62
+ # @see http://narray.rubyforge.org/SPEC.en
63
+ elsif narray?
64
+ term_document_matrix[j, i] = tfidf
65
+ end
66
+ end
67
+ end
68
+ end
69
+
70
+ # Columns are normalized to unit vectors, so we can calculate the cosine
71
+ # similarity of all document vectors.
72
+ matrix = normalize term_document_matrix
73
+
74
+ if nmatrix?
75
+ matrix.transpose.dot matrix
76
+ else
77
+ matrix.transpose * matrix
78
+ end
79
+ end
80
+
81
+ # @param [String] term a term
82
+ # @return [Float] the term's inverse document frequency
83
+ #
84
+ # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
85
+ def inverse_document_frequency(term)
86
+ 1 + Math.log(documents.size / (document_counts[term].to_f + 1))
87
+ end
88
+ alias_method :idf, :inverse_document_frequency
89
+
90
+ # @param [Document] matrix a term-document matrix
91
+ # @return [Matrix] a matrix in which all document vectors are unit vectors
92
+ #
93
+ # @note Lucene normalizes document length differently.
94
+ def normalize(matrix)
95
+ if gsl?
96
+ matrix.each_col(&:normalize!)
97
+ elsif narray?
98
+ # @todo NArray doesn't have a method to normalize a vector.
99
+ # 0.upto(matrix.shape[0] - 1).each do |j|
100
+ # matrix[j, true] # Normalize this column somehow.
101
+ # end
102
+ matrix
103
+ elsif nmatrix?
104
+ # @todo NMatrix doesn't have a method to normalize a vector.
105
+ matrix
106
+ else
107
+ Matrix.columns matrix.column_vectors.map(&:normalize)
108
+ end
109
+ end
110
+
111
+ private
112
+
113
+ def gsl?
114
+ @gsl ||= Object.const_defined?(:GSL)
115
+ end
116
+
117
+ def narray?
118
+ @narray ||= Object.const_defined?(:NArray) && !gsl?
119
+ end
120
+
121
+ def nmatrix?
122
+ @nmatrix ||= Object.const_defined?(:NMatrix) && !narray?
123
+ end
124
+
125
+ def matrix?
126
+ @matrix ||= Object.const_defined?(:Matrix)
127
+ end
128
+ end
@@ -0,0 +1,62 @@
1
+ # coding: utf-8
2
+ require 'unicode_utils'
3
+
4
+ class TfIdfSimilarity::Document
5
+ # An optional document identifier.
6
+ attr_reader :id
7
+ # The document's text.
8
+ attr_reader :text
9
+ # The number of times each term appears in the document.
10
+ attr_reader :term_counts
11
+
12
+ # @param [String] text the document's text
13
+ # @param [Hash] opts optional arguments
14
+ # @option opts [String] :id a string to identify the document
15
+ def initialize(text, opts = {})
16
+ @text = text
17
+ @id = opts[:id] || object_id
18
+ @term_counts = Hash.new 0
19
+ process
20
+ end
21
+
22
+ # @return [Array<String>] the set of the document's terms with no duplicates
23
+ def terms
24
+ term_counts.keys
25
+ end
26
+
27
+ # @param [String] term a term
28
+ # @return [Float] the square root of the term count
29
+ #
30
+ # @see http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
31
+ def term_frequency(term)
32
+ Math.sqrt term_counts[term]
33
+ end
34
+ alias_method :tf, :term_frequency
35
+
36
+ private
37
+
38
+ # Tokenize the text and counts terms.
39
+ def process
40
+ tokenize(text).each do |word|
41
+ token = TfIdfSimilarity::Token.new word
42
+ if token.valid?
43
+ @term_counts[token.lowercase_filter.classic_filter.to_s] += 1
44
+ end
45
+ end
46
+ end
47
+
48
+ # Tokenizes a text, respecting the word boundary rules from Unicode’s Default
49
+ # Word Boundary Specification.
50
+ #
51
+ # @param [String] text a text
52
+ # @return [Enumerator] a token enumerator
53
+ #
54
+ # @note We should evaluate the tokenizers by {http://www.sciencemag.org/content/suppl/2010/12/16/science.1199644.DC1/Michel.SOM.revision.2.pdf Google}
55
+ # or {http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailTokenizerFactory Solr}.
56
+ #
57
+ # @see http://unicode.org/reports/tr29/#Default_Word_Boundaries
58
+ # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory
59
+ def tokenize(text)
60
+ UnicodeUtils.each_word text
61
+ end
62
+ end
@@ -0,0 +1,85 @@
1
+ require 'tf-idf-similarity/collection'
2
+
3
+ class TfIdfSimilarity::Collection
4
+ # SMART n, Salton x, Chisholm NONE
5
+ def no_collection_frequency(term)
6
+ 1.0
7
+ end
8
+
9
+ # SMART t, Salton f, Chisholm IDFB
10
+ def plain_inverse_document_frequency(term)
11
+ count = document_counts[term].to_f
12
+ Math.log documents.size / count
13
+ end
14
+ alias_method :plain_idf, :plain_inverse_document_frequency
15
+
16
+ # SMART p, Salton p, Chisholm IDFP
17
+ def probabilistic_inverse_document_frequency(term)
18
+ count = document_counts[term].to_f
19
+ Math.log (documents.size - count) / count
20
+ end
21
+ alias_method :probabilistic_idf, :probabilistic_inverse_document_frequency
22
+
23
+ # Chisholm IGFF
24
+ def global_frequency_inverse_document_frequency(term)
25
+ term_counts[term] / document_counts[term].to_f
26
+ end
27
+ alias_method :gfidf, :global_frequency_inverse_document_frequency
28
+
29
+ # Chisholm IGFL
30
+ def log_global_frequency_inverse_document_frequency(term)
31
+ Math.log global_frequency_inverse_document_frequency(term) + 1
32
+ end
33
+ alias_method :log_gfidf, :log_global_frequency_inverse_document_frequency
34
+
35
+ # Chisholm IGFI
36
+ def incremented_global_frequency_inverse_document_frequency(term)
37
+ global_frequency_inverse_document_frequency(term) + 1
38
+ end
39
+ alias_method :incremented_gfidf, :incremented_global_frequency_inverse_document_frequency
40
+
41
+ # Chisholm IGFS
42
+ def square_root_global_frequency_inverse_document_frequency(term)
43
+ Math.sqrt global_frequency_inverse_document_frequency(term) - 0.9
44
+ end
45
+ alias_method :square_root_gfidf, :square_root_global_frequency_inverse_document_frequency
46
+
47
+ # Chisholm ENPY
48
+ def entropy(term)
49
+ denominator = term_counts[term].to_f
50
+ logN = Math.log documents.size
51
+ 1 + documents.reduce(0) do |sum,document|
52
+ quotient = document.term_counts[term] / denominator
53
+ sum += quotient * Math.log(quotient) / logN
54
+ end
55
+ end
56
+
57
+
58
+
59
+ # @param [Document] matrix a term-document matrix
60
+ # @return [Matrix] the same matrix
61
+ #
62
+ # SMART n, Salton x, Chisholm NONE
63
+ def no_normalization(matrix)
64
+ matrix
65
+ end
66
+
67
+ # @param [Document] matrix a term-document matrix
68
+ # @return [Matrix] a matrix in which all document vectors are unit vectors
69
+ #
70
+ # SMART c, Salton c, Chisholm COSN
71
+ def cosine_normalization(matrix)
72
+ Matrix.columns(tfidf.column_vectors.map do |column|
73
+ column.normalize
74
+ end)
75
+ end
76
+
77
+ # @param [Document] matrix a term-document matrix
78
+ # @return [Matrix] a matrix
79
+ #
80
+ # SMART u, Chisholm PUQN
81
+ def pivoted_unique_normalization(matrix)
82
+ # @todo
83
+ # http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
84
+ end
85
+ end
@@ -0,0 +1,118 @@
1
+ require 'tf-idf-similarity/document'
2
+
3
+ class TfIdfSimilarity::Document
4
+ # @return [Float] the maximum term count of any term in the document
5
+ def maximum_term_count
6
+ @maximum_term_count ||= @term_counts.values.max.to_f
7
+ end
8
+
9
+ # @return [Float] the average term count of all terms in the document
10
+ def average_term_count
11
+ @average_term_count ||= @term_counts.values.reduce(:+) / @term_counts.size.to_f
12
+ end
13
+
14
+
15
+
16
+ # Returns the term count.
17
+ #
18
+ # SMART n, Salton t, Chisholm FREQ
19
+ def plain_term_frequency(term)
20
+ term_counts[term]
21
+ end
22
+ alias :plain_tf, :plain_term_frequency
23
+
24
+ # Returns 1 if the term is present, 0 otherwise.
25
+ #
26
+ # SMART b, Salton b, Chisholm BNRY
27
+ def binary_term_frequency(term)
28
+ count = term_counts[term]
29
+ if count > 0
30
+ 1
31
+ else
32
+ 0
33
+ end
34
+ end
35
+ alias_method :binary_tf, :binary_term_frequency
36
+
37
+ # Normalizes the term count by the maximum term count.
38
+ #
39
+ # @see http://en.wikipedia.org/wiki/Tf*idf
40
+ def normalized_term_frequency(term)
41
+ term_counts[term] / maximum_term_count
42
+ end
43
+ alias_method :normalized_tf, :normalized_term_frequency
44
+
45
+ # Further normalizes the normalized term frequency to lie between 0.5 and 1.
46
+ #
47
+ # SMART a, Salton n, Chisholm ATF1
48
+ def augmented_normalized_term_frequency(term)
49
+ 0.5 + 0.5 * normalized_term_frequency(term)
50
+ end
51
+ alias_method :augmented_normalized_tf, :augmented_normalized_term_frequency
52
+
53
+ # Chisholm ATFA
54
+ def augmented_average_term_frequency(term)
55
+ count = term_counts[term]
56
+ if count > 0
57
+ 0.9 + 0.1 * count / average_term_count
58
+ else
59
+ 0
60
+ end
61
+ end
62
+ alias_method :augmented_average_tf, :augmented_average_term_frequency
63
+
64
+ # Chisholm ATFC
65
+ def changed_coefficient_augmented_normalized_term_frequency(term)
66
+ count = term_counts[term]
67
+ if count > 0
68
+ 0.2 + 0.8 * count / maximum_term_count
69
+ else
70
+ 0
71
+ end
72
+ end
73
+ alias_method :changed_coefficient_augmented_normalized_tf, :changed_coefficient_augmented_normalized_term_frequency
74
+
75
+ # SMART l, Chisholm LOGA
76
+ def log_term_frequency(term)
77
+ count = term_counts[term]
78
+ if count > 0
79
+ 1 + Math.log(count)
80
+ else
81
+ 0
82
+ end
83
+ end
84
+ alias_method :log_tf, :log_term_frequency
85
+
86
+ # SMART L, Chisholm LOGN
87
+ def normalized_log_term_frequency(term)
88
+ count = term_counts[term]
89
+ if count > 0
90
+ (1 + Math.log(count)) / (1 + Math.log(average_term_count))
91
+ else
92
+ 0
93
+ end
94
+ end
95
+ alias_method :normalized_log_tf, :normalized_log_term_frequency
96
+
97
+ # Chisholm LOGG
98
+ def augmented_log_term_frequency(term)
99
+ count = term_counts[term]
100
+ if count > 0
101
+ 0.2 + 0.8 * Math.log(count + 1)
102
+ else
103
+ 0
104
+ end
105
+ end
106
+ alias_method :augmented_log_tf, :augmented_log_term_frequency
107
+
108
+ # Chisholm SQRT
109
+ def square_root_term_frequency(term)
110
+ count = term_counts[term]
111
+ if count > 0
112
+ Math.sqrt(count - 0.5) + 1
113
+ else
114
+ 0
115
+ end
116
+ end
117
+ alias_method :square_root_tf, :square_root_term_frequency
118
+ end
@@ -0,0 +1,42 @@
1
+ # coding: utf-8
2
+
3
+ # @note We can add more filters from Solr and stem using Porter's Snowball.
4
+ #
5
+ # @see https://github.com/aurelian/ruby-stemmer
6
+ # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory
7
+ # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
8
+ # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
9
+ class TfIdfSimilarity::Token < String
10
+ # Returns a falsy value if all its characters are numbers, punctuation,
11
+ # whitespace or control characters.
12
+ #
13
+ # @note Some implementations ignore one and two-letter words.
14
+ #
15
+ # @return [Boolean] whether the string is a token
16
+ def valid?
17
+ !self[%r{
18
+ \A
19
+ (
20
+ \d | # number
21
+ \p{Cntrl} | # control character
22
+ \p{Punct} | # punctuation
23
+ [[:space:]] # whitespace
24
+ )+
25
+ \z
26
+ }x]
27
+ end
28
+
29
+ # @return [Token] a lowercase string
30
+ #
31
+ # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseFilterFactory
32
+ def lowercase_filter
33
+ self.class.new UnicodeUtils.downcase(self, :fr)
34
+ end
35
+
36
+ # @return [Token] a string with no English possessive or periods in acronyms
37
+ #
38
+ # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ClassicFilterFactory
39
+ def classic_filter
40
+ self.class.new self.gsub('.', '').chomp("'s")
41
+ end
42
+ end
@@ -0,0 +1,3 @@
1
+ module TfIdfSimilarity
2
+ VERSION = "0.0.1"
3
+ end
@@ -0,0 +1,22 @@
1
+ # -*- encoding: utf-8 -*-
2
+ $:.push File.expand_path("../lib", __FILE__)
3
+ require "tf-idf-similarity/version"
4
+
5
+ Gem::Specification.new do |s|
6
+ s.name = "tf-idf-similarity"
7
+ s.version = TfIdfSimilarity::VERSION
8
+ s.platform = Gem::Platform::RUBY
9
+ s.authors = ["Open North"]
10
+ s.email = ["info@opennorth.ca"]
11
+ s.homepage = "http://github.com/opennorth/tf-idf-similarity"
12
+ s.summary = %q{Calculates the similarity between texts using tf*idf}
13
+
14
+ s.files = `git ls-files`.split("\n")
15
+ s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
16
+ s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
17
+ s.require_paths = ["lib"]
18
+
19
+ s.add_runtime_dependency('unicode_utils')
20
+ s.add_development_dependency('rspec', '~> 2.10')
21
+ s.add_development_dependency('rake')
22
+ end
metadata ADDED
@@ -0,0 +1,114 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: tf-idf-similarity
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Open North
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2012-09-10 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: unicode_utils
16
+ requirement: !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ! '>='
20
+ - !ruby/object:Gem::Version
21
+ version: '0'
22
+ type: :runtime
23
+ prerelease: false
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ! '>='
28
+ - !ruby/object:Gem::Version
29
+ version: '0'
30
+ - !ruby/object:Gem::Dependency
31
+ name: rspec
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ~>
36
+ - !ruby/object:Gem::Version
37
+ version: '2.10'
38
+ type: :development
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ~>
44
+ - !ruby/object:Gem::Version
45
+ version: '2.10'
46
+ - !ruby/object:Gem::Dependency
47
+ name: rake
48
+ requirement: !ruby/object:Gem::Requirement
49
+ none: false
50
+ requirements:
51
+ - - ! '>='
52
+ - !ruby/object:Gem::Version
53
+ version: '0'
54
+ type: :development
55
+ prerelease: false
56
+ version_requirements: !ruby/object:Gem::Requirement
57
+ none: false
58
+ requirements:
59
+ - - ! '>='
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ description:
63
+ email:
64
+ - info@opennorth.ca
65
+ executables: []
66
+ extensions: []
67
+ extra_rdoc_files: []
68
+ files:
69
+ - .gitignore
70
+ - .travis.yml
71
+ - Gemfile
72
+ - LICENSE
73
+ - README.md
74
+ - Rakefile
75
+ - USAGE
76
+ - lib/tf-idf-similarity.rb
77
+ - lib/tf-idf-similarity/collection.rb
78
+ - lib/tf-idf-similarity/document.rb
79
+ - lib/tf-idf-similarity/extras/collection.rb
80
+ - lib/tf-idf-similarity/extras/document.rb
81
+ - lib/tf-idf-similarity/token.rb
82
+ - lib/tf-idf-similarity/version.rb
83
+ - td-idf-similarity.gemspec
84
+ homepage: http://github.com/opennorth/tf-idf-similarity
85
+ licenses: []
86
+ post_install_message:
87
+ rdoc_options: []
88
+ require_paths:
89
+ - lib
90
+ required_ruby_version: !ruby/object:Gem::Requirement
91
+ none: false
92
+ requirements:
93
+ - - ! '>='
94
+ - !ruby/object:Gem::Version
95
+ version: '0'
96
+ segments:
97
+ - 0
98
+ hash: 697007281194730821
99
+ required_rubygems_version: !ruby/object:Gem::Requirement
100
+ none: false
101
+ requirements:
102
+ - - ! '>='
103
+ - !ruby/object:Gem::Version
104
+ version: '0'
105
+ segments:
106
+ - 0
107
+ hash: 697007281194730821
108
+ requirements: []
109
+ rubyforge_project:
110
+ rubygems_version: 1.8.24
111
+ signing_key:
112
+ specification_version: 3
113
+ summary: Calculates the similarity between texts using tf*idf
114
+ test_files: []