RubyGems - tf-idf-similarity - Versions diffs - 0.1.6 → 0.2.0 - Mend

tf-idf-similarity 0.1.6 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +5 -5
data/.gitignore +1 -0
data/.travis.yml +1 -1
data/README.md +7 -6
data/lib/tf-idf-similarity.rb +0 -3
data/lib/tf-idf-similarity/document.rb +7 -5
data/lib/tf-idf-similarity/token.rb +7 -0
data/lib/tf-idf-similarity/tokenizer.rb +19 -0
data/lib/tf-idf-similarity/version.rb +1 -1
metadata +4 -3

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
-SHA1:
-  metadata.gz: 03431fb16064caa54fe9cbfc17a151acb1a25fa5
-  data.tar.gz: be2e97b63e14244925937ee71fc8dc60c88dfce4
+SHA256:
+  metadata.gz: 605ac457508eaf64a7e583e8a4a71af231d3d9d2f9c30ee82b25fb9f647d1312
+  data.tar.gz: f24b89dccdcbef3c4fcaa59d15050f064455859c134c550fd6a432346883eb31
 SHA512:
-  metadata.gz: f615fae6cfad994fa25c85b1f3d6882742944e7bb5894ae3fcf6b4c9d7b34647b0da1b3914f127eb26e46a299c0f8a4e9d64bc05a7cb1c429663beaf657704eb
-  data.tar.gz: 317ea7c5a1a72e53419f2eadb5b4789bccbe29f0f7bf742f89e9ed9ffb210b43a78180ebef818baf497a48911e0f25897e6906251c45cd787d61c5da43cbbb92
+  metadata.gz: a41195c6543dea206baa8ce3e2095437d1df94fabedcc76a8151fa5af5991524d96530710a7216c1fef48a7008f88a43773ce2a2323afa563fa29f5abed9909c
+  data.tar.gz: aadbb85d6bd74625088d0aa7cb58b4127337d5c1dcc2af13c22664f1562013c59d79d8b3bcc3564a2861dfd968d39770205d3b401114e8bdf870b2ac412fda26

data/.gitignore CHANGED

@@ -4,3 +4,4 @@
 Gemfile.lock
 doc/*
 pkg/*
+coverage/*

data/.travis.yml CHANGED

@@ -18,7 +18,7 @@ addons:
     # Installing ATLAS will install BLAS.
     - libatlas-dev
     - libatlas-base-dev
-    - libatlas3gf-base
+    - libatlas3-base
 before_install:
   - bundle config build.nmatrix --with-lapacklib
   - export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/include/atlas

data/README.md CHANGED

@@ -1,12 +1,11 @@
-# Ruby Vector Space Model (VSM) with tf*idf weights
+# Ruby Vector Space Model (VSM) with tf\*idf weights
 [![Gem Version](https://badge.fury.io/rb/tf-idf-similarity.svg)](https://badge.fury.io/rb/tf-idf-similarity)
 [![Build Status](https://secure.travis-ci.org/jpmckinney/tf-idf-similarity.png)](https://travis-ci.org/jpmckinney/tf-idf-similarity)
-[![Dependency Status](https://gemnasium.com/jpmckinney/tf-idf-similarity.png)](https://gemnasium.com/jpmckinney/tf-idf-similarity)
 [![Coverage Status](https://coveralls.io/repos/jpmckinney/tf-idf-similarity/badge.png)](https://coveralls.io/r/jpmckinney/tf-idf-similarity)
 [![Code Climate](https://codeclimate.com/github/jpmckinney/tf-idf-similarity.png)](https://codeclimate.com/github/jpmckinney/tf-idf-similarity)
-Calculates the similarity between texts using a [bag-of-words](https://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](https://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency (tf*idf)](https://en.wikipedia.org/wiki/Tf–idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) (see below).
+Calculates the similarity between texts using a [bag-of-words](https://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](https://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency (tf\*idf)](https://en.wikipedia.org/wiki/Tf–idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) (see below).
 ## Usage
@@ -48,7 +47,7 @@ Find the similarity of two documents in the matrix:
 matrix[model.document_index(document1), model.document_index(document2)]
 ```
-Print the tf*idf values for terms in a document:
+Print the tf\*idf values for terms in a document:
 ```ruby
 tfidf_by_term = {}
@@ -86,6 +85,8 @@ end
 document1 = TfIdfSimilarity::Document.new(text, :term_counts => term_counts, :size => size)
 ```
+Or, use your own classes for the tokenizer and tokens, like in [this example](https://gist.github.com/satoryu/0183a4eba365cc67e28988a09f3035b3).
 [Read the documentation at RubyDoc.info.](http://rubydoc.info/gems/tf-idf-similarity)
 ## Troubleshooting
@@ -114,11 +115,11 @@ You can access more term frequency, document frequency, and normalization formul
     require 'tf-idf-similarity/extras/document'
     require 'tf-idf-similarity/extras/tf_idf_model'
-The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).
+The default tf\*idf formula follows the [Lucene Conceptual Scoring Formula](http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).
 ## Why?
-At the time of writing, no other Ruby gem implemented the tf*idf formula used by Lucene, Sphinx and Ferret.
+At the time of writing, no other Ruby gem implemented the tf\*idf formula used by Lucene, Sphinx and Ferret.
 * [rsemantic](https://github.com/josephwilk/rsemantic) now uses the same [term frequency](https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L14) and [document frequency](https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L13) formulas as Lucene.
 * [treat](https://github.com/louismullie/treat) offers many term frequency formulas, [one of which](https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L13) is the same as Lucene.

data/lib/tf-idf-similarity.rb CHANGED

@@ -1,9 +1,6 @@
 require 'forwardable'
 require 'set'
-require 'unicode_utils/downcase'
-require 'unicode_utils/each_word'
 module TfIdfSimilarity
 end

data/lib/tf-idf-similarity/document.rb CHANGED

@@ -1,3 +1,5 @@
+require 'tf-idf-similarity/tokenizer'
 # A document.
 module TfIdfSimilarity
   class Document
@@ -19,7 +21,8 @@ module TfIdfSimilarity
     def initialize(text, opts = {})
       @text   = text
       @id     = opts[:id] || object_id
-      @tokens = opts[:tokens]
+      @tokens = Array(opts[:tokens]).map { |t| Token.new(t) } if opts[:tokens]
+      @tokenizer = opts[:tokenizer] || Tokenizer.new
       if opts[:term_counts]
         @term_counts = opts[:term_counts]
@@ -51,10 +54,9 @@ module TfIdfSimilarity
     # Tokenizes the text and counts terms and total tokens.
     def set_term_counts_and_size
-      tokenize(text).each do |word|
-        token = Token.new(word)
+      tokenize(text).each do |token|
         if token.valid?
-          term = token.lowercase_filter.classic_filter.to_s
+          term = token.to_s
           @term_counts[term] += 1
           @size += 1
         end
@@ -76,7 +78,7 @@ module TfIdfSimilarity
     # @see http://unicode.org/reports/tr29/#Default_Word_Boundaries
     # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory
     def tokenize(text)
-      @tokens || UnicodeUtils.each_word(text)
+      @tokens || @tokenizer.tokenize(text)
     end
   end
 end

data/lib/tf-idf-similarity/token.rb CHANGED

@@ -1,5 +1,7 @@
 # coding: utf-8
 require 'delegate'
+require 'unicode_utils/downcase'
+require 'unicode_utils/each_word'
 # A token.
 #
@@ -47,5 +49,10 @@ module TfIdfSimilarity
     def classic_filter
       self.class.new(self.gsub('.', '').sub(/['`’]s\z/, ''))
     end
+    def to_s
+      # Don't call #lowercase_filter and #classic_filter to avoid creating unnecessary objects.
+      UnicodeUtils.downcase(self).gsub('.', '').sub(/['`’]s\z/, '')
+    end
   end
 end

data/lib/tf-idf-similarity/tokenizer.rb ADDED

@@ -0,0 +1,19 @@
+require 'unicode_utils/each_word'
+require 'tf-idf-similarity/token'
+# A tokenizer using UnicodeUtils to tokenize a text.
+#
+# @see https://github.com/lang/unicode_utils
+module TfIdfSimilarity
+  class Tokenizer
+    # Tokenizes a text.
+    #
+    # @param [String] text
+    # @return [Enumerator] an enumerator of Token objects
+    def tokenize(text)
+      UnicodeUtils.each_word(text).map do |word|
+        Token.new(word)
+      end
+    end
+  end
+end

data/lib/tf-idf-similarity/version.rb CHANGED

@@ -1,3 +1,3 @@
 module TfIdfSimilarity
-  VERSION = "0.1.6"
+  VERSION = "0.2.0"
 end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: tf-idf-similarity
 version: !ruby/object:Gem::Version
-  version: 0.1.6
+  version: 0.2.0
 platform: ruby
 authors:
 - James McKinney
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2017-03-07 00:00:00.000000000 Z
+date: 2019-12-19 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: unicode_utils
@@ -104,6 +104,7 @@ files:
 - lib/tf-idf-similarity/term_count_model.rb
 - lib/tf-idf-similarity/tf_idf_model.rb
 - lib/tf-idf-similarity/token.rb
+- lib/tf-idf-similarity/tokenizer.rb
 - lib/tf-idf-similarity/version.rb
 - spec/bm25_model_spec.rb
 - spec/document_spec.rb
@@ -133,7 +134,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.4.5
+rubygems_version: 2.7.6
 signing_key:
 specification_version: 4
 summary: Calculates the similarity between texts using tf*idf