tf-idf-similarity 0.1.6 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA1:
3
- metadata.gz: 03431fb16064caa54fe9cbfc17a151acb1a25fa5
4
- data.tar.gz: be2e97b63e14244925937ee71fc8dc60c88dfce4
2
+ SHA256:
3
+ metadata.gz: 605ac457508eaf64a7e583e8a4a71af231d3d9d2f9c30ee82b25fb9f647d1312
4
+ data.tar.gz: f24b89dccdcbef3c4fcaa59d15050f064455859c134c550fd6a432346883eb31
5
5
  SHA512:
6
- metadata.gz: f615fae6cfad994fa25c85b1f3d6882742944e7bb5894ae3fcf6b4c9d7b34647b0da1b3914f127eb26e46a299c0f8a4e9d64bc05a7cb1c429663beaf657704eb
7
- data.tar.gz: 317ea7c5a1a72e53419f2eadb5b4789bccbe29f0f7bf742f89e9ed9ffb210b43a78180ebef818baf497a48911e0f25897e6906251c45cd787d61c5da43cbbb92
6
+ metadata.gz: a41195c6543dea206baa8ce3e2095437d1df94fabedcc76a8151fa5af5991524d96530710a7216c1fef48a7008f88a43773ce2a2323afa563fa29f5abed9909c
7
+ data.tar.gz: aadbb85d6bd74625088d0aa7cb58b4127337d5c1dcc2af13c22664f1562013c59d79d8b3bcc3564a2861dfd968d39770205d3b401114e8bdf870b2ac412fda26
data/.gitignore CHANGED
@@ -4,3 +4,4 @@
4
4
  Gemfile.lock
5
5
  doc/*
6
6
  pkg/*
7
+ coverage/*
@@ -18,7 +18,7 @@ addons:
18
18
  # Installing ATLAS will install BLAS.
19
19
  - libatlas-dev
20
20
  - libatlas-base-dev
21
- - libatlas3gf-base
21
+ - libatlas3-base
22
22
  before_install:
23
23
  - bundle config build.nmatrix --with-lapacklib
24
24
  - export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/include/atlas
data/README.md CHANGED
@@ -1,12 +1,11 @@
1
- # Ruby Vector Space Model (VSM) with tf*idf weights
1
+ # Ruby Vector Space Model (VSM) with tf\*idf weights
2
2
 
3
3
  [![Gem Version](https://badge.fury.io/rb/tf-idf-similarity.svg)](https://badge.fury.io/rb/tf-idf-similarity)
4
4
  [![Build Status](https://secure.travis-ci.org/jpmckinney/tf-idf-similarity.png)](https://travis-ci.org/jpmckinney/tf-idf-similarity)
5
- [![Dependency Status](https://gemnasium.com/jpmckinney/tf-idf-similarity.png)](https://gemnasium.com/jpmckinney/tf-idf-similarity)
6
5
  [![Coverage Status](https://coveralls.io/repos/jpmckinney/tf-idf-similarity/badge.png)](https://coveralls.io/r/jpmckinney/tf-idf-similarity)
7
6
  [![Code Climate](https://codeclimate.com/github/jpmckinney/tf-idf-similarity.png)](https://codeclimate.com/github/jpmckinney/tf-idf-similarity)
8
7
 
9
- Calculates the similarity between texts using a [bag-of-words](https://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](https://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency (tf*idf)](https://en.wikipedia.org/wiki/Tf–idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) (see below).
8
+ Calculates the similarity between texts using a [bag-of-words](https://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](https://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency (tf\*idf)](https://en.wikipedia.org/wiki/Tf–idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) (see below).
10
9
 
11
10
  ## Usage
12
11
 
@@ -48,7 +47,7 @@ Find the similarity of two documents in the matrix:
48
47
  matrix[model.document_index(document1), model.document_index(document2)]
49
48
  ```
50
49
 
51
- Print the tf*idf values for terms in a document:
50
+ Print the tf\*idf values for terms in a document:
52
51
 
53
52
  ```ruby
54
53
  tfidf_by_term = {}
@@ -86,6 +85,8 @@ end
86
85
  document1 = TfIdfSimilarity::Document.new(text, :term_counts => term_counts, :size => size)
87
86
  ```
88
87
 
88
+ Or, use your own classes for the tokenizer and tokens, like in [this example](https://gist.github.com/satoryu/0183a4eba365cc67e28988a09f3035b3).
89
+
89
90
  [Read the documentation at RubyDoc.info.](http://rubydoc.info/gems/tf-idf-similarity)
90
91
 
91
92
  ## Troubleshooting
@@ -114,11 +115,11 @@ You can access more term frequency, document frequency, and normalization formul
114
115
  require 'tf-idf-similarity/extras/document'
115
116
  require 'tf-idf-similarity/extras/tf_idf_model'
116
117
 
117
- The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).
118
+ The default tf\*idf formula follows the [Lucene Conceptual Scoring Formula](http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).
118
119
 
119
120
  ## Why?
120
121
 
121
- At the time of writing, no other Ruby gem implemented the tf*idf formula used by Lucene, Sphinx and Ferret.
122
+ At the time of writing, no other Ruby gem implemented the tf\*idf formula used by Lucene, Sphinx and Ferret.
122
123
 
123
124
  * [rsemantic](https://github.com/josephwilk/rsemantic) now uses the same [term frequency](https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L14) and [document frequency](https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L13) formulas as Lucene.
124
125
  * [treat](https://github.com/louismullie/treat) offers many term frequency formulas, [one of which](https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L13) is the same as Lucene.
@@ -1,9 +1,6 @@
1
1
  require 'forwardable'
2
2
  require 'set'
3
3
 
4
- require 'unicode_utils/downcase'
5
- require 'unicode_utils/each_word'
6
-
7
4
  module TfIdfSimilarity
8
5
  end
9
6
 
@@ -1,3 +1,5 @@
1
+ require 'tf-idf-similarity/tokenizer'
2
+
1
3
  # A document.
2
4
  module TfIdfSimilarity
3
5
  class Document
@@ -19,7 +21,8 @@ module TfIdfSimilarity
19
21
  def initialize(text, opts = {})
20
22
  @text = text
21
23
  @id = opts[:id] || object_id
22
- @tokens = opts[:tokens]
24
+ @tokens = Array(opts[:tokens]).map { |t| Token.new(t) } if opts[:tokens]
25
+ @tokenizer = opts[:tokenizer] || Tokenizer.new
23
26
 
24
27
  if opts[:term_counts]
25
28
  @term_counts = opts[:term_counts]
@@ -51,10 +54,9 @@ module TfIdfSimilarity
51
54
 
52
55
  # Tokenizes the text and counts terms and total tokens.
53
56
  def set_term_counts_and_size
54
- tokenize(text).each do |word|
55
- token = Token.new(word)
57
+ tokenize(text).each do |token|
56
58
  if token.valid?
57
- term = token.lowercase_filter.classic_filter.to_s
59
+ term = token.to_s
58
60
  @term_counts[term] += 1
59
61
  @size += 1
60
62
  end
@@ -76,7 +78,7 @@ module TfIdfSimilarity
76
78
  # @see http://unicode.org/reports/tr29/#Default_Word_Boundaries
77
79
  # @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory
78
80
  def tokenize(text)
79
- @tokens || UnicodeUtils.each_word(text)
81
+ @tokens || @tokenizer.tokenize(text)
80
82
  end
81
83
  end
82
84
  end
@@ -1,5 +1,7 @@
1
1
  # coding: utf-8
2
2
  require 'delegate'
3
+ require 'unicode_utils/downcase'
4
+ require 'unicode_utils/each_word'
3
5
 
4
6
  # A token.
5
7
  #
@@ -47,5 +49,10 @@ module TfIdfSimilarity
47
49
  def classic_filter
48
50
  self.class.new(self.gsub('.', '').sub(/['`’]s\z/, ''))
49
51
  end
52
+
53
+ def to_s
54
+ # Don't call #lowercase_filter and #classic_filter to avoid creating unnecessary objects.
55
+ UnicodeUtils.downcase(self).gsub('.', '').sub(/['`’]s\z/, '')
56
+ end
50
57
  end
51
58
  end
@@ -0,0 +1,19 @@
1
+ require 'unicode_utils/each_word'
2
+ require 'tf-idf-similarity/token'
3
+
4
+ # A tokenizer using UnicodeUtils to tokenize a text.
5
+ #
6
+ # @see https://github.com/lang/unicode_utils
7
+ module TfIdfSimilarity
8
+ class Tokenizer
9
+ # Tokenizes a text.
10
+ #
11
+ # @param [String] text
12
+ # @return [Enumerator] an enumerator of Token objects
13
+ def tokenize(text)
14
+ UnicodeUtils.each_word(text).map do |word|
15
+ Token.new(word)
16
+ end
17
+ end
18
+ end
19
+ end
@@ -1,3 +1,3 @@
1
1
  module TfIdfSimilarity
2
- VERSION = "0.1.6"
2
+ VERSION = "0.2.0"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: tf-idf-similarity
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.6
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - James McKinney
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2017-03-07 00:00:00.000000000 Z
11
+ date: 2019-12-19 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: unicode_utils
@@ -104,6 +104,7 @@ files:
104
104
  - lib/tf-idf-similarity/term_count_model.rb
105
105
  - lib/tf-idf-similarity/tf_idf_model.rb
106
106
  - lib/tf-idf-similarity/token.rb
107
+ - lib/tf-idf-similarity/tokenizer.rb
107
108
  - lib/tf-idf-similarity/version.rb
108
109
  - spec/bm25_model_spec.rb
109
110
  - spec/document_spec.rb
@@ -133,7 +134,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
133
134
  version: '0'
134
135
  requirements: []
135
136
  rubyforge_project:
136
- rubygems_version: 2.4.5
137
+ rubygems_version: 2.7.6
137
138
  signing_key:
138
139
  specification_version: 4
139
140
  summary: Calculates the similarity between texts using tf*idf