bow_tfidf 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 6bfa8c6ca528c4eb7258f265958162243253bd8327b6bbccd91665d162360b5c
4
+ data.tar.gz: 3cd70368a67dd1ba28921b3438857b2a1c20bbef4b90e16ad2c9c964071199d2
5
+ SHA512:
6
+ metadata.gz: 3258284a37b2ae2a90e0e63c16a6ec972bead603cfbdc0a0c4ffe1cb3fa27c79308a1bd98a9d289994e7b244292e5418bf8b704b04b7c1494701c0626473353e
7
+ data.tar.gz: a58ed2da2b29a666fbadafc099a999e3741c3fe78fbb931334c954f6ada09460d44e65791952e26148d267503f1676fdab7e12a21e23a4414d32b1b26694a11f
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in bow_tfidf.gemspec
4
+ gemspec
@@ -0,0 +1,88 @@
1
+ # BowTfidf
2
+
3
+ Based on two concepts TFIDF and Bag-of-words.
4
+
5
+ ### TFIDF
6
+ > TFIDF - In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. Tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf.
7
+ Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields, including text summarization and classification.
8
+
9
+ Read more about TFIDF on [Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).
10
+
11
+
12
+ ### Bag-of-words.
13
+ >The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model has also been used for computer vision.
14
+ The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.
15
+
16
+ Read more about Bag-of-words on [Wikipedia](https://en.wikipedia.org/wiki/Bag-of-words_model).
17
+
18
+ ## Installation
19
+
20
+ Add this line to your application's Gemfile:
21
+
22
+ ```ruby
23
+ gem 'bow_tfidf'
24
+ ```
25
+
26
+ And then execute:
27
+
28
+ $ bundle
29
+
30
+ Or install it yourself as:
31
+
32
+ $ gem install bow_tfidf
33
+
34
+ ## Usage
35
+
36
+ First of all bag of words with computed tfidf for each word should be created. For this add labeled words as a hash to bag of words:
37
+
38
+ ```ruby
39
+ bow = Tfidf::BagOfWords.new
40
+ bow.add_labeled_data!({
41
+ category1: ['word', 'word1'],
42
+ category2: ['word', 'word2']
43
+ category3: ['word', 'word2', 'word3']
44
+ })
45
+ ```
46
+
47
+ To identify category of text pass array of words as argument to category classifier:
48
+ ```ruby
49
+ classifier = BowTfidf::Classifier.new(bow)
50
+ classifier.call(['word2' 'word3'])
51
+ # {
52
+ # category_key: :category3,
53
+ # score: {
54
+ # category3: 0.27185717486836963,
55
+ # category2: 0.09061905828945654
56
+ # }
57
+ # }
58
+ ```
59
+ `:category_key` - assumption about category of text by given words. Is based on `:score`. The highest score wins.
60
+
61
+ `BowTfidf::Classifier` takes numerical interpretation of relation beetwen word and category, sums it up for each word and returns score.
62
+
63
+ ### When classifier cannot recognize category:
64
+
65
+ 1. all given words not in the BOW.
66
+ - **Solution:** update BOW with new words.
67
+
68
+ 2. each of given words belongs to all categories
69
+ - In current implementation TFIDF tool ignores such words and not adding it to BOW. It is done with assumption that less frequent words should exists.
70
+
71
+ ### Performance
72
+ To improve performance and memmory usage create dump of built BOW with light data structure(without unnecessary for classifier attributes) and custom classifier which can work with the dump.
73
+
74
+ ### Split text into words(tokens)
75
+ ```ruby
76
+ BowTfidf::Tokenizer.new.call('word word2, some! text')
77
+ # <Set: {"word", "word2", "some", "text"}>
78
+ ```
79
+
80
+ ## Development
81
+
82
+ After checking out the repo, run `bin/setup` to install dependencies. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
83
+
84
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
85
+
86
+ ## Contributing
87
+
88
+ Bug reports and pull requests are welcome on GitHub at https://github.com/isidzukuri/bow_tfidf.
@@ -0,0 +1,2 @@
1
+ require 'bundler/gem_tasks'
2
+ task default: :spec
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'bundler/setup'
4
+ require 'bow_tfidf'
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require 'irb'
14
+ IRB.start
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,36 @@
1
+ # coding: utf-8
2
+
3
+ lib = File.expand_path('../lib', __FILE__)
4
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
5
+ require 'bow_tfidf/version'
6
+
7
+ Gem::Specification.new do |spec|
8
+ spec.name = 'bow_tfidf'
9
+ spec.version = BowTfidf::VERSION
10
+ spec.authors = ['isidzukuri']
11
+ spec.email = ['axesigon@gmail.com']
12
+
13
+ spec.summary = 'Tf–idf is one of the most popular term-weighting schemes.'
14
+ spec.description = 'In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.'
15
+ spec.homepage = 'https://github.com/isidzukuri/bow_tfidf'
16
+
17
+ # Prevent pushing this gem to RubyGems.org. To allow pushes either set the 'allowed_push_host'
18
+ # to allow pushing to a single host or delete this section to allow pushing to any host.
19
+ if spec.respond_to?(:metadata)
20
+ spec.metadata['allowed_push_host'] = "https://rubygems.org"
21
+ else
22
+ raise 'RubyGems 2.0 or newer is required to protect against ' \
23
+ 'public gem pushes.'
24
+ end
25
+
26
+ spec.files = `git ls-files -z`.split("\x0").reject do |f|
27
+ f.match(%r{^(test|spec|features)/})
28
+ end
29
+ spec.bindir = 'exe'
30
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
31
+ spec.require_paths = ['lib']
32
+
33
+ spec.add_development_dependency 'bundler', '~> 1.13'
34
+ spec.add_development_dependency 'rake', '~> 10.0'
35
+ spec.add_development_dependency 'rspec'
36
+ end
@@ -0,0 +1,9 @@
1
+ require 'set'
2
+ require 'bow_tfidf/version'
3
+ require 'bow_tfidf/computation'
4
+ require 'bow_tfidf/bag_of_words'
5
+ require 'bow_tfidf/classifier'
6
+ require 'bow_tfidf/tokenizer'
7
+
8
+ module BowTfidf
9
+ end
@@ -0,0 +1,59 @@
1
+ module BowTfidf
2
+ class BagOfWords
3
+ attr_reader :words, :categories
4
+
5
+ def initialize
6
+ @words = {}
7
+ @categories = {}
8
+ end
9
+
10
+ def add_labeled_data!(data)
11
+ validate_labeled_data(data)
12
+
13
+ data.each do |category_key, category_words|
14
+ category = category_by_key(category_key)
15
+
16
+ category_words.each do |word|
17
+ add_word(word, category)
18
+ end
19
+ end
20
+
21
+ compute_tfidf
22
+ end
23
+
24
+ private
25
+
26
+ def validate_labeled_data(data)
27
+ raise(ArgumentError, 'Hash with arrays expected') unless data.is_a?(Hash)
28
+
29
+ data.values.each do |array|
30
+ raise(ArgumentError, 'Hash with arrays expected') unless array.is_a?(Enumerable)
31
+
32
+ raise(ArgumentError, 'Hash with arrays of strings expected') unless array.all? { |value| value.is_a?(String) }
33
+ end
34
+ end
35
+
36
+ def add_word(word, category)
37
+ words[word] = { categories: {} } unless words[word]
38
+ words[word][:categories][category[:id]] ||= { entries: 0 }
39
+ words[word][:categories][category[:id]][:entries] += 1
40
+
41
+ categories[category[:key]][:words] << word
42
+ end
43
+
44
+ def category_by_key(key)
45
+ unless categories[key]
46
+ categories[key] = {
47
+ id: categories.length,
48
+ key: key,
49
+ words: Set[]
50
+ }
51
+ end
52
+ categories[key]
53
+ end
54
+
55
+ def compute_tfidf
56
+ Computation.new(self).call
57
+ end
58
+ end
59
+ end
@@ -0,0 +1,70 @@
1
+ module BowTfidf
2
+ class Classifier
3
+ attr_reader :bow, :score
4
+
5
+ def initialize(bow)
6
+ raise(ArgumentError, 'BowTfidf::BagOfWords instance expected') unless bow.is_a?(BowTfidf::BagOfWords)
7
+
8
+ @bow = bow
9
+ @score = {}
10
+ end
11
+
12
+ def call(tokens)
13
+ raise(ArgumentError, 'Array of strings expected') unless tokens.is_a?(Array)
14
+
15
+ tokens.each do |word|
16
+ process_word(word)
17
+ end
18
+
19
+ result
20
+ end
21
+
22
+ def find_word(word)
23
+ bow.words[word]
24
+ end
25
+
26
+ def category_by_id(id)
27
+ return nil unless id
28
+
29
+ bow.categories.values.find { |category| category[:id] == id }
30
+ end
31
+
32
+ private
33
+
34
+ def process_word(word)
35
+ return unless (word_data = find_word(word.to_s))
36
+
37
+ word_data[:categories].each do |category_id, word_category_relation|
38
+ score[category_id] = 0 unless score[category_id]
39
+ score[category_id] += word_category_relation[:tfidf]
40
+ end
41
+ end
42
+
43
+ def category_by_highest_score
44
+ ranking = score.max_by { |_k, v| v }
45
+
46
+ return unless ranking
47
+
48
+ category_id = ranking[0]
49
+ category_by_id(category_id)[:key]
50
+ end
51
+
52
+ def display_score
53
+ sorted = score.sort_by { |_k, v| v }.reverse
54
+ result_hash = {}
55
+ sorted.each do |item|
56
+ key = category_by_id(item[0])[:key]
57
+ result_hash[key] = item[1]
58
+ end
59
+
60
+ result_hash
61
+ end
62
+
63
+ def result
64
+ {
65
+ category_key: category_by_highest_score,
66
+ score: display_score
67
+ }
68
+ end
69
+ end
70
+ end
@@ -0,0 +1,59 @@
1
+ module BowTfidf
2
+ class Computation
3
+ attr_reader :bow
4
+
5
+ def initialize(bow)
6
+ raise(ArgumentError, 'BowTfidf::BagOfWords instance expected') unless bow.is_a?(BowTfidf::BagOfWords)
7
+
8
+ @bow = bow
9
+ end
10
+
11
+ def call
12
+ compute_idf
13
+ compute_tfidf
14
+ bow
15
+ end
16
+
17
+ private
18
+
19
+ def words
20
+ bow.words
21
+ end
22
+
23
+ def categories
24
+ bow.categories
25
+ end
26
+
27
+ def compute_idf
28
+ words.each do |word, attrs|
29
+ idf(attrs)
30
+
31
+ words.delete(word) if attrs[:idf] == 0.0
32
+ end
33
+ end
34
+
35
+ def idf(attrs)
36
+ if categories.length == attrs[:categories].length
37
+ attrs[:idf] = 0.0
38
+ else
39
+ # the number of categories / in how many occurs
40
+ attrs[:idf] = Math.log10(1 + categories.length / attrs[:categories].length)
41
+ end
42
+ end
43
+
44
+ def compute_tfidf
45
+ categories.values.each do |category|
46
+ category[:words].each do |category_word|
47
+ next unless words[category_word]
48
+
49
+ # how many times the word occurs in the category / the number of words in category
50
+ # tf = category_word_attrs[:entries].to_f/category_attrs[:words].length
51
+ tf = Math.log10(1 + words[category_word][:categories][category[:id]][:entries])
52
+
53
+ words[category_word][:categories][category[:id]][:tf] = tf
54
+ words[category_word][:categories][category[:id]][:tfidf] = tf * words[category_word][:idf]
55
+ end
56
+ end
57
+ end
58
+ end
59
+ end
@@ -0,0 +1,39 @@
1
+ module BowTfidf
2
+ class Tokenizer
3
+ SPLIT_REGEX = /[\s\n\t\.,\-\!:()\/%\\+\|@^<«>*'~;=»\?—•$”\"’\[£“■‘\{#®♦°™€¥\]©§\}–]/
4
+ TOKEN_MIN_LENGTH = 3
5
+ TOKEN_MAX_LENGTH = 15
6
+
7
+ attr_reader :tokens
8
+
9
+ def initialize
10
+ @tokens = Set[]
11
+ end
12
+
13
+ def call(text)
14
+ raise(ArgumentError, 'String instance expected') unless text.is_a?(String)
15
+
16
+ raw_tokens = split(text)
17
+
18
+ raw_tokens.each do |token|
19
+ process_token(token)
20
+ end
21
+
22
+ tokens
23
+ end
24
+
25
+ private
26
+
27
+ def split(text)
28
+ text.split(SPLIT_REGEX)
29
+ end
30
+
31
+ def process_token(token)
32
+ return if token.length < TOKEN_MIN_LENGTH
33
+ return if token.length > TOKEN_MAX_LENGTH
34
+ return if token.scan(/\D/).empty? # skip if str contains only digits
35
+
36
+ tokens << token.downcase
37
+ end
38
+ end
39
+ end
@@ -0,0 +1,3 @@
1
+ module BowTfidf
2
+ VERSION = '0.1.0'.freeze
3
+ end
metadata ADDED
@@ -0,0 +1,105 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: bow_tfidf
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - isidzukuri
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2019-04-08 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: bundler
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '1.13'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '1.13'
27
+ - !ruby/object:Gem::Dependency
28
+ name: rake
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '10.0'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '10.0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: rspec
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ description: In information retrieval, tf–idf or TFIDF, short for term frequency–inverse
56
+ document frequency, is a numerical statistic that is intended to reflect how important
57
+ a word is to a document in a collection or corpus. It is often used as a weighting
58
+ factor in searches of information retrieval, text mining, and user modeling. The
59
+ tf–idf value increases proportionally to the number of times a word appears in the
60
+ document and is offset by the number of documents in the corpus that contain the
61
+ word, which helps to adjust for the fact that some words appear more frequently
62
+ in general.
63
+ email:
64
+ - axesigon@gmail.com
65
+ executables: []
66
+ extensions: []
67
+ extra_rdoc_files: []
68
+ files:
69
+ - Gemfile
70
+ - README.md
71
+ - Rakefile
72
+ - bin/console
73
+ - bin/setup
74
+ - bow_tfidf.gemspec
75
+ - lib/bow_tfidf.rb
76
+ - lib/bow_tfidf/bag_of_words.rb
77
+ - lib/bow_tfidf/classifier.rb
78
+ - lib/bow_tfidf/computation.rb
79
+ - lib/bow_tfidf/tokenizer.rb
80
+ - lib/bow_tfidf/version.rb
81
+ homepage: https://github.com/isidzukuri/bow_tfidf
82
+ licenses: []
83
+ metadata:
84
+ allowed_push_host: https://rubygems.org
85
+ post_install_message:
86
+ rdoc_options: []
87
+ require_paths:
88
+ - lib
89
+ required_ruby_version: !ruby/object:Gem::Requirement
90
+ requirements:
91
+ - - ">="
92
+ - !ruby/object:Gem::Version
93
+ version: '0'
94
+ required_rubygems_version: !ruby/object:Gem::Requirement
95
+ requirements:
96
+ - - ">="
97
+ - !ruby/object:Gem::Version
98
+ version: '0'
99
+ requirements: []
100
+ rubyforge_project:
101
+ rubygems_version: 2.7.4
102
+ signing_key:
103
+ specification_version: 4
104
+ summary: Tf–idf is one of the most popular term-weighting schemes.
105
+ test_files: []