RubyGems - bow_tfidf - Versions diffs - 0.1.0 - Mend

bow_tfidf 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml +7 -0
data/Gemfile +4 -0
data/README.md +88 -0
data/Rakefile +2 -0
data/bin/console +14 -0
data/bin/setup +8 -0
data/bow_tfidf.gemspec +36 -0
data/lib/bow_tfidf.rb +9 -0
data/lib/bow_tfidf/bag_of_words.rb +59 -0
data/lib/bow_tfidf/classifier.rb +70 -0
data/lib/bow_tfidf/computation.rb +59 -0
data/lib/bow_tfidf/tokenizer.rb +39 -0
data/lib/bow_tfidf/version.rb +3 -0
metadata +105 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 6bfa8c6ca528c4eb7258f265958162243253bd8327b6bbccd91665d162360b5c
+  data.tar.gz: 3cd70368a67dd1ba28921b3438857b2a1c20bbef4b90e16ad2c9c964071199d2
+SHA512:
+  metadata.gz: 3258284a37b2ae2a90e0e63c16a6ec972bead603cfbdc0a0c4ffe1cb3fa27c79308a1bd98a9d289994e7b244292e5418bf8b704b04b7c1494701c0626473353e
+  data.tar.gz: a58ed2da2b29a666fbadafc099a999e3741c3fe78fbb931334c954f6ada09460d44e65791952e26148d267503f1676fdab7e12a21e23a4414d32b1b26694a11f

data/Gemfile ADDED

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in bow_tfidf.gemspec
+gemspec

data/README.md ADDED

@@ -0,0 +1,88 @@
+# BowTfidf
+Based on two concepts TFIDF and Bag-of-words.
+### TFIDF
+> TFIDF - In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. Tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf.
+Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields, including text summarization and classification.
+Read more about TFIDF on [Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).
+### Bag-of-words.
+>The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model has also been used for computer vision.
+The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.
+Read more about Bag-of-words on [Wikipedia](https://en.wikipedia.org/wiki/Bag-of-words_model).
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'bow_tfidf'
+```
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install bow_tfidf
+## Usage
+First of all bag of words with computed tfidf for each word should be created. For this add labeled words as a hash to bag of words:
+```ruby
+bow = Tfidf::BagOfWords.new
+bow.add_labeled_data!({
+  category1: ['word', 'word1'],
+  category2: ['word', 'word2']
+  category3: ['word', 'word2', 'word3']
+  })
+```
+To identify category of text pass array of words as argument to category classifier:
+```ruby
+classifier = BowTfidf::Classifier.new(bow)
+classifier.call(['word2' 'word3'])
+# {
+#    category_key: :category3,
+#    score: {
+#        category3: 0.27185717486836963,
+#        category2: 0.09061905828945654
+#    }
+# }
+```
+`:category_key` - assumption about category of text by given words. Is based on `:score`. The highest score wins.
+`BowTfidf::Classifier` takes numerical interpretation of relation beetwen word and category, sums it up for each word and returns score.
+### When classifier cannot recognize category:
+1. all given words not in the BOW.
+    - **Solution:** update BOW with new words.
+2. each of given words belongs to all categories
+    - In current implementation TFIDF tool ignores such words and not adding it to BOW. It is done with assumption that less frequent words should exists.
+### Performance
+To improve performance and memmory usage create dump of built BOW with light data structure(without unnecessary for classifier attributes) and custom classifier which can work with the dump.
+### Split text into words(tokens)
+```ruby
+BowTfidf::Tokenizer.new.call('word word2, some! text')
+# <Set: {"word", "word2", "some", "text"}>
+```
+## Development
+After checking out the repo, run `bin/setup` to install dependencies. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
+To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
+## Contributing
+Bug reports and pull requests are welcome on GitHub at https://github.com/isidzukuri/bow_tfidf.

data/Rakefile ADDED

	@@ -0,0 +1,2 @@
1	+ require 'bundler/gem_tasks'
2	+ task default: :spec

data/bin/console ADDED

@@ -0,0 +1,14 @@
+#!/usr/bin/env ruby
+require 'bundler/setup'
+require 'bow_tfidf'
+# You can add fixtures and/or initialization code here to make experimenting
+# with your gem easier. You can also use a different console, if you like.
+# (If you use this, don't forget to add pry to your Gemfile!)
+# require "pry"
+# Pry.start
+require 'irb'
+IRB.start

data/bin/setup ADDED

@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+set -euo pipefail
+IFS=$'\n\t'
+set -vx
+bundle install
+# Do any other automated setup that you need to do here

data/bow_tfidf.gemspec ADDED

@@ -0,0 +1,36 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'bow_tfidf/version'
+Gem::Specification.new do |spec|
+  spec.name          = 'bow_tfidf'
+  spec.version       = BowTfidf::VERSION
+  spec.authors       = ['isidzukuri']
+  spec.email         = ['axesigon@gmail.com']
+  spec.summary       = 'Tf–idf is one of the most popular term-weighting schemes.'
+  spec.description   = 'In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.'
+  spec.homepage      = 'https://github.com/isidzukuri/bow_tfidf'
+  # Prevent pushing this gem to RubyGems.org. To allow pushes either set the 'allowed_push_host'
+  # to allow pushing to a single host or delete this section to allow pushing to any host.
+  if spec.respond_to?(:metadata)
+    spec.metadata['allowed_push_host'] = "https://rubygems.org"
+  else
+    raise 'RubyGems 2.0 or newer is required to protect against ' \
+      'public gem pushes.'
+  end
+  spec.files = `git ls-files -z`.split("\x0").reject do |f|
+    f.match(%r{^(test|spec|features)/})
+  end
+  spec.bindir        = 'exe'
+  spec.executables   = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
+  spec.require_paths = ['lib']
+  spec.add_development_dependency 'bundler', '~> 1.13'
+  spec.add_development_dependency 'rake', '~> 10.0'
+  spec.add_development_dependency 'rspec'
+end

data/lib/bow_tfidf.rb ADDED

@@ -0,0 +1,9 @@
+require 'set'
+require 'bow_tfidf/version'
+require 'bow_tfidf/computation'
+require 'bow_tfidf/bag_of_words'
+require 'bow_tfidf/classifier'
+require 'bow_tfidf/tokenizer'
+module BowTfidf
+end

data/lib/bow_tfidf/bag_of_words.rb ADDED

@@ -0,0 +1,59 @@
+module BowTfidf
+  class BagOfWords
+    attr_reader :words, :categories
+    def initialize
+      @words = {}
+      @categories = {}
+    end
+    def add_labeled_data!(data)
+      validate_labeled_data(data)
+      data.each do |category_key, category_words|
+        category = category_by_key(category_key)
+        category_words.each do |word|
+          add_word(word, category)
+        end
+      end
+      compute_tfidf
+    end
+    private
+    def validate_labeled_data(data)
+      raise(ArgumentError, 'Hash with arrays expected') unless data.is_a?(Hash)
+      data.values.each do |array|
+        raise(ArgumentError, 'Hash with arrays expected') unless array.is_a?(Enumerable)
+        raise(ArgumentError, 'Hash with arrays of strings expected') unless array.all? { |value| value.is_a?(String) }
+      end
+    end
+    def add_word(word, category)
+      words[word] = { categories: {} } unless words[word]
+      words[word][:categories][category[:id]] ||= { entries: 0 }
+      words[word][:categories][category[:id]][:entries] += 1
+      categories[category[:key]][:words] << word
+    end
+    def category_by_key(key)
+      unless categories[key]
+        categories[key] = {
+          id: categories.length,
+          key: key,
+          words: Set[]
+        }
+      end
+      categories[key]
+    end
+    def compute_tfidf
+      Computation.new(self).call
+    end
+  end
+end

data/lib/bow_tfidf/classifier.rb ADDED

@@ -0,0 +1,70 @@
+module BowTfidf
+  class Classifier
+    attr_reader :bow, :score
+    def initialize(bow)
+      raise(ArgumentError, 'BowTfidf::BagOfWords instance expected') unless bow.is_a?(BowTfidf::BagOfWords)
+      @bow = bow
+      @score = {}
+    end
+    def call(tokens)
+      raise(ArgumentError, 'Array of strings expected') unless tokens.is_a?(Array)
+      tokens.each do |word|
+        process_word(word)
+      end
+      result
+    end
+    def find_word(word)
+      bow.words[word]
+    end
+    def category_by_id(id)
+      return nil unless id
+      bow.categories.values.find { |category| category[:id] == id }
+    end
+    private
+    def process_word(word)
+      return unless (word_data = find_word(word.to_s))
+      word_data[:categories].each do |category_id, word_category_relation|
+        score[category_id] = 0 unless score[category_id]
+        score[category_id] += word_category_relation[:tfidf]
+      end
+    end
+    def category_by_highest_score
+      ranking = score.max_by { |_k, v| v }
+      return unless ranking
+      category_id = ranking[0]
+      category_by_id(category_id)[:key]
+    end
+    def display_score
+      sorted = score.sort_by { |_k, v| v }.reverse
+      result_hash = {}
+      sorted.each do |item|
+        key = category_by_id(item[0])[:key]
+        result_hash[key] = item[1]
+      end
+      result_hash
+    end
+    def result
+      {
+        category_key: category_by_highest_score,
+        score: display_score
+      }
+    end
+  end
+end

data/lib/bow_tfidf/computation.rb ADDED

@@ -0,0 +1,59 @@
+module BowTfidf
+  class Computation
+    attr_reader :bow
+    def initialize(bow)
+      raise(ArgumentError, 'BowTfidf::BagOfWords instance expected') unless bow.is_a?(BowTfidf::BagOfWords)
+      @bow = bow
+    end
+    def call
+      compute_idf
+      compute_tfidf
+      bow
+    end
+    private
+    def words
+      bow.words
+    end
+    def categories
+      bow.categories
+    end
+    def compute_idf
+      words.each do |word, attrs|
+        idf(attrs)
+        words.delete(word) if attrs[:idf] == 0.0
+      end
+    end
+    def idf(attrs)
+      if categories.length == attrs[:categories].length
+        attrs[:idf] = 0.0
+      else
+        # the number of categories / in how many occurs
+        attrs[:idf] = Math.log10(1 + categories.length / attrs[:categories].length)
+      end
+    end
+    def compute_tfidf
+      categories.values.each do |category|
+        category[:words].each do |category_word|
+          next unless words[category_word]
+          # how many times the word occurs in the category / the number of words in category
+          # tf = category_word_attrs[:entries].to_f/category_attrs[:words].length
+          tf = Math.log10(1 + words[category_word][:categories][category[:id]][:entries])
+          words[category_word][:categories][category[:id]][:tf] = tf
+          words[category_word][:categories][category[:id]][:tfidf] = tf * words[category_word][:idf]
+        end
+      end
+    end
+  end
+end

data/lib/bow_tfidf/tokenizer.rb ADDED

@@ -0,0 +1,39 @@
+module BowTfidf
+  class Tokenizer
+    SPLIT_REGEX = /[\s\n\t\.,\-\!:()\/%\\+\|@^<«>*'~;=»\?—•$”\"’\[£“■‘\{#®♦°™€¥\]©§\}–]/
+    TOKEN_MIN_LENGTH = 3
+    TOKEN_MAX_LENGTH = 15
+    attr_reader :tokens
+    def initialize
+      @tokens = Set[]
+    end
+    def call(text)
+      raise(ArgumentError, 'String instance expected') unless text.is_a?(String)
+      raw_tokens = split(text)
+      raw_tokens.each do |token|
+        process_token(token)
+      end
+      tokens
+    end
+    private
+    def split(text)
+      text.split(SPLIT_REGEX)
+    end
+    def process_token(token)
+      return if token.length < TOKEN_MIN_LENGTH
+      return if token.length > TOKEN_MAX_LENGTH
+      return if token.scan(/\D/).empty? # skip if str contains only digits
+      tokens << token.downcase
+    end
+  end
+end

data/lib/bow_tfidf/version.rb ADDED

@@ -0,0 +1,3 @@
+module BowTfidf
+  VERSION = '0.1.0'.freeze
+end

metadata ADDED

@@ -0,0 +1,105 @@
+--- !ruby/object:Gem::Specification
+name: bow_tfidf
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- isidzukuri
+autorequire:
+bindir: exe
+cert_chain: []
+date: 2019-04-08 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.13'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.13'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '10.0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '10.0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+description: In information retrieval, tf–idf or TFIDF, short for term frequency–inverse
+  document frequency, is a numerical statistic that is intended to reflect how important
+  a word is to a document in a collection or corpus. It is often used as a weighting
+  factor in searches of information retrieval, text mining, and user modeling. The
+  tf–idf value increases proportionally to the number of times a word appears in the
+  document and is offset by the number of documents in the corpus that contain the
+  word, which helps to adjust for the fact that some words appear more frequently
+  in general.
+email:
+- axesigon@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- Gemfile
+- README.md
+- Rakefile
+- bin/console
+- bin/setup
+- bow_tfidf.gemspec
+- lib/bow_tfidf.rb
+- lib/bow_tfidf/bag_of_words.rb
+- lib/bow_tfidf/classifier.rb
+- lib/bow_tfidf/computation.rb
+- lib/bow_tfidf/tokenizer.rb
+- lib/bow_tfidf/version.rb
+homepage: https://github.com/isidzukuri/bow_tfidf
+licenses: []
+metadata:
+  allowed_push_host: https://rubygems.org
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.7.4
+signing_key:
+specification_version: 4
+summary: Tf–idf is one of the most popular term-weighting schemes.
+test_files: []