RubyGems - classifier-reborn - Versions diffs - 2.2.0 → 2.3.0 - Mend

classifier-reborn 2.2.0 → 2.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

checksums.yaml +5 -5
data/README.markdown +6 -8
data/lib/classifier-reborn/backends/bayes_memory_backend.rb +3 -1
data/lib/classifier-reborn/backends/bayes_redis_backend.rb +2 -0
data/lib/classifier-reborn/backends/no_redis_error.rb +5 -3
data/lib/classifier-reborn/bayes.rb +27 -15
data/lib/classifier-reborn/category_namer.rb +2 -0
data/lib/classifier-reborn/extensions/hasher.rb +20 -45
data/lib/classifier-reborn/extensions/token_filter/stemmer.rb +24 -0
data/lib/classifier-reborn/extensions/token_filter/stopword.rb +48 -0
data/lib/classifier-reborn/extensions/token_filter/symbol.rb +20 -0
data/lib/classifier-reborn/extensions/tokenizer/token.rb +36 -0
data/lib/classifier-reborn/extensions/tokenizer/whitespace.rb +28 -0
data/lib/classifier-reborn/extensions/vector.rb +13 -7
data/lib/classifier-reborn/extensions/vector_serialize.rb +3 -1
data/lib/classifier-reborn/extensions/zero_vector.rb +7 -0
data/lib/classifier-reborn/lsi/cached_content_node.rb +2 -0
data/lib/classifier-reborn/lsi/content_node.rb +23 -10
data/lib/classifier-reborn/lsi/summarizer.rb +3 -1
data/lib/classifier-reborn/lsi/word_list.rb +2 -0
data/lib/classifier-reborn/lsi.rb +79 -23
data/lib/classifier-reborn/validators/classifier_validator.rb +39 -38
data/lib/classifier-reborn/version.rb +3 -1
data/lib/classifier-reborn.rb +3 -1
metadata +36 -18

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
-SHA1:
-  metadata.gz: e626667dbd70c34cda4604852500af1bd9cc8f9f
-  data.tar.gz: 36dad105dca4770c1a5b708257c66721522d235e
+SHA256:
+  metadata.gz: '0100803f158326f660f53694ff5d0d400440792bb5174a10d80ae7eb780c5b6b'
+  data.tar.gz: 1f5a249471e67beb8796a0a61f47ea18fa2f0a252e832f03cb7e7b1937921fa5
 SHA512:
-  metadata.gz: aa08b0c7ad09138ac9efb0f55daa3d24a43b1c1388cef4ad2a71a9375620ce1794db66d3eca470c7772a80e4eac8e3c7a6811b20a3b65e2e93cfced86922554b
-  data.tar.gz: c137296ec3661043828e99abd11814b44a1d7e7ab6ca3d6f36ae54eaba1044d426847a9782bd106e082a6f7bacc22d72910eb51e6a37516017886a4c7df38a57
+  metadata.gz: e63b40492f9d35092353c198822f2ce444d05dec7613572048c3f420eecda4040c84026fe621ccb6c316e9862bc25258d47e32663168eb8f67c2b29b41733c57
+  data.tar.gz: abad42c42694cea59acf4bb59184a8f2aaa1d909826b126b4917b67b350c3ca9a14a3b688bd648e7ff8bba241a72e7846c749b91092e7ea91b5bc373c793b24f

data/README.markdown CHANGED Viewed

@@ -2,11 +2,9 @@
 [![Gem Version](https://badge.fury.io/rb/classifier-reborn.svg)](https://rubygems.org/gems/classifier-reborn)
 [![Build Status](https://img.shields.io/travis/jekyll/classifier-reborn/master.svg)](https://travis-ci.org/jekyll/classifier-reborn)
-[![Dependency Status](https://img.shields.io/gemnasium/jekyll/classifier-reborn.svg)](https://gemnasium.com/jekyll/classifier-reborn)
 ---
-## [Read the Docs](http://www.classifier-reborn.com/)
+## [Read the Docs](https://jekyll.github.io/classifier-reborn/)
 ## Getting Started
@@ -45,11 +43,11 @@ irb(main):013:0> lsi.find_related("This text is around cats!", 2)
 There is much more that can be done using Bayes and LSI beyond these quick examples.
 For more information read the following documentation topics.
-* [Installation and Dependencies](http://www.classifier-reborn.com/)
-* [Bayesian Classifier](http://www.classifier-reborn.com/bayes)
-* [Latent Semantic Indexer (LSI)](http://www.classifier-reborn.com/lsi)
-* [Classifier Validation](http://www.classifier-reborn.com/validation)
-* [Development and Contributions](http://www.classifier-reborn.com/development) (*Optional Docker instructions included*)
+* [Installation and Dependencies](https://jekyll.github.io/classifier-reborn/)
+* [Bayesian Classifier](https://jekyll.github.io/classifier-reborn/bayes)
+* [Latent Semantic Indexer (LSI)](https://jekyll.github.io/classifier-reborn/lsi)
+* [Classifier Validation](https://jekyll.github.io/classifier-reborn/validation)
+* [Development and Contributions](https://jekyll.github.io/classifier-reborn/development) (*Optional Docker instructions included*)
 ### Notes on JRuby support

data/lib/classifier-reborn/backends/bayes_memory_backend.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 module ClassifierReborn
   class BayesMemoryBackend
     attr_reader :total_words, :total_trainings
@@ -69,7 +71,7 @@ module ClassifierReborn
     private
     def category_counts(category)
-      @category_counts[category] ||= {training: 0, word: 0}
+      @category_counts[category] ||= { training: 0, word: 0 }
     end
   end
 end

data/lib/classifier-reborn/backends/bayes_redis_backend.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 require_relative 'no_redis_error'
 # require redis when we run #intialize. This way only people using this backend
 # will need to install and load the backend without having to

data/lib/classifier-reborn/backends/no_redis_error.rb CHANGED Viewed

@@ -1,12 +1,14 @@
-class NoRedisError < LoadError
+# frozen_string_literal: true
+class NoRedisError < RuntimeError
   def initialize
     msg =
-      %q{The Redis Backend can only be used if Redis is installed.
+      %q(The Redis Backend can only be used if Redis is installed.
         This error is raised from 'lib/classifier-reborn/backends/bayes_redis_backend.rb'.
         If you have encountered this error and would like to use the Redis Backend,
         please run 'gem install redis' or include 'gem "redis"' in
         your gemfile. For more info see https://github.com/jekyll/classifier-reborn#usage.
-      }
+      )
     super(msg)
   end
 end

data/lib/classifier-reborn/bayes.rb CHANGED Viewed

@@ -1,9 +1,14 @@
+# frozen_string_literal: true
 # Author::    Lucas Carlson  (mailto:lucas@rufy.com)
 # Copyright:: Copyright (c) 2005 Lucas Carlson
 # License::   LGPL
 require 'set'
+require_relative 'extensions/tokenizer/whitespace'
+require_relative 'extensions/token_filter/stopword'
+require_relative 'extensions/token_filter/stemmer'
 require_relative 'category_namer'
 require_relative 'backends/bayes_memory_backend'
 require_relative 'backends/bayes_redis_backend'
@@ -26,12 +31,11 @@ module ClassifierReborn
     #   backend:          BayesMemoryBackend.new  Alternatively, BayesRedisBackend.new for persistent storage
     def initialize(*args)
       @initial_categories = []
-      options = { language:         'en',
+      options = { language: 'en',
                   enable_threshold: false,
-                  threshold:        0.0,
-                  enable_stemmer:   true,
-                  backend:          BayesMemoryBackend.new
-                }
+                  threshold: 0.0,
+                  enable_stemmer: true,
+                  backend: BayesMemoryBackend.new }
       args.flatten.each do |arg|
         if arg.is_a?(Hash)
           options.merge!(arg)
@@ -50,12 +54,14 @@ module ClassifierReborn
       @threshold           = options[:threshold]
       @enable_stemmer      = options[:enable_stemmer]
       @backend             = options[:backend]
+      @tokenizer           = options[:tokenizer] || Tokenizer::Whitespace
+      @token_filters       = options[:token_filters] || [TokenFilter::Stopword]
+      @token_filters << TokenFilter::Stemmer if @enable_stemmer && !@token_filters.include?(TokenFilter::Stemmer)
+      TokenFilter::Stopword.language = @language if @token_filters.include?(TokenFilter::Stopword)
       populate_initial_categories
-      if options.key?(:stopwords)
-        custom_stopwords options[:stopwords]
-      end
+      custom_stopwords options[:stopwords] if options.key?(:stopwords)
     end
     # Provides a general training method for all categories specified in Bayes#new
@@ -65,8 +71,10 @@ module ClassifierReborn
     #     b.train "that", "That text"
     #     b.train "The other", "The other text"
     def train(category, text)
-      word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
+      word_hash = Hasher.word_hash(text, @enable_stemmer,
+                                   tokenizer: @tokenizer, token_filters: @token_filters)
       return if word_hash.empty?
       category = CategoryNamer.prepare_name(category)
       # Add the category dynamically or raise an error
@@ -95,11 +103,14 @@ module ClassifierReborn
     #     b.train :this, "This text"
     #     b.untrain :this, "This text"
     def untrain(category, text)
-      word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
+      word_hash = Hasher.word_hash(text, @enable_stemmer,
+                                   tokenizer: @tokenizer, token_filters: @token_filters)
       return if word_hash.empty?
       category = CategoryNamer.prepare_name(category)
       word_hash.each do |word, count|
         next if @backend.total_words < 0
         orig = @backend.category_word_frequency(category, word) || 0
         @backend.update_category_word_frequency(category, word, -count)
         if @backend.category_word_frequency(category, word) <= 0
@@ -120,7 +131,8 @@ module ClassifierReborn
     # The largest of these scores (the one closest to 0) is the one picked out by #classify
     def classifications(text)
       score = {}
-      word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
+      word_hash = Hasher.word_hash(text, @enable_stemmer,
+                                   tokenizer: @tokenizer, token_filters: @token_filters)
       if word_hash.empty?
         category_keys.each do |category|
           score[category.to_s] = Float::INFINITY
@@ -152,7 +164,7 @@ module ClassifierReborn
     # Return the classification without the score
     def classify(text)
       result, score = classify_with_score(text)
-      result = nil if score < @threshold || score == Float::INFINITY if threshold_enabled?
+      result = nil if threshold_enabled? && (score < @threshold || score == Float::INFINITY)
       result
     end
@@ -240,7 +252,7 @@ module ClassifierReborn
       @backend.add_category(category)
     end
-    alias_method :append_category, :add_category
+    alias append_category add_category
     def reset
       @backend.reset
@@ -261,12 +273,12 @@ module ClassifierReborn
         if stopwords.strip.empty?
           stopwords = []
         elsif File.exist?(stopwords)
-          stopwords = File.read(stopwords).force_encoding("utf-8").split
+          stopwords = File.read(stopwords).force_encoding('utf-8').split
         else
           return # Do not overwrite the default
         end
       end
-      Hasher::STOPWORDS[@language] = Set.new stopwords
+      TokenFilter::Stopword::STOPWORDS[@language] = Set.new stopwords
     end
   end
 end

data/lib/classifier-reborn/category_namer.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 # Author::    Lucas Carlson  (mailto:lucas@rufy.com)
 # Copyright:: Copyright (c) 2005 Lucas Carlson
 # License::   LGPL

data/lib/classifier-reborn/extensions/hasher.rb CHANGED Viewed

@@ -1,67 +1,42 @@
-# encoding: utf-8
+# frozen_string_literal: true
 # Author::    Lucas Carlson  (mailto:lucas@rufy.com)
 # Copyright:: Copyright (c) 2005 Lucas Carlson
 # License::   LGPL
 require 'set'
+require_relative 'tokenizer/whitespace'
+require_relative 'token_filter/stopword'
+require_relative 'token_filter/stemmer'
 module ClassifierReborn
   module Hasher
-    STOPWORDS_PATH = [File.expand_path(File.dirname(__FILE__) + '/../../../data/stopwords')]
     module_function
     # Return a Hash of strings => ints. Each word in the string is stemmed,
     # interned, and indexes to its frequency in the document.
-    def word_hash(str, language = 'en', enable_stemmer = true)
-      cleaned_word_hash = clean_word_hash(str, language, enable_stemmer)
-      symbol_hash = word_hash_for_symbols(str.scan(/[^\s\p{WORD}]/))
-      cleaned_word_hash.merge(symbol_hash)
-    end
-    # Return a word hash without extra punctuation or short symbols, just stemmed words
-    def clean_word_hash(str, language = 'en', enable_stemmer = true)
-      word_hash_for_words(str.gsub(/[^\p{WORD}\s]/, '').downcase.split, language, enable_stemmer)
-    end
-    def word_hash_for_words(words, language = 'en', enable_stemmer = true)
-      d = Hash.new(0)
-      words.each do |word|
-        next unless word.length > 2 && !STOPWORDS[language].include?(word)
-        if enable_stemmer
-          d[word.stem.intern] += 1
-        else
-          d[word.intern] += 1
+    def word_hash(str, enable_stemmer = true,
+                  tokenizer: Tokenizer::Whitespace,
+                  token_filters: [TokenFilter::Stopword])
+      if token_filters.include?(TokenFilter::Stemmer)
+        unless enable_stemmer
+          token_filters.reject! do |token_filter|
+            token_filter == TokenFilter::Stemmer
+          end
         end
+      else
+        token_filters << TokenFilter::Stemmer if enable_stemmer
+      end
+      words = tokenizer.call(str)
+      token_filters.each do |token_filter|
+        words = token_filter.call(words)
       end
-      d
-    end
-    # Add custom path to a new stopword file created by user
-    def add_custom_stopword_path(path)
-      STOPWORDS_PATH.unshift(path)
-    end
-    def word_hash_for_symbols(words)
       d = Hash.new(0)
       words.each do |word|
         d[word.intern] += 1
       end
       d
     end
-    # Create a lazily-loaded hash of stopword data
-    STOPWORDS = Hash.new do |hash, language|
-      hash[language] = []
-      STOPWORDS_PATH.each do |path|
-        if File.exist?(File.join(path, language))
-          hash[language] = Set.new File.read(File.join(path, language.to_s)).force_encoding("utf-8").split
-          break
-        end
-      end
-      hash[language]
-    end
   end
 end

data/lib/classifier-reborn/extensions/token_filter/stemmer.rb ADDED Viewed

@@ -0,0 +1,24 @@
+# frozen_string_literal: true
+# Author::    Lucas Carlson  (mailto:lucas@rufy.com)
+# Copyright:: Copyright (c) 2005 Lucas Carlson
+# License::   LGPL
+module ClassifierReborn
+  module TokenFilter
+    # This filter converts given tokens to their stemmed versions.
+    module Stemmer
+      module_function
+      def call(tokens)
+        tokens.collect do |token|
+          if token.stemmable?
+            token.stem
+          else
+            token
+          end
+        end
+      end
+    end
+  end
+end

data/lib/classifier-reborn/extensions/token_filter/stopword.rb ADDED Viewed

@@ -0,0 +1,48 @@
+# frozen_string_literal: true
+# Author::    Lucas Carlson  (mailto:lucas@rufy.com)
+# Copyright:: Copyright (c) 2005 Lucas Carlson
+# License::   LGPL
+module ClassifierReborn
+  module TokenFilter
+    # This filter removes stopwords in the language, from given tokens.
+    module Stopword
+      STOPWORDS_PATH = [File.expand_path(File.dirname(__FILE__) + '/../../../../data/stopwords')]
+      @language = 'en'
+      module_function
+      def call(tokens)
+        tokens.reject do |token|
+          token.maybe_stopword? &&
+            (token.length <= 2 || STOPWORDS[@language].include?(token))
+        end
+      end
+      # Add custom path to a new stopword file created by user
+      def add_custom_stopword_path(path)
+        STOPWORDS_PATH.unshift(path)
+      end
+      # Create a lazily-loaded hash of stopword data
+      STOPWORDS = Hash.new do |hash, language|
+        hash[language] = []
+        STOPWORDS_PATH.each do |path|
+          if File.exist?(File.join(path, language))
+            hash[language] = Set.new File.read(File.join(path, language.to_s)).force_encoding('utf-8').split
+            break
+          end
+        end
+        hash[language]
+      end
+      # Changes the language of stopwords
+      def language=(language)
+        @language = language
+      end
+    end
+  end
+end

data/lib/classifier-reborn/extensions/token_filter/symbol.rb ADDED Viewed

@@ -0,0 +1,20 @@
+# frozen_string_literal: true
+# Author::    Lucas Carlson  (mailto:lucas@rufy.com)
+# Copyright:: Copyright (c) 2005 Lucas Carlson
+# License::   LGPL
+module ClassifierReborn
+  module TokenFilter
+    # This filter removes symbol-only terms, from given tokens.
+    module Symbol
+      module_function
+      def call(tokens)
+        tokens.reject do |token|
+          /[^\s\p{WORD}]/ === token
+        end
+      end
+    end
+  end
+end

data/lib/classifier-reborn/extensions/tokenizer/token.rb ADDED Viewed

@@ -0,0 +1,36 @@
+# frozen_string_literal: true
+# Author::    Lucas Carlson  (mailto:lucas@rufy.com)
+# Copyright:: Copyright (c) 2005 Lucas Carlson
+# License::   LGPL
+module ClassifierReborn
+  module Tokenizer
+    class Token < String
+      # The class can be created with one token string and extra attributes. E.g.,
+      #      t = ClassifierReborn::Tokenizer::Token.new 'Tokenize', stemmable: true, maybe_stopword: false
+      #
+      # Attributes available are:
+      #   stemmable:        true  Possibility that the token can be stemmed. This must be false for un-stemmable terms, otherwise this should be true.
+      #   maybe_stopword:   true  Possibility that the token is a stopword. This must be false for terms which never been stopword, otherwise this should be true.
+      def initialize(string, stemmable: true, maybe_stopword: true)
+        super(string)
+        @stemmable = stemmable
+        @maybe_stopword = maybe_stopword
+      end
+      def stemmable?
+        @stemmable
+      end
+      def maybe_stopword?
+        @maybe_stopword
+      end
+      def stem
+        stemmed = super
+        self.class.new(stemmed, stemmable: @stemmable, maybe_stopword: @maybe_stopword)
+      end
+    end
+  end
+end

data/lib/classifier-reborn/extensions/tokenizer/whitespace.rb ADDED Viewed

@@ -0,0 +1,28 @@
+# frozen_string_literal: true
+# Author::    Lucas Carlson  (mailto:lucas@rufy.com)
+# Copyright:: Copyright (c) 2005 Lucas Carlson
+# License::   LGPL
+require_relative 'token'
+module ClassifierReborn
+  module Tokenizer
+    # This tokenizes given input as white-space separated terms.
+    # It mainly aims to tokenize sentences written with a space between words, like English, French, and others.
+    module Whitespace
+      module_function
+      def call(str)
+        tokens = str.gsub(/[^\p{WORD}\s]/, '').downcase.split.collect do |word|
+          Token.new(word, stemmable: true, maybe_stopword: true)
+        end
+        symbol_tokens = str.scan(/[^\s\p{WORD}]/).collect do |word|
+          Token.new(word, stemmable: false, maybe_stopword: false)
+        end
+        tokens += symbol_tokens
+        tokens
+      end
+    end
+  end
+end

data/lib/classifier-reborn/extensions/vector.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 # Author::    Ernest Ellingson
 # Copyright:: Copyright (c) 2005
@@ -10,14 +12,14 @@ class Matrix
     Matrix.diagonal(*s)
   end
-  alias_method :trans, :transpose
+  alias trans transpose
   def SV_decomp(maxSweeps = 20)
-    if row_size >= column_size
-      q = trans * self
-    else
-      q = self * trans
-    end
+    q = if row_size >= column_size
+          trans * self
+        else
+          self * trans
+        end
     qrot    = q.dup
     v       = Matrix.identity(q.row_size)
@@ -31,7 +33,11 @@ class Matrix
         (1..qrot.row_size - 1).each do |col|
           next if row == col
-          h = Math.atan((2 * qrot[row, col]) / (qrot[row, row] - qrot[col, col])) / 2.0
+          h = if (2.0 * qrot[row, col]) == (qrot[row, row] - qrot[col, col])
+                Math.atan(1) / 2.0
+              else
+                Math.atan((2.0 * qrot[row, col]) / (qrot[row, row] - qrot[col, col])) / 2.0
+              end
           hcos = Math.cos(h)
           hsin = Math.sin(h)
           mzrot = Matrix.identity(qrot.row_size)

data/lib/classifier-reborn/extensions/vector_serialize.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 module GSL
   class Vector
     def _dump(_v)
@@ -12,7 +14,7 @@ module GSL
   class Matrix
     class <<self
-       alias_method :diag, :diagonal
+       alias diag diagonal
     end
   end
 end

data/lib/classifier-reborn/extensions/zero_vector.rb ADDED Viewed

@@ -0,0 +1,7 @@
+# frozen_string_literal: true
+class Vector
+  def zero?
+    all?(&:zero?)
+  end
+end

data/lib/classifier-reborn/lsi/cached_content_node.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 # Author::    Kelley Reynolds  (mailto:kelley@insidesystems.net)
 # Copyright:: Copyright (c) 2015 Kelley Reynolds
 # License::   LGPL

data/lib/classifier-reborn/lsi/content_node.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 # Author::    David Fayram  (mailto:dfayram@lensmen.net)
 # Copyright:: Copyright (c) 2005 David Fayram II
 # License::   LGPL
@@ -27,7 +29,11 @@ module ClassifierReborn
     # Method to access the transposed search vector
     def transposed_search_vector
-      search_vector.col
+      if $SVD == :numo
+        search_vector
+      else
+        search_vector.col
+      end
     end
     # Use this to fetch the appropriate search vector in normalized form.
@@ -38,18 +44,22 @@ module ClassifierReborn
     # Creates the raw vector out of word_hash using word_list as the
     # key for mapping the vector space.
     def raw_vector_with(word_list)
-      if $GSL
-        vec = GSL::Vector.alloc(word_list.size)
-      else
-        vec = Array.new(word_list.size, 0)
-      end
+      vec = if $SVD == :numo
+              Numo::DFloat.zeros(word_list.size)
+            elsif $SVD == :gsl
+              GSL::Vector.alloc(word_list.size)
+            else
+              Array.new(word_list.size, 0)
+            end
       @word_hash.each_key do |word|
         vec[word_list[word]] = @word_hash[word] if word_list[word]
       end
       # Perform the scaling transform and force floating point arithmetic
-      if $GSL
+      if $SVD == :numo
+        total_words = vec.sum.to_f
+      elsif $SVD == :gsl
         sum = 0.0
         vec.each { |v| sum += v }
         total_words = sum
@@ -59,7 +69,7 @@ module ClassifierReborn
       total_unique_words = 0
-      if $GSL
+      if [:numo, :gsl].include?($SVD)
         vec.each { |word| total_unique_words += 1 if word != 0.0 }
       else
         total_unique_words = vec.count { |word| word != 0 }
@@ -83,12 +93,15 @@ module ClassifierReborn
           hash[val] = Math.log(val + 1) / -weighted_total
         end
-        vec.collect! do |val|
+        vec = vec.map do |val|
           cached_calcs[val]
         end
       end
-      if $GSL
+      if $SVD == :numo
+        @raw_norm   = vec / Numo::Linalg.norm(vec)
+        @raw_vector = vec
+      elsif $SVD == :gsl
         @raw_norm   = vec.normalize
         @raw_vector = vec
       else

data/lib/classifier-reborn/lsi/summarizer.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 # Author::    Lucas Carlson  (mailto:lucas@rufy.com)
 # Copyright:: Copyright (c) 2005 Lucas Carlson
 # License::   LGPL
@@ -27,7 +29,7 @@ module ClassifierReborn
       chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }
       lsi.build_index
       summaries = lsi.highest_relative_content count
-      summaries.reject { |chunk| !summaries.include? chunk }.map(&:strip).join(separator)
+      summaries.select { |chunk| summaries.include? chunk }.map(&:strip).join(separator)
     end
   end
 end

data/lib/classifier-reborn/lsi/word_list.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 # Author::    David Fayram  (mailto:dfayram@lensmen.net)
 # Copyright:: Copyright (c) 2005 David Fayram II
 # License::   LGPL

data/lib/classifier-reborn/lsi.rb CHANGED Viewed

@@ -1,23 +1,39 @@
+# frozen_string_literal: true
 # Author::    David Fayram  (mailto:dfayram@lensmen.net)
 # Copyright:: Copyright (c) 2005 David Fayram II
 # License::   LGPL
+# Try to load Numo first - it's the most current and the most well-supported.
+# Fall back to GSL.
+# Fall back to native vector.
 begin
   raise LoadError if ENV['NATIVE_VECTOR'] == 'true' # to test the native vector class, try `rake test NATIVE_VECTOR=true`
+  raise LoadError if ENV['GSL'] == 'true' # to test with gsl, try `rake test GSL=true`
-  require 'gsl' # requires https://github.com/SciRuby/rb-gsl
-  require_relative 'extensions/vector_serialize'
-  $GSL = true
+  require 'numo/narray' # https://ruby-numo.github.io/narray/
+  require 'numo/linalg' # https://ruby-numo.github.io/linalg/
+  $SVD = :numo
 rescue LoadError
-  $GSL = false
-  require_relative 'extensions/vector'
+  begin
+    raise LoadError if ENV['NATIVE_VECTOR'] == 'true' # to test the native vector class, try `rake test NATIVE_VECTOR=true`
+    require 'gsl' # requires https://github.com/SciRuby/rb-gsl
+    require_relative 'extensions/vector_serialize'
+    $SVD = :gsl
+  rescue LoadError
+    $SVD = :ruby
+    require_relative 'extensions/vector'
+    require_relative 'extensions/zero_vector'
+  end
 end
 require_relative 'lsi/word_list'
 require_relative 'lsi/content_node'
 require_relative 'lsi/cached_content_node'
 require_relative 'lsi/summarizer'
+require_relative 'extensions/token_filter/stopword'
+require_relative 'extensions/token_filter/symbol'
 module ClassifierReborn
   # This class implements a Latent Semantic Indexer, which can search, classify and cluster
@@ -40,6 +56,11 @@ module ClassifierReborn
       @version = 0
       @built_at_version = -1
       @language = options[:language] || 'en'
+      @token_filters = [
+        TokenFilter::Stopword,
+        TokenFilter::Symbol
+      ]
+      TokenFilter::Stopword.language = @language
       extend CachedContentNode::InstanceMethods if @cache_node_vectors = options[:cache_node_vectors]
     end
@@ -64,7 +85,8 @@ module ClassifierReborn
     #   lsi.add_item ar, *ar.categories { |x| ar.content }
     #
     def add_item(item, *categories, &block)
-      clean_word_hash = Hasher.clean_word_hash((block ? block.call(item) : item.to_s), @language)
+      clean_word_hash = Hasher.word_hash((block ? yield(item) : item.to_s),
+                                         token_filters: @token_filters)
       if clean_word_hash.empty?
         puts "Input: '#{item}' is entirely stopwords or words with 2 or fewer characters. Classifier-Reborn cannot handle this document properly."
       else
@@ -124,12 +146,21 @@ module ClassifierReborn
     # turning the LSI class into a simple vector search engine.
     def build_index(cutoff = 0.75)
       return unless needs_rebuild?
       make_word_list
       doc_list = @items.values
       tda = doc_list.collect { |node| node.raw_vector_with(@word_list) }
-      if $GSL
+      if $SVD == :numo
+        tdm = Numo::NArray.asarray(tda.map(&:to_a)).transpose
+        ntdm = numo_build_reduced_matrix(tdm, cutoff)
+        ntdm.each_over_axis(1).with_index do |col_vec, i|
+          doc_list[i].lsi_vector = col_vec
+          doc_list[i].lsi_norm = col_vec / Numo::Linalg.norm(col_vec)
+        end
+      elsif $SVD == :gsl
         tdm = GSL::Matrix.alloc(*tda).trans
         ntdm = build_reduced_matrix(tdm, cutoff)
@@ -142,9 +173,13 @@ module ClassifierReborn
         tdm = Matrix.rows(tda).trans
         ntdm = build_reduced_matrix(tdm, cutoff)
-        ntdm.row_size.times do |col|
+        ntdm.column_size.times do |col|
           doc_list[col].lsi_vector = ntdm.column(col) if doc_list[col]
-          doc_list[col].lsi_norm = ntdm.column(col).normalize if doc_list[col]
+          if ntdm.column(col).zero?
+            doc_list[col].lsi_norm = ntdm.column(col) if doc_list[col]
+          else
+            doc_list[col].lsi_norm = ntdm.column(col).normalize if doc_list[col]
+          end
         end
       end
@@ -186,11 +221,13 @@ module ClassifierReborn
       content_node = node_for_content(doc, &block)
       result =
         @items.keys.collect do |item|
-          if $GSL
-            val = content_node.search_vector * @items[item].transposed_search_vector
-          else
-            val = (Matrix[content_node.search_vector] * @items[item].search_vector)[0]
-          end
+          val = if $SVD == :numo
+                  content_node.search_vector.dot(@items[item].transposed_search_vector)
+                elsif $SVD == :gsl
+                  content_node.search_vector * @items[item].transposed_search_vector
+                else
+                  (Matrix[content_node.search_vector] * @items[item].search_vector)[0]
+                end
           [item, val]
         end
       result.sort_by { |x| x[1] }.reverse
@@ -205,7 +242,8 @@ module ClassifierReborn
       return [] if needs_rebuild?
       content_node = node_for_content(doc, &block)
-      if $GSL && content_node.raw_norm.isnan?.all?
+      if ($SVD == :gsl && content_node.raw_norm.isnan?.all?) ||
+          ($SVD == :numo && content_node.raw_norm.isnan.all?)
         puts "There are no documents that are similar to #{doc}"
       else
         content_node_norms(content_node)
@@ -215,11 +253,13 @@ module ClassifierReborn
     def content_node_norms(content_node)
       result =
         @items.keys.collect do |item|
-          if $GSL
-            val = content_node.search_norm * @items[item].search_norm.col
-          else
-            val = (Matrix[content_node.search_norm] * @items[item].search_norm)[0]
-          end
+          val = if $SVD == :numo
+                  content_node.search_norm.dot(@items[item].search_norm)
+                elsif $SVD == :gsl
+                  content_node.search_norm * @items[item].search_norm.col
+                else
+                  (Matrix[content_node.search_norm] * @items[item].search_norm)[0]
+                end
           [item, val]
         end
       result.sort_by { |x| x[1] }.reverse
@@ -234,6 +274,7 @@ module ClassifierReborn
     # it is actually the same algorithm, just applied on a smaller document.
     def search(string, max_nearest = 3)
       return [] if needs_rebuild?
       carry = proximity_norms_for_content(string)
       unless carry.nil?
         result = carry.collect { |x| x[0] }
@@ -295,6 +336,7 @@ module ClassifierReborn
     # it's supposed to.
     def highest_ranked_stems(doc, count = 3)
       raise 'Requested stem ranking on non-indexed content!' unless @items[doc]
       content_vector_array = node_for_content(doc).lsi_vector.to_a
       top_n = content_vector_array.sort.reverse[0..count - 1]
       top_n.collect { |x| @word_list.word_for_index(content_vector_array.index(x)) }
@@ -315,14 +357,28 @@ module ClassifierReborn
         s[ord] = 0.0 if s[ord] < s_cutoff
       end
       # Reconstruct the term document matrix, only with reduced rank
-      u * ($GSL ? GSL::Matrix : ::Matrix).diag(s) * v.trans
+      u * ($SVD == :gsl ? GSL::Matrix : ::Matrix).diag(s) * v.trans
+    end
+    def numo_build_reduced_matrix(matrix, cutoff = 0.75)
+      s, u, vt = Numo::Linalg.svd(matrix, driver: 'svd', job: 'S')
+      # TODO: Better than 75% term (as above)
+      s_cutoff = s.sort.reverse[(s.size * cutoff).round - 1]
+      s.size.times do |ord|
+        s[ord] = 0.0 if s[ord] < s_cutoff
+      end
+      # Reconstruct the term document matrix, only with reduced rank
+      u.dot(::Numo::DFloat.eye(s.size) * s).dot(vt)
     end
     def node_for_content(item, &block)
       if @items[item]
         return @items[item]
       else
-        clean_word_hash = Hasher.clean_word_hash((block ? block.call(item) : item.to_s), @language)
+        clean_word_hash = Hasher.word_hash((block ? yield(item) : item.to_s),
+                                           token_filters: @token_filters)
         content_node = ContentNode.new(clean_word_hash, &block) # make the node and extract the data

data/lib/classifier-reborn/validators/classifier_validator.rb CHANGED Viewed

@@ -1,10 +1,11 @@
+# frozen_string_literal: true
 module ClassifierReborn
   module ClassifierValidator
     module_function
-    def cross_validate(classifier, sample_data, fold=10, *options)
-      classifier = ClassifierReborn::const_get(classifier).new(options) if classifier.is_a?(String)
+    def cross_validate(classifier, sample_data, fold = 10, *options)
+      classifier = ClassifierReborn.const_get(classifier).new(options) if classifier.is_a?(String)
       sample_data.shuffle!
       partition_size = sample_data.length / fold
       partitioned_data = sample_data.each_slice(partition_size)
@@ -14,13 +15,13 @@ module ClassifierReborn
         test_data = training_data.slice!(i)
         conf_mats << validate(classifier, training_data.flatten!(1), test_data)
       end
-      classifier.reset()
+      classifier.reset
       generate_report(conf_mats)
     end
     def validate(classifier, training_data, test_data, *options)
-      classifier = ClassifierReborn::const_get(classifier).new(options) if classifier.is_a?(String)
-      classifier.reset()
+      classifier = ClassifierReborn.const_get(classifier).new(options) if classifier.is_a?(String)
+      classifier.reset
       training_data.each do |rec|
         classifier.train(rec.first, rec.last)
       end
@@ -40,25 +41,25 @@ module ClassifierReborn
     def generate_report(*conf_mats)
       conf_mats.flatten!
       accumulated_conf_mat = conf_mats.length == 1 ? conf_mats.first : empty_conf_mat(conf_mats.first.keys.sort)
-      header = "Run     Total   Correct Incorrect  Accuracy"
+      header = 'Run     Total   Correct Incorrect  Accuracy'
       puts
-      puts " Run Report ".center(header.length, "-")
+      puts ' Run Report '.center(header.length, '-')
       puts header
-      puts "-" * header.length
+      puts '-' * header.length
       if conf_mats.length > 1
         conf_mats.each_with_index do |conf_mat, i|
           run_report = build_run_report(conf_mat)
-          print_run_report(run_report, i+1)
+          print_run_report(run_report, i + 1)
           conf_mat.each do |actual, cols|
             cols.each do |predicted, v|
               accumulated_conf_mat[actual][predicted] += v
             end
           end
         end
-        puts "-" * header.length
+        puts '-' * header.length
       end
       run_report = build_run_report(accumulated_conf_mat)
-      print_run_report(run_report, "All")
+      print_run_report(run_report, 'All')
       puts
       print_conf_mat(accumulated_conf_mat)
       puts
@@ -78,11 +79,11 @@ module ClassifierReborn
         end
       end
       total = correct + incorrect
-      {total: total, correct: correct, incorrect: incorrect, accuracy: divide(correct, total)}
+      { total: total, correct: correct, incorrect: incorrect, accuracy: divide(correct, total) }
     end
     def conf_mat_to_tab(conf_mat)
-      conf_tab = Hash.new {|h, k| h[k] = {p: {t: 0, f: 0}, n: {t: 0, f: 0}}}
+      conf_tab = Hash.new { |h, k| h[k] = { p: { t: 0, f: 0 }, n: { t: 0, f: 0 } } }
       conf_mat.each_key do |positive|
         conf_mat.each do |actual, cols|
           cols.each do |predicted, v|
@@ -93,32 +94,32 @@ module ClassifierReborn
       conf_tab
     end
-    def print_run_report(stats, prefix="", print_header=false)
-      puts "#{"Run".rjust([3, prefix.length].max)}     Total   Correct Incorrect  Accuracy" if print_header
+    def print_run_report(stats, prefix = '', print_header = false)
+      puts "#{'Run'.rjust([3, prefix.length].max)}     Total   Correct Incorrect  Accuracy" if print_header
       puts "#{prefix.to_s.rjust(3)} #{stats[:total].to_s.rjust(9)} #{stats[:correct].to_s.rjust(9)} #{stats[:incorrect].to_s.rjust(9)} #{stats[:accuracy].round(5).to_s.ljust(7, '0').rjust(9)}"
     end
     def print_conf_mat(conf_mat)
-      header = ["Predicted ->"] + conf_mat.keys + ["Total", "Recall"]
+      header = ['Predicted ->'] + conf_mat.keys + %w[Total Recall]
       cell_size = header.map(&:length).max
-      header = header.map{|h| h.rjust(cell_size)}.join(" ")
-      puts " Confusion Matrix ".center(header.length, "-")
+      header = header.map { |h| h.rjust(cell_size) }.join(' ')
+      puts ' Confusion Matrix '.center(header.length, '-')
       puts header
-      puts "-" * header.length
-      predicted_totals = conf_mat.keys.map{|predicted| [predicted, 0]}.to_h
+      puts '-' * header.length
+      predicted_totals = conf_mat.keys.map { |predicted| [predicted, 0] }.to_h
       correct = 0
       conf_mat.each do |k, rec|
         actual_total = rec.values.reduce(:+)
-        puts ([k.ljust(cell_size)] + rec.values.map{|v| v.to_s.rjust(cell_size)} + [actual_total.to_s.rjust(cell_size), divide(rec[k], actual_total).round(5).to_s.rjust(cell_size)]).join(" ")
+        puts ([k.ljust(cell_size)] + rec.values.map { |v| v.to_s.rjust(cell_size) } + [actual_total.to_s.rjust(cell_size), divide(rec[k], actual_total).round(5).to_s.rjust(cell_size)]).join(' ')
         rec.each do |cat, val|
           predicted_totals[cat] += val
           correct += val if cat == k
         end
       end
       total = predicted_totals.values.reduce(:+)
-      puts "-" * header.length
-      puts (["Total".ljust(cell_size)] + predicted_totals.values.map{|v| v.to_s.rjust(cell_size)} + [total.to_s.rjust(cell_size), "".rjust(cell_size)]).join(" ")
-      puts (["Precision".ljust(cell_size)] + predicted_totals.keys.map{|k| divide(conf_mat[k][k], predicted_totals[k]).round(5).to_s.rjust(cell_size)} + ["Accuracy ->".rjust(cell_size), divide(correct, total).round(5).to_s.rjust(cell_size)]).join(" ")
+      puts '-' * header.length
+      puts (['Total'.ljust(cell_size)] + predicted_totals.values.map { |v| v.to_s.rjust(cell_size) } + [total.to_s.rjust(cell_size), ''.rjust(cell_size)]).join(' ')
+      puts (['Precision'.ljust(cell_size)] + predicted_totals.keys.map { |k| divide(conf_mat[k][k], predicted_totals[k]).round(5).to_s.rjust(cell_size) } + ['Accuracy ->'.rjust(cell_size), divide(correct, total).round(5).to_s.rjust(cell_size)]).join(' ')
     end
     def print_conf_tab(conf_tab)
@@ -135,31 +136,31 @@ module ClassifierReborn
       negatives = tab[:n][:t] + tab[:p][:f]
       total     = positives + negatives
       {
-        total_population:   positives + negatives,
+        total_population: positives + negatives,
         condition_positive: positives,
         condition_negative: negatives,
-        true_positive:      tab[:p][:t],
-        true_negative:      tab[:n][:t],
-        false_positive:     tab[:p][:f],
-        false_negative:     tab[:n][:f],
-        prevalence:         divide(positives, total),
-        specificity:        divide(tab[:n][:t], negatives),
-        recall:             divide(tab[:p][:t], positives),
-        precision:          divide(tab[:p][:t], tab[:p][:t] + tab[:p][:f]),
-        accuracy:           divide(tab[:p][:t] + tab[:n][:t], total),
-        f1_score:           divide(2 * tab[:p][:t], 2 * tab[:p][:t] + tab[:p][:f] + tab[:n][:f])
+        true_positive: tab[:p][:t],
+        true_negative: tab[:n][:t],
+        false_positive: tab[:p][:f],
+        false_negative: tab[:n][:f],
+        prevalence: divide(positives, total),
+        specificity: divide(tab[:n][:t], negatives),
+        recall: divide(tab[:p][:t], positives),
+        precision: divide(tab[:p][:t], tab[:p][:t] + tab[:p][:f]),
+        accuracy: divide(tab[:p][:t] + tab[:n][:t], total),
+        f1_score: divide(2 * tab[:p][:t], 2 * tab[:p][:t] + tab[:p][:f] + tab[:n][:f])
       }
     end
     def print_derivations(derivations)
       max_len = derivations.keys.map(&:length).max
       derivations.each do |k, v|
-        puts k.to_s.tr('_', ' ').capitalize.ljust(max_len) + " : " + v.to_s
+        puts k.to_s.tr('_', ' ').capitalize.ljust(max_len) + ' : ' + v.to_s
       end
     end
     def empty_conf_mat(categories)
-      categories.map{|actual| [actual, categories.map{|predicted| [predicted, 0]}.to_h]}.to_h
+      categories.map { |actual| [actual, categories.map { |predicted| [predicted, 0] }.to_h] }.to_h
     end
     def divide(dividend, divisor)

data/lib/classifier-reborn/version.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 module ClassifierReborn
-  VERSION = '2.2.0'
+  VERSION = '2.3.0'
 end

data/lib/classifier-reborn.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 #--
 # Copyright (c) 2005 Lucas Carlson
 #
@@ -36,4 +38,4 @@ end
 require_relative 'classifier-reborn/category_namer'
 require_relative 'classifier-reborn/bayes'
 require_relative 'classifier-reborn/lsi'
-require_relative 'classifier-reborn/validators/classifier_validator'
+require_relative 'classifier-reborn/validators/classifier_validator'

metadata CHANGED Viewed

@@ -1,16 +1,16 @@
 --- !ruby/object:Gem::Specification
 name: classifier-reborn
 version: !ruby/object:Gem::Version
-  version: 2.2.0
+  version: 2.3.0
 platform: ruby
 authors:
 - Lucas Carlson
 - Parker Moore
 - Chase Gilliam
-autorequire:
+autorequire:
 bindir: bin
 cert_chain: []
-date: 2017-12-15 00:00:00.000000000 Z
+date: 2022-06-21 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: fast-stemmer
@@ -27,7 +27,21 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '1.0'
 - !ruby/object:Gem::Dependency
-  name: rake
+  name: matrix
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.4'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.4'
+- !ruby/object:Gem::Dependency
+  name: minitest
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
@@ -41,7 +55,7 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
-  name: rdoc
+  name: minitest-reporters
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
@@ -55,7 +69,7 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
-  name: minitest
+  name: pry
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
@@ -69,7 +83,7 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
-  name: minitest-reporters
+  name: rake
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
@@ -83,7 +97,7 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
-  name: rubocop
+  name: rdoc
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
@@ -97,7 +111,7 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
-  name: pry
+  name: redis
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
@@ -111,7 +125,7 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
-  name: redis
+  name: rubocop
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
@@ -124,7 +138,7 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
-description:
+description:
 email:
 - lucas@rufy.com
 - parkrmoore@gmail.com
@@ -167,8 +181,14 @@ files:
 - lib/classifier-reborn/bayes.rb
 - lib/classifier-reborn/category_namer.rb
 - lib/classifier-reborn/extensions/hasher.rb
+- lib/classifier-reborn/extensions/token_filter/stemmer.rb
+- lib/classifier-reborn/extensions/token_filter/stopword.rb
+- lib/classifier-reborn/extensions/token_filter/symbol.rb
+- lib/classifier-reborn/extensions/tokenizer/token.rb
+- lib/classifier-reborn/extensions/tokenizer/whitespace.rb
 - lib/classifier-reborn/extensions/vector.rb
 - lib/classifier-reborn/extensions/vector_serialize.rb
+- lib/classifier-reborn/extensions/zero_vector.rb
 - lib/classifier-reborn/lsi.rb
 - lib/classifier-reborn/lsi/cached_content_node.rb
 - lib/classifier-reborn/lsi/content_node.rb
@@ -176,11 +196,11 @@ files:
 - lib/classifier-reborn/lsi/word_list.rb
 - lib/classifier-reborn/validators/classifier_validator.rb
 - lib/classifier-reborn/version.rb
-homepage: https://github.com/jekyll/classifier-reborn
+homepage: https://jekyll.github.io/classifier-reborn/
 licenses:
 - LGPL
 metadata: {}
-post_install_message:
+post_install_message:
 rdoc_options:
 - "--charset=UTF-8"
 require_paths:
@@ -189,17 +209,15 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      version: 1.9.3
+      version: 2.4.0
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubyforge_project:
-rubygems_version: 2.6.14
-signing_key:
+rubygems_version: 3.3.7
+signing_key:
 specification_version: 2
 summary: A general classifier module to allow Bayesian and other types of classifications.
 test_files: []
-has_rdoc: true