RubyGems - classifier-reborn - Versions diffs - 2.0.4 → 2.1.0 - Mend

classifier-reborn 2.0.4 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

checksums.yaml +4 -4
data/README.markdown +25 -3
data/bin/bayes.rb +22 -22
data/bin/summarize.rb +4 -4
data/lib/classifier-reborn.rb +1 -1
data/lib/classifier-reborn/bayes.rb +50 -46
data/lib/classifier-reborn/category_namer.rb +4 -3
data/lib/classifier-reborn/extensions/hasher.rb +19 -11
data/lib/classifier-reborn/extensions/vector.rb +27 -26
data/lib/classifier-reborn/extensions/vector_serialize.rb +8 -10
data/lib/classifier-reborn/lsi.rb +89 -77
data/lib/classifier-reborn/lsi/cached_content_node.rb +4 -5
data/lib/classifier-reborn/lsi/content_node.rb +15 -18
data/lib/classifier-reborn/lsi/summarizer.rb +5 -5
data/lib/classifier-reborn/lsi/word_list.rb +1 -2
data/lib/classifier-reborn/version.rb +1 -1
metadata +46 -5

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 193d6c53d76559337140f192fc69910418015abe
-  data.tar.gz: 745ec353d12ad84aaf74cca6b235401127f00075
+  metadata.gz: 53745cead2833e74e9d74f3359bae0bc7fd01fa4
+  data.tar.gz: 2926365890cf0bc43f7ff17570789ed5e032a45d
 SHA512:
-  metadata.gz: c9891b16c6e9fb2ddfffd32a2335f59bfc55a5e97dae675b652acaa9122a2fc268bfc0d4c2be945456fe46397601127dd2a39782a99767718adf7fee373bdae2
-  data.tar.gz: af844b19d90186a6e3866cdcbfa3deb6ba8492e4646298d6712168c1a329f34fbd24c551ff241a561711a8516cf57577dd7fe1a96bb069190e8545a09a570ff0
+  metadata.gz: 5def462eccbb9ef7a45d3968a5fbc2e2a886bfdda734a98e1f5904943ba1b4012ea331a51ff52e2c165db776dc8c6aed35fac9e01ea3e3f001f82daf98027f74
+  data.tar.gz: 22726b8f6c2acab5bb47b9faac2a2d615557bfff535327ef5bbe022a661c1be13c3b7cec95d1c0030f93158eb0942e121d97f2a0213a640eca00607de9d55f61

data/README.markdown CHANGED

@@ -1,5 +1,12 @@
 ## Welcome to Classifier Reborn
+[![Gem Version](https://img.shields.io/gem/v/classifier-reborn.svg)][ruby-gems]
+[![Build Status](https://img.shields.io/travis/jekyll/classifier-reborn/master.svg)][travis]
+[![Dependency Status](https://img.shields.io/gemnasium/jekyll/classifier-reborn.svg)][gemnasium]
+[ruby-gems]: https://rubygems.org/gems/jekyll/classifier-reborn
+[gemnasium]: https://gemnasium.com/jekyll/classifier-reborn
+[travis]: https://travis-ci.org/jekyll/classifier-reborn
 Classifier is a general module to allow Bayesian and other types of classifications.
 Classifier Reborn is a fork of cardmagic/classifier under more active development.
@@ -37,6 +44,8 @@ Notice that LSI will work without these libraries, but as soon as they are insta
 A Bayesian classifier by Lucas Carlson. Bayesian Classifiers are accurate, fast, and have modest memory requirements.
+*Note: Classifier only supports UTF-8 characters.*
 ### Usage
 ```ruby
@@ -60,7 +69,7 @@ trained_classifier.classify "I love" # returns 'Interesting'
 ```
 Beyond the basic example, the constructor and trainer can be used in a more
-flexible way to accomidate non-trival applications.  Consider the following
+flexible way to accommodate non-trival applications.  Consider the following
 program:
 ```ruby
@@ -72,7 +81,8 @@ require 'classifier-reborn'
 training_set = DATA.read.split("\n")
 categories   = training_set.shift.split(',').map{|c| c.strip}
-classifier = ClassifierReborn::Bayes.new categories
+# pass :auto_categorize option to allow feeding previously unknown categories
+classifier = ClassifierReborn::Bayes.new categories, auto_categorize: true
 training_set.each do |a_line|
   next if a_line.empty? || '#' == a_line.strip[0]
@@ -141,7 +151,7 @@ Or suppose you just want the ability to have multiple categories and a 'None of
 When you initialize the *ClassifierReborn::Bayes* classifier there are several options which can be set that control threshold processing.
 ```ruby
-b = ClassifierRebor::Bayes.new(
+b = ClassifierReborn::Bayes.new(
         'good',                 # one or more categories
         enable_threshold: true, # default: false
         threshold: -10.0        # default: 0.0
@@ -218,6 +228,18 @@ with more than just simple strings.
 * http://www.chadfowler.com/index.cgi/Computing/LatentSemanticIndexing.rdoc
 * http://en.wikipedia.org/wiki/Latent_semantic_analysis
+## Code of Conduct
+In order to have a more open and welcoming community, Classifier-Reborn adheres to the Jekyll
+[code of conduct](https://github.com/jekyll/jekyll/blob/master/CONDUCT.markdown) adapted from the Ruby on Rails code of
+conduct.
+Please adhere to this code of conduct in any interactions you have in the
+Classifier community.  If you encounter someone violating
+these terms, please let [@chase](https://github.com/Ch4s3) know and we will address it as soon as possible.
 ## Authors
 * Lucas Carlson  (lucas@rufy.com)

data/bin/bayes.rb CHANGED

@@ -1,36 +1,36 @@
 #!/usr/bin/env ruby
 begin
-	require 'rubygems'
-	require 'classifier'
+  require 'rubygems'
+  require 'classifier'
 rescue
-	require 'classifier'
+  require 'classifier'
 end
 require 'madeleine'
-m = SnapshotMadeleine.new(File.expand_path("~/.bayes_data")) {
-	ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting'
-}
+m = SnapshotMadeleine.new(File.expand_path('~/.bayes_data')) do
+  ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting'
+end
 case ARGV[0]
-when "add"
-	case ARGV[1].downcase
-	when "interesting"
-		m.system.train_interesting File.open(ARGV[2]).read
-		puts "#{ARGV[2]} has been classified as interesting"
-	when "uninteresting"
-		m.system.train_uninteresting File.open(ARGV[2]).read
-		puts "#{ARGV[2]} has been classified as uninteresting"
-	else
-		puts "Invalid category: choose between interesting and uninteresting"
-		exit(1)
-	end
-when "classify"
-	puts m.system.classify(File.open(ARGV[1]).read)
+when 'add'
+  case ARGV[1].downcase
+  when 'interesting'
+    m.system.train_interesting File.open(ARGV[2]).read
+    puts "#{ARGV[2]} has been classified as interesting"
+  when 'uninteresting'
+    m.system.train_uninteresting File.open(ARGV[2]).read
+    puts "#{ARGV[2]} has been classified as uninteresting"
+  else
+    puts 'Invalid category: choose between interesting and uninteresting'
+    exit(1)
+  end
+when 'classify'
+  puts m.system.classify(File.open(ARGV[1]).read)
 else
-	puts "Invalid option: choose add [category] [file] or clasify [file]"
-	exit(-1)
+  puts 'Invalid option: choose add [category] [file] or clasify [file]'
+  exit(-1)
 end
 m.take_snapshot

data/bin/summarize.rb CHANGED

@@ -1,10 +1,10 @@
 #!/usr/bin/env ruby
 begin
-        require 'rubygems'
-        require 'classifier'
+  require 'rubygems'
+  require 'classifier'
 rescue
-        require 'classifier'
+  require 'classifier'
 end
 require 'open-uri'
@@ -13,4 +13,4 @@ num = ARGV[1].to_i
 num = num < 1 ? 10 : num
 text = open(ARGV.first).read
-puts text.gsub(/<[^>]+>/,"").gsub(/[\s]+/," ").summary(num)
+puts text.gsub(/<[^>]+>/, '').gsub(/[\s]+/, ' ').summary(num)

data/lib/classifier-reborn.rb CHANGED

@@ -27,4 +27,4 @@
 require 'rubygems'
 require_relative 'classifier-reborn/category_namer'
 require_relative 'classifier-reborn/bayes'
-require_relative 'classifier-reborn/lsi'
+require_relative 'classifier-reborn/lsi'

data/lib/classifier-reborn/bayes.rb CHANGED

@@ -17,20 +17,22 @@ module ClassifierReborn
     #   auto_categorize:  false  When true, enables ability to dynamically declare a category
     #   enable_threshold: false  When true, enables a threshold requirement for classifition
     #   threshold:        0.0    Default threshold, only used when enabled
+    #   enable_stemmer:   true   When false, disables word stemming
     def initialize(*args)
-      @categories = Hash.new
-      options = { language:         'en',
+      @categories = {}
+      options = { language:         'en',
                   auto_categorize:  false,
                   enable_threshold: false,
-                  threshold:        0.0
+                  threshold:        0.0,
+                  enable_stemmer:   true
                 }
-      args.flatten.each { |arg|
-        if arg.kind_of?(Hash)
+      args.flatten.each do |arg|
+        if arg.is_a?(Hash)
           options.merge!(arg)
         else
           add_category(arg)
         end
-      }
+      end
       @total_words         = 0
       @category_counts     = Hash.new(0)
@@ -40,6 +42,7 @@ module ClassifierReborn
       @auto_categorize     = options[:auto_categorize]
       @enable_threshold    = options[:enable_threshold]
       @threshold           = options[:threshold]
+      @enable_stemmer      = options[:enable_stemmer]
     end
     # Provides a general training method for all categories specified in Bayes#new
@@ -52,18 +55,18 @@ module ClassifierReborn
       category = CategoryNamer.prepare_name(category)
       # Add the category dynamically or raise an error
-      if !@categories.has_key?(category)
+      unless @categories.key?(category)
         if @auto_categorize
           add_category(category)
         else
-          raise CategoryNotFoundError.new("Cannot train; category #{category} does not exist")
+          raise CategoryNotFoundError, "Cannot train; category #{category} does not exist"
         end
       end
       @category_counts[category] += 1
-      Hasher.word_hash(text, @language).each do |word, count|
-        @categories[category][word]      +=     count
-        @category_word_count[category]   += count
+      Hasher.word_hash(text, @language, @enable_stemmer).each do |word, count|
+        @categories[category][word] += count
+        @category_word_count[category] += count
         @total_words += count
       end
     end
@@ -78,20 +81,17 @@ module ClassifierReborn
     def untrain(category, text)
       category = CategoryNamer.prepare_name(category)
       @category_counts[category] -= 1
-      Hasher.word_hash(text, @language).each do |word, count|
-        if @total_words >= 0
-          orig = @categories[category][word] || 0
-          @categories[category][word]      -=     count
-          if @categories[category][word] <= 0
-            @categories[category].delete(word)
-            count = orig
-          end
-          if @category_word_count[category] >= count
-            @category_word_count[category] -= count
-          end
-          @total_words -= count
+      Hasher.word_hash(text, @language, @enable_stemmer).each do |word, count|
+        next if @total_words < 0
+        orig = @categories[category][word] || 0
+        @categories[category][word] -= count
+        if @categories[category][word] <= 0
+          @categories[category].delete(word)
+          count = orig
         end
+        @category_word_count[category] -= count if @category_word_count[category] >= count
+        @total_words -= count
       end
     end
@@ -100,21 +100,21 @@ module ClassifierReborn
     #    =>  {"Uninteresting"=>-12.6997928013932, "Interesting"=>-18.4206807439524}
     # The largest of these scores (the one closest to 0) is the one picked out by #classify
     def classifications(text)
-      score = Hash.new
-      word_hash = Hasher.word_hash(text, @language)
+      score = {}
+      word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
       training_count = @category_counts.values.reduce(:+).to_f
       @categories.each do |category, category_words|
         score[category.to_s] = 0
         total = (@category_word_count[category] || 1).to_f
-        word_hash.each do |word, count|
-          s = category_words.has_key?(word) ? category_words[word] : 0.1
-          score[category.to_s] += Math.log(s/total)
+        word_hash.each do |word, _count|
+          s = category_words.key?(word) ? category_words[word] : 0.1
+          score[category.to_s] += Math.log(s / total)
         end
         # now add prior probability for the category
-        s = @category_counts.has_key?(category) ? @category_counts[category] : 0.1
+        s = @category_counts.key?(category) ? @category_counts[category] : 0.1
         score[category.to_s] += Math.log(s / training_count)
       end
-      return score
+      score
     end
     # Returns the classification of the provided +text+, which is one of the
@@ -128,21 +128,15 @@ module ClassifierReborn
     # Return the classification without the score
     def classify(text)
       result, score = classify_with_score(text)
-      if threshold_enabled?
-        result = nil if score < @threshold || score == Float::INFINITY
-      end
-      return result
+      result = nil if score < @threshold || score == Float::INFINITY if threshold_enabled?
+      result
     end
     # Retrieve the current threshold value
-    def threshold
-      @threshold
-    end
+    attr_reader :threshold
     # Dynamically set the threshold value
-    def threshold=(a_float)
-      @threshold = a_float
-    end
+    attr_writer :threshold
     # Dynamically enable threshold for classify results
     def enable_threshold
@@ -164,6 +158,16 @@ module ClassifierReborn
       !@enable_threshold
     end
+    # Is word stemming enabled?
+    def stemmer_enabled?
+      @enable_stemmer
+    end
+    # Is word stemming disabled?
+    def stemmer_disabled?
+      !@enable_stemmer
+    end
     # Provides training and untraining methods for the categories specified in Bayes#new
     # For example:
     #     b = ClassifierReborn::Bayes.new 'This', 'That', 'the_other'
@@ -174,12 +178,12 @@ module ClassifierReborn
     def method_missing(name, *args)
       cleaned_name = name.to_s.gsub(/(un)?train_([\w]+)/, '\2')
       category = CategoryNamer.prepare_name(cleaned_name)
-      if @categories.has_key? category
-        args.each { |text| eval("#{$1}train(category, text)") }
+      if @categories.key? category
+        args.each { |text| eval("#{Regexp.last_match(1)}train(category, text)") }
       elsif name.to_s =~ /(un)?train_([\w]+)/
         raise StandardError, "No such category: #{category}"
       else
-        super  #raise StandardError, "No such method: #{name}"
+        super # raise StandardError, "No such method: #{name}"
       end
     end
@@ -188,7 +192,7 @@ module ClassifierReborn
     #     b.categories
     #     =>   ['This', 'That', 'the_other']
     def categories # :nodoc:
-      @categories.keys.collect {|c| c.to_s}
+      @categories.keys.collect(&:to_s)
     end
     # Allows you to add categories to the classifier.
@@ -203,6 +207,6 @@ module ClassifierReborn
       @categories[CategoryNamer.prepare_name(category)] ||= Hash.new(0)
     end
-    alias append_category add_category
+    alias_method :append_category, :add_category
   end
 end

data/lib/classifier-reborn/category_namer.rb CHANGED

@@ -7,11 +7,12 @@ require 'classifier-reborn/extensions/hasher'
 module ClassifierReborn
   module CategoryNamer
-    extend self
-    def prepare_name(name)
+    module_function
+    def prepare_name(name)
       return name if name.is_a?(Symbol)
-      name.to_s.gsub("_"," ").capitalize.intern
+      name.to_s.tr('_', ' ').capitalize.intern
     end
   end
 end

data/lib/classifier-reborn/extensions/hasher.rb CHANGED

@@ -9,29 +9,37 @@ module ClassifierReborn
   module Hasher
     STOPWORDS_PATH = [File.expand_path(File.dirname(__FILE__) + '/../../../data/stopwords')]
-    extend self
+    module_function
     # Return a Hash of strings => ints. Each word in the string is stemmed,
     # interned, and indexes to its frequency in the document.
-    def word_hash(str, language = 'en')
-      cleaned_word_hash = clean_word_hash(str, language)
+    def word_hash(str, language = 'en', enable_stemmer = true)
+      cleaned_word_hash = clean_word_hash(str, language, enable_stemmer)
       symbol_hash = word_hash_for_symbols(str.scan(/[^\s\p{WORD}]/))
-      return cleaned_word_hash.merge(symbol_hash)
+      cleaned_word_hash.merge(symbol_hash)
     end
     # Return a word hash without extra punctuation or short symbols, just stemmed words
-    def clean_word_hash(str, language = 'en')
-      word_hash_for_words str.gsub(/[^\p{WORD}\s]/,'').downcase.split, language
+    def clean_word_hash(str, language = 'en', enable_stemmer = true)
+      word_hash_for_words str.gsub(/[^\p{WORD}\s]/, '').downcase.split, language, enable_stemmer
     end
-    def word_hash_for_words(words, language = 'en')
+    def word_hash_for_words(words, language = 'en', enable_stemmer = true)
       d = Hash.new(0)
       words.each do |word|
-        if word.length > 2 && !STOPWORDS[language].include?(word)
+        next unless word.length > 2 && !STOPWORDS[language].include?(word)
+        if enable_stemmer
           d[word.stem.intern] += 1
+        else
+          d[word.intern] += 1
         end
       end
-      return d
+      d
+    end
+    # Add custom path to a new stopword file created by user
+    def add_custom_stopword_path(path)
+      STOPWORDS_PATH.unshift(path)
     end
     def word_hash_for_symbols(words)
@@ -39,7 +47,7 @@ module ClassifierReborn
       words.each do |word|
         d[word.intern] += 1
       end
-      return d
+      d
     end
     # Create a lazily-loaded hash of stopword data
@@ -48,7 +56,7 @@ module ClassifierReborn
       STOPWORDS_PATH.each do |path|
         if File.exist?(File.join(path, language))
-          hash[language] = Set.new File.read(File.join(path, language.to_s)).split
+          hash[language] = Set.new File.read(File.join(path, language.to_s)).force_encoding("utf-8").split
           break
         end
       end

data/lib/classifier-reborn/extensions/vector.rb CHANGED

@@ -6,17 +6,17 @@
 require 'matrix'
 class Matrix
-  def Matrix.diag(s)
-     Matrix.diagonal(*s)
+  def self.diag(s)
+    Matrix.diagonal(*s)
   end
-  alias :trans :transpose
+  alias_method :trans, :transpose
   def SV_decomp(maxSweeps = 20)
-    if self.row_size >= self.column_size
-      q = self.trans * self
+    if row_size >= column_size
+      q = trans * self
     else
-      q = self * self.trans
+      q = self * trans
     end
     qrot    = q.dup
@@ -24,55 +24,56 @@ class Matrix
     mzrot   = nil
     cnt     = 0
     s_old   = nil
-    mu      = nil
-    while true do
+    loop do
       cnt += 1
-      for row in (0...qrot.row_size-1) do
-        for col in (1..qrot.row_size-1) do
+      (0...qrot.row_size - 1).each do |row|
+        (1..qrot.row_size - 1).each do |col|
           next if row == col
-          h = Math.atan((2 * qrot[row,col])/(qrot[row,row]-qrot[col,col]))/2.0
+          h = Math.atan((2 * qrot[row, col]) / (qrot[row, row] - qrot[col, col])) / 2.0
           hcos = Math.cos(h)
           hsin = Math.sin(h)
           mzrot = Matrix.identity(qrot.row_size)
-          mzrot[row,row] = hcos
-          mzrot[row,col] = -hsin
-          mzrot[col,row] = hsin
-          mzrot[col,col] = hcos
+          mzrot[row, row] = hcos
+          mzrot[row, col] = -hsin
+          mzrot[col, row] = hsin
+          mzrot[col, col] = hcos
           qrot = mzrot.trans * qrot * mzrot
-          v = v * mzrot
+          v *= mzrot
         end
       end
       s_old = qrot.dup if cnt == 1
       sum_qrot = 0.0
       if cnt > 1
         qrot.row_size.times do |r|
-          sum_qrot += (qrot[r,r]-s_old[r,r]).abs if (qrot[r,r]-s_old[r,r]).abs > 0.001
+          sum_qrot += (qrot[r, r] - s_old[r, r]).abs if (qrot[r, r] - s_old[r, r]).abs > 0.001
         end
         s_old = qrot.dup
       end
-      break if (sum_qrot <= 0.001 and cnt > 1) or cnt >= maxSweeps
+      break if (sum_qrot <= 0.001 && cnt > 1) || cnt >= maxSweeps
     end # of do while true
     s = []
     qrot.row_size.times do |r|
-      s << Math.sqrt(qrot[r,r])
+      s << Math.sqrt(qrot[r, r])
     end
-    #puts "cnt = #{cnt}"
-    if self.row_size >= self.column_size
-      mu = self *  v * Matrix.diagonal(*s).inverse
+    # puts "cnt = #{cnt}"
+    if row_size >= column_size
+      mu = self * v * Matrix.diagonal(*s).inverse
       return [mu, v, s]
     else
       puts v.row_size
       puts v.column_size
-      puts self.row_size
-      puts self.column_size
+      puts row_size
+      puts column_size
       puts s.size
-      mu = (self.trans * v *  Matrix.diagonal(*s).inverse)
+      mu = (trans * v * Matrix.diagonal(*s).inverse)
       return [mu, v, s]
     end
   end
-  def []=(i,j,val)
+  def []=(i, j, val)
     @rows[i][j] = val
   end
 end

data/lib/classifier-reborn/extensions/vector_serialize.rb CHANGED

@@ -1,20 +1,18 @@
 module GSL
   class Vector
-    def _dump(v)
-      Marshal.dump( self.to_a )
+    def _dump(_v)
+      Marshal.dump(to_a)
     end
     def self._load(arr)
       arry = Marshal.load(arr)
-      return GSL::Vector.alloc(arry)
+      GSL::Vector.alloc(arry)
     end
   end
   class Matrix
-     class <<self
-        alias :diag :diagonal
-     end
+    class <<self
+       alias_method :diag, :diagonal
+    end
   end
 end

data/lib/classifier-reborn/lsi.rb CHANGED

@@ -3,13 +3,14 @@
 # License::   LGPL
 begin
-  raise LoadError if ENV['NATIVE_VECTOR'] == "true" # to test the native vector class, try `rake test NATIVE_VECTOR=true`
+  raise LoadError if ENV['NATIVE_VECTOR'] == 'true' # to test the native vector class, try `rake test NATIVE_VECTOR=true`
-  require 'gsl' # requires http://rb-gsl.rubyforge.org/
+  require 'gsl' # requires https://github.com/SciRuby/rb-gsl
   require_relative 'extensions/vector_serialize'
   $GSL = true
 rescue LoadError
+  $GSL = false
   require_relative 'extensions/vector'
 end
@@ -19,12 +20,10 @@ require_relative 'lsi/cached_content_node'
 require_relative 'lsi/summarizer'
 module ClassifierReborn
   # This class implements a Latent Semantic Indexer, which can search, classify and cluster
   # data based on underlying semantic relations. For more information on the algorithms used,
   # please consult Wikipedia[http://en.wikipedia.org/wiki/Latent_Semantic_Indexing].
   class LSI
     attr_reader :word_list, :cache_node_vectors
     attr_accessor :auto_rebuild
@@ -36,12 +35,12 @@ module ClassifierReborn
     #
     def initialize(options = {})
       @auto_rebuild = options[:auto_rebuild] != false
-      @word_list, @items = WordList.new, {}
-      @version, @built_at_version = 0, -1
+      @word_list = WordList.new
+      @items = {}
+      @version = 0
+      @built_at_version = -1
       @language = options[:language] || 'en'
-      if @cache_node_vectors = options[:cache_node_vectors]
-        extend CachedContentNode::InstanceMethods
-      end
+      extend CachedContentNode::InstanceMethods if @cache_node_vectors = options[:cache_node_vectors]
     end
     # Returns true if the index needs to be rebuilt.  The index needs
@@ -64,39 +63,44 @@ module ClassifierReborn
     #   ar = ActiveRecordObject.find( :all )
     #   lsi.add_item ar, *ar.categories { |x| ar.content }
     #
-    def add_item( item, *categories, &block )
+    def add_item(item, *categories, &block)
       clean_word_hash = Hasher.clean_word_hash((block ? block.call(item) : item.to_s), @language)
-      @items[item] = if @cache_node_vectors
-        CachedContentNode.new(clean_word_hash, *categories)
+      if clean_word_hash.empty?
+        puts "Input: '#{item}' is entirely stopwords or words with 2 or fewer characters. Classifier-Reborn cannot handle this document properly."
       else
-        ContentNode.new(clean_word_hash, *categories)
+        @items[item] = if @cache_node_vectors
+                         CachedContentNode.new(clean_word_hash, *categories)
+                       else
+                         ContentNode.new(clean_word_hash, *categories)
+                       end
+        @version += 1
+        build_index if @auto_rebuild
       end
-      @version += 1
-      build_index if @auto_rebuild
     end
     # A less flexible shorthand for add_item that assumes
     # you are passing in a string with no categorries. item
     # will be duck typed via to_s .
     #
-    def <<( item )
-      add_item item
+    def <<(item)
+      add_item(item)
     end
     # Returns the categories for a given indexed items. You are free to add and remove
     # items from this as you see fit. It does not invalide an index to change its categories.
     def categories_for(item)
       return [] unless @items[item]
-      return @items[item].categories
+      @items[item].categories
     end
     # Removes an item from the database, if it is indexed.
     #
-    def remove_item( item )
-      if @items.key? item
-        @items.delete item
-        @version += 1
-      end
+    def remove_item(item)
+      return unless @items.key? item
+      @items.delete item
+      @version += 1
     end
     # Returns an array of items that are indexed.
@@ -118,30 +122,30 @@ module ClassifierReborn
     # cutoff parameter tells the indexer how many of these values to keep.
     # A value of 1 for cutoff means that no semantic analysis will take place,
     # turning the LSI class into a simple vector search engine.
-    def build_index( cutoff=0.75 )
+    def build_index(cutoff = 0.75)
       return unless needs_rebuild?
       make_word_list
       doc_list = @items.values
-      tda = doc_list.collect { |node| node.raw_vector_with( @word_list ) }
+      tda = doc_list.collect { |node| node.raw_vector_with(@word_list) }
       if $GSL
-         tdm = GSL::Matrix.alloc(*tda).trans
-         ntdm = build_reduced_matrix(tdm, cutoff)
-         ntdm.size[1].times do |col|
-           vec = GSL::Vector.alloc( ntdm.column(col) ).row
-           doc_list[col].lsi_vector = vec
-           doc_list[col].lsi_norm = vec.normalize
-         end
+        tdm = GSL::Matrix.alloc(*tda).trans
+        ntdm = build_reduced_matrix(tdm, cutoff)
+        ntdm.size[1].times do |col|
+          vec = GSL::Vector.alloc(ntdm.column(col)).row
+          doc_list[col].lsi_vector = vec
+          doc_list[col].lsi_norm = vec.normalize
+        end
       else
-         tdm = Matrix.rows(tda).trans
-         ntdm = build_reduced_matrix(tdm, cutoff)
+        tdm = Matrix.rows(tda).trans
+        ntdm = build_reduced_matrix(tdm, cutoff)
-         ntdm.row_size.times do |col|
-           doc_list[col].lsi_vector = ntdm.column(col) if doc_list[col]
-           doc_list[col].lsi_norm = ntdm.column(col).normalize  if doc_list[col]
-         end
+        ntdm.row_size.times do |col|
+          doc_list[col].lsi_vector = ntdm.column(col) if doc_list[col]
+          doc_list[col].lsi_norm = ntdm.column(col).normalize if doc_list[col]
+        end
       end
       @built_at_version = @version
@@ -155,13 +159,13 @@ module ClassifierReborn
     # your dataset's general content. For example, if you were to use categorize on the
     # results of this data, you could gather information on what your dataset is generally
     # about.
-    def highest_relative_content( max_chunks=10 )
-       return [] if needs_rebuild?
+    def highest_relative_content(max_chunks = 10)
+      return [] if needs_rebuild?
-       avg_density = Hash.new
-       @items.each_key { |item| avg_density[item] = proximity_array_for_content(item).inject(0.0) { |x,y| x + y[1]} }
+      avg_density = {}
+      @items.each_key { |item| avg_density[item] = proximity_array_for_content(item).inject(0.0) { |x, y| x + y[1] } }
-       avg_density.keys.sort_by { |x| avg_density[x] }.reverse[0..max_chunks-1].map
+      avg_density.keys.sort_by { |x| avg_density[x] }.reverse[0..max_chunks - 1].map
     end
     # This function is the primitive that find_related and classify
@@ -176,10 +180,10 @@ module ClassifierReborn
     # The parameter doc is the content to compare. If that content is not
     # indexed, you can pass an optional block to define how to create the
     # text data. See add_item for examples of how this works.
-    def proximity_array_for_content( doc, &block )
+    def proximity_array_for_content(doc, &block)
       return [] if needs_rebuild?
-      content_node = node_for_content( doc, &block )
+      content_node = node_for_content(doc, &block)
       result =
         @items.keys.collect do |item|
           if $GSL
@@ -197,10 +201,18 @@ module ClassifierReborn
     # calculated vectors instead of their full versions. This is useful when
     # you're trying to perform operations on content that is much smaller than
     # the text you're working with. search uses this primitive.
-    def proximity_norms_for_content( doc, &block )
+    def proximity_norms_for_content(doc, &block)
       return [] if needs_rebuild?
-      content_node = node_for_content( doc, &block )
+      content_node = node_for_content(doc, &block)
+      if $GSL && content_node.raw_norm.isnan?.all?
+        puts "There are no documents that are similar to #{doc}"
+      else
+        content_node_norms(content_node)
+      end
+    end
+    def content_node_norms(content_node)
       result =
         @items.keys.collect do |item|
           if $GSL
@@ -220,11 +232,13 @@ module ClassifierReborn
     #
     # While this may seem backwards compared to the other functions that LSI supports,
     # it is actually the same algorithm, just applied on a smaller document.
-    def search( string, max_nearest=3 )
+    def search(string, max_nearest = 3)
       return [] if needs_rebuild?
-      carry = proximity_norms_for_content( string )
-      result = carry.collect { |x| x[0] }
-      return result[0..max_nearest-1]
+      carry = proximity_norms_for_content(string)
+      unless carry.nil?
+        result = carry.collect { |x| x[0] }
+        result[0..max_nearest - 1]
+      end
     end
     # This function takes content and finds other documents
@@ -236,21 +250,21 @@ module ClassifierReborn
     # This is particularly useful for identifing clusters in your document space.
     # For example you may want to identify several "What's Related" items for weblog
     # articles, or find paragraphs that relate to each other in an essay.
-    def find_related( doc, max_nearest=3, &block )
+    def find_related(doc, max_nearest = 3, &block)
       carry =
-        proximity_array_for_content( doc, &block ).reject { |pair| pair[0].eql? doc }
+        proximity_array_for_content(doc, &block).reject { |pair| pair[0].eql? doc }
       result = carry.collect { |x| x[0] }
-      return result[0..max_nearest-1]
+      result[0..max_nearest - 1]
     end
     # Return the most obvious category with the score
-    def classify_with_score( doc, cutoff=0.30, &block)
-      return scored_categories(doc, cutoff, &block).last
+    def classify_with_score(doc, cutoff = 0.30, &block)
+      scored_categories(doc, cutoff, &block).last
     end
     # Return the most obvious category without the score
-    def classify( doc, cutoff=0.30, &block )
-      return scored_categories(doc, cutoff, &block).last.first
+    def classify(doc, cutoff = 0.30, &block)
+      scored_categories(doc, cutoff, &block).last.first
     end
     # This function uses a voting system to categorize documents, based on
@@ -262,10 +276,10 @@ module ClassifierReborn
     # text. A cutoff of 1 means that every document in the index votes on
     # what category the document is in. This may not always make sense.
     #
-    def scored_categories( doc, cutoff=0.30, &block )
+    def scored_categories(doc, cutoff = 0.30, &block)
       icutoff = (@items.size * cutoff).round
-      carry = proximity_array_for_content( doc, &block )
-      carry = carry[0..icutoff-1]
+      carry = proximity_array_for_content(doc, &block)
+      carry = carry[0..icutoff - 1]
       votes = Hash.new(0.0)
       carry.each do |pair|
         @items[pair[0]].categories.each do |category|
@@ -273,31 +287,31 @@ module ClassifierReborn
         end
       end
-      return votes.sort_by { |_, score| score }
+      votes.sort_by { |_, score| score }
     end
     # Prototype, only works on indexed documents.
     # I have no clue if this is going to work, but in theory
     # it's supposed to.
-    def highest_ranked_stems( doc, count=3 )
-      raise "Requested stem ranking on non-indexed content!" unless @items[doc]
-      arr = node_for_content(doc).lsi_vector.to_a
-      top_n = arr.sort.reverse[0..count-1]
-      return top_n.collect { |x| @word_list.word_for_index(arr.index(x))}
+    def highest_ranked_stems(doc, count = 3)
+      raise 'Requested stem ranking on non-indexed content!' unless @items[doc]
+      content_vector_array = node_for_content(doc).lsi_vector.to_a
+      top_n = content_vector_array.sort.reverse[0..count - 1]
+      top_n.collect { |x| @word_list.word_for_index(content_vector_array.index(x)) }
     end
     private
-    def build_reduced_matrix( matrix, cutoff=0.75 )
+    def build_reduced_matrix(matrix, cutoff = 0.75)
       # TODO: Check that M>=N on these dimensions! Transpose helps assure this
       u, v, s = matrix.SV_decomp
       # TODO: Better than 75% term, please. :\
       s_cutoff = s.sort.reverse[(s.size * cutoff).round - 1]
       s.size.times do |ord|
         s[ord] = 0.0 if s[ord] < s_cutoff
       end
       # Reconstruct the term document matrix, only with reduced rank
-      u * ($GSL ? GSL::Matrix : ::Matrix).diag( s ) * v.trans
+      u * ($GSL ? GSL::Matrix : ::Matrix).diag(s) * v.trans
     end
     def node_for_content(item, &block)
@@ -306,23 +320,21 @@ module ClassifierReborn
       else
         clean_word_hash = Hasher.clean_word_hash((block ? block.call(item) : item.to_s), @language)
-        cn = ContentNode.new(clean_word_hash, &block) # make the node and extract the data
+        content_node = ContentNode.new(clean_word_hash, &block) # make the node and extract the data
         unless needs_rebuild?
-          cn.raw_vector_with( @word_list ) # make the lsi raw and norm vectors
+          content_node.raw_vector_with(@word_list) # make the lsi raw and norm vectors
         end
       end
-      return cn
+      content_node
     end
     def make_word_list
       @word_list = WordList.new
       @items.each_value do |node|
-        node.word_hash.each_key { |key| @word_list.add_word key }
+        node.word_hash.each_key { |key| @word_list.add_word(key) }
       end
     end
   end
 end

data/lib/classifier-reborn/lsi/cached_content_node.rb CHANGED

@@ -3,7 +3,6 @@
 # License::   LGPL
 module ClassifierReborn
   # Subclass of ContentNode which caches the search_vector transpositions.
   # Its great because its much faster for large indexes, but at the cost of more ram. Additionally,
   # if you Marshal your classifier and want to keep the size down, you'll need to manually
@@ -16,7 +15,7 @@ module ClassifierReborn
       end
     end
-    def initialize( word_hash, *categories )
+    def initialize(word_hash, *categories)
       clear_cache!
       super
     end
@@ -29,13 +28,13 @@ module ClassifierReborn
     def transposed_search_vector
       @transposed_search_vector ||= super
     end
     # Clear the cache before we continue on
-    def raw_vector_with( word_list )
+    def raw_vector_with(word_list)
       clear_cache!
       super
     end
     # We don't want the cached_data here
     def marshal_dump
       [@lsi_vector, @lsi_norm, @raw_vector, @raw_norm, @categories, @word_hash]

data/lib/classifier-reborn/lsi/content_node.rb CHANGED

@@ -3,10 +3,9 @@
 # License::   LGPL
 module ClassifierReborn
-# This is an internal data structure class for the LSI node. Save for
-# raw_vector_with, it should be fairly straightforward to understand.
-# You should never have to use it directly.
+  # This is an internal data structure class for the LSI node. Save for
+  # raw_vector_with, it should be fairly straightforward to understand.
+  # You should never have to use it directly.
   class ContentNode
     attr_accessor :raw_vector, :raw_norm,
                   :lsi_vector, :lsi_norm,
@@ -15,7 +14,7 @@ module ClassifierReborn
     attr_reader :word_hash
     # If text_proc is not specified, the source will be duck-typed
     # via source.to_s
-    def initialize( word_hash, *categories )
+    def initialize(word_hash, *categories)
       @categories = categories || []
       @word_hash = word_hash
       @lsi_norm, @lsi_vector = nil
@@ -38,11 +37,11 @@ module ClassifierReborn
     # Creates the raw vector out of word_hash using word_list as the
     # key for mapping the vector space.
-    def raw_vector_with( word_list )
+    def raw_vector_with(word_list)
       if $GSL
-         vec = GSL::Vector.alloc(word_list.size)
+        vec = GSL::Vector.alloc(word_list.size)
       else
-         vec = Array.new(word_list.size, 0)
+        vec = Array.new(word_list.size, 0)
       end
       @word_hash.each_key do |word|
@@ -52,7 +51,7 @@ module ClassifierReborn
       # Perform the scaling transform and force floating point arithmetic
       if $GSL
         sum = 0.0
-        vec.each {|v| sum += v }
+        vec.each { |v| sum += v }
         total_words = sum
       else
         total_words = vec.reduce(0, :+).to_f
@@ -63,7 +62,7 @@ module ClassifierReborn
       if $GSL
         vec.each { |word| total_unique_words += 1 if word != 0.0 }
       else
-        total_unique_words = vec.count{ |word| word != 0 }
+        total_unique_words = vec.count { |word| word != 0 }
       end
       # Perform first-order association transform if this vector has more
@@ -71,9 +70,9 @@ module ClassifierReborn
       if total_words > 1.0 && total_unique_words > 1
         weighted_total = 0.0
         # Cache calculations, this takes too long on large indexes
-        cached_calcs = Hash.new { |hash, term|
-          hash[term] = (( term / total_words ) * Math.log( term / total_words ))
-        }
+        cached_calcs = Hash.new do |hash, term|
+          hash[term] = ((term / total_words) * Math.log(term / total_words))
+        end
         vec.each do |term|
           weighted_total += cached_calcs[term] if term > 0.0
@@ -81,12 +80,12 @@ module ClassifierReborn
         # Cache calculations, this takes too long on large indexes
         cached_calcs = Hash.new do |hash, val|
-          hash[val] = Math.log( val + 1 ) / -weighted_total
+          hash[val] = Math.log(val + 1) / -weighted_total
         end
-        vec.collect! { |val|
+        vec.collect! do |val|
           cached_calcs[val]
-        }
+        end
       end
       if $GSL
@@ -97,7 +96,5 @@ module ClassifierReborn
         @raw_vector = Vector[*vec]
       end
     end
   end
 end

data/lib/classifier-reborn/lsi/summarizer.rb CHANGED

@@ -4,13 +4,13 @@
 module ClassifierReborn
   module Summarizer
-    extend self
+    module_function
-    def summary( str, count=10, separator=" [...] " )
+    def summary(str, count = 10, separator = ' [...] ')
       perform_lsi split_sentences(str), count, separator
     end
-    def paragraph_summary( str, count=1, separator=" [...] " )
+    def paragraph_summary(str, count = 1, separator = ' [...] ')
       perform_lsi split_paragraphs(str), count, separator
     end
@@ -23,11 +23,11 @@ module ClassifierReborn
     end
     def perform_lsi(chunks, count, separator)
-      lsi = ClassifierReborn::LSI.new :auto_rebuild => false
+      lsi = ClassifierReborn::LSI.new auto_rebuild: false
       chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }
       lsi.build_index
       summaries = lsi.highest_relative_content count
-      return summaries.reject { |chunk| !summaries.include? chunk }.map { |x| x.strip }.join(separator)
+      summaries.reject { |chunk| !summaries.include? chunk }.map(&:strip).join(separator)
     end
   end
 end

data/lib/classifier-reborn/lsi/word_list.rb CHANGED

@@ -8,7 +8,7 @@ module ClassifierReborn
   class WordList
     def initialize
-      @location_table = Hash.new
+      @location_table = {}
     end
     # Adds a word (if it is new) and assigns it a unique dimension.
@@ -31,6 +31,5 @@ module ClassifierReborn
     def size
       @location_table.size
     end
   end
 end

data/lib/classifier-reborn/version.rb CHANGED

@@ -1,3 +1,3 @@
 module ClassifierReborn
-  VERSION = '2.0.4'
+  VERSION = '2.1.0'
 end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: classifier-reborn
 version: !ruby/object:Gem::Version
-  version: 2.0.4
+  version: 2.1.0
 platform: ruby
 authors:
 - Lucas Carlson
@@ -10,7 +10,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-10-31 00:00:00.000000000 Z
+date: 2017-01-01 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: fast-stemmer
@@ -55,7 +55,49 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
-  name: test-unit
+  name: minitest
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: minitest-reporters
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rubocop
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: pry
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
@@ -134,9 +176,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.4.8
+rubygems_version: 2.5.2
 signing_key:
 specification_version: 2
 summary: A general classifier module to allow Bayesian and other types of classifications.
 test_files: []
-has_rdoc: true