RubyGems - classifier-reborn - Versions diffs - 2.0.4 → 2.1.0 - Mend

classifier-reborn 2.0.4 → 2.1.0

Files changed (17) hide show

checksums.yaml +4 -4
data/README.markdown +25 -3
data/bin/bayes.rb +22 -22
data/bin/summarize.rb +4 -4
data/lib/classifier-reborn.rb +1 -1
data/lib/classifier-reborn/bayes.rb +50 -46
data/lib/classifier-reborn/category_namer.rb +4 -3
data/lib/classifier-reborn/extensions/hasher.rb +19 -11
data/lib/classifier-reborn/extensions/vector.rb +27 -26
data/lib/classifier-reborn/extensions/vector_serialize.rb +8 -10
data/lib/classifier-reborn/lsi.rb +89 -77
data/lib/classifier-reborn/lsi/cached_content_node.rb +4 -5
data/lib/classifier-reborn/lsi/content_node.rb +15 -18
data/lib/classifier-reborn/lsi/summarizer.rb +5 -5
data/lib/classifier-reborn/lsi/word_list.rb +1 -2
data/lib/classifier-reborn/version.rb +1 -1
metadata +46 -5

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 193d6c53d76559337140f192fc69910418015abe
-  data.tar.gz: 745ec353d12ad84aaf74cca6b235401127f00075
+  metadata.gz: 53745cead2833e74e9d74f3359bae0bc7fd01fa4
+  data.tar.gz: 2926365890cf0bc43f7ff17570789ed5e032a45d
 SHA512:
-  metadata.gz: c9891b16c6e9fb2ddfffd32a2335f59bfc55a5e97dae675b652acaa9122a2fc268bfc0d4c2be945456fe46397601127dd2a39782a99767718adf7fee373bdae2
-  data.tar.gz: af844b19d90186a6e3866cdcbfa3deb6ba8492e4646298d6712168c1a329f34fbd24c551ff241a561711a8516cf57577dd7fe1a96bb069190e8545a09a570ff0
+  metadata.gz: 5def462eccbb9ef7a45d3968a5fbc2e2a886bfdda734a98e1f5904943ba1b4012ea331a51ff52e2c165db776dc8c6aed35fac9e01ea3e3f001f82daf98027f74
+  data.tar.gz: 22726b8f6c2acab5bb47b9faac2a2d615557bfff535327ef5bbe022a661c1be13c3b7cec95d1c0030f93158eb0942e121d97f2a0213a640eca00607de9d55f61

data/README.markdown CHANGED

@@ -1,5 +1,12 @@
 ## Welcome to Classifier Reborn
+[![Gem Version](https://img.shields.io/gem/v/classifier-reborn.svg)][ruby-gems]
+[![Build Status](https://img.shields.io/travis/jekyll/classifier-reborn/master.svg)][travis]
+[![Dependency Status](https://img.shields.io/gemnasium/jekyll/classifier-reborn.svg)][gemnasium]
+[ruby-gems]: https://rubygems.org/gems/jekyll/classifier-reborn
+[gemnasium]: https://gemnasium.com/jekyll/classifier-reborn
+[travis]: https://travis-ci.org/jekyll/classifier-reborn
 Classifier is a general module to allow Bayesian and other types of classifications.
 Classifier Reborn is a fork of cardmagic/classifier under more active development.
@@ -37,6 +44,8 @@ Notice that LSI will work without these libraries, but as soon as they are insta
 A Bayesian classifier by Lucas Carlson. Bayesian Classifiers are accurate, fast, and have modest memory requirements.
+*Note: Classifier only supports UTF-8 characters.*
 ### Usage
 ```ruby
@@ -60,7 +69,7 @@ trained_classifier.classify "I love" # returns 'Interesting'
 ```
 Beyond the basic example, the constructor and trainer can be used in a more
-flexible way to accomidate non-trival applications.  Consider the following
+flexible way to accommodate non-trival applications.  Consider the following
 program:
 ```ruby
@@ -72,7 +81,8 @@ require 'classifier-reborn'
 training_set = DATA.read.split("\n")
 categories   = training_set.shift.split(',').map{|c| c.strip}
-classifier = ClassifierReborn::Bayes.new categories
+# pass :auto_categorize option to allow feeding previously unknown categories
+classifier = ClassifierReborn::Bayes.new categories, auto_categorize: true
 training_set.each do |a_line|
   next if a_line.empty? || '#' == a_line.strip[0]
@@ -141,7 +151,7 @@ Or suppose you just want the ability to have multiple categories and a 'None of
 When you initialize the *ClassifierReborn::Bayes* classifier there are several options which can be set that control threshold processing.
 ```ruby
-b = ClassifierRebor::Bayes.new(
+b = ClassifierReborn::Bayes.new(
         'good',                 # one or more categories
         enable_threshold: true, # default: false
         threshold: -10.0        # default: 0.0
@@ -218,6 +228,18 @@ with more than just simple strings.
 * http://www.chadfowler.com/index.cgi/Computing/LatentSemanticIndexing.rdoc
 * http://en.wikipedia.org/wiki/Latent_semantic_analysis
+## Code of Conduct
+In order to have a more open and welcoming community, Classifier-Reborn adheres to the Jekyll
+[code of conduct](https://github.com/jekyll/jekyll/blob/master/CONDUCT.markdown) adapted from the Ruby on Rails code of
+conduct.
+Please adhere to this code of conduct in any interactions you have in the
+Classifier community.  If you encounter someone violating
+these terms, please let [@chase](https://github.com/Ch4s3) know and we will address it as soon as possible.
 ## Authors
 * Lucas Carlson  (lucas@rufy.com)

data/bin/bayes.rb CHANGED

@@ -1,36 +1,36 @@
 #!/usr/bin/env ruby
 begin
-	require 'rubygems'
-	require 'classifier'
+  require 'rubygems'
+  require 'classifier'
 rescue
-	require 'classifier'
+  require 'classifier'
 end
 require 'madeleine'
-m = SnapshotMadeleine.new(File.expand_path("~/.bayes_data")) {
-	ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting'
-}
+m = SnapshotMadeleine.new(File.expand_path('~/.bayes_data')) do
+  ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting'
+end
 case ARGV[0]
-when "add"
-	case ARGV[1].downcase
-	when "interesting"
-		m.system.train_interesting File.open(ARGV[2]).read
-		puts "#{ARGV[2]} has been classified as interesting"
-	when "uninteresting"
-		m.system.train_uninteresting File.open(ARGV[2]).read
-		puts "#{ARGV[2]} has been classified as uninteresting"
-	else
-		puts "Invalid category: choose between interesting and uninteresting"
-		exit(1)
-	end
-when "classify"
-	puts m.system.classify(File.open(ARGV[1]).read)
+when 'add'
+  case ARGV[1].downcase
+  when 'interesting'
+    m.system.train_interesting File.open(ARGV[2]).read
+    puts "#{ARGV[2]} has been classified as interesting"
+  when 'uninteresting'
+    m.system.train_uninteresting File.open(ARGV[2]).read
+    puts "#{ARGV[2]} has been classified as uninteresting"
+  else
+    puts 'Invalid category: choose between interesting and uninteresting'
+    exit(1)
+  end
+when 'classify'
+  puts m.system.classify(File.open(ARGV[1]).read)
 else
-	puts "Invalid option: choose add [category] [file] or clasify [file]"
-	exit(-1)
+  puts 'Invalid option: choose add [category] [file] or clasify [file]'
+  exit(-1)
 end
 m.take_snapshot

data/bin/summarize.rb CHANGED

@@ -1,10 +1,10 @@
 #!/usr/bin/env ruby
 begin
-        require 'rubygems'
-        require 'classifier'
+  require 'rubygems'
+  require 'classifier'
 rescue
-        require 'classifier'
+  require 'classifier'
 end
 require 'open-uri'
@@ -13,4 +13,4 @@ num = ARGV[1].to_i
 num = num < 1 ? 10 : num
 text = open(ARGV.first).read
-puts text.gsub(/<[^>]+>/,"").gsub(/[\s]+/," ").summary(num)
+puts text.gsub(/<[^>]+>/, '').gsub(/[\s]+/, ' ').summary(num)

data/lib/classifier-reborn.rb CHANGED

@@ -27,4 +27,4 @@
 require 'rubygems'
 require_relative 'classifier-reborn/category_namer'
 require_relative 'classifier-reborn/bayes'
-require_relative 'classifier-reborn/lsi'
+require_relative 'classifier-reborn/lsi'

data/lib/classifier-reborn/bayes.rb CHANGED

@@ -17,20 +17,22 @@ module ClassifierReborn
     #   auto_categorize:  false  When true, enables ability to dynamically declare a category
     #   enable_threshold: false  When true, enables a threshold requirement for classifition
     #   threshold:        0.0    Default threshold, only used when enabled
+    #   enable_stemmer:   true   When false, disables word stemming
     def initialize(*args)
-      @categories = Hash.new
-      options = { language:         'en',
+      @categories = {}
+      options = { language:         'en',
                   auto_categorize:  false,
                   enable_threshold: false,
-                  threshold:        0.0
+                  threshold:        0.0,
+                  enable_stemmer:   true
                 }
-      args.flatten.each { |arg|
-        if arg.kind_of?(Hash)
+      args.flatten.each do |arg|
+        if arg.is_a?(Hash)
           options.merge!(arg)
         else
           add_category(arg)
         end
-      }
+      end
       @total_words         = 0
       @category_counts     = Hash.new(0)
@@ -40,6 +42,7 @@ module ClassifierReborn
       @auto_categorize     = options[:auto_categorize]
       @enable_threshold    = options[:enable_threshold]
       @threshold           = options[:threshold]
+      @enable_stemmer      = options[:enable_stemmer]
     end
     # Provides a general training method for all categories specified in Bayes#new
@@ -52,18 +55,18 @@ module ClassifierReborn
       category = CategoryNamer.prepare_name(category)
       # Add the category dynamically or raise an error
-      if !@categories.has_key?(category)
+      unless @categories.key?(category)
         if @auto_categorize
           add_category(category)
         else
-          raise CategoryNotFoundError.new("Cannot train; category #{category} does not exist")
+          raise CategoryNotFoundError, "Cannot train; category #{category} does not exist"
         end
       end
       @category_counts[category] += 1
-      Hasher.word_hash(text, @language).each do |word, count|
-        @categories[category][word]      +=     count
-        @category_word_count[category]   += count
+      Hasher.word_hash(text, @language, @enable_stemmer).each do |word, count|
+        @categories[category][word] += count
+        @category_word_count[category] += count
         @total_words += count
       end
     end
@@ -78,20 +81,17 @@ module ClassifierReborn
     def untrain(category, text)
       category = CategoryNamer.prepare_name(category)
       @category_counts[category] -= 1
-      Hasher.word_hash(text, @language).each do |word, count|
-        if @total_words >= 0
-          orig = @categories[category][word] || 0
-          @categories[category][word]      -=     count
-          if @categories[category][word] <= 0
-            @categories[category].delete(word)
-            count = orig
-          end
-          if @category_word_count[category] >= count
-            @category_word_count[category] -= count
-          end
-          @total_words -= count
+      Hasher.word_hash(text, @language, @enable_stemmer).each do |word, count|
+        next if @total_words < 0
+        orig = @categories[category][word] || 0
+        @categories[category][word] -= count
+        if @categories[category][word] <= 0
+          @categories[category].delete(word)
+          count = orig
         end
+        @category_word_count[category] -= count if @category_word_count[category] >= count
+        @total_words -= count
       end
     end
@@ -100,21 +100,21 @@ module ClassifierReborn
     #    =>  {"Uninteresting"=>-12.6997928013932, "Interesting"=>-18.4206807439524}
     # The largest of these scores (the one closest to 0) is the one picked out by #classify
     def classifications(text)
-      score = Hash.new
-      word_hash = Hasher.word_hash(text, @language)
+      score = {}
+      word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
       training_count = @category_counts.values.reduce(:+).to_f
       @categories.each do |category, category_words|
         score[category.to_s] = 0
         total = (@category_word_count[category] || 1).to_f
-        word_hash.each do |word, count|
-          s = category_words.has_key?(word) ? category_words[word] : 0.1
-          score[category.to_s] += Math.log(s/total)
+        word_hash.each do |word, _count|
+          s = category_words.key?(word) ? category_words[word] : 0.1
+          score[category.to_s] += Math.log(s / total)
         end
         # now add prior probability for the category
-        s = @category_counts.has_key?(category) ? @category_counts[category] : 0.1
+        s = @category_counts.key?(category) ? @category_counts[category] : 0.1
         score[category.to_s] += Math.log(s / training_count)
       end
-      return score
+      score
     end
     # Returns the classification of the provided +text+, which is one of the
@@ -128,21 +128,15 @@ module ClassifierReborn
     # Return the classification without the score
     def classify(text)
       result, score = classify_with_score(text)
-      if threshold_enabled?
-        result = nil if score < @threshold || score == Float::INFINITY
-      end
-      return result
+      result = nil if score < @threshold || score == Float::INFINITY if threshold_enabled?
+      result
     end
     # Retrieve the current threshold value
-    def threshold
-      @threshold
-    end
+    attr_reader :threshold
     # Dynamically set the threshold value
-    def threshold=(a_float)
-      @threshold = a_float
-    end
+    attr_writer :threshold
     # Dynamically enable threshold for classify results
     def enable_threshold
@@ -164,6 +158,16 @@ module ClassifierReborn
       !@enable_threshold
     end
+    # Is word stemming enabled?
+    def stemmer_enabled?
+      @enable_stemmer
+    end
+    # Is word stemming disabled?
+    def stemmer_disabled?
+      !@enable_stemmer
+    end
     # Provides training and untraining methods for the categories specified in Bayes#new
     # For example:
     #     b = ClassifierReborn::Bayes.new 'This', 'That', 'the_other'
@@ -174,12 +178,12 @@ module ClassifierReborn
     def method_missing(name, *args)
       cleaned_name = name.to_s.gsub(/(un)?train_([\w]+)/, '\2')
       category = CategoryNamer.prepare_name(cleaned_name)
-      if @categories.has_key? category
-        args.each { |text| eval("#{$1}train(category, text)") }
+      if @categories.key? category
+        args.each { |text| eval("#{Regexp.last_match(1)}train(category, text)") }
       elsif name.to_s =~ /(un)?train_([\w]+)/
         raise StandardError, "No such category: #{category}"
       else
-        super  #raise StandardError, "No such method: #{name}"
+        super # raise StandardError, "No such method: #{name}"
       end
     end
@@ -188,7 +192,7 @@ module ClassifierReborn
     #     b.categories
     #     =>   ['This', 'That', 'the_other']
     def categories # :nodoc:
-      @categories.keys.collect {|c| c.to_s}
+      @categories.keys.collect(&:to_s)
     end
     # Allows you to add categories to the classifier.
@@ -203,6 +207,6 @@ module ClassifierReborn
       @categories[CategoryNamer.prepare_name(category)] ||= Hash.new(0)
     end
-    alias append_category add_category
+    alias_method :append_category, :add_category
   end
 end

data/lib/classifier-reborn/category_namer.rb CHANGED

@@ -7,11 +7,12 @@ require 'classifier-reborn/extensions/hasher'
 module ClassifierReborn
   module CategoryNamer
-    extend self
-    def prepare_name(name)
+    module_function
+    def prepare_name(name)
       return name if name.is_a?(Symbol)
-      name.to_s.gsub("_"," ").capitalize.intern
+      name.to_s.tr('_', ' ').capitalize.intern
     end
   end
 end

data/lib/classifier-reborn/extensions/hasher.rb CHANGED

@@ -9,29 +9,37 @@ module ClassifierReborn
   module Hasher
     STOPWORDS_PATH = [File.expand_path(File.dirname(__FILE__) + '/../../../data/stopwords')]
-    extend self
+    module_function
     # Return a Hash of strings => ints. Each word in the string is stemmed,
     # interned, and indexes to its frequency in the document.
-    def word_hash(str, language = 'en')
-      cleaned_word_hash = clean_word_hash(str, language)
+    def word_hash(str, language = 'en', enable_stemmer = true)
+      cleaned_word_hash = clean_word_hash(str, language, enable_stemmer)
       symbol_hash = word_hash_for_symbols(str.scan(/[^\s\p{WORD}]/))
-      return cleaned_word_hash.merge(symbol_hash)
+      cleaned_word_hash.merge(symbol_hash)
     end
     # Return a word hash without extra punctuation or short symbols, just stemmed words
-    def clean_word_hash(str, language = 'en')
-      word_hash_for_words str.gsub(/[^\p{WORD}\s]/,'').downcase.split, language
+    def clean_word_hash(str, language = 'en', enable_stemmer = true)
+      word_hash_for_words str.gsub(/[^\p{WORD}\s]/, '').downcase.split, language, enable_stemmer
     end
-    def word_hash_for_words(words, language = 'en')
+    def word_hash_for_words(words, language = 'en', enable_stemmer = true)
       d = Hash.new(0)
       words.each do |word|
-        if word.length > 2 && !STOPWORDS[language].include?(word)
+        next unless word.length > 2 && !STOPWORDS[language].include?(word)
+        if enable_stemmer
           d[word.stem.intern] += 1
+        else
+          d[word.intern] += 1
         end
       end
-      return d
+      d
+    end
+    # Add custom path to a new stopword file created by user
+    def add_custom_stopword_path(path)
+      STOPWORDS_PATH.unshift(path)
     end
     def word_hash_for_symbols(words)
@@ -39,7 +47,7 @@ module ClassifierReborn
       words.each do |word|
         d[word.intern] += 1
       end
-      return d
+      d
     end
     # Create a lazily-loaded hash of stopword data
@@ -48,7 +56,7 @@ module ClassifierReborn
       STOPWORDS_PATH.each do |path|
         if File.exist?(File.join(path, language))
-          hash[language] = Set.new File.read(File.join(path, language.to_s)).split
+          hash[language] = Set.new File.read(File.join(path, language.to_s)).force_encoding("utf-8").split
           break
         end
       end

data/lib/classifier-reborn/extensions/vector.rb CHANGED

@@ -6,17 +6,17 @@
 require 'matrix'
 class Matrix
-  def Matrix.diag(s)
-     Matrix.diagonal(*s)
+  def self.diag(s)
+    Matrix.diagonal(*s)
   end
-  alias :trans :transpose
+  alias_method :trans, :transpose
   def SV_decomp(maxSweeps = 20)
-    if self.row_size >= self.column_size
-      q = self.trans * self
+    if row_size >= column_size
+      q = trans * self
     else
-      q = self * self.trans
+      q = self * trans
     end
     qrot    = q.dup
@@ -24,55 +24,56 @@ class Matrix
     mzrot   = nil
     cnt     = 0
     s_old   = nil
-    mu      = nil
-    while true do
+    loop do
       cnt += 1
-      for row in (0...qrot.row_size-1) do
-        for col in (1..qrot.row_size-1) do
+      (0...qrot.row_size - 1).each do |row|
+        (1..qrot.row_size - 1).each do |col|
           next if row == col
-          h = Math.atan((2 * qrot[row,col])/(qrot[row,row]-qrot[col,col]))/2.0
+          h = Math.atan((2 * qrot[row, col]) / (qrot[row, row] - qrot[col, col])) / 2.0
           hcos = Math.cos(h)
           hsin = Math.sin(h)
           mzrot = Matrix.identity(qrot.row_size)
-          mzrot[row,row] = hcos
-          mzrot[row,col] = -hsin
-          mzrot[col,row] = hsin
-          mzrot[col,col] = hcos
+          mzrot[row, row] = hcos
+          mzrot[row, col] = -hsin
+          mzrot[col, row] = hsin
+          mzrot[col, col] = hcos
           qrot = mzrot.trans * qrot * mzrot
-          v = v * mzrot
+          v *= mzrot
         end
       end
       s_old = qrot.dup if cnt == 1
       sum_qrot = 0.0
       if cnt > 1
         qrot.row_size.times do |r|
-          sum_qrot += (qrot[r,r]-s_old[r,r]).abs if (qrot[r,r]-s_old[r,r]).abs > 0.001
+          sum_qrot += (qrot[r, r] - s_old[r, r]).abs if (qrot[r, r] - s_old[r, r]).abs > 0.001
         end
         s_old = qrot.dup
       end
-      break if (sum_qrot <= 0.001 and cnt > 1) or cnt >= maxSweeps
+      break if (sum_qrot <= 0.001 && cnt > 1) || cnt >= maxSweeps
     end # of do while true
     s = []
     qrot.row_size.times do |r|
-      s << Math.sqrt(qrot[r,r])
+      s << Math.sqrt(qrot[r, r])
     end
-    #puts "cnt = #{cnt}"
-    if self.row_size >= self.column_size
-      mu = self *  v * Matrix.diagonal(*s).inverse
+    # puts "cnt = #{cnt}"
+    if row_size >= column_size
+      mu = self * v * Matrix.diagonal(*s).inverse
       return [mu, v, s]
     else
       puts v.row_size
       puts v.column_size
-      puts self.row_size
-      puts self.column_size
+      puts row_size
+      puts column_size
       puts s.size
-      mu = (self.trans * v *  Matrix.diagonal(*s).inverse)
+      mu = (trans * v * Matrix.diagonal(*s).inverse)
       return [mu, v, s]
     end
   end
-  def []=(i,j,val)
+  def []=(i, j, val)
     @rows[i][j] = val
   end
 end

data/lib/classifier-reborn/extensions/vector_serialize.rb CHANGED

@@ -1,20 +1,18 @@
 module GSL
   class Vector
-    def _dump(v)
-      Marshal.dump( self.to_a )
+    def _dump(_v)
+      Marshal.dump(to_a)
     end
     def self._load(arr)
       arry = Marshal.load(arr)
-      return GSL::Vector.alloc(arry)
+      GSL::Vector.alloc(arry)
     end
   end
   class Matrix
-     class <<self
-        alias :diag :diagonal
-     end
+    class <<self
+       alias_method :diag, :diagonal
+    end
   end
 end

data/lib/classifier-reborn/lsi.rb CHANGED

@@ -3,13 +3,14 @@
 # License::   LGPL
 begin
-  raise LoadError if ENV['NATIVE_VECTOR'] == "true" # to test the native vector class, try `rake test NATIVE_VECTOR=true`
+  raise LoadError if ENV['NATIVE_VECTOR'] == 'true' # to test the native vector class, try `rake test NATIVE_VECTOR=true`
-  require 'gsl' # requires http://rb-gsl.rubyforge.org/
+  require 'gsl' # requires https://github.com/SciRuby/rb-gsl
   require_relative 'extensions/vector_serialize'
   $GSL = true
 rescue LoadError
+  $GSL = false
   require_relative 'extensions/vector'
 end
@@ -19,12 +20,10 @@ require_relative 'lsi/cached_content_node'
 require_relative 'lsi/summarizer'
 module ClassifierReborn
   # This class implements a Latent Semantic Indexer, which can search, classify and cluster
   # data based on underlying semantic relations. For more information on the algorithms used,
   # please consult Wikipedia[http://en.wikipedia.org/wiki/Latent_Semantic_Indexing].
   class LSI
     attr_reader :word_list, :cache_node_vectors
     attr_accessor :auto_rebuild
@@ -36,12 +35,12 @@ module ClassifierReborn
     #
     def initialize(options = {})
       @auto_rebuild = options[:auto_rebuild] != false
-      @word_list, @items = WordList.new, {}
-      @version, @built_at_version = 0, -1
+      @word_list = WordList.new
+      @items = {}
+      @version = 0
+      @built_at_version = -1
       @language = options[:language] || 'en'
-      if @cache_node_vectors = options[:cache_node_vectors]
-        extend CachedContentNode::InstanceMethods
-      end
+      extend CachedContentNode::InstanceMethods if @cache_node_vectors = options[:cache_node_vectors]
     end
     # Returns true if the index needs to be rebuilt.  The index needs
@@ -64,39 +63,44 @@ module ClassifierReborn
     #   ar = ActiveRecordObject.find( :all )
     #   lsi.add_item ar, *ar.categories { |x| ar.content }
     #
-    def add_item( item, *categories, &block )
+    def add_item(item, *categories, &block)
       clean_word_hash = Hasher.clean_word_hash((block ? block.call(item) : item.to_s), @language)
-      @items[item] = if @cache_node_vectors
-        CachedContentNode.new(clean_word_hash, *categories)
+      if clean_word_hash.empty?
+        puts "Input: '#{item}' is entirely stopwords or words with 2 or fewer characters. Classifier-Reborn cannot handle this document properly."
       else
-        ContentNode.new(clean_word_hash, *categories)
+        @items[item] = if @cache_node_vectors
+                         CachedContentNode.new(clean_word_hash, *categories)
+                       else
+                         ContentNode.new(clean_word_hash, *categories)
+                       end
+        @version += 1
+        build_index if @auto_rebuild
       end
-      @version += 1
-      build_index if @auto_rebuild
     end
     # A less flexible shorthand for add_item that assumes
     # you are passing in a string with no categorries. item
     # will be duck typed via to_s .
     #
-    def <<( item )
-      add_item item
+    def <<(item)
+      add_item(item)
     end
     # Returns the categories for a given indexed items. You are free to add and remove
     # items from this as you see fit. It does not invalide an index to change its categories.
     def categories_for(item)
       return [] unless @items[item]
-      return @items[item].categories
+      @items[item].categories
     end
     # Removes an item from the database, if it is indexed.
     #
-    def remove_item( item )
-      if @items.key? item
-        @items.delete item
-        @version += 1
-      end
+    def remove_item(item)
+      return unless @items.key? item
+      @items.delete item
+      @version += 1
     end
     # Returns an array of items that are indexed.
@@ -118,30 +122,30 @@ module ClassifierReborn
     # cutoff parameter tells the indexer how many of these values to keep.
     # A value of 1 for cutoff means that no semantic analysis will take place,
     # turning the LSI class into a simple vector search engine.
-    def build_index( cutoff=0.75 )
+    def build_index(cutoff = 0.75)
       return unless needs_rebuild?
       make_word_list
       doc_list = @items.values
-      tda = doc_list.collect { |node| node.raw_vector_with( @word_list ) }
+      tda = doc_list.collect { |node| node.raw_vector_with(@word_list) }
       if $GSL
-         tdm = GSL::Matrix.alloc(*tda).trans
-         ntdm = build_reduced_matrix(tdm, cutoff)
-         ntdm.size[1].times do |col|
-           vec = GSL::Vector.alloc( ntdm.column(col) ).row
-           doc_list[col].lsi_vector = vec
-           doc_list[col].lsi_norm = vec.normalize
-         end
+        tdm = GSL::Matrix.alloc(*tda).trans
+        ntdm = build_reduced_matrix(tdm, cutoff)
+        ntdm.size[1].times do |col|
+          vec = GSL::Vector.alloc(ntdm.column(col)).row
+          doc_list[col].lsi_vector = vec
+          doc_list[col].lsi_norm = vec.normalize
+        end
       else
-         tdm = Matrix.rows(tda).trans
-         ntdm = build_reduced_matrix(tdm, cutoff)
+        tdm = Matrix.rows(tda).trans
+        ntdm = build_reduced_matrix(tdm, cutoff)
-         ntdm.row_size.times do |col|
-           doc_list[col].lsi_vector = ntdm.column(col) if doc_list[col]
-           doc_list[col].lsi_norm = ntdm.column(col).normalize  if doc_list[col]
-         end
+        ntdm.row_size.times do |col|
+          doc_list[col].lsi_vector = ntdm.column(col) if doc_list[col]
+          doc_list[col].lsi_norm = ntdm.column(col).normalize if doc_list[col]
+        end
       end
       @built_at_version = @version
@@ -155,13 +159,13 @@ module ClassifierReborn
     # your dataset's general content. For example, if you were to use categorize on the
     # results of this data, you could gather information on what your dataset is generally
     # about.
-    def highest_relative_content( max_chunks=10 )
-       return [] if needs_rebuild?
+    def highest_relative_content(max_chunks = 10)
+      return [] if needs_rebuild?
-       avg_density = Hash.new
-       @items.each_key { |item| avg_density[item] = proximity_array_for_content(item).inject(0.0) { |x,y| x + y[1]} }
+      avg_density = {}
+      @items.each_key { |item| avg_density[item] = proximity_array_for_content(item).inject(0.0) { |x, y| x + y[1] } }
-       avg_density.keys.sort_by { |x| avg_density[x] }.reverse[0..max_chunks-1].map
+      avg_density.keys.sort_by { |x| avg_density[x] }.reverse[0..max_chunks - 1].map
     end
     # This function is the primitive that find_related and classify
@@ -176,10 +180,10 @@ module ClassifierReborn
     # The parameter doc is the content to compare. If that content is not
     # indexed, you can pass an optional block to define how to create the
     # text data. See add_item for examples of how this works.
-    def proximity_array_for_content( doc, &block )
+    def proximity_array_for_content(doc, &block)
       return [] if needs_rebuild?
-      content_node = node_for_content( doc, &block )
+      content_node = node_for_content(doc, &block)
       result =
         @items.keys.collect do |item|
           if $GSL
@@ -197,10 +201,18 @@ module ClassifierReborn
     # calculated vectors instead of their full versions. This is useful when
     # you're trying to perform operations on content that is much smaller than
     # the text you're working with. search uses this primitive.
-    def proximity_norms_for_content( doc, &block )
+    def proximity_norms_for_content(doc, &block)
       return [] if needs_rebuild?
-      content_node = node_for_content( doc, &block )
+      content_node = node_for_content(doc, &block)
+      if $GSL && content_node.raw_norm.isnan?.all?
+        puts "There are no documents that are similar to #{doc}"
+      else
+        content_node_norms(content_node)
+      end
+    end
+    def content_node_norms(content_node)
       result =
         @items.keys.collect do |item|
           if $GSL
@@ -220,11 +232,13 @@ module ClassifierReborn
     #
     # While this may seem backwards compared to the other functions that LSI supports,
     # it is actually the same algorithm, just applied on a smaller document.
-    def search( string, max_nearest=3 )
+    def search(string, max_nearest = 3)
       return [] if needs_rebuild?
-      carry = proximity_norms_for_content( string )
-      result = carry.collect { |x| x[0] }
-      return result[0..max_nearest-1]
+      carry = proximity_norms_for_content(string)
+      unless carry.nil?
+        result = carry.collect { |x| x[0] }
+        result[0..max_nearest - 1]
+      end
     end
     # This function takes content and finds other documents
@@ -236,21 +250,21 @@ module ClassifierReborn
     # This is particularly useful for identifing clusters in your document space.
     # For example you may want to identify several "What's Related" items for weblog
     # articles, or find paragraphs that relate to each other in an essay.
-    def find_related( doc, max_nearest=3, &block )
+    def find_related(doc, max_nearest = 3, &block)
       carry =
-        proximity_array_for_content( doc, &block ).reject { |pair| pair[0].eql? doc }
+        proximity_array_for_content(doc, &block).reject { |pair| pair[0].eql? doc }
       result = carry.collect { |x| x[0] }
-      return result[0..max_nearest-1]
+      result[0..max_nearest - 1]
     end
     # Return the most obvious category with the score
-    def classify_with_score( doc, cutoff=0.30, &block)
-      return scored_categories(doc, cutoff, &block).last
+    def classify_with_score(doc, cutoff = 0.30, &block)
+      scored_categories(doc, cutoff, &block).last
     end
     # Return the most obvious category without the score
-    def classify( doc, cutoff=0.30, &block )
-      return scored_categories(doc, cutoff, &block).last.first
+    def classify(doc, cutoff = 0.30, &block)
+      scored_categories(doc, cutoff, &block).last.first
     end
     # This function uses a voting system to categorize documents, based on
@@ -262,10 +276,10 @@ module ClassifierReborn
     # text. A cutoff of 1 means that every document in the index votes on
     # what category the document is in. This may not always make sense.
     #
-    def scored_categories( doc, cutoff=0.30, &block )
+    def scored_categories(doc, cutoff = 0.30, &block)
       icutoff = (@items.size * cutoff).round
-      carry = proximity_array_for_content( doc, &block )
-      carry = carry[0..icutoff-1]
+      carry = proximity_array_for_content(doc, &block)
+      carry = carry[0..icutoff - 1]
       votes = Hash.new(0.0)
       carry.each do |pair|
         @items[pair[0]].categories.each do |category|
@@ -273,31 +287,31 @@ module ClassifierReborn
         end
       end
-      return votes.sort_by { |_, score| score }
+      votes.sort_by { |_, score| score }
     end
     # Prototype, only works on indexed documents.
     # I have no clue if this is going to work, but in theory
     # it's supposed to.
-    def highest_ranked_stems( doc, count=3 )
-      raise "Requested stem ranking on non-indexed content!" unless @items[doc]
-      arr = node_for_content(doc).lsi_vector.to_a
-      top_n = arr.sort.reverse[0..count-1]
-      return top_n.collect { |x| @word_list.word_for_index(arr.index(x))}
+    def highest_ranked_stems(doc, count = 3)
+      raise 'Requested stem ranking on non-indexed content!' unless @items[doc]
+      content_vector_array = node_for_content(doc).lsi_vector.to_a
+      top_n = content_vector_array.sort.reverse[0..count - 1]
+      top_n.collect { |x| @word_list.word_for_index(content_vector_array.index(x)) }
     end
     private
-    def build_reduced_matrix( matrix, cutoff=0.75 )
+    def build_reduced_matrix(matrix, cutoff = 0.75)
       # TODO: Check that M>=N on these dimensions! Transpose helps assure this
       u, v, s = matrix.SV_decomp
       # TODO: Better than 75% term, please. :\
       s_cutoff = s.sort.reverse[(s.size * cutoff).round - 1]
       s.size.times do |ord|
         s[ord] = 0.0 if s[ord] < s_cutoff
       end
       # Reconstruct the term document matrix, only with reduced rank
-      u * ($GSL ? GSL::Matrix : ::Matrix).diag( s ) * v.trans
+      u * ($GSL ? GSL::Matrix : ::Matrix).diag(s) * v.trans
     end
     def node_for_content(item, &block)
@@ -306,23 +320,21 @@ module ClassifierReborn
       else
         clean_word_hash = Hasher.clean_word_hash((block ? block.call(item) : item.to_s), @language)
-        cn = ContentNode.new(clean_word_hash, &block) # make the node and extract the data
+        content_node = ContentNode.new(clean_word_hash, &block) # make the node and extract the data
         unless needs_rebuild?
-          cn.raw_vector_with( @word_list ) # make the lsi raw and norm vectors
+          content_node.raw_vector_with(@word_list) # make the lsi raw and norm vectors
         end
       end
-      return cn
+      content_node
     end
     def make_word_list
       @word_list = WordList.new
       @items.each_value do |node|
-        node.word_hash.each_key { |key| @word_list.add_word key }
+        node.word_hash.each_key { |key| @word_list.add_word(key) }
       end
     end
   end
 end

data/lib/classifier-reborn/lsi/cached_content_node.rb CHANGED

@@ -3,7 +3,6 @@
 # License::   LGPL
 module ClassifierReborn
   # Subclass of ContentNode which caches the search_vector transpositions.
   # Its great because its much faster for large indexes, but at the cost of more ram. Additionally,
   # if you Marshal your classifier and want to keep the size down, you'll need to manually
@@ -16,7 +15,7 @@ module ClassifierReborn
       end
     end
-    def initialize( word_hash, *categories )
+    def initialize(word_hash, *categories)
       clear_cache!
       super
     end
@@ -29,13 +28,13 @@ module ClassifierReborn
     def transposed_search_vector
       @transposed_search_vector ||= super
     end
     # Clear the cache before we continue on
-    def raw_vector_with( word_list )
+    def raw_vector_with(word_list)
       clear_cache!
       super
     end
     # We don't want the cached_data here
     def marshal_dump
       [@lsi_vector, @lsi_norm, @raw_vector, @raw_norm, @categories, @word_hash]

data/lib/classifier-reborn/lsi/content_node.rb CHANGED

@@ -3,10 +3,9 @@
 # License::   LGPL
 module ClassifierReborn
-# This is an internal data structure class for the LSI node. Save for
-# raw_vector_with, it should be fairly straightforward to understand.
-# You should never have to use it directly.
+  # This is an internal data structure class for the LSI node. Save for
+  # raw_vector_with, it should be fairly straightforward to understand.
+  # You should never have to use it directly.
   class ContentNode
     attr_accessor :raw_vector, :raw_norm,
                   :lsi_vector, :lsi_norm,
@@ -15,7 +14,7 @@ module ClassifierReborn
     attr_reader :word_hash
     # If text_proc is not specified, the source will be duck-typed
     # via source.to_s
-    def initialize( word_hash, *categories )
+    def initialize(word_hash, *categories)
       @categories = categories || []
       @word_hash = word_hash
       @lsi_norm, @lsi_vector = nil
@@ -38,11 +37,11 @@ module ClassifierReborn
     # Creates the raw vector out of word_hash using word_list as the
     # key for mapping the vector space.
-    def raw_vector_with( word_list )
+    def raw_vector_with(word_list)
       if $GSL
-         vec = GSL::Vector.alloc(word_list.size)
+        vec = GSL::Vector.alloc(word_list.size)
       else
-         vec = Array.new(word_list.size, 0)
+        vec = Array.new(word_list.size, 0)
       end
       @word_hash.each_key do |word|
@@ -52,7 +51,7 @@ module ClassifierReborn
       # Perform the scaling transform and force floating point arithmetic
       if $GSL
         sum = 0.0
-        vec.each {|v| sum += v }
+        vec.each { |v| sum += v }
         total_words = sum
       else
         total_words = vec.reduce(0, :+).to_f
@@ -63,7 +62,7 @@ module ClassifierReborn
       if $GSL
         vec.each { |word| total_unique_words += 1 if word != 0.0 }
       else
-        total_unique_words = vec.count{ |word| word != 0 }
+        total_unique_words = vec.count { |word| word != 0 }
       end
       # Perform first-order association transform if this vector has more
@@ -71,9 +70,9 @@ module ClassifierReborn
       if total_words > 1.0 && total_unique_words > 1
         weighted_total = 0.0
         # Cache calculations, this takes too long on large indexes
-        cached_calcs = Hash.new { |hash, term|
-          hash[term] = (( term / total_words ) * Math.log( term / total_words ))
-        }
+        cached_calcs = Hash.new do |hash, term|
+          hash[term] = ((term / total_words) * Math.log(term / total_words))
+        end
         vec.each do |term|
           weighted_total += cached_calcs[term] if term > 0.0
@@ -81,12 +80,12 @@ module ClassifierReborn
         # Cache calculations, this takes too long on large indexes
         cached_calcs = Hash.new do |hash, val|
-          hash[val] = Math.log( val + 1 ) / -weighted_total
+          hash[val] = Math.log(val + 1) / -weighted_total
         end
-        vec.collect! { |val|
+        vec.collect! do |val|
           cached_calcs[val]
-        }
+        end
       end
       if $GSL
@@ -97,7 +96,5 @@ module ClassifierReborn
         @raw_vector = Vector[*vec]
       end
     end
   end
 end

data/lib/classifier-reborn/lsi/summarizer.rb CHANGED

@@ -4,13 +4,13 @@
 module ClassifierReborn
   module Summarizer
-    extend self
+    module_function
-    def summary( str, count=10, separator=" [...] " )
+    def summary(str, count = 10, separator = ' [...] ')
       perform_lsi split_sentences(str), count, separator
     end
-    def paragraph_summary( str, count=1, separator=" [...] " )
+    def paragraph_summary(str, count = 1, separator = ' [...] ')
       perform_lsi split_paragraphs(str), count, separator
     end
@@ -23,11 +23,11 @@ module ClassifierReborn
     end
     def perform_lsi(chunks, count, separator)
-      lsi = ClassifierReborn::LSI.new :auto_rebuild => false
+      lsi = ClassifierReborn::LSI.new auto_rebuild: false
       chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }
       lsi.build_index
       summaries = lsi.highest_relative_content count
-      return summaries.reject { |chunk| !summaries.include? chunk }.map { |x| x.strip }.join(separator)
+      summaries.reject { |chunk| !summaries.include? chunk }.map(&:strip).join(separator)
     end
   end
 end

data/lib/classifier-reborn/lsi/word_list.rb CHANGED

@@ -8,7 +8,7 @@ module ClassifierReborn
   class WordList
     def initialize
-      @location_table = Hash.new
+      @location_table = {}
     end
     # Adds a word (if it is new) and assigns it a unique dimension.
@@ -31,6 +31,5 @@ module ClassifierReborn
     def size
       @location_table.size
     end
   end
 end

data/lib/classifier-reborn/version.rb CHANGED

@@ -1,3 +1,3 @@
 module ClassifierReborn
-  VERSION = '2.0.4'
+  VERSION = '2.1.0'
 end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: classifier-reborn
 version: !ruby/object:Gem::Version
-  version: 2.0.4
+  version: 2.1.0
 platform: ruby
 authors:
 - Lucas Carlson
@@ -10,7 +10,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-10-31 00:00:00.000000000 Z
+date: 2017-01-01 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: fast-stemmer
@@ -55,7 +55,49 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
-  name: test-unit
+  name: minitest
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: minitest-reporters
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rubocop
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: pry
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
@@ -134,9 +176,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.4.8
+rubygems_version: 2.5.2
 signing_key:
 specification_version: 2
 summary: A general classifier module to allow Bayesian and other types of classifications.
 test_files: []
-has_rdoc: true