RubyGems - basset - Versions diffs - 1.0.0 - Mend

basset 1.0.0

Files changed (22) hide show

data/History.txt +3 -0
data/License.txt +20 -0
data/Manifest.txt +21 -0
data/README.txt +31 -0
data/Rakefile +30 -0
data/lib/basset.rb +7 -0
data/lib/basset/classification_evaluator.rb +175 -0
data/lib/basset/core_extensions.rb +56 -0
data/lib/basset/document.rb +51 -0
data/lib/basset/document_override_example.rb +11 -0
data/lib/basset/feature.rb +26 -0
data/lib/basset/feature_extractor.rb +52 -0
data/lib/basset/feature_selector.rb +126 -0
data/lib/basset/naive_bayes.rb +109 -0
data/lib/basset/yaml_serialization.rb +21 -0
data/spec/spec.opts +0 -0
data/spec/spec_helper.rb +4 -0
data/spec/unit/document_spec.rb +28 -0
data/spec/unit/feature_extractor_spec.rb +32 -0
data/spec/unit/feature_selector_spec.rb +108 -0
data/spec/unit/feature_spec.rb +40 -0
metadata +88 -0

data/History.txt ADDED

@@ -0,0 +1,3 @@
+== 1.0.0 / 2008-01-08
+* Initial release

data/License.txt ADDED

@@ -0,0 +1,20 @@
+Copyright (c) 2007 Paul Dix
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/Manifest.txt ADDED

@@ -0,0 +1,21 @@
+History.txt
+License.txt
+Manifest.txt
+README.txt
+Rakefile
+lib/basset.rb
+lib/basset/classification_evaluator.rb
+lib/basset/core_extensions.rb
+lib/basset/document.rb
+lib/basset/document_override_example.rb
+lib/basset/feature.rb
+lib/basset/feature_extractor.rb
+lib/basset/feature_selector.rb
+lib/basset/naive_bayes.rb
+lib/basset/yaml_serialization.rb
+spec/spec.opts
+spec/spec_helper.rb
+spec/unit/document_spec.rb
+spec/unit/feature_extractor_spec.rb
+spec/unit/feature_selector_spec.rb
+spec/unit/feature_spec.rb

data/README.txt ADDED

@@ -0,0 +1,31 @@
+Author::    Paul Dix  (mailto:paul@pauldix.net)
+=Summary
+This is a library for running machine learning tasks.
+These include a generic document representation class, a feature selector, a feature extractor, a naive bayes classifier, and a classification evaluator for running tests. The goal was to create a general framework that would be easy to modify for specific problems. I also tried to design the system to be extensible so I could add more classification and clustering algorithms as I get deeper into my studies on machine learning.
+=What You Could Use This For
+Just in case you don't have a clue what machine learning or classification is, here's a quick example scenario and an explanation of the process. The most popular task is spam identification. To do this you'll first need a set of training documents. This would consist of a number of documents which you have labeled as either spam or not. With training sets, bigger is better. You should probably have at least 100 of each type (spam and not spam). Really 1,000 of each type would be better and 10,000 of each would be super sweet. Once you have the training set the process with this library flows like this:
+* Create each as a Document (a class in this library)
+* Pass those documents into the FeatureSelector
+* Get the best features and pass those into the FeatureExtractor
+* Now extract features from each document using the extractor and
+* Pass those extracted features to NaiveBayes as part of the training set
+* Now you can save the FeatureExtractor and NaiveBayes to a file
+That represents the process of selecting features and training the classifier. Once you've done that you can predict if a new previously unseen document is spam or not by just doing the following:
+* Load the feature extractor and naive bayes from their files
+* Create a new document object from your new unseen document
+* Extract the features from that document using the feature extractor and
+* Pass those to the classify method of the naive bayes classifier
+Something that you'll probably want to do before doing real classification is to test things. Use the ClassificationEvaluator for this. Using the evaluator you can pass your training documents in and have it run through a series of tests to give you an estimate of how successful the classifier will be at predicting unseen documents. Easy classification tasks will generally be > 90% accurate while others can be much harder. Each classification task is different and most of the time you won't know until you actually test it out.
+=Contact
+I love machine learning and classification so if you have a problem that is giving you trouble don't hesitate to get a hold of me. The same applies for anyone who wants to write additional classifiers, better document representations, or just to tell my my code is amateur.
+Author::    Paul Dix  (mailto:paul@pauldix.net)
+Site::      http://www.pauldix.net
+Freenode::  pauldix in #nyc.rb

data/Rakefile ADDED

@@ -0,0 +1,30 @@
+# -*- ruby -*-
+require 'rubygems'
+require 'hoe'
+require 'spec/rake/spectask'
+require './lib/basset.rb'
+Hoe.new('basset', Basset::VERSION) do |p|
+  p.summary = 'A library for running machine learning algorithms for classification, feature selection and evaluation'
+  p.url = 'http://basset.rubyforge.org/'
+  p.author = ['Paul Dix', 'Bryan Helmkamp']
+  p.email = 'paul@pauldix.net'
+  p.description = p.paragraphs_of('README.txt', 2..5).join("\n\n")
+  p.changes = p.paragraphs_of('History.txt', 0..1).join("\n\n")
+  p.remote_rdoc_dir = '' # Release to root
+  p.extra_deps << ['stemmer', '>= 1.0.1']
+end
+desc "Run all of the specs"
+Spec::Rake::SpecTask.new do |t|
+  t.spec_opts = ['--options', "\"spec/spec.opts\""]
+end
+desc "Run all spec with RCov"
+Spec::Rake::SpecTask.new(:coverage) do |t|
+  t.rcov = true
+  t.rcov_opts = ['--exclude', 'spec']
+end

data/lib/basset.rb ADDED

@@ -0,0 +1,7 @@
+Dir[File.join(File.dirname(__FILE__), "basset", "*.rb")].each do |file|
+  require file
+end
+module Basset
+  VERSION = "1.0.0"
+end

data/lib/basset/classification_evaluator.rb ADDED

@@ -0,0 +1,175 @@
+module Basset
+  # Class for running evaluation tests on a classifier, and document
+  # representation.
+  # There's probably a much better way to write this, but for now here's the gist:
+  # When you initialize pass in an args hash and a block. For now the args hash should just be:
+  # {:output => true} or false which will tell it to output results to the console.
+  # The block will be a chunk of code that will later be called to read each document to be tested. The block will be
+  # passed a document name. The reason for doing this instead of just taking all the documents is so the evaluator
+  # doesn't need to keep all the documents in memory at one time. Here's an example if you had a set of files to read from:
+  # evaluator = ClassificationEvaluator.new()
+  class ClassificationEvaluator
+    def initialize(training_documents, args = {})
+      args[:output] = true unless args.has_key?(:output)
+      @output_to_console = args[:output]
+      @folding_amount = (args[:folding_amount] or 10)
+      @total_documents_trained = 0
+      @document_sets = split_documents_into_cross_validation_sets(training_documents, @folding_amount)
+    end
+    # Classifiers should be an array of basset classifier objects to run cross validation tests on
+    def test_with_basset_classifiers(classifiers)
+    end
+    # Classifiers should be an array of basset classifier objects to run cross validation tests on.
+    # chi_value will be passed on to the feature_selector. The default value of 0 will select all features.
+    # The block will get called and passed in each training_set and test_set from the document_sets. It should
+    # run some external classifier and return the number of documents from the test_set that were correctly classified.
+    def compare_against_basset_classifiers(classifiers, chi_value = 0, &block)
+      # initialize the results hash
+      results = {"External" => {:correct => 0, :total => 0, :time => 0.0}}
+      classifiers.each {|classifier| results[classifier.class] = {:correct => 0, :total => 0, :time => 0.0}}
+      # run on each doc set
+      @document_sets.each_with_index do |document_set, iteration|
+        puts "iteration #{iteration + 1} of #{@document_sets.size}" if @output_to_console
+        feature_extractor = nil
+        feature_extractor_time = time_execution { feature_extractor = create_feature_extractor(document_set[:training_set], chi_value) }
+        number_of_test_documents = document_set[:test_set].size
+        # do a test run on each classifier
+        classifiers.each do |classifier|
+          correct = 0
+          time = time_execution { correct = test_run(document_set[:training_set], document_set[:test_set], feature_extractor, classifier) } + feature_extractor_time
+          results[classifier.class][:time] += time
+          results[classifier.class][:correct] += correct
+          results[classifier.class][:total] += number_of_test_documents
+          output_results(correct, number_of_test_documents, time, classifier.class) if @output_to_console
+        end
+        # now run the external and gather results
+        correct = 0
+        time = time_execution { correct = block.call(document_set[:training_set], document_set[:test_set]) }
+        results["External"][:time]    += time
+        results["External"][:correct] += correct
+        results["External"][:total]   += number_of_test_documents
+        output_results(correct, number_of_test_documents, time, "External") if @output_to_console
+      end
+      puts "\nFinal Results\n---------------------------------------------------------------------------------------" if @output_to_console
+      puts "Trained on #{@total_documents_trained} documents on #{@folding_amount} cross validation runs." if @output_to_console
+      if @output_to_console
+        results.each_pair {|classifier, results_numbers| output_results(results_numbers[:correct], results_numbers[:total], results_numbers[:time], classifier)}
+      end
+      return results
+    end
+    # It will then feature select and train on 9 and test on
+    # the other. Iterate 10 times using each block as the test set and the others as the
+    # training and combine the results.
+    def test_with_cross_validation(training_document_names, folding_amount = 10)
+      # make sure it's not in some order
+      training_document_names.each {|class_documents| class_documents.randomize!}
+      # the folding amount determines how big the test set size is. for 10 fold it's 10% and we run 10 times
+      total_correct, total_documents = 0, 0
+      # there's some tricky code here to make sure that the training and test sets have an equal percentage
+      # of docs from each class for each iteration.
+      folding_amount.times do |iteration|
+        puts "iteration #{iteration + 1} of #{folding_amount}" if @output_to_console
+        test_set = []
+        training_document_names.each do |class_document_names|
+          test_set_size = (class_document_names.size / folding_amount).to_i
+          test_set << class_document_names.slice(iteration * test_set_size, test_set_size)
+        end
+        training_set = []
+        training_document_names.each_with_index {|class_document_names, i| training_set += (class_document_names - test_set[i])}
+        test_set = test_set.flatten
+        correct, total = test_run(training_set, test_set)
+        total_correct += correct
+        total_documents += total
+      end
+      output_results(total_correct, total_documents) if @output_to_console
+      return [total_correct, total_documents]
+    end
+  private
+    # Splits entire set. The goal here is to test classification accuracy
+    # using cross validation. 10 fold is the default. So it will split the training set
+    # into 10 equal size chunks.
+    # training_documents is actually an array of arrays. each class to be considered
+    # has an array of documents.
+    def split_documents_into_cross_validation_sets(training_documents, folding_amount = 10)
+      document_sets = []
+      # make sure it's not in some order
+      training_documents.each {|class_documents| class_documents.randomize!}
+      # the folding amount determines how big the test set size is. for 10 fold it's 10% and we run 10 times
+      # there's some tricky code here to make sure that the training and test sets have an equal percentage
+      # of docs from each class for each iteration.
+      folding_amount.times do |iteration|
+        test_set = []
+        training_documents.each do |class_documents|
+          test_set_size = (class_documents.size / folding_amount).to_i
+          test_set << class_documents.slice(iteration * test_set_size, test_set_size)
+        end
+        training_set = []
+        training_documents.each_with_index {|class_documents, i| training_set += (class_documents - test_set[i])}
+        test_set = test_set.flatten
+        @total_documents_trained += training_set.size
+        document_sets << {:training_set => training_set, :test_set => test_set}
+      end
+      return document_sets
+    end
+    # this method returns a feature extractor for the passed in training_set using the chi_value
+    def create_feature_extractor(training_set, chi_value)
+      feature_selector = FeatureSelector.new
+      # select features based on training set
+      training_set.each do |document|
+        feature_selector.add_document(document)
+      end
+      if chi_value == 0
+        selected_features = feature_selector.all_feature_names
+      else
+        selected_features = feature_selector.select_features(chi_value)
+      end
+      puts "selected #{selected_features.size} of #{feature_selector.number_of_features} features for this iteration" if @output_to_console
+      return FeatureExtractor.new(selected_features)
+    end
+    # this is a single run on a training and test set. It will run feature_selection, the feature_extraction, then training, then testing
+    def test_run(training_set, testing_set, feature_extractor, classifier)
+      puts "training #{classifier.class} on #{training_set.size} documents..." if @output_to_console
+      # now train the classifier
+      training_set.each do |document|
+        classifier.add_document(document.classification, feature_extractor.extract_numbered(document) )
+      end
+      puts "running #{classifier.class} on #{testing_set.size} documents..." if @output_to_console
+      # now classify test set
+      number_correctly_classified = 0
+      testing_set.each do |document|
+        score, predicted_classification = classifier.classify(feature_extractor.extract_numbered(document))
+        number_correctly_classified += 1 if document.classification == predicted_classification
+      end
+      return number_correctly_classified
+    end
+    def output_results(correct, total, time, classifier_name)
+      puts "#{classifier_name} classified #{correct} of #{total} correctly for #{(correct/total.to_f * 100).to_s_decimal_places(2)}% accurcy. Executed run in #{time.to_s_decimal_places(1)} seconds."
+    end
+    def time_execution(&block)
+      start_time = Time.now
+      yield
+      end_time = Time.now
+      return end_time - start_time
+    end
+  end
+end

data/lib/basset/core_extensions.rb ADDED

@@ -0,0 +1,56 @@
+# This file contains extensions to built in Ruby classes.
+require 'rubygems'
+require 'stemmer'
+# Extensions to the array class.
+class Array
+  # Returns a new array that contains everything except the first element of this one. (just like in lisp)
+  def rest
+    self.slice(1, size)
+  end
+  # Returns the second item in the array
+  def second
+    self[1]
+  end
+  # Returns a random item from the array
+  def pick_random
+    self[rand(self.size)]
+  end
+  # Returns a randomized array
+  def randomize
+    self.sort_by { rand }
+  end
+  def sum
+    inject(0) { |sum, val| sum + val }
+  end
+  # Randomizes array in place
+  def randomize!
+    self.replace(self.randomize)
+  end
+end
+class Float
+  def to_s_decimal_places(decimal_places)
+    pattern = "[0-9]*\."
+    decimal_places.times { pattern << "[0-9]"}
+    return self.to_s.match(pattern)[0]
+  end
+end
+class Symbol
+  def to_proc
+    proc { |obj, *args| obj.send(self, *args) }
+  end
+end
+# Extensions to the string class.
+# We're just including the stemmable module into string. This adds the .stem method.
+class String
+  include Stemmable
+end

data/lib/basset/document.rb ADDED

@@ -0,0 +1,51 @@
+module Basset
+  # A class for representing a document as a vector of features. It takes the text
+  # of the document and the classification. The vector of features representation is
+  # just a basic bag of words approach.
+  class Document
+    attr_reader :text, :classification
+    def initialize(text, classification = nil)
+      @text           = text
+      @classification = classification
+    end
+    def vector_of_features
+      @feature_vector ||= vector_of_features_from_terms_hash( terms_hash_from_words_array( stemmed_words ) )
+    end
+  private
+    # returns a hash with each word as a key and the value is the number of times
+    # the word appears in the passed in words array
+    def terms_hash_from_words_array(words)
+      terms = Hash.new(0)
+      stemmed_words.each do |term|
+        terms[term] += 1
+      end
+      return terms
+    end
+    def vector_of_features_from_terms_hash(terms)
+      terms.collect do |term, frequency|
+        Feature.new(term, frequency)
+      end
+    end
+    def stemmed_words
+      words.collect(&:stem)
+    end
+    def words
+      clean_text.split(" ")
+    end
+    # Remove punctuation, numbers and symbols
+    def clean_text
+      text.tr("'@_", '').gsub(/\W/, ' ').gsub(/[0-9]/, '')
+#      text.tr( ',?.!;:"#$%^&*()_=+[]{}\|<>/`~', " " ) .tr( "@'\-", "")
+    end
+  end
+end

data/lib/basset/document_override_example.rb ADDED

@@ -0,0 +1,11 @@
+module Basset
+  # This class is an example for how to do custom document representations. In this
+  # example, I change the way text is cleaned and don't stem the words. It would also
+  # be easy to put in additional hard coded features.
+  # The important thing to note is that the new document class only needs one function: vector_of_features
+  class DocumentOverrideExample < Document
+    def vector_of_features
+      @vector_of_features ||= vector_of_features_from_terms_hash( terms_hash_from_words_array( text.gsub(/\W/, ' ').split(' ') ) )
+    end
+  end
+end

data/lib/basset/feature.rb ADDED

@@ -0,0 +1,26 @@
+module Basset
+  # A class to hold a feature which consists of a name and a value. In the basic sense
+  # of document classification the name would be the word and the value would be the
+  # number of times that word appeared in the document.
+  class Feature
+    attr_accessor :name, :value
+    def initialize(name, value = 0)
+      @name   = name
+      @value  = value
+    end
+    def <=>(other)
+      ret = self.name <=> other.name
+      ret = self.value <=> other.value if ret.zero?
+      ret
+    end
+    def ==(other)
+      ret = self.name == other.name
+      ret = self.value == other.value if ret
+      ret
+    end
+  end
+end

data/lib/basset/feature_extractor.rb ADDED

@@ -0,0 +1,52 @@
+require File.join(File.dirname(__FILE__), "yaml_serialization")
+module Basset
+  # Extracts features from a document. On initialization it expects the set of features that
+  # are to be extracted from documents. The extracted features will just be numbered in
+  # ascending order. This makes it easy to output feature sets for libraries like svmlight.
+  class FeatureExtractor
+    include YamlSerialization
+    # the constructor takes an array of feature names. These are the features that will be
+    # extracted from documents. All others will be ignored.
+    def initialize(feature_names)
+      @feature_names = {}
+      feature_names.each_with_index {|feature_name, index| @feature_names[feature_name] = index + 1}
+    end
+    def number_of_features
+      @feature_names.size
+    end
+    # returns an array of features, but with their names replaced with an integer identifier.
+    # They should be sorted in ascending identifier order. This is a generic representation that works
+    # well with other machine learning packages like svm_light.
+    def extract_numbered(document)
+      numbered_features = extract(document).collect do |feature|
+        Feature.new(@feature_names[feature.name], feature.value)
+      end
+      numbered_features.sort
+    end
+    # just returns the features from the document that the extractor is interested in
+    def extract(document)
+      document.vector_of_features.find_all do |feature|
+        @feature_names[feature.name]
+      end
+    end
+    # def extract_with_duplicate_removal(document)
+    #   features = extract(document)
+    #   # # now remove the unigrams that dupe bigram features
+    #   # # first grab an array of the bigram ones
+    #   # bigram_features = []
+    #   # sorted_features.each {|feature| bigram_features << feature if feature.name =~ /.*_AND_.*/}
+    #   # # now remove all the ones that have a match in the bigram features
+    #   # sorted_features.each_with_index do |feature, index|
+    #   #   sorted_features.delete_at(index) if (feature.name !~ /_AND_/ and bigram_features.detect {|bf| bf.name =~ /^#{feature.name}_|_#{feature.name}$/})
+    #   # end
+    # end
+  end
+end

data/lib/basset/feature_selector.rb ADDED

@@ -0,0 +1,126 @@
+module Basset
+  # This class is the feature selector. All documents in the training set should be added
+  # to the selector. Once they are in, a number of features may be selected based on the
+  # chi square value. When in doubt just call feature_with_chi_value_greater_than with an
+  # empty hash. It will return all features that have at least some statistical significance
+  # and occur in more than one document.
+  class FeatureSelector
+    attr_reader :docs
+    def initialize
+      @docs           = 0
+      @docs_in_class  = Hash.new(0)
+      @features       = Hash.new { |h, k| h[k] = FeatureValues.new }
+    end
+    # Adds a document to the feature selector. The document should respond_to a
+    # method vector_of_features which returns a vector of unique features.
+    def add_document(document)
+      @docs += 1
+      @docs_in_class[document.classification] += 1
+      document.vector_of_features.each do |feature|
+        @features[feature.name].add_document_with_class(document.classification)
+      end
+    end
+    # returns all features, regardless of chi_square or frequency
+    def all_feature_names
+      @features.keys
+    end
+    def number_of_features
+      @features.size
+    end
+    # returns an array of the best features for a given classification
+    def best_features(count = 10, classification = nil)
+      select_features(1.0, classification).first(count)
+    end
+    def features_with_chi(classification)
+      @features.keys.map do |feature_name|
+        Feature.new(feature_name, chi_squared(feature_name, classification))
+      end
+    end
+    # returns an array of features that have a minimum or better chi_square value.
+    def select_features(chi_value = 1.0, classification = nil)
+      classification ||= @docs_in_class.keys.first
+      selected_features = features_with_chi(classification).select do |feature|
+        (docs_with_feature(feature.name) > 1) && (feature.value >= chi_value)
+      end
+      selected_features.sort_by(&:value).reverse.collect(&:name)
+    end
+  private
+    def docs_with_feature_and_class(feature_name, classification)
+      @features[feature_name].docs_with_class(classification)
+    end
+    def docs_with_feature_and_not_class(feature_name, classification)
+      @features[feature_name].docs_with_feature - @features[feature_name].docs_with_class(classification)
+    end
+    def docs_with_class_and_not_feature(classification, feature_name)
+      @docs_in_class[classification] - @features[feature_name].docs_with_class(classification)
+    end
+    def docs_without_feature_or_class(feature_name, classification)
+      @docs - @docs_in_class[classification] - docs_with_feature_and_not_class(feature_name, classification)
+    end
+    def docs_with_feature(feature_name)
+      @features[feature_name].docs_with_feature
+    end
+    def docs_with_class(classification)
+      @docs_in_class[classification]
+    end
+    # Returns the chi_squared value for this feature with the passed classification
+    # This is formula 13.14 on page 215 of An Introduction to Information Retrieval by
+    # Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.
+    def chi_squared(feature_name, classification)
+      chi_squared_algo(
+        docs_with_feature_and_class(feature_name, classification),
+        docs_with_class_and_not_feature(classification, feature_name),
+        docs_with_feature_and_not_class(feature_name, classification),
+        docs_without_feature_or_class(feature_name, classification)
+      )
+    end
+    def chi_squared_algo(o11, o10, o01, o00)
+      denominator = ((o11 + o01) * (o11 + o10) * (o10 + o00) * (o01 + o00))
+      numerator   = ((o11 + o10 + o01 + o00) * ((o11 * o00 - o10 * o01)**2))
+      # Checking zero to avoid producing Infinity
+      denominator.zero? ? 0.0 : numerator.to_f / denominator.to_f
+    end
+    # A class to hold the values associated with a feature. These values are
+    # important for feature selection.
+    class FeatureValues
+      attr_accessor :docs_with_feature
+      def initialize()
+        @classes = Hash.new(0)
+        @docs_with_feature = 0
+      end
+      def add_document_with_class(classification)
+        @classes[classification] += 1
+        @docs_with_feature += 1
+      end
+      def docs_with_class(classification)
+        @classes[classification]
+      end
+    end
+  end
+end

data/lib/basset/naive_bayes.rb ADDED

@@ -0,0 +1,109 @@
+require File.join(File.dirname(__FILE__), "yaml_serialization")
+module Basset
+  # A class for running Naive Bayes classification.
+  # Documents are added to the classifier. Once they are added
+  # it can be used to classify new documents.
+  class NaiveBayes
+    include YamlSerialization
+    def initialize
+      @number_of_documents = 0
+      @number_of_documents_in_class = Hash.new(0)
+      @features = []
+      reset_cached_probabilities
+    end
+    # takes a classification which can be a string and
+    # a vector of numbered features.
+    def add_document(classification, feature_vector)
+      reset_cached_probabilities
+      @number_of_documents_in_class[classification] += 1
+      @number_of_documents += 1
+      feature_vector.each do |feature|
+        @features[feature.name] ||= FeatureCount.new
+        @features[feature.name].add_count_for_class(feature.value, classification)
+      end
+    end
+    # returns the most likely class given a vector of features
+    def classify(feature_vector)
+      class_probabilities = []
+      @number_of_documents_in_class.keys.each do |classification|
+        class_probability = Math.log10(probability_of_class(classification))
+        feature_vector.each do |feature|
+          class_probability += Math.log10(probability_of_feature_given_class(feature.name, classification)) * feature.value
+        end
+        class_probabilities << [class_probability, classification]
+      end
+      # this next bit picks a random item first
+      # this covers the case that all the class probabilities are equal and we need to randomly select a class
+      max = class_probabilities.pick_random
+      class_probabilities.each do |cp|
+        max = cp if cp.first > max.first
+      end
+      max
+    end
+  private
+    # probabilities are cached when the classification is run. This method resets
+    # the cached probabities.
+    def reset_cached_probabilities
+      @occurences_of_every_feature_in_class = Hash.new
+    end
+    # The number of times every feature occurs for a given class.
+    def number_of_occurences_of_every_feature_in_class(classification)
+      # return the cached value, if there is one
+      return @occurences_of_every_feature_in_class[classification] if @occurences_of_every_feature_in_class[classification]
+      # we drop the first (since there is no 0 feature) and sum on the rest
+      # the reason the rescue 0 is in there is tricky
+      # because of the removal of redundant unigrams, it's possible that one of the features is never used/initialized
+      @occurences_of_every_feature_in_class[classification] = @features.rest.compact.inject(0) do |sum, feature_count|
+        sum + feature_count.count_for_class(classification)
+      end
+    end
+    # returns the probability of a given class
+    def probability_of_class(classification)
+      @number_of_documents_in_class[classification] / @number_of_documents.to_f
+    end
+    # returns the probability of a feature given the class
+    def probability_of_feature_given_class(feature, classification)
+      # the reason the rescue 0 is in there is tricky
+      # because of the removal of redundant unigrams, it's possible that one of the features is never used/initialized
+      ((@features[feature].count_for_class(classification) rescue 0) + 1)/ number_of_occurences_of_every_feature_in_class(classification).to_f
+    end
+    # A class to store feature counts
+    class FeatureCount
+      def initialize
+        @classes = {}
+      end
+      def add_count_for_class(count, classification)
+        @classes[classification] ||= 0
+        @classes[classification] += count
+      end
+      def count_for_class(classification)
+        @classes[classification] || 1
+      end
+      def count
+        @classes.values.sum
+      end
+    end
+  end
+end

data/lib/basset/yaml_serialization.rb ADDED

@@ -0,0 +1,21 @@
+require "yaml"
+module YamlSerialization
+  def self.included(base)
+    base.extend ClassMethods
+  end
+  module ClassMethods
+    def load_from_file(file_name)
+      YAML.load_file(file_name)
+    end
+  end
+  def save_to_file(file_name)
+    File.open(file_name, 'w') do |file|
+      YAML.dump(self, file)
+    end
+  end
+end

data/spec/spec.opts ADDED

File without changes

data/spec/spec_helper.rb ADDED

@@ -0,0 +1,4 @@
+require 'spec'
+require File.join(File.dirname(__FILE__), "..", "lib", "basset")
+include Basset

data/spec/unit/document_spec.rb ADDED

@@ -0,0 +1,28 @@
+require File.join(File.dirname(__FILE__), "..", "spec_helper")
+describe Document do
+  it "should remove punctuation from words" do
+    Document.new("abc.").vector_of_features.should == [Feature.new("abc", 1)]
+  end
+  it "should remove numbers from words" do
+    Document.new("abc1").vector_of_features.should == [Feature.new("abc", 1)]
+  end
+  it "should remove symbols from words" do
+    Document.new("abc%").vector_of_features.should == [Feature.new("abc", 1)]
+  end
+  it "should lowercase text" do
+    Document.new("ABC").vector_of_features.should == [Feature.new("abc", 1)]
+  end
+  it "should stem words" do
+    Document.new("testing").vector_of_features.should == [Feature.new("test", 1)]
+  end
+  it "should count feature occurances" do
+    Document.new("test doc test", :test).vector_of_features.should ==
+      [Feature.new("doc", 1), Feature.new("test", 2)]
+  end
+end

data/spec/unit/feature_extractor_spec.rb ADDED

@@ -0,0 +1,32 @@
+require File.join(File.dirname(__FILE__), "..", "spec_helper")
+describe FeatureExtractor do
+  DocumentMock = Struct.new(:vector_of_features)
+  it "should save to file"
+  it "should be loadable from file"
+  it "should return number of features" do
+    FeatureExtractor.new(%w[one two]).number_of_features.should == 2
+  end
+  it "should throw away extra features" do
+    doc = DocumentMock.new([Feature.new("keep"), Feature.new("throwaway")])
+    FeatureExtractor.new(%w[keep]).extract(doc).should == [Feature.new("keep")]
+  end
+  it "should extract no features from a doc with no features" do
+    FeatureExtractor.new(%w[keep]).extract(DocumentMock.new([])).should == []
+  end
+  it "should extract numbered features" do
+    doc = DocumentMock.new([Feature.new("keep", 0)])
+    FeatureExtractor.new(%w[keep]).extract_numbered(doc).should == [Feature.new(1, 0)]
+  end
+  it "should sort extracted numbered features" do
+    feature_extractor = FeatureExtractor.new(%w[keep1 keep2])
+    doc = DocumentMock.new([Feature.new("keep2", 10), Feature.new("keep1", 20)])
+    feature_extractor.extract_numbered(doc).should == [Feature.new(1, 20), Feature.new(2, 10)]
+  end
+end

data/spec/unit/feature_selector_spec.rb ADDED

@@ -0,0 +1,108 @@
+require File.join(File.dirname(__FILE__), "..", "spec_helper")
+describe FeatureSelector do
+  DocumentMock = Struct.new(:vector_of_features, :classification)
+  it "should count documents" do
+    feature_selector = FeatureSelector.new
+    feature_selector.docs.should == 0
+    feature_selector.add_document(DocumentMock.new([]))
+    feature_selector.docs.should == 1
+  end
+  it "should return all feature names" do
+    feature_selector = FeatureSelector.new
+    feature_selector.all_feature_names.should == []
+    feature_selector.add_document(DocumentMock.new([Feature.new("a")]))
+    feature_selector.add_document(DocumentMock.new([Feature.new("b")]))
+    feature_selector.all_feature_names.should == %w[a b]
+  end
+  # TODO
+  # it "should return_all_features_as_best
+  #   feature_selector = FeatureSelector.new
+  #   feature_selector.add_document(DocumentMock.new([Feature.new("a")], :test))
+  #   assert_equal %w[a], feature_selector.best_features_for_classification(:test, 10)
+  # end
+  it "should count docs with feature and class" do
+    feature_selector = FeatureSelector.new
+    feature_selector.add_document(doc([Feature.new("viagra", 1)], :spam))
+    feature_selector.add_document(doc([Feature.new("puppy", 1)], :ham))
+    feature_selector.docs_with_feature_and_class("viagra", :spam).should == 1
+    feature_selector.docs_with_feature_and_class("viagra", :ham).should == 0
+  end
+  it "should count docs with feature and not class" do
+    feature_selector = FeatureSelector.new
+    feature_selector.add_document(doc([Feature.new("viagra", 1)], :spam))
+    feature_selector.add_document(doc([Feature.new("puppy", 1)], :ham))
+    feature_selector.docs_with_feature_and_not_class("puppy", :spam).should == 1
+    feature_selector.docs_with_feature_and_not_class("puppy", :ham).should == 0
+  end
+  it "should count docs with class and not feature" do
+    feature_selector = FeatureSelector.new
+    feature_selector.add_document(doc([Feature.new("viagra", 1)], :spam))
+    feature_selector.add_document(doc([Feature.new("puppy", 1)], :ham))
+    feature_selector.docs_with_class_and_not_feature(:spam, "puppy").should == 1
+    feature_selector.docs_with_class_and_not_feature(:spam, "viagra").should == 0
+  end
+  it "should count docs without feature or class" do
+    feature_selector = FeatureSelector.new
+    feature_selector.add_document(doc([Feature.new("viagra", 1)], :spam))
+    feature_selector.add_document(doc([Feature.new("puppy", 1)], :ham))
+    feature_selector.docs_without_feature_or_class("viagra", :spam).should == 1
+    feature_selector.docs_without_feature_or_class("viagra", :ham).should == 0
+  end
+  it "should return zero chi if all docs contain feature" do
+    feature_selector = FeatureSelector.new
+    the = Feature.new("the", 1)
+    feature_selector.add_document(doc([the], :spam))
+    feature_selector.add_document(doc([the], :ham))
+    feature_selector.features_with_chi(:spam).should == [Feature.new("the", 0.0)]
+  end
+  it "should compute chi squared" do
+    feature_selector = FeatureSelector.new
+    feature_selector.add_document(doc([Feature.new("viagra", 1)], :spam))
+    feature_selector.add_document(doc([Feature.new("puppy", 1)], :ham))
+    feature_selector.features_with_chi(:spam).should == [Feature.new("viagra", 2.0), Feature.new("puppy", 2.0)]
+  end
+  it "should not select any features if they are all insignificant" do
+    feature_selector = FeatureSelector.new
+    feature_selector.add_document(doc([Feature.new("viagra", 1)], :spam))
+    feature_selector.add_document(doc([Feature.new("puppy", 1)], :ham))
+    feature_selector.select_features.should == []
+  end
+  it "should not select features in only one doc" do
+    feature_selector = FeatureSelector.new
+    the = Feature.new("the", 1)
+    feature_selector.add_document(doc([the, Feature.new("viagra", 1)], :spam))
+    feature_selector.add_document(doc([the, Feature.new("puppy", 1)], :ham))
+    feature_selector.select_features.should == []
+  end
+  it "should select significant features occuring in more than one doc" do
+    feature_selector = FeatureSelector.new
+    the = Feature.new("the", 1)
+    feature_selector.add_document(doc([the, Feature.new("viagra", 1)], :spam))
+    feature_selector.add_document(doc([Feature.new("viagra", 1)], :spam))
+    feature_selector.add_document(doc([the, Feature.new("puppy", 1)], :ham))
+    feature_selector.select_features.should == %w[viagra]
+  end
+  it "should return selected features sorted by chi squared descending"
+  it "should select based on first feature by default"
+  it "should select with a chi squared of 1 by default"
+private
+  def doc(*args)
+    DocumentMock.new(*args)
+  end
+end

data/spec/unit/feature_spec.rb ADDED

@@ -0,0 +1,40 @@
+require File.join(File.dirname(__FILE__), "..", "spec_helper")
+describe Feature do
+  it "should store name" do
+    Feature.new("hello").name.should == "hello"
+  end
+  it "should require name" do
+    lambda { Feature.new }.should raise_error(ArgumentError)
+  end
+  it "should store values" do
+    Feature.new("name", 2).value.should ==2
+  end
+  it "should default value to zero" do
+    Feature.new("name").value.should == 0
+  end
+  it "should be equal with same name and no value" do
+    Feature.new("hello").should == Feature.new("hello")
+  end
+  it "should be equal with same name and same value" do
+    Feature.new("hello", 1).should == Feature.new("hello", 1)
+  end
+  it "should not be equal with different name" do
+    Feature.new("hello").should_not == Feature.new("test")
+  end
+  it "should not be equal with same name and different value" do
+    Feature.new("hello", 1).should_not == Feature.new("hello", 2)
+  end
+  it "should sort by name ascending then value ascending" do
+    [Feature.new("b", 3), Feature.new("a", 2), Feature.new("a", 1)].sort.should ==
+    [Feature.new("a", 1), Feature.new("a", 2), Feature.new("b", 3)]
+  end
+end

metadata ADDED

@@ -0,0 +1,88 @@
+--- !ruby/object:Gem::Specification
+rubygems_version: 0.9.4
+specification_version: 1
+name: basset
+version: !ruby/object:Gem::Version
+  version: 1.0.0
+date: 2008-01-08 00:00:00 +00:00
+summary: A library for running machine learning algorithms for classification, feature selection and evaluation
+require_paths:
+- lib
+email: paul@pauldix.net
+homepage: http://basset.rubyforge.org/
+rubyforge_project: basset
+description: "=What You Could Use This For Just in case you don't have a clue what machine learning or classification is, here's a quick example scenario and an explanation of the process. The most popular task is spam identification. To do this you'll first need a set of training documents. This would consist of a number of documents which you have labeled as either spam or not. With training sets, bigger is better. You should probably have at least 100 of each type (spam and not spam). Really 1,000 of each type would be better and 10,000 of each would be super sweet. Once you have the training set the process with this library flows like this:  * Create each as a Document (a class in this library) * Pass those documents into the FeatureSelector * Get the best features and pass those into the FeatureExtractor * Now extract features from each document using the extractor and * Pass those extracted features to NaiveBayes as part of the training set * Now you can save the FeatureExtractor and NaiveBayes to a file  That represents the process of selecting features and training the classifier. Once you've done that you can predict if a new previously unseen document is spam or not by just doing the following:  * Load the feature extractor and naive bayes from their files * Create a new document object from your new unseen document * Extract the features from that document using the feature extractor and * Pass those to the classify method of the naive bayes classifier"
+autorequire:
+default_executable:
+bindir: bin
+has_rdoc: true
+required_ruby_version: !ruby/object:Gem::Version::Requirement
+  requirements:
+  - - ">"
+    - !ruby/object:Gem::Version
+      version: 0.0.0
+  version:
+platform: ruby
+signing_key:
+cert_chain:
+post_install_message:
+authors:
+- Paul Dix
+- Bryan Helmkamp
+files:
+- History.txt
+- License.txt
+- Manifest.txt
+- README.txt
+- Rakefile
+- lib/basset.rb
+- lib/basset/classification_evaluator.rb
+- lib/basset/core_extensions.rb
+- lib/basset/document.rb
+- lib/basset/document_override_example.rb
+- lib/basset/feature.rb
+- lib/basset/feature_extractor.rb
+- lib/basset/feature_selector.rb
+- lib/basset/naive_bayes.rb
+- lib/basset/yaml_serialization.rb
+- spec/spec.opts
+- spec/spec_helper.rb
+- spec/unit/document_spec.rb
+- spec/unit/feature_extractor_spec.rb
+- spec/unit/feature_selector_spec.rb
+- spec/unit/feature_spec.rb
+test_files: []
+rdoc_options:
+- --main
+- README.txt
+extra_rdoc_files:
+- History.txt
+- License.txt
+- Manifest.txt
+- README.txt
+executables: []
+extensions: []
+requirements: []
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: stemmer
+  version_requirement:
+  version_requirements: !ruby/object:Gem::Version::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 1.0.1
+    version:
+- !ruby/object:Gem::Dependency
+  name: hoe
+  version_requirement:
+  version_requirements: !ruby/object:Gem::Version::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 1.4.0
+    version: