RubyGems - cross_validation - Versions diffs - 0.0.1 → 0.0.2 - Mend

cross_validation 0.0.1 → 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml +4 -4
data/README.md +89 -17
data/lib/cross_validation.rb +5 -1
data/lib/cross_validation/confusion_matrix.rb +3 -1
data/lib/cross_validation/partitioner.rb +34 -0
data/lib/cross_validation/runner.rb +57 -9
data/lib/cross_validation/sample.rb +15 -0
data/test/support/spam_classifier.rb +24 -0
data/test/test_confusion_matrix.rb +3 -2
data/test/test_helper.rb +0 -9
data/test/test_partitioner.rb +28 -0
data/test/test_runner.rb +17 -22
data/test/test_sample.rb +29 -0
metadata +10 -2

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: e36ebf38a97ebf8665474186f2c1ae3c82a3855d
-  data.tar.gz: 770c131f2d8e3cf359e7e26450cb7a24c704239c
+  metadata.gz: 5f5fee4f5040fd1ad3562518031b850f7cf8c4b0
+  data.tar.gz: 7076cc8ff37d8967c84f04a493a5dff8729e0202
 SHA512:
-  metadata.gz: 34ce7b4484db03a2d09aeb2b1f40c586a84a63b6ba22d355220ee6cba8bf588f3c8a2ac867a82524d43aa9fba06863f4f44ca09af8a17cb7f56409d9105dce7f
-  data.tar.gz: d55546305d845c1f2c977f825f2d69a77b32369c3afc13b5262dd170b523f4efffa6f7fa9f6d9f7d9215b6b008156ff1ae2cb13993625268bc8d3e1f0b4b2f4c
+  metadata.gz: f27ea09189c51f97db89aaee88dee9cca8adefbc503fe596f91297ed9acc6273be0c3414dfd6b1004b9e32ff2cb98cc22edafaa3484ddf062e965a609bdff39b
+  data.tar.gz: 04e0bf2d0863102dadddcb053d053c8d4dd5bbb0098acc58b3205a09438d56c5c3077aa3529302017f709f1bab8cf7a7440b690d047b6bd1cb36277daf8eb810

data/README.md CHANGED

@@ -4,7 +4,9 @@
 [![Code Climate](https://codeclimate.com/github/jmdeldin/cross_validation.png)](https://codeclimate.com/github/jmdeldin/cross_validation)
 This gem provides a k-fold cross-validation routine and confusion matrix
-for evaluating machine learning classifiers.
+for evaluating machine learning classifiers. See [below](#usage) for
+usage or jump to the
+[documentation](http://rubydoc.info/github/jmdeldin/cross_validation/frames).
 ## Installation
@@ -22,19 +24,89 @@ Or install it yourself as:
 ## Usage
-Cross-validation:
-Confusion-matrix:
-## Contributing
-1. Fork it
-2. Create your feature branch (`git checkout -b my-new-feature`)
-3. Commit your changes (`git commit -am 'Add some feature'`)
-4. Push to the branch (`git push origin my-new-feature`)
-5. Create new Pull Request
-## Questions
-Send me an email, `dev@jmdeldin.com`
+To cross-validate your classifier, you need to configure a run as
+follows:
+```ruby
+require 'cross_validation'
+runner = CrossValidation::Runner.create do |r|
+  r.documents = my_array_of_documents
+  r.folds = 10
+  # or if you'd rather test on 10%
+  # r.percentage = 0.1
+  r.classifier = lambda { SpamClassifier.new }
+  r.fetch_sample_class = lambda { |sample| sample.klass }
+  r.fetch_sample_value = lambda { |sample| sample.value }
+  r.matrix = CrossValidation::ConfusionMatrix.new(method(:keys_for))
+  r.training = lambda { |classifier, doc|
+    classifier.train doc.klass, doc.value
+  }
+  r.classifying = lambda { |classifier, doc|
+    classifier.classify doc
+  }
+end
+```
+With the run configured, just invoke `#run` to return a confusion matrix:
+```ruby
+mat = runner.run
+```
+With a confusion matrix in hand, you can compute many statistics about
+your classifier:
+- `mat.accuracy`
+- `mat.f1`
+- `mat.fscore(beta)`
+- `mat.precision`
+- `mat.recall`
+Please see the
+[respective documentation](http://rubydoc.info/github/jmdeldin/cross_validation/CrossValidation/ConfusionMatrix)
+for each method for more details.
+### Defining `keys_for`
+The ConfusionMatrix class requires a `keys_for` `Proc` that returns a
+symbol. In this method, you specify what constitutes a true positive
+(`:tp`), true negative (`:tn`), false positive (`:fp`), and false
+negative (`:fn`). For example, in spam classification, you can construct
+the following table to write the keys_for method:
+                            actual
+              +---------------------------------
+     expected | correct        | not correct
+    ----------+----------------+----------------
+     spam     | true positive  | false positive
+     ham      | true negative  | false negative
+You can then implement this table with nested hashes or just a few
+conditionals:
+```ruby
+def keys_for(expected, actual)
+  if expected == :spam
+    actual == :spam ? :tp : :fp
+  elsif expected == :ham
+    actual == :ham ? :tn : :fn
+  end
+end
+```
+Once you have your `keys_for` method implemented, pass it into the
+ConfusionMatrix with `method(:keys_for)`, or if it's a class-method,
+`MyClass.method(:keys_for)`. (You can also implement the method as a
+lambda.)
+## Roadmap
+For v1.0:
+- Implement configurable, parallel cross-validation
+- Include more complete examples
+## Author
+Jon-Michael Deldin, `dev@jmdeldin.com`

data/lib/cross_validation.rb CHANGED

@@ -1,5 +1,9 @@
 $LOAD_PATH.unshift File.dirname(__FILE__)
 module CrossValidation
-  VERSION = '0.0.1'
+  VERSION = '0.0.2'
+end
+%w(confusion_matrix runner).each do |fn|
+  require File.join('cross_validation', fn)
 end

data/lib/cross_validation/confusion_matrix.rb CHANGED

@@ -34,7 +34,7 @@ module CrossValidation
     # @param [Object] truth   The known, expected value
     # @return [self]
     def store(actual, truth)
-      key = @keys_for.call(actual, truth)
+      key = @keys_for.call(truth, actual)
       if @values.key?(key)
         @values[key] += 1
@@ -83,6 +83,8 @@ module CrossValidation
     end
     # Returns the classifier's error
+    #
+    # @return [Float]
     def error
       1.0 - accuracy()
     end

data/lib/cross_validation/partitioner.rb ADDED

@@ -0,0 +1,34 @@
+module CrossValidation
+  # Provides helper methods for data partitioning.
+  #
+  module Partitioner
+    # Splits the array into +k+-sized subsets.
+    #
+    # For example, calling this method for the array +%w(foo bar baz qux)+
+    # with +k=2+ results in an array of arrays: +[[foo, bar], [baz, qux]]+.
+    #
+    # @param [Array]  ary    documents to work with
+    # @param [Fixnum] k      size of each subset
+    # @return [Array]        array of arrays
+    # @raise [ArgumentError] if the length of the documents array is not
+    #                        evenly divisible by k
+    def self.subset(ary, k)
+      if ary.length % k > 0
+        fail ArgumentError, "Can't create equal subsets when k=#{k}"
+      end
+      ary.each_slice(k).to_a
+    end
+    # Returns a flattened copy of the original array without an element at
+    # index +i+.
+    #
+    # @param [Array]  ary  subsets to work with (e.g., array of arrays)
+    # @param [Fixnum] i    index to remove
+    # @return [Array]
+    def self.exclude_index(ary, i)
+      ary.rotate(i).drop(1).flatten
+    end
+  end
+end

data/lib/cross_validation/runner.rb CHANGED

@@ -1,4 +1,6 @@
 require_relative '../cross_validation'
+require_relative 'partitioner'
+require_relative 'sample'
 module CrossValidation
   class Runner
@@ -43,6 +45,17 @@ module CrossValidation
     #                document and should return the document's class.
     attr_accessor :fetch_sample_class
+    # @return [Array] Array of which attributes are empty
+    attr_reader :errors
+    def initialize
+      @fetch_sample_value = lambda { |sample| sample.value }
+      @fetch_sample_class = lambda { |sample| sample.klass }
+      @critical_keys = [:documents, :classifier, :matrix, :training,
+                        :classifying, :fetch_sample_value, :fetch_sample_class]
+    end
     # Returns the number of folds to partition the documents into.
     #
     # @return [Fixnum]
@@ -50,6 +63,24 @@ module CrossValidation
       @k ||= percentage ? (documents.size * percentage) : folds
     end
+    # Checks if all of the required run parameters are set.
+    #
+    # @return [Boolean]
+    def valid?
+      @errors = []
+      @critical_keys.each do |k|
+        any_error = public_send(k).nil?
+        @errors << k if any_error
+      end
+      @errors.size == 0
+    end
+    # @see #valid?
+    def invalid?
+      !valid?
+    end
     # Performs k-fold cross-validation and returns a confusion matrix.
     #
     # The algorithm is as follows (Mitchell, 1997, p147):
@@ -61,24 +92,23 @@ module CrossValidation
     #     classify(partitions[i])
     #   output confusion matrix
     #
+    # @raise [ArgumentError] if the runner is missing required attributes
+    # @return [ConfusionMatrix]
     def run
-      partitions = documents.each_slice(k).to_a
+      fail_if_invalid
+      partitions = Partitioner.subset(documents, k)
       results = partitions.map.with_index do |part, i|
-        # Array#rotate puts the element i first, so all we have to do is rotate
-        # then remove that element to get the training set. Array#drop does not
-        # mutate the original array either. Array#flatten is needed to coalesce
-        # our list of lists into one list again.
-        training_samples = partitions.rotate(i).drop(1).flatten
+        training_samples = Partitioner.exclude_index(documents, i)
         classifier_instance = classifier.call()
-        # train it
-        training_samples.each { |doc| training.call(classifier_instance, doc) }
+        train(classifier_instance, training_samples)
         # fetch confusion keys
         part.each do |x|
-          prediction = classifying.call(classifier_instance, fetch_sample_value.call(x))
+          prediction = classify(classifier_instance, x)
           matrix.store(prediction, fetch_sample_class.call(x))
         end
       end
@@ -91,5 +121,23 @@ module CrossValidation
     def self.create
       new.tap { |r| yield(r) }
     end
+    private
+    def fail_if_invalid
+      return nil if valid?
+      msg = "The following attribute(s) must be specified: #{errors.join(', ')}"
+      fail ArgumentError, msg
+    end
+    def train(classifier_instance, samples)
+      samples.each do |doc|
+        training.call(classifier_instance, doc)
+      end
+    end
+    def classify(classifier_instance, sample)
+      classifying.call(classifier_instance, fetch_sample_value.call(sample))
+    end
   end
 end

data/lib/cross_validation/sample.rb ADDED

@@ -0,0 +1,15 @@
+module CrossValidation
+  # Represents a datum and its class (e.g., "spam").
+  #
+  # This is an optional data structure that simplifies definining training
+  # methods in cross-validation runs.
+  Sample = Struct.new(:klass, :value)
+  # Converts an array of +[class, value]+ into a `Sample` object.
+  #
+  # @param [Array] tuple
+  # @return [Sample]
+  def self.Sample(tuple)
+    Sample.new(tuple.fetch(0), tuple.fetch(1))
+  end
+end

data/test/support/spam_classifier.rb ADDED

@@ -0,0 +1,24 @@
+# A toy classifier. As long as you can tell the CrossValidation gem how to
+# invoke your training and classifying methods, then you can do whatever you
+# want in your classifier.
+class SpamClassifier
+  def train(klass, document)
+    # don't bother, we're that good (in reality, you should probably do some
+    # work here)
+  end
+  def classify(document)
+    document =~ /viagra/ ? :spam : :ham
+  end
+  # Dummy method for use in testing confusion matrices. Used to determine
+  # whether a class is a true positive|negative or a false positive|negative.
+  # This is used when configuring a confusion matrix.
+  def self.keys_for(expected, actual)
+    if expected == :spam
+      actual == :spam ? :tp : :fp
+    elsif expected == :ham
+      actual == :ham ? :tn : :fn
+    end
+  end
+end

data/test/test_confusion_matrix.rb CHANGED

@@ -1,4 +1,5 @@
 require_relative 'test_helper'
+require_relative 'support/spam_classifier'
 require_relative '../lib/cross_validation/confusion_matrix'
 class TestConfusionMatrix < MiniTest::Unit::TestCase
@@ -7,7 +8,7 @@ class TestConfusionMatrix < MiniTest::Unit::TestCase
   end
   def setup
-    @mat = CrossValidation::ConfusionMatrix.new(method(:keys_for))
+    @mat = CrossValidation::ConfusionMatrix.new(SpamClassifier.method(:keys_for))
   end
   def test_true_positives
@@ -31,7 +32,7 @@ class TestConfusionMatrix < MiniTest::Unit::TestCase
   end
   def test_store_raises_index_error_on_bad_key
-    bad_keys_for = ->(actual, expected) { :bad }
+    bad_keys_for = ->(expected, actual) { :bad }
     mat = CrossValidation::ConfusionMatrix.new(bad_keys_for)
     assert_raises IndexError do
       mat.store(:ham, :spam)

data/test/test_helper.rb CHANGED

@@ -1,10 +1 @@
 require 'minitest/autorun'
-# Dummy method for use in testing confusion matrices.
-def keys_for(actual, expected)
-  if actual == :spam
-    expected == :spam ? :tp : :fn
-  elsif actual == :ham
-    expected == :ham ? :tn : :fp
-  end
-end

data/test/test_partitioner.rb ADDED

@@ -0,0 +1,28 @@
+require_relative 'test_helper'
+require_relative '../lib/cross_validation/partitioner'
+class TestPartitioner < MiniTest::Unit::TestCase
+  def setup
+    @docs = %w(foo bar baz qux)
+    @p    = CrossValidation::Partitioner
+  end
+  def test_create_equal_subsets_returns_equal_subsets
+    subsets = @p.subset(@docs, 2)
+    assert_equal %w(foo bar), subsets.first
+    assert_equal %w(baz qux), subsets.last
+  end
+  def test_create_equal_subsets_prevents_unequal_subsets
+    e = assert_raises ArgumentError do
+      @p.subset(@docs, 3)
+    end
+    assert_equal "Can't create equal subsets when k=3", e.message
+  end
+  def test_exclude_by_index
+    samples = @p.exclude_index(@docs, 1)
+    assert_equal %w(baz qux foo), samples
+  end
+end

data/test/test_runner.rb CHANGED

@@ -1,24 +1,9 @@
 require_relative 'test_helper'
+require_relative 'support/spam_classifier'
 require_relative '../lib/cross_validation/confusion_matrix'
+require_relative '../lib/cross_validation/sample'
 require_relative '../lib/cross_validation/runner'
-# A stupid classifier
-class SpamClassifier
-  def train(klass, document)
-    # don't bother, we're that good (in reality, you should probably do some
-    # work here)
-  end
-  def classify(document)
-    document =~ /viagra/ ? :spam : :ham
-  end
-end
-# We just need to associate a class with a value. Feel free to use whatever
-# data structure you like -- this is only used in user-defined training and
-# classifying closures.
-Sample = Struct.new(:klass, :value)
 # Asserts the DSL's getter and setters work.
 def check_dsl(attribute, value)
   runner = CrossValidation::Runner.create { |r|
@@ -28,13 +13,17 @@ def check_dsl(attribute, value)
   define_method("test_#{attribute}_getter") {
     assert_equal :value, runner.public_send(attribute)
   }
+  define_method("test_runner_is_invalid_with_only_#{attribute}_set") {
+    assert runner.invalid?
+  }
 end
 class TestRunner < MiniTest::Unit::TestCase
   def setup
     tpl = ['Buy some...', 'Would you like some...']
-    @spam = tpl.map { |pfx| Sample.new(:spam, pfx + 'viagra!') }
-    @ham = tpl.map { |pfx| Sample.new(:ham, pfx + 'penicillin!') }
+    @spam = tpl.map { |pfx| CrossValidation::Sample.new(:spam, pfx + 'viagra!') }
+    @ham = tpl.map { |pfx| CrossValidation::Sample.new(:ham, pfx + 'penicillin!') }
     @corpus = @spam + @ham
     @corpus *= 25 # 100 is easier to deal with
   end
@@ -44,9 +33,7 @@ class TestRunner < MiniTest::Unit::TestCase
       r.documents = @corpus
       r.folds = 10
       r.classifier = lambda { SpamClassifier.new }
-      r.fetch_sample_class = lambda { |sample| sample.klass }
-      r.fetch_sample_value = lambda { |sample| sample.value }
-      r.matrix = CrossValidation::ConfusionMatrix.new(method(:keys_for))
+      r.matrix = CrossValidation::ConfusionMatrix.new(SpamClassifier.method(:keys_for))
       r.training = lambda { |classifier, doc|
         classifier.train doc.klass, doc.value
       }
@@ -83,4 +70,12 @@ class TestRunner < MiniTest::Unit::TestCase
   ].each do |attribute|
     check_dsl(attribute, :foo)
   end
+  def test_invalid_runner_raises_error
+    runner = CrossValidation::Runner.create {}
+    exception = assert_raises ArgumentError do
+      runner.run
+    end
+    assert_match(/must be specified/, exception.message)
+  end
 end

data/test/test_sample.rb ADDED

@@ -0,0 +1,29 @@
+require_relative 'test_helper'
+require_relative '../lib/cross_validation/sample'
+class TestSample < MiniTest::Unit::TestCase
+  def setup
+    @sample = CrossValidation::Sample.new(:spam, :spammy_msg)
+  end
+  def test_klass
+    assert_equal :spam, @sample.klass
+  end
+  def test_value
+    assert_equal :spammy_msg, @sample.value
+  end
+  def test_casting_a_tuple_to_sample
+    tuple = [:ham, :some_value]
+    sample = CrossValidation::Sample(tuple)
+    assert_equal :ham, sample.klass
+    assert_equal :some_value, sample.value
+  end
+  def test_casting_an_incomplete_tuple_to_sample_fails
+    assert_raises IndexError do
+      CrossValidation::Sample([])
+    end
+  end
+end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: cross_validation
 version: !ruby/object:Gem::Version
-  version: 0.0.1
+  version: 0.0.2
 platform: ruby
 authors:
 - Jon-Michael Deldin
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2013-04-06 00:00:00.000000000 Z
+date: 2013-04-15 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rake
@@ -40,10 +40,15 @@ files:
 - cross_validation.gemspec
 - lib/cross_validation.rb
 - lib/cross_validation/confusion_matrix.rb
+- lib/cross_validation/partitioner.rb
 - lib/cross_validation/runner.rb
+- lib/cross_validation/sample.rb
+- test/support/spam_classifier.rb
 - test/test_confusion_matrix.rb
 - test/test_helper.rb
+- test/test_partitioner.rb
 - test/test_runner.rb
+- test/test_sample.rb
 homepage: https://github.com/jmdeldin/cross_validation
 licenses: []
 metadata: {}
@@ -68,6 +73,9 @@ signing_key:
 specification_version: 4
 summary: Performs k-fold cross-validation on machine learning classifiers.
 test_files:
+- test/support/spam_classifier.rb
 - test/test_confusion_matrix.rb
 - test/test_helper.rb
+- test/test_partitioner.rb
 - test/test_runner.rb
+- test/test_sample.rb