RubyGems - supervised_learning - Versions diffs - 0.0.1 → 0.0.2 - Mend

supervised_learning 0.0.1 → 0.0.2

Files changed (8) hide show

checksums.yaml +4 -4
data/README.md +32 -10
data/lib/supervised_learning/version.rb +1 -1
data/lib/supervised_learning.rb +63 -77
data/spec/linear_regression_spec.rb +2 -3
data/spec/spec_helper.rb +1 -0
data/supervised_learning.gemspec +3 -2
metadata +18 -4

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: f8b579d3f7901496e6f92152443e2af2a5ae1e48
-  data.tar.gz: c1b532c91505c321fbc3d0de82c0f2e19dd847dc
+  metadata.gz: a22158ac095b48c40b8dcb61e29c7047668271ed
+  data.tar.gz: 2696ec504a12c9b389ddeece4bec1d91af5a857a
 SHA512:
-  metadata.gz: 595acd0096de69b602bdddf01f8e6f2c83da4cb0c40adb0d896d4232325d462edd3eea913fad21504b0d50f7c70b0bd79b7a6986c0cab20a7a1d6623890482fd
-  data.tar.gz: d64319d9bf16c941b62939cab66f18da2f5c92d7a9934c02d08c4724d11ee6107f6a51c879401bc9d6c6803d41bc3d90cec4572ae3118c6b08bae84fd4c4888e
+  metadata.gz: 71a672546ab53a46d71c6193b3a934e6777efdd73fdf5e4ae05fe1120b80823629590cfddd45d6b888c07035b187d0582b51566f6b96b3ccd9c7c3744ea55791
+  data.tar.gz: 27f4b46caa452db3fe2d0903fdf648e74d917cdd481a71f9e19cf51d86ee44669f8a5437bd8f7cb8c02161e35574f47f17b91a9195d9b335256634dc9028f687

data/README.md CHANGED Viewed

@@ -1,8 +1,8 @@
-# SupervisedLearning
+# Supervised Learning
 Supervised learning is the machine learning task of inferring a function from labeled training data. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.
-Credits for the underlying algorithms of the functions that make predictions go to [Andrew Ng](http://cs.stanford.edu/people/ang/) at Stanford University.
+Credits for some of the algorithms used go to [Andrew Ng](http://cs.stanford.edu/people/ang/) at Stanford University.
 ## Example
@@ -35,11 +35,11 @@ Or install it yourself as:
 ## Usage
-1. Create a matrix of the training data.
+### 1. Create a matrix of the training data
-The **output value** (the type of value you want to predict) needs to be in the **last column**. The matrix has to have a) at least two columns (one feature and one output) and b) one row. The more data you feed it, the more accurate the prediction.
+The **output value** (the type of value you want to predict) needs to be in the **last column**. The matrix has to have a) at least two columns (one feature and one output) and b) at least one row. The more data you feed it, the more accurate the prediction.
-Consult the [Ruby API](http://www.ruby-doc.org/stdlib-2.1.2/libdoc/matrix/rdoc/Matrix.html) for information on how to build matrices for instances based on arrays of rows or columns.
+Consult the [Ruby API](http://www.ruby-doc.org/stdlib-2.1.2/libdoc/matrix/rdoc/Matrix.html) for information on how to build matrices. You can use both arrays of rows or columns to create a matrix.
 ```ruby
 require 'matrix'
@@ -47,21 +47,43 @@ require 'matrix'
 training_set = Matrix[ [2104, 3, 399900], [1600, 3, 329900], [3000, 4, 539900], [1940, 4, 239999] ]
 ```
-2. Instantiate an object with the training data.
+### 2. Instantiate an object with the training data
 ```ruby
+require 'supervised_learning' # if not automatically loaded
 program = SupervisedLearning::LinearRegression.new(training_set)
 ```
-3. Create a prediction in form of a matrix.
+### 3. Create a prediction in form of a matrix
 This matrix has one row and the **columns follow the order of the training set**. It has one column less than the training set since the output value (the last column of the training set) is the value we want to predict.
 ```ruby
-# Predict the price of a house of 2000 square meters with 3 bedrooms
-prediction_set = Matrix[ [2000, 3] ]
+# Predict the price of a house of 2200 square meters with 3 bedrooms
+prediction_set = Matrix[ [2200, 3] ]
 program.predict(prediction_set)
-=>
+=> 454115.66
+```
+### 4. Options
+The lower bound running time of the `#predict` method is cubic / Ω(n3). Hence, on very large datasets (more than 1,000 columns / features), the `#predict` method can be slow. If this happens, use the `#predict_advanced` method, which uses gradient descent (approximation) instead of normalization (direct calculation). Gradient descent can be a bit less accurate and also requires some optional parameters.
+The `#predict_advanced` method takes 4 parameters in the following order:
+1. The prediction_set, similar to the `#prediction` method discussed above
+2. The learning rate (how big of a "jump" each iteration makes). The smaller the learning rate, the less errors are to be expected. Unfortunately, a smaller learning rate requires more iterations (see below). Start at 0.001 and increase the parameter to 0.01, 0.1, 1, etc if the calculation is too slow.
+3. The number of iterations. The more iterations, the more accurate the result will but the slower the algorithm will run.
+4. Whether you want print out values to see at which point convergence happens.
+```ruby
+# Predict the price of a house of 2200 square meters with 3 bedrooms
+prediction_set = Matrix[ [2200, 3] ]
+program.predict_advanced(prediction_set, 0.01, 600, false)
+=> 443693.16
 ```
 ## Contributing

data/lib/supervised_learning/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module SupervisedLearning
-  VERSION = "0.0.1"
+  VERSION = "0.0.2"
 end

data/lib/supervised_learning.rb CHANGED Viewed

@@ -1,6 +1,7 @@
 require 'supervised_learning/version'
 require 'matrix'
 require 'descriptive_statistics'
+require 'matrix_extensions'
 module SupervisedLearning
@@ -12,16 +13,18 @@ module SupervisedLearning
   class LinearRegression
     # Initializes a LinearRegression object with a training set
     # @param training_set [Matrix] training_set, each feature/dimension has one column and the last column is the output column (type of value #predict will return)
-    # @raise [ArgumentError] if training_set is not a Matrix and has at least two columns and one row
+    # @raise [ArgumentError] if training_set is not a Matrix or does not have at least two columns and one row
     def initialize(training_set)
       @training_set = training_set
       raise ArgumentError, 'input is not a Matrix' unless @training_set.is_a? Matrix
       raise ArgumentError, 'Matrix must have at least 2 columns and 1 row' unless @training_set.column_size > 1
-      @number_of_training_set_columns = @training_set.column_size
-      @number_of_features = @number_of_training_set_columns - 1
+      @number_of_features = @training_set.column_size - 1
       @number_of_training_examples = @training_set.row_size
+      @feature_set = @training_set.clone
+      @feature_set.hpop # remove output set
       @output_set = @training_set.column_vectors.last
     end
@@ -30,7 +33,8 @@ module SupervisedLearning
     # sets (more than 1000 columns) it might take too long to calculate.
     # @param prediction [Matrix] prediction
     def predict(prediction)
-      feature_set = get_feature_set(@training_set, true)
+      # add ones to feature set
+      feature_set = Matrix.hconcat(Matrix.one(@number_of_training_examples, 1), @feature_set)
       validate_prediction_input(prediction)
@@ -38,7 +42,7 @@ module SupervisedLearning
       theta = (transposed_feature_set * feature_set).inverse * transposed_feature_set * @output_set
       # add column of ones to prediction
-      prediction = get_feature_set(prediction, true)
+      prediction = Matrix.hconcat(Matrix.one(prediction.row_size, 1), prediction)
       result_vectorized = prediction * theta
       result = result_vectorized.to_a.first.to_f
@@ -51,14 +55,12 @@ module SupervisedLearning
     def predict_advanced(prediction, learning_rate = 0.01, iterations = 1000, debug = false)
       validate_prediction_input(prediction)
-      feature_set = get_feature_set(@training_set, false)
-      feature_set = normalize_feature_set(feature_set)
+      feature_set = normalize_feature_set(@feature_set)
       # add ones to feature set after normalization
-      feature_set = get_feature_set(feature_set, true)
+      feature_set = Matrix.hconcat(Matrix.one(@number_of_training_examples, 1), feature_set)
       # prepare theta column vector with zeros
-      theta = Array.new(@number_of_training_set_columns, 0)
-      theta = Matrix.columns([theta])
+      theta = Matrix.zero(@number_of_features+1, 1)
       iterations.times do
         theta = theta - (learning_rate * (1.0/@number_of_training_examples) * (feature_set * theta - @output_set).transpose * feature_set).transpose
@@ -72,7 +74,7 @@ module SupervisedLearning
       prediction = normalize_prediction(prediction)
       # add column of ones to prediction
-      prediction = get_feature_set(prediction, true)
+      prediction = Matrix.hconcat(Matrix.one(prediction.row_size, 1), prediction)
       result_vectorized = prediction * theta
       result = result_vectorized[0,0]
@@ -80,34 +82,6 @@ module SupervisedLearning
     private
-    # Returns a feature set without output set (last column of training set)
-    # and optionally adds a leading column of ones to a Matrix.
-    # This column of ones is the first dimension of theta to easily calculate
-    # the output of a function a*1 + b*theta_1 + c*theta_2 etc.
-    # Ruby's Matrix class has not built-in function for prepending,
-    # hence some manual work is required.
-    # @see http://stackoverflow.com/questions/9710628/how-do-i-add-columns-and-rows-to-a-matrix-in-ruby
-    # @param matrix [Matrix] matrix
-    # @param leading_ones [Boolean] whether to prepend a column of leading ones
-    # @return [Matrix] matrix
-    def get_feature_set(matrix, leading_ones = false)
-      # get array of columns
-      existing_columns = matrix.column_vectors
-      columns = []
-      columns << Array.new(existing_columns.first.size, 1) if leading_ones
-      # add remaining columns
-      existing_columns.each_with_index do |column, index|
-        # output column (last column of @training_set) needs to be skipped
-        # when called from #get_feature_set, matrix includes output column
-        # when called from #prediction, matrix does not inlcude output column
-        break if index + 1 > @number_of_features
-        columns << column.to_a
-      end
-      Matrix.columns(columns)
-    end
     # Validates prediction input.
     # @param prediction [Matrix] prediction
     # @raise [ArgumentError] if prediction is not a Matrix
@@ -123,59 +97,43 @@ module SupervisedLearning
     # @param feature_set [Matrix] feature set
     # @return [Matrix] normalized feature set
     def normalize_feature_set(feature_set)
-      # create Matrix with mean
+      # get mean for each column
       mean = []
       feature_set.column_vectors.each do |feature_set_column|
-        # create Matrix of length of training examples for later substraction
-        mean << Array.new(@number_of_training_examples, feature_set_column.mean)
+        mean << feature_set_column.mean
       end
-      mean = Matrix.columns(mean)
-      # save for later usage as Matrix and not as Vector
-      @mean = Matrix[mean.row(0)]
+      # convert Array into Matrix of same dimension as feature_set for substraction
+      # save for later usage
+      @mean = Matrix[mean].vcopy(@number_of_training_examples - 1)
       # substract mean from feature set
-      feature_set = feature_set - mean
+      feature_set = feature_set - @mean
-      # create Matrix with standard deviation
+      # get standard deviation for each column
       standard_deviation = []
       feature_set.column_vectors.each do |feature_set_column|
-        # create row vector with standard deviation
-        standard_deviation << [feature_set_column.standard_deviation]
-      end
-      # save for later usage
-      @standard_deviation = Matrix.columns(standard_deviation)
-      # Dividing these non-square matrices has to be done manually
-      # (non square matrices have no inverse and can't be divided in Ruby)
-      # iterate through each column
-      columns = []
-      feature_set.column_vectors.each_with_index do |feature_set_column, index|
-        # manually divide each row within column with standard deviation for that row
-        columns << feature_set_column.to_a.collect { |value| value / @standard_deviation[0,index] }
+        standard_deviation << feature_set_column.standard_deviation
       end
-      # reconstruct training set
-      feature_set = Matrix.columns(columns)
-      feature_set
+      # convert Array into Matrix of same dimension as feature_set for substraction
+      # save for later usage
+      @standard_deviation = Matrix[standard_deviation].vcopy(@number_of_training_examples - 1)
+      # divide feature set by standard deviation
+      feature_set = feature_set.element_division @standard_deviation
     end
     # Normalizes prediction.
     # @param prediction [Matrix] prediction
     # @return [Matrix] normalized prediction
-    def normalize_prediction(prediction)
+    def normalize_prediction(prediction)
+      # convert prediction into Matrix of same dimension as @mean for substraction
+      prediction = prediction.vcopy(@number_of_training_examples - 1)
       # substract mean
-      prediction = prediction - @mean
-      # Dividing these non-square matrices has to be done manually
-      # (non square matrices have no inverse and can't be divided in Ruby)
-      # iterate through each column
-      columns = []
-      prediction.column_vectors.each_with_index do |prediction_column, index|
-        # manually divide each row within column with standard deviation for that row
-        columns << prediction_column / @standard_deviation[0,index]
-      end
-      # reconstruct prediction
-      prediction = Matrix.columns(columns)
+      prediction = prediction - @mean
+      # divide feature set by standard deviation
+      prediction = prediction.element_division @standard_deviation
     end
     # Calculates cost of current theta.
@@ -188,4 +146,32 @@ module SupervisedLearning
       cost_vectorized[0,0]
     end
   end
+  # This class uses logistic regression to make discrete predictions (true or false) based on a training set.
+  # The algorithms in #predict were provided by Andrew Ng (Stanford University).
+  # @author Michael Imstepf
+  class LogisticRegression
+    # Initializes a LogisticRegression object with a training set
+    # @param training_set [Matrix] training_set, each feature/dimension has one column and the last column is the output column (type of value #predict will return)
+    # @raise [ArgumentError] if training_set is not a Matrix or does not have at least two columns and one row
+    def initialize(training_set)
+      @training_set = training_set
+      raise ArgumentError, 'input is not a Matrix' unless @training_set.is_a? Matrix
+      raise ArgumentError, 'Matrix must have at least 2 columns and 1 row' unless @training_set.column_size > 1
+    end
+    private
+    def calculate_sigmoid(z)
+      matrix_with_ones = Matrix.one(1, z.column_size)
+      matrix_with_eulers_number = Matrix.build(1, z.column_size) {Math::E}
+      z_negative = z * -1
+      matrix_with_ones.element_division (matrix_with_ones + matrix_with_eulers_number.element_exponentiation(z_negative))
+    end
+    def calculate_cost
+    end
+  end
 end

data/spec/linear_regression_spec.rb CHANGED Viewed

@@ -1,5 +1,4 @@
 require 'spec_helper'
-require 'pry'
 describe SupervisedLearning::LinearRegression do
   training_set_one_feature = Matrix[ [2104,399900], [1600,329900], [2400,369000], [1416,232000], [3000,539900], [1985,299900], [1534,314900], [1427,198999], [1380,212000], [1494,242500], [1940,239999], [2000,347000], [1890,329999], [4478,699900], [1268,259900], [2300,449900], [1320,299900], [1236,199900], [2609,499998], [3031,599000], [1767,252900], [1888,255000], [1604,242900], [1962,259900], [3890,573900], [1100,249900], [1458,464500], [2526,469000], [2200,475000], [2637,299900], [1839,349900], [1000,169900], [2040,314900], [3137,579900], [1811,285900], [1437,249900], [1239,229900], [2132,345000], [4215,549000], [2162,287000], [1664,368500], [2238,329900], [2567,314000], [1200,299000], [852,179900], [1852,299900], [1203,239500] ]
@@ -104,13 +103,13 @@ describe SupervisedLearning::LinearRegression do
     context 'when prediction has correct amount of columns' do
       context 'when training set has one feature' do
         it 'returns correct prediction' do
-          expect(program_one_feature.predict_advanced(Matrix[ [1650] ], 0.1, 600, false).to_i).to be_within(200).of(293237)
+          expect(program_one_feature.predict_advanced(Matrix[ [1650] ], 0.01, 600, false).to_i).to be_within(2000).of(293237)
         end
       end
       context 'when training set has two features' do
         it 'returns correct prediction' do
-          expect(program_two_features.predict_advanced(Matrix[ [1650, 3] ], 0.1, 600, false).to_i).to be_within(200).of(293237)
+          expect(program_two_features.predict_advanced(Matrix[ [1650, 3] ], 0.01, 600, true).to_i).to be_within(2000).of(293237)
         end
       end
     end

data/spec/spec_helper.rb CHANGED Viewed

@@ -1,4 +1,5 @@
 require 'supervised_learning'
+require 'pry'
 RSpec.configure do |config|
   # Run specs in random order to surface order dependencies. If you find an

data/supervised_learning.gemspec CHANGED Viewed

@@ -21,6 +21,7 @@ Gem::Specification.new do |spec|
   spec.add_development_dependency "bundler", "~> 1.6"
   spec.add_development_dependency "rake"
   spec.add_development_dependency "rspec"
-  spec.add_development_dependency "pry"
-  spec.add_development_dependency "descriptive_statistics"
+  spec.add_development_dependency "descriptive_statistics"
+  spec.add_development_dependency "matrix_extensions"
+  spec.add_development_dependency "pry"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: supervised_learning
 version: !ruby/object:Gem::Version
-  version: 0.0.1
+  version: 0.0.2
 platform: ruby
 authors:
 - Michael Imstepf
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2014-07-21 00:00:00.000000000 Z
+date: 2014-08-11 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -53,7 +53,7 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
-  name: pry
+  name: descriptive_statistics
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
@@ -67,7 +67,21 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
-  name: descriptive_statistics
+  name: matrix_extensions
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: pry
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - ">="