supervised_learning 0.0.1 → 0.0.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: f8b579d3f7901496e6f92152443e2af2a5ae1e48
4
- data.tar.gz: c1b532c91505c321fbc3d0de82c0f2e19dd847dc
3
+ metadata.gz: a22158ac095b48c40b8dcb61e29c7047668271ed
4
+ data.tar.gz: 2696ec504a12c9b389ddeece4bec1d91af5a857a
5
5
  SHA512:
6
- metadata.gz: 595acd0096de69b602bdddf01f8e6f2c83da4cb0c40adb0d896d4232325d462edd3eea913fad21504b0d50f7c70b0bd79b7a6986c0cab20a7a1d6623890482fd
7
- data.tar.gz: d64319d9bf16c941b62939cab66f18da2f5c92d7a9934c02d08c4724d11ee6107f6a51c879401bc9d6c6803d41bc3d90cec4572ae3118c6b08bae84fd4c4888e
6
+ metadata.gz: 71a672546ab53a46d71c6193b3a934e6777efdd73fdf5e4ae05fe1120b80823629590cfddd45d6b888c07035b187d0582b51566f6b96b3ccd9c7c3744ea55791
7
+ data.tar.gz: 27f4b46caa452db3fe2d0903fdf648e74d917cdd481a71f9e19cf51d86ee44669f8a5437bd8f7cb8c02161e35574f47f17b91a9195d9b335256634dc9028f687
data/README.md CHANGED
@@ -1,8 +1,8 @@
1
- # SupervisedLearning
1
+ # Supervised Learning
2
2
 
3
3
  Supervised learning is the machine learning task of inferring a function from labeled training data. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.
4
4
 
5
- Credits for the underlying algorithms of the functions that make predictions go to [Andrew Ng](http://cs.stanford.edu/people/ang/) at Stanford University.
5
+ Credits for some of the algorithms used go to [Andrew Ng](http://cs.stanford.edu/people/ang/) at Stanford University.
6
6
 
7
7
  ## Example
8
8
 
@@ -35,11 +35,11 @@ Or install it yourself as:
35
35
 
36
36
  ## Usage
37
37
 
38
- 1. Create a matrix of the training data.
38
+ ### 1. Create a matrix of the training data
39
39
 
40
- The **output value** (the type of value you want to predict) needs to be in the **last column**. The matrix has to have a) at least two columns (one feature and one output) and b) one row. The more data you feed it, the more accurate the prediction.
40
+ The **output value** (the type of value you want to predict) needs to be in the **last column**. The matrix has to have a) at least two columns (one feature and one output) and b) at least one row. The more data you feed it, the more accurate the prediction.
41
41
 
42
- Consult the [Ruby API](http://www.ruby-doc.org/stdlib-2.1.2/libdoc/matrix/rdoc/Matrix.html) for information on how to build matrices for instances based on arrays of rows or columns.
42
+ Consult the [Ruby API](http://www.ruby-doc.org/stdlib-2.1.2/libdoc/matrix/rdoc/Matrix.html) for information on how to build matrices. You can use both arrays of rows or columns to create a matrix.
43
43
 
44
44
  ```ruby
45
45
  require 'matrix'
@@ -47,21 +47,43 @@ require 'matrix'
47
47
  training_set = Matrix[ [2104, 3, 399900], [1600, 3, 329900], [3000, 4, 539900], [1940, 4, 239999] ]
48
48
  ```
49
49
 
50
- 2. Instantiate an object with the training data.
50
+ ### 2. Instantiate an object with the training data
51
51
 
52
52
  ```ruby
53
+ require 'supervised_learning' # if not automatically loaded
54
+
53
55
  program = SupervisedLearning::LinearRegression.new(training_set)
54
56
  ```
55
57
 
56
- 3. Create a prediction in form of a matrix.
58
+ ### 3. Create a prediction in form of a matrix
57
59
 
58
60
  This matrix has one row and the **columns follow the order of the training set**. It has one column less than the training set since the output value (the last column of the training set) is the value we want to predict.
59
61
 
60
62
  ```ruby
61
- # Predict the price of a house of 2000 square meters with 3 bedrooms
62
- prediction_set = Matrix[ [2000, 3] ]
63
+ # Predict the price of a house of 2200 square meters with 3 bedrooms
64
+ prediction_set = Matrix[ [2200, 3] ]
65
+
63
66
  program.predict(prediction_set)
64
- =>
67
+ => 454115.66
68
+ ```
69
+
70
+ ### 4. Options
71
+
72
+ The lower bound running time of the `#predict` method is cubic / Ω(n3). Hence, on very large datasets (more than 1,000 columns / features), the `#predict` method can be slow. If this happens, use the `#predict_advanced` method, which uses gradient descent (approximation) instead of normalization (direct calculation). Gradient descent can be a bit less accurate and also requires some optional parameters.
73
+
74
+ The `#predict_advanced` method takes 4 parameters in the following order:
75
+
76
+ 1. The prediction_set, similar to the `#prediction` method discussed above
77
+ 2. The learning rate (how big of a "jump" each iteration makes). The smaller the learning rate, the less errors are to be expected. Unfortunately, a smaller learning rate requires more iterations (see below). Start at 0.001 and increase the parameter to 0.01, 0.1, 1, etc if the calculation is too slow.
78
+ 3. The number of iterations. The more iterations, the more accurate the result will but the slower the algorithm will run.
79
+ 4. Whether you want print out values to see at which point convergence happens.
80
+
81
+ ```ruby
82
+ # Predict the price of a house of 2200 square meters with 3 bedrooms
83
+ prediction_set = Matrix[ [2200, 3] ]
84
+
85
+ program.predict_advanced(prediction_set, 0.01, 600, false)
86
+ => 443693.16
65
87
  ```
66
88
 
67
89
  ## Contributing
@@ -1,3 +1,3 @@
1
1
  module SupervisedLearning
2
- VERSION = "0.0.1"
2
+ VERSION = "0.0.2"
3
3
  end
@@ -1,6 +1,7 @@
1
1
  require 'supervised_learning/version'
2
2
  require 'matrix'
3
3
  require 'descriptive_statistics'
4
+ require 'matrix_extensions'
4
5
 
5
6
  module SupervisedLearning
6
7
 
@@ -12,16 +13,18 @@ module SupervisedLearning
12
13
  class LinearRegression
13
14
  # Initializes a LinearRegression object with a training set
14
15
  # @param training_set [Matrix] training_set, each feature/dimension has one column and the last column is the output column (type of value #predict will return)
15
- # @raise [ArgumentError] if training_set is not a Matrix and has at least two columns and one row
16
+ # @raise [ArgumentError] if training_set is not a Matrix or does not have at least two columns and one row
16
17
  def initialize(training_set)
17
18
  @training_set = training_set
18
19
  raise ArgumentError, 'input is not a Matrix' unless @training_set.is_a? Matrix
19
20
  raise ArgumentError, 'Matrix must have at least 2 columns and 1 row' unless @training_set.column_size > 1
20
21
 
21
- @number_of_training_set_columns = @training_set.column_size
22
- @number_of_features = @number_of_training_set_columns - 1
22
+ @number_of_features = @training_set.column_size - 1
23
23
  @number_of_training_examples = @training_set.row_size
24
24
 
25
+ @feature_set = @training_set.clone
26
+ @feature_set.hpop # remove output set
27
+
25
28
  @output_set = @training_set.column_vectors.last
26
29
  end
27
30
 
@@ -30,7 +33,8 @@ module SupervisedLearning
30
33
  # sets (more than 1000 columns) it might take too long to calculate.
31
34
  # @param prediction [Matrix] prediction
32
35
  def predict(prediction)
33
- feature_set = get_feature_set(@training_set, true)
36
+ # add ones to feature set
37
+ feature_set = Matrix.hconcat(Matrix.one(@number_of_training_examples, 1), @feature_set)
34
38
 
35
39
  validate_prediction_input(prediction)
36
40
 
@@ -38,7 +42,7 @@ module SupervisedLearning
38
42
  theta = (transposed_feature_set * feature_set).inverse * transposed_feature_set * @output_set
39
43
 
40
44
  # add column of ones to prediction
41
- prediction = get_feature_set(prediction, true)
45
+ prediction = Matrix.hconcat(Matrix.one(prediction.row_size, 1), prediction)
42
46
 
43
47
  result_vectorized = prediction * theta
44
48
  result = result_vectorized.to_a.first.to_f
@@ -51,14 +55,12 @@ module SupervisedLearning
51
55
  def predict_advanced(prediction, learning_rate = 0.01, iterations = 1000, debug = false)
52
56
  validate_prediction_input(prediction)
53
57
 
54
- feature_set = get_feature_set(@training_set, false)
55
- feature_set = normalize_feature_set(feature_set)
58
+ feature_set = normalize_feature_set(@feature_set)
56
59
  # add ones to feature set after normalization
57
- feature_set = get_feature_set(feature_set, true)
60
+ feature_set = Matrix.hconcat(Matrix.one(@number_of_training_examples, 1), feature_set)
58
61
 
59
62
  # prepare theta column vector with zeros
60
- theta = Array.new(@number_of_training_set_columns, 0)
61
- theta = Matrix.columns([theta])
63
+ theta = Matrix.zero(@number_of_features+1, 1)
62
64
 
63
65
  iterations.times do
64
66
  theta = theta - (learning_rate * (1.0/@number_of_training_examples) * (feature_set * theta - @output_set).transpose * feature_set).transpose
@@ -72,7 +74,7 @@ module SupervisedLearning
72
74
  prediction = normalize_prediction(prediction)
73
75
 
74
76
  # add column of ones to prediction
75
- prediction = get_feature_set(prediction, true)
77
+ prediction = Matrix.hconcat(Matrix.one(prediction.row_size, 1), prediction)
76
78
 
77
79
  result_vectorized = prediction * theta
78
80
  result = result_vectorized[0,0]
@@ -80,34 +82,6 @@ module SupervisedLearning
80
82
 
81
83
  private
82
84
 
83
- # Returns a feature set without output set (last column of training set)
84
- # and optionally adds a leading column of ones to a Matrix.
85
- # This column of ones is the first dimension of theta to easily calculate
86
- # the output of a function a*1 + b*theta_1 + c*theta_2 etc.
87
- # Ruby's Matrix class has not built-in function for prepending,
88
- # hence some manual work is required.
89
- # @see http://stackoverflow.com/questions/9710628/how-do-i-add-columns-and-rows-to-a-matrix-in-ruby
90
- # @param matrix [Matrix] matrix
91
- # @param leading_ones [Boolean] whether to prepend a column of leading ones
92
- # @return [Matrix] matrix
93
- def get_feature_set(matrix, leading_ones = false)
94
- # get array of columns
95
- existing_columns = matrix.column_vectors
96
-
97
- columns = []
98
- columns << Array.new(existing_columns.first.size, 1) if leading_ones
99
- # add remaining columns
100
- existing_columns.each_with_index do |column, index|
101
- # output column (last column of @training_set) needs to be skipped
102
- # when called from #get_feature_set, matrix includes output column
103
- # when called from #prediction, matrix does not inlcude output column
104
- break if index + 1 > @number_of_features
105
- columns << column.to_a
106
- end
107
-
108
- Matrix.columns(columns)
109
- end
110
-
111
85
  # Validates prediction input.
112
86
  # @param prediction [Matrix] prediction
113
87
  # @raise [ArgumentError] if prediction is not a Matrix
@@ -123,59 +97,43 @@ module SupervisedLearning
123
97
  # @param feature_set [Matrix] feature set
124
98
  # @return [Matrix] normalized feature set
125
99
  def normalize_feature_set(feature_set)
126
- # create Matrix with mean
100
+ # get mean for each column
127
101
  mean = []
128
102
  feature_set.column_vectors.each do |feature_set_column|
129
- # create Matrix of length of training examples for later substraction
130
- mean << Array.new(@number_of_training_examples, feature_set_column.mean)
103
+ mean << feature_set_column.mean
131
104
  end
132
- mean = Matrix.columns(mean)
133
-
134
- # save for later usage as Matrix and not as Vector
135
- @mean = Matrix[mean.row(0)]
105
+ # convert Array into Matrix of same dimension as feature_set for substraction
106
+ # save for later usage
107
+ @mean = Matrix[mean].vcopy(@number_of_training_examples - 1)
136
108
 
137
109
  # substract mean from feature set
138
- feature_set = feature_set - mean
110
+ feature_set = feature_set - @mean
139
111
 
140
- # create Matrix with standard deviation
112
+ # get standard deviation for each column
141
113
  standard_deviation = []
142
114
  feature_set.column_vectors.each do |feature_set_column|
143
- # create row vector with standard deviation
144
- standard_deviation << [feature_set_column.standard_deviation]
145
- end
146
- # save for later usage
147
- @standard_deviation = Matrix.columns(standard_deviation)
148
-
149
- # Dividing these non-square matrices has to be done manually
150
- # (non square matrices have no inverse and can't be divided in Ruby)
151
- # iterate through each column
152
- columns = []
153
- feature_set.column_vectors.each_with_index do |feature_set_column, index|
154
- # manually divide each row within column with standard deviation for that row
155
- columns << feature_set_column.to_a.collect { |value| value / @standard_deviation[0,index] }
115
+ standard_deviation << feature_set_column.standard_deviation
156
116
  end
157
- # reconstruct training set
158
- feature_set = Matrix.columns(columns)
159
- feature_set
117
+ # convert Array into Matrix of same dimension as feature_set for substraction
118
+ # save for later usage
119
+ @standard_deviation = Matrix[standard_deviation].vcopy(@number_of_training_examples - 1)
120
+
121
+ # divide feature set by standard deviation
122
+ feature_set = feature_set.element_division @standard_deviation
160
123
  end
161
124
 
162
125
  # Normalizes prediction.
163
126
  # @param prediction [Matrix] prediction
164
127
  # @return [Matrix] normalized prediction
165
- def normalize_prediction(prediction)
128
+ def normalize_prediction(prediction)
129
+ # convert prediction into Matrix of same dimension as @mean for substraction
130
+ prediction = prediction.vcopy(@number_of_training_examples - 1)
131
+
166
132
  # substract mean
167
- prediction = prediction - @mean
168
-
169
- # Dividing these non-square matrices has to be done manually
170
- # (non square matrices have no inverse and can't be divided in Ruby)
171
- # iterate through each column
172
- columns = []
173
- prediction.column_vectors.each_with_index do |prediction_column, index|
174
- # manually divide each row within column with standard deviation for that row
175
- columns << prediction_column / @standard_deviation[0,index]
176
- end
177
- # reconstruct prediction
178
- prediction = Matrix.columns(columns)
133
+ prediction = prediction - @mean
134
+
135
+ # divide feature set by standard deviation
136
+ prediction = prediction.element_division @standard_deviation
179
137
  end
180
138
 
181
139
  # Calculates cost of current theta.
@@ -188,4 +146,32 @@ module SupervisedLearning
188
146
  cost_vectorized[0,0]
189
147
  end
190
148
  end
149
+
150
+ # This class uses logistic regression to make discrete predictions (true or false) based on a training set.
151
+ # The algorithms in #predict were provided by Andrew Ng (Stanford University).
152
+ # @author Michael Imstepf
153
+ class LogisticRegression
154
+ # Initializes a LogisticRegression object with a training set
155
+ # @param training_set [Matrix] training_set, each feature/dimension has one column and the last column is the output column (type of value #predict will return)
156
+ # @raise [ArgumentError] if training_set is not a Matrix or does not have at least two columns and one row
157
+ def initialize(training_set)
158
+ @training_set = training_set
159
+ raise ArgumentError, 'input is not a Matrix' unless @training_set.is_a? Matrix
160
+ raise ArgumentError, 'Matrix must have at least 2 columns and 1 row' unless @training_set.column_size > 1
161
+ end
162
+
163
+ private
164
+
165
+ def calculate_sigmoid(z)
166
+ matrix_with_ones = Matrix.one(1, z.column_size)
167
+ matrix_with_eulers_number = Matrix.build(1, z.column_size) {Math::E}
168
+ z_negative = z * -1
169
+ matrix_with_ones.element_division (matrix_with_ones + matrix_with_eulers_number.element_exponentiation(z_negative))
170
+ end
171
+
172
+ def calculate_cost
173
+
174
+ end
175
+ end
176
+
191
177
  end
@@ -1,5 +1,4 @@
1
1
  require 'spec_helper'
2
- require 'pry'
3
2
 
4
3
  describe SupervisedLearning::LinearRegression do
5
4
  training_set_one_feature = Matrix[ [2104,399900], [1600,329900], [2400,369000], [1416,232000], [3000,539900], [1985,299900], [1534,314900], [1427,198999], [1380,212000], [1494,242500], [1940,239999], [2000,347000], [1890,329999], [4478,699900], [1268,259900], [2300,449900], [1320,299900], [1236,199900], [2609,499998], [3031,599000], [1767,252900], [1888,255000], [1604,242900], [1962,259900], [3890,573900], [1100,249900], [1458,464500], [2526,469000], [2200,475000], [2637,299900], [1839,349900], [1000,169900], [2040,314900], [3137,579900], [1811,285900], [1437,249900], [1239,229900], [2132,345000], [4215,549000], [2162,287000], [1664,368500], [2238,329900], [2567,314000], [1200,299000], [852,179900], [1852,299900], [1203,239500] ]
@@ -104,13 +103,13 @@ describe SupervisedLearning::LinearRegression do
104
103
  context 'when prediction has correct amount of columns' do
105
104
  context 'when training set has one feature' do
106
105
  it 'returns correct prediction' do
107
- expect(program_one_feature.predict_advanced(Matrix[ [1650] ], 0.1, 600, false).to_i).to be_within(200).of(293237)
106
+ expect(program_one_feature.predict_advanced(Matrix[ [1650] ], 0.01, 600, false).to_i).to be_within(2000).of(293237)
108
107
  end
109
108
  end
110
109
 
111
110
  context 'when training set has two features' do
112
111
  it 'returns correct prediction' do
113
- expect(program_two_features.predict_advanced(Matrix[ [1650, 3] ], 0.1, 600, false).to_i).to be_within(200).of(293237)
112
+ expect(program_two_features.predict_advanced(Matrix[ [1650, 3] ], 0.01, 600, true).to_i).to be_within(2000).of(293237)
114
113
  end
115
114
  end
116
115
  end
data/spec/spec_helper.rb CHANGED
@@ -1,4 +1,5 @@
1
1
  require 'supervised_learning'
2
+ require 'pry'
2
3
 
3
4
  RSpec.configure do |config|
4
5
  # Run specs in random order to surface order dependencies. If you find an
@@ -21,6 +21,7 @@ Gem::Specification.new do |spec|
21
21
  spec.add_development_dependency "bundler", "~> 1.6"
22
22
  spec.add_development_dependency "rake"
23
23
  spec.add_development_dependency "rspec"
24
- spec.add_development_dependency "pry"
25
- spec.add_development_dependency "descriptive_statistics"
24
+ spec.add_development_dependency "descriptive_statistics"
25
+ spec.add_development_dependency "matrix_extensions"
26
+ spec.add_development_dependency "pry"
26
27
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: supervised_learning
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.1
4
+ version: 0.0.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michael Imstepf
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-07-21 00:00:00.000000000 Z
11
+ date: 2014-08-11 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -53,7 +53,7 @@ dependencies:
53
53
  - !ruby/object:Gem::Version
54
54
  version: '0'
55
55
  - !ruby/object:Gem::Dependency
56
- name: pry
56
+ name: descriptive_statistics
57
57
  requirement: !ruby/object:Gem::Requirement
58
58
  requirements:
59
59
  - - ">="
@@ -67,7 +67,21 @@ dependencies:
67
67
  - !ruby/object:Gem::Version
68
68
  version: '0'
69
69
  - !ruby/object:Gem::Dependency
70
- name: descriptive_statistics
70
+ name: matrix_extensions
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - ">="
74
+ - !ruby/object:Gem::Version
75
+ version: '0'
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - ">="
81
+ - !ruby/object:Gem::Version
82
+ version: '0'
83
+ - !ruby/object:Gem::Dependency
84
+ name: pry
71
85
  requirement: !ruby/object:Gem::Requirement
72
86
  requirements:
73
87
  - - ">="