supervised_learning 0.0.1 → 0.0.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +32 -10
- data/lib/supervised_learning/version.rb +1 -1
- data/lib/supervised_learning.rb +63 -77
- data/spec/linear_regression_spec.rb +2 -3
- data/spec/spec_helper.rb +1 -0
- data/supervised_learning.gemspec +3 -2
- metadata +18 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: a22158ac095b48c40b8dcb61e29c7047668271ed
|
4
|
+
data.tar.gz: 2696ec504a12c9b389ddeece4bec1d91af5a857a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 71a672546ab53a46d71c6193b3a934e6777efdd73fdf5e4ae05fe1120b80823629590cfddd45d6b888c07035b187d0582b51566f6b96b3ccd9c7c3744ea55791
|
7
|
+
data.tar.gz: 27f4b46caa452db3fe2d0903fdf648e74d917cdd481a71f9e19cf51d86ee44669f8a5437bd8f7cb8c02161e35574f47f17b91a9195d9b335256634dc9028f687
|
data/README.md
CHANGED
@@ -1,8 +1,8 @@
|
|
1
|
-
#
|
1
|
+
# Supervised Learning
|
2
2
|
|
3
3
|
Supervised learning is the machine learning task of inferring a function from labeled training data. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.
|
4
4
|
|
5
|
-
Credits for
|
5
|
+
Credits for some of the algorithms used go to [Andrew Ng](http://cs.stanford.edu/people/ang/) at Stanford University.
|
6
6
|
|
7
7
|
## Example
|
8
8
|
|
@@ -35,11 +35,11 @@ Or install it yourself as:
|
|
35
35
|
|
36
36
|
## Usage
|
37
37
|
|
38
|
-
1. Create a matrix of the training data
|
38
|
+
### 1. Create a matrix of the training data
|
39
39
|
|
40
|
-
The **output value** (the type of value you want to predict) needs to be in the **last column**. The matrix has to have a) at least two columns (one feature and one output) and b) one row. The more data you feed it, the more accurate the prediction.
|
40
|
+
The **output value** (the type of value you want to predict) needs to be in the **last column**. The matrix has to have a) at least two columns (one feature and one output) and b) at least one row. The more data you feed it, the more accurate the prediction.
|
41
41
|
|
42
|
-
Consult the [Ruby API](http://www.ruby-doc.org/stdlib-2.1.2/libdoc/matrix/rdoc/Matrix.html) for information on how to build matrices
|
42
|
+
Consult the [Ruby API](http://www.ruby-doc.org/stdlib-2.1.2/libdoc/matrix/rdoc/Matrix.html) for information on how to build matrices. You can use both arrays of rows or columns to create a matrix.
|
43
43
|
|
44
44
|
```ruby
|
45
45
|
require 'matrix'
|
@@ -47,21 +47,43 @@ require 'matrix'
|
|
47
47
|
training_set = Matrix[ [2104, 3, 399900], [1600, 3, 329900], [3000, 4, 539900], [1940, 4, 239999] ]
|
48
48
|
```
|
49
49
|
|
50
|
-
2. Instantiate an object with the training data
|
50
|
+
### 2. Instantiate an object with the training data
|
51
51
|
|
52
52
|
```ruby
|
53
|
+
require 'supervised_learning' # if not automatically loaded
|
54
|
+
|
53
55
|
program = SupervisedLearning::LinearRegression.new(training_set)
|
54
56
|
```
|
55
57
|
|
56
|
-
3. Create a prediction in form of a matrix
|
58
|
+
### 3. Create a prediction in form of a matrix
|
57
59
|
|
58
60
|
This matrix has one row and the **columns follow the order of the training set**. It has one column less than the training set since the output value (the last column of the training set) is the value we want to predict.
|
59
61
|
|
60
62
|
```ruby
|
61
|
-
# Predict the price of a house of
|
62
|
-
prediction_set = Matrix[ [
|
63
|
+
# Predict the price of a house of 2200 square meters with 3 bedrooms
|
64
|
+
prediction_set = Matrix[ [2200, 3] ]
|
65
|
+
|
63
66
|
program.predict(prediction_set)
|
64
|
-
=>
|
67
|
+
=> 454115.66
|
68
|
+
```
|
69
|
+
|
70
|
+
### 4. Options
|
71
|
+
|
72
|
+
The lower bound running time of the `#predict` method is cubic / Ω(n3). Hence, on very large datasets (more than 1,000 columns / features), the `#predict` method can be slow. If this happens, use the `#predict_advanced` method, which uses gradient descent (approximation) instead of normalization (direct calculation). Gradient descent can be a bit less accurate and also requires some optional parameters.
|
73
|
+
|
74
|
+
The `#predict_advanced` method takes 4 parameters in the following order:
|
75
|
+
|
76
|
+
1. The prediction_set, similar to the `#prediction` method discussed above
|
77
|
+
2. The learning rate (how big of a "jump" each iteration makes). The smaller the learning rate, the less errors are to be expected. Unfortunately, a smaller learning rate requires more iterations (see below). Start at 0.001 and increase the parameter to 0.01, 0.1, 1, etc if the calculation is too slow.
|
78
|
+
3. The number of iterations. The more iterations, the more accurate the result will but the slower the algorithm will run.
|
79
|
+
4. Whether you want print out values to see at which point convergence happens.
|
80
|
+
|
81
|
+
```ruby
|
82
|
+
# Predict the price of a house of 2200 square meters with 3 bedrooms
|
83
|
+
prediction_set = Matrix[ [2200, 3] ]
|
84
|
+
|
85
|
+
program.predict_advanced(prediction_set, 0.01, 600, false)
|
86
|
+
=> 443693.16
|
65
87
|
```
|
66
88
|
|
67
89
|
## Contributing
|
data/lib/supervised_learning.rb
CHANGED
@@ -1,6 +1,7 @@
|
|
1
1
|
require 'supervised_learning/version'
|
2
2
|
require 'matrix'
|
3
3
|
require 'descriptive_statistics'
|
4
|
+
require 'matrix_extensions'
|
4
5
|
|
5
6
|
module SupervisedLearning
|
6
7
|
|
@@ -12,16 +13,18 @@ module SupervisedLearning
|
|
12
13
|
class LinearRegression
|
13
14
|
# Initializes a LinearRegression object with a training set
|
14
15
|
# @param training_set [Matrix] training_set, each feature/dimension has one column and the last column is the output column (type of value #predict will return)
|
15
|
-
# @raise [ArgumentError] if training_set is not a Matrix
|
16
|
+
# @raise [ArgumentError] if training_set is not a Matrix or does not have at least two columns and one row
|
16
17
|
def initialize(training_set)
|
17
18
|
@training_set = training_set
|
18
19
|
raise ArgumentError, 'input is not a Matrix' unless @training_set.is_a? Matrix
|
19
20
|
raise ArgumentError, 'Matrix must have at least 2 columns and 1 row' unless @training_set.column_size > 1
|
20
21
|
|
21
|
-
@
|
22
|
-
@number_of_features = @number_of_training_set_columns - 1
|
22
|
+
@number_of_features = @training_set.column_size - 1
|
23
23
|
@number_of_training_examples = @training_set.row_size
|
24
24
|
|
25
|
+
@feature_set = @training_set.clone
|
26
|
+
@feature_set.hpop # remove output set
|
27
|
+
|
25
28
|
@output_set = @training_set.column_vectors.last
|
26
29
|
end
|
27
30
|
|
@@ -30,7 +33,8 @@ module SupervisedLearning
|
|
30
33
|
# sets (more than 1000 columns) it might take too long to calculate.
|
31
34
|
# @param prediction [Matrix] prediction
|
32
35
|
def predict(prediction)
|
33
|
-
|
36
|
+
# add ones to feature set
|
37
|
+
feature_set = Matrix.hconcat(Matrix.one(@number_of_training_examples, 1), @feature_set)
|
34
38
|
|
35
39
|
validate_prediction_input(prediction)
|
36
40
|
|
@@ -38,7 +42,7 @@ module SupervisedLearning
|
|
38
42
|
theta = (transposed_feature_set * feature_set).inverse * transposed_feature_set * @output_set
|
39
43
|
|
40
44
|
# add column of ones to prediction
|
41
|
-
prediction =
|
45
|
+
prediction = Matrix.hconcat(Matrix.one(prediction.row_size, 1), prediction)
|
42
46
|
|
43
47
|
result_vectorized = prediction * theta
|
44
48
|
result = result_vectorized.to_a.first.to_f
|
@@ -51,14 +55,12 @@ module SupervisedLearning
|
|
51
55
|
def predict_advanced(prediction, learning_rate = 0.01, iterations = 1000, debug = false)
|
52
56
|
validate_prediction_input(prediction)
|
53
57
|
|
54
|
-
feature_set =
|
55
|
-
feature_set = normalize_feature_set(feature_set)
|
58
|
+
feature_set = normalize_feature_set(@feature_set)
|
56
59
|
# add ones to feature set after normalization
|
57
|
-
feature_set =
|
60
|
+
feature_set = Matrix.hconcat(Matrix.one(@number_of_training_examples, 1), feature_set)
|
58
61
|
|
59
62
|
# prepare theta column vector with zeros
|
60
|
-
theta =
|
61
|
-
theta = Matrix.columns([theta])
|
63
|
+
theta = Matrix.zero(@number_of_features+1, 1)
|
62
64
|
|
63
65
|
iterations.times do
|
64
66
|
theta = theta - (learning_rate * (1.0/@number_of_training_examples) * (feature_set * theta - @output_set).transpose * feature_set).transpose
|
@@ -72,7 +74,7 @@ module SupervisedLearning
|
|
72
74
|
prediction = normalize_prediction(prediction)
|
73
75
|
|
74
76
|
# add column of ones to prediction
|
75
|
-
prediction =
|
77
|
+
prediction = Matrix.hconcat(Matrix.one(prediction.row_size, 1), prediction)
|
76
78
|
|
77
79
|
result_vectorized = prediction * theta
|
78
80
|
result = result_vectorized[0,0]
|
@@ -80,34 +82,6 @@ module SupervisedLearning
|
|
80
82
|
|
81
83
|
private
|
82
84
|
|
83
|
-
# Returns a feature set without output set (last column of training set)
|
84
|
-
# and optionally adds a leading column of ones to a Matrix.
|
85
|
-
# This column of ones is the first dimension of theta to easily calculate
|
86
|
-
# the output of a function a*1 + b*theta_1 + c*theta_2 etc.
|
87
|
-
# Ruby's Matrix class has not built-in function for prepending,
|
88
|
-
# hence some manual work is required.
|
89
|
-
# @see http://stackoverflow.com/questions/9710628/how-do-i-add-columns-and-rows-to-a-matrix-in-ruby
|
90
|
-
# @param matrix [Matrix] matrix
|
91
|
-
# @param leading_ones [Boolean] whether to prepend a column of leading ones
|
92
|
-
# @return [Matrix] matrix
|
93
|
-
def get_feature_set(matrix, leading_ones = false)
|
94
|
-
# get array of columns
|
95
|
-
existing_columns = matrix.column_vectors
|
96
|
-
|
97
|
-
columns = []
|
98
|
-
columns << Array.new(existing_columns.first.size, 1) if leading_ones
|
99
|
-
# add remaining columns
|
100
|
-
existing_columns.each_with_index do |column, index|
|
101
|
-
# output column (last column of @training_set) needs to be skipped
|
102
|
-
# when called from #get_feature_set, matrix includes output column
|
103
|
-
# when called from #prediction, matrix does not inlcude output column
|
104
|
-
break if index + 1 > @number_of_features
|
105
|
-
columns << column.to_a
|
106
|
-
end
|
107
|
-
|
108
|
-
Matrix.columns(columns)
|
109
|
-
end
|
110
|
-
|
111
85
|
# Validates prediction input.
|
112
86
|
# @param prediction [Matrix] prediction
|
113
87
|
# @raise [ArgumentError] if prediction is not a Matrix
|
@@ -123,59 +97,43 @@ module SupervisedLearning
|
|
123
97
|
# @param feature_set [Matrix] feature set
|
124
98
|
# @return [Matrix] normalized feature set
|
125
99
|
def normalize_feature_set(feature_set)
|
126
|
-
#
|
100
|
+
# get mean for each column
|
127
101
|
mean = []
|
128
102
|
feature_set.column_vectors.each do |feature_set_column|
|
129
|
-
|
130
|
-
mean << Array.new(@number_of_training_examples, feature_set_column.mean)
|
103
|
+
mean << feature_set_column.mean
|
131
104
|
end
|
132
|
-
|
133
|
-
|
134
|
-
|
135
|
-
@mean = Matrix[mean.row(0)]
|
105
|
+
# convert Array into Matrix of same dimension as feature_set for substraction
|
106
|
+
# save for later usage
|
107
|
+
@mean = Matrix[mean].vcopy(@number_of_training_examples - 1)
|
136
108
|
|
137
109
|
# substract mean from feature set
|
138
|
-
feature_set = feature_set - mean
|
110
|
+
feature_set = feature_set - @mean
|
139
111
|
|
140
|
-
#
|
112
|
+
# get standard deviation for each column
|
141
113
|
standard_deviation = []
|
142
114
|
feature_set.column_vectors.each do |feature_set_column|
|
143
|
-
|
144
|
-
standard_deviation << [feature_set_column.standard_deviation]
|
145
|
-
end
|
146
|
-
# save for later usage
|
147
|
-
@standard_deviation = Matrix.columns(standard_deviation)
|
148
|
-
|
149
|
-
# Dividing these non-square matrices has to be done manually
|
150
|
-
# (non square matrices have no inverse and can't be divided in Ruby)
|
151
|
-
# iterate through each column
|
152
|
-
columns = []
|
153
|
-
feature_set.column_vectors.each_with_index do |feature_set_column, index|
|
154
|
-
# manually divide each row within column with standard deviation for that row
|
155
|
-
columns << feature_set_column.to_a.collect { |value| value / @standard_deviation[0,index] }
|
115
|
+
standard_deviation << feature_set_column.standard_deviation
|
156
116
|
end
|
157
|
-
#
|
158
|
-
|
159
|
-
|
117
|
+
# convert Array into Matrix of same dimension as feature_set for substraction
|
118
|
+
# save for later usage
|
119
|
+
@standard_deviation = Matrix[standard_deviation].vcopy(@number_of_training_examples - 1)
|
120
|
+
|
121
|
+
# divide feature set by standard deviation
|
122
|
+
feature_set = feature_set.element_division @standard_deviation
|
160
123
|
end
|
161
124
|
|
162
125
|
# Normalizes prediction.
|
163
126
|
# @param prediction [Matrix] prediction
|
164
127
|
# @return [Matrix] normalized prediction
|
165
|
-
def normalize_prediction(prediction)
|
128
|
+
def normalize_prediction(prediction)
|
129
|
+
# convert prediction into Matrix of same dimension as @mean for substraction
|
130
|
+
prediction = prediction.vcopy(@number_of_training_examples - 1)
|
131
|
+
|
166
132
|
# substract mean
|
167
|
-
prediction = prediction - @mean
|
168
|
-
|
169
|
-
#
|
170
|
-
|
171
|
-
# iterate through each column
|
172
|
-
columns = []
|
173
|
-
prediction.column_vectors.each_with_index do |prediction_column, index|
|
174
|
-
# manually divide each row within column with standard deviation for that row
|
175
|
-
columns << prediction_column / @standard_deviation[0,index]
|
176
|
-
end
|
177
|
-
# reconstruct prediction
|
178
|
-
prediction = Matrix.columns(columns)
|
133
|
+
prediction = prediction - @mean
|
134
|
+
|
135
|
+
# divide feature set by standard deviation
|
136
|
+
prediction = prediction.element_division @standard_deviation
|
179
137
|
end
|
180
138
|
|
181
139
|
# Calculates cost of current theta.
|
@@ -188,4 +146,32 @@ module SupervisedLearning
|
|
188
146
|
cost_vectorized[0,0]
|
189
147
|
end
|
190
148
|
end
|
149
|
+
|
150
|
+
# This class uses logistic regression to make discrete predictions (true or false) based on a training set.
|
151
|
+
# The algorithms in #predict were provided by Andrew Ng (Stanford University).
|
152
|
+
# @author Michael Imstepf
|
153
|
+
class LogisticRegression
|
154
|
+
# Initializes a LogisticRegression object with a training set
|
155
|
+
# @param training_set [Matrix] training_set, each feature/dimension has one column and the last column is the output column (type of value #predict will return)
|
156
|
+
# @raise [ArgumentError] if training_set is not a Matrix or does not have at least two columns and one row
|
157
|
+
def initialize(training_set)
|
158
|
+
@training_set = training_set
|
159
|
+
raise ArgumentError, 'input is not a Matrix' unless @training_set.is_a? Matrix
|
160
|
+
raise ArgumentError, 'Matrix must have at least 2 columns and 1 row' unless @training_set.column_size > 1
|
161
|
+
end
|
162
|
+
|
163
|
+
private
|
164
|
+
|
165
|
+
def calculate_sigmoid(z)
|
166
|
+
matrix_with_ones = Matrix.one(1, z.column_size)
|
167
|
+
matrix_with_eulers_number = Matrix.build(1, z.column_size) {Math::E}
|
168
|
+
z_negative = z * -1
|
169
|
+
matrix_with_ones.element_division (matrix_with_ones + matrix_with_eulers_number.element_exponentiation(z_negative))
|
170
|
+
end
|
171
|
+
|
172
|
+
def calculate_cost
|
173
|
+
|
174
|
+
end
|
175
|
+
end
|
176
|
+
|
191
177
|
end
|
@@ -1,5 +1,4 @@
|
|
1
1
|
require 'spec_helper'
|
2
|
-
require 'pry'
|
3
2
|
|
4
3
|
describe SupervisedLearning::LinearRegression do
|
5
4
|
training_set_one_feature = Matrix[ [2104,399900], [1600,329900], [2400,369000], [1416,232000], [3000,539900], [1985,299900], [1534,314900], [1427,198999], [1380,212000], [1494,242500], [1940,239999], [2000,347000], [1890,329999], [4478,699900], [1268,259900], [2300,449900], [1320,299900], [1236,199900], [2609,499998], [3031,599000], [1767,252900], [1888,255000], [1604,242900], [1962,259900], [3890,573900], [1100,249900], [1458,464500], [2526,469000], [2200,475000], [2637,299900], [1839,349900], [1000,169900], [2040,314900], [3137,579900], [1811,285900], [1437,249900], [1239,229900], [2132,345000], [4215,549000], [2162,287000], [1664,368500], [2238,329900], [2567,314000], [1200,299000], [852,179900], [1852,299900], [1203,239500] ]
|
@@ -104,13 +103,13 @@ describe SupervisedLearning::LinearRegression do
|
|
104
103
|
context 'when prediction has correct amount of columns' do
|
105
104
|
context 'when training set has one feature' do
|
106
105
|
it 'returns correct prediction' do
|
107
|
-
expect(program_one_feature.predict_advanced(Matrix[ [1650] ], 0.
|
106
|
+
expect(program_one_feature.predict_advanced(Matrix[ [1650] ], 0.01, 600, false).to_i).to be_within(2000).of(293237)
|
108
107
|
end
|
109
108
|
end
|
110
109
|
|
111
110
|
context 'when training set has two features' do
|
112
111
|
it 'returns correct prediction' do
|
113
|
-
expect(program_two_features.predict_advanced(Matrix[ [1650, 3] ], 0.
|
112
|
+
expect(program_two_features.predict_advanced(Matrix[ [1650, 3] ], 0.01, 600, true).to_i).to be_within(2000).of(293237)
|
114
113
|
end
|
115
114
|
end
|
116
115
|
end
|
data/spec/spec_helper.rb
CHANGED
data/supervised_learning.gemspec
CHANGED
@@ -21,6 +21,7 @@ Gem::Specification.new do |spec|
|
|
21
21
|
spec.add_development_dependency "bundler", "~> 1.6"
|
22
22
|
spec.add_development_dependency "rake"
|
23
23
|
spec.add_development_dependency "rspec"
|
24
|
-
spec.add_development_dependency "
|
25
|
-
spec.add_development_dependency "
|
24
|
+
spec.add_development_dependency "descriptive_statistics"
|
25
|
+
spec.add_development_dependency "matrix_extensions"
|
26
|
+
spec.add_development_dependency "pry"
|
26
27
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: supervised_learning
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Michael Imstepf
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-
|
11
|
+
date: 2014-08-11 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -53,7 +53,7 @@ dependencies:
|
|
53
53
|
- !ruby/object:Gem::Version
|
54
54
|
version: '0'
|
55
55
|
- !ruby/object:Gem::Dependency
|
56
|
-
name:
|
56
|
+
name: descriptive_statistics
|
57
57
|
requirement: !ruby/object:Gem::Requirement
|
58
58
|
requirements:
|
59
59
|
- - ">="
|
@@ -67,7 +67,21 @@ dependencies:
|
|
67
67
|
- !ruby/object:Gem::Version
|
68
68
|
version: '0'
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
|
-
name:
|
70
|
+
name: matrix_extensions
|
71
|
+
requirement: !ruby/object:Gem::Requirement
|
72
|
+
requirements:
|
73
|
+
- - ">="
|
74
|
+
- !ruby/object:Gem::Version
|
75
|
+
version: '0'
|
76
|
+
type: :development
|
77
|
+
prerelease: false
|
78
|
+
version_requirements: !ruby/object:Gem::Requirement
|
79
|
+
requirements:
|
80
|
+
- - ">="
|
81
|
+
- !ruby/object:Gem::Version
|
82
|
+
version: '0'
|
83
|
+
- !ruby/object:Gem::Dependency
|
84
|
+
name: pry
|
71
85
|
requirement: !ruby/object:Gem::Requirement
|
72
86
|
requirements:
|
73
87
|
- - ">="
|