rumale 0.13.0 → 0.13.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +8 -0
- data/README.md +2 -2
- data/lib/rumale.rb +1 -0
- data/lib/rumale/clustering/dbscan.rb +25 -13
- data/lib/rumale/clustering/k_medoids.rb +2 -2
- data/lib/rumale/clustering/snn.rb +76 -0
- data/lib/rumale/decomposition/pca.rb +2 -1
- data/lib/rumale/linear_model/linear_regression.rb +54 -15
- data/lib/rumale/linear_model/ridge.rb +57 -17
- data/lib/rumale/pairwise_metric.rb +18 -5
- data/lib/rumale/version.rb +1 -1
- data/rumale.gemspec +1 -1
- metadata +4 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: ce88d7170fd676377227427a0be90f8bdb1a9c97
|
4
|
+
data.tar.gz: 04f0d07e6d098768eda726fc82f864420678e427
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 203444f0e7d833946f67c2ee922e02a48b7174c20eac84480e190f8749e150e0c5ed18e3d7b7d30480e565483b5a5b51d1990cced7e09b5db027d8c508fa4313
|
7
|
+
data.tar.gz: e608c97fc0d29c018c778f9cc96cd53b0edff927c5631bd3b0cb606ee93f4e8c647ed2c76e7835b49f1933c0e5aeccb1ffbda4fe9aec59a2689f7bde4a28e103
|
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,11 @@
|
|
1
|
+
# 0.13.1
|
2
|
+
- Add class for Shared Neareset Neighbor clustering.
|
3
|
+
- Add function for calculation of manhattan distance to Rumale::PairwiseMetric.
|
4
|
+
- Add metric parameter that specifies distance metric to Rumale::Clustering::DBSCAN.
|
5
|
+
- Add the solver parameter that specifies the optimization algorithm to Rumale::LinearModel::LinearRegression.
|
6
|
+
- Add the solver parameter that specifies the optimization algorithm to Rumale::LinearModel::Ridge.
|
7
|
+
- Fix bug that the ndim of NArray of 1-dimensional principal components is not 1.
|
8
|
+
|
1
9
|
# 0.13.0
|
2
10
|
- Introduce [Numo::Linalg](https://github.com/ruby-numo/numo-linalg) to use linear algebra algorithms on the optimization.
|
3
11
|
- Add the solver parameter that specifies the optimization algorithm to Rumale::Decomposition::PCA.
|
data/README.md
CHANGED
@@ -6,14 +6,14 @@
|
|
6
6
|
[](https://coveralls.io/github/yoshoku/rumale?branch=master)
|
7
7
|
[](https://badge.fury.io/rb/rumale)
|
8
8
|
[](https://github.com/yoshoku/rumale/blob/master/LICENSE.txt)
|
9
|
-
[](https://www.rubydoc.info/gems/rumale/0.13.
|
9
|
+
[](https://www.rubydoc.info/gems/rumale/0.13.1)
|
10
10
|
|
11
11
|
Rumale (**Ru**by **ma**chine **le**arning) is a machine learning library in Ruby.
|
12
12
|
Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python.
|
13
13
|
Rumale supports Linear / Kernel Support Vector Machine,
|
14
14
|
Logistic Regression, Linear Regression, Ridge, Lasso, Factorization Machine,
|
15
15
|
Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor classifier,
|
16
|
-
K-Means, K-Medoids, Gaussian Mixture Model, DBSCAN, Power Iteration Clustering,
|
16
|
+
K-Means, K-Medoids, Gaussian Mixture Model, DBSCAN, SNN, Power Iteration Clustering,
|
17
17
|
Mutidimensional Scaling, t-SNE, Principal Component Analysis, and Non-negative Matrix Factorization.
|
18
18
|
|
19
19
|
This project was formerly known as "SVMKit".
|
data/lib/rumale.rb
CHANGED
@@ -60,6 +60,7 @@ require 'rumale/clustering/k_means'
|
|
60
60
|
require 'rumale/clustering/k_medoids'
|
61
61
|
require 'rumale/clustering/gaussian_mixture'
|
62
62
|
require 'rumale/clustering/dbscan'
|
63
|
+
require 'rumale/clustering/snn'
|
63
64
|
require 'rumale/clustering/power_iteration'
|
64
65
|
require 'rumale/decomposition/pca'
|
65
66
|
require 'rumale/decomposition/nmf'
|
@@ -7,7 +7,6 @@ require 'rumale/pairwise_metric'
|
|
7
7
|
module Rumale
|
8
8
|
module Clustering
|
9
9
|
# DBSCAN is a class that implements DBSCAN cluster analysis.
|
10
|
-
# The current implementation uses the Euclidean distance for analyzing the clusters.
|
11
10
|
#
|
12
11
|
# @example
|
13
12
|
# analyzer = Rumale::Clustering::DBSCAN.new(eps: 0.5, min_samples: 5)
|
@@ -31,12 +30,17 @@ module Rumale
|
|
31
30
|
#
|
32
31
|
# @param eps [Float] The radius of neighborhood.
|
33
32
|
# @param min_samples [Integer] The number of neighbor samples to be used for the criterion whether a point is a core point.
|
34
|
-
|
33
|
+
# @param metric [String] The metric to calculate the distances.
|
34
|
+
# If metric is 'euclidean', Euclidean distance is calculated for distance between points.
|
35
|
+
# If metric is 'precomputed', the fit and fit_transform methods expect to be given a distance matrix.
|
36
|
+
def initialize(eps: 0.5, min_samples: 5, metric: 'euclidean')
|
35
37
|
check_params_float(eps: eps)
|
36
38
|
check_params_integer(min_samples: min_samples)
|
39
|
+
check_params_string(metric: metric)
|
37
40
|
@params = {}
|
38
41
|
@params[:eps] = eps
|
39
42
|
@params[:min_samples] = min_samples
|
43
|
+
@params[:metric] = metric == 'precomputed' ? 'precomputed' : 'euclidean'
|
40
44
|
@core_sample_ids = nil
|
41
45
|
@labels = nil
|
42
46
|
end
|
@@ -46,19 +50,23 @@ module Rumale
|
|
46
50
|
# @overload fit(x) -> DBSCAN
|
47
51
|
#
|
48
52
|
# @param x [Numo::DFloat] (shape: [n_samples, n_features]) The training data to be used for cluster analysis.
|
53
|
+
# If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
|
49
54
|
# @return [DBSCAN] The learned cluster analyzer itself.
|
50
55
|
def fit(x, _y = nil)
|
51
56
|
check_sample_array(x)
|
57
|
+
raise ArgumentError, 'Expect the input distance matrix to be square.' if @params[:metric] == 'precomputed' && x.shape[0] != x.shape[1]
|
52
58
|
partial_fit(x)
|
53
59
|
self
|
54
60
|
end
|
55
61
|
|
56
62
|
# Analysis clusters and assign samples to clusters.
|
57
63
|
#
|
58
|
-
# @param x [Numo::DFloat] (shape: [n_samples, n_features]) The
|
64
|
+
# @param x [Numo::DFloat] (shape: [n_samples, n_features]) The samples to be used for cluster analysis.
|
65
|
+
# If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
|
59
66
|
# @return [Numo::Int32] (shape: [n_samples]) Predicted cluster label per sample.
|
60
67
|
def fit_predict(x)
|
61
68
|
check_sample_array(x)
|
69
|
+
raise ArgumentError, 'Expect the input distance matrix to be square.' if @params[:metric] == 'precomputed' && x.shape[0] != x.shape[1]
|
62
70
|
partial_fit(x)
|
63
71
|
labels
|
64
72
|
end
|
@@ -84,19 +92,24 @@ module Rumale
|
|
84
92
|
|
85
93
|
def partial_fit(x)
|
86
94
|
cluster_id = 0
|
87
|
-
|
95
|
+
metric_mat = calc_pairwise_metrics(x)
|
96
|
+
n_samples = metric_mat.shape[0]
|
88
97
|
@core_sample_ids = []
|
89
98
|
@labels = Numo::Int32.zeros(n_samples) - 2
|
90
|
-
n_samples.times do |
|
91
|
-
next if @labels[
|
92
|
-
cluster_id += 1 if expand_cluster(
|
99
|
+
n_samples.times do |query_id|
|
100
|
+
next if @labels[query_id] >= -1
|
101
|
+
cluster_id += 1 if expand_cluster(metric_mat, query_id, cluster_id)
|
93
102
|
end
|
94
103
|
@core_sample_ids = Numo::Int32[*@core_sample_ids.flatten]
|
95
104
|
nil
|
96
105
|
end
|
97
106
|
|
98
|
-
def
|
99
|
-
|
107
|
+
def calc_pairwise_metrics(x)
|
108
|
+
@params[:metric] == 'precomputed' ? x : Rumale::PairwiseMetric.euclidean_distance(x)
|
109
|
+
end
|
110
|
+
|
111
|
+
def expand_cluster(metric_mat, query_id, cluster_id)
|
112
|
+
target_ids = region_query(metric_mat[query_id, true])
|
100
113
|
if target_ids.size < @params[:min_samples]
|
101
114
|
@labels[query_id] = -1
|
102
115
|
false
|
@@ -105,7 +118,7 @@ module Rumale
|
|
105
118
|
@core_sample_ids.push(target_ids.dup)
|
106
119
|
target_ids.delete(query_id)
|
107
120
|
while (m = target_ids.shift)
|
108
|
-
neighbor_ids = region_query(
|
121
|
+
neighbor_ids = region_query(metric_mat[m, true])
|
109
122
|
next if neighbor_ids.size < @params[:min_samples]
|
110
123
|
neighbor_ids.each do |n|
|
111
124
|
target_ids.push(n) if @labels[n] < -1
|
@@ -116,9 +129,8 @@ module Rumale
|
|
116
129
|
end
|
117
130
|
end
|
118
131
|
|
119
|
-
def region_query(
|
120
|
-
|
121
|
-
distance_arr.lt(@params[:eps]).where.to_a
|
132
|
+
def region_query(metric_arr)
|
133
|
+
metric_arr.lt(@params[:eps]).where.to_a
|
122
134
|
end
|
123
135
|
end
|
124
136
|
end
|
@@ -29,8 +29,8 @@ module Rumale
|
|
29
29
|
# Create a new cluster analyzer with K-Medoids method.
|
30
30
|
#
|
31
31
|
# @param n_clusters [Integer] The number of clusters.
|
32
|
-
# @param metric [String] The metric to calculate the distances
|
33
|
-
# If metric is 'euclidean', Euclidean distance is calculated for distance
|
32
|
+
# @param metric [String] The metric to calculate the distances.
|
33
|
+
# If metric is 'euclidean', Euclidean distance is calculated for distance between points.
|
34
34
|
# If metric is 'precomputed', the fit and fit_transform methods expect to be given a distance matrix.
|
35
35
|
# @param init [String] The initialization method for centroids ('random' or 'k-means++').
|
36
36
|
# @param max_iter [Integer] The maximum number of iterations.
|
@@ -0,0 +1,76 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'rumale/pairwise_metric'
|
4
|
+
require 'rumale/clustering/dbscan'
|
5
|
+
|
6
|
+
module Rumale
|
7
|
+
module Clustering
|
8
|
+
# SNN is a class that implements Shared Nearest Neighbor cluster analysis.
|
9
|
+
# The SNN method is a variation of DBSCAN that uses similarity based on k-nearest neighbors as a metric.
|
10
|
+
#
|
11
|
+
# @example
|
12
|
+
# analyzer = Rumale::Clustering::SNN.new(n_neighbros: 10, eps: 5, min_samples: 5)
|
13
|
+
# cluster_labels = analyzer.fit_predict(samples)
|
14
|
+
#
|
15
|
+
# *Reference*
|
16
|
+
# - L. Ertoz, M. Steinbach, and V. Kumar, "Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data," Proc. SDM'03, pp. 47--58, 2003.
|
17
|
+
# - M E. Houle, H-P. Kriegel, P. Kroger, E. Schubert, and A. Zimek, "Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?," Proc. SSDBM'10, pp. 482--500, 2010.
|
18
|
+
class SNN < DBSCAN
|
19
|
+
# Create a new cluster analyzer with Shared Neareset Neighbor method.
|
20
|
+
#
|
21
|
+
# @param n_neighbors [Integer] The number of neighbors to be used for finding k-nearest neighbors.
|
22
|
+
# @param eps [Integer] The threshold value for finding connected components based on similarity.
|
23
|
+
# @param min_samples [Integer] The number of neighbor samples to be used for the criterion whether a point is a core point.
|
24
|
+
# @param metric [String] The metric to calculate the distances.
|
25
|
+
# If metric is 'euclidean', Euclidean distance is calculated for distance between points.
|
26
|
+
# If metric is 'precomputed', the fit and fit_transform methods expect to be given a distance matrix.
|
27
|
+
def initialize(n_neighbors: 10, eps: 5, min_samples: 5, metric: 'euclidean')
|
28
|
+
check_params_integer(n_neighbors: n_neighbors, min_samples: min_samples)
|
29
|
+
check_params_string(metric: metric)
|
30
|
+
@params = {}
|
31
|
+
@params[:n_neighbors] = n_neighbors
|
32
|
+
@params[:eps] = eps
|
33
|
+
@params[:min_samples] = min_samples
|
34
|
+
@params[:metric] = metric == 'precomputed' ? 'precomputed' : 'euclidean'
|
35
|
+
@core_sample_ids = nil
|
36
|
+
@labels = nil
|
37
|
+
end
|
38
|
+
|
39
|
+
# Analysis clusters with given training data.
|
40
|
+
#
|
41
|
+
# @overload fit(x) -> SNN
|
42
|
+
# @param x [Numo::DFloat] (shape: [n_samples, n_features]) The training data to be used for cluster analysis.
|
43
|
+
# If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
|
44
|
+
# @return [SNN] The learned cluster analyzer itself.
|
45
|
+
def fit(x, _y = nil)
|
46
|
+
super
|
47
|
+
end
|
48
|
+
|
49
|
+
# Analysis clusters and assign samples to clusters.
|
50
|
+
#
|
51
|
+
# @param x [Numo::DFloat] (shape: [n_samples, n_features]) The samples to be used for cluster analysis.
|
52
|
+
# If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
|
53
|
+
# @return [Numo::Int32] (shape: [n_samples]) Predicted cluster label per sample.
|
54
|
+
def fit_predict(x)
|
55
|
+
super
|
56
|
+
end
|
57
|
+
|
58
|
+
private
|
59
|
+
|
60
|
+
def calc_pairwise_metrics(x)
|
61
|
+
distance_mat = @params[:metric] == 'precomputed' ? x : Rumale::PairwiseMetric.euclidean_distance(x)
|
62
|
+
n_samples = distance_mat.shape[0]
|
63
|
+
adjacency_mat = Numo::DFloat.zeros(n_samples, n_samples)
|
64
|
+
n_samples.times do |n|
|
65
|
+
neighbor_ids = distance_mat[n, true].sort_index[0...@params[:n_neighbors]]
|
66
|
+
adjacency_mat[n, neighbor_ids] = 1
|
67
|
+
end
|
68
|
+
adjacency_mat.dot(adjacency_mat.transpose)
|
69
|
+
end
|
70
|
+
|
71
|
+
def region_query(similarity_arr)
|
72
|
+
similarity_arr.gt(@params[:eps]).where.to_a
|
73
|
+
end
|
74
|
+
end
|
75
|
+
end
|
76
|
+
end
|
@@ -80,7 +80,8 @@ module Rumale
|
|
80
80
|
covariance_mat = centered_x.transpose.dot(centered_x) / (n_samples - 1)
|
81
81
|
if @params[:solver] == 'evd' && enable_linalg?
|
82
82
|
_, evecs = Numo::Linalg.eigh(covariance_mat, vals_range: (n_features - @params[:n_components])...n_features)
|
83
|
-
|
83
|
+
comps = evecs.reverse(1).transpose
|
84
|
+
@components = @params[:n_components] == 1 ? comps[0, true].dup : comps.dup
|
84
85
|
else
|
85
86
|
@params[:n_components].times do
|
86
87
|
comp_vec = Rumale::Utils.rand_uniform(n_features, sub_rng)
|
@@ -6,7 +6,7 @@ require 'rumale/base/regressor'
|
|
6
6
|
module Rumale
|
7
7
|
module LinearModel
|
8
8
|
# LinearRegression is a class that implements ordinary least square linear regression
|
9
|
-
# with mini-batch stochastic gradient descent optimization.
|
9
|
+
# with mini-batch stochastic gradient descent optimization or singular value decomposition.
|
10
10
|
#
|
11
11
|
# @example
|
12
12
|
# estimator =
|
@@ -14,6 +14,11 @@ module Rumale
|
|
14
14
|
# estimator.fit(training_samples, traininig_values)
|
15
15
|
# results = estimator.predict(testing_samples)
|
16
16
|
#
|
17
|
+
# # If Numo::Linalg is installed, you can specify 'svd' for the solver option.
|
18
|
+
# require 'numo/linalg/autoloader'
|
19
|
+
# estimator = Rumale::LinearModel::LinearRegression.new(solver: 'svd')
|
20
|
+
# estimator.fit(training_samples, traininig_values)
|
21
|
+
# results = estimator.predict(testing_samples)
|
17
22
|
class LinearRegression < BaseLinearModel
|
18
23
|
include Base::Regressor
|
19
24
|
|
@@ -34,23 +39,32 @@ module Rumale
|
|
34
39
|
# @param fit_bias [Boolean] The flag indicating whether to fit the bias term.
|
35
40
|
# @param bias_scale [Float] The scale of the bias term.
|
36
41
|
# @param max_iter [Integer] The maximum number of iterations.
|
42
|
+
# If solver = 'svd', this parameter is ignored.
|
37
43
|
# @param batch_size [Integer] The size of the mini batches.
|
44
|
+
# If solver = 'svd', this parameter is ignored.
|
38
45
|
# @param optimizer [Optimizer] The optimizer to calculate adaptive learning rate.
|
39
46
|
# If nil is given, Nadam is used.
|
47
|
+
# If solver = 'svd', this parameter is ignored.
|
48
|
+
# @param solver [String] The algorithm to calculate weights. ('sgd' or 'svd').
|
49
|
+
# 'sgd' uses the stochastic gradient descent optimization.
|
50
|
+
# 'svd' performs singular value decomposition of samples.
|
40
51
|
# @param n_jobs [Integer] The number of jobs for running the fit method in parallel.
|
41
52
|
# If nil is given, the method does not execute in parallel.
|
42
53
|
# If zero or less is given, it becomes equal to the number of processors.
|
43
54
|
# This parameter is ignored if the Parallel gem is not loaded.
|
44
55
|
# @param random_seed [Integer] The seed value using to initialize the random generator.
|
45
56
|
def initialize(fit_bias: false, bias_scale: 1.0, max_iter: 1000, batch_size: 10, optimizer: nil,
|
46
|
-
n_jobs: nil, random_seed: nil)
|
57
|
+
solver: 'sgd', n_jobs: nil, random_seed: nil)
|
47
58
|
check_params_float(bias_scale: bias_scale)
|
48
59
|
check_params_integer(max_iter: max_iter, batch_size: batch_size)
|
49
60
|
check_params_boolean(fit_bias: fit_bias)
|
61
|
+
check_params_string(solver: solver)
|
50
62
|
check_params_type_or_nil(Integer, n_jobs: n_jobs, random_seed: random_seed)
|
51
63
|
check_params_positive(max_iter: max_iter, batch_size: batch_size)
|
52
64
|
keywd_args = method(:initialize).parameters.map { |_t, arg| [arg, binding.local_variable_get(arg)] }.to_h.merge(reg_param: 0.0)
|
65
|
+
keywd_args.delete(:solver)
|
53
66
|
super(keywd_args)
|
67
|
+
@params[:solver] = solver != 'svd' ? 'sgd' : 'svd'
|
54
68
|
end
|
55
69
|
|
56
70
|
# Fit the model with given training data.
|
@@ -63,20 +77,10 @@ module Rumale
|
|
63
77
|
check_tvalue_array(y)
|
64
78
|
check_sample_tvalue_size(x, y)
|
65
79
|
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
if n_outputs > 1
|
70
|
-
@weight_vec = Numo::DFloat.zeros(n_outputs, n_features)
|
71
|
-
@bias_term = Numo::DFloat.zeros(n_outputs)
|
72
|
-
if enable_parallel?
|
73
|
-
models = parallel_map(n_outputs) { |n| partial_fit(x, y[true, n]) }
|
74
|
-
n_outputs.times { |n| @weight_vec[n, true], @bias_term[n] = models[n] }
|
75
|
-
else
|
76
|
-
n_outputs.times { |n| @weight_vec[n, true], @bias_term[n] = partial_fit(x, y[true, n]) }
|
77
|
-
end
|
80
|
+
if @params[:solver] == 'svd' && enable_linalg?
|
81
|
+
fit_svd(x, y)
|
78
82
|
else
|
79
|
-
|
83
|
+
fit_sgd(x, y)
|
80
84
|
end
|
81
85
|
|
82
86
|
self
|
@@ -112,6 +116,41 @@ module Rumale
|
|
112
116
|
|
113
117
|
private
|
114
118
|
|
119
|
+
def fit_svd(x, y)
|
120
|
+
samples = @params[:fit_bias] ? expand_feature(x) : x
|
121
|
+
|
122
|
+
s, u, vt = Numo::Linalg.svd(samples, driver: 'sdd', job: 'S')
|
123
|
+
d = (s / s**2).diag
|
124
|
+
w = vt.transpose.dot(d).dot(u.transpose).dot(y)
|
125
|
+
|
126
|
+
is_single_target_vals = y.shape[1].nil?
|
127
|
+
if @params[:fit_bias]
|
128
|
+
@weight_vec = is_single_target_vals ? w[0...-1].dup : w[0...-1, true].dup
|
129
|
+
@bias_term = is_single_target_vals ? w[-1] : w[-1, true].dup
|
130
|
+
else
|
131
|
+
@weight_vec = w.dup
|
132
|
+
@bias_term = is_single_target_vals ? 0 : Numo::DFloat.zeros(y.shape[1])
|
133
|
+
end
|
134
|
+
end
|
135
|
+
|
136
|
+
def fit_sgd(x, y)
|
137
|
+
n_outputs = y.shape[1].nil? ? 1 : y.shape[1]
|
138
|
+
n_features = x.shape[1]
|
139
|
+
|
140
|
+
if n_outputs > 1
|
141
|
+
@weight_vec = Numo::DFloat.zeros(n_outputs, n_features)
|
142
|
+
@bias_term = Numo::DFloat.zeros(n_outputs)
|
143
|
+
if enable_parallel?
|
144
|
+
models = parallel_map(n_outputs) { |n| partial_fit(x, y[true, n]) }
|
145
|
+
n_outputs.times { |n| @weight_vec[n, true], @bias_term[n] = models[n] }
|
146
|
+
else
|
147
|
+
n_outputs.times { |n| @weight_vec[n, true], @bias_term[n] = partial_fit(x, y[true, n]) }
|
148
|
+
end
|
149
|
+
else
|
150
|
+
@weight_vec, @bias_term = partial_fit(x, y)
|
151
|
+
end
|
152
|
+
end
|
153
|
+
|
115
154
|
def calc_loss_gradient(x, y, weight)
|
116
155
|
2.0 * (x.dot(weight) - y)
|
117
156
|
end
|
@@ -6,7 +6,7 @@ require 'rumale/base/regressor'
|
|
6
6
|
module Rumale
|
7
7
|
module LinearModel
|
8
8
|
# Ridge is a class that implements Ridge Regression
|
9
|
-
# with mini-batch stochastic gradient descent optimization.
|
9
|
+
# with mini-batch stochastic gradient descent optimization or singular value decomposition.
|
10
10
|
#
|
11
11
|
# @example
|
12
12
|
# estimator =
|
@@ -14,6 +14,11 @@ module Rumale
|
|
14
14
|
# estimator.fit(training_samples, traininig_values)
|
15
15
|
# results = estimator.predict(testing_samples)
|
16
16
|
#
|
17
|
+
# # If Numo::Linalg is installed, you can specify 'svd' for the solver option.
|
18
|
+
# require 'numo/linalg/autoloader'
|
19
|
+
# estimator = Rumale::LinearModel::Ridge.new(reg_param: 0.1, solver: 'svd')
|
20
|
+
# estimator.fit(training_samples, traininig_values)
|
21
|
+
# results = estimator.predict(testing_samples)
|
17
22
|
class Ridge < BaseLinearModel
|
18
23
|
include Base::Regressor
|
19
24
|
|
@@ -35,22 +40,32 @@ module Rumale
|
|
35
40
|
# @param fit_bias [Boolean] The flag indicating whether to fit the bias term.
|
36
41
|
# @param bias_scale [Float] The scale of the bias term.
|
37
42
|
# @param max_iter [Integer] The maximum number of iterations.
|
43
|
+
# If solver = 'svd', this parameter is ignored.
|
38
44
|
# @param batch_size [Integer] The size of the mini batches.
|
45
|
+
# If solver = 'svd', this parameter is ignored.
|
39
46
|
# @param optimizer [Optimizer] The optimizer to calculate adaptive learning rate.
|
40
47
|
# If nil is given, Nadam is used.
|
48
|
+
# If solver = 'svd', this parameter is ignored.
|
49
|
+
# @param solver [String] The algorithm to calculate weights. ('sgd' or 'svd').
|
50
|
+
# 'sgd' uses the stochastic gradient descent optimization.
|
51
|
+
# 'svd' performs singular value decomposition of samples.
|
41
52
|
# @param n_jobs [Integer] The number of jobs for running the fit method in parallel.
|
42
53
|
# If nil is given, the method does not execute in parallel.
|
43
54
|
# If zero or less is given, it becomes equal to the number of processors.
|
44
|
-
# This parameter is ignored if the Parallel gem is not loaded.
|
55
|
+
# This parameter is ignored if the Parallel gem is not loaded or the solver is 'svd'.
|
45
56
|
# @param random_seed [Integer] The seed value using to initialize the random generator.
|
46
57
|
def initialize(reg_param: 1.0, fit_bias: false, bias_scale: 1.0, max_iter: 1000, batch_size: 10, optimizer: nil,
|
47
|
-
n_jobs: nil, random_seed: nil)
|
58
|
+
solver: 'sgd', n_jobs: nil, random_seed: nil)
|
48
59
|
check_params_float(reg_param: reg_param, bias_scale: bias_scale)
|
49
60
|
check_params_integer(max_iter: max_iter, batch_size: batch_size)
|
50
61
|
check_params_boolean(fit_bias: fit_bias)
|
62
|
+
check_params_string(solver: solver)
|
51
63
|
check_params_type_or_nil(Integer, n_jobs: n_jobs, random_seed: random_seed)
|
52
64
|
check_params_positive(reg_param: reg_param, max_iter: max_iter, batch_size: batch_size)
|
53
|
-
|
65
|
+
keywd_args = method(:initialize).parameters.map { |_t, arg| [arg, binding.local_variable_get(arg)] }.to_h
|
66
|
+
keywd_args.delete(:solver)
|
67
|
+
super(keywd_args)
|
68
|
+
@params[:solver] = solver != 'svd' ? 'sgd' : 'svd'
|
54
69
|
end
|
55
70
|
|
56
71
|
# Fit the model with given training data.
|
@@ -63,20 +78,10 @@ module Rumale
|
|
63
78
|
check_tvalue_array(y)
|
64
79
|
check_sample_tvalue_size(x, y)
|
65
80
|
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
if n_outputs > 1
|
70
|
-
@weight_vec = Numo::DFloat.zeros(n_outputs, n_features)
|
71
|
-
@bias_term = Numo::DFloat.zeros(n_outputs)
|
72
|
-
if enable_parallel?
|
73
|
-
models = parallel_map(n_outputs) { |n| partial_fit(x, y[true, n]) }
|
74
|
-
n_outputs.times { |n| @weight_vec[n, true], @bias_term[n] = models[n] }
|
75
|
-
else
|
76
|
-
n_outputs.times { |n| @weight_vec[n, true], @bias_term[n] = partial_fit(x, y[true, n]) }
|
77
|
-
end
|
81
|
+
if @params[:solver] == 'svd' && enable_linalg?
|
82
|
+
fit_svd(x, y)
|
78
83
|
else
|
79
|
-
|
84
|
+
fit_sgd(x, y)
|
80
85
|
end
|
81
86
|
|
82
87
|
self
|
@@ -112,6 +117,41 @@ module Rumale
|
|
112
117
|
|
113
118
|
private
|
114
119
|
|
120
|
+
def fit_svd(x, y)
|
121
|
+
samples = @params[:fit_bias] ? expand_feature(x) : x
|
122
|
+
|
123
|
+
s, u, vt = Numo::Linalg.svd(samples, driver: 'sdd', job: 'S')
|
124
|
+
d = (s / (s**2 + @params[:reg_param])).diag
|
125
|
+
w = vt.transpose.dot(d).dot(u.transpose).dot(y)
|
126
|
+
|
127
|
+
is_single_target_vals = y.shape[1].nil?
|
128
|
+
if @params[:fit_bias]
|
129
|
+
@weight_vec = is_single_target_vals ? w[0...-1].dup : w[0...-1, true].dup
|
130
|
+
@bias_term = is_single_target_vals ? w[-1] : w[-1, true].dup
|
131
|
+
else
|
132
|
+
@weight_vec = w.dup
|
133
|
+
@bias_term = is_single_target_vals ? 0 : Numo::DFloat.zeros(y.shape[1])
|
134
|
+
end
|
135
|
+
end
|
136
|
+
|
137
|
+
def fit_sgd(x, y)
|
138
|
+
n_outputs = y.shape[1].nil? ? 1 : y.shape[1]
|
139
|
+
n_features = x.shape[1]
|
140
|
+
|
141
|
+
if n_outputs > 1
|
142
|
+
@weight_vec = Numo::DFloat.zeros(n_outputs, n_features)
|
143
|
+
@bias_term = Numo::DFloat.zeros(n_outputs)
|
144
|
+
if enable_parallel?
|
145
|
+
models = parallel_map(n_outputs) { |n| partial_fit(x, y[true, n]) }
|
146
|
+
n_outputs.times { |n| @weight_vec[n, true], @bias_term[n] = models[n] }
|
147
|
+
else
|
148
|
+
n_outputs.times { |n| @weight_vec[n, true], @bias_term[n] = partial_fit(x, y[true, n]) }
|
149
|
+
end
|
150
|
+
else
|
151
|
+
@weight_vec, @bias_term = partial_fit(x, y)
|
152
|
+
end
|
153
|
+
end
|
154
|
+
|
115
155
|
def calc_loss_gradient(x, y, weight)
|
116
156
|
2.0 * (x.dot(weight) - y)
|
117
157
|
end
|
@@ -18,6 +18,24 @@ module Rumale
|
|
18
18
|
Numo::NMath.sqrt(squared_error(x, y).abs)
|
19
19
|
end
|
20
20
|
|
21
|
+
# Calculate the pairwise manhattan distances between x and y.
|
22
|
+
#
|
23
|
+
# @param x [Numo::DFloat] (shape: [n_samples_x, n_features])
|
24
|
+
# @param y [Numo::DFloat] (shape: [n_samples_y, n_features])
|
25
|
+
# @return [Numo::DFloat] (shape: [n_samples_x, n_samples_x] or [n_samples_x, n_samples_y] if y is given)
|
26
|
+
def manhattan_distance(x, y = nil)
|
27
|
+
y = x if y.nil?
|
28
|
+
Rumale::Validation.check_sample_array(x)
|
29
|
+
Rumale::Validation.check_sample_array(y)
|
30
|
+
n_samples_x = x.shape[0]
|
31
|
+
n_samples_y = y.shape[0]
|
32
|
+
distance_mat = Numo::DFloat.zeros(n_samples_x, n_samples_y)
|
33
|
+
n_samples_x.times do |n|
|
34
|
+
distance_mat[n, true] = (y - x[n, true]).abs.sum(axis: 1)
|
35
|
+
end
|
36
|
+
distance_mat
|
37
|
+
end
|
38
|
+
|
21
39
|
# Calculate the pairwise squared errors between x and y.
|
22
40
|
#
|
23
41
|
# @param x [Numo::DFloat] (shape: [n_samples_x, n_features])
|
@@ -27,11 +45,6 @@ module Rumale
|
|
27
45
|
y = x if y.nil?
|
28
46
|
Rumale::Validation.check_sample_array(x)
|
29
47
|
Rumale::Validation.check_sample_array(y)
|
30
|
-
# sum_x_vec = (x**2).sum(1)
|
31
|
-
# sum_y_vec = (y**2).sum(1)
|
32
|
-
# dot_xy_mat = x.dot(y.transpose)
|
33
|
-
# dot_xy_mat * -2.0 + sum_x_vec.tile(y.shape[0], 1).transpose + sum_y_vec.tile(x.shape[0], 1)
|
34
|
-
#
|
35
48
|
n_features = x.shape[1]
|
36
49
|
one_vec = Numo::DFloat.ones(n_features).expand_dims(1)
|
37
50
|
sum_x_vec = (x**2).dot(one_vec)
|
data/lib/rumale/version.rb
CHANGED
data/rumale.gemspec
CHANGED
@@ -19,7 +19,7 @@ Gem::Specification.new do |spec|
|
|
19
19
|
Rumale currently supports Linear / Kernel Support Vector Machine,
|
20
20
|
Logistic Regression, Linear Regression, Ridge, Lasso, Factorization Machine,
|
21
21
|
Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor algorithm,
|
22
|
-
K-Means, K-Medoids, Gaussian Mixture Model, DBSCAN, Power Iteration Clustering,
|
22
|
+
K-Means, K-Medoids, Gaussian Mixture Model, DBSCAN, SNN, Power Iteration Clustering,
|
23
23
|
Multidimensional Scaling, t-SNE, Principal Component Analysis, and Non-negative Matrix Factorization.
|
24
24
|
MSG
|
25
25
|
spec.homepage = 'https://github.com/yoshoku/rumale'
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: rumale
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.13.
|
4
|
+
version: 0.13.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- yoshoku
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2019-
|
11
|
+
date: 2019-09-01 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: numo-narray
|
@@ -128,7 +128,7 @@ description: |
|
|
128
128
|
Rumale currently supports Linear / Kernel Support Vector Machine,
|
129
129
|
Logistic Regression, Linear Regression, Ridge, Lasso, Factorization Machine,
|
130
130
|
Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor algorithm,
|
131
|
-
K-Means, K-Medoids, Gaussian Mixture Model, DBSCAN, Power Iteration Clustering,
|
131
|
+
K-Means, K-Medoids, Gaussian Mixture Model, DBSCAN, SNN, Power Iteration Clustering,
|
132
132
|
Multidimensional Scaling, t-SNE, Principal Component Analysis, and Non-negative Matrix Factorization.
|
133
133
|
email:
|
134
134
|
- yoshoku@outlook.com
|
@@ -166,6 +166,7 @@ files:
|
|
166
166
|
- lib/rumale/clustering/k_means.rb
|
167
167
|
- lib/rumale/clustering/k_medoids.rb
|
168
168
|
- lib/rumale/clustering/power_iteration.rb
|
169
|
+
- lib/rumale/clustering/snn.rb
|
169
170
|
- lib/rumale/dataset.rb
|
170
171
|
- lib/rumale/decomposition/nmf.rb
|
171
172
|
- lib/rumale/decomposition/pca.rb
|