rumale 0.13.0 → 0.13.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +8 -0
- data/README.md +2 -2
- data/lib/rumale.rb +1 -0
- data/lib/rumale/clustering/dbscan.rb +25 -13
- data/lib/rumale/clustering/k_medoids.rb +2 -2
- data/lib/rumale/clustering/snn.rb +76 -0
- data/lib/rumale/decomposition/pca.rb +2 -1
- data/lib/rumale/linear_model/linear_regression.rb +54 -15
- data/lib/rumale/linear_model/ridge.rb +57 -17
- data/lib/rumale/pairwise_metric.rb +18 -5
- data/lib/rumale/version.rb +1 -1
- data/rumale.gemspec +1 -1
- metadata +4 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: ce88d7170fd676377227427a0be90f8bdb1a9c97
|
4
|
+
data.tar.gz: 04f0d07e6d098768eda726fc82f864420678e427
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 203444f0e7d833946f67c2ee922e02a48b7174c20eac84480e190f8749e150e0c5ed18e3d7b7d30480e565483b5a5b51d1990cced7e09b5db027d8c508fa4313
|
7
|
+
data.tar.gz: e608c97fc0d29c018c778f9cc96cd53b0edff927c5631bd3b0cb606ee93f4e8c647ed2c76e7835b49f1933c0e5aeccb1ffbda4fe9aec59a2689f7bde4a28e103
|
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,11 @@
|
|
1
|
+
# 0.13.1
|
2
|
+
- Add class for Shared Neareset Neighbor clustering.
|
3
|
+
- Add function for calculation of manhattan distance to Rumale::PairwiseMetric.
|
4
|
+
- Add metric parameter that specifies distance metric to Rumale::Clustering::DBSCAN.
|
5
|
+
- Add the solver parameter that specifies the optimization algorithm to Rumale::LinearModel::LinearRegression.
|
6
|
+
- Add the solver parameter that specifies the optimization algorithm to Rumale::LinearModel::Ridge.
|
7
|
+
- Fix bug that the ndim of NArray of 1-dimensional principal components is not 1.
|
8
|
+
|
1
9
|
# 0.13.0
|
2
10
|
- Introduce [Numo::Linalg](https://github.com/ruby-numo/numo-linalg) to use linear algebra algorithms on the optimization.
|
3
11
|
- Add the solver parameter that specifies the optimization algorithm to Rumale::Decomposition::PCA.
|
data/README.md
CHANGED
@@ -6,14 +6,14 @@
|
|
6
6
|
[![Coverage Status](https://coveralls.io/repos/github/yoshoku/rumale/badge.svg?branch=master)](https://coveralls.io/github/yoshoku/rumale?branch=master)
|
7
7
|
[![Gem Version](https://badge.fury.io/rb/rumale.svg)](https://badge.fury.io/rb/rumale)
|
8
8
|
[![BSD 2-Clause License](https://img.shields.io/badge/License-BSD%202--Clause-orange.svg)](https://github.com/yoshoku/rumale/blob/master/LICENSE.txt)
|
9
|
-
[![Documentation](http://img.shields.io/badge/docs-rdoc.info-blue.svg)](https://www.rubydoc.info/gems/rumale/0.13.
|
9
|
+
[![Documentation](http://img.shields.io/badge/docs-rdoc.info-blue.svg)](https://www.rubydoc.info/gems/rumale/0.13.1)
|
10
10
|
|
11
11
|
Rumale (**Ru**by **ma**chine **le**arning) is a machine learning library in Ruby.
|
12
12
|
Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python.
|
13
13
|
Rumale supports Linear / Kernel Support Vector Machine,
|
14
14
|
Logistic Regression, Linear Regression, Ridge, Lasso, Factorization Machine,
|
15
15
|
Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor classifier,
|
16
|
-
K-Means, K-Medoids, Gaussian Mixture Model, DBSCAN, Power Iteration Clustering,
|
16
|
+
K-Means, K-Medoids, Gaussian Mixture Model, DBSCAN, SNN, Power Iteration Clustering,
|
17
17
|
Mutidimensional Scaling, t-SNE, Principal Component Analysis, and Non-negative Matrix Factorization.
|
18
18
|
|
19
19
|
This project was formerly known as "SVMKit".
|
data/lib/rumale.rb
CHANGED
@@ -60,6 +60,7 @@ require 'rumale/clustering/k_means'
|
|
60
60
|
require 'rumale/clustering/k_medoids'
|
61
61
|
require 'rumale/clustering/gaussian_mixture'
|
62
62
|
require 'rumale/clustering/dbscan'
|
63
|
+
require 'rumale/clustering/snn'
|
63
64
|
require 'rumale/clustering/power_iteration'
|
64
65
|
require 'rumale/decomposition/pca'
|
65
66
|
require 'rumale/decomposition/nmf'
|
@@ -7,7 +7,6 @@ require 'rumale/pairwise_metric'
|
|
7
7
|
module Rumale
|
8
8
|
module Clustering
|
9
9
|
# DBSCAN is a class that implements DBSCAN cluster analysis.
|
10
|
-
# The current implementation uses the Euclidean distance for analyzing the clusters.
|
11
10
|
#
|
12
11
|
# @example
|
13
12
|
# analyzer = Rumale::Clustering::DBSCAN.new(eps: 0.5, min_samples: 5)
|
@@ -31,12 +30,17 @@ module Rumale
|
|
31
30
|
#
|
32
31
|
# @param eps [Float] The radius of neighborhood.
|
33
32
|
# @param min_samples [Integer] The number of neighbor samples to be used for the criterion whether a point is a core point.
|
34
|
-
|
33
|
+
# @param metric [String] The metric to calculate the distances.
|
34
|
+
# If metric is 'euclidean', Euclidean distance is calculated for distance between points.
|
35
|
+
# If metric is 'precomputed', the fit and fit_transform methods expect to be given a distance matrix.
|
36
|
+
def initialize(eps: 0.5, min_samples: 5, metric: 'euclidean')
|
35
37
|
check_params_float(eps: eps)
|
36
38
|
check_params_integer(min_samples: min_samples)
|
39
|
+
check_params_string(metric: metric)
|
37
40
|
@params = {}
|
38
41
|
@params[:eps] = eps
|
39
42
|
@params[:min_samples] = min_samples
|
43
|
+
@params[:metric] = metric == 'precomputed' ? 'precomputed' : 'euclidean'
|
40
44
|
@core_sample_ids = nil
|
41
45
|
@labels = nil
|
42
46
|
end
|
@@ -46,19 +50,23 @@ module Rumale
|
|
46
50
|
# @overload fit(x) -> DBSCAN
|
47
51
|
#
|
48
52
|
# @param x [Numo::DFloat] (shape: [n_samples, n_features]) The training data to be used for cluster analysis.
|
53
|
+
# If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
|
49
54
|
# @return [DBSCAN] The learned cluster analyzer itself.
|
50
55
|
def fit(x, _y = nil)
|
51
56
|
check_sample_array(x)
|
57
|
+
raise ArgumentError, 'Expect the input distance matrix to be square.' if @params[:metric] == 'precomputed' && x.shape[0] != x.shape[1]
|
52
58
|
partial_fit(x)
|
53
59
|
self
|
54
60
|
end
|
55
61
|
|
56
62
|
# Analysis clusters and assign samples to clusters.
|
57
63
|
#
|
58
|
-
# @param x [Numo::DFloat] (shape: [n_samples, n_features]) The
|
64
|
+
# @param x [Numo::DFloat] (shape: [n_samples, n_features]) The samples to be used for cluster analysis.
|
65
|
+
# If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
|
59
66
|
# @return [Numo::Int32] (shape: [n_samples]) Predicted cluster label per sample.
|
60
67
|
def fit_predict(x)
|
61
68
|
check_sample_array(x)
|
69
|
+
raise ArgumentError, 'Expect the input distance matrix to be square.' if @params[:metric] == 'precomputed' && x.shape[0] != x.shape[1]
|
62
70
|
partial_fit(x)
|
63
71
|
labels
|
64
72
|
end
|
@@ -84,19 +92,24 @@ module Rumale
|
|
84
92
|
|
85
93
|
def partial_fit(x)
|
86
94
|
cluster_id = 0
|
87
|
-
|
95
|
+
metric_mat = calc_pairwise_metrics(x)
|
96
|
+
n_samples = metric_mat.shape[0]
|
88
97
|
@core_sample_ids = []
|
89
98
|
@labels = Numo::Int32.zeros(n_samples) - 2
|
90
|
-
n_samples.times do |
|
91
|
-
next if @labels[
|
92
|
-
cluster_id += 1 if expand_cluster(
|
99
|
+
n_samples.times do |query_id|
|
100
|
+
next if @labels[query_id] >= -1
|
101
|
+
cluster_id += 1 if expand_cluster(metric_mat, query_id, cluster_id)
|
93
102
|
end
|
94
103
|
@core_sample_ids = Numo::Int32[*@core_sample_ids.flatten]
|
95
104
|
nil
|
96
105
|
end
|
97
106
|
|
98
|
-
def
|
99
|
-
|
107
|
+
def calc_pairwise_metrics(x)
|
108
|
+
@params[:metric] == 'precomputed' ? x : Rumale::PairwiseMetric.euclidean_distance(x)
|
109
|
+
end
|
110
|
+
|
111
|
+
def expand_cluster(metric_mat, query_id, cluster_id)
|
112
|
+
target_ids = region_query(metric_mat[query_id, true])
|
100
113
|
if target_ids.size < @params[:min_samples]
|
101
114
|
@labels[query_id] = -1
|
102
115
|
false
|
@@ -105,7 +118,7 @@ module Rumale
|
|
105
118
|
@core_sample_ids.push(target_ids.dup)
|
106
119
|
target_ids.delete(query_id)
|
107
120
|
while (m = target_ids.shift)
|
108
|
-
neighbor_ids = region_query(
|
121
|
+
neighbor_ids = region_query(metric_mat[m, true])
|
109
122
|
next if neighbor_ids.size < @params[:min_samples]
|
110
123
|
neighbor_ids.each do |n|
|
111
124
|
target_ids.push(n) if @labels[n] < -1
|
@@ -116,9 +129,8 @@ module Rumale
|
|
116
129
|
end
|
117
130
|
end
|
118
131
|
|
119
|
-
def region_query(
|
120
|
-
|
121
|
-
distance_arr.lt(@params[:eps]).where.to_a
|
132
|
+
def region_query(metric_arr)
|
133
|
+
metric_arr.lt(@params[:eps]).where.to_a
|
122
134
|
end
|
123
135
|
end
|
124
136
|
end
|
@@ -29,8 +29,8 @@ module Rumale
|
|
29
29
|
# Create a new cluster analyzer with K-Medoids method.
|
30
30
|
#
|
31
31
|
# @param n_clusters [Integer] The number of clusters.
|
32
|
-
# @param metric [String] The metric to calculate the distances
|
33
|
-
# If metric is 'euclidean', Euclidean distance is calculated for distance
|
32
|
+
# @param metric [String] The metric to calculate the distances.
|
33
|
+
# If metric is 'euclidean', Euclidean distance is calculated for distance between points.
|
34
34
|
# If metric is 'precomputed', the fit and fit_transform methods expect to be given a distance matrix.
|
35
35
|
# @param init [String] The initialization method for centroids ('random' or 'k-means++').
|
36
36
|
# @param max_iter [Integer] The maximum number of iterations.
|
@@ -0,0 +1,76 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'rumale/pairwise_metric'
|
4
|
+
require 'rumale/clustering/dbscan'
|
5
|
+
|
6
|
+
module Rumale
|
7
|
+
module Clustering
|
8
|
+
# SNN is a class that implements Shared Nearest Neighbor cluster analysis.
|
9
|
+
# The SNN method is a variation of DBSCAN that uses similarity based on k-nearest neighbors as a metric.
|
10
|
+
#
|
11
|
+
# @example
|
12
|
+
# analyzer = Rumale::Clustering::SNN.new(n_neighbros: 10, eps: 5, min_samples: 5)
|
13
|
+
# cluster_labels = analyzer.fit_predict(samples)
|
14
|
+
#
|
15
|
+
# *Reference*
|
16
|
+
# - L. Ertoz, M. Steinbach, and V. Kumar, "Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data," Proc. SDM'03, pp. 47--58, 2003.
|
17
|
+
# - M E. Houle, H-P. Kriegel, P. Kroger, E. Schubert, and A. Zimek, "Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?," Proc. SSDBM'10, pp. 482--500, 2010.
|
18
|
+
class SNN < DBSCAN
|
19
|
+
# Create a new cluster analyzer with Shared Neareset Neighbor method.
|
20
|
+
#
|
21
|
+
# @param n_neighbors [Integer] The number of neighbors to be used for finding k-nearest neighbors.
|
22
|
+
# @param eps [Integer] The threshold value for finding connected components based on similarity.
|
23
|
+
# @param min_samples [Integer] The number of neighbor samples to be used for the criterion whether a point is a core point.
|
24
|
+
# @param metric [String] The metric to calculate the distances.
|
25
|
+
# If metric is 'euclidean', Euclidean distance is calculated for distance between points.
|
26
|
+
# If metric is 'precomputed', the fit and fit_transform methods expect to be given a distance matrix.
|
27
|
+
def initialize(n_neighbors: 10, eps: 5, min_samples: 5, metric: 'euclidean')
|
28
|
+
check_params_integer(n_neighbors: n_neighbors, min_samples: min_samples)
|
29
|
+
check_params_string(metric: metric)
|
30
|
+
@params = {}
|
31
|
+
@params[:n_neighbors] = n_neighbors
|
32
|
+
@params[:eps] = eps
|
33
|
+
@params[:min_samples] = min_samples
|
34
|
+
@params[:metric] = metric == 'precomputed' ? 'precomputed' : 'euclidean'
|
35
|
+
@core_sample_ids = nil
|
36
|
+
@labels = nil
|
37
|
+
end
|
38
|
+
|
39
|
+
# Analysis clusters with given training data.
|
40
|
+
#
|
41
|
+
# @overload fit(x) -> SNN
|
42
|
+
# @param x [Numo::DFloat] (shape: [n_samples, n_features]) The training data to be used for cluster analysis.
|
43
|
+
# If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
|
44
|
+
# @return [SNN] The learned cluster analyzer itself.
|
45
|
+
def fit(x, _y = nil)
|
46
|
+
super
|
47
|
+
end
|
48
|
+
|
49
|
+
# Analysis clusters and assign samples to clusters.
|
50
|
+
#
|
51
|
+
# @param x [Numo::DFloat] (shape: [n_samples, n_features]) The samples to be used for cluster analysis.
|
52
|
+
# If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
|
53
|
+
# @return [Numo::Int32] (shape: [n_samples]) Predicted cluster label per sample.
|
54
|
+
def fit_predict(x)
|
55
|
+
super
|
56
|
+
end
|
57
|
+
|
58
|
+
private
|
59
|
+
|
60
|
+
def calc_pairwise_metrics(x)
|
61
|
+
distance_mat = @params[:metric] == 'precomputed' ? x : Rumale::PairwiseMetric.euclidean_distance(x)
|
62
|
+
n_samples = distance_mat.shape[0]
|
63
|
+
adjacency_mat = Numo::DFloat.zeros(n_samples, n_samples)
|
64
|
+
n_samples.times do |n|
|
65
|
+
neighbor_ids = distance_mat[n, true].sort_index[0...@params[:n_neighbors]]
|
66
|
+
adjacency_mat[n, neighbor_ids] = 1
|
67
|
+
end
|
68
|
+
adjacency_mat.dot(adjacency_mat.transpose)
|
69
|
+
end
|
70
|
+
|
71
|
+
def region_query(similarity_arr)
|
72
|
+
similarity_arr.gt(@params[:eps]).where.to_a
|
73
|
+
end
|
74
|
+
end
|
75
|
+
end
|
76
|
+
end
|
@@ -80,7 +80,8 @@ module Rumale
|
|
80
80
|
covariance_mat = centered_x.transpose.dot(centered_x) / (n_samples - 1)
|
81
81
|
if @params[:solver] == 'evd' && enable_linalg?
|
82
82
|
_, evecs = Numo::Linalg.eigh(covariance_mat, vals_range: (n_features - @params[:n_components])...n_features)
|
83
|
-
|
83
|
+
comps = evecs.reverse(1).transpose
|
84
|
+
@components = @params[:n_components] == 1 ? comps[0, true].dup : comps.dup
|
84
85
|
else
|
85
86
|
@params[:n_components].times do
|
86
87
|
comp_vec = Rumale::Utils.rand_uniform(n_features, sub_rng)
|
@@ -6,7 +6,7 @@ require 'rumale/base/regressor'
|
|
6
6
|
module Rumale
|
7
7
|
module LinearModel
|
8
8
|
# LinearRegression is a class that implements ordinary least square linear regression
|
9
|
-
# with mini-batch stochastic gradient descent optimization.
|
9
|
+
# with mini-batch stochastic gradient descent optimization or singular value decomposition.
|
10
10
|
#
|
11
11
|
# @example
|
12
12
|
# estimator =
|
@@ -14,6 +14,11 @@ module Rumale
|
|
14
14
|
# estimator.fit(training_samples, traininig_values)
|
15
15
|
# results = estimator.predict(testing_samples)
|
16
16
|
#
|
17
|
+
# # If Numo::Linalg is installed, you can specify 'svd' for the solver option.
|
18
|
+
# require 'numo/linalg/autoloader'
|
19
|
+
# estimator = Rumale::LinearModel::LinearRegression.new(solver: 'svd')
|
20
|
+
# estimator.fit(training_samples, traininig_values)
|
21
|
+
# results = estimator.predict(testing_samples)
|
17
22
|
class LinearRegression < BaseLinearModel
|
18
23
|
include Base::Regressor
|
19
24
|
|
@@ -34,23 +39,32 @@ module Rumale
|
|
34
39
|
# @param fit_bias [Boolean] The flag indicating whether to fit the bias term.
|
35
40
|
# @param bias_scale [Float] The scale of the bias term.
|
36
41
|
# @param max_iter [Integer] The maximum number of iterations.
|
42
|
+
# If solver = 'svd', this parameter is ignored.
|
37
43
|
# @param batch_size [Integer] The size of the mini batches.
|
44
|
+
# If solver = 'svd', this parameter is ignored.
|
38
45
|
# @param optimizer [Optimizer] The optimizer to calculate adaptive learning rate.
|
39
46
|
# If nil is given, Nadam is used.
|
47
|
+
# If solver = 'svd', this parameter is ignored.
|
48
|
+
# @param solver [String] The algorithm to calculate weights. ('sgd' or 'svd').
|
49
|
+
# 'sgd' uses the stochastic gradient descent optimization.
|
50
|
+
# 'svd' performs singular value decomposition of samples.
|
40
51
|
# @param n_jobs [Integer] The number of jobs for running the fit method in parallel.
|
41
52
|
# If nil is given, the method does not execute in parallel.
|
42
53
|
# If zero or less is given, it becomes equal to the number of processors.
|
43
54
|
# This parameter is ignored if the Parallel gem is not loaded.
|
44
55
|
# @param random_seed [Integer] The seed value using to initialize the random generator.
|
45
56
|
def initialize(fit_bias: false, bias_scale: 1.0, max_iter: 1000, batch_size: 10, optimizer: nil,
|
46
|
-
n_jobs: nil, random_seed: nil)
|
57
|
+
solver: 'sgd', n_jobs: nil, random_seed: nil)
|
47
58
|
check_params_float(bias_scale: bias_scale)
|
48
59
|
check_params_integer(max_iter: max_iter, batch_size: batch_size)
|
49
60
|
check_params_boolean(fit_bias: fit_bias)
|
61
|
+
check_params_string(solver: solver)
|
50
62
|
check_params_type_or_nil(Integer, n_jobs: n_jobs, random_seed: random_seed)
|
51
63
|
check_params_positive(max_iter: max_iter, batch_size: batch_size)
|
52
64
|
keywd_args = method(:initialize).parameters.map { |_t, arg| [arg, binding.local_variable_get(arg)] }.to_h.merge(reg_param: 0.0)
|
65
|
+
keywd_args.delete(:solver)
|
53
66
|
super(keywd_args)
|
67
|
+
@params[:solver] = solver != 'svd' ? 'sgd' : 'svd'
|
54
68
|
end
|
55
69
|
|
56
70
|
# Fit the model with given training data.
|
@@ -63,20 +77,10 @@ module Rumale
|
|
63
77
|
check_tvalue_array(y)
|
64
78
|
check_sample_tvalue_size(x, y)
|
65
79
|
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
if n_outputs > 1
|
70
|
-
@weight_vec = Numo::DFloat.zeros(n_outputs, n_features)
|
71
|
-
@bias_term = Numo::DFloat.zeros(n_outputs)
|
72
|
-
if enable_parallel?
|
73
|
-
models = parallel_map(n_outputs) { |n| partial_fit(x, y[true, n]) }
|
74
|
-
n_outputs.times { |n| @weight_vec[n, true], @bias_term[n] = models[n] }
|
75
|
-
else
|
76
|
-
n_outputs.times { |n| @weight_vec[n, true], @bias_term[n] = partial_fit(x, y[true, n]) }
|
77
|
-
end
|
80
|
+
if @params[:solver] == 'svd' && enable_linalg?
|
81
|
+
fit_svd(x, y)
|
78
82
|
else
|
79
|
-
|
83
|
+
fit_sgd(x, y)
|
80
84
|
end
|
81
85
|
|
82
86
|
self
|
@@ -112,6 +116,41 @@ module Rumale
|
|
112
116
|
|
113
117
|
private
|
114
118
|
|
119
|
+
def fit_svd(x, y)
|
120
|
+
samples = @params[:fit_bias] ? expand_feature(x) : x
|
121
|
+
|
122
|
+
s, u, vt = Numo::Linalg.svd(samples, driver: 'sdd', job: 'S')
|
123
|
+
d = (s / s**2).diag
|
124
|
+
w = vt.transpose.dot(d).dot(u.transpose).dot(y)
|
125
|
+
|
126
|
+
is_single_target_vals = y.shape[1].nil?
|
127
|
+
if @params[:fit_bias]
|
128
|
+
@weight_vec = is_single_target_vals ? w[0...-1].dup : w[0...-1, true].dup
|
129
|
+
@bias_term = is_single_target_vals ? w[-1] : w[-1, true].dup
|
130
|
+
else
|
131
|
+
@weight_vec = w.dup
|
132
|
+
@bias_term = is_single_target_vals ? 0 : Numo::DFloat.zeros(y.shape[1])
|
133
|
+
end
|
134
|
+
end
|
135
|
+
|
136
|
+
def fit_sgd(x, y)
|
137
|
+
n_outputs = y.shape[1].nil? ? 1 : y.shape[1]
|
138
|
+
n_features = x.shape[1]
|
139
|
+
|
140
|
+
if n_outputs > 1
|
141
|
+
@weight_vec = Numo::DFloat.zeros(n_outputs, n_features)
|
142
|
+
@bias_term = Numo::DFloat.zeros(n_outputs)
|
143
|
+
if enable_parallel?
|
144
|
+
models = parallel_map(n_outputs) { |n| partial_fit(x, y[true, n]) }
|
145
|
+
n_outputs.times { |n| @weight_vec[n, true], @bias_term[n] = models[n] }
|
146
|
+
else
|
147
|
+
n_outputs.times { |n| @weight_vec[n, true], @bias_term[n] = partial_fit(x, y[true, n]) }
|
148
|
+
end
|
149
|
+
else
|
150
|
+
@weight_vec, @bias_term = partial_fit(x, y)
|
151
|
+
end
|
152
|
+
end
|
153
|
+
|
115
154
|
def calc_loss_gradient(x, y, weight)
|
116
155
|
2.0 * (x.dot(weight) - y)
|
117
156
|
end
|
@@ -6,7 +6,7 @@ require 'rumale/base/regressor'
|
|
6
6
|
module Rumale
|
7
7
|
module LinearModel
|
8
8
|
# Ridge is a class that implements Ridge Regression
|
9
|
-
# with mini-batch stochastic gradient descent optimization.
|
9
|
+
# with mini-batch stochastic gradient descent optimization or singular value decomposition.
|
10
10
|
#
|
11
11
|
# @example
|
12
12
|
# estimator =
|
@@ -14,6 +14,11 @@ module Rumale
|
|
14
14
|
# estimator.fit(training_samples, traininig_values)
|
15
15
|
# results = estimator.predict(testing_samples)
|
16
16
|
#
|
17
|
+
# # If Numo::Linalg is installed, you can specify 'svd' for the solver option.
|
18
|
+
# require 'numo/linalg/autoloader'
|
19
|
+
# estimator = Rumale::LinearModel::Ridge.new(reg_param: 0.1, solver: 'svd')
|
20
|
+
# estimator.fit(training_samples, traininig_values)
|
21
|
+
# results = estimator.predict(testing_samples)
|
17
22
|
class Ridge < BaseLinearModel
|
18
23
|
include Base::Regressor
|
19
24
|
|
@@ -35,22 +40,32 @@ module Rumale
|
|
35
40
|
# @param fit_bias [Boolean] The flag indicating whether to fit the bias term.
|
36
41
|
# @param bias_scale [Float] The scale of the bias term.
|
37
42
|
# @param max_iter [Integer] The maximum number of iterations.
|
43
|
+
# If solver = 'svd', this parameter is ignored.
|
38
44
|
# @param batch_size [Integer] The size of the mini batches.
|
45
|
+
# If solver = 'svd', this parameter is ignored.
|
39
46
|
# @param optimizer [Optimizer] The optimizer to calculate adaptive learning rate.
|
40
47
|
# If nil is given, Nadam is used.
|
48
|
+
# If solver = 'svd', this parameter is ignored.
|
49
|
+
# @param solver [String] The algorithm to calculate weights. ('sgd' or 'svd').
|
50
|
+
# 'sgd' uses the stochastic gradient descent optimization.
|
51
|
+
# 'svd' performs singular value decomposition of samples.
|
41
52
|
# @param n_jobs [Integer] The number of jobs for running the fit method in parallel.
|
42
53
|
# If nil is given, the method does not execute in parallel.
|
43
54
|
# If zero or less is given, it becomes equal to the number of processors.
|
44
|
-
# This parameter is ignored if the Parallel gem is not loaded.
|
55
|
+
# This parameter is ignored if the Parallel gem is not loaded or the solver is 'svd'.
|
45
56
|
# @param random_seed [Integer] The seed value using to initialize the random generator.
|
46
57
|
def initialize(reg_param: 1.0, fit_bias: false, bias_scale: 1.0, max_iter: 1000, batch_size: 10, optimizer: nil,
|
47
|
-
n_jobs: nil, random_seed: nil)
|
58
|
+
solver: 'sgd', n_jobs: nil, random_seed: nil)
|
48
59
|
check_params_float(reg_param: reg_param, bias_scale: bias_scale)
|
49
60
|
check_params_integer(max_iter: max_iter, batch_size: batch_size)
|
50
61
|
check_params_boolean(fit_bias: fit_bias)
|
62
|
+
check_params_string(solver: solver)
|
51
63
|
check_params_type_or_nil(Integer, n_jobs: n_jobs, random_seed: random_seed)
|
52
64
|
check_params_positive(reg_param: reg_param, max_iter: max_iter, batch_size: batch_size)
|
53
|
-
|
65
|
+
keywd_args = method(:initialize).parameters.map { |_t, arg| [arg, binding.local_variable_get(arg)] }.to_h
|
66
|
+
keywd_args.delete(:solver)
|
67
|
+
super(keywd_args)
|
68
|
+
@params[:solver] = solver != 'svd' ? 'sgd' : 'svd'
|
54
69
|
end
|
55
70
|
|
56
71
|
# Fit the model with given training data.
|
@@ -63,20 +78,10 @@ module Rumale
|
|
63
78
|
check_tvalue_array(y)
|
64
79
|
check_sample_tvalue_size(x, y)
|
65
80
|
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
if n_outputs > 1
|
70
|
-
@weight_vec = Numo::DFloat.zeros(n_outputs, n_features)
|
71
|
-
@bias_term = Numo::DFloat.zeros(n_outputs)
|
72
|
-
if enable_parallel?
|
73
|
-
models = parallel_map(n_outputs) { |n| partial_fit(x, y[true, n]) }
|
74
|
-
n_outputs.times { |n| @weight_vec[n, true], @bias_term[n] = models[n] }
|
75
|
-
else
|
76
|
-
n_outputs.times { |n| @weight_vec[n, true], @bias_term[n] = partial_fit(x, y[true, n]) }
|
77
|
-
end
|
81
|
+
if @params[:solver] == 'svd' && enable_linalg?
|
82
|
+
fit_svd(x, y)
|
78
83
|
else
|
79
|
-
|
84
|
+
fit_sgd(x, y)
|
80
85
|
end
|
81
86
|
|
82
87
|
self
|
@@ -112,6 +117,41 @@ module Rumale
|
|
112
117
|
|
113
118
|
private
|
114
119
|
|
120
|
+
def fit_svd(x, y)
|
121
|
+
samples = @params[:fit_bias] ? expand_feature(x) : x
|
122
|
+
|
123
|
+
s, u, vt = Numo::Linalg.svd(samples, driver: 'sdd', job: 'S')
|
124
|
+
d = (s / (s**2 + @params[:reg_param])).diag
|
125
|
+
w = vt.transpose.dot(d).dot(u.transpose).dot(y)
|
126
|
+
|
127
|
+
is_single_target_vals = y.shape[1].nil?
|
128
|
+
if @params[:fit_bias]
|
129
|
+
@weight_vec = is_single_target_vals ? w[0...-1].dup : w[0...-1, true].dup
|
130
|
+
@bias_term = is_single_target_vals ? w[-1] : w[-1, true].dup
|
131
|
+
else
|
132
|
+
@weight_vec = w.dup
|
133
|
+
@bias_term = is_single_target_vals ? 0 : Numo::DFloat.zeros(y.shape[1])
|
134
|
+
end
|
135
|
+
end
|
136
|
+
|
137
|
+
def fit_sgd(x, y)
|
138
|
+
n_outputs = y.shape[1].nil? ? 1 : y.shape[1]
|
139
|
+
n_features = x.shape[1]
|
140
|
+
|
141
|
+
if n_outputs > 1
|
142
|
+
@weight_vec = Numo::DFloat.zeros(n_outputs, n_features)
|
143
|
+
@bias_term = Numo::DFloat.zeros(n_outputs)
|
144
|
+
if enable_parallel?
|
145
|
+
models = parallel_map(n_outputs) { |n| partial_fit(x, y[true, n]) }
|
146
|
+
n_outputs.times { |n| @weight_vec[n, true], @bias_term[n] = models[n] }
|
147
|
+
else
|
148
|
+
n_outputs.times { |n| @weight_vec[n, true], @bias_term[n] = partial_fit(x, y[true, n]) }
|
149
|
+
end
|
150
|
+
else
|
151
|
+
@weight_vec, @bias_term = partial_fit(x, y)
|
152
|
+
end
|
153
|
+
end
|
154
|
+
|
115
155
|
def calc_loss_gradient(x, y, weight)
|
116
156
|
2.0 * (x.dot(weight) - y)
|
117
157
|
end
|
@@ -18,6 +18,24 @@ module Rumale
|
|
18
18
|
Numo::NMath.sqrt(squared_error(x, y).abs)
|
19
19
|
end
|
20
20
|
|
21
|
+
# Calculate the pairwise manhattan distances between x and y.
|
22
|
+
#
|
23
|
+
# @param x [Numo::DFloat] (shape: [n_samples_x, n_features])
|
24
|
+
# @param y [Numo::DFloat] (shape: [n_samples_y, n_features])
|
25
|
+
# @return [Numo::DFloat] (shape: [n_samples_x, n_samples_x] or [n_samples_x, n_samples_y] if y is given)
|
26
|
+
def manhattan_distance(x, y = nil)
|
27
|
+
y = x if y.nil?
|
28
|
+
Rumale::Validation.check_sample_array(x)
|
29
|
+
Rumale::Validation.check_sample_array(y)
|
30
|
+
n_samples_x = x.shape[0]
|
31
|
+
n_samples_y = y.shape[0]
|
32
|
+
distance_mat = Numo::DFloat.zeros(n_samples_x, n_samples_y)
|
33
|
+
n_samples_x.times do |n|
|
34
|
+
distance_mat[n, true] = (y - x[n, true]).abs.sum(axis: 1)
|
35
|
+
end
|
36
|
+
distance_mat
|
37
|
+
end
|
38
|
+
|
21
39
|
# Calculate the pairwise squared errors between x and y.
|
22
40
|
#
|
23
41
|
# @param x [Numo::DFloat] (shape: [n_samples_x, n_features])
|
@@ -27,11 +45,6 @@ module Rumale
|
|
27
45
|
y = x if y.nil?
|
28
46
|
Rumale::Validation.check_sample_array(x)
|
29
47
|
Rumale::Validation.check_sample_array(y)
|
30
|
-
# sum_x_vec = (x**2).sum(1)
|
31
|
-
# sum_y_vec = (y**2).sum(1)
|
32
|
-
# dot_xy_mat = x.dot(y.transpose)
|
33
|
-
# dot_xy_mat * -2.0 + sum_x_vec.tile(y.shape[0], 1).transpose + sum_y_vec.tile(x.shape[0], 1)
|
34
|
-
#
|
35
48
|
n_features = x.shape[1]
|
36
49
|
one_vec = Numo::DFloat.ones(n_features).expand_dims(1)
|
37
50
|
sum_x_vec = (x**2).dot(one_vec)
|
data/lib/rumale/version.rb
CHANGED
data/rumale.gemspec
CHANGED
@@ -19,7 +19,7 @@ Gem::Specification.new do |spec|
|
|
19
19
|
Rumale currently supports Linear / Kernel Support Vector Machine,
|
20
20
|
Logistic Regression, Linear Regression, Ridge, Lasso, Factorization Machine,
|
21
21
|
Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor algorithm,
|
22
|
-
K-Means, K-Medoids, Gaussian Mixture Model, DBSCAN, Power Iteration Clustering,
|
22
|
+
K-Means, K-Medoids, Gaussian Mixture Model, DBSCAN, SNN, Power Iteration Clustering,
|
23
23
|
Multidimensional Scaling, t-SNE, Principal Component Analysis, and Non-negative Matrix Factorization.
|
24
24
|
MSG
|
25
25
|
spec.homepage = 'https://github.com/yoshoku/rumale'
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: rumale
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.13.
|
4
|
+
version: 0.13.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- yoshoku
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2019-
|
11
|
+
date: 2019-09-01 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: numo-narray
|
@@ -128,7 +128,7 @@ description: |
|
|
128
128
|
Rumale currently supports Linear / Kernel Support Vector Machine,
|
129
129
|
Logistic Regression, Linear Regression, Ridge, Lasso, Factorization Machine,
|
130
130
|
Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor algorithm,
|
131
|
-
K-Means, K-Medoids, Gaussian Mixture Model, DBSCAN, Power Iteration Clustering,
|
131
|
+
K-Means, K-Medoids, Gaussian Mixture Model, DBSCAN, SNN, Power Iteration Clustering,
|
132
132
|
Multidimensional Scaling, t-SNE, Principal Component Analysis, and Non-negative Matrix Factorization.
|
133
133
|
email:
|
134
134
|
- yoshoku@outlook.com
|
@@ -166,6 +166,7 @@ files:
|
|
166
166
|
- lib/rumale/clustering/k_means.rb
|
167
167
|
- lib/rumale/clustering/k_medoids.rb
|
168
168
|
- lib/rumale/clustering/power_iteration.rb
|
169
|
+
- lib/rumale/clustering/snn.rb
|
169
170
|
- lib/rumale/dataset.rb
|
170
171
|
- lib/rumale/decomposition/nmf.rb
|
171
172
|
- lib/rumale/decomposition/pca.rb
|