rumale 0.9.2 → 0.10.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 1d73f16bcd1d149babe18c1db66d3f72bb9a1206
4
- data.tar.gz: 247fd7d548563ef27622c293073236468f634b7d
3
+ metadata.gz: 939e53a9d8c52f86c1d112d1cfbca17802aa94d8
4
+ data.tar.gz: cc1454ad92767b7ea27f9ec8d27cc7e7bbf97441
5
5
  SHA512:
6
- metadata.gz: 6a4a92d08ee0a8295d96a930a46fb67a9299a9e0beb717d52186347fef3b70727e35a2375e6c50f5da37ab699132fe0d3c3beeeb0a9730a158e3a5864f6b8364
7
- data.tar.gz: a614c5002c750f9091a0b7b80b678115ea6b65a1a7d0de621431ee942f8f1678d36c64a271cdb1cc0c4a68c49d20bacfe934d844381ad78361c09e762e02e872
6
+ metadata.gz: bed2b567c09ec4e79e81181ca9978598cb89696463330c4ac6aa153375aaa7a86d2a6e0eb2af190037fd49d08aa720835af10ff3b021b851c285dc564607acaa
7
+ data.tar.gz: 6b9faf79edecfad12ce46d3572810fdaa2648a3c1df090b99964c2e0ccef2553dcecc1cca7d12e1a9dd9b69108b91ff9fe68d94f74738f1a8f57ccdb4436c69c
@@ -1,3 +1,7 @@
1
+ # 0.10.0
2
+ - Add class for t-distributed Stochastic Neighborhood Embedding.
3
+ - Fix bug of zero division on min-max scaling class.
4
+
1
5
  # 0.9.2
2
6
  - Add class for Gradient tree boosting classifier.
3
7
  - Add class for Gradient tree boosting regressor.
data/README.md CHANGED
@@ -6,14 +6,14 @@
6
6
  [![Coverage Status](https://coveralls.io/repos/github/yoshoku/rumale/badge.svg?branch=master)](https://coveralls.io/github/yoshoku/rumale?branch=master)
7
7
  [![Gem Version](https://badge.fury.io/rb/rumale.svg)](https://badge.fury.io/rb/rumale)
8
8
  [![BSD 2-Clause License](https://img.shields.io/badge/License-BSD%202--Clause-orange.svg)](https://github.com/yoshoku/rumale/blob/master/LICENSE.txt)
9
- [![Documentation](http://img.shields.io/badge/docs-rdoc.info-blue.svg)](https://www.rubydoc.info/gems/rumale/)
9
+ [![Documentation](http://img.shields.io/badge/docs-rdoc.info-blue.svg)](https://www.rubydoc.info/gems/rumale/0.10.0)
10
10
 
11
- Rumale (**Ru**by **ma**chine **le**arning) is a machine learninig library in Ruby.
11
+ Rumale (**Ru**by **ma**chine **le**arning) is a machine learning library in Ruby.
12
12
  Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python.
13
13
  Rumale supports Linear / Kernel Support Vector Machine,
14
14
  Logistic Regression, Linear Regression, Ridge, Lasso, Factorization Machine,
15
15
  Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor classifier,
16
- K-Means, DBSCAN, Principal Component Analysis, and Non-negative Matrix Factorization.
16
+ K-Means, DBSCAN, t-SNE, Principal Component Analysis, and Non-negative Matrix Factorization.
17
17
 
18
18
  This project was formerly known as "SVMKit".
19
19
  If you are using SVMKit, please install Rumale and replace `SVMKit` constants with `Rumale`.
@@ -60,6 +60,7 @@ require 'rumale/clustering/k_means'
60
60
  require 'rumale/clustering/dbscan'
61
61
  require 'rumale/decomposition/pca'
62
62
  require 'rumale/decomposition/nmf'
63
+ require 'rumale/manifold/tsne'
63
64
  require 'rumale/preprocessing/l2_normalizer'
64
65
  require 'rumale/preprocessing/min_max_scaler'
65
66
  require 'rumale/preprocessing/max_abs_scaler'
@@ -0,0 +1,238 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'rumale/base/base_estimator'
4
+ require 'rumale/base/transformer'
5
+ require 'rumale/utils'
6
+ require 'rumale/pairwise_metric'
7
+ require 'rumale/decomposition/pca'
8
+
9
+ module Rumale
10
+ # Module for data embedding algorithms.
11
+ module Manifold
12
+ # TSNE is a class that implements t-Distributed Stochastic Neighbor Embedding (t-SNE)
13
+ # with fixed-point optimization algorithm.
14
+ # Fixed-point algorithm usually converges faster than gradient descent method and
15
+ # do not need the learning parameters such as the learning rate and momentum.
16
+ #
17
+ # @example
18
+ # tsne = Rumale::Manifold::TSNE.new(perplexity: 40.0, init: 'pca', max_iter: 500, random_seed: 1)
19
+ # representations = tsne.fit_transform(samples)
20
+ #
21
+ # *Reference*
22
+ # - L. van der Maaten and G. Hinton, "Visualizing data using t-SNE," J. of Machine Learning Research, vol. 9, pp. 2579--2605, 2008.
23
+ # - Z. Yang, I. King, Z. Xu, and E. Oja, "Heavy-Tailed Symmetric Stochastic Neighbor Embedding," Proc. NIPS'09, pp. 2169--2177, 2009.
24
+ class TSNE
25
+ include Base::BaseEstimator
26
+ include Base::Transformer
27
+
28
+ # Return the data in representation space.
29
+ # @return [Numo::DFloat] (shape: [n_samples, n_components])
30
+ attr_reader :embedding
31
+
32
+ # Return the Kullback-Leibler divergence after optimization.
33
+ # @return [Float]
34
+ attr_reader :kl_divergence
35
+
36
+ # Return the number of iterations run for optimization
37
+ # @return [Integer]
38
+ attr_reader :n_iter
39
+
40
+ # Return the random generator.
41
+ # @return [Random]
42
+ attr_reader :rng
43
+
44
+ # Create a new transformer with t-SNE.
45
+ #
46
+ # @param n_components [Integer] The number of dimensions on representation space.
47
+ # @param perplexity [Float] The effective number of neighbors for each point. Perplexity are typically set from 5 to 50.
48
+ # @param metric [String] The metric to calculate the distances in original space.
49
+ # If metric is 'euclidean', Euclidean distance is calculated for distance in original space.
50
+ # If metric is 'precomputed', the fit and fit_transform methods expect to be given a distance matrix.
51
+ # @param init [String] The init is a method to initialize the representaion space.
52
+ # If init is 'random', the representaion space is initialized with normal random variables.
53
+ # If init is 'pca', the result of principal component analysis as the initial value of the representation space.
54
+ # @param max_iter [Integer] The maximum number of iterations.
55
+ # @param tol [Float] The tolerance of KL-divergence for terminating optimization.
56
+ # If tol is nil, it does not use KL divergence as a criterion for terminating the optimization.
57
+ # @param verbose [Boolean] The flag indicating whether to output KL divergence during iteration.
58
+ # @param random_seed [Integer] The seed value using to initialize the random generator.
59
+ def initialize(n_components: 2, perplexity: 30.0, metric: 'euclidean', init: 'random',
60
+ max_iter: 500, tol: nil, verbose: false, random_seed: nil)
61
+ check_params_integer(n_components: n_components, max_iter: max_iter)
62
+ check_params_float(perplexity: perplexity)
63
+ check_params_string(metric: metric, init: init)
64
+ check_params_boolean(verbose: verbose)
65
+ check_params_type_or_nil(Float, tol: tol)
66
+ check_params_type_or_nil(Integer, random_seed: random_seed)
67
+ check_params_positive(n_components: n_components, perplexity: perplexity, max_iter: max_iter)
68
+ @params = {}
69
+ @params[:n_components] = n_components
70
+ @params[:perplexity] = perplexity
71
+ @params[:max_iter] = max_iter
72
+ @params[:tol] = tol
73
+ @params[:metric] = metric
74
+ @params[:init] = init
75
+ @params[:verbose] = verbose
76
+ @params[:random_seed] = random_seed
77
+ @params[:random_seed] ||= srand
78
+ @rng = Random.new(@params[:random_seed])
79
+ @embedding = nil
80
+ @kl_divergence = nil
81
+ @n_iter = nil
82
+ end
83
+
84
+ # Fit the model with given training data.
85
+ #
86
+ # @overload fit(x) -> TSNE
87
+ #
88
+ # @param x [Numo::DFloat] (shape: [n_samples, n_features]) The training data to be used for fitting the model.
89
+ # If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
90
+ # @return [TSNE] The learned transformer itself.
91
+ def fit(x, _not_used = nil)
92
+ check_sample_array(x)
93
+ raise ArgumentError, 'Expect the input distance matrix to be square.' if @params[:metric] == 'precomputed' && x.shape[0] != x.shape[1]
94
+ # initialize some varibales.
95
+ @n_iter = 0
96
+ distance_mat = @params[:metric] == 'precomputed' ? x**2 : Rumale::PairwiseMetric.squared_error(x)
97
+ hi_prob_mat = gaussian_distributed_probability_matrix(distance_mat)
98
+ y = init_embedding(x)
99
+ lo_prob_mat = t_distributed_probability_matrix(y)
100
+ # perform fixed-point optimization.
101
+ one_vec = Numo::DFloat.ones(x.shape[0]).expand_dims(1)
102
+ @params[:max_iter].times do |t|
103
+ break if terminate?(hi_prob_mat, lo_prob_mat)
104
+ a = hi_prob_mat * lo_prob_mat
105
+ b = lo_prob_mat * lo_prob_mat
106
+ y = (b.dot(one_vec) * y + (a - b).dot(y)) / a.dot(one_vec)
107
+ lo_prob_mat = t_distributed_probability_matrix(y)
108
+ @n_iter = t + 1
109
+ puts "[t-SNE] KL divergence after #{@n_iter} iterations: #{cost(hi_prob_mat, lo_prob_mat)}" if @params[:verbose] && (@n_iter % 100).zero?
110
+ end
111
+ # store results.
112
+ @embedding = y
113
+ @kl_divergence = cost(hi_prob_mat, lo_prob_mat)
114
+ self
115
+ end
116
+
117
+ # Fit the model with training data, and then transform them with the learned model.
118
+ #
119
+ # @overload fit_transform(x) -> Numo::DFloat
120
+ #
121
+ # @param x [Numo::DFloat] (shape: [n_samples, n_features]) The training data to be used for fitting the model.
122
+ # If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
123
+ # @return [Numo::DFloat] (shape: [n_samples, n_components]) The transformed data
124
+ def fit_transform(x, _not_used = nil)
125
+ fit(x)
126
+ @embedding.dup
127
+ end
128
+
129
+ # Dump marshal data.
130
+ # @return [Hash] The marshal data.
131
+ def marshal_dump
132
+ { params: @params,
133
+ embedding: @embedding,
134
+ kl_divergence: @kl_divergence,
135
+ n_iter: @n_iter,
136
+ rng: @rng }
137
+ end
138
+
139
+ # Load marshal data.
140
+ # @return [nil]
141
+ def marshal_load(obj)
142
+ @params = obj[:params]
143
+ @embedding = obj[:embedding]
144
+ @kl_divergence = obj[:kl_divergence]
145
+ @n_iter = obj[:n_iter]
146
+ @rng = obj[:rng]
147
+ nil
148
+ end
149
+
150
+ private
151
+
152
+ def init_embedding(x)
153
+ if @params[:init] == 'pca' && @params[:metric] == 'euclidean'
154
+ pca = Rumale::Decomposition::PCA.new(n_components: @params[:n_components], random_seed: @params[:random_seed])
155
+ pca.fit_transform(x)
156
+ else
157
+ n_samples = x.shape[0]
158
+ Rumale::Utils.rand_normal([n_samples, @params[:n_components]], @rng, 0, 0.0001)
159
+ end
160
+ end
161
+
162
+ def gaussian_distributed_probability_matrix(distance_mat)
163
+ # initialize some variables.
164
+ n_samples = distance_mat.shape[0]
165
+ prob_mat = Numo::DFloat.zeros(n_samples, n_samples)
166
+ sum_beta = 0.0
167
+ # calculate conditional probabilities.
168
+ n_samples.times do |n|
169
+ beta, probs = optimal_probabilities(n, distance_mat[n, true])
170
+ prob_mat[n, true] = probs
171
+ sum_beta += beta
172
+ puts "[t-SNE] Computed conditional probabilities for sample #{n + 1} / #{n_samples}" if @params[:verbose] && ((n + 1) % 1000).zero?
173
+ end
174
+ puts "[t-SNE] Mean sigma: #{Math.sqrt(n_samples.fdiv(sum_beta))}" if @params[:verbose]
175
+ # symmetrize and normalize probability matrix.
176
+ prob_mat[prob_mat.diag_indices(0)] = 0.0
177
+ prob_mat = 0.5 * (prob_mat + prob_mat.transpose)
178
+ prob_mat / prob_mat.sum
179
+ end
180
+
181
+ def optimal_probabilities(sample_id, distance_vec, max_iter = 100)
182
+ # initialize some variables.
183
+ probs = nil
184
+ beta = 1.0
185
+ betamin = Float::MIN
186
+ betamax = Float::MAX
187
+ init_entropy = Math.log(@params[:perplexity])
188
+ # calculate optimal beta and conditional probabilities with binary search.
189
+ max_iter.times do
190
+ entropy, probs = gaussian_distributed_probability_vector(sample_id, distance_vec, beta)
191
+ diff_entropy = entropy - init_entropy
192
+ break if diff_entropy.abs <= 1e-5
193
+ if diff_entropy.positive?
194
+ betamin = beta
195
+ if betamax == Float::MAX
196
+ beta *= 2.0
197
+ else
198
+ beta = 0.5 * (beta + betamax)
199
+ end
200
+ else
201
+ betamax = beta
202
+ if betamin == Float::MIN
203
+ beta /= 2.0
204
+ else
205
+ beta = 0.5 * (beta + betamin)
206
+ end
207
+ end
208
+ end
209
+ [beta, probs]
210
+ end
211
+
212
+ def gaussian_distributed_probability_vector(n, distance_vec, beta)
213
+ probs = Numo::NMath.exp(-beta * distance_vec)
214
+ probs[n] = 0.0
215
+ sum_probs = probs.sum
216
+ probs /= sum_probs
217
+ entropy = Math.log(sum_probs) + beta * (distance_vec * probs).sum
218
+ [entropy, probs]
219
+ end
220
+
221
+ def t_distributed_probability_matrix(y)
222
+ distance_mat = Rumale::PairwiseMetric.squared_error(y)
223
+ prob_mat = 1.0 / (1.0 + distance_mat)
224
+ prob_mat[prob_mat.diag_indices(0)] = 0.0
225
+ prob_mat / prob_mat.sum
226
+ end
227
+
228
+ def cost(p, q)
229
+ (p * Numo::NMath.log(Numo::DFloat.maximum(1e-20, p) / Numo::DFloat.maximum(1e-20, q))).sum
230
+ end
231
+
232
+ def terminate?(p, q)
233
+ return false if @params[:tol].nil?
234
+ cost(p, q) <= @params[:tol]
235
+ end
236
+ end
237
+ end
238
+ end
@@ -15,13 +15,29 @@ module Rumale
15
15
  y = x if y.nil?
16
16
  Rumale::Validation.check_sample_array(x)
17
17
  Rumale::Validation.check_sample_array(y)
18
- sum_x_vec = (x**2).sum(1)
19
- sum_y_vec = (y**2).sum(1)
18
+ Numo::NMath.sqrt(squared_error(x, y).abs)
19
+ end
20
+
21
+ # Calculate the pairwise squared errors between x and y.
22
+ #
23
+ # @param x [Numo::DFloat] (shape: [n_samples_x, n_features])
24
+ # @param y [Numo::DFloat] (shape: [n_samples_y, n_features])
25
+ # @return [Numo::DFloat] (shape: [n_samples_x, n_samples_x] or [n_samples_x, n_samples_y] if y is given)
26
+ def squared_error(x, y = nil)
27
+ y = x if y.nil?
28
+ Rumale::Validation.check_sample_array(x)
29
+ Rumale::Validation.check_sample_array(y)
30
+ # sum_x_vec = (x**2).sum(1)
31
+ # sum_y_vec = (y**2).sum(1)
32
+ # dot_xy_mat = x.dot(y.transpose)
33
+ # dot_xy_mat * -2.0 + sum_x_vec.tile(y.shape[0], 1).transpose + sum_y_vec.tile(x.shape[0], 1)
34
+ #
35
+ n_features = x.shape[1]
36
+ one_vec = Numo::DFloat.ones(n_features).expand_dims(1)
37
+ sum_x_vec = (x**2).dot(one_vec)
38
+ sum_y_vec = (y**2).dot(one_vec).transpose
20
39
  dot_xy_mat = x.dot(y.transpose)
21
- distance_matrix = dot_xy_mat * -2.0 +
22
- sum_x_vec.tile(y.shape[0], 1).transpose +
23
- sum_y_vec.tile(x.shape[0], 1)
24
- Numo::NMath.sqrt(distance_matrix.abs)
40
+ dot_xy_mat * -2.0 + sum_x_vec + sum_y_vec
25
41
  end
26
42
 
27
43
  # Calculate the rbf kernel between x and y.
@@ -67,6 +67,7 @@ module Rumale
67
67
  check_sample_array(x)
68
68
  n_samples, = x.shape
69
69
  dif_vec = @max_vec - @min_vec
70
+ dif_vec[dif_vec.eq(0)] = 1.0
70
71
  nx = (x - @min_vec.tile(n_samples, 1)) / dif_vec.tile(n_samples, 1)
71
72
  nx * (@params[:feature_range][1] - @params[:feature_range][0]) + @params[:feature_range][0]
72
73
  end
@@ -3,5 +3,5 @@
3
3
  # Rumale is a machine learning library in Ruby.
4
4
  module Rumale
5
5
  # The version of Rumale you are using.
6
- VERSION = '0.9.2'
6
+ VERSION = '0.10.0'
7
7
  end
@@ -9,16 +9,16 @@ Gem::Specification.new do |spec|
9
9
  spec.email = ['yoshoku@outlook.com']
10
10
 
11
11
  spec.summary = <<MSG
12
- Rumale is a machine learninig library in Ruby.
12
+ Rumale is a machine learning library in Ruby.
13
13
  Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python.
14
14
  MSG
15
15
  spec.description = <<MSG
16
- Rumale is a machine learninig library in Ruby.
16
+ Rumale is a machine learning library in Ruby.
17
17
  Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python.
18
18
  Rumale currently supports Linear / Kernel Support Vector Machine,
19
19
  Logistic Regression, Linear Regression, Ridge, Lasso, Factorization Machine,
20
20
  Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor algorithm,
21
- K-Means, DBSCAN, Principal Component Analysis, and Non-negative Matrix Factorization.
21
+ K-Means, DBSCAN, t-SNE, Principal Component Analysis, and Non-negative Matrix Factorization.
22
22
  MSG
23
23
  spec.homepage = 'https://github.com/yoshoku/rumale'
24
24
  spec.license = 'BSD-2-Clause'
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: rumale
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.9.2
4
+ version: 0.10.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - yoshoku
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2019-05-11 00:00:00.000000000 Z
11
+ date: 2019-05-17 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: numo-narray
@@ -95,12 +95,12 @@ dependencies:
95
95
  - !ruby/object:Gem::Version
96
96
  version: '3.0'
97
97
  description: |
98
- Rumale is a machine learninig library in Ruby.
98
+ Rumale is a machine learning library in Ruby.
99
99
  Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python.
100
100
  Rumale currently supports Linear / Kernel Support Vector Machine,
101
101
  Logistic Regression, Linear Regression, Ridge, Lasso, Factorization Machine,
102
102
  Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor algorithm,
103
- K-Means, DBSCAN, Principal Component Analysis, and Non-negative Matrix Factorization.
103
+ K-Means, DBSCAN, t-SNE, Principal Component Analysis, and Non-negative Matrix Factorization.
104
104
  email:
105
105
  - yoshoku@outlook.com
106
106
  executables: []
@@ -172,6 +172,7 @@ files:
172
172
  - lib/rumale/linear_model/ridge.rb
173
173
  - lib/rumale/linear_model/svc.rb
174
174
  - lib/rumale/linear_model/svr.rb
175
+ - lib/rumale/manifold/tsne.rb
175
176
  - lib/rumale/model_selection/cross_validation.rb
176
177
  - lib/rumale/model_selection/grid_search_cv.rb
177
178
  - lib/rumale/model_selection/k_fold.rb
@@ -236,6 +237,6 @@ rubyforge_project:
236
237
  rubygems_version: 2.5.2.3
237
238
  signing_key:
238
239
  specification_version: 4
239
- summary: Rumale is a machine learninig library in Ruby. Rumale provides machine learning
240
+ summary: Rumale is a machine learning library in Ruby. Rumale provides machine learning
240
241
  algorithms with interfaces similar to Scikit-Learn in Python.
241
242
  test_files: []