rumale 0.9.2 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 1d73f16bcd1d149babe18c1db66d3f72bb9a1206
4
- data.tar.gz: 247fd7d548563ef27622c293073236468f634b7d
3
+ metadata.gz: 939e53a9d8c52f86c1d112d1cfbca17802aa94d8
4
+ data.tar.gz: cc1454ad92767b7ea27f9ec8d27cc7e7bbf97441
5
5
  SHA512:
6
- metadata.gz: 6a4a92d08ee0a8295d96a930a46fb67a9299a9e0beb717d52186347fef3b70727e35a2375e6c50f5da37ab699132fe0d3c3beeeb0a9730a158e3a5864f6b8364
7
- data.tar.gz: a614c5002c750f9091a0b7b80b678115ea6b65a1a7d0de621431ee942f8f1678d36c64a271cdb1cc0c4a68c49d20bacfe934d844381ad78361c09e762e02e872
6
+ metadata.gz: bed2b567c09ec4e79e81181ca9978598cb89696463330c4ac6aa153375aaa7a86d2a6e0eb2af190037fd49d08aa720835af10ff3b021b851c285dc564607acaa
7
+ data.tar.gz: 6b9faf79edecfad12ce46d3572810fdaa2648a3c1df090b99964c2e0ccef2553dcecc1cca7d12e1a9dd9b69108b91ff9fe68d94f74738f1a8f57ccdb4436c69c
@@ -1,3 +1,7 @@
1
+ # 0.10.0
2
+ - Add class for t-distributed Stochastic Neighborhood Embedding.
3
+ - Fix bug of zero division on min-max scaling class.
4
+
1
5
  # 0.9.2
2
6
  - Add class for Gradient tree boosting classifier.
3
7
  - Add class for Gradient tree boosting regressor.
data/README.md CHANGED
@@ -6,14 +6,14 @@
6
6
  [![Coverage Status](https://coveralls.io/repos/github/yoshoku/rumale/badge.svg?branch=master)](https://coveralls.io/github/yoshoku/rumale?branch=master)
7
7
  [![Gem Version](https://badge.fury.io/rb/rumale.svg)](https://badge.fury.io/rb/rumale)
8
8
  [![BSD 2-Clause License](https://img.shields.io/badge/License-BSD%202--Clause-orange.svg)](https://github.com/yoshoku/rumale/blob/master/LICENSE.txt)
9
- [![Documentation](http://img.shields.io/badge/docs-rdoc.info-blue.svg)](https://www.rubydoc.info/gems/rumale/)
9
+ [![Documentation](http://img.shields.io/badge/docs-rdoc.info-blue.svg)](https://www.rubydoc.info/gems/rumale/0.10.0)
10
10
 
11
- Rumale (**Ru**by **ma**chine **le**arning) is a machine learninig library in Ruby.
11
+ Rumale (**Ru**by **ma**chine **le**arning) is a machine learning library in Ruby.
12
12
  Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python.
13
13
  Rumale supports Linear / Kernel Support Vector Machine,
14
14
  Logistic Regression, Linear Regression, Ridge, Lasso, Factorization Machine,
15
15
  Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor classifier,
16
- K-Means, DBSCAN, Principal Component Analysis, and Non-negative Matrix Factorization.
16
+ K-Means, DBSCAN, t-SNE, Principal Component Analysis, and Non-negative Matrix Factorization.
17
17
 
18
18
  This project was formerly known as "SVMKit".
19
19
  If you are using SVMKit, please install Rumale and replace `SVMKit` constants with `Rumale`.
@@ -60,6 +60,7 @@ require 'rumale/clustering/k_means'
60
60
  require 'rumale/clustering/dbscan'
61
61
  require 'rumale/decomposition/pca'
62
62
  require 'rumale/decomposition/nmf'
63
+ require 'rumale/manifold/tsne'
63
64
  require 'rumale/preprocessing/l2_normalizer'
64
65
  require 'rumale/preprocessing/min_max_scaler'
65
66
  require 'rumale/preprocessing/max_abs_scaler'
@@ -0,0 +1,238 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'rumale/base/base_estimator'
4
+ require 'rumale/base/transformer'
5
+ require 'rumale/utils'
6
+ require 'rumale/pairwise_metric'
7
+ require 'rumale/decomposition/pca'
8
+
9
+ module Rumale
10
+ # Module for data embedding algorithms.
11
+ module Manifold
12
+ # TSNE is a class that implements t-Distributed Stochastic Neighbor Embedding (t-SNE)
13
+ # with fixed-point optimization algorithm.
14
+ # Fixed-point algorithm usually converges faster than gradient descent method and
15
+ # do not need the learning parameters such as the learning rate and momentum.
16
+ #
17
+ # @example
18
+ # tsne = Rumale::Manifold::TSNE.new(perplexity: 40.0, init: 'pca', max_iter: 500, random_seed: 1)
19
+ # representations = tsne.fit_transform(samples)
20
+ #
21
+ # *Reference*
22
+ # - L. van der Maaten and G. Hinton, "Visualizing data using t-SNE," J. of Machine Learning Research, vol. 9, pp. 2579--2605, 2008.
23
+ # - Z. Yang, I. King, Z. Xu, and E. Oja, "Heavy-Tailed Symmetric Stochastic Neighbor Embedding," Proc. NIPS'09, pp. 2169--2177, 2009.
24
+ class TSNE
25
+ include Base::BaseEstimator
26
+ include Base::Transformer
27
+
28
+ # Return the data in representation space.
29
+ # @return [Numo::DFloat] (shape: [n_samples, n_components])
30
+ attr_reader :embedding
31
+
32
+ # Return the Kullback-Leibler divergence after optimization.
33
+ # @return [Float]
34
+ attr_reader :kl_divergence
35
+
36
+ # Return the number of iterations run for optimization
37
+ # @return [Integer]
38
+ attr_reader :n_iter
39
+
40
+ # Return the random generator.
41
+ # @return [Random]
42
+ attr_reader :rng
43
+
44
+ # Create a new transformer with t-SNE.
45
+ #
46
+ # @param n_components [Integer] The number of dimensions on representation space.
47
+ # @param perplexity [Float] The effective number of neighbors for each point. Perplexity are typically set from 5 to 50.
48
+ # @param metric [String] The metric to calculate the distances in original space.
49
+ # If metric is 'euclidean', Euclidean distance is calculated for distance in original space.
50
+ # If metric is 'precomputed', the fit and fit_transform methods expect to be given a distance matrix.
51
+ # @param init [String] The init is a method to initialize the representaion space.
52
+ # If init is 'random', the representaion space is initialized with normal random variables.
53
+ # If init is 'pca', the result of principal component analysis as the initial value of the representation space.
54
+ # @param max_iter [Integer] The maximum number of iterations.
55
+ # @param tol [Float] The tolerance of KL-divergence for terminating optimization.
56
+ # If tol is nil, it does not use KL divergence as a criterion for terminating the optimization.
57
+ # @param verbose [Boolean] The flag indicating whether to output KL divergence during iteration.
58
+ # @param random_seed [Integer] The seed value using to initialize the random generator.
59
+ def initialize(n_components: 2, perplexity: 30.0, metric: 'euclidean', init: 'random',
60
+ max_iter: 500, tol: nil, verbose: false, random_seed: nil)
61
+ check_params_integer(n_components: n_components, max_iter: max_iter)
62
+ check_params_float(perplexity: perplexity)
63
+ check_params_string(metric: metric, init: init)
64
+ check_params_boolean(verbose: verbose)
65
+ check_params_type_or_nil(Float, tol: tol)
66
+ check_params_type_or_nil(Integer, random_seed: random_seed)
67
+ check_params_positive(n_components: n_components, perplexity: perplexity, max_iter: max_iter)
68
+ @params = {}
69
+ @params[:n_components] = n_components
70
+ @params[:perplexity] = perplexity
71
+ @params[:max_iter] = max_iter
72
+ @params[:tol] = tol
73
+ @params[:metric] = metric
74
+ @params[:init] = init
75
+ @params[:verbose] = verbose
76
+ @params[:random_seed] = random_seed
77
+ @params[:random_seed] ||= srand
78
+ @rng = Random.new(@params[:random_seed])
79
+ @embedding = nil
80
+ @kl_divergence = nil
81
+ @n_iter = nil
82
+ end
83
+
84
+ # Fit the model with given training data.
85
+ #
86
+ # @overload fit(x) -> TSNE
87
+ #
88
+ # @param x [Numo::DFloat] (shape: [n_samples, n_features]) The training data to be used for fitting the model.
89
+ # If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
90
+ # @return [TSNE] The learned transformer itself.
91
+ def fit(x, _not_used = nil)
92
+ check_sample_array(x)
93
+ raise ArgumentError, 'Expect the input distance matrix to be square.' if @params[:metric] == 'precomputed' && x.shape[0] != x.shape[1]
94
+ # initialize some varibales.
95
+ @n_iter = 0
96
+ distance_mat = @params[:metric] == 'precomputed' ? x**2 : Rumale::PairwiseMetric.squared_error(x)
97
+ hi_prob_mat = gaussian_distributed_probability_matrix(distance_mat)
98
+ y = init_embedding(x)
99
+ lo_prob_mat = t_distributed_probability_matrix(y)
100
+ # perform fixed-point optimization.
101
+ one_vec = Numo::DFloat.ones(x.shape[0]).expand_dims(1)
102
+ @params[:max_iter].times do |t|
103
+ break if terminate?(hi_prob_mat, lo_prob_mat)
104
+ a = hi_prob_mat * lo_prob_mat
105
+ b = lo_prob_mat * lo_prob_mat
106
+ y = (b.dot(one_vec) * y + (a - b).dot(y)) / a.dot(one_vec)
107
+ lo_prob_mat = t_distributed_probability_matrix(y)
108
+ @n_iter = t + 1
109
+ puts "[t-SNE] KL divergence after #{@n_iter} iterations: #{cost(hi_prob_mat, lo_prob_mat)}" if @params[:verbose] && (@n_iter % 100).zero?
110
+ end
111
+ # store results.
112
+ @embedding = y
113
+ @kl_divergence = cost(hi_prob_mat, lo_prob_mat)
114
+ self
115
+ end
116
+
117
+ # Fit the model with training data, and then transform them with the learned model.
118
+ #
119
+ # @overload fit_transform(x) -> Numo::DFloat
120
+ #
121
+ # @param x [Numo::DFloat] (shape: [n_samples, n_features]) The training data to be used for fitting the model.
122
+ # If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
123
+ # @return [Numo::DFloat] (shape: [n_samples, n_components]) The transformed data
124
+ def fit_transform(x, _not_used = nil)
125
+ fit(x)
126
+ @embedding.dup
127
+ end
128
+
129
+ # Dump marshal data.
130
+ # @return [Hash] The marshal data.
131
+ def marshal_dump
132
+ { params: @params,
133
+ embedding: @embedding,
134
+ kl_divergence: @kl_divergence,
135
+ n_iter: @n_iter,
136
+ rng: @rng }
137
+ end
138
+
139
+ # Load marshal data.
140
+ # @return [nil]
141
+ def marshal_load(obj)
142
+ @params = obj[:params]
143
+ @embedding = obj[:embedding]
144
+ @kl_divergence = obj[:kl_divergence]
145
+ @n_iter = obj[:n_iter]
146
+ @rng = obj[:rng]
147
+ nil
148
+ end
149
+
150
+ private
151
+
152
+ def init_embedding(x)
153
+ if @params[:init] == 'pca' && @params[:metric] == 'euclidean'
154
+ pca = Rumale::Decomposition::PCA.new(n_components: @params[:n_components], random_seed: @params[:random_seed])
155
+ pca.fit_transform(x)
156
+ else
157
+ n_samples = x.shape[0]
158
+ Rumale::Utils.rand_normal([n_samples, @params[:n_components]], @rng, 0, 0.0001)
159
+ end
160
+ end
161
+
162
+ def gaussian_distributed_probability_matrix(distance_mat)
163
+ # initialize some variables.
164
+ n_samples = distance_mat.shape[0]
165
+ prob_mat = Numo::DFloat.zeros(n_samples, n_samples)
166
+ sum_beta = 0.0
167
+ # calculate conditional probabilities.
168
+ n_samples.times do |n|
169
+ beta, probs = optimal_probabilities(n, distance_mat[n, true])
170
+ prob_mat[n, true] = probs
171
+ sum_beta += beta
172
+ puts "[t-SNE] Computed conditional probabilities for sample #{n + 1} / #{n_samples}" if @params[:verbose] && ((n + 1) % 1000).zero?
173
+ end
174
+ puts "[t-SNE] Mean sigma: #{Math.sqrt(n_samples.fdiv(sum_beta))}" if @params[:verbose]
175
+ # symmetrize and normalize probability matrix.
176
+ prob_mat[prob_mat.diag_indices(0)] = 0.0
177
+ prob_mat = 0.5 * (prob_mat + prob_mat.transpose)
178
+ prob_mat / prob_mat.sum
179
+ end
180
+
181
+ def optimal_probabilities(sample_id, distance_vec, max_iter = 100)
182
+ # initialize some variables.
183
+ probs = nil
184
+ beta = 1.0
185
+ betamin = Float::MIN
186
+ betamax = Float::MAX
187
+ init_entropy = Math.log(@params[:perplexity])
188
+ # calculate optimal beta and conditional probabilities with binary search.
189
+ max_iter.times do
190
+ entropy, probs = gaussian_distributed_probability_vector(sample_id, distance_vec, beta)
191
+ diff_entropy = entropy - init_entropy
192
+ break if diff_entropy.abs <= 1e-5
193
+ if diff_entropy.positive?
194
+ betamin = beta
195
+ if betamax == Float::MAX
196
+ beta *= 2.0
197
+ else
198
+ beta = 0.5 * (beta + betamax)
199
+ end
200
+ else
201
+ betamax = beta
202
+ if betamin == Float::MIN
203
+ beta /= 2.0
204
+ else
205
+ beta = 0.5 * (beta + betamin)
206
+ end
207
+ end
208
+ end
209
+ [beta, probs]
210
+ end
211
+
212
+ def gaussian_distributed_probability_vector(n, distance_vec, beta)
213
+ probs = Numo::NMath.exp(-beta * distance_vec)
214
+ probs[n] = 0.0
215
+ sum_probs = probs.sum
216
+ probs /= sum_probs
217
+ entropy = Math.log(sum_probs) + beta * (distance_vec * probs).sum
218
+ [entropy, probs]
219
+ end
220
+
221
+ def t_distributed_probability_matrix(y)
222
+ distance_mat = Rumale::PairwiseMetric.squared_error(y)
223
+ prob_mat = 1.0 / (1.0 + distance_mat)
224
+ prob_mat[prob_mat.diag_indices(0)] = 0.0
225
+ prob_mat / prob_mat.sum
226
+ end
227
+
228
+ def cost(p, q)
229
+ (p * Numo::NMath.log(Numo::DFloat.maximum(1e-20, p) / Numo::DFloat.maximum(1e-20, q))).sum
230
+ end
231
+
232
+ def terminate?(p, q)
233
+ return false if @params[:tol].nil?
234
+ cost(p, q) <= @params[:tol]
235
+ end
236
+ end
237
+ end
238
+ end
@@ -15,13 +15,29 @@ module Rumale
15
15
  y = x if y.nil?
16
16
  Rumale::Validation.check_sample_array(x)
17
17
  Rumale::Validation.check_sample_array(y)
18
- sum_x_vec = (x**2).sum(1)
19
- sum_y_vec = (y**2).sum(1)
18
+ Numo::NMath.sqrt(squared_error(x, y).abs)
19
+ end
20
+
21
+ # Calculate the pairwise squared errors between x and y.
22
+ #
23
+ # @param x [Numo::DFloat] (shape: [n_samples_x, n_features])
24
+ # @param y [Numo::DFloat] (shape: [n_samples_y, n_features])
25
+ # @return [Numo::DFloat] (shape: [n_samples_x, n_samples_x] or [n_samples_x, n_samples_y] if y is given)
26
+ def squared_error(x, y = nil)
27
+ y = x if y.nil?
28
+ Rumale::Validation.check_sample_array(x)
29
+ Rumale::Validation.check_sample_array(y)
30
+ # sum_x_vec = (x**2).sum(1)
31
+ # sum_y_vec = (y**2).sum(1)
32
+ # dot_xy_mat = x.dot(y.transpose)
33
+ # dot_xy_mat * -2.0 + sum_x_vec.tile(y.shape[0], 1).transpose + sum_y_vec.tile(x.shape[0], 1)
34
+ #
35
+ n_features = x.shape[1]
36
+ one_vec = Numo::DFloat.ones(n_features).expand_dims(1)
37
+ sum_x_vec = (x**2).dot(one_vec)
38
+ sum_y_vec = (y**2).dot(one_vec).transpose
20
39
  dot_xy_mat = x.dot(y.transpose)
21
- distance_matrix = dot_xy_mat * -2.0 +
22
- sum_x_vec.tile(y.shape[0], 1).transpose +
23
- sum_y_vec.tile(x.shape[0], 1)
24
- Numo::NMath.sqrt(distance_matrix.abs)
40
+ dot_xy_mat * -2.0 + sum_x_vec + sum_y_vec
25
41
  end
26
42
 
27
43
  # Calculate the rbf kernel between x and y.
@@ -67,6 +67,7 @@ module Rumale
67
67
  check_sample_array(x)
68
68
  n_samples, = x.shape
69
69
  dif_vec = @max_vec - @min_vec
70
+ dif_vec[dif_vec.eq(0)] = 1.0
70
71
  nx = (x - @min_vec.tile(n_samples, 1)) / dif_vec.tile(n_samples, 1)
71
72
  nx * (@params[:feature_range][1] - @params[:feature_range][0]) + @params[:feature_range][0]
72
73
  end
@@ -3,5 +3,5 @@
3
3
  # Rumale is a machine learning library in Ruby.
4
4
  module Rumale
5
5
  # The version of Rumale you are using.
6
- VERSION = '0.9.2'
6
+ VERSION = '0.10.0'
7
7
  end
@@ -9,16 +9,16 @@ Gem::Specification.new do |spec|
9
9
  spec.email = ['yoshoku@outlook.com']
10
10
 
11
11
  spec.summary = <<MSG
12
- Rumale is a machine learninig library in Ruby.
12
+ Rumale is a machine learning library in Ruby.
13
13
  Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python.
14
14
  MSG
15
15
  spec.description = <<MSG
16
- Rumale is a machine learninig library in Ruby.
16
+ Rumale is a machine learning library in Ruby.
17
17
  Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python.
18
18
  Rumale currently supports Linear / Kernel Support Vector Machine,
19
19
  Logistic Regression, Linear Regression, Ridge, Lasso, Factorization Machine,
20
20
  Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor algorithm,
21
- K-Means, DBSCAN, Principal Component Analysis, and Non-negative Matrix Factorization.
21
+ K-Means, DBSCAN, t-SNE, Principal Component Analysis, and Non-negative Matrix Factorization.
22
22
  MSG
23
23
  spec.homepage = 'https://github.com/yoshoku/rumale'
24
24
  spec.license = 'BSD-2-Clause'
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: rumale
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.9.2
4
+ version: 0.10.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - yoshoku
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2019-05-11 00:00:00.000000000 Z
11
+ date: 2019-05-17 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: numo-narray
@@ -95,12 +95,12 @@ dependencies:
95
95
  - !ruby/object:Gem::Version
96
96
  version: '3.0'
97
97
  description: |
98
- Rumale is a machine learninig library in Ruby.
98
+ Rumale is a machine learning library in Ruby.
99
99
  Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python.
100
100
  Rumale currently supports Linear / Kernel Support Vector Machine,
101
101
  Logistic Regression, Linear Regression, Ridge, Lasso, Factorization Machine,
102
102
  Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor algorithm,
103
- K-Means, DBSCAN, Principal Component Analysis, and Non-negative Matrix Factorization.
103
+ K-Means, DBSCAN, t-SNE, Principal Component Analysis, and Non-negative Matrix Factorization.
104
104
  email:
105
105
  - yoshoku@outlook.com
106
106
  executables: []
@@ -172,6 +172,7 @@ files:
172
172
  - lib/rumale/linear_model/ridge.rb
173
173
  - lib/rumale/linear_model/svc.rb
174
174
  - lib/rumale/linear_model/svr.rb
175
+ - lib/rumale/manifold/tsne.rb
175
176
  - lib/rumale/model_selection/cross_validation.rb
176
177
  - lib/rumale/model_selection/grid_search_cv.rb
177
178
  - lib/rumale/model_selection/k_fold.rb
@@ -236,6 +237,6 @@ rubyforge_project:
236
237
  rubygems_version: 2.5.2.3
237
238
  signing_key:
238
239
  specification_version: 4
239
- summary: Rumale is a machine learninig library in Ruby. Rumale provides machine learning
240
+ summary: Rumale is a machine learning library in Ruby. Rumale provides machine learning
240
241
  algorithms with interfaces similar to Scikit-Learn in Python.
241
242
  test_files: []