RubyGems - rumale - Versions diffs - 0.9.2 → 0.10.0 - Mend

rumale 0.9.2 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +4 -0
data/README.md +3 -3
data/lib/rumale.rb +1 -0
data/lib/rumale/manifold/tsne.rb +238 -0
data/lib/rumale/pairwise_metric.rb +22 -6
data/lib/rumale/preprocessing/min_max_scaler.rb +1 -0
data/lib/rumale/version.rb +1 -1
data/rumale.gemspec +3 -3
metadata +6 -5

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 1d73f16bcd1d149babe18c1db66d3f72bb9a1206
-  data.tar.gz: 247fd7d548563ef27622c293073236468f634b7d
+  metadata.gz: 939e53a9d8c52f86c1d112d1cfbca17802aa94d8
+  data.tar.gz: cc1454ad92767b7ea27f9ec8d27cc7e7bbf97441
 SHA512:
-  metadata.gz: 6a4a92d08ee0a8295d96a930a46fb67a9299a9e0beb717d52186347fef3b70727e35a2375e6c50f5da37ab699132fe0d3c3beeeb0a9730a158e3a5864f6b8364
-  data.tar.gz: a614c5002c750f9091a0b7b80b678115ea6b65a1a7d0de621431ee942f8f1678d36c64a271cdb1cc0c4a68c49d20bacfe934d844381ad78361c09e762e02e872
+  metadata.gz: bed2b567c09ec4e79e81181ca9978598cb89696463330c4ac6aa153375aaa7a86d2a6e0eb2af190037fd49d08aa720835af10ff3b021b851c285dc564607acaa
+  data.tar.gz: 6b9faf79edecfad12ce46d3572810fdaa2648a3c1df090b99964c2e0ccef2553dcecc1cca7d12e1a9dd9b69108b91ff9fe68d94f74738f1a8f57ccdb4436c69c

data/CHANGELOG.md CHANGED

@@ -1,3 +1,7 @@
+# 0.10.0
+- Add class for t-distributed Stochastic Neighborhood Embedding.
+- Fix bug of zero division on min-max scaling class.
 # 0.9.2
 - Add class for Gradient tree boosting classifier.
 - Add class for Gradient tree boosting regressor.

data/README.md CHANGED

@@ -6,14 +6,14 @@
 [![Coverage Status](https://coveralls.io/repos/github/yoshoku/rumale/badge.svg?branch=master)](https://coveralls.io/github/yoshoku/rumale?branch=master)
 [![Gem Version](https://badge.fury.io/rb/rumale.svg)](https://badge.fury.io/rb/rumale)
 [![BSD 2-Clause License](https://img.shields.io/badge/License-BSD%202--Clause-orange.svg)](https://github.com/yoshoku/rumale/blob/master/LICENSE.txt)
-[![Documentation](http://img.shields.io/badge/docs-rdoc.info-blue.svg)](https://www.rubydoc.info/gems/rumale/)
+[![Documentation](http://img.shields.io/badge/docs-rdoc.info-blue.svg)](https://www.rubydoc.info/gems/rumale/0.10.0)
-Rumale (**Ru**by **ma**chine **le**arning) is a machine learninig library in Ruby.
+Rumale (**Ru**by **ma**chine **le**arning) is a machine learning library in Ruby.
 Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python.
 Rumale supports Linear / Kernel Support Vector Machine,
 Logistic Regression, Linear Regression, Ridge, Lasso, Factorization Machine,
 Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor classifier,
-K-Means, DBSCAN, Principal Component Analysis, and Non-negative Matrix Factorization.
+K-Means, DBSCAN, t-SNE, Principal Component Analysis, and Non-negative Matrix Factorization.
 This project was formerly known as "SVMKit".
 If you are using SVMKit, please install Rumale and replace `SVMKit` constants with `Rumale`.

data/lib/rumale.rb CHANGED

@@ -60,6 +60,7 @@ require 'rumale/clustering/k_means'
 require 'rumale/clustering/dbscan'
 require 'rumale/decomposition/pca'
 require 'rumale/decomposition/nmf'
+require 'rumale/manifold/tsne'
 require 'rumale/preprocessing/l2_normalizer'
 require 'rumale/preprocessing/min_max_scaler'
 require 'rumale/preprocessing/max_abs_scaler'

data/lib/rumale/manifold/tsne.rb ADDED

@@ -0,0 +1,238 @@
+# frozen_string_literal: true
+require 'rumale/base/base_estimator'
+require 'rumale/base/transformer'
+require 'rumale/utils'
+require 'rumale/pairwise_metric'
+require 'rumale/decomposition/pca'
+module Rumale
+  # Module for data embedding algorithms.
+  module Manifold
+    # TSNE is a class that implements t-Distributed Stochastic Neighbor Embedding (t-SNE)
+    # with fixed-point optimization algorithm.
+    # Fixed-point algorithm usually converges faster than gradient descent method and
+    # do not need the learning parameters such as the learning rate and momentum.
+    #
+    # @example
+    #   tsne = Rumale::Manifold::TSNE.new(perplexity: 40.0, init: 'pca', max_iter: 500, random_seed: 1)
+    #   representations = tsne.fit_transform(samples)
+    #
+    # *Reference*
+    # - L. van der Maaten and G. Hinton, "Visualizing data using t-SNE," J. of Machine Learning Research, vol. 9, pp. 2579--2605, 2008.
+    # - Z. Yang, I. King, Z. Xu, and E. Oja, "Heavy-Tailed Symmetric Stochastic Neighbor Embedding," Proc. NIPS'09, pp. 2169--2177, 2009.
+    class TSNE
+      include Base::BaseEstimator
+      include Base::Transformer
+      # Return the data in representation space.
+      # @return [Numo::DFloat] (shape: [n_samples, n_components])
+      attr_reader :embedding
+      # Return the Kullback-Leibler divergence after optimization.
+      # @return [Float]
+      attr_reader :kl_divergence
+      # Return the number of iterations run for optimization
+      # @return [Integer]
+      attr_reader :n_iter
+      # Return the random generator.
+      # @return [Random]
+      attr_reader :rng
+      # Create a new transformer with t-SNE.
+      #
+      # @param n_components [Integer] The number of dimensions on representation space.
+      # @param perplexity [Float] The effective number of neighbors for each point. Perplexity are typically set from 5 to 50.
+      # @param metric [String] The metric to calculate the distances in original space.
+      #   If metric is 'euclidean', Euclidean distance is calculated for distance in original space.
+      #   If metric is 'precomputed', the fit and fit_transform methods expect to be given a distance matrix.
+      # @param init [String] The init is a method to initialize the representaion space.
+      #   If init is 'random', the representaion space is initialized with normal random variables.
+      #   If init is 'pca', the result of principal component analysis as the initial value of the representation space.
+      # @param max_iter [Integer] The maximum number of iterations.
+      # @param tol [Float] The tolerance of KL-divergence for terminating optimization.
+      #   If tol is nil,  it does not use KL divergence as a criterion for terminating the optimization.
+      # @param verbose [Boolean] The flag indicating whether to output KL divergence during iteration.
+      # @param random_seed [Integer] The seed value using to initialize the random generator.
+      def initialize(n_components: 2, perplexity: 30.0, metric: 'euclidean', init: 'random',
+                     max_iter: 500, tol: nil, verbose: false, random_seed: nil)
+        check_params_integer(n_components: n_components, max_iter: max_iter)
+        check_params_float(perplexity: perplexity)
+        check_params_string(metric: metric, init: init)
+        check_params_boolean(verbose: verbose)
+        check_params_type_or_nil(Float, tol: tol)
+        check_params_type_or_nil(Integer, random_seed: random_seed)
+        check_params_positive(n_components: n_components, perplexity: perplexity, max_iter: max_iter)
+        @params = {}
+        @params[:n_components] = n_components
+        @params[:perplexity] = perplexity
+        @params[:max_iter] = max_iter
+        @params[:tol] = tol
+        @params[:metric] = metric
+        @params[:init] = init
+        @params[:verbose] = verbose
+        @params[:random_seed] = random_seed
+        @params[:random_seed] ||= srand
+        @rng = Random.new(@params[:random_seed])
+        @embedding = nil
+        @kl_divergence = nil
+        @n_iter = nil
+      end
+      # Fit the model with given training data.
+      #
+      # @overload fit(x) -> TSNE
+      #
+      # @param x [Numo::DFloat] (shape: [n_samples, n_features]) The training data to be used for fitting the model.
+      #   If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
+      # @return [TSNE] The learned transformer itself.
+      def fit(x, _not_used = nil)
+        check_sample_array(x)
+        raise ArgumentError, 'Expect the input distance matrix to be square.' if @params[:metric] == 'precomputed' && x.shape[0] != x.shape[1]
+        # initialize some varibales.
+        @n_iter = 0
+        distance_mat = @params[:metric] == 'precomputed' ? x**2 : Rumale::PairwiseMetric.squared_error(x)
+        hi_prob_mat = gaussian_distributed_probability_matrix(distance_mat)
+        y = init_embedding(x)
+        lo_prob_mat = t_distributed_probability_matrix(y)
+        # perform fixed-point optimization.
+        one_vec = Numo::DFloat.ones(x.shape[0]).expand_dims(1)
+        @params[:max_iter].times do |t|
+          break if terminate?(hi_prob_mat, lo_prob_mat)
+          a = hi_prob_mat * lo_prob_mat
+          b = lo_prob_mat * lo_prob_mat
+          y = (b.dot(one_vec) * y + (a - b).dot(y)) / a.dot(one_vec)
+          lo_prob_mat = t_distributed_probability_matrix(y)
+          @n_iter = t + 1
+          puts "[t-SNE] KL divergence after #{@n_iter} iterations: #{cost(hi_prob_mat, lo_prob_mat)}" if @params[:verbose] && (@n_iter % 100).zero?
+        end
+        # store results.
+        @embedding = y
+        @kl_divergence = cost(hi_prob_mat, lo_prob_mat)
+        self
+      end
+      # Fit the model with training data, and then transform them with the learned model.
+      #
+      # @overload fit_transform(x) -> Numo::DFloat
+      #
+      # @param x [Numo::DFloat] (shape: [n_samples, n_features]) The training data to be used for fitting the model.
+      #   If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
+      # @return [Numo::DFloat] (shape: [n_samples, n_components]) The transformed data
+      def fit_transform(x, _not_used = nil)
+        fit(x)
+        @embedding.dup
+      end
+      # Dump marshal data.
+      # @return [Hash] The marshal data.
+      def marshal_dump
+        { params: @params,
+          embedding: @embedding,
+          kl_divergence: @kl_divergence,
+          n_iter: @n_iter,
+          rng: @rng }
+      end
+      # Load marshal data.
+      # @return [nil]
+      def marshal_load(obj)
+        @params = obj[:params]
+        @embedding = obj[:embedding]
+        @kl_divergence = obj[:kl_divergence]
+        @n_iter = obj[:n_iter]
+        @rng = obj[:rng]
+        nil
+      end
+      private
+      def init_embedding(x)
+        if @params[:init] == 'pca' && @params[:metric] == 'euclidean'
+          pca = Rumale::Decomposition::PCA.new(n_components: @params[:n_components], random_seed: @params[:random_seed])
+          pca.fit_transform(x)
+        else
+          n_samples = x.shape[0]
+          Rumale::Utils.rand_normal([n_samples, @params[:n_components]], @rng, 0, 0.0001)
+        end
+      end
+      def gaussian_distributed_probability_matrix(distance_mat)
+        # initialize some variables.
+        n_samples = distance_mat.shape[0]
+        prob_mat = Numo::DFloat.zeros(n_samples, n_samples)
+        sum_beta = 0.0
+        # calculate conditional probabilities.
+        n_samples.times do |n|
+          beta, probs = optimal_probabilities(n, distance_mat[n, true])
+          prob_mat[n, true] = probs
+          sum_beta += beta
+          puts "[t-SNE] Computed conditional probabilities for sample #{n + 1} / #{n_samples}" if @params[:verbose] && ((n + 1) % 1000).zero?
+        end
+        puts "[t-SNE] Mean sigma: #{Math.sqrt(n_samples.fdiv(sum_beta))}" if @params[:verbose]
+        # symmetrize and normalize probability matrix.
+        prob_mat[prob_mat.diag_indices(0)] = 0.0
+        prob_mat = 0.5 * (prob_mat + prob_mat.transpose)
+        prob_mat / prob_mat.sum
+      end
+      def optimal_probabilities(sample_id, distance_vec, max_iter = 100)
+        # initialize some variables.
+        probs = nil
+        beta = 1.0
+        betamin = Float::MIN
+        betamax = Float::MAX
+        init_entropy = Math.log(@params[:perplexity])
+        # calculate optimal beta and conditional probabilities with binary search.
+        max_iter.times do
+          entropy, probs = gaussian_distributed_probability_vector(sample_id, distance_vec, beta)
+          diff_entropy = entropy - init_entropy
+          break if diff_entropy.abs <= 1e-5
+          if diff_entropy.positive?
+            betamin = beta
+            if betamax == Float::MAX
+              beta *= 2.0
+            else
+              beta = 0.5 * (beta + betamax)
+            end
+          else
+            betamax = beta
+            if betamin == Float::MIN
+              beta /= 2.0
+            else
+              beta = 0.5 * (beta + betamin)
+            end
+          end
+        end
+        [beta, probs]
+      end
+      def gaussian_distributed_probability_vector(n, distance_vec, beta)
+        probs = Numo::NMath.exp(-beta * distance_vec)
+        probs[n] = 0.0
+        sum_probs = probs.sum
+        probs /= sum_probs
+        entropy = Math.log(sum_probs) + beta * (distance_vec * probs).sum
+        [entropy, probs]
+      end
+      def t_distributed_probability_matrix(y)
+        distance_mat = Rumale::PairwiseMetric.squared_error(y)
+        prob_mat = 1.0 / (1.0 + distance_mat)
+        prob_mat[prob_mat.diag_indices(0)] = 0.0
+        prob_mat / prob_mat.sum
+      end
+      def cost(p, q)
+        (p * Numo::NMath.log(Numo::DFloat.maximum(1e-20, p) / Numo::DFloat.maximum(1e-20, q))).sum
+      end
+      def terminate?(p, q)
+        return false if @params[:tol].nil?
+        cost(p, q) <= @params[:tol]
+      end
+    end
+  end
+end

data/lib/rumale/pairwise_metric.rb CHANGED

@@ -15,13 +15,29 @@ module Rumale
         y = x if y.nil?
         Rumale::Validation.check_sample_array(x)
         Rumale::Validation.check_sample_array(y)
-        sum_x_vec = (x**2).sum(1)
-        sum_y_vec = (y**2).sum(1)
+        Numo::NMath.sqrt(squared_error(x, y).abs)
+      end
+      # Calculate the pairwise squared errors between x and y.
+      #
+      # @param x [Numo::DFloat] (shape: [n_samples_x, n_features])
+      # @param y [Numo::DFloat] (shape: [n_samples_y, n_features])
+      # @return [Numo::DFloat] (shape: [n_samples_x, n_samples_x] or [n_samples_x, n_samples_y] if y is given)
+      def squared_error(x, y = nil)
+        y = x if y.nil?
+        Rumale::Validation.check_sample_array(x)
+        Rumale::Validation.check_sample_array(y)
+        # sum_x_vec = (x**2).sum(1)
+        # sum_y_vec = (y**2).sum(1)
+        # dot_xy_mat = x.dot(y.transpose)
+        # dot_xy_mat * -2.0 + sum_x_vec.tile(y.shape[0], 1).transpose + sum_y_vec.tile(x.shape[0], 1)
+        #
+        n_features = x.shape[1]
+        one_vec = Numo::DFloat.ones(n_features).expand_dims(1)
+        sum_x_vec = (x**2).dot(one_vec)
+        sum_y_vec = (y**2).dot(one_vec).transpose
         dot_xy_mat = x.dot(y.transpose)
-        distance_matrix = dot_xy_mat * -2.0 +
-                          sum_x_vec.tile(y.shape[0], 1).transpose +
-                          sum_y_vec.tile(x.shape[0], 1)
-        Numo::NMath.sqrt(distance_matrix.abs)
+        dot_xy_mat * -2.0 + sum_x_vec + sum_y_vec
       end
       # Calculate the rbf kernel between x and y.

data/lib/rumale/preprocessing/min_max_scaler.rb CHANGED

@@ -67,6 +67,7 @@ module Rumale
         check_sample_array(x)
         n_samples, = x.shape
         dif_vec = @max_vec - @min_vec
+        dif_vec[dif_vec.eq(0)] = 1.0
         nx = (x - @min_vec.tile(n_samples, 1)) / dif_vec.tile(n_samples, 1)
         nx * (@params[:feature_range][1] - @params[:feature_range][0]) + @params[:feature_range][0]
       end

data/lib/rumale/version.rb CHANGED

@@ -3,5 +3,5 @@
 # Rumale is a machine learning library in Ruby.
 module Rumale
   # The version of Rumale you are using.
-  VERSION = '0.9.2'
+  VERSION = '0.10.0'
 end

data/rumale.gemspec CHANGED

@@ -9,16 +9,16 @@ Gem::Specification.new do |spec|
   spec.email         = ['yoshoku@outlook.com']
   spec.summary       = <<MSG
-Rumale is a machine learninig library in Ruby.
+Rumale is a machine learning library in Ruby.
 Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python.
 MSG
   spec.description   = <<MSG
-Rumale is a machine learninig library in Ruby.
+Rumale is a machine learning library in Ruby.
 Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python.
 Rumale currently supports Linear / Kernel Support Vector Machine,
 Logistic Regression, Linear Regression, Ridge, Lasso, Factorization Machine,
 Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor algorithm,
-K-Means, DBSCAN, Principal Component Analysis, and Non-negative Matrix Factorization.
+K-Means, DBSCAN, t-SNE, Principal Component Analysis, and Non-negative Matrix Factorization.
 MSG
   spec.homepage      = 'https://github.com/yoshoku/rumale'
   spec.license       = 'BSD-2-Clause'

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: rumale
 version: !ruby/object:Gem::Version
-  version: 0.9.2
+  version: 0.10.0
 platform: ruby
 authors:
 - yoshoku
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2019-05-11 00:00:00.000000000 Z
+date: 2019-05-17 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: numo-narray
@@ -95,12 +95,12 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '3.0'
 description: |
-  Rumale is a machine learninig library in Ruby.
+  Rumale is a machine learning library in Ruby.
   Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python.
   Rumale currently supports Linear / Kernel Support Vector Machine,
   Logistic Regression, Linear Regression, Ridge, Lasso, Factorization Machine,
   Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor algorithm,
-  K-Means, DBSCAN, Principal Component Analysis, and Non-negative Matrix Factorization.
+  K-Means, DBSCAN, t-SNE, Principal Component Analysis, and Non-negative Matrix Factorization.
 email:
 - yoshoku@outlook.com
 executables: []
@@ -172,6 +172,7 @@ files:
 - lib/rumale/linear_model/ridge.rb
 - lib/rumale/linear_model/svc.rb
 - lib/rumale/linear_model/svr.rb
+- lib/rumale/manifold/tsne.rb
 - lib/rumale/model_selection/cross_validation.rb
 - lib/rumale/model_selection/grid_search_cv.rb
 - lib/rumale/model_selection/k_fold.rb
@@ -236,6 +237,6 @@ rubyforge_project:
 rubygems_version: 2.5.2.3
 signing_key:
 specification_version: 4
-summary: Rumale is a machine learninig library in Ruby. Rumale provides machine learning
+summary: Rumale is a machine learning library in Ruby. Rumale provides machine learning
   algorithms with interfaces similar to Scikit-Learn in Python.
 test_files: []