RubyGems - rumale-manifold - Versions diffs - 0.24.0 - Mend

rumale-manifold 0.24.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml +7 -0
data/LICENSE.txt +27 -0
data/README.md +34 -0
data/lib/rumale/manifold/mds.rb +152 -0
data/lib/rumale/manifold/tsne.rb +218 -0
data/lib/rumale/manifold/version.rb +10 -0
data/lib/rumale/manifold.rb +7 -0
metadata +98 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 6c0ef66c09b01569728f5b856236aac366cd0048055edb91e26f4d206fe44033
+  data.tar.gz: dc5fc6ad6423e6c9d65adbbee74cb79817b94001a1c8e0753b62f3fbfd5d8ded
+SHA512:
+  metadata.gz: 4601ea832a9583c8987a457ea7b4c9e4d9d094e041041903d05f554c8a4ec34324ed9d7c34282f9825a68eca67d5bc89e0082489362b93f173f4f944ddc896f5
+  data.tar.gz: 20eaeed3f214929b208e4faa04929270b6a28c0b89b835f0098ce90a8877f912a5a476fa2dcebd2f62a06228a783cd0eda55fa955982cbb70dbcfaf165cf1042

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,27 @@
+Copyright (c) 2022 Atsushi Tatsuma
+All rights reserved.
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+* Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+* Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+* Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

data/README.md ADDED Viewed

@@ -0,0 +1,34 @@
+# Rumale::Manifold
+[![Gem Version](https://badge.fury.io/rb/rumale-manifold.svg)](https://badge.fury.io/rb/rumale-manifold)
+[![BSD 3-Clause License](https://img.shields.io/badge/License-BSD%203--Clause-orange.svg)](https://gitlab.com/yoshoku/rumale/-/blob/main/rumale-manifold/LICENSE.txt)
+[![Documentation](https://img.shields.io/badge/api-reference-blue.svg)](https://yoshoku.github.io/rumale/doc/Rumale/Manifold.html)
+Rumale is a machine learning library in Ruby.
+Rumale::Manifold provides data embedding algorithms,
+such as Multi-dimensional Scaling and t-distributed Stochastic Neighbor Embedding,
+with Rumale interface.
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'rumale-manifold'
+```
+And then execute:
+    $ bundle install
+Or install it yourself as:
+    $ gem install rumale-manifold
+## Documentation
+- [Rumale API Documentation - Manifold](https://yoshoku.github.io/rumale/doc/Rumale/Manifold.html)
+## License
+The gem is available as open source under the terms of the [BSD-3-Clause License](https://opensource.org/licenses/BSD-3-Clause).

data/lib/rumale/manifold/mds.rb ADDED Viewed

@@ -0,0 +1,152 @@
+# frozen_string_literal: true
+require 'rumale/base/estimator'
+require 'rumale/base/transformer'
+require 'rumale/utils'
+require 'rumale/validation'
+require 'rumale/pairwise_metric'
+require 'rumale/decomposition/pca'
+module Rumale
+  module Manifold
+    # MDS is a class that implements Metric Multidimensional Scaling (MDS)
+    # with Scaling by MAjorizing a COmplicated Function (SMACOF) algorithm.
+    #
+    # @example
+    #   require 'rumale/manifold/mds'
+    #
+    #   mds = Rumale::Manifold::MDS.new(init: 'pca', max_iter: 500, random_seed: 1)
+    #   representations = mds.fit_transform(samples)
+    #
+    # *Reference*
+    # - Groenen, P J. F.  and van de Velden, M., "Multidimensional Scaling by Majorization: A Review," J. of Statistical Software, Vol. 73 (8), 2016.
+    class MDS < ::Rumale::Base::Estimator
+      include ::Rumale::Base::Transformer
+      # Return the data in representation space.
+      # @return [Numo::DFloat] (shape: [n_samples, n_components])
+      attr_reader :embedding
+      # Return the stress function value after optimization.
+      # @return [Float]
+      attr_reader :stress
+      # Return the number of iterations run for optimization
+      # @return [Integer]
+      attr_reader :n_iter
+      # Return the random generator.
+      # @return [Random]
+      attr_reader :rng
+      # Create a new transformer with MDS.
+      #
+      # @param n_components [Integer] The number of dimensions on representation space.
+      # @param metric [String] The metric to calculate the distances in original space.
+      #   If metric is 'euclidean', Euclidean distance is calculated for distance in original space.
+      #   If metric is 'precomputed', the fit and fit_transform methods expect to be given a distance matrix.
+      # @param init [String] The init is a method to initialize the representaion space.
+      #   If init is 'random', the representaion space is initialized with normal random variables.
+      #   If init is 'pca', the result of principal component analysis as the initial value of the representation space.
+      # @param max_iter [Integer] The maximum number of iterations.
+      # @param tol [Float] The tolerance of stress value for terminating optimization.
+      #   If tol is nil,  it does not use stress value as a criterion for terminating the optimization.
+      # @param verbose [Boolean] The flag indicating whether to output stress value during iteration.
+      # @param random_seed [Integer] The seed value using to initialize the random generator.
+      def initialize(n_components: 2, metric: 'euclidean', init: 'random',
+                     max_iter: 300, tol: nil, verbose: false, random_seed: nil)
+        super()
+        @params = {
+          n_components: n_components,
+          max_iter: max_iter,
+          tol: tol,
+          metric: metric,
+          init: init,
+          verbose: verbose,
+          random_seed: random_seed || srand
+        }
+        @rng = Random.new(@params[:random_seed])
+      end
+      # Fit the model with given training data.
+      #
+      # @overload fit(x) -> MDS
+      #   @param x [Numo::DFloat] (shape: [n_samples, n_features]) The training data to be used for fitting the model.
+      #     If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
+      #   @return [MDS] The learned transformer itself.
+      def fit(x, _not_used = nil)
+        x = ::Rumale::Validation.check_convert_sample_array(x)
+        if @params[:metric] == 'precomputed' && x.shape[0] != x.shape[1]
+          raise ArgumentError, 'Expect the input distance matrix to be square.'
+        end
+        # initialize some varibales.
+        n_samples = x.shape[0]
+        hi_distance_mat = @params[:metric] == 'precomputed' ? x : ::Rumale::PairwiseMetric.euclidean_distance(x)
+        @embedding = init_embedding(x)
+        lo_distance_mat = ::Rumale::PairwiseMetric.euclidean_distance(@embedding)
+        @stress = calc_stress(hi_distance_mat, lo_distance_mat)
+        @n_iter = 0
+        # perform optimization.
+        @params[:max_iter].times do |t|
+          # guttman tarnsform.
+          ratio = hi_distance_mat / lo_distance_mat
+          ratio[ratio.diag_indices] = 0.0
+          ratio[lo_distance_mat.eq(0)] = 0.0
+          tmp_mat = -ratio
+          tmp_mat[tmp_mat.diag_indices] += ratio.sum(axis: 1)
+          @embedding = 1.fdiv(n_samples) * tmp_mat.dot(@embedding)
+          lo_distance_mat = ::Rumale::PairwiseMetric.euclidean_distance(@embedding)
+          # check convergence.
+          new_stress = calc_stress(hi_distance_mat, lo_distance_mat)
+          if terminate?(@stress, new_stress)
+            @stress = new_stress
+            break
+          end
+          # next step.
+          @n_iter = t + 1
+          @stress = new_stress
+          puts "[MDS] stress function after #{@n_iter} iterations: #{@stress}" if @params[:verbose] && (@n_iter % 100).zero?
+        end
+        self
+      end
+      # Fit the model with training data, and then transform them with the learned model.
+      #
+      # @overload fit_transform(x) -> Numo::DFloat
+      #   @param x [Numo::DFloat] (shape: [n_samples, n_features]) The training data to be used for fitting the model.
+      #     If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
+      #   @return [Numo::DFloat] (shape: [n_samples, n_components]) The transformed data
+      def fit_transform(x, _not_used = nil)
+        x = ::Rumale::Validation.check_convert_sample_array(x)
+        fit(x)
+        @embedding.dup
+      end
+      private
+      def init_embedding(x)
+        if @params[:init] == 'pca' && @params[:metric] == 'euclidean'
+          pca = ::Rumale::Decomposition::PCA.new(n_components: @params[:n_components], random_seed: @params[:random_seed])
+          pca.fit_transform(x)
+        else
+          n_samples = x.shape[0]
+          sub_rng = @rng.dup
+          ::Rumale::Utils.rand_uniform([n_samples, @params[:n_components]], sub_rng) - 0.5
+        end
+      end
+      def terminate?(old_stress, new_stress)
+        return false if @params[:tol].nil?
+        return false if old_stress.nil?
+        (old_stress - new_stress).abs <= @params[:tol]
+      end
+      def calc_stress(hi_distance_mat, lo_distance_mat)
+        ((hi_distance_mat - lo_distance_mat)**2).sum.fdiv(2)
+      end
+    end
+  end
+end

data/lib/rumale/manifold/tsne.rb ADDED Viewed

@@ -0,0 +1,218 @@
+# frozen_string_literal: true
+require 'rumale/base/estimator'
+require 'rumale/base/transformer'
+require 'rumale/utils'
+require 'rumale/validation'
+require 'rumale/pairwise_metric'
+require 'rumale/decomposition/pca'
+module Rumale
+  module Manifold
+    # TSNE is a class that implements t-Distributed Stochastic Neighbor Embedding (t-SNE)
+    # with fixed-point optimization algorithm.
+    # Fixed-point algorithm usually converges faster than gradient descent method and
+    # do not need the learning parameters such as the learning rate and momentum.
+    #
+    # @example
+    #   require 'rumale/manifold/tsne'
+    #
+    #   tsne = Rumale::Manifold::TSNE.new(perplexity: 40.0, init: 'pca', max_iter: 500, random_seed: 1)
+    #   representations = tsne.fit_transform(samples)
+    #
+    # *Reference*
+    # - van der Maaten, L., and Hinton, G., "Visualizing data using t-SNE," J. of Machine Learning Research, vol. 9, pp. 2579--2605, 2008.
+    # - Yang, Z., King, I., Xu, Z., and Oja, E., "Heavy-Tailed Symmetric Stochastic Neighbor Embedding," Proc. NIPS'09, pp. 2169--2177, 2009.
+    class TSNE < ::Rumale::Base::Estimator
+      include ::Rumale::Base::Transformer
+      # Return the data in representation space.
+      # @return [Numo::DFloat] (shape: [n_samples, n_components])
+      attr_reader :embedding
+      # Return the Kullback-Leibler divergence after optimization.
+      # @return [Float]
+      attr_reader :kl_divergence
+      # Return the number of iterations run for optimization
+      # @return [Integer]
+      attr_reader :n_iter
+      # Return the random generator.
+      # @return [Random]
+      attr_reader :rng
+      # Create a new transformer with t-SNE.
+      #
+      # @param n_components [Integer] The number of dimensions on representation space.
+      # @param perplexity [Float] The effective number of neighbors for each point. Perplexity are typically set from 5 to 50.
+      # @param metric [String] The metric to calculate the distances in original space.
+      #   If metric is 'euclidean', Euclidean distance is calculated for distance in original space.
+      #   If metric is 'precomputed', the fit and fit_transform methods expect to be given a distance matrix.
+      # @param init [String] The init is a method to initialize the representaion space.
+      #   If init is 'random', the representaion space is initialized with normal random variables.
+      #   If init is 'pca', the result of principal component analysis as the initial value of the representation space.
+      # @param max_iter [Integer] The maximum number of iterations.
+      # @param tol [Float] The tolerance of KL-divergence for terminating optimization.
+      #   If tol is nil,  it does not use KL divergence as a criterion for terminating the optimization.
+      # @param verbose [Boolean] The flag indicating whether to output KL divergence during iteration.
+      # @param random_seed [Integer] The seed value using to initialize the random generator.
+      def initialize(n_components: 2, perplexity: 30.0, metric: 'euclidean', init: 'random',
+                     max_iter: 500, tol: nil, verbose: false, random_seed: nil)
+        super()
+        @params = {
+          n_components: n_components,
+          perplexity: perplexity,
+          max_iter: max_iter,
+          tol: tol,
+          metric: metric,
+          init: init,
+          verbose: verbose,
+          random_seed: random_seed || srand
+        }
+        @rng = Random.new(@params[:random_seed])
+      end
+      # Fit the model with given training data.
+      #
+      # @overload fit(x) -> TSNE
+      #   @param x [Numo::DFloat] (shape: [n_samples, n_features]) The training data to be used for fitting the model.
+      #     If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
+      #   @return [TSNE] The learned transformer itself.
+      def fit(x, _not_used = nil)
+        x = ::Rumale::Validation.check_convert_sample_array(x)
+        if @params[:metric] == 'precomputed' && x.shape[0] != x.shape[1]
+          raise ArgumentError, 'Expect the input distance matrix to be square.'
+        end
+        # initialize some varibales.
+        @n_iter = 0
+        distance_mat = @params[:metric] == 'precomputed' ? x**2 : ::Rumale::PairwiseMetric.squared_error(x)
+        hi_prob_mat = gaussian_distributed_probability_matrix(distance_mat)
+        y = init_embedding(x)
+        lo_prob_mat = t_distributed_probability_matrix(y)
+        # perform fixed-point optimization.
+        one_vec = Numo::DFloat.ones(x.shape[0]).expand_dims(1)
+        @params[:max_iter].times do |t|
+          break if terminate?(hi_prob_mat, lo_prob_mat)
+          a = hi_prob_mat * lo_prob_mat
+          b = lo_prob_mat**2
+          y = (b.dot(one_vec) * y + (a - b).dot(y)) / a.dot(one_vec)
+          lo_prob_mat = t_distributed_probability_matrix(y)
+          @n_iter = t + 1
+          if @params[:verbose] && (@n_iter % 100).zero?
+            puts "[t-SNE] KL divergence after #{@n_iter} iterations: #{cost(hi_prob_mat, lo_prob_mat)}"
+          end
+        end
+        # store results.
+        @embedding = y
+        @kl_divergence = cost(hi_prob_mat, lo_prob_mat)
+        self
+      end
+      # Fit the model with training data, and then transform them with the learned model.
+      #
+      # @overload fit_transform(x) -> Numo::DFloat
+      #   @param x [Numo::DFloat] (shape: [n_samples, n_features]) The training data to be used for fitting the model.
+      #     If the metric is 'precomputed', x must be a square distance matrix (shape: [n_samples, n_samples]).
+      #   @return [Numo::DFloat] (shape: [n_samples, n_components]) The transformed data
+      def fit_transform(x, _not_used = nil)
+        x = ::Rumale::Validation.check_convert_sample_array(x)
+        fit(x)
+        @embedding.dup
+      end
+      private
+      def init_embedding(x)
+        if @params[:init] == 'pca' && @params[:metric] == 'euclidean'
+          pca = ::Rumale::Decomposition::PCA.new(n_components: @params[:n_components], random_seed: @params[:random_seed])
+          pca.fit_transform(x)
+        else
+          n_samples = x.shape[0]
+          sub_rng = @rng.dup
+          ::Rumale::Utils.rand_normal([n_samples, @params[:n_components]], sub_rng, 0, 0.0001)
+        end
+      end
+      def gaussian_distributed_probability_matrix(distance_mat)
+        # initialize some variables.
+        n_samples = distance_mat.shape[0]
+        prob_mat = Numo::DFloat.zeros(n_samples, n_samples)
+        sum_beta = 0.0
+        # calculate conditional probabilities.
+        n_samples.times do |n|
+          beta, probs = optimal_probabilities(n, distance_mat[n, true])
+          prob_mat[n, true] = probs
+          sum_beta += beta
+          puts "[t-SNE] Computed conditional probabilities for sample #{n + 1} / #{n_samples}" if @params[:verbose] && ((n + 1) % 1000).zero?
+        end
+        puts "[t-SNE] Mean sigma: #{Math.sqrt(n_samples.fdiv(sum_beta))}" if @params[:verbose]
+        # symmetrize and normalize probability matrix.
+        prob_mat[prob_mat.diag_indices(0)] = 0.0
+        prob_mat = 0.5 * (prob_mat + prob_mat.transpose)
+        prob_mat / prob_mat.sum
+      end
+      def optimal_probabilities(sample_id, distance_vec, max_iter = 100)
+        # initialize some variables.
+        probs = nil
+        beta = 1.0
+        betamin = Float::MIN
+        betamax = Float::MAX
+        init_entropy = Math.log(@params[:perplexity])
+        # calculate optimal beta and conditional probabilities with binary search.
+        max_iter.times do
+          entropy, probs = gaussian_distributed_probability_vector(sample_id, distance_vec, beta)
+          diff_entropy = entropy - init_entropy
+          break if diff_entropy.abs <= 1e-5
+          if diff_entropy.positive?
+            betamin = beta
+            if betamax == Float::MAX
+              beta *= 2.0
+            else
+              beta = 0.5 * (beta + betamax)
+            end
+          else
+            betamax = beta
+            if betamin == Float::MIN
+              beta /= 2.0
+            else
+              beta = 0.5 * (beta + betamin)
+            end
+          end
+        end
+        [beta, probs]
+      end
+      def gaussian_distributed_probability_vector(n, distance_vec, beta)
+        probs = Numo::NMath.exp(-beta * distance_vec)
+        probs[n] = 0.0
+        sum_probs = probs.sum
+        probs /= sum_probs
+        entropy = Math.log(sum_probs) + beta * (distance_vec * probs).sum
+        [entropy, probs]
+      end
+      def t_distributed_probability_matrix(y)
+        distance_mat = ::Rumale::PairwiseMetric.squared_error(y)
+        prob_mat = 1.0 / (1.0 + distance_mat)
+        prob_mat[prob_mat.diag_indices(0)] = 0.0
+        prob_mat / prob_mat.sum
+      end
+      def cost(p, q)
+        (p * Numo::NMath.log(Numo::DFloat.maximum(1e-20, p) / Numo::DFloat.maximum(1e-20, q))).sum
+      end
+      def terminate?(p, q)
+        return false if @params[:tol].nil?
+        cost(p, q) <= @params[:tol]
+      end
+    end
+  end
+end

data/lib/rumale/manifold/version.rb ADDED Viewed

@@ -0,0 +1,10 @@
+# frozen_string_literal: true
+# Rumale is a machine learning library in Ruby.
+module Rumale
+  # Module for data embedding algorithms.
+  module Manifold
+    # @!visibility private
+    VERSION = '0.24.0'
+  end
+end

data/lib/rumale/manifold.rb ADDED Viewed

@@ -0,0 +1,7 @@
+# frozen_string_literal: true
+require 'numo/narray'
+require_relative 'manifold/mds'
+require_relative 'manifold/tsne'
+require_relative 'manifold/version'

metadata ADDED Viewed

@@ -0,0 +1,98 @@
+--- !ruby/object:Gem::Specification
+name: rumale-manifold
+version: !ruby/object:Gem::Version
+  version: 0.24.0
+platform: ruby
+authors:
+- yoshoku
+autorequire:
+bindir: exe
+cert_chain: []
+date: 2022-12-31 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: numo-narray
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 0.9.1
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 0.9.1
+- !ruby/object:Gem::Dependency
+  name: rumale-core
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 0.24.0
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 0.24.0
+- !ruby/object:Gem::Dependency
+  name: rumale-decomposition
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 0.24.0
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 0.24.0
+description: |
+  Rumale::Manifold provides data embedding algorithms,
+  such as Multi-dimensional Scaling and t-distributed Stochastic Neighbor Embedding,
+  with Rumale interface.
+email:
+- yoshoku@outlook.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- LICENSE.txt
+- README.md
+- lib/rumale/manifold.rb
+- lib/rumale/manifold/mds.rb
+- lib/rumale/manifold/tsne.rb
+- lib/rumale/manifold/version.rb
+homepage: https://gitlab.com/yoshoku/rumale
+licenses:
+- BSD-3-Clause
+metadata:
+  homepage_uri: https://gitlab.com/yoshoku/rumale
+  source_code_uri: https://gitlab.com/yoshoku/rumale/-/tree/main/rumale-manifold
+  changelog_uri: https://gitlab.com/yoshoku/rumale/-/blob/main/CHANGELOG.md
+  rubygems_mfa_required: 'true'
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubygems_version: 3.3.26
+signing_key:
+specification_version: 4
+summary: Rumale::Manifold provides data embedding algorithms with Rumale interface.
+test_files: []