RubyGems - k_means_pp - Versions diffs - 0.0.1 - Mend

k_means_pp 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 25114f424713579b656eddb1a275a59103a860c5
+  data.tar.gz: f3ef959b9a7044a048903c44b2c55a4c0bc11583
+SHA512:
+  metadata.gz: ea81b5f48c62e0654cee9a635ffbc16e829c5129e1e26a02477217287a517416ce5adfe41a150199acc87dec26d335b9d89d738ea94bd31441ce419002d23625
+  data.tar.gz: 9587a9d3a4d0c7d6d3e14d945d92bab9c81e9779699ef4f53dc8e232ea56f073d2fae1f0baad6c96ea2af6fc0ae01bd709964491ed9333f6e379e530acaf6bc4

data/.gitignore ADDED

@@ -0,0 +1,16 @@
+/*.gem
+/examples/report-*.*
+/.bundle/
+/.yardoc
+/Gemfile.lock
+/_yardoc/
+/coverage/
+/doc/
+/pkg/
+/spec/reports/
+/tmp/
+*.bundle
+*.so
+*.o
+*.a
+mkmf.log

data/.rspec ADDED

	@@ -0,0 +1,2 @@
1	+ --color
2	+ --format documentation

data/.yardopts ADDED

@@ -0,0 +1,7 @@
+--protected
+--private
+--embed-mixins
+lib/**/*.rb
+-
+README.md
+LICENSE.txt

data/Gemfile ADDED

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in k_means_pp.gemspec
+gemspec

data/LICENSE.txt ADDED

@@ -0,0 +1,22 @@
+Copyright (c) 2014 Oldrich Vetesnik
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,161 @@
+# KMeansPP
+## What's this?
+This is a Ruby implementation of the k-means++ algorithm for data clustering.
+In other words: Grouping a bunch of X, Y points into K groups.
+The code is a port of the Python version on [rosettacode.org][rosetta].
+### K-means++ (from [Wikipedia][kmeans++])
+> In data mining, k-means++ is an algorithm for choosing the initial values (or
+> "seeds") for the k-means clustering algorithm. It was proposed in 2007 by
+> David Arthur and Sergei Vassilvitskii, as an approximation algorithm for the
+> NP-hard k-means problem—a way of avoiding the sometimes poor clusterings found
+> by the standard k-means algorithm.
+>
+> [...]
+>
+> The k-means problem is to find cluster centers that minimize the intra-class
+> variance, i.e. the sum of squared distances from each data point being
+> clustered to its cluster center (the center that is closest to it). Although
+> finding an exact solution to the k-means problem for arbitrary input is
+> NP-hard the standard approach to finding an approximate solution (often
+> called [Lloyd's algorithm][lloyd] or the k-means algorithm) is used widely and
+> frequently finds reasonable solutions quickly.
+### K-means (from [Wikipedia][kmeans])
+> k-means clustering is a method of vector quantization, originally from signal
+> processing, that is popular for cluster analysis in data mining. k-means
+> clustering aims to partition n observations into k clusters in which each
+> observation belongs to the cluster with the nearest mean, serving as a
+> prototype of the cluster. This results in a partitioning of the data space
+> into Voronoi cells.
+## Usage
+See examples, too.
+```ruby
+points = [
+  [0.3968, 1.9431],
+  [9.3348, 6.7843],
+  [9.2882, 8.1347],
+  [7.6768, 2.7362],
+  [3.4434, 4.1910],
+  [1.8097, 5.0884],
+  [7.0698, 3.9285],
+  [9.3820, 7.6790],
+  [8.6092, 0.9651],
+  [9.1981, 7.7493]
+]
+clusters = KMeansPP.clusters(points, 3)
+plot clusters
+puts clusters
+# Cluster (7.785266666666668, 2.5432666666666663): [
+#   [7.6768, 2.7362],
+#   [7.0698, 3.9285],
+#   [8.6092, 0.9651],
+# ]
+# Cluster (9.300774999999998, 7.586824999999999): [
+#   [9.3348, 6.7843],
+#   [9.2882, 8.1347],
+#   [9.382, 7.679],
+#   [9.1981, 7.7493],
+# ]
+# Cluster (1.8833, 3.7408333333333332): [
+#   [0.3968, 1.9431],
+#   [3.4434, 4.191],
+#   [1.8097, 5.0884],
+# ]
+cluster = clusters.first
+p cluster.centroid.x # 7.785266666666668
+p cluster.centroid.y # 2.5432666666666663
+p cluster.points     # [[7.6768, 2.7362], [7.0698, 3.9285], [8.6092, 0.9651]]
+```
+Or with custom structure:
+```ruby
+points = [
+  { x: 0.3968, y: 1.9431 },
+  { x: 9.3348, y: 6.7843 },
+  { x: 9.2882, y: 8.1347 },
+  { x: 7.6768, y: 2.7362 },
+  { x: 3.4434, y: 4.1910 },
+  { x: 1.8097, y: 5.0884 },
+  { x: 7.0698, y: 3.9285 },
+  { x: 9.3820, y: 7.6790 },
+  { x: 8.6092, y: 0.9651 },
+  { x: 9.1981, y: 7.7493 }
+]
+clusters = KMeansPP.clusters(points, 3) do |point|
+  [point[:x], point[:y]]
+end
+puts clusters
+# Cluster (9.300774999999998, 7.586824999999999): [
+#   {:x=>9.3348, :y=>6.7843},
+#   {:x=>9.2882, :y=>8.1347},
+#   {:x=>9.382, :y=>7.679},
+#   {:x=>9.1981, :y=>7.7493},
+# ]
+# Cluster (1.8833, 3.7408333333333332): [
+#   {:x=>0.3968, :y=>1.9431},
+#   {:x=>3.4434, :y=>4.191},
+#   {:x=>1.8097, :y=>5.0884},
+# ]
+# Cluster (7.785266666666668, 2.5432666666666663): [
+#   {:x=>7.6768, :y=>2.7362},
+#   {:x=>7.0698, :y=>3.9285},
+#   {:x=>8.6092, :y=>0.9651},
+# ]
+```
+## Running examples
+If you want to run the examples, you will need `gnuplot` library and gem.
+Don't forget to add the `--with-x` flag otherwise it won't show anything.
+    $ brew install gnuplot --with-x # Assuming OS X
+    $ gem install gnuplot
+    $ cd examples
+    $ ruby example_simple.rb
+    $ ruby example_block.rb
+    $ ruby example_csv.rb
+    $ ruby example_huge.rb
+    $ ruby example_debug.rb # Generates profiler reports
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'k_means_pp'
+```
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install k_means_pp
+## Contributing
+1. Fork it (https://github.com/ollie/k_means_pp/fork)
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create a new Pull Request
+[rosetta]:  http://rosettacode.org/wiki/K-means%2B%2B_clustering#Python
+[kmeans++]: https://en.wikipedia.org/wiki/K-means%2B%2B
+[kmeans]:   https://en.wikipedia.org/wiki/K-means_clustering
+[lloyd]:    https://en.wikipedia.org/wiki/Lloyd%27s_algorithm

data/Rakefile ADDED

@@ -0,0 +1,18 @@
+task default: :combo
+desc 'Run tests, rubocop and generate documentation'
+task :combo do
+  sh 'bundle exec rspec'
+  sh('bundle exec rubocop') {} # ignore status > 0
+  sh 'bundle exec yardoc'
+end
+desc 'Same as :combo but build a gem, too'
+task mega_combo: :combo do
+  sh 'gem build k_means_pp.gemspec'
+end
+desc 'Start a console'
+task :console do
+  sh 'bundle exec pry -I ./lib -r ./lib/k_means_pp.rb'
+end

data/examples/common.rb ADDED

@@ -0,0 +1,34 @@
+require 'gnuplot'
+# Plot and display data on the screen.
+#
+# @param clusters [Array<Cluster>]
+def plot(clusters)
+  # Graph output by running gnuplot pipe
+  Gnuplot.open do |gp|
+    # Start a new plot
+    Gnuplot::Plot.new(gp) do |plot|
+      # Plot each cluster's points
+      clusters.each do |cluster|
+        # Collect all x and y coords for this cluster
+        x = cluster.points.map { |p| p[0] }
+        y = cluster.points.map { |p| p[1] }
+        # Plot w/o a title (clutters things up)
+        plot.data << Gnuplot::DataSet.new([x, y]) do |ds|
+          ds.notitle
+        end
+        # Centroid point as bigger black points
+        x = [cluster.centroid.x]
+        y = [cluster.centroid.y]
+        plot.data << Gnuplot::DataSet.new([x, y]) do |ds|
+          ds.notitle
+          ds.linecolor = '000000'
+          ds.linewidth = 3
+        end
+      end
+    end
+  end
+end

data/examples/example_block.rb ADDED

@@ -0,0 +1,23 @@
+$LOAD_PATH.unshift('../lib')
+require 'bundler/setup'
+require 'k_means_pp'
+points = [
+  { x: 0.3968, y: 1.9431 },
+  { x: 9.3348, y: 6.7843 },
+  { x: 9.2882, y: 8.1347 },
+  { x: 7.6768, y: 2.7362 },
+  { x: 3.4434, y: 4.1910 },
+  { x: 1.8097, y: 5.0884 },
+  { x: 7.0698, y: 3.9285 },
+  { x: 9.3820, y: 7.6790 },
+  { x: 8.6092, y: 0.9651 },
+  { x: 9.1981, y: 7.7493 }
+]
+clusters = KMeansPP.clusters(points, 3) do |point|
+  [point[:x], point[:y]]
+end
+puts clusters

data/examples/example_csv.rb ADDED

@@ -0,0 +1,15 @@
+$LOAD_PATH.unshift('../lib')
+require 'bundler/setup'
+require 'k_means_pp'
+require './common'
+require 'csv'
+points = CSV.foreach('points.csv').map do |row|
+  [row[0].to_f, row[1].to_f]
+end
+clusters = KMeansPP.clusters(points, 3)
+plot clusters
+puts clusters

data/examples/example_debug.rb ADDED

@@ -0,0 +1,47 @@
+$LOAD_PATH.unshift('../lib')
+require 'bundler/setup'
+require 'k_means_pp'
+# require './common'
+require 'ruby-prof'
+# Generate an array of random n points around origin.
+#
+# @param n      [Fixnum] Number of points to generate.
+# @param radius [Fixnum] How far to go from origin.
+#
+# @return [Array<Array>]
+def generate_points(n, radius)
+  n.times.map do
+    random_radius = rand * radius
+    random_angle  = rand * 2 * Math::PI
+    x             = random_radius * Math.cos(random_angle)
+    y             = random_radius * Math.sin(random_angle)
+    [x, y]
+  end
+end
+clusters = nil
+result = RubyProf.profile do
+  points   = generate_points(100, 10)
+  clusters = KMeansPP.clusters(points, 5)
+end
+printer = RubyProf::FlatPrinter.new(result)
+printer.print(File.open('report-flat.txt', 'w'), min_percent: 2)
+printer = RubyProf::GraphPrinter.new(result)
+printer.print(File.open('report-graph.txt', 'w'), min_percent: 2)
+printer = RubyProf::GraphHtmlPrinter.new(result)
+printer.print(File.open('report-graph.html', 'w'), min_percent: 2)
+printer = RubyProf::DotPrinter.new(result)
+printer.print(File.open('report-dot.dot', 'w'), min_percent: 2)
+# Then run:
+# dot -Tpng report-dot.dot > report-graph.png
+# plot(clusters)

data/examples/example_huge.rb ADDED

@@ -0,0 +1,27 @@
+$LOAD_PATH.unshift('../lib')
+require 'bundler/setup'
+require 'k_means_pp'
+require './common'
+# Generate an array of random n points around origin.
+#
+# @param n      [Fixnum] Number of points to generate.
+# @param radius [Fixnum] How far to go from origin.
+#
+# @return [Array<Array>]
+def generate_points(n, radius)
+  n.times.map do
+    random_radius = rand * radius
+    random_angle  = rand * 2 * Math::PI
+    x             = random_radius * Math.cos(random_angle)
+    y             = random_radius * Math.sin(random_angle)
+    [x, y]
+  end
+end
+points   = generate_points(30_000, 10)
+clusters = KMeansPP.clusters(points, 7)
+plot clusters

data/examples/example_simple.rb ADDED

@@ -0,0 +1,28 @@
+$LOAD_PATH.unshift('../lib')
+require 'bundler/setup'
+require 'k_means_pp'
+require './common'
+points = [
+  [0.3968, 1.9431],
+  [9.3348, 6.7843],
+  [9.2882, 8.1347],
+  [7.6768, 2.7362],
+  [3.4434, 4.1910],
+  [1.8097, 5.0884],
+  [7.0698, 3.9285],
+  [9.3820, 7.6790],
+  [8.6092, 0.9651],
+  [9.1981, 7.7493]
+]
+clusters = KMeansPP.clusters(points, 3)
+plot clusters
+puts clusters
+cluster = clusters.first
+p cluster.centroid.x
+p cluster.centroid.y
+p cluster.points

data/examples/points.csv ADDED

@@ -0,0 +1,100 @@
+48.2641334571,86.4516903905
+0.114004262656,35.8368597414
+97.4319168245,92.8009240744
+24.4614031388,18.3292584382
+36.2367675367,32.8294024271
+75.5836860736,68.30729977
+38.6577034445,25.7701728584
+28.2607136287,64.4493377817
+61.5358486771,61.2195232194
+1.52352224798,38.5083779618
+11.6392182793,68.2369021579
+53.9486870607,53.9136556533
+14.6671651772,26.0132534731
+65.9506725878,82.5639317581
+58.3682872339,51.6414580337
+12.6918921252,2.28888447759
+31.7587852231,18.1368234166
+63.6631115204,24.933301389
+29.1652289905,34.456759171
+44.3830953085,70.4813875779
+47.0571691145,65.3507625811
+74.0584537502,98.2271944247
+55.8929146157,86.6196265477
+20.4744253473,12.0025149302
+14.2867767281,40.2850440995
+40.43551369,94.5410407116
+87.6178871195,12.4700151639
+47.2703048197,93.0636237124
+59.7895104175,69.2621288413
+80.8612333922,42.9183411179
+31.1271795535,55.6669044656
+78.9671049353,65.833739365
+39.8324533414,63.0343115139
+79.126343548,14.9128874133
+65.8152400306,77.5202358013
+75.2762752704,42.4858435609
+29.6475948493,61.2068411763
+67.421857106,54.8955604259
+10.4652931501,29.7954139372
+32.0272462745,99.5422900971
+80.1520927001,84.2710379142
+2.27240208403,41.2138854089
+44.4601509555,1.72563901513
+16.8676021068,35.3415636277
+58.1977544121,29.2752085455
+24.6119080085,39.9440735137
+63.0759798755,60.9841014448
+30.9289119657,95.0173219502
+8.54972950047,41.7384441737
+61.2606910793,4.06738902059
+83.2302091964,11.6373312879
+89.4443065362,42.5694882801
+24.5619318152,97.7947977804
+50.3134024475,40.6429336223
+58.1422402033,36.1112632557
+32.0668520827,29.9924151435
+89.6057447137,84.9532177777
+9.8876440816,18.2540486261
+17.9670383961,47.596032257
+50.2977668282,93.6851189223
+98.0700386253,86.5816924579
+10.8175290981,26.4344732252
+34.7463851288,24.4154447141
+92.5470100593,17.3595513748
+79.0426629356,4.59850018907
+89.9791366918,29.523946842
+3.89920214563,91.3650215111
+35.4669861576,62.1865368798
+2.78150918086,24.5280230552
+50.0390951889,57.0414421682
+64.4521660758,48.4962172448
+94.4915452316,56.6508179406
+47.1655534769,15.8292055671
+94.2027011374,45.6802385454
+30.5846324871,54.783635876
+57.7043252948,0.286661610381
+41.7908674949,14.7206014023
+59.6689465934,64.8849831965
+92.2553335495,55.9096460272
+48.493467262,69.4766837809
+23.1837859581,71.4406867443
+29.0737623652,66.9391416961
+95.7442323112,89.4677505059
+68.7707275828,40.9900140055
+84.5445737133,32.1707309618
+67.4126251988,56.6710579117
+10.688352016,28.1745892928
+56.7620324155,18.3034334207
+50.6751320678,86.6916908032
+74.6185482896,34.022483532
+20.7011996002,32.855295357
+11.479054664,1.59204297586
+51.6805387648,25.4063026358
+84.4109522357,47.237632645
+90.6395051745,57.7917166935
+58.6159601042,84.1226173848
+46.2184509277,28.559934585
+97.0302485783,41.3135022812
+31.3144587058,87.2459910122
+5.93357833962,95.6812831872

data/k_means_pp.gemspec ADDED

@@ -0,0 +1,38 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'k_means_pp/version'
+Gem::Specification.new do |spec|
+  spec.name          = 'k_means_pp'
+  spec.version       = KMeansPP::VERSION
+  spec.authors       = ['Oldrich Vetesnik']
+  spec.email         = ['oldrich.vetesnik@gmail.com']
+  spec.summary       = 'K-means++ Algorithm Implementation.'
+  spec.description   = 'This is a Ruby implementation of the k-means++ ' \
+                       'algorithm for data clustering. In other words: ' \
+                       'Grouping a bunch of X, Y points into K groups.'
+  spec.homepage      = 'https://github.com/ollie/k_means_pp'
+  spec.license       = 'MIT'
+  spec.files         = `git ls-files -z`.split("\x0")
+  spec.executables   = spec.files.grep(/^bin\//) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(/^(test|spec|features)\//)
+  spec.require_paths = ['lib']
+  # System
+  spec.add_development_dependency 'bundler', '~> 1.7'
+  # Test
+  spec.add_development_dependency 'rspec',     '~> 3.1'
+  spec.add_development_dependency 'simplecov', '~> 0.9'
+  # Code style, debugging, docs
+  spec.add_development_dependency 'yard',       '~> 0.8'
+  spec.add_development_dependency 'rake',       '~> 10.3'
+  spec.add_development_dependency 'rubocop',    '~> 0.26'
+  spec.add_development_dependency 'pry',        '~> 0.10'
+  spec.add_development_dependency 'pry-byebug', '~> 2.0'
+  spec.add_development_dependency 'ruby-prof',  '~> 0.15'
+  spec.add_development_dependency 'gnuplot',    '~> 2.6'
+end

data/lib/k_means_pp.rb ADDED

@@ -0,0 +1,240 @@
+require 'k_means_pp/version'
+require 'k_means_pp/point'
+require 'k_means_pp/cluster'
+# Cluster data with the k-means++, k-means and Lloyd algorithm.
+class KMeansPP
+  # Source data set of points.
+  #
+  # @return [Array<Point>]
+  attr_accessor :points
+  # Centroid points
+  #
+  # @return [Array<Centroid>]
+  attr_accessor :centroids
+  # Take an array of things and group them into K clusters.
+  #
+  # If no block was given, an array of arrays (of two numbers) is expected.
+  # At the end an array of +Cluster+s is returned, each wrapping
+  # an array or arrays (of two numbers).
+  #
+  # If a block was given, the +points+ is likely an array of other things
+  # like hashes or objects. The block is expected to return an array of two
+  # numbers. At the end an array of +Cluster+s is returned, each wrapping
+  # an array or original objects.
+  #
+  # @param points         [Array]  Source data set of points.
+  # @param clusters_count [Fixnum] Number of clusters ("k").
+  # @yieldreturn [Array<Numeric>]
+  #
+  # @return [Array<Cluster>]
+  def self.clusters(points, clusters_count, &block)
+    instance = new(points, clusters_count, &block)
+    instance.group_points
+    instance.centroids.map do |centroid|
+      cluster_for_centroid(centroid, points, &block)
+    end
+  end
+  # Computed points are a flat structure so this nests each point
+  # in an array.
+  #
+  # @param centroid [Centroid] Centroid of the cluster.
+  #
+  # @return [Cluster]
+  def self.cluster_for_centroid(centroid, points, &block)
+    cluster_points = points.select { |p| p.group == centroid }
+    if block
+      cluster_points.map!(&:original)
+    else
+      cluster_points.map! { |p| [p.x, p.y] }
+    end
+    Cluster.new(centroid, cluster_points)
+  end
+  # Find nearest centroid for a given point in given centroids.
+  #
+  # @param point     [Point]           Measure distance of this point
+  # @param centroids [Array<Centroid>] to those cluster centers
+  #
+  # @return [Centroid]
+  def self.find_nearest_centroid(point, centroids)
+    find_nearest_centroid_and_distance(point, centroids)[0]
+  end
+  # Find distance to the nearest centroid for a given point in given centroids.
+  #
+  # @param point     [Point]           Measure distance of this point
+  # @param centroids [Array<Centroid>] to those cluster centers
+  #
+  # @return [Float]
+  def self.find_nearest_centroid_distance(point, centroids)
+    find_nearest_centroid_and_distance(point, centroids)[1]
+  end
+  # Find the nearest centroid in given centroids.
+  #
+  # @param point     [Point]           Measure distance of this point
+  # @param centroids [Array<Centroid>] to those cluster centers
+  #
+  # @return [Array]
+  def self.find_nearest_centroid_and_distance(point, centroids)
+    # Assume the current centroid is the closest.
+    nearest_centroid = point.group
+    nearest_distance = Float::INFINITY
+    centroids.each do |centroid|
+      distance = centroid.squared_distance_to(point)
+      next if distance >= nearest_distance
+      nearest_distance = distance
+      nearest_centroid = centroid
+    end
+    [nearest_centroid, nearest_distance]
+  end
+  # Take an array of things and group them into K clusters.
+  #
+  # If no block was given, an array of arrays (of two numbers) is expected.
+  # Internally we map them with +Point+ objects.
+  #
+  # If a block was given, the +points+ is likely an array of other things
+  # like hashes or objects. In this case we will keep the original object
+  # in a property and once we are done, we will swap those objects.
+  # The block is expected to retun an array of two numbers.
+  #
+  # @param points         [Array]  Source data set of points.
+  # @param clusters_count [Fixnum] Number of clusters ("k").
+  # @yieldreturn [Array<Numeric>]
+  def initialize(points, clusters_count)
+    if block_given?
+      points.map! do |point_obj|
+        point_ary      = yield(point_obj)
+        point          = Point.new(point_ary[0], point_ary[1])
+        point.original = point_obj
+        point
+      end
+    else
+      points.map! do |point_ary|
+        Point.new(point_ary[0], point_ary[1])
+      end
+    end
+    self.points    = points
+    self.centroids = Array.new(clusters_count)
+  end
+  # Group points into clusters.
+  def group_points
+    define_initial_clusters
+    fine_tune_clusters
+  end
+  protected
+  # K-means++ algorithm.
+  #
+  # Find initial centroids and assign points to their nearest centroid,
+  # forming cells.
+  def define_initial_clusters
+    # Randomly choose a point as the first centroid.
+    centroids[0] = Centroid.new(points.sample)
+    # Initialize an array of distances of every point.
+    distances = points.size.times.map { 0.0 }
+    centroids.each_with_index do |_, centroid_i|
+      # Skip the first centroid as it's already picked but keep the index.
+      next if centroid_i == 0
+      # Sum points' distances to their nearest centroid
+      distances_sum = 0.0
+      points.each_with_index do |point, point_i|
+        distance = self.class.find_nearest_centroid_distance(
+          point,
+          centroids[0...centroid_i]
+        )
+        distances[point_i] = distance
+        distances_sum     += distance
+      end
+      # Randomly cut it.
+      distances_sum *= rand
+      # Keep subtracting those distances until we hit a zero (or lower)
+      # in which case we found a new centroid.
+      distances.each_with_index do |distance, point_i|
+        distances_sum -= distance
+        next if distances_sum > 0
+        centroids[centroid_i] = Centroid.new(points[point_i])
+        break
+      end
+    end
+    # Assign each point its nearest centroid.
+    points.each do |point|
+      point.group = self.class.find_nearest_centroid(point, centroids)
+    end
+  end
+  # This is Lloyd's algorithm
+  # https://en.wikipedia.org/wiki/Lloyd%27s_algorithm
+  #
+  # At this point we have our points already assigned into cells.
+  #
+  # 1. We calculate a new center for each cell.
+  # 2. For each point find its nearest center and re-assign it if it changed.
+  # 3. Repeat until a threshold has been reached.
+  def fine_tune_clusters
+    # When a number of changed points reaches this number, we are done.
+    changed_threshold = points.size >> 10
+    loop do
+      calculate_new_centroids
+      changed = reassign_points
+      # Stop when 99.9% of points are good
+      break if changed <= changed_threshold
+    end
+  end
+  # For each cell calculate its center.
+  # This is done by averaging X and Y coordinates.
+  def calculate_new_centroids
+    # Clear centroids.
+    centroids.each(&:reset)
+    # Sum all X and Y coords into each point's centroid.
+    points.each do |point|
+      centroid = point.group
+      centroid.add(point)
+    end
+    # And then average it to find a center.
+    centroids.each(&:average)
+  end
+  # Loop through all the points and find their nearest centroid.
+  # If it's a different one than current, change it ande take a note.
+  #
+  # @return [Fixnum] Number of changed points.
+  def reassign_points
+    changed = 0
+    points.each do |point|
+      centroid = self.class.find_nearest_centroid(point, centroids)
+      next if centroid == point.group
+      changed += 1
+      point.group = centroid
+    end
+    changed
+  end
+end

data/lib/k_means_pp/cluster.rb ADDED

@@ -0,0 +1,32 @@
+class KMeansPP
+  # Cluster has a centroid and a group of related points.
+  class Cluster
+    # Center of the data set ("centroid").
+    #
+    # @return [Centroid]
+    attr_accessor :centroid
+    # Points in this cluster.
+    #
+    # @return [Array<Point>]
+    attr_accessor :points
+    # Create a new cluster with a centroid and points.
+    #
+    # @param centroid [Centroid]     Center point of the data set.
+    # @param points   [Array<Point>] Points in this cluster.
+    def initialize(centroid, points = [])
+      self.centroid = centroid
+      self.points   = points
+    end
+    # A string representation of the cluster.
+    def to_s
+      o = ''
+      o << "Cluster #{ centroid }: [\n"
+      points.each { |p| o << "  #{ p },\n" }
+      o << "]\n"
+      o
+    end
+  end
+end

data/lib/k_means_pp/point.rb ADDED

@@ -0,0 +1,95 @@
+class KMeansPP
+  # Common methods for +Point+ and +Centroid+.
+  class BasePoint
+    # X coordinate of the point.
+    #
+    # @return [Float]
+    attr_accessor :x
+    # Y coordinate of the point.
+    #
+    # @return [Float]
+    attr_accessor :y
+    # Measure a 2D squared distance between two points.
+    #
+    # @param point [BasePoint]
+    #
+    # @return [Float]
+    def squared_distance_to(point)
+      distance_x       = x - point.x
+      distance_y       = y - point.y
+      squared_distance = distance_x**2 + distance_y**2
+      squared_distance
+    end
+    # A string representation of the point.
+    def to_s
+      "(#{ x }, #{ y })"
+    end
+  end
+  # Point of the data set.
+  class Point < BasePoint
+    # Group is a centroid point.
+    #
+    # @return [Centroid]
+    attr_accessor :group
+    # The original object (could be anything from Hash to an Object).
+    #
+    # @return [Object]
+    attr_accessor :original
+    # Create a new point (data set point or a centroid).
+    #
+    # @param x     [Float]    X coordinate of the point.
+    # @param y     [Float]    Y coordinate of the point.
+    # @param group [Centroid] Group is a centroid point.
+    def initialize(x = 0.0, y = 0.0, group = nil)
+      self.x     = x
+      self.y     = y
+      self.group = group
+    end
+  end
+  # Centroid of a cluster.
+  class Centroid < BasePoint
+    # How many points are in this cluster?
+    #
+    # @return [Fixnum]
+    attr_accessor :counter
+    # Create a new centroid point.
+    #
+    # @param point [Point] Copy point's X and Y coords.
+    def initialize(point)
+      self.x = point.x
+      self.y = point.y
+    end
+    # Prepare centroid for a new iteration, zero-ing everything.
+    def reset
+      self.x       = 0.0
+      self.y       = 0.0
+      self.counter = 0
+    end
+    # Add this point's X and Y coords into the sum (for later average).
+    #
+    # @param point [Point]
+    def add(point)
+      self.counter += 1
+      self.x       += point.x
+      self.y       += point.y
+    end
+    # At this point X and Y properties will contain sums of all the point
+    # coords, counter will contain number of those points.
+    # By averaging the coords we find a new center.
+    def average
+      self.x /= counter
+      self.y /= counter
+    end
+  end
+end

data/lib/k_means_pp/version.rb ADDED

@@ -0,0 +1,4 @@
+class KMeansPP
+  # Version number, happy now?
+  VERSION = '0.0.1'
+end

data/spec/lib/k_means_pp_spec.rb ADDED

@@ -0,0 +1,62 @@
+require 'spec_helper'
+require 'csv'
+RSpec.describe 'Superman' do
+  it 'does it again' do
+    data = CSV.foreach('./spec/resources/points.csv').map do |row|
+      [row[0].to_f, row[1].to_f]
+    end
+    clusters = KMeansPP.clusters(data, 3)
+    clusters.each do |cluster|
+      expect(cluster.points.size).to be > 0
+      expect(cluster.centroid.x).to_not eq(0)
+      expect(cluster.centroid.y).to_not eq(0)
+      expect(cluster.to_s).to_not be_empty
+    end
+    expect(clusters.size).to eq(3)
+  end
+  it 'array of arrays' do
+    data = [
+      [0.3968, 1.9431],
+      [9.3348, 6.7843],
+      [9.2882, 8.1347],
+      [7.6768, 2.7362],
+      [3.4434, 4.1910],
+      [1.8097, 5.0884],
+      [7.0698, 3.9285],
+      [9.3820, 7.6790],
+      [8.6092, 0.9651],
+      [9.1981, 7.7493]
+    ]
+    clusters = KMeansPP.clusters(data, 3)
+    expect(clusters.size).to eq(3)
+    expect(clusters.first.points.first).to be_a(Array)
+  end
+  it 'array of anything else with block' do
+    data = [
+      { x: 0.3968, y: 1.9431 },
+      { x: 9.3348, y: 6.7843 },
+      { x: 9.2882, y: 8.1347 },
+      { x: 7.6768, y: 2.7362 },
+      { x: 3.4434, y: 4.1910 },
+      { x: 1.8097, y: 5.0884 },
+      { x: 7.0698, y: 3.9285 },
+      { x: 9.3820, y: 7.6790 },
+      { x: 8.6092, y: 0.9651 },
+      { x: 9.1981, y: 7.7493 }
+    ]
+    clusters = KMeansPP.clusters(data, 3) do |point|
+      [point[:x], point[:y]]
+    end
+    expect(clusters.size).to eq(3)
+    expect(clusters.first.points.first).to be_a(Hash)
+  end
+end

data/spec/resources/points.csv ADDED

@@ -0,0 +1,100 @@
+48.2641334571,86.4516903905
+0.114004262656,35.8368597414
+97.4319168245,92.8009240744
+24.4614031388,18.3292584382
+36.2367675367,32.8294024271
+75.5836860736,68.30729977
+38.6577034445,25.7701728584
+28.2607136287,64.4493377817
+61.5358486771,61.2195232194
+1.52352224798,38.5083779618
+11.6392182793,68.2369021579
+53.9486870607,53.9136556533
+14.6671651772,26.0132534731
+65.9506725878,82.5639317581
+58.3682872339,51.6414580337
+12.6918921252,2.28888447759
+31.7587852231,18.1368234166
+63.6631115204,24.933301389
+29.1652289905,34.456759171
+44.3830953085,70.4813875779
+47.0571691145,65.3507625811
+74.0584537502,98.2271944247
+55.8929146157,86.6196265477
+20.4744253473,12.0025149302
+14.2867767281,40.2850440995
+40.43551369,94.5410407116
+87.6178871195,12.4700151639
+47.2703048197,93.0636237124
+59.7895104175,69.2621288413
+80.8612333922,42.9183411179
+31.1271795535,55.6669044656
+78.9671049353,65.833739365
+39.8324533414,63.0343115139
+79.126343548,14.9128874133
+65.8152400306,77.5202358013
+75.2762752704,42.4858435609
+29.6475948493,61.2068411763
+67.421857106,54.8955604259
+10.4652931501,29.7954139372
+32.0272462745,99.5422900971
+80.1520927001,84.2710379142
+2.27240208403,41.2138854089
+44.4601509555,1.72563901513
+16.8676021068,35.3415636277
+58.1977544121,29.2752085455
+24.6119080085,39.9440735137
+63.0759798755,60.9841014448
+30.9289119657,95.0173219502
+8.54972950047,41.7384441737
+61.2606910793,4.06738902059
+83.2302091964,11.6373312879
+89.4443065362,42.5694882801
+24.5619318152,97.7947977804
+50.3134024475,40.6429336223
+58.1422402033,36.1112632557
+32.0668520827,29.9924151435
+89.6057447137,84.9532177777
+9.8876440816,18.2540486261
+17.9670383961,47.596032257
+50.2977668282,93.6851189223
+98.0700386253,86.5816924579
+10.8175290981,26.4344732252
+34.7463851288,24.4154447141
+92.5470100593,17.3595513748
+79.0426629356,4.59850018907
+89.9791366918,29.523946842
+3.89920214563,91.3650215111
+35.4669861576,62.1865368798
+2.78150918086,24.5280230552
+50.0390951889,57.0414421682
+64.4521660758,48.4962172448
+94.4915452316,56.6508179406
+47.1655534769,15.8292055671
+94.2027011374,45.6802385454
+30.5846324871,54.783635876
+57.7043252948,0.286661610381
+41.7908674949,14.7206014023
+59.6689465934,64.8849831965
+92.2553335495,55.9096460272
+48.493467262,69.4766837809
+23.1837859581,71.4406867443
+29.0737623652,66.9391416961
+95.7442323112,89.4677505059
+68.7707275828,40.9900140055
+84.5445737133,32.1707309618
+67.4126251988,56.6710579117
+10.688352016,28.1745892928
+56.7620324155,18.3034334207
+50.6751320678,86.6916908032
+74.6185482896,34.022483532
+20.7011996002,32.855295357
+11.479054664,1.59204297586
+51.6805387648,25.4063026358
+84.4109522357,47.237632645
+90.6395051745,57.7917166935
+58.6159601042,84.1226173848
+46.2184509277,28.559934585
+97.0302485783,41.3135022812
+31.3144587058,87.2459910122
+5.93357833962,95.6812831872

data/spec/spec_helper.rb ADDED

@@ -0,0 +1,10 @@
+require 'bundler/setup'
+require 'simplecov'
+# Coverage tool, needs to be started as soon as possible
+SimpleCov.start do
+  add_filter '/spec/' # Ignore spec directory
+end
+require 'k_means_pp'

metadata ADDED

@@ -0,0 +1,211 @@
+--- !ruby/object:Gem::Specification
+name: k_means_pp
+version: !ruby/object:Gem::Version
+  version: 0.0.1
+platform: ruby
+authors:
+- Oldrich Vetesnik
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2014-10-03 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.7'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.7'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.1'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.1'
+- !ruby/object:Gem::Dependency
+  name: simplecov
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.9'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.9'
+- !ruby/object:Gem::Dependency
+  name: yard
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.8'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.8'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '10.3'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '10.3'
+- !ruby/object:Gem::Dependency
+  name: rubocop
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.26'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.26'
+- !ruby/object:Gem::Dependency
+  name: pry
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.10'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.10'
+- !ruby/object:Gem::Dependency
+  name: pry-byebug
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.0'
+- !ruby/object:Gem::Dependency
+  name: ruby-prof
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.15'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.15'
+- !ruby/object:Gem::Dependency
+  name: gnuplot
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.6'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.6'
+description: 'This is a Ruby implementation of the k-means++ algorithm for data clustering.
+  In other words: Grouping a bunch of X, Y points into K groups.'
+email:
+- oldrich.vetesnik@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- ".gitignore"
+- ".rspec"
+- ".yardopts"
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- examples/common.rb
+- examples/example_block.rb
+- examples/example_csv.rb
+- examples/example_debug.rb
+- examples/example_huge.rb
+- examples/example_simple.rb
+- examples/points.csv
+- k_means_pp.gemspec
+- lib/k_means_pp.rb
+- lib/k_means_pp/cluster.rb
+- lib/k_means_pp/point.rb
+- lib/k_means_pp/version.rb
+- spec/lib/k_means_pp_spec.rb
+- spec/resources/points.csv
+- spec/spec_helper.rb
+homepage: https://github.com/ollie/k_means_pp
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.4.1
+signing_key:
+specification_version: 4
+summary: K-means++ Algorithm Implementation.
+test_files:
+- spec/lib/k_means_pp_spec.rb
+- spec/resources/points.csv
+- spec/spec_helper.rb
+has_rdoc: