onedclusterer 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: bb699416393553f57991d0375be6ff278bbddffc
4
+ data.tar.gz: c7f00767cca21af5fd2946668fc31b0bf91df5ce
5
+ SHA512:
6
+ metadata.gz: 6c5afee47f692194509869fe7bd0531248c44a3b8119e345c5a1cd219184790f5e1053a81d3412aed6214391aaa7e10b58484c578387f6d5e5e129b3c568cb54
7
+ data.tar.gz: e9afec6e1500240174c3693fb7c9091018b14029bbb6f29b71213a027e32363749af4dba6c738954da98c511eba07965fb1cbb9f713bd7e9e9c83f2bccb58730
data/.gitignore ADDED
@@ -0,0 +1,10 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
10
+ .idea
data/.travis.yml ADDED
@@ -0,0 +1,4 @@
1
+ language: ruby
2
+ rvm:
3
+ - 2.1.0
4
+ before_install: gem install bundler -v 1.10.6
@@ -0,0 +1,13 @@
1
+ # Contributor Code of Conduct
2
+
3
+ As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
4
+
5
+ We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.
6
+
7
+ Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.
8
+
9
+ Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.
10
+
11
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.
12
+
13
+ This Code of Conduct is adapted from the [Contributor Covenant](http://contributor-covenant.org), version 1.0.0, available at [http://contributor-covenant.org/version/1/0/0/](http://contributor-covenant.org/version/1/0/0/)
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in onedclusterer.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,165 @@
1
+ GNU LESSER GENERAL PUBLIC LICENSE
2
+ Version 3, 29 June 2007
3
+
4
+ Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
5
+ Everyone is permitted to copy and distribute verbatim copies
6
+ of this license document, but changing it is not allowed.
7
+
8
+
9
+ This version of the GNU Lesser General Public License incorporates
10
+ the terms and conditions of version 3 of the GNU General Public
11
+ License, supplemented by the additional permissions listed below.
12
+
13
+ 0. Additional Definitions.
14
+
15
+ As used herein, "this License" refers to version 3 of the GNU Lesser
16
+ General Public License, and the "GNU GPL" refers to version 3 of the GNU
17
+ General Public License.
18
+
19
+ "The Library" refers to a covered work governed by this License,
20
+ other than an Application or a Combined Work as defined below.
21
+
22
+ An "Application" is any work that makes use of an interface provided
23
+ by the Library, but which is not otherwise based on the Library.
24
+ Defining a subclass of a class defined by the Library is deemed a mode
25
+ of using an interface provided by the Library.
26
+
27
+ A "Combined Work" is a work produced by combining or linking an
28
+ Application with the Library. The particular version of the Library
29
+ with which the Combined Work was made is also called the "Linked
30
+ Version".
31
+
32
+ The "Minimal Corresponding Source" for a Combined Work means the
33
+ Corresponding Source for the Combined Work, excluding any source code
34
+ for portions of the Combined Work that, considered in isolation, are
35
+ based on the Application, and not on the Linked Version.
36
+
37
+ The "Corresponding Application Code" for a Combined Work means the
38
+ object code and/or source code for the Application, including any data
39
+ and utility programs needed for reproducing the Combined Work from the
40
+ Application, but excluding the System Libraries of the Combined Work.
41
+
42
+ 1. Exception to Section 3 of the GNU GPL.
43
+
44
+ You may convey a covered work under sections 3 and 4 of this License
45
+ without being bound by section 3 of the GNU GPL.
46
+
47
+ 2. Conveying Modified Versions.
48
+
49
+ If you modify a copy of the Library, and, in your modifications, a
50
+ facility refers to a function or data to be supplied by an Application
51
+ that uses the facility (other than as an argument passed when the
52
+ facility is invoked), then you may convey a copy of the modified
53
+ version:
54
+
55
+ a) under this License, provided that you make a good faith effort to
56
+ ensure that, in the event an Application does not supply the
57
+ function or data, the facility still operates, and performs
58
+ whatever part of its purpose remains meaningful, or
59
+
60
+ b) under the GNU GPL, with none of the additional permissions of
61
+ this License applicable to that copy.
62
+
63
+ 3. Object Code Incorporating Material from Library Header Files.
64
+
65
+ The object code form of an Application may incorporate material from
66
+ a header file that is part of the Library. You may convey such object
67
+ code under terms of your choice, provided that, if the incorporated
68
+ material is not limited to numerical parameters, data structure
69
+ layouts and accessors, or small macros, inline functions and templates
70
+ (ten or fewer lines in length), you do both of the following:
71
+
72
+ a) Give prominent notice with each copy of the object code that the
73
+ Library is used in it and that the Library and its use are
74
+ covered by this License.
75
+
76
+ b) Accompany the object code with a copy of the GNU GPL and this license
77
+ document.
78
+
79
+ 4. Combined Works.
80
+
81
+ You may convey a Combined Work under terms of your choice that,
82
+ taken together, effectively do not restrict modification of the
83
+ portions of the Library contained in the Combined Work and reverse
84
+ engineering for debugging such modifications, if you also do each of
85
+ the following:
86
+
87
+ a) Give prominent notice with each copy of the Combined Work that
88
+ the Library is used in it and that the Library and its use are
89
+ covered by this License.
90
+
91
+ b) Accompany the Combined Work with a copy of the GNU GPL and this license
92
+ document.
93
+
94
+ c) For a Combined Work that displays copyright notices during
95
+ execution, include the copyright notice for the Library among
96
+ these notices, as well as a reference directing the user to the
97
+ copies of the GNU GPL and this license document.
98
+
99
+ d) Do one of the following:
100
+
101
+ 0) Convey the Minimal Corresponding Source under the terms of this
102
+ License, and the Corresponding Application Code in a form
103
+ suitable for, and under terms that permit, the user to
104
+ recombine or relink the Application with a modified version of
105
+ the Linked Version to produce a modified Combined Work, in the
106
+ manner specified by section 6 of the GNU GPL for conveying
107
+ Corresponding Source.
108
+
109
+ 1) Use a suitable shared library mechanism for linking with the
110
+ Library. A suitable mechanism is one that (a) uses at run time
111
+ a copy of the Library already present on the user's computer
112
+ system, and (b) will operate properly with a modified version
113
+ of the Library that is interface-compatible with the Linked
114
+ Version.
115
+
116
+ e) Provide Installation Information, but only if you would otherwise
117
+ be required to provide such information under section 6 of the
118
+ GNU GPL, and only to the extent that such information is
119
+ necessary to install and execute a modified version of the
120
+ Combined Work produced by recombining or relinking the
121
+ Application with a modified version of the Linked Version. (If
122
+ you use option 4d0, the Installation Information must accompany
123
+ the Minimal Corresponding Source and Corresponding Application
124
+ Code. If you use option 4d1, you must provide the Installation
125
+ Information in the manner specified by section 6 of the GNU GPL
126
+ for conveying Corresponding Source.)
127
+
128
+ 5. Combined Libraries.
129
+
130
+ You may place library facilities that are a work based on the
131
+ Library side by side in a single library together with other library
132
+ facilities that are not Applications and are not covered by this
133
+ License, and convey such a combined library under terms of your
134
+ choice, if you do both of the following:
135
+
136
+ a) Accompany the combined library with a copy of the same work based
137
+ on the Library, uncombined with any other library facilities,
138
+ conveyed under the terms of this License.
139
+
140
+ b) Give prominent notice with the combined library that part of it
141
+ is a work based on the Library, and explaining where to find the
142
+ accompanying uncombined form of the same work.
143
+
144
+ 6. Revised Versions of the GNU Lesser General Public License.
145
+
146
+ The Free Software Foundation may publish revised and/or new versions
147
+ of the GNU Lesser General Public License from time to time. Such new
148
+ versions will be similar in spirit to the present version, but may
149
+ differ in detail to address new problems or concerns.
150
+
151
+ Each version is given a distinguishing version number. If the
152
+ Library as you received it specifies that a certain numbered version
153
+ of the GNU Lesser General Public License "or any later version"
154
+ applies to it, you have the option of following the terms and
155
+ conditions either of that published version or of any later version
156
+ published by the Free Software Foundation. If the Library as you
157
+ received it does not specify a version number of the GNU Lesser
158
+ General Public License, you may choose any version of the GNU Lesser
159
+ General Public License ever published by the Free Software Foundation.
160
+
161
+ If the Library as you received it specifies that a proxy can decide
162
+ whether future versions of the GNU Lesser General Public License shall
163
+ apply, that proxy's public statement of acceptance of any version is
164
+ permanent authorization for you to choose that version for the
165
+ Library.
data/README.md ADDED
@@ -0,0 +1,64 @@
1
+ # Onedclusterer
2
+
3
+ a tiny ruby library for one-dimensional clustering methods.
4
+
5
+ ## Usage
6
+
7
+ ### Ckmeans.1d.dp
8
+
9
+ A dynamic programming algorithm for optimal one-dimensional k-means clustering. The algorithm minimizes the sum of squares of within-cluster distances. As an alternative to the standard heuristic k-means algorithm, this algorithm guarantees optimality and repeatability.
10
+ https://cran.r-project.org/web/packages/Ckmeans.1d.dp/index.html
11
+
12
+ ```ruby
13
+ require 'onedclusterer'
14
+ data = [1259.61,2024.82,1855.75,1559.04,1707.65,1107.1,2155.8]
15
+ ckmeans = OnedClusterer::Ckmeans.new(data, 1, 7) # chooses an optimal number clusters between 1 and 7
16
+ p ckmeans.bounds # => [0, 1259.61, 1855.75, 2155.8]
17
+ p ckmeans.clusters # => [[1107.1, 1259.61], [1559.04, 1707.65, 1855.75], [2024.82, 2155.8]]
18
+
19
+ # exact number of clusters can be requested instead of min and max
20
+ ckmeans = OnedClusterer::Ckmeans.new(data, 4)
21
+ p ckmeans.bounds # => [0, 1259.61, 1559.04, 1855.75, 2155.8]
22
+ p ckmeans.clusters # => [[1107.1, 1259.61], [1559.04], [1707.65, 1855.75], [2024.82, 2155.8]]
23
+ ```
24
+
25
+ ### Jenks natural breaks:
26
+
27
+ The Jenks natural breaks classification method seeks to reduce the variance within classes and maximize the variance between classes.
28
+ http://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization
29
+ http://www.macwright.org/2013/02/18/literate-jenks.html
30
+
31
+ ```ruby
32
+ require 'onedclusterer'
33
+ data = [1259.61,2024.82,1855.75,1559.04,1707.65,1107.1,2155.8]
34
+ jenks = OnedClusterer::Jenks.new(data, 4)
35
+ p jenks.bounds # => [0, 1259.61, 1559.04, 1855.75, 2155.8]
36
+ p jenks.clusters # => [[1107.1, 1259.61], [1559.04], [1707.65, 1855.75], [2024.82, 2155.8]]
37
+ ```
38
+
39
+ ## Installation
40
+
41
+ Add this line to your application's Gemfile:
42
+
43
+ ```ruby
44
+ gem 'onedclusterer'
45
+ ```
46
+
47
+ And then execute:
48
+
49
+ $ bundle
50
+
51
+ Or install it yourself as:
52
+
53
+ $ gem install onedclusterer
54
+
55
+ ## Development
56
+
57
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
58
+
59
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
60
+
61
+ ## Contributing
62
+
63
+ Bug reports and pull requests are welcome on GitHub at https://github.com/Hamdiakoguz/onedclusterer. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
64
+
data/Rakefile ADDED
@@ -0,0 +1,10 @@
1
+ require "bundler/gem_tasks"
2
+ require "rake/testtask"
3
+
4
+ Rake::TestTask.new(:test) do |t|
5
+ t.libs << "test"
6
+ t.libs << "lib"
7
+ t.test_files = FileList['test/**/*_test.rb']
8
+ end
9
+
10
+ task :default => :test
data/bin/console ADDED
@@ -0,0 +1,7 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "onedclusterer"
5
+
6
+ require "pry"
7
+ Pry.start
data/bin/setup ADDED
@@ -0,0 +1,7 @@
1
+ #!/bin/bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+
5
+ bundle install
6
+
7
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,7 @@
1
+ require "onedclusterer/version"
2
+ require "onedclusterer/jenks"
3
+ require "onedclusterer/ckmeans"
4
+
5
+
6
+ module OnedClusterer
7
+ end
@@ -0,0 +1,231 @@
1
+ require 'matrix'
2
+ require_relative 'clusterer'
3
+
4
+ module OnedClusterer
5
+
6
+ # Ckmeans clustering is an improvement on heuristic-based clustering
7
+ # approaches like Jenks. The algorithm was developed in
8
+ # [Haizhou Wang and Mingzhou Song](http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Wang+Song.pdf)
9
+ # as a [dynamic programming](https://en.wikipedia.org/wiki/Dynamic_programming) approach
10
+ # to the problem of clustering numeric data into groups with the least
11
+ # within-group sum-of-squared-deviations.
12
+ #
13
+ # Minimizing the difference within groups - what Wang & Song refer to as
14
+ # `withinss`, or within sum-of-squares, means that groups are optimally
15
+ # homogenous within and the data is split into representative groups.
16
+ # This is very useful for visualization, where you may want to represent
17
+ # a continuous variable in discrete color or style groups. This function
18
+ # can provide groups that emphasize differences between data.
19
+ #
20
+ # being a dynamic approach, this algorithm is based on two matrices that
21
+ # store incrementally-computed values for squared deviations and backtracking
22
+ # indexes.
23
+ #
24
+ # This implementation is ported from original c++ implementation.
25
+ #
26
+ ## References
27
+ # _Ckmeans.1d.dp: Optimal k-means Clustering in One Dimension by Dynamic
28
+ # Programming_ Haizhou Wang and Mingzhou Song ISSN 2073-4859
29
+ #
30
+ # from []The R Journal Vol. 3/2, December 2011](http://journal.r-project.org/archive/2011-2/RJournal_2011-2.pdf)
31
+
32
+ class Ckmeans
33
+ include Clusterer
34
+
35
+ attr_reader :data, :kmin, :kmax, :cluster_details
36
+
37
+ # Input:
38
+ # data -- a vector of numbers, not necessarily sorted
39
+ # kmin -- the minimum number of clusters expected
40
+ # kmax -- the maximum number of clusters expected
41
+ # If only kmin is given exactly kmin clusters will be returned
42
+ # else algorithm chooses an optimal number between Kmin and Kmax
43
+ def initialize(data, kmin, kmax = kmin)
44
+ @data_size = data.size
45
+ @data = data.sort # All arrays here is considered starting at position 1, position 0 is not used.
46
+ @kmin = kmin
47
+ @unique = data.uniq.size
48
+ @kmax = @unique < kmax ? @unique : kmax
49
+
50
+ raise ArgumentError, "kmin can not be greater than kmax." if kmin > kmax
51
+ raise ArgumentError, "kmin can not be greater than data size." if kmin > @data_size
52
+ raise ArgumentError, "kmax can not be greater than data size." if kmax > @data_size
53
+ end
54
+
55
+ # returns clustered data as array and sets :cluster_details
56
+ def clusters
57
+ @clusters_result ||=begin
58
+ if @unique <= 1 # A single cluster that contains all elements
59
+ return [data]
60
+ end
61
+
62
+ rows = @data_size
63
+ cols = kmax
64
+
65
+ distance = *Matrix.zero(cols + 1, rows + 1) # 'D'
66
+ backtrack = *Matrix.zero(cols + 1, rows + 1) # 'B'
67
+
68
+ fill_dp_matrix(data.insert(0, nil), distance, backtrack)
69
+
70
+ # Choose an optimal number of levels between Kmin and Kmax
71
+ kopt = select_levels(data, backtrack, kmin, kmax)
72
+ backtrack = backtrack[0..kopt]
73
+
74
+ results = []
75
+ backtrack(backtrack) do |k, left, right|
76
+ results[k] = data[left..right]
77
+ end
78
+ results.drop(1)
79
+ end
80
+ end
81
+
82
+ def bounds
83
+ @bounds ||= clusters.map { |cluster| cluster.last }.insert(0, 0)
84
+ end
85
+
86
+ private
87
+
88
+ def backtrack(matrix)
89
+ right = matrix[0].size - 1
90
+
91
+ for k in (matrix.size - 1).downto 1
92
+ left = matrix[k][right]
93
+
94
+ yield k, left, right
95
+
96
+ if k > 1
97
+ right = left - 1
98
+ end
99
+ end
100
+ end
101
+
102
+ def fill_dp_matrix(data, distance, backtrack)
103
+ for i in 1..kmax
104
+ distance[i][1] = 0.0
105
+ backtrack[i][1] = 1
106
+ end
107
+
108
+ for k in 1..kmax
109
+ mean_x1 = data[1]
110
+
111
+ for i in ([2,k].max)..@data_size
112
+ if k == 1
113
+ distance[k][i] = distance[k][i-1] + (i-1) / Float(i) * (data[i] - mean_x1) ** 2
114
+ mean_x1 = ((i - 1) * mean_x1 + data[i]) / Float(i)
115
+ backtrack[1][i] = 1
116
+ else
117
+ d = 0.0 # the sum of squared distances from x_j ,. . ., x_i to their mean
118
+ mean_xj = 0.0
119
+
120
+ for j in i.downto k
121
+ d = d + (i - j) / Float(i - j + 1) * (data[j] - mean_xj) ** 2
122
+ mean_xj = (data[j] + (i - j) * mean_xj) / Float(i - j + 1)
123
+
124
+ if j == i
125
+ distance[k][i] = d
126
+ backtrack[k][i] = j
127
+ distance[k][i] += distance[k - 1][j - 1] unless j == 1
128
+ else
129
+ if j == 1
130
+ if d <= distance[k][i]
131
+ distance[k][i] = d
132
+ backtrack[k][i] = j
133
+ end
134
+ elsif d + distance[k - 1][j - 1] < distance[k][i]
135
+ distance[k][i] = d + distance[k - 1][j - 1]
136
+ backtrack[k][i] = j
137
+ end
138
+ end
139
+ end
140
+ end
141
+ end
142
+ end
143
+
144
+ end
145
+
146
+ # Choose an optimal number of levels between Kmin and Kmax
147
+ def select_levels(data, backtrack, kmin, kmax)
148
+ return kmin if kmin == kmax
149
+
150
+ method = :normal # "uniform" or "normal"
151
+
152
+ kopt = kmin
153
+
154
+ base = 1 # The position of first element in x: 1 or 0.
155
+ n = data.size - base
156
+
157
+ max_bic = 0.0
158
+
159
+ for k in kmin..kmax
160
+ cluster_sizes = []
161
+ kbacktrack = backtrack[0..k]
162
+ backtrack(kbacktrack) do |cluster, left, right|
163
+ cluster_sizes[cluster] = right - left + 1
164
+ end
165
+
166
+ index_left = base
167
+ index_right = 0
168
+
169
+ likelihood = 0
170
+ bin_left, bin_right = 0
171
+ for i in 0..(k-1)
172
+ points_in_bin = cluster_sizes[i + base]
173
+ index_right = index_left + points_in_bin - 1
174
+
175
+ if data[index_left] < data[index_right]
176
+ bin_left = data[index_left]
177
+ bin_right = data[index_right]
178
+ elsif data[index_left] == data[index_right]
179
+ bin_left = index_left == base ? data[base] : (data[index_left-1] + data[index_left]) / 2
180
+ bin_right = index_right < n-1+base ? (data[index_right] + data[index_right+1]) / 2 : data[n-1+base]
181
+ else
182
+ raise "ERROR: binLeft > binRight"
183
+ end
184
+
185
+ bin_width = bin_right - bin_left
186
+ if method == :uniform
187
+ likelihood += points_in_bin * Math.log(points_in_bin / bin_width / n)
188
+ else
189
+ mean = 0.0
190
+ variance = 0.0
191
+
192
+ for j in index_left..index_right
193
+ mean += data[j]
194
+ variance += data[j] ** 2
195
+ end
196
+ mean /= points_in_bin
197
+ variance = (variance - points_in_bin * mean ** 2) / (points_in_bin - 1) if points_in_bin > 1
198
+
199
+ if variance > 0
200
+ for j in index_left..index_right
201
+ likelihood += - (data[j] - mean) ** 2 / (2.0 * variance)
202
+ end
203
+ likelihood += points_in_bin * (Math.log(points_in_bin / Float(n))
204
+ - 0.5 * Math.log( 2 * Math::PI * variance))
205
+ else
206
+ likelihood += points_in_bin * Math.log(1.0 / bin_width / n)
207
+ end
208
+ end
209
+
210
+ index_left = index_right + 1
211
+ end
212
+
213
+ # Compute the Bayesian information criterion
214
+ bic = 2 * likelihood - (3 * k - 1) * Math.log(Float(n))
215
+
216
+ if k == kmin
217
+ max_bic = bic
218
+ kopt = kmin
219
+ elsif bic > max_bic
220
+ max_bic = bic
221
+ kopt = k
222
+ end
223
+
224
+ end
225
+
226
+ kopt
227
+ end
228
+
229
+ end
230
+
231
+ end
@@ -0,0 +1,20 @@
1
+ module OnedClusterer
2
+
3
+ # Common methods fo all
4
+ module Clusterer
5
+
6
+ # Returns zero based index of cluster which a value belongs to
7
+ # value must be in data array
8
+ def classify(value)
9
+ raise ArgumentError, "value: #{value} must be in data array" unless @data.include?(value)
10
+
11
+ bounds[1..-1].index { |bound| value <= bound }
12
+ end
13
+
14
+ # Returns inclusive interval limits
15
+ def intervals
16
+ first, *rest = bounds.each_cons(2).to_a
17
+ [first, *rest.map {|lower, upper| [data[data.rindex(lower) + 1] , upper] }]
18
+ end
19
+ end
20
+ end
@@ -0,0 +1,135 @@
1
+ require 'matrix'
2
+ require_relative 'clusterer'
3
+
4
+ module OnedClusterer
5
+ # [Jenks natural breaks optimization](http://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization)
6
+ #
7
+ # Adapted from javascript implementation: https://gist.github.com/tmcw/4977508
8
+ class Jenks
9
+ include Clusterer
10
+
11
+ attr_reader :data, :n_classes
12
+
13
+ # @param data one dimensional numerical array
14
+ # @param n_classes number of classes
15
+ def initialize(data, n_classes)
16
+ @data = data.sort
17
+ @n_classes = n_classes
18
+
19
+ raise ArgumentError, "Number of classes can not be greater than size of data array." if n_classes > data.size
20
+ raise ArgumentError, "Number of classes can not be less than 1." if n_classes < 1
21
+
22
+ @lower_class_limits, @variance_combinations = matrices
23
+ end
24
+
25
+ # get clustered array with `n` number of clusters
26
+ def clusters(n = n_classes)
27
+ bounds_iter = bounds(n).drop(1).each_with_index
28
+ result = Array.new(n) { [] }
29
+
30
+ data.each do |value|
31
+ bound, index = bounds_iter.peek
32
+ if value > bound
33
+ bounds_iter.next
34
+ index += 1
35
+ end
36
+ result[index].push(value)
37
+ end
38
+
39
+ result
40
+ end
41
+
42
+ # get bounds array for `n` number of classes
43
+ def bounds(n = n_classes)
44
+ raise ArgumentError, "n must be lesser than or equal to n_classes: #{n_classes}" if n > n_classes
45
+
46
+ k = data.size
47
+ bounds = []
48
+
49
+ # the calculation of classes will never include the upper and
50
+ # lower bounds, so we need to explicitly set them
51
+ bounds[n] = data.last
52
+ bounds[0] = 0
53
+
54
+ for countNum in n.downto 2
55
+ id = @lower_class_limits[k][countNum]
56
+ bounds[countNum - 1] = data[id - 2]
57
+ k = id - 1
58
+ end
59
+
60
+ bounds
61
+ end
62
+
63
+ private
64
+
65
+ # Compute the matrices required for Jenks breaks. These matrices
66
+ # can be used for any classing of data with `classes <= n_classes`
67
+ def matrices
68
+ rows = data.size
69
+ cols = n_classes
70
+
71
+ # in the original implementation, these matrices are referred to
72
+ # as `LC` and `OP`
73
+ # * lower_class_limits (LC): optimal lower class limits
74
+ # * variance_combinations (OP): optimal variance combinations for all classes
75
+ lower_class_limits = *Matrix.zero(rows + 1, cols + 1)
76
+ variance_combinations = *Matrix.zero(rows + 1, cols + 1)
77
+
78
+ # the variance, as computed at each step in the calculation
79
+ variance = 0
80
+
81
+ for i in 1..cols
82
+ lower_class_limits[1][i] = 1
83
+ variance_combinations[1][i] = 0
84
+ for j in 2..rows
85
+ variance_combinations[j][i] = Float::INFINITY
86
+ end
87
+ end
88
+
89
+ for l in 2..rows
90
+ sum = 0 # `SZ` originally. this is the sum of the values seen thus far when calculating variance.
91
+ sum_squares = 0 # `ZSQ` originally. the sum of squares of values seen thus far
92
+ w = 0 # `WT` originally. This is the number of data points considered so far.
93
+
94
+ for m in 1..l
95
+ lower_class_limit = l - m + 1 # `III` originally
96
+ val = data[lower_class_limit - 1]
97
+
98
+ # here we're estimating variance for each potential classing
99
+ # of the data, for each potential number of classes. `w`
100
+ # is the number of data points considered so far.
101
+ w += 1
102
+
103
+ # increase the current sum and sum-of-squares
104
+ sum += val
105
+ sum_squares += (val ** 2)
106
+
107
+ # the variance at this point in the sequence is the difference
108
+ # between the sum of squares and the total x 2, over the number
109
+ # of samples.
110
+ variance = sum_squares - (sum ** 2) / w
111
+
112
+ i4 = lower_class_limit - 1 # `IV` originally
113
+ if i4 != 0
114
+ for j in 2..cols
115
+ # if adding this element to an existing class
116
+ # will increase its variance beyond the limit, break
117
+ # the class at this point, setting the lower_class_limit
118
+ # at this point.
119
+ if variance_combinations[l][j] >= (variance + variance_combinations[i4][j - 1])
120
+ lower_class_limits[l][j] = lower_class_limit
121
+ variance_combinations[l][j] = variance +
122
+ variance_combinations[i4][j - 1]
123
+ end
124
+ end
125
+ end
126
+ end
127
+
128
+ lower_class_limits[l][1] = 1
129
+ variance_combinations[l][1] = variance
130
+ end
131
+
132
+ [lower_class_limits, variance_combinations]
133
+ end
134
+ end
135
+ end
@@ -0,0 +1,3 @@
1
+ module OnedClusterer
2
+ VERSION = "0.1.0"
3
+ end
@@ -0,0 +1,24 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'onedclusterer/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "onedclusterer"
8
+ spec.version = OnedClusterer::VERSION
9
+ spec.authors = ["Hamdi Akoguz"]
10
+ spec.email = ["hamdiakoguz@gmail.com"]
11
+
12
+ spec.summary = %q{a tiny ruby library for one-dimensional clustering methods.}
13
+ spec.homepage = "https://github.com/Hamdiakoguz/onedclusterer"
14
+
15
+ spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
16
+ spec.bindir = "exe"
17
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
18
+ spec.require_paths = ["lib"]
19
+
20
+ spec.add_development_dependency "bundler", "~> 1.10"
21
+ spec.add_development_dependency "rake", "~> 10.0"
22
+ spec.add_development_dependency "minitest"
23
+ spec.add_development_dependency "pry"
24
+ end
metadata ADDED
@@ -0,0 +1,114 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: onedclusterer
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Hamdi Akoguz
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2015-10-10 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: bundler
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '1.10'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '1.10'
27
+ - !ruby/object:Gem::Dependency
28
+ name: rake
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '10.0'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '10.0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: minitest
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: pry
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - ">="
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - ">="
67
+ - !ruby/object:Gem::Version
68
+ version: '0'
69
+ description:
70
+ email:
71
+ - hamdiakoguz@gmail.com
72
+ executables: []
73
+ extensions: []
74
+ extra_rdoc_files: []
75
+ files:
76
+ - ".gitignore"
77
+ - ".travis.yml"
78
+ - CODE_OF_CONDUCT.md
79
+ - Gemfile
80
+ - LICENSE.txt
81
+ - README.md
82
+ - Rakefile
83
+ - bin/console
84
+ - bin/setup
85
+ - lib/onedclusterer.rb
86
+ - lib/onedclusterer/ckmeans.rb
87
+ - lib/onedclusterer/clusterer.rb
88
+ - lib/onedclusterer/jenks.rb
89
+ - lib/onedclusterer/version.rb
90
+ - onedclusterer.gemspec
91
+ homepage: https://github.com/Hamdiakoguz/onedclusterer
92
+ licenses: []
93
+ metadata: {}
94
+ post_install_message:
95
+ rdoc_options: []
96
+ require_paths:
97
+ - lib
98
+ required_ruby_version: !ruby/object:Gem::Requirement
99
+ requirements:
100
+ - - ">="
101
+ - !ruby/object:Gem::Version
102
+ version: '0'
103
+ required_rubygems_version: !ruby/object:Gem::Requirement
104
+ requirements:
105
+ - - ">="
106
+ - !ruby/object:Gem::Version
107
+ version: '0'
108
+ requirements: []
109
+ rubyforge_project:
110
+ rubygems_version: 2.2.3
111
+ signing_key:
112
+ specification_version: 4
113
+ summary: a tiny ruby library for one-dimensional clustering methods.
114
+ test_files: []