weka 0.1.0-java

Sign up to get free protection for your applications and to get access to all the features.
Files changed (55) hide show
  1. checksums.yaml +7 -0
  2. data/.gitignore +10 -0
  3. data/.rspec +2 -0
  4. data/.travis.yml +15 -0
  5. data/CODE_OF_CONDUCT.md +13 -0
  6. data/Gemfile +4 -0
  7. data/Jarfile +1 -0
  8. data/Jarfile.lock +17 -0
  9. data/MIT-LICENSE.txt +19 -0
  10. data/README.md +687 -0
  11. data/Rakefile +21 -0
  12. data/bin/console +14 -0
  13. data/bin/setup +7 -0
  14. data/lib/weka.rb +32 -0
  15. data/lib/weka/attribute_selection.rb +1 -0
  16. data/lib/weka/attribute_selection/attribute_selection.rb +11 -0
  17. data/lib/weka/attribute_selection/evaluator.rb +29 -0
  18. data/lib/weka/attribute_selection/search.rb +14 -0
  19. data/lib/weka/class_builder.rb +88 -0
  20. data/lib/weka/classifiers.rb +1 -0
  21. data/lib/weka/classifiers/bayes.rb +16 -0
  22. data/lib/weka/classifiers/evaluation.rb +37 -0
  23. data/lib/weka/classifiers/functions.rb +21 -0
  24. data/lib/weka/classifiers/lazy.rb +13 -0
  25. data/lib/weka/classifiers/meta.rb +29 -0
  26. data/lib/weka/classifiers/rules.rb +16 -0
  27. data/lib/weka/classifiers/trees.rb +18 -0
  28. data/lib/weka/classifiers/utils.rb +138 -0
  29. data/lib/weka/clusterers.rb +16 -0
  30. data/lib/weka/clusterers/cluster_evaluation.rb +14 -0
  31. data/lib/weka/clusterers/utils.rb +103 -0
  32. data/lib/weka/concerns.rb +18 -0
  33. data/lib/weka/concerns/buildable.rb +19 -0
  34. data/lib/weka/concerns/describable.rb +30 -0
  35. data/lib/weka/concerns/optionizable.rb +49 -0
  36. data/lib/weka/concerns/persistent.rb +16 -0
  37. data/lib/weka/core.rb +6 -0
  38. data/lib/weka/core/attribute.rb +24 -0
  39. data/lib/weka/core/converters.rb +17 -0
  40. data/lib/weka/core/dense_instance.rb +68 -0
  41. data/lib/weka/core/instances.rb +199 -0
  42. data/lib/weka/core/loader.rb +32 -0
  43. data/lib/weka/core/saver.rb +34 -0
  44. data/lib/weka/exceptions.rb +6 -0
  45. data/lib/weka/filters.rb +1 -0
  46. data/lib/weka/filters/filter.rb +9 -0
  47. data/lib/weka/filters/supervised/attribute.rb +26 -0
  48. data/lib/weka/filters/supervised/instance.rb +16 -0
  49. data/lib/weka/filters/unsupervised/attribute.rb +67 -0
  50. data/lib/weka/filters/unsupervised/instance.rb +25 -0
  51. data/lib/weka/filters/utils.rb +17 -0
  52. data/lib/weka/jars.rb +19 -0
  53. data/lib/weka/version.rb +3 -0
  54. data/weka.gemspec +32 -0
  55. metadata +183 -0
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: bedc1a7e9457ec42c3343b05a68e7e2598e9f0f9
4
+ data.tar.gz: ac3ade8d787979d9127d72daa6879b38cb58dfd4
5
+ SHA512:
6
+ metadata.gz: b5f1f0754f075545e460504e3a343b030bd130f27edd78c4ff312f96f134505df2a0f32a3a1b7461a1c588953764766c5d3acbf06acfa7d4ed76f44781a096a6
7
+ data.tar.gz: 1a457c4160b77baea4b7f471af5229062f48a791aae8f3a7a36dc901fb07b9582ffc3fdaf77a689598a411d054b698623ef583311b22df93c4f813d1b4419e7c
@@ -0,0 +1,10 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
10
+ jars
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --format documentation
2
+ --color
@@ -0,0 +1,15 @@
1
+ sudo: false
2
+ language: ruby
3
+
4
+ rvm:
5
+ - jruby-9000
6
+
7
+ cache:
8
+ - bundler
9
+
10
+ before_install:
11
+ - gem install bundler
12
+ - rvm get head
13
+ - rvm use jruby-9.0.1.0 --install
14
+
15
+ script: bundle exec rake spec
@@ -0,0 +1,13 @@
1
+ # Contributor Code of Conduct
2
+
3
+ As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
4
+
5
+ We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.
6
+
7
+ Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.
8
+
9
+ Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.
10
+
11
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.
12
+
13
+ This Code of Conduct is adapted from the [Contributor Covenant](http://contributor-covenant.org), version 1.0.0, available at [http://contributor-covenant.org/version/1/0/0/](http://contributor-covenant.org/version/1/0/0/)
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in weka.gemspec
4
+ gemspec
data/Jarfile ADDED
@@ -0,0 +1 @@
1
+ jar 'nz.ac.waikato.cms.weka:weka-dev:jar:3.7.13'
@@ -0,0 +1,17 @@
1
+ ---
2
+ version: 0.13.0
3
+ groups:
4
+ default:
5
+ dependencies:
6
+ - nz.ac.waikato.cms.weka.thirdparty:java-cup-11b-runtime:jar:2015.03.26
7
+ - nz.ac.waikato.cms.weka.thirdparty:java-cup-11b:jar:2015.03.26
8
+ - nz.ac.waikato.cms.weka:weka-dev:jar:3.7.13
9
+ - org.pentaho.pentaho-commons:pentaho-package-manager:jar:1.0.11
10
+ artifacts:
11
+ - jar:nz.ac.waikato.cms.weka:weka-dev:jar:3.7.13:
12
+ transitive:
13
+ nz.ac.waikato.cms.weka.thirdparty:java-cup-11b:jar:2015.03.26: {}
14
+ org.pentaho.pentaho-commons:pentaho-package-manager:jar:1.0.11: {}
15
+ nz.ac.waikato.cms.weka.thirdparty:java-cup-11b-runtime:jar:2015.03.26: {}
16
+ remote_repositories:
17
+ - http://repo1.maven.org/maven2/
@@ -0,0 +1,19 @@
1
+ Copyright (c) 2015 Paul Götze
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy
4
+ of this software and associated documentation files (the "Software"), to deal
5
+ in the Software without restriction, including without limitation the rights
6
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7
+ copies of the Software, and to permit persons to whom the Software is
8
+ furnished to do so, subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in
11
+ all copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19
+ THE SOFTWARE.
@@ -0,0 +1,687 @@
1
+ # Weka
2
+
3
+ [![Gem Version](https://badge.fury.io/rb/weka.svg)](http://badge.fury.io/rb/weka)
4
+ [![Travis Build](https://travis-ci.org/paulgoetze/weka-jruby.svg)](https://travis-ci.org/paulgoetze/weka-jruby)
5
+
6
+ Machine Learning & Data Mining with JRuby based on the [Weka](http://www.cs.waikato.ac.nz/~ml/weka/index.html) Java library.
7
+
8
+ ## Installation
9
+
10
+ Add this line to your application's Gemfile:
11
+
12
+ ```ruby
13
+ gem 'weka'
14
+ ```
15
+
16
+ And then execute:
17
+
18
+ $ bundle install
19
+
20
+ Or install it yourself as:
21
+
22
+ $ gem install weka
23
+
24
+ ## Usage
25
+
26
+ Start using Weka's Machine Learning and Data Mining algorithms by requiring the gem:
27
+
28
+ ```ruby
29
+ require 'weka'
30
+ ```
31
+
32
+ The weka gem tries to carry over the namespaces defined in Weka and enhances some interfaces in order to allow a more Ruby-ish programming style when using the Weka library.
33
+
34
+ The idea behind keeping the namespaces is, that you can also use the [Weka documentation](http://weka.sourceforge.net/doc.dev/) for looking up functionality and classes.
35
+
36
+ Analog to the Weka doc you can find the following namespaces:
37
+
38
+ | Namespace | Description |
39
+ |----------------------------|------------------------------------------------------------------|
40
+ | `Weka::Core` | defines base classes for loading, saving, creating, and editing a dataset |
41
+ | `Weka::Classifiers` | defines classifier classes in different sub-modules (`Bayes`, `Functions`, `Lazy`, `Meta`, `Rules`, and `Trees` ) |
42
+ | `Weka::Filters` | defines filter classes for processing datasets in the `Supervised` or `Unsupervised`, and `Attribute` or `Instance` sub-modules |
43
+ | `Weka::Clusterers` | defines clusterer classes |
44
+ | `Weka::AttributeSelection` | defines classes for selecting attributes from a dataset |
45
+
46
+ ### Instances
47
+
48
+ Instances objects hold the dataset that is used to train a classifier or that
49
+ should be classified based on training data.
50
+
51
+ Instances can be loaded from files and saved to files.
52
+ Supported formats are *ARFF*, *CSV*, and *JSON*.
53
+
54
+ #### Loading Instances from a file
55
+
56
+ Instances can be loaded from ARFF, CSV, and JSON files:
57
+
58
+ ```ruby
59
+ instances = Weka::Core::Instances.from_arff('weather.arff')
60
+ instances = Weka::Core::Instances.from_csv('weather.csv')
61
+ instances = Weka::Core::Instances.from_json('weather.json')
62
+ ```
63
+
64
+ #### Creating Instances
65
+
66
+ Attributes of an Instances object can be defined in a block using the `with_attributes` method. The class attribute can be set by the `class_attribute: true` option on the fly with defining an attribute.
67
+
68
+ ```ruby
69
+ # create instances with relation name 'weather' and attributes
70
+ instances = Weka::Core::Instances.new(relation_name: 'weather').with_attributes do
71
+ nominal :outlook, values: ['sunny', 'overcast', 'rainy']
72
+ numeric :temperature
73
+ numeric :humidity
74
+ nominal :windy, values: [true, false]
75
+ date :last_storm, 'yyyy-MM-dd'
76
+ nominal :play, values: [:yes, :no], class_attribute: true
77
+ end
78
+ ```
79
+
80
+ You can also pass an array of Attributes on instantiating new Instances:
81
+ This is useful, if you want to create a new empty Instances object with the same
82
+ attributes as an already existing one:
83
+
84
+ ```ruby
85
+ # Take attributes from existing instances
86
+ attributes = instances.attributes
87
+
88
+ # create an empty Instances object with the given attributes
89
+ test_instances = Weka::Core::Instances.new(attributes: attributes)
90
+ ```
91
+
92
+ #### Saving Instances as files
93
+
94
+ You can save Instances as ARFF, CSV, or JSON file.
95
+
96
+ ```ruby
97
+ instances.to_arff('weather.arff')
98
+ instances.to_csv('weather.csv')
99
+ instances.to_json('weather.json')
100
+ ```
101
+
102
+ #### Adding additional attributes
103
+
104
+ You can add additional attributes to the Instances after its initialization.
105
+ All records that are already in the dataset will get an unknown value (`?`) for
106
+ the new attribute.
107
+
108
+ ```ruby
109
+ instances.add_numeric_attribute(:pressure)
110
+ instances.add_nominal_attribute(:grandma_says, values: [:hm, :bad, :terrible])
111
+ instances.add_date_attribute(:last_rain, 'yyyy-MM-dd HH:mm')
112
+ ```
113
+
114
+ #### Adding a data instance
115
+
116
+ You can add a data instance to the Instances by using the `add_instance` method:
117
+
118
+ ```ruby
119
+ data = [:sunny, 70, 80, true, '2015-12-06', :yes, 1.1, :hm, '2015-12-24 20:00']
120
+ instances.add_instance(data)
121
+
122
+ # with custom weight:
123
+ instances.add_instance(data, weight: 2.0)
124
+ ```
125
+
126
+ Multiple instances can be added with the `add_instances` method:
127
+
128
+ ```ruby
129
+ data = [
130
+ [:sunny, 70, 80, true, '2015-12-06', :yes, 1.1, :hm, '2015-12-24 20:00'],
131
+ [:overcast, 80, 85, false, '2015-11-11', :no, 0.9, :bad, '2015-12-25 18:13']
132
+ ]
133
+
134
+ instances.add_instances(data, weight: 2.0)
135
+ ```
136
+
137
+ If the `weight` argument is not given, then a default weight of 1.0 is used.
138
+ The weight in `add_instances` is used for all the added instances.
139
+
140
+ #### Setting a class attribute
141
+
142
+ You can set an earlier defined attribute as the class attribute of the dataset.
143
+ This allows classifiers to use the class for building a classification model while training.
144
+
145
+ ```ruby
146
+ instances.add_nominal_attribute(:size, values: ['L', 'XL'])
147
+ instances.class_attribute = :size
148
+ ```
149
+
150
+ The added attribute can also be directly set as the class attribute:
151
+
152
+ ```ruby
153
+ instances.add_nominal_attribute(:size, values: ['L', 'XL'], class_attribute: true)
154
+ ```
155
+
156
+ Keep in mind that you can only assign existing attributes to be the class attribute.
157
+ The class attribute will not appear in the `instances.attributes` anymore and can be accessed with the `class_attribute` method.
158
+
159
+
160
+ #### Alias methods
161
+
162
+ `Weka::Core::Instances` has following alias methods:
163
+
164
+ | method | alias |
165
+ |-----------------------|-------------------------|
166
+ | `numeric` | `add_numeric_attribute` |
167
+ | `nominal` | `add_nominal_attribute` |
168
+ | `date` | `add_date_attribute` |
169
+ | `string` | `add_string_attribute` |
170
+ | `set_class_attribute` | `class_attribute=` |
171
+ | `with_attributes` | `add_attributes` |
172
+
173
+ The methods on the left side are meant to be used when defining
174
+ attributes in a block when using `#with_attributes` (or `#add_attributes`).
175
+
176
+ The alias methods are meant to be used for explicitly adding
177
+ attributes to an Instances object or defining its class attribute later on.
178
+
179
+ ## Filters
180
+
181
+ Filters are used to preprocess datasets.
182
+
183
+ There are two categories of filters which are also reflected by the namespaces:
184
+
185
+ * *supervised* – The filter requires a class atribute to be set
186
+ * *unsupervised* – A class attribute is not required to be present
187
+
188
+ In each category there are two sub-categories:
189
+
190
+ * *attribute-based* – Attributes (columns) are processed
191
+ * *instance-based* – Instances (rows) are processed
192
+
193
+ Thus, Filter classes are organized in the following four namespaces:
194
+
195
+ ```ruby
196
+ Weka::Filters::Supervised::Attribute
197
+ Weka::Filters::Supervised::Instance
198
+
199
+ Weka::Filters::Unsupervised::Attribute
200
+ Weka::Filters::Unsupervised::Instance
201
+ ```
202
+
203
+ #### Filtering Instances
204
+
205
+ Filters can be used directly to filter Instances:
206
+
207
+ ```ruby
208
+ # create filter
209
+ filter = Weka::Filters::Unsupervised::Attribute::Normalize.new
210
+
211
+ # filter instances
212
+ filtered_data = filter.filter(instances)
213
+ ```
214
+
215
+ You can also apply a Filter on an Instances object:
216
+
217
+ ```ruby
218
+ # create filter
219
+ filter = Weka::Filters::Unsupervised::Attribute::Normalize.new
220
+
221
+ # apply filter on instances
222
+ filtered_data = instances.apply_filter(filter)
223
+ ```
224
+
225
+ With this approach, it is possible to chain multiple filters on a dataset:
226
+
227
+ ```ruby
228
+ # create filters
229
+ include Weka::Filters::Unsupervised::Attribute
230
+
231
+ normalize = Normalize.new
232
+ discretize = Discretize.new
233
+
234
+ # apply a filter chain on instances
235
+ filtered_data = instances.apply_filter(normalize).apply_filter(discretize)
236
+ ```
237
+
238
+ #### Setting Filter options
239
+
240
+ Any Filter has several options. You can list a description of all options of a filter:
241
+
242
+ ```ruby
243
+ puts Weka::Filters::Unsupervised::Attribute::Normalize.options
244
+ # -S <num> The scaling factor for the output range.
245
+ # (default: 1.0)
246
+ # -T <num> The translation of the output range.
247
+ # (default: 0.0)
248
+ # -unset-class-temporarily Unsets the class index temporarily before the filter is
249
+ # applied to the data.
250
+ # (default: no)
251
+ ```
252
+
253
+ To get the default option set of a Filter you can run `.default_options`:
254
+
255
+ ```ruby
256
+ Weka::Filters::Unsupervised::Attribute::Normalize.default_options
257
+ # => '-S 1.0 -T 0.0'
258
+ ```
259
+
260
+ Options can be set while building a Filter:
261
+
262
+ ```ruby
263
+ filter = Weka::Filters::Unsupervised::Attribute::Normalize.build do
264
+ use_options '-S 0.5'
265
+ end
266
+ ```
267
+
268
+ Or they can be set or changed after you created the Filter:
269
+
270
+ ```ruby
271
+ filter = Weka::Filters::Unsupervised::Attribute::Normalize.new
272
+ filter.use_options('-S 0.5')
273
+ ```
274
+
275
+ ## Attribute selection
276
+
277
+ Selecting attributes (features) from a set of instances is important
278
+ for getting the best result out of a classification or clustering.
279
+ Attribute selection reduces the number of attributes and thereby can speed up
280
+ the runtime of the algorithms.
281
+ It also avoids processing too many attributes when only a certain subset is essential
282
+ for building a good model.
283
+
284
+ For attribute selection you need to apply a search and an evaluation method on a dataset.
285
+
286
+ Search methods are defined in the `Weka::AttributeSelection::Search` module.
287
+ There are search methods for subset search and individual attribute search.
288
+
289
+ Evaluators are defined in the `Weka::AttributeSelection::Evaluator` module.
290
+ Corresponding to search method types there are two evalutor types for subset search and individual search.
291
+
292
+ The search methods and evaluators from each category can be combined to perform an attribute selection.
293
+
294
+ **Classes for attribute *subset* selection:**
295
+
296
+ | Search | Evaluators |
297
+ |-------------------------------|------------------------------|
298
+ | `BestFirst`, `GreedyStepwise` | `CfsSubset`, `WrapperSubset` |
299
+
300
+ **Classes for *individual* attribute selection:**
301
+
302
+ | Search | Evaluators |
303
+ |----------|------------|
304
+ | `Ranker` | `CorrelationAttribute`, `GainRatioAttribute`, `InfoGainAttribute`, `OneRAttribute`, `ReliefFAttribute`, `SymmetricalUncertAttribute` |
305
+
306
+ An attribute selection can either be performed with the `Weka::AttributeSelection::AttributeSelection` class:
307
+
308
+ ```ruby
309
+ instances = Weka::Core::Instances.from_arff('weather.arff')
310
+
311
+ selection = Weka::AttributeSelection::AttributeSelection.new
312
+ selection.search = Weka::AttributeSelection::Search::Ranker.new
313
+ selection.evaluator = Weka::AttributeSelection::Evaluator::PricipalComponents.new
314
+
315
+ selection.select_attribute(instances)
316
+ puts selection.summary
317
+ ```
318
+
319
+ Or you can use the supervised `AttributeSelection` filter to directly filter instances:
320
+
321
+ ```ruby
322
+ instances = Weka::Core::Instances.from_arff('weather.arff')
323
+ search = Weka::AttributeSelection::Search::Ranker.new
324
+ evaluator = Weka::AttributeSelection::Evaluator::PricipalComponents.new
325
+
326
+ filter = Weka::Filters::Supervised::Attribute::AttributeSelection.build do
327
+ use_search search
328
+ use_evaluator evaluator
329
+ end
330
+
331
+ filtered_instances = instances.apply_filter(filter)
332
+ ```
333
+
334
+ ## Classifiers
335
+
336
+ Weka‘s classification and regression algorithms can be found in the `Weka::Classifiers`
337
+ namespace.
338
+
339
+ The classifier classes are organised in the following submodules:
340
+
341
+ ```ruby
342
+ Weka::Classifiers::Bayes
343
+ Weka::Classifiers::Functions
344
+ Weka::Classifiers::Lazy
345
+ Weka::Classifiers::Meta
346
+ Weka::Classifiers::Rules
347
+ Weka::Classifiers::Trees
348
+ ```
349
+
350
+ #### Getting information about a classifier
351
+
352
+ To get a description about the classifier class and its available options
353
+ you can use the class methods `.description` and `.options` on each classifier:
354
+
355
+ ```ruby
356
+ puts Weka::Classifiers::Trees::RandomForest.description
357
+ # Class for constructing a forest of random trees.
358
+ # For more information see:
359
+ # Leo Breiman (2001). Random Forests. Machine Learning. 45(1):5-32.
360
+
361
+ puts Weka::Classifiers::Trees::RandomForest.options
362
+ # -I <number of trees> Number of trees to build.
363
+ # (default 100)
364
+ # -K <number of features> Number of features to consider (<1=int(log_2(#predictors)+1)).
365
+ # (default 0)
366
+ # ...
367
+
368
+ ```
369
+
370
+ The default options that are used for a classifier can be displayed with:
371
+
372
+ ```ruby
373
+ Weka::Classifiers::Trees::RandomForest.default_options
374
+ # => "-I 100 -K 0 -S 1 -num-slots 1"
375
+ ```
376
+
377
+ #### Creating a new classifier
378
+
379
+ To build a new classifiers model based on training instances you can use
380
+ the following syntax:
381
+
382
+ ```ruby
383
+ instances = Weka::Core::Instances.from_arff('weather.arff')
384
+ instances.class_attribute = :play
385
+
386
+ classifier = Weka::Classifiers::Trees::RandomForest.new
387
+ classifier.use_options('-I 200 -K 5')
388
+ classifier.train_with_instances(instances)
389
+ ```
390
+ You can also build a classifier by using the block syntax:
391
+
392
+ ```ruby
393
+ classifier = Weka::Classifiers::Trees::RandomForest.build do
394
+ use_options '-I 200 -K 5'
395
+ train_with_instances instances
396
+ end
397
+
398
+ ```
399
+
400
+ #### Evaluating a classifier model
401
+
402
+ You can evaluate the trained classifier using [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)):
403
+
404
+ ```ruby
405
+ # default number of folds is 3
406
+ evaluation = classifier.cross_validate
407
+
408
+ # with a custom number of folds
409
+ evaluation = classifier.cross_validate(folds: 10)
410
+ ```
411
+
412
+ The cross-validation returns a `Weka::Classifiers::Evaluation` object which can be used to get details about the accuracy of the trained classification model:
413
+
414
+ ```ruby
415
+ puts evaluation.summary
416
+ #
417
+ # Correctly Classified Instances 10 71.4286 %
418
+ # Incorrectly Classified Instances 4 28.5714 %
419
+ # Kappa statistic 0.3778
420
+ # Mean absolute error 0.4098
421
+ # Root mean squared error 0.4657
422
+ # Relative absolute error 87.4588 %
423
+ # Root relative squared error 96.2945 %
424
+ # Coverage of cases (0.95 level) 100 %
425
+ # Mean rel. region size (0.95 level) 96.4286 %
426
+ # Total Number of Instances 14
427
+ ```
428
+
429
+ The evaluation holds detailed information about a number of different meassures of interest,
430
+ like the [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall), the FP/FN/TP/TN-rates, [F-Measure](https://en.wikipedia.org/wiki/F1_score) and the areas under PRC and [ROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) curve.
431
+
432
+ If your trained classifier should be evaluated against a set of *test instances*,
433
+ you can use `evaluate`:
434
+
435
+ ```ruby
436
+ test_instances = Weka::Core::Instances.from_arff('test_data.arff')
437
+ test_instances.class_attribute = :play
438
+
439
+ evaluation = classifier.evaluate(test_instances)
440
+ ```
441
+
442
+ #### Classifying new data
443
+
444
+ Each classifier implements either a `classify` method or a `distibution_for` method, or both.
445
+
446
+ The `classify` method takes a Weka::Core::DenseInstance or an Array of values as argument and returns the predicted class value:
447
+
448
+ ```ruby
449
+ instances = Weka::Core::Instances.from_arff('unclassified_data.arff')
450
+
451
+ # with an instance as argument
452
+ instances.map do |instance|
453
+ classifier.classify(instance)
454
+ end
455
+ # => ['no', 'yes', 'yes', ...]
456
+
457
+ # with an Array of values as argument
458
+ classifier.classify [:sunny, 80, 80, :FALSE, '?']
459
+ # => 'yes'
460
+ ```
461
+
462
+ The `distribution_for` method takes a Weka::Core::DenseInstance or an Array of values as argument as well and returns a hash with the distributions per class value:
463
+
464
+ ```ruby
465
+ instances = Weka::Core::Instances.from_arff('unclassified_data.arff')
466
+
467
+ # with an instance as argument
468
+ classifier.distribution_for(instances.first)
469
+ # => { "yes" => 0.26, "no" => 0.74 }
470
+
471
+ # with an Array of values as argument
472
+ classifier.distribution_for [:sunny, 80, 80, :FALSE, '?']
473
+ # => { "yes" => 0.62, "no" => 0.38 }
474
+ ```
475
+
476
+ ### Clusterers
477
+
478
+ Clustering is an unsupervised machine learning technique which tries to find patterns in data and group sets of data. Clustering algorithms work without class attributes.
479
+
480
+ Weka‘s clustering algorithms can be found in the `Weka::Clusterers` namespace.
481
+
482
+ The following clusterer classes are available:
483
+
484
+ ```ruby
485
+ Weka::Clusterers::Canopy
486
+ Weka::Clusterers::Cobweb
487
+ Weka::Clusterers::EM
488
+ Weka::Clusterers::FarthestFirst
489
+ Weka::Clusterers::HierarchicalClusterer
490
+ Weka::Clusterers::SimpleKMeans
491
+ ```
492
+
493
+ #### Getting information about a clusterer
494
+
495
+ To get a description about the clusterer class and its available options
496
+ you can use the class methods `.description` and `.options` on each clusterer:
497
+
498
+ ```ruby
499
+ puts Weka::Clusterers::SimpleKMeans.description
500
+ # Cluster data using the k means algorithm.
501
+ # ...
502
+
503
+ puts Weka::Clusterers::SimpleKMeans.options
504
+ # -N <num> Number of clusters.
505
+ # (default 2).
506
+ # -init Initialization method to use.
507
+ # 0 = random, 1 = k-means++, 2 = canopy, 3 = farthest first.
508
+ # (default = 0)
509
+ # ...
510
+ ```
511
+
512
+ The default options that are used for a clusterer can be displayed with:
513
+
514
+ ```ruby
515
+ Weka::Clusterers::SimpleKMeans.default_options
516
+ # => "-init 0 -max-candidates 100 -periodic-pruning 10000 -min-density 2.0 -t1 -1.25
517
+ # -t2 -1.0 -N 2 -A weka.core.EuclideanDistance -R first-last -I 500 -num-slots 1 -S 10"
518
+ ```
519
+
520
+ #### Creating a new Clusterer
521
+
522
+ To build a new clusterer model based on training instances you can use the following syntax:
523
+
524
+ ```ruby
525
+ instances = Weka::Core::Instances.from_arff('weather.arff')
526
+
527
+ clusterer = Weka::Clusterers::SimpleKMeans.new
528
+ clusterer.use_options('-N 3 -I 600')
529
+ clusterer.train_with_instances(instances)
530
+ ```
531
+
532
+ You can also build a clusterer by using the block syntax:
533
+
534
+ ```ruby
535
+ classifier = Weka::Clusterers::SimpleKMeans.build do
536
+ use_options '-N 5 -I 600'
537
+ train_with_instances instances
538
+ end
539
+ ```
540
+
541
+ #### Evaluating a clusterer model
542
+
543
+ You can evaluate trained density-based clusterer using [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) (The only density-based clusterer in the Weka lib is `EM` at the moment).
544
+
545
+ The cross-validation returns the cross-validated log-likelihood:
546
+
547
+ ```ruby
548
+ # default number of folds is 3
549
+ log_likelihood = clusterer.cross_validate
550
+ # => -10.556166997137497
551
+
552
+ # with a custom number of folds
553
+ log_likelihood = clusterer.cross_validate(folds: 10)
554
+ # => -10.262696653333032
555
+ ```
556
+
557
+ If your trained classifier should be evaluated against a set of *test instances*,
558
+ you can use `evaluate`.
559
+ The evaluation returns a `Weka::Clusterer::ClusterEvaluation` object which can be used to get details about the accuracy of the trained clusterer model:
560
+
561
+ ```ruby
562
+ test_instances = Weka::Core::Instances.from_arff('test_data.arff')
563
+ evaluation = clusterer.evaluate(test_instances)
564
+
565
+ puts evaluation.summary
566
+ # EM
567
+ # ==
568
+ #
569
+ # Number of clusters: 2
570
+ # Number of iterations performed: 7
571
+ #
572
+ # Cluster
573
+ # Attribute 0 1
574
+ # (0.35) (0.65)
575
+ # ==============================
576
+ # outlook
577
+ # sunny 3.8732 3.1268
578
+ # overcast 1.7746 4.2254
579
+ # rainy 2.1889 4.8111
580
+ # [total] 7.8368 12.1632
581
+ # ...
582
+ ```
583
+
584
+ #### Clustering new data
585
+
586
+ Similar to classifiers, clusterers come with a either a `cluster` method or a `distribution_for` method which both take a Weka::Core::DenseInstance or an Array of values as argument.
587
+
588
+ The `classify` method returns the index of the predicted cluster:
589
+
590
+ ```ruby
591
+ instances = Weka::Core::Instances.from_arff('unlabeled_data.arff')
592
+
593
+ clusterer = Weka::Clusterers::Canopy.build
594
+ train_with_instances instances
595
+ end
596
+
597
+ # with an instance as argument
598
+ instances.map do |instance|
599
+ clusterer.cluster(instance)
600
+ end
601
+ # => [3, 3, 4, 0, 0, 1, 2, 3, 0, 0, 2, 2, 4, 1]
602
+
603
+ # with an Array of values as argument
604
+ clusterer.cluster [:sunny, 80, 80, :FALSE]
605
+ # => 4
606
+ ```
607
+
608
+ The `distribution_for` method returns an Array with the distributions at the cluster‘s index:
609
+
610
+ ```ruby
611
+ # with an instance as argument
612
+ clusterer.distribution_for(instances.first)
613
+ # => [0.17229465277140552, 0.1675583309853506, 0.15089102301329346, 0.3274056122786787, 0.18185038095127165]
614
+
615
+ # with an Array of values as argument
616
+ classifier.distribution_for [:sunny, 80, 80, :FALSE]
617
+ # => [0.21517055355632506, 0.16012256401406233, 0.17890840384466453, 0.2202344150907843, 0.2255640634941639]
618
+ ```
619
+
620
+ #### Adding a cluster attribute to a dataset
621
+
622
+ After building and training a clusterer with training instances you can use the clusterer
623
+ in the unsupervised attribute filter `AddCluster` to assign a cluster to each instance of a dataset:
624
+
625
+ ```ruby
626
+ filter = Weka::Filter::Unsupervised::Attribute::AddCluster.new
627
+ filter.clusterer = clusterer
628
+
629
+ instances = Weka::Core::Instances.from_arff('unlabeled_data.arff')
630
+ clustered_instances = instances.apply_filter(filter)
631
+
632
+ puts clustered_instances.to_s
633
+ ```
634
+
635
+ `clustered_instance` now has a nominal `cluster` attribute as the last attribute.
636
+ The values of the cluster attribute are the *N* cluster names, e.g. with *N = 2* clusters, the ARFF representation looks like:
637
+
638
+ ```
639
+ ...
640
+ @attribute outlook {sunny,overcast,rainy}
641
+ @attribute temperature numeric
642
+ @attribute humidity numeric
643
+ @attribute windy {TRUE,FALSE}
644
+ @attribute cluster {cluster1,cluster2}
645
+ ...
646
+ ```
647
+
648
+ Each instance is now assigned to a cluster, e.g.:
649
+
650
+ ```
651
+ ...
652
+ @data
653
+ sunny,85,85,FALSE,cluster1
654
+ sunny,80,90,TRUE,cluster1
655
+ ...
656
+ ```
657
+
658
+ ## Development
659
+
660
+ After checking out the repo, run `bin/setup` to install dependencies.
661
+ To install this gem onto your local machine, run `bundle exec rake install`.
662
+
663
+ Then, run `rake spec` to run the tests. You can also run `bin/console` or `rake irb` for an interactive prompt that will allow you to experiment.
664
+
665
+ ## Contributing
666
+
667
+ Bug reports and pull requests are welcome on GitHub at https://github.com/paulgoetze/weka-jruby. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant code of conduct](http://contributor-covenant.org/version/1/2/0).
668
+
669
+ For development we use the [git branching model](http://nvie.com/posts/a-successful-git-branching-model/) described by [nvie](https://github.com/nvie).
670
+
671
+ Here's how to contribute:
672
+
673
+ 1. Fork it ( https://github.com/paulgoetze/weka-jruby/fork )
674
+ 2. Create your feature branch (`git checkout -b feature/my-new-feature develop`)
675
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
676
+ 4. Push to the branch (`git push origin feature/my-new-feature`)
677
+ 5. Create a new Pull Request
678
+
679
+ Please try to add RSpec tests along with your new features. This will ensure that your code does not break existing functionality and that your feature is working as expected.
680
+
681
+ ## Acknowledgement
682
+
683
+ The original ideas for wrapping Weka in JRuby come from [@arrigonialberto86](https://github.com/arrigonialberto86) and his [ruby-band](https://github.com/arrigonialberto86/ruby-band) gem. Great thanks!
684
+
685
+ ## License
686
+
687
+ The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).