daru 0.0.5 → 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (48) hide show
  1. checksums.yaml +4 -4
  2. data/.build.sh +14 -0
  3. data/.travis.yml +26 -4
  4. data/CONTRIBUTING.md +31 -0
  5. data/Gemfile +1 -2
  6. data/{History.txt → History.md} +110 -44
  7. data/README.md +21 -288
  8. data/Rakefile +1 -0
  9. data/daru.gemspec +12 -8
  10. data/lib/daru.rb +36 -1
  11. data/lib/daru/accessors/array_wrapper.rb +8 -3
  12. data/lib/daru/accessors/gsl_wrapper.rb +113 -0
  13. data/lib/daru/accessors/nmatrix_wrapper.rb +6 -17
  14. data/lib/daru/core/group_by.rb +0 -1
  15. data/lib/daru/dataframe.rb +1192 -83
  16. data/lib/daru/extensions/rserve.rb +21 -0
  17. data/lib/daru/index.rb +14 -0
  18. data/lib/daru/io/io.rb +170 -8
  19. data/lib/daru/maths/arithmetic/dataframe.rb +4 -3
  20. data/lib/daru/maths/arithmetic/vector.rb +4 -4
  21. data/lib/daru/maths/statistics/dataframe.rb +48 -27
  22. data/lib/daru/maths/statistics/vector.rb +215 -33
  23. data/lib/daru/monkeys.rb +53 -7
  24. data/lib/daru/multi_index.rb +21 -4
  25. data/lib/daru/plotting/dataframe.rb +83 -25
  26. data/lib/daru/plotting/vector.rb +9 -10
  27. data/lib/daru/vector.rb +596 -61
  28. data/lib/daru/version.rb +3 -0
  29. data/spec/accessors/wrappers_spec.rb +51 -0
  30. data/spec/core/group_by_spec.rb +0 -2
  31. data/spec/daru_spec.rb +58 -0
  32. data/spec/dataframe_spec.rb +768 -73
  33. data/spec/extensions/rserve_spec.rb +52 -0
  34. data/spec/fixtures/bank2.dat +200 -0
  35. data/spec/fixtures/repeated_fields.csv +7 -0
  36. data/spec/fixtures/scientific_notation.csv +4 -0
  37. data/spec/fixtures/test_xls.xls +0 -0
  38. data/spec/io/io_spec.rb +161 -24
  39. data/spec/math/arithmetic/dataframe_spec.rb +26 -7
  40. data/spec/math/arithmetic/vector_spec.rb +8 -0
  41. data/spec/math/statistics/dataframe_spec.rb +16 -1
  42. data/spec/math/statistics/vector_spec.rb +215 -47
  43. data/spec/spec_helper.rb +21 -2
  44. data/spec/vector_spec.rb +368 -12
  45. metadata +99 -16
  46. data/lib/version.rb +0 -3
  47. data/notebooks/grouping_splitting_pivots.ipynb +0 -529
  48. data/notebooks/intro_with_music_data_.ipynb +0 -303
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: fd2dec0795f15ca1e45bdad5238fb7dbe33e1089
4
- data.tar.gz: 634ff6e6b533cad019893a6e248706c824933e1d
3
+ metadata.gz: 6e48778067b94afc9f1060d7d6d4212029b421f2
4
+ data.tar.gz: 5d0ed9cc2fcf70562e0fcf2767c593e1f8fbfa54
5
5
  SHA512:
6
- metadata.gz: 2c4aed326afacb2fe2324dd720e302564ab973b7fe69e17daf8f4902fecf7a2bbe34a26b0681dc42eaef14bd511a439a2717a115a7f577f700212d0d605d6dee
7
- data.tar.gz: be1bc452b188d233a6c668a008ed9f9e4cd77cf9b24a574559bf27c8c28ab34b0c23d51cc4321ab49c1416a53b0b74571afca698cdf2106d407f744204191362
6
+ metadata.gz: 778ad55b592865e08388eac0001cdbce6bc01f58fa77ed8e2a2b72e44d8a54fc2d289f241352affd98b6424326416634ef729889163175f9eb64c83e471fb7e2
7
+ data.tar.gz: c498252daf63597adc0255810d3eb7b60c102ef086117bf451de821faaa8e196933570b06f88e6415b51c1ef6ea2ee6f60afce18b65691483741158954c73d0b
@@ -0,0 +1,14 @@
1
+ #!/bin/bash
2
+
3
+ git clone https://github.com/SciRuby/nmatrix.git
4
+ cd nmatrix
5
+ gem build nmatrix.gemspec
6
+ gem install nmatrix-0.1.0.gem
7
+ cd ..
8
+ rm -rf nmatrix
9
+ git clone https://github.com/v0dro/gsl-nmatrix
10
+ cd gsl-nmatrix
11
+ gem build gsl-nmatrix.gemspec
12
+ gem install gsl-nmatrix-1.17.gem
13
+ cd ..
14
+ rm -rf gsl-nmatrix
@@ -1,5 +1,27 @@
1
- language: ruby
1
+ language:
2
+ ruby
3
+
4
+ env:
5
+ - CPLUS_INCLUDE_PATH=/usr/include/atlas C_INCLUDE_PATH=/usr/include/atlas
6
+
2
7
  rvm:
3
- - 1.9.3
4
- - 2.0.0
5
- - 2.1.1
8
+ - '2.0'
9
+ - '2.1'
10
+ - '2.2'
11
+
12
+ matrix:
13
+ fast_finish:
14
+ true
15
+
16
+ script: "bundle exec rspec"
17
+
18
+ install:
19
+ - gem install bundler
20
+ - ./.build.sh
21
+ - bundle install
22
+
23
+ before_install:
24
+ - sudo apt-get update -qq
25
+ - sudo apt-get install -qq libatlas-base-dev
26
+ - sudo apt-get install -y libgsl0-dev r-base r-base-dev
27
+ - sudo Rscript -e "install.packages(c('Rserve','irr'),,'http://cran.us.r-project.org')"
@@ -0,0 +1,31 @@
1
+ # Contributing guide
2
+
3
+ ## Installing daru development dependencies
4
+
5
+ If you want to run the full rspec suite, you will need the latest unreleased nmatrix and gsl-nmatrix ruby gems. They will released upstream soon but please follow this procedure for now.
6
+
7
+ Keep in mind that either nmatrix OR gsl-nmatrix are NOT NECESSARY for using daru. They are just required for an optional speed up.
8
+
9
+ To install dependencies, execute the following commands:
10
+
11
+ `export CPLUS_INCLUDE_PATH=/usr/include/atlas`
12
+ `export C_INCLUDE_PATH=/usr/include/atlas`
13
+ `sudo apt-get update -qq`
14
+ `sudo apt-get install -qq libatlas-base-dev`
15
+ `sudo apt-get --purge remove liblapack-dev liblapack3 liblapack3gf`
16
+ `sudo apt-get install -y libgsl0-dev r-base r-base-dev`
17
+ `sudo Rscript -e "install.packages(c('Rserve','irr'),,'http://cran.us.r-project.org')"`
18
+
19
+ Then execute the .build.sh script to clone and install the latest nmatrix and gsl-nmatrix on your system:
20
+
21
+ `./.build.sh`
22
+
23
+ Then finally install remaining dependencies:
24
+
25
+ `bundle install`
26
+
27
+ And run the test suite (should be all green with pending tests):
28
+
29
+ `bundle exec rspec`
30
+
31
+ If you have problems installing nmatrix, please consult the [nmatrix installation wiki](https://github.com/SciRuby/nmatrix/wiki/Installation) or the [mailing list](https://groups.google.com/forum/#!forum/sciruby-dev).
data/Gemfile CHANGED
@@ -1,3 +1,2 @@
1
1
  source 'https://rubygems.org'
2
-
3
- gemspec
2
+ gemspec
@@ -1,52 +1,74 @@
1
- == 0.0.1
2
- * Added classes for DataFrame and Vector alongwith some super-basic functions to get off the ground
3
-
4
- == 0.0.2
5
- * Added iterators for dataframe and vector alongwith printing functions (to_html) to interface properly with iRuby notebook.
6
-
7
- == 0.0.2.1
8
- * Fixed bugs with previous code and more iterators
9
-
10
- == 0.0.2.2
11
- * Added test cases and multiple column access through the [] operator on DataFrames
12
-
13
- == 0.0.2.3
14
- * Added #filter\_rows and #delete_row to DataFrame and changed #row to return a row containing a Hash of column name and value.
15
- * Vector objects passed into a DataFrame are now duplicated so that any changes dont affect the original vector.
16
- * Added an optional opts argument to DataFrame.
17
- * Sending more fields than vectors in DataFrame will cause addition of nil vectors.
18
- * Init a DataFrame without having to convert explicitly to vectors.
19
-
20
- == 0.0.2.4
21
- * Initialize dataframe from an array which looks like [{a: 10, b: 20}, {a: 11, b: 12}]. Works for parsed JSON.
22
- * Over-riding vectors in DataFrame will still preserve order.
23
- * Any re-assignment of rows in #each_row and #each_row_with_index will reflect in the DataFrame.
24
- * Added #to_a and #to_json to DataFrame.
1
+ # 0.1.0
25
2
 
26
- == 0.0.3
27
- * This release is a complete rewrite of the entire gem to accomodate index values.
3
+ * Fixes
4
+ - Update documentation and fix it in other places.
5
+ - Fix Vector#sum_of_squares and #ranked.
6
+ - Fixed some tests that were giving RSpec warnings
7
+ - Fixed a bug where nyaplot not being present would raise a warning.
8
+ - Fixed a bug in DataFrame row assignment.
9
+ * Enhancements
10
+ - Wrote a proper .travis.yml
11
+ - Added optional GSL dependency gsl-nmatrix
12
+ - Added Marshalling and unMarshalling capabilities to Vector, Index and DataFrame.
13
+ - Added new method Daru::IO.load for loading data from files by marshalling.
14
+ - Lots of documentation and new notebooks.
15
+ - Added data loading and writing from and to CSV, Excel, plain text and SQL databases.
16
+ - Daru::DataFrame and Vector have now completely replaced Statsample::Dataset and Vector.
17
+ - Vector
18
+ - #center
19
+ - #standardize
20
+ - #vector_percentile
21
+ - Added a new wrapper class Daru::Accessors::GSLWrapper for wrapping around GSL::Vector, which works similarly to NMatrixWrapper or ArrayWrapper.
22
+ - Added a host of statistical methods to GSLWrapper in Daru::Accessors::GSLStatistics that call the relevant GSL::Vector functions for super-fast C level computations.
23
+ - More stats functions - #vector_standardized_compute, #vector_centered_compute, #sample_with_replacement, #sample_without_replacement
24
+ - #only_valid for creating a Vector with only non-nil data.
25
+ - #only_missing for creating a Vector of only missing data.
26
+ - #only_numeric to create Vector of only numerical data.
27
+ - Ported many Statsample::Vector stat methods to Daru::Vector. These are: #percentile, #factors, etc.
28
+ - Added .new_with_size for creating vectors by specifying a size for the
29
+ vector and a block for generating values.
30
+ - Added Vector#verify, #recode! and #recode.
31
+ - Added #save, #jackknife and #bootstrap.
32
+ - Added #missing_values= that will allow setting values for treating data as 'missing'.
33
+ - Added #split_by_separator, #split_by_separator_freq and #splitted.
34
+ - Added #reset_index!
35
+ - Added #any? and #all?
36
+ - Added #db_type for guessing the type of SQL type contained in the vector.
37
+ - Added and tested plotting support for histogram and box plot.
38
+ - DataFrame
39
+ - #dup_only_valid
40
+ - #clone, #clone_only_valid, #clone_structure
41
+ - #[]= does not clone the vector if it has the same index as the DataFrame.
42
+ - Added a :clone option to initialize that will not clone Daru::Vectors passed into the constructor.
43
+ - Added #save.
44
+ - Added #only_numerics.
45
+ - Added better iterators and changed some behaviour of previous ones to make them more ruby-like. New iterators are #map, #map!, #each, #recode and #collect.
46
+ - Added #vector_sum and #vector_mean.
47
+ - Added #to_gsl to convert to GSL::Matrix.
48
+ - Added #has_missing_data? and #missing_values_rows.
49
+ - Added #compute and #verify.
50
+ - Added .crosstab_by_assignation to generate data frame from row, column and value vectors.
51
+ - Added #filter_vector.
52
+ - Added #standardize and added argument option to #dup.
53
+ - Added #any? and #all? for vector and row axis.
54
+ - Better creation of empty data frames.
55
+ - Added #merge, #one_to_many, #add_vectors_by_split_recode
56
+ - Added constant SPLIT_TOKEN and methods #add_vectors_by_split, .[], #summary.
57
+ - Added #bootstrap.
58
+ - Added a #filter method to wrap around #filter_vectors and #filter_rows.
59
+ - Greatly improved plotting function.
60
+ - Added a lazy update feature that will allow users to delay updating the missing positions index until the last possible moment.
61
+ - Added interoperaility with rserve client which makes it possible to change daru data to R data and perform computation there.
62
+ * Changes
63
+ - Changes Vector#nil_positions to Vector#missing_positions so that future changes for accomodating different values for missing data can be made easily.
64
+ - Changed History.txt to History.md
28
65
 
29
- == 0.0.3.1
30
- * Added aritmetic methods for vector aritmetic by taking the index of values into account.
31
66
 
32
- == 0.0.4
33
- * Added wrappers for Array, NMatrix and MDArray such that the external implementation is completely transparent of the data type being used internally.
34
- * Added statistics methods for vectors for ArrayWrapper. These are compatible with statsample methods.
35
- * Added plotting functions for DataFrame and Vector using Nyaplot.
36
- * Create a DataFrame by specifying the rows with the ".rows" class method.
37
- * Create a Vector from a Hash.
38
- * Call a Vector element by specfying the index name as a method call (method_missing logic).
39
- * Retrive multiple rows of a DataFrame by specfying a Range or an Array with multiple index names.
40
- * #head and #tail for DataFrame.
41
- * #uniq for Vector.
42
- * #max for Vector can return a Vector object with the index set to the index of the max value.
43
- * Tonnes of documentation for most methods.
44
-
45
- == 0.0.5
67
+ # 0.0.5
46
68
 
47
69
  * Easy accessors for some methods
48
70
  * Faster CSV loading.
49
- * Changed vector #is\_valid? to #exists?
71
+ * Changed vector #is_valid? to #exists?
50
72
  * Revamped dtype specifiers for Vector. Now specify :array/:nmatrix for changing underlying data implementation. Specigfy nm\_dtype for specifying the data type of the NMatrix object.
51
73
  * #sort for Vector. Quick sort algorithm with preservation of original indexes.
52
74
  * Removed #re\_index and #to\_index from Daru::Index.
@@ -75,4 +97,48 @@
75
97
  * Added #describe to DataFrame for producing multiple statistics data of numerical vectors in one shot.
76
98
  * Monkey patched Ruby Matrix to include #elementwise_division.
77
99
  * Added #covariance to calculate the covariance between numbers of a DataFrame and #correlation to calculate correlation.
78
- * Enumerators return Enumerator objects if there is no block.
100
+ * Enumerators return Enumerator objects if there is no block.
101
+
102
+ # 0.0.4
103
+ * Added wrappers for Array, NMatrix and MDArray such that the external implementation is completely transparent of the data type being used internally.
104
+ * Added statistics methods for vectors for ArrayWrapper. These are compatible with statsample methods.
105
+ * Added plotting functions for DataFrame and Vector using Nyaplot.
106
+ * Create a DataFrame by specifying the rows with the ".rows" class method.
107
+ * Create a Vector from a Hash.
108
+ * Call a Vector element by specfying the index name as a method call (method_missing logic).
109
+ * Retrive multiple rows of a DataFrame by specfying a Range or an Array with multiple index names.
110
+ * #head and #tail for DataFrame.
111
+ * #uniq for Vector.
112
+ * #max for Vector can return a Vector object with the index set to the index of the max value.
113
+ * Tonnes of documentation for most methods.
114
+
115
+ # 0.0.3.1
116
+ * Added aritmetic methods for vector aritmetic by taking the index of values into account.
117
+
118
+ # 0.0.3
119
+ * This release is a complete rewrite of the entire gem to accomodate index values.
120
+
121
+ # 0.0.2.4
122
+ * Initialize dataframe from an array which looks like [{a: 10, b: 20}, {a: 11, b: 12}]. Works for parsed JSON.
123
+ * Over-riding vectors in DataFrame will still preserve order.
124
+ * Any re-assignment of rows in #each_row and #each_row_with_index will reflect in the DataFrame.
125
+ * Added #to_a and #to_json to DataFrame.
126
+
127
+ # 0.0.2.3
128
+ * Added #filter\_rows and #delete_row to DataFrame and changed #row to return a row containing a Hash of column name and value.
129
+ * Vector objects passed into a DataFrame are now duplicated so that any changes dont affect the original vector.
130
+ * Added an optional opts argument to DataFrame.
131
+ * Sending more fields than vectors in DataFrame will cause addition of nil vectors.
132
+ * Init a DataFrame without having to convert explicitly to vectors.
133
+
134
+ # 0.0.2.2
135
+ * Added test cases and multiple column access through the [] operator on DataFrames
136
+
137
+ # 0.0.2.1
138
+ * Fixed bugs with previous code and more iterators
139
+
140
+ # 0.0.2
141
+ * Added iterators for dataframe and vector alongwith printing functions (to_html) to interface properly with iRuby notebook.
142
+
143
+ # 0.0.1
144
+ * Added classes for DataFrame and Vector alongwith some super-basic functions to get off the ground
data/README.md CHANGED
@@ -4,33 +4,45 @@ daru
4
4
  Data Analysis in RUby
5
5
 
6
6
  [![Gem Version](https://badge.fury.io/rb/daru.svg)](http://badge.fury.io/rb/daru)
7
+ [![Build Status](https://travis-ci.org/v0dro/daru.svg)](https://travis-ci.org/v0dro/daru)
7
8
 
8
9
  ## Introduction
9
10
 
10
11
  daru (Data Analysis in RUby) is a library for storage, analysis, manipulation and visualization of data.
11
12
 
12
- daru is inspired by `Statsample::Dataset` and pandas, a very mature solution in Python.
13
+ daru is inspired by pandas, a very mature solution in Python.
13
14
 
14
- Written in pure Ruby so should work with all ruby implementations.
15
+ Written in pure Ruby so should work with all ruby implementations. Tested with MRI 2.0, 2.1, 2.2.
15
16
 
16
17
  ## Features
17
18
 
18
19
  * Data structures:
19
20
  - Vector - A basic 1-D vector.
20
- - DataFrame - A 2-D table-like structure which is internally composed of named `Vectors`.
21
- * Compatible with [IRuby notebook](https://github.com/minad/iruby) and [statsample](https://github.com/clbustos/statsample).
21
+ - DataFrame - A 2-D spreadsheet-like structure for manipulating and storing data sets. This is daru's primary data structure.
22
+ * Compatible with [IRuby notebook](https://github.com/SciRuby/iruby) and [statsample](https://github.com/SciRuby/statsample).
22
23
  * Singly and hierarchially indexed data structures.
23
24
  * Flexible and intuitive API for manipulation and analysis of data.
24
25
  * Easy plotting, statistics and arithmetic.
25
26
  * Plentiful iterators.
26
- * Optional speed and space optimization on MRI with [NMatrix](https://github.com/SciRuby/nmatrix).
27
+ * Optional speed and space optimization on MRI with [NMatrix](https://github.com/SciRuby/nmatrix) and GSL.
27
28
  * Easy splitting, aggregation and grouping of data.
28
29
  * Quickly reducing data with pivot tables for quick data summary.
30
+ * Import and exports dataset from and to Excel, CSV, Databases and plain text files.
29
31
 
30
32
  ## Notebooks
31
33
 
32
- * [Analysis and plotting of a data set comprising of music listening habits of a last.fm user](http://nbviewer.ipython.org/github/v0dro/daru/blob/master/notebooks/intro_with_music_data_.ipynb)
33
- * [Basic splitting, grouping and aggregating of data](http://nbviewer.ipython.org/github/v0dro/daru/blob/master/notebooks/grouping_splitting_pivots.ipynb)
34
+ ### Usage
35
+
36
+ * [Basic Creation of Vectors and DataFrame](http://nbviewer.ipython.org/github/SciRuby/sciruby-notebooks/blob/master/Data%20Analysis/Creation%20of%20Vector%20and%20DataFrame.ipynb)
37
+ * [Detailed Usage of Daru::Vector](http://nbviewer.ipython.org/github/SciRuby/sciruby-notebooks/blob/master/Data%20Analysis/Usage%20of%20Vector.ipynb)
38
+ * [Detailed Usage of Daru::DataFrame](http://nbviewer.ipython.org/github/SciRuby/sciruby-notebooks/blob/master/Data%20Analysis/Usage%20of%20DataFrame.ipynb)
39
+ * [Visualizing Data With Daru::DataFrame](http://nbviewer.ipython.org/github/SciRuby/sciruby-notebooks/blob/master/Visualization/Visualizing%20data%20with%20daru%20DataFrame.ipynb)
40
+ * [Grouping, Splitting and Pivoting Data](http://nbviewer.ipython.org/github/SciRuby/sciruby-notebooks/blob/master/Data%20Analysis/Grouping%2C%20Splitting%20and%20Pivoting.ipynb)
41
+
42
+ ### Case Studies
43
+
44
+ * [Logistic Regression Analysis with daru and statsample-glm](http://nbviewer.ipython.org/github/SciRuby/sciruby-notebooks/blob/master/Data%20Analysis/Logistic%20Regression%20with%20daru%20and%20statsample-glm.ipynb)
45
+ * [Finding and Plotting most heard artists from a Last.fm dataset](http://nbviewer.ipython.org/github/SciRuby/sciruby-notebooks/blob/master/Data%20Analysis/Finding%20and%20plotting%20the%20most%20heard%20artists%20on%20last%20fm.ipynb)
34
46
 
35
47
  ## Blog Posts
36
48
 
@@ -41,295 +53,18 @@ Written in pure Ruby so should work with all ruby implementations.
41
53
 
42
54
  Docs can be found [here](https://rubygems.org/gems/daru).
43
55
 
44
- ## Basic Usage
45
-
46
- #### Initialization of DataFrame
47
-
48
- A basic DataFrame can be initialized like this:
49
-
50
- ```ruby
51
-
52
- df = Daru::DataFrame.new({b: [11,12,13,14,15], a: [1,2,3,4,5]}, order: [:a, :b], index: [:one, :two, :three, :four, :five])
53
- df
54
-
55
- # =>
56
- # # <Daru::DataFrame:87274040 @name = 7308c587-4073-4e7d-b3ca-3679d1dcc946 # @size = 5>
57
- # a b
58
- # one 1 11
59
- # two 2 12
60
- # three 3 13
61
- # four 4 14
62
- # five 5 15
63
- ```
64
- Daru will automatically align the vectors correctly according to the specified index and then create the DataFrame. Thus, elements having the same index will show up in the same row. The indexes will be arranged alphabetically if vectors with unaligned indexes are supplied.
65
-
66
- The vectors of the DataFrame will be arranged according to the array specified in the (optional) second argument. Otherwise the vectors are ordered alphabetically.
67
-
68
- ```ruby
69
-
70
- df = Daru::DataFrame.new({
71
- b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
72
- a: [1,2,3,4,5].dv(:a, [:two,:one,:three, :four, :five])
73
- }, order: [:a, :b]
74
- )
75
- df
76
-
77
- # =>
78
- # #<Daru::DataFrame:87363700 @name = 75ba0a14-8291-48ac-ac30-35017e4d6c5f # @size = 5>
79
- # a b
80
- # five 5 14
81
- # four 4 13
82
- # one 2 12
83
- # three 3 15
84
- # two 1 11
85
- ```
86
-
87
- If an index for the DataFrame is supplied (third argument), then the indexes of the individual vectors will be matched to the DataFrame index. If any of the indexes do not match, nils will be inserted instead:
88
-
89
- ```ruby
90
-
91
- df = Daru::DataFrame.new({
92
- b: [11] .dv(nil, [:one]),
93
- a: [1,2,3] .dv(nil, [:one, :two, :three]),
94
- c: [11,22,33,44,55] .dv(nil, [:one, :two, :three, :four, :five]),
95
- d: [49,69,89,99,108,44].dv(nil, [:one, :two, :three, :four, :five, :six])
96
- }, order: [:a, :b, :c, :d], index: [:one, :two, :three, :four, :five, :six])
97
- df
98
- # =>
99
- # #<Daru::DataFrame:87523270 @name = bda4eb68-afdd-4404-9981-708edab14201 #@size = 6>
100
- # a b c d
101
- # one 1 11 11 49
102
- # two 2 nil 22 69
103
- # three 3 nil 33 89
104
- # four nil nil 44 99
105
- # five nil nil 55 108
106
- # six nil nil nil 44
107
- ```
108
-
109
- If some of the supplied vectors do not contain certain indexes that are contained in other vectors, they are added to those vectors and the correspoding elements are set to `nil`.
110
-
111
- ```ruby
112
-
113
- df = Daru::DataFrame.new({
114
- b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
115
- a: [1,2,3] .dv(:a, [:two,:one,:three])
116
- }, order: [:a, :b])
117
- df
118
-
119
- # =>
120
- # #<Daru::DataFrame:87612510 @name = 1e904c15-e095-4dce-bfdf-c07ee4d6e4a4 # @size = 5>
121
- # a b
122
- # five nil 14
123
- # four nil 13
124
- # one 2 12
125
- # three 3 15
126
- # two 1 11
127
- ```
128
-
129
- #### Initialization of Vector
130
-
131
- The `Vector` data structure is also named and indexed. It accepts arguments name, source, index (in that order).
132
-
133
- In the simplest case it can be constructed like this:
134
-
135
- ```ruby
136
-
137
- dv = Daru::Vector.new [1,2,3,4,5], name: ravan, index: [:ek, :don, :teen, :char, :pach]
138
- dv
139
- # =>
140
- # #<Daru::Vector:87630270 @name = ravan @size = 5 >
141
- # ravan
142
- # ek 1
143
- # don 2
144
- # teen 3
145
- # char 4
146
- # pach 5
147
- ```
148
-
149
- Initializing a vector with indexes will insert nils in places where elements dont exist:
150
-
151
- ```ruby
152
-
153
- dv = Daru::Vector.new [1,2,3], name: yoga, index: [0,1,2,3,4]
154
- dv
155
- # =>
156
- # #<Daru::Vector:87890840 @name = yoga @size = 5 >
157
- # y
158
- # 0 1
159
- # 1 2
160
- # 2 3
161
- # 3 nil
162
- # 4 nil
163
- ```
164
-
165
- #### Basic Selection Operations
166
-
167
- Initialize a dataframe:
168
-
169
- ```ruby
170
-
171
- df = Daru::DataFrame.new({
172
- b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
173
- a: [1,2,3,4,5].dv(:a, [:two,:one,:three, :four, :five])
174
- }, order: [:a, :b])
175
-
176
- # =>
177
- # #<Daru::DataFrame:87455010 @name = b3d14e23-98c2-4741-a563-92e8f1fd0f13 # @size = 5>
178
- # a b
179
- # five 5 14
180
- # four 4 13
181
- # one 2 12
182
- # three 3 15
183
- # two 1 11
184
-
185
- ```
186
- Select a row from a DataFrame:
187
-
188
- ```ruby
189
-
190
- df.row[:one]
191
-
192
- # =>
193
- # #<Daru::Vector:87432070 @name = one @size = 2 >
194
- # one
195
- # a 2
196
- # b 12
197
- ```
198
- A row or a vector is returned as a `Daru::Vector` object, so any manipulations supported by `Daru::Vector` can be performed on the chosen row as well.
199
-
200
- Select multiple rows with a Range and get a DataFrame in return:
201
-
202
- ``` ruby
203
-
204
- df.row[1..3] # OR df.row[:four..:three]
205
- # =>
206
- #<Daru::DataFrame:85361520 @name = d6582f66-5a55-473e-ba57-cb2ba974da6a @size #= 3>
207
- # a b
208
- # four 4 13
209
- # one 2 12
210
- # three 3 15
211
-
212
- ```
213
-
214
- Select a single vector:
215
-
216
- ```ruby
217
-
218
- df.vector[:a] # or simply df.a
219
-
220
- # =>
221
- # #<Daru::Vector:87454270 @name = a @size = 5 >
222
- # a
223
- # five 5
224
- # four 4
225
- # one 2
226
- # three 3
227
- # two 1
228
- ```
229
-
230
- Select multiple vectors and return a DataFrame in the specified order:
231
-
232
- ```ruby
233
-
234
- df.vector[:b, :a]
235
- # =>
236
- # #<Daru::DataFrame:87835960 @name = e80902cc-cff9-4b23-9eca-5da36ebc88a8 # @size = 5>
237
- # b a
238
- # five 14 5
239
- # four 13 4
240
- # one 12 2
241
- # three 15 3
242
- # two 11 1
243
- ```
244
-
245
- Keep/remove row according to a specified condition:
246
-
247
- ```ruby
248
-
249
- df = df.filter_rows do |row|
250
- row[:a] == 5
251
- end
252
- df
253
- # =>
254
- # #<Daru::DataFrame:87455010 @name = b3d14e23-98c2-4741-a563-92e8f1fd0f13 # @size = 1>
255
- # a b
256
- # five 5 14
257
- ```
258
- The same can be applied to vectors using `filter_vectors`.
259
-
260
- To change the values of a row/vector while iterating through the DataFrame, use `map_rows` or `map_vectors`:
261
-
262
- ```ruby
263
-
264
- df.map_rows do |row|
265
- row = row * row
266
- end
267
-
268
- df
269
- # =>
270
- # #<Daru::DataFrame:86826830 @name = b092ca5b-7b83-4dbe-a469-124f7f25a568 # @size = 5>
271
- # a b
272
- # five 25 196
273
- # four 16 169
274
- # one 4 144
275
- # three 9 225
276
- # two 1 121
277
- ```
278
-
279
- #### Basic Maths Operations
280
-
281
- Performing a binary arithmetic operation on two `Daru::Vector` objects will return a `Vector` object in which the operation will be performed on elements of the same index.
282
-
283
- ```ruby
284
-
285
- dv1 = Daru::Vector.new [1,2,3,4], name: :boozy, index: [:a, :b, :c, :d]
286
- dv2 = Daru::Vector.new [1,2,3,4], name: :mayer, index: [:e, :f, :b, :d]
287
- dv1 * dv2
288
-
289
- # #<Daru::Vector:80924700 @name = boozy @size = 2 >
290
- # boozy
291
- # b 6
292
- # d 16
293
- ```
294
-
295
- Arithmetic operators applied on a single Numeric will perform the operation with that number against the entire vector.
296
-
297
- Same applies to DataFrame as well.
298
-
299
- #### Splitting and aggregation of data
300
-
301
- `Daru::DataFrame` provides the `#group_by` method to split or aggregate data. Its very similar to SQL GROUP BY. Check the [blog post]() for details.
302
-
303
- You can also generate Excel-style pivot tables with `#pivot_table`.
304
-
305
- #### Plotting
306
-
307
- daru uses [Nyaplot](https://github.com/domitry/nyaplot) for plotting and an example of this can be found in the [notebook](http://nbviewer.ipython.org/github/v0dro/daru/blob/master/notebooks/intro_with_music_data_.ipynb) or [blog post](http://v0dro.github.io/blog/2014/11/25/data-analysis-in-ruby-basic-data-manipulation-and-plotting/).
308
-
309
- Head over to the tutorials and notebooks listed above for more examples.
310
-
311
- #### Working with missing data
312
-
313
- Missing data is an integral part of any data analysis operation and [this blog post](http://v0dro.github.io/blog/2015/02/24/data-analysis-in-ruby-part-2/) provides details on dealing with missing data.
314
-
315
56
  ## Roadmap
316
57
 
317
58
  * Automate testing for both MRI and JRuby.
318
59
  * Enable creation of DataFrame by only specifying an NMatrix/MDArray in initialize. Vector naming happens automatically (alphabetic) or is specified in an Array.
319
- * Destructive map iterators for DataFrame.
320
60
  * Completely test all functionality for MDArray.
321
61
  * Basic Data manipulation and analysis operations:
322
- - Different kinds of join operations
323
- - Dataframe/vector merge (left, right, inner, outer)
324
- - Verification of data in a vector
325
62
  - DF concat
326
63
  * Option to express a DataFrame as an NMatrix or MDArray so as to use more efficient storage techniques.
327
64
  * Assignment of a column to a single number should set the entire column to that number.
328
65
  * == between daru_vector and string/number.
329
66
  * Multiple column assignment with []=
330
67
  * Multiple value assignment for vectors with []=.
331
- * Load DataFrame from multiple sources (excel, SQL, etc.).
332
- * Deletion of elements from Vector should only modify the index and leave the vector as it is so that compacting is not needed and things are faster.
333
68
  * #find\_max function which will evaluate a block and return the row for the value of the block is max.
334
69
  * Function to check if a value of a row/vector is within a specified range.
335
70
  * Create a new vector in map_rows if any of the already present rows dont match the one assigned in the block.
@@ -338,19 +73,17 @@ Missing data is an integral part of any data analysis operation and [this blog p
338
73
  * Cumulative sum.
339
74
  * Time series support.
340
75
  * Calculate percentage change.
341
- * Working with missing data - drop\_missing\_data, dropping rows with missing data.
342
76
  * Have some sample data sets for users to play around with. Should be able to load these from the code itself.
343
77
  * Sorting with missing data present.
344
- * Make vectors aware of the data frame that they are a part of.
345
78
  * re_index should re establish previous index values in the newly supplied index.
346
- * Reset index.
347
79
 
348
80
  ## Contributing
349
81
 
350
- Pick a feature from the Roadmap above or think of your own and send me a Pull Request!
82
+ Pick a feature from the Roadmap or the issue tracker or think of your own and send me a Pull Request!
351
83
 
352
84
  ## Acknowledgements
353
85
 
86
+ * Google and the Ruby Science Foundation for the Google Summer of Code 2015 grant for further developing daru and integrating it with other ruby gems.
354
87
  * Thank you [last.fm](http://www.last.fm/) for making user data accessible to the public.
355
88
 
356
89
  Copyright (c) 2015, Sameer Deshmukh