daru 0.0.5 → 0.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.build.sh +14 -0
- data/.travis.yml +26 -4
- data/CONTRIBUTING.md +31 -0
- data/Gemfile +1 -2
- data/{History.txt → History.md} +110 -44
- data/README.md +21 -288
- data/Rakefile +1 -0
- data/daru.gemspec +12 -8
- data/lib/daru.rb +36 -1
- data/lib/daru/accessors/array_wrapper.rb +8 -3
- data/lib/daru/accessors/gsl_wrapper.rb +113 -0
- data/lib/daru/accessors/nmatrix_wrapper.rb +6 -17
- data/lib/daru/core/group_by.rb +0 -1
- data/lib/daru/dataframe.rb +1192 -83
- data/lib/daru/extensions/rserve.rb +21 -0
- data/lib/daru/index.rb +14 -0
- data/lib/daru/io/io.rb +170 -8
- data/lib/daru/maths/arithmetic/dataframe.rb +4 -3
- data/lib/daru/maths/arithmetic/vector.rb +4 -4
- data/lib/daru/maths/statistics/dataframe.rb +48 -27
- data/lib/daru/maths/statistics/vector.rb +215 -33
- data/lib/daru/monkeys.rb +53 -7
- data/lib/daru/multi_index.rb +21 -4
- data/lib/daru/plotting/dataframe.rb +83 -25
- data/lib/daru/plotting/vector.rb +9 -10
- data/lib/daru/vector.rb +596 -61
- data/lib/daru/version.rb +3 -0
- data/spec/accessors/wrappers_spec.rb +51 -0
- data/spec/core/group_by_spec.rb +0 -2
- data/spec/daru_spec.rb +58 -0
- data/spec/dataframe_spec.rb +768 -73
- data/spec/extensions/rserve_spec.rb +52 -0
- data/spec/fixtures/bank2.dat +200 -0
- data/spec/fixtures/repeated_fields.csv +7 -0
- data/spec/fixtures/scientific_notation.csv +4 -0
- data/spec/fixtures/test_xls.xls +0 -0
- data/spec/io/io_spec.rb +161 -24
- data/spec/math/arithmetic/dataframe_spec.rb +26 -7
- data/spec/math/arithmetic/vector_spec.rb +8 -0
- data/spec/math/statistics/dataframe_spec.rb +16 -1
- data/spec/math/statistics/vector_spec.rb +215 -47
- data/spec/spec_helper.rb +21 -2
- data/spec/vector_spec.rb +368 -12
- metadata +99 -16
- data/lib/version.rb +0 -3
- data/notebooks/grouping_splitting_pivots.ipynb +0 -529
- data/notebooks/intro_with_music_data_.ipynb +0 -303
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 6e48778067b94afc9f1060d7d6d4212029b421f2
|
4
|
+
data.tar.gz: 5d0ed9cc2fcf70562e0fcf2767c593e1f8fbfa54
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 778ad55b592865e08388eac0001cdbce6bc01f58fa77ed8e2a2b72e44d8a54fc2d289f241352affd98b6424326416634ef729889163175f9eb64c83e471fb7e2
|
7
|
+
data.tar.gz: c498252daf63597adc0255810d3eb7b60c102ef086117bf451de821faaa8e196933570b06f88e6415b51c1ef6ea2ee6f60afce18b65691483741158954c73d0b
|
data/.build.sh
ADDED
@@ -0,0 +1,14 @@
|
|
1
|
+
#!/bin/bash
|
2
|
+
|
3
|
+
git clone https://github.com/SciRuby/nmatrix.git
|
4
|
+
cd nmatrix
|
5
|
+
gem build nmatrix.gemspec
|
6
|
+
gem install nmatrix-0.1.0.gem
|
7
|
+
cd ..
|
8
|
+
rm -rf nmatrix
|
9
|
+
git clone https://github.com/v0dro/gsl-nmatrix
|
10
|
+
cd gsl-nmatrix
|
11
|
+
gem build gsl-nmatrix.gemspec
|
12
|
+
gem install gsl-nmatrix-1.17.gem
|
13
|
+
cd ..
|
14
|
+
rm -rf gsl-nmatrix
|
data/.travis.yml
CHANGED
@@ -1,5 +1,27 @@
|
|
1
|
-
language:
|
1
|
+
language:
|
2
|
+
ruby
|
3
|
+
|
4
|
+
env:
|
5
|
+
- CPLUS_INCLUDE_PATH=/usr/include/atlas C_INCLUDE_PATH=/usr/include/atlas
|
6
|
+
|
2
7
|
rvm:
|
3
|
-
-
|
4
|
-
- 2.
|
5
|
-
- 2.
|
8
|
+
- '2.0'
|
9
|
+
- '2.1'
|
10
|
+
- '2.2'
|
11
|
+
|
12
|
+
matrix:
|
13
|
+
fast_finish:
|
14
|
+
true
|
15
|
+
|
16
|
+
script: "bundle exec rspec"
|
17
|
+
|
18
|
+
install:
|
19
|
+
- gem install bundler
|
20
|
+
- ./.build.sh
|
21
|
+
- bundle install
|
22
|
+
|
23
|
+
before_install:
|
24
|
+
- sudo apt-get update -qq
|
25
|
+
- sudo apt-get install -qq libatlas-base-dev
|
26
|
+
- sudo apt-get install -y libgsl0-dev r-base r-base-dev
|
27
|
+
- sudo Rscript -e "install.packages(c('Rserve','irr'),,'http://cran.us.r-project.org')"
|
data/CONTRIBUTING.md
CHANGED
@@ -0,0 +1,31 @@
|
|
1
|
+
# Contributing guide
|
2
|
+
|
3
|
+
## Installing daru development dependencies
|
4
|
+
|
5
|
+
If you want to run the full rspec suite, you will need the latest unreleased nmatrix and gsl-nmatrix ruby gems. They will released upstream soon but please follow this procedure for now.
|
6
|
+
|
7
|
+
Keep in mind that either nmatrix OR gsl-nmatrix are NOT NECESSARY for using daru. They are just required for an optional speed up.
|
8
|
+
|
9
|
+
To install dependencies, execute the following commands:
|
10
|
+
|
11
|
+
`export CPLUS_INCLUDE_PATH=/usr/include/atlas`
|
12
|
+
`export C_INCLUDE_PATH=/usr/include/atlas`
|
13
|
+
`sudo apt-get update -qq`
|
14
|
+
`sudo apt-get install -qq libatlas-base-dev`
|
15
|
+
`sudo apt-get --purge remove liblapack-dev liblapack3 liblapack3gf`
|
16
|
+
`sudo apt-get install -y libgsl0-dev r-base r-base-dev`
|
17
|
+
`sudo Rscript -e "install.packages(c('Rserve','irr'),,'http://cran.us.r-project.org')"`
|
18
|
+
|
19
|
+
Then execute the .build.sh script to clone and install the latest nmatrix and gsl-nmatrix on your system:
|
20
|
+
|
21
|
+
`./.build.sh`
|
22
|
+
|
23
|
+
Then finally install remaining dependencies:
|
24
|
+
|
25
|
+
`bundle install`
|
26
|
+
|
27
|
+
And run the test suite (should be all green with pending tests):
|
28
|
+
|
29
|
+
`bundle exec rspec`
|
30
|
+
|
31
|
+
If you have problems installing nmatrix, please consult the [nmatrix installation wiki](https://github.com/SciRuby/nmatrix/wiki/Installation) or the [mailing list](https://groups.google.com/forum/#!forum/sciruby-dev).
|
data/Gemfile
CHANGED
data/{History.txt → History.md}
RENAMED
@@ -1,52 +1,74 @@
|
|
1
|
-
|
2
|
-
* Added classes for DataFrame and Vector alongwith some super-basic functions to get off the ground
|
3
|
-
|
4
|
-
== 0.0.2
|
5
|
-
* Added iterators for dataframe and vector alongwith printing functions (to_html) to interface properly with iRuby notebook.
|
6
|
-
|
7
|
-
== 0.0.2.1
|
8
|
-
* Fixed bugs with previous code and more iterators
|
9
|
-
|
10
|
-
== 0.0.2.2
|
11
|
-
* Added test cases and multiple column access through the [] operator on DataFrames
|
12
|
-
|
13
|
-
== 0.0.2.3
|
14
|
-
* Added #filter\_rows and #delete_row to DataFrame and changed #row to return a row containing a Hash of column name and value.
|
15
|
-
* Vector objects passed into a DataFrame are now duplicated so that any changes dont affect the original vector.
|
16
|
-
* Added an optional opts argument to DataFrame.
|
17
|
-
* Sending more fields than vectors in DataFrame will cause addition of nil vectors.
|
18
|
-
* Init a DataFrame without having to convert explicitly to vectors.
|
19
|
-
|
20
|
-
== 0.0.2.4
|
21
|
-
* Initialize dataframe from an array which looks like [{a: 10, b: 20}, {a: 11, b: 12}]. Works for parsed JSON.
|
22
|
-
* Over-riding vectors in DataFrame will still preserve order.
|
23
|
-
* Any re-assignment of rows in #each_row and #each_row_with_index will reflect in the DataFrame.
|
24
|
-
* Added #to_a and #to_json to DataFrame.
|
1
|
+
# 0.1.0
|
25
2
|
|
26
|
-
|
27
|
-
|
3
|
+
* Fixes
|
4
|
+
- Update documentation and fix it in other places.
|
5
|
+
- Fix Vector#sum_of_squares and #ranked.
|
6
|
+
- Fixed some tests that were giving RSpec warnings
|
7
|
+
- Fixed a bug where nyaplot not being present would raise a warning.
|
8
|
+
- Fixed a bug in DataFrame row assignment.
|
9
|
+
* Enhancements
|
10
|
+
- Wrote a proper .travis.yml
|
11
|
+
- Added optional GSL dependency gsl-nmatrix
|
12
|
+
- Added Marshalling and unMarshalling capabilities to Vector, Index and DataFrame.
|
13
|
+
- Added new method Daru::IO.load for loading data from files by marshalling.
|
14
|
+
- Lots of documentation and new notebooks.
|
15
|
+
- Added data loading and writing from and to CSV, Excel, plain text and SQL databases.
|
16
|
+
- Daru::DataFrame and Vector have now completely replaced Statsample::Dataset and Vector.
|
17
|
+
- Vector
|
18
|
+
- #center
|
19
|
+
- #standardize
|
20
|
+
- #vector_percentile
|
21
|
+
- Added a new wrapper class Daru::Accessors::GSLWrapper for wrapping around GSL::Vector, which works similarly to NMatrixWrapper or ArrayWrapper.
|
22
|
+
- Added a host of statistical methods to GSLWrapper in Daru::Accessors::GSLStatistics that call the relevant GSL::Vector functions for super-fast C level computations.
|
23
|
+
- More stats functions - #vector_standardized_compute, #vector_centered_compute, #sample_with_replacement, #sample_without_replacement
|
24
|
+
- #only_valid for creating a Vector with only non-nil data.
|
25
|
+
- #only_missing for creating a Vector of only missing data.
|
26
|
+
- #only_numeric to create Vector of only numerical data.
|
27
|
+
- Ported many Statsample::Vector stat methods to Daru::Vector. These are: #percentile, #factors, etc.
|
28
|
+
- Added .new_with_size for creating vectors by specifying a size for the
|
29
|
+
vector and a block for generating values.
|
30
|
+
- Added Vector#verify, #recode! and #recode.
|
31
|
+
- Added #save, #jackknife and #bootstrap.
|
32
|
+
- Added #missing_values= that will allow setting values for treating data as 'missing'.
|
33
|
+
- Added #split_by_separator, #split_by_separator_freq and #splitted.
|
34
|
+
- Added #reset_index!
|
35
|
+
- Added #any? and #all?
|
36
|
+
- Added #db_type for guessing the type of SQL type contained in the vector.
|
37
|
+
- Added and tested plotting support for histogram and box plot.
|
38
|
+
- DataFrame
|
39
|
+
- #dup_only_valid
|
40
|
+
- #clone, #clone_only_valid, #clone_structure
|
41
|
+
- #[]= does not clone the vector if it has the same index as the DataFrame.
|
42
|
+
- Added a :clone option to initialize that will not clone Daru::Vectors passed into the constructor.
|
43
|
+
- Added #save.
|
44
|
+
- Added #only_numerics.
|
45
|
+
- Added better iterators and changed some behaviour of previous ones to make them more ruby-like. New iterators are #map, #map!, #each, #recode and #collect.
|
46
|
+
- Added #vector_sum and #vector_mean.
|
47
|
+
- Added #to_gsl to convert to GSL::Matrix.
|
48
|
+
- Added #has_missing_data? and #missing_values_rows.
|
49
|
+
- Added #compute and #verify.
|
50
|
+
- Added .crosstab_by_assignation to generate data frame from row, column and value vectors.
|
51
|
+
- Added #filter_vector.
|
52
|
+
- Added #standardize and added argument option to #dup.
|
53
|
+
- Added #any? and #all? for vector and row axis.
|
54
|
+
- Better creation of empty data frames.
|
55
|
+
- Added #merge, #one_to_many, #add_vectors_by_split_recode
|
56
|
+
- Added constant SPLIT_TOKEN and methods #add_vectors_by_split, .[], #summary.
|
57
|
+
- Added #bootstrap.
|
58
|
+
- Added a #filter method to wrap around #filter_vectors and #filter_rows.
|
59
|
+
- Greatly improved plotting function.
|
60
|
+
- Added a lazy update feature that will allow users to delay updating the missing positions index until the last possible moment.
|
61
|
+
- Added interoperaility with rserve client which makes it possible to change daru data to R data and perform computation there.
|
62
|
+
* Changes
|
63
|
+
- Changes Vector#nil_positions to Vector#missing_positions so that future changes for accomodating different values for missing data can be made easily.
|
64
|
+
- Changed History.txt to History.md
|
28
65
|
|
29
|
-
== 0.0.3.1
|
30
|
-
* Added aritmetic methods for vector aritmetic by taking the index of values into account.
|
31
66
|
|
32
|
-
|
33
|
-
* Added wrappers for Array, NMatrix and MDArray such that the external implementation is completely transparent of the data type being used internally.
|
34
|
-
* Added statistics methods for vectors for ArrayWrapper. These are compatible with statsample methods.
|
35
|
-
* Added plotting functions for DataFrame and Vector using Nyaplot.
|
36
|
-
* Create a DataFrame by specifying the rows with the ".rows" class method.
|
37
|
-
* Create a Vector from a Hash.
|
38
|
-
* Call a Vector element by specfying the index name as a method call (method_missing logic).
|
39
|
-
* Retrive multiple rows of a DataFrame by specfying a Range or an Array with multiple index names.
|
40
|
-
* #head and #tail for DataFrame.
|
41
|
-
* #uniq for Vector.
|
42
|
-
* #max for Vector can return a Vector object with the index set to the index of the max value.
|
43
|
-
* Tonnes of documentation for most methods.
|
44
|
-
|
45
|
-
== 0.0.5
|
67
|
+
# 0.0.5
|
46
68
|
|
47
69
|
* Easy accessors for some methods
|
48
70
|
* Faster CSV loading.
|
49
|
-
* Changed vector #
|
71
|
+
* Changed vector #is_valid? to #exists?
|
50
72
|
* Revamped dtype specifiers for Vector. Now specify :array/:nmatrix for changing underlying data implementation. Specigfy nm\_dtype for specifying the data type of the NMatrix object.
|
51
73
|
* #sort for Vector. Quick sort algorithm with preservation of original indexes.
|
52
74
|
* Removed #re\_index and #to\_index from Daru::Index.
|
@@ -75,4 +97,48 @@
|
|
75
97
|
* Added #describe to DataFrame for producing multiple statistics data of numerical vectors in one shot.
|
76
98
|
* Monkey patched Ruby Matrix to include #elementwise_division.
|
77
99
|
* Added #covariance to calculate the covariance between numbers of a DataFrame and #correlation to calculate correlation.
|
78
|
-
* Enumerators return Enumerator objects if there is no block.
|
100
|
+
* Enumerators return Enumerator objects if there is no block.
|
101
|
+
|
102
|
+
# 0.0.4
|
103
|
+
* Added wrappers for Array, NMatrix and MDArray such that the external implementation is completely transparent of the data type being used internally.
|
104
|
+
* Added statistics methods for vectors for ArrayWrapper. These are compatible with statsample methods.
|
105
|
+
* Added plotting functions for DataFrame and Vector using Nyaplot.
|
106
|
+
* Create a DataFrame by specifying the rows with the ".rows" class method.
|
107
|
+
* Create a Vector from a Hash.
|
108
|
+
* Call a Vector element by specfying the index name as a method call (method_missing logic).
|
109
|
+
* Retrive multiple rows of a DataFrame by specfying a Range or an Array with multiple index names.
|
110
|
+
* #head and #tail for DataFrame.
|
111
|
+
* #uniq for Vector.
|
112
|
+
* #max for Vector can return a Vector object with the index set to the index of the max value.
|
113
|
+
* Tonnes of documentation for most methods.
|
114
|
+
|
115
|
+
# 0.0.3.1
|
116
|
+
* Added aritmetic methods for vector aritmetic by taking the index of values into account.
|
117
|
+
|
118
|
+
# 0.0.3
|
119
|
+
* This release is a complete rewrite of the entire gem to accomodate index values.
|
120
|
+
|
121
|
+
# 0.0.2.4
|
122
|
+
* Initialize dataframe from an array which looks like [{a: 10, b: 20}, {a: 11, b: 12}]. Works for parsed JSON.
|
123
|
+
* Over-riding vectors in DataFrame will still preserve order.
|
124
|
+
* Any re-assignment of rows in #each_row and #each_row_with_index will reflect in the DataFrame.
|
125
|
+
* Added #to_a and #to_json to DataFrame.
|
126
|
+
|
127
|
+
# 0.0.2.3
|
128
|
+
* Added #filter\_rows and #delete_row to DataFrame and changed #row to return a row containing a Hash of column name and value.
|
129
|
+
* Vector objects passed into a DataFrame are now duplicated so that any changes dont affect the original vector.
|
130
|
+
* Added an optional opts argument to DataFrame.
|
131
|
+
* Sending more fields than vectors in DataFrame will cause addition of nil vectors.
|
132
|
+
* Init a DataFrame without having to convert explicitly to vectors.
|
133
|
+
|
134
|
+
# 0.0.2.2
|
135
|
+
* Added test cases and multiple column access through the [] operator on DataFrames
|
136
|
+
|
137
|
+
# 0.0.2.1
|
138
|
+
* Fixed bugs with previous code and more iterators
|
139
|
+
|
140
|
+
# 0.0.2
|
141
|
+
* Added iterators for dataframe and vector alongwith printing functions (to_html) to interface properly with iRuby notebook.
|
142
|
+
|
143
|
+
# 0.0.1
|
144
|
+
* Added classes for DataFrame and Vector alongwith some super-basic functions to get off the ground
|
data/README.md
CHANGED
@@ -4,33 +4,45 @@ daru
|
|
4
4
|
Data Analysis in RUby
|
5
5
|
|
6
6
|
[![Gem Version](https://badge.fury.io/rb/daru.svg)](http://badge.fury.io/rb/daru)
|
7
|
+
[![Build Status](https://travis-ci.org/v0dro/daru.svg)](https://travis-ci.org/v0dro/daru)
|
7
8
|
|
8
9
|
## Introduction
|
9
10
|
|
10
11
|
daru (Data Analysis in RUby) is a library for storage, analysis, manipulation and visualization of data.
|
11
12
|
|
12
|
-
daru is inspired by
|
13
|
+
daru is inspired by pandas, a very mature solution in Python.
|
13
14
|
|
14
|
-
Written in pure Ruby so should work with all ruby implementations.
|
15
|
+
Written in pure Ruby so should work with all ruby implementations. Tested with MRI 2.0, 2.1, 2.2.
|
15
16
|
|
16
17
|
## Features
|
17
18
|
|
18
19
|
* Data structures:
|
19
20
|
- Vector - A basic 1-D vector.
|
20
|
-
- DataFrame - A 2-D
|
21
|
-
* Compatible with [IRuby notebook](https://github.com/
|
21
|
+
- DataFrame - A 2-D spreadsheet-like structure for manipulating and storing data sets. This is daru's primary data structure.
|
22
|
+
* Compatible with [IRuby notebook](https://github.com/SciRuby/iruby) and [statsample](https://github.com/SciRuby/statsample).
|
22
23
|
* Singly and hierarchially indexed data structures.
|
23
24
|
* Flexible and intuitive API for manipulation and analysis of data.
|
24
25
|
* Easy plotting, statistics and arithmetic.
|
25
26
|
* Plentiful iterators.
|
26
|
-
* Optional speed and space optimization on MRI with [NMatrix](https://github.com/SciRuby/nmatrix).
|
27
|
+
* Optional speed and space optimization on MRI with [NMatrix](https://github.com/SciRuby/nmatrix) and GSL.
|
27
28
|
* Easy splitting, aggregation and grouping of data.
|
28
29
|
* Quickly reducing data with pivot tables for quick data summary.
|
30
|
+
* Import and exports dataset from and to Excel, CSV, Databases and plain text files.
|
29
31
|
|
30
32
|
## Notebooks
|
31
33
|
|
32
|
-
|
33
|
-
|
34
|
+
### Usage
|
35
|
+
|
36
|
+
* [Basic Creation of Vectors and DataFrame](http://nbviewer.ipython.org/github/SciRuby/sciruby-notebooks/blob/master/Data%20Analysis/Creation%20of%20Vector%20and%20DataFrame.ipynb)
|
37
|
+
* [Detailed Usage of Daru::Vector](http://nbviewer.ipython.org/github/SciRuby/sciruby-notebooks/blob/master/Data%20Analysis/Usage%20of%20Vector.ipynb)
|
38
|
+
* [Detailed Usage of Daru::DataFrame](http://nbviewer.ipython.org/github/SciRuby/sciruby-notebooks/blob/master/Data%20Analysis/Usage%20of%20DataFrame.ipynb)
|
39
|
+
* [Visualizing Data With Daru::DataFrame](http://nbviewer.ipython.org/github/SciRuby/sciruby-notebooks/blob/master/Visualization/Visualizing%20data%20with%20daru%20DataFrame.ipynb)
|
40
|
+
* [Grouping, Splitting and Pivoting Data](http://nbviewer.ipython.org/github/SciRuby/sciruby-notebooks/blob/master/Data%20Analysis/Grouping%2C%20Splitting%20and%20Pivoting.ipynb)
|
41
|
+
|
42
|
+
### Case Studies
|
43
|
+
|
44
|
+
* [Logistic Regression Analysis with daru and statsample-glm](http://nbviewer.ipython.org/github/SciRuby/sciruby-notebooks/blob/master/Data%20Analysis/Logistic%20Regression%20with%20daru%20and%20statsample-glm.ipynb)
|
45
|
+
* [Finding and Plotting most heard artists from a Last.fm dataset](http://nbviewer.ipython.org/github/SciRuby/sciruby-notebooks/blob/master/Data%20Analysis/Finding%20and%20plotting%20the%20most%20heard%20artists%20on%20last%20fm.ipynb)
|
34
46
|
|
35
47
|
## Blog Posts
|
36
48
|
|
@@ -41,295 +53,18 @@ Written in pure Ruby so should work with all ruby implementations.
|
|
41
53
|
|
42
54
|
Docs can be found [here](https://rubygems.org/gems/daru).
|
43
55
|
|
44
|
-
## Basic Usage
|
45
|
-
|
46
|
-
#### Initialization of DataFrame
|
47
|
-
|
48
|
-
A basic DataFrame can be initialized like this:
|
49
|
-
|
50
|
-
```ruby
|
51
|
-
|
52
|
-
df = Daru::DataFrame.new({b: [11,12,13,14,15], a: [1,2,3,4,5]}, order: [:a, :b], index: [:one, :two, :three, :four, :five])
|
53
|
-
df
|
54
|
-
|
55
|
-
# =>
|
56
|
-
# # <Daru::DataFrame:87274040 @name = 7308c587-4073-4e7d-b3ca-3679d1dcc946 # @size = 5>
|
57
|
-
# a b
|
58
|
-
# one 1 11
|
59
|
-
# two 2 12
|
60
|
-
# three 3 13
|
61
|
-
# four 4 14
|
62
|
-
# five 5 15
|
63
|
-
```
|
64
|
-
Daru will automatically align the vectors correctly according to the specified index and then create the DataFrame. Thus, elements having the same index will show up in the same row. The indexes will be arranged alphabetically if vectors with unaligned indexes are supplied.
|
65
|
-
|
66
|
-
The vectors of the DataFrame will be arranged according to the array specified in the (optional) second argument. Otherwise the vectors are ordered alphabetically.
|
67
|
-
|
68
|
-
```ruby
|
69
|
-
|
70
|
-
df = Daru::DataFrame.new({
|
71
|
-
b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
|
72
|
-
a: [1,2,3,4,5].dv(:a, [:two,:one,:three, :four, :five])
|
73
|
-
}, order: [:a, :b]
|
74
|
-
)
|
75
|
-
df
|
76
|
-
|
77
|
-
# =>
|
78
|
-
# #<Daru::DataFrame:87363700 @name = 75ba0a14-8291-48ac-ac30-35017e4d6c5f # @size = 5>
|
79
|
-
# a b
|
80
|
-
# five 5 14
|
81
|
-
# four 4 13
|
82
|
-
# one 2 12
|
83
|
-
# three 3 15
|
84
|
-
# two 1 11
|
85
|
-
```
|
86
|
-
|
87
|
-
If an index for the DataFrame is supplied (third argument), then the indexes of the individual vectors will be matched to the DataFrame index. If any of the indexes do not match, nils will be inserted instead:
|
88
|
-
|
89
|
-
```ruby
|
90
|
-
|
91
|
-
df = Daru::DataFrame.new({
|
92
|
-
b: [11] .dv(nil, [:one]),
|
93
|
-
a: [1,2,3] .dv(nil, [:one, :two, :three]),
|
94
|
-
c: [11,22,33,44,55] .dv(nil, [:one, :two, :three, :four, :five]),
|
95
|
-
d: [49,69,89,99,108,44].dv(nil, [:one, :two, :three, :four, :five, :six])
|
96
|
-
}, order: [:a, :b, :c, :d], index: [:one, :two, :three, :four, :five, :six])
|
97
|
-
df
|
98
|
-
# =>
|
99
|
-
# #<Daru::DataFrame:87523270 @name = bda4eb68-afdd-4404-9981-708edab14201 #@size = 6>
|
100
|
-
# a b c d
|
101
|
-
# one 1 11 11 49
|
102
|
-
# two 2 nil 22 69
|
103
|
-
# three 3 nil 33 89
|
104
|
-
# four nil nil 44 99
|
105
|
-
# five nil nil 55 108
|
106
|
-
# six nil nil nil 44
|
107
|
-
```
|
108
|
-
|
109
|
-
If some of the supplied vectors do not contain certain indexes that are contained in other vectors, they are added to those vectors and the correspoding elements are set to `nil`.
|
110
|
-
|
111
|
-
```ruby
|
112
|
-
|
113
|
-
df = Daru::DataFrame.new({
|
114
|
-
b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
|
115
|
-
a: [1,2,3] .dv(:a, [:two,:one,:three])
|
116
|
-
}, order: [:a, :b])
|
117
|
-
df
|
118
|
-
|
119
|
-
# =>
|
120
|
-
# #<Daru::DataFrame:87612510 @name = 1e904c15-e095-4dce-bfdf-c07ee4d6e4a4 # @size = 5>
|
121
|
-
# a b
|
122
|
-
# five nil 14
|
123
|
-
# four nil 13
|
124
|
-
# one 2 12
|
125
|
-
# three 3 15
|
126
|
-
# two 1 11
|
127
|
-
```
|
128
|
-
|
129
|
-
#### Initialization of Vector
|
130
|
-
|
131
|
-
The `Vector` data structure is also named and indexed. It accepts arguments name, source, index (in that order).
|
132
|
-
|
133
|
-
In the simplest case it can be constructed like this:
|
134
|
-
|
135
|
-
```ruby
|
136
|
-
|
137
|
-
dv = Daru::Vector.new [1,2,3,4,5], name: ravan, index: [:ek, :don, :teen, :char, :pach]
|
138
|
-
dv
|
139
|
-
# =>
|
140
|
-
# #<Daru::Vector:87630270 @name = ravan @size = 5 >
|
141
|
-
# ravan
|
142
|
-
# ek 1
|
143
|
-
# don 2
|
144
|
-
# teen 3
|
145
|
-
# char 4
|
146
|
-
# pach 5
|
147
|
-
```
|
148
|
-
|
149
|
-
Initializing a vector with indexes will insert nils in places where elements dont exist:
|
150
|
-
|
151
|
-
```ruby
|
152
|
-
|
153
|
-
dv = Daru::Vector.new [1,2,3], name: yoga, index: [0,1,2,3,4]
|
154
|
-
dv
|
155
|
-
# =>
|
156
|
-
# #<Daru::Vector:87890840 @name = yoga @size = 5 >
|
157
|
-
# y
|
158
|
-
# 0 1
|
159
|
-
# 1 2
|
160
|
-
# 2 3
|
161
|
-
# 3 nil
|
162
|
-
# 4 nil
|
163
|
-
```
|
164
|
-
|
165
|
-
#### Basic Selection Operations
|
166
|
-
|
167
|
-
Initialize a dataframe:
|
168
|
-
|
169
|
-
```ruby
|
170
|
-
|
171
|
-
df = Daru::DataFrame.new({
|
172
|
-
b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
|
173
|
-
a: [1,2,3,4,5].dv(:a, [:two,:one,:three, :four, :five])
|
174
|
-
}, order: [:a, :b])
|
175
|
-
|
176
|
-
# =>
|
177
|
-
# #<Daru::DataFrame:87455010 @name = b3d14e23-98c2-4741-a563-92e8f1fd0f13 # @size = 5>
|
178
|
-
# a b
|
179
|
-
# five 5 14
|
180
|
-
# four 4 13
|
181
|
-
# one 2 12
|
182
|
-
# three 3 15
|
183
|
-
# two 1 11
|
184
|
-
|
185
|
-
```
|
186
|
-
Select a row from a DataFrame:
|
187
|
-
|
188
|
-
```ruby
|
189
|
-
|
190
|
-
df.row[:one]
|
191
|
-
|
192
|
-
# =>
|
193
|
-
# #<Daru::Vector:87432070 @name = one @size = 2 >
|
194
|
-
# one
|
195
|
-
# a 2
|
196
|
-
# b 12
|
197
|
-
```
|
198
|
-
A row or a vector is returned as a `Daru::Vector` object, so any manipulations supported by `Daru::Vector` can be performed on the chosen row as well.
|
199
|
-
|
200
|
-
Select multiple rows with a Range and get a DataFrame in return:
|
201
|
-
|
202
|
-
``` ruby
|
203
|
-
|
204
|
-
df.row[1..3] # OR df.row[:four..:three]
|
205
|
-
# =>
|
206
|
-
#<Daru::DataFrame:85361520 @name = d6582f66-5a55-473e-ba57-cb2ba974da6a @size #= 3>
|
207
|
-
# a b
|
208
|
-
# four 4 13
|
209
|
-
# one 2 12
|
210
|
-
# three 3 15
|
211
|
-
|
212
|
-
```
|
213
|
-
|
214
|
-
Select a single vector:
|
215
|
-
|
216
|
-
```ruby
|
217
|
-
|
218
|
-
df.vector[:a] # or simply df.a
|
219
|
-
|
220
|
-
# =>
|
221
|
-
# #<Daru::Vector:87454270 @name = a @size = 5 >
|
222
|
-
# a
|
223
|
-
# five 5
|
224
|
-
# four 4
|
225
|
-
# one 2
|
226
|
-
# three 3
|
227
|
-
# two 1
|
228
|
-
```
|
229
|
-
|
230
|
-
Select multiple vectors and return a DataFrame in the specified order:
|
231
|
-
|
232
|
-
```ruby
|
233
|
-
|
234
|
-
df.vector[:b, :a]
|
235
|
-
# =>
|
236
|
-
# #<Daru::DataFrame:87835960 @name = e80902cc-cff9-4b23-9eca-5da36ebc88a8 # @size = 5>
|
237
|
-
# b a
|
238
|
-
# five 14 5
|
239
|
-
# four 13 4
|
240
|
-
# one 12 2
|
241
|
-
# three 15 3
|
242
|
-
# two 11 1
|
243
|
-
```
|
244
|
-
|
245
|
-
Keep/remove row according to a specified condition:
|
246
|
-
|
247
|
-
```ruby
|
248
|
-
|
249
|
-
df = df.filter_rows do |row|
|
250
|
-
row[:a] == 5
|
251
|
-
end
|
252
|
-
df
|
253
|
-
# =>
|
254
|
-
# #<Daru::DataFrame:87455010 @name = b3d14e23-98c2-4741-a563-92e8f1fd0f13 # @size = 1>
|
255
|
-
# a b
|
256
|
-
# five 5 14
|
257
|
-
```
|
258
|
-
The same can be applied to vectors using `filter_vectors`.
|
259
|
-
|
260
|
-
To change the values of a row/vector while iterating through the DataFrame, use `map_rows` or `map_vectors`:
|
261
|
-
|
262
|
-
```ruby
|
263
|
-
|
264
|
-
df.map_rows do |row|
|
265
|
-
row = row * row
|
266
|
-
end
|
267
|
-
|
268
|
-
df
|
269
|
-
# =>
|
270
|
-
# #<Daru::DataFrame:86826830 @name = b092ca5b-7b83-4dbe-a469-124f7f25a568 # @size = 5>
|
271
|
-
# a b
|
272
|
-
# five 25 196
|
273
|
-
# four 16 169
|
274
|
-
# one 4 144
|
275
|
-
# three 9 225
|
276
|
-
# two 1 121
|
277
|
-
```
|
278
|
-
|
279
|
-
#### Basic Maths Operations
|
280
|
-
|
281
|
-
Performing a binary arithmetic operation on two `Daru::Vector` objects will return a `Vector` object in which the operation will be performed on elements of the same index.
|
282
|
-
|
283
|
-
```ruby
|
284
|
-
|
285
|
-
dv1 = Daru::Vector.new [1,2,3,4], name: :boozy, index: [:a, :b, :c, :d]
|
286
|
-
dv2 = Daru::Vector.new [1,2,3,4], name: :mayer, index: [:e, :f, :b, :d]
|
287
|
-
dv1 * dv2
|
288
|
-
|
289
|
-
# #<Daru::Vector:80924700 @name = boozy @size = 2 >
|
290
|
-
# boozy
|
291
|
-
# b 6
|
292
|
-
# d 16
|
293
|
-
```
|
294
|
-
|
295
|
-
Arithmetic operators applied on a single Numeric will perform the operation with that number against the entire vector.
|
296
|
-
|
297
|
-
Same applies to DataFrame as well.
|
298
|
-
|
299
|
-
#### Splitting and aggregation of data
|
300
|
-
|
301
|
-
`Daru::DataFrame` provides the `#group_by` method to split or aggregate data. Its very similar to SQL GROUP BY. Check the [blog post]() for details.
|
302
|
-
|
303
|
-
You can also generate Excel-style pivot tables with `#pivot_table`.
|
304
|
-
|
305
|
-
#### Plotting
|
306
|
-
|
307
|
-
daru uses [Nyaplot](https://github.com/domitry/nyaplot) for plotting and an example of this can be found in the [notebook](http://nbviewer.ipython.org/github/v0dro/daru/blob/master/notebooks/intro_with_music_data_.ipynb) or [blog post](http://v0dro.github.io/blog/2014/11/25/data-analysis-in-ruby-basic-data-manipulation-and-plotting/).
|
308
|
-
|
309
|
-
Head over to the tutorials and notebooks listed above for more examples.
|
310
|
-
|
311
|
-
#### Working with missing data
|
312
|
-
|
313
|
-
Missing data is an integral part of any data analysis operation and [this blog post](http://v0dro.github.io/blog/2015/02/24/data-analysis-in-ruby-part-2/) provides details on dealing with missing data.
|
314
|
-
|
315
56
|
## Roadmap
|
316
57
|
|
317
58
|
* Automate testing for both MRI and JRuby.
|
318
59
|
* Enable creation of DataFrame by only specifying an NMatrix/MDArray in initialize. Vector naming happens automatically (alphabetic) or is specified in an Array.
|
319
|
-
* Destructive map iterators for DataFrame.
|
320
60
|
* Completely test all functionality for MDArray.
|
321
61
|
* Basic Data manipulation and analysis operations:
|
322
|
-
- Different kinds of join operations
|
323
|
-
- Dataframe/vector merge (left, right, inner, outer)
|
324
|
-
- Verification of data in a vector
|
325
62
|
- DF concat
|
326
63
|
* Option to express a DataFrame as an NMatrix or MDArray so as to use more efficient storage techniques.
|
327
64
|
* Assignment of a column to a single number should set the entire column to that number.
|
328
65
|
* == between daru_vector and string/number.
|
329
66
|
* Multiple column assignment with []=
|
330
67
|
* Multiple value assignment for vectors with []=.
|
331
|
-
* Load DataFrame from multiple sources (excel, SQL, etc.).
|
332
|
-
* Deletion of elements from Vector should only modify the index and leave the vector as it is so that compacting is not needed and things are faster.
|
333
68
|
* #find\_max function which will evaluate a block and return the row for the value of the block is max.
|
334
69
|
* Function to check if a value of a row/vector is within a specified range.
|
335
70
|
* Create a new vector in map_rows if any of the already present rows dont match the one assigned in the block.
|
@@ -338,19 +73,17 @@ Missing data is an integral part of any data analysis operation and [this blog p
|
|
338
73
|
* Cumulative sum.
|
339
74
|
* Time series support.
|
340
75
|
* Calculate percentage change.
|
341
|
-
* Working with missing data - drop\_missing\_data, dropping rows with missing data.
|
342
76
|
* Have some sample data sets for users to play around with. Should be able to load these from the code itself.
|
343
77
|
* Sorting with missing data present.
|
344
|
-
* Make vectors aware of the data frame that they are a part of.
|
345
78
|
* re_index should re establish previous index values in the newly supplied index.
|
346
|
-
* Reset index.
|
347
79
|
|
348
80
|
## Contributing
|
349
81
|
|
350
|
-
Pick a feature from the Roadmap
|
82
|
+
Pick a feature from the Roadmap or the issue tracker or think of your own and send me a Pull Request!
|
351
83
|
|
352
84
|
## Acknowledgements
|
353
85
|
|
86
|
+
* Google and the Ruby Science Foundation for the Google Summer of Code 2015 grant for further developing daru and integrating it with other ruby gems.
|
354
87
|
* Thank you [last.fm](http://www.last.fm/) for making user data accessible to the public.
|
355
88
|
|
356
89
|
Copyright (c) 2015, Sameer Deshmukh
|