daru 0.2.1 → 0.2.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -5
- data/.gitignore +1 -0
- data/.travis.yml +4 -1
- data/CONTRIBUTING.md +9 -0
- data/History.md +16 -1
- data/README.md +22 -3
- data/benchmarks/db_loading.rb +34 -0
- data/daru.gemspec +4 -2
- data/lib/daru/category.rb +13 -4
- data/lib/daru/core/group_by.rb +40 -31
- data/lib/daru/dataframe.rb +200 -54
- data/lib/daru/index/index.rb +12 -11
- data/lib/daru/index/multi_index.rb +8 -3
- data/lib/daru/io/io.rb +5 -17
- data/lib/daru/iruby/templates/dataframe.html.erb +1 -1
- data/lib/daru/vector.rb +20 -6
- data/lib/daru/version.rb +1 -1
- data/spec/core/group_by_spec.rb +6 -1
- data/spec/dataframe_spec.rb +110 -0
- data/spec/index/index_spec.rb +26 -0
- data/spec/index/multi_index_spec.rb +18 -0
- metadata +10 -10
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: 617e082fd3366f695622071cf630690d102552821e82926af81a7007bb09093d
|
4
|
+
data.tar.gz: b6b995e35e8124768a15a3e32d1fc38515aecc55f070510d7a51b45945520eb7
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 8ae029cac761e4a7164b472ad6ef5275d18aa7d6dace9f61ab4553cc288a1d95e80412804e28a4405444a97dfd1e850ce00783d9b45126c0cb8e0c4dafa09e63
|
7
|
+
data.tar.gz: e8aa0aed6c05ec54ba4f5083ed3baa47b6eb02c3304bb0152beea1e57e181aafe45ebc779bbdec76b22b6c86b29f71dc16d013250870589cb93eeae0b8ca0917
|
data/.gitignore
CHANGED
data/.travis.yml
CHANGED
@@ -22,7 +22,10 @@ script:
|
|
22
22
|
- bundle exec yard-junk
|
23
23
|
|
24
24
|
install:
|
25
|
-
-
|
25
|
+
- if [ $TRAVIS_RUBY_VERSION == '2.2' ] || [ $TRAVIS_RUBY_VERSION == '2.1' ] || [ $TRAVIS_RUBY_VERSION == '2.0' ];
|
26
|
+
then gem install bundler -v '~> 1.6';
|
27
|
+
else gem install bundler;
|
28
|
+
fi
|
26
29
|
- gem install rainbow -v '2.2.1'
|
27
30
|
- bundle install
|
28
31
|
|
data/CONTRIBUTING.md
CHANGED
@@ -22,12 +22,21 @@ And run the test suite (should be all green with pending tests):
|
|
22
22
|
|
23
23
|
If you have problems installing nmatrix, please consult the [nmatrix installation wiki](https://github.com/SciRuby/nmatrix/wiki/Installation) or the [mailing list](https://groups.google.com/forum/#!forum/sciruby-dev).
|
24
24
|
|
25
|
+
**NOTE**: `Daru` is compatible with Ruby versions < 2.5; for later Ruby versions it breaks, returning the following error in versions >= 2.5.
|
26
|
+
```
|
27
|
+
/gems/packable-1.3.10/lib/packable/extensions/io.rb:86:in `pos': Illegal seek @ rb_io_tell - <STDOUT> (Errno::ESPIPE)
|
28
|
+
```
|
29
|
+
To reproduce this issue or explore this error further, head over to
|
30
|
+
[issue #500](https://github.com/SciRuby/daru/issues/500),
|
31
|
+
[issue #503](https://github.com/SciRuby/daru/issues/503). Also, if you want to fix this issue, then please discuss it here : [#505](https://github.com/SciRuby/daru/issues/500)
|
32
|
+
|
25
33
|
While preparing your pull requests, don't forget to check your code with Rubocop:
|
26
34
|
|
27
35
|
`bundle exec rubocop`
|
28
36
|
|
29
37
|
[Optional] Install all Ruby versions which Daru currently supports with `rake spec setup`.
|
30
38
|
|
39
|
+
|
31
40
|
## Basic Development Flow
|
32
41
|
|
33
42
|
1. Create a new branch with `git checkout -b <branch_name>`.
|
data/History.md
CHANGED
@@ -1,3 +1,18 @@
|
|
1
|
+
# 0.2.2 (8 August 2019)
|
2
|
+
|
3
|
+
* Minor Enhancements
|
4
|
+
- DataFrame#set_index can take column name array, which results in multi-index https://github.com/SciRuby/daru/pull/471 (by @Yuki-Inoue)
|
5
|
+
- implements DataFrame#reset_index https://github.com/SciRuby/daru/pull/473 (by @Yuki-Inoue)
|
6
|
+
- Make DataFrame.from_activerecord faster https://github.com/SciRuby/daru/pull/464 (by @paisible-wanderer )
|
7
|
+
- Added access_row_tuples_by_indexs method https://github.com/SciRuby/daru/pull/463 (by @Prakriti-nith )
|
8
|
+
|
9
|
+
* Fixes
|
10
|
+
- Fix reindex vector on argument error https://github.com/SciRuby/daru/pull/470 (by @Yuki-Inoue)
|
11
|
+
- Optimize aggregation https://github.com/SciRuby/daru/pull/464 (by @paisible-wanderer)
|
12
|
+
- Index#dup should copy reference to name too https://github.com/SciRuby/daru/pull/477 (by @Yuki-Inoue)
|
13
|
+
- Should support bundler version 2.x.x https://github.com/SciRuby/daru/pull/483/ (by @Shekharrajak )
|
14
|
+
- fix table style https://github.com/SciRuby/daru/pull/489 (by @kojix2 )
|
15
|
+
|
1
16
|
# 0.2.1 (02 July 2018)
|
2
17
|
|
3
18
|
* Minor Enhancements
|
@@ -116,7 +131,7 @@
|
|
116
131
|
- Support formatting empty dataframes. They were returning an error before. (@gnilrets)
|
117
132
|
- method_missing in Daru::DataFrame would not detect the correct vector if it was a String. Fixed that. (@lokeshh)
|
118
133
|
- Fix docs of contrast_code to specify that the default value is false. (@v0dro)
|
119
|
-
- Fix occurence of SystemStackError due to faulty
|
134
|
+
- Fix occurence of SystemStackError due to faulty argument passing to Array#values_at. (@v0dro)
|
120
135
|
- Fix `DataFrame#pivot_table` regression that raised an ArgumentError if the `:index` option was not specified. (@zverok)
|
121
136
|
- Fix `DateFrame.rows` to accept empty argument. (@zverok)
|
122
137
|
- Fix bug with false values on dataframe create. DataFrame from an Array of hashes wasn't being created properly when some of the values were `false`. (@gnilrets)
|
data/README.md
CHANGED
@@ -11,6 +11,25 @@ daru (Data Analysis in RUby) is a library for storage, analysis, manipulation an
|
|
11
11
|
|
12
12
|
daru makes it easy and intuitive to process data predominantly through 2 data structures: `Daru::DataFrame` and `Daru::Vector`. Written in pure Ruby works with all ruby implementations. Tested with MRI 2.0, 2.1, 2.2, 2.3, and 2.4.
|
13
13
|
|
14
|
+
|
15
|
+
## daru plugin gems
|
16
|
+
|
17
|
+
- **[daru-view](https://github.com/SciRuby/daru-view)**
|
18
|
+
|
19
|
+
daru-view is for easy and interactive plotting in web application & IRuby
|
20
|
+
notebook. It can work in any Ruby web application frameworks like Rails, Sinatra, Nanoc and hopefully in others too.
|
21
|
+
|
22
|
+
Articles/Blogs, that summarize powerful features of daru-view:
|
23
|
+
|
24
|
+
* [GSoC 2017 daru-view](http://sciruby.com/blog/2017/09/01/gsoc-2017-data-visualization-using-daru-view/)
|
25
|
+
* [GSoC 2018 Progress Report](https://github.com/SciRuby/daru-view/wiki/GSoC-2018---Progress-Report)
|
26
|
+
* [HighCharts Official blog post regarding daru-view](https://www.highcharts.com/blog/post/i-am-ruby-developer-how-can-i-use-highcharts/)
|
27
|
+
|
28
|
+
- **[daru-io](https://github.com/SciRuby/daru-io)**
|
29
|
+
|
30
|
+
This gem extends support for many Import and Export methods of `Daru::DataFrame`. This gem is intended to help Rubyists who are into Data Analysis or Web Development, by serving as a general purpose conversion library that takes input in one format (say, JSON) and converts it another format (say, Avro) while also making it incredibly easy to getting started on analyzing data with daru. One can read more in [SciRuby/blog/daru-io](http://sciruby.com/blog/2017/08/29/gsoc-2017-support-to-import-export-of-more-formats/).
|
31
|
+
|
32
|
+
|
14
33
|
## Features
|
15
34
|
|
16
35
|
* Data structures:
|
@@ -83,9 +102,9 @@ $ gem install daru
|
|
83
102
|
|
84
103
|
### Categorical Data
|
85
104
|
|
86
|
-
* [Categorical Index](http://lokeshh.github.io/blog/2016/06/14/categorical-index/)
|
87
|
-
* [Categorical Data](http://lokeshh.github.io/blog/2016/06/21/categorical-data/)
|
88
|
-
* [Visualization with Categorical Data](http://lokeshh.github.io/blog/2016/07/02/visualization/)
|
105
|
+
* [Categorical Index](http://lokeshh.github.io/gsoc2016/blog/2016/06/14/categorical-index/)
|
106
|
+
* [Categorical Data](http://lokeshh.github.io/gsoc2016/blog/2016/06/21/categorical-data/)
|
107
|
+
* [Visualization with Categorical Data](http://lokeshh.github.io/gsoc2016/blog/2016/07/02/visualization/)
|
89
108
|
|
90
109
|
## Basic Usage
|
91
110
|
|
@@ -0,0 +1,34 @@
|
|
1
|
+
$:.unshift File.expand_path("../../lib", __FILE__)
|
2
|
+
|
3
|
+
require 'benchmark'
|
4
|
+
require 'daru'
|
5
|
+
require 'sqlite3'
|
6
|
+
require 'dbi'
|
7
|
+
require 'active_record'
|
8
|
+
|
9
|
+
db_name = 'daru_test.sqlite'
|
10
|
+
FileUtils.rm(db_name) if File.file?(db_name)
|
11
|
+
|
12
|
+
SQLite3::Database.new(db_name).tap do |db|
|
13
|
+
db.execute "create table accounts(id integer, name varchar, age integer, primary key(id))"
|
14
|
+
|
15
|
+
values = 1.upto(100_000).map { |i| %!(#{i},"name_#{i}",#{rand(100)})! }.join(",")
|
16
|
+
db.execute "insert into accounts values #{values}"
|
17
|
+
end
|
18
|
+
|
19
|
+
ActiveRecord::Base.establish_connection("sqlite3:#{db_name}")
|
20
|
+
ActiveRecord::Base.connection
|
21
|
+
|
22
|
+
class Account < ActiveRecord::Base; end
|
23
|
+
|
24
|
+
Benchmark.bm do |x|
|
25
|
+
x.report("DataFrame.from_sql") do
|
26
|
+
Daru::DataFrame.from_sql(ActiveRecord::Base.connection, "SELECT * FROM accounts")
|
27
|
+
end
|
28
|
+
|
29
|
+
x.report("DataFrame.from_activerecord") do
|
30
|
+
Daru::DataFrame.from_activerecord(Account.all)
|
31
|
+
end
|
32
|
+
end
|
33
|
+
|
34
|
+
FileUtils.rm(db_name)
|
data/daru.gemspec
CHANGED
@@ -33,7 +33,7 @@ Gem::Specification.new do |spec|
|
|
33
33
|
spec.add_runtime_dependency 'packable', '~> 1.3.9'
|
34
34
|
|
35
35
|
spec.add_development_dependency 'spreadsheet', '~> 1.1.1'
|
36
|
-
spec.add_development_dependency 'bundler', '
|
36
|
+
spec.add_development_dependency 'bundler', '>= 1.10'
|
37
37
|
spec.add_development_dependency 'rake', '~>10.5'
|
38
38
|
spec.add_development_dependency 'pry', '~> 0.10'
|
39
39
|
spec.add_development_dependency 'pry-byebug'
|
@@ -49,7 +49,9 @@ Gem::Specification.new do |spec|
|
|
49
49
|
spec.add_development_dependency 'dbi'
|
50
50
|
spec.add_development_dependency 'activerecord', '~> 4.0'
|
51
51
|
spec.add_development_dependency 'mechanize'
|
52
|
-
|
52
|
+
# issue : https://github.com/SciRuby/daru/issues/493 occured
|
53
|
+
# with latest version of sqlite3
|
54
|
+
spec.add_development_dependency 'sqlite3', '~> 1.3.13'
|
53
55
|
spec.add_development_dependency 'rubocop', '~> 0.49.0'
|
54
56
|
spec.add_development_dependency 'ruby-prof'
|
55
57
|
spec.add_development_dependency 'simplecov'
|
data/lib/daru/category.rb
CHANGED
@@ -74,6 +74,13 @@ module Daru
|
|
74
74
|
end
|
75
75
|
end
|
76
76
|
|
77
|
+
# this method is overwritten: see Daru::Category#plotting_library=
|
78
|
+
def plot(*args, **options, &b)
|
79
|
+
init_plotting_library
|
80
|
+
|
81
|
+
plot(*args, **options, &b)
|
82
|
+
end
|
83
|
+
|
77
84
|
alias_method :rename, :name=
|
78
85
|
|
79
86
|
# Returns an enumerator that enumerates on categorical data
|
@@ -174,7 +181,7 @@ module Daru
|
|
174
181
|
# Returns vector for indexes/positions specified
|
175
182
|
# @param [Array] indexes for which values has to be retrived
|
176
183
|
# @note Since it accepts both indexes and postions. In case of collision,
|
177
|
-
#
|
184
|
+
# argument will be treated as index
|
178
185
|
# @return vector containing values specified at specified indexes/positions
|
179
186
|
# @example
|
180
187
|
# dv = Daru::Vector.new [:a, 1, :a, 1, :c],
|
@@ -748,6 +755,11 @@ module Daru
|
|
748
755
|
|
749
756
|
private
|
750
757
|
|
758
|
+
# Will lazily load the plotting library being used
|
759
|
+
def init_plotting_library
|
760
|
+
self.plotting_library = Daru.plotting_library
|
761
|
+
end
|
762
|
+
|
751
763
|
def validate_categories input_categories
|
752
764
|
raise ArgumentError, 'Input categories and speculated categories mismatch' unless
|
753
765
|
(categories - input_categories).empty?
|
@@ -768,9 +780,6 @@ module Daru
|
|
768
780
|
# To link every instance to its category,
|
769
781
|
# it stores integer for every instance representing its category
|
770
782
|
@array = map_cat_int.values_at(*data)
|
771
|
-
|
772
|
-
# Include plotting functionality
|
773
|
-
self.plotting_library = Daru.plotting_library
|
774
783
|
end
|
775
784
|
|
776
785
|
def category_from_position position
|
data/lib/daru/core/group_by.rb
CHANGED
@@ -2,6 +2,7 @@ module Daru
|
|
2
2
|
module Core
|
3
3
|
class GroupBy
|
4
4
|
class << self
|
5
|
+
# @private
|
5
6
|
def get_positions_group_map_on(indexes_with_positions, sort: false)
|
6
7
|
group_map = {}
|
7
8
|
|
@@ -17,6 +18,7 @@ module Daru
|
|
17
18
|
group_map
|
18
19
|
end
|
19
20
|
|
21
|
+
# @private
|
20
22
|
def get_positions_group_for_aggregation(multi_index, level=-1)
|
21
23
|
raise unless multi_index.is_a?(Daru::MultiIndex)
|
22
24
|
|
@@ -26,16 +28,19 @@ module Daru
|
|
26
28
|
get_positions_group_map_on(new_index.each_with_index)
|
27
29
|
end
|
28
30
|
|
31
|
+
# @private
|
29
32
|
def get_positions_group_map_for_df(df, group_by_keys, sort: true)
|
30
33
|
indexes_with_positions = df[*group_by_keys].to_df.each_row.map(&:to_a).each_with_index
|
31
34
|
|
32
35
|
get_positions_group_map_on(indexes_with_positions, sort: sort)
|
33
36
|
end
|
34
37
|
|
38
|
+
# @private
|
35
39
|
def group_map_from_positions_to_indexes(positions_group_map, index)
|
36
40
|
positions_group_map.map { |k, positions| [k, positions.map { |pos| index.at(pos) }] }.to_h
|
37
41
|
end
|
38
42
|
|
43
|
+
# @private
|
39
44
|
def df_from_group_map(df, group_map, remaining_vectors, from_position: true)
|
40
45
|
return nil if group_map == {}
|
41
46
|
|
@@ -52,7 +57,17 @@ module Daru
|
|
52
57
|
end
|
53
58
|
end
|
54
59
|
|
55
|
-
attr_reader
|
60
|
+
# lazy accessor/attr_reader for the attribute groups
|
61
|
+
def groups
|
62
|
+
@groups ||= GroupBy.group_map_from_positions_to_indexes(@groups_by_pos, @context.index)
|
63
|
+
end
|
64
|
+
alias :groups_by_idx :groups
|
65
|
+
|
66
|
+
# lazy accessor/attr_reader for the attribute df
|
67
|
+
def df
|
68
|
+
@df ||= GroupBy.df_from_group_map(@context, @groups_by_pos, @non_group_vectors)
|
69
|
+
end
|
70
|
+
alias :grouped_df :df
|
56
71
|
|
57
72
|
# Iterate over each group created by group_by. A DataFrame is yielded in
|
58
73
|
# block.
|
@@ -75,8 +90,11 @@ module Daru
|
|
75
90
|
end
|
76
91
|
|
77
92
|
def initialize context, names
|
93
|
+
@group_vectors = names
|
78
94
|
@non_group_vectors = context.vectors.to_a - names
|
79
|
-
|
95
|
+
|
96
|
+
@context = context # TODO: maybe rename in @original_df or @grouped_db
|
97
|
+
|
80
98
|
# FIXME: It feels like we don't want to sort here. Ruby's #group_by
|
81
99
|
# never sorts:
|
82
100
|
#
|
@@ -84,22 +102,14 @@ module Daru
|
|
84
102
|
# # => {4=>["test"], 2=>["me"], 6=>["please"]}
|
85
103
|
#
|
86
104
|
# - zverok, 2016-09-12
|
87
|
-
|
88
|
-
|
89
|
-
@groups = GroupBy.group_map_from_positions_to_indexes(positions_groups, @context.index)
|
90
|
-
@df = GroupBy.df_from_group_map(@context, positions_groups, @non_group_vectors)
|
105
|
+
@groups_by_pos = GroupBy.get_positions_group_map_for_df(@context, @group_vectors, sort: true)
|
91
106
|
end
|
92
107
|
|
93
108
|
# Get a Daru::Vector of the size of each group.
|
94
109
|
def size
|
95
|
-
index =
|
96
|
-
if multi_indexed_grouping?
|
97
|
-
Daru::MultiIndex.from_tuples @groups.keys
|
98
|
-
else
|
99
|
-
Daru::Index.new @groups.keys.flatten
|
100
|
-
end
|
110
|
+
index = get_grouped_index
|
101
111
|
|
102
|
-
values = @
|
112
|
+
values = @groups_by_pos.values.map(&:size)
|
103
113
|
Daru::Vector.new(values, index: index, name: :size)
|
104
114
|
end
|
105
115
|
|
@@ -246,7 +256,7 @@ module Daru
|
|
246
256
|
# # a b c d
|
247
257
|
# # 5 bar two 6 66
|
248
258
|
def get_group group
|
249
|
-
indexes =
|
259
|
+
indexes = groups_by_idx[group]
|
250
260
|
elements = @context.each_vector.map(&:to_a)
|
251
261
|
transpose = elements.transpose
|
252
262
|
rows = indexes.each.map { |idx| transpose[idx] }
|
@@ -273,7 +283,7 @@ module Daru
|
|
273
283
|
# # a ACE
|
274
284
|
# # b BDF
|
275
285
|
def reduce(init=nil)
|
276
|
-
result_hash =
|
286
|
+
result_hash = groups_by_idx.each_with_object({}) do |(group, indices), h|
|
277
287
|
group_indices = indices.map { |v| @context.index.to_a[v] }
|
278
288
|
|
279
289
|
grouped_result = init
|
@@ -284,18 +294,13 @@ module Daru
|
|
284
294
|
h[group] = grouped_result
|
285
295
|
end
|
286
296
|
|
287
|
-
index =
|
288
|
-
if multi_indexed_grouping?
|
289
|
-
Daru::MultiIndex.from_tuples result_hash.keys
|
290
|
-
else
|
291
|
-
Daru::Index.new result_hash.keys.flatten
|
292
|
-
end
|
297
|
+
index = get_grouped_index(result_hash.keys)
|
293
298
|
|
294
299
|
Daru::Vector.new(result_hash.values, index: index)
|
295
300
|
end
|
296
301
|
|
297
302
|
def inspect
|
298
|
-
|
303
|
+
grouped_df.inspect
|
299
304
|
end
|
300
305
|
|
301
306
|
# Function to use for aggregating the data.
|
@@ -335,7 +340,9 @@ module Daru
|
|
335
340
|
# Ram Hyderabad,Mumbai
|
336
341
|
#
|
337
342
|
def aggregate(options={})
|
338
|
-
|
343
|
+
new_index = get_grouped_index
|
344
|
+
|
345
|
+
@context.aggregate(options) { [@groups_by_pos.values, new_index] }
|
339
346
|
end
|
340
347
|
|
341
348
|
private
|
@@ -344,7 +351,7 @@ module Daru
|
|
344
351
|
selection = @context
|
345
352
|
rows, indexes = [], []
|
346
353
|
|
347
|
-
|
354
|
+
groups_by_idx.each_value do |index|
|
348
355
|
index.send(method, quantity).each do |idx|
|
349
356
|
rows << selection.row[idx].to_a
|
350
357
|
indexes << idx
|
@@ -360,29 +367,31 @@ module Daru
|
|
360
367
|
method_type == :numeric && @context[ngvec].type == :numeric
|
361
368
|
end
|
362
369
|
|
363
|
-
rows =
|
370
|
+
rows = groups_by_idx.map do |_group, indexes|
|
364
371
|
order.map do |ngvector|
|
365
372
|
slice = @context[ngvector][*indexes]
|
366
373
|
slice.is_a?(Daru::Vector) ? slice.send(method) : slice
|
367
374
|
end
|
368
375
|
end
|
369
376
|
|
370
|
-
index =
|
377
|
+
index = get_grouped_index
|
371
378
|
order = Daru::Index.new(order)
|
372
379
|
Daru::DataFrame.new(rows.transpose, index: index, order: order)
|
373
380
|
end
|
374
381
|
|
375
|
-
def
|
382
|
+
def get_grouped_index(index_tuples=nil)
|
383
|
+
index_tuples = @groups_by_pos.keys if index_tuples.nil?
|
384
|
+
|
376
385
|
if multi_indexed_grouping?
|
377
|
-
Daru::MultiIndex.from_tuples(
|
386
|
+
Daru::MultiIndex.from_tuples(index_tuples)
|
378
387
|
else
|
379
|
-
Daru::Index.new(
|
388
|
+
Daru::Index.new(index_tuples.flatten)
|
380
389
|
end
|
381
390
|
end
|
382
391
|
|
383
392
|
def multi_indexed_grouping?
|
384
|
-
return false unless @
|
385
|
-
@
|
393
|
+
return false unless @groups_by_pos.keys[0]
|
394
|
+
@groups_by_pos.keys[0].size > 1
|
386
395
|
end
|
387
396
|
end
|
388
397
|
end
|
data/lib/daru/dataframe.rb
CHANGED
@@ -10,7 +10,8 @@ module Daru
|
|
10
10
|
include Daru::Maths::Arithmetic::DataFrame
|
11
11
|
include Daru::Maths::Statistics::DataFrame
|
12
12
|
# TODO: Remove this line but its causing erros due to unkown reason
|
13
|
-
|
13
|
+
Daru.has_nyaplot?
|
14
|
+
|
14
15
|
extend Gem::Deprecate
|
15
16
|
|
16
17
|
class << self
|
@@ -346,20 +347,19 @@ module Daru
|
|
346
347
|
@name = opts[:name]
|
347
348
|
|
348
349
|
case source
|
349
|
-
when
|
350
|
-
|
351
|
-
@index = Index.coerce index
|
352
|
-
create_empty_vectors
|
350
|
+
when [], {}
|
351
|
+
create_empty_vectors(vectors, index)
|
353
352
|
when Array
|
354
353
|
initialize_from_array source, vectors, index, opts
|
355
354
|
when Hash
|
356
355
|
initialize_from_hash source, vectors, index, opts
|
356
|
+
when ->(s) { s.empty? } # TODO: likely want to remove this case
|
357
|
+
create_empty_vectors(vectors, index)
|
357
358
|
end
|
358
359
|
|
359
360
|
set_size
|
360
361
|
validate
|
361
362
|
update
|
362
|
-
self.plotting_library = Daru.plotting_library
|
363
363
|
end
|
364
364
|
|
365
365
|
def plotting_library= lib
|
@@ -372,11 +372,18 @@ module Daru
|
|
372
372
|
)
|
373
373
|
end
|
374
374
|
else
|
375
|
-
raise
|
375
|
+
raise ArgumentError, "Plotting library #{lib} not supported. "\
|
376
376
|
'Supported libraries are :nyaplot and :gruff'
|
377
377
|
end
|
378
378
|
end
|
379
379
|
|
380
|
+
# this method is overwritten: see Daru::DataFrame#plotting_library=
|
381
|
+
def plot(*args, **options, &b)
|
382
|
+
init_plotting_library
|
383
|
+
|
384
|
+
plot(*args, **options, &b)
|
385
|
+
end
|
386
|
+
|
380
387
|
# Access row or vector. Specify name of row/vector followed by axis(:row, :vector).
|
381
388
|
# Defaults to *:vector*. Use of this method is not recommended for accessing
|
382
389
|
# rows. Use df.row[:a] for accessing row with index ':a'.
|
@@ -404,13 +411,11 @@ module Daru
|
|
404
411
|
validate_positions(*positions, nrows)
|
405
412
|
|
406
413
|
if positions.is_a? Integer
|
407
|
-
|
408
|
-
|
414
|
+
row = get_rows_for([positions])
|
415
|
+
Daru::Vector.new row, index: @vectors
|
409
416
|
else
|
410
|
-
new_rows =
|
411
|
-
|
412
|
-
index: @index.at(*original_positions),
|
413
|
-
order: @vectors
|
417
|
+
new_rows = get_rows_for(original_positions)
|
418
|
+
Daru::DataFrame.new new_rows, index: @index.at(*original_positions), order: @vectors
|
414
419
|
end
|
415
420
|
end
|
416
421
|
|
@@ -621,7 +626,7 @@ module Daru
|
|
621
626
|
deprecate :dup_only_valid, :reject_values, 2016, 10
|
622
627
|
|
623
628
|
# Returns a dataframe in which rows with any of the mentioned values
|
624
|
-
#
|
629
|
+
# are ignored.
|
625
630
|
# @param [Array] values to reject to form the new dataframe
|
626
631
|
# @return [Daru::DataFrame] Data Frame with only rows which doesn't
|
627
632
|
# contain the mentioned values
|
@@ -752,7 +757,7 @@ module Daru
|
|
752
757
|
# 3 4 d
|
753
758
|
#
|
754
759
|
def uniq(*vtrs)
|
755
|
-
vecs = vtrs.empty? ? vectors.
|
760
|
+
vecs = vtrs.empty? ? vectors.to_a : Array(vtrs)
|
756
761
|
grouped = group_by(vecs)
|
757
762
|
indexes = grouped.groups.values.map { |v| v[0] }.sort
|
758
763
|
row[*indexes]
|
@@ -1011,6 +1016,7 @@ module Daru
|
|
1011
1016
|
case method
|
1012
1017
|
when Symbol then df.send(method)
|
1013
1018
|
when Proc then method.call(df)
|
1019
|
+
when Array then method.map(&:to_proc).map { |proc| proc.call(df) } # works with Array of both Symbol and/or Proc
|
1014
1020
|
else raise
|
1015
1021
|
end
|
1016
1022
|
end
|
@@ -1489,7 +1495,7 @@ module Daru
|
|
1489
1495
|
def reindex_vectors new_vectors
|
1490
1496
|
unless new_vectors.is_a?(Daru::Index)
|
1491
1497
|
raise ArgumentError, 'Must pass the new index of type Index or its '\
|
1492
|
-
"subclasses, not #{
|
1498
|
+
"subclasses, not #{new_vectors.class}"
|
1493
1499
|
end
|
1494
1500
|
|
1495
1501
|
cl = Daru::DataFrame.new({}, order: new_vectors, index: @index, name: @name)
|
@@ -1527,14 +1533,52 @@ module Daru
|
|
1527
1533
|
df
|
1528
1534
|
end
|
1529
1535
|
|
1536
|
+
module SetSingleIndexStrategy
|
1537
|
+
def self.uniq_size(df, col)
|
1538
|
+
df[col].uniq.size
|
1539
|
+
end
|
1540
|
+
|
1541
|
+
def self.new_index(df, col)
|
1542
|
+
Daru::Index.new(df[col].to_a)
|
1543
|
+
end
|
1544
|
+
|
1545
|
+
def self.delete_vector(df, col)
|
1546
|
+
df.delete_vector(col)
|
1547
|
+
end
|
1548
|
+
end
|
1549
|
+
|
1550
|
+
module SetMultiIndexStrategy
|
1551
|
+
def self.uniq_size(df, cols)
|
1552
|
+
df[*cols].uniq.size
|
1553
|
+
end
|
1554
|
+
|
1555
|
+
def self.new_index(df, cols)
|
1556
|
+
Daru::MultiIndex.from_arrays(df[*cols].map_vectors(&:to_a)).tap do |mi|
|
1557
|
+
mi.name = cols
|
1558
|
+
mi
|
1559
|
+
end
|
1560
|
+
end
|
1561
|
+
|
1562
|
+
def self.delete_vector(df, cols)
|
1563
|
+
df.delete_vectors(*cols)
|
1564
|
+
end
|
1565
|
+
end
|
1566
|
+
|
1530
1567
|
# Set a particular column as the new DF
|
1531
|
-
def set_index
|
1532
|
-
|
1533
|
-
|
1568
|
+
def set_index new_index_col, opts={}
|
1569
|
+
if new_index_col.respond_to?(:to_a)
|
1570
|
+
strategy = SetMultiIndexStrategy
|
1571
|
+
new_index_col = new_index_col.to_a
|
1572
|
+
else
|
1573
|
+
strategy = SetSingleIndexStrategy
|
1574
|
+
end
|
1534
1575
|
|
1535
|
-
|
1536
|
-
|
1576
|
+
uniq_size = strategy.uniq_size(self, new_index_col)
|
1577
|
+
raise ArgumentError, 'All elements in new index must be unique.' if
|
1578
|
+
@size != uniq_size
|
1537
1579
|
|
1580
|
+
self.index = strategy.new_index(self, new_index_col)
|
1581
|
+
strategy.delete_vector(self, new_index_col) unless opts[:keep]
|
1538
1582
|
self
|
1539
1583
|
end
|
1540
1584
|
|
@@ -1572,11 +1616,24 @@ module Daru
|
|
1572
1616
|
end
|
1573
1617
|
end
|
1574
1618
|
|
1619
|
+
def reset_index
|
1620
|
+
index_df = index.to_df
|
1621
|
+
names = index.name
|
1622
|
+
names = [names] unless names.instance_of?(Array)
|
1623
|
+
new_vectors = names + vectors.to_a
|
1624
|
+
self.index = index_df.index
|
1625
|
+
names.each do |name|
|
1626
|
+
self[name] = index_df[name]
|
1627
|
+
end
|
1628
|
+
self.order = new_vectors
|
1629
|
+
self
|
1630
|
+
end
|
1631
|
+
|
1575
1632
|
# Reassign index with a new index of type Daru::Index or any of its subclasses.
|
1576
1633
|
#
|
1577
1634
|
# @param [Daru::Index] idx New index object on which the rows of the dataframe
|
1578
1635
|
# are to be indexed.
|
1579
|
-
# @example
|
1636
|
+
# @example Reassigining index of a DataFrame
|
1580
1637
|
# df = Daru::DataFrame.new({a: [1,2,3,4], b: [11,22,33,44]})
|
1581
1638
|
# df.index.to_a #=> [0,1,2,3]
|
1582
1639
|
#
|
@@ -2088,7 +2145,7 @@ module Daru
|
|
2088
2145
|
|
2089
2146
|
# Write this DataFrame to a CSV file.
|
2090
2147
|
#
|
2091
|
-
# ==
|
2148
|
+
# == Arguments
|
2092
2149
|
#
|
2093
2150
|
# * filename - Path of CSV file where the DataFrame is to be saved.
|
2094
2151
|
#
|
@@ -2264,7 +2321,7 @@ module Daru
|
|
2264
2321
|
# # 2 3]
|
2265
2322
|
def split_by_category cat_name
|
2266
2323
|
cat_dv = self[cat_name]
|
2267
|
-
raise
|
2324
|
+
raise ArgumentError, "#{cat_name} is not a category vector" unless
|
2268
2325
|
cat_dv.category?
|
2269
2326
|
|
2270
2327
|
cat_dv.categories.map do |cat|
|
@@ -2274,6 +2331,50 @@ module Daru
|
|
2274
2331
|
end
|
2275
2332
|
end
|
2276
2333
|
|
2334
|
+
# @param indexes [Array] index(s) at which row tuples are retrieved
|
2335
|
+
# @return [Array] returns array of row tuples at given index(s)
|
2336
|
+
# @example Using Daru::Index
|
2337
|
+
# df = Daru::DataFrame.new({
|
2338
|
+
# a: [1, 2, 3],
|
2339
|
+
# b: ['a', 'a', 'b']
|
2340
|
+
# })
|
2341
|
+
#
|
2342
|
+
# df.access_row_tuples_by_indexs(1,2)
|
2343
|
+
# # => [[2, "a"], [3, "b"]]
|
2344
|
+
#
|
2345
|
+
# df.index = Daru::Index.new([:one,:two,:three])
|
2346
|
+
# df.access_row_tuples_by_indexs(:one,:three)
|
2347
|
+
# # => [[1, "a"], [3, "b"]]
|
2348
|
+
#
|
2349
|
+
# @example Using Daru::MultiIndex
|
2350
|
+
# mi_idx = Daru::MultiIndex.from_tuples [
|
2351
|
+
# [:a,:one,:bar],
|
2352
|
+
# [:a,:one,:baz],
|
2353
|
+
# [:b,:two,:bar],
|
2354
|
+
# [:a,:two,:baz],
|
2355
|
+
# ]
|
2356
|
+
# df_mi = Daru::DataFrame.new({
|
2357
|
+
# a: 1..4,
|
2358
|
+
# b: 'a'..'d'
|
2359
|
+
# }, index: mi_idx )
|
2360
|
+
#
|
2361
|
+
# df_mi.access_row_tuples_by_indexs(:b, :two, :bar)
|
2362
|
+
# # => [[3, "c"]]
|
2363
|
+
# df_mi.access_row_tuples_by_indexs(:a)
|
2364
|
+
# # => [[1, "a"], [2, "b"], [4, "d"]]
|
2365
|
+
def access_row_tuples_by_indexs *indexes
|
2366
|
+
return get_sub_dataframe(indexes, by_position: false).map_rows(&:to_a) if
|
2367
|
+
@index.is_a?(Daru::MultiIndex)
|
2368
|
+
positions = @index.pos(*indexes)
|
2369
|
+
if positions.is_a? Numeric
|
2370
|
+
row = get_rows_for([positions])
|
2371
|
+
row.first.is_a?(Array) ? row : [row]
|
2372
|
+
else
|
2373
|
+
new_rows = get_rows_for(indexes, by_position: false)
|
2374
|
+
indexes.map { |index| new_rows.map { |r| r[index] } }
|
2375
|
+
end
|
2376
|
+
end
|
2377
|
+
|
2277
2378
|
# Function to use for aggregating the data.
|
2278
2379
|
#
|
2279
2380
|
# @param options [Hash] options for column, you want in resultant dataframe
|
@@ -2322,25 +2423,28 @@ module Daru
|
|
2322
2423
|
# Note: `GroupBy` class `aggregate` method uses this `aggregate` method
|
2323
2424
|
# internally.
|
2324
2425
|
def aggregate(options={}, multi_index_level=-1)
|
2325
|
-
|
2426
|
+
if block_given?
|
2427
|
+
positions_tuples, new_index = yield(@index) # note: use of yield is private for now
|
2428
|
+
else
|
2429
|
+
positions_tuples, new_index = group_index_for_aggregation(@index, multi_index_level)
|
2430
|
+
end
|
2326
2431
|
|
2327
2432
|
colmn_value = aggregate_by_positions_tuples(options, positions_tuples)
|
2328
2433
|
|
2329
2434
|
Daru::DataFrame.new(colmn_value, index: new_index, order: options.keys)
|
2330
2435
|
end
|
2331
2436
|
|
2332
|
-
# Is faster than using group_by followed by aggregate (because it doesn't generate an intermediary dataframe)
|
2333
2437
|
def group_by_and_aggregate(*group_by_keys, **aggregation_map)
|
2334
|
-
|
2335
|
-
|
2336
|
-
new_index = Daru::MultiIndex.from_tuples(positions_groups.keys).coerce_index
|
2337
|
-
colmn_value = aggregate_by_positions_tuples(aggregation_map, positions_groups.values)
|
2338
|
-
|
2339
|
-
Daru::DataFrame.new(colmn_value, index: new_index, order: aggregation_map.keys)
|
2438
|
+
group_by(*group_by_keys).aggregate(aggregation_map)
|
2340
2439
|
end
|
2341
2440
|
|
2342
2441
|
private
|
2343
2442
|
|
2443
|
+
# Will lazily load the plotting library being used for this dataframe
|
2444
|
+
def init_plotting_library
|
2445
|
+
self.plotting_library = Daru.plotting_library
|
2446
|
+
end
|
2447
|
+
|
2344
2448
|
def headers
|
2345
2449
|
Daru::Index.new(Array(index.name) + @vectors.to_a)
|
2346
2450
|
end
|
@@ -2452,19 +2556,30 @@ module Daru
|
|
2452
2556
|
positions = @index.pos(*indexes)
|
2453
2557
|
|
2454
2558
|
if positions.is_a? Numeric
|
2455
|
-
|
2456
|
-
|
2457
|
-
name: indexes.first
|
2559
|
+
row = get_rows_for([positions])
|
2560
|
+
Daru::Vector.new row, index: @vectors, name: indexes.first
|
2458
2561
|
else
|
2459
|
-
new_rows =
|
2460
|
-
|
2461
|
-
index: @index.subset(*indexes),
|
2462
|
-
order: @vectors
|
2562
|
+
new_rows = get_rows_for(indexes, by_position: false)
|
2563
|
+
Daru::DataFrame.new new_rows, index: @index.subset(*indexes), order: @vectors
|
2463
2564
|
end
|
2464
2565
|
end
|
2465
2566
|
|
2466
|
-
|
2467
|
-
|
2567
|
+
# @param keys [Array] can be an array of positions (if by_position is true) or indexes (if by_position if false)
|
2568
|
+
# because of coercion by Daru::Vector#at and Daru::Vector#[], can return either an Array of
|
2569
|
+
# values (representing a row) or an array of Vectors (that can be seen as rows)
|
2570
|
+
def get_rows_for(keys, by_position: true)
|
2571
|
+
raise unless keys.is_a?(Array)
|
2572
|
+
|
2573
|
+
if by_position
|
2574
|
+
pos = keys
|
2575
|
+
@data.map { |vector| vector.at(*pos) }
|
2576
|
+
else
|
2577
|
+
# TODO: for now (2018-07-27), it is different than using
|
2578
|
+
# get_rows_for(@index.pos(*keys))
|
2579
|
+
# because Daru::Vector#at and Daru::Vector#[] don't handle Daru::MultiIndex the same way
|
2580
|
+
indexes = keys
|
2581
|
+
@data.map { |vec| vec[*indexes] }
|
2582
|
+
end
|
2468
2583
|
end
|
2469
2584
|
|
2470
2585
|
def insert_or_modify_vector name, vector
|
@@ -2565,7 +2680,10 @@ module Daru
|
|
2565
2680
|
set_size
|
2566
2681
|
end
|
2567
2682
|
|
2568
|
-
def create_empty_vectors
|
2683
|
+
def create_empty_vectors(vectors, index)
|
2684
|
+
@vectors = Index.coerce vectors
|
2685
|
+
@index = Index.coerce index
|
2686
|
+
|
2569
2687
|
@data = @vectors.map do |name|
|
2570
2688
|
Daru::Vector.new([], name: coerce_name(name), index: @index)
|
2571
2689
|
end
|
@@ -2885,7 +3003,6 @@ module Daru
|
|
2885
3003
|
|
2886
3004
|
# Raises IndexError when one of the positions is not a valid position
|
2887
3005
|
def validate_positions *positions, size
|
2888
|
-
positions = [positions] if positions.is_a? Integer
|
2889
3006
|
positions.each do |pos|
|
2890
3007
|
raise IndexError, "#{pos} is not a valid position." if pos >= size
|
2891
3008
|
end
|
@@ -2910,28 +3027,57 @@ module Daru
|
|
2910
3027
|
end
|
2911
3028
|
|
2912
3029
|
def aggregate_by_positions_tuples(options, positions_tuples)
|
2913
|
-
options
|
2914
|
-
|
2915
|
-
|
3030
|
+
agg_over_vectors_only, options = cast_aggregation_options(options)
|
3031
|
+
|
3032
|
+
if agg_over_vectors_only
|
3033
|
+
options.map do |vect_name, method|
|
3034
|
+
vect = self[vect_name]
|
2916
3035
|
|
2917
3036
|
positions_tuples.map do |positions|
|
2918
3037
|
vect.apply_method_on_sub_vector(method, keys: positions)
|
2919
3038
|
end
|
2920
|
-
else
|
2921
|
-
positions_tuples.map do |positions|
|
2922
|
-
apply_method_on_sub_df(method, keys: positions)
|
2923
|
-
end
|
2924
3039
|
end
|
3040
|
+
else
|
3041
|
+
methods = options.values
|
3042
|
+
|
3043
|
+
# note: because we aggregate over rows, we don't have to re-get sub-dfs for each method (which is expensive)
|
3044
|
+
rows = positions_tuples.map do |positions|
|
3045
|
+
apply_method_on_sub_df(methods, keys: positions)
|
3046
|
+
end
|
3047
|
+
|
3048
|
+
rows.transpose
|
3049
|
+
end
|
3050
|
+
end
|
3051
|
+
|
3052
|
+
# convert operations over sub-vectors to operations over sub-dfs when it improves perf
|
3053
|
+
# note: we don't always "cast" because aggregation over a single vector / a few vector is faster
|
3054
|
+
# than aggregation over (sub-)dfs
|
3055
|
+
def cast_aggregation_options(options)
|
3056
|
+
vects, non_vects = options.keys.partition { |k| @vectors.include?(k) }
|
3057
|
+
|
3058
|
+
over_vectors = true
|
3059
|
+
|
3060
|
+
if non_vects.any?
|
3061
|
+
options = options.clone
|
3062
|
+
|
3063
|
+
vects.each do |name|
|
3064
|
+
proc_on_vect = options[name].to_proc
|
3065
|
+
options[name] = ->(sub_df) { proc_on_vect.call(sub_df[name]) }
|
3066
|
+
end
|
3067
|
+
|
3068
|
+
over_vectors = false
|
2925
3069
|
end
|
3070
|
+
|
3071
|
+
[over_vectors, options]
|
2926
3072
|
end
|
2927
3073
|
|
2928
3074
|
def group_index_for_aggregation(index, multi_index_level=-1)
|
2929
3075
|
case index
|
2930
3076
|
when Daru::MultiIndex
|
2931
|
-
|
2932
|
-
new_index, pos_tuples = groups.keys, groups.values
|
3077
|
+
groups_by_pos = Daru::Core::GroupBy.get_positions_group_for_aggregation(index, multi_index_level)
|
2933
3078
|
|
2934
|
-
new_index = Daru::MultiIndex.from_tuples(
|
3079
|
+
new_index = Daru::MultiIndex.from_tuples(groups_by_pos.keys).coerce_index
|
3080
|
+
pos_tuples = groups_by_pos.values
|
2935
3081
|
when Daru::Index, Daru::CategoricalIndex
|
2936
3082
|
new_index = Array(index).uniq
|
2937
3083
|
pos_tuples = new_index.map { |idx| [*index.pos(idx)] }
|
@@ -2950,7 +3096,7 @@ module Daru
|
|
2950
3096
|
when Range
|
2951
3097
|
size.times.to_a[positions.first]
|
2952
3098
|
else
|
2953
|
-
raise ArgumentError, '
|
3099
|
+
raise ArgumentError, 'Unknown position type.'
|
2954
3100
|
end
|
2955
3101
|
else
|
2956
3102
|
positions
|