red_amber 0.1.7 → 0.2.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.rubocop.yml +12 -2
- data/.rubocop_todo.yml +2 -15
- data/.yardopts +1 -0
- data/CHANGELOG.md +164 -2
- data/Gemfile +2 -1
- data/README.md +246 -17
- data/doc/DataFrame.md +392 -129
- data/doc/Vector.md +37 -19
- data/doc/examples_of_red_amber.ipynb +8979 -0
- data/lib/red_amber/data_frame.rb +138 -24
- data/lib/red_amber/data_frame_displayable.rb +35 -18
- data/lib/red_amber/data_frame_reshaping.rb +85 -0
- data/lib/red_amber/data_frame_selectable.rb +53 -9
- data/lib/red_amber/data_frame_variable_operation.rb +130 -50
- data/lib/red_amber/group.rb +29 -27
- data/lib/red_amber/vector.rb +1 -1
- data/lib/red_amber/vector_functions.rb +65 -23
- data/lib/red_amber/vector_selectable.rb +12 -9
- data/lib/red_amber/vector_updatable.rb +22 -1
- data/lib/red_amber/version.rb +1 -1
- data/lib/red_amber.rb +1 -1
- data/red_amber.gemspec +1 -1
- metadata +7 -5
- data/doc/47_examples_of_red_amber.ipynb +0 -4872
data/doc/DataFrame.md
CHANGED
@@ -155,7 +155,25 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
155
155
|
|
156
156
|
### `indices`, `indexes`
|
157
157
|
|
158
|
-
- Returns
|
158
|
+
- Returns indexes in an Array.
|
159
|
+
Accepts an option `start` as the first of indexes.
|
160
|
+
|
161
|
+
```ruby
|
162
|
+
df = RedAmber::DataFrame.new(x: [1, 2, 3, 4, 5])
|
163
|
+
df.indices
|
164
|
+
|
165
|
+
# =>
|
166
|
+
[0, 1, 2, 3, 4]
|
167
|
+
|
168
|
+
df.indices(1)
|
169
|
+
|
170
|
+
# =>
|
171
|
+
[1, 2, 3, 4, 5]
|
172
|
+
|
173
|
+
df.indices(:a)
|
174
|
+
# =>
|
175
|
+
[:a, :b, :c, :d, :e]
|
176
|
+
```
|
159
177
|
|
160
178
|
### `to_h`
|
161
179
|
|
@@ -167,6 +185,11 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
167
185
|
|
168
186
|
If you need a column-oriented full array, use `.to_h.to_a`
|
169
187
|
|
188
|
+
### `each_row`
|
189
|
+
|
190
|
+
Yield each row in a `{ key => row}` Hash.
|
191
|
+
Returns Enumerator if block is not given.
|
192
|
+
|
170
193
|
### `schema`
|
171
194
|
|
172
195
|
- Returns column name and data type in a Hash.
|
@@ -202,7 +225,22 @@ puts penguins.to_s
|
|
202
225
|
`inspect` uses `to_s` output and also shows shape and object_id.
|
203
226
|
|
204
227
|
|
205
|
-
### `summary`, `describe`
|
228
|
+
### `summary`, `describe`
|
229
|
+
|
230
|
+
`DataFrame#summary` or `DataFrame#describe` shows summary statistics in a DataFrame.
|
231
|
+
|
232
|
+
```ruby
|
233
|
+
puts penguins.summary.to_s(width: 82) # needs more width to show all stats in this example
|
234
|
+
|
235
|
+
# =>
|
236
|
+
variables count mean std min 25% median 75% max
|
237
|
+
<dictionary> <uint16> <double> <double> <double> <double> <double> <double> <double>
|
238
|
+
1 bill_length_mm 342 43.92 5.46 32.1 39.23 44.38 48.5 59.6
|
239
|
+
2 bill_depth_mm 342 17.15 1.97 13.1 15.6 17.32 18.7 21.5
|
240
|
+
3 flipper_length_mm 342 200.92 14.06 172.0 190.0 197.0 213.0 231.0
|
241
|
+
4 body_mass_g 342 4201.75 801.95 2700.0 3550.0 4031.5 4750.0 6300.0
|
242
|
+
5 year 344 2008.03 0.82 2007.0 2007.0 2008.0 2009.0 2009.0
|
243
|
+
```
|
206
244
|
|
207
245
|
### `to_rover`
|
208
246
|
|
@@ -352,13 +390,13 @@ penguins.to_rover
|
|
352
390
|
|
353
391
|
### `pick ` - pick up variables by key label -
|
354
392
|
|
355
|
-
Pick up some
|
393
|
+
Pick up some columns (variables) to create a sub DataFrame.
|
356
394
|
|
357
395
|
![pick method image](doc/../image/dataframe/pick.png)
|
358
396
|
|
359
397
|
- Keys as arguments
|
360
398
|
|
361
|
-
`pick(keys)` accepts keys as arguments in an Array.
|
399
|
+
`pick(keys)` accepts keys as arguments in an Array or a Range.
|
362
400
|
|
363
401
|
```ruby
|
364
402
|
penguins.pick(:species, :bill_length_mm)
|
@@ -378,9 +416,31 @@ penguins.to_rover
|
|
378
416
|
344 Gentoo 49.9
|
379
417
|
```
|
380
418
|
|
381
|
-
-
|
419
|
+
- Indices as arguments
|
420
|
+
|
421
|
+
`pick(indices)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers.
|
422
|
+
|
423
|
+
```ruby
|
424
|
+
penguins.pick(0..2, -1)
|
425
|
+
|
426
|
+
# =>
|
427
|
+
#<RedAmber::DataFrame : 344 x 4 Vectors, 0x0000000000055ce4>
|
428
|
+
species island bill_length_mm year
|
429
|
+
<string> <string> <double> <uint16>
|
430
|
+
1 Adelie Torgersen 39.1 2007
|
431
|
+
2 Adelie Torgersen 39.5 2007
|
432
|
+
3 Adelie Torgersen 40.3 2007
|
433
|
+
4 Adelie Torgersen (nil) 2007
|
434
|
+
5 Adelie Torgersen 36.7 2007
|
435
|
+
: : : : :
|
436
|
+
342 Gentoo Biscoe 50.4 2009
|
437
|
+
343 Gentoo Biscoe 45.2 2009
|
438
|
+
344 Gentoo Biscoe 49.9 2009
|
439
|
+
```
|
440
|
+
|
441
|
+
- Booleans as arguments
|
382
442
|
|
383
|
-
`pick(booleans)` accepts booleans as
|
443
|
+
`pick(booleans)` accepts booleans as arguments in an Array. Booleans must be same length as `n_keys`.
|
384
444
|
|
385
445
|
```ruby
|
386
446
|
penguins.pick(penguins.types.map { |type| type == :string })
|
@@ -400,9 +460,9 @@ penguins.to_rover
|
|
400
460
|
344 Gentoo Biscoe male
|
401
461
|
```
|
402
462
|
|
403
|
-
|
463
|
+
- Keys or booleans by a block
|
404
464
|
|
405
|
-
`pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
|
465
|
+
`pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, indices or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
|
406
466
|
|
407
467
|
```ruby
|
408
468
|
penguins.pick { keys.map { |key| key.end_with?('mm') } }
|
@@ -424,21 +484,25 @@ penguins.to_rover
|
|
424
484
|
|
425
485
|
### `drop ` - pick and drop -
|
426
486
|
|
427
|
-
Drop some
|
487
|
+
Drop some columns (variables) to create a remainer DataFrame.
|
428
488
|
|
429
489
|
![drop method image](doc/../image/dataframe/drop.png)
|
430
490
|
|
431
491
|
- Keys as arguments
|
432
492
|
|
433
|
-
`drop(keys)` accepts keys as arguments in an Array.
|
493
|
+
`drop(keys)` accepts keys as arguments in an Array or a Range.
|
494
|
+
|
495
|
+
- Indices as arguments
|
496
|
+
|
497
|
+
`drop(indices)` accepts indices as a arguments. Indices should be Integers, Floats or Ranges of Integers.
|
434
498
|
|
435
|
-
- Booleans as
|
499
|
+
- Booleans as arguments
|
436
500
|
|
437
|
-
`drop(booleans)` accepts booleans as
|
501
|
+
`drop(booleans)` accepts booleans as an argument in an Array. Booleans must be same length as `n_keys`.
|
438
502
|
|
439
503
|
- Keys or booleans by a block
|
440
504
|
|
441
|
-
`drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
|
505
|
+
`drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, indices or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
|
442
506
|
|
443
507
|
- Notice for nil
|
444
508
|
|
@@ -473,9 +537,20 @@ penguins.to_rover
|
|
473
537
|
[1, 2, 3]
|
474
538
|
```
|
475
539
|
|
540
|
+
A simple key name is usable as a method of the DataFrame if the key name is acceptable as a method name.
|
541
|
+
It returns a Vector same as `[]`.
|
542
|
+
|
543
|
+
```ruby
|
544
|
+
df.a
|
545
|
+
|
546
|
+
# =>
|
547
|
+
#<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
|
548
|
+
[1, 2, 3]
|
549
|
+
```
|
550
|
+
|
476
551
|
### `slice ` - to cut vertically is slice -
|
477
552
|
|
478
|
-
Slice and select
|
553
|
+
Slice and select rows (observations) to create a sub DataFrame.
|
479
554
|
|
480
555
|
![slice method image](doc/../image/dataframe/slice.png)
|
481
556
|
|
@@ -506,7 +581,7 @@ penguins.to_rover
|
|
506
581
|
|
507
582
|
- Booleans as an argument
|
508
583
|
|
509
|
-
`slice(booleans)` accepts booleans as
|
584
|
+
`slice(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
|
510
585
|
|
511
586
|
```ruby
|
512
587
|
vector = penguins[:bill_length_mm]
|
@@ -583,7 +658,7 @@ penguins.to_rover
|
|
583
658
|
|
584
659
|
### `remove`
|
585
660
|
|
586
|
-
Slice and reject
|
661
|
+
Slice and reject rows (observations) to create a remainer DataFrame.
|
587
662
|
|
588
663
|
![remove method image](doc/../image/dataframe/remove.png)
|
589
664
|
|
@@ -612,7 +687,7 @@ penguins.to_rover
|
|
612
687
|
|
613
688
|
- Booleans as an argument
|
614
689
|
|
615
|
-
`remove(booleans)` accepts booleans as
|
690
|
+
`remove(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
|
616
691
|
|
617
692
|
```ruby
|
618
693
|
# remove all observation contains nil
|
@@ -640,10 +715,12 @@ penguins.to_rover
|
|
640
715
|
|
641
716
|
```ruby
|
642
717
|
penguins.remove do
|
643
|
-
|
644
|
-
|
645
|
-
|
646
|
-
|
718
|
+
# We will use another style shown in slice
|
719
|
+
# self.bill_length_mm returns Vector
|
720
|
+
mean = bill_length_mm.mean
|
721
|
+
min = mean - bill_length_mm.std
|
722
|
+
max = mean + bill_length_mm.std
|
723
|
+
bill_length_mm.to_a.map { |e| (min..max).include? e }
|
647
724
|
end
|
648
725
|
|
649
726
|
# =>
|
@@ -660,6 +737,7 @@ penguins.to_rover
|
|
660
737
|
139 Gentoo Biscoe 50.4 15.7 222 ... 2009
|
661
738
|
140 Gentoo Biscoe 49.9 16.1 213 ... 2009
|
662
739
|
```
|
740
|
+
|
663
741
|
- Notice for nil
|
664
742
|
- When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
|
665
743
|
|
@@ -704,7 +782,7 @@ penguins.to_rover
|
|
704
782
|
|
705
783
|
- Key pairs as arguments
|
706
784
|
|
707
|
-
`rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`.
|
785
|
+
`rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}` or an Array of Arrays like `[[existing_key, new_key], ... ]`.
|
708
786
|
|
709
787
|
```ruby
|
710
788
|
df = RedAmber::DataFrame.new( 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] )
|
@@ -721,7 +799,11 @@ penguins.to_rover
|
|
721
799
|
|
722
800
|
- Key pairs by a block
|
723
801
|
|
724
|
-
`rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}`. Block is called in the context of self.
|
802
|
+
`rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}` or an Array of Arrays like `[[existing_key, new_key], ... ]`. Block is called in the context of self.
|
803
|
+
|
804
|
+
- Not existing keys
|
805
|
+
|
806
|
+
If specified `existing_key` is not exist, raise a `DataFrameArgumentError`.
|
725
807
|
|
726
808
|
- Key type
|
727
809
|
|
@@ -729,16 +811,16 @@ penguins.to_rover
|
|
729
811
|
|
730
812
|
### `assign`
|
731
813
|
|
732
|
-
Assign new or updated
|
814
|
+
Assign new or updated columns (variables) and create a updated DataFrame.
|
733
815
|
|
734
|
-
- Variables with new keys will append new
|
816
|
+
- Variables with new keys will append new columns from the right.
|
735
817
|
- Variables with exisiting keys will update corresponding vectors.
|
736
818
|
|
737
819
|
![assign method image](doc/../image/dataframe/assign.png)
|
738
820
|
|
739
821
|
- Variables as arguments
|
740
822
|
|
741
|
-
`assign(key_pairs)` accepts pairs of key and values as
|
823
|
+
`assign(key_pairs)` accepts pairs of key and values as parameters. `key_pairs` should be a Hash of `{key => array_like}` or an Array of Arrays like `[[key, array_like], ... ]`. `array_like` is ether `Vector`, `Array` or `Arrow::Array`.
|
742
824
|
|
743
825
|
```ruby
|
744
826
|
df = RedAmber::DataFrame.new(
|
@@ -748,15 +830,19 @@ penguins.to_rover
|
|
748
830
|
|
749
831
|
# =>
|
750
832
|
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000062804>
|
751
|
-
name age
|
752
|
-
<string> <uint8>
|
753
|
-
1 Yasuko 68
|
754
|
-
2 Rui 49
|
833
|
+
name age
|
834
|
+
<string> <uint8>
|
835
|
+
1 Yasuko 68
|
836
|
+
2 Rui 49
|
755
837
|
3 Hinata 28
|
756
838
|
|
757
839
|
# update :age and add :brother
|
758
|
-
|
759
|
-
|
840
|
+
df.assign do
|
841
|
+
{
|
842
|
+
age: age + 29,
|
843
|
+
brother: ['Santa', nil, 'Momotaro']
|
844
|
+
}
|
845
|
+
end
|
760
846
|
|
761
847
|
# =>
|
762
848
|
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000658b0>
|
@@ -769,13 +855,14 @@ penguins.to_rover
|
|
769
855
|
|
770
856
|
- Key pairs by a block
|
771
857
|
|
772
|
-
`assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key =>
|
858
|
+
`assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array_like}` or an Array of Arrays like `[[key, array_like], ... ]`. `array_like` is ether `Vector`, `Array` or `Arrow::Array`. The block is called in the context of self.
|
773
859
|
|
774
860
|
```ruby
|
775
861
|
df = RedAmber::DataFrame.new(
|
776
862
|
index: [0, 1, 2, 3, nil],
|
777
863
|
float: [0.0, 1.1, 2.2, Float::NAN, nil],
|
778
|
-
string: ['A', 'B', 'C', 'D', nil]
|
864
|
+
string: ['A', 'B', 'C', 'D', nil]
|
865
|
+
)
|
779
866
|
df
|
780
867
|
|
781
868
|
# =>
|
@@ -788,29 +875,27 @@ penguins.to_rover
|
|
788
875
|
4 3 NaN D
|
789
876
|
5 (nil) (nil) (nil)
|
790
877
|
|
791
|
-
# update
|
878
|
+
# update :float
|
879
|
+
# assigner by an Array
|
792
880
|
df.assign do
|
793
|
-
|
794
|
-
|
795
|
-
assigner[keys[i]] = v * -1 if v.numeric?
|
796
|
-
end
|
797
|
-
assigner
|
881
|
+
vectors.select(&:float?)
|
882
|
+
.map { |v| [v.key, -v] }
|
798
883
|
end
|
799
884
|
|
800
885
|
# =>
|
801
|
-
#<RedAmber::DataFrame : 5 x 3 Vectors,
|
802
|
-
|
803
|
-
<
|
804
|
-
1
|
805
|
-
2
|
806
|
-
3
|
807
|
-
4
|
808
|
-
5
|
809
|
-
|
810
|
-
# Or
|
886
|
+
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x00000000000dfffc>
|
887
|
+
index float string
|
888
|
+
<uint8> <double> <string>
|
889
|
+
1 0 -0.0 A
|
890
|
+
2 1 -1.1 B
|
891
|
+
3 2 -2.2 C
|
892
|
+
4 3 NaN D
|
893
|
+
5 (nil) (nil) (nil)
|
894
|
+
|
895
|
+
# Or we can use assigner by a Hash
|
811
896
|
df.assign do
|
812
|
-
|
813
|
-
assigner[key] =
|
897
|
+
vectors.select.with_object({}) do |v, assigner|
|
898
|
+
assigner[v.key] = -v if v.float?
|
814
899
|
end
|
815
900
|
end
|
816
901
|
|
@@ -821,6 +906,96 @@ penguins.to_rover
|
|
821
906
|
|
822
907
|
Symbol key and String key are considered as the same key.
|
823
908
|
|
909
|
+
- Empty assignment
|
910
|
+
|
911
|
+
If assigner is empty or nil, returns self.
|
912
|
+
|
913
|
+
- Append from left
|
914
|
+
|
915
|
+
`assign_left` method accepts the same parameters and block as `assign`, but append new columns from leftside.
|
916
|
+
|
917
|
+
```ruby
|
918
|
+
df.assign_left(new_index: df.indices(1))
|
919
|
+
|
920
|
+
# =>
|
921
|
+
#<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000001787c>
|
922
|
+
new_index index float string
|
923
|
+
<uint8> <uint8> <double> <string>
|
924
|
+
1 1 0 0.0 A
|
925
|
+
2 2 1 1.1 B
|
926
|
+
3 3 2 2.2 C
|
927
|
+
4 4 3 NaN D
|
928
|
+
5 5 (nil) (nil) (nil)
|
929
|
+
```
|
930
|
+
|
931
|
+
### `slice_by(key, keep_key: false) { block }`
|
932
|
+
|
933
|
+
`slice_by` accepts a key and a block to select rows.
|
934
|
+
|
935
|
+
(Since 0.2.1)
|
936
|
+
|
937
|
+
```ruby
|
938
|
+
df = RedAmber::DataFrame.new(
|
939
|
+
index: [0, 1, 2, 3, nil],
|
940
|
+
float: [0.0, 1.1, 2.2, Float::NAN, nil],
|
941
|
+
string: ['A', 'B', 'C', 'D', nil]
|
942
|
+
)
|
943
|
+
df
|
944
|
+
|
945
|
+
# =>
|
946
|
+
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x0000000000069e60>
|
947
|
+
index float string
|
948
|
+
<uint8> <double> <string>
|
949
|
+
1 0 0.0 A
|
950
|
+
2 1 1.1 B
|
951
|
+
3 2 2.2 C
|
952
|
+
4 3 NaN D
|
953
|
+
5 (nil) (nil) (nil)
|
954
|
+
|
955
|
+
df.slice_by(:string) { ["A", "C"] }
|
956
|
+
|
957
|
+
# =>
|
958
|
+
#<RedAmber::DataFrame : 2 x 2 Vectors, 0x000000000001b1ac>
|
959
|
+
index float
|
960
|
+
<uint8> <double>
|
961
|
+
1 0 0.0
|
962
|
+
2 2 2.2
|
963
|
+
```
|
964
|
+
|
965
|
+
It is the same behavior as;
|
966
|
+
|
967
|
+
```ruby
|
968
|
+
df.slice { [string.index("A"), string.index("C")] }.drop(:string)
|
969
|
+
```
|
970
|
+
|
971
|
+
`slice_by` also accepts a Range.
|
972
|
+
|
973
|
+
```ruby
|
974
|
+
df.slice_by(:string) { "A".."C" }
|
975
|
+
|
976
|
+
# =>
|
977
|
+
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000069668>
|
978
|
+
index float
|
979
|
+
<uint8> <double>
|
980
|
+
1 0 0.0
|
981
|
+
2 1 1.1
|
982
|
+
3 2 2.2
|
983
|
+
```
|
984
|
+
|
985
|
+
When the option `keep_key: true` used, the column `key` will be preserved.
|
986
|
+
|
987
|
+
```ruby
|
988
|
+
df.slice_by(:string, keep_key: true) { "A".."C" }
|
989
|
+
|
990
|
+
# =>
|
991
|
+
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x0000000000073c44>
|
992
|
+
index float string
|
993
|
+
<uint8> <double> <string>
|
994
|
+
1 0 0.0 A
|
995
|
+
2 1 1.1 B
|
996
|
+
3 2 2.2 C
|
997
|
+
```
|
998
|
+
|
824
999
|
## Updating
|
825
1000
|
|
826
1001
|
### `sort`
|
@@ -830,11 +1005,11 @@ penguins.to_rover
|
|
830
1005
|
- "-key" denotes descending order
|
831
1006
|
|
832
1007
|
```ruby
|
833
|
-
df = RedAmber::DataFrame.new(
|
1008
|
+
df = RedAmber::DataFrame.new(
|
834
1009
|
index: [1, 1, 0, nil, 0],
|
835
1010
|
string: ['C', 'B', nil, 'A', 'B'],
|
836
1011
|
bool: [nil, true, false, true, false],
|
837
|
-
|
1012
|
+
)
|
838
1013
|
df.sort(:index, '-bool')
|
839
1014
|
|
840
1015
|
# =>
|
@@ -860,16 +1035,10 @@ penguins.to_rover
|
|
860
1035
|
|
861
1036
|
## Grouping
|
862
1037
|
|
863
|
-
### `group(
|
864
|
-
|
865
|
-
(
|
866
|
-
This API will change in the future version. Especcially I want to change:
|
867
|
-
- Order of the column of the result (aggregation_keys should be the first)
|
868
|
-
- DataFrame#group will accept a block (heronshoes/red_amber #28)
|
869
|
-
)
|
1038
|
+
### `group(group_keys)`
|
870
1039
|
|
871
1040
|
`group` creates a class `Group` object. `Group` accepts functions below as a method.
|
872
|
-
Method accepts options as `
|
1041
|
+
Method accepts options as `group_keys`.
|
873
1042
|
|
874
1043
|
Available functions are:
|
875
1044
|
|
@@ -889,8 +1058,8 @@ penguins.to_rover
|
|
889
1058
|
- [ ] tdigest
|
890
1059
|
- ✓ variance
|
891
1060
|
|
892
|
-
For the each group of `
|
893
|
-
|
1061
|
+
For the each group of `group_keys`, the aggregation `function` is applied and returns a new dataframe with aggregated keys according to `summary_keys`.
|
1062
|
+
Summary key names are provided by `function(summary_keys)` style.
|
894
1063
|
|
895
1064
|
This is an example of grouping of famous STARWARS dataset.
|
896
1065
|
|
@@ -900,18 +1069,18 @@ penguins.to_rover
|
|
900
1069
|
starwars
|
901
1070
|
|
902
1071
|
# =>
|
903
|
-
#<RedAmber::DataFrame : 87 x 12 Vectors,
|
904
|
-
|
905
|
-
|
906
|
-
|
907
|
-
|
908
|
-
|
909
|
-
|
910
|
-
|
911
|
-
|
912
|
-
|
913
|
-
|
914
|
-
|
1072
|
+
#<RedAmber::DataFrame : 87 x 12 Vectors, 0x0000000000005a50>
|
1073
|
+
unnamed1 name height mass hair_color skin_color eye_color ... species
|
1074
|
+
<int64> <string> <int64> <double> <string> <string> <string> ... <string>
|
1075
|
+
1 1 Luke Skywalker 172 77.0 blond fair blue ... Human
|
1076
|
+
2 2 C-3PO 167 75.0 NA gold yellow ... Droid
|
1077
|
+
3 3 R2-D2 96 32.0 NA white, blue red ... Droid
|
1078
|
+
4 4 Darth Vader 202 136.0 none white yellow ... Human
|
1079
|
+
5 5 Leia Organa 150 49.0 brown light brown ... Human
|
1080
|
+
: : : : : : : : ... :
|
1081
|
+
85 85 BB8 (nil) (nil) none none black ... Droid
|
1082
|
+
86 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA
|
1083
|
+
87 87 Padmé Amidala 165 45.0 brown light brown ... Human
|
915
1084
|
|
916
1085
|
starwars.tdr(12)
|
917
1086
|
|
@@ -919,7 +1088,7 @@ penguins.to_rover
|
|
919
1088
|
RedAmber::DataFrame : 87 x 12 Vectors
|
920
1089
|
Vectors : 4 numeric, 8 strings
|
921
1090
|
# key type level data_preview
|
922
|
-
1 :
|
1091
|
+
1 :unnamed1 int64 87 [1, 2, 3, 4, 5, ... ]
|
923
1092
|
2 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
|
924
1093
|
3 :height int64 46 [172, 167, 96, 202, 150, ... ], 6 nils
|
925
1094
|
4 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
|
@@ -933,82 +1102,176 @@ penguins.to_rover
|
|
933
1102
|
12 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ]
|
934
1103
|
```
|
935
1104
|
|
936
|
-
We can
|
1105
|
+
We can group by `:species` and calculate the count.
|
1106
|
+
|
1107
|
+
```ruby
|
1108
|
+
starwars.group(:species).count(:species)
|
1109
|
+
|
1110
|
+
# =>
|
1111
|
+
#<RedAmber::DataFrame : 38 x 2 Vectors, 0x000000000001d6f0>
|
1112
|
+
species count
|
1113
|
+
<string> <int64>
|
1114
|
+
1 Human 35
|
1115
|
+
2 Droid 6
|
1116
|
+
3 Wookiee 2
|
1117
|
+
4 Rodian 1
|
1118
|
+
5 Hutt 1
|
1119
|
+
: : :
|
1120
|
+
36 Kaleesh 1
|
1121
|
+
37 Pau'an 1
|
1122
|
+
38 Kel Dor 1
|
1123
|
+
```
|
1124
|
+
|
1125
|
+
We can also calculate the mean of `:mass` and `:height` together.
|
937
1126
|
|
938
1127
|
```ruby
|
939
|
-
grouped = starwars.group(:species)
|
940
|
-
grouped
|
1128
|
+
grouped = starwars.group(:species) { [count(:species), mean(:height, :mass)] }
|
941
1129
|
|
942
1130
|
# =>
|
943
|
-
#<RedAmber::DataFrame : 38 x
|
944
|
-
mean(
|
945
|
-
|
946
|
-
1 82.8
|
947
|
-
2 69.8
|
948
|
-
3 124.0
|
949
|
-
4 74.0
|
950
|
-
5 1358.0
|
951
|
-
:
|
952
|
-
36 159.0
|
953
|
-
37 80.0
|
954
|
-
38 80.0
|
1131
|
+
#<RedAmber::DataFrame : 38 x 4 Vectors, 0x00000000000407cc>
|
1132
|
+
specie s count mean(height) mean(mass)
|
1133
|
+
<strin g> <int64> <double> <double>
|
1134
|
+
1 Human 35 176.6 82.8
|
1135
|
+
2 Droid 6 131.2 69.8
|
1136
|
+
3 Wookie e 2 231.0 124.0
|
1137
|
+
4 Rodian 1 173.0 74.0
|
1138
|
+
5 Hutt 1 175.0 1358.0
|
1139
|
+
: : : : :
|
1140
|
+
36 Kalees h 1 216.0 159.0
|
1141
|
+
37 Pau'an 1 206.0 80.0
|
1142
|
+
38 Kel Dor 1 188.0 80.0
|
955
1143
|
```
|
956
1144
|
|
957
1145
|
Select rows for count > 1.
|
958
1146
|
|
959
1147
|
```ruby
|
960
|
-
|
961
|
-
grouped = grouped.slice(count > 1)
|
1148
|
+
grouped.slice(grouped[:count] > 1)
|
962
1149
|
|
963
1150
|
# =>
|
964
|
-
#<RedAmber::DataFrame : 9 x
|
965
|
-
mean(
|
966
|
-
|
967
|
-
1 82.8
|
968
|
-
2 69.8
|
969
|
-
3 124.0
|
970
|
-
4 74.0
|
971
|
-
5 48.0
|
972
|
-
:
|
973
|
-
7 55.0
|
974
|
-
8
|
975
|
-
9
|
1151
|
+
#<RedAmber::DataFrame : 9 x 4 Vectors, 0x000000000004c270>
|
1152
|
+
species count mean(height) mean(mass)
|
1153
|
+
<string> <int64> <double> <double>
|
1154
|
+
1 Human 35 176.6 82.8
|
1155
|
+
2 Droid 6 131.2 69.8
|
1156
|
+
3 Wookiee 2 231.0 124.0
|
1157
|
+
4 Gungan 3 208.7 74.0
|
1158
|
+
5 NA 4 181.3 48.0
|
1159
|
+
: : : : :
|
1160
|
+
7 Twi'lek 2 179.0 55.0
|
1161
|
+
8 Mirialan 2 168.0 53.1
|
1162
|
+
9 Kaminoan 2 221.0 88.0
|
976
1163
|
```
|
977
1164
|
|
978
|
-
|
1165
|
+
## Reshape
|
1166
|
+
|
1167
|
+
### `transpose`
|
1168
|
+
|
1169
|
+
Creates transposed DataFrame for the wide (messy) dataframe.
|
979
1170
|
|
980
1171
|
```ruby
|
981
|
-
|
1172
|
+
import_cars = RedAmber::DataFrame.load('test/entity/import_cars.tsv')
|
1173
|
+
|
1174
|
+
# =>
|
1175
|
+
#<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000d520>
|
1176
|
+
Year Audi BMW BMW_MINI Mercedes-Benz VW
|
1177
|
+
<int64> <int64> <int64> <int64> <int64> <int64>
|
1178
|
+
1 2017 28336 52527 25427 68221 49040
|
1179
|
+
2 2018 26473 50982 25984 67554 51961
|
1180
|
+
3 2019 24222 46814 23813 66553 46794
|
1181
|
+
4 2020 22304 35712 20196 57041 36576
|
1182
|
+
5 2021 22535 35905 18211 51722 35215
|
1183
|
+
import_cars.transpose(:Manufacturer)
|
1184
|
+
|
1185
|
+
# =>
|
1186
|
+
#<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000ef74>
|
1187
|
+
Manufacturer 2017 2018 2019 2020 2021
|
1188
|
+
<dictionary> <uint32> <uint32> <uint32> <uint16> <uint16>
|
1189
|
+
1 Audi 28336 26473 24222 22304 22535
|
1190
|
+
2 BMW 52527 50982 46814 35712 35905
|
1191
|
+
3 BMW_MINI 25427 25984 23813 20196 18211
|
1192
|
+
4 Mercedes-Benz 68221 67554 66553 57041 51722
|
1193
|
+
5 VW 49040 51961 46794 36576 35215
|
1194
|
+
```
|
982
1195
|
|
1196
|
+
The leftmost column is created by original keys. Key name of the column is
|
1197
|
+
named by parameter `:name`. If `:name` is not specified, `:N` is used for the key.
|
1198
|
+
|
1199
|
+
### `to_long(*keep_keys)`
|
1200
|
+
|
1201
|
+
Creates a 'long' (tidy) DataFrame from a 'wide' DataFrame.
|
1202
|
+
|
1203
|
+
- Parameter `keep_keys` specifies the key names to keep.
|
1204
|
+
|
1205
|
+
```ruby
|
1206
|
+
import_cars.to_long(:Year)
|
1207
|
+
|
983
1208
|
# =>
|
984
|
-
#<RedAmber::DataFrame :
|
985
|
-
|
986
|
-
|
987
|
-
|
988
|
-
|
989
|
-
|
990
|
-
|
991
|
-
|
992
|
-
|
993
|
-
|
994
|
-
|
995
|
-
|
1209
|
+
#<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000012750>
|
1210
|
+
Year N V
|
1211
|
+
<uint16> <dictionary> <uint32>
|
1212
|
+
1 2017 Audi 28336
|
1213
|
+
2 2017 BMW 52527
|
1214
|
+
3 2017 BMW_MINI 25427
|
1215
|
+
4 2017 Mercedes-Benz 68221
|
1216
|
+
5 2017 VW 49040
|
1217
|
+
: : : :
|
1218
|
+
23 2021 BMW_MINI 18211
|
1219
|
+
24 2021 Mercedes-Benz 51722
|
1220
|
+
25 2021 VW 35215
|
996
1221
|
```
|
997
1222
|
|
998
|
-
|
1223
|
+
- Option `:name` is the key of the column which came **from key names**.
|
1224
|
+
- Option `:value` is the key of the column which came **from values**.
|
999
1225
|
|
1000
|
-
|
1226
|
+
```ruby
|
1227
|
+
import_cars.to_long(:Year, name: :Manufacturer, value: :Num_of_imported)
|
1001
1228
|
|
1002
|
-
|
1229
|
+
# =>
|
1230
|
+
#<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000017700>
|
1231
|
+
Year Manufacturer Num_of_imported
|
1232
|
+
<uint16> <dictionary> <uint32>
|
1233
|
+
1 2017 Audi 28336
|
1234
|
+
2 2017 BMW 52527
|
1235
|
+
3 2017 BMW_MINI 25427
|
1236
|
+
4 2017 Mercedes-Benz 68221
|
1237
|
+
5 2017 VW 49040
|
1238
|
+
: : : :
|
1239
|
+
23 2021 BMW_MINI 18211
|
1240
|
+
24 2021 Mercedes-Benz 51722
|
1241
|
+
25 2021 VW 35215
|
1242
|
+
```
|
1003
1243
|
|
1004
|
-
|
1244
|
+
### `to_wide`
|
1005
1245
|
|
1006
|
-
|
1246
|
+
Creates a 'wide' (messy) DataFrame from a 'long' DataFrame.
|
1007
1247
|
|
1008
|
-
|
1248
|
+
- Option `:name` is the key of the column which will be expanded **to key names**.
|
1249
|
+
- Option `:value` is the key of the column which will be expanded **to values**.
|
1009
1250
|
|
1010
|
-
|
1251
|
+
```ruby
|
1252
|
+
import_cars.to_long(:Year).to_wide
|
1253
|
+
# import_cars.to_long(:Year).to_wide(name: :N, value: :V)
|
1254
|
+
# is also OK
|
1011
1255
|
|
1012
|
-
|
1256
|
+
# =>
|
1257
|
+
#<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000f0f0>
|
1258
|
+
Year Audi BMW BMW_MINI Mercedes-Benz VW
|
1259
|
+
<uint16> <uint16> <uint16> <uint16> <uint32> <uint16>
|
1260
|
+
1 2017 28336 52527 25427 68221 49040
|
1261
|
+
2 2018 26473 50982 25984 67554 51961
|
1262
|
+
3 2019 24222 46814 23813 66553 46794
|
1263
|
+
4 2020 22304 35712 20196 57041 36576
|
1264
|
+
5 2021 22535 35905 18211 51722 35215
|
1265
|
+
|
1266
|
+
# == import_cars
|
1267
|
+
```
|
1268
|
+
|
1269
|
+
## Combine
|
1270
|
+
|
1271
|
+
- [ ] Combining dataframes
|
1013
1272
|
|
1014
|
-
- [ ]
|
1273
|
+
- [ ] Join
|
1274
|
+
|
1275
|
+
## Encoding
|
1276
|
+
|
1277
|
+
- [ ] One-hot encoding
|