red_amber 0.1.7 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +12 -2
- data/.rubocop_todo.yml +2 -15
- data/.yardopts +1 -0
- data/CHANGELOG.md +164 -2
- data/Gemfile +2 -1
- data/README.md +246 -17
- data/doc/DataFrame.md +392 -129
- data/doc/Vector.md +37 -19
- data/doc/examples_of_red_amber.ipynb +8979 -0
- data/lib/red_amber/data_frame.rb +138 -24
- data/lib/red_amber/data_frame_displayable.rb +35 -18
- data/lib/red_amber/data_frame_reshaping.rb +85 -0
- data/lib/red_amber/data_frame_selectable.rb +53 -9
- data/lib/red_amber/data_frame_variable_operation.rb +130 -50
- data/lib/red_amber/group.rb +29 -27
- data/lib/red_amber/vector.rb +1 -1
- data/lib/red_amber/vector_functions.rb +65 -23
- data/lib/red_amber/vector_selectable.rb +12 -9
- data/lib/red_amber/vector_updatable.rb +22 -1
- data/lib/red_amber/version.rb +1 -1
- data/lib/red_amber.rb +1 -1
- data/red_amber.gemspec +1 -1
- metadata +7 -5
- data/doc/47_examples_of_red_amber.ipynb +0 -4872
data/doc/DataFrame.md
CHANGED
@@ -155,7 +155,25 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
155
155
|
|
156
156
|
### `indices`, `indexes`
|
157
157
|
|
158
|
-
- Returns
|
158
|
+
- Returns indexes in an Array.
|
159
|
+
Accepts an option `start` as the first of indexes.
|
160
|
+
|
161
|
+
```ruby
|
162
|
+
df = RedAmber::DataFrame.new(x: [1, 2, 3, 4, 5])
|
163
|
+
df.indices
|
164
|
+
|
165
|
+
# =>
|
166
|
+
[0, 1, 2, 3, 4]
|
167
|
+
|
168
|
+
df.indices(1)
|
169
|
+
|
170
|
+
# =>
|
171
|
+
[1, 2, 3, 4, 5]
|
172
|
+
|
173
|
+
df.indices(:a)
|
174
|
+
# =>
|
175
|
+
[:a, :b, :c, :d, :e]
|
176
|
+
```
|
159
177
|
|
160
178
|
### `to_h`
|
161
179
|
|
@@ -167,6 +185,11 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
167
185
|
|
168
186
|
If you need a column-oriented full array, use `.to_h.to_a`
|
169
187
|
|
188
|
+
### `each_row`
|
189
|
+
|
190
|
+
Yield each row in a `{ key => row}` Hash.
|
191
|
+
Returns Enumerator if block is not given.
|
192
|
+
|
170
193
|
### `schema`
|
171
194
|
|
172
195
|
- Returns column name and data type in a Hash.
|
@@ -202,7 +225,22 @@ puts penguins.to_s
|
|
202
225
|
`inspect` uses `to_s` output and also shows shape and object_id.
|
203
226
|
|
204
227
|
|
205
|
-
### `summary`, `describe`
|
228
|
+
### `summary`, `describe`
|
229
|
+
|
230
|
+
`DataFrame#summary` or `DataFrame#describe` shows summary statistics in a DataFrame.
|
231
|
+
|
232
|
+
```ruby
|
233
|
+
puts penguins.summary.to_s(width: 82) # needs more width to show all stats in this example
|
234
|
+
|
235
|
+
# =>
|
236
|
+
variables count mean std min 25% median 75% max
|
237
|
+
<dictionary> <uint16> <double> <double> <double> <double> <double> <double> <double>
|
238
|
+
1 bill_length_mm 342 43.92 5.46 32.1 39.23 44.38 48.5 59.6
|
239
|
+
2 bill_depth_mm 342 17.15 1.97 13.1 15.6 17.32 18.7 21.5
|
240
|
+
3 flipper_length_mm 342 200.92 14.06 172.0 190.0 197.0 213.0 231.0
|
241
|
+
4 body_mass_g 342 4201.75 801.95 2700.0 3550.0 4031.5 4750.0 6300.0
|
242
|
+
5 year 344 2008.03 0.82 2007.0 2007.0 2008.0 2009.0 2009.0
|
243
|
+
```
|
206
244
|
|
207
245
|
### `to_rover`
|
208
246
|
|
@@ -352,13 +390,13 @@ penguins.to_rover
|
|
352
390
|
|
353
391
|
### `pick ` - pick up variables by key label -
|
354
392
|
|
355
|
-
Pick up some
|
393
|
+
Pick up some columns (variables) to create a sub DataFrame.
|
356
394
|
|
357
395
|

|
358
396
|
|
359
397
|
- Keys as arguments
|
360
398
|
|
361
|
-
`pick(keys)` accepts keys as arguments in an Array.
|
399
|
+
`pick(keys)` accepts keys as arguments in an Array or a Range.
|
362
400
|
|
363
401
|
```ruby
|
364
402
|
penguins.pick(:species, :bill_length_mm)
|
@@ -378,9 +416,31 @@ penguins.to_rover
|
|
378
416
|
344 Gentoo 49.9
|
379
417
|
```
|
380
418
|
|
381
|
-
-
|
419
|
+
- Indices as arguments
|
420
|
+
|
421
|
+
`pick(indices)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers.
|
422
|
+
|
423
|
+
```ruby
|
424
|
+
penguins.pick(0..2, -1)
|
425
|
+
|
426
|
+
# =>
|
427
|
+
#<RedAmber::DataFrame : 344 x 4 Vectors, 0x0000000000055ce4>
|
428
|
+
species island bill_length_mm year
|
429
|
+
<string> <string> <double> <uint16>
|
430
|
+
1 Adelie Torgersen 39.1 2007
|
431
|
+
2 Adelie Torgersen 39.5 2007
|
432
|
+
3 Adelie Torgersen 40.3 2007
|
433
|
+
4 Adelie Torgersen (nil) 2007
|
434
|
+
5 Adelie Torgersen 36.7 2007
|
435
|
+
: : : : :
|
436
|
+
342 Gentoo Biscoe 50.4 2009
|
437
|
+
343 Gentoo Biscoe 45.2 2009
|
438
|
+
344 Gentoo Biscoe 49.9 2009
|
439
|
+
```
|
440
|
+
|
441
|
+
- Booleans as arguments
|
382
442
|
|
383
|
-
`pick(booleans)` accepts booleans as
|
443
|
+
`pick(booleans)` accepts booleans as arguments in an Array. Booleans must be same length as `n_keys`.
|
384
444
|
|
385
445
|
```ruby
|
386
446
|
penguins.pick(penguins.types.map { |type| type == :string })
|
@@ -400,9 +460,9 @@ penguins.to_rover
|
|
400
460
|
344 Gentoo Biscoe male
|
401
461
|
```
|
402
462
|
|
403
|
-
|
463
|
+
- Keys or booleans by a block
|
404
464
|
|
405
|
-
`pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
|
465
|
+
`pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, indices or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
|
406
466
|
|
407
467
|
```ruby
|
408
468
|
penguins.pick { keys.map { |key| key.end_with?('mm') } }
|
@@ -424,21 +484,25 @@ penguins.to_rover
|
|
424
484
|
|
425
485
|
### `drop ` - pick and drop -
|
426
486
|
|
427
|
-
Drop some
|
487
|
+
Drop some columns (variables) to create a remainer DataFrame.
|
428
488
|
|
429
489
|

|
430
490
|
|
431
491
|
- Keys as arguments
|
432
492
|
|
433
|
-
`drop(keys)` accepts keys as arguments in an Array.
|
493
|
+
`drop(keys)` accepts keys as arguments in an Array or a Range.
|
494
|
+
|
495
|
+
- Indices as arguments
|
496
|
+
|
497
|
+
`drop(indices)` accepts indices as a arguments. Indices should be Integers, Floats or Ranges of Integers.
|
434
498
|
|
435
|
-
- Booleans as
|
499
|
+
- Booleans as arguments
|
436
500
|
|
437
|
-
`drop(booleans)` accepts booleans as
|
501
|
+
`drop(booleans)` accepts booleans as an argument in an Array. Booleans must be same length as `n_keys`.
|
438
502
|
|
439
503
|
- Keys or booleans by a block
|
440
504
|
|
441
|
-
`drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
|
505
|
+
`drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, indices or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
|
442
506
|
|
443
507
|
- Notice for nil
|
444
508
|
|
@@ -473,9 +537,20 @@ penguins.to_rover
|
|
473
537
|
[1, 2, 3]
|
474
538
|
```
|
475
539
|
|
540
|
+
A simple key name is usable as a method of the DataFrame if the key name is acceptable as a method name.
|
541
|
+
It returns a Vector same as `[]`.
|
542
|
+
|
543
|
+
```ruby
|
544
|
+
df.a
|
545
|
+
|
546
|
+
# =>
|
547
|
+
#<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
|
548
|
+
[1, 2, 3]
|
549
|
+
```
|
550
|
+
|
476
551
|
### `slice ` - to cut vertically is slice -
|
477
552
|
|
478
|
-
Slice and select
|
553
|
+
Slice and select rows (observations) to create a sub DataFrame.
|
479
554
|
|
480
555
|

|
481
556
|
|
@@ -506,7 +581,7 @@ penguins.to_rover
|
|
506
581
|
|
507
582
|
- Booleans as an argument
|
508
583
|
|
509
|
-
`slice(booleans)` accepts booleans as
|
584
|
+
`slice(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
|
510
585
|
|
511
586
|
```ruby
|
512
587
|
vector = penguins[:bill_length_mm]
|
@@ -583,7 +658,7 @@ penguins.to_rover
|
|
583
658
|
|
584
659
|
### `remove`
|
585
660
|
|
586
|
-
Slice and reject
|
661
|
+
Slice and reject rows (observations) to create a remainer DataFrame.
|
587
662
|
|
588
663
|

|
589
664
|
|
@@ -612,7 +687,7 @@ penguins.to_rover
|
|
612
687
|
|
613
688
|
- Booleans as an argument
|
614
689
|
|
615
|
-
`remove(booleans)` accepts booleans as
|
690
|
+
`remove(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
|
616
691
|
|
617
692
|
```ruby
|
618
693
|
# remove all observation contains nil
|
@@ -640,10 +715,12 @@ penguins.to_rover
|
|
640
715
|
|
641
716
|
```ruby
|
642
717
|
penguins.remove do
|
643
|
-
|
644
|
-
|
645
|
-
|
646
|
-
|
718
|
+
# We will use another style shown in slice
|
719
|
+
# self.bill_length_mm returns Vector
|
720
|
+
mean = bill_length_mm.mean
|
721
|
+
min = mean - bill_length_mm.std
|
722
|
+
max = mean + bill_length_mm.std
|
723
|
+
bill_length_mm.to_a.map { |e| (min..max).include? e }
|
647
724
|
end
|
648
725
|
|
649
726
|
# =>
|
@@ -660,6 +737,7 @@ penguins.to_rover
|
|
660
737
|
139 Gentoo Biscoe 50.4 15.7 222 ... 2009
|
661
738
|
140 Gentoo Biscoe 49.9 16.1 213 ... 2009
|
662
739
|
```
|
740
|
+
|
663
741
|
- Notice for nil
|
664
742
|
- When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
|
665
743
|
|
@@ -704,7 +782,7 @@ penguins.to_rover
|
|
704
782
|
|
705
783
|
- Key pairs as arguments
|
706
784
|
|
707
|
-
`rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`.
|
785
|
+
`rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}` or an Array of Arrays like `[[existing_key, new_key], ... ]`.
|
708
786
|
|
709
787
|
```ruby
|
710
788
|
df = RedAmber::DataFrame.new( 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] )
|
@@ -721,7 +799,11 @@ penguins.to_rover
|
|
721
799
|
|
722
800
|
- Key pairs by a block
|
723
801
|
|
724
|
-
`rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}`. Block is called in the context of self.
|
802
|
+
`rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}` or an Array of Arrays like `[[existing_key, new_key], ... ]`. Block is called in the context of self.
|
803
|
+
|
804
|
+
- Not existing keys
|
805
|
+
|
806
|
+
If specified `existing_key` is not exist, raise a `DataFrameArgumentError`.
|
725
807
|
|
726
808
|
- Key type
|
727
809
|
|
@@ -729,16 +811,16 @@ penguins.to_rover
|
|
729
811
|
|
730
812
|
### `assign`
|
731
813
|
|
732
|
-
Assign new or updated
|
814
|
+
Assign new or updated columns (variables) and create a updated DataFrame.
|
733
815
|
|
734
|
-
- Variables with new keys will append new
|
816
|
+
- Variables with new keys will append new columns from the right.
|
735
817
|
- Variables with exisiting keys will update corresponding vectors.
|
736
818
|
|
737
819
|

|
738
820
|
|
739
821
|
- Variables as arguments
|
740
822
|
|
741
|
-
`assign(key_pairs)` accepts pairs of key and values as
|
823
|
+
`assign(key_pairs)` accepts pairs of key and values as parameters. `key_pairs` should be a Hash of `{key => array_like}` or an Array of Arrays like `[[key, array_like], ... ]`. `array_like` is ether `Vector`, `Array` or `Arrow::Array`.
|
742
824
|
|
743
825
|
```ruby
|
744
826
|
df = RedAmber::DataFrame.new(
|
@@ -748,15 +830,19 @@ penguins.to_rover
|
|
748
830
|
|
749
831
|
# =>
|
750
832
|
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000062804>
|
751
|
-
name age
|
752
|
-
<string> <uint8>
|
753
|
-
1 Yasuko 68
|
754
|
-
2 Rui 49
|
833
|
+
name age
|
834
|
+
<string> <uint8>
|
835
|
+
1 Yasuko 68
|
836
|
+
2 Rui 49
|
755
837
|
3 Hinata 28
|
756
838
|
|
757
839
|
# update :age and add :brother
|
758
|
-
|
759
|
-
|
840
|
+
df.assign do
|
841
|
+
{
|
842
|
+
age: age + 29,
|
843
|
+
brother: ['Santa', nil, 'Momotaro']
|
844
|
+
}
|
845
|
+
end
|
760
846
|
|
761
847
|
# =>
|
762
848
|
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000658b0>
|
@@ -769,13 +855,14 @@ penguins.to_rover
|
|
769
855
|
|
770
856
|
- Key pairs by a block
|
771
857
|
|
772
|
-
`assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key =>
|
858
|
+
`assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array_like}` or an Array of Arrays like `[[key, array_like], ... ]`. `array_like` is ether `Vector`, `Array` or `Arrow::Array`. The block is called in the context of self.
|
773
859
|
|
774
860
|
```ruby
|
775
861
|
df = RedAmber::DataFrame.new(
|
776
862
|
index: [0, 1, 2, 3, nil],
|
777
863
|
float: [0.0, 1.1, 2.2, Float::NAN, nil],
|
778
|
-
string: ['A', 'B', 'C', 'D', nil]
|
864
|
+
string: ['A', 'B', 'C', 'D', nil]
|
865
|
+
)
|
779
866
|
df
|
780
867
|
|
781
868
|
# =>
|
@@ -788,29 +875,27 @@ penguins.to_rover
|
|
788
875
|
4 3 NaN D
|
789
876
|
5 (nil) (nil) (nil)
|
790
877
|
|
791
|
-
# update
|
878
|
+
# update :float
|
879
|
+
# assigner by an Array
|
792
880
|
df.assign do
|
793
|
-
|
794
|
-
|
795
|
-
assigner[keys[i]] = v * -1 if v.numeric?
|
796
|
-
end
|
797
|
-
assigner
|
881
|
+
vectors.select(&:float?)
|
882
|
+
.map { |v| [v.key, -v] }
|
798
883
|
end
|
799
884
|
|
800
885
|
# =>
|
801
|
-
#<RedAmber::DataFrame : 5 x 3 Vectors,
|
802
|
-
|
803
|
-
<
|
804
|
-
1
|
805
|
-
2
|
806
|
-
3
|
807
|
-
4
|
808
|
-
5
|
809
|
-
|
810
|
-
# Or
|
886
|
+
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x00000000000dfffc>
|
887
|
+
index float string
|
888
|
+
<uint8> <double> <string>
|
889
|
+
1 0 -0.0 A
|
890
|
+
2 1 -1.1 B
|
891
|
+
3 2 -2.2 C
|
892
|
+
4 3 NaN D
|
893
|
+
5 (nil) (nil) (nil)
|
894
|
+
|
895
|
+
# Or we can use assigner by a Hash
|
811
896
|
df.assign do
|
812
|
-
|
813
|
-
assigner[key] =
|
897
|
+
vectors.select.with_object({}) do |v, assigner|
|
898
|
+
assigner[v.key] = -v if v.float?
|
814
899
|
end
|
815
900
|
end
|
816
901
|
|
@@ -821,6 +906,96 @@ penguins.to_rover
|
|
821
906
|
|
822
907
|
Symbol key and String key are considered as the same key.
|
823
908
|
|
909
|
+
- Empty assignment
|
910
|
+
|
911
|
+
If assigner is empty or nil, returns self.
|
912
|
+
|
913
|
+
- Append from left
|
914
|
+
|
915
|
+
`assign_left` method accepts the same parameters and block as `assign`, but append new columns from leftside.
|
916
|
+
|
917
|
+
```ruby
|
918
|
+
df.assign_left(new_index: df.indices(1))
|
919
|
+
|
920
|
+
# =>
|
921
|
+
#<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000001787c>
|
922
|
+
new_index index float string
|
923
|
+
<uint8> <uint8> <double> <string>
|
924
|
+
1 1 0 0.0 A
|
925
|
+
2 2 1 1.1 B
|
926
|
+
3 3 2 2.2 C
|
927
|
+
4 4 3 NaN D
|
928
|
+
5 5 (nil) (nil) (nil)
|
929
|
+
```
|
930
|
+
|
931
|
+
### `slice_by(key, keep_key: false) { block }`
|
932
|
+
|
933
|
+
`slice_by` accepts a key and a block to select rows.
|
934
|
+
|
935
|
+
(Since 0.2.1)
|
936
|
+
|
937
|
+
```ruby
|
938
|
+
df = RedAmber::DataFrame.new(
|
939
|
+
index: [0, 1, 2, 3, nil],
|
940
|
+
float: [0.0, 1.1, 2.2, Float::NAN, nil],
|
941
|
+
string: ['A', 'B', 'C', 'D', nil]
|
942
|
+
)
|
943
|
+
df
|
944
|
+
|
945
|
+
# =>
|
946
|
+
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x0000000000069e60>
|
947
|
+
index float string
|
948
|
+
<uint8> <double> <string>
|
949
|
+
1 0 0.0 A
|
950
|
+
2 1 1.1 B
|
951
|
+
3 2 2.2 C
|
952
|
+
4 3 NaN D
|
953
|
+
5 (nil) (nil) (nil)
|
954
|
+
|
955
|
+
df.slice_by(:string) { ["A", "C"] }
|
956
|
+
|
957
|
+
# =>
|
958
|
+
#<RedAmber::DataFrame : 2 x 2 Vectors, 0x000000000001b1ac>
|
959
|
+
index float
|
960
|
+
<uint8> <double>
|
961
|
+
1 0 0.0
|
962
|
+
2 2 2.2
|
963
|
+
```
|
964
|
+
|
965
|
+
It is the same behavior as;
|
966
|
+
|
967
|
+
```ruby
|
968
|
+
df.slice { [string.index("A"), string.index("C")] }.drop(:string)
|
969
|
+
```
|
970
|
+
|
971
|
+
`slice_by` also accepts a Range.
|
972
|
+
|
973
|
+
```ruby
|
974
|
+
df.slice_by(:string) { "A".."C" }
|
975
|
+
|
976
|
+
# =>
|
977
|
+
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000069668>
|
978
|
+
index float
|
979
|
+
<uint8> <double>
|
980
|
+
1 0 0.0
|
981
|
+
2 1 1.1
|
982
|
+
3 2 2.2
|
983
|
+
```
|
984
|
+
|
985
|
+
When the option `keep_key: true` used, the column `key` will be preserved.
|
986
|
+
|
987
|
+
```ruby
|
988
|
+
df.slice_by(:string, keep_key: true) { "A".."C" }
|
989
|
+
|
990
|
+
# =>
|
991
|
+
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x0000000000073c44>
|
992
|
+
index float string
|
993
|
+
<uint8> <double> <string>
|
994
|
+
1 0 0.0 A
|
995
|
+
2 1 1.1 B
|
996
|
+
3 2 2.2 C
|
997
|
+
```
|
998
|
+
|
824
999
|
## Updating
|
825
1000
|
|
826
1001
|
### `sort`
|
@@ -830,11 +1005,11 @@ penguins.to_rover
|
|
830
1005
|
- "-key" denotes descending order
|
831
1006
|
|
832
1007
|
```ruby
|
833
|
-
df = RedAmber::DataFrame.new(
|
1008
|
+
df = RedAmber::DataFrame.new(
|
834
1009
|
index: [1, 1, 0, nil, 0],
|
835
1010
|
string: ['C', 'B', nil, 'A', 'B'],
|
836
1011
|
bool: [nil, true, false, true, false],
|
837
|
-
|
1012
|
+
)
|
838
1013
|
df.sort(:index, '-bool')
|
839
1014
|
|
840
1015
|
# =>
|
@@ -860,16 +1035,10 @@ penguins.to_rover
|
|
860
1035
|
|
861
1036
|
## Grouping
|
862
1037
|
|
863
|
-
### `group(
|
864
|
-
|
865
|
-
(
|
866
|
-
This API will change in the future version. Especcially I want to change:
|
867
|
-
- Order of the column of the result (aggregation_keys should be the first)
|
868
|
-
- DataFrame#group will accept a block (heronshoes/red_amber #28)
|
869
|
-
)
|
1038
|
+
### `group(group_keys)`
|
870
1039
|
|
871
1040
|
`group` creates a class `Group` object. `Group` accepts functions below as a method.
|
872
|
-
Method accepts options as `
|
1041
|
+
Method accepts options as `group_keys`.
|
873
1042
|
|
874
1043
|
Available functions are:
|
875
1044
|
|
@@ -889,8 +1058,8 @@ penguins.to_rover
|
|
889
1058
|
- [ ] tdigest
|
890
1059
|
- ✓ variance
|
891
1060
|
|
892
|
-
For the each group of `
|
893
|
-
|
1061
|
+
For the each group of `group_keys`, the aggregation `function` is applied and returns a new dataframe with aggregated keys according to `summary_keys`.
|
1062
|
+
Summary key names are provided by `function(summary_keys)` style.
|
894
1063
|
|
895
1064
|
This is an example of grouping of famous STARWARS dataset.
|
896
1065
|
|
@@ -900,18 +1069,18 @@ penguins.to_rover
|
|
900
1069
|
starwars
|
901
1070
|
|
902
1071
|
# =>
|
903
|
-
#<RedAmber::DataFrame : 87 x 12 Vectors,
|
904
|
-
|
905
|
-
|
906
|
-
|
907
|
-
|
908
|
-
|
909
|
-
|
910
|
-
|
911
|
-
|
912
|
-
|
913
|
-
|
914
|
-
|
1072
|
+
#<RedAmber::DataFrame : 87 x 12 Vectors, 0x0000000000005a50>
|
1073
|
+
unnamed1 name height mass hair_color skin_color eye_color ... species
|
1074
|
+
<int64> <string> <int64> <double> <string> <string> <string> ... <string>
|
1075
|
+
1 1 Luke Skywalker 172 77.0 blond fair blue ... Human
|
1076
|
+
2 2 C-3PO 167 75.0 NA gold yellow ... Droid
|
1077
|
+
3 3 R2-D2 96 32.0 NA white, blue red ... Droid
|
1078
|
+
4 4 Darth Vader 202 136.0 none white yellow ... Human
|
1079
|
+
5 5 Leia Organa 150 49.0 brown light brown ... Human
|
1080
|
+
: : : : : : : : ... :
|
1081
|
+
85 85 BB8 (nil) (nil) none none black ... Droid
|
1082
|
+
86 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA
|
1083
|
+
87 87 Padmé Amidala 165 45.0 brown light brown ... Human
|
915
1084
|
|
916
1085
|
starwars.tdr(12)
|
917
1086
|
|
@@ -919,7 +1088,7 @@ penguins.to_rover
|
|
919
1088
|
RedAmber::DataFrame : 87 x 12 Vectors
|
920
1089
|
Vectors : 4 numeric, 8 strings
|
921
1090
|
# key type level data_preview
|
922
|
-
1 :
|
1091
|
+
1 :unnamed1 int64 87 [1, 2, 3, 4, 5, ... ]
|
923
1092
|
2 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
|
924
1093
|
3 :height int64 46 [172, 167, 96, 202, 150, ... ], 6 nils
|
925
1094
|
4 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
|
@@ -933,82 +1102,176 @@ penguins.to_rover
|
|
933
1102
|
12 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ]
|
934
1103
|
```
|
935
1104
|
|
936
|
-
We can
|
1105
|
+
We can group by `:species` and calculate the count.
|
1106
|
+
|
1107
|
+
```ruby
|
1108
|
+
starwars.group(:species).count(:species)
|
1109
|
+
|
1110
|
+
# =>
|
1111
|
+
#<RedAmber::DataFrame : 38 x 2 Vectors, 0x000000000001d6f0>
|
1112
|
+
species count
|
1113
|
+
<string> <int64>
|
1114
|
+
1 Human 35
|
1115
|
+
2 Droid 6
|
1116
|
+
3 Wookiee 2
|
1117
|
+
4 Rodian 1
|
1118
|
+
5 Hutt 1
|
1119
|
+
: : :
|
1120
|
+
36 Kaleesh 1
|
1121
|
+
37 Pau'an 1
|
1122
|
+
38 Kel Dor 1
|
1123
|
+
```
|
1124
|
+
|
1125
|
+
We can also calculate the mean of `:mass` and `:height` together.
|
937
1126
|
|
938
1127
|
```ruby
|
939
|
-
grouped = starwars.group(:species)
|
940
|
-
grouped
|
1128
|
+
grouped = starwars.group(:species) { [count(:species), mean(:height, :mass)] }
|
941
1129
|
|
942
1130
|
# =>
|
943
|
-
#<RedAmber::DataFrame : 38 x
|
944
|
-
mean(
|
945
|
-
|
946
|
-
1 82.8
|
947
|
-
2 69.8
|
948
|
-
3 124.0
|
949
|
-
4 74.0
|
950
|
-
5 1358.0
|
951
|
-
:
|
952
|
-
36 159.0
|
953
|
-
37 80.0
|
954
|
-
38 80.0
|
1131
|
+
#<RedAmber::DataFrame : 38 x 4 Vectors, 0x00000000000407cc>
|
1132
|
+
specie s count mean(height) mean(mass)
|
1133
|
+
<strin g> <int64> <double> <double>
|
1134
|
+
1 Human 35 176.6 82.8
|
1135
|
+
2 Droid 6 131.2 69.8
|
1136
|
+
3 Wookie e 2 231.0 124.0
|
1137
|
+
4 Rodian 1 173.0 74.0
|
1138
|
+
5 Hutt 1 175.0 1358.0
|
1139
|
+
: : : : :
|
1140
|
+
36 Kalees h 1 216.0 159.0
|
1141
|
+
37 Pau'an 1 206.0 80.0
|
1142
|
+
38 Kel Dor 1 188.0 80.0
|
955
1143
|
```
|
956
1144
|
|
957
1145
|
Select rows for count > 1.
|
958
1146
|
|
959
1147
|
```ruby
|
960
|
-
|
961
|
-
grouped = grouped.slice(count > 1)
|
1148
|
+
grouped.slice(grouped[:count] > 1)
|
962
1149
|
|
963
1150
|
# =>
|
964
|
-
#<RedAmber::DataFrame : 9 x
|
965
|
-
mean(
|
966
|
-
|
967
|
-
1 82.8
|
968
|
-
2 69.8
|
969
|
-
3 124.0
|
970
|
-
4 74.0
|
971
|
-
5 48.0
|
972
|
-
:
|
973
|
-
7 55.0
|
974
|
-
8
|
975
|
-
9
|
1151
|
+
#<RedAmber::DataFrame : 9 x 4 Vectors, 0x000000000004c270>
|
1152
|
+
species count mean(height) mean(mass)
|
1153
|
+
<string> <int64> <double> <double>
|
1154
|
+
1 Human 35 176.6 82.8
|
1155
|
+
2 Droid 6 131.2 69.8
|
1156
|
+
3 Wookiee 2 231.0 124.0
|
1157
|
+
4 Gungan 3 208.7 74.0
|
1158
|
+
5 NA 4 181.3 48.0
|
1159
|
+
: : : : :
|
1160
|
+
7 Twi'lek 2 179.0 55.0
|
1161
|
+
8 Mirialan 2 168.0 53.1
|
1162
|
+
9 Kaminoan 2 221.0 88.0
|
976
1163
|
```
|
977
1164
|
|
978
|
-
|
1165
|
+
## Reshape
|
1166
|
+
|
1167
|
+
### `transpose`
|
1168
|
+
|
1169
|
+
Creates transposed DataFrame for the wide (messy) dataframe.
|
979
1170
|
|
980
1171
|
```ruby
|
981
|
-
|
1172
|
+
import_cars = RedAmber::DataFrame.load('test/entity/import_cars.tsv')
|
1173
|
+
|
1174
|
+
# =>
|
1175
|
+
#<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000d520>
|
1176
|
+
Year Audi BMW BMW_MINI Mercedes-Benz VW
|
1177
|
+
<int64> <int64> <int64> <int64> <int64> <int64>
|
1178
|
+
1 2017 28336 52527 25427 68221 49040
|
1179
|
+
2 2018 26473 50982 25984 67554 51961
|
1180
|
+
3 2019 24222 46814 23813 66553 46794
|
1181
|
+
4 2020 22304 35712 20196 57041 36576
|
1182
|
+
5 2021 22535 35905 18211 51722 35215
|
1183
|
+
import_cars.transpose(:Manufacturer)
|
1184
|
+
|
1185
|
+
# =>
|
1186
|
+
#<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000ef74>
|
1187
|
+
Manufacturer 2017 2018 2019 2020 2021
|
1188
|
+
<dictionary> <uint32> <uint32> <uint32> <uint16> <uint16>
|
1189
|
+
1 Audi 28336 26473 24222 22304 22535
|
1190
|
+
2 BMW 52527 50982 46814 35712 35905
|
1191
|
+
3 BMW_MINI 25427 25984 23813 20196 18211
|
1192
|
+
4 Mercedes-Benz 68221 67554 66553 57041 51722
|
1193
|
+
5 VW 49040 51961 46794 36576 35215
|
1194
|
+
```
|
982
1195
|
|
1196
|
+
The leftmost column is created by original keys. Key name of the column is
|
1197
|
+
named by parameter `:name`. If `:name` is not specified, `:N` is used for the key.
|
1198
|
+
|
1199
|
+
### `to_long(*keep_keys)`
|
1200
|
+
|
1201
|
+
Creates a 'long' (tidy) DataFrame from a 'wide' DataFrame.
|
1202
|
+
|
1203
|
+
- Parameter `keep_keys` specifies the key names to keep.
|
1204
|
+
|
1205
|
+
```ruby
|
1206
|
+
import_cars.to_long(:Year)
|
1207
|
+
|
983
1208
|
# =>
|
984
|
-
#<RedAmber::DataFrame :
|
985
|
-
|
986
|
-
|
987
|
-
|
988
|
-
|
989
|
-
|
990
|
-
|
991
|
-
|
992
|
-
|
993
|
-
|
994
|
-
|
995
|
-
|
1209
|
+
#<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000012750>
|
1210
|
+
Year N V
|
1211
|
+
<uint16> <dictionary> <uint32>
|
1212
|
+
1 2017 Audi 28336
|
1213
|
+
2 2017 BMW 52527
|
1214
|
+
3 2017 BMW_MINI 25427
|
1215
|
+
4 2017 Mercedes-Benz 68221
|
1216
|
+
5 2017 VW 49040
|
1217
|
+
: : : :
|
1218
|
+
23 2021 BMW_MINI 18211
|
1219
|
+
24 2021 Mercedes-Benz 51722
|
1220
|
+
25 2021 VW 35215
|
996
1221
|
```
|
997
1222
|
|
998
|
-
|
1223
|
+
- Option `:name` is the key of the column which came **from key names**.
|
1224
|
+
- Option `:value` is the key of the column which came **from values**.
|
999
1225
|
|
1000
|
-
|
1226
|
+
```ruby
|
1227
|
+
import_cars.to_long(:Year, name: :Manufacturer, value: :Num_of_imported)
|
1001
1228
|
|
1002
|
-
|
1229
|
+
# =>
|
1230
|
+
#<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000017700>
|
1231
|
+
Year Manufacturer Num_of_imported
|
1232
|
+
<uint16> <dictionary> <uint32>
|
1233
|
+
1 2017 Audi 28336
|
1234
|
+
2 2017 BMW 52527
|
1235
|
+
3 2017 BMW_MINI 25427
|
1236
|
+
4 2017 Mercedes-Benz 68221
|
1237
|
+
5 2017 VW 49040
|
1238
|
+
: : : :
|
1239
|
+
23 2021 BMW_MINI 18211
|
1240
|
+
24 2021 Mercedes-Benz 51722
|
1241
|
+
25 2021 VW 35215
|
1242
|
+
```
|
1003
1243
|
|
1004
|
-
|
1244
|
+
### `to_wide`
|
1005
1245
|
|
1006
|
-
|
1246
|
+
Creates a 'wide' (messy) DataFrame from a 'long' DataFrame.
|
1007
1247
|
|
1008
|
-
|
1248
|
+
- Option `:name` is the key of the column which will be expanded **to key names**.
|
1249
|
+
- Option `:value` is the key of the column which will be expanded **to values**.
|
1009
1250
|
|
1010
|
-
|
1251
|
+
```ruby
|
1252
|
+
import_cars.to_long(:Year).to_wide
|
1253
|
+
# import_cars.to_long(:Year).to_wide(name: :N, value: :V)
|
1254
|
+
# is also OK
|
1011
1255
|
|
1012
|
-
|
1256
|
+
# =>
|
1257
|
+
#<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000f0f0>
|
1258
|
+
Year Audi BMW BMW_MINI Mercedes-Benz VW
|
1259
|
+
<uint16> <uint16> <uint16> <uint16> <uint32> <uint16>
|
1260
|
+
1 2017 28336 52527 25427 68221 49040
|
1261
|
+
2 2018 26473 50982 25984 67554 51961
|
1262
|
+
3 2019 24222 46814 23813 66553 46794
|
1263
|
+
4 2020 22304 35712 20196 57041 36576
|
1264
|
+
5 2021 22535 35905 18211 51722 35215
|
1265
|
+
|
1266
|
+
# == import_cars
|
1267
|
+
```
|
1268
|
+
|
1269
|
+
## Combine
|
1270
|
+
|
1271
|
+
- [ ] Combining dataframes
|
1013
1272
|
|
1014
|
-
- [ ]
|
1273
|
+
- [ ] Join
|
1274
|
+
|
1275
|
+
## Encoding
|
1276
|
+
|
1277
|
+
- [ ] One-hot encoding
|