red_amber 0.1.7 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/doc/DataFrame.md CHANGED
@@ -155,7 +155,25 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
155
155
 
156
156
  ### `indices`, `indexes`
157
157
 
158
- - Returns all indexes in an Array.
158
+ - Returns indexes in an Array.
159
+ Accepts an option `start` as the first of indexes.
160
+
161
+ ```ruby
162
+ df = RedAmber::DataFrame.new(x: [1, 2, 3, 4, 5])
163
+ df.indices
164
+
165
+ # =>
166
+ [0, 1, 2, 3, 4]
167
+
168
+ df.indices(1)
169
+
170
+ # =>
171
+ [1, 2, 3, 4, 5]
172
+
173
+ df.indices(:a)
174
+ # =>
175
+ [:a, :b, :c, :d, :e]
176
+ ```
159
177
 
160
178
  ### `to_h`
161
179
 
@@ -167,6 +185,11 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
167
185
 
168
186
  If you need a column-oriented full array, use `.to_h.to_a`
169
187
 
188
+ ### `each_row`
189
+
190
+ Yield each row in a `{ key => row}` Hash.
191
+ Returns Enumerator if block is not given.
192
+
170
193
  ### `schema`
171
194
 
172
195
  - Returns column name and data type in a Hash.
@@ -202,7 +225,22 @@ puts penguins.to_s
202
225
  `inspect` uses `to_s` output and also shows shape and object_id.
203
226
 
204
227
 
205
- ### `summary`, `describe` (not implemented)
228
+ ### `summary`, `describe`
229
+
230
+ `DataFrame#summary` or `DataFrame#describe` shows summary statistics in a DataFrame.
231
+
232
+ ```ruby
233
+ puts penguins.summary.to_s(width: 82) # needs more width to show all stats in this example
234
+
235
+ # =>
236
+ variables count mean std min 25% median 75% max
237
+ <dictionary> <uint16> <double> <double> <double> <double> <double> <double> <double>
238
+ 1 bill_length_mm 342 43.92 5.46 32.1 39.23 44.38 48.5 59.6
239
+ 2 bill_depth_mm 342 17.15 1.97 13.1 15.6 17.32 18.7 21.5
240
+ 3 flipper_length_mm 342 200.92 14.06 172.0 190.0 197.0 213.0 231.0
241
+ 4 body_mass_g 342 4201.75 801.95 2700.0 3550.0 4031.5 4750.0 6300.0
242
+ 5 year 344 2008.03 0.82 2007.0 2007.0 2008.0 2009.0 2009.0
243
+ ```
206
244
 
207
245
  ### `to_rover`
208
246
 
@@ -352,13 +390,13 @@ penguins.to_rover
352
390
 
353
391
  ### `pick ` - pick up variables by key label -
354
392
 
355
- Pick up some variables (columns) to create a sub DataFrame.
393
+ Pick up some columns (variables) to create a sub DataFrame.
356
394
 
357
395
  ![pick method image](doc/../image/dataframe/pick.png)
358
396
 
359
397
  - Keys as arguments
360
398
 
361
- `pick(keys)` accepts keys as arguments in an Array.
399
+ `pick(keys)` accepts keys as arguments in an Array or a Range.
362
400
 
363
401
  ```ruby
364
402
  penguins.pick(:species, :bill_length_mm)
@@ -378,9 +416,31 @@ penguins.to_rover
378
416
  344 Gentoo 49.9
379
417
  ```
380
418
 
381
- - Booleans as a argument
419
+ - Indices as arguments
420
+
421
+ `pick(indices)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers.
422
+
423
+ ```ruby
424
+ penguins.pick(0..2, -1)
425
+
426
+ # =>
427
+ #<RedAmber::DataFrame : 344 x 4 Vectors, 0x0000000000055ce4>
428
+ species island bill_length_mm year
429
+ <string> <string> <double> <uint16>
430
+ 1 Adelie Torgersen 39.1 2007
431
+ 2 Adelie Torgersen 39.5 2007
432
+ 3 Adelie Torgersen 40.3 2007
433
+ 4 Adelie Torgersen (nil) 2007
434
+ 5 Adelie Torgersen 36.7 2007
435
+ : : : : :
436
+ 342 Gentoo Biscoe 50.4 2009
437
+ 343 Gentoo Biscoe 45.2 2009
438
+ 344 Gentoo Biscoe 49.9 2009
439
+ ```
440
+
441
+ - Booleans as arguments
382
442
 
383
- `pick(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
443
+ `pick(booleans)` accepts booleans as arguments in an Array. Booleans must be same length as `n_keys`.
384
444
 
385
445
  ```ruby
386
446
  penguins.pick(penguins.types.map { |type| type == :string })
@@ -400,9 +460,9 @@ penguins.to_rover
400
460
  344 Gentoo Biscoe male
401
461
  ```
402
462
 
403
- - Keys or booleans by a block
463
+ - Keys or booleans by a block
404
464
 
405
- `pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
465
+ `pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, indices or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
406
466
 
407
467
  ```ruby
408
468
  penguins.pick { keys.map { |key| key.end_with?('mm') } }
@@ -424,21 +484,25 @@ penguins.to_rover
424
484
 
425
485
  ### `drop ` - pick and drop -
426
486
 
427
- Drop some variables (columns) to create a remainer DataFrame.
487
+ Drop some columns (variables) to create a remainer DataFrame.
428
488
 
429
489
  ![drop method image](doc/../image/dataframe/drop.png)
430
490
 
431
491
  - Keys as arguments
432
492
 
433
- `drop(keys)` accepts keys as arguments in an Array.
493
+ `drop(keys)` accepts keys as arguments in an Array or a Range.
494
+
495
+ - Indices as arguments
496
+
497
+ `drop(indices)` accepts indices as a arguments. Indices should be Integers, Floats or Ranges of Integers.
434
498
 
435
- - Booleans as a argument
499
+ - Booleans as arguments
436
500
 
437
- `drop(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
501
+ `drop(booleans)` accepts booleans as an argument in an Array. Booleans must be same length as `n_keys`.
438
502
 
439
503
  - Keys or booleans by a block
440
504
 
441
- `drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
505
+ `drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, indices or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
442
506
 
443
507
  - Notice for nil
444
508
 
@@ -473,9 +537,20 @@ penguins.to_rover
473
537
  [1, 2, 3]
474
538
  ```
475
539
 
540
+ A simple key name is usable as a method of the DataFrame if the key name is acceptable as a method name.
541
+ It returns a Vector same as `[]`.
542
+
543
+ ```ruby
544
+ df.a
545
+
546
+ # =>
547
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
548
+ [1, 2, 3]
549
+ ```
550
+
476
551
  ### `slice ` - to cut vertically is slice -
477
552
 
478
- Slice and select observations (rows) to create a sub DataFrame.
553
+ Slice and select rows (observations) to create a sub DataFrame.
479
554
 
480
555
  ![slice method image](doc/../image/dataframe/slice.png)
481
556
 
@@ -506,7 +581,7 @@ penguins.to_rover
506
581
 
507
582
  - Booleans as an argument
508
583
 
509
- `slice(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
584
+ `slice(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
510
585
 
511
586
  ```ruby
512
587
  vector = penguins[:bill_length_mm]
@@ -583,7 +658,7 @@ penguins.to_rover
583
658
 
584
659
  ### `remove`
585
660
 
586
- Slice and reject observations (rows) to create a remainer DataFrame.
661
+ Slice and reject rows (observations) to create a remainer DataFrame.
587
662
 
588
663
  ![remove method image](doc/../image/dataframe/remove.png)
589
664
 
@@ -612,7 +687,7 @@ penguins.to_rover
612
687
 
613
688
  - Booleans as an argument
614
689
 
615
- `remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
690
+ `remove(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
616
691
 
617
692
  ```ruby
618
693
  # remove all observation contains nil
@@ -640,10 +715,12 @@ penguins.to_rover
640
715
 
641
716
  ```ruby
642
717
  penguins.remove do
643
- vector = self[:bill_length_mm]
644
- min = vector.mean - vector.std
645
- max = vector.mean + vector.std
646
- vector.to_a.map { |e| (min..max).include? e }
718
+ # We will use another style shown in slice
719
+ # self.bill_length_mm returns Vector
720
+ mean = bill_length_mm.mean
721
+ min = mean - bill_length_mm.std
722
+ max = mean + bill_length_mm.std
723
+ bill_length_mm.to_a.map { |e| (min..max).include? e }
647
724
  end
648
725
 
649
726
  # =>
@@ -660,6 +737,7 @@ penguins.to_rover
660
737
  139 Gentoo Biscoe 50.4 15.7 222 ... 2009
661
738
  140 Gentoo Biscoe 49.9 16.1 213 ... 2009
662
739
  ```
740
+
663
741
  - Notice for nil
664
742
  - When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
665
743
 
@@ -704,7 +782,7 @@ penguins.to_rover
704
782
 
705
783
  - Key pairs as arguments
706
784
 
707
- `rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`.
785
+ `rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}` or an Array of Arrays like `[[existing_key, new_key], ... ]`.
708
786
 
709
787
  ```ruby
710
788
  df = RedAmber::DataFrame.new( 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] )
@@ -721,7 +799,11 @@ penguins.to_rover
721
799
 
722
800
  - Key pairs by a block
723
801
 
724
- `rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}`. Block is called in the context of self.
802
+ `rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}` or an Array of Arrays like `[[existing_key, new_key], ... ]`. Block is called in the context of self.
803
+
804
+ - Not existing keys
805
+
806
+ If specified `existing_key` is not exist, raise a `DataFrameArgumentError`.
725
807
 
726
808
  - Key type
727
809
 
@@ -729,16 +811,16 @@ penguins.to_rover
729
811
 
730
812
  ### `assign`
731
813
 
732
- Assign new or updated variables (columns) and create a updated DataFrame.
814
+ Assign new or updated columns (variables) and create a updated DataFrame.
733
815
 
734
- - Variables with new keys will append new variables at bottom (right in the table).
816
+ - Variables with new keys will append new columns from the right.
735
817
  - Variables with exisiting keys will update corresponding vectors.
736
818
 
737
819
  ![assign method image](doc/../image/dataframe/assign.png)
738
820
 
739
821
  - Variables as arguments
740
822
 
741
- `assign(key_pairs)` accepts pairs of key and values as arguments. key_pairs should be a Hash of `{key => array}` or `{key => Vector}`.
823
+ `assign(key_pairs)` accepts pairs of key and values as parameters. `key_pairs` should be a Hash of `{key => array_like}` or an Array of Arrays like `[[key, array_like], ... ]`. `array_like` is ether `Vector`, `Array` or `Arrow::Array`.
742
824
 
743
825
  ```ruby
744
826
  df = RedAmber::DataFrame.new(
@@ -748,15 +830,19 @@ penguins.to_rover
748
830
 
749
831
  # =>
750
832
  #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000062804>
751
- name age
752
- <string> <uint8>
753
- 1 Yasuko 68
754
- 2 Rui 49
833
+ name age
834
+ <string> <uint8>
835
+ 1 Yasuko 68
836
+ 2 Rui 49
755
837
  3 Hinata 28
756
838
 
757
839
  # update :age and add :brother
758
- assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }
759
- df.assign(assigner)
840
+ df.assign do
841
+ {
842
+ age: age + 29,
843
+ brother: ['Santa', nil, 'Momotaro']
844
+ }
845
+ end
760
846
 
761
847
  # =>
762
848
  #<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000658b0>
@@ -769,13 +855,14 @@ penguins.to_rover
769
855
 
770
856
  - Key pairs by a block
771
857
 
772
- `assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array}` or `{key => Vector}`. Block is called in the context of self.
858
+ `assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array_like}` or an Array of Arrays like `[[key, array_like], ... ]`. `array_like` is ether `Vector`, `Array` or `Arrow::Array`. The block is called in the context of self.
773
859
 
774
860
  ```ruby
775
861
  df = RedAmber::DataFrame.new(
776
862
  index: [0, 1, 2, 3, nil],
777
863
  float: [0.0, 1.1, 2.2, Float::NAN, nil],
778
- string: ['A', 'B', 'C', 'D', nil])
864
+ string: ['A', 'B', 'C', 'D', nil]
865
+ )
779
866
  df
780
867
 
781
868
  # =>
@@ -788,29 +875,27 @@ penguins.to_rover
788
875
  4 3 NaN D
789
876
  5 (nil) (nil) (nil)
790
877
 
791
- # update numeric variables
878
+ # update :float
879
+ # assigner by an Array
792
880
  df.assign do
793
- assigner = {}
794
- vectors.each_with_index do |v, i|
795
- assigner[keys[i]] = v * -1 if v.numeric?
796
- end
797
- assigner
881
+ vectors.select(&:float?)
882
+ .map { |v| [v.key, -v] }
798
883
  end
799
884
 
800
885
  # =>
801
- #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000006e000>
802
- index float string
803
- <int8> <double> <string>
804
- 1 0 -0.0 A
805
- 2 -1 -1.1 B
806
- 3 -2 -2.2 C
807
- 4 -3 NaN D
808
- 5 (nil) (nil) (nil)
809
-
810
- # Or it ’s shorter like this:
886
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x00000000000dfffc>
887
+ index float string
888
+ <uint8> <double> <string>
889
+ 1 0 -0.0 A
890
+ 2 1 -1.1 B
891
+ 3 2 -2.2 C
892
+ 4 3 NaN D
893
+ 5 (nil) (nil) (nil)
894
+
895
+ # Or we can use assigner by a Hash
811
896
  df.assign do
812
- variables.select.with_object({}) do |(key, vector), assigner|
813
- assigner[key] = vector * -1 if vector.numeric?
897
+ vectors.select.with_object({}) do |v, assigner|
898
+ assigner[v.key] = -v if v.float?
814
899
  end
815
900
  end
816
901
 
@@ -821,6 +906,96 @@ penguins.to_rover
821
906
 
822
907
  Symbol key and String key are considered as the same key.
823
908
 
909
+ - Empty assignment
910
+
911
+ If assigner is empty or nil, returns self.
912
+
913
+ - Append from left
914
+
915
+ `assign_left` method accepts the same parameters and block as `assign`, but append new columns from leftside.
916
+
917
+ ```ruby
918
+ df.assign_left(new_index: df.indices(1))
919
+
920
+ # =>
921
+ #<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000001787c>
922
+ new_index index float string
923
+ <uint8> <uint8> <double> <string>
924
+ 1 1 0 0.0 A
925
+ 2 2 1 1.1 B
926
+ 3 3 2 2.2 C
927
+ 4 4 3 NaN D
928
+ 5 5 (nil) (nil) (nil)
929
+ ```
930
+
931
+ ### `slice_by(key, keep_key: false) { block }`
932
+
933
+ `slice_by` accepts a key and a block to select rows.
934
+
935
+ (Since 0.2.1)
936
+
937
+ ```ruby
938
+ df = RedAmber::DataFrame.new(
939
+ index: [0, 1, 2, 3, nil],
940
+ float: [0.0, 1.1, 2.2, Float::NAN, nil],
941
+ string: ['A', 'B', 'C', 'D', nil]
942
+ )
943
+ df
944
+
945
+ # =>
946
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x0000000000069e60>
947
+ index float string
948
+ <uint8> <double> <string>
949
+ 1 0 0.0 A
950
+ 2 1 1.1 B
951
+ 3 2 2.2 C
952
+ 4 3 NaN D
953
+ 5 (nil) (nil) (nil)
954
+
955
+ df.slice_by(:string) { ["A", "C"] }
956
+
957
+ # =>
958
+ #<RedAmber::DataFrame : 2 x 2 Vectors, 0x000000000001b1ac>
959
+ index float
960
+ <uint8> <double>
961
+ 1 0 0.0
962
+ 2 2 2.2
963
+ ```
964
+
965
+ It is the same behavior as;
966
+
967
+ ```ruby
968
+ df.slice { [string.index("A"), string.index("C")] }.drop(:string)
969
+ ```
970
+
971
+ `slice_by` also accepts a Range.
972
+
973
+ ```ruby
974
+ df.slice_by(:string) { "A".."C" }
975
+
976
+ # =>
977
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000069668>
978
+ index float
979
+ <uint8> <double>
980
+ 1 0 0.0
981
+ 2 1 1.1
982
+ 3 2 2.2
983
+ ```
984
+
985
+ When the option `keep_key: true` used, the column `key` will be preserved.
986
+
987
+ ```ruby
988
+ df.slice_by(:string, keep_key: true) { "A".."C" }
989
+
990
+ # =>
991
+ #<RedAmber::DataFrame : 3 x 3 Vectors, 0x0000000000073c44>
992
+ index float string
993
+ <uint8> <double> <string>
994
+ 1 0 0.0 A
995
+ 2 1 1.1 B
996
+ 3 2 2.2 C
997
+ ```
998
+
824
999
  ## Updating
825
1000
 
826
1001
  ### `sort`
@@ -830,11 +1005,11 @@ penguins.to_rover
830
1005
  - "-key" denotes descending order
831
1006
 
832
1007
  ```ruby
833
- df = RedAmber::DataFrame.new({
1008
+ df = RedAmber::DataFrame.new(
834
1009
  index: [1, 1, 0, nil, 0],
835
1010
  string: ['C', 'B', nil, 'A', 'B'],
836
1011
  bool: [nil, true, false, true, false],
837
- })
1012
+ )
838
1013
  df.sort(:index, '-bool')
839
1014
 
840
1015
  # =>
@@ -860,16 +1035,10 @@ penguins.to_rover
860
1035
 
861
1036
  ## Grouping
862
1037
 
863
- ### `group(aggregating_keys)`
864
-
865
- (
866
- This API will change in the future version. Especcially I want to change:
867
- - Order of the column of the result (aggregation_keys should be the first)
868
- - DataFrame#group will accept a block (heronshoes/red_amber #28)
869
- )
1038
+ ### `group(group_keys)`
870
1039
 
871
1040
  `group` creates a class `Group` object. `Group` accepts functions below as a method.
872
- Method accepts options as `summary_keys`.
1041
+ Method accepts options as `group_keys`.
873
1042
 
874
1043
  Available functions are:
875
1044
 
@@ -889,8 +1058,8 @@ penguins.to_rover
889
1058
  - [ ] tdigest
890
1059
  - ✓ variance
891
1060
 
892
- For the each group of `aggregation_keys`, the aggregation `function` is applied and returns a new dataframe with aggregated keys according to `summary_keys`.
893
- Aggregated key name is `function(summary_key)` style.
1061
+ For the each group of `group_keys`, the aggregation `function` is applied and returns a new dataframe with aggregated keys according to `summary_keys`.
1062
+ Summary key names are provided by `function(summary_keys)` style.
894
1063
 
895
1064
  This is an example of grouping of famous STARWARS dataset.
896
1065
 
@@ -900,18 +1069,18 @@ penguins.to_rover
900
1069
  starwars
901
1070
 
902
1071
  # =>
903
- #<RedAmber::DataFrame : 87 x 12 Vectors, 0x00000000000773bc>
904
- species name height mass hair_color skin_color eye_color ... homeworld
905
- <string> <string> <int64> <double> <string> <string> <string> ... <string>
906
- Human 1 Luke Skywalker 172 77.0 blond fair blue ... Tatooine
907
- Droid 2 C-3PO 167 75.0 NA gold yellow ... Tatooine
908
- Droid 3 R2-D2 96 32.0 NA white, blue red ... Naboo
909
- Human 4 Darth Vader 202 136.0 none white yellow ... Tatooine
910
- Human 5 Leia Organa 150 49.0 brown light brown ... Alderaan
911
- : : : : : : : : ... :
912
- Droid 85 BB8 (nil) (nil) none none black ... NA
913
- NA 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA
914
- Human 87 Padmé Amidala 165 45.0 brown light brown ... Naboo
1072
+ #<RedAmber::DataFrame : 87 x 12 Vectors, 0x0000000000005a50>
1073
+ unnamed1 name height mass hair_color skin_color eye_color ... species
1074
+ <int64> <string> <int64> <double> <string> <string> <string> ... <string>
1075
+ 1 1 Luke Skywalker 172 77.0 blond fair blue ... Human
1076
+ 2 2 C-3PO 167 75.0 NA gold yellow ... Droid
1077
+ 3 3 R2-D2 96 32.0 NA white, blue red ... Droid
1078
+ 4 4 Darth Vader 202 136.0 none white yellow ... Human
1079
+ 5 5 Leia Organa 150 49.0 brown light brown ... Human
1080
+ : : : : : : : : ... :
1081
+ 85 85 BB8 (nil) (nil) none none black ... Droid
1082
+ 86 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA
1083
+ 87 87 Padmé Amidala 165 45.0 brown light brown ... Human
915
1084
 
916
1085
  starwars.tdr(12)
917
1086
 
@@ -919,7 +1088,7 @@ penguins.to_rover
919
1088
  RedAmber::DataFrame : 87 x 12 Vectors
920
1089
  Vectors : 4 numeric, 8 strings
921
1090
  # key type level data_preview
922
- 1 :"" int64 87 [1, 2, 3, 4, 5, ... ]
1091
+ 1 :unnamed1 int64 87 [1, 2, 3, 4, 5, ... ]
923
1092
  2 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
924
1093
  3 :height int64 46 [172, 167, 96, 202, 150, ... ], 6 nils
925
1094
  4 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
@@ -933,82 +1102,176 @@ penguins.to_rover
933
1102
  12 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ]
934
1103
  ```
935
1104
 
936
- We can aggregate for `:species` and calculate the mean of `:mass` and `:height`.
1105
+ We can group by `:species` and calculate the count.
1106
+
1107
+ ```ruby
1108
+ starwars.group(:species).count(:species)
1109
+
1110
+ # =>
1111
+ #<RedAmber::DataFrame : 38 x 2 Vectors, 0x000000000001d6f0>
1112
+ species count
1113
+ <string> <int64>
1114
+ 1 Human 35
1115
+ 2 Droid 6
1116
+ 3 Wookiee 2
1117
+ 4 Rodian 1
1118
+ 5 Hutt 1
1119
+ : : :
1120
+ 36 Kaleesh 1
1121
+ 37 Pau'an 1
1122
+ 38 Kel Dor 1
1123
+ ```
1124
+
1125
+ We can also calculate the mean of `:mass` and `:height` together.
937
1126
 
938
1127
  ```ruby
939
- grouped = starwars.group(:species).mean(:mass, :height)
940
- grouped
1128
+ grouped = starwars.group(:species) { [count(:species), mean(:height, :mass)] }
941
1129
 
942
1130
  # =>
943
- #<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000008e620>
944
- mean(mass) mean(height) species
945
- <double> <double> <string>
946
- 1 82.8 176.6 Human
947
- 2 69.8 131.2 Droid
948
- 3 124.0 231.0 Wookiee
949
- 4 74.0 173.0 Rodian
950
- 5 1358.0 175.0 Hutt
951
- : : : :
952
- 36 159.0 216.0 Kaleesh
953
- 37 80.0 206.0 Pau'an
954
- 38 80.0 188.0 Kel Dor
1131
+ #<RedAmber::DataFrame : 38 x 4 Vectors, 0x00000000000407cc>
1132
+ specie s count mean(height) mean(mass)
1133
+ <strin g> <int64> <double> <double>
1134
+ 1 Human 35 176.6 82.8
1135
+ 2 Droid 6 131.2 69.8
1136
+ 3 Wookie e 2 231.0 124.0
1137
+ 4 Rodian 1 173.0 74.0
1138
+ 5 Hutt 1 175.0 1358.0
1139
+ : : : : :
1140
+ 36 Kalees h 1 216.0 159.0
1141
+ 37 Pau'an 1 206.0 80.0
1142
+ 38 Kel Dor 1 188.0 80.0
955
1143
  ```
956
1144
 
957
1145
  Select rows for count > 1.
958
1146
 
959
1147
  ```ruby
960
- count = starwars.group(:species).count(:species)[:'count(species)'] # => Vector
961
- grouped = grouped.slice(count > 1)
1148
+ grouped.slice(grouped[:count] > 1)
962
1149
 
963
1150
  # =>
964
- #<RedAmber::DataFrame : 9 x 3 Vectors, 0x0000000000098260>
965
- mean(mass) mean(height) species
966
- <double> <double> <string>
967
- 1 82.8 176.6 Human
968
- 2 69.8 131.2 Droid
969
- 3 124.0 231.0 Wookiee
970
- 4 74.0 208.7 Gungan
971
- 5 48.0 181.3 NA
972
- : : : :
973
- 7 55.0 179.0 Twi'lek
974
- 8 53.1 168.0 Mirialan
975
- 9 88.0 221.0 Kaminoan
1151
+ #<RedAmber::DataFrame : 9 x 4 Vectors, 0x000000000004c270>
1152
+ species count mean(height) mean(mass)
1153
+ <string> <int64> <double> <double>
1154
+ 1 Human 35 176.6 82.8
1155
+ 2 Droid 6 131.2 69.8
1156
+ 3 Wookiee 2 231.0 124.0
1157
+ 4 Gungan 3 208.7 74.0
1158
+ 5 NA 4 181.3 48.0
1159
+ : : : : :
1160
+ 7 Twi'lek 2 179.0 55.0
1161
+ 8 Mirialan 2 168.0 53.1
1162
+ 9 Kaminoan 2 221.0 88.0
976
1163
  ```
977
1164
 
978
- Assemble the result and change the order of columns.
1165
+ ## Reshape
1166
+
1167
+ ### `transpose`
1168
+
1169
+ Creates transposed DataFrame for the wide (messy) dataframe.
979
1170
 
980
1171
  ```ruby
981
- grouped.assign(count: count[count > 1]).pick { [2,3,0,1].map{ |i| keys[i] } }
1172
+ import_cars = RedAmber::DataFrame.load('test/entity/import_cars.tsv')
1173
+
1174
+ # =>
1175
+ #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000d520>
1176
+ Year Audi BMW BMW_MINI Mercedes-Benz VW
1177
+ <int64> <int64> <int64> <int64> <int64> <int64>
1178
+ 1 2017 28336 52527 25427 68221 49040
1179
+ 2 2018 26473 50982 25984 67554 51961
1180
+ 3 2019 24222 46814 23813 66553 46794
1181
+ 4 2020 22304 35712 20196 57041 36576
1182
+ 5 2021 22535 35905 18211 51722 35215
1183
+ import_cars.transpose(:Manufacturer)
1184
+
1185
+ # =>
1186
+ #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000ef74>
1187
+ Manufacturer 2017 2018 2019 2020 2021
1188
+ <dictionary> <uint32> <uint32> <uint32> <uint16> <uint16>
1189
+ 1 Audi 28336 26473 24222 22304 22535
1190
+ 2 BMW 52527 50982 46814 35712 35905
1191
+ 3 BMW_MINI 25427 25984 23813 20196 18211
1192
+ 4 Mercedes-Benz 68221 67554 66553 57041 51722
1193
+ 5 VW 49040 51961 46794 36576 35215
1194
+ ```
982
1195
 
1196
+ The leftmost column is created by original keys. Key name of the column is
1197
+ named by parameter `:name`. If `:name` is not specified, `:N` is used for the key.
1198
+
1199
+ ### `to_long(*keep_keys)`
1200
+
1201
+ Creates a 'long' (tidy) DataFrame from a 'wide' DataFrame.
1202
+
1203
+ - Parameter `keep_keys` specifies the key names to keep.
1204
+
1205
+ ```ruby
1206
+ import_cars.to_long(:Year)
1207
+
983
1208
  # =>
984
- #<RedAmber::DataFrame : 9 x 4 Vectors, 0x0000000000141838>
985
- species count mean(mass) mean(height)
986
- <string> <uint8> <double> <double>
987
- 1 Human 35 82.8 176.6
988
- 2 Droid 6 69.8 131.2
989
- 3 Wookiee 2 124.0 231.0
990
- 4 Gungan 3 74.0 208.7
991
- 5 NA 4 48.0 181.3
992
- : : : : :
993
- 7 Twi'lek 2 55.0 179.0
994
- 8 Mirialan 2 53.1 168.0
995
- 9 Kaminoan 2 88.0 221.0
1209
+ #<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000012750>
1210
+ Year N V
1211
+ <uint16> <dictionary> <uint32>
1212
+ 1 2017 Audi 28336
1213
+ 2 2017 BMW 52527
1214
+ 3 2017 BMW_MINI 25427
1215
+ 4 2017 Mercedes-Benz 68221
1216
+ 5 2017 VW 49040
1217
+ : : : :
1218
+ 23 2021 BMW_MINI 18211
1219
+ 24 2021 Mercedes-Benz 51722
1220
+ 25 2021 VW 35215
996
1221
  ```
997
1222
 
998
- ## Combining DataFrames
1223
+ - Option `:name` is the key of the column which came **from key names**.
1224
+ - Option `:value` is the key of the column which came **from values**.
999
1225
 
1000
- - [ ] Combining rows to a dataframe
1226
+ ```ruby
1227
+ import_cars.to_long(:Year, name: :Manufacturer, value: :Num_of_imported)
1001
1228
 
1002
- - [ ] Add vars
1229
+ # =>
1230
+ #<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000017700>
1231
+ Year Manufacturer Num_of_imported
1232
+ <uint16> <dictionary> <uint32>
1233
+ 1 2017 Audi 28336
1234
+ 2 2017 BMW 52527
1235
+ 3 2017 BMW_MINI 25427
1236
+ 4 2017 Mercedes-Benz 68221
1237
+ 5 2017 VW 49040
1238
+ : : : :
1239
+ 23 2021 BMW_MINI 18211
1240
+ 24 2021 Mercedes-Benz 51722
1241
+ 25 2021 VW 35215
1242
+ ```
1003
1243
 
1004
- - [ ] Inner join
1244
+ ### `to_wide`
1005
1245
 
1006
- - [ ] Left join
1246
+ Creates a 'wide' (messy) DataFrame from a 'long' DataFrame.
1007
1247
 
1008
- ## Encoding
1248
+ - Option `:name` is the key of the column which will be expanded **to key names**.
1249
+ - Option `:value` is the key of the column which will be expanded **to values**.
1009
1250
 
1010
- - [ ] One-hot encoding
1251
+ ```ruby
1252
+ import_cars.to_long(:Year).to_wide
1253
+ # import_cars.to_long(:Year).to_wide(name: :N, value: :V)
1254
+ # is also OK
1011
1255
 
1012
- ## Iteration (not impremented)
1256
+ # =>
1257
+ #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000f0f0>
1258
+ Year Audi BMW BMW_MINI Mercedes-Benz VW
1259
+ <uint16> <uint16> <uint16> <uint16> <uint32> <uint16>
1260
+ 1 2017 28336 52527 25427 68221 49040
1261
+ 2 2018 26473 50982 25984 67554 51961
1262
+ 3 2019 24222 46814 23813 66553 46794
1263
+ 4 2020 22304 35712 20196 57041 36576
1264
+ 5 2021 22535 35905 18211 51722 35215
1265
+
1266
+ # == import_cars
1267
+ ```
1268
+
1269
+ ## Combine
1270
+
1271
+ - [ ] Combining dataframes
1013
1272
 
1014
- - [ ] each_rows
1273
+ - [ ] Join
1274
+
1275
+ ## Encoding
1276
+
1277
+ - [ ] One-hot encoding