red_amber 0.1.7 → 0.2.1

Sign up to get free protection for your applications and to get access to all the features.
data/doc/DataFrame.md CHANGED
@@ -155,7 +155,25 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
155
155
 
156
156
  ### `indices`, `indexes`
157
157
 
158
- - Returns all indexes in an Array.
158
+ - Returns indexes in an Array.
159
+ Accepts an option `start` as the first of indexes.
160
+
161
+ ```ruby
162
+ df = RedAmber::DataFrame.new(x: [1, 2, 3, 4, 5])
163
+ df.indices
164
+
165
+ # =>
166
+ [0, 1, 2, 3, 4]
167
+
168
+ df.indices(1)
169
+
170
+ # =>
171
+ [1, 2, 3, 4, 5]
172
+
173
+ df.indices(:a)
174
+ # =>
175
+ [:a, :b, :c, :d, :e]
176
+ ```
159
177
 
160
178
  ### `to_h`
161
179
 
@@ -167,6 +185,11 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
167
185
 
168
186
  If you need a column-oriented full array, use `.to_h.to_a`
169
187
 
188
+ ### `each_row`
189
+
190
+ Yield each row in a `{ key => row}` Hash.
191
+ Returns Enumerator if block is not given.
192
+
170
193
  ### `schema`
171
194
 
172
195
  - Returns column name and data type in a Hash.
@@ -202,7 +225,22 @@ puts penguins.to_s
202
225
  `inspect` uses `to_s` output and also shows shape and object_id.
203
226
 
204
227
 
205
- ### `summary`, `describe` (not implemented)
228
+ ### `summary`, `describe`
229
+
230
+ `DataFrame#summary` or `DataFrame#describe` shows summary statistics in a DataFrame.
231
+
232
+ ```ruby
233
+ puts penguins.summary.to_s(width: 82) # needs more width to show all stats in this example
234
+
235
+ # =>
236
+ variables count mean std min 25% median 75% max
237
+ <dictionary> <uint16> <double> <double> <double> <double> <double> <double> <double>
238
+ 1 bill_length_mm 342 43.92 5.46 32.1 39.23 44.38 48.5 59.6
239
+ 2 bill_depth_mm 342 17.15 1.97 13.1 15.6 17.32 18.7 21.5
240
+ 3 flipper_length_mm 342 200.92 14.06 172.0 190.0 197.0 213.0 231.0
241
+ 4 body_mass_g 342 4201.75 801.95 2700.0 3550.0 4031.5 4750.0 6300.0
242
+ 5 year 344 2008.03 0.82 2007.0 2007.0 2008.0 2009.0 2009.0
243
+ ```
206
244
 
207
245
  ### `to_rover`
208
246
 
@@ -352,13 +390,13 @@ penguins.to_rover
352
390
 
353
391
  ### `pick ` - pick up variables by key label -
354
392
 
355
- Pick up some variables (columns) to create a sub DataFrame.
393
+ Pick up some columns (variables) to create a sub DataFrame.
356
394
 
357
395
  ![pick method image](doc/../image/dataframe/pick.png)
358
396
 
359
397
  - Keys as arguments
360
398
 
361
- `pick(keys)` accepts keys as arguments in an Array.
399
+ `pick(keys)` accepts keys as arguments in an Array or a Range.
362
400
 
363
401
  ```ruby
364
402
  penguins.pick(:species, :bill_length_mm)
@@ -378,9 +416,31 @@ penguins.to_rover
378
416
  344 Gentoo 49.9
379
417
  ```
380
418
 
381
- - Booleans as a argument
419
+ - Indices as arguments
420
+
421
+ `pick(indices)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers.
422
+
423
+ ```ruby
424
+ penguins.pick(0..2, -1)
425
+
426
+ # =>
427
+ #<RedAmber::DataFrame : 344 x 4 Vectors, 0x0000000000055ce4>
428
+ species island bill_length_mm year
429
+ <string> <string> <double> <uint16>
430
+ 1 Adelie Torgersen 39.1 2007
431
+ 2 Adelie Torgersen 39.5 2007
432
+ 3 Adelie Torgersen 40.3 2007
433
+ 4 Adelie Torgersen (nil) 2007
434
+ 5 Adelie Torgersen 36.7 2007
435
+ : : : : :
436
+ 342 Gentoo Biscoe 50.4 2009
437
+ 343 Gentoo Biscoe 45.2 2009
438
+ 344 Gentoo Biscoe 49.9 2009
439
+ ```
440
+
441
+ - Booleans as arguments
382
442
 
383
- `pick(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
443
+ `pick(booleans)` accepts booleans as arguments in an Array. Booleans must be same length as `n_keys`.
384
444
 
385
445
  ```ruby
386
446
  penguins.pick(penguins.types.map { |type| type == :string })
@@ -400,9 +460,9 @@ penguins.to_rover
400
460
  344 Gentoo Biscoe male
401
461
  ```
402
462
 
403
- - Keys or booleans by a block
463
+ - Keys or booleans by a block
404
464
 
405
- `pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
465
+ `pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, indices or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
406
466
 
407
467
  ```ruby
408
468
  penguins.pick { keys.map { |key| key.end_with?('mm') } }
@@ -424,21 +484,25 @@ penguins.to_rover
424
484
 
425
485
  ### `drop ` - pick and drop -
426
486
 
427
- Drop some variables (columns) to create a remainer DataFrame.
487
+ Drop some columns (variables) to create a remainer DataFrame.
428
488
 
429
489
  ![drop method image](doc/../image/dataframe/drop.png)
430
490
 
431
491
  - Keys as arguments
432
492
 
433
- `drop(keys)` accepts keys as arguments in an Array.
493
+ `drop(keys)` accepts keys as arguments in an Array or a Range.
494
+
495
+ - Indices as arguments
496
+
497
+ `drop(indices)` accepts indices as a arguments. Indices should be Integers, Floats or Ranges of Integers.
434
498
 
435
- - Booleans as a argument
499
+ - Booleans as arguments
436
500
 
437
- `drop(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
501
+ `drop(booleans)` accepts booleans as an argument in an Array. Booleans must be same length as `n_keys`.
438
502
 
439
503
  - Keys or booleans by a block
440
504
 
441
- `drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
505
+ `drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, indices or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
442
506
 
443
507
  - Notice for nil
444
508
 
@@ -473,9 +537,20 @@ penguins.to_rover
473
537
  [1, 2, 3]
474
538
  ```
475
539
 
540
+ A simple key name is usable as a method of the DataFrame if the key name is acceptable as a method name.
541
+ It returns a Vector same as `[]`.
542
+
543
+ ```ruby
544
+ df.a
545
+
546
+ # =>
547
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
548
+ [1, 2, 3]
549
+ ```
550
+
476
551
  ### `slice ` - to cut vertically is slice -
477
552
 
478
- Slice and select observations (rows) to create a sub DataFrame.
553
+ Slice and select rows (observations) to create a sub DataFrame.
479
554
 
480
555
  ![slice method image](doc/../image/dataframe/slice.png)
481
556
 
@@ -506,7 +581,7 @@ penguins.to_rover
506
581
 
507
582
  - Booleans as an argument
508
583
 
509
- `slice(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
584
+ `slice(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
510
585
 
511
586
  ```ruby
512
587
  vector = penguins[:bill_length_mm]
@@ -583,7 +658,7 @@ penguins.to_rover
583
658
 
584
659
  ### `remove`
585
660
 
586
- Slice and reject observations (rows) to create a remainer DataFrame.
661
+ Slice and reject rows (observations) to create a remainer DataFrame.
587
662
 
588
663
  ![remove method image](doc/../image/dataframe/remove.png)
589
664
 
@@ -612,7 +687,7 @@ penguins.to_rover
612
687
 
613
688
  - Booleans as an argument
614
689
 
615
- `remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
690
+ `remove(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
616
691
 
617
692
  ```ruby
618
693
  # remove all observation contains nil
@@ -640,10 +715,12 @@ penguins.to_rover
640
715
 
641
716
  ```ruby
642
717
  penguins.remove do
643
- vector = self[:bill_length_mm]
644
- min = vector.mean - vector.std
645
- max = vector.mean + vector.std
646
- vector.to_a.map { |e| (min..max).include? e }
718
+ # We will use another style shown in slice
719
+ # self.bill_length_mm returns Vector
720
+ mean = bill_length_mm.mean
721
+ min = mean - bill_length_mm.std
722
+ max = mean + bill_length_mm.std
723
+ bill_length_mm.to_a.map { |e| (min..max).include? e }
647
724
  end
648
725
 
649
726
  # =>
@@ -660,6 +737,7 @@ penguins.to_rover
660
737
  139 Gentoo Biscoe 50.4 15.7 222 ... 2009
661
738
  140 Gentoo Biscoe 49.9 16.1 213 ... 2009
662
739
  ```
740
+
663
741
  - Notice for nil
664
742
  - When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
665
743
 
@@ -704,7 +782,7 @@ penguins.to_rover
704
782
 
705
783
  - Key pairs as arguments
706
784
 
707
- `rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`.
785
+ `rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}` or an Array of Arrays like `[[existing_key, new_key], ... ]`.
708
786
 
709
787
  ```ruby
710
788
  df = RedAmber::DataFrame.new( 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] )
@@ -721,7 +799,11 @@ penguins.to_rover
721
799
 
722
800
  - Key pairs by a block
723
801
 
724
- `rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}`. Block is called in the context of self.
802
+ `rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}` or an Array of Arrays like `[[existing_key, new_key], ... ]`. Block is called in the context of self.
803
+
804
+ - Not existing keys
805
+
806
+ If specified `existing_key` is not exist, raise a `DataFrameArgumentError`.
725
807
 
726
808
  - Key type
727
809
 
@@ -729,16 +811,16 @@ penguins.to_rover
729
811
 
730
812
  ### `assign`
731
813
 
732
- Assign new or updated variables (columns) and create a updated DataFrame.
814
+ Assign new or updated columns (variables) and create a updated DataFrame.
733
815
 
734
- - Variables with new keys will append new variables at bottom (right in the table).
816
+ - Variables with new keys will append new columns from the right.
735
817
  - Variables with exisiting keys will update corresponding vectors.
736
818
 
737
819
  ![assign method image](doc/../image/dataframe/assign.png)
738
820
 
739
821
  - Variables as arguments
740
822
 
741
- `assign(key_pairs)` accepts pairs of key and values as arguments. key_pairs should be a Hash of `{key => array}` or `{key => Vector}`.
823
+ `assign(key_pairs)` accepts pairs of key and values as parameters. `key_pairs` should be a Hash of `{key => array_like}` or an Array of Arrays like `[[key, array_like], ... ]`. `array_like` is ether `Vector`, `Array` or `Arrow::Array`.
742
824
 
743
825
  ```ruby
744
826
  df = RedAmber::DataFrame.new(
@@ -748,15 +830,19 @@ penguins.to_rover
748
830
 
749
831
  # =>
750
832
  #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000062804>
751
- name age
752
- <string> <uint8>
753
- 1 Yasuko 68
754
- 2 Rui 49
833
+ name age
834
+ <string> <uint8>
835
+ 1 Yasuko 68
836
+ 2 Rui 49
755
837
  3 Hinata 28
756
838
 
757
839
  # update :age and add :brother
758
- assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }
759
- df.assign(assigner)
840
+ df.assign do
841
+ {
842
+ age: age + 29,
843
+ brother: ['Santa', nil, 'Momotaro']
844
+ }
845
+ end
760
846
 
761
847
  # =>
762
848
  #<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000658b0>
@@ -769,13 +855,14 @@ penguins.to_rover
769
855
 
770
856
  - Key pairs by a block
771
857
 
772
- `assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array}` or `{key => Vector}`. Block is called in the context of self.
858
+ `assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array_like}` or an Array of Arrays like `[[key, array_like], ... ]`. `array_like` is ether `Vector`, `Array` or `Arrow::Array`. The block is called in the context of self.
773
859
 
774
860
  ```ruby
775
861
  df = RedAmber::DataFrame.new(
776
862
  index: [0, 1, 2, 3, nil],
777
863
  float: [0.0, 1.1, 2.2, Float::NAN, nil],
778
- string: ['A', 'B', 'C', 'D', nil])
864
+ string: ['A', 'B', 'C', 'D', nil]
865
+ )
779
866
  df
780
867
 
781
868
  # =>
@@ -788,29 +875,27 @@ penguins.to_rover
788
875
  4 3 NaN D
789
876
  5 (nil) (nil) (nil)
790
877
 
791
- # update numeric variables
878
+ # update :float
879
+ # assigner by an Array
792
880
  df.assign do
793
- assigner = {}
794
- vectors.each_with_index do |v, i|
795
- assigner[keys[i]] = v * -1 if v.numeric?
796
- end
797
- assigner
881
+ vectors.select(&:float?)
882
+ .map { |v| [v.key, -v] }
798
883
  end
799
884
 
800
885
  # =>
801
- #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000006e000>
802
- index float string
803
- <int8> <double> <string>
804
- 1 0 -0.0 A
805
- 2 -1 -1.1 B
806
- 3 -2 -2.2 C
807
- 4 -3 NaN D
808
- 5 (nil) (nil) (nil)
809
-
810
- # Or it ’s shorter like this:
886
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x00000000000dfffc>
887
+ index float string
888
+ <uint8> <double> <string>
889
+ 1 0 -0.0 A
890
+ 2 1 -1.1 B
891
+ 3 2 -2.2 C
892
+ 4 3 NaN D
893
+ 5 (nil) (nil) (nil)
894
+
895
+ # Or we can use assigner by a Hash
811
896
  df.assign do
812
- variables.select.with_object({}) do |(key, vector), assigner|
813
- assigner[key] = vector * -1 if vector.numeric?
897
+ vectors.select.with_object({}) do |v, assigner|
898
+ assigner[v.key] = -v if v.float?
814
899
  end
815
900
  end
816
901
 
@@ -821,6 +906,96 @@ penguins.to_rover
821
906
 
822
907
  Symbol key and String key are considered as the same key.
823
908
 
909
+ - Empty assignment
910
+
911
+ If assigner is empty or nil, returns self.
912
+
913
+ - Append from left
914
+
915
+ `assign_left` method accepts the same parameters and block as `assign`, but append new columns from leftside.
916
+
917
+ ```ruby
918
+ df.assign_left(new_index: df.indices(1))
919
+
920
+ # =>
921
+ #<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000001787c>
922
+ new_index index float string
923
+ <uint8> <uint8> <double> <string>
924
+ 1 1 0 0.0 A
925
+ 2 2 1 1.1 B
926
+ 3 3 2 2.2 C
927
+ 4 4 3 NaN D
928
+ 5 5 (nil) (nil) (nil)
929
+ ```
930
+
931
+ ### `slice_by(key, keep_key: false) { block }`
932
+
933
+ `slice_by` accepts a key and a block to select rows.
934
+
935
+ (Since 0.2.1)
936
+
937
+ ```ruby
938
+ df = RedAmber::DataFrame.new(
939
+ index: [0, 1, 2, 3, nil],
940
+ float: [0.0, 1.1, 2.2, Float::NAN, nil],
941
+ string: ['A', 'B', 'C', 'D', nil]
942
+ )
943
+ df
944
+
945
+ # =>
946
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x0000000000069e60>
947
+ index float string
948
+ <uint8> <double> <string>
949
+ 1 0 0.0 A
950
+ 2 1 1.1 B
951
+ 3 2 2.2 C
952
+ 4 3 NaN D
953
+ 5 (nil) (nil) (nil)
954
+
955
+ df.slice_by(:string) { ["A", "C"] }
956
+
957
+ # =>
958
+ #<RedAmber::DataFrame : 2 x 2 Vectors, 0x000000000001b1ac>
959
+ index float
960
+ <uint8> <double>
961
+ 1 0 0.0
962
+ 2 2 2.2
963
+ ```
964
+
965
+ It is the same behavior as;
966
+
967
+ ```ruby
968
+ df.slice { [string.index("A"), string.index("C")] }.drop(:string)
969
+ ```
970
+
971
+ `slice_by` also accepts a Range.
972
+
973
+ ```ruby
974
+ df.slice_by(:string) { "A".."C" }
975
+
976
+ # =>
977
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000069668>
978
+ index float
979
+ <uint8> <double>
980
+ 1 0 0.0
981
+ 2 1 1.1
982
+ 3 2 2.2
983
+ ```
984
+
985
+ When the option `keep_key: true` used, the column `key` will be preserved.
986
+
987
+ ```ruby
988
+ df.slice_by(:string, keep_key: true) { "A".."C" }
989
+
990
+ # =>
991
+ #<RedAmber::DataFrame : 3 x 3 Vectors, 0x0000000000073c44>
992
+ index float string
993
+ <uint8> <double> <string>
994
+ 1 0 0.0 A
995
+ 2 1 1.1 B
996
+ 3 2 2.2 C
997
+ ```
998
+
824
999
  ## Updating
825
1000
 
826
1001
  ### `sort`
@@ -830,11 +1005,11 @@ penguins.to_rover
830
1005
  - "-key" denotes descending order
831
1006
 
832
1007
  ```ruby
833
- df = RedAmber::DataFrame.new({
1008
+ df = RedAmber::DataFrame.new(
834
1009
  index: [1, 1, 0, nil, 0],
835
1010
  string: ['C', 'B', nil, 'A', 'B'],
836
1011
  bool: [nil, true, false, true, false],
837
- })
1012
+ )
838
1013
  df.sort(:index, '-bool')
839
1014
 
840
1015
  # =>
@@ -860,16 +1035,10 @@ penguins.to_rover
860
1035
 
861
1036
  ## Grouping
862
1037
 
863
- ### `group(aggregating_keys)`
864
-
865
- (
866
- This API will change in the future version. Especcially I want to change:
867
- - Order of the column of the result (aggregation_keys should be the first)
868
- - DataFrame#group will accept a block (heronshoes/red_amber #28)
869
- )
1038
+ ### `group(group_keys)`
870
1039
 
871
1040
  `group` creates a class `Group` object. `Group` accepts functions below as a method.
872
- Method accepts options as `summary_keys`.
1041
+ Method accepts options as `group_keys`.
873
1042
 
874
1043
  Available functions are:
875
1044
 
@@ -889,8 +1058,8 @@ penguins.to_rover
889
1058
  - [ ] tdigest
890
1059
  - ✓ variance
891
1060
 
892
- For the each group of `aggregation_keys`, the aggregation `function` is applied and returns a new dataframe with aggregated keys according to `summary_keys`.
893
- Aggregated key name is `function(summary_key)` style.
1061
+ For the each group of `group_keys`, the aggregation `function` is applied and returns a new dataframe with aggregated keys according to `summary_keys`.
1062
+ Summary key names are provided by `function(summary_keys)` style.
894
1063
 
895
1064
  This is an example of grouping of famous STARWARS dataset.
896
1065
 
@@ -900,18 +1069,18 @@ penguins.to_rover
900
1069
  starwars
901
1070
 
902
1071
  # =>
903
- #<RedAmber::DataFrame : 87 x 12 Vectors, 0x00000000000773bc>
904
- species name height mass hair_color skin_color eye_color ... homeworld
905
- <string> <string> <int64> <double> <string> <string> <string> ... <string>
906
- Human 1 Luke Skywalker 172 77.0 blond fair blue ... Tatooine
907
- Droid 2 C-3PO 167 75.0 NA gold yellow ... Tatooine
908
- Droid 3 R2-D2 96 32.0 NA white, blue red ... Naboo
909
- Human 4 Darth Vader 202 136.0 none white yellow ... Tatooine
910
- Human 5 Leia Organa 150 49.0 brown light brown ... Alderaan
911
- : : : : : : : : ... :
912
- Droid 85 BB8 (nil) (nil) none none black ... NA
913
- NA 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA
914
- Human 87 Padmé Amidala 165 45.0 brown light brown ... Naboo
1072
+ #<RedAmber::DataFrame : 87 x 12 Vectors, 0x0000000000005a50>
1073
+ unnamed1 name height mass hair_color skin_color eye_color ... species
1074
+ <int64> <string> <int64> <double> <string> <string> <string> ... <string>
1075
+ 1 1 Luke Skywalker 172 77.0 blond fair blue ... Human
1076
+ 2 2 C-3PO 167 75.0 NA gold yellow ... Droid
1077
+ 3 3 R2-D2 96 32.0 NA white, blue red ... Droid
1078
+ 4 4 Darth Vader 202 136.0 none white yellow ... Human
1079
+ 5 5 Leia Organa 150 49.0 brown light brown ... Human
1080
+ : : : : : : : : ... :
1081
+ 85 85 BB8 (nil) (nil) none none black ... Droid
1082
+ 86 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA
1083
+ 87 87 Padmé Amidala 165 45.0 brown light brown ... Human
915
1084
 
916
1085
  starwars.tdr(12)
917
1086
 
@@ -919,7 +1088,7 @@ penguins.to_rover
919
1088
  RedAmber::DataFrame : 87 x 12 Vectors
920
1089
  Vectors : 4 numeric, 8 strings
921
1090
  # key type level data_preview
922
- 1 :"" int64 87 [1, 2, 3, 4, 5, ... ]
1091
+ 1 :unnamed1 int64 87 [1, 2, 3, 4, 5, ... ]
923
1092
  2 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
924
1093
  3 :height int64 46 [172, 167, 96, 202, 150, ... ], 6 nils
925
1094
  4 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
@@ -933,82 +1102,176 @@ penguins.to_rover
933
1102
  12 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ]
934
1103
  ```
935
1104
 
936
- We can aggregate for `:species` and calculate the mean of `:mass` and `:height`.
1105
+ We can group by `:species` and calculate the count.
1106
+
1107
+ ```ruby
1108
+ starwars.group(:species).count(:species)
1109
+
1110
+ # =>
1111
+ #<RedAmber::DataFrame : 38 x 2 Vectors, 0x000000000001d6f0>
1112
+ species count
1113
+ <string> <int64>
1114
+ 1 Human 35
1115
+ 2 Droid 6
1116
+ 3 Wookiee 2
1117
+ 4 Rodian 1
1118
+ 5 Hutt 1
1119
+ : : :
1120
+ 36 Kaleesh 1
1121
+ 37 Pau'an 1
1122
+ 38 Kel Dor 1
1123
+ ```
1124
+
1125
+ We can also calculate the mean of `:mass` and `:height` together.
937
1126
 
938
1127
  ```ruby
939
- grouped = starwars.group(:species).mean(:mass, :height)
940
- grouped
1128
+ grouped = starwars.group(:species) { [count(:species), mean(:height, :mass)] }
941
1129
 
942
1130
  # =>
943
- #<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000008e620>
944
- mean(mass) mean(height) species
945
- <double> <double> <string>
946
- 1 82.8 176.6 Human
947
- 2 69.8 131.2 Droid
948
- 3 124.0 231.0 Wookiee
949
- 4 74.0 173.0 Rodian
950
- 5 1358.0 175.0 Hutt
951
- : : : :
952
- 36 159.0 216.0 Kaleesh
953
- 37 80.0 206.0 Pau'an
954
- 38 80.0 188.0 Kel Dor
1131
+ #<RedAmber::DataFrame : 38 x 4 Vectors, 0x00000000000407cc>
1132
+ specie s count mean(height) mean(mass)
1133
+ <strin g> <int64> <double> <double>
1134
+ 1 Human 35 176.6 82.8
1135
+ 2 Droid 6 131.2 69.8
1136
+ 3 Wookie e 2 231.0 124.0
1137
+ 4 Rodian 1 173.0 74.0
1138
+ 5 Hutt 1 175.0 1358.0
1139
+ : : : : :
1140
+ 36 Kalees h 1 216.0 159.0
1141
+ 37 Pau'an 1 206.0 80.0
1142
+ 38 Kel Dor 1 188.0 80.0
955
1143
  ```
956
1144
 
957
1145
  Select rows for count > 1.
958
1146
 
959
1147
  ```ruby
960
- count = starwars.group(:species).count(:species)[:'count(species)'] # => Vector
961
- grouped = grouped.slice(count > 1)
1148
+ grouped.slice(grouped[:count] > 1)
962
1149
 
963
1150
  # =>
964
- #<RedAmber::DataFrame : 9 x 3 Vectors, 0x0000000000098260>
965
- mean(mass) mean(height) species
966
- <double> <double> <string>
967
- 1 82.8 176.6 Human
968
- 2 69.8 131.2 Droid
969
- 3 124.0 231.0 Wookiee
970
- 4 74.0 208.7 Gungan
971
- 5 48.0 181.3 NA
972
- : : : :
973
- 7 55.0 179.0 Twi'lek
974
- 8 53.1 168.0 Mirialan
975
- 9 88.0 221.0 Kaminoan
1151
+ #<RedAmber::DataFrame : 9 x 4 Vectors, 0x000000000004c270>
1152
+ species count mean(height) mean(mass)
1153
+ <string> <int64> <double> <double>
1154
+ 1 Human 35 176.6 82.8
1155
+ 2 Droid 6 131.2 69.8
1156
+ 3 Wookiee 2 231.0 124.0
1157
+ 4 Gungan 3 208.7 74.0
1158
+ 5 NA 4 181.3 48.0
1159
+ : : : : :
1160
+ 7 Twi'lek 2 179.0 55.0
1161
+ 8 Mirialan 2 168.0 53.1
1162
+ 9 Kaminoan 2 221.0 88.0
976
1163
  ```
977
1164
 
978
- Assemble the result and change the order of columns.
1165
+ ## Reshape
1166
+
1167
+ ### `transpose`
1168
+
1169
+ Creates transposed DataFrame for the wide (messy) dataframe.
979
1170
 
980
1171
  ```ruby
981
- grouped.assign(count: count[count > 1]).pick { [2,3,0,1].map{ |i| keys[i] } }
1172
+ import_cars = RedAmber::DataFrame.load('test/entity/import_cars.tsv')
1173
+
1174
+ # =>
1175
+ #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000d520>
1176
+ Year Audi BMW BMW_MINI Mercedes-Benz VW
1177
+ <int64> <int64> <int64> <int64> <int64> <int64>
1178
+ 1 2017 28336 52527 25427 68221 49040
1179
+ 2 2018 26473 50982 25984 67554 51961
1180
+ 3 2019 24222 46814 23813 66553 46794
1181
+ 4 2020 22304 35712 20196 57041 36576
1182
+ 5 2021 22535 35905 18211 51722 35215
1183
+ import_cars.transpose(:Manufacturer)
1184
+
1185
+ # =>
1186
+ #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000ef74>
1187
+ Manufacturer 2017 2018 2019 2020 2021
1188
+ <dictionary> <uint32> <uint32> <uint32> <uint16> <uint16>
1189
+ 1 Audi 28336 26473 24222 22304 22535
1190
+ 2 BMW 52527 50982 46814 35712 35905
1191
+ 3 BMW_MINI 25427 25984 23813 20196 18211
1192
+ 4 Mercedes-Benz 68221 67554 66553 57041 51722
1193
+ 5 VW 49040 51961 46794 36576 35215
1194
+ ```
982
1195
 
1196
+ The leftmost column is created by original keys. Key name of the column is
1197
+ named by parameter `:name`. If `:name` is not specified, `:N` is used for the key.
1198
+
1199
+ ### `to_long(*keep_keys)`
1200
+
1201
+ Creates a 'long' (tidy) DataFrame from a 'wide' DataFrame.
1202
+
1203
+ - Parameter `keep_keys` specifies the key names to keep.
1204
+
1205
+ ```ruby
1206
+ import_cars.to_long(:Year)
1207
+
983
1208
  # =>
984
- #<RedAmber::DataFrame : 9 x 4 Vectors, 0x0000000000141838>
985
- species count mean(mass) mean(height)
986
- <string> <uint8> <double> <double>
987
- 1 Human 35 82.8 176.6
988
- 2 Droid 6 69.8 131.2
989
- 3 Wookiee 2 124.0 231.0
990
- 4 Gungan 3 74.0 208.7
991
- 5 NA 4 48.0 181.3
992
- : : : : :
993
- 7 Twi'lek 2 55.0 179.0
994
- 8 Mirialan 2 53.1 168.0
995
- 9 Kaminoan 2 88.0 221.0
1209
+ #<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000012750>
1210
+ Year N V
1211
+ <uint16> <dictionary> <uint32>
1212
+ 1 2017 Audi 28336
1213
+ 2 2017 BMW 52527
1214
+ 3 2017 BMW_MINI 25427
1215
+ 4 2017 Mercedes-Benz 68221
1216
+ 5 2017 VW 49040
1217
+ : : : :
1218
+ 23 2021 BMW_MINI 18211
1219
+ 24 2021 Mercedes-Benz 51722
1220
+ 25 2021 VW 35215
996
1221
  ```
997
1222
 
998
- ## Combining DataFrames
1223
+ - Option `:name` is the key of the column which came **from key names**.
1224
+ - Option `:value` is the key of the column which came **from values**.
999
1225
 
1000
- - [ ] Combining rows to a dataframe
1226
+ ```ruby
1227
+ import_cars.to_long(:Year, name: :Manufacturer, value: :Num_of_imported)
1001
1228
 
1002
- - [ ] Add vars
1229
+ # =>
1230
+ #<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000017700>
1231
+ Year Manufacturer Num_of_imported
1232
+ <uint16> <dictionary> <uint32>
1233
+ 1 2017 Audi 28336
1234
+ 2 2017 BMW 52527
1235
+ 3 2017 BMW_MINI 25427
1236
+ 4 2017 Mercedes-Benz 68221
1237
+ 5 2017 VW 49040
1238
+ : : : :
1239
+ 23 2021 BMW_MINI 18211
1240
+ 24 2021 Mercedes-Benz 51722
1241
+ 25 2021 VW 35215
1242
+ ```
1003
1243
 
1004
- - [ ] Inner join
1244
+ ### `to_wide`
1005
1245
 
1006
- - [ ] Left join
1246
+ Creates a 'wide' (messy) DataFrame from a 'long' DataFrame.
1007
1247
 
1008
- ## Encoding
1248
+ - Option `:name` is the key of the column which will be expanded **to key names**.
1249
+ - Option `:value` is the key of the column which will be expanded **to values**.
1009
1250
 
1010
- - [ ] One-hot encoding
1251
+ ```ruby
1252
+ import_cars.to_long(:Year).to_wide
1253
+ # import_cars.to_long(:Year).to_wide(name: :N, value: :V)
1254
+ # is also OK
1011
1255
 
1012
- ## Iteration (not impremented)
1256
+ # =>
1257
+ #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000f0f0>
1258
+ Year Audi BMW BMW_MINI Mercedes-Benz VW
1259
+ <uint16> <uint16> <uint16> <uint16> <uint32> <uint16>
1260
+ 1 2017 28336 52527 25427 68221 49040
1261
+ 2 2018 26473 50982 25984 67554 51961
1262
+ 3 2019 24222 46814 23813 66553 46794
1263
+ 4 2020 22304 35712 20196 57041 36576
1264
+ 5 2021 22535 35905 18211 51722 35215
1265
+
1266
+ # == import_cars
1267
+ ```
1268
+
1269
+ ## Combine
1270
+
1271
+ - [ ] Combining dataframes
1013
1272
 
1014
- - [ ] each_rows
1273
+ - [ ] Join
1274
+
1275
+ ## Encoding
1276
+
1277
+ - [ ] One-hot encoding