red_amber 0.1.6 → 0.1.7

Sign up to get free protection for your applications and to get access to all the features.
data/doc/DataFrame.md CHANGED
@@ -9,8 +9,6 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
9
9
 
10
10
  ![dataframe model image](doc/../image/dataframe_model.png)
11
11
 
12
- (No change in this model in v0.1.6 .)
13
-
14
12
  ## Constructors and saving
15
13
 
16
14
  ### `new` from a Hash
@@ -37,6 +35,8 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
37
35
 
38
36
 
39
37
  ```ruby
38
+ require 'rover'
39
+
40
40
  rover = Rover::DataFrame.new(x: [1, 2, 3])
41
41
  RedAmber::DataFrame.new(rover)
42
42
  ```
@@ -61,6 +61,8 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
61
61
  - from a Parquet file
62
62
 
63
63
  ```ruby
64
+ require 'parquet'
65
+
64
66
  dataframe = RedAmber::DataFrame.load("file.parquet")
65
67
  ```
66
68
 
@@ -75,6 +77,8 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
75
77
  - to a Parquet file
76
78
 
77
79
  ```ruby
80
+ require 'parquet'
81
+
78
82
  dataframe.save("file.parquet")
79
83
  ```
80
84
 
@@ -175,12 +179,41 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
175
179
 
176
180
  ### `to_s`
177
181
 
182
+ `to_s` returns a preview of the Table.
183
+
184
+ ```ruby
185
+ puts penguins.to_s
186
+
187
+ # =>
188
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
189
+ <string> <string> <double> <double> <uint8> ... <uint16>
190
+ 1 Adelie Torgersen 39.1 18.7 181 ... 2007
191
+ 2 Adelie Torgersen 39.5 17.4 186 ... 2007
192
+ 3 Adelie Torgersen 40.3 18.0 195 ... 2007
193
+ 4 Adelie Torgersen (nil) (nil) (nil) ... 2007
194
+ 5 Adelie Torgersen 36.7 19.3 193 ... 2007
195
+ : : : : : : ... :
196
+ 342 Gentoo Biscoe 50.4 15.7 222 ... 2009
197
+ 343 Gentoo Biscoe 45.2 14.8 212 ... 2009
198
+ 344 Gentoo Biscoe 49.9 16.1 213 ... 2009
199
+ ```
200
+ ### `inspect`
201
+
202
+ `inspect` uses `to_s` output and also shows shape and object_id.
203
+
204
+
178
205
  ### `summary`, `describe` (not implemented)
179
206
 
180
207
  ### `to_rover`
181
208
 
182
209
  - Returns a `Rover::DataFrame`.
183
210
 
211
+ ```ruby
212
+ require 'rover'
213
+
214
+ penguins.to_rover
215
+ ```
216
+
184
217
  ### `to_iruby`
185
218
 
186
219
  - Show the DataFrame as a Table in Jupyter Notebook or Jupyter Lab with IRuby.
@@ -196,6 +229,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
196
229
 
197
230
  penguins = Datasets::Penguins.new.to_arrow
198
231
  RedAmber::DataFrame.new(penguins).tdr
232
+
199
233
  # =>
200
234
  RedAmber::DataFrame : 344 x 8 Vectors
201
235
  Vectors : 5 numeric, 3 strings
@@ -214,22 +248,6 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
214
248
  - tally: max level to use tally mode.
215
249
  - elements: max num of element to show values in each observations.
216
250
 
217
- ### `inspect`
218
-
219
- - Returns the information of self as `tdr(3)`, and also shows object id.
220
-
221
- ```ruby
222
- puts penguins.inspect
223
- # =>
224
- #<RedAmber::DataFrame : 344 x 8 Vectors, 0x000000000000f0b4>
225
- Vectors : 5 numeric, 3 strings
226
- # key type level data_preview
227
- 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
228
- 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
229
- 3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
230
- ... 5 more Vectors ...
231
- ```
232
-
233
251
  ## Selecting
234
252
 
235
253
  ### Select variables (columns in a table) by `[]` as `[key]`, `[keys]`, `[keys[index]]`
@@ -250,19 +268,21 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
250
268
  hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
251
269
  df = RedAmber::DataFrame.new(hash)
252
270
  df[:b..:c, "a"]
271
+
253
272
  # =>
254
- #<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000b02c>
255
- Vectors : 2 numeric, 1 string
256
- # key type level data_preview
257
- 1 :b string 3 ["A", "B", "C"]
258
- 2 :c double 3 [1.0, 2.0, 3.0]
259
- 3 :a uint8 3 [1, 2, 3]
273
+ #<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000328fc>
274
+ b c a
275
+ <string> <double> <uint8>
276
+ 1 A 1.0 1
277
+ 2 B 2.0 2
278
+ 3 C 3.0 3
260
279
  ```
261
280
 
262
281
  If `#[]` represents single variable (column), it returns a Vector object.
263
282
 
264
283
  ```ruby
265
284
  df[:a]
285
+
266
286
  # =>
267
287
  #<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
268
288
  [1, 2, 3]
@@ -271,6 +291,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
271
291
 
272
292
  ```ruby
273
293
  df.v(:a)
294
+
274
295
  # =>
275
296
  #<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
276
297
  [1, 2, 3]
@@ -294,14 +315,16 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
294
315
  ```ruby
295
316
  hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
296
317
  df = RedAmber::DataFrame.new(hash)
297
- df[:b..:c, "a"].tdr(tally_level: 0)
318
+ df[2, 0..]
319
+
298
320
  # =>
299
- RedAmber::DataFrame : 4 x 3 Vectors
300
- Vectors : 2 numeric, 1 string
301
- # key type level data_preview
302
- 1 :a uint8 3 [3, 1, 2, 3]
303
- 2 :b string 3 ["C", "A", "B", "C"]
304
- 3 :c double 3 [3.0, 1.0, 2.0, 3.0]
321
+ #<RedAmber::DataFrame : 4 x 3 Vectors, 0x0000000000033270>
322
+ a b c
323
+ <uint8> <string> <double>
324
+ 1 3 C 3.0
325
+ 2 1 A 1.0
326
+ 3 2 B 2.0
327
+ 4 3 C 3.0
305
328
  ```
306
329
 
307
330
  - Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self.
@@ -313,13 +336,12 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
313
336
  df[true, false, nil] # or
314
337
  df[[true, false, nil]] # or
315
338
  df[RedAmber::Vector.new([true, false, nil])]
339
+
316
340
  # =>
317
- #<RedAmber::DataFrame : 1 x 3 Vectors, 0x000000000000f1a4>
318
- Vectors : 2 numeric, 1 string
319
- # key type level data_preview
320
- 1 :a uint8 1 [1]
321
- 2 :b string 1 ["A"]
322
- 3 :c double 1 [1.0]
341
+ #<RedAmber::DataFrame : 1 x 3 Vectors, 0x00000000000353e0>
342
+ a b c
343
+ <uint8> <string> <double>
344
+ 1 1 A 1.0
323
345
  ```
324
346
 
325
347
  ### Select rows from top or from bottom
@@ -340,12 +362,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
340
362
 
341
363
  ```ruby
342
364
  penguins.pick(:species, :bill_length_mm)
365
+
343
366
  # =>
344
- #<RedAmber::DataFrame : 344 x 2 Vectors, 0x000000000000f924>
345
- Vectors : 1 numeric, 1 string
346
- # key type level data_preview
347
- 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
348
- 2 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
367
+ #<RedAmber::DataFrame : 344 x 2 Vectors, 0x0000000000035ebc>
368
+ species bill_length_mm
369
+ <string> <double>
370
+ 1 Adelie 39.1
371
+ 2 Adelie 39.5
372
+ 3 Adelie 40.3
373
+ 4 Adelie (nil)
374
+ 5 Adelie 36.7
375
+ : : :
376
+ 342 Gentoo 50.4
377
+ 343 Gentoo 45.2
378
+ 344 Gentoo 49.9
349
379
  ```
350
380
 
351
381
  - Booleans as a argument
@@ -354,13 +384,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
354
384
 
355
385
  ```ruby
356
386
  penguins.pick(penguins.types.map { |type| type == :string })
387
+
357
388
  # =>
358
- #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f938>
359
- Vectors : 3 strings
360
- # key type level data_preview
361
- 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
362
- 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
363
- 3 :sex string 3 {"male"=>168, "female"=>165, ""=>11}
389
+ #<RedAmber::DataFrame : 344 x 3 Vectors, 0x00000000000387ac>
390
+ species island sex
391
+ <string> <string> <string>
392
+ 1 Adelie Torgersen male
393
+ 2 Adelie Torgersen female
394
+ 3 Adelie Torgersen female
395
+ 4 Adelie Torgersen (nil)
396
+ 5 Adelie Torgersen female
397
+ : : : :
398
+ 342 Gentoo Biscoe male
399
+ 343 Gentoo Biscoe female
400
+ 344 Gentoo Biscoe male
364
401
  ```
365
402
 
366
403
  - Keys or booleans by a block
@@ -368,15 +405,21 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
368
405
  `pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
369
406
 
370
407
  ```ruby
371
- # It is ok to write `keys ...` in the block, not `penguins.keys ...`
372
408
  penguins.pick { keys.map { |key| key.end_with?('mm') } }
409
+
373
410
  # =>
374
- #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f1cc>
375
- Vectors : 3 numeric
376
- # key type level data_preview
377
- 1 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
378
- 2 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
379
- 3 :flipper_length_mm int64 56 [181, 186, 195, nil, 193, ... ], 2 nils
411
+ #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000003dd4c>
412
+ bill_length_mm bill_depth_mm flipper_length_mm
413
+ <double> <double> <uint8>
414
+ 1 39.1 18.7 181
415
+ 2 39.5 17.4 186
416
+ 3 40.3 18.0 195
417
+ 4 (nil) (nil) (nil)
418
+ 5 36.7 19.3 193
419
+ : : : :
420
+ 342 50.4 15.7 222
421
+ 343 45.2 14.8 212
422
+ 344 49.9 16.1 213
380
423
  ```
381
424
 
382
425
  ### `drop ` - pick and drop -
@@ -414,13 +457,17 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
414
457
  df = RedAmber::DataFrame.new(a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3])
415
458
  df.pick(:a) # or
416
459
  df.drop(:b, :c)
460
+
417
461
  # =>
418
- #<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000000f280>
419
- Vector : 1 numeric
420
- # key type level data_preview
421
- 1 :a uint8 3 [1, 2, 3]
462
+ #<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000003f4bc>
463
+ a
464
+ <uint8>
465
+ 1 1
466
+ 2 2
467
+ 3 3
422
468
 
423
469
  df[:a]
470
+
424
471
  # =>
425
472
  #<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
426
473
  [1, 2, 3]
@@ -441,14 +488,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
441
488
  ```ruby
442
489
  # returns 5 obs. at start and 5 obs. from end
443
490
  penguins.slice(0...5, -5..-1)
491
+
444
492
  # =>
445
- #<RedAmber::DataFrame : 10 x 8 Vectors, 0x000000000000f230>
446
- Vectors : 5 numeric, 3 strings
447
- # key type level data_preview
448
- 1 :species string 2 {"Adelie"=>5, "Gentoo"=>5}
449
- 2 :island string 2 {"Torgersen"=>5, "Biscoe"=>5}
450
- 3 :bill_length_mm double 9 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
451
- ... 5 more Vectors ...
493
+ #<RedAmber::DataFrame : 10 x 8 Vectors, 0x0000000000042be4>
494
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
495
+ <string> <string> <double> <double> <uint8> ... <uint16>
496
+ 1 Adelie Torgersen 39.1 18.7 181 ... 2007
497
+ 2 Adelie Torgersen 39.5 17.4 186 ... 2007
498
+ 3 Adelie Torgersen 40.3 18.0 195 ... 2007
499
+ 4 Adelie Torgersen (nil) (nil) (nil) ... 2007
500
+ 5 Adelie Torgersen 36.7 19.3 193 ... 2007
501
+ : : : : : : ... :
502
+ 8 Gentoo Biscoe 50.4 15.7 222 ... 2009
503
+ 9 Gentoo Biscoe 45.2 14.8 212 ... 2009
504
+ 10 Gentoo Biscoe 49.9 16.1 213 ... 2009
452
505
  ```
453
506
 
454
507
  - Booleans as an argument
@@ -458,14 +511,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
458
511
  ```ruby
459
512
  vector = penguins[:bill_length_mm]
460
513
  penguins.slice(vector >= 40)
514
+
461
515
  # =>
462
- #<RedAmber::DataFrame : 242 x 8 Vectors, 0x000000000000f2bc>
463
- Vectors : 5 numeric, 3 strings
464
- # key type level data_preview
465
- 1 :species string 3 {"Adelie"=>51, "Chinstrap"=>68, "Gentoo"=>123}
466
- 2 :island string 3 {"Torgersen"=>18, "Biscoe"=>139, "Dream"=>85}
467
- 3 :bill_length_mm double 115 [40.3, 42.0, 41.1, 42.5, 46.0, ... ]
468
- ... 5 more Vectors ...
516
+ #<RedAmber::DataFrame : 242 x 8 Vectors, 0x0000000000043d3c>
517
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
518
+ <string> <string> <double> <double> <uint8> ... <uint16>
519
+ 1 Adelie Torgersen 40.3 18.0 195 ... 2007
520
+ 2 Adelie Torgersen 42.0 20.2 190 ... 2007
521
+ 3 Adelie Torgersen 41.1 17.6 182 ... 2007
522
+ 4 Adelie Torgersen 42.5 20.7 197 ... 2007
523
+ 5 Adelie Torgersen 46.0 21.5 194 ... 2007
524
+ : : : : : : ... :
525
+ 240 Gentoo Biscoe 50.4 15.7 222 ... 2009
526
+ 241 Gentoo Biscoe 45.2 14.8 212 ... 2009
527
+ 242 Gentoo Biscoe 49.9 16.1 213 ... 2009
469
528
  ```
470
529
 
471
530
  - Indices or booleans by a block
@@ -482,13 +541,18 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
482
541
  end
483
542
 
484
543
  # =>
485
- #<RedAmber::DataFrame : 204 x 8 Vectors, 0x000000000000f30c>
486
- Vectors : 5 numeric, 3 strings
487
- # key type level data_preview
488
- 1 :species string 3 {"Adelie"=>82, "Chinstrap"=>33, "Gentoo"=>89}
489
- 2 :island string 3 {"Torgersen"=>31, "Biscoe"=>112, "Dream"=>61}
490
- 3 :bill_length_mm double 90 [39.1, 39.5, 40.3, 39.3, 38.9, ... ]
491
- ... 5 more Vectors ...
544
+ #<RedAmber::DataFrame : 204 x 8 Vectors, 0x0000000000047a40>
545
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
546
+ <string> <string> <double> <double> <uint8> ... <uint16>
547
+ 1 Adelie Torgersen 39.1 18.7 181 ... 2007
548
+ 2 Adelie Torgersen 39.5 17.4 186 ... 2007
549
+ 3 Adelie Torgersen 40.3 18.0 195 ... 2007
550
+ 4 Adelie Torgersen 39.3 20.6 190 ... 2007
551
+ 5 Adelie Torgersen 38.9 17.8 181 ... 2007
552
+ : : : : : : ... :
553
+ 202 Gentoo Biscoe 47.2 13.7 214 ... 2009
554
+ 203 Gentoo Biscoe 46.8 14.3 215 ... 2009
555
+ 204 Gentoo Biscoe 45.2 14.8 212 ... 2009
492
556
  ```
493
557
 
494
558
  - Notice: nil option
@@ -498,6 +562,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
498
562
  hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }
499
563
  table = Arrow::Table.new(hash)
500
564
  table.slice([true, false, nil])
565
+
501
566
  # =>
502
567
  #<Arrow::Table:0x7fdfe44b9e18 ptr=0x555e9fe744d0>
503
568
  a b c
@@ -509,6 +574,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
509
574
 
510
575
  ```ruby
511
576
  RedAmber::DataFrame.new(table).slice([true, false, nil]).table
577
+
512
578
  # =>
513
579
  #<Arrow::Table:0x7fdfe44981c8 ptr=0x555e9febc330>
514
580
  a b c
@@ -528,14 +594,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
528
594
  ```ruby
529
595
  # returns 6th to 339th obs.
530
596
  penguins.remove(0...5, -5..-1)
597
+
531
598
  # =>
532
- #<RedAmber::DataFrame : 334 x 8 Vectors, 0x000000000000f320>
533
- Vectors : 5 numeric, 3 strings
534
- # key type level data_preview
535
- 1 :species string 3 {"Adelie"=>147, "Chinstrap"=>68, "Gentoo"=>119}
536
- 2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>124}
537
- 3 :bill_length_mm double 162 [39.3, 38.9, 39.2, 34.1, 42.0, ... ]
538
- ... 5 more Vectors ...
599
+ #<RedAmber::DataFrame : 334 x 8 Vectors, 0x00000000000487c4>
600
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
601
+ <string> <string> <double> <double> <uint8> ... <uint16>
602
+ 1 Adelie Torgersen 39.3 20.6 190 ... 2007
603
+ 2 Adelie Torgersen 38.9 17.8 181 ... 2007
604
+ 3 Adelie Torgersen 39.2 19.6 195 ... 2007
605
+ 4 Adelie Torgersen 34.1 18.1 193 ... 2007
606
+ 5 Adelie Torgersen 42.0 20.2 190 ... 2007
607
+ : : : : : : ... :
608
+ 332 Gentoo Biscoe 44.5 15.7 217 ... 2009
609
+ 333 Gentoo Biscoe 48.8 16.2 222 ... 2009
610
+ 334 Gentoo Biscoe 47.2 13.7 214 ... 2009
539
611
  ```
540
612
 
541
613
  - Booleans as an argument
@@ -545,19 +617,21 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
545
617
  ```ruby
546
618
  # remove all observation contains nil
547
619
  removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
548
- removed.tdr
620
+ removed
621
+
549
622
  # =>
550
- RedAmber::DataFrame : 333 x 8 Vectors
551
- Vectors : 5 numeric, 3 strings
552
- # key type level data_preview
553
- 1 :species string 3 {"Adelie"=>146, "Chinstrap"=>68, "Gentoo"=>119}
554
- 2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>123}
555
- 3 :bill_length_mm double 163 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
556
- 4 :bill_depth_mm double 79 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
557
- 5 :flipper_length_mm uint8 54 [181, 186, 195, 193, 190, ... ]
558
- 6 :body_mass_g uint16 93 [3750, 3800, 3250, 3450, 3650, ... ]
559
- 7 :sex string 2 {"male"=>168, "female"=>165}
560
- 8 :year uint16 3 {2007=>103, 2008=>113, 2009=>117}
623
+ #<RedAmber::DataFrame : 333 x 8 Vectors, 0x0000000000049fac>
624
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
625
+ <string> <string> <double> <double> <uint8> ... <uint16>
626
+ 1 Adelie Torgersen 39.1 18.7 181 ... 2007
627
+ 2 Adelie Torgersen 39.5 17.4 186 ... 2007
628
+ 3 Adelie Torgersen 40.3 18.0 195 ... 2007
629
+ 4 Adelie Torgersen 36.7 19.3 193 ... 2007
630
+ 5 Adelie Torgersen 39.3 20.6 190 ... 2007
631
+ : : : : : : ... :
632
+ 331 Gentoo Biscoe 50.4 15.7 222 ... 2009
633
+ 332 Gentoo Biscoe 45.2 14.8 212 ... 2009
634
+ 333 Gentoo Biscoe 49.9 16.1 213 ... 2009
561
635
  ```
562
636
 
563
637
  - Indices or booleans by a block
@@ -571,14 +645,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
571
645
  max = vector.mean + vector.std
572
646
  vector.to_a.map { |e| (min..max).include? e }
573
647
  end
648
+
574
649
  # =>
575
- #<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000000f370>
576
- Vectors : 5 numeric, 3 strings
577
- # key type level data_preview
578
- 1 :species string 3 {"Adelie"=>70, "Chinstrap"=>35, "Gentoo"=>35}
579
- 2 :island string 3 {"Torgersen"=>21, "Biscoe"=>56, "Dream"=>63}
580
- 3 :bill_length_mm double 75 [nil, 36.7, 34.1, 37.8, 37.8, ... ], 2 nils
581
- ... 5 more Vectors ...
650
+ #<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000004de40>
651
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
652
+ <string> <string> <double> <double> <uint8> ... <uint16>
653
+ 1 Adelie Torgersen (nil) (nil) (nil) ... 2007
654
+ 2 Adelie Torgersen 36.7 19.3 193 ... 2007
655
+ 3 Adelie Torgersen 34.1 18.1 193 ... 2007
656
+ 4 Adelie Torgersen 37.8 17.1 186 ... 2007
657
+ 5 Adelie Torgersen 37.8 17.3 180 ... 2007
658
+ : : : : : : ... :
659
+ 138 Gentoo Biscoe (nil) (nil) (nil) ... 2009
660
+ 139 Gentoo Biscoe 50.4 15.7 222 ... 2009
661
+ 140 Gentoo Biscoe 49.9 16.1 213 ... 2009
582
662
  ```
583
663
  - Notice for nil
584
664
  - When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
@@ -586,28 +666,34 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
586
666
  ```ruby
587
667
  df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])
588
668
  booleans = df[:a] < 2
669
+ booleans
670
+
589
671
  # =>
590
672
  #<RedAmber::Vector(:boolean, size=3):0x000000000000f410>
591
673
  [true, false, nil]
592
674
 
593
675
  booleans_invert = booleans.to_a.map(&:!) # => [false, true, true]
676
+
594
677
  df.slice(booleans) == df.remove(booleans_invert) # => true
595
678
  ```
679
+
596
680
  - Whereas `Vector#invert` returns nil for elements nil. This will bring different result.
597
681
 
598
682
  ```ruby
599
683
  booleans.invert
684
+
600
685
  # =>
601
686
  #<RedAmber::Vector(:boolean, size=3):0x000000000000f488>
602
687
  [false, true, nil]
603
688
 
604
689
  df.remove(booleans.invert)
605
- #<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000000f474>
606
- Vectors : 2 numeric, 1 string
607
- # key type level data_preview
608
- 1 :a uint8 2 [1, nil], 1 nil
609
- 2 :b string 2 ["A", "C"]
610
- 3 :c double 2 [1.0, 3.0]
690
+
691
+ # =>
692
+ #<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000005df98>
693
+ a b c
694
+ <uint8> <string> <double>
695
+ 1 1 A 1.0
696
+ 2 (nil) C 3.0
611
697
  ```
612
698
 
613
699
  ### `rename`
@@ -621,15 +707,16 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
621
707
  `rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`.
622
708
 
623
709
  ```ruby
624
- h = { 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] }
625
- df = RedAmber::DataFrame.new(h)
710
+ df = RedAmber::DataFrame.new( 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] )
626
711
  df.rename(:age => :age_in_1993)
712
+
627
713
  # =>
628
- #<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
629
- Vectors : 1 numeric, 1 string
630
- # key type level data_preview
631
- 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
632
- 2 :age_in_1993 uint8 3 [68, 49, 28]
714
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000060838>
715
+ name age_in_1993
716
+ <string> <uint8>
717
+ 1 Yasuko 68
718
+ 2 Rui 49
719
+ 3 Hinata 28
633
720
  ```
634
721
 
635
722
  - Key pairs by a block
@@ -655,25 +742,29 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
655
742
 
656
743
  ```ruby
657
744
  df = RedAmber::DataFrame.new(
658
- 'name' => %w[Yasuko Rui Hinata],
659
- 'age' => [68, 49, 28])
745
+ name: %w[Yasuko Rui Hinata],
746
+ age: [68, 49, 28])
747
+ df
748
+
660
749
  # =>
661
- #<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
662
- Vectors : 1 numeric, 1 string
663
- # key type level data_preview
664
- 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
665
- 2 :age uint8 3 [68, 49, 28]
750
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000062804>
751
+ name age
752
+ <string> <uint8>
753
+ 1 Yasuko 68
754
+ 2 Rui 49
755
+ 3 Hinata 28
666
756
 
667
757
  # update :age and add :brother
668
758
  assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }
669
759
  df.assign(assigner)
760
+
670
761
  # =>
671
- #<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000f960>
672
- Vectors : 1 numeric, 2 strings
673
- # key type level data_preview
674
- 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
675
- 2 :age uint8 3 [97, 78, 57]
676
- 3 :brother string 3 ["Santa", nil, "Momotaro"], 1 nil
762
+ #<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000658b0>
763
+ name age brother
764
+ <string> <uint8> <string>
765
+ 1 Yasuko 97 Santa
766
+ 2 Rui 78 (nil)
767
+ 3 Hinata 57 Momotaro
677
768
  ```
678
769
 
679
770
  - Key pairs by a block
@@ -685,13 +776,17 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
685
776
  index: [0, 1, 2, 3, nil],
686
777
  float: [0.0, 1.1, 2.2, Float::NAN, nil],
687
778
  string: ['A', 'B', 'C', 'D', nil])
779
+ df
780
+
688
781
  # =>
689
- #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f8c0>
690
- Vectors : 2 numeric, 1 string
691
- # key type level data_preview
692
- 1 :index uint8 5 [0, 1, 2, 3, nil], 1 nil
693
- 2 :float double 5 [0.0, 1.1, 2.2, NaN, nil], 1 NaN, 1 nil
694
- 3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
782
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x0000000000069e60>
783
+ index float string
784
+ <uint8> <double> <string>
785
+ 1 0 0.0 A
786
+ 2 1 1.1 B
787
+ 3 2 2.2 C
788
+ 4 3 NaN D
789
+ 5 (nil) (nil) (nil)
695
790
 
696
791
  # update numeric variables
697
792
  df.assign do
@@ -701,13 +796,16 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
701
796
  end
702
797
  assigner
703
798
  end
799
+
704
800
  # =>
705
- #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f924>
706
- Vectors : 2 numeric, 1 string
707
- # key type level data_preview
708
- 1 :index int8 5 [0, -1, -2, -3, nil], 1 nil
709
- 2 :float double 5 [-0.0, -1.1, -2.2, NaN, nil], 1 NaN, 1 nil
710
- 3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
801
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000006e000>
802
+ index float string
803
+ <int8> <double> <string>
804
+ 1 0 -0.0 A
805
+ 2 -1 -1.1 B
806
+ 3 -2 -2.2 C
807
+ 4 -3 NaN D
808
+ 5 (nil) (nil) (nil)
711
809
 
712
810
  # Or it ’s shorter like this:
713
811
  df.assign do
@@ -715,6 +813,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
715
813
  assigner[key] = vector * -1 if vector.numeric?
716
814
  end
717
815
  end
816
+
718
817
  # => same as above
719
818
  ```
720
819
 
@@ -736,14 +835,17 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
736
835
  string: ['C', 'B', nil, 'A', 'B'],
737
836
  bool: [nil, true, false, true, false],
738
837
  })
739
- df.sort(:index, '-bool').tdr(tally: 0)
838
+ df.sort(:index, '-bool')
839
+
740
840
  # =>
741
- RedAmber::DataFrame : 5 x 3 Vectors
742
- Vectors : 1 numeric, 1 string, 1 boolean
743
- # key type level data_preview
744
- 1 :index uint8 3 [0, 0, 1, 1, nil], 1 nil
745
- 2 :string string 4 [nil, "B", "B", "C", "A"], 1 nil
746
- 3 :bool boolean 3 [false, false, true, nil, true], 1 nil
841
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000009b03c>
842
+ index string bool
843
+ <uint8> <string> <boolean>
844
+ 1 0 (nil) false
845
+ 2 0 B false
846
+ 3 1 B true
847
+ 4 1 C (nil)
848
+ 5 (nil) A true
747
849
  ```
748
850
 
749
851
  - [ ] Clamp
@@ -758,66 +860,16 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
758
860
 
759
861
  ## Grouping
760
862
 
761
- ### `group(aggregating_keys, function, target_keys)`
762
-
763
- (This is a temporary API and may change in the future version.)
863
+ ### `group(aggregating_keys)`
764
864
 
765
- Create grouped dataframe by `aggregation_keys` and apply `function` to each group and returns in `target_keys`. Aggregated key name is `function(key)` style.
865
+ (
866
+ This API will change in the future version. Especcially I want to change:
867
+ - Order of the column of the result (aggregation_keys should be the first)
868
+ - DataFrame#group will accept a block (heronshoes/red_amber #28)
869
+ )
766
870
 
767
- (The current implementation is not intuitive. Needs improvement.)
768
-
769
- ```ruby
770
- ds = Datasets::Rdatasets.new('dplyr', 'starwars')
771
- starwars = RedAmber::DataFrame.new(ds.to_table.to_h)
772
- starwars.tdr(11)
773
- # =>
774
- RedAmber::DataFrame : 87 x 11 Vectors
775
- Vectors : 3 numeric, 8 strings
776
- # key type level data_preview
777
- 1 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
778
- 2 :height uint16 46 [172, 167, 96, 202, 150, ... ], 6 nils
779
- 3 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
780
- 4 :hair_color string 13 ["blond", nil, nil, "none", "brown", ... ], 5 nils
781
- 5 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", .. . ]
782
- 6 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ]
783
- 7 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
784
- 8 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, nil=>4}
785
- 9 :gender string 3 {"masculine"=>66, "feminine"=>17, nil=>4}
786
- 10 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ], 10 nils
787
- 11 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ], 4 nils
788
-
789
- grouped = starwars.group(:species, :mean, [:mass, :height])
790
- # =>
791
- #<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000000fbf4>
792
- Vectors : 2 numeric, 1 string
793
- # key type level data_preview
794
- 1 :"mean(mass)" double 27 [82.78181818181818, 69.75, 124.0, 74.0, 1358.0, ... ], 6 nils
795
- 2 :"mean(height)" double 32 [176.6451612903226, 131.2, 231.0, 173.0, 175.0, ... ]
796
- 3 :species string 38 ["Human", "Droid", "Wookiee", "Rodian", "Hutt", ... ], 1 nil
797
-
798
- count = starwars.group(:species, :count, :species)[:"count(species)"]
799
- df = grouped.slice(count > 1)
800
- # =>
801
- #<RedAmber::DataFrame : 8 x 3 Vectors, 0x000000000000fc44>
802
- Vectors : 2 numeric, 1 string
803
- # key type level data_preview
804
- 1 :"mean(mass)" double 8 [82.78181818181818, 69.75, 124.0, 74.0, 80.0, ... ]
805
- 2 :"mean(height)" double 8 [176.6451612903226, 131.2, 231.0, 208.66666666666666, 173.0, ... ]
806
- 3 :species string 8 ["Human", "Droid", "Wookiee", "Gungan", "Zabrak", ... ]
807
-
808
- df.table
809
- # =>
810
- #<Arrow::Table:0x1165593c8 ptr=0x7fb3db144c70>
811
- mean(mass) mean(height) species
812
- 0 82.781818 176.645161 Human
813
- 1 69.750000 131.200000 Droid
814
- 2 124.000000 231.000000 Wookiee
815
- 3 74.000000 208.666667 Gungan
816
- 4 80.000000 173.000000 Zabrak
817
- 5 55.000000 179.000000 Twi'lek
818
- 6 53.100000 168.000000 Mirialan
819
- 7 88.000000 221.000000 Kaminoan
820
- ```
871
+ `group` creates a class `Group` object. `Group` accepts functions below as a method.
872
+ Method accepts options as `summary_keys`.
821
873
 
822
874
  Available functions are:
823
875
 
@@ -837,9 +889,115 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
837
889
  - [ ] tdigest
838
890
  - ✓ variance
839
891
 
892
+ For the each group of `aggregation_keys`, the aggregation `function` is applied and returns a new dataframe with aggregated keys according to `summary_keys`.
893
+ Aggregated key name is `function(summary_key)` style.
894
+
895
+ This is an example of grouping of famous STARWARS dataset.
896
+
897
+ ```ruby
898
+ starwars =
899
+ RedAmber::DataFrame.load(URI("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv"))
900
+ starwars
901
+
902
+ # =>
903
+ #<RedAmber::DataFrame : 87 x 12 Vectors, 0x00000000000773bc>
904
+ species name height mass hair_color skin_color eye_color ... homeworld
905
+ <string> <string> <int64> <double> <string> <string> <string> ... <string>
906
+ Human 1 Luke Skywalker 172 77.0 blond fair blue ... Tatooine
907
+ Droid 2 C-3PO 167 75.0 NA gold yellow ... Tatooine
908
+ Droid 3 R2-D2 96 32.0 NA white, blue red ... Naboo
909
+ Human 4 Darth Vader 202 136.0 none white yellow ... Tatooine
910
+ Human 5 Leia Organa 150 49.0 brown light brown ... Alderaan
911
+ : : : : : : : : ... :
912
+ Droid 85 BB8 (nil) (nil) none none black ... NA
913
+ NA 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA
914
+ Human 87 Padmé Amidala 165 45.0 brown light brown ... Naboo
915
+
916
+ starwars.tdr(12)
917
+
918
+ # =>
919
+ RedAmber::DataFrame : 87 x 12 Vectors
920
+ Vectors : 4 numeric, 8 strings
921
+ # key type level data_preview
922
+ 1 :"" int64 87 [1, 2, 3, 4, 5, ... ]
923
+ 2 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
924
+ 3 :height int64 46 [172, 167, 96, 202, 150, ... ], 6 nils
925
+ 4 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
926
+ 5 :hair_color string 13 ["blond", "NA", "NA", "none", "brown", ... ]
927
+ 6 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", ... ]
928
+ 7 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ]
929
+ 8 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
930
+ 9 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, "NA"=>4}
931
+ 10 :gender string 3 {"masculine"=>66, "feminine"=>17, "NA"=>4}
932
+ 11 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ]
933
+ 12 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ]
934
+ ```
935
+
936
+ We can aggregate for `:species` and calculate the mean of `:mass` and `:height`.
937
+
938
+ ```ruby
939
+ grouped = starwars.group(:species).mean(:mass, :height)
940
+ grouped
941
+
942
+ # =>
943
+ #<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000008e620>
944
+ mean(mass) mean(height) species
945
+ <double> <double> <string>
946
+ 1 82.8 176.6 Human
947
+ 2 69.8 131.2 Droid
948
+ 3 124.0 231.0 Wookiee
949
+ 4 74.0 173.0 Rodian
950
+ 5 1358.0 175.0 Hutt
951
+ : : : :
952
+ 36 159.0 216.0 Kaleesh
953
+ 37 80.0 206.0 Pau'an
954
+ 38 80.0 188.0 Kel Dor
955
+ ```
956
+
957
+ Select rows for count > 1.
958
+
959
+ ```ruby
960
+ count = starwars.group(:species).count(:species)[:'count(species)'] # => Vector
961
+ grouped = grouped.slice(count > 1)
962
+
963
+ # =>
964
+ #<RedAmber::DataFrame : 9 x 3 Vectors, 0x0000000000098260>
965
+ mean(mass) mean(height) species
966
+ <double> <double> <string>
967
+ 1 82.8 176.6 Human
968
+ 2 69.8 131.2 Droid
969
+ 3 124.0 231.0 Wookiee
970
+ 4 74.0 208.7 Gungan
971
+ 5 48.0 181.3 NA
972
+ : : : :
973
+ 7 55.0 179.0 Twi'lek
974
+ 8 53.1 168.0 Mirialan
975
+ 9 88.0 221.0 Kaminoan
976
+ ```
977
+
978
+ Assemble the result and change the order of columns.
979
+
980
+ ```ruby
981
+ grouped.assign(count: count[count > 1]).pick { [2,3,0,1].map{ |i| keys[i] } }
982
+
983
+ # =>
984
+ #<RedAmber::DataFrame : 9 x 4 Vectors, 0x0000000000141838>
985
+ species count mean(mass) mean(height)
986
+ <string> <uint8> <double> <double>
987
+ 1 Human 35 82.8 176.6
988
+ 2 Droid 6 69.8 131.2
989
+ 3 Wookiee 2 124.0 231.0
990
+ 4 Gungan 3 74.0 208.7
991
+ 5 NA 4 48.0 181.3
992
+ : : : : :
993
+ 7 Twi'lek 2 55.0 179.0
994
+ 8 Mirialan 2 53.1 168.0
995
+ 9 Kaminoan 2 88.0 221.0
996
+ ```
997
+
840
998
  ## Combining DataFrames
841
999
 
842
- - [ ] obs
1000
+ - [ ] Combining rows to a dataframe
843
1001
 
844
1002
  - [ ] Add vars
845
1003
 
@@ -852,3 +1010,5 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
852
1010
  - [ ] One-hot encoding
853
1011
 
854
1012
  ## Iteration (not impremented)
1013
+
1014
+ - [ ] each_rows