red_amber 0.1.6 → 0.1.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +3 -0
- data/CHANGELOG.md +44 -18
- data/Gemfile +4 -1
- data/README.md +51 -76
- data/Rakefile +1 -0
- data/benchmark/csv_load_penguins.yml +1 -1
- data/doc/47_examples_of_red_amber.ipynb +4872 -0
- data/doc/DataFrame.md +370 -210
- data/doc/Vector.md +68 -15
- data/doc/image/dataframe/assign.png +0 -0
- data/doc/image/dataframe/drop.png +0 -0
- data/doc/image/dataframe/pick.png +0 -0
- data/doc/image/dataframe/remove.png +0 -0
- data/doc/image/dataframe/rename.png +0 -0
- data/doc/image/dataframe/slice.png +0 -0
- data/doc/image/dataframe_model.png +0 -0
- data/doc/image/vector/binary_element_wise.png +0 -0
- data/doc/image/vector/unary_aggregation.png +0 -0
- data/doc/image/vector/unary_aggregation_w_option.png +0 -0
- data/doc/image/vector/unary_element_wise.png +0 -0
- data/lib/red-amber.rb +1 -25
- data/lib/red_amber/data_frame.rb +9 -7
- data/lib/red_amber/data_frame_displayable.rb +79 -4
- data/lib/red_amber/group.rb +61 -0
- data/lib/red_amber/vector.rb +17 -3
- data/lib/red_amber/vector_functions.rb +22 -20
- data/lib/red_amber/version.rb +1 -1
- data/lib/red_amber.rb +27 -1
- data/red_amber.gemspec +0 -2
- metadata +4 -31
- data/lib/red_amber/data_frame_observation_operation.rb +0 -11
data/doc/DataFrame.md
CHANGED
@@ -9,8 +9,6 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
9
9
|
|
10
10
|

|
11
11
|
|
12
|
-
(No change in this model in v0.1.6 .)
|
13
|
-
|
14
12
|
## Constructors and saving
|
15
13
|
|
16
14
|
### `new` from a Hash
|
@@ -37,6 +35,8 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
37
35
|
|
38
36
|
|
39
37
|
```ruby
|
38
|
+
require 'rover'
|
39
|
+
|
40
40
|
rover = Rover::DataFrame.new(x: [1, 2, 3])
|
41
41
|
RedAmber::DataFrame.new(rover)
|
42
42
|
```
|
@@ -61,6 +61,8 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
61
61
|
- from a Parquet file
|
62
62
|
|
63
63
|
```ruby
|
64
|
+
require 'parquet'
|
65
|
+
|
64
66
|
dataframe = RedAmber::DataFrame.load("file.parquet")
|
65
67
|
```
|
66
68
|
|
@@ -75,6 +77,8 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
75
77
|
- to a Parquet file
|
76
78
|
|
77
79
|
```ruby
|
80
|
+
require 'parquet'
|
81
|
+
|
78
82
|
dataframe.save("file.parquet")
|
79
83
|
```
|
80
84
|
|
@@ -175,12 +179,41 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
175
179
|
|
176
180
|
### `to_s`
|
177
181
|
|
182
|
+
`to_s` returns a preview of the Table.
|
183
|
+
|
184
|
+
```ruby
|
185
|
+
puts penguins.to_s
|
186
|
+
|
187
|
+
# =>
|
188
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
189
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
190
|
+
1 Adelie Torgersen 39.1 18.7 181 ... 2007
|
191
|
+
2 Adelie Torgersen 39.5 17.4 186 ... 2007
|
192
|
+
3 Adelie Torgersen 40.3 18.0 195 ... 2007
|
193
|
+
4 Adelie Torgersen (nil) (nil) (nil) ... 2007
|
194
|
+
5 Adelie Torgersen 36.7 19.3 193 ... 2007
|
195
|
+
: : : : : : ... :
|
196
|
+
342 Gentoo Biscoe 50.4 15.7 222 ... 2009
|
197
|
+
343 Gentoo Biscoe 45.2 14.8 212 ... 2009
|
198
|
+
344 Gentoo Biscoe 49.9 16.1 213 ... 2009
|
199
|
+
```
|
200
|
+
### `inspect`
|
201
|
+
|
202
|
+
`inspect` uses `to_s` output and also shows shape and object_id.
|
203
|
+
|
204
|
+
|
178
205
|
### `summary`, `describe` (not implemented)
|
179
206
|
|
180
207
|
### `to_rover`
|
181
208
|
|
182
209
|
- Returns a `Rover::DataFrame`.
|
183
210
|
|
211
|
+
```ruby
|
212
|
+
require 'rover'
|
213
|
+
|
214
|
+
penguins.to_rover
|
215
|
+
```
|
216
|
+
|
184
217
|
### `to_iruby`
|
185
218
|
|
186
219
|
- Show the DataFrame as a Table in Jupyter Notebook or Jupyter Lab with IRuby.
|
@@ -196,6 +229,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
196
229
|
|
197
230
|
penguins = Datasets::Penguins.new.to_arrow
|
198
231
|
RedAmber::DataFrame.new(penguins).tdr
|
232
|
+
|
199
233
|
# =>
|
200
234
|
RedAmber::DataFrame : 344 x 8 Vectors
|
201
235
|
Vectors : 5 numeric, 3 strings
|
@@ -214,22 +248,6 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
214
248
|
- tally: max level to use tally mode.
|
215
249
|
- elements: max num of element to show values in each observations.
|
216
250
|
|
217
|
-
### `inspect`
|
218
|
-
|
219
|
-
- Returns the information of self as `tdr(3)`, and also shows object id.
|
220
|
-
|
221
|
-
```ruby
|
222
|
-
puts penguins.inspect
|
223
|
-
# =>
|
224
|
-
#<RedAmber::DataFrame : 344 x 8 Vectors, 0x000000000000f0b4>
|
225
|
-
Vectors : 5 numeric, 3 strings
|
226
|
-
# key type level data_preview
|
227
|
-
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
|
228
|
-
2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
|
229
|
-
3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
230
|
-
... 5 more Vectors ...
|
231
|
-
```
|
232
|
-
|
233
251
|
## Selecting
|
234
252
|
|
235
253
|
### Select variables (columns in a table) by `[]` as `[key]`, `[keys]`, `[keys[index]]`
|
@@ -250,19 +268,21 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
250
268
|
hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
|
251
269
|
df = RedAmber::DataFrame.new(hash)
|
252
270
|
df[:b..:c, "a"]
|
271
|
+
|
253
272
|
# =>
|
254
|
-
#<RedAmber::DataFrame : 3 x 3 Vectors,
|
255
|
-
|
256
|
-
|
257
|
-
1
|
258
|
-
2
|
259
|
-
3
|
273
|
+
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000328fc>
|
274
|
+
b c a
|
275
|
+
<string> <double> <uint8>
|
276
|
+
1 A 1.0 1
|
277
|
+
2 B 2.0 2
|
278
|
+
3 C 3.0 3
|
260
279
|
```
|
261
280
|
|
262
281
|
If `#[]` represents single variable (column), it returns a Vector object.
|
263
282
|
|
264
283
|
```ruby
|
265
284
|
df[:a]
|
285
|
+
|
266
286
|
# =>
|
267
287
|
#<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
|
268
288
|
[1, 2, 3]
|
@@ -271,6 +291,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
271
291
|
|
272
292
|
```ruby
|
273
293
|
df.v(:a)
|
294
|
+
|
274
295
|
# =>
|
275
296
|
#<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
|
276
297
|
[1, 2, 3]
|
@@ -294,14 +315,16 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
294
315
|
```ruby
|
295
316
|
hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
|
296
317
|
df = RedAmber::DataFrame.new(hash)
|
297
|
-
df[
|
318
|
+
df[2, 0..]
|
319
|
+
|
298
320
|
# =>
|
299
|
-
RedAmber::DataFrame : 4 x 3 Vectors
|
300
|
-
|
301
|
-
|
302
|
-
1
|
303
|
-
2
|
304
|
-
3
|
321
|
+
#<RedAmber::DataFrame : 4 x 3 Vectors, 0x0000000000033270>
|
322
|
+
a b c
|
323
|
+
<uint8> <string> <double>
|
324
|
+
1 3 C 3.0
|
325
|
+
2 1 A 1.0
|
326
|
+
3 2 B 2.0
|
327
|
+
4 3 C 3.0
|
305
328
|
```
|
306
329
|
|
307
330
|
- Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self.
|
@@ -313,13 +336,12 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
313
336
|
df[true, false, nil] # or
|
314
337
|
df[[true, false, nil]] # or
|
315
338
|
df[RedAmber::Vector.new([true, false, nil])]
|
339
|
+
|
316
340
|
# =>
|
317
|
-
#<RedAmber::DataFrame : 1 x 3 Vectors,
|
318
|
-
|
319
|
-
|
320
|
-
1
|
321
|
-
2 :b string 1 ["A"]
|
322
|
-
3 :c double 1 [1.0]
|
341
|
+
#<RedAmber::DataFrame : 1 x 3 Vectors, 0x00000000000353e0>
|
342
|
+
a b c
|
343
|
+
<uint8> <string> <double>
|
344
|
+
1 1 A 1.0
|
323
345
|
```
|
324
346
|
|
325
347
|
### Select rows from top or from bottom
|
@@ -340,12 +362,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
340
362
|
|
341
363
|
```ruby
|
342
364
|
penguins.pick(:species, :bill_length_mm)
|
365
|
+
|
343
366
|
# =>
|
344
|
-
#<RedAmber::DataFrame : 344 x 2 Vectors,
|
345
|
-
|
346
|
-
|
347
|
-
|
348
|
-
|
367
|
+
#<RedAmber::DataFrame : 344 x 2 Vectors, 0x0000000000035ebc>
|
368
|
+
species bill_length_mm
|
369
|
+
<string> <double>
|
370
|
+
1 Adelie 39.1
|
371
|
+
2 Adelie 39.5
|
372
|
+
3 Adelie 40.3
|
373
|
+
4 Adelie (nil)
|
374
|
+
5 Adelie 36.7
|
375
|
+
: : :
|
376
|
+
342 Gentoo 50.4
|
377
|
+
343 Gentoo 45.2
|
378
|
+
344 Gentoo 49.9
|
349
379
|
```
|
350
380
|
|
351
381
|
- Booleans as a argument
|
@@ -354,13 +384,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
354
384
|
|
355
385
|
```ruby
|
356
386
|
penguins.pick(penguins.types.map { |type| type == :string })
|
387
|
+
|
357
388
|
# =>
|
358
|
-
#<RedAmber::DataFrame : 344 x 3 Vectors,
|
359
|
-
|
360
|
-
|
361
|
-
|
362
|
-
|
363
|
-
|
389
|
+
#<RedAmber::DataFrame : 344 x 3 Vectors, 0x00000000000387ac>
|
390
|
+
species island sex
|
391
|
+
<string> <string> <string>
|
392
|
+
1 Adelie Torgersen male
|
393
|
+
2 Adelie Torgersen female
|
394
|
+
3 Adelie Torgersen female
|
395
|
+
4 Adelie Torgersen (nil)
|
396
|
+
5 Adelie Torgersen female
|
397
|
+
: : : :
|
398
|
+
342 Gentoo Biscoe male
|
399
|
+
343 Gentoo Biscoe female
|
400
|
+
344 Gentoo Biscoe male
|
364
401
|
```
|
365
402
|
|
366
403
|
- Keys or booleans by a block
|
@@ -368,15 +405,21 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
368
405
|
`pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
|
369
406
|
|
370
407
|
```ruby
|
371
|
-
# It is ok to write `keys ...` in the block, not `penguins.keys ...`
|
372
408
|
penguins.pick { keys.map { |key| key.end_with?('mm') } }
|
409
|
+
|
373
410
|
# =>
|
374
|
-
#<RedAmber::DataFrame : 344 x 3 Vectors,
|
375
|
-
|
376
|
-
|
377
|
-
|
378
|
-
|
379
|
-
|
411
|
+
#<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000003dd4c>
|
412
|
+
bill_length_mm bill_depth_mm flipper_length_mm
|
413
|
+
<double> <double> <uint8>
|
414
|
+
1 39.1 18.7 181
|
415
|
+
2 39.5 17.4 186
|
416
|
+
3 40.3 18.0 195
|
417
|
+
4 (nil) (nil) (nil)
|
418
|
+
5 36.7 19.3 193
|
419
|
+
: : : :
|
420
|
+
342 50.4 15.7 222
|
421
|
+
343 45.2 14.8 212
|
422
|
+
344 49.9 16.1 213
|
380
423
|
```
|
381
424
|
|
382
425
|
### `drop ` - pick and drop -
|
@@ -414,13 +457,17 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
414
457
|
df = RedAmber::DataFrame.new(a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3])
|
415
458
|
df.pick(:a) # or
|
416
459
|
df.drop(:b, :c)
|
460
|
+
|
417
461
|
# =>
|
418
|
-
#<RedAmber::DataFrame : 3 x 1 Vector,
|
419
|
-
|
420
|
-
|
421
|
-
1
|
462
|
+
#<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000003f4bc>
|
463
|
+
a
|
464
|
+
<uint8>
|
465
|
+
1 1
|
466
|
+
2 2
|
467
|
+
3 3
|
422
468
|
|
423
469
|
df[:a]
|
470
|
+
|
424
471
|
# =>
|
425
472
|
#<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
|
426
473
|
[1, 2, 3]
|
@@ -441,14 +488,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
441
488
|
```ruby
|
442
489
|
# returns 5 obs. at start and 5 obs. from end
|
443
490
|
penguins.slice(0...5, -5..-1)
|
491
|
+
|
444
492
|
# =>
|
445
|
-
#<RedAmber::DataFrame : 10 x 8 Vectors,
|
446
|
-
|
447
|
-
|
448
|
-
|
449
|
-
|
450
|
-
|
451
|
-
|
493
|
+
#<RedAmber::DataFrame : 10 x 8 Vectors, 0x0000000000042be4>
|
494
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
495
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
496
|
+
1 Adelie Torgersen 39.1 18.7 181 ... 2007
|
497
|
+
2 Adelie Torgersen 39.5 17.4 186 ... 2007
|
498
|
+
3 Adelie Torgersen 40.3 18.0 195 ... 2007
|
499
|
+
4 Adelie Torgersen (nil) (nil) (nil) ... 2007
|
500
|
+
5 Adelie Torgersen 36.7 19.3 193 ... 2007
|
501
|
+
: : : : : : ... :
|
502
|
+
8 Gentoo Biscoe 50.4 15.7 222 ... 2009
|
503
|
+
9 Gentoo Biscoe 45.2 14.8 212 ... 2009
|
504
|
+
10 Gentoo Biscoe 49.9 16.1 213 ... 2009
|
452
505
|
```
|
453
506
|
|
454
507
|
- Booleans as an argument
|
@@ -458,14 +511,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
458
511
|
```ruby
|
459
512
|
vector = penguins[:bill_length_mm]
|
460
513
|
penguins.slice(vector >= 40)
|
514
|
+
|
461
515
|
# =>
|
462
|
-
#<RedAmber::DataFrame : 242 x 8 Vectors,
|
463
|
-
|
464
|
-
|
465
|
-
|
466
|
-
|
467
|
-
|
468
|
-
|
516
|
+
#<RedAmber::DataFrame : 242 x 8 Vectors, 0x0000000000043d3c>
|
517
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
518
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
519
|
+
1 Adelie Torgersen 40.3 18.0 195 ... 2007
|
520
|
+
2 Adelie Torgersen 42.0 20.2 190 ... 2007
|
521
|
+
3 Adelie Torgersen 41.1 17.6 182 ... 2007
|
522
|
+
4 Adelie Torgersen 42.5 20.7 197 ... 2007
|
523
|
+
5 Adelie Torgersen 46.0 21.5 194 ... 2007
|
524
|
+
: : : : : : ... :
|
525
|
+
240 Gentoo Biscoe 50.4 15.7 222 ... 2009
|
526
|
+
241 Gentoo Biscoe 45.2 14.8 212 ... 2009
|
527
|
+
242 Gentoo Biscoe 49.9 16.1 213 ... 2009
|
469
528
|
```
|
470
529
|
|
471
530
|
- Indices or booleans by a block
|
@@ -482,13 +541,18 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
482
541
|
end
|
483
542
|
|
484
543
|
# =>
|
485
|
-
#<RedAmber::DataFrame : 204 x 8 Vectors,
|
486
|
-
|
487
|
-
|
488
|
-
|
489
|
-
|
490
|
-
|
491
|
-
|
544
|
+
#<RedAmber::DataFrame : 204 x 8 Vectors, 0x0000000000047a40>
|
545
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
546
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
547
|
+
1 Adelie Torgersen 39.1 18.7 181 ... 2007
|
548
|
+
2 Adelie Torgersen 39.5 17.4 186 ... 2007
|
549
|
+
3 Adelie Torgersen 40.3 18.0 195 ... 2007
|
550
|
+
4 Adelie Torgersen 39.3 20.6 190 ... 2007
|
551
|
+
5 Adelie Torgersen 38.9 17.8 181 ... 2007
|
552
|
+
: : : : : : ... :
|
553
|
+
202 Gentoo Biscoe 47.2 13.7 214 ... 2009
|
554
|
+
203 Gentoo Biscoe 46.8 14.3 215 ... 2009
|
555
|
+
204 Gentoo Biscoe 45.2 14.8 212 ... 2009
|
492
556
|
```
|
493
557
|
|
494
558
|
- Notice: nil option
|
@@ -498,6 +562,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
498
562
|
hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }
|
499
563
|
table = Arrow::Table.new(hash)
|
500
564
|
table.slice([true, false, nil])
|
565
|
+
|
501
566
|
# =>
|
502
567
|
#<Arrow::Table:0x7fdfe44b9e18 ptr=0x555e9fe744d0>
|
503
568
|
a b c
|
@@ -509,6 +574,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
509
574
|
|
510
575
|
```ruby
|
511
576
|
RedAmber::DataFrame.new(table).slice([true, false, nil]).table
|
577
|
+
|
512
578
|
# =>
|
513
579
|
#<Arrow::Table:0x7fdfe44981c8 ptr=0x555e9febc330>
|
514
580
|
a b c
|
@@ -528,14 +594,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
528
594
|
```ruby
|
529
595
|
# returns 6th to 339th obs.
|
530
596
|
penguins.remove(0...5, -5..-1)
|
597
|
+
|
531
598
|
# =>
|
532
|
-
#<RedAmber::DataFrame : 334 x 8 Vectors,
|
533
|
-
|
534
|
-
|
535
|
-
|
536
|
-
|
537
|
-
|
538
|
-
|
599
|
+
#<RedAmber::DataFrame : 334 x 8 Vectors, 0x00000000000487c4>
|
600
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
601
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
602
|
+
1 Adelie Torgersen 39.3 20.6 190 ... 2007
|
603
|
+
2 Adelie Torgersen 38.9 17.8 181 ... 2007
|
604
|
+
3 Adelie Torgersen 39.2 19.6 195 ... 2007
|
605
|
+
4 Adelie Torgersen 34.1 18.1 193 ... 2007
|
606
|
+
5 Adelie Torgersen 42.0 20.2 190 ... 2007
|
607
|
+
: : : : : : ... :
|
608
|
+
332 Gentoo Biscoe 44.5 15.7 217 ... 2009
|
609
|
+
333 Gentoo Biscoe 48.8 16.2 222 ... 2009
|
610
|
+
334 Gentoo Biscoe 47.2 13.7 214 ... 2009
|
539
611
|
```
|
540
612
|
|
541
613
|
- Booleans as an argument
|
@@ -545,19 +617,21 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
545
617
|
```ruby
|
546
618
|
# remove all observation contains nil
|
547
619
|
removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
|
548
|
-
removed
|
620
|
+
removed
|
621
|
+
|
549
622
|
# =>
|
550
|
-
RedAmber::DataFrame : 333 x 8 Vectors
|
551
|
-
|
552
|
-
|
553
|
-
|
554
|
-
|
555
|
-
|
556
|
-
|
557
|
-
|
558
|
-
|
559
|
-
7
|
560
|
-
8
|
623
|
+
#<RedAmber::DataFrame : 333 x 8 Vectors, 0x0000000000049fac>
|
624
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
625
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
626
|
+
1 Adelie Torgersen 39.1 18.7 181 ... 2007
|
627
|
+
2 Adelie Torgersen 39.5 17.4 186 ... 2007
|
628
|
+
3 Adelie Torgersen 40.3 18.0 195 ... 2007
|
629
|
+
4 Adelie Torgersen 36.7 19.3 193 ... 2007
|
630
|
+
5 Adelie Torgersen 39.3 20.6 190 ... 2007
|
631
|
+
: : : : : : ... :
|
632
|
+
331 Gentoo Biscoe 50.4 15.7 222 ... 2009
|
633
|
+
332 Gentoo Biscoe 45.2 14.8 212 ... 2009
|
634
|
+
333 Gentoo Biscoe 49.9 16.1 213 ... 2009
|
561
635
|
```
|
562
636
|
|
563
637
|
- Indices or booleans by a block
|
@@ -571,14 +645,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
571
645
|
max = vector.mean + vector.std
|
572
646
|
vector.to_a.map { |e| (min..max).include? e }
|
573
647
|
end
|
648
|
+
|
574
649
|
# =>
|
575
|
-
#<RedAmber::DataFrame : 140 x 8 Vectors,
|
576
|
-
|
577
|
-
|
578
|
-
|
579
|
-
|
580
|
-
|
581
|
-
|
650
|
+
#<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000004de40>
|
651
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
652
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
653
|
+
1 Adelie Torgersen (nil) (nil) (nil) ... 2007
|
654
|
+
2 Adelie Torgersen 36.7 19.3 193 ... 2007
|
655
|
+
3 Adelie Torgersen 34.1 18.1 193 ... 2007
|
656
|
+
4 Adelie Torgersen 37.8 17.1 186 ... 2007
|
657
|
+
5 Adelie Torgersen 37.8 17.3 180 ... 2007
|
658
|
+
: : : : : : ... :
|
659
|
+
138 Gentoo Biscoe (nil) (nil) (nil) ... 2009
|
660
|
+
139 Gentoo Biscoe 50.4 15.7 222 ... 2009
|
661
|
+
140 Gentoo Biscoe 49.9 16.1 213 ... 2009
|
582
662
|
```
|
583
663
|
- Notice for nil
|
584
664
|
- When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
|
@@ -586,28 +666,34 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
586
666
|
```ruby
|
587
667
|
df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])
|
588
668
|
booleans = df[:a] < 2
|
669
|
+
booleans
|
670
|
+
|
589
671
|
# =>
|
590
672
|
#<RedAmber::Vector(:boolean, size=3):0x000000000000f410>
|
591
673
|
[true, false, nil]
|
592
674
|
|
593
675
|
booleans_invert = booleans.to_a.map(&:!) # => [false, true, true]
|
676
|
+
|
594
677
|
df.slice(booleans) == df.remove(booleans_invert) # => true
|
595
678
|
```
|
679
|
+
|
596
680
|
- Whereas `Vector#invert` returns nil for elements nil. This will bring different result.
|
597
681
|
|
598
682
|
```ruby
|
599
683
|
booleans.invert
|
684
|
+
|
600
685
|
# =>
|
601
686
|
#<RedAmber::Vector(:boolean, size=3):0x000000000000f488>
|
602
687
|
[false, true, nil]
|
603
688
|
|
604
689
|
df.remove(booleans.invert)
|
605
|
-
|
606
|
-
|
607
|
-
|
608
|
-
|
609
|
-
|
610
|
-
|
690
|
+
|
691
|
+
# =>
|
692
|
+
#<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000005df98>
|
693
|
+
a b c
|
694
|
+
<uint8> <string> <double>
|
695
|
+
1 1 A 1.0
|
696
|
+
2 (nil) C 3.0
|
611
697
|
```
|
612
698
|
|
613
699
|
### `rename`
|
@@ -621,15 +707,16 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
621
707
|
`rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`.
|
622
708
|
|
623
709
|
```ruby
|
624
|
-
|
625
|
-
df = RedAmber::DataFrame.new(h)
|
710
|
+
df = RedAmber::DataFrame.new( 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] )
|
626
711
|
df.rename(:age => :age_in_1993)
|
712
|
+
|
627
713
|
# =>
|
628
|
-
#<RedAmber::DataFrame : 3 x 2 Vectors,
|
629
|
-
|
630
|
-
|
631
|
-
1
|
632
|
-
2
|
714
|
+
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000060838>
|
715
|
+
name age_in_1993
|
716
|
+
<string> <uint8>
|
717
|
+
1 Yasuko 68
|
718
|
+
2 Rui 49
|
719
|
+
3 Hinata 28
|
633
720
|
```
|
634
721
|
|
635
722
|
- Key pairs by a block
|
@@ -655,25 +742,29 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
655
742
|
|
656
743
|
```ruby
|
657
744
|
df = RedAmber::DataFrame.new(
|
658
|
-
|
659
|
-
|
745
|
+
name: %w[Yasuko Rui Hinata],
|
746
|
+
age: [68, 49, 28])
|
747
|
+
df
|
748
|
+
|
660
749
|
# =>
|
661
|
-
#<RedAmber::DataFrame : 3 x 2 Vectors,
|
662
|
-
|
663
|
-
|
664
|
-
1
|
665
|
-
2
|
750
|
+
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000062804>
|
751
|
+
name age
|
752
|
+
<string> <uint8>
|
753
|
+
1 Yasuko 68
|
754
|
+
2 Rui 49
|
755
|
+
3 Hinata 28
|
666
756
|
|
667
757
|
# update :age and add :brother
|
668
758
|
assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }
|
669
759
|
df.assign(assigner)
|
760
|
+
|
670
761
|
# =>
|
671
|
-
#<RedAmber::DataFrame : 3 x 3 Vectors,
|
672
|
-
|
673
|
-
|
674
|
-
1
|
675
|
-
2
|
676
|
-
3
|
762
|
+
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000658b0>
|
763
|
+
name age brother
|
764
|
+
<string> <uint8> <string>
|
765
|
+
1 Yasuko 97 Santa
|
766
|
+
2 Rui 78 (nil)
|
767
|
+
3 Hinata 57 Momotaro
|
677
768
|
```
|
678
769
|
|
679
770
|
- Key pairs by a block
|
@@ -685,13 +776,17 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
685
776
|
index: [0, 1, 2, 3, nil],
|
686
777
|
float: [0.0, 1.1, 2.2, Float::NAN, nil],
|
687
778
|
string: ['A', 'B', 'C', 'D', nil])
|
779
|
+
df
|
780
|
+
|
688
781
|
# =>
|
689
|
-
#<RedAmber::DataFrame : 5 x 3 Vectors,
|
690
|
-
|
691
|
-
|
692
|
-
1
|
693
|
-
2
|
694
|
-
3
|
782
|
+
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x0000000000069e60>
|
783
|
+
index float string
|
784
|
+
<uint8> <double> <string>
|
785
|
+
1 0 0.0 A
|
786
|
+
2 1 1.1 B
|
787
|
+
3 2 2.2 C
|
788
|
+
4 3 NaN D
|
789
|
+
5 (nil) (nil) (nil)
|
695
790
|
|
696
791
|
# update numeric variables
|
697
792
|
df.assign do
|
@@ -701,13 +796,16 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
701
796
|
end
|
702
797
|
assigner
|
703
798
|
end
|
799
|
+
|
704
800
|
# =>
|
705
|
-
#<RedAmber::DataFrame : 5 x 3 Vectors,
|
706
|
-
|
707
|
-
|
708
|
-
1
|
709
|
-
2
|
710
|
-
3
|
801
|
+
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000006e000>
|
802
|
+
index float string
|
803
|
+
<int8> <double> <string>
|
804
|
+
1 0 -0.0 A
|
805
|
+
2 -1 -1.1 B
|
806
|
+
3 -2 -2.2 C
|
807
|
+
4 -3 NaN D
|
808
|
+
5 (nil) (nil) (nil)
|
711
809
|
|
712
810
|
# Or it ’s shorter like this:
|
713
811
|
df.assign do
|
@@ -715,6 +813,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
715
813
|
assigner[key] = vector * -1 if vector.numeric?
|
716
814
|
end
|
717
815
|
end
|
816
|
+
|
718
817
|
# => same as above
|
719
818
|
```
|
720
819
|
|
@@ -736,14 +835,17 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
736
835
|
string: ['C', 'B', nil, 'A', 'B'],
|
737
836
|
bool: [nil, true, false, true, false],
|
738
837
|
})
|
739
|
-
df.sort(:index, '-bool')
|
838
|
+
df.sort(:index, '-bool')
|
839
|
+
|
740
840
|
# =>
|
741
|
-
RedAmber::DataFrame : 5 x 3 Vectors
|
742
|
-
|
743
|
-
|
744
|
-
1
|
745
|
-
2
|
746
|
-
3
|
841
|
+
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000009b03c>
|
842
|
+
index string bool
|
843
|
+
<uint8> <string> <boolean>
|
844
|
+
1 0 (nil) false
|
845
|
+
2 0 B false
|
846
|
+
3 1 B true
|
847
|
+
4 1 C (nil)
|
848
|
+
5 (nil) A true
|
747
849
|
```
|
748
850
|
|
749
851
|
- [ ] Clamp
|
@@ -758,66 +860,16 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
758
860
|
|
759
861
|
## Grouping
|
760
862
|
|
761
|
-
### `group(aggregating_keys
|
762
|
-
|
763
|
-
(This is a temporary API and may change in the future version.)
|
863
|
+
### `group(aggregating_keys)`
|
764
864
|
|
765
|
-
|
865
|
+
(
|
866
|
+
This API will change in the future version. Especcially I want to change:
|
867
|
+
- Order of the column of the result (aggregation_keys should be the first)
|
868
|
+
- DataFrame#group will accept a block (heronshoes/red_amber #28)
|
869
|
+
)
|
766
870
|
|
767
|
-
|
768
|
-
|
769
|
-
```ruby
|
770
|
-
ds = Datasets::Rdatasets.new('dplyr', 'starwars')
|
771
|
-
starwars = RedAmber::DataFrame.new(ds.to_table.to_h)
|
772
|
-
starwars.tdr(11)
|
773
|
-
# =>
|
774
|
-
RedAmber::DataFrame : 87 x 11 Vectors
|
775
|
-
Vectors : 3 numeric, 8 strings
|
776
|
-
# key type level data_preview
|
777
|
-
1 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
|
778
|
-
2 :height uint16 46 [172, 167, 96, 202, 150, ... ], 6 nils
|
779
|
-
3 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
|
780
|
-
4 :hair_color string 13 ["blond", nil, nil, "none", "brown", ... ], 5 nils
|
781
|
-
5 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", .. . ]
|
782
|
-
6 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ]
|
783
|
-
7 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
|
784
|
-
8 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, nil=>4}
|
785
|
-
9 :gender string 3 {"masculine"=>66, "feminine"=>17, nil=>4}
|
786
|
-
10 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ], 10 nils
|
787
|
-
11 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ], 4 nils
|
788
|
-
|
789
|
-
grouped = starwars.group(:species, :mean, [:mass, :height])
|
790
|
-
# =>
|
791
|
-
#<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000000fbf4>
|
792
|
-
Vectors : 2 numeric, 1 string
|
793
|
-
# key type level data_preview
|
794
|
-
1 :"mean(mass)" double 27 [82.78181818181818, 69.75, 124.0, 74.0, 1358.0, ... ], 6 nils
|
795
|
-
2 :"mean(height)" double 32 [176.6451612903226, 131.2, 231.0, 173.0, 175.0, ... ]
|
796
|
-
3 :species string 38 ["Human", "Droid", "Wookiee", "Rodian", "Hutt", ... ], 1 nil
|
797
|
-
|
798
|
-
count = starwars.group(:species, :count, :species)[:"count(species)"]
|
799
|
-
df = grouped.slice(count > 1)
|
800
|
-
# =>
|
801
|
-
#<RedAmber::DataFrame : 8 x 3 Vectors, 0x000000000000fc44>
|
802
|
-
Vectors : 2 numeric, 1 string
|
803
|
-
# key type level data_preview
|
804
|
-
1 :"mean(mass)" double 8 [82.78181818181818, 69.75, 124.0, 74.0, 80.0, ... ]
|
805
|
-
2 :"mean(height)" double 8 [176.6451612903226, 131.2, 231.0, 208.66666666666666, 173.0, ... ]
|
806
|
-
3 :species string 8 ["Human", "Droid", "Wookiee", "Gungan", "Zabrak", ... ]
|
807
|
-
|
808
|
-
df.table
|
809
|
-
# =>
|
810
|
-
#<Arrow::Table:0x1165593c8 ptr=0x7fb3db144c70>
|
811
|
-
mean(mass) mean(height) species
|
812
|
-
0 82.781818 176.645161 Human
|
813
|
-
1 69.750000 131.200000 Droid
|
814
|
-
2 124.000000 231.000000 Wookiee
|
815
|
-
3 74.000000 208.666667 Gungan
|
816
|
-
4 80.000000 173.000000 Zabrak
|
817
|
-
5 55.000000 179.000000 Twi'lek
|
818
|
-
6 53.100000 168.000000 Mirialan
|
819
|
-
7 88.000000 221.000000 Kaminoan
|
820
|
-
```
|
871
|
+
`group` creates a class `Group` object. `Group` accepts functions below as a method.
|
872
|
+
Method accepts options as `summary_keys`.
|
821
873
|
|
822
874
|
Available functions are:
|
823
875
|
|
@@ -837,9 +889,115 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
837
889
|
- [ ] tdigest
|
838
890
|
- ✓ variance
|
839
891
|
|
892
|
+
For the each group of `aggregation_keys`, the aggregation `function` is applied and returns a new dataframe with aggregated keys according to `summary_keys`.
|
893
|
+
Aggregated key name is `function(summary_key)` style.
|
894
|
+
|
895
|
+
This is an example of grouping of famous STARWARS dataset.
|
896
|
+
|
897
|
+
```ruby
|
898
|
+
starwars =
|
899
|
+
RedAmber::DataFrame.load(URI("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv"))
|
900
|
+
starwars
|
901
|
+
|
902
|
+
# =>
|
903
|
+
#<RedAmber::DataFrame : 87 x 12 Vectors, 0x00000000000773bc>
|
904
|
+
species name height mass hair_color skin_color eye_color ... homeworld
|
905
|
+
<string> <string> <int64> <double> <string> <string> <string> ... <string>
|
906
|
+
Human 1 Luke Skywalker 172 77.0 blond fair blue ... Tatooine
|
907
|
+
Droid 2 C-3PO 167 75.0 NA gold yellow ... Tatooine
|
908
|
+
Droid 3 R2-D2 96 32.0 NA white, blue red ... Naboo
|
909
|
+
Human 4 Darth Vader 202 136.0 none white yellow ... Tatooine
|
910
|
+
Human 5 Leia Organa 150 49.0 brown light brown ... Alderaan
|
911
|
+
: : : : : : : : ... :
|
912
|
+
Droid 85 BB8 (nil) (nil) none none black ... NA
|
913
|
+
NA 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA
|
914
|
+
Human 87 Padmé Amidala 165 45.0 brown light brown ... Naboo
|
915
|
+
|
916
|
+
starwars.tdr(12)
|
917
|
+
|
918
|
+
# =>
|
919
|
+
RedAmber::DataFrame : 87 x 12 Vectors
|
920
|
+
Vectors : 4 numeric, 8 strings
|
921
|
+
# key type level data_preview
|
922
|
+
1 :"" int64 87 [1, 2, 3, 4, 5, ... ]
|
923
|
+
2 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
|
924
|
+
3 :height int64 46 [172, 167, 96, 202, 150, ... ], 6 nils
|
925
|
+
4 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
|
926
|
+
5 :hair_color string 13 ["blond", "NA", "NA", "none", "brown", ... ]
|
927
|
+
6 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", ... ]
|
928
|
+
7 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ]
|
929
|
+
8 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
|
930
|
+
9 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, "NA"=>4}
|
931
|
+
10 :gender string 3 {"masculine"=>66, "feminine"=>17, "NA"=>4}
|
932
|
+
11 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ]
|
933
|
+
12 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ]
|
934
|
+
```
|
935
|
+
|
936
|
+
We can aggregate for `:species` and calculate the mean of `:mass` and `:height`.
|
937
|
+
|
938
|
+
```ruby
|
939
|
+
grouped = starwars.group(:species).mean(:mass, :height)
|
940
|
+
grouped
|
941
|
+
|
942
|
+
# =>
|
943
|
+
#<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000008e620>
|
944
|
+
mean(mass) mean(height) species
|
945
|
+
<double> <double> <string>
|
946
|
+
1 82.8 176.6 Human
|
947
|
+
2 69.8 131.2 Droid
|
948
|
+
3 124.0 231.0 Wookiee
|
949
|
+
4 74.0 173.0 Rodian
|
950
|
+
5 1358.0 175.0 Hutt
|
951
|
+
: : : :
|
952
|
+
36 159.0 216.0 Kaleesh
|
953
|
+
37 80.0 206.0 Pau'an
|
954
|
+
38 80.0 188.0 Kel Dor
|
955
|
+
```
|
956
|
+
|
957
|
+
Select rows for count > 1.
|
958
|
+
|
959
|
+
```ruby
|
960
|
+
count = starwars.group(:species).count(:species)[:'count(species)'] # => Vector
|
961
|
+
grouped = grouped.slice(count > 1)
|
962
|
+
|
963
|
+
# =>
|
964
|
+
#<RedAmber::DataFrame : 9 x 3 Vectors, 0x0000000000098260>
|
965
|
+
mean(mass) mean(height) species
|
966
|
+
<double> <double> <string>
|
967
|
+
1 82.8 176.6 Human
|
968
|
+
2 69.8 131.2 Droid
|
969
|
+
3 124.0 231.0 Wookiee
|
970
|
+
4 74.0 208.7 Gungan
|
971
|
+
5 48.0 181.3 NA
|
972
|
+
: : : :
|
973
|
+
7 55.0 179.0 Twi'lek
|
974
|
+
8 53.1 168.0 Mirialan
|
975
|
+
9 88.0 221.0 Kaminoan
|
976
|
+
```
|
977
|
+
|
978
|
+
Assemble the result and change the order of columns.
|
979
|
+
|
980
|
+
```ruby
|
981
|
+
grouped.assign(count: count[count > 1]).pick { [2,3,0,1].map{ |i| keys[i] } }
|
982
|
+
|
983
|
+
# =>
|
984
|
+
#<RedAmber::DataFrame : 9 x 4 Vectors, 0x0000000000141838>
|
985
|
+
species count mean(mass) mean(height)
|
986
|
+
<string> <uint8> <double> <double>
|
987
|
+
1 Human 35 82.8 176.6
|
988
|
+
2 Droid 6 69.8 131.2
|
989
|
+
3 Wookiee 2 124.0 231.0
|
990
|
+
4 Gungan 3 74.0 208.7
|
991
|
+
5 NA 4 48.0 181.3
|
992
|
+
: : : : :
|
993
|
+
7 Twi'lek 2 55.0 179.0
|
994
|
+
8 Mirialan 2 53.1 168.0
|
995
|
+
9 Kaminoan 2 88.0 221.0
|
996
|
+
```
|
997
|
+
|
840
998
|
## Combining DataFrames
|
841
999
|
|
842
|
-
- [ ]
|
1000
|
+
- [ ] Combining rows to a dataframe
|
843
1001
|
|
844
1002
|
- [ ] Add vars
|
845
1003
|
|
@@ -852,3 +1010,5 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
852
1010
|
- [ ] One-hot encoding
|
853
1011
|
|
854
1012
|
## Iteration (not impremented)
|
1013
|
+
|
1014
|
+
- [ ] each_rows
|