red_amber 0.1.6 → 0.1.7
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.rubocop.yml +3 -0
- data/CHANGELOG.md +44 -18
- data/Gemfile +4 -1
- data/README.md +51 -76
- data/Rakefile +1 -0
- data/benchmark/csv_load_penguins.yml +1 -1
- data/doc/47_examples_of_red_amber.ipynb +4872 -0
- data/doc/DataFrame.md +370 -210
- data/doc/Vector.md +68 -15
- data/doc/image/dataframe/assign.png +0 -0
- data/doc/image/dataframe/drop.png +0 -0
- data/doc/image/dataframe/pick.png +0 -0
- data/doc/image/dataframe/remove.png +0 -0
- data/doc/image/dataframe/rename.png +0 -0
- data/doc/image/dataframe/slice.png +0 -0
- data/doc/image/dataframe_model.png +0 -0
- data/doc/image/vector/binary_element_wise.png +0 -0
- data/doc/image/vector/unary_aggregation.png +0 -0
- data/doc/image/vector/unary_aggregation_w_option.png +0 -0
- data/doc/image/vector/unary_element_wise.png +0 -0
- data/lib/red-amber.rb +1 -25
- data/lib/red_amber/data_frame.rb +9 -7
- data/lib/red_amber/data_frame_displayable.rb +79 -4
- data/lib/red_amber/group.rb +61 -0
- data/lib/red_amber/vector.rb +17 -3
- data/lib/red_amber/vector_functions.rb +22 -20
- data/lib/red_amber/version.rb +1 -1
- data/lib/red_amber.rb +27 -1
- data/red_amber.gemspec +0 -2
- metadata +4 -31
- data/lib/red_amber/data_frame_observation_operation.rb +0 -11
data/doc/DataFrame.md
CHANGED
@@ -9,8 +9,6 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
9
9
|
|
10
10
|
![dataframe model image](doc/../image/dataframe_model.png)
|
11
11
|
|
12
|
-
(No change in this model in v0.1.6 .)
|
13
|
-
|
14
12
|
## Constructors and saving
|
15
13
|
|
16
14
|
### `new` from a Hash
|
@@ -37,6 +35,8 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
37
35
|
|
38
36
|
|
39
37
|
```ruby
|
38
|
+
require 'rover'
|
39
|
+
|
40
40
|
rover = Rover::DataFrame.new(x: [1, 2, 3])
|
41
41
|
RedAmber::DataFrame.new(rover)
|
42
42
|
```
|
@@ -61,6 +61,8 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
61
61
|
- from a Parquet file
|
62
62
|
|
63
63
|
```ruby
|
64
|
+
require 'parquet'
|
65
|
+
|
64
66
|
dataframe = RedAmber::DataFrame.load("file.parquet")
|
65
67
|
```
|
66
68
|
|
@@ -75,6 +77,8 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
75
77
|
- to a Parquet file
|
76
78
|
|
77
79
|
```ruby
|
80
|
+
require 'parquet'
|
81
|
+
|
78
82
|
dataframe.save("file.parquet")
|
79
83
|
```
|
80
84
|
|
@@ -175,12 +179,41 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
175
179
|
|
176
180
|
### `to_s`
|
177
181
|
|
182
|
+
`to_s` returns a preview of the Table.
|
183
|
+
|
184
|
+
```ruby
|
185
|
+
puts penguins.to_s
|
186
|
+
|
187
|
+
# =>
|
188
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
189
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
190
|
+
1 Adelie Torgersen 39.1 18.7 181 ... 2007
|
191
|
+
2 Adelie Torgersen 39.5 17.4 186 ... 2007
|
192
|
+
3 Adelie Torgersen 40.3 18.0 195 ... 2007
|
193
|
+
4 Adelie Torgersen (nil) (nil) (nil) ... 2007
|
194
|
+
5 Adelie Torgersen 36.7 19.3 193 ... 2007
|
195
|
+
: : : : : : ... :
|
196
|
+
342 Gentoo Biscoe 50.4 15.7 222 ... 2009
|
197
|
+
343 Gentoo Biscoe 45.2 14.8 212 ... 2009
|
198
|
+
344 Gentoo Biscoe 49.9 16.1 213 ... 2009
|
199
|
+
```
|
200
|
+
### `inspect`
|
201
|
+
|
202
|
+
`inspect` uses `to_s` output and also shows shape and object_id.
|
203
|
+
|
204
|
+
|
178
205
|
### `summary`, `describe` (not implemented)
|
179
206
|
|
180
207
|
### `to_rover`
|
181
208
|
|
182
209
|
- Returns a `Rover::DataFrame`.
|
183
210
|
|
211
|
+
```ruby
|
212
|
+
require 'rover'
|
213
|
+
|
214
|
+
penguins.to_rover
|
215
|
+
```
|
216
|
+
|
184
217
|
### `to_iruby`
|
185
218
|
|
186
219
|
- Show the DataFrame as a Table in Jupyter Notebook or Jupyter Lab with IRuby.
|
@@ -196,6 +229,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
196
229
|
|
197
230
|
penguins = Datasets::Penguins.new.to_arrow
|
198
231
|
RedAmber::DataFrame.new(penguins).tdr
|
232
|
+
|
199
233
|
# =>
|
200
234
|
RedAmber::DataFrame : 344 x 8 Vectors
|
201
235
|
Vectors : 5 numeric, 3 strings
|
@@ -214,22 +248,6 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
214
248
|
- tally: max level to use tally mode.
|
215
249
|
- elements: max num of element to show values in each observations.
|
216
250
|
|
217
|
-
### `inspect`
|
218
|
-
|
219
|
-
- Returns the information of self as `tdr(3)`, and also shows object id.
|
220
|
-
|
221
|
-
```ruby
|
222
|
-
puts penguins.inspect
|
223
|
-
# =>
|
224
|
-
#<RedAmber::DataFrame : 344 x 8 Vectors, 0x000000000000f0b4>
|
225
|
-
Vectors : 5 numeric, 3 strings
|
226
|
-
# key type level data_preview
|
227
|
-
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
|
228
|
-
2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
|
229
|
-
3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
230
|
-
... 5 more Vectors ...
|
231
|
-
```
|
232
|
-
|
233
251
|
## Selecting
|
234
252
|
|
235
253
|
### Select variables (columns in a table) by `[]` as `[key]`, `[keys]`, `[keys[index]]`
|
@@ -250,19 +268,21 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
250
268
|
hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
|
251
269
|
df = RedAmber::DataFrame.new(hash)
|
252
270
|
df[:b..:c, "a"]
|
271
|
+
|
253
272
|
# =>
|
254
|
-
#<RedAmber::DataFrame : 3 x 3 Vectors,
|
255
|
-
|
256
|
-
|
257
|
-
1
|
258
|
-
2
|
259
|
-
3
|
273
|
+
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000328fc>
|
274
|
+
b c a
|
275
|
+
<string> <double> <uint8>
|
276
|
+
1 A 1.0 1
|
277
|
+
2 B 2.0 2
|
278
|
+
3 C 3.0 3
|
260
279
|
```
|
261
280
|
|
262
281
|
If `#[]` represents single variable (column), it returns a Vector object.
|
263
282
|
|
264
283
|
```ruby
|
265
284
|
df[:a]
|
285
|
+
|
266
286
|
# =>
|
267
287
|
#<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
|
268
288
|
[1, 2, 3]
|
@@ -271,6 +291,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
271
291
|
|
272
292
|
```ruby
|
273
293
|
df.v(:a)
|
294
|
+
|
274
295
|
# =>
|
275
296
|
#<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
|
276
297
|
[1, 2, 3]
|
@@ -294,14 +315,16 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
294
315
|
```ruby
|
295
316
|
hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
|
296
317
|
df = RedAmber::DataFrame.new(hash)
|
297
|
-
df[
|
318
|
+
df[2, 0..]
|
319
|
+
|
298
320
|
# =>
|
299
|
-
RedAmber::DataFrame : 4 x 3 Vectors
|
300
|
-
|
301
|
-
|
302
|
-
1
|
303
|
-
2
|
304
|
-
3
|
321
|
+
#<RedAmber::DataFrame : 4 x 3 Vectors, 0x0000000000033270>
|
322
|
+
a b c
|
323
|
+
<uint8> <string> <double>
|
324
|
+
1 3 C 3.0
|
325
|
+
2 1 A 1.0
|
326
|
+
3 2 B 2.0
|
327
|
+
4 3 C 3.0
|
305
328
|
```
|
306
329
|
|
307
330
|
- Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self.
|
@@ -313,13 +336,12 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
313
336
|
df[true, false, nil] # or
|
314
337
|
df[[true, false, nil]] # or
|
315
338
|
df[RedAmber::Vector.new([true, false, nil])]
|
339
|
+
|
316
340
|
# =>
|
317
|
-
#<RedAmber::DataFrame : 1 x 3 Vectors,
|
318
|
-
|
319
|
-
|
320
|
-
1
|
321
|
-
2 :b string 1 ["A"]
|
322
|
-
3 :c double 1 [1.0]
|
341
|
+
#<RedAmber::DataFrame : 1 x 3 Vectors, 0x00000000000353e0>
|
342
|
+
a b c
|
343
|
+
<uint8> <string> <double>
|
344
|
+
1 1 A 1.0
|
323
345
|
```
|
324
346
|
|
325
347
|
### Select rows from top or from bottom
|
@@ -340,12 +362,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
340
362
|
|
341
363
|
```ruby
|
342
364
|
penguins.pick(:species, :bill_length_mm)
|
365
|
+
|
343
366
|
# =>
|
344
|
-
#<RedAmber::DataFrame : 344 x 2 Vectors,
|
345
|
-
|
346
|
-
|
347
|
-
|
348
|
-
|
367
|
+
#<RedAmber::DataFrame : 344 x 2 Vectors, 0x0000000000035ebc>
|
368
|
+
species bill_length_mm
|
369
|
+
<string> <double>
|
370
|
+
1 Adelie 39.1
|
371
|
+
2 Adelie 39.5
|
372
|
+
3 Adelie 40.3
|
373
|
+
4 Adelie (nil)
|
374
|
+
5 Adelie 36.7
|
375
|
+
: : :
|
376
|
+
342 Gentoo 50.4
|
377
|
+
343 Gentoo 45.2
|
378
|
+
344 Gentoo 49.9
|
349
379
|
```
|
350
380
|
|
351
381
|
- Booleans as a argument
|
@@ -354,13 +384,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
354
384
|
|
355
385
|
```ruby
|
356
386
|
penguins.pick(penguins.types.map { |type| type == :string })
|
387
|
+
|
357
388
|
# =>
|
358
|
-
#<RedAmber::DataFrame : 344 x 3 Vectors,
|
359
|
-
|
360
|
-
|
361
|
-
|
362
|
-
|
363
|
-
|
389
|
+
#<RedAmber::DataFrame : 344 x 3 Vectors, 0x00000000000387ac>
|
390
|
+
species island sex
|
391
|
+
<string> <string> <string>
|
392
|
+
1 Adelie Torgersen male
|
393
|
+
2 Adelie Torgersen female
|
394
|
+
3 Adelie Torgersen female
|
395
|
+
4 Adelie Torgersen (nil)
|
396
|
+
5 Adelie Torgersen female
|
397
|
+
: : : :
|
398
|
+
342 Gentoo Biscoe male
|
399
|
+
343 Gentoo Biscoe female
|
400
|
+
344 Gentoo Biscoe male
|
364
401
|
```
|
365
402
|
|
366
403
|
- Keys or booleans by a block
|
@@ -368,15 +405,21 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
368
405
|
`pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
|
369
406
|
|
370
407
|
```ruby
|
371
|
-
# It is ok to write `keys ...` in the block, not `penguins.keys ...`
|
372
408
|
penguins.pick { keys.map { |key| key.end_with?('mm') } }
|
409
|
+
|
373
410
|
# =>
|
374
|
-
#<RedAmber::DataFrame : 344 x 3 Vectors,
|
375
|
-
|
376
|
-
|
377
|
-
|
378
|
-
|
379
|
-
|
411
|
+
#<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000003dd4c>
|
412
|
+
bill_length_mm bill_depth_mm flipper_length_mm
|
413
|
+
<double> <double> <uint8>
|
414
|
+
1 39.1 18.7 181
|
415
|
+
2 39.5 17.4 186
|
416
|
+
3 40.3 18.0 195
|
417
|
+
4 (nil) (nil) (nil)
|
418
|
+
5 36.7 19.3 193
|
419
|
+
: : : :
|
420
|
+
342 50.4 15.7 222
|
421
|
+
343 45.2 14.8 212
|
422
|
+
344 49.9 16.1 213
|
380
423
|
```
|
381
424
|
|
382
425
|
### `drop ` - pick and drop -
|
@@ -414,13 +457,17 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
414
457
|
df = RedAmber::DataFrame.new(a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3])
|
415
458
|
df.pick(:a) # or
|
416
459
|
df.drop(:b, :c)
|
460
|
+
|
417
461
|
# =>
|
418
|
-
#<RedAmber::DataFrame : 3 x 1 Vector,
|
419
|
-
|
420
|
-
|
421
|
-
1
|
462
|
+
#<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000003f4bc>
|
463
|
+
a
|
464
|
+
<uint8>
|
465
|
+
1 1
|
466
|
+
2 2
|
467
|
+
3 3
|
422
468
|
|
423
469
|
df[:a]
|
470
|
+
|
424
471
|
# =>
|
425
472
|
#<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
|
426
473
|
[1, 2, 3]
|
@@ -441,14 +488,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
441
488
|
```ruby
|
442
489
|
# returns 5 obs. at start and 5 obs. from end
|
443
490
|
penguins.slice(0...5, -5..-1)
|
491
|
+
|
444
492
|
# =>
|
445
|
-
#<RedAmber::DataFrame : 10 x 8 Vectors,
|
446
|
-
|
447
|
-
|
448
|
-
|
449
|
-
|
450
|
-
|
451
|
-
|
493
|
+
#<RedAmber::DataFrame : 10 x 8 Vectors, 0x0000000000042be4>
|
494
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
495
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
496
|
+
1 Adelie Torgersen 39.1 18.7 181 ... 2007
|
497
|
+
2 Adelie Torgersen 39.5 17.4 186 ... 2007
|
498
|
+
3 Adelie Torgersen 40.3 18.0 195 ... 2007
|
499
|
+
4 Adelie Torgersen (nil) (nil) (nil) ... 2007
|
500
|
+
5 Adelie Torgersen 36.7 19.3 193 ... 2007
|
501
|
+
: : : : : : ... :
|
502
|
+
8 Gentoo Biscoe 50.4 15.7 222 ... 2009
|
503
|
+
9 Gentoo Biscoe 45.2 14.8 212 ... 2009
|
504
|
+
10 Gentoo Biscoe 49.9 16.1 213 ... 2009
|
452
505
|
```
|
453
506
|
|
454
507
|
- Booleans as an argument
|
@@ -458,14 +511,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
458
511
|
```ruby
|
459
512
|
vector = penguins[:bill_length_mm]
|
460
513
|
penguins.slice(vector >= 40)
|
514
|
+
|
461
515
|
# =>
|
462
|
-
#<RedAmber::DataFrame : 242 x 8 Vectors,
|
463
|
-
|
464
|
-
|
465
|
-
|
466
|
-
|
467
|
-
|
468
|
-
|
516
|
+
#<RedAmber::DataFrame : 242 x 8 Vectors, 0x0000000000043d3c>
|
517
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
518
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
519
|
+
1 Adelie Torgersen 40.3 18.0 195 ... 2007
|
520
|
+
2 Adelie Torgersen 42.0 20.2 190 ... 2007
|
521
|
+
3 Adelie Torgersen 41.1 17.6 182 ... 2007
|
522
|
+
4 Adelie Torgersen 42.5 20.7 197 ... 2007
|
523
|
+
5 Adelie Torgersen 46.0 21.5 194 ... 2007
|
524
|
+
: : : : : : ... :
|
525
|
+
240 Gentoo Biscoe 50.4 15.7 222 ... 2009
|
526
|
+
241 Gentoo Biscoe 45.2 14.8 212 ... 2009
|
527
|
+
242 Gentoo Biscoe 49.9 16.1 213 ... 2009
|
469
528
|
```
|
470
529
|
|
471
530
|
- Indices or booleans by a block
|
@@ -482,13 +541,18 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
482
541
|
end
|
483
542
|
|
484
543
|
# =>
|
485
|
-
#<RedAmber::DataFrame : 204 x 8 Vectors,
|
486
|
-
|
487
|
-
|
488
|
-
|
489
|
-
|
490
|
-
|
491
|
-
|
544
|
+
#<RedAmber::DataFrame : 204 x 8 Vectors, 0x0000000000047a40>
|
545
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
546
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
547
|
+
1 Adelie Torgersen 39.1 18.7 181 ... 2007
|
548
|
+
2 Adelie Torgersen 39.5 17.4 186 ... 2007
|
549
|
+
3 Adelie Torgersen 40.3 18.0 195 ... 2007
|
550
|
+
4 Adelie Torgersen 39.3 20.6 190 ... 2007
|
551
|
+
5 Adelie Torgersen 38.9 17.8 181 ... 2007
|
552
|
+
: : : : : : ... :
|
553
|
+
202 Gentoo Biscoe 47.2 13.7 214 ... 2009
|
554
|
+
203 Gentoo Biscoe 46.8 14.3 215 ... 2009
|
555
|
+
204 Gentoo Biscoe 45.2 14.8 212 ... 2009
|
492
556
|
```
|
493
557
|
|
494
558
|
- Notice: nil option
|
@@ -498,6 +562,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
498
562
|
hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }
|
499
563
|
table = Arrow::Table.new(hash)
|
500
564
|
table.slice([true, false, nil])
|
565
|
+
|
501
566
|
# =>
|
502
567
|
#<Arrow::Table:0x7fdfe44b9e18 ptr=0x555e9fe744d0>
|
503
568
|
a b c
|
@@ -509,6 +574,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
509
574
|
|
510
575
|
```ruby
|
511
576
|
RedAmber::DataFrame.new(table).slice([true, false, nil]).table
|
577
|
+
|
512
578
|
# =>
|
513
579
|
#<Arrow::Table:0x7fdfe44981c8 ptr=0x555e9febc330>
|
514
580
|
a b c
|
@@ -528,14 +594,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
528
594
|
```ruby
|
529
595
|
# returns 6th to 339th obs.
|
530
596
|
penguins.remove(0...5, -5..-1)
|
597
|
+
|
531
598
|
# =>
|
532
|
-
#<RedAmber::DataFrame : 334 x 8 Vectors,
|
533
|
-
|
534
|
-
|
535
|
-
|
536
|
-
|
537
|
-
|
538
|
-
|
599
|
+
#<RedAmber::DataFrame : 334 x 8 Vectors, 0x00000000000487c4>
|
600
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
601
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
602
|
+
1 Adelie Torgersen 39.3 20.6 190 ... 2007
|
603
|
+
2 Adelie Torgersen 38.9 17.8 181 ... 2007
|
604
|
+
3 Adelie Torgersen 39.2 19.6 195 ... 2007
|
605
|
+
4 Adelie Torgersen 34.1 18.1 193 ... 2007
|
606
|
+
5 Adelie Torgersen 42.0 20.2 190 ... 2007
|
607
|
+
: : : : : : ... :
|
608
|
+
332 Gentoo Biscoe 44.5 15.7 217 ... 2009
|
609
|
+
333 Gentoo Biscoe 48.8 16.2 222 ... 2009
|
610
|
+
334 Gentoo Biscoe 47.2 13.7 214 ... 2009
|
539
611
|
```
|
540
612
|
|
541
613
|
- Booleans as an argument
|
@@ -545,19 +617,21 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
545
617
|
```ruby
|
546
618
|
# remove all observation contains nil
|
547
619
|
removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
|
548
|
-
removed
|
620
|
+
removed
|
621
|
+
|
549
622
|
# =>
|
550
|
-
RedAmber::DataFrame : 333 x 8 Vectors
|
551
|
-
|
552
|
-
|
553
|
-
|
554
|
-
|
555
|
-
|
556
|
-
|
557
|
-
|
558
|
-
|
559
|
-
7
|
560
|
-
8
|
623
|
+
#<RedAmber::DataFrame : 333 x 8 Vectors, 0x0000000000049fac>
|
624
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
625
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
626
|
+
1 Adelie Torgersen 39.1 18.7 181 ... 2007
|
627
|
+
2 Adelie Torgersen 39.5 17.4 186 ... 2007
|
628
|
+
3 Adelie Torgersen 40.3 18.0 195 ... 2007
|
629
|
+
4 Adelie Torgersen 36.7 19.3 193 ... 2007
|
630
|
+
5 Adelie Torgersen 39.3 20.6 190 ... 2007
|
631
|
+
: : : : : : ... :
|
632
|
+
331 Gentoo Biscoe 50.4 15.7 222 ... 2009
|
633
|
+
332 Gentoo Biscoe 45.2 14.8 212 ... 2009
|
634
|
+
333 Gentoo Biscoe 49.9 16.1 213 ... 2009
|
561
635
|
```
|
562
636
|
|
563
637
|
- Indices or booleans by a block
|
@@ -571,14 +645,20 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
571
645
|
max = vector.mean + vector.std
|
572
646
|
vector.to_a.map { |e| (min..max).include? e }
|
573
647
|
end
|
648
|
+
|
574
649
|
# =>
|
575
|
-
#<RedAmber::DataFrame : 140 x 8 Vectors,
|
576
|
-
|
577
|
-
|
578
|
-
|
579
|
-
|
580
|
-
|
581
|
-
|
650
|
+
#<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000004de40>
|
651
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
652
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
653
|
+
1 Adelie Torgersen (nil) (nil) (nil) ... 2007
|
654
|
+
2 Adelie Torgersen 36.7 19.3 193 ... 2007
|
655
|
+
3 Adelie Torgersen 34.1 18.1 193 ... 2007
|
656
|
+
4 Adelie Torgersen 37.8 17.1 186 ... 2007
|
657
|
+
5 Adelie Torgersen 37.8 17.3 180 ... 2007
|
658
|
+
: : : : : : ... :
|
659
|
+
138 Gentoo Biscoe (nil) (nil) (nil) ... 2009
|
660
|
+
139 Gentoo Biscoe 50.4 15.7 222 ... 2009
|
661
|
+
140 Gentoo Biscoe 49.9 16.1 213 ... 2009
|
582
662
|
```
|
583
663
|
- Notice for nil
|
584
664
|
- When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
|
@@ -586,28 +666,34 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
586
666
|
```ruby
|
587
667
|
df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])
|
588
668
|
booleans = df[:a] < 2
|
669
|
+
booleans
|
670
|
+
|
589
671
|
# =>
|
590
672
|
#<RedAmber::Vector(:boolean, size=3):0x000000000000f410>
|
591
673
|
[true, false, nil]
|
592
674
|
|
593
675
|
booleans_invert = booleans.to_a.map(&:!) # => [false, true, true]
|
676
|
+
|
594
677
|
df.slice(booleans) == df.remove(booleans_invert) # => true
|
595
678
|
```
|
679
|
+
|
596
680
|
- Whereas `Vector#invert` returns nil for elements nil. This will bring different result.
|
597
681
|
|
598
682
|
```ruby
|
599
683
|
booleans.invert
|
684
|
+
|
600
685
|
# =>
|
601
686
|
#<RedAmber::Vector(:boolean, size=3):0x000000000000f488>
|
602
687
|
[false, true, nil]
|
603
688
|
|
604
689
|
df.remove(booleans.invert)
|
605
|
-
|
606
|
-
|
607
|
-
|
608
|
-
|
609
|
-
|
610
|
-
|
690
|
+
|
691
|
+
# =>
|
692
|
+
#<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000005df98>
|
693
|
+
a b c
|
694
|
+
<uint8> <string> <double>
|
695
|
+
1 1 A 1.0
|
696
|
+
2 (nil) C 3.0
|
611
697
|
```
|
612
698
|
|
613
699
|
### `rename`
|
@@ -621,15 +707,16 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
621
707
|
`rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`.
|
622
708
|
|
623
709
|
```ruby
|
624
|
-
|
625
|
-
df = RedAmber::DataFrame.new(h)
|
710
|
+
df = RedAmber::DataFrame.new( 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] )
|
626
711
|
df.rename(:age => :age_in_1993)
|
712
|
+
|
627
713
|
# =>
|
628
|
-
#<RedAmber::DataFrame : 3 x 2 Vectors,
|
629
|
-
|
630
|
-
|
631
|
-
1
|
632
|
-
2
|
714
|
+
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000060838>
|
715
|
+
name age_in_1993
|
716
|
+
<string> <uint8>
|
717
|
+
1 Yasuko 68
|
718
|
+
2 Rui 49
|
719
|
+
3 Hinata 28
|
633
720
|
```
|
634
721
|
|
635
722
|
- Key pairs by a block
|
@@ -655,25 +742,29 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
655
742
|
|
656
743
|
```ruby
|
657
744
|
df = RedAmber::DataFrame.new(
|
658
|
-
|
659
|
-
|
745
|
+
name: %w[Yasuko Rui Hinata],
|
746
|
+
age: [68, 49, 28])
|
747
|
+
df
|
748
|
+
|
660
749
|
# =>
|
661
|
-
#<RedAmber::DataFrame : 3 x 2 Vectors,
|
662
|
-
|
663
|
-
|
664
|
-
1
|
665
|
-
2
|
750
|
+
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000062804>
|
751
|
+
name age
|
752
|
+
<string> <uint8>
|
753
|
+
1 Yasuko 68
|
754
|
+
2 Rui 49
|
755
|
+
3 Hinata 28
|
666
756
|
|
667
757
|
# update :age and add :brother
|
668
758
|
assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }
|
669
759
|
df.assign(assigner)
|
760
|
+
|
670
761
|
# =>
|
671
|
-
#<RedAmber::DataFrame : 3 x 3 Vectors,
|
672
|
-
|
673
|
-
|
674
|
-
1
|
675
|
-
2
|
676
|
-
3
|
762
|
+
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000658b0>
|
763
|
+
name age brother
|
764
|
+
<string> <uint8> <string>
|
765
|
+
1 Yasuko 97 Santa
|
766
|
+
2 Rui 78 (nil)
|
767
|
+
3 Hinata 57 Momotaro
|
677
768
|
```
|
678
769
|
|
679
770
|
- Key pairs by a block
|
@@ -685,13 +776,17 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
685
776
|
index: [0, 1, 2, 3, nil],
|
686
777
|
float: [0.0, 1.1, 2.2, Float::NAN, nil],
|
687
778
|
string: ['A', 'B', 'C', 'D', nil])
|
779
|
+
df
|
780
|
+
|
688
781
|
# =>
|
689
|
-
#<RedAmber::DataFrame : 5 x 3 Vectors,
|
690
|
-
|
691
|
-
|
692
|
-
1
|
693
|
-
2
|
694
|
-
3
|
782
|
+
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x0000000000069e60>
|
783
|
+
index float string
|
784
|
+
<uint8> <double> <string>
|
785
|
+
1 0 0.0 A
|
786
|
+
2 1 1.1 B
|
787
|
+
3 2 2.2 C
|
788
|
+
4 3 NaN D
|
789
|
+
5 (nil) (nil) (nil)
|
695
790
|
|
696
791
|
# update numeric variables
|
697
792
|
df.assign do
|
@@ -701,13 +796,16 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
701
796
|
end
|
702
797
|
assigner
|
703
798
|
end
|
799
|
+
|
704
800
|
# =>
|
705
|
-
#<RedAmber::DataFrame : 5 x 3 Vectors,
|
706
|
-
|
707
|
-
|
708
|
-
1
|
709
|
-
2
|
710
|
-
3
|
801
|
+
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000006e000>
|
802
|
+
index float string
|
803
|
+
<int8> <double> <string>
|
804
|
+
1 0 -0.0 A
|
805
|
+
2 -1 -1.1 B
|
806
|
+
3 -2 -2.2 C
|
807
|
+
4 -3 NaN D
|
808
|
+
5 (nil) (nil) (nil)
|
711
809
|
|
712
810
|
# Or it ’s shorter like this:
|
713
811
|
df.assign do
|
@@ -715,6 +813,7 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
715
813
|
assigner[key] = vector * -1 if vector.numeric?
|
716
814
|
end
|
717
815
|
end
|
816
|
+
|
718
817
|
# => same as above
|
719
818
|
```
|
720
819
|
|
@@ -736,14 +835,17 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
736
835
|
string: ['C', 'B', nil, 'A', 'B'],
|
737
836
|
bool: [nil, true, false, true, false],
|
738
837
|
})
|
739
|
-
df.sort(:index, '-bool')
|
838
|
+
df.sort(:index, '-bool')
|
839
|
+
|
740
840
|
# =>
|
741
|
-
RedAmber::DataFrame : 5 x 3 Vectors
|
742
|
-
|
743
|
-
|
744
|
-
1
|
745
|
-
2
|
746
|
-
3
|
841
|
+
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000009b03c>
|
842
|
+
index string bool
|
843
|
+
<uint8> <string> <boolean>
|
844
|
+
1 0 (nil) false
|
845
|
+
2 0 B false
|
846
|
+
3 1 B true
|
847
|
+
4 1 C (nil)
|
848
|
+
5 (nil) A true
|
747
849
|
```
|
748
850
|
|
749
851
|
- [ ] Clamp
|
@@ -758,66 +860,16 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
758
860
|
|
759
861
|
## Grouping
|
760
862
|
|
761
|
-
### `group(aggregating_keys
|
762
|
-
|
763
|
-
(This is a temporary API and may change in the future version.)
|
863
|
+
### `group(aggregating_keys)`
|
764
864
|
|
765
|
-
|
865
|
+
(
|
866
|
+
This API will change in the future version. Especcially I want to change:
|
867
|
+
- Order of the column of the result (aggregation_keys should be the first)
|
868
|
+
- DataFrame#group will accept a block (heronshoes/red_amber #28)
|
869
|
+
)
|
766
870
|
|
767
|
-
|
768
|
-
|
769
|
-
```ruby
|
770
|
-
ds = Datasets::Rdatasets.new('dplyr', 'starwars')
|
771
|
-
starwars = RedAmber::DataFrame.new(ds.to_table.to_h)
|
772
|
-
starwars.tdr(11)
|
773
|
-
# =>
|
774
|
-
RedAmber::DataFrame : 87 x 11 Vectors
|
775
|
-
Vectors : 3 numeric, 8 strings
|
776
|
-
# key type level data_preview
|
777
|
-
1 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
|
778
|
-
2 :height uint16 46 [172, 167, 96, 202, 150, ... ], 6 nils
|
779
|
-
3 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
|
780
|
-
4 :hair_color string 13 ["blond", nil, nil, "none", "brown", ... ], 5 nils
|
781
|
-
5 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", .. . ]
|
782
|
-
6 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ]
|
783
|
-
7 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
|
784
|
-
8 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, nil=>4}
|
785
|
-
9 :gender string 3 {"masculine"=>66, "feminine"=>17, nil=>4}
|
786
|
-
10 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ], 10 nils
|
787
|
-
11 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ], 4 nils
|
788
|
-
|
789
|
-
grouped = starwars.group(:species, :mean, [:mass, :height])
|
790
|
-
# =>
|
791
|
-
#<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000000fbf4>
|
792
|
-
Vectors : 2 numeric, 1 string
|
793
|
-
# key type level data_preview
|
794
|
-
1 :"mean(mass)" double 27 [82.78181818181818, 69.75, 124.0, 74.0, 1358.0, ... ], 6 nils
|
795
|
-
2 :"mean(height)" double 32 [176.6451612903226, 131.2, 231.0, 173.0, 175.0, ... ]
|
796
|
-
3 :species string 38 ["Human", "Droid", "Wookiee", "Rodian", "Hutt", ... ], 1 nil
|
797
|
-
|
798
|
-
count = starwars.group(:species, :count, :species)[:"count(species)"]
|
799
|
-
df = grouped.slice(count > 1)
|
800
|
-
# =>
|
801
|
-
#<RedAmber::DataFrame : 8 x 3 Vectors, 0x000000000000fc44>
|
802
|
-
Vectors : 2 numeric, 1 string
|
803
|
-
# key type level data_preview
|
804
|
-
1 :"mean(mass)" double 8 [82.78181818181818, 69.75, 124.0, 74.0, 80.0, ... ]
|
805
|
-
2 :"mean(height)" double 8 [176.6451612903226, 131.2, 231.0, 208.66666666666666, 173.0, ... ]
|
806
|
-
3 :species string 8 ["Human", "Droid", "Wookiee", "Gungan", "Zabrak", ... ]
|
807
|
-
|
808
|
-
df.table
|
809
|
-
# =>
|
810
|
-
#<Arrow::Table:0x1165593c8 ptr=0x7fb3db144c70>
|
811
|
-
mean(mass) mean(height) species
|
812
|
-
0 82.781818 176.645161 Human
|
813
|
-
1 69.750000 131.200000 Droid
|
814
|
-
2 124.000000 231.000000 Wookiee
|
815
|
-
3 74.000000 208.666667 Gungan
|
816
|
-
4 80.000000 173.000000 Zabrak
|
817
|
-
5 55.000000 179.000000 Twi'lek
|
818
|
-
6 53.100000 168.000000 Mirialan
|
819
|
-
7 88.000000 221.000000 Kaminoan
|
820
|
-
```
|
871
|
+
`group` creates a class `Group` object. `Group` accepts functions below as a method.
|
872
|
+
Method accepts options as `summary_keys`.
|
821
873
|
|
822
874
|
Available functions are:
|
823
875
|
|
@@ -837,9 +889,115 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
837
889
|
- [ ] tdigest
|
838
890
|
- ✓ variance
|
839
891
|
|
892
|
+
For the each group of `aggregation_keys`, the aggregation `function` is applied and returns a new dataframe with aggregated keys according to `summary_keys`.
|
893
|
+
Aggregated key name is `function(summary_key)` style.
|
894
|
+
|
895
|
+
This is an example of grouping of famous STARWARS dataset.
|
896
|
+
|
897
|
+
```ruby
|
898
|
+
starwars =
|
899
|
+
RedAmber::DataFrame.load(URI("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv"))
|
900
|
+
starwars
|
901
|
+
|
902
|
+
# =>
|
903
|
+
#<RedAmber::DataFrame : 87 x 12 Vectors, 0x00000000000773bc>
|
904
|
+
species name height mass hair_color skin_color eye_color ... homeworld
|
905
|
+
<string> <string> <int64> <double> <string> <string> <string> ... <string>
|
906
|
+
Human 1 Luke Skywalker 172 77.0 blond fair blue ... Tatooine
|
907
|
+
Droid 2 C-3PO 167 75.0 NA gold yellow ... Tatooine
|
908
|
+
Droid 3 R2-D2 96 32.0 NA white, blue red ... Naboo
|
909
|
+
Human 4 Darth Vader 202 136.0 none white yellow ... Tatooine
|
910
|
+
Human 5 Leia Organa 150 49.0 brown light brown ... Alderaan
|
911
|
+
: : : : : : : : ... :
|
912
|
+
Droid 85 BB8 (nil) (nil) none none black ... NA
|
913
|
+
NA 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA
|
914
|
+
Human 87 Padmé Amidala 165 45.0 brown light brown ... Naboo
|
915
|
+
|
916
|
+
starwars.tdr(12)
|
917
|
+
|
918
|
+
# =>
|
919
|
+
RedAmber::DataFrame : 87 x 12 Vectors
|
920
|
+
Vectors : 4 numeric, 8 strings
|
921
|
+
# key type level data_preview
|
922
|
+
1 :"" int64 87 [1, 2, 3, 4, 5, ... ]
|
923
|
+
2 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
|
924
|
+
3 :height int64 46 [172, 167, 96, 202, 150, ... ], 6 nils
|
925
|
+
4 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
|
926
|
+
5 :hair_color string 13 ["blond", "NA", "NA", "none", "brown", ... ]
|
927
|
+
6 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", ... ]
|
928
|
+
7 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ]
|
929
|
+
8 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
|
930
|
+
9 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, "NA"=>4}
|
931
|
+
10 :gender string 3 {"masculine"=>66, "feminine"=>17, "NA"=>4}
|
932
|
+
11 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ]
|
933
|
+
12 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ]
|
934
|
+
```
|
935
|
+
|
936
|
+
We can aggregate for `:species` and calculate the mean of `:mass` and `:height`.
|
937
|
+
|
938
|
+
```ruby
|
939
|
+
grouped = starwars.group(:species).mean(:mass, :height)
|
940
|
+
grouped
|
941
|
+
|
942
|
+
# =>
|
943
|
+
#<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000008e620>
|
944
|
+
mean(mass) mean(height) species
|
945
|
+
<double> <double> <string>
|
946
|
+
1 82.8 176.6 Human
|
947
|
+
2 69.8 131.2 Droid
|
948
|
+
3 124.0 231.0 Wookiee
|
949
|
+
4 74.0 173.0 Rodian
|
950
|
+
5 1358.0 175.0 Hutt
|
951
|
+
: : : :
|
952
|
+
36 159.0 216.0 Kaleesh
|
953
|
+
37 80.0 206.0 Pau'an
|
954
|
+
38 80.0 188.0 Kel Dor
|
955
|
+
```
|
956
|
+
|
957
|
+
Select rows for count > 1.
|
958
|
+
|
959
|
+
```ruby
|
960
|
+
count = starwars.group(:species).count(:species)[:'count(species)'] # => Vector
|
961
|
+
grouped = grouped.slice(count > 1)
|
962
|
+
|
963
|
+
# =>
|
964
|
+
#<RedAmber::DataFrame : 9 x 3 Vectors, 0x0000000000098260>
|
965
|
+
mean(mass) mean(height) species
|
966
|
+
<double> <double> <string>
|
967
|
+
1 82.8 176.6 Human
|
968
|
+
2 69.8 131.2 Droid
|
969
|
+
3 124.0 231.0 Wookiee
|
970
|
+
4 74.0 208.7 Gungan
|
971
|
+
5 48.0 181.3 NA
|
972
|
+
: : : :
|
973
|
+
7 55.0 179.0 Twi'lek
|
974
|
+
8 53.1 168.0 Mirialan
|
975
|
+
9 88.0 221.0 Kaminoan
|
976
|
+
```
|
977
|
+
|
978
|
+
Assemble the result and change the order of columns.
|
979
|
+
|
980
|
+
```ruby
|
981
|
+
grouped.assign(count: count[count > 1]).pick { [2,3,0,1].map{ |i| keys[i] } }
|
982
|
+
|
983
|
+
# =>
|
984
|
+
#<RedAmber::DataFrame : 9 x 4 Vectors, 0x0000000000141838>
|
985
|
+
species count mean(mass) mean(height)
|
986
|
+
<string> <uint8> <double> <double>
|
987
|
+
1 Human 35 82.8 176.6
|
988
|
+
2 Droid 6 69.8 131.2
|
989
|
+
3 Wookiee 2 124.0 231.0
|
990
|
+
4 Gungan 3 74.0 208.7
|
991
|
+
5 NA 4 48.0 181.3
|
992
|
+
: : : : :
|
993
|
+
7 Twi'lek 2 55.0 179.0
|
994
|
+
8 Mirialan 2 53.1 168.0
|
995
|
+
9 Kaminoan 2 88.0 221.0
|
996
|
+
```
|
997
|
+
|
840
998
|
## Combining DataFrames
|
841
999
|
|
842
|
-
- [ ]
|
1000
|
+
- [ ] Combining rows to a dataframe
|
843
1001
|
|
844
1002
|
- [ ] Add vars
|
845
1003
|
|
@@ -852,3 +1010,5 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
852
1010
|
- [ ] One-hot encoding
|
853
1011
|
|
854
1012
|
## Iteration (not impremented)
|
1013
|
+
|
1014
|
+
- [ ] each_rows
|