red_amber 0.4.2 → 0.5.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. checksums.yaml +4 -4
  2. data/.devcontainer/Dockerfile +75 -0
  3. data/.devcontainer/devcontainer.json +38 -0
  4. data/.devcontainer/onCreateCommand.sh +22 -0
  5. data/.rubocop.yml +11 -5
  6. data/CHANGELOG.md +141 -17
  7. data/Gemfile +5 -6
  8. data/README.ja.md +271 -0
  9. data/README.md +52 -31
  10. data/Rakefile +55 -0
  11. data/benchmark/group.yml +12 -5
  12. data/doc/Dev_Containers.ja.md +290 -0
  13. data/doc/Dev_Containers.md +292 -0
  14. data/doc/qmd/examples_of_red_amber.qmd +4596 -0
  15. data/doc/qmd/red-amber.qmd +90 -0
  16. data/docker/Dockerfile +2 -2
  17. data/docker/Gemfile +8 -3
  18. data/docker/docker-compose.yml +1 -1
  19. data/docker/readme.md +5 -5
  20. data/lib/red_amber/data_frame.rb +78 -4
  21. data/lib/red_amber/data_frame_combinable.rb +147 -119
  22. data/lib/red_amber/data_frame_displayable.rb +7 -6
  23. data/lib/red_amber/data_frame_loadsave.rb +1 -1
  24. data/lib/red_amber/data_frame_selectable.rb +51 -2
  25. data/lib/red_amber/data_frame_variable_operation.rb +6 -6
  26. data/lib/red_amber/group.rb +476 -127
  27. data/lib/red_amber/helper.rb +26 -0
  28. data/lib/red_amber/subframes.rb +18 -11
  29. data/lib/red_amber/vector.rb +45 -25
  30. data/lib/red_amber/vector_aggregation.rb +26 -0
  31. data/lib/red_amber/vector_selectable.rb +124 -40
  32. data/lib/red_amber/vector_string_function.rb +279 -0
  33. data/lib/red_amber/vector_unary_element_wise.rb +4 -0
  34. data/lib/red_amber/vector_updatable.rb +28 -0
  35. data/lib/red_amber/version.rb +1 -1
  36. data/lib/red_amber.rb +2 -1
  37. data/red_amber.gemspec +3 -3
  38. metadata +19 -14
  39. data/docker/Gemfile.lock +0 -80
  40. data/docker/example +0 -74
  41. data/docker/notebook/examples_of_red_amber.ipynb +0 -8562
  42. data/docker/notebook/red-amber.ipynb +0 -188
@@ -0,0 +1,4596 @@
1
+ ---
2
+ title: 127 examples of Red Amber
3
+ author: heronshoes
4
+ date: '2023-08-11'
5
+ format:
6
+ pdf:
7
+ code-fold: false
8
+ jupyter: ruby
9
+ format:
10
+ pdf:
11
+ toc: true
12
+ fontfamily: libertinus
13
+ colorlinks: true
14
+ ---
15
+
16
+ For RedAmber Version 0.5.1-HEAD and Arrow version 12.0.1 .
17
+
18
+ ## 1. Install
19
+
20
+ Install requirements before you install RedAmber.
21
+
22
+ - Ruby (>= 3.0)
23
+
24
+ - Apache Arrow (~> 12.0.0)
25
+ - Apache Arrow GLib (~> 12.0.0)
26
+ - Apache Parquet GLib (~> 12.0.0) # if you need IO from/to Parquet resource.
27
+
28
+ See [Apache Arrow install document](https://arrow.apache.org/install/).
29
+
30
+ - Minimum installation example for the latest Ubuntu:
31
+ ```shell
32
+ sudo apt update
33
+ sudo apt install -y -V ca-certificates lsb-release wget
34
+ wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
35
+ sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
36
+ sudo apt update
37
+ sudo apt install -y -V libarrow-dev
38
+ sudo apt install -y -V libarrow-glib-dev
39
+ ```
40
+ - On Fedora 38 (Rawhide):
41
+ ```shell
42
+ sudo dnf update
43
+ sudo dnf -y install gcc-c++ libarrow-devel libarrow-glib-devel ruby-devel
44
+
45
+
46
+ - On macOS, you can install Apache Arrow C++ library using Homebrew:
47
+ ```shell
48
+ brew install apache-arrow
49
+ ```
50
+
51
+ and GLib (C) package with:
52
+ ```shell
53
+ brew install apache-arrow-glib
54
+ ```
55
+
56
+ If you prepared Apache Arrow, add these lines to your Gemfile:
57
+
58
+ ```ruby
59
+ gem 'red-arrow', '~> 12.0.0'
60
+ gem 'red_amber'
61
+ gem 'red-arrow-numo-narray' # Optional, recommended if you use inputs from Numo::NArray
62
+ # or use random sampling feature.
63
+ gem 'red-parquet', '~> 12.0.0' # Optional, if you use IO from/to parquet
64
+ gem 'red-datasets-arrow' # Optional, recommended if you use Red Datasets
65
+ gem 'red-arrow-activerecord' # Optional, if you use Active Record
66
+ gem 'rover-df', # Optional, if you use IO from/to Rover::DataFrame.
67
+ ```
68
+
69
+ And then execute `bundle install` or install it yourself as `gem install red_amber`.
70
+
71
+ ## 2. Require
72
+
73
+ ```{ruby}
74
+ #| tags: []
75
+ require 'red_amber' # require 'red-amber' is also OK
76
+ include RedAmber
77
+ {RedAmber: VERSION, Arrow: Arrow::VERSION}
78
+ ```
79
+
80
+ ## 3. Initialize
81
+
82
+ There are several ways to initialize a DataFrame.
83
+
84
+ ```{ruby}
85
+ #| tags: []
86
+ # From a Hash
87
+ DataFrame.new(x: [1, 2, 3], y: %w[A B C])
88
+ ```
89
+
90
+ ```{ruby}
91
+ #| tags: []
92
+ # From a schema and a row-oriented array
93
+ DataFrame.new({ x: :uint8, y: :string }, [[1, 'A'], [2, 'B'], [3, 'C']])
94
+ ```
95
+
96
+ ```{ruby}
97
+ #| tags: []
98
+ # From an Arrow::Table
99
+ table = Arrow::Table.new(x: [1, 2, 3], y: %w[A B C])
100
+ DataFrame.new(table)
101
+ ```
102
+
103
+ ```{ruby}
104
+ #| tags: []
105
+ # From a Rover::DataFrame
106
+ require 'rover'
107
+ rover = Rover::DataFrame.new(x: [1, 2, 3], y: %w[A B C])
108
+ DataFrame.new(rover)
109
+ ```
110
+
111
+ ```{ruby}
112
+ #| tags: []
113
+ # from a datasets in Red Datasets
114
+ require 'datasets-arrow'
115
+ dataset = Datasets::Penguins.new
116
+ penguins = DataFrame.new(dataset) # Since 0.2.2 . If it is older, it must be `dataset.to_arrow`.
117
+ ```
118
+
119
+ ```{ruby}
120
+ #| tags: []
121
+ dataset = Datasets::Rdatasets.new('datasets', 'mtcars')
122
+ mtcars = DataFrame.new(dataset)
123
+ ```
124
+
125
+ (New from 0.2.3 with Arrow 10.0.0) It is possible to initialize by objects responsible to `to_arrow` since 0.2.3 . Arrays in Numo::NArray is responsible to `to_arrow` with `red-arrow-numo-narray` gem. This feature is proposed by the Red Data Tools member @kojix2 and implemented by @kou in Arrow 10.0.0 and Red Arrow Numo::NArray 0.0.6. Thanks!
126
+
127
+ ```{ruby}
128
+ #| tags: []
129
+ require 'arrow-numo-narray'
130
+
131
+ DataFrame.new(numo: Numo::DFloat.new(3).rand)
132
+ ```
133
+
134
+ Another example by Numo::NArray is [#77. Introduce columns from numo/narray](#77.-Introduce-columns-from-numo/narray).
135
+
136
+ ## 4. Load
137
+
138
+ `RedAmber::DataFrame` delegates `#load` to `Arrow::Table#load`. We can load from `[.arrow, .arrows, .csv, .csv.gz, .tsv]` files.
139
+
140
+ `load` accepts following options:
141
+
142
+ `load(input, format: nil, compression: nil, schema: nil, skip_lines: nil)`
143
+
144
+ - `format` [:arrow_file, :batch, :arrows, :arrow_stream, :stream, :csv, :tsv]
145
+ - `compression` [:gzip, nil]
146
+ - `schema` [Arrow::Schema]
147
+ - `skip_lines` [Regexp]
148
+
149
+ Load from a file 'comecome.csv';
150
+
151
+ ```{ruby}
152
+ #| tags: []
153
+ file = Tempfile.open(['comecome', '.csv']) do |f|
154
+ f.puts(<<~CSV)
155
+ name,age
156
+ Yasuko,68
157
+ Rui,49
158
+ Hinata,28
159
+ CSV
160
+ f
161
+ end
162
+
163
+ DataFrame.load(file)
164
+ ```
165
+
166
+ Load from a Buffer;
167
+
168
+ ```{ruby}
169
+ #| tags: []
170
+ DataFrame.load(Arrow::Buffer.new(<<~BUFFER), format: :csv)
171
+ name,age
172
+ Yasuko,68
173
+ Rui,49
174
+ Hinata,28
175
+ BUFFER
176
+ ```
177
+
178
+ Load from a Buffer skipping comment line;
179
+
180
+ ```{ruby}
181
+ #| tags: []
182
+ DataFrame.load(Arrow::Buffer.new(<<~BUFFER), format: :csv, skip_lines: /^#/)
183
+ # comment
184
+ name,age
185
+ Yasuko,68
186
+ Rui,49
187
+ Hinata,28
188
+ BUFFER
189
+ ```
190
+
191
+ ## 5. Load from a URI
192
+
193
+ ```{ruby}
194
+ #| tags: []
195
+ uri = URI("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
196
+ DataFrame.load(uri)
197
+ ```
198
+
199
+ ## 6. Save
200
+
201
+ `#save` accepts same options as `#load`. See [#4. Load](#4.-Load).
202
+
203
+ ```{ruby}
204
+ #| tags: []
205
+ penguins.save("penguins.arrow")
206
+ penguins.save("penguins.arrows")
207
+ penguins.save("penguins.csv")
208
+ penguins.save("penguins.csv.gz")
209
+ penguins.save("penguins.tsv")
210
+ penguins.save("penguins.feather")
211
+ ```
212
+
213
+ (Since 0.3.0) `DataFrame#save` returns self.
214
+
215
+ ## 7. to_s/inspect
216
+
217
+ `to_s` or `inspect` (it uses to_s inside) shows a preview of the dataframe.
218
+
219
+ It shows first 5 and last 3 rows if it has many rows. Columns are also omitted if line is exceeded 80 letters.
220
+
221
+ ```{ruby}
222
+ #| tags: []
223
+ df = DataFrame.new(
224
+ x: [1, 2, 3, 4, 5],
225
+ y: [1, 2, 3, 0/0.0, nil],
226
+ s: %w[A B C D] << nil,
227
+ b: [true, false, true, false, nil]
228
+ )
229
+ ```
230
+
231
+ ```{ruby}
232
+ #| tags: []
233
+ p penguins; nil
234
+ ```
235
+
236
+ ## 8. Show table
237
+
238
+ `#table` shows Arrow::Table object. The alias is `#to_arrow`.
239
+
240
+ ```{ruby}
241
+ #| tags: []
242
+ df.table
243
+ ```
244
+
245
+ ```{ruby}
246
+ #| tags: []
247
+ penguins.to_arrow
248
+ ```
249
+
250
+ ```{ruby}
251
+ #| tags: []
252
+ # This is a Red Arrow's feature
253
+ puts df.table.to_s(format: :column)
254
+ ```
255
+
256
+ ```{ruby}
257
+ #| tags: []
258
+ # This is also a Red Arrow's feature
259
+ puts df.table.to_s(format: :list)
260
+ ```
261
+
262
+ ## 9. TDR
263
+
264
+ TDR means 'Transposed Dataframe Representation'. It shows columns in lateral just the same shape as initializing by a Hash. TDR has some information which is useful for the exploratory data processing.
265
+
266
+ - DataFrame shape: n_rows x n_columns
267
+ - Data types
268
+ - Levels: number of unique elements
269
+ - Data preview: same data is aggregated if level is smaller (tally mode)
270
+ - Show counts of abnormal element: NaN and nil
271
+
272
+ It is similar to dplyr's (or Polars's) `glimpse()` so we have an alias `#glimpse` (since 0.4.0).
273
+
274
+ ```{ruby}
275
+ #| tags: []
276
+ df.tdr
277
+ ```
278
+
279
+ ```{ruby}
280
+ #| tags: []
281
+ penguins.tdr
282
+ ```
283
+
284
+ `#tdr` has some options:
285
+
286
+ `limit` : to limit a number of variables to show. Default value is `limit=10`.
287
+
288
+ ```{ruby}
289
+ #| tags: []
290
+ penguins.tdr(3)
291
+ ```
292
+
293
+ By default `#tdr` shows 9 variables at maximum. `#tdr(:all)` will show all variables.
294
+
295
+ ```{ruby}
296
+ #| tags: []
297
+ mtcars.tdr(:all)
298
+ ```
299
+
300
+ (Since 0.4.0) `#tdra` method is short cut for `#tdr(:all)`
301
+
302
+ ```{ruby}
303
+ #| tags: []
304
+ mtcars.tdra
305
+ ```
306
+
307
+ `elements` : max number of elements to show in observations. Default value is `elements: 5`.
308
+
309
+ ```{ruby}
310
+ #| tags: []
311
+ penguins.tdr(elements: 3) # Show first 3 items in data
312
+ ```
313
+
314
+ `tally` : max level to use tally mode. Level means size of `tally`ed hash. Default value is `tally: 5`.
315
+
316
+ ```{ruby}
317
+ #| tags: []
318
+ penguins.tdr(tally: 0) # Don't use tally mode
319
+ ```
320
+
321
+ `#tdr_str` returns a String. `#tdr` do the same thing as `puts #tdr_str`
322
+
323
+ ```{ruby}
324
+ #| tags: []
325
+ puts penguins.tdr_str
326
+ ```
327
+
328
+ (Since 0.4.0) `#glimpse` is an alias for `#tdr`.
329
+
330
+ ```{ruby}
331
+ #| tags: []
332
+ mtcars.glimpse(:all, elements: 10)
333
+ ```
334
+
335
+ ## 10. Size and shape
336
+
337
+ ```{ruby}
338
+ #| tags: []
339
+ # same as n_rows, n_obs
340
+ df.size
341
+ ```
342
+
343
+ ```{ruby}
344
+ #| tags: []
345
+ # same as n_cols, n_vars
346
+ df.n_keys
347
+ ```
348
+
349
+ ```{ruby}
350
+ #| tags: []
351
+ # [df.size, df.n_keys], [df.n_rows, df.n_cols]
352
+ df.shape
353
+ ```
354
+
355
+ ## 11. Keys
356
+
357
+ ```{ruby}
358
+ #| tags: []
359
+ df.keys
360
+ ```
361
+
362
+ ```{ruby}
363
+ #| tags: []
364
+ penguins.keys
365
+ ```
366
+
367
+ ## 12. Types
368
+
369
+ ```{ruby}
370
+ #| tags: []
371
+ df.types
372
+ ```
373
+
374
+ ```{ruby}
375
+ #| tags: []
376
+ penguins.types
377
+ ```
378
+
379
+ ## 13. Data type classes
380
+
381
+ ```{ruby}
382
+ #| tags: []
383
+ df.type_classes
384
+ ```
385
+
386
+ ```{ruby}
387
+ #| tags: []
388
+ penguins.type_classes
389
+ ```
390
+
391
+ ## 14. Indices
392
+
393
+ Another example of `indices` is in [66. Custom index](#66.-Custom-index).
394
+
395
+ ```{ruby}
396
+ #| tags: []
397
+ df.indexes
398
+ # or
399
+ df.indices
400
+ ```
401
+
402
+ (Since 0.2.3) `#indices` returns Vector.
403
+
404
+ ## 15. To an Array or a Hash
405
+
406
+ DataFrame#to_a returns an array of row-oriented data without a header.
407
+
408
+ ```{ruby}
409
+ #| tags: []
410
+ df.to_a
411
+ ```
412
+
413
+ If you need a column-oriented array with keys, use `.to_h.to_a`
414
+
415
+ ```{ruby}
416
+ #| tags: []
417
+ df.to_h
418
+ ```
419
+
420
+ ```{ruby}
421
+ #| tags: []
422
+ df.to_h.to_a
423
+ ```
424
+
425
+ ## 16. Schema
426
+
427
+ Schema is keys and value types pairs as a Hash.
428
+
429
+ ```{ruby}
430
+ #| tags: []
431
+ df.schema
432
+ ```
433
+
434
+ ## 17. Vector
435
+
436
+ Each variable (column in the table) is represented by a Vector object.
437
+
438
+ ```{ruby}
439
+ #| tags: []
440
+ df[:x] # This syntax will come later
441
+ ```
442
+
443
+ Or create new Vector by the constructor.
444
+
445
+ ```{ruby}
446
+ #| tags: []
447
+ Vector.new(1, 2, 3, 4, 5)
448
+ ```
449
+
450
+ ```{ruby}
451
+ #| tags: []
452
+ Vector.new(1..5)
453
+ ```
454
+
455
+ ```{ruby}
456
+ #| tags: []
457
+ Vector.new([1, 2, 3], [4, 5])
458
+ ```
459
+
460
+ ```{ruby}
461
+ #| tags: []
462
+ array = Arrow::Array.new([1, 2, 3, 4, 5])
463
+ Vector.new(array)
464
+ ```
465
+
466
+ (Since 0.4.2) New constructor Vector[*array_like] has introduced.
467
+
468
+ ```{ruby}
469
+ #| tags: []
470
+ Vector[1, 2, 3, 4, 5]
471
+ ```
472
+
473
+ ## 18. Vectors
474
+
475
+ Returns an Array of Vectors as a DataFrame.
476
+
477
+ ```{ruby}
478
+ #| tags: []
479
+ df.vectors
480
+ ```
481
+
482
+ ## 19. Variables
483
+
484
+ Returns key and Vector pairs as a Hash.
485
+
486
+ ```{ruby}
487
+ #| tags: []
488
+ df.variables
489
+ ```
490
+
491
+ ## 20. Select columns by #[ ]
492
+
493
+ `DataFrame#[]` is overloading column operations and row operations.
494
+
495
+ - For columns (variables)
496
+ - Key in a Symbol: `df[:symbol]`
497
+ - Key in a String: `df["string"]`
498
+ - Keys in an Array: `df[:symbol1, "string", :symbol2]`
499
+ - Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`
500
+
501
+ ```{ruby}
502
+ #| tags: []
503
+ # Keys in a Symbol and a String
504
+ df[:x, 'y']
505
+ ```
506
+
507
+ ```{ruby}
508
+ #| tags: []
509
+ # Keys in a Range
510
+ df[:x..:y]
511
+ ```
512
+
513
+ ```{ruby}
514
+ #| tags: []
515
+ # Keys with a index Range, and a symbol
516
+ df[df.keys[2..], :x]
517
+ ```
518
+
519
+ ## 21. Select rows by #[ ]
520
+ `DataFrame#[]` is overloading column operations and row operations.
521
+
522
+ - For rows (observations)
523
+ - Select rows by a Index: `df[index]`
524
+ - Select rows by Indices: `df[indices]` # Array, Arrow::Array, Vectors are acceptable for indices
525
+ - Select rows by Ranges: `df[range]`
526
+ - Select rows by Booleans: `df[booleans]` # Array, Arrow::Array, Vectors are acceptable for booleans
527
+
528
+ ```{ruby}
529
+ #| tags: []
530
+ # indices
531
+ df[0, 2, 1]
532
+ ```
533
+
534
+ ```{ruby}
535
+ #| tags: []
536
+ # including a Range
537
+ # negative indices are also acceptable
538
+ df[1..2, -1]
539
+ ```
540
+
541
+ ```{ruby}
542
+ #| tags: []
543
+ # booleans
544
+ # length of boolean should be the same as self
545
+ df[false, true, true, false, true]
546
+ ```
547
+
548
+ ```{ruby}
549
+ #| tags: []
550
+ # Arrow::Array
551
+ indices = Arrow::UInt8Array.new([0,2,4])
552
+ df[indices]
553
+ ```
554
+
555
+ ```{ruby}
556
+ #| tags: []
557
+ # By a Vector as indices
558
+ indices = Vector.new(df.indices)
559
+ # indices > 1 returns a boolean Vector
560
+ df[indices > 1]
561
+ ```
562
+
563
+ ```{ruby}
564
+ #| tags: []
565
+ # By a Vector as booleans
566
+ booleans = df[:b]
567
+ ```
568
+
569
+ ```{ruby}
570
+ #| tags: []
571
+ df[booleans]
572
+ ```
573
+
574
+ ## 22. empty?
575
+
576
+ ```{ruby}
577
+ #| tags: []
578
+ df.empty?
579
+ ```
580
+
581
+ ```{ruby}
582
+ #| tags: []
583
+ DataFrame.new
584
+ ```
585
+
586
+ ```{ruby}
587
+ #| tags: []
588
+ DataFrame.new.empty?
589
+ ```
590
+
591
+ ## 23. Select columns by pick
592
+
593
+ `DataFrame#pick` accepts an Array of keys to pick up columns (variables) and creates a new DataFrame. You can change the order of columns at a same time.
594
+
595
+ The name `pick` comes from the action to pick variables(columns) according to the label keys.
596
+
597
+ ```{ruby}
598
+ #| tags: []
599
+ df.pick(:s, :y)
600
+ # or
601
+ df.pick([:s, :y]) # OK too.
602
+ ```
603
+
604
+ Or use a boolean Array of lengeh `n_key` to `pick`. This style preserves the order of variables.
605
+
606
+ ```{ruby}
607
+ #| tags: []
608
+ df.pick(false, true, true, false)
609
+ # or
610
+ df.pick([false, true, true, false])
611
+ # or
612
+ df.pick(Vector.new([false, true, true, false]))
613
+ ```
614
+
615
+ `#pick` also accepts a block in the context of self.
616
+
617
+ Next example is picking up numeric variables.
618
+
619
+ ```{ruby}
620
+ #| tags: []
621
+ # reciever is required with the argument style
622
+ df.pick(df.vectors.map(&:numeric?))
623
+
624
+ # with a block
625
+ df.pick { vectors.map(&:numeric?) }
626
+ ```
627
+
628
+ `pick` also accepts numeric indexes.
629
+
630
+ (Since 0.2.1)
631
+
632
+ ```{ruby}
633
+ #| tags: []
634
+ df.pick(0, 3)
635
+ ```
636
+
637
+ ## 24. Reject columns by drop
638
+
639
+ `DataFrame#drop` accepts an Array keys to drop columns (variables) to create a remainer DataFrame.
640
+
641
+ The name `drop` comes from the pair word of `pick`.
642
+
643
+ ```{ruby}
644
+ #| tags: []
645
+ df.drop(:x, :b)
646
+ # df.drop([:x, :b]) #is OK too.
647
+ ```
648
+
649
+ Or use a boolean Array of lengeh `n_key` to `drop`.
650
+
651
+ ```{ruby}
652
+ #| tags: []
653
+ df.drop(true, false, false, true)
654
+ # df.drop([true, false, false, true]) # is OK too
655
+ ```
656
+
657
+ `#drop` also accepts a block in the context of self.
658
+
659
+ Next example will drop variables which have nil or NaN values.
660
+
661
+ ```{ruby}
662
+ #| tags: []
663
+ df.drop { vectors.map { |v| v.is_na.any } }
664
+ ```
665
+
666
+ Argument style is also acceptable but it requires the reciever 'df'.
667
+
668
+ ```{ruby}
669
+ #| tags: []
670
+ df.drop(df.vectors.map { |v| v.is_na.any })
671
+ ```
672
+
673
+ `drop` also accepts numeric indexes.
674
+
675
+ (Since 0.2.1)
676
+
677
+ ```{ruby}
678
+ #| tags: []
679
+ df.drop(0, 3)
680
+ ```
681
+
682
+ ## 25. Pick/drop and nil
683
+
684
+ When `pick` or `drop` is used with booleans, nil in the booleans is treated as false. This behavior is aligned with Ruby's `BasicObject#!`.
685
+
686
+ ```{ruby}
687
+ #| tags: []
688
+ booleans = [true, true, false, nil]
689
+ booleans_invert = booleans.map(&:!) # => [false, false, true, true] because nil.! is true
690
+ df.pick(booleans) == df.drop(booleans_invert)
691
+ ```
692
+
693
+ ## 26. Vector#invert, #primitive_invert
694
+
695
+ For the boolean Vector;
696
+
697
+ ```{ruby}
698
+ #| tags: []
699
+ vector = Vector.new(booleans)
700
+ ```
701
+
702
+ nil is converted to nil by `Vector#invert`.
703
+
704
+ ```{ruby}
705
+ #| tags: []
706
+ vector.invert
707
+ # or
708
+ !vector
709
+ ```
710
+
711
+ So `df.pick(booleans) != df.drop(booleans.invert)` when booleans have any nils.
712
+
713
+ On the other hand, `Vector#primitive_invert` follows Ruby's `BasicObject#!`'s behavior. Then pick and drop keep 'MECE' behavior.
714
+
715
+ ```{ruby}
716
+ #| tags: []
717
+ vector.primitive_invert
718
+ ```
719
+
720
+ ```{ruby}
721
+ #| tags: []
722
+ df.pick(vector) == df.drop(vector.primitive_invert)
723
+ ```
724
+
725
+ ## 27. Pick/drop, #[] and #v
726
+
727
+ When `pick` or `drop` select a single column (variable), it returns a `DataFrame` with one column (variable).
728
+
729
+ ```{ruby}
730
+ #| tags: []
731
+ df.pick(:x) # or
732
+ df.drop(:y, :s, :b)
733
+ ```
734
+
735
+ In contrast, when `[]` selects a single column (variable), it returns a `Vector`.
736
+
737
+ ```{ruby}
738
+ #| tags: []
739
+ df[:x]
740
+ ```
741
+
742
+ This behavior may be useful to use with DataFrame manipulation verbs (like pick, drop, slice, remove, assign, rename).
743
+
744
+ ```{ruby}
745
+ #| tags: []
746
+ df.pick { keys.select { |key| df[key].numeric? } }
747
+ ```
748
+
749
+ `df#v` method is same as `df#[]` to pick a Vector. But a little bit faster and easy to use in the block.
750
+
751
+ ```{ruby}
752
+ #| tags: []
753
+ df.v(:x)
754
+ ```
755
+
756
+ ## 28. Slice
757
+
758
+ Another example of `slice` is [#70. Row index label by slice_by](#70.-Row-index-label-by-slice_by).
759
+
760
+ `slice` selects rows (records) to create a subset of a DataFrame.
761
+
762
+ `slice(indeces)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers. Negative index from the tail like Ruby's Array is also acceptable.
763
+
764
+ ```{ruby}
765
+ #| tags: []
766
+ # returns 5 rows from the start and 5 rows from the end
767
+ penguins.slice(0...5, -5..-1)
768
+ ```
769
+
770
+ ```{ruby}
771
+ #| tags: []
772
+ # slice accepts Float index
773
+ # 33% of 344 observations in index => 113.52 th data ??
774
+ indexed_penguins = penguins.assign_left { [:index, indexes] } # #assign_left and assigner by Array is 0.2.0 feature
775
+ indexed_penguins.slice(penguins.size * 0.33)
776
+ ```
777
+
778
+ Indices in Vectors or Arrow::Arrays are also acceptable.
779
+
780
+ Another way to select in `slice` is to use booleans. An alias for this feature is `filter`.
781
+ - Booleans is an Array, Arrow::Array, Vector or their Array.
782
+ - Each data type must be boolean.
783
+ - Size of booleans must be same as the size of self.
784
+
785
+ ```{ruby}
786
+ #| tags: []
787
+ # make boolean Vector to check over 40
788
+ booleans = penguins[:bill_length_mm] > 40
789
+ ```
790
+
791
+ ```{ruby}
792
+ #| tags: []
793
+ penguins.slice(booleans)
794
+ ```
795
+
796
+ `slice` accepts a block.
797
+ - We can't use both arguments and a block at a same time.
798
+ - The block should return indeces in any length or a boolean Array with a same length as `size`.
799
+ - Block is called in the context of self. So reciever 'self' can be omitted in the block.
800
+
801
+ ```{ruby}
802
+ #| tags: []
803
+ # return a DataFrame with bill_length_mm is in 2*std range around mean
804
+ penguins.slice do
805
+ min = bill_length_mm.mean - bill_length_mm.std
806
+ max = bill_length_mm.mean + bill_length_mm.std
807
+ bill_length_mm.to_a.map { |e| (min..max).include? e }
808
+ end
809
+ ```
810
+
811
+ ## 29. Slice and nil option
812
+
813
+ `Arrow::Table#slice` uses `#filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row.
814
+
815
+ ```{ruby}
816
+ #| tags: []
817
+ hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }
818
+ table = Arrow::Table.new(hash)
819
+ table.slice([true, false, nil])
820
+ ```
821
+
822
+ Whereas in RedAmber, `DataFrame#slice` with booleans containing nil is treated as false. This behavior comes from `Allow::FilterOptions.null_selection_behavior = :drop`. This is a default value for `Arrow::Table.filter` method.
823
+
824
+ ```{ruby}
825
+ #| tags: []
826
+ RedAmber::DataFrame.new(table).slice([true, false, nil]).table
827
+ ```
828
+
829
+ ## 30. Remove
830
+
831
+ Slice and reject rows (observations) to create a remainer DataFrame.
832
+
833
+ `#remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
834
+
835
+ ```{ruby}
836
+ #| tags: []
837
+ # returns 6th to 339th obs. Remainer of penguins.slice(0...5, -5..-1)
838
+ penguins.remove(0...5, -5..-1)
839
+ ```
840
+
841
+ `remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `#size`.
842
+
843
+ ```{ruby}
844
+ #| tags: []
845
+ # remove all observation contains nil
846
+ removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
847
+ ```
848
+
849
+ `remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as size. Block is called in the context of self.
850
+
851
+ ```{ruby}
852
+ #| tags: []
853
+ # Remove data in 2*std range around mean
854
+ penguins.remove do
855
+ vector = self[:bill_length_mm]
856
+ min = vector.mean - vector.std
857
+ max = vector.mean + vector.std
858
+ vector.to_a.map { |e| (min..max).include? e }
859
+ end
860
+ ```
861
+
862
+ ## 31. Remove and nil
863
+
864
+ When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
865
+
866
+ ```{ruby}
867
+ #| tags: []
868
+ df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])
869
+ ```
870
+
871
+ ```{ruby}
872
+ #| tags: []
873
+ booleans = df[:a] < 2
874
+ ```
875
+
876
+ ```{ruby}
877
+ #| tags: []
878
+ booleans_invert = booleans.to_a.map(&:!)
879
+ ```
880
+
881
+ ```{ruby}
882
+ #| tags: []
883
+ df.slice(booleans) == df.remove(booleans_invert)
884
+ ```
885
+
886
+ Whereas `Vector#invert` returns nil for elements nil. This will bring different result. (See #26)
887
+
888
+ ```{ruby}
889
+ #| tags: []
890
+ booleans.invert
891
+ ```
892
+
893
+ ```{ruby}
894
+ #| tags: []
895
+ df.remove(booleans.invert)
896
+ ```
897
+
898
+ We have `#primitive_invert` method in Vector. This method returns the same result as `.to_a.map(&:!)` above.
899
+
900
+ ```{ruby}
901
+ #| tags: []
902
+ booleans.primitive_invert
903
+ ```
904
+
905
+ ```{ruby}
906
+ #| tags: []
907
+ df.remove(booleans.primitive_invert)
908
+ ```
909
+
910
+ ```{ruby}
911
+ #| tags: []
912
+ df.slice(booleans) == df.remove(booleans.primitive_invert)
913
+ ```
914
+
915
+ ## 32. Remove nil
916
+
917
+ Remove any observations containing nil.
918
+
919
+ ```{ruby}
920
+ #| tags: []
921
+ penguins.remove_nil
922
+ ```
923
+
924
+ The roundabout way for this is to use `#remove`.
925
+
926
+ ```{ruby}
927
+ #| tags: []
928
+ penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
929
+ ```
930
+
931
+ ## 33. Rename
932
+
933
+ Rename keys (column names) to create a updated DataFrame.
934
+
935
+ `#rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}` or an Array of Array `[[existing_key, new_key], ...]` .
936
+
937
+ ```{ruby}
938
+ #| tags: []
939
+ h = { name: %w[Yasuko Rui Hinata], age: [68, 49, 28] }
940
+ comecome = RedAmber::DataFrame.new(h)
941
+ ```
942
+
943
+ ```{ruby}
944
+ #| tags: []
945
+ comecome.rename(age: :age_in_1993)
946
+ # comecome.rename(:age, :age_in_1993) # is also OK
947
+ # comecome.rename([:age, :age_in_1993]) # is also OK
948
+ ```
949
+
950
+ `#rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}` or an Array of Array `[[existing_key, new_key], ...]`. Block is called in the context of self.
951
+
952
+ Symbol key and String key are distinguished.
953
+
954
+ ## 34. Assign
955
+
956
+ Another example of `assign` is [68. Assign revised](#68.-Assign-revised), [#69. Variations of assign](#69.-Variations-of-assign) .
957
+
958
+ Assign new or updated columns (variables) and create a updated DataFrame.
959
+
960
+ - Columns with new keys will append new variables at right (bottom in TDR).
961
+ - Columns with exisiting keys will update corresponding vectors.
962
+
963
+ `#assign(key_pairs)` accepts pairs of key and array_like values as arguments. The pairs should be a Hash of `{key => array_like}` or an Array of Array `[[key, array_like], ... ]`. `array_like` is one of `Vector`, `Array` or `Arrow::Array`.
964
+
965
+ ```{ruby}
966
+ #| tags: []
967
+ comecome = RedAmber::DataFrame.new( name: %w[Yasuko Rui Hinata], age: [68, 49, 28] )
968
+ ```
969
+
970
+ ```{ruby}
971
+ #| tags: []
972
+ # update :age and add :brother
973
+ assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }
974
+ comecome.assign(assigner)
975
+ ```
976
+
977
+ `#assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and array_like values as a Hash of `{key => array_like}` or an Array of Array `[[key, array_like], ... ]`. `array_like` is one of `Vector`, `Array` or `Arrow::Array`. Block is called in the context of self.
978
+
979
+ ```{ruby}
980
+ #| tags: []
981
+ df = RedAmber::DataFrame.new(
982
+ index: [0, 1, 2, 3, nil],
983
+ float: [0.0, 1.1, 2.2, Float::NAN, nil],
984
+ string: ['A', 'B', 'C', 'D', nil])
985
+ ```
986
+
987
+ ```{ruby}
988
+ #| tags: []
989
+ # update numeric variables
990
+ df.assign do
991
+ vectors.select(&:numeric?).map { |v| [v.key, -v] }
992
+ end
993
+ ```
994
+
995
+ In this example, columns :x and :y are updated. Column :x returns complements for #negate method because :x is :uint8 type.
996
+
997
+ ```{ruby}
998
+ #| tags: []
999
+ df.types
1000
+ ```
1001
+
1002
+ ## 35. Coerce in Vector
1003
+
1004
+ Vector has coerce method.
1005
+
1006
+ ```{ruby}
1007
+ #| tags: []
1008
+ vector = RedAmber::Vector.new(1,2,3)
1009
+ ```
1010
+
1011
+ ```{ruby}
1012
+ #| tags: []
1013
+ # Vector's `#*` method
1014
+ vector * -1
1015
+ ```
1016
+
1017
+ ```{ruby}
1018
+ #| tags: []
1019
+ # coerced calculation
1020
+ -1 * vector
1021
+ ```
1022
+
1023
+ ```{ruby}
1024
+ #| tags: []
1025
+ # `@-` operator
1026
+ -vector
1027
+ ```
1028
+
1029
+ ## 36. Vector#to_ary
1030
+
1031
+ `Vector#to_ary` will enable implicit conversion to an Array.
1032
+
1033
+ ```{ruby}
1034
+ #| tags: []
1035
+ Array(Vector.new([3, 4, 5]))
1036
+ ```
1037
+
1038
+ ```{ruby}
1039
+ #| tags: []
1040
+ [1, 2] + Vector.new([3, 4, 5])
1041
+ ```
1042
+
1043
+ ```{ruby}
1044
+ #| tags: []
1045
+ [1, 2, Vector.new([3, 4, 5])].flatten
1046
+ ```
1047
+
1048
+ ## 37. Vector#fill_nil
1049
+
1050
+ `Vector#fill_nil_forward` or `Vector#fill_nil_backward` will
1051
+ propagate the last valid observation forward (or backward).
1052
+ Or preserve nil if all previous values are nil or at the end.
1053
+
1054
+ ```{ruby}
1055
+ #| tags: []
1056
+ integer = Vector.new([0, 1, nil, 3, nil])
1057
+ integer.fill_nil_forward
1058
+ ```
1059
+
1060
+ ```{ruby}
1061
+ #| tags: []
1062
+ integer.fill_nil_backward
1063
+ ```
1064
+
1065
+ (Since 0.4.2) `Vector#fill_nil(value)` will fill `value` to `nil` in self.
1066
+
1067
+ ```{ruby}
1068
+ #| tags: []
1069
+ integer.fill_nil(-1)
1070
+ ```
1071
+
1072
+ If value has upper type, self will automatically upcasted.
1073
+ Int16 will casted into double in next example.
1074
+
1075
+ ```{ruby}
1076
+ #| tags: []
1077
+ integer.fill_nil(0.1)
1078
+ ```
1079
+
1080
+ ## 38. Vector#all?/any?
1081
+
1082
+ `Vector#all?` returns true if all elements is true.
1083
+
1084
+ `Vector#any?` returns true if exists any true.
1085
+
1086
+ These are unary aggregation function.
1087
+
1088
+ ```{ruby}
1089
+ #| tags: []
1090
+ booleans = Vector.new([true, true, nil])
1091
+ booleans.all?
1092
+ ```
1093
+
1094
+ ```{ruby}
1095
+ #| tags: []
1096
+ booleans.any?
1097
+ ```
1098
+
1099
+ If these methods are used with option `skip_nulls: false` nil is considered.
1100
+
1101
+ ```{ruby}
1102
+ #| tags: []
1103
+ booleans.all?(skip_nulls: false)
1104
+ ```
1105
+
1106
+ ```{ruby}
1107
+ #| tags: []
1108
+ booleans.any?(skip_nulls: false)
1109
+ ```
1110
+
1111
+ ## 39. Vector#count/count_uniq
1112
+
1113
+ `Vector#count` counts element.
1114
+
1115
+ `Vector#count_uniq` counts unique element. `#count_distinct` is an alias (Arrow's name).
1116
+
1117
+ These are unary aggregation function.
1118
+
1119
+ ```{ruby}
1120
+ #| tags: []
1121
+ string = Vector.new(%w[A B A])
1122
+ string.count
1123
+ ```
1124
+
1125
+ ```{ruby}
1126
+ #| tags: []
1127
+ string.count_uniq # count_distinct is also OK
1128
+ ```
1129
+
1130
+ ## 40. Vector#stddev/variance
1131
+
1132
+ These are unary element-wise function.
1133
+
1134
+ For biased standard deviation;
1135
+
1136
+ ```{ruby}
1137
+ #| tags: []
1138
+ integers = Vector.new([1, 2, 3, nil])
1139
+ integers.stddev
1140
+ ```
1141
+
1142
+ For unbiased standard deviation;
1143
+
1144
+ ```{ruby}
1145
+ #| tags: []
1146
+ integers.sd
1147
+ ```
1148
+
1149
+ For biased variance;
1150
+
1151
+ ```{ruby}
1152
+ #| tags: []
1153
+ integers.variance
1154
+ ```
1155
+
1156
+ For unbiased variance;
1157
+
1158
+ ```{ruby}
1159
+ #| tags: []
1160
+ integers.var
1161
+ ```
1162
+
1163
+ ## 41. Vector#negate
1164
+
1165
+ These are unary element-wise function.
1166
+
1167
+ ```{ruby}
1168
+ #| tags: []
1169
+ double = Vector.new([1.0, -2, 3])
1170
+ double.negate
1171
+ ```
1172
+
1173
+ Same as #negate;
1174
+
1175
+ ```{ruby}
1176
+ #| tags: []
1177
+ -double
1178
+ ```
1179
+
1180
+ ## 42. Vector#round
1181
+
1182
+ Otions for `#round`;
1183
+
1184
+ - `:n-digits` The number of digits to show.
1185
+ - `round_mode` Specify rounding mode.
1186
+
1187
+ This is a unary element-wise function.
1188
+
1189
+ ```{ruby}
1190
+ #| tags: []
1191
+ double = RedAmber::Vector.new([15.15, 2.5, 3.5, -4.5, -5.5])
1192
+ ```
1193
+
1194
+ ```{ruby}
1195
+ #| tags: []
1196
+ double.round
1197
+ ```
1198
+
1199
+ ```{ruby}
1200
+ #| tags: []
1201
+ double.round(mode: :half_to_even)
1202
+ ```
1203
+
1204
+ ```{ruby}
1205
+ #| tags: []
1206
+ double.round(mode: :towards_infinity)
1207
+ ```
1208
+
1209
+ ```{ruby}
1210
+ #| tags: []
1211
+ double.round(mode: :half_up)
1212
+ ```
1213
+
1214
+ ```{ruby}
1215
+ #| tags: []
1216
+ double.round(mode: :half_towards_zero)
1217
+ ```
1218
+
1219
+ ```{ruby}
1220
+ #| tags: []
1221
+ double.round(mode: :half_towards_infinity)
1222
+ ```
1223
+
1224
+ ```{ruby}
1225
+ #| tags: []
1226
+ double.round(mode: :half_to_odd)
1227
+ ```
1228
+
1229
+ ```{ruby}
1230
+ #| tags: []
1231
+ double.round(n_digits: 0)
1232
+ ```
1233
+
1234
+ ```{ruby}
1235
+ #| tags: []
1236
+ double.round(n_digits: 1)
1237
+ ```
1238
+
1239
+ ```{ruby}
1240
+ #| tags: []
1241
+ double.round(n_digits: -1)
1242
+ ```
1243
+
1244
+ ## 43. Vector#and/or
1245
+
1246
+ RedAmber select `and_kleene`/`or_kleene` as default `&`/`|` method.
1247
+
1248
+ These are unary element-wise function.
1249
+
1250
+ ```{ruby}
1251
+ #| tags: []
1252
+ bool_self = Vector.new([true, true, true, false, false, false, nil, nil, nil])
1253
+ bool_other = Vector.new([true, false, nil, true, false, nil, true, false, nil])
1254
+
1255
+ bool_self & bool_other # same as bool_self.and_kleene(bool_other)
1256
+ ```
1257
+
1258
+ ```{ruby}
1259
+ #| tags: []
1260
+ # Ruby's primitive `&&`
1261
+ bool_self && bool_other
1262
+ ```
1263
+
1264
+ ```{ruby}
1265
+ #| tags: []
1266
+ # Arrow's default `and`
1267
+ bool_self.and_org(bool_other)
1268
+ ```
1269
+
1270
+ ```{ruby}
1271
+ #| tags: []
1272
+ bool_self | bool_other # same as bool_self.or_kleene(bool_other)
1273
+ ```
1274
+
1275
+ ```{ruby}
1276
+ #| tags: []
1277
+ # Ruby's primitive `||`
1278
+ bool_self || bool_other
1279
+ ```
1280
+
1281
+ ```{ruby}
1282
+ #| tags: []
1283
+ # Arrow's default `or`
1284
+ bool_self.or_org(bool_other)
1285
+ ```
1286
+
1287
+ ## 44. Vector#is_finite/is_nan/is_nil/is_na
1288
+
1289
+ These are unary element-wise function.
1290
+
1291
+ ```{ruby}
1292
+ #| tags: []
1293
+ double = Vector.new([Math::PI, Float::INFINITY, -Float::INFINITY, Float::NAN, nil])
1294
+ ```
1295
+
1296
+ ```{ruby}
1297
+ #| tags: []
1298
+ double.is_finite
1299
+ ```
1300
+
1301
+ ```{ruby}
1302
+ #| tags: []
1303
+ double.is_inf
1304
+ ```
1305
+
1306
+ ```{ruby}
1307
+ #| tags: []
1308
+ double.is_na
1309
+ ```
1310
+
1311
+ ```{ruby}
1312
+ #| tags: []
1313
+ double.is_nil
1314
+ ```
1315
+
1316
+ ```{ruby}
1317
+ #| tags: []
1318
+ double.is_valid
1319
+ ```
1320
+
1321
+ ## 45. Prime-th rows
1322
+
1323
+ ```{ruby}
1324
+ #| tags: []
1325
+ # prime-th rows ... Don't ask me what it means.
1326
+ require 'prime'
1327
+ penguins.assign_left(:index, penguins.indices + 1) # since 0.2.0
1328
+ .slice { Vector.new(Prime.each(size).to_a) - 1 }
1329
+ ```
1330
+
1331
+ ## 46. Slice by Enumerator
1332
+
1333
+ Slice accepts Enumerator.
1334
+
1335
+ ```{ruby}
1336
+ #| tags: []
1337
+ # Select every 10 samples
1338
+ penguins.assign_left(index: penguins.indices) # 0.2.0 feature
1339
+ .slice(0.step(by: 10, to: 340))
1340
+ ```
1341
+
1342
+ ```{ruby}
1343
+ #| tags: []
1344
+ # Select every 2 samples by step 100
1345
+ penguins.assign_left(index: penguins.indices) # 0.2.0 feature
1346
+ .slice { 0.step(by: 100, to: 300).map { |i| i..(i+1) } }
1347
+ ```
1348
+
1349
+ ## 47. Output mode
1350
+
1351
+ Output mode of `DataFrame#inspect` and `DataFrame#to_iruby` is Table mode by default. If you prefer other mode set the environment variable `RED_AMBER_OUTPUT_MODE` .
1352
+
1353
+ ```{ruby}
1354
+ #| tags: []
1355
+ ENV['RED_AMBER_OUTPUT_MODE'] = 'Table' # or nil (default)
1356
+ penguins # Almost same as `puts penguins.to_s` in any mode
1357
+ ```
1358
+
1359
+ ```{ruby}
1360
+ #| tags: []
1361
+ penguins[:species]
1362
+ ```
1363
+
1364
+ ```{ruby}
1365
+ #| tags: []
1366
+ ENV['RED_AMBER_OUTPUT_MODE'] = 'Plain' # Since 0.2.2
1367
+ penguins
1368
+ ```
1369
+
1370
+ ```{ruby}
1371
+ #| tags: []
1372
+ penguins[:species]
1373
+ ```
1374
+
1375
+ ```{ruby}
1376
+ #| tags: []
1377
+ ENV['RED_AMBER_OUTPUT_MODE'] = 'Minimum' # Since 0.2.2
1378
+ penguins
1379
+ ```
1380
+
1381
+ ```{ruby}
1382
+ #| tags: []
1383
+ penguins[:species]
1384
+ ```
1385
+
1386
+ ```{ruby}
1387
+ #| tags: []
1388
+ ENV['RED_AMBER_OUTPUT_MODE'] = 'TDR'
1389
+ penguins
1390
+ ```
1391
+
1392
+ ```{ruby}
1393
+ #| tags: []
1394
+ penguins[:species]
1395
+ ```
1396
+
1397
+ ```{ruby}
1398
+ #| tags: []
1399
+ ENV['RED_AMBER_OUTPUT_MODE'] = nil
1400
+ ```
1401
+
1402
+ ## 48. Empty key
1403
+
1404
+ Empty key `:""` will be automatically renamed to `:unnamed1`.
1405
+
1406
+ If `:unnamed1` was used, `:unnamed1.succ` will be used.
1407
+
1408
+ (Since 0.1.8)
1409
+
1410
+ ```{ruby}
1411
+ #| tags: []
1412
+ df = DataFrame.new("": [1, 2], unnamed1: [3, 4])
1413
+ ```
1414
+
1415
+ ## 49. Grouping
1416
+
1417
+ `DataFrame#group` takes group_keys as arguments, and creates `Group` class.
1418
+
1419
+ Group class inspects counts of each unique elements.
1420
+
1421
+ (Since 0.1.7)
1422
+
1423
+ ```{ruby}
1424
+ #| tags: []
1425
+ group = penguins.group(:species)
1426
+ ```
1427
+
1428
+ The instance of `Group` class has methods to summary functions.
1429
+
1430
+ It returns `function(key)` style summarized columns as a result.
1431
+
1432
+ ```{ruby}
1433
+ #| tags: []
1434
+ group.count
1435
+ ```
1436
+
1437
+ If count result is same in multiple columns, count column is aggregated to one column `:count`.
1438
+
1439
+ ```{ruby}
1440
+ #| tags: []
1441
+ penguins.pick(:species, :bill_length_mm, :bill_depth_mm).group(:species).count
1442
+ ```
1443
+
1444
+ Grouping key comes first (leftmost) in the columns.
1445
+
1446
+ ## 50. Grouping with a block
1447
+
1448
+ `DataFrame#group` takes a block and we can specify multiple functions.
1449
+
1450
+ Inside the block is the context of instance of Group. So we can use summary functions without the reciever.
1451
+
1452
+ (Since 0.1.8)
1453
+
1454
+ ```{ruby}
1455
+ #| tags: []
1456
+ penguins.group(:species) { [count(:species), mean(:body_mass_g)] }
1457
+ ```
1458
+
1459
+ `Group#summarize` accepts same block as `DataFrame#group`.
1460
+
1461
+ ```{ruby}
1462
+ #| tags: []
1463
+ group.summarize { [count(:species), mean] }
1464
+ ```
1465
+
1466
+ ## 51. Group#count family
1467
+
1468
+ `Group#count` counts the number of non-nil values in each group.
1469
+ If counts are the same (and do not include NaN or nil), columns for counts are unified.
1470
+
1471
+ ```{ruby}
1472
+ dataframe = DataFrame.new(
1473
+ x: [*1..6],
1474
+ y: %w[A A B B B C],
1475
+ z: [false, true, false, nil, true, false])
1476
+ ```
1477
+
1478
+ Non-nil counts in column y and z are different.
1479
+
1480
+ ```{ruby}
1481
+ dataframe.group(:y).count
1482
+ ```
1483
+
1484
+ Non-nil counts in column x and y are same, so only one column is emitted.
1485
+
1486
+ ```{ruby}
1487
+ dataframe.group(:z).count
1488
+ ```
1489
+
1490
+ `Group#count_all` returns each record group size as a DataFrame. `Group#group_count` is an alias.
1491
+
1492
+ ```{ruby}
1493
+ dataframe.group(:y).count_all
1494
+ ```
1495
+
1496
+ `Group#count_uniq` count the unique values in each group and return as a DataFrame. `Group#count_distinct` is an alias.
1497
+
1498
+ ```{ruby}
1499
+ dataframe.group(:y).count_uniq
1500
+ ```
1501
+
1502
+ ## 52. Group#one
1503
+
1504
+ `Group#one` gets one value from each group.
1505
+
1506
+ ```{ruby}
1507
+ dataframe.group(:y).one
1508
+ ```
1509
+
1510
+ ## 53. Group aggregation functions
1511
+
1512
+ `Group#all` emits aggragated booleans Whether all elements in each group evaluate to true.
1513
+
1514
+ ```{ruby}
1515
+ dataframe.group(:y).all
1516
+ ```
1517
+
1518
+ `Group#any` emits aggragated booleans Whether any elements in each group evaluate to true.
1519
+
1520
+ ```{ruby}
1521
+ dataframe.group(:y).any
1522
+ ```
1523
+
1524
+ `Group#max` computes maximum of values in each group for numeric columns.
1525
+
1526
+ ```{ruby}
1527
+ dataframe.group(:y).max
1528
+ ```
1529
+
1530
+ `Group#mean` computes mean of values in each group for numeric columns.
1531
+
1532
+ ```{ruby}
1533
+ dataframe.group(:y).mean
1534
+ ```
1535
+
1536
+ `Group#median` computes median of values in each group for numeric columns.
1537
+
1538
+ ```{ruby}
1539
+ dataframe.group(:y).median
1540
+ ```
1541
+
1542
+ `Group#min` computes minimum of values in each group for numeric columns.
1543
+
1544
+ ```{ruby}
1545
+ dataframe.group(:y).min
1546
+ ```
1547
+
1548
+ `Group#product` computes product of values in each group for numeric columns.
1549
+
1550
+ ```{ruby}
1551
+ dataframe.group(:y).product
1552
+ ```
1553
+
1554
+ `Group#stddev` computes standrad deviation of values in each group for numeric columns.
1555
+
1556
+ ```{ruby}
1557
+ dataframe.group(:y).stddev
1558
+ ```
1559
+
1560
+ `Group#sum` computes sum of values in each group for numeric columns.
1561
+
1562
+ ```{ruby}
1563
+ dataframe.group(:y).sum
1564
+ ```
1565
+
1566
+ `Group#variance` computes variance of values in each group for numeric columns.
1567
+
1568
+ ```{ruby}
1569
+ dataframe.group(:y).variance
1570
+ ```
1571
+
1572
+ ## 54. Group#grouped_frame
1573
+
1574
+ `Group#grouped_frame` returns grouped DataFrame only for group keys. The alias is `#none`
1575
+
1576
+ ```{ruby}
1577
+ dataframe.group(:y).grouped_frame
1578
+ ```
1579
+
1580
+ ## 55. Vector#shift
1581
+
1582
+ `Vector#shift(amount = 1, fill: nil)`
1583
+
1584
+ Shift vector's values by specified `amount`. Shifted space is filled by value `fill`.
1585
+
1586
+ (Since 0.1.8)
1587
+
1588
+ ```{ruby}
1589
+ #| tags: []
1590
+ vector = RedAmber::Vector.new([1, 2, 3, 4, 5])
1591
+ vector.shift
1592
+ ```
1593
+
1594
+ ```{ruby}
1595
+ #| tags: []
1596
+ vector.shift(-2)
1597
+ ```
1598
+
1599
+ ```{ruby}
1600
+ #| tags: []
1601
+ vector.shift(fill: Float::NAN)
1602
+ ```
1603
+
1604
+ ## 56. From the Pandas cookbook - if-then
1605
+
1606
+ https://pandas.pydata.org/docs/user_guide/cookbook.html#if-then
1607
+
1608
+ ```python
1609
+ # by Python Pandas
1610
+ df = pd.DataFrame(
1611
+ {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
1612
+ )
1613
+ df.loc[df.AAA >= 5, "BBB"] = -1
1614
+
1615
+ # returns =>
1616
+ AAA BBB CCC
1617
+ 0 4 10 100
1618
+ 1 5 -1 50
1619
+ 2 6 -1 -30
1620
+ 3 7 -1 -50
1621
+ ```
1622
+
1623
+ ```{ruby}
1624
+ #| tags: []
1625
+ # RedAmber
1626
+ df = DataFrame.new(
1627
+ "AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50] # You can omit {}
1628
+ )
1629
+
1630
+ df.assign(BBB: df[:BBB].replace(df[:AAA] >= 5, -1))
1631
+ ```
1632
+
1633
+ If you want to replace both :BBB and :CCC ;
1634
+
1635
+ ```{ruby}
1636
+ #| tags: []
1637
+ df.assign do
1638
+ replacer = v(:AAA) >= 5 # Boolean Vector
1639
+ {
1640
+ BBB: v(:BBB).replace(replacer, -1),
1641
+ CCC: v(:CCC).replace(replacer, -2)
1642
+ }
1643
+ end
1644
+ ```
1645
+
1646
+ ## 57. From the Pandas cookbook - Splitting
1647
+ Split a frame with a boolean criterion
1648
+
1649
+ https://pandas.pydata.org/docs/user_guide/cookbook.html#splitting
1650
+
1651
+ ```python
1652
+ # by Python Pandas
1653
+ df = pd.DataFrame(
1654
+ {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
1655
+ )
1656
+ df[df.AAA <= 5]
1657
+
1658
+ # returns =>
1659
+ AAA BBB CCC
1660
+ 0 4 10 100
1661
+ 1 5 20 50
1662
+
1663
+ df[df.AAA > 5]
1664
+
1665
+ # returns =>
1666
+ AAA BBB CCC
1667
+ 2 6 30 -30
1668
+ 3 7 40 -50
1669
+ ```
1670
+
1671
+ ```{ruby}
1672
+ #| tags: []
1673
+ # RedAmber
1674
+ df = DataFrame.new(
1675
+ # You can omit outer {}
1676
+ "AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]
1677
+ )
1678
+
1679
+ df.slice(df[:AAA] <= 5)
1680
+ # df[df[:AAA] <= 5] # is also OK
1681
+ ```
1682
+
1683
+ ```{ruby}
1684
+ #| tags: []
1685
+ df.remove(df[:AAA] <= 5)
1686
+ # df.slice(df[:AAA] > 5) # do the same thing
1687
+ ```
1688
+
1689
+ ## 58. From the Pandas cookbook - Building criteria
1690
+ Split a frame with a boolean criterion
1691
+
1692
+ https://pandas.pydata.org/docs/user_guide/cookbook.html#building-criteria
1693
+
1694
+ ```python
1695
+ # by Python Pandas
1696
+ df = pd.DataFrame(
1697
+ {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
1698
+ )
1699
+
1700
+ # and
1701
+ df.loc[(df["BBB"] < 25) & (df["CCC"] >= -40), "AAA"]
1702
+
1703
+ # returns a series =>
1704
+ 0 4
1705
+ 1 5
1706
+ Name: AAA, dtype: int64
1707
+
1708
+ # or
1709
+ df.loc[(df["BBB"] > 25) | (df["CCC"] >= -40), "AAA"]
1710
+
1711
+ # returns a series =>
1712
+ 0 4
1713
+ 1 5
1714
+ 2 6
1715
+ 3 7
1716
+ Name: AAA, dtype: int64
1717
+ ```
1718
+
1719
+ ```{ruby}
1720
+ #| tags: []
1721
+ # RedAmber
1722
+ df = DataFrame.new(
1723
+ # You can omit {}
1724
+ "AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]
1725
+ )
1726
+
1727
+ df.slice( (df[:BBB] < 25) & (df[:CCC] >= 40) ).pick(:AAA)
1728
+ ```
1729
+
1730
+ ```{ruby}
1731
+ #| tags: []
1732
+ df.slice( (df[:BBB] > 25) | (df[:CCC] >= 40) ).pick(:AAA)
1733
+ # df[ (df[:BBB] > 25) | (df[:CCC] >= 40) ][:AAA)] # also OK
1734
+ ```
1735
+
1736
+ ```python
1737
+ # by Python Pandas
1738
+ # or (with assignment)
1739
+ df.loc[(df["BBB"] > 25) | (df["CCC"] >= 75), "AAA"] = 0.1
1740
+ df
1741
+
1742
+ # returns a dataframe =>
1743
+ AAA BBB CCC
1744
+ 0 0.1 10 100
1745
+ 1 5.0 20 50
1746
+ 2 0.1 30 -30
1747
+ 3 0.1 40 -50
1748
+ ```
1749
+
1750
+ ```{ruby}
1751
+ #| tags: []
1752
+ # df.assign(AAA: df[:AAA].replace((df[:BBB] > 25) | (df[:CCC] >= 75), 0.1)) # by one liner
1753
+
1754
+ booleans = (df[:BBB] > 25) | (df[:CCC] >= 75)
1755
+ replaced = df[:AAA].replace(booleans, 0.1)
1756
+ df.assign(AAA: replaced)
1757
+ ```
1758
+
1759
+ ```python
1760
+ # by Python Pandas
1761
+ # Select rows with data closest to certain value using argsort
1762
+ df = pd.DataFrame(
1763
+ {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
1764
+ )
1765
+ aValue = 43.0
1766
+ df.loc[(df.CCC - aValue).abs().argsort()]
1767
+
1768
+ # returns a dataframe =>
1769
+ AAA BBB CCC
1770
+ 1 5 20 50
1771
+ 0 4 10 100
1772
+ 2 6 30 -30
1773
+ 3 7 40 -50
1774
+ ```
1775
+
1776
+ ```{ruby}
1777
+ #| tags: []
1778
+ a_value = 43
1779
+ df[(df[:CCC] - a_value).abs.sort_indexes]
1780
+ # df.slice (df[:CCC] - a_value).abs.sort_indexes # also OK
1781
+ ```
1782
+
1783
+ ```python
1784
+ # by Python Pandas
1785
+ # Dynamically reduce a list of criteria using a binary operators
1786
+ df = pd.DataFrame(
1787
+ {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
1788
+ )
1789
+ Crit1 = df.AAA <= 5.5
1790
+ Crit2 = df.BBB == 10.0
1791
+ Crit3 = df.CCC > -40.0
1792
+ AllCrit = Crit1 & Crit2 & Crit3
1793
+
1794
+ import functools
1795
+
1796
+ CritList = [Crit1, Crit2, Crit3]
1797
+ AllCrit = functools.reduce(lambda x, y: x & y, CritList)
1798
+ df[AllCrit]
1799
+
1800
+ # returns a dataframe =>
1801
+ AAA BBB CCC
1802
+ 0 4 10 100
1803
+ ```
1804
+
1805
+ ```{ruby}
1806
+ #| tags: []
1807
+ crit1 = df[:AAA] <= 5.5
1808
+ crit2 = df[:BBB] == 10.0
1809
+ crit3 = df[:CCC] >= -40.0
1810
+ df[crit1 & crit2 & crit3]
1811
+ ```
1812
+
1813
+ ## 59. From the Pandas cookbook - Dataframes
1814
+
1815
+ https://pandas.pydata.org/docs/user_guide/cookbook.html#dataframes
1816
+
1817
+ ```python
1818
+ # by Python Pandas
1819
+ # Using both row labels and value conditionals
1820
+ df = pd.DataFrame(
1821
+ {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
1822
+ )
1823
+ df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))]
1824
+
1825
+ # returns =>
1826
+ AAA BBB CCC
1827
+ 0 4 10 100
1828
+ 2 6 30 -30
1829
+ ```
1830
+
1831
+ ```{ruby}
1832
+ #| tags: []
1833
+ # RedAmber
1834
+ df = DataFrame.new(
1835
+ "AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]
1836
+ )
1837
+
1838
+ df[(df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i }]
1839
+ ```
1840
+
1841
+ ```python
1842
+ # by Python Pandas
1843
+ # Use loc for label-oriented slicing and iloc positional slicing GH2904
1844
+ df = pd.DataFrame(
1845
+ {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]},
1846
+ index=["foo", "bar", "boo", "kar"],
1847
+ )
1848
+
1849
+ # There are 2 explicit slicing methods, with a third general case
1850
+ # 1. Positional-oriented (Python slicing style : exclusive of end)
1851
+ # 2. Label-oriented (Non-Python slicing style : inclusive of end)
1852
+ # 3. General (Either slicing style : depends on if the slice contains labels or positions)
1853
+
1854
+ df.loc["bar":"kar"] # Label
1855
+ # returns =>
1856
+ AAA BBB CCC
1857
+ bar 5 20 50
1858
+ boo 6 30 -30
1859
+ kar 7 40 -50
1860
+
1861
+ # Generic
1862
+ df[0:3]
1863
+ # returns =>
1864
+ AAA BBB CCC
1865
+ foo 4 10 100
1866
+ bar 5 20 50
1867
+ boo 6 30 -30
1868
+
1869
+ df["bar":"kar"]
1870
+ # returns =>
1871
+ AAA BBB CCC
1872
+ bar 5 20 50
1873
+ boo 6 30 -30
1874
+ kar 7 40 -50
1875
+ ```
1876
+
1877
+ ```{ruby}
1878
+ #| tags: []
1879
+ # RedAmber does not have row index. Use a new column as indexes.
1880
+ labeled = df.assign_left(index: %w[foo bar boo kar])
1881
+ # labeled = df.assign(index: %w[foo bar boo kar]).pick { [keys[-1], keys[0...-1]] } # until v0.1.8
1882
+ ```
1883
+
1884
+ ```{ruby}
1885
+ #| tags: []
1886
+ labeled[1..3]
1887
+ ```
1888
+
1889
+ ```{ruby}
1890
+ #| tags: []
1891
+ labeled.slice do
1892
+ v = v(:index)
1893
+ v.index("bar")..v.index("kar")
1894
+ end
1895
+ ```
1896
+
1897
+ `slice_by` returns the same result as above.
1898
+
1899
+ (Since 0.2.1)
1900
+
1901
+ ```{ruby}
1902
+ #| tags: []
1903
+ labeled.slice_by(:index, keep_key: true) { "bar".."kar"}
1904
+ ```
1905
+
1906
+ ```python
1907
+ # by Python Pandas
1908
+ # Ambiguity arises when an index consists of integers with a non-zero start or non-unit increment.
1909
+ df2 = pd.DataFrame(data=data, index=[1, 2, 3, 4]) # Note index starts at 1.
1910
+
1911
+ df2.iloc[1:3] # Position-oriented
1912
+ # returns =>
1913
+ AAA BBB CCC
1914
+ 2 5 20 50
1915
+ 3 6 30 -30
1916
+
1917
+ df2.loc[1:3] # Label-oriented
1918
+ # returns =>
1919
+ AAA BBB CCC
1920
+ 1 4 10 100
1921
+ 2 5 20 50
1922
+ 3 6 30 -30
1923
+ ```
1924
+
1925
+ ```{ruby}
1926
+ #| tags: []
1927
+ # RedAmber only have an implicit integer index 0...size,
1928
+ # does not happen any ambiguity unless you create a new column and use it for indexes :-).
1929
+ ```
1930
+
1931
+ ```python
1932
+ # by Python Pandas
1933
+ # Using inverse operator (~) to take the complement of a mask
1934
+ df[~((df.AAA <= 6) & (df.index.isin([0, 2, 4])))]
1935
+
1936
+ # returns =>
1937
+ AAA BBB CCC
1938
+ 1 5 20 50
1939
+ 3 7 40 -50
1940
+ ```
1941
+
1942
+ ```{ruby}
1943
+ #| tags: []
1944
+ # RedAmber offers #! method for boolean Vector.
1945
+ df[!((df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i })]
1946
+
1947
+ # or
1948
+ # df[((df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i }).invert]
1949
+ ```
1950
+
1951
+ If you have `nil` in your data, consider #primitive_invert for consistent result. See example #26.
1952
+
1953
+ ## 60. From the Pandas cookbook - New columns
1954
+
1955
+ https://pandas.pydata.org/docs/user_guide/cookbook.html#new-columns
1956
+
1957
+ ```python
1958
+ # by Python Pandas
1959
+ # Efficiently and dynamically creating new columns using applymap
1960
+ df = pd.DataFrame({"AAA": [1, 2, 1, 3], "BBB": [1, 1, 2, 2], "CCC": [2, 1, 3, 1]})
1961
+ df
1962
+
1963
+ # returns =>
1964
+ AAA BBB CCC
1965
+ 0 1 1 2
1966
+ 1 2 1 1
1967
+ 2 1 2 3
1968
+ 3 3 2 1
1969
+
1970
+ source_cols = df.columns # Or some subset would work too
1971
+ new_cols = [str(x) + "_cat" for x in source_cols]
1972
+ categories = {1: "Alpha", 2: "Beta", 3: "Charlie"}
1973
+ df[new_cols] = df[source_cols].applymap(categories.get)
1974
+ df
1975
+
1976
+ # returns =>
1977
+ AAA BBB CCC AAA_cat BBB_cat CCC_cat
1978
+ 0 1 1 2 Alpha Alpha Beta
1979
+ 1 2 1 1 Beta Alpha Alpha
1980
+ 2 1 2 3 Alpha Beta Charlie
1981
+ 3 3 2 1 Charlie Beta Alpha
1982
+ ```
1983
+
1984
+ ```{ruby}
1985
+ #| tags: []
1986
+ # RedAmber
1987
+ df = DataFrame.new({"AAA": [1, 2, 1, 3], "BBB": [1, 1, 2, 2], "CCC": [2, 1, 3, 1]})
1988
+ ```
1989
+
1990
+ ```{ruby}
1991
+ #| tags: []
1992
+ categories = {1 => "Alpha", 2 => "Beta", 3 => "Charlie"}
1993
+
1994
+ # Creating a Hash from keys
1995
+ df.assign do
1996
+ keys.each_with_object({}) do |key, h|
1997
+ h["#{key}_cat"] = v(key).to_a.map { |x| categories[x] }
1998
+ end
1999
+ end
2000
+
2001
+ # Creating an Array from vectors, from v0.2.0
2002
+ df.assign do
2003
+ vectors.map do |v|
2004
+ ["#{v.key}_cat", v.to_a.map { |x| categories[x] } ]
2005
+ end
2006
+ end
2007
+ ```
2008
+
2009
+ ```python
2010
+ # by Python Pandas
2011
+ # Keep other columns when using min() with groupby
2012
+ df = pd.DataFrame(
2013
+ {"AAA": [1, 1, 1, 2, 2, 2, 3, 3], "BBB": [2, 1, 3, 4, 5, 1, 2, 3]}
2014
+ )
2015
+ df
2016
+
2017
+ # returns =>
2018
+ AAA BBB
2019
+ 0 1 2
2020
+ 1 1 1
2021
+ 2 1 3
2022
+ 3 2 4
2023
+ 4 2 5
2024
+ 5 2 1
2025
+ 6 3 2
2026
+ 7 3 3
2027
+
2028
+ # Method 1 : idxmin() to get the index of the minimums
2029
+ df.loc[df.groupby("AAA")["BBB"].idxmin()]
2030
+
2031
+ # returns =>
2032
+ AAA BBB
2033
+ 1 1 1
2034
+ 5 2 1
2035
+ 6 3 2
2036
+
2037
+ # Method 2 : sort then take first of each
2038
+ df.sort_values(by="BBB").groupby("AAA", as_index=False).first()
2039
+
2040
+ # returns =>
2041
+ AAA BBB
2042
+ 0 1 1
2043
+ 1 2 1
2044
+ 2 3 2
2045
+
2046
+ # Notice the same results, with the exception of the index.
2047
+ ```
2048
+
2049
+ ```{ruby}
2050
+ #| tags: []
2051
+ # RedAmber
2052
+ df = DataFrame.new(AAA: [1, 1, 1, 2, 2, 2, 3, 3], BBB: [2, 1, 3, 4, 5, 1, 2, 3])
2053
+ ```
2054
+
2055
+ ```{ruby}
2056
+ #| tags: []
2057
+ df.group(:AAA).min
2058
+
2059
+ # Add `.rename { [keys[-1], :BBB] }` if you want.
2060
+ ```
2061
+
2062
+ ## 61. Summary/describe
2063
+
2064
+ ```{ruby}
2065
+ #| tags: []
2066
+ penguins.summary
2067
+ # or
2068
+ penguins.describe
2069
+ ```
2070
+
2071
+ If you need a variables in row, use `transpose`. (Since 0.2.0)
2072
+
2073
+ ```{ruby}
2074
+ #| tags: []
2075
+ penguins.summary.transpose(name: :stats)
2076
+ ```
2077
+
2078
+ ## 62. Quantile/Quantiles
2079
+
2080
+ `Vector#quantile(prob)` returns quantile at probability `prob`.
2081
+
2082
+ (Since 0.2.0)
2083
+
2084
+ ```{ruby}
2085
+ #| tags: []
2086
+ penguins[:bill_depth_mm].quantile # default is prob = 0.5
2087
+ ```
2088
+
2089
+ `Vector#quantiles` accepts an Array for multiple quantiles. Returns a DataFrame.
2090
+
2091
+ ```{ruby}
2092
+ #| tags: []
2093
+ penguins[:bill_depth_mm].quantiles([0.05, 0.95])
2094
+ ```
2095
+
2096
+ ## 63. Transpose
2097
+
2098
+ `DataFrame#transpose` creates transposed DataFrame for wide type dataframe.
2099
+
2100
+ (Since 0.2.0)
2101
+
2102
+ ```{ruby}
2103
+ #| tags: []
2104
+ uri = URI("https://raw.githubusercontent.com/heronshoes/red_amber/master/test/entity/import_cars.tsv")
2105
+ import_cars = RedAmber::DataFrame.load(uri)
2106
+ ```
2107
+
2108
+ ```{ruby}
2109
+ #| tags: []
2110
+ import_cars.transpose
2111
+ ```
2112
+
2113
+ Default name of created column is `:NAME`.
2114
+
2115
+ We can name the column from the keys in original by the option `name:`.
2116
+
2117
+ ```{ruby}
2118
+ #| tags: []
2119
+ import_cars.transpose(key: :Year, name: :Manufacturer)
2120
+ ```
2121
+
2122
+ You can specify index column by option `:key` even if it is in the middle of the original DataFrame.
2123
+
2124
+ ```{ruby}
2125
+ #| tags: []
2126
+ # locate `:Year` in the middle
2127
+ df = import_cars.pick(1..2, 0, 3..)
2128
+ ```
2129
+
2130
+ ```{ruby}
2131
+ #| tags: []
2132
+ df.transpose(key: :Year)
2133
+ ```
2134
+
2135
+ ## 64. To_long
2136
+
2137
+ `DataFrame#to_long(*keep_keys)` reshapes wide DataFrame to the long DataFrame.
2138
+
2139
+ - Parameter `keep_keys` specifies the key names to keep.
2140
+
2141
+ (Since 0.2.0)
2142
+
2143
+ ```{ruby}
2144
+ #| tags: []
2145
+ uri = URI("https://raw.githubusercontent.com/heronshoes/red_amber/master/test/entity/import_cars.tsv")
2146
+ import_cars = RedAmber::DataFrame.load(uri)
2147
+ ```
2148
+
2149
+ ```{ruby}
2150
+ #| tags: []
2151
+ import_cars.to_long(:Year)
2152
+ ```
2153
+
2154
+ - Option `:name` specify the key of the column which is come **from key names**. Default is `:NAME`.
2155
+ - Option `:value` specify the key of the column which is come **from values**. Default is `:VALUE`.
2156
+
2157
+ ```{ruby}
2158
+ #| tags: []
2159
+ import_cars.to_long(:Year, name: :Manufacturer, value: :Num_of_imported)
2160
+ ```
2161
+
2162
+ ## 65. To_wide
2163
+
2164
+ `DataFrame#to_wide(*keep_keys)` reshapes long DataFrame to a wide DataFrame.
2165
+
2166
+ - Option `:name` specify the key of the column which will be expanded **to key name**. Default is `:NAME`.
2167
+ - Option `:value` specify the key of the column which will be expanded **to values**. Default is `:VALUE`.
2168
+
2169
+ (Since 0.2.0)
2170
+
2171
+ ```{ruby}
2172
+ #| tags: []
2173
+ import_cars.to_long(:Year).to_wide
2174
+ ```
2175
+
2176
+ ```{ruby}
2177
+ #| tags: []
2178
+ import_cars.to_long(:Year).to_wide(name: :NAME, value: :VALUE)
2179
+ # is also OK
2180
+ ```
2181
+
2182
+ ## 66. Custom index
2183
+
2184
+ Another example of `indices` is [14. Indices](#14.-Indices).
2185
+
2186
+ We can set the start of indices by the option.
2187
+
2188
+ (Since 0.2.1)
2189
+
2190
+ ```{ruby}
2191
+ #| tags: []
2192
+ df = DataFrame.new(x: [0, 1, 2, 3, 4])
2193
+ df.indices
2194
+ ```
2195
+
2196
+ ```{ruby}
2197
+ #| tags: []
2198
+ df.indices(1)
2199
+ ```
2200
+
2201
+ You can put the first value which accepts `#succ` method.
2202
+
2203
+ ```{ruby}
2204
+ #| tags: []
2205
+ df.indices("a")
2206
+ ```
2207
+
2208
+ ## 67. Method missing
2209
+
2210
+ `RedAmber::DataFrame` has `#method_missing` to enable to call key names as methods.
2211
+
2212
+ This feature is limited to what can be called as a method (`:key` is OK, not allowed for the keys `:Key`, `:"key.1"`, `:"1key"`, etc. ). But it will be convenient in many cases.
2213
+
2214
+ (Since 0.2.1)
2215
+
2216
+ ```{ruby}
2217
+ #| tags: []
2218
+ df = DataFrame.new(x: [1, 2, 3])
2219
+ df.x.sum
2220
+ ```
2221
+
2222
+ ```{ruby}
2223
+ #| tags: []
2224
+ # Some ways to pull a Vector
2225
+ df[:x] # Formal style
2226
+
2227
+ df.v(:x) # #v method
2228
+
2229
+ df.x # method
2230
+ ```
2231
+
2232
+ ```{ruby}
2233
+ #| tags: []
2234
+ df.x.sum
2235
+ ```
2236
+
2237
+ ## 68. Assign revised
2238
+
2239
+ Another example of `assign` is [#34. Assign](#34.-Assign), [#69. Variations of assign](#69.-Variations-of-assign) .
2240
+
2241
+ ```{ruby}
2242
+ #| tags: []
2243
+ df = DataFrame.new(x: [1, 2, 3])
2244
+
2245
+ # Assign by a Hash
2246
+ df.assign(y: df.x / 10.0)
2247
+ ```
2248
+
2249
+ ```{ruby}
2250
+ #| tags: []
2251
+ # Assign by separated key and value
2252
+ df.assign(:y) { x / 10.0 }
2253
+ ```
2254
+
2255
+ ```{ruby}
2256
+ #| tags: []
2257
+ # Separated keys and values
2258
+ df.assign(:y, :z) { [x * 10, x / 10.0] }
2259
+ ```
2260
+
2261
+ ## 69. Variations of assign
2262
+
2263
+ Another example of `assign` is [#34. Assign](#34.-Assign), [#68. Assign revised](#68.-Assign-revised) .
2264
+
2265
+ ```{ruby}
2266
+ #| tags: []
2267
+ df = DataFrame.new(x: [1, 2, 3])
2268
+ ```
2269
+
2270
+ ```{ruby}
2271
+ #| tags: []
2272
+ # Hash args
2273
+ df.assign(y: df[:x] * 10, z: df[:x] / 10.0)
2274
+
2275
+ # Hash
2276
+ hash = {y: df[:x] * 10, z: df[:x] / 10.0}
2277
+ df.assign(hash)
2278
+
2279
+ # Array
2280
+ array = [[:y, df[:x] * 10], [:z, df[:x] / 10.0]]
2281
+ df.assign(array)
2282
+
2283
+ # Array
2284
+ df.assign [
2285
+ [:y, df[:x] * 10],
2286
+ [:z, df[:x] / 10.0]
2287
+ ]
2288
+
2289
+ # Hash
2290
+ df.assign({
2291
+ y: df[:x] * 10,
2292
+ z: df[:x] / 10.0
2293
+ })
2294
+
2295
+ # Block, Hash
2296
+ df.assign { {y: df[:x] * 10, z: df[:x] / 10.0} }
2297
+
2298
+ # Block, Array
2299
+ df.assign { [[:y, df[:x] * 10], [:z, df[:x] / 10.0]] }
2300
+
2301
+ # Block, Array, method
2302
+ #df.assign { [:y, x * 10], [:z, x / 10.0]] }
2303
+
2304
+ # Separated
2305
+ #df.assign(:y, :z) { [x * 10, x / 10.0] }
2306
+ ```
2307
+
2308
+ ## 70. Row index label by slice_by
2309
+
2310
+ Another example of `slice` is [#28. Slice](#28.-Slice).
2311
+
2312
+ (Since 0.2.1)
2313
+
2314
+ ```{ruby}
2315
+ #| tags: []
2316
+ df = DataFrame.new(num: [1.1, 2.2, 3.3, 4.4, 5.5])
2317
+ .assign_left(:label) { indices("a") }
2318
+ ```
2319
+
2320
+ `slice_by(key) { row_selector }` selects rows in column `key` with `row_selector`.
2321
+
2322
+ ```{ruby}
2323
+ #| tags: []
2324
+ df.slice_by(:label) { "b".."d" }
2325
+ ```
2326
+
2327
+ ```{ruby}
2328
+ #| tags: []
2329
+ df.slice_by(:label) { ["c", "b", "e"] }
2330
+ ```
2331
+
2332
+ If the option `keep_key:` set to `true`, index label column is preserved.
2333
+
2334
+ ```{ruby}
2335
+ #| tags: []
2336
+ df.slice_by(:label, keep_key: true) { "b".."d" }
2337
+ ```
2338
+
2339
+ ## 71. Simpson's paradox in COVID-19 data
2340
+
2341
+ https://www.rdocumentation.org/packages/openintro/versions/2.3.0/topics/simpsons_paradox_covid
2342
+
2343
+ ```{ruby}
2344
+ #| tags: []
2345
+ require 'datasets-arrow'
2346
+
2347
+ ds = Datasets::Rdatasets.new('openintro', 'simpsons_paradox_covid')
2348
+ df = RedAmber::DataFrame.new(ds.to_arrow)
2349
+ ```
2350
+
2351
+ Create group and count by vaccine status and outcome.
2352
+
2353
+ ```{ruby}
2354
+ #| tags: []
2355
+ count = df.group(:vaccine_status, :outcome).count
2356
+ ```
2357
+
2358
+ Reshape to human readable wide table.
2359
+
2360
+ ```{ruby}
2361
+ #| tags: []
2362
+ all_count = count.to_wide(name: :vaccine_status, value: :count)
2363
+ ```
2364
+
2365
+ Compute death or survived ratio for vaccine status.
2366
+
2367
+ ```{ruby}
2368
+ #| tags: []
2369
+ all_count.assign do
2370
+ {
2371
+ "vaccinated_%": 100.0 * vaccinated / vaccinated.sum,
2372
+ "unvaccinated_%": 100.0 * unvaccinated / unvaccinated.sum
2373
+ }
2374
+ end
2375
+ ```
2376
+
2377
+ Death ratio for vaccinated is higher than unvaccinated. Is it true?
2378
+
2379
+ Next, do the same thing above for each age group. Temporally create methods.
2380
+
2381
+ ```{ruby}
2382
+ #| tags: []
2383
+ def make_covid_table(df)
2384
+ df.group(:vaccine_status, :outcome)
2385
+ .count
2386
+ .to_wide(name: :vaccine_status, value: :count)
2387
+ .assign do
2388
+ {
2389
+ "vaccinated_%": (100.0 * vaccinated / vaccinated.sum).round(n_digits: 3),
2390
+ "unvaccinated_%": (100.0 * unvaccinated / unvaccinated.sum).round(n_digits: 3)
2391
+ }
2392
+ end
2393
+ end
2394
+ ```
2395
+
2396
+ ```{ruby}
2397
+ #| tags: []
2398
+ # under 50
2399
+ make_covid_table(df[df[:age_group] == "under 50"])
2400
+ ```
2401
+
2402
+ ```{ruby}
2403
+ #| tags: []
2404
+ # 50 +
2405
+ make_covid_table(df[df[:age_group] == "50 +"])
2406
+ ```
2407
+
2408
+ Death ratio for vaccinated is lower than unvaccinated for grouped subset by age. This is an exaple of "Simpson's paradox" .
2409
+
2410
+ ```{ruby}
2411
+ #| tags: []
2412
+ # Vaccine status vs age
2413
+ # 50+ is highly vaccinated.
2414
+ df.group(:vaccine_status, :age_group).count.to_wide(name: :age_group, value: :count)
2415
+ ```
2416
+
2417
+ ```{ruby}
2418
+ #| tags: []
2419
+ # Outcome vs age
2420
+ # 50+ also has higher death rate.
2421
+ df.group(:outcome, :age_group).count.to_wide(name: :age_group, value: :count)
2422
+ ```
2423
+
2424
+ ## 72. Clean up dirty data
2425
+
2426
+ ```{ruby}
2427
+ #| tags: []
2428
+ file = Tempfile.open(['dirty_data', '.csv']) do |f|
2429
+ f.puts(<<~CSV)
2430
+ height,weight
2431
+ 154.9,52.2
2432
+ 156.8cm,51.1kg
2433
+ 152,49
2434
+ 148.5cm,45.4kg
2435
+ 155cm,
2436
+ ,49.9kg
2437
+ 1.58m,49.8kg
2438
+ 166.8cm,53.6kg
2439
+ CSV
2440
+ f
2441
+ end
2442
+
2443
+ df = DataFrame.load(file)
2444
+ ```
2445
+
2446
+ It was loaded as String Vectors.
2447
+
2448
+ ```{ruby}
2449
+ #| tags: []
2450
+ df.schema
2451
+ ```
2452
+
2453
+ First for the `:weight` column. Replacing "" to NaN causes casting to Float.
2454
+
2455
+ ```{ruby}
2456
+ #| tags: []
2457
+ df.assign do
2458
+ {
2459
+ weight: weight.replace(weight == "", Float::NAN)
2460
+ }
2461
+ end
2462
+ ```
2463
+
2464
+ Apply same conversion for `:height` followed by unit conversion by `if_else`.
2465
+
2466
+ ```{ruby}
2467
+ #| tags: []
2468
+ df = df.assign do
2469
+ {
2470
+ weight: weight.replace(weight == '', Float::NAN),
2471
+ height: height.replace(height == '', Float::NAN)
2472
+ .then { |h| (h < 10).if_else(h * 100, h) }
2473
+ }
2474
+ end
2475
+ puts df.schema
2476
+ df
2477
+ ```
2478
+
2479
+ We got clean data, then compute BMI as a new column.
2480
+
2481
+ ```{ruby}
2482
+ #| tags: []
2483
+ df.assign(:BMI) { (weight / height ** 2 * 10000).round(n_digits: 1) }
2484
+ ```
2485
+
2486
+ ## 73. From the Pandas cookbook - Multiindexing
2487
+
2488
+ (Updated on v0.3.0)
2489
+ https://pandas.pydata.org/docs/user_guide/cookbook.html#multiindexing
2490
+
2491
+ ```python
2492
+ # by Python Pandas
2493
+ # Efficiently and dynamically creating new columns using applymap
2494
+
2495
+ df = pd.DataFrame(
2496
+ {
2497
+ "row": [0, 1, 2],
2498
+ "One_X": [1.1, 1.1, 1.1],
2499
+ "One_Y": [1.2, 1.2, 1.2],
2500
+ "Two_X": [1.11, 1.11, 1.11],
2501
+ "Two_Y": [1.22, 1.22, 1.22],
2502
+ }
2503
+ )
2504
+ df
2505
+
2506
+ # =>
2507
+ row One_X One_Y Two_X Two_Y
2508
+ 0 0 1.1 1.2 1.11 1.22
2509
+ 1 1 1.1 1.2 1.11 1.22
2510
+ 2 2 1.1 1.2 1.11 1.22
2511
+
2512
+ # As Labelled Index
2513
+ df = df.set_index("row")
2514
+ df
2515
+
2516
+ # =>
2517
+ One_X One_Y Two_X Two_Y
2518
+ row
2519
+ 0 1.1 1.2 1.11 1.22
2520
+ 1 1.1 1.2 1.11 1.22
2521
+ 2 1.1 1.2 1.11 1.22
2522
+
2523
+ # With Hierarchical Columns
2524
+ df.columns = pd.MultiIndex.from_tuples([tuple(c.split("_")) for c in df.columns])
2525
+ df
2526
+
2527
+ # =>
2528
+ One Two
2529
+ X Y X Y
2530
+ row
2531
+ 0 1.1 1.2 1.11 1.22
2532
+ 1 1.1 1.2 1.11 1.22
2533
+ 2 1.1 1.2 1.11 1.22
2534
+
2535
+ # Now stack & Reset
2536
+ df = df.stack(0).reset_index(1)
2537
+ df
2538
+
2539
+ # =>
2540
+ level_1 X Y
2541
+ row
2542
+ 0 One 1.10 1.20
2543
+ 0 Two 1.11 1.22
2544
+ 1 One 1.10 1.20
2545
+ 1 Two 1.11 1.22
2546
+ 2 One 1.10 1.20
2547
+ 2 Two 1.11 1.22
2548
+
2549
+ # And fix the labels (Notice the label 'level_1' got added automatically)
2550
+ df.columns = ["Sample", "All_X", "All_Y"]
2551
+ df
2552
+
2553
+ # =>
2554
+ Sample All_X All_Y
2555
+ row
2556
+ 0 One 1.10 1.20
2557
+ 0 Two 1.11 1.22
2558
+ 1 One 1.10 1.20
2559
+ 1 Two 1.11 1.22
2560
+ 2 One 1.10 1.20
2561
+ 2 Two 1.11 1.22
2562
+ ```
2563
+
2564
+ (Until 0.2.3)
2565
+ This is an example before `Vector#split_*` has introduced. See [88. Vector#split_columns](#88.-Vector#split_to_columns) .
2566
+
2567
+ ```{ruby}
2568
+ #| tags: []
2569
+ df = RedAmber::DataFrame.new(
2570
+ "row": [0, 1, 2],
2571
+ "One_X": [1.1, 1.1, 1.1],
2572
+ "One_Y": [1.2, 1.2, 1.2],
2573
+ "Two_X": [1.11, 1.11, 1.11],
2574
+ "Two_Y": [1.22, 1.22, 1.22],
2575
+ )
2576
+ ```
2577
+
2578
+ ```{ruby}
2579
+ #| tags: []
2580
+ df_x = df.pick(:row, :One_X, :Two_X)
2581
+ .to_long(:row, name: :Sample, value: :All_X)
2582
+ ```
2583
+
2584
+ ```{ruby}
2585
+ #| tags: []
2586
+ df_y = df.pick(:row, :One_Y, :Two_Y)
2587
+ .to_long(:row, name: :Sample, value: :All_Y)
2588
+ ```
2589
+
2590
+ ```{ruby}
2591
+ #| tags: []
2592
+ df_x.pick(:row)
2593
+ .assign [
2594
+ [:Sample, df_x[:Sample].each.map { |x| x.split("_").first }],
2595
+ [:All_X, df_x[:All_X]],
2596
+ [:All_Y, df_y[:All_Y]]
2597
+ ]
2598
+ ```
2599
+
2600
+ (Since 0.3.0)
2601
+ This example will use `Vector#split_to_columns`.
2602
+
2603
+ ```{ruby}
2604
+ #| tags: []
2605
+ df = RedAmber::DataFrame.new(
2606
+ "row": [0, 1, 2],
2607
+ "One_X": [1.1, 1.1, 1.1],
2608
+ "One_Y": [1.2, 1.2, 1.2],
2609
+ "Two_X": [1.11, 1.11, 1.11],
2610
+ "Two_Y": [1.22, 1.22, 1.22],
2611
+ )
2612
+ ```
2613
+
2614
+ ```{ruby}
2615
+ #| tags: []
2616
+ df.to_long(:row)
2617
+ ```
2618
+
2619
+ `Vector#split_to_colums` returns two splitted Vectors.
2620
+
2621
+ ```{ruby}
2622
+ #| tags: []
2623
+ df.to_long(:row, name: :Sample)
2624
+ .assign(:Sample, :xy) { v(:Sample).split_to_columns('_') }
2625
+ ```
2626
+
2627
+ ```{ruby}
2628
+ #| tags: []
2629
+ df.to_long(:row, name: :Sample)
2630
+ .assign(:Sample, :xy) { v(:Sample).split_to_columns('_') }
2631
+ .to_wide(name: :xy, value: :VALUE)
2632
+ ```
2633
+
2634
+ ## 74. From the Pandas cookbook - Arithmetic
2635
+
2636
+ https://pandas.pydata.org/docs/user_guide/cookbook.html#arithmetic
2637
+
2638
+ ```python
2639
+ # by Python Pandas
2640
+ cols = pd.MultiIndex.from_tuples(
2641
+ [(x, y) for x in ["A", "B", "C"] for y in ["O", "I"]]
2642
+ )
2643
+
2644
+ df = pd.DataFrame(np.random.randn(2, 6), index=["n", "m"], columns=cols)
2645
+ df
2646
+
2647
+ # =>
2648
+ A B C
2649
+ O I O I O I
2650
+ n 0.469112 -0.282863 -1.509059 -1.135632 1.212112 -0.173215
2651
+ m 0.119209 -1.044236 -0.861849 -2.104569 -0.494929 1.071804
2652
+
2653
+ df = df.div(df["C"], level=1)
2654
+ df
2655
+
2656
+ # =>
2657
+ A B C
2658
+ O I O I O I
2659
+ n 0.387021 1.633022 -1.244983 6.556214 1.0 1.0
2660
+ m -0.240860 -0.974279 1.741358 -1.963577 1.0 1.0
2661
+ ```
2662
+
2663
+ This is a tentative example. This work may be refined by the coming feature which treats multiple key header easily.
2664
+
2665
+ ```{ruby}
2666
+ #| tags: []
2667
+ require "arrow-numo-narray"
2668
+
2669
+ values = Numo::DFloat.new(6, 2).rand_norm
2670
+ ```
2671
+
2672
+ For consistency with the pandas result, we will use same data of them.
2673
+
2674
+ ```{ruby}
2675
+ #| tags: []
2676
+ values = [
2677
+ [0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215],
2678
+ [0.119209, -1.044236, -0.861849, -2.104569, -0.494929, 1.071804]
2679
+ ].transpose
2680
+ ```
2681
+
2682
+ ```{ruby}
2683
+ #| tags: []
2684
+ keys = %w[A B C].product(%w[O I]).map(&:join)
2685
+ ```
2686
+
2687
+ ```{ruby}
2688
+ #| tags: []
2689
+ df = RedAmber::DataFrame.new(index: %w[n m])
2690
+ .assign(*keys) { values }
2691
+ ```
2692
+
2693
+ ```{ruby}
2694
+ #| tags: []
2695
+ df.assign do
2696
+ assigner = {}
2697
+ %w[A B C].each do |abc|
2698
+ %w[O I].each do |oi|
2699
+ key = "#{abc}#{oi}".to_sym
2700
+ assigner[key] = v(key) / v("C#{oi}".to_sym)
2701
+ end
2702
+ end
2703
+ assigner
2704
+ end
2705
+ ```
2706
+
2707
+ ```{ruby}
2708
+ #| tags: []
2709
+ coords = [["AA", "one"], ["AA", "six"], ["BB", "one"], ["BB", "two"], ["BB", "six"]].transpose
2710
+ df = RedAmber::DataFrame.new(MyData: [11, 22, 33, 44, 55])
2711
+ .assign_left(:label1, :label2) { coords }
2712
+ ```
2713
+
2714
+ ## 75. From the Pandas cookbook - Slicing
2715
+
2716
+ https://pandas.pydata.org/docs/user_guide/cookbook.html#slicing
2717
+
2718
+ ```python
2719
+ # by Python Pandas
2720
+ coords = [("AA", "one"), ("AA", "six"), ("BB", "one"), ("BB", "two"), ("BB", "six")]
2721
+ index = pd.MultiIndex.from_tuples(coords)
2722
+ df = pd.DataFrame([11, 22, 33, 44, 55], index, ["MyData"])
2723
+ df
2724
+
2725
+ # =>
2726
+ MyData
2727
+ AA one 11
2728
+ six 22
2729
+ BB one 33
2730
+ two 44
2731
+ six 55
2732
+ ```
2733
+
2734
+ To take the cross section of the 1st level and 1st axis the index:
2735
+
2736
+ ```python
2737
+ # by Python Pandas
2738
+ # Note : level and axis are optional, and default to zero
2739
+ df.xs("BB", level=0, axis=0)
2740
+
2741
+ # =>
2742
+ MyData
2743
+ one 33
2744
+ two 44
2745
+ six 55
2746
+ ```
2747
+
2748
+ ```{ruby}
2749
+ #| tags: []
2750
+ df.slice { label1 == "BB" }.drop(:label1)
2751
+ ```
2752
+
2753
+ …and now the 2nd level of the 1st axis.
2754
+
2755
+ ```python
2756
+ # by Python Pandas
2757
+ df.xs("six", level=1, axis=0)
2758
+
2759
+ # =>
2760
+ MyData
2761
+ AA 22
2762
+ BB 55
2763
+ ```
2764
+
2765
+ ```{ruby}
2766
+ #| tags: []
2767
+ df.slice { label2 == "six" }.drop(:label2)
2768
+ ```
2769
+
2770
+ ```python
2771
+ # by Python Pandas
2772
+ import itertools
2773
+
2774
+ index = list(itertools.product(["Ada", "Quinn", "Violet"], ["Comp", "Math", "Sci"]))
2775
+ headr = list(itertools.product(["Exams", "Labs"], ["I", "II"]))
2776
+ indx = pd.MultiIndex.from_tuples(index, names=["Student", "Course"])
2777
+ cols = pd.MultiIndex.from_tuples(headr) # Notice these are un-named
2778
+ data = [[70 + x + y + (x * y) % 3 for x in range(4)] for y in range(9)]
2779
+ df = pd.DataFrame(data, indx, cols)
2780
+ df
2781
+
2782
+ # =>
2783
+ Exams Labs
2784
+ I II I II
2785
+ Student Course
2786
+ Ada Comp 70 71 72 73
2787
+ Math 71 73 75 74
2788
+ Sci 72 75 75 75
2789
+ Quinn Comp 73 74 75 76
2790
+ Math 74 76 78 77
2791
+ Sci 75 78 78 78
2792
+ Violet Comp 76 77 78 79
2793
+ Math 77 79 81 80
2794
+ Sci 78 81 81 81
2795
+ ```
2796
+
2797
+ ```{ruby}
2798
+ #| tags: []
2799
+ indexes = %w[Ada Quinn Violet].product(%w[Comp Math Sci]).transpose
2800
+ df = RedAmber::DataFrame.new(%w[Student Course].zip(indexes))
2801
+ .assign do
2802
+ assigner = {}
2803
+ keys = %w[Exams Labs].product(%w[I II]).map { |a| a.join("/") }
2804
+ keys.each.with_index do |key, x|
2805
+ assigner[key] = (0...9).map { |y| 70 + x + y + (x * y) % 3 }
2806
+ end
2807
+ assigner
2808
+ end
2809
+ ```
2810
+
2811
+ ```python
2812
+ # by Python Pandas
2813
+ All = slice(None)
2814
+
2815
+ df.loc["Violet"]
2816
+
2817
+ # =>
2818
+ Exams Labs
2819
+ I II I II
2820
+ Course
2821
+ Comp 76 77 78 79
2822
+ Math 77 79 81 80
2823
+ Sci 78 81 81 81
2824
+ ```
2825
+
2826
+ ```{ruby}
2827
+ #| tags: []
2828
+ df.slice(df[:Student] == "Violet").drop(:Student)
2829
+ ```
2830
+
2831
+ ```python
2832
+ # by Python Pandas
2833
+ df.loc[(All, "Math"), All]
2834
+
2835
+ # =>
2836
+ Exams Labs
2837
+ I II I II
2838
+ Student Course
2839
+ Ada Math 71 73 75 74
2840
+ Quinn Math 74 76 78 77
2841
+ Violet Math 77 79 81 80
2842
+ ```
2843
+
2844
+ ```{ruby}
2845
+ #| tags: []
2846
+ df.slice(df[:Course] == "Math")
2847
+ ```
2848
+
2849
+ ```python
2850
+ # by Python Pandas
2851
+ df.loc[(slice("Ada", "Quinn"), "Math"), All]
2852
+
2853
+ # =>
2854
+ Exams Labs
2855
+ I II I II
2856
+ Student Course
2857
+ Ada Math 71 73 75 74
2858
+ Quinn Math 74 76 78 77
2859
+ ```
2860
+
2861
+ ```{ruby}
2862
+ #| tags: []
2863
+ df.slice(df[:Course] == "Math")
2864
+ .slice { (v(:Student) == "Ada") | (v(:Student) == "Quinn") }
2865
+ ```
2866
+
2867
+ ```python
2868
+ # by Python Pandas
2869
+ df.loc[(All, "Math"), ("Exams")]
2870
+
2871
+ # =>
2872
+ I II
2873
+ Student Course
2874
+ Ada Math 71 73
2875
+ Quinn Math 74 76
2876
+ Violet Math 77 79
2877
+ ```
2878
+
2879
+ ```{ruby}
2880
+ #| tags: []
2881
+ df.slice(df[:Course] == "Math")
2882
+ .pick {
2883
+ [:Student, :Course].concat keys.select { |key| key.to_s.start_with?("Exams") }
2884
+ }
2885
+ ```
2886
+
2887
+ ```python
2888
+ # by Python Pandas
2889
+ df.loc[(All, "Math"), (All, "II")]
2890
+
2891
+ # =>
2892
+ Exams Labs
2893
+ II II
2894
+ Student Course
2895
+ Ada Math 73 74
2896
+ Quinn Math 76 77
2897
+ Violet Math 79 80
2898
+ ```
2899
+
2900
+ ```{ruby}
2901
+ #| tags: []
2902
+ df.slice(df[:Course] == "Math")
2903
+ .pick {
2904
+ [:Student, :Course].concat keys.select { |key| key.to_s.end_with?("II") }
2905
+ }
2906
+ ```
2907
+
2908
+ ## 76. Vector#map
2909
+
2910
+ `Vector#map` method accepts a block and return yielded results from the block in a Vector.
2911
+
2912
+ ```{ruby}
2913
+ #| tags: []
2914
+ v = Vector.new(1, 2, 3, 4)
2915
+ v.map { |x| x / 100.0 }
2916
+ ```
2917
+
2918
+ If no block is given, return a Enumerator.
2919
+
2920
+ ```{ruby}
2921
+ #| tags: []
2922
+ v.map
2923
+ ```
2924
+
2925
+ If you need ruby's map from a Vector, try `.each.map` .
2926
+
2927
+ ```{ruby}
2928
+ #| tags: []
2929
+ v.each.map { |x| x / 100.0 }
2930
+ ```
2931
+
2932
+ Alias for `#map` is `#collect`
2933
+
2934
+ Similar method is `Vector#filter/#select`.
2935
+
2936
+ ## 77. Introduce columns from numo/narray
2937
+
2938
+ (Until 0.2.2 w/Arrow 9.0.0) We couldn't construct the DataFrame directly from Numo/NArray, but following trick enables.
2939
+
2940
+ ```{ruby}
2941
+ #| tags: []
2942
+ DataFrame.new(index: Array(1..10))
2943
+ .assign do
2944
+ {
2945
+ x0: Numo::DFloat.new(size).rand_norm(0, 2),
2946
+ x1: Numo::DFloat.new(size).rand_norm(5, 2),
2947
+ x2: Numo::DFloat.new(size).rand_norm(10, 2),
2948
+ y0: Numo::DFloat.new(size).rand_norm(100, 10),
2949
+ y1: Numo::DFloat.new(size).rand_norm(200, 10),
2950
+ y2: Numo::DFloat.new(size).rand_norm(300, 10)
2951
+ }
2952
+ end
2953
+ ```
2954
+
2955
+ If you do not need the index column, try this.
2956
+
2957
+ ```{ruby}
2958
+ #| tags: []
2959
+ DataFrame.new(_: Array(1..10))
2960
+ .assign do
2961
+ {
2962
+ x0: Numo::DFloat.new(size).rand_norm(0, 2),
2963
+ x1: Numo::DFloat.new(size).rand_norm(5, 2),
2964
+ x2: Numo::DFloat.new(size).rand_norm(10, 2),
2965
+ y0: Numo::DFloat.new(size).rand_norm(100, 10),
2966
+ y1: Numo::DFloat.new(size).rand_norm(200, 10),
2967
+ y2: Numo::DFloat.new(size).rand_norm(300, 10)
2968
+ }
2969
+ end
2970
+ .drop(:_)
2971
+ ```
2972
+
2973
+ (New from 0.2.3 with Aroow 10.0.0) It is possible to initialize by objects responsible to `to_arrow` since 0.2.3 . Arrays in Numo::NArray is responsible to `to_arrow` with `red-arrow-numo-narray` gem. This feature is proposed by the Red Data Tools member @kojix2 and implemented by @kou in Arrow 10.0.0 and Red Arrow Numo::NArray 0.0.6. Thanks!
2974
+
2975
+ ```{ruby}
2976
+ #| tags: []
2977
+ require 'arrow-numo-narray'
2978
+
2979
+ size = 10
2980
+ DataFrame.new(
2981
+ x0: Numo::DFloat.new(size).rand_norm(0, 2),
2982
+ x1: Numo::DFloat.new(size).rand_norm(5, 2),
2983
+ x2: Numo::DFloat.new(size).rand_norm(10, 2),
2984
+ y0: Numo::DFloat.new(size).rand_norm(100, 10),
2985
+ y1: Numo::DFloat.new(size).rand_norm(200, 10),
2986
+ y2: Numo::DFloat.new(size).rand_norm(300, 10)
2987
+ )
2988
+ ```
2989
+
2990
+ ## 78. Join (mutating joins)
2991
+
2992
+ (Since 0.2.3)
2993
+
2994
+ ```{ruby}
2995
+ #| tags: []
2996
+ df = DataFrame.new(
2997
+ KEY: %w[A B C],
2998
+ X1: [1, 2, 3]
2999
+ )
3000
+ ```
3001
+
3002
+ ```{ruby}
3003
+ #| tags: []
3004
+ other = DataFrame.new(
3005
+ KEY: %w[A B D],
3006
+ X2: [true, false, nil]
3007
+ )
3008
+ ```
3009
+
3010
+ Inner join will join data leaving only the matching records.
3011
+
3012
+ ```{ruby}
3013
+ #| tags: []
3014
+ df.inner_join(other, :KEY)
3015
+ ```
3016
+
3017
+ If we omit join keys, common keys are automatically chosen (natural key).
3018
+
3019
+ ```{ruby}
3020
+ #| tags: []
3021
+ df.inner_join(other)
3022
+ ```
3023
+
3024
+ Full join will join data leaving all records.
3025
+
3026
+ ```{ruby}
3027
+ #| tags: []
3028
+ df.full_join(other)
3029
+ ```
3030
+
3031
+ Left join will join matching values to self from other (type: left_outer).
3032
+
3033
+ ```{ruby}
3034
+ #| tags: []
3035
+ df.left_join(other)
3036
+ ```
3037
+
3038
+ Right join will join matching values from self to other (type: right_outer).
3039
+
3040
+ ```{ruby}
3041
+ #| tags: []
3042
+ df.right_join(other)
3043
+ ```
3044
+
3045
+ Left join will join matching values to self from other.
3046
+
3047
+ ```{ruby}
3048
+ #| tags: []
3049
+ df.left_join(other)
3050
+ ```
3051
+
3052
+ ## 79. Join (filtering joins)
3053
+
3054
+ (Since 0.2.3)
3055
+
3056
+ Semi join will return records of self that have a match in other.
3057
+
3058
+ ```{ruby}
3059
+ #| tags: []
3060
+ df.semi_join(other)
3061
+ ```
3062
+
3063
+ Anti join will return records of self that do not have a match in other.
3064
+
3065
+ ```{ruby}
3066
+ #| tags: []
3067
+ df.anti_join(other)
3068
+ ```
3069
+
3070
+ ## 80. Partial joins
3071
+
3072
+ (Since 0.2.3)
3073
+
3074
+ ```{ruby}
3075
+ #| tags: []
3076
+ df2 = DataFrame.new(
3077
+ KEY1: %w[A B C],
3078
+ KEY2: %w[s t u],
3079
+ X: [1, 2, 3]
3080
+ )
3081
+ ```
3082
+
3083
+ ```{ruby}
3084
+ #| tags: []
3085
+ other2 = DataFrame.new(
3086
+ KEY1: %w[A B D],
3087
+ KEY2: %w[s u v],
3088
+ Y: [3, 2, 1]
3089
+ )
3090
+ ```
3091
+
3092
+ ```{ruby}
3093
+ #| tags: []
3094
+ # natural join
3095
+ df2.inner_join(other2)
3096
+ # Same as df2.inner_join(other2, [:KEY1, :KEY2])
3097
+ ```
3098
+
3099
+ Partial join enables some part of common keys as join keys.
3100
+
3101
+ Common keys of other not used as join keys will renamed as `:suffix`. Default suffix is '.1'.
3102
+
3103
+ ```{ruby}
3104
+ #| tags: []
3105
+ # partial join
3106
+ df2.inner_join(other2, :KEY1)
3107
+ ```
3108
+
3109
+ ```{ruby}
3110
+ #| tags: []
3111
+ df2.inner_join(other2, :KEY1, suffix: '_')
3112
+ ```
3113
+
3114
+ ## 81. Order of record in join
3115
+
3116
+ Order of records is not guaranteed to be preserved before or after join. This is a similar property to RDB. Records behave like a set.
3117
+
3118
+ If you want to preserve the order of records, it is recommended to add an index or sort.
3119
+
3120
+ (Since 0.2.3)
3121
+
3122
+ ```{ruby}
3123
+ #| tags: []
3124
+ df2
3125
+ ```
3126
+
3127
+ ```{ruby}
3128
+ #| tags: []
3129
+ other2
3130
+ ```
3131
+
3132
+ ```{ruby}
3133
+ #| tags: []
3134
+ df2.full_join(other2, :KEY2)
3135
+ ```
3136
+
3137
+ ## 82. Set operations
3138
+
3139
+ Keys in self and other must be same in set operations.
3140
+
3141
+ (Since 0.2.3)
3142
+
3143
+ ```{ruby}
3144
+ #| tags: []
3145
+ df = DataFrame.new(
3146
+ KEY1: %w[A B C],
3147
+ KEY2: [1, 2, 3]
3148
+ )
3149
+ ```
3150
+
3151
+ ```{ruby}
3152
+ #| tags: []
3153
+ other = DataFrame.new(
3154
+ KEY1: %w[A B D],
3155
+ KEY2: [1, 4, 5]
3156
+ )
3157
+ ```
3158
+
3159
+ Intersect will select records appearing in both self and other.
3160
+
3161
+ ```{ruby}
3162
+ #| tags: []
3163
+ df.intersect(other)
3164
+ ```
3165
+
3166
+ Union will select records appearing in both self or other.
3167
+
3168
+ ```{ruby}
3169
+ #| tags: []
3170
+ df.union(other)
3171
+ ```
3172
+
3173
+ Difference will select records appearing in self but not in other.
3174
+
3175
+ It has an alias `#setdiff`.
3176
+
3177
+ ```{ruby}
3178
+ #| tags: []
3179
+ df.difference(other)
3180
+ ```
3181
+
3182
+ ## 83. Join (big method)
3183
+
3184
+ Undocumented big method `join` supports all mutating joins, filtering joins and set operations.
3185
+
3186
+ |category|method of RedAmber|:type in join method|requirement|
3187
+ |-|-|-|-|
3188
+ |mutating joins|#inner_join|:inner||
3189
+ |mutating joins|#full_join|:full_outer||
3190
+ |mutating joins|#left_join|:left_outer||
3191
+ |mutating joins|#right_join|:right_outer||
3192
+ |-|-|:right_semi||
3193
+ |-|-|:right_anti||
3194
+ |filtering joins|#semi_join|:left_semi||
3195
+ |filtering joins|#anti_join|:left_anti||
3196
+ |set operations|#intersect|:inner|must have same keys with self and other|
3197
+ |set operations|#union|:full_outer|must have same keys with self and other|
3198
+ |set operations|#difference|:left_anti|must have same keys with self and other|
3199
+
3200
+ (Since 0.2.3)
3201
+
3202
+ ```{ruby}
3203
+ #| tags: []
3204
+ df = DataFrame.new(
3205
+ KEY: %w[A B C],
3206
+ X1: [1, 2, 3]
3207
+ )
3208
+ ```
3209
+
3210
+ ```{ruby}
3211
+ #| tags: []
3212
+ other = DataFrame.new(
3213
+ KEY: %w[A B D],
3214
+ X2: [true, false, nil]
3215
+ )
3216
+ ```
3217
+
3218
+ ```{ruby}
3219
+ #| tags: []
3220
+ df.join(other, :KEY, type: :inner)
3221
+ # Same as df.inner_join(other)
3222
+ ```
3223
+
3224
+ (Since 0.5.0) `#join` will not force ordering of original column by default.
3225
+
3226
+ ## 84. Force order for #join
3227
+
3228
+ We can use `:force_order` option to ensure unique order for `join` families.
3229
+ This option is true by default in `#inner_join`, `#full_join`, `#left_join`, `#right_join`, `#semi_join` and `#anti_join`.
3230
+ It will append index to the source and sort after joining. It will cause some degradation in performance.
3231
+ (Since 0.4.0)
3232
+
3233
+ (Since 0.5.0) `#join` will not force ordering of original column by default.
3234
+
3235
+ ```{ruby}
3236
+ #| tags: []
3237
+ df2 = DataFrame.new(
3238
+ KEY1: %w[A B C],
3239
+ KEY2: %w[s t u],
3240
+ X: [1, 2, 3]
3241
+ )
3242
+ ```
3243
+
3244
+ ```{ruby}
3245
+ #| tags: []
3246
+ right2 = DataFrame.new(
3247
+ KEY1: %w[A B D],
3248
+ KEY2: %w[s u v],
3249
+ Y: [3, 2, 1]
3250
+ )
3251
+ ```
3252
+
3253
+ ```{ruby}
3254
+ #| tags: []
3255
+ df2.full_join(right2, :KEY2)
3256
+ ```
3257
+
3258
+ ```{ruby}
3259
+ #| tags: []
3260
+ df2.full_join(right2, :KEY2, force_order: false)
3261
+ ```
3262
+
3263
+ ```{ruby}
3264
+ #| tags: []
3265
+ df2.full_join(right2, { left: :KEY2, right: 'KEY2' })
3266
+ ```
3267
+
3268
+ ```{ruby}
3269
+ #| tags: []
3270
+ df2.full_join(right2, { left: :KEY2, right: 'KEY2' }, force_order: false)
3271
+ ```
3272
+
3273
+ ## 85. Binding DataFrames in vertical (concatenate)
3274
+
3275
+ Concatenate another DataFrame or Table onto the bottom of self. The shape and data type of other must be the same as self.
3276
+
3277
+ The alias is `concat`.
3278
+ (Since 0.2.3)
3279
+
3280
+ ```{ruby}
3281
+ #| tags: []
3282
+ df = DataFrame.new(x: [1, 2], y: ['A', 'B'])
3283
+ ```
3284
+
3285
+ ```{ruby}
3286
+ #| tags: []
3287
+ other = DataFrame.new(x: [3, 4], y: ['C', 'D'])
3288
+ ```
3289
+
3290
+ ```{ruby}
3291
+ #| tags: []
3292
+ df.concatenate(other)
3293
+ ```
3294
+
3295
+ ## 86. Binding DataFrames in lateral (merge)
3296
+
3297
+ Concatenate another DataFrame or Table onto the bottom of self. The shape and data type of other must be the same as self.
3298
+
3299
+ (Since 0.2.3)
3300
+
3301
+ ```{ruby}
3302
+ #| tags: []
3303
+ df = DataFrame.new(x: [1, 2], y: [3, 4])
3304
+ ```
3305
+
3306
+ ```{ruby}
3307
+ #| tags: []
3308
+ other = DataFrame.new(a: ['A', 'B'], b: ['C', 'D'])
3309
+ ```
3310
+
3311
+ ```{ruby}
3312
+ #| tags: []
3313
+ df.merge(other)
3314
+ ```
3315
+
3316
+ ## 87. Join - larger example by nycflight13
3317
+
3318
+ (Since 0.2.3)
3319
+
3320
+ 'nycflights13' dataset is a large dataset. It will take a while for the first run to fetch and prepare red-datasets cache.
3321
+
3322
+ ```{ruby}
3323
+ #| tags: []
3324
+ require 'datasets-arrow'
3325
+
3326
+ package = 'nycflights13'
3327
+
3328
+ airlines = DataFrame.new(Datasets::Rdatasets.new(package, 'airlines'))
3329
+ airlines
3330
+ ```
3331
+
3332
+ Creating `Datasets::Rdatasets.new('flights', 'airlines')` is very slow because Red Datasets uses Ruby's primitive CSV as csv parser. We can parse csv by Arrow's faster parser.
3333
+
3334
+ ```{ruby}
3335
+ uri = URI('https://vincentarelbundock.github.io/Rdatasets/csv/nycflights13/flights.csv')
3336
+ flights = DataFrame.load(uri)
3337
+ .pick(%i[month day carrier flight tailnum origin dest air_time distance])
3338
+ flights
3339
+ ```
3340
+
3341
+ ```{ruby}
3342
+ # inner join
3343
+ flights.inner_join(airlines, :carrier)
3344
+ # flights.inner_join(airlines) # natural join (same result)
3345
+ ```
3346
+
3347
+ ## 88. Vector#split_to_columns
3348
+
3349
+ Another example using in the DataFrame operation is in [73. From the Pandas cookbook - Multiindexing](#73.-From-the-Pandas-cookbook---Multiindexing).
3350
+
3351
+ `self` must be a String type Vector.
3352
+
3353
+ (Since 0.3.0)
3354
+
3355
+ ```{ruby}
3356
+ #| tags: []
3357
+ v = Vector.new(['a b', 'c d', 'e f'])
3358
+ ```
3359
+
3360
+ ```{ruby}
3361
+ #| tags: []
3362
+ v.split_to_columns
3363
+ ```
3364
+
3365
+ `#split` accepts `sep` argument as a separator. `sep` is passed to `String#split(sep)`.
3366
+
3367
+ ```{ruby}
3368
+ #| tags: []
3369
+ Vector.new('ab', 'cd', 'ef')
3370
+ .split_to_columns('')
3371
+ ```
3372
+
3373
+ nil will separated as nil.
3374
+
3375
+ ```{ruby}
3376
+ #| tags: []
3377
+ Vector.new(nil, 'c d', 'e f')
3378
+ .split_to_columns
3379
+ ```
3380
+
3381
+ ## 89. Vector#split_to_rows
3382
+
3383
+ `#split_to_rows` will separate strings and flatten into row.
3384
+
3385
+ (Since 0.3.0)
3386
+
3387
+ ```{ruby}
3388
+ #| tags: []
3389
+ v = Vector.new(['a b', 'c d', 'e f'])
3390
+ ```
3391
+
3392
+ ```{ruby}
3393
+ #| tags: []
3394
+ v.split_to_rows
3395
+ ```
3396
+
3397
+ ## 90. Vector#merge
3398
+ (Since 0.3.0)
3399
+
3400
+ `Vector#merge(other)` merges `self` and `other` if they are String Vector.
3401
+
3402
+ ```{ruby}
3403
+ #| tags: []
3404
+ vector = Vector.new(%w[a c e])
3405
+ other = Vector.new(%w[b d f])
3406
+ vector.merge(other)
3407
+ ```
3408
+
3409
+ If `other` is scalar, it will be appended to each elements of `self`.
3410
+
3411
+ ```{ruby}
3412
+ #| tags: []
3413
+ vector.merge('x')
3414
+ ```
3415
+
3416
+ Option `:sep` is used to concatenating elements. Its default value is ' '.
3417
+
3418
+ ```{ruby}
3419
+ #| tags: []
3420
+ vector.merge('x', sep: '')
3421
+ ```
3422
+
3423
+ ## 91. Separate a variable (column) in a DataFrame
3424
+ (Since 0.3.0)
3425
+
3426
+ R's separate operation.
3427
+
3428
+ https://tidyr.tidyverse.org/reference/separate.html
3429
+
3430
+ ```{ruby}
3431
+ #| tags: []
3432
+ df = DataFrame.new(xyz: [nil, 'x.y', 'x.z', 'y.z'])
3433
+ ```
3434
+
3435
+ Instead of `separate(:xyz, [:a, :b])` we will do:
3436
+
3437
+ ```{ruby}
3438
+ #| tags: []
3439
+ df.assign(:A, :B) { xyz.split_to_columns('.') }
3440
+ .drop(:xyz)
3441
+ ```
3442
+
3443
+ If you need :B only, instead of `separate(:xyz, [nil, :B])` we will do:
3444
+
3445
+ ```{ruby}
3446
+ #| tags: []
3447
+ df.assign(:A, :B) { xyz.split_to_columns('.') }
3448
+ .pick(:B)
3449
+ ```
3450
+
3451
+ When splitted length is not equal, split returns max size of Vector Array filled with nil.
3452
+
3453
+ ```{ruby}
3454
+ #| tags: []
3455
+ df = DataFrame.new(xyz: ['x', 'x y', 'x y z', nil])
3456
+ df.assign(:x, :y, :z) { xyz.split_to_columns }
3457
+ ```
3458
+
3459
+ Split limiting max 2 elemnts.
3460
+
3461
+ ```{ruby}
3462
+ #| tags: []
3463
+ df.assign(:x, :yz) { xyz.split_to_columns(' ', 2) }
3464
+ ```
3465
+
3466
+ Another example:
3467
+
3468
+ ```{ruby}
3469
+ #| tags: []
3470
+ df = DataFrame.new(id: 1..3, 'month-year': %w[8-2022 9-2022 10-2022])
3471
+ .assign(:month, :year) { v(:'month-year').split_to_columns('-') }
3472
+ ```
3473
+
3474
+ Split between the letters.
3475
+
3476
+ ```{ruby}
3477
+ #| tags: []
3478
+ df = DataFrame.new(id: 1..3, yearmonth: %w[202209 202210 202211])
3479
+ .assign(:year, :month) { yearmonth.split_to_columns(/(?=..$)/) }
3480
+ ```
3481
+
3482
+ ## 92. Unite variables (columns) in a DataFrame
3483
+ (Since 0.3.0)
3484
+
3485
+ R's unite operation.
3486
+
3487
+ ```{ruby}
3488
+ #| tags: []
3489
+ df = DataFrame.new(id: 1..3, year: %w[2022 2022 2022], month: %w[09 10 11])
3490
+ ```
3491
+
3492
+ ```{ruby}
3493
+ #| tags: []
3494
+ df.assign(:yearmonth) { year.merge(month, sep: '') }
3495
+ .pick(:id, :yearmonth)
3496
+ ```
3497
+
3498
+ ```{ruby}
3499
+ #| tags: []
3500
+ # Or directly create:
3501
+ DataFrame.new(id: 1..3, yearmonth: df.year.merge(df.month, sep: ''))
3502
+ ```
3503
+
3504
+ ## 93. Separate variable and lengthen into several rows.
3505
+ (Since 0.3.0)
3506
+
3507
+ R's separate_rows operation.
3508
+
3509
+ ```{ruby}
3510
+ #| tags: []
3511
+ df = DataFrame.new(id: 1..3, yearmonth: %w[202209 202210 202211])
3512
+ .assign(:year, :month) { yearmonth.split_to_columns(/(?=..$)/) }
3513
+ .drop(:yearmonth)
3514
+ .to_long(:id)
3515
+ ```
3516
+
3517
+ Another example with different list size.
3518
+
3519
+ ```{ruby}
3520
+ #| tags: []
3521
+ df = DataFrame.new(
3522
+ x: 1..3,
3523
+ y: ['a', 'd,e,f', 'g,h'],
3524
+ z: ['1', '2,3,4', '5,6'],
3525
+ )
3526
+ ```
3527
+
3528
+ ```{ruby}
3529
+ #| tags: []
3530
+ sizes = df.y.split(',').list_sizes
3531
+ a = sizes.to_a.map.with_index(1) { |n, i| [i] * n }.flatten
3532
+ ```
3533
+
3534
+ ```{ruby}
3535
+ #| tags: []
3536
+ DataFrame.new(
3537
+ x: a,
3538
+ y: df.y.split_to_rows(','),
3539
+ z: df.z.split_to_rows(',')
3540
+ )
3541
+ ```
3542
+
3543
+ Another way to use `#split_to_columns`.
3544
+
3545
+ ```{ruby}
3546
+ #| tags: []
3547
+ xy = df.pick(:x, :y)
3548
+ .assign(:y, :y1, :y2) { v(:y).split_to_columns(',') }
3549
+ .to_long(:x, value: :y)
3550
+ .remove_nil
3551
+ ```
3552
+
3553
+ ```{ruby}
3554
+ #| tags: []
3555
+ xz = df.pick(:x, :z)
3556
+ .assign(:z, :z1, :z2) { v(:z).split_to_columns(',') }
3557
+ .to_long(:x, value: :z)
3558
+ .remove_nil
3559
+ ```
3560
+
3561
+ ```{ruby}
3562
+ #| tags: []
3563
+ xy.pick(:x, :y).merge(xz.pick(:z))
3564
+ ```
3565
+
3566
+ Get all combinations of :y and :z.
3567
+
3568
+ ```{ruby}
3569
+ #| tags: []
3570
+ df.assign(:y, :y1, :y2) { v(:y).split_to_columns(',') }
3571
+ .to_long(:x, :z, value: :y)
3572
+ .drop(:NAME)
3573
+ .assign(:z, :z1, :z2) { v(:z).split_to_columns(',') }
3574
+ .to_long(:x, :y, value: :z)
3575
+ .drop(:NAME)
3576
+ .drop_nil
3577
+ ```
3578
+
3579
+ ## 94. Vector#propagate
3580
+
3581
+ Spread the return value of an aggregate function as if it is a element-wise function.
3582
+
3583
+ It has an alias `#expand`.
3584
+
3585
+ (Since 0.4.0)
3586
+
3587
+ ```{ruby}
3588
+ #| tags: []
3589
+ vec = Vector.new(1, 2, 3, 4)
3590
+ vec.propagate(:mean)
3591
+ ```
3592
+
3593
+ Block is also available.
3594
+
3595
+ ```{ruby}
3596
+ #| tags: []
3597
+ vec.propagate { |v| v.mean.round }
3598
+ ```
3599
+
3600
+ ## 95. DataFrame#propagate
3601
+
3602
+ Returns a Vector such that all elements have value `scalar` and have same size as self.
3603
+
3604
+ (Since 0.5.0)
3605
+
3606
+ ```{ruby}
3607
+ #| tags: []
3608
+ df
3609
+ ```
3610
+
3611
+ ```{ruby}
3612
+ #| tags: []
3613
+ df.assign(:sum_x) { propagate(x.sum) }
3614
+ ```
3615
+
3616
+ With a block.
3617
+
3618
+ ```{ruby}
3619
+ #| tags: []
3620
+ df.assign(:range) { propagate { x.max - x.min } }
3621
+ ```
3622
+
3623
+ ## 96. Vector#sort / #sort_indices
3624
+
3625
+ `#sort` will arrange values in Vector.
3626
+
3627
+ Accepts :sort order option:
3628
+ - `:+`, `:ascending` or without argument will sort in increasing order.
3629
+ - `:-` or `:descending` will sort in decreasing order.
3630
+
3631
+ (Since 0.4.0)
3632
+
3633
+ ```{ruby}
3634
+ #| tags: []
3635
+ vector = Vector.new(%w[B D A E C])
3636
+ vector.sort
3637
+ # same as vector.sort(:+)
3638
+ # same as vector.sort(:ascending)
3639
+ ```
3640
+
3641
+ Sort in decreasing order;
3642
+
3643
+ ```{ruby}
3644
+ #| tags: []
3645
+ vector.sort(:-)
3646
+ # same as vector.sort(:descending)
3647
+ ```
3648
+
3649
+ ## 97. Vector#rank
3650
+
3651
+ Returns 1-based numerical rank of self.
3652
+
3653
+ - Nil values are considered greater than any value.
3654
+ - NaN values are considered greater than any value but smaller than nil values.
3655
+ - Sort order can be controlled by the option `order`.
3656
+ * `:ascending` or `+` will compute rank in ascending order (default).
3657
+ * `:descending` or `-` will compute rank in descending order.
3658
+ - Tiebreakers will configure how ties between equal values are handled.
3659
+ * `tie: :first` : Ranks are assigned in order of when ties appear in the input (default).
3660
+ * `tie: :min` : Ties get the smallest possible rank in the sorted order.
3661
+ * `tie: :max` : Ties get the largest possible rank in the sorted order.
3662
+ * `tie: :dense` : The ranks span a dense [1, M] interval where M is the number of distinct values in the input.
3663
+ - Placement of nil and NaN is controlled by the option `null_placement`.
3664
+ * `null_placement: :at_end` : place nulls at end (default).
3665
+ * `null_placement: :at_start` : place nulls at the top of Vector.
3666
+
3667
+ (Since 0.4.0, revised in 0.5.1)
3668
+
3669
+ Rank of float Vector;
3670
+
3671
+ ```{ruby}
3672
+ #| tags: []
3673
+ float = Vector[1, 0, nil, Float::NAN, Float::INFINITY, -Float::INFINITY, 3, 2]
3674
+ ```
3675
+
3676
+ ```{ruby}
3677
+ #| tags: []
3678
+ # Same as float.rank(:ascending, tie: :first, null_placement: :at_end)
3679
+ float.rank
3680
+ ```
3681
+
3682
+ With sort order;
3683
+
3684
+ ```{ruby}
3685
+ #| tags: []
3686
+ float.rank(:descending) # or float.rank('-')
3687
+ ```
3688
+
3689
+ With null placement;
3690
+
3691
+ ```{ruby}
3692
+ #| tags: []
3693
+ float.rank(null_placement: :at_start)
3694
+ ```
3695
+
3696
+ Rank of string Vector with tiebreakers;
3697
+
3698
+ ```{ruby}
3699
+ #| tags: []
3700
+ string = Vector['A', 'A', nil, nil, 'C', 'B']
3701
+ ```
3702
+
3703
+ ```{ruby}
3704
+ #| tags: []
3705
+ string.rank # same as string.rank(tie: :first)
3706
+ ```
3707
+
3708
+ ```{ruby}
3709
+ #| tags: []
3710
+ string.rank(tie: :min)
3711
+ ```
3712
+
3713
+ ```{ruby}
3714
+ #| tags: []
3715
+ string.rank(tie: :max)
3716
+ ```
3717
+
3718
+ ```{ruby}
3719
+ #| tags: []
3720
+ string.rank(tie: :dense)
3721
+ ```
3722
+
3723
+ ## 98. Vector#sample
3724
+ Pick up elements at random.
3725
+
3726
+ (Since 0.4.0)
3727
+
3728
+ Return a randomly selected element. This is one of an aggregation function.
3729
+
3730
+ ```{ruby}
3731
+ #| tags: []
3732
+ v = Vector.new('A'..'H')
3733
+ ```
3734
+
3735
+ Returns scalar without any arguments.
3736
+
3737
+ ```{ruby}
3738
+ #| tags: []
3739
+ v.sample
3740
+ ```
3741
+
3742
+ `sample(n)` will pick up `n` elements at random. `n` is a positive number of elements to pick.
3743
+
3744
+ If n is smaller or equal to size, elements are picked by non-repeating.
3745
+
3746
+ If n == 1 (in case of `sample(1)`), it returns a Vector of size == 1 not a scalar.
3747
+
3748
+ ```{ruby}
3749
+ #| tags: []
3750
+ v.sample(1)
3751
+ ```
3752
+
3753
+ Sample same size of self: every element is picked in random order.
3754
+
3755
+ ```{ruby}
3756
+ #| tags: []
3757
+ v.sample(8)
3758
+ ```
3759
+
3760
+ If n is greater than `size`, some elements are picked repeatedly.
3761
+
3762
+ ```{ruby}
3763
+ #| tags: []
3764
+ v.sample(9)
3765
+ ```
3766
+
3767
+ `sample(prop)` will pick up elements by proportion `prop` at random. `prop` must be positive float.
3768
+ - Absolute number of elements to pick:`prop*size` is truncated.
3769
+ - If prop is smaller or equal to 1.0, elements are picked by non-repeating.
3770
+
3771
+ ```{ruby}
3772
+ #| tags: []
3773
+ v.sample(0.7)
3774
+ ```
3775
+
3776
+ If picked element is only one, it returns a Vector of size == 1 not a scalar.
3777
+
3778
+ ```{ruby}
3779
+ #| tags: []
3780
+ v.sample(0.1)
3781
+ ```
3782
+
3783
+ Sample same size of self: every element is picked in random order.
3784
+
3785
+ ```{ruby}
3786
+ #| tags: []
3787
+ v.sample(1.0)
3788
+ ```
3789
+
3790
+ If prop is greater than 1.0, some elements are picked repeatedly.
3791
+
3792
+ ```{ruby}
3793
+ #| tags: []
3794
+ # 2 times over sampling
3795
+ sampled = v.sample(2.0)
3796
+ ```
3797
+
3798
+ ```{ruby}
3799
+ #| tags: []
3800
+ sampled.tally
3801
+ ```
3802
+
3803
+ ## 99. DataFrame#sample/shuffle
3804
+
3805
+ (Since 0.5.0)
3806
+
3807
+ Select records randomly to create a DataFrame.
3808
+
3809
+ ```{ruby}
3810
+ #| tags: []
3811
+ penguins.sample(0.1)
3812
+ ```
3813
+
3814
+ Returns a DataFrame with shuffled rows.
3815
+
3816
+ ```{ruby}
3817
+ #| tags: []
3818
+ penguins.shuffle
3819
+ ```
3820
+
3821
+ ## 100. Vector#concatenate
3822
+
3823
+ Concatenate other array-like to self.
3824
+
3825
+ (Since 0.4.0)
3826
+
3827
+ Concatenate to string;
3828
+
3829
+ ```{ruby}
3830
+ #| tags: []
3831
+ string = Vector.new(%w[A B])
3832
+ ```
3833
+
3834
+ ```{ruby}
3835
+ #| tags: []
3836
+ string.concatenate([1, 2])
3837
+ ```
3838
+
3839
+ Concatenate to integer;
3840
+
3841
+ ```{ruby}
3842
+ #| tags: []
3843
+ integer = Vector.new(1, 2)
3844
+ ```
3845
+
3846
+ ```{ruby}
3847
+ #| tags: []
3848
+ integer.concatenate(["A", "B"])
3849
+ ```
3850
+
3851
+ ## 101. Vector#resolve
3852
+
3853
+ Return other as a Vector which is same data type as self.
3854
+
3855
+ (Since 0.4.0)
3856
+
3857
+ Integer to String;
3858
+
3859
+ ```{ruby}
3860
+ #| tags: []
3861
+ Vector.new('A').resolve([1, 2])
3862
+ ```
3863
+
3864
+ String to Ineger;
3865
+
3866
+ ```{ruby}
3867
+ #| tags: []
3868
+ Vector.new(1).resolve(['A'])
3869
+ ```
3870
+
3871
+ Upcast to uint16;
3872
+
3873
+ ```{ruby}
3874
+ #| tags: []
3875
+ vector = Vector.new(256)
3876
+ ```
3877
+
3878
+ Not a uint8 Vector;
3879
+
3880
+ ```{ruby}
3881
+ #| tags: []
3882
+ vector.resolve([1, 2])
3883
+ ```
3884
+
3885
+ ## 102. Vector#cast
3886
+
3887
+ Cast self to `type`.
3888
+
3889
+ (since 0.4.2)
3890
+
3891
+ ```{ruby}
3892
+ #| tags: []
3893
+ vector = Vector.new(1, 2, nil)
3894
+ vector.cast(:int16)
3895
+ ```
3896
+
3897
+ ```{ruby}
3898
+ #| tags: []
3899
+ vector.cast(:double)
3900
+ ```
3901
+
3902
+ ```{ruby}
3903
+ #| tags: []
3904
+ vector.cast(:string)
3905
+ ```
3906
+
3907
+ ## 103. Vector#one
3908
+
3909
+ Get a non-nil element in self. If all elements are nil, return nil.
3910
+
3911
+ (since 0.4.2)
3912
+
3913
+ ```{ruby}
3914
+ #| tags: []
3915
+ vector = Vector.new([nil, 1, 3])
3916
+ vector.one
3917
+ ```
3918
+
3919
+ ## 104. SubFrames
3920
+
3921
+ `SubFrames` is a new concept of DataFrame collection. It represents ordered subsets of a DataFrame collected by some rules. It includes both grouping and windowing concepts in a unified manner, and also covers broader cases more flexibly.
3922
+
3923
+ (Since 0.4.0)
3924
+
3925
+ ```{ruby}
3926
+ #| tags: []
3927
+ dataframe = DataFrame.new(
3928
+ x: [*1..6],
3929
+ y: %w[A A B B B C],
3930
+ z: [false, true, false, nil, true, false]
3931
+ )
3932
+ p dataframe; nil
3933
+ ```
3934
+
3935
+ ```{ruby}
3936
+ #| tags: []
3937
+ sf = SubFrames.new(dataframe, [[0, 1], [2, 3, 4], [5]])
3938
+ ```
3939
+
3940
+ Source DataFrame (univarsal set).
3941
+
3942
+ ```{ruby}
3943
+ #| tags: []
3944
+ sf.baseframe
3945
+ ```
3946
+
3947
+ Size of subsets.
3948
+
3949
+ ```{ruby}
3950
+ #| tags: []
3951
+ sf.size
3952
+ ```
3953
+
3954
+ Sizes of each subsets.
3955
+
3956
+ ```{ruby}
3957
+ #| tags: []
3958
+ sf.sizes
3959
+ ```
3960
+
3961
+ `#each` will return an Enumerator or iterates each subset as a DataFrame.
3962
+
3963
+ ```{ruby}
3964
+ #| tags: []
3965
+ sf.each
3966
+ ```
3967
+
3968
+ ```{ruby}
3969
+ #| tags: []
3970
+ sf.each.next
3971
+ ```
3972
+
3973
+ `SubFrames.new` also accepts a block.
3974
+
3975
+ ```{ruby}
3976
+ #| tags: []
3977
+ usf = SubFrames.new(dataframe) { |df| [df.indices] }
3978
+ ```
3979
+
3980
+ `#universal?` tests if self is an univarsal set.
3981
+
3982
+ ```{ruby}
3983
+ #| tags: []
3984
+ usf.universal?
3985
+ ```
3986
+
3987
+ `#empty?` tests if self is an empty set.
3988
+
3989
+ ```{ruby}
3990
+ #| tags: []
3991
+ esf = SubFrames.new(dataframe, [])
3992
+ ```
3993
+
3994
+ ```{ruby}
3995
+ #| scrolled: true
3996
+ #| tags: []
3997
+ esf.empty?
3998
+ ```
3999
+
4000
+ `#take(n)` takes n sub dataframes and return them by SubFrames. If n >= size, it returns self.
4001
+
4002
+ ```{ruby}
4003
+ sf.take(2)
4004
+ ```
4005
+
4006
+ `#offset_indices` returns indices at the top of each sub DataFrames.
4007
+
4008
+ ```{ruby}
4009
+ sf.offset_indices
4010
+ ```
4011
+
4012
+ `#frames` returns an Array of sub DataFrames.
4013
+
4014
+ ```{ruby}
4015
+ sf.frames
4016
+ ```
4017
+
4018
+ `SubFrames.new` also accepts boolean filters even from the block.
4019
+
4020
+ ```{ruby}
4021
+ #| tags: []
4022
+ small = dataframe.x < 4
4023
+ large = !small
4024
+ small_large = SubFrames.new(dataframe) { [small, large] }
4025
+ ```
4026
+
4027
+ ## 105. SubFrames#concatenate
4028
+
4029
+ `SubFrames#concatenate` (or alias `#concat`) will concatenate SubFrames to create a DataFrame.
4030
+
4031
+ (Since 0.4.0)
4032
+
4033
+ ```{ruby}
4034
+ #| tags: []
4035
+ sf.concatenate
4036
+ ```
4037
+
4038
+ ## 106. SubFrames.by_group
4039
+
4040
+ Create SubFrames by Group object.
4041
+
4042
+ (Since 0.4.0)
4043
+
4044
+ ```{ruby}
4045
+ #| tags: []
4046
+ p dataframe; nil
4047
+ ```
4048
+
4049
+ ```{ruby}
4050
+ #| tags: []
4051
+ group = Group.new(dataframe, [:y])
4052
+ sf = SubFrames.by_group(group)
4053
+ ```
4054
+
4055
+ ## 107. SubFrames.by_indices/.by_filters
4056
+
4057
+ `SubFrames.by_indices(dataframe, subset_indices)` creates a new SubFrames object from a DataFrame and an array of indices.#
4058
+
4059
+ ```{ruby}
4060
+ SubFrames.by_indices(dataframe, [[0, 2, 4], [1, 3, 5]])
4061
+ ```
4062
+
4063
+ `SubFrames.by_filters(dataframe, subset_filters)` creates a new SubFrames object from a DataFrame and an array of filters.
4064
+
4065
+ ```{ruby}
4066
+ #| scrolled: true
4067
+ SubFrames.by_filters(dataframe, [[true, false, true, false, nil, false], [true, true, false, false, nil, false]])
4068
+ ```
4069
+
4070
+ ## 108. SubFrames.by_dataframes
4071
+
4072
+ `SubFrames.by_dataframes(dataframes)` creates a new SubFrames from an Array of DataFrames.
4073
+
4074
+ ```{ruby}
4075
+ dataframes = [
4076
+ DataFrame.new(x: [1, 2, 3], y: %w[A A B], z: [false, true, false]),
4077
+ DataFrame.new(x: [4, 5, 6], y: %w[B B C], z: [nil, true, false])
4078
+ ]
4079
+ ```
4080
+
4081
+ ```{ruby}
4082
+ SubFrames.by_dataframes(dataframes)
4083
+ ```
4084
+
4085
+ ## 109. DataFrame#sub_by_value
4086
+
4087
+ `sub_by_value(*keys)` make subframes by value. It is corresponding to Group processing.
4088
+
4089
+ Create SubFrames from keys and group by values in columns specified by the key.
4090
+
4091
+ (Since 0.4.0)
4092
+
4093
+ ```{ruby}
4094
+ #| tags: []
4095
+ dataframe.sub_by_value(:y)
4096
+ ```
4097
+
4098
+ ## 110. DataFrame#sub_by_window
4099
+
4100
+ Create SubFrames by window in `size` rolling `from` by `step`.
4101
+
4102
+ Default values is `from: 0`, `size: nil` and `step: 1`.
4103
+
4104
+ (Since 0.4.0)
4105
+
4106
+ ```{ruby}
4107
+ #| tags: []
4108
+ dataframe.sub_by_window(size: 4, step: 2)
4109
+ ```
4110
+
4111
+ ## 111. DataFrame#sub_by_enum
4112
+
4113
+ Create SubFrames by Grouping/Windowing by posion. The position is specified by `Array`'s enumerator method such as `each_slice` or `each_cons`.
4114
+
4115
+ (Since 0.4.0)
4116
+
4117
+ Create a SubFrames object sliced by 3 rows. This is MECE (Mutually Exclusive and Collectively Exhaustive) SubFrames.
4118
+
4119
+ ```{ruby}
4120
+ #| tags: []
4121
+ dataframe.sub_by_enum(:each_slice, 3)
4122
+ ```
4123
+
4124
+ Create a SubFrames object for each consecutive 3 rows.
4125
+
4126
+ ```{ruby}
4127
+ #| tags: []
4128
+ dataframe.sub_by_enum(:each_cons, 4)
4129
+ ```
4130
+
4131
+ ## 112. DataFrame#sub_by_kernel
4132
+
4133
+ Create SubFrames by windowing with a kernel and step.
4134
+ Kernel is a boolean Array and it behaves like a masked window.
4135
+
4136
+ (Since 0.4.0)
4137
+
4138
+ ```{ruby}
4139
+ #| tags: []
4140
+ kernel = [true, false, false, true]
4141
+ dataframe.sub_by_kernel(kernel, step: 2)
4142
+ ```
4143
+
4144
+ ## 113. DataFrame#build_subframes
4145
+
4146
+ Generic builder of sub-dataframe from self.
4147
+
4148
+ (Sice 0.4.0)
4149
+
4150
+ ```{ruby}
4151
+ #| tags: []
4152
+ dataframe.build_subframes([[0, 2, 4], [1, 3, 5]])
4153
+ ```
4154
+
4155
+ `#build_subframes` also accepts a block.
4156
+
4157
+ ```{ruby}
4158
+ #| tags: []
4159
+ dataframe.build_subframes do |df|
4160
+ even = df.indices.map(&:even?)
4161
+ [even, !even]
4162
+ end
4163
+ ```
4164
+
4165
+ ## 114. SubFrames#aggregate
4166
+
4167
+ Aggregate SubFrames to create a DataFrame. There are 4 APIs in this method.
4168
+
4169
+ (Since 0.4.0)
4170
+
4171
+ - `#aggregate(keys) { columns }`
4172
+
4173
+ Aggregate SubFrames creating DataFrame with label `keys` and its column values by block.
4174
+
4175
+ ```{ruby}
4176
+ #| tags: []
4177
+ sf = dataframe.sub_by_value(:y)
4178
+ ```
4179
+
4180
+ ```{ruby}
4181
+ sf.aggregate(:y, :sum_x) { [y.one, x.sum] } # sf.aggregate([:y, :sum_x]) { [y.one, x.sum] } is also acceptable
4182
+ ```
4183
+
4184
+ - `#aggregate { key_and_aggregated_values }`
4185
+
4186
+ Aggregate SubFrames creating DataFrame with pairs of key and aggregated values in Hash from the block.
4187
+
4188
+ ```{ruby}
4189
+ sf.aggregate do
4190
+ { y: y.one, sum_x: x.sum }
4191
+ end
4192
+ ```
4193
+
4194
+ - `#aggregate { [keys, values] }`
4195
+
4196
+ Aggregate SubFrames creating DataFrame with an Array of key and aggregated value from the block.
4197
+
4198
+ ```{ruby}
4199
+ #| tags: []
4200
+ sf.aggregate do
4201
+ [[:y, y.one], [:sum_x, x.sum]]
4202
+ end
4203
+ ```
4204
+
4205
+ - `#aggregate(group_keys, aggregations)`
4206
+
4207
+ Aggregate SubFrames for first values of the columns of `group_keys` and the aggregated results of key-function pairs.
4208
+ ( [Experiment)l] This API may be changed in the future.
4209
+
4210
+ ```{ruby}
4211
+ #| tags: []
4212
+ sf.aggregate(:y, { x: :sum, z: :count })
4213
+ ```
4214
+
4215
+ ## 115. SubFrames#map/#assign
4216
+
4217
+ `#map` Returns a SubFrames containing DataFrames returned by the block. It has an alias `collect`.
4218
+
4219
+ ```{ruby}
4220
+ sf
4221
+ ```
4222
+
4223
+ This example assigns a new column.
4224
+
4225
+ ```{ruby}
4226
+ sf.map { |df| df.assign(x_plus1: df[:x] + 1) }
4227
+ ```
4228
+
4229
+ There is a shortcut of `map { assign }`. We can use `assign(key) { updated_column }`.
4230
+
4231
+ ```{ruby}
4232
+ sf.assign(:x_plus1) { x + 1 }
4233
+ ```
4234
+
4235
+ We can use `assign(keys) { updated_columns }` for multiple columns.
4236
+
4237
+ ```{ruby}
4238
+ sf.assign(:sum_x, :flac_x) do
4239
+ group_sum = x.sum
4240
+ [[group_sum] * x.size, x / group_sum.to_f]
4241
+ end
4242
+ ```
4243
+
4244
+ Also `assign { keys_and_columns }` is possible.
4245
+
4246
+ ```{ruby}
4247
+ sf.assign do
4248
+ { 'x*z': x * z.if_else(1, 0) }
4249
+ end
4250
+ ```
4251
+
4252
+ (Notice) `SubFrames#assign` has a same syntax as `DataFrame#assign`.
4253
+
4254
+ If you need an Array of DataFrames (not a SubFrames), use `each.map` instead.
4255
+
4256
+ ```{ruby}
4257
+ sf.each.map { |df| df.assign(x_plus1: df[:x] + 1) }
4258
+ ```
4259
+
4260
+ ## 116. SubFrames#select/#reject
4261
+
4262
+ `#select` returns a SubFrames containing DataFrames selected by the block.#
4263
+
4264
+ ```{ruby}
4265
+ sf.select { |df| df[:z].any? }
4266
+ ```
4267
+
4268
+ `#select` has aliases `#filter` and `#find_all`.
4269
+
4270
+ `#reject` returns a SubFrames containing truthy DataFrames returned by the block.#
4271
+
4272
+ ```{ruby}
4273
+ sf.reject { |df| df[:z].any? }
4274
+ ```
4275
+
4276
+ ## 117. SubFrames#filter_map
4277
+
4278
+ It returns a SubFrames containing truthy DataFrames returned by the block.
4279
+
4280
+ ```{ruby}
4281
+ sf.filter_map do |df|
4282
+ if df.size > 1
4283
+ df.assign(:y) do
4284
+ y.merge(indices('1'), sep: '')
4285
+ end
4286
+ end
4287
+ end
4288
+ ```
4289
+
4290
+ ## 118. Vector#modulo
4291
+
4292
+ (Since 0.4.1)
4293
+
4294
+ `#%` is an alias of `#modulo`.
4295
+
4296
+ ```{ruby}
4297
+ #| tags: []
4298
+ vector = Vector.new(5, -3, 1)
4299
+ vector % 3
4300
+ ```
4301
+
4302
+ `#%` and `#modulo` is equivalent to `self-divisor*(self/divisor).floor`.
4303
+
4304
+ ```{ruby}
4305
+ #| tags: []
4306
+ vector.modulo(-2)
4307
+ ```
4308
+
4309
+ ## 119. Vector#mode
4310
+
4311
+ Compute the 1 most common values and their respective occurence counts.
4312
+
4313
+ (since 0.5.0) ModeOptions are not supported in 0.5.0 . Only one mode value is returned.
4314
+
4315
+ ```{ruby}
4316
+ #| tags: []
4317
+ Vector[true, true, false, nil].mode
4318
+ ```
4319
+
4320
+ ```{ruby}
4321
+ #| tags: []
4322
+ Vector[0, 1, 1, 2, nil].mode
4323
+ ```
4324
+
4325
+ ```{ruby}
4326
+ #| tags: []
4327
+ Vector[1, 0/0.0, -1/0.0, 1/0.0, nil].mode
4328
+ ```
4329
+
4330
+ ## 120. Vector#end_with/start_with
4331
+
4332
+ Check if elements in self ends/starts with a literal pattern.
4333
+
4334
+ (since 0.5.0)
4335
+
4336
+ ```{ruby}
4337
+ #| tags: []
4338
+ v = Vector['array', 'Arrow', 'carrot', nil, 'window']
4339
+ ```
4340
+
4341
+ Emits true if it contains `string`. Emit false if not found. Nil inputs emit nil.
4342
+
4343
+ ```{ruby}
4344
+ #| tags: []
4345
+ v.end_with('ow')
4346
+ ```
4347
+
4348
+ ```{ruby}
4349
+ #| tags: []
4350
+ v.start_with('arr')
4351
+ ```
4352
+
4353
+ ## 121. Vector#match_substring
4354
+
4355
+ For each string in self, emit true if it contains a given pattern.
4356
+
4357
+ (since 0.5.0)
4358
+
4359
+ ```{ruby}
4360
+ #| tags: []
4361
+ v = Vector['array', 'Arrow', 'carrot', nil, 'window']
4362
+ ```
4363
+
4364
+ Emits true if it contains `string`. Emit false if not found. Nil inputs emit nil.
4365
+
4366
+ ```{ruby}
4367
+ #| tags: []
4368
+ v.match_substring('arr')
4369
+ ```
4370
+
4371
+ Otherwise use it with Regexp pattern. It calls `count_substring_regex` in Arrow compute function and uses re2 library.
4372
+
4373
+ ```{ruby}
4374
+ #| tags: []
4375
+ v.match_substring(/arr/)
4376
+ ```
4377
+
4378
+ You can ignore case if you use regexp with `i` option, or `igfnore_case: true`
4379
+
4380
+ ```{ruby}
4381
+ #| tags: []
4382
+ v.match_substring(/arr/i) # same as v.find_substring(/arr/, ignore_case: true)
4383
+ ```
4384
+
4385
+ ## 122. Vector#match_like
4386
+
4387
+ Match elements of self against SQL-style LIKE pattern. The pattern matches a given pattern at any position.
4388
+
4389
+ - '%' will match any number of characters,
4390
+ - '_' will match exactly one character, and any other character matches itself.
4391
+ - To match a literal '%', '_', or '\', precede the character with a backslash.
4392
+
4393
+ (since 0.5.0)
4394
+
4395
+ ```{ruby}
4396
+ #| tags: []
4397
+ v = Vector['array', 'Arrow', 'carrot', nil, 'window']
4398
+ ```
4399
+
4400
+ You can find indices of a literal string. Emit -1 if not found. Nil inputs emit nil.
4401
+
4402
+ ```{ruby}
4403
+ #| tags: []
4404
+ v.match_like('_arr%')
4405
+ ```
4406
+
4407
+ You can ignore case if you use the option `igfnore_case: true`.
4408
+
4409
+ ```{ruby}
4410
+ #| tags: []
4411
+ v.match_like('arr%', ignore_case: true)
4412
+ ```
4413
+
4414
+ ## 123. Vector#find_substring
4415
+
4416
+ Find first occurrence of substring in string Vector.
4417
+
4418
+ (since 0.5.1)
4419
+
4420
+ ```{ruby}
4421
+ #| tags: []
4422
+ v = Vector['array', 'Arrow', 'carrot', nil, 'window']
4423
+ ```
4424
+
4425
+ You can find indices of a literal string. Emit -1 if not found. Nil inputs emit nil.
4426
+
4427
+ ```{ruby}
4428
+ #| tags: []
4429
+ v.find_substring('arr')
4430
+ ```
4431
+
4432
+ Otherwise use it with Regexp pattern. It calls `count_substring_regex` in Arrow compute function and uses re2 library.
4433
+
4434
+ ```{ruby}
4435
+ #| tags: []
4436
+ v.find_substring(/arr/)
4437
+ ```
4438
+
4439
+ You can ignore case if you use regexp with `i` option, or `igfnore_case: true`
4440
+
4441
+ ```{ruby}
4442
+ #| tags: []
4443
+ v.find_substring(/arr/i) # same as v.find_substring(/arr/, ignore_case: true)
4444
+ ```
4445
+
4446
+ ## 124. Vector#count_substring
4447
+
4448
+ For each string in self, count occuerences of substring in given pattern.
4449
+
4450
+ (since 0.5.0)
4451
+
4452
+ ```{ruby}
4453
+ #| tags: []
4454
+ v = Vector['amber', 'Amazon', 'banana', nil]
4455
+ ```
4456
+
4457
+ You can find indices of a literal string. Emit -1 if not found. Nil inputs emit nil.
4458
+
4459
+ ```{ruby}
4460
+ #| tags: []
4461
+ v.count_substring('an')
4462
+ ```
4463
+
4464
+ Otherwise use it with Regexp pattern. It calls `count_substring_regex` in Arrow compute function and uses re2 library.
4465
+
4466
+ ```{ruby}
4467
+ #| tags: []
4468
+ v.count_substring(/a[mn]/)
4469
+ ```
4470
+
4471
+ You can ignore case if you use regexp with `i` option, or `igfnore_case: true`
4472
+
4473
+ ```{ruby}
4474
+ #| tags: []
4475
+ v.count_substring(/a[mn]/i) # same as v.find_substring(/arr/, ignore_case: true)
4476
+ ```
4477
+
4478
+ ## 125. Grouped DataFrame as a list
4479
+
4480
+ This API was introduced in 0.2.3, and supply a new DataFrame group (experimental).
4481
+
4482
+ This additional API will treat a grouped DataFrame as a list of DataFrames. I think this API has pros such as:
4483
+
4484
+ - API is easy to understand and flexible.
4485
+ - It has good compatibility with Ruby's primitive Enumerables.
4486
+ - We can only use non hash-ed aggregation functions.
4487
+ - Do not need grouped DataFrame state, nor `#ungroup` method.
4488
+ - May be useful for concurrent operations.
4489
+
4490
+ This feature is implemented by Ruby, so it is pretty slow and experimental. Use original Group API for practical purpose.
4491
+
4492
+ (Since 0.2.3, experimental feature => This was upgraded to SubFrames feature)
4493
+
4494
+ ```{ruby}
4495
+ enum = penguins.group(:island).each
4496
+ ```
4497
+
4498
+ ```{ruby}
4499
+ enum.to_a
4500
+ ```
4501
+
4502
+ ```{ruby}
4503
+ array = enum.map do |df|
4504
+ DataFrame.new(island: [df.island[0]]).assign do
4505
+ df.variables.each_with_object({}) do |(key, vec), hash|
4506
+ next unless vec.numeric?
4507
+ hash["mean(#{key})"] = [vec.mean]
4508
+ end
4509
+ end
4510
+ end
4511
+ ```
4512
+
4513
+ ```{ruby}
4514
+ array.reduce { |a, df| a.concat df }
4515
+ ```
4516
+
4517
+ ## 126. ArrowFunction helpers
4518
+
4519
+ `ArrowFunction` module adds two helper method.
4520
+
4521
+ `ArrowFunction.find(function_name)` returns Arrow Function object in Arrow C++ Compute Functions.
4522
+
4523
+ ```{ruby}
4524
+ ArrowFunction.find(:mean)
4525
+ ```
4526
+
4527
+ To execute this function,
4528
+
4529
+ ```{ruby}
4530
+ ArrowFunction.find(:mean).execute([[1, 2, 3, 4]]).value.value
4531
+ ```
4532
+
4533
+ `ArrowFunction.arrow_doc(function_name)` returns a document of Arrow C++ Compute Function in a string.
4534
+
4535
+ ```{ruby}
4536
+ puts ArrowFunction.arrow_doc(:mean)
4537
+ ```
4538
+
4539
+ ## 127. DataFrame.auto_cast
4540
+
4541
+ A data set for planetary data in https://nssdc.gsfc.nasa.gov/planetary/factsheet/ is used here. Let's manually copy the data in the html table and get the tab separated text values.
4542
+
4543
+ ```{ruby}
4544
+ tsv = ' MERCURY VENUS EARTH MOON MARS JUPITER SATURN URANUS NEPTUNE PLUTO
4545
+ Mass (1024kg) 0.330 4.87 5.97 0.073 0.642 1898 568 86.8 102 0.0130
4546
+ Diameter (km) 4879 12,104 12,756 3475 6792 142,984 120,536 51,118 49,528 2376
4547
+ Density (kg/m3) 5429 5243 5514 3340 3934 1326 687 1270 1638 1850
4548
+ Gravity (m/s2) 3.7 8.9 9.8 1.6 3.7 23.1 9.0 8.7 11.0 0.7
4549
+ Escape Velocity (km/s) 4.3 10.4 11.2 2.4 5.0 59.5 35.5 21.3 23.5 1.3
4550
+ Rotation Period (hours) 1407.6 -5832.5 23.9 655.7 24.6 9.9 10.7 -17.2 16.1 -153.3
4551
+ Length of Day (hours) 4222.6 2802.0 24.0 708.7 24.7 9.9 10.7 17.2 16.1 153.3
4552
+ Distance from Sun (106 km) 57.9 108.2 149.6 0.384* 228.0 778.5 1432.0 2867.0 4515.0 5906.4
4553
+ Perihelion (106 km) 46.0 107.5 147.1 0.363* 206.7 740.6 1357.6 2732.7 4471.1 4436.8
4554
+ Aphelion (106 km) 69.8 108.9 152.1 0.406* 249.3 816.4 1506.5 3001.4 4558.9 7375.9
4555
+ Orbital Period (days) 88.0 224.7 365.2 27.3* 687.0 4331 10,747 30,589 59,800 90,560
4556
+ Orbital Velocity (km/s) 47.4 35.0 29.8 1.0* 24.1 13.1 9.7 6.8 5.4 4.7
4557
+ Orbital Inclination (degrees) 7.0 3.4 0.0 5.1 1.8 1.3 2.5 0.8 1.8 17.2
4558
+ Orbital Eccentricity 0.206 0.007 0.017 0.055 0.094 0.049 0.052 0.047 0.010 0.244
4559
+ Obliquity to Orbit (degrees) 0.034 177.4 23.4 6.7 25.2 3.1 26.7 97.8 28.3 122.5
4560
+ Mean Temperature (C) 167 464 15 -20 -65 -110 -140 -195 -200 -225
4561
+ Surface Pressure (bars) 0 92 1 0 0.01 Unknown* Unknown* Unknown* Unknown* 0.00001
4562
+ Number of Moons 0 0 1 0 2 79 82 27 14 5
4563
+ Ring System? No No No No No Yes Yes Yes Yes No
4564
+ Global Magnetic Field? Yes No Yes No No Yes Yes Yes Yes Unknown
4565
+ '
4566
+
4567
+ raw_dataframe = DataFrame.load(Arrow::Buffer.new(tsv), format: :tsv)
4568
+
4569
+ ENV['RED_AMBER_OUTPUT_MODE'] = 'plain'
4570
+ raw_dataframe
4571
+ ```
4572
+
4573
+ This dataframe has row oriented calues. So we must transpose the dataframe.
4574
+
4575
+ ```{ruby}
4576
+ transposed = raw_dataframe.transpose
4577
+ ```
4578
+
4579
+ This dataframe has string columns. We can cast each numeric columns, recommended way is to use `#auto_cast`. `#auto_cast` save it in temporally tsv file and re-open it to get a casted dataframe.
4580
+
4581
+ ```{ruby}
4582
+ transposed.auto_cast
4583
+ ```
4584
+
4585
+ There are still some dirts to be cleaned in this dataframe, we don't touch them here. If you are interested, give it a try!
4586
+
4587
+ - Rename a column 'NAME' to 'Planet_name'.
4588
+ - Remove preceding/trailing spaces in 'Planet_name' values.
4589
+ - Capitalize 'Planet_name' values.
4590
+ - Remove data for 'Moon' and 'Pluto' to create the Table for planets.
4591
+ - Convert 'Unknown*' to nil.
4592
+ - Change 'Yes' / 'No' values to true / false (change column type to boolean).
4593
+ - Remove comma in numeric values. They obstruct to be numeric columns.
4594
+ - Correct cell values which have '*'. They obstruct to be numeric columns.
4595
+ - Add missing '^' to unit in labels.
4596
+