red_amber 0.1.2 → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +21 -10
  3. data/CHANGELOG.md +162 -6
  4. data/Gemfile +3 -0
  5. data/README.md +89 -303
  6. data/benchmark/csv_load_penguins.yml +15 -0
  7. data/benchmark/drop_nil.yml +11 -0
  8. data/doc/DataFrame.md +840 -0
  9. data/doc/Vector.md +317 -0
  10. data/doc/image/arrow_table_new.png +0 -0
  11. data/doc/image/dataframe/assign.png +0 -0
  12. data/doc/image/dataframe/drop.png +0 -0
  13. data/doc/image/dataframe/pick.png +0 -0
  14. data/doc/image/dataframe/remove.png +0 -0
  15. data/doc/image/dataframe/rename.png +0 -0
  16. data/doc/image/dataframe/slice.png +0 -0
  17. data/doc/image/dataframe_model.png +0 -0
  18. data/doc/image/example_in_red_arrow.png +0 -0
  19. data/doc/image/tdr.png +0 -0
  20. data/doc/image/tdr_and_table.png +0 -0
  21. data/doc/image/tidy_data_in_TDR.png +0 -0
  22. data/doc/image/vector/binary_element_wise.png +0 -0
  23. data/doc/image/vector/unary_aggregation.png +0 -0
  24. data/doc/image/vector/unary_aggregation_w_option.png +0 -0
  25. data/doc/image/vector/unary_element_wise.png +0 -0
  26. data/doc/tdr.md +56 -0
  27. data/doc/tdr_ja.md +56 -0
  28. data/lib/red_amber/data_frame.rb +68 -35
  29. data/lib/red_amber/data_frame_displayable.rb +132 -0
  30. data/lib/red_amber/data_frame_helper.rb +64 -0
  31. data/lib/red_amber/data_frame_indexable.rb +38 -0
  32. data/lib/red_amber/data_frame_observation_operation.rb +83 -0
  33. data/lib/red_amber/data_frame_selectable.rb +34 -43
  34. data/lib/red_amber/data_frame_variable_operation.rb +133 -0
  35. data/lib/red_amber/vector.rb +58 -6
  36. data/lib/red_amber/vector_compensable.rb +68 -0
  37. data/lib/red_amber/vector_functions.rb +147 -68
  38. data/lib/red_amber/version.rb +1 -1
  39. data/lib/red_amber.rb +9 -1
  40. data/red_amber.gemspec +3 -6
  41. metadata +36 -9
  42. data/lib/red_amber/data_frame_output.rb +0 -116
data/doc/DataFrame.md ADDED
@@ -0,0 +1,840 @@
1
+ # DataFrame
2
+
3
+ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
4
+ - A collection of data which have same data type within. We call it `Vector`.
5
+ - A label is attached to `Vector`. We call it `key`.
6
+ - A `Vector` and associated `key` is grouped as a `variable`.
7
+ - `variable`s with same vector length are aligned and arranged to be a `DaTaFrame`.
8
+ - Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `observation`.
9
+
10
+ ![dataframe model image](doc/../image/dataframe_model.png)
11
+
12
+ ## Constructors and saving
13
+
14
+ ### `new` from a Hash
15
+
16
+ ```ruby
17
+ RedAmber::DataFrame.new(x: [1, 2, 3])
18
+ ```
19
+
20
+ ### `new` from a schema (by Hash) and data (by Array)
21
+
22
+ ```ruby
23
+ RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]])
24
+ ```
25
+
26
+ ### `new` from an Arrow::Table
27
+
28
+
29
+ ```ruby
30
+ table = Arrow::Table.new(x: [1, 2, 3])
31
+ RedAmber::DataFrame.new(table)
32
+ ```
33
+
34
+ ### `new` from a Rover::DataFrame
35
+
36
+
37
+ ```ruby
38
+ rover = Rover::DataFrame.new(x: [1, 2, 3])
39
+ RedAmber::DataFrame.new(rover)
40
+ ```
41
+
42
+ ### `load` (class method)
43
+
44
+ - from a `.arrow`, `.arrows`, `.csv`, `.csv.gz` or `.tsv` file
45
+
46
+ ```ruby
47
+ RedAmber::DataFrame.load("test/entity/with_header.csv")
48
+ ```
49
+
50
+ - from a string buffer
51
+
52
+ - from a URI
53
+
54
+ ```ruby
55
+ uri = URI("uri = URI("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
56
+ RedAmber::DataFrame.load(uri)
57
+ ```
58
+
59
+ - from a Parquet file
60
+
61
+ ```ruby
62
+ dataframe = RedAmber::DataFrame.load("file.parquet")
63
+ ```
64
+
65
+ ### `save` (instance method)
66
+
67
+ - to a `.arrow`, `.arrows`, `.csv`, `.csv.gz` or `.tsv` file
68
+
69
+ - to a string buffer
70
+
71
+ - to a URI
72
+
73
+ - to a Parquet file
74
+
75
+ ```ruby
76
+ dataframe.save("file.parquet")
77
+ ```
78
+
79
+ ## Properties
80
+
81
+ ### `table`, `to_arrow`
82
+
83
+ - Reader of Arrow::Table object inside.
84
+
85
+ ### `size`, `n_obs`, `n_rows`
86
+
87
+ - Returns size of Vector (num of observations).
88
+
89
+ ### `n_keys`, `n_vars`, `n_cols`,
90
+
91
+ - Returns num of keys (num of variables).
92
+
93
+ ### `shape`
94
+
95
+ - Returns shape in an Array[n_rows, n_cols].
96
+
97
+ ### `variables`
98
+
99
+ - Returns key names and Vectors pair in a Hash.
100
+
101
+ It is convenient to use in a block when both key and vector required. We will write:
102
+
103
+ ```ruby
104
+ # update numeric variables
105
+ df.assign do
106
+ variables.select.with_object({}) do |(key, vector), assigner|
107
+ assigner[key] = vector * -1 if vector.numeric?
108
+ end
109
+ end
110
+ ```
111
+
112
+ Instead of:
113
+ ```ruby
114
+ df.assign do
115
+ assigner = {}
116
+ vectors.each_with_index do |vector, i|
117
+ assigner[keys[i]] = vector * -1 if vector.numeric?
118
+ end
119
+ assigner
120
+ end
121
+ ```
122
+
123
+ ### `keys`, `var_names`, `column_names`
124
+
125
+ - Returns key names in an Array.
126
+
127
+ When we use it with vectors, Vector#key is useful to get the key inside of DataFrame.
128
+
129
+ ```ruby
130
+ # update numeric variables, another solution
131
+ df.assign do
132
+ vectors.each_with_object({}) do |vector, assigner|
133
+ assigner[vector.key] = vector * -1 if vector.numeric?
134
+ end
135
+ end
136
+ ```
137
+
138
+ ### `types`
139
+
140
+ - Returns types of vectors in an Array of Symbols.
141
+
142
+ ### `type_classes`
143
+
144
+ - Returns types of vector in an Array of `Arrow::DataType`.
145
+
146
+ ### `vectors`
147
+
148
+ - Returns an Array of Vectors.
149
+
150
+ ### `indexes`, `indices`
151
+
152
+ - Returns all indexes in a Range.
153
+
154
+ ### `to_h`
155
+
156
+ - Returns column-oriented data in a Hash.
157
+
158
+ ### `to_a`, `raw_records`
159
+
160
+ - Returns an array of row-oriented data without header.
161
+
162
+ If you need a column-oriented full array, use `.to_h.to_a`
163
+
164
+ ### `schema`
165
+
166
+ - Returns column name and data type in a Hash.
167
+
168
+ ### `==`
169
+
170
+ ### `empty?`
171
+
172
+ ## Output
173
+
174
+ ### `to_s`
175
+
176
+ ### `summary`, `describe` (not implemented)
177
+
178
+ ### `to_rover`
179
+
180
+ - Returns a `Rover::DataFrame`.
181
+
182
+ ### `tdr(limit = 10, tally: 5, elements: 5)`
183
+
184
+ - Shows some information about self in a transposed style.
185
+ - `tdr_str` returns same info as a String.
186
+
187
+ ```ruby
188
+ require 'red_amber'
189
+ require 'datasets-arrow'
190
+
191
+ penguins = Datasets::Penguins.new.to_arrow
192
+ RedAmber::DataFrame.new(penguins).tdr
193
+ # =>
194
+ RedAmber::DataFrame : 344 x 8 Vectors
195
+ Vectors : 5 numeric, 3 strings
196
+ # key type level data_preview
197
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
198
+ 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
199
+ 3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
200
+ 4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
201
+ 5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils
202
+ 6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
203
+ 7 :sex string 3 {"male"=>168, "female"=>165, nil=>11}
204
+ 8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}
205
+ ```
206
+
207
+ - limit: limit of variables to show. Default value is 10.
208
+ - tally: max level to use tally mode.
209
+ - elements: max num of element to show values in each observations.
210
+
211
+ ### `inspect`
212
+
213
+ - Returns the information of self as `tdr(3)`, and also shows object id.
214
+
215
+ ```ruby
216
+ puts penguins.inspect
217
+ # =>
218
+ #<RedAmber::DataFrame : 344 x 8 Vectors, 0x000000000000f0b4>
219
+ Vectors : 5 numeric, 3 strings
220
+ # key type level data_preview
221
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
222
+ 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
223
+ 3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
224
+ ... 5 more Vectors ...
225
+ ```
226
+
227
+ ## Selecting
228
+
229
+ ### Select variables (columns in a table) by `[]` as `[key]`, `[keys]`, `[keys[index]]`
230
+ - Key in a Symbol: `df[:symbol]`
231
+ - Key in a String: `df["string"]`
232
+ - Keys in an Array: `df[:symbol1, "string", :symbol2]`
233
+ - Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`
234
+
235
+ Key indeces can be used via `keys[i]` because numbers are used to select observations (rows).
236
+
237
+ - Keys by a Range:
238
+
239
+ If keys are able to represent by Range, it can be included in the arguments. See a example below.
240
+
241
+ - You can exchange the order of variables (columns).
242
+
243
+ ```ruby
244
+ hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
245
+ df = RedAmber::DataFrame.new(hash)
246
+ df[:b..:c, "a"]
247
+ # =>
248
+ #<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000b02c>
249
+ Vectors : 2 numeric, 1 string
250
+ # key type level data_preview
251
+ 1 :b string 3 ["A", "B", "C"]
252
+ 2 :c double 3 [1.0, 2.0, 3.0]
253
+ 3 :a uint8 3 [1, 2, 3]
254
+ ```
255
+
256
+ If `#[]` represents single variable (column), it returns a Vector object.
257
+
258
+ ```ruby
259
+ df[:a]
260
+ # =>
261
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
262
+ [1, 2, 3]
263
+ ```
264
+ Or `#v` method also returns a Vector for a key.
265
+
266
+ ```ruby
267
+ df.v(:a)
268
+ # =>
269
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
270
+ [1, 2, 3]
271
+ ```
272
+
273
+ This may be useful to use in a block of DataFrame manipulation verbs. We can write `v(:a)` rather than `self[:a]` or `df[:a]`
274
+
275
+ ### Select observations (rows in a table) by `[]` as `[index]`, `[range]`, `[array]`
276
+
277
+ - Select a obs. by index: `df[0]`
278
+ - Select obs. by indeces in a Range: `df[1..2]`
279
+
280
+ An end-less or a begin-less Range can be used to represent indeces.
281
+
282
+ - Select obs. by indeces in an Array: `df[1, 2]`
283
+ - Mixed case: `df[2, 0..]`
284
+
285
+ ```ruby
286
+ hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
287
+ df = RedAmber::DataFrame.new(hash)
288
+ df[:b..:c, "a"].tdr(tally_level: 0)
289
+ # =>
290
+ RedAmber::DataFrame : 4 x 3 Vectors
291
+ Vectors : 2 numeric, 1 string
292
+ # key type level data_preview
293
+ 1 :a uint8 3 [3, 1, 2, 3]
294
+ 2 :b string 3 ["C", "A", "B", "C"]
295
+ 3 :c double 3 [3.0, 1.0, 2.0, 3.0]
296
+ ```
297
+
298
+ - Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self.
299
+
300
+ It returns a sub dataframe with observations at boolean is true.
301
+
302
+ ```ruby
303
+ # with the same dataframe `df` above
304
+ df[true, false, nil] # or
305
+ df[[true, false, nil]] # or
306
+ df[RedAmber::Vector.new([true, false, nil])]
307
+ # =>
308
+ #<RedAmber::DataFrame : 1 x 3 Vectors, 0x000000000000f1a4>
309
+ Vectors : 2 numeric, 1 string
310
+ # key type level data_preview
311
+ 1 :a uint8 1 [1]
312
+ 2 :b string 1 ["A"]
313
+ 3 :c double 1 [1.0]
314
+ ```
315
+
316
+ ### Select rows from top or from bottom
317
+
318
+ `head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
319
+
320
+ ## Sub DataFrame manipulations
321
+
322
+ ### `pick ` - pick up variables by key label -
323
+
324
+ Pick up some variables (columns) to create a sub DataFrame.
325
+
326
+ ![pick method image](doc/../image/dataframe/pick.png)
327
+
328
+ - Keys as arguments
329
+
330
+ `pick(keys)` accepts keys as arguments in an Array.
331
+
332
+ ```ruby
333
+ penguins.pick(:species, :bill_length_mm)
334
+ # =>
335
+ #<RedAmber::DataFrame : 344 x 2 Vectors, 0x000000000000f924>
336
+ Vectors : 1 numeric, 1 string
337
+ # key type level data_preview
338
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
339
+ 2 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
340
+ ```
341
+
342
+ - Booleans as a argument
343
+
344
+ `pick(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
345
+
346
+ ```ruby
347
+ penguins.pick(penguins.types.map { |type| type == :string })
348
+ # =>
349
+ #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f938>
350
+ Vectors : 3 strings
351
+ # key type level data_preview
352
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
353
+ 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
354
+ 3 :sex string 3 {"male"=>168, "female"=>165, ""=>11}
355
+ ```
356
+
357
+ - Keys or booleans by a block
358
+
359
+ `pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
360
+
361
+ ```ruby
362
+ # It is ok to write `keys ...` in the block, not `penguins.keys ...`
363
+ penguins.pick { keys.map { |key| key.end_with?('mm') } }
364
+ # =>
365
+ #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f1cc>
366
+ Vectors : 3 numeric
367
+ # key type level data_preview
368
+ 1 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
369
+ 2 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
370
+ 3 :flipper_length_mm int64 56 [181, 186, 195, nil, 193, ... ], 2 nils
371
+ ```
372
+
373
+ ### `drop ` - pick and drop -
374
+
375
+ Drop some variables (columns) to create a remainer DataFrame.
376
+
377
+ ![drop method image](doc/../image/dataframe/drop.png)
378
+
379
+ - Keys as arguments
380
+
381
+ `drop(keys)` accepts keys as arguments in an Array.
382
+
383
+ - Booleans as a argument
384
+
385
+ `drop(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
386
+
387
+ - Keys or booleans by a block
388
+
389
+ `drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
390
+
391
+ - Notice for nil
392
+
393
+ When used with booleans, nil in booleans is treated as a false. This behavior is aligned with Ruby's `nil#!`.
394
+
395
+ ```ruby
396
+ booleans = [true, false, nil]
397
+ booleans_invert = booleans.map(&:!) # => [false, true, true]
398
+ df.pick(booleans) == df.drop(booleans_invert) # => true
399
+ ```
400
+ - Difference between `pick`/`drop` and `[]`
401
+
402
+ If `pick` or `drop` will select a single variable (column), it returns a `DataFrame` with one variable. In contrast, `[]` returns a `Vector`. This behavior may be useful to use in a block of DataFrame manipulations.
403
+
404
+ ```ruby
405
+ df = RedAmber::DataFrame.new(a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3])
406
+ df.pick(:a) # or
407
+ df.drop(:b, :c)
408
+ # =>
409
+ #<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000000f280>
410
+ Vector : 1 numeric
411
+ # key type level data_preview
412
+ 1 :a uint8 3 [1, 2, 3]
413
+
414
+ df[:a]
415
+ # =>
416
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
417
+ [1, 2, 3]
418
+ ```
419
+
420
+ ### `slice ` - to cut vertically is slice -
421
+
422
+ Slice and select observations (rows) to create a sub DataFrame.
423
+
424
+ ![slice method image](doc/../image/dataframe/slice.png)
425
+
426
+ - Keys as arguments
427
+
428
+ `slice(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
429
+
430
+ ```ruby
431
+ # returns 5 obs. at start and 5 obs. from end
432
+ penguins.slice(0...5, -5..-1)
433
+ # =>
434
+ #<RedAmber::DataFrame : 10 x 8 Vectors, 0x000000000000f230>
435
+ Vectors : 5 numeric, 3 strings
436
+ # key type level data_preview
437
+ 1 :species string 2 {"Adelie"=>5, "Gentoo"=>5}
438
+ 2 :island string 2 {"Torgersen"=>5, "Biscoe"=>5}
439
+ 3 :bill_length_mm double 9 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
440
+ ... 5 more Vectors ...
441
+ ```
442
+
443
+ - Booleans as an argument
444
+
445
+ `slice(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
446
+
447
+ ```ruby
448
+ vector = penguins[:bill_length_mm]
449
+ penguins.slice(vector >= 40)
450
+ # =>
451
+ #<RedAmber::DataFrame : 242 x 8 Vectors, 0x000000000000f2bc>
452
+ Vectors : 5 numeric, 3 strings
453
+ # key type level data_preview
454
+ 1 :species string 3 {"Adelie"=>51, "Chinstrap"=>68, "Gentoo"=>123}
455
+ 2 :island string 3 {"Torgersen"=>18, "Biscoe"=>139, "Dream"=>85}
456
+ 3 :bill_length_mm double 115 [40.3, 42.0, 41.1, 42.5, 46.0, ... ]
457
+ ... 5 more Vectors ...
458
+ ```
459
+
460
+ - Keys or booleans by a block
461
+
462
+ `slice {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
463
+
464
+ ```ruby
465
+ # return a DataFrame with bill_length_mm is in 2*std range around mean
466
+ penguins.slice do
467
+ vector = self[:bill_length_mm]
468
+ min = vector.mean - vector.std
469
+ max = vector.mean + vector.std
470
+ vector.to_a.map { |e| (min..max).include? e }
471
+ end
472
+ # =>
473
+ #<RedAmber::DataFrame : 204 x 8 Vectors, 0x000000000000f30c>
474
+ Vectors : 5 numeric, 3 strings
475
+ # key type level data_preview
476
+ 1 :species string 3 {"Adelie"=>82, "Chinstrap"=>33, "Gentoo"=>89}
477
+ 2 :island string 3 {"Torgersen"=>31, "Biscoe"=>112, "Dream"=>61}
478
+ 3 :bill_length_mm double 90 [39.1, 39.5, 40.3, 39.3, 38.9, ... ]
479
+ ... 5 more Vectors ...
480
+ ```
481
+
482
+ - Notice: nil option
483
+ - `Arrow::Table#slice` uses `filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row.
484
+
485
+ ```ruby
486
+ hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }
487
+ table = Arrow::Table.new(hash)
488
+ table.slice([true, false, nil])
489
+ # =>
490
+ #<Arrow::Table:0x7fdfe44b9e18 ptr=0x555e9fe744d0>
491
+ a b c
492
+ 0 1 A 1.000000
493
+ 1 (null) (null) (null)
494
+ ```
495
+
496
+ - Whereas in RedAmber, `DataFrame#slice` with booleans containing nil is treated as false. This behavior comes from `Allow::FilterOptions.null_selection_behavior = :drop`. This is a default value for `Arrow::Table.filter` method.
497
+
498
+ ```ruby
499
+ RedAmber::DataFrame.new(table).slice([true, false, nil]).table
500
+ # =>
501
+ #<Arrow::Table:0x7fdfe44981c8 ptr=0x555e9febc330>
502
+ a b c
503
+ 0 1 A 1.000000
504
+ ```
505
+
506
+ ### `remove`
507
+
508
+ Slice and reject observations (rows) to create a remainer DataFrame.
509
+
510
+ ![remove method image](doc/../image/dataframe/remove.png)
511
+
512
+ - Keys as arguments
513
+
514
+ `remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
515
+
516
+ ```ruby
517
+ # returns 6th to 339th obs.
518
+ penguins.remove(0...5, -5..-1)
519
+ # =>
520
+ #<RedAmber::DataFrame : 334 x 8 Vectors, 0x000000000000f320>
521
+ Vectors : 5 numeric, 3 strings
522
+ # key type level data_preview
523
+ 1 :species string 3 {"Adelie"=>147, "Chinstrap"=>68, "Gentoo"=>119}
524
+ 2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>124}
525
+ 3 :bill_length_mm double 162 [39.3, 38.9, 39.2, 34.1, 42.0, ... ]
526
+ ... 5 more Vectors ...
527
+ ```
528
+
529
+ - Booleans as an argument
530
+
531
+ `remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
532
+
533
+ ```ruby
534
+ # remove all observation contains nil
535
+ removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
536
+ removed.tdr
537
+ # =>
538
+ RedAmber::DataFrame : 333 x 8 Vectors
539
+ Vectors : 5 numeric, 3 strings
540
+ # key type level data_preview
541
+ 1 :species string 3 {"Adelie"=>146, "Chinstrap"=>68, "Gentoo"=>119}
542
+ 2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>123}
543
+ 3 :bill_length_mm double 163 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
544
+ 4 :bill_depth_mm double 79 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
545
+ 5 :flipper_length_mm uint8 54 [181, 186, 195, 193, 190, ... ]
546
+ 6 :body_mass_g uint16 93 [3750, 3800, 3250, 3450, 3650, ... ]
547
+ 7 :sex string 2 {"male"=>168, "female"=>165}
548
+ 8 :year uint16 3 {2007=>103, 2008=>113, 2009=>117}
549
+ ```
550
+
551
+ - Keys or booleans by a block
552
+
553
+ `remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
554
+
555
+ ```ruby
556
+ penguins.remove do
557
+ vector = self[:bill_length_mm]
558
+ min = vector.mean - vector.std
559
+ max = vector.mean + vector.std
560
+ vector.to_a.map { |e| (min..max).include? e }
561
+ end
562
+ # =>
563
+ #<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000000f370>
564
+ Vectors : 5 numeric, 3 strings
565
+ # key type level data_preview
566
+ 1 :species string 3 {"Adelie"=>70, "Chinstrap"=>35, "Gentoo"=>35}
567
+ 2 :island string 3 {"Torgersen"=>21, "Biscoe"=>56, "Dream"=>63}
568
+ 3 :bill_length_mm double 75 [nil, 36.7, 34.1, 37.8, 37.8, ... ], 2 nils
569
+ ... 5 more Vectors ...
570
+ ```
571
+ - Notice for nil
572
+ - When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
573
+
574
+ ```ruby
575
+ df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])
576
+ booleans = df[:a] < 2
577
+ # =>
578
+ #<RedAmber::Vector(:boolean, size=3):0x000000000000f410>
579
+ [true, false, nil]
580
+
581
+ booleans_invert = booleans.to_a.map(&:!) # => [false, true, true]
582
+ df.slice(booleans) == df.remove(booleans_invert) # => true
583
+ ```
584
+ - Whereas `Vector#invert` returns nil for elements nil. This will bring different result.
585
+
586
+ ```ruby
587
+ booleans.invert
588
+ # =>
589
+ #<RedAmber::Vector(:boolean, size=3):0x000000000000f488>
590
+ [false, true, nil]
591
+
592
+ df.remove(booleans.invert)
593
+ #<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000000f474>
594
+ Vectors : 2 numeric, 1 string
595
+ # key type level data_preview
596
+ 1 :a uint8 2 [1, nil], 1 nil
597
+ 2 :b string 2 ["A", "C"]
598
+ 3 :c double 2 [1.0, 3.0]
599
+ ```
600
+
601
+ ### `rename`
602
+
603
+ Rename keys (column names) to create a updated DataFrame.
604
+
605
+ ![rename method image](doc/../image/dataframe/rename.png)
606
+
607
+ - Key pairs as arguments
608
+
609
+ `rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`.
610
+
611
+ ```ruby
612
+ h = { 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] }
613
+ df = RedAmber::DataFrame.new(h)
614
+ df.rename(:age => :age_in_1993)
615
+ # =>
616
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
617
+ Vectors : 1 numeric, 1 string
618
+ # key type level data_preview
619
+ 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
620
+ 2 :age_in_1993 uint8 3 [68, 49, 28]
621
+ ```
622
+
623
+ - Key pairs by a block
624
+
625
+ `rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}`. Block is called in the context of self.
626
+
627
+ - Key type
628
+
629
+ Symbol key and String key are distinguished.
630
+
631
+ ### `assign`
632
+
633
+ Assign new or updated variables (columns) and create a updated DataFrame.
634
+
635
+ - Variables with new keys will append new variables at bottom (right in the table).
636
+ - Variables with exisiting keys will update corresponding vectors.
637
+
638
+ ![assign method image](doc/../image/dataframe/assign.png)
639
+
640
+ - Variables as arguments
641
+
642
+ `assign(key_pairs)` accepts pairs of key and values as arguments. key_pairs should be a Hash of `{key => array}` or `{key => Vector}`.
643
+
644
+ ```ruby
645
+ df = RedAmber::DataFrame.new(
646
+ 'name' => %w[Yasuko Rui Hinata],
647
+ 'age' => [68, 49, 28])
648
+ # =>
649
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
650
+ Vectors : 1 numeric, 1 string
651
+ # key type level data_preview
652
+ 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
653
+ 2 :age uint8 3 [68, 49, 28]
654
+
655
+ # update :age and add :brother
656
+ assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }
657
+ df.assign(assigner)
658
+ # =>
659
+ #<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000f960>
660
+ Vectors : 1 numeric, 2 strings
661
+ # key type level data_preview
662
+ 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
663
+ 2 :age uint8 3 [97, 78, 57]
664
+ 3 :brother string 3 ["Santa", nil, "Momotaro"], 1 nil
665
+ ```
666
+
667
+ - Key pairs by a block
668
+
669
+ `assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array}` or `{key => Vector}`. Block is called in the context of self.
670
+
671
+ ```ruby
672
+ df = RedAmber::DataFrame.new(
673
+ index: [0, 1, 2, 3, nil],
674
+ float: [0.0, 1.1, 2.2, Float::NAN, nil],
675
+ string: ['A', 'B', 'C', 'D', nil])
676
+ # =>
677
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f8c0>
678
+ Vectors : 2 numeric, 1 string
679
+ # key type level data_preview
680
+ 1 :index uint8 5 [0, 1, 2, 3, nil], 1 nil
681
+ 2 :float double 5 [0.0, 1.1, 2.2, NaN, nil], 1 NaN, 1 nil
682
+ 3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
683
+
684
+ # update numeric variables
685
+ df.assign do
686
+ assigner = {}
687
+ vectors.each_with_index do |v, i|
688
+ assigner[keys[i]] = v * -1 if v.numeric?
689
+ end
690
+ assigner
691
+ end
692
+ # =>
693
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f924>
694
+ Vectors : 2 numeric, 1 string
695
+ # key type level data_preview
696
+ 1 :index int8 5 [0, -1, -2, -3, nil], 1 nil
697
+ 2 :float double 5 [-0.0, -1.1, -2.2, NaN, nil], 1 NaN, 1 nil
698
+ 3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
699
+
700
+ # Or it ’s shorter like this:
701
+ df.assign do
702
+ variables.select.with_object({}) do |(key, vector), assigner|
703
+ assigner[key] = vector * -1 if vector.numeric?
704
+ end
705
+ end
706
+ # => same as above
707
+ ```
708
+
709
+ - Key type
710
+
711
+ Symbol key and String key are considered as the same key.
712
+
713
+ ## Updating
714
+
715
+ ### `sort`
716
+
717
+ `sort` accepts parameters as sort_keys thanks to the amazing Red Arrow feature。
718
+ - :key, "key" or "+key" denotes ascending order
719
+ - "-key" denotes descending order
720
+
721
+ ```ruby
722
+ df = RedAmber::DataFrame.new({
723
+ index: [1, 1, 0, nil, 0],
724
+ string: ['C', 'B', nil, 'A', 'B'],
725
+ bool: [nil, true, false, true, false],
726
+ })
727
+ df.sort(:index, '-bool').tdr(tally: 0)
728
+ # =>
729
+ RedAmber::DataFrame : 5 x 3 Vectors
730
+ Vectors : 1 numeric, 1 string, 1 boolean
731
+ # key type level data_preview
732
+ 1 :index uint8 3 [0, 0, 1, 1, nil], 1 nil
733
+ 2 :string string 4 [nil, "B", "B", "C", "A"], 1 nil
734
+ 3 :bool boolean 3 [false, false, true, nil, true], 1 nil
735
+ ```
736
+
737
+ - [ ] Clamp
738
+
739
+ - [ ] Clear data
740
+
741
+ ## Treat na data
742
+
743
+ ### `remove_nil`
744
+
745
+ Remove any observations containing nil.
746
+
747
+ ## Grouping
748
+
749
+ ### `group(aggregating_keys, function, target_keys)`
750
+
751
+ Create grouped dataframe by `aggregation_keys` and apply `function` to each group and returns in `target_keys`. Aggregated key name is `function(key)` style.
752
+
753
+ (The current implementation is not intuitive. Needs improvement.)
754
+
755
+ ```ruby
756
+ ds = Datasets::Rdatasets.new('dplyr', 'starwars')
757
+ starwars = RedAmber::DataFrame.new(ds.to_table.to_h)
758
+ starwars.tdr(11)
759
+ # =>
760
+ RedAmber::DataFrame : 87 x 11 Vectors
761
+ Vectors : 3 numeric, 8 strings
762
+ # key type level data_preview
763
+ 1 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
764
+ 2 :height uint16 46 [172, 167, 96, 202, 150, ... ], 6 nils
765
+ 3 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
766
+ 4 :hair_color string 13 ["blond", nil, nil, "none", "brown", ... ], 5 nils
767
+ 5 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", .. . ]
768
+ 6 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ]
769
+ 7 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
770
+ 8 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, nil=>4}
771
+ 9 :gender string 3 {"masculine"=>66, "feminine"=>17, nil=>4}
772
+ 10 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ], 10 nils
773
+ 11 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ], 4 nils
774
+
775
+ grouped = starwars.group(:species, :mean, [:mass, :height])
776
+ # =>
777
+ #<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000000fbf4>
778
+ Vectors : 2 numeric, 1 string
779
+ # key type level data_preview
780
+ 1 :"mean(mass)" double 27 [82.78181818181818, 69.75, 124.0, 74.0, 1358.0, ... ], 6 nils
781
+ 2 :"mean(height)" double 32 [176.6451612903226, 131.2, 231.0, 173.0, 175.0, ... ]
782
+ 3 :species string 38 ["Human", "Droid", "Wookiee", "Rodian", "Hutt", ... ], 1 nil
783
+
784
+ count = starwars.group(:species, :count, :species)[:"count(species)"]
785
+ df = grouped.slice(count > 1)
786
+ # =>
787
+ #<RedAmber::DataFrame : 8 x 3 Vectors, 0x000000000000fc44>
788
+ Vectors : 2 numeric, 1 string
789
+ # key type level data_preview
790
+ 1 :"mean(mass)" double 8 [82.78181818181818, 69.75, 124.0, 74.0, 80.0, ... ]
791
+ 2 :"mean(height)" double 8 [176.6451612903226, 131.2, 231.0, 208.66666666666666, 173.0, ... ]
792
+ 3 :species string 8 ["Human", "Droid", "Wookiee", "Gungan", "Zabrak", ... ]
793
+
794
+ df.table
795
+ # =>
796
+ #<Arrow::Table:0x1165593c8 ptr=0x7fb3db144c70>
797
+ mean(mass) mean(height) species
798
+ 0 82.781818 176.645161 Human
799
+ 1 69.750000 131.200000 Droid
800
+ 2 124.000000 231.000000 Wookiee
801
+ 3 74.000000 208.666667 Gungan
802
+ 4 80.000000 173.000000 Zabrak
803
+ 5 55.000000 179.000000 Twi'lek
804
+ 6 53.100000 168.000000 Mirialan
805
+ 7 88.000000 221.000000 Kaminoan
806
+ ```
807
+
808
+ Available functions are:
809
+
810
+ - [ ] all
811
+ - [ ] any
812
+ - [ ] approximate_median
813
+ - ✓ count
814
+ - [ ] count_distinct
815
+ - [ ] distinct
816
+ - ✓ max
817
+ - ✓ mean
818
+ - ✓ min
819
+ - [ ] min_max
820
+ - ✓ product
821
+ - ✓ stddev
822
+ - ✓ sum
823
+ - [ ] tdigest
824
+ - ✓ variance
825
+
826
+ ## Combining DataFrames
827
+
828
+ - [ ] obs
829
+
830
+ - [ ] Add vars
831
+
832
+ - [ ] Inner join
833
+
834
+ - [ ] Left join
835
+
836
+ ## Encoding
837
+
838
+ - [ ] One-hot encoding
839
+
840
+ ## Iteration (not impremented)