red_amber 0.1.3 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (43) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +31 -7
  3. data/CHANGELOG.md +214 -10
  4. data/Gemfile +4 -0
  5. data/README.md +117 -342
  6. data/benchmark/csv_load_penguins.yml +15 -0
  7. data/benchmark/drop_nil.yml +11 -0
  8. data/doc/DataFrame.md +854 -0
  9. data/doc/Vector.md +449 -0
  10. data/doc/image/arrow_table_new.png +0 -0
  11. data/doc/image/dataframe/assign.png +0 -0
  12. data/doc/image/dataframe/drop.png +0 -0
  13. data/doc/image/dataframe/pick.png +0 -0
  14. data/doc/image/dataframe/remove.png +0 -0
  15. data/doc/image/dataframe/rename.png +0 -0
  16. data/doc/image/dataframe/slice.png +0 -0
  17. data/doc/image/dataframe_model.png +0 -0
  18. data/doc/image/example_in_red_arrow.png +0 -0
  19. data/doc/image/tdr.png +0 -0
  20. data/doc/image/tdr_and_table.png +0 -0
  21. data/doc/image/tidy_data_in_TDR.png +0 -0
  22. data/doc/image/vector/binary_element_wise.png +0 -0
  23. data/doc/image/vector/unary_aggregation.png +0 -0
  24. data/doc/image/vector/unary_aggregation_w_option.png +0 -0
  25. data/doc/image/vector/unary_element_wise.png +0 -0
  26. data/doc/tdr.md +56 -0
  27. data/doc/tdr_ja.md +56 -0
  28. data/lib/red-amber.rb +27 -0
  29. data/lib/red_amber/data_frame.rb +91 -37
  30. data/lib/red_amber/{data_frame_output.rb → data_frame_displayable.rb} +49 -41
  31. data/lib/red_amber/data_frame_indexable.rb +38 -0
  32. data/lib/red_amber/data_frame_observation_operation.rb +11 -0
  33. data/lib/red_amber/data_frame_selectable.rb +155 -48
  34. data/lib/red_amber/data_frame_variable_operation.rb +137 -0
  35. data/lib/red_amber/helper.rb +61 -0
  36. data/lib/red_amber/vector.rb +69 -16
  37. data/lib/red_amber/vector_functions.rb +80 -45
  38. data/lib/red_amber/vector_selectable.rb +124 -0
  39. data/lib/red_amber/vector_updatable.rb +104 -0
  40. data/lib/red_amber/version.rb +1 -1
  41. data/lib/red_amber.rb +1 -16
  42. data/red_amber.gemspec +3 -6
  43. metadata +38 -9
data/doc/DataFrame.md ADDED
@@ -0,0 +1,854 @@
1
+ # DataFrame
2
+
3
+ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
4
+ - A collection of data which have same data type within. We call it `Vector`.
5
+ - A label is attached to `Vector`. We call it `key`.
6
+ - A `Vector` and associated `key` is grouped as a `variable`.
7
+ - `variable`s with same vector length are aligned and arranged to be a `DataFrame`.
8
+ - Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `observation`.
9
+
10
+ ![dataframe model image](doc/../image/dataframe_model.png)
11
+
12
+ (No change in this model in v0.1.6 .)
13
+
14
+ ## Constructors and saving
15
+
16
+ ### `new` from a Hash
17
+
18
+ ```ruby
19
+ RedAmber::DataFrame.new(x: [1, 2, 3])
20
+ ```
21
+
22
+ ### `new` from a schema (by Hash) and data (by Array)
23
+
24
+ ```ruby
25
+ RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]])
26
+ ```
27
+
28
+ ### `new` from an Arrow::Table
29
+
30
+
31
+ ```ruby
32
+ table = Arrow::Table.new(x: [1, 2, 3])
33
+ RedAmber::DataFrame.new(table)
34
+ ```
35
+
36
+ ### `new` from a Rover::DataFrame
37
+
38
+
39
+ ```ruby
40
+ rover = Rover::DataFrame.new(x: [1, 2, 3])
41
+ RedAmber::DataFrame.new(rover)
42
+ ```
43
+
44
+ ### `load` (class method)
45
+
46
+ - from a `.arrow`, `.arrows`, `.csv`, `.csv.gz` or `.tsv` file
47
+
48
+ ```ruby
49
+ RedAmber::DataFrame.load("test/entity/with_header.csv")
50
+ ```
51
+
52
+ - from a string buffer
53
+
54
+ - from a URI
55
+
56
+ ```ruby
57
+ uri = URI("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
58
+ RedAmber::DataFrame.load(uri)
59
+ ```
60
+
61
+ - from a Parquet file
62
+
63
+ ```ruby
64
+ dataframe = RedAmber::DataFrame.load("file.parquet")
65
+ ```
66
+
67
+ ### `save` (instance method)
68
+
69
+ - to a `.arrow`, `.arrows`, `.csv`, `.csv.gz` or `.tsv` file
70
+
71
+ - to a string buffer
72
+
73
+ - to a URI
74
+
75
+ - to a Parquet file
76
+
77
+ ```ruby
78
+ dataframe.save("file.parquet")
79
+ ```
80
+
81
+ ## Properties
82
+
83
+ ### `table`, `to_arrow`
84
+
85
+ - Reader of Arrow::Table object inside.
86
+
87
+ ### `size`, `n_obs`, `n_rows`
88
+
89
+ - Returns size of Vector (num of observations).
90
+
91
+ ### `n_keys`, `n_vars`, `n_cols`,
92
+
93
+ - Returns num of keys (num of variables).
94
+
95
+ ### `shape`
96
+
97
+ - Returns shape in an Array[n_rows, n_cols].
98
+
99
+ ### `variables`
100
+
101
+ - Returns key names and Vectors pair in a Hash.
102
+
103
+ It is convenient to use in a block when both key and vector required. We will write:
104
+
105
+ ```ruby
106
+ # update numeric variables
107
+ df.assign do
108
+ variables.select.with_object({}) do |(key, vector), assigner|
109
+ assigner[key] = vector * -1 if vector.numeric?
110
+ end
111
+ end
112
+ ```
113
+
114
+ Instead of:
115
+ ```ruby
116
+ df.assign do
117
+ assigner = {}
118
+ vectors.each_with_index do |vector, i|
119
+ assigner[keys[i]] = vector * -1 if vector.numeric?
120
+ end
121
+ assigner
122
+ end
123
+ ```
124
+
125
+ ### `keys`, `var_names`, `column_names`
126
+
127
+ - Returns key names in an Array.
128
+
129
+ When we use it with vectors, Vector#key is useful to get the key inside of DataFrame.
130
+
131
+ ```ruby
132
+ # update numeric variables, another solution
133
+ df.assign do
134
+ vectors.each_with_object({}) do |vector, assigner|
135
+ assigner[vector.key] = vector * -1 if vector.numeric?
136
+ end
137
+ end
138
+ ```
139
+
140
+ ### `types`
141
+
142
+ - Returns types of vectors in an Array of Symbols.
143
+
144
+ ### `type_classes`
145
+
146
+ - Returns types of vector in an Array of `Arrow::DataType`.
147
+
148
+ ### `vectors`
149
+
150
+ - Returns an Array of Vectors.
151
+
152
+ ### `indices`, `indexes`
153
+
154
+ - Returns all indexes in an Array.
155
+
156
+ ### `to_h`
157
+
158
+ - Returns column-oriented data in a Hash.
159
+
160
+ ### `to_a`, `raw_records`
161
+
162
+ - Returns an array of row-oriented data without header.
163
+
164
+ If you need a column-oriented full array, use `.to_h.to_a`
165
+
166
+ ### `schema`
167
+
168
+ - Returns column name and data type in a Hash.
169
+
170
+ ### `==`
171
+
172
+ ### `empty?`
173
+
174
+ ## Output
175
+
176
+ ### `to_s`
177
+
178
+ ### `summary`, `describe` (not implemented)
179
+
180
+ ### `to_rover`
181
+
182
+ - Returns a `Rover::DataFrame`.
183
+
184
+ ### `to_iruby`
185
+
186
+ - Show the DataFrame as a Table in Jupyter Notebook or Jupyter Lab with IRuby.
187
+
188
+ ### `tdr(limit = 10, tally: 5, elements: 5)`
189
+
190
+ - Shows some information about self in a transposed style.
191
+ - `tdr_str` returns same info as a String.
192
+
193
+ ```ruby
194
+ require 'red_amber'
195
+ require 'datasets-arrow'
196
+
197
+ penguins = Datasets::Penguins.new.to_arrow
198
+ RedAmber::DataFrame.new(penguins).tdr
199
+ # =>
200
+ RedAmber::DataFrame : 344 x 8 Vectors
201
+ Vectors : 5 numeric, 3 strings
202
+ # key type level data_preview
203
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
204
+ 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
205
+ 3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
206
+ 4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
207
+ 5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils
208
+ 6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
209
+ 7 :sex string 3 {"male"=>168, "female"=>165, nil=>11}
210
+ 8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}
211
+ ```
212
+
213
+ - limit: limit of variables to show. Default value is 10.
214
+ - tally: max level to use tally mode.
215
+ - elements: max num of element to show values in each observations.
216
+
217
+ ### `inspect`
218
+
219
+ - Returns the information of self as `tdr(3)`, and also shows object id.
220
+
221
+ ```ruby
222
+ puts penguins.inspect
223
+ # =>
224
+ #<RedAmber::DataFrame : 344 x 8 Vectors, 0x000000000000f0b4>
225
+ Vectors : 5 numeric, 3 strings
226
+ # key type level data_preview
227
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
228
+ 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
229
+ 3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
230
+ ... 5 more Vectors ...
231
+ ```
232
+
233
+ ## Selecting
234
+
235
+ ### Select variables (columns in a table) by `[]` as `[key]`, `[keys]`, `[keys[index]]`
236
+ - Key in a Symbol: `df[:symbol]`
237
+ - Key in a String: `df["string"]`
238
+ - Keys in an Array: `df[:symbol1, "string", :symbol2]`
239
+ - Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`
240
+
241
+ Key indeces can be used via `keys[i]` because numbers are used to select observations (rows).
242
+
243
+ - Keys by a Range:
244
+
245
+ If keys are able to represent by Range, it can be included in the arguments. See a example below.
246
+
247
+ - You can exchange the order of variables (columns).
248
+
249
+ ```ruby
250
+ hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
251
+ df = RedAmber::DataFrame.new(hash)
252
+ df[:b..:c, "a"]
253
+ # =>
254
+ #<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000b02c>
255
+ Vectors : 2 numeric, 1 string
256
+ # key type level data_preview
257
+ 1 :b string 3 ["A", "B", "C"]
258
+ 2 :c double 3 [1.0, 2.0, 3.0]
259
+ 3 :a uint8 3 [1, 2, 3]
260
+ ```
261
+
262
+ If `#[]` represents single variable (column), it returns a Vector object.
263
+
264
+ ```ruby
265
+ df[:a]
266
+ # =>
267
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
268
+ [1, 2, 3]
269
+ ```
270
+ Or `#v` method also returns a Vector for a key.
271
+
272
+ ```ruby
273
+ df.v(:a)
274
+ # =>
275
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
276
+ [1, 2, 3]
277
+ ```
278
+
279
+ This may be useful to use in a block of DataFrame manipulation verbs. We can write `v(:a)` rather than `self[:a]` or `df[:a]`
280
+
281
+ ### Select observations (rows in a table) by `[]` as `[index]`, `[range]`, `[array]`
282
+
283
+ - Select a obs. by index: `df[0]`
284
+ - Select obs. by indeces in a Range: `df[1..2]`
285
+
286
+ An end-less or a begin-less Range can be used to represent indeces.
287
+
288
+ - Select obs. by indeces in an Array: `df[1, 2]`
289
+
290
+ - You can use float indices.
291
+
292
+ - Mixed case: `df[2, 0..]`
293
+
294
+ ```ruby
295
+ hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
296
+ df = RedAmber::DataFrame.new(hash)
297
+ df[:b..:c, "a"].tdr(tally_level: 0)
298
+ # =>
299
+ RedAmber::DataFrame : 4 x 3 Vectors
300
+ Vectors : 2 numeric, 1 string
301
+ # key type level data_preview
302
+ 1 :a uint8 3 [3, 1, 2, 3]
303
+ 2 :b string 3 ["C", "A", "B", "C"]
304
+ 3 :c double 3 [3.0, 1.0, 2.0, 3.0]
305
+ ```
306
+
307
+ - Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self.
308
+
309
+ It returns a sub dataframe with observations at boolean is true.
310
+
311
+ ```ruby
312
+ # with the same dataframe `df` above
313
+ df[true, false, nil] # or
314
+ df[[true, false, nil]] # or
315
+ df[RedAmber::Vector.new([true, false, nil])]
316
+ # =>
317
+ #<RedAmber::DataFrame : 1 x 3 Vectors, 0x000000000000f1a4>
318
+ Vectors : 2 numeric, 1 string
319
+ # key type level data_preview
320
+ 1 :a uint8 1 [1]
321
+ 2 :b string 1 ["A"]
322
+ 3 :c double 1 [1.0]
323
+ ```
324
+
325
+ ### Select rows from top or from bottom
326
+
327
+ `head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
328
+
329
+ ## Sub DataFrame manipulations
330
+
331
+ ### `pick ` - pick up variables by key label -
332
+
333
+ Pick up some variables (columns) to create a sub DataFrame.
334
+
335
+ ![pick method image](doc/../image/dataframe/pick.png)
336
+
337
+ - Keys as arguments
338
+
339
+ `pick(keys)` accepts keys as arguments in an Array.
340
+
341
+ ```ruby
342
+ penguins.pick(:species, :bill_length_mm)
343
+ # =>
344
+ #<RedAmber::DataFrame : 344 x 2 Vectors, 0x000000000000f924>
345
+ Vectors : 1 numeric, 1 string
346
+ # key type level data_preview
347
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
348
+ 2 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
349
+ ```
350
+
351
+ - Booleans as a argument
352
+
353
+ `pick(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
354
+
355
+ ```ruby
356
+ penguins.pick(penguins.types.map { |type| type == :string })
357
+ # =>
358
+ #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f938>
359
+ Vectors : 3 strings
360
+ # key type level data_preview
361
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
362
+ 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
363
+ 3 :sex string 3 {"male"=>168, "female"=>165, ""=>11}
364
+ ```
365
+
366
+ - Keys or booleans by a block
367
+
368
+ `pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
369
+
370
+ ```ruby
371
+ # It is ok to write `keys ...` in the block, not `penguins.keys ...`
372
+ penguins.pick { keys.map { |key| key.end_with?('mm') } }
373
+ # =>
374
+ #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f1cc>
375
+ Vectors : 3 numeric
376
+ # key type level data_preview
377
+ 1 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
378
+ 2 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
379
+ 3 :flipper_length_mm int64 56 [181, 186, 195, nil, 193, ... ], 2 nils
380
+ ```
381
+
382
+ ### `drop ` - pick and drop -
383
+
384
+ Drop some variables (columns) to create a remainer DataFrame.
385
+
386
+ ![drop method image](doc/../image/dataframe/drop.png)
387
+
388
+ - Keys as arguments
389
+
390
+ `drop(keys)` accepts keys as arguments in an Array.
391
+
392
+ - Booleans as a argument
393
+
394
+ `drop(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
395
+
396
+ - Keys or booleans by a block
397
+
398
+ `drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
399
+
400
+ - Notice for nil
401
+
402
+ When used with booleans, nil in booleans is treated as a false. This behavior is aligned with Ruby's `nil#!`.
403
+
404
+ ```ruby
405
+ booleans = [true, false, nil]
406
+ booleans_invert = booleans.map(&:!) # => [false, true, true]
407
+ df.pick(booleans) == df.drop(booleans_invert) # => true
408
+ ```
409
+ - Difference between `pick`/`drop` and `[]`
410
+
411
+ If `pick` or `drop` will select a single variable (column), it returns a `DataFrame` with one variable. In contrast, `[]` returns a `Vector`. This behavior may be useful to use in a block of DataFrame manipulations.
412
+
413
+ ```ruby
414
+ df = RedAmber::DataFrame.new(a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3])
415
+ df.pick(:a) # or
416
+ df.drop(:b, :c)
417
+ # =>
418
+ #<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000000f280>
419
+ Vector : 1 numeric
420
+ # key type level data_preview
421
+ 1 :a uint8 3 [1, 2, 3]
422
+
423
+ df[:a]
424
+ # =>
425
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
426
+ [1, 2, 3]
427
+ ```
428
+
429
+ ### `slice ` - to cut vertically is slice -
430
+
431
+ Slice and select observations (rows) to create a sub DataFrame.
432
+
433
+ ![slice method image](doc/../image/dataframe/slice.png)
434
+
435
+ - Indices as arguments
436
+
437
+ `slice(indeces)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers.
438
+
439
+ Negative index from the tail like Ruby's Array is also acceptable.
440
+
441
+ ```ruby
442
+ # returns 5 obs. at start and 5 obs. from end
443
+ penguins.slice(0...5, -5..-1)
444
+ # =>
445
+ #<RedAmber::DataFrame : 10 x 8 Vectors, 0x000000000000f230>
446
+ Vectors : 5 numeric, 3 strings
447
+ # key type level data_preview
448
+ 1 :species string 2 {"Adelie"=>5, "Gentoo"=>5}
449
+ 2 :island string 2 {"Torgersen"=>5, "Biscoe"=>5}
450
+ 3 :bill_length_mm double 9 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
451
+ ... 5 more Vectors ...
452
+ ```
453
+
454
+ - Booleans as an argument
455
+
456
+ `slice(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
457
+
458
+ ```ruby
459
+ vector = penguins[:bill_length_mm]
460
+ penguins.slice(vector >= 40)
461
+ # =>
462
+ #<RedAmber::DataFrame : 242 x 8 Vectors, 0x000000000000f2bc>
463
+ Vectors : 5 numeric, 3 strings
464
+ # key type level data_preview
465
+ 1 :species string 3 {"Adelie"=>51, "Chinstrap"=>68, "Gentoo"=>123}
466
+ 2 :island string 3 {"Torgersen"=>18, "Biscoe"=>139, "Dream"=>85}
467
+ 3 :bill_length_mm double 115 [40.3, 42.0, 41.1, 42.5, 46.0, ... ]
468
+ ... 5 more Vectors ...
469
+ ```
470
+
471
+ - Indices or booleans by a block
472
+
473
+ `slice {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
474
+
475
+ ```ruby
476
+ # return a DataFrame with bill_length_mm is in 2*std range around mean
477
+ penguins.slice do
478
+ vector = self[:bill_length_mm]
479
+ min = vector.mean - vector.std
480
+ max = vector.mean + vector.std
481
+ vector.to_a.map { |e| (min..max).include? e }
482
+ end
483
+
484
+ # =>
485
+ #<RedAmber::DataFrame : 204 x 8 Vectors, 0x000000000000f30c>
486
+ Vectors : 5 numeric, 3 strings
487
+ # key type level data_preview
488
+ 1 :species string 3 {"Adelie"=>82, "Chinstrap"=>33, "Gentoo"=>89}
489
+ 2 :island string 3 {"Torgersen"=>31, "Biscoe"=>112, "Dream"=>61}
490
+ 3 :bill_length_mm double 90 [39.1, 39.5, 40.3, 39.3, 38.9, ... ]
491
+ ... 5 more Vectors ...
492
+ ```
493
+
494
+ - Notice: nil option
495
+ - `Arrow::Table#slice` uses `filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row.
496
+
497
+ ```ruby
498
+ hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }
499
+ table = Arrow::Table.new(hash)
500
+ table.slice([true, false, nil])
501
+ # =>
502
+ #<Arrow::Table:0x7fdfe44b9e18 ptr=0x555e9fe744d0>
503
+ a b c
504
+ 0 1 A 1.000000
505
+ 1 (null) (null) (null)
506
+ ```
507
+
508
+ - Whereas in RedAmber, `DataFrame#slice` with booleans containing nil is treated as false. This behavior comes from `Allow::FilterOptions.null_selection_behavior = :drop`. This is a default value for `Arrow::Table.filter` method.
509
+
510
+ ```ruby
511
+ RedAmber::DataFrame.new(table).slice([true, false, nil]).table
512
+ # =>
513
+ #<Arrow::Table:0x7fdfe44981c8 ptr=0x555e9febc330>
514
+ a b c
515
+ 0 1 A 1.000000
516
+ ```
517
+
518
+ ### `remove`
519
+
520
+ Slice and reject observations (rows) to create a remainer DataFrame.
521
+
522
+ ![remove method image](doc/../image/dataframe/remove.png)
523
+
524
+ - Indices as arguments
525
+
526
+ `remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
527
+
528
+ ```ruby
529
+ # returns 6th to 339th obs.
530
+ penguins.remove(0...5, -5..-1)
531
+ # =>
532
+ #<RedAmber::DataFrame : 334 x 8 Vectors, 0x000000000000f320>
533
+ Vectors : 5 numeric, 3 strings
534
+ # key type level data_preview
535
+ 1 :species string 3 {"Adelie"=>147, "Chinstrap"=>68, "Gentoo"=>119}
536
+ 2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>124}
537
+ 3 :bill_length_mm double 162 [39.3, 38.9, 39.2, 34.1, 42.0, ... ]
538
+ ... 5 more Vectors ...
539
+ ```
540
+
541
+ - Booleans as an argument
542
+
543
+ `remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
544
+
545
+ ```ruby
546
+ # remove all observation contains nil
547
+ removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
548
+ removed.tdr
549
+ # =>
550
+ RedAmber::DataFrame : 333 x 8 Vectors
551
+ Vectors : 5 numeric, 3 strings
552
+ # key type level data_preview
553
+ 1 :species string 3 {"Adelie"=>146, "Chinstrap"=>68, "Gentoo"=>119}
554
+ 2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>123}
555
+ 3 :bill_length_mm double 163 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
556
+ 4 :bill_depth_mm double 79 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
557
+ 5 :flipper_length_mm uint8 54 [181, 186, 195, 193, 190, ... ]
558
+ 6 :body_mass_g uint16 93 [3750, 3800, 3250, 3450, 3650, ... ]
559
+ 7 :sex string 2 {"male"=>168, "female"=>165}
560
+ 8 :year uint16 3 {2007=>103, 2008=>113, 2009=>117}
561
+ ```
562
+
563
+ - Indices or booleans by a block
564
+
565
+ `remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
566
+
567
+ ```ruby
568
+ penguins.remove do
569
+ vector = self[:bill_length_mm]
570
+ min = vector.mean - vector.std
571
+ max = vector.mean + vector.std
572
+ vector.to_a.map { |e| (min..max).include? e }
573
+ end
574
+ # =>
575
+ #<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000000f370>
576
+ Vectors : 5 numeric, 3 strings
577
+ # key type level data_preview
578
+ 1 :species string 3 {"Adelie"=>70, "Chinstrap"=>35, "Gentoo"=>35}
579
+ 2 :island string 3 {"Torgersen"=>21, "Biscoe"=>56, "Dream"=>63}
580
+ 3 :bill_length_mm double 75 [nil, 36.7, 34.1, 37.8, 37.8, ... ], 2 nils
581
+ ... 5 more Vectors ...
582
+ ```
583
+ - Notice for nil
584
+ - When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
585
+
586
+ ```ruby
587
+ df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])
588
+ booleans = df[:a] < 2
589
+ # =>
590
+ #<RedAmber::Vector(:boolean, size=3):0x000000000000f410>
591
+ [true, false, nil]
592
+
593
+ booleans_invert = booleans.to_a.map(&:!) # => [false, true, true]
594
+ df.slice(booleans) == df.remove(booleans_invert) # => true
595
+ ```
596
+ - Whereas `Vector#invert` returns nil for elements nil. This will bring different result.
597
+
598
+ ```ruby
599
+ booleans.invert
600
+ # =>
601
+ #<RedAmber::Vector(:boolean, size=3):0x000000000000f488>
602
+ [false, true, nil]
603
+
604
+ df.remove(booleans.invert)
605
+ #<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000000f474>
606
+ Vectors : 2 numeric, 1 string
607
+ # key type level data_preview
608
+ 1 :a uint8 2 [1, nil], 1 nil
609
+ 2 :b string 2 ["A", "C"]
610
+ 3 :c double 2 [1.0, 3.0]
611
+ ```
612
+
613
+ ### `rename`
614
+
615
+ Rename keys (column names) to create a updated DataFrame.
616
+
617
+ ![rename method image](doc/../image/dataframe/rename.png)
618
+
619
+ - Key pairs as arguments
620
+
621
+ `rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`.
622
+
623
+ ```ruby
624
+ h = { 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] }
625
+ df = RedAmber::DataFrame.new(h)
626
+ df.rename(:age => :age_in_1993)
627
+ # =>
628
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
629
+ Vectors : 1 numeric, 1 string
630
+ # key type level data_preview
631
+ 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
632
+ 2 :age_in_1993 uint8 3 [68, 49, 28]
633
+ ```
634
+
635
+ - Key pairs by a block
636
+
637
+ `rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}`. Block is called in the context of self.
638
+
639
+ - Key type
640
+
641
+ Symbol key and String key are distinguished.
642
+
643
+ ### `assign`
644
+
645
+ Assign new or updated variables (columns) and create a updated DataFrame.
646
+
647
+ - Variables with new keys will append new variables at bottom (right in the table).
648
+ - Variables with exisiting keys will update corresponding vectors.
649
+
650
+ ![assign method image](doc/../image/dataframe/assign.png)
651
+
652
+ - Variables as arguments
653
+
654
+ `assign(key_pairs)` accepts pairs of key and values as arguments. key_pairs should be a Hash of `{key => array}` or `{key => Vector}`.
655
+
656
+ ```ruby
657
+ df = RedAmber::DataFrame.new(
658
+ 'name' => %w[Yasuko Rui Hinata],
659
+ 'age' => [68, 49, 28])
660
+ # =>
661
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
662
+ Vectors : 1 numeric, 1 string
663
+ # key type level data_preview
664
+ 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
665
+ 2 :age uint8 3 [68, 49, 28]
666
+
667
+ # update :age and add :brother
668
+ assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }
669
+ df.assign(assigner)
670
+ # =>
671
+ #<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000f960>
672
+ Vectors : 1 numeric, 2 strings
673
+ # key type level data_preview
674
+ 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
675
+ 2 :age uint8 3 [97, 78, 57]
676
+ 3 :brother string 3 ["Santa", nil, "Momotaro"], 1 nil
677
+ ```
678
+
679
+ - Key pairs by a block
680
+
681
+ `assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array}` or `{key => Vector}`. Block is called in the context of self.
682
+
683
+ ```ruby
684
+ df = RedAmber::DataFrame.new(
685
+ index: [0, 1, 2, 3, nil],
686
+ float: [0.0, 1.1, 2.2, Float::NAN, nil],
687
+ string: ['A', 'B', 'C', 'D', nil])
688
+ # =>
689
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f8c0>
690
+ Vectors : 2 numeric, 1 string
691
+ # key type level data_preview
692
+ 1 :index uint8 5 [0, 1, 2, 3, nil], 1 nil
693
+ 2 :float double 5 [0.0, 1.1, 2.2, NaN, nil], 1 NaN, 1 nil
694
+ 3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
695
+
696
+ # update numeric variables
697
+ df.assign do
698
+ assigner = {}
699
+ vectors.each_with_index do |v, i|
700
+ assigner[keys[i]] = v * -1 if v.numeric?
701
+ end
702
+ assigner
703
+ end
704
+ # =>
705
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f924>
706
+ Vectors : 2 numeric, 1 string
707
+ # key type level data_preview
708
+ 1 :index int8 5 [0, -1, -2, -3, nil], 1 nil
709
+ 2 :float double 5 [-0.0, -1.1, -2.2, NaN, nil], 1 NaN, 1 nil
710
+ 3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
711
+
712
+ # Or it ’s shorter like this:
713
+ df.assign do
714
+ variables.select.with_object({}) do |(key, vector), assigner|
715
+ assigner[key] = vector * -1 if vector.numeric?
716
+ end
717
+ end
718
+ # => same as above
719
+ ```
720
+
721
+ - Key type
722
+
723
+ Symbol key and String key are considered as the same key.
724
+
725
+ ## Updating
726
+
727
+ ### `sort`
728
+
729
+ `sort` accepts parameters as sort_keys thanks to the amazing Red Arrow feature。
730
+ - :key, "key" or "+key" denotes ascending order
731
+ - "-key" denotes descending order
732
+
733
+ ```ruby
734
+ df = RedAmber::DataFrame.new({
735
+ index: [1, 1, 0, nil, 0],
736
+ string: ['C', 'B', nil, 'A', 'B'],
737
+ bool: [nil, true, false, true, false],
738
+ })
739
+ df.sort(:index, '-bool').tdr(tally: 0)
740
+ # =>
741
+ RedAmber::DataFrame : 5 x 3 Vectors
742
+ Vectors : 1 numeric, 1 string, 1 boolean
743
+ # key type level data_preview
744
+ 1 :index uint8 3 [0, 0, 1, 1, nil], 1 nil
745
+ 2 :string string 4 [nil, "B", "B", "C", "A"], 1 nil
746
+ 3 :bool boolean 3 [false, false, true, nil, true], 1 nil
747
+ ```
748
+
749
+ - [ ] Clamp
750
+
751
+ - [ ] Clear data
752
+
753
+ ## Treat na data
754
+
755
+ ### `remove_nil`
756
+
757
+ Remove any observations containing nil.
758
+
759
+ ## Grouping
760
+
761
+ ### `group(aggregating_keys, function, target_keys)`
762
+
763
+ (This is a temporary API and may change in the future version.)
764
+
765
+ Create grouped dataframe by `aggregation_keys` and apply `function` to each group and returns in `target_keys`. Aggregated key name is `function(key)` style.
766
+
767
+ (The current implementation is not intuitive. Needs improvement.)
768
+
769
+ ```ruby
770
+ ds = Datasets::Rdatasets.new('dplyr', 'starwars')
771
+ starwars = RedAmber::DataFrame.new(ds.to_table.to_h)
772
+ starwars.tdr(11)
773
+ # =>
774
+ RedAmber::DataFrame : 87 x 11 Vectors
775
+ Vectors : 3 numeric, 8 strings
776
+ # key type level data_preview
777
+ 1 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
778
+ 2 :height uint16 46 [172, 167, 96, 202, 150, ... ], 6 nils
779
+ 3 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
780
+ 4 :hair_color string 13 ["blond", nil, nil, "none", "brown", ... ], 5 nils
781
+ 5 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", .. . ]
782
+ 6 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ]
783
+ 7 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
784
+ 8 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, nil=>4}
785
+ 9 :gender string 3 {"masculine"=>66, "feminine"=>17, nil=>4}
786
+ 10 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ], 10 nils
787
+ 11 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ], 4 nils
788
+
789
+ grouped = starwars.group(:species, :mean, [:mass, :height])
790
+ # =>
791
+ #<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000000fbf4>
792
+ Vectors : 2 numeric, 1 string
793
+ # key type level data_preview
794
+ 1 :"mean(mass)" double 27 [82.78181818181818, 69.75, 124.0, 74.0, 1358.0, ... ], 6 nils
795
+ 2 :"mean(height)" double 32 [176.6451612903226, 131.2, 231.0, 173.0, 175.0, ... ]
796
+ 3 :species string 38 ["Human", "Droid", "Wookiee", "Rodian", "Hutt", ... ], 1 nil
797
+
798
+ count = starwars.group(:species, :count, :species)[:"count(species)"]
799
+ df = grouped.slice(count > 1)
800
+ # =>
801
+ #<RedAmber::DataFrame : 8 x 3 Vectors, 0x000000000000fc44>
802
+ Vectors : 2 numeric, 1 string
803
+ # key type level data_preview
804
+ 1 :"mean(mass)" double 8 [82.78181818181818, 69.75, 124.0, 74.0, 80.0, ... ]
805
+ 2 :"mean(height)" double 8 [176.6451612903226, 131.2, 231.0, 208.66666666666666, 173.0, ... ]
806
+ 3 :species string 8 ["Human", "Droid", "Wookiee", "Gungan", "Zabrak", ... ]
807
+
808
+ df.table
809
+ # =>
810
+ #<Arrow::Table:0x1165593c8 ptr=0x7fb3db144c70>
811
+ mean(mass) mean(height) species
812
+ 0 82.781818 176.645161 Human
813
+ 1 69.750000 131.200000 Droid
814
+ 2 124.000000 231.000000 Wookiee
815
+ 3 74.000000 208.666667 Gungan
816
+ 4 80.000000 173.000000 Zabrak
817
+ 5 55.000000 179.000000 Twi'lek
818
+ 6 53.100000 168.000000 Mirialan
819
+ 7 88.000000 221.000000 Kaminoan
820
+ ```
821
+
822
+ Available functions are:
823
+
824
+ - [ ] all
825
+ - [ ] any
826
+ - [ ] approximate_median
827
+ - ✓ count
828
+ - [ ] count_distinct
829
+ - [ ] distinct
830
+ - ✓ max
831
+ - ✓ mean
832
+ - ✓ min
833
+ - [ ] min_max
834
+ - ✓ product
835
+ - ✓ stddev
836
+ - ✓ sum
837
+ - [ ] tdigest
838
+ - ✓ variance
839
+
840
+ ## Combining DataFrames
841
+
842
+ - [ ] obs
843
+
844
+ - [ ] Add vars
845
+
846
+ - [ ] Inner join
847
+
848
+ - [ ] Left join
849
+
850
+ ## Encoding
851
+
852
+ - [ ] One-hot encoding
853
+
854
+ ## Iteration (not impremented)