red_amber 0.1.3 → 0.1.6

Sign up to get free protection for your applications and to get access to all the features.
Files changed (43) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +31 -7
  3. data/CHANGELOG.md +214 -10
  4. data/Gemfile +4 -0
  5. data/README.md +117 -342
  6. data/benchmark/csv_load_penguins.yml +15 -0
  7. data/benchmark/drop_nil.yml +11 -0
  8. data/doc/DataFrame.md +854 -0
  9. data/doc/Vector.md +449 -0
  10. data/doc/image/arrow_table_new.png +0 -0
  11. data/doc/image/dataframe/assign.png +0 -0
  12. data/doc/image/dataframe/drop.png +0 -0
  13. data/doc/image/dataframe/pick.png +0 -0
  14. data/doc/image/dataframe/remove.png +0 -0
  15. data/doc/image/dataframe/rename.png +0 -0
  16. data/doc/image/dataframe/slice.png +0 -0
  17. data/doc/image/dataframe_model.png +0 -0
  18. data/doc/image/example_in_red_arrow.png +0 -0
  19. data/doc/image/tdr.png +0 -0
  20. data/doc/image/tdr_and_table.png +0 -0
  21. data/doc/image/tidy_data_in_TDR.png +0 -0
  22. data/doc/image/vector/binary_element_wise.png +0 -0
  23. data/doc/image/vector/unary_aggregation.png +0 -0
  24. data/doc/image/vector/unary_aggregation_w_option.png +0 -0
  25. data/doc/image/vector/unary_element_wise.png +0 -0
  26. data/doc/tdr.md +56 -0
  27. data/doc/tdr_ja.md +56 -0
  28. data/lib/red-amber.rb +27 -0
  29. data/lib/red_amber/data_frame.rb +91 -37
  30. data/lib/red_amber/{data_frame_output.rb → data_frame_displayable.rb} +49 -41
  31. data/lib/red_amber/data_frame_indexable.rb +38 -0
  32. data/lib/red_amber/data_frame_observation_operation.rb +11 -0
  33. data/lib/red_amber/data_frame_selectable.rb +155 -48
  34. data/lib/red_amber/data_frame_variable_operation.rb +137 -0
  35. data/lib/red_amber/helper.rb +61 -0
  36. data/lib/red_amber/vector.rb +69 -16
  37. data/lib/red_amber/vector_functions.rb +80 -45
  38. data/lib/red_amber/vector_selectable.rb +124 -0
  39. data/lib/red_amber/vector_updatable.rb +104 -0
  40. data/lib/red_amber/version.rb +1 -1
  41. data/lib/red_amber.rb +1 -16
  42. data/red_amber.gemspec +3 -6
  43. metadata +38 -9
data/doc/DataFrame.md ADDED
@@ -0,0 +1,854 @@
1
+ # DataFrame
2
+
3
+ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
4
+ - A collection of data which have same data type within. We call it `Vector`.
5
+ - A label is attached to `Vector`. We call it `key`.
6
+ - A `Vector` and associated `key` is grouped as a `variable`.
7
+ - `variable`s with same vector length are aligned and arranged to be a `DataFrame`.
8
+ - Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `observation`.
9
+
10
+ ![dataframe model image](doc/../image/dataframe_model.png)
11
+
12
+ (No change in this model in v0.1.6 .)
13
+
14
+ ## Constructors and saving
15
+
16
+ ### `new` from a Hash
17
+
18
+ ```ruby
19
+ RedAmber::DataFrame.new(x: [1, 2, 3])
20
+ ```
21
+
22
+ ### `new` from a schema (by Hash) and data (by Array)
23
+
24
+ ```ruby
25
+ RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]])
26
+ ```
27
+
28
+ ### `new` from an Arrow::Table
29
+
30
+
31
+ ```ruby
32
+ table = Arrow::Table.new(x: [1, 2, 3])
33
+ RedAmber::DataFrame.new(table)
34
+ ```
35
+
36
+ ### `new` from a Rover::DataFrame
37
+
38
+
39
+ ```ruby
40
+ rover = Rover::DataFrame.new(x: [1, 2, 3])
41
+ RedAmber::DataFrame.new(rover)
42
+ ```
43
+
44
+ ### `load` (class method)
45
+
46
+ - from a `.arrow`, `.arrows`, `.csv`, `.csv.gz` or `.tsv` file
47
+
48
+ ```ruby
49
+ RedAmber::DataFrame.load("test/entity/with_header.csv")
50
+ ```
51
+
52
+ - from a string buffer
53
+
54
+ - from a URI
55
+
56
+ ```ruby
57
+ uri = URI("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
58
+ RedAmber::DataFrame.load(uri)
59
+ ```
60
+
61
+ - from a Parquet file
62
+
63
+ ```ruby
64
+ dataframe = RedAmber::DataFrame.load("file.parquet")
65
+ ```
66
+
67
+ ### `save` (instance method)
68
+
69
+ - to a `.arrow`, `.arrows`, `.csv`, `.csv.gz` or `.tsv` file
70
+
71
+ - to a string buffer
72
+
73
+ - to a URI
74
+
75
+ - to a Parquet file
76
+
77
+ ```ruby
78
+ dataframe.save("file.parquet")
79
+ ```
80
+
81
+ ## Properties
82
+
83
+ ### `table`, `to_arrow`
84
+
85
+ - Reader of Arrow::Table object inside.
86
+
87
+ ### `size`, `n_obs`, `n_rows`
88
+
89
+ - Returns size of Vector (num of observations).
90
+
91
+ ### `n_keys`, `n_vars`, `n_cols`,
92
+
93
+ - Returns num of keys (num of variables).
94
+
95
+ ### `shape`
96
+
97
+ - Returns shape in an Array[n_rows, n_cols].
98
+
99
+ ### `variables`
100
+
101
+ - Returns key names and Vectors pair in a Hash.
102
+
103
+ It is convenient to use in a block when both key and vector required. We will write:
104
+
105
+ ```ruby
106
+ # update numeric variables
107
+ df.assign do
108
+ variables.select.with_object({}) do |(key, vector), assigner|
109
+ assigner[key] = vector * -1 if vector.numeric?
110
+ end
111
+ end
112
+ ```
113
+
114
+ Instead of:
115
+ ```ruby
116
+ df.assign do
117
+ assigner = {}
118
+ vectors.each_with_index do |vector, i|
119
+ assigner[keys[i]] = vector * -1 if vector.numeric?
120
+ end
121
+ assigner
122
+ end
123
+ ```
124
+
125
+ ### `keys`, `var_names`, `column_names`
126
+
127
+ - Returns key names in an Array.
128
+
129
+ When we use it with vectors, Vector#key is useful to get the key inside of DataFrame.
130
+
131
+ ```ruby
132
+ # update numeric variables, another solution
133
+ df.assign do
134
+ vectors.each_with_object({}) do |vector, assigner|
135
+ assigner[vector.key] = vector * -1 if vector.numeric?
136
+ end
137
+ end
138
+ ```
139
+
140
+ ### `types`
141
+
142
+ - Returns types of vectors in an Array of Symbols.
143
+
144
+ ### `type_classes`
145
+
146
+ - Returns types of vector in an Array of `Arrow::DataType`.
147
+
148
+ ### `vectors`
149
+
150
+ - Returns an Array of Vectors.
151
+
152
+ ### `indices`, `indexes`
153
+
154
+ - Returns all indexes in an Array.
155
+
156
+ ### `to_h`
157
+
158
+ - Returns column-oriented data in a Hash.
159
+
160
+ ### `to_a`, `raw_records`
161
+
162
+ - Returns an array of row-oriented data without header.
163
+
164
+ If you need a column-oriented full array, use `.to_h.to_a`
165
+
166
+ ### `schema`
167
+
168
+ - Returns column name and data type in a Hash.
169
+
170
+ ### `==`
171
+
172
+ ### `empty?`
173
+
174
+ ## Output
175
+
176
+ ### `to_s`
177
+
178
+ ### `summary`, `describe` (not implemented)
179
+
180
+ ### `to_rover`
181
+
182
+ - Returns a `Rover::DataFrame`.
183
+
184
+ ### `to_iruby`
185
+
186
+ - Show the DataFrame as a Table in Jupyter Notebook or Jupyter Lab with IRuby.
187
+
188
+ ### `tdr(limit = 10, tally: 5, elements: 5)`
189
+
190
+ - Shows some information about self in a transposed style.
191
+ - `tdr_str` returns same info as a String.
192
+
193
+ ```ruby
194
+ require 'red_amber'
195
+ require 'datasets-arrow'
196
+
197
+ penguins = Datasets::Penguins.new.to_arrow
198
+ RedAmber::DataFrame.new(penguins).tdr
199
+ # =>
200
+ RedAmber::DataFrame : 344 x 8 Vectors
201
+ Vectors : 5 numeric, 3 strings
202
+ # key type level data_preview
203
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
204
+ 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
205
+ 3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
206
+ 4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
207
+ 5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils
208
+ 6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
209
+ 7 :sex string 3 {"male"=>168, "female"=>165, nil=>11}
210
+ 8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}
211
+ ```
212
+
213
+ - limit: limit of variables to show. Default value is 10.
214
+ - tally: max level to use tally mode.
215
+ - elements: max num of element to show values in each observations.
216
+
217
+ ### `inspect`
218
+
219
+ - Returns the information of self as `tdr(3)`, and also shows object id.
220
+
221
+ ```ruby
222
+ puts penguins.inspect
223
+ # =>
224
+ #<RedAmber::DataFrame : 344 x 8 Vectors, 0x000000000000f0b4>
225
+ Vectors : 5 numeric, 3 strings
226
+ # key type level data_preview
227
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
228
+ 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
229
+ 3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
230
+ ... 5 more Vectors ...
231
+ ```
232
+
233
+ ## Selecting
234
+
235
+ ### Select variables (columns in a table) by `[]` as `[key]`, `[keys]`, `[keys[index]]`
236
+ - Key in a Symbol: `df[:symbol]`
237
+ - Key in a String: `df["string"]`
238
+ - Keys in an Array: `df[:symbol1, "string", :symbol2]`
239
+ - Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`
240
+
241
+ Key indeces can be used via `keys[i]` because numbers are used to select observations (rows).
242
+
243
+ - Keys by a Range:
244
+
245
+ If keys are able to represent by Range, it can be included in the arguments. See a example below.
246
+
247
+ - You can exchange the order of variables (columns).
248
+
249
+ ```ruby
250
+ hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
251
+ df = RedAmber::DataFrame.new(hash)
252
+ df[:b..:c, "a"]
253
+ # =>
254
+ #<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000b02c>
255
+ Vectors : 2 numeric, 1 string
256
+ # key type level data_preview
257
+ 1 :b string 3 ["A", "B", "C"]
258
+ 2 :c double 3 [1.0, 2.0, 3.0]
259
+ 3 :a uint8 3 [1, 2, 3]
260
+ ```
261
+
262
+ If `#[]` represents single variable (column), it returns a Vector object.
263
+
264
+ ```ruby
265
+ df[:a]
266
+ # =>
267
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
268
+ [1, 2, 3]
269
+ ```
270
+ Or `#v` method also returns a Vector for a key.
271
+
272
+ ```ruby
273
+ df.v(:a)
274
+ # =>
275
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
276
+ [1, 2, 3]
277
+ ```
278
+
279
+ This may be useful to use in a block of DataFrame manipulation verbs. We can write `v(:a)` rather than `self[:a]` or `df[:a]`
280
+
281
+ ### Select observations (rows in a table) by `[]` as `[index]`, `[range]`, `[array]`
282
+
283
+ - Select a obs. by index: `df[0]`
284
+ - Select obs. by indeces in a Range: `df[1..2]`
285
+
286
+ An end-less or a begin-less Range can be used to represent indeces.
287
+
288
+ - Select obs. by indeces in an Array: `df[1, 2]`
289
+
290
+ - You can use float indices.
291
+
292
+ - Mixed case: `df[2, 0..]`
293
+
294
+ ```ruby
295
+ hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
296
+ df = RedAmber::DataFrame.new(hash)
297
+ df[:b..:c, "a"].tdr(tally_level: 0)
298
+ # =>
299
+ RedAmber::DataFrame : 4 x 3 Vectors
300
+ Vectors : 2 numeric, 1 string
301
+ # key type level data_preview
302
+ 1 :a uint8 3 [3, 1, 2, 3]
303
+ 2 :b string 3 ["C", "A", "B", "C"]
304
+ 3 :c double 3 [3.0, 1.0, 2.0, 3.0]
305
+ ```
306
+
307
+ - Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self.
308
+
309
+ It returns a sub dataframe with observations at boolean is true.
310
+
311
+ ```ruby
312
+ # with the same dataframe `df` above
313
+ df[true, false, nil] # or
314
+ df[[true, false, nil]] # or
315
+ df[RedAmber::Vector.new([true, false, nil])]
316
+ # =>
317
+ #<RedAmber::DataFrame : 1 x 3 Vectors, 0x000000000000f1a4>
318
+ Vectors : 2 numeric, 1 string
319
+ # key type level data_preview
320
+ 1 :a uint8 1 [1]
321
+ 2 :b string 1 ["A"]
322
+ 3 :c double 1 [1.0]
323
+ ```
324
+
325
+ ### Select rows from top or from bottom
326
+
327
+ `head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
328
+
329
+ ## Sub DataFrame manipulations
330
+
331
+ ### `pick ` - pick up variables by key label -
332
+
333
+ Pick up some variables (columns) to create a sub DataFrame.
334
+
335
+ ![pick method image](doc/../image/dataframe/pick.png)
336
+
337
+ - Keys as arguments
338
+
339
+ `pick(keys)` accepts keys as arguments in an Array.
340
+
341
+ ```ruby
342
+ penguins.pick(:species, :bill_length_mm)
343
+ # =>
344
+ #<RedAmber::DataFrame : 344 x 2 Vectors, 0x000000000000f924>
345
+ Vectors : 1 numeric, 1 string
346
+ # key type level data_preview
347
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
348
+ 2 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
349
+ ```
350
+
351
+ - Booleans as a argument
352
+
353
+ `pick(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
354
+
355
+ ```ruby
356
+ penguins.pick(penguins.types.map { |type| type == :string })
357
+ # =>
358
+ #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f938>
359
+ Vectors : 3 strings
360
+ # key type level data_preview
361
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
362
+ 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
363
+ 3 :sex string 3 {"male"=>168, "female"=>165, ""=>11}
364
+ ```
365
+
366
+ - Keys or booleans by a block
367
+
368
+ `pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
369
+
370
+ ```ruby
371
+ # It is ok to write `keys ...` in the block, not `penguins.keys ...`
372
+ penguins.pick { keys.map { |key| key.end_with?('mm') } }
373
+ # =>
374
+ #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f1cc>
375
+ Vectors : 3 numeric
376
+ # key type level data_preview
377
+ 1 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
378
+ 2 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
379
+ 3 :flipper_length_mm int64 56 [181, 186, 195, nil, 193, ... ], 2 nils
380
+ ```
381
+
382
+ ### `drop ` - pick and drop -
383
+
384
+ Drop some variables (columns) to create a remainer DataFrame.
385
+
386
+ ![drop method image](doc/../image/dataframe/drop.png)
387
+
388
+ - Keys as arguments
389
+
390
+ `drop(keys)` accepts keys as arguments in an Array.
391
+
392
+ - Booleans as a argument
393
+
394
+ `drop(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
395
+
396
+ - Keys or booleans by a block
397
+
398
+ `drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
399
+
400
+ - Notice for nil
401
+
402
+ When used with booleans, nil in booleans is treated as a false. This behavior is aligned with Ruby's `nil#!`.
403
+
404
+ ```ruby
405
+ booleans = [true, false, nil]
406
+ booleans_invert = booleans.map(&:!) # => [false, true, true]
407
+ df.pick(booleans) == df.drop(booleans_invert) # => true
408
+ ```
409
+ - Difference between `pick`/`drop` and `[]`
410
+
411
+ If `pick` or `drop` will select a single variable (column), it returns a `DataFrame` with one variable. In contrast, `[]` returns a `Vector`. This behavior may be useful to use in a block of DataFrame manipulations.
412
+
413
+ ```ruby
414
+ df = RedAmber::DataFrame.new(a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3])
415
+ df.pick(:a) # or
416
+ df.drop(:b, :c)
417
+ # =>
418
+ #<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000000f280>
419
+ Vector : 1 numeric
420
+ # key type level data_preview
421
+ 1 :a uint8 3 [1, 2, 3]
422
+
423
+ df[:a]
424
+ # =>
425
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
426
+ [1, 2, 3]
427
+ ```
428
+
429
+ ### `slice ` - to cut vertically is slice -
430
+
431
+ Slice and select observations (rows) to create a sub DataFrame.
432
+
433
+ ![slice method image](doc/../image/dataframe/slice.png)
434
+
435
+ - Indices as arguments
436
+
437
+ `slice(indeces)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers.
438
+
439
+ Negative index from the tail like Ruby's Array is also acceptable.
440
+
441
+ ```ruby
442
+ # returns 5 obs. at start and 5 obs. from end
443
+ penguins.slice(0...5, -5..-1)
444
+ # =>
445
+ #<RedAmber::DataFrame : 10 x 8 Vectors, 0x000000000000f230>
446
+ Vectors : 5 numeric, 3 strings
447
+ # key type level data_preview
448
+ 1 :species string 2 {"Adelie"=>5, "Gentoo"=>5}
449
+ 2 :island string 2 {"Torgersen"=>5, "Biscoe"=>5}
450
+ 3 :bill_length_mm double 9 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
451
+ ... 5 more Vectors ...
452
+ ```
453
+
454
+ - Booleans as an argument
455
+
456
+ `slice(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
457
+
458
+ ```ruby
459
+ vector = penguins[:bill_length_mm]
460
+ penguins.slice(vector >= 40)
461
+ # =>
462
+ #<RedAmber::DataFrame : 242 x 8 Vectors, 0x000000000000f2bc>
463
+ Vectors : 5 numeric, 3 strings
464
+ # key type level data_preview
465
+ 1 :species string 3 {"Adelie"=>51, "Chinstrap"=>68, "Gentoo"=>123}
466
+ 2 :island string 3 {"Torgersen"=>18, "Biscoe"=>139, "Dream"=>85}
467
+ 3 :bill_length_mm double 115 [40.3, 42.0, 41.1, 42.5, 46.0, ... ]
468
+ ... 5 more Vectors ...
469
+ ```
470
+
471
+ - Indices or booleans by a block
472
+
473
+ `slice {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
474
+
475
+ ```ruby
476
+ # return a DataFrame with bill_length_mm is in 2*std range around mean
477
+ penguins.slice do
478
+ vector = self[:bill_length_mm]
479
+ min = vector.mean - vector.std
480
+ max = vector.mean + vector.std
481
+ vector.to_a.map { |e| (min..max).include? e }
482
+ end
483
+
484
+ # =>
485
+ #<RedAmber::DataFrame : 204 x 8 Vectors, 0x000000000000f30c>
486
+ Vectors : 5 numeric, 3 strings
487
+ # key type level data_preview
488
+ 1 :species string 3 {"Adelie"=>82, "Chinstrap"=>33, "Gentoo"=>89}
489
+ 2 :island string 3 {"Torgersen"=>31, "Biscoe"=>112, "Dream"=>61}
490
+ 3 :bill_length_mm double 90 [39.1, 39.5, 40.3, 39.3, 38.9, ... ]
491
+ ... 5 more Vectors ...
492
+ ```
493
+
494
+ - Notice: nil option
495
+ - `Arrow::Table#slice` uses `filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row.
496
+
497
+ ```ruby
498
+ hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }
499
+ table = Arrow::Table.new(hash)
500
+ table.slice([true, false, nil])
501
+ # =>
502
+ #<Arrow::Table:0x7fdfe44b9e18 ptr=0x555e9fe744d0>
503
+ a b c
504
+ 0 1 A 1.000000
505
+ 1 (null) (null) (null)
506
+ ```
507
+
508
+ - Whereas in RedAmber, `DataFrame#slice` with booleans containing nil is treated as false. This behavior comes from `Allow::FilterOptions.null_selection_behavior = :drop`. This is a default value for `Arrow::Table.filter` method.
509
+
510
+ ```ruby
511
+ RedAmber::DataFrame.new(table).slice([true, false, nil]).table
512
+ # =>
513
+ #<Arrow::Table:0x7fdfe44981c8 ptr=0x555e9febc330>
514
+ a b c
515
+ 0 1 A 1.000000
516
+ ```
517
+
518
+ ### `remove`
519
+
520
+ Slice and reject observations (rows) to create a remainer DataFrame.
521
+
522
+ ![remove method image](doc/../image/dataframe/remove.png)
523
+
524
+ - Indices as arguments
525
+
526
+ `remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
527
+
528
+ ```ruby
529
+ # returns 6th to 339th obs.
530
+ penguins.remove(0...5, -5..-1)
531
+ # =>
532
+ #<RedAmber::DataFrame : 334 x 8 Vectors, 0x000000000000f320>
533
+ Vectors : 5 numeric, 3 strings
534
+ # key type level data_preview
535
+ 1 :species string 3 {"Adelie"=>147, "Chinstrap"=>68, "Gentoo"=>119}
536
+ 2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>124}
537
+ 3 :bill_length_mm double 162 [39.3, 38.9, 39.2, 34.1, 42.0, ... ]
538
+ ... 5 more Vectors ...
539
+ ```
540
+
541
+ - Booleans as an argument
542
+
543
+ `remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
544
+
545
+ ```ruby
546
+ # remove all observation contains nil
547
+ removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
548
+ removed.tdr
549
+ # =>
550
+ RedAmber::DataFrame : 333 x 8 Vectors
551
+ Vectors : 5 numeric, 3 strings
552
+ # key type level data_preview
553
+ 1 :species string 3 {"Adelie"=>146, "Chinstrap"=>68, "Gentoo"=>119}
554
+ 2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>123}
555
+ 3 :bill_length_mm double 163 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
556
+ 4 :bill_depth_mm double 79 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
557
+ 5 :flipper_length_mm uint8 54 [181, 186, 195, 193, 190, ... ]
558
+ 6 :body_mass_g uint16 93 [3750, 3800, 3250, 3450, 3650, ... ]
559
+ 7 :sex string 2 {"male"=>168, "female"=>165}
560
+ 8 :year uint16 3 {2007=>103, 2008=>113, 2009=>117}
561
+ ```
562
+
563
+ - Indices or booleans by a block
564
+
565
+ `remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
566
+
567
+ ```ruby
568
+ penguins.remove do
569
+ vector = self[:bill_length_mm]
570
+ min = vector.mean - vector.std
571
+ max = vector.mean + vector.std
572
+ vector.to_a.map { |e| (min..max).include? e }
573
+ end
574
+ # =>
575
+ #<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000000f370>
576
+ Vectors : 5 numeric, 3 strings
577
+ # key type level data_preview
578
+ 1 :species string 3 {"Adelie"=>70, "Chinstrap"=>35, "Gentoo"=>35}
579
+ 2 :island string 3 {"Torgersen"=>21, "Biscoe"=>56, "Dream"=>63}
580
+ 3 :bill_length_mm double 75 [nil, 36.7, 34.1, 37.8, 37.8, ... ], 2 nils
581
+ ... 5 more Vectors ...
582
+ ```
583
+ - Notice for nil
584
+ - When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
585
+
586
+ ```ruby
587
+ df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])
588
+ booleans = df[:a] < 2
589
+ # =>
590
+ #<RedAmber::Vector(:boolean, size=3):0x000000000000f410>
591
+ [true, false, nil]
592
+
593
+ booleans_invert = booleans.to_a.map(&:!) # => [false, true, true]
594
+ df.slice(booleans) == df.remove(booleans_invert) # => true
595
+ ```
596
+ - Whereas `Vector#invert` returns nil for elements nil. This will bring different result.
597
+
598
+ ```ruby
599
+ booleans.invert
600
+ # =>
601
+ #<RedAmber::Vector(:boolean, size=3):0x000000000000f488>
602
+ [false, true, nil]
603
+
604
+ df.remove(booleans.invert)
605
+ #<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000000f474>
606
+ Vectors : 2 numeric, 1 string
607
+ # key type level data_preview
608
+ 1 :a uint8 2 [1, nil], 1 nil
609
+ 2 :b string 2 ["A", "C"]
610
+ 3 :c double 2 [1.0, 3.0]
611
+ ```
612
+
613
+ ### `rename`
614
+
615
+ Rename keys (column names) to create a updated DataFrame.
616
+
617
+ ![rename method image](doc/../image/dataframe/rename.png)
618
+
619
+ - Key pairs as arguments
620
+
621
+ `rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`.
622
+
623
+ ```ruby
624
+ h = { 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] }
625
+ df = RedAmber::DataFrame.new(h)
626
+ df.rename(:age => :age_in_1993)
627
+ # =>
628
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
629
+ Vectors : 1 numeric, 1 string
630
+ # key type level data_preview
631
+ 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
632
+ 2 :age_in_1993 uint8 3 [68, 49, 28]
633
+ ```
634
+
635
+ - Key pairs by a block
636
+
637
+ `rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}`. Block is called in the context of self.
638
+
639
+ - Key type
640
+
641
+ Symbol key and String key are distinguished.
642
+
643
+ ### `assign`
644
+
645
+ Assign new or updated variables (columns) and create a updated DataFrame.
646
+
647
+ - Variables with new keys will append new variables at bottom (right in the table).
648
+ - Variables with exisiting keys will update corresponding vectors.
649
+
650
+ ![assign method image](doc/../image/dataframe/assign.png)
651
+
652
+ - Variables as arguments
653
+
654
+ `assign(key_pairs)` accepts pairs of key and values as arguments. key_pairs should be a Hash of `{key => array}` or `{key => Vector}`.
655
+
656
+ ```ruby
657
+ df = RedAmber::DataFrame.new(
658
+ 'name' => %w[Yasuko Rui Hinata],
659
+ 'age' => [68, 49, 28])
660
+ # =>
661
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
662
+ Vectors : 1 numeric, 1 string
663
+ # key type level data_preview
664
+ 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
665
+ 2 :age uint8 3 [68, 49, 28]
666
+
667
+ # update :age and add :brother
668
+ assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }
669
+ df.assign(assigner)
670
+ # =>
671
+ #<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000f960>
672
+ Vectors : 1 numeric, 2 strings
673
+ # key type level data_preview
674
+ 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
675
+ 2 :age uint8 3 [97, 78, 57]
676
+ 3 :brother string 3 ["Santa", nil, "Momotaro"], 1 nil
677
+ ```
678
+
679
+ - Key pairs by a block
680
+
681
+ `assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array}` or `{key => Vector}`. Block is called in the context of self.
682
+
683
+ ```ruby
684
+ df = RedAmber::DataFrame.new(
685
+ index: [0, 1, 2, 3, nil],
686
+ float: [0.0, 1.1, 2.2, Float::NAN, nil],
687
+ string: ['A', 'B', 'C', 'D', nil])
688
+ # =>
689
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f8c0>
690
+ Vectors : 2 numeric, 1 string
691
+ # key type level data_preview
692
+ 1 :index uint8 5 [0, 1, 2, 3, nil], 1 nil
693
+ 2 :float double 5 [0.0, 1.1, 2.2, NaN, nil], 1 NaN, 1 nil
694
+ 3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
695
+
696
+ # update numeric variables
697
+ df.assign do
698
+ assigner = {}
699
+ vectors.each_with_index do |v, i|
700
+ assigner[keys[i]] = v * -1 if v.numeric?
701
+ end
702
+ assigner
703
+ end
704
+ # =>
705
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f924>
706
+ Vectors : 2 numeric, 1 string
707
+ # key type level data_preview
708
+ 1 :index int8 5 [0, -1, -2, -3, nil], 1 nil
709
+ 2 :float double 5 [-0.0, -1.1, -2.2, NaN, nil], 1 NaN, 1 nil
710
+ 3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
711
+
712
+ # Or it ’s shorter like this:
713
+ df.assign do
714
+ variables.select.with_object({}) do |(key, vector), assigner|
715
+ assigner[key] = vector * -1 if vector.numeric?
716
+ end
717
+ end
718
+ # => same as above
719
+ ```
720
+
721
+ - Key type
722
+
723
+ Symbol key and String key are considered as the same key.
724
+
725
+ ## Updating
726
+
727
+ ### `sort`
728
+
729
+ `sort` accepts parameters as sort_keys thanks to the amazing Red Arrow feature。
730
+ - :key, "key" or "+key" denotes ascending order
731
+ - "-key" denotes descending order
732
+
733
+ ```ruby
734
+ df = RedAmber::DataFrame.new({
735
+ index: [1, 1, 0, nil, 0],
736
+ string: ['C', 'B', nil, 'A', 'B'],
737
+ bool: [nil, true, false, true, false],
738
+ })
739
+ df.sort(:index, '-bool').tdr(tally: 0)
740
+ # =>
741
+ RedAmber::DataFrame : 5 x 3 Vectors
742
+ Vectors : 1 numeric, 1 string, 1 boolean
743
+ # key type level data_preview
744
+ 1 :index uint8 3 [0, 0, 1, 1, nil], 1 nil
745
+ 2 :string string 4 [nil, "B", "B", "C", "A"], 1 nil
746
+ 3 :bool boolean 3 [false, false, true, nil, true], 1 nil
747
+ ```
748
+
749
+ - [ ] Clamp
750
+
751
+ - [ ] Clear data
752
+
753
+ ## Treat na data
754
+
755
+ ### `remove_nil`
756
+
757
+ Remove any observations containing nil.
758
+
759
+ ## Grouping
760
+
761
+ ### `group(aggregating_keys, function, target_keys)`
762
+
763
+ (This is a temporary API and may change in the future version.)
764
+
765
+ Create grouped dataframe by `aggregation_keys` and apply `function` to each group and returns in `target_keys`. Aggregated key name is `function(key)` style.
766
+
767
+ (The current implementation is not intuitive. Needs improvement.)
768
+
769
+ ```ruby
770
+ ds = Datasets::Rdatasets.new('dplyr', 'starwars')
771
+ starwars = RedAmber::DataFrame.new(ds.to_table.to_h)
772
+ starwars.tdr(11)
773
+ # =>
774
+ RedAmber::DataFrame : 87 x 11 Vectors
775
+ Vectors : 3 numeric, 8 strings
776
+ # key type level data_preview
777
+ 1 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
778
+ 2 :height uint16 46 [172, 167, 96, 202, 150, ... ], 6 nils
779
+ 3 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
780
+ 4 :hair_color string 13 ["blond", nil, nil, "none", "brown", ... ], 5 nils
781
+ 5 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", .. . ]
782
+ 6 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ]
783
+ 7 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
784
+ 8 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, nil=>4}
785
+ 9 :gender string 3 {"masculine"=>66, "feminine"=>17, nil=>4}
786
+ 10 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ], 10 nils
787
+ 11 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ], 4 nils
788
+
789
+ grouped = starwars.group(:species, :mean, [:mass, :height])
790
+ # =>
791
+ #<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000000fbf4>
792
+ Vectors : 2 numeric, 1 string
793
+ # key type level data_preview
794
+ 1 :"mean(mass)" double 27 [82.78181818181818, 69.75, 124.0, 74.0, 1358.0, ... ], 6 nils
795
+ 2 :"mean(height)" double 32 [176.6451612903226, 131.2, 231.0, 173.0, 175.0, ... ]
796
+ 3 :species string 38 ["Human", "Droid", "Wookiee", "Rodian", "Hutt", ... ], 1 nil
797
+
798
+ count = starwars.group(:species, :count, :species)[:"count(species)"]
799
+ df = grouped.slice(count > 1)
800
+ # =>
801
+ #<RedAmber::DataFrame : 8 x 3 Vectors, 0x000000000000fc44>
802
+ Vectors : 2 numeric, 1 string
803
+ # key type level data_preview
804
+ 1 :"mean(mass)" double 8 [82.78181818181818, 69.75, 124.0, 74.0, 80.0, ... ]
805
+ 2 :"mean(height)" double 8 [176.6451612903226, 131.2, 231.0, 208.66666666666666, 173.0, ... ]
806
+ 3 :species string 8 ["Human", "Droid", "Wookiee", "Gungan", "Zabrak", ... ]
807
+
808
+ df.table
809
+ # =>
810
+ #<Arrow::Table:0x1165593c8 ptr=0x7fb3db144c70>
811
+ mean(mass) mean(height) species
812
+ 0 82.781818 176.645161 Human
813
+ 1 69.750000 131.200000 Droid
814
+ 2 124.000000 231.000000 Wookiee
815
+ 3 74.000000 208.666667 Gungan
816
+ 4 80.000000 173.000000 Zabrak
817
+ 5 55.000000 179.000000 Twi'lek
818
+ 6 53.100000 168.000000 Mirialan
819
+ 7 88.000000 221.000000 Kaminoan
820
+ ```
821
+
822
+ Available functions are:
823
+
824
+ - [ ] all
825
+ - [ ] any
826
+ - [ ] approximate_median
827
+ - ✓ count
828
+ - [ ] count_distinct
829
+ - [ ] distinct
830
+ - ✓ max
831
+ - ✓ mean
832
+ - ✓ min
833
+ - [ ] min_max
834
+ - ✓ product
835
+ - ✓ stddev
836
+ - ✓ sum
837
+ - [ ] tdigest
838
+ - ✓ variance
839
+
840
+ ## Combining DataFrames
841
+
842
+ - [ ] obs
843
+
844
+ - [ ] Add vars
845
+
846
+ - [ ] Inner join
847
+
848
+ - [ ] Left join
849
+
850
+ ## Encoding
851
+
852
+ - [ ] One-hot encoding
853
+
854
+ ## Iteration (not impremented)