red_amber 0.1.3 → 0.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/doc/DataFrame.md ADDED
@@ -0,0 +1,690 @@
1
+ # DataFrame
2
+
3
+ Class `RedAmber::DataFrame` represents 2D-data. `DataFrame` consists with:
4
+ - A collection of data which have same data type within. We call it `Vector`.
5
+ - A label is attached to `Vector`. We call it `key`.
6
+ - A `Vector` and associated `key` is grouped as a `variable`.
7
+ - `variable`s with same vector length are aligned and arranged to be a `DaTaFrame`.
8
+ - Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `observation`.
9
+
10
+ ![dataframe model image](doc/../image/dataframe_model.png)
11
+
12
+ ## Constructors and saving
13
+
14
+ ### `new` from a columnar Hash
15
+
16
+ ```ruby
17
+ RedAmber::DataFrame.new(x: [1, 2, 3])
18
+ ```
19
+
20
+ ### `new` from a schema (by Hash) and rows (by Array)
21
+
22
+ ```ruby
23
+ RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]])
24
+ ```
25
+
26
+ ### `new` from an Arrow::Table
27
+
28
+
29
+ ```ruby
30
+ table = Arrow::Table.new(x: [1, 2, 3])
31
+ RedAmber::DataFrame.new(table)
32
+ ```
33
+
34
+ ### `new` from a Rover::DataFrame
35
+
36
+
37
+ ```ruby
38
+ rover = Rover::DataFrame.new(x: [1, 2, 3])
39
+ RedAmber::DataFrame.new(rover)
40
+ ```
41
+
42
+ ### `load` (class method)
43
+
44
+ - from a `.arrow`, `.arrows`, `.csv`, `.csv.gz` or `.tsv` file
45
+
46
+ ```ruby
47
+ RedAmber::DataFrame.load("test/entity/with_header.csv")
48
+ ```
49
+
50
+ - from a string buffer
51
+
52
+ - from a URI
53
+
54
+ ```ruby
55
+ uri = URI("https://github.com/heronshoes/red_amber/blob/master/test/entity/with_header.csv")
56
+ RedAmber::DataFrame.load(uri)
57
+ ```
58
+
59
+ - from a Parquet file
60
+
61
+ ```ruby
62
+ dataframe = RedAmber::DataFrame.load("file.parquet")
63
+ ```
64
+
65
+ ### `save` (instance method)
66
+
67
+ - to a `.arrow`, `.arrows`, `.csv`, `.csv.gz` or `.tsv` file
68
+
69
+ - to a string buffer
70
+
71
+ - to a URI
72
+
73
+ - to a Parquet file
74
+
75
+ ```ruby
76
+ dataframe.save("file.parquet")
77
+ ```
78
+
79
+ ## Properties
80
+
81
+ ### `table`
82
+
83
+ - Reader of Arrow::Table object inside.
84
+
85
+ ### `size`, `n_obs`, `n_rows`
86
+
87
+ - Returns size of Vector (num of observations).
88
+
89
+ ### `n_keys`, `n_vars`, `n_cols`,
90
+
91
+ - Returns num of keys (num of variables).
92
+
93
+ ### `shape`
94
+
95
+ - Returns shape in an Array[n_rows, n_cols].
96
+
97
+ ### `keys`, `var_names`, `column_names`
98
+
99
+ - Returns key names in an Array.
100
+
101
+ ### `types`
102
+
103
+ - Returns types of vectors in an Array of Symbols.
104
+
105
+ ### `data_types`
106
+
107
+ - Returns types of vector in an Array of `Arrow::DataType`.
108
+
109
+ ### `vectors`
110
+
111
+ - Returns an Array of Vectors.
112
+
113
+ ### `indexes`, `indices`
114
+
115
+ - Returns all indexes in a Range.
116
+
117
+ ### `to_h`
118
+
119
+ - Returns column-oriented data in a Hash.
120
+
121
+ ### `to_a`, `raw_records`
122
+
123
+ - Returns an array of row-oriented data without header.
124
+
125
+ If you need a column-oriented full array, use `.to_h.to_a`
126
+
127
+ ### `schema`
128
+
129
+ - Returns column name and data type in a Hash.
130
+
131
+ ### `==`
132
+
133
+ ### `empty?`
134
+
135
+ ## Output
136
+
137
+ ### `to_s`
138
+
139
+ ### `summary`, `describe` (not implemented)
140
+
141
+ ### `to_rover`
142
+
143
+ - Returns a `Rover::DataFrame`.
144
+
145
+ ### `tdr(limit = 10, tally: 5, elements: 5)`
146
+
147
+ - Shows some information about self in a transposed style.
148
+ - `tdr_str` returns same info as a String.
149
+
150
+ ```ruby
151
+ require 'red_amber'
152
+ require 'datasets-arrow'
153
+
154
+ penguins = Datasets::Penguins.new.to_arrow
155
+ RedAmber::DataFrame.new(penguins).tdr
156
+ # =>
157
+ RedAmber::DataFrame : 344 x 8 Vectors
158
+ Vectors : 5 numeric, 3 strings
159
+ # key type level data_preview
160
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
161
+ 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
162
+ 3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
163
+ 4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
164
+ 5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils
165
+ 6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
166
+ 7 :sex string 3 {"male"=>168, "female"=>165, nil=>11}
167
+ 8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}
168
+ ```
169
+
170
+ - limit: limits variable number to show. Default value is 10.
171
+ - tally: max level to use tally mode.
172
+ - elements: max num of element to show values in each observations.
173
+
174
+ ### `inspect`
175
+
176
+ - Returns the information of self as `tdr(3)`, and also shows object id.
177
+
178
+ ```ruby
179
+ puts penguins.inspect
180
+ # =>
181
+ #<RedAmber::DataFrame : 344 x 8 Vectors, 0x000000000000f0b4>
182
+ Vectors : 5 numeric, 3 strings
183
+ # key type level data_preview
184
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
185
+ 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
186
+ 3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
187
+ ... 5 more Vectors ...
188
+ ```
189
+
190
+ ## Selecting
191
+
192
+ ### Select variables (columns in a table) by `[]` as `[key]`, `[keys]`, `[keys[index]]`
193
+ - Key in a Symbol: `df[:symbol]`
194
+ - Key in a String: `df["string"]`
195
+ - Keys in an Array: `df[:symbol1, "string", :symbol2]`
196
+ - Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`
197
+
198
+ Key indeces can be used via `keys[i]` because numbers are used to select observations (rows).
199
+
200
+ - Keys by a Range:
201
+
202
+ If keys are able to represent by Range, it can be included in the arguments. See a example below.
203
+
204
+ - You can exchange the order of variables (columns).
205
+
206
+ ```ruby
207
+ hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
208
+ df = RedAmber::DataFrame.new(hash)
209
+ df[:b..:c, "a"]
210
+ # =>
211
+ #<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000b02c>
212
+ Vectors : 2 numeric, 1 string
213
+ # key type level data_preview
214
+ 1 :b string 3 ["A", "B", "C"]
215
+ 2 :c double 3 [1.0, 2.0, 3.0]
216
+ 3 :a uint8 3 [1, 2, 3]
217
+ ```
218
+
219
+ If `#[]` represents single variable (column), it returns a Vector object.
220
+
221
+ ```ruby
222
+ df[:a]
223
+ # =>
224
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
225
+ [1, 2, 3]
226
+ ```
227
+ This may be useful to use in a block of DataFrame manipulations.
228
+
229
+ ### Select observations (rows in a table) by `[]` as `[index]`, `[range]`, `[array]`
230
+
231
+ - Select a obs. by index: `df[0]`
232
+ - Select obs. by indeces in a Range: `df[1..2]`
233
+
234
+ An end-less or a begin-less Range can be used to represent indeces.
235
+
236
+ - Select obs. by indeces in an Array: `df[1, 2]`
237
+ - Mixed case: `df[2, 0..]`
238
+
239
+ ```ruby
240
+ hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
241
+ df = RedAmber::DataFrame.new(hash)
242
+ df[:b..:c, "a"].tdr(tally_level: 0)
243
+ # =>
244
+ RedAmber::DataFrame : 4 x 3 Vectors
245
+ Vectors : 2 numeric, 1 string
246
+ # key type level data_preview
247
+ 1 :a uint8 3 [3, 1, 2, 3]
248
+ 2 :b string 3 ["C", "A", "B", "C"]
249
+ 3 :c double 3 [3.0, 1.0, 2.0, 3.0]
250
+ ```
251
+
252
+ - Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self.
253
+
254
+ It returns a sub dataframe with observations at boolean is true.
255
+
256
+ ```ruby
257
+ # with the same dataframe `df` above
258
+ df[true, false, nil] # or
259
+ df[[true, false, nil]] # or
260
+ df[RedAmber::Vector.new([true, false, nil])]
261
+ # =>
262
+ #<RedAmber::DataFrame : 1 x 3 Vectors, 0x000000000000f1a4>
263
+ Vectors : 2 numeric, 1 string
264
+ # key type level data_preview
265
+ 1 :a uint8 1 [1]
266
+ 2 :b string 1 ["A"]
267
+ 3 :c double 1 [1.0]
268
+ ```
269
+
270
+ ### Select rows from top or bottom
271
+
272
+ `head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
273
+
274
+ ## Sub DataFrame manipulations
275
+
276
+ ### `pick`
277
+
278
+ Pick up some variables (columns) to create a sub DataFrame.
279
+
280
+ ![pick method image](doc/../image/dataframe/pick.png)
281
+
282
+ - Keys as arguments
283
+
284
+ `pick(keys)` accepts keys as arguments in an Array.
285
+
286
+ ```ruby
287
+ penguins.pick(:species, :bill_length_mm)
288
+ # =>
289
+ #<RedAmber::DataFrame : 344 x 2 Vectors, 0x000000000000f924>
290
+ Vectors : 1 numeric, 1 string
291
+ # key type level data_preview
292
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
293
+ 2 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
294
+ ```
295
+
296
+ - Booleans as a argument
297
+
298
+ `pick(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
299
+
300
+ ```ruby
301
+ penguins.pick(penguins.types.map { |type| type == :string })
302
+ # =>
303
+ #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f938>
304
+ Vectors : 3 strings
305
+ # key type level data_preview
306
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
307
+ 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
308
+ 3 :sex string 3 {"male"=>168, "female"=>165, ""=>11}
309
+ ```
310
+
311
+ - Keys or booleans by a block
312
+
313
+ `pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
314
+
315
+ ```ruby
316
+ penguins.pick { keys.map { |key| key.end_with?('mm') } }
317
+ # =>
318
+ #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f1cc>
319
+ Vectors : 3 numeric
320
+ # key type level data_preview
321
+ 1 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
322
+ 2 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
323
+ 3 :flipper_length_mm int64 56 [181, 186, 195, nil, 193, ... ], 2 nils
324
+ ```
325
+
326
+ ### `drop`
327
+
328
+ Drop some variables (columns) to create a remainer DataFrame.
329
+
330
+ ![drop method image](doc/../image/dataframe/drop.png)
331
+
332
+ - Keys as arguments
333
+
334
+ `drop(keys)` accepts keys as arguments in an Array.
335
+
336
+ - Booleans as a argument
337
+
338
+ `drop(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
339
+
340
+ - Keys or booleans by a block
341
+
342
+ `drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
343
+
344
+ - Notice for nil
345
+
346
+ When used with booleans, nil in booleans is treated as a false. This behavior is aligned with Ruby's `nil#!`.
347
+
348
+ ```ruby
349
+ booleans = [true, false, nil]
350
+ booleans_invert = booleans.map(&:!) # => [false, true, true]
351
+ df.pick(booleans) == df.drop(booleans_invert) # => true
352
+ ```
353
+ - Difference between `pick`/`drop` and `[]`
354
+
355
+ If `pick` or `drop` will select single variable (column), it returns a `DataFrame` with one variable. In contrast, `[]` returns a `Vector`.
356
+
357
+ ```ruby
358
+ df = RedAmber::DataFrame.new(a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3])
359
+ df[:a]
360
+ # =>
361
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
362
+ [1, 2, 3]
363
+
364
+ df.pick(:a) # or
365
+ df.drop(:b, :c)
366
+ # =>
367
+ #<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000000f280>
368
+ Vector : 1 numeric
369
+ # key type level data_preview
370
+ 1 :a uint8 3 [1, 2, 3]
371
+ ```
372
+
373
+ ### `slice`
374
+
375
+ Slice and select observations (rows) to create a sub DataFrame.
376
+
377
+ ![slice method image](doc/../image/dataframe/slice.png)
378
+
379
+ - Keys as arguments
380
+
381
+ `slice(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
382
+
383
+ ```ruby
384
+ # returns 5 obs. at start and 5 obs. from end
385
+ penguins.slice(0...5, -5..-1)
386
+ # =>
387
+ #<RedAmber::DataFrame : 10 x 8 Vectors, 0x000000000000f230>
388
+ Vectors : 5 numeric, 3 strings
389
+ # key type level data_preview
390
+ 1 :species string 2 {"Adelie"=>5, "Gentoo"=>5}
391
+ 2 :island string 2 {"Torgersen"=>5, "Biscoe"=>5}
392
+ 3 :bill_length_mm double 9 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
393
+ ... 5 more Vectors ...
394
+ ```
395
+
396
+ - Booleans as an argument
397
+
398
+ `slice(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
399
+
400
+ ```ruby
401
+ vector = penguins[:bill_length_mm]
402
+ penguins.slice(vector >= 40)
403
+ # =>
404
+ #<RedAmber::DataFrame : 242 x 8 Vectors, 0x000000000000f2bc>
405
+ Vectors : 5 numeric, 3 strings
406
+ # key type level data_preview
407
+ 1 :species string 3 {"Adelie"=>51, "Chinstrap"=>68, "Gentoo"=>123}
408
+ 2 :island string 3 {"Torgersen"=>18, "Biscoe"=>139, "Dream"=>85}
409
+ 3 :bill_length_mm double 115 [40.3, 42.0, 41.1, 42.5, 46.0, ... ]
410
+ ... 5 more Vectors ...
411
+ ```
412
+
413
+ - Keys or booleans by a block
414
+
415
+ `slice {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
416
+
417
+ ```ruby
418
+ # return a DataFrame with bill_length_mm is in 2*std range around mean
419
+ penguins.slice do
420
+ vector = self[:bill_length_mm]
421
+ min = vector.mean - vector.std
422
+ max = vector.mean + vector.std
423
+ vector.to_a.map { |e| (min..max).include? e }
424
+ end
425
+ # =>
426
+ #<RedAmber::DataFrame : 204 x 8 Vectors, 0x000000000000f30c>
427
+ Vectors : 5 numeric, 3 strings
428
+ # key type level data_preview
429
+ 1 :species string 3 {"Adelie"=>82, "Chinstrap"=>33, "Gentoo"=>89}
430
+ 2 :island string 3 {"Torgersen"=>31, "Biscoe"=>112, "Dream"=>61}
431
+ 3 :bill_length_mm double 90 [39.1, 39.5, 40.3, 39.3, 38.9, ... ]
432
+ ... 5 more Vectors ...
433
+ ```
434
+
435
+ - Notice: nil option
436
+ - `Arrow::Table#slice` uses `filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row.
437
+
438
+ ```ruby
439
+ hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }
440
+ table = Arrow::Table.new(hash)
441
+ table.slice([true, false, nil])
442
+ # =>
443
+ #<Arrow::Table:0x7fdfe44b9e18 ptr=0x555e9fe744d0>
444
+ a b c
445
+ 0 1 A 1.000000
446
+ 1 (null) (null) (null)
447
+ ```
448
+
449
+ - Whereas in RedAmber, `DataFrame#slice` with booleans containing nil is treated as false. This behavior comes from `Allow::FilterOptions.null_selection_behavior = :drop`. This is a default value for `Arrow::Table.filter` method.
450
+
451
+ ```ruby
452
+ RedAmber::DataFrame.new(table).slice([true, false, nil]).table
453
+ # =>
454
+ #<Arrow::Table:0x7fdfe44981c8 ptr=0x555e9febc330>
455
+ a b c
456
+ 0 1 A 1.000000
457
+ ```
458
+
459
+ ### `remove`
460
+
461
+ Slice and reject observations (rows) to create a remainer DataFrame.
462
+
463
+ ![remove method image](doc/../image/dataframe/remove.png)
464
+
465
+ - Keys as arguments
466
+
467
+ `remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
468
+
469
+ ```ruby
470
+ # returns 6th to 339th obs.
471
+ penguins.remove(0...5, -5..-1)
472
+ # =>
473
+ #<RedAmber::DataFrame : 334 x 8 Vectors, 0x000000000000f320>
474
+ Vectors : 5 numeric, 3 strings
475
+ # key type level data_preview
476
+ 1 :species string 3 {"Adelie"=>147, "Chinstrap"=>68, "Gentoo"=>119}
477
+ 2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>124}
478
+ 3 :bill_length_mm double 162 [39.3, 38.9, 39.2, 34.1, 42.0, ... ]
479
+ ... 5 more Vectors ...
480
+ ```
481
+
482
+ - Booleans as an argument
483
+
484
+ `remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
485
+
486
+ ```ruby
487
+ # remove all observation contains nil
488
+ removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
489
+ removed.tdr
490
+ # =>
491
+ RedAmber::DataFrame : 342 x 8 Vectors
492
+ Vectors : 5 numeric, 3 strings
493
+ # key type level data_preview
494
+ 1 :species string 3 {"Adelie"=>151, "Chinstrap"=>68, "Gentoo"=>123}
495
+ 2 :island string 3 {"Torgersen"=>51, "Biscoe"=>167, "Dream"=>124}
496
+ 3 :bill_length_mm double 164 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
497
+ 4 :bill_depth_mm double 80 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
498
+ 5 :flipper_length_mm int64 55 [181, 186, 195, 193, 190, ... ]
499
+ 6 :body_mass_g int64 94 [3750, 3800, 3250, 3450, 3650, ... ]
500
+ 7 :sex string 3 {"male"=>168, "female"=>165, ""=>9}
501
+ 8 :year int64 3 {2007=>109, 2008=>114, 2009=>119}
502
+ ```
503
+
504
+ - Keys or booleans by a block
505
+
506
+ `remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
507
+
508
+ ```ruby
509
+ penguins.remove do
510
+ vector = self[:bill_length_mm]
511
+ min = vector.mean - vector.std
512
+ max = vector.mean + vector.std
513
+ vector.to_a.map { |e| (min..max).include? e }
514
+ end
515
+ # =>
516
+ #<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000000f370>
517
+ Vectors : 5 numeric, 3 strings
518
+ # key type level data_preview
519
+ 1 :species string 3 {"Adelie"=>70, "Chinstrap"=>35, "Gentoo"=>35}
520
+ 2 :island string 3 {"Torgersen"=>21, "Biscoe"=>56, "Dream"=>63}
521
+ 3 :bill_length_mm double 75 [nil, 36.7, 34.1, 37.8, 37.8, ... ], 2 nils
522
+ ... 5 more Vectors ...
523
+ ```
524
+ - Notice for nil
525
+ - When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
526
+
527
+ ```ruby
528
+ df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])
529
+ booleans = df[:a] < 2
530
+ # =>
531
+ #<RedAmber::Vector(:boolean, size=3):0x000000000000f410>
532
+ [true, false, nil]
533
+
534
+ booleans_invert = booleans.to_a.map(&:!) # => [false, true, true]
535
+ df.slice(booleans) == df.remove(booleans_invert) # => true
536
+ ```
537
+ - Whereas `Vector#invert` returns nil for elements nil. This will bring different result.
538
+
539
+ ```ruby
540
+ booleans.invert
541
+ # =>
542
+ #<RedAmber::Vector(:boolean, size=3):0x000000000000f488>
543
+ [false, true, nil]
544
+
545
+ df.remove(booleans.invert)
546
+ #<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000000f474>
547
+ Vectors : 2 numeric, 1 string
548
+ # key type level data_preview
549
+ 1 :a uint8 2 [1, nil], 1 nil
550
+ 2 :b string 2 ["A", "C"]
551
+ 3 :c double 2 [1.0, 3.0]
552
+ ```
553
+
554
+ ### `rename`
555
+
556
+ Rename keys (column names) to create a updated DataFrame.
557
+
558
+ ![rename method image](doc/../image/dataframe/rename.png)
559
+
560
+ - Key pairs as arguments
561
+
562
+ `rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`.
563
+
564
+ ```ruby
565
+ h = { 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] }
566
+ df = RedAmber::DataFrame.new(h)
567
+ df.rename(:age => :age_in_1993)
568
+ # =>
569
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
570
+ Vectors : 1 numeric, 1 string
571
+ # key type level data_preview
572
+ 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
573
+ 2 :age_in_1993 uint8 3 [68, 49, 28]
574
+ ```
575
+
576
+ - Key pairs by a block
577
+
578
+ `rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}`. Block is called in the context of self.
579
+
580
+ - Key type
581
+
582
+ Symbol key and String key are distinguished.
583
+
584
+ ### `assign`
585
+
586
+ Assign new variables (columns) and create a updated DataFrame.
587
+
588
+ - Variables with new keys will append new variables at bottom (right in the table).
589
+ - Variables with exisiting keys will update corresponding vectors.
590
+
591
+ ![assign method image](doc/../image/dataframe/assign.png)
592
+
593
+ - Variables as arguments
594
+
595
+ `assign(key_pairs)` accepts pairs of key and values as arguments. key_pairs should be a Hash of `{key => array}` or `{key => Vector}`.
596
+
597
+ ```ruby
598
+ df = RedAmber::DataFrame.new(
599
+ 'name' => %w[Yasuko Rui Hinata],
600
+ 'age' => [68, 49, 28])
601
+ # =>
602
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
603
+ Vectors : 1 numeric, 1 string
604
+ # key type level data_preview
605
+ 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
606
+ 2 :age uint8 3 [68, 49, 28]
607
+
608
+ # update :age and add :brother
609
+ assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }
610
+ df.assign(assigner)
611
+ # =>
612
+ #<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000f960>
613
+ Vectors : 1 numeric, 2 strings
614
+ # key type level data_preview
615
+ 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
616
+ 2 :age uint8 3 [97, 78, 57]
617
+ 3 :brother string 3 ["Santa", nil, "Momotaro"], 1 nil
618
+ ```
619
+
620
+ - Key pairs by a block
621
+
622
+ `assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array}` or `{key => Vector}`. Block is called in the context of self.
623
+
624
+ ```ruby
625
+ df = RedAmber::DataFrame.new(
626
+ index: [0, 1, 2, 3, nil],
627
+ float: [0.0, 1.1, 2.2, Float::NAN, nil],
628
+ string: ['A', 'B', 'C', 'D', nil])
629
+ # =>
630
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f8c0>
631
+ Vectors : 2 numeric, 1 string
632
+ # key type level data_preview
633
+ 1 :index uint8 5 [0, 1, 2, 3, nil], 1 nil
634
+ 2 :float double 5 [0.0, 1.1, 2.2, NaN, nil], 1 NaN, 1 nil
635
+ 3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
636
+
637
+ # update numeric variables
638
+ df.assign do
639
+ assigner = {}
640
+ vectors.each_with_index do |v, i|
641
+ assigner[keys[i]] = v * -1 if v.numeric?
642
+ end
643
+ assigner
644
+ end
645
+ # =>
646
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f924>
647
+ Vectors : 2 numeric, 1 string
648
+ # key type level data_preview
649
+ 1 :index int8 5 [0, -1, -2, -3, nil], 1 nil
650
+ 2 :float double 5 [-0.0, -1.1, -2.2, NaN, nil], 1 NaN, 1 nil
651
+ 3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
652
+ ```
653
+
654
+ - Key type
655
+
656
+ Symbol key and String key are considered as the same key.
657
+
658
+ ## Updating
659
+
660
+ - [ ] Update elements matching a condition
661
+
662
+ - [ ] Clamp
663
+
664
+ - [ ] Sort rows
665
+
666
+ - [ ] Clear data
667
+
668
+ ## Treat na data
669
+
670
+ - [ ] Drop na (NaN, nil)
671
+
672
+ - [ ] Replace na with value
673
+
674
+ - [ ] Interpolate na with convolution array
675
+
676
+ ## Combining DataFrames
677
+
678
+ - [ ] obs
679
+
680
+ - [ ] Add vars
681
+
682
+ - [ ] Inner join
683
+
684
+ - [ ] Left join
685
+
686
+ ## Encoding
687
+
688
+ - [ ] One-hot encoding
689
+
690
+ ## Iteration (not impremented)