red_amber 0.1.3 → 0.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +9 -4
- data/CHANGELOG.md +60 -8
- data/README.md +41 -349
- data/doc/DataFrame.md +690 -0
- data/doc/Vector.md +195 -0
- data/doc/image/TDR_operations.pdf +0 -0
- data/doc/image/arrow_table_new.png +0 -0
- data/doc/image/dataframe/assign.png +0 -0
- data/doc/image/dataframe/drop.png +0 -0
- data/doc/image/dataframe/pick.png +0 -0
- data/doc/image/dataframe/remove.png +0 -0
- data/doc/image/dataframe/rename.png +0 -0
- data/doc/image/dataframe/slice.png +0 -0
- data/doc/image/dataframe_model.png +0 -0
- data/doc/image/example_in_red_arrow.png +0 -0
- data/doc/image/tdr.png +0 -0
- data/doc/image/tdr_and_table.png +0 -0
- data/doc/image/tidy_data_in_TDR.png +0 -0
- data/doc/image/vector/binary_element_wise.png +0 -0
- data/doc/image/vector/unary_aggregation.png +0 -0
- data/doc/image/vector/unary_aggregation_w_option.png +0 -0
- data/doc/image/vector/unary_element_wise.png +0 -0
- data/doc/tdr.md +53 -0
- data/doc/tdr_ja.md +53 -0
- data/lib/red_amber/data_frame.rb +22 -15
- data/lib/red_amber/{data_frame_output.rb → data_frame_displayable.rb} +44 -37
- data/lib/red_amber/data_frame_helper.rb +64 -0
- data/lib/red_amber/data_frame_observation_operation.rb +72 -0
- data/lib/red_amber/data_frame_selectable.rb +21 -43
- data/lib/red_amber/data_frame_variable_operation.rb +133 -0
- data/lib/red_amber/vector_functions.rb +54 -29
- data/lib/red_amber/version.rb +1 -1
- data/lib/red_amber.rb +4 -1
- metadata +27 -3
data/doc/DataFrame.md
ADDED
@@ -0,0 +1,690 @@
|
|
1
|
+
# DataFrame
|
2
|
+
|
3
|
+
Class `RedAmber::DataFrame` represents 2D-data. `DataFrame` consists with:
|
4
|
+
- A collection of data which have same data type within. We call it `Vector`.
|
5
|
+
- A label is attached to `Vector`. We call it `key`.
|
6
|
+
- A `Vector` and associated `key` is grouped as a `variable`.
|
7
|
+
- `variable`s with same vector length are aligned and arranged to be a `DaTaFrame`.
|
8
|
+
- Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `observation`.
|
9
|
+
|
10
|
+

|
11
|
+
|
12
|
+
## Constructors and saving
|
13
|
+
|
14
|
+
### `new` from a columnar Hash
|
15
|
+
|
16
|
+
```ruby
|
17
|
+
RedAmber::DataFrame.new(x: [1, 2, 3])
|
18
|
+
```
|
19
|
+
|
20
|
+
### `new` from a schema (by Hash) and rows (by Array)
|
21
|
+
|
22
|
+
```ruby
|
23
|
+
RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]])
|
24
|
+
```
|
25
|
+
|
26
|
+
### `new` from an Arrow::Table
|
27
|
+
|
28
|
+
|
29
|
+
```ruby
|
30
|
+
table = Arrow::Table.new(x: [1, 2, 3])
|
31
|
+
RedAmber::DataFrame.new(table)
|
32
|
+
```
|
33
|
+
|
34
|
+
### `new` from a Rover::DataFrame
|
35
|
+
|
36
|
+
|
37
|
+
```ruby
|
38
|
+
rover = Rover::DataFrame.new(x: [1, 2, 3])
|
39
|
+
RedAmber::DataFrame.new(rover)
|
40
|
+
```
|
41
|
+
|
42
|
+
### `load` (class method)
|
43
|
+
|
44
|
+
- from a `.arrow`, `.arrows`, `.csv`, `.csv.gz` or `.tsv` file
|
45
|
+
|
46
|
+
```ruby
|
47
|
+
RedAmber::DataFrame.load("test/entity/with_header.csv")
|
48
|
+
```
|
49
|
+
|
50
|
+
- from a string buffer
|
51
|
+
|
52
|
+
- from a URI
|
53
|
+
|
54
|
+
```ruby
|
55
|
+
uri = URI("https://github.com/heronshoes/red_amber/blob/master/test/entity/with_header.csv")
|
56
|
+
RedAmber::DataFrame.load(uri)
|
57
|
+
```
|
58
|
+
|
59
|
+
- from a Parquet file
|
60
|
+
|
61
|
+
```ruby
|
62
|
+
dataframe = RedAmber::DataFrame.load("file.parquet")
|
63
|
+
```
|
64
|
+
|
65
|
+
### `save` (instance method)
|
66
|
+
|
67
|
+
- to a `.arrow`, `.arrows`, `.csv`, `.csv.gz` or `.tsv` file
|
68
|
+
|
69
|
+
- to a string buffer
|
70
|
+
|
71
|
+
- to a URI
|
72
|
+
|
73
|
+
- to a Parquet file
|
74
|
+
|
75
|
+
```ruby
|
76
|
+
dataframe.save("file.parquet")
|
77
|
+
```
|
78
|
+
|
79
|
+
## Properties
|
80
|
+
|
81
|
+
### `table`
|
82
|
+
|
83
|
+
- Reader of Arrow::Table object inside.
|
84
|
+
|
85
|
+
### `size`, `n_obs`, `n_rows`
|
86
|
+
|
87
|
+
- Returns size of Vector (num of observations).
|
88
|
+
|
89
|
+
### `n_keys`, `n_vars`, `n_cols`,
|
90
|
+
|
91
|
+
- Returns num of keys (num of variables).
|
92
|
+
|
93
|
+
### `shape`
|
94
|
+
|
95
|
+
- Returns shape in an Array[n_rows, n_cols].
|
96
|
+
|
97
|
+
### `keys`, `var_names`, `column_names`
|
98
|
+
|
99
|
+
- Returns key names in an Array.
|
100
|
+
|
101
|
+
### `types`
|
102
|
+
|
103
|
+
- Returns types of vectors in an Array of Symbols.
|
104
|
+
|
105
|
+
### `data_types`
|
106
|
+
|
107
|
+
- Returns types of vector in an Array of `Arrow::DataType`.
|
108
|
+
|
109
|
+
### `vectors`
|
110
|
+
|
111
|
+
- Returns an Array of Vectors.
|
112
|
+
|
113
|
+
### `indexes`, `indices`
|
114
|
+
|
115
|
+
- Returns all indexes in a Range.
|
116
|
+
|
117
|
+
### `to_h`
|
118
|
+
|
119
|
+
- Returns column-oriented data in a Hash.
|
120
|
+
|
121
|
+
### `to_a`, `raw_records`
|
122
|
+
|
123
|
+
- Returns an array of row-oriented data without header.
|
124
|
+
|
125
|
+
If you need a column-oriented full array, use `.to_h.to_a`
|
126
|
+
|
127
|
+
### `schema`
|
128
|
+
|
129
|
+
- Returns column name and data type in a Hash.
|
130
|
+
|
131
|
+
### `==`
|
132
|
+
|
133
|
+
### `empty?`
|
134
|
+
|
135
|
+
## Output
|
136
|
+
|
137
|
+
### `to_s`
|
138
|
+
|
139
|
+
### `summary`, `describe` (not implemented)
|
140
|
+
|
141
|
+
### `to_rover`
|
142
|
+
|
143
|
+
- Returns a `Rover::DataFrame`.
|
144
|
+
|
145
|
+
### `tdr(limit = 10, tally: 5, elements: 5)`
|
146
|
+
|
147
|
+
- Shows some information about self in a transposed style.
|
148
|
+
- `tdr_str` returns same info as a String.
|
149
|
+
|
150
|
+
```ruby
|
151
|
+
require 'red_amber'
|
152
|
+
require 'datasets-arrow'
|
153
|
+
|
154
|
+
penguins = Datasets::Penguins.new.to_arrow
|
155
|
+
RedAmber::DataFrame.new(penguins).tdr
|
156
|
+
# =>
|
157
|
+
RedAmber::DataFrame : 344 x 8 Vectors
|
158
|
+
Vectors : 5 numeric, 3 strings
|
159
|
+
# key type level data_preview
|
160
|
+
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
|
161
|
+
2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
|
162
|
+
3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
163
|
+
4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
|
164
|
+
5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils
|
165
|
+
6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
|
166
|
+
7 :sex string 3 {"male"=>168, "female"=>165, nil=>11}
|
167
|
+
8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}
|
168
|
+
```
|
169
|
+
|
170
|
+
- limit: limits variable number to show. Default value is 10.
|
171
|
+
- tally: max level to use tally mode.
|
172
|
+
- elements: max num of element to show values in each observations.
|
173
|
+
|
174
|
+
### `inspect`
|
175
|
+
|
176
|
+
- Returns the information of self as `tdr(3)`, and also shows object id.
|
177
|
+
|
178
|
+
```ruby
|
179
|
+
puts penguins.inspect
|
180
|
+
# =>
|
181
|
+
#<RedAmber::DataFrame : 344 x 8 Vectors, 0x000000000000f0b4>
|
182
|
+
Vectors : 5 numeric, 3 strings
|
183
|
+
# key type level data_preview
|
184
|
+
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
|
185
|
+
2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
|
186
|
+
3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
187
|
+
... 5 more Vectors ...
|
188
|
+
```
|
189
|
+
|
190
|
+
## Selecting
|
191
|
+
|
192
|
+
### Select variables (columns in a table) by `[]` as `[key]`, `[keys]`, `[keys[index]]`
|
193
|
+
- Key in a Symbol: `df[:symbol]`
|
194
|
+
- Key in a String: `df["string"]`
|
195
|
+
- Keys in an Array: `df[:symbol1, "string", :symbol2]`
|
196
|
+
- Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`
|
197
|
+
|
198
|
+
Key indeces can be used via `keys[i]` because numbers are used to select observations (rows).
|
199
|
+
|
200
|
+
- Keys by a Range:
|
201
|
+
|
202
|
+
If keys are able to represent by Range, it can be included in the arguments. See a example below.
|
203
|
+
|
204
|
+
- You can exchange the order of variables (columns).
|
205
|
+
|
206
|
+
```ruby
|
207
|
+
hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
|
208
|
+
df = RedAmber::DataFrame.new(hash)
|
209
|
+
df[:b..:c, "a"]
|
210
|
+
# =>
|
211
|
+
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000b02c>
|
212
|
+
Vectors : 2 numeric, 1 string
|
213
|
+
# key type level data_preview
|
214
|
+
1 :b string 3 ["A", "B", "C"]
|
215
|
+
2 :c double 3 [1.0, 2.0, 3.0]
|
216
|
+
3 :a uint8 3 [1, 2, 3]
|
217
|
+
```
|
218
|
+
|
219
|
+
If `#[]` represents single variable (column), it returns a Vector object.
|
220
|
+
|
221
|
+
```ruby
|
222
|
+
df[:a]
|
223
|
+
# =>
|
224
|
+
#<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
|
225
|
+
[1, 2, 3]
|
226
|
+
```
|
227
|
+
This may be useful to use in a block of DataFrame manipulations.
|
228
|
+
|
229
|
+
### Select observations (rows in a table) by `[]` as `[index]`, `[range]`, `[array]`
|
230
|
+
|
231
|
+
- Select a obs. by index: `df[0]`
|
232
|
+
- Select obs. by indeces in a Range: `df[1..2]`
|
233
|
+
|
234
|
+
An end-less or a begin-less Range can be used to represent indeces.
|
235
|
+
|
236
|
+
- Select obs. by indeces in an Array: `df[1, 2]`
|
237
|
+
- Mixed case: `df[2, 0..]`
|
238
|
+
|
239
|
+
```ruby
|
240
|
+
hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
|
241
|
+
df = RedAmber::DataFrame.new(hash)
|
242
|
+
df[:b..:c, "a"].tdr(tally_level: 0)
|
243
|
+
# =>
|
244
|
+
RedAmber::DataFrame : 4 x 3 Vectors
|
245
|
+
Vectors : 2 numeric, 1 string
|
246
|
+
# key type level data_preview
|
247
|
+
1 :a uint8 3 [3, 1, 2, 3]
|
248
|
+
2 :b string 3 ["C", "A", "B", "C"]
|
249
|
+
3 :c double 3 [3.0, 1.0, 2.0, 3.0]
|
250
|
+
```
|
251
|
+
|
252
|
+
- Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self.
|
253
|
+
|
254
|
+
It returns a sub dataframe with observations at boolean is true.
|
255
|
+
|
256
|
+
```ruby
|
257
|
+
# with the same dataframe `df` above
|
258
|
+
df[true, false, nil] # or
|
259
|
+
df[[true, false, nil]] # or
|
260
|
+
df[RedAmber::Vector.new([true, false, nil])]
|
261
|
+
# =>
|
262
|
+
#<RedAmber::DataFrame : 1 x 3 Vectors, 0x000000000000f1a4>
|
263
|
+
Vectors : 2 numeric, 1 string
|
264
|
+
# key type level data_preview
|
265
|
+
1 :a uint8 1 [1]
|
266
|
+
2 :b string 1 ["A"]
|
267
|
+
3 :c double 1 [1.0]
|
268
|
+
```
|
269
|
+
|
270
|
+
### Select rows from top or bottom
|
271
|
+
|
272
|
+
`head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
|
273
|
+
|
274
|
+
## Sub DataFrame manipulations
|
275
|
+
|
276
|
+
### `pick`
|
277
|
+
|
278
|
+
Pick up some variables (columns) to create a sub DataFrame.
|
279
|
+
|
280
|
+

|
281
|
+
|
282
|
+
- Keys as arguments
|
283
|
+
|
284
|
+
`pick(keys)` accepts keys as arguments in an Array.
|
285
|
+
|
286
|
+
```ruby
|
287
|
+
penguins.pick(:species, :bill_length_mm)
|
288
|
+
# =>
|
289
|
+
#<RedAmber::DataFrame : 344 x 2 Vectors, 0x000000000000f924>
|
290
|
+
Vectors : 1 numeric, 1 string
|
291
|
+
# key type level data_preview
|
292
|
+
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
|
293
|
+
2 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
294
|
+
```
|
295
|
+
|
296
|
+
- Booleans as a argument
|
297
|
+
|
298
|
+
`pick(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
|
299
|
+
|
300
|
+
```ruby
|
301
|
+
penguins.pick(penguins.types.map { |type| type == :string })
|
302
|
+
# =>
|
303
|
+
#<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f938>
|
304
|
+
Vectors : 3 strings
|
305
|
+
# key type level data_preview
|
306
|
+
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
|
307
|
+
2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
|
308
|
+
3 :sex string 3 {"male"=>168, "female"=>165, ""=>11}
|
309
|
+
```
|
310
|
+
|
311
|
+
- Keys or booleans by a block
|
312
|
+
|
313
|
+
`pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
|
314
|
+
|
315
|
+
```ruby
|
316
|
+
penguins.pick { keys.map { |key| key.end_with?('mm') } }
|
317
|
+
# =>
|
318
|
+
#<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f1cc>
|
319
|
+
Vectors : 3 numeric
|
320
|
+
# key type level data_preview
|
321
|
+
1 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
322
|
+
2 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
|
323
|
+
3 :flipper_length_mm int64 56 [181, 186, 195, nil, 193, ... ], 2 nils
|
324
|
+
```
|
325
|
+
|
326
|
+
### `drop`
|
327
|
+
|
328
|
+
Drop some variables (columns) to create a remainer DataFrame.
|
329
|
+
|
330
|
+

|
331
|
+
|
332
|
+
- Keys as arguments
|
333
|
+
|
334
|
+
`drop(keys)` accepts keys as arguments in an Array.
|
335
|
+
|
336
|
+
- Booleans as a argument
|
337
|
+
|
338
|
+
`drop(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
|
339
|
+
|
340
|
+
- Keys or booleans by a block
|
341
|
+
|
342
|
+
`drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
|
343
|
+
|
344
|
+
- Notice for nil
|
345
|
+
|
346
|
+
When used with booleans, nil in booleans is treated as a false. This behavior is aligned with Ruby's `nil#!`.
|
347
|
+
|
348
|
+
```ruby
|
349
|
+
booleans = [true, false, nil]
|
350
|
+
booleans_invert = booleans.map(&:!) # => [false, true, true]
|
351
|
+
df.pick(booleans) == df.drop(booleans_invert) # => true
|
352
|
+
```
|
353
|
+
- Difference between `pick`/`drop` and `[]`
|
354
|
+
|
355
|
+
If `pick` or `drop` will select single variable (column), it returns a `DataFrame` with one variable. In contrast, `[]` returns a `Vector`.
|
356
|
+
|
357
|
+
```ruby
|
358
|
+
df = RedAmber::DataFrame.new(a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3])
|
359
|
+
df[:a]
|
360
|
+
# =>
|
361
|
+
#<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
|
362
|
+
[1, 2, 3]
|
363
|
+
|
364
|
+
df.pick(:a) # or
|
365
|
+
df.drop(:b, :c)
|
366
|
+
# =>
|
367
|
+
#<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000000f280>
|
368
|
+
Vector : 1 numeric
|
369
|
+
# key type level data_preview
|
370
|
+
1 :a uint8 3 [1, 2, 3]
|
371
|
+
```
|
372
|
+
|
373
|
+
### `slice`
|
374
|
+
|
375
|
+
Slice and select observations (rows) to create a sub DataFrame.
|
376
|
+
|
377
|
+

|
378
|
+
|
379
|
+
- Keys as arguments
|
380
|
+
|
381
|
+
`slice(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
|
382
|
+
|
383
|
+
```ruby
|
384
|
+
# returns 5 obs. at start and 5 obs. from end
|
385
|
+
penguins.slice(0...5, -5..-1)
|
386
|
+
# =>
|
387
|
+
#<RedAmber::DataFrame : 10 x 8 Vectors, 0x000000000000f230>
|
388
|
+
Vectors : 5 numeric, 3 strings
|
389
|
+
# key type level data_preview
|
390
|
+
1 :species string 2 {"Adelie"=>5, "Gentoo"=>5}
|
391
|
+
2 :island string 2 {"Torgersen"=>5, "Biscoe"=>5}
|
392
|
+
3 :bill_length_mm double 9 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
393
|
+
... 5 more Vectors ...
|
394
|
+
```
|
395
|
+
|
396
|
+
- Booleans as an argument
|
397
|
+
|
398
|
+
`slice(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
|
399
|
+
|
400
|
+
```ruby
|
401
|
+
vector = penguins[:bill_length_mm]
|
402
|
+
penguins.slice(vector >= 40)
|
403
|
+
# =>
|
404
|
+
#<RedAmber::DataFrame : 242 x 8 Vectors, 0x000000000000f2bc>
|
405
|
+
Vectors : 5 numeric, 3 strings
|
406
|
+
# key type level data_preview
|
407
|
+
1 :species string 3 {"Adelie"=>51, "Chinstrap"=>68, "Gentoo"=>123}
|
408
|
+
2 :island string 3 {"Torgersen"=>18, "Biscoe"=>139, "Dream"=>85}
|
409
|
+
3 :bill_length_mm double 115 [40.3, 42.0, 41.1, 42.5, 46.0, ... ]
|
410
|
+
... 5 more Vectors ...
|
411
|
+
```
|
412
|
+
|
413
|
+
- Keys or booleans by a block
|
414
|
+
|
415
|
+
`slice {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
|
416
|
+
|
417
|
+
```ruby
|
418
|
+
# return a DataFrame with bill_length_mm is in 2*std range around mean
|
419
|
+
penguins.slice do
|
420
|
+
vector = self[:bill_length_mm]
|
421
|
+
min = vector.mean - vector.std
|
422
|
+
max = vector.mean + vector.std
|
423
|
+
vector.to_a.map { |e| (min..max).include? e }
|
424
|
+
end
|
425
|
+
# =>
|
426
|
+
#<RedAmber::DataFrame : 204 x 8 Vectors, 0x000000000000f30c>
|
427
|
+
Vectors : 5 numeric, 3 strings
|
428
|
+
# key type level data_preview
|
429
|
+
1 :species string 3 {"Adelie"=>82, "Chinstrap"=>33, "Gentoo"=>89}
|
430
|
+
2 :island string 3 {"Torgersen"=>31, "Biscoe"=>112, "Dream"=>61}
|
431
|
+
3 :bill_length_mm double 90 [39.1, 39.5, 40.3, 39.3, 38.9, ... ]
|
432
|
+
... 5 more Vectors ...
|
433
|
+
```
|
434
|
+
|
435
|
+
- Notice: nil option
|
436
|
+
- `Arrow::Table#slice` uses `filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row.
|
437
|
+
|
438
|
+
```ruby
|
439
|
+
hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }
|
440
|
+
table = Arrow::Table.new(hash)
|
441
|
+
table.slice([true, false, nil])
|
442
|
+
# =>
|
443
|
+
#<Arrow::Table:0x7fdfe44b9e18 ptr=0x555e9fe744d0>
|
444
|
+
a b c
|
445
|
+
0 1 A 1.000000
|
446
|
+
1 (null) (null) (null)
|
447
|
+
```
|
448
|
+
|
449
|
+
- Whereas in RedAmber, `DataFrame#slice` with booleans containing nil is treated as false. This behavior comes from `Allow::FilterOptions.null_selection_behavior = :drop`. This is a default value for `Arrow::Table.filter` method.
|
450
|
+
|
451
|
+
```ruby
|
452
|
+
RedAmber::DataFrame.new(table).slice([true, false, nil]).table
|
453
|
+
# =>
|
454
|
+
#<Arrow::Table:0x7fdfe44981c8 ptr=0x555e9febc330>
|
455
|
+
a b c
|
456
|
+
0 1 A 1.000000
|
457
|
+
```
|
458
|
+
|
459
|
+
### `remove`
|
460
|
+
|
461
|
+
Slice and reject observations (rows) to create a remainer DataFrame.
|
462
|
+
|
463
|
+

|
464
|
+
|
465
|
+
- Keys as arguments
|
466
|
+
|
467
|
+
`remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
|
468
|
+
|
469
|
+
```ruby
|
470
|
+
# returns 6th to 339th obs.
|
471
|
+
penguins.remove(0...5, -5..-1)
|
472
|
+
# =>
|
473
|
+
#<RedAmber::DataFrame : 334 x 8 Vectors, 0x000000000000f320>
|
474
|
+
Vectors : 5 numeric, 3 strings
|
475
|
+
# key type level data_preview
|
476
|
+
1 :species string 3 {"Adelie"=>147, "Chinstrap"=>68, "Gentoo"=>119}
|
477
|
+
2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>124}
|
478
|
+
3 :bill_length_mm double 162 [39.3, 38.9, 39.2, 34.1, 42.0, ... ]
|
479
|
+
... 5 more Vectors ...
|
480
|
+
```
|
481
|
+
|
482
|
+
- Booleans as an argument
|
483
|
+
|
484
|
+
`remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
|
485
|
+
|
486
|
+
```ruby
|
487
|
+
# remove all observation contains nil
|
488
|
+
removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
|
489
|
+
removed.tdr
|
490
|
+
# =>
|
491
|
+
RedAmber::DataFrame : 342 x 8 Vectors
|
492
|
+
Vectors : 5 numeric, 3 strings
|
493
|
+
# key type level data_preview
|
494
|
+
1 :species string 3 {"Adelie"=>151, "Chinstrap"=>68, "Gentoo"=>123}
|
495
|
+
2 :island string 3 {"Torgersen"=>51, "Biscoe"=>167, "Dream"=>124}
|
496
|
+
3 :bill_length_mm double 164 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
|
497
|
+
4 :bill_depth_mm double 80 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
|
498
|
+
5 :flipper_length_mm int64 55 [181, 186, 195, 193, 190, ... ]
|
499
|
+
6 :body_mass_g int64 94 [3750, 3800, 3250, 3450, 3650, ... ]
|
500
|
+
7 :sex string 3 {"male"=>168, "female"=>165, ""=>9}
|
501
|
+
8 :year int64 3 {2007=>109, 2008=>114, 2009=>119}
|
502
|
+
```
|
503
|
+
|
504
|
+
- Keys or booleans by a block
|
505
|
+
|
506
|
+
`remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
|
507
|
+
|
508
|
+
```ruby
|
509
|
+
penguins.remove do
|
510
|
+
vector = self[:bill_length_mm]
|
511
|
+
min = vector.mean - vector.std
|
512
|
+
max = vector.mean + vector.std
|
513
|
+
vector.to_a.map { |e| (min..max).include? e }
|
514
|
+
end
|
515
|
+
# =>
|
516
|
+
#<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000000f370>
|
517
|
+
Vectors : 5 numeric, 3 strings
|
518
|
+
# key type level data_preview
|
519
|
+
1 :species string 3 {"Adelie"=>70, "Chinstrap"=>35, "Gentoo"=>35}
|
520
|
+
2 :island string 3 {"Torgersen"=>21, "Biscoe"=>56, "Dream"=>63}
|
521
|
+
3 :bill_length_mm double 75 [nil, 36.7, 34.1, 37.8, 37.8, ... ], 2 nils
|
522
|
+
... 5 more Vectors ...
|
523
|
+
```
|
524
|
+
- Notice for nil
|
525
|
+
- When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
|
526
|
+
|
527
|
+
```ruby
|
528
|
+
df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])
|
529
|
+
booleans = df[:a] < 2
|
530
|
+
# =>
|
531
|
+
#<RedAmber::Vector(:boolean, size=3):0x000000000000f410>
|
532
|
+
[true, false, nil]
|
533
|
+
|
534
|
+
booleans_invert = booleans.to_a.map(&:!) # => [false, true, true]
|
535
|
+
df.slice(booleans) == df.remove(booleans_invert) # => true
|
536
|
+
```
|
537
|
+
- Whereas `Vector#invert` returns nil for elements nil. This will bring different result.
|
538
|
+
|
539
|
+
```ruby
|
540
|
+
booleans.invert
|
541
|
+
# =>
|
542
|
+
#<RedAmber::Vector(:boolean, size=3):0x000000000000f488>
|
543
|
+
[false, true, nil]
|
544
|
+
|
545
|
+
df.remove(booleans.invert)
|
546
|
+
#<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000000f474>
|
547
|
+
Vectors : 2 numeric, 1 string
|
548
|
+
# key type level data_preview
|
549
|
+
1 :a uint8 2 [1, nil], 1 nil
|
550
|
+
2 :b string 2 ["A", "C"]
|
551
|
+
3 :c double 2 [1.0, 3.0]
|
552
|
+
```
|
553
|
+
|
554
|
+
### `rename`
|
555
|
+
|
556
|
+
Rename keys (column names) to create a updated DataFrame.
|
557
|
+
|
558
|
+

|
559
|
+
|
560
|
+
- Key pairs as arguments
|
561
|
+
|
562
|
+
`rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`.
|
563
|
+
|
564
|
+
```ruby
|
565
|
+
h = { 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] }
|
566
|
+
df = RedAmber::DataFrame.new(h)
|
567
|
+
df.rename(:age => :age_in_1993)
|
568
|
+
# =>
|
569
|
+
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
|
570
|
+
Vectors : 1 numeric, 1 string
|
571
|
+
# key type level data_preview
|
572
|
+
1 :name string 3 ["Yasuko", "Rui", "Hinata"]
|
573
|
+
2 :age_in_1993 uint8 3 [68, 49, 28]
|
574
|
+
```
|
575
|
+
|
576
|
+
- Key pairs by a block
|
577
|
+
|
578
|
+
`rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}`. Block is called in the context of self.
|
579
|
+
|
580
|
+
- Key type
|
581
|
+
|
582
|
+
Symbol key and String key are distinguished.
|
583
|
+
|
584
|
+
### `assign`
|
585
|
+
|
586
|
+
Assign new variables (columns) and create a updated DataFrame.
|
587
|
+
|
588
|
+
- Variables with new keys will append new variables at bottom (right in the table).
|
589
|
+
- Variables with exisiting keys will update corresponding vectors.
|
590
|
+
|
591
|
+

|
592
|
+
|
593
|
+
- Variables as arguments
|
594
|
+
|
595
|
+
`assign(key_pairs)` accepts pairs of key and values as arguments. key_pairs should be a Hash of `{key => array}` or `{key => Vector}`.
|
596
|
+
|
597
|
+
```ruby
|
598
|
+
df = RedAmber::DataFrame.new(
|
599
|
+
'name' => %w[Yasuko Rui Hinata],
|
600
|
+
'age' => [68, 49, 28])
|
601
|
+
# =>
|
602
|
+
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
|
603
|
+
Vectors : 1 numeric, 1 string
|
604
|
+
# key type level data_preview
|
605
|
+
1 :name string 3 ["Yasuko", "Rui", "Hinata"]
|
606
|
+
2 :age uint8 3 [68, 49, 28]
|
607
|
+
|
608
|
+
# update :age and add :brother
|
609
|
+
assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }
|
610
|
+
df.assign(assigner)
|
611
|
+
# =>
|
612
|
+
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000f960>
|
613
|
+
Vectors : 1 numeric, 2 strings
|
614
|
+
# key type level data_preview
|
615
|
+
1 :name string 3 ["Yasuko", "Rui", "Hinata"]
|
616
|
+
2 :age uint8 3 [97, 78, 57]
|
617
|
+
3 :brother string 3 ["Santa", nil, "Momotaro"], 1 nil
|
618
|
+
```
|
619
|
+
|
620
|
+
- Key pairs by a block
|
621
|
+
|
622
|
+
`assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array}` or `{key => Vector}`. Block is called in the context of self.
|
623
|
+
|
624
|
+
```ruby
|
625
|
+
df = RedAmber::DataFrame.new(
|
626
|
+
index: [0, 1, 2, 3, nil],
|
627
|
+
float: [0.0, 1.1, 2.2, Float::NAN, nil],
|
628
|
+
string: ['A', 'B', 'C', 'D', nil])
|
629
|
+
# =>
|
630
|
+
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f8c0>
|
631
|
+
Vectors : 2 numeric, 1 string
|
632
|
+
# key type level data_preview
|
633
|
+
1 :index uint8 5 [0, 1, 2, 3, nil], 1 nil
|
634
|
+
2 :float double 5 [0.0, 1.1, 2.2, NaN, nil], 1 NaN, 1 nil
|
635
|
+
3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
|
636
|
+
|
637
|
+
# update numeric variables
|
638
|
+
df.assign do
|
639
|
+
assigner = {}
|
640
|
+
vectors.each_with_index do |v, i|
|
641
|
+
assigner[keys[i]] = v * -1 if v.numeric?
|
642
|
+
end
|
643
|
+
assigner
|
644
|
+
end
|
645
|
+
# =>
|
646
|
+
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f924>
|
647
|
+
Vectors : 2 numeric, 1 string
|
648
|
+
# key type level data_preview
|
649
|
+
1 :index int8 5 [0, -1, -2, -3, nil], 1 nil
|
650
|
+
2 :float double 5 [-0.0, -1.1, -2.2, NaN, nil], 1 NaN, 1 nil
|
651
|
+
3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
|
652
|
+
```
|
653
|
+
|
654
|
+
- Key type
|
655
|
+
|
656
|
+
Symbol key and String key are considered as the same key.
|
657
|
+
|
658
|
+
## Updating
|
659
|
+
|
660
|
+
- [ ] Update elements matching a condition
|
661
|
+
|
662
|
+
- [ ] Clamp
|
663
|
+
|
664
|
+
- [ ] Sort rows
|
665
|
+
|
666
|
+
- [ ] Clear data
|
667
|
+
|
668
|
+
## Treat na data
|
669
|
+
|
670
|
+
- [ ] Drop na (NaN, nil)
|
671
|
+
|
672
|
+
- [ ] Replace na with value
|
673
|
+
|
674
|
+
- [ ] Interpolate na with convolution array
|
675
|
+
|
676
|
+
## Combining DataFrames
|
677
|
+
|
678
|
+
- [ ] obs
|
679
|
+
|
680
|
+
- [ ] Add vars
|
681
|
+
|
682
|
+
- [ ] Inner join
|
683
|
+
|
684
|
+
- [ ] Left join
|
685
|
+
|
686
|
+
## Encoding
|
687
|
+
|
688
|
+
- [ ] One-hot encoding
|
689
|
+
|
690
|
+
## Iteration (not impremented)
|