red_amber 0.1.3 → 0.1.6
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.rubocop.yml +31 -7
- data/CHANGELOG.md +214 -10
- data/Gemfile +4 -0
- data/README.md +117 -342
- data/benchmark/csv_load_penguins.yml +15 -0
- data/benchmark/drop_nil.yml +11 -0
- data/doc/DataFrame.md +854 -0
- data/doc/Vector.md +449 -0
- data/doc/image/arrow_table_new.png +0 -0
- data/doc/image/dataframe/assign.png +0 -0
- data/doc/image/dataframe/drop.png +0 -0
- data/doc/image/dataframe/pick.png +0 -0
- data/doc/image/dataframe/remove.png +0 -0
- data/doc/image/dataframe/rename.png +0 -0
- data/doc/image/dataframe/slice.png +0 -0
- data/doc/image/dataframe_model.png +0 -0
- data/doc/image/example_in_red_arrow.png +0 -0
- data/doc/image/tdr.png +0 -0
- data/doc/image/tdr_and_table.png +0 -0
- data/doc/image/tidy_data_in_TDR.png +0 -0
- data/doc/image/vector/binary_element_wise.png +0 -0
- data/doc/image/vector/unary_aggregation.png +0 -0
- data/doc/image/vector/unary_aggregation_w_option.png +0 -0
- data/doc/image/vector/unary_element_wise.png +0 -0
- data/doc/tdr.md +56 -0
- data/doc/tdr_ja.md +56 -0
- data/lib/red-amber.rb +27 -0
- data/lib/red_amber/data_frame.rb +91 -37
- data/lib/red_amber/{data_frame_output.rb → data_frame_displayable.rb} +49 -41
- data/lib/red_amber/data_frame_indexable.rb +38 -0
- data/lib/red_amber/data_frame_observation_operation.rb +11 -0
- data/lib/red_amber/data_frame_selectable.rb +155 -48
- data/lib/red_amber/data_frame_variable_operation.rb +137 -0
- data/lib/red_amber/helper.rb +61 -0
- data/lib/red_amber/vector.rb +69 -16
- data/lib/red_amber/vector_functions.rb +80 -45
- data/lib/red_amber/vector_selectable.rb +124 -0
- data/lib/red_amber/vector_updatable.rb +104 -0
- data/lib/red_amber/version.rb +1 -1
- data/lib/red_amber.rb +1 -16
- data/red_amber.gemspec +3 -6
- metadata +38 -9
data/doc/DataFrame.md
ADDED
@@ -0,0 +1,854 @@
|
|
1
|
+
# DataFrame
|
2
|
+
|
3
|
+
Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
4
|
+
- A collection of data which have same data type within. We call it `Vector`.
|
5
|
+
- A label is attached to `Vector`. We call it `key`.
|
6
|
+
- A `Vector` and associated `key` is grouped as a `variable`.
|
7
|
+
- `variable`s with same vector length are aligned and arranged to be a `DataFrame`.
|
8
|
+
- Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `observation`.
|
9
|
+
|
10
|
+
![dataframe model image](doc/../image/dataframe_model.png)
|
11
|
+
|
12
|
+
(No change in this model in v0.1.6 .)
|
13
|
+
|
14
|
+
## Constructors and saving
|
15
|
+
|
16
|
+
### `new` from a Hash
|
17
|
+
|
18
|
+
```ruby
|
19
|
+
RedAmber::DataFrame.new(x: [1, 2, 3])
|
20
|
+
```
|
21
|
+
|
22
|
+
### `new` from a schema (by Hash) and data (by Array)
|
23
|
+
|
24
|
+
```ruby
|
25
|
+
RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]])
|
26
|
+
```
|
27
|
+
|
28
|
+
### `new` from an Arrow::Table
|
29
|
+
|
30
|
+
|
31
|
+
```ruby
|
32
|
+
table = Arrow::Table.new(x: [1, 2, 3])
|
33
|
+
RedAmber::DataFrame.new(table)
|
34
|
+
```
|
35
|
+
|
36
|
+
### `new` from a Rover::DataFrame
|
37
|
+
|
38
|
+
|
39
|
+
```ruby
|
40
|
+
rover = Rover::DataFrame.new(x: [1, 2, 3])
|
41
|
+
RedAmber::DataFrame.new(rover)
|
42
|
+
```
|
43
|
+
|
44
|
+
### `load` (class method)
|
45
|
+
|
46
|
+
- from a `.arrow`, `.arrows`, `.csv`, `.csv.gz` or `.tsv` file
|
47
|
+
|
48
|
+
```ruby
|
49
|
+
RedAmber::DataFrame.load("test/entity/with_header.csv")
|
50
|
+
```
|
51
|
+
|
52
|
+
- from a string buffer
|
53
|
+
|
54
|
+
- from a URI
|
55
|
+
|
56
|
+
```ruby
|
57
|
+
uri = URI("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
|
58
|
+
RedAmber::DataFrame.load(uri)
|
59
|
+
```
|
60
|
+
|
61
|
+
- from a Parquet file
|
62
|
+
|
63
|
+
```ruby
|
64
|
+
dataframe = RedAmber::DataFrame.load("file.parquet")
|
65
|
+
```
|
66
|
+
|
67
|
+
### `save` (instance method)
|
68
|
+
|
69
|
+
- to a `.arrow`, `.arrows`, `.csv`, `.csv.gz` or `.tsv` file
|
70
|
+
|
71
|
+
- to a string buffer
|
72
|
+
|
73
|
+
- to a URI
|
74
|
+
|
75
|
+
- to a Parquet file
|
76
|
+
|
77
|
+
```ruby
|
78
|
+
dataframe.save("file.parquet")
|
79
|
+
```
|
80
|
+
|
81
|
+
## Properties
|
82
|
+
|
83
|
+
### `table`, `to_arrow`
|
84
|
+
|
85
|
+
- Reader of Arrow::Table object inside.
|
86
|
+
|
87
|
+
### `size`, `n_obs`, `n_rows`
|
88
|
+
|
89
|
+
- Returns size of Vector (num of observations).
|
90
|
+
|
91
|
+
### `n_keys`, `n_vars`, `n_cols`,
|
92
|
+
|
93
|
+
- Returns num of keys (num of variables).
|
94
|
+
|
95
|
+
### `shape`
|
96
|
+
|
97
|
+
- Returns shape in an Array[n_rows, n_cols].
|
98
|
+
|
99
|
+
### `variables`
|
100
|
+
|
101
|
+
- Returns key names and Vectors pair in a Hash.
|
102
|
+
|
103
|
+
It is convenient to use in a block when both key and vector required. We will write:
|
104
|
+
|
105
|
+
```ruby
|
106
|
+
# update numeric variables
|
107
|
+
df.assign do
|
108
|
+
variables.select.with_object({}) do |(key, vector), assigner|
|
109
|
+
assigner[key] = vector * -1 if vector.numeric?
|
110
|
+
end
|
111
|
+
end
|
112
|
+
```
|
113
|
+
|
114
|
+
Instead of:
|
115
|
+
```ruby
|
116
|
+
df.assign do
|
117
|
+
assigner = {}
|
118
|
+
vectors.each_with_index do |vector, i|
|
119
|
+
assigner[keys[i]] = vector * -1 if vector.numeric?
|
120
|
+
end
|
121
|
+
assigner
|
122
|
+
end
|
123
|
+
```
|
124
|
+
|
125
|
+
### `keys`, `var_names`, `column_names`
|
126
|
+
|
127
|
+
- Returns key names in an Array.
|
128
|
+
|
129
|
+
When we use it with vectors, Vector#key is useful to get the key inside of DataFrame.
|
130
|
+
|
131
|
+
```ruby
|
132
|
+
# update numeric variables, another solution
|
133
|
+
df.assign do
|
134
|
+
vectors.each_with_object({}) do |vector, assigner|
|
135
|
+
assigner[vector.key] = vector * -1 if vector.numeric?
|
136
|
+
end
|
137
|
+
end
|
138
|
+
```
|
139
|
+
|
140
|
+
### `types`
|
141
|
+
|
142
|
+
- Returns types of vectors in an Array of Symbols.
|
143
|
+
|
144
|
+
### `type_classes`
|
145
|
+
|
146
|
+
- Returns types of vector in an Array of `Arrow::DataType`.
|
147
|
+
|
148
|
+
### `vectors`
|
149
|
+
|
150
|
+
- Returns an Array of Vectors.
|
151
|
+
|
152
|
+
### `indices`, `indexes`
|
153
|
+
|
154
|
+
- Returns all indexes in an Array.
|
155
|
+
|
156
|
+
### `to_h`
|
157
|
+
|
158
|
+
- Returns column-oriented data in a Hash.
|
159
|
+
|
160
|
+
### `to_a`, `raw_records`
|
161
|
+
|
162
|
+
- Returns an array of row-oriented data without header.
|
163
|
+
|
164
|
+
If you need a column-oriented full array, use `.to_h.to_a`
|
165
|
+
|
166
|
+
### `schema`
|
167
|
+
|
168
|
+
- Returns column name and data type in a Hash.
|
169
|
+
|
170
|
+
### `==`
|
171
|
+
|
172
|
+
### `empty?`
|
173
|
+
|
174
|
+
## Output
|
175
|
+
|
176
|
+
### `to_s`
|
177
|
+
|
178
|
+
### `summary`, `describe` (not implemented)
|
179
|
+
|
180
|
+
### `to_rover`
|
181
|
+
|
182
|
+
- Returns a `Rover::DataFrame`.
|
183
|
+
|
184
|
+
### `to_iruby`
|
185
|
+
|
186
|
+
- Show the DataFrame as a Table in Jupyter Notebook or Jupyter Lab with IRuby.
|
187
|
+
|
188
|
+
### `tdr(limit = 10, tally: 5, elements: 5)`
|
189
|
+
|
190
|
+
- Shows some information about self in a transposed style.
|
191
|
+
- `tdr_str` returns same info as a String.
|
192
|
+
|
193
|
+
```ruby
|
194
|
+
require 'red_amber'
|
195
|
+
require 'datasets-arrow'
|
196
|
+
|
197
|
+
penguins = Datasets::Penguins.new.to_arrow
|
198
|
+
RedAmber::DataFrame.new(penguins).tdr
|
199
|
+
# =>
|
200
|
+
RedAmber::DataFrame : 344 x 8 Vectors
|
201
|
+
Vectors : 5 numeric, 3 strings
|
202
|
+
# key type level data_preview
|
203
|
+
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
|
204
|
+
2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
|
205
|
+
3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
206
|
+
4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
|
207
|
+
5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils
|
208
|
+
6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
|
209
|
+
7 :sex string 3 {"male"=>168, "female"=>165, nil=>11}
|
210
|
+
8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}
|
211
|
+
```
|
212
|
+
|
213
|
+
- limit: limit of variables to show. Default value is 10.
|
214
|
+
- tally: max level to use tally mode.
|
215
|
+
- elements: max num of element to show values in each observations.
|
216
|
+
|
217
|
+
### `inspect`
|
218
|
+
|
219
|
+
- Returns the information of self as `tdr(3)`, and also shows object id.
|
220
|
+
|
221
|
+
```ruby
|
222
|
+
puts penguins.inspect
|
223
|
+
# =>
|
224
|
+
#<RedAmber::DataFrame : 344 x 8 Vectors, 0x000000000000f0b4>
|
225
|
+
Vectors : 5 numeric, 3 strings
|
226
|
+
# key type level data_preview
|
227
|
+
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
|
228
|
+
2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
|
229
|
+
3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
230
|
+
... 5 more Vectors ...
|
231
|
+
```
|
232
|
+
|
233
|
+
## Selecting
|
234
|
+
|
235
|
+
### Select variables (columns in a table) by `[]` as `[key]`, `[keys]`, `[keys[index]]`
|
236
|
+
- Key in a Symbol: `df[:symbol]`
|
237
|
+
- Key in a String: `df["string"]`
|
238
|
+
- Keys in an Array: `df[:symbol1, "string", :symbol2]`
|
239
|
+
- Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`
|
240
|
+
|
241
|
+
Key indeces can be used via `keys[i]` because numbers are used to select observations (rows).
|
242
|
+
|
243
|
+
- Keys by a Range:
|
244
|
+
|
245
|
+
If keys are able to represent by Range, it can be included in the arguments. See a example below.
|
246
|
+
|
247
|
+
- You can exchange the order of variables (columns).
|
248
|
+
|
249
|
+
```ruby
|
250
|
+
hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
|
251
|
+
df = RedAmber::DataFrame.new(hash)
|
252
|
+
df[:b..:c, "a"]
|
253
|
+
# =>
|
254
|
+
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000b02c>
|
255
|
+
Vectors : 2 numeric, 1 string
|
256
|
+
# key type level data_preview
|
257
|
+
1 :b string 3 ["A", "B", "C"]
|
258
|
+
2 :c double 3 [1.0, 2.0, 3.0]
|
259
|
+
3 :a uint8 3 [1, 2, 3]
|
260
|
+
```
|
261
|
+
|
262
|
+
If `#[]` represents single variable (column), it returns a Vector object.
|
263
|
+
|
264
|
+
```ruby
|
265
|
+
df[:a]
|
266
|
+
# =>
|
267
|
+
#<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
|
268
|
+
[1, 2, 3]
|
269
|
+
```
|
270
|
+
Or `#v` method also returns a Vector for a key.
|
271
|
+
|
272
|
+
```ruby
|
273
|
+
df.v(:a)
|
274
|
+
# =>
|
275
|
+
#<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
|
276
|
+
[1, 2, 3]
|
277
|
+
```
|
278
|
+
|
279
|
+
This may be useful to use in a block of DataFrame manipulation verbs. We can write `v(:a)` rather than `self[:a]` or `df[:a]`
|
280
|
+
|
281
|
+
### Select observations (rows in a table) by `[]` as `[index]`, `[range]`, `[array]`
|
282
|
+
|
283
|
+
- Select a obs. by index: `df[0]`
|
284
|
+
- Select obs. by indeces in a Range: `df[1..2]`
|
285
|
+
|
286
|
+
An end-less or a begin-less Range can be used to represent indeces.
|
287
|
+
|
288
|
+
- Select obs. by indeces in an Array: `df[1, 2]`
|
289
|
+
|
290
|
+
- You can use float indices.
|
291
|
+
|
292
|
+
- Mixed case: `df[2, 0..]`
|
293
|
+
|
294
|
+
```ruby
|
295
|
+
hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
|
296
|
+
df = RedAmber::DataFrame.new(hash)
|
297
|
+
df[:b..:c, "a"].tdr(tally_level: 0)
|
298
|
+
# =>
|
299
|
+
RedAmber::DataFrame : 4 x 3 Vectors
|
300
|
+
Vectors : 2 numeric, 1 string
|
301
|
+
# key type level data_preview
|
302
|
+
1 :a uint8 3 [3, 1, 2, 3]
|
303
|
+
2 :b string 3 ["C", "A", "B", "C"]
|
304
|
+
3 :c double 3 [3.0, 1.0, 2.0, 3.0]
|
305
|
+
```
|
306
|
+
|
307
|
+
- Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self.
|
308
|
+
|
309
|
+
It returns a sub dataframe with observations at boolean is true.
|
310
|
+
|
311
|
+
```ruby
|
312
|
+
# with the same dataframe `df` above
|
313
|
+
df[true, false, nil] # or
|
314
|
+
df[[true, false, nil]] # or
|
315
|
+
df[RedAmber::Vector.new([true, false, nil])]
|
316
|
+
# =>
|
317
|
+
#<RedAmber::DataFrame : 1 x 3 Vectors, 0x000000000000f1a4>
|
318
|
+
Vectors : 2 numeric, 1 string
|
319
|
+
# key type level data_preview
|
320
|
+
1 :a uint8 1 [1]
|
321
|
+
2 :b string 1 ["A"]
|
322
|
+
3 :c double 1 [1.0]
|
323
|
+
```
|
324
|
+
|
325
|
+
### Select rows from top or from bottom
|
326
|
+
|
327
|
+
`head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
|
328
|
+
|
329
|
+
## Sub DataFrame manipulations
|
330
|
+
|
331
|
+
### `pick ` - pick up variables by key label -
|
332
|
+
|
333
|
+
Pick up some variables (columns) to create a sub DataFrame.
|
334
|
+
|
335
|
+
![pick method image](doc/../image/dataframe/pick.png)
|
336
|
+
|
337
|
+
- Keys as arguments
|
338
|
+
|
339
|
+
`pick(keys)` accepts keys as arguments in an Array.
|
340
|
+
|
341
|
+
```ruby
|
342
|
+
penguins.pick(:species, :bill_length_mm)
|
343
|
+
# =>
|
344
|
+
#<RedAmber::DataFrame : 344 x 2 Vectors, 0x000000000000f924>
|
345
|
+
Vectors : 1 numeric, 1 string
|
346
|
+
# key type level data_preview
|
347
|
+
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
|
348
|
+
2 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
349
|
+
```
|
350
|
+
|
351
|
+
- Booleans as a argument
|
352
|
+
|
353
|
+
`pick(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
|
354
|
+
|
355
|
+
```ruby
|
356
|
+
penguins.pick(penguins.types.map { |type| type == :string })
|
357
|
+
# =>
|
358
|
+
#<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f938>
|
359
|
+
Vectors : 3 strings
|
360
|
+
# key type level data_preview
|
361
|
+
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
|
362
|
+
2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
|
363
|
+
3 :sex string 3 {"male"=>168, "female"=>165, ""=>11}
|
364
|
+
```
|
365
|
+
|
366
|
+
- Keys or booleans by a block
|
367
|
+
|
368
|
+
`pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
|
369
|
+
|
370
|
+
```ruby
|
371
|
+
# It is ok to write `keys ...` in the block, not `penguins.keys ...`
|
372
|
+
penguins.pick { keys.map { |key| key.end_with?('mm') } }
|
373
|
+
# =>
|
374
|
+
#<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f1cc>
|
375
|
+
Vectors : 3 numeric
|
376
|
+
# key type level data_preview
|
377
|
+
1 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
378
|
+
2 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
|
379
|
+
3 :flipper_length_mm int64 56 [181, 186, 195, nil, 193, ... ], 2 nils
|
380
|
+
```
|
381
|
+
|
382
|
+
### `drop ` - pick and drop -
|
383
|
+
|
384
|
+
Drop some variables (columns) to create a remainer DataFrame.
|
385
|
+
|
386
|
+
![drop method image](doc/../image/dataframe/drop.png)
|
387
|
+
|
388
|
+
- Keys as arguments
|
389
|
+
|
390
|
+
`drop(keys)` accepts keys as arguments in an Array.
|
391
|
+
|
392
|
+
- Booleans as a argument
|
393
|
+
|
394
|
+
`drop(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
|
395
|
+
|
396
|
+
- Keys or booleans by a block
|
397
|
+
|
398
|
+
`drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
|
399
|
+
|
400
|
+
- Notice for nil
|
401
|
+
|
402
|
+
When used with booleans, nil in booleans is treated as a false. This behavior is aligned with Ruby's `nil#!`.
|
403
|
+
|
404
|
+
```ruby
|
405
|
+
booleans = [true, false, nil]
|
406
|
+
booleans_invert = booleans.map(&:!) # => [false, true, true]
|
407
|
+
df.pick(booleans) == df.drop(booleans_invert) # => true
|
408
|
+
```
|
409
|
+
- Difference between `pick`/`drop` and `[]`
|
410
|
+
|
411
|
+
If `pick` or `drop` will select a single variable (column), it returns a `DataFrame` with one variable. In contrast, `[]` returns a `Vector`. This behavior may be useful to use in a block of DataFrame manipulations.
|
412
|
+
|
413
|
+
```ruby
|
414
|
+
df = RedAmber::DataFrame.new(a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3])
|
415
|
+
df.pick(:a) # or
|
416
|
+
df.drop(:b, :c)
|
417
|
+
# =>
|
418
|
+
#<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000000f280>
|
419
|
+
Vector : 1 numeric
|
420
|
+
# key type level data_preview
|
421
|
+
1 :a uint8 3 [1, 2, 3]
|
422
|
+
|
423
|
+
df[:a]
|
424
|
+
# =>
|
425
|
+
#<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
|
426
|
+
[1, 2, 3]
|
427
|
+
```
|
428
|
+
|
429
|
+
### `slice ` - to cut vertically is slice -
|
430
|
+
|
431
|
+
Slice and select observations (rows) to create a sub DataFrame.
|
432
|
+
|
433
|
+
![slice method image](doc/../image/dataframe/slice.png)
|
434
|
+
|
435
|
+
- Indices as arguments
|
436
|
+
|
437
|
+
`slice(indeces)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers.
|
438
|
+
|
439
|
+
Negative index from the tail like Ruby's Array is also acceptable.
|
440
|
+
|
441
|
+
```ruby
|
442
|
+
# returns 5 obs. at start and 5 obs. from end
|
443
|
+
penguins.slice(0...5, -5..-1)
|
444
|
+
# =>
|
445
|
+
#<RedAmber::DataFrame : 10 x 8 Vectors, 0x000000000000f230>
|
446
|
+
Vectors : 5 numeric, 3 strings
|
447
|
+
# key type level data_preview
|
448
|
+
1 :species string 2 {"Adelie"=>5, "Gentoo"=>5}
|
449
|
+
2 :island string 2 {"Torgersen"=>5, "Biscoe"=>5}
|
450
|
+
3 :bill_length_mm double 9 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
451
|
+
... 5 more Vectors ...
|
452
|
+
```
|
453
|
+
|
454
|
+
- Booleans as an argument
|
455
|
+
|
456
|
+
`slice(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
|
457
|
+
|
458
|
+
```ruby
|
459
|
+
vector = penguins[:bill_length_mm]
|
460
|
+
penguins.slice(vector >= 40)
|
461
|
+
# =>
|
462
|
+
#<RedAmber::DataFrame : 242 x 8 Vectors, 0x000000000000f2bc>
|
463
|
+
Vectors : 5 numeric, 3 strings
|
464
|
+
# key type level data_preview
|
465
|
+
1 :species string 3 {"Adelie"=>51, "Chinstrap"=>68, "Gentoo"=>123}
|
466
|
+
2 :island string 3 {"Torgersen"=>18, "Biscoe"=>139, "Dream"=>85}
|
467
|
+
3 :bill_length_mm double 115 [40.3, 42.0, 41.1, 42.5, 46.0, ... ]
|
468
|
+
... 5 more Vectors ...
|
469
|
+
```
|
470
|
+
|
471
|
+
- Indices or booleans by a block
|
472
|
+
|
473
|
+
`slice {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
|
474
|
+
|
475
|
+
```ruby
|
476
|
+
# return a DataFrame with bill_length_mm is in 2*std range around mean
|
477
|
+
penguins.slice do
|
478
|
+
vector = self[:bill_length_mm]
|
479
|
+
min = vector.mean - vector.std
|
480
|
+
max = vector.mean + vector.std
|
481
|
+
vector.to_a.map { |e| (min..max).include? e }
|
482
|
+
end
|
483
|
+
|
484
|
+
# =>
|
485
|
+
#<RedAmber::DataFrame : 204 x 8 Vectors, 0x000000000000f30c>
|
486
|
+
Vectors : 5 numeric, 3 strings
|
487
|
+
# key type level data_preview
|
488
|
+
1 :species string 3 {"Adelie"=>82, "Chinstrap"=>33, "Gentoo"=>89}
|
489
|
+
2 :island string 3 {"Torgersen"=>31, "Biscoe"=>112, "Dream"=>61}
|
490
|
+
3 :bill_length_mm double 90 [39.1, 39.5, 40.3, 39.3, 38.9, ... ]
|
491
|
+
... 5 more Vectors ...
|
492
|
+
```
|
493
|
+
|
494
|
+
- Notice: nil option
|
495
|
+
- `Arrow::Table#slice` uses `filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row.
|
496
|
+
|
497
|
+
```ruby
|
498
|
+
hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }
|
499
|
+
table = Arrow::Table.new(hash)
|
500
|
+
table.slice([true, false, nil])
|
501
|
+
# =>
|
502
|
+
#<Arrow::Table:0x7fdfe44b9e18 ptr=0x555e9fe744d0>
|
503
|
+
a b c
|
504
|
+
0 1 A 1.000000
|
505
|
+
1 (null) (null) (null)
|
506
|
+
```
|
507
|
+
|
508
|
+
- Whereas in RedAmber, `DataFrame#slice` with booleans containing nil is treated as false. This behavior comes from `Allow::FilterOptions.null_selection_behavior = :drop`. This is a default value for `Arrow::Table.filter` method.
|
509
|
+
|
510
|
+
```ruby
|
511
|
+
RedAmber::DataFrame.new(table).slice([true, false, nil]).table
|
512
|
+
# =>
|
513
|
+
#<Arrow::Table:0x7fdfe44981c8 ptr=0x555e9febc330>
|
514
|
+
a b c
|
515
|
+
0 1 A 1.000000
|
516
|
+
```
|
517
|
+
|
518
|
+
### `remove`
|
519
|
+
|
520
|
+
Slice and reject observations (rows) to create a remainer DataFrame.
|
521
|
+
|
522
|
+
![remove method image](doc/../image/dataframe/remove.png)
|
523
|
+
|
524
|
+
- Indices as arguments
|
525
|
+
|
526
|
+
`remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
|
527
|
+
|
528
|
+
```ruby
|
529
|
+
# returns 6th to 339th obs.
|
530
|
+
penguins.remove(0...5, -5..-1)
|
531
|
+
# =>
|
532
|
+
#<RedAmber::DataFrame : 334 x 8 Vectors, 0x000000000000f320>
|
533
|
+
Vectors : 5 numeric, 3 strings
|
534
|
+
# key type level data_preview
|
535
|
+
1 :species string 3 {"Adelie"=>147, "Chinstrap"=>68, "Gentoo"=>119}
|
536
|
+
2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>124}
|
537
|
+
3 :bill_length_mm double 162 [39.3, 38.9, 39.2, 34.1, 42.0, ... ]
|
538
|
+
... 5 more Vectors ...
|
539
|
+
```
|
540
|
+
|
541
|
+
- Booleans as an argument
|
542
|
+
|
543
|
+
`remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
|
544
|
+
|
545
|
+
```ruby
|
546
|
+
# remove all observation contains nil
|
547
|
+
removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
|
548
|
+
removed.tdr
|
549
|
+
# =>
|
550
|
+
RedAmber::DataFrame : 333 x 8 Vectors
|
551
|
+
Vectors : 5 numeric, 3 strings
|
552
|
+
# key type level data_preview
|
553
|
+
1 :species string 3 {"Adelie"=>146, "Chinstrap"=>68, "Gentoo"=>119}
|
554
|
+
2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>123}
|
555
|
+
3 :bill_length_mm double 163 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
|
556
|
+
4 :bill_depth_mm double 79 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
|
557
|
+
5 :flipper_length_mm uint8 54 [181, 186, 195, 193, 190, ... ]
|
558
|
+
6 :body_mass_g uint16 93 [3750, 3800, 3250, 3450, 3650, ... ]
|
559
|
+
7 :sex string 2 {"male"=>168, "female"=>165}
|
560
|
+
8 :year uint16 3 {2007=>103, 2008=>113, 2009=>117}
|
561
|
+
```
|
562
|
+
|
563
|
+
- Indices or booleans by a block
|
564
|
+
|
565
|
+
`remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
|
566
|
+
|
567
|
+
```ruby
|
568
|
+
penguins.remove do
|
569
|
+
vector = self[:bill_length_mm]
|
570
|
+
min = vector.mean - vector.std
|
571
|
+
max = vector.mean + vector.std
|
572
|
+
vector.to_a.map { |e| (min..max).include? e }
|
573
|
+
end
|
574
|
+
# =>
|
575
|
+
#<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000000f370>
|
576
|
+
Vectors : 5 numeric, 3 strings
|
577
|
+
# key type level data_preview
|
578
|
+
1 :species string 3 {"Adelie"=>70, "Chinstrap"=>35, "Gentoo"=>35}
|
579
|
+
2 :island string 3 {"Torgersen"=>21, "Biscoe"=>56, "Dream"=>63}
|
580
|
+
3 :bill_length_mm double 75 [nil, 36.7, 34.1, 37.8, 37.8, ... ], 2 nils
|
581
|
+
... 5 more Vectors ...
|
582
|
+
```
|
583
|
+
- Notice for nil
|
584
|
+
- When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
|
585
|
+
|
586
|
+
```ruby
|
587
|
+
df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])
|
588
|
+
booleans = df[:a] < 2
|
589
|
+
# =>
|
590
|
+
#<RedAmber::Vector(:boolean, size=3):0x000000000000f410>
|
591
|
+
[true, false, nil]
|
592
|
+
|
593
|
+
booleans_invert = booleans.to_a.map(&:!) # => [false, true, true]
|
594
|
+
df.slice(booleans) == df.remove(booleans_invert) # => true
|
595
|
+
```
|
596
|
+
- Whereas `Vector#invert` returns nil for elements nil. This will bring different result.
|
597
|
+
|
598
|
+
```ruby
|
599
|
+
booleans.invert
|
600
|
+
# =>
|
601
|
+
#<RedAmber::Vector(:boolean, size=3):0x000000000000f488>
|
602
|
+
[false, true, nil]
|
603
|
+
|
604
|
+
df.remove(booleans.invert)
|
605
|
+
#<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000000f474>
|
606
|
+
Vectors : 2 numeric, 1 string
|
607
|
+
# key type level data_preview
|
608
|
+
1 :a uint8 2 [1, nil], 1 nil
|
609
|
+
2 :b string 2 ["A", "C"]
|
610
|
+
3 :c double 2 [1.0, 3.0]
|
611
|
+
```
|
612
|
+
|
613
|
+
### `rename`
|
614
|
+
|
615
|
+
Rename keys (column names) to create a updated DataFrame.
|
616
|
+
|
617
|
+
![rename method image](doc/../image/dataframe/rename.png)
|
618
|
+
|
619
|
+
- Key pairs as arguments
|
620
|
+
|
621
|
+
`rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`.
|
622
|
+
|
623
|
+
```ruby
|
624
|
+
h = { 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] }
|
625
|
+
df = RedAmber::DataFrame.new(h)
|
626
|
+
df.rename(:age => :age_in_1993)
|
627
|
+
# =>
|
628
|
+
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
|
629
|
+
Vectors : 1 numeric, 1 string
|
630
|
+
# key type level data_preview
|
631
|
+
1 :name string 3 ["Yasuko", "Rui", "Hinata"]
|
632
|
+
2 :age_in_1993 uint8 3 [68, 49, 28]
|
633
|
+
```
|
634
|
+
|
635
|
+
- Key pairs by a block
|
636
|
+
|
637
|
+
`rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}`. Block is called in the context of self.
|
638
|
+
|
639
|
+
- Key type
|
640
|
+
|
641
|
+
Symbol key and String key are distinguished.
|
642
|
+
|
643
|
+
### `assign`
|
644
|
+
|
645
|
+
Assign new or updated variables (columns) and create a updated DataFrame.
|
646
|
+
|
647
|
+
- Variables with new keys will append new variables at bottom (right in the table).
|
648
|
+
- Variables with exisiting keys will update corresponding vectors.
|
649
|
+
|
650
|
+
![assign method image](doc/../image/dataframe/assign.png)
|
651
|
+
|
652
|
+
- Variables as arguments
|
653
|
+
|
654
|
+
`assign(key_pairs)` accepts pairs of key and values as arguments. key_pairs should be a Hash of `{key => array}` or `{key => Vector}`.
|
655
|
+
|
656
|
+
```ruby
|
657
|
+
df = RedAmber::DataFrame.new(
|
658
|
+
'name' => %w[Yasuko Rui Hinata],
|
659
|
+
'age' => [68, 49, 28])
|
660
|
+
# =>
|
661
|
+
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
|
662
|
+
Vectors : 1 numeric, 1 string
|
663
|
+
# key type level data_preview
|
664
|
+
1 :name string 3 ["Yasuko", "Rui", "Hinata"]
|
665
|
+
2 :age uint8 3 [68, 49, 28]
|
666
|
+
|
667
|
+
# update :age and add :brother
|
668
|
+
assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }
|
669
|
+
df.assign(assigner)
|
670
|
+
# =>
|
671
|
+
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000f960>
|
672
|
+
Vectors : 1 numeric, 2 strings
|
673
|
+
# key type level data_preview
|
674
|
+
1 :name string 3 ["Yasuko", "Rui", "Hinata"]
|
675
|
+
2 :age uint8 3 [97, 78, 57]
|
676
|
+
3 :brother string 3 ["Santa", nil, "Momotaro"], 1 nil
|
677
|
+
```
|
678
|
+
|
679
|
+
- Key pairs by a block
|
680
|
+
|
681
|
+
`assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array}` or `{key => Vector}`. Block is called in the context of self.
|
682
|
+
|
683
|
+
```ruby
|
684
|
+
df = RedAmber::DataFrame.new(
|
685
|
+
index: [0, 1, 2, 3, nil],
|
686
|
+
float: [0.0, 1.1, 2.2, Float::NAN, nil],
|
687
|
+
string: ['A', 'B', 'C', 'D', nil])
|
688
|
+
# =>
|
689
|
+
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f8c0>
|
690
|
+
Vectors : 2 numeric, 1 string
|
691
|
+
# key type level data_preview
|
692
|
+
1 :index uint8 5 [0, 1, 2, 3, nil], 1 nil
|
693
|
+
2 :float double 5 [0.0, 1.1, 2.2, NaN, nil], 1 NaN, 1 nil
|
694
|
+
3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
|
695
|
+
|
696
|
+
# update numeric variables
|
697
|
+
df.assign do
|
698
|
+
assigner = {}
|
699
|
+
vectors.each_with_index do |v, i|
|
700
|
+
assigner[keys[i]] = v * -1 if v.numeric?
|
701
|
+
end
|
702
|
+
assigner
|
703
|
+
end
|
704
|
+
# =>
|
705
|
+
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f924>
|
706
|
+
Vectors : 2 numeric, 1 string
|
707
|
+
# key type level data_preview
|
708
|
+
1 :index int8 5 [0, -1, -2, -3, nil], 1 nil
|
709
|
+
2 :float double 5 [-0.0, -1.1, -2.2, NaN, nil], 1 NaN, 1 nil
|
710
|
+
3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
|
711
|
+
|
712
|
+
# Or it ’s shorter like this:
|
713
|
+
df.assign do
|
714
|
+
variables.select.with_object({}) do |(key, vector), assigner|
|
715
|
+
assigner[key] = vector * -1 if vector.numeric?
|
716
|
+
end
|
717
|
+
end
|
718
|
+
# => same as above
|
719
|
+
```
|
720
|
+
|
721
|
+
- Key type
|
722
|
+
|
723
|
+
Symbol key and String key are considered as the same key.
|
724
|
+
|
725
|
+
## Updating
|
726
|
+
|
727
|
+
### `sort`
|
728
|
+
|
729
|
+
`sort` accepts parameters as sort_keys thanks to the amazing Red Arrow feature。
|
730
|
+
- :key, "key" or "+key" denotes ascending order
|
731
|
+
- "-key" denotes descending order
|
732
|
+
|
733
|
+
```ruby
|
734
|
+
df = RedAmber::DataFrame.new({
|
735
|
+
index: [1, 1, 0, nil, 0],
|
736
|
+
string: ['C', 'B', nil, 'A', 'B'],
|
737
|
+
bool: [nil, true, false, true, false],
|
738
|
+
})
|
739
|
+
df.sort(:index, '-bool').tdr(tally: 0)
|
740
|
+
# =>
|
741
|
+
RedAmber::DataFrame : 5 x 3 Vectors
|
742
|
+
Vectors : 1 numeric, 1 string, 1 boolean
|
743
|
+
# key type level data_preview
|
744
|
+
1 :index uint8 3 [0, 0, 1, 1, nil], 1 nil
|
745
|
+
2 :string string 4 [nil, "B", "B", "C", "A"], 1 nil
|
746
|
+
3 :bool boolean 3 [false, false, true, nil, true], 1 nil
|
747
|
+
```
|
748
|
+
|
749
|
+
- [ ] Clamp
|
750
|
+
|
751
|
+
- [ ] Clear data
|
752
|
+
|
753
|
+
## Treat na data
|
754
|
+
|
755
|
+
### `remove_nil`
|
756
|
+
|
757
|
+
Remove any observations containing nil.
|
758
|
+
|
759
|
+
## Grouping
|
760
|
+
|
761
|
+
### `group(aggregating_keys, function, target_keys)`
|
762
|
+
|
763
|
+
(This is a temporary API and may change in the future version.)
|
764
|
+
|
765
|
+
Create grouped dataframe by `aggregation_keys` and apply `function` to each group and returns in `target_keys`. Aggregated key name is `function(key)` style.
|
766
|
+
|
767
|
+
(The current implementation is not intuitive. Needs improvement.)
|
768
|
+
|
769
|
+
```ruby
|
770
|
+
ds = Datasets::Rdatasets.new('dplyr', 'starwars')
|
771
|
+
starwars = RedAmber::DataFrame.new(ds.to_table.to_h)
|
772
|
+
starwars.tdr(11)
|
773
|
+
# =>
|
774
|
+
RedAmber::DataFrame : 87 x 11 Vectors
|
775
|
+
Vectors : 3 numeric, 8 strings
|
776
|
+
# key type level data_preview
|
777
|
+
1 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
|
778
|
+
2 :height uint16 46 [172, 167, 96, 202, 150, ... ], 6 nils
|
779
|
+
3 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
|
780
|
+
4 :hair_color string 13 ["blond", nil, nil, "none", "brown", ... ], 5 nils
|
781
|
+
5 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", .. . ]
|
782
|
+
6 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ]
|
783
|
+
7 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
|
784
|
+
8 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, nil=>4}
|
785
|
+
9 :gender string 3 {"masculine"=>66, "feminine"=>17, nil=>4}
|
786
|
+
10 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ], 10 nils
|
787
|
+
11 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ], 4 nils
|
788
|
+
|
789
|
+
grouped = starwars.group(:species, :mean, [:mass, :height])
|
790
|
+
# =>
|
791
|
+
#<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000000fbf4>
|
792
|
+
Vectors : 2 numeric, 1 string
|
793
|
+
# key type level data_preview
|
794
|
+
1 :"mean(mass)" double 27 [82.78181818181818, 69.75, 124.0, 74.0, 1358.0, ... ], 6 nils
|
795
|
+
2 :"mean(height)" double 32 [176.6451612903226, 131.2, 231.0, 173.0, 175.0, ... ]
|
796
|
+
3 :species string 38 ["Human", "Droid", "Wookiee", "Rodian", "Hutt", ... ], 1 nil
|
797
|
+
|
798
|
+
count = starwars.group(:species, :count, :species)[:"count(species)"]
|
799
|
+
df = grouped.slice(count > 1)
|
800
|
+
# =>
|
801
|
+
#<RedAmber::DataFrame : 8 x 3 Vectors, 0x000000000000fc44>
|
802
|
+
Vectors : 2 numeric, 1 string
|
803
|
+
# key type level data_preview
|
804
|
+
1 :"mean(mass)" double 8 [82.78181818181818, 69.75, 124.0, 74.0, 80.0, ... ]
|
805
|
+
2 :"mean(height)" double 8 [176.6451612903226, 131.2, 231.0, 208.66666666666666, 173.0, ... ]
|
806
|
+
3 :species string 8 ["Human", "Droid", "Wookiee", "Gungan", "Zabrak", ... ]
|
807
|
+
|
808
|
+
df.table
|
809
|
+
# =>
|
810
|
+
#<Arrow::Table:0x1165593c8 ptr=0x7fb3db144c70>
|
811
|
+
mean(mass) mean(height) species
|
812
|
+
0 82.781818 176.645161 Human
|
813
|
+
1 69.750000 131.200000 Droid
|
814
|
+
2 124.000000 231.000000 Wookiee
|
815
|
+
3 74.000000 208.666667 Gungan
|
816
|
+
4 80.000000 173.000000 Zabrak
|
817
|
+
5 55.000000 179.000000 Twi'lek
|
818
|
+
6 53.100000 168.000000 Mirialan
|
819
|
+
7 88.000000 221.000000 Kaminoan
|
820
|
+
```
|
821
|
+
|
822
|
+
Available functions are:
|
823
|
+
|
824
|
+
- [ ] all
|
825
|
+
- [ ] any
|
826
|
+
- [ ] approximate_median
|
827
|
+
- ✓ count
|
828
|
+
- [ ] count_distinct
|
829
|
+
- [ ] distinct
|
830
|
+
- ✓ max
|
831
|
+
- ✓ mean
|
832
|
+
- ✓ min
|
833
|
+
- [ ] min_max
|
834
|
+
- ✓ product
|
835
|
+
- ✓ stddev
|
836
|
+
- ✓ sum
|
837
|
+
- [ ] tdigest
|
838
|
+
- ✓ variance
|
839
|
+
|
840
|
+
## Combining DataFrames
|
841
|
+
|
842
|
+
- [ ] obs
|
843
|
+
|
844
|
+
- [ ] Add vars
|
845
|
+
|
846
|
+
- [ ] Inner join
|
847
|
+
|
848
|
+
- [ ] Left join
|
849
|
+
|
850
|
+
## Encoding
|
851
|
+
|
852
|
+
- [ ] One-hot encoding
|
853
|
+
|
854
|
+
## Iteration (not impremented)
|