red_amber 0.1.3 → 0.1.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +31 -7
- data/CHANGELOG.md +214 -10
- data/Gemfile +4 -0
- data/README.md +117 -342
- data/benchmark/csv_load_penguins.yml +15 -0
- data/benchmark/drop_nil.yml +11 -0
- data/doc/DataFrame.md +854 -0
- data/doc/Vector.md +449 -0
- data/doc/image/arrow_table_new.png +0 -0
- data/doc/image/dataframe/assign.png +0 -0
- data/doc/image/dataframe/drop.png +0 -0
- data/doc/image/dataframe/pick.png +0 -0
- data/doc/image/dataframe/remove.png +0 -0
- data/doc/image/dataframe/rename.png +0 -0
- data/doc/image/dataframe/slice.png +0 -0
- data/doc/image/dataframe_model.png +0 -0
- data/doc/image/example_in_red_arrow.png +0 -0
- data/doc/image/tdr.png +0 -0
- data/doc/image/tdr_and_table.png +0 -0
- data/doc/image/tidy_data_in_TDR.png +0 -0
- data/doc/image/vector/binary_element_wise.png +0 -0
- data/doc/image/vector/unary_aggregation.png +0 -0
- data/doc/image/vector/unary_aggregation_w_option.png +0 -0
- data/doc/image/vector/unary_element_wise.png +0 -0
- data/doc/tdr.md +56 -0
- data/doc/tdr_ja.md +56 -0
- data/lib/red-amber.rb +27 -0
- data/lib/red_amber/data_frame.rb +91 -37
- data/lib/red_amber/{data_frame_output.rb → data_frame_displayable.rb} +49 -41
- data/lib/red_amber/data_frame_indexable.rb +38 -0
- data/lib/red_amber/data_frame_observation_operation.rb +11 -0
- data/lib/red_amber/data_frame_selectable.rb +155 -48
- data/lib/red_amber/data_frame_variable_operation.rb +137 -0
- data/lib/red_amber/helper.rb +61 -0
- data/lib/red_amber/vector.rb +69 -16
- data/lib/red_amber/vector_functions.rb +80 -45
- data/lib/red_amber/vector_selectable.rb +124 -0
- data/lib/red_amber/vector_updatable.rb +104 -0
- data/lib/red_amber/version.rb +1 -1
- data/lib/red_amber.rb +1 -16
- data/red_amber.gemspec +3 -6
- metadata +38 -9
data/doc/DataFrame.md
ADDED
@@ -0,0 +1,854 @@
|
|
1
|
+
# DataFrame
|
2
|
+
|
3
|
+
Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
4
|
+
- A collection of data which have same data type within. We call it `Vector`.
|
5
|
+
- A label is attached to `Vector`. We call it `key`.
|
6
|
+
- A `Vector` and associated `key` is grouped as a `variable`.
|
7
|
+
- `variable`s with same vector length are aligned and arranged to be a `DataFrame`.
|
8
|
+
- Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `observation`.
|
9
|
+
|
10
|
+

|
11
|
+
|
12
|
+
(No change in this model in v0.1.6 .)
|
13
|
+
|
14
|
+
## Constructors and saving
|
15
|
+
|
16
|
+
### `new` from a Hash
|
17
|
+
|
18
|
+
```ruby
|
19
|
+
RedAmber::DataFrame.new(x: [1, 2, 3])
|
20
|
+
```
|
21
|
+
|
22
|
+
### `new` from a schema (by Hash) and data (by Array)
|
23
|
+
|
24
|
+
```ruby
|
25
|
+
RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]])
|
26
|
+
```
|
27
|
+
|
28
|
+
### `new` from an Arrow::Table
|
29
|
+
|
30
|
+
|
31
|
+
```ruby
|
32
|
+
table = Arrow::Table.new(x: [1, 2, 3])
|
33
|
+
RedAmber::DataFrame.new(table)
|
34
|
+
```
|
35
|
+
|
36
|
+
### `new` from a Rover::DataFrame
|
37
|
+
|
38
|
+
|
39
|
+
```ruby
|
40
|
+
rover = Rover::DataFrame.new(x: [1, 2, 3])
|
41
|
+
RedAmber::DataFrame.new(rover)
|
42
|
+
```
|
43
|
+
|
44
|
+
### `load` (class method)
|
45
|
+
|
46
|
+
- from a `.arrow`, `.arrows`, `.csv`, `.csv.gz` or `.tsv` file
|
47
|
+
|
48
|
+
```ruby
|
49
|
+
RedAmber::DataFrame.load("test/entity/with_header.csv")
|
50
|
+
```
|
51
|
+
|
52
|
+
- from a string buffer
|
53
|
+
|
54
|
+
- from a URI
|
55
|
+
|
56
|
+
```ruby
|
57
|
+
uri = URI("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
|
58
|
+
RedAmber::DataFrame.load(uri)
|
59
|
+
```
|
60
|
+
|
61
|
+
- from a Parquet file
|
62
|
+
|
63
|
+
```ruby
|
64
|
+
dataframe = RedAmber::DataFrame.load("file.parquet")
|
65
|
+
```
|
66
|
+
|
67
|
+
### `save` (instance method)
|
68
|
+
|
69
|
+
- to a `.arrow`, `.arrows`, `.csv`, `.csv.gz` or `.tsv` file
|
70
|
+
|
71
|
+
- to a string buffer
|
72
|
+
|
73
|
+
- to a URI
|
74
|
+
|
75
|
+
- to a Parquet file
|
76
|
+
|
77
|
+
```ruby
|
78
|
+
dataframe.save("file.parquet")
|
79
|
+
```
|
80
|
+
|
81
|
+
## Properties
|
82
|
+
|
83
|
+
### `table`, `to_arrow`
|
84
|
+
|
85
|
+
- Reader of Arrow::Table object inside.
|
86
|
+
|
87
|
+
### `size`, `n_obs`, `n_rows`
|
88
|
+
|
89
|
+
- Returns size of Vector (num of observations).
|
90
|
+
|
91
|
+
### `n_keys`, `n_vars`, `n_cols`,
|
92
|
+
|
93
|
+
- Returns num of keys (num of variables).
|
94
|
+
|
95
|
+
### `shape`
|
96
|
+
|
97
|
+
- Returns shape in an Array[n_rows, n_cols].
|
98
|
+
|
99
|
+
### `variables`
|
100
|
+
|
101
|
+
- Returns key names and Vectors pair in a Hash.
|
102
|
+
|
103
|
+
It is convenient to use in a block when both key and vector required. We will write:
|
104
|
+
|
105
|
+
```ruby
|
106
|
+
# update numeric variables
|
107
|
+
df.assign do
|
108
|
+
variables.select.with_object({}) do |(key, vector), assigner|
|
109
|
+
assigner[key] = vector * -1 if vector.numeric?
|
110
|
+
end
|
111
|
+
end
|
112
|
+
```
|
113
|
+
|
114
|
+
Instead of:
|
115
|
+
```ruby
|
116
|
+
df.assign do
|
117
|
+
assigner = {}
|
118
|
+
vectors.each_with_index do |vector, i|
|
119
|
+
assigner[keys[i]] = vector * -1 if vector.numeric?
|
120
|
+
end
|
121
|
+
assigner
|
122
|
+
end
|
123
|
+
```
|
124
|
+
|
125
|
+
### `keys`, `var_names`, `column_names`
|
126
|
+
|
127
|
+
- Returns key names in an Array.
|
128
|
+
|
129
|
+
When we use it with vectors, Vector#key is useful to get the key inside of DataFrame.
|
130
|
+
|
131
|
+
```ruby
|
132
|
+
# update numeric variables, another solution
|
133
|
+
df.assign do
|
134
|
+
vectors.each_with_object({}) do |vector, assigner|
|
135
|
+
assigner[vector.key] = vector * -1 if vector.numeric?
|
136
|
+
end
|
137
|
+
end
|
138
|
+
```
|
139
|
+
|
140
|
+
### `types`
|
141
|
+
|
142
|
+
- Returns types of vectors in an Array of Symbols.
|
143
|
+
|
144
|
+
### `type_classes`
|
145
|
+
|
146
|
+
- Returns types of vector in an Array of `Arrow::DataType`.
|
147
|
+
|
148
|
+
### `vectors`
|
149
|
+
|
150
|
+
- Returns an Array of Vectors.
|
151
|
+
|
152
|
+
### `indices`, `indexes`
|
153
|
+
|
154
|
+
- Returns all indexes in an Array.
|
155
|
+
|
156
|
+
### `to_h`
|
157
|
+
|
158
|
+
- Returns column-oriented data in a Hash.
|
159
|
+
|
160
|
+
### `to_a`, `raw_records`
|
161
|
+
|
162
|
+
- Returns an array of row-oriented data without header.
|
163
|
+
|
164
|
+
If you need a column-oriented full array, use `.to_h.to_a`
|
165
|
+
|
166
|
+
### `schema`
|
167
|
+
|
168
|
+
- Returns column name and data type in a Hash.
|
169
|
+
|
170
|
+
### `==`
|
171
|
+
|
172
|
+
### `empty?`
|
173
|
+
|
174
|
+
## Output
|
175
|
+
|
176
|
+
### `to_s`
|
177
|
+
|
178
|
+
### `summary`, `describe` (not implemented)
|
179
|
+
|
180
|
+
### `to_rover`
|
181
|
+
|
182
|
+
- Returns a `Rover::DataFrame`.
|
183
|
+
|
184
|
+
### `to_iruby`
|
185
|
+
|
186
|
+
- Show the DataFrame as a Table in Jupyter Notebook or Jupyter Lab with IRuby.
|
187
|
+
|
188
|
+
### `tdr(limit = 10, tally: 5, elements: 5)`
|
189
|
+
|
190
|
+
- Shows some information about self in a transposed style.
|
191
|
+
- `tdr_str` returns same info as a String.
|
192
|
+
|
193
|
+
```ruby
|
194
|
+
require 'red_amber'
|
195
|
+
require 'datasets-arrow'
|
196
|
+
|
197
|
+
penguins = Datasets::Penguins.new.to_arrow
|
198
|
+
RedAmber::DataFrame.new(penguins).tdr
|
199
|
+
# =>
|
200
|
+
RedAmber::DataFrame : 344 x 8 Vectors
|
201
|
+
Vectors : 5 numeric, 3 strings
|
202
|
+
# key type level data_preview
|
203
|
+
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
|
204
|
+
2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
|
205
|
+
3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
206
|
+
4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
|
207
|
+
5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils
|
208
|
+
6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
|
209
|
+
7 :sex string 3 {"male"=>168, "female"=>165, nil=>11}
|
210
|
+
8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}
|
211
|
+
```
|
212
|
+
|
213
|
+
- limit: limit of variables to show. Default value is 10.
|
214
|
+
- tally: max level to use tally mode.
|
215
|
+
- elements: max num of element to show values in each observations.
|
216
|
+
|
217
|
+
### `inspect`
|
218
|
+
|
219
|
+
- Returns the information of self as `tdr(3)`, and also shows object id.
|
220
|
+
|
221
|
+
```ruby
|
222
|
+
puts penguins.inspect
|
223
|
+
# =>
|
224
|
+
#<RedAmber::DataFrame : 344 x 8 Vectors, 0x000000000000f0b4>
|
225
|
+
Vectors : 5 numeric, 3 strings
|
226
|
+
# key type level data_preview
|
227
|
+
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
|
228
|
+
2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
|
229
|
+
3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
230
|
+
... 5 more Vectors ...
|
231
|
+
```
|
232
|
+
|
233
|
+
## Selecting
|
234
|
+
|
235
|
+
### Select variables (columns in a table) by `[]` as `[key]`, `[keys]`, `[keys[index]]`
|
236
|
+
- Key in a Symbol: `df[:symbol]`
|
237
|
+
- Key in a String: `df["string"]`
|
238
|
+
- Keys in an Array: `df[:symbol1, "string", :symbol2]`
|
239
|
+
- Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`
|
240
|
+
|
241
|
+
Key indeces can be used via `keys[i]` because numbers are used to select observations (rows).
|
242
|
+
|
243
|
+
- Keys by a Range:
|
244
|
+
|
245
|
+
If keys are able to represent by Range, it can be included in the arguments. See a example below.
|
246
|
+
|
247
|
+
- You can exchange the order of variables (columns).
|
248
|
+
|
249
|
+
```ruby
|
250
|
+
hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
|
251
|
+
df = RedAmber::DataFrame.new(hash)
|
252
|
+
df[:b..:c, "a"]
|
253
|
+
# =>
|
254
|
+
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000b02c>
|
255
|
+
Vectors : 2 numeric, 1 string
|
256
|
+
# key type level data_preview
|
257
|
+
1 :b string 3 ["A", "B", "C"]
|
258
|
+
2 :c double 3 [1.0, 2.0, 3.0]
|
259
|
+
3 :a uint8 3 [1, 2, 3]
|
260
|
+
```
|
261
|
+
|
262
|
+
If `#[]` represents single variable (column), it returns a Vector object.
|
263
|
+
|
264
|
+
```ruby
|
265
|
+
df[:a]
|
266
|
+
# =>
|
267
|
+
#<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
|
268
|
+
[1, 2, 3]
|
269
|
+
```
|
270
|
+
Or `#v` method also returns a Vector for a key.
|
271
|
+
|
272
|
+
```ruby
|
273
|
+
df.v(:a)
|
274
|
+
# =>
|
275
|
+
#<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
|
276
|
+
[1, 2, 3]
|
277
|
+
```
|
278
|
+
|
279
|
+
This may be useful to use in a block of DataFrame manipulation verbs. We can write `v(:a)` rather than `self[:a]` or `df[:a]`
|
280
|
+
|
281
|
+
### Select observations (rows in a table) by `[]` as `[index]`, `[range]`, `[array]`
|
282
|
+
|
283
|
+
- Select a obs. by index: `df[0]`
|
284
|
+
- Select obs. by indeces in a Range: `df[1..2]`
|
285
|
+
|
286
|
+
An end-less or a begin-less Range can be used to represent indeces.
|
287
|
+
|
288
|
+
- Select obs. by indeces in an Array: `df[1, 2]`
|
289
|
+
|
290
|
+
- You can use float indices.
|
291
|
+
|
292
|
+
- Mixed case: `df[2, 0..]`
|
293
|
+
|
294
|
+
```ruby
|
295
|
+
hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
|
296
|
+
df = RedAmber::DataFrame.new(hash)
|
297
|
+
df[:b..:c, "a"].tdr(tally_level: 0)
|
298
|
+
# =>
|
299
|
+
RedAmber::DataFrame : 4 x 3 Vectors
|
300
|
+
Vectors : 2 numeric, 1 string
|
301
|
+
# key type level data_preview
|
302
|
+
1 :a uint8 3 [3, 1, 2, 3]
|
303
|
+
2 :b string 3 ["C", "A", "B", "C"]
|
304
|
+
3 :c double 3 [3.0, 1.0, 2.0, 3.0]
|
305
|
+
```
|
306
|
+
|
307
|
+
- Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self.
|
308
|
+
|
309
|
+
It returns a sub dataframe with observations at boolean is true.
|
310
|
+
|
311
|
+
```ruby
|
312
|
+
# with the same dataframe `df` above
|
313
|
+
df[true, false, nil] # or
|
314
|
+
df[[true, false, nil]] # or
|
315
|
+
df[RedAmber::Vector.new([true, false, nil])]
|
316
|
+
# =>
|
317
|
+
#<RedAmber::DataFrame : 1 x 3 Vectors, 0x000000000000f1a4>
|
318
|
+
Vectors : 2 numeric, 1 string
|
319
|
+
# key type level data_preview
|
320
|
+
1 :a uint8 1 [1]
|
321
|
+
2 :b string 1 ["A"]
|
322
|
+
3 :c double 1 [1.0]
|
323
|
+
```
|
324
|
+
|
325
|
+
### Select rows from top or from bottom
|
326
|
+
|
327
|
+
`head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
|
328
|
+
|
329
|
+
## Sub DataFrame manipulations
|
330
|
+
|
331
|
+
### `pick ` - pick up variables by key label -
|
332
|
+
|
333
|
+
Pick up some variables (columns) to create a sub DataFrame.
|
334
|
+
|
335
|
+

|
336
|
+
|
337
|
+
- Keys as arguments
|
338
|
+
|
339
|
+
`pick(keys)` accepts keys as arguments in an Array.
|
340
|
+
|
341
|
+
```ruby
|
342
|
+
penguins.pick(:species, :bill_length_mm)
|
343
|
+
# =>
|
344
|
+
#<RedAmber::DataFrame : 344 x 2 Vectors, 0x000000000000f924>
|
345
|
+
Vectors : 1 numeric, 1 string
|
346
|
+
# key type level data_preview
|
347
|
+
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
|
348
|
+
2 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
349
|
+
```
|
350
|
+
|
351
|
+
- Booleans as a argument
|
352
|
+
|
353
|
+
`pick(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
|
354
|
+
|
355
|
+
```ruby
|
356
|
+
penguins.pick(penguins.types.map { |type| type == :string })
|
357
|
+
# =>
|
358
|
+
#<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f938>
|
359
|
+
Vectors : 3 strings
|
360
|
+
# key type level data_preview
|
361
|
+
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
|
362
|
+
2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
|
363
|
+
3 :sex string 3 {"male"=>168, "female"=>165, ""=>11}
|
364
|
+
```
|
365
|
+
|
366
|
+
- Keys or booleans by a block
|
367
|
+
|
368
|
+
`pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
|
369
|
+
|
370
|
+
```ruby
|
371
|
+
# It is ok to write `keys ...` in the block, not `penguins.keys ...`
|
372
|
+
penguins.pick { keys.map { |key| key.end_with?('mm') } }
|
373
|
+
# =>
|
374
|
+
#<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f1cc>
|
375
|
+
Vectors : 3 numeric
|
376
|
+
# key type level data_preview
|
377
|
+
1 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
378
|
+
2 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
|
379
|
+
3 :flipper_length_mm int64 56 [181, 186, 195, nil, 193, ... ], 2 nils
|
380
|
+
```
|
381
|
+
|
382
|
+
### `drop ` - pick and drop -
|
383
|
+
|
384
|
+
Drop some variables (columns) to create a remainer DataFrame.
|
385
|
+
|
386
|
+

|
387
|
+
|
388
|
+
- Keys as arguments
|
389
|
+
|
390
|
+
`drop(keys)` accepts keys as arguments in an Array.
|
391
|
+
|
392
|
+
- Booleans as a argument
|
393
|
+
|
394
|
+
`drop(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
|
395
|
+
|
396
|
+
- Keys or booleans by a block
|
397
|
+
|
398
|
+
`drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
|
399
|
+
|
400
|
+
- Notice for nil
|
401
|
+
|
402
|
+
When used with booleans, nil in booleans is treated as a false. This behavior is aligned with Ruby's `nil#!`.
|
403
|
+
|
404
|
+
```ruby
|
405
|
+
booleans = [true, false, nil]
|
406
|
+
booleans_invert = booleans.map(&:!) # => [false, true, true]
|
407
|
+
df.pick(booleans) == df.drop(booleans_invert) # => true
|
408
|
+
```
|
409
|
+
- Difference between `pick`/`drop` and `[]`
|
410
|
+
|
411
|
+
If `pick` or `drop` will select a single variable (column), it returns a `DataFrame` with one variable. In contrast, `[]` returns a `Vector`. This behavior may be useful to use in a block of DataFrame manipulations.
|
412
|
+
|
413
|
+
```ruby
|
414
|
+
df = RedAmber::DataFrame.new(a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3])
|
415
|
+
df.pick(:a) # or
|
416
|
+
df.drop(:b, :c)
|
417
|
+
# =>
|
418
|
+
#<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000000f280>
|
419
|
+
Vector : 1 numeric
|
420
|
+
# key type level data_preview
|
421
|
+
1 :a uint8 3 [1, 2, 3]
|
422
|
+
|
423
|
+
df[:a]
|
424
|
+
# =>
|
425
|
+
#<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
|
426
|
+
[1, 2, 3]
|
427
|
+
```
|
428
|
+
|
429
|
+
### `slice ` - to cut vertically is slice -
|
430
|
+
|
431
|
+
Slice and select observations (rows) to create a sub DataFrame.
|
432
|
+
|
433
|
+

|
434
|
+
|
435
|
+
- Indices as arguments
|
436
|
+
|
437
|
+
`slice(indeces)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers.
|
438
|
+
|
439
|
+
Negative index from the tail like Ruby's Array is also acceptable.
|
440
|
+
|
441
|
+
```ruby
|
442
|
+
# returns 5 obs. at start and 5 obs. from end
|
443
|
+
penguins.slice(0...5, -5..-1)
|
444
|
+
# =>
|
445
|
+
#<RedAmber::DataFrame : 10 x 8 Vectors, 0x000000000000f230>
|
446
|
+
Vectors : 5 numeric, 3 strings
|
447
|
+
# key type level data_preview
|
448
|
+
1 :species string 2 {"Adelie"=>5, "Gentoo"=>5}
|
449
|
+
2 :island string 2 {"Torgersen"=>5, "Biscoe"=>5}
|
450
|
+
3 :bill_length_mm double 9 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
|
451
|
+
... 5 more Vectors ...
|
452
|
+
```
|
453
|
+
|
454
|
+
- Booleans as an argument
|
455
|
+
|
456
|
+
`slice(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
|
457
|
+
|
458
|
+
```ruby
|
459
|
+
vector = penguins[:bill_length_mm]
|
460
|
+
penguins.slice(vector >= 40)
|
461
|
+
# =>
|
462
|
+
#<RedAmber::DataFrame : 242 x 8 Vectors, 0x000000000000f2bc>
|
463
|
+
Vectors : 5 numeric, 3 strings
|
464
|
+
# key type level data_preview
|
465
|
+
1 :species string 3 {"Adelie"=>51, "Chinstrap"=>68, "Gentoo"=>123}
|
466
|
+
2 :island string 3 {"Torgersen"=>18, "Biscoe"=>139, "Dream"=>85}
|
467
|
+
3 :bill_length_mm double 115 [40.3, 42.0, 41.1, 42.5, 46.0, ... ]
|
468
|
+
... 5 more Vectors ...
|
469
|
+
```
|
470
|
+
|
471
|
+
- Indices or booleans by a block
|
472
|
+
|
473
|
+
`slice {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
|
474
|
+
|
475
|
+
```ruby
|
476
|
+
# return a DataFrame with bill_length_mm is in 2*std range around mean
|
477
|
+
penguins.slice do
|
478
|
+
vector = self[:bill_length_mm]
|
479
|
+
min = vector.mean - vector.std
|
480
|
+
max = vector.mean + vector.std
|
481
|
+
vector.to_a.map { |e| (min..max).include? e }
|
482
|
+
end
|
483
|
+
|
484
|
+
# =>
|
485
|
+
#<RedAmber::DataFrame : 204 x 8 Vectors, 0x000000000000f30c>
|
486
|
+
Vectors : 5 numeric, 3 strings
|
487
|
+
# key type level data_preview
|
488
|
+
1 :species string 3 {"Adelie"=>82, "Chinstrap"=>33, "Gentoo"=>89}
|
489
|
+
2 :island string 3 {"Torgersen"=>31, "Biscoe"=>112, "Dream"=>61}
|
490
|
+
3 :bill_length_mm double 90 [39.1, 39.5, 40.3, 39.3, 38.9, ... ]
|
491
|
+
... 5 more Vectors ...
|
492
|
+
```
|
493
|
+
|
494
|
+
- Notice: nil option
|
495
|
+
- `Arrow::Table#slice` uses `filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row.
|
496
|
+
|
497
|
+
```ruby
|
498
|
+
hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }
|
499
|
+
table = Arrow::Table.new(hash)
|
500
|
+
table.slice([true, false, nil])
|
501
|
+
# =>
|
502
|
+
#<Arrow::Table:0x7fdfe44b9e18 ptr=0x555e9fe744d0>
|
503
|
+
a b c
|
504
|
+
0 1 A 1.000000
|
505
|
+
1 (null) (null) (null)
|
506
|
+
```
|
507
|
+
|
508
|
+
- Whereas in RedAmber, `DataFrame#slice` with booleans containing nil is treated as false. This behavior comes from `Allow::FilterOptions.null_selection_behavior = :drop`. This is a default value for `Arrow::Table.filter` method.
|
509
|
+
|
510
|
+
```ruby
|
511
|
+
RedAmber::DataFrame.new(table).slice([true, false, nil]).table
|
512
|
+
# =>
|
513
|
+
#<Arrow::Table:0x7fdfe44981c8 ptr=0x555e9febc330>
|
514
|
+
a b c
|
515
|
+
0 1 A 1.000000
|
516
|
+
```
|
517
|
+
|
518
|
+
### `remove`
|
519
|
+
|
520
|
+
Slice and reject observations (rows) to create a remainer DataFrame.
|
521
|
+
|
522
|
+

|
523
|
+
|
524
|
+
- Indices as arguments
|
525
|
+
|
526
|
+
`remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
|
527
|
+
|
528
|
+
```ruby
|
529
|
+
# returns 6th to 339th obs.
|
530
|
+
penguins.remove(0...5, -5..-1)
|
531
|
+
# =>
|
532
|
+
#<RedAmber::DataFrame : 334 x 8 Vectors, 0x000000000000f320>
|
533
|
+
Vectors : 5 numeric, 3 strings
|
534
|
+
# key type level data_preview
|
535
|
+
1 :species string 3 {"Adelie"=>147, "Chinstrap"=>68, "Gentoo"=>119}
|
536
|
+
2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>124}
|
537
|
+
3 :bill_length_mm double 162 [39.3, 38.9, 39.2, 34.1, 42.0, ... ]
|
538
|
+
... 5 more Vectors ...
|
539
|
+
```
|
540
|
+
|
541
|
+
- Booleans as an argument
|
542
|
+
|
543
|
+
`remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
|
544
|
+
|
545
|
+
```ruby
|
546
|
+
# remove all observation contains nil
|
547
|
+
removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
|
548
|
+
removed.tdr
|
549
|
+
# =>
|
550
|
+
RedAmber::DataFrame : 333 x 8 Vectors
|
551
|
+
Vectors : 5 numeric, 3 strings
|
552
|
+
# key type level data_preview
|
553
|
+
1 :species string 3 {"Adelie"=>146, "Chinstrap"=>68, "Gentoo"=>119}
|
554
|
+
2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>123}
|
555
|
+
3 :bill_length_mm double 163 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
|
556
|
+
4 :bill_depth_mm double 79 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
|
557
|
+
5 :flipper_length_mm uint8 54 [181, 186, 195, 193, 190, ... ]
|
558
|
+
6 :body_mass_g uint16 93 [3750, 3800, 3250, 3450, 3650, ... ]
|
559
|
+
7 :sex string 2 {"male"=>168, "female"=>165}
|
560
|
+
8 :year uint16 3 {2007=>103, 2008=>113, 2009=>117}
|
561
|
+
```
|
562
|
+
|
563
|
+
- Indices or booleans by a block
|
564
|
+
|
565
|
+
`remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
|
566
|
+
|
567
|
+
```ruby
|
568
|
+
penguins.remove do
|
569
|
+
vector = self[:bill_length_mm]
|
570
|
+
min = vector.mean - vector.std
|
571
|
+
max = vector.mean + vector.std
|
572
|
+
vector.to_a.map { |e| (min..max).include? e }
|
573
|
+
end
|
574
|
+
# =>
|
575
|
+
#<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000000f370>
|
576
|
+
Vectors : 5 numeric, 3 strings
|
577
|
+
# key type level data_preview
|
578
|
+
1 :species string 3 {"Adelie"=>70, "Chinstrap"=>35, "Gentoo"=>35}
|
579
|
+
2 :island string 3 {"Torgersen"=>21, "Biscoe"=>56, "Dream"=>63}
|
580
|
+
3 :bill_length_mm double 75 [nil, 36.7, 34.1, 37.8, 37.8, ... ], 2 nils
|
581
|
+
... 5 more Vectors ...
|
582
|
+
```
|
583
|
+
- Notice for nil
|
584
|
+
- When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
|
585
|
+
|
586
|
+
```ruby
|
587
|
+
df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])
|
588
|
+
booleans = df[:a] < 2
|
589
|
+
# =>
|
590
|
+
#<RedAmber::Vector(:boolean, size=3):0x000000000000f410>
|
591
|
+
[true, false, nil]
|
592
|
+
|
593
|
+
booleans_invert = booleans.to_a.map(&:!) # => [false, true, true]
|
594
|
+
df.slice(booleans) == df.remove(booleans_invert) # => true
|
595
|
+
```
|
596
|
+
- Whereas `Vector#invert` returns nil for elements nil. This will bring different result.
|
597
|
+
|
598
|
+
```ruby
|
599
|
+
booleans.invert
|
600
|
+
# =>
|
601
|
+
#<RedAmber::Vector(:boolean, size=3):0x000000000000f488>
|
602
|
+
[false, true, nil]
|
603
|
+
|
604
|
+
df.remove(booleans.invert)
|
605
|
+
#<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000000f474>
|
606
|
+
Vectors : 2 numeric, 1 string
|
607
|
+
# key type level data_preview
|
608
|
+
1 :a uint8 2 [1, nil], 1 nil
|
609
|
+
2 :b string 2 ["A", "C"]
|
610
|
+
3 :c double 2 [1.0, 3.0]
|
611
|
+
```
|
612
|
+
|
613
|
+
### `rename`
|
614
|
+
|
615
|
+
Rename keys (column names) to create a updated DataFrame.
|
616
|
+
|
617
|
+

|
618
|
+
|
619
|
+
- Key pairs as arguments
|
620
|
+
|
621
|
+
`rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`.
|
622
|
+
|
623
|
+
```ruby
|
624
|
+
h = { 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] }
|
625
|
+
df = RedAmber::DataFrame.new(h)
|
626
|
+
df.rename(:age => :age_in_1993)
|
627
|
+
# =>
|
628
|
+
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
|
629
|
+
Vectors : 1 numeric, 1 string
|
630
|
+
# key type level data_preview
|
631
|
+
1 :name string 3 ["Yasuko", "Rui", "Hinata"]
|
632
|
+
2 :age_in_1993 uint8 3 [68, 49, 28]
|
633
|
+
```
|
634
|
+
|
635
|
+
- Key pairs by a block
|
636
|
+
|
637
|
+
`rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}`. Block is called in the context of self.
|
638
|
+
|
639
|
+
- Key type
|
640
|
+
|
641
|
+
Symbol key and String key are distinguished.
|
642
|
+
|
643
|
+
### `assign`
|
644
|
+
|
645
|
+
Assign new or updated variables (columns) and create a updated DataFrame.
|
646
|
+
|
647
|
+
- Variables with new keys will append new variables at bottom (right in the table).
|
648
|
+
- Variables with exisiting keys will update corresponding vectors.
|
649
|
+
|
650
|
+

|
651
|
+
|
652
|
+
- Variables as arguments
|
653
|
+
|
654
|
+
`assign(key_pairs)` accepts pairs of key and values as arguments. key_pairs should be a Hash of `{key => array}` or `{key => Vector}`.
|
655
|
+
|
656
|
+
```ruby
|
657
|
+
df = RedAmber::DataFrame.new(
|
658
|
+
'name' => %w[Yasuko Rui Hinata],
|
659
|
+
'age' => [68, 49, 28])
|
660
|
+
# =>
|
661
|
+
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
|
662
|
+
Vectors : 1 numeric, 1 string
|
663
|
+
# key type level data_preview
|
664
|
+
1 :name string 3 ["Yasuko", "Rui", "Hinata"]
|
665
|
+
2 :age uint8 3 [68, 49, 28]
|
666
|
+
|
667
|
+
# update :age and add :brother
|
668
|
+
assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }
|
669
|
+
df.assign(assigner)
|
670
|
+
# =>
|
671
|
+
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000f960>
|
672
|
+
Vectors : 1 numeric, 2 strings
|
673
|
+
# key type level data_preview
|
674
|
+
1 :name string 3 ["Yasuko", "Rui", "Hinata"]
|
675
|
+
2 :age uint8 3 [97, 78, 57]
|
676
|
+
3 :brother string 3 ["Santa", nil, "Momotaro"], 1 nil
|
677
|
+
```
|
678
|
+
|
679
|
+
- Key pairs by a block
|
680
|
+
|
681
|
+
`assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array}` or `{key => Vector}`. Block is called in the context of self.
|
682
|
+
|
683
|
+
```ruby
|
684
|
+
df = RedAmber::DataFrame.new(
|
685
|
+
index: [0, 1, 2, 3, nil],
|
686
|
+
float: [0.0, 1.1, 2.2, Float::NAN, nil],
|
687
|
+
string: ['A', 'B', 'C', 'D', nil])
|
688
|
+
# =>
|
689
|
+
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f8c0>
|
690
|
+
Vectors : 2 numeric, 1 string
|
691
|
+
# key type level data_preview
|
692
|
+
1 :index uint8 5 [0, 1, 2, 3, nil], 1 nil
|
693
|
+
2 :float double 5 [0.0, 1.1, 2.2, NaN, nil], 1 NaN, 1 nil
|
694
|
+
3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
|
695
|
+
|
696
|
+
# update numeric variables
|
697
|
+
df.assign do
|
698
|
+
assigner = {}
|
699
|
+
vectors.each_with_index do |v, i|
|
700
|
+
assigner[keys[i]] = v * -1 if v.numeric?
|
701
|
+
end
|
702
|
+
assigner
|
703
|
+
end
|
704
|
+
# =>
|
705
|
+
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f924>
|
706
|
+
Vectors : 2 numeric, 1 string
|
707
|
+
# key type level data_preview
|
708
|
+
1 :index int8 5 [0, -1, -2, -3, nil], 1 nil
|
709
|
+
2 :float double 5 [-0.0, -1.1, -2.2, NaN, nil], 1 NaN, 1 nil
|
710
|
+
3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
|
711
|
+
|
712
|
+
# Or it ’s shorter like this:
|
713
|
+
df.assign do
|
714
|
+
variables.select.with_object({}) do |(key, vector), assigner|
|
715
|
+
assigner[key] = vector * -1 if vector.numeric?
|
716
|
+
end
|
717
|
+
end
|
718
|
+
# => same as above
|
719
|
+
```
|
720
|
+
|
721
|
+
- Key type
|
722
|
+
|
723
|
+
Symbol key and String key are considered as the same key.
|
724
|
+
|
725
|
+
## Updating
|
726
|
+
|
727
|
+
### `sort`
|
728
|
+
|
729
|
+
`sort` accepts parameters as sort_keys thanks to the amazing Red Arrow feature。
|
730
|
+
- :key, "key" or "+key" denotes ascending order
|
731
|
+
- "-key" denotes descending order
|
732
|
+
|
733
|
+
```ruby
|
734
|
+
df = RedAmber::DataFrame.new({
|
735
|
+
index: [1, 1, 0, nil, 0],
|
736
|
+
string: ['C', 'B', nil, 'A', 'B'],
|
737
|
+
bool: [nil, true, false, true, false],
|
738
|
+
})
|
739
|
+
df.sort(:index, '-bool').tdr(tally: 0)
|
740
|
+
# =>
|
741
|
+
RedAmber::DataFrame : 5 x 3 Vectors
|
742
|
+
Vectors : 1 numeric, 1 string, 1 boolean
|
743
|
+
# key type level data_preview
|
744
|
+
1 :index uint8 3 [0, 0, 1, 1, nil], 1 nil
|
745
|
+
2 :string string 4 [nil, "B", "B", "C", "A"], 1 nil
|
746
|
+
3 :bool boolean 3 [false, false, true, nil, true], 1 nil
|
747
|
+
```
|
748
|
+
|
749
|
+
- [ ] Clamp
|
750
|
+
|
751
|
+
- [ ] Clear data
|
752
|
+
|
753
|
+
## Treat na data
|
754
|
+
|
755
|
+
### `remove_nil`
|
756
|
+
|
757
|
+
Remove any observations containing nil.
|
758
|
+
|
759
|
+
## Grouping
|
760
|
+
|
761
|
+
### `group(aggregating_keys, function, target_keys)`
|
762
|
+
|
763
|
+
(This is a temporary API and may change in the future version.)
|
764
|
+
|
765
|
+
Create grouped dataframe by `aggregation_keys` and apply `function` to each group and returns in `target_keys`. Aggregated key name is `function(key)` style.
|
766
|
+
|
767
|
+
(The current implementation is not intuitive. Needs improvement.)
|
768
|
+
|
769
|
+
```ruby
|
770
|
+
ds = Datasets::Rdatasets.new('dplyr', 'starwars')
|
771
|
+
starwars = RedAmber::DataFrame.new(ds.to_table.to_h)
|
772
|
+
starwars.tdr(11)
|
773
|
+
# =>
|
774
|
+
RedAmber::DataFrame : 87 x 11 Vectors
|
775
|
+
Vectors : 3 numeric, 8 strings
|
776
|
+
# key type level data_preview
|
777
|
+
1 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
|
778
|
+
2 :height uint16 46 [172, 167, 96, 202, 150, ... ], 6 nils
|
779
|
+
3 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
|
780
|
+
4 :hair_color string 13 ["blond", nil, nil, "none", "brown", ... ], 5 nils
|
781
|
+
5 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", .. . ]
|
782
|
+
6 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ]
|
783
|
+
7 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
|
784
|
+
8 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, nil=>4}
|
785
|
+
9 :gender string 3 {"masculine"=>66, "feminine"=>17, nil=>4}
|
786
|
+
10 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ], 10 nils
|
787
|
+
11 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ], 4 nils
|
788
|
+
|
789
|
+
grouped = starwars.group(:species, :mean, [:mass, :height])
|
790
|
+
# =>
|
791
|
+
#<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000000fbf4>
|
792
|
+
Vectors : 2 numeric, 1 string
|
793
|
+
# key type level data_preview
|
794
|
+
1 :"mean(mass)" double 27 [82.78181818181818, 69.75, 124.0, 74.0, 1358.0, ... ], 6 nils
|
795
|
+
2 :"mean(height)" double 32 [176.6451612903226, 131.2, 231.0, 173.0, 175.0, ... ]
|
796
|
+
3 :species string 38 ["Human", "Droid", "Wookiee", "Rodian", "Hutt", ... ], 1 nil
|
797
|
+
|
798
|
+
count = starwars.group(:species, :count, :species)[:"count(species)"]
|
799
|
+
df = grouped.slice(count > 1)
|
800
|
+
# =>
|
801
|
+
#<RedAmber::DataFrame : 8 x 3 Vectors, 0x000000000000fc44>
|
802
|
+
Vectors : 2 numeric, 1 string
|
803
|
+
# key type level data_preview
|
804
|
+
1 :"mean(mass)" double 8 [82.78181818181818, 69.75, 124.0, 74.0, 80.0, ... ]
|
805
|
+
2 :"mean(height)" double 8 [176.6451612903226, 131.2, 231.0, 208.66666666666666, 173.0, ... ]
|
806
|
+
3 :species string 8 ["Human", "Droid", "Wookiee", "Gungan", "Zabrak", ... ]
|
807
|
+
|
808
|
+
df.table
|
809
|
+
# =>
|
810
|
+
#<Arrow::Table:0x1165593c8 ptr=0x7fb3db144c70>
|
811
|
+
mean(mass) mean(height) species
|
812
|
+
0 82.781818 176.645161 Human
|
813
|
+
1 69.750000 131.200000 Droid
|
814
|
+
2 124.000000 231.000000 Wookiee
|
815
|
+
3 74.000000 208.666667 Gungan
|
816
|
+
4 80.000000 173.000000 Zabrak
|
817
|
+
5 55.000000 179.000000 Twi'lek
|
818
|
+
6 53.100000 168.000000 Mirialan
|
819
|
+
7 88.000000 221.000000 Kaminoan
|
820
|
+
```
|
821
|
+
|
822
|
+
Available functions are:
|
823
|
+
|
824
|
+
- [ ] all
|
825
|
+
- [ ] any
|
826
|
+
- [ ] approximate_median
|
827
|
+
- ✓ count
|
828
|
+
- [ ] count_distinct
|
829
|
+
- [ ] distinct
|
830
|
+
- ✓ max
|
831
|
+
- ✓ mean
|
832
|
+
- ✓ min
|
833
|
+
- [ ] min_max
|
834
|
+
- ✓ product
|
835
|
+
- ✓ stddev
|
836
|
+
- ✓ sum
|
837
|
+
- [ ] tdigest
|
838
|
+
- ✓ variance
|
839
|
+
|
840
|
+
## Combining DataFrames
|
841
|
+
|
842
|
+
- [ ] obs
|
843
|
+
|
844
|
+
- [ ] Add vars
|
845
|
+
|
846
|
+
- [ ] Inner join
|
847
|
+
|
848
|
+
- [ ] Left join
|
849
|
+
|
850
|
+
## Encoding
|
851
|
+
|
852
|
+
- [ ] One-hot encoding
|
853
|
+
|
854
|
+
## Iteration (not impremented)
|