red_amber 0.1.3 → 0.1.6

Sign up to get free protection for your applications and to get access to all the features.
Files changed (43) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +31 -7
  3. data/CHANGELOG.md +214 -10
  4. data/Gemfile +4 -0
  5. data/README.md +117 -342
  6. data/benchmark/csv_load_penguins.yml +15 -0
  7. data/benchmark/drop_nil.yml +11 -0
  8. data/doc/DataFrame.md +854 -0
  9. data/doc/Vector.md +449 -0
  10. data/doc/image/arrow_table_new.png +0 -0
  11. data/doc/image/dataframe/assign.png +0 -0
  12. data/doc/image/dataframe/drop.png +0 -0
  13. data/doc/image/dataframe/pick.png +0 -0
  14. data/doc/image/dataframe/remove.png +0 -0
  15. data/doc/image/dataframe/rename.png +0 -0
  16. data/doc/image/dataframe/slice.png +0 -0
  17. data/doc/image/dataframe_model.png +0 -0
  18. data/doc/image/example_in_red_arrow.png +0 -0
  19. data/doc/image/tdr.png +0 -0
  20. data/doc/image/tdr_and_table.png +0 -0
  21. data/doc/image/tidy_data_in_TDR.png +0 -0
  22. data/doc/image/vector/binary_element_wise.png +0 -0
  23. data/doc/image/vector/unary_aggregation.png +0 -0
  24. data/doc/image/vector/unary_aggregation_w_option.png +0 -0
  25. data/doc/image/vector/unary_element_wise.png +0 -0
  26. data/doc/tdr.md +56 -0
  27. data/doc/tdr_ja.md +56 -0
  28. data/lib/red-amber.rb +27 -0
  29. data/lib/red_amber/data_frame.rb +91 -37
  30. data/lib/red_amber/{data_frame_output.rb → data_frame_displayable.rb} +49 -41
  31. data/lib/red_amber/data_frame_indexable.rb +38 -0
  32. data/lib/red_amber/data_frame_observation_operation.rb +11 -0
  33. data/lib/red_amber/data_frame_selectable.rb +155 -48
  34. data/lib/red_amber/data_frame_variable_operation.rb +137 -0
  35. data/lib/red_amber/helper.rb +61 -0
  36. data/lib/red_amber/vector.rb +69 -16
  37. data/lib/red_amber/vector_functions.rb +80 -45
  38. data/lib/red_amber/vector_selectable.rb +124 -0
  39. data/lib/red_amber/vector_updatable.rb +104 -0
  40. data/lib/red_amber/version.rb +1 -1
  41. data/lib/red_amber.rb +1 -16
  42. data/red_amber.gemspec +3 -6
  43. metadata +38 -9
data/doc/Vector.md ADDED
@@ -0,0 +1,449 @@
1
+ # Vector
2
+
3
+ Class `RedAmber::Vector` represents a series of data in the DataFrame.
4
+
5
+ ## Constructor
6
+
7
+ ### Create from a column in a DataFrame
8
+
9
+ ```ruby
10
+ df = RedAmber::DataFrame.new(x: [1, 2, 3])
11
+ df[:x]
12
+ # =>
13
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f4ec>
14
+ [1, 2, 3]
15
+ ```
16
+
17
+ ### New from an Array
18
+
19
+ ```ruby
20
+ vector = RedAmber::Vector.new([1, 2, 3])
21
+ # or
22
+ vector = RedAmber::Vector.new(1, 2, 3)
23
+ # or
24
+ vector = RedAmber::Vector.new(1..3)
25
+ # or
26
+ vector = RedAmber::Vector.new(Arrow::Array([1, 2, 3])
27
+
28
+ # =>
29
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f514>
30
+ [1, 2, 3]
31
+ ```
32
+
33
+ ## Properties
34
+
35
+ ### `to_s`
36
+
37
+ ### `values`, `to_a`, `entries`
38
+
39
+ ### `indices`, `indexes`, `indeces`
40
+
41
+ Return indices in an Array
42
+
43
+ ### `to_ary`
44
+ Vector has `#to_ary`. It implicitly converts a Vector to an Array when required.
45
+
46
+ ```ruby
47
+ [1, 2] + Vector.new([3, 4])
48
+
49
+ # =>
50
+ [1, 2, 3, 4]
51
+ ```
52
+
53
+ ### `size`, `length`, `n_rows`, `nrow`
54
+
55
+ ### `empty?`
56
+
57
+ ### `type`
58
+
59
+ ### `boolean?`, `numeric?`, `string?`, `temporal?`
60
+
61
+ ### `type_class`
62
+
63
+ ### [ ] `each` (not impremented yet)
64
+
65
+ ### [ ] `chunked?` (not impremented yet)
66
+
67
+ ### [ ] `n_chunks` (not impremented yet)
68
+
69
+ ### [ ] `each_chunk` (not impremented yet)
70
+
71
+ ### `n_nils`, `n_nans`
72
+
73
+ - `n_nulls` is an alias of `n_nils`
74
+
75
+ ### `has_nil?`
76
+
77
+ Returns `true` if self has any `nil`. Otherwise returns `false`.
78
+
79
+ ### `inspect(limit: 80)`
80
+
81
+ - `limit` sets size limit to display long array.
82
+
83
+ ```ruby
84
+ vector = RedAmber::Vector.new((1..50).to_a)
85
+ # =>
86
+ #<RedAmber::Vector(:uint8, size=50):0x000000000000f528>
87
+ [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, ... ]
88
+ ```
89
+
90
+ ## Selecting Values
91
+
92
+ ### `take(indices)`, `[](indices)`
93
+
94
+ - Acceptable class for indices:
95
+ - Integer, Float
96
+ - Vector of integer or float
97
+ - Arrow::Arry of integer or float
98
+ - Negative index is also OK like the Ruby's primitive Array.
99
+
100
+ ```ruby
101
+ array = RedAmber::Vector.new(%w[A B C D E])
102
+ indices = RedAmber::Vector.new([0.1, -0.5, -5.1])
103
+ array.take(indices)
104
+ # or
105
+ array[indices]
106
+
107
+ # =>
108
+ #<RedAmber::Vector(:string, size=3):0x000000000000f820>
109
+ ["A", "E", "A"]
110
+ ```
111
+
112
+ ### `filter(booleans)`, `[](booleans)`
113
+
114
+ - Acceptable class for booleans:
115
+ - An array of true, false, or nil
116
+ - Boolean Vector
117
+ - Arrow::BooleanArray
118
+
119
+ ```ruby
120
+ array = RedAmber::Vector.new(%w[A B C D E])
121
+ booleans = [true, false, nil, false, true]
122
+ array.filter(booleans)
123
+ # or
124
+ array[booleans]
125
+
126
+ # =>
127
+ #<RedAmber::Vector(:string, size=2):0x000000000000f21c>
128
+ ["A", "E"]
129
+ ```
130
+
131
+ ## Functions
132
+
133
+ ### Unary aggregations: `vector.func => scalar`
134
+
135
+ ![unary aggregation](doc/image/../../image/vector/unary_aggregation_w_option.png)
136
+
137
+ | Method |Boolean|Numeric|String|Options|Remarks|
138
+ | ----------- | --- | --- | --- | --- | --- |
139
+ | ✓ `all?` | ✓ | | | ✓ ScalarAggregate| alias `all` |
140
+ | ✓ `any?` | ✓ | | | ✓ ScalarAggregate| alias `any` |
141
+ | ✓ `approximate_median`| |✓| | ✓ ScalarAggregate| alias `median`|
142
+ | ✓ `count` | ✓ | ✓ | ✓ | ✓ Count | |
143
+ | ✓ `count_distinct`| ✓ | ✓ | ✓ | ✓ Count |alias `count_uniq`|
144
+ |[ ]`index` | [ ] | [ ] | [ ] |[ ] Index | |
145
+ | ✓ `max` | ✓ | ✓ | ✓ | ✓ ScalarAggregate| |
146
+ | ✓ `mean` | ✓ | ✓ | | ✓ ScalarAggregate| |
147
+ | ✓ `min` | ✓ | ✓ | ✓ | ✓ ScalarAggregate| |
148
+ | ✓ `min_max` | ✓ | ✓ | ✓ | ✓ ScalarAggregate| |
149
+ |[ ]`mode` | | [ ] | |[ ] Mode | |
150
+ | ✓ `product` | ✓ | ✓ | | ✓ ScalarAggregate| |
151
+ |[ ]`quantile`| | [ ] | |[ ] Quantile| |
152
+ | ✓ `sd ` | | ✓ | | |ddof: 1 at `stddev`|
153
+ | ✓ `stddev` | | ✓ | | ✓ Variance|ddof: 0 by default|
154
+ | ✓ `sum` | ✓ | ✓ | | ✓ ScalarAggregate| |
155
+ |[ ]`tdigest` | | [ ] | |[ ] TDigest | |
156
+ | ✓ `var `| | ✓ | | |ddof: 1 at `variance`<br>alias `unbiased_variance`|
157
+ | ✓ `variance`| | ✓ | | ✓ Variance|ddof: 0 by default|
158
+
159
+
160
+ Options can be used as follows.
161
+ See the [document of C++ function](https://arrow.apache.org/docs/cpp/compute.html) for detail.
162
+
163
+ ```ruby
164
+ double = RedAmber::Vector.new([1, 0/0.0, -1/0.0, 1/0.0, nil, ""])
165
+ #=>
166
+ #<RedAmber::Vector(:double, size=6):0x000000000000f910>
167
+ [1.0, NaN, -Infinity, Infinity, nil, 0.0]
168
+
169
+ double.count #=> 5
170
+ double.count(opts: {mode: :only_valid}) #=> 5, default
171
+ double.count(opts: {mode: :only_null}) #=> 1
172
+ double.count(opts: {mode: :all}) #=> 6
173
+
174
+ boolean = RedAmber::Vector.new([true, true, nil])
175
+ #=>
176
+ #<RedAmber::Vector(:boolean, size=3):0x000000000000f924>
177
+ [true, true, nil]
178
+
179
+ boolean.all #=> true
180
+ boolean.all(opts: {skip_nulls: true}) #=> true
181
+ boolean.all(opts: {skip_nulls: false}) #=> false
182
+ ```
183
+
184
+ ### Unary element-wise: `vector.func => vector`
185
+
186
+ ![unary element-wise](doc/image/../../image/vector/unary_element_wise.png)
187
+
188
+ | Method |Boolean|Numeric|String|Options|Remarks|
189
+ | ------------ | --- | --- | --- | --- | ----- |
190
+ | ✓ `-@` | | ✓ | | |as `-vector`|
191
+ | ✓ `negate` | | ✓ | | |`-@` |
192
+ | ✓ `abs` | | ✓ | | | |
193
+ |[ ]`acos` | | [ ] | | | |
194
+ |[ ]`asin` | | [ ] | | | |
195
+ | ✓ `atan` | | ✓ | | | |
196
+ | ✓ `bit_wise_not`| | (✓) | | |integer only|
197
+ | ✓ `ceil` | | ✓ | | | |
198
+ | ✓ `cos` | | ✓ | | | |
199
+ | ✓`fill_nil_backward`| ✓ | ✓ | ✓ | | |
200
+ | ✓`fill_nil_forward` | ✓ | ✓ | ✓ | | |
201
+ | ✓ `floor` | | ✓ | | | |
202
+ | ✓ `invert` | ✓ | | | |`!`, alias `not`|
203
+ |[ ]`ln` | | [ ] | | | |
204
+ |[ ]`log10` | | [ ] | | | |
205
+ |[ ]`log1p` | | [ ] | | | |
206
+ |[ ]`log2` | | [ ] | | | |
207
+ | ✓ `round` | | ✓ | | ✓ Round (:mode, :n_digits)| |
208
+ | ✓ `round_to_multiple`| | ✓ | | ✓ RoundToMultiple :mode, :multiple| multiple must be an Arrow::Scalar|
209
+ | ✓ `sign` | | ✓ | | | |
210
+ | ✓ `sin` | | ✓ | | | |
211
+ | ✓`sort_indexes`| ✓ | ✓ | ✓ |:order|alias `sort_indices`|
212
+ | ✓ `tan` | | ✓ | | | |
213
+ | ✓ `trunc` | | ✓ | | | |
214
+
215
+ ### Binary element-wise: `vector.func(vector) => vector`
216
+
217
+ ![binary element-wise](doc/image/../../image/vector/binary_element_wise.png)
218
+
219
+ | Method |Boolean|Numeric|String|Options|Remarks|
220
+ | ----------------- | --- | --- | --- | --- | ----- |
221
+ | ✓ `add` | | ✓ | | | `+` |
222
+ | ✓ `atan2` | | ✓ | | | |
223
+ | ✓ `and_kleene` | ✓ | | | | `&` |
224
+ | ✓ `and_org ` | ✓ | | | |`and` in Red Arrow|
225
+ | ✓ `and_not` | ✓ | | | | |
226
+ | ✓ `and_not_kleene`| ✓ | | | | |
227
+ | ✓ `bit_wise_and` | | (✓) | | |integer only|
228
+ | ✓ `bit_wise_or` | | (✓) | | |integer only|
229
+ | ✓ `bit_wise_xor` | | (✓) | | |integer only|
230
+ | ✓ `divide` | | ✓ | | | `/` |
231
+ | ✓ `equal` | ✓ | ✓ | ✓ | |`==`, alias `eq`|
232
+ | ✓ `greater` | ✓ | ✓ | ✓ | |`>`, alias `gt`|
233
+ | ✓ `greater_equal` | ✓ | ✓ | ✓ | |`>=`, alias `ge`|
234
+ | ✓ `is_finite` | | ✓ | | | |
235
+ | ✓ `is_inf` | | ✓ | | | |
236
+ | ✓ `is_na` | ✓ | ✓ | ✓ | | |
237
+ | ✓ `is_nan` | | ✓ | | | |
238
+ |[ ]`is_nil` | ✓ | ✓ | ✓ |[ ] Null|alias `is_null`|
239
+ | ✓ `is_valid` | ✓ | ✓ | ✓ | | |
240
+ | ✓ `less` | ✓ | ✓ | ✓ | |`<`, alias `lt`|
241
+ | ✓ `less_equal` | ✓ | ✓ | ✓ | |`<=`, alias `le`|
242
+ |[ ]`logb` | | [ ] | | | |
243
+ |[ ]`mod` | | [ ] | | | `%` |
244
+ | ✓ `multiply` | | ✓ | | | `*` |
245
+ | ✓ `not_equal` | ✓ | ✓ | ✓ | |`!=`, alias `ne`|
246
+ | ✓ `or_kleene` | ✓ | | | | `\|` |
247
+ | ✓ `or_org` | ✓ | | | |`or` in Red Arrow|
248
+ | ✓ `power` | | ✓ | | | `**` |
249
+ | ✓ `subtract` | | ✓ | | | `-` |
250
+ | ✓ `shift_left` | | (✓) | | |`<<`, integer only|
251
+ | ✓ `shift_right` | | (✓) | | |`>>`, integer only|
252
+ | ✓ `xor` | ✓ | | | | `^` |
253
+
254
+ ### `uniq`
255
+
256
+ Returns a new array with distinct elements.
257
+
258
+ (Not impremented functions)
259
+
260
+ ### `tally` and `value_counts`
261
+
262
+ Compute counts of unique elements and return a Hash.
263
+
264
+ It returns almost same result as Ruby's tally. These methods consider NaNs are same.
265
+
266
+ ```ruby
267
+ array = [0.0/0, Float::NAN]
268
+ array.tally #=> {NaN=>1, NaN=>1}
269
+
270
+ vector = RedAmber::Vector.new(array)
271
+ vector.tally #=> {NaN=>2}
272
+ vector.value_counts #=> {NaN=>2}
273
+ ```
274
+ ### `index(element)`
275
+
276
+ Returns index of specified element.
277
+
278
+ ### `sort_indexes`, `sort_indices`, `array_sort_indices`
279
+
280
+ ### [ ] `sort`, `sort_by`
281
+ ### [ ] argmin, argmax
282
+ ### [ ] (array functions)
283
+ ### [ ] (strings functions)
284
+ ### [ ] (temporal functions)
285
+ ### [ ] (conditional functions)
286
+ ### [ ] (index functions)
287
+ ### [ ] (other functions)
288
+
289
+ ## Coerce (not impremented)
290
+
291
+ ## Update vector's value
292
+ ### `replace(specifier, replacer)` => vector
293
+
294
+ - Accepts Scalar, Range of Integer, Vector, Array, Arrow::Array as a specifier
295
+ - Accepts Scalar, Vector, Array and Arrow::Array as a replacer.
296
+ - Boolean specifiers specify the position of replacer in true.
297
+ - Index specifiers specify the position of replacer in indices.
298
+ - replacer specifies the values to be replaced.
299
+ - The number of true in booleans must be equal to the length of replacer
300
+
301
+ ```ruby
302
+ vector = RedAmber::Vector.new([1, 2, 3])
303
+ booleans = [true, false, true]
304
+ replacer = [4, 5]
305
+ vector.replace(booleans, replacer)
306
+ # =>
307
+ #<RedAmber::Vector(:uint8, size=3):0x000000000001ee10>
308
+ [4, 2, 5]
309
+ ```
310
+
311
+ - Scalar value in replacer can be broadcasted.
312
+
313
+ ```ruby
314
+ replacer = 0
315
+ vector.replace(booleans, replacer)
316
+ # =>
317
+ #<RedAmber::Vector(:uint8, size=3):0x000000000001ee10>
318
+ [0, 2, 0]
319
+ ```
320
+
321
+ - Returned data type is automatically up-casted by replacer.
322
+
323
+ ```ruby
324
+ replacer = 1.0
325
+ vector.replace(booleans, replacer)
326
+ # =>
327
+ #<RedAmber::Vector(:double, size=3):0x0000000000025d78>
328
+ [1.0, 2.0, 1.0]
329
+ ```
330
+
331
+ - Position of nil in booleans is replaced with nil.
332
+
333
+ ```ruby
334
+ booleans = [true, false, nil]
335
+ replacer = -1
336
+ vec.replace(booleans, replacer)
337
+ =>
338
+ #<RedAmber::Vector(:int8, size=3):0x00000000000304d0>
339
+ [-1, 2, nil]
340
+ ```
341
+
342
+ - replacer can have nil in it.
343
+
344
+ ```ruby
345
+ booleans = [true, false, true]
346
+ replacer = [nil]
347
+ vec.replace(booleans, replacer)
348
+ =>
349
+ #<RedAmber::Vector(:int8, size=3):0x00000000000304d0>
350
+ [nil, 2, nil]
351
+ ```
352
+
353
+ - If no replacer specified, it is same as to specify nil.
354
+
355
+ ```ruby
356
+ booleans = [true, false, true]
357
+ vec.replace(booleans)
358
+ =>
359
+ #<RedAmber::Vector(:int8, size=3):0x00000000000304d0>
360
+ [nil, 2, nil]
361
+ ```
362
+
363
+ - An example to replace 'NA' to nil.
364
+
365
+ ```ruby
366
+ vector = RedAmber::Vector.new(['A', 'B', 'NA'])
367
+ vector.replace(vector == 'NA', nil)
368
+ # =>
369
+ #<RedAmber::Vector(:string, size=3):0x000000000000f8ac>
370
+ ["A", "B", nil]
371
+ ```
372
+
373
+ - Specifier in indices.
374
+
375
+ Specified indices are used 'as sorted'. Position in indices and replacer may not have correspondence.
376
+
377
+ ```ruby
378
+ vector = RedAmber::Vector.new([1, 2, 3])
379
+ indices = [2, 1]
380
+ replacer = [4, 5]
381
+ vector.replace(indices, replacer)
382
+ # =>
383
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f244>
384
+ [1, 4, 5] # not [1, 5, 4]
385
+ ```
386
+
387
+
388
+ ### `fill_nil_forward`, `fill_nil_backward` => vector
389
+
390
+ Propagate the last valid observation forward (or backward).
391
+ Or preserve nil if all previous values are nil or at the end.
392
+
393
+ ```ruby
394
+ integer = RedAmber::Vector.new([0, 1, nil, 3, nil])
395
+ integer.fill_nil_forward
396
+ # =>
397
+ #<RedAmber::Vector(:uint8, size=5):0x000000000000f960>
398
+ [0, 1, 1, 3, 3]
399
+
400
+ integer.fill_nil_backward
401
+ # =>
402
+ #<RedAmber::Vector(:uint8, size=5):0x000000000000f974>
403
+ [0, 1, 3, 3, nil]
404
+ ```
405
+
406
+ ### `boolean_vector.if_else(true_choice, false_choice)` => vector
407
+
408
+ Choose values based on self. Self must be a boolean Vector.
409
+
410
+ `true_choice`, `false_choice` must be of the same type scalar / array / Vector.
411
+ `nil` values in `cond` will be promoted to the output.
412
+
413
+ This example will normalize negative indices to positive ones.
414
+
415
+ ```ruby
416
+ indices = RedAmber::Vector.new([1, -1, 3, -4])
417
+ array_size = 10
418
+ normalized_indices = (indices < 0).if_else(indices + array_size, indices)
419
+
420
+ # =>
421
+ #<RedAmber::Vector(:int16, size=4):0x000000000000f85c>
422
+ [1, 9, 3, 6]
423
+ ```
424
+
425
+ ### `is_in(values)` => boolean vector
426
+
427
+ For each element in self, return true if it is found in given `values`, false otherwise.
428
+ By default, nulls are matched against the value set. (This will be changed in SetLookupOptions: not impremented.)
429
+
430
+ ```ruby
431
+ vector = RedAmber::Vector.new %W[A B C D]
432
+ values = ['A', 'C', 'X']
433
+ vector.is_in(values)
434
+
435
+ # =>
436
+ #<RedAmber::Vector(:boolean, size=4):0x000000000000f2a8>
437
+ [true, false, true, false]
438
+ ```
439
+
440
+ `values` are casted to the same Class of Vector.
441
+
442
+ ```ruby
443
+ vector = RedAmber::Vector.new([1, 2, 255])
444
+ vector.is_in(1, -1)
445
+
446
+ # =>
447
+ #<RedAmber::Vector(:boolean, size=3):0x000000000000f320>
448
+ [true, false, true]
449
+ ```
Binary file
Binary file
Binary file
Binary file
Binary file
Binary file
Binary file
Binary file
Binary file
data/doc/image/tdr.png ADDED
Binary file
Binary file
Binary file
data/doc/tdr.md ADDED
@@ -0,0 +1,56 @@
1
+ # TDR (Transposed DataFrame Representation)
2
+
3
+ ([Japanese version](tdr_ja.md) of this document is available)
4
+
5
+ TDR is a presentation style of 2D data. It shows columnar vector values in *row Vector* and observations in *column* just like a **transposed** table.
6
+
7
+ ![TDR Image](image/tdr.png)
8
+
9
+ Row-oriented data table (1) and columnar data table (2) have different data allocation in memory within a context of Arrow Columnar Format. But they have the same data placement (in rows and columns) in our brain.
10
+
11
+ TDR (3) is a logical concept of data placement to transpose rows and columns in a columnar table (2).
12
+
13
+ ![TDR and Table Image](image/tdr_and_table.png)
14
+
15
+ TDR is not an implementation in software but a logical image in our mind.
16
+
17
+ TDR is consistent with the 'transposed' tidy data concept. The only thing we should do is not to use the positional words 'row' and 'column'.
18
+
19
+ ![tidy data in TDR](image/tidy_data_in_TDR.png)
20
+
21
+ TDR is one of a simple way to create DataFrame object in many libraries. For example, we can initalize Arrow::Table in Red Arrow like the right below and get table as left.
22
+
23
+ ![Arrow Table New](image/arrow_table_new.png)
24
+
25
+ We are using TDR style code naturally. For other example:
26
+ - Ruby: Daru::DataFrame, Rover::DataFrame accept same arguments.
27
+ - Python: similar style in Pandas for pd.DataFrame(data_in_dict)
28
+ - R: similar style in tidyr for tibble(x = 1:3, y = c("A", "B", "C"))
29
+
30
+ There are other ways to initialize data frame, but they are not intuitive.
31
+
32
+ ## Table and TDR API
33
+
34
+ The API based on TDR is draft and RedAmber is a small experiment to test the TDR concept. The following is a comparison of Table and TDR (draft).
35
+
36
+ | |Basic Table|Transposed DataFrame|Comment for TDR|
37
+ |-----------|---------|------------|---|
38
+ |name in TDR|`Table`|`TDR`|**T**ransposed **D**ataFrame **R**epresentation|
39
+ |variable |located in a column|a key and a `Vector` in lateral|select by keys|
40
+ |observation|located in a row|sliced in vertical|select by indices|
41
+ |number of variables|n_columns etc. |`n_keys` |`n_cols` is available as an alias|
42
+ |number of observations|n_rows etc. |`size` |`n_rows` is available as an alias|
43
+ |shape |[n_rows, n_columns] |`shape`=`[size, n_keys]` |same order as Table|
44
+ |Select variables|select, filter, [ ], etc.|`pick` or `[keys]` |accepts arguments or a block|
45
+ |Reject variables|drop, etc.|`drop` |accepts arguments or a block|
46
+ |Select observations|slice, [ ], iloc, etc.|`slice` or `[indices]` |accepts arguments or a block|
47
+ |Reject observations|drop, etc.|`remove` |accepts arguments or a block|
48
+ |Add variables|mutate, assign, etc.|`assign` |accepts arguments or a block|
49
+ |update variables|transmute, [ ]=, etc.|`assign` |accepts arguments or a block|
50
+ |inner join| inner_join(a,b)<br>merge(a, b, how='inner')|`a.inner_join(b)` |with a option on:|
51
+ |left join| left_join(a,b)<br>merge(a, b, how='left')|`a.join(b)` |naturally join from bottom<br>with a option on:|
52
+ |right join| right_join(a,b))<br>merge(a, b, how='right')|`b.join(a)` |naturally join from bottom<br>with a option on:|
53
+
54
+ ## Q and A for TDR
55
+
56
+ (Not prepared yet)
data/doc/tdr_ja.md ADDED
@@ -0,0 +1,56 @@
1
+ # TDR (Transposed DataFrame Representation)
2
+
3
+ ([英語版](tdr.md) もあります)
4
+
5
+ TDR は、2次元のデータの表現方法につけた名前です。TDR では下の図のように同じ型のデータに key というラベルをつけて横に並べ、それらを縦に積み重ねてデータを表現します。
6
+
7
+ ![TDR Image](image/tdr.png)
8
+
9
+ Arrow Columnar Format では、csv のような従来の行指向データ(1)に対して、列方向に連続したデータ(2)を取り扱います。この行、列という言葉は私たちの脳内イメージを規定していて、データフレームの構造といえば(1)または(2)のような形を思い浮かべることでしょう。しかし、本質は連続したデータの配置にあるので、我々の頭の中では(3)のように行と列を入れ替えて考えてもいいはずです。
10
+
11
+ ![TDR and Table Image](image/tdr_and_table.png)
12
+
13
+ 大事なことは、TDR は頭の中の論理的なイメージであって、実装上のアーキテクチャではないということです。
14
+
15
+ TDR は、整然データ(tidy data)の考え方とも矛盾しません。TDR における整然データは行と列を入れ替えた形で全く同じデータを表しています。一つだけ気をつけることは、混乱を避けるため、位置や方向に関するワードである行(row)や列(column)を避けるべきであるということです。
16
+
17
+ ![tidy data in TDR](image/tidy_data_in_TDR.png)
18
+
19
+ TDR は、現時点でも2次元データを楽に初期化できる記法で、ごく自然に使われています。例えば、Red Arrow ではArrow::Table を初期化する際に下の図の右のように書けます。
20
+
21
+ ![Arrow Table New](image/arrow_table_new.png)
22
+
23
+ これはごく自然な書き方ですが、この形は TDR の形と一致しています。その他の例として:
24
+ - Ruby: Daru::DataFrame, Rover::DataFrame でも上と同じように書けます。
25
+ - Python: Pandas で pd.DataFrame(data_in_dict) のように dict を使う場合が同じです。
26
+ - R: tidyr で tibble(x = 1:3, y = c("A", "B", "C")) のように書けます。
27
+
28
+ それぞれのライブラリーで、データフレームを初期化するやり方はこれだけではありませんが、他の方法は少し回りくどいような印象があります。
29
+
30
+ TDR で考えた方がちょっぴりうまくいくというのは単なる仮説ですが、その理由は「この惑星では横書きでコードを書く」からではないかと私は考えています。
31
+
32
+ ## Table and TDR API
33
+
34
+ TDR に基づいた API はまだ暫定板の段階であり、RedAmber は TDR の実験の場であると考えています。下記の表に TDR と行x列形式の Table のAPIの比較を示します(暫定版)。
35
+
36
+ | |従来の Table|Transposed DataFrame|TDRに対するコメント|
37
+ |-----------|---------|------------|---|
38
+ |TDRでの呼称|`Table`|`TDR`|**T**ransposed **D**ataFrame **R**epresentationの略|
39
+ |変数 |列に配置|`variables`<br>key と `Vector` として横方向に配置|key で選択|
40
+ |観測 |行に配置|`observations`<br>縦方向に切った一つ一つはslice|index や `slice` メソッドで選択|
41
+ |変数(列)の数|ncol, n_columns など |`n_keys` |`n_cols` をエイリアスとして設定|
42
+ |観測(行)の数|nrow, n_rows など |`size` |`n_rows` をエイリアスとして設定|
43
+ |形状 |[nrow, ncol] |`shape`=`[size, n_keys]` |行, 列の順番は同じ|
44
+ |変数(列)の選択|select, filter, [ ], など|`pick` or `[keys]` |引数またはブロックで指定|
45
+ |変数(列)の削除|drop, など|`drop` |引数またはブロックで指定|
46
+ |観測(行)の選択|slice, [ ], iloc, など|`slice` or `[indices]` |引数またはブロックで指定|
47
+ |観測(行)の削除|drop, など|`remove` |引数またはブロックで指定|
48
+ |変数(列)の追加|mutate, assign, など|`assign` |引数またはブロックで指定|
49
+ |変数(列)の更新|transmute, [ ]=, など|`assign` |引数またはブロックで指定|
50
+ |内部結合| inner_join(a,b)<br>merge(a, b, how='inner')|`a.inner_join(b)` |オプション on:|
51
+ |左結合| left_join(a,b)<br>merge(a, b, how='left')|`a.join(b)` |自然に下にくっつける<br>オプション on:|
52
+ |右結合| right_join(a,b))<br>merge(a, b, how='right')|`b.join(a)` |自然に下にくっつける<br>オプション on:|
53
+
54
+ ## Q and A for TDR
55
+
56
+ (作成中)
data/lib/red-amber.rb ADDED
@@ -0,0 +1,27 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'arrow'
4
+ require 'rover-df'
5
+
6
+ require_relative 'red_amber/helper'
7
+ require_relative 'red_amber/data_frame_displayable'
8
+ require_relative 'red_amber/data_frame_indexable'
9
+ require_relative 'red_amber/data_frame_selectable'
10
+ require_relative 'red_amber/data_frame_observation_operation'
11
+ require_relative 'red_amber/data_frame_variable_operation'
12
+ require_relative 'red_amber/data_frame'
13
+ require_relative 'red_amber/vector_functions'
14
+ require_relative 'red_amber/vector_updatable'
15
+ require_relative 'red_amber/vector_selectable'
16
+ require_relative 'red_amber/vector'
17
+ require_relative 'red_amber/version'
18
+
19
+ module RedAmber
20
+ class Error < StandardError; end
21
+
22
+ class DataFrameArgumentError < ArgumentError; end
23
+ class DataFrameTypeError < TypeError; end
24
+
25
+ class VectorArgumentError < ArgumentError; end
26
+ class VectorTypeError < TypeError; end
27
+ end