red_amber 0.1.3 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (43) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +31 -7
  3. data/CHANGELOG.md +214 -10
  4. data/Gemfile +4 -0
  5. data/README.md +117 -342
  6. data/benchmark/csv_load_penguins.yml +15 -0
  7. data/benchmark/drop_nil.yml +11 -0
  8. data/doc/DataFrame.md +854 -0
  9. data/doc/Vector.md +449 -0
  10. data/doc/image/arrow_table_new.png +0 -0
  11. data/doc/image/dataframe/assign.png +0 -0
  12. data/doc/image/dataframe/drop.png +0 -0
  13. data/doc/image/dataframe/pick.png +0 -0
  14. data/doc/image/dataframe/remove.png +0 -0
  15. data/doc/image/dataframe/rename.png +0 -0
  16. data/doc/image/dataframe/slice.png +0 -0
  17. data/doc/image/dataframe_model.png +0 -0
  18. data/doc/image/example_in_red_arrow.png +0 -0
  19. data/doc/image/tdr.png +0 -0
  20. data/doc/image/tdr_and_table.png +0 -0
  21. data/doc/image/tidy_data_in_TDR.png +0 -0
  22. data/doc/image/vector/binary_element_wise.png +0 -0
  23. data/doc/image/vector/unary_aggregation.png +0 -0
  24. data/doc/image/vector/unary_aggregation_w_option.png +0 -0
  25. data/doc/image/vector/unary_element_wise.png +0 -0
  26. data/doc/tdr.md +56 -0
  27. data/doc/tdr_ja.md +56 -0
  28. data/lib/red-amber.rb +27 -0
  29. data/lib/red_amber/data_frame.rb +91 -37
  30. data/lib/red_amber/{data_frame_output.rb → data_frame_displayable.rb} +49 -41
  31. data/lib/red_amber/data_frame_indexable.rb +38 -0
  32. data/lib/red_amber/data_frame_observation_operation.rb +11 -0
  33. data/lib/red_amber/data_frame_selectable.rb +155 -48
  34. data/lib/red_amber/data_frame_variable_operation.rb +137 -0
  35. data/lib/red_amber/helper.rb +61 -0
  36. data/lib/red_amber/vector.rb +69 -16
  37. data/lib/red_amber/vector_functions.rb +80 -45
  38. data/lib/red_amber/vector_selectable.rb +124 -0
  39. data/lib/red_amber/vector_updatable.rb +104 -0
  40. data/lib/red_amber/version.rb +1 -1
  41. data/lib/red_amber.rb +1 -16
  42. data/red_amber.gemspec +3 -6
  43. metadata +38 -9
data/doc/Vector.md ADDED
@@ -0,0 +1,449 @@
1
+ # Vector
2
+
3
+ Class `RedAmber::Vector` represents a series of data in the DataFrame.
4
+
5
+ ## Constructor
6
+
7
+ ### Create from a column in a DataFrame
8
+
9
+ ```ruby
10
+ df = RedAmber::DataFrame.new(x: [1, 2, 3])
11
+ df[:x]
12
+ # =>
13
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f4ec>
14
+ [1, 2, 3]
15
+ ```
16
+
17
+ ### New from an Array
18
+
19
+ ```ruby
20
+ vector = RedAmber::Vector.new([1, 2, 3])
21
+ # or
22
+ vector = RedAmber::Vector.new(1, 2, 3)
23
+ # or
24
+ vector = RedAmber::Vector.new(1..3)
25
+ # or
26
+ vector = RedAmber::Vector.new(Arrow::Array([1, 2, 3])
27
+
28
+ # =>
29
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f514>
30
+ [1, 2, 3]
31
+ ```
32
+
33
+ ## Properties
34
+
35
+ ### `to_s`
36
+
37
+ ### `values`, `to_a`, `entries`
38
+
39
+ ### `indices`, `indexes`, `indeces`
40
+
41
+ Return indices in an Array
42
+
43
+ ### `to_ary`
44
+ Vector has `#to_ary`. It implicitly converts a Vector to an Array when required.
45
+
46
+ ```ruby
47
+ [1, 2] + Vector.new([3, 4])
48
+
49
+ # =>
50
+ [1, 2, 3, 4]
51
+ ```
52
+
53
+ ### `size`, `length`, `n_rows`, `nrow`
54
+
55
+ ### `empty?`
56
+
57
+ ### `type`
58
+
59
+ ### `boolean?`, `numeric?`, `string?`, `temporal?`
60
+
61
+ ### `type_class`
62
+
63
+ ### [ ] `each` (not impremented yet)
64
+
65
+ ### [ ] `chunked?` (not impremented yet)
66
+
67
+ ### [ ] `n_chunks` (not impremented yet)
68
+
69
+ ### [ ] `each_chunk` (not impremented yet)
70
+
71
+ ### `n_nils`, `n_nans`
72
+
73
+ - `n_nulls` is an alias of `n_nils`
74
+
75
+ ### `has_nil?`
76
+
77
+ Returns `true` if self has any `nil`. Otherwise returns `false`.
78
+
79
+ ### `inspect(limit: 80)`
80
+
81
+ - `limit` sets size limit to display long array.
82
+
83
+ ```ruby
84
+ vector = RedAmber::Vector.new((1..50).to_a)
85
+ # =>
86
+ #<RedAmber::Vector(:uint8, size=50):0x000000000000f528>
87
+ [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, ... ]
88
+ ```
89
+
90
+ ## Selecting Values
91
+
92
+ ### `take(indices)`, `[](indices)`
93
+
94
+ - Acceptable class for indices:
95
+ - Integer, Float
96
+ - Vector of integer or float
97
+ - Arrow::Arry of integer or float
98
+ - Negative index is also OK like the Ruby's primitive Array.
99
+
100
+ ```ruby
101
+ array = RedAmber::Vector.new(%w[A B C D E])
102
+ indices = RedAmber::Vector.new([0.1, -0.5, -5.1])
103
+ array.take(indices)
104
+ # or
105
+ array[indices]
106
+
107
+ # =>
108
+ #<RedAmber::Vector(:string, size=3):0x000000000000f820>
109
+ ["A", "E", "A"]
110
+ ```
111
+
112
+ ### `filter(booleans)`, `[](booleans)`
113
+
114
+ - Acceptable class for booleans:
115
+ - An array of true, false, or nil
116
+ - Boolean Vector
117
+ - Arrow::BooleanArray
118
+
119
+ ```ruby
120
+ array = RedAmber::Vector.new(%w[A B C D E])
121
+ booleans = [true, false, nil, false, true]
122
+ array.filter(booleans)
123
+ # or
124
+ array[booleans]
125
+
126
+ # =>
127
+ #<RedAmber::Vector(:string, size=2):0x000000000000f21c>
128
+ ["A", "E"]
129
+ ```
130
+
131
+ ## Functions
132
+
133
+ ### Unary aggregations: `vector.func => scalar`
134
+
135
+ ![unary aggregation](doc/image/../../image/vector/unary_aggregation_w_option.png)
136
+
137
+ | Method |Boolean|Numeric|String|Options|Remarks|
138
+ | ----------- | --- | --- | --- | --- | --- |
139
+ | ✓ `all?` | ✓ | | | ✓ ScalarAggregate| alias `all` |
140
+ | ✓ `any?` | ✓ | | | ✓ ScalarAggregate| alias `any` |
141
+ | ✓ `approximate_median`| |✓| | ✓ ScalarAggregate| alias `median`|
142
+ | ✓ `count` | ✓ | ✓ | ✓ | ✓ Count | |
143
+ | ✓ `count_distinct`| ✓ | ✓ | ✓ | ✓ Count |alias `count_uniq`|
144
+ |[ ]`index` | [ ] | [ ] | [ ] |[ ] Index | |
145
+ | ✓ `max` | ✓ | ✓ | ✓ | ✓ ScalarAggregate| |
146
+ | ✓ `mean` | ✓ | ✓ | | ✓ ScalarAggregate| |
147
+ | ✓ `min` | ✓ | ✓ | ✓ | ✓ ScalarAggregate| |
148
+ | ✓ `min_max` | ✓ | ✓ | ✓ | ✓ ScalarAggregate| |
149
+ |[ ]`mode` | | [ ] | |[ ] Mode | |
150
+ | ✓ `product` | ✓ | ✓ | | ✓ ScalarAggregate| |
151
+ |[ ]`quantile`| | [ ] | |[ ] Quantile| |
152
+ | ✓ `sd ` | | ✓ | | |ddof: 1 at `stddev`|
153
+ | ✓ `stddev` | | ✓ | | ✓ Variance|ddof: 0 by default|
154
+ | ✓ `sum` | ✓ | ✓ | | ✓ ScalarAggregate| |
155
+ |[ ]`tdigest` | | [ ] | |[ ] TDigest | |
156
+ | ✓ `var `| | ✓ | | |ddof: 1 at `variance`<br>alias `unbiased_variance`|
157
+ | ✓ `variance`| | ✓ | | ✓ Variance|ddof: 0 by default|
158
+
159
+
160
+ Options can be used as follows.
161
+ See the [document of C++ function](https://arrow.apache.org/docs/cpp/compute.html) for detail.
162
+
163
+ ```ruby
164
+ double = RedAmber::Vector.new([1, 0/0.0, -1/0.0, 1/0.0, nil, ""])
165
+ #=>
166
+ #<RedAmber::Vector(:double, size=6):0x000000000000f910>
167
+ [1.0, NaN, -Infinity, Infinity, nil, 0.0]
168
+
169
+ double.count #=> 5
170
+ double.count(opts: {mode: :only_valid}) #=> 5, default
171
+ double.count(opts: {mode: :only_null}) #=> 1
172
+ double.count(opts: {mode: :all}) #=> 6
173
+
174
+ boolean = RedAmber::Vector.new([true, true, nil])
175
+ #=>
176
+ #<RedAmber::Vector(:boolean, size=3):0x000000000000f924>
177
+ [true, true, nil]
178
+
179
+ boolean.all #=> true
180
+ boolean.all(opts: {skip_nulls: true}) #=> true
181
+ boolean.all(opts: {skip_nulls: false}) #=> false
182
+ ```
183
+
184
+ ### Unary element-wise: `vector.func => vector`
185
+
186
+ ![unary element-wise](doc/image/../../image/vector/unary_element_wise.png)
187
+
188
+ | Method |Boolean|Numeric|String|Options|Remarks|
189
+ | ------------ | --- | --- | --- | --- | ----- |
190
+ | ✓ `-@` | | ✓ | | |as `-vector`|
191
+ | ✓ `negate` | | ✓ | | |`-@` |
192
+ | ✓ `abs` | | ✓ | | | |
193
+ |[ ]`acos` | | [ ] | | | |
194
+ |[ ]`asin` | | [ ] | | | |
195
+ | ✓ `atan` | | ✓ | | | |
196
+ | ✓ `bit_wise_not`| | (✓) | | |integer only|
197
+ | ✓ `ceil` | | ✓ | | | |
198
+ | ✓ `cos` | | ✓ | | | |
199
+ | ✓`fill_nil_backward`| ✓ | ✓ | ✓ | | |
200
+ | ✓`fill_nil_forward` | ✓ | ✓ | ✓ | | |
201
+ | ✓ `floor` | | ✓ | | | |
202
+ | ✓ `invert` | ✓ | | | |`!`, alias `not`|
203
+ |[ ]`ln` | | [ ] | | | |
204
+ |[ ]`log10` | | [ ] | | | |
205
+ |[ ]`log1p` | | [ ] | | | |
206
+ |[ ]`log2` | | [ ] | | | |
207
+ | ✓ `round` | | ✓ | | ✓ Round (:mode, :n_digits)| |
208
+ | ✓ `round_to_multiple`| | ✓ | | ✓ RoundToMultiple :mode, :multiple| multiple must be an Arrow::Scalar|
209
+ | ✓ `sign` | | ✓ | | | |
210
+ | ✓ `sin` | | ✓ | | | |
211
+ | ✓`sort_indexes`| ✓ | ✓ | ✓ |:order|alias `sort_indices`|
212
+ | ✓ `tan` | | ✓ | | | |
213
+ | ✓ `trunc` | | ✓ | | | |
214
+
215
+ ### Binary element-wise: `vector.func(vector) => vector`
216
+
217
+ ![binary element-wise](doc/image/../../image/vector/binary_element_wise.png)
218
+
219
+ | Method |Boolean|Numeric|String|Options|Remarks|
220
+ | ----------------- | --- | --- | --- | --- | ----- |
221
+ | ✓ `add` | | ✓ | | | `+` |
222
+ | ✓ `atan2` | | ✓ | | | |
223
+ | ✓ `and_kleene` | ✓ | | | | `&` |
224
+ | ✓ `and_org ` | ✓ | | | |`and` in Red Arrow|
225
+ | ✓ `and_not` | ✓ | | | | |
226
+ | ✓ `and_not_kleene`| ✓ | | | | |
227
+ | ✓ `bit_wise_and` | | (✓) | | |integer only|
228
+ | ✓ `bit_wise_or` | | (✓) | | |integer only|
229
+ | ✓ `bit_wise_xor` | | (✓) | | |integer only|
230
+ | ✓ `divide` | | ✓ | | | `/` |
231
+ | ✓ `equal` | ✓ | ✓ | ✓ | |`==`, alias `eq`|
232
+ | ✓ `greater` | ✓ | ✓ | ✓ | |`>`, alias `gt`|
233
+ | ✓ `greater_equal` | ✓ | ✓ | ✓ | |`>=`, alias `ge`|
234
+ | ✓ `is_finite` | | ✓ | | | |
235
+ | ✓ `is_inf` | | ✓ | | | |
236
+ | ✓ `is_na` | ✓ | ✓ | ✓ | | |
237
+ | ✓ `is_nan` | | ✓ | | | |
238
+ |[ ]`is_nil` | ✓ | ✓ | ✓ |[ ] Null|alias `is_null`|
239
+ | ✓ `is_valid` | ✓ | ✓ | ✓ | | |
240
+ | ✓ `less` | ✓ | ✓ | ✓ | |`<`, alias `lt`|
241
+ | ✓ `less_equal` | ✓ | ✓ | ✓ | |`<=`, alias `le`|
242
+ |[ ]`logb` | | [ ] | | | |
243
+ |[ ]`mod` | | [ ] | | | `%` |
244
+ | ✓ `multiply` | | ✓ | | | `*` |
245
+ | ✓ `not_equal` | ✓ | ✓ | ✓ | |`!=`, alias `ne`|
246
+ | ✓ `or_kleene` | ✓ | | | | `\|` |
247
+ | ✓ `or_org` | ✓ | | | |`or` in Red Arrow|
248
+ | ✓ `power` | | ✓ | | | `**` |
249
+ | ✓ `subtract` | | ✓ | | | `-` |
250
+ | ✓ `shift_left` | | (✓) | | |`<<`, integer only|
251
+ | ✓ `shift_right` | | (✓) | | |`>>`, integer only|
252
+ | ✓ `xor` | ✓ | | | | `^` |
253
+
254
+ ### `uniq`
255
+
256
+ Returns a new array with distinct elements.
257
+
258
+ (Not impremented functions)
259
+
260
+ ### `tally` and `value_counts`
261
+
262
+ Compute counts of unique elements and return a Hash.
263
+
264
+ It returns almost same result as Ruby's tally. These methods consider NaNs are same.
265
+
266
+ ```ruby
267
+ array = [0.0/0, Float::NAN]
268
+ array.tally #=> {NaN=>1, NaN=>1}
269
+
270
+ vector = RedAmber::Vector.new(array)
271
+ vector.tally #=> {NaN=>2}
272
+ vector.value_counts #=> {NaN=>2}
273
+ ```
274
+ ### `index(element)`
275
+
276
+ Returns index of specified element.
277
+
278
+ ### `sort_indexes`, `sort_indices`, `array_sort_indices`
279
+
280
+ ### [ ] `sort`, `sort_by`
281
+ ### [ ] argmin, argmax
282
+ ### [ ] (array functions)
283
+ ### [ ] (strings functions)
284
+ ### [ ] (temporal functions)
285
+ ### [ ] (conditional functions)
286
+ ### [ ] (index functions)
287
+ ### [ ] (other functions)
288
+
289
+ ## Coerce (not impremented)
290
+
291
+ ## Update vector's value
292
+ ### `replace(specifier, replacer)` => vector
293
+
294
+ - Accepts Scalar, Range of Integer, Vector, Array, Arrow::Array as a specifier
295
+ - Accepts Scalar, Vector, Array and Arrow::Array as a replacer.
296
+ - Boolean specifiers specify the position of replacer in true.
297
+ - Index specifiers specify the position of replacer in indices.
298
+ - replacer specifies the values to be replaced.
299
+ - The number of true in booleans must be equal to the length of replacer
300
+
301
+ ```ruby
302
+ vector = RedAmber::Vector.new([1, 2, 3])
303
+ booleans = [true, false, true]
304
+ replacer = [4, 5]
305
+ vector.replace(booleans, replacer)
306
+ # =>
307
+ #<RedAmber::Vector(:uint8, size=3):0x000000000001ee10>
308
+ [4, 2, 5]
309
+ ```
310
+
311
+ - Scalar value in replacer can be broadcasted.
312
+
313
+ ```ruby
314
+ replacer = 0
315
+ vector.replace(booleans, replacer)
316
+ # =>
317
+ #<RedAmber::Vector(:uint8, size=3):0x000000000001ee10>
318
+ [0, 2, 0]
319
+ ```
320
+
321
+ - Returned data type is automatically up-casted by replacer.
322
+
323
+ ```ruby
324
+ replacer = 1.0
325
+ vector.replace(booleans, replacer)
326
+ # =>
327
+ #<RedAmber::Vector(:double, size=3):0x0000000000025d78>
328
+ [1.0, 2.0, 1.0]
329
+ ```
330
+
331
+ - Position of nil in booleans is replaced with nil.
332
+
333
+ ```ruby
334
+ booleans = [true, false, nil]
335
+ replacer = -1
336
+ vec.replace(booleans, replacer)
337
+ =>
338
+ #<RedAmber::Vector(:int8, size=3):0x00000000000304d0>
339
+ [-1, 2, nil]
340
+ ```
341
+
342
+ - replacer can have nil in it.
343
+
344
+ ```ruby
345
+ booleans = [true, false, true]
346
+ replacer = [nil]
347
+ vec.replace(booleans, replacer)
348
+ =>
349
+ #<RedAmber::Vector(:int8, size=3):0x00000000000304d0>
350
+ [nil, 2, nil]
351
+ ```
352
+
353
+ - If no replacer specified, it is same as to specify nil.
354
+
355
+ ```ruby
356
+ booleans = [true, false, true]
357
+ vec.replace(booleans)
358
+ =>
359
+ #<RedAmber::Vector(:int8, size=3):0x00000000000304d0>
360
+ [nil, 2, nil]
361
+ ```
362
+
363
+ - An example to replace 'NA' to nil.
364
+
365
+ ```ruby
366
+ vector = RedAmber::Vector.new(['A', 'B', 'NA'])
367
+ vector.replace(vector == 'NA', nil)
368
+ # =>
369
+ #<RedAmber::Vector(:string, size=3):0x000000000000f8ac>
370
+ ["A", "B", nil]
371
+ ```
372
+
373
+ - Specifier in indices.
374
+
375
+ Specified indices are used 'as sorted'. Position in indices and replacer may not have correspondence.
376
+
377
+ ```ruby
378
+ vector = RedAmber::Vector.new([1, 2, 3])
379
+ indices = [2, 1]
380
+ replacer = [4, 5]
381
+ vector.replace(indices, replacer)
382
+ # =>
383
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f244>
384
+ [1, 4, 5] # not [1, 5, 4]
385
+ ```
386
+
387
+
388
+ ### `fill_nil_forward`, `fill_nil_backward` => vector
389
+
390
+ Propagate the last valid observation forward (or backward).
391
+ Or preserve nil if all previous values are nil or at the end.
392
+
393
+ ```ruby
394
+ integer = RedAmber::Vector.new([0, 1, nil, 3, nil])
395
+ integer.fill_nil_forward
396
+ # =>
397
+ #<RedAmber::Vector(:uint8, size=5):0x000000000000f960>
398
+ [0, 1, 1, 3, 3]
399
+
400
+ integer.fill_nil_backward
401
+ # =>
402
+ #<RedAmber::Vector(:uint8, size=5):0x000000000000f974>
403
+ [0, 1, 3, 3, nil]
404
+ ```
405
+
406
+ ### `boolean_vector.if_else(true_choice, false_choice)` => vector
407
+
408
+ Choose values based on self. Self must be a boolean Vector.
409
+
410
+ `true_choice`, `false_choice` must be of the same type scalar / array / Vector.
411
+ `nil` values in `cond` will be promoted to the output.
412
+
413
+ This example will normalize negative indices to positive ones.
414
+
415
+ ```ruby
416
+ indices = RedAmber::Vector.new([1, -1, 3, -4])
417
+ array_size = 10
418
+ normalized_indices = (indices < 0).if_else(indices + array_size, indices)
419
+
420
+ # =>
421
+ #<RedAmber::Vector(:int16, size=4):0x000000000000f85c>
422
+ [1, 9, 3, 6]
423
+ ```
424
+
425
+ ### `is_in(values)` => boolean vector
426
+
427
+ For each element in self, return true if it is found in given `values`, false otherwise.
428
+ By default, nulls are matched against the value set. (This will be changed in SetLookupOptions: not impremented.)
429
+
430
+ ```ruby
431
+ vector = RedAmber::Vector.new %W[A B C D]
432
+ values = ['A', 'C', 'X']
433
+ vector.is_in(values)
434
+
435
+ # =>
436
+ #<RedAmber::Vector(:boolean, size=4):0x000000000000f2a8>
437
+ [true, false, true, false]
438
+ ```
439
+
440
+ `values` are casted to the same Class of Vector.
441
+
442
+ ```ruby
443
+ vector = RedAmber::Vector.new([1, 2, 255])
444
+ vector.is_in(1, -1)
445
+
446
+ # =>
447
+ #<RedAmber::Vector(:boolean, size=3):0x000000000000f320>
448
+ [true, false, true]
449
+ ```
Binary file
Binary file
Binary file
Binary file
Binary file
Binary file
Binary file
Binary file
Binary file
data/doc/image/tdr.png ADDED
Binary file
Binary file
Binary file
data/doc/tdr.md ADDED
@@ -0,0 +1,56 @@
1
+ # TDR (Transposed DataFrame Representation)
2
+
3
+ ([Japanese version](tdr_ja.md) of this document is available)
4
+
5
+ TDR is a presentation style of 2D data. It shows columnar vector values in *row Vector* and observations in *column* just like a **transposed** table.
6
+
7
+ ![TDR Image](image/tdr.png)
8
+
9
+ Row-oriented data table (1) and columnar data table (2) have different data allocation in memory within a context of Arrow Columnar Format. But they have the same data placement (in rows and columns) in our brain.
10
+
11
+ TDR (3) is a logical concept of data placement to transpose rows and columns in a columnar table (2).
12
+
13
+ ![TDR and Table Image](image/tdr_and_table.png)
14
+
15
+ TDR is not an implementation in software but a logical image in our mind.
16
+
17
+ TDR is consistent with the 'transposed' tidy data concept. The only thing we should do is not to use the positional words 'row' and 'column'.
18
+
19
+ ![tidy data in TDR](image/tidy_data_in_TDR.png)
20
+
21
+ TDR is one of a simple way to create DataFrame object in many libraries. For example, we can initalize Arrow::Table in Red Arrow like the right below and get table as left.
22
+
23
+ ![Arrow Table New](image/arrow_table_new.png)
24
+
25
+ We are using TDR style code naturally. For other example:
26
+ - Ruby: Daru::DataFrame, Rover::DataFrame accept same arguments.
27
+ - Python: similar style in Pandas for pd.DataFrame(data_in_dict)
28
+ - R: similar style in tidyr for tibble(x = 1:3, y = c("A", "B", "C"))
29
+
30
+ There are other ways to initialize data frame, but they are not intuitive.
31
+
32
+ ## Table and TDR API
33
+
34
+ The API based on TDR is draft and RedAmber is a small experiment to test the TDR concept. The following is a comparison of Table and TDR (draft).
35
+
36
+ | |Basic Table|Transposed DataFrame|Comment for TDR|
37
+ |-----------|---------|------------|---|
38
+ |name in TDR|`Table`|`TDR`|**T**ransposed **D**ataFrame **R**epresentation|
39
+ |variable |located in a column|a key and a `Vector` in lateral|select by keys|
40
+ |observation|located in a row|sliced in vertical|select by indices|
41
+ |number of variables|n_columns etc. |`n_keys` |`n_cols` is available as an alias|
42
+ |number of observations|n_rows etc. |`size` |`n_rows` is available as an alias|
43
+ |shape |[n_rows, n_columns] |`shape`=`[size, n_keys]` |same order as Table|
44
+ |Select variables|select, filter, [ ], etc.|`pick` or `[keys]` |accepts arguments or a block|
45
+ |Reject variables|drop, etc.|`drop` |accepts arguments or a block|
46
+ |Select observations|slice, [ ], iloc, etc.|`slice` or `[indices]` |accepts arguments or a block|
47
+ |Reject observations|drop, etc.|`remove` |accepts arguments or a block|
48
+ |Add variables|mutate, assign, etc.|`assign` |accepts arguments or a block|
49
+ |update variables|transmute, [ ]=, etc.|`assign` |accepts arguments or a block|
50
+ |inner join| inner_join(a,b)<br>merge(a, b, how='inner')|`a.inner_join(b)` |with a option on:|
51
+ |left join| left_join(a,b)<br>merge(a, b, how='left')|`a.join(b)` |naturally join from bottom<br>with a option on:|
52
+ |right join| right_join(a,b))<br>merge(a, b, how='right')|`b.join(a)` |naturally join from bottom<br>with a option on:|
53
+
54
+ ## Q and A for TDR
55
+
56
+ (Not prepared yet)
data/doc/tdr_ja.md ADDED
@@ -0,0 +1,56 @@
1
+ # TDR (Transposed DataFrame Representation)
2
+
3
+ ([英語版](tdr.md) もあります)
4
+
5
+ TDR は、2次元のデータの表現方法につけた名前です。TDR では下の図のように同じ型のデータに key というラベルをつけて横に並べ、それらを縦に積み重ねてデータを表現します。
6
+
7
+ ![TDR Image](image/tdr.png)
8
+
9
+ Arrow Columnar Format では、csv のような従来の行指向データ(1)に対して、列方向に連続したデータ(2)を取り扱います。この行、列という言葉は私たちの脳内イメージを規定していて、データフレームの構造といえば(1)または(2)のような形を思い浮かべることでしょう。しかし、本質は連続したデータの配置にあるので、我々の頭の中では(3)のように行と列を入れ替えて考えてもいいはずです。
10
+
11
+ ![TDR and Table Image](image/tdr_and_table.png)
12
+
13
+ 大事なことは、TDR は頭の中の論理的なイメージであって、実装上のアーキテクチャではないということです。
14
+
15
+ TDR は、整然データ(tidy data)の考え方とも矛盾しません。TDR における整然データは行と列を入れ替えた形で全く同じデータを表しています。一つだけ気をつけることは、混乱を避けるため、位置や方向に関するワードである行(row)や列(column)を避けるべきであるということです。
16
+
17
+ ![tidy data in TDR](image/tidy_data_in_TDR.png)
18
+
19
+ TDR は、現時点でも2次元データを楽に初期化できる記法で、ごく自然に使われています。例えば、Red Arrow ではArrow::Table を初期化する際に下の図の右のように書けます。
20
+
21
+ ![Arrow Table New](image/arrow_table_new.png)
22
+
23
+ これはごく自然な書き方ですが、この形は TDR の形と一致しています。その他の例として:
24
+ - Ruby: Daru::DataFrame, Rover::DataFrame でも上と同じように書けます。
25
+ - Python: Pandas で pd.DataFrame(data_in_dict) のように dict を使う場合が同じです。
26
+ - R: tidyr で tibble(x = 1:3, y = c("A", "B", "C")) のように書けます。
27
+
28
+ それぞれのライブラリーで、データフレームを初期化するやり方はこれだけではありませんが、他の方法は少し回りくどいような印象があります。
29
+
30
+ TDR で考えた方がちょっぴりうまくいくというのは単なる仮説ですが、その理由は「この惑星では横書きでコードを書く」からではないかと私は考えています。
31
+
32
+ ## Table and TDR API
33
+
34
+ TDR に基づいた API はまだ暫定板の段階であり、RedAmber は TDR の実験の場であると考えています。下記の表に TDR と行x列形式の Table のAPIの比較を示します(暫定版)。
35
+
36
+ | |従来の Table|Transposed DataFrame|TDRに対するコメント|
37
+ |-----------|---------|------------|---|
38
+ |TDRでの呼称|`Table`|`TDR`|**T**ransposed **D**ataFrame **R**epresentationの略|
39
+ |変数 |列に配置|`variables`<br>key と `Vector` として横方向に配置|key で選択|
40
+ |観測 |行に配置|`observations`<br>縦方向に切った一つ一つはslice|index や `slice` メソッドで選択|
41
+ |変数(列)の数|ncol, n_columns など |`n_keys` |`n_cols` をエイリアスとして設定|
42
+ |観測(行)の数|nrow, n_rows など |`size` |`n_rows` をエイリアスとして設定|
43
+ |形状 |[nrow, ncol] |`shape`=`[size, n_keys]` |行, 列の順番は同じ|
44
+ |変数(列)の選択|select, filter, [ ], など|`pick` or `[keys]` |引数またはブロックで指定|
45
+ |変数(列)の削除|drop, など|`drop` |引数またはブロックで指定|
46
+ |観測(行)の選択|slice, [ ], iloc, など|`slice` or `[indices]` |引数またはブロックで指定|
47
+ |観測(行)の削除|drop, など|`remove` |引数またはブロックで指定|
48
+ |変数(列)の追加|mutate, assign, など|`assign` |引数またはブロックで指定|
49
+ |変数(列)の更新|transmute, [ ]=, など|`assign` |引数またはブロックで指定|
50
+ |内部結合| inner_join(a,b)<br>merge(a, b, how='inner')|`a.inner_join(b)` |オプション on:|
51
+ |左結合| left_join(a,b)<br>merge(a, b, how='left')|`a.join(b)` |自然に下にくっつける<br>オプション on:|
52
+ |右結合| right_join(a,b))<br>merge(a, b, how='right')|`b.join(a)` |自然に下にくっつける<br>オプション on:|
53
+
54
+ ## Q and A for TDR
55
+
56
+ (作成中)
data/lib/red-amber.rb ADDED
@@ -0,0 +1,27 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'arrow'
4
+ require 'rover-df'
5
+
6
+ require_relative 'red_amber/helper'
7
+ require_relative 'red_amber/data_frame_displayable'
8
+ require_relative 'red_amber/data_frame_indexable'
9
+ require_relative 'red_amber/data_frame_selectable'
10
+ require_relative 'red_amber/data_frame_observation_operation'
11
+ require_relative 'red_amber/data_frame_variable_operation'
12
+ require_relative 'red_amber/data_frame'
13
+ require_relative 'red_amber/vector_functions'
14
+ require_relative 'red_amber/vector_updatable'
15
+ require_relative 'red_amber/vector_selectable'
16
+ require_relative 'red_amber/vector'
17
+ require_relative 'red_amber/version'
18
+
19
+ module RedAmber
20
+ class Error < StandardError; end
21
+
22
+ class DataFrameArgumentError < ArgumentError; end
23
+ class DataFrameTypeError < TypeError; end
24
+
25
+ class VectorArgumentError < ArgumentError; end
26
+ class VectorTypeError < TypeError; end
27
+ end