red_amber 0.1.2 → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +21 -10
  3. data/CHANGELOG.md +162 -6
  4. data/Gemfile +3 -0
  5. data/README.md +89 -303
  6. data/benchmark/csv_load_penguins.yml +15 -0
  7. data/benchmark/drop_nil.yml +11 -0
  8. data/doc/DataFrame.md +840 -0
  9. data/doc/Vector.md +317 -0
  10. data/doc/image/arrow_table_new.png +0 -0
  11. data/doc/image/dataframe/assign.png +0 -0
  12. data/doc/image/dataframe/drop.png +0 -0
  13. data/doc/image/dataframe/pick.png +0 -0
  14. data/doc/image/dataframe/remove.png +0 -0
  15. data/doc/image/dataframe/rename.png +0 -0
  16. data/doc/image/dataframe/slice.png +0 -0
  17. data/doc/image/dataframe_model.png +0 -0
  18. data/doc/image/example_in_red_arrow.png +0 -0
  19. data/doc/image/tdr.png +0 -0
  20. data/doc/image/tdr_and_table.png +0 -0
  21. data/doc/image/tidy_data_in_TDR.png +0 -0
  22. data/doc/image/vector/binary_element_wise.png +0 -0
  23. data/doc/image/vector/unary_aggregation.png +0 -0
  24. data/doc/image/vector/unary_aggregation_w_option.png +0 -0
  25. data/doc/image/vector/unary_element_wise.png +0 -0
  26. data/doc/tdr.md +56 -0
  27. data/doc/tdr_ja.md +56 -0
  28. data/lib/red_amber/data_frame.rb +68 -35
  29. data/lib/red_amber/data_frame_displayable.rb +132 -0
  30. data/lib/red_amber/data_frame_helper.rb +64 -0
  31. data/lib/red_amber/data_frame_indexable.rb +38 -0
  32. data/lib/red_amber/data_frame_observation_operation.rb +83 -0
  33. data/lib/red_amber/data_frame_selectable.rb +34 -43
  34. data/lib/red_amber/data_frame_variable_operation.rb +133 -0
  35. data/lib/red_amber/vector.rb +58 -6
  36. data/lib/red_amber/vector_compensable.rb +68 -0
  37. data/lib/red_amber/vector_functions.rb +147 -68
  38. data/lib/red_amber/version.rb +1 -1
  39. data/lib/red_amber.rb +9 -1
  40. data/red_amber.gemspec +3 -6
  41. metadata +36 -9
  42. data/lib/red_amber/data_frame_output.rb +0 -116
data/README.md CHANGED
@@ -3,18 +3,27 @@
3
3
  A simple dataframe library for Ruby (experimental)
4
4
 
5
5
  - Powered by [Red Arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow)
6
- - Simple API similar to [Rover-df](https://github.com/ankane/rover)
6
+ - Inspired by the dataframe library [Rover-df](https://github.com/ankane/rover)
7
7
 
8
8
  ## Requirements
9
9
 
10
10
  ```ruby
11
- gem 'red-arrow', '>= 7.0.0'
12
- gem 'red-parquet', '>= 7.0.0' # if you use IO from/to parquet
11
+ gem 'red-arrow', '>= 8.0.0'
12
+ gem 'red-parquet', '>= 8.0.0' # if you use IO from/to parquet
13
13
  gem 'rover-df', '~> 0.3.0' # if you use IO from/to Rover::DataFrame
14
14
  ```
15
15
 
16
16
  ## Installation
17
17
 
18
+ Install requirements before you install Red Amber.
19
+
20
+ - Apache Arrow GLib (>= 8.0.0)
21
+ - Apache Parquet GLib (>= 8.0.0)
22
+
23
+ See [Apache Arrow install document](https://arrow.apache.org/install/).
24
+
25
+ Minimum installation example for the latest Ubuntu is in the ['Prepare the Apache Arrow' section in ci test](https://github.com/heronshoes/red_amber/blob/master/.github/workflows/test.yml) of Red Amber.
26
+
18
27
  Add this line to your Gemfile:
19
28
 
20
29
  ```ruby
@@ -23,339 +32,116 @@ gem 'red_amber'
23
32
 
24
33
  And then execute:
25
34
 
26
- $ bundle install
35
+ ```shell
36
+ bundle install
37
+ ```
27
38
 
28
39
  Or install it yourself as:
29
40
 
30
- $ gem install red_amber
41
+ ```shell
42
+ gem install red_amber
43
+ ```
31
44
 
32
45
  ## `RedAmber::DataFrame`
33
46
 
34
- ### Constructors and saving
35
-
36
- - [x] `new` from a columnar Hash
37
- - `RedAmber::DataFrame.new(x: [1, 2, 3])`
38
-
39
- - [x] `new` from a schema (by Hash) and rows (by Array)
40
- - `RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]])`
41
-
42
- - [x] `new` from an Arrow::Table
43
- - `RedAmber::DataFrame.new(Arrow::Table.new(x: [1, 2, 3]))`
44
-
45
- - [x] `new` from a Rover::DataFrame
46
- - `RedAmber::DataFrame.new(Rover::DataFrame.new(x: [1, 2, 3]))`
47
-
48
- - [ ] `load` (class method)
49
-
50
- - [x] from a [`.arrow`, `.arrows`, `.csv`, `.csv.gz`, `.tsv`] file
51
- - `RedAmber::DataFrame.load("test/entity/with_header.csv")`
52
-
53
- - [x] from a string buffer
54
-
55
- - [x] from a URI
56
- - `RedAmber::DataFrame.load(URI("https://github.com/heronshoes/red_amber/blob/master/test/entity/with_header.csv"))`
57
-
58
- - [ ] from a parquet file
59
-
60
- - [ ] `save` (instance method)
61
-
62
- - [x] to a [`.arrow`, `.arrows`, `.csv`, `.csv.gz`, `.tsv`] file
63
-
64
- - [x] to a string buffer
65
-
66
- - [x] to a URI
67
-
68
- - [ ] to a parquet file
69
-
70
- ### Properties
71
-
72
- - [x] `table`
73
-
74
- Reader of Arrow::Table object inside.
75
-
76
- - [x] `n_rows`, `nrow`, `size`, `length`
77
-
78
- Returns num of rows (data size).
79
-
80
- - [x] `n_columns`, `ncol`, `width`
81
-
82
- Returns num of columns (num of vectors).
83
-
84
- - [x] `shape`
85
-
86
- Returns shape in an Array[n_rows, n_cols].
87
-
88
- - [x] `column_names`, `keys`
89
-
90
- Returns num of column names by an Array.
91
-
92
- - [x] `types`
93
-
94
- Returns types of columns by an Array of Symbols.
95
-
96
- - [x] `data_types`
97
-
98
- Returns types of columns by an Array of `Arrow::DataType`.
99
-
100
- - [x] `vectors`
101
-
102
- Returns an Array of Vectors.
103
-
104
- - [x] `to_h`
105
-
106
- Returns column-oriented data in a Hash.
107
-
108
- - [x] `to_a`, `raw_records`
47
+ Represents a set of data in 2D-shape.
109
48
 
110
- Returns an array of row-oriented data without header. If you need a column-oriented full array, use `.to_h.to_a`
111
-
112
- - [x] `schema`
113
-
114
- Returns column name and data type in a Hash.
115
-
116
- - [x] `==`
117
-
118
- - [x] `empty?`
119
-
120
- ### Output
121
-
122
- - [x] `to_s`
123
-
124
- - [ ] summary, describe
125
-
126
- - [x] `to_rover`
49
+ ```ruby
50
+ require 'red_amber'
51
+ require 'datasets-arrow'
127
52
 
128
- Returns a `Rover::DataFrame`.
53
+ arrow = Datasets::Penguins.new.to_arrow
54
+ penguins = RedAmber::DataFrame.new(arrow)
55
+ penguins.tdr
56
+ # =>
57
+ RedAmber::DataFrame : 344 x 8 Vectors
58
+ Vectors : 5 numeric, 3 strings
59
+ # key type level data_preview
60
+ 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
61
+ 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
62
+ 3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
63
+ 4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
64
+ 5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils
65
+ 6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
66
+ 7 :sex string 3 {"male"=>168, "female"=>165, nil=>11}
67
+ 8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}
68
+ ```
129
69
 
130
- - [x] `inspect(tally_level: 5, max_element: 5)`
70
+ ### DataFrame model
71
+ ![dataframe model of RedAmber](doc/image/dataframe_model.png)
131
72
 
132
- Shows some information about self.
73
+ For example, `DataFrame#pick` accepts keys as an argument and returns a sub DataFrame.
133
74
 
134
75
  ```ruby
135
- hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
136
- RedAmber::DataFrame.new(hash)
76
+ df = penguins.pick(:body_mass_g)
137
77
  # =>
138
- RedAmber::DataFrame : 3 observations(rows) of 3 variables(columns)
139
- Variables : 2 numeric, 1 string
140
- # key type level data_preview
141
- 1 :a uint8 3 [1, 2, 3]
142
- 2 :b string 3 [A, B, C]
143
- 3 :c double 3 [1.0, 2.0, 3.0]
78
+ #<RedAmber::DataFrame : 344 x 1 Vector, 0x000000000000fa14>
79
+ Vector : 1 numeric
80
+ # key type level data_preview
81
+ 1 :body_mass_g int64 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
144
82
  ```
145
83
 
146
- - tally_level: max level to use tally mode
147
- - max_element: max num of element to show values in each row
84
+ `DataFrame#assign` creates new variables (column in the table).
148
85
 
149
- ### Selecting
150
-
151
- - [x] Select columns by `[]` as `[key]`, `[keys]`, `[keys[index]]`
152
- - Key in a Symbol: `df[:symbol]`
153
- - Key in a String: `df["string"]`
154
- - Keys in an Array: `df[:symbol1`, `"string"`, `:symbol2`
155
- - Keys in indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`
156
- - Keys in a Range:
157
- A end-less Range can be used to represent keys.
158
86
  ```ruby
159
- hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
160
- df = RedAmber::DataFrame.new(hash)
161
- df[:b..:c, "a"]
87
+ df.assign(:body_mass_kg => df[:body_mass_g] / 1000.0)
162
88
  # =>
163
- RedAmber::DataFrame : 3 observations(rows) of 3 variables(columns)
164
- Variables : 2 numeric, 1 string
165
- # key type level data_preview
166
- 1 :b string 3 [A, B, C]
167
- 2 :c double 3 [1.0, 2.0, 3.0]
168
- 3 :a uint8 3 [1, 2, 3]
89
+ #<RedAmber::DataFrame : 344 x 2 Vectors, 0x000000000000fa28>
90
+ Vectors : 2 numeric
91
+ # key type level data_preview
92
+ 1 :body_mass_g int64 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
93
+ 2 :body_mass_kg double 95 [3.75, 3.8, 3.25, nil, 3.45, ... ], 2 nils
169
94
  ```
170
95
 
171
- - [x] Select rows by `[]` as `[index]`, `[range]`, `[array]`
172
- - Select a row by index: `df[0]`
173
- - Select rows by indeces in a Range: `df[1..2]`
174
- - Select rows by indeces in an Array: `df[1, 2]`
175
- - Mixed case: `df[2, 0..]`
176
-
177
- - [x] Select rows from top or bottom
178
-
179
- `head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
180
-
181
- - [ ] slice
182
-
183
- ### Updating
184
-
185
- - [ ] Add a new column
186
-
187
- - [ ] Update a single element
188
-
189
- - [ ] Update multiple elements
190
-
191
- - [ ] Update all elements
96
+ DataFrame manipulating methods like `pick`, `drop`, `slice`, `remove`, `rename` and `assign` accept a block.
192
97
 
193
- - [ ] Update elements matching a condition
98
+ This is an exaple to eliminate observations (row in the table) containing nil.
194
99
 
195
- - [ ] Clamp
196
-
197
- - [ ] Delete columns
198
-
199
- - [ ] Rename a column
200
-
201
- - [ ] Sort rows
202
-
203
- - [ ] Clear data
204
-
205
- ### Treat na data
206
-
207
- - [ ] Drop na (NaN, nil)
208
-
209
- - [ ] Replace na with value
210
-
211
- - [ ] Interpolate na with convolution array
212
-
213
- ### Combining DataFrames
100
+ ```ruby
101
+ # remove all observation contains nil
102
+ nil_removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
103
+ nil_removed.tdr
104
+ # =>
105
+ RedAmber::DataFrame : 342 x 8 Vectors
106
+ Vectors : 5 numeric, 3 strings
107
+ # key type level data_preview
108
+ 1 :species string 3 {"Adelie"=>151, "Chinstrap"=>68, "Gentoo"=>123}
109
+ 2 :island string 3 {"Torgersen"=>51, "Biscoe"=>167, "Dream"=>124}
110
+ 3 :bill_length_mm double 164 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
111
+ 4 :bill_depth_mm double 80 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
112
+ 5 :flipper_length_mm int64 55 [181, 186, 195, 193, 190, ... ]
113
+ 6 :body_mass_g int64 94 [3750, 3800, 3250, 3450, 3650, ... ]
114
+ 7 :sex string 3 {"male"=>168, "female"=>165, ""=>9}
115
+ 8 :year int64 3 {2007=>109, 2008=>114, 2009=>119}
116
+ ```
214
117
 
215
- - [ ] Add rows
118
+ For this frequently needed task, we can do it much simpler.
216
119
 
217
- - [ ] Add columns
120
+ ```ruby
121
+ penguins.remove_nil # => same result as above
122
+ ```
218
123
 
219
- - [ ] Inner join
124
+ See [DataFrame.md](doc/DataFrame.md) for details.
220
125
 
221
- - [ ] Left join
222
126
 
223
- ### Encoding
127
+ ## `RedAmber::Vector`
224
128
 
225
- - [ ] One-hot encoding
129
+ Class `RedAmber::Vector` represents a series of data in the DataFrame.
226
130
 
227
- ### Iteration (not impremented)
131
+ ```ruby
132
+ penguins[:bill_length_mm]
133
+ # =>
134
+ #<RedAmber::Vector(:double, size=344):0x000000000000f8fc>
135
+ [39.1, 39.5, 40.3, nil, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, ... ]
136
+ ```
228
137
 
229
- ### Filtering (not impremented)
138
+ Vectors accepts some [functional methods from Arrow](https://arrow.apache.org/docs/cpp/compute.html).
230
139
 
140
+ See [Vector.md](doc/Vector.md) for details.
231
141
 
232
- ## `RedAmber::Vector`
233
- ### Constructor
234
-
235
- - [x] Create from a column in a DataFrame
236
-
237
- - [x] New from an Array
238
-
239
- ### Properties
240
-
241
- - [x] `to_s`
242
-
243
- - [x] `values`, `to_a`, `entries`
244
-
245
- - [x] `size`, `length`, `n_rows`, `nrow`
246
-
247
- - [x] `type`
248
-
249
- - [x] `data_type`
250
-
251
- - [ ] `each`
252
-
253
- - [ ] `chunked?`
254
-
255
- - [ ] `n_chunks`
256
-
257
- - [ ] `each_chunk`
258
-
259
- - [x] `tally`
260
-
261
- - [ ] `n_nulls`
262
-
263
- ### Functions
264
- #### Unary aggregations: vector.func => Scalar
265
-
266
- | Method |Boolean|Numeric|String|Remarks|
267
- | ------------ | --- | --- | --- | ----- |
268
- |[x] `all` | [x] | | | |
269
- |[x] `any` | [x] | | | |
270
- |[x] `approximate_median`| | [x] | | |
271
- |[x] `count` | [x] | [x] | [x] | |
272
- |[x] `count_distinct`| [x] | [x] | [x] | |
273
- |[x] `count_uniq` | [x] | [x] | [x] |an alias of `count_distinct`|
274
- |[ ] `index` | | | | |
275
- |[x] `max` | [x] | [x] | [x] | |
276
- |[x] `mean` | [x] | [x] | | |
277
- |[x] `min` | [x] | [x] | [x] | |
278
- |[ ] `min_max` | | | | |
279
- |[ ] `mode` | | | | |
280
- |[x] `product` | [x] | [x] | | |
281
- |[ ] `quantile`| | | | |
282
- |[x] `stddev` | | [x] | | |
283
- |[x] `sum` | [x] | [x] | | |
284
- |[ ] `tdigest` | | | | |
285
- |[x] `variance`| | [x] | | |
286
-
287
- #### Unary element-wise: vector.func => Vector
288
-
289
- | Method |Boolean|Numeric|String|Remarks|
290
- | ------------ | --- | --- | --- | ----- |
291
- |[x] `-@` | | [x] | |as `-vector`|
292
- |[x] `negate` | | [x] | |`-@` |
293
- |[x] `abs` | | [x] | | |
294
- |[ ] `acos` | | [ ] | | |
295
- |[ ] `asin` | | [ ] | | |
296
- |[x] `atan` | | [x] | | |
297
- |[ ] `ceil` | | [x] | | |
298
- |[x] `cos` | | [x] | | |
299
- |[ ] `floor` | | [x] | | |
300
- |[ ] `ln` | | [ ] | | |
301
- |[ ] `log10` | | [ ] | | |
302
- |[ ] `log1p` | | [ ] | | |
303
- |[ ] `log2` | | [ ] | | |
304
- |[x] `sign` | | [x] | | |
305
- |[x] `sin` | | [x] | | |
306
- |[x] `tan` | | [x] | | |
307
- |[ ] `trunc` | | [x] | | |
308
-
309
- #### Binary element-wise: vector.func(vector) => Vector
310
-
311
- | Method |Boolean|Numeric|String|Remarks|
312
- | ------------------ | --- | --- | --- | ----- |
313
- |[x] `add` | | [x] | | `+` |
314
- |[x] `atan2` | | [x] | | |
315
- |[x] `and` | [x] | | | |
316
- |[x] `and_kleene` | [x] | | | |
317
- |[x] `and_not` | [x] | | | |
318
- |[x] `and_not_kleene`| [x] | | | |
319
- |[x] `bit_wise_and` | |([x])| |`&`, integer only|
320
- |[ ] `bit_wise_not` | |([x])| |`!`, integer only|
321
- |[x] `bit_wise_or` | |([x])| |`|`, integer only|
322
- |[x] `bit_wise_xor` | |([x])| |`^`, integer only|
323
- |[x] `divide` | | [x] | | `/` |
324
- |[x] `equal` | [x] | [x] | [x] |`==`, alias `eq`|
325
- |[x] `greater` | [x] | [x] | [x] |`>`, alias `gt`|
326
- |[x] `greater_equal` | [x] | [x] | [x] |`>=`, alias `ge`|
327
- |[x] `less` | [x] | [x] | [x] |`<`, alias `lt`|
328
- |[x] `less_equal` | [x] | [x] | [x] |`<=`, alias `le`|
329
- |[ ] `logb` | | [ ] | | |
330
- |[ ] `mod` | | [ ] | | |
331
- |[x] `multiply` | | [x] | | `*` |
332
- |[x] `not_equal` | [x] | [x] | [x] |`!=`, alias `ne`|
333
- |[x] `or` | [x] | | | |
334
- |[x] `or_kleene` | [x] | | | |
335
- |[x] `power` | | [x] | | `**` |
336
- |[x] `subtract` | | [x] | | `-` |
337
- |[x] `shift_left` | |([x])| |`<<`, integer only|
338
- |[x] `shift_right` | |([x])| |`>>`, integer only|
339
- |[x] `xor` | [x] | | | |
340
-
341
- ##### (Not impremented)
342
- - [ ] invert, round, round_to_multiple
343
- - [ ] sort, sort_index
344
- - [ ] minmax, var, median, quantile
345
- - [ ] argmin, argmax
346
- - [ ] (array functions)
347
- - [ ] (strings functions)
348
- - [ ] (temporal functions)
349
- - [ ] (conditional functions)
350
- - [ ] (index functions)
351
- - [ ] (other functions)
352
-
353
- ### Coerce (not impremented)
354
-
355
- ### Updating (not impremented)
356
-
357
- ### DSL in a block for faster calculation ?
142
+ ## TDR concept
358
143
 
144
+ I named the data frame representation style in the model above as TDR (Transposed DataFrame Representation). See [TDR.md](doc/tdr.md) for details.
359
145
 
360
146
  ## Development
361
147
 
@@ -0,0 +1,15 @@
1
+ prelude: |
2
+ require 'datasets-arrow'
3
+ require 'rover'
4
+ require 'red_amber'
5
+
6
+ penguins_csv = 'benchmark/cache/penguins.csv'
7
+
8
+ unless File.exist?(penguins_csv)
9
+ arrow = Datasets::Penguins.new.to_arrow
10
+ RedAmber::DataFrame.new(arrow).save(penguins_csv)
11
+ end
12
+
13
+ benchmark:
14
+ 'penguins by Rover': Rover.read_csv(penguins_csv)
15
+ 'penguins by RedAmber': RedAmber::DataFrame.load(penguins_csv)
@@ -0,0 +1,11 @@
1
+ prelude: |
2
+ require 'datasets-arrow'
3
+ require 'red_amber'
4
+
5
+ penguins = RedAmber::DataFrame.new(Datasets::Penguins.new.to_arrow)
6
+
7
+ def drop_nil(penguins)
8
+ penguins.remove { vectors.map { |v| v.is_nil} }
9
+ end
10
+
11
+ benchmark: drop_nil(penguins)