red_amber 0.1.5 → 0.1.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +33 -5
- data/.rubocop_todo.yml +2 -15
- data/.yardopts +1 -0
- data/CHANGELOG.md +164 -18
- data/Gemfile +6 -1
- data/README.md +247 -33
- data/Rakefile +1 -0
- data/benchmark/csv_load_penguins.yml +1 -1
- data/doc/DataFrame.md +383 -219
- data/doc/Vector.md +247 -37
- data/doc/examples_of_red_amber.ipynb +5454 -0
- data/doc/image/dataframe/assign.png +0 -0
- data/doc/image/dataframe/drop.png +0 -0
- data/doc/image/dataframe/pick.png +0 -0
- data/doc/image/dataframe/remove.png +0 -0
- data/doc/image/dataframe/rename.png +0 -0
- data/doc/image/dataframe/slice.png +0 -0
- data/doc/image/dataframe_model.png +0 -0
- data/doc/image/vector/binary_element_wise.png +0 -0
- data/doc/image/vector/unary_aggregation.png +0 -0
- data/doc/image/vector/unary_aggregation_w_option.png +0 -0
- data/doc/image/vector/unary_element_wise.png +0 -0
- data/lib/red-amber.rb +3 -0
- data/lib/red_amber/data_frame.rb +62 -10
- data/lib/red_amber/data_frame_displayable.rb +86 -9
- data/lib/red_amber/data_frame_selectable.rb +151 -32
- data/lib/red_amber/data_frame_variable_operation.rb +4 -0
- data/lib/red_amber/group.rb +59 -0
- data/lib/red_amber/helper.rb +61 -0
- data/lib/red_amber/vector.rb +59 -15
- data/lib/red_amber/vector_functions.rb +47 -38
- data/lib/red_amber/vector_selectable.rb +126 -0
- data/lib/red_amber/vector_updatable.rb +125 -0
- data/lib/red_amber/version.rb +1 -1
- data/lib/red_amber.rb +6 -3
- data/red_amber.gemspec +0 -2
- metadata +9 -33
- data/lib/red_amber/data_frame_helper.rb +0 -64
- data/lib/red_amber/data_frame_observation_operation.rb +0 -83
- data/lib/red_amber/vector_compensable.rb +0 -68
data/README.md
CHANGED
@@ -1,16 +1,20 @@
|
|
1
1
|
# RedAmber
|
2
2
|
|
3
|
-
|
3
|
+
[](https://badge.fury.io/rb/red_amber)
|
4
|
+
[](https://github.com/heronshoes/red_amber/actions/workflows/test.yml)
|
4
5
|
|
5
|
-
|
6
|
+
A simple dataframe library for Ruby (experimental).
|
7
|
+
|
8
|
+
- Powered by [Red Arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow) [](https://gitter.im/red-data-tools/en)
|
6
9
|
- Inspired by the dataframe library [Rover-df](https://github.com/ankane/rover)
|
7
10
|
|
8
11
|
## Requirements
|
9
12
|
|
10
13
|
```ruby
|
11
14
|
gem 'red-arrow', '>= 8.0.0'
|
12
|
-
|
13
|
-
gem '
|
15
|
+
|
16
|
+
gem 'red-parquet', '>= 8.0.0' # Optional, if you use IO from/to parquet
|
17
|
+
gem 'rover-df', '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame
|
14
18
|
```
|
15
19
|
|
16
20
|
## Installation
|
@@ -18,7 +22,8 @@ gem 'rover-df', '~> 0.3.0' # if you use IO from/to Rover::DataFrame
|
|
18
22
|
Install requirements before you install Red Amber.
|
19
23
|
|
20
24
|
- Apache Arrow GLib (>= 8.0.0)
|
21
|
-
|
25
|
+
|
26
|
+
- Apache Parquet GLib (>= 8.0.0) # If you use IO from/to parquet
|
22
27
|
|
23
28
|
See [Apache Arrow install document](https://arrow.apache.org/install/).
|
24
29
|
|
@@ -44,27 +49,28 @@ gem install red_amber
|
|
44
49
|
|
45
50
|
## `RedAmber::DataFrame`
|
46
51
|
|
47
|
-
Represents a set of data in 2D-shape.
|
52
|
+
Represents a set of data in 2D-shape. The entity is a Red Arrow's Table object.
|
48
53
|
|
49
54
|
```ruby
|
50
|
-
require 'red_amber'
|
55
|
+
require 'red_amber' # require 'red-amber' is also OK.
|
51
56
|
require 'datasets-arrow'
|
52
57
|
|
53
58
|
arrow = Datasets::Penguins.new.to_arrow
|
54
59
|
penguins = RedAmber::DataFrame.new(arrow)
|
55
|
-
|
60
|
+
|
56
61
|
# =>
|
57
|
-
RedAmber::DataFrame : 344 x 8 Vectors
|
58
|
-
|
59
|
-
|
60
|
-
1
|
61
|
-
2
|
62
|
-
3
|
63
|
-
4
|
64
|
-
5
|
65
|
-
|
66
|
-
7
|
67
|
-
8
|
62
|
+
#<RedAmber::DataFrame : 344 x 8 Vectors, 0x0000000000013790>
|
63
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
64
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
65
|
+
1 Adelie Torgersen 39.1 18.7 181 ... 2007
|
66
|
+
2 Adelie Torgersen 39.5 17.4 186 ... 2007
|
67
|
+
3 Adelie Torgersen 40.3 18.0 195 ... 2007
|
68
|
+
4 Adelie Torgersen (nil) (nil) (nil) ... 2007
|
69
|
+
5 Adelie Torgersen 36.7 19.3 193 ... 2007
|
70
|
+
: : : : : : ... :
|
71
|
+
342 Gentoo Biscoe 50.4 15.7 222 ... 2009
|
72
|
+
343 Gentoo Biscoe 45.2 14.8 212 ... 2009
|
73
|
+
344 Gentoo Biscoe 49.9 16.1 213 ... 2009
|
68
74
|
```
|
69
75
|
|
70
76
|
### DataFrame model
|
@@ -72,33 +78,179 @@ Vectors : 5 numeric, 3 strings
|
|
72
78
|
|
73
79
|
For example, `DataFrame#pick` accepts keys as an argument and returns a sub DataFrame.
|
74
80
|
|
81
|
+

|
82
|
+
|
75
83
|
```ruby
|
76
|
-
|
84
|
+
penguins.keys
|
85
|
+
# =>
|
86
|
+
[:species,
|
87
|
+
:island,
|
88
|
+
:bill_length_mm,
|
89
|
+
:bill_depth_mm,
|
90
|
+
:flipper_length_mm,
|
91
|
+
:body_mass_g,
|
92
|
+
:sex,
|
93
|
+
:year]
|
94
|
+
|
95
|
+
df = penguins.pick(:species, :island, :body_mass_g)
|
96
|
+
df
|
97
|
+
|
98
|
+
# =>
|
99
|
+
#<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000003cc1c>
|
100
|
+
species island body_mass_g
|
101
|
+
<string> <string> <uint16>
|
102
|
+
1 Adelie Torgersen 3750
|
103
|
+
2 Adelie Torgersen 3800
|
104
|
+
3 Adelie Torgersen 3250
|
105
|
+
4 Adelie Torgersen (nil)
|
106
|
+
5 Adelie Torgersen 3450
|
107
|
+
: : : :
|
108
|
+
342 Gentoo Biscoe 5750
|
109
|
+
343 Gentoo Biscoe 5200
|
110
|
+
344 Gentoo Biscoe 5400
|
111
|
+
```
|
112
|
+
|
113
|
+
`DataFrame#drop` drops some columns to create a remainer DataFrame.
|
114
|
+
|
115
|
+

|
116
|
+
|
117
|
+
You can specify by keys or a boolean array (same size as n_keys).
|
118
|
+
|
119
|
+
```ruby
|
120
|
+
# Same as df.drop(:species, :island)
|
121
|
+
df = df.drop(true, true, false)
|
122
|
+
|
77
123
|
# =>
|
78
|
-
#<RedAmber::DataFrame : 344 x 1 Vector,
|
79
|
-
|
80
|
-
|
81
|
-
1
|
124
|
+
#<RedAmber::DataFrame : 344 x 1 Vector, 0x0000000000048760>
|
125
|
+
body_mass_g
|
126
|
+
<uint16>
|
127
|
+
1 3750
|
128
|
+
2 3800
|
129
|
+
3 3250
|
130
|
+
4 (nil)
|
131
|
+
5 3450
|
132
|
+
: :
|
133
|
+
342 5750
|
134
|
+
343 5200
|
135
|
+
344 5400
|
82
136
|
```
|
83
137
|
|
138
|
+
Arrow data is immutable, so these methods always return an new object.
|
139
|
+
|
84
140
|
`DataFrame#assign` creates new variables (column in the table).
|
85
141
|
|
142
|
+

|
143
|
+
|
86
144
|
```ruby
|
145
|
+
# New column is created because ':body_mass_kg' is a new key.
|
87
146
|
df.assign(:body_mass_kg => df[:body_mass_g] / 1000.0)
|
147
|
+
|
148
|
+
# =>
|
149
|
+
#<RedAmber::DataFrame : 344 x 2 Vectors, 0x00000000000212f0>
|
150
|
+
body_mass_g body_mass_kg
|
151
|
+
<uint16> <double>
|
152
|
+
1 3750 3.8
|
153
|
+
2 3800 3.8
|
154
|
+
3 3250 3.3
|
155
|
+
4 (nil) (nil)
|
156
|
+
5 3450 3.5
|
157
|
+
: : :
|
158
|
+
342 5750 5.8
|
159
|
+
343 5200 5.2
|
160
|
+
344 5400 5.4
|
161
|
+
```
|
162
|
+
|
163
|
+
`DataFrame#slice` selects rows (observations) to create a sub DataFrame.
|
164
|
+
|
165
|
+

|
166
|
+
|
167
|
+
```ruby
|
168
|
+
# returns 5 rows at the start and 5 rows from the end
|
169
|
+
penguins.slice(0...5, -5..-1)
|
170
|
+
|
88
171
|
# =>
|
89
|
-
#<RedAmber::DataFrame :
|
90
|
-
|
91
|
-
|
92
|
-
1
|
93
|
-
2
|
172
|
+
#<RedAmber::DataFrame : 10 x 8 Vectors, 0x0000000000042be4>
|
173
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
174
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
175
|
+
1 Adelie Torgersen 39.1 18.7 181 ... 2007
|
176
|
+
2 Adelie Torgersen 39.5 17.4 186 ... 2007
|
177
|
+
3 Adelie Torgersen 40.3 18.0 195 ... 2007
|
178
|
+
4 Adelie Torgersen (nil) (nil) (nil) ... 2007
|
179
|
+
5 Adelie Torgersen 36.7 19.3 193 ... 2007
|
180
|
+
: : : : : : ... :
|
181
|
+
8 Gentoo Biscoe 50.4 15.7 222 ... 2009
|
182
|
+
9 Gentoo Biscoe 45.2 14.8 212 ... 2009
|
183
|
+
10 Gentoo Biscoe 49.9 16.1 213 ... 2009
|
184
|
+
```
|
185
|
+
|
186
|
+
`DataFrame#remove` rejects rows (observations) to create a remainer DataFrame.
|
187
|
+
|
188
|
+

|
189
|
+
|
190
|
+
```ruby
|
191
|
+
# penguins[:bill_length_mm] < 40 returns a boolean Vector
|
192
|
+
penguins.remove(penguins[:bill_length_mm] < 40)
|
193
|
+
|
194
|
+
# =>
|
195
|
+
#<RedAmber::DataFrame : 244 x 8 Vectors, 0x000000000007d6f4>
|
196
|
+
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
|
197
|
+
<string> <string> <double> <double> <uint8> ... <uint16>
|
198
|
+
1 Adelie Torgersen 40.3 18.0 195 ... 2007
|
199
|
+
2 Adelie Torgersen (nil) (nil) (nil) ... 2007
|
200
|
+
3 Adelie Torgersen 42.0 20.2 190 ... 2007
|
201
|
+
4 Adelie Torgersen 41.1 17.6 182 ... 2007
|
202
|
+
5 Adelie Torgersen 42.5 20.7 197 ... 2007
|
203
|
+
: : : : : : ... :
|
204
|
+
242 Gentoo Biscoe 50.4 15.7 222 ... 2009
|
205
|
+
243 Gentoo Biscoe 45.2 14.8 212 ... 2009
|
206
|
+
244 Gentoo Biscoe 49.9 16.1 213 ... 2009
|
94
207
|
```
|
95
208
|
|
96
209
|
DataFrame manipulating methods like `pick`, `drop`, `slice`, `remove`, `rename` and `assign` accept a block.
|
97
210
|
|
98
|
-
This is
|
211
|
+
This example is usage of block to update numeric columns.
|
212
|
+
|
213
|
+
```ruby
|
214
|
+
df = RedAmber::DataFrame.new(
|
215
|
+
integer: [0, 1, 2, 3, nil],
|
216
|
+
float: [0.0, 1.1, 2.2, Float::NAN, nil],
|
217
|
+
string: ['A', 'B', 'C', 'D', nil],
|
218
|
+
boolean: [true, false, true, false, nil])
|
219
|
+
df
|
220
|
+
|
221
|
+
# =>
|
222
|
+
#<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000003131c>
|
223
|
+
integer float string boolean
|
224
|
+
<uint8> <double> <string> <boolean>
|
225
|
+
1 0 0.0 A true
|
226
|
+
2 1 1.1 B false
|
227
|
+
3 2 2.2 C true
|
228
|
+
4 3 NaN D false
|
229
|
+
5 (nil) (nil) (nil) (nil)
|
230
|
+
|
231
|
+
df.assign do
|
232
|
+
vectors.each_with_object({}) do |v, h|
|
233
|
+
h[v.key] = -v if v.numeric?
|
234
|
+
end
|
235
|
+
end
|
236
|
+
|
237
|
+
# =>
|
238
|
+
#<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000009a1b4>
|
239
|
+
integer float string boolean
|
240
|
+
<uint8> <double> <string> <boolean>
|
241
|
+
1 0 -0.0 A true
|
242
|
+
2 255 -1.1 B false
|
243
|
+
3 254 -2.2 C true
|
244
|
+
4 253 NaN D false
|
245
|
+
5 (nil) (nil) (nil) (nil)
|
246
|
+
```
|
247
|
+
|
248
|
+
Negate (-@) method of unsigned integer Vector returns complement.
|
249
|
+
|
250
|
+
Next example is to eliminate observations (row in the table) containing nil.
|
99
251
|
|
100
252
|
```ruby
|
101
|
-
# remove all
|
253
|
+
# remove all observations containing nil
|
102
254
|
nil_removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
|
103
255
|
nil_removed.tdr
|
104
256
|
# =>
|
@@ -121,12 +273,51 @@ For this frequently needed task, we can do it much simpler.
|
|
121
273
|
penguins.remove_nil # => same result as above
|
122
274
|
```
|
123
275
|
|
276
|
+
`DataFrame#group` method can be used for the grouping tasks.
|
277
|
+
|
278
|
+
```ruby
|
279
|
+
starwars = RedAmber::DataFrame.load(URI("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv"))
|
280
|
+
starwars
|
281
|
+
|
282
|
+
# =>
|
283
|
+
#<RedAmber::DataFrame : 87 x 12 Vectors, 0x000000000000607c>
|
284
|
+
unnamed1 name height mass hair_color skin_color eye_color ... species
|
285
|
+
<int64> <string> <int64> <double> <string> <string> <string> ... <string>
|
286
|
+
1 1 Luke Skywalker 172 77.0 blond fair blue ... Human
|
287
|
+
2 2 C-3PO 167 75.0 NA gold yellow ... Droid
|
288
|
+
3 3 R2-D2 96 32.0 NA white, blue red ... Droid
|
289
|
+
4 4 Darth Vader 202 136.0 none white yellow ... Human
|
290
|
+
5 5 Leia Organa 150 49.0 brown light brown ... Human
|
291
|
+
: : : : : : : : ... :
|
292
|
+
85 85 BB8 (nil) (nil) none none black ... Droid
|
293
|
+
86 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA
|
294
|
+
87 87 Padmé Amidala 165 45.0 brown light brown ... Human
|
295
|
+
|
296
|
+
grouped = starwars.group(:species) { [count(:species), mean(:height, :mass)] }
|
297
|
+
grouped.slice { v(:count) > 1 }
|
298
|
+
|
299
|
+
# =>
|
300
|
+
#<RedAmber::DataFrame : 9 x 4 Vectors, 0x000000000006e848>
|
301
|
+
species count mean(height) mean(mass)
|
302
|
+
<string> <int64> <double> <double>
|
303
|
+
1 Human 35 176.6 82.8
|
304
|
+
2 Droid 6 131.2 69.8
|
305
|
+
3 Wookiee 2 231.0 124.0
|
306
|
+
4 Gungan 3 208.7 74.0
|
307
|
+
5 NA 4 181.3 48.0
|
308
|
+
: : : : :
|
309
|
+
7 Twi'lek 2 179.0 55.0
|
310
|
+
8 Mirialan 2 168.0 53.1
|
311
|
+
9 Kaminoan 2 221.0 88.0
|
312
|
+
```
|
313
|
+
|
124
314
|
See [DataFrame.md](doc/DataFrame.md) for details.
|
125
315
|
|
126
316
|
|
127
317
|
## `RedAmber::Vector`
|
128
318
|
|
129
319
|
Class `RedAmber::Vector` represents a series of data in the DataFrame.
|
320
|
+
Method `RedAmber::DataFrame#[key]` returns a Vector with the key `key`.
|
130
321
|
|
131
322
|
```ruby
|
132
323
|
penguins[:bill_length_mm]
|
@@ -137,11 +328,34 @@ penguins[:bill_length_mm]
|
|
137
328
|
|
138
329
|
Vectors accepts some [functional methods from Arrow](https://arrow.apache.org/docs/cpp/compute.html).
|
139
330
|
|
331
|
+
This is an element-wise comparison and returns a boolean Vector of same size.
|
332
|
+
|
333
|
+

|
334
|
+
|
335
|
+
```ruby
|
336
|
+
penguins[:bill_length_mm] < 40
|
337
|
+
|
338
|
+
# =>
|
339
|
+
#<RedAmber::Vector(:boolean, size=344):0x000000000007e7ac>
|
340
|
+
[true, true, false, nil, true, true, true, true, true, false, true, true, false, ... ]
|
341
|
+
```
|
342
|
+
|
343
|
+
Next example returns aggregated result.
|
344
|
+
|
345
|
+

|
346
|
+
|
347
|
+
```ruby
|
348
|
+
penguins[:bill_length_mm].mean
|
349
|
+
43.92192982456141
|
350
|
+
# =>
|
351
|
+
|
352
|
+
```
|
353
|
+
|
140
354
|
See [Vector.md](doc/Vector.md) for details.
|
141
355
|
|
142
|
-
##
|
356
|
+
## Jupyter notebook
|
143
357
|
|
144
|
-
|
358
|
+
[53 Examples of Red Amber](doc/examples_of_red_amber.ipynb)
|
145
359
|
|
146
360
|
## Development
|
147
361
|
|
data/Rakefile
CHANGED
@@ -1,11 +1,11 @@
|
|
1
1
|
prelude: |
|
2
|
-
require 'datasets-arrow'
|
3
2
|
require 'rover'
|
4
3
|
require 'red_amber'
|
5
4
|
|
6
5
|
penguins_csv = 'benchmark/cache/penguins.csv'
|
7
6
|
|
8
7
|
unless File.exist?(penguins_csv)
|
8
|
+
require 'datasets-arrow'
|
9
9
|
arrow = Datasets::Penguins.new.to_arrow
|
10
10
|
RedAmber::DataFrame.new(arrow).save(penguins_csv)
|
11
11
|
end
|