red_amber 0.1.5 → 0.1.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +33 -5
  3. data/.rubocop_todo.yml +2 -15
  4. data/.yardopts +1 -0
  5. data/CHANGELOG.md +164 -18
  6. data/Gemfile +6 -1
  7. data/README.md +247 -33
  8. data/Rakefile +1 -0
  9. data/benchmark/csv_load_penguins.yml +1 -1
  10. data/doc/DataFrame.md +383 -219
  11. data/doc/Vector.md +247 -37
  12. data/doc/examples_of_red_amber.ipynb +5454 -0
  13. data/doc/image/dataframe/assign.png +0 -0
  14. data/doc/image/dataframe/drop.png +0 -0
  15. data/doc/image/dataframe/pick.png +0 -0
  16. data/doc/image/dataframe/remove.png +0 -0
  17. data/doc/image/dataframe/rename.png +0 -0
  18. data/doc/image/dataframe/slice.png +0 -0
  19. data/doc/image/dataframe_model.png +0 -0
  20. data/doc/image/vector/binary_element_wise.png +0 -0
  21. data/doc/image/vector/unary_aggregation.png +0 -0
  22. data/doc/image/vector/unary_aggregation_w_option.png +0 -0
  23. data/doc/image/vector/unary_element_wise.png +0 -0
  24. data/lib/red-amber.rb +3 -0
  25. data/lib/red_amber/data_frame.rb +62 -10
  26. data/lib/red_amber/data_frame_displayable.rb +86 -9
  27. data/lib/red_amber/data_frame_selectable.rb +151 -32
  28. data/lib/red_amber/data_frame_variable_operation.rb +4 -0
  29. data/lib/red_amber/group.rb +59 -0
  30. data/lib/red_amber/helper.rb +61 -0
  31. data/lib/red_amber/vector.rb +59 -15
  32. data/lib/red_amber/vector_functions.rb +47 -38
  33. data/lib/red_amber/vector_selectable.rb +126 -0
  34. data/lib/red_amber/vector_updatable.rb +125 -0
  35. data/lib/red_amber/version.rb +1 -1
  36. data/lib/red_amber.rb +6 -3
  37. data/red_amber.gemspec +0 -2
  38. metadata +9 -33
  39. data/lib/red_amber/data_frame_helper.rb +0 -64
  40. data/lib/red_amber/data_frame_observation_operation.rb +0 -83
  41. data/lib/red_amber/vector_compensable.rb +0 -68
data/README.md CHANGED
@@ -1,16 +1,20 @@
1
1
  # RedAmber
2
2
 
3
- A simple dataframe library for Ruby (experimental)
3
+ [![Gem Version](https://badge.fury.io/rb/red_amber.svg)](https://badge.fury.io/rb/red_amber)
4
+ [![Ruby](https://github.com/heronshoes/red_amber/actions/workflows/test.yml/badge.svg)](https://github.com/heronshoes/red_amber/actions/workflows/test.yml)
4
5
 
5
- - Powered by [Red Arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow)
6
+ A simple dataframe library for Ruby (experimental).
7
+
8
+ - Powered by [Red Arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow) [![Gitter Chat](https://badges.gitter.im/red-data-tools/en.svg)](https://gitter.im/red-data-tools/en)
6
9
  - Inspired by the dataframe library [Rover-df](https://github.com/ankane/rover)
7
10
 
8
11
  ## Requirements
9
12
 
10
13
  ```ruby
11
14
  gem 'red-arrow', '>= 8.0.0'
12
- gem 'red-parquet', '>= 8.0.0' # if you use IO from/to parquet
13
- gem 'rover-df', '~> 0.3.0' # if you use IO from/to Rover::DataFrame
15
+
16
+ gem 'red-parquet', '>= 8.0.0' # Optional, if you use IO from/to parquet
17
+ gem 'rover-df', '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame
14
18
  ```
15
19
 
16
20
  ## Installation
@@ -18,7 +22,8 @@ gem 'rover-df', '~> 0.3.0' # if you use IO from/to Rover::DataFrame
18
22
  Install requirements before you install Red Amber.
19
23
 
20
24
  - Apache Arrow GLib (>= 8.0.0)
21
- - Apache Parquet GLib (>= 8.0.0)
25
+
26
+ - Apache Parquet GLib (>= 8.0.0) # If you use IO from/to parquet
22
27
 
23
28
  See [Apache Arrow install document](https://arrow.apache.org/install/).
24
29
 
@@ -44,27 +49,28 @@ gem install red_amber
44
49
 
45
50
  ## `RedAmber::DataFrame`
46
51
 
47
- Represents a set of data in 2D-shape.
52
+ Represents a set of data in 2D-shape. The entity is a Red Arrow's Table object.
48
53
 
49
54
  ```ruby
50
- require 'red_amber'
55
+ require 'red_amber' # require 'red-amber' is also OK.
51
56
  require 'datasets-arrow'
52
57
 
53
58
  arrow = Datasets::Penguins.new.to_arrow
54
59
  penguins = RedAmber::DataFrame.new(arrow)
55
- penguins.tdr
60
+
56
61
  # =>
57
- RedAmber::DataFrame : 344 x 8 Vectors
58
- Vectors : 5 numeric, 3 strings
59
- # key type level data_preview
60
- 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
61
- 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
62
- 3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
63
- 4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
64
- 5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils
65
- 6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
66
- 7 :sex string 3 {"male"=>168, "female"=>165, nil=>11}
67
- 8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}
62
+ #<RedAmber::DataFrame : 344 x 8 Vectors, 0x0000000000013790>
63
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
64
+ <string> <string> <double> <double> <uint8> ... <uint16>
65
+ 1 Adelie Torgersen 39.1 18.7 181 ... 2007
66
+ 2 Adelie Torgersen 39.5 17.4 186 ... 2007
67
+ 3 Adelie Torgersen 40.3 18.0 195 ... 2007
68
+ 4 Adelie Torgersen (nil) (nil) (nil) ... 2007
69
+ 5 Adelie Torgersen 36.7 19.3 193 ... 2007
70
+ : : : : : : ... :
71
+ 342 Gentoo Biscoe 50.4 15.7 222 ... 2009
72
+ 343 Gentoo Biscoe 45.2 14.8 212 ... 2009
73
+ 344 Gentoo Biscoe 49.9 16.1 213 ... 2009
68
74
  ```
69
75
 
70
76
  ### DataFrame model
@@ -72,33 +78,179 @@ Vectors : 5 numeric, 3 strings
72
78
 
73
79
  For example, `DataFrame#pick` accepts keys as an argument and returns a sub DataFrame.
74
80
 
81
+ ![pick method image](doc/image/dataframe/pick.png)
82
+
75
83
  ```ruby
76
- df = penguins.pick(:body_mass_g)
84
+ penguins.keys
85
+ # =>
86
+ [:species,
87
+ :island,
88
+ :bill_length_mm,
89
+ :bill_depth_mm,
90
+ :flipper_length_mm,
91
+ :body_mass_g,
92
+ :sex,
93
+ :year]
94
+
95
+ df = penguins.pick(:species, :island, :body_mass_g)
96
+ df
97
+
98
+ # =>
99
+ #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000003cc1c>
100
+ species island body_mass_g
101
+ <string> <string> <uint16>
102
+ 1 Adelie Torgersen 3750
103
+ 2 Adelie Torgersen 3800
104
+ 3 Adelie Torgersen 3250
105
+ 4 Adelie Torgersen (nil)
106
+ 5 Adelie Torgersen 3450
107
+ : : : :
108
+ 342 Gentoo Biscoe 5750
109
+ 343 Gentoo Biscoe 5200
110
+ 344 Gentoo Biscoe 5400
111
+ ```
112
+
113
+ `DataFrame#drop` drops some columns to create a remainer DataFrame.
114
+
115
+ ![drop method image](doc/image/dataframe/drop.png)
116
+
117
+ You can specify by keys or a boolean array (same size as n_keys).
118
+
119
+ ```ruby
120
+ # Same as df.drop(:species, :island)
121
+ df = df.drop(true, true, false)
122
+
77
123
  # =>
78
- #<RedAmber::DataFrame : 344 x 1 Vector, 0x000000000000fa14>
79
- Vector : 1 numeric
80
- # key type level data_preview
81
- 1 :body_mass_g int64 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
124
+ #<RedAmber::DataFrame : 344 x 1 Vector, 0x0000000000048760>
125
+ body_mass_g
126
+ <uint16>
127
+ 1 3750
128
+ 2 3800
129
+ 3 3250
130
+ 4 (nil)
131
+ 5 3450
132
+ : :
133
+ 342 5750
134
+ 343 5200
135
+ 344 5400
82
136
  ```
83
137
 
138
+ Arrow data is immutable, so these methods always return an new object.
139
+
84
140
  `DataFrame#assign` creates new variables (column in the table).
85
141
 
142
+ ![assign method image](doc/image/dataframe/assign.png)
143
+
86
144
  ```ruby
145
+ # New column is created because ':body_mass_kg' is a new key.
87
146
  df.assign(:body_mass_kg => df[:body_mass_g] / 1000.0)
147
+
148
+ # =>
149
+ #<RedAmber::DataFrame : 344 x 2 Vectors, 0x00000000000212f0>
150
+ body_mass_g body_mass_kg
151
+ <uint16> <double>
152
+ 1 3750 3.8
153
+ 2 3800 3.8
154
+ 3 3250 3.3
155
+ 4 (nil) (nil)
156
+ 5 3450 3.5
157
+ : : :
158
+ 342 5750 5.8
159
+ 343 5200 5.2
160
+ 344 5400 5.4
161
+ ```
162
+
163
+ `DataFrame#slice` selects rows (observations) to create a sub DataFrame.
164
+
165
+ ![slice method image](doc/image/dataframe/slice.png)
166
+
167
+ ```ruby
168
+ # returns 5 rows at the start and 5 rows from the end
169
+ penguins.slice(0...5, -5..-1)
170
+
88
171
  # =>
89
- #<RedAmber::DataFrame : 344 x 2 Vectors, 0x000000000000fa28>
90
- Vectors : 2 numeric
91
- # key type level data_preview
92
- 1 :body_mass_g int64 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
93
- 2 :body_mass_kg double 95 [3.75, 3.8, 3.25, nil, 3.45, ... ], 2 nils
172
+ #<RedAmber::DataFrame : 10 x 8 Vectors, 0x0000000000042be4>
173
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
174
+ <string> <string> <double> <double> <uint8> ... <uint16>
175
+ 1 Adelie Torgersen 39.1 18.7 181 ... 2007
176
+ 2 Adelie Torgersen 39.5 17.4 186 ... 2007
177
+ 3 Adelie Torgersen 40.3 18.0 195 ... 2007
178
+ 4 Adelie Torgersen (nil) (nil) (nil) ... 2007
179
+ 5 Adelie Torgersen 36.7 19.3 193 ... 2007
180
+ : : : : : : ... :
181
+ 8 Gentoo Biscoe 50.4 15.7 222 ... 2009
182
+ 9 Gentoo Biscoe 45.2 14.8 212 ... 2009
183
+ 10 Gentoo Biscoe 49.9 16.1 213 ... 2009
184
+ ```
185
+
186
+ `DataFrame#remove` rejects rows (observations) to create a remainer DataFrame.
187
+
188
+ ![remove method image](doc/image/dataframe/remove.png)
189
+
190
+ ```ruby
191
+ # penguins[:bill_length_mm] < 40 returns a boolean Vector
192
+ penguins.remove(penguins[:bill_length_mm] < 40)
193
+
194
+ # =>
195
+ #<RedAmber::DataFrame : 244 x 8 Vectors, 0x000000000007d6f4>
196
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
197
+ <string> <string> <double> <double> <uint8> ... <uint16>
198
+ 1 Adelie Torgersen 40.3 18.0 195 ... 2007
199
+ 2 Adelie Torgersen (nil) (nil) (nil) ... 2007
200
+ 3 Adelie Torgersen 42.0 20.2 190 ... 2007
201
+ 4 Adelie Torgersen 41.1 17.6 182 ... 2007
202
+ 5 Adelie Torgersen 42.5 20.7 197 ... 2007
203
+ : : : : : : ... :
204
+ 242 Gentoo Biscoe 50.4 15.7 222 ... 2009
205
+ 243 Gentoo Biscoe 45.2 14.8 212 ... 2009
206
+ 244 Gentoo Biscoe 49.9 16.1 213 ... 2009
94
207
  ```
95
208
 
96
209
  DataFrame manipulating methods like `pick`, `drop`, `slice`, `remove`, `rename` and `assign` accept a block.
97
210
 
98
- This is an exaple to eliminate observations (row in the table) containing nil.
211
+ This example is usage of block to update numeric columns.
212
+
213
+ ```ruby
214
+ df = RedAmber::DataFrame.new(
215
+ integer: [0, 1, 2, 3, nil],
216
+ float: [0.0, 1.1, 2.2, Float::NAN, nil],
217
+ string: ['A', 'B', 'C', 'D', nil],
218
+ boolean: [true, false, true, false, nil])
219
+ df
220
+
221
+ # =>
222
+ #<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000003131c>
223
+ integer float string boolean
224
+ <uint8> <double> <string> <boolean>
225
+ 1 0 0.0 A true
226
+ 2 1 1.1 B false
227
+ 3 2 2.2 C true
228
+ 4 3 NaN D false
229
+ 5 (nil) (nil) (nil) (nil)
230
+
231
+ df.assign do
232
+ vectors.each_with_object({}) do |v, h|
233
+ h[v.key] = -v if v.numeric?
234
+ end
235
+ end
236
+
237
+ # =>
238
+ #<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000009a1b4>
239
+ integer float string boolean
240
+ <uint8> <double> <string> <boolean>
241
+ 1 0 -0.0 A true
242
+ 2 255 -1.1 B false
243
+ 3 254 -2.2 C true
244
+ 4 253 NaN D false
245
+ 5 (nil) (nil) (nil) (nil)
246
+ ```
247
+
248
+ Negate (-@) method of unsigned integer Vector returns complement.
249
+
250
+ Next example is to eliminate observations (row in the table) containing nil.
99
251
 
100
252
  ```ruby
101
- # remove all observation contains nil
253
+ # remove all observations containing nil
102
254
  nil_removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
103
255
  nil_removed.tdr
104
256
  # =>
@@ -121,12 +273,51 @@ For this frequently needed task, we can do it much simpler.
121
273
  penguins.remove_nil # => same result as above
122
274
  ```
123
275
 
276
+ `DataFrame#group` method can be used for the grouping tasks.
277
+
278
+ ```ruby
279
+ starwars = RedAmber::DataFrame.load(URI("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv"))
280
+ starwars
281
+
282
+ # =>
283
+ #<RedAmber::DataFrame : 87 x 12 Vectors, 0x000000000000607c>
284
+ unnamed1 name height mass hair_color skin_color eye_color ... species
285
+ <int64> <string> <int64> <double> <string> <string> <string> ... <string>
286
+ 1 1 Luke Skywalker 172 77.0 blond fair blue ... Human
287
+ 2 2 C-3PO 167 75.0 NA gold yellow ... Droid
288
+ 3 3 R2-D2 96 32.0 NA white, blue red ... Droid
289
+ 4 4 Darth Vader 202 136.0 none white yellow ... Human
290
+ 5 5 Leia Organa 150 49.0 brown light brown ... Human
291
+ : : : : : : : : ... :
292
+ 85 85 BB8 (nil) (nil) none none black ... Droid
293
+ 86 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA
294
+ 87 87 Padmé Amidala 165 45.0 brown light brown ... Human
295
+
296
+ grouped = starwars.group(:species) { [count(:species), mean(:height, :mass)] }
297
+ grouped.slice { v(:count) > 1 }
298
+
299
+ # =>
300
+ #<RedAmber::DataFrame : 9 x 4 Vectors, 0x000000000006e848>
301
+ species count mean(height) mean(mass)
302
+ <string> <int64> <double> <double>
303
+ 1 Human 35 176.6 82.8
304
+ 2 Droid 6 131.2 69.8
305
+ 3 Wookiee 2 231.0 124.0
306
+ 4 Gungan 3 208.7 74.0
307
+ 5 NA 4 181.3 48.0
308
+ : : : : :
309
+ 7 Twi'lek 2 179.0 55.0
310
+ 8 Mirialan 2 168.0 53.1
311
+ 9 Kaminoan 2 221.0 88.0
312
+ ```
313
+
124
314
  See [DataFrame.md](doc/DataFrame.md) for details.
125
315
 
126
316
 
127
317
  ## `RedAmber::Vector`
128
318
 
129
319
  Class `RedAmber::Vector` represents a series of data in the DataFrame.
320
+ Method `RedAmber::DataFrame#[key]` returns a Vector with the key `key`.
130
321
 
131
322
  ```ruby
132
323
  penguins[:bill_length_mm]
@@ -137,11 +328,34 @@ penguins[:bill_length_mm]
137
328
 
138
329
  Vectors accepts some [functional methods from Arrow](https://arrow.apache.org/docs/cpp/compute.html).
139
330
 
331
+ This is an element-wise comparison and returns a boolean Vector of same size.
332
+
333
+ ![unary element-wise](doc/image/vector/unary_element_wise.png)
334
+
335
+ ```ruby
336
+ penguins[:bill_length_mm] < 40
337
+
338
+ # =>
339
+ #<RedAmber::Vector(:boolean, size=344):0x000000000007e7ac>
340
+ [true, true, false, nil, true, true, true, true, true, false, true, true, false, ... ]
341
+ ```
342
+
343
+ Next example returns aggregated result.
344
+
345
+ ![unary aggregation](doc/image/vector/unary_aggregation.png)
346
+
347
+ ```ruby
348
+ penguins[:bill_length_mm].mean
349
+ 43.92192982456141
350
+ # =>
351
+
352
+ ```
353
+
140
354
  See [Vector.md](doc/Vector.md) for details.
141
355
 
142
- ## TDR concept
356
+ ## Jupyter notebook
143
357
 
144
- I named the data frame representation style in the model above as TDR (Transposed DataFrame Representation). See [TDR.md](doc/tdr.md) for details.
358
+ [53 Examples of Red Amber](doc/examples_of_red_amber.ipynb)
145
359
 
146
360
  ## Development
147
361
 
data/Rakefile CHANGED
@@ -7,6 +7,7 @@ Rake::TestTask.new(:test) do |t|
7
7
  t.libs << 'test'
8
8
  t.libs << 'lib'
9
9
  t.test_files = FileList['test/**/test_*.rb']
10
+ t.warning = false
10
11
  end
11
12
 
12
13
  require 'rubocop/rake_task'
@@ -1,11 +1,11 @@
1
1
  prelude: |
2
- require 'datasets-arrow'
3
2
  require 'rover'
4
3
  require 'red_amber'
5
4
 
6
5
  penguins_csv = 'benchmark/cache/penguins.csv'
7
6
 
8
7
  unless File.exist?(penguins_csv)
8
+ require 'datasets-arrow'
9
9
  arrow = Datasets::Penguins.new.to_arrow
10
10
  RedAmber::DataFrame.new(arrow).save(penguins_csv)
11
11
  end