red_amber 0.2.2 → 0.3.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.rubocop.yml +114 -39
- data/CHANGELOG.md +203 -31
- data/Gemfile +5 -2
- data/README.md +62 -29
- data/benchmark/basic.yml +86 -0
- data/benchmark/combine.yml +62 -0
- data/benchmark/dataframe.yml +62 -0
- data/benchmark/drop_nil.yml +15 -3
- data/benchmark/group.yml +39 -0
- data/benchmark/reshape.yml +31 -0
- data/benchmark/{csv_load_penguins.yml → rover/csv_load_penguins.yml} +3 -3
- data/benchmark/rover/flights.yml +23 -0
- data/benchmark/rover/penguins.yml +23 -0
- data/benchmark/rover/planes.yml +23 -0
- data/benchmark/rover/weather.yml +23 -0
- data/benchmark/vector.yml +60 -0
- data/doc/DataFrame.md +335 -53
- data/doc/Vector.md +91 -0
- data/doc/image/dataframe/join.png +0 -0
- data/doc/image/dataframe/set_and_bind.png +0 -0
- data/doc/image/dataframe_model.png +0 -0
- data/lib/red_amber/data_frame.rb +167 -51
- data/lib/red_amber/data_frame_combinable.rb +486 -0
- data/lib/red_amber/data_frame_displayable.rb +6 -4
- data/lib/red_amber/data_frame_indexable.rb +2 -2
- data/lib/red_amber/data_frame_loadsave.rb +4 -1
- data/lib/red_amber/data_frame_reshaping.rb +35 -10
- data/lib/red_amber/data_frame_selectable.rb +221 -116
- data/lib/red_amber/data_frame_variable_operation.rb +146 -82
- data/lib/red_amber/group.rb +108 -18
- data/lib/red_amber/helper.rb +53 -43
- data/lib/red_amber/refinements.rb +199 -0
- data/lib/red_amber/vector.rb +56 -46
- data/lib/red_amber/vector_functions.rb +23 -83
- data/lib/red_amber/vector_selectable.rb +116 -69
- data/lib/red_amber/vector_updatable.rb +189 -65
- data/lib/red_amber/version.rb +1 -1
- data/lib/red_amber.rb +3 -0
- data/red_amber.gemspec +4 -3
- metadata +24 -10
data/README.md
CHANGED
@@ -1,28 +1,31 @@
|
|
1
1
|
# RedAmber
|
2
2
|
|
3
3
|
[![Gem Version](https://badge.fury.io/rb/red_amber.svg)](https://badge.fury.io/rb/red_amber)
|
4
|
-
[![
|
4
|
+
[![CI](https://github.com/heronshoes/red_amber/actions/workflows/ci.yml/badge.svg)](https://github.com/heronshoes/red_amber/actions/workflows/ci.yml)
|
5
|
+
[![Maintainability](https://api.codeclimate.com/v1/badges/b8a745047045d2f49daa/maintainability)](https://codeclimate.com/github/heronshoes/red_amber/maintainability)
|
6
|
+
[![Test coverage](https://api.codeclimate.com/v1/badges/b8a745047045d2f49daa/test_coverage)](https://codeclimate.com/github/heronshoes/red_amber/test_coverage)
|
7
|
+
[![Doc](https://img.shields.io/badge/docs-latest-blue)](https://heronshoes.github.io/red_amber/)
|
5
8
|
[![Discussions](https://img.shields.io/github/discussions/heronshoes/red_amber)](https://github.com/heronshoes/red_amber/discussions)
|
6
9
|
|
7
10
|
A simple dataframe library for Ruby.
|
8
11
|
|
9
|
-
- Powered by [Red Arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow)
|
12
|
+
- Powered by [Red Arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow)
|
13
|
+
[![Gitter Chat](https://badges.gitter.im/red-data-tools/en.svg)](https://gitter.im/red-data-tools/en)
|
10
14
|
- Inspired by the dataframe library [Rover-df](https://github.com/ankane/rover)
|
11
15
|
|
12
|
-
![screenshot from jupyterlab](doc/image/screenshot.png)
|
16
|
+
![screenshot from jupyterlab](https://raw.githubusercontent.com/heronshoes/red_amber/main/doc/image/screenshot.png)
|
13
17
|
|
14
18
|
## Requirements
|
15
19
|
|
16
|
-
Supported Ruby version is >=
|
20
|
+
Supported Ruby version is >= 3.0 (since RedAmber 0.3.0).
|
17
21
|
|
18
|
-
|
19
|
-
I recommend Ruby 3 for performance.
|
22
|
+
- I decided to remove Ruby 2.7 without waiting for EOL because it cannot solve the problem of simultaneous use of Hash and keyword arguments when implementing DataFrame#join.
|
20
23
|
|
21
24
|
```ruby
|
22
25
|
# Libraries required
|
23
|
-
gem 'red-arrow', '
|
26
|
+
gem 'red-arrow', '~> 10.0.0' # Requires Apache Arrow (see installation below)
|
24
27
|
|
25
|
-
gem 'red-parquet', '
|
28
|
+
gem 'red-parquet', '~> 10.0.0' # Optional, if you use IO from/to parquet
|
26
29
|
gem 'rover-df', '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame
|
27
30
|
```
|
28
31
|
|
@@ -30,37 +33,61 @@ gem 'rover-df', '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame
|
|
30
33
|
|
31
34
|
Install requirements before you install Red Amber.
|
32
35
|
|
33
|
-
- Apache Arrow
|
34
|
-
|
35
|
-
- Apache Parquet GLib (
|
36
|
+
- Apache Arrow (~> 10.0.0)
|
37
|
+
- Apache Arrow GLib (~> 10.0.0)
|
38
|
+
- Apache Parquet GLib (~> 10.0.0) # If you use IO from/to parquet
|
36
39
|
|
37
40
|
See [Apache Arrow install document](https://arrow.apache.org/install/).
|
38
41
|
|
39
|
-
Minimum installation example for the latest Ubuntu
|
42
|
+
- Minimum installation example for the latest Ubuntu:
|
40
43
|
|
41
|
-
|
44
|
+
```
|
45
|
+
sudo apt update
|
46
|
+
sudo apt install -y -V ca-certificates lsb-release wget
|
47
|
+
wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
|
48
|
+
sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
|
49
|
+
sudo apt update
|
50
|
+
sudo apt install -y -V libarrow-dev
|
51
|
+
sudo apt install -y -V libarrow-glib-dev
|
52
|
+
```
|
42
53
|
|
43
|
-
|
44
|
-
gem 'red_amber'
|
45
|
-
```
|
54
|
+
- On Fedora 38 (Rawhide):
|
46
55
|
|
47
|
-
|
56
|
+
```
|
57
|
+
sudo dnf update
|
58
|
+
sudo dnf -y install gcc-c++ libarrow-devel libarrow-glib-devel ruby-devel
|
59
|
+
```
|
48
60
|
|
49
|
-
|
50
|
-
bundle install
|
51
|
-
```
|
61
|
+
- On macOS, you can install Apache Arrow C++ library using Homebrew:
|
52
62
|
|
53
|
-
|
63
|
+
```
|
64
|
+
brew install apache-arrow
|
65
|
+
```
|
54
66
|
|
55
|
-
|
56
|
-
|
67
|
+
and GLib (C) package with:
|
68
|
+
|
69
|
+
```
|
70
|
+
brew install apache-arrow-glib
|
71
|
+
```
|
72
|
+
|
73
|
+
If you prepared Apache Arrow, add these lines to your Gemfile:
|
74
|
+
|
75
|
+
```ruby
|
76
|
+
gem 'red-arrow', '~> 10.0.0'
|
77
|
+
gem 'red_amber'
|
78
|
+
gem 'red-parquet', '~> 10.0.0' # Optional, if you use IO from/to parquet
|
79
|
+
gem 'rover-df', '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame
|
80
|
+
gem 'red-datasets-arrow' # Optional, recommended if you use Red Datasets
|
81
|
+
gem 'red-arrow-numo-narray' # Optional, recommended if you use inputs from Numo::NArray
|
57
82
|
```
|
58
83
|
|
84
|
+
And then execute `bundle install` or install it yourself as `gem install red_amber`.
|
85
|
+
|
59
86
|
## Docker image and Jupyter Notebook
|
60
87
|
|
61
88
|
[RubyData Docker Stacks](https://github.com/RubyData/docker-stacks) is available as a ready-to-run Docker image containing Jupyter and useful data tools as well as RedAmber (Thanks to @mrkn).
|
62
89
|
|
63
|
-
Also you can try the contents of this README interactively by [Binder](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=
|
90
|
+
Also you can try the contents of this README interactively by [Binder](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=red-amber.ipynb).
|
64
91
|
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=red-amber.ipynb)
|
65
92
|
|
66
93
|
|
@@ -69,9 +96,9 @@ Also you can try the contents of this README interactively by [Binder](https://m
|
|
69
96
|
Class `RedAmber::DataFrame` represents a set of data in 2D-shape.
|
70
97
|
The entity is a Red Arrow's Table object.
|
71
98
|
|
72
|
-
![dataframe model of RedAmber](doc/image/dataframe_model.png)
|
99
|
+
![dataframe model of RedAmber](https://raw.githubusercontent.com/heronshoes/red_amber/main/doc/image/dataframe_model.png)
|
73
100
|
|
74
|
-
|
101
|
+
Let's load the library and try some examples.
|
75
102
|
|
76
103
|
```ruby
|
77
104
|
require 'red_amber' # require 'red-amber' is also OK.
|
@@ -80,6 +107,11 @@ include RedAmber
|
|
80
107
|
|
81
108
|
### Example: diamonds dataset
|
82
109
|
|
110
|
+
First do (if you do not installed) `
|
111
|
+
gem install red-datasets-arrow
|
112
|
+
`
|
113
|
+
then
|
114
|
+
|
83
115
|
```ruby
|
84
116
|
require 'datasets-arrow' # to load sample data
|
85
117
|
|
@@ -101,7 +133,7 @@ diamonds = DataFrame.new(dataset) # from v0.2.2, should be `dataset.to_arrow` if
|
|
101
133
|
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 ... 3.64
|
102
134
|
```
|
103
135
|
|
104
|
-
For example, we can compute mean prices per
|
136
|
+
For example, we can compute mean prices per cut for the data larger than 1 carat.
|
105
137
|
|
106
138
|
```ruby
|
107
139
|
df = diamonds
|
@@ -125,7 +157,7 @@ Arrow data is immutable, so these methods always return new objects.
|
|
125
157
|
Next example will rename a column and create a new column by simple calcuration.
|
126
158
|
|
127
159
|
```ruby
|
128
|
-
usdjpy = 110.0
|
160
|
+
usdjpy = 110.0 # when the yen was stronger
|
129
161
|
|
130
162
|
df.rename('mean(price)': :mean_price_USD)
|
131
163
|
.assign(:mean_price_JPY) { mean_price_USD * usdjpy }
|
@@ -181,7 +213,8 @@ See [Vector.md](doc/Vector.md) for details.
|
|
181
213
|
|
182
214
|
## Jupyter notebook
|
183
215
|
|
184
|
-
[
|
216
|
+
[89 Examples of Red Amber](https://github.com/heronshoes/docker-stacks/blob/RedAmber-binder/binder/examples_of_red_amber.ipynb)
|
217
|
+
([raw file](https://raw.githubusercontent.com/heronshoes/docker-stacks/RedAmber-binder/binder/examples_of_red_amber.ipynb)) shows more examples in jupyter notebook.
|
185
218
|
|
186
219
|
You can try this notebook on [Binder](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=examples_of_red_amber.ipynb).
|
187
220
|
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=examples_of_red_amber.ipynb)
|
data/benchmark/basic.yml
ADDED
@@ -0,0 +1,86 @@
|
|
1
|
+
loop_count: 3
|
2
|
+
|
3
|
+
contexts:
|
4
|
+
- name: HEAD
|
5
|
+
prelude: |
|
6
|
+
$LOAD_PATH.unshift(File.expand_path('lib'))
|
7
|
+
- name: 0.2.3
|
8
|
+
gems:
|
9
|
+
red_amber: 0.2.3
|
10
|
+
- name: 0.2.0
|
11
|
+
gems:
|
12
|
+
red_amber: 0.2.0
|
13
|
+
- name: 0.1.5
|
14
|
+
gems:
|
15
|
+
red_amber: 0.1.5
|
16
|
+
|
17
|
+
prelude: |
|
18
|
+
require 'red_amber'
|
19
|
+
require 'datasets-arrow'
|
20
|
+
|
21
|
+
ds = Datasets::Rdatasets.new('nycflights13', 'flights')
|
22
|
+
df = RedAmber::DataFrame.new(ds.to_arrow)
|
23
|
+
|
24
|
+
slicer = df[:distance] > 1000
|
25
|
+
distance_km = df[:distance] * 1.852
|
26
|
+
|
27
|
+
benchmark:
|
28
|
+
'B01: Pick([]) by a key name': |
|
29
|
+
df[:flight]
|
30
|
+
|
31
|
+
'B02a: Pick([]) by key names': |
|
32
|
+
df[:carrier, :flight]
|
33
|
+
|
34
|
+
'B03: Pick by key names': |
|
35
|
+
df.pick(:carrier, :flight)
|
36
|
+
|
37
|
+
'B04: Drop by key names': |
|
38
|
+
df.drop(:year, :month, :day)
|
39
|
+
|
40
|
+
'B05: Pick by booleans': |
|
41
|
+
df.pick(df.vectors.map(&:string?))
|
42
|
+
|
43
|
+
'B06: Pick by a block': |
|
44
|
+
df.pick { keys.map { |key| key.end_with?('time') } }
|
45
|
+
|
46
|
+
'B07: Slice([]) by a index': |
|
47
|
+
df[877]
|
48
|
+
|
49
|
+
'B08: Slice by indeces': |
|
50
|
+
df.slice(0...5, -5..-1)
|
51
|
+
|
52
|
+
'B09: Slice([]) by booleans': |
|
53
|
+
df[slicer]
|
54
|
+
|
55
|
+
'B10: Slice by booleans': |
|
56
|
+
df.slice(slicer)
|
57
|
+
|
58
|
+
'B11: Remove by booleans': |
|
59
|
+
df.remove(slicer)
|
60
|
+
|
61
|
+
'B12: Slice by a block': |
|
62
|
+
df.slice { slicer }
|
63
|
+
|
64
|
+
'B13: Rename by Hash': |
|
65
|
+
df.rename(distance: :distance_mile)
|
66
|
+
|
67
|
+
'B14: Assign an existing variable': |
|
68
|
+
df.assign(distance: distance_km)
|
69
|
+
|
70
|
+
'B15: Assign a new variable': |
|
71
|
+
df.assign(distance_km: distance_km)
|
72
|
+
|
73
|
+
'B16: Sort by a key': |
|
74
|
+
df.sort(:distance)
|
75
|
+
|
76
|
+
'B17: Sort by keys': |
|
77
|
+
df.sort(:origin, '-distance')
|
78
|
+
|
79
|
+
'B18: Convert to a Hash': |
|
80
|
+
df.to_h
|
81
|
+
|
82
|
+
'B19: Output in TDR style': |
|
83
|
+
df.tdr
|
84
|
+
|
85
|
+
'B20: Inspect': |
|
86
|
+
df.inspect
|
@@ -0,0 +1,62 @@
|
|
1
|
+
loop_count: 3
|
2
|
+
|
3
|
+
contexts:
|
4
|
+
- name: HEAD
|
5
|
+
prelude: |
|
6
|
+
$LOAD_PATH.unshift(File.expand_path('lib'))
|
7
|
+
- name: 0.2.3
|
8
|
+
gems:
|
9
|
+
red_amber: 0.2.3
|
10
|
+
|
11
|
+
prelude: |
|
12
|
+
require 'red_amber'
|
13
|
+
include RedAmber
|
14
|
+
require 'datasets-arrow'
|
15
|
+
|
16
|
+
package = 'nycflights13'
|
17
|
+
airlines = DataFrame.new(Datasets::Rdatasets.new(package, 'airlines'))
|
18
|
+
airports = DataFrame.new(Datasets::Rdatasets.new(package, 'airports'))
|
19
|
+
flights = DataFrame.new(Datasets::Rdatasets.new(package, 'flights'))
|
20
|
+
.pick(%i[month day carrier flight tailnum origin dest air_time distance])
|
21
|
+
planes = DataFrame.new(Datasets::Rdatasets.new(package, 'planes'))
|
22
|
+
weather = DataFrame.new(Datasets::Rdatasets.new(package, 'weather'))
|
23
|
+
|
24
|
+
flights_Q1 = flights.slice { month <= 3 }
|
25
|
+
flights_Q2 = flights.slice { month > 3 }
|
26
|
+
|
27
|
+
flights_1_2 = flights_Q1.slice { month.is_in(1, 2) }
|
28
|
+
flights_1_3 = flights_Q1.slice { month.is_in(1, 3) }
|
29
|
+
|
30
|
+
flights_left = flights_Q1.pick(...5)
|
31
|
+
flights_right = flights_Q1.pick(5..)
|
32
|
+
|
33
|
+
benchmark:
|
34
|
+
'C01: Inner join on flights_Q1 by carrier': |
|
35
|
+
flights_Q1.inner_join(airlines, :carrier)
|
36
|
+
|
37
|
+
'C02: Full join on flights_Q1 by planes': |
|
38
|
+
flights_Q1.full_join(planes, :tailnum)
|
39
|
+
|
40
|
+
'C03: Left join on flights_Q1 by planes': |
|
41
|
+
flights_Q1.left_join(planes, :tailnum)
|
42
|
+
|
43
|
+
'C04: Semi join on flights_Q1 by planes': |
|
44
|
+
flights_Q1.semi_join(planes, :tailnum)
|
45
|
+
|
46
|
+
'C05: Anti join on flights_Q1 by planes': |
|
47
|
+
flights_Q1.anti_join(planes, :tailnum)
|
48
|
+
|
49
|
+
'C06: Intersection of flights_1_2 and flights_1_3': |
|
50
|
+
flights_1_2.intersect(flights_1_3)
|
51
|
+
|
52
|
+
'C07: Union of flights_1_2 and flights_1_3': |
|
53
|
+
flights_1_2.union(flights_1_3)
|
54
|
+
|
55
|
+
'C08: Difference between flights_1_2 and flights_1_3': |
|
56
|
+
flights_1_2.difference(flights_1_3)
|
57
|
+
|
58
|
+
'C09: Concatenate flight_Q1 on flight_Q2': |
|
59
|
+
flights_Q1.concatenate(flights_Q2)
|
60
|
+
|
61
|
+
'C10: Merge flights_Q1_right on flights_Q1_left': |
|
62
|
+
flights_left.merge(flights_right)
|
@@ -0,0 +1,62 @@
|
|
1
|
+
loop_count: 3
|
2
|
+
|
3
|
+
contexts:
|
4
|
+
- name: HEAD
|
5
|
+
prelude: |
|
6
|
+
$LOAD_PATH.unshift(File.expand_path('lib'))
|
7
|
+
- name: 0.2.3
|
8
|
+
gems:
|
9
|
+
red_amber: 0.2.3
|
10
|
+
- name: 0.2.0
|
11
|
+
gems:
|
12
|
+
red_amber: 0.2.0
|
13
|
+
|
14
|
+
prelude: |
|
15
|
+
require 'red_amber'
|
16
|
+
require 'datasets-arrow'
|
17
|
+
|
18
|
+
diamonds = RedAmber::DataFrame.new(Datasets::Diamonds.new.to_arrow)
|
19
|
+
|
20
|
+
starwars = RedAmber::DataFrame.new(Datasets::Rdataset.new('dplyr', 'starwars').to_arrow)
|
21
|
+
|
22
|
+
uri = URI("https://raw.githubusercontent.com/heronshoes/red_amber/master/test/entity/import_cars.tsv")
|
23
|
+
import_cars = RedAmber::DataFrame.load(uri)
|
24
|
+
|
25
|
+
ds = Datasets::Rdataset.new('openintro', 'simpsons_paradox_covid')
|
26
|
+
simpsons_paradox_covid = RedAmber::DataFrame.new(ds.to_arrow)
|
27
|
+
|
28
|
+
benchmark:
|
29
|
+
'D01: Diamonds test': |
|
30
|
+
diamonds
|
31
|
+
.slice { v(:carat) > 1 }
|
32
|
+
.pick(:cut, :price)
|
33
|
+
.group(:cut)
|
34
|
+
.mean
|
35
|
+
.sort('-mean(price)')
|
36
|
+
.rename('mean(price)': :mean_price_USD)
|
37
|
+
.assign { [:mean_price_JPY, v(:mean_price_USD) * 110.0] }
|
38
|
+
|
39
|
+
'D02: Starwars test': |
|
40
|
+
starwars
|
41
|
+
.drop { keys.select { |key| key.end_with?('color') } }
|
42
|
+
.remove { v(:species) == 'NA' }
|
43
|
+
.group(:species) { [count(:species), mean(:height, :mass)] }
|
44
|
+
.slice { v(:count) > 1 }
|
45
|
+
|
46
|
+
'D03: Inport cars test': |
|
47
|
+
import_cars
|
48
|
+
.to_long(:Year, name: :Manufacturer, value: :Num_of_imported)
|
49
|
+
.to_wide(name: :Manufacturer, value: :Num_of_imported)
|
50
|
+
.transpose
|
51
|
+
|
52
|
+
'D04: Simpsons paradox test': |
|
53
|
+
simpsons_paradox_covid[simpsons_paradox_covid[:age_group] == 'under 50']
|
54
|
+
.group(:vaccine_status, :outcome)
|
55
|
+
.count
|
56
|
+
.then { |df| df.to_wide(name: :vaccine_status, value: df.keys[-1]) }
|
57
|
+
.assign do
|
58
|
+
[
|
59
|
+
[:'vaccinated_%', (100.0 * v(:vaccinated) / v(:vaccinated).sum)],
|
60
|
+
[:'unvaccinated_%', (100.0 * v(:unvaccinated) / v(:unvaccinated).sum)]
|
61
|
+
]
|
62
|
+
end
|
data/benchmark/drop_nil.yml
CHANGED
@@ -1,11 +1,23 @@
|
|
1
|
+
contexts:
|
2
|
+
- gems:
|
3
|
+
red_amber: 0.1.8
|
4
|
+
- gems:
|
5
|
+
red_amber: 0.2.2
|
6
|
+
- name: HEAD
|
7
|
+
prelude: |
|
8
|
+
$LOAD_PATH.unshift(File.expand_path('lib'))
|
9
|
+
require 'red_amber'
|
10
|
+
|
1
11
|
prelude: |
|
2
12
|
require 'datasets-arrow'
|
3
13
|
require 'red_amber'
|
4
14
|
|
5
15
|
penguins = RedAmber::DataFrame.new(Datasets::Penguins.new.to_arrow)
|
6
16
|
|
7
|
-
def
|
8
|
-
penguins.remove { vectors.map
|
17
|
+
def remove_nil(penguins)
|
18
|
+
penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
|
9
19
|
end
|
10
20
|
|
11
|
-
benchmark:
|
21
|
+
benchmark:
|
22
|
+
'Remove and reduce': remove_nil(penguins)
|
23
|
+
'remove_nil method': penguins.remove_nil
|
data/benchmark/group.yml
ADDED
@@ -0,0 +1,39 @@
|
|
1
|
+
loop_count: 3
|
2
|
+
|
3
|
+
contexts:
|
4
|
+
- name: HEAD
|
5
|
+
prelude: |
|
6
|
+
$LOAD_PATH.unshift(File.expand_path('lib'))
|
7
|
+
- name: 0.2.3
|
8
|
+
gems:
|
9
|
+
red_amber: 0.2.3
|
10
|
+
- name: 0.2.2
|
11
|
+
gems:
|
12
|
+
red_amber: 0.2.2
|
13
|
+
|
14
|
+
prelude: |
|
15
|
+
require 'red_amber'
|
16
|
+
require 'datasets-arrow'
|
17
|
+
|
18
|
+
ds = Datasets::Rdatasets.new('nycflights13', 'flights')
|
19
|
+
df = RedAmber::DataFrame.new(ds.to_arrow)
|
20
|
+
.assign(:flight) { flight.map(&:to_s) }
|
21
|
+
|
22
|
+
slicer = df[:distance] > 1000
|
23
|
+
distance_km = df[:distance] * 1.852
|
24
|
+
|
25
|
+
benchmark:
|
26
|
+
'G01: sum distance by destination': |
|
27
|
+
df.group(:dest).sum(:distance)
|
28
|
+
|
29
|
+
'G02: sum arr_delay by month and day': |
|
30
|
+
df.group(:month, :day).sum(:arr_delay)
|
31
|
+
|
32
|
+
'G03: sum arr_delay, mean distance by flight': |
|
33
|
+
df.group(:flight) { [sum(:arr_delay), mean(:distance)] }
|
34
|
+
|
35
|
+
'G04: mean air_time, distance by flight': |
|
36
|
+
df.group(:flight).mean(:air_time, :distance)
|
37
|
+
|
38
|
+
'G05: sum dep_delay, arr_delay by carrer': |
|
39
|
+
df.group(:carrier).sum(:dep_delay, :arr_delay)
|
@@ -0,0 +1,31 @@
|
|
1
|
+
loop_count: 3
|
2
|
+
|
3
|
+
contexts:
|
4
|
+
- name: HEAD
|
5
|
+
prelude: |
|
6
|
+
$LOAD_PATH.unshift(File.expand_path('lib'))
|
7
|
+
- name: 0.2.3
|
8
|
+
gems:
|
9
|
+
red_amber: 0.2.3
|
10
|
+
- name: 0.2.2
|
11
|
+
gems:
|
12
|
+
red_amber: 0.2.2
|
13
|
+
|
14
|
+
prelude: |
|
15
|
+
require 'red_amber'
|
16
|
+
require 'datasets-arrow'
|
17
|
+
|
18
|
+
ds = Datasets::Rdatasets.new('tidyr', 'billboard')
|
19
|
+
df = RedAmber::DataFrame.new(ds.to_arrow)
|
20
|
+
sub_df = df.pick(:track, df.keys.select{ |k| k.start_with? 'wk' })
|
21
|
+
long_df = df.to_long(:artist, :track, :'date.entered', name: :week, value: :rank)
|
22
|
+
|
23
|
+
benchmark:
|
24
|
+
'R01: Transpose a DataFrame': |
|
25
|
+
sub_df.transpose(name: :week)
|
26
|
+
|
27
|
+
'R02: Reshape to longer DataFrame': |
|
28
|
+
df.to_long(:artist, :track, :'date.entered', name: :week, value: :rank)
|
29
|
+
|
30
|
+
'R03: Reshape to wider DataFrame': |
|
31
|
+
long_df.to_wide(name: :week, value: :rank)
|
@@ -2,12 +2,12 @@ prelude: |
|
|
2
2
|
require 'rover'
|
3
3
|
require 'red_amber'
|
4
4
|
|
5
|
-
penguins_csv = '
|
5
|
+
penguins_csv = 'tmp/penguins.csv'
|
6
6
|
|
7
7
|
unless File.exist?(penguins_csv)
|
8
8
|
require 'datasets-arrow'
|
9
|
-
|
10
|
-
RedAmber::DataFrame.new(
|
9
|
+
ds = Datasets::Penguins.new
|
10
|
+
RedAmber::DataFrame.new(ds).save(penguins_csv)
|
11
11
|
end
|
12
12
|
|
13
13
|
benchmark:
|
@@ -0,0 +1,23 @@
|
|
1
|
+
contexts:
|
2
|
+
- gems:
|
3
|
+
red_amber: 0.2.2
|
4
|
+
- name: HEAD
|
5
|
+
prelude: |
|
6
|
+
$LOAD_PATH.unshift(File.expand_path('lib'))
|
7
|
+
require 'red_amber'
|
8
|
+
|
9
|
+
prelude: |
|
10
|
+
require 'rover'
|
11
|
+
require 'datasets-arrow'
|
12
|
+
ds = Datasets::Rdatasets.new('nycflights13', 'flights')
|
13
|
+
df = RedAmber::DataFrame.new(ds)
|
14
|
+
rover = Rover::DataFrame.new(df.to_h)
|
15
|
+
group_keys = [:month, :origin]
|
16
|
+
summary_key = :air_time
|
17
|
+
|
18
|
+
benchmark:
|
19
|
+
'penguins Group by Rover': |
|
20
|
+
rover.group(group_keys).count
|
21
|
+
|
22
|
+
'penguins Group by RedAmber': |
|
23
|
+
df.group(group_keys).count
|
@@ -0,0 +1,23 @@
|
|
1
|
+
contexts:
|
2
|
+
- gems:
|
3
|
+
red_amber: 0.2.2
|
4
|
+
- name: HEAD
|
5
|
+
prelude: |
|
6
|
+
$LOAD_PATH.unshift(File.expand_path('lib'))
|
7
|
+
require 'red_amber'
|
8
|
+
|
9
|
+
prelude: |
|
10
|
+
require 'rover'
|
11
|
+
require 'datasets-arrow'
|
12
|
+
ds = Datasets::Penguins.new
|
13
|
+
df = RedAmber::DataFrame.new(ds)
|
14
|
+
rover = Rover::DataFrame.new(df.to_h)
|
15
|
+
group_keys = [:species, :island]
|
16
|
+
summary_key = :body_mass_g
|
17
|
+
|
18
|
+
benchmark:
|
19
|
+
'penguins Group by Rover': |
|
20
|
+
rover.group(group_keys).mean(summary_key)
|
21
|
+
|
22
|
+
'penguins Group by RedAmber': |
|
23
|
+
df.group(group_keys).mean(summary_key)
|
@@ -0,0 +1,23 @@
|
|
1
|
+
contexts:
|
2
|
+
- gems:
|
3
|
+
red_amber: 0.2.2
|
4
|
+
- name: HEAD
|
5
|
+
prelude: |
|
6
|
+
$LOAD_PATH.unshift(File.expand_path('lib'))
|
7
|
+
require 'red_amber'
|
8
|
+
|
9
|
+
prelude: |
|
10
|
+
require 'rover'
|
11
|
+
require 'datasets-arrow'
|
12
|
+
ds = Datasets::Rdatasets.new('nycflights13', 'planes')
|
13
|
+
df = RedAmber::DataFrame.new(ds)
|
14
|
+
rover = Rover::DataFrame.new(df.to_h)
|
15
|
+
group_keys = [:engines, :engine]
|
16
|
+
summary_key = :seats
|
17
|
+
|
18
|
+
benchmark:
|
19
|
+
'penguins Group by Rover': |
|
20
|
+
rover.group(group_keys).mean(summary_key)
|
21
|
+
|
22
|
+
'penguins Group by RedAmber': |
|
23
|
+
df.group(group_keys).mean(summary_key)
|
@@ -0,0 +1,23 @@
|
|
1
|
+
contexts:
|
2
|
+
- gems:
|
3
|
+
red_amber: 0.2.2
|
4
|
+
- name: HEAD
|
5
|
+
prelude: |
|
6
|
+
$LOAD_PATH.unshift(File.expand_path('lib'))
|
7
|
+
require 'red_amber'
|
8
|
+
|
9
|
+
prelude: |
|
10
|
+
require 'rover'
|
11
|
+
require 'datasets-arrow'
|
12
|
+
ds = Datasets::Rdatasets.new('nycflights13', 'weather')
|
13
|
+
df = RedAmber::DataFrame.new(ds)
|
14
|
+
rover = Rover::DataFrame.new(df.to_h)
|
15
|
+
group_keys = [:month, :origin]
|
16
|
+
summary_key = :temp
|
17
|
+
|
18
|
+
benchmark:
|
19
|
+
'penguins Group by Rover': |
|
20
|
+
rover.group(group_keys).mean(summary_key)
|
21
|
+
|
22
|
+
'penguins Group by RedAmber': |
|
23
|
+
df.group(group_keys).mean(summary_key)
|
@@ -0,0 +1,60 @@
|
|
1
|
+
loop_count: 10
|
2
|
+
|
3
|
+
contexts:
|
4
|
+
- name: HEAD
|
5
|
+
prelude: |
|
6
|
+
$LOAD_PATH.unshift(File.expand_path('lib'))
|
7
|
+
- name: 0.2.0
|
8
|
+
gems:
|
9
|
+
red_amber: 0.2.0
|
10
|
+
|
11
|
+
prelude: |
|
12
|
+
require 'red_amber'
|
13
|
+
include RedAmber
|
14
|
+
require 'datasets-arrow'
|
15
|
+
|
16
|
+
ds = Datasets::Rdatasets.new('nycflights13', 'flights')
|
17
|
+
flights = RedAmber::DataFrame.new(ds.to_arrow)
|
18
|
+
df = flights.slice { flights[:month] <= 6 }
|
19
|
+
|
20
|
+
tailnum_vector = df[:tailnum]
|
21
|
+
distance_vector = df[:distance]
|
22
|
+
|
23
|
+
strings = tailnum_vector.to_a
|
24
|
+
arrow_array = tailnum_vector.data
|
25
|
+
integers = df[:dep_delay].to_a
|
26
|
+
boolean_vector = df[:air_time].is_nil
|
27
|
+
index_vector = Vector.new(0...boolean_vector.size).filter(boolean_vector)
|
28
|
+
replacer = index_vector.data.map(&:to_s)
|
29
|
+
booleans = boolean_vector.to_a
|
30
|
+
|
31
|
+
benchmark:
|
32
|
+
'V01: Vector.new from integer Array': |
|
33
|
+
Vector.new(integers)
|
34
|
+
|
35
|
+
'V02: Vector.new from string Array': |
|
36
|
+
Vector.new(strings)
|
37
|
+
|
38
|
+
'V03: Vector.new from boolean Vector': |
|
39
|
+
Vector.new(boolean_vector)
|
40
|
+
|
41
|
+
'V04: Vector#sum': |
|
42
|
+
distance_vector.mean
|
43
|
+
|
44
|
+
'V05: Vector#*': |
|
45
|
+
distance_vector * 1.852
|
46
|
+
|
47
|
+
'V06: Vector#[booleans]': |
|
48
|
+
tailnum_vector[booleans]
|
49
|
+
|
50
|
+
'V07: Vector#[boolean_vector]': |
|
51
|
+
tailnum_vector[boolean_vector]
|
52
|
+
|
53
|
+
'V08: Vector#[index_vector]': |
|
54
|
+
tailnum_vector[index_vector]
|
55
|
+
|
56
|
+
'V09: Vector#replace': |
|
57
|
+
tailnum_vector.replace(booleans, replacer)
|
58
|
+
|
59
|
+
'V10: Vector#replace with broad casting': |
|
60
|
+
tailnum_vector.replace(booleans, 'x')
|