red_amber 0.2.2 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (41) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +114 -39
  3. data/CHANGELOG.md +203 -31
  4. data/Gemfile +5 -2
  5. data/README.md +62 -29
  6. data/benchmark/basic.yml +86 -0
  7. data/benchmark/combine.yml +62 -0
  8. data/benchmark/dataframe.yml +62 -0
  9. data/benchmark/drop_nil.yml +15 -3
  10. data/benchmark/group.yml +39 -0
  11. data/benchmark/reshape.yml +31 -0
  12. data/benchmark/{csv_load_penguins.yml → rover/csv_load_penguins.yml} +3 -3
  13. data/benchmark/rover/flights.yml +23 -0
  14. data/benchmark/rover/penguins.yml +23 -0
  15. data/benchmark/rover/planes.yml +23 -0
  16. data/benchmark/rover/weather.yml +23 -0
  17. data/benchmark/vector.yml +60 -0
  18. data/doc/DataFrame.md +335 -53
  19. data/doc/Vector.md +91 -0
  20. data/doc/image/dataframe/join.png +0 -0
  21. data/doc/image/dataframe/set_and_bind.png +0 -0
  22. data/doc/image/dataframe_model.png +0 -0
  23. data/lib/red_amber/data_frame.rb +167 -51
  24. data/lib/red_amber/data_frame_combinable.rb +486 -0
  25. data/lib/red_amber/data_frame_displayable.rb +6 -4
  26. data/lib/red_amber/data_frame_indexable.rb +2 -2
  27. data/lib/red_amber/data_frame_loadsave.rb +4 -1
  28. data/lib/red_amber/data_frame_reshaping.rb +35 -10
  29. data/lib/red_amber/data_frame_selectable.rb +221 -116
  30. data/lib/red_amber/data_frame_variable_operation.rb +146 -82
  31. data/lib/red_amber/group.rb +108 -18
  32. data/lib/red_amber/helper.rb +53 -43
  33. data/lib/red_amber/refinements.rb +199 -0
  34. data/lib/red_amber/vector.rb +56 -46
  35. data/lib/red_amber/vector_functions.rb +23 -83
  36. data/lib/red_amber/vector_selectable.rb +116 -69
  37. data/lib/red_amber/vector_updatable.rb +189 -65
  38. data/lib/red_amber/version.rb +1 -1
  39. data/lib/red_amber.rb +3 -0
  40. data/red_amber.gemspec +4 -3
  41. metadata +24 -10
data/README.md CHANGED
@@ -1,28 +1,31 @@
1
1
  # RedAmber
2
2
 
3
3
  [![Gem Version](https://badge.fury.io/rb/red_amber.svg)](https://badge.fury.io/rb/red_amber)
4
- [![Ruby](https://github.com/heronshoes/red_amber/actions/workflows/test.yml/badge.svg)](https://github.com/heronshoes/red_amber/actions/workflows/test.yml)
4
+ [![CI](https://github.com/heronshoes/red_amber/actions/workflows/ci.yml/badge.svg)](https://github.com/heronshoes/red_amber/actions/workflows/ci.yml)
5
+ [![Maintainability](https://api.codeclimate.com/v1/badges/b8a745047045d2f49daa/maintainability)](https://codeclimate.com/github/heronshoes/red_amber/maintainability)
6
+ [![Test coverage](https://api.codeclimate.com/v1/badges/b8a745047045d2f49daa/test_coverage)](https://codeclimate.com/github/heronshoes/red_amber/test_coverage)
7
+ [![Doc](https://img.shields.io/badge/docs-latest-blue)](https://heronshoes.github.io/red_amber/)
5
8
  [![Discussions](https://img.shields.io/github/discussions/heronshoes/red_amber)](https://github.com/heronshoes/red_amber/discussions)
6
9
 
7
10
  A simple dataframe library for Ruby.
8
11
 
9
- - Powered by [Red Arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow) [![Gitter Chat](https://badges.gitter.im/red-data-tools/en.svg)](https://gitter.im/red-data-tools/en)
12
+ - Powered by [Red Arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow)
13
+ [![Gitter Chat](https://badges.gitter.im/red-data-tools/en.svg)](https://gitter.im/red-data-tools/en)
10
14
  - Inspired by the dataframe library [Rover-df](https://github.com/ankane/rover)
11
15
 
12
- ![screenshot from jupyterlab](doc/image/screenshot.png)
16
+ ![screenshot from jupyterlab](https://raw.githubusercontent.com/heronshoes/red_amber/main/doc/image/screenshot.png)
13
17
 
14
18
  ## Requirements
15
19
 
16
- Supported Ruby version is >= 2.7.
20
+ Supported Ruby version is >= 3.0 (since RedAmber 0.3.0).
17
21
 
18
- Since v0.2.0, this library uses pattern matching which is an experimental feature in 2.7 . It is usable but a warning message will be shown in 2.7 .
19
- I recommend Ruby 3 for performance.
22
+ - I decided to remove Ruby 2.7 without waiting for EOL because it cannot solve the problem of simultaneous use of Hash and keyword arguments when implementing DataFrame#join.
20
23
 
21
24
  ```ruby
22
25
  # Libraries required
23
- gem 'red-arrow', '>= 9.0.0'
26
+ gem 'red-arrow', '~> 10.0.0' # Requires Apache Arrow (see installation below)
24
27
 
25
- gem 'red-parquet', '>= 9.0.0' # Optional, if you use IO from/to parquet
28
+ gem 'red-parquet', '~> 10.0.0' # Optional, if you use IO from/to parquet
26
29
  gem 'rover-df', '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame
27
30
  ```
28
31
 
@@ -30,37 +33,61 @@ gem 'rover-df', '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame
30
33
 
31
34
  Install requirements before you install Red Amber.
32
35
 
33
- - Apache Arrow GLib (>= 9.0.0)
34
-
35
- - Apache Parquet GLib (>= 9.0.0) # If you use IO from/to parquet
36
+ - Apache Arrow (~> 10.0.0)
37
+ - Apache Arrow GLib (~> 10.0.0)
38
+ - Apache Parquet GLib (~> 10.0.0) # If you use IO from/to parquet
36
39
 
37
40
  See [Apache Arrow install document](https://arrow.apache.org/install/).
38
41
 
39
- Minimum installation example for the latest Ubuntu is in the ['Prepare the Apache Arrow' section in ci test](https://github.com/heronshoes/red_amber/blob/master/.github/workflows/test.yml) of Red Amber.
42
+ - Minimum installation example for the latest Ubuntu:
40
43
 
41
- Add this line to your Gemfile:
44
+ ```
45
+ sudo apt update
46
+ sudo apt install -y -V ca-certificates lsb-release wget
47
+ wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
48
+ sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
49
+ sudo apt update
50
+ sudo apt install -y -V libarrow-dev
51
+ sudo apt install -y -V libarrow-glib-dev
52
+ ```
42
53
 
43
- ```ruby
44
- gem 'red_amber'
45
- ```
54
+ - On Fedora 38 (Rawhide):
46
55
 
47
- And then execute:
56
+ ```
57
+ sudo dnf update
58
+ sudo dnf -y install gcc-c++ libarrow-devel libarrow-glib-devel ruby-devel
59
+ ```
48
60
 
49
- ```shell
50
- bundle install
51
- ```
61
+ - On macOS, you can install Apache Arrow C++ library using Homebrew:
52
62
 
53
- Or install it yourself as:
63
+ ```
64
+ brew install apache-arrow
65
+ ```
54
66
 
55
- ```shell
56
- gem install red_amber
67
+ and GLib (C) package with:
68
+
69
+ ```
70
+ brew install apache-arrow-glib
71
+ ```
72
+
73
+ If you prepared Apache Arrow, add these lines to your Gemfile:
74
+
75
+ ```ruby
76
+ gem 'red-arrow', '~> 10.0.0'
77
+ gem 'red_amber'
78
+ gem 'red-parquet', '~> 10.0.0' # Optional, if you use IO from/to parquet
79
+ gem 'rover-df', '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame
80
+ gem 'red-datasets-arrow' # Optional, recommended if you use Red Datasets
81
+ gem 'red-arrow-numo-narray' # Optional, recommended if you use inputs from Numo::NArray
57
82
  ```
58
83
 
84
+ And then execute `bundle install` or install it yourself as `gem install red_amber`.
85
+
59
86
  ## Docker image and Jupyter Notebook
60
87
 
61
88
  [RubyData Docker Stacks](https://github.com/RubyData/docker-stacks) is available as a ready-to-run Docker image containing Jupyter and useful data tools as well as RedAmber (Thanks to @mrkn).
62
89
 
63
- Also you can try the contents of this README interactively by [Binder](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=README.ipynb).
90
+ Also you can try the contents of this README interactively by [Binder](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=red-amber.ipynb).
64
91
  [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=red-amber.ipynb)
65
92
 
66
93
 
@@ -69,9 +96,9 @@ Also you can try the contents of this README interactively by [Binder](https://m
69
96
  Class `RedAmber::DataFrame` represents a set of data in 2D-shape.
70
97
  The entity is a Red Arrow's Table object.
71
98
 
72
- ![dataframe model of RedAmber](doc/image/dataframe_model.png)
99
+ ![dataframe model of RedAmber](https://raw.githubusercontent.com/heronshoes/red_amber/main/doc/image/dataframe_model.png)
73
100
 
74
- Load the library.
101
+ Let's load the library and try some examples.
75
102
 
76
103
  ```ruby
77
104
  require 'red_amber' # require 'red-amber' is also OK.
@@ -80,6 +107,11 @@ include RedAmber
80
107
 
81
108
  ### Example: diamonds dataset
82
109
 
110
+ First do (if you do not installed) `
111
+ gem install red-datasets-arrow
112
+ `
113
+ then
114
+
83
115
  ```ruby
84
116
  require 'datasets-arrow' # to load sample data
85
117
 
@@ -101,7 +133,7 @@ diamonds = DataFrame.new(dataset) # from v0.2.2, should be `dataset.to_arrow` if
101
133
  53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 ... 3.64
102
134
  ```
103
135
 
104
- For example, we can compute mean prices per 'cut' for the data larger than 1 carat.
136
+ For example, we can compute mean prices per cut for the data larger than 1 carat.
105
137
 
106
138
  ```ruby
107
139
  df = diamonds
@@ -125,7 +157,7 @@ Arrow data is immutable, so these methods always return new objects.
125
157
  Next example will rename a column and create a new column by simple calcuration.
126
158
 
127
159
  ```ruby
128
- usdjpy = 110.0
160
+ usdjpy = 110.0 # when the yen was stronger
129
161
 
130
162
  df.rename('mean(price)': :mean_price_USD)
131
163
  .assign(:mean_price_JPY) { mean_price_USD * usdjpy }
@@ -181,7 +213,8 @@ See [Vector.md](doc/Vector.md) for details.
181
213
 
182
214
  ## Jupyter notebook
183
215
 
184
- [73 Examples of Red Amber](binder/examples_of_red_amber.ipynb) shows more examples in jupyter notebook.
216
+ [89 Examples of Red Amber](https://github.com/heronshoes/docker-stacks/blob/RedAmber-binder/binder/examples_of_red_amber.ipynb)
217
+ ([raw file](https://raw.githubusercontent.com/heronshoes/docker-stacks/RedAmber-binder/binder/examples_of_red_amber.ipynb)) shows more examples in jupyter notebook.
185
218
 
186
219
  You can try this notebook on [Binder](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=examples_of_red_amber.ipynb).
187
220
  [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=examples_of_red_amber.ipynb)
@@ -0,0 +1,86 @@
1
+ loop_count: 3
2
+
3
+ contexts:
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ - name: 0.2.3
8
+ gems:
9
+ red_amber: 0.2.3
10
+ - name: 0.2.0
11
+ gems:
12
+ red_amber: 0.2.0
13
+ - name: 0.1.5
14
+ gems:
15
+ red_amber: 0.1.5
16
+
17
+ prelude: |
18
+ require 'red_amber'
19
+ require 'datasets-arrow'
20
+
21
+ ds = Datasets::Rdatasets.new('nycflights13', 'flights')
22
+ df = RedAmber::DataFrame.new(ds.to_arrow)
23
+
24
+ slicer = df[:distance] > 1000
25
+ distance_km = df[:distance] * 1.852
26
+
27
+ benchmark:
28
+ 'B01: Pick([]) by a key name': |
29
+ df[:flight]
30
+
31
+ 'B02a: Pick([]) by key names': |
32
+ df[:carrier, :flight]
33
+
34
+ 'B03: Pick by key names': |
35
+ df.pick(:carrier, :flight)
36
+
37
+ 'B04: Drop by key names': |
38
+ df.drop(:year, :month, :day)
39
+
40
+ 'B05: Pick by booleans': |
41
+ df.pick(df.vectors.map(&:string?))
42
+
43
+ 'B06: Pick by a block': |
44
+ df.pick { keys.map { |key| key.end_with?('time') } }
45
+
46
+ 'B07: Slice([]) by a index': |
47
+ df[877]
48
+
49
+ 'B08: Slice by indeces': |
50
+ df.slice(0...5, -5..-1)
51
+
52
+ 'B09: Slice([]) by booleans': |
53
+ df[slicer]
54
+
55
+ 'B10: Slice by booleans': |
56
+ df.slice(slicer)
57
+
58
+ 'B11: Remove by booleans': |
59
+ df.remove(slicer)
60
+
61
+ 'B12: Slice by a block': |
62
+ df.slice { slicer }
63
+
64
+ 'B13: Rename by Hash': |
65
+ df.rename(distance: :distance_mile)
66
+
67
+ 'B14: Assign an existing variable': |
68
+ df.assign(distance: distance_km)
69
+
70
+ 'B15: Assign a new variable': |
71
+ df.assign(distance_km: distance_km)
72
+
73
+ 'B16: Sort by a key': |
74
+ df.sort(:distance)
75
+
76
+ 'B17: Sort by keys': |
77
+ df.sort(:origin, '-distance')
78
+
79
+ 'B18: Convert to a Hash': |
80
+ df.to_h
81
+
82
+ 'B19: Output in TDR style': |
83
+ df.tdr
84
+
85
+ 'B20: Inspect': |
86
+ df.inspect
@@ -0,0 +1,62 @@
1
+ loop_count: 3
2
+
3
+ contexts:
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ - name: 0.2.3
8
+ gems:
9
+ red_amber: 0.2.3
10
+
11
+ prelude: |
12
+ require 'red_amber'
13
+ include RedAmber
14
+ require 'datasets-arrow'
15
+
16
+ package = 'nycflights13'
17
+ airlines = DataFrame.new(Datasets::Rdatasets.new(package, 'airlines'))
18
+ airports = DataFrame.new(Datasets::Rdatasets.new(package, 'airports'))
19
+ flights = DataFrame.new(Datasets::Rdatasets.new(package, 'flights'))
20
+ .pick(%i[month day carrier flight tailnum origin dest air_time distance])
21
+ planes = DataFrame.new(Datasets::Rdatasets.new(package, 'planes'))
22
+ weather = DataFrame.new(Datasets::Rdatasets.new(package, 'weather'))
23
+
24
+ flights_Q1 = flights.slice { month <= 3 }
25
+ flights_Q2 = flights.slice { month > 3 }
26
+
27
+ flights_1_2 = flights_Q1.slice { month.is_in(1, 2) }
28
+ flights_1_3 = flights_Q1.slice { month.is_in(1, 3) }
29
+
30
+ flights_left = flights_Q1.pick(...5)
31
+ flights_right = flights_Q1.pick(5..)
32
+
33
+ benchmark:
34
+ 'C01: Inner join on flights_Q1 by carrier': |
35
+ flights_Q1.inner_join(airlines, :carrier)
36
+
37
+ 'C02: Full join on flights_Q1 by planes': |
38
+ flights_Q1.full_join(planes, :tailnum)
39
+
40
+ 'C03: Left join on flights_Q1 by planes': |
41
+ flights_Q1.left_join(planes, :tailnum)
42
+
43
+ 'C04: Semi join on flights_Q1 by planes': |
44
+ flights_Q1.semi_join(planes, :tailnum)
45
+
46
+ 'C05: Anti join on flights_Q1 by planes': |
47
+ flights_Q1.anti_join(planes, :tailnum)
48
+
49
+ 'C06: Intersection of flights_1_2 and flights_1_3': |
50
+ flights_1_2.intersect(flights_1_3)
51
+
52
+ 'C07: Union of flights_1_2 and flights_1_3': |
53
+ flights_1_2.union(flights_1_3)
54
+
55
+ 'C08: Difference between flights_1_2 and flights_1_3': |
56
+ flights_1_2.difference(flights_1_3)
57
+
58
+ 'C09: Concatenate flight_Q1 on flight_Q2': |
59
+ flights_Q1.concatenate(flights_Q2)
60
+
61
+ 'C10: Merge flights_Q1_right on flights_Q1_left': |
62
+ flights_left.merge(flights_right)
@@ -0,0 +1,62 @@
1
+ loop_count: 3
2
+
3
+ contexts:
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ - name: 0.2.3
8
+ gems:
9
+ red_amber: 0.2.3
10
+ - name: 0.2.0
11
+ gems:
12
+ red_amber: 0.2.0
13
+
14
+ prelude: |
15
+ require 'red_amber'
16
+ require 'datasets-arrow'
17
+
18
+ diamonds = RedAmber::DataFrame.new(Datasets::Diamonds.new.to_arrow)
19
+
20
+ starwars = RedAmber::DataFrame.new(Datasets::Rdataset.new('dplyr', 'starwars').to_arrow)
21
+
22
+ uri = URI("https://raw.githubusercontent.com/heronshoes/red_amber/master/test/entity/import_cars.tsv")
23
+ import_cars = RedAmber::DataFrame.load(uri)
24
+
25
+ ds = Datasets::Rdataset.new('openintro', 'simpsons_paradox_covid')
26
+ simpsons_paradox_covid = RedAmber::DataFrame.new(ds.to_arrow)
27
+
28
+ benchmark:
29
+ 'D01: Diamonds test': |
30
+ diamonds
31
+ .slice { v(:carat) > 1 }
32
+ .pick(:cut, :price)
33
+ .group(:cut)
34
+ .mean
35
+ .sort('-mean(price)')
36
+ .rename('mean(price)': :mean_price_USD)
37
+ .assign { [:mean_price_JPY, v(:mean_price_USD) * 110.0] }
38
+
39
+ 'D02: Starwars test': |
40
+ starwars
41
+ .drop { keys.select { |key| key.end_with?('color') } }
42
+ .remove { v(:species) == 'NA' }
43
+ .group(:species) { [count(:species), mean(:height, :mass)] }
44
+ .slice { v(:count) > 1 }
45
+
46
+ 'D03: Inport cars test': |
47
+ import_cars
48
+ .to_long(:Year, name: :Manufacturer, value: :Num_of_imported)
49
+ .to_wide(name: :Manufacturer, value: :Num_of_imported)
50
+ .transpose
51
+
52
+ 'D04: Simpsons paradox test': |
53
+ simpsons_paradox_covid[simpsons_paradox_covid[:age_group] == 'under 50']
54
+ .group(:vaccine_status, :outcome)
55
+ .count
56
+ .then { |df| df.to_wide(name: :vaccine_status, value: df.keys[-1]) }
57
+ .assign do
58
+ [
59
+ [:'vaccinated_%', (100.0 * v(:vaccinated) / v(:vaccinated).sum)],
60
+ [:'unvaccinated_%', (100.0 * v(:unvaccinated) / v(:unvaccinated).sum)]
61
+ ]
62
+ end
@@ -1,11 +1,23 @@
1
+ contexts:
2
+ - gems:
3
+ red_amber: 0.1.8
4
+ - gems:
5
+ red_amber: 0.2.2
6
+ - name: HEAD
7
+ prelude: |
8
+ $LOAD_PATH.unshift(File.expand_path('lib'))
9
+ require 'red_amber'
10
+
1
11
  prelude: |
2
12
  require 'datasets-arrow'
3
13
  require 'red_amber'
4
14
 
5
15
  penguins = RedAmber::DataFrame.new(Datasets::Penguins.new.to_arrow)
6
16
 
7
- def drop_nil(penguins)
8
- penguins.remove { vectors.map { |v| v.is_nil} }
17
+ def remove_nil(penguins)
18
+ penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
9
19
  end
10
20
 
11
- benchmark: drop_nil(penguins)
21
+ benchmark:
22
+ 'Remove and reduce': remove_nil(penguins)
23
+ 'remove_nil method': penguins.remove_nil
@@ -0,0 +1,39 @@
1
+ loop_count: 3
2
+
3
+ contexts:
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ - name: 0.2.3
8
+ gems:
9
+ red_amber: 0.2.3
10
+ - name: 0.2.2
11
+ gems:
12
+ red_amber: 0.2.2
13
+
14
+ prelude: |
15
+ require 'red_amber'
16
+ require 'datasets-arrow'
17
+
18
+ ds = Datasets::Rdatasets.new('nycflights13', 'flights')
19
+ df = RedAmber::DataFrame.new(ds.to_arrow)
20
+ .assign(:flight) { flight.map(&:to_s) }
21
+
22
+ slicer = df[:distance] > 1000
23
+ distance_km = df[:distance] * 1.852
24
+
25
+ benchmark:
26
+ 'G01: sum distance by destination': |
27
+ df.group(:dest).sum(:distance)
28
+
29
+ 'G02: sum arr_delay by month and day': |
30
+ df.group(:month, :day).sum(:arr_delay)
31
+
32
+ 'G03: sum arr_delay, mean distance by flight': |
33
+ df.group(:flight) { [sum(:arr_delay), mean(:distance)] }
34
+
35
+ 'G04: mean air_time, distance by flight': |
36
+ df.group(:flight).mean(:air_time, :distance)
37
+
38
+ 'G05: sum dep_delay, arr_delay by carrer': |
39
+ df.group(:carrier).sum(:dep_delay, :arr_delay)
@@ -0,0 +1,31 @@
1
+ loop_count: 3
2
+
3
+ contexts:
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ - name: 0.2.3
8
+ gems:
9
+ red_amber: 0.2.3
10
+ - name: 0.2.2
11
+ gems:
12
+ red_amber: 0.2.2
13
+
14
+ prelude: |
15
+ require 'red_amber'
16
+ require 'datasets-arrow'
17
+
18
+ ds = Datasets::Rdatasets.new('tidyr', 'billboard')
19
+ df = RedAmber::DataFrame.new(ds.to_arrow)
20
+ sub_df = df.pick(:track, df.keys.select{ |k| k.start_with? 'wk' })
21
+ long_df = df.to_long(:artist, :track, :'date.entered', name: :week, value: :rank)
22
+
23
+ benchmark:
24
+ 'R01: Transpose a DataFrame': |
25
+ sub_df.transpose(name: :week)
26
+
27
+ 'R02: Reshape to longer DataFrame': |
28
+ df.to_long(:artist, :track, :'date.entered', name: :week, value: :rank)
29
+
30
+ 'R03: Reshape to wider DataFrame': |
31
+ long_df.to_wide(name: :week, value: :rank)
@@ -2,12 +2,12 @@ prelude: |
2
2
  require 'rover'
3
3
  require 'red_amber'
4
4
 
5
- penguins_csv = 'benchmark/cache/penguins.csv'
5
+ penguins_csv = 'tmp/penguins.csv'
6
6
 
7
7
  unless File.exist?(penguins_csv)
8
8
  require 'datasets-arrow'
9
- arrow = Datasets::Penguins.new.to_arrow
10
- RedAmber::DataFrame.new(arrow).save(penguins_csv)
9
+ ds = Datasets::Penguins.new
10
+ RedAmber::DataFrame.new(ds).save(penguins_csv)
11
11
  end
12
12
 
13
13
  benchmark:
@@ -0,0 +1,23 @@
1
+ contexts:
2
+ - gems:
3
+ red_amber: 0.2.2
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ require 'red_amber'
8
+
9
+ prelude: |
10
+ require 'rover'
11
+ require 'datasets-arrow'
12
+ ds = Datasets::Rdatasets.new('nycflights13', 'flights')
13
+ df = RedAmber::DataFrame.new(ds)
14
+ rover = Rover::DataFrame.new(df.to_h)
15
+ group_keys = [:month, :origin]
16
+ summary_key = :air_time
17
+
18
+ benchmark:
19
+ 'penguins Group by Rover': |
20
+ rover.group(group_keys).count
21
+
22
+ 'penguins Group by RedAmber': |
23
+ df.group(group_keys).count
@@ -0,0 +1,23 @@
1
+ contexts:
2
+ - gems:
3
+ red_amber: 0.2.2
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ require 'red_amber'
8
+
9
+ prelude: |
10
+ require 'rover'
11
+ require 'datasets-arrow'
12
+ ds = Datasets::Penguins.new
13
+ df = RedAmber::DataFrame.new(ds)
14
+ rover = Rover::DataFrame.new(df.to_h)
15
+ group_keys = [:species, :island]
16
+ summary_key = :body_mass_g
17
+
18
+ benchmark:
19
+ 'penguins Group by Rover': |
20
+ rover.group(group_keys).mean(summary_key)
21
+
22
+ 'penguins Group by RedAmber': |
23
+ df.group(group_keys).mean(summary_key)
@@ -0,0 +1,23 @@
1
+ contexts:
2
+ - gems:
3
+ red_amber: 0.2.2
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ require 'red_amber'
8
+
9
+ prelude: |
10
+ require 'rover'
11
+ require 'datasets-arrow'
12
+ ds = Datasets::Rdatasets.new('nycflights13', 'planes')
13
+ df = RedAmber::DataFrame.new(ds)
14
+ rover = Rover::DataFrame.new(df.to_h)
15
+ group_keys = [:engines, :engine]
16
+ summary_key = :seats
17
+
18
+ benchmark:
19
+ 'penguins Group by Rover': |
20
+ rover.group(group_keys).mean(summary_key)
21
+
22
+ 'penguins Group by RedAmber': |
23
+ df.group(group_keys).mean(summary_key)
@@ -0,0 +1,23 @@
1
+ contexts:
2
+ - gems:
3
+ red_amber: 0.2.2
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ require 'red_amber'
8
+
9
+ prelude: |
10
+ require 'rover'
11
+ require 'datasets-arrow'
12
+ ds = Datasets::Rdatasets.new('nycflights13', 'weather')
13
+ df = RedAmber::DataFrame.new(ds)
14
+ rover = Rover::DataFrame.new(df.to_h)
15
+ group_keys = [:month, :origin]
16
+ summary_key = :temp
17
+
18
+ benchmark:
19
+ 'penguins Group by Rover': |
20
+ rover.group(group_keys).mean(summary_key)
21
+
22
+ 'penguins Group by RedAmber': |
23
+ df.group(group_keys).mean(summary_key)
@@ -0,0 +1,60 @@
1
+ loop_count: 10
2
+
3
+ contexts:
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ - name: 0.2.0
8
+ gems:
9
+ red_amber: 0.2.0
10
+
11
+ prelude: |
12
+ require 'red_amber'
13
+ include RedAmber
14
+ require 'datasets-arrow'
15
+
16
+ ds = Datasets::Rdatasets.new('nycflights13', 'flights')
17
+ flights = RedAmber::DataFrame.new(ds.to_arrow)
18
+ df = flights.slice { flights[:month] <= 6 }
19
+
20
+ tailnum_vector = df[:tailnum]
21
+ distance_vector = df[:distance]
22
+
23
+ strings = tailnum_vector.to_a
24
+ arrow_array = tailnum_vector.data
25
+ integers = df[:dep_delay].to_a
26
+ boolean_vector = df[:air_time].is_nil
27
+ index_vector = Vector.new(0...boolean_vector.size).filter(boolean_vector)
28
+ replacer = index_vector.data.map(&:to_s)
29
+ booleans = boolean_vector.to_a
30
+
31
+ benchmark:
32
+ 'V01: Vector.new from integer Array': |
33
+ Vector.new(integers)
34
+
35
+ 'V02: Vector.new from string Array': |
36
+ Vector.new(strings)
37
+
38
+ 'V03: Vector.new from boolean Vector': |
39
+ Vector.new(boolean_vector)
40
+
41
+ 'V04: Vector#sum': |
42
+ distance_vector.mean
43
+
44
+ 'V05: Vector#*': |
45
+ distance_vector * 1.852
46
+
47
+ 'V06: Vector#[booleans]': |
48
+ tailnum_vector[booleans]
49
+
50
+ 'V07: Vector#[boolean_vector]': |
51
+ tailnum_vector[boolean_vector]
52
+
53
+ 'V08: Vector#[index_vector]': |
54
+ tailnum_vector[index_vector]
55
+
56
+ 'V09: Vector#replace': |
57
+ tailnum_vector.replace(booleans, replacer)
58
+
59
+ 'V10: Vector#replace with broad casting': |
60
+ tailnum_vector.replace(booleans, 'x')