red_amber 0.2.2 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +114 -39
  3. data/CHANGELOG.md +203 -31
  4. data/Gemfile +5 -2
  5. data/README.md +62 -29
  6. data/benchmark/basic.yml +86 -0
  7. data/benchmark/combine.yml +62 -0
  8. data/benchmark/dataframe.yml +62 -0
  9. data/benchmark/drop_nil.yml +15 -3
  10. data/benchmark/group.yml +39 -0
  11. data/benchmark/reshape.yml +31 -0
  12. data/benchmark/{csv_load_penguins.yml → rover/csv_load_penguins.yml} +3 -3
  13. data/benchmark/rover/flights.yml +23 -0
  14. data/benchmark/rover/penguins.yml +23 -0
  15. data/benchmark/rover/planes.yml +23 -0
  16. data/benchmark/rover/weather.yml +23 -0
  17. data/benchmark/vector.yml +60 -0
  18. data/doc/DataFrame.md +335 -53
  19. data/doc/Vector.md +91 -0
  20. data/doc/image/dataframe/join.png +0 -0
  21. data/doc/image/dataframe/set_and_bind.png +0 -0
  22. data/doc/image/dataframe_model.png +0 -0
  23. data/lib/red_amber/data_frame.rb +167 -51
  24. data/lib/red_amber/data_frame_combinable.rb +486 -0
  25. data/lib/red_amber/data_frame_displayable.rb +6 -4
  26. data/lib/red_amber/data_frame_indexable.rb +2 -2
  27. data/lib/red_amber/data_frame_loadsave.rb +4 -1
  28. data/lib/red_amber/data_frame_reshaping.rb +35 -10
  29. data/lib/red_amber/data_frame_selectable.rb +221 -116
  30. data/lib/red_amber/data_frame_variable_operation.rb +146 -82
  31. data/lib/red_amber/group.rb +108 -18
  32. data/lib/red_amber/helper.rb +53 -43
  33. data/lib/red_amber/refinements.rb +199 -0
  34. data/lib/red_amber/vector.rb +56 -46
  35. data/lib/red_amber/vector_functions.rb +23 -83
  36. data/lib/red_amber/vector_selectable.rb +116 -69
  37. data/lib/red_amber/vector_updatable.rb +189 -65
  38. data/lib/red_amber/version.rb +1 -1
  39. data/lib/red_amber.rb +3 -0
  40. data/red_amber.gemspec +4 -3
  41. metadata +24 -10
data/README.md CHANGED
@@ -1,28 +1,31 @@
1
1
  # RedAmber
2
2
 
3
3
  [![Gem Version](https://badge.fury.io/rb/red_amber.svg)](https://badge.fury.io/rb/red_amber)
4
- [![Ruby](https://github.com/heronshoes/red_amber/actions/workflows/test.yml/badge.svg)](https://github.com/heronshoes/red_amber/actions/workflows/test.yml)
4
+ [![CI](https://github.com/heronshoes/red_amber/actions/workflows/ci.yml/badge.svg)](https://github.com/heronshoes/red_amber/actions/workflows/ci.yml)
5
+ [![Maintainability](https://api.codeclimate.com/v1/badges/b8a745047045d2f49daa/maintainability)](https://codeclimate.com/github/heronshoes/red_amber/maintainability)
6
+ [![Test coverage](https://api.codeclimate.com/v1/badges/b8a745047045d2f49daa/test_coverage)](https://codeclimate.com/github/heronshoes/red_amber/test_coverage)
7
+ [![Doc](https://img.shields.io/badge/docs-latest-blue)](https://heronshoes.github.io/red_amber/)
5
8
  [![Discussions](https://img.shields.io/github/discussions/heronshoes/red_amber)](https://github.com/heronshoes/red_amber/discussions)
6
9
 
7
10
  A simple dataframe library for Ruby.
8
11
 
9
- - Powered by [Red Arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow) [![Gitter Chat](https://badges.gitter.im/red-data-tools/en.svg)](https://gitter.im/red-data-tools/en)
12
+ - Powered by [Red Arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow)
13
+ [![Gitter Chat](https://badges.gitter.im/red-data-tools/en.svg)](https://gitter.im/red-data-tools/en)
10
14
  - Inspired by the dataframe library [Rover-df](https://github.com/ankane/rover)
11
15
 
12
- ![screenshot from jupyterlab](doc/image/screenshot.png)
16
+ ![screenshot from jupyterlab](https://raw.githubusercontent.com/heronshoes/red_amber/main/doc/image/screenshot.png)
13
17
 
14
18
  ## Requirements
15
19
 
16
- Supported Ruby version is >= 2.7.
20
+ Supported Ruby version is >= 3.0 (since RedAmber 0.3.0).
17
21
 
18
- Since v0.2.0, this library uses pattern matching which is an experimental feature in 2.7 . It is usable but a warning message will be shown in 2.7 .
19
- I recommend Ruby 3 for performance.
22
+ - I decided to remove Ruby 2.7 without waiting for EOL because it cannot solve the problem of simultaneous use of Hash and keyword arguments when implementing DataFrame#join.
20
23
 
21
24
  ```ruby
22
25
  # Libraries required
23
- gem 'red-arrow', '>= 9.0.0'
26
+ gem 'red-arrow', '~> 10.0.0' # Requires Apache Arrow (see installation below)
24
27
 
25
- gem 'red-parquet', '>= 9.0.0' # Optional, if you use IO from/to parquet
28
+ gem 'red-parquet', '~> 10.0.0' # Optional, if you use IO from/to parquet
26
29
  gem 'rover-df', '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame
27
30
  ```
28
31
 
@@ -30,37 +33,61 @@ gem 'rover-df', '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame
30
33
 
31
34
  Install requirements before you install Red Amber.
32
35
 
33
- - Apache Arrow GLib (>= 9.0.0)
34
-
35
- - Apache Parquet GLib (>= 9.0.0) # If you use IO from/to parquet
36
+ - Apache Arrow (~> 10.0.0)
37
+ - Apache Arrow GLib (~> 10.0.0)
38
+ - Apache Parquet GLib (~> 10.0.0) # If you use IO from/to parquet
36
39
 
37
40
  See [Apache Arrow install document](https://arrow.apache.org/install/).
38
41
 
39
- Minimum installation example for the latest Ubuntu is in the ['Prepare the Apache Arrow' section in ci test](https://github.com/heronshoes/red_amber/blob/master/.github/workflows/test.yml) of Red Amber.
42
+ - Minimum installation example for the latest Ubuntu:
40
43
 
41
- Add this line to your Gemfile:
44
+ ```
45
+ sudo apt update
46
+ sudo apt install -y -V ca-certificates lsb-release wget
47
+ wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
48
+ sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
49
+ sudo apt update
50
+ sudo apt install -y -V libarrow-dev
51
+ sudo apt install -y -V libarrow-glib-dev
52
+ ```
42
53
 
43
- ```ruby
44
- gem 'red_amber'
45
- ```
54
+ - On Fedora 38 (Rawhide):
46
55
 
47
- And then execute:
56
+ ```
57
+ sudo dnf update
58
+ sudo dnf -y install gcc-c++ libarrow-devel libarrow-glib-devel ruby-devel
59
+ ```
48
60
 
49
- ```shell
50
- bundle install
51
- ```
61
+ - On macOS, you can install Apache Arrow C++ library using Homebrew:
52
62
 
53
- Or install it yourself as:
63
+ ```
64
+ brew install apache-arrow
65
+ ```
54
66
 
55
- ```shell
56
- gem install red_amber
67
+ and GLib (C) package with:
68
+
69
+ ```
70
+ brew install apache-arrow-glib
71
+ ```
72
+
73
+ If you prepared Apache Arrow, add these lines to your Gemfile:
74
+
75
+ ```ruby
76
+ gem 'red-arrow', '~> 10.0.0'
77
+ gem 'red_amber'
78
+ gem 'red-parquet', '~> 10.0.0' # Optional, if you use IO from/to parquet
79
+ gem 'rover-df', '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame
80
+ gem 'red-datasets-arrow' # Optional, recommended if you use Red Datasets
81
+ gem 'red-arrow-numo-narray' # Optional, recommended if you use inputs from Numo::NArray
57
82
  ```
58
83
 
84
+ And then execute `bundle install` or install it yourself as `gem install red_amber`.
85
+
59
86
  ## Docker image and Jupyter Notebook
60
87
 
61
88
  [RubyData Docker Stacks](https://github.com/RubyData/docker-stacks) is available as a ready-to-run Docker image containing Jupyter and useful data tools as well as RedAmber (Thanks to @mrkn).
62
89
 
63
- Also you can try the contents of this README interactively by [Binder](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=README.ipynb).
90
+ Also you can try the contents of this README interactively by [Binder](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=red-amber.ipynb).
64
91
  [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=red-amber.ipynb)
65
92
 
66
93
 
@@ -69,9 +96,9 @@ Also you can try the contents of this README interactively by [Binder](https://m
69
96
  Class `RedAmber::DataFrame` represents a set of data in 2D-shape.
70
97
  The entity is a Red Arrow's Table object.
71
98
 
72
- ![dataframe model of RedAmber](doc/image/dataframe_model.png)
99
+ ![dataframe model of RedAmber](https://raw.githubusercontent.com/heronshoes/red_amber/main/doc/image/dataframe_model.png)
73
100
 
74
- Load the library.
101
+ Let's load the library and try some examples.
75
102
 
76
103
  ```ruby
77
104
  require 'red_amber' # require 'red-amber' is also OK.
@@ -80,6 +107,11 @@ include RedAmber
80
107
 
81
108
  ### Example: diamonds dataset
82
109
 
110
+ First do (if you do not installed) `
111
+ gem install red-datasets-arrow
112
+ `
113
+ then
114
+
83
115
  ```ruby
84
116
  require 'datasets-arrow' # to load sample data
85
117
 
@@ -101,7 +133,7 @@ diamonds = DataFrame.new(dataset) # from v0.2.2, should be `dataset.to_arrow` if
101
133
  53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 ... 3.64
102
134
  ```
103
135
 
104
- For example, we can compute mean prices per 'cut' for the data larger than 1 carat.
136
+ For example, we can compute mean prices per cut for the data larger than 1 carat.
105
137
 
106
138
  ```ruby
107
139
  df = diamonds
@@ -125,7 +157,7 @@ Arrow data is immutable, so these methods always return new objects.
125
157
  Next example will rename a column and create a new column by simple calcuration.
126
158
 
127
159
  ```ruby
128
- usdjpy = 110.0
160
+ usdjpy = 110.0 # when the yen was stronger
129
161
 
130
162
  df.rename('mean(price)': :mean_price_USD)
131
163
  .assign(:mean_price_JPY) { mean_price_USD * usdjpy }
@@ -181,7 +213,8 @@ See [Vector.md](doc/Vector.md) for details.
181
213
 
182
214
  ## Jupyter notebook
183
215
 
184
- [73 Examples of Red Amber](binder/examples_of_red_amber.ipynb) shows more examples in jupyter notebook.
216
+ [89 Examples of Red Amber](https://github.com/heronshoes/docker-stacks/blob/RedAmber-binder/binder/examples_of_red_amber.ipynb)
217
+ ([raw file](https://raw.githubusercontent.com/heronshoes/docker-stacks/RedAmber-binder/binder/examples_of_red_amber.ipynb)) shows more examples in jupyter notebook.
185
218
 
186
219
  You can try this notebook on [Binder](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=examples_of_red_amber.ipynb).
187
220
  [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=examples_of_red_amber.ipynb)
@@ -0,0 +1,86 @@
1
+ loop_count: 3
2
+
3
+ contexts:
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ - name: 0.2.3
8
+ gems:
9
+ red_amber: 0.2.3
10
+ - name: 0.2.0
11
+ gems:
12
+ red_amber: 0.2.0
13
+ - name: 0.1.5
14
+ gems:
15
+ red_amber: 0.1.5
16
+
17
+ prelude: |
18
+ require 'red_amber'
19
+ require 'datasets-arrow'
20
+
21
+ ds = Datasets::Rdatasets.new('nycflights13', 'flights')
22
+ df = RedAmber::DataFrame.new(ds.to_arrow)
23
+
24
+ slicer = df[:distance] > 1000
25
+ distance_km = df[:distance] * 1.852
26
+
27
+ benchmark:
28
+ 'B01: Pick([]) by a key name': |
29
+ df[:flight]
30
+
31
+ 'B02a: Pick([]) by key names': |
32
+ df[:carrier, :flight]
33
+
34
+ 'B03: Pick by key names': |
35
+ df.pick(:carrier, :flight)
36
+
37
+ 'B04: Drop by key names': |
38
+ df.drop(:year, :month, :day)
39
+
40
+ 'B05: Pick by booleans': |
41
+ df.pick(df.vectors.map(&:string?))
42
+
43
+ 'B06: Pick by a block': |
44
+ df.pick { keys.map { |key| key.end_with?('time') } }
45
+
46
+ 'B07: Slice([]) by a index': |
47
+ df[877]
48
+
49
+ 'B08: Slice by indeces': |
50
+ df.slice(0...5, -5..-1)
51
+
52
+ 'B09: Slice([]) by booleans': |
53
+ df[slicer]
54
+
55
+ 'B10: Slice by booleans': |
56
+ df.slice(slicer)
57
+
58
+ 'B11: Remove by booleans': |
59
+ df.remove(slicer)
60
+
61
+ 'B12: Slice by a block': |
62
+ df.slice { slicer }
63
+
64
+ 'B13: Rename by Hash': |
65
+ df.rename(distance: :distance_mile)
66
+
67
+ 'B14: Assign an existing variable': |
68
+ df.assign(distance: distance_km)
69
+
70
+ 'B15: Assign a new variable': |
71
+ df.assign(distance_km: distance_km)
72
+
73
+ 'B16: Sort by a key': |
74
+ df.sort(:distance)
75
+
76
+ 'B17: Sort by keys': |
77
+ df.sort(:origin, '-distance')
78
+
79
+ 'B18: Convert to a Hash': |
80
+ df.to_h
81
+
82
+ 'B19: Output in TDR style': |
83
+ df.tdr
84
+
85
+ 'B20: Inspect': |
86
+ df.inspect
@@ -0,0 +1,62 @@
1
+ loop_count: 3
2
+
3
+ contexts:
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ - name: 0.2.3
8
+ gems:
9
+ red_amber: 0.2.3
10
+
11
+ prelude: |
12
+ require 'red_amber'
13
+ include RedAmber
14
+ require 'datasets-arrow'
15
+
16
+ package = 'nycflights13'
17
+ airlines = DataFrame.new(Datasets::Rdatasets.new(package, 'airlines'))
18
+ airports = DataFrame.new(Datasets::Rdatasets.new(package, 'airports'))
19
+ flights = DataFrame.new(Datasets::Rdatasets.new(package, 'flights'))
20
+ .pick(%i[month day carrier flight tailnum origin dest air_time distance])
21
+ planes = DataFrame.new(Datasets::Rdatasets.new(package, 'planes'))
22
+ weather = DataFrame.new(Datasets::Rdatasets.new(package, 'weather'))
23
+
24
+ flights_Q1 = flights.slice { month <= 3 }
25
+ flights_Q2 = flights.slice { month > 3 }
26
+
27
+ flights_1_2 = flights_Q1.slice { month.is_in(1, 2) }
28
+ flights_1_3 = flights_Q1.slice { month.is_in(1, 3) }
29
+
30
+ flights_left = flights_Q1.pick(...5)
31
+ flights_right = flights_Q1.pick(5..)
32
+
33
+ benchmark:
34
+ 'C01: Inner join on flights_Q1 by carrier': |
35
+ flights_Q1.inner_join(airlines, :carrier)
36
+
37
+ 'C02: Full join on flights_Q1 by planes': |
38
+ flights_Q1.full_join(planes, :tailnum)
39
+
40
+ 'C03: Left join on flights_Q1 by planes': |
41
+ flights_Q1.left_join(planes, :tailnum)
42
+
43
+ 'C04: Semi join on flights_Q1 by planes': |
44
+ flights_Q1.semi_join(planes, :tailnum)
45
+
46
+ 'C05: Anti join on flights_Q1 by planes': |
47
+ flights_Q1.anti_join(planes, :tailnum)
48
+
49
+ 'C06: Intersection of flights_1_2 and flights_1_3': |
50
+ flights_1_2.intersect(flights_1_3)
51
+
52
+ 'C07: Union of flights_1_2 and flights_1_3': |
53
+ flights_1_2.union(flights_1_3)
54
+
55
+ 'C08: Difference between flights_1_2 and flights_1_3': |
56
+ flights_1_2.difference(flights_1_3)
57
+
58
+ 'C09: Concatenate flight_Q1 on flight_Q2': |
59
+ flights_Q1.concatenate(flights_Q2)
60
+
61
+ 'C10: Merge flights_Q1_right on flights_Q1_left': |
62
+ flights_left.merge(flights_right)
@@ -0,0 +1,62 @@
1
+ loop_count: 3
2
+
3
+ contexts:
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ - name: 0.2.3
8
+ gems:
9
+ red_amber: 0.2.3
10
+ - name: 0.2.0
11
+ gems:
12
+ red_amber: 0.2.0
13
+
14
+ prelude: |
15
+ require 'red_amber'
16
+ require 'datasets-arrow'
17
+
18
+ diamonds = RedAmber::DataFrame.new(Datasets::Diamonds.new.to_arrow)
19
+
20
+ starwars = RedAmber::DataFrame.new(Datasets::Rdataset.new('dplyr', 'starwars').to_arrow)
21
+
22
+ uri = URI("https://raw.githubusercontent.com/heronshoes/red_amber/master/test/entity/import_cars.tsv")
23
+ import_cars = RedAmber::DataFrame.load(uri)
24
+
25
+ ds = Datasets::Rdataset.new('openintro', 'simpsons_paradox_covid')
26
+ simpsons_paradox_covid = RedAmber::DataFrame.new(ds.to_arrow)
27
+
28
+ benchmark:
29
+ 'D01: Diamonds test': |
30
+ diamonds
31
+ .slice { v(:carat) > 1 }
32
+ .pick(:cut, :price)
33
+ .group(:cut)
34
+ .mean
35
+ .sort('-mean(price)')
36
+ .rename('mean(price)': :mean_price_USD)
37
+ .assign { [:mean_price_JPY, v(:mean_price_USD) * 110.0] }
38
+
39
+ 'D02: Starwars test': |
40
+ starwars
41
+ .drop { keys.select { |key| key.end_with?('color') } }
42
+ .remove { v(:species) == 'NA' }
43
+ .group(:species) { [count(:species), mean(:height, :mass)] }
44
+ .slice { v(:count) > 1 }
45
+
46
+ 'D03: Inport cars test': |
47
+ import_cars
48
+ .to_long(:Year, name: :Manufacturer, value: :Num_of_imported)
49
+ .to_wide(name: :Manufacturer, value: :Num_of_imported)
50
+ .transpose
51
+
52
+ 'D04: Simpsons paradox test': |
53
+ simpsons_paradox_covid[simpsons_paradox_covid[:age_group] == 'under 50']
54
+ .group(:vaccine_status, :outcome)
55
+ .count
56
+ .then { |df| df.to_wide(name: :vaccine_status, value: df.keys[-1]) }
57
+ .assign do
58
+ [
59
+ [:'vaccinated_%', (100.0 * v(:vaccinated) / v(:vaccinated).sum)],
60
+ [:'unvaccinated_%', (100.0 * v(:unvaccinated) / v(:unvaccinated).sum)]
61
+ ]
62
+ end
@@ -1,11 +1,23 @@
1
+ contexts:
2
+ - gems:
3
+ red_amber: 0.1.8
4
+ - gems:
5
+ red_amber: 0.2.2
6
+ - name: HEAD
7
+ prelude: |
8
+ $LOAD_PATH.unshift(File.expand_path('lib'))
9
+ require 'red_amber'
10
+
1
11
  prelude: |
2
12
  require 'datasets-arrow'
3
13
  require 'red_amber'
4
14
 
5
15
  penguins = RedAmber::DataFrame.new(Datasets::Penguins.new.to_arrow)
6
16
 
7
- def drop_nil(penguins)
8
- penguins.remove { vectors.map { |v| v.is_nil} }
17
+ def remove_nil(penguins)
18
+ penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
9
19
  end
10
20
 
11
- benchmark: drop_nil(penguins)
21
+ benchmark:
22
+ 'Remove and reduce': remove_nil(penguins)
23
+ 'remove_nil method': penguins.remove_nil
@@ -0,0 +1,39 @@
1
+ loop_count: 3
2
+
3
+ contexts:
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ - name: 0.2.3
8
+ gems:
9
+ red_amber: 0.2.3
10
+ - name: 0.2.2
11
+ gems:
12
+ red_amber: 0.2.2
13
+
14
+ prelude: |
15
+ require 'red_amber'
16
+ require 'datasets-arrow'
17
+
18
+ ds = Datasets::Rdatasets.new('nycflights13', 'flights')
19
+ df = RedAmber::DataFrame.new(ds.to_arrow)
20
+ .assign(:flight) { flight.map(&:to_s) }
21
+
22
+ slicer = df[:distance] > 1000
23
+ distance_km = df[:distance] * 1.852
24
+
25
+ benchmark:
26
+ 'G01: sum distance by destination': |
27
+ df.group(:dest).sum(:distance)
28
+
29
+ 'G02: sum arr_delay by month and day': |
30
+ df.group(:month, :day).sum(:arr_delay)
31
+
32
+ 'G03: sum arr_delay, mean distance by flight': |
33
+ df.group(:flight) { [sum(:arr_delay), mean(:distance)] }
34
+
35
+ 'G04: mean air_time, distance by flight': |
36
+ df.group(:flight).mean(:air_time, :distance)
37
+
38
+ 'G05: sum dep_delay, arr_delay by carrer': |
39
+ df.group(:carrier).sum(:dep_delay, :arr_delay)
@@ -0,0 +1,31 @@
1
+ loop_count: 3
2
+
3
+ contexts:
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ - name: 0.2.3
8
+ gems:
9
+ red_amber: 0.2.3
10
+ - name: 0.2.2
11
+ gems:
12
+ red_amber: 0.2.2
13
+
14
+ prelude: |
15
+ require 'red_amber'
16
+ require 'datasets-arrow'
17
+
18
+ ds = Datasets::Rdatasets.new('tidyr', 'billboard')
19
+ df = RedAmber::DataFrame.new(ds.to_arrow)
20
+ sub_df = df.pick(:track, df.keys.select{ |k| k.start_with? 'wk' })
21
+ long_df = df.to_long(:artist, :track, :'date.entered', name: :week, value: :rank)
22
+
23
+ benchmark:
24
+ 'R01: Transpose a DataFrame': |
25
+ sub_df.transpose(name: :week)
26
+
27
+ 'R02: Reshape to longer DataFrame': |
28
+ df.to_long(:artist, :track, :'date.entered', name: :week, value: :rank)
29
+
30
+ 'R03: Reshape to wider DataFrame': |
31
+ long_df.to_wide(name: :week, value: :rank)
@@ -2,12 +2,12 @@ prelude: |
2
2
  require 'rover'
3
3
  require 'red_amber'
4
4
 
5
- penguins_csv = 'benchmark/cache/penguins.csv'
5
+ penguins_csv = 'tmp/penguins.csv'
6
6
 
7
7
  unless File.exist?(penguins_csv)
8
8
  require 'datasets-arrow'
9
- arrow = Datasets::Penguins.new.to_arrow
10
- RedAmber::DataFrame.new(arrow).save(penguins_csv)
9
+ ds = Datasets::Penguins.new
10
+ RedAmber::DataFrame.new(ds).save(penguins_csv)
11
11
  end
12
12
 
13
13
  benchmark:
@@ -0,0 +1,23 @@
1
+ contexts:
2
+ - gems:
3
+ red_amber: 0.2.2
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ require 'red_amber'
8
+
9
+ prelude: |
10
+ require 'rover'
11
+ require 'datasets-arrow'
12
+ ds = Datasets::Rdatasets.new('nycflights13', 'flights')
13
+ df = RedAmber::DataFrame.new(ds)
14
+ rover = Rover::DataFrame.new(df.to_h)
15
+ group_keys = [:month, :origin]
16
+ summary_key = :air_time
17
+
18
+ benchmark:
19
+ 'penguins Group by Rover': |
20
+ rover.group(group_keys).count
21
+
22
+ 'penguins Group by RedAmber': |
23
+ df.group(group_keys).count
@@ -0,0 +1,23 @@
1
+ contexts:
2
+ - gems:
3
+ red_amber: 0.2.2
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ require 'red_amber'
8
+
9
+ prelude: |
10
+ require 'rover'
11
+ require 'datasets-arrow'
12
+ ds = Datasets::Penguins.new
13
+ df = RedAmber::DataFrame.new(ds)
14
+ rover = Rover::DataFrame.new(df.to_h)
15
+ group_keys = [:species, :island]
16
+ summary_key = :body_mass_g
17
+
18
+ benchmark:
19
+ 'penguins Group by Rover': |
20
+ rover.group(group_keys).mean(summary_key)
21
+
22
+ 'penguins Group by RedAmber': |
23
+ df.group(group_keys).mean(summary_key)
@@ -0,0 +1,23 @@
1
+ contexts:
2
+ - gems:
3
+ red_amber: 0.2.2
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ require 'red_amber'
8
+
9
+ prelude: |
10
+ require 'rover'
11
+ require 'datasets-arrow'
12
+ ds = Datasets::Rdatasets.new('nycflights13', 'planes')
13
+ df = RedAmber::DataFrame.new(ds)
14
+ rover = Rover::DataFrame.new(df.to_h)
15
+ group_keys = [:engines, :engine]
16
+ summary_key = :seats
17
+
18
+ benchmark:
19
+ 'penguins Group by Rover': |
20
+ rover.group(group_keys).mean(summary_key)
21
+
22
+ 'penguins Group by RedAmber': |
23
+ df.group(group_keys).mean(summary_key)
@@ -0,0 +1,23 @@
1
+ contexts:
2
+ - gems:
3
+ red_amber: 0.2.2
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ require 'red_amber'
8
+
9
+ prelude: |
10
+ require 'rover'
11
+ require 'datasets-arrow'
12
+ ds = Datasets::Rdatasets.new('nycflights13', 'weather')
13
+ df = RedAmber::DataFrame.new(ds)
14
+ rover = Rover::DataFrame.new(df.to_h)
15
+ group_keys = [:month, :origin]
16
+ summary_key = :temp
17
+
18
+ benchmark:
19
+ 'penguins Group by Rover': |
20
+ rover.group(group_keys).mean(summary_key)
21
+
22
+ 'penguins Group by RedAmber': |
23
+ df.group(group_keys).mean(summary_key)
@@ -0,0 +1,60 @@
1
+ loop_count: 10
2
+
3
+ contexts:
4
+ - name: HEAD
5
+ prelude: |
6
+ $LOAD_PATH.unshift(File.expand_path('lib'))
7
+ - name: 0.2.0
8
+ gems:
9
+ red_amber: 0.2.0
10
+
11
+ prelude: |
12
+ require 'red_amber'
13
+ include RedAmber
14
+ require 'datasets-arrow'
15
+
16
+ ds = Datasets::Rdatasets.new('nycflights13', 'flights')
17
+ flights = RedAmber::DataFrame.new(ds.to_arrow)
18
+ df = flights.slice { flights[:month] <= 6 }
19
+
20
+ tailnum_vector = df[:tailnum]
21
+ distance_vector = df[:distance]
22
+
23
+ strings = tailnum_vector.to_a
24
+ arrow_array = tailnum_vector.data
25
+ integers = df[:dep_delay].to_a
26
+ boolean_vector = df[:air_time].is_nil
27
+ index_vector = Vector.new(0...boolean_vector.size).filter(boolean_vector)
28
+ replacer = index_vector.data.map(&:to_s)
29
+ booleans = boolean_vector.to_a
30
+
31
+ benchmark:
32
+ 'V01: Vector.new from integer Array': |
33
+ Vector.new(integers)
34
+
35
+ 'V02: Vector.new from string Array': |
36
+ Vector.new(strings)
37
+
38
+ 'V03: Vector.new from boolean Vector': |
39
+ Vector.new(boolean_vector)
40
+
41
+ 'V04: Vector#sum': |
42
+ distance_vector.mean
43
+
44
+ 'V05: Vector#*': |
45
+ distance_vector * 1.852
46
+
47
+ 'V06: Vector#[booleans]': |
48
+ tailnum_vector[booleans]
49
+
50
+ 'V07: Vector#[boolean_vector]': |
51
+ tailnum_vector[boolean_vector]
52
+
53
+ 'V08: Vector#[index_vector]': |
54
+ tailnum_vector[index_vector]
55
+
56
+ 'V09: Vector#replace': |
57
+ tailnum_vector.replace(booleans, replacer)
58
+
59
+ 'V10: Vector#replace with broad casting': |
60
+ tailnum_vector.replace(booleans, 'x')