red_amber 0.2.3 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +133 -51
- data/.yardopts +2 -0
- data/CHANGELOG.md +203 -1
- data/Gemfile +2 -1
- data/LICENSE +1 -1
- data/README.md +61 -45
- data/benchmark/basic.yml +11 -4
- data/benchmark/combine.yml +3 -4
- data/benchmark/dataframe.yml +62 -0
- data/benchmark/group.yml +7 -1
- data/benchmark/reshape.yml +6 -2
- data/benchmark/vector.yml +63 -0
- data/doc/DataFrame.md +35 -12
- data/doc/DataFrame_Comparison.md +65 -0
- data/doc/SubFrames.md +11 -0
- data/doc/Vector.md +295 -1
- data/doc/yard-templates/default/fulldoc/html/css/common.css +6 -0
- data/lib/red_amber/data_frame.rb +537 -68
- data/lib/red_amber/data_frame_combinable.rb +776 -123
- data/lib/red_amber/data_frame_displayable.rb +248 -18
- data/lib/red_amber/data_frame_indexable.rb +122 -19
- data/lib/red_amber/data_frame_loadsave.rb +81 -10
- data/lib/red_amber/data_frame_reshaping.rb +216 -21
- data/lib/red_amber/data_frame_selectable.rb +781 -120
- data/lib/red_amber/data_frame_variable_operation.rb +561 -85
- data/lib/red_amber/group.rb +195 -21
- data/lib/red_amber/helper.rb +114 -32
- data/lib/red_amber/refinements.rb +206 -0
- data/lib/red_amber/subframes.rb +1066 -0
- data/lib/red_amber/vector.rb +435 -58
- data/lib/red_amber/vector_aggregation.rb +312 -0
- data/lib/red_amber/vector_binary_element_wise.rb +387 -0
- data/lib/red_amber/vector_selectable.rb +321 -69
- data/lib/red_amber/vector_unary_element_wise.rb +436 -0
- data/lib/red_amber/vector_updatable.rb +397 -24
- data/lib/red_amber/version.rb +2 -1
- data/lib/red_amber.rb +15 -1
- data/red_amber.gemspec +4 -3
- metadata +19 -11
- data/doc/image/dataframe/reshaping_DataFrames.png +0 -0
- data/lib/red_amber/vector_functions.rb +0 -294
data/README.md
CHANGED
|
@@ -1,28 +1,29 @@
|
|
|
1
1
|
# RedAmber
|
|
2
2
|
|
|
3
|
-
[](https://rubygems.org/gems/red_amber)
|
|
4
|
+
[](https://github.com/heronshoes/red_amber/actions/workflows/ci.yml)
|
|
5
|
+
[](https://codeclimate.com/github/heronshoes/red_amber/maintainability)
|
|
6
|
+
[](https://codeclimate.com/github/heronshoes/red_amber/test_coverage)
|
|
7
|
+
[](https://heronshoes.github.io/red_amber/)
|
|
5
8
|
[](https://github.com/heronshoes/red_amber/discussions)
|
|
6
9
|
|
|
7
10
|
A simple dataframe library for Ruby.
|
|
8
11
|
|
|
9
|
-
- Powered by [Red Arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow)
|
|
12
|
+
- Powered by [Red Arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow)
|
|
13
|
+
[](https://gitter.im/red-data-tools/en) [](https://rubygems.org/gems/red-arrow)
|
|
10
14
|
- Inspired by the dataframe library [Rover-df](https://github.com/ankane/rover)
|
|
11
15
|
|
|
12
|
-

|
|
16
|
+

|
|
13
17
|
|
|
14
18
|
## Requirements
|
|
19
|
+
### Ruby
|
|
20
|
+
Supported Ruby version is >= 3.0 (since RedAmber 0.3.0).
|
|
21
|
+
- I decided to remove Ruby 2.7 without waiting for EOL. See [Release note for v0.3.0](https://github.com/heronshoes/red_amber/discussions/162) for details.
|
|
15
22
|
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
Since v0.2.0, this library uses pattern matching which is an experimental feature in 2.7 . It is usable but a warning message will be shown in 2.7 .
|
|
19
|
-
I recommend Ruby 3 for performance.
|
|
20
|
-
|
|
23
|
+
### Libraries
|
|
21
24
|
```ruby
|
|
22
|
-
#
|
|
23
|
-
gem 'red-
|
|
24
|
-
|
|
25
|
-
gem 'red-parquet', '~> 10.0.0' # Optional, if you use IO from/to parquet
|
|
25
|
+
gem 'red-arrow', '~> 11.0.0' # Requires Apache Arrow (see installation below)
|
|
26
|
+
gem 'red-parquet', '~> 11.0.0' # Optional, if you use IO from/to parquet
|
|
26
27
|
gem 'rover-df', '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame
|
|
27
28
|
```
|
|
28
29
|
|
|
@@ -30,61 +31,71 @@ gem 'rover-df', '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame
|
|
|
30
31
|
|
|
31
32
|
Install requirements before you install Red Amber.
|
|
32
33
|
|
|
33
|
-
- Apache Arrow (~>
|
|
34
|
-
- Apache Arrow GLib (~>
|
|
35
|
-
- Apache Parquet GLib (~>
|
|
34
|
+
- Apache Arrow (~> 11.0.0)
|
|
35
|
+
- Apache Arrow GLib (~> 11.0.0)
|
|
36
|
+
- Apache Parquet GLib (~> 11.0.0) # If you use IO from/to parquet
|
|
36
37
|
|
|
37
|
-
|
|
38
|
+
See [Apache Arrow install document](https://arrow.apache.org/install/).
|
|
38
39
|
|
|
39
40
|
- Minimum installation example for the latest Ubuntu:
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
41
|
+
|
|
42
|
+
```
|
|
43
|
+
sudo apt update
|
|
44
|
+
sudo apt install -y -V ca-certificates lsb-release wget
|
|
45
|
+
wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
|
|
46
|
+
sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
|
|
47
|
+
sudo apt update
|
|
48
|
+
sudo apt install -y -V libarrow-dev
|
|
49
|
+
sudo apt install -y -V libarrow-glib-dev
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
- On Fedora 38 (Rawhide):
|
|
53
|
+
|
|
54
|
+
```
|
|
55
|
+
sudo dnf update
|
|
56
|
+
sudo dnf -y install gcc-c++ libarrow-devel libarrow-glib-devel ruby-devel
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
- On macOS, using Homebrew:
|
|
60
|
+
|
|
61
|
+
```
|
|
62
|
+
brew install apache-arrow
|
|
63
|
+
brew install apache-arrow-glib
|
|
64
|
+
```
|
|
60
65
|
|
|
61
66
|
If you prepared Apache Arrow, add these lines to your Gemfile:
|
|
62
67
|
|
|
63
68
|
```ruby
|
|
64
|
-
gem 'red-arrow', '~>
|
|
69
|
+
gem 'red-arrow', '~> 11.0.0'
|
|
65
70
|
gem 'red_amber'
|
|
66
|
-
gem 'red-parquet', '~>
|
|
71
|
+
gem 'red-parquet', '~> 11.0.0' # Optional, if you use IO from/to parquet
|
|
67
72
|
gem 'rover-df', '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame
|
|
68
73
|
gem 'red-datasets-arrow' # Optional, recommended if you use Red Datasets
|
|
69
74
|
gem 'red-arrow-numo-narray' # Optional, recommended if you use inputs from Numo::NArray
|
|
70
75
|
```
|
|
71
76
|
|
|
72
|
-
And then execute `bundle install` or install
|
|
77
|
+
And then execute `bundle install` or install them yourself such as `gem install red_amber`.
|
|
73
78
|
|
|
74
79
|
## Docker image and Jupyter Notebook
|
|
75
80
|
|
|
76
|
-
[RubyData Docker Stacks](https://github.com/RubyData/docker-stacks) is available as a ready-to-run Docker image containing Jupyter and useful data tools as well as RedAmber (Thanks to
|
|
81
|
+
[RubyData Docker Stacks](https://github.com/RubyData/docker-stacks) is available as a ready-to-run Docker image containing Jupyter and useful data tools as well as RedAmber (Thanks to Kenta Murata).
|
|
77
82
|
|
|
78
83
|
Also you can try the contents of this README interactively by [Binder](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=red-amber.ipynb).
|
|
79
84
|
[](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=red-amber.ipynb)
|
|
80
85
|
|
|
86
|
+
## Comparison of DataFrames
|
|
87
|
+
|
|
88
|
+
Comparison of basic features of RedAmber with Python
|
|
89
|
+
[pandas](https://pandas.pydata.org/),
|
|
90
|
+
R [Tidyverse](https://www.tidyverse.org/) and
|
|
91
|
+
Julia [Dataframes](https://dataframes.juliadata.org/stable/) is [here](doc/DataFrame_Comparison.md) (Thanks to Benson Muite).
|
|
81
92
|
|
|
82
93
|
## Data frame in `RedAmber`
|
|
83
94
|
|
|
84
95
|
Class `RedAmber::DataFrame` represents a set of data in 2D-shape.
|
|
85
96
|
The entity is a Red Arrow's Table object.
|
|
86
97
|
|
|
87
|
-

|
|
98
|
+

|
|
88
99
|
|
|
89
100
|
Let's load the library and try some examples.
|
|
90
101
|
|
|
@@ -95,6 +106,11 @@ include RedAmber
|
|
|
95
106
|
|
|
96
107
|
### Example: diamonds dataset
|
|
97
108
|
|
|
109
|
+
First do (if you do not installed) `
|
|
110
|
+
gem install red-datasets-arrow
|
|
111
|
+
`
|
|
112
|
+
then
|
|
113
|
+
|
|
98
114
|
```ruby
|
|
99
115
|
require 'datasets-arrow' # to load sample data
|
|
100
116
|
|
|
@@ -120,7 +136,7 @@ For example, we can compute mean prices per cut for the data larger than 1 carat
|
|
|
120
136
|
|
|
121
137
|
```ruby
|
|
122
138
|
df = diamonds
|
|
123
|
-
.slice { carat > 1 }
|
|
139
|
+
.slice { carat > 1 } # or use #filter instead of #slice
|
|
124
140
|
.group(:cut)
|
|
125
141
|
.mean(:price) # `pick` prior to `group` is not required if `:price` is specified here.
|
|
126
142
|
.sort('-mean(price)')
|
|
@@ -169,7 +185,7 @@ starwars
|
|
|
169
185
|
.drop(0) # delete unnecessary index column
|
|
170
186
|
.remove { species == "NA" } # delete unnecessary rows
|
|
171
187
|
.group(:species) { [count(:species), mean(:height, :mass)] }
|
|
172
|
-
.slice { count > 1 }
|
|
188
|
+
.slice { count > 1 } # or use #filter instead of slice
|
|
173
189
|
|
|
174
190
|
# =>
|
|
175
191
|
#<RedAmber::DataFrame : 8 x 4 Vectors, 0x000000000000f848>
|
|
@@ -196,7 +212,7 @@ See [Vector.md](doc/Vector.md) for details.
|
|
|
196
212
|
|
|
197
213
|
## Jupyter notebook
|
|
198
214
|
|
|
199
|
-
[
|
|
215
|
+
[Examples of Red Amber](https://github.com/heronshoes/docker-stacks/blob/RedAmber-binder/binder/examples_of_red_amber.ipynb)
|
|
200
216
|
([raw file](https://raw.githubusercontent.com/heronshoes/docker-stacks/RedAmber-binder/binder/examples_of_red_amber.ipynb)) shows more examples in jupyter notebook.
|
|
201
217
|
|
|
202
218
|
You can try this notebook on [Binder](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=examples_of_red_amber.ipynb).
|
data/benchmark/basic.yml
CHANGED
|
@@ -1,10 +1,17 @@
|
|
|
1
|
+
loop_count: 3
|
|
2
|
+
|
|
1
3
|
contexts:
|
|
2
4
|
- name: HEAD
|
|
3
5
|
prelude: |
|
|
4
6
|
$LOAD_PATH.unshift(File.expand_path('lib'))
|
|
5
|
-
-
|
|
7
|
+
- name: 0.3.0
|
|
8
|
+
gems:
|
|
9
|
+
red_amber: 0.3.0
|
|
10
|
+
- name: 0.2.0
|
|
11
|
+
gems:
|
|
6
12
|
red_amber: 0.2.0
|
|
7
|
-
-
|
|
13
|
+
- name: 0.1.5
|
|
14
|
+
gems:
|
|
8
15
|
red_amber: 0.1.5
|
|
9
16
|
|
|
10
17
|
prelude: |
|
|
@@ -21,8 +28,8 @@ benchmark:
|
|
|
21
28
|
'B01: Pick([]) by a key name': |
|
|
22
29
|
df[:flight]
|
|
23
30
|
|
|
24
|
-
'
|
|
25
|
-
df[
|
|
31
|
+
'B02a: Pick([]) by key names': |
|
|
32
|
+
df[:carrier, :flight]
|
|
26
33
|
|
|
27
34
|
'B03: Pick by key names': |
|
|
28
35
|
df.pick(:carrier, :flight)
|
data/benchmark/combine.yml
CHANGED
|
@@ -0,0 +1,62 @@
|
|
|
1
|
+
loop_count: 3
|
|
2
|
+
|
|
3
|
+
contexts:
|
|
4
|
+
- name: HEAD
|
|
5
|
+
prelude: |
|
|
6
|
+
$LOAD_PATH.unshift(File.expand_path('lib'))
|
|
7
|
+
- name: 0.3.0
|
|
8
|
+
gems:
|
|
9
|
+
red_amber: 0.3.0
|
|
10
|
+
- name: 0.2.0
|
|
11
|
+
gems:
|
|
12
|
+
red_amber: 0.2.0
|
|
13
|
+
|
|
14
|
+
prelude: |
|
|
15
|
+
require 'red_amber'
|
|
16
|
+
require 'datasets-arrow'
|
|
17
|
+
|
|
18
|
+
diamonds = RedAmber::DataFrame.new(Datasets::Diamonds.new.to_arrow)
|
|
19
|
+
|
|
20
|
+
starwars = RedAmber::DataFrame.new(Datasets::Rdataset.new('dplyr', 'starwars').to_arrow)
|
|
21
|
+
|
|
22
|
+
uri = URI("https://raw.githubusercontent.com/heronshoes/red_amber/master/test/entity/import_cars.tsv")
|
|
23
|
+
import_cars = RedAmber::DataFrame.load(uri)
|
|
24
|
+
|
|
25
|
+
ds = Datasets::Rdataset.new('openintro', 'simpsons_paradox_covid')
|
|
26
|
+
simpsons_paradox_covid = RedAmber::DataFrame.new(ds.to_arrow)
|
|
27
|
+
|
|
28
|
+
benchmark:
|
|
29
|
+
'D01: Diamonds test': |
|
|
30
|
+
diamonds
|
|
31
|
+
.slice { v(:carat) > 1 }
|
|
32
|
+
.pick(:cut, :price)
|
|
33
|
+
.group(:cut)
|
|
34
|
+
.mean
|
|
35
|
+
.sort('-mean(price)')
|
|
36
|
+
.rename('mean(price)': :mean_price_USD)
|
|
37
|
+
.assign { [:mean_price_JPY, v(:mean_price_USD) * 110.0] }
|
|
38
|
+
|
|
39
|
+
'D02: Starwars test': |
|
|
40
|
+
starwars
|
|
41
|
+
.drop { keys.select { |key| key.end_with?('color') } }
|
|
42
|
+
.remove { v(:species) == 'NA' }
|
|
43
|
+
.group(:species) { [count(:species), mean(:height, :mass)] }
|
|
44
|
+
.slice { v(:count) > 1 }
|
|
45
|
+
|
|
46
|
+
'D03: Inport cars test': |
|
|
47
|
+
import_cars
|
|
48
|
+
.to_long(:Year, name: :Manufacturer, value: :Num_of_imported)
|
|
49
|
+
.to_wide(name: :Manufacturer, value: :Num_of_imported)
|
|
50
|
+
.transpose
|
|
51
|
+
|
|
52
|
+
'D04: Simpsons paradox test': |
|
|
53
|
+
simpsons_paradox_covid[simpsons_paradox_covid[:age_group] == 'under 50']
|
|
54
|
+
.group(:vaccine_status, :outcome)
|
|
55
|
+
.count
|
|
56
|
+
.then { |df| df.to_wide(name: :vaccine_status, value: df.keys[-1]) }
|
|
57
|
+
.assign do
|
|
58
|
+
[
|
|
59
|
+
[:'vaccinated_%', (100.0 * v(:vaccinated) / v(:vaccinated).sum)],
|
|
60
|
+
[:'unvaccinated_%', (100.0 * v(:unvaccinated) / v(:unvaccinated).sum)]
|
|
61
|
+
]
|
|
62
|
+
end
|
data/benchmark/group.yml
CHANGED
data/benchmark/reshape.yml
CHANGED
|
@@ -0,0 +1,63 @@
|
|
|
1
|
+
loop_count: 10
|
|
2
|
+
|
|
3
|
+
contexts:
|
|
4
|
+
- name: HEAD
|
|
5
|
+
prelude: |
|
|
6
|
+
$LOAD_PATH.unshift(File.expand_path('lib'))
|
|
7
|
+
- name: 0.3.0
|
|
8
|
+
gems:
|
|
9
|
+
red_amber: 0.3.0
|
|
10
|
+
- name: 0.2.0
|
|
11
|
+
gems:
|
|
12
|
+
red_amber: 0.2.0
|
|
13
|
+
|
|
14
|
+
prelude: |
|
|
15
|
+
require 'red_amber'
|
|
16
|
+
include RedAmber
|
|
17
|
+
require 'datasets-arrow'
|
|
18
|
+
|
|
19
|
+
ds = Datasets::Rdatasets.new('nycflights13', 'flights')
|
|
20
|
+
flights = RedAmber::DataFrame.new(ds.to_arrow)
|
|
21
|
+
df = flights.slice { flights[:month] <= 6 }
|
|
22
|
+
|
|
23
|
+
tailnum_vector = df[:tailnum]
|
|
24
|
+
distance_vector = df[:distance]
|
|
25
|
+
|
|
26
|
+
strings = tailnum_vector.to_a
|
|
27
|
+
arrow_array = tailnum_vector.data
|
|
28
|
+
integers = df[:dep_delay].to_a
|
|
29
|
+
boolean_vector = df[:air_time].is_nil
|
|
30
|
+
index_vector = Vector.new(0...boolean_vector.size).filter(boolean_vector)
|
|
31
|
+
replacer = index_vector.data.map(&:to_s)
|
|
32
|
+
booleans = boolean_vector.to_a
|
|
33
|
+
|
|
34
|
+
benchmark:
|
|
35
|
+
'V01: Vector.new from integer Array': |
|
|
36
|
+
Vector.new(integers)
|
|
37
|
+
|
|
38
|
+
'V02: Vector.new from string Array': |
|
|
39
|
+
Vector.new(strings)
|
|
40
|
+
|
|
41
|
+
'V03: Vector.new from boolean Vector': |
|
|
42
|
+
Vector.new(boolean_vector)
|
|
43
|
+
|
|
44
|
+
'V04: Vector#sum': |
|
|
45
|
+
distance_vector.mean
|
|
46
|
+
|
|
47
|
+
'V05: Vector#*': |
|
|
48
|
+
distance_vector * 1.852
|
|
49
|
+
|
|
50
|
+
'V06: Vector#[booleans]': |
|
|
51
|
+
tailnum_vector[booleans]
|
|
52
|
+
|
|
53
|
+
'V07: Vector#[boolean_vector]': |
|
|
54
|
+
tailnum_vector[boolean_vector]
|
|
55
|
+
|
|
56
|
+
'V08: Vector#[index_vector]': |
|
|
57
|
+
tailnum_vector[index_vector]
|
|
58
|
+
|
|
59
|
+
'V09: Vector#replace': |
|
|
60
|
+
tailnum_vector.replace(booleans, replacer)
|
|
61
|
+
|
|
62
|
+
'V10: Vector#replace with broad casting': |
|
|
63
|
+
tailnum_vector.replace(booleans, 'x')
|
data/doc/DataFrame.md
CHANGED
|
@@ -57,6 +57,10 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
|
|
|
57
57
|
```ruby
|
|
58
58
|
RedAmber::DataFrame.load("test/entity/with_header.csv")
|
|
59
59
|
```
|
|
60
|
+
|
|
61
|
+
```ruby
|
|
62
|
+
RedAmber::DataFrame.load("test/entity/without_header.csv", headers: [:x, :y, :z])
|
|
63
|
+
```
|
|
60
64
|
|
|
61
65
|
- from a string buffer
|
|
62
66
|
|
|
@@ -275,6 +279,7 @@ penguins.to_rover
|
|
|
275
279
|
|
|
276
280
|
- Shows some information about self in a transposed style.
|
|
277
281
|
- `tdr_str` returns same info as a String.
|
|
282
|
+
- `glimpse` is an alias. It is similar to dplyr's (or Polars's) `glimpse()`.
|
|
278
283
|
|
|
279
284
|
```ruby
|
|
280
285
|
require 'red_amber'
|
|
@@ -568,7 +573,7 @@ penguins.to_rover
|
|
|
568
573
|
[1, 2, 3]
|
|
569
574
|
```
|
|
570
575
|
|
|
571
|
-
### `slice ` -
|
|
576
|
+
### `slice ` - cut into slices of records -
|
|
572
577
|
|
|
573
578
|
Slice and select records (rows) to create a sub DataFrame.
|
|
574
579
|
|
|
@@ -601,11 +606,14 @@ penguins.to_rover
|
|
|
601
606
|
|
|
602
607
|
- Booleans as an argument
|
|
603
608
|
|
|
604
|
-
`slice(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
|
|
609
|
+
`filter(booleans)` or `slice(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
|
|
610
|
+
|
|
611
|
+
note: `slice(booleans)` is acceptable for orthogonality of `slice`/`remove`.
|
|
605
612
|
|
|
606
613
|
```ruby
|
|
607
614
|
vector = penguins[:bill_length_mm]
|
|
608
|
-
penguins.
|
|
615
|
+
penguins.filter(vector >= 40)
|
|
616
|
+
# penguins.slice(vector >= 40) is also acceptable
|
|
609
617
|
|
|
610
618
|
# =>
|
|
611
619
|
#<RedAmber::DataFrame : 242 x 8 Vectors, 0x0000000000043d3c>
|
|
@@ -833,14 +841,14 @@ penguins.to_rover
|
|
|
833
841
|
|
|
834
842
|
Assign new or updated variables (columns) and create an updated DataFrame.
|
|
835
843
|
|
|
836
|
-
- Variables with new keys will append new columns from
|
|
844
|
+
- Variables with new keys will append new columns from right.
|
|
837
845
|
- Variables with exisiting keys will update corresponding vectors.
|
|
838
846
|
|
|
839
847
|

|
|
840
848
|
|
|
841
849
|
- Variables as arguments
|
|
842
850
|
|
|
843
|
-
`assign(
|
|
851
|
+
`assign(key_value_pairs)` accepts pairs of key and values as parameters. `key_value_pairs` should be a Hash of `{key => array_like}` or an Array of Arrays like `[[key, array_like], ... ]`. `array_like` is ether `Vector`, `Array` or `Arrow::Array`.
|
|
844
852
|
|
|
845
853
|
```ruby
|
|
846
854
|
df = RedAmber::DataFrame.new(
|
|
@@ -857,12 +865,12 @@ penguins.to_rover
|
|
|
857
865
|
2 Hinata 28
|
|
858
866
|
|
|
859
867
|
# update :age and add :brother
|
|
860
|
-
df.assign
|
|
868
|
+
df.assign(
|
|
861
869
|
{
|
|
862
870
|
age: age + 29,
|
|
863
871
|
brother: ['Santa', nil, 'Momotaro']
|
|
864
872
|
}
|
|
865
|
-
|
|
873
|
+
)
|
|
866
874
|
|
|
867
875
|
# =>
|
|
868
876
|
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000658b0>
|
|
@@ -932,7 +940,7 @@ penguins.to_rover
|
|
|
932
940
|
|
|
933
941
|
- Append from left
|
|
934
942
|
|
|
935
|
-
`assign_left` method accepts the same parameters and block as `assign`, but append new columns from
|
|
943
|
+
`assign_left` method accepts the same parameters and block as `assign`, but append new columns from left.
|
|
936
944
|
|
|
937
945
|
```ruby
|
|
938
946
|
df.assign_left(new_index: df.indices(1))
|
|
@@ -1302,7 +1310,10 @@ When the option `keep_key: true` used, the column `key` will be preserved.
|
|
|
1302
1310
|
- `join_keys` are keys shared by self and other to match with them.
|
|
1303
1311
|
- If `join_keys` are empty, common keys in self and other are chosen (natural join).
|
|
1304
1312
|
- If (common keys) > `join_keys`, duplicated keys are renamed by `suffix`.
|
|
1313
|
+
- If you want to match the columns with different names,
|
|
1314
|
+
use Hash for `join_keys` such as `{ left: :KEY1, right: KEY2}`.
|
|
1305
1315
|
|
|
1316
|
+
These are dataframes to use in the examples of joins.
|
|
1306
1317
|
```ruby
|
|
1307
1318
|
df = DataFrame.new(
|
|
1308
1319
|
KEY: %w[A B C],
|
|
@@ -1450,6 +1461,8 @@ When the option `keep_key: true` used, the column `key` will be preserved.
|
|
|
1450
1461
|
1 B 4
|
|
1451
1462
|
2 D 5
|
|
1452
1463
|
```
|
|
1464
|
+
##### `set_operable?(other)`
|
|
1465
|
+
Check if `types` of self and other are same.
|
|
1453
1466
|
|
|
1454
1467
|
##### `intersect(other)`
|
|
1455
1468
|
|
|
@@ -1495,15 +1508,23 @@ When the option `keep_key: true` used, the column `key` will be preserved.
|
|
|
1495
1508
|
<string> <uint8>
|
|
1496
1509
|
1 B 2
|
|
1497
1510
|
2 C 3
|
|
1511
|
+
|
|
1512
|
+
other.differencr(df)
|
|
1513
|
+
#=>
|
|
1514
|
+
#<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000040e0c>
|
|
1515
|
+
KEY1 KEY2
|
|
1516
|
+
<string> <uint8>
|
|
1517
|
+
0 B 4
|
|
1518
|
+
1 D 5
|
|
1498
1519
|
```
|
|
1499
1520
|
|
|
1500
1521
|
## Binding
|
|
1501
1522
|
|
|
1502
1523
|
### `concatenate(other)`
|
|
1503
1524
|
|
|
1504
|
-
Concatenate another DataFrame or Table onto the bottom of self. The
|
|
1525
|
+
Concatenate another DataFrame or Table onto the bottom of self. The types of other must be the same as self.
|
|
1505
1526
|
|
|
1506
|
-
The alias is `concat`.
|
|
1527
|
+
The alias is `concat` and `bind_rows`.
|
|
1507
1528
|
|
|
1508
1529
|
An array of DataFrames or Tables is also acceptable as other.
|
|
1509
1530
|
|
|
@@ -1535,9 +1556,11 @@ When the option `keep_key: true` used, the column `key` will be preserved.
|
|
|
1535
1556
|
3 4 D
|
|
1536
1557
|
```
|
|
1537
1558
|
|
|
1538
|
-
### `merge(other)`
|
|
1559
|
+
### `merge(*other)`
|
|
1560
|
+
|
|
1561
|
+
Concatenate another DataFrame or Table onto the bottom of self. The size of other must be the same as self. Self and other must not share the same key.
|
|
1539
1562
|
|
|
1540
|
-
|
|
1563
|
+
The alias is `bind_cols`.
|
|
1541
1564
|
|
|
1542
1565
|
```ruby
|
|
1543
1566
|
df
|
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
# Comparison of DataFrames
|
|
2
|
+
|
|
3
|
+
Compare basic features of RedAmber with Python
|
|
4
|
+
[pandas](https://pandas.pydata.org/),
|
|
5
|
+
R [Tidyverse](https://www.tidyverse.org/) and
|
|
6
|
+
Julia [Dataframes](https://dataframes.juliadata.org/stable/).
|
|
7
|
+
|
|
8
|
+
## Select columns (variables)
|
|
9
|
+
|
|
10
|
+
| Features | RedAmber | Tidyverse | pandas | DataFrames.jl |
|
|
11
|
+
|--- |--- |--- |--- |--- |
|
|
12
|
+
| Select columns as a dataframe | pick, drop, [] | dplyr::select, dplyr::select_if | [], loc[], iloc[], drop, select_dtypes | [], select |
|
|
13
|
+
| Select a column as a vector | [], v | dplyr::pull, [, x] | [], loc[], iloc[] | [!, :x] |
|
|
14
|
+
| Move columns to a new position | pick, [] | relocate | [], reindex, loc[], iloc[] | select,transform |
|
|
15
|
+
|
|
16
|
+
## Select rows (records, observations)
|
|
17
|
+
|
|
18
|
+
| Features | RedAmber | Tidyverse | pandas | DataFrames.jl |
|
|
19
|
+
|--- |--- |--- |--- |--- |
|
|
20
|
+
| Select rows that meet logical criteria as a dataframe | slice, remove, [] | dplyr::filter | [], filter, query, loc[] | filter |
|
|
21
|
+
| Select rows by position as a dataframe | slice, remove, [] | dplyr::slice | iloc[], drop | subset |
|
|
22
|
+
| Move rows to a new position | slice, [] | dplyr::filter, dplyr::slice | reindex, loc[], iloc[] | permute |
|
|
23
|
+
|
|
24
|
+
## Update columns / create new columns
|
|
25
|
+
|
|
26
|
+
|Features | RedAmber | Tidyverse | pandas | DataFrames.jl |
|
|
27
|
+
|--- |--- |--- |--- |--- |
|
|
28
|
+
| Update existing columns | assign | dplyr::mutate | assign, []= | mapcols |
|
|
29
|
+
| Create new columns | assign, assign_left | dplyr::mutate | apply | insertcols,.+ |
|
|
30
|
+
| Compute new columns, drop others | new | transmute | (dfply:)transmute | transform,insertcols,mapcols |
|
|
31
|
+
| Rename columns | rename | dplyr::rename, dplyr::rename_with, purrr::set_names | rename, set_axis | rename |
|
|
32
|
+
| Sort dataframe | sort | dplyr::arrange | sort_values | sort |
|
|
33
|
+
|
|
34
|
+
## Reshape dataframe
|
|
35
|
+
|
|
36
|
+
| Features | RedAmber | Tidyverse | pandas | DataFrames.jl |
|
|
37
|
+
|--- |--- |--- |--- |--- |
|
|
38
|
+
| Gather columns into rows (create a longer dataframe) | to_long | tidyr::pivot_longer | melt | stack |
|
|
39
|
+
| Spread rows into columns (create a wider dataframe) | to_wide | tidyr::pivot_wider | pivot | unstack |
|
|
40
|
+
| transpose a wide dataframe | transpose | transpose, t | transpose, T | permutedims |
|
|
41
|
+
|
|
42
|
+
## Grouping
|
|
43
|
+
|
|
44
|
+
| Features | RedAmber | Tidyverse | pandas | DataFrames.jl |
|
|
45
|
+
|--- |--- |--- |--- |--- |
|
|
46
|
+
|Grouping | group, group.summarize | dplyr::group_by %>% dplyr::summarise | groupby.agg | combine,groupby |
|
|
47
|
+
|
|
48
|
+
## Combine dataframes or tables
|
|
49
|
+
|
|
50
|
+
| Features | RedAmber | Tidyverse | pandas | DataFrames.jl |
|
|
51
|
+
|--- |--- |--- |--- |--- |
|
|
52
|
+
| Combine additional columns | merge, bind_cols | dplyr::bind_cols | concat | combine |
|
|
53
|
+
| Combine additional rows | concatenate, concat, bind_rows | dplyr::bind_rows | concat | transform |
|
|
54
|
+
| Join right to left, leaving only the matching rows| join, inner_join | dplyr::inner_join | merge | innerjoin |
|
|
55
|
+
| Join right to left, leaving all rows | join, full_join, outer_join | dplyr::full_join | merge | outerjoin |
|
|
56
|
+
| Join matching values to left from right | join, left_join | dplyr::left_join | merge | leftjoin |
|
|
57
|
+
| Join matching values from left to right | join, right_join | dplyr::right_join | merge | rightjoin |
|
|
58
|
+
| Return rows of left that have a match in right | join, semi_join | dplyr::semi_join | [isin] | semijoin |
|
|
59
|
+
| Return rows of left that do not have a match in right | join, anti_join | dplyr::anti_join | [isin] | antijoin |
|
|
60
|
+
| Collect rows that appear in left or right | union | dplyr::union | merge | |
|
|
61
|
+
| Collect rows that appear in both left and right | intersect | dplyr::intersect | merge | |
|
|
62
|
+
| Collect rows that appear in left but not right | difference, setdiff | dplyr::setdiff | merge | |
|
|
63
|
+
|
|
64
|
+
|
|
65
|
+
|
data/doc/SubFrames.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
1
|
+
# SubFrames
|
|
2
|
+
|
|
3
|
+
`SubFrames` represents a collection of subsets of a DataFrame.
|
|
4
|
+
It has an Array of indices `#subset_indices` which is able to create an Array of sub DataFrames.
|
|
5
|
+
The concept includes `group` operation of a Dataframe, rolling window operation and has more broad capabilities.
|
|
6
|
+
|
|
7
|
+
This feature is experimental. It may be removed or be changed in the future.
|
|
8
|
+
|
|
9
|
+
## Create SubFrames
|
|
10
|
+
|
|
11
|
+
## Properties of SubFrames
|