galaaz 0.4.9 → 0.4.10
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +798 -285
- data/blogs/galaaz_ggplot/galaaz_ggplot.Rmd +3 -12
- data/blogs/galaaz_ggplot/galaaz_ggplot.aux +5 -7
- data/blogs/galaaz_ggplot/galaaz_ggplot.html +69 -29
- data/blogs/galaaz_ggplot/galaaz_ggplot.pdf +0 -0
- data/blogs/galaaz_ggplot/galaaz_ggplot_files/figure-html/midwest_rb.png +0 -0
- data/blogs/galaaz_ggplot/galaaz_ggplot_files/figure-html/scatter_plot_rb.png +0 -0
- data/blogs/galaaz_ggplot/galaaz_ggplot_files/figure-latex/midwest_rb.pdf +0 -0
- data/blogs/galaaz_ggplot/galaaz_ggplot_files/figure-latex/scatter_plot_rb.pdf +0 -0
- data/blogs/galaaz_ggplot/midwest.Rmd +1 -9
- data/blogs/gknit/gknit.Rmd +37 -40
- data/blogs/gknit/gknit.html +32 -30
- data/blogs/gknit/gknit.md +36 -37
- data/blogs/gknit/gknit.pdf +0 -0
- data/blogs/gknit/gknit.tex +35 -37
- data/blogs/manual/manual.Rmd +548 -125
- data/blogs/manual/manual.html +509 -286
- data/blogs/manual/manual.md +798 -285
- data/blogs/manual/manual.pdf +0 -0
- data/blogs/manual/manual.tex +2816 -0
- data/blogs/manual/manual_files/figure-latex/diverging_bar.pdf +0 -0
- data/blogs/nse_dplyr/nse_dplyr.Rmd +240 -74
- data/blogs/nse_dplyr/nse_dplyr.html +191 -87
- data/blogs/nse_dplyr/nse_dplyr.md +361 -107
- data/blogs/nse_dplyr/nse_dplyr.pdf +0 -0
- data/blogs/nse_dplyr/nse_dplyr.tex +1373 -0
- data/blogs/ruby_plot/ruby_plot.Rmd +61 -81
- data/blogs/ruby_plot/ruby_plot.html +54 -57
- data/blogs/ruby_plot/ruby_plot.md +48 -67
- data/blogs/ruby_plot/ruby_plot.pdf +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/dose_len.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facet_by_delivery.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facet_by_dose.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_by_delivery_color.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_by_delivery_color2.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_jitter.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_points.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/final_box_plot.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/final_violin_plot.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/violin_with_jitter.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-latex/dose_len.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-latex/facet_by_delivery.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-latex/facet_by_dose.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-latex/facets_by_delivery_color.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-latex/facets_by_delivery_color2.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-latex/facets_with_decorations.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-latex/facets_with_jitter.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-latex/facets_with_points.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-latex/final_box_plot.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-latex/final_violin_plot.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-latex/violin_with_jitter.png +0 -0
- data/lib/R_interface/rdata_frame.rb +0 -12
- data/lib/R_interface/robject.rb +14 -14
- data/lib/R_interface/ruby_extensions.rb +3 -31
- data/lib/R_interface/rvector.rb +0 -12
- data/lib/gknit/knitr_engine.rb +5 -3
- data/lib/util/exec_ruby.rb +22 -61
- data/specs/tmp.rb +26 -12
- data/version.rb +1 -1
- metadata +22 -17
- data/bin/gknit_old_r +0 -236
- data/blogs/dev/dev.Rmd +0 -23
- data/blogs/dev/dev.md +0 -58
- data/blogs/dev/dev2.Rmd +0 -65
- data/blogs/dev/model.rb +0 -41
- data/blogs/dplyr/dplyr.Rmd +0 -29
- data/blogs/dplyr/dplyr.html +0 -433
- data/blogs/dplyr/dplyr.md +0 -58
- data/blogs/dplyr/dplyr.rb +0 -63
- data/blogs/galaaz_ggplot/galaaz_ggplot.log +0 -640
- data/blogs/galaaz_ggplot/galaaz_ggplot.md +0 -431
- data/blogs/galaaz_ggplot/galaaz_ggplot.tex +0 -481
- data/blogs/galaaz_ggplot/midwest.png +0 -0
- data/blogs/galaaz_ggplot/scatter_plot.png +0 -0
- data/blogs/ruby_plot/ruby_plot.tex +0 -1077
@@ -4,7 +4,7 @@ author:
|
|
4
4
|
- "Rodrigo Botafogo"
|
5
5
|
- "Daniel Mossé - University of Pittsburgh"
|
6
6
|
tags: [Tech, Data Science, Ruby, R, GraalVM]
|
7
|
-
date: "
|
7
|
+
date: "10/05/2019"
|
8
8
|
output:
|
9
9
|
html_document:
|
10
10
|
self_contained: true
|
@@ -13,27 +13,41 @@ output:
|
|
13
13
|
includes:
|
14
14
|
in_header: ["../../sty/galaaz.sty"]
|
15
15
|
number_sections: yes
|
16
|
+
toc: true
|
17
|
+
toc_depth: 2
|
18
|
+
md_document:
|
19
|
+
variant: markdown_github
|
20
|
+
fontsize: 11pt
|
16
21
|
---
|
17
22
|
|
18
23
|
|
19
24
|
|
20
25
|
# Introduction
|
21
26
|
|
22
|
-
In this post we will see how to program with
|
27
|
+
In this post we will see how to program with _dplyr_ in Galaaz.
|
23
28
|
|
24
|
-
|
29
|
+
## But first, what is Galaaz??
|
25
30
|
|
26
31
|
Galaaz is a system for tightly coupling Ruby and R. Ruby is a powerful language, with
|
27
32
|
a large community, a very large set of libraries and great for web development. However,
|
28
33
|
it lacks libraries for data science, statistics, scientific plotting and machine learning.
|
29
34
|
On the other hand, R is considered one of the most powerful languages for solving all of the
|
30
35
|
above problems. Maybe the strongest competitor to R is Python with libraries such as NumPy,
|
31
|
-
|
36
|
+
Pandas, SciPy, SciKit-Learn and many more.
|
32
37
|
|
33
38
|
With Galaaz we do not intend to re-implement any of the scientific libraries in R. However, we
|
34
39
|
allow for very tight coupling between the two languages to the point that the Ruby
|
35
|
-
developer does not need to know that there is an R engine running.
|
36
|
-
|
40
|
+
developer does not need to know that there is an R engine running. Also, from the point of
|
41
|
+
view of the R user/developer Galaaz looks a lot like R, with just minor syntactic difference,
|
42
|
+
so there is almost no learning courve for the R developer. And as we will see in this
|
43
|
+
post, programming with _dplyr_ is easier in Galaaz than in R.
|
44
|
+
|
45
|
+
R users are probably quite knowledgeable about _dplyr_, for the Ruby developer, _dplyr_ and
|
46
|
+
the _tidyverse_ libraries are a set of libraries for data manipulation in R, developed by
|
47
|
+
Hardley Wickham, chief scientis at RStudio and a prolific R coder and writer.
|
48
|
+
|
49
|
+
For the coupling of Ruby and R we use new technologies provided by Oracle: GraalVM,
|
50
|
+
TruffleRuby and FastR:
|
37
51
|
|
38
52
|
GraalVM is a universal virtual machine for running applications
|
39
53
|
written in JavaScript, Python 3, Ruby, R, JVM-based languages like Java,
|
@@ -68,10 +82,16 @@ Interested readers should also check out the following sites:
|
|
68
82
|
* [TruffleRuby](https://github.com/oracle/truffleruby)
|
69
83
|
* [FastR](https://github.com/oracle/fastr)
|
70
84
|
* [Faster R with FastR](https://medium.com/graalvm/faster-r-with-fastr-4b8db0e0dceb)
|
85
|
+
* [How to make Beautiful Ruby Plots with Galaaz](https://medium.freecodecamp.org/how-to-make-beautiful-ruby-plots-with-galaaz-320848058857)
|
86
|
+
* [Ruby Plotting with Galaaz: An example of tightly coupling Ruby and R in GraalVM](https://towardsdatascience.com/ruby-plotting-with-galaaz-an-example-of-tightly-coupling-ruby-and-r-in-graalvm-520b69e21021)
|
87
|
+
* [How to do reproducible research in Ruby with gKnit](https://towardsdatascience.com/how-to-do-reproducible-research-in-ruby-with-gknit-c26d2684d64e)
|
88
|
+
* [R for Data Science](https://r4ds.had.co.nz/)
|
89
|
+
* [Advanced R](https://adv-r.hadley.nz/)
|
71
90
|
|
72
|
-
|
91
|
+
## Programming with dplyr
|
73
92
|
|
74
|
-
|
93
|
+
This post will follow closely the work done in https://dplyr.tidyverse.org/articles/programming.html,
|
94
|
+
by Hardley Wickham. In it, Hardley states:
|
75
95
|
|
76
96
|
> Most dplyr functions use non-standard evaluation (NSE). This is a catch-all term that
|
77
97
|
> means they don’t follow the usual R rules of evaluation. Instead, they capture the
|
@@ -80,7 +100,7 @@ According to Hardley (https://dplyr.tidyverse.org/articles/programming.html)
|
|
80
100
|
|
81
101
|
> Operations on data frames can be expressed succinctly because you don’t need to repeat
|
82
102
|
> the name of the data frame. For example, you can write filter(df, x == 1, y == 2, z == 3)
|
83
|
-
> instead of df[df
|
103
|
+
> instead of df[df\$x == 1 & df\$y ==2 & df\$z == 3, ].
|
84
104
|
|
85
105
|
> dplyr can choose to compute results in a different way to base R. This is important for
|
86
106
|
> database backends because dplyr itself doesn’t do any work, but instead generates the SQL
|
@@ -92,29 +112,9 @@ According to Hardley (https://dplyr.tidyverse.org/articles/programming.html)
|
|
92
112
|
> with a seemingly equivalent object that you’ve defined elsewhere. In other words, this code:
|
93
113
|
|
94
114
|
|
95
|
-
|
96
115
|
```r
|
97
116
|
df <- data.frame(x = 1:3, y = 3:1)
|
98
|
-
print(df)
|
99
|
-
```
|
100
|
-
|
101
|
-
```
|
102
|
-
## x y
|
103
|
-
## 1 1 3
|
104
|
-
## 2 2 2
|
105
|
-
## 3 3 1
|
106
|
-
```
|
107
|
-
|
108
|
-
```r
|
109
117
|
print(filter(df, x == 1))
|
110
|
-
```
|
111
|
-
|
112
|
-
```
|
113
|
-
## x y
|
114
|
-
## 1 1 3
|
115
|
-
```
|
116
|
-
|
117
|
-
```r
|
118
118
|
#> # A tibble: 1 x 2
|
119
119
|
#> x y
|
120
120
|
#> <int> <int>
|
@@ -131,15 +131,22 @@ filter(df, my_var == 1)
|
|
131
131
|
```
|
132
132
|
> This makes it hard to create functions with arguments that change how dplyr verbs are computed.
|
133
133
|
|
134
|
+
In this post we will see that programming with _dplyr_ in Galaaz does not require knowledge of
|
135
|
+
non-standard evaluation in R and can be accomplished by utilizing normal Ruby constructs.
|
136
|
+
|
134
137
|
# Writing Expressions in Galaaz
|
135
138
|
|
136
|
-
Galaaz extends Ruby to work with
|
137
|
-
(base R) or 'quo' (tidyverse).
|
139
|
+
Galaaz extends Ruby to work with expressions, similar to R's expressions build with 'quote'
|
140
|
+
(base R) or 'quo' (tidyverse). Expressions in this context are like mathematical expressions or
|
141
|
+
formulae. For instance, in mathematics, the expression $y = sin(x)$ describes a function but cannot
|
142
|
+
be computed unless the value of $x$ is bound to some value.
|
143
|
+
|
144
|
+
Let's take a look at some of those expressions in Ruby:
|
138
145
|
|
139
146
|
## Expressions from operators
|
140
147
|
|
141
|
-
The code bellow
|
142
|
-
|
148
|
+
The code bellow creates an expression summing two symbols. Note that :a and :b are Ruby symbols and
|
149
|
+
are not bound to any value at the time of expression definition:
|
143
150
|
|
144
151
|
|
145
152
|
```ruby
|
@@ -150,7 +157,7 @@ puts exp1
|
|
150
157
|
```
|
151
158
|
## a + b
|
152
159
|
```
|
153
|
-
We can build any complex mathematical expression
|
160
|
+
We can build any complex mathematical expression such as:
|
154
161
|
|
155
162
|
|
156
163
|
```ruby
|
@@ -161,8 +168,9 @@ puts exp2
|
|
161
168
|
```
|
162
169
|
## (a + b) * 2 + c^2L/z
|
163
170
|
```
|
171
|
+
The 'L' after two indicates that 2 is an integer.
|
164
172
|
|
165
|
-
It is also possible to use inequality operators in building expressions
|
173
|
+
It is also possible to use inequality operators in building expressions:
|
166
174
|
|
167
175
|
|
168
176
|
```ruby
|
@@ -173,6 +181,19 @@ puts exp3
|
|
173
181
|
```
|
174
182
|
## a + b >= z
|
175
183
|
```
|
184
|
+
Expressions' definition can also make use of normal Ruby variables without any problem:
|
185
|
+
|
186
|
+
|
187
|
+
```ruby
|
188
|
+
x = 20
|
189
|
+
y = 30
|
190
|
+
exp_var = (:a + :b) * x <= :z - y
|
191
|
+
puts exp_var
|
192
|
+
```
|
193
|
+
|
194
|
+
```
|
195
|
+
## (a + b) * 20L <= z - 30L
|
196
|
+
```
|
176
197
|
|
177
198
|
Galaaz provides both symbolic representations for operators, such as (>, <, !=) as functional
|
178
199
|
notation for those operators such as (.gt, .ge, etc.). So the same expression written
|
@@ -188,8 +209,9 @@ puts exp4
|
|
188
209
|
## a + b >= z
|
189
210
|
```
|
190
211
|
|
191
|
-
Two type of expression can only be created with the functional representation
|
192
|
-
those are expressions involving '==', and '='. In order to write an
|
212
|
+
Two type of expression, however, can only be created with the functional representation
|
213
|
+
of the operators, those are expressions involving '==', and '='. In order to write an
|
214
|
+
expression involving '==' we
|
193
215
|
need to use the method '.eq' and for '=' we need the function '.assign'
|
194
216
|
|
195
217
|
|
@@ -228,17 +250,16 @@ puts exp_wrong
|
|
228
250
|
```
|
229
251
|
and it might be difficult to understand what is going on here. The problem lies with the fact that
|
230
252
|
when using '==' we are comparing expression (:a + :b) to expression :z with '=='. When the
|
231
|
-
comparison is executed, the system tries to evaluate :a, :b and :z, and those symbols
|
232
|
-
this time are not bound to anything and we get a "object 'a' not found" message.
|
233
|
-
If we only use functional notation, this type of error will
|
253
|
+
comparison is executed, the system tries to evaluate :a, :b and :z, and those symbols at
|
254
|
+
this time are not bound to anything and we get a "object 'a' not found" message.
|
255
|
+
If we only use functional notation, this type of error will not occur.
|
234
256
|
|
235
257
|
## Expressions with R methods
|
236
258
|
|
237
259
|
It is often necessary to create an expression that uses a method or function. For instance, in
|
238
260
|
mathematics, it's quite natural to write an expressin such as $y = sin(x)$. In this case, the
|
239
|
-
'sin' function is part of the expression and should not immediately executed.
|
240
|
-
|
241
|
-
When we want the function to be part of the expression, we call the function preceeding it
|
261
|
+
'sin' function is part of the expression and should not immediately be executed. When we want
|
262
|
+
the function to be part of the expression, we call the function preceeding it
|
242
263
|
by the letter E, such as 'E.sin(x)'
|
243
264
|
|
244
265
|
|
@@ -250,28 +271,144 @@ puts exp7
|
|
250
271
|
```
|
251
272
|
## y <- sin(x)
|
252
273
|
```
|
253
|
-
|
254
|
-
|
274
|
+
|
275
|
+
Expressions can also be written using '.' notation:
|
255
276
|
|
256
277
|
|
257
278
|
```ruby
|
258
|
-
|
259
|
-
exp8 = :y.assign R.sin(x)
|
279
|
+
exp8 = :y.assign :x.sin
|
260
280
|
puts exp8
|
261
281
|
```
|
262
282
|
|
283
|
+
```
|
284
|
+
## y <- sin(x)
|
285
|
+
```
|
286
|
+
|
287
|
+
When a function has multiple arguments, the first one can be used before the '.':
|
288
|
+
|
289
|
+
|
290
|
+
```ruby
|
291
|
+
exp9 = :x.c(:y)
|
292
|
+
puts exp9
|
293
|
+
```
|
294
|
+
|
295
|
+
```
|
296
|
+
## c(x, y)
|
297
|
+
```
|
298
|
+
|
299
|
+
## Evaluating an Expression
|
300
|
+
|
301
|
+
Expressions can be evaluated by calling function 'eval' with a binding. A binding can be provided
|
302
|
+
with a list:
|
303
|
+
|
304
|
+
|
305
|
+
```ruby
|
306
|
+
exp = (:a + :b) * 2.0 + :c ** 2 / :z
|
307
|
+
puts exp.eval(R.list(a: 10, b: 20, c: 30, z: 40))
|
308
|
+
```
|
309
|
+
|
310
|
+
```
|
311
|
+
## [1] 82.5
|
312
|
+
```
|
313
|
+
|
314
|
+
... with a data frame:
|
315
|
+
|
316
|
+
|
317
|
+
```ruby
|
318
|
+
df = R.data__frame(
|
319
|
+
a: R.c(1, 2, 3),
|
320
|
+
b: R.c(10, 20, 30),
|
321
|
+
c: R.c(100, 200, 300),
|
322
|
+
z: R.c(1000, 2000, 3000))
|
323
|
+
|
324
|
+
puts exp.eval(df)
|
325
|
+
```
|
326
|
+
|
327
|
+
```
|
328
|
+
## [1] 32 64 96
|
329
|
+
```
|
330
|
+
|
331
|
+
# Using Galaaz to call R functions
|
332
|
+
|
333
|
+
Galaaz tries to emulate as closely as possible the way R functions are called and migrating from
|
334
|
+
R to Galaaz should be quite easy requiring only minor syntactic changes to an R script. In
|
335
|
+
this post, we do not have enough space to write a complete manual on Galaaz
|
336
|
+
(a short manual can be found at: https://www.rubydoc.info/gems/galaaz/0.4.9), so we will
|
337
|
+
present only a few examples scripts using Galaaz.
|
338
|
+
|
339
|
+
Basically, to call an R function from Ruby with Galaaz, one only needs to preceed the function
|
340
|
+
with 'R.'. For instance, to create a vector in R, the 'c' function is used. From Galaaz, a
|
341
|
+
vector can be created by using 'R.c':
|
342
|
+
|
343
|
+
|
344
|
+
```ruby
|
345
|
+
vec = R.c(1.0, 2, 3)
|
346
|
+
puts vec
|
347
|
+
```
|
348
|
+
|
349
|
+
```
|
350
|
+
## [1] 1 2 3
|
351
|
+
```
|
352
|
+
A list is created in R with the 'list' function, so in Galaaz we do:
|
353
|
+
|
354
|
+
|
355
|
+
```ruby
|
356
|
+
list = R.list(a: 1.0, b: 2, c: 3)
|
357
|
+
puts list
|
358
|
+
```
|
359
|
+
|
360
|
+
```
|
361
|
+
## $a
|
362
|
+
## [1] 1
|
363
|
+
##
|
364
|
+
## $b
|
365
|
+
## [1] 2
|
366
|
+
##
|
367
|
+
## $c
|
368
|
+
## [1] 3
|
369
|
+
```
|
370
|
+
Note that we can use named arguments in our list. The same code in R would be:
|
371
|
+
|
372
|
+
|
373
|
+
```r
|
374
|
+
lst = list(a = 1, b = 2L, c = 3L)
|
375
|
+
print(lst)
|
376
|
+
```
|
377
|
+
|
378
|
+
```
|
379
|
+
## $a
|
380
|
+
## [1] 1
|
381
|
+
##
|
382
|
+
## $b
|
383
|
+
## [1] 2
|
384
|
+
##
|
385
|
+
## $c
|
386
|
+
## [1] 3
|
387
|
+
```
|
388
|
+
Now, let's say that 'x' is an angle of 45$^\circ$ and we acttually want to create
|
389
|
+
the expression $y = sin(45^\circ)$, which is $y = 0.850...$. In this case,
|
390
|
+
we will use 'R.sin':
|
391
|
+
|
392
|
+
|
393
|
+
```ruby
|
394
|
+
exp10 = :y.assign R.sin(45)
|
395
|
+
puts exp10
|
396
|
+
```
|
397
|
+
|
263
398
|
```
|
264
399
|
## y <- 0.850903524534118
|
265
400
|
```
|
401
|
+
|
266
402
|
# Filtering using expressions
|
267
403
|
|
268
|
-
Now that we
|
269
|
-
Let's first start by creating
|
404
|
+
Now that we know how to write expression and call R functions let's do some data manipulation in
|
405
|
+
Galaaz. Let's first start by creating the same data frame that we created previously in section
|
406
|
+
"Programming with dplyr":
|
270
407
|
|
271
408
|
|
272
409
|
```ruby
|
273
|
-
|
274
|
-
puts
|
410
|
+
df = R.data__frame(x: (1..3), y: (3..1))
|
411
|
+
puts df
|
275
412
|
```
|
276
413
|
|
277
414
|
```
|
@@ -280,12 +417,17 @@ puts @df
|
|
280
417
|
## 2 2 2
|
281
418
|
## 3 3 1
|
282
419
|
```
|
283
|
-
|
284
|
-
|
420
|
+
The 'filter' function can be called on this data frame either by using 'R.filter(df, ...)' or
|
421
|
+
by using dot notation. We prefer to use dot notation as shown bellow. The argument to 'filter'
|
422
|
+
in Galaaz should be an expression. Note that if we gave to filter a Ruby expression such as
|
423
|
+
'x == 1', we would get an error, since there is no variable 'x' defined and if 'x' was a variable
|
424
|
+
then 'x == 1' would either be 'true' or 'false'. Our goal is to filter our data frame returning
|
425
|
+
all rows in which the 'x' value is equal to 1. To express this we want: ':x.eq 1', where :x will
|
426
|
+
be interpreted by filter as the 'x' column.
|
285
427
|
|
286
428
|
|
287
429
|
```ruby
|
288
|
-
puts
|
430
|
+
puts df.filter(:x.eq 1)
|
289
431
|
```
|
290
432
|
|
291
433
|
```
|
@@ -294,7 +436,7 @@ puts @df.filter(:x.eq 1)
|
|
294
436
|
```
|
295
437
|
|
296
438
|
In R, and when coding with 'tidyverse', arguments to a function are usually not
|
297
|
-
*referencially transparent*. That is,
|
439
|
+
*referencially transparent*. That is, you can’t replace a value with a seemingly equivalent
|
298
440
|
object that you’ve defined elsewhere. In other words, this code
|
299
441
|
|
300
442
|
|
@@ -304,8 +446,8 @@ filter(df, my_var == 1)
|
|
304
446
|
```
|
305
447
|
Generates the following error: "object 'x' not found.
|
306
448
|
|
307
|
-
However, in
|
308
|
-
code bellow. Note
|
449
|
+
However, in Galaaz, arguments are referencially transparent as can be seen by the
|
450
|
+
code bellow. Note initally that 'my_var = :x' will not give the error "object 'x' not found"
|
309
451
|
since ':x' is treated as an expression and assigned to my\_var. Then when doing (my\_var.eq 1),
|
310
452
|
my\_var is a variable that resolves to ':x' and it becomes equivalent to (:x.eq 1) which is
|
311
453
|
what we want.
|
@@ -313,7 +455,7 @@ what we want.
|
|
313
455
|
|
314
456
|
```ruby
|
315
457
|
my_var = :x
|
316
|
-
puts
|
458
|
+
puts df.filter(my_var.eq 1)
|
317
459
|
```
|
318
460
|
|
319
461
|
```
|
@@ -333,17 +475,17 @@ df[x == y, ]
|
|
333
475
|
```
|
334
476
|
In galaaz this ambiguity does not exist, filter(df, x.eq y) is not a valid expression as
|
335
477
|
expressions are build with symbols. In doing filter(df, :x.eq y) we are looking for elements
|
336
|
-
of the 'x' column that are equal to a previously defined y variable. Finally
|
478
|
+
of the 'x' column that are equal to a previously defined y variable. Finally in
|
337
479
|
filter(df, :x.eq :y) we are looking for elements in which the 'x' column value is equal to
|
338
480
|
the 'y' column value. This can be seen in the following two chunks of code:
|
339
481
|
|
340
482
|
|
341
483
|
```ruby
|
342
|
-
|
343
|
-
|
484
|
+
y = 1
|
485
|
+
x = 2
|
344
486
|
|
345
487
|
# looking for values where the 'x' column is equal to the 'y' column
|
346
|
-
puts
|
488
|
+
puts df.filter(:x.eq :y)
|
347
489
|
```
|
348
490
|
|
349
491
|
```
|
@@ -355,7 +497,7 @@ puts @df.filter(:x.eq :y)
|
|
355
497
|
```ruby
|
356
498
|
# looking for values where the 'x' column is equal to the 'y' variable
|
357
499
|
# in this case, the number 1
|
358
|
-
puts
|
500
|
+
puts df.filter(:x.eq y)
|
359
501
|
```
|
360
502
|
|
361
503
|
```
|
@@ -364,7 +506,11 @@ puts @df.filter(:x.eq @y)
|
|
364
506
|
```
|
365
507
|
# Writing a function that applies to different data sets
|
366
508
|
|
509
|
+
Let's suppose that we want to write a function that receives as the first argument a data frame
|
510
|
+
and as second argument an expression that adds a column to the data frame that is equal to the
|
511
|
+
sum of elements in column 'a' plus 'x'.
|
367
512
|
|
513
|
+
Here is the intended behaviour using the 'mutate' function of 'dplyr':
|
368
514
|
|
369
515
|
```
|
370
516
|
mutate(df1, y = a + x)
|
@@ -372,8 +518,18 @@ mutate(df2, y = a + x)
|
|
372
518
|
mutate(df3, y = a + x)
|
373
519
|
mutate(df4, y = a + x)
|
374
520
|
```
|
521
|
+
The naive approach to writing an R function to solve this problem is:
|
522
|
+
|
523
|
+
```
|
524
|
+
mutate_y <- function(df) {
|
525
|
+
mutate(df, y = a + x)
|
526
|
+
}
|
527
|
+
```
|
528
|
+
Unfortunately, in R, this function can fail silently if one of the variables isn’t present
|
529
|
+
in the data frame, but is present in the global environment. We will not go through here how
|
530
|
+
to solve this problem in R.
|
375
531
|
|
376
|
-
|
532
|
+
In Galaaz the method mutate_y bellow will work fine and will never fail silently.
|
377
533
|
|
378
534
|
|
379
535
|
```ruby
|
@@ -381,14 +537,27 @@ def mutate_y(df)
|
|
381
537
|
df.mutate(:y.assign :a + :x)
|
382
538
|
end
|
383
539
|
```
|
384
|
-
|
385
|
-
Note that contrary to what happens in R, method mutate_y will fail independetly from the fact
|
386
|
-
that variable 'a' is defined or not.
|
540
|
+
Here we create a data frame that has only one column named 'x':
|
387
541
|
|
388
542
|
|
389
543
|
```ruby
|
390
544
|
df1 = R.data__frame(x: (1..3))
|
391
545
|
puts df1
|
546
|
+
```
|
547
|
+
|
548
|
+
```
|
549
|
+
## x
|
550
|
+
## 1 1
|
551
|
+
## 2 2
|
552
|
+
## 3 3
|
553
|
+
```
|
554
|
+
|
555
|
+
Note that method mutate_y will fail independetly from the fact that variable 'a' is defined and
|
556
|
+
in the scope of the method. Variable 'a' has no relationship with the symbol ':a' used in the
|
557
|
+
definition of 'mutate\_y' above:
|
558
|
+
|
559
|
+
|
560
|
+
```ruby
|
392
561
|
a = 10
|
393
562
|
mutate_y(df1)
|
394
563
|
```
|
@@ -402,11 +571,17 @@ mutate_y(df1)
|
|
402
571
|
## mismatched protect/unprotect (unprotect with empty protect stack) (RError)
|
403
572
|
## Translated to internal error
|
404
573
|
```
|
405
|
-
|
406
574
|
# Different expressions
|
407
575
|
|
576
|
+
Let's move to the next problem as presented by Hardley where trying to write a function in R
|
577
|
+
that will receive two argumens, the first a variable and the second an expression is not trivial.
|
578
|
+
Bellow we create a data frame and we want to write a function that groups data by a variable and
|
579
|
+
summarises it by an expression:
|
580
|
+
|
408
581
|
|
409
582
|
```r
|
583
|
+
set.seed(123)
|
584
|
+
|
410
585
|
df <- data.frame(
|
411
586
|
g1 = c(1, 1, 2, 2, 2),
|
412
587
|
g2 = c(1, 2, 1, 2, 1),
|
@@ -414,6 +589,19 @@ df <- data.frame(
|
|
414
589
|
b = sample(5)
|
415
590
|
)
|
416
591
|
|
592
|
+
as.data.frame(df)
|
593
|
+
```
|
594
|
+
|
595
|
+
```
|
596
|
+
## g1 g2 a b
|
597
|
+
## 1 1 1 2 1
|
598
|
+
## 2 1 2 4 3
|
599
|
+
## 3 2 1 5 4
|
600
|
+
## 4 2 2 3 2
|
601
|
+
## 5 2 1 1 5
|
602
|
+
```
|
603
|
+
|
604
|
+
```r
|
417
605
|
d2 <- df %>%
|
418
606
|
group_by(g1) %>%
|
419
607
|
summarise(a = mean(a))
|
@@ -437,13 +625,11 @@ as.data.frame(d2)
|
|
437
625
|
|
438
626
|
```
|
439
627
|
## g2 a
|
440
|
-
## 1 1
|
441
|
-
## 2 2
|
628
|
+
## 1 1 2.666667
|
629
|
+
## 2 2 3.500000
|
442
630
|
```
|
443
631
|
|
444
|
-
|
445
|
-
the second an expression is not trivia. As shown by Hardley, one might expect this function
|
446
|
-
to do the trick:
|
632
|
+
As shown by Hardley, one might expect this function to do the trick:
|
447
633
|
|
448
634
|
|
449
635
|
```r
|
@@ -458,11 +644,13 @@ my_summarise <- function(df, group_var) {
|
|
458
644
|
```
|
459
645
|
|
460
646
|
In order to solve this problem, coding with dplyr requires the introduction of many new concepts
|
461
|
-
and functions such as 'quo', 'quos', 'enquo', 'enquos', '!!' (bang bang), '!!!' (triple bang).
|
647
|
+
and functions such as 'quo', 'quos', 'enquo', 'enquos', '!!' (bang bang), '!!!' (triple bang).
|
648
|
+
Again, we'll leave to Hardley the explanation on how to use all those functions.
|
462
649
|
|
463
650
|
Now, let's try to implement the same function in galaaz. The next code block first prints the
|
464
|
-
'df' data frame define previously in R
|
465
|
-
|
651
|
+
'df' data frame define previously in R (to access an R variable from Galaaz, we use the tilda
|
652
|
+
operator '~' applied to the R variable name as symbol, i.e., ':df'. We then create the
|
653
|
+
'my_summarize' method and call it passing the R data frame and the group by variable ':g1':
|
466
654
|
|
467
655
|
|
468
656
|
```ruby
|
@@ -471,35 +659,35 @@ print "\n"
|
|
471
659
|
|
472
660
|
def my_summarize(df, group_var)
|
473
661
|
df.group_by(group_var).
|
474
|
-
summarize(a:
|
662
|
+
summarize(a: :a.mean)
|
475
663
|
end
|
476
664
|
|
477
|
-
puts my_summarize(
|
665
|
+
puts my_summarize(:df, :g1).as__data__frame
|
478
666
|
```
|
479
667
|
|
480
668
|
```
|
481
669
|
## g1 g2 a b
|
482
|
-
## 1 1 1
|
483
|
-
## 2 1 2
|
484
|
-
## 3 2 1
|
485
|
-
## 4 2 2 3
|
486
|
-
## 5 2 1
|
670
|
+
## 1 1 1 2 1
|
671
|
+
## 2 1 2 4 3
|
672
|
+
## 3 2 1 5 4
|
673
|
+
## 4 2 2 3 2
|
674
|
+
## 5 2 1 1 5
|
487
675
|
##
|
488
676
|
## g1 a
|
489
677
|
## 1 1 3
|
490
678
|
## 2 2 3
|
491
679
|
```
|
492
|
-
It works!!! Well let's make sure this was not just some coincidence
|
680
|
+
It works!!! Well, let's make sure this was not just some coincidence
|
493
681
|
|
494
682
|
|
495
683
|
```ruby
|
496
|
-
puts my_summarize(
|
684
|
+
puts my_summarize(:df, :g2).as__data__frame
|
497
685
|
```
|
498
686
|
|
499
687
|
```
|
500
688
|
## g2 a
|
501
|
-
## 1 1
|
502
|
-
## 2 2
|
689
|
+
## 1 1 2.666667
|
690
|
+
## 2 2 3.500000
|
503
691
|
```
|
504
692
|
|
505
693
|
Great, everything is fine! No magic, no new functions, no complexities, just normal, standard Ruby
|
@@ -508,7 +696,7 @@ code. If you've ever done NSE in R, this certainly feels much safer and easy to
|
|
508
696
|
# Different input variables
|
509
697
|
|
510
698
|
In the previous section we've managed to get rid of all NSE formulation for a simple example, but
|
511
|
-
does this remain true for more complex examples, or will the
|
699
|
+
does this remain true for more complex examples, or will the Galaaz way prove inpractical for
|
512
700
|
more complex code?
|
513
701
|
|
514
702
|
In the next example Hardley proposes us to write a function that given an expression such as 'a'
|
@@ -526,7 +714,7 @@ summarise(df, mean = mean(a * b), sum = sum(a * b), n = n())
|
|
526
714
|
#> # A tibble: 1 x 3
|
527
715
|
#> mean sum n
|
528
716
|
#> <dbl> <int> <int>
|
529
|
-
#> 1 9
|
717
|
+
#> 1 9 45 5
|
530
718
|
```
|
531
719
|
|
532
720
|
Let's try it in galaaz:
|
@@ -549,11 +737,11 @@ puts my_summarise2((~:df), :a * :b)
|
|
549
737
|
## mean sum n
|
550
738
|
## 1 3 15 5
|
551
739
|
## mean sum n
|
552
|
-
## 1
|
740
|
+
## 1 9 45 5
|
553
741
|
```
|
554
742
|
|
555
743
|
Once again, there is no need to use any special theory or functions. The only point to be
|
556
|
-
careful about is the use of 'E' to build
|
744
|
+
careful about is the use of 'E' to build expressions from functions 'mean', 'sum' and 'n'.
|
557
745
|
|
558
746
|
# Different input and output variable
|
559
747
|
|
@@ -583,8 +771,10 @@ mutate(df, mean_b = mean(b), sum_b = sum(b))
|
|
583
771
|
#> 4 2 2 5 4 3 15
|
584
772
|
#> # … with 1 more row
|
585
773
|
```
|
774
|
+
In order to solve this problem in R, Hardley needs to introduce some more new functions and notations:
|
775
|
+
'quo_name' and the ':=' operator from package 'rlang'
|
586
776
|
|
587
|
-
Here is our Ruby code
|
777
|
+
Here is our Ruby code:
|
588
778
|
|
589
779
|
|
590
780
|
```ruby
|
@@ -602,17 +792,17 @@ puts my_mutate((~:df), :b)
|
|
602
792
|
|
603
793
|
```
|
604
794
|
## g1 g2 a b mean_a sum_a
|
605
|
-
## 1 1 1
|
606
|
-
## 2 1 2
|
607
|
-
## 3 2 1
|
608
|
-
## 4 2 2 3
|
609
|
-
## 5 2 1
|
795
|
+
## 1 1 1 2 1 3 15
|
796
|
+
## 2 1 2 4 3 3 15
|
797
|
+
## 3 2 1 5 4 3 15
|
798
|
+
## 4 2 2 3 2 3 15
|
799
|
+
## 5 2 1 1 5 3 15
|
610
800
|
## g1 g2 a b mean_b sum_b
|
611
|
-
## 1 1 1
|
612
|
-
## 2 1 2
|
613
|
-
## 3 2 1
|
614
|
-
## 4 2 2 3
|
615
|
-
## 5 2 1
|
801
|
+
## 1 1 1 2 1 3 15
|
802
|
+
## 2 1 2 4 3 3 15
|
803
|
+
## 3 2 1 5 4 3 15
|
804
|
+
## 4 2 2 3 2 3 15
|
805
|
+
## 5 2 1 1 5 3 15
|
616
806
|
```
|
617
807
|
It really seems that "Non Standard Evaluation" is actually quite standard in Galaaz! But, you
|
618
808
|
might have noticed a small change in the way the arguments to the mutate method were called.
|
@@ -624,6 +814,12 @@ and variable mean\_name is not followed by ':' but by '=>'. This is standard Ru
|
|
624
814
|
|
625
815
|
# Capturing multiple variables
|
626
816
|
|
817
|
+
Moving on with new complexities, Hardley proposes us to solve the problem in which the
|
818
|
+
summarise function will receive any number of grouping variables.
|
819
|
+
|
820
|
+
This again is quite standard Ruby. In order to receive an undefined number of paramenters
|
821
|
+
the paramenter is preceded by '*':
|
822
|
+
|
627
823
|
|
628
824
|
```ruby
|
629
825
|
def my_summarise3(df, *group_vars)
|
@@ -636,14 +832,58 @@ puts my_summarise3((~:df), :g1, :g2).as__data__frame
|
|
636
832
|
|
637
833
|
```
|
638
834
|
## g1 g2 a
|
639
|
-
## 1 1 1
|
640
|
-
## 2 1 2
|
835
|
+
## 1 1 1 2
|
836
|
+
## 2 1 2 4
|
641
837
|
## 3 2 1 3
|
642
838
|
## 4 2 2 3
|
643
839
|
```
|
644
840
|
|
841
|
+
# Why does R require NSE and Galaaz does not?
|
842
|
+
|
843
|
+
NSE introduces a number of new concepts, such as 'quoting', 'quasiquotation', 'unquoting' and
|
844
|
+
'unquote-splicing', while in Galaaz none of those concepts are needed. What gives?
|
845
|
+
|
846
|
+
R is an extremely flexible language and it has lazy evaluation of parameters. When in R a
|
847
|
+
function is called as 'summarise(df, a = b)', the summarise function receives the litteral
|
848
|
+
'a = b' parameter and can work with this as if it were a string. In R, it is not clear what
|
849
|
+
a and b are, they can be expressions or they can be variables, it is up to the function to
|
850
|
+
decide what 'a = b' means.
|
851
|
+
|
852
|
+
In Ruby, there is no lazy evaluation of parameters and 'a' is always a variable and so is 'b'.
|
853
|
+
Variables assume their value as soon as they are used, so 'x = a' is immediately evaluate and
|
854
|
+
variable 'x' will receive the value of variable 'a' as soon as the Ruby statement is executed.
|
855
|
+
Ruby also provides the notion of a symbol; ':a' is a symbol and does not evaluate to anything.
|
856
|
+
Galaaz uses Ruby symbols to build expressions that are not bound to anything: ':a.eq :b' is
|
857
|
+
clearly an expression and has no relationship whatsoever with the statment 'a = b'. By using
|
858
|
+
symbols, variables and expressions all the possible ambiguities that are found in R are
|
859
|
+
eliminated in Galaaz.
|
860
|
+
|
861
|
+
The main problem that remains, is that in R, functions are not clearly documented as what type
|
862
|
+
of input they are expecting, they might be expecting regular variables or they might be
|
863
|
+
expecting expressions and the R function will know how to deal with an input of the form
|
864
|
+
'a = b', now for the Ruby developer it might not be immediately clear if it should call the
|
865
|
+
function passing the value 'true' if variable 'a' is equal to variable 'b' or if it should
|
866
|
+
call the function passing the expression ':a.eq :b'.
|
867
|
+
|
868
|
+
|
645
869
|
# Advanced dplyr features
|
646
|
-
|
870
|
+
|
871
|
+
In the blog: Programming with dplyr by using dplyr (https://www.r-bloggers.com/programming-with-dplyr-by-using-dplyr/) Iñaki Úcar shows surprise that some R users are trying to code in dplyr avoiding
|
872
|
+
the use of NSE. For instance he says:
|
873
|
+
|
874
|
+
> Take the example of seplyr. It stands for standard evaluation dplyr, and enables us to
|
875
|
+
> program over dplyr without having “to bring in (or study) any deep-theory or
|
876
|
+
> heavy-weight tools such as rlang/tidyeval”.
|
877
|
+
|
878
|
+
For me, there isn't really any surprise that users are trying to avoid dplyr deep-theory. R
|
879
|
+
users frequently are not programmers and learning to code is already hard business, on top
|
880
|
+
of that, having to learn how to 'quote' or 'enquo' or 'quos' or 'enquos' is not necessarily
|
881
|
+
a 'piece of cake'. So much so, that 'tidyeval' has some more advanced functions that instead
|
882
|
+
of using quoted expressions, uses strings as arguments.
|
883
|
+
|
884
|
+
In the following examples, we show the use of functions 'group\_by\_at', 'summarise\_at' and
|
885
|
+
'rename\_at' that receive strings as argument. The data frame used in 'starwars' that describes
|
886
|
+
features of characters in the Starwars movies:
|
647
887
|
|
648
888
|
|
649
889
|
```ruby
|
@@ -680,6 +920,8 @@ puts (~:starwars).head.as__data__frame
|
|
680
920
|
## 5 Imperial Speeder Bike
|
681
921
|
## 6
|
682
922
|
```
|
923
|
+
The grouped_mean function bellow will receive a grouping variable and calculate summaries for
|
924
|
+
the value\_variables given:
|
683
925
|
|
684
926
|
|
685
927
|
```r
|
@@ -716,6 +958,8 @@ as.data.frame(gm)
|
|
716
958
|
## 15 yellow 81.11111 76.38000 11
|
717
959
|
```
|
718
960
|
|
961
|
+
The same code with Galaaz, becomes:
|
962
|
+
|
719
963
|
|
720
964
|
```ruby
|
721
965
|
def grouped_mean(data, grouping_variables, value_variables)
|
@@ -723,10 +967,10 @@ def grouped_mean(data, grouping_variables, value_variables)
|
|
723
967
|
group_by_at(grouping_variables).
|
724
968
|
mutate(count: E.n).
|
725
969
|
summarise_at(E.c(value_variables, "count"), ~:mean, na__rm: true).
|
726
|
-
rename_at(value_variables,
|
970
|
+
rename_at(value_variables, E.funs(E.paste0("mean_", value_variables)))
|
727
971
|
end
|
728
972
|
|
729
|
-
puts grouped_mean((~:starwars), "eye_color",
|
973
|
+
puts grouped_mean((~:starwars), "eye_color", E.c("mass", "birth_year")).as__data__frame
|
730
974
|
```
|
731
975
|
|
732
976
|
```
|
@@ -747,3 +991,13 @@ puts grouped_mean((~:starwars), "eye_color", R.c("mass", "birth_year")).as__data
|
|
747
991
|
## 14 white 48.00000 NaN 1
|
748
992
|
## 15 yellow 81.11111 76.38000 11
|
749
993
|
```
|
994
|
+
|
995
|
+
# Conclusion
|
996
|
+
|
997
|
+
Ruby and Galaaz provide a nice framework for developing code that uses R functions. Although R is
|
998
|
+
a very powerful and flexible language, sometimes, too much flexibility makes life harder for
|
999
|
+
the casual user. We believe however, that even for the advanced user, Ruby integrated
|
1000
|
+
with R throught Galaaz, makes a powerful environment for data analysis. In this blog post we
|
1001
|
+
showed how Galaaz consistent syntax eliminates the need for complex constructs such as quoting,
|
1002
|
+
enquoting, quasiquotation, etc. This simplification comes from the fact that expressions and
|
1003
|
+
variables are clearly separated objects, which is not the case in the R language.
|