galaaz 0.4.6 → 0.5.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -5
- data/README.md +3575 -118
- data/Rakefile +21 -4
- data/bin/gknit +152 -6
- data/bin/gknit-draft +105 -0
- data/bin/gknit-draft.rb +28 -0
- data/bin/gknit_Rscript +127 -0
- data/bin/grun +27 -1
- data/bin/gstudio +47 -4
- data/bin/{gstudio.rb → gstudio_irb.rb} +0 -0
- data/bin/gstudio_pry.rb +7 -0
- data/blogs/galaaz_ggplot/galaaz_ggplot.Rmd +3 -12
- data/blogs/galaaz_ggplot/galaaz_ggplot.html +77 -222
- data/blogs/galaaz_ggplot/galaaz_ggplot.md +4 -31
- data/blogs/galaaz_ggplot/galaaz_ggplot.pdf +0 -0
- data/blogs/galaaz_ggplot/galaaz_ggplot_files/figure-html/midwest_rb.png +0 -0
- data/blogs/galaaz_ggplot/galaaz_ggplot_files/figure-html/scatter_plot_rb.png +0 -0
- data/blogs/galaaz_ggplot/midwest.Rmd +1 -9
- data/blogs/gknit/gknit.Rmd +232 -123
- data/blogs/{dev/dev.html → gknit/gknit.html} +1897 -33
- data/blogs/gknit/gknit.pdf +0 -0
- data/blogs/gknit/lst.rds +0 -0
- data/blogs/gknit/stats.bib +27 -0
- data/blogs/manual/lst.rds +0 -0
- data/blogs/manual/manual.Rmd +1893 -47
- data/blogs/manual/manual.html +3153 -347
- data/blogs/manual/manual.md +3575 -118
- data/blogs/manual/manual.pdf +0 -0
- data/blogs/manual/manual.tex +4026 -0
- data/blogs/manual/manual_files/figure-html/bubble-1.png +0 -0
- data/blogs/manual/manual_files/figure-html/diverging_bar.png +0 -0
- data/blogs/manual/manual_files/figure-latex/bubble-1.png +0 -0
- data/blogs/manual/manual_files/figure-latex/diverging_bar.pdf +0 -0
- data/blogs/{dev → manual}/model.rb +0 -0
- data/blogs/nse_dplyr/nse_dplyr.Rmd +849 -0
- data/blogs/nse_dplyr/nse_dplyr.html +878 -0
- data/blogs/nse_dplyr/nse_dplyr.md +1198 -0
- data/blogs/nse_dplyr/nse_dplyr.pdf +0 -0
- data/blogs/oh_my/oh_my.html +274 -386
- data/blogs/oh_my/oh_my.md +208 -205
- data/blogs/ruby_plot/ruby_plot.Rmd +64 -84
- data/blogs/ruby_plot/ruby_plot.html +235 -208
- data/blogs/ruby_plot/ruby_plot.md +239 -34
- data/blogs/ruby_plot/ruby_plot.pdf +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/dose_len.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facet_by_delivery.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facet_by_dose.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_by_delivery_color.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_by_delivery_color2.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_decorations.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_jitter.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_points.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/final_box_plot.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/final_violin_plot.png +0 -0
- data/blogs/ruby_plot/ruby_plot_files/figure-html/violin_with_jitter.png +0 -0
- data/examples/Bibliography/master.bib +50 -0
- data/examples/Bibliography/stats.bib +72 -0
- data/examples/islr/ch2.spec.rb +1 -1
- data/examples/islr/ch3_boston.rb +4 -4
- data/examples/islr/x_y_rnorm.jpg +0 -0
- data/examples/latex_templates/Test-acm_article/Makefile +16 -0
- data/examples/latex_templates/Test-acm_article/Test-acm_article.Rmd +65 -0
- data/examples/latex_templates/Test-acm_article/acm_proc_article-sp.cls +1670 -0
- data/examples/latex_templates/Test-acm_article/sensys-abstract.cls +703 -0
- data/examples/latex_templates/Test-acm_article/sigproc.bib +59 -0
- data/examples/latex_templates/Test-acs_article/Test-acs_article.Rmd +260 -0
- data/examples/latex_templates/Test-acs_article/Test-acs_article.pdf +0 -0
- data/examples/latex_templates/Test-acs_article/acs-Test-acs_article.bib +11 -0
- data/examples/latex_templates/Test-acs_article/acs-my_output.bib +11 -0
- data/examples/latex_templates/Test-acs_article/acstest.bib +17 -0
- data/examples/latex_templates/Test-aea_article/AEA.cls +1414 -0
- data/examples/latex_templates/Test-aea_article/BibFile.bib +0 -0
- data/examples/latex_templates/Test-aea_article/Test-aea_article.Rmd +108 -0
- data/examples/latex_templates/Test-aea_article/Test-aea_article.pdf +0 -0
- data/examples/latex_templates/Test-aea_article/aea.bst +1269 -0
- data/examples/latex_templates/Test-aea_article/multicol.sty +853 -0
- data/examples/latex_templates/Test-aea_article/references.bib +0 -0
- data/examples/latex_templates/Test-aea_article/setspace.sty +546 -0
- data/examples/latex_templates/Test-amq_article/Test-amq_article.Rmd +256 -0
- data/examples/latex_templates/Test-amq_article/Test-amq_article.pdf +0 -0
- data/examples/latex_templates/Test-amq_article/Test-amq_article.pdfsync +3397 -0
- data/examples/latex_templates/Test-amq_article/pics/Figure2.pdf +0 -0
- data/examples/latex_templates/Test-ams_article/Test-ams_article.Rmd +215 -0
- data/examples/latex_templates/Test-ams_article/amstest.bib +436 -0
- data/examples/latex_templates/Test-asa_article/Test-asa_article.Rmd +153 -0
- data/examples/latex_templates/Test-asa_article/Test-asa_article.pdf +0 -0
- data/examples/latex_templates/Test-asa_article/agsm.bst +1353 -0
- data/examples/latex_templates/Test-asa_article/bibliography.bib +233 -0
- data/examples/latex_templates/Test-ieee_article/IEEEtran.bst +2409 -0
- data/examples/latex_templates/Test-ieee_article/IEEEtran.cls +6346 -0
- data/examples/latex_templates/Test-ieee_article/Test-ieee_article.Rmd +175 -0
- data/examples/latex_templates/Test-ieee_article/Test-ieee_article.pdf +0 -0
- data/examples/latex_templates/Test-ieee_article/mybibfile.bib +20 -0
- data/examples/latex_templates/Test-rjournal_article/RJournal.sty +335 -0
- data/examples/latex_templates/Test-rjournal_article/RJreferences.bib +18 -0
- data/examples/latex_templates/Test-rjournal_article/RJwrapper.pdf +0 -0
- data/examples/latex_templates/Test-rjournal_article/Test-rjournal_article.Rmd +52 -0
- data/examples/latex_templates/Test-springer_article/Test-springer_article.Rmd +65 -0
- data/examples/latex_templates/Test-springer_article/Test-springer_article.pdf +0 -0
- data/examples/latex_templates/Test-springer_article/bibliography.bib +26 -0
- data/examples/latex_templates/Test-springer_article/spbasic.bst +1658 -0
- data/examples/latex_templates/Test-springer_article/spmpsci.bst +1512 -0
- data/examples/latex_templates/Test-springer_article/spphys.bst +1443 -0
- data/examples/latex_templates/Test-springer_article/svglov3.clo +113 -0
- data/examples/latex_templates/Test-springer_article/svjour3.cls +1431 -0
- data/examples/misc/moneyball.rb +1 -1
- data/examples/misc/subsetting.rb +37 -37
- data/examples/rmarkdown/svm-rmarkdown-anon-ms-example/svm-rmarkdown-anon-ms-example.Rmd +73 -0
- data/examples/rmarkdown/svm-rmarkdown-anon-ms-example/svm-rmarkdown-anon-ms-example.pdf +0 -0
- data/examples/rmarkdown/svm-rmarkdown-article-example/svm-rmarkdown-article-example.Rmd +382 -0
- data/examples/rmarkdown/svm-rmarkdown-article-example/svm-rmarkdown-article-example.pdf +0 -0
- data/examples/rmarkdown/svm-rmarkdown-beamer-example/svm-rmarkdown-beamer-example.Rmd +164 -0
- data/examples/rmarkdown/svm-rmarkdown-beamer-example/svm-rmarkdown-beamer-example.pdf +0 -0
- data/examples/rmarkdown/svm-rmarkdown-cv/svm-rmarkdown-cv.Rmd +92 -0
- data/examples/rmarkdown/svm-rmarkdown-cv/svm-rmarkdown-cv.pdf +0 -0
- data/examples/rmarkdown/svm-rmarkdown-syllabus-example/attend-grade-relationships.csv +482 -0
- data/examples/rmarkdown/svm-rmarkdown-syllabus-example/svm-rmarkdown-syllabus-example.Rmd +280 -0
- data/examples/rmarkdown/svm-rmarkdown-syllabus-example/svm-rmarkdown-syllabus-example.pdf +0 -0
- data/examples/rmarkdown/svm-xaringan-example/svm-xaringan-example.Rmd +386 -0
- data/lib/R_interface/r.rb +2 -2
- data/lib/R_interface/r_libs.R +6 -1
- data/lib/R_interface/r_methods.rb +12 -2
- data/lib/R_interface/rdata_frame.rb +8 -17
- data/lib/R_interface/rindexed_object.rb +1 -2
- data/lib/R_interface/rlist.rb +1 -0
- data/lib/R_interface/robject.rb +20 -23
- data/lib/R_interface/rpkg.rb +15 -6
- data/lib/R_interface/rsupport.rb +13 -19
- data/lib/R_interface/ruby_extensions.rb +14 -18
- data/lib/R_interface/rvector.rb +0 -12
- data/lib/gknit.rb +2 -0
- data/lib/gknit/draft.rb +105 -0
- data/lib/gknit/knitr_engine.rb +6 -37
- data/lib/util/exec_ruby.rb +22 -84
- data/lib/util/inline_file.rb +7 -3
- data/specs/figures/bg.jpeg +0 -0
- data/specs/figures/bg.png +0 -0
- data/specs/figures/bg.svg +2 -2
- data/specs/figures/dose_len.png +0 -0
- data/specs/figures/no_args.jpeg +0 -0
- data/specs/figures/no_args.png +0 -0
- data/specs/figures/no_args.svg +2 -2
- data/specs/figures/width_height.jpeg +0 -0
- data/specs/figures/width_height.png +0 -0
- data/specs/figures/width_height_units1.jpeg +0 -0
- data/specs/figures/width_height_units1.png +0 -0
- data/specs/figures/width_height_units2.jpeg +0 -0
- data/specs/figures/width_height_units2.png +0 -0
- data/specs/r_dataframe.spec.rb +184 -11
- data/specs/r_list.spec.rb +4 -4
- data/specs/r_list_apply.spec.rb +11 -10
- data/specs/ruby_expression.spec.rb +3 -11
- data/specs/tmp.rb +106 -34
- data/version.rb +1 -1
- metadata +96 -33
- data/bin/gknit_old_r +0 -236
- data/blogs/dev/dev.Rmd +0 -77
- data/blogs/dev/dev.md +0 -87
- data/blogs/dev/dev_files/figure-html/bubble-1.png +0 -0
- data/blogs/dev/dev_files/figure-html/diverging_bar. +0 -0
- data/blogs/dev/dev_files/figure-html/diverging_bar.png +0 -0
- data/blogs/dplyr/dplyr.rb +0 -63
- data/blogs/galaaz_ggplot/galaaz_ggplot.aux +0 -43
- data/blogs/galaaz_ggplot/galaaz_ggplot.log +0 -640
- data/blogs/galaaz_ggplot/galaaz_ggplot.out +0 -10
- data/blogs/galaaz_ggplot/galaaz_ggplot.tex +0 -481
- data/blogs/galaaz_ggplot/midwest.png +0 -0
- data/blogs/galaaz_ggplot/scatter_plot.png +0 -0
- data/blogs/ruby_plot/ruby_plot.Rmd_external_figs +0 -662
- data/blogs/ruby_plot/ruby_plot.tex +0 -1077
- data/blogs/ruby_plot/ruby_plot_files/figure-html/dose_len.svg +0 -57
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facet_by_delivery.svg +0 -106
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facet_by_dose.svg +0 -110
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_by_delivery_color.svg +0 -174
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_by_delivery_color2.svg +0 -236
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_jitter.svg +0 -296
- data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_points.svg +0 -236
- data/blogs/ruby_plot/ruby_plot_files/figure-html/final_box_plot.svg +0 -218
- data/blogs/ruby_plot/ruby_plot_files/figure-html/final_violin_plot.svg +0 -128
- data/blogs/ruby_plot/ruby_plot_files/figure-html/violin_with_jitter.svg +0 -150
- data/examples/paper/paper.rb +0 -36
Binary file
|
Binary file
|
Binary file
|
Binary file
|
File without changes
|
@@ -0,0 +1,849 @@
|
|
1
|
+
---
|
2
|
+
title: "Non Standard Evaluation in dplyr with Galaaz"
|
3
|
+
author:
|
4
|
+
- "Rodrigo Botafogo"
|
5
|
+
- "Daniel Mossé - University of Pittsburgh"
|
6
|
+
tags: [Tech, Data Science, Ruby, R, GraalVM]
|
7
|
+
date: "10/05/2019"
|
8
|
+
output:
|
9
|
+
html_document:
|
10
|
+
self_contained: true
|
11
|
+
keep_md: true
|
12
|
+
pdf_document:
|
13
|
+
includes:
|
14
|
+
in_header: ["../../sty/galaaz.sty"]
|
15
|
+
number_sections: yes
|
16
|
+
toc: true
|
17
|
+
toc_depth: 2
|
18
|
+
md_document:
|
19
|
+
variant: markdown_github
|
20
|
+
fontsize: 11pt
|
21
|
+
---
|
22
|
+
|
23
|
+
```{r setup, echo=FALSE, message = FALSE}
|
24
|
+
#R.options(crayon__enabled: false)
|
25
|
+
options(crayon.enabled = FALSE)
|
26
|
+
library('dplyr')
|
27
|
+
library('tibble')
|
28
|
+
```
|
29
|
+
|
30
|
+
# Introduction
|
31
|
+
|
32
|
+
According to Steven Sagaert answer on Quora about "Is programming language R overrated?":
|
33
|
+
|
34
|
+
> R is a sophisticated language with an unusual (i.e. non-mainstream) set of features. It‘s
|
35
|
+
> an impure functional programming language with sophisticated metaprogramming and 3
|
36
|
+
> different OO systems.
|
37
|
+
|
38
|
+
> Just like common lisp you can completely customise how things work via metaprogramming.
|
39
|
+
> The biggest example is the tidyverse: by creating it’s own evaluation system (tidyeval)
|
40
|
+
> was able to create a custom syntax for dplyr.
|
41
|
+
|
42
|
+
> Mastering R (the language) and its ecosystem is not a matter of weeks or months but
|
43
|
+
> takes years. The rabbit hole goes pretty deep…
|
44
|
+
|
45
|
+
Although having a highly configurable language might give extreme power to the programmer,
|
46
|
+
it can also be, as stated above, a question of years to master it. Programming with _dplyr_
|
47
|
+
for instance, requires learning a set of complex concepts and rules that are not easily
|
48
|
+
accessible for casual users or _unsofisticated_ programmers as many users of R are. Being
|
49
|
+
_unsofisticated_ is NOT used here in a negative sense, as R was build for statitians and
|
50
|
+
not programmers, that need to solve real problems, often in a short time spam and are not
|
51
|
+
concerned about creating complex computer systems.
|
52
|
+
|
53
|
+
Unfortunatelly, if this _unsofisticated_ programmer decides to move unto more sofisticated
|
54
|
+
coding, the learning curve might become a serious impediment.
|
55
|
+
|
56
|
+
In this post we will see how to program with _dplyr_ in Galaaz and how Ruby can simplify
|
57
|
+
the learning curve of mastering _dplyr_ coding.
|
58
|
+
|
59
|
+
# But first, what is Galaaz??
|
60
|
+
|
61
|
+
Galaaz is a system for tightly coupling Ruby and R. Ruby is a powerful language, with
|
62
|
+
a large community, a very large set of libraries and great for web development. It is also
|
63
|
+
easy to learn. However,
|
64
|
+
it lacks libraries for data science, statistics, scientific plotting and machine learning.
|
65
|
+
On the other hand, R is considered one of the most powerful languages for solving all of the
|
66
|
+
above problems. Maybe the strongest competitor to R is Python with libraries such as NumPy,
|
67
|
+
Pandas, SciPy, SciKit-Learn and many more. We will not get here in the discussion on R
|
68
|
+
versus Python, both are excellent languages with powerful features, benefits and drawbacks.
|
69
|
+
Our interest is to bring to yet another excellent language, Ruby, the data science libraries
|
70
|
+
that it lacks.
|
71
|
+
|
72
|
+
With Galaaz we do not intend to re-implement any of the scientific libraries in R. However, we
|
73
|
+
allow for very tight coupling between the two languages to the point that the Ruby
|
74
|
+
developer does not need to know that there is an R engine running. Also, from the point of
|
75
|
+
view of the R user/developer, Galaaz looks a lot like R, with just minor syntactic difference,
|
76
|
+
so there is almost no learning curve for the R developer. And as we will see in this
|
77
|
+
post that programming with _dplyr_ is easier in Galaaz than in R.
|
78
|
+
|
79
|
+
R users are probably quite knowledgeable about _dplyr_. For the Ruby developer, _dplyr_ and
|
80
|
+
the _tidyverse_ libraries are a set of libraries for data manipulation in R, developed by
|
81
|
+
Hardley Wickham, chief scientis at RStudio and a prolific R coder and writer.
|
82
|
+
|
83
|
+
For the coupling of Ruby and R, we use new technologies provided by Oracle: GraalVM,
|
84
|
+
TruffleRuby and FastR. GraalVM home page had the following definition:
|
85
|
+
|
86
|
+
GraalVM is a universal virtual machine for running applications
|
87
|
+
written in JavaScript, Python 3, Ruby, R, JVM-based languages like Java,
|
88
|
+
Scala, Kotlin, and LLVM-based languages such as C and C++.
|
89
|
+
|
90
|
+
GraalVM removes the isolation between programming languages and enables
|
91
|
+
interoperability in a shared runtime. It can run either standalone or in
|
92
|
+
the context of OpenJDK, Node.js, Oracle Database, or MySQL.
|
93
|
+
|
94
|
+
GraalVM allows you to write polyglot applications with a seamless way to
|
95
|
+
pass values from one language to another. With GraalVM there is no copying
|
96
|
+
or marshaling necessary as it is with other polyglot systems. This lets
|
97
|
+
you achieve high performance when language boundaries are crossed. Most
|
98
|
+
of the time there is no additional cost for crossing a language boundary
|
99
|
+
at all.
|
100
|
+
|
101
|
+
Often developers have to make uncomfortable compromises that require them
|
102
|
+
to rewrite their software in other languages. For example:
|
103
|
+
|
104
|
+
* “That library is not available in my language. I need to rewrite it.”
|
105
|
+
* “That language would be the perfect fit for my problem, but we cannot
|
106
|
+
run it in our environment.”
|
107
|
+
* “That problem is already solved in my language, but the language is
|
108
|
+
too slow.”
|
109
|
+
|
110
|
+
With GraalVM we aim to allow developers to freely choose the right language
|
111
|
+
for the task at hand without making compromises.
|
112
|
+
|
113
|
+
|
114
|
+
# Tidyverse and dplyr
|
115
|
+
|
116
|
+
In [What is the tidyverse?](https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/) the
|
117
|
+
tidyverse is explained as follows:
|
118
|
+
|
119
|
+
> The tidyverse is a coherent system of packages for data manipulation, exploration and
|
120
|
+
> visualization that share a common design philosophy. These were mostly developed by
|
121
|
+
> Hadley Wickham himself, but they are now being expanded by several contributors. Tidyverse
|
122
|
+
> packages are intended to make statisticians and data scientists more productive by
|
123
|
+
> guiding them through workflows that facilitate communication, and result in reproducible
|
124
|
+
> work products. Fundamentally, the tidyverse is about the connections between the tools
|
125
|
+
> that make the workflow possible.
|
126
|
+
|
127
|
+
_dplyr_ is one of the many packages that are part of the tidyverse. It is:
|
128
|
+
|
129
|
+
> a grammar of data manipulation, providing a consistent set of verbs that help you solve
|
130
|
+
> the most common data manipulation challenges:
|
131
|
+
|
132
|
+
> 1. mutate() adds new variables that are functions of existing variables
|
133
|
+
> 2. select() picks variables based on their names.
|
134
|
+
> 3. filter() picks cases based on their values.
|
135
|
+
> 4. summarise() reduces multiple values down to a single summary.
|
136
|
+
> 5. arrange() changes the ordering of the rows.
|
137
|
+
|
138
|
+
Very often R is used interactively and users use _dplyr_ to manipulate a single dataset
|
139
|
+
without programming. When users want to replicate their work for
|
140
|
+
multiple datasets, programming becomes necessary.
|
141
|
+
|
142
|
+
# Programming with dplyr
|
143
|
+
|
144
|
+
In the vignette ["Programming with dplyr"](https://dplyr.tidyverse.org/articles/programming.html),
|
145
|
+
Hardley Wickham states:
|
146
|
+
|
147
|
+
> Most dplyr functions use non-standard evaluation (NSE). This is a catch-all term that
|
148
|
+
> means they don’t follow the usual R rules of evaluation. Instead, they capture the
|
149
|
+
> expression that you typed and evaluate it in a custom way. This has two main
|
150
|
+
> benefits for dplyr code:
|
151
|
+
|
152
|
+
> Operations on data frames can be expressed succinctly because you don’t need to repeat
|
153
|
+
> the name of the data frame. For example, you can write filter(df, x == 1, y == 2, z == 3)
|
154
|
+
> instead of df[df\$x == 1 & df\$y ==2 & df\$z == 3, ].
|
155
|
+
|
156
|
+
> dplyr can choose to compute results in a different way to base R. This is important for
|
157
|
+
> database backends because dplyr itself doesn’t do any work, but instead generates the SQL
|
158
|
+
> that tells the database what to do.
|
159
|
+
|
160
|
+
But then he goes on:
|
161
|
+
|
162
|
+
> Unfortunately these benefits do not come for free. There are two main drawbacks:
|
163
|
+
|
164
|
+
> Most dplyr arguments are not referentially transparent. That means you can’t replace a value
|
165
|
+
> with a seemingly equivalent object that you’ve defined elsewhere. In other words, this code:
|
166
|
+
|
167
|
+
```{r tibble, eval=FALSE}
|
168
|
+
df <- data.frame(x = 1:3, y = 3:1)
|
169
|
+
print(filter(df, x == 1))
|
170
|
+
#> # A tibble: 1 x 2
|
171
|
+
#> x y
|
172
|
+
#> <int> <int>
|
173
|
+
#> 1 1 3
|
174
|
+
```
|
175
|
+
> Is not equivalent to this code:
|
176
|
+
|
177
|
+
```{r my_var_err, eval = FALSE}
|
178
|
+
my_var <- x
|
179
|
+
#> Error in eval(expr, envir, enclos): object 'x' not found
|
180
|
+
filter(df, my_var == 1)
|
181
|
+
#> Error: object 'my_var' not found
|
182
|
+
```
|
183
|
+
> This makes it hard to create functions with arguments that change how dplyr verbs are computed.
|
184
|
+
|
185
|
+
As a result of this, programming with _dplyr_ requires learning a set of new ideas and concepts.
|
186
|
+
In this vignette Hardley goes on showing how to program ever more difficult problems with _dplyr_,
|
187
|
+
showing the problems it faces and the new concepts needed to solve them.
|
188
|
+
|
189
|
+
In this blog, we will look at all the problems presented by Harley on the vignette and show how
|
190
|
+
those same problems can be solved using Galaaz and the Ruby language.
|
191
|
+
|
192
|
+
This blog is organized as follows: first we show how to write expressions using Galaaz.
|
193
|
+
Expressions are a fundamental concept in _dplyr_ and are not part of basic Ruby. We extend
|
194
|
+
the Ruby language create a manipulate expressions that will be used by _dplyr_ functions.
|
195
|
+
|
196
|
+
Then we show very succintly how Ruby and R can be integrated and how R functions are
|
197
|
+
transparently called from Ruby. Galaaz [user manual](https://github.com/rbotafogo/galaaz/wiki)
|
198
|
+
(still in development) goes in much deeper detail about this integration.
|
199
|
+
|
200
|
+
Next in section "Data manipulation wiht _dplyr_" we go through all the problems on the
|
201
|
+
_dplyr_ vignette and look at how they are solved in Galaaz. We then discuss why programming
|
202
|
+
with Galaaz and _dplyr_ is easier than programming with _dplyr_ in plain R.
|
203
|
+
|
204
|
+
The following section looks at another more advanced problem and shows that Galaaz can still
|
205
|
+
handle it without any difficulty. We then provide further reading and concluding remarks.
|
206
|
+
|
207
|
+
# Writing Expressions in Galaaz
|
208
|
+
|
209
|
+
Galaaz extends Ruby to work with expressions, similar to R's expressions build with 'quote'
|
210
|
+
(base R) or 'quo' (tidyverse). Expressions in this context are like mathematical expressions or
|
211
|
+
formulae. For instance, in mathematics, the expression $y = sin(x)$ describes a function but cannot
|
212
|
+
be computed unless the value of $x$ is bound to some value.
|
213
|
+
|
214
|
+
Expressions are fundamental in _dplyr_ programming as they are the input to _dplyr_ functions,
|
215
|
+
for instance, as we will see shortly, if a data frame has a column named 'x' and we want
|
216
|
+
to add another column, y, to this dataframe that has the values of 'x' times 2, then we would
|
217
|
+
call a _dplyr_ function with the expression 'y = x * 2'.
|
218
|
+
|
219
|
+
## A note on notation
|
220
|
+
|
221
|
+
This blog was written in Rmarkdown and automatically converted to HTML or PDF (depending on
|
222
|
+
where you are reading this blog) with gKnit (a tool provided by Galaaz). In Rmarkdown, it is
|
223
|
+
possible to write text and code blocks that are executed to generate the final report. Code
|
224
|
+
blocks appear inside a 'box' and the result of their execution appear either in another type
|
225
|
+
of 'box' with a different background (HTML) or as normal text (PDF). Every output line from
|
226
|
+
the code execution is preceeded by '##'.
|
227
|
+
|
228
|
+
## Expressions from operators
|
229
|
+
|
230
|
+
The code below creates an expression summing two symbols. Note that :a and :b are Ruby symbols and
|
231
|
+
are not bound to any values at the time of expression definition:
|
232
|
+
|
233
|
+
```{ruby expressions}
|
234
|
+
exp1 = :a + :b
|
235
|
+
puts exp1
|
236
|
+
```
|
237
|
+
In Galaaz, we can build any complex mathematical expression such as:
|
238
|
+
|
239
|
+
```{ruby expr2}
|
240
|
+
exp2 = (:a + :b) * 2.0 + :c ** 2 / :z
|
241
|
+
puts exp2
|
242
|
+
```
|
243
|
+
Expressions are printed with the same format as the equivalent R expressions. The 'L' after
|
244
|
+
2 indicates that 2 is an integer.
|
245
|
+
|
246
|
+
The R developer should note that in R, if she writes the
|
247
|
+
number '2', the R interpreter will convert it to float. In order to get an interger she
|
248
|
+
should write '2L'. Galaaz follows Ruby notation and '2' is an integer, while '2.0' is a
|
249
|
+
float.
|
250
|
+
|
251
|
+
It is also possible to use inequality operators in building expressions:
|
252
|
+
|
253
|
+
```{ruby expr3}
|
254
|
+
exp3 = (:a + :b) >= :z
|
255
|
+
puts exp3
|
256
|
+
```
|
257
|
+
Expressions' definition can also make use of normal Ruby variables without any problem:
|
258
|
+
|
259
|
+
```{ruby expr_with_var}
|
260
|
+
x = 20
|
261
|
+
y = 30.0
|
262
|
+
exp_var = (:a + :b) * x <= :z - y
|
263
|
+
puts exp_var
|
264
|
+
```
|
265
|
+
|
266
|
+
Galaaz provides both symbolic representations for operators, such as (>, <, !=) as functional
|
267
|
+
notation for those operators such as (.gt, .ge, etc.). So the same expression written
|
268
|
+
above can also be written as
|
269
|
+
|
270
|
+
```{ruby expr4}
|
271
|
+
exp4 = (:a + :b).ge :z
|
272
|
+
puts exp4
|
273
|
+
```
|
274
|
+
|
275
|
+
Two types of expressions, however, can only be created with the functional representation
|
276
|
+
of the operators. Those are expressions involving '==', and '='. This is the case since
|
277
|
+
those symbols have special meaning in Ruby and should not be redefined.
|
278
|
+
|
279
|
+
In order to write an expression involving '==' we
|
280
|
+
need to use the method '.eq' and for '=' we need the function '.assign':
|
281
|
+
|
282
|
+
```{ruby expr5}
|
283
|
+
exp5 = (:a + :b).eq :z
|
284
|
+
puts exp5
|
285
|
+
```
|
286
|
+
|
287
|
+
```{ruby expr6}
|
288
|
+
exp6 = :y.assign :a + :b
|
289
|
+
puts exp6
|
290
|
+
```
|
291
|
+
Users should be careful when writing expressions not to inadvertently use '==' or '=' as
|
292
|
+
this will generate an error, that might be a bit cryptic (in future releases of Galaza, we
|
293
|
+
plan to improve the error message).
|
294
|
+
|
295
|
+
```{ruby exp_wrong, warning=FALSE}
|
296
|
+
exp_wrong = (:a + :b) == :z
|
297
|
+
puts exp_wrong
|
298
|
+
```
|
299
|
+
The problem lies with the fact that
|
300
|
+
when using '==' we are comparing expression (:a + :b) to expression :z with '=='. When this
|
301
|
+
comparison is executed, the system tries to evaluate :a, :b and :z, and those symbols, at
|
302
|
+
this time, are not bound to anything giving the "object 'a' not found" message.
|
303
|
+
|
304
|
+
## Expressions with R methods
|
305
|
+
|
306
|
+
It is often necessary to create an expression that uses a method or function. For instance, in
|
307
|
+
mathematics, it's quite natural to write an expressin such as $y = sin(x)$. In this case, the
|
308
|
+
'sin' function is part of the expression and should not be immediately executed. When we want
|
309
|
+
the function to be part of the expression, we call the function preceeding it
|
310
|
+
by the letter E, such as 'E.sin(x)'
|
311
|
+
|
312
|
+
```{ruby method_expression}
|
313
|
+
exp7 = :y.assign E.sin(:x)
|
314
|
+
puts exp7
|
315
|
+
```
|
316
|
+
Function expressions can also be written using '.' notation:
|
317
|
+
|
318
|
+
```{ruby expression_with_dot}
|
319
|
+
exp8 = :y.assign :x.sin
|
320
|
+
puts exp8
|
321
|
+
```
|
322
|
+
When a function has multiple arguments, the first one can be used before the '.'. For instance,
|
323
|
+
the R concatenate function 'c', that concatenates two or more arguments can be part of
|
324
|
+
an expression as:
|
325
|
+
|
326
|
+
```{ruby expression_multiple_args}
|
327
|
+
exp9 = :x.c(:y)
|
328
|
+
puts exp9
|
329
|
+
```
|
330
|
+
Note that this gives an OO feeling to the code, as if we were saying 'x' concatenates 'y'. As a
|
331
|
+
side note, '.' notation can be used as the R pipe operator '%>%', but is more general than the
|
332
|
+
pipe.
|
333
|
+
|
334
|
+
## Evaluating an Expression
|
335
|
+
|
336
|
+
Although we are mainly focusing on expressions to pass them to _dplyr_ functions, expressions
|
337
|
+
can be evaluated by calling function 'eval' with a binding.
|
338
|
+
|
339
|
+
A binding can be provided with a list or a data frame as shown below:
|
340
|
+
|
341
|
+
```{ruby eval_expression_list}
|
342
|
+
exp = (:a + :b) * 2.0 + :c ** 2 / :z
|
343
|
+
puts exp.eval(R.list(a: 10, b: 20, c: 30, z: 40))
|
344
|
+
```
|
345
|
+
|
346
|
+
with a data frame:
|
347
|
+
|
348
|
+
```{ruby eval_expression_df}
|
349
|
+
df = R.data__frame(
|
350
|
+
a: R.c(1, 2, 3),
|
351
|
+
b: R.c(10, 20, 30),
|
352
|
+
c: R.c(100, 200, 300),
|
353
|
+
z: R.c(1000, 2000, 3000))
|
354
|
+
|
355
|
+
puts exp.eval(df)
|
356
|
+
```
|
357
|
+
|
358
|
+
# Using Galaaz to call R functions
|
359
|
+
|
360
|
+
Galaaz tries to emulate as closely as possible the way R functions are called and migrating from
|
361
|
+
R to Galaaz should be quite easy requiring only minor syntactic changes to an R script. In
|
362
|
+
this post, we do not have enough space to write a complete manual on Galaaz
|
363
|
+
(a short manual can be found at: https://www.rubydoc.info/gems/galaaz/0.4.9), so we will
|
364
|
+
present only a few examples scripts using Galaaz.
|
365
|
+
|
366
|
+
Basically, to call an R function from Ruby with Galaaz, one only needs to preced the function
|
367
|
+
with 'R.'. For instance, to create a vector in R, the 'c' function is used. In Galaaz, a
|
368
|
+
vector can be created by using 'R.c':
|
369
|
+
|
370
|
+
```{ruby vector}
|
371
|
+
vec = R.c(1.0, 2, 3)
|
372
|
+
puts vec
|
373
|
+
```
|
374
|
+
A list is created in R with the 'list' function, so in Galaaz we do:
|
375
|
+
|
376
|
+
```{ruby list}
|
377
|
+
list = R.list(a: 1.0, b: 2, c: 3)
|
378
|
+
puts list
|
379
|
+
```
|
380
|
+
Note that we can use named arguments in our list. The same code in R would be:
|
381
|
+
|
382
|
+
```{r list2}
|
383
|
+
lst = list(a = 1, b = 2L, c = 3L)
|
384
|
+
print(lst)
|
385
|
+
```
|
386
|
+
Now, let's say that 'x' is an angle of 45$^\circ$ and we acttually want to create
|
387
|
+
the expression $y = sin(45^\circ)$, which is $y = 0.850...$. In this case,
|
388
|
+
we will use 'R.sin':
|
389
|
+
|
390
|
+
```{ruby eval_sin}
|
391
|
+
exp10 = :y.assign R.sin(45)
|
392
|
+
puts exp10
|
393
|
+
```
|
394
|
+
|
395
|
+
# Data manipulation wiht _dplyr_
|
396
|
+
|
397
|
+
In this section we will give a brief tour _dplyr_'s usage in Galaaz and how to manipulate
|
398
|
+
data in Ruby with it. This section will follow [_dplyr_'s vignette](https://dplyr.tidyverse.org/articles/dplyr.html) that explores the nycflights13 data set. This dataset contains all 336776
|
399
|
+
flights that departed from New York City in 2013. The data comes from the US Bureau of
|
400
|
+
Transportation Statistics.
|
401
|
+
|
402
|
+
Let's start by taking a look at this dataset:
|
403
|
+
|
404
|
+
```{ruby nycflights13}
|
405
|
+
R.library('nycflights13')
|
406
|
+
# check it's dimension
|
407
|
+
puts ~:flights.dim
|
408
|
+
# and the structure
|
409
|
+
~:flights.str
|
410
|
+
```
|
411
|
+
|
412
|
+
Now, let's use a first verb of _dplyr_: 'filter'. This verb, obviously, will filter the data
|
413
|
+
by the given expression. In the next block, we filter by columns 'month' and 'day'. The
|
414
|
+
first argument to the filter function is symbol ':flights'. A Ruby symbol, when given to
|
415
|
+
an R function will convert to the R variable of the same name, in this case 'flights', that
|
416
|
+
holds the nycflights13 data frame.
|
417
|
+
|
418
|
+
The second and third arguments are expressions that will be used by the filter function to
|
419
|
+
filter by columns, looking for entries in which the month and day are equal to 1.
|
420
|
+
|
421
|
+
```{ruby filter}
|
422
|
+
puts R.filter(:flights, (:month.eq 1), (:day.eq 1))
|
423
|
+
```
|
424
|
+
|
425
|
+
|
426
|
+
## Programming with _dplyr_: problems and how to solve them in Galaaz
|
427
|
+
|
428
|
+
In this section we look at the list of problems that Hardley describes in the "Programming with dplyr"
|
429
|
+
vignette and show how those problems are solved and coded with Galaaz. Readers interested in
|
430
|
+
how those problems are treated in _dplyr_ should read the vignette and use it as a comparison with
|
431
|
+
this blog.
|
432
|
+
|
433
|
+
## Filtering using expressions
|
434
|
+
|
435
|
+
Now that we know how to write expressions and call R functions, let's do some data manipulation in
|
436
|
+
Galaaz. Let's first start by creating a data frame. In R, the 'data.frame' function creates a
|
437
|
+
data frame. In Ruby, writing 'data.frame' will not parse as a single object. To call R
|
438
|
+
functions that have a '.' in them, we need to substitute the '.' with '__'. So, method
|
439
|
+
'data.frame' in R, is called in Galaaz as 'R.data\_\_frame':
|
440
|
+
|
441
|
+
```{ruby df}
|
442
|
+
df = R.data__frame(x: (1..3), y: (3..1))
|
443
|
+
puts df
|
444
|
+
```
|
445
|
+
|
446
|
+
_dplyr_ provides the 'filter' function, that filters data in a data brame. The 'filter'
|
447
|
+
function can be called on this data frame either by using 'R.filter(df, ...)' or
|
448
|
+
by using dot notation.
|
449
|
+
|
450
|
+
-------FIX---------
|
451
|
+
|
452
|
+
We prefer to use dot notation as shown bellow. The argument to 'filter' should be an
|
453
|
+
expression. Note that if we gave to filter a Ruby expression such as
|
454
|
+
'x == 1', we would get an error, since there is no variable 'x' defined and if 'x' was a variable
|
455
|
+
then 'x == 1' would either be 'true' or 'false'. Our goal is to filter our data frame returning
|
456
|
+
all rows in which the 'x' value is equal to 1. To express this we want: ':x.eq 1', where :x will
|
457
|
+
be interpreted by filter as the 'x' column.
|
458
|
+
|
459
|
+
```{ruby filter_exp}
|
460
|
+
puts df.filter(:x.eq 1)
|
461
|
+
```
|
462
|
+
In R, and when coding with 'tidyverse', arguments to a function are usually not
|
463
|
+
*referencially transparent*. That is, you can’t replace a value with a seemingly equivalent
|
464
|
+
object that you’ve defined elsewhere. In other words, this code
|
465
|
+
|
466
|
+
```{r not_transp, eval=FALSE}
|
467
|
+
my_var <- x
|
468
|
+
filter(df, my_var == 1)
|
469
|
+
```
|
470
|
+
Generates the following error: "object 'x' not found.
|
471
|
+
|
472
|
+
However, in Galaaz, arguments are referencially transparent as can be seen by the
|
473
|
+
code bellow. Note initally that 'my_var = :x' will not give the error "object 'x' not found"
|
474
|
+
since ':x' is treated as an expression and assigned to my\_var. Then when doing (my\_var.eq 1),
|
475
|
+
my\_var is a variable that resolves to ':x' and it becomes equivalent to (:x.eq 1) which is
|
476
|
+
what we want.
|
477
|
+
|
478
|
+
```{ruby my_var}
|
479
|
+
my_var = :x
|
480
|
+
puts df.filter(my_var.eq 1)
|
481
|
+
```
|
482
|
+
As stated by Hardley
|
483
|
+
|
484
|
+
> dplyr code is ambiguous. Depending on what variables are defined where,
|
485
|
+
> filter(df, x == y) could be equivalent to any of:
|
486
|
+
|
487
|
+
```
|
488
|
+
df[df$x == df$y, ]
|
489
|
+
df[df$x == y, ]
|
490
|
+
df[x == df$y, ]
|
491
|
+
df[x == y, ]
|
492
|
+
```
|
493
|
+
In galaaz this ambiguity does not exist, filter(df, x.eq y) is not a valid expression as
|
494
|
+
expressions are build with symbols. In doing filter(df, :x.eq y) we are looking for elements
|
495
|
+
of the 'x' column that are equal to a previously defined y variable. Finally in
|
496
|
+
filter(df, :x.eq :y) we are looking for elements in which the 'x' column value is equal to
|
497
|
+
the 'y' column value. This can be seen in the following two chunks of code:
|
498
|
+
|
499
|
+
```{ruby disamb1}
|
500
|
+
y = 1
|
501
|
+
x = 2
|
502
|
+
|
503
|
+
# looking for values where the 'x' column is equal to the 'y' column
|
504
|
+
puts df.filter(:x.eq :y)
|
505
|
+
```
|
506
|
+
|
507
|
+
```{ruby disamb2}
|
508
|
+
# looking for values where the 'x' column is equal to the 'y' variable
|
509
|
+
# in this case, the number 1
|
510
|
+
puts df.filter(:x.eq y)
|
511
|
+
```
|
512
|
+
## Writing a function that applies to different data sets
|
513
|
+
|
514
|
+
Let's suppose that we want to write a function that receives as the first argument a data frame
|
515
|
+
and as second argument an expression that adds a column to the data frame that is equal to the
|
516
|
+
sum of elements in column 'a' plus 'x'.
|
517
|
+
|
518
|
+
Here is the intended behaviour using the 'mutate' function of 'dplyr':
|
519
|
+
|
520
|
+
```
|
521
|
+
mutate(df1, y = a + x)
|
522
|
+
mutate(df2, y = a + x)
|
523
|
+
mutate(df3, y = a + x)
|
524
|
+
mutate(df4, y = a + x)
|
525
|
+
```
|
526
|
+
The naive approach to writing an R function to solve this problem is:
|
527
|
+
|
528
|
+
```
|
529
|
+
mutate_y <- function(df) {
|
530
|
+
mutate(df, y = a + x)
|
531
|
+
}
|
532
|
+
```
|
533
|
+
Unfortunately, in R, this function can fail silently if one of the variables isn’t present
|
534
|
+
in the data frame, but is present in the global environment. We will not go through here how
|
535
|
+
to solve this problem in R.
|
536
|
+
|
537
|
+
In Galaaz the method mutate_y bellow will work fine and will never fail silently.
|
538
|
+
|
539
|
+
```{ruby mutate_y, warning=FALSE}
|
540
|
+
def mutate_y(df)
|
541
|
+
df.mutate(:y.assign :a + :x)
|
542
|
+
end
|
543
|
+
```
|
544
|
+
Here we create a data frame that has only one column named 'x':
|
545
|
+
|
546
|
+
```{ruby data_frame_no_a_column, warning=FALSE}
|
547
|
+
df1 = R.data__frame(x: (1..3))
|
548
|
+
puts df1
|
549
|
+
```
|
550
|
+
|
551
|
+
Note that method mutate_y will fail independetly from the fact that variable 'a' is defined and
|
552
|
+
in the scope of the method. Variable 'a' has no relationship with the symbol ':a' used in the
|
553
|
+
definition of 'mutate\_y' above:
|
554
|
+
|
555
|
+
```{ruby call_mutate_y, warning = FALSE}
|
556
|
+
a = 10
|
557
|
+
mutate_y(df1)
|
558
|
+
```
|
559
|
+
## Different expressions
|
560
|
+
|
561
|
+
Let's move to the next problem as presented by Hardley where trying to write a function in R
|
562
|
+
that will receive two argumens, the first a variable and the second an expression is not trivial.
|
563
|
+
Bellow we create a data frame and we want to write a function that groups data by a variable and
|
564
|
+
summarises it by an expression:
|
565
|
+
|
566
|
+
```{r diff_expr}
|
567
|
+
set.seed(123)
|
568
|
+
|
569
|
+
df <- data.frame(
|
570
|
+
g1 = c(1, 1, 2, 2, 2),
|
571
|
+
g2 = c(1, 2, 1, 2, 1),
|
572
|
+
a = sample(5),
|
573
|
+
b = sample(5)
|
574
|
+
)
|
575
|
+
|
576
|
+
as.data.frame(df)
|
577
|
+
|
578
|
+
d2 <- df %>%
|
579
|
+
group_by(g1) %>%
|
580
|
+
summarise(a = mean(a))
|
581
|
+
|
582
|
+
as.data.frame(d2)
|
583
|
+
|
584
|
+
d2 <- df %>%
|
585
|
+
group_by(g2) %>%
|
586
|
+
summarise(a = mean(a))
|
587
|
+
|
588
|
+
as.data.frame(d2)
|
589
|
+
```
|
590
|
+
|
591
|
+
As shown by Hardley, one might expect this function to do the trick:
|
592
|
+
|
593
|
+
```{r diff_exp_fnc}
|
594
|
+
my_summarise <- function(df, group_var) {
|
595
|
+
df %>%
|
596
|
+
group_by(group_var) %>%
|
597
|
+
summarise(a = mean(a))
|
598
|
+
}
|
599
|
+
|
600
|
+
# my_summarise(df, g1)
|
601
|
+
#> Error: Column `group_var` is unknown
|
602
|
+
```
|
603
|
+
|
604
|
+
In order to solve this problem, coding with dplyr requires the introduction of many new concepts
|
605
|
+
and functions such as 'quo', 'quos', 'enquo', 'enquos', '!!' (bang bang), '!!!' (triple bang).
|
606
|
+
Again, we'll leave to Hardley the explanation on how to use all those functions.
|
607
|
+
|
608
|
+
Now, let's try to implement the same function in galaaz. The next code block first prints the
|
609
|
+
'df' data frame define previously in R (to access an R variable from Galaaz, we use the tilda
|
610
|
+
operator '~' applied to the R variable name as symbol, i.e., ':df'. We then create the
|
611
|
+
'my_summarize' method and call it passing the R data frame and the group by variable ':g1':
|
612
|
+
|
613
|
+
```{ruby diff_exp_ruby_func}
|
614
|
+
puts ~:df
|
615
|
+
print "\n"
|
616
|
+
|
617
|
+
def my_summarize(df, group_var)
|
618
|
+
df.group_by(group_var).
|
619
|
+
summarize(a: :a.mean)
|
620
|
+
end
|
621
|
+
|
622
|
+
puts my_summarize(:df, :g1)
|
623
|
+
```
|
624
|
+
It works!!! Well, let's make sure this was not just some coincidence
|
625
|
+
|
626
|
+
```{ruby group_g2}
|
627
|
+
puts my_summarize(:df, :g2)
|
628
|
+
```
|
629
|
+
|
630
|
+
Great, everything is fine! No magic, no new functions, no complexities, just normal, standard Ruby
|
631
|
+
code. If you've ever done NSE in R, this certainly feels much safer and easy to implement.
|
632
|
+
|
633
|
+
## Different input variables
|
634
|
+
|
635
|
+
In the previous section we've managed to get rid of all NSE formulation for a simple example, but
|
636
|
+
does this remain true for more complex examples, or will the Galaaz way prove inpractical for
|
637
|
+
more complex code?
|
638
|
+
|
639
|
+
In the next example Hardley proposes us to write a function that given an expression such as 'a'
|
640
|
+
or 'a * b', calculates three summaries. What we want a function that does the same as these R
|
641
|
+
statements:
|
642
|
+
|
643
|
+
```
|
644
|
+
summarise(df, mean = mean(a), sum = sum(a), n = n())
|
645
|
+
#> # A tibble: 1 x 3
|
646
|
+
#> mean sum n
|
647
|
+
#> <dbl> <int> <int>
|
648
|
+
#> 1 3 15 5
|
649
|
+
|
650
|
+
summarise(df, mean = mean(a * b), sum = sum(a * b), n = n())
|
651
|
+
#> # A tibble: 1 x 3
|
652
|
+
#> mean sum n
|
653
|
+
#> <dbl> <int> <int>
|
654
|
+
#> 1 9 45 5
|
655
|
+
```
|
656
|
+
|
657
|
+
Let's try it in galaaz:
|
658
|
+
|
659
|
+
```{ruby summarize_method}
|
660
|
+
def my_summarise2(df, expr)
|
661
|
+
df.summarize(
|
662
|
+
mean: E.mean(expr),
|
663
|
+
sum: E.sum(expr),
|
664
|
+
n: E.n
|
665
|
+
)
|
666
|
+
end
|
667
|
+
|
668
|
+
puts my_summarise2((~:df), :a)
|
669
|
+
puts my_summarise2((~:df), :a * :b)
|
670
|
+
```
|
671
|
+
|
672
|
+
Once again, there is no need to use any special theory or functions. The only point to be
|
673
|
+
careful about is the use of 'E' to build expressions from functions 'mean', 'sum' and 'n'.
|
674
|
+
|
675
|
+
## Different input and output variable
|
676
|
+
|
677
|
+
Now the next challenge presented by Hardley is to vary the name of the output variables based on
|
678
|
+
the received expression. So, if the input expression is 'a', we want our data frame columns to
|
679
|
+
be named 'mean\_a' and 'sum\_a'. Now, if the input expression is 'b', columns
|
680
|
+
should be named 'mean\_b' and 'sum\_b'.
|
681
|
+
|
682
|
+
```
|
683
|
+
mutate(df, mean_a = mean(a), sum_a = sum(a))
|
684
|
+
#> # A tibble: 5 x 6
|
685
|
+
#> g1 g2 a b mean_a sum_a
|
686
|
+
#> <dbl> <dbl> <int> <int> <dbl> <int>
|
687
|
+
#> 1 1 1 1 3 3 15
|
688
|
+
#> 2 1 2 4 2 3 15
|
689
|
+
#> 3 2 1 2 1 3 15
|
690
|
+
#> 4 2 2 5 4 3 15
|
691
|
+
#> # … with 1 more row
|
692
|
+
|
693
|
+
mutate(df, mean_b = mean(b), sum_b = sum(b))
|
694
|
+
#> # A tibble: 5 x 6
|
695
|
+
#> g1 g2 a b mean_b sum_b
|
696
|
+
#> <dbl> <dbl> <int> <int> <dbl> <int>
|
697
|
+
#> 1 1 1 1 3 3 15
|
698
|
+
#> 2 1 2 4 2 3 15
|
699
|
+
#> 3 2 1 2 1 3 15
|
700
|
+
#> 4 2 2 5 4 3 15
|
701
|
+
#> # … with 1 more row
|
702
|
+
```
|
703
|
+
In order to solve this problem in R, Hardley needs to introduce some more new functions and notations:
|
704
|
+
'quo_name' and the ':=' operator from package 'rlang'
|
705
|
+
|
706
|
+
Here is our Ruby code:
|
707
|
+
|
708
|
+
```{ruby name_change}
|
709
|
+
def my_mutate(df, expr)
|
710
|
+
mean_name = "mean_#{expr.to_s}"
|
711
|
+
sum_name = "sum_#{expr.to_s}"
|
712
|
+
|
713
|
+
df.mutate(mean_name => E.mean(expr),
|
714
|
+
sum_name => E.sum(expr))
|
715
|
+
end
|
716
|
+
|
717
|
+
puts my_mutate((~:df), :a)
|
718
|
+
puts my_mutate((~:df), :b)
|
719
|
+
```
|
720
|
+
It really seems that "Non Standard Evaluation" is actually quite standard in Galaaz! But, you
|
721
|
+
might have noticed a small change in the way the arguments to the mutate method were called.
|
722
|
+
In a previous example we used df.summarise(mean: E.mean(:a), ...) where the column name was
|
723
|
+
followed by a ':' colom. In this example, we have df.mutate(mean_name => E.mean(expr), ...)
|
724
|
+
and variable mean\_name is not followed by ':' but by '=>'. This is standard Ruby notation.
|
725
|
+
|
726
|
+
[explain....]
|
727
|
+
|
728
|
+
## Capturing multiple variables
|
729
|
+
|
730
|
+
Moving on with new complexities, Hardley proposes us to solve the problem in which the
|
731
|
+
summarise function will receive any number of grouping variables.
|
732
|
+
|
733
|
+
This again is quite standard Ruby. In order to receive an undefined number of paramenters
|
734
|
+
the paramenter is preceded by '*':
|
735
|
+
|
736
|
+
```{ruby multiple_vars}
|
737
|
+
def my_summarise3(df, *group_vars)
|
738
|
+
df.group_by(*group_vars).
|
739
|
+
summarise(a: E.mean(:a))
|
740
|
+
end
|
741
|
+
|
742
|
+
puts my_summarise3((~:df), :g1, :g2)
|
743
|
+
```
|
744
|
+
|
745
|
+
# Why does R require NSE and Galaaz does not?
|
746
|
+
|
747
|
+
NSE introduces a number of new concepts, such as 'quoting', 'quasiquotation', 'unquoting' and
|
748
|
+
'unquote-splicing', while in Galaaz none of those concepts are needed. What gives?
|
749
|
+
|
750
|
+
R is an extremely flexible language and it has lazy evaluation of parameters. When in R a
|
751
|
+
function is called as 'summarise(df, a = b)', the summarise function receives the litteral
|
752
|
+
'a = b' parameter and can work with this as if it were a string. In R, it is not clear what
|
753
|
+
a and b are, they can be expressions or they can be variables, it is up to the function to
|
754
|
+
decide what 'a = b' means.
|
755
|
+
|
756
|
+
In Ruby, there is no lazy evaluation of parameters and 'a' is always a variable and so is 'b'.
|
757
|
+
Variables assume their value as soon as they are used, so 'x = a' is immediately evaluate and
|
758
|
+
variable 'x' will receive the value of variable 'a' as soon as the Ruby statement is executed.
|
759
|
+
Ruby also provides the notion of a symbol; ':a' is a symbol and does not evaluate to anything.
|
760
|
+
Galaaz uses Ruby symbols to build expressions that are not bound to anything: ':a.eq :b' is
|
761
|
+
clearly an expression and has no relationship whatsoever with the statment 'a = b'. By using
|
762
|
+
symbols, variables and expressions all the possible ambiguities that are found in R are
|
763
|
+
eliminated in Galaaz.
|
764
|
+
|
765
|
+
The main problem that remains, is that in R, functions are not clearly documented as what type
|
766
|
+
of input they are expecting, they might be expecting regular variables or they might be
|
767
|
+
expecting expressions and the R function will know how to deal with an input of the form
|
768
|
+
'a = b', now for the Ruby developer it might not be immediately clear if it should call the
|
769
|
+
function passing the value 'true' if variable 'a' is equal to variable 'b' or if it should
|
770
|
+
call the function passing the expression ':a.eq :b'.
|
771
|
+
|
772
|
+
|
773
|
+
# Advanced dplyr features
|
774
|
+
|
775
|
+
In the blog: [Programming with dplyr by using dplyr](https://www.r-bloggers.com/programming-with-dplyr-by-using-dplyr/) Iñaki Úcar shows surprise that some R users are trying to code in dplyr avoiding
|
776
|
+
the use of NSE. For instance he says:
|
777
|
+
|
778
|
+
> Take the example of seplyr. It stands for standard evaluation dplyr, and enables us to
|
779
|
+
> program over dplyr without having “to bring in (or study) any deep-theory or
|
780
|
+
> heavy-weight tools such as rlang/tidyeval”.
|
781
|
+
|
782
|
+
For me, there isn't really any surprise that users are trying to avoid dplyr deep-theory. R
|
783
|
+
users frequently are not programmers and learning to code is already hard business, on top
|
784
|
+
of that, having to learn how to 'quote' or 'enquo' or 'quos' or 'enquos' is not necessarily
|
785
|
+
a 'piece of cake'. So much so, that 'tidyeval' has some more advanced functions that instead
|
786
|
+
of using quoted expressions, uses strings as arguments.
|
787
|
+
|
788
|
+
In the following examples, we show the use of functions 'group\_by\_at', 'summarise\_at' and
|
789
|
+
'rename\_at' that receive strings as argument. The data frame used in 'starwars' that describes
|
790
|
+
features of characters in the Starwars movies:
|
791
|
+
|
792
|
+
```{ruby starwars}
|
793
|
+
puts (~:starwars).head
|
794
|
+
```
|
795
|
+
The grouped_mean function bellow will receive a grouping variable and calculate summaries for
|
796
|
+
the value\_variables given:
|
797
|
+
|
798
|
+
```{r grouped_mean}
|
799
|
+
grouped_mean <- function(data, grouping_variables, value_variables) {
|
800
|
+
data %>%
|
801
|
+
group_by_at(grouping_variables) %>%
|
802
|
+
mutate(count = n()) %>%
|
803
|
+
summarise_at(c(value_variables, "count"), mean, na.rm = TRUE) %>%
|
804
|
+
rename_at(value_variables, funs(paste0("mean_", .)))
|
805
|
+
}
|
806
|
+
|
807
|
+
gm = starwars %>%
|
808
|
+
grouped_mean("eye_color", c("mass", "birth_year"))
|
809
|
+
|
810
|
+
as.data.frame(gm)
|
811
|
+
```
|
812
|
+
|
813
|
+
The same code with Galaaz, becomes:
|
814
|
+
|
815
|
+
```{ruby advanced_starwars}
|
816
|
+
def grouped_mean(data, grouping_variables, value_variables)
|
817
|
+
data.
|
818
|
+
group_by_at(grouping_variables).
|
819
|
+
mutate(count: E.n).
|
820
|
+
summarise_at(E.c(value_variables, "count"), ~:mean, na__rm: true).
|
821
|
+
rename_at(value_variables, E.funs(E.paste0("mean_", value_variables)))
|
822
|
+
end
|
823
|
+
|
824
|
+
puts grouped_mean((~:starwars), "eye_color", E.c("mass", "birth_year"))
|
825
|
+
```
|
826
|
+
|
827
|
+
# Further reading
|
828
|
+
|
829
|
+
For more information on GraalVM, TruffleRuby, fastR, R and Galaaz check out the following sites/posts:
|
830
|
+
|
831
|
+
* [GraalVM Home](https://www.graalvm.org/)
|
832
|
+
* [TruffleRuby](https://github.com/oracle/truffleruby)
|
833
|
+
* [FastR](https://github.com/oracle/fastr)
|
834
|
+
* [Faster R with FastR](https://medium.com/graalvm/faster-r-with-fastr-4b8db0e0dceb)
|
835
|
+
* [How to make Beautiful Ruby Plots with Galaaz](https://medium.freecodecamp.org/how-to-make-beautiful-ruby-plots-with-galaaz-320848058857)
|
836
|
+
* [Ruby Plotting with Galaaz: An example of tightly coupling Ruby and R in GraalVM](https://towardsdatascience.com/ruby-plotting-with-galaaz-an-example-of-tightly-coupling-ruby-and-r-in-graalvm-520b69e21021)
|
837
|
+
* [How to do reproducible research in Ruby with gKnit](https://towardsdatascience.com/how-to-do-reproducible-research-in-ruby-with-gknit-c26d2684d64e)
|
838
|
+
* [R for Data Science](https://r4ds.had.co.nz/)
|
839
|
+
* [Advanced R](https://adv-r.hadley.nz/)
|
840
|
+
|
841
|
+
# Conclusion
|
842
|
+
|
843
|
+
Ruby and Galaaz provide a nice framework for developing code that uses R functions. Although R is
|
844
|
+
a very powerful and flexible language, sometimes, too much flexibility makes life harder for
|
845
|
+
the casual user. We believe however, that even for the advanced user, Ruby integrated
|
846
|
+
with R throught Galaaz, makes a powerful environment for data analysis. In this blog post we
|
847
|
+
showed how Galaaz consistent syntax eliminates the need for complex constructs such as quoting,
|
848
|
+
enquoting, quasiquotation, etc. This simplification comes from the fact that expressions and
|
849
|
+
variables are clearly separated objects, which is not the case in the R language.
|