galaaz 0.4.6 → 0.5.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (181) hide show
  1. checksums.yaml +5 -5
  2. data/README.md +3575 -118
  3. data/Rakefile +21 -4
  4. data/bin/gknit +152 -6
  5. data/bin/gknit-draft +105 -0
  6. data/bin/gknit-draft.rb +28 -0
  7. data/bin/gknit_Rscript +127 -0
  8. data/bin/grun +27 -1
  9. data/bin/gstudio +47 -4
  10. data/bin/{gstudio.rb → gstudio_irb.rb} +0 -0
  11. data/bin/gstudio_pry.rb +7 -0
  12. data/blogs/galaaz_ggplot/galaaz_ggplot.Rmd +3 -12
  13. data/blogs/galaaz_ggplot/galaaz_ggplot.html +77 -222
  14. data/blogs/galaaz_ggplot/galaaz_ggplot.md +4 -31
  15. data/blogs/galaaz_ggplot/galaaz_ggplot.pdf +0 -0
  16. data/blogs/galaaz_ggplot/galaaz_ggplot_files/figure-html/midwest_rb.png +0 -0
  17. data/blogs/galaaz_ggplot/galaaz_ggplot_files/figure-html/scatter_plot_rb.png +0 -0
  18. data/blogs/galaaz_ggplot/midwest.Rmd +1 -9
  19. data/blogs/gknit/gknit.Rmd +232 -123
  20. data/blogs/{dev/dev.html → gknit/gknit.html} +1897 -33
  21. data/blogs/gknit/gknit.pdf +0 -0
  22. data/blogs/gknit/lst.rds +0 -0
  23. data/blogs/gknit/stats.bib +27 -0
  24. data/blogs/manual/lst.rds +0 -0
  25. data/blogs/manual/manual.Rmd +1893 -47
  26. data/blogs/manual/manual.html +3153 -347
  27. data/blogs/manual/manual.md +3575 -118
  28. data/blogs/manual/manual.pdf +0 -0
  29. data/blogs/manual/manual.tex +4026 -0
  30. data/blogs/manual/manual_files/figure-html/bubble-1.png +0 -0
  31. data/blogs/manual/manual_files/figure-html/diverging_bar.png +0 -0
  32. data/blogs/manual/manual_files/figure-latex/bubble-1.png +0 -0
  33. data/blogs/manual/manual_files/figure-latex/diverging_bar.pdf +0 -0
  34. data/blogs/{dev → manual}/model.rb +0 -0
  35. data/blogs/nse_dplyr/nse_dplyr.Rmd +849 -0
  36. data/blogs/nse_dplyr/nse_dplyr.html +878 -0
  37. data/blogs/nse_dplyr/nse_dplyr.md +1198 -0
  38. data/blogs/nse_dplyr/nse_dplyr.pdf +0 -0
  39. data/blogs/oh_my/oh_my.html +274 -386
  40. data/blogs/oh_my/oh_my.md +208 -205
  41. data/blogs/ruby_plot/ruby_plot.Rmd +64 -84
  42. data/blogs/ruby_plot/ruby_plot.html +235 -208
  43. data/blogs/ruby_plot/ruby_plot.md +239 -34
  44. data/blogs/ruby_plot/ruby_plot.pdf +0 -0
  45. data/blogs/ruby_plot/ruby_plot_files/figure-html/dose_len.png +0 -0
  46. data/blogs/ruby_plot/ruby_plot_files/figure-html/facet_by_delivery.png +0 -0
  47. data/blogs/ruby_plot/ruby_plot_files/figure-html/facet_by_dose.png +0 -0
  48. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_by_delivery_color.png +0 -0
  49. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_by_delivery_color2.png +0 -0
  50. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_decorations.png +0 -0
  51. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_jitter.png +0 -0
  52. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_points.png +0 -0
  53. data/blogs/ruby_plot/ruby_plot_files/figure-html/final_box_plot.png +0 -0
  54. data/blogs/ruby_plot/ruby_plot_files/figure-html/final_violin_plot.png +0 -0
  55. data/blogs/ruby_plot/ruby_plot_files/figure-html/violin_with_jitter.png +0 -0
  56. data/examples/Bibliography/master.bib +50 -0
  57. data/examples/Bibliography/stats.bib +72 -0
  58. data/examples/islr/ch2.spec.rb +1 -1
  59. data/examples/islr/ch3_boston.rb +4 -4
  60. data/examples/islr/x_y_rnorm.jpg +0 -0
  61. data/examples/latex_templates/Test-acm_article/Makefile +16 -0
  62. data/examples/latex_templates/Test-acm_article/Test-acm_article.Rmd +65 -0
  63. data/examples/latex_templates/Test-acm_article/acm_proc_article-sp.cls +1670 -0
  64. data/examples/latex_templates/Test-acm_article/sensys-abstract.cls +703 -0
  65. data/examples/latex_templates/Test-acm_article/sigproc.bib +59 -0
  66. data/examples/latex_templates/Test-acs_article/Test-acs_article.Rmd +260 -0
  67. data/examples/latex_templates/Test-acs_article/Test-acs_article.pdf +0 -0
  68. data/examples/latex_templates/Test-acs_article/acs-Test-acs_article.bib +11 -0
  69. data/examples/latex_templates/Test-acs_article/acs-my_output.bib +11 -0
  70. data/examples/latex_templates/Test-acs_article/acstest.bib +17 -0
  71. data/examples/latex_templates/Test-aea_article/AEA.cls +1414 -0
  72. data/examples/latex_templates/Test-aea_article/BibFile.bib +0 -0
  73. data/examples/latex_templates/Test-aea_article/Test-aea_article.Rmd +108 -0
  74. data/examples/latex_templates/Test-aea_article/Test-aea_article.pdf +0 -0
  75. data/examples/latex_templates/Test-aea_article/aea.bst +1269 -0
  76. data/examples/latex_templates/Test-aea_article/multicol.sty +853 -0
  77. data/examples/latex_templates/Test-aea_article/references.bib +0 -0
  78. data/examples/latex_templates/Test-aea_article/setspace.sty +546 -0
  79. data/examples/latex_templates/Test-amq_article/Test-amq_article.Rmd +256 -0
  80. data/examples/latex_templates/Test-amq_article/Test-amq_article.pdf +0 -0
  81. data/examples/latex_templates/Test-amq_article/Test-amq_article.pdfsync +3397 -0
  82. data/examples/latex_templates/Test-amq_article/pics/Figure2.pdf +0 -0
  83. data/examples/latex_templates/Test-ams_article/Test-ams_article.Rmd +215 -0
  84. data/examples/latex_templates/Test-ams_article/amstest.bib +436 -0
  85. data/examples/latex_templates/Test-asa_article/Test-asa_article.Rmd +153 -0
  86. data/examples/latex_templates/Test-asa_article/Test-asa_article.pdf +0 -0
  87. data/examples/latex_templates/Test-asa_article/agsm.bst +1353 -0
  88. data/examples/latex_templates/Test-asa_article/bibliography.bib +233 -0
  89. data/examples/latex_templates/Test-ieee_article/IEEEtran.bst +2409 -0
  90. data/examples/latex_templates/Test-ieee_article/IEEEtran.cls +6346 -0
  91. data/examples/latex_templates/Test-ieee_article/Test-ieee_article.Rmd +175 -0
  92. data/examples/latex_templates/Test-ieee_article/Test-ieee_article.pdf +0 -0
  93. data/examples/latex_templates/Test-ieee_article/mybibfile.bib +20 -0
  94. data/examples/latex_templates/Test-rjournal_article/RJournal.sty +335 -0
  95. data/examples/latex_templates/Test-rjournal_article/RJreferences.bib +18 -0
  96. data/examples/latex_templates/Test-rjournal_article/RJwrapper.pdf +0 -0
  97. data/examples/latex_templates/Test-rjournal_article/Test-rjournal_article.Rmd +52 -0
  98. data/examples/latex_templates/Test-springer_article/Test-springer_article.Rmd +65 -0
  99. data/examples/latex_templates/Test-springer_article/Test-springer_article.pdf +0 -0
  100. data/examples/latex_templates/Test-springer_article/bibliography.bib +26 -0
  101. data/examples/latex_templates/Test-springer_article/spbasic.bst +1658 -0
  102. data/examples/latex_templates/Test-springer_article/spmpsci.bst +1512 -0
  103. data/examples/latex_templates/Test-springer_article/spphys.bst +1443 -0
  104. data/examples/latex_templates/Test-springer_article/svglov3.clo +113 -0
  105. data/examples/latex_templates/Test-springer_article/svjour3.cls +1431 -0
  106. data/examples/misc/moneyball.rb +1 -1
  107. data/examples/misc/subsetting.rb +37 -37
  108. data/examples/rmarkdown/svm-rmarkdown-anon-ms-example/svm-rmarkdown-anon-ms-example.Rmd +73 -0
  109. data/examples/rmarkdown/svm-rmarkdown-anon-ms-example/svm-rmarkdown-anon-ms-example.pdf +0 -0
  110. data/examples/rmarkdown/svm-rmarkdown-article-example/svm-rmarkdown-article-example.Rmd +382 -0
  111. data/examples/rmarkdown/svm-rmarkdown-article-example/svm-rmarkdown-article-example.pdf +0 -0
  112. data/examples/rmarkdown/svm-rmarkdown-beamer-example/svm-rmarkdown-beamer-example.Rmd +164 -0
  113. data/examples/rmarkdown/svm-rmarkdown-beamer-example/svm-rmarkdown-beamer-example.pdf +0 -0
  114. data/examples/rmarkdown/svm-rmarkdown-cv/svm-rmarkdown-cv.Rmd +92 -0
  115. data/examples/rmarkdown/svm-rmarkdown-cv/svm-rmarkdown-cv.pdf +0 -0
  116. data/examples/rmarkdown/svm-rmarkdown-syllabus-example/attend-grade-relationships.csv +482 -0
  117. data/examples/rmarkdown/svm-rmarkdown-syllabus-example/svm-rmarkdown-syllabus-example.Rmd +280 -0
  118. data/examples/rmarkdown/svm-rmarkdown-syllabus-example/svm-rmarkdown-syllabus-example.pdf +0 -0
  119. data/examples/rmarkdown/svm-xaringan-example/svm-xaringan-example.Rmd +386 -0
  120. data/lib/R_interface/r.rb +2 -2
  121. data/lib/R_interface/r_libs.R +6 -1
  122. data/lib/R_interface/r_methods.rb +12 -2
  123. data/lib/R_interface/rdata_frame.rb +8 -17
  124. data/lib/R_interface/rindexed_object.rb +1 -2
  125. data/lib/R_interface/rlist.rb +1 -0
  126. data/lib/R_interface/robject.rb +20 -23
  127. data/lib/R_interface/rpkg.rb +15 -6
  128. data/lib/R_interface/rsupport.rb +13 -19
  129. data/lib/R_interface/ruby_extensions.rb +14 -18
  130. data/lib/R_interface/rvector.rb +0 -12
  131. data/lib/gknit.rb +2 -0
  132. data/lib/gknit/draft.rb +105 -0
  133. data/lib/gknit/knitr_engine.rb +6 -37
  134. data/lib/util/exec_ruby.rb +22 -84
  135. data/lib/util/inline_file.rb +7 -3
  136. data/specs/figures/bg.jpeg +0 -0
  137. data/specs/figures/bg.png +0 -0
  138. data/specs/figures/bg.svg +2 -2
  139. data/specs/figures/dose_len.png +0 -0
  140. data/specs/figures/no_args.jpeg +0 -0
  141. data/specs/figures/no_args.png +0 -0
  142. data/specs/figures/no_args.svg +2 -2
  143. data/specs/figures/width_height.jpeg +0 -0
  144. data/specs/figures/width_height.png +0 -0
  145. data/specs/figures/width_height_units1.jpeg +0 -0
  146. data/specs/figures/width_height_units1.png +0 -0
  147. data/specs/figures/width_height_units2.jpeg +0 -0
  148. data/specs/figures/width_height_units2.png +0 -0
  149. data/specs/r_dataframe.spec.rb +184 -11
  150. data/specs/r_list.spec.rb +4 -4
  151. data/specs/r_list_apply.spec.rb +11 -10
  152. data/specs/ruby_expression.spec.rb +3 -11
  153. data/specs/tmp.rb +106 -34
  154. data/version.rb +1 -1
  155. metadata +96 -33
  156. data/bin/gknit_old_r +0 -236
  157. data/blogs/dev/dev.Rmd +0 -77
  158. data/blogs/dev/dev.md +0 -87
  159. data/blogs/dev/dev_files/figure-html/bubble-1.png +0 -0
  160. data/blogs/dev/dev_files/figure-html/diverging_bar. +0 -0
  161. data/blogs/dev/dev_files/figure-html/diverging_bar.png +0 -0
  162. data/blogs/dplyr/dplyr.rb +0 -63
  163. data/blogs/galaaz_ggplot/galaaz_ggplot.aux +0 -43
  164. data/blogs/galaaz_ggplot/galaaz_ggplot.log +0 -640
  165. data/blogs/galaaz_ggplot/galaaz_ggplot.out +0 -10
  166. data/blogs/galaaz_ggplot/galaaz_ggplot.tex +0 -481
  167. data/blogs/galaaz_ggplot/midwest.png +0 -0
  168. data/blogs/galaaz_ggplot/scatter_plot.png +0 -0
  169. data/blogs/ruby_plot/ruby_plot.Rmd_external_figs +0 -662
  170. data/blogs/ruby_plot/ruby_plot.tex +0 -1077
  171. data/blogs/ruby_plot/ruby_plot_files/figure-html/dose_len.svg +0 -57
  172. data/blogs/ruby_plot/ruby_plot_files/figure-html/facet_by_delivery.svg +0 -106
  173. data/blogs/ruby_plot/ruby_plot_files/figure-html/facet_by_dose.svg +0 -110
  174. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_by_delivery_color.svg +0 -174
  175. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_by_delivery_color2.svg +0 -236
  176. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_jitter.svg +0 -296
  177. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_points.svg +0 -236
  178. data/blogs/ruby_plot/ruby_plot_files/figure-html/final_box_plot.svg +0 -218
  179. data/blogs/ruby_plot/ruby_plot_files/figure-html/final_violin_plot.svg +0 -128
  180. data/blogs/ruby_plot/ruby_plot_files/figure-html/violin_with_jitter.svg +0 -150
  181. data/examples/paper/paper.rb +0 -36
@@ -0,0 +1,1198 @@
1
+ ---
2
+ title: "Non Standard Evaluation in dplyr with Galaaz"
3
+ author:
4
+ - "Rodrigo Botafogo"
5
+ - "Daniel Mossé - University of Pittsburgh"
6
+ tags: [Tech, Data Science, Ruby, R, GraalVM]
7
+ date: "10/05/2019"
8
+ output:
9
+ html_document:
10
+ self_contained: true
11
+ keep_md: true
12
+ pdf_document:
13
+ includes:
14
+ in_header: ["../../sty/galaaz.sty"]
15
+ number_sections: yes
16
+ toc: true
17
+ toc_depth: 2
18
+ md_document:
19
+ variant: markdown_github
20
+ fontsize: 11pt
21
+ ---
22
+
23
+
24
+
25
+ # Introduction
26
+
27
+ According to Steven Sagaert answer on Quora about "Is programming language R overrated?":
28
+
29
+ > R is a sophisticated language with an unusual (i.e. non-mainstream) set of features. It‘s
30
+ > an impure functional programming language with sophisticated metaprogramming and 3
31
+ > different OO systems.
32
+
33
+ > Just like common lisp you can completely customise how things work via metaprogramming.
34
+ > The biggest example is the tidyverse: by creating it’s own evaluation system (tidyeval)
35
+ > was able to create a custom syntax for dplyr.
36
+
37
+ > Mastering R (the language) and its ecosystem is not a matter of weeks or months but
38
+ > takes years. The rabbit hole goes pretty deep…
39
+
40
+ Although having a highly configurable language might give extreme power to the programmer,
41
+ it can also be, as stated above, a question of years to master it. Programming with _dplyr_
42
+ for instance, requires learning a set of complex concepts and rules that are not easily
43
+ accessible for casual users or _unsofisticated_ programmers as many users of R are. Being
44
+ _unsofisticated_ is NOT used here in a negative sense, as R was build for statitians and
45
+ not programmers, that need to solve real problems, often in a short time spam and are not
46
+ concerned about creating complex computer systems.
47
+
48
+ Unfortunatelly, if this _unsofisticated_ programmer decides to move unto more sofisticated
49
+ coding, the learning curve might become a serious impediment.
50
+
51
+ In this post we will see how to program with _dplyr_ in Galaaz and how Ruby can simplify
52
+ the learning curve of mastering _dplyr_ coding.
53
+
54
+ # But first, what is Galaaz??
55
+
56
+ Galaaz is a system for tightly coupling Ruby and R. Ruby is a powerful language, with
57
+ a large community, a very large set of libraries and great for web development. It is also
58
+ easy to learn. However,
59
+ it lacks libraries for data science, statistics, scientific plotting and machine learning.
60
+ On the other hand, R is considered one of the most powerful languages for solving all of the
61
+ above problems. Maybe the strongest competitor to R is Python with libraries such as NumPy,
62
+ Pandas, SciPy, SciKit-Learn and many more. We will not get here in the discussion on R
63
+ versus Python, both are excellent languages with powerful features, benefits and drawbacks.
64
+ Our interest is to bring to yet another excellent language, Ruby, the data science libraries
65
+ that it lacks.
66
+
67
+ With Galaaz we do not intend to re-implement any of the scientific libraries in R. However, we
68
+ allow for very tight coupling between the two languages to the point that the Ruby
69
+ developer does not need to know that there is an R engine running. Also, from the point of
70
+ view of the R user/developer, Galaaz looks a lot like R, with just minor syntactic difference,
71
+ so there is almost no learning curve for the R developer. And as we will see in this
72
+ post that programming with _dplyr_ is easier in Galaaz than in R.
73
+
74
+ R users are probably quite knowledgeable about _dplyr_. For the Ruby developer, _dplyr_ and
75
+ the _tidyverse_ libraries are a set of libraries for data manipulation in R, developed by
76
+ Hardley Wickham, chief scientis at RStudio and a prolific R coder and writer.
77
+
78
+ For the coupling of Ruby and R, we use new technologies provided by Oracle: GraalVM,
79
+ TruffleRuby and FastR. GraalVM home page had the following definition:
80
+
81
+ GraalVM is a universal virtual machine for running applications
82
+ written in JavaScript, Python 3, Ruby, R, JVM-based languages like Java,
83
+ Scala, Kotlin, and LLVM-based languages such as C and C++.
84
+
85
+ GraalVM removes the isolation between programming languages and enables
86
+ interoperability in a shared runtime. It can run either standalone or in
87
+ the context of OpenJDK, Node.js, Oracle Database, or MySQL.
88
+
89
+ GraalVM allows you to write polyglot applications with a seamless way to
90
+ pass values from one language to another. With GraalVM there is no copying
91
+ or marshaling necessary as it is with other polyglot systems. This lets
92
+ you achieve high performance when language boundaries are crossed. Most
93
+ of the time there is no additional cost for crossing a language boundary
94
+ at all.
95
+
96
+ Often developers have to make uncomfortable compromises that require them
97
+ to rewrite their software in other languages. For example:
98
+
99
+ * “That library is not available in my language. I need to rewrite it.”
100
+ * “That language would be the perfect fit for my problem, but we cannot
101
+ run it in our environment.”
102
+ * “That problem is already solved in my language, but the language is
103
+ too slow.”
104
+
105
+ With GraalVM we aim to allow developers to freely choose the right language
106
+ for the task at hand without making compromises.
107
+
108
+
109
+ # Tidyverse and dplyr
110
+
111
+ In [What is the tidyverse?](https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/) the
112
+ tidyverse is explained as follows:
113
+
114
+ > The tidyverse is a coherent system of packages for data manipulation, exploration and
115
+ > visualization that share a common design philosophy. These were mostly developed by
116
+ > Hadley Wickham himself, but they are now being expanded by several contributors. Tidyverse
117
+ > packages are intended to make statisticians and data scientists more productive by
118
+ > guiding them through workflows that facilitate communication, and result in reproducible
119
+ > work products. Fundamentally, the tidyverse is about the connections between the tools
120
+ > that make the workflow possible.
121
+
122
+ _dplyr_ is one of the many packages that are part of the tidyverse. It is:
123
+
124
+ > a grammar of data manipulation, providing a consistent set of verbs that help you solve
125
+ > the most common data manipulation challenges:
126
+
127
+ > 1. mutate() adds new variables that are functions of existing variables
128
+ > 2. select() picks variables based on their names.
129
+ > 3. filter() picks cases based on their values.
130
+ > 4. summarise() reduces multiple values down to a single summary.
131
+ > 5. arrange() changes the ordering of the rows.
132
+
133
+ Very often R is used interactively and users use _dplyr_ to manipulate a single dataset
134
+ without programming. When users want to replicate their work for
135
+ multiple datasets, programming becomes necessary.
136
+
137
+ # Programming with dplyr
138
+
139
+ In the vignette ["Programming with dplyr"](https://dplyr.tidyverse.org/articles/programming.html),
140
+ Hardley Wickham states:
141
+
142
+ > Most dplyr functions use non-standard evaluation (NSE). This is a catch-all term that
143
+ > means they don’t follow the usual R rules of evaluation. Instead, they capture the
144
+ > expression that you typed and evaluate it in a custom way. This has two main
145
+ > benefits for dplyr code:
146
+
147
+ > Operations on data frames can be expressed succinctly because you don’t need to repeat
148
+ > the name of the data frame. For example, you can write filter(df, x == 1, y == 2, z == 3)
149
+ > instead of df[df\$x == 1 & df\$y ==2 & df\$z == 3, ].
150
+
151
+ > dplyr can choose to compute results in a different way to base R. This is important for
152
+ > database backends because dplyr itself doesn’t do any work, but instead generates the SQL
153
+ > that tells the database what to do.
154
+
155
+ But then he goes on:
156
+
157
+ > Unfortunately these benefits do not come for free. There are two main drawbacks:
158
+
159
+ > Most dplyr arguments are not referentially transparent. That means you can’t replace a value
160
+ > with a seemingly equivalent object that you’ve defined elsewhere. In other words, this code:
161
+
162
+
163
+ ```r
164
+ df <- data.frame(x = 1:3, y = 3:1)
165
+ print(filter(df, x == 1))
166
+ #> # A tibble: 1 x 2
167
+ #> x y
168
+ #> <int> <int>
169
+ #> 1 1 3
170
+ ```
171
+ > Is not equivalent to this code:
172
+
173
+
174
+ ```r
175
+ my_var <- x
176
+ #> Error in eval(expr, envir, enclos): object 'x' not found
177
+ filter(df, my_var == 1)
178
+ #> Error: object 'my_var' not found
179
+ ```
180
+ > This makes it hard to create functions with arguments that change how dplyr verbs are computed.
181
+
182
+ As a result of this, programming with _dplyr_ requires learning a set of new ideas and concepts.
183
+ In this vignette Hardley goes on showing how to program ever more difficult problems with _dplyr_,
184
+ showing the problems it faces and the new concepts needed to solve them.
185
+
186
+ In this blog, we will look at all the problems presented by Harley on the vignette and show how
187
+ those same problems can be solved using Galaaz and the Ruby language.
188
+
189
+ This blog is organized as follows: first we show how to write expressions using Galaaz.
190
+ Expressions are a fundamental concept in _dplyr_ and are not part of basic Ruby. We extend
191
+ the Ruby language create a manipulate expressions that will be used by _dplyr_ functions.
192
+
193
+ Then we show very succintly how Ruby and R can be integrated and how R functions are
194
+ transparently called from Ruby. Galaaz [user manual](https://github.com/rbotafogo/galaaz/wiki)
195
+ (still in development) goes in much deeper detail about this integration.
196
+
197
+ Next in section "Data manipulation wiht _dplyr_" we go through all the problems on the
198
+ _dplyr_ vignette and look at how they are solved in Galaaz. We then discuss why programming
199
+ with Galaaz and _dplyr_ is easier than programming with _dplyr_ in plain R.
200
+
201
+ The following section looks at another more advanced problem and shows that Galaaz can still
202
+ handle it without any difficulty. We then provide further reading and concluding remarks.
203
+
204
+ # Writing Expressions in Galaaz
205
+
206
+ Galaaz extends Ruby to work with expressions, similar to R's expressions build with 'quote'
207
+ (base R) or 'quo' (tidyverse). Expressions in this context are like mathematical expressions or
208
+ formulae. For instance, in mathematics, the expression $y = sin(x)$ describes a function but cannot
209
+ be computed unless the value of $x$ is bound to some value.
210
+
211
+ Expressions are fundamental in _dplyr_ programming as they are the input to _dplyr_ functions,
212
+ for instance, as we will see shortly, if a data frame has a column named 'x' and we want
213
+ to add another column, y, to this dataframe that has the values of 'x' times 2, then we would
214
+ call a _dplyr_ function with the expression 'y = x * 2'.
215
+
216
+ ## A note on notation
217
+
218
+ This blog was written in Rmarkdown and automatically converted to HTML or PDF (depending on
219
+ where you are reading this blog) with gKnit (a tool provided by Galaaz). In Rmarkdown, it is
220
+ possible to write text and code blocks that are executed to generate the final report. Code
221
+ blocks appear inside a 'box' and the result of their execution appear either in another type
222
+ of 'box' with a different background (HTML) or as normal text (PDF). Every output line from
223
+ the code execution is preceeded by '##'.
224
+
225
+ ## Expressions from operators
226
+
227
+ The code below creates an expression summing two symbols. Note that :a and :b are Ruby symbols and
228
+ are not bound to any values at the time of expression definition:
229
+
230
+
231
+ ```ruby
232
+ exp1 = :a + :b
233
+ puts exp1
234
+ ```
235
+
236
+ ```
237
+ ## a + b
238
+ ```
239
+ In Galaaz, we can build any complex mathematical expression such as:
240
+
241
+
242
+ ```ruby
243
+ exp2 = (:a + :b) * 2.0 + :c ** 2 / :z
244
+ puts exp2
245
+ ```
246
+
247
+ ```
248
+ ## (a + b) * 2 + c^2L/z
249
+ ```
250
+ Expressions are printed with the same format as the equivalent R expressions. The 'L' after
251
+ 2 indicates that 2 is an integer.
252
+
253
+ The R developer should note that in R, if she writes the
254
+ number '2', the R interpreter will convert it to float. In order to get an interger she
255
+ should write '2L'. Galaaz follows Ruby notation and '2' is an integer, while '2.0' is a
256
+ float.
257
+
258
+ It is also possible to use inequality operators in building expressions:
259
+
260
+
261
+ ```ruby
262
+ exp3 = (:a + :b) >= :z
263
+ puts exp3
264
+ ```
265
+
266
+ ```
267
+ ## a + b >= z
268
+ ```
269
+ Expressions' definition can also make use of normal Ruby variables without any problem:
270
+
271
+
272
+ ```ruby
273
+ x = 20
274
+ y = 30.0
275
+ exp_var = (:a + :b) * x <= :z - y
276
+ puts exp_var
277
+ ```
278
+
279
+ ```
280
+ ## (a + b) * 20L <= z - 30
281
+ ```
282
+
283
+ Galaaz provides both symbolic representations for operators, such as (>, <, !=) as functional
284
+ notation for those operators such as (.gt, .ge, etc.). So the same expression written
285
+ above can also be written as
286
+
287
+
288
+ ```ruby
289
+ exp4 = (:a + :b).ge :z
290
+ puts exp4
291
+ ```
292
+
293
+ ```
294
+ ## a + b >= z
295
+ ```
296
+
297
+ Two types of expressions, however, can only be created with the functional representation
298
+ of the operators. Those are expressions involving '==', and '='. This is the case since
299
+ those symbols have special meaning in Ruby and should not be redefined.
300
+
301
+ In order to write an expression involving '==' we
302
+ need to use the method '.eq' and for '=' we need the function '.assign':
303
+
304
+
305
+ ```ruby
306
+ exp5 = (:a + :b).eq :z
307
+ puts exp5
308
+ ```
309
+
310
+ ```
311
+ ## a + b == z
312
+ ```
313
+
314
+
315
+ ```ruby
316
+ exp6 = :y.assign :a + :b
317
+ puts exp6
318
+ ```
319
+
320
+ ```
321
+ ## y <- a + b
322
+ ```
323
+ Users should be careful when writing expressions not to inadvertently use '==' or '=' as
324
+ this will generate an error, that might be a bit cryptic (in future releases of Galaza, we
325
+ plan to improve the error message).
326
+
327
+
328
+ ```ruby
329
+ exp_wrong = (:a + :b) == :z
330
+ puts exp_wrong
331
+ ```
332
+
333
+ ```
334
+ ## Message:
335
+ ## Error in function (x, y, num.eq = TRUE, single.NA = TRUE, attrib.as.set = TRUE, :
336
+ ## object 'a' not found (RError)
337
+ ## Translated to internal error
338
+ ```
339
+ The problem lies with the fact that
340
+ when using '==' we are comparing expression (:a + :b) to expression :z with '=='. When this
341
+ comparison is executed, the system tries to evaluate :a, :b and :z, and those symbols, at
342
+ this time, are not bound to anything giving the "object 'a' not found" message.
343
+
344
+ ## Expressions with R methods
345
+
346
+ It is often necessary to create an expression that uses a method or function. For instance, in
347
+ mathematics, it's quite natural to write an expressin such as $y = sin(x)$. In this case, the
348
+ 'sin' function is part of the expression and should not be immediately executed. When we want
349
+ the function to be part of the expression, we call the function preceeding it
350
+ by the letter E, such as 'E.sin(x)'
351
+
352
+
353
+ ```ruby
354
+ exp7 = :y.assign E.sin(:x)
355
+ puts exp7
356
+ ```
357
+
358
+ ```
359
+ ## y <- sin(x)
360
+ ```
361
+ Function expressions can also be written using '.' notation:
362
+
363
+
364
+ ```ruby
365
+ exp8 = :y.assign :x.sin
366
+ puts exp8
367
+ ```
368
+
369
+ ```
370
+ ## y <- sin(x)
371
+ ```
372
+ When a function has multiple arguments, the first one can be used before the '.'. For instance,
373
+ the R concatenate function 'c', that concatenates two or more arguments can be part of
374
+ an expression as:
375
+
376
+
377
+ ```ruby
378
+ exp9 = :x.c(:y)
379
+ puts exp9
380
+ ```
381
+
382
+ ```
383
+ ## c(x, y)
384
+ ```
385
+ Note that this gives an OO feeling to the code, as if we were saying 'x' concatenates 'y'. As a
386
+ side note, '.' notation can be used as the R pipe operator '%>%', but is more general than the
387
+ pipe.
388
+
389
+ ## Evaluating an Expression
390
+
391
+ Although we are mainly focusing on expressions to pass them to _dplyr_ functions, expressions
392
+ can be evaluated by calling function 'eval' with a binding.
393
+
394
+ A binding can be provided with a list or a data frame as shown below:
395
+
396
+
397
+ ```ruby
398
+ exp = (:a + :b) * 2.0 + :c ** 2 / :z
399
+ puts exp.eval(R.list(a: 10, b: 20, c: 30, z: 40))
400
+ ```
401
+
402
+ ```
403
+ ## [1] 82.5
404
+ ```
405
+
406
+ with a data frame:
407
+
408
+
409
+ ```ruby
410
+ df = R.data__frame(
411
+ a: R.c(1, 2, 3),
412
+ b: R.c(10, 20, 30),
413
+ c: R.c(100, 200, 300),
414
+ z: R.c(1000, 2000, 3000))
415
+
416
+ puts exp.eval(df)
417
+ ```
418
+
419
+ ```
420
+ ## [1] 32 64 96
421
+ ```
422
+
423
+ # Using Galaaz to call R functions
424
+
425
+ Galaaz tries to emulate as closely as possible the way R functions are called and migrating from
426
+ R to Galaaz should be quite easy requiring only minor syntactic changes to an R script. In
427
+ this post, we do not have enough space to write a complete manual on Galaaz
428
+ (a short manual can be found at: https://www.rubydoc.info/gems/galaaz/0.4.9), so we will
429
+ present only a few examples scripts using Galaaz.
430
+
431
+ Basically, to call an R function from Ruby with Galaaz, one only needs to preced the function
432
+ with 'R.'. For instance, to create a vector in R, the 'c' function is used. In Galaaz, a
433
+ vector can be created by using 'R.c':
434
+
435
+
436
+ ```ruby
437
+ vec = R.c(1.0, 2, 3)
438
+ puts vec
439
+ ```
440
+
441
+ ```
442
+ ## [1] 1 2 3
443
+ ```
444
+ A list is created in R with the 'list' function, so in Galaaz we do:
445
+
446
+
447
+ ```ruby
448
+ list = R.list(a: 1.0, b: 2, c: 3)
449
+ puts list
450
+ ```
451
+
452
+ ```
453
+ ## $a
454
+ ## [1] 1
455
+ ##
456
+ ## $b
457
+ ## [1] 2
458
+ ##
459
+ ## $c
460
+ ## [1] 3
461
+ ```
462
+ Note that we can use named arguments in our list. The same code in R would be:
463
+
464
+
465
+ ```r
466
+ lst = list(a = 1, b = 2L, c = 3L)
467
+ print(lst)
468
+ ```
469
+
470
+ ```
471
+ ## $a
472
+ ## [1] 1
473
+ ##
474
+ ## $b
475
+ ## [1] 2
476
+ ##
477
+ ## $c
478
+ ## [1] 3
479
+ ```
480
+ Now, let's say that 'x' is an angle of 45$^\circ$ and we acttually want to create
481
+ the expression $y = sin(45^\circ)$, which is $y = 0.850...$. In this case,
482
+ we will use 'R.sin':
483
+
484
+
485
+ ```ruby
486
+ exp10 = :y.assign R.sin(45)
487
+ puts exp10
488
+ ```
489
+
490
+ ```
491
+ ## y <- 0.850903524534118
492
+ ```
493
+
494
+ # Data manipulation wiht _dplyr_
495
+
496
+ In this section we will give a brief tour _dplyr_'s usage in Galaaz and how to manipulate
497
+ data in Ruby with it. This section will follow [_dplyr_'s vignette](https://dplyr.tidyverse.org/articles/dplyr.html) that explores the nycflights13 data set. This dataset contains all 336776
498
+ flights that departed from New York City in 2013. The data comes from the US Bureau of
499
+ Transportation Statistics.
500
+
501
+ Let's start by taking a look at this dataset:
502
+
503
+
504
+ ```ruby
505
+ R.library('nycflights13')
506
+ # check it's dimension
507
+ puts ~:flights.dim
508
+ # and the structure
509
+ ~:flights.str
510
+ ```
511
+
512
+ ```
513
+ ## Message:
514
+ ## Method ~ not found in R environment
515
+ ```
516
+
517
+ ```
518
+ ## Message:
519
+ ## /home/rbotafogo/desenv/galaaz/lib/R_interface/rsupport.rb:90:in `eval'
520
+ ## /home/rbotafogo/desenv/galaaz/lib/R_interface/rsupport.rb:270:in `exec_function_name'
521
+ ## /home/rbotafogo/desenv/galaaz/lib/R_interface/robject.rb:166:in `method_missing'
522
+ ## /home/rbotafogo/desenv/galaaz/lib/util/exec_ruby.rb:105:in `get_binding'
523
+ ## /home/rbotafogo/desenv/galaaz/lib/util/exec_ruby.rb:102:in `eval'
524
+ ## /home/rbotafogo/desenv/galaaz/lib/util/exec_ruby.rb:102:in `exec_ruby'
525
+ ## /home/rbotafogo/desenv/galaaz/lib/gknit/knitr_engine.rb:650:in `block in initialize'
526
+ ## /home/rbotafogo/desenv/galaaz/lib/R_interface/ruby_callback.rb:77:in `call'
527
+ ## /home/rbotafogo/desenv/galaaz/lib/R_interface/ruby_callback.rb:77:in `callback'
528
+ ## (eval):3:in `function(...) {\n rb_method(...)'
529
+ ## unknown.r:1:in `in_dir'
530
+ ## unknown.r:1:in `block_exec'
531
+ ## /usr/local/lib/graalvm-ce-java11-20.0.0/languages/R/library/knitr/R/block.R:92:in `call_block'
532
+ ## /usr/local/lib/graalvm-ce-java11-20.0.0/languages/R/library/knitr/R/block.R:6:in `process_group.block'
533
+ ## /usr/local/lib/graalvm-ce-java11-20.0.0/languages/R/library/knitr/R/block.R:3:in `<no source>'
534
+ ## unknown.r:1:in `withCallingHandlers'
535
+ ## unknown.r:1:in `process_file'
536
+ ## unknown.r:1:in `<no source>'
537
+ ## unknown.r:1:in `<no source>'
538
+ ## <REPL>:4:in `<repl wrapper>'
539
+ ## <REPL>:1
540
+ ```
541
+
542
+ Now, let's use a first verb of _dplyr_: 'filter'. This verb, obviously, will filter the data
543
+ by the given expression. In the next block, we filter by columns 'month' and 'day'. The
544
+ first argument to the filter function is symbol ':flights'. A Ruby symbol, when given to
545
+ an R function will convert to the R variable of the same name, in this case 'flights', that
546
+ holds the nycflights13 data frame.
547
+
548
+ The second and third arguments are expressions that will be used by the filter function to
549
+ filter by columns, looking for entries in which the month and day are equal to 1.
550
+
551
+
552
+ ```ruby
553
+ puts R.filter(:flights, (:month.eq 1), (:day.eq 1))
554
+ ```
555
+
556
+ ```
557
+ ## # A tibble: 842 x 19
558
+ ## year month day dep_time sched_dep_time dep_delay arr_time
559
+ ## <int> <int> <int> <int> <int> <dbl> <int>
560
+ ## 1 2013 1 1 517 515 2 830
561
+ ## 2 2013 1 1 533 529 4 850
562
+ ## 3 2013 1 1 542 540 2 923
563
+ ## 4 2013 1 1 544 545 -1 1004
564
+ ## 5 2013 1 1 554 600 -6 812
565
+ ## 6 2013 1 1 554 558 -4 740
566
+ ## 7 2013 1 1 555 600 -5 913
567
+ ## 8 2013 1 1 557 600 -3 709
568
+ ## 9 2013 1 1 557 600 -3 838
569
+ ## 10 2013 1 1 558 600 -2 753
570
+ ## # … with 832 more rows, and 12 more variables: sched_arr_time <int>,
571
+ ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
572
+ ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
573
+ ## # minute <dbl>, time_hour <dttm>
574
+ ```
575
+
576
+
577
+ ## Programming with _dplyr_: problems and how to solve them in Galaaz
578
+
579
+ In this section we look at the list of problems that Hardley describes in the "Programming with dplyr"
580
+ vignette and show how those problems are solved and coded with Galaaz. Readers interested in
581
+ how those problems are treated in _dplyr_ should read the vignette and use it as a comparison with
582
+ this blog.
583
+
584
+ ## Filtering using expressions
585
+
586
+ Now that we know how to write expressions and call R functions, let's do some data manipulation in
587
+ Galaaz. Let's first start by creating a data frame. In R, the 'data.frame' function creates a
588
+ data frame. In Ruby, writing 'data.frame' will not parse as a single object. To call R
589
+ functions that have a '.' in them, we need to substitute the '.' with '__'. So, method
590
+ 'data.frame' in R, is called in Galaaz as 'R.data\_\_frame':
591
+
592
+
593
+ ```ruby
594
+ df = R.data__frame(x: (1..3), y: (3..1))
595
+ puts df
596
+ ```
597
+
598
+ ```
599
+ ## x y
600
+ ## 1 1 3
601
+ ## 2 2 2
602
+ ## 3 3 1
603
+ ```
604
+
605
+ _dplyr_ provides the 'filter' function, that filters data in a data brame. The 'filter'
606
+ function can be called on this data frame either by using 'R.filter(df, ...)' or
607
+ by using dot notation.
608
+
609
+ -------FIX---------
610
+
611
+ We prefer to use dot notation as shown bellow. The argument to 'filter' should be an
612
+ expression. Note that if we gave to filter a Ruby expression such as
613
+ 'x == 1', we would get an error, since there is no variable 'x' defined and if 'x' was a variable
614
+ then 'x == 1' would either be 'true' or 'false'. Our goal is to filter our data frame returning
615
+ all rows in which the 'x' value is equal to 1. To express this we want: ':x.eq 1', where :x will
616
+ be interpreted by filter as the 'x' column.
617
+
618
+
619
+ ```ruby
620
+ puts df.filter(:x.eq 1)
621
+ ```
622
+
623
+ ```
624
+ ## x y
625
+ ## 1 1 3
626
+ ```
627
+ In R, and when coding with 'tidyverse', arguments to a function are usually not
628
+ *referencially transparent*. That is, you can’t replace a value with a seemingly equivalent
629
+ object that you’ve defined elsewhere. In other words, this code
630
+
631
+
632
+ ```r
633
+ my_var <- x
634
+ filter(df, my_var == 1)
635
+ ```
636
+ Generates the following error: "object 'x' not found.
637
+
638
+ However, in Galaaz, arguments are referencially transparent as can be seen by the
639
+ code bellow. Note initally that 'my_var = :x' will not give the error "object 'x' not found"
640
+ since ':x' is treated as an expression and assigned to my\_var. Then when doing (my\_var.eq 1),
641
+ my\_var is a variable that resolves to ':x' and it becomes equivalent to (:x.eq 1) which is
642
+ what we want.
643
+
644
+
645
+ ```ruby
646
+ my_var = :x
647
+ puts df.filter(my_var.eq 1)
648
+ ```
649
+
650
+ ```
651
+ ## x y
652
+ ## 1 1 3
653
+ ```
654
+ As stated by Hardley
655
+
656
+ > dplyr code is ambiguous. Depending on what variables are defined where,
657
+ > filter(df, x == y) could be equivalent to any of:
658
+
659
+ ```
660
+ df[df$x == df$y, ]
661
+ df[df$x == y, ]
662
+ df[x == df$y, ]
663
+ df[x == y, ]
664
+ ```
665
+ In galaaz this ambiguity does not exist, filter(df, x.eq y) is not a valid expression as
666
+ expressions are build with symbols. In doing filter(df, :x.eq y) we are looking for elements
667
+ of the 'x' column that are equal to a previously defined y variable. Finally in
668
+ filter(df, :x.eq :y) we are looking for elements in which the 'x' column value is equal to
669
+ the 'y' column value. This can be seen in the following two chunks of code:
670
+
671
+
672
+ ```ruby
673
+ y = 1
674
+ x = 2
675
+
676
+ # looking for values where the 'x' column is equal to the 'y' column
677
+ puts df.filter(:x.eq :y)
678
+ ```
679
+
680
+ ```
681
+ ## x y
682
+ ## 1 2 2
683
+ ```
684
+
685
+
686
+ ```ruby
687
+ # looking for values where the 'x' column is equal to the 'y' variable
688
+ # in this case, the number 1
689
+ puts df.filter(:x.eq y)
690
+ ```
691
+
692
+ ```
693
+ ## x y
694
+ ## 1 1 3
695
+ ```
696
+ ## Writing a function that applies to different data sets
697
+
698
+ Let's suppose that we want to write a function that receives as the first argument a data frame
699
+ and as second argument an expression that adds a column to the data frame that is equal to the
700
+ sum of elements in column 'a' plus 'x'.
701
+
702
+ Here is the intended behaviour using the 'mutate' function of 'dplyr':
703
+
704
+ ```
705
+ mutate(df1, y = a + x)
706
+ mutate(df2, y = a + x)
707
+ mutate(df3, y = a + x)
708
+ mutate(df4, y = a + x)
709
+ ```
710
+ The naive approach to writing an R function to solve this problem is:
711
+
712
+ ```
713
+ mutate_y <- function(df) {
714
+ mutate(df, y = a + x)
715
+ }
716
+ ```
717
+ Unfortunately, in R, this function can fail silently if one of the variables isn’t present
718
+ in the data frame, but is present in the global environment. We will not go through here how
719
+ to solve this problem in R.
720
+
721
+ In Galaaz the method mutate_y bellow will work fine and will never fail silently.
722
+
723
+
724
+ ```ruby
725
+ def mutate_y(df)
726
+ df.mutate(:y.assign :a + :x)
727
+ end
728
+ ```
729
+ Here we create a data frame that has only one column named 'x':
730
+
731
+
732
+ ```ruby
733
+ df1 = R.data__frame(x: (1..3))
734
+ puts df1
735
+ ```
736
+
737
+ ```
738
+ ## x
739
+ ## 1 1
740
+ ## 2 2
741
+ ## 3 3
742
+ ```
743
+
744
+ Note that method mutate_y will fail independetly from the fact that variable 'a' is defined and
745
+ in the scope of the method. Variable 'a' has no relationship with the symbol ':a' used in the
746
+ definition of 'mutate\_y' above:
747
+
748
+
749
+ ```ruby
750
+ a = 10
751
+ mutate_y(df1)
752
+ ```
753
+
754
+ ```
755
+ ## Message:
756
+ ## Error in mutate_impl(.data, dots) :
757
+ ## Evaluation error: object 'a' not found.
758
+ ## In addition: Warning message:
759
+ ## In mutate_impl(.data, dots) :
760
+ ## mismatched protect/unprotect (unprotect with empty protect stack) (RError)
761
+ ## Translated to internal error
762
+ ```
763
+ ## Different expressions
764
+
765
+ Let's move to the next problem as presented by Hardley where trying to write a function in R
766
+ that will receive two argumens, the first a variable and the second an expression is not trivial.
767
+ Bellow we create a data frame and we want to write a function that groups data by a variable and
768
+ summarises it by an expression:
769
+
770
+
771
+ ```r
772
+ set.seed(123)
773
+
774
+ df <- data.frame(
775
+ g1 = c(1, 1, 2, 2, 2),
776
+ g2 = c(1, 2, 1, 2, 1),
777
+ a = sample(5),
778
+ b = sample(5)
779
+ )
780
+
781
+ as.data.frame(df)
782
+ ```
783
+
784
+ ```
785
+ ## g1 g2 a b
786
+ ## 1 1 1 3 3
787
+ ## 2 1 2 2 1
788
+ ## 3 2 1 5 2
789
+ ## 4 2 2 4 5
790
+ ## 5 2 1 1 4
791
+ ```
792
+
793
+ ```r
794
+ d2 <- df %>%
795
+ group_by(g1) %>%
796
+ summarise(a = mean(a))
797
+
798
+ as.data.frame(d2)
799
+ ```
800
+
801
+ ```
802
+ ## g1 a
803
+ ## 1 1 2.500000
804
+ ## 2 2 3.333333
805
+ ```
806
+
807
+ ```r
808
+ d2 <- df %>%
809
+ group_by(g2) %>%
810
+ summarise(a = mean(a))
811
+
812
+ as.data.frame(d2)
813
+ ```
814
+
815
+ ```
816
+ ## g2 a
817
+ ## 1 1 3
818
+ ## 2 2 3
819
+ ```
820
+
821
+ As shown by Hardley, one might expect this function to do the trick:
822
+
823
+
824
+ ```r
825
+ my_summarise <- function(df, group_var) {
826
+ df %>%
827
+ group_by(group_var) %>%
828
+ summarise(a = mean(a))
829
+ }
830
+
831
+ # my_summarise(df, g1)
832
+ #> Error: Column `group_var` is unknown
833
+ ```
834
+
835
+ In order to solve this problem, coding with dplyr requires the introduction of many new concepts
836
+ and functions such as 'quo', 'quos', 'enquo', 'enquos', '!!' (bang bang), '!!!' (triple bang).
837
+ Again, we'll leave to Hardley the explanation on how to use all those functions.
838
+
839
+ Now, let's try to implement the same function in galaaz. The next code block first prints the
840
+ 'df' data frame define previously in R (to access an R variable from Galaaz, we use the tilda
841
+ operator '~' applied to the R variable name as symbol, i.e., ':df'. We then create the
842
+ 'my_summarize' method and call it passing the R data frame and the group by variable ':g1':
843
+
844
+
845
+ ```ruby
846
+ puts ~:df
847
+ print "\n"
848
+
849
+ def my_summarize(df, group_var)
850
+ df.group_by(group_var).
851
+ summarize(a: :a.mean)
852
+ end
853
+
854
+ puts my_summarize(:df, :g1)
855
+ ```
856
+
857
+ ```
858
+ ## g1 g2 a b
859
+ ## 1 1 1 3 3
860
+ ## 2 1 2 2 1
861
+ ## 3 2 1 5 2
862
+ ## 4 2 2 4 5
863
+ ## 5 2 1 1 4
864
+ ##
865
+ ## # A tibble: 2 x 2
866
+ ## g1 a
867
+ ## <dbl> <dbl>
868
+ ## 1 1 2.5
869
+ ## 2 2 3.33
870
+ ```
871
+ It works!!! Well, let's make sure this was not just some coincidence
872
+
873
+
874
+ ```ruby
875
+ puts my_summarize(:df, :g2)
876
+ ```
877
+
878
+ ```
879
+ ## # A tibble: 2 x 2
880
+ ## g2 a
881
+ ## <dbl> <dbl>
882
+ ## 1 1 3
883
+ ## 2 2 3
884
+ ```
885
+
886
+ Great, everything is fine! No magic, no new functions, no complexities, just normal, standard Ruby
887
+ code. If you've ever done NSE in R, this certainly feels much safer and easy to implement.
888
+
889
+ ## Different input variables
890
+
891
+ In the previous section we've managed to get rid of all NSE formulation for a simple example, but
892
+ does this remain true for more complex examples, or will the Galaaz way prove inpractical for
893
+ more complex code?
894
+
895
+ In the next example Hardley proposes us to write a function that given an expression such as 'a'
896
+ or 'a * b', calculates three summaries. What we want a function that does the same as these R
897
+ statements:
898
+
899
+ ```
900
+ summarise(df, mean = mean(a), sum = sum(a), n = n())
901
+ #> # A tibble: 1 x 3
902
+ #> mean sum n
903
+ #> <dbl> <int> <int>
904
+ #> 1 3 15 5
905
+
906
+ summarise(df, mean = mean(a * b), sum = sum(a * b), n = n())
907
+ #> # A tibble: 1 x 3
908
+ #> mean sum n
909
+ #> <dbl> <int> <int>
910
+ #> 1 9 45 5
911
+ ```
912
+
913
+ Let's try it in galaaz:
914
+
915
+
916
+ ```ruby
917
+ def my_summarise2(df, expr)
918
+ df.summarize(
919
+ mean: E.mean(expr),
920
+ sum: E.sum(expr),
921
+ n: E.n
922
+ )
923
+ end
924
+
925
+ puts my_summarise2((~:df), :a)
926
+ puts my_summarise2((~:df), :a * :b)
927
+ ```
928
+
929
+ ```
930
+ ## mean sum n
931
+ ## 1 3 15 5
932
+ ## mean sum n
933
+ ## 1 9 45 5
934
+ ```
935
+
936
+ Once again, there is no need to use any special theory or functions. The only point to be
937
+ careful about is the use of 'E' to build expressions from functions 'mean', 'sum' and 'n'.
938
+
939
+ ## Different input and output variable
940
+
941
+ Now the next challenge presented by Hardley is to vary the name of the output variables based on
942
+ the received expression. So, if the input expression is 'a', we want our data frame columns to
943
+ be named 'mean\_a' and 'sum\_a'. Now, if the input expression is 'b', columns
944
+ should be named 'mean\_b' and 'sum\_b'.
945
+
946
+ ```
947
+ mutate(df, mean_a = mean(a), sum_a = sum(a))
948
+ #> # A tibble: 5 x 6
949
+ #> g1 g2 a b mean_a sum_a
950
+ #> <dbl> <dbl> <int> <int> <dbl> <int>
951
+ #> 1 1 1 1 3 3 15
952
+ #> 2 1 2 4 2 3 15
953
+ #> 3 2 1 2 1 3 15
954
+ #> 4 2 2 5 4 3 15
955
+ #> # … with 1 more row
956
+
957
+ mutate(df, mean_b = mean(b), sum_b = sum(b))
958
+ #> # A tibble: 5 x 6
959
+ #> g1 g2 a b mean_b sum_b
960
+ #> <dbl> <dbl> <int> <int> <dbl> <int>
961
+ #> 1 1 1 1 3 3 15
962
+ #> 2 1 2 4 2 3 15
963
+ #> 3 2 1 2 1 3 15
964
+ #> 4 2 2 5 4 3 15
965
+ #> # … with 1 more row
966
+ ```
967
+ In order to solve this problem in R, Hardley needs to introduce some more new functions and notations:
968
+ 'quo_name' and the ':=' operator from package 'rlang'
969
+
970
+ Here is our Ruby code:
971
+
972
+
973
+ ```ruby
974
+ def my_mutate(df, expr)
975
+ mean_name = "mean_#{expr.to_s}"
976
+ sum_name = "sum_#{expr.to_s}"
977
+
978
+ df.mutate(mean_name => E.mean(expr),
979
+ sum_name => E.sum(expr))
980
+ end
981
+
982
+ puts my_mutate((~:df), :a)
983
+ puts my_mutate((~:df), :b)
984
+ ```
985
+
986
+ ```
987
+ ## g1 g2 a b mean_a sum_a
988
+ ## 1 1 1 3 3 3 15
989
+ ## 2 1 2 2 1 3 15
990
+ ## 3 2 1 5 2 3 15
991
+ ## 4 2 2 4 5 3 15
992
+ ## 5 2 1 1 4 3 15
993
+ ## g1 g2 a b mean_b sum_b
994
+ ## 1 1 1 3 3 3 15
995
+ ## 2 1 2 2 1 3 15
996
+ ## 3 2 1 5 2 3 15
997
+ ## 4 2 2 4 5 3 15
998
+ ## 5 2 1 1 4 3 15
999
+ ```
1000
+ It really seems that "Non Standard Evaluation" is actually quite standard in Galaaz! But, you
1001
+ might have noticed a small change in the way the arguments to the mutate method were called.
1002
+ In a previous example we used df.summarise(mean: E.mean(:a), ...) where the column name was
1003
+ followed by a ':' colom. In this example, we have df.mutate(mean_name => E.mean(expr), ...)
1004
+ and variable mean\_name is not followed by ':' but by '=>'. This is standard Ruby notation.
1005
+
1006
+ [explain....]
1007
+
1008
+ ## Capturing multiple variables
1009
+
1010
+ Moving on with new complexities, Hardley proposes us to solve the problem in which the
1011
+ summarise function will receive any number of grouping variables.
1012
+
1013
+ This again is quite standard Ruby. In order to receive an undefined number of paramenters
1014
+ the paramenter is preceded by '*':
1015
+
1016
+
1017
+ ```ruby
1018
+ def my_summarise3(df, *group_vars)
1019
+ df.group_by(*group_vars).
1020
+ summarise(a: E.mean(:a))
1021
+ end
1022
+
1023
+ puts my_summarise3((~:df), :g1, :g2)
1024
+ ```
1025
+
1026
+ ```
1027
+ ## # A tibble: 4 x 3
1028
+ ## # Groups: g1 [?]
1029
+ ## g1 g2 a
1030
+ ## <dbl> <dbl> <dbl>
1031
+ ## 1 1 1 3
1032
+ ## 2 1 2 2
1033
+ ## 3 2 1 3
1034
+ ## 4 2 2 4
1035
+ ```
1036
+
1037
+ # Why does R require NSE and Galaaz does not?
1038
+
1039
+ NSE introduces a number of new concepts, such as 'quoting', 'quasiquotation', 'unquoting' and
1040
+ 'unquote-splicing', while in Galaaz none of those concepts are needed. What gives?
1041
+
1042
+ R is an extremely flexible language and it has lazy evaluation of parameters. When in R a
1043
+ function is called as 'summarise(df, a = b)', the summarise function receives the litteral
1044
+ 'a = b' parameter and can work with this as if it were a string. In R, it is not clear what
1045
+ a and b are, they can be expressions or they can be variables, it is up to the function to
1046
+ decide what 'a = b' means.
1047
+
1048
+ In Ruby, there is no lazy evaluation of parameters and 'a' is always a variable and so is 'b'.
1049
+ Variables assume their value as soon as they are used, so 'x = a' is immediately evaluate and
1050
+ variable 'x' will receive the value of variable 'a' as soon as the Ruby statement is executed.
1051
+ Ruby also provides the notion of a symbol; ':a' is a symbol and does not evaluate to anything.
1052
+ Galaaz uses Ruby symbols to build expressions that are not bound to anything: ':a.eq :b' is
1053
+ clearly an expression and has no relationship whatsoever with the statment 'a = b'. By using
1054
+ symbols, variables and expressions all the possible ambiguities that are found in R are
1055
+ eliminated in Galaaz.
1056
+
1057
+ The main problem that remains, is that in R, functions are not clearly documented as what type
1058
+ of input they are expecting, they might be expecting regular variables or they might be
1059
+ expecting expressions and the R function will know how to deal with an input of the form
1060
+ 'a = b', now for the Ruby developer it might not be immediately clear if it should call the
1061
+ function passing the value 'true' if variable 'a' is equal to variable 'b' or if it should
1062
+ call the function passing the expression ':a.eq :b'.
1063
+
1064
+
1065
+ # Advanced dplyr features
1066
+
1067
+ In the blog: [Programming with dplyr by using dplyr](https://www.r-bloggers.com/programming-with-dplyr-by-using-dplyr/) Iñaki Úcar shows surprise that some R users are trying to code in dplyr avoiding
1068
+ the use of NSE. For instance he says:
1069
+
1070
+ > Take the example of seplyr. It stands for standard evaluation dplyr, and enables us to
1071
+ > program over dplyr without having “to bring in (or study) any deep-theory or
1072
+ > heavy-weight tools such as rlang/tidyeval”.
1073
+
1074
+ For me, there isn't really any surprise that users are trying to avoid dplyr deep-theory. R
1075
+ users frequently are not programmers and learning to code is already hard business, on top
1076
+ of that, having to learn how to 'quote' or 'enquo' or 'quos' or 'enquos' is not necessarily
1077
+ a 'piece of cake'. So much so, that 'tidyeval' has some more advanced functions that instead
1078
+ of using quoted expressions, uses strings as arguments.
1079
+
1080
+ In the following examples, we show the use of functions 'group\_by\_at', 'summarise\_at' and
1081
+ 'rename\_at' that receive strings as argument. The data frame used in 'starwars' that describes
1082
+ features of characters in the Starwars movies:
1083
+
1084
+
1085
+ ```ruby
1086
+ puts (~:starwars).head
1087
+ ```
1088
+
1089
+ ```
1090
+ ## # A tibble: 6 x 13
1091
+ ## name height mass hair_color skin_color eye_color birth_year gender
1092
+ ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1093
+ ## 1 Luke… 172 77 blond fair blue 19 male
1094
+ ## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
1095
+ ## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
1096
+ ## 4 Dart… 202 136 none white yellow 41.9 male
1097
+ ## 5 Leia… 150 49 brown light brown 19 female
1098
+ ## 6 Owen… 178 120 brown, gr… light blue 52 male
1099
+ ## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
1100
+ ## # vehicles <list>, starships <list>
1101
+ ```
1102
+ The grouped_mean function bellow will receive a grouping variable and calculate summaries for
1103
+ the value\_variables given:
1104
+
1105
+
1106
+ ```r
1107
+ grouped_mean <- function(data, grouping_variables, value_variables) {
1108
+ data %>%
1109
+ group_by_at(grouping_variables) %>%
1110
+ mutate(count = n()) %>%
1111
+ summarise_at(c(value_variables, "count"), mean, na.rm = TRUE) %>%
1112
+ rename_at(value_variables, funs(paste0("mean_", .)))
1113
+ }
1114
+
1115
+ gm = starwars %>%
1116
+ grouped_mean("eye_color", c("mass", "birth_year"))
1117
+
1118
+ as.data.frame(gm)
1119
+ ```
1120
+
1121
+ ```
1122
+ ## eye_color mean_mass mean_birth_year count
1123
+ ## 1 black 76.28571 33.00000 10
1124
+ ## 2 blue 86.51667 67.06923 19
1125
+ ## 3 blue-gray 77.00000 57.00000 1
1126
+ ## 4 brown 66.09231 108.96429 21
1127
+ ## 5 dark NaN NaN 1
1128
+ ## 6 gold NaN NaN 1
1129
+ ## 7 green, yellow 159.00000 NaN 1
1130
+ ## 8 hazel 66.00000 34.50000 3
1131
+ ## 9 orange 282.33333 231.00000 8
1132
+ ## 10 pink NaN NaN 1
1133
+ ## 11 red 81.40000 33.66667 5
1134
+ ## 12 red, blue NaN NaN 1
1135
+ ## 13 unknown 31.50000 NaN 3
1136
+ ## 14 white 48.00000 NaN 1
1137
+ ## 15 yellow 81.11111 76.38000 11
1138
+ ```
1139
+
1140
+ The same code with Galaaz, becomes:
1141
+
1142
+
1143
+ ```ruby
1144
+ def grouped_mean(data, grouping_variables, value_variables)
1145
+ data.
1146
+ group_by_at(grouping_variables).
1147
+ mutate(count: E.n).
1148
+ summarise_at(E.c(value_variables, "count"), ~:mean, na__rm: true).
1149
+ rename_at(value_variables, E.funs(E.paste0("mean_", value_variables)))
1150
+ end
1151
+
1152
+ puts grouped_mean((~:starwars), "eye_color", E.c("mass", "birth_year"))
1153
+ ```
1154
+
1155
+ ```
1156
+ ## # A tibble: 15 x 4
1157
+ ## eye_color mean_mass mean_birth_year count
1158
+ ## <chr> <dbl> <dbl> <dbl>
1159
+ ## 1 black 76.3 33 10
1160
+ ## 2 blue 86.5 67.1 19
1161
+ ## 3 blue-gray 77 57 1
1162
+ ## 4 brown 66.1 109. 21
1163
+ ## 5 dark NaN NaN 1
1164
+ ## 6 gold NaN NaN 1
1165
+ ## 7 green, yellow 159 NaN 1
1166
+ ## 8 hazel 66 34.5 3
1167
+ ## 9 orange 282. 231 8
1168
+ ## 10 pink NaN NaN 1
1169
+ ## 11 red 81.4 33.7 5
1170
+ ## 12 red, blue NaN NaN 1
1171
+ ## 13 unknown 31.5 NaN 3
1172
+ ## 14 white 48 NaN 1
1173
+ ## 15 yellow 81.1 76.4 11
1174
+ ```
1175
+
1176
+ # Further reading
1177
+
1178
+ For more information on GraalVM, TruffleRuby, fastR, R and Galaaz check out the following sites/posts:
1179
+
1180
+ * [GraalVM Home](https://www.graalvm.org/)
1181
+ * [TruffleRuby](https://github.com/oracle/truffleruby)
1182
+ * [FastR](https://github.com/oracle/fastr)
1183
+ * [Faster R with FastR](https://medium.com/graalvm/faster-r-with-fastr-4b8db0e0dceb)
1184
+ * [How to make Beautiful Ruby Plots with Galaaz](https://medium.freecodecamp.org/how-to-make-beautiful-ruby-plots-with-galaaz-320848058857)
1185
+ * [Ruby Plotting with Galaaz: An example of tightly coupling Ruby and R in GraalVM](https://towardsdatascience.com/ruby-plotting-with-galaaz-an-example-of-tightly-coupling-ruby-and-r-in-graalvm-520b69e21021)
1186
+ * [How to do reproducible research in Ruby with gKnit](https://towardsdatascience.com/how-to-do-reproducible-research-in-ruby-with-gknit-c26d2684d64e)
1187
+ * [R for Data Science](https://r4ds.had.co.nz/)
1188
+ * [Advanced R](https://adv-r.hadley.nz/)
1189
+
1190
+ # Conclusion
1191
+
1192
+ Ruby and Galaaz provide a nice framework for developing code that uses R functions. Although R is
1193
+ a very powerful and flexible language, sometimes, too much flexibility makes life harder for
1194
+ the casual user. We believe however, that even for the advanced user, Ruby integrated
1195
+ with R throught Galaaz, makes a powerful environment for data analysis. In this blog post we
1196
+ showed how Galaaz consistent syntax eliminates the need for complex constructs such as quoting,
1197
+ enquoting, quasiquotation, etc. This simplification comes from the fact that expressions and
1198
+ variables are clearly separated objects, which is not the case in the R language.