galaaz 0.4.9 → 0.4.10

Sign up to get free protection for your applications and to get access to all the features.
Files changed (76) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +798 -285
  3. data/blogs/galaaz_ggplot/galaaz_ggplot.Rmd +3 -12
  4. data/blogs/galaaz_ggplot/galaaz_ggplot.aux +5 -7
  5. data/blogs/galaaz_ggplot/galaaz_ggplot.html +69 -29
  6. data/blogs/galaaz_ggplot/galaaz_ggplot.pdf +0 -0
  7. data/blogs/galaaz_ggplot/galaaz_ggplot_files/figure-html/midwest_rb.png +0 -0
  8. data/blogs/galaaz_ggplot/galaaz_ggplot_files/figure-html/scatter_plot_rb.png +0 -0
  9. data/blogs/galaaz_ggplot/galaaz_ggplot_files/figure-latex/midwest_rb.pdf +0 -0
  10. data/blogs/galaaz_ggplot/galaaz_ggplot_files/figure-latex/scatter_plot_rb.pdf +0 -0
  11. data/blogs/galaaz_ggplot/midwest.Rmd +1 -9
  12. data/blogs/gknit/gknit.Rmd +37 -40
  13. data/blogs/gknit/gknit.html +32 -30
  14. data/blogs/gknit/gknit.md +36 -37
  15. data/blogs/gknit/gknit.pdf +0 -0
  16. data/blogs/gknit/gknit.tex +35 -37
  17. data/blogs/manual/manual.Rmd +548 -125
  18. data/blogs/manual/manual.html +509 -286
  19. data/blogs/manual/manual.md +798 -285
  20. data/blogs/manual/manual.pdf +0 -0
  21. data/blogs/manual/manual.tex +2816 -0
  22. data/blogs/manual/manual_files/figure-latex/diverging_bar.pdf +0 -0
  23. data/blogs/nse_dplyr/nse_dplyr.Rmd +240 -74
  24. data/blogs/nse_dplyr/nse_dplyr.html +191 -87
  25. data/blogs/nse_dplyr/nse_dplyr.md +361 -107
  26. data/blogs/nse_dplyr/nse_dplyr.pdf +0 -0
  27. data/blogs/nse_dplyr/nse_dplyr.tex +1373 -0
  28. data/blogs/ruby_plot/ruby_plot.Rmd +61 -81
  29. data/blogs/ruby_plot/ruby_plot.html +54 -57
  30. data/blogs/ruby_plot/ruby_plot.md +48 -67
  31. data/blogs/ruby_plot/ruby_plot.pdf +0 -0
  32. data/blogs/ruby_plot/ruby_plot_files/figure-html/dose_len.png +0 -0
  33. data/blogs/ruby_plot/ruby_plot_files/figure-html/facet_by_delivery.png +0 -0
  34. data/blogs/ruby_plot/ruby_plot_files/figure-html/facet_by_dose.png +0 -0
  35. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_by_delivery_color.png +0 -0
  36. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_by_delivery_color2.png +0 -0
  37. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_jitter.png +0 -0
  38. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_points.png +0 -0
  39. data/blogs/ruby_plot/ruby_plot_files/figure-html/final_box_plot.png +0 -0
  40. data/blogs/ruby_plot/ruby_plot_files/figure-html/final_violin_plot.png +0 -0
  41. data/blogs/ruby_plot/ruby_plot_files/figure-html/violin_with_jitter.png +0 -0
  42. data/blogs/ruby_plot/ruby_plot_files/figure-latex/dose_len.png +0 -0
  43. data/blogs/ruby_plot/ruby_plot_files/figure-latex/facet_by_delivery.png +0 -0
  44. data/blogs/ruby_plot/ruby_plot_files/figure-latex/facet_by_dose.png +0 -0
  45. data/blogs/ruby_plot/ruby_plot_files/figure-latex/facets_by_delivery_color.png +0 -0
  46. data/blogs/ruby_plot/ruby_plot_files/figure-latex/facets_by_delivery_color2.png +0 -0
  47. data/blogs/ruby_plot/ruby_plot_files/figure-latex/facets_with_decorations.png +0 -0
  48. data/blogs/ruby_plot/ruby_plot_files/figure-latex/facets_with_jitter.png +0 -0
  49. data/blogs/ruby_plot/ruby_plot_files/figure-latex/facets_with_points.png +0 -0
  50. data/blogs/ruby_plot/ruby_plot_files/figure-latex/final_box_plot.png +0 -0
  51. data/blogs/ruby_plot/ruby_plot_files/figure-latex/final_violin_plot.png +0 -0
  52. data/blogs/ruby_plot/ruby_plot_files/figure-latex/violin_with_jitter.png +0 -0
  53. data/lib/R_interface/rdata_frame.rb +0 -12
  54. data/lib/R_interface/robject.rb +14 -14
  55. data/lib/R_interface/ruby_extensions.rb +3 -31
  56. data/lib/R_interface/rvector.rb +0 -12
  57. data/lib/gknit/knitr_engine.rb +5 -3
  58. data/lib/util/exec_ruby.rb +22 -61
  59. data/specs/tmp.rb +26 -12
  60. data/version.rb +1 -1
  61. metadata +22 -17
  62. data/bin/gknit_old_r +0 -236
  63. data/blogs/dev/dev.Rmd +0 -23
  64. data/blogs/dev/dev.md +0 -58
  65. data/blogs/dev/dev2.Rmd +0 -65
  66. data/blogs/dev/model.rb +0 -41
  67. data/blogs/dplyr/dplyr.Rmd +0 -29
  68. data/blogs/dplyr/dplyr.html +0 -433
  69. data/blogs/dplyr/dplyr.md +0 -58
  70. data/blogs/dplyr/dplyr.rb +0 -63
  71. data/blogs/galaaz_ggplot/galaaz_ggplot.log +0 -640
  72. data/blogs/galaaz_ggplot/galaaz_ggplot.md +0 -431
  73. data/blogs/galaaz_ggplot/galaaz_ggplot.tex +0 -481
  74. data/blogs/galaaz_ggplot/midwest.png +0 -0
  75. data/blogs/galaaz_ggplot/scatter_plot.png +0 -0
  76. data/blogs/ruby_plot/ruby_plot.tex +0 -1077
@@ -4,7 +4,7 @@ author:
4
4
  - "Rodrigo Botafogo"
5
5
  - "Daniel Mossé - University of Pittsburgh"
6
6
  tags: [Tech, Data Science, Ruby, R, GraalVM]
7
- date: "20/02/2019"
7
+ date: "10/05/2019"
8
8
  output:
9
9
  html_document:
10
10
  self_contained: true
@@ -13,27 +13,41 @@ output:
13
13
  includes:
14
14
  in_header: ["../../sty/galaaz.sty"]
15
15
  number_sections: yes
16
+ toc: true
17
+ toc_depth: 2
18
+ md_document:
19
+ variant: markdown_github
20
+ fontsize: 11pt
16
21
  ---
17
22
 
18
23
 
19
24
 
20
25
  # Introduction
21
26
 
22
- In this post we will see how to program with dplyr in Galaaz.
27
+ In this post we will see how to program with _dplyr_ in Galaaz.
23
28
 
24
- ### But first, what is Galaaz??
29
+ ## But first, what is Galaaz??
25
30
 
26
31
  Galaaz is a system for tightly coupling Ruby and R. Ruby is a powerful language, with
27
32
  a large community, a very large set of libraries and great for web development. However,
28
33
  it lacks libraries for data science, statistics, scientific plotting and machine learning.
29
34
  On the other hand, R is considered one of the most powerful languages for solving all of the
30
35
  above problems. Maybe the strongest competitor to R is Python with libraries such as NumPy,
31
- Panda, SciPy, SciKit-Learn and a couple more.
36
+ Pandas, SciPy, SciKit-Learn and many more.
32
37
 
33
38
  With Galaaz we do not intend to re-implement any of the scientific libraries in R. However, we
34
39
  allow for very tight coupling between the two languages to the point that the Ruby
35
- developer does not need to know that there is an R engine running. For this to happen we
36
- use new technologies provided by Oracle: GraalVM, TruffleRuby and FastR:
40
+ developer does not need to know that there is an R engine running. Also, from the point of
41
+ view of the R user/developer Galaaz looks a lot like R, with just minor syntactic difference,
42
+ so there is almost no learning courve for the R developer. And as we will see in this
43
+ post, programming with _dplyr_ is easier in Galaaz than in R.
44
+
45
+ R users are probably quite knowledgeable about _dplyr_, for the Ruby developer, _dplyr_ and
46
+ the _tidyverse_ libraries are a set of libraries for data manipulation in R, developed by
47
+ Hardley Wickham, chief scientis at RStudio and a prolific R coder and writer.
48
+
49
+ For the coupling of Ruby and R we use new technologies provided by Oracle: GraalVM,
50
+ TruffleRuby and FastR:
37
51
 
38
52
  GraalVM is a universal virtual machine for running applications
39
53
  written in JavaScript, Python 3, Ruby, R, JVM-based languages like Java,
@@ -68,10 +82,16 @@ Interested readers should also check out the following sites:
68
82
  * [TruffleRuby](https://github.com/oracle/truffleruby)
69
83
  * [FastR](https://github.com/oracle/fastr)
70
84
  * [Faster R with FastR](https://medium.com/graalvm/faster-r-with-fastr-4b8db0e0dceb)
85
+ * [How to make Beautiful Ruby Plots with Galaaz](https://medium.freecodecamp.org/how-to-make-beautiful-ruby-plots-with-galaaz-320848058857)
86
+ * [Ruby Plotting with Galaaz: An example of tightly coupling Ruby and R in GraalVM](https://towardsdatascience.com/ruby-plotting-with-galaaz-an-example-of-tightly-coupling-ruby-and-r-in-graalvm-520b69e21021)
87
+ * [How to do reproducible research in Ruby with gKnit](https://towardsdatascience.com/how-to-do-reproducible-research-in-ruby-with-gknit-c26d2684d64e)
88
+ * [R for Data Science](https://r4ds.had.co.nz/)
89
+ * [Advanced R](https://adv-r.hadley.nz/)
71
90
 
72
- ### Now to programming with dplyr
91
+ ## Programming with dplyr
73
92
 
74
- According to Hardley (https://dplyr.tidyverse.org/articles/programming.html)
93
+ This post will follow closely the work done in https://dplyr.tidyverse.org/articles/programming.html,
94
+ by Hardley Wickham. In it, Hardley states:
75
95
 
76
96
  > Most dplyr functions use non-standard evaluation (NSE). This is a catch-all term that
77
97
  > means they don’t follow the usual R rules of evaluation. Instead, they capture the
@@ -80,7 +100,7 @@ According to Hardley (https://dplyr.tidyverse.org/articles/programming.html)
80
100
 
81
101
  > Operations on data frames can be expressed succinctly because you don’t need to repeat
82
102
  > the name of the data frame. For example, you can write filter(df, x == 1, y == 2, z == 3)
83
- > instead of df[df$x == 1 & df$y ==2 & df$z == 3, ].
103
+ > instead of df[df\$x == 1 & df\$y ==2 & df\$z == 3, ].
84
104
 
85
105
  > dplyr can choose to compute results in a different way to base R. This is important for
86
106
  > database backends because dplyr itself doesn’t do any work, but instead generates the SQL
@@ -92,29 +112,9 @@ According to Hardley (https://dplyr.tidyverse.org/articles/programming.html)
92
112
  > with a seemingly equivalent object that you’ve defined elsewhere. In other words, this code:
93
113
 
94
114
 
95
-
96
115
  ```r
97
116
  df <- data.frame(x = 1:3, y = 3:1)
98
- print(df)
99
- ```
100
-
101
- ```
102
- ## x y
103
- ## 1 1 3
104
- ## 2 2 2
105
- ## 3 3 1
106
- ```
107
-
108
- ```r
109
117
  print(filter(df, x == 1))
110
- ```
111
-
112
- ```
113
- ## x y
114
- ## 1 1 3
115
- ```
116
-
117
- ```r
118
118
  #> # A tibble: 1 x 2
119
119
  #> x y
120
120
  #> <int> <int>
@@ -131,15 +131,22 @@ filter(df, my_var == 1)
131
131
  ```
132
132
  > This makes it hard to create functions with arguments that change how dplyr verbs are computed.
133
133
 
134
+ In this post we will see that programming with _dplyr_ in Galaaz does not require knowledge of
135
+ non-standard evaluation in R and can be accomplished by utilizing normal Ruby constructs.
136
+
134
137
  # Writing Expressions in Galaaz
135
138
 
136
- Galaaz extends Ruby to work with complex expressions, similar to R's expressions build with 'quote'
137
- (base R) or 'quo' (tidyverse). Let's take a look at some of those expressions.
139
+ Galaaz extends Ruby to work with expressions, similar to R's expressions build with 'quote'
140
+ (base R) or 'quo' (tidyverse). Expressions in this context are like mathematical expressions or
141
+ formulae. For instance, in mathematics, the expression $y = sin(x)$ describes a function but cannot
142
+ be computed unless the value of $x$ is bound to some value.
143
+
144
+ Let's take a look at some of those expressions in Ruby:
138
145
 
139
146
  ## Expressions from operators
140
147
 
141
- The code bellow
142
- creates an expression summing two symbols
148
+ The code bellow creates an expression summing two symbols. Note that :a and :b are Ruby symbols and
149
+ are not bound to any value at the time of expression definition:
143
150
 
144
151
 
145
152
  ```ruby
@@ -150,7 +157,7 @@ puts exp1
150
157
  ```
151
158
  ## a + b
152
159
  ```
153
- We can build any complex mathematical expression
160
+ We can build any complex mathematical expression such as:
154
161
 
155
162
 
156
163
  ```ruby
@@ -161,8 +168,9 @@ puts exp2
161
168
  ```
162
169
  ## (a + b) * 2 + c^2L/z
163
170
  ```
171
+ The 'L' after two indicates that 2 is an integer.
164
172
 
165
- It is also possible to use inequality operators in building expressions
173
+ It is also possible to use inequality operators in building expressions:
166
174
 
167
175
 
168
176
  ```ruby
@@ -173,6 +181,19 @@ puts exp3
173
181
  ```
174
182
  ## a + b >= z
175
183
  ```
184
+ Expressions' definition can also make use of normal Ruby variables without any problem:
185
+
186
+
187
+ ```ruby
188
+ x = 20
189
+ y = 30
190
+ exp_var = (:a + :b) * x <= :z - y
191
+ puts exp_var
192
+ ```
193
+
194
+ ```
195
+ ## (a + b) * 20L <= z - 30L
196
+ ```
176
197
 
177
198
  Galaaz provides both symbolic representations for operators, such as (>, <, !=) as functional
178
199
  notation for those operators such as (.gt, .ge, etc.). So the same expression written
@@ -188,8 +209,9 @@ puts exp4
188
209
  ## a + b >= z
189
210
  ```
190
211
 
191
- Two type of expression can only be created with the functional representation of the operators,
192
- those are expressions involving '==', and '='. In order to write an expression involving '==' we
212
+ Two type of expression, however, can only be created with the functional representation
213
+ of the operators, those are expressions involving '==', and '='. In order to write an
214
+ expression involving '==' we
193
215
  need to use the method '.eq' and for '=' we need the function '.assign'
194
216
 
195
217
 
@@ -228,17 +250,16 @@ puts exp_wrong
228
250
  ```
229
251
  and it might be difficult to understand what is going on here. The problem lies with the fact that
230
252
  when using '==' we are comparing expression (:a + :b) to expression :z with '=='. When the
231
- comparison is executed, the system tries to evaluate :a, :b and :z, and those symbols, at
232
- this time are not bound to anything and we get a "object 'a' not found" message.
233
- If we only use functional notation, this type of error will never occur.
253
+ comparison is executed, the system tries to evaluate :a, :b and :z, and those symbols at
254
+ this time are not bound to anything and we get a "object 'a' not found" message.
255
+ If we only use functional notation, this type of error will not occur.
234
256
 
235
257
  ## Expressions with R methods
236
258
 
237
259
  It is often necessary to create an expression that uses a method or function. For instance, in
238
260
  mathematics, it's quite natural to write an expressin such as $y = sin(x)$. In this case, the
239
- 'sin' function is part of the expression and should not immediately executed. Now, let's say
240
- that 'x' is an angle of 45$^\circ$ and we acttually want our expression to be $y = 0.850...$.
241
- When we want the function to be part of the expression, we call the function preceeding it
261
+ 'sin' function is part of the expression and should not immediately be executed. When we want
262
+ the function to be part of the expression, we call the function preceeding it
242
263
  by the letter E, such as 'E.sin(x)'
243
264
 
244
265
 
@@ -250,28 +271,144 @@ puts exp7
250
271
  ```
251
272
  ## y <- sin(x)
252
273
  ```
253
- However, if we want the function to be evaluated, then
254
- we use the normal call to function with R as 'R.sin(x)'.
274
+
275
+ Expressions can also be written using '.' notation:
255
276
 
256
277
 
257
278
  ```ruby
258
- x = 45
259
- exp8 = :y.assign R.sin(x)
279
+ exp8 = :y.assign :x.sin
260
280
  puts exp8
261
281
  ```
262
282
 
283
+ ```
284
+ ## y <- sin(x)
285
+ ```
286
+
287
+ When a function has multiple arguments, the first one can be used before the '.':
288
+
289
+
290
+ ```ruby
291
+ exp9 = :x.c(:y)
292
+ puts exp9
293
+ ```
294
+
295
+ ```
296
+ ## c(x, y)
297
+ ```
298
+
299
+ ## Evaluating an Expression
300
+
301
+ Expressions can be evaluated by calling function 'eval' with a binding. A binding can be provided
302
+ with a list:
303
+
304
+
305
+ ```ruby
306
+ exp = (:a + :b) * 2.0 + :c ** 2 / :z
307
+ puts exp.eval(R.list(a: 10, b: 20, c: 30, z: 40))
308
+ ```
309
+
310
+ ```
311
+ ## [1] 82.5
312
+ ```
313
+
314
+ ... with a data frame:
315
+
316
+
317
+ ```ruby
318
+ df = R.data__frame(
319
+ a: R.c(1, 2, 3),
320
+ b: R.c(10, 20, 30),
321
+ c: R.c(100, 200, 300),
322
+ z: R.c(1000, 2000, 3000))
323
+
324
+ puts exp.eval(df)
325
+ ```
326
+
327
+ ```
328
+ ## [1] 32 64 96
329
+ ```
330
+
331
+ # Using Galaaz to call R functions
332
+
333
+ Galaaz tries to emulate as closely as possible the way R functions are called and migrating from
334
+ R to Galaaz should be quite easy requiring only minor syntactic changes to an R script. In
335
+ this post, we do not have enough space to write a complete manual on Galaaz
336
+ (a short manual can be found at: https://www.rubydoc.info/gems/galaaz/0.4.9), so we will
337
+ present only a few examples scripts using Galaaz.
338
+
339
+ Basically, to call an R function from Ruby with Galaaz, one only needs to preceed the function
340
+ with 'R.'. For instance, to create a vector in R, the 'c' function is used. From Galaaz, a
341
+ vector can be created by using 'R.c':
342
+
343
+
344
+ ```ruby
345
+ vec = R.c(1.0, 2, 3)
346
+ puts vec
347
+ ```
348
+
349
+ ```
350
+ ## [1] 1 2 3
351
+ ```
352
+ A list is created in R with the 'list' function, so in Galaaz we do:
353
+
354
+
355
+ ```ruby
356
+ list = R.list(a: 1.0, b: 2, c: 3)
357
+ puts list
358
+ ```
359
+
360
+ ```
361
+ ## $a
362
+ ## [1] 1
363
+ ##
364
+ ## $b
365
+ ## [1] 2
366
+ ##
367
+ ## $c
368
+ ## [1] 3
369
+ ```
370
+ Note that we can use named arguments in our list. The same code in R would be:
371
+
372
+
373
+ ```r
374
+ lst = list(a = 1, b = 2L, c = 3L)
375
+ print(lst)
376
+ ```
377
+
378
+ ```
379
+ ## $a
380
+ ## [1] 1
381
+ ##
382
+ ## $b
383
+ ## [1] 2
384
+ ##
385
+ ## $c
386
+ ## [1] 3
387
+ ```
388
+ Now, let's say that 'x' is an angle of 45$^\circ$ and we acttually want to create
389
+ the expression $y = sin(45^\circ)$, which is $y = 0.850...$. In this case,
390
+ we will use 'R.sin':
391
+
392
+
393
+ ```ruby
394
+ exp10 = :y.assign R.sin(45)
395
+ puts exp10
396
+ ```
397
+
263
398
  ```
264
399
  ## y <- 0.850903524534118
265
400
  ```
401
+
266
402
  # Filtering using expressions
267
403
 
268
- Now that we now how to write expression, we can use then to filter a data frame by expressions.
269
- Let's first start by creating a simple data frame with two columns named 'x' and 'y'
404
+ Now that we know how to write expression and call R functions let's do some data manipulation in
405
+ Galaaz. Let's first start by creating the same data frame that we created previously in section
406
+ "Programming with dplyr":
270
407
 
271
408
 
272
409
  ```ruby
273
- @df = R.data__frame(x: (1..3), y: (3..1))
274
- puts @df
410
+ df = R.data__frame(x: (1..3), y: (3..1))
411
+ puts df
275
412
  ```
276
413
 
277
414
  ```
@@ -280,12 +417,17 @@ puts @df
280
417
  ## 2 2 2
281
418
  ## 3 3 1
282
419
  ```
283
- In the code bellow we want to filter the data frame by rows in which the value of 'x' is
284
- equal to 1.
420
+ The 'filter' function can be called on this data frame either by using 'R.filter(df, ...)' or
421
+ by using dot notation. We prefer to use dot notation as shown bellow. The argument to 'filter'
422
+ in Galaaz should be an expression. Note that if we gave to filter a Ruby expression such as
423
+ 'x == 1', we would get an error, since there is no variable 'x' defined and if 'x' was a variable
424
+ then 'x == 1' would either be 'true' or 'false'. Our goal is to filter our data frame returning
425
+ all rows in which the 'x' value is equal to 1. To express this we want: ':x.eq 1', where :x will
426
+ be interpreted by filter as the 'x' column.
285
427
 
286
428
 
287
429
  ```ruby
288
- puts @df.filter(:x.eq 1)
430
+ puts df.filter(:x.eq 1)
289
431
  ```
290
432
 
291
433
  ```
@@ -294,7 +436,7 @@ puts @df.filter(:x.eq 1)
294
436
  ```
295
437
 
296
438
  In R, and when coding with 'tidyverse', arguments to a function are usually not
297
- *referencially transparent*. That is, ou can’t replace a value with a seemingly equivalent
439
+ *referencially transparent*. That is, you can’t replace a value with a seemingly equivalent
298
440
  object that you’ve defined elsewhere. In other words, this code
299
441
 
300
442
 
@@ -304,8 +446,8 @@ filter(df, my_var == 1)
304
446
  ```
305
447
  Generates the following error: "object 'x' not found.
306
448
 
307
- However, in Ruby and Galaaz, arguments are referencially transparent as can be seen by the
308
- code bellow. Note, initally that 'my_var = :x' will not give the error "object 'x' not found"
449
+ However, in Galaaz, arguments are referencially transparent as can be seen by the
450
+ code bellow. Note initally that 'my_var = :x' will not give the error "object 'x' not found"
309
451
  since ':x' is treated as an expression and assigned to my\_var. Then when doing (my\_var.eq 1),
310
452
  my\_var is a variable that resolves to ':x' and it becomes equivalent to (:x.eq 1) which is
311
453
  what we want.
@@ -313,7 +455,7 @@ what we want.
313
455
 
314
456
  ```ruby
315
457
  my_var = :x
316
- puts @df.filter(my_var.eq 1)
458
+ puts df.filter(my_var.eq 1)
317
459
  ```
318
460
 
319
461
  ```
@@ -333,17 +475,17 @@ df[x == y, ]
333
475
  ```
334
476
  In galaaz this ambiguity does not exist, filter(df, x.eq y) is not a valid expression as
335
477
  expressions are build with symbols. In doing filter(df, :x.eq y) we are looking for elements
336
- of the 'x' column that are equal to a previously defined y variable. Finally,
478
+ of the 'x' column that are equal to a previously defined y variable. Finally in
337
479
  filter(df, :x.eq :y) we are looking for elements in which the 'x' column value is equal to
338
480
  the 'y' column value. This can be seen in the following two chunks of code:
339
481
 
340
482
 
341
483
  ```ruby
342
- @y = 1
343
- @x = 2
484
+ y = 1
485
+ x = 2
344
486
 
345
487
  # looking for values where the 'x' column is equal to the 'y' column
346
- puts @df.filter(:x.eq :y)
488
+ puts df.filter(:x.eq :y)
347
489
  ```
348
490
 
349
491
  ```
@@ -355,7 +497,7 @@ puts @df.filter(:x.eq :y)
355
497
  ```ruby
356
498
  # looking for values where the 'x' column is equal to the 'y' variable
357
499
  # in this case, the number 1
358
- puts @df.filter(:x.eq @y)
500
+ puts df.filter(:x.eq y)
359
501
  ```
360
502
 
361
503
  ```
@@ -364,7 +506,11 @@ puts @df.filter(:x.eq @y)
364
506
  ```
365
507
  # Writing a function that applies to different data sets
366
508
 
509
+ Let's suppose that we want to write a function that receives as the first argument a data frame
510
+ and as second argument an expression that adds a column to the data frame that is equal to the
511
+ sum of elements in column 'a' plus 'x'.
367
512
 
513
+ Here is the intended behaviour using the 'mutate' function of 'dplyr':
368
514
 
369
515
  ```
370
516
  mutate(df1, y = a + x)
@@ -372,8 +518,18 @@ mutate(df2, y = a + x)
372
518
  mutate(df3, y = a + x)
373
519
  mutate(df4, y = a + x)
374
520
  ```
521
+ The naive approach to writing an R function to solve this problem is:
522
+
523
+ ```
524
+ mutate_y <- function(df) {
525
+ mutate(df, y = a + x)
526
+ }
527
+ ```
528
+ Unfortunately, in R, this function can fail silently if one of the variables isn’t present
529
+ in the data frame, but is present in the global environment. We will not go through here how
530
+ to solve this problem in R.
375
531
 
376
- Here we create a mutate_y Ruby method.
532
+ In Galaaz the method mutate_y bellow will work fine and will never fail silently.
377
533
 
378
534
 
379
535
  ```ruby
@@ -381,14 +537,27 @@ def mutate_y(df)
381
537
  df.mutate(:y.assign :a + :x)
382
538
  end
383
539
  ```
384
-
385
- Note that contrary to what happens in R, method mutate_y will fail independetly from the fact
386
- that variable 'a' is defined or not.
540
+ Here we create a data frame that has only one column named 'x':
387
541
 
388
542
 
389
543
  ```ruby
390
544
  df1 = R.data__frame(x: (1..3))
391
545
  puts df1
546
+ ```
547
+
548
+ ```
549
+ ## x
550
+ ## 1 1
551
+ ## 2 2
552
+ ## 3 3
553
+ ```
554
+
555
+ Note that method mutate_y will fail independetly from the fact that variable 'a' is defined and
556
+ in the scope of the method. Variable 'a' has no relationship with the symbol ':a' used in the
557
+ definition of 'mutate\_y' above:
558
+
559
+
560
+ ```ruby
392
561
  a = 10
393
562
  mutate_y(df1)
394
563
  ```
@@ -402,11 +571,17 @@ mutate_y(df1)
402
571
  ## mismatched protect/unprotect (unprotect with empty protect stack) (RError)
403
572
  ## Translated to internal error
404
573
  ```
405
-
406
574
  # Different expressions
407
575
 
576
+ Let's move to the next problem as presented by Hardley where trying to write a function in R
577
+ that will receive two argumens, the first a variable and the second an expression is not trivial.
578
+ Bellow we create a data frame and we want to write a function that groups data by a variable and
579
+ summarises it by an expression:
580
+
408
581
 
409
582
  ```r
583
+ set.seed(123)
584
+
410
585
  df <- data.frame(
411
586
  g1 = c(1, 1, 2, 2, 2),
412
587
  g2 = c(1, 2, 1, 2, 1),
@@ -414,6 +589,19 @@ df <- data.frame(
414
589
  b = sample(5)
415
590
  )
416
591
 
592
+ as.data.frame(df)
593
+ ```
594
+
595
+ ```
596
+ ## g1 g2 a b
597
+ ## 1 1 1 2 1
598
+ ## 2 1 2 4 3
599
+ ## 3 2 1 5 4
600
+ ## 4 2 2 3 2
601
+ ## 5 2 1 1 5
602
+ ```
603
+
604
+ ```r
417
605
  d2 <- df %>%
418
606
  group_by(g1) %>%
419
607
  summarise(a = mean(a))
@@ -437,13 +625,11 @@ as.data.frame(d2)
437
625
 
438
626
  ```
439
627
  ## g2 a
440
- ## 1 1 3.666667
441
- ## 2 2 2.000000
628
+ ## 1 1 2.666667
629
+ ## 2 2 3.500000
442
630
  ```
443
631
 
444
- Trying to write a function in R that will receive two argumens, the first a variable and
445
- the second an expression is not trivia. As shown by Hardley, one might expect this function
446
- to do the trick:
632
+ As shown by Hardley, one might expect this function to do the trick:
447
633
 
448
634
 
449
635
  ```r
@@ -458,11 +644,13 @@ my_summarise <- function(df, group_var) {
458
644
  ```
459
645
 
460
646
  In order to solve this problem, coding with dplyr requires the introduction of many new concepts
461
- and functions such as 'quo', 'quos', 'enquo', 'enquos', '!!' (bang bang), '!!!' (triple bang).
647
+ and functions such as 'quo', 'quos', 'enquo', 'enquos', '!!' (bang bang), '!!!' (triple bang).
648
+ Again, we'll leave to Hardley the explanation on how to use all those functions.
462
649
 
463
650
  Now, let's try to implement the same function in galaaz. The next code block first prints the
464
- 'df' data frame define previously in R, then creates the my_summarize function and calls it
465
- passing the R data frame and the group by variable ':g1'
651
+ 'df' data frame define previously in R (to access an R variable from Galaaz, we use the tilda
652
+ operator '~' applied to the R variable name as symbol, i.e., ':df'. We then create the
653
+ 'my_summarize' method and call it passing the R data frame and the group by variable ':g1':
466
654
 
467
655
 
468
656
  ```ruby
@@ -471,35 +659,35 @@ print "\n"
471
659
 
472
660
  def my_summarize(df, group_var)
473
661
  df.group_by(group_var).
474
- summarize(a: E.mean(:a))
662
+ summarize(a: :a.mean)
475
663
  end
476
664
 
477
- puts my_summarize((~:df), :g1).as__data__frame
665
+ puts my_summarize(:df, :g1).as__data__frame
478
666
  ```
479
667
 
480
668
  ```
481
669
  ## g1 g2 a b
482
- ## 1 1 1 5 2
483
- ## 2 1 2 1 5
484
- ## 3 2 1 2 4
485
- ## 4 2 2 3 1
486
- ## 5 2 1 4 3
670
+ ## 1 1 1 2 1
671
+ ## 2 1 2 4 3
672
+ ## 3 2 1 5 4
673
+ ## 4 2 2 3 2
674
+ ## 5 2 1 1 5
487
675
  ##
488
676
  ## g1 a
489
677
  ## 1 1 3
490
678
  ## 2 2 3
491
679
  ```
492
- It works!!! Well let's make sure this was not just some coincidence
680
+ It works!!! Well, let's make sure this was not just some coincidence
493
681
 
494
682
 
495
683
  ```ruby
496
- puts my_summarize((~:df), :g2).as__data__frame
684
+ puts my_summarize(:df, :g2).as__data__frame
497
685
  ```
498
686
 
499
687
  ```
500
688
  ## g2 a
501
- ## 1 1 3.666667
502
- ## 2 2 2.000000
689
+ ## 1 1 2.666667
690
+ ## 2 2 3.500000
503
691
  ```
504
692
 
505
693
  Great, everything is fine! No magic, no new functions, no complexities, just normal, standard Ruby
@@ -508,7 +696,7 @@ code. If you've ever done NSE in R, this certainly feels much safer and easy to
508
696
  # Different input variables
509
697
 
510
698
  In the previous section we've managed to get rid of all NSE formulation for a simple example, but
511
- does this remain true for more complex examples, or will the Ruby way prove inpractical for
699
+ does this remain true for more complex examples, or will the Galaaz way prove inpractical for
512
700
  more complex code?
513
701
 
514
702
  In the next example Hardley proposes us to write a function that given an expression such as 'a'
@@ -526,7 +714,7 @@ summarise(df, mean = mean(a * b), sum = sum(a * b), n = n())
526
714
  #> # A tibble: 1 x 3
527
715
  #> mean sum n
528
716
  #> <dbl> <int> <int>
529
- #> 1 9.6 48 5
717
+ #> 1 9 45 5
530
718
  ```
531
719
 
532
720
  Let's try it in galaaz:
@@ -549,11 +737,11 @@ puts my_summarise2((~:df), :a * :b)
549
737
  ## mean sum n
550
738
  ## 1 3 15 5
551
739
  ## mean sum n
552
- ## 1 7.6 38 5
740
+ ## 1 9 45 5
553
741
  ```
554
742
 
555
743
  Once again, there is no need to use any special theory or functions. The only point to be
556
- careful about is the use of 'E' to build an expression that uses the mean, sum and n.
744
+ careful about is the use of 'E' to build expressions from functions 'mean', 'sum' and 'n'.
557
745
 
558
746
  # Different input and output variable
559
747
 
@@ -583,8 +771,10 @@ mutate(df, mean_b = mean(b), sum_b = sum(b))
583
771
  #> 4 2 2 5 4 3 15
584
772
  #> # … with 1 more row
585
773
  ```
774
+ In order to solve this problem in R, Hardley needs to introduce some more new functions and notations:
775
+ 'quo_name' and the ':=' operator from package 'rlang'
586
776
 
587
- Here is our Ruby code
777
+ Here is our Ruby code:
588
778
 
589
779
 
590
780
  ```ruby
@@ -602,17 +792,17 @@ puts my_mutate((~:df), :b)
602
792
 
603
793
  ```
604
794
  ## g1 g2 a b mean_a sum_a
605
- ## 1 1 1 5 2 3 15
606
- ## 2 1 2 1 5 3 15
607
- ## 3 2 1 2 4 3 15
608
- ## 4 2 2 3 1 3 15
609
- ## 5 2 1 4 3 3 15
795
+ ## 1 1 1 2 1 3 15
796
+ ## 2 1 2 4 3 3 15
797
+ ## 3 2 1 5 4 3 15
798
+ ## 4 2 2 3 2 3 15
799
+ ## 5 2 1 1 5 3 15
610
800
  ## g1 g2 a b mean_b sum_b
611
- ## 1 1 1 5 2 3 15
612
- ## 2 1 2 1 5 3 15
613
- ## 3 2 1 2 4 3 15
614
- ## 4 2 2 3 1 3 15
615
- ## 5 2 1 4 3 3 15
801
+ ## 1 1 1 2 1 3 15
802
+ ## 2 1 2 4 3 3 15
803
+ ## 3 2 1 5 4 3 15
804
+ ## 4 2 2 3 2 3 15
805
+ ## 5 2 1 1 5 3 15
616
806
  ```
617
807
  It really seems that "Non Standard Evaluation" is actually quite standard in Galaaz! But, you
618
808
  might have noticed a small change in the way the arguments to the mutate method were called.
@@ -624,6 +814,12 @@ and variable mean\_name is not followed by ':' but by '=>'. This is standard Ru
624
814
 
625
815
  # Capturing multiple variables
626
816
 
817
+ Moving on with new complexities, Hardley proposes us to solve the problem in which the
818
+ summarise function will receive any number of grouping variables.
819
+
820
+ This again is quite standard Ruby. In order to receive an undefined number of paramenters
821
+ the paramenter is preceded by '*':
822
+
627
823
 
628
824
  ```ruby
629
825
  def my_summarise3(df, *group_vars)
@@ -636,14 +832,58 @@ puts my_summarise3((~:df), :g1, :g2).as__data__frame
636
832
 
637
833
  ```
638
834
  ## g1 g2 a
639
- ## 1 1 1 5
640
- ## 2 1 2 1
835
+ ## 1 1 1 2
836
+ ## 2 1 2 4
641
837
  ## 3 2 1 3
642
838
  ## 4 2 2 3
643
839
  ```
644
840
 
841
+ # Why does R require NSE and Galaaz does not?
842
+
843
+ NSE introduces a number of new concepts, such as 'quoting', 'quasiquotation', 'unquoting' and
844
+ 'unquote-splicing', while in Galaaz none of those concepts are needed. What gives?
845
+
846
+ R is an extremely flexible language and it has lazy evaluation of parameters. When in R a
847
+ function is called as 'summarise(df, a = b)', the summarise function receives the litteral
848
+ 'a = b' parameter and can work with this as if it were a string. In R, it is not clear what
849
+ a and b are, they can be expressions or they can be variables, it is up to the function to
850
+ decide what 'a = b' means.
851
+
852
+ In Ruby, there is no lazy evaluation of parameters and 'a' is always a variable and so is 'b'.
853
+ Variables assume their value as soon as they are used, so 'x = a' is immediately evaluate and
854
+ variable 'x' will receive the value of variable 'a' as soon as the Ruby statement is executed.
855
+ Ruby also provides the notion of a symbol; ':a' is a symbol and does not evaluate to anything.
856
+ Galaaz uses Ruby symbols to build expressions that are not bound to anything: ':a.eq :b' is
857
+ clearly an expression and has no relationship whatsoever with the statment 'a = b'. By using
858
+ symbols, variables and expressions all the possible ambiguities that are found in R are
859
+ eliminated in Galaaz.
860
+
861
+ The main problem that remains, is that in R, functions are not clearly documented as what type
862
+ of input they are expecting, they might be expecting regular variables or they might be
863
+ expecting expressions and the R function will know how to deal with an input of the form
864
+ 'a = b', now for the Ruby developer it might not be immediately clear if it should call the
865
+ function passing the value 'true' if variable 'a' is equal to variable 'b' or if it should
866
+ call the function passing the expression ':a.eq :b'.
867
+
868
+
645
869
  # Advanced dplyr features
646
- https://www.r-bloggers.com/programming-with-dplyr-by-using-dplyr/
870
+
871
+ In the blog: Programming with dplyr by using dplyr (https://www.r-bloggers.com/programming-with-dplyr-by-using-dplyr/) Iñaki Úcar shows surprise that some R users are trying to code in dplyr avoiding
872
+ the use of NSE. For instance he says:
873
+
874
+ > Take the example of seplyr. It stands for standard evaluation dplyr, and enables us to
875
+ > program over dplyr without having “to bring in (or study) any deep-theory or
876
+ > heavy-weight tools such as rlang/tidyeval”.
877
+
878
+ For me, there isn't really any surprise that users are trying to avoid dplyr deep-theory. R
879
+ users frequently are not programmers and learning to code is already hard business, on top
880
+ of that, having to learn how to 'quote' or 'enquo' or 'quos' or 'enquos' is not necessarily
881
+ a 'piece of cake'. So much so, that 'tidyeval' has some more advanced functions that instead
882
+ of using quoted expressions, uses strings as arguments.
883
+
884
+ In the following examples, we show the use of functions 'group\_by\_at', 'summarise\_at' and
885
+ 'rename\_at' that receive strings as argument. The data frame used in 'starwars' that describes
886
+ features of characters in the Starwars movies:
647
887
 
648
888
 
649
889
  ```ruby
@@ -680,6 +920,8 @@ puts (~:starwars).head.as__data__frame
680
920
  ## 5 Imperial Speeder Bike
681
921
  ## 6
682
922
  ```
923
+ The grouped_mean function bellow will receive a grouping variable and calculate summaries for
924
+ the value\_variables given:
683
925
 
684
926
 
685
927
  ```r
@@ -716,6 +958,8 @@ as.data.frame(gm)
716
958
  ## 15 yellow 81.11111 76.38000 11
717
959
  ```
718
960
 
961
+ The same code with Galaaz, becomes:
962
+
719
963
 
720
964
  ```ruby
721
965
  def grouped_mean(data, grouping_variables, value_variables)
@@ -723,10 +967,10 @@ def grouped_mean(data, grouping_variables, value_variables)
723
967
  group_by_at(grouping_variables).
724
968
  mutate(count: E.n).
725
969
  summarise_at(E.c(value_variables, "count"), ~:mean, na__rm: true).
726
- rename_at(value_variables, R.funs(E.paste0("mean_", value_variables)))
970
+ rename_at(value_variables, E.funs(E.paste0("mean_", value_variables)))
727
971
  end
728
972
 
729
- puts grouped_mean((~:starwars), "eye_color", R.c("mass", "birth_year")).as__data__frame
973
+ puts grouped_mean((~:starwars), "eye_color", E.c("mass", "birth_year")).as__data__frame
730
974
  ```
731
975
 
732
976
  ```
@@ -747,3 +991,13 @@ puts grouped_mean((~:starwars), "eye_color", R.c("mass", "birth_year")).as__data
747
991
  ## 14 white 48.00000 NaN 1
748
992
  ## 15 yellow 81.11111 76.38000 11
749
993
  ```
994
+
995
+ # Conclusion
996
+
997
+ Ruby and Galaaz provide a nice framework for developing code that uses R functions. Although R is
998
+ a very powerful and flexible language, sometimes, too much flexibility makes life harder for
999
+ the casual user. We believe however, that even for the advanced user, Ruby integrated
1000
+ with R throught Galaaz, makes a powerful environment for data analysis. In this blog post we
1001
+ showed how Galaaz consistent syntax eliminates the need for complex constructs such as quoting,
1002
+ enquoting, quasiquotation, etc. This simplification comes from the fact that expressions and
1003
+ variables are clearly separated objects, which is not the case in the R language.