galaaz 0.4.6 → 0.4.7
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -5
- data/Rakefile +1 -1
- data/bin/grun +1 -1
- data/bin/gstudio +1 -1
- data/blogs/gknit/gknit.Rmd +36 -13
- data/blogs/gknit/gknit.html +955 -0
- data/blogs/gknit/gknit.md +861 -0
- data/blogs/gknit/gknit_files/figure-html/bubble-1.png +0 -0
- data/blogs/gknit/gknit_files/figure-html/diverging_bar.png +0 -0
- data/blogs/nse_dplyr/nse_dplyr.Rmd +75 -0
- data/blogs/nse_dplyr/nse_dplyr.html +421 -0
- data/blogs/nse_dplyr/nse_dplyr.md +136 -0
- data/lib/R_interface/r_libs.R +5 -0
- data/lib/R_interface/robject.rb +8 -0
- data/lib/R_interface/rsupport.rb +2 -4
- data/lib/R_interface/ruby_extensions.rb +9 -0
- data/lib/util/exec_ruby.rb +4 -1
- data/specs/figures/bg.svg +2 -2
- data/specs/figures/no_args.svg +2 -2
- data/specs/r_list_apply.spec.rb +11 -10
- data/specs/tmp.rb +34 -30
- data/version.rb +1 -1
- metadata +10 -4
@@ -0,0 +1,861 @@
|
|
1
|
+
---
|
2
|
+
title: "How to do reproducible research in Ruby with gKnit"
|
3
|
+
author:
|
4
|
+
- "Rodrigo Botafogo"
|
5
|
+
- "Daniel Mossé - University of Pittsburgh"
|
6
|
+
tags: [Tech, Data Science, Ruby, R, GraalVM]
|
7
|
+
date: "20/02/2019"
|
8
|
+
output:
|
9
|
+
html_document:
|
10
|
+
self_contained: true
|
11
|
+
keep_md: true
|
12
|
+
pdf_document:
|
13
|
+
includes:
|
14
|
+
in_header: ["../../sty/galaaz.sty"]
|
15
|
+
number_sections: yes
|
16
|
+
---
|
17
|
+
|
18
|
+
|
19
|
+
|
20
|
+
# Introduction
|
21
|
+
|
22
|
+
The idea of "literate programming" was first introduced by Donald Knuth in the 1980's.
|
23
|
+
The main intention of this approach was to develop software interspersing macro snippets,
|
24
|
+
traditional source code, and a natural language such as English in a document
|
25
|
+
that could be compiled into
|
26
|
+
executable code and at the same time easily read by a human developer. According to Knuth
|
27
|
+
"The practitioner of
|
28
|
+
literate programming can be regarded as an essayist, whose main concern is with exposition
|
29
|
+
and excellence of style."
|
30
|
+
|
31
|
+
The idea of literate programming evolved into the idea of reproducible research, in which
|
32
|
+
all the data, software code, documentation, graphics etc. needed to reproduce the research
|
33
|
+
and its reports could be included in a
|
34
|
+
single document or set of documents that when distributed to peers could be rerun generating
|
35
|
+
the same output and reports.
|
36
|
+
|
37
|
+
The R community has put a great deal of effort in reproducible research. In 2002, Sweave was
|
38
|
+
introduced and it allowed mixing R code with Latex generating high quality PDF documents. Those
|
39
|
+
documents could include the code, the result of executing the code, graphics and text. This
|
40
|
+
contained the whole narrative to reproduce the research. But Sweave had many problems and in
|
41
|
+
2012, Knitr, developed by Yihui Xie from RStudio was released, solving many of the long lasting
|
42
|
+
problems from Sweave and including in one single package many extensions and add-on packages that
|
43
|
+
were necessary for Sweave.
|
44
|
+
|
45
|
+
With Knitr, R markdown was also developed, an extension to the
|
46
|
+
Markdown format. With R markdown and Knitr it is possible to generate reports in a multitude
|
47
|
+
of formats such as HTML, markdown, Latex, PDF, dvi, etc. R markdown also allows the use of
|
48
|
+
multiple programming languages in the same document. In R markdown text is interspersed with
|
49
|
+
code chunks that can be executed and both the code and its results can become
|
50
|
+
part of the final report. Although R markdown allows multiple programming languages in the
|
51
|
+
same document, only R and Python (with
|
52
|
+
the reticulate package) can persist variables between chunks. For other languages, such as
|
53
|
+
Ruby, every chunk will start a new process and thus all data is lost between chunks, unless it
|
54
|
+
is somehow stored in a data file that is read by the next chunk.
|
55
|
+
|
56
|
+
Being able to persist data
|
57
|
+
between chunks is critical for literate programming otherwise the flow of the narrative is lost
|
58
|
+
by all the effort of having to save data and then reload it. Probably, because of
|
59
|
+
this impossibility,
|
60
|
+
it is very rare to see any R markdown document in the Ruby community. Also, the use of
|
61
|
+
R markdown for the Ruby community would also require the Ruby developer to download R and
|
62
|
+
have some minimal knowledge of Knitr.
|
63
|
+
|
64
|
+
In the Python community, the same effort to have code and text in an integrated environment
|
65
|
+
started around the first decade of 2000. In 2006 iPython 0.7.2 was released. In 2014,
|
66
|
+
Fernando Pérez, spun off project Jupyter from iPython creating a web-based interactive
|
67
|
+
computation environment. Jupyter can now be used with many languages, including Ruby with the
|
68
|
+
iruby gem (https://github.com/SciRuby/iruby). I am not sure if multiple languages can be used
|
69
|
+
in a Jupyter notebook and if variables can persist between chunks.
|
70
|
+
|
71
|
+
# gKnitting a Document
|
72
|
+
|
73
|
+
This document describes gKnit. gKnit uses Knitr and R markdown to knit a document in Ruby or R
|
74
|
+
and output it in any of the available formats for R markdown.
|
75
|
+
gKnit runs atop of GraalVM, and Galaaz (an integration
|
76
|
+
library between Ruby and R). In gKnit, Ruby variables are persisted between chunks, making
|
77
|
+
it an ideal solution for literate programming in this language. Also, since it is based on
|
78
|
+
Galaaz, Ruby chunks can have access to R variables and Polyglot Programming with Ruby and R
|
79
|
+
is quite natural.
|
80
|
+
|
81
|
+
Galaaz has been describe already in the following posts:
|
82
|
+
|
83
|
+
* https://towardsdatascience.com/ruby-plotting-with-galaaz-an-example-of-tightly-coupling-ruby-and-r-in-graalvm-520b69e21021.
|
84
|
+
* https://medium.freecodecamp.org/how-to-make-beautiful-ruby-plots-with-galaaz-320848058857
|
85
|
+
|
86
|
+
This is not a blog post on R markdown, and the interested user is directed to the following links
|
87
|
+
for detailed information on its capabilities and use.
|
88
|
+
|
89
|
+
* https://rmarkdown.rstudio.com/ or
|
90
|
+
* https://bookdown.org/yihui/rmarkdown/
|
91
|
+
|
92
|
+
Here, we will describe quickly the main aspects of R markdown, so the user can start gKnitting
|
93
|
+
Ruby and R documents quickly.
|
94
|
+
|
95
|
+
## The Yaml header
|
96
|
+
|
97
|
+
An R markdown document should start with a Yaml header and be stored in a file with
|
98
|
+
'.Rmd' extension. This document has the following header for gKitting an HTML document.
|
99
|
+
|
100
|
+
```
|
101
|
+
---
|
102
|
+
title: "How to do reproducible research in Ruby with gKnit"
|
103
|
+
author:
|
104
|
+
- "Rodrigo Botafogo"
|
105
|
+
- "Daniel Mossé - University of Pittsburgh"
|
106
|
+
tags: [Tech, Data Science, Ruby, R, GraalVM]
|
107
|
+
date: "20/02/2019"
|
108
|
+
output:
|
109
|
+
html_document:
|
110
|
+
self_contained: true
|
111
|
+
keep_md: true
|
112
|
+
pdf_document:
|
113
|
+
includes:
|
114
|
+
in_header: ["../../sty/galaaz.sty"]
|
115
|
+
number_sections: yes
|
116
|
+
---
|
117
|
+
```
|
118
|
+
|
119
|
+
For more information on the options in the Yaml header, check https://bookdown.org/yihui/rmarkdown/html-document.html.
|
120
|
+
|
121
|
+
## R Markdown formatting
|
122
|
+
|
123
|
+
Document formatting can be done with simple markups such as:
|
124
|
+
|
125
|
+
### Headers
|
126
|
+
|
127
|
+
```
|
128
|
+
# Header 1
|
129
|
+
|
130
|
+
## Header 2
|
131
|
+
|
132
|
+
### Header 3
|
133
|
+
|
134
|
+
```
|
135
|
+
|
136
|
+
### Lists
|
137
|
+
|
138
|
+
```
|
139
|
+
Unordered lists:
|
140
|
+
|
141
|
+
* Item 1
|
142
|
+
* Item 2
|
143
|
+
+ Item 2a
|
144
|
+
+ Item 2b
|
145
|
+
```
|
146
|
+
|
147
|
+
```
|
148
|
+
Ordered Lists
|
149
|
+
|
150
|
+
1. Item 1
|
151
|
+
2. Item 2
|
152
|
+
3. Item 3
|
153
|
+
+ Item 3a
|
154
|
+
+ Item 3b
|
155
|
+
```
|
156
|
+
|
157
|
+
Please, go to https://rmarkdown.rstudio.com/authoring_basics.html, for more R markdown formatting.
|
158
|
+
|
159
|
+
### R chunks
|
160
|
+
|
161
|
+
Running and executing Ruby and R code is actually what really interests us is this blog.
|
162
|
+
Inserting a code chunk is done by adding code in a block delimited by three back ticks
|
163
|
+
followed by an open
|
164
|
+
curly brace ('{') followed with the engine name (r, ruby, rb, include, ...), an
|
165
|
+
any optional chunk_label and options, as shown bellow:
|
166
|
+
|
167
|
+
````
|
168
|
+
```{engine_name [chunk_label], [chunk_options]}
|
169
|
+
```
|
170
|
+
````
|
171
|
+
|
172
|
+
for instance, let's add an R chunk to the document labeled 'first_r_chunk'. This is
|
173
|
+
a very simple code just to create a variable and print it out. The code block should
|
174
|
+
be defined as follows:
|
175
|
+
|
176
|
+
````
|
177
|
+
```{r first_r_chunk}
|
178
|
+
vec <- c(1, 2, 3)
|
179
|
+
print(vec)
|
180
|
+
```
|
181
|
+
````
|
182
|
+
|
183
|
+
If this block is added to an R markdown document and gKnitted the result will be:
|
184
|
+
|
185
|
+
|
186
|
+
```r
|
187
|
+
vec <- c(1, 2, 3)
|
188
|
+
print(vec)
|
189
|
+
```
|
190
|
+
|
191
|
+
```
|
192
|
+
## [1] 1 2 3
|
193
|
+
```
|
194
|
+
|
195
|
+
Now let's say that we want to do some analysis in the code, but just print the result and not the
|
196
|
+
code itself. For this, we need to add the option 'echo = FALSE'.
|
197
|
+
|
198
|
+
````
|
199
|
+
```{r second_r_chunk, echo = FALSE}
|
200
|
+
vec2 <- c(10, 20, 30)
|
201
|
+
vec3 <- vec * vec2
|
202
|
+
print(vec3)
|
203
|
+
```
|
204
|
+
````
|
205
|
+
Here is how this block will show up in the document. Observe that the code is not shown
|
206
|
+
and we only see the execution result in a white box
|
207
|
+
|
208
|
+
|
209
|
+
```
|
210
|
+
## [1] 10 40 90
|
211
|
+
```
|
212
|
+
|
213
|
+
A description of the available chunk options can be found in the documentation cited above.
|
214
|
+
|
215
|
+
Let's add another R chunkd with a function definition. In this example, a vector
|
216
|
+
'r_vec' is created and
|
217
|
+
a new function 'reduce_sum' is defined. The chunk specification is
|
218
|
+
|
219
|
+
````
|
220
|
+
```{r data_creation}
|
221
|
+
r_vec <- c(1, 2, 3, 4, 5)
|
222
|
+
|
223
|
+
reduce_sum <- function(...) {
|
224
|
+
Reduce(sum, as.list(...))
|
225
|
+
}
|
226
|
+
```
|
227
|
+
````
|
228
|
+
|
229
|
+
and this is how it will look like once executed. From now on, we will not
|
230
|
+
show the chunk definition any longer.
|
231
|
+
|
232
|
+
|
233
|
+
|
234
|
+
```r
|
235
|
+
r_vec <- c(1, 2, 3, 4, 5)
|
236
|
+
|
237
|
+
reduce_sum <- function(...) {
|
238
|
+
Reduce(sum, as.list(...))
|
239
|
+
}
|
240
|
+
```
|
241
|
+
|
242
|
+
We can, possibly in another chunk, access the vector and call the function as follows:
|
243
|
+
|
244
|
+
|
245
|
+
```r
|
246
|
+
print(r_vec)
|
247
|
+
```
|
248
|
+
|
249
|
+
```
|
250
|
+
## [1] 1 2 3 4 5
|
251
|
+
```
|
252
|
+
|
253
|
+
```r
|
254
|
+
print(reduce_sum(r_vec))
|
255
|
+
```
|
256
|
+
|
257
|
+
```
|
258
|
+
## [1] 15
|
259
|
+
```
|
260
|
+
### R Graphics with ggplot
|
261
|
+
|
262
|
+
In the following chunk, we create a bubble chart in R using ggplot and include it in
|
263
|
+
this document. Note that there is no directive in the code to include the image, this
|
264
|
+
occurs automatically. The 'mpg' dataframe is natively available to R and to Galaaz as
|
265
|
+
well.
|
266
|
+
|
267
|
+
|
268
|
+
```r
|
269
|
+
# load package and data
|
270
|
+
library(ggplot2)
|
271
|
+
data(mpg, package="ggplot2")
|
272
|
+
# mpg <- read.csv("http://goo.gl/uEeRGu")
|
273
|
+
|
274
|
+
mpg_select <- mpg[mpg$manufacturer %in% c("audi", "ford", "honda", "hyundai"), ]
|
275
|
+
|
276
|
+
# Scatterplot
|
277
|
+
theme_set(theme_bw()) # pre-set the bw theme.
|
278
|
+
g <- ggplot(mpg_select, aes(displ, cty)) +
|
279
|
+
labs(subtitle="mpg: Displacement vs City Mileage",
|
280
|
+
title="Bubble chart")
|
281
|
+
|
282
|
+
g + geom_jitter(aes(col=manufacturer, size=hwy)) +
|
283
|
+
geom_smooth(aes(col=manufacturer), method="lm", se=F)
|
284
|
+
```
|
285
|
+
|
286
|
+
![](/home/rbotafogo/desenv/galaaz/blogs/gknit/gknit_files/figure-html/bubble-1.png)<!-- -->
|
287
|
+
|
288
|
+
### Ruby chunks
|
289
|
+
|
290
|
+
|
291
|
+
Including a Ruby chunk is just as easy as including an R chunk in the document: just
|
292
|
+
change the name of the engine to 'ruby'. It is also possible to pass chunk options
|
293
|
+
to the Ruby engine; however, this version does not accept all the options that are
|
294
|
+
available to R chunks. Future versions will add those options.
|
295
|
+
|
296
|
+
````
|
297
|
+
```{ruby first_ruby_chunk}
|
298
|
+
```
|
299
|
+
````
|
300
|
+
|
301
|
+
In this example, the ruby chunk is called 'first_ruby_chunk'. One important
|
302
|
+
aspect of chunk labels is that they cannot be duplicated. If a chunk label is
|
303
|
+
duplicated, gKnitting will stop with an error.
|
304
|
+
|
305
|
+
Another relevant point with Ruby chunks is that they are evaluated in the scope
|
306
|
+
of a class called RubyChunk. To make sure that variables are
|
307
|
+
available between chunks, they should be made as instance variables of the
|
308
|
+
RubyChunk class. In the following chunk, variable '\@a', '\@b' and '\@c'
|
309
|
+
are standard Ruby variables and '\@vec' and '\@vec2' are two vectors created
|
310
|
+
by calling the 'c' method on the R module.
|
311
|
+
|
312
|
+
In Galaaz, the R module allows us to access R functions transparently. The 'c'
|
313
|
+
function in R, is a function that concatenates its arguments making a vector.
|
314
|
+
Calling the 'c' method in the R module is automatically converted to calling the
|
315
|
+
'c' function in R, that, through Galaaz and the Truffle interface creates the
|
316
|
+
vector.
|
317
|
+
|
318
|
+
It
|
319
|
+
should be clear that there is no requirement in gknit to call or use any R
|
320
|
+
functions. gKnit will knit standard Ruby code, or even general text without
|
321
|
+
any code.
|
322
|
+
|
323
|
+
|
324
|
+
```ruby
|
325
|
+
@a = [1, 2, 3]
|
326
|
+
@b = "US$ 250.000"
|
327
|
+
@c = "The 'outputs' function"
|
328
|
+
|
329
|
+
@vec = R.c(1, 2, 3)
|
330
|
+
@vec2 = R.c(10, 20, 30)
|
331
|
+
```
|
332
|
+
|
333
|
+
In this next block, variables '\@a', '\@vec' and '\@vec2' are used and printed.
|
334
|
+
|
335
|
+
|
336
|
+
```ruby
|
337
|
+
puts @a
|
338
|
+
puts @vec * @vec2
|
339
|
+
```
|
340
|
+
|
341
|
+
```
|
342
|
+
## [1, 2, 3]
|
343
|
+
## [1] 10 40 90
|
344
|
+
```
|
345
|
+
|
346
|
+
Note that @a is a standard Ruby Array and @vec and @vec2 are vectors that behave accordingly,
|
347
|
+
where multiplication works as expected.
|
348
|
+
|
349
|
+
|
350
|
+
### Accessing R from Ruby
|
351
|
+
|
352
|
+
One of the nice aspects of Galaaz on GraalVM, is that variables and functions defined in R, can
|
353
|
+
be easily accessed from Ruby. This next chunk, reads data from R and uses the 'reduce_sum'
|
354
|
+
function defined previously. To access an R variable from Ruby the '~' function should be
|
355
|
+
applied to the Ruby symbol representing the R variable. Since the R variable is called 'r_vec',
|
356
|
+
in Ruby, the symbol to acess it is ':r_vec' and thus '~:r_vec' retrieves the value of the
|
357
|
+
variable.
|
358
|
+
|
359
|
+
|
360
|
+
```ruby
|
361
|
+
puts ~:r_vec
|
362
|
+
```
|
363
|
+
|
364
|
+
```
|
365
|
+
## [1] 1 2 3 4 5
|
366
|
+
```
|
367
|
+
|
368
|
+
In order to call an R function, the 'R.' module is used as follows
|
369
|
+
|
370
|
+
|
371
|
+
```ruby
|
372
|
+
puts R.reduce_sum(~:r_vec)
|
373
|
+
```
|
374
|
+
|
375
|
+
```
|
376
|
+
## [1] 15
|
377
|
+
```
|
378
|
+
|
379
|
+
### Ruby Plotting
|
380
|
+
|
381
|
+
We have seen an example of plotting with R. Plotting with Ruby does not require
|
382
|
+
anything different from plotting with R. In the following example we plot a
|
383
|
+
diverging bar graph using the 'mtcars' dataframe from R. This data was extracted
|
384
|
+
from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects
|
385
|
+
of automobile design and performance for 32 automobiles (1973–74 models). The
|
386
|
+
ten aspects are:
|
387
|
+
|
388
|
+
* mpg: Miles/(US) gallon
|
389
|
+
* cyl: Number of cylinders
|
390
|
+
* disp: Displacement (cu.in.)
|
391
|
+
* hp: Gross horsepower
|
392
|
+
* drat: Rear axle ratio
|
393
|
+
* wt: Weight (1000 lbs)
|
394
|
+
* qsec: 1/4 mile time
|
395
|
+
* vs: Engine (0 = V-shaped, 1 = straight)
|
396
|
+
* am: Transmission (0 = automatic, 1 = manual)
|
397
|
+
* gear: Number of forward gears
|
398
|
+
* carb: Number of carburetors
|
399
|
+
|
400
|
+
|
401
|
+
|
402
|
+
```ruby
|
403
|
+
require 'ggplot'
|
404
|
+
|
405
|
+
mtcars = ~:mtcars
|
406
|
+
|
407
|
+
mtcars.car_name = mtcars.rownames # create new column for car names
|
408
|
+
mtcars.mpg_z = ((mtcars.mpg - mtcars.mpg.mean) / mtcars.mpg.sd).round 2
|
409
|
+
mtcars.mpg_type = (mtcars.mpg_z < 0).ifelse('below', 'above')
|
410
|
+
mtcars = mtcars[mtcars.mpg_z.order, :all]
|
411
|
+
mtcars.car_name = R.factor(mtcars.car_name, levels: mtcars.car_name)
|
412
|
+
|
413
|
+
puts mtcars.ggplot(E.aes(x: :car_name, y: :mpg_z, label: :mpg_z)) +
|
414
|
+
R.geom_bar(E.aes(fill: :mpg_type), stat: 'identity', width: 0.5) +
|
415
|
+
R.scale_fill_manual(name: 'Mileage',
|
416
|
+
labels: R.c('Above Average', 'Below Average'),
|
417
|
+
values: R.c('above': '#00ba38', 'below': '#f8766d')) +
|
418
|
+
R.labs(subtitle: "Normalised mileage from 'mtcars'",
|
419
|
+
title: "Diverging Bars") +
|
420
|
+
R.coord_flip
|
421
|
+
```
|
422
|
+
|
423
|
+
|
424
|
+
![](/home/rbotafogo/desenv/galaaz/blogs/gknit/gknit_files/figure-html/diverging_bar.png)<!-- -->
|
425
|
+
|
426
|
+
### Inline Ruby code
|
427
|
+
|
428
|
+
When using a Ruby chunk, the code and the output are formatted in blocks as seen above.
|
429
|
+
This formatting is not always desired. Sometimes, we want to have the results of the
|
430
|
+
Ruby evaluation included in the middle of a phrase. gKnit allows adding inline Ruby code
|
431
|
+
with the 'rb' engine. The following chunk specification will
|
432
|
+
create and inline Ruby text:
|
433
|
+
|
434
|
+
````
|
435
|
+
This is some text with inline Ruby accessing variable \@b which has value:
|
436
|
+
```{rb puts @b}
|
437
|
+
```
|
438
|
+
and is followed by some other text!
|
439
|
+
````
|
440
|
+
|
441
|
+
Note that it is important not to add any new line before of after the code
|
442
|
+
block if we want everything to be in only one line, resulting in the following sentence
|
443
|
+
with inline Ruby code
|
444
|
+
|
445
|
+
<div style="margin-bottom:30px;">
|
446
|
+
</div>
|
447
|
+
|
448
|
+
This is some text with inline Ruby accessing variable \@b which has value:
|
449
|
+
US$ 250.000
|
450
|
+
and is followed by some other text!
|
451
|
+
|
452
|
+
<div style="margin-bottom:30px;">
|
453
|
+
</div>
|
454
|
+
|
455
|
+
|
456
|
+
### The 'outputs' function
|
457
|
+
|
458
|
+
He have previously used the standard 'puts' method in Ruby chunks in order to get some
|
459
|
+
output. As can be seen, the result of a 'puts' is formatted inside a white box that
|
460
|
+
follows the code block. Many times however, we would like to do some processing in the
|
461
|
+
Ruby chunk and have the result of this processing generate and output that is
|
462
|
+
'included' in the document as if we had typed it in R markdown.
|
463
|
+
|
464
|
+
For example, suppose we want to create a new 'heading' in our document, but the heading
|
465
|
+
phrase is the result of some code processing: maybe it's the first line of a file we are
|
466
|
+
going to read. Method 'outputs' adds its output as if typed in the R markdown document.
|
467
|
+
|
468
|
+
Take now a look at variable '@c' (it was defined in a previous block above) as
|
469
|
+
'@c = "The 'outputs' function". "The 'outputs' function" is actually the name of this
|
470
|
+
section and it was created using the 'outputs' function inside a Ruby chunk.
|
471
|
+
|
472
|
+
The ruby chunk to generate this heading is:
|
473
|
+
|
474
|
+
````
|
475
|
+
```{ruby heading}
|
476
|
+
outputs "### #{@c}"
|
477
|
+
```
|
478
|
+
````
|
479
|
+
|
480
|
+
The three '###' are the way we add a Heading 3 in R markdown.
|
481
|
+
|
482
|
+
|
483
|
+
### HTML Output from Ruby Chunks
|
484
|
+
|
485
|
+
We've just seen the use of method 'outputs' to add text to the the R markdown
|
486
|
+
document. This technique can also be used to add HTML code to the document. In R
|
487
|
+
markdown any html code typed directly in the document will be properly rendered.
|
488
|
+
Here, for instance, is a table definition in HTML and its output in the document:
|
489
|
+
|
490
|
+
```
|
491
|
+
<table style="width:100%">
|
492
|
+
<tr>
|
493
|
+
<th>Firstname</th>
|
494
|
+
<th>Lastname</th>
|
495
|
+
<th>Age</th>
|
496
|
+
</tr>
|
497
|
+
<tr>
|
498
|
+
<td>Jill</td>
|
499
|
+
<td>Smith</td>
|
500
|
+
<td>50</td>
|
501
|
+
</tr>
|
502
|
+
<tr>
|
503
|
+
<td>Eve</td>
|
504
|
+
<td>Jackson</td>
|
505
|
+
<td>94</td>
|
506
|
+
</tr>
|
507
|
+
</table>
|
508
|
+
```
|
509
|
+
<div style="margin-bottom:30px;">
|
510
|
+
</div>
|
511
|
+
|
512
|
+
<table style="width:100%">
|
513
|
+
<tr>
|
514
|
+
<th>Firstname</th>
|
515
|
+
<th>Lastname</th>
|
516
|
+
<th>Age</th>
|
517
|
+
</tr>
|
518
|
+
<tr>
|
519
|
+
<td>Jill</td>
|
520
|
+
<td>Smith</td>
|
521
|
+
<td>50</td>
|
522
|
+
</tr>
|
523
|
+
<tr>
|
524
|
+
<td>Eve</td>
|
525
|
+
<td>Jackson</td>
|
526
|
+
<td>94</td>
|
527
|
+
</tr>
|
528
|
+
</table>
|
529
|
+
|
530
|
+
<div style="margin-bottom:30px;">
|
531
|
+
</div>
|
532
|
+
|
533
|
+
But manually creating HTML output is not always easy or desirable. The above
|
534
|
+
table certainly looks ugly. The 'kableExtra' library is a great library for
|
535
|
+
creating beautiful tables. Take a look at https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
|
536
|
+
|
537
|
+
In the next chunk, we output the 'mtcars' dataframe from R in a nicely formatted
|
538
|
+
table. Note that we retrieve the mtcars dataframe by using '~:mtcars'.
|
539
|
+
|
540
|
+
|
541
|
+
```ruby
|
542
|
+
R.install_and_loads('kableExtra')
|
543
|
+
outputs (~:mtcars).kable.kable_styling
|
544
|
+
```
|
545
|
+
|
546
|
+
```
|
547
|
+
## Message:
|
548
|
+
## Method kable_styling not found in R environment
|
549
|
+
```
|
550
|
+
|
551
|
+
```
|
552
|
+
## Message:
|
553
|
+
## /home/rbotafogo/desenv/galaaz/lib/R_interface/rsupport.rb:92:in `eval'
|
554
|
+
## /home/rbotafogo/desenv/galaaz/lib/R_interface/rsupport.rb:272:in `exec_function_name'
|
555
|
+
## /home/rbotafogo/desenv/galaaz/lib/R_interface/robject.rb:177:in `method_missing'
|
556
|
+
## (eval):2:in `exec_ruby'
|
557
|
+
## /home/rbotafogo/desenv/galaaz/lib/util/exec_ruby.rb:141:in `instance_eval'
|
558
|
+
## /home/rbotafogo/desenv/galaaz/lib/util/exec_ruby.rb:141:in `exec_ruby'
|
559
|
+
## /home/rbotafogo/desenv/galaaz/lib/gknit/knitr_engine.rb:650:in `block in initialize'
|
560
|
+
## /home/rbotafogo/desenv/galaaz/lib/R_interface/ruby_callback.rb:77:in `call'
|
561
|
+
## /home/rbotafogo/desenv/galaaz/lib/R_interface/ruby_callback.rb:77:in `callback'
|
562
|
+
## (eval):3:in `function(...) {\n rb_method(...)'
|
563
|
+
## unknown.r:1:in `in_dir'
|
564
|
+
## unknown.r:1:in `block_exec'
|
565
|
+
## /home/rbotafogo/lib/graalvm-ce-1.0.0-rc14/jre/languages/R/library/knitr/R/block.R:92:in `call_block'
|
566
|
+
## /home/rbotafogo/lib/graalvm-ce-1.0.0-rc14/jre/languages/R/library/knitr/R/block.R:6:in `process_group.block'
|
567
|
+
## /home/rbotafogo/lib/graalvm-ce-1.0.0-rc14/jre/languages/R/library/knitr/R/block.R:3:in `<no source>'
|
568
|
+
## unknown.r:1:in `withCallingHandlers'
|
569
|
+
## unknown.r:1:in `process_file'
|
570
|
+
## unknown.r:1:in `<no source>'
|
571
|
+
## unknown.r:1:in `<no source>'
|
572
|
+
## <REPL>:5:in `<repl wrapper>'
|
573
|
+
## <REPL>:1
|
574
|
+
```
|
575
|
+
|
576
|
+
### Including Ruby files
|
577
|
+
|
578
|
+
R is a language that was created to be easy and fast for statisticians to use. As far
|
579
|
+
as I know (and please correct me if you think otherwise), tt was not a
|
580
|
+
language to be used for developing large systems. Of course, there are large systems and
|
581
|
+
libraries in R, but the focus of the language is for developing statistical models and
|
582
|
+
distribute that to peers.
|
583
|
+
|
584
|
+
Ruby on the other hand, is a language for large software development. Systems written in
|
585
|
+
Ruby will have dozens, hundreds or even thousands of files. In order to document a
|
586
|
+
large system with
|
587
|
+
literate programming we cannot expect the developer to add all the files in a single '.Rmd'
|
588
|
+
file. gKnit provides the 'include' chunk engine to include a Ruby file as if it had being
|
589
|
+
typed in the '.Rmd' file.
|
590
|
+
|
591
|
+
To include a file, the following chunk should be created, where <filename> is the name of
|
592
|
+
the file to be include and where the extension, if it is '.rb', does not need to be added.
|
593
|
+
If the 'relative' option is not included, then it is treated as TRUE. When 'relative' is
|
594
|
+
true, 'require_relative' semantics is used to load the file, when false, Ruby's \$LOAD_PATH
|
595
|
+
is searched to find the file and it is 'require'd.
|
596
|
+
|
597
|
+
````
|
598
|
+
```{include <filename>, relative = <TRUE/FALSE>}
|
599
|
+
```
|
600
|
+
````
|
601
|
+
|
602
|
+
Here we include file 'model.rb' which is in the same directory of this blog.
|
603
|
+
This code uses R 'caret' package to split a dataset in a train and test sets.
|
604
|
+
The 'caret' package is a very important a useful package for doing Data Analysis,
|
605
|
+
it has hundreds of functions for all steps of the Data Analysis workflow. To
|
606
|
+
just split a dataset it is using the proverbial cannon to kill the fly. We use
|
607
|
+
it here only to show that integrating Ruby and R and using even a very comples
|
608
|
+
package as 'caret' is trivial with Galaaz.
|
609
|
+
|
610
|
+
A word of advice: the 'caret' package has lots of dependencies and installing
|
611
|
+
it in a Linux system is a time consuming operation. Method 'R.install_and_loads'
|
612
|
+
will install the package if it is not already installed and can take a while.
|
613
|
+
|
614
|
+
````
|
615
|
+
```{include model}
|
616
|
+
```
|
617
|
+
````
|
618
|
+
|
619
|
+
|
620
|
+
```include
|
621
|
+
require 'galaaz'
|
622
|
+
|
623
|
+
# Loads the R 'caret' package. If not present, installs it
|
624
|
+
R.install_and_loads 'caret'
|
625
|
+
|
626
|
+
class Model
|
627
|
+
|
628
|
+
attr_reader :data
|
629
|
+
attr_reader :test
|
630
|
+
attr_reader :train
|
631
|
+
|
632
|
+
#==========================================================
|
633
|
+
#
|
634
|
+
#==========================================================
|
635
|
+
|
636
|
+
def initialize(data, percent_train:, seed: 123)
|
637
|
+
|
638
|
+
R.set__seed(seed)
|
639
|
+
@data = data
|
640
|
+
@percent_train = percent_train
|
641
|
+
@seed = seed
|
642
|
+
|
643
|
+
end
|
644
|
+
|
645
|
+
#==========================================================
|
646
|
+
#
|
647
|
+
#==========================================================
|
648
|
+
|
649
|
+
def partition(field)
|
650
|
+
|
651
|
+
train_index =
|
652
|
+
R.createDataPartition(@data.send(field), p: @percet_train,
|
653
|
+
list: false, times: 1)
|
654
|
+
@train = @data[train_index, :all]
|
655
|
+
@test = @data[-train_index, :all]
|
656
|
+
|
657
|
+
end
|
658
|
+
|
659
|
+
end
|
660
|
+
|
661
|
+
```
|
662
|
+
|
663
|
+
|
664
|
+
```ruby
|
665
|
+
mtcars = ~:mtcars
|
666
|
+
model = Model.new(mtcars, percent_train: 0.8)
|
667
|
+
model.partition(:mpg)
|
668
|
+
puts model.train.head
|
669
|
+
puts model.test.head
|
670
|
+
```
|
671
|
+
|
672
|
+
```
|
673
|
+
## mpg cyl disp hp drat wt qsec vs am gear carb
|
674
|
+
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
|
675
|
+
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
|
676
|
+
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
|
677
|
+
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
|
678
|
+
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
|
679
|
+
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
|
680
|
+
## mpg cyl disp hp drat wt qsec vs am gear carb
|
681
|
+
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
|
682
|
+
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
|
683
|
+
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
|
684
|
+
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
|
685
|
+
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
|
686
|
+
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
|
687
|
+
```
|
688
|
+
|
689
|
+
### Documenting Gems
|
690
|
+
|
691
|
+
gKnit also allows developers to document and load files that are not in the same directory
|
692
|
+
of the '.Rmd' file. When using 'relative = FALSE' in a chunk header, gKnit will look for the
|
693
|
+
file in Ruby's \$LOAD_PATH and load it if found.
|
694
|
+
|
695
|
+
Here is an example of loading the 'find.rb' file from TruffleRuby.
|
696
|
+
|
697
|
+
````
|
698
|
+
```{include find, relative = FALSE}
|
699
|
+
```
|
700
|
+
````
|
701
|
+
|
702
|
+
|
703
|
+
```include
|
704
|
+
# frozen_string_literal: true
|
705
|
+
#
|
706
|
+
# find.rb: the Find module for processing all files under a given directory.
|
707
|
+
#
|
708
|
+
|
709
|
+
#
|
710
|
+
# The +Find+ module supports the top-down traversal of a set of file paths.
|
711
|
+
#
|
712
|
+
# For example, to total the size of all files under your home directory,
|
713
|
+
# ignoring anything in a "dot" directory (e.g. $HOME/.ssh):
|
714
|
+
#
|
715
|
+
# require 'find'
|
716
|
+
#
|
717
|
+
# total_size = 0
|
718
|
+
#
|
719
|
+
# Find.find(ENV["HOME"]) do |path|
|
720
|
+
# if FileTest.directory?(path)
|
721
|
+
# if File.basename(path)[0] == ?.
|
722
|
+
# Find.prune # Don't look any further into this directory.
|
723
|
+
# else
|
724
|
+
# next
|
725
|
+
# end
|
726
|
+
# else
|
727
|
+
# total_size += FileTest.size(path)
|
728
|
+
# end
|
729
|
+
# end
|
730
|
+
#
|
731
|
+
module Find
|
732
|
+
|
733
|
+
#
|
734
|
+
# Calls the associated block with the name of every file and directory listed
|
735
|
+
# as arguments, then recursively on their subdirectories, and so on.
|
736
|
+
#
|
737
|
+
# Returns an enumerator if no block is given.
|
738
|
+
#
|
739
|
+
# See the +Find+ module documentation for an example.
|
740
|
+
#
|
741
|
+
def find(*paths, ignore_error: true) # :yield: path
|
742
|
+
block_given? or return enum_for(__method__, *paths, ignore_error: ignore_error)
|
743
|
+
|
744
|
+
fs_encoding = Encoding.find("filesystem")
|
745
|
+
|
746
|
+
paths.collect!{|d| raise Errno::ENOENT, d unless File.exist?(d); d.dup}.each do |path|
|
747
|
+
path = path.to_path if path.respond_to? :to_path
|
748
|
+
enc = path.encoding == Encoding::US_ASCII ? fs_encoding : path.encoding
|
749
|
+
ps = [path]
|
750
|
+
while file = ps.shift
|
751
|
+
catch(:prune) do
|
752
|
+
yield file.dup.taint
|
753
|
+
begin
|
754
|
+
s = File.lstat(file)
|
755
|
+
rescue Errno::ENOENT, Errno::EACCES, Errno::ENOTDIR, Errno::ELOOP, Errno::ENAMETOOLONG
|
756
|
+
raise unless ignore_error
|
757
|
+
next
|
758
|
+
end
|
759
|
+
if s.directory? then
|
760
|
+
begin
|
761
|
+
fs = Dir.children(file, encoding: enc)
|
762
|
+
rescue Errno::ENOENT, Errno::EACCES, Errno::ENOTDIR, Errno::ELOOP, Errno::ENAMETOOLONG
|
763
|
+
raise unless ignore_error
|
764
|
+
next
|
765
|
+
end
|
766
|
+
fs.sort!
|
767
|
+
fs.reverse_each {|f|
|
768
|
+
f = File.join(file, f)
|
769
|
+
ps.unshift f.untaint
|
770
|
+
}
|
771
|
+
end
|
772
|
+
end
|
773
|
+
end
|
774
|
+
end
|
775
|
+
nil
|
776
|
+
end
|
777
|
+
|
778
|
+
#
|
779
|
+
# Skips the current file or directory, restarting the loop with the next
|
780
|
+
# entry. If the current file is a directory, that directory will not be
|
781
|
+
# recursively entered. Meaningful only within the block associated with
|
782
|
+
# Find::find.
|
783
|
+
#
|
784
|
+
# See the +Find+ module documentation for an example.
|
785
|
+
#
|
786
|
+
def prune
|
787
|
+
throw :prune
|
788
|
+
end
|
789
|
+
|
790
|
+
module_function :find, :prune
|
791
|
+
end
|
792
|
+
```
|
793
|
+
|
794
|
+
## Converting to PDF
|
795
|
+
|
796
|
+
One of the beauties of knitr is that the same input can be converted to many different outputs.
|
797
|
+
One very useful format, is, of course, PDF. In order to converted an R markdown file to PDF
|
798
|
+
it is necessary to have LaTeX installed on the system. We will not explain here how to
|
799
|
+
install LaTeX as there are plenty of documents on the web showing how to proceed.
|
800
|
+
|
801
|
+
gKnit comes with a simple LaTeX style file for gknitting this blog as a PDF document. Here is
|
802
|
+
the Yaml header to generate this blog in PDF format instead of HTML:
|
803
|
+
|
804
|
+
```
|
805
|
+
---
|
806
|
+
title: "gKnit - Ruby and R Knitting with Galaaz in GraalVM"
|
807
|
+
author: "Rodrigo Botafogo"
|
808
|
+
tags: [Galaaz, Ruby, R, TruffleRuby, FastR, GraalVM, knitr, gknit]
|
809
|
+
date: "29 October 2018"
|
810
|
+
output:
|
811
|
+
pdf_document:
|
812
|
+
includes:
|
813
|
+
in_header: ["../../sty/galaaz.sty"]
|
814
|
+
number_sections: yes
|
815
|
+
---
|
816
|
+
```
|
817
|
+
|
818
|
+
# Conclusion
|
819
|
+
|
820
|
+
One of the promises of GraalVM is that users/developers will be able to use the best tool
|
821
|
+
for their task at hand, independently of the programming language the tool was written. Galaaz
|
822
|
+
and gKnit are not trivial implementations atop the GraalVM and Truffle interop messages;
|
823
|
+
however, the time and effort it took to wrap Ruby over R - Galaaz - (not finished yet) or to
|
824
|
+
wrap Knitr with gKnit is a fraction of a fraction of a fraction of the time require to
|
825
|
+
implement the original tools. Trying to reimplement all R packages in Ruby would require the
|
826
|
+
same effort it is taking Python to implement NumPy, Panda and all supporting libraries and it
|
827
|
+
is unlikely that this effort would ever be done. GraalVM has allowed Ruby to profit "almost
|
828
|
+
for free" from this huge set of libraries and tools that make R one of the most used
|
829
|
+
languages for data analysis and machine learning.
|
830
|
+
|
831
|
+
More interesting though than being able to wrap the R libraries with Ruby, is that Ruby adds
|
832
|
+
value to R, by allowing developers to use powerful and modern constructs for code reuse that
|
833
|
+
are not the strong points of R. As shown in this blog, R and Ruby can easily communicate
|
834
|
+
and R can be structured in classes and modules in a way that greatly expands its power and
|
835
|
+
readability.
|
836
|
+
|
837
|
+
# Installing gKnit
|
838
|
+
|
839
|
+
## Prerequisites
|
840
|
+
|
841
|
+
* GraalVM (>= rc8)
|
842
|
+
* TruffleRuby
|
843
|
+
* FastR
|
844
|
+
|
845
|
+
The following R packages will be automatically installed when necessary, but could be installed prior
|
846
|
+
to using gKnit if desired:
|
847
|
+
|
848
|
+
* ggplot2
|
849
|
+
* gridExtra
|
850
|
+
* knitr
|
851
|
+
|
852
|
+
Installation of R packages requires a development environment and can be time consuming. In Linux,
|
853
|
+
the gnu compiler and tools should be enough. I am not sure what is needed on the Mac.
|
854
|
+
|
855
|
+
## Preparation
|
856
|
+
|
857
|
+
* gem install galaaz
|
858
|
+
|
859
|
+
## Usage
|
860
|
+
|
861
|
+
* gknit \<filename\>
|