galaaz 0.4.6 → 0.4.7

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,861 @@
1
+ ---
2
+ title: "How to do reproducible research in Ruby with gKnit"
3
+ author:
4
+ - "Rodrigo Botafogo"
5
+ - "Daniel Mossé - University of Pittsburgh"
6
+ tags: [Tech, Data Science, Ruby, R, GraalVM]
7
+ date: "20/02/2019"
8
+ output:
9
+ html_document:
10
+ self_contained: true
11
+ keep_md: true
12
+ pdf_document:
13
+ includes:
14
+ in_header: ["../../sty/galaaz.sty"]
15
+ number_sections: yes
16
+ ---
17
+
18
+
19
+
20
+ # Introduction
21
+
22
+ The idea of "literate programming" was first introduced by Donald Knuth in the 1980's.
23
+ The main intention of this approach was to develop software interspersing macro snippets,
24
+ traditional source code, and a natural language such as English in a document
25
+ that could be compiled into
26
+ executable code and at the same time easily read by a human developer. According to Knuth
27
+ "The practitioner of
28
+ literate programming can be regarded as an essayist, whose main concern is with exposition
29
+ and excellence of style."
30
+
31
+ The idea of literate programming evolved into the idea of reproducible research, in which
32
+ all the data, software code, documentation, graphics etc. needed to reproduce the research
33
+ and its reports could be included in a
34
+ single document or set of documents that when distributed to peers could be rerun generating
35
+ the same output and reports.
36
+
37
+ The R community has put a great deal of effort in reproducible research. In 2002, Sweave was
38
+ introduced and it allowed mixing R code with Latex generating high quality PDF documents. Those
39
+ documents could include the code, the result of executing the code, graphics and text. This
40
+ contained the whole narrative to reproduce the research. But Sweave had many problems and in
41
+ 2012, Knitr, developed by Yihui Xie from RStudio was released, solving many of the long lasting
42
+ problems from Sweave and including in one single package many extensions and add-on packages that
43
+ were necessary for Sweave.
44
+
45
+ With Knitr, R markdown was also developed, an extension to the
46
+ Markdown format. With R markdown and Knitr it is possible to generate reports in a multitude
47
+ of formats such as HTML, markdown, Latex, PDF, dvi, etc. R markdown also allows the use of
48
+ multiple programming languages in the same document. In R markdown text is interspersed with
49
+ code chunks that can be executed and both the code and its results can become
50
+ part of the final report. Although R markdown allows multiple programming languages in the
51
+ same document, only R and Python (with
52
+ the reticulate package) can persist variables between chunks. For other languages, such as
53
+ Ruby, every chunk will start a new process and thus all data is lost between chunks, unless it
54
+ is somehow stored in a data file that is read by the next chunk.
55
+
56
+ Being able to persist data
57
+ between chunks is critical for literate programming otherwise the flow of the narrative is lost
58
+ by all the effort of having to save data and then reload it. Probably, because of
59
+ this impossibility,
60
+ it is very rare to see any R markdown document in the Ruby community. Also, the use of
61
+ R markdown for the Ruby community would also require the Ruby developer to download R and
62
+ have some minimal knowledge of Knitr.
63
+
64
+ In the Python community, the same effort to have code and text in an integrated environment
65
+ started around the first decade of 2000. In 2006 iPython 0.7.2 was released. In 2014,
66
+ Fernando Pérez, spun off project Jupyter from iPython creating a web-based interactive
67
+ computation environment. Jupyter can now be used with many languages, including Ruby with the
68
+ iruby gem (https://github.com/SciRuby/iruby). I am not sure if multiple languages can be used
69
+ in a Jupyter notebook and if variables can persist between chunks.
70
+
71
+ # gKnitting a Document
72
+
73
+ This document describes gKnit. gKnit uses Knitr and R markdown to knit a document in Ruby or R
74
+ and output it in any of the available formats for R markdown.
75
+ gKnit runs atop of GraalVM, and Galaaz (an integration
76
+ library between Ruby and R). In gKnit, Ruby variables are persisted between chunks, making
77
+ it an ideal solution for literate programming in this language. Also, since it is based on
78
+ Galaaz, Ruby chunks can have access to R variables and Polyglot Programming with Ruby and R
79
+ is quite natural.
80
+
81
+ Galaaz has been describe already in the following posts:
82
+
83
+ * https://towardsdatascience.com/ruby-plotting-with-galaaz-an-example-of-tightly-coupling-ruby-and-r-in-graalvm-520b69e21021.
84
+ * https://medium.freecodecamp.org/how-to-make-beautiful-ruby-plots-with-galaaz-320848058857
85
+
86
+ This is not a blog post on R markdown, and the interested user is directed to the following links
87
+ for detailed information on its capabilities and use.
88
+
89
+ * https://rmarkdown.rstudio.com/ or
90
+ * https://bookdown.org/yihui/rmarkdown/
91
+
92
+ Here, we will describe quickly the main aspects of R markdown, so the user can start gKnitting
93
+ Ruby and R documents quickly.
94
+
95
+ ## The Yaml header
96
+
97
+ An R markdown document should start with a Yaml header and be stored in a file with
98
+ '.Rmd' extension. This document has the following header for gKitting an HTML document.
99
+
100
+ ```
101
+ ---
102
+ title: "How to do reproducible research in Ruby with gKnit"
103
+ author:
104
+ - "Rodrigo Botafogo"
105
+ - "Daniel Mossé - University of Pittsburgh"
106
+ tags: [Tech, Data Science, Ruby, R, GraalVM]
107
+ date: "20/02/2019"
108
+ output:
109
+ html_document:
110
+ self_contained: true
111
+ keep_md: true
112
+ pdf_document:
113
+ includes:
114
+ in_header: ["../../sty/galaaz.sty"]
115
+ number_sections: yes
116
+ ---
117
+ ```
118
+
119
+ For more information on the options in the Yaml header, check https://bookdown.org/yihui/rmarkdown/html-document.html.
120
+
121
+ ## R Markdown formatting
122
+
123
+ Document formatting can be done with simple markups such as:
124
+
125
+ ### Headers
126
+
127
+ ```
128
+ # Header 1
129
+
130
+ ## Header 2
131
+
132
+ ### Header 3
133
+
134
+ ```
135
+
136
+ ### Lists
137
+
138
+ ```
139
+ Unordered lists:
140
+
141
+ * Item 1
142
+ * Item 2
143
+ + Item 2a
144
+ + Item 2b
145
+ ```
146
+
147
+ ```
148
+ Ordered Lists
149
+
150
+ 1. Item 1
151
+ 2. Item 2
152
+ 3. Item 3
153
+ + Item 3a
154
+ + Item 3b
155
+ ```
156
+
157
+ Please, go to https://rmarkdown.rstudio.com/authoring_basics.html, for more R markdown formatting.
158
+
159
+ ### R chunks
160
+
161
+ Running and executing Ruby and R code is actually what really interests us is this blog.
162
+ Inserting a code chunk is done by adding code in a block delimited by three back ticks
163
+ followed by an open
164
+ curly brace ('{') followed with the engine name (r, ruby, rb, include, ...), an
165
+ any optional chunk_label and options, as shown bellow:
166
+
167
+ ````
168
+ ```{engine_name [chunk_label], [chunk_options]}
169
+ ```
170
+ ````
171
+
172
+ for instance, let's add an R chunk to the document labeled 'first_r_chunk'. This is
173
+ a very simple code just to create a variable and print it out. The code block should
174
+ be defined as follows:
175
+
176
+ ````
177
+ ```{r first_r_chunk}
178
+ vec <- c(1, 2, 3)
179
+ print(vec)
180
+ ```
181
+ ````
182
+
183
+ If this block is added to an R markdown document and gKnitted the result will be:
184
+
185
+
186
+ ```r
187
+ vec <- c(1, 2, 3)
188
+ print(vec)
189
+ ```
190
+
191
+ ```
192
+ ## [1] 1 2 3
193
+ ```
194
+
195
+ Now let's say that we want to do some analysis in the code, but just print the result and not the
196
+ code itself. For this, we need to add the option 'echo = FALSE'.
197
+
198
+ ````
199
+ ```{r second_r_chunk, echo = FALSE}
200
+ vec2 <- c(10, 20, 30)
201
+ vec3 <- vec * vec2
202
+ print(vec3)
203
+ ```
204
+ ````
205
+ Here is how this block will show up in the document. Observe that the code is not shown
206
+ and we only see the execution result in a white box
207
+
208
+
209
+ ```
210
+ ## [1] 10 40 90
211
+ ```
212
+
213
+ A description of the available chunk options can be found in the documentation cited above.
214
+
215
+ Let's add another R chunkd with a function definition. In this example, a vector
216
+ 'r_vec' is created and
217
+ a new function 'reduce_sum' is defined. The chunk specification is
218
+
219
+ ````
220
+ ```{r data_creation}
221
+ r_vec <- c(1, 2, 3, 4, 5)
222
+
223
+ reduce_sum <- function(...) {
224
+ Reduce(sum, as.list(...))
225
+ }
226
+ ```
227
+ ````
228
+
229
+ and this is how it will look like once executed. From now on, we will not
230
+ show the chunk definition any longer.
231
+
232
+
233
+
234
+ ```r
235
+ r_vec <- c(1, 2, 3, 4, 5)
236
+
237
+ reduce_sum <- function(...) {
238
+ Reduce(sum, as.list(...))
239
+ }
240
+ ```
241
+
242
+ We can, possibly in another chunk, access the vector and call the function as follows:
243
+
244
+
245
+ ```r
246
+ print(r_vec)
247
+ ```
248
+
249
+ ```
250
+ ## [1] 1 2 3 4 5
251
+ ```
252
+
253
+ ```r
254
+ print(reduce_sum(r_vec))
255
+ ```
256
+
257
+ ```
258
+ ## [1] 15
259
+ ```
260
+ ### R Graphics with ggplot
261
+
262
+ In the following chunk, we create a bubble chart in R using ggplot and include it in
263
+ this document. Note that there is no directive in the code to include the image, this
264
+ occurs automatically. The 'mpg' dataframe is natively available to R and to Galaaz as
265
+ well.
266
+
267
+
268
+ ```r
269
+ # load package and data
270
+ library(ggplot2)
271
+ data(mpg, package="ggplot2")
272
+ # mpg <- read.csv("http://goo.gl/uEeRGu")
273
+
274
+ mpg_select <- mpg[mpg$manufacturer %in% c("audi", "ford", "honda", "hyundai"), ]
275
+
276
+ # Scatterplot
277
+ theme_set(theme_bw()) # pre-set the bw theme.
278
+ g <- ggplot(mpg_select, aes(displ, cty)) +
279
+ labs(subtitle="mpg: Displacement vs City Mileage",
280
+ title="Bubble chart")
281
+
282
+ g + geom_jitter(aes(col=manufacturer, size=hwy)) +
283
+ geom_smooth(aes(col=manufacturer), method="lm", se=F)
284
+ ```
285
+
286
+ ![](/home/rbotafogo/desenv/galaaz/blogs/gknit/gknit_files/figure-html/bubble-1.png)<!-- -->
287
+
288
+ ### Ruby chunks
289
+
290
+
291
+ Including a Ruby chunk is just as easy as including an R chunk in the document: just
292
+ change the name of the engine to 'ruby'. It is also possible to pass chunk options
293
+ to the Ruby engine; however, this version does not accept all the options that are
294
+ available to R chunks. Future versions will add those options.
295
+
296
+ ````
297
+ ```{ruby first_ruby_chunk}
298
+ ```
299
+ ````
300
+
301
+ In this example, the ruby chunk is called 'first_ruby_chunk'. One important
302
+ aspect of chunk labels is that they cannot be duplicated. If a chunk label is
303
+ duplicated, gKnitting will stop with an error.
304
+
305
+ Another relevant point with Ruby chunks is that they are evaluated in the scope
306
+ of a class called RubyChunk. To make sure that variables are
307
+ available between chunks, they should be made as instance variables of the
308
+ RubyChunk class. In the following chunk, variable '\@a', '\@b' and '\@c'
309
+ are standard Ruby variables and '\@vec' and '\@vec2' are two vectors created
310
+ by calling the 'c' method on the R module.
311
+
312
+ In Galaaz, the R module allows us to access R functions transparently. The 'c'
313
+ function in R, is a function that concatenates its arguments making a vector.
314
+ Calling the 'c' method in the R module is automatically converted to calling the
315
+ 'c' function in R, that, through Galaaz and the Truffle interface creates the
316
+ vector.
317
+
318
+ It
319
+ should be clear that there is no requirement in gknit to call or use any R
320
+ functions. gKnit will knit standard Ruby code, or even general text without
321
+ any code.
322
+
323
+
324
+ ```ruby
325
+ @a = [1, 2, 3]
326
+ @b = "US$ 250.000"
327
+ @c = "The 'outputs' function"
328
+
329
+ @vec = R.c(1, 2, 3)
330
+ @vec2 = R.c(10, 20, 30)
331
+ ```
332
+
333
+ In this next block, variables '\@a', '\@vec' and '\@vec2' are used and printed.
334
+
335
+
336
+ ```ruby
337
+ puts @a
338
+ puts @vec * @vec2
339
+ ```
340
+
341
+ ```
342
+ ## [1, 2, 3]
343
+ ## [1] 10 40 90
344
+ ```
345
+
346
+ Note that @a is a standard Ruby Array and @vec and @vec2 are vectors that behave accordingly,
347
+ where multiplication works as expected.
348
+
349
+
350
+ ### Accessing R from Ruby
351
+
352
+ One of the nice aspects of Galaaz on GraalVM, is that variables and functions defined in R, can
353
+ be easily accessed from Ruby. This next chunk, reads data from R and uses the 'reduce_sum'
354
+ function defined previously. To access an R variable from Ruby the '~' function should be
355
+ applied to the Ruby symbol representing the R variable. Since the R variable is called 'r_vec',
356
+ in Ruby, the symbol to acess it is ':r_vec' and thus '~:r_vec' retrieves the value of the
357
+ variable.
358
+
359
+
360
+ ```ruby
361
+ puts ~:r_vec
362
+ ```
363
+
364
+ ```
365
+ ## [1] 1 2 3 4 5
366
+ ```
367
+
368
+ In order to call an R function, the 'R.' module is used as follows
369
+
370
+
371
+ ```ruby
372
+ puts R.reduce_sum(~:r_vec)
373
+ ```
374
+
375
+ ```
376
+ ## [1] 15
377
+ ```
378
+
379
+ ### Ruby Plotting
380
+
381
+ We have seen an example of plotting with R. Plotting with Ruby does not require
382
+ anything different from plotting with R. In the following example we plot a
383
+ diverging bar graph using the 'mtcars' dataframe from R. This data was extracted
384
+ from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects
385
+ of automobile design and performance for 32 automobiles (1973–74 models). The
386
+ ten aspects are:
387
+
388
+ * mpg: Miles/(US) gallon
389
+ * cyl: Number of cylinders
390
+ * disp: Displacement (cu.in.)
391
+ * hp: Gross horsepower
392
+ * drat: Rear axle ratio
393
+ * wt: Weight (1000 lbs)
394
+ * qsec: 1/4 mile time
395
+ * vs: Engine (0 = V-shaped, 1 = straight)
396
+ * am: Transmission (0 = automatic, 1 = manual)
397
+ * gear: Number of forward gears
398
+ * carb: Number of carburetors
399
+
400
+
401
+
402
+ ```ruby
403
+ require 'ggplot'
404
+
405
+ mtcars = ~:mtcars
406
+
407
+ mtcars.car_name = mtcars.rownames # create new column for car names
408
+ mtcars.mpg_z = ((mtcars.mpg - mtcars.mpg.mean) / mtcars.mpg.sd).round 2
409
+ mtcars.mpg_type = (mtcars.mpg_z < 0).ifelse('below', 'above')
410
+ mtcars = mtcars[mtcars.mpg_z.order, :all]
411
+ mtcars.car_name = R.factor(mtcars.car_name, levels: mtcars.car_name)
412
+
413
+ puts mtcars.ggplot(E.aes(x: :car_name, y: :mpg_z, label: :mpg_z)) +
414
+ R.geom_bar(E.aes(fill: :mpg_type), stat: 'identity', width: 0.5) +
415
+ R.scale_fill_manual(name: 'Mileage',
416
+ labels: R.c('Above Average', 'Below Average'),
417
+ values: R.c('above': '#00ba38', 'below': '#f8766d')) +
418
+ R.labs(subtitle: "Normalised mileage from 'mtcars'",
419
+ title: "Diverging Bars") +
420
+ R.coord_flip
421
+ ```
422
+
423
+
424
+ ![](/home/rbotafogo/desenv/galaaz/blogs/gknit/gknit_files/figure-html/diverging_bar.png)<!-- -->
425
+
426
+ ### Inline Ruby code
427
+
428
+ When using a Ruby chunk, the code and the output are formatted in blocks as seen above.
429
+ This formatting is not always desired. Sometimes, we want to have the results of the
430
+ Ruby evaluation included in the middle of a phrase. gKnit allows adding inline Ruby code
431
+ with the 'rb' engine. The following chunk specification will
432
+ create and inline Ruby text:
433
+
434
+ ````
435
+ This is some text with inline Ruby accessing variable \@b which has value:
436
+ ```{rb puts @b}
437
+ ```
438
+ and is followed by some other text!
439
+ ````
440
+
441
+ Note that it is important not to add any new line before of after the code
442
+ block if we want everything to be in only one line, resulting in the following sentence
443
+ with inline Ruby code
444
+
445
+ <div style="margin-bottom:30px;">
446
+ </div>
447
+
448
+ This is some text with inline Ruby accessing variable \@b which has value:
449
+ US$ 250.000
450
+ and is followed by some other text!
451
+
452
+ <div style="margin-bottom:30px;">
453
+ </div>
454
+
455
+
456
+ ### The 'outputs' function
457
+
458
+ He have previously used the standard 'puts' method in Ruby chunks in order to get some
459
+ output. As can be seen, the result of a 'puts' is formatted inside a white box that
460
+ follows the code block. Many times however, we would like to do some processing in the
461
+ Ruby chunk and have the result of this processing generate and output that is
462
+ 'included' in the document as if we had typed it in R markdown.
463
+
464
+ For example, suppose we want to create a new 'heading' in our document, but the heading
465
+ phrase is the result of some code processing: maybe it's the first line of a file we are
466
+ going to read. Method 'outputs' adds its output as if typed in the R markdown document.
467
+
468
+ Take now a look at variable '@c' (it was defined in a previous block above) as
469
+ '@c = "The 'outputs' function". "The 'outputs' function" is actually the name of this
470
+ section and it was created using the 'outputs' function inside a Ruby chunk.
471
+
472
+ The ruby chunk to generate this heading is:
473
+
474
+ ````
475
+ ```{ruby heading}
476
+ outputs "### #{@c}"
477
+ ```
478
+ ````
479
+
480
+ The three '###' are the way we add a Heading 3 in R markdown.
481
+
482
+
483
+ ### HTML Output from Ruby Chunks
484
+
485
+ We've just seen the use of method 'outputs' to add text to the the R markdown
486
+ document. This technique can also be used to add HTML code to the document. In R
487
+ markdown any html code typed directly in the document will be properly rendered.
488
+ Here, for instance, is a table definition in HTML and its output in the document:
489
+
490
+ ```
491
+ <table style="width:100%">
492
+ <tr>
493
+ <th>Firstname</th>
494
+ <th>Lastname</th>
495
+ <th>Age</th>
496
+ </tr>
497
+ <tr>
498
+ <td>Jill</td>
499
+ <td>Smith</td>
500
+ <td>50</td>
501
+ </tr>
502
+ <tr>
503
+ <td>Eve</td>
504
+ <td>Jackson</td>
505
+ <td>94</td>
506
+ </tr>
507
+ </table>
508
+ ```
509
+ <div style="margin-bottom:30px;">
510
+ </div>
511
+
512
+ <table style="width:100%">
513
+ <tr>
514
+ <th>Firstname</th>
515
+ <th>Lastname</th>
516
+ <th>Age</th>
517
+ </tr>
518
+ <tr>
519
+ <td>Jill</td>
520
+ <td>Smith</td>
521
+ <td>50</td>
522
+ </tr>
523
+ <tr>
524
+ <td>Eve</td>
525
+ <td>Jackson</td>
526
+ <td>94</td>
527
+ </tr>
528
+ </table>
529
+
530
+ <div style="margin-bottom:30px;">
531
+ </div>
532
+
533
+ But manually creating HTML output is not always easy or desirable. The above
534
+ table certainly looks ugly. The 'kableExtra' library is a great library for
535
+ creating beautiful tables. Take a look at https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
536
+
537
+ In the next chunk, we output the 'mtcars' dataframe from R in a nicely formatted
538
+ table. Note that we retrieve the mtcars dataframe by using '~:mtcars'.
539
+
540
+
541
+ ```ruby
542
+ R.install_and_loads('kableExtra')
543
+ outputs (~:mtcars).kable.kable_styling
544
+ ```
545
+
546
+ ```
547
+ ## Message:
548
+ ## Method kable_styling not found in R environment
549
+ ```
550
+
551
+ ```
552
+ ## Message:
553
+ ## /home/rbotafogo/desenv/galaaz/lib/R_interface/rsupport.rb:92:in `eval'
554
+ ## /home/rbotafogo/desenv/galaaz/lib/R_interface/rsupport.rb:272:in `exec_function_name'
555
+ ## /home/rbotafogo/desenv/galaaz/lib/R_interface/robject.rb:177:in `method_missing'
556
+ ## (eval):2:in `exec_ruby'
557
+ ## /home/rbotafogo/desenv/galaaz/lib/util/exec_ruby.rb:141:in `instance_eval'
558
+ ## /home/rbotafogo/desenv/galaaz/lib/util/exec_ruby.rb:141:in `exec_ruby'
559
+ ## /home/rbotafogo/desenv/galaaz/lib/gknit/knitr_engine.rb:650:in `block in initialize'
560
+ ## /home/rbotafogo/desenv/galaaz/lib/R_interface/ruby_callback.rb:77:in `call'
561
+ ## /home/rbotafogo/desenv/galaaz/lib/R_interface/ruby_callback.rb:77:in `callback'
562
+ ## (eval):3:in `function(...) {\n rb_method(...)'
563
+ ## unknown.r:1:in `in_dir'
564
+ ## unknown.r:1:in `block_exec'
565
+ ## /home/rbotafogo/lib/graalvm-ce-1.0.0-rc14/jre/languages/R/library/knitr/R/block.R:92:in `call_block'
566
+ ## /home/rbotafogo/lib/graalvm-ce-1.0.0-rc14/jre/languages/R/library/knitr/R/block.R:6:in `process_group.block'
567
+ ## /home/rbotafogo/lib/graalvm-ce-1.0.0-rc14/jre/languages/R/library/knitr/R/block.R:3:in `<no source>'
568
+ ## unknown.r:1:in `withCallingHandlers'
569
+ ## unknown.r:1:in `process_file'
570
+ ## unknown.r:1:in `<no source>'
571
+ ## unknown.r:1:in `<no source>'
572
+ ## <REPL>:5:in `<repl wrapper>'
573
+ ## <REPL>:1
574
+ ```
575
+
576
+ ### Including Ruby files
577
+
578
+ R is a language that was created to be easy and fast for statisticians to use. As far
579
+ as I know (and please correct me if you think otherwise), tt was not a
580
+ language to be used for developing large systems. Of course, there are large systems and
581
+ libraries in R, but the focus of the language is for developing statistical models and
582
+ distribute that to peers.
583
+
584
+ Ruby on the other hand, is a language for large software development. Systems written in
585
+ Ruby will have dozens, hundreds or even thousands of files. In order to document a
586
+ large system with
587
+ literate programming we cannot expect the developer to add all the files in a single '.Rmd'
588
+ file. gKnit provides the 'include' chunk engine to include a Ruby file as if it had being
589
+ typed in the '.Rmd' file.
590
+
591
+ To include a file, the following chunk should be created, where <filename> is the name of
592
+ the file to be include and where the extension, if it is '.rb', does not need to be added.
593
+ If the 'relative' option is not included, then it is treated as TRUE. When 'relative' is
594
+ true, 'require_relative' semantics is used to load the file, when false, Ruby's \$LOAD_PATH
595
+ is searched to find the file and it is 'require'd.
596
+
597
+ ````
598
+ ```{include <filename>, relative = <TRUE/FALSE>}
599
+ ```
600
+ ````
601
+
602
+ Here we include file 'model.rb' which is in the same directory of this blog.
603
+ This code uses R 'caret' package to split a dataset in a train and test sets.
604
+ The 'caret' package is a very important a useful package for doing Data Analysis,
605
+ it has hundreds of functions for all steps of the Data Analysis workflow. To
606
+ just split a dataset it is using the proverbial cannon to kill the fly. We use
607
+ it here only to show that integrating Ruby and R and using even a very comples
608
+ package as 'caret' is trivial with Galaaz.
609
+
610
+ A word of advice: the 'caret' package has lots of dependencies and installing
611
+ it in a Linux system is a time consuming operation. Method 'R.install_and_loads'
612
+ will install the package if it is not already installed and can take a while.
613
+
614
+ ````
615
+ ```{include model}
616
+ ```
617
+ ````
618
+
619
+
620
+ ```include
621
+ require 'galaaz'
622
+
623
+ # Loads the R 'caret' package. If not present, installs it
624
+ R.install_and_loads 'caret'
625
+
626
+ class Model
627
+
628
+ attr_reader :data
629
+ attr_reader :test
630
+ attr_reader :train
631
+
632
+ #==========================================================
633
+ #
634
+ #==========================================================
635
+
636
+ def initialize(data, percent_train:, seed: 123)
637
+
638
+ R.set__seed(seed)
639
+ @data = data
640
+ @percent_train = percent_train
641
+ @seed = seed
642
+
643
+ end
644
+
645
+ #==========================================================
646
+ #
647
+ #==========================================================
648
+
649
+ def partition(field)
650
+
651
+ train_index =
652
+ R.createDataPartition(@data.send(field), p: @percet_train,
653
+ list: false, times: 1)
654
+ @train = @data[train_index, :all]
655
+ @test = @data[-train_index, :all]
656
+
657
+ end
658
+
659
+ end
660
+
661
+ ```
662
+
663
+
664
+ ```ruby
665
+ mtcars = ~:mtcars
666
+ model = Model.new(mtcars, percent_train: 0.8)
667
+ model.partition(:mpg)
668
+ puts model.train.head
669
+ puts model.test.head
670
+ ```
671
+
672
+ ```
673
+ ## mpg cyl disp hp drat wt qsec vs am gear carb
674
+ ## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
675
+ ## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
676
+ ## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
677
+ ## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
678
+ ## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
679
+ ## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
680
+ ## mpg cyl disp hp drat wt qsec vs am gear carb
681
+ ## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
682
+ ## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
683
+ ## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
684
+ ## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
685
+ ## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
686
+ ## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
687
+ ```
688
+
689
+ ### Documenting Gems
690
+
691
+ gKnit also allows developers to document and load files that are not in the same directory
692
+ of the '.Rmd' file. When using 'relative = FALSE' in a chunk header, gKnit will look for the
693
+ file in Ruby's \$LOAD_PATH and load it if found.
694
+
695
+ Here is an example of loading the 'find.rb' file from TruffleRuby.
696
+
697
+ ````
698
+ ```{include find, relative = FALSE}
699
+ ```
700
+ ````
701
+
702
+
703
+ ```include
704
+ # frozen_string_literal: true
705
+ #
706
+ # find.rb: the Find module for processing all files under a given directory.
707
+ #
708
+
709
+ #
710
+ # The +Find+ module supports the top-down traversal of a set of file paths.
711
+ #
712
+ # For example, to total the size of all files under your home directory,
713
+ # ignoring anything in a "dot" directory (e.g. $HOME/.ssh):
714
+ #
715
+ # require 'find'
716
+ #
717
+ # total_size = 0
718
+ #
719
+ # Find.find(ENV["HOME"]) do |path|
720
+ # if FileTest.directory?(path)
721
+ # if File.basename(path)[0] == ?.
722
+ # Find.prune # Don't look any further into this directory.
723
+ # else
724
+ # next
725
+ # end
726
+ # else
727
+ # total_size += FileTest.size(path)
728
+ # end
729
+ # end
730
+ #
731
+ module Find
732
+
733
+ #
734
+ # Calls the associated block with the name of every file and directory listed
735
+ # as arguments, then recursively on their subdirectories, and so on.
736
+ #
737
+ # Returns an enumerator if no block is given.
738
+ #
739
+ # See the +Find+ module documentation for an example.
740
+ #
741
+ def find(*paths, ignore_error: true) # :yield: path
742
+ block_given? or return enum_for(__method__, *paths, ignore_error: ignore_error)
743
+
744
+ fs_encoding = Encoding.find("filesystem")
745
+
746
+ paths.collect!{|d| raise Errno::ENOENT, d unless File.exist?(d); d.dup}.each do |path|
747
+ path = path.to_path if path.respond_to? :to_path
748
+ enc = path.encoding == Encoding::US_ASCII ? fs_encoding : path.encoding
749
+ ps = [path]
750
+ while file = ps.shift
751
+ catch(:prune) do
752
+ yield file.dup.taint
753
+ begin
754
+ s = File.lstat(file)
755
+ rescue Errno::ENOENT, Errno::EACCES, Errno::ENOTDIR, Errno::ELOOP, Errno::ENAMETOOLONG
756
+ raise unless ignore_error
757
+ next
758
+ end
759
+ if s.directory? then
760
+ begin
761
+ fs = Dir.children(file, encoding: enc)
762
+ rescue Errno::ENOENT, Errno::EACCES, Errno::ENOTDIR, Errno::ELOOP, Errno::ENAMETOOLONG
763
+ raise unless ignore_error
764
+ next
765
+ end
766
+ fs.sort!
767
+ fs.reverse_each {|f|
768
+ f = File.join(file, f)
769
+ ps.unshift f.untaint
770
+ }
771
+ end
772
+ end
773
+ end
774
+ end
775
+ nil
776
+ end
777
+
778
+ #
779
+ # Skips the current file or directory, restarting the loop with the next
780
+ # entry. If the current file is a directory, that directory will not be
781
+ # recursively entered. Meaningful only within the block associated with
782
+ # Find::find.
783
+ #
784
+ # See the +Find+ module documentation for an example.
785
+ #
786
+ def prune
787
+ throw :prune
788
+ end
789
+
790
+ module_function :find, :prune
791
+ end
792
+ ```
793
+
794
+ ## Converting to PDF
795
+
796
+ One of the beauties of knitr is that the same input can be converted to many different outputs.
797
+ One very useful format, is, of course, PDF. In order to converted an R markdown file to PDF
798
+ it is necessary to have LaTeX installed on the system. We will not explain here how to
799
+ install LaTeX as there are plenty of documents on the web showing how to proceed.
800
+
801
+ gKnit comes with a simple LaTeX style file for gknitting this blog as a PDF document. Here is
802
+ the Yaml header to generate this blog in PDF format instead of HTML:
803
+
804
+ ```
805
+ ---
806
+ title: "gKnit - Ruby and R Knitting with Galaaz in GraalVM"
807
+ author: "Rodrigo Botafogo"
808
+ tags: [Galaaz, Ruby, R, TruffleRuby, FastR, GraalVM, knitr, gknit]
809
+ date: "29 October 2018"
810
+ output:
811
+ pdf_document:
812
+ includes:
813
+ in_header: ["../../sty/galaaz.sty"]
814
+ number_sections: yes
815
+ ---
816
+ ```
817
+
818
+ # Conclusion
819
+
820
+ One of the promises of GraalVM is that users/developers will be able to use the best tool
821
+ for their task at hand, independently of the programming language the tool was written. Galaaz
822
+ and gKnit are not trivial implementations atop the GraalVM and Truffle interop messages;
823
+ however, the time and effort it took to wrap Ruby over R - Galaaz - (not finished yet) or to
824
+ wrap Knitr with gKnit is a fraction of a fraction of a fraction of the time require to
825
+ implement the original tools. Trying to reimplement all R packages in Ruby would require the
826
+ same effort it is taking Python to implement NumPy, Panda and all supporting libraries and it
827
+ is unlikely that this effort would ever be done. GraalVM has allowed Ruby to profit "almost
828
+ for free" from this huge set of libraries and tools that make R one of the most used
829
+ languages for data analysis and machine learning.
830
+
831
+ More interesting though than being able to wrap the R libraries with Ruby, is that Ruby adds
832
+ value to R, by allowing developers to use powerful and modern constructs for code reuse that
833
+ are not the strong points of R. As shown in this blog, R and Ruby can easily communicate
834
+ and R can be structured in classes and modules in a way that greatly expands its power and
835
+ readability.
836
+
837
+ # Installing gKnit
838
+
839
+ ## Prerequisites
840
+
841
+ * GraalVM (>= rc8)
842
+ * TruffleRuby
843
+ * FastR
844
+
845
+ The following R packages will be automatically installed when necessary, but could be installed prior
846
+ to using gKnit if desired:
847
+
848
+ * ggplot2
849
+ * gridExtra
850
+ * knitr
851
+
852
+ Installation of R packages requires a development environment and can be time consuming. In Linux,
853
+ the gnu compiler and tools should be enough. I am not sure what is needed on the Mac.
854
+
855
+ ## Preparation
856
+
857
+ * gem install galaaz
858
+
859
+ ## Usage
860
+
861
+ * gknit \<filename\>