galaaz 0.4.1 → 0.4.2

Sign up to get free protection for your applications and to get access to all the features.
Files changed (105) hide show
  1. checksums.yaml +4 -4
  2. data/Rakefile +29 -0
  3. data/bin/gknit +208 -10
  4. data/bin/gknit2 +14 -0
  5. data/bin/gknit2~ +6 -0
  6. data/bin/prepareR.rb +3 -0
  7. data/bin/prepareR.rb~ +1 -0
  8. data/bin/tmp.py +51 -0
  9. data/blogs/dev/dev.Rmd +70 -0
  10. data/blogs/dev/dev.Rmd~ +104 -0
  11. data/blogs/dev/dev.html +209 -0
  12. data/blogs/dev/dev.md +72 -0
  13. data/blogs/dev/dev_files/figure-html/bubble-1.png +0 -0
  14. data/blogs/dev/model.rb +41 -0
  15. data/blogs/galaaz_ggplot/galaaz_ggplot.Rmd +55 -27
  16. data/blogs/galaaz_ggplot/galaaz_ggplot.aux +44 -0
  17. data/blogs/galaaz_ggplot/galaaz_ggplot.dvi +0 -0
  18. data/blogs/galaaz_ggplot/galaaz_ggplot.html +17 -4
  19. data/blogs/galaaz_ggplot/galaaz_ggplot.out +10 -0
  20. data/blogs/galaaz_ggplot/galaaz_ggplot.pdf +0 -0
  21. data/blogs/galaaz_ggplot/galaaz_ggplot.tex +630 -0
  22. data/blogs/galaaz_ggplot/midwest.Rmd +1 -1
  23. data/blogs/galaaz_ggplot/midwest_external_png +13 -0
  24. data/blogs/galaaz_ggplot/midwest_external_png~ +1 -0
  25. data/blogs/gknit/gknit.Rmd +500 -0
  26. data/blogs/gknit/gknit.Rmd~ +184 -0
  27. data/blogs/gknit/gknit.Rnd~ +17 -0
  28. data/blogs/gknit/gknit.html +528 -0
  29. data/blogs/gknit/gknit.md +628 -0
  30. data/blogs/gknit/gknit.pdf +0 -0
  31. data/blogs/gknit/gknit.tex +745 -0
  32. data/blogs/gknit/gknit_files/figure-html/bubble-1.png +0 -0
  33. data/blogs/gknit/gknit_files/figure-html/diverging_bar.png +0 -0
  34. data/blogs/gknit/model.rb +41 -0
  35. data/blogs/gknit/model.rb~ +46 -0
  36. data/blogs/ruby_plot/figures/dose_len.png +0 -0
  37. data/blogs/ruby_plot/figures/facet_by_delivery.png +0 -0
  38. data/blogs/ruby_plot/figures/facet_by_dose.png +0 -0
  39. data/blogs/ruby_plot/figures/facets_by_delivery_color.png +0 -0
  40. data/blogs/ruby_plot/figures/facets_by_delivery_color2.png +0 -0
  41. data/blogs/ruby_plot/figures/facets_with_decorations.png +0 -0
  42. data/blogs/ruby_plot/figures/facets_with_jitter.png +0 -0
  43. data/blogs/ruby_plot/figures/facets_with_points.png +0 -0
  44. data/blogs/ruby_plot/figures/final_box_plot.png +0 -0
  45. data/blogs/ruby_plot/figures/final_violin_plot.png +0 -0
  46. data/blogs/ruby_plot/figures/violin_with_jitter.png +0 -0
  47. data/blogs/ruby_plot/ruby_plot.Rmd +680 -0
  48. data/blogs/ruby_plot/ruby_plot.Rmd~ +215 -0
  49. data/blogs/ruby_plot/ruby_plot.html +563 -0
  50. data/blogs/ruby_plot/ruby_plot.md +731 -0
  51. data/blogs/ruby_plot/ruby_plot.pdf +0 -0
  52. data/blogs/ruby_plot/ruby_plot.tex +458 -0
  53. data/examples/sthda_ggplot/all.rb +0 -6
  54. data/examples/sthda_ggplot/two_variables_cont_bivariate/geom_hex.rb +1 -1
  55. data/examples/sthda_ggplot/two_variables_cont_cont/misc.rb +1 -1
  56. data/examples/sthda_ggplot/two_variables_disc_cont/geom_bar.rb +2 -2
  57. data/examples/sthda_ggplot/two_variables_disc_disc/geom_jitter.rb +0 -1
  58. data/lib/R/eng_ruby.R +62 -0
  59. data/lib/R/eng_ruby.R~ +63 -0
  60. data/lib/R_interface/capture_plot.rb~ +23 -0
  61. data/lib/{R → R_interface}/expression.rb +0 -0
  62. data/lib/{R → R_interface}/r.rb +10 -1
  63. data/lib/{R → R_interface}/r.rb~ +0 -0
  64. data/lib/{R → R_interface}/r_methods.rb +21 -5
  65. data/lib/{R → R_interface}/rbinary_operators.rb +6 -1
  66. data/lib/R_interface/rclosure.rb +38 -0
  67. data/lib/{R → R_interface}/rdata_frame.rb +0 -0
  68. data/lib/R_interface/rdevices.R +31 -0
  69. data/lib/R_interface/rdevices.rb +225 -0
  70. data/lib/{R/rclosure.rb → R_interface/rdevices.rb~} +3 -10
  71. data/lib/{R → R_interface}/renvironment.rb +0 -0
  72. data/lib/{R → R_interface}/rexpression.rb +0 -0
  73. data/lib/{R → R_interface}/rindexed_object.rb +0 -0
  74. data/lib/{R → R_interface}/rlanguage.rb +0 -0
  75. data/lib/{R → R_interface}/rlist.rb +0 -0
  76. data/lib/{R → R_interface}/rmatrix.rb +0 -0
  77. data/lib/{R → R_interface}/rmd_indexed_object.rb +0 -0
  78. data/lib/{R → R_interface}/robject.rb +5 -0
  79. data/lib/{R → R_interface}/rpkg.rb +0 -0
  80. data/lib/{R → R_interface}/rsupport.rb +49 -13
  81. data/lib/{R → R_interface}/rsupport_scope.rb +0 -0
  82. data/lib/{R → R_interface}/rsymbol.rb +1 -0
  83. data/lib/{R → R_interface}/ruby_callback.rb +0 -0
  84. data/lib/{R → R_interface}/ruby_extensions.rb +2 -1
  85. data/lib/{R → R_interface}/runary_operators.rb +0 -0
  86. data/lib/{R → R_interface}/rvector.rb +0 -0
  87. data/lib/galaaz.rb +4 -2
  88. data/lib/gknit.rb +27 -0
  89. data/lib/gknit.rb~ +26 -0
  90. data/lib/gknit/knitr_engine.rb +120 -0
  91. data/lib/gknit/knitr_engine.rb~ +102 -0
  92. data/lib/gknit/ruby_engine.rb +70 -0
  93. data/lib/gknit/ruby_engine.rb~ +72 -0
  94. data/lib/util/exec_ruby.rb +8 -7
  95. data/lib/util/inline_file.rb +70 -0
  96. data/lib/util/inline_file.rb~ +23 -0
  97. data/r_requires/ggplot.rb +1 -8
  98. data/r_requires/knitr.rb +27 -0
  99. data/r_requires/knitr.rb~ +4 -0
  100. data/specs/r_language.spec.rb +22 -0
  101. data/specs/r_plots.spec.rb +72 -0
  102. data/specs/r_plots.spec.rb~ +37 -0
  103. data/specs/tmp.rb +255 -1
  104. data/version.rb +1 -1
  105. metadata +89 -39
@@ -0,0 +1,731 @@
1
+ ---
2
+ title: "How to make Beautiful Ruby Plots with Galaaz"
3
+ author: "Rodrigo Botafogo"
4
+ tags: [Tech, Data Science, Ruby, R, GraalVM]
5
+ date: "November 19th, 2018"
6
+ output:
7
+ html_document:
8
+ self_contained: true
9
+ keep_md: true
10
+ pdf_document:
11
+ includes:
12
+ in_header: ["../../sty/galaaz.sty"]
13
+ number_sections: yes
14
+ ---
15
+
16
+
17
+
18
+ # Introduction
19
+
20
+ According to Wikipedia "Ruby is a dynamic, interpreted, reflective, object-oriented,
21
+ general-purpose programming language. It was designed and developed in the mid-1990s by Yukihiro
22
+ "Matz" Matsumoto in Japan." It reached high popularity with the development of Ruby on Rails
23
+ (RoR) by David Heinemeier Hansson. RoR is a web application framework first released
24
+ around 2005. It makes extensive use of Ruby's metaprogramming features. With RoR,
25
+ Ruby became very popular. According to [Ruby's Tiobe index](https://www.tiobe.com/tiobe-index/ruby/)
26
+ it peeked in popularity around 2008. Then it's popularity
27
+ declined until 2015 when it started picking up again. At the time of
28
+ this writing (November 2018), the Tiobe index puts ruby in 16th position.
29
+
30
+ Python, a similar language to Ruby, ranks 4th in the index. Java, C and C++ take the
31
+ first three positions. Ruby is often criticized for its focus on web applications.
32
+ But Ruby can do [much more](https://github.com/markets/awesome-ruby) than just web applications.
33
+ Yet, for scientific computing, Ruby lags way behind Python and R. Python has
34
+ Django framework for web, NumPy for numerical arrays, Pandas for data analysis.
35
+ R is a free software environment for statistical computing and graphics with thousands
36
+ of libraries for data analysis.
37
+
38
+ Until recently, there was no real perspective for Ruby to bridge this gap.
39
+ Implementing a complete scientific computing infrastructure would take too long.
40
+ Comes GraalVM into the picture:
41
+
42
+ > GraalVM is a universal virtual machine for running applications written in
43
+ > JavaScript, Python 3, Ruby, R, JVM-based languages like Java, Scala, Kotlin,
44
+ > and LLVM-based languages such as C and C++.
45
+ >
46
+ > GraalVM removes the isolation between programming languages and enables
47
+ > interoperability in a shared runtime. It can run either standalone or in the
48
+ > context of OpenJDK, Node.js, Oracle Database, or MySQL.
49
+ >
50
+ > GraalVM allows you to write polyglot applications with a seamless way to pass
51
+ > values from one language to another. With GraalVM there is no copying or
52
+ > marshaling necessary as it is with other polyglot systems. This lets you
53
+ > achieve high performance when language boundaries are crossed. Most of the time
54
+ > there is no additional cost for crossing a language boundary at all.
55
+ >
56
+ > Often developers have to make uncomfortable compromises that require them
57
+ > to rewrite their software in other languages. For example:
58
+ >
59
+ > * That library is not available in my language. I need to rewrite it.
60
+ > * That language would be the perfect fit for my problem, but we cannot
61
+ > run it in our environment.
62
+ > * That problem is already solved in my language, but the language is
63
+ > too slow.
64
+ >
65
+ > With GraalVM we aim to allow developers to freely choose the right language for
66
+ > the task at hand without making compromises.
67
+
68
+ As stated above, GraalVM is a _universal_ virtual machine that allows Ruby and R (and other
69
+ languages) to run on the same environment. GraalVM allows polyglot applications to
70
+ _seamlessly_ interact with one another and pass values from one language to the other.
71
+ Galaaz, a gem for Ruby, intends to tightly couple Ruby and R
72
+ and allow those languages to interact in a way that the user will be unaware
73
+ of such interaction.
74
+
75
+ Library wrapping is an usual way of bringing features from one language into another.
76
+ To improve performance, Python often wraps more efficient C libraries. For the
77
+ Python developer, the existence of such C libraries is of no concern. The problem with
78
+ library wrapping is that for any new library, there is the need to handcraft a new
79
+ wrapper.
80
+
81
+ Galaaz, instead of wrapping a single C or R library, wraps the whole of
82
+ the R language in Ruby. Doing so, all thousands of R libraries are available to
83
+ Ruby developers. Also any new library developed in R will be available without a
84
+ new wrapping effort.
85
+
86
+ This article shows how Ruby can use R's ggplot2 library tranparantly, and
87
+ bring to Ruby the power of high quality scientific plotting. it also shows that
88
+ migrating from R to Ruby with Galaaz is a matter of small syntactic changes.
89
+ Using Ruby, the R developer can use all of Ruby's powerful OO features. It also
90
+ becomes much easier to move code from the analysis phase to the production phase.
91
+
92
+ In this article we will explore the R ToothGrowth dataset. In doing so, we will
93
+ create some boxplots. A primer on boxplot is available in
94
+ [this article](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51).
95
+
96
+ We will also create a Corporate Template ensuring that plots will have a consistent
97
+ visualization. This template is build using a Ruby module. There is a way of building
98
+ ggplot themes that will work the same as the Ruby module. Yet, writing a new theme
99
+ requires specific knowledge. Ruby modules are standard to the language and don't
100
+ need special knowledge.
101
+
102
+ In [this blog](https://towardsdatascience.com/ruby-plotting-with-galaaz-an-example-of-tightly-coupling-ruby-and-r-in-graalvm-520b69e21021) we show a scatter plot in Ruby also with Galaaz.
103
+
104
+ # gKnit
105
+
106
+ _Knitr_ is an application that converts text written in rmarkdown to many
107
+ different output formats. For instance, a writer can convert an rmarkdown document
108
+ to HTML, $LaTex$, docx and many other formats. Rmarkdown documents can contain
109
+ text and _code chunks_. Knitr formats code chunks in a grayed box in the output document.
110
+ It also executes the code chunks and formats the output in a white box. Every line of
111
+ output from the execution code is preceded by '##'.
112
+
113
+ Knitr allows code chunks to be in R, Python,
114
+ Ruby and dozens of other languages. Yet, while R and Python chunks can share data, in other
115
+ languages, chunks are independent. This means that a variable defined in one chunk
116
+ cannot be used in another chunk.
117
+
118
+ With _gKnit_ Ruby code chunks can share data. In gKnit each
119
+ Ruby chunk executes in its own scope and thus, local variable defined in a chunk are
120
+ not accessible by other chunks. Yet, All chunks execute in the scope of a 'chunk'
121
+ class and instance variables ('@'), are available in all chunks.
122
+
123
+ # Exploring the Dataset
124
+
125
+ Let's start by exploring our selected dataset. ToothGrowth is an R dataset. A dataset
126
+ is like an excel spreadsheet, but in which each column has only one type of data.
127
+ For instance one column can have float, the other integer, and a third strings.
128
+ This dataset analyses the length of odontoblasts (cells responsible for tooth growth)
129
+ in 60 guinea pigs, where each animal received one of three dose levels of Vitamin C
130
+ (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice (OJ) or ascorbic acid
131
+ (a form of vitamin C and coded as VC).
132
+
133
+ The ToothGrowth dataset contains three columns: 'len', 'supp' and 'dose'. Let's
134
+ take a look at a few rows of this dataset. In Galaaz, to have access to an R variable
135
+ we use the corresponding Ruby symbol preceeded by the tilda ('~') function. Note in the
136
+ following chunk that Ruby's '@tooth_growth' is assigned the value of '~:ToothGrowth'.
137
+ 'ToothGrowth' is the R variable containing the dataset of interest.
138
+
139
+
140
+ ```ruby
141
+ # Read the R ToothGrowth variable and assign it to the
142
+ # Ruby instance variable @tooth_growth that will be
143
+ # available to all Ruby chunks in this document.
144
+ @tooth_growth = ~:ToothGrowth
145
+ # print the first few elements of the dataset
146
+ puts @tooth_growth.head
147
+ ```
148
+
149
+ ```
150
+ ## len supp dose
151
+ ## 1 4.2 VC 0.5
152
+ ## 2 11.5 VC 0.5
153
+ ## 3 7.3 VC 0.5
154
+ ## 4 5.8 VC 0.5
155
+ ## 5 6.4 VC 0.5
156
+ ## 6 10.0 VC 0.5
157
+ ```
158
+
159
+ Great! We've managed to read the ToothGrowth dataset and take a look at its elements.
160
+ We see here the first 6 rows of the dataset. To access a column, follow the dataset name
161
+ with a dot ('.') and the name of the column. Also use dot notation to chain methods
162
+ in usual Ruby style.
163
+
164
+
165
+ ```ruby
166
+ # Access the tooth_growth 'len' column and print the first few
167
+ # elements of this column with the 'head' method.
168
+ puts @tooth_growth.len.head
169
+ ```
170
+
171
+ ```
172
+ ## [1] 4.2 11.5 7.3 5.8 6.4 10.0
173
+ ```
174
+
175
+ The 'dose' column contains a numeric value wiht either, 0.5, 1 or 2. Although those are
176
+ number, they are better interpreted as a [factor or cathegory](https://swcarpentry.github.io/r-novice-inflammation/12-supp-factors/). So, let's convert our 'dose' column from numeric to 'factor'.
177
+ In R, the function 'as.factor' is used to convert data in a vector to factors. To use this
178
+ function from Galaaz the dot ('.') in the function name is substituted by '__' (double underline).
179
+ The function 'as.factor' becomes 'R.as__factor' or just 'as__factor' when chaining.
180
+
181
+
182
+ ```ruby
183
+ # convert the dose to a factor
184
+ @tooth_growth.dose = @tooth_growth.dose.as__factor
185
+ ```
186
+
187
+ Let's explore some more details of this dataset. In particular, let's look at its dimensions,
188
+ structure and summary statistics.
189
+
190
+
191
+ ```ruby
192
+ puts @tooth_growth.dim
193
+ ```
194
+
195
+ ```
196
+ ## [1] 60 3
197
+ ```
198
+
199
+ This dataset has 60 rows, one for each subject and 3 columns, as we have already seen.
200
+
201
+ Note that we do not call 'puts' when using the 'str' function. This functions does not
202
+ return anything and prints the structure of the dataset as a side effect.
203
+
204
+
205
+ ```ruby
206
+ @tooth_growth.str
207
+ ```
208
+
209
+ ```
210
+ ## 'data.frame': 60 obs. of 3 variables:
211
+ ## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
212
+ ## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
213
+ ## $ dose: Factor w/ 3 levels "0.5","1","2": 1 1 1 1 1 1 1 1 1 1 ...
214
+ ```
215
+ Observe that both variables 'supp' and 'dose' are factors. The system made variable 'supp'
216
+ a factor automatically, since it contais two strings OJ and VC.
217
+
218
+ Finally, using the summary method, we get the statistical summary for the dataset
219
+
220
+
221
+ ```ruby
222
+ puts @tooth_growth.summary
223
+ ```
224
+
225
+ ```
226
+ ## len supp dose
227
+ ## Min. : 4.20 OJ:30 0.5:20
228
+ ## 1st Qu.:13.07 VC:30 1 :20
229
+ ## Median :19.25 2 :20
230
+ ## Mean :18.81
231
+ ## 3rd Qu.:25.27
232
+ ## Max. :33.90
233
+ ```
234
+
235
+ # Doing the Data Analysis
236
+
237
+ ## Quick plot for seing the data
238
+
239
+ Let's now create our first plot with the given data by accessing ggplot2 from Ruby. For Rubyist
240
+ that have never seen or used ggplot2, here is the description of ggplot found on its home page:
241
+
242
+ > "ggplot2 is a system for declaratively creating graphics, based on _The Grammar of Graphics_.
243
+ > You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical
244
+ > primitives to use, and it takes care of the details."
245
+
246
+ This description might be a bit cryptic and it is best to see it at work to understand it.
247
+ Basically, in the _grammar of graphics_ developers add layers of components such as grid,
248
+ axis, data, title, subtitle and also graphical primitives such as _bar plot_, _box plot_,
249
+ to form the final graphics.
250
+
251
+ In order to make a plot, we use the 'ggplot' function to the dataset. In R, this would be
252
+ written as ```ggplot(<dataset>, ...)```. In Galaaz, use either ```R.ggplot(<dataset>, ...)```,
253
+ or ```<dataset>.ggplot(...)```. In the graph specification bellow, we use the second notation
254
+ that looks more Ruby like. The plot specifies the 'dose' on the $x$ axis and the 'length' on
255
+ the $y$ axis with the 'aes' method. 'E.aes(x: :dose, y: :len)'. To specify the type of plot to
256
+ create add a geom to the plot. For a boxplot, the geom is R.geom_boxplot.
257
+
258
+ Note also that we have a call to 'R.png' before plotting and 'R.dev__off' after the print
259
+ statement. 'R.png' opens a 'png' device for outputting the plot. 'R.dev__off'
260
+ closes the device and creates the 'png' file. If we do no pass a name to the 'png' function, the
261
+ image gets a default name of 'Rplot\<nnn\>' where \<nnn\> is the number of the plot. We can
262
+ then include the generated 'png' file in the document by adding an rmarkdown directive.
263
+
264
+
265
+ ```ruby
266
+ require 'ggplot'
267
+
268
+ R.png("figures/dose_len.png")
269
+
270
+ e = @tooth_growth.ggplot(E.aes(x: :dose, y: :len))
271
+ print e + R.geom_boxplot
272
+
273
+ R.dev__off
274
+ ```
275
+
276
+ [//]: # (Including the 'png' file generated above. In future releases)
277
+ [//]: # (of gKnit, the figures should be automatically saved and the name)
278
+ [//]: # (taken from the chunk 'label' and possibly chunk parameters)
279
+
280
+ ![](figures/dose_len.png)
281
+
282
+ Great! We've just managed to create and save our first plot in Ruby with only
283
+ four lines of code. We can see with this plot a clear trend: as the dose of the supplement
284
+ is increased, so is the length of teeth.
285
+
286
+ ## Facetting the plot
287
+
288
+ This first plot shows a trend, but our data has information about two different forms
289
+ of delivery method, either by Orange Juice (OJ) or by Vitamin C (VC).
290
+ Let's then try to create a plot that explicits the effect of each delivery method. This next
291
+ plot is a _facetted_ plot where each delivery method gets is own plot.
292
+ On the left side, the plot shows the OJ delivery method. On the right side, we see the
293
+ VC delivery method. To obtain this plot, we use the 'R.facet_grid' function, that
294
+ automatically creates the facets based on the delivery method factors. The parameter to
295
+ the 'facet_grid' method is a [_formula_](https://thomasleeper.com/Rcourse/Tutorials/formulae.html).
296
+
297
+ In Galaaz, formulas are written a bit differently than in R. The following changes are
298
+ necessary:
299
+
300
+ * R symbols are represented by the same Ruby symbol prefixed with the '+' method. The
301
+ symbol ```x``` in R becomes ```+:x``` in Ruby;
302
+ * The '~' operator in R becomes '=~' in Ruby. The formula ```x ~ y``` in R is written as
303
+ ```+:x =~ +:y``` in Ruby;
304
+ * The '.' symbol in R becomes '+:all'
305
+
306
+ Another way of writing a formula is to use the 'formula' function with the actual formula as
307
+ a string. The formula ```x ~ y``` in R can be written as ```R.formula("x ~ y")```. For more
308
+ complex formulas, the use of the 'formula' function is preferred.
309
+
310
+ The formula ```+:all =~ +:supp``` indicates to the 'facet_grid' function that it needs to
311
+ facet the plot based on the ```supp``` variable and split the plot vertically. Changing
312
+ the formula to ```+:supp =~ +:all``` would split the plot horizontally.
313
+
314
+
315
+ ```ruby
316
+ R.png("figures/facet_by_delivery.png")
317
+
318
+ @base_tooth = @tooth_growth.ggplot(E.aes(x: :dose, y: :len, group: :dose))
319
+
320
+ @bp = @base_tooth + R.geom_boxplot +
321
+ # Split in vertical direction
322
+ R.facet_grid(+:all =~ +:supp)
323
+
324
+ puts @bp
325
+
326
+ R.dev__off
327
+ ```
328
+
329
+ ![](figures/facet_by_delivery.png)
330
+
331
+ It now becomes clear that although both methods of delivery have a direct
332
+ impact on tooth growth, method (OJ) is non-linear having a higher impact with smaller
333
+ doses of ascorbic acid and reducing it's impact as the dose increases. With the
334
+ (VC) approach, the impact seems to be more linear.
335
+
336
+ ## Adding Color
337
+
338
+ If this paper was about data analysis, we should make a better analysis of the trends and
339
+ should improve the statistical analysis. But we are interested in working with ggplot
340
+ in Ruby. So, Let's add some color to this plot to make the trend and comparison more
341
+ visible. In the following plot, the boxes are color coded by dose. To add color, it is
342
+ enough to add ```fill: :dose``` to the aesthetic of boxplot. With this command each 'dose'
343
+ factor gets its own color.
344
+
345
+
346
+ ```ruby
347
+ R.png("figures/facets_by_delivery_color.png")
348
+
349
+ @bp = @bp + R.geom_boxplot(E.aes(fill: :dose))
350
+ puts @bp
351
+
352
+ R.dev__off
353
+ ```
354
+
355
+ ![](figures/facets_by_delivery_color.png)
356
+
357
+ Facetting helps us compare the general trends in the (OJ) and (VC) delivery methods.
358
+ Adding color allow us to compare specifically how each dosage impacts the teeth growth.
359
+ It is possible to observe that with smaller doses, up to 1mg, (OJ) performs better
360
+ than (VC) (red color). For 2mg, both (OJ) and (VC) have the same median, but (OJ) is
361
+ less disperse (blue color).
362
+ For 1mg (green color), (OJ) is significantly bettern than (VC). By this very quick analysis,
363
+ it seems that (OJ) is a better delivery method than (VC).
364
+
365
+ ## Clarifying the data
366
+
367
+ Boxplots give us a nice idea of the distribution of data, but looking at those plots with
368
+ large colored boxes leaves us wondering what is going on on those boxes. According to
369
+ Edward Tufte in Envisioning Information:
370
+
371
+ > Thin data rightly prompts suspicions: "What are they leaving out? Is that really everything
372
+ > they know? What are they hiding? Is that all they did?" Now and then it is claimed
373
+ > that vacant space is "friendly" (anthropomorphizing an inherently murky idea) but
374
+ > _it is not how much empty space there is, but rather how it is used. It is not how much
375
+ > information there is, but rather how effectively it is arranged._
376
+
377
+ And he states:
378
+
379
+ > A most unconventional design strategy is revealed: _to clarify, add detail._
380
+
381
+ Let's then use this wisdom and add yet another layer of data to our plot, so that we clarify
382
+ it with detail and do not leave large empty boxes. In this next plot, we add data points for
383
+ each of the 60 pigs in the experiment. For that, add the function 'R.geom_point' to the
384
+ plot.
385
+
386
+
387
+ ```ruby
388
+ R.png("figures/facets_with_points.png")
389
+
390
+ # Split in vertical direction
391
+ @bp = @bp + R.geom_point
392
+
393
+ puts @bp
394
+
395
+ R.dev__off
396
+ ```
397
+
398
+ ![](figures/facets_with_points.png)
399
+
400
+ Now we can see the actual distribution of all the 60 subject. Actually, this is not
401
+ totally true. We have a hard time seing all 60 subjects. It seems that some points
402
+ might be placed one over the other hiding useful information.
403
+
404
+ But no sweat! Another layer might solve the problem. In the following plot a new layer
405
+ called 'geom_jitter' is added to the plot. This adds randomness to the position of
406
+ the points, making it easier to see all of then and preventing data hiding. We also add
407
+ color and change the shape of the points, making them even easier to see.
408
+
409
+
410
+ ```ruby
411
+ R.png("figures/facets_with_jitter.png")
412
+
413
+ # Split in vertical direction
414
+ puts @bp + R.geom_jitter(shape: 23, color: "cyan3", size: 1)
415
+
416
+ R.dev__off
417
+ ```
418
+
419
+ ![](figures/facets_with_jitter.png)
420
+
421
+ Now we can see all 60 points in the graph. We have here a much higher information density
422
+ and we can see outliers and subjects distribution.
423
+
424
+ # Preparing the Plot for Presentation
425
+
426
+ We have come a long way since our first plot. As was already said, this is not
427
+ an article about data analysis and the focus is on the
428
+ integration of Ruby and ggplot. So, let's assume that the analysis is now done. Yet,
429
+ ending the analysis does not mean that the work is done. On the contrary, the hardest
430
+ part is yet to come!
431
+
432
+ After the analysis it is necessary to communicate it by making a final plot for
433
+ presentation. The last plot has all the information we want to share, but it is not very
434
+ pleasing to the eye.
435
+
436
+ ## Improving Colors
437
+
438
+ Let's start by trying to improve colors. For now, we will not use the jitter layer.
439
+ The previous plot has three bright colors that have no relashionship between them. Is
440
+ there any obvious, or non-obvious for that matter, interpretation for the colors?
441
+ Clearly, they are just random colors selected automatically by our software. Although
442
+ those colors helped us understand the data, for a final presentation random colors
443
+ can distract the viewer.
444
+
445
+ In the following plot we use shades function 'scale_fill_manual' to change
446
+ the colors of the boxes and order of labels. For colors we use shades of blue for
447
+ each dosage, with light blue ('cyan')
448
+ representing the lower dose and deep blue ('deepskyblue4') the higher dose. Also
449
+ the smaller value (0.5) is on
450
+ the botton of the labels and (2) at the top. This ordering seems more natural and
451
+ matches with the actual order of the colors in the plot.
452
+
453
+
454
+ ```ruby
455
+ R.png("figures/facets_by_delivery_color2.png")
456
+
457
+ @bp = @bp +
458
+ R.scale_fill_manual(values: R.c("cyan", "deepskyblue", "deepskyblue4"),
459
+ breaks: R.c("2","1","0.5"))
460
+
461
+ puts @bp
462
+
463
+ R.dev__off
464
+ ```
465
+
466
+ ![](figures/facets_by_delivery_color2.png)
467
+
468
+
469
+ ## Violin Plot and Jitter
470
+
471
+ The boxplot with jitter did look a bit overwhelming. The next plot uses a variation of
472
+ a boxplot known as a _violin plot_ with jittered data.
473
+
474
+ [From Wikipedia](https://en.wikipedia.org/wiki/Violin_plot)
475
+
476
+
477
+ > A violin plot is a method of plotting numeric data. It is similar to a box plot with
478
+ > a rotated kernel density plot on each side.
479
+ >
480
+ > A violin plot has four layers. The outer shape represents all possible results, with
481
+ > thickness indicating how common. (Thus the thickest section represents the mode average.)
482
+ > The next layer inside represents the values that occur 95% of the time.
483
+ > The next layer (if it exists) inside represents the values that occur 50% of the time.
484
+ > The central dot represents the median average value.
485
+
486
+
487
+ ```ruby
488
+ R.png("figures/violin_with_jitter.png")
489
+
490
+ @violin = @base_tooth + R.geom_violin(E.aes(fill: :dose)) +
491
+ R.facet_grid(+:all =~ +:supp) +
492
+ R.geom_jitter(shape: 23, color: "cyan3", size: 1) +
493
+ R.scale_fill_manual(values: R.c("cyan", "deepskyblue", "deepskyblue4"),
494
+ breaks: R.c("2","1","0.5"))
495
+
496
+ puts @violin
497
+
498
+ R.dev__off
499
+ ```
500
+
501
+ ![](figures/violin_with_jitter.png)
502
+
503
+ This plot is an alternative to the original boxplot. For the final presentation, it is
504
+ important to think which graphics will be best understood by our audience. A violin plot
505
+ is a less known plot and could add mental overhead, yet, in my opinion, it does look a lit
506
+ bit better than the boxplot and provides even more information than the boxplot with jitter.
507
+
508
+ ## Adding Decoration
509
+
510
+ Our final plot is starting to take shape, but a presentation plot should have at least a
511
+ title, labels on the axis and maybe some other decorations. Let's start adding those.
512
+ Since decoration requires more graph area, this new plot has a 'width' and 'height'
513
+ specification. When there is no specification, the default values for width and height are
514
+ 480.
515
+
516
+ The 'labs' function adds require decoration. In this example we use 'title', 'subtitle',
517
+ 'x' for the $x$ axis label and 'y', for the $y$ axis label, and 'caption' for information
518
+ about the plot.
519
+
520
+
521
+ ```ruby
522
+ R.png("figures/facets_with_decorations.png", width: 540, height: 560)
523
+
524
+ caption = <<-EOT
525
+ Length of odontoblasts in 60 guinea pigs.
526
+ Each animal received one of three dose levels of vitamin C.
527
+ EOT
528
+
529
+ @decorations =
530
+ R.labs(title: "Tooth Growth: Length by Dose",
531
+ subtitle: "Faceted by delivery method, (OJ) or (VC)",
532
+ x: "Dose (mg)", y: "Teeth length",
533
+ caption: caption)
534
+
535
+ puts @bp + @decorations
536
+
537
+ R.dev__off
538
+ ```
539
+
540
+ ![](figures/facets_with_decorations.png)
541
+
542
+
543
+ ## The Corp Theme
544
+
545
+ We are almost done. But the plot does not yet look nice to the eye. We are still distracted
546
+ by many aspects of the graph. First, the back font color does not look good. Then
547
+ plot background, borders, grids all add clutter to the plot.
548
+
549
+ We will now define our corporate theme. In this theme, we remove borders and grids. The
550
+ background if left for faceted plots but removed for non-faceted plots. Font colors are
551
+ a shade o blue (color: '#00080'). Axis labels are moved near the end of the axis and
552
+ written in 'bold'.
553
+
554
+
555
+ ```ruby
556
+ module CorpTheme
557
+
558
+ R.install_and_loads 'RColorBrewer'
559
+
560
+ #---------------------------------------------------------------------------------
561
+ # face can be (1=plain, 2=bold, 3=italic, 4=bold-italic)
562
+ #---------------------------------------------------------------------------------
563
+
564
+ def self.text_element(size, face: "plain", hjust: nil)
565
+ E.element_text(color: "#000080",
566
+ face: face,
567
+ size: size,
568
+ hjust: hjust)
569
+ end
570
+
571
+ #---------------------------------------------------------------------------------
572
+ # Defines the plot theme (visualization). In this theme we remove major and minor
573
+ # grids, borders and background. We also turn-off scientific notation.
574
+ #---------------------------------------------------------------------------------
575
+
576
+ def self.global_theme(faceted = false)
577
+
578
+ R.options(scipen: 999) # turn-off scientific notation like 1e+48
579
+ # R.theme_set(R.theme_bw)
580
+
581
+ # remove major grids
582
+ gb = R.theme(panel__grid__major: E.element_blank())
583
+ # remove minor grids
584
+ gb = gb + R.theme(panel__grid__minor: E.element_blank)
585
+ # gb = R.theme(panel__grid__minor: E.element_blank)
586
+ # remove border
587
+ gb = gb + R.theme(panel__border: E.element_blank)
588
+ # remove background. When working with faceted graphs, the background makes
589
+ # it easier to see each facet, so leave it
590
+ gb = gb + R.theme(panel__background: E.element_blank) if !faceted
591
+ # Change axis font
592
+ gb = gb + R.theme(axis__text: text_element(8))
593
+ # change axis title font
594
+ gb = gb + R.theme(axis__title: text_element(10, face: "bold", hjust: 1))
595
+ # change font of title
596
+ gb = gb + R.theme(title: text_element(12, face: "bold"))
597
+ # change font of subtitle
598
+ gb = gb + R.theme(plot__subtitle: text_element(9))
599
+ # change font of captions
600
+ gb = gb + R.theme(plot__caption: text_element(8))
601
+
602
+ end
603
+
604
+ end
605
+ ```
606
+
607
+ ## Final Box Plot
608
+
609
+ Here is our final boxplot, without jitter.
610
+
611
+
612
+ ```ruby
613
+ R.png("figures/final_box_plot.png", width: 540, height: 560)
614
+
615
+ puts @bp + @decorations + CorpTheme.global_theme(faceted: true)
616
+
617
+ R.dev__off
618
+ ```
619
+
620
+ ![](figures/final_box_plot.png)
621
+
622
+ ## Final Violin Plot
623
+
624
+ Here is the final violin plot, with jitter and the same look and feel of the corporate
625
+ boxplot.
626
+
627
+
628
+ ```ruby
629
+ R.png("figures/final_violin_plot.png", width: 540, height: 560)
630
+
631
+ puts @violin + @decorations + CorpTheme.global_theme(faceted: true)
632
+
633
+ R.dev__off
634
+ ```
635
+
636
+ ![](figures/final_violin_plot.png)
637
+
638
+ ## Another View
639
+
640
+ Finally, here is a last plot, with the same look and feel as before but facetted by
641
+ dose and not by supplement.
642
+
643
+
644
+ ```ruby
645
+ R.png("figures/facet_by_dose.png", width: 540, height: 560)
646
+
647
+ caption = <<-EOT
648
+ Length of odontoblasts in 60 guinea pigs.
649
+ Each animal received one of three dose levels of vitamin C.
650
+ EOT
651
+
652
+ @bp = @tooth_growth.ggplot(E.aes(x: :supp, y: :len, group: :supp)) +
653
+ R.geom_boxplot(E.aes(fill: :supp)) + R.facet_grid(+:all =~ +:dose) +
654
+ R.scale_fill_manual(values: R.c("cyan", "deepskyblue4")) +
655
+ R.labs(title: "Tooth Growth: Length by Dose",
656
+ subtitle: "Faceted by dose",
657
+ x: "Delivery method", y: "Teeth length",
658
+ caption: caption) +
659
+ CorpTheme.global_theme(faceted: true)
660
+ puts @bp
661
+
662
+ R.dev__off
663
+ ```
664
+
665
+ ![](figures/facet_by_dose.png)
666
+
667
+ # Conclusion
668
+
669
+ Galaaz tightly couples Ruby and R in a way that Ruby developers do not need to be aware
670
+ of the executing R engine. For the Ruby developer the existence of R
671
+ is of no consequence. For her, she is just coding in Ruby. On the other hand, for the R
672
+ developer, migration to Ruby is a matter of small syntactic changes and very gentle
673
+ learning curve. As the R developer becomes more proficient in Ruby, he can start using
674
+ 'classes', 'modules', 'procs', 'lambdas'.
675
+
676
+ This coupling shows the power of GraalVM and Truffle polyglot environment. Trying to
677
+ bring to Ruby the power of R starting from scratch is an enourmous endeavour and would
678
+ probably never be accomplished. Today's data scientists would certainly stick with either
679
+ Python or R. Now, both the Ruby and R communities might benefit from this marriage. Also,
680
+ the process to couple Ruby and R can be also be done to couple Ruby and JavaScript and
681
+ maybe also Ruby and Python. In a polyglot world a *uniglot* language might be extremely
682
+ relevant.
683
+
684
+ From the perspective of performance, GraalVM and Truffle promises improvements that could
685
+ reach over 10 times, both for [FastR](https://medium.com/graalvm/faster-r-with-fastr-4b8db0e0dceb)
686
+ and for [TruffleRuby](https://rubykaigi.org/2018/presentations/eregontp.html).
687
+
688
+ This article has shown how to improve a plot step-by-step. Starting from a very simple
689
+ boxplot with all default configurations, we moved slowly to our final plot. The important
690
+ point here is not if the final plot is actually beautiful, but that there is a process
691
+ of small steps improvements that can be followed to getting a final plot ready for
692
+ presentation.
693
+
694
+ Finally, this whole article was written in rmarkdown and compiled to HTML by _gknit_, an
695
+ application that wraps _knitr_ and allows documenting Ruby code. This application can
696
+ be of great help for any Rubyist trying to write articles, blogs or documentation for Ruby.
697
+
698
+
699
+ # Installing Galaaz
700
+
701
+ ## Prerequisites
702
+
703
+ * GraalVM (>= rc8): https://github.com/oracle/graal/releases
704
+ * TruffleRuby
705
+ * FastR
706
+
707
+ The following R packages will be automatically installed when necessary, but could be installed prior
708
+ to using gKnit if desired:
709
+
710
+ * ggplot2
711
+ * gridExtra
712
+ * knitr
713
+
714
+ Installation of R packages requires a development environment and can be time consuming. In Linux,
715
+ the gnu compiler and tools should be enough. I am not sure what is needed on the Mac.
716
+
717
+ ## Preparation
718
+
719
+ * gem install galaaz
720
+
721
+ ## Usage
722
+
723
+ * gknit <filename>
724
+ * In a scrip add: require 'galaaz'
725
+
726
+
727
+ And now that you’ve read this far, here’s how to submit your story to the freeCodeCamp
728
+ publication: send an email to submit at freecodecamp org. Include the URL for your story on
729
+ Medium (preferably an unpublished draft) and the word “bananas” so that we’ll know that you
730
+ have read all this. Only send one story URL per email. There’s no need to add anything
731
+ further to your email — we just read the stories and judge them based on their own merits.