galaaz 0.4.2 → 0.4.5

Sign up to get free protection for your applications and to get access to all the features.
Files changed (114) hide show
  1. checksums.yaml +4 -4
  2. data/LICENSE +25 -0
  3. data/Rakefile +8 -0
  4. data/bin/gknit +9 -5
  5. data/bin/gstudio +4 -2
  6. data/bin/gstudio.rb +32 -2
  7. data/blogs/dev/dev.html +219 -34
  8. data/blogs/dev/dev.md +26 -26
  9. data/blogs/dev/dev_files/figure-html/bubble-1.png +0 -0
  10. data/blogs/dev/dev_files/figure-html/diverging_bar.png +0 -0
  11. data/blogs/dplyr/dplyr.rb +63 -0
  12. data/blogs/galaaz_ggplot/galaaz_ggplot.Rmd +38 -26
  13. data/blogs/galaaz_ggplot/galaaz_ggplot.aux +16 -17
  14. data/blogs/galaaz_ggplot/galaaz_ggplot.pdf +0 -0
  15. data/blogs/galaaz_ggplot/galaaz_ggplot.tex +65 -31
  16. data/blogs/oh_my/not_so.rb +2342 -0
  17. data/blogs/oh_my/oh_my.Rmd +493 -0
  18. data/blogs/oh_my/oh_my.html +680 -0
  19. data/blogs/oh_my/oh_my.md +597 -0
  20. data/blogs/oh_my/old.Rmd +2100 -0
  21. data/blogs/ruby_plot/figures/facets_with_decorations.png +0 -0
  22. data/blogs/ruby_plot/figures/facets_with_jitter.png +0 -0
  23. data/blogs/ruby_plot/figures/final_box_plot.png +0 -0
  24. data/blogs/ruby_plot/figures/final_violin_plot.png +0 -0
  25. data/blogs/ruby_plot/figures/violin_with_jitter.png +0 -0
  26. data/blogs/ruby_plot/ruby_plot.Rmd +147 -122
  27. data/blogs/ruby_plot/ruby_plot.Rmd_external_figs +662 -0
  28. data/blogs/ruby_plot/ruby_plot.html +49 -54
  29. data/blogs/ruby_plot/ruby_plot.md +147 -122
  30. data/blogs/ruby_plot/ruby_plot.pdf +0 -0
  31. data/blogs/ruby_plot/ruby_plot.tex +776 -157
  32. data/blogs/ruby_plot/ruby_plot_files/figure-html/dose_len.svg +57 -0
  33. data/blogs/ruby_plot/ruby_plot_files/figure-html/facet_by_delivery.svg +106 -0
  34. data/blogs/ruby_plot/ruby_plot_files/figure-html/facet_by_dose.svg +110 -0
  35. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_by_delivery_color.svg +174 -0
  36. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_by_delivery_color2.svg +236 -0
  37. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_decorations.png +0 -0
  38. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_jitter.svg +296 -0
  39. data/blogs/ruby_plot/ruby_plot_files/figure-html/facets_with_points.svg +236 -0
  40. data/blogs/ruby_plot/ruby_plot_files/figure-html/final_box_plot.svg +218 -0
  41. data/blogs/ruby_plot/ruby_plot_files/figure-html/final_violin_plot.svg +128 -0
  42. data/blogs/ruby_plot/ruby_plot_files/figure-html/violin_with_jitter.svg +150 -0
  43. data/examples/islr/ch2.spec.rb +21 -18
  44. data/examples/islr/ch3_boston.rb +14 -5
  45. data/examples/islr/ch3_multiple_regression.rb +2 -3
  46. data/examples/islr/ch6.spec.rb +1 -1
  47. data/examples/islr/x_y_rnorm.jpg +0 -0
  48. data/lib/R_interface/r.rb +14 -10
  49. data/lib/R_interface/r_libs.R +9 -0
  50. data/lib/R_interface/r_methods.rb +77 -6
  51. data/lib/R_interface/{expression.rb → r_module_s.rb} +13 -14
  52. data/lib/R_interface/rbinary_operators.rb +58 -71
  53. data/lib/R_interface/rdata_frame.rb +2 -1
  54. data/lib/R_interface/rdevices.R +4 -0
  55. data/lib/R_interface/rdevices.rb +1 -1
  56. data/lib/R_interface/renvironment.rb +34 -1
  57. data/lib/R_interface/rexpression.rb +108 -2
  58. data/lib/R_interface/rindexed_object.rb +3 -1
  59. data/lib/R_interface/rlanguage.rb +18 -2
  60. data/lib/R_interface/rmatrix.rb +14 -0
  61. data/lib/R_interface/rmd_indexed_object.rb +5 -1
  62. data/lib/R_interface/robject.rb +61 -23
  63. data/lib/R_interface/rsupport.rb +111 -53
  64. data/lib/R_interface/rsymbol.rb +6 -5
  65. data/lib/R_interface/ruby_extensions.rb +130 -4
  66. data/lib/R_interface/runary_operators.rb +35 -3
  67. data/lib/R_interface/rvector.rb +1 -0
  68. data/lib/galaaz.rb +0 -2
  69. data/lib/gknit/knitr_engine.rb +58 -4
  70. data/lib/gknit/ruby_engine.rb +5 -6
  71. data/lib/util/exec_ruby.rb +55 -9
  72. data/specs/all.rb +13 -3
  73. data/specs/figures/dose_len.png +0 -0
  74. data/specs/r_dataframe.spec.rb +49 -26
  75. data/specs/r_environment.spec.rb +140 -0
  76. data/specs/r_eval.spec.rb +0 -15
  77. data/specs/r_formula.spec.rb +232 -0
  78. data/specs/r_function.spec.rb +7 -8
  79. data/specs/r_list.spec.rb +4 -0
  80. data/specs/r_list_apply.spec.rb +11 -11
  81. data/specs/r_matrix.spec.rb +3 -3
  82. data/specs/{r_plots.spec.rb~ → r_nse.spec.rb} +29 -6
  83. data/specs/r_vector_creation.spec.rb +6 -0
  84. data/specs/r_vector_object.spec.rb +2 -2
  85. data/specs/r_vector_operators.spec.rb +3 -3
  86. data/specs/r_vector_subsetting.spec.rb +4 -4
  87. data/specs/ruby_expression.spec.rb +324 -0
  88. data/specs/tmp.rb +12 -524
  89. data/sty/galaaz.sty +71 -0
  90. data/version.rb +1 -1
  91. metadata +31 -41
  92. data/bin/gknit2~ +0 -6
  93. data/bin/ogk~ +0 -4
  94. data/bin/prepareR.rb~ +0 -1
  95. data/blogs/dev/dev.Rmd~ +0 -104
  96. data/blogs/galaaz_ggplot/galaaz_ggplot.dvi +0 -0
  97. data/blogs/galaaz_ggplot/midwest_external_png~ +0 -1
  98. data/blogs/gknit/gknit.Rmd~ +0 -184
  99. data/blogs/gknit/gknit.Rnd~ +0 -17
  100. data/blogs/gknit/model.rb~ +0 -46
  101. data/blogs/ruby_plot/ruby_plot.Rmd~ +0 -215
  102. data/examples/islr/Figure.jpg +0 -0
  103. data/examples/misc/moneyball.rb~ +0 -16
  104. data/examples/misc/subsetting.rb~ +0 -372
  105. data/lib/R/eng_ruby.R~ +0 -63
  106. data/lib/R_interface/capture_plot.rb~ +0 -23
  107. data/lib/R_interface/r.rb~ +0 -121
  108. data/lib/R_interface/rdevices.rb~ +0 -27
  109. data/lib/gknit.rb~ +0 -26
  110. data/lib/gknit/knitr_engine.rb~ +0 -102
  111. data/lib/gknit/ruby_engine.rb~ +0 -72
  112. data/lib/util/inline_file.rb~ +0 -23
  113. data/r_requires/knitr.rb~ +0 -4
  114. data/specs/r_language.spec.rb +0 -157
@@ -0,0 +1,662 @@
1
+ ---
2
+ title: "How to make Beautiful Ruby Plots with Galaaz"
3
+ author:
4
+ - "Rodrigo Botafogo"
5
+ - "Daniel Mossé - University of Pittsburgh"
6
+ tags: [Tech, Data Science, Ruby, R, GraalVM]
7
+ date: "November 19th, 2018"
8
+ output:
9
+ html_document:
10
+ self_contained: true
11
+ keep_md: true
12
+ pdf_document:
13
+ includes:
14
+ in_header: "../../sty/galaaz.sty"
15
+ keep_tex: yes
16
+ number_sections: yes
17
+ toc: true
18
+ toc_depth: 2
19
+ fontsize: 11pt
20
+ ---
21
+
22
+ ```{r setup, echo=FALSE}
23
+ # set global chunk options. We want all figures to be 'svg'
24
+ knitr::opts_chunk$set(fig.width=7, fig.height=7, dev="svg")
25
+ ```
26
+
27
+ According to Wikipedia "Ruby is a dynamic, interpreted, reflective, object-oriented,
28
+ general-purpose programming language. It was designed and developed in the mid-1990s by Yukihiro
29
+ "Matz" Matsumoto in Japan." It reached high popularity with the development of Ruby on Rails
30
+ (RoR) by David Heinemeier Hansson. RoR is a web application framework first released
31
+ around 2005. It makes extensive use of Ruby's metaprogramming features. With RoR,
32
+ Ruby became very popular. According to [Ruby's Tiobe index](https://www.tiobe.com/tiobe-index/ruby/)
33
+ it peeked in popularity around 2008, then declined until 2015 when it started picking up again.
34
+ At the time of this writing (November 2018), the Tiobe index puts Ruby in 16th position as
35
+ most popular language.
36
+
37
+ Python, a language similar to Ruby, ranks 4th in the index. Java, C and C++ take the
38
+ first three positions. Ruby is often criticized for its focus on web applications.
39
+ But Ruby can do [much more](https://github.com/markets/awesome-ruby) than just web applications.
40
+ Yet, for scientific computing, Ruby lags way behind Python and R. Python has
41
+ Django framework for web, NumPy for numerical arrays, Pandas for data analysis.
42
+ R is a free software environment for statistical computing and graphics with thousands
43
+ of libraries for data analysis.
44
+
45
+ Until recently, there was no real perspective for Ruby to bridge this gap.
46
+ Implementing a complete scientific computing infrastructure would take too long.
47
+ Enters [Oracle's GraalVM](https://www.graalvm.org/):
48
+
49
+ > GraalVM is a universal virtual machine for running applications written in
50
+ > JavaScript, Python 3, Ruby, R, JVM-based languages like Java, Scala, Kotlin,
51
+ > and LLVM-based languages such as C and C++.
52
+ >
53
+ > GraalVM removes the isolation between programming languages and enables
54
+ > interoperability in a shared runtime. It can run either standalone or in the
55
+ > context of OpenJDK, Node.js, Oracle Database, or MySQL.
56
+ >
57
+ > GraalVM allows you to write polyglot applications with a seamless way to pass
58
+ > values from one language to another. With GraalVM there is no copying or
59
+ > marshaling necessary as it is with other polyglot systems. This lets you
60
+ > achieve high performance when language boundaries are crossed. Most of the time
61
+ > there is no additional cost for crossing a language boundary at all.
62
+ >
63
+ > Often developers have to make uncomfortable compromises that require them
64
+ > to rewrite their software in other languages. For example:
65
+ >
66
+ > * That library is not available in my language. I need to rewrite it.
67
+ > * That language would be the perfect fit for my problem, but we cannot
68
+ > run it in our environment.
69
+ > * That problem is already solved in my language, but the language is
70
+ > too slow.
71
+ >
72
+ > With GraalVM we aim to allow developers to freely choose the right language for
73
+ > the task at hand without making compromises.
74
+
75
+ As stated above, GraalVM is a _universal_ virtual machine that allows Ruby and R (and other
76
+ languages) to run on the same environment. GraalVM allows polyglot applications to
77
+ _seamlessly_ interact with one another and pass values from one language to the other.
78
+ Although a great idea, GraalVM still requires application writers to know several languages.
79
+ To eliminate that requirement, we built Galaaz, a gem for Ruby, to tightly couple
80
+ Ruby and R and allow those languages to interact in a way that the user will be unaware
81
+ of such interaction. In other words, a Ruby programmer will be able to use all
82
+ the capabilities of R without knowing the R syntax.
83
+
84
+ Library wrapping is a usual way of bringing features from one language into another.
85
+ To improve performance, Python often wraps more efficient C libraries. For the
86
+ Python developer, the existence of such C libraries is hidden. The problem with
87
+ library wrapping is that for any new library, there is the need to handcraft a new
88
+ wrapper.
89
+
90
+ Galaaz, instead of wrapping a single C or R library, wraps the whole R language
91
+ in Ruby. Doing so, all thousands of R libraries are available immediately
92
+ to Ruby developers without any new wrapping effort.
93
+
94
+ To show the power of Galaaz, we show in this article how Ruby can use R's ggplot2
95
+ library tranparantly bringing to Ruby the power of high quality scientific plotting.
96
+ We also show that migrating from R to Ruby with Galaaz is a matter of small
97
+ syntactic changes. By using Ruby, the R developer can use all of Ruby's powerful
98
+ object-oriented features. Also, with Ruby, it becomes much easier to move code
99
+ from the analysis phase to the production phase.
100
+
101
+ In this article we will explore the R ToothGrowth dataset. To illustrate, we will
102
+ create some boxplots. A primer on boxplot is available in
103
+ [this article](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51).
104
+
105
+ We will also create a Corporate Template ensuring that plots will have a consistent
106
+ visualization. This template is built using a Ruby module. There is a way of building
107
+ ggplot themes that will work the same as the Ruby module. Yet, writing a new theme
108
+ requires specific knowledge on theme writing. Ruby modules are standard to the
109
+ language and don't need special knowledge.
110
+
111
+ [Here](https://towardsdatascience.com/ruby-plotting-with-galaaz-an-example-of-tightly-coupling-ruby-and-r-in-graalvm-520b69e21021) we show a scatter plot in Ruby also with Galaaz.
112
+
113
+ # gKnit
114
+
115
+ _Knitr_ is an application that converts text written in rmarkdown to many
116
+ different output formats. For instance, a writer can convert an rmarkdown document
117
+ to HTML, $LaTex$, docx and many other formats. Rmarkdown documents can contain
118
+ text and _code chunks_. Knitr formats code chunks in a grayed box in the output document.
119
+ It also executes the code chunks and formats the output in a white box. Every line of
120
+ output from the execution code is preceded by '##'.
121
+
122
+ Knitr allows code chunks to be in R, Python,
123
+ Ruby and dozens of other languages. Yet, while R and Python chunks can share data, in other
124
+ languages, chunks are independent. This means that a variable defined in one chunk
125
+ cannot be used in another chunk.
126
+
127
+ With _gKnit_ Ruby code chunks can share data. In gKnit each
128
+ Ruby chunk executes in its own scope and thus, local variable defined in a chunk are
129
+ not accessible by other chunks. Yet, All chunks execute in the scope of a 'chunk'
130
+ class and instance variables ('@'), are available in all chunks.
131
+
132
+ # Exploring the Dataset
133
+
134
+ Let's start by exploring our selected dataset. ToothGrowth is an R dataset. A dataset
135
+ is like a simple excel spreadsheet, in which each column has only one type of data.
136
+ For instance one column can have float, the other integer, and a third strings.
137
+ This dataset analyzes the length of odontoblasts (cells responsible for tooth growth)
138
+ in 60 guinea pigs, where each animal received one of three dose levels of Vitamin C
139
+ (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice OJ or ascorbic acid
140
+ (a form of vitamin C and coded as VC).
141
+
142
+ The ToothGrowth dataset contains three columns: 'len', 'supp' and 'dose'. Let's
143
+ take a look at a few rows of this dataset. In Galaaz, R variables are accessed
144
+ by using the corresponding Ruby symbol preceeded by the tilda ('~') function. Note in the
145
+ following chunk that 'ToothGrowth' is the R variable and Ruby's '@tooth_growth' is
146
+ assigned the value of '~:ToothGrowth'.
147
+
148
+ ```{ruby head}
149
+ # Read the R ToothGrowth variable and assign it to the
150
+ # Ruby instance variable @tooth_growth that will be
151
+ # available to all Ruby chunks in this document.
152
+ @tooth_growth = ~:ToothGrowth
153
+ # print the first few elements of the dataset
154
+ puts @tooth_growth.head
155
+ ```
156
+
157
+ Great! We've managed to read the ToothGrowth dataset and take a look at its elements.
158
+ We see here the first 6 rows of the dataset. To access a column, follow the dataset name
159
+ with a dot ('.') and the name of the column. Also use dot notation to chain methods
160
+ in usual Ruby style.
161
+
162
+ ```{ruby dataset_columns}
163
+ # Access the tooth_growth 'len' column and print the first few
164
+ # elements of this column with the 'head' method.
165
+ puts @tooth_growth.len.head
166
+ ```
167
+
168
+ The 'dose' column contains a numeric value with either, 0.5, 1 or 2, although the
169
+ first 6 rows as seen above only contain the 0.5 values. Even though those are
170
+ number, they are better interpreted as a [factor or cathegory](https://swcarpentry.github.io/r-novice-inflammation/12-supp-factors/). So, let's convert our 'dose' column from numeric to 'factor'.
171
+ In R, the function 'as.factor' is used to convert data in a vector to factors. To use this
172
+ function from Galaaz the dot ('.') in the function name is substituted by '__' (double underline).
173
+ The function 'as.factor' becomes 'R.as__factor' or just 'as__factor' when chaining.
174
+
175
+ ```{ruby tooth_growth}
176
+ # convert the dose to a factor
177
+ @tooth_growth.dose = @tooth_growth.dose.as__factor
178
+ ```
179
+
180
+ Let's explore some more details of this dataset. In particular, let's look at its dimensions,
181
+ structure and summary statistics.
182
+
183
+ ```{ruby dim}
184
+ puts @tooth_growth.dim
185
+ ```
186
+
187
+ This dataset has 60 rows, one for each subject and 3 columns, as we have already seen.
188
+
189
+ Note that we do not need to call 'puts' when using the 'str' function. This
190
+ functions does not return anything and prints the structure of the dataset
191
+ as a side effect.
192
+
193
+ ```{ruby str}
194
+ @tooth_growth.str
195
+ ```
196
+ Observe that both variables 'supp' and 'dose' are factors. The system made variable 'supp'
197
+ a factor automatically, since it contais two strings OJ and VC.
198
+
199
+ Finally, using the summary method, we get the statistical summary for the dataset
200
+
201
+ ```{ruby summary}
202
+ puts @tooth_growth.summary
203
+ ```
204
+
205
+ # Doing the Data Analysis
206
+
207
+ ## Quick plot for seing the data
208
+
209
+ Let's now create our first plot with the given data by accessing ggplot2 from Ruby.
210
+ For Rubyists that have never seen or used ggplot2, here is the description of ggplot
211
+ found in its home page:
212
+
213
+ > "ggplot2 is a system for declaratively creating graphics, based on _The Grammar of Graphics_.
214
+ > You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical
215
+ > primitives to use, and it takes care of the details."
216
+
217
+ This description might be a bit cryptic and it is best to see it at work to understand it.
218
+ Basically, in the _grammar of graphics_ developers add layers of components such as grid,
219
+ axis, data, title, subtitle and also graphical primitives such as _bar plot_, _box plot_,
220
+ to form the final graphics.
221
+
222
+ In order to make a plot, we use the 'ggplot' function to the dataset. In R, this would be
223
+ written as ```ggplot(<dataset>, ...)```. Galaaz gives you the flexibility to use
224
+ either ```R.ggplot(<dataset>, ...)``` or ```<dataset>.ggplot(...)```. In the graph s
225
+ pecification bellow, we use the second notation
226
+ that looks more like Ruby. ggplot uses the ‘aes’ method to specify
227
+ x and y axes; in this case, the 'dose' on the $x$ axis and the 'length' on
228
+ the $y$ axis: 'E.aes(x: :dose, y: :len)'. To specify the type of plot add a geom to
229
+ the plot. For a boxplot, the geom is R.geom_boxplot.
230
+
231
+ Note also that we have a call to 'R.png' before plotting and 'R.dev__off' after the print
232
+ statement. 'R.png' opens a 'png device' for outputting the plot. If we do no pass a
233
+ name to the 'png' function, the
234
+ image gets a default name of 'Rplot\<nnn\>' where \<nnn\> is the number of the plot.
235
+ 'R.dev__off'
236
+ closes the device and creates the 'png' file. We can
237
+ then include the generated 'png' file in the document by adding an rmarkdown directive.
238
+
239
+ ```{ruby dose_len}
240
+ require 'ggplot'
241
+
242
+ e = @tooth_growth.ggplot(E.aes(x: :dose, y: :len))
243
+ print e + R.geom_boxplot
244
+ ```
245
+
246
+ [//]: # (Including the 'png' file generated above. In future releases)
247
+ [//]: # (of gKnit, the figures should be automatically saved and the name)
248
+ [//]: # (taken from the chunk 'label' and possibly chunk parameters)
249
+
250
+ ![](https://gist.githubusercontent.com/rbotafogo/5538d6c679a59f4d56179b2c030e8d28/raw/96db2729e02ced0f9336216d87d14af141c1e81b/dose_len.png)
251
+
252
+ Great! We've just managed to create and save our first plot in Ruby with only
253
+ four lines of code. We can now easily see with this plot a clear trend: as the
254
+ dose of the supplement
255
+ is increased, so is the length of teeth.
256
+
257
+ ## Facetting the plot
258
+
259
+ This first plot shows a trend, but our data has information about two different forms
260
+ of delivery method, either by Orange Juice OJ or by Vitamin C VC.
261
+ Let's then try to create a plot that helps us discern the effect of each
262
+ delivery method. This next
263
+ plot is a _facetted_ plot where each delivery method gets is own plot.
264
+ On the left side, the plot shows the OJ delivery method. On the right side,
265
+ we see the VC delivery method. To obtain this plot, we use the
266
+ 'R.facet_grid' function, that
267
+ automatically creates the facets based on the delivery method factors. The parameter to
268
+ the 'facet_grid' method is a [_formula_](https://thomasleeper.com/Rcourse/Tutorials/formulae.html).
269
+
270
+ In Galaaz we give programmers the flexibility to use two different ways to write formulas.
271
+ In the first way, the following changes from writing formulas (for example 'x ~ y')
272
+ in R are necessary:
273
+
274
+ * R symbols are represented by the same Ruby symbol prefixed with the '+' method. The
275
+ symbol ```x``` in R becomes ```+:x``` in Ruby;
276
+ * The '~' operator in R becomes '=~' in Ruby. The formula ```x ~ y``` in R is written as
277
+ ```+:x =~ +:y``` in Ruby;
278
+ * The '.' symbol in R becomes '+:all'
279
+
280
+ Another way of writing a formula is to use the 'formula' function with the actual formula as
281
+ a string. The formula ```x ~ y``` in R can be written as ```R.formula("x ~ y")```. For more
282
+ complex formulas, the use of the 'formula' function is preferred.
283
+
284
+ The formula ```+:all =~ +:supp``` indicates to the 'facet_grid' function that it needs to
285
+ facet the plot based on the ```supp``` variable and split the plot vertically. Changing
286
+ the formula to ```+:supp =~ +:all``` would split the plot horizontally.
287
+
288
+ ```{ruby facet_by_delivery}
289
+ @base_tooth = @tooth_growth.ggplot(E.aes(x: :dose, y: :len, group: :dose))
290
+
291
+ @bp = @base_tooth + R.geom_boxplot +
292
+ # Split in vertical direction
293
+ R.facet_grid(+:all =~ +:supp)
294
+
295
+ puts @bp
296
+ ```
297
+
298
+ ![](https://gist.githubusercontent.com/rbotafogo/5538d6c679a59f4d56179b2c030e8d28/raw/96db2729e02ced0f9336216d87d14af141c1e81b/facet_by_delivery.png)
299
+
300
+ It now becomes clear that although both methods of delivery have a direct
301
+ impact on tooth growth, method (OJ) is non-linear having a higher impact with smaller
302
+ doses of ascorbic acid and reducing it's impact as the dose increases. With the
303
+ (VC) approach, the impact seems to be more linear.
304
+
305
+ ## Adding Color
306
+
307
+ If we were writing about data analysis, we would make a better analysis of the trends and
308
+ improve the statistical analysis. But here we are interested in working with ggplot
309
+ in Ruby. So, let's add some color to this plot to make the trend and comparison more
310
+ visible. In the following plot, the boxes are color coded by dose. To add color, it is
311
+ enough to add ```fill: :dose``` to the aesthetic of boxplot. With this command each 'dose'
312
+ factor gets its own color.
313
+
314
+ ```{ruby facets_by_delivery_color}
315
+ @bp = @bp + R.geom_boxplot(E.aes(fill: :dose))
316
+ puts @bp
317
+ ```
318
+
319
+ ![](https://gist.githubusercontent.com/rbotafogo/5538d6c679a59f4d56179b2c030e8d28/raw/96db2729e02ced0f9336216d87d14af141c1e81b/facets_by_delivery_color.png)
320
+
321
+ Facetting helps us compare the general trends for each delivery method.
322
+ Adding color allow us to compare specifically how each dosage impacts the tooth growth.
323
+ It is possible to observe that with smaller doses, up to 1mg, OJ performs better
324
+ than VC (red color). For 2mg, both OJ and VC have the same median, but OJ is
325
+ less disperse (blue color).
326
+ For 1mg (green color), OJ is significantly bettern than VC. By this very quick
327
+ visual analysis, it seems that OJ is a better delivery method than VC.
328
+
329
+ ## Clarifying the data
330
+
331
+ Boxplots give us a nice idea of the distribution of data, but looking at those plots with
332
+ large colored boxes leaves us wondering what else is going on. According to
333
+ Edward Tufte in Envisioning Information:
334
+
335
+ > Thin data rightly prompts suspicions: "What are they leaving out? Is that really everything
336
+ > they know? What are they hiding? Is that all they did?" Now and then it is claimed
337
+ > that vacant space is "friendly" (anthropomorphizing an inherently murky idea) but
338
+ > _it is not how much empty space there is, but rather how it is used. It is not how much
339
+ > information there is, but rather how effectively it is arranged._
340
+
341
+ And he states:
342
+
343
+ > A most unconventional design strategy is revealed: _to clarify, add detail._
344
+
345
+ Let's use this wisdom and add yet another layer of data to our plot, so that we clarify
346
+ it with detail and do not leave large empty boxes. In this next plot, we add data points for
347
+ each of the 60 pigs in the experiment. For that, add the function 'R.geom_point' to the
348
+ plot.
349
+
350
+ ```{ruby facets_with_points}
351
+ # Split in vertical direction
352
+ @bp = @bp + R.geom_point
353
+
354
+ puts @bp
355
+ ```
356
+
357
+ ![](https://gist.githubusercontent.com/rbotafogo/5538d6c679a59f4d56179b2c030e8d28/raw/96db2729e02ced0f9336216d87d14af141c1e81b/facets_with_points.png)
358
+
359
+ Now we can see the actual distribution of all the 60 subjects. Actually, this is not
360
+ totally true. We have a hard time seing all 60 subjects. It seems that some points
361
+ might be placed one over the other hiding useful information.
362
+
363
+ But no sweat! Another layer might solve the problem. In the following plot a new layer
364
+ called 'geom_jitter' is added to the plot. Jitter adds a small amount of random variation
365
+ to the location of each point, and is a useful way of handling overplotting caused by
366
+ discreteness in smaller datasets. This makes it easier to see all of the points and
367
+ prevents data hiding. We also add
368
+ color and change the shape of the points, making them even easier to see.
369
+
370
+ ```{ruby facets_with_jitter}
371
+ # Split in vertical direction
372
+ puts @bp + R.geom_jitter(shape: 23, color: "cyan3", size: 1)
373
+ ```
374
+
375
+ ![](https://gist.githubusercontent.com/rbotafogo/5538d6c679a59f4d56179b2c030e8d28/raw/96db2729e02ced0f9336216d87d14af141c1e81b/facets_with_jitter.png)
376
+
377
+ Now we can see all 60 points in the graph. We have here a much higher information density
378
+ and we can see outliers and subjects distribution.
379
+
380
+ # Preparing the Plot for Presentation
381
+
382
+ We have come a long way since our first plot. As we already said, this is not
383
+ an article about data analysis and the focus is on the
384
+ integration of Ruby and ggplot. So, let's assume that the analysis is now done. Yet,
385
+ ending the analysis does not mean that the work is done. On the contrary, the hardest
386
+ part is yet to come!
387
+
388
+ After the analysis it is necessary to communicate it by making a final plot for
389
+ presentation. The last plot has all the information we want to share, but it is not very
390
+ pleasing to the eye.
391
+
392
+ ## Improving Colors
393
+
394
+ Let's start by trying to improve colors. For now, we will not use the jitter layer.
395
+ The previous plot has three bright colors that have no relashionship between them. Is
396
+ there any obvious, or non-obvious for that matter, interpretation for the colors?
397
+ Clearly, they are just random colors selected automatically by our software. Although
398
+ those colors helped us understand the data, for a final presentation random colors
399
+ can distract the viewer.
400
+
401
+ In the following plot we use shades function 'scale_fill_manual' to change
402
+ the colors of the boxes and order of labels. For colors, we use shades of blue for
403
+ each dosage, with light blue ('cyan')
404
+ representing the lower dose and deep blue ('deepskyblue4') the higher dose.
405
+ Also, the legend could be improved: we use the ‘breaks’ parameter to put
406
+ the smaller value (0.5) at the botton of the labels and the largest (2) at the top.
407
+ This ordering seems more natural and
408
+ matches with the actual order of the colors in the plot.
409
+
410
+ ```{ruby facets_by_delivery_color2}
411
+ @bp = @bp +
412
+ R.scale_fill_manual(values: R.c("cyan", "deepskyblue", "deepskyblue4"),
413
+ breaks: R.c("2","1","0.5"))
414
+
415
+ puts @bp
416
+ ```
417
+
418
+ ![](https://gist.githubusercontent.com/rbotafogo/5538d6c679a59f4d56179b2c030e8d28/raw/96db2729e02ced0f9336216d87d14af141c1e81b/facets_by_delivery_color2.png)
419
+
420
+
421
+ ## Violin Plot and Jitter
422
+
423
+ The boxplot with jitter did look a bit overwhelming. The next plot uses a variation of
424
+ a boxplot known as a _violin plot_ with jittered data.
425
+
426
+ [From Wikipedia](https://en.wikipedia.org/wiki/Violin_plot)
427
+
428
+
429
+ > A violin plot is a method of plotting numeric data. It is similar to a box plot with
430
+ > a rotated kernel density plot on each side.
431
+ >
432
+ > A violin plot has four layers. The outer shape represents all possible results, with
433
+ > thickness indicating how common. (Thus the thickest section represents the mode average.)
434
+ > The next layer inside represents the values that occur 95% of the time.
435
+ > The next layer (if it exists) inside represents the values that occur 50% of the time.
436
+ > The central dot represents the median average value.
437
+
438
+ ```{ruby violin_with_jitter}
439
+ @violin = @base_tooth + R.geom_violin(E.aes(fill: :dose)) +
440
+ R.facet_grid(+:all =~ +:supp) +
441
+ R.geom_jitter(shape: 23, color: "cyan3", size: 1) +
442
+ R.scale_fill_manual(values: R.c("cyan", "deepskyblue", "deepskyblue4"),
443
+ breaks: R.c("2","1","0.5"))
444
+
445
+ puts @violin
446
+ ```
447
+
448
+ ![](https://gist.githubusercontent.com/rbotafogo/5538d6c679a59f4d56179b2c030e8d28/raw/96db2729e02ced0f9336216d87d14af141c1e81b/violin_with_jitter.png)
449
+
450
+ This plot is an alternative to the original boxplot. For the final presentation, it is
451
+ important to think which graphics will be best understood by our audience. A violin plot
452
+ is a less known plot and could add mental overhead, yet, in my opinion, it does look a lit
453
+ bit better than the boxplot and provides even more information than the boxplot with jitter.
454
+
455
+ ## Adding Decoration
456
+
457
+ Our final plot is starting to take shape, but a presentation plot should have at least a
458
+ title, labels on the axes and maybe some other decorations. Let's start adding those.
459
+ Since decoration requires more graph area, this new plot has a 'width' and 'height'
460
+ specification. When there is no specification, the default values from R for width and
461
+ height are 480.
462
+
463
+ The 'labs' function adds the required decoration. In this example we use 'title',
464
+ 'subtitle', 'x' for the $x$ axis label and 'y', for the $y$ axis label, and 'caption'
465
+ for information about the plot (for clarity, we defined a caption variable using Ruby's
466
+ Here Doc style).
467
+
468
+ ```{ruby facets_with_decorations, dev = "png", fig.width = 540, fig.height = 560, units = "px"}
469
+ caption = <<-EOT
470
+ Length of odontoblasts in 60 guinea pigs.
471
+ Each animal received one of three dose levels of vitamin C.
472
+ EOT
473
+
474
+ @decorations =
475
+ R.labs(title: "Tooth Growth: Length vs Vitamin C Dose",
476
+ subtitle: "Faceted by delivery method, OJ or VC",
477
+ x: "Dose (mg)", y: "Teeth length",
478
+ caption: caption)
479
+
480
+ puts @bp + @decorations
481
+ ```
482
+
483
+ ![](https://gist.githubusercontent.com/rbotafogo/5538d6c679a59f4d56179b2c030e8d28/raw/225058450f4e69e5e82a01e22f69725554746893/facets_with_decorations.png)
484
+
485
+ ## The Corp Theme
486
+
487
+ We are almost done. But the default plot configuration does not yet look
488
+ nice to the eye. We are still distracted
489
+ by many aspects of the graph. First, the back font color does not look good. Then
490
+ plot background, borders, grids all add clutter to the plot.
491
+
492
+ We will now define our corporate theme. in a module that can be used/loaded for all
493
+ plots, similar to CSS or any other style definition.
494
+
495
+ In this theme, we remove borders and grids. The
496
+ background if left for faceted plots but removed for non-faceted plots. Font colors are
497
+ a shade o blue (color: '#00080'). Axis labels are moved near the end of the axis and
498
+ written in 'bold'.
499
+
500
+ ```{ruby coorp_theme}
501
+ module CorpTheme
502
+
503
+ R.install_and_loads 'RColorBrewer'
504
+
505
+ #---------------------------------------------------------------------------------
506
+ # face can be (1=plain, 2=bold, 3=italic, 4=bold-italic)
507
+ #---------------------------------------------------------------------------------
508
+
509
+ def self.text_element(size, face: "plain", hjust: nil)
510
+ E.element_text(color: "#000080",
511
+ face: face,
512
+ size: size,
513
+ hjust: hjust)
514
+ end
515
+
516
+ #---------------------------------------------------------------------------------
517
+ # Defines the plot theme (visualization). In this theme we remove major and minor
518
+ # grids, borders and background. We also turn-off scientific notation.
519
+ #---------------------------------------------------------------------------------
520
+
521
+ def self.global_theme(faceted = false)
522
+
523
+ R.options(scipen: 999) # turn-off scientific notation like 1e+48
524
+ # R.theme_set(R.theme_bw)
525
+
526
+ # remove major grids
527
+ gb = R.theme(panel__grid__major: E.element_blank())
528
+ # remove minor grids
529
+ gb = gb + R.theme(panel__grid__minor: E.element_blank)
530
+ # gb = R.theme(panel__grid__minor: E.element_blank)
531
+ # remove border
532
+ gb = gb + R.theme(panel__border: E.element_blank)
533
+ # remove background. When working with faceted graphs, the background makes
534
+ # it easier to see each facet, so leave it
535
+ gb = gb + R.theme(panel__background: E.element_blank) if !faceted
536
+ # Change axis font
537
+ gb = gb + R.theme(axis__text: text_element(8))
538
+ # change axis title font
539
+ gb = gb + R.theme(axis__title: text_element(10, face: "bold", hjust: 1))
540
+ # change font of title
541
+ gb = gb + R.theme(title: text_element(12, face: "bold"))
542
+ # change font of subtitle
543
+ gb = gb + R.theme(plot__subtitle: text_element(9))
544
+ # change font of captions
545
+ gb = gb + R.theme(plot__caption: text_element(8))
546
+
547
+ end
548
+
549
+ end
550
+ ```
551
+
552
+ ## Final Box Plot
553
+
554
+ We can now easily make our final boxplot and violin plot. All the layers for the plot were
555
+ added in order to expose our understanding of the data and the need to present the result
556
+ to our audience.
557
+
558
+ The final specification is just the addition of all layers build up to this point (@bp), plus
559
+ the decorations (@decorations), plus the corporate theme.
560
+
561
+ Here is our final boxplot, without jitter.
562
+
563
+ ```{ruby final_box_plot}
564
+ puts @bp + @decorations + CorpTheme.global_theme(faceted: true)
565
+ ```
566
+
567
+ ![](https://gist.githubusercontent.com/rbotafogo/5538d6c679a59f4d56179b2c030e8d28/raw/225058450f4e69e5e82a01e22f69725554746893/final_box_plot.png)
568
+
569
+ And here is the final violin plot, with jitter and the same look and feel of the corporate
570
+ boxplot.
571
+
572
+ ```{ruby final_violin_plot}
573
+ puts @violin + @decorations + CorpTheme.global_theme(faceted: true)
574
+ ```
575
+
576
+
577
+ ![]https://gist.githubusercontent.com/rbotafogo/5538d6c679a59f4d56179b2c030e8d28/raw/225058450f4e69e5e82a01e22f69725554746893/final_violin_plot.png
578
+
579
+ ## Another View
580
+
581
+ We now make another plot, with the same look and feel as before but facetted by
582
+ dose and not by supplement. This shows how easy it is to create new plots by just
583
+ changing small statement on the _grammar of graphics_.
584
+
585
+ ```{ruby facet_by_dose}
586
+ caption = <<-EOT
587
+ Length of odontoblasts in 60 guinea pigs.
588
+ Each animal received one of three dose levels of vitamin C.
589
+ EOT
590
+
591
+ @bp = @tooth_growth.ggplot(E.aes(x: :supp, y: :len, group: :supp)) +
592
+ R.geom_boxplot(E.aes(fill: :supp)) + R.facet_grid(+:all =~ +:dose) +
593
+ R.scale_fill_manual(values: R.c("cyan", "deepskyblue4")) +
594
+ R.labs(title: "Tooth Growth: Length by Dose",
595
+ subtitle: "Faceted by dose",
596
+ x: "Delivery method", y: "Teeth length",
597
+ caption: caption) +
598
+ CorpTheme.global_theme(faceted: true)
599
+ puts @bp
600
+ ```
601
+
602
+ ![](https://gist.githubusercontent.com/rbotafogo/5538d6c679a59f4d56179b2c030e8d28/raw/96db2729e02ced0f9336216d87d14af141c1e81b/facet_by_dose.png)
603
+
604
+ # Conclusion
605
+
606
+ In this article, we introduce Galaaz and show how to tightly couple Ruby and R
607
+ in a way that Ruby developers do not need to be aware
608
+ of the executing R engine. For the Ruby developer the existence of R
609
+ is of no consequence, she is just coding in Ruby. On the other hand, for the R
610
+ developer, migration to Ruby is a matter of small syntactic changes with a very gentle
611
+ learning curve. As the R developer becomes more proficient in Ruby, he can start using
612
+ 'classes', 'modules', 'procs', 'lambdas'.
613
+
614
+ Trying to bring to Ruby the power of R starting from scratch is an enourmous endeavour
615
+ and would probably never be accomplished. Today's data scientists would certainly
616
+ stick with either Python or R. Now, both the Ruby and R communities can benefit
617
+ from this marriage, provided by Galaaz on top of GraalVM and Truffle's
618
+ polyglot environment. We presented
619
+ the process to couple Ruby and R, but this process can also be done to couple Ruby
620
+ and JavaScript or Ruby and Python. In a polyglot world a *uniglot* language might
621
+ be extremely relevant.
622
+
623
+ From the perspective of performance, GraalVM and Truffle promises improvements that could
624
+ reach over 10 times, both for [FastR](https://medium.com/graalvm/faster-r-with-fastr-4b8db0e0dceb)
625
+ and for [TruffleRuby](https://rubykaigi.org/2018/presentations/eregontp.html).
626
+
627
+ This article has shown how to improve a plot step-by-step. Starting from a very simple
628
+ boxplot with all default configurations, we moved slowly to our final plot. The important
629
+ point here is not if the final plot is actually beautiful (as beauty is in the eye of
630
+ the beholder), but that there is a process of small steps improvements that can be followed
631
+ to getting a final plot ready for presentation.
632
+
633
+ Finally, this whole article was written in rmarkdown and compiled to HTML by _gknit_, an
634
+ application that wraps _knitr_ and allows documenting Ruby code. This application can
635
+ be of great help for any Rubyist trying to write articles, blogs or documentation for Ruby.
636
+
637
+ # Installing Galaaz
638
+
639
+ ## Prerequisites
640
+
641
+ * GraalVM (>= rc8): https://github.com/oracle/graal/releases
642
+ * TruffleRuby
643
+ * FastR
644
+
645
+ The following R packages will be automatically installed when necessary, but could be installed prior
646
+ to using gKnit if desired:
647
+
648
+ * ggplot2
649
+ * gridExtra
650
+ * knitr
651
+
652
+ Installation of R packages requires a development environment and can be time consuming. In Linux,
653
+ the gnu compiler and tools should be enough. I am not sure what is needed on the Mac.
654
+
655
+ ## Preparation
656
+
657
+ * gem install galaaz
658
+
659
+ ## Usage
660
+
661
+ * gknit <filename>
662
+ * In a scrip add: require 'galaaz'