galaaz 0.4.1 → 0.4.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/Rakefile +29 -0
- data/bin/gknit +208 -10
- data/bin/gknit2 +14 -0
- data/bin/gknit2~ +6 -0
- data/bin/prepareR.rb +3 -0
- data/bin/prepareR.rb~ +1 -0
- data/bin/tmp.py +51 -0
- data/blogs/dev/dev.Rmd +70 -0
- data/blogs/dev/dev.Rmd~ +104 -0
- data/blogs/dev/dev.html +209 -0
- data/blogs/dev/dev.md +72 -0
- data/blogs/dev/dev_files/figure-html/bubble-1.png +0 -0
- data/blogs/dev/model.rb +41 -0
- data/blogs/galaaz_ggplot/galaaz_ggplot.Rmd +55 -27
- data/blogs/galaaz_ggplot/galaaz_ggplot.aux +44 -0
- data/blogs/galaaz_ggplot/galaaz_ggplot.dvi +0 -0
- data/blogs/galaaz_ggplot/galaaz_ggplot.html +17 -4
- data/blogs/galaaz_ggplot/galaaz_ggplot.out +10 -0
- data/blogs/galaaz_ggplot/galaaz_ggplot.pdf +0 -0
- data/blogs/galaaz_ggplot/galaaz_ggplot.tex +630 -0
- data/blogs/galaaz_ggplot/midwest.Rmd +1 -1
- data/blogs/galaaz_ggplot/midwest_external_png +13 -0
- data/blogs/galaaz_ggplot/midwest_external_png~ +1 -0
- data/blogs/gknit/gknit.Rmd +500 -0
- data/blogs/gknit/gknit.Rmd~ +184 -0
- data/blogs/gknit/gknit.Rnd~ +17 -0
- data/blogs/gknit/gknit.html +528 -0
- data/blogs/gknit/gknit.md +628 -0
- data/blogs/gknit/gknit.pdf +0 -0
- data/blogs/gknit/gknit.tex +745 -0
- data/blogs/gknit/gknit_files/figure-html/bubble-1.png +0 -0
- data/blogs/gknit/gknit_files/figure-html/diverging_bar.png +0 -0
- data/blogs/gknit/model.rb +41 -0
- data/blogs/gknit/model.rb~ +46 -0
- data/blogs/ruby_plot/figures/dose_len.png +0 -0
- data/blogs/ruby_plot/figures/facet_by_delivery.png +0 -0
- data/blogs/ruby_plot/figures/facet_by_dose.png +0 -0
- data/blogs/ruby_plot/figures/facets_by_delivery_color.png +0 -0
- data/blogs/ruby_plot/figures/facets_by_delivery_color2.png +0 -0
- data/blogs/ruby_plot/figures/facets_with_decorations.png +0 -0
- data/blogs/ruby_plot/figures/facets_with_jitter.png +0 -0
- data/blogs/ruby_plot/figures/facets_with_points.png +0 -0
- data/blogs/ruby_plot/figures/final_box_plot.png +0 -0
- data/blogs/ruby_plot/figures/final_violin_plot.png +0 -0
- data/blogs/ruby_plot/figures/violin_with_jitter.png +0 -0
- data/blogs/ruby_plot/ruby_plot.Rmd +680 -0
- data/blogs/ruby_plot/ruby_plot.Rmd~ +215 -0
- data/blogs/ruby_plot/ruby_plot.html +563 -0
- data/blogs/ruby_plot/ruby_plot.md +731 -0
- data/blogs/ruby_plot/ruby_plot.pdf +0 -0
- data/blogs/ruby_plot/ruby_plot.tex +458 -0
- data/examples/sthda_ggplot/all.rb +0 -6
- data/examples/sthda_ggplot/two_variables_cont_bivariate/geom_hex.rb +1 -1
- data/examples/sthda_ggplot/two_variables_cont_cont/misc.rb +1 -1
- data/examples/sthda_ggplot/two_variables_disc_cont/geom_bar.rb +2 -2
- data/examples/sthda_ggplot/two_variables_disc_disc/geom_jitter.rb +0 -1
- data/lib/R/eng_ruby.R +62 -0
- data/lib/R/eng_ruby.R~ +63 -0
- data/lib/R_interface/capture_plot.rb~ +23 -0
- data/lib/{R → R_interface}/expression.rb +0 -0
- data/lib/{R → R_interface}/r.rb +10 -1
- data/lib/{R → R_interface}/r.rb~ +0 -0
- data/lib/{R → R_interface}/r_methods.rb +21 -5
- data/lib/{R → R_interface}/rbinary_operators.rb +6 -1
- data/lib/R_interface/rclosure.rb +38 -0
- data/lib/{R → R_interface}/rdata_frame.rb +0 -0
- data/lib/R_interface/rdevices.R +31 -0
- data/lib/R_interface/rdevices.rb +225 -0
- data/lib/{R/rclosure.rb → R_interface/rdevices.rb~} +3 -10
- data/lib/{R → R_interface}/renvironment.rb +0 -0
- data/lib/{R → R_interface}/rexpression.rb +0 -0
- data/lib/{R → R_interface}/rindexed_object.rb +0 -0
- data/lib/{R → R_interface}/rlanguage.rb +0 -0
- data/lib/{R → R_interface}/rlist.rb +0 -0
- data/lib/{R → R_interface}/rmatrix.rb +0 -0
- data/lib/{R → R_interface}/rmd_indexed_object.rb +0 -0
- data/lib/{R → R_interface}/robject.rb +5 -0
- data/lib/{R → R_interface}/rpkg.rb +0 -0
- data/lib/{R → R_interface}/rsupport.rb +49 -13
- data/lib/{R → R_interface}/rsupport_scope.rb +0 -0
- data/lib/{R → R_interface}/rsymbol.rb +1 -0
- data/lib/{R → R_interface}/ruby_callback.rb +0 -0
- data/lib/{R → R_interface}/ruby_extensions.rb +2 -1
- data/lib/{R → R_interface}/runary_operators.rb +0 -0
- data/lib/{R → R_interface}/rvector.rb +0 -0
- data/lib/galaaz.rb +4 -2
- data/lib/gknit.rb +27 -0
- data/lib/gknit.rb~ +26 -0
- data/lib/gknit/knitr_engine.rb +120 -0
- data/lib/gknit/knitr_engine.rb~ +102 -0
- data/lib/gknit/ruby_engine.rb +70 -0
- data/lib/gknit/ruby_engine.rb~ +72 -0
- data/lib/util/exec_ruby.rb +8 -7
- data/lib/util/inline_file.rb +70 -0
- data/lib/util/inline_file.rb~ +23 -0
- data/r_requires/ggplot.rb +1 -8
- data/r_requires/knitr.rb +27 -0
- data/r_requires/knitr.rb~ +4 -0
- data/specs/r_language.spec.rb +22 -0
- data/specs/r_plots.spec.rb +72 -0
- data/specs/r_plots.spec.rb~ +37 -0
- data/specs/tmp.rb +255 -1
- data/version.rb +1 -1
- metadata +89 -39
@@ -0,0 +1,731 @@
|
|
1
|
+
---
|
2
|
+
title: "How to make Beautiful Ruby Plots with Galaaz"
|
3
|
+
author: "Rodrigo Botafogo"
|
4
|
+
tags: [Tech, Data Science, Ruby, R, GraalVM]
|
5
|
+
date: "November 19th, 2018"
|
6
|
+
output:
|
7
|
+
html_document:
|
8
|
+
self_contained: true
|
9
|
+
keep_md: true
|
10
|
+
pdf_document:
|
11
|
+
includes:
|
12
|
+
in_header: ["../../sty/galaaz.sty"]
|
13
|
+
number_sections: yes
|
14
|
+
---
|
15
|
+
|
16
|
+
|
17
|
+
|
18
|
+
# Introduction
|
19
|
+
|
20
|
+
According to Wikipedia "Ruby is a dynamic, interpreted, reflective, object-oriented,
|
21
|
+
general-purpose programming language. It was designed and developed in the mid-1990s by Yukihiro
|
22
|
+
"Matz" Matsumoto in Japan." It reached high popularity with the development of Ruby on Rails
|
23
|
+
(RoR) by David Heinemeier Hansson. RoR is a web application framework first released
|
24
|
+
around 2005. It makes extensive use of Ruby's metaprogramming features. With RoR,
|
25
|
+
Ruby became very popular. According to [Ruby's Tiobe index](https://www.tiobe.com/tiobe-index/ruby/)
|
26
|
+
it peeked in popularity around 2008. Then it's popularity
|
27
|
+
declined until 2015 when it started picking up again. At the time of
|
28
|
+
this writing (November 2018), the Tiobe index puts ruby in 16th position.
|
29
|
+
|
30
|
+
Python, a similar language to Ruby, ranks 4th in the index. Java, C and C++ take the
|
31
|
+
first three positions. Ruby is often criticized for its focus on web applications.
|
32
|
+
But Ruby can do [much more](https://github.com/markets/awesome-ruby) than just web applications.
|
33
|
+
Yet, for scientific computing, Ruby lags way behind Python and R. Python has
|
34
|
+
Django framework for web, NumPy for numerical arrays, Pandas for data analysis.
|
35
|
+
R is a free software environment for statistical computing and graphics with thousands
|
36
|
+
of libraries for data analysis.
|
37
|
+
|
38
|
+
Until recently, there was no real perspective for Ruby to bridge this gap.
|
39
|
+
Implementing a complete scientific computing infrastructure would take too long.
|
40
|
+
Comes GraalVM into the picture:
|
41
|
+
|
42
|
+
> GraalVM is a universal virtual machine for running applications written in
|
43
|
+
> JavaScript, Python 3, Ruby, R, JVM-based languages like Java, Scala, Kotlin,
|
44
|
+
> and LLVM-based languages such as C and C++.
|
45
|
+
>
|
46
|
+
> GraalVM removes the isolation between programming languages and enables
|
47
|
+
> interoperability in a shared runtime. It can run either standalone or in the
|
48
|
+
> context of OpenJDK, Node.js, Oracle Database, or MySQL.
|
49
|
+
>
|
50
|
+
> GraalVM allows you to write polyglot applications with a seamless way to pass
|
51
|
+
> values from one language to another. With GraalVM there is no copying or
|
52
|
+
> marshaling necessary as it is with other polyglot systems. This lets you
|
53
|
+
> achieve high performance when language boundaries are crossed. Most of the time
|
54
|
+
> there is no additional cost for crossing a language boundary at all.
|
55
|
+
>
|
56
|
+
> Often developers have to make uncomfortable compromises that require them
|
57
|
+
> to rewrite their software in other languages. For example:
|
58
|
+
>
|
59
|
+
> * That library is not available in my language. I need to rewrite it.
|
60
|
+
> * That language would be the perfect fit for my problem, but we cannot
|
61
|
+
> run it in our environment.
|
62
|
+
> * That problem is already solved in my language, but the language is
|
63
|
+
> too slow.
|
64
|
+
>
|
65
|
+
> With GraalVM we aim to allow developers to freely choose the right language for
|
66
|
+
> the task at hand without making compromises.
|
67
|
+
|
68
|
+
As stated above, GraalVM is a _universal_ virtual machine that allows Ruby and R (and other
|
69
|
+
languages) to run on the same environment. GraalVM allows polyglot applications to
|
70
|
+
_seamlessly_ interact with one another and pass values from one language to the other.
|
71
|
+
Galaaz, a gem for Ruby, intends to tightly couple Ruby and R
|
72
|
+
and allow those languages to interact in a way that the user will be unaware
|
73
|
+
of such interaction.
|
74
|
+
|
75
|
+
Library wrapping is an usual way of bringing features from one language into another.
|
76
|
+
To improve performance, Python often wraps more efficient C libraries. For the
|
77
|
+
Python developer, the existence of such C libraries is of no concern. The problem with
|
78
|
+
library wrapping is that for any new library, there is the need to handcraft a new
|
79
|
+
wrapper.
|
80
|
+
|
81
|
+
Galaaz, instead of wrapping a single C or R library, wraps the whole of
|
82
|
+
the R language in Ruby. Doing so, all thousands of R libraries are available to
|
83
|
+
Ruby developers. Also any new library developed in R will be available without a
|
84
|
+
new wrapping effort.
|
85
|
+
|
86
|
+
This article shows how Ruby can use R's ggplot2 library tranparantly, and
|
87
|
+
bring to Ruby the power of high quality scientific plotting. it also shows that
|
88
|
+
migrating from R to Ruby with Galaaz is a matter of small syntactic changes.
|
89
|
+
Using Ruby, the R developer can use all of Ruby's powerful OO features. It also
|
90
|
+
becomes much easier to move code from the analysis phase to the production phase.
|
91
|
+
|
92
|
+
In this article we will explore the R ToothGrowth dataset. In doing so, we will
|
93
|
+
create some boxplots. A primer on boxplot is available in
|
94
|
+
[this article](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51).
|
95
|
+
|
96
|
+
We will also create a Corporate Template ensuring that plots will have a consistent
|
97
|
+
visualization. This template is build using a Ruby module. There is a way of building
|
98
|
+
ggplot themes that will work the same as the Ruby module. Yet, writing a new theme
|
99
|
+
requires specific knowledge. Ruby modules are standard to the language and don't
|
100
|
+
need special knowledge.
|
101
|
+
|
102
|
+
In [this blog](https://towardsdatascience.com/ruby-plotting-with-galaaz-an-example-of-tightly-coupling-ruby-and-r-in-graalvm-520b69e21021) we show a scatter plot in Ruby also with Galaaz.
|
103
|
+
|
104
|
+
# gKnit
|
105
|
+
|
106
|
+
_Knitr_ is an application that converts text written in rmarkdown to many
|
107
|
+
different output formats. For instance, a writer can convert an rmarkdown document
|
108
|
+
to HTML, $LaTex$, docx and many other formats. Rmarkdown documents can contain
|
109
|
+
text and _code chunks_. Knitr formats code chunks in a grayed box in the output document.
|
110
|
+
It also executes the code chunks and formats the output in a white box. Every line of
|
111
|
+
output from the execution code is preceded by '##'.
|
112
|
+
|
113
|
+
Knitr allows code chunks to be in R, Python,
|
114
|
+
Ruby and dozens of other languages. Yet, while R and Python chunks can share data, in other
|
115
|
+
languages, chunks are independent. This means that a variable defined in one chunk
|
116
|
+
cannot be used in another chunk.
|
117
|
+
|
118
|
+
With _gKnit_ Ruby code chunks can share data. In gKnit each
|
119
|
+
Ruby chunk executes in its own scope and thus, local variable defined in a chunk are
|
120
|
+
not accessible by other chunks. Yet, All chunks execute in the scope of a 'chunk'
|
121
|
+
class and instance variables ('@'), are available in all chunks.
|
122
|
+
|
123
|
+
# Exploring the Dataset
|
124
|
+
|
125
|
+
Let's start by exploring our selected dataset. ToothGrowth is an R dataset. A dataset
|
126
|
+
is like an excel spreadsheet, but in which each column has only one type of data.
|
127
|
+
For instance one column can have float, the other integer, and a third strings.
|
128
|
+
This dataset analyses the length of odontoblasts (cells responsible for tooth growth)
|
129
|
+
in 60 guinea pigs, where each animal received one of three dose levels of Vitamin C
|
130
|
+
(0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice (OJ) or ascorbic acid
|
131
|
+
(a form of vitamin C and coded as VC).
|
132
|
+
|
133
|
+
The ToothGrowth dataset contains three columns: 'len', 'supp' and 'dose'. Let's
|
134
|
+
take a look at a few rows of this dataset. In Galaaz, to have access to an R variable
|
135
|
+
we use the corresponding Ruby symbol preceeded by the tilda ('~') function. Note in the
|
136
|
+
following chunk that Ruby's '@tooth_growth' is assigned the value of '~:ToothGrowth'.
|
137
|
+
'ToothGrowth' is the R variable containing the dataset of interest.
|
138
|
+
|
139
|
+
|
140
|
+
```ruby
|
141
|
+
# Read the R ToothGrowth variable and assign it to the
|
142
|
+
# Ruby instance variable @tooth_growth that will be
|
143
|
+
# available to all Ruby chunks in this document.
|
144
|
+
@tooth_growth = ~:ToothGrowth
|
145
|
+
# print the first few elements of the dataset
|
146
|
+
puts @tooth_growth.head
|
147
|
+
```
|
148
|
+
|
149
|
+
```
|
150
|
+
## len supp dose
|
151
|
+
## 1 4.2 VC 0.5
|
152
|
+
## 2 11.5 VC 0.5
|
153
|
+
## 3 7.3 VC 0.5
|
154
|
+
## 4 5.8 VC 0.5
|
155
|
+
## 5 6.4 VC 0.5
|
156
|
+
## 6 10.0 VC 0.5
|
157
|
+
```
|
158
|
+
|
159
|
+
Great! We've managed to read the ToothGrowth dataset and take a look at its elements.
|
160
|
+
We see here the first 6 rows of the dataset. To access a column, follow the dataset name
|
161
|
+
with a dot ('.') and the name of the column. Also use dot notation to chain methods
|
162
|
+
in usual Ruby style.
|
163
|
+
|
164
|
+
|
165
|
+
```ruby
|
166
|
+
# Access the tooth_growth 'len' column and print the first few
|
167
|
+
# elements of this column with the 'head' method.
|
168
|
+
puts @tooth_growth.len.head
|
169
|
+
```
|
170
|
+
|
171
|
+
```
|
172
|
+
## [1] 4.2 11.5 7.3 5.8 6.4 10.0
|
173
|
+
```
|
174
|
+
|
175
|
+
The 'dose' column contains a numeric value wiht either, 0.5, 1 or 2. Although those are
|
176
|
+
number, they are better interpreted as a [factor or cathegory](https://swcarpentry.github.io/r-novice-inflammation/12-supp-factors/). So, let's convert our 'dose' column from numeric to 'factor'.
|
177
|
+
In R, the function 'as.factor' is used to convert data in a vector to factors. To use this
|
178
|
+
function from Galaaz the dot ('.') in the function name is substituted by '__' (double underline).
|
179
|
+
The function 'as.factor' becomes 'R.as__factor' or just 'as__factor' when chaining.
|
180
|
+
|
181
|
+
|
182
|
+
```ruby
|
183
|
+
# convert the dose to a factor
|
184
|
+
@tooth_growth.dose = @tooth_growth.dose.as__factor
|
185
|
+
```
|
186
|
+
|
187
|
+
Let's explore some more details of this dataset. In particular, let's look at its dimensions,
|
188
|
+
structure and summary statistics.
|
189
|
+
|
190
|
+
|
191
|
+
```ruby
|
192
|
+
puts @tooth_growth.dim
|
193
|
+
```
|
194
|
+
|
195
|
+
```
|
196
|
+
## [1] 60 3
|
197
|
+
```
|
198
|
+
|
199
|
+
This dataset has 60 rows, one for each subject and 3 columns, as we have already seen.
|
200
|
+
|
201
|
+
Note that we do not call 'puts' when using the 'str' function. This functions does not
|
202
|
+
return anything and prints the structure of the dataset as a side effect.
|
203
|
+
|
204
|
+
|
205
|
+
```ruby
|
206
|
+
@tooth_growth.str
|
207
|
+
```
|
208
|
+
|
209
|
+
```
|
210
|
+
## 'data.frame': 60 obs. of 3 variables:
|
211
|
+
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
|
212
|
+
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
|
213
|
+
## $ dose: Factor w/ 3 levels "0.5","1","2": 1 1 1 1 1 1 1 1 1 1 ...
|
214
|
+
```
|
215
|
+
Observe that both variables 'supp' and 'dose' are factors. The system made variable 'supp'
|
216
|
+
a factor automatically, since it contais two strings OJ and VC.
|
217
|
+
|
218
|
+
Finally, using the summary method, we get the statistical summary for the dataset
|
219
|
+
|
220
|
+
|
221
|
+
```ruby
|
222
|
+
puts @tooth_growth.summary
|
223
|
+
```
|
224
|
+
|
225
|
+
```
|
226
|
+
## len supp dose
|
227
|
+
## Min. : 4.20 OJ:30 0.5:20
|
228
|
+
## 1st Qu.:13.07 VC:30 1 :20
|
229
|
+
## Median :19.25 2 :20
|
230
|
+
## Mean :18.81
|
231
|
+
## 3rd Qu.:25.27
|
232
|
+
## Max. :33.90
|
233
|
+
```
|
234
|
+
|
235
|
+
# Doing the Data Analysis
|
236
|
+
|
237
|
+
## Quick plot for seing the data
|
238
|
+
|
239
|
+
Let's now create our first plot with the given data by accessing ggplot2 from Ruby. For Rubyist
|
240
|
+
that have never seen or used ggplot2, here is the description of ggplot found on its home page:
|
241
|
+
|
242
|
+
> "ggplot2 is a system for declaratively creating graphics, based on _The Grammar of Graphics_.
|
243
|
+
> You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical
|
244
|
+
> primitives to use, and it takes care of the details."
|
245
|
+
|
246
|
+
This description might be a bit cryptic and it is best to see it at work to understand it.
|
247
|
+
Basically, in the _grammar of graphics_ developers add layers of components such as grid,
|
248
|
+
axis, data, title, subtitle and also graphical primitives such as _bar plot_, _box plot_,
|
249
|
+
to form the final graphics.
|
250
|
+
|
251
|
+
In order to make a plot, we use the 'ggplot' function to the dataset. In R, this would be
|
252
|
+
written as ```ggplot(<dataset>, ...)```. In Galaaz, use either ```R.ggplot(<dataset>, ...)```,
|
253
|
+
or ```<dataset>.ggplot(...)```. In the graph specification bellow, we use the second notation
|
254
|
+
that looks more Ruby like. The plot specifies the 'dose' on the $x$ axis and the 'length' on
|
255
|
+
the $y$ axis with the 'aes' method. 'E.aes(x: :dose, y: :len)'. To specify the type of plot to
|
256
|
+
create add a geom to the plot. For a boxplot, the geom is R.geom_boxplot.
|
257
|
+
|
258
|
+
Note also that we have a call to 'R.png' before plotting and 'R.dev__off' after the print
|
259
|
+
statement. 'R.png' opens a 'png' device for outputting the plot. 'R.dev__off'
|
260
|
+
closes the device and creates the 'png' file. If we do no pass a name to the 'png' function, the
|
261
|
+
image gets a default name of 'Rplot\<nnn\>' where \<nnn\> is the number of the plot. We can
|
262
|
+
then include the generated 'png' file in the document by adding an rmarkdown directive.
|
263
|
+
|
264
|
+
|
265
|
+
```ruby
|
266
|
+
require 'ggplot'
|
267
|
+
|
268
|
+
R.png("figures/dose_len.png")
|
269
|
+
|
270
|
+
e = @tooth_growth.ggplot(E.aes(x: :dose, y: :len))
|
271
|
+
print e + R.geom_boxplot
|
272
|
+
|
273
|
+
R.dev__off
|
274
|
+
```
|
275
|
+
|
276
|
+
[//]: # (Including the 'png' file generated above. In future releases)
|
277
|
+
[//]: # (of gKnit, the figures should be automatically saved and the name)
|
278
|
+
[//]: # (taken from the chunk 'label' and possibly chunk parameters)
|
279
|
+
|
280
|
+
![](figures/dose_len.png)
|
281
|
+
|
282
|
+
Great! We've just managed to create and save our first plot in Ruby with only
|
283
|
+
four lines of code. We can see with this plot a clear trend: as the dose of the supplement
|
284
|
+
is increased, so is the length of teeth.
|
285
|
+
|
286
|
+
## Facetting the plot
|
287
|
+
|
288
|
+
This first plot shows a trend, but our data has information about two different forms
|
289
|
+
of delivery method, either by Orange Juice (OJ) or by Vitamin C (VC).
|
290
|
+
Let's then try to create a plot that explicits the effect of each delivery method. This next
|
291
|
+
plot is a _facetted_ plot where each delivery method gets is own plot.
|
292
|
+
On the left side, the plot shows the OJ delivery method. On the right side, we see the
|
293
|
+
VC delivery method. To obtain this plot, we use the 'R.facet_grid' function, that
|
294
|
+
automatically creates the facets based on the delivery method factors. The parameter to
|
295
|
+
the 'facet_grid' method is a [_formula_](https://thomasleeper.com/Rcourse/Tutorials/formulae.html).
|
296
|
+
|
297
|
+
In Galaaz, formulas are written a bit differently than in R. The following changes are
|
298
|
+
necessary:
|
299
|
+
|
300
|
+
* R symbols are represented by the same Ruby symbol prefixed with the '+' method. The
|
301
|
+
symbol ```x``` in R becomes ```+:x``` in Ruby;
|
302
|
+
* The '~' operator in R becomes '=~' in Ruby. The formula ```x ~ y``` in R is written as
|
303
|
+
```+:x =~ +:y``` in Ruby;
|
304
|
+
* The '.' symbol in R becomes '+:all'
|
305
|
+
|
306
|
+
Another way of writing a formula is to use the 'formula' function with the actual formula as
|
307
|
+
a string. The formula ```x ~ y``` in R can be written as ```R.formula("x ~ y")```. For more
|
308
|
+
complex formulas, the use of the 'formula' function is preferred.
|
309
|
+
|
310
|
+
The formula ```+:all =~ +:supp``` indicates to the 'facet_grid' function that it needs to
|
311
|
+
facet the plot based on the ```supp``` variable and split the plot vertically. Changing
|
312
|
+
the formula to ```+:supp =~ +:all``` would split the plot horizontally.
|
313
|
+
|
314
|
+
|
315
|
+
```ruby
|
316
|
+
R.png("figures/facet_by_delivery.png")
|
317
|
+
|
318
|
+
@base_tooth = @tooth_growth.ggplot(E.aes(x: :dose, y: :len, group: :dose))
|
319
|
+
|
320
|
+
@bp = @base_tooth + R.geom_boxplot +
|
321
|
+
# Split in vertical direction
|
322
|
+
R.facet_grid(+:all =~ +:supp)
|
323
|
+
|
324
|
+
puts @bp
|
325
|
+
|
326
|
+
R.dev__off
|
327
|
+
```
|
328
|
+
|
329
|
+
![](figures/facet_by_delivery.png)
|
330
|
+
|
331
|
+
It now becomes clear that although both methods of delivery have a direct
|
332
|
+
impact on tooth growth, method (OJ) is non-linear having a higher impact with smaller
|
333
|
+
doses of ascorbic acid and reducing it's impact as the dose increases. With the
|
334
|
+
(VC) approach, the impact seems to be more linear.
|
335
|
+
|
336
|
+
## Adding Color
|
337
|
+
|
338
|
+
If this paper was about data analysis, we should make a better analysis of the trends and
|
339
|
+
should improve the statistical analysis. But we are interested in working with ggplot
|
340
|
+
in Ruby. So, Let's add some color to this plot to make the trend and comparison more
|
341
|
+
visible. In the following plot, the boxes are color coded by dose. To add color, it is
|
342
|
+
enough to add ```fill: :dose``` to the aesthetic of boxplot. With this command each 'dose'
|
343
|
+
factor gets its own color.
|
344
|
+
|
345
|
+
|
346
|
+
```ruby
|
347
|
+
R.png("figures/facets_by_delivery_color.png")
|
348
|
+
|
349
|
+
@bp = @bp + R.geom_boxplot(E.aes(fill: :dose))
|
350
|
+
puts @bp
|
351
|
+
|
352
|
+
R.dev__off
|
353
|
+
```
|
354
|
+
|
355
|
+
![](figures/facets_by_delivery_color.png)
|
356
|
+
|
357
|
+
Facetting helps us compare the general trends in the (OJ) and (VC) delivery methods.
|
358
|
+
Adding color allow us to compare specifically how each dosage impacts the teeth growth.
|
359
|
+
It is possible to observe that with smaller doses, up to 1mg, (OJ) performs better
|
360
|
+
than (VC) (red color). For 2mg, both (OJ) and (VC) have the same median, but (OJ) is
|
361
|
+
less disperse (blue color).
|
362
|
+
For 1mg (green color), (OJ) is significantly bettern than (VC). By this very quick analysis,
|
363
|
+
it seems that (OJ) is a better delivery method than (VC).
|
364
|
+
|
365
|
+
## Clarifying the data
|
366
|
+
|
367
|
+
Boxplots give us a nice idea of the distribution of data, but looking at those plots with
|
368
|
+
large colored boxes leaves us wondering what is going on on those boxes. According to
|
369
|
+
Edward Tufte in Envisioning Information:
|
370
|
+
|
371
|
+
> Thin data rightly prompts suspicions: "What are they leaving out? Is that really everything
|
372
|
+
> they know? What are they hiding? Is that all they did?" Now and then it is claimed
|
373
|
+
> that vacant space is "friendly" (anthropomorphizing an inherently murky idea) but
|
374
|
+
> _it is not how much empty space there is, but rather how it is used. It is not how much
|
375
|
+
> information there is, but rather how effectively it is arranged._
|
376
|
+
|
377
|
+
And he states:
|
378
|
+
|
379
|
+
> A most unconventional design strategy is revealed: _to clarify, add detail._
|
380
|
+
|
381
|
+
Let's then use this wisdom and add yet another layer of data to our plot, so that we clarify
|
382
|
+
it with detail and do not leave large empty boxes. In this next plot, we add data points for
|
383
|
+
each of the 60 pigs in the experiment. For that, add the function 'R.geom_point' to the
|
384
|
+
plot.
|
385
|
+
|
386
|
+
|
387
|
+
```ruby
|
388
|
+
R.png("figures/facets_with_points.png")
|
389
|
+
|
390
|
+
# Split in vertical direction
|
391
|
+
@bp = @bp + R.geom_point
|
392
|
+
|
393
|
+
puts @bp
|
394
|
+
|
395
|
+
R.dev__off
|
396
|
+
```
|
397
|
+
|
398
|
+
![](figures/facets_with_points.png)
|
399
|
+
|
400
|
+
Now we can see the actual distribution of all the 60 subject. Actually, this is not
|
401
|
+
totally true. We have a hard time seing all 60 subjects. It seems that some points
|
402
|
+
might be placed one over the other hiding useful information.
|
403
|
+
|
404
|
+
But no sweat! Another layer might solve the problem. In the following plot a new layer
|
405
|
+
called 'geom_jitter' is added to the plot. This adds randomness to the position of
|
406
|
+
the points, making it easier to see all of then and preventing data hiding. We also add
|
407
|
+
color and change the shape of the points, making them even easier to see.
|
408
|
+
|
409
|
+
|
410
|
+
```ruby
|
411
|
+
R.png("figures/facets_with_jitter.png")
|
412
|
+
|
413
|
+
# Split in vertical direction
|
414
|
+
puts @bp + R.geom_jitter(shape: 23, color: "cyan3", size: 1)
|
415
|
+
|
416
|
+
R.dev__off
|
417
|
+
```
|
418
|
+
|
419
|
+
![](figures/facets_with_jitter.png)
|
420
|
+
|
421
|
+
Now we can see all 60 points in the graph. We have here a much higher information density
|
422
|
+
and we can see outliers and subjects distribution.
|
423
|
+
|
424
|
+
# Preparing the Plot for Presentation
|
425
|
+
|
426
|
+
We have come a long way since our first plot. As was already said, this is not
|
427
|
+
an article about data analysis and the focus is on the
|
428
|
+
integration of Ruby and ggplot. So, let's assume that the analysis is now done. Yet,
|
429
|
+
ending the analysis does not mean that the work is done. On the contrary, the hardest
|
430
|
+
part is yet to come!
|
431
|
+
|
432
|
+
After the analysis it is necessary to communicate it by making a final plot for
|
433
|
+
presentation. The last plot has all the information we want to share, but it is not very
|
434
|
+
pleasing to the eye.
|
435
|
+
|
436
|
+
## Improving Colors
|
437
|
+
|
438
|
+
Let's start by trying to improve colors. For now, we will not use the jitter layer.
|
439
|
+
The previous plot has three bright colors that have no relashionship between them. Is
|
440
|
+
there any obvious, or non-obvious for that matter, interpretation for the colors?
|
441
|
+
Clearly, they are just random colors selected automatically by our software. Although
|
442
|
+
those colors helped us understand the data, for a final presentation random colors
|
443
|
+
can distract the viewer.
|
444
|
+
|
445
|
+
In the following plot we use shades function 'scale_fill_manual' to change
|
446
|
+
the colors of the boxes and order of labels. For colors we use shades of blue for
|
447
|
+
each dosage, with light blue ('cyan')
|
448
|
+
representing the lower dose and deep blue ('deepskyblue4') the higher dose. Also
|
449
|
+
the smaller value (0.5) is on
|
450
|
+
the botton of the labels and (2) at the top. This ordering seems more natural and
|
451
|
+
matches with the actual order of the colors in the plot.
|
452
|
+
|
453
|
+
|
454
|
+
```ruby
|
455
|
+
R.png("figures/facets_by_delivery_color2.png")
|
456
|
+
|
457
|
+
@bp = @bp +
|
458
|
+
R.scale_fill_manual(values: R.c("cyan", "deepskyblue", "deepskyblue4"),
|
459
|
+
breaks: R.c("2","1","0.5"))
|
460
|
+
|
461
|
+
puts @bp
|
462
|
+
|
463
|
+
R.dev__off
|
464
|
+
```
|
465
|
+
|
466
|
+
![](figures/facets_by_delivery_color2.png)
|
467
|
+
|
468
|
+
|
469
|
+
## Violin Plot and Jitter
|
470
|
+
|
471
|
+
The boxplot with jitter did look a bit overwhelming. The next plot uses a variation of
|
472
|
+
a boxplot known as a _violin plot_ with jittered data.
|
473
|
+
|
474
|
+
[From Wikipedia](https://en.wikipedia.org/wiki/Violin_plot)
|
475
|
+
|
476
|
+
|
477
|
+
> A violin plot is a method of plotting numeric data. It is similar to a box plot with
|
478
|
+
> a rotated kernel density plot on each side.
|
479
|
+
>
|
480
|
+
> A violin plot has four layers. The outer shape represents all possible results, with
|
481
|
+
> thickness indicating how common. (Thus the thickest section represents the mode average.)
|
482
|
+
> The next layer inside represents the values that occur 95% of the time.
|
483
|
+
> The next layer (if it exists) inside represents the values that occur 50% of the time.
|
484
|
+
> The central dot represents the median average value.
|
485
|
+
|
486
|
+
|
487
|
+
```ruby
|
488
|
+
R.png("figures/violin_with_jitter.png")
|
489
|
+
|
490
|
+
@violin = @base_tooth + R.geom_violin(E.aes(fill: :dose)) +
|
491
|
+
R.facet_grid(+:all =~ +:supp) +
|
492
|
+
R.geom_jitter(shape: 23, color: "cyan3", size: 1) +
|
493
|
+
R.scale_fill_manual(values: R.c("cyan", "deepskyblue", "deepskyblue4"),
|
494
|
+
breaks: R.c("2","1","0.5"))
|
495
|
+
|
496
|
+
puts @violin
|
497
|
+
|
498
|
+
R.dev__off
|
499
|
+
```
|
500
|
+
|
501
|
+
![](figures/violin_with_jitter.png)
|
502
|
+
|
503
|
+
This plot is an alternative to the original boxplot. For the final presentation, it is
|
504
|
+
important to think which graphics will be best understood by our audience. A violin plot
|
505
|
+
is a less known plot and could add mental overhead, yet, in my opinion, it does look a lit
|
506
|
+
bit better than the boxplot and provides even more information than the boxplot with jitter.
|
507
|
+
|
508
|
+
## Adding Decoration
|
509
|
+
|
510
|
+
Our final plot is starting to take shape, but a presentation plot should have at least a
|
511
|
+
title, labels on the axis and maybe some other decorations. Let's start adding those.
|
512
|
+
Since decoration requires more graph area, this new plot has a 'width' and 'height'
|
513
|
+
specification. When there is no specification, the default values for width and height are
|
514
|
+
480.
|
515
|
+
|
516
|
+
The 'labs' function adds require decoration. In this example we use 'title', 'subtitle',
|
517
|
+
'x' for the $x$ axis label and 'y', for the $y$ axis label, and 'caption' for information
|
518
|
+
about the plot.
|
519
|
+
|
520
|
+
|
521
|
+
```ruby
|
522
|
+
R.png("figures/facets_with_decorations.png", width: 540, height: 560)
|
523
|
+
|
524
|
+
caption = <<-EOT
|
525
|
+
Length of odontoblasts in 60 guinea pigs.
|
526
|
+
Each animal received one of three dose levels of vitamin C.
|
527
|
+
EOT
|
528
|
+
|
529
|
+
@decorations =
|
530
|
+
R.labs(title: "Tooth Growth: Length by Dose",
|
531
|
+
subtitle: "Faceted by delivery method, (OJ) or (VC)",
|
532
|
+
x: "Dose (mg)", y: "Teeth length",
|
533
|
+
caption: caption)
|
534
|
+
|
535
|
+
puts @bp + @decorations
|
536
|
+
|
537
|
+
R.dev__off
|
538
|
+
```
|
539
|
+
|
540
|
+
![](figures/facets_with_decorations.png)
|
541
|
+
|
542
|
+
|
543
|
+
## The Corp Theme
|
544
|
+
|
545
|
+
We are almost done. But the plot does not yet look nice to the eye. We are still distracted
|
546
|
+
by many aspects of the graph. First, the back font color does not look good. Then
|
547
|
+
plot background, borders, grids all add clutter to the plot.
|
548
|
+
|
549
|
+
We will now define our corporate theme. In this theme, we remove borders and grids. The
|
550
|
+
background if left for faceted plots but removed for non-faceted plots. Font colors are
|
551
|
+
a shade o blue (color: '#00080'). Axis labels are moved near the end of the axis and
|
552
|
+
written in 'bold'.
|
553
|
+
|
554
|
+
|
555
|
+
```ruby
|
556
|
+
module CorpTheme
|
557
|
+
|
558
|
+
R.install_and_loads 'RColorBrewer'
|
559
|
+
|
560
|
+
#---------------------------------------------------------------------------------
|
561
|
+
# face can be (1=plain, 2=bold, 3=italic, 4=bold-italic)
|
562
|
+
#---------------------------------------------------------------------------------
|
563
|
+
|
564
|
+
def self.text_element(size, face: "plain", hjust: nil)
|
565
|
+
E.element_text(color: "#000080",
|
566
|
+
face: face,
|
567
|
+
size: size,
|
568
|
+
hjust: hjust)
|
569
|
+
end
|
570
|
+
|
571
|
+
#---------------------------------------------------------------------------------
|
572
|
+
# Defines the plot theme (visualization). In this theme we remove major and minor
|
573
|
+
# grids, borders and background. We also turn-off scientific notation.
|
574
|
+
#---------------------------------------------------------------------------------
|
575
|
+
|
576
|
+
def self.global_theme(faceted = false)
|
577
|
+
|
578
|
+
R.options(scipen: 999) # turn-off scientific notation like 1e+48
|
579
|
+
# R.theme_set(R.theme_bw)
|
580
|
+
|
581
|
+
# remove major grids
|
582
|
+
gb = R.theme(panel__grid__major: E.element_blank())
|
583
|
+
# remove minor grids
|
584
|
+
gb = gb + R.theme(panel__grid__minor: E.element_blank)
|
585
|
+
# gb = R.theme(panel__grid__minor: E.element_blank)
|
586
|
+
# remove border
|
587
|
+
gb = gb + R.theme(panel__border: E.element_blank)
|
588
|
+
# remove background. When working with faceted graphs, the background makes
|
589
|
+
# it easier to see each facet, so leave it
|
590
|
+
gb = gb + R.theme(panel__background: E.element_blank) if !faceted
|
591
|
+
# Change axis font
|
592
|
+
gb = gb + R.theme(axis__text: text_element(8))
|
593
|
+
# change axis title font
|
594
|
+
gb = gb + R.theme(axis__title: text_element(10, face: "bold", hjust: 1))
|
595
|
+
# change font of title
|
596
|
+
gb = gb + R.theme(title: text_element(12, face: "bold"))
|
597
|
+
# change font of subtitle
|
598
|
+
gb = gb + R.theme(plot__subtitle: text_element(9))
|
599
|
+
# change font of captions
|
600
|
+
gb = gb + R.theme(plot__caption: text_element(8))
|
601
|
+
|
602
|
+
end
|
603
|
+
|
604
|
+
end
|
605
|
+
```
|
606
|
+
|
607
|
+
## Final Box Plot
|
608
|
+
|
609
|
+
Here is our final boxplot, without jitter.
|
610
|
+
|
611
|
+
|
612
|
+
```ruby
|
613
|
+
R.png("figures/final_box_plot.png", width: 540, height: 560)
|
614
|
+
|
615
|
+
puts @bp + @decorations + CorpTheme.global_theme(faceted: true)
|
616
|
+
|
617
|
+
R.dev__off
|
618
|
+
```
|
619
|
+
|
620
|
+
![](figures/final_box_plot.png)
|
621
|
+
|
622
|
+
## Final Violin Plot
|
623
|
+
|
624
|
+
Here is the final violin plot, with jitter and the same look and feel of the corporate
|
625
|
+
boxplot.
|
626
|
+
|
627
|
+
|
628
|
+
```ruby
|
629
|
+
R.png("figures/final_violin_plot.png", width: 540, height: 560)
|
630
|
+
|
631
|
+
puts @violin + @decorations + CorpTheme.global_theme(faceted: true)
|
632
|
+
|
633
|
+
R.dev__off
|
634
|
+
```
|
635
|
+
|
636
|
+
![](figures/final_violin_plot.png)
|
637
|
+
|
638
|
+
## Another View
|
639
|
+
|
640
|
+
Finally, here is a last plot, with the same look and feel as before but facetted by
|
641
|
+
dose and not by supplement.
|
642
|
+
|
643
|
+
|
644
|
+
```ruby
|
645
|
+
R.png("figures/facet_by_dose.png", width: 540, height: 560)
|
646
|
+
|
647
|
+
caption = <<-EOT
|
648
|
+
Length of odontoblasts in 60 guinea pigs.
|
649
|
+
Each animal received one of three dose levels of vitamin C.
|
650
|
+
EOT
|
651
|
+
|
652
|
+
@bp = @tooth_growth.ggplot(E.aes(x: :supp, y: :len, group: :supp)) +
|
653
|
+
R.geom_boxplot(E.aes(fill: :supp)) + R.facet_grid(+:all =~ +:dose) +
|
654
|
+
R.scale_fill_manual(values: R.c("cyan", "deepskyblue4")) +
|
655
|
+
R.labs(title: "Tooth Growth: Length by Dose",
|
656
|
+
subtitle: "Faceted by dose",
|
657
|
+
x: "Delivery method", y: "Teeth length",
|
658
|
+
caption: caption) +
|
659
|
+
CorpTheme.global_theme(faceted: true)
|
660
|
+
puts @bp
|
661
|
+
|
662
|
+
R.dev__off
|
663
|
+
```
|
664
|
+
|
665
|
+
![](figures/facet_by_dose.png)
|
666
|
+
|
667
|
+
# Conclusion
|
668
|
+
|
669
|
+
Galaaz tightly couples Ruby and R in a way that Ruby developers do not need to be aware
|
670
|
+
of the executing R engine. For the Ruby developer the existence of R
|
671
|
+
is of no consequence. For her, she is just coding in Ruby. On the other hand, for the R
|
672
|
+
developer, migration to Ruby is a matter of small syntactic changes and very gentle
|
673
|
+
learning curve. As the R developer becomes more proficient in Ruby, he can start using
|
674
|
+
'classes', 'modules', 'procs', 'lambdas'.
|
675
|
+
|
676
|
+
This coupling shows the power of GraalVM and Truffle polyglot environment. Trying to
|
677
|
+
bring to Ruby the power of R starting from scratch is an enourmous endeavour and would
|
678
|
+
probably never be accomplished. Today's data scientists would certainly stick with either
|
679
|
+
Python or R. Now, both the Ruby and R communities might benefit from this marriage. Also,
|
680
|
+
the process to couple Ruby and R can be also be done to couple Ruby and JavaScript and
|
681
|
+
maybe also Ruby and Python. In a polyglot world a *uniglot* language might be extremely
|
682
|
+
relevant.
|
683
|
+
|
684
|
+
From the perspective of performance, GraalVM and Truffle promises improvements that could
|
685
|
+
reach over 10 times, both for [FastR](https://medium.com/graalvm/faster-r-with-fastr-4b8db0e0dceb)
|
686
|
+
and for [TruffleRuby](https://rubykaigi.org/2018/presentations/eregontp.html).
|
687
|
+
|
688
|
+
This article has shown how to improve a plot step-by-step. Starting from a very simple
|
689
|
+
boxplot with all default configurations, we moved slowly to our final plot. The important
|
690
|
+
point here is not if the final plot is actually beautiful, but that there is a process
|
691
|
+
of small steps improvements that can be followed to getting a final plot ready for
|
692
|
+
presentation.
|
693
|
+
|
694
|
+
Finally, this whole article was written in rmarkdown and compiled to HTML by _gknit_, an
|
695
|
+
application that wraps _knitr_ and allows documenting Ruby code. This application can
|
696
|
+
be of great help for any Rubyist trying to write articles, blogs or documentation for Ruby.
|
697
|
+
|
698
|
+
|
699
|
+
# Installing Galaaz
|
700
|
+
|
701
|
+
## Prerequisites
|
702
|
+
|
703
|
+
* GraalVM (>= rc8): https://github.com/oracle/graal/releases
|
704
|
+
* TruffleRuby
|
705
|
+
* FastR
|
706
|
+
|
707
|
+
The following R packages will be automatically installed when necessary, but could be installed prior
|
708
|
+
to using gKnit if desired:
|
709
|
+
|
710
|
+
* ggplot2
|
711
|
+
* gridExtra
|
712
|
+
* knitr
|
713
|
+
|
714
|
+
Installation of R packages requires a development environment and can be time consuming. In Linux,
|
715
|
+
the gnu compiler and tools should be enough. I am not sure what is needed on the Mac.
|
716
|
+
|
717
|
+
## Preparation
|
718
|
+
|
719
|
+
* gem install galaaz
|
720
|
+
|
721
|
+
## Usage
|
722
|
+
|
723
|
+
* gknit <filename>
|
724
|
+
* In a scrip add: require 'galaaz'
|
725
|
+
|
726
|
+
|
727
|
+
And now that you’ve read this far, here’s how to submit your story to the freeCodeCamp
|
728
|
+
publication: send an email to submit at freecodecamp org. Include the URL for your story on
|
729
|
+
Medium (preferably an unpublished draft) and the word “bananas” so that we’ll know that you
|
730
|
+
have read all this. Only send one story URL per email. There’s no need to add anything
|
731
|
+
further to your email — we just read the stories and judge them based on their own merits.
|