raka 0.3.1 → 0.3.4
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +343 -119
- data/VERSION +1 -1
- data/bin/raka +37 -12
- data/lib/{compile.rb → raka/compile.rb} +10 -4
- data/lib/{interface.rbs → raka/interface.rbs} +0 -0
- data/lib/{lang → raka/lang}/psql/impl.rb +0 -0
- data/lib/{lang → raka/lang}/python/impl.rb +0 -0
- data/lib/raka/lang/r/impl.rb +23 -0
- data/lib/{lang → raka/lang}/shell/impl.rb +0 -0
- data/lib/{output_type.rb → raka/output_type.rb} +0 -0
- data/lib/{protocol.rb → raka/protocol.rb} +0 -0
- data/lib/{token.rb → raka/token.rb} +0 -0
- data/lib/raka.rb +9 -4
- metadata +13 -14
- data/lib/lang/r/impl.rb +0 -38
- data/lib/lang/r/io.R +0 -113
- data/lib/temp.json +0 -9167
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 895e1c40801f38ff5168449c25333ee40e9aca718f6d082a90d8898a75513373
|
4
|
+
data.tar.gz: 9d8228de56582ef7202f30d10f32a285ad96197599344bd9a471dabb1699bdd3
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 3fe0e6223cc75ebe754f353155a1c78c96d25ae55bbffc108a6d8485038d06ad0cebda5be8093d50d0cb24e12b670ed0335b07b880816f5913e1f2712675e30c
|
7
|
+
data.tar.gz: 245dbe2ed2b8f0966b0b3a980af00dda5cf084838228d40144e020de2a1545e3b856d90ebe51ab0419c53ca5a93a41929023903eb5171500e22c65ca44e043b1
|
data/README.md
CHANGED
@@ -1,51 +1,31 @@
|
|
1
|
-
**Raka** is a **DSL**(Domain Specific Language) on top of **Rak**e for defining and running d**a**t**a** processing workflows. Raka is specifically designed for data processing with improved pattern matching, scopes, language extensions and lots of conventions to prevent verbosity.
|
2
|
-
|
3
|
-
## Why Raka
|
4
|
-
|
5
|
-
Data processing tasks can involve plenty of steps, each with its dependencies. Compared to bare Rake or the more classical Make, Raka offers the following advantages:
|
6
|
-
|
7
|
-
1. Advanced pattern matching and template resolving to define general rules and maximize code reuse.
|
8
|
-
2. Extensible and context-aware protocol architecture.
|
9
|
-
3. Multilingual. Other programming languages can be easily embedded.
|
10
|
-
4. Auto dependency and naming by conventions.
|
11
|
-
5. Scopes to ease comparative studies.
|
12
|
-
6. Terser syntax.
|
13
|
-
|
14
|
-
... and more.
|
15
|
-
|
16
|
-
Compared to more comlex, GUI-based solutions (perhaps classified as scientific-workflow software) like Kepler, etc., Raka has the following advantages:
|
17
|
-
|
18
|
-
1. Lightweight and easy to setup, especially on platforms with ruby preinstalled.
|
19
|
-
2. Easy to deploy, version-control, backup or share workflows since the workflows are merely text files.
|
20
|
-
3. Easy to reuse modules or create reusable modules, which are merely plain ruby code snippets (or in other languages with protocols).
|
21
|
-
4. Expressive so a few lines of code can replace many manual operations.
|
1
|
+
**Raka** is a **DSL**(Domain Specific Language) on top of **Rak**e for defining rules and running d**a**t**a** processing workflows. Raka is specifically designed for data processing with improved pattern matching, scopes, language extensions and lots of conventions to prevent verbosity.
|
22
2
|
|
23
3
|
## Installation
|
24
4
|
|
25
|
-
Raka is a library based on rake. Though rake is cross platform, raka may not work on Windows since it relies some shell facilities.
|
5
|
+
Raka is a library based on rake. Though rake is cross platform, raka may not work on Windows since it relies some shell facilities. Ruby is available for most \*nix systems including Mac OSX so the only task is to install raka like:
|
26
6
|
|
27
7
|
```bash
|
28
8
|
gem install raka
|
29
9
|
```
|
30
10
|
|
31
|
-
##
|
11
|
+
## Quick Start
|
32
12
|
|
33
13
|
First create a file named `main.raka` and import & initialize the DSL
|
34
14
|
|
35
15
|
```ruby
|
36
16
|
require 'raka'
|
37
17
|
|
38
|
-
dsl =
|
39
|
-
output_types: [:txt
|
40
|
-
input_types: [:txt
|
18
|
+
dsl = Raka.new(self,
|
19
|
+
output_types: [:txt],
|
20
|
+
input_types: [:txt]
|
41
21
|
)
|
42
22
|
```
|
43
23
|
|
44
24
|
Then the code below will define two simple rules:
|
45
25
|
|
46
26
|
```ruby
|
47
|
-
txt.
|
48
|
-
txt.sort = [txt.input] | shell* "cat
|
27
|
+
txt._.first50 = shell* "cat $< | head -n 50 > $@"
|
28
|
+
txt.sort = [txt.input] | shell* "cat $(dep0) | sort -rn > $@"
|
49
29
|
```
|
50
30
|
|
51
31
|
For testing let's prepare an input file named `input.txt`:
|
@@ -54,106 +34,234 @@ For testing let's prepare an input file named `input.txt`:
|
|
54
34
|
seq 1000 > input.txt
|
55
35
|
```
|
56
36
|
|
57
|
-
|
37
|
+
Invoke:
|
38
|
+
|
39
|
+
```bash
|
40
|
+
raka first50__sort.txt
|
41
|
+
```
|
42
|
+
|
43
|
+
Raka will read data from *input.txt*, sort the numbers descendingly and copy the first 50 lines to *first50__sort.txt*.
|
58
44
|
|
59
45
|
The workflow here is as follows:
|
60
46
|
|
61
|
-
1. Try to find
|
62
|
-
2. Rule `txt.sort.first50` matched
|
63
|
-
3.
|
64
|
-
4. Rule `txt.sort` matched
|
65
|
-
5.
|
66
|
-
6.
|
67
|
-
7. Run rule `txt.sort` and create
|
68
|
-
8. Run rule `txt.sort.first50` and create
|
47
|
+
1. Try to find *first50__sort.txt*: not exists.
|
48
|
+
2. Rule with target `txt.sort.first50` matched.
|
49
|
+
3. Find input file *sort.txt*, not exists.
|
50
|
+
4. Rule with target `txt.sort` matched.
|
51
|
+
5. This rule has no input but a depended target `txt.input`.
|
52
|
+
6. File *input.txt* exists. Use it.
|
53
|
+
7. Run rule `txt.sort` and create *sort.txt*.
|
54
|
+
8. Run rule `txt.sort.first50` and create *first50__sort.txt*
|
69
55
|
|
70
|
-
|
56
|
+
We may want to skip the sort step, and invoke:
|
71
57
|
|
72
|
-
```
|
73
|
-
|
74
|
-
|
75
|
-
DB = 'osm'
|
76
|
-
HOST = 'localhost'
|
77
|
-
PORT = 5432
|
58
|
+
```bash
|
59
|
+
raka first50__input.txt
|
60
|
+
```
|
78
61
|
|
79
|
-
|
62
|
+
Raka will read data from *input.txt* and copy the first 50 lines to *first50__input.txt*.
|
80
63
|
|
81
|
-
|
64
|
+
This illustrates some basic ideas but may not be particularly interesting. Following is a slightly more complex example which covers more features.
|
65
|
+
|
66
|
+
```ruby
|
67
|
+
require 'raka'
|
82
68
|
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
69
|
+
dsl = Raka.new(self,
|
70
|
+
output_types: %i[csv pdf],
|
71
|
+
input_types: %i[csv],
|
72
|
+
lang: ['lang/shell', 'lang/python'])
|
73
|
+
|
74
|
+
py_template = <<~PYTHON
|
75
|
+
import os.path
|
76
|
+
import pandas as pd
|
77
|
+
|
78
|
+
def write_variety(input, output, variety):
|
79
|
+
print(variety)
|
80
|
+
folder = os.path.dirname(output)
|
81
|
+
if len(folder) > 0:
|
82
|
+
os.makedirs(folder, exist_ok=True)
|
83
|
+
df = pd.read_csv(input)
|
84
|
+
df[df['class'] == variety].to_csv(output)
|
85
|
+
|
86
|
+
<code>
|
87
|
+
PYTHON
|
88
|
+
py.config script_template: py_template
|
89
|
+
|
90
|
+
groups = %i[virginica versicolor]
|
91
|
+
|
92
|
+
csv(groups.join('|')).iris =
|
93
|
+
[csv.iris_all] | py* %(write_variety('$<', '$@', 'Iris-$(target_scope)'))
|
94
|
+
|
95
|
+
csv.iris_all = shell* %(curl -L https://datahub.io/machine-learning/iris/r/iris.csv > $@)
|
96
|
+
|
97
|
+
dsl.scope(*groups) do
|
98
|
+
pdf.iris.plot['plot_(\S+)_(\S+)'] = py do |rask|
|
99
|
+
<<-PYTHON
|
100
|
+
import seaborn as sns
|
101
|
+
from matplotlib import pyplot as plt
|
102
|
+
|
103
|
+
df = pd.read_csv('#{rask.input}')
|
104
|
+
ax = sns.displot(x=df['#{rask.captures.plot0}#{rask.captures.plot1}'])
|
105
|
+
ax.set_axis_labels('#{rask.captures.plot0} #{rask.captures.plot1}', 'frequency')
|
106
|
+
plt.savefig('#{rask.output}')
|
107
|
+
PYTHON
|
108
|
+
end
|
109
|
+
end
|
110
|
+
|
111
|
+
task figures: (groups.product(%w[sepal petal], %w[length width]).map do |info|
|
112
|
+
"_out/#{info[0]}/plot_#{info[1]}_#{info[2]}__iris.pdf"
|
113
|
+
end)
|
87
114
|
```
|
88
115
|
|
89
|
-
|
116
|
+
In this example, we download a classical dataset named *iris.csv*, use python code to extract two varieties including *virginica* and *versicolor*, and generate thematic plots of frequency histograms for both varieties.
|
90
117
|
|
91
|
-
|
92
|
-
draw_stat_snapshot <- function(d) { ... }
|
93
|
-
draw_user_trend <- function(d) { ... }
|
94
|
-
```
|
118
|
+
To invoke the script, we run in terminal:
|
95
119
|
|
96
|
-
|
120
|
+
```bash
|
121
|
+
raka -j 8 -v figures
|
122
|
+
```
|
97
123
|
|
98
|
-
|
99
|
-
|
100
|
-
|
124
|
+
The option `-j 8` indicates we want to parallelize the tasks with 8 concurrent processes at most where possible. The option `-v` let raka print detailed information so we can view the generated python code.
|
125
|
+
|
126
|
+
The tool will then act as the following:
|
127
|
+
|
128
|
+
1. Match `figures with the last`rule, which is a normal rake task.
|
129
|
+
2. The prerequisites include 8 figures, none of them exists yet. Take *_out/versicolor/plot_petal_length__iris.pdf * as an example from now on.
|
130
|
+
3. Rule `pdf.iris.plot['plot_(\S+)_(\S+)']...` is matched, where "petal" is bound to `plot0` and "length" is bound to `plot1`.
|
131
|
+
4. Neither of the 2 possible input files: *_out/versicolor/iris.csv* and *_out/versicolor/iris.pdf* and can be found. But the rule `csv(groups.join('|')).iris = ...` (`csv('virginica|versicolor').iris`) can be matched for the former, where the target scope is matched as `versicolor`.
|
132
|
+
5. The only dependecy `csv.iris_all` is resolved as *_out/iris_all_.csv*. The path does not contain `vesicolor` since the target scope only applies to the target.
|
133
|
+
6. Rule `csv.iris_all` is matched without any dependencies.
|
134
|
+
7. The protocol `shell` replaces the automatic variable`$@` with `_out/iris_all.csv` to build a curl command and download the iris dataset from ()[datahub.io].
|
135
|
+
8. Now raka goes back to generate output *_out/versicolor/iris.csv*, by executing the code generated by the `python` protocol, which extracts rows where the class field equals "Iris-versicolor".
|
136
|
+
9. Raka goes back to generate output *_out/versicolor/plot_petal_length__iris.pdf*, , by executing the code generated by the `python` protocol, which draws a histogram plot to depict the distribution of petal length.
|
137
|
+
10. Raka continues to generate plot files until all 8 figures exist.
|
138
|
+
|
139
|
+
As an example, the generated python code in *9* are:
|
140
|
+
|
141
|
+
```python
|
142
|
+
import sys
|
143
|
+
import os.path
|
144
|
+
import pandas as pd
|
145
|
+
|
146
|
+
def write_variety(input, output, variety):
|
147
|
+
print(variety)
|
148
|
+
folder = os.path.dirname(output)
|
149
|
+
if len(folder) > 0:
|
150
|
+
os.makedirs(folder, exist_ok=True)
|
151
|
+
df = pd.read_csv(input)
|
152
|
+
df[df['class'] == variety].to_csv(output)
|
153
|
+
|
154
|
+
import seaborn as sns
|
155
|
+
from matplotlib import pyplot as plt
|
156
|
+
|
157
|
+
df = pd.read_csv('_out/versicolor/iris.csv')
|
158
|
+
ax = sns.displot(x=df['petallength'])
|
159
|
+
ax.set_axis_labels('petal length', 'frequency')
|
160
|
+
plt.savefig('_out/versicolor/plot_petal_length__iris.pdf')
|
101
161
|
```
|
102
162
|
|
103
|
-
|
163
|
+
The rule-based system, the strategy to execute tasks only when necessary, and the capable host language make it fairly easy to adjust the experiments during the exploration. For example, suppose we want to also apply experiments also to the *setosa* class, we can just change the line
|
164
|
+
|
165
|
+
`groups = %i[virginica versicolor]`
|
166
|
+
|
167
|
+
to
|
168
|
+
|
169
|
+
`groups = %i[virginica versicolor setosa]`
|
170
|
+
|
171
|
+
The command `raka -j 8 -v figures` will generate 4 figures for the new class, without re-executing tasks for the other two classes.
|
172
|
+
|
173
|
+
## Why Raka
|
174
|
+
|
175
|
+
Data processing tasks can involve plenty of steps, each with its dependencies. Compared to bare Rake or the more classical Make, Raka offers the following advantages:
|
176
|
+
|
177
|
+
1. Advanced pattern matching and template resolving to define general rules and maximize code reuse.
|
178
|
+
2. Extensible and context-aware protocol architecture.
|
179
|
+
3. Multilingual. Other programming languages can be easily embedded.
|
180
|
+
4. Auto dependency and naming by conventions.
|
181
|
+
5. Scopes to ease comparative studies.
|
182
|
+
6. Terser syntax.
|
183
|
+
|
184
|
+
... and more.
|
185
|
+
|
186
|
+
Compared to more comlex, GUI-based solutions (perhaps classified as scientific-workflow software) like Kepler, etc., Raka has the following advantages:
|
187
|
+
|
188
|
+
1. Lightweight and easy to setup, especially on platforms with ruby preinstalled.
|
189
|
+
2. Easy to deploy, version-control, backup or share workflows since the workflows are merely text files.
|
190
|
+
3. Easy to reuse modules or create reusable modules, which are merely plain ruby code snippets (or in other languages with protocols).
|
191
|
+
4. Expressive so a few lines of code can replace many manual operations.
|
192
|
+
|
193
|
+
## Documentation
|
104
194
|
|
105
|
-
|
195
|
+
### Conceptual Model
|
106
196
|
|
107
|
-
|
108
|
-
2. Rule `pdf.buildings.func['(\S+)_graph']` matched. "stat_snapshot_graph" is bound to `func` and "stat_snapshot" is bound to `func0`.
|
109
|
-
3. None of the four possible input files: _de/buildings.table_, _de/buildings.txt_, _buildings.table_, _buildings.txt_ can be found. Rule `table.buildings` is matched and the only dependecy file _admin.csv_ is found.
|
110
|
-
4. The protocol `psqlf` finds the source file _src/buildings.sql_, intepolate the options with automatic variables (`$<` as "admin.csv"), run the sql, and create a placeholder file _de/buildings.table_ afterwards.
|
111
|
-
5. Run the post-job `idx_this`, according to the rule `idx._` it will find and run _buildings_idx.sql_, then create a placeholder file _de/buildings.idx_.
|
112
|
-
6. For rule `pdf.buildings.func['(\S+)_graph']`, the R code in `%[]` is interpolated with several automatic variables (`$(input_stem)` as "buildings", `$@` as "de/stat_snapshot_graph\_\_buildings.pdf") and the variables (`func`, `func0`) bound before.
|
113
|
-
7. Run the R code. The _buildings_ table is piped into the function `draw_snapshot_graph` and then output to `ggplot_output`, which writes the graph to the specified pdf file.
|
197
|
+
A raka rule consists of target, dependencies, actions and
|
114
198
|
|
115
|
-
|
199
|
+
### Syntax Definition
|
116
200
|
|
117
|
-
It is possible to use Raka with little knowledge of ruby / rake, though minimal understandings are highly recommended. The formal syntax of rule can be defined as follows (EBNF form):
|
201
|
+
It is possible to use Raka with little knowledge of ruby / rake, though minimal understandings are highly recommended. The formal syntax of rule can be defined as follows (W3C EBNF form):
|
118
202
|
|
119
203
|
```ebnf
|
120
|
-
rule
|
204
|
+
rule ::= target "=" (dependencies "|")* action ("|" post_target)*
|
121
205
|
|
122
|
-
target
|
206
|
+
target ::= ext "." ltoken ("." ltoken)*
|
123
207
|
|
124
|
-
|
208
|
+
dependencies ::= "[]" | "[" dependency ("," dependency)* "]"
|
125
209
|
|
126
|
-
|
127
|
-
rexpr = ext "." rtoken {"." rtoken};
|
210
|
+
dependency ::= rexpr | template
|
128
211
|
|
129
|
-
|
130
|
-
rtoken = word | word "(" template ")";
|
212
|
+
post_target ::= rexpr | template
|
131
213
|
|
132
|
-
|
214
|
+
rexpr ::= ext "." rtoken ("." rtoken)*
|
133
215
|
|
134
|
-
|
135
|
-
|
216
|
+
ltoken ::= word | word "[" pattern "]"
|
217
|
+
rtoken ::= word | word "(" template ")"
|
218
|
+
|
219
|
+
word ::= ("_" | letter) ( letter | digit | "_" )*
|
220
|
+
|
221
|
+
action ::= ("shell" | "r" | "psql" | "py" ) ("*" template | block ) | "run" block
|
136
222
|
```
|
137
223
|
|
138
224
|
The corresponding railroad diagrams are:
|
139
225
|
|
140
|
-
|
226
|
+
**rule**
|
227
|
+
|
228
|
+
![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/rule.svg)
|
229
|
+
|
230
|
+
**target**
|
231
|
+
|
232
|
+
![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/target.svg)
|
233
|
+
|
234
|
+
**dependencies**
|
235
|
+
|
236
|
+
![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/dependencies.svg)
|
237
|
+
|
238
|
+
**dependency**
|
239
|
+
|
240
|
+
![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/dependency.svg)
|
241
|
+
|
242
|
+
**post_target_**
|
141
243
|
|
142
|
-
![](https://cdn.rawgit.com/yarray/raka/master/doc/
|
244
|
+
![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/post_target.svg)
|
143
245
|
|
144
|
-
|
246
|
+
**rexpr**
|
145
247
|
|
146
|
-
![](https://cdn.rawgit.com/yarray/raka/master/doc/
|
248
|
+
![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/rexpr.svg)
|
147
249
|
|
148
|
-
|
250
|
+
**ltoken**
|
149
251
|
|
150
|
-
![](https://cdn.rawgit.com/yarray/raka/master/doc/
|
252
|
+
![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/ltoken.svg)
|
151
253
|
|
152
|
-
|
254
|
+
**rtoken**
|
153
255
|
|
154
|
-
![](https://cdn.rawgit.com/yarray/raka/master/doc/
|
256
|
+
![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/rtoken.svg)
|
155
257
|
|
156
|
-
|
258
|
+
**word**
|
259
|
+
|
260
|
+
![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/word.svg)
|
261
|
+
|
262
|
+
**action**
|
263
|
+
|
264
|
+
![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/action.svg)
|
157
265
|
|
158
266
|
The definition is concise but several details are omitted for simplicity:
|
159
267
|
|
@@ -163,15 +271,15 @@ The definition is concise but several details are omitted for simplicity:
|
|
163
271
|
4. The listed protocols are merely what we offered now. It can be greatly extended.
|
164
272
|
5. Nearly any concept in the syntax can be replaced by a suitable ruby variable.
|
165
273
|
|
166
|
-
|
274
|
+
### Pattern matching and template resolving
|
167
275
|
|
168
|
-
When defined a rule like `
|
276
|
+
When defined a rule like `target = <specification>`, the left side represents a pattern and the right side contains specifications for extra dependecies, actions and some targets to create thereafter. When raking a target file, the left sides of the rules will be examined one by one until a rule is matched. The matching process based on Regex also support named captures so that some varibales can be extracted for use in the right side.
|
169
277
|
|
170
|
-
The specifications on the right side of a rule can
|
278
|
+
The specifications on the right side of a rule can contain templates. The "holes" in the templates will be fulfilled by automatic variables and variables captured when matching the left side.
|
171
279
|
|
172
|
-
|
280
|
+
#### Pattern matching
|
173
281
|
|
174
|
-
To match a given _file_ with a `
|
282
|
+
To match a given _file_ with a `target`, the extension will be matched first. The substrings of the file name between "\_\_" are mapped to tokens separated by `.`, in reverse order. After that, each substring is matched to the corresponding token or the regex in `[]`. For example, the rule
|
175
283
|
|
176
284
|
```ruby
|
177
285
|
pdf.buildings.indicator['\S+'].top['top_(\d+)']
|
@@ -186,26 +294,147 @@ can match "top_50\_\_node_num\_\_buildings.pdf". The logical process is:
|
|
186
294
|
- `top_(\d+) ~ top_50`
|
187
295
|
3. Two levels of captures are made. First, 'node_num' is captured as `indicator`, 'top_50' is captured as `top`; Second, '50' is captured as `top0` since `\d+` is wrapped in parenthesis and is the first.
|
188
296
|
|
189
|
-
One can write special token `_`
|
297
|
+
One can write special token `_` to match any token. Since raka uses prefix matching, something like `token0['']` can also match any token and capture it in `token0` in addition. End-of-line symbol `$` can be used to match the whole token, e.g., `token0['word$']` will not match `word_bench`.
|
190
298
|
|
191
|
-
|
299
|
+
#### Template resolving
|
192
300
|
|
193
301
|
In some places of `rexpr`, templates can be written instead of strings, so that it can represent different values at runtime. There are two types of variables that can be used in templates. The first is automatic variables, which is just like `$@` in Make or `task.name` in Rake. We even preserve some Make conventions for easier migrations. All automatic varibales begin with `$`. The possible automatic variables are:
|
194
302
|
|
195
|
-
| symbol
|
196
|
-
|
|
197
|
-
|
|
198
|
-
|
|
199
|
-
| \$(
|
200
|
-
| \$(
|
303
|
+
| symbol | description |
|
304
|
+
| ----------------------------------------- | ------------------------------------------------------------------------------------------------------ |
|
305
|
+
| \$@, \$(output) | the output file |
|
306
|
+
| \$<, \$(input) | the input file defined in the chained target |
|
307
|
+
| \$^, \$(deps) | all dependecies concated by comma (including input) |
|
308
|
+
| \$(dep0), \$(dep1), ... | the i-th depdency (input is $(dep0)) |
|
309
|
+
| \$(input_stem) | stem of the input file |
|
310
|
+
| \$(output_stem) | stem of the output file |
|
311
|
+
| \$(func) | the token added to input to generate output, e.g., stat in csv.data.stat |
|
312
|
+
| \$(ext) | extension of the output file |
|
313
|
+
| \$(scope) | scope for current task, i.e. the common directory for output, input and dependencies |
|
314
|
+
| \$(target_scope) | the inline scope defined in target |
|
315
|
+
| \$(target_scope0), \$(target_scope1), ... | the i-th captured value by inline scope defined in target |
|
316
|
+
| \$(rule_scope0), \$(rule_scope1), ... | the i-th scope defined in rule-level by nested calls of the dsl.scope function (i is larger insideout) |
|
317
|
+
|
318
|
+
The other type of variables are those captured during pattern matching, which can be referred to using `%{var}`. In the example of the [pattern matching](###pattern-matching) section, `%{indicator}` will be replaced by `node_num`, `%{top}` will be replaced by `top_50` and `%{top0}` will be replaced by `50`. In such case, a template as `'calculate top %{top0} of %{indicator} for $@'` will be resolved as `'calculate top 50 of node_num for top_50__node_num__buildings.pdf'`
|
319
|
+
|
320
|
+
Templates can happen in various places. For depdencies and post targets, tokens with parenthesis can contain templates, like `csv._('%{indicator}')`. The symbol of a token with parenthesis is of no use and is generally omitted with an underscore. It is also possible to write template literal directly, i.e. `'%{indicator}.csv'`. Templates can also be applied in actions but it depends on the implementations of protocols.
|
321
|
+
|
322
|
+
### Actions and protocols
|
323
|
+
|
324
|
+
Raka invokes **actions** when all input and dependencies are presented. Generally, users define an action that generates the output. To maximize the flexibility, users can feed code in an arbitrary programming language to the corresponding **protocol**. The protocol will then transform and execute the code. Raka natively supports the host(ruby) protocol and several foreign protocols including shell, python, psql, and r.
|
325
|
+
|
326
|
+
The host protocol is special and just executes the given ruby block. All other protocols can accept a templated code string given an aterisk operator or a block producing a templated code string. Following illustrates examples for each protocol.
|
327
|
+
|
328
|
+
In the host protocol and the block versions of other protocols, a raka task (the *rask* variable) is provided, which offers the following properties:
|
329
|
+
|
330
|
+
| property | description |
|
331
|
+
| --------------------- | ------------------------------------------------------------------------------------ |
|
332
|
+
| output | the output file |
|
333
|
+
| input | the input file defined in the chained target |
|
334
|
+
| deps | the depdencies (input is deps[0]) |
|
335
|
+
| func | the token added to input to generate output, e.g., stat in csv.data.stat |
|
336
|
+
| ext | extension of the output file |
|
337
|
+
| captures | captured text during pattern matching, key-value |
|
338
|
+
| scope | scope for current task, i.e. the common directory for output, input and dependencies |
|
339
|
+
| target_scope | the inline scope defined in target |
|
340
|
+
| target_scope_captures | captured values by inline scope defined in target |
|
341
|
+
| rule_scopes | the inline scope defined in target |
|
342
|
+
|
343
|
+
```ruby
|
344
|
+
require 'raka'
|
345
|
+
require 'csv'
|
346
|
+
|
347
|
+
dsl = Raka.new(
|
348
|
+
self, output_types: %i[table view csv],
|
349
|
+
lang: ['lang/psql', 'lang/shell', 'lang/python', 'lang/r']
|
350
|
+
)
|
351
|
+
|
352
|
+
csv.iris_all = shell* %(curl -L https://datahub.io/machine-learning/iris/r/iris.csv > $@)
|
353
|
+
|
354
|
+
# host(ruby) protocol
|
355
|
+
csv.rb_out = [csv.iris_all] | run do |rask|
|
356
|
+
in_f = File.open(rask.deps[0])
|
357
|
+
out_f = File.open(rask.output, 'w')
|
358
|
+
options = { headers: true, return_headers: true, write_headers: true }
|
359
|
+
CSV.filter(in_f, out_f, options) do |row|
|
360
|
+
row['class'] == 'Iris-versicolor'
|
361
|
+
end
|
362
|
+
end
|
363
|
+
|
364
|
+
# python protocol
|
365
|
+
csv.py_out = [csv.iris_all] | py* %(
|
366
|
+
import pandas as pd
|
367
|
+
df = pd.read_csv('$(dep0)')
|
368
|
+
df[df['class'] == 'Iris-versicolor'].to_csv('$@')
|
369
|
+
)
|
370
|
+
|
371
|
+
# python protocol (block)
|
372
|
+
csv.py_out2 = [csv.iris_all] | py do |rask|
|
373
|
+
<<-PYTHON
|
374
|
+
import pandas as pd
|
375
|
+
df = pd.read_csv('#{rask.deps[0]}')
|
376
|
+
df[df['class'] == 'Iris-versicolor'].to_csv('#{rask.output}')
|
377
|
+
PYTHON
|
378
|
+
end
|
379
|
+
|
380
|
+
# r protocol
|
381
|
+
csv.r_out = [csv.iris_all] | r* %(
|
382
|
+
df <- read.csv("$(dep0)")
|
383
|
+
write.csv(df[(df$class == "Iris-versicolor"),], file="$@")
|
384
|
+
)
|
201
385
|
|
202
|
-
|
386
|
+
# r protocol (block)
|
387
|
+
csv.r_out = [csv.iris_all] | r do |rask|
|
388
|
+
<<-R
|
389
|
+
df <- read.csv("#{rask.deps[0]}")
|
390
|
+
write.csv(df[(df$class == "Iris-versicolor"),], file="#{rask.output}")
|
391
|
+
R
|
392
|
+
end
|
393
|
+
|
394
|
+
# shell protocol
|
395
|
+
csv.shell_out = [csv.iris_all] | shell* %(
|
396
|
+
cat <(head $(dep0)) <(grep "Iris-versicolor" $(dep0)) > $@
|
397
|
+
)
|
398
|
+
|
399
|
+
# shell protocol (block)
|
400
|
+
csv.shell_out2 = [csv.iris_all] | shell do |rask|
|
401
|
+
"cat <(head -1 #{rask.deps[0]}) <(grep 'Iris-versicolor' #{rask.deps[0]}) > rask.output"
|
402
|
+
end
|
403
|
+
|
404
|
+
# psql protocol
|
405
|
+
pg = OpenStruct.new(
|
406
|
+
user: 'postgres',
|
407
|
+
port: 5433,
|
408
|
+
host: '127.0.0.1',
|
409
|
+
db: 'postgres',
|
410
|
+
password: 'postgres'
|
411
|
+
)
|
412
|
+
psql.config conn: pg, create: :mview
|
413
|
+
|
414
|
+
table.iris_all = [csv.iris_all] | psql(create: nil)* %(
|
415
|
+
DROP TABLE IF EXISTS $(output_stem);
|
416
|
+
CREATE TABLE $(output_stem) (
|
417
|
+
sepallength float,
|
418
|
+
sepalwidth float,
|
419
|
+
petallength float,
|
420
|
+
petalwidth float,
|
421
|
+
class varchar
|
422
|
+
);
|
423
|
+
\\COPY $(output_stem) FROM '$(dep0)' CSV HEADER;
|
424
|
+
)
|
203
425
|
|
204
|
-
|
426
|
+
table.psql_out = [table.iris_all] | psql* %(
|
427
|
+
SELECT * FROM $(dep0_stem) WHERE class='Iris-versicolor'
|
428
|
+
)
|
205
429
|
|
206
|
-
|
430
|
+
# psql protocol (block)
|
431
|
+
table.psql_out2 = [table.iris_all] | psql do |rask|
|
432
|
+
<<-SQL
|
433
|
+
SELECT * FROM #{dsl.stem(rask.deps[0])} WHERE class='Iris-versicolor'
|
434
|
+
SQL
|
435
|
+
end
|
207
436
|
|
208
|
-
|
437
|
+
```
|
209
438
|
|
210
439
|
### Initialization and options
|
211
440
|
|
@@ -215,13 +444,13 @@ These APIs are bounded to an instance of DSL, you can create the object at the t
|
|
215
444
|
dsl = DSL.new(<env>, <options>)
|
216
445
|
```
|
217
446
|
|
218
|
-
The argument `<env>` should be the
|
447
|
+
The argument `<env>` should be the *self* of a running Rakefile. In most case you can directly write:
|
219
448
|
|
220
449
|
```ruby
|
221
450
|
dsl = DSL.new(self, <options>)
|
222
451
|
```
|
223
452
|
|
224
|
-
|
453
|
+
Two important fields of `options` are `output_types` and `input_types`. For each item in `output_types`, you will get a global function to bootstrap a rule. For example, with
|
225
454
|
|
226
455
|
```ruby
|
227
456
|
dsl = DSL.new(self, { output_types: [:csv, :pdf] })
|
@@ -234,24 +463,19 @@ csv.data = ...
|
|
234
463
|
pdf.graph = ...
|
235
464
|
```
|
236
465
|
|
237
|
-
which will
|
466
|
+
which will match *<dir>/data.csv* and *<dir>/graph.pdf*
|
238
467
|
|
239
|
-
The `input_types` involves the strategy to find inputs. For example, raka will try to find both
|
468
|
+
The `input_types` involves the strategy to find inputs. All possible input types will be tried when resolving an input file in chained target. For example, raka will try to find both *numbers.csv* and *numbers.table* for a rule like `table.numbers.mean = …` if `input_type = [:csv, :table]`.
|
240
469
|
|
241
470
|
### Scope
|
242
471
|
|
243
|
-
|
472
|
+
Scopes define constraints which help users create rules more precisely. A scope generally refer to a folder and can happen in several places.
|
244
473
|
|
245
|
-
|
474
|
+
**Task scope** is the scope when executing a task, a.k.a. **scope**. When a rule is matched given a desired output, a task is generated and its scope is the common folder of the output and all dependencies. For example, a rule `csv.out = [csv.in] | ...` can be matched given *out/out.csv* and the task scope is resolved *out/*. The task will thus search for *out/in.csv* as dependency.
|
246
475
|
|
247
|
-
|
248
|
-
shell(base_dir='./')* code::templ_str { |task| ... }
|
249
|
-
psql(options={})* code::templ_str { |task| ... }
|
250
|
-
r(src:str, libs=[])* code::templ_str { |task| ... }
|
476
|
+
**Rule scope** is the scope to restrict possible task scope, given by `Raka::scope`. In the following example, the rule scopes are
|
251
477
|
|
252
|
-
|
253
|
-
psqlf(options={})
|
254
|
-
```
|
478
|
+
**Target scope.**
|
255
479
|
|
256
480
|
## Rakefile Template
|
257
481
|
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.3.
|
1
|
+
0.3.4
|