raka 0.3.3 → 0.3.4

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4cb170d2a0de09c964495fb081a0ea03fa9b88985580317d31b5554c8aa15568
4
- data.tar.gz: 909ce5212754c0e55b47ff0b91d1a2dcb046d173662a3d7a50b4d9a48f8c5a86
3
+ metadata.gz: 895e1c40801f38ff5168449c25333ee40e9aca718f6d082a90d8898a75513373
4
+ data.tar.gz: 9d8228de56582ef7202f30d10f32a285ad96197599344bd9a471dabb1699bdd3
5
5
  SHA512:
6
- metadata.gz: f0c7c21dfcb9fab8013fac22e70fe666194b6774789ca4bb8cfbb273dc02e2e44d3859fef90521b41ab17388c60546c19224c392d1a1fd43d138382b345b22b2
7
- data.tar.gz: cf8d6ac072b32763f6f947de2a87f96ee834a6686777b7c57e1e21ac7e21685cc4499496b51a8870168a90394492ad0b11e524dbc6c1fdb79695dba6a41f4474
6
+ metadata.gz: 3fe0e6223cc75ebe754f353155a1c78c96d25ae55bbffc108a6d8485038d06ad0cebda5be8093d50d0cb24e12b670ed0335b07b880816f5913e1f2712675e30c
7
+ data.tar.gz: 245dbe2ed2b8f0966b0b3a980af00dda5cf084838228d40144e020de2a1545e3b856d90ebe51ab0419c53ca5a93a41929023903eb5171500e22c65ca44e043b1
data/README.md CHANGED
@@ -1,51 +1,31 @@
1
- **Raka** is a **DSL**(Domain Specific Language) on top of **Rak**e for defining and running d**a**t**a** processing workflows. Raka is specifically designed for data processing with improved pattern matching, scopes, language extensions and lots of conventions to prevent verbosity.
2
-
3
- ## Why Raka
4
-
5
- Data processing tasks can involve plenty of steps, each with its dependencies. Compared to bare Rake or the more classical Make, Raka offers the following advantages:
6
-
7
- 1. Advanced pattern matching and template resolving to define general rules and maximize code reuse.
8
- 2. Extensible and context-aware protocol architecture.
9
- 3. Multilingual. Other programming languages can be easily embedded.
10
- 4. Auto dependency and naming by conventions.
11
- 5. Scopes to ease comparative studies.
12
- 6. Terser syntax.
13
-
14
- ... and more.
15
-
16
- Compared to more comlex, GUI-based solutions (perhaps classified as scientific-workflow software) like Kepler, etc., Raka has the following advantages:
17
-
18
- 1. Lightweight and easy to setup, especially on platforms with ruby preinstalled.
19
- 2. Easy to deploy, version-control, backup or share workflows since the workflows are merely text files.
20
- 3. Easy to reuse modules or create reusable modules, which are merely plain ruby code snippets (or in other languages with protocols).
21
- 4. Expressive so a few lines of code can replace many manual operations.
1
+ **Raka** is a **DSL**(Domain Specific Language) on top of **Rak**e for defining rules and running d**a**t**a** processing workflows. Raka is specifically designed for data processing with improved pattern matching, scopes, language extensions and lots of conventions to prevent verbosity.
22
2
 
23
3
  ## Installation
24
4
 
25
- Raka is a library based on rake. Though rake is cross platform, raka may not work on Windows since it relies some shell facilities. To use raka, one has to install ruby and rake first. Ruby is available for most \*nix systems including Mac OSX so the only task is to install raka like:
5
+ Raka is a library based on rake. Though rake is cross platform, raka may not work on Windows since it relies some shell facilities. Ruby is available for most \*nix systems including Mac OSX so the only task is to install raka like:
26
6
 
27
7
  ```bash
28
8
  gem install raka
29
9
  ```
30
10
 
31
- ## QuickStart
11
+ ## Quick Start
32
12
 
33
13
  First create a file named `main.raka` and import & initialize the DSL
34
14
 
35
15
  ```ruby
36
16
  require 'raka'
37
17
 
38
- dsl = DSL.new(self,
39
- output_types: [:txt, :table, :pdf, :idx],
40
- input_types: [:txt, :table]
18
+ dsl = Raka.new(self,
19
+ output_types: [:txt],
20
+ input_types: [:txt]
41
21
  )
42
22
  ```
43
23
 
44
24
  Then the code below will define two simple rules:
45
25
 
46
26
  ```ruby
47
- txt.sort.first50 = shell* "cat sort.txt | head -n 50 > $@"
48
- txt.sort = [txt.input] | shell* "cat $< | sort -rn > $@"
27
+ txt._.first50 = shell* "cat $< | head -n 50 > $@"
28
+ txt.sort = [txt.input] | shell* "cat $(dep0) | sort -rn > $@"
49
29
  ```
50
30
 
51
31
  For testing let's prepare an input file named `input.txt`:
@@ -54,106 +34,234 @@ For testing let's prepare an input file named `input.txt`:
54
34
  seq 1000 > input.txt
55
35
  ```
56
36
 
57
- We can then invoke `rake first50.txt`, the script will read data from _input.txt_, sort the numbers descendingly and get the first 50 lines.
37
+ Invoke:
38
+
39
+ ```bash
40
+ raka first50__sort.txt
41
+ ```
42
+
43
+ Raka will read data from *input.txt*, sort the numbers descendingly and copy the first 50 lines to *first50__sort.txt*.
58
44
 
59
45
  The workflow here is as follows:
60
46
 
61
- 1. Try to find _first50\_\_sort.txt_: not exists
62
- 2. Rule `txt.sort.first50` matched
63
- 3. For rule `txt.sort.first50`, find input file _sort.txt_ or _sort.table_. Neither exists
64
- 4. Rule `txt.sort` matched
65
- 5. Rule `txt.sort` has no input but a depended target `txt.input`
66
- 6. Find file _input.txt_ or _input.table_. Use the former
67
- 7. Run rule `txt.sort` and create _sort.txt_
68
- 8. Run rule `txt.sort.first50` and create _first50\_\_sort.txt_
47
+ 1. Try to find *first50__sort.txt*: not exists.
48
+ 2. Rule with target `txt.sort.first50` matched.
49
+ 3. Find input file *sort.txt*, not exists.
50
+ 4. Rule with target `txt.sort` matched.
51
+ 5. This rule has no input but a depended target `txt.input`.
52
+ 6. File *input.txt* exists. Use it.
53
+ 7. Run rule `txt.sort` and create *sort.txt*.
54
+ 8. Run rule `txt.sort.first50` and create *first50__sort.txt*
69
55
 
70
- This illustrates some basic ideas but may not be particularly interesting. Following is a much more sophisticated example from real world research which covers more features.
56
+ We may want to skip the sort step, and invoke:
71
57
 
72
- ```ruby
73
- SRC_DIR = File.absolute_path 'src'
74
- USER = 'postgres'
75
- DB = 'osm'
76
- HOST = 'localhost'
77
- PORT = 5432
58
+ ```bash
59
+ raka first50__input.txt
60
+ ```
78
61
 
79
- def idx_this() [idx._('$(output_stem)')] end
62
+ Raka will read data from *input.txt* and copy the first 50 lines to *first50__input.txt*.
80
63
 
81
- dsl.scope :de
64
+ This illustrates some basic ideas but may not be particularly interesting. Following is a slightly more complex example which covers more features.
65
+
66
+ ```ruby
67
+ require 'raka'
82
68
 
83
- idx._ = psqlf(script_name: '$stem_idx.sql')
84
- pdf.buildings.func['(\S+)_graph'] = r(:graph)* %[
85
- table_input("$(input_stem)") | draw_%{func0} | ggplot_output('$@') ]
86
- table.buildings = [csv.admin] | psqlf(admin: '$<') | idx_this
69
+ dsl = Raka.new(self,
70
+ output_types: %i[csv pdf],
71
+ input_types: %i[csv],
72
+ lang: ['lang/shell', 'lang/python'])
73
+
74
+ py_template = <<~PYTHON
75
+ import os.path
76
+ import pandas as pd
77
+
78
+ def write_variety(input, output, variety):
79
+ print(variety)
80
+ folder = os.path.dirname(output)
81
+ if len(folder) > 0:
82
+ os.makedirs(folder, exist_ok=True)
83
+ df = pd.read_csv(input)
84
+ df[df['class'] == variety].to_csv(output)
85
+
86
+ <code>
87
+ PYTHON
88
+ py.config script_template: py_template
89
+
90
+ groups = %i[virginica versicolor]
91
+
92
+ csv(groups.join('|')).iris =
93
+ [csv.iris_all] | py* %(write_variety('$<', '$@', 'Iris-$(target_scope)'))
94
+
95
+ csv.iris_all = shell* %(curl -L https://datahub.io/machine-learning/iris/r/iris.csv > $@)
96
+
97
+ dsl.scope(*groups) do
98
+ pdf.iris.plot['plot_(\S+)_(\S+)'] = py do |rask|
99
+ <<-PYTHON
100
+ import seaborn as sns
101
+ from matplotlib import pyplot as plt
102
+
103
+ df = pd.read_csv('#{rask.input}')
104
+ ax = sns.displot(x=df['#{rask.captures.plot0}#{rask.captures.plot1}'])
105
+ ax.set_axis_labels('#{rask.captures.plot0} #{rask.captures.plot1}', 'frequency')
106
+ plt.savefig('#{rask.output}')
107
+ PYTHON
108
+ end
109
+ end
110
+
111
+ task figures: (groups.product(%w[sepal petal], %w[length width]).map do |info|
112
+ "_out/#{info[0]}/plot_#{info[1]}_#{info[2]}__iris.pdf"
113
+ end)
87
114
  ```
88
115
 
89
- Assume that we have a schema named _de_ in database _osm_, have a input file _admin.csv_, and have _graph.R_ and _buildings.sql_ under _src/_. Now further assume that _graph.R_ contains two functions:
116
+ In this example, we download a classical dataset named *iris.csv*, use python code to extract two varieties including *virginica* and *versicolor*, and generate thematic plots of frequency histograms for both varieties.
90
117
 
91
- ```r
92
- draw_stat_snapshot <- function(d) { ... }
93
- draw_user_trend <- function(d) { ... }
94
- ```
118
+ To invoke the script, we run in terminal:
95
119
 
96
- ...and _buildings.sql_ contains table creation code like:
120
+ ```bash
121
+ raka -j 8 -v figures
122
+ ```
97
123
 
98
- ```sql
99
- DROP TABLE IF EXISTS buildings;
100
- CREATE TABLE buildings AS ( ... );
124
+ The option `-j 8` indicates we want to parallelize the tasks with 8 concurrent processes at most where possible. The option `-v` let raka print detailed information so we can view the generated python code.
125
+
126
+ The tool will then act as the following:
127
+
128
+ 1. Match `figures with the last`rule, which is a normal rake task.
129
+ 2. The prerequisites include 8 figures, none of them exists yet. Take *_out/versicolor/plot_petal_length__iris.pdf * as an example from now on.
130
+ 3. Rule `pdf.iris.plot['plot_(\S+)_(\S+)']...` is matched, where "petal" is bound to `plot0` and "length" is bound to `plot1`.
131
+ 4. Neither of the 2 possible input files: *_out/versicolor/iris.csv* and *_out/versicolor/iris.pdf* and can be found. But the rule `csv(groups.join('|')).iris = ...` (`csv('virginica|versicolor').iris`) can be matched for the former, where the target scope is matched as `versicolor`.
132
+ 5. The only dependecy `csv.iris_all` is resolved as *_out/iris_all_.csv*. The path does not contain `vesicolor` since the target scope only applies to the target.
133
+ 6. Rule `csv.iris_all` is matched without any dependencies.
134
+ 7. The protocol `shell` replaces the automatic variable`$@` with `_out/iris_all.csv` to build a curl command and download the iris dataset from ()[datahub.io].
135
+ 8. Now raka goes back to generate output *_out/versicolor/iris.csv*, by executing the code generated by the `python` protocol, which extracts rows where the class field equals "Iris-versicolor".
136
+ 9. Raka goes back to generate output *_out/versicolor/plot_petal_length__iris.pdf*, , by executing the code generated by the `python` protocol, which draws a histogram plot to depict the distribution of petal length.
137
+ 10. Raka continues to generate plot files until all 8 figures exist.
138
+
139
+ As an example, the generated python code in *9* are:
140
+
141
+ ```python
142
+ import sys
143
+ import os.path
144
+ import pandas as pd
145
+
146
+ def write_variety(input, output, variety):
147
+ print(variety)
148
+ folder = os.path.dirname(output)
149
+ if len(folder) > 0:
150
+ os.makedirs(folder, exist_ok=True)
151
+ df = pd.read_csv(input)
152
+ df[df['class'] == variety].to_csv(output)
153
+
154
+ import seaborn as sns
155
+ from matplotlib import pyplot as plt
156
+
157
+ df = pd.read_csv('_out/versicolor/iris.csv')
158
+ ax = sns.displot(x=df['petallength'])
159
+ ax.set_axis_labels('petal length', 'frequency')
160
+ plt.savefig('_out/versicolor/plot_petal_length__iris.pdf')
101
161
  ```
102
162
 
103
- We may also have a _buildings_idx.sql_ to create index for the table.
163
+ The rule-based system, the strategy to execute tasks only when necessary, and the capable host language make it fairly easy to adjust the experiments during the exploration. For example, suppose we want to also apply experiments also to the *setosa* class, we can just change the line
164
+
165
+ `groups = %i[virginica versicolor]`
166
+
167
+ to
168
+
169
+ `groups = %i[virginica versicolor setosa]`
170
+
171
+ The command `raka -j 8 -v figures` will generate 4 figures for the new class, without re-executing tasks for the other two classes.
172
+
173
+ ## Why Raka
174
+
175
+ Data processing tasks can involve plenty of steps, each with its dependencies. Compared to bare Rake or the more classical Make, Raka offers the following advantages:
176
+
177
+ 1. Advanced pattern matching and template resolving to define general rules and maximize code reuse.
178
+ 2. Extensible and context-aware protocol architecture.
179
+ 3. Multilingual. Other programming languages can be easily embedded.
180
+ 4. Auto dependency and naming by conventions.
181
+ 5. Scopes to ease comparative studies.
182
+ 6. Terser syntax.
183
+
184
+ ... and more.
185
+
186
+ Compared to more comlex, GUI-based solutions (perhaps classified as scientific-workflow software) like Kepler, etc., Raka has the following advantages:
187
+
188
+ 1. Lightweight and easy to setup, especially on platforms with ruby preinstalled.
189
+ 2. Easy to deploy, version-control, backup or share workflows since the workflows are merely text files.
190
+ 3. Easy to reuse modules or create reusable modules, which are merely plain ruby code snippets (or in other languages with protocols).
191
+ 4. Expressive so a few lines of code can replace many manual operations.
192
+
193
+ ## Documentation
104
194
 
105
- Then we can run either `rake de/stat_snapshot_graph__buildings.pdf` or `rake de/user_trend_graph__buildings.pdf`, which will do a bunch of things at first run (take the former as example):
195
+ ### Conceptual Model
106
196
 
107
- 1. Target file not found.
108
- 2. Rule `pdf.buildings.func['(\S+)_graph']` matched. "stat_snapshot_graph" is bound to `func` and "stat_snapshot" is bound to `func0`.
109
- 3. None of the four possible input files: _de/buildings.table_, _de/buildings.txt_, _buildings.table_, _buildings.txt_ can be found. Rule `table.buildings` is matched and the only dependecy file _admin.csv_ is found.
110
- 4. The protocol `psqlf` finds the source file _src/buildings.sql_, intepolate the options with automatic variables (`$<` as "admin.csv"), run the sql, and create a placeholder file _de/buildings.table_ afterwards.
111
- 5. Run the post-job `idx_this`, according to the rule `idx._` it will find and run _buildings_idx.sql_, then create a placeholder file _de/buildings.idx_.
112
- 6. For rule `pdf.buildings.func['(\S+)_graph']`, the R code in `%[]` is interpolated with several automatic variables (`$(input_stem)` as "buildings", `$@` as "de/stat_snapshot_graph\_\_buildings.pdf") and the variables (`func`, `func0`) bound before.
113
- 7. Run the R code. The _buildings_ table is piped into the function `draw_snapshot_graph` and then output to `ggplot_output`, which writes the graph to the specified pdf file.
197
+ A raka rule consists of target, dependencies, actions and
114
198
 
115
- ## Syntax of Rules
199
+ ### Syntax Definition
116
200
 
117
- It is possible to use Raka with little knowledge of ruby / rake, though minimal understandings are highly recommended. The formal syntax of rule can be defined as follows (EBNF form):
201
+ It is possible to use Raka with little knowledge of ruby / rake, though minimal understandings are highly recommended. The formal syntax of rule can be defined as follows (W3C EBNF form):
118
202
 
119
203
  ```ebnf
120
- rule = lexpr "=" {target_list "|"} protocol {"|" target_list};
204
+ rule ::= target "=" (dependencies "|")* action ("|" post_target)*
121
205
 
122
- target = rexpr | template;
206
+ target ::= ext "." ltoken ("." ltoken)*
123
207
 
124
- target_list = "[]" | "[" target {"," target} "]";
208
+ dependencies ::= "[]" | "[" dependency ("," dependency)* "]"
125
209
 
126
- lexpr = ext "." {ltoken "."} ltoken;
127
- rexpr = ext "." rtoken {"." rtoken};
210
+ dependency ::= rexpr | template
128
211
 
129
- ltoken = word | word "[" pattern "]";
130
- rtoken = word | word "(" template ")";
212
+ post_target ::= rexpr | template
131
213
 
132
- word = ("_" | letter) { letter | digit | "_" };
214
+ rexpr ::= ext "." rtoken ("." rtoken)*
133
215
 
134
- protocol = ("shell" | "r" | "psql") ("*" template | BLOCK )
135
- | "psqlf" | "psqlf" "(" HASH ")";
216
+ ltoken ::= word | word "[" pattern "]"
217
+ rtoken ::= word | word "(" template ")"
218
+
219
+ word ::= ("_" | letter) ( letter | digit | "_" )*
220
+
221
+ action ::= ("shell" | "r" | "psql" | "py" ) ("*" template | block ) | "run" block
136
222
  ```
137
223
 
138
224
  The corresponding railroad diagrams are:
139
225
 
140
- ![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/rule.svg)
226
+ **rule**
227
+
228
+ ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/rule.svg)
229
+
230
+ **target**
231
+
232
+ ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/target.svg)
233
+
234
+ **dependencies**
235
+
236
+ ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/dependencies.svg)
237
+
238
+ **dependency**
239
+
240
+ ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/dependency.svg)
241
+
242
+ **post_target_**
141
243
 
142
- ![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/target.svg)
244
+ ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/post_target.svg)
143
245
 
144
- ![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/target_list.svg)
246
+ **rexpr**
145
247
 
146
- ![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/lexpr.svg)
248
+ ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/rexpr.svg)
147
249
 
148
- ![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/rexpr.svg)
250
+ **ltoken**
149
251
 
150
- ![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/ltoken.svg)
252
+ ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/ltoken.svg)
151
253
 
152
- ![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/rtoken.svg)
254
+ **rtoken**
153
255
 
154
- ![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/word.svg)
256
+ ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/rtoken.svg)
155
257
 
156
- ![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/protocol.svg)
258
+ **word**
259
+
260
+ ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/word.svg)
261
+
262
+ **action**
263
+
264
+ ![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/action.svg)
157
265
 
158
266
  The definition is concise but several details are omitted for simplicity:
159
267
 
@@ -163,15 +271,15 @@ The definition is concise but several details are omitted for simplicity:
163
271
  4. The listed protocols are merely what we offered now. It can be greatly extended.
164
272
  5. Nearly any concept in the syntax can be replaced by a suitable ruby variable.
165
273
 
166
- ## Pattern matching and template resolving
274
+ ### Pattern matching and template resolving
167
275
 
168
- When defined a rule like `lexpr = rexpr`, the left side represents a pattern and the right side contains specifications for extra dependecies, actions and some targets to create thereafter. When raking a target file, the left sides of the rules will be examined one by one until a rule is matched. The matching process based on Regex also support named captures so that some varibales can be bound for use in the right side.
276
+ When defined a rule like `target = <specification>`, the left side represents a pattern and the right side contains specifications for extra dependecies, actions and some targets to create thereafter. When raking a target file, the left sides of the rules will be examined one by one until a rule is matched. The matching process based on Regex also support named captures so that some varibales can be extracted for use in the right side.
169
277
 
170
- The specifications on the right side of a rule can be incomplete from various aspects, that is, they can contains some templates. The "holes" in the templates will be fulfilled by automatic variables and variables bounded when matching the left side.
278
+ The specifications on the right side of a rule can contain templates. The "holes" in the templates will be fulfilled by automatic variables and variables captured when matching the left side.
171
279
 
172
- ### Pattern matching
280
+ #### Pattern matching
173
281
 
174
- To match a given _file_ with a `lexpr`, asides the extension, the substrings of the file name between "\_\_" are mapped to tokens separated by `.`, in reverse order. After that, each substring is matched to the corresponding token or the regex in `[]`. For example, the rule
282
+ To match a given _file_ with a `target`, the extension will be matched first. The substrings of the file name between "\_\_" are mapped to tokens separated by `.`, in reverse order. After that, each substring is matched to the corresponding token or the regex in `[]`. For example, the rule
175
283
 
176
284
  ```ruby
177
285
  pdf.buildings.indicator['\S+'].top['top_(\d+)']
@@ -186,26 +294,147 @@ can match "top_50\_\_node_num\_\_buildings.pdf". The logical process is:
186
294
  - `top_(\d+) ~ top_50`
187
295
  3. Two levels of captures are made. First, 'node_num' is captured as `indicator`, 'top_50' is captured as `top`; Second, '50' is captured as `top0` since `\d+` is wrapped in parenthesis and is the first.
188
296
 
189
- One can write special token `_` or `something[]` if the captured value is useful later, as the syntax sugar of `something['\S+']`.
297
+ One can write special token `_` to match any token. Since raka uses prefix matching, something like `token0['']` can also match any token and capture it in `token0` in addition. End-of-line symbol `$` can be used to match the whole token, e.g., `token0['word$']` will not match `word_bench`.
190
298
 
191
- ### Template resolving
299
+ #### Template resolving
192
300
 
193
301
  In some places of `rexpr`, templates can be written instead of strings, so that it can represent different values at runtime. There are two types of variables that can be used in templates. The first is automatic variables, which is just like `$@` in Make or `task.name` in Rake. We even preserve some Make conventions for easier migrations. All automatic varibales begin with `$`. The possible automatic variables are:
194
302
 
195
- | symbol | meaning | symbol | meaning |
196
- | -------------- | ---------------------- | --------------- | ------------------------------- |
197
- | \$@ | output file | \$^ | all dependecies (sep by spaces) |
198
- | \$< | first dependency | $0, $1, \$i | ith depdency |
199
- | \$(scope) | scope for current task | \$(output_stem) | stem of the output file |
200
- | \$(input_stem) | stem of the input file | | |
303
+ | symbol | description |
304
+ | ----------------------------------------- | ------------------------------------------------------------------------------------------------------ |
305
+ | \$@, \$(output) | the output file |
306
+ | \$<, \$(input) | the input file defined in the chained target |
307
+ | \$^, \$(deps) | all dependecies concated by comma (including input) |
308
+ | \$(dep0), \$(dep1), ... | the i-th depdency (input is $(dep0)) |
309
+ | \$(input_stem) | stem of the input file |
310
+ | \$(output_stem) | stem of the output file |
311
+ | \$(func) | the token added to input to generate output, e.g., stat in csv.data.stat |
312
+ | \$(ext) | extension of the output file |
313
+ | \$(scope) | scope for current task, i.e. the common directory for output, input and dependencies |
314
+ | \$(target_scope) | the inline scope defined in target |
315
+ | \$(target_scope0), \$(target_scope1), ... | the i-th captured value by inline scope defined in target |
316
+ | \$(rule_scope0), \$(rule_scope1), ... | the i-th scope defined in rule-level by nested calls of the dsl.scope function (i is larger insideout) |
317
+
318
+ The other type of variables are those captured during pattern matching, which can be referred to using `%{var}`. In the example of the [pattern matching](###pattern-matching) section, `%{indicator}` will be replaced by `node_num`, `%{top}` will be replaced by `top_50` and `%{top0}` will be replaced by `50`. In such case, a template as `'calculate top %{top0} of %{indicator} for $@'` will be resolved as `'calculate top 50 of node_num for top_50__node_num__buildings.pdf'`
319
+
320
+ Templates can happen in various places. For depdencies and post targets, tokens with parenthesis can contain templates, like `csv._('%{indicator}')`. The symbol of a token with parenthesis is of no use and is generally omitted with an underscore. It is also possible to write template literal directly, i.e. `'%{indicator}.csv'`. Templates can also be applied in actions but it depends on the implementations of protocols.
321
+
322
+ ### Actions and protocols
323
+
324
+ Raka invokes **actions** when all input and dependencies are presented. Generally, users define an action that generates the output. To maximize the flexibility, users can feed code in an arbitrary programming language to the corresponding **protocol**. The protocol will then transform and execute the code. Raka natively supports the host(ruby) protocol and several foreign protocols including shell, python, psql, and r.
325
+
326
+ The host protocol is special and just executes the given ruby block. All other protocols can accept a templated code string given an aterisk operator or a block producing a templated code string. Following illustrates examples for each protocol.
327
+
328
+ In the host protocol and the block versions of other protocols, a raka task (the *rask* variable) is provided, which offers the following properties:
329
+
330
+ | property | description |
331
+ | --------------------- | ------------------------------------------------------------------------------------ |
332
+ | output | the output file |
333
+ | input | the input file defined in the chained target |
334
+ | deps | the depdencies (input is deps[0]) |
335
+ | func | the token added to input to generate output, e.g., stat in csv.data.stat |
336
+ | ext | extension of the output file |
337
+ | captures | captured text during pattern matching, key-value |
338
+ | scope | scope for current task, i.e. the common directory for output, input and dependencies |
339
+ | target_scope | the inline scope defined in target |
340
+ | target_scope_captures | captured values by inline scope defined in target |
341
+ | rule_scopes | the inline scope defined in target |
342
+
343
+ ```ruby
344
+ require 'raka'
345
+ require 'csv'
346
+
347
+ dsl = Raka.new(
348
+ self, output_types: %i[table view csv],
349
+ lang: ['lang/psql', 'lang/shell', 'lang/python', 'lang/r']
350
+ )
351
+
352
+ csv.iris_all = shell* %(curl -L https://datahub.io/machine-learning/iris/r/iris.csv > $@)
353
+
354
+ # host(ruby) protocol
355
+ csv.rb_out = [csv.iris_all] | run do |rask|
356
+ in_f = File.open(rask.deps[0])
357
+ out_f = File.open(rask.output, 'w')
358
+ options = { headers: true, return_headers: true, write_headers: true }
359
+ CSV.filter(in_f, out_f, options) do |row|
360
+ row['class'] == 'Iris-versicolor'
361
+ end
362
+ end
363
+
364
+ # python protocol
365
+ csv.py_out = [csv.iris_all] | py* %(
366
+ import pandas as pd
367
+ df = pd.read_csv('$(dep0)')
368
+ df[df['class'] == 'Iris-versicolor'].to_csv('$@')
369
+ )
370
+
371
+ # python protocol (block)
372
+ csv.py_out2 = [csv.iris_all] | py do |rask|
373
+ <<-PYTHON
374
+ import pandas as pd
375
+ df = pd.read_csv('#{rask.deps[0]}')
376
+ df[df['class'] == 'Iris-versicolor'].to_csv('#{rask.output}')
377
+ PYTHON
378
+ end
379
+
380
+ # r protocol
381
+ csv.r_out = [csv.iris_all] | r* %(
382
+ df <- read.csv("$(dep0)")
383
+ write.csv(df[(df$class == "Iris-versicolor"),], file="$@")
384
+ )
201
385
 
202
- The other type of variables are those bounded during pattern matching,which can be referred to using `%{var}`. In the example of the [pattern matching](###pattern-matching) section, `%{indicator}` will be replaced by `node_num`, `%{top}` will be replaced by `top_50` and `%{top0}` will be replaced by `50`. In such case, a template as `'calculate top %{top0} of %{indicator} for $@'` will be resolved as `'calculate top 50 of node_num for top_50__node_num__buildings.pdf'`
386
+ # r protocol (block)
387
+ csv.r_out = [csv.iris_all] | r do |rask|
388
+ <<-R
389
+ df <- read.csv("#{rask.deps[0]}")
390
+ write.csv(df[(df$class == "Iris-versicolor"),], file="#{rask.output}")
391
+ R
392
+ end
393
+
394
+ # shell protocol
395
+ csv.shell_out = [csv.iris_all] | shell* %(
396
+ cat <(head $(dep0)) <(grep "Iris-versicolor" $(dep0)) > $@
397
+ )
398
+
399
+ # shell protocol (block)
400
+ csv.shell_out2 = [csv.iris_all] | shell do |rask|
401
+ "cat <(head -1 #{rask.deps[0]}) <(grep 'Iris-versicolor' #{rask.deps[0]}) > rask.output"
402
+ end
403
+
404
+ # psql protocol
405
+ pg = OpenStruct.new(
406
+ user: 'postgres',
407
+ port: 5433,
408
+ host: '127.0.0.1',
409
+ db: 'postgres',
410
+ password: 'postgres'
411
+ )
412
+ psql.config conn: pg, create: :mview
413
+
414
+ table.iris_all = [csv.iris_all] | psql(create: nil)* %(
415
+ DROP TABLE IF EXISTS $(output_stem);
416
+ CREATE TABLE $(output_stem) (
417
+ sepallength float,
418
+ sepalwidth float,
419
+ petallength float,
420
+ petalwidth float,
421
+ class varchar
422
+ );
423
+ \\COPY $(output_stem) FROM '$(dep0)' CSV HEADER;
424
+ )
203
425
 
204
- The replacement of variables happen before any process to the template string. So do not include the symbols for automatic variables or `%{<anything>}` in templates.
426
+ table.psql_out = [table.iris_all] | psql* %(
427
+ SELECT * FROM $(dep0_stem) WHERE class='Iris-versicolor'
428
+ )
205
429
 
206
- Templates can happen in various places. For depdencies and post jobs, tokens with parenthesis can wrap in templates, like `csv._('%{indicator}')`. The symbol of a token with parenthesis is of no use and is generally omitted. It is also possible to write template literal directly, i.e. `'%{indicator}.csv'`. Where templates can be applied in actions depends on the protocols and will be explained later in the [Protocols](###protocols) section
430
+ # psql protocol (block)
431
+ table.psql_out2 = [table.iris_all] | psql do |rask|
432
+ <<-SQL
433
+ SELECT * FROM #{dsl.stem(rask.deps[0])} WHERE class='Iris-versicolor'
434
+ SQL
435
+ end
207
436
 
208
- ## APIs
437
+ ```
209
438
 
210
439
  ### Initialization and options
211
440
 
@@ -215,13 +444,13 @@ These APIs are bounded to an instance of DSL, you can create the object at the t
215
444
  dsl = DSL.new(<env>, <options>)
216
445
  ```
217
446
 
218
- The argument `<env>` should be the _self_ of a running Rakefile. In most case you can directly write:
447
+ The argument `<env>` should be the *self* of a running Rakefile. In most case you can directly write:
219
448
 
220
449
  ```ruby
221
450
  dsl = DSL.new(self, <options>)
222
451
  ```
223
452
 
224
- The argument `options` currently support `output_types` and `input_types`. For each item in `output_types`, you will get an extra function to bootstrap a rule. For example, with
453
+ Two important fields of `options` are `output_types` and `input_types`. For each item in `output_types`, you will get a global function to bootstrap a rule. For example, with
225
454
 
226
455
  ```ruby
227
456
  dsl = DSL.new(self, { output_types: [:csv, :pdf] })
@@ -234,24 +463,19 @@ csv.data = ...
234
463
  pdf.graph = ...
235
464
  ```
236
465
 
237
- which will generate data.csv and graph.pdf
466
+ which will match *<dir>/data.csv* and *<dir>/graph.pdf*
238
467
 
239
- The `input_types` involves the strategy to find inputs. For example, raka will try to find both _numbers.csv_ and _numbers.table_ for a rule like `table.numbers.mean = …` if `input_type = [:csv, :table]`.
468
+ The `input_types` involves the strategy to find inputs. All possible input types will be tried when resolving an input file in chained target. For example, raka will try to find both *numbers.csv* and *numbers.table* for a rule like `table.numbers.mean = …` if `input_type = [:csv, :table]`.
240
469
 
241
470
  ### Scope
242
471
 
243
- ### Protocols
472
+ Scopes define constraints which help users create rules more precisely. A scope generally refer to a folder and can happen in several places.
244
473
 
245
- Currently Raka support 4 lang: shell, psql, r and psqlf.
474
+ **Task scope** is the scope when executing a task, a.k.a. **scope**. When a rule is matched given a desired output, a task is generated and its scope is the common folder of the output and all dependencies. For example, a rule `csv.out = [csv.in] | ...` can be matched given *out/out.csv* and the task scope is resolved *out/*. The task will thus search for *out/in.csv* as dependency.
246
475
 
247
- ```ruby
248
- shell(base_dir='./')* code::templ_str { |task| ... }
249
- psql(options={})* code::templ_str { |task| ... }
250
- r(src:str, libs=[])* code::templ_str { |task| ... }
476
+ **Rule scope** is the scope to restrict possible task scope, given by `Raka::scope`. In the following example, the rule scopes are
251
477
 
252
- # options = { script_name: , script_file: , params: }
253
- psqlf(options={})
254
- ```
478
+ **Target scope.**
255
479
 
256
480
  ## Rakefile Template
257
481
 
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.3.3
1
+ 0.3.4