RubyGems - raka - Versions diffs - 0.3.3 → 0.3.4 - Mend

raka 0.3.3 → 0.3.4

Files changed (18) hide show

checksums.yaml +4 -4
data/README.md +343 -119
data/VERSION +1 -1
data/bin/raka +32 -12
data/lib/{compile.rb → raka/compile.rb} +10 -4
data/lib/{interface.rbs → raka/interface.rbs} +0 -0
data/lib/{lang → raka/lang}/psql/impl.rb +0 -0
data/lib/{lang → raka/lang}/python/impl.rb +0 -0
data/lib/raka/lang/r/impl.rb +23 -0
data/lib/{lang → raka/lang}/shell/impl.rb +0 -0
data/lib/{output_type.rb → raka/output_type.rb} +0 -0
data/lib/{protocol.rb → raka/protocol.rb} +0 -0
data/lib/{token.rb → raka/token.rb} +0 -0
data/lib/raka.rb +9 -4
metadata +11 -13
data/lib/lang/r/impl.rb +0 -38
data/lib/lang/r/io.R +0 -113
data/lib/temp.json +0 -9167

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 4cb170d2a0de09c964495fb081a0ea03fa9b88985580317d31b5554c8aa15568
-  data.tar.gz: 909ce5212754c0e55b47ff0b91d1a2dcb046d173662a3d7a50b4d9a48f8c5a86
+  metadata.gz: 895e1c40801f38ff5168449c25333ee40e9aca718f6d082a90d8898a75513373
+  data.tar.gz: 9d8228de56582ef7202f30d10f32a285ad96197599344bd9a471dabb1699bdd3
 SHA512:
-  metadata.gz: f0c7c21dfcb9fab8013fac22e70fe666194b6774789ca4bb8cfbb273dc02e2e44d3859fef90521b41ab17388c60546c19224c392d1a1fd43d138382b345b22b2
-  data.tar.gz: cf8d6ac072b32763f6f947de2a87f96ee834a6686777b7c57e1e21ac7e21685cc4499496b51a8870168a90394492ad0b11e524dbc6c1fdb79695dba6a41f4474
+  metadata.gz: 3fe0e6223cc75ebe754f353155a1c78c96d25ae55bbffc108a6d8485038d06ad0cebda5be8093d50d0cb24e12b670ed0335b07b880816f5913e1f2712675e30c
+  data.tar.gz: 245dbe2ed2b8f0966b0b3a980af00dda5cf084838228d40144e020de2a1545e3b856d90ebe51ab0419c53ca5a93a41929023903eb5171500e22c65ca44e043b1

data/README.md CHANGED Viewed

@@ -1,51 +1,31 @@
-**Raka** is a **DSL**(Domain Specific Language) on top of **Rak**e for defining and running d**a**t**a** processing workflows. Raka is specifically designed for data processing with improved pattern matching, scopes, language extensions and lots of conventions to prevent verbosity.
-## Why Raka
-Data processing tasks can involve plenty of steps, each with its dependencies. Compared to bare Rake or the more classical Make, Raka offers the following advantages:
-1. Advanced pattern matching and template resolving to define general rules and maximize code reuse.
-2. Extensible and context-aware protocol architecture.
-3. Multilingual. Other programming languages can be easily embedded.
-4. Auto dependency and naming by conventions.
-5. Scopes to ease comparative studies.
-6. Terser syntax.
-... and more.
-Compared to more comlex, GUI-based solutions (perhaps classified as scientific-workflow software) like Kepler, etc., Raka has the following advantages:
-1. Lightweight and easy to setup, especially on platforms with ruby preinstalled.
-2. Easy to deploy, version-control, backup or share workflows since the workflows are merely text files.
-3. Easy to reuse modules or create reusable modules, which are merely plain ruby code snippets (or in other languages with protocols).
-4. Expressive so a few lines of code can replace many manual operations.
+**Raka** is a **DSL**(Domain Specific Language) on top of **Rak**e for defining rules and running d**a**t**a** processing workflows. Raka is specifically designed for data processing with improved pattern matching, scopes, language extensions and lots of conventions to prevent verbosity.
 ## Installation
-Raka is a library based on rake. Though rake is cross platform, raka may not work on Windows since it relies some shell facilities. To use raka, one has to install ruby and rake first. Ruby is available for most \*nix systems including Mac OSX so the only task is to install raka like:
+Raka is a library based on rake. Though rake is cross platform, raka may not work on Windows since it relies some shell facilities. Ruby is available for most \*nix systems including Mac OSX so the only task is to install raka like:
 ```bash
 gem install raka
 ```
-## QuickStart
+## Quick Start
 First create a file named `main.raka` and import & initialize the DSL
 ```ruby
 require 'raka'
-dsl = DSL.new(self,
-  output_types: [:txt, :table, :pdf, :idx],
-  input_types: [:txt, :table]
+dsl = Raka.new(self,
+  output_types: [:txt],
+  input_types: [:txt]
 )
 ```
 Then the code below will define two simple rules:
 ```ruby
-txt.sort.first50 = shell* "cat sort.txt | head -n 50 > $@"
-txt.sort = [txt.input] | shell* "cat $< | sort -rn > $@"
+txt._.first50 = shell* "cat $< | head -n 50 > $@"
+txt.sort = [txt.input] | shell* "cat $(dep0) | sort -rn > $@"
 ```
 For testing let's prepare an input file named `input.txt`:
@@ -54,106 +34,234 @@ For testing let's prepare an input file named `input.txt`:
 seq 1000 > input.txt
 ```
-We can then invoke `rake first50.txt`, the script will read data from _input.txt_, sort the numbers descendingly and get the first 50 lines.
+Invoke:
+```bash
+raka first50__sort.txt
+```
+Raka will read data from *input.txt*, sort the numbers descendingly and copy the first 50 lines to *first50__sort.txt*.
 The workflow here is as follows:
-1. Try to find _first50\_\_sort.txt_: not exists
-2. Rule `txt.sort.first50` matched
-3. For rule `txt.sort.first50`, find input file _sort.txt_ or _sort.table_. Neither exists
-4. Rule `txt.sort` matched
-5. Rule `txt.sort` has no input but a depended target `txt.input`
-6. Find file _input.txt_ or _input.table_. Use the former
-7. Run rule `txt.sort` and create _sort.txt_
-8. Run rule `txt.sort.first50` and create _first50\_\_sort.txt_
+1. Try to find *first50__sort.txt*: not exists.
+2. Rule with target `txt.sort.first50` matched.
+3. Find input file *sort.txt*, not exists.
+4. Rule with target `txt.sort` matched.
+5. This rule has no input but a depended target `txt.input`.
+6. File *input.txt* exists. Use it.
+7. Run rule `txt.sort` and create *sort.txt*.
+8. Run rule `txt.sort.first50` and create *first50__sort.txt*
-This illustrates some basic ideas but may not be particularly interesting. Following is a much more sophisticated example from real world research which covers more features.
+We may want to skip the sort step, and invoke:
-```ruby
-SRC_DIR = File.absolute_path 'src'
-USER = 'postgres'
-DB = 'osm'
-HOST = 'localhost'
-PORT = 5432
+```bash
+raka first50__input.txt
+```
-def idx_this() [idx._('$(output_stem)')] end
+Raka will read data from *input.txt* and copy the first 50 lines to *first50__input.txt*.
-dsl.scope :de
+This illustrates some basic ideas but may not be particularly interesting. Following is a slightly more complex example which covers more features.
+```ruby
+require 'raka'
-idx._ = psqlf(script_name: '$stem_idx.sql')
-pdf.buildings.func['(\S+)_graph'] = r(:graph)* %[
-  table_input("$(input_stem)") | draw_%{func0} | ggplot_output('$@') ]
-table.buildings = [csv.admin] | psqlf(admin: '$<') | idx_this
+dsl = Raka.new(self,
+               output_types: %i[csv pdf],
+               input_types: %i[csv],
+               lang: ['lang/shell', 'lang/python'])
+py_template = <<~PYTHON
+  import os.path
+  import pandas as pd
+  def write_variety(input, output, variety):
+    print(variety)
+    folder = os.path.dirname(output)
+    if len(folder) > 0:
+      os.makedirs(folder, exist_ok=True)
+    df = pd.read_csv(input)
+    df[df['class'] == variety].to_csv(output)
+  <code>
+PYTHON
+py.config script_template: py_template
+groups = %i[virginica versicolor]
+csv(groups.join('|')).iris =
+  [csv.iris_all] | py* %(write_variety('$<', '$@', 'Iris-$(target_scope)'))
+csv.iris_all = shell* %(curl -L https://datahub.io/machine-learning/iris/r/iris.csv > $@)
+dsl.scope(*groups) do
+  pdf.iris.plot['plot_(\S+)_(\S+)'] = py do |rask|
+    <<-PYTHON
+    import seaborn as sns
+    from matplotlib import pyplot as plt
+    df = pd.read_csv('#{rask.input}')
+    ax = sns.displot(x=df['#{rask.captures.plot0}#{rask.captures.plot1}'])
+    ax.set_axis_labels('#{rask.captures.plot0} #{rask.captures.plot1}', 'frequency')
+    plt.savefig('#{rask.output}')
+    PYTHON
+  end
+end
+task figures: (groups.product(%w[sepal petal], %w[length width]).map do |info|
+  "_out/#{info[0]}/plot_#{info[1]}_#{info[2]}__iris.pdf"
+end)
 ```
-Assume that we have a schema named _de_ in database _osm_, have a input file _admin.csv_, and have _graph.R_ and _buildings.sql_ under _src/_. Now further assume that _graph.R_ contains two functions:
+In this example, we download a classical dataset named *iris.csv*, use python code to extract two varieties including *virginica* and *versicolor*, and generate thematic plots of frequency histograms for both varieties.
-```r
-draw_stat_snapshot <- function(d) { ... }
-draw_user_trend <- function(d) { ... }
-```
+To invoke the script, we run in terminal:
-...and _buildings.sql_ contains table creation code like:
+```bash
+raka -j 8 -v figures
+```
-```sql
-DROP TABLE IF EXISTS buildings;
-CREATE TABLE buildings AS ( ... );
+The option `-j 8` indicates we want to parallelize the tasks with 8 concurrent processes at most where possible. The option `-v` let raka print detailed information so we can view the generated python code.
+The tool will then act as the following:
+1. Match `figures with the last`rule, which is a normal rake task.
+2. The prerequisites include 8 figures, none of them exists yet. Take *_out/versicolor/plot_petal_length__iris.pdf * as an example from now on.
+3. Rule `pdf.iris.plot['plot_(\S+)_(\S+)']...` is matched, where "petal" is bound to `plot0` and "length" is bound to `plot1`.
+4. Neither of the 2 possible input files: *_out/versicolor/iris.csv* and *_out/versicolor/iris.pdf* and can be found. But the rule `csv(groups.join('|')).iris = ...` (`csv('virginica|versicolor').iris`) can be matched for the former, where the target scope is matched as `versicolor`.
+5. The only dependecy `csv.iris_all` is resolved as *_out/iris_all_.csv*. The path does not contain `vesicolor` since the target scope only applies to the target.
+6. Rule `csv.iris_all` is matched without any dependencies.
+7. The protocol `shell` replaces the automatic variable`$@` with `_out/iris_all.csv` to build a curl command and download the iris dataset from ()[datahub.io].
+8. Now raka goes back to generate output *_out/versicolor/iris.csv*, by executing the code generated by the `python` protocol, which extracts rows where the class  field equals "Iris-versicolor".
+9. Raka goes back to generate output *_out/versicolor/plot_petal_length__iris.pdf*, , by executing the code generated by the `python` protocol, which draws a histogram plot to depict the distribution of petal length.
+10. Raka continues to generate plot files until all 8 figures exist.
+As an example, the generated python code in *9* are:
+```python
+import sys
+import os.path
+import pandas as pd
+def write_variety(input, output, variety):
+  print(variety)
+  folder = os.path.dirname(output)
+  if len(folder) > 0:
+    os.makedirs(folder, exist_ok=True)
+  df = pd.read_csv(input)
+  df[df['class'] == variety].to_csv(output)
+import seaborn as sns
+from matplotlib import pyplot as plt
+df = pd.read_csv('_out/versicolor/iris.csv')
+ax = sns.displot(x=df['petallength'])
+ax.set_axis_labels('petal length', 'frequency')
+plt.savefig('_out/versicolor/plot_petal_length__iris.pdf')
 ```
-We may also have a _buildings_idx.sql_ to create index for the table.
+The rule-based system, the strategy to execute tasks only when necessary, and the capable host language make it fairly easy to adjust the experiments during the exploration. For example, suppose we want to also apply experiments also to the *setosa* class, we can just change the line
+`groups = %i[virginica versicolor]`
+to
+`groups = %i[virginica versicolor setosa]`
+The command `raka -j 8 -v figures` will generate 4 figures for the new class, without re-executing tasks for the other two classes.
+## Why Raka
+Data processing tasks can involve plenty of steps, each with its dependencies. Compared to bare Rake or the more classical Make, Raka offers the following advantages:
+1. Advanced pattern matching and template resolving to define general rules and maximize code reuse.
+2. Extensible and context-aware protocol architecture.
+3. Multilingual. Other programming languages can be easily embedded.
+4. Auto dependency and naming by conventions.
+5. Scopes to ease comparative studies.
+6. Terser syntax.
+... and more.
+Compared to more comlex, GUI-based solutions (perhaps classified as scientific-workflow software) like Kepler, etc., Raka has the following advantages:
+1. Lightweight and easy to setup, especially on platforms with ruby preinstalled.
+2. Easy to deploy, version-control, backup or share workflows since the workflows are merely text files.
+3. Easy to reuse modules or create reusable modules, which are merely plain ruby code snippets (or in other languages with protocols).
+4. Expressive so a few lines of code can replace many manual operations.
+## Documentation
-Then we can run either `rake de/stat_snapshot_graph__buildings.pdf` or `rake de/user_trend_graph__buildings.pdf`, which will do a bunch of things at first run (take the former as example):
+### Conceptual Model
-1. Target file not found.
-2. Rule `pdf.buildings.func['(\S+)_graph']` matched. "stat_snapshot_graph" is bound to `func` and "stat_snapshot" is bound to `func0`.
-3. None of the four possible input files: _de/buildings.table_, _de/buildings.txt_, _buildings.table_, _buildings.txt_ can be found. Rule `table.buildings` is matched and the only dependecy file _admin.csv_ is found.
-4. The protocol `psqlf` finds the source file _src/buildings.sql_, intepolate the options with automatic variables (`$<` as "admin.csv"), run the sql, and create a placeholder file _de/buildings.table_ afterwards.
-5. Run the post-job `idx_this`, according to the rule `idx._` it will find and run _buildings_idx.sql_, then create a placeholder file _de/buildings.idx_.
-6. For rule `pdf.buildings.func['(\S+)_graph']`, the R code in `%[]` is interpolated with several automatic variables (`$(input_stem)` as "buildings", `$@` as "de/stat_snapshot_graph\_\_buildings.pdf") and the variables (`func`, `func0`) bound before.
-7. Run the R code. The _buildings_ table is piped into the function `draw_snapshot_graph` and then output to `ggplot_output`, which writes the graph to the specified pdf file.
+A raka rule consists of target, dependencies, actions and
-## Syntax of Rules
+### Syntax Definition
-It is possible to use Raka with little knowledge of ruby / rake, though minimal understandings are highly recommended. The formal syntax of rule can be defined as follows (EBNF form):
+It is possible to use Raka with little knowledge of ruby / rake, though minimal understandings are highly recommended. The formal syntax of rule can be defined as follows (W3C EBNF form):
 ```ebnf
-rule = lexpr "=" {target_list "|"} protocol {"|" target_list};
+rule ::= target "=" (dependencies "|")* action ("|" post_target)*
-target = rexpr | template;
+target ::= ext "." ltoken ("." ltoken)*
-target_list = "[]" | "[" target {"," target} "]";
+dependencies ::= "[]" | "[" dependency ("," dependency)* "]"
-lexpr = ext "." {ltoken "."} ltoken;
-rexpr = ext "." rtoken {"." rtoken};
+dependency ::= rexpr | template
-ltoken = word | word "[" pattern "]";
-rtoken = word | word "(" template ")";
+post_target ::= rexpr | template
-word = ("_" | letter) { letter | digit | "_" };
+rexpr ::= ext "." rtoken ("." rtoken)*
-protocol = ("shell" | "r" | "psql") ("*" template | BLOCK )
-         | "psqlf" | "psqlf" "(" HASH ")";
+ltoken ::= word | word "[" pattern "]"
+rtoken ::= word | word "(" template ")"
+word ::= ("_" | letter) ( letter | digit | "_" )*
+action ::= ("shell" | "r" | "psql" | "py" ) ("*" template | block ) | "run" block
 ```
 The corresponding railroad diagrams are:
-![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/rule.svg)
+**rule**
+![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/rule.svg)
+**target**
+![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/target.svg)
+**dependencies**
+![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/dependencies.svg)
+**dependency**
+![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/dependency.svg)
+**post_target_**
-![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/target.svg)
+![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/post_target.svg)
-![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/target_list.svg)
+**rexpr**
-![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/lexpr.svg)
+![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/rexpr.svg)
-![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/rexpr.svg)
+**ltoken**
-![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/ltoken.svg)
+![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/ltoken.svg)
-![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/rtoken.svg)
+**rtoken**
-![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/word.svg)
+![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/rtoken.svg)
-![](https://cdn.rawgit.com/yarray/raka/master/doc/figures/protocol.svg)
+**word**
+![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/word.svg)
+**action**
+![](https://cdn.rawgit.com/yarray/raka/master/doc/syntax/action.svg)
 The definition is concise but several details are omitted for simplicity:
@@ -163,15 +271,15 @@ The definition is concise but several details are omitted for simplicity:
 4. The listed protocols are merely what we offered now. It can be greatly extended.
 5. Nearly any concept in the syntax can be replaced by a suitable ruby variable.
-## Pattern matching and template resolving
+### Pattern matching and template resolving
-When defined a rule like `lexpr = rexpr`, the left side represents a pattern and the right side contains specifications for extra dependecies, actions and some targets to create thereafter. When raking a target file, the left sides of the rules will be examined one by one until a rule is matched. The matching process based on Regex also support named captures so that some varibales can be bound for use in the right side.
+When defined a rule like `target = <specification>`, the left side represents a pattern and the right side contains specifications for extra dependecies, actions and some targets to create thereafter. When raking a target file, the left sides of the rules will be examined one by one until a rule is matched. The matching process based on Regex also support named captures so that some varibales can be extracted for use in the right side.
-The specifications on the right side of a rule can be incomplete from various aspects, that is, they can contains some templates. The "holes" in the templates will be fulfilled by automatic variables and variables bounded when matching the left side.
+The specifications on the right side of a rule can contain templates. The "holes" in the templates will be fulfilled by automatic variables and variables captured when matching the left side.
-### Pattern matching
+#### Pattern matching
-To match a given _file_ with a `lexpr`, asides the extension, the substrings of the file name between "\_\_" are mapped to tokens separated by `.`, in reverse order. After that, each substring is matched to the corresponding token or the regex in `[]`. For example, the rule
+To match a given _file_ with a `target`, the extension will be matched first. The substrings of the file name between "\_\_" are mapped to tokens separated by `.`, in reverse order. After that, each substring is matched to the corresponding token or the regex in `[]`. For example, the rule
 ```ruby
 pdf.buildings.indicator['\S+'].top['top_(\d+)']
@@ -186,26 +294,147 @@ can match "top_50\_\_node_num\_\_buildings.pdf". The logical process is:
    - `top_(\d+) ~ top_50`
 3. Two levels of captures are made. First, 'node_num' is captured as `indicator`, 'top_50' is captured as `top`; Second, '50' is captured as `top0` since `\d+` is wrapped in parenthesis and is the first.
-One can write special token `_` or `something[]` if the captured value is useful later, as the syntax sugar of `something['\S+']`.
+One can write special token `_` to match any token. Since raka uses prefix matching, something like `token0['']` can also match any token and capture it in `token0` in addition. End-of-line symbol `$` can be used to match the whole token, e.g., `token0['word$']` will not match `word_bench`.
-### Template resolving
+#### Template resolving
 In some places of `rexpr`, templates can be written instead of strings, so that it can represent different values at runtime. There are two types of variables that can be used in templates. The first is automatic variables, which is just like `$@` in Make or `task.name` in Rake. We even preserve some Make conventions for easier migrations. All automatic varibales begin with `$`. The possible automatic variables are:
-| symbol         | meaning                | symbol          | meaning                         |
-| -------------- | ---------------------- | --------------- | ------------------------------- |
-| \$@            | output file            | \$^             | all dependecies (sep by spaces) |
-| \$<            | first dependency       | $0, $1, … \$i   | ith depdency                    |
-| \$(scope)      | scope for current task | \$(output_stem) | stem of the output file         |
-| \$(input_stem) | stem of the input file |                 |                                 |
+| symbol                                    | description                                                                                            |
+| ----------------------------------------- | ------------------------------------------------------------------------------------------------------ |
+| \$@, \$(output)                           | the output file                                                                                        |
+| \$<, \$(input)                            | the input file defined in the chained target                                                           |
+| \$^, \$(deps)                             | all dependecies concated by comma (including input)                                                    |
+| \$(dep0), \$(dep1), ...                   | the i-th depdency (input is $(dep0))                                                                   |
+| \$(input_stem)                            | stem of the input file                                                                                 |
+| \$(output_stem)                           | stem of the output file                                                                                |
+| \$(func)                                  | the token added to input to generate output, e.g., stat in csv.data.stat                               |
+| \$(ext)                                   | extension of the output file                                                                           |
+| \$(scope)                                 | scope for current task, i.e. the common directory for output, input and dependencies                   |
+| \$(target_scope)                          | the inline scope defined in target                                                                     |
+| \$(target_scope0), \$(target_scope1), ... | the i-th captured value by inline scope defined in target                                              |
+| \$(rule_scope0), \$(rule_scope1), ...     | the i-th scope defined in rule-level by nested calls of the dsl.scope function (i is larger insideout) |
+The other type of variables are those captured during pattern matching, which can be referred to using `%{var}`. In the example of the [pattern matching](###pattern-matching) section, `%{indicator}` will be replaced by `node_num`, `%{top}` will be replaced by `top_50` and `%{top0}` will be replaced by `50`. In such case, a template as `'calculate top %{top0} of %{indicator} for $@'` will be resolved as `'calculate top 50 of node_num for top_50__node_num__buildings.pdf'`
+Templates can happen in various places. For depdencies and post targets, tokens with parenthesis can contain templates, like `csv._('%{indicator}')`. The symbol of a token with parenthesis is of no use and is generally omitted with an underscore. It is also possible to write template literal directly, i.e. `'%{indicator}.csv'`. Templates can also be applied in actions but it depends on the implementations of protocols.
+### Actions and protocols
+Raka invokes **actions** when all input and dependencies are presented. Generally, users define an action that generates the output. To maximize the flexibility, users can feed code in an arbitrary programming language to the corresponding **protocol**. The protocol will then transform and execute the code. Raka natively supports the host(ruby) protocol and several foreign protocols including shell, python, psql, and r.
+The host protocol is special and just executes the given ruby block. All other protocols can accept a templated code string given an aterisk operator or a block producing a templated code string. Following illustrates examples for each protocol.
+In the host protocol and the block versions of other protocols, a raka task (the *rask* variable) is provided, which offers the following properties:
+| property              | description                                                                          |
+| --------------------- | ------------------------------------------------------------------------------------ |
+| output                | the output file                                                                      |
+| input                 | the input file defined in the chained target                                         |
+| deps                  | the depdencies (input is deps[0])                                                    |
+| func                  | the token added to input to generate output, e.g., stat in csv.data.stat             |
+| ext                   | extension of the output file                                                         |
+| captures              | captured text during pattern matching, key-value                                     |
+| scope                 | scope for current task, i.e. the common directory for output, input and dependencies |
+| target_scope          | the inline scope defined in target                                                   |
+| target_scope_captures | captured values by inline scope defined in target                                    |
+| rule_scopes           | the inline scope defined in target                                                   |
+```ruby
+require 'raka'
+require 'csv'
+dsl = Raka.new(
+  self, output_types: %i[table view csv],
+        lang: ['lang/psql', 'lang/shell', 'lang/python', 'lang/r']
+)
+csv.iris_all = shell* %(curl -L https://datahub.io/machine-learning/iris/r/iris.csv > $@)
+# host(ruby) protocol
+csv.rb_out = [csv.iris_all] | run do |rask|
+  in_f = File.open(rask.deps[0])
+  out_f = File.open(rask.output, 'w')
+  options = { headers: true, return_headers: true, write_headers: true }
+  CSV.filter(in_f, out_f, options) do |row|
+    row['class'] == 'Iris-versicolor'
+  end
+end
+# python protocol
+csv.py_out = [csv.iris_all] | py* %(
+  import pandas as pd
+  df = pd.read_csv('$(dep0)')
+  df[df['class'] == 'Iris-versicolor'].to_csv('$@')
+)
+# python protocol (block)
+csv.py_out2 = [csv.iris_all] | py do |rask|
+  <<-PYTHON
+  import pandas as pd
+  df = pd.read_csv('#{rask.deps[0]}')
+  df[df['class'] == 'Iris-versicolor'].to_csv('#{rask.output}')
+  PYTHON
+end
+# r protocol
+csv.r_out = [csv.iris_all] | r* %(
+  df <- read.csv("$(dep0)")
+  write.csv(df[(df$class == "Iris-versicolor"),], file="$@")
+)
-The other type of variables are those bounded during pattern matching,which can be referred to using `%{var}`. In the example of the [pattern matching](###pattern-matching) section, `%{indicator}` will be replaced by `node_num`, `%{top}` will be replaced by `top_50` and `%{top0}` will be replaced by `50`. In such case, a template as `'calculate top %{top0} of %{indicator} for $@'` will be resolved as `'calculate top 50 of node_num for top_50__node_num__buildings.pdf'`
+# r protocol (block)
+csv.r_out = [csv.iris_all] | r do |rask|
+  <<-R
+  df <- read.csv("#{rask.deps[0]}")
+  write.csv(df[(df$class == "Iris-versicolor"),], file="#{rask.output}")
+  R
+end
+# shell protocol
+csv.shell_out = [csv.iris_all] | shell* %(
+  cat <(head $(dep0)) <(grep "Iris-versicolor" $(dep0)) > $@
+)
+# shell protocol (block)
+csv.shell_out2 = [csv.iris_all] | shell do |rask|
+  "cat <(head -1 #{rask.deps[0]}) <(grep 'Iris-versicolor' #{rask.deps[0]}) > rask.output"
+end
+# psql protocol
+pg = OpenStruct.new(
+  user: 'postgres',
+  port: 5433,
+  host: '127.0.0.1',
+  db: 'postgres',
+  password: 'postgres'
+)
+psql.config conn: pg, create: :mview
+table.iris_all = [csv.iris_all] | psql(create: nil)* %(
+  DROP TABLE IF EXISTS $(output_stem);
+  CREATE TABLE $(output_stem) (
+    sepallength float,
+    sepalwidth float,
+    petallength float,
+    petalwidth float,
+    class varchar
+  );
+  \\COPY $(output_stem) FROM '$(dep0)' CSV HEADER;
+)
-The replacement of variables happen before any process to the template string. So do not include the symbols for automatic variables or `%{<anything>}` in templates.
+table.psql_out = [table.iris_all] | psql* %(
+  SELECT * FROM $(dep0_stem) WHERE class='Iris-versicolor'
+)
-Templates can happen in various places. For depdencies and post jobs, tokens with parenthesis can wrap in templates, like `csv._('%{indicator}')`. The symbol of a token with parenthesis is of no use and is generally omitted. It is also possible to write template literal directly, i.e. `'%{indicator}.csv'`. Where templates can be applied in actions depends on the protocols and will be explained later in the [Protocols](###protocols) section
+# psql protocol (block)
+table.psql_out2 = [table.iris_all] | psql do |rask|
+  <<-SQL
+  SELECT * FROM #{dsl.stem(rask.deps[0])} WHERE class='Iris-versicolor'
+  SQL
+end
-## APIs
+```
 ### Initialization and options
@@ -215,13 +444,13 @@ These APIs are bounded to an instance of DSL, you can create the object at the t
 dsl = DSL.new(<env>, <options>)
 ```
-The argument `<env>` should be the _self_ of a running Rakefile. In most case you can directly write:
+The argument `<env>` should be the *self* of a running Rakefile. In most case you can directly write:
 ```ruby
 dsl = DSL.new(self, <options>)
 ```
-The argument `options` currently support `output_types` and `input_types`. For each item in `output_types`, you will get an extra function to bootstrap a rule. For example, with
+Two important fields of `options` are `output_types` and `input_types`. For each item in `output_types`, you will get a global function to bootstrap a rule. For example, with
 ```ruby
 dsl = DSL.new(self, { output_types: [:csv, :pdf] })
@@ -234,24 +463,19 @@ csv.data = ...
 pdf.graph = ...
 ```
-which will generate data.csv and graph.pdf
+which will match *<dir>/data.csv* and *<dir>/graph.pdf*
-The `input_types` involves the strategy to find inputs. For example, raka will try to find both _numbers.csv_ and _numbers.table_ for a rule like `table.numbers.mean = …` if `input_type = [:csv, :table]`.
+The `input_types` involves the strategy to find inputs. All possible input types will be tried when resolving an input file in chained target.  For example, raka will try to find both *numbers.csv* and *numbers.table* for a rule like `table.numbers.mean = …` if `input_type = [:csv, :table]`.
 ### Scope
-### Protocols
+Scopes define constraints which help users create rules more precisely. A scope generally refer to a folder and can happen in several places.
-Currently Raka support 4 lang: shell, psql, r and psqlf.
+**Task scope** is the scope when executing a task, a.k.a. **scope**.  When a rule is matched given a desired output, a task is generated and its scope is the common folder of the output and all dependencies. For example, a rule `csv.out = [csv.in] | ...` can be matched given *out/out.csv* and the task scope is resolved *out/*. The task will thus search for *out/in.csv* as dependency.
-```ruby
-shell(base_dir='./')* code::templ_str { |task| ... }
-psql(options={})* code::templ_str { |task| ... }
-r(src:str, libs=[])* code::templ_str { |task| ... }
+**Rule scope** is the scope to restrict possible task scope, given by `Raka::scope`. In the following example, the rule scopes are
-# options = { script_name: , script_file: , params: }
-psqlf(options={})
-```
+**Target scope.**
 ## Rakefile Template

data/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 0.3.3
1	+ 0.3.4