raka 0.2.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/LICENSE +21 -0
- data/README.md +261 -0
- data/VERSION +1 -0
- data/lib/compile.rb +173 -0
- data/lib/interface.rbs +23 -0
- data/lib/lang/psql/impl.rb +59 -0
- data/lib/lang/python/impl.rb +32 -0
- data/lib/lang/r/impl.rb +38 -0
- data/lib/lang/r/io.R +113 -0
- data/lib/lang/shell/impl.rb +17 -0
- data/lib/output_type.rb +36 -0
- data/lib/protocol.rb +122 -0
- data/lib/raka.rb +70 -0
- data/lib/temp.json +9167 -0
- data/lib/token.rb +165 -0
- metadata +158 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: a3391bad82ef017d820e266d897d570d32f6a62c3b09382b4b6ad8d1f8ead6fe
|
4
|
+
data.tar.gz: 39aa088fc9e87bda8a7ad14f60678b934ca006de0d78c893da5678f0a1082740
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: b0b426ad5076914630e192345d0796dfa1c30c580f3709d6c88773170ee8e18148c9795f6f7c2619bc5fa22983f0f50f4ed34e09880fc3a8004cbd773b138a5e
|
7
|
+
data.tar.gz: db00045e711049a23735dc9358b7faca10528a07b893307becd8eb4d73a90002430a19f6a4e5bb228366455b5066ee1e27b0b609b9b2192768fb8ea33131986a
|
data/LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
MIT License
|
2
|
+
|
3
|
+
Copyright (c) 2016 yarray
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
13
|
+
copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21
|
+
SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,261 @@
|
|
1
|
+
**Raka** is a **DSL**(Domain Specific Language) on top of **Rak**e for defining and running d**a**t**a** processing workflows. Raka is specifically designed for data processing with improved pattern matching, scopes, language extensions and lots of conventions to prevent verbosity.
|
2
|
+
|
3
|
+
## Why Raka
|
4
|
+
|
5
|
+
Data processing tasks can involve plenty of steps, each with its dependencies. Compared to bare Rake or the more classical Make, Raka offers the following advantages:
|
6
|
+
|
7
|
+
1. Advanced pattern matching and template resolving to define general rules and maximize code reuse.
|
8
|
+
2. Extensible and context-aware protocol architecture
|
9
|
+
3. Multilingual. Other programming languages can be easily embedded
|
10
|
+
4. Auto dependency and naming by conventions
|
11
|
+
5. Support scopes to ease comparative studies
|
12
|
+
6. Terser syntax
|
13
|
+
|
14
|
+
... and more.
|
15
|
+
|
16
|
+
## Usage
|
17
|
+
|
18
|
+
Raka is a drop-in library for rake. Though rake is cross platform, raka may not work on Windows since it relies some shell facilities. To use raka, one has to install ruby and rake first. Ruby is available for most \*nix systems including Mac OSX so the only task is to install rake like:
|
19
|
+
|
20
|
+
```bash
|
21
|
+
gem install rake
|
22
|
+
```
|
23
|
+
|
24
|
+
The next step is to clone this project to local machine, cd to the directory, and install the gem:
|
25
|
+
|
26
|
+
```bash
|
27
|
+
gem install pkg/raka-0.1.0.gem
|
28
|
+
```
|
29
|
+
|
30
|
+
## For the Impatient
|
31
|
+
|
32
|
+
First create a file named `main.raka` and import & initialize the DSL
|
33
|
+
|
34
|
+
```ruby
|
35
|
+
require 'raka'
|
36
|
+
|
37
|
+
dsl = DSL.new(self,
|
38
|
+
output_types: [:txt, :table, :pdf, :idx],
|
39
|
+
input_types: [:txt, :table]
|
40
|
+
)
|
41
|
+
```
|
42
|
+
|
43
|
+
Then the code below will define two simple rules:
|
44
|
+
|
45
|
+
```ruby
|
46
|
+
txt.sort.first50 = shell* "cat sort.txt | head -n 50 > $@"
|
47
|
+
txt.sort = [txt.input] | shell* "cat $< | sort -rn > $@"
|
48
|
+
```
|
49
|
+
|
50
|
+
For testing let's prepare an input file named `input.txt`:
|
51
|
+
|
52
|
+
```bash
|
53
|
+
seq 1000 > input.txt
|
54
|
+
```
|
55
|
+
|
56
|
+
We can then invoke `rake first50.txt`, the script will read data from _input.txt_, sort the numbers descendingly and get the first 50 lines.
|
57
|
+
|
58
|
+
The workflow here is as follows:
|
59
|
+
|
60
|
+
1. Try to find _first50\_\_sort.txt_: not exists
|
61
|
+
2. Rule `txt.sort.first50` matched
|
62
|
+
3. For rule `txt.sort.first50`, find input file _sort.txt_ or _sort.table_. Neither exists
|
63
|
+
4. Rule `txt.sort` matched
|
64
|
+
5. Rule `txt.sort` has no input but a depended target `txt.input`
|
65
|
+
6. Find file _input.txt_ or _input.table_. Use the former
|
66
|
+
7. Run rule `txt.sort` and create _sort.txt_
|
67
|
+
8. Run rule `txt.sort.first50` and create _first50\_\_sort.txt_
|
68
|
+
|
69
|
+
This illustrates some basic ideas but may not be particularly interesting. Following is a much more sophisticated example from real world research which covers more features.
|
70
|
+
|
71
|
+
```ruby
|
72
|
+
SRC_DIR = File.absolute_path 'src'
|
73
|
+
USER = 'postgres'
|
74
|
+
DB = 'osm'
|
75
|
+
HOST = 'localhost'
|
76
|
+
PORT = 5432
|
77
|
+
|
78
|
+
def idx_this() [idx._('$(output_stem)')] end
|
79
|
+
|
80
|
+
dsl.scope :de
|
81
|
+
|
82
|
+
idx._ = psqlf(script_name: '$stem_idx.sql')
|
83
|
+
pdf.buildings.func['(\S+)_graph'] = r(:graph)* %[
|
84
|
+
table_input("$(input_stem)") | draw_%{func0} | ggplot_output('$@') ]
|
85
|
+
table.buildings = [csv.admin] | psqlf(admin: '$<') | idx_this
|
86
|
+
```
|
87
|
+
|
88
|
+
Assume that we have a schema named _de_ in database _osm_, have a input file _admin.csv_, and have _graph.R_ and _buildings.sql_ under _src/_. Now further assume that _graph.R_ contains two functions:
|
89
|
+
|
90
|
+
```r
|
91
|
+
draw_stat_snapshot <- function(d) { ... }
|
92
|
+
draw_user_trend <- function(d) { ... }
|
93
|
+
```
|
94
|
+
|
95
|
+
...and _buildings.sql_ contains table creation code like:
|
96
|
+
|
97
|
+
```sql
|
98
|
+
DROP TABLE IF EXISTS buildings;
|
99
|
+
CREATE TABLE buildings AS ( ... );
|
100
|
+
```
|
101
|
+
|
102
|
+
We may also have a _buildings_idx.sql_ to create index for the table.
|
103
|
+
|
104
|
+
Then we can run either `rake de/stat_snapshot_graph__buildings.pdf` or `rake de/user_trend_graph__buildings.pdf`, which will do a bunch of things at first run (take the former as example):
|
105
|
+
|
106
|
+
1. Target file not found.
|
107
|
+
2. Rule `pdf.buildings.func['(\S+)_graph']` matched. "stat_snapshot_graph" is bound to `func` and "stat_snapshot" is bound to `func0`.
|
108
|
+
3. None of the four possible input files: _de/buildings.table_, _de/buildings.txt_, _buildings.table_, _buildings.txt_ can be found. Rule `table.buildings` is matched and the only dependecy file _admin.csv_ is found.
|
109
|
+
4. The protocol `psqlf` finds the source file _src/buildings.sql_, intepolate the options with automatic variables (`$<` as "admin.csv"), run the sql, and create a placeholder file _de/buildings.table_ afterwards.
|
110
|
+
5. Run the post-job `idx_this`, according to the rule `idx._` it will find and run _buildings_idx.sql_, then create a placeholder file _de/buildings.idx_.
|
111
|
+
6. For rule `pdf.buildings.func['(\S+)_graph']`, the R code in `%[]` is interpolated with several automatic variables (`$(input_stem)` as "buildings", `$@` as "de/stat_snapshot_graph\_\_buildings.pdf") and the variables (`func`, `func0`) bound before.
|
112
|
+
7. Run the R code. The _buildings_ table is piped into the function `draw_snapshot_graph` and then output to `ggplot_output`, which writes the graph to the specified pdf file.
|
113
|
+
|
114
|
+
## Syntax of Rules
|
115
|
+
|
116
|
+
It is possible to use Raka with little knowledge of ruby / rake, though minimal understandings are highly recommended. The formal syntax of rule can be defined as follows (EBNF form):
|
117
|
+
|
118
|
+
```ebnf
|
119
|
+
rule = lexpr "=" {target_list "|"} protocol {"|" target_list};
|
120
|
+
|
121
|
+
target = rexpr | template;
|
122
|
+
|
123
|
+
target_list = "[]" | "[" target {"," target} "]";
|
124
|
+
|
125
|
+
lexpr = ext "." {ltoken "."} ltoken;
|
126
|
+
rexpr = ext "." rtoken {"." rtoken};
|
127
|
+
|
128
|
+
ltoken = word | word "[" pattern "]";
|
129
|
+
rtoken = word | word "(" template ")";
|
130
|
+
|
131
|
+
word = ("_" | letter) { letter | digit | "_" };
|
132
|
+
|
133
|
+
protocol = ("shell" | "r" | "psql") ("*" template | BLOCK )
|
134
|
+
| "psqlf" | "psqlf" "(" HASH ")";
|
135
|
+
```
|
136
|
+
|
137
|
+
The corresponding railroad diagrams are:
|
138
|
+
|
139
|
+

|
140
|
+
|
141
|
+

|
142
|
+
|
143
|
+

|
144
|
+
|
145
|
+

|
146
|
+
|
147
|
+

|
148
|
+
|
149
|
+

|
150
|
+
|
151
|
+

|
152
|
+
|
153
|
+

|
154
|
+
|
155
|
+

|
156
|
+
|
157
|
+
The definition is concise but several details are omitted for simplicity:
|
158
|
+
|
159
|
+
1. **BLOCK** and **HASH** is ruby's block and hash object.
|
160
|
+
2. A **template** is just a ruby string, but with some placeholders (see the next section for details)
|
161
|
+
3. A **pattern** is just a ruby string which represents regex (see the next section for details)
|
162
|
+
4. The listed protocols are merely what we offered now. It can be greatly extended.
|
163
|
+
5. Nearly any concept in the syntax can be replaced by a suitable ruby variable.
|
164
|
+
|
165
|
+
## Pattern matching and template resolving
|
166
|
+
|
167
|
+
When defined a rule like `lexpr = rexpr`, the left side represents a pattern and the right side contains specifications for extra dependecies, actions and some targets to create thereafter. When raking a target file, the left sides of the rules will be examined one by one until a rule is matched. The matching process based on Regex also support named captures so that some varibales can be bound for use in the right side.
|
168
|
+
|
169
|
+
The specifications on the right side of a rule can be incomplete from various aspects, that is, they can contains some templates. The "holes" in the templates will be fulfilled by automatic variables and variables bounded when matching the left side.
|
170
|
+
|
171
|
+
### Pattern matching
|
172
|
+
|
173
|
+
To match a given _file_ with a `lexpr`, asides the extension, the substrings of the file name between "\_\_" are mapped to tokens separated by `.`, in reverse order. After that, each substring is matched to the corresponding token or the regex in `[]`. For example, the rule
|
174
|
+
|
175
|
+
```ruby
|
176
|
+
pdf.buildings.indicator['\S+'].top['top_(\d+)']
|
177
|
+
```
|
178
|
+
|
179
|
+
can match "top_50\_\_node_num\_\_buildings.pdf". The logical process is:
|
180
|
+
|
181
|
+
1. The extension `pdf` matches.
|
182
|
+
2. The substrings and the tokens are paired and they all match:
|
183
|
+
- `buildings ~ buildings`
|
184
|
+
- `'\S+' ~ node_num`
|
185
|
+
- `top_(\d+) ~ top_50`
|
186
|
+
3. Two levels of captures are made. First, 'node_num' is captured as `indicator`, 'top_50' is captured as `top`; Second, '50' is captured as `top0` since `\d+` is wrapped in parenthesis and is the first.
|
187
|
+
|
188
|
+
One can write special token `_` or `something[]` if the captured value is useful later, as the syntax sugar of `something['\S+']`.
|
189
|
+
|
190
|
+
### Template resolving
|
191
|
+
|
192
|
+
In some places of `rexpr`, templates can be written instead of strings, so that it can represent different values at runtime. There are two types of variables that can be used in templates. The first is automatic variables, which is just like `$@` in Make or `task.name` in Rake. We even preserve some Make conventions for easier migrations. All automatic varibales begin with `$`. The possible automatic variables are:
|
193
|
+
|
194
|
+
| symbol | meaning | symbol | meaning |
|
195
|
+
| -------------- | ---------------------- | ------------- | ------------------------------- |
|
196
|
+
| \$@ | output file | \$^ | all dependecies (sep by spaces) |
|
197
|
+
| \$< | first dependency | $0, $1, … \$i | ith depdency |
|
198
|
+
| \$(scope) | scope for current task | \$(output_stem) | stem of the output file |
|
199
|
+
| \$(input_stem) | stem of the input file | | |
|
200
|
+
|
201
|
+
The other type of variables are those bounded during pattern matching,which can be referred to using `%{var}`. In the example of the [pattern matching](###pattern-matching) section, `%{indicator}` will be replaced by `node_num`, `%{top}` will be replaced by `top_50` and `%{top0}` will be replaced by `50`. In such case, a template as `'calculate top %{top0} of %{indicator} for $@'` will be resolved as `'calculate top 50 of node_num for top_50__node_num__buildings.pdf'`
|
202
|
+
|
203
|
+
The replacement of variables happen before any process to the template string. So do not include the symbols for automatic variables or `%{<anything>}` in templates.
|
204
|
+
|
205
|
+
Templates can happen in various places. For depdencies and post jobs, tokens with parenthesis can wrap in templates, like `csv._('%{indicator}')`. The symbol of a token with parenthesis is of no use and is generally omitted. It is also possible to write template literal directly, i.e. `'%{indicator}.csv'`. Where templates can be applied in actions depends on the protocols and will be explained later in the [Protocols](###protocols) section
|
206
|
+
|
207
|
+
## APIs
|
208
|
+
|
209
|
+
### Initialization and options
|
210
|
+
|
211
|
+
These APIs are bounded to an instance of DSL, you can create the object at the top:
|
212
|
+
|
213
|
+
```ruby
|
214
|
+
dsl = DSL.new(<env>, <options>)
|
215
|
+
```
|
216
|
+
|
217
|
+
The argument `<env>` should be the _self_ of a running Rakefile. In most case you can directly write:
|
218
|
+
|
219
|
+
```ruby
|
220
|
+
dsl = DSL.new(self, <options>)
|
221
|
+
```
|
222
|
+
|
223
|
+
The argument `options` currently support `output_types` and `input_types`. For each item in `output_types`, you will get an extra function to bootstrap a rule. For example, with
|
224
|
+
|
225
|
+
```ruby
|
226
|
+
dsl = DSL.new(self, { output_types: [:csv, :pdf] })
|
227
|
+
```
|
228
|
+
|
229
|
+
you can write these rules like:
|
230
|
+
|
231
|
+
```ruby
|
232
|
+
csv.data = ...
|
233
|
+
pdf.graph = ...
|
234
|
+
```
|
235
|
+
|
236
|
+
which will generate data.csv and graph.pdf
|
237
|
+
|
238
|
+
The `input_types` involves the strategy to find inputs. For example, raka will try to find both _numbers.csv_ and _numbers.table_ for a rule like `table.numbers.mean = …` if `input_type = [:csv, :table]`.
|
239
|
+
|
240
|
+
### Scope
|
241
|
+
|
242
|
+
### Protocols
|
243
|
+
|
244
|
+
Currently Raka support 4 lang: shell, psql, r and psqlf.
|
245
|
+
|
246
|
+
```ruby
|
247
|
+
shell(base_dir='./')* code::templ_str { |task| ... }
|
248
|
+
psql(options={})* code::templ_str { |task| ... }
|
249
|
+
r(src:str, libs=[])* code::templ_str { |task| ... }
|
250
|
+
|
251
|
+
# options = { script_name: , script_file: , params: }
|
252
|
+
psqlf(options={})
|
253
|
+
```
|
254
|
+
|
255
|
+
## Rakefile Template
|
256
|
+
|
257
|
+
## Write your own protocols
|
258
|
+
|
259
|
+
## Compare to other tools
|
260
|
+
|
261
|
+
Raka borrows some ideas from Drake but not much (currently mainly the name "protocol"). Briefly we have different visions and maybe different suitable senarios.
|
data/VERSION
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
0.2.3
|
data/lib/compile.rb
ADDED
@@ -0,0 +1,173 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'fileutils'
|
4
|
+
|
5
|
+
require_relative './token'
|
6
|
+
|
7
|
+
def array_to_hash(array)
|
8
|
+
array.nil? ? {} : Hash[((0...array.size).map { |i| i.to_s.to_sym }).zip array]
|
9
|
+
end
|
10
|
+
|
11
|
+
def protect_percent_symbol(text)
|
12
|
+
anchor = '-_-_-'
|
13
|
+
safe_text = text.gsub(/%(?=[^\s{]+)/, anchor) # replace % not in shape of %{ to special sign
|
14
|
+
safe_text = yield safe_text
|
15
|
+
safe_text.gsub(anchor, '%') # replace % not in shape of %{ to special sign
|
16
|
+
end
|
17
|
+
|
18
|
+
# compiles rule (lhs = rhs) to rake task
|
19
|
+
class DSLCompiler
|
20
|
+
attr_reader :env
|
21
|
+
|
22
|
+
# keep env as running environment of rake since we want to inject rules
|
23
|
+
def initialize(env, options)
|
24
|
+
@env = env
|
25
|
+
@options = options
|
26
|
+
end
|
27
|
+
|
28
|
+
# Raka task structure, input task is rake's task pushed into blocks
|
29
|
+
def dsl_task(token, task)
|
30
|
+
name = task.name
|
31
|
+
deps = task.prerequisites
|
32
|
+
|
33
|
+
output_info = token._parse_output_ name
|
34
|
+
task_info = {
|
35
|
+
name: name,
|
36
|
+
deps: deps,
|
37
|
+
deps_str: deps.join(','),
|
38
|
+
input: deps.first || '',
|
39
|
+
task: task
|
40
|
+
}
|
41
|
+
OpenStruct.new(output_info.to_h.merge(task_info))
|
42
|
+
end
|
43
|
+
|
44
|
+
# resolve auto variables with only output info,
|
45
|
+
# useful when resolve extra deps (task is not available yet)
|
46
|
+
def resolve_by_output(target, output_info)
|
47
|
+
info = output_info
|
48
|
+
text = target.respond_to?(:_template_) ? target._template_(info.scope).to_s : target.to_s
|
49
|
+
text = text
|
50
|
+
.gsub('$(scope)', info.scope.nil? ? '' : info.scope)
|
51
|
+
.gsub('$(target_scope)', info.target_scope.nil? ? '' : info.target_scope)
|
52
|
+
.gsub('$(output)', info.output)
|
53
|
+
.gsub('$(output_stem)', info.stem)
|
54
|
+
.gsub('$(input_stem)', info.input_stem.nil? ? '' : info.input_stem)
|
55
|
+
.gsub('$@', info.name)
|
56
|
+
|
57
|
+
protect_percent_symbol text do |safe_text|
|
58
|
+
safe_text = safe_text % (info.to_h.merge info.captures.to_h)
|
59
|
+
safe_text = safe_text.gsub(/\$\(rule_scope(\d+)\)/, '%{\1}') % array_to_hash(info.rule_scopes)
|
60
|
+
safe_text.gsub(/\$\(target_scope(\d+)\)/, '%{\1}') % array_to_hash(info.target_scope_captures)
|
61
|
+
end
|
62
|
+
end
|
63
|
+
|
64
|
+
# resolve auto variables with dsl task
|
65
|
+
def resolve(target, task)
|
66
|
+
# convert target to text whether it is expression or already text
|
67
|
+
text = resolve_by_output target, task
|
68
|
+
|
69
|
+
# convert $0, $1 to the universal shape of %{dep} as captures
|
70
|
+
text = text
|
71
|
+
.gsub('$^', task.deps_str)
|
72
|
+
.gsub('$<', task.input || '')
|
73
|
+
.gsub('$(deps)', task.deps_str)
|
74
|
+
.gsub('$(input)', task.input || '')
|
75
|
+
|
76
|
+
protect_percent_symbol text do |safe_text|
|
77
|
+
# add numbered auto variables like $0, $2 referring to the first and third deps
|
78
|
+
safe_text.gsub(/\$\(dep(\d+)\)/, '%{\1}') % array_to_hash(task.deps)
|
79
|
+
end
|
80
|
+
end
|
81
|
+
|
82
|
+
def rule_action(lhs, actions, extra_tasks, task)
|
83
|
+
return if actions.empty?
|
84
|
+
|
85
|
+
task = dsl_task(lhs, task)
|
86
|
+
@env.logger.info "raking #{task.name}"
|
87
|
+
unless task.scope.nil?
|
88
|
+
folder = task.scope
|
89
|
+
folder = File.join(task.scope, task.target_scope) unless task.target_scope.nil?
|
90
|
+
FileUtils.makedirs(folder)
|
91
|
+
end
|
92
|
+
actions.each do |action|
|
93
|
+
action.call @env, task do |code|
|
94
|
+
resolve(code, task)
|
95
|
+
end
|
96
|
+
end
|
97
|
+
|
98
|
+
extra_tasks.each do |templ|
|
99
|
+
Rake::Task[resolve(templ, task)].invoke
|
100
|
+
end
|
101
|
+
end
|
102
|
+
|
103
|
+
# build one rule
|
104
|
+
def create_rule(lhs, input_ext, actions, extra_deps, extra_tasks)
|
105
|
+
# the "rule" method is private, maybe here are better choices
|
106
|
+
@env.send(:rule, lhs._pattern_ => [proc do |target|
|
107
|
+
inputs = lhs._inputs_(target, input_ext)
|
108
|
+
output = lhs._parse_output_(target)
|
109
|
+
plain_extra_deps = extra_deps.map do |templ|
|
110
|
+
resolve_by_output(templ, output)
|
111
|
+
end
|
112
|
+
# main data source and extra dependencies
|
113
|
+
inputs + plain_extra_deps
|
114
|
+
end]) do |task|
|
115
|
+
# rake continue task even if dependencies not met, we handle ourselves
|
116
|
+
absence = task.prerequisites.find_index { |f| !File.exist? f }
|
117
|
+
unless absence.nil?
|
118
|
+
@env.logger.warn\
|
119
|
+
"Dependent #{task.prerequisites[absence]} does not exist, skip task #{task.name}"
|
120
|
+
next
|
121
|
+
end
|
122
|
+
rule_action(lhs, actions, extra_tasks, task)
|
123
|
+
end
|
124
|
+
end
|
125
|
+
|
126
|
+
# compile token = rhs to rake rule
|
127
|
+
# rubocop:disable Style/MethodLength
|
128
|
+
def compile(lhs, rhs)
|
129
|
+
unless @env.instance_of?(Object)
|
130
|
+
raise "DSL compile error: seems not a valid @env of rake with class #{@env.class}"
|
131
|
+
end
|
132
|
+
|
133
|
+
# the format is [dep, ...] | [action, ...] | [post, ...], where the posts
|
134
|
+
# are those will be raked after the actions
|
135
|
+
actions_start = rhs.find_index { |item| item.respond_to?(:call) }
|
136
|
+
|
137
|
+
# case 1: has action
|
138
|
+
if actions_start
|
139
|
+
extra_deps = rhs[0, actions_start]
|
140
|
+
actions_end = rhs[actions_start, rhs.length].find_index do |item|
|
141
|
+
!item.respond_to?(:call)
|
142
|
+
end
|
143
|
+
|
144
|
+
# case 1.1: has post
|
145
|
+
if actions_end
|
146
|
+
actions_end += actions_start
|
147
|
+
actions = rhs[actions_start, actions_end]
|
148
|
+
extra_tasks = rhs[actions_end, rhs.length]
|
149
|
+
# case 1.2: no post
|
150
|
+
else
|
151
|
+
actions = rhs[actions_start, rhs.length]
|
152
|
+
extra_tasks = []
|
153
|
+
end
|
154
|
+
# case 2: no action
|
155
|
+
else
|
156
|
+
extra_deps = rhs
|
157
|
+
actions = []
|
158
|
+
extra_tasks = []
|
159
|
+
end
|
160
|
+
|
161
|
+
unless lhs._input_?
|
162
|
+
create_rule lhs, proc { [] }, actions, extra_deps, extra_tasks
|
163
|
+
return
|
164
|
+
end
|
165
|
+
|
166
|
+
# We generate a rule for each possible input type
|
167
|
+
@options.input_types.each do |ext|
|
168
|
+
# We find auto source from both THE scope and the root
|
169
|
+
create_rule lhs, ext, actions, extra_deps, extra_tasks
|
170
|
+
end
|
171
|
+
end
|
172
|
+
end
|
173
|
+
# rubocop:enable Style/MethodLength
|
data/lib/interface.rbs
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
class RakaTask
|
2
|
+
attr_reader name: String
|
3
|
+
attr_reader stem: String
|
4
|
+
attr_reader func: String?
|
5
|
+
attr_reader input_stem: String?
|
6
|
+
attr_reader scope: String?
|
7
|
+
attr_reader target_scope: String?
|
8
|
+
attr_reader scopes: Array[String]
|
9
|
+
attr_reader target_scope_captures: Array[String]
|
10
|
+
attr_reader captures: Hash[String, String]
|
11
|
+
attr_reader deps: Array[String]
|
12
|
+
attr_reader deps_str: String
|
13
|
+
attr_reader input: String
|
14
|
+
attr_reader task: Object # RakeTask
|
15
|
+
end
|
16
|
+
|
17
|
+
class RakaEnv
|
18
|
+
end
|
19
|
+
|
20
|
+
class LanguageImpl
|
21
|
+
def build: (String code, RakaTask task) -> String
|
22
|
+
def run_script: (RakaEnv env, String fname, RakaTask task) -> nil
|
23
|
+
end
|
@@ -0,0 +1,59 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require_relative '../../protocol'
|
4
|
+
|
5
|
+
def bash(env, cmd)
|
6
|
+
code = remove_common_indent(
|
7
|
+
%(set -e
|
8
|
+
set -o pipefail
|
9
|
+
|
10
|
+
#{cmd}
|
11
|
+
)
|
12
|
+
)
|
13
|
+
# puts code
|
14
|
+
env.send :sh, 'bash ' + create_tmp(code)
|
15
|
+
end
|
16
|
+
|
17
|
+
# postgresql protocol using psql, requires HOST, PORT, USER, DB
|
18
|
+
class Psql
|
19
|
+
# Sometimes we want to use the psql command with bash directly
|
20
|
+
def sh_cmd(scope)
|
21
|
+
c = @conn
|
22
|
+
env_vars = "PGOPTIONS='-c search_path=#{scope ? scope + ',' : ''}public' "
|
23
|
+
"PGPASSWORD=#{c.password} #{env_vars} psql -h #{c.host} -p #{c.port} -U #{c.user} -d #{c.db} -v ON_ERROR_STOP=1"
|
24
|
+
end
|
25
|
+
|
26
|
+
# 1. do not add required argument here, so psql.config will work or we can only use psql(conn: xxx).config
|
27
|
+
def initialize(conn: nil, create: 'mview', params: {})
|
28
|
+
@create = create
|
29
|
+
@params = params
|
30
|
+
@conn = conn
|
31
|
+
end
|
32
|
+
|
33
|
+
def build(code, _)
|
34
|
+
# 2. lazily check the argument only when used
|
35
|
+
raise 'argument conn required' if @conn.nil?
|
36
|
+
|
37
|
+
if @create.to_s == 'table'
|
38
|
+
'DROP TABLE IF EXISTS :_name_;' \
|
39
|
+
'CREATE TABLE :_name_ AS (' + code + ');'
|
40
|
+
elsif @create.to_s == 'mview'
|
41
|
+
'DROP MATERIALIZED VIEW IF EXISTS :_name_;' \
|
42
|
+
'CREATE MATERIALIZED VIEW :_name_ AS (' + code + ');'
|
43
|
+
else
|
44
|
+
code
|
45
|
+
end
|
46
|
+
end
|
47
|
+
|
48
|
+
def run_script(env, fname, task)
|
49
|
+
param_str = (@params || {}).map { |k, v| "-v #{k}=\"#{v}\"" }.join(' ')
|
50
|
+
|
51
|
+
bash env, %(
|
52
|
+
#{sh_cmd(task.scope)} #{param_str} -v _name_=#{task.stem} \
|
53
|
+
-f #{fname} | tee #{fname}.log
|
54
|
+
mv #{fname}.log #{task.name}
|
55
|
+
)
|
56
|
+
end
|
57
|
+
end
|
58
|
+
|
59
|
+
creator :psql, Psql
|
@@ -0,0 +1,32 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require_relative '../../protocol'
|
4
|
+
|
5
|
+
# Binding for python language, allow specifying imports and paths
|
6
|
+
class Python
|
7
|
+
# @implements LanguageImpl
|
8
|
+
def initialize(libs: [], paths: [], runner: 'python')
|
9
|
+
common_aliases = {
|
10
|
+
pandas: :pd,
|
11
|
+
numpy: :np
|
12
|
+
}.freeze
|
13
|
+
|
14
|
+
libs = libs.map(&:to_s) # convert all to strings
|
15
|
+
@imports = libs.map { |lib| "import #{lib}" }
|
16
|
+
common_aliases.each do |name, short|
|
17
|
+
@imports.push("import #{name} as #{short}") if libs.include? name.to_s
|
18
|
+
end
|
19
|
+
@paths = ['import sys'] + paths.map { |path| "sys.path.append('#{path}')" }
|
20
|
+
@runner = runner
|
21
|
+
end
|
22
|
+
|
23
|
+
def build(code, _task)
|
24
|
+
(@paths + @imports + [code]).join "\n"
|
25
|
+
end
|
26
|
+
|
27
|
+
def run_script(env, fname, _task)
|
28
|
+
run_cmd(env, "#{@runner} #{fname}")
|
29
|
+
end
|
30
|
+
end
|
31
|
+
|
32
|
+
creator :py, Python
|
data/lib/lang/r/impl.rb
ADDED
@@ -0,0 +1,38 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require_relative '../protocol'
|
4
|
+
|
5
|
+
# r language protocol
|
6
|
+
class R
|
7
|
+
def initialize(src, libs = [], **kwargs)
|
8
|
+
@src = src
|
9
|
+
@libs = libs
|
10
|
+
super(**kwargs)
|
11
|
+
end
|
12
|
+
|
13
|
+
def build(code, _)
|
14
|
+
libraries = ([
|
15
|
+
:pipeR
|
16
|
+
] + @libs).map { |name| "suppressPackageStartupMessages(library(#{name}))" }
|
17
|
+
|
18
|
+
sources = ["source('#{File.dirname(__FILE__)}/io.R')"] +
|
19
|
+
(@src ? [@src] : []).map { |name| "source('#{SRC_DIR}/#{name}.R')" }
|
20
|
+
|
21
|
+
extra = [
|
22
|
+
'`|` <- `%>>%`',
|
23
|
+
"conn_args <- list(host='#{HOST}', user='#{USER}', dbname='#{DB}', port='#{PORT}')",
|
24
|
+
'args <- commandArgs(trailingOnly = T)',
|
25
|
+
'sql_input <- init_sql_input(conn_args, args[1])',
|
26
|
+
'table_input <- init_table_input(conn_args, args[1])',
|
27
|
+
'table_output <- init_table_output(conn_args, args[1])'
|
28
|
+
]
|
29
|
+
|
30
|
+
[libraries, sources, extra, code].join "\n"
|
31
|
+
end
|
32
|
+
|
33
|
+
def run_script(env, fname, task)
|
34
|
+
env.send :sh, "Rscript #{fname} '#{task.scope || 'public'}'"
|
35
|
+
end
|
36
|
+
end
|
37
|
+
|
38
|
+
creator :r, R
|