metacrunch 3.0.1 → 3.0.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/Readme.md +221 -41
- data/lib/metacrunch/db/writer.rb +29 -1
- data/lib/metacrunch/job.rb +10 -16
- data/lib/metacrunch/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 140352c3ee66626aef744b87358762a4130f6823
|
4
|
+
data.tar.gz: f9bd336d44ac985f5806045b852e219236e1d038
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 494530523e869e12ef00bd709ad139e840b1fd580d39d587575b85b1acdf029a573c330776eb65085a99c6714f4fd01afdad8a37d25dfe7197568682d229cde2
|
7
|
+
data.tar.gz: f09ba8cadfc1a10cb26b9a5125797da1dbb7dd9c24f4b11381183181b0916fa57a74abbf20713614f062e0026f7b0670f4cb3fa2a568e13b5b017f4d718a0c07
|
data/Readme.md
CHANGED
@@ -17,51 +17,64 @@ $ gem install metacrunch
|
|
17
17
|
```
|
18
18
|
|
19
19
|
|
20
|
-
|
21
|
-
|
20
|
+
Creating ETL jobs
|
21
|
+
-----------------
|
22
22
|
|
23
|
-
The basic idea behind an ETL job in metacrunch is the concept of a data processing pipeline. Each ETL job reads data from one or more **sources** (extract step), runs one or more **transformations** (transform step) on the data and finally writes the transformed data
|
23
|
+
The basic idea behind an ETL job in metacrunch is the concept of a data processing pipeline. Each ETL job reads data from one or more **sources** (extract step), runs one or more **transformations** (transform step) on the data and finally writes the transformed data to one or more **destinations** (load step).
|
24
24
|
|
25
|
-
metacrunch provides you with a simple DSL to define such ETL jobs. Just create a text file with the extension `.metacrunch`. Note: The extension doesn't really matter but you should avoid `.rb` to not loading them by mistake from another Ruby component
|
25
|
+
metacrunch provides you with a simple DSL to define and run such ETL jobs. Just create a text file with the extension `.metacrunch`. *Note: The extension doesn't really matter but you should avoid `.rb` to not loading them by mistake from another Ruby component.*
|
26
26
|
|
27
|
-
Let's
|
27
|
+
Let's walk through the main steps of creating ETL jobs with metacrunch. For a collection of working examples check out our [metacrunch-demo](https://github.com/ubpb/metacrunch-demo) repo.
|
28
|
+
|
29
|
+
#### It's Ruby
|
30
|
+
|
31
|
+
Every `.metacrunch` job file is a regular Ruby file. So you can always use regular stuff like e.g. declaring methods, classes, variable and requiring other Ruby files.
|
28
32
|
|
29
33
|
```ruby
|
30
34
|
# File: my_etl_job.metacrunch
|
31
35
|
|
32
|
-
# Every metacrunch job file is a regular Ruby file. So you can always use regular Ruby
|
33
|
-
# stuff like declaring methods
|
34
36
|
def my_helper
|
35
37
|
# ...
|
36
38
|
end
|
37
39
|
|
38
|
-
# ... declaring classes
|
39
40
|
class MyHelper
|
40
41
|
# ...
|
41
42
|
end
|
42
43
|
|
43
|
-
|
44
|
-
foo = "bar"
|
44
|
+
helper = MyHelper.new
|
45
45
|
|
46
|
-
|
46
|
+
require "SomeGem"
|
47
47
|
require_relative "./some/other/ruby/file"
|
48
|
+
```
|
49
|
+
|
50
|
+
#### Defining sources
|
51
|
+
|
52
|
+
A source (aka. a reader) is an object that reads data into the metacrunch processing pipeline. Use one of the build-in or 3rd party sources or implement it by yourself. Implementing sources is easy – [see notes below](#implementing-sources). You can declare one or more sources. They are processed in the order they are defined.
|
53
|
+
|
54
|
+
You must declare at least one source to allow a job to run.
|
55
|
+
|
56
|
+
```ruby
|
57
|
+
# File: my_etl_job.metacrunch
|
48
58
|
|
49
|
-
|
50
|
-
# At least one source is required to allow the job to run.
|
59
|
+
source Metacrunch::Fs::Reader.new(args)
|
51
60
|
source MySource.new
|
52
|
-
|
53
|
-
source MyOtherSource.new
|
61
|
+
```
|
54
62
|
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
63
|
+
This example uses a build-in file reader source. To learn more about the build-in sources see [notes below](#built-in-sources-and-destinations).
|
64
|
+
|
65
|
+
#### Defining transformations
|
66
|
+
|
67
|
+
To process, transform or manipulate data use the `#transformation` hook. A transformation can be implemented as a block, a lambda or as an (callable) object. To learn more about transformations check the section about [implementing transformations](#implementing-transformations) below.
|
68
|
+
|
69
|
+
The current data object (the object that is currently read by the source) will be passed to the first transformation as a parameter. The return value of a transformation will then be passed to the next transformation - or to the destination if the current transformation is the last one.
|
70
|
+
|
71
|
+
If you return nil the current data object will be dismissed and the next transformation (or destination) won't be called.
|
72
|
+
|
73
|
+
```ruby
|
74
|
+
# File: my_etl_job.metacrunch
|
61
75
|
|
62
|
-
# To process data use the #transformation hook.
|
63
76
|
transformation do |data|
|
64
|
-
# Called for each data object that has been
|
77
|
+
# Called for each data object that has been read by a source.
|
65
78
|
|
66
79
|
# Do your data transformation process here.
|
67
80
|
|
@@ -71,60 +84,227 @@ transformation do |data|
|
|
71
84
|
end
|
72
85
|
|
73
86
|
# Instead of passing a block to #transformation you can pass a
|
74
|
-
# `callable` object (
|
75
|
-
transformation
|
76
|
-
#
|
87
|
+
# `callable` object (any object responding to #call).
|
88
|
+
transformation ->(data) {
|
89
|
+
# Lambdas responds to #call
|
77
90
|
}
|
78
91
|
|
79
92
|
# MyTransformation defines #call
|
80
93
|
transformation MyTransformation.new
|
94
|
+
```
|
95
|
+
|
96
|
+
#### Defining destinations
|
97
|
+
|
98
|
+
A destination (aka. a writer) is an object that writes the transformed data to an external system. Use one of the build-in or 3rd party destinations or implement it by yourself. Implementing destinations is easy – [see notes below](#implementing-destinations). You can declare one or more destinations. They are processed in the order they are defined.
|
99
|
+
|
100
|
+
```ruby
|
101
|
+
# File: my_etl_job.metacrunch
|
81
102
|
|
82
|
-
|
103
|
+
destination MyDestination.new
|
104
|
+
```
|
105
|
+
|
106
|
+
This example uses a custom destination. To learn more about the build-in destinations see [notes below](#built-in-sources-and-destinations).
|
107
|
+
|
108
|
+
#### Pre/Post process
|
109
|
+
|
110
|
+
To run arbitrary code before the first transformation use the
|
111
|
+
`#pre_process` hook. To run arbitrary after the last transformation use
|
112
|
+
`#post_process`. Like transformations, `#post_process` and `#pre_process` can be called with a block, a lambda or a (callable) object.
|
113
|
+
|
114
|
+
```ruby
|
83
115
|
pre_process do
|
84
116
|
# Called before the first transformation
|
85
117
|
end
|
86
118
|
|
87
|
-
# To run arbitrary code after the last transformation use the #post_process hook.
|
88
119
|
post_process do
|
89
120
|
# Called after the last transformation
|
90
121
|
end
|
91
122
|
|
92
|
-
|
93
|
-
#
|
94
|
-
pre_process Proc.new {
|
95
|
-
# Procs and Lambdas responds to #call
|
123
|
+
pre_process ->() {
|
124
|
+
# Lambdas responds to #call
|
96
125
|
}
|
97
126
|
|
98
127
|
# MyCallable class defines #call
|
99
128
|
post_process MyCallable.new
|
100
|
-
|
101
129
|
```
|
102
130
|
|
131
|
+
#### Defining options
|
103
132
|
|
104
|
-
|
105
|
-
------------
|
133
|
+
TBD.
|
106
134
|
|
107
|
-
|
135
|
+
Running ETL jobs
|
136
|
+
----------------
|
137
|
+
|
138
|
+
metacrunch comes with a handy command line tool. In a terminal use
|
108
139
|
|
109
140
|
|
110
141
|
```
|
111
142
|
$ metacrunch run my_etl_job.metacrunch
|
112
143
|
```
|
113
144
|
|
114
|
-
to run
|
145
|
+
to run a job.
|
146
|
+
|
147
|
+
If you use [Bundler](http://bundler.io) to manage dependencies for your jobs make sure to change into the directory where your Gemfile is (or set BUNDLE_GEMFILE environment variable) and run metacrunch with `bundle exec`.
|
148
|
+
|
149
|
+
```
|
150
|
+
$ bundle exec metacrunch run my_etl_job.metacrunch
|
151
|
+
```
|
152
|
+
|
153
|
+
Depending on your environment `bundle exec` may not be required (e.g. you have rubygems-bundler installed) but we recommend using it whenever you have a Gemfile you like to use. When using Bundler make sure to add `gem "metacrunch"` to the Gemfile.
|
154
|
+
|
155
|
+
To pass options to the job, separate job options from the metacrunch command options using the `@@` separator.
|
156
|
+
|
157
|
+
Use the following syntax
|
158
|
+
|
159
|
+
```
|
160
|
+
$ [bundle exec] metacrunch run [COMMAND_OPTIONS] JOB_FILE [@@ [JOB_OPTIONS] [JOB_ARGS...]]
|
161
|
+
```
|
162
|
+
|
115
163
|
|
116
164
|
Implementing sources
|
117
165
|
--------------------
|
118
166
|
|
119
|
-
|
167
|
+
A source (aka a reader) is any Ruby object that responds to the `each` method that yields data objects one by one.
|
168
|
+
|
169
|
+
The data is usually a `Hash` instance, but could be other structures as long as the rest of your pipeline is expecting it.
|
170
|
+
|
171
|
+
Any `enumerable` object (e.g. `Array`) responds to `each` and can be used as a source in metacrunch.
|
172
|
+
|
173
|
+
```ruby
|
174
|
+
# File: my_etl_job.metacrunch
|
175
|
+
source [1,2,3,4,5,6,7,8,9]
|
176
|
+
```
|
177
|
+
|
178
|
+
Usually you implement your sources as classes. Doing so you can unit test and reuse them.
|
179
|
+
|
180
|
+
Here is a simple CSV source
|
181
|
+
|
182
|
+
```ruby
|
183
|
+
# File: my_csv_source.rb
|
184
|
+
require 'csv'
|
185
|
+
|
186
|
+
class MyCsvSource
|
187
|
+
def initialize(input_file)
|
188
|
+
@csv = CSV.open(input_file, headers: true, header_converters: :symbol)
|
189
|
+
end
|
190
|
+
|
191
|
+
def each
|
192
|
+
@csv.each do |data|
|
193
|
+
yield(data.to_hash)
|
194
|
+
end
|
195
|
+
@csv.close
|
196
|
+
end
|
197
|
+
end
|
198
|
+
```
|
199
|
+
|
200
|
+
You can then use that source in your job
|
201
|
+
|
202
|
+
```ruby
|
203
|
+
# File: my_etl_job.metacrunch
|
204
|
+
require "my_csv_source"
|
205
|
+
|
206
|
+
source MyCsvSource.new("my_data.csv")
|
207
|
+
```
|
208
|
+
|
120
209
|
|
121
210
|
Implementing transformations
|
122
211
|
----------------------------
|
123
212
|
|
124
|
-
|
213
|
+
Transformations can be implemented as blocks or as a `callable`. A `callable` in Ruby is any object that responds to the `call` method.
|
214
|
+
|
215
|
+
### Transformations as a block
|
216
|
+
|
217
|
+
When using the block syntax the current data row will be passed as a parameter.
|
218
|
+
|
219
|
+
```ruby
|
220
|
+
# File: my_etl_job.metacrunch
|
221
|
+
|
222
|
+
transformation do |data|
|
223
|
+
# DO YOUR TRANSFORMATION HERE
|
224
|
+
data = ...
|
225
|
+
|
226
|
+
# Make sure to return the data to keep it in the pipeline. Dismiss the
|
227
|
+
# data conditionally by returning nil.
|
228
|
+
data
|
229
|
+
end
|
230
|
+
|
231
|
+
```
|
232
|
+
|
233
|
+
### Transformations as a callable
|
234
|
+
|
235
|
+
Procs and Lambdas in Ruby respond to `call`. They can be used to implement transformations similar to blocks.
|
236
|
+
|
237
|
+
```ruby
|
238
|
+
# File: my_etl_job.metacrunch
|
239
|
+
|
240
|
+
transformation -> (data) do
|
241
|
+
# ...
|
242
|
+
end
|
243
|
+
|
244
|
+
```
|
245
|
+
|
246
|
+
Like sources you can create classes to test and reuse transformation logic.
|
247
|
+
|
248
|
+
```ruby
|
249
|
+
# File: my_transformation.rb
|
250
|
+
|
251
|
+
class MyTransformation
|
252
|
+
|
253
|
+
def call(data)
|
254
|
+
# ...
|
255
|
+
end
|
256
|
+
|
257
|
+
end
|
258
|
+
```
|
259
|
+
|
260
|
+
You can use this transformation in your job
|
261
|
+
|
262
|
+
```ruby
|
263
|
+
# File: my_etl_job.metacrunch
|
264
|
+
|
265
|
+
require "my_transformation"
|
266
|
+
|
267
|
+
transformation MyTransformation.new
|
268
|
+
|
269
|
+
```
|
270
|
+
|
271
|
+
Implementing destinations
|
272
|
+
-------------------------
|
273
|
+
|
274
|
+
A destination (aka a writer) is any Ruby object that responds to `write(data)` and `close`.
|
275
|
+
|
276
|
+
Like sources you are encouraged to implement destinations as classes.
|
277
|
+
|
278
|
+
```ruby
|
279
|
+
# File: my_destination.rb
|
280
|
+
|
281
|
+
class MyDestination
|
282
|
+
|
283
|
+
def write(data)
|
284
|
+
# Write data to files, remote services, databases etc.
|
285
|
+
end
|
286
|
+
|
287
|
+
def close
|
288
|
+
# Use this method to close connections, files etc.
|
289
|
+
end
|
290
|
+
|
291
|
+
end
|
292
|
+
```
|
293
|
+
|
294
|
+
In your job
|
295
|
+
|
296
|
+
```ruby
|
297
|
+
# File: my_etl_job.metacrunch
|
298
|
+
|
299
|
+
require "my_destination"
|
300
|
+
|
301
|
+
destination MyDestination.new
|
302
|
+
|
303
|
+
```
|
304
|
+
|
125
305
|
|
126
|
-
|
127
|
-
|
306
|
+
Built in sources and destinations
|
307
|
+
---------------------------------
|
128
308
|
|
129
309
|
TBD.
|
130
310
|
|
data/lib/metacrunch/db/writer.rb
CHANGED
@@ -2,6 +2,9 @@ module Metacrunch
|
|
2
2
|
class Db::Writer
|
3
3
|
|
4
4
|
def initialize(database_connection_or_url, dataset_proc, options = {})
|
5
|
+
@use_upsert = options.delete(:use_upsert) || false
|
6
|
+
@id_key = options.delete(:id_key) || :id
|
7
|
+
|
5
8
|
@db = if database_connection_or_url.is_a?(String)
|
6
9
|
Sequel.connect(database_connection_or_url, options)
|
7
10
|
else
|
@@ -12,12 +15,37 @@ module Metacrunch
|
|
12
15
|
end
|
13
16
|
|
14
17
|
def write(data)
|
15
|
-
|
18
|
+
if data.is_a?(Array)
|
19
|
+
@db.transaction do
|
20
|
+
data.each{|d| insert_or_upsert(d) }
|
21
|
+
end
|
22
|
+
else
|
23
|
+
insert_or_upsert(data)
|
24
|
+
end
|
16
25
|
end
|
17
26
|
|
18
27
|
def close
|
19
28
|
@db.disconnect
|
20
29
|
end
|
21
30
|
|
31
|
+
private
|
32
|
+
|
33
|
+
def insert_or_upsert(data)
|
34
|
+
@use_upsert ? upsert(data) : insert(data)
|
35
|
+
end
|
36
|
+
|
37
|
+
def insert(data)
|
38
|
+
@dataset.insert(data) if data
|
39
|
+
end
|
40
|
+
|
41
|
+
def upsert(data)
|
42
|
+
if data
|
43
|
+
rec = @dataset.where(id: data[@id_key])
|
44
|
+
if 1 != rec.update(data)
|
45
|
+
insert(data)
|
46
|
+
end
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
22
50
|
end
|
23
51
|
end
|
data/lib/metacrunch/job.rb
CHANGED
@@ -110,37 +110,31 @@ module Metacrunch
|
|
110
110
|
def run_transformations
|
111
111
|
sources.each do |source|
|
112
112
|
# sources are expected to respond to `each`
|
113
|
-
source.each do |
|
114
|
-
|
113
|
+
source.each do |data|
|
114
|
+
run_transformations_and_write_destinations(data)
|
115
115
|
end
|
116
116
|
|
117
117
|
# Run all transformations a last time to flush possible buffers
|
118
|
-
|
118
|
+
run_transformations_and_write_destinations(nil, flush_buffers: true)
|
119
119
|
end
|
120
120
|
|
121
121
|
# destination implementations are expected to respond to `close`
|
122
122
|
destinations.each(&:close)
|
123
123
|
end
|
124
124
|
|
125
|
-
def
|
125
|
+
def run_transformations_and_write_destinations(data, flush_buffers: false)
|
126
126
|
transformations.each do |transformation|
|
127
|
-
|
128
|
-
if
|
129
|
-
|
130
|
-
else
|
131
|
-
transformation.buffer(row)
|
132
|
-
end
|
127
|
+
if transformation.is_a?(Buffer)
|
128
|
+
data = transformation.buffer(data) if data.present?
|
129
|
+
data = transformation.flush if flush_buffers
|
133
130
|
else
|
134
|
-
transformation.call(
|
131
|
+
data = transformation.call(data) if data.present?
|
135
132
|
end
|
136
|
-
|
137
|
-
break unless row
|
138
133
|
end
|
139
134
|
|
140
|
-
if
|
135
|
+
if data.present?
|
141
136
|
destinations.each do |destination|
|
142
|
-
# destinations are expected to respond to `write(
|
143
|
-
destination.write(row)
|
137
|
+
destination.write(data) # destinations are expected to respond to `write(data)`
|
144
138
|
end
|
145
139
|
end
|
146
140
|
end
|
data/lib/metacrunch/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: metacrunch
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 3.0.
|
4
|
+
version: 3.0.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- René Sprotte
|
@@ -10,7 +10,7 @@ authors:
|
|
10
10
|
autorequire:
|
11
11
|
bindir: exe
|
12
12
|
cert_chain: []
|
13
|
-
date: 2016-
|
13
|
+
date: 2016-07-17 00:00:00.000000000 Z
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|
16
16
|
name: activesupport
|