metacrunch 3.0.1 → 3.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Readme.md +221 -41
- data/lib/metacrunch/db/writer.rb +29 -1
- data/lib/metacrunch/job.rb +10 -16
- data/lib/metacrunch/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 140352c3ee66626aef744b87358762a4130f6823
|
4
|
+
data.tar.gz: f9bd336d44ac985f5806045b852e219236e1d038
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 494530523e869e12ef00bd709ad139e840b1fd580d39d587575b85b1acdf029a573c330776eb65085a99c6714f4fd01afdad8a37d25dfe7197568682d229cde2
|
7
|
+
data.tar.gz: f09ba8cadfc1a10cb26b9a5125797da1dbb7dd9c24f4b11381183181b0916fa57a74abbf20713614f062e0026f7b0670f4cb3fa2a568e13b5b017f4d718a0c07
|
data/Readme.md
CHANGED
@@ -17,51 +17,64 @@ $ gem install metacrunch
|
|
17
17
|
```
|
18
18
|
|
19
19
|
|
20
|
-
|
21
|
-
|
20
|
+
Creating ETL jobs
|
21
|
+
-----------------
|
22
22
|
|
23
|
-
The basic idea behind an ETL job in metacrunch is the concept of a data processing pipeline. Each ETL job reads data from one or more **sources** (extract step), runs one or more **transformations** (transform step) on the data and finally writes the transformed data
|
23
|
+
The basic idea behind an ETL job in metacrunch is the concept of a data processing pipeline. Each ETL job reads data from one or more **sources** (extract step), runs one or more **transformations** (transform step) on the data and finally writes the transformed data to one or more **destinations** (load step).
|
24
24
|
|
25
|
-
metacrunch provides you with a simple DSL to define such ETL jobs. Just create a text file with the extension `.metacrunch`. Note: The extension doesn't really matter but you should avoid `.rb` to not loading them by mistake from another Ruby component
|
25
|
+
metacrunch provides you with a simple DSL to define and run such ETL jobs. Just create a text file with the extension `.metacrunch`. *Note: The extension doesn't really matter but you should avoid `.rb` to not loading them by mistake from another Ruby component.*
|
26
26
|
|
27
|
-
Let's
|
27
|
+
Let's walk through the main steps of creating ETL jobs with metacrunch. For a collection of working examples check out our [metacrunch-demo](https://github.com/ubpb/metacrunch-demo) repo.
|
28
|
+
|
29
|
+
#### It's Ruby
|
30
|
+
|
31
|
+
Every `.metacrunch` job file is a regular Ruby file. So you can always use regular stuff like e.g. declaring methods, classes, variable and requiring other Ruby files.
|
28
32
|
|
29
33
|
```ruby
|
30
34
|
# File: my_etl_job.metacrunch
|
31
35
|
|
32
|
-
# Every metacrunch job file is a regular Ruby file. So you can always use regular Ruby
|
33
|
-
# stuff like declaring methods
|
34
36
|
def my_helper
|
35
37
|
# ...
|
36
38
|
end
|
37
39
|
|
38
|
-
# ... declaring classes
|
39
40
|
class MyHelper
|
40
41
|
# ...
|
41
42
|
end
|
42
43
|
|
43
|
-
|
44
|
-
foo = "bar"
|
44
|
+
helper = MyHelper.new
|
45
45
|
|
46
|
-
|
46
|
+
require "SomeGem"
|
47
47
|
require_relative "./some/other/ruby/file"
|
48
|
+
```
|
49
|
+
|
50
|
+
#### Defining sources
|
51
|
+
|
52
|
+
A source (aka. a reader) is an object that reads data into the metacrunch processing pipeline. Use one of the build-in or 3rd party sources or implement it by yourself. Implementing sources is easy – [see notes below](#implementing-sources). You can declare one or more sources. They are processed in the order they are defined.
|
53
|
+
|
54
|
+
You must declare at least one source to allow a job to run.
|
55
|
+
|
56
|
+
```ruby
|
57
|
+
# File: my_etl_job.metacrunch
|
48
58
|
|
49
|
-
|
50
|
-
# At least one source is required to allow the job to run.
|
59
|
+
source Metacrunch::Fs::Reader.new(args)
|
51
60
|
source MySource.new
|
52
|
-
|
53
|
-
source MyOtherSource.new
|
61
|
+
```
|
54
62
|
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
63
|
+
This example uses a build-in file reader source. To learn more about the build-in sources see [notes below](#built-in-sources-and-destinations).
|
64
|
+
|
65
|
+
#### Defining transformations
|
66
|
+
|
67
|
+
To process, transform or manipulate data use the `#transformation` hook. A transformation can be implemented as a block, a lambda or as an (callable) object. To learn more about transformations check the section about [implementing transformations](#implementing-transformations) below.
|
68
|
+
|
69
|
+
The current data object (the object that is currently read by the source) will be passed to the first transformation as a parameter. The return value of a transformation will then be passed to the next transformation - or to the destination if the current transformation is the last one.
|
70
|
+
|
71
|
+
If you return nil the current data object will be dismissed and the next transformation (or destination) won't be called.
|
72
|
+
|
73
|
+
```ruby
|
74
|
+
# File: my_etl_job.metacrunch
|
61
75
|
|
62
|
-
# To process data use the #transformation hook.
|
63
76
|
transformation do |data|
|
64
|
-
# Called for each data object that has been
|
77
|
+
# Called for each data object that has been read by a source.
|
65
78
|
|
66
79
|
# Do your data transformation process here.
|
67
80
|
|
@@ -71,60 +84,227 @@ transformation do |data|
|
|
71
84
|
end
|
72
85
|
|
73
86
|
# Instead of passing a block to #transformation you can pass a
|
74
|
-
# `callable` object (
|
75
|
-
transformation
|
76
|
-
#
|
87
|
+
# `callable` object (any object responding to #call).
|
88
|
+
transformation ->(data) {
|
89
|
+
# Lambdas responds to #call
|
77
90
|
}
|
78
91
|
|
79
92
|
# MyTransformation defines #call
|
80
93
|
transformation MyTransformation.new
|
94
|
+
```
|
95
|
+
|
96
|
+
#### Defining destinations
|
97
|
+
|
98
|
+
A destination (aka. a writer) is an object that writes the transformed data to an external system. Use one of the build-in or 3rd party destinations or implement it by yourself. Implementing destinations is easy – [see notes below](#implementing-destinations). You can declare one or more destinations. They are processed in the order they are defined.
|
99
|
+
|
100
|
+
```ruby
|
101
|
+
# File: my_etl_job.metacrunch
|
81
102
|
|
82
|
-
|
103
|
+
destination MyDestination.new
|
104
|
+
```
|
105
|
+
|
106
|
+
This example uses a custom destination. To learn more about the build-in destinations see [notes below](#built-in-sources-and-destinations).
|
107
|
+
|
108
|
+
#### Pre/Post process
|
109
|
+
|
110
|
+
To run arbitrary code before the first transformation use the
|
111
|
+
`#pre_process` hook. To run arbitrary after the last transformation use
|
112
|
+
`#post_process`. Like transformations, `#post_process` and `#pre_process` can be called with a block, a lambda or a (callable) object.
|
113
|
+
|
114
|
+
```ruby
|
83
115
|
pre_process do
|
84
116
|
# Called before the first transformation
|
85
117
|
end
|
86
118
|
|
87
|
-
# To run arbitrary code after the last transformation use the #post_process hook.
|
88
119
|
post_process do
|
89
120
|
# Called after the last transformation
|
90
121
|
end
|
91
122
|
|
92
|
-
|
93
|
-
#
|
94
|
-
pre_process Proc.new {
|
95
|
-
# Procs and Lambdas responds to #call
|
123
|
+
pre_process ->() {
|
124
|
+
# Lambdas responds to #call
|
96
125
|
}
|
97
126
|
|
98
127
|
# MyCallable class defines #call
|
99
128
|
post_process MyCallable.new
|
100
|
-
|
101
129
|
```
|
102
130
|
|
131
|
+
#### Defining options
|
103
132
|
|
104
|
-
|
105
|
-
------------
|
133
|
+
TBD.
|
106
134
|
|
107
|
-
|
135
|
+
Running ETL jobs
|
136
|
+
----------------
|
137
|
+
|
138
|
+
metacrunch comes with a handy command line tool. In a terminal use
|
108
139
|
|
109
140
|
|
110
141
|
```
|
111
142
|
$ metacrunch run my_etl_job.metacrunch
|
112
143
|
```
|
113
144
|
|
114
|
-
to run
|
145
|
+
to run a job.
|
146
|
+
|
147
|
+
If you use [Bundler](http://bundler.io) to manage dependencies for your jobs make sure to change into the directory where your Gemfile is (or set BUNDLE_GEMFILE environment variable) and run metacrunch with `bundle exec`.
|
148
|
+
|
149
|
+
```
|
150
|
+
$ bundle exec metacrunch run my_etl_job.metacrunch
|
151
|
+
```
|
152
|
+
|
153
|
+
Depending on your environment `bundle exec` may not be required (e.g. you have rubygems-bundler installed) but we recommend using it whenever you have a Gemfile you like to use. When using Bundler make sure to add `gem "metacrunch"` to the Gemfile.
|
154
|
+
|
155
|
+
To pass options to the job, separate job options from the metacrunch command options using the `@@` separator.
|
156
|
+
|
157
|
+
Use the following syntax
|
158
|
+
|
159
|
+
```
|
160
|
+
$ [bundle exec] metacrunch run [COMMAND_OPTIONS] JOB_FILE [@@ [JOB_OPTIONS] [JOB_ARGS...]]
|
161
|
+
```
|
162
|
+
|
115
163
|
|
116
164
|
Implementing sources
|
117
165
|
--------------------
|
118
166
|
|
119
|
-
|
167
|
+
A source (aka a reader) is any Ruby object that responds to the `each` method that yields data objects one by one.
|
168
|
+
|
169
|
+
The data is usually a `Hash` instance, but could be other structures as long as the rest of your pipeline is expecting it.
|
170
|
+
|
171
|
+
Any `enumerable` object (e.g. `Array`) responds to `each` and can be used as a source in metacrunch.
|
172
|
+
|
173
|
+
```ruby
|
174
|
+
# File: my_etl_job.metacrunch
|
175
|
+
source [1,2,3,4,5,6,7,8,9]
|
176
|
+
```
|
177
|
+
|
178
|
+
Usually you implement your sources as classes. Doing so you can unit test and reuse them.
|
179
|
+
|
180
|
+
Here is a simple CSV source
|
181
|
+
|
182
|
+
```ruby
|
183
|
+
# File: my_csv_source.rb
|
184
|
+
require 'csv'
|
185
|
+
|
186
|
+
class MyCsvSource
|
187
|
+
def initialize(input_file)
|
188
|
+
@csv = CSV.open(input_file, headers: true, header_converters: :symbol)
|
189
|
+
end
|
190
|
+
|
191
|
+
def each
|
192
|
+
@csv.each do |data|
|
193
|
+
yield(data.to_hash)
|
194
|
+
end
|
195
|
+
@csv.close
|
196
|
+
end
|
197
|
+
end
|
198
|
+
```
|
199
|
+
|
200
|
+
You can then use that source in your job
|
201
|
+
|
202
|
+
```ruby
|
203
|
+
# File: my_etl_job.metacrunch
|
204
|
+
require "my_csv_source"
|
205
|
+
|
206
|
+
source MyCsvSource.new("my_data.csv")
|
207
|
+
```
|
208
|
+
|
120
209
|
|
121
210
|
Implementing transformations
|
122
211
|
----------------------------
|
123
212
|
|
124
|
-
|
213
|
+
Transformations can be implemented as blocks or as a `callable`. A `callable` in Ruby is any object that responds to the `call` method.
|
214
|
+
|
215
|
+
### Transformations as a block
|
216
|
+
|
217
|
+
When using the block syntax the current data row will be passed as a parameter.
|
218
|
+
|
219
|
+
```ruby
|
220
|
+
# File: my_etl_job.metacrunch
|
221
|
+
|
222
|
+
transformation do |data|
|
223
|
+
# DO YOUR TRANSFORMATION HERE
|
224
|
+
data = ...
|
225
|
+
|
226
|
+
# Make sure to return the data to keep it in the pipeline. Dismiss the
|
227
|
+
# data conditionally by returning nil.
|
228
|
+
data
|
229
|
+
end
|
230
|
+
|
231
|
+
```
|
232
|
+
|
233
|
+
### Transformations as a callable
|
234
|
+
|
235
|
+
Procs and Lambdas in Ruby respond to `call`. They can be used to implement transformations similar to blocks.
|
236
|
+
|
237
|
+
```ruby
|
238
|
+
# File: my_etl_job.metacrunch
|
239
|
+
|
240
|
+
transformation -> (data) do
|
241
|
+
# ...
|
242
|
+
end
|
243
|
+
|
244
|
+
```
|
245
|
+
|
246
|
+
Like sources you can create classes to test and reuse transformation logic.
|
247
|
+
|
248
|
+
```ruby
|
249
|
+
# File: my_transformation.rb
|
250
|
+
|
251
|
+
class MyTransformation
|
252
|
+
|
253
|
+
def call(data)
|
254
|
+
# ...
|
255
|
+
end
|
256
|
+
|
257
|
+
end
|
258
|
+
```
|
259
|
+
|
260
|
+
You can use this transformation in your job
|
261
|
+
|
262
|
+
```ruby
|
263
|
+
# File: my_etl_job.metacrunch
|
264
|
+
|
265
|
+
require "my_transformation"
|
266
|
+
|
267
|
+
transformation MyTransformation.new
|
268
|
+
|
269
|
+
```
|
270
|
+
|
271
|
+
Implementing destinations
|
272
|
+
-------------------------
|
273
|
+
|
274
|
+
A destination (aka a writer) is any Ruby object that responds to `write(data)` and `close`.
|
275
|
+
|
276
|
+
Like sources you are encouraged to implement destinations as classes.
|
277
|
+
|
278
|
+
```ruby
|
279
|
+
# File: my_destination.rb
|
280
|
+
|
281
|
+
class MyDestination
|
282
|
+
|
283
|
+
def write(data)
|
284
|
+
# Write data to files, remote services, databases etc.
|
285
|
+
end
|
286
|
+
|
287
|
+
def close
|
288
|
+
# Use this method to close connections, files etc.
|
289
|
+
end
|
290
|
+
|
291
|
+
end
|
292
|
+
```
|
293
|
+
|
294
|
+
In your job
|
295
|
+
|
296
|
+
```ruby
|
297
|
+
# File: my_etl_job.metacrunch
|
298
|
+
|
299
|
+
require "my_destination"
|
300
|
+
|
301
|
+
destination MyDestination.new
|
302
|
+
|
303
|
+
```
|
304
|
+
|
125
305
|
|
126
|
-
|
127
|
-
|
306
|
+
Built in sources and destinations
|
307
|
+
---------------------------------
|
128
308
|
|
129
309
|
TBD.
|
130
310
|
|
data/lib/metacrunch/db/writer.rb
CHANGED
@@ -2,6 +2,9 @@ module Metacrunch
|
|
2
2
|
class Db::Writer
|
3
3
|
|
4
4
|
def initialize(database_connection_or_url, dataset_proc, options = {})
|
5
|
+
@use_upsert = options.delete(:use_upsert) || false
|
6
|
+
@id_key = options.delete(:id_key) || :id
|
7
|
+
|
5
8
|
@db = if database_connection_or_url.is_a?(String)
|
6
9
|
Sequel.connect(database_connection_or_url, options)
|
7
10
|
else
|
@@ -12,12 +15,37 @@ module Metacrunch
|
|
12
15
|
end
|
13
16
|
|
14
17
|
def write(data)
|
15
|
-
|
18
|
+
if data.is_a?(Array)
|
19
|
+
@db.transaction do
|
20
|
+
data.each{|d| insert_or_upsert(d) }
|
21
|
+
end
|
22
|
+
else
|
23
|
+
insert_or_upsert(data)
|
24
|
+
end
|
16
25
|
end
|
17
26
|
|
18
27
|
def close
|
19
28
|
@db.disconnect
|
20
29
|
end
|
21
30
|
|
31
|
+
private
|
32
|
+
|
33
|
+
def insert_or_upsert(data)
|
34
|
+
@use_upsert ? upsert(data) : insert(data)
|
35
|
+
end
|
36
|
+
|
37
|
+
def insert(data)
|
38
|
+
@dataset.insert(data) if data
|
39
|
+
end
|
40
|
+
|
41
|
+
def upsert(data)
|
42
|
+
if data
|
43
|
+
rec = @dataset.where(id: data[@id_key])
|
44
|
+
if 1 != rec.update(data)
|
45
|
+
insert(data)
|
46
|
+
end
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
22
50
|
end
|
23
51
|
end
|
data/lib/metacrunch/job.rb
CHANGED
@@ -110,37 +110,31 @@ module Metacrunch
|
|
110
110
|
def run_transformations
|
111
111
|
sources.each do |source|
|
112
112
|
# sources are expected to respond to `each`
|
113
|
-
source.each do |
|
114
|
-
|
113
|
+
source.each do |data|
|
114
|
+
run_transformations_and_write_destinations(data)
|
115
115
|
end
|
116
116
|
|
117
117
|
# Run all transformations a last time to flush possible buffers
|
118
|
-
|
118
|
+
run_transformations_and_write_destinations(nil, flush_buffers: true)
|
119
119
|
end
|
120
120
|
|
121
121
|
# destination implementations are expected to respond to `close`
|
122
122
|
destinations.each(&:close)
|
123
123
|
end
|
124
124
|
|
125
|
-
def
|
125
|
+
def run_transformations_and_write_destinations(data, flush_buffers: false)
|
126
126
|
transformations.each do |transformation|
|
127
|
-
|
128
|
-
if
|
129
|
-
|
130
|
-
else
|
131
|
-
transformation.buffer(row)
|
132
|
-
end
|
127
|
+
if transformation.is_a?(Buffer)
|
128
|
+
data = transformation.buffer(data) if data.present?
|
129
|
+
data = transformation.flush if flush_buffers
|
133
130
|
else
|
134
|
-
transformation.call(
|
131
|
+
data = transformation.call(data) if data.present?
|
135
132
|
end
|
136
|
-
|
137
|
-
break unless row
|
138
133
|
end
|
139
134
|
|
140
|
-
if
|
135
|
+
if data.present?
|
141
136
|
destinations.each do |destination|
|
142
|
-
# destinations are expected to respond to `write(
|
143
|
-
destination.write(row)
|
137
|
+
destination.write(data) # destinations are expected to respond to `write(data)`
|
144
138
|
end
|
145
139
|
end
|
146
140
|
end
|
data/lib/metacrunch/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: metacrunch
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 3.0.
|
4
|
+
version: 3.0.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- René Sprotte
|
@@ -10,7 +10,7 @@ authors:
|
|
10
10
|
autorequire:
|
11
11
|
bindir: exe
|
12
12
|
cert_chain: []
|
13
|
-
date: 2016-
|
13
|
+
date: 2016-07-17 00:00:00.000000000 Z
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|
16
16
|
name: activesupport
|