metacrunch 3.0.1 → 3.0.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 8c5c7308708c116022aab5aafb8c546b70430383
4
- data.tar.gz: 5c658db2d33ab7a31b28df026a158146d67d63a6
3
+ metadata.gz: 140352c3ee66626aef744b87358762a4130f6823
4
+ data.tar.gz: f9bd336d44ac985f5806045b852e219236e1d038
5
5
  SHA512:
6
- metadata.gz: c9f71280290fecd7ac65cec82b708e9e09b4aa07929f1a7dce9d7077c048ac8c8c2787972a4dc344da3041756c79cfff8f78e5f2b8c9f95d58383cc9dfcf0cd3
7
- data.tar.gz: f2bef55af9e464cf9c50b83bdcd15ad3b7552cb65520fbc1267228fe9ca9c4b500a6250f54236c87a63ba7d6fb1cb3325e884cd595205fabf4a68eeedffbea26
6
+ metadata.gz: 494530523e869e12ef00bd709ad139e840b1fd580d39d587575b85b1acdf029a573c330776eb65085a99c6714f4fd01afdad8a37d25dfe7197568682d229cde2
7
+ data.tar.gz: f09ba8cadfc1a10cb26b9a5125797da1dbb7dd9c24f4b11381183181b0916fa57a74abbf20713614f062e0026f7b0670f4cb3fa2a568e13b5b017f4d718a0c07
data/Readme.md CHANGED
@@ -17,51 +17,64 @@ $ gem install metacrunch
17
17
  ```
18
18
 
19
19
 
20
- Create ETL jobs
21
- ---------------
20
+ Creating ETL jobs
21
+ -----------------
22
22
 
23
- The basic idea behind an ETL job in metacrunch is the concept of a data processing pipeline. Each ETL job reads data from one or more **sources** (extract step), runs one or more **transformations** (transform step) on the data and finally writes the transformed data back to one or more **destinations** (load step).
23
+ The basic idea behind an ETL job in metacrunch is the concept of a data processing pipeline. Each ETL job reads data from one or more **sources** (extract step), runs one or more **transformations** (transform step) on the data and finally writes the transformed data to one or more **destinations** (load step).
24
24
 
25
- metacrunch provides you with a simple DSL to define such ETL jobs. Just create a text file with the extension `.metacrunch`. Note: The extension doesn't really matter but you should avoid `.rb` to not loading them by mistake from another Ruby component.
25
+ metacrunch provides you with a simple DSL to define and run such ETL jobs. Just create a text file with the extension `.metacrunch`. *Note: The extension doesn't really matter but you should avoid `.rb` to not loading them by mistake from another Ruby component.*
26
26
 
27
- Let's take a look at an example. For a collection of working examples check out our [metacrunch-demo](https://github.com/ubpb/metacrunch-demo) repo.
27
+ Let's walk through the main steps of creating ETL jobs with metacrunch. For a collection of working examples check out our [metacrunch-demo](https://github.com/ubpb/metacrunch-demo) repo.
28
+
29
+ #### It's Ruby
30
+
31
+ Every `.metacrunch` job file is a regular Ruby file. So you can always use regular stuff like e.g. declaring methods, classes, variable and requiring other Ruby files.
28
32
 
29
33
  ```ruby
30
34
  # File: my_etl_job.metacrunch
31
35
 
32
- # Every metacrunch job file is a regular Ruby file. So you can always use regular Ruby
33
- # stuff like declaring methods
34
36
  def my_helper
35
37
  # ...
36
38
  end
37
39
 
38
- # ... declaring classes
39
40
  class MyHelper
40
41
  # ...
41
42
  end
42
43
 
43
- # ... declaring variables
44
- foo = "bar"
44
+ helper = MyHelper.new
45
45
 
46
- # ... or loading other ruby files
46
+ require "SomeGem"
47
47
  require_relative "./some/other/ruby/file"
48
+ ```
49
+
50
+ #### Defining sources
51
+
52
+ A source (aka. a reader) is an object that reads data into the metacrunch processing pipeline. Use one of the build-in or 3rd party sources or implement it by yourself. Implementing sources is easy – [see notes below](#implementing-sources). You can declare one or more sources. They are processed in the order they are defined.
53
+
54
+ You must declare at least one source to allow a job to run.
55
+
56
+ ```ruby
57
+ # File: my_etl_job.metacrunch
48
58
 
49
- # Declare a source (use a build-in or 3rd party source or implement it – see notes below).
50
- # At least one source is required to allow the job to run.
59
+ source Metacrunch::Fs::Reader.new(args)
51
60
  source MySource.new
52
- # ... maybe another one. Sources are processed in the order they are defined.
53
- source MyOtherSource.new
61
+ ```
54
62
 
55
- # Declare a destination (use a build-in or 3rd party destination or implement it see notes below).
56
- # Technically a destination is optional, but a job that doesn't store it's
57
- # output doesn't really makes sense.
58
- destination MyDestination.new
59
- # ... you can have more destinations if you like
60
- destination MyOtherDestination.new
63
+ This example uses a build-in file reader source. To learn more about the build-in sources see [notes below](#built-in-sources-and-destinations).
64
+
65
+ #### Defining transformations
66
+
67
+ To process, transform or manipulate data use the `#transformation` hook. A transformation can be implemented as a block, a lambda or as an (callable) object. To learn more about transformations check the section about [implementing transformations](#implementing-transformations) below.
68
+
69
+ The current data object (the object that is currently read by the source) will be passed to the first transformation as a parameter. The return value of a transformation will then be passed to the next transformation - or to the destination if the current transformation is the last one.
70
+
71
+ If you return nil the current data object will be dismissed and the next transformation (or destination) won't be called.
72
+
73
+ ```ruby
74
+ # File: my_etl_job.metacrunch
61
75
 
62
- # To process data use the #transformation hook.
63
76
  transformation do |data|
64
- # Called for each data object that has been put in the pipeline by a source.
77
+ # Called for each data object that has been read by a source.
65
78
 
66
79
  # Do your data transformation process here.
67
80
 
@@ -71,60 +84,227 @@ transformation do |data|
71
84
  end
72
85
 
73
86
  # Instead of passing a block to #transformation you can pass a
74
- # `callable` object (an object responding to #call).
75
- transformation Proc.new {
76
- # Procs and Lambdas responds to #call
87
+ # `callable` object (any object responding to #call).
88
+ transformation ->(data) {
89
+ # Lambdas responds to #call
77
90
  }
78
91
 
79
92
  # MyTransformation defines #call
80
93
  transformation MyTransformation.new
94
+ ```
95
+
96
+ #### Defining destinations
97
+
98
+ A destination (aka. a writer) is an object that writes the transformed data to an external system. Use one of the build-in or 3rd party destinations or implement it by yourself. Implementing destinations is easy – [see notes below](#implementing-destinations). You can declare one or more destinations. They are processed in the order they are defined.
99
+
100
+ ```ruby
101
+ # File: my_etl_job.metacrunch
81
102
 
82
- # To run arbitrary code before the first transformation use the #pre_process hook.
103
+ destination MyDestination.new
104
+ ```
105
+
106
+ This example uses a custom destination. To learn more about the build-in destinations see [notes below](#built-in-sources-and-destinations).
107
+
108
+ #### Pre/Post process
109
+
110
+ To run arbitrary code before the first transformation use the
111
+ `#pre_process` hook. To run arbitrary after the last transformation use
112
+ `#post_process`. Like transformations, `#post_process` and `#pre_process` can be called with a block, a lambda or a (callable) object.
113
+
114
+ ```ruby
83
115
  pre_process do
84
116
  # Called before the first transformation
85
117
  end
86
118
 
87
- # To run arbitrary code after the last transformation use the #post_process hook.
88
119
  post_process do
89
120
  # Called after the last transformation
90
121
  end
91
122
 
92
- # Instead of passing a block to #pre_process or #post_process you can pass a
93
- # `callable` object (an object responding to #call).
94
- pre_process Proc.new {
95
- # Procs and Lambdas responds to #call
123
+ pre_process ->() {
124
+ # Lambdas responds to #call
96
125
  }
97
126
 
98
127
  # MyCallable class defines #call
99
128
  post_process MyCallable.new
100
-
101
129
  ```
102
130
 
131
+ #### Defining options
103
132
 
104
- Run ETL jobs
105
- ------------
133
+ TBD.
106
134
 
107
- metacrunch comes with a handy command line tool. In your terminal just call
135
+ Running ETL jobs
136
+ ----------------
137
+
138
+ metacrunch comes with a handy command line tool. In a terminal use
108
139
 
109
140
 
110
141
  ```
111
142
  $ metacrunch run my_etl_job.metacrunch
112
143
  ```
113
144
 
114
- to run the job.
145
+ to run a job.
146
+
147
+ If you use [Bundler](http://bundler.io) to manage dependencies for your jobs make sure to change into the directory where your Gemfile is (or set BUNDLE_GEMFILE environment variable) and run metacrunch with `bundle exec`.
148
+
149
+ ```
150
+ $ bundle exec metacrunch run my_etl_job.metacrunch
151
+ ```
152
+
153
+ Depending on your environment `bundle exec` may not be required (e.g. you have rubygems-bundler installed) but we recommend using it whenever you have a Gemfile you like to use. When using Bundler make sure to add `gem "metacrunch"` to the Gemfile.
154
+
155
+ To pass options to the job, separate job options from the metacrunch command options using the `@@` separator.
156
+
157
+ Use the following syntax
158
+
159
+ ```
160
+ $ [bundle exec] metacrunch run [COMMAND_OPTIONS] JOB_FILE [@@ [JOB_OPTIONS] [JOB_ARGS...]]
161
+ ```
162
+
115
163
 
116
164
  Implementing sources
117
165
  --------------------
118
166
 
119
- TBD.
167
+ A source (aka a reader) is any Ruby object that responds to the `each` method that yields data objects one by one.
168
+
169
+ The data is usually a `Hash` instance, but could be other structures as long as the rest of your pipeline is expecting it.
170
+
171
+ Any `enumerable` object (e.g. `Array`) responds to `each` and can be used as a source in metacrunch.
172
+
173
+ ```ruby
174
+ # File: my_etl_job.metacrunch
175
+ source [1,2,3,4,5,6,7,8,9]
176
+ ```
177
+
178
+ Usually you implement your sources as classes. Doing so you can unit test and reuse them.
179
+
180
+ Here is a simple CSV source
181
+
182
+ ```ruby
183
+ # File: my_csv_source.rb
184
+ require 'csv'
185
+
186
+ class MyCsvSource
187
+ def initialize(input_file)
188
+ @csv = CSV.open(input_file, headers: true, header_converters: :symbol)
189
+ end
190
+
191
+ def each
192
+ @csv.each do |data|
193
+ yield(data.to_hash)
194
+ end
195
+ @csv.close
196
+ end
197
+ end
198
+ ```
199
+
200
+ You can then use that source in your job
201
+
202
+ ```ruby
203
+ # File: my_etl_job.metacrunch
204
+ require "my_csv_source"
205
+
206
+ source MyCsvSource.new("my_data.csv")
207
+ ```
208
+
120
209
 
121
210
  Implementing transformations
122
211
  ----------------------------
123
212
 
124
- TBD.
213
+ Transformations can be implemented as blocks or as a `callable`. A `callable` in Ruby is any object that responds to the `call` method.
214
+
215
+ ### Transformations as a block
216
+
217
+ When using the block syntax the current data row will be passed as a parameter.
218
+
219
+ ```ruby
220
+ # File: my_etl_job.metacrunch
221
+
222
+ transformation do |data|
223
+ # DO YOUR TRANSFORMATION HERE
224
+ data = ...
225
+
226
+ # Make sure to return the data to keep it in the pipeline. Dismiss the
227
+ # data conditionally by returning nil.
228
+ data
229
+ end
230
+
231
+ ```
232
+
233
+ ### Transformations as a callable
234
+
235
+ Procs and Lambdas in Ruby respond to `call`. They can be used to implement transformations similar to blocks.
236
+
237
+ ```ruby
238
+ # File: my_etl_job.metacrunch
239
+
240
+ transformation -> (data) do
241
+ # ...
242
+ end
243
+
244
+ ```
245
+
246
+ Like sources you can create classes to test and reuse transformation logic.
247
+
248
+ ```ruby
249
+ # File: my_transformation.rb
250
+
251
+ class MyTransformation
252
+
253
+ def call(data)
254
+ # ...
255
+ end
256
+
257
+ end
258
+ ```
259
+
260
+ You can use this transformation in your job
261
+
262
+ ```ruby
263
+ # File: my_etl_job.metacrunch
264
+
265
+ require "my_transformation"
266
+
267
+ transformation MyTransformation.new
268
+
269
+ ```
270
+
271
+ Implementing destinations
272
+ -------------------------
273
+
274
+ A destination (aka a writer) is any Ruby object that responds to `write(data)` and `close`.
275
+
276
+ Like sources you are encouraged to implement destinations as classes.
277
+
278
+ ```ruby
279
+ # File: my_destination.rb
280
+
281
+ class MyDestination
282
+
283
+ def write(data)
284
+ # Write data to files, remote services, databases etc.
285
+ end
286
+
287
+ def close
288
+ # Use this method to close connections, files etc.
289
+ end
290
+
291
+ end
292
+ ```
293
+
294
+ In your job
295
+
296
+ ```ruby
297
+ # File: my_etl_job.metacrunch
298
+
299
+ require "my_destination"
300
+
301
+ destination MyDestination.new
302
+
303
+ ```
304
+
125
305
 
126
- Implementing writers
127
- ---------------------
306
+ Built in sources and destinations
307
+ ---------------------------------
128
308
 
129
309
  TBD.
130
310
 
@@ -2,6 +2,9 @@ module Metacrunch
2
2
  class Db::Writer
3
3
 
4
4
  def initialize(database_connection_or_url, dataset_proc, options = {})
5
+ @use_upsert = options.delete(:use_upsert) || false
6
+ @id_key = options.delete(:id_key) || :id
7
+
5
8
  @db = if database_connection_or_url.is_a?(String)
6
9
  Sequel.connect(database_connection_or_url, options)
7
10
  else
@@ -12,12 +15,37 @@ module Metacrunch
12
15
  end
13
16
 
14
17
  def write(data)
15
- @dataset.insert(data)
18
+ if data.is_a?(Array)
19
+ @db.transaction do
20
+ data.each{|d| insert_or_upsert(d) }
21
+ end
22
+ else
23
+ insert_or_upsert(data)
24
+ end
16
25
  end
17
26
 
18
27
  def close
19
28
  @db.disconnect
20
29
  end
21
30
 
31
+ private
32
+
33
+ def insert_or_upsert(data)
34
+ @use_upsert ? upsert(data) : insert(data)
35
+ end
36
+
37
+ def insert(data)
38
+ @dataset.insert(data) if data
39
+ end
40
+
41
+ def upsert(data)
42
+ if data
43
+ rec = @dataset.where(id: data[@id_key])
44
+ if 1 != rec.update(data)
45
+ insert(data)
46
+ end
47
+ end
48
+ end
49
+
22
50
  end
23
51
  end
@@ -110,37 +110,31 @@ module Metacrunch
110
110
  def run_transformations
111
111
  sources.each do |source|
112
112
  # sources are expected to respond to `each`
113
- source.each do |row|
114
- _run_transformations(row)
113
+ source.each do |data|
114
+ run_transformations_and_write_destinations(data)
115
115
  end
116
116
 
117
117
  # Run all transformations a last time to flush possible buffers
118
- _run_transformations(nil, flush_buffers: true)
118
+ run_transformations_and_write_destinations(nil, flush_buffers: true)
119
119
  end
120
120
 
121
121
  # destination implementations are expected to respond to `close`
122
122
  destinations.each(&:close)
123
123
  end
124
124
 
125
- def _run_transformations(row, flush_buffers: false)
125
+ def run_transformations_and_write_destinations(data, flush_buffers: false)
126
126
  transformations.each do |transformation|
127
- row = if transformation.is_a?(Buffer)
128
- if flush_buffers
129
- transformation.flush
130
- else
131
- transformation.buffer(row)
132
- end
127
+ if transformation.is_a?(Buffer)
128
+ data = transformation.buffer(data) if data.present?
129
+ data = transformation.flush if flush_buffers
133
130
  else
134
- transformation.call(row) if row
131
+ data = transformation.call(data) if data.present?
135
132
  end
136
-
137
- break unless row
138
133
  end
139
134
 
140
- if row
135
+ if data.present?
141
136
  destinations.each do |destination|
142
- # destinations are expected to respond to `write(row)`
143
- destination.write(row)
137
+ destination.write(data) # destinations are expected to respond to `write(data)`
144
138
  end
145
139
  end
146
140
  end
@@ -1,3 +1,3 @@
1
1
  module Metacrunch
2
- VERSION = "3.0.1"
2
+ VERSION = "3.0.2"
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: metacrunch
3
3
  version: !ruby/object:Gem::Version
4
- version: 3.0.1
4
+ version: 3.0.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - René Sprotte
@@ -10,7 +10,7 @@ authors:
10
10
  autorequire:
11
11
  bindir: exe
12
12
  cert_chain: []
13
- date: 2016-05-19 00:00:00.000000000 Z
13
+ date: 2016-07-17 00:00:00.000000000 Z
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
16
  name: activesupport