metacrunch 3.0.1 → 3.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 8c5c7308708c116022aab5aafb8c546b70430383
4
- data.tar.gz: 5c658db2d33ab7a31b28df026a158146d67d63a6
3
+ metadata.gz: 140352c3ee66626aef744b87358762a4130f6823
4
+ data.tar.gz: f9bd336d44ac985f5806045b852e219236e1d038
5
5
  SHA512:
6
- metadata.gz: c9f71280290fecd7ac65cec82b708e9e09b4aa07929f1a7dce9d7077c048ac8c8c2787972a4dc344da3041756c79cfff8f78e5f2b8c9f95d58383cc9dfcf0cd3
7
- data.tar.gz: f2bef55af9e464cf9c50b83bdcd15ad3b7552cb65520fbc1267228fe9ca9c4b500a6250f54236c87a63ba7d6fb1cb3325e884cd595205fabf4a68eeedffbea26
6
+ metadata.gz: 494530523e869e12ef00bd709ad139e840b1fd580d39d587575b85b1acdf029a573c330776eb65085a99c6714f4fd01afdad8a37d25dfe7197568682d229cde2
7
+ data.tar.gz: f09ba8cadfc1a10cb26b9a5125797da1dbb7dd9c24f4b11381183181b0916fa57a74abbf20713614f062e0026f7b0670f4cb3fa2a568e13b5b017f4d718a0c07
data/Readme.md CHANGED
@@ -17,51 +17,64 @@ $ gem install metacrunch
17
17
  ```
18
18
 
19
19
 
20
- Create ETL jobs
21
- ---------------
20
+ Creating ETL jobs
21
+ -----------------
22
22
 
23
- The basic idea behind an ETL job in metacrunch is the concept of a data processing pipeline. Each ETL job reads data from one or more **sources** (extract step), runs one or more **transformations** (transform step) on the data and finally writes the transformed data back to one or more **destinations** (load step).
23
+ The basic idea behind an ETL job in metacrunch is the concept of a data processing pipeline. Each ETL job reads data from one or more **sources** (extract step), runs one or more **transformations** (transform step) on the data and finally writes the transformed data to one or more **destinations** (load step).
24
24
 
25
- metacrunch provides you with a simple DSL to define such ETL jobs. Just create a text file with the extension `.metacrunch`. Note: The extension doesn't really matter but you should avoid `.rb` to not loading them by mistake from another Ruby component.
25
+ metacrunch provides you with a simple DSL to define and run such ETL jobs. Just create a text file with the extension `.metacrunch`. *Note: The extension doesn't really matter but you should avoid `.rb` to not loading them by mistake from another Ruby component.*
26
26
 
27
- Let's take a look at an example. For a collection of working examples check out our [metacrunch-demo](https://github.com/ubpb/metacrunch-demo) repo.
27
+ Let's walk through the main steps of creating ETL jobs with metacrunch. For a collection of working examples check out our [metacrunch-demo](https://github.com/ubpb/metacrunch-demo) repo.
28
+
29
+ #### It's Ruby
30
+
31
+ Every `.metacrunch` job file is a regular Ruby file. So you can always use regular stuff like e.g. declaring methods, classes, variable and requiring other Ruby files.
28
32
 
29
33
  ```ruby
30
34
  # File: my_etl_job.metacrunch
31
35
 
32
- # Every metacrunch job file is a regular Ruby file. So you can always use regular Ruby
33
- # stuff like declaring methods
34
36
  def my_helper
35
37
  # ...
36
38
  end
37
39
 
38
- # ... declaring classes
39
40
  class MyHelper
40
41
  # ...
41
42
  end
42
43
 
43
- # ... declaring variables
44
- foo = "bar"
44
+ helper = MyHelper.new
45
45
 
46
- # ... or loading other ruby files
46
+ require "SomeGem"
47
47
  require_relative "./some/other/ruby/file"
48
+ ```
49
+
50
+ #### Defining sources
51
+
52
+ A source (aka. a reader) is an object that reads data into the metacrunch processing pipeline. Use one of the build-in or 3rd party sources or implement it by yourself. Implementing sources is easy – [see notes below](#implementing-sources). You can declare one or more sources. They are processed in the order they are defined.
53
+
54
+ You must declare at least one source to allow a job to run.
55
+
56
+ ```ruby
57
+ # File: my_etl_job.metacrunch
48
58
 
49
- # Declare a source (use a build-in or 3rd party source or implement it – see notes below).
50
- # At least one source is required to allow the job to run.
59
+ source Metacrunch::Fs::Reader.new(args)
51
60
  source MySource.new
52
- # ... maybe another one. Sources are processed in the order they are defined.
53
- source MyOtherSource.new
61
+ ```
54
62
 
55
- # Declare a destination (use a build-in or 3rd party destination or implement it see notes below).
56
- # Technically a destination is optional, but a job that doesn't store it's
57
- # output doesn't really makes sense.
58
- destination MyDestination.new
59
- # ... you can have more destinations if you like
60
- destination MyOtherDestination.new
63
+ This example uses a build-in file reader source. To learn more about the build-in sources see [notes below](#built-in-sources-and-destinations).
64
+
65
+ #### Defining transformations
66
+
67
+ To process, transform or manipulate data use the `#transformation` hook. A transformation can be implemented as a block, a lambda or as an (callable) object. To learn more about transformations check the section about [implementing transformations](#implementing-transformations) below.
68
+
69
+ The current data object (the object that is currently read by the source) will be passed to the first transformation as a parameter. The return value of a transformation will then be passed to the next transformation - or to the destination if the current transformation is the last one.
70
+
71
+ If you return nil the current data object will be dismissed and the next transformation (or destination) won't be called.
72
+
73
+ ```ruby
74
+ # File: my_etl_job.metacrunch
61
75
 
62
- # To process data use the #transformation hook.
63
76
  transformation do |data|
64
- # Called for each data object that has been put in the pipeline by a source.
77
+ # Called for each data object that has been read by a source.
65
78
 
66
79
  # Do your data transformation process here.
67
80
 
@@ -71,60 +84,227 @@ transformation do |data|
71
84
  end
72
85
 
73
86
  # Instead of passing a block to #transformation you can pass a
74
- # `callable` object (an object responding to #call).
75
- transformation Proc.new {
76
- # Procs and Lambdas responds to #call
87
+ # `callable` object (any object responding to #call).
88
+ transformation ->(data) {
89
+ # Lambdas responds to #call
77
90
  }
78
91
 
79
92
  # MyTransformation defines #call
80
93
  transformation MyTransformation.new
94
+ ```
95
+
96
+ #### Defining destinations
97
+
98
+ A destination (aka. a writer) is an object that writes the transformed data to an external system. Use one of the build-in or 3rd party destinations or implement it by yourself. Implementing destinations is easy – [see notes below](#implementing-destinations). You can declare one or more destinations. They are processed in the order they are defined.
99
+
100
+ ```ruby
101
+ # File: my_etl_job.metacrunch
81
102
 
82
- # To run arbitrary code before the first transformation use the #pre_process hook.
103
+ destination MyDestination.new
104
+ ```
105
+
106
+ This example uses a custom destination. To learn more about the build-in destinations see [notes below](#built-in-sources-and-destinations).
107
+
108
+ #### Pre/Post process
109
+
110
+ To run arbitrary code before the first transformation use the
111
+ `#pre_process` hook. To run arbitrary after the last transformation use
112
+ `#post_process`. Like transformations, `#post_process` and `#pre_process` can be called with a block, a lambda or a (callable) object.
113
+
114
+ ```ruby
83
115
  pre_process do
84
116
  # Called before the first transformation
85
117
  end
86
118
 
87
- # To run arbitrary code after the last transformation use the #post_process hook.
88
119
  post_process do
89
120
  # Called after the last transformation
90
121
  end
91
122
 
92
- # Instead of passing a block to #pre_process or #post_process you can pass a
93
- # `callable` object (an object responding to #call).
94
- pre_process Proc.new {
95
- # Procs and Lambdas responds to #call
123
+ pre_process ->() {
124
+ # Lambdas responds to #call
96
125
  }
97
126
 
98
127
  # MyCallable class defines #call
99
128
  post_process MyCallable.new
100
-
101
129
  ```
102
130
 
131
+ #### Defining options
103
132
 
104
- Run ETL jobs
105
- ------------
133
+ TBD.
106
134
 
107
- metacrunch comes with a handy command line tool. In your terminal just call
135
+ Running ETL jobs
136
+ ----------------
137
+
138
+ metacrunch comes with a handy command line tool. In a terminal use
108
139
 
109
140
 
110
141
  ```
111
142
  $ metacrunch run my_etl_job.metacrunch
112
143
  ```
113
144
 
114
- to run the job.
145
+ to run a job.
146
+
147
+ If you use [Bundler](http://bundler.io) to manage dependencies for your jobs make sure to change into the directory where your Gemfile is (or set BUNDLE_GEMFILE environment variable) and run metacrunch with `bundle exec`.
148
+
149
+ ```
150
+ $ bundle exec metacrunch run my_etl_job.metacrunch
151
+ ```
152
+
153
+ Depending on your environment `bundle exec` may not be required (e.g. you have rubygems-bundler installed) but we recommend using it whenever you have a Gemfile you like to use. When using Bundler make sure to add `gem "metacrunch"` to the Gemfile.
154
+
155
+ To pass options to the job, separate job options from the metacrunch command options using the `@@` separator.
156
+
157
+ Use the following syntax
158
+
159
+ ```
160
+ $ [bundle exec] metacrunch run [COMMAND_OPTIONS] JOB_FILE [@@ [JOB_OPTIONS] [JOB_ARGS...]]
161
+ ```
162
+
115
163
 
116
164
  Implementing sources
117
165
  --------------------
118
166
 
119
- TBD.
167
+ A source (aka a reader) is any Ruby object that responds to the `each` method that yields data objects one by one.
168
+
169
+ The data is usually a `Hash` instance, but could be other structures as long as the rest of your pipeline is expecting it.
170
+
171
+ Any `enumerable` object (e.g. `Array`) responds to `each` and can be used as a source in metacrunch.
172
+
173
+ ```ruby
174
+ # File: my_etl_job.metacrunch
175
+ source [1,2,3,4,5,6,7,8,9]
176
+ ```
177
+
178
+ Usually you implement your sources as classes. Doing so you can unit test and reuse them.
179
+
180
+ Here is a simple CSV source
181
+
182
+ ```ruby
183
+ # File: my_csv_source.rb
184
+ require 'csv'
185
+
186
+ class MyCsvSource
187
+ def initialize(input_file)
188
+ @csv = CSV.open(input_file, headers: true, header_converters: :symbol)
189
+ end
190
+
191
+ def each
192
+ @csv.each do |data|
193
+ yield(data.to_hash)
194
+ end
195
+ @csv.close
196
+ end
197
+ end
198
+ ```
199
+
200
+ You can then use that source in your job
201
+
202
+ ```ruby
203
+ # File: my_etl_job.metacrunch
204
+ require "my_csv_source"
205
+
206
+ source MyCsvSource.new("my_data.csv")
207
+ ```
208
+
120
209
 
121
210
  Implementing transformations
122
211
  ----------------------------
123
212
 
124
- TBD.
213
+ Transformations can be implemented as blocks or as a `callable`. A `callable` in Ruby is any object that responds to the `call` method.
214
+
215
+ ### Transformations as a block
216
+
217
+ When using the block syntax the current data row will be passed as a parameter.
218
+
219
+ ```ruby
220
+ # File: my_etl_job.metacrunch
221
+
222
+ transformation do |data|
223
+ # DO YOUR TRANSFORMATION HERE
224
+ data = ...
225
+
226
+ # Make sure to return the data to keep it in the pipeline. Dismiss the
227
+ # data conditionally by returning nil.
228
+ data
229
+ end
230
+
231
+ ```
232
+
233
+ ### Transformations as a callable
234
+
235
+ Procs and Lambdas in Ruby respond to `call`. They can be used to implement transformations similar to blocks.
236
+
237
+ ```ruby
238
+ # File: my_etl_job.metacrunch
239
+
240
+ transformation -> (data) do
241
+ # ...
242
+ end
243
+
244
+ ```
245
+
246
+ Like sources you can create classes to test and reuse transformation logic.
247
+
248
+ ```ruby
249
+ # File: my_transformation.rb
250
+
251
+ class MyTransformation
252
+
253
+ def call(data)
254
+ # ...
255
+ end
256
+
257
+ end
258
+ ```
259
+
260
+ You can use this transformation in your job
261
+
262
+ ```ruby
263
+ # File: my_etl_job.metacrunch
264
+
265
+ require "my_transformation"
266
+
267
+ transformation MyTransformation.new
268
+
269
+ ```
270
+
271
+ Implementing destinations
272
+ -------------------------
273
+
274
+ A destination (aka a writer) is any Ruby object that responds to `write(data)` and `close`.
275
+
276
+ Like sources you are encouraged to implement destinations as classes.
277
+
278
+ ```ruby
279
+ # File: my_destination.rb
280
+
281
+ class MyDestination
282
+
283
+ def write(data)
284
+ # Write data to files, remote services, databases etc.
285
+ end
286
+
287
+ def close
288
+ # Use this method to close connections, files etc.
289
+ end
290
+
291
+ end
292
+ ```
293
+
294
+ In your job
295
+
296
+ ```ruby
297
+ # File: my_etl_job.metacrunch
298
+
299
+ require "my_destination"
300
+
301
+ destination MyDestination.new
302
+
303
+ ```
304
+
125
305
 
126
- Implementing writers
127
- ---------------------
306
+ Built in sources and destinations
307
+ ---------------------------------
128
308
 
129
309
  TBD.
130
310
 
@@ -2,6 +2,9 @@ module Metacrunch
2
2
  class Db::Writer
3
3
 
4
4
  def initialize(database_connection_or_url, dataset_proc, options = {})
5
+ @use_upsert = options.delete(:use_upsert) || false
6
+ @id_key = options.delete(:id_key) || :id
7
+
5
8
  @db = if database_connection_or_url.is_a?(String)
6
9
  Sequel.connect(database_connection_or_url, options)
7
10
  else
@@ -12,12 +15,37 @@ module Metacrunch
12
15
  end
13
16
 
14
17
  def write(data)
15
- @dataset.insert(data)
18
+ if data.is_a?(Array)
19
+ @db.transaction do
20
+ data.each{|d| insert_or_upsert(d) }
21
+ end
22
+ else
23
+ insert_or_upsert(data)
24
+ end
16
25
  end
17
26
 
18
27
  def close
19
28
  @db.disconnect
20
29
  end
21
30
 
31
+ private
32
+
33
+ def insert_or_upsert(data)
34
+ @use_upsert ? upsert(data) : insert(data)
35
+ end
36
+
37
+ def insert(data)
38
+ @dataset.insert(data) if data
39
+ end
40
+
41
+ def upsert(data)
42
+ if data
43
+ rec = @dataset.where(id: data[@id_key])
44
+ if 1 != rec.update(data)
45
+ insert(data)
46
+ end
47
+ end
48
+ end
49
+
22
50
  end
23
51
  end
@@ -110,37 +110,31 @@ module Metacrunch
110
110
  def run_transformations
111
111
  sources.each do |source|
112
112
  # sources are expected to respond to `each`
113
- source.each do |row|
114
- _run_transformations(row)
113
+ source.each do |data|
114
+ run_transformations_and_write_destinations(data)
115
115
  end
116
116
 
117
117
  # Run all transformations a last time to flush possible buffers
118
- _run_transformations(nil, flush_buffers: true)
118
+ run_transformations_and_write_destinations(nil, flush_buffers: true)
119
119
  end
120
120
 
121
121
  # destination implementations are expected to respond to `close`
122
122
  destinations.each(&:close)
123
123
  end
124
124
 
125
- def _run_transformations(row, flush_buffers: false)
125
+ def run_transformations_and_write_destinations(data, flush_buffers: false)
126
126
  transformations.each do |transformation|
127
- row = if transformation.is_a?(Buffer)
128
- if flush_buffers
129
- transformation.flush
130
- else
131
- transformation.buffer(row)
132
- end
127
+ if transformation.is_a?(Buffer)
128
+ data = transformation.buffer(data) if data.present?
129
+ data = transformation.flush if flush_buffers
133
130
  else
134
- transformation.call(row) if row
131
+ data = transformation.call(data) if data.present?
135
132
  end
136
-
137
- break unless row
138
133
  end
139
134
 
140
- if row
135
+ if data.present?
141
136
  destinations.each do |destination|
142
- # destinations are expected to respond to `write(row)`
143
- destination.write(row)
137
+ destination.write(data) # destinations are expected to respond to `write(data)`
144
138
  end
145
139
  end
146
140
  end
@@ -1,3 +1,3 @@
1
1
  module Metacrunch
2
- VERSION = "3.0.1"
2
+ VERSION = "3.0.2"
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: metacrunch
3
3
  version: !ruby/object:Gem::Version
4
- version: 3.0.1
4
+ version: 3.0.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - René Sprotte
@@ -10,7 +10,7 @@ authors:
10
10
  autorequire:
11
11
  bindir: exe
12
12
  cert_chain: []
13
- date: 2016-05-19 00:00:00.000000000 Z
13
+ date: 2016-07-17 00:00:00.000000000 Z
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
16
  name: activesupport