ETL 0.0.1 → 1.0.0.rc

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore CHANGED
@@ -3,6 +3,7 @@
3
3
  .bundle
4
4
  .config
5
5
  .yardoc
6
+ .rvmrc
6
7
  Gemfile.lock
7
8
  InstalledFiles
8
9
  _yardoc
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --color
2
+ --format documentation
@@ -0,0 +1,3 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.3
@@ -0,0 +1,14 @@
1
+ Contributing
2
+ ============
3
+
4
+ If you would like to contribute code to ETL you can do so through GitHub by
5
+ forking the repository and sending a pull request.
6
+
7
+ When submitting code, please make every effort to follow existing conventions
8
+ and style in order to keep the code as readable as possible.
9
+
10
+ Before your code can be accepted into the project you must also sign the
11
+ [Individual Contributor License Agreement (CLA)][1].
12
+
13
+
14
+ [1]: https://spreadsheets.google.com/spreadsheet/viewform?formkey=dDViT2xzUHAwRkI3X3k5Z0lQM091OGc6MQ&ndplr=1
data/Gemfile CHANGED
@@ -1,4 +1,4 @@
1
1
  source 'https://rubygems.org'
2
2
 
3
- # Specify your gem's dependencies in ETL.gemspec
3
+ # Specify your gem's dependencies in etl.gemspec
4
4
  gemspec
data/LICENSE CHANGED
@@ -1,22 +1,13 @@
1
- Copyright (c) 2012 Jeff Iacono
1
+ Copyright 2013 Square Inc.
2
2
 
3
- MIT License
3
+ Licensed under the Apache License, Version 2.0 (the "License");
4
+ you may not use this file except in compliance with the License.
5
+ You may obtain a copy of the License at
4
6
 
5
- Permission is hereby granted, free of charge, to any person obtaining
6
- a copy of this software and associated documentation files (the
7
- "Software"), to deal in the Software without restriction, including
8
- without limitation the rights to use, copy, modify, merge, publish,
9
- distribute, sublicense, and/or sell copies of the Software, and to
10
- permit persons to whom the Software is furnished to do so, subject to
11
- the following conditions:
7
+ http://www.apache.org/licenses/LICENSE-2.0
12
8
 
13
- The above copyright notice and this permission notice shall be
14
- included in all copies or substantial portions of the Software.
15
-
16
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
- EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
- MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
- NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
- LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
- OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
- WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
9
+ Unless required by applicable law or agreed to in writing, software
10
+ distributed under the License is distributed on an "AS IS" BASIS,
11
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ See the License for the specific language governing permissions and
13
+ limitations under the License.
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # ETL
2
2
 
3
- TODO: Write a gem description
3
+ Extract, transform, and load data with ruby!
4
4
 
5
5
  ## Installation
6
6
 
@@ -16,14 +16,370 @@ Or install it yourself as:
16
16
 
17
17
  $ gem install ETL
18
18
 
19
- ## Usage
19
+ ## ETL Dependencies
20
20
 
21
- TODO: Write usage instructions here
21
+ ETL depends on having a database connection object that __must__ respond
22
+ to `#query`. The [mysql2](https://github.com/brianmario/mysql2) gem is a good option.
23
+ You can also proxy another library using Ruby's `SimpleDelegator` and add a `#query`
24
+ method if need be.
25
+
26
+ The gem comes bundled with a default logger. If you'd like to write your own
27
+ just make sure that it implements `#debug` and `#info`. For more information
28
+ on what is logged and when, view the [logger details](#logger-details).
29
+
30
+ ### Basic ETL
31
+
32
+ Assume that we have a database connection represented by `connection`.
33
+
34
+ To run a basic ETL that is composed of sequential SQL statements, start by
35
+ creating a new ETL instance:
36
+
37
+ ```ruby
38
+ etl = ETL.new(description: "a description of what this ETL does",
39
+ connection: connection)
40
+ ```
41
+ which can then be configured:
42
+
43
+ ```ruby
44
+ etl.config do |etl|
45
+ etl.ensure_destination do |etl|
46
+ # For most ETLs you may want to ensure that the destination exists, so the
47
+ # #ensure_destination block is ideally suited to fulfill this requirement.
48
+ #
49
+ # By way of example:
50
+ #
51
+ etl.query %[
52
+ CREATE TABLE IF NOT EXISTS some_database.some_destination_table (
53
+ user_id INT UNSIGNED NOT NULL,
54
+ created_date DATE NOT NULL,
55
+ total_amount INT SIGNED NOT NULL,
56
+ message VARCHAR(100) DEFAULT NULL,
57
+ PRIMARY KEY (user_id),
58
+ KEY (user_id, created_date),
59
+ KEY (created_date)
60
+ )]
61
+ end
62
+
63
+ etl.before_etl do |etl|
64
+ # All pre-ETL work is performed in this block.
65
+ #
66
+ # This can be thought of as a before-ETL hook that will fire only once. When
67
+ # you are not leveraging the ETL iteration capabilities, the value of this
68
+ # block vs the #etl block is not very clear. We will see how and when to
69
+ # leverage this block effectively when we introduce iteration.
70
+ #
71
+ # As an example, let's say we want to get rid of all entries that have an
72
+ # amount less than zero before moving on to our actual etl:
73
+ #
74
+ etl.query %[DELETE FROM some_database.some_source_table WHERE amount < 0]
75
+ end
76
+
77
+ etl.etl do |etl|
78
+ # Here is where the magic happens! This block contains the main ETL
79
+ # operation.
80
+ #
81
+ # For example:
82
+ #
83
+ etl.query %[
84
+ REPLACE INTO some_database.some_destination_table
85
+ SELECT
86
+ user_id
87
+ , DATE(created_at) AS created_date
88
+ , SUM(amount) AS total_amount
89
+ FROM
90
+ some_database.some_source_table sst
91
+ GROUP BY
92
+ sst.user_id
93
+ , sst.DATE(created_at)]
94
+ end
95
+
96
+ etl.after_etl do |etl|
97
+ # All post-ETL work is performed in this block.
98
+ #
99
+ # Again, to finish up with an example:
100
+ #
101
+ etl.query %[
102
+ UPDATE some_database.some_destination_table
103
+ SET message = "WOW"
104
+ WHERE total_amount > 100]
105
+ end
106
+ end
107
+ ```
108
+
109
+ At this point it is possible to run the ETL instance via:
110
+
111
+ ```ruby
112
+ etl.run
113
+ ```
114
+ which executes `#ensure_destination`, `#before_etl`, `#etl`, and `#after_etl` in
115
+ that order.
116
+
117
+ ### ETL with iteration
118
+
119
+ To add in iteration, simply supply `#start`, `#step`, and `#stop` blocks. This
120
+ is useful when dealing with large data sets or when executing queries that,
121
+ while optimized, are still slow.
122
+
123
+ Again, to kick things off:
124
+
125
+ ```ruby
126
+ etl = ETL.new(description: "a description of what this ETL does",
127
+ connection: connection)
128
+ ```
129
+
130
+ where `connection` is the same as described above.
131
+
132
+ Next we can configure the ETL:
133
+
134
+ ```ruby
135
+ # assuming we have the ETL instance from above
136
+ etl.config do |etl|
137
+ etl.ensure_destination do |etl|
138
+ # For most ETLs you may want to ensure that the destination exists, so the
139
+ # #ensure_destination block is ideally suited to fulfill this requirement.
140
+ #
141
+ # By way of example:
142
+ #
143
+ etl.query %[
144
+ CREATE TABLE IF NOT EXISTS some_database.some_destination_table (
145
+ user_id INT UNSIGNED NOT NULL,
146
+ created_date DATE NOT NULL,
147
+ total_amount INT SIGNED NOT NULL,
148
+ message VARCHAR(100) DEFAULT NULL,
149
+ PRIMARY KEY (user_id),
150
+ KEY (user_id, created_date),
151
+ KEY (created_date)
152
+ )]
153
+ end
154
+
155
+ etl.before_etl do |etl|
156
+ # All pre-ETL work is performed in this block.
157
+ #
158
+ # Now that we are leveraging iteration the #before_etl block becomes
159
+ # more useful as a way to execute an operation once before we begin
160
+ # our iteration.
161
+ #
162
+ # As an example, let's say we want to get rid of all entries that have an
163
+ # amount less than zero before moving on to our actual etl:
164
+ #
165
+ etl.query %[
166
+ DELETE FROM some_database.some_source_table
167
+ WHERE amount < 0]
168
+ end
169
+
170
+ etl.start do |etl|
171
+ # This defines where the ETL should start. This can be a flat number
172
+ # or date, or even SQL / other code can be executed to produce a starting
173
+ # value.
174
+ #
175
+ # Usually, this is the last known entry for the destination table with
176
+ # some sensible default if the destination does not yet contain data.
177
+ #
178
+ # As an example:
179
+ #
180
+ res = etl.query %[
181
+ SELECT COALESCE(MAX(created_date), '1970-01-01') AS the_max
182
+ FROM some_database.some_destination_table]
183
+
184
+ res.to_a.first['the_max']
185
+ end
186
+
187
+ etl.step do |etl|
188
+ # The step block defines the size of the iteration block. To iterate by
189
+ # ten records, the step block should be set to return 10.
190
+ #
191
+ # As an alternative example, to set the iteration to go 10,000 units
192
+ # at a time, the following value should be provided:
193
+ #
194
+ # 10_000 (Note: an underscore is used for readability)
195
+ #
196
+ # As an example, to iterate 7 days at a time:
197
+ #
198
+ 7.days
199
+ end
200
+
201
+ etl.stop do |etl|
202
+ # The stop block defines when the iteration should halt.
203
+ # Again, this can be a flat value or code. Either way, one value *must* be
204
+ # returned.
205
+ #
206
+ # As a flat value:
207
+ #
208
+ # 1_000_000
209
+ #
210
+ # Or a date value:
211
+ #
212
+ # Time.now.to_date
213
+ #
214
+ # Or as a code example:
215
+ #
216
+ res = etl.query %[
217
+ SELECT DATE(MAX(created_at)) AS the_max
218
+ FROM some_database.some_source_table]
219
+
220
+ res.to_a.first['the_max']
221
+ end
222
+
223
+ etl.etl do |etl, lbound, ubound|
224
+ # The etl block is the main part of the framework. Note: there are
225
+ # two extra args with the iterator this time around: "lbound" and "ubound"
226
+ #
227
+ # "lbound" is the lower bound of the current iteration. When iterating
228
+ # from 0 to 10 and stepping by 2, the lbound would equal 2 on the
229
+ # second iteration.
230
+ #
231
+ # "ubound" is the upper bound of the current iteration. In continuing with the
232
+ # example above, when iterating from 0 to 10 and stepping by 2, the ubound would
233
+ # equal 4 on the second iteration.
234
+ #
235
+ # These args can be used to "window" SQL queries or other code operations.
236
+ #
237
+ # As a first example, to iterate over a set of ids:
238
+ #
239
+ # etl.query %[
240
+ # REPLACE INTO some_database.some_destination_table
241
+ # SELECT
242
+ # user_id
243
+ # , SUM(amount) AS total_amount
244
+ # FROM
245
+ # some_database.some_source_table sst
246
+ # WHERE
247
+ # sst.user_id > #{lbound} AND sst.user_id <= #{ubound}
248
+ # GROUP BY
249
+ # sst.user_id]
250
+ #
251
+ # To "window" a SQL query using dates:
252
+ #
253
+ etl.query %[
254
+ REPLACE INTO some_database.some_destination_table
255
+ SELECT
256
+ DATE(created_at)
257
+ , SUM(amount) AS total_amount
258
+ FROM
259
+ some_database.some_source_table sst
260
+ WHERE
261
+ -- Note the usage of quotes surrounding the lbound and ubound vars.
262
+ -- This is is required when dealing with dates / datetimes
263
+ sst.created_at >= '#{lbound}' AND sst.created_at < '#{ubound}'
264
+ GROUP BY
265
+ sst.user_id]
266
+
267
+ # Note that there is no sql sanitization here so there is *potential* for SQL
268
+ # injection. That being said you'll likely be using this gem in an internal
269
+ # tool so hopefully your co-workers are not looking to sabotage your ETL
270
+ # pipeline. Just be aware of this and handle it as you see fit.
271
+ end
272
+
273
+ etl.after_etl do |etl|
274
+ # All post-ETL work is performed in this block.
275
+ #
276
+ # Again, to finish up with an example:
277
+ #
278
+ etl.query %[
279
+ UPDATE some_database.some_destination_table
280
+ SET message = "WOW"
281
+ WHERE total_amount > 100]
282
+ end
283
+ end
284
+ ```
285
+
286
+ At this point it is possible to run the ETL instance via:
287
+
288
+ ```ruby
289
+ etl.run
290
+ ```
291
+ which executes `#ensure_destination`, `#before_etl`, `#etl`, and `#after_etl` in
292
+ that order.
293
+
294
+ Note that `#etl` executes `#start` and `#stop` once and memoizes the result for
295
+ each. It then begins to iterate from what `#start` evaluated to up until what `#stop`
296
+ evaluated to by what `#step` evaluates to.
297
+
298
+ ## Logger Details
299
+
300
+ A logger must support two methods: `#info` and `#warn`.
301
+
302
+ Both methods should accept a single hash argument. The argument will contain:
303
+
304
+ - `:emitter` => a reference to the ETL instance's `self`
305
+ - `:event_type` => a symbol that includes the type of event being logged. You
306
+ can use this value to derive which other data you'll have available
307
+
308
+ When `:event_type` is equal to `:query_start`, you'll have the following
309
+ available in the hash argument:
310
+
311
+ - `:sql` => the sql that is going to be run
312
+
313
+ These events are logged at the debug level.
314
+
315
+ When `:event_type` is equal to `:query_complete`, you'll have the following
316
+ available in the hash argument:
317
+
318
+ - `:sql` => the sql that was run
319
+ - `:runtime` => how long the query took to execute
320
+
321
+ These events are logged at the info level.
322
+
323
+ Following from this you could implement a simple logger as:
324
+
325
+ ```ruby
326
+ class PutsLogger
327
+ def info data
328
+ @data = data
329
+ write!
330
+ end
331
+
332
+ def debug data
333
+ @data = data
334
+ write!
335
+ end
336
+
337
+ private
338
+
339
+ def write!
340
+ case (event_type = @data.delete(:event_type))
341
+ when :query_start
342
+ output = "#{@data[:emitter].description} is about to run\n"
343
+ output += "#{@data[:sql]}\n"
344
+ when :query_complete
345
+ output = "#{@data[:emitter].description} executed:\n"
346
+ output += "#{@data[:sql]}\n"
347
+ output += "query completed at #{Time.now} and took #{@data[:runtime]}s\n"
348
+ else
349
+ output = "no special logging for #{event_type} event_type yet\n"
350
+ end
351
+ puts output
352
+ @data = nil
353
+ end
354
+ end
355
+ ```
22
356
 
23
357
  ## Contributing
24
358
 
25
- 1. Fork it
26
- 2. Create your feature branch (`git checkout -b my-new-feature`)
27
- 3. Commit your changes (`git commit -am 'Added some feature'`)
28
- 4. Push to the branch (`git push origin my-new-feature`)
29
- 5. Create new Pull Request
359
+ If you would like to contribute code to ETL you can do so through GitHub by
360
+ forking the repository and sending a pull request.
361
+
362
+ When submitting code, please make every effort to follow existing conventions
363
+ and style in order to keep the code as readable as possible.
364
+
365
+ Before your code can be accepted into the project you must also sign the
366
+ [Individual Contributor License Agreement (CLA)][1].
367
+
368
+
369
+ [1]: https://spreadsheets.google.com/spreadsheet/viewform?formkey=dDViT2xzUHAwRkI3X3k5Z0lQM091OGc6MQ&ndplr=1
370
+
371
+ ## License
372
+
373
+ Copyright 2013 Square Inc.
374
+
375
+ Licensed under the Apache License, Version 2.0 (the "License");
376
+ you may not use this file except in compliance with the License.
377
+ You may obtain a copy of the License at
378
+
379
+ http://www.apache.org/licenses/LICENSE-2.0
380
+
381
+ Unless required by applicable law or agreed to in writing, software
382
+ distributed under the License is distributed on an "AS IS" BASIS,
383
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
384
+ See the License for the specific language governing permissions and
385
+ limitations under the License.