kiba 0.5.0 → 0.6.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/Changes.md +8 -0
- data/Gemfile +1 -1
- data/README.md +38 -7
- data/Rakefile +1 -1
- data/bin/kiba +1 -1
- data/kiba.gemspec +9 -9
- data/lib/kiba/cli.rb +3 -3
- data/lib/kiba/context.rb +8 -4
- data/lib/kiba/control.rb +5 -1
- data/lib/kiba/parser.rb +1 -1
- data/lib/kiba/runner.rb +16 -8
- data/lib/kiba/version.rb +2 -2
- data/test/helper.rb +1 -1
- data/test/support/test_enumerable_source.rb +1 -1
- data/test/test_cli.rb +2 -2
- data/test/test_integration.rb +24 -16
- data/test/test_parser.rb +26 -18
- data/test/test_runner.rb +30 -11
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: faa6cbb049d4b35cdd62c647c1510bc9d296cbb5
|
4
|
+
data.tar.gz: c6d522da427b0b2388771279b146fd48ac117605
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: b26ef4488c4aa78c86f99fb001565260dacae1ef529f2c7b8c37533ef179e39c57dfc8c6e35fa1f31acf8e9e4ab6d22b418d90316317a6f4db898c3a93c22108
|
7
|
+
data.tar.gz: f4d7cc78c3ccdb04fc3b98310ac2f4f9163aa60112481b81f22119f4864c6485117087718ec3512559ac668f62d2f8bcc7b54275999303941d9373225851b620
|
data/Changes.md
CHANGED
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -1,9 +1,13 @@
|
|
1
1
|
Writing reliable, concise, well-tested & maintainable data-processing code is tricky.
|
2
2
|
|
3
|
-
Kiba lets you define and run such high-quality ETL jobs, using Ruby.
|
3
|
+
Kiba lets you define and run such high-quality ETL ([Extract-Transform-Load](http://en.wikipedia.org/wiki/Extract,_transform,_load)) jobs, using Ruby (see [supported versions](#supported-ruby-versions)).
|
4
4
|
|
5
|
-
|
5
|
+
Learn more on the [Kiba blog](http://thibautbarrere.com):
|
6
6
|
|
7
|
+
* [Rubyists - are you doing ETL unknowningly?](http://thibautbarrere.com/2015/03/25/rubyists-are-you-doing-etl-unknowingly/)
|
8
|
+
* [How to write solid data processing code](http://thibautbarrere.com/2015/04/05/how-to-write-solid-data-processing-code/)
|
9
|
+
|
10
|
+
[![Gem Version](https://badge.fury.io/rb/kiba.svg)](http://badge.fury.io/rb/kiba)
|
7
11
|
[![Build Status](https://travis-ci.org/thbar/kiba.svg?branch=master)](https://travis-ci.org/thbar/kiba) [![Code Climate](https://codeclimate.com/github/thbar/kiba/badges/gpa.svg)](https://codeclimate.com/github/thbar/kiba) [![Dependency Status](https://gemnasium.com/thbar/kiba.svg)](https://gemnasium.com/thbar/kiba)
|
8
12
|
|
9
13
|
## How do you define ETL jobs with Kiba?
|
@@ -20,6 +24,11 @@ end
|
|
20
24
|
# eg: commonly used sources / destinations / transforms, under unit-test
|
21
25
|
require_relative 'common'
|
22
26
|
|
27
|
+
# declare a pre-processor: a block called before the first row is read
|
28
|
+
pre_process do
|
29
|
+
# do something
|
30
|
+
end
|
31
|
+
|
23
32
|
# declare a source where to take data from (you implement it - see notes below)
|
24
33
|
source MyCsvSource, 'input.csv'
|
25
34
|
|
@@ -50,7 +59,7 @@ post_process do
|
|
50
59
|
end
|
51
60
|
```
|
52
61
|
|
53
|
-
The combination of sources, transforms, destinations and post-processors defines the data processing pipeline.
|
62
|
+
The combination of pre-processors, sources, transforms, destinations and post-processors defines the data processing pipeline.
|
54
63
|
|
55
64
|
Note: you are advised to store your ETL definitions as files with the extension `.etl` (rather than `.rb`). This will make sure you do not end up loading them by mistake from another component (eg: a Rails app).
|
56
65
|
|
@@ -74,7 +83,7 @@ Kiba.run(job_definition)
|
|
74
83
|
`Kiba.parse` evaluates your ETL Ruby code to register sources, transforms, destinations and post-processors in a job definition. It is important to understand that you can use Ruby logic at the DSL parsing time. This means that such code is possible, provided the CSV files are available at parsing time:
|
75
84
|
|
76
85
|
```ruby
|
77
|
-
Dir['to_be_processed/*.csv'].each do |
|
86
|
+
Dir['to_be_processed/*.csv'].each do |file|
|
78
87
|
source MyCsvSource, file
|
79
88
|
end
|
80
89
|
```
|
@@ -191,14 +200,32 @@ class MyCsvDestination
|
|
191
200
|
end
|
192
201
|
```
|
193
202
|
|
194
|
-
## Implementing post-processors
|
203
|
+
## Implementing pre and post-processors
|
195
204
|
|
196
|
-
|
197
|
-
|
205
|
+
Pre-processors and post-processors are currently blocks, which get called only once per ETL run:
|
206
|
+
- Pre-processors get called before the ETL starts reading rows from the sources.
|
207
|
+
- Post-processors get invoked after the ETL successfully processed all the rows.
|
208
|
+
|
209
|
+
Note that post-processors won't get called if an error occurred earlier.
|
198
210
|
|
199
211
|
```ruby
|
200
212
|
count = 0
|
201
213
|
|
214
|
+
def system!(cmd)
|
215
|
+
fail "Command #{cmd} failed" unless system(cmd)
|
216
|
+
end
|
217
|
+
|
218
|
+
file = 'my_file.csv'
|
219
|
+
sample_file = 'my_file.sample.csv'
|
220
|
+
|
221
|
+
pre_process do
|
222
|
+
# it's handy to work with a reduced data set. you can
|
223
|
+
# e.g. just keep one line of the CSV files + the headers
|
224
|
+
system! "sed -n \"1p;25706p\" #{file} > #{sample_file}"
|
225
|
+
end
|
226
|
+
|
227
|
+
source MyCsv, file: sample_file
|
228
|
+
|
202
229
|
transform do |row|
|
203
230
|
count += 1
|
204
231
|
row
|
@@ -225,6 +252,10 @@ The ability to support that DSL, but also check command line arguments, environm
|
|
225
252
|
|
226
253
|
Make sure to subscribe to my [Ruby ETL blog](http://thibautbarrere.com) where I'll demonstrate such techniques over time!
|
227
254
|
|
255
|
+
## Supported Ruby versions
|
256
|
+
|
257
|
+
Kiba currently supports Ruby 2.0+ and JRuby (with its default 1.9 syntax).
|
258
|
+
|
228
259
|
## History & Credits
|
229
260
|
|
230
261
|
Wow, you're still there? Nice to meet you. I'm [Thibaut](http://thibautbarrere.com), author of Kiba.
|
data/Rakefile
CHANGED
data/bin/kiba
CHANGED
data/kiba.gemspec
CHANGED
@@ -2,19 +2,19 @@
|
|
2
2
|
require File.expand_path('../lib/kiba/version', __FILE__)
|
3
3
|
|
4
4
|
Gem::Specification.new do |gem|
|
5
|
-
gem.authors = [
|
6
|
-
gem.email = [
|
7
|
-
gem.description = gem.summary =
|
8
|
-
gem.homepage =
|
9
|
-
gem.license =
|
5
|
+
gem.authors = ['Thibaut Barrère']
|
6
|
+
gem.email = ['thibaut.barrere@gmail.com']
|
7
|
+
gem.description = gem.summary = 'Lightweight ETL for Ruby'
|
8
|
+
gem.homepage = 'http://thbar.github.io/kiba/'
|
9
|
+
gem.license = 'LGPL-3.0'
|
10
10
|
gem.files = `git ls-files | grep -Ev '^(examples)'`.split("\n")
|
11
11
|
gem.test_files = `git ls-files -- test/*`.split("\n")
|
12
|
-
gem.name =
|
13
|
-
gem.require_paths = [
|
12
|
+
gem.name = 'kiba'
|
13
|
+
gem.require_paths = ['lib']
|
14
14
|
gem.version = Kiba::VERSION
|
15
15
|
gem.executables = ['kiba']
|
16
|
-
|
16
|
+
|
17
17
|
gem.add_development_dependency 'rake'
|
18
18
|
gem.add_development_dependency 'minitest'
|
19
19
|
gem.add_development_dependency 'awesome_print'
|
20
|
-
end
|
20
|
+
end
|
data/lib/kiba/cli.rb
CHANGED
@@ -4,8 +4,8 @@ module Kiba
|
|
4
4
|
class Cli
|
5
5
|
def self.run(args)
|
6
6
|
unless args.size == 1
|
7
|
-
puts
|
8
|
-
exit
|
7
|
+
puts 'Syntax: kiba your-script.etl'
|
8
|
+
exit(-1)
|
9
9
|
end
|
10
10
|
filename = args[0]
|
11
11
|
script_content = IO.read(filename)
|
@@ -13,4 +13,4 @@ module Kiba
|
|
13
13
|
Kiba.run(job_definition)
|
14
14
|
end
|
15
15
|
end
|
16
|
-
end
|
16
|
+
end
|
data/lib/kiba/context.rb
CHANGED
@@ -5,24 +5,28 @@ module Kiba
|
|
5
5
|
@control = control
|
6
6
|
end
|
7
7
|
|
8
|
+
def pre_process(&block)
|
9
|
+
@control.pre_processes << block
|
10
|
+
end
|
11
|
+
|
8
12
|
def source(klass, *initialization_params)
|
9
|
-
@control.sources << {klass: klass, args: initialization_params}
|
13
|
+
@control.sources << { klass: klass, args: initialization_params }
|
10
14
|
end
|
11
15
|
|
12
16
|
def transform(klass = nil, *initialization_params, &block)
|
13
17
|
if klass
|
14
|
-
@control.transforms << {klass: klass, args: initialization_params}
|
18
|
+
@control.transforms << { klass: klass, args: initialization_params }
|
15
19
|
else
|
16
20
|
@control.transforms << block
|
17
21
|
end
|
18
22
|
end
|
19
23
|
|
20
24
|
def destination(klass, *initialization_params)
|
21
|
-
@control.destinations << {klass: klass, args: initialization_params}
|
25
|
+
@control.destinations << { klass: klass, args: initialization_params }
|
22
26
|
end
|
23
27
|
|
24
28
|
def post_process(&block)
|
25
29
|
@control.post_processes << block
|
26
30
|
end
|
27
31
|
end
|
28
|
-
end
|
32
|
+
end
|
data/lib/kiba/control.rb
CHANGED
data/lib/kiba/parser.rb
CHANGED
data/lib/kiba/runner.rb
CHANGED
@@ -1,15 +1,25 @@
|
|
1
1
|
module Kiba
|
2
2
|
module Runner
|
3
3
|
def run(control)
|
4
|
+
# instantiate early so that error are raised before any processing occurs
|
5
|
+
pre_processes = to_instances(control.pre_processes, true, false)
|
4
6
|
sources = to_instances(control.sources)
|
5
7
|
destinations = to_instances(control.destinations)
|
6
8
|
transforms = to_instances(control.transforms, true)
|
7
|
-
# not using keyword args because JRuby defaults to 1.9 syntax currently
|
8
9
|
post_processes = to_instances(control.post_processes, true, false)
|
9
10
|
|
11
|
+
pre_processes.each(&:call)
|
12
|
+
process_rows(sources, transforms, destinations)
|
13
|
+
destinations.each(&:close)
|
14
|
+
post_processes.each(&:call)
|
15
|
+
end
|
16
|
+
|
17
|
+
def process_rows(sources, transforms, destinations)
|
10
18
|
sources.each do |source|
|
11
19
|
source.each do |row|
|
12
|
-
transforms.
|
20
|
+
transforms.each do |transform|
|
21
|
+
# TODO: avoid the case completely by e.g. subclassing Proc
|
22
|
+
# and aliasing `process` to `call`. Benchmark needed first though.
|
13
23
|
if transform.is_a?(Proc)
|
14
24
|
row = transform.call(row)
|
15
25
|
else
|
@@ -23,22 +33,20 @@ module Kiba
|
|
23
33
|
end
|
24
34
|
end
|
25
35
|
end
|
26
|
-
|
27
|
-
destinations.each(&:close)
|
28
|
-
post_processes.each(&:call)
|
29
36
|
end
|
30
37
|
|
38
|
+
# not using keyword args because JRuby defaults to 1.9 syntax currently
|
31
39
|
def to_instances(definitions, allow_block = false, allow_class = true)
|
32
40
|
definitions.map do |d|
|
33
41
|
case d
|
34
42
|
when Proc
|
35
|
-
|
43
|
+
fail 'Block form is not allowed here' unless allow_block
|
36
44
|
d
|
37
45
|
else
|
38
|
-
|
46
|
+
fail 'Class form is not allowed here' unless allow_class
|
39
47
|
d[:klass].new(*d[:args])
|
40
48
|
end
|
41
49
|
end
|
42
50
|
end
|
43
51
|
end
|
44
|
-
end
|
52
|
+
end
|
data/lib/kiba/version.rb
CHANGED
@@ -1,3 +1,3 @@
|
|
1
1
|
module Kiba
|
2
|
-
VERSION =
|
3
|
-
end
|
2
|
+
VERSION = '0.6.0'
|
3
|
+
end
|
data/test/helper.rb
CHANGED
data/test/test_cli.rb
CHANGED
@@ -11,7 +11,7 @@ class TestCli < Kiba::Test
|
|
11
11
|
Kiba::Cli.run([fixture('bogus.etl')])
|
12
12
|
end
|
13
13
|
|
14
|
-
assert_match
|
14
|
+
assert_match(/uninitialized constant (.*)UnknownThing/, exception.message)
|
15
15
|
assert_includes exception.backtrace.to_s, 'test/fixtures/bogus.etl:2:in'
|
16
16
|
end
|
17
|
-
end
|
17
|
+
end
|
data/test/test_integration.rb
CHANGED
@@ -9,7 +9,8 @@ class TestIntegration < Kiba::Test
|
|
9
9
|
let(:output_file) { 'test/tmp/output.csv' }
|
10
10
|
let(:input_file) { 'test/tmp/input.csv' }
|
11
11
|
|
12
|
-
let(:sample_csv_data) do
|
12
|
+
let(:sample_csv_data) do
|
13
|
+
<<CSV
|
13
14
|
first_name,last_name,sex
|
14
15
|
John,Doe,M
|
15
16
|
Mary,Johnson,F
|
@@ -26,17 +27,17 @@ CSV
|
|
26
27
|
def teardown
|
27
28
|
remove_files(input_file, output_file)
|
28
29
|
end
|
29
|
-
|
30
|
+
|
30
31
|
def test_csv_to_csv
|
31
|
-
# parse the ETL script (this won't run it)
|
32
|
+
# parse the ETL script (this won't run it)
|
32
33
|
control = Kiba.parse do
|
33
34
|
source TestCsvSource, 'test/tmp/input.csv'
|
34
35
|
|
35
36
|
transform do |row|
|
36
37
|
row[:sex] = case row[:sex]
|
37
|
-
|
38
|
-
|
39
|
-
|
38
|
+
when 'M' then 'Male'
|
39
|
+
when 'F' then 'Female'
|
40
|
+
else 'Unknown'
|
40
41
|
end
|
41
42
|
row # must be returned
|
42
43
|
end
|
@@ -61,28 +62,35 @@ Mary,Johnson,Female
|
|
61
62
|
Cindy,Backgammon,Female
|
62
63
|
CSV
|
63
64
|
end
|
64
|
-
|
65
|
+
|
65
66
|
def test_variable_access
|
66
67
|
message = nil
|
67
|
-
|
68
|
+
|
68
69
|
control = Kiba.parse do
|
69
70
|
source TestEnumerableSource, [1, 2, 3]
|
70
|
-
|
71
|
+
|
72
|
+
# assign a first value at parsing time
|
71
73
|
count = 0
|
72
74
|
|
75
|
+
pre_process do
|
76
|
+
# then change it from there (run time)
|
77
|
+
count += 100
|
78
|
+
end
|
79
|
+
|
73
80
|
transform do |r|
|
81
|
+
# increase it once per row
|
74
82
|
count += 1
|
75
83
|
r
|
76
84
|
end
|
77
|
-
|
85
|
+
|
78
86
|
post_process do
|
79
|
-
|
87
|
+
# and save so we can assert
|
88
|
+
message = "Count is now #{count}"
|
80
89
|
end
|
81
90
|
end
|
82
|
-
|
91
|
+
|
83
92
|
Kiba.run(control)
|
84
|
-
|
85
|
-
assert_equal '
|
93
|
+
|
94
|
+
assert_equal 'Count is now 103', message
|
86
95
|
end
|
87
|
-
|
88
|
-
end
|
96
|
+
end
|
data/test/test_parser.rb
CHANGED
@@ -10,11 +10,11 @@ class TestParser < Kiba::Test
|
|
10
10
|
control = Kiba.parse do
|
11
11
|
source DummyClass, 'has', 'args'
|
12
12
|
end
|
13
|
-
|
13
|
+
|
14
14
|
assert_equal DummyClass, control.sources[0][:klass]
|
15
|
-
assert_equal
|
15
|
+
assert_equal %w(has args), control.sources[0][:args]
|
16
16
|
end
|
17
|
-
|
17
|
+
|
18
18
|
def test_block_transform_definition
|
19
19
|
control = Kiba.parse do
|
20
20
|
transform { |row| row }
|
@@ -31,34 +31,42 @@ class TestParser < Kiba::Test
|
|
31
31
|
assert_equal TestRenameFieldTransform, control.transforms[0][:klass]
|
32
32
|
assert_equal [:last_name, :name], control.transforms[0][:args]
|
33
33
|
end
|
34
|
-
|
34
|
+
|
35
35
|
def test_destination_definition
|
36
36
|
control = Kiba.parse do
|
37
37
|
destination DummyClass, 'has', 'args'
|
38
38
|
end
|
39
|
-
|
39
|
+
|
40
40
|
assert_equal DummyClass, control.destinations[0][:klass]
|
41
|
-
assert_equal
|
41
|
+
assert_equal %w(has args), control.destinations[0][:args]
|
42
42
|
end
|
43
|
-
|
43
|
+
|
44
44
|
def test_block_post_process_definition
|
45
45
|
control = Kiba.parse do
|
46
|
-
post_process {
|
46
|
+
post_process {}
|
47
47
|
end
|
48
|
-
|
48
|
+
|
49
49
|
assert_instance_of Proc, control.post_processes[0]
|
50
50
|
end
|
51
51
|
|
52
|
+
def test_block_pre_process_definition
|
53
|
+
control = Kiba.parse do
|
54
|
+
pre_process {}
|
55
|
+
end
|
56
|
+
|
57
|
+
assert_instance_of Proc, control.pre_processes[0]
|
58
|
+
end
|
59
|
+
|
52
60
|
def test_source_as_string_parsing
|
53
61
|
control = Kiba.parse <<RUBY
|
54
62
|
source DummyClass, 'from', 'file'
|
55
63
|
RUBY
|
56
|
-
|
64
|
+
|
57
65
|
assert_equal 1, control.sources.size
|
58
66
|
assert_equal DummyClass, control.sources[0][:klass]
|
59
|
-
assert_equal
|
67
|
+
assert_equal %w(from file), control.sources[0][:args]
|
60
68
|
end
|
61
|
-
|
69
|
+
|
62
70
|
def test_source_as_file_doing_require
|
63
71
|
IO.write 'test/tmp/etl-common.rb', <<RUBY
|
64
72
|
def common_source_declaration
|
@@ -67,18 +75,18 @@ RUBY
|
|
67
75
|
RUBY
|
68
76
|
IO.write 'test/tmp/etl-main.rb', <<RUBY
|
69
77
|
require './test/tmp/etl-common.rb'
|
70
|
-
|
78
|
+
|
71
79
|
source DummyClass, 'from', 'main'
|
72
80
|
common_source_declaration
|
73
81
|
RUBY
|
74
82
|
control = Kiba.parse IO.read('test/tmp/etl-main.rb')
|
75
|
-
|
83
|
+
|
76
84
|
assert_equal 2, control.sources.size
|
77
85
|
|
78
|
-
assert_equal
|
79
|
-
assert_equal
|
80
|
-
|
86
|
+
assert_equal %w(from main), control.sources[0][:args]
|
87
|
+
assert_equal %w(from common), control.sources[1][:args]
|
88
|
+
|
81
89
|
ensure
|
82
90
|
remove_files('test/tmp/etl-common.rb', 'test/tmp/etl-main.rb')
|
83
91
|
end
|
84
|
-
end
|
92
|
+
end
|
data/test/test_runner.rb
CHANGED
@@ -3,10 +3,20 @@ require_relative 'helper'
|
|
3
3
|
require_relative 'support/test_enumerable_source'
|
4
4
|
|
5
5
|
class TestRunner < Kiba::Test
|
6
|
+
let(:rows) do
|
7
|
+
[
|
8
|
+
{ field: 'value' },
|
9
|
+
{ field: 'other-value' }
|
10
|
+
]
|
11
|
+
end
|
12
|
+
|
6
13
|
let(:control) do
|
7
14
|
control = Kiba::Control.new
|
8
15
|
# this will yield a single row for testing
|
9
|
-
control.sources << {
|
16
|
+
control.sources << {
|
17
|
+
klass: TestEnumerableSource,
|
18
|
+
args: [rows]
|
19
|
+
}
|
10
20
|
control
|
11
21
|
end
|
12
22
|
|
@@ -18,23 +28,32 @@ class TestRunner < Kiba::Test
|
|
18
28
|
end
|
19
29
|
|
20
30
|
def test_dismissed_row_not_passed_to_next_transform
|
21
|
-
control.transforms << lambda { |
|
22
|
-
control.transforms << lambda { |
|
31
|
+
control.transforms << lambda { |_| nil }
|
32
|
+
control.transforms << lambda { |_| @called = true; nil }
|
23
33
|
Kiba.run(control)
|
24
34
|
assert_nil @called
|
25
35
|
end
|
26
|
-
|
27
|
-
def
|
28
|
-
|
36
|
+
|
37
|
+
def test_post_process_runs_once
|
38
|
+
assert_equal 2, rows.size
|
39
|
+
@called = 0
|
40
|
+
control.post_processes << lambda { @called += 1 }
|
29
41
|
Kiba.run(control)
|
30
|
-
assert_equal
|
42
|
+
assert_equal 1, @called
|
31
43
|
end
|
32
|
-
|
44
|
+
|
33
45
|
def test_post_process_not_called_after_row_failure
|
34
|
-
control.transforms << lambda { |
|
46
|
+
control.transforms << lambda { |_| fail 'FAIL' }
|
35
47
|
control.post_processes << lambda { @called = true }
|
36
48
|
assert_raises(RuntimeError, 'FAIL') { Kiba.run(control) }
|
37
49
|
assert_nil @called
|
38
50
|
end
|
39
|
-
|
40
|
-
|
51
|
+
|
52
|
+
def test_pre_process_runs_once
|
53
|
+
assert_equal 2, rows.size
|
54
|
+
@called = 0
|
55
|
+
control.pre_processes << lambda { @called += 1 }
|
56
|
+
Kiba.run(control)
|
57
|
+
assert_equal 1, @called
|
58
|
+
end
|
59
|
+
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: kiba
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.6.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Thibaut Barrère
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2015-
|
11
|
+
date: 2015-05-14 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: rake
|