kiba 0.5.0 → 0.6.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 502470fc246c67daaa681ca78fb5337899cca7fa
4
- data.tar.gz: a125ff166156c79e5a0b0d67bf9dfb980b7e0dba
3
+ metadata.gz: faa6cbb049d4b35cdd62c647c1510bc9d296cbb5
4
+ data.tar.gz: c6d522da427b0b2388771279b146fd48ac117605
5
5
  SHA512:
6
- metadata.gz: ccefac21a401ca860d34c89fdda2473e5a30b51d61223fc8cced50165786f41f328014144bd31486522db34c4e801190060d250cad20408745c691ca937ea1ea
7
- data.tar.gz: 6c0bee993d99fdec14504e6811549af4dce40cd8930be951a142f8793da69956283ed7ccf6acececa4a6a108e9e9f424d2190b6ba7d9e45f55207b3ee240418d
6
+ metadata.gz: b26ef4488c4aa78c86f99fb001565260dacae1ef529f2c7b8c37533ef179e39c57dfc8c6e35fa1f31acf8e9e4ab6d22b418d90316317a6f4db898c3a93c22108
7
+ data.tar.gz: f4d7cc78c3ccdb04fc3b98310ac2f4f9163aa60112481b81f22119f4864c6485117087718ec3512559ac668f62d2f8bcc7b54275999303941d9373225851b620
data/Changes.md CHANGED
@@ -1,3 +1,11 @@
1
+ Unreleased
2
+ ----------
3
+
4
+ 0.6.0
5
+ -----
6
+
7
+ - Add `pre_process` block support
8
+
1
9
  0.5.0
2
10
  -----
3
11
 
data/Gemfile CHANGED
@@ -1,3 +1,3 @@
1
1
  source 'https://rubygems.org'
2
2
 
3
- gemspec
3
+ gemspec
data/README.md CHANGED
@@ -1,9 +1,13 @@
1
1
  Writing reliable, concise, well-tested & maintainable data-processing code is tricky.
2
2
 
3
- Kiba lets you define and run such high-quality ETL jobs, using Ruby.
3
+ Kiba lets you define and run such high-quality ETL ([Extract-Transform-Load](http://en.wikipedia.org/wiki/Extract,_transform,_load)) jobs, using Ruby (see [supported versions](#supported-ruby-versions)).
4
4
 
5
- **Note: this is EARLY WORK - the API/syntax may change at any time.**
5
+ Learn more on the [Kiba blog](http://thibautbarrere.com):
6
6
 
7
+ * [Rubyists - are you doing ETL unknowningly?](http://thibautbarrere.com/2015/03/25/rubyists-are-you-doing-etl-unknowingly/)
8
+ * [How to write solid data processing code](http://thibautbarrere.com/2015/04/05/how-to-write-solid-data-processing-code/)
9
+
10
+ [![Gem Version](https://badge.fury.io/rb/kiba.svg)](http://badge.fury.io/rb/kiba)
7
11
  [![Build Status](https://travis-ci.org/thbar/kiba.svg?branch=master)](https://travis-ci.org/thbar/kiba) [![Code Climate](https://codeclimate.com/github/thbar/kiba/badges/gpa.svg)](https://codeclimate.com/github/thbar/kiba) [![Dependency Status](https://gemnasium.com/thbar/kiba.svg)](https://gemnasium.com/thbar/kiba)
8
12
 
9
13
  ## How do you define ETL jobs with Kiba?
@@ -20,6 +24,11 @@ end
20
24
  # eg: commonly used sources / destinations / transforms, under unit-test
21
25
  require_relative 'common'
22
26
 
27
+ # declare a pre-processor: a block called before the first row is read
28
+ pre_process do
29
+ # do something
30
+ end
31
+
23
32
  # declare a source where to take data from (you implement it - see notes below)
24
33
  source MyCsvSource, 'input.csv'
25
34
 
@@ -50,7 +59,7 @@ post_process do
50
59
  end
51
60
  ```
52
61
 
53
- The combination of sources, transforms, destinations and post-processors defines the data processing pipeline.
62
+ The combination of pre-processors, sources, transforms, destinations and post-processors defines the data processing pipeline.
54
63
 
55
64
  Note: you are advised to store your ETL definitions as files with the extension `.etl` (rather than `.rb`). This will make sure you do not end up loading them by mistake from another component (eg: a Rails app).
56
65
 
@@ -74,7 +83,7 @@ Kiba.run(job_definition)
74
83
  `Kiba.parse` evaluates your ETL Ruby code to register sources, transforms, destinations and post-processors in a job definition. It is important to understand that you can use Ruby logic at the DSL parsing time. This means that such code is possible, provided the CSV files are available at parsing time:
75
84
 
76
85
  ```ruby
77
- Dir['to_be_processed/*.csv'].each do |f|
86
+ Dir['to_be_processed/*.csv'].each do |file|
78
87
  source MyCsvSource, file
79
88
  end
80
89
  ```
@@ -191,14 +200,32 @@ class MyCsvDestination
191
200
  end
192
201
  ```
193
202
 
194
- ## Implementing post-processors
203
+ ## Implementing pre and post-processors
195
204
 
196
- Post-processors are currently blocks, which get called once, after the ETL
197
- successfully processed all the rows. It won't get called if an error occurred.
205
+ Pre-processors and post-processors are currently blocks, which get called only once per ETL run:
206
+ - Pre-processors get called before the ETL starts reading rows from the sources.
207
+ - Post-processors get invoked after the ETL successfully processed all the rows.
208
+
209
+ Note that post-processors won't get called if an error occurred earlier.
198
210
 
199
211
  ```ruby
200
212
  count = 0
201
213
 
214
+ def system!(cmd)
215
+ fail "Command #{cmd} failed" unless system(cmd)
216
+ end
217
+
218
+ file = 'my_file.csv'
219
+ sample_file = 'my_file.sample.csv'
220
+
221
+ pre_process do
222
+ # it's handy to work with a reduced data set. you can
223
+ # e.g. just keep one line of the CSV files + the headers
224
+ system! "sed -n \"1p;25706p\" #{file} > #{sample_file}"
225
+ end
226
+
227
+ source MyCsv, file: sample_file
228
+
202
229
  transform do |row|
203
230
  count += 1
204
231
  row
@@ -225,6 +252,10 @@ The ability to support that DSL, but also check command line arguments, environm
225
252
 
226
253
  Make sure to subscribe to my [Ruby ETL blog](http://thibautbarrere.com) where I'll demonstrate such techniques over time!
227
254
 
255
+ ## Supported Ruby versions
256
+
257
+ Kiba currently supports Ruby 2.0+ and JRuby (with its default 1.9 syntax).
258
+
228
259
  ## History & Credits
229
260
 
230
261
  Wow, you're still there? Nice to meet you. I'm [Thibaut](http://thibautbarrere.com), author of Kiba.
data/Rakefile CHANGED
@@ -4,4 +4,4 @@ Rake::TestTask.new(:test) do |t|
4
4
  t.pattern = 'test/test_*.rb'
5
5
  end
6
6
 
7
- task :default => :test
7
+ task default: :test
data/bin/kiba CHANGED
@@ -2,4 +2,4 @@
2
2
 
3
3
  require_relative '../lib/kiba/cli'
4
4
 
5
- Kiba::Cli.run(ARGV)
5
+ Kiba::Cli.run(ARGV)
@@ -2,19 +2,19 @@
2
2
  require File.expand_path('../lib/kiba/version', __FILE__)
3
3
 
4
4
  Gem::Specification.new do |gem|
5
- gem.authors = ["Thibaut Barrère"]
6
- gem.email = ["thibaut.barrere@gmail.com"]
7
- gem.description = gem.summary = "Lightweight ETL for Ruby"
8
- gem.homepage = "http://thbar.github.io/kiba/"
9
- gem.license = "LGPL-3.0"
5
+ gem.authors = ['Thibaut Barrère']
6
+ gem.email = ['thibaut.barrere@gmail.com']
7
+ gem.description = gem.summary = 'Lightweight ETL for Ruby'
8
+ gem.homepage = 'http://thbar.github.io/kiba/'
9
+ gem.license = 'LGPL-3.0'
10
10
  gem.files = `git ls-files | grep -Ev '^(examples)'`.split("\n")
11
11
  gem.test_files = `git ls-files -- test/*`.split("\n")
12
- gem.name = "kiba"
13
- gem.require_paths = ["lib"]
12
+ gem.name = 'kiba'
13
+ gem.require_paths = ['lib']
14
14
  gem.version = Kiba::VERSION
15
15
  gem.executables = ['kiba']
16
-
16
+
17
17
  gem.add_development_dependency 'rake'
18
18
  gem.add_development_dependency 'minitest'
19
19
  gem.add_development_dependency 'awesome_print'
20
- end
20
+ end
@@ -4,8 +4,8 @@ module Kiba
4
4
  class Cli
5
5
  def self.run(args)
6
6
  unless args.size == 1
7
- puts "Syntax: kiba your-script.etl"
8
- exit -1
7
+ puts 'Syntax: kiba your-script.etl'
8
+ exit(-1)
9
9
  end
10
10
  filename = args[0]
11
11
  script_content = IO.read(filename)
@@ -13,4 +13,4 @@ module Kiba
13
13
  Kiba.run(job_definition)
14
14
  end
15
15
  end
16
- end
16
+ end
@@ -5,24 +5,28 @@ module Kiba
5
5
  @control = control
6
6
  end
7
7
 
8
+ def pre_process(&block)
9
+ @control.pre_processes << block
10
+ end
11
+
8
12
  def source(klass, *initialization_params)
9
- @control.sources << {klass: klass, args: initialization_params}
13
+ @control.sources << { klass: klass, args: initialization_params }
10
14
  end
11
15
 
12
16
  def transform(klass = nil, *initialization_params, &block)
13
17
  if klass
14
- @control.transforms << {klass: klass, args: initialization_params}
18
+ @control.transforms << { klass: klass, args: initialization_params }
15
19
  else
16
20
  @control.transforms << block
17
21
  end
18
22
  end
19
23
 
20
24
  def destination(klass, *initialization_params)
21
- @control.destinations << {klass: klass, args: initialization_params}
25
+ @control.destinations << { klass: klass, args: initialization_params }
22
26
  end
23
27
 
24
28
  def post_process(&block)
25
29
  @control.post_processes << block
26
30
  end
27
31
  end
28
- end
32
+ end
@@ -1,5 +1,9 @@
1
1
  module Kiba
2
2
  class Control
3
+ def pre_processes
4
+ @pre_processes ||= []
5
+ end
6
+
3
7
  def sources
4
8
  @sources ||= []
5
9
  end
@@ -16,4 +20,4 @@ module Kiba
16
20
  @post_processes ||= []
17
21
  end
18
22
  end
19
- end
23
+ end
@@ -12,4 +12,4 @@ module Kiba
12
12
  control
13
13
  end
14
14
  end
15
- end
15
+ end
@@ -1,15 +1,25 @@
1
1
  module Kiba
2
2
  module Runner
3
3
  def run(control)
4
+ # instantiate early so that error are raised before any processing occurs
5
+ pre_processes = to_instances(control.pre_processes, true, false)
4
6
  sources = to_instances(control.sources)
5
7
  destinations = to_instances(control.destinations)
6
8
  transforms = to_instances(control.transforms, true)
7
- # not using keyword args because JRuby defaults to 1.9 syntax currently
8
9
  post_processes = to_instances(control.post_processes, true, false)
9
10
 
11
+ pre_processes.each(&:call)
12
+ process_rows(sources, transforms, destinations)
13
+ destinations.each(&:close)
14
+ post_processes.each(&:call)
15
+ end
16
+
17
+ def process_rows(sources, transforms, destinations)
10
18
  sources.each do |source|
11
19
  source.each do |row|
12
- transforms.each_with_index do |transform, index|
20
+ transforms.each do |transform|
21
+ # TODO: avoid the case completely by e.g. subclassing Proc
22
+ # and aliasing `process` to `call`. Benchmark needed first though.
13
23
  if transform.is_a?(Proc)
14
24
  row = transform.call(row)
15
25
  else
@@ -23,22 +33,20 @@ module Kiba
23
33
  end
24
34
  end
25
35
  end
26
-
27
- destinations.each(&:close)
28
- post_processes.each(&:call)
29
36
  end
30
37
 
38
+ # not using keyword args because JRuby defaults to 1.9 syntax currently
31
39
  def to_instances(definitions, allow_block = false, allow_class = true)
32
40
  definitions.map do |d|
33
41
  case d
34
42
  when Proc
35
- raise "Block form is not allowed here" unless allow_block
43
+ fail 'Block form is not allowed here' unless allow_block
36
44
  d
37
45
  else
38
- raise "Class form is not allowed here" unless allow_class
46
+ fail 'Class form is not allowed here' unless allow_class
39
47
  d[:klass].new(*d[:args])
40
48
  end
41
49
  end
42
50
  end
43
51
  end
44
- end
52
+ end
@@ -1,3 +1,3 @@
1
1
  module Kiba
2
- VERSION = "0.5.0"
3
- end
2
+ VERSION = '0.6.0'
3
+ end
@@ -7,7 +7,7 @@ class Kiba::Test < Minitest::Test
7
7
 
8
8
  def remove_files(*files)
9
9
  files.each do |file|
10
- File.delete(file) if File.exists?(file)
10
+ File.delete(file) if File.exist?(file)
11
11
  end
12
12
  end
13
13
 
@@ -8,4 +8,4 @@ class TestEnumerableSource
8
8
  yield row
9
9
  end
10
10
  end
11
- end
11
+ end
@@ -11,7 +11,7 @@ class TestCli < Kiba::Test
11
11
  Kiba::Cli.run([fixture('bogus.etl')])
12
12
  end
13
13
 
14
- assert_match /uninitialized constant (.*)UnknownThing/, exception.message
14
+ assert_match(/uninitialized constant (.*)UnknownThing/, exception.message)
15
15
  assert_includes exception.backtrace.to_s, 'test/fixtures/bogus.etl:2:in'
16
16
  end
17
- end
17
+ end
@@ -9,7 +9,8 @@ class TestIntegration < Kiba::Test
9
9
  let(:output_file) { 'test/tmp/output.csv' }
10
10
  let(:input_file) { 'test/tmp/input.csv' }
11
11
 
12
- let(:sample_csv_data) do <<CSV
12
+ let(:sample_csv_data) do
13
+ <<CSV
13
14
  first_name,last_name,sex
14
15
  John,Doe,M
15
16
  Mary,Johnson,F
@@ -26,17 +27,17 @@ CSV
26
27
  def teardown
27
28
  remove_files(input_file, output_file)
28
29
  end
29
-
30
+
30
31
  def test_csv_to_csv
31
- # parse the ETL script (this won't run it)
32
+ # parse the ETL script (this won't run it)
32
33
  control = Kiba.parse do
33
34
  source TestCsvSource, 'test/tmp/input.csv'
34
35
 
35
36
  transform do |row|
36
37
  row[:sex] = case row[:sex]
37
- when 'M'; 'Male'
38
- when 'F'; 'Female'
39
- else 'Unknown'
38
+ when 'M' then 'Male'
39
+ when 'F' then 'Female'
40
+ else 'Unknown'
40
41
  end
41
42
  row # must be returned
42
43
  end
@@ -61,28 +62,35 @@ Mary,Johnson,Female
61
62
  Cindy,Backgammon,Female
62
63
  CSV
63
64
  end
64
-
65
+
65
66
  def test_variable_access
66
67
  message = nil
67
-
68
+
68
69
  control = Kiba.parse do
69
70
  source TestEnumerableSource, [1, 2, 3]
70
-
71
+
72
+ # assign a first value at parsing time
71
73
  count = 0
72
74
 
75
+ pre_process do
76
+ # then change it from there (run time)
77
+ count += 100
78
+ end
79
+
73
80
  transform do |r|
81
+ # increase it once per row
74
82
  count += 1
75
83
  r
76
84
  end
77
-
85
+
78
86
  post_process do
79
- message = "#{count} rows processed"
87
+ # and save so we can assert
88
+ message = "Count is now #{count}"
80
89
  end
81
90
  end
82
-
91
+
83
92
  Kiba.run(control)
84
-
85
- assert_equal '3 rows processed', message
93
+
94
+ assert_equal 'Count is now 103', message
86
95
  end
87
-
88
- end
96
+ end
@@ -10,11 +10,11 @@ class TestParser < Kiba::Test
10
10
  control = Kiba.parse do
11
11
  source DummyClass, 'has', 'args'
12
12
  end
13
-
13
+
14
14
  assert_equal DummyClass, control.sources[0][:klass]
15
- assert_equal ['has', 'args'], control.sources[0][:args]
15
+ assert_equal %w(has args), control.sources[0][:args]
16
16
  end
17
-
17
+
18
18
  def test_block_transform_definition
19
19
  control = Kiba.parse do
20
20
  transform { |row| row }
@@ -31,34 +31,42 @@ class TestParser < Kiba::Test
31
31
  assert_equal TestRenameFieldTransform, control.transforms[0][:klass]
32
32
  assert_equal [:last_name, :name], control.transforms[0][:args]
33
33
  end
34
-
34
+
35
35
  def test_destination_definition
36
36
  control = Kiba.parse do
37
37
  destination DummyClass, 'has', 'args'
38
38
  end
39
-
39
+
40
40
  assert_equal DummyClass, control.destinations[0][:klass]
41
- assert_equal ['has', 'args'], control.destinations[0][:args]
41
+ assert_equal %w(has args), control.destinations[0][:args]
42
42
  end
43
-
43
+
44
44
  def test_block_post_process_definition
45
45
  control = Kiba.parse do
46
- post_process { }
46
+ post_process {}
47
47
  end
48
-
48
+
49
49
  assert_instance_of Proc, control.post_processes[0]
50
50
  end
51
51
 
52
+ def test_block_pre_process_definition
53
+ control = Kiba.parse do
54
+ pre_process {}
55
+ end
56
+
57
+ assert_instance_of Proc, control.pre_processes[0]
58
+ end
59
+
52
60
  def test_source_as_string_parsing
53
61
  control = Kiba.parse <<RUBY
54
62
  source DummyClass, 'from', 'file'
55
63
  RUBY
56
-
64
+
57
65
  assert_equal 1, control.sources.size
58
66
  assert_equal DummyClass, control.sources[0][:klass]
59
- assert_equal ['from', 'file'], control.sources[0][:args]
67
+ assert_equal %w(from file), control.sources[0][:args]
60
68
  end
61
-
69
+
62
70
  def test_source_as_file_doing_require
63
71
  IO.write 'test/tmp/etl-common.rb', <<RUBY
64
72
  def common_source_declaration
@@ -67,18 +75,18 @@ RUBY
67
75
  RUBY
68
76
  IO.write 'test/tmp/etl-main.rb', <<RUBY
69
77
  require './test/tmp/etl-common.rb'
70
-
78
+
71
79
  source DummyClass, 'from', 'main'
72
80
  common_source_declaration
73
81
  RUBY
74
82
  control = Kiba.parse IO.read('test/tmp/etl-main.rb')
75
-
83
+
76
84
  assert_equal 2, control.sources.size
77
85
 
78
- assert_equal ['from', 'main'], control.sources[0][:args]
79
- assert_equal ['from', 'common'], control.sources[1][:args]
80
-
86
+ assert_equal %w(from main), control.sources[0][:args]
87
+ assert_equal %w(from common), control.sources[1][:args]
88
+
81
89
  ensure
82
90
  remove_files('test/tmp/etl-common.rb', 'test/tmp/etl-main.rb')
83
91
  end
84
- end
92
+ end
@@ -3,10 +3,20 @@ require_relative 'helper'
3
3
  require_relative 'support/test_enumerable_source'
4
4
 
5
5
  class TestRunner < Kiba::Test
6
+ let(:rows) do
7
+ [
8
+ { field: 'value' },
9
+ { field: 'other-value' }
10
+ ]
11
+ end
12
+
6
13
  let(:control) do
7
14
  control = Kiba::Control.new
8
15
  # this will yield a single row for testing
9
- control.sources << {klass: TestEnumerableSource, args: [[{field: 'value'}]]}
16
+ control.sources << {
17
+ klass: TestEnumerableSource,
18
+ args: [rows]
19
+ }
10
20
  control
11
21
  end
12
22
 
@@ -18,23 +28,32 @@ class TestRunner < Kiba::Test
18
28
  end
19
29
 
20
30
  def test_dismissed_row_not_passed_to_next_transform
21
- control.transforms << lambda { |r| nil }
22
- control.transforms << lambda { |r| @called = true; nil}
31
+ control.transforms << lambda { |_| nil }
32
+ control.transforms << lambda { |_| @called = true; nil }
23
33
  Kiba.run(control)
24
34
  assert_nil @called
25
35
  end
26
-
27
- def test_post_process_runs
28
- control.post_processes << lambda { @called = true }
36
+
37
+ def test_post_process_runs_once
38
+ assert_equal 2, rows.size
39
+ @called = 0
40
+ control.post_processes << lambda { @called += 1 }
29
41
  Kiba.run(control)
30
- assert_equal true, @called
42
+ assert_equal 1, @called
31
43
  end
32
-
44
+
33
45
  def test_post_process_not_called_after_row_failure
34
- control.transforms << lambda { |r| raise 'FAIL' }
46
+ control.transforms << lambda { |_| fail 'FAIL' }
35
47
  control.post_processes << lambda { @called = true }
36
48
  assert_raises(RuntimeError, 'FAIL') { Kiba.run(control) }
37
49
  assert_nil @called
38
50
  end
39
-
40
- end
51
+
52
+ def test_pre_process_runs_once
53
+ assert_equal 2, rows.size
54
+ @called = 0
55
+ control.pre_processes << lambda { @called += 1 }
56
+ Kiba.run(control)
57
+ assert_equal 1, @called
58
+ end
59
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: kiba
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.5.0
4
+ version: 0.6.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Thibaut Barrère
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-04-18 00:00:00.000000000 Z
11
+ date: 2015-05-14 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rake