kiba 2.0.0.rc1 → 3.6.0

Sign up to get free protection for your applications and to get access to all the features.
data/Pro-Changes.md CHANGED
@@ -1,13 +1,90 @@
1
1
  Kiba Pro Changelog
2
2
  ==================
3
3
 
4
- Kiba Pro is the commercial extension for Kiba. Documentation is available on the [Wiki](https://github.com/thbar/kiba/wiki).
4
+ Kiba Pro provides vendor-supported ETL extensions for Kiba. Your subscription funds the Open-Source development, thanks for considering it!
5
5
 
6
- HEAD
7
- -------
6
+ Learn more on the [Kiba website](https://www.kiba-etl.org/kiba-pro).
8
7
 
9
- 1.0.0.rc1
10
- ---------
8
+ Documentation is available on the [Wiki](https://github.com/thbar/kiba/wiki#kiba-pro).
9
+
10
+ 2.0.0
11
+ -----
12
+
13
+ - New: `SQLBulkLookup` transform allows to efficiently lookup values in SQL tables. This is particularly useful in datawarehouse scenarios (to replace unique business keys by surrogate keys), or when writing migrations of SQL databases. Instead of looking-up each row individually, it avoids a "N+1" like effect, by working on large batches of rows.
14
+ - New: `ParallelTransform` provides an easy way to process a group of ETL rows at the same time using a pool of threads. It can be used to accelerate ETL transforms doing IO operations such as HTTP queries, by going multithreaded.
15
+ - New: `FileLock` adds an easy way to avoid overlapping runs in ETL Jobs using a local file lock.
16
+
17
+ 1.5.0
18
+ -----
19
+
20
+ - Compatibility with Kiba v3
21
+ - BREAKING CHANGE: deprecate non-live Sequel connection passing (https://github.com/thbar/kiba/issues/79). Do not use `database: "connection_string"`, instead pass your `Sequel` connection directly. This moves the connection management out of the destination, which is a better pattern & provides better (block-based) resources closing.
22
+ - Official MySQL support:
23
+ - While the compatibility was already here, it is now tested for in our QA testing suite.
24
+ - MySQL 5.5-8.0 is supported & tested
25
+ - MariaDB should be supported (although not tested against in the QA testing suite)
26
+ - Amazon Aurora MySQL is also supposed to work (although not tested)
27
+ - `Kiba::Pro::Sources::SQL` supports for non-streaming + streaming use
28
+ - `Kiba::Pro::Destinations::SQLBulkInsert` supports:
29
+ - Bulk insert
30
+ - Bulk insert with ignore
31
+ - Bulk upsert (including with dynamically computed columns) via `ON DUPLICATE KEY UPDATE`
32
+ - Note that the `Kiba::Pro::Destinations::SQLUpsert` (row-by-row) is not MySQL compatible at the moment
33
+
34
+ 1.2.0
35
+ -----
36
+
37
+ - `SQL` source improvements:
38
+ - Deprecate use_cursor in favor of block query construct. The source could previously be configured with:
39
+
40
+ ```ruby
41
+ source Kiba::Pro::Sources::SQL,
42
+ query: "SELECT * FROM items",
43
+ use_cursor: true
44
+ ```
45
+
46
+ The `use_cursor` keyword is now deprecated. You can use the more powerful block query construct:
47
+
48
+ ```ruby
49
+ source Kiba::Pro::Sources::SQL,
50
+ query: -> (db) { db["SELECT * FROM items"].use_cursor },
51
+ ```
52
+
53
+ - Avoid bogus nested SQL calls when configuring the query via block/proc. A call with:
54
+
55
+ ```ruby
56
+ source Kiba::Pro::Sources::SQL,
57
+ query: -> (db) { db["SELECT * FROM items"] },
58
+ ```
59
+
60
+ would have previously generated a `SELECT * FROM (SELECT * FROM "items")`. This is now fixed.
61
+
62
+ - Add specs around streaming support (for both MySQL and Postgres).
63
+
64
+ For Postgres, streaming was [recommended by the author of Sequel](https://groups.google.com/d/msg/sequel-talk/olznPcmEf8M/hd5Ris0pYNwJ) over `use_cursor: true` (but do compare on your actual cases!). To enable streaming for Postgres:
65
+ - Add `sequel_pg` to your `Gemfile`
66
+ - Enable the extension in your `db` instance & add `.stream` to your dataset e.g.:
67
+
68
+ ```ruby
69
+ Sequel.connect(ENV.fetch('DATABASE_URL')) do |db|
70
+ db.extension(:pg_streaming)
71
+ Kiba.run(Kiba.parse do
72
+ source Kiba::Pro::Sources::SQL,
73
+ db: db,
74
+ query: -> (db) { db[:items].stream }
75
+ # SNIP
76
+ end)
77
+ ```
78
+
79
+ For MySQL, just add `.stream` to your dataset like above (no extension required).
80
+
81
+ 1.1.0
82
+ -----
83
+
84
+ - Improvement: `SQLBulkInsert` now supports Postgres `INSERT ON CONFLICT` for batch operations (bulk upsert, conditional upserts, ignore if exist etc) via new `dataset` keyword. See [documentation](https://github.com/thbar/kiba/wiki/SQL-Bulk-Insert-Destination).
85
+
86
+ 1.0.0
87
+ -----
11
88
 
12
89
  NOTE: documentation & requirements/compatibility are available on the [wiki](https://github.com/thbar/kiba/wiki).
13
90
 
data/README.md CHANGED
@@ -1,51 +1,31 @@
1
- **If you need help**, please [ask your question with tag kiba-etl on StackOverflow](http://stackoverflow.com/questions/ask?tags=kiba-etl) so that other can benefit from your contribution! I monitor this specific tag and will reply to you.
2
-
3
- Writing reliable, concise, well-tested & maintainable data-processing code is tricky.
4
-
5
- Kiba lets you define and run such high-quality ETL ([Extract-Transform-Load](http://en.wikipedia.org/wiki/Extract,_transform,_load)) jobs using Ruby.
6
-
7
- Learn more on the [Wiki](https://github.com/thbar/kiba/wiki), on my [blog](http://thibautbarrere.com) and on [StackOverflow](http://stackoverflow.com/questions/tagged/kiba-etl).
1
+ # Kiba ETL
8
2
 
9
3
  [![Gem Version](https://badge.fury.io/rb/kiba.svg)](http://badge.fury.io/rb/kiba)
10
- [![Build Status](https://travis-ci.org/thbar/kiba.svg?branch=master)](https://travis-ci.org/thbar/kiba) [![Build status](https://ci.appveyor.com/api/projects/status/v05jcyhpp1mueq9i?svg=true)](https://ci.appveyor.com/project/thbar/kiba) [![Code Climate](https://codeclimate.com/github/thbar/kiba/badges/gpa.svg)](https://codeclimate.com/github/thbar/kiba) [![Dependency Status](https://gemnasium.com/thbar/kiba.svg)](https://gemnasium.com/thbar/kiba)
4
+ [![Build Status](https://travis-ci.org/thbar/kiba.svg?branch=master)](https://travis-ci.org/thbar/kiba) [![Build status](https://ci.appveyor.com/api/projects/status/v05jcyhpp1mueq9i?svg=true)](https://ci.appveyor.com/project/thbar/kiba) [![Code Climate](https://codeclimate.com/github/thbar/kiba/badges/gpa.svg)](https://codeclimate.com/github/thbar/kiba)
11
5
 
12
- ## Note on upcoming Kiba 2.0.0
6
+ Writing reliable, concise, well-tested & maintainable data-processing code is tricky.
13
7
 
14
- Kiba 2.0.0 (available on `master`) includes an improved engine called the `StreamingRunner`, which allows transforms to generate more than one output row for each input row. See [#44](https://github.com/thbar/kiba/pull/44) for documentation on benefits & how to activate.
8
+ Kiba lets you define and run such high-quality ETL ([Extract-Transform-Load](http://en.wikipedia.org/wiki/Extract,_transform,_load)) jobs using Ruby.
15
9
 
16
10
  ## Getting Started
17
11
 
18
- * [How do you define ETL jobs with Kiba?](https://github.com/thbar/kiba/wiki/How-do-you-define-ETL-jobs-with-Kiba%3F)
19
- * [How do you run your ETL jobs?](https://github.com/thbar/kiba/wiki/How-do-you-run-your-ETL-jobs%3F)
20
- * [Implementing ETL sources](https://github.com/thbar/kiba/wiki/Implementing-ETL-sources).
21
- * [Implementing ETL transforms](https://github.com/thbar/kiba/wiki/Implementing-ETL-transforms).
22
- * [Implementing ETL destinations](https://github.com/thbar/kiba/wiki/Implementing-ETL-destinations).
23
- * [Implementing pre and post-processors](https://github.com/thbar/kiba/wiki/Implementing-pre-and-post-processors).
24
-
25
- ## Useful links
12
+ Head over to the [Wiki](https://github.com/thbar/kiba/wiki) for up-to-date documentation.
26
13
 
27
- * [Live Coding Session - Processing data with Kiba ETL](http://thibautbarrere.com/2015/11/09/video-processing-data-with-kiba-etl/)
28
- * [Rubyists - are you doing ETL unknowningly?](http://thibautbarrere.com/2015/03/25/rubyists-are-you-doing-etl-unknowingly/)
29
- * [How to write solid data processing code](http://thibautbarrere.com/2015/04/05/how-to-write-solid-data-processing-code/)
30
- * [How to reformat CSV files with Kiba](http://thibautbarrere.com/2015/06/04/how-to-reformat-csv-files-with-kiba/) (in-depth, hands-on tutorial)
31
- * [How to explode multivalued attributes with Kiba ETL?](http://thibautbarrere.com/2015/06/25/how-to-explode-multivalued-attributes-with-kiba/)
32
- * [Common techniques to compute aggregates with Kiba](https://stackoverflow.com/questions/31145715/how-to-do-a-aggregation-transformation-in-a-kiba-etl-script-kiba-gem)
33
- * [How to run Kiba in a Rails environment?](http://thibautbarrere.com/2015/09/26/how-to-run-kiba-in-a-rails-environment/)
34
- * [How to pass parameters to the Kiba command line?](http://stackoverflow.com/questions/32959692/how-to-pass-parameters-into-your-etl-job)
14
+ **If you need help**, please [ask your question with tag kiba-etl on StackOverflow](http://stackoverflow.com/questions/ask?tags=kiba-etl) so that other can benefit from your contribution! I monitor this specific tag and will reply to you.
35
15
 
36
- ## Supported Ruby versions
16
+ [Kiba Pro](https://www.kiba-etl.org/kiba-pro) customers get priority private email support for any unforeseen issues and simple matters such as installation troubles. Our consulting services will also be prioritized to Kiba Pro subscribers. If you need any coaching on ETL & data pipeline implementation, please [reach out via email](mailto:info@logeek.fr) so we can discuss how to help you out.
37
17
 
38
- Kiba currently supports Ruby 2.0+ and JRuby (with its default 1.9 syntax). See [test matrix](https://travis-ci.org/thbar/kiba).
18
+ You can also check out the [author blog](https://thibautbarrere.com) and [StackOverflow answers](http://stackoverflow.com/questions/tagged/kiba-etl).
39
19
 
40
- ## Kiba Common
20
+ ## Supported Ruby versions
41
21
 
42
- I'm starting to add commonly used reusable helpers in a separate gem called [kiba-common](https://github.com/thbar/kiba-common), check it out (work-in-progress).
22
+ Kiba currently supports Ruby 2.4+, JRuby 9.2+ and TruffleRuby. See [test matrix](https://travis-ci.org/thbar/kiba).
43
23
 
44
24
  ## ETL consulting & commercial version
45
25
 
46
- **Consulting services**: if your organization needs help to implement a data pipeline or to build a data-intensive application, I provide consulting services. [More information](http://thibautbarrere.com/hire-me/).
26
+ **Consulting services**: if your organization needs guidance on Kiba / ETL implementations, we provide consulting services. Contact at [https://www.logeek.fr](https://www.logeek.fr).
47
27
 
48
- **Kiba Pro**: for more features & goodies, check out Kiba Pro ([Changelog & contact info](Pro-Changes.md)).
28
+ **Kiba Pro**: for vendor-backed ETL extensions, check out [Kiba Pro](https://www.kiba-etl.org/kiba-pro).
49
29
 
50
30
  ## License
51
31
 
data/Rakefile CHANGED
@@ -4,4 +4,9 @@ Rake::TestTask.new(:test) do |t|
4
4
  t.pattern = 'test/test_*.rb'
5
5
  end
6
6
 
7
- task default: :test
7
+ # A simple check to verify TruffleRuby installation trick is really in effect
8
+ task :show_ruby_version do
9
+ puts "Running with #{RUBY_DESCRIPTION}"
10
+ end
11
+
12
+ task default: [:show_ruby_version, :test]
data/appveyor.yml CHANGED
@@ -5,10 +5,13 @@ cache:
5
5
 
6
6
  environment:
7
7
  matrix:
8
+ # TODO: add RUBY_VERSION=30 when available (https://www.appveyor.com/updates/)
9
+ - RUBY_VERSION: 27
10
+ - RUBY_VERSION: 26
11
+ - RUBY_VERSION: 25
8
12
  - RUBY_VERSION: 24
9
- - RUBY_VERSION: 23
10
- - RUBY_VERSION: 22
11
- - RUBY_VERSION: 21
13
+ # NOTE: jruby doesn't seem to be supported on default images
14
+ # see https://www.appveyor.com/docs/build-environment/#ruby
12
15
 
13
16
  install:
14
17
  - set PATH=C:\Ruby%RUBY_VERSION%\bin;%PATH%
data/bin/kiba CHANGED
@@ -1,5 +1,15 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
- require_relative '../lib/kiba/cli'
3
+ STDERR.puts <<DOC
4
4
 
5
- Kiba::Cli.run(ARGV)
5
+ ##########################################################################
6
+
7
+ The 'kiba' CLI is deprecated and has been removed in Kiba ETL v3.
8
+
9
+ See release notes / changelog for help.
10
+
11
+ ##########################################################################
12
+
13
+ DOC
14
+
15
+ exit(1)
data/kiba.gemspec CHANGED
@@ -13,6 +13,10 @@ Gem::Specification.new do |gem|
13
13
  gem.require_paths = ['lib']
14
14
  gem.version = Kiba::VERSION
15
15
  gem.executables = ['kiba']
16
+ gem.metadata = {
17
+ 'source_code_uri' => 'https://github.com/thbar/kiba',
18
+ 'documentation_uri' => 'https://github.com/thbar/kiba/wiki',
19
+ }
16
20
 
17
21
  gem.add_development_dependency 'rake'
18
22
  gem.add_development_dependency 'minitest', '~> 5.9'
data/lib/kiba.rb CHANGED
@@ -11,9 +11,15 @@ require 'kiba/dsl_extensions/config'
11
11
  Kiba.extend(Kiba::Parser)
12
12
 
13
13
  module Kiba
14
- def self.run(job)
14
+ def self.run(job = nil, &block)
15
+ unless (job.nil? ^ block.nil?)
16
+ fail ArgumentError.new("Kiba.run takes either one argument (the job) or a block (defining the job)")
17
+ end
18
+
19
+ job ||= Kiba.parse { instance_exec(&block) }
20
+
15
21
  # NOTE: use Hash#dig when Ruby 2.2 reaches EOL
16
- runner = job.config.fetch(:kiba, {}).fetch(:runner, Kiba::Runner)
22
+ runner = job.config.fetch(:kiba, {}).fetch(:runner, Kiba::StreamingRunner)
17
23
  runner.run(job)
18
24
  end
19
25
  end
data/lib/kiba/context.rb CHANGED
@@ -23,5 +23,9 @@ module Kiba
23
23
  def post_process(&block)
24
24
  @control.post_processes << { block: block }
25
25
  end
26
+
27
+ [:source, :transform, :destination].each do |m|
28
+ ruby2_keywords(m) if respond_to?(:ruby2_keywords, true)
29
+ end
26
30
  end
27
31
  end
data/lib/kiba/parser.rb CHANGED
@@ -1,26 +1,10 @@
1
- # NOTE: using the "Kiba::Parser" declaration, as I discovered,
2
- # provides increased isolation to the declared ETL script, compared
3
- # to 2 nested modules.
4
- # Before that, a user creating entities named Control, Context
5
- # or DSLExtensions would see a conflict with Kiba own classes,
6
- # as by default instance_eval will resolve references by adding
7
- # the module containing the parser class (initially "Kiba").
8
- # Now, the classes appear to be further hidden from the user,
9
- # as Kiba::Parser is its own module.
10
- # This allows the user to create a Parser, Context, Control class
11
- # without it being interpreted as reopening Kiba::Parser, Kiba::Context,
12
- # etc.
13
- # See test in test_cli.rb (test_namespace_conflict)
14
- module Kiba::Parser
15
- def parse(source_as_string = nil, source_file = nil, &source_as_block)
16
- control = Kiba::Control.new
17
- context = Kiba::Context.new(control)
18
- if source_as_string
19
- # this somewhat weird construct allows to remove a nil source_file
20
- context.instance_eval(*[source_as_string, source_file].compact)
21
- else
1
+ module Kiba
2
+ module Parser
3
+ def parse(&source_as_block)
4
+ control = Kiba::Control.new
5
+ context = Kiba::Context.new(control)
22
6
  context.instance_eval(&source_as_block)
7
+ control
23
8
  end
24
- control
25
9
  end
26
10
  end
data/lib/kiba/runner.rb CHANGED
@@ -8,9 +8,6 @@ module Kiba
8
8
  end
9
9
 
10
10
  def run(control)
11
- # TODO: add a dry-run (not instantiating mode) to_instances call
12
- # that will validate the job definition from a syntax pov before
13
- # going any further. This could be shared with the parser.
14
11
  run_pre_processes(control)
15
12
  process_rows(
16
13
  to_instances(control.sources),
@@ -18,8 +15,6 @@ module Kiba
18
15
  destinations = to_instances(control.destinations)
19
16
  )
20
17
  close_destinations(destinations)
21
- # TODO: when I add post processes as class, I'll have to add a test to
22
- # make sure instantiation occurs after the main processing is done (#16)
23
18
  run_post_processes(control)
24
19
  end
25
20
 
@@ -63,15 +58,16 @@ module Kiba
63
58
  end
64
59
 
65
60
  def to_instance(klass, args, block, allow_block, allow_class)
66
- if klass
61
+ if klass && block
62
+ fail 'Class and block form cannot be used together at the moment'
63
+ elsif klass
67
64
  fail 'Class form is not allowed here' unless allow_class
68
65
  klass.new(*args)
69
66
  elsif block
70
67
  fail 'Block form is not allowed here' unless allow_block
71
68
  AliasingProc.new(&block)
72
69
  else
73
- # TODO: support block passing to a class form definition?
74
- fail 'Class and block form cannot be used together at the moment'
70
+ fail 'Nil parameters not allowed here'
75
71
  end
76
72
  end
77
73
  end
@@ -11,6 +11,11 @@ module Kiba
11
11
  end
12
12
  y << returned_row if returned_row
13
13
  end
14
+ if t.respond_to?(:close)
15
+ t.close do |close_row|
16
+ y << close_row
17
+ end
18
+ end
14
19
  end
15
20
  end
16
21
 
data/lib/kiba/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Kiba
2
- VERSION = '2.0.0.rc1'
2
+ VERSION = '3.6.0'
3
3
  end
data/test/helper.rb CHANGED
@@ -3,9 +3,11 @@ require 'minitest/pride'
3
3
  require 'minitest/focus'
4
4
  require 'kiba'
5
5
 
6
- class Kiba::Test < Minitest::Test
7
- extend Minitest::Spec::DSL
6
+ if ENV['CI'] == 'true'
7
+ puts "Running with MiniTest version #{MiniTest::VERSION}"
8
+ end
8
9
 
10
+ class Kiba::Test < Minitest::Test
9
11
  def remove_files(*files)
10
12
  files.each do |file|
11
13
  File.delete(file) if File.exist?(file)
@@ -15,4 +17,10 @@ class Kiba::Test < Minitest::Test
15
17
  def fixture(file)
16
18
  File.join(File.dirname(__FILE__), 'fixtures', file)
17
19
  end
20
+
21
+ unless self.method_defined?(:assert_mock)
22
+ def assert_mock(mock)
23
+ mock.verify
24
+ end
25
+ end
18
26
  end
@@ -1,11 +1,8 @@
1
1
  require 'minitest/mock'
2
- require_relative '../support/test_enumerable_source'
2
+ require_relative 'support/test_enumerable_source'
3
+ require_relative 'support/test_destination_returning_nil'
3
4
 
4
5
  module SharedRunnerTests
5
- def kiba_run(job)
6
- Kiba.run(job)
7
- end
8
-
9
6
  def rows
10
7
  @rows ||= [
11
8
  { identifier: 'first-row' },
@@ -134,4 +131,98 @@ module SharedRunnerTests
134
131
  # and the second row should have been reformatted
135
132
  assert_equal [{new_identifier: 'second-row'}], @remaining_rows
136
133
  end
137
- end
134
+
135
+ def test_destination_returning_nil_does_not_remove_row_from_pipeline
136
+ # safeguard to avoid modification on the support code
137
+ assert_nil TestDestinationReturningNil.new.write("FOOBAR")
138
+
139
+ destinations = []
140
+ control = Kiba.parse do
141
+ source TestEnumerableSource, [{key: 'value'}]
142
+ 2.times do
143
+ destination TestDestinationReturningNil, on_init: lambda { |d| destinations << d }
144
+ end
145
+ end
146
+ kiba_run(control)
147
+ 2.times do |i|
148
+ assert_equal [{key: 'value'}], destinations[i].instance_variable_get(:@written_rows)
149
+ end
150
+ end
151
+
152
+ def test_nil_transform_error_message
153
+ control = Kiba.parse do
154
+ transform
155
+ end
156
+ assert_raises(RuntimeError, 'Nil parameters not allowed here') { kiba_run(control) }
157
+ end
158
+
159
+ def test_ruby_3_source_kwargs
160
+ # NOTE: before Ruby 3 kwargs support, a Ruby warning would
161
+ # be captured here with Ruby 2.7 & ensure we fail,
162
+ # and an error would be raised with Ruby 2.8.0-dev
163
+ # NOTE: only the first warning will be captured, though, but
164
+ # having 3 different tests is still better
165
+ storage = nil
166
+ assert_silent do
167
+ Kiba.run(Kiba.parse do
168
+ source TestKeywordArgumentsComponent,
169
+ mandatory: "first",
170
+ on_init: -> (values) { storage = values }
171
+ end)
172
+ end
173
+ assert_equal({
174
+ mandatory: "first",
175
+ optional: nil
176
+ }, storage)
177
+ end
178
+
179
+ def test_ruby_3_transform_kwargs
180
+ storage = nil
181
+ assert_silent do
182
+ Kiba.run(Kiba.parse do
183
+ transform TestKeywordArgumentsComponent,
184
+ mandatory: "first",
185
+ on_init: -> (values) { storage = values }
186
+ end)
187
+ end
188
+ assert_equal({
189
+ mandatory: "first",
190
+ optional: nil
191
+ }, storage)
192
+ end
193
+
194
+ def test_ruby_3_destination_kwargs
195
+ storage = nil
196
+ assert_silent do
197
+ Kiba.run(Kiba.parse do
198
+ destination TestKeywordArgumentsComponent,
199
+ mandatory: "first",
200
+ on_init: -> (values) { storage = values }
201
+ end)
202
+ end
203
+ assert_equal({
204
+ mandatory: "first",
205
+ optional: nil
206
+ }, storage)
207
+ end
208
+
209
+ def test_positional_plus_keyword_arguments
210
+ storage = nil
211
+ assert_silent do
212
+ Kiba.run(Kiba.parse do
213
+ source TestMixedArgumentsComponent,
214
+ "some positional argument",
215
+ mandatory: "first",
216
+ on_init: -> (values) {
217
+ storage = values
218
+ }
219
+ end)
220
+ end
221
+
222
+ assert_equal({
223
+ some_value: "some positional argument",
224
+ mandatory: "first",
225
+ optional: nil
226
+ }, storage)
227
+ end
228
+ end