kiba 1.0.0 → 3.5.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (43) hide show
  1. checksums.yaml +5 -5
  2. data/.github/FUNDING.yml +1 -0
  3. data/.travis.yml +11 -9
  4. data/COMM-LICENSE.md +348 -0
  5. data/Changes.md +28 -0
  6. data/ISSUE_TEMPLATE.md +7 -0
  7. data/LICENSE +7 -0
  8. data/Pro-Changes.md +108 -0
  9. data/README.md +15 -282
  10. data/Rakefile +6 -1
  11. data/appveyor.yml +19 -9
  12. data/bin/kiba +12 -2
  13. data/kiba.gemspec +6 -1
  14. data/lib/kiba.rb +10 -1
  15. data/lib/kiba/context.rb +4 -0
  16. data/lib/kiba/control.rb +4 -0
  17. data/lib/kiba/dsl_extensions/config.rb +9 -0
  18. data/lib/kiba/parser.rb +4 -9
  19. data/lib/kiba/runner.rb +14 -5
  20. data/lib/kiba/streaming_runner.rb +38 -0
  21. data/lib/kiba/version.rb +1 -1
  22. data/test/helper.rb +11 -2
  23. data/test/shared_runner_tests.rb +228 -0
  24. data/test/support/shared_tests.rb +10 -0
  25. data/test/support/test_aggregate_transform.rb +19 -0
  26. data/test/support/test_array_destination.rb +9 -0
  27. data/test/support/test_close_yielding_transform.rb +11 -0
  28. data/test/support/test_destination_returning_nil.rb +12 -0
  29. data/test/support/test_duplicate_row_transform.rb +9 -0
  30. data/test/support/test_keyword_arguments_component.rb +14 -0
  31. data/test/support/test_mixed_arguments_component.rb +14 -0
  32. data/test/support/test_non_closing_transform.rb +5 -0
  33. data/test/support/test_yielding_transform.rb +8 -0
  34. data/test/test_integration.rb +3 -3
  35. data/test/test_parser.rb +34 -29
  36. data/test/test_run.rb +12 -0
  37. data/test/test_runner.rb +5 -81
  38. data/test/test_streaming_runner.rb +70 -0
  39. metadata +57 -16
  40. data/lib/kiba/cli.rb +0 -16
  41. data/test/fixtures/bogus.etl +0 -2
  42. data/test/fixtures/valid.etl +0 -1
  43. data/test/test_cli.rb +0 -17
data/README.md CHANGED
@@ -1,306 +1,39 @@
1
- **Foreword - if you need help**: please [ask your question with tag kiba-etl on StackOverflow](http://stackoverflow.com/questions/ask?tags=kiba-etl) so that other can benefit from your contribution! I monitor this specific tag and will reply to you.
2
-
3
- Writing reliable, concise, well-tested & maintainable data-processing code is tricky.
4
-
5
- Kiba lets you define and run such high-quality ETL ([Extract-Transform-Load](http://en.wikipedia.org/wiki/Extract,_transform,_load)) jobs, using Ruby (see [supported versions](#supported-ruby-versions)).
6
-
7
- Learn more on the [Kiba blog](http://thibautbarrere.com) and on [StackOverflow](http://stackoverflow.com/questions/tagged/kiba-etl):
8
-
9
- * [Live Coding Session - Processing data with Kiba ETL](http://thibautbarrere.com/2015/11/09/video-processing-data-with-kiba-etl/)
10
- * [Rubyists - are you doing ETL unknowningly?](http://thibautbarrere.com/2015/03/25/rubyists-are-you-doing-etl-unknowingly/)
11
- * [How to write solid data processing code](http://thibautbarrere.com/2015/04/05/how-to-write-solid-data-processing-code/)
12
- * [How to reformat CSV files with Kiba](http://thibautbarrere.com/2015/06/04/how-to-reformat-csv-files-with-kiba/) (in-depth, hands-on tutorial)
13
- * [How to explode multivalued attributes with Kiba ETL?](http://thibautbarrere.com/2015/06/25/how-to-explode-multivalued-attributes-with-kiba/)
14
- * [Common techniques to compute aggregates with Kiba](https://stackoverflow.com/questions/31145715/how-to-do-a-aggregation-transformation-in-a-kiba-etl-script-kiba-gem)
15
- * [How to run Kiba in a Rails environment?](http://thibautbarrere.com/2015/09/26/how-to-run-kiba-in-a-rails-environment/)
16
-
17
- **Consulting services**: if your organization needs to leverage data processing to solve a given business problem, I'm available to help you out via consulting sessions. [More information](http://thibautbarrere.com/hire-me/).
18
-
19
- **Kiba Pro**: I'm working on a Pro version ([read more here](https://github.com/thbar/kiba/issues/20)) which will provide more advanced features and built-in goodies in exchange for a yearly subscription. This will also make sure I can support Kiba for the many years to come. [Chime in](https://github.com/thbar/kiba/issues/20) if your company is interested!
1
+ # Kiba ETL
20
2
 
21
3
  [![Gem Version](https://badge.fury.io/rb/kiba.svg)](http://badge.fury.io/rb/kiba)
22
- [![Build Status](https://travis-ci.org/thbar/kiba.svg?branch=master)](https://travis-ci.org/thbar/kiba) [![Code Climate](https://codeclimate.com/github/thbar/kiba/badges/gpa.svg)](https://codeclimate.com/github/thbar/kiba) [![Dependency Status](https://gemnasium.com/thbar/kiba.svg)](https://gemnasium.com/thbar/kiba)
23
-
24
- ## How do you define ETL jobs with Kiba?
25
-
26
- Kiba provides you with a DSL to define ETL jobs:
27
-
28
- ```ruby
29
- # declare a ruby method here, for quick reusable logic
30
- def parse_french_date(date)
31
- Date.strptime(date, '%d/%m/%Y')
32
- end
33
-
34
- # or better, include a ruby file which loads reusable assets
35
- # eg: commonly used sources / destinations / transforms, under unit-test
36
- require_relative 'common'
37
-
38
- # declare a pre-processor: a block called before the first row is read
39
- pre_process do
40
- # do something
41
- end
42
-
43
- # declare a source where to take data from (you implement it - see notes below)
44
- source MyCsvSource, 'input.csv'
45
-
46
- # declare a row transform to process a given field
47
- transform do |row|
48
- row[:birth_date] = parse_french_date(row[:birth_date])
49
- # return to keep in the pipeline
50
- row
51
- end
52
-
53
- # declare another row transform, dismissing rows conditionally by returning nil
54
- transform do |row|
55
- row[:birth_date].year < 2000 ? row : nil
56
- end
57
-
58
- # declare a row transform as a class, which can be tested properly
59
- transform ComplianceCheckTransform, eula: 2015
60
-
61
- # before declaring a definition, maybe you'll want to retrieve credentials
62
- config = YAML.load(IO.read('config.yml'))
63
-
64
- # declare a destination - like source, you implement it (see below)
65
- destination MyDatabaseDestination, config['my_database']
66
-
67
- # declare a post-processor: a block called after all rows are successfully processed
68
- post_process do
69
- # do something
70
- end
71
- ```
72
-
73
- The combination of pre-processors, sources, transforms, destinations and post-processors defines the data processing pipeline.
74
-
75
- Note: you are advised to store your ETL definitions as files with the extension `.etl` (rather than `.rb`). This will make sure you do not end up loading them by mistake from another component (eg: a Rails app).
76
-
77
- ## How do you run your ETL jobs?
78
-
79
- You can use the provided command-line:
80
-
81
- ```
82
- bundle exec kiba my-data-processing-script.etl
83
- ```
84
-
85
- This command essentially starts a two-step process:
86
-
87
- ```ruby
88
- script_content = IO.read(filename)
89
- # pass the filename to get for line numbers on errors
90
- job_definition = Kiba.parse(script_content, filename)
91
- Kiba.run(job_definition)
92
- ```
93
-
94
- `Kiba.parse` evaluates your ETL Ruby code to register sources, transforms, destinations and post-processors in a job definition. It is important to understand that you can use Ruby logic at the DSL parsing time. This means that such code is possible, provided the CSV files are available at parsing time:
95
-
96
- ```ruby
97
- Dir['to_be_processed/*.csv'].each do |file|
98
- source MyCsvSource, file
99
- end
100
- ```
101
-
102
- Once the job definition is loaded, `Kiba.run` will use that information to do the actual row-by-row processing. It currently uses a simple row-by-row, single-threaded processing that will stop at the first error encountered.
103
-
104
- ## Implementing ETL sources
105
-
106
- In Kiba, you are responsible for implementing the sources that do the extraction of data.
107
-
108
- Sources are classes implementing:
109
- - a constructor (to which Kiba will pass the provided arguments in the DSL)
110
- - the `each` method (which should yield rows one by one)
111
-
112
- Rows are usually `Hash` instances, but could be other structures as long as the rest of your pipeline is expecting it.
113
-
114
- Since sources are classes, you can (and are encouraged to) unit test them and reuse them.
115
-
116
- Here is a simple CSV source:
117
-
118
- ```ruby
119
- require 'csv'
120
-
121
- class MyCsvSource
122
- def initialize(input_file)
123
- @csv = CSV.open(input_file, headers: true, header_converters: :symbol)
124
- end
125
-
126
- def each
127
- @csv.each do |row|
128
- yield(row.to_hash)
129
- end
130
- @csv.close
131
- end
132
- end
133
- ```
134
-
135
- ## Implementing row transforms
136
-
137
- Row transforms can implemented in two ways: as blocks, or as classes.
4
+ [![Build Status](https://travis-ci.org/thbar/kiba.svg?branch=master)](https://travis-ci.org/thbar/kiba) [![Build status](https://ci.appveyor.com/api/projects/status/v05jcyhpp1mueq9i?svg=true)](https://ci.appveyor.com/project/thbar/kiba) [![Code Climate](https://codeclimate.com/github/thbar/kiba/badges/gpa.svg)](https://codeclimate.com/github/thbar/kiba)
138
5
 
139
- ### Row transform as a block
140
-
141
- When writing a row transform as a block, it will be passed the row as parameter:
142
-
143
- ```ruby
144
- transform do |row|
145
- row[:this_field] = row[:that_field] * 10
146
- # make sure to return the row to keep it in the pipeline
147
- row
148
- end
149
- ```
150
-
151
- To dismiss a row from the pipeline, simply return `nil` from a transform:
152
-
153
- ```ruby
154
- transform { |row| row[:index] % 2 == 0 ? row : nil }
155
- ```
156
-
157
- ### Row transform as a class
158
-
159
- If you implement the transform as a class, it must respond to `process(row)`:
160
-
161
- ```ruby
162
- class SamplingTransform
163
- def initialize(modulo_value)
164
- @modulo_value = modulo_value
165
- end
166
-
167
- def process(row)
168
- row[:index] % @modulo_value == 0 ? row : nil
169
- end
170
- end
171
- ```
172
-
173
- You'll use it this way in your ETL declaration (the parameters will be passed to initialize):
174
-
175
- ```ruby
176
- # only keep 1 row over 10
177
- transform SamplingTransform, 10
178
- ```
179
-
180
- Like the block form, it can return `nil` to dismiss the row. The class form allows better testability and reusability across your(s) ETL script(s).
181
-
182
- ## Implementing ETL destinations
183
-
184
- Like sources, destinations are classes that you are providing. Destinations must implement:
185
- - a constructor (to which Kiba will pass the provided arguments in the DSL)
186
- - a `write(row)` method that will be called for each non-dismissed row
187
- - an optional `close` method that will be called, if present, at the end of the processing (useful to tear down resources such as connections)
188
-
189
- Here is an example destination:
190
-
191
- ```ruby
192
- require 'csv'
193
-
194
- # simple destination assuming all rows have the same fields
195
- class MyCsvDestination
196
- def initialize(output_file)
197
- @csv = CSV.open(output_file, 'w')
198
- end
199
-
200
- def write(row)
201
- unless @headers_written
202
- @headers_written = true
203
- @csv << row.keys
204
- end
205
- @csv << row.values
206
- end
207
-
208
- def close
209
- @csv.close
210
- end
211
- end
212
- ```
213
-
214
- ## Implementing pre and post-processors
215
-
216
- Pre-processors and post-processors are currently blocks, which get called only once per ETL run:
217
- - Pre-processors get called before the ETL starts reading rows from the sources.
218
- - Post-processors get invoked after the ETL successfully processed all the rows.
219
-
220
- Note that post-processors won't get called if an error occurred earlier.
221
-
222
- ```ruby
223
- count = 0
224
-
225
- def system!(cmd)
226
- fail "Command #{cmd} failed" unless system(cmd)
227
- end
228
-
229
- file = 'my_file.csv'
230
- sample_file = 'my_file.sample.csv'
231
-
232
- pre_process do
233
- # it's handy to work with a reduced data set. you can
234
- # e.g. just keep one line of the CSV files + the headers
235
- system! "sed -n \"1p;25706p\" #{file} > #{sample_file}"
236
- end
237
-
238
- source MyCsv, file: sample_file
239
-
240
- transform do |row|
241
- count += 1
242
- row
243
- end
244
-
245
- post_process do
246
- Email.send(supervisor_address, "#{count} rows successfully processed")
247
- end
248
- ```
6
+ Writing reliable, concise, well-tested & maintainable data-processing code is tricky.
249
7
 
250
- ## Composability, reusability, testability of Kiba components
8
+ Kiba lets you define and run such high-quality ETL ([Extract-Transform-Load](http://en.wikipedia.org/wiki/Extract,_transform,_load)) jobs using Ruby.
251
9
 
252
- The way Kiba works makes it easy to create reusable, well-tested ETL components and jobs.
10
+ ## Getting Started
253
11
 
254
- The main reason for this is that a Kiba ETL script can `require` shared Ruby code, which allows to:
255
- - create well-tested, reusable sources & destinations
256
- - create macro-transforms as methods, to be reused across sister scripts
257
- - substitute a component by another (e.g.: try a variant of a destination)
258
- - use a centralized place for configuration (credentials, IP addresses, etc.)
12
+ Head over to the [Wiki](https://github.com/thbar/kiba/wiki) for up-to-date documentation.
259
13
 
260
- The fact that the DSL evaluation "runs" the script also allows for simple meta-programming techniques, like pre-reading a source file to extract field names, to be used in transform definitions.
14
+ **If you need help**, please [ask your question with tag kiba-etl on StackOverflow](http://stackoverflow.com/questions/ask?tags=kiba-etl) so that other can benefit from your contribution! I monitor this specific tag and will reply to you.
261
15
 
262
- The ability to support that DSL, but also check command line arguments, environment variables and tweak behaviour as needed, or call other/faster specialized tools make Ruby an asset to implement ETL jobs.
16
+ [Kiba Pro](https://www.kiba-etl.org/kiba-pro) customers get priority private email support.
263
17
 
264
- Make sure to subscribe to my [Ruby ETL blog](http://thibautbarrere.com) where I'll demonstrate such techniques over time!
18
+ You can also check out the [author blog](https://thibautbarrere.com) and [StackOverflow answers](http://stackoverflow.com/questions/tagged/kiba-etl).
265
19
 
266
20
  ## Supported Ruby versions
267
21
 
268
- Kiba currently supports Ruby 2.0+ and JRuby (with its default 1.9 syntax). See [test matrix](https://travis-ci.org/thbar/kiba).
269
-
270
- ## History & Credits
22
+ Kiba currently supports Ruby 2.4+, JRuby 9.2+ and TruffleRuby. See [test matrix](https://travis-ci.org/thbar/kiba).
271
23
 
272
- Wow, you're still there? Nice to meet you. I'm [Thibaut](http://thibautbarrere.com), author of Kiba.
24
+ ## ETL consulting & commercial version
273
25
 
274
- I first met the idea of row-based syntax when I started using [Anthony Eden](https://github.com/aeden)'s [Activewarehouse-ETL](https://github.com/activewarehouse/activewarehouse-etl), first published around 2006 (I think), in which Anthony applied the core principles defined by Ralph Kimball in [The Data Warehouse ETL Toolkit](http://www.amazon.com/gp/product/0764567578).
26
+ **Consulting services**: if your organization needs guidance on Kiba / ETL implementations, we provide consulting services. Contact at [https://www.logeek.fr](https://www.logeek.fr).
275
27
 
276
- I've been writing and maintaining a number of production ETL systems using Activewarehouse-ETL, then later with an ancestor of Kiba which was named TinyTL.
277
-
278
- I took over the maintenance of Activewarehouse-ETL circa 2009/2010, but over time, I could not properly update & document it, given the gradual failure of a large number of dependencies and components. Ultimately in 2014 I had to stop maintaining it, after an already long hiatus.
279
-
280
- That said using Activewarehouse-ETL for so long made me realize the row-based processing syntax was great and provided some great assets for maintainability on long time-spans.
281
-
282
- Kiba is a completely fresh & minimalistic-on-purpose implementation of that row-based processing pattern.
283
-
284
- It is minimalistic to make it more likely that I will be able to maintain it over time.
285
-
286
- It makes strong simplicity assumptions (like letting you define the sources, transforms & destinations). MiniTest is an inspiration.
287
-
288
- As I developed Kiba, I realize how much this simplicity opens the road for interesting developments such as multi-threaded & multi-processes processing.
289
-
290
- Last word: Kiba is 100% sponsored by my company LoGeek SARL (also provider of [WiseCash, a lightweight cash-flow forecasting app](https://www.wisecashhq.com)).
28
+ **Kiba Pro**: for vendor-backed ETL extensions, check out [Kiba Pro](https://www.kiba-etl.org/kiba-pro).
291
29
 
292
30
  ## License
293
31
 
294
- Copyright (c) LoGeek SARL.
295
-
296
- Kiba is an Open Source project licensed under the terms of
297
- the LGPLv3 license. Please see <http://www.gnu.org/licenses/lgpl-3.0.html>
298
- for license text.
32
+ Copyright (c) LoGeek SARL. Kiba is an Open Source project licensed under the terms of
33
+ the LGPLv3 license. Please see <http://www.gnu.org/licenses/lgpl-3.0.html> for license text.
299
34
 
300
35
  ## Contributing & Legal
301
36
 
302
- Until the API is more stable, I can only accept documentation Pull Requests.
303
-
304
37
  (agreement below borrowed from [Sidekiq Legal](https://github.com/mperham/sidekiq/blob/master/Contributing.md))
305
38
 
306
39
  By submitting a Pull Request, you disavow any rights or claims to any changes submitted to the Kiba project and assign the copyright of those changes to LoGeek SARL.
data/Rakefile CHANGED
@@ -4,4 +4,9 @@ Rake::TestTask.new(:test) do |t|
4
4
  t.pattern = 'test/test_*.rb'
5
5
  end
6
6
 
7
- task default: :test
7
+ # A simple check to verify TruffleRuby installation trick is really in effect
8
+ task :show_ruby_version do
9
+ puts "Running with #{RUBY_DESCRIPTION}"
10
+ end
11
+
12
+ task default: [:show_ruby_version, :test]
@@ -1,18 +1,28 @@
1
- version: '{build}'
1
+ version: 1.0.{build}-{branch}
2
2
 
3
- skip_tags: true
3
+ cache:
4
+ - vendor/bundle
4
5
 
5
6
  environment:
6
7
  matrix:
7
- - ruby_version: "21"
8
- - ruby_version: "21-x64"
8
+ - RUBY_VERSION: 26
9
+ - RUBY_VERSION: 25
10
+ - RUBY_VERSION: 24
11
+ - RUBY_VERSION: 23
12
+ # NOTE: jruby doesn't seem to be supported on default images
13
+ # see https://www.appveyor.com/docs/build-environment/#ruby
9
14
 
10
15
  install:
11
- - SET PATH=C:\Ruby%ruby_version%\bin;%PATH%
12
- - gem install bundler --no-document -v 1.10.5
13
- - bundle install --retry=3
16
+ - set PATH=C:\Ruby%RUBY_VERSION%\bin;%PATH%
17
+ - bundle config --local path vendor/bundle
18
+ - bundle install
19
+
20
+ build: off
21
+
22
+ before_test:
23
+ - ruby -v
24
+ - gem -v
25
+ - bundle -v
14
26
 
15
27
  test_script:
16
28
  - bundle exec rake
17
-
18
- build: off
data/bin/kiba CHANGED
@@ -1,5 +1,15 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
- require_relative '../lib/kiba/cli'
3
+ STDERR.puts <<DOC
4
4
 
5
- Kiba::Cli.run(ARGV)
5
+ ##########################################################################
6
+
7
+ The 'kiba' CLI is deprecated and has been removed in Kiba ETL v3.
8
+
9
+ See release notes / changelog for help.
10
+
11
+ ##########################################################################
12
+
13
+ DOC
14
+
15
+ exit(1)
@@ -13,8 +13,13 @@ Gem::Specification.new do |gem|
13
13
  gem.require_paths = ['lib']
14
14
  gem.version = Kiba::VERSION
15
15
  gem.executables = ['kiba']
16
+ gem.metadata = {
17
+ 'source_code_uri' => 'https://github.com/thbar/kiba',
18
+ 'documentation_uri' => 'https://github.com/thbar/kiba/wiki',
19
+ }
16
20
 
17
21
  gem.add_development_dependency 'rake'
18
- gem.add_development_dependency 'minitest'
22
+ gem.add_development_dependency 'minitest', '~> 5.9'
19
23
  gem.add_development_dependency 'awesome_print'
24
+ gem.add_development_dependency 'minitest-focus'
20
25
  end
@@ -5,6 +5,15 @@ require 'kiba/control'
5
5
  require 'kiba/context'
6
6
  require 'kiba/parser'
7
7
  require 'kiba/runner'
8
+ require 'kiba/streaming_runner'
9
+ require 'kiba/dsl_extensions/config'
8
10
 
9
11
  Kiba.extend(Kiba::Parser)
10
- Kiba.extend(Kiba::Runner)
12
+
13
+ module Kiba
14
+ def self.run(job)
15
+ # NOTE: use Hash#dig when Ruby 2.2 reaches EOL
16
+ runner = job.config.fetch(:kiba, {}).fetch(:runner, Kiba::StreamingRunner)
17
+ runner.run(job)
18
+ end
19
+ end
@@ -23,5 +23,9 @@ module Kiba
23
23
  def post_process(&block)
24
24
  @control.post_processes << { block: block }
25
25
  end
26
+
27
+ [:source, :transform, :destination].each do |m|
28
+ ruby2_keywords(m) if respond_to?(:ruby2_keywords, true)
29
+ end
26
30
  end
27
31
  end
@@ -3,6 +3,10 @@ module Kiba
3
3
  def pre_processes
4
4
  @pre_processes ||= []
5
5
  end
6
+
7
+ def config
8
+ @config ||= {}
9
+ end
6
10
 
7
11
  def sources
8
12
  @sources ||= []
@@ -0,0 +1,9 @@
1
+ module Kiba
2
+ module DSLExtensions
3
+ module Config
4
+ def config(context, context_config)
5
+ (@control.config[context] ||= {}).merge!(context_config)
6
+ end
7
+ end
8
+ end
9
+ end