kiba 1.0.0 → 3.5.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -5
- data/.github/FUNDING.yml +1 -0
- data/.travis.yml +11 -9
- data/COMM-LICENSE.md +348 -0
- data/Changes.md +28 -0
- data/ISSUE_TEMPLATE.md +7 -0
- data/LICENSE +7 -0
- data/Pro-Changes.md +108 -0
- data/README.md +15 -282
- data/Rakefile +6 -1
- data/appveyor.yml +19 -9
- data/bin/kiba +12 -2
- data/kiba.gemspec +6 -1
- data/lib/kiba.rb +10 -1
- data/lib/kiba/context.rb +4 -0
- data/lib/kiba/control.rb +4 -0
- data/lib/kiba/dsl_extensions/config.rb +9 -0
- data/lib/kiba/parser.rb +4 -9
- data/lib/kiba/runner.rb +14 -5
- data/lib/kiba/streaming_runner.rb +38 -0
- data/lib/kiba/version.rb +1 -1
- data/test/helper.rb +11 -2
- data/test/shared_runner_tests.rb +228 -0
- data/test/support/shared_tests.rb +10 -0
- data/test/support/test_aggregate_transform.rb +19 -0
- data/test/support/test_array_destination.rb +9 -0
- data/test/support/test_close_yielding_transform.rb +11 -0
- data/test/support/test_destination_returning_nil.rb +12 -0
- data/test/support/test_duplicate_row_transform.rb +9 -0
- data/test/support/test_keyword_arguments_component.rb +14 -0
- data/test/support/test_mixed_arguments_component.rb +14 -0
- data/test/support/test_non_closing_transform.rb +5 -0
- data/test/support/test_yielding_transform.rb +8 -0
- data/test/test_integration.rb +3 -3
- data/test/test_parser.rb +34 -29
- data/test/test_run.rb +12 -0
- data/test/test_runner.rb +5 -81
- data/test/test_streaming_runner.rb +70 -0
- metadata +57 -16
- data/lib/kiba/cli.rb +0 -16
- data/test/fixtures/bogus.etl +0 -2
- data/test/fixtures/valid.etl +0 -1
- data/test/test_cli.rb +0 -17
data/README.md
CHANGED
@@ -1,306 +1,39 @@
|
|
1
|
-
|
2
|
-
|
3
|
-
Writing reliable, concise, well-tested & maintainable data-processing code is tricky.
|
4
|
-
|
5
|
-
Kiba lets you define and run such high-quality ETL ([Extract-Transform-Load](http://en.wikipedia.org/wiki/Extract,_transform,_load)) jobs, using Ruby (see [supported versions](#supported-ruby-versions)).
|
6
|
-
|
7
|
-
Learn more on the [Kiba blog](http://thibautbarrere.com) and on [StackOverflow](http://stackoverflow.com/questions/tagged/kiba-etl):
|
8
|
-
|
9
|
-
* [Live Coding Session - Processing data with Kiba ETL](http://thibautbarrere.com/2015/11/09/video-processing-data-with-kiba-etl/)
|
10
|
-
* [Rubyists - are you doing ETL unknowningly?](http://thibautbarrere.com/2015/03/25/rubyists-are-you-doing-etl-unknowingly/)
|
11
|
-
* [How to write solid data processing code](http://thibautbarrere.com/2015/04/05/how-to-write-solid-data-processing-code/)
|
12
|
-
* [How to reformat CSV files with Kiba](http://thibautbarrere.com/2015/06/04/how-to-reformat-csv-files-with-kiba/) (in-depth, hands-on tutorial)
|
13
|
-
* [How to explode multivalued attributes with Kiba ETL?](http://thibautbarrere.com/2015/06/25/how-to-explode-multivalued-attributes-with-kiba/)
|
14
|
-
* [Common techniques to compute aggregates with Kiba](https://stackoverflow.com/questions/31145715/how-to-do-a-aggregation-transformation-in-a-kiba-etl-script-kiba-gem)
|
15
|
-
* [How to run Kiba in a Rails environment?](http://thibautbarrere.com/2015/09/26/how-to-run-kiba-in-a-rails-environment/)
|
16
|
-
|
17
|
-
**Consulting services**: if your organization needs to leverage data processing to solve a given business problem, I'm available to help you out via consulting sessions. [More information](http://thibautbarrere.com/hire-me/).
|
18
|
-
|
19
|
-
**Kiba Pro**: I'm working on a Pro version ([read more here](https://github.com/thbar/kiba/issues/20)) which will provide more advanced features and built-in goodies in exchange for a yearly subscription. This will also make sure I can support Kiba for the many years to come. [Chime in](https://github.com/thbar/kiba/issues/20) if your company is interested!
|
1
|
+
# Kiba ETL
|
20
2
|
|
21
3
|
[![Gem Version](https://badge.fury.io/rb/kiba.svg)](http://badge.fury.io/rb/kiba)
|
22
|
-
[![Build Status](https://travis-ci.org/thbar/kiba.svg?branch=master)](https://travis-ci.org/thbar/kiba) [![
|
23
|
-
|
24
|
-
## How do you define ETL jobs with Kiba?
|
25
|
-
|
26
|
-
Kiba provides you with a DSL to define ETL jobs:
|
27
|
-
|
28
|
-
```ruby
|
29
|
-
# declare a ruby method here, for quick reusable logic
|
30
|
-
def parse_french_date(date)
|
31
|
-
Date.strptime(date, '%d/%m/%Y')
|
32
|
-
end
|
33
|
-
|
34
|
-
# or better, include a ruby file which loads reusable assets
|
35
|
-
# eg: commonly used sources / destinations / transforms, under unit-test
|
36
|
-
require_relative 'common'
|
37
|
-
|
38
|
-
# declare a pre-processor: a block called before the first row is read
|
39
|
-
pre_process do
|
40
|
-
# do something
|
41
|
-
end
|
42
|
-
|
43
|
-
# declare a source where to take data from (you implement it - see notes below)
|
44
|
-
source MyCsvSource, 'input.csv'
|
45
|
-
|
46
|
-
# declare a row transform to process a given field
|
47
|
-
transform do |row|
|
48
|
-
row[:birth_date] = parse_french_date(row[:birth_date])
|
49
|
-
# return to keep in the pipeline
|
50
|
-
row
|
51
|
-
end
|
52
|
-
|
53
|
-
# declare another row transform, dismissing rows conditionally by returning nil
|
54
|
-
transform do |row|
|
55
|
-
row[:birth_date].year < 2000 ? row : nil
|
56
|
-
end
|
57
|
-
|
58
|
-
# declare a row transform as a class, which can be tested properly
|
59
|
-
transform ComplianceCheckTransform, eula: 2015
|
60
|
-
|
61
|
-
# before declaring a definition, maybe you'll want to retrieve credentials
|
62
|
-
config = YAML.load(IO.read('config.yml'))
|
63
|
-
|
64
|
-
# declare a destination - like source, you implement it (see below)
|
65
|
-
destination MyDatabaseDestination, config['my_database']
|
66
|
-
|
67
|
-
# declare a post-processor: a block called after all rows are successfully processed
|
68
|
-
post_process do
|
69
|
-
# do something
|
70
|
-
end
|
71
|
-
```
|
72
|
-
|
73
|
-
The combination of pre-processors, sources, transforms, destinations and post-processors defines the data processing pipeline.
|
74
|
-
|
75
|
-
Note: you are advised to store your ETL definitions as files with the extension `.etl` (rather than `.rb`). This will make sure you do not end up loading them by mistake from another component (eg: a Rails app).
|
76
|
-
|
77
|
-
## How do you run your ETL jobs?
|
78
|
-
|
79
|
-
You can use the provided command-line:
|
80
|
-
|
81
|
-
```
|
82
|
-
bundle exec kiba my-data-processing-script.etl
|
83
|
-
```
|
84
|
-
|
85
|
-
This command essentially starts a two-step process:
|
86
|
-
|
87
|
-
```ruby
|
88
|
-
script_content = IO.read(filename)
|
89
|
-
# pass the filename to get for line numbers on errors
|
90
|
-
job_definition = Kiba.parse(script_content, filename)
|
91
|
-
Kiba.run(job_definition)
|
92
|
-
```
|
93
|
-
|
94
|
-
`Kiba.parse` evaluates your ETL Ruby code to register sources, transforms, destinations and post-processors in a job definition. It is important to understand that you can use Ruby logic at the DSL parsing time. This means that such code is possible, provided the CSV files are available at parsing time:
|
95
|
-
|
96
|
-
```ruby
|
97
|
-
Dir['to_be_processed/*.csv'].each do |file|
|
98
|
-
source MyCsvSource, file
|
99
|
-
end
|
100
|
-
```
|
101
|
-
|
102
|
-
Once the job definition is loaded, `Kiba.run` will use that information to do the actual row-by-row processing. It currently uses a simple row-by-row, single-threaded processing that will stop at the first error encountered.
|
103
|
-
|
104
|
-
## Implementing ETL sources
|
105
|
-
|
106
|
-
In Kiba, you are responsible for implementing the sources that do the extraction of data.
|
107
|
-
|
108
|
-
Sources are classes implementing:
|
109
|
-
- a constructor (to which Kiba will pass the provided arguments in the DSL)
|
110
|
-
- the `each` method (which should yield rows one by one)
|
111
|
-
|
112
|
-
Rows are usually `Hash` instances, but could be other structures as long as the rest of your pipeline is expecting it.
|
113
|
-
|
114
|
-
Since sources are classes, you can (and are encouraged to) unit test them and reuse them.
|
115
|
-
|
116
|
-
Here is a simple CSV source:
|
117
|
-
|
118
|
-
```ruby
|
119
|
-
require 'csv'
|
120
|
-
|
121
|
-
class MyCsvSource
|
122
|
-
def initialize(input_file)
|
123
|
-
@csv = CSV.open(input_file, headers: true, header_converters: :symbol)
|
124
|
-
end
|
125
|
-
|
126
|
-
def each
|
127
|
-
@csv.each do |row|
|
128
|
-
yield(row.to_hash)
|
129
|
-
end
|
130
|
-
@csv.close
|
131
|
-
end
|
132
|
-
end
|
133
|
-
```
|
134
|
-
|
135
|
-
## Implementing row transforms
|
136
|
-
|
137
|
-
Row transforms can implemented in two ways: as blocks, or as classes.
|
4
|
+
[![Build Status](https://travis-ci.org/thbar/kiba.svg?branch=master)](https://travis-ci.org/thbar/kiba) [![Build status](https://ci.appveyor.com/api/projects/status/v05jcyhpp1mueq9i?svg=true)](https://ci.appveyor.com/project/thbar/kiba) [![Code Climate](https://codeclimate.com/github/thbar/kiba/badges/gpa.svg)](https://codeclimate.com/github/thbar/kiba)
|
138
5
|
|
139
|
-
|
140
|
-
|
141
|
-
When writing a row transform as a block, it will be passed the row as parameter:
|
142
|
-
|
143
|
-
```ruby
|
144
|
-
transform do |row|
|
145
|
-
row[:this_field] = row[:that_field] * 10
|
146
|
-
# make sure to return the row to keep it in the pipeline
|
147
|
-
row
|
148
|
-
end
|
149
|
-
```
|
150
|
-
|
151
|
-
To dismiss a row from the pipeline, simply return `nil` from a transform:
|
152
|
-
|
153
|
-
```ruby
|
154
|
-
transform { |row| row[:index] % 2 == 0 ? row : nil }
|
155
|
-
```
|
156
|
-
|
157
|
-
### Row transform as a class
|
158
|
-
|
159
|
-
If you implement the transform as a class, it must respond to `process(row)`:
|
160
|
-
|
161
|
-
```ruby
|
162
|
-
class SamplingTransform
|
163
|
-
def initialize(modulo_value)
|
164
|
-
@modulo_value = modulo_value
|
165
|
-
end
|
166
|
-
|
167
|
-
def process(row)
|
168
|
-
row[:index] % @modulo_value == 0 ? row : nil
|
169
|
-
end
|
170
|
-
end
|
171
|
-
```
|
172
|
-
|
173
|
-
You'll use it this way in your ETL declaration (the parameters will be passed to initialize):
|
174
|
-
|
175
|
-
```ruby
|
176
|
-
# only keep 1 row over 10
|
177
|
-
transform SamplingTransform, 10
|
178
|
-
```
|
179
|
-
|
180
|
-
Like the block form, it can return `nil` to dismiss the row. The class form allows better testability and reusability across your(s) ETL script(s).
|
181
|
-
|
182
|
-
## Implementing ETL destinations
|
183
|
-
|
184
|
-
Like sources, destinations are classes that you are providing. Destinations must implement:
|
185
|
-
- a constructor (to which Kiba will pass the provided arguments in the DSL)
|
186
|
-
- a `write(row)` method that will be called for each non-dismissed row
|
187
|
-
- an optional `close` method that will be called, if present, at the end of the processing (useful to tear down resources such as connections)
|
188
|
-
|
189
|
-
Here is an example destination:
|
190
|
-
|
191
|
-
```ruby
|
192
|
-
require 'csv'
|
193
|
-
|
194
|
-
# simple destination assuming all rows have the same fields
|
195
|
-
class MyCsvDestination
|
196
|
-
def initialize(output_file)
|
197
|
-
@csv = CSV.open(output_file, 'w')
|
198
|
-
end
|
199
|
-
|
200
|
-
def write(row)
|
201
|
-
unless @headers_written
|
202
|
-
@headers_written = true
|
203
|
-
@csv << row.keys
|
204
|
-
end
|
205
|
-
@csv << row.values
|
206
|
-
end
|
207
|
-
|
208
|
-
def close
|
209
|
-
@csv.close
|
210
|
-
end
|
211
|
-
end
|
212
|
-
```
|
213
|
-
|
214
|
-
## Implementing pre and post-processors
|
215
|
-
|
216
|
-
Pre-processors and post-processors are currently blocks, which get called only once per ETL run:
|
217
|
-
- Pre-processors get called before the ETL starts reading rows from the sources.
|
218
|
-
- Post-processors get invoked after the ETL successfully processed all the rows.
|
219
|
-
|
220
|
-
Note that post-processors won't get called if an error occurred earlier.
|
221
|
-
|
222
|
-
```ruby
|
223
|
-
count = 0
|
224
|
-
|
225
|
-
def system!(cmd)
|
226
|
-
fail "Command #{cmd} failed" unless system(cmd)
|
227
|
-
end
|
228
|
-
|
229
|
-
file = 'my_file.csv'
|
230
|
-
sample_file = 'my_file.sample.csv'
|
231
|
-
|
232
|
-
pre_process do
|
233
|
-
# it's handy to work with a reduced data set. you can
|
234
|
-
# e.g. just keep one line of the CSV files + the headers
|
235
|
-
system! "sed -n \"1p;25706p\" #{file} > #{sample_file}"
|
236
|
-
end
|
237
|
-
|
238
|
-
source MyCsv, file: sample_file
|
239
|
-
|
240
|
-
transform do |row|
|
241
|
-
count += 1
|
242
|
-
row
|
243
|
-
end
|
244
|
-
|
245
|
-
post_process do
|
246
|
-
Email.send(supervisor_address, "#{count} rows successfully processed")
|
247
|
-
end
|
248
|
-
```
|
6
|
+
Writing reliable, concise, well-tested & maintainable data-processing code is tricky.
|
249
7
|
|
250
|
-
|
8
|
+
Kiba lets you define and run such high-quality ETL ([Extract-Transform-Load](http://en.wikipedia.org/wiki/Extract,_transform,_load)) jobs using Ruby.
|
251
9
|
|
252
|
-
|
10
|
+
## Getting Started
|
253
11
|
|
254
|
-
|
255
|
-
- create well-tested, reusable sources & destinations
|
256
|
-
- create macro-transforms as methods, to be reused across sister scripts
|
257
|
-
- substitute a component by another (e.g.: try a variant of a destination)
|
258
|
-
- use a centralized place for configuration (credentials, IP addresses, etc.)
|
12
|
+
Head over to the [Wiki](https://github.com/thbar/kiba/wiki) for up-to-date documentation.
|
259
13
|
|
260
|
-
|
14
|
+
**If you need help**, please [ask your question with tag kiba-etl on StackOverflow](http://stackoverflow.com/questions/ask?tags=kiba-etl) so that other can benefit from your contribution! I monitor this specific tag and will reply to you.
|
261
15
|
|
262
|
-
|
16
|
+
[Kiba Pro](https://www.kiba-etl.org/kiba-pro) customers get priority private email support.
|
263
17
|
|
264
|
-
|
18
|
+
You can also check out the [author blog](https://thibautbarrere.com) and [StackOverflow answers](http://stackoverflow.com/questions/tagged/kiba-etl).
|
265
19
|
|
266
20
|
## Supported Ruby versions
|
267
21
|
|
268
|
-
Kiba currently supports Ruby 2.
|
269
|
-
|
270
|
-
## History & Credits
|
22
|
+
Kiba currently supports Ruby 2.4+, JRuby 9.2+ and TruffleRuby. See [test matrix](https://travis-ci.org/thbar/kiba).
|
271
23
|
|
272
|
-
|
24
|
+
## ETL consulting & commercial version
|
273
25
|
|
274
|
-
|
26
|
+
**Consulting services**: if your organization needs guidance on Kiba / ETL implementations, we provide consulting services. Contact at [https://www.logeek.fr](https://www.logeek.fr).
|
275
27
|
|
276
|
-
|
277
|
-
|
278
|
-
I took over the maintenance of Activewarehouse-ETL circa 2009/2010, but over time, I could not properly update & document it, given the gradual failure of a large number of dependencies and components. Ultimately in 2014 I had to stop maintaining it, after an already long hiatus.
|
279
|
-
|
280
|
-
That said using Activewarehouse-ETL for so long made me realize the row-based processing syntax was great and provided some great assets for maintainability on long time-spans.
|
281
|
-
|
282
|
-
Kiba is a completely fresh & minimalistic-on-purpose implementation of that row-based processing pattern.
|
283
|
-
|
284
|
-
It is minimalistic to make it more likely that I will be able to maintain it over time.
|
285
|
-
|
286
|
-
It makes strong simplicity assumptions (like letting you define the sources, transforms & destinations). MiniTest is an inspiration.
|
287
|
-
|
288
|
-
As I developed Kiba, I realize how much this simplicity opens the road for interesting developments such as multi-threaded & multi-processes processing.
|
289
|
-
|
290
|
-
Last word: Kiba is 100% sponsored by my company LoGeek SARL (also provider of [WiseCash, a lightweight cash-flow forecasting app](https://www.wisecashhq.com)).
|
28
|
+
**Kiba Pro**: for vendor-backed ETL extensions, check out [Kiba Pro](https://www.kiba-etl.org/kiba-pro).
|
291
29
|
|
292
30
|
## License
|
293
31
|
|
294
|
-
Copyright (c) LoGeek SARL.
|
295
|
-
|
296
|
-
Kiba is an Open Source project licensed under the terms of
|
297
|
-
the LGPLv3 license. Please see <http://www.gnu.org/licenses/lgpl-3.0.html>
|
298
|
-
for license text.
|
32
|
+
Copyright (c) LoGeek SARL. Kiba is an Open Source project licensed under the terms of
|
33
|
+
the LGPLv3 license. Please see <http://www.gnu.org/licenses/lgpl-3.0.html> for license text.
|
299
34
|
|
300
35
|
## Contributing & Legal
|
301
36
|
|
302
|
-
Until the API is more stable, I can only accept documentation Pull Requests.
|
303
|
-
|
304
37
|
(agreement below borrowed from [Sidekiq Legal](https://github.com/mperham/sidekiq/blob/master/Contributing.md))
|
305
38
|
|
306
39
|
By submitting a Pull Request, you disavow any rights or claims to any changes submitted to the Kiba project and assign the copyright of those changes to LoGeek SARL.
|
data/Rakefile
CHANGED
@@ -4,4 +4,9 @@ Rake::TestTask.new(:test) do |t|
|
|
4
4
|
t.pattern = 'test/test_*.rb'
|
5
5
|
end
|
6
6
|
|
7
|
-
|
7
|
+
# A simple check to verify TruffleRuby installation trick is really in effect
|
8
|
+
task :show_ruby_version do
|
9
|
+
puts "Running with #{RUBY_DESCRIPTION}"
|
10
|
+
end
|
11
|
+
|
12
|
+
task default: [:show_ruby_version, :test]
|
data/appveyor.yml
CHANGED
@@ -1,18 +1,28 @@
|
|
1
|
-
version:
|
1
|
+
version: 1.0.{build}-{branch}
|
2
2
|
|
3
|
-
|
3
|
+
cache:
|
4
|
+
- vendor/bundle
|
4
5
|
|
5
6
|
environment:
|
6
7
|
matrix:
|
7
|
-
-
|
8
|
-
-
|
8
|
+
- RUBY_VERSION: 26
|
9
|
+
- RUBY_VERSION: 25
|
10
|
+
- RUBY_VERSION: 24
|
11
|
+
- RUBY_VERSION: 23
|
12
|
+
# NOTE: jruby doesn't seem to be supported on default images
|
13
|
+
# see https://www.appveyor.com/docs/build-environment/#ruby
|
9
14
|
|
10
15
|
install:
|
11
|
-
-
|
12
|
-
-
|
13
|
-
- bundle install
|
16
|
+
- set PATH=C:\Ruby%RUBY_VERSION%\bin;%PATH%
|
17
|
+
- bundle config --local path vendor/bundle
|
18
|
+
- bundle install
|
19
|
+
|
20
|
+
build: off
|
21
|
+
|
22
|
+
before_test:
|
23
|
+
- ruby -v
|
24
|
+
- gem -v
|
25
|
+
- bundle -v
|
14
26
|
|
15
27
|
test_script:
|
16
28
|
- bundle exec rake
|
17
|
-
|
18
|
-
build: off
|
data/bin/kiba
CHANGED
@@ -1,5 +1,15 @@
|
|
1
1
|
#!/usr/bin/env ruby
|
2
2
|
|
3
|
-
|
3
|
+
STDERR.puts <<DOC
|
4
4
|
|
5
|
-
|
5
|
+
##########################################################################
|
6
|
+
|
7
|
+
The 'kiba' CLI is deprecated and has been removed in Kiba ETL v3.
|
8
|
+
|
9
|
+
See release notes / changelog for help.
|
10
|
+
|
11
|
+
##########################################################################
|
12
|
+
|
13
|
+
DOC
|
14
|
+
|
15
|
+
exit(1)
|
data/kiba.gemspec
CHANGED
@@ -13,8 +13,13 @@ Gem::Specification.new do |gem|
|
|
13
13
|
gem.require_paths = ['lib']
|
14
14
|
gem.version = Kiba::VERSION
|
15
15
|
gem.executables = ['kiba']
|
16
|
+
gem.metadata = {
|
17
|
+
'source_code_uri' => 'https://github.com/thbar/kiba',
|
18
|
+
'documentation_uri' => 'https://github.com/thbar/kiba/wiki',
|
19
|
+
}
|
16
20
|
|
17
21
|
gem.add_development_dependency 'rake'
|
18
|
-
gem.add_development_dependency 'minitest'
|
22
|
+
gem.add_development_dependency 'minitest', '~> 5.9'
|
19
23
|
gem.add_development_dependency 'awesome_print'
|
24
|
+
gem.add_development_dependency 'minitest-focus'
|
20
25
|
end
|
data/lib/kiba.rb
CHANGED
@@ -5,6 +5,15 @@ require 'kiba/control'
|
|
5
5
|
require 'kiba/context'
|
6
6
|
require 'kiba/parser'
|
7
7
|
require 'kiba/runner'
|
8
|
+
require 'kiba/streaming_runner'
|
9
|
+
require 'kiba/dsl_extensions/config'
|
8
10
|
|
9
11
|
Kiba.extend(Kiba::Parser)
|
10
|
-
|
12
|
+
|
13
|
+
module Kiba
|
14
|
+
def self.run(job)
|
15
|
+
# NOTE: use Hash#dig when Ruby 2.2 reaches EOL
|
16
|
+
runner = job.config.fetch(:kiba, {}).fetch(:runner, Kiba::StreamingRunner)
|
17
|
+
runner.run(job)
|
18
|
+
end
|
19
|
+
end
|
data/lib/kiba/context.rb
CHANGED
data/lib/kiba/control.rb
CHANGED