kiba 0.5.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.gitignore +2 -0
- data/.travis.yml +6 -0
- data/Changes.md +4 -0
- data/Gemfile +3 -0
- data/README.md +268 -0
- data/Rakefile +7 -0
- data/bin/kiba +5 -0
- data/kiba.gemspec +20 -0
- data/lib/kiba.rb +10 -0
- data/lib/kiba/cli.rb +16 -0
- data/lib/kiba/context.rb +28 -0
- data/lib/kiba/control.rb +19 -0
- data/lib/kiba/parser.rb +15 -0
- data/lib/kiba/runner.rb +44 -0
- data/lib/kiba/version.rb +3 -0
- data/test/fixtures/bogus.etl +2 -0
- data/test/fixtures/valid.etl +1 -0
- data/test/helper.rb +17 -0
- data/test/support/test_csv_destination.rb +21 -0
- data/test/support/test_csv_source.rb +14 -0
- data/test/support/test_enumerable_source.rb +11 -0
- data/test/support/test_rename_field_transform.rb +11 -0
- data/test/test_cli.rb +17 -0
- data/test/test_integration.rb +88 -0
- data/test/test_parser.rb +84 -0
- data/test/test_runner.rb +40 -0
- data/test/tmp/.gitkeep +0 -0
- metadata +126 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 502470fc246c67daaa681ca78fb5337899cca7fa
|
4
|
+
data.tar.gz: a125ff166156c79e5a0b0d67bf9dfb980b7e0dba
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: ccefac21a401ca860d34c89fdda2473e5a30b51d61223fc8cced50165786f41f328014144bd31486522db34c4e801190060d250cad20408745c691ca937ea1ea
|
7
|
+
data.tar.gz: 6c0bee993d99fdec14504e6811549af4dce40cd8930be951a142f8793da69956283ed7ccf6acececa4a6a108e9e9f424d2190b6ba7d9e45f55207b3ee240418d
|
data/.gitignore
ADDED
data/.travis.yml
ADDED
data/Changes.md
ADDED
data/Gemfile
ADDED
data/README.md
ADDED
@@ -0,0 +1,268 @@
|
|
1
|
+
Writing reliable, concise, well-tested & maintainable data-processing code is tricky.
|
2
|
+
|
3
|
+
Kiba lets you define and run such high-quality ETL jobs, using Ruby.
|
4
|
+
|
5
|
+
**Note: this is EARLY WORK - the API/syntax may change at any time.**
|
6
|
+
|
7
|
+
[![Build Status](https://travis-ci.org/thbar/kiba.svg?branch=master)](https://travis-ci.org/thbar/kiba) [![Code Climate](https://codeclimate.com/github/thbar/kiba/badges/gpa.svg)](https://codeclimate.com/github/thbar/kiba) [![Dependency Status](https://gemnasium.com/thbar/kiba.svg)](https://gemnasium.com/thbar/kiba)
|
8
|
+
|
9
|
+
## How do you define ETL jobs with Kiba?
|
10
|
+
|
11
|
+
Kiba provides you with a DSL to define ETL jobs:
|
12
|
+
|
13
|
+
```ruby
|
14
|
+
# declare a ruby method here, for quick reusable logic
|
15
|
+
def parse_french_date(date)
|
16
|
+
Date.strptime(date, '%d/%m/%Y')
|
17
|
+
end
|
18
|
+
|
19
|
+
# or better, include a ruby file which loads reusable assets
|
20
|
+
# eg: commonly used sources / destinations / transforms, under unit-test
|
21
|
+
require_relative 'common'
|
22
|
+
|
23
|
+
# declare a source where to take data from (you implement it - see notes below)
|
24
|
+
source MyCsvSource, 'input.csv'
|
25
|
+
|
26
|
+
# declare a row transform to process a given field
|
27
|
+
transform do |row|
|
28
|
+
row[:birth_date] = parse_french_date(row[:birth_date])
|
29
|
+
# return to keep in the pipeline
|
30
|
+
row
|
31
|
+
end
|
32
|
+
|
33
|
+
# declare another row transform, dismissing rows conditionally by returning nil
|
34
|
+
transform do |row|
|
35
|
+
row[:birth_date].year < 2000 ? row : nil
|
36
|
+
end
|
37
|
+
|
38
|
+
# declare a row transform as a class, which can be tested properly
|
39
|
+
transform ComplianceCheckTransform, eula: 2015
|
40
|
+
|
41
|
+
# before declaring a definition, maybe you'll want to retrieve credentials
|
42
|
+
config = YAML.load(IO.read('config.yml'))
|
43
|
+
|
44
|
+
# declare a destination - like source, you implement it (see below)
|
45
|
+
destination MyDatabaseDestination, config['my_database']
|
46
|
+
|
47
|
+
# declare a post-processor: a block called after all rows are successfully processed
|
48
|
+
post_process do
|
49
|
+
# do something
|
50
|
+
end
|
51
|
+
```
|
52
|
+
|
53
|
+
The combination of sources, transforms, destinations and post-processors defines the data processing pipeline.
|
54
|
+
|
55
|
+
Note: you are advised to store your ETL definitions as files with the extension `.etl` (rather than `.rb`). This will make sure you do not end up loading them by mistake from another component (eg: a Rails app).
|
56
|
+
|
57
|
+
## How do you run your ETL jobs?
|
58
|
+
|
59
|
+
You can use the provided command-line:
|
60
|
+
|
61
|
+
```
|
62
|
+
bundle exec kiba my-data-processing-script.etl
|
63
|
+
```
|
64
|
+
|
65
|
+
This command essentially starts a two-step process:
|
66
|
+
|
67
|
+
```ruby
|
68
|
+
script_content = IO.read(filename)
|
69
|
+
# pass the filename to get for line numbers on errors
|
70
|
+
job_definition = Kiba.parse(script_content, filename)
|
71
|
+
Kiba.run(job_definition)
|
72
|
+
```
|
73
|
+
|
74
|
+
`Kiba.parse` evaluates your ETL Ruby code to register sources, transforms, destinations and post-processors in a job definition. It is important to understand that you can use Ruby logic at the DSL parsing time. This means that such code is possible, provided the CSV files are available at parsing time:
|
75
|
+
|
76
|
+
```ruby
|
77
|
+
Dir['to_be_processed/*.csv'].each do |f|
|
78
|
+
source MyCsvSource, file
|
79
|
+
end
|
80
|
+
```
|
81
|
+
|
82
|
+
Once the job definition is loaded, `Kiba.run` will use that information to do the actual row-by-row processing. It currently uses a simple row-by-row, single-threaded processing that will stop at the first error encountered.
|
83
|
+
|
84
|
+
## Implementing ETL sources
|
85
|
+
|
86
|
+
In Kiba, you are responsible for implementing the sources that do the extraction of data.
|
87
|
+
|
88
|
+
Sources are classes implementing:
|
89
|
+
- a constructor (to which Kiba will pass the provided arguments in the DSL)
|
90
|
+
- the `each` method (which should yield rows one by one)
|
91
|
+
|
92
|
+
Rows are usually `Hash` instances, but could be other structures as long as the rest of your pipeline is expecting it.
|
93
|
+
|
94
|
+
Since sources are classes, you can (and are encouraged to) unit test them and reuse them.
|
95
|
+
|
96
|
+
Here is a simple CSV source:
|
97
|
+
|
98
|
+
```ruby
|
99
|
+
require 'csv'
|
100
|
+
|
101
|
+
class MyCsvSource
|
102
|
+
def initialize(input_file)
|
103
|
+
@csv = CSV.open(input_file, headers: true, header_converters: :symbol)
|
104
|
+
end
|
105
|
+
|
106
|
+
def each
|
107
|
+
@csv.each do |row|
|
108
|
+
yield(row.to_hash)
|
109
|
+
end
|
110
|
+
@csv.close
|
111
|
+
end
|
112
|
+
end
|
113
|
+
```
|
114
|
+
|
115
|
+
## Implementing row transforms
|
116
|
+
|
117
|
+
Row transforms can implemented in two ways: as blocks, or as classes.
|
118
|
+
|
119
|
+
### Row transform as a block
|
120
|
+
|
121
|
+
When writing a row transform as a block, it will be passed the row as parameter:
|
122
|
+
|
123
|
+
```ruby
|
124
|
+
transform do |row|
|
125
|
+
row[:this_field] = row[:that_field] * 10
|
126
|
+
# make sure to return the row to keep it in the pipeline
|
127
|
+
row
|
128
|
+
end
|
129
|
+
```
|
130
|
+
|
131
|
+
To dismiss a row from the pipeline, simply return `nil` from a transform:
|
132
|
+
|
133
|
+
```ruby
|
134
|
+
transform { |row| row[:index] % 2 == 0 ? row : nil }
|
135
|
+
```
|
136
|
+
|
137
|
+
### Row transform as a class
|
138
|
+
|
139
|
+
If you implement the transform as a class, it must respond to `process(row)`:
|
140
|
+
|
141
|
+
```ruby
|
142
|
+
class SamplingTransform
|
143
|
+
def initialize(modulo_value)
|
144
|
+
@modulo_value = modulo_value
|
145
|
+
end
|
146
|
+
|
147
|
+
def process(row)
|
148
|
+
row[:index] % @modulo_value == 0 ? row : nil
|
149
|
+
end
|
150
|
+
end
|
151
|
+
```
|
152
|
+
|
153
|
+
You'll use it this way in your ETL declaration (the parameters will be passed to initialize):
|
154
|
+
|
155
|
+
```ruby
|
156
|
+
# only keep 1 row over 10
|
157
|
+
transform SamplingTransform, 10
|
158
|
+
```
|
159
|
+
|
160
|
+
Like the block form, it can return `nil` to dismiss the row. The class form allows better testability and reusability across your(s) ETL script(s).
|
161
|
+
|
162
|
+
## Implementing ETL destinations
|
163
|
+
|
164
|
+
Like sources, destinations are classes that you are providing. Destinations must implement:
|
165
|
+
- a constructor (to which Kiba will pass the provided arguments in the DSL)
|
166
|
+
- a `write(row)` method that will be called for each non-dismissed row
|
167
|
+
- a `close` method that will be called at the end of the processing
|
168
|
+
|
169
|
+
Here is an example destination:
|
170
|
+
|
171
|
+
```ruby
|
172
|
+
require 'csv'
|
173
|
+
|
174
|
+
# simple destination assuming all rows have the same fields
|
175
|
+
class MyCsvDestination
|
176
|
+
def initialize(output_file)
|
177
|
+
@csv = CSV.open(output_file, 'w')
|
178
|
+
end
|
179
|
+
|
180
|
+
def write(row)
|
181
|
+
unless @headers_written
|
182
|
+
@headers_written = true
|
183
|
+
@csv << row.keys
|
184
|
+
end
|
185
|
+
@csv << row.values
|
186
|
+
end
|
187
|
+
|
188
|
+
def close
|
189
|
+
@csv.close
|
190
|
+
end
|
191
|
+
end
|
192
|
+
```
|
193
|
+
|
194
|
+
## Implementing post-processors
|
195
|
+
|
196
|
+
Post-processors are currently blocks, which get called once, after the ETL
|
197
|
+
successfully processed all the rows. It won't get called if an error occurred.
|
198
|
+
|
199
|
+
```ruby
|
200
|
+
count = 0
|
201
|
+
|
202
|
+
transform do |row|
|
203
|
+
count += 1
|
204
|
+
row
|
205
|
+
end
|
206
|
+
|
207
|
+
post_process do
|
208
|
+
Email.send(supervisor_address, "#{count} rows successfully processed")
|
209
|
+
end
|
210
|
+
```
|
211
|
+
|
212
|
+
## Composability, reusability, testability of Kiba components
|
213
|
+
|
214
|
+
The way Kiba works makes it easy to create reusable, well-tested ETL components and jobs.
|
215
|
+
|
216
|
+
The main reason for this is that a Kiba ETL script can `require` shared Ruby code, which allows to:
|
217
|
+
- create well-tested, reusable sources & destinations
|
218
|
+
- create macro-transforms as methods, to be reused across sister scripts
|
219
|
+
- substitute a component by another (e.g.: try a variant of a destination)
|
220
|
+
- use a centralized place for configuration (credentials, IP addresses, etc.)
|
221
|
+
|
222
|
+
The fact that the DSL evaluation "runs" the script also allows for simple meta-programming techniques, like pre-reading a source file to extract field names, to be used in transform definitions.
|
223
|
+
|
224
|
+
The ability to support that DSL, but also check command line arguments, environment variables and tweak behaviour as needed, or call other/faster specialized tools make Ruby an asset to implement ETL jobs.
|
225
|
+
|
226
|
+
Make sure to subscribe to my [Ruby ETL blog](http://thibautbarrere.com) where I'll demonstrate such techniques over time!
|
227
|
+
|
228
|
+
## History & Credits
|
229
|
+
|
230
|
+
Wow, you're still there? Nice to meet you. I'm [Thibaut](http://thibautbarrere.com), author of Kiba.
|
231
|
+
|
232
|
+
I first met the idea of row-based syntax when I started using [Anthony Eden](https://github.com/aeden)'s [Activewarehouse-ETL](https://github.com/activewarehouse/activewarehouse-etl), first published around 2006 (I think), in which Anthony applied the core principles defined by Ralph Kimball in [The Data Warehouse ETL Toolkit](http://www.amazon.com/gp/product/0764567578).
|
233
|
+
|
234
|
+
I've been writing and maintaining a number of production ETL systems using Activewarehouse-ETL, then later with an ancestor of Kiba which was named TinyTL.
|
235
|
+
|
236
|
+
I took over the maintenance of Activewarehouse-ETL circa 2009/2010, but over time, I could not properly update & document it, given the gradual failure of a large number of dependencies and components. Ultimately in 2014 I had to stop maintaining it, after an already long hiatus.
|
237
|
+
|
238
|
+
That said using Activewarehouse-ETL for so long made me realize the row-based processing syntax was great and provided some great assets for maintainability on long time-spans.
|
239
|
+
|
240
|
+
Kiba is a completely fresh & minimalistic-on-purpose implementation of that row-based processing pattern.
|
241
|
+
|
242
|
+
It is minimalistic to make it more likely that I will be able to maintain it over time.
|
243
|
+
|
244
|
+
It makes strong simplicity assumptions (like letting you define the sources, transforms & destinations). MiniTest is an inspiration.
|
245
|
+
|
246
|
+
As I developed Kiba, I realize how much this simplicity opens the road for interesting developments such as multi-threaded & multi-processes processing.
|
247
|
+
|
248
|
+
Last word: Kiba is 100% sponsored by my company LoGeek SARL (also provider of [WiseCash, a lightweight cash-flow forecasting app](https://www.wisecashhq.com)).
|
249
|
+
|
250
|
+
## License
|
251
|
+
|
252
|
+
Copyright (c) LoGeek SARL.
|
253
|
+
|
254
|
+
Kiba is an Open Source project licensed under the terms of
|
255
|
+
the LGPLv3 license. Please see <http://www.gnu.org/licenses/lgpl-3.0.html>
|
256
|
+
for license text.
|
257
|
+
|
258
|
+
## Contributing & Legal
|
259
|
+
|
260
|
+
Until the API is more stable, I can only accept documentation Pull Requests.
|
261
|
+
|
262
|
+
(agreement below borrowed from [Sidekiq Legal](https://github.com/mperham/sidekiq/blob/master/Contributing.md))
|
263
|
+
|
264
|
+
By submitting a Pull Request, you disavow any rights or claims to any changes submitted to the Kiba project and assign the copyright of those changes to LoGeek SARL.
|
265
|
+
|
266
|
+
If you cannot or do not want to reassign those rights (your employment contract for your employer may not allow this), you should not submit a PR. Open an issue and someone else can do the work.
|
267
|
+
|
268
|
+
This is a legal way of saying "If you submit a PR to us, that code becomes ours". 99.9% of the time that's what you intend anyways; we hope it doesn't scare you away from contributing.
|
data/Rakefile
ADDED
data/bin/kiba
ADDED
data/kiba.gemspec
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
2
|
+
require File.expand_path('../lib/kiba/version', __FILE__)
|
3
|
+
|
4
|
+
Gem::Specification.new do |gem|
|
5
|
+
gem.authors = ["Thibaut Barrère"]
|
6
|
+
gem.email = ["thibaut.barrere@gmail.com"]
|
7
|
+
gem.description = gem.summary = "Lightweight ETL for Ruby"
|
8
|
+
gem.homepage = "http://thbar.github.io/kiba/"
|
9
|
+
gem.license = "LGPL-3.0"
|
10
|
+
gem.files = `git ls-files | grep -Ev '^(examples)'`.split("\n")
|
11
|
+
gem.test_files = `git ls-files -- test/*`.split("\n")
|
12
|
+
gem.name = "kiba"
|
13
|
+
gem.require_paths = ["lib"]
|
14
|
+
gem.version = Kiba::VERSION
|
15
|
+
gem.executables = ['kiba']
|
16
|
+
|
17
|
+
gem.add_development_dependency 'rake'
|
18
|
+
gem.add_development_dependency 'minitest'
|
19
|
+
gem.add_development_dependency 'awesome_print'
|
20
|
+
end
|
data/lib/kiba.rb
ADDED
data/lib/kiba/cli.rb
ADDED
@@ -0,0 +1,16 @@
|
|
1
|
+
require 'kiba'
|
2
|
+
|
3
|
+
module Kiba
|
4
|
+
class Cli
|
5
|
+
def self.run(args)
|
6
|
+
unless args.size == 1
|
7
|
+
puts "Syntax: kiba your-script.etl"
|
8
|
+
exit -1
|
9
|
+
end
|
10
|
+
filename = args[0]
|
11
|
+
script_content = IO.read(filename)
|
12
|
+
job_definition = Kiba.parse(script_content, filename)
|
13
|
+
Kiba.run(job_definition)
|
14
|
+
end
|
15
|
+
end
|
16
|
+
end
|
data/lib/kiba/context.rb
ADDED
@@ -0,0 +1,28 @@
|
|
1
|
+
module Kiba
|
2
|
+
class Context
|
3
|
+
def initialize(control)
|
4
|
+
# TODO: forbid access to control from context? use cleanroom?
|
5
|
+
@control = control
|
6
|
+
end
|
7
|
+
|
8
|
+
def source(klass, *initialization_params)
|
9
|
+
@control.sources << {klass: klass, args: initialization_params}
|
10
|
+
end
|
11
|
+
|
12
|
+
def transform(klass = nil, *initialization_params, &block)
|
13
|
+
if klass
|
14
|
+
@control.transforms << {klass: klass, args: initialization_params}
|
15
|
+
else
|
16
|
+
@control.transforms << block
|
17
|
+
end
|
18
|
+
end
|
19
|
+
|
20
|
+
def destination(klass, *initialization_params)
|
21
|
+
@control.destinations << {klass: klass, args: initialization_params}
|
22
|
+
end
|
23
|
+
|
24
|
+
def post_process(&block)
|
25
|
+
@control.post_processes << block
|
26
|
+
end
|
27
|
+
end
|
28
|
+
end
|
data/lib/kiba/control.rb
ADDED
data/lib/kiba/parser.rb
ADDED
@@ -0,0 +1,15 @@
|
|
1
|
+
module Kiba
|
2
|
+
module Parser
|
3
|
+
def parse(source_as_string = nil, source_file = nil, &source_as_block)
|
4
|
+
control = Control.new
|
5
|
+
context = Context.new(control)
|
6
|
+
if source_as_string
|
7
|
+
# this somewhat weird construct allows to remove a nil source_file
|
8
|
+
context.instance_eval(*[source_as_string, source_file].compact)
|
9
|
+
else
|
10
|
+
context.instance_eval(&source_as_block)
|
11
|
+
end
|
12
|
+
control
|
13
|
+
end
|
14
|
+
end
|
15
|
+
end
|
data/lib/kiba/runner.rb
ADDED
@@ -0,0 +1,44 @@
|
|
1
|
+
module Kiba
|
2
|
+
module Runner
|
3
|
+
def run(control)
|
4
|
+
sources = to_instances(control.sources)
|
5
|
+
destinations = to_instances(control.destinations)
|
6
|
+
transforms = to_instances(control.transforms, true)
|
7
|
+
# not using keyword args because JRuby defaults to 1.9 syntax currently
|
8
|
+
post_processes = to_instances(control.post_processes, true, false)
|
9
|
+
|
10
|
+
sources.each do |source|
|
11
|
+
source.each do |row|
|
12
|
+
transforms.each_with_index do |transform, index|
|
13
|
+
if transform.is_a?(Proc)
|
14
|
+
row = transform.call(row)
|
15
|
+
else
|
16
|
+
row = transform.process(row)
|
17
|
+
end
|
18
|
+
break unless row
|
19
|
+
end
|
20
|
+
next unless row
|
21
|
+
destinations.each do |destination|
|
22
|
+
destination.write(row)
|
23
|
+
end
|
24
|
+
end
|
25
|
+
end
|
26
|
+
|
27
|
+
destinations.each(&:close)
|
28
|
+
post_processes.each(&:call)
|
29
|
+
end
|
30
|
+
|
31
|
+
def to_instances(definitions, allow_block = false, allow_class = true)
|
32
|
+
definitions.map do |d|
|
33
|
+
case d
|
34
|
+
when Proc
|
35
|
+
raise "Block form is not allowed here" unless allow_block
|
36
|
+
d
|
37
|
+
else
|
38
|
+
raise "Class form is not allowed here" unless allow_class
|
39
|
+
d[:klass].new(*d[:args])
|
40
|
+
end
|
41
|
+
end
|
42
|
+
end
|
43
|
+
end
|
44
|
+
end
|
data/lib/kiba/version.rb
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
# this does nothing
|
data/test/helper.rb
ADDED
@@ -0,0 +1,17 @@
|
|
1
|
+
require 'minitest/autorun'
|
2
|
+
require 'minitest/pride'
|
3
|
+
require 'kiba'
|
4
|
+
|
5
|
+
class Kiba::Test < Minitest::Test
|
6
|
+
extend Minitest::Spec::DSL
|
7
|
+
|
8
|
+
def remove_files(*files)
|
9
|
+
files.each do |file|
|
10
|
+
File.delete(file) if File.exists?(file)
|
11
|
+
end
|
12
|
+
end
|
13
|
+
|
14
|
+
def fixture(file)
|
15
|
+
File.join(File.dirname(__FILE__), 'fixtures', file)
|
16
|
+
end
|
17
|
+
end
|
@@ -0,0 +1,21 @@
|
|
1
|
+
require 'csv'
|
2
|
+
|
3
|
+
# simple destination, not checking that each row has all the fields
|
4
|
+
class TestCsvDestination
|
5
|
+
def initialize(output_file)
|
6
|
+
@csv = CSV.open(output_file, 'w')
|
7
|
+
@headers_written = false
|
8
|
+
end
|
9
|
+
|
10
|
+
def write(row)
|
11
|
+
unless @headers_written
|
12
|
+
@headers_written = true
|
13
|
+
@csv << row.keys
|
14
|
+
end
|
15
|
+
@csv << row.values
|
16
|
+
end
|
17
|
+
|
18
|
+
def close
|
19
|
+
@csv.close
|
20
|
+
end
|
21
|
+
end
|
data/test/test_cli.rb
ADDED
@@ -0,0 +1,17 @@
|
|
1
|
+
require_relative 'helper'
|
2
|
+
require 'kiba/cli'
|
3
|
+
|
4
|
+
class TestCli < Kiba::Test
|
5
|
+
def test_cli_launches
|
6
|
+
Kiba::Cli.run([fixture('valid.etl')])
|
7
|
+
end
|
8
|
+
|
9
|
+
def test_cli_reports_filename_and_lineno
|
10
|
+
exception = assert_raises(NameError) do
|
11
|
+
Kiba::Cli.run([fixture('bogus.etl')])
|
12
|
+
end
|
13
|
+
|
14
|
+
assert_match /uninitialized constant (.*)UnknownThing/, exception.message
|
15
|
+
assert_includes exception.backtrace.to_s, 'test/fixtures/bogus.etl:2:in'
|
16
|
+
end
|
17
|
+
end
|
@@ -0,0 +1,88 @@
|
|
1
|
+
require_relative 'helper'
|
2
|
+
|
3
|
+
require_relative 'support/test_csv_source'
|
4
|
+
require_relative 'support/test_csv_destination'
|
5
|
+
require_relative 'support/test_rename_field_transform'
|
6
|
+
|
7
|
+
# End-to-end tests go here
|
8
|
+
class TestIntegration < Kiba::Test
|
9
|
+
let(:output_file) { 'test/tmp/output.csv' }
|
10
|
+
let(:input_file) { 'test/tmp/input.csv' }
|
11
|
+
|
12
|
+
let(:sample_csv_data) do <<CSV
|
13
|
+
first_name,last_name,sex
|
14
|
+
John,Doe,M
|
15
|
+
Mary,Johnson,F
|
16
|
+
Cindy,Backgammon,F
|
17
|
+
Patrick,McWire,M
|
18
|
+
CSV
|
19
|
+
end
|
20
|
+
|
21
|
+
def setup
|
22
|
+
remove_files(input_file, output_file)
|
23
|
+
IO.write(input_file, sample_csv_data)
|
24
|
+
end
|
25
|
+
|
26
|
+
def teardown
|
27
|
+
remove_files(input_file, output_file)
|
28
|
+
end
|
29
|
+
|
30
|
+
def test_csv_to_csv
|
31
|
+
# parse the ETL script (this won't run it)
|
32
|
+
control = Kiba.parse do
|
33
|
+
source TestCsvSource, 'test/tmp/input.csv'
|
34
|
+
|
35
|
+
transform do |row|
|
36
|
+
row[:sex] = case row[:sex]
|
37
|
+
when 'M'; 'Male'
|
38
|
+
when 'F'; 'Female'
|
39
|
+
else 'Unknown'
|
40
|
+
end
|
41
|
+
row # must be returned
|
42
|
+
end
|
43
|
+
|
44
|
+
# returning nil dismisses the row
|
45
|
+
transform do |row|
|
46
|
+
row[:sex] == 'Female' ? row : nil
|
47
|
+
end
|
48
|
+
|
49
|
+
transform TestRenameFieldTransform, :sex, :sex_2015
|
50
|
+
|
51
|
+
destination TestCsvDestination, 'test/tmp/output.csv'
|
52
|
+
end
|
53
|
+
|
54
|
+
# run the parsed ETL script
|
55
|
+
Kiba.run(control)
|
56
|
+
|
57
|
+
# verify the output
|
58
|
+
assert_equal <<CSV, IO.read(output_file)
|
59
|
+
first_name,last_name,sex_2015
|
60
|
+
Mary,Johnson,Female
|
61
|
+
Cindy,Backgammon,Female
|
62
|
+
CSV
|
63
|
+
end
|
64
|
+
|
65
|
+
def test_variable_access
|
66
|
+
message = nil
|
67
|
+
|
68
|
+
control = Kiba.parse do
|
69
|
+
source TestEnumerableSource, [1, 2, 3]
|
70
|
+
|
71
|
+
count = 0
|
72
|
+
|
73
|
+
transform do |r|
|
74
|
+
count += 1
|
75
|
+
r
|
76
|
+
end
|
77
|
+
|
78
|
+
post_process do
|
79
|
+
message = "#{count} rows processed"
|
80
|
+
end
|
81
|
+
end
|
82
|
+
|
83
|
+
Kiba.run(control)
|
84
|
+
|
85
|
+
assert_equal '3 rows processed', message
|
86
|
+
end
|
87
|
+
|
88
|
+
end
|
data/test/test_parser.rb
ADDED
@@ -0,0 +1,84 @@
|
|
1
|
+
require_relative 'helper'
|
2
|
+
|
3
|
+
require_relative 'support/test_rename_field_transform'
|
4
|
+
|
5
|
+
class DummyClass
|
6
|
+
end
|
7
|
+
|
8
|
+
class TestParser < Kiba::Test
|
9
|
+
def test_source_definition
|
10
|
+
control = Kiba.parse do
|
11
|
+
source DummyClass, 'has', 'args'
|
12
|
+
end
|
13
|
+
|
14
|
+
assert_equal DummyClass, control.sources[0][:klass]
|
15
|
+
assert_equal ['has', 'args'], control.sources[0][:args]
|
16
|
+
end
|
17
|
+
|
18
|
+
def test_block_transform_definition
|
19
|
+
control = Kiba.parse do
|
20
|
+
transform { |row| row }
|
21
|
+
end
|
22
|
+
|
23
|
+
assert_instance_of Proc, control.transforms[0]
|
24
|
+
end
|
25
|
+
|
26
|
+
def test_class_transform_definition
|
27
|
+
control = Kiba.parse do
|
28
|
+
transform TestRenameFieldTransform, :last_name, :name
|
29
|
+
end
|
30
|
+
|
31
|
+
assert_equal TestRenameFieldTransform, control.transforms[0][:klass]
|
32
|
+
assert_equal [:last_name, :name], control.transforms[0][:args]
|
33
|
+
end
|
34
|
+
|
35
|
+
def test_destination_definition
|
36
|
+
control = Kiba.parse do
|
37
|
+
destination DummyClass, 'has', 'args'
|
38
|
+
end
|
39
|
+
|
40
|
+
assert_equal DummyClass, control.destinations[0][:klass]
|
41
|
+
assert_equal ['has', 'args'], control.destinations[0][:args]
|
42
|
+
end
|
43
|
+
|
44
|
+
def test_block_post_process_definition
|
45
|
+
control = Kiba.parse do
|
46
|
+
post_process { }
|
47
|
+
end
|
48
|
+
|
49
|
+
assert_instance_of Proc, control.post_processes[0]
|
50
|
+
end
|
51
|
+
|
52
|
+
def test_source_as_string_parsing
|
53
|
+
control = Kiba.parse <<RUBY
|
54
|
+
source DummyClass, 'from', 'file'
|
55
|
+
RUBY
|
56
|
+
|
57
|
+
assert_equal 1, control.sources.size
|
58
|
+
assert_equal DummyClass, control.sources[0][:klass]
|
59
|
+
assert_equal ['from', 'file'], control.sources[0][:args]
|
60
|
+
end
|
61
|
+
|
62
|
+
def test_source_as_file_doing_require
|
63
|
+
IO.write 'test/tmp/etl-common.rb', <<RUBY
|
64
|
+
def common_source_declaration
|
65
|
+
source DummyClass, 'from', 'common'
|
66
|
+
end
|
67
|
+
RUBY
|
68
|
+
IO.write 'test/tmp/etl-main.rb', <<RUBY
|
69
|
+
require './test/tmp/etl-common.rb'
|
70
|
+
|
71
|
+
source DummyClass, 'from', 'main'
|
72
|
+
common_source_declaration
|
73
|
+
RUBY
|
74
|
+
control = Kiba.parse IO.read('test/tmp/etl-main.rb')
|
75
|
+
|
76
|
+
assert_equal 2, control.sources.size
|
77
|
+
|
78
|
+
assert_equal ['from', 'main'], control.sources[0][:args]
|
79
|
+
assert_equal ['from', 'common'], control.sources[1][:args]
|
80
|
+
|
81
|
+
ensure
|
82
|
+
remove_files('test/tmp/etl-common.rb', 'test/tmp/etl-main.rb')
|
83
|
+
end
|
84
|
+
end
|
data/test/test_runner.rb
ADDED
@@ -0,0 +1,40 @@
|
|
1
|
+
require_relative 'helper'
|
2
|
+
|
3
|
+
require_relative 'support/test_enumerable_source'
|
4
|
+
|
5
|
+
class TestRunner < Kiba::Test
|
6
|
+
let(:control) do
|
7
|
+
control = Kiba::Control.new
|
8
|
+
# this will yield a single row for testing
|
9
|
+
control.sources << {klass: TestEnumerableSource, args: [[{field: 'value'}]]}
|
10
|
+
control
|
11
|
+
end
|
12
|
+
|
13
|
+
def test_block_transform_processing
|
14
|
+
# is there a better way to assert a block was called in minitest?
|
15
|
+
control.transforms << lambda { |r| @called = true; r }
|
16
|
+
Kiba.run(control)
|
17
|
+
assert_equal true, @called
|
18
|
+
end
|
19
|
+
|
20
|
+
def test_dismissed_row_not_passed_to_next_transform
|
21
|
+
control.transforms << lambda { |r| nil }
|
22
|
+
control.transforms << lambda { |r| @called = true; nil}
|
23
|
+
Kiba.run(control)
|
24
|
+
assert_nil @called
|
25
|
+
end
|
26
|
+
|
27
|
+
def test_post_process_runs
|
28
|
+
control.post_processes << lambda { @called = true }
|
29
|
+
Kiba.run(control)
|
30
|
+
assert_equal true, @called
|
31
|
+
end
|
32
|
+
|
33
|
+
def test_post_process_not_called_after_row_failure
|
34
|
+
control.transforms << lambda { |r| raise 'FAIL' }
|
35
|
+
control.post_processes << lambda { @called = true }
|
36
|
+
assert_raises(RuntimeError, 'FAIL') { Kiba.run(control) }
|
37
|
+
assert_nil @called
|
38
|
+
end
|
39
|
+
|
40
|
+
end
|
data/test/tmp/.gitkeep
ADDED
File without changes
|
metadata
ADDED
@@ -0,0 +1,126 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: kiba
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.5.0
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Thibaut Barrère
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
date: 2015-04-18 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: rake
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - ">="
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '0'
|
20
|
+
type: :development
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - ">="
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '0'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: minitest
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - ">="
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '0'
|
34
|
+
type: :development
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - ">="
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '0'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: awesome_print
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - ">="
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '0'
|
48
|
+
type: :development
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - ">="
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '0'
|
55
|
+
description: Lightweight ETL for Ruby
|
56
|
+
email:
|
57
|
+
- thibaut.barrere@gmail.com
|
58
|
+
executables:
|
59
|
+
- kiba
|
60
|
+
extensions: []
|
61
|
+
extra_rdoc_files: []
|
62
|
+
files:
|
63
|
+
- ".gitignore"
|
64
|
+
- ".travis.yml"
|
65
|
+
- Changes.md
|
66
|
+
- Gemfile
|
67
|
+
- README.md
|
68
|
+
- Rakefile
|
69
|
+
- bin/kiba
|
70
|
+
- kiba.gemspec
|
71
|
+
- lib/kiba.rb
|
72
|
+
- lib/kiba/cli.rb
|
73
|
+
- lib/kiba/context.rb
|
74
|
+
- lib/kiba/control.rb
|
75
|
+
- lib/kiba/parser.rb
|
76
|
+
- lib/kiba/runner.rb
|
77
|
+
- lib/kiba/version.rb
|
78
|
+
- test/fixtures/bogus.etl
|
79
|
+
- test/fixtures/valid.etl
|
80
|
+
- test/helper.rb
|
81
|
+
- test/support/test_csv_destination.rb
|
82
|
+
- test/support/test_csv_source.rb
|
83
|
+
- test/support/test_enumerable_source.rb
|
84
|
+
- test/support/test_rename_field_transform.rb
|
85
|
+
- test/test_cli.rb
|
86
|
+
- test/test_integration.rb
|
87
|
+
- test/test_parser.rb
|
88
|
+
- test/test_runner.rb
|
89
|
+
- test/tmp/.gitkeep
|
90
|
+
homepage: http://thbar.github.io/kiba/
|
91
|
+
licenses:
|
92
|
+
- LGPL-3.0
|
93
|
+
metadata: {}
|
94
|
+
post_install_message:
|
95
|
+
rdoc_options: []
|
96
|
+
require_paths:
|
97
|
+
- lib
|
98
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
99
|
+
requirements:
|
100
|
+
- - ">="
|
101
|
+
- !ruby/object:Gem::Version
|
102
|
+
version: '0'
|
103
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
104
|
+
requirements:
|
105
|
+
- - ">="
|
106
|
+
- !ruby/object:Gem::Version
|
107
|
+
version: '0'
|
108
|
+
requirements: []
|
109
|
+
rubyforge_project:
|
110
|
+
rubygems_version: 2.4.3
|
111
|
+
signing_key:
|
112
|
+
specification_version: 4
|
113
|
+
summary: Lightweight ETL for Ruby
|
114
|
+
test_files:
|
115
|
+
- test/fixtures/bogus.etl
|
116
|
+
- test/fixtures/valid.etl
|
117
|
+
- test/helper.rb
|
118
|
+
- test/support/test_csv_destination.rb
|
119
|
+
- test/support/test_csv_source.rb
|
120
|
+
- test/support/test_enumerable_source.rb
|
121
|
+
- test/support/test_rename_field_transform.rb
|
122
|
+
- test/test_cli.rb
|
123
|
+
- test/test_integration.rb
|
124
|
+
- test/test_parser.rb
|
125
|
+
- test/test_runner.rb
|
126
|
+
- test/tmp/.gitkeep
|