kiba 1.0.0 → 2.0.0.rc1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 622b10fe7f524f66152c10ab8618d6a661831776
4
- data.tar.gz: 7410d79e2dec5bb48e9f1079e94c87260226d66c
3
+ metadata.gz: ca18859887a38d8eee1afa9ae170ad8fc2a8b63f
4
+ data.tar.gz: d27a6985d90a1ab73315b4f859bca2e105ba66a7
5
5
  SHA512:
6
- metadata.gz: 5fe1c537d31ccc49446316f40c76630f724774431a3914019b870ce4cbdc3d34ed7eda4b66b622855d7d6972f435a0b8f831567f9b53b159e2836257b766671c
7
- data.tar.gz: 60c2e8929bdd350308c5f377b0a6c989c3473f147903a90dab7ab9b4b58f6ec06cdc038094c5bca2ebd593defe2f0e7994752bdef325bd4ccbd7e04ef27f86a9
6
+ metadata.gz: dc66fedc5922ee6b63a4cfdd647e71592af3ae8d2851600008a5af0c165252583c1fdf15b064ff691cf19b6ebe004edcf1062fb0091c45f7b2c7e3a5e7e5bc27
7
+ data.tar.gz: d02d8e6293dbfa6ca2b0372ff36ab3548ef3608ce5554f44ee82a04f1d0a17cbb86838de5745721260a357943c11429012bbd3dbccd59abff08f5b29226f2454
@@ -1,10 +1,15 @@
1
1
  language: ruby
2
2
  before_install:
3
+ # https://stackoverflow.com/a/47972768
4
+ - gem update --system
3
5
  - gem update bundler
4
6
  rvm:
5
- - 2.3.0 # 2.3 won't work here (RVM issue afaik)
7
+ - ruby-head
8
+ - 2.5.0
9
+ - 2.4.3
10
+ - 2.3
6
11
  - 2.2
7
12
  - 2.1
8
13
  - 2.0
9
14
  - jruby-1.7
10
- - jruby-9
15
+ - jruby-9.1.15.0
data/Changes.md CHANGED
@@ -1,3 +1,9 @@
1
+ 2.0.0.rc1 (unreleased)
2
+ ----------------------
3
+
4
+ - New (opt-in) StreamingRunner allows class transforms to generate more than one row. See [#44](https://github.com/thbar/kiba/pull/44) for rationale & how to activate.
5
+ - Potentially breaking change if you were using the internal class `Kiba::Parser` directly: ETL jobs parsing has been modified to improve the isolation between the evaluation scope and the Kiba classes. See [#46](https://github.com/thbar/kiba/pull/46) for more information.
6
+
1
7
  1.0.0
2
8
  -----
3
9
 
data/LICENSE ADDED
@@ -0,0 +1,5 @@
1
+ Copyright (c) LoGeek SARL
2
+
3
+ Kiba Common is an Open Source project licensed under the terms of
4
+ the LGPLv3 license. Please see <http://www.gnu.org/licenses/lgpl-3.0.html>
5
+ for license text.
@@ -0,0 +1,31 @@
1
+ Kiba Pro Changelog
2
+ ==================
3
+
4
+ Kiba Pro is the commercial extension for Kiba. Documentation is available on the [Wiki](https://github.com/thbar/kiba/wiki).
5
+
6
+ HEAD
7
+ -------
8
+
9
+ 1.0.0.rc1
10
+ ---------
11
+
12
+ NOTE: documentation & requirements/compatibility are available on the [wiki](https://github.com/thbar/kiba/wiki).
13
+
14
+ - New: `SQLUpsert` destination allowing row-by-row "insert or update".
15
+ - New: `SQL` source allowing efficient streaming of large volumes of SQL rows while controlling memory consumption.
16
+ - Improvement: `SQLBulkInsert` can now be used from a Sidekiq job.
17
+
18
+ 0.9.0
19
+ -----
20
+
21
+ - Multiple improvements to `SQLBulkInsert`:
22
+ - New flexible `row_pre_processor` option which allows to either remove a row conditionally (useful to conditionally target a given destination amongst many) or to replace it by N dynamically computed target rows.
23
+ - New callbacks: `after_initialize` & `before_flush` (useful to enforce dependent destinations flush & ensure required foreign keys constraints are respected).
24
+ - `logger` support.
25
+ - Bugfix: make sure to `disconnect` on `close`.
26
+ - Extra safety checks on row keys.
27
+
28
+ 0.4.0
29
+ -----
30
+
31
+ - Initial release of the `SQLBulkInsert` destination (providing fast SQL INSERT).
data/README.md CHANGED
@@ -1,306 +1,59 @@
1
- **Foreword - if you need help**: please [ask your question with tag kiba-etl on StackOverflow](http://stackoverflow.com/questions/ask?tags=kiba-etl) so that other can benefit from your contribution! I monitor this specific tag and will reply to you.
1
+ **If you need help**, please [ask your question with tag kiba-etl on StackOverflow](http://stackoverflow.com/questions/ask?tags=kiba-etl) so that other can benefit from your contribution! I monitor this specific tag and will reply to you.
2
2
 
3
3
  Writing reliable, concise, well-tested & maintainable data-processing code is tricky.
4
4
 
5
- Kiba lets you define and run such high-quality ETL ([Extract-Transform-Load](http://en.wikipedia.org/wiki/Extract,_transform,_load)) jobs, using Ruby (see [supported versions](#supported-ruby-versions)).
5
+ Kiba lets you define and run such high-quality ETL ([Extract-Transform-Load](http://en.wikipedia.org/wiki/Extract,_transform,_load)) jobs using Ruby.
6
6
 
7
- Learn more on the [Kiba blog](http://thibautbarrere.com) and on [StackOverflow](http://stackoverflow.com/questions/tagged/kiba-etl):
8
-
9
- * [Live Coding Session - Processing data with Kiba ETL](http://thibautbarrere.com/2015/11/09/video-processing-data-with-kiba-etl/)
10
- * [Rubyists - are you doing ETL unknowningly?](http://thibautbarrere.com/2015/03/25/rubyists-are-you-doing-etl-unknowingly/)
11
- * [How to write solid data processing code](http://thibautbarrere.com/2015/04/05/how-to-write-solid-data-processing-code/)
12
- * [How to reformat CSV files with Kiba](http://thibautbarrere.com/2015/06/04/how-to-reformat-csv-files-with-kiba/) (in-depth, hands-on tutorial)
13
- * [How to explode multivalued attributes with Kiba ETL?](http://thibautbarrere.com/2015/06/25/how-to-explode-multivalued-attributes-with-kiba/)
14
- * [Common techniques to compute aggregates with Kiba](https://stackoverflow.com/questions/31145715/how-to-do-a-aggregation-transformation-in-a-kiba-etl-script-kiba-gem)
15
- * [How to run Kiba in a Rails environment?](http://thibautbarrere.com/2015/09/26/how-to-run-kiba-in-a-rails-environment/)
16
-
17
- **Consulting services**: if your organization needs to leverage data processing to solve a given business problem, I'm available to help you out via consulting sessions. [More information](http://thibautbarrere.com/hire-me/).
18
-
19
- **Kiba Pro**: I'm working on a Pro version ([read more here](https://github.com/thbar/kiba/issues/20)) which will provide more advanced features and built-in goodies in exchange for a yearly subscription. This will also make sure I can support Kiba for the many years to come. [Chime in](https://github.com/thbar/kiba/issues/20) if your company is interested!
7
+ Learn more on the [Wiki](https://github.com/thbar/kiba/wiki), on my [blog](http://thibautbarrere.com) and on [StackOverflow](http://stackoverflow.com/questions/tagged/kiba-etl).
20
8
 
21
9
  [![Gem Version](https://badge.fury.io/rb/kiba.svg)](http://badge.fury.io/rb/kiba)
22
- [![Build Status](https://travis-ci.org/thbar/kiba.svg?branch=master)](https://travis-ci.org/thbar/kiba) [![Code Climate](https://codeclimate.com/github/thbar/kiba/badges/gpa.svg)](https://codeclimate.com/github/thbar/kiba) [![Dependency Status](https://gemnasium.com/thbar/kiba.svg)](https://gemnasium.com/thbar/kiba)
23
-
24
- ## How do you define ETL jobs with Kiba?
25
-
26
- Kiba provides you with a DSL to define ETL jobs:
27
-
28
- ```ruby
29
- # declare a ruby method here, for quick reusable logic
30
- def parse_french_date(date)
31
- Date.strptime(date, '%d/%m/%Y')
32
- end
33
-
34
- # or better, include a ruby file which loads reusable assets
35
- # eg: commonly used sources / destinations / transforms, under unit-test
36
- require_relative 'common'
37
-
38
- # declare a pre-processor: a block called before the first row is read
39
- pre_process do
40
- # do something
41
- end
42
-
43
- # declare a source where to take data from (you implement it - see notes below)
44
- source MyCsvSource, 'input.csv'
45
-
46
- # declare a row transform to process a given field
47
- transform do |row|
48
- row[:birth_date] = parse_french_date(row[:birth_date])
49
- # return to keep in the pipeline
50
- row
51
- end
52
-
53
- # declare another row transform, dismissing rows conditionally by returning nil
54
- transform do |row|
55
- row[:birth_date].year < 2000 ? row : nil
56
- end
57
-
58
- # declare a row transform as a class, which can be tested properly
59
- transform ComplianceCheckTransform, eula: 2015
60
-
61
- # before declaring a definition, maybe you'll want to retrieve credentials
62
- config = YAML.load(IO.read('config.yml'))
63
-
64
- # declare a destination - like source, you implement it (see below)
65
- destination MyDatabaseDestination, config['my_database']
66
-
67
- # declare a post-processor: a block called after all rows are successfully processed
68
- post_process do
69
- # do something
70
- end
71
- ```
72
-
73
- The combination of pre-processors, sources, transforms, destinations and post-processors defines the data processing pipeline.
74
-
75
- Note: you are advised to store your ETL definitions as files with the extension `.etl` (rather than `.rb`). This will make sure you do not end up loading them by mistake from another component (eg: a Rails app).
76
-
77
- ## How do you run your ETL jobs?
78
-
79
- You can use the provided command-line:
80
-
81
- ```
82
- bundle exec kiba my-data-processing-script.etl
83
- ```
84
-
85
- This command essentially starts a two-step process:
86
-
87
- ```ruby
88
- script_content = IO.read(filename)
89
- # pass the filename to get for line numbers on errors
90
- job_definition = Kiba.parse(script_content, filename)
91
- Kiba.run(job_definition)
92
- ```
93
-
94
- `Kiba.parse` evaluates your ETL Ruby code to register sources, transforms, destinations and post-processors in a job definition. It is important to understand that you can use Ruby logic at the DSL parsing time. This means that such code is possible, provided the CSV files are available at parsing time:
95
-
96
- ```ruby
97
- Dir['to_be_processed/*.csv'].each do |file|
98
- source MyCsvSource, file
99
- end
100
- ```
101
-
102
- Once the job definition is loaded, `Kiba.run` will use that information to do the actual row-by-row processing. It currently uses a simple row-by-row, single-threaded processing that will stop at the first error encountered.
103
-
104
- ## Implementing ETL sources
105
-
106
- In Kiba, you are responsible for implementing the sources that do the extraction of data.
107
-
108
- Sources are classes implementing:
109
- - a constructor (to which Kiba will pass the provided arguments in the DSL)
110
- - the `each` method (which should yield rows one by one)
111
-
112
- Rows are usually `Hash` instances, but could be other structures as long as the rest of your pipeline is expecting it.
113
-
114
- Since sources are classes, you can (and are encouraged to) unit test them and reuse them.
115
-
116
- Here is a simple CSV source:
117
-
118
- ```ruby
119
- require 'csv'
120
-
121
- class MyCsvSource
122
- def initialize(input_file)
123
- @csv = CSV.open(input_file, headers: true, header_converters: :symbol)
124
- end
125
-
126
- def each
127
- @csv.each do |row|
128
- yield(row.to_hash)
129
- end
130
- @csv.close
131
- end
132
- end
133
- ```
134
-
135
- ## Implementing row transforms
136
-
137
- Row transforms can implemented in two ways: as blocks, or as classes.
138
-
139
- ### Row transform as a block
10
+ [![Build Status](https://travis-ci.org/thbar/kiba.svg?branch=master)](https://travis-ci.org/thbar/kiba) [![Build status](https://ci.appveyor.com/api/projects/status/v05jcyhpp1mueq9i?svg=true)](https://ci.appveyor.com/project/thbar/kiba) [![Code Climate](https://codeclimate.com/github/thbar/kiba/badges/gpa.svg)](https://codeclimate.com/github/thbar/kiba) [![Dependency Status](https://gemnasium.com/thbar/kiba.svg)](https://gemnasium.com/thbar/kiba)
140
11
 
141
- When writing a row transform as a block, it will be passed the row as parameter:
12
+ ## Note on upcoming Kiba 2.0.0
142
13
 
143
- ```ruby
144
- transform do |row|
145
- row[:this_field] = row[:that_field] * 10
146
- # make sure to return the row to keep it in the pipeline
147
- row
148
- end
149
- ```
14
+ Kiba 2.0.0 (available on `master`) includes an improved engine called the `StreamingRunner`, which allows transforms to generate more than one output row for each input row. See [#44](https://github.com/thbar/kiba/pull/44) for documentation on benefits & how to activate.
150
15
 
151
- To dismiss a row from the pipeline, simply return `nil` from a transform:
16
+ ## Getting Started
152
17
 
153
- ```ruby
154
- transform { |row| row[:index] % 2 == 0 ? row : nil }
155
- ```
18
+ * [How do you define ETL jobs with Kiba?](https://github.com/thbar/kiba/wiki/How-do-you-define-ETL-jobs-with-Kiba%3F)
19
+ * [How do you run your ETL jobs?](https://github.com/thbar/kiba/wiki/How-do-you-run-your-ETL-jobs%3F)
20
+ * [Implementing ETL sources](https://github.com/thbar/kiba/wiki/Implementing-ETL-sources).
21
+ * [Implementing ETL transforms](https://github.com/thbar/kiba/wiki/Implementing-ETL-transforms).
22
+ * [Implementing ETL destinations](https://github.com/thbar/kiba/wiki/Implementing-ETL-destinations).
23
+ * [Implementing pre and post-processors](https://github.com/thbar/kiba/wiki/Implementing-pre-and-post-processors).
156
24
 
157
- ### Row transform as a class
25
+ ## Useful links
158
26
 
159
- If you implement the transform as a class, it must respond to `process(row)`:
160
-
161
- ```ruby
162
- class SamplingTransform
163
- def initialize(modulo_value)
164
- @modulo_value = modulo_value
165
- end
166
-
167
- def process(row)
168
- row[:index] % @modulo_value == 0 ? row : nil
169
- end
170
- end
171
- ```
172
-
173
- You'll use it this way in your ETL declaration (the parameters will be passed to initialize):
174
-
175
- ```ruby
176
- # only keep 1 row over 10
177
- transform SamplingTransform, 10
178
- ```
179
-
180
- Like the block form, it can return `nil` to dismiss the row. The class form allows better testability and reusability across your(s) ETL script(s).
181
-
182
- ## Implementing ETL destinations
183
-
184
- Like sources, destinations are classes that you are providing. Destinations must implement:
185
- - a constructor (to which Kiba will pass the provided arguments in the DSL)
186
- - a `write(row)` method that will be called for each non-dismissed row
187
- - an optional `close` method that will be called, if present, at the end of the processing (useful to tear down resources such as connections)
188
-
189
- Here is an example destination:
190
-
191
- ```ruby
192
- require 'csv'
193
-
194
- # simple destination assuming all rows have the same fields
195
- class MyCsvDestination
196
- def initialize(output_file)
197
- @csv = CSV.open(output_file, 'w')
198
- end
199
-
200
- def write(row)
201
- unless @headers_written
202
- @headers_written = true
203
- @csv << row.keys
204
- end
205
- @csv << row.values
206
- end
207
-
208
- def close
209
- @csv.close
210
- end
211
- end
212
- ```
213
-
214
- ## Implementing pre and post-processors
215
-
216
- Pre-processors and post-processors are currently blocks, which get called only once per ETL run:
217
- - Pre-processors get called before the ETL starts reading rows from the sources.
218
- - Post-processors get invoked after the ETL successfully processed all the rows.
219
-
220
- Note that post-processors won't get called if an error occurred earlier.
221
-
222
- ```ruby
223
- count = 0
224
-
225
- def system!(cmd)
226
- fail "Command #{cmd} failed" unless system(cmd)
227
- end
228
-
229
- file = 'my_file.csv'
230
- sample_file = 'my_file.sample.csv'
231
-
232
- pre_process do
233
- # it's handy to work with a reduced data set. you can
234
- # e.g. just keep one line of the CSV files + the headers
235
- system! "sed -n \"1p;25706p\" #{file} > #{sample_file}"
236
- end
237
-
238
- source MyCsv, file: sample_file
239
-
240
- transform do |row|
241
- count += 1
242
- row
243
- end
244
-
245
- post_process do
246
- Email.send(supervisor_address, "#{count} rows successfully processed")
247
- end
248
- ```
249
-
250
- ## Composability, reusability, testability of Kiba components
251
-
252
- The way Kiba works makes it easy to create reusable, well-tested ETL components and jobs.
253
-
254
- The main reason for this is that a Kiba ETL script can `require` shared Ruby code, which allows to:
255
- - create well-tested, reusable sources & destinations
256
- - create macro-transforms as methods, to be reused across sister scripts
257
- - substitute a component by another (e.g.: try a variant of a destination)
258
- - use a centralized place for configuration (credentials, IP addresses, etc.)
259
-
260
- The fact that the DSL evaluation "runs" the script also allows for simple meta-programming techniques, like pre-reading a source file to extract field names, to be used in transform definitions.
261
-
262
- The ability to support that DSL, but also check command line arguments, environment variables and tweak behaviour as needed, or call other/faster specialized tools make Ruby an asset to implement ETL jobs.
263
-
264
- Make sure to subscribe to my [Ruby ETL blog](http://thibautbarrere.com) where I'll demonstrate such techniques over time!
27
+ * [Live Coding Session - Processing data with Kiba ETL](http://thibautbarrere.com/2015/11/09/video-processing-data-with-kiba-etl/)
28
+ * [Rubyists - are you doing ETL unknowningly?](http://thibautbarrere.com/2015/03/25/rubyists-are-you-doing-etl-unknowingly/)
29
+ * [How to write solid data processing code](http://thibautbarrere.com/2015/04/05/how-to-write-solid-data-processing-code/)
30
+ * [How to reformat CSV files with Kiba](http://thibautbarrere.com/2015/06/04/how-to-reformat-csv-files-with-kiba/) (in-depth, hands-on tutorial)
31
+ * [How to explode multivalued attributes with Kiba ETL?](http://thibautbarrere.com/2015/06/25/how-to-explode-multivalued-attributes-with-kiba/)
32
+ * [Common techniques to compute aggregates with Kiba](https://stackoverflow.com/questions/31145715/how-to-do-a-aggregation-transformation-in-a-kiba-etl-script-kiba-gem)
33
+ * [How to run Kiba in a Rails environment?](http://thibautbarrere.com/2015/09/26/how-to-run-kiba-in-a-rails-environment/)
34
+ * [How to pass parameters to the Kiba command line?](http://stackoverflow.com/questions/32959692/how-to-pass-parameters-into-your-etl-job)
265
35
 
266
36
  ## Supported Ruby versions
267
37
 
268
38
  Kiba currently supports Ruby 2.0+ and JRuby (with its default 1.9 syntax). See [test matrix](https://travis-ci.org/thbar/kiba).
269
39
 
270
- ## History & Credits
271
-
272
- Wow, you're still there? Nice to meet you. I'm [Thibaut](http://thibautbarrere.com), author of Kiba.
273
-
274
- I first met the idea of row-based syntax when I started using [Anthony Eden](https://github.com/aeden)'s [Activewarehouse-ETL](https://github.com/activewarehouse/activewarehouse-etl), first published around 2006 (I think), in which Anthony applied the core principles defined by Ralph Kimball in [The Data Warehouse ETL Toolkit](http://www.amazon.com/gp/product/0764567578).
275
-
276
- I've been writing and maintaining a number of production ETL systems using Activewarehouse-ETL, then later with an ancestor of Kiba which was named TinyTL.
40
+ ## Kiba Common
277
41
 
278
- I took over the maintenance of Activewarehouse-ETL circa 2009/2010, but over time, I could not properly update & document it, given the gradual failure of a large number of dependencies and components. Ultimately in 2014 I had to stop maintaining it, after an already long hiatus.
42
+ I'm starting to add commonly used reusable helpers in a separate gem called [kiba-common](https://github.com/thbar/kiba-common), check it out (work-in-progress).
279
43
 
280
- That said using Activewarehouse-ETL for so long made me realize the row-based processing syntax was great and provided some great assets for maintainability on long time-spans.
44
+ ## ETL consulting & commercial version
281
45
 
282
- Kiba is a completely fresh & minimalistic-on-purpose implementation of that row-based processing pattern.
46
+ **Consulting services**: if your organization needs help to implement a data pipeline or to build a data-intensive application, I provide consulting services. [More information](http://thibautbarrere.com/hire-me/).
283
47
 
284
- It is minimalistic to make it more likely that I will be able to maintain it over time.
285
-
286
- It makes strong simplicity assumptions (like letting you define the sources, transforms & destinations). MiniTest is an inspiration.
287
-
288
- As I developed Kiba, I realize how much this simplicity opens the road for interesting developments such as multi-threaded & multi-processes processing.
289
-
290
- Last word: Kiba is 100% sponsored by my company LoGeek SARL (also provider of [WiseCash, a lightweight cash-flow forecasting app](https://www.wisecashhq.com)).
48
+ **Kiba Pro**: for more features & goodies, check out Kiba Pro ([Changelog & contact info](Pro-Changes.md)).
291
49
 
292
50
  ## License
293
51
 
294
- Copyright (c) LoGeek SARL.
295
-
296
- Kiba is an Open Source project licensed under the terms of
297
- the LGPLv3 license. Please see <http://www.gnu.org/licenses/lgpl-3.0.html>
298
- for license text.
52
+ Copyright (c) LoGeek SARL. Kiba is an Open Source project licensed under the terms of
53
+ the LGPLv3 license. Please see <http://www.gnu.org/licenses/lgpl-3.0.html> for license text.
299
54
 
300
55
  ## Contributing & Legal
301
56
 
302
- Until the API is more stable, I can only accept documentation Pull Requests.
303
-
304
57
  (agreement below borrowed from [Sidekiq Legal](https://github.com/mperham/sidekiq/blob/master/Contributing.md))
305
58
 
306
59
  By submitting a Pull Request, you disavow any rights or claims to any changes submitted to the Kiba project and assign the copyright of those changes to LoGeek SARL.
@@ -1,18 +1,26 @@
1
- version: '{build}'
1
+ version: 1.0.{build}-{branch}
2
2
 
3
- skip_tags: true
3
+ cache:
4
+ - vendor/bundle
4
5
 
5
6
  environment:
6
7
  matrix:
7
- - ruby_version: "21"
8
- - ruby_version: "21-x64"
8
+ - RUBY_VERSION: 24
9
+ - RUBY_VERSION: 23
10
+ - RUBY_VERSION: 22
11
+ - RUBY_VERSION: 21
9
12
 
10
13
  install:
11
- - SET PATH=C:\Ruby%ruby_version%\bin;%PATH%
12
- - gem install bundler --no-document -v 1.10.5
13
- - bundle install --retry=3
14
+ - set PATH=C:\Ruby%RUBY_VERSION%\bin;%PATH%
15
+ - bundle config --local path vendor/bundle
16
+ - bundle install
17
+
18
+ build: off
19
+
20
+ before_test:
21
+ - ruby -v
22
+ - gem -v
23
+ - bundle -v
14
24
 
15
25
  test_script:
16
26
  - bundle exec rake
17
-
18
- build: off
@@ -15,6 +15,7 @@ Gem::Specification.new do |gem|
15
15
  gem.executables = ['kiba']
16
16
 
17
17
  gem.add_development_dependency 'rake'
18
- gem.add_development_dependency 'minitest'
18
+ gem.add_development_dependency 'minitest', '~> 5.9'
19
19
  gem.add_development_dependency 'awesome_print'
20
+ gem.add_development_dependency 'minitest-focus'
20
21
  end
@@ -5,6 +5,15 @@ require 'kiba/control'
5
5
  require 'kiba/context'
6
6
  require 'kiba/parser'
7
7
  require 'kiba/runner'
8
+ require 'kiba/streaming_runner'
9
+ require 'kiba/dsl_extensions/config'
8
10
 
9
11
  Kiba.extend(Kiba::Parser)
10
- Kiba.extend(Kiba::Runner)
12
+
13
+ module Kiba
14
+ def self.run(job)
15
+ # NOTE: use Hash#dig when Ruby 2.2 reaches EOL
16
+ runner = job.config.fetch(:kiba, {}).fetch(:runner, Kiba::Runner)
17
+ runner.run(job)
18
+ end
19
+ end
@@ -3,6 +3,10 @@ module Kiba
3
3
  def pre_processes
4
4
  @pre_processes ||= []
5
5
  end
6
+
7
+ def config
8
+ @config ||= {}
9
+ end
6
10
 
7
11
  def sources
8
12
  @sources ||= []
@@ -0,0 +1,9 @@
1
+ module Kiba
2
+ module DSLExtensions
3
+ module Config
4
+ def config(context, context_config)
5
+ (@control.config[context] ||= {}).merge!(context_config)
6
+ end
7
+ end
8
+ end
9
+ end
@@ -1,15 +1,26 @@
1
- module Kiba
2
- module Parser
3
- def parse(source_as_string = nil, source_file = nil, &source_as_block)
4
- control = Control.new
5
- context = Context.new(control)
6
- if source_as_string
7
- # this somewhat weird construct allows to remove a nil source_file
8
- context.instance_eval(*[source_as_string, source_file].compact)
9
- else
10
- context.instance_eval(&source_as_block)
11
- end
12
- control
1
+ # NOTE: using the "Kiba::Parser" declaration, as I discovered,
2
+ # provides increased isolation to the declared ETL script, compared
3
+ # to 2 nested modules.
4
+ # Before that, a user creating entities named Control, Context
5
+ # or DSLExtensions would see a conflict with Kiba own classes,
6
+ # as by default instance_eval will resolve references by adding
7
+ # the module containing the parser class (initially "Kiba").
8
+ # Now, the classes appear to be further hidden from the user,
9
+ # as Kiba::Parser is its own module.
10
+ # This allows the user to create a Parser, Context, Control class
11
+ # without it being interpreted as reopening Kiba::Parser, Kiba::Context,
12
+ # etc.
13
+ # See test in test_cli.rb (test_namespace_conflict)
14
+ module Kiba::Parser
15
+ def parse(source_as_string = nil, source_file = nil, &source_as_block)
16
+ control = Kiba::Control.new
17
+ context = Kiba::Context.new(control)
18
+ if source_as_string
19
+ # this somewhat weird construct allows to remove a nil source_file
20
+ context.instance_eval(*[source_as_string, source_file].compact)
21
+ else
22
+ context.instance_eval(&source_as_block)
13
23
  end
24
+ control
14
25
  end
15
26
  end
@@ -1,5 +1,7 @@
1
1
  module Kiba
2
2
  module Runner
3
+ extend self
4
+
3
5
  # allow to handle a block form just like a regular transform
4
6
  class AliasingProc < Proc
5
7
  alias_method :process, :call
@@ -13,8 +15,9 @@ module Kiba
13
15
  process_rows(
14
16
  to_instances(control.sources),
15
17
  to_instances(control.transforms, true),
16
- to_instances(control.destinations)
18
+ destinations = to_instances(control.destinations)
17
19
  )
20
+ close_destinations(destinations)
18
21
  # TODO: when I add post processes as class, I'll have to add a test to
19
22
  # make sure instantiation occurs after the main processing is done (#16)
20
23
  run_post_processes(control)
@@ -28,6 +31,12 @@ module Kiba
28
31
  to_instances(control.post_processes, true, false).each(&:call)
29
32
  end
30
33
 
34
+ def close_destinations(destinations)
35
+ destinations
36
+ .find_all { |d| d.respond_to?(:close) }
37
+ .each(&:close)
38
+ end
39
+
31
40
  def process_rows(sources, transforms, destinations)
32
41
  sources.each do |source|
33
42
  source.each do |row|
@@ -41,7 +50,6 @@ module Kiba
41
50
  end
42
51
  end
43
52
  end
44
- destinations.find_all { |d| d.respond_to?(:close) }.each(&:close)
45
53
  end
46
54
 
47
55
  # not using keyword args because JRuby defaults to 1.9 syntax currently
@@ -0,0 +1,33 @@
1
+ module Kiba
2
+ module StreamingRunner
3
+ include Runner
4
+ extend self
5
+
6
+ def transform_stream(stream, t)
7
+ Enumerator.new do |y|
8
+ stream.each do |input_row|
9
+ returned_row = t.process(input_row) do |yielded_row|
10
+ y << yielded_row
11
+ end
12
+ y << returned_row if returned_row
13
+ end
14
+ end
15
+ end
16
+
17
+ def source_stream(sources)
18
+ Enumerator.new do |y|
19
+ sources.each do |source|
20
+ source.each { |r| y << r }
21
+ end
22
+ end
23
+ end
24
+
25
+ def process_rows(sources, transforms, destinations)
26
+ stream = source_stream(sources)
27
+ recurser = lambda { |s,t| transform_stream(s, t) }
28
+ transforms.inject(stream, &recurser).each do |r|
29
+ destinations.each { |d| d.write(r) }
30
+ end
31
+ end
32
+ end
33
+ end
@@ -1,3 +1,3 @@
1
1
  module Kiba
2
- VERSION = '1.0.0'
2
+ VERSION = '2.0.0.rc1'
3
3
  end
@@ -0,0 +1,137 @@
1
+ require 'minitest/mock'
2
+ require_relative '../support/test_enumerable_source'
3
+
4
+ module SharedRunnerTests
5
+ def kiba_run(job)
6
+ Kiba.run(job)
7
+ end
8
+
9
+ def rows
10
+ @rows ||= [
11
+ { identifier: 'first-row' },
12
+ { identifier: 'second-row' }
13
+ ]
14
+ end
15
+
16
+ def control
17
+ @control ||= begin
18
+ control = Kiba::Control.new
19
+ # this will yield a single row for testing
20
+ control.sources << {
21
+ klass: TestEnumerableSource,
22
+ args: [rows]
23
+ }
24
+ control
25
+ end
26
+ end
27
+
28
+ def test_block_transform_processing
29
+ # is there a better way to assert a block was called in minitest?
30
+ control.transforms << { block: lambda { |r| @called = true; r } }
31
+ kiba_run(control)
32
+ assert_equal true, @called
33
+ end
34
+
35
+ def test_dismissed_row_not_passed_to_next_transform
36
+ @called = nil
37
+ control.transforms << { block: lambda { |_| nil } }
38
+ control.transforms << { block: lambda { |_| @called = true; nil } }
39
+ kiba_run(control)
40
+ assert_nil @called
41
+ end
42
+
43
+ def test_post_process_runs_once
44
+ assert_equal 2, rows.size
45
+ @called = 0
46
+ control.post_processes << { block: lambda { @called += 1 } }
47
+ kiba_run(control)
48
+ assert_equal 1, @called
49
+ end
50
+
51
+ def test_post_process_not_called_after_row_failure
52
+ @called = nil
53
+ control.transforms << { block: lambda { |_| fail 'FAIL' } }
54
+ control.post_processes << { block: lambda { @called = true } }
55
+ assert_raises(RuntimeError, 'FAIL') { kiba_run(control) }
56
+ assert_nil @called
57
+ end
58
+
59
+ def test_pre_process_runs_once
60
+ assert_equal 2, rows.size
61
+ @called = 0
62
+ control.pre_processes << { block: lambda { @called += 1 } }
63
+ kiba_run(control)
64
+ assert_equal 1, @called
65
+ end
66
+
67
+ def test_pre_process_runs_before_source_is_instantiated
68
+ calls = []
69
+
70
+ mock_source_class = MiniTest::Mock.new
71
+ mock_source_class.expect(:new, TestEnumerableSource.new([1, 2, 3])) do
72
+ calls << :source_instantiated
73
+ end
74
+
75
+ control = Kiba::Control.new
76
+ control.pre_processes << { block: lambda { calls << :pre_processor_executed } }
77
+ control.sources << { klass: mock_source_class }
78
+ kiba_run(control)
79
+
80
+ assert_equal [:pre_processor_executed, :source_instantiated], calls
81
+ assert_mock mock_source_class
82
+ end
83
+
84
+ def test_no_error_raised_if_destination_close_not_implemented
85
+ # NOTE: this fake destination does not implement `close`
86
+ destination_instance = MiniTest::Mock.new
87
+
88
+ mock_destination_class = MiniTest::Mock.new
89
+ mock_destination_class.expect(:new, destination_instance)
90
+
91
+ control = Kiba::Control.new
92
+ control.destinations << { klass: mock_destination_class }
93
+ kiba_run(control)
94
+ assert_mock mock_destination_class
95
+ end
96
+
97
+ def test_destination_close_called_if_defined
98
+ destination_instance = MiniTest::Mock.new
99
+ destination_instance.expect(:close, nil)
100
+ mock_destination_class = MiniTest::Mock.new
101
+ mock_destination_class.expect(:new, destination_instance)
102
+
103
+ control = Kiba::Control.new
104
+ control.destinations << { klass: mock_destination_class }
105
+ kiba_run(control)
106
+ assert_mock destination_instance
107
+ assert_mock mock_destination_class
108
+ end
109
+
110
+ def test_use_next_to_exit_early_from_block_transform
111
+ assert_equal 2, rows.size
112
+
113
+ # calling "return row" from a block is forbidden, but you can use "next" instead
114
+ b = lambda do |row|
115
+ if row.fetch(:identifier) == 'first-row'
116
+ # demonstrate how to remove a row from the pipeline via next
117
+ next
118
+ else
119
+ # demonstrate how you can reformat via next
120
+ next({new_identifier: row.fetch(:identifier)})
121
+ end
122
+ fail "This should not be called"
123
+ end
124
+ control.transforms << { block: b }
125
+
126
+ # keep track of the rows
127
+ @remaining_rows = []
128
+ checker = lambda { |row| @remaining_rows << row; row }
129
+ control.transforms << { block: checker }
130
+
131
+ kiba_run(control)
132
+
133
+ # the first row should have been removed
134
+ # and the second row should have been reformatted
135
+ assert_equal [{new_identifier: 'second-row'}], @remaining_rows
136
+ end
137
+ end
@@ -0,0 +1,9 @@
1
+ fail "Context should not be visible without Kiba namespace" if defined?(Context)
2
+ fail "Control should not be visible without Kiba namespace" if defined?(Control)
3
+ fail "Parser should not be visible without Kiba namespace" if defined?(Parser)
4
+ fail "Config should not be visible without Kiba namespace" if defined?(DSLExtensions::Config)
5
+
6
+ # verify Kiba config (namespaced under Kiba::DSLExtensions::Config)
7
+ # isn't causing troubles to implementers using a top-level DSLExtensions module
8
+ require_relative 'some_extension'
9
+ extend DSLExtensions::SomeExtension
@@ -0,0 +1,4 @@
1
+ module DSLExtensions
2
+ module SomeExtension
3
+ end
4
+ end
@@ -1,5 +1,6 @@
1
1
  require 'minitest/autorun'
2
2
  require 'minitest/pride'
3
+ require 'minitest/focus'
3
4
  require 'kiba'
4
5
 
5
6
  class Kiba::Test < Minitest::Test
@@ -0,0 +1,10 @@
1
+ module SharedTests
2
+ def shared_tests_for(desc, &block)
3
+ @@shared_tests ||= {}
4
+ @@shared_tests[desc] = block
5
+ end
6
+
7
+ def shared_tests(desc, *args)
8
+ self.class_exec(*args, &@@shared_tests.fetch(desc))
9
+ end
10
+ end
@@ -0,0 +1,9 @@
1
+ class TestArrayDestination
2
+ def initialize(array)
3
+ @array = array
4
+ end
5
+
6
+ def write(row)
7
+ @array << row
8
+ end
9
+ end
@@ -0,0 +1,8 @@
1
+ class TestYieldingTransform
2
+ def process(row)
3
+ row.fetch(:tags).each do |value|
4
+ yield({item: value})
5
+ end
6
+ {item: "classic-return-value"}
7
+ end
8
+ end
@@ -14,4 +14,8 @@ class TestCli < Kiba::Test
14
14
  assert_match(/uninitialized constant(.*)UnknownThing/, exception.message)
15
15
  assert_includes exception.backtrace.to_s, 'test/fixtures/bogus.etl:2:in'
16
16
  end
17
+
18
+ def test_namespace_conflict
19
+ Kiba::Cli.run([fixture('namespace_conflict.etl')])
20
+ end
17
21
  end
@@ -14,6 +14,17 @@ class TestParser < Kiba::Test
14
14
  assert_equal DummyClass, control.sources[0][:klass]
15
15
  assert_equal %w(has args), control.sources[0][:args]
16
16
  end
17
+
18
+ # NOTE: useful for anything not using the CLI (e.g. sidekiq)
19
+ def test_block_parsing_with_reference_to_outside_variable
20
+ some_variable = Object.new
21
+
22
+ control = Kiba.parse do
23
+ source DummyClass, some_variable
24
+ end
25
+
26
+ assert_equal [some_variable], control.sources[0][:args]
27
+ end
17
28
 
18
29
  def test_block_transform_definition
19
30
  control = Kiba.parse do
@@ -89,4 +100,31 @@ RUBY
89
100
  ensure
90
101
  remove_files('test/tmp/etl-common.rb', 'test/tmp/etl-main.rb')
91
102
  end
103
+
104
+ def test_config
105
+ control = Kiba.parse do
106
+ extend Kiba::DSLExtensions::Config
107
+
108
+ config :context, key: "value", other_key: "other_value"
109
+ end
110
+
111
+ assert_equal({ context: {
112
+ key: "value",
113
+ other_key: "other_value"
114
+ }}, control.config)
115
+ end
116
+
117
+ def test_config_override
118
+ control = Kiba.parse do
119
+ extend Kiba::DSLExtensions::Config
120
+
121
+ config :context, key: "value", other_key: "other_value"
122
+ config :context, key: "new_value"
123
+ end
124
+
125
+ assert_equal({ context: {
126
+ key: "new_value",
127
+ other_key: "other_value"
128
+ }}, control.config)
129
+ end
92
130
  end
@@ -1,87 +1,6 @@
1
1
  require_relative 'helper'
2
- require 'minitest/mock'
3
- require_relative 'support/test_enumerable_source'
2
+ require_relative 'common/runner'
4
3
 
5
4
  class TestRunner < Kiba::Test
6
- let(:rows) do
7
- [
8
- { field: 'value' },
9
- { field: 'other-value' }
10
- ]
11
- end
12
-
13
- let(:control) do
14
- control = Kiba::Control.new
15
- # this will yield a single row for testing
16
- control.sources << {
17
- klass: TestEnumerableSource,
18
- args: [rows]
19
- }
20
- control
21
- end
22
-
23
- def test_block_transform_processing
24
- # is there a better way to assert a block was called in minitest?
25
- control.transforms << { block: lambda { |r| @called = true; r } }
26
- Kiba.run(control)
27
- assert_equal true, @called
28
- end
29
-
30
- def test_dismissed_row_not_passed_to_next_transform
31
- control.transforms << { block: lambda { |_| nil } }
32
- control.transforms << { block: lambda { |_| @called = true; nil } }
33
- Kiba.run(control)
34
- assert_nil @called
35
- end
36
-
37
- def test_post_process_runs_once
38
- assert_equal 2, rows.size
39
- @called = 0
40
- control.post_processes << { block: lambda { @called += 1 } }
41
- Kiba.run(control)
42
- assert_equal 1, @called
43
- end
44
-
45
- def test_post_process_not_called_after_row_failure
46
- control.transforms << { block: lambda { |_| fail 'FAIL' } }
47
- control.post_processes << { block: lambda { @called = true } }
48
- assert_raises(RuntimeError, 'FAIL') { Kiba.run(control) }
49
- assert_nil @called
50
- end
51
-
52
- def test_pre_process_runs_once
53
- assert_equal 2, rows.size
54
- @called = 0
55
- control.pre_processes << { block: lambda { @called += 1 } }
56
- Kiba.run(control)
57
- assert_equal 1, @called
58
- end
59
-
60
- def test_pre_process_runs_before_source_is_instantiated
61
- calls = []
62
-
63
- mock_source_class = MiniTest::Mock.new
64
- mock_source_class.expect(:new, TestEnumerableSource.new([1, 2, 3])) do
65
- calls << :source_instantiated
66
- end
67
-
68
- control = Kiba::Control.new
69
- control.pre_processes << { block: lambda { calls << :pre_processor_executed } }
70
- control.sources << { klass: mock_source_class }
71
- Kiba.run(control)
72
-
73
- assert_equal [:pre_processor_executed, :source_instantiated], calls
74
- end
75
-
76
- def test_no_error_raised_if_destination_close_not_implemented
77
- # NOTE: this fake destination does not implement `close`
78
- destination_instance = MiniTest::Mock.new
79
-
80
- mock_destination_class = MiniTest::Mock.new
81
- mock_destination_class.expect(:new, destination_instance)
82
-
83
- control = Kiba::Control.new
84
- control.destinations << { klass: mock_destination_class }
85
- Kiba.run(control)
86
- end
5
+ include SharedRunnerTests
87
6
  end
@@ -0,0 +1,33 @@
1
+ require_relative 'helper'
2
+ require_relative 'support/test_enumerable_source'
3
+ require_relative 'support/test_array_destination'
4
+ require_relative 'support/test_yielding_transform'
5
+ require_relative 'common/runner'
6
+
7
+ class TestStreamingRunner < Kiba::Test
8
+ include SharedRunnerTests
9
+
10
+ def test_yielding_class_transform
11
+ input_row = {tags: ["one", "two", "three"]}
12
+ destination_array = []
13
+
14
+ job = Kiba.parse do
15
+ extend Kiba::DSLExtensions::Config
16
+
17
+ config :kiba, runner: Kiba::StreamingRunner
18
+
19
+ source TestEnumerableSource, [input_row]
20
+ transform TestYieldingTransform
21
+ destination TestArrayDestination, destination_array
22
+ end
23
+
24
+ kiba_run(job)
25
+
26
+ assert_equal [
27
+ {item: 'one'},
28
+ {item: 'two'},
29
+ {item: 'three'},
30
+ {item: 'classic-return-value'}
31
+ ], destination_array
32
+ end
33
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: kiba
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 2.0.0.rc1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Thibaut Barrère
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2016-12-01 00:00:00.000000000 Z
11
+ date: 2017-12-27 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rake
@@ -26,6 +26,20 @@ dependencies:
26
26
  version: '0'
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: minitest
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '5.9'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '5.9'
41
+ - !ruby/object:Gem::Dependency
42
+ name: awesome_print
29
43
  requirement: !ruby/object:Gem::Requirement
30
44
  requirements:
31
45
  - - ">="
@@ -39,7 +53,7 @@ dependencies:
39
53
  - !ruby/object:Gem::Version
40
54
  version: '0'
41
55
  - !ruby/object:Gem::Dependency
42
- name: awesome_print
56
+ name: minitest-focus
43
57
  requirement: !ruby/object:Gem::Requirement
44
58
  requirements:
45
59
  - - ">="
@@ -64,6 +78,8 @@ files:
64
78
  - ".travis.yml"
65
79
  - Changes.md
66
80
  - Gemfile
81
+ - LICENSE
82
+ - Pro-Changes.md
67
83
  - README.md
68
84
  - Rakefile
69
85
  - appveyor.yml
@@ -73,21 +89,30 @@ files:
73
89
  - lib/kiba/cli.rb
74
90
  - lib/kiba/context.rb
75
91
  - lib/kiba/control.rb
92
+ - lib/kiba/dsl_extensions/config.rb
76
93
  - lib/kiba/parser.rb
77
94
  - lib/kiba/runner.rb
95
+ - lib/kiba/streaming_runner.rb
78
96
  - lib/kiba/version.rb
97
+ - test/common/runner.rb
79
98
  - test/fixtures/bogus.etl
99
+ - test/fixtures/namespace_conflict.etl
100
+ - test/fixtures/some_extension.rb
80
101
  - test/fixtures/valid.etl
81
102
  - test/helper.rb
103
+ - test/support/shared_tests.rb
104
+ - test/support/test_array_destination.rb
82
105
  - test/support/test_csv_destination.rb
83
106
  - test/support/test_csv_source.rb
84
107
  - test/support/test_enumerable_source.rb
85
108
  - test/support/test_rename_field_transform.rb
86
109
  - test/support/test_source_that_reads_at_instantiation_time.rb
110
+ - test/support/test_yielding_transform.rb
87
111
  - test/test_cli.rb
88
112
  - test/test_integration.rb
89
113
  - test/test_parser.rb
90
114
  - test/test_runner.rb
115
+ - test/test_streaming_runner.rb
91
116
  - test/tmp/.gitkeep
92
117
  homepage: http://thbar.github.io/kiba/
93
118
  licenses:
@@ -104,26 +129,33 @@ required_ruby_version: !ruby/object:Gem::Requirement
104
129
  version: '0'
105
130
  required_rubygems_version: !ruby/object:Gem::Requirement
106
131
  requirements:
107
- - - ">="
132
+ - - ">"
108
133
  - !ruby/object:Gem::Version
109
- version: '0'
134
+ version: 1.3.1
110
135
  requirements: []
111
136
  rubyforge_project:
112
- rubygems_version: 2.4.8
137
+ rubygems_version: 2.6.14
113
138
  signing_key:
114
139
  specification_version: 4
115
140
  summary: Lightweight ETL for Ruby
116
141
  test_files:
142
+ - test/common/runner.rb
117
143
  - test/fixtures/bogus.etl
144
+ - test/fixtures/namespace_conflict.etl
145
+ - test/fixtures/some_extension.rb
118
146
  - test/fixtures/valid.etl
119
147
  - test/helper.rb
148
+ - test/support/shared_tests.rb
149
+ - test/support/test_array_destination.rb
120
150
  - test/support/test_csv_destination.rb
121
151
  - test/support/test_csv_source.rb
122
152
  - test/support/test_enumerable_source.rb
123
153
  - test/support/test_rename_field_transform.rb
124
154
  - test/support/test_source_that_reads_at_instantiation_time.rb
155
+ - test/support/test_yielding_transform.rb
125
156
  - test/test_cli.rb
126
157
  - test/test_integration.rb
127
158
  - test/test_parser.rb
128
159
  - test/test_runner.rb
160
+ - test/test_streaming_runner.rb
129
161
  - test/tmp/.gitkeep