metacrunch 4.2.0 → 4.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +5 -5
- data/.circleci/config.yml +35 -0
- data/Gemfile +3 -2
- data/Readme.md +30 -29
- data/lib/metacrunch/cli.rb +1 -1
- data/lib/metacrunch/version.rb +1 -1
- metadata +4 -10
- data/.travis.yml +0 -4
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
|
-
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: 6c1facce15096151df3186f7d48245b1c06bebb231b9d6dabeb70d569c0bb06c
|
|
4
|
+
data.tar.gz: 41487b86683753e2f8eba95d1e9dbea47efa0af5c36b13c1bb343f0b59f714ab
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: '04015927726756e1f5839d4b4bceac287400b729a1828a3652d15fc456720245ade702d8b1fddec83aaf41418df2c404006f0cd89805cdfc0aba8bd04e737579'
|
|
7
|
+
data.tar.gz: f6e8d9719618e8c1f6c8b68c8a28c008f9f710929b22d878352076cd3bf7f99e4503d7604e78472e8c70d052f3aa07eb8ccc0d778ed31dc19d040de893f4d484
|
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
version: 2.1
|
|
2
|
+
orbs:
|
|
3
|
+
ruby: circleci/ruby@1.1.1
|
|
4
|
+
|
|
5
|
+
jobs:
|
|
6
|
+
build:
|
|
7
|
+
docker:
|
|
8
|
+
- image: circleci/ruby:2.6-node-browsers
|
|
9
|
+
|
|
10
|
+
working_directory: ~/repo
|
|
11
|
+
|
|
12
|
+
steps:
|
|
13
|
+
- checkout
|
|
14
|
+
|
|
15
|
+
- run:
|
|
16
|
+
name: Install dependencies
|
|
17
|
+
command: bundle install --jobs=4 --retry=3 --path vendor/bundle
|
|
18
|
+
|
|
19
|
+
- run:
|
|
20
|
+
name: Install CodeClimate test coverage reporter
|
|
21
|
+
command: |
|
|
22
|
+
curl -L https://codeclimate.com/downloads/test-reporter/test-reporter-latest-linux-amd64 > ./cc-test-reporter
|
|
23
|
+
chmod +x ./cc-test-reporter
|
|
24
|
+
./cc-test-reporter before-build
|
|
25
|
+
|
|
26
|
+
- run:
|
|
27
|
+
name: Run tests
|
|
28
|
+
command: |
|
|
29
|
+
mkdir /tmp/test-results
|
|
30
|
+
bundle exec rspec --format progress --format RspecJunitFormatter --out /tmp/test-results/rspec.xml
|
|
31
|
+
|
|
32
|
+
- run:
|
|
33
|
+
name: Upload test coverage report to CodeClimate
|
|
34
|
+
command: ./cc-test-reporter after-build --exit-code $?
|
|
35
|
+
|
data/Gemfile
CHANGED
|
@@ -5,7 +5,6 @@ gemspec
|
|
|
5
5
|
group :development do
|
|
6
6
|
gem "bundler", ">= 1.15"
|
|
7
7
|
gem "rake", ">= 12.1"
|
|
8
|
-
gem "rspec", ">= 3.5.0", "< 4.0.0"
|
|
9
8
|
|
|
10
9
|
if !ENV["CI"]
|
|
11
10
|
gem "pry-byebug", ">= 3.5.0"
|
|
@@ -13,5 +12,7 @@ group :development do
|
|
|
13
12
|
end
|
|
14
13
|
|
|
15
14
|
group :test do
|
|
16
|
-
gem "
|
|
15
|
+
gem "rspec", ">= 3.5.0", "< 4.0.0"
|
|
16
|
+
gem "rspec_junit_formatter", ">= 0.3.0"
|
|
17
|
+
gem "simplecov", "= 0.17.1"
|
|
17
18
|
end
|
data/Readme.md
CHANGED
|
@@ -3,7 +3,8 @@ metacrunch
|
|
|
3
3
|
|
|
4
4
|
[](http://badge.fury.io/rb/metacrunch)
|
|
5
5
|
[](https://codeclimate.com/github/ubpb/metacrunch)
|
|
6
|
-
[](https://codeclimate.com/github/ubpb/metacrunch/coverage)
|
|
7
|
+
[](https://circleci.com/gh/ubpb/metacrunch)
|
|
7
8
|
|
|
8
9
|
metacrunch is a simple and lightweight data processing and ETL ([Extract-Transform-Load](http://en.wikipedia.org/wiki/Extract,_transform,_load))
|
|
9
10
|
toolkit for Ruby.
|
|
@@ -28,7 +29,7 @@ metacrunch gives you a simple DSL ([Domain-specific language](https://en.wikiped
|
|
|
28
29
|
|
|
29
30
|
Let's walk through the main steps of creating ETL jobs with metacrunch. For a collection of working examples check out our [metacrunch-demo](https://github.com/ubpb/metacrunch-demo) repository.
|
|
30
31
|
|
|
31
|
-
|
|
32
|
+
### It's Ruby
|
|
32
33
|
|
|
33
34
|
Every `.metacrunch` job is a regular Ruby file and you can use any valid Ruby code like declaring methods, classes, variables, requiring other Ruby
|
|
34
35
|
files and so on.
|
|
@@ -50,12 +51,14 @@ require "SomeGem"
|
|
|
50
51
|
require_relative "./some/other/ruby/file"
|
|
51
52
|
```
|
|
52
53
|
|
|
53
|
-
|
|
54
|
+
### Defining a source
|
|
54
55
|
|
|
55
|
-
A source is an object that
|
|
56
|
+
A source is an object that emits data objects (e.g. from a file or an external system) into the metacrunch processing pipeline. Implementing sources is easy – a source is a Ruby `Enumerable` (any object that responds to the `#each` method). For more information on how to implement sources [see notes below](#implementing-sources).
|
|
56
57
|
|
|
57
58
|
You must declare a source to allow a job to run.
|
|
58
59
|
|
|
60
|
+
A source iterates over it's entries and emits every entry as a data object into the transformation pipeline, by passing it to the first transformation.
|
|
61
|
+
|
|
59
62
|
```ruby
|
|
60
63
|
# File: my_etl_job.metacrunch
|
|
61
64
|
|
|
@@ -66,15 +69,15 @@ source Metacrunch::File::Source.new(ARGV)
|
|
|
66
69
|
source MySource.new
|
|
67
70
|
```
|
|
68
71
|
|
|
69
|
-
|
|
72
|
+
### Defining transformations
|
|
70
73
|
|
|
71
|
-
To process, transform or manipulate data use the `#transformation` hook. A transformation is implemented with a `callable` object (any Ruby object that responds to `#call`. E.g. a
|
|
74
|
+
To process, transform or manipulate data use the `#transformation` hook. A transformation is implemented with a `callable` object (any Ruby object that responds to `#call`. E.g. a `Proc`). To learn more about transformations check the section about [implementing transformations](#implementing-transformations) below.
|
|
72
75
|
|
|
73
|
-
The current data object (the
|
|
76
|
+
The *current data object* (the current object emitted by the source) will be passed to the first transformation as a parameter. The return value of a transformation will then be passed to the next transformation and so on.
|
|
74
77
|
|
|
75
|
-
There are two exceptions to that rule
|
|
78
|
+
There are two exceptions to that rule:
|
|
76
79
|
|
|
77
|
-
* If you return `nil` the current data object will be dismissed and the next transformation won't be called.
|
|
80
|
+
* If you return `nil` the current data object will be dismissed and the next transformation won't be called. The process continues with the next data object that will be emitted by the source and the first transformation.
|
|
78
81
|
* If you return an `Enumerator` the object will be expanded and the following transformations will be called with each element of the `Enumerator`.
|
|
79
82
|
|
|
80
83
|
```ruby
|
|
@@ -85,27 +88,29 @@ source [1,2,3,4,5,6,7,8,9]
|
|
|
85
88
|
|
|
86
89
|
# A transformation is implemented with a `callable` object (any
|
|
87
90
|
# object that responds to #call).
|
|
88
|
-
#
|
|
91
|
+
# Proc responds to #call
|
|
89
92
|
transformation ->(number) {
|
|
90
|
-
# Called for each data object that has been
|
|
93
|
+
# Called for each data object that has been emitted by a source.
|
|
91
94
|
# You must return the data to keep it in the pipeline. Dismiss the
|
|
92
95
|
# data conditionally by returning nil.
|
|
93
96
|
number if number.odd?
|
|
94
97
|
}
|
|
95
98
|
|
|
99
|
+
# Only called for odd numbers as even numbers gets dismissed in the previous
|
|
100
|
+
# transformation.
|
|
96
101
|
transformation ->(odd_number) {
|
|
97
102
|
odd_number * 2
|
|
98
103
|
}
|
|
99
104
|
|
|
100
|
-
# MyTransformation implements #call
|
|
105
|
+
# MyTransformation implements #call. Gets called with the prevous number times 2.
|
|
101
106
|
transformation MyTransformation.new
|
|
102
107
|
```
|
|
103
108
|
|
|
104
|
-
|
|
109
|
+
### Using a transformation buffer
|
|
105
110
|
|
|
106
111
|
Sometimes it is useful to buffer data between transformation steps to allow a transformation to work on larger bulks of data. metacrunch uses a simple transformation buffer to achieve this.
|
|
107
112
|
|
|
108
|
-
To use a transformation buffer add the `:buffer` option to your transformation. You can pass a positive integer value as a buffer size, or as an advanced option you can pass a `Proc` object. The buffer flushes every time the buffer reaches the given size or if the `Proc` returns `true`.
|
|
113
|
+
To use a transformation buffer add the `:buffer` option to your transformation. You can pass a positive integer value as a buffer size, or as an advanced option you can pass a `Proc` object. The buffer flushes every time the buffer reaches the given size or if the `Proc` returns `true`. The buffer also flushes after the last data object was emitted by the source.
|
|
109
114
|
|
|
110
115
|
```ruby
|
|
111
116
|
# File: my_etl_job.metacrunch
|
|
@@ -128,11 +133,9 @@ transformation ->(bulk) {
|
|
|
128
133
|
}
|
|
129
134
|
```
|
|
130
135
|
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
A destination is an object that writes the transformed data to an external system. Implementing destinations is easy – [see notes below](#implementing-destinations). A destination receives the return value from the last transformation as a parameter if the return value from the last transformation was not `nil`.
|
|
136
|
+
### Defining a destination
|
|
134
137
|
|
|
135
|
-
|
|
138
|
+
A destination is an object that writes the transformed data to an external system (e.g. a file, database etc.). Implementing destinations is easy – [see notes below](#implementing-destinations). A destination receives the return value from the last transformation as a parameter if the return value from the last transformation was not `nil`.
|
|
136
139
|
|
|
137
140
|
```ruby
|
|
138
141
|
# File: my_etl_job.metacrunch
|
|
@@ -140,20 +143,20 @@ Using destinations is optional. In most cases using the last transformation to w
|
|
|
140
143
|
destination MyDestination.new
|
|
141
144
|
```
|
|
142
145
|
|
|
143
|
-
|
|
146
|
+
### Pre/Post process
|
|
144
147
|
|
|
145
148
|
To run arbitrary code before the first transformation is run on the first data object use the `#pre_process` hook. To run arbitrary code after the last transformation is run on the last data object use `#post_process`. Like transformations, `#post_process` and `#pre_process` must be implemented using a `callable` object.
|
|
146
149
|
|
|
147
150
|
```ruby
|
|
148
151
|
pre_process -> {
|
|
149
|
-
#
|
|
152
|
+
# Proc responds to #call
|
|
150
153
|
}
|
|
151
154
|
|
|
152
155
|
# MyCallable class defines #call
|
|
153
156
|
post_process MyCallable.new
|
|
154
157
|
```
|
|
155
158
|
|
|
156
|
-
|
|
159
|
+
### Defining job options
|
|
157
160
|
|
|
158
161
|
metacrunch has build-in support to parameterize jobs. Using the `options` hook you can declare options that can be set/overridden by the CLI when [running your jobs](#running-etl-jobs).
|
|
159
162
|
|
|
@@ -191,9 +194,7 @@ Job options:
|
|
|
191
194
|
REQUIRED
|
|
192
195
|
```
|
|
193
196
|
|
|
194
|
-
|
|
195
|
-
|
|
196
|
-
#### Require non-option arguments
|
|
197
|
+
### Require non-option arguments
|
|
197
198
|
|
|
198
199
|
All non-option arguments that get passed to the job when running are available to the `ARGV` constant. If your job requires such arguments (e.g. if you work with a list of files) you can require it.
|
|
199
200
|
|
|
@@ -242,11 +243,11 @@ $ [bundle exec] metacrunch [options] JOB_FILE [job-options] [ARGS...]
|
|
|
242
243
|
Implementing sources
|
|
243
244
|
--------------------
|
|
244
245
|
|
|
245
|
-
A metacrunch source is any Ruby object that responds to the
|
|
246
|
+
A metacrunch source is any Ruby `Enumerable` object (an object that responds to the `#each` method) that yields data objects one by one.
|
|
246
247
|
|
|
247
248
|
The data is usually a `Hash` instance, but could be other structures as long as the rest of your pipeline is expecting it.
|
|
248
249
|
|
|
249
|
-
Any `
|
|
250
|
+
Any `Enumerable` object (e.g. `Array`) responds to `#each` and can be used as a source in metacrunch.
|
|
250
251
|
|
|
251
252
|
```ruby
|
|
252
253
|
# File: my_etl_job.metacrunch
|
|
@@ -288,9 +289,9 @@ source MyCsvSource.new("my_data.csv")
|
|
|
288
289
|
Implementing transformations
|
|
289
290
|
----------------------------
|
|
290
291
|
|
|
291
|
-
A metacrunch transformation is implemented as a `callable` object. A `callable` in Ruby is any object that responds to the
|
|
292
|
+
A metacrunch transformation is implemented as a `callable` object. A `callable` in Ruby is any object that responds to the `#call` method.
|
|
292
293
|
|
|
293
|
-
|
|
294
|
+
`Proc`s in Ruby respond to `#call`. They can be used to implement transformations inline.
|
|
294
295
|
|
|
295
296
|
```ruby
|
|
296
297
|
# File: my_etl_job.metacrunch
|
|
@@ -329,7 +330,7 @@ transformation MyTransformation.new
|
|
|
329
330
|
Implementing destinations
|
|
330
331
|
-------------------------
|
|
331
332
|
|
|
332
|
-
A destination is any Ruby object that responds to
|
|
333
|
+
A destination is any Ruby object that responds to `#write(data)` and `#close`.
|
|
333
334
|
|
|
334
335
|
Like sources you are encouraged to implement destinations as classes.
|
|
335
336
|
|
data/lib/metacrunch/cli.rb
CHANGED
|
@@ -51,7 +51,7 @@ private
|
|
|
51
51
|
def run!(job_file)
|
|
52
52
|
if job_file.blank?
|
|
53
53
|
error "You need to provide a job file."
|
|
54
|
-
elsif !File.
|
|
54
|
+
elsif !File.exist?(job_file)
|
|
55
55
|
error "The file `#{job_file}` doesn't exist."
|
|
56
56
|
else
|
|
57
57
|
job_filename = File.expand_path(job_file)
|
data/lib/metacrunch/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,16 +1,15 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: metacrunch
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 4.2.
|
|
4
|
+
version: 4.2.1
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- René Sprotte
|
|
8
8
|
- Michael Sievers
|
|
9
9
|
- Marcel Otto
|
|
10
|
-
autorequire:
|
|
11
10
|
bindir: exe
|
|
12
11
|
cert_chain: []
|
|
13
|
-
date:
|
|
12
|
+
date: 1980-01-02 00:00:00.000000000 Z
|
|
14
13
|
dependencies:
|
|
15
14
|
- !ruby/object:Gem::Dependency
|
|
16
15
|
name: activesupport
|
|
@@ -40,16 +39,14 @@ dependencies:
|
|
|
40
39
|
- - ">="
|
|
41
40
|
- !ruby/object:Gem::Version
|
|
42
41
|
version: 0.8.1
|
|
43
|
-
description:
|
|
44
|
-
email:
|
|
45
42
|
executables:
|
|
46
43
|
- metacrunch
|
|
47
44
|
extensions: []
|
|
48
45
|
extra_rdoc_files: []
|
|
49
46
|
files:
|
|
47
|
+
- ".circleci/config.yml"
|
|
50
48
|
- ".gitignore"
|
|
51
49
|
- ".rspec"
|
|
52
|
-
- ".travis.yml"
|
|
53
50
|
- Gemfile
|
|
54
51
|
- License.txt
|
|
55
52
|
- Rakefile
|
|
@@ -73,7 +70,6 @@ homepage: http://github.com/ubpb/metacrunch
|
|
|
73
70
|
licenses:
|
|
74
71
|
- MIT
|
|
75
72
|
metadata: {}
|
|
76
|
-
post_install_message:
|
|
77
73
|
rdoc_options: []
|
|
78
74
|
require_paths:
|
|
79
75
|
- lib
|
|
@@ -88,9 +84,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
|
88
84
|
- !ruby/object:Gem::Version
|
|
89
85
|
version: '0'
|
|
90
86
|
requirements: []
|
|
91
|
-
|
|
92
|
-
rubygems_version: 2.6.11
|
|
93
|
-
signing_key:
|
|
87
|
+
rubygems_version: 3.6.9
|
|
94
88
|
specification_version: 4
|
|
95
89
|
summary: Data processing and ETL toolkit for Ruby
|
|
96
90
|
test_files: []
|
data/.travis.yml
DELETED