cure 0.1.1 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +16 -3
- data/.tool-versions +1 -0
- data/Dockerfile +1 -1
- data/Gemfile +1 -0
- data/Gemfile.lock +25 -6
- data/README.md +59 -81
- data/docs/README.md +33 -0
- data/docs/about.md +219 -0
- data/docs/builder/add.md +52 -0
- data/docs/builder/black_white_list.md +83 -0
- data/docs/builder/copy.md +48 -0
- data/docs/builder/explode.md +70 -0
- data/docs/builder/main.md +43 -0
- data/docs/builder/remove.md +46 -0
- data/docs/examples/examples.md +164 -0
- data/docs/export/main.md +37 -0
- data/docs/extract/main.md +89 -0
- data/docs/metadata/main.md +29 -0
- data/docs/query/main.md +45 -0
- data/docs/sources/main.md +36 -0
- data/docs/transform/main.md +53 -0
- data/docs/validate/main.md +42 -0
- data/exe/cure +12 -37
- data/exe/cure.old +59 -0
- data/lib/cure/builder/base_builder.rb +151 -0
- data/lib/cure/builder/candidate.rb +56 -0
- data/lib/cure/cli/command.rb +105 -0
- data/lib/cure/cli/generate_command.rb +54 -0
- data/lib/cure/cli/new_command.rb +52 -0
- data/lib/cure/cli/run_command.rb +19 -0
- data/lib/cure/cli/templates/README.md.erb +1 -0
- data/lib/cure/cli/templates/gemfile.erb +5 -0
- data/lib/cure/cli/templates/gitignore.erb +181 -0
- data/lib/cure/cli/templates/new_template.rb.erb +31 -0
- data/lib/cure/cli/templates/tool-versions.erb +1 -0
- data/lib/cure/config.rb +151 -13
- data/lib/cure/coordinator.rb +108 -0
- data/lib/cure/database.rb +191 -0
- data/lib/cure/dsl/builder.rb +26 -0
- data/lib/cure/dsl/exporters.rb +45 -0
- data/lib/cure/dsl/extraction.rb +60 -0
- data/lib/cure/dsl/metadata.rb +33 -0
- data/lib/cure/dsl/queries.rb +36 -0
- data/lib/cure/dsl/source_files.rb +36 -0
- data/lib/cure/dsl/template.rb +131 -0
- data/lib/cure/dsl/transformations.rb +95 -0
- data/lib/cure/dsl/validator.rb +22 -0
- data/lib/cure/export/base_processor.rb +194 -0
- data/lib/cure/export/manager.rb +24 -0
- data/lib/cure/extract/base_processor.rb +47 -0
- data/lib/cure/extract/csv_lookup.rb +43 -0
- data/lib/cure/extract/extractor.rb +80 -0
- data/lib/cure/extract/filter.rb +118 -0
- data/lib/cure/extract/named_range.rb +94 -0
- data/lib/cure/extract/named_range_processor.rb +128 -0
- data/lib/cure/extract/variable.rb +25 -0
- data/lib/cure/extract/variable_processor.rb +57 -0
- data/lib/cure/generator/base_generator.rb +61 -0
- data/lib/cure/generator/case_generator.rb +32 -0
- data/lib/cure/generator/character_generator.rb +41 -0
- data/lib/cure/generator/erb_generator.rb +21 -0
- data/lib/cure/generator/eval_generator.rb +34 -0
- data/lib/cure/generator/faker_generator.rb +31 -0
- data/lib/cure/generator/guid_generator.rb +21 -0
- data/lib/cure/generator/hex_generator.rb +21 -0
- data/lib/cure/generator/imports.rb +16 -0
- data/lib/cure/generator/number_generator.rb +21 -0
- data/lib/cure/generator/placeholder_generator.rb +26 -0
- data/lib/cure/generator/proc_generator.rb +21 -0
- data/lib/cure/generator/redact_generator.rb +22 -0
- data/lib/cure/generator/static_generator.rb +21 -0
- data/lib/cure/generator/variable_generator.rb +26 -0
- data/lib/cure/helpers/file_helpers.rb +50 -0
- data/lib/cure/helpers/object_helpers.rb +17 -0
- data/lib/cure/helpers/perf_helpers.rb +30 -0
- data/lib/cure/helpers/string.rb +54 -0
- data/lib/cure/launcher.rb +125 -0
- data/lib/cure/log.rb +10 -3
- data/lib/cure/planner.rb +136 -0
- data/lib/cure/strategy/append_strategy.rb +28 -0
- data/lib/cure/strategy/base_strategy.rb +98 -0
- data/lib/cure/strategy/contain_strategy.rb +51 -0
- data/lib/cure/strategy/end_with_strategy.rb +52 -0
- data/lib/cure/strategy/full_strategy.rb +28 -0
- data/lib/cure/strategy/history/history_cache.rb +82 -0
- data/lib/cure/strategy/imports.rb +12 -0
- data/lib/cure/strategy/match_strategy.rb +48 -0
- data/lib/cure/strategy/prepend_strategy.rb +28 -0
- data/lib/cure/strategy/regex_strategy.rb +55 -0
- data/lib/cure/strategy/split_strategy.rb +58 -0
- data/lib/cure/strategy/start_with_strategy.rb +53 -0
- data/lib/cure/transformation/candidate.rb +47 -36
- data/lib/cure/transformation/transform.rb +29 -71
- data/lib/cure/validator/base_rule.rb +78 -0
- data/lib/cure/validator/candidate.rb +54 -0
- data/lib/cure/validator/manager.rb +21 -0
- data/lib/cure/validators.rb +71 -0
- data/lib/cure/version.rb +1 -1
- data/lib/cure.rb +19 -6
- data/templates/dsl_example.rb +48 -0
- data/templates/empty_template.rb +31 -0
- metadata +161 -23
- data/lib/cure/csv_helpers.rb +0 -6
- data/lib/cure/export/exporter.rb +0 -49
- data/lib/cure/file_helpers.rb +0 -38
- data/lib/cure/generator/base.rb +0 -148
- data/lib/cure/main.rb +0 -63
- data/lib/cure/object_helpers.rb +0 -27
- data/lib/cure/strategy/base.rb +0 -223
- data/templates/aws_cur_template.json +0 -143
- data/templates/example_template.json +0 -38
@@ -0,0 +1,83 @@
|
|
1
|
+
[... go back to build contents](main.md)
|
2
|
+
|
3
|
+
## Black/White List
|
4
|
+
|
5
|
+
### What is it?
|
6
|
+
|
7
|
+
These builders operate at a sheet level, bulk changing the columns provided.
|
8
|
+
- Blacklist will bulk remove any columns **in** the list provided.
|
9
|
+
- Whitelist will bulk remove any columns **not in** the list provided.
|
10
|
+
|
11
|
+
### Why would you need it?
|
12
|
+
|
13
|
+
If you have a large spreadsheet that you want to control the presence or removal of multiple columns,
|
14
|
+
the quickest way to do so is via this option.
|
15
|
+
|
16
|
+
### Full Configuration
|
17
|
+
|
18
|
+
```ruby
|
19
|
+
candidate(named_range: "mysheet") do
|
20
|
+
blacklist options: { columns: %w[col_a col_b] }
|
21
|
+
whitelist options: { columns: %w[col_c col_d] }
|
22
|
+
end
|
23
|
+
```
|
24
|
+
|
25
|
+
- `named_range`: specifies the named range holding the column, if no named range has been set you can leave it blank.
|
26
|
+
- `options`:
|
27
|
+
- `columns`: mandatory, will perform filtering on these options.
|
28
|
+
|
29
|
+
### Example
|
30
|
+
|
31
|
+
## Blacklist
|
32
|
+
|
33
|
+
```ruby
|
34
|
+
candidate(named_range: "mysheet") do
|
35
|
+
blacklist options: { columns: %w[col_a col_b] }
|
36
|
+
end
|
37
|
+
```
|
38
|
+
|
39
|
+
Original input:
|
40
|
+
```
|
41
|
+
+-------+-------+-------+
|
42
|
+
| col_a | col_b | col_c |
|
43
|
+
+-------+-------+-------+
|
44
|
+
| a | b | c |
|
45
|
+
+-------+-------+-------+
|
46
|
+
```
|
47
|
+
|
48
|
+
changes to:
|
49
|
+
|
50
|
+
```
|
51
|
+
+-------+
|
52
|
+
| col_c |
|
53
|
+
+-------+
|
54
|
+
| c |
|
55
|
+
+-------+
|
56
|
+
```
|
57
|
+
|
58
|
+
## Whitelist
|
59
|
+
|
60
|
+
```ruby
|
61
|
+
candidate(named_range: "mysheet") do
|
62
|
+
whitelist options: { columns: %w[col_a col_b] }
|
63
|
+
end
|
64
|
+
```
|
65
|
+
|
66
|
+
Original input:
|
67
|
+
```
|
68
|
+
+-------+-------+-------+
|
69
|
+
| col_a | col_b | col_c |
|
70
|
+
+-------+-------+-------+
|
71
|
+
| a | b | c |
|
72
|
+
+-------+-------+-------+
|
73
|
+
```
|
74
|
+
|
75
|
+
changes to:
|
76
|
+
|
77
|
+
```
|
78
|
+
+-------+-------+
|
79
|
+
| col_a | col_b |
|
80
|
+
+-------+-------+
|
81
|
+
| a | b |
|
82
|
+
+-------+-------+
|
83
|
+
```
|
@@ -0,0 +1,48 @@
|
|
1
|
+
[... go back to build contents](main.md)
|
2
|
+
|
3
|
+
## Copy
|
4
|
+
|
5
|
+
### What is it?
|
6
|
+
|
7
|
+
Copy builder will copy an entire column from the spreadsheet.
|
8
|
+
|
9
|
+
### Why would you need it?
|
10
|
+
|
11
|
+
Copy a column from the spreadsheet, useful if you want to transform or manipulate the column in the transform stage.
|
12
|
+
|
13
|
+
### Full Configuration
|
14
|
+
|
15
|
+
```ruby
|
16
|
+
build do
|
17
|
+
candidate(column: "col_b", named_range: "_default") { copy options: { to_column: "col_b_copy" } }
|
18
|
+
end
|
19
|
+
```
|
20
|
+
- `column`: represents the column name, mandatory.
|
21
|
+
- `named_range`: specifies the named range holding the column, if no named range has been set you can leave it blank.
|
22
|
+
- `options`:
|
23
|
+
- `to_column`: column will be renamed to this value if set, otherwise will default to <column>_copy.
|
24
|
+
|
25
|
+
### Example
|
26
|
+
|
27
|
+
```ruby
|
28
|
+
build do
|
29
|
+
candidate(column: "col_a") { copy options: { to_column: "col_a_copy" } }
|
30
|
+
end
|
31
|
+
```
|
32
|
+
|
33
|
+
Original input:
|
34
|
+
```
|
35
|
+
+-------+
|
36
|
+
| col_a |
|
37
|
+
+-------+
|
38
|
+
| a |
|
39
|
+
+-------+
|
40
|
+
```
|
41
|
+
changes to:
|
42
|
+
```
|
43
|
+
+-------+------------+
|
44
|
+
| col_a | col_a_copy |
|
45
|
+
+-------+------------+
|
46
|
+
| a | a |
|
47
|
+
+-------+------------+
|
48
|
+
```
|
@@ -0,0 +1,70 @@
|
|
1
|
+
[... go back to build contents](main.md)
|
2
|
+
|
3
|
+
## Explode
|
4
|
+
|
5
|
+
### What is it?
|
6
|
+
|
7
|
+
Explode takes a JSON string and will break it out into columns intelligently.
|
8
|
+
|
9
|
+
### Why would you need it?
|
10
|
+
|
11
|
+
If you have a JSON column and you wish to treat the values as an individual column per key. This is popular in technical
|
12
|
+
|
13
|
+
### Full Configuration
|
14
|
+
|
15
|
+
```yaml
|
16
|
+
build:
|
17
|
+
candidates:
|
18
|
+
- column: "column_name"
|
19
|
+
named_range: "default"
|
20
|
+
action:
|
21
|
+
type: "explode"
|
22
|
+
options:
|
23
|
+
filter:
|
24
|
+
type: "whitelist|blacklist"
|
25
|
+
values:
|
26
|
+
- "example"
|
27
|
+
```
|
28
|
+
-
|
29
|
+
- `column`: represents the column name, mandatory.
|
30
|
+
- `named_range`: specifies the named range holding the column, if no named range has been set you can leave it blank.
|
31
|
+
- `action`: represents the action that will be taken on the data
|
32
|
+
- `type`: specifies the type of action, in this instance is explode
|
33
|
+
- `options`:
|
34
|
+
- `filter`: filters out the candidate columns, you can either use whitelist or blacklist.
|
35
|
+
- `values`: contains the names of the columns that will be filtered with.
|
36
|
+
|
37
|
+
|
38
|
+
### Example
|
39
|
+
|
40
|
+
```yaml
|
41
|
+
build:
|
42
|
+
candidates:
|
43
|
+
- column: tags
|
44
|
+
action:
|
45
|
+
type: "explode"
|
46
|
+
```
|
47
|
+
|
48
|
+
Original input:
|
49
|
+
```
|
50
|
+
+---------------------------------+
|
51
|
+
| tags |
|
52
|
+
+---------------------------------+
|
53
|
+
| {"type":"string","name":"abcd"} |
|
54
|
+
+---------------------------------+
|
55
|
+
| {"tier":"high"} |
|
56
|
+
+---------------------------------+
|
57
|
+
```
|
58
|
+
|
59
|
+
changes to:
|
60
|
+
|
61
|
+
```
|
62
|
+
+----------------------------------+--------+-------+------+
|
63
|
+
| tags | type | name | tier |
|
64
|
+
+----------------------------------+--------+-------+------+
|
65
|
+
| {"type":"string","name":"abcde"} | string | abcde | |
|
66
|
+
| {"tier":"string"} | | | high |
|
67
|
+
+----------------------------------+--------+-------+------+
|
68
|
+
```
|
69
|
+
|
70
|
+
**Note:** if you want to remove the original column (tags) you can do so with the [remove](remove.md) option.
|
@@ -0,0 +1,43 @@
|
|
1
|
+
Source > Extract > Validate > **Build** > Query > Transform > Export
|
2
|
+
|
3
|
+
Build
|
4
|
+
=======
|
5
|
+
|
6
|
+
### About
|
7
|
+
|
8
|
+
The build step immediately follows the **extract** step, and operates at a column level on the spreadsheet. It provides
|
9
|
+
an interface to manipulation that you may wish to occur across all columns. Individual build steps are called
|
10
|
+
candidates, and *multiple steps can be performed on a single column* if desired.
|
11
|
+
|
12
|
+
---
|
13
|
+
|
14
|
+
**When you should use this**: You have a spreadsheet that requires changes to the column structure of the data. This
|
15
|
+
may be as trivial as adding or removing a column, or *exploding* a JSON object (Key => Val) into individual columns.
|
16
|
+
|
17
|
+
See below an example configuration block:
|
18
|
+
|
19
|
+
```ruby
|
20
|
+
build do
|
21
|
+
# White/Blacklist - do not need to provide column into candidate
|
22
|
+
candidate do
|
23
|
+
blacklist options: { columns: %w[col_a col_b] }
|
24
|
+
whitelist options: { columns: %w[col_c col_d] }
|
25
|
+
end
|
26
|
+
|
27
|
+
# Simple addition of new column
|
28
|
+
candidate(column: "full_name") { add options: { default_value: "ABC" } }
|
29
|
+
# Simple renaming of existing column
|
30
|
+
candidate(column: "Tags") { rename options: { new_name: "test" } }
|
31
|
+
end
|
32
|
+
```
|
33
|
+
|
34
|
+
- `column`: represents the column name, mandatory.
|
35
|
+
- `named_range`: specifies the named range holding the column, if no named range has been set you can leave it blank.
|
36
|
+
|
37
|
+
### Components
|
38
|
+
|
39
|
+
There are four different types of operations you can perform in this step;
|
40
|
+
- [add](add.md)
|
41
|
+
- [remove](remove.md)
|
42
|
+
- [copy](copy.md)
|
43
|
+
- [black_white_list](black_white_list.md)
|
@@ -0,0 +1,46 @@
|
|
1
|
+
[... go back to build contents](main.md)
|
2
|
+
|
3
|
+
## Remove
|
4
|
+
|
5
|
+
### What is it?
|
6
|
+
|
7
|
+
Remove builder will remove an entire column from the spreadsheet.
|
8
|
+
|
9
|
+
### Why would you need it?
|
10
|
+
|
11
|
+
Removes a column from the spreadsheet, useful if you want to remove entire columns from the output.
|
12
|
+
|
13
|
+
### Full Configuration
|
14
|
+
|
15
|
+
```ruby
|
16
|
+
build do
|
17
|
+
candidate(column: "remove_this", named_range: "mysheet") { remove }
|
18
|
+
end
|
19
|
+
```
|
20
|
+
- `column`: represents the column name, mandatory.
|
21
|
+
- `named_range`: specifies the named range holding the column, if no named range has been set you can leave it blank.
|
22
|
+
|
23
|
+
### Example
|
24
|
+
|
25
|
+
```ruby
|
26
|
+
build do
|
27
|
+
candidate(column: "col_b") { remove }
|
28
|
+
end
|
29
|
+
```
|
30
|
+
|
31
|
+
Original input:
|
32
|
+
```
|
33
|
+
+-------+-------+
|
34
|
+
| col_a | col_b |
|
35
|
+
+-------+-------+
|
36
|
+
| a | b |
|
37
|
+
+-------+-------+
|
38
|
+
```
|
39
|
+
changes to:
|
40
|
+
```
|
41
|
+
+-------+
|
42
|
+
| col_a |
|
43
|
+
+-------+
|
44
|
+
| a |
|
45
|
+
+-------+
|
46
|
+
```
|
@@ -0,0 +1,164 @@
|
|
1
|
+
## Examples
|
2
|
+
|
3
|
+
Below are some examples of Cure being used to transform odd CSV formats, or unusual tasks.
|
4
|
+
|
5
|
+
### Multi row grouping
|
6
|
+
|
7
|
+
In the example below we want to:
|
8
|
+
1. Group the rows on identifier
|
9
|
+
2. Change gender to single letter,
|
10
|
+
3. Create a full name column that joins first_name and last_name, and capitalizes them.
|
11
|
+
|
12
|
+
| id | identifier | first_name | last_name | age | gender |
|
13
|
+
|----|------------|------------|-----------|-----|--------|
|
14
|
+
| 1 | 1 | joe | smith | 20 | |
|
15
|
+
| 2 | 1 | | | | male |
|
16
|
+
| 3 | 2 | lean | davis | 32 | |
|
17
|
+
| 4 | 2 | | | | female |
|
18
|
+
|
19
|
+
to
|
20
|
+
|
21
|
+
| id | identifier | first_name | last_name | age | gender | full_name |
|
22
|
+
|----|------------|------------|-----------|-----|--------|------------|
|
23
|
+
| 1 | 1 | joe | smith | 20 | M | Joe Smith |
|
24
|
+
| 3 | 2 | lean | davis | 32 | F | Lean Davis |
|
25
|
+
|
26
|
+
using
|
27
|
+
|
28
|
+
```ruby
|
29
|
+
build do
|
30
|
+
candidate column: "full_name" do
|
31
|
+
add options: { default_value: "" }
|
32
|
+
end
|
33
|
+
end
|
34
|
+
|
35
|
+
transform do
|
36
|
+
from query: <<-SQL
|
37
|
+
SELECT
|
38
|
+
id as id,
|
39
|
+
identifier as identifier,
|
40
|
+
group_concat(first_name, '') as first_name,
|
41
|
+
group_concat(last_name, '') as last_name,
|
42
|
+
group_concat(gender, '') as gender,
|
43
|
+
group_concat(age, '') as age,
|
44
|
+
full_name FROM _default
|
45
|
+
GROUP BY identifier
|
46
|
+
SQL
|
47
|
+
|
48
|
+
candidate column: "gender" do
|
49
|
+
with_translation { replace("full").with("case",
|
50
|
+
statement: {
|
51
|
+
switch: [
|
52
|
+
{
|
53
|
+
case: "male",
|
54
|
+
return_value: "M"
|
55
|
+
},{
|
56
|
+
case: "female",
|
57
|
+
return_value: "F"
|
58
|
+
}
|
59
|
+
],
|
60
|
+
else: [
|
61
|
+
return_value: "<unknown gender>"
|
62
|
+
]
|
63
|
+
})
|
64
|
+
}
|
65
|
+
end
|
66
|
+
|
67
|
+
candidate column: "full_name" do
|
68
|
+
with_translation { replace("full").with("erb",
|
69
|
+
template: "<%= first_name.capitalize %> <%= last_name.capitalize %>")
|
70
|
+
}
|
71
|
+
end
|
72
|
+
end
|
73
|
+
|
74
|
+
export do
|
75
|
+
terminal title: "Exported", limit_rows: 5
|
76
|
+
end
|
77
|
+
|
78
|
+
```
|
79
|
+
|
80
|
+
### AWS Cost and Usage Report Anonymization
|
81
|
+
|
82
|
+
Below a small subset of the Cost and Usage Report provided by Amazon that hold information that we want to transform.
|
83
|
+
|
84
|
+
Some thoughts;
|
85
|
+
- the **identity/LineItemId** column has seemingly random characters that may be the same (see row ids 9 and 10)
|
86
|
+
- **lineItem/ResourceId** has records that hold account numbers, we want to ensure that they are the same as **bill/PayerAccountId**
|
87
|
+
and **lineItem/UsageAccountId** for consistent data.
|
88
|
+
|
89
|
+
| id | identity/LineItemId | bill/PayerAccountId | lineItem/UsageAccountId | lineItem/ProductCode | lineItem/ResourceId |
|
90
|
+
|----|------------------------------------------------------|---------------------|-------------------------|----------------------|----------------------------------------------|
|
91
|
+
| 1 | mggj00y7rig8p3xjma6rpzkvtrn98q4a0ortz9ddgquu0pv3xshq | 9876543210 | 9876543210 | AmazonS3 | cloudtrail-9876543210 |
|
92
|
+
| 2 | t8ubihdw6ad39awf1748v98yim4uh6wyjzr59bziwwcfnyu4rxhf | 9876543210 | 9876543210 | AmazonS3 | cloudtrail-9876543210 |
|
93
|
+
| 3 | 8c8u2fcetmrz3f0x52coe4wgjv77ffxx2ivgitg1a1nacpo8menv | 9876543210 | 9876543210 | AmazonCloudFront | arn:aws:cloudfront::9876543210:Overhold |
|
94
|
+
| 4 | 9jqoasom8qnxma5rjqhawkncrhev0ocsp4ax5pngrp8l1yno03v3 | 9876543210 | 9876543210 | AmazonS3 | aws-cloudtrail-logs-9876543210 |
|
95
|
+
| 5 | 35znibzyuoisze9x45377jqkbd7o677w4mhgl8hyte8born5h1h3 | 9876543210 | 9876543210 | AmazonCloudFront | arn:aws:cloudfront::9876543210:Overhold |
|
96
|
+
| 6 | tb8qzhsrqu0z613jervo541l7p95b5pq2k80m7hcsnqjjjs6jnlx | 9876543210 | 9876543210 | awskms | arn:aws:kms:ap-southeast-2:9876543210:Zoolab |
|
97
|
+
| 7 | c0k9bpm5y5m1aoebsrlc2ozdgqoqjkyjy7z0hx7kv4y93gx8ioji | 9876543210 | 9876543210 | AWSLambda | arn:aws:lambda:Trippledex |
|
98
|
+
| 8 | ju8pmo0qqn5c2tapej4toy3c95w08ym6uar9hllyf3r0oj1hoiya | 9876543210 | 9876543210 | AmazonEC2 | vol-3ef2aece632 |
|
99
|
+
| 9 | f5kta3av4k5k2fve6l8g370bj41leqzkazsad28hjnu2xngn8f86 | 9876543210 | 9876543210 | AmazonS3 | cloudtrail-9876543210 |
|
100
|
+
| 10 | f5kta3av4k5k2fve6l8g370bj41leqzkazsad28hjnu2xngn8f86 | 9876543210 | 9876543210 | AmazonS3 | cloudtrail-9876543210 |
|
101
|
+
|
102
|
+
##### Configuration
|
103
|
+
```ruby
|
104
|
+
transform do
|
105
|
+
# Operate on the "identity/LineItemId" column
|
106
|
+
candidate column: "identity/LineItemId" do
|
107
|
+
# Replace the full record with a random character string of 52 length, only consisting of
|
108
|
+
# lowercase and number values.
|
109
|
+
with_translation { replace("full").with("character", length: 52, types: %w[lowercase number]) }
|
110
|
+
end
|
111
|
+
|
112
|
+
candidate column: "bill/PayerAccountId" do
|
113
|
+
# Replace the full record with a placeholder named :account_number (See at bottom of file for placeholders)
|
114
|
+
with_translation { replace("full").with("placeholder", name: :account_number) }
|
115
|
+
end
|
116
|
+
|
117
|
+
candidate column: "lineItem/UsageAccountId" do
|
118
|
+
with_translation { replace("full").with("number", length: 10) }
|
119
|
+
end
|
120
|
+
|
121
|
+
candidate column: "lineItem/ResourceId" do
|
122
|
+
# If there is a match (i-[my-group]), replace just the match group with a hex string of 10 length
|
123
|
+
with_translation { replace("regex", regex_cg: "^i-(.*)").with("hex", length: 10) }
|
124
|
+
# If there is a match (vol-[my-group]), replace just the match group with a hex string of 10 length
|
125
|
+
with_translation { replace("regex", regex_cg: "^vol-(.*)").with("hex", length: 10) }
|
126
|
+
# If the string contains a token :, replace the 4th element with the account_number placeholder.
|
127
|
+
with_translation { replace("split", token: ":", index: 4).with("placeholder", name: :account_number) }
|
128
|
+
# If the string contains a token -, replace the last element with the account_number placeholder.
|
129
|
+
with_translation { replace("split", token: "-", index: -1).with("placeholder", name: :account_number) }
|
130
|
+
# If the string contains a token :, replace the last element with the a Faker value Faker::App.name.
|
131
|
+
with_translation { replace("split", token: ":", index: -1).with("faker", module: "App", method: "name") }
|
132
|
+
|
133
|
+
# If no match is found, replace the whole match with a prefix hidden_ along with a random 10 char hex string
|
134
|
+
if_no_match { replace("full").with("hex", prefix: "hidden_", length: 10) }
|
135
|
+
end
|
136
|
+
|
137
|
+
# Hardcoded values that we may wish to reference
|
138
|
+
place_holders({account_number: 1_234_567_890})
|
139
|
+
end
|
140
|
+
|
141
|
+
export do
|
142
|
+
# Export to terminal a table with only 10 rows.
|
143
|
+
terminal title: "Exported", row_count: 10
|
144
|
+
end
|
145
|
+
```
|
146
|
+
|
147
|
+
With these rules, the above file becomes:
|
148
|
+
|
149
|
+
Output:
|
150
|
+
|
151
|
+
| id | identity/LineItemId | bill/PayerAccountId | lineItem/UsageAccountId | lineItem/ProductCode | lineItem/ResourceId |
|
152
|
+
|----|------------------------------------------------------|---------------------|-------------------------|----------------------|----------------------------------------------|
|
153
|
+
| 1 | ozsmh5j4oqnfgnv7k82tx1yne4h62rt2rfiilo0clt306ts9ib9g | 1234567890 | 1234567890 | AmazonS3 | cloudtrail-1234567890 |
|
154
|
+
| 2 | soha1946igwsaz8iju4a6q9305yd1cj9gluqwxu6lmjor1wf4yb0 | 1234567890 | 1234567890 | AmazonS3 | cloudtrail-1234567890 |
|
155
|
+
| 3 | k5a29qle33aqoemi74m75pwmhv5xq4sau6e6pyc9pc93g6stzk8s | 1234567890 | 1234567890 | AmazonCloudFront | arn:aws:cloudfront::1234567890:Latlux |
|
156
|
+
| 4 | 9i0pxzj7mgfy2nnjhalxatck9xidqt55vvmopiotv23raaol9wh1 | 1234567890 | 1234567890 | AmazonS3 | aws-cloudtrail-logs-1234567890 |
|
157
|
+
| 5 | uvws7h5xqc8qov8ana6arxyr0urkhpgu9a0g3wzv1emq9z19bl9m | 1234567890 | 1234567890 | AmazonCloudFront | arn:aws:cloudfront::1234567890:Latlux |
|
158
|
+
| 6 | lhv6swfx2ulsfs8mpfrjutgq45kixouh0xjfvfo40g42757r7mje | 1234567890 | 1234567890 | awskms | arn:aws:kms:ap-southeast-2:1234567890:Sonair |
|
159
|
+
| 7 | zm6gwy8c5qxbe24du6oipdls3iyjp83a3000z6p1l26xo44e0swa | 1234567890 | 1234567890 | AWSLambda | arn:aws:lambda:Biodex |
|
160
|
+
| 8 | xcpy7jqbash47ckhyv8bnaqrf1tvsmrqq325vbebu550v7nnhef5 | 1234567890 | 1234567890 | AmazonEC2 | vol-1234567890 |
|
161
|
+
| 9 | o1b4h0yvkw0jkbrhewqr1s0cd9abyqol1r90jtitu7vcr2e6qvcb | 1234567890 | 1234567890 | AmazonS3 | cloudtrail-1234567890 |
|
162
|
+
| 10 | o1b4h0yvkw0jkbrhewqr1s0cd9abyqol1r90jtitu7vcr2e6qvcb | 1234567890 | 1234567890 | AmazonS3 | cloudtrail-1234567890 |
|
163
|
+
|
164
|
+
Note that rows 9 and 10 have the same **identity/LineItemId**, and **lineItem/ResourceId** references our new made up account number.
|
data/docs/export/main.md
ADDED
@@ -0,0 +1,37 @@
|
|
1
|
+
Source > Extract > Validate > Build > Query > Transform > **Export**
|
2
|
+
|
3
|
+
Export
|
4
|
+
=======
|
5
|
+
|
6
|
+
### About
|
7
|
+
|
8
|
+
Exporting is the final step, where you are given each row at the end of each previous step. You can have multiple
|
9
|
+
exporters, that can each point to different named ranges, or the same.
|
10
|
+
|
11
|
+
A common pattern is to export the first 10 rows to terminal, and export the larger dataset to a CSV.
|
12
|
+
|
13
|
+
---
|
14
|
+
|
15
|
+
**When you should use this**: You have transformed your data, and you want to save the results.
|
16
|
+
|
17
|
+
See below an example configuration block:
|
18
|
+
|
19
|
+
### Example
|
20
|
+
|
21
|
+
```ruby
|
22
|
+
export do
|
23
|
+
# Export to terminal window
|
24
|
+
terminal title: "Exported", limit_rows: 5, named_range: "mysheet"
|
25
|
+
|
26
|
+
# Export to a single CSV
|
27
|
+
csv file_name: "mysheet", directory: "/tmp/cure", named_range: "mysheet"
|
28
|
+
|
29
|
+
# Export to multiple CSVs each with 100 rows.
|
30
|
+
# These will be exported as 1_mysheet.csv, 2_mysheet.csv... n_mysheet.csv
|
31
|
+
chunk_csv file_name_prefix: "mysheet", directory: "/tmp/cure", chunk_size: 100, named_range: "mysheet"
|
32
|
+
|
33
|
+
# Yield out each row to a custom proc. This allows for the caller to do whatever they want
|
34
|
+
# with the row. You could use this to make a API call to insert data to remote system.
|
35
|
+
yield_row named_range: "mysheet", proc: proc { |row| puts row }
|
36
|
+
end
|
37
|
+
```
|
@@ -0,0 +1,89 @@
|
|
1
|
+
Source > **Extract** > Validate > Build > Query > Transform > Export
|
2
|
+
|
3
|
+
Extract
|
4
|
+
=======
|
5
|
+
|
6
|
+
### About
|
7
|
+
|
8
|
+
The extract step is the first step that is undertaken on the spreadsheet. If the spreadsheet is in the form you need,
|
9
|
+
(where headers and rows are in the right place), this step is not necessary.
|
10
|
+
|
11
|
+
There are two main processes that are available in this section; named ranges and variables.
|
12
|
+
|
13
|
+
**Named ranges** are a subset of your spreadsheets data. In some situations, spreadsheets may have more than one section
|
14
|
+
of data that you are interested in. Using named ranges, and simple notation (eg. B2:G6), you can select as many ranges
|
15
|
+
as needed, and format them back together at the end.
|
16
|
+
|
17
|
+
**Variables** are a single row value that is extracted into a hash, and available at the transform stage. A common use
|
18
|
+
for this would be to extract a value from somewhere in the spreadsheet to allow it to be added to each row.
|
19
|
+
|
20
|
+
---
|
21
|
+
|
22
|
+
**When you should use this**: You have a spreadsheet that has more data than you need, or is in a format that is not
|
23
|
+
strictly in a tabular format. You may want to extract a part (or multiple parts) of the spreadsheet, and discard the
|
24
|
+
rest.
|
25
|
+
|
26
|
+
See below an example configuration block:
|
27
|
+
|
28
|
+
### Example
|
29
|
+
|
30
|
+
```ruby
|
31
|
+
extract do
|
32
|
+
named_range name: "main", at: "B2:D4", headers: "B2:B4", ref_name: "_default"
|
33
|
+
named_range name: "secondary", at: "A2:D3", ref_name: "_default"
|
34
|
+
named_range name: "full", at: -1, ref_name: "_default"
|
35
|
+
|
36
|
+
variable name: "my_string", at: "E5", ref_name: "_default"
|
37
|
+
end
|
38
|
+
```
|
39
|
+
|
40
|
+
- `name`: represents what you want to call the named range, mandatory.
|
41
|
+
- `at`: specifies the named range location in the sheet. -1 will collect the entire sheet.
|
42
|
+
- `headers`: specifies the named range location of the headers. Leave off unless they are not on the top row.
|
43
|
+
- `ref_name`: specifies the file to use to extract the named range. If you are only processing a single file you
|
44
|
+
do not need to supply (default ref_name is "_default").
|
45
|
+
|
46
|
+
If you do not supply any named range, a default named range is given "_default" which encompasses the entire sheet.
|
47
|
+
You do not need to supply this in other parts of the template as if they are not set, they will default to "_default".
|
48
|
+
|
49
|
+
Original input:
|
50
|
+
```
|
51
|
+
+----+----+----+----+----+
|
52
|
+
| a1 | b1 | c1 | d1 | e1 |
|
53
|
+
| a2 | b2 | c2 | d2 | e2 |
|
54
|
+
| a3 | b3 | c3 | d3 | e3 |
|
55
|
+
| a4 | b4 | c4 | d4 | e4 |
|
56
|
+
| a5 | b5 | c5 | d5 | e5 |
|
57
|
+
+----+----+----+----+----+
|
58
|
+
```
|
59
|
+
changes to:
|
60
|
+
```
|
61
|
+
+--------------+
|
62
|
+
| main |
|
63
|
+
+----+----+----+
|
64
|
+
| b2 | c2 | d2 |
|
65
|
+
| b3 | c3 | d3 |
|
66
|
+
| b4 | c4 | d4 |
|
67
|
+
+----+----+----+
|
68
|
+
|
69
|
+
+----+----+----+----+
|
70
|
+
| secondary |
|
71
|
+
+----+----+----+----+
|
72
|
+
| a2 | b2 | c2 | d2 |
|
73
|
+
| a3 | b3 | c3 | d3 |
|
74
|
+
+----+----+----+----+
|
75
|
+
|
76
|
+
+----+----+----+----+----+
|
77
|
+
| full |
|
78
|
+
+----+----+----+----+----+
|
79
|
+
| a1 | b1 | c1 | d1 | e1 |
|
80
|
+
| a2 | b2 | c2 | d2 | e2 |
|
81
|
+
| a3 | b3 | c3 | d3 | e3 |
|
82
|
+
| a4 | b4 | c4 | d4 | e4 |
|
83
|
+
| a5 | b5 | c5 | d5 | e5 |
|
84
|
+
+----+----+----+----+----+
|
85
|
+
|
86
|
+
variables
|
87
|
+
- my_string => "e5"
|
88
|
+
```
|
89
|
+
|
@@ -0,0 +1,29 @@
|
|
1
|
+
**Metadata** > Source > Extract > Validate > Build > Query > Transform > Export
|
2
|
+
|
3
|
+
Metadata
|
4
|
+
=======
|
5
|
+
|
6
|
+
### About
|
7
|
+
|
8
|
+
The metadata step will not affect the process, but allows you to document things you might want to in the template.
|
9
|
+
|
10
|
+
---
|
11
|
+
|
12
|
+
**When you should use this**: You want to record some information - version, author, date.
|
13
|
+
|
14
|
+
See below an example configuration block:
|
15
|
+
|
16
|
+
### Example
|
17
|
+
|
18
|
+
```ruby
|
19
|
+
metadata do
|
20
|
+
name "My Dataset"
|
21
|
+
version 1
|
22
|
+
comments "A useless comment"
|
23
|
+
additional data: {
|
24
|
+
created_date: "2023-01-01 00:00",
|
25
|
+
author: "william"
|
26
|
+
}
|
27
|
+
end
|
28
|
+
```
|
29
|
+
|
data/docs/query/main.md
ADDED
@@ -0,0 +1,45 @@
|
|
1
|
+
Source > Extract > Validate > Build > **Query** > Transform > Export
|
2
|
+
|
3
|
+
Query
|
4
|
+
=======
|
5
|
+
|
6
|
+
### About
|
7
|
+
|
8
|
+
The query step allows you to customise what data is returned from the extract step.
|
9
|
+
|
10
|
+
If this step is not provided, `SELECT * FROM _default` is run. Whatever you put in the SELECT (aliases etc) will
|
11
|
+
be returned to you for transforming.
|
12
|
+
|
13
|
+
---
|
14
|
+
|
15
|
+
**When you should use this**: You want to harness the full power of SQL to return a more tailored response.
|
16
|
+
|
17
|
+
See below an example configuration block:
|
18
|
+
|
19
|
+
### Example
|
20
|
+
|
21
|
+
```ruby
|
22
|
+
query do
|
23
|
+
with named_range: "data_log", query: <<-SQL
|
24
|
+
SELECT
|
25
|
+
*
|
26
|
+
FROM
|
27
|
+
data_log
|
28
|
+
WHERE
|
29
|
+
Equipment = 'Raw'
|
30
|
+
AND
|
31
|
+
(Division = 'O' OR Division = 'Open')
|
32
|
+
AND
|
33
|
+
Event = 'SBD'
|
34
|
+
AND
|
35
|
+
ParentFederation = 'IPF'
|
36
|
+
AND
|
37
|
+
Sex = 'F'
|
38
|
+
AND
|
39
|
+
strftime('%Y', Date) > '2014'
|
40
|
+
ORDER BY Date DESC
|
41
|
+
SQL
|
42
|
+
end
|
43
|
+
|
44
|
+
```
|
45
|
+
|