datamancer 0.0.2 → 0.0.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +156 -3
- data/lib/datamancer/extract.rb +15 -4
- data/lib/datamancer/transform.rb +72 -10
- data/lib/datamancer/version.rb +1 -1
- data/spec/data/destination.sqlite3 +0 -0
- data/spec/extract_spec.rb +16 -4
- data/spec/load_spec.rb +4 -2
- data/spec/transform_spec.rb +39 -11
- metadata +2 -6
- data/spec/data/left_source.csv +0 -5
- data/spec/data/right_source.csv +0 -5
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 7c61e73995d0dfa4c372c1142482044a3a55ceec
|
4
|
+
data.tar.gz: 71968fed5543e5653ebf74e4b1f48f7e4576d5ef
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 4657f9b28e44d82e6ceff39cc9d59f10964255138c2542e3d828a3fa5201858a1932eedc0b55918abcfd9e469591527bfe1c42f10b88a7629b0faa10965b5615
|
7
|
+
data.tar.gz: 382575e0b05be66f8f23b0c549895bc82e57a38575605bebf4cf196053f03d78b8ea9e52b7a4ff9f457ac8cb4917ee076178984ca315ad5b87e5635953f1f35a
|
data/README.md
CHANGED
@@ -2,9 +2,11 @@
|
|
2
2
|
|
3
3
|
**The Datamancer** is a form of sorcerer whose spells deal with data manipulation between databases.
|
4
4
|
|
5
|
-
|
5
|
+
Sources and destinations can be **databases** supported by ActiveRecord and **CSV files**. Multiple sources and destinations can be present in a single ETL process.
|
6
6
|
|
7
|
-
Datamancer relies in bulk SQL reading and writing, and does not instantiate ActiveRecord objects, which is used for the sole purpose of connecting to databases.
|
7
|
+
To optimize, Datamancer relies in bulk SQL reading and writing, and does not instantiate ActiveRecord objects, which is used for the sole purpose of connecting to databases.
|
8
|
+
|
9
|
+
It is tested to run over JRuby as well.
|
8
10
|
|
9
11
|
## Installation
|
10
12
|
|
@@ -22,7 +24,158 @@ Or install it yourself as:
|
|
22
24
|
|
23
25
|
## Usage
|
24
26
|
|
25
|
-
|
27
|
+
### Datastreams
|
28
|
+
|
29
|
+
A **datastream** is a table with columns and rows made with an array of hashes.
|
30
|
+
Each element of the array (a hash) represents a row,
|
31
|
+
and each pair of a hash, an attribute (a column) and its value.
|
32
|
+
|
33
|
+
```ruby
|
34
|
+
people = [
|
35
|
+
{name: 'Foo', age: 27},
|
36
|
+
{name: 'Bar', age: 42}
|
37
|
+
]
|
38
|
+
```
|
39
|
+
|
40
|
+
### Extract
|
41
|
+
|
42
|
+
To get a datastream from a data source use the extract() method, like this:
|
43
|
+
|
44
|
+
```ruby
|
45
|
+
people = extract(from: 'people.csv')
|
46
|
+
```
|
47
|
+
|
48
|
+
If 'from' value is a string, extract will suppose that the source is a CSV file.
|
49
|
+
One option that can be used along this kind of source is 'separator', which value
|
50
|
+
will be used to interpret the file (defaults to comma).
|
51
|
+
|
52
|
+
Similarly, if 'from' attribute is a hash, it will be given to ActiveRecord to establish
|
53
|
+
a connection with the specified database and table. In case that the table is not indicated
|
54
|
+
there, the 'table' option can provide it.
|
55
|
+
|
56
|
+
Next, an example using a YAML file listing databases and showing further capacities of extract().
|
57
|
+
|
58
|
+
*databases.yml*
|
59
|
+
|
60
|
+
```yaml
|
61
|
+
warehouse:
|
62
|
+
adapter: postgresql
|
63
|
+
host: 192.168.0.1
|
64
|
+
database: Warehouse
|
65
|
+
username: foo
|
66
|
+
password: 1234
|
67
|
+
```
|
68
|
+
|
69
|
+
```ruby
|
70
|
+
databases = YAML.load_file('databases.yml')
|
71
|
+
|
72
|
+
people = extract(from: databases['main'], table: 'people') do
|
73
|
+
field :name
|
74
|
+
field :age
|
75
|
+
end
|
76
|
+
```
|
77
|
+
|
78
|
+
extract() can take a block in order to manipulate data being extracted. All table or file columns
|
79
|
+
are brung by default; to get the opposite effect use 'exclude: true'.
|
80
|
+
|
81
|
+
Inside the block, field() is used to refer to columns. Without more arguments than columns' name
|
82
|
+
(as above) it performs no operation, but if columns are being excluded through the aforementioned
|
83
|
+
option, field() includes the referred columns. Both string and symbol are valid for column names.
|
84
|
+
|
85
|
+
To summarize, extract() options are:
|
86
|
+
|
87
|
+
* from: (hash or string)
|
88
|
+
* table: (string)
|
89
|
+
* exclude: (boolean)
|
90
|
+
* separator: (string)
|
91
|
+
|
92
|
+
And now we introduce field() options under extraction context:
|
93
|
+
|
94
|
+
* reject_if: (value or array of values) — Rejects row if condition is meet.
|
95
|
+
* reject_unless: (value or array of values) — Rejects row unless condition is meet.
|
96
|
+
* map: (string or symbol) — Alias column name in database.
|
97
|
+
* type: (class) — Casts data type on field.
|
98
|
+
* default: (value) — Set field to default value if nil.
|
99
|
+
* type_default: (class) — Casts data type even if field's value is nil. This way a nil
|
100
|
+
can be converted into a zero or an empty string.
|
101
|
+
* empty_default: (string) — Set field to default value if nil or empty.
|
102
|
+
* strip: (boolean) — Right and left strips the value to remove extra spaces in the string.
|
103
|
+
|
104
|
+
In addition to these options field() also takes blocks!
|
105
|
+
|
106
|
+
All together now:
|
107
|
+
|
108
|
+
```ruby
|
109
|
+
people = extract(from: databases['main'], table: 'people') do
|
110
|
+
field 'name', map: 'Name', reject_if: nil
|
111
|
+
field 'age', map: 'Age', type: Integer
|
112
|
+
end
|
113
|
+
```
|
114
|
+
|
115
|
+
### Transform
|
116
|
+
|
117
|
+
Method options:
|
118
|
+
|
119
|
+
* exclude
|
120
|
+
* join
|
121
|
+
* on
|
122
|
+
* unique
|
123
|
+
|
124
|
+
Field options: Not allowed.
|
125
|
+
|
126
|
+
### Load
|
127
|
+
|
128
|
+
Method options:
|
129
|
+
|
130
|
+
* to
|
131
|
+
* table
|
132
|
+
* exclude
|
133
|
+
* append
|
134
|
+
* batch
|
135
|
+
|
136
|
+
Field options:
|
137
|
+
|
138
|
+
* map
|
139
|
+
|
140
|
+
## Example
|
141
|
+
|
142
|
+
```ruby
|
143
|
+
require 'bundler/setup'
|
144
|
+
require 'datamancer'
|
145
|
+
require 'active_record'
|
146
|
+
require 'csv'
|
147
|
+
|
148
|
+
include Datamancer
|
149
|
+
|
150
|
+
bases = YAML.load_file('/home/matias/proyectos/panel/bases_de_datos.yml')
|
151
|
+
|
152
|
+
países_ISO =
|
153
|
+
extract from: 'country-list/country/cldr/es_AR/country.csv' do
|
154
|
+
field :iso
|
155
|
+
field :nombre, map: 'name'
|
156
|
+
end
|
157
|
+
|
158
|
+
países_UN =
|
159
|
+
extract from: 'countries/countries.csv', separator: ';', exclude: true do
|
160
|
+
field :iso, map: 'cca2'
|
161
|
+
field :número, map: 'ccn3', type: Integer
|
162
|
+
end
|
163
|
+
|
164
|
+
países =
|
165
|
+
transform países_ISO, join: países_UN, on: :iso
|
166
|
+
|
167
|
+
load países, to: bases['panel'], table: 'lk_com_pais', append: true do
|
168
|
+
field :número, map: 'id_com_pais'
|
169
|
+
field :iso, map: 'cd_com_pais'
|
170
|
+
field :nombre, map: 'ds_com_pais'
|
171
|
+
end
|
172
|
+
```
|
173
|
+
|
174
|
+
## Future features
|
175
|
+
|
176
|
+
* Batch mode
|
177
|
+
* Error monitor
|
178
|
+
* Control files
|
26
179
|
|
27
180
|
## Contributing
|
28
181
|
|
data/lib/datamancer/extract.rb
CHANGED
@@ -43,6 +43,7 @@ module Datamancer
|
|
43
43
|
|
44
44
|
define_singleton_method :field do |name, actions = {}, &block|
|
45
45
|
actions[:type] ||= actions[:type_default]
|
46
|
+
actions[:default] ||= actions[:empty_default]
|
46
47
|
actions = default_actions.merge(actions)
|
47
48
|
mapping = actions[:map] || name.to_s
|
48
49
|
|
@@ -171,11 +172,21 @@ module Datamancer
|
|
171
172
|
|
172
173
|
# TODO: Test this. Test to not reject nil by default.
|
173
174
|
|
174
|
-
if actions[:reject_if]
|
175
|
-
|
176
|
-
|
175
|
+
if actions[:reject_if].is_a? Array
|
176
|
+
actions[:reject_if].each do |reject_value|
|
177
|
+
value = :reject if reject_value == value
|
178
|
+
end
|
179
|
+
else
|
180
|
+
value = :reject if actions[:reject_if] == value
|
181
|
+
end
|
177
182
|
|
178
|
-
|
183
|
+
if actions[:reject_unless].is_a? Array
|
184
|
+
actions[:reject_unless].each do |reject_value|
|
185
|
+
value = :reject if reject_value != value
|
186
|
+
end
|
187
|
+
else
|
188
|
+
value = :reject if actions[:reject_unless] != :ñil &&
|
189
|
+
actions[:reject_unless] != value
|
179
190
|
end
|
180
191
|
|
181
192
|
value
|
data/lib/datamancer/transform.rb
CHANGED
@@ -33,14 +33,40 @@ module Datamancer
|
|
33
33
|
output
|
34
34
|
end
|
35
35
|
|
36
|
+
def where input, attributes
|
37
|
+
|
38
|
+
end
|
39
|
+
|
40
|
+
def unique input, attribute
|
41
|
+
|
42
|
+
attribute = attribute.to_sym
|
43
|
+
output = Array.new
|
44
|
+
unique_values = Array.new
|
45
|
+
|
46
|
+
input.each do |row|
|
47
|
+
unless unique_values.include?(row[attribute])
|
48
|
+
output << row
|
49
|
+
unique_values << row[attribute]
|
50
|
+
end
|
51
|
+
end
|
52
|
+
|
53
|
+
output
|
54
|
+
end
|
55
|
+
|
36
56
|
def transform input, args = {}
|
37
57
|
|
38
58
|
if args[:join]
|
39
59
|
raise ArgumentError unless args[:on]
|
60
|
+
raise ArgumentError unless input.first.keys.include?(args[:on].to_sym)
|
61
|
+
raise ArgumentError unless args[:join].first.keys.include?(args[:on].to_sym)
|
40
62
|
|
41
63
|
input = join input, args[:join], args[:on]
|
42
64
|
end
|
43
65
|
|
66
|
+
if args[:unique]
|
67
|
+
input = unique input, args[:unique]
|
68
|
+
end
|
69
|
+
|
44
70
|
# TODO: Method-overriding safeguard.
|
45
71
|
|
46
72
|
input.first.each_key do |key|
|
@@ -55,43 +81,79 @@ module Datamancer
|
|
55
81
|
end
|
56
82
|
end
|
57
83
|
|
84
|
+
define_singleton_method :row_number do
|
85
|
+
@row_number
|
86
|
+
end
|
87
|
+
|
88
|
+
define_singleton_method :row do
|
89
|
+
@supplementary_row
|
90
|
+
end
|
91
|
+
|
92
|
+
define_singleton_method :count do
|
93
|
+
@count += 1
|
94
|
+
end
|
95
|
+
|
96
|
+
define_singleton_method :output do
|
97
|
+
@output
|
98
|
+
end
|
99
|
+
|
100
|
+
define_singleton_method :switch do |slot|
|
101
|
+
@slot = slot
|
102
|
+
end
|
103
|
+
|
58
104
|
define_singleton_method :field do |name, value = nil, *args|
|
59
105
|
raise MissingField,
|
60
|
-
"Required field '#{name}' was not found" unless
|
61
|
-
|
106
|
+
"Required field '#{name}' was not found" unless @input_row.include?(name.to_sym)
|
107
|
+
|
62
108
|
@output_row[name.to_sym] = if value.is_a?(Symbol)
|
63
|
-
send(name).send *args.unshift(value)
|
109
|
+
send(name.downcase).send *args.unshift(value)
|
64
110
|
else
|
65
|
-
value || send(name)
|
111
|
+
value || send(name.downcase)
|
66
112
|
end
|
67
113
|
end
|
68
114
|
|
69
115
|
define_singleton_method :del_field do |name|
|
70
116
|
raise MissingField,
|
71
|
-
"Filtered field '#{name}' was not found" unless
|
117
|
+
"Filtered field '#{name}' was not found" unless @input_row.include?(name.to_sym)
|
72
118
|
|
73
119
|
@output_row.delete(name.to_sym)
|
74
120
|
end
|
75
121
|
|
76
122
|
define_singleton_method :new_field do |name, value|
|
77
123
|
raise ExistingField,
|
78
|
-
"New field '#{name}' already exists" if
|
124
|
+
"New field '#{name}' already exists" if @input_row.include?(name.to_sym)
|
79
125
|
|
80
126
|
@output_row[name.to_sym] = value
|
81
127
|
end
|
82
128
|
|
83
|
-
|
129
|
+
# TODO: Test for count.
|
130
|
+
|
131
|
+
@count = 0
|
132
|
+
|
133
|
+
# TODO: Test for slots.
|
134
|
+
|
135
|
+
@output = Hash.new { |h, k| h[k] = [] }
|
136
|
+
|
137
|
+
input.each_with_index do |row, row_number|
|
138
|
+
|
139
|
+
# TODO: Test for row_number.
|
140
|
+
# TODO: Test for (supplementary) row.
|
141
|
+
|
142
|
+
@row_number = row_number
|
84
143
|
@input_row = row
|
144
|
+
@supplementary_row = @input_row.dup
|
85
145
|
@output_row = args[:exclude] ? {} : @input_row.dup
|
86
146
|
|
87
147
|
yield if block_given?
|
88
148
|
|
89
|
-
@output_row
|
149
|
+
@output[@slot] << @output_row
|
90
150
|
end
|
151
|
+
|
152
|
+
@output.length == 1? @output[nil] : @output
|
91
153
|
end
|
92
154
|
|
93
155
|
def aggregate input
|
94
|
-
|
156
|
+
|
95
157
|
define_singleton_method :dim do |name|
|
96
158
|
name = name.to_sym
|
97
159
|
@dimensions[name] = @row[name]
|
@@ -104,7 +166,7 @@ module Datamancer
|
|
104
166
|
|
105
167
|
aggregated_input = Hash.new { |hash, key| hash[key] = Hash.new }
|
106
168
|
|
107
|
-
input.
|
169
|
+
input.each_with_index do |row, row_number|
|
108
170
|
@row = row
|
109
171
|
@dimensions = {}
|
110
172
|
@facts = {}
|
data/lib/datamancer/version.rb
CHANGED
Binary file
|
data/spec/extract_spec.rb
CHANGED
@@ -1,5 +1,16 @@
|
|
1
1
|
require 'spec_helper'
|
2
2
|
|
3
|
+
# TODO: Some wanted features
|
4
|
+
|
5
|
+
# allow table_name to be both string or symbol
|
6
|
+
# database.yml autoloading
|
7
|
+
# autoincremental id
|
8
|
+
# transform!
|
9
|
+
# reject by array and regex
|
10
|
+
# existent actions check
|
11
|
+
|
12
|
+
# TODO: case insentivity
|
13
|
+
|
3
14
|
describe Datamancer do
|
4
15
|
|
5
16
|
context 'EXTRACT spell' do
|
@@ -60,10 +71,11 @@ describe Datamancer do
|
|
60
71
|
}.to raise_error(ArgumentError, 'Extract requires a source, i.e. extract(from: source)')
|
61
72
|
end
|
62
73
|
|
63
|
-
|
74
|
+
|
64
75
|
it 'raises an exception if a required field is missing' do
|
65
76
|
|
66
|
-
# TODO: Better explanation for this error.
|
77
|
+
# TODO: Better explanation for this error. Assure that table appears
|
78
|
+
# even though it has been specified with :table option (outside :from).
|
67
79
|
|
68
80
|
# source = case source_type
|
69
81
|
# when 'CSV file' then "#{@source} file"
|
@@ -73,11 +85,11 @@ describe Datamancer do
|
|
73
85
|
expect {
|
74
86
|
|
75
87
|
extract(from: @source) do
|
76
|
-
field :
|
88
|
+
field :agE
|
77
89
|
end
|
78
90
|
|
79
91
|
}.to raise_error(MissingField,
|
80
|
-
"Required field '
|
92
|
+
"Required field 'agE' was not found in '#{@source}'")
|
81
93
|
end
|
82
94
|
|
83
95
|
|
data/spec/load_spec.rb
CHANGED
@@ -80,11 +80,13 @@ describe Datamancer do
|
|
80
80
|
expect {
|
81
81
|
|
82
82
|
load(@data, to: @destination) do
|
83
|
-
field :
|
83
|
+
field :agE
|
84
84
|
end
|
85
85
|
|
86
|
+
# TODO: Check for table.
|
87
|
+
|
86
88
|
}.to raise_error(MissingField,
|
87
|
-
"Required field '
|
89
|
+
"Required field 'agE' was not found in '#{@destination}'")
|
88
90
|
end
|
89
91
|
|
90
92
|
|
data/spec/transform_spec.rb
CHANGED
@@ -55,20 +55,20 @@ describe Datamancer do
|
|
55
55
|
expect {
|
56
56
|
|
57
57
|
transform(@data) do
|
58
|
-
field :
|
58
|
+
field :agE
|
59
59
|
end
|
60
60
|
|
61
61
|
}.to raise_error(MissingField,
|
62
|
-
"Required field '
|
62
|
+
"Required field 'agE' was not found")
|
63
63
|
|
64
64
|
expect {
|
65
65
|
|
66
66
|
transform(@data) do
|
67
|
-
del_field :
|
67
|
+
del_field :agE
|
68
68
|
end
|
69
69
|
|
70
70
|
}.to raise_error(MissingField,
|
71
|
-
"Filtered field '
|
71
|
+
"Filtered field 'agE' was not found")
|
72
72
|
end
|
73
73
|
|
74
74
|
|
@@ -149,11 +149,35 @@ describe Datamancer do
|
|
149
149
|
end
|
150
150
|
|
151
151
|
|
152
|
+
it 'drops duplicated rows' do
|
153
|
+
duplicated_data = @data + @data
|
154
|
+
|
155
|
+
expect(
|
156
|
+
transform(duplicated_data, unique: :name)
|
157
|
+
).to eq(@data)
|
158
|
+
|
159
|
+
expect(
|
160
|
+
transform(duplicated_data, unique: :name) do
|
161
|
+
field :name
|
162
|
+
end
|
163
|
+
).to eq(@data)
|
164
|
+
end
|
165
|
+
|
166
|
+
|
152
167
|
context 'combines records by' do
|
153
168
|
|
154
169
|
before(:all) do
|
155
|
-
@left_data =
|
156
|
-
|
170
|
+
@left_data =
|
171
|
+
[{name: 'Foo', some_id: 1},
|
172
|
+
{name: 'Bar', some_id: 2},
|
173
|
+
{name: 'Baz', some_id: 2},
|
174
|
+
{name: 'Foobar', some_id: nil}]
|
175
|
+
|
176
|
+
@right_data =
|
177
|
+
[{age: 0, some_id: nil},
|
178
|
+
{age: 27, some_id: 1},
|
179
|
+
{age: 33, some_id: 1},
|
180
|
+
{age: 42, some_id: 2}]
|
157
181
|
end
|
158
182
|
|
159
183
|
|
@@ -164,16 +188,20 @@ describe Datamancer do
|
|
164
188
|
expect {
|
165
189
|
transform(@left_data, join: @right_data)
|
166
190
|
}.to raise_error(ArgumentError)
|
191
|
+
|
192
|
+
expect {
|
193
|
+
transform(@left_data, join: @right_data, on: 'some_ID')
|
194
|
+
}.to raise_error(ArgumentError)
|
167
195
|
|
168
196
|
expect(
|
169
197
|
transform(@left_data, join: @right_data, on: 'some_id') do
|
170
198
|
del_field :some_id
|
171
|
-
new_field :namage, name.downcase + age
|
199
|
+
new_field :namage, name.downcase + age.to_s
|
172
200
|
end
|
173
|
-
).to eq([{name: 'Foo', age:
|
174
|
-
{name: 'Foo', age:
|
175
|
-
{name: 'Bar', age:
|
176
|
-
{name: 'Baz', age:
|
201
|
+
).to eq([{name: 'Foo', age: 27, namage: 'foo27'},
|
202
|
+
{name: 'Foo', age: 33, namage: 'foo33'},
|
203
|
+
{name: 'Bar', age: 42, namage: 'bar42'},
|
204
|
+
{name: 'Baz', age: 42, namage: 'baz42'}])
|
177
205
|
end
|
178
206
|
|
179
207
|
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: datamancer
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.3
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Matías Battocchia
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2013-
|
11
|
+
date: 2013-12-19 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -114,8 +114,6 @@ files:
|
|
114
114
|
- lib/datamancer/version.rb
|
115
115
|
- spec/data/destination.csv
|
116
116
|
- spec/data/destination.sqlite3
|
117
|
-
- spec/data/left_source.csv
|
118
|
-
- spec/data/right_source.csv
|
119
117
|
- spec/data/simple.yml
|
120
118
|
- spec/data/source.csv
|
121
119
|
- spec/data/source.sqlite3
|
@@ -151,8 +149,6 @@ summary: ''
|
|
151
149
|
test_files:
|
152
150
|
- spec/data/destination.csv
|
153
151
|
- spec/data/destination.sqlite3
|
154
|
-
- spec/data/left_source.csv
|
155
|
-
- spec/data/right_source.csv
|
156
152
|
- spec/data/simple.yml
|
157
153
|
- spec/data/source.csv
|
158
154
|
- spec/data/source.sqlite3
|
data/spec/data/left_source.csv
DELETED