datamancer 0.0.2 → 0.0.3
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +156 -3
- data/lib/datamancer/extract.rb +15 -4
- data/lib/datamancer/transform.rb +72 -10
- data/lib/datamancer/version.rb +1 -1
- data/spec/data/destination.sqlite3 +0 -0
- data/spec/extract_spec.rb +16 -4
- data/spec/load_spec.rb +4 -2
- data/spec/transform_spec.rb +39 -11
- metadata +2 -6
- data/spec/data/left_source.csv +0 -5
- data/spec/data/right_source.csv +0 -5
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 7c61e73995d0dfa4c372c1142482044a3a55ceec
|
4
|
+
data.tar.gz: 71968fed5543e5653ebf74e4b1f48f7e4576d5ef
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 4657f9b28e44d82e6ceff39cc9d59f10964255138c2542e3d828a3fa5201858a1932eedc0b55918abcfd9e469591527bfe1c42f10b88a7629b0faa10965b5615
|
7
|
+
data.tar.gz: 382575e0b05be66f8f23b0c549895bc82e57a38575605bebf4cf196053f03d78b8ea9e52b7a4ff9f457ac8cb4917ee076178984ca315ad5b87e5635953f1f35a
|
data/README.md
CHANGED
@@ -2,9 +2,11 @@
|
|
2
2
|
|
3
3
|
**The Datamancer** is a form of sorcerer whose spells deal with data manipulation between databases.
|
4
4
|
|
5
|
-
|
5
|
+
Sources and destinations can be **databases** supported by ActiveRecord and **CSV files**. Multiple sources and destinations can be present in a single ETL process.
|
6
6
|
|
7
|
-
Datamancer relies in bulk SQL reading and writing, and does not instantiate ActiveRecord objects, which is used for the sole purpose of connecting to databases.
|
7
|
+
To optimize, Datamancer relies in bulk SQL reading and writing, and does not instantiate ActiveRecord objects, which is used for the sole purpose of connecting to databases.
|
8
|
+
|
9
|
+
It is tested to run over JRuby as well.
|
8
10
|
|
9
11
|
## Installation
|
10
12
|
|
@@ -22,7 +24,158 @@ Or install it yourself as:
|
|
22
24
|
|
23
25
|
## Usage
|
24
26
|
|
25
|
-
|
27
|
+
### Datastreams
|
28
|
+
|
29
|
+
A **datastream** is a table with columns and rows made with an array of hashes.
|
30
|
+
Each element of the array (a hash) represents a row,
|
31
|
+
and each pair of a hash, an attribute (a column) and its value.
|
32
|
+
|
33
|
+
```ruby
|
34
|
+
people = [
|
35
|
+
{name: 'Foo', age: 27},
|
36
|
+
{name: 'Bar', age: 42}
|
37
|
+
]
|
38
|
+
```
|
39
|
+
|
40
|
+
### Extract
|
41
|
+
|
42
|
+
To get a datastream from a data source use the extract() method, like this:
|
43
|
+
|
44
|
+
```ruby
|
45
|
+
people = extract(from: 'people.csv')
|
46
|
+
```
|
47
|
+
|
48
|
+
If 'from' value is a string, extract will suppose that the source is a CSV file.
|
49
|
+
One option that can be used along this kind of source is 'separator', which value
|
50
|
+
will be used to interpret the file (defaults to comma).
|
51
|
+
|
52
|
+
Similarly, if 'from' attribute is a hash, it will be given to ActiveRecord to establish
|
53
|
+
a connection with the specified database and table. In case that the table is not indicated
|
54
|
+
there, the 'table' option can provide it.
|
55
|
+
|
56
|
+
Next, an example using a YAML file listing databases and showing further capacities of extract().
|
57
|
+
|
58
|
+
*databases.yml*
|
59
|
+
|
60
|
+
```yaml
|
61
|
+
warehouse:
|
62
|
+
adapter: postgresql
|
63
|
+
host: 192.168.0.1
|
64
|
+
database: Warehouse
|
65
|
+
username: foo
|
66
|
+
password: 1234
|
67
|
+
```
|
68
|
+
|
69
|
+
```ruby
|
70
|
+
databases = YAML.load_file('databases.yml')
|
71
|
+
|
72
|
+
people = extract(from: databases['main'], table: 'people') do
|
73
|
+
field :name
|
74
|
+
field :age
|
75
|
+
end
|
76
|
+
```
|
77
|
+
|
78
|
+
extract() can take a block in order to manipulate data being extracted. All table or file columns
|
79
|
+
are brung by default; to get the opposite effect use 'exclude: true'.
|
80
|
+
|
81
|
+
Inside the block, field() is used to refer to columns. Without more arguments than columns' name
|
82
|
+
(as above) it performs no operation, but if columns are being excluded through the aforementioned
|
83
|
+
option, field() includes the referred columns. Both string and symbol are valid for column names.
|
84
|
+
|
85
|
+
To summarize, extract() options are:
|
86
|
+
|
87
|
+
* from: (hash or string)
|
88
|
+
* table: (string)
|
89
|
+
* exclude: (boolean)
|
90
|
+
* separator: (string)
|
91
|
+
|
92
|
+
And now we introduce field() options under extraction context:
|
93
|
+
|
94
|
+
* reject_if: (value or array of values) — Rejects row if condition is meet.
|
95
|
+
* reject_unless: (value or array of values) — Rejects row unless condition is meet.
|
96
|
+
* map: (string or symbol) — Alias column name in database.
|
97
|
+
* type: (class) — Casts data type on field.
|
98
|
+
* default: (value) — Set field to default value if nil.
|
99
|
+
* type_default: (class) — Casts data type even if field's value is nil. This way a nil
|
100
|
+
can be converted into a zero or an empty string.
|
101
|
+
* empty_default: (string) — Set field to default value if nil or empty.
|
102
|
+
* strip: (boolean) — Right and left strips the value to remove extra spaces in the string.
|
103
|
+
|
104
|
+
In addition to these options field() also takes blocks!
|
105
|
+
|
106
|
+
All together now:
|
107
|
+
|
108
|
+
```ruby
|
109
|
+
people = extract(from: databases['main'], table: 'people') do
|
110
|
+
field 'name', map: 'Name', reject_if: nil
|
111
|
+
field 'age', map: 'Age', type: Integer
|
112
|
+
end
|
113
|
+
```
|
114
|
+
|
115
|
+
### Transform
|
116
|
+
|
117
|
+
Method options:
|
118
|
+
|
119
|
+
* exclude
|
120
|
+
* join
|
121
|
+
* on
|
122
|
+
* unique
|
123
|
+
|
124
|
+
Field options: Not allowed.
|
125
|
+
|
126
|
+
### Load
|
127
|
+
|
128
|
+
Method options:
|
129
|
+
|
130
|
+
* to
|
131
|
+
* table
|
132
|
+
* exclude
|
133
|
+
* append
|
134
|
+
* batch
|
135
|
+
|
136
|
+
Field options:
|
137
|
+
|
138
|
+
* map
|
139
|
+
|
140
|
+
## Example
|
141
|
+
|
142
|
+
```ruby
|
143
|
+
require 'bundler/setup'
|
144
|
+
require 'datamancer'
|
145
|
+
require 'active_record'
|
146
|
+
require 'csv'
|
147
|
+
|
148
|
+
include Datamancer
|
149
|
+
|
150
|
+
bases = YAML.load_file('/home/matias/proyectos/panel/bases_de_datos.yml')
|
151
|
+
|
152
|
+
países_ISO =
|
153
|
+
extract from: 'country-list/country/cldr/es_AR/country.csv' do
|
154
|
+
field :iso
|
155
|
+
field :nombre, map: 'name'
|
156
|
+
end
|
157
|
+
|
158
|
+
países_UN =
|
159
|
+
extract from: 'countries/countries.csv', separator: ';', exclude: true do
|
160
|
+
field :iso, map: 'cca2'
|
161
|
+
field :número, map: 'ccn3', type: Integer
|
162
|
+
end
|
163
|
+
|
164
|
+
países =
|
165
|
+
transform países_ISO, join: países_UN, on: :iso
|
166
|
+
|
167
|
+
load países, to: bases['panel'], table: 'lk_com_pais', append: true do
|
168
|
+
field :número, map: 'id_com_pais'
|
169
|
+
field :iso, map: 'cd_com_pais'
|
170
|
+
field :nombre, map: 'ds_com_pais'
|
171
|
+
end
|
172
|
+
```
|
173
|
+
|
174
|
+
## Future features
|
175
|
+
|
176
|
+
* Batch mode
|
177
|
+
* Error monitor
|
178
|
+
* Control files
|
26
179
|
|
27
180
|
## Contributing
|
28
181
|
|
data/lib/datamancer/extract.rb
CHANGED
@@ -43,6 +43,7 @@ module Datamancer
|
|
43
43
|
|
44
44
|
define_singleton_method :field do |name, actions = {}, &block|
|
45
45
|
actions[:type] ||= actions[:type_default]
|
46
|
+
actions[:default] ||= actions[:empty_default]
|
46
47
|
actions = default_actions.merge(actions)
|
47
48
|
mapping = actions[:map] || name.to_s
|
48
49
|
|
@@ -171,11 +172,21 @@ module Datamancer
|
|
171
172
|
|
172
173
|
# TODO: Test this. Test to not reject nil by default.
|
173
174
|
|
174
|
-
if actions[:reject_if]
|
175
|
-
|
176
|
-
|
175
|
+
if actions[:reject_if].is_a? Array
|
176
|
+
actions[:reject_if].each do |reject_value|
|
177
|
+
value = :reject if reject_value == value
|
178
|
+
end
|
179
|
+
else
|
180
|
+
value = :reject if actions[:reject_if] == value
|
181
|
+
end
|
177
182
|
|
178
|
-
|
183
|
+
if actions[:reject_unless].is_a? Array
|
184
|
+
actions[:reject_unless].each do |reject_value|
|
185
|
+
value = :reject if reject_value != value
|
186
|
+
end
|
187
|
+
else
|
188
|
+
value = :reject if actions[:reject_unless] != :ñil &&
|
189
|
+
actions[:reject_unless] != value
|
179
190
|
end
|
180
191
|
|
181
192
|
value
|
data/lib/datamancer/transform.rb
CHANGED
@@ -33,14 +33,40 @@ module Datamancer
|
|
33
33
|
output
|
34
34
|
end
|
35
35
|
|
36
|
+
def where input, attributes
|
37
|
+
|
38
|
+
end
|
39
|
+
|
40
|
+
def unique input, attribute
|
41
|
+
|
42
|
+
attribute = attribute.to_sym
|
43
|
+
output = Array.new
|
44
|
+
unique_values = Array.new
|
45
|
+
|
46
|
+
input.each do |row|
|
47
|
+
unless unique_values.include?(row[attribute])
|
48
|
+
output << row
|
49
|
+
unique_values << row[attribute]
|
50
|
+
end
|
51
|
+
end
|
52
|
+
|
53
|
+
output
|
54
|
+
end
|
55
|
+
|
36
56
|
def transform input, args = {}
|
37
57
|
|
38
58
|
if args[:join]
|
39
59
|
raise ArgumentError unless args[:on]
|
60
|
+
raise ArgumentError unless input.first.keys.include?(args[:on].to_sym)
|
61
|
+
raise ArgumentError unless args[:join].first.keys.include?(args[:on].to_sym)
|
40
62
|
|
41
63
|
input = join input, args[:join], args[:on]
|
42
64
|
end
|
43
65
|
|
66
|
+
if args[:unique]
|
67
|
+
input = unique input, args[:unique]
|
68
|
+
end
|
69
|
+
|
44
70
|
# TODO: Method-overriding safeguard.
|
45
71
|
|
46
72
|
input.first.each_key do |key|
|
@@ -55,43 +81,79 @@ module Datamancer
|
|
55
81
|
end
|
56
82
|
end
|
57
83
|
|
84
|
+
define_singleton_method :row_number do
|
85
|
+
@row_number
|
86
|
+
end
|
87
|
+
|
88
|
+
define_singleton_method :row do
|
89
|
+
@supplementary_row
|
90
|
+
end
|
91
|
+
|
92
|
+
define_singleton_method :count do
|
93
|
+
@count += 1
|
94
|
+
end
|
95
|
+
|
96
|
+
define_singleton_method :output do
|
97
|
+
@output
|
98
|
+
end
|
99
|
+
|
100
|
+
define_singleton_method :switch do |slot|
|
101
|
+
@slot = slot
|
102
|
+
end
|
103
|
+
|
58
104
|
define_singleton_method :field do |name, value = nil, *args|
|
59
105
|
raise MissingField,
|
60
|
-
"Required field '#{name}' was not found" unless
|
61
|
-
|
106
|
+
"Required field '#{name}' was not found" unless @input_row.include?(name.to_sym)
|
107
|
+
|
62
108
|
@output_row[name.to_sym] = if value.is_a?(Symbol)
|
63
|
-
send(name).send *args.unshift(value)
|
109
|
+
send(name.downcase).send *args.unshift(value)
|
64
110
|
else
|
65
|
-
value || send(name)
|
111
|
+
value || send(name.downcase)
|
66
112
|
end
|
67
113
|
end
|
68
114
|
|
69
115
|
define_singleton_method :del_field do |name|
|
70
116
|
raise MissingField,
|
71
|
-
"Filtered field '#{name}' was not found" unless
|
117
|
+
"Filtered field '#{name}' was not found" unless @input_row.include?(name.to_sym)
|
72
118
|
|
73
119
|
@output_row.delete(name.to_sym)
|
74
120
|
end
|
75
121
|
|
76
122
|
define_singleton_method :new_field do |name, value|
|
77
123
|
raise ExistingField,
|
78
|
-
"New field '#{name}' already exists" if
|
124
|
+
"New field '#{name}' already exists" if @input_row.include?(name.to_sym)
|
79
125
|
|
80
126
|
@output_row[name.to_sym] = value
|
81
127
|
end
|
82
128
|
|
83
|
-
|
129
|
+
# TODO: Test for count.
|
130
|
+
|
131
|
+
@count = 0
|
132
|
+
|
133
|
+
# TODO: Test for slots.
|
134
|
+
|
135
|
+
@output = Hash.new { |h, k| h[k] = [] }
|
136
|
+
|
137
|
+
input.each_with_index do |row, row_number|
|
138
|
+
|
139
|
+
# TODO: Test for row_number.
|
140
|
+
# TODO: Test for (supplementary) row.
|
141
|
+
|
142
|
+
@row_number = row_number
|
84
143
|
@input_row = row
|
144
|
+
@supplementary_row = @input_row.dup
|
85
145
|
@output_row = args[:exclude] ? {} : @input_row.dup
|
86
146
|
|
87
147
|
yield if block_given?
|
88
148
|
|
89
|
-
@output_row
|
149
|
+
@output[@slot] << @output_row
|
90
150
|
end
|
151
|
+
|
152
|
+
@output.length == 1? @output[nil] : @output
|
91
153
|
end
|
92
154
|
|
93
155
|
def aggregate input
|
94
|
-
|
156
|
+
|
95
157
|
define_singleton_method :dim do |name|
|
96
158
|
name = name.to_sym
|
97
159
|
@dimensions[name] = @row[name]
|
@@ -104,7 +166,7 @@ module Datamancer
|
|
104
166
|
|
105
167
|
aggregated_input = Hash.new { |hash, key| hash[key] = Hash.new }
|
106
168
|
|
107
|
-
input.
|
169
|
+
input.each_with_index do |row, row_number|
|
108
170
|
@row = row
|
109
171
|
@dimensions = {}
|
110
172
|
@facts = {}
|
data/lib/datamancer/version.rb
CHANGED
Binary file
|
data/spec/extract_spec.rb
CHANGED
@@ -1,5 +1,16 @@
|
|
1
1
|
require 'spec_helper'
|
2
2
|
|
3
|
+
# TODO: Some wanted features
|
4
|
+
|
5
|
+
# allow table_name to be both string or symbol
|
6
|
+
# database.yml autoloading
|
7
|
+
# autoincremental id
|
8
|
+
# transform!
|
9
|
+
# reject by array and regex
|
10
|
+
# existent actions check
|
11
|
+
|
12
|
+
# TODO: case insentivity
|
13
|
+
|
3
14
|
describe Datamancer do
|
4
15
|
|
5
16
|
context 'EXTRACT spell' do
|
@@ -60,10 +71,11 @@ describe Datamancer do
|
|
60
71
|
}.to raise_error(ArgumentError, 'Extract requires a source, i.e. extract(from: source)')
|
61
72
|
end
|
62
73
|
|
63
|
-
|
74
|
+
|
64
75
|
it 'raises an exception if a required field is missing' do
|
65
76
|
|
66
|
-
# TODO: Better explanation for this error.
|
77
|
+
# TODO: Better explanation for this error. Assure that table appears
|
78
|
+
# even though it has been specified with :table option (outside :from).
|
67
79
|
|
68
80
|
# source = case source_type
|
69
81
|
# when 'CSV file' then "#{@source} file"
|
@@ -73,11 +85,11 @@ describe Datamancer do
|
|
73
85
|
expect {
|
74
86
|
|
75
87
|
extract(from: @source) do
|
76
|
-
field :
|
88
|
+
field :agE
|
77
89
|
end
|
78
90
|
|
79
91
|
}.to raise_error(MissingField,
|
80
|
-
"Required field '
|
92
|
+
"Required field 'agE' was not found in '#{@source}'")
|
81
93
|
end
|
82
94
|
|
83
95
|
|
data/spec/load_spec.rb
CHANGED
@@ -80,11 +80,13 @@ describe Datamancer do
|
|
80
80
|
expect {
|
81
81
|
|
82
82
|
load(@data, to: @destination) do
|
83
|
-
field :
|
83
|
+
field :agE
|
84
84
|
end
|
85
85
|
|
86
|
+
# TODO: Check for table.
|
87
|
+
|
86
88
|
}.to raise_error(MissingField,
|
87
|
-
"Required field '
|
89
|
+
"Required field 'agE' was not found in '#{@destination}'")
|
88
90
|
end
|
89
91
|
|
90
92
|
|
data/spec/transform_spec.rb
CHANGED
@@ -55,20 +55,20 @@ describe Datamancer do
|
|
55
55
|
expect {
|
56
56
|
|
57
57
|
transform(@data) do
|
58
|
-
field :
|
58
|
+
field :agE
|
59
59
|
end
|
60
60
|
|
61
61
|
}.to raise_error(MissingField,
|
62
|
-
"Required field '
|
62
|
+
"Required field 'agE' was not found")
|
63
63
|
|
64
64
|
expect {
|
65
65
|
|
66
66
|
transform(@data) do
|
67
|
-
del_field :
|
67
|
+
del_field :agE
|
68
68
|
end
|
69
69
|
|
70
70
|
}.to raise_error(MissingField,
|
71
|
-
"Filtered field '
|
71
|
+
"Filtered field 'agE' was not found")
|
72
72
|
end
|
73
73
|
|
74
74
|
|
@@ -149,11 +149,35 @@ describe Datamancer do
|
|
149
149
|
end
|
150
150
|
|
151
151
|
|
152
|
+
it 'drops duplicated rows' do
|
153
|
+
duplicated_data = @data + @data
|
154
|
+
|
155
|
+
expect(
|
156
|
+
transform(duplicated_data, unique: :name)
|
157
|
+
).to eq(@data)
|
158
|
+
|
159
|
+
expect(
|
160
|
+
transform(duplicated_data, unique: :name) do
|
161
|
+
field :name
|
162
|
+
end
|
163
|
+
).to eq(@data)
|
164
|
+
end
|
165
|
+
|
166
|
+
|
152
167
|
context 'combines records by' do
|
153
168
|
|
154
169
|
before(:all) do
|
155
|
-
@left_data =
|
156
|
-
|
170
|
+
@left_data =
|
171
|
+
[{name: 'Foo', some_id: 1},
|
172
|
+
{name: 'Bar', some_id: 2},
|
173
|
+
{name: 'Baz', some_id: 2},
|
174
|
+
{name: 'Foobar', some_id: nil}]
|
175
|
+
|
176
|
+
@right_data =
|
177
|
+
[{age: 0, some_id: nil},
|
178
|
+
{age: 27, some_id: 1},
|
179
|
+
{age: 33, some_id: 1},
|
180
|
+
{age: 42, some_id: 2}]
|
157
181
|
end
|
158
182
|
|
159
183
|
|
@@ -164,16 +188,20 @@ describe Datamancer do
|
|
164
188
|
expect {
|
165
189
|
transform(@left_data, join: @right_data)
|
166
190
|
}.to raise_error(ArgumentError)
|
191
|
+
|
192
|
+
expect {
|
193
|
+
transform(@left_data, join: @right_data, on: 'some_ID')
|
194
|
+
}.to raise_error(ArgumentError)
|
167
195
|
|
168
196
|
expect(
|
169
197
|
transform(@left_data, join: @right_data, on: 'some_id') do
|
170
198
|
del_field :some_id
|
171
|
-
new_field :namage, name.downcase + age
|
199
|
+
new_field :namage, name.downcase + age.to_s
|
172
200
|
end
|
173
|
-
).to eq([{name: 'Foo', age:
|
174
|
-
{name: 'Foo', age:
|
175
|
-
{name: 'Bar', age:
|
176
|
-
{name: 'Baz', age:
|
201
|
+
).to eq([{name: 'Foo', age: 27, namage: 'foo27'},
|
202
|
+
{name: 'Foo', age: 33, namage: 'foo33'},
|
203
|
+
{name: 'Bar', age: 42, namage: 'bar42'},
|
204
|
+
{name: 'Baz', age: 42, namage: 'baz42'}])
|
177
205
|
end
|
178
206
|
|
179
207
|
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: datamancer
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.3
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Matías Battocchia
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2013-
|
11
|
+
date: 2013-12-19 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -114,8 +114,6 @@ files:
|
|
114
114
|
- lib/datamancer/version.rb
|
115
115
|
- spec/data/destination.csv
|
116
116
|
- spec/data/destination.sqlite3
|
117
|
-
- spec/data/left_source.csv
|
118
|
-
- spec/data/right_source.csv
|
119
117
|
- spec/data/simple.yml
|
120
118
|
- spec/data/source.csv
|
121
119
|
- spec/data/source.sqlite3
|
@@ -151,8 +149,6 @@ summary: ''
|
|
151
149
|
test_files:
|
152
150
|
- spec/data/destination.csv
|
153
151
|
- spec/data/destination.sqlite3
|
154
|
-
- spec/data/left_source.csv
|
155
|
-
- spec/data/right_source.csv
|
156
152
|
- spec/data/simple.yml
|
157
153
|
- spec/data/source.csv
|
158
154
|
- spec/data/source.sqlite3
|
data/spec/data/left_source.csv
DELETED