datamancer 0.0.2 → 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 5a09ebee71e5c2b54c0b790ae4dc2165e4e9e45a
4
- data.tar.gz: eabae34cb0d1b34ce072796f3b0edab7117325e2
3
+ metadata.gz: 7c61e73995d0dfa4c372c1142482044a3a55ceec
4
+ data.tar.gz: 71968fed5543e5653ebf74e4b1f48f7e4576d5ef
5
5
  SHA512:
6
- metadata.gz: 84af02d3f38e7e861086bf268eb6670ef8528f6b70035d2bba6f3098b99dd897b29b727ae1037ba80abad412ff34ae543eec4e8c57ed71f4a74a939f35be4b15
7
- data.tar.gz: 4c88628302bca9be9089ee2a6c1be2991c06201d66ce91bd88a62247b6ad45b64ff0cc90403dce9c1ebd0dcabb6fb807be58f52133394c0f57e5771473d558c2
6
+ metadata.gz: 4657f9b28e44d82e6ceff39cc9d59f10964255138c2542e3d828a3fa5201858a1932eedc0b55918abcfd9e469591527bfe1c42f10b88a7629b0faa10965b5615
7
+ data.tar.gz: 382575e0b05be66f8f23b0c549895bc82e57a38575605bebf4cf196053f03d78b8ea9e52b7a4ff9f457ac8cb4917ee076178984ca315ad5b87e5635953f1f35a
data/README.md CHANGED
@@ -2,9 +2,11 @@
2
2
 
3
3
  **The Datamancer** is a form of sorcerer whose spells deal with data manipulation between databases.
4
4
 
5
- Data targets (sources and destinations) can be **databases** supported by ActiveRecord and **CSV files**. Multiple targets can be present in a single ETL process.
5
+ Sources and destinations can be **databases** supported by ActiveRecord and **CSV files**. Multiple sources and destinations can be present in a single ETL process.
6
6
 
7
- Datamancer relies in bulk SQL reading and writing, and does not instantiate ActiveRecord objects, which is used for the sole purpose of connecting to databases.
7
+ To optimize, Datamancer relies in bulk SQL reading and writing, and does not instantiate ActiveRecord objects, which is used for the sole purpose of connecting to databases.
8
+
9
+ It is tested to run over JRuby as well.
8
10
 
9
11
  ## Installation
10
12
 
@@ -22,7 +24,158 @@ Or install it yourself as:
22
24
 
23
25
  ## Usage
24
26
 
25
- *Please see the specs, for now.*
27
+ ### Datastreams
28
+
29
+ A **datastream** is a table with columns and rows made with an array of hashes.
30
+ Each element of the array (a hash) represents a row,
31
+ and each pair of a hash, an attribute (a column) and its value.
32
+
33
+ ```ruby
34
+ people = [
35
+ {name: 'Foo', age: 27},
36
+ {name: 'Bar', age: 42}
37
+ ]
38
+ ```
39
+
40
+ ### Extract
41
+
42
+ To get a datastream from a data source use the extract() method, like this:
43
+
44
+ ```ruby
45
+ people = extract(from: 'people.csv')
46
+ ```
47
+
48
+ If 'from' value is a string, extract will suppose that the source is a CSV file.
49
+ One option that can be used along this kind of source is 'separator', which value
50
+ will be used to interpret the file (defaults to comma).
51
+
52
+ Similarly, if 'from' attribute is a hash, it will be given to ActiveRecord to establish
53
+ a connection with the specified database and table. In case that the table is not indicated
54
+ there, the 'table' option can provide it.
55
+
56
+ Next, an example using a YAML file listing databases and showing further capacities of extract().
57
+
58
+ *databases.yml*
59
+
60
+ ```yaml
61
+ warehouse:
62
+ adapter: postgresql
63
+ host: 192.168.0.1
64
+ database: Warehouse
65
+ username: foo
66
+ password: 1234
67
+ ```
68
+
69
+ ```ruby
70
+ databases = YAML.load_file('databases.yml')
71
+
72
+ people = extract(from: databases['main'], table: 'people') do
73
+ field :name
74
+ field :age
75
+ end
76
+ ```
77
+
78
+ extract() can take a block in order to manipulate data being extracted. All table or file columns
79
+ are brung by default; to get the opposite effect use 'exclude: true'.
80
+
81
+ Inside the block, field() is used to refer to columns. Without more arguments than columns' name
82
+ (as above) it performs no operation, but if columns are being excluded through the aforementioned
83
+ option, field() includes the referred columns. Both string and symbol are valid for column names.
84
+
85
+ To summarize, extract() options are:
86
+
87
+ * from: (hash or string)
88
+ * table: (string)
89
+ * exclude: (boolean)
90
+ * separator: (string)
91
+
92
+ And now we introduce field() options under extraction context:
93
+
94
+ * reject_if: (value or array of values) — Rejects row if condition is meet.
95
+ * reject_unless: (value or array of values) — Rejects row unless condition is meet.
96
+ * map: (string or symbol) — Alias column name in database.
97
+ * type: (class) — Casts data type on field.
98
+ * default: (value) — Set field to default value if nil.
99
+ * type_default: (class) — Casts data type even if field's value is nil. This way a nil
100
+ can be converted into a zero or an empty string.
101
+ * empty_default: (string) — Set field to default value if nil or empty.
102
+ * strip: (boolean) — Right and left strips the value to remove extra spaces in the string.
103
+
104
+ In addition to these options field() also takes blocks!
105
+
106
+ All together now:
107
+
108
+ ```ruby
109
+ people = extract(from: databases['main'], table: 'people') do
110
+ field 'name', map: 'Name', reject_if: nil
111
+ field 'age', map: 'Age', type: Integer
112
+ end
113
+ ```
114
+
115
+ ### Transform
116
+
117
+ Method options:
118
+
119
+ * exclude
120
+ * join
121
+ * on
122
+ * unique
123
+
124
+ Field options: Not allowed.
125
+
126
+ ### Load
127
+
128
+ Method options:
129
+
130
+ * to
131
+ * table
132
+ * exclude
133
+ * append
134
+ * batch
135
+
136
+ Field options:
137
+
138
+ * map
139
+
140
+ ## Example
141
+
142
+ ```ruby
143
+ require 'bundler/setup'
144
+ require 'datamancer'
145
+ require 'active_record'
146
+ require 'csv'
147
+
148
+ include Datamancer
149
+
150
+ bases = YAML.load_file('/home/matias/proyectos/panel/bases_de_datos.yml')
151
+
152
+ países_ISO =
153
+ extract from: 'country-list/country/cldr/es_AR/country.csv' do
154
+ field :iso
155
+ field :nombre, map: 'name'
156
+ end
157
+
158
+ países_UN =
159
+ extract from: 'countries/countries.csv', separator: ';', exclude: true do
160
+ field :iso, map: 'cca2'
161
+ field :número, map: 'ccn3', type: Integer
162
+ end
163
+
164
+ países =
165
+ transform países_ISO, join: países_UN, on: :iso
166
+
167
+ load países, to: bases['panel'], table: 'lk_com_pais', append: true do
168
+ field :número, map: 'id_com_pais'
169
+ field :iso, map: 'cd_com_pais'
170
+ field :nombre, map: 'ds_com_pais'
171
+ end
172
+ ```
173
+
174
+ ## Future features
175
+
176
+ * Batch mode
177
+ * Error monitor
178
+ * Control files
26
179
 
27
180
  ## Contributing
28
181
 
@@ -43,6 +43,7 @@ module Datamancer
43
43
 
44
44
  define_singleton_method :field do |name, actions = {}, &block|
45
45
  actions[:type] ||= actions[:type_default]
46
+ actions[:default] ||= actions[:empty_default]
46
47
  actions = default_actions.merge(actions)
47
48
  mapping = actions[:map] || name.to_s
48
49
 
@@ -171,11 +172,21 @@ module Datamancer
171
172
 
172
173
  # TODO: Test this. Test to not reject nil by default.
173
174
 
174
- if actions[:reject_if] == value ||
175
- (actions[:reject_unless] != :ñil &&
176
- actions[:reject_unless] != value)
175
+ if actions[:reject_if].is_a? Array
176
+ actions[:reject_if].each do |reject_value|
177
+ value = :reject if reject_value == value
178
+ end
179
+ else
180
+ value = :reject if actions[:reject_if] == value
181
+ end
177
182
 
178
- value = :reject
183
+ if actions[:reject_unless].is_a? Array
184
+ actions[:reject_unless].each do |reject_value|
185
+ value = :reject if reject_value != value
186
+ end
187
+ else
188
+ value = :reject if actions[:reject_unless] != :ñil &&
189
+ actions[:reject_unless] != value
179
190
  end
180
191
 
181
192
  value
@@ -33,14 +33,40 @@ module Datamancer
33
33
  output
34
34
  end
35
35
 
36
+ def where input, attributes
37
+
38
+ end
39
+
40
+ def unique input, attribute
41
+
42
+ attribute = attribute.to_sym
43
+ output = Array.new
44
+ unique_values = Array.new
45
+
46
+ input.each do |row|
47
+ unless unique_values.include?(row[attribute])
48
+ output << row
49
+ unique_values << row[attribute]
50
+ end
51
+ end
52
+
53
+ output
54
+ end
55
+
36
56
  def transform input, args = {}
37
57
 
38
58
  if args[:join]
39
59
  raise ArgumentError unless args[:on]
60
+ raise ArgumentError unless input.first.keys.include?(args[:on].to_sym)
61
+ raise ArgumentError unless args[:join].first.keys.include?(args[:on].to_sym)
40
62
 
41
63
  input = join input, args[:join], args[:on]
42
64
  end
43
65
 
66
+ if args[:unique]
67
+ input = unique input, args[:unique]
68
+ end
69
+
44
70
  # TODO: Method-overriding safeguard.
45
71
 
46
72
  input.first.each_key do |key|
@@ -55,43 +81,79 @@ module Datamancer
55
81
  end
56
82
  end
57
83
 
84
+ define_singleton_method :row_number do
85
+ @row_number
86
+ end
87
+
88
+ define_singleton_method :row do
89
+ @supplementary_row
90
+ end
91
+
92
+ define_singleton_method :count do
93
+ @count += 1
94
+ end
95
+
96
+ define_singleton_method :output do
97
+ @output
98
+ end
99
+
100
+ define_singleton_method :switch do |slot|
101
+ @slot = slot
102
+ end
103
+
58
104
  define_singleton_method :field do |name, value = nil, *args|
59
105
  raise MissingField,
60
- "Required field '#{name}' was not found" unless respond_to?(name)
61
-
106
+ "Required field '#{name}' was not found" unless @input_row.include?(name.to_sym)
107
+
62
108
  @output_row[name.to_sym] = if value.is_a?(Symbol)
63
- send(name).send *args.unshift(value)
109
+ send(name.downcase).send *args.unshift(value)
64
110
  else
65
- value || send(name)
111
+ value || send(name.downcase)
66
112
  end
67
113
  end
68
114
 
69
115
  define_singleton_method :del_field do |name|
70
116
  raise MissingField,
71
- "Filtered field '#{name}' was not found" unless respond_to?(name)
117
+ "Filtered field '#{name}' was not found" unless @input_row.include?(name.to_sym)
72
118
 
73
119
  @output_row.delete(name.to_sym)
74
120
  end
75
121
 
76
122
  define_singleton_method :new_field do |name, value|
77
123
  raise ExistingField,
78
- "New field '#{name}' already exists" if respond_to?(name)
124
+ "New field '#{name}' already exists" if @input_row.include?(name.to_sym)
79
125
 
80
126
  @output_row[name.to_sym] = value
81
127
  end
82
128
 
83
- input.map do |row|
129
+ # TODO: Test for count.
130
+
131
+ @count = 0
132
+
133
+ # TODO: Test for slots.
134
+
135
+ @output = Hash.new { |h, k| h[k] = [] }
136
+
137
+ input.each_with_index do |row, row_number|
138
+
139
+ # TODO: Test for row_number.
140
+ # TODO: Test for (supplementary) row.
141
+
142
+ @row_number = row_number
84
143
  @input_row = row
144
+ @supplementary_row = @input_row.dup
85
145
  @output_row = args[:exclude] ? {} : @input_row.dup
86
146
 
87
147
  yield if block_given?
88
148
 
89
- @output_row
149
+ @output[@slot] << @output_row
90
150
  end
151
+
152
+ @output.length == 1? @output[nil] : @output
91
153
  end
92
154
 
93
155
  def aggregate input
94
-
156
+
95
157
  define_singleton_method :dim do |name|
96
158
  name = name.to_sym
97
159
  @dimensions[name] = @row[name]
@@ -104,7 +166,7 @@ module Datamancer
104
166
 
105
167
  aggregated_input = Hash.new { |hash, key| hash[key] = Hash.new }
106
168
 
107
- input.each do |row|
169
+ input.each_with_index do |row, row_number|
108
170
  @row = row
109
171
  @dimensions = {}
110
172
  @facts = {}
@@ -1,3 +1,3 @@
1
1
  module Datamancer
2
- VERSION = "0.0.2"
2
+ VERSION = '0.0.3'
3
3
  end
Binary file
data/spec/extract_spec.rb CHANGED
@@ -1,5 +1,16 @@
1
1
  require 'spec_helper'
2
2
 
3
+ # TODO: Some wanted features
4
+
5
+ # allow table_name to be both string or symbol
6
+ # database.yml autoloading
7
+ # autoincremental id
8
+ # transform!
9
+ # reject by array and regex
10
+ # existent actions check
11
+
12
+ # TODO: case insentivity
13
+
3
14
  describe Datamancer do
4
15
 
5
16
  context 'EXTRACT spell' do
@@ -60,10 +71,11 @@ describe Datamancer do
60
71
  }.to raise_error(ArgumentError, 'Extract requires a source, i.e. extract(from: source)')
61
72
  end
62
73
 
63
-
74
+
64
75
  it 'raises an exception if a required field is missing' do
65
76
 
66
- # TODO: Better explanation for this error.
77
+ # TODO: Better explanation for this error. Assure that table appears
78
+ # even though it has been specified with :table option (outside :from).
67
79
 
68
80
  # source = case source_type
69
81
  # when 'CSV file' then "#{@source} file"
@@ -73,11 +85,11 @@ describe Datamancer do
73
85
  expect {
74
86
 
75
87
  extract(from: @source) do
76
- field :surname
88
+ field :agE
77
89
  end
78
90
 
79
91
  }.to raise_error(MissingField,
80
- "Required field 'surname' was not found in '#{@source}'")
92
+ "Required field 'agE' was not found in '#{@source}'")
81
93
  end
82
94
 
83
95
 
data/spec/load_spec.rb CHANGED
@@ -80,11 +80,13 @@ describe Datamancer do
80
80
  expect {
81
81
 
82
82
  load(@data, to: @destination) do
83
- field :surname
83
+ field :agE
84
84
  end
85
85
 
86
+ # TODO: Check for table.
87
+
86
88
  }.to raise_error(MissingField,
87
- "Required field 'surname' was not found in '#{@destination}'")
89
+ "Required field 'agE' was not found in '#{@destination}'")
88
90
  end
89
91
 
90
92
 
@@ -55,20 +55,20 @@ describe Datamancer do
55
55
  expect {
56
56
 
57
57
  transform(@data) do
58
- field :surname
58
+ field :agE
59
59
  end
60
60
 
61
61
  }.to raise_error(MissingField,
62
- "Required field 'surname' was not found")
62
+ "Required field 'agE' was not found")
63
63
 
64
64
  expect {
65
65
 
66
66
  transform(@data) do
67
- del_field :surname
67
+ del_field :agE
68
68
  end
69
69
 
70
70
  }.to raise_error(MissingField,
71
- "Filtered field 'surname' was not found")
71
+ "Filtered field 'agE' was not found")
72
72
  end
73
73
 
74
74
 
@@ -149,11 +149,35 @@ describe Datamancer do
149
149
  end
150
150
 
151
151
 
152
+ it 'drops duplicated rows' do
153
+ duplicated_data = @data + @data
154
+
155
+ expect(
156
+ transform(duplicated_data, unique: :name)
157
+ ).to eq(@data)
158
+
159
+ expect(
160
+ transform(duplicated_data, unique: :name) do
161
+ field :name
162
+ end
163
+ ).to eq(@data)
164
+ end
165
+
166
+
152
167
  context 'combines records by' do
153
168
 
154
169
  before(:all) do
155
- @left_data = extract from: $dir + '/left_source.csv'
156
- @right_data = extract from: $dir + '/right_source.csv'
170
+ @left_data =
171
+ [{name: 'Foo', some_id: 1},
172
+ {name: 'Bar', some_id: 2},
173
+ {name: 'Baz', some_id: 2},
174
+ {name: 'Foobar', some_id: nil}]
175
+
176
+ @right_data =
177
+ [{age: 0, some_id: nil},
178
+ {age: 27, some_id: 1},
179
+ {age: 33, some_id: 1},
180
+ {age: 42, some_id: 2}]
157
181
  end
158
182
 
159
183
 
@@ -164,16 +188,20 @@ describe Datamancer do
164
188
  expect {
165
189
  transform(@left_data, join: @right_data)
166
190
  }.to raise_error(ArgumentError)
191
+
192
+ expect {
193
+ transform(@left_data, join: @right_data, on: 'some_ID')
194
+ }.to raise_error(ArgumentError)
167
195
 
168
196
  expect(
169
197
  transform(@left_data, join: @right_data, on: 'some_id') do
170
198
  del_field :some_id
171
- new_field :namage, name.downcase + age
199
+ new_field :namage, name.downcase + age.to_s
172
200
  end
173
- ).to eq([{name: 'Foo', age: '27', namage: 'foo27'},
174
- {name: 'Foo', age: '33', namage: 'foo33'},
175
- {name: 'Bar', age: '42', namage: 'bar42'},
176
- {name: 'Baz', age: '42', namage: 'baz42'}])
201
+ ).to eq([{name: 'Foo', age: 27, namage: 'foo27'},
202
+ {name: 'Foo', age: 33, namage: 'foo33'},
203
+ {name: 'Bar', age: 42, namage: 'bar42'},
204
+ {name: 'Baz', age: 42, namage: 'baz42'}])
177
205
  end
178
206
 
179
207
 
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: datamancer
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.2
4
+ version: 0.0.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Matías Battocchia
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2013-11-22 00:00:00.000000000 Z
11
+ date: 2013-12-19 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -114,8 +114,6 @@ files:
114
114
  - lib/datamancer/version.rb
115
115
  - spec/data/destination.csv
116
116
  - spec/data/destination.sqlite3
117
- - spec/data/left_source.csv
118
- - spec/data/right_source.csv
119
117
  - spec/data/simple.yml
120
118
  - spec/data/source.csv
121
119
  - spec/data/source.sqlite3
@@ -151,8 +149,6 @@ summary: ''
151
149
  test_files:
152
150
  - spec/data/destination.csv
153
151
  - spec/data/destination.sqlite3
154
- - spec/data/left_source.csv
155
- - spec/data/right_source.csv
156
152
  - spec/data/simple.yml
157
153
  - spec/data/source.csv
158
154
  - spec/data/source.sqlite3
@@ -1,5 +0,0 @@
1
- name,some_id
2
- "Foo",1
3
- "Bar",2
4
- "Baz",2
5
- "Foobar",
@@ -1,5 +0,0 @@
1
- age,some_id
2
- 0,
3
- 27,1
4
- 33,1
5
- 42,2