datamancer 0.0.2 → 0.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 5a09ebee71e5c2b54c0b790ae4dc2165e4e9e45a
4
- data.tar.gz: eabae34cb0d1b34ce072796f3b0edab7117325e2
3
+ metadata.gz: 7c61e73995d0dfa4c372c1142482044a3a55ceec
4
+ data.tar.gz: 71968fed5543e5653ebf74e4b1f48f7e4576d5ef
5
5
  SHA512:
6
- metadata.gz: 84af02d3f38e7e861086bf268eb6670ef8528f6b70035d2bba6f3098b99dd897b29b727ae1037ba80abad412ff34ae543eec4e8c57ed71f4a74a939f35be4b15
7
- data.tar.gz: 4c88628302bca9be9089ee2a6c1be2991c06201d66ce91bd88a62247b6ad45b64ff0cc90403dce9c1ebd0dcabb6fb807be58f52133394c0f57e5771473d558c2
6
+ metadata.gz: 4657f9b28e44d82e6ceff39cc9d59f10964255138c2542e3d828a3fa5201858a1932eedc0b55918abcfd9e469591527bfe1c42f10b88a7629b0faa10965b5615
7
+ data.tar.gz: 382575e0b05be66f8f23b0c549895bc82e57a38575605bebf4cf196053f03d78b8ea9e52b7a4ff9f457ac8cb4917ee076178984ca315ad5b87e5635953f1f35a
data/README.md CHANGED
@@ -2,9 +2,11 @@
2
2
 
3
3
  **The Datamancer** is a form of sorcerer whose spells deal with data manipulation between databases.
4
4
 
5
- Data targets (sources and destinations) can be **databases** supported by ActiveRecord and **CSV files**. Multiple targets can be present in a single ETL process.
5
+ Sources and destinations can be **databases** supported by ActiveRecord and **CSV files**. Multiple sources and destinations can be present in a single ETL process.
6
6
 
7
- Datamancer relies in bulk SQL reading and writing, and does not instantiate ActiveRecord objects, which is used for the sole purpose of connecting to databases.
7
+ To optimize, Datamancer relies in bulk SQL reading and writing, and does not instantiate ActiveRecord objects, which is used for the sole purpose of connecting to databases.
8
+
9
+ It is tested to run over JRuby as well.
8
10
 
9
11
  ## Installation
10
12
 
@@ -22,7 +24,158 @@ Or install it yourself as:
22
24
 
23
25
  ## Usage
24
26
 
25
- *Please see the specs, for now.*
27
+ ### Datastreams
28
+
29
+ A **datastream** is a table with columns and rows made with an array of hashes.
30
+ Each element of the array (a hash) represents a row,
31
+ and each pair of a hash, an attribute (a column) and its value.
32
+
33
+ ```ruby
34
+ people = [
35
+ {name: 'Foo', age: 27},
36
+ {name: 'Bar', age: 42}
37
+ ]
38
+ ```
39
+
40
+ ### Extract
41
+
42
+ To get a datastream from a data source use the extract() method, like this:
43
+
44
+ ```ruby
45
+ people = extract(from: 'people.csv')
46
+ ```
47
+
48
+ If 'from' value is a string, extract will suppose that the source is a CSV file.
49
+ One option that can be used along this kind of source is 'separator', which value
50
+ will be used to interpret the file (defaults to comma).
51
+
52
+ Similarly, if 'from' attribute is a hash, it will be given to ActiveRecord to establish
53
+ a connection with the specified database and table. In case that the table is not indicated
54
+ there, the 'table' option can provide it.
55
+
56
+ Next, an example using a YAML file listing databases and showing further capacities of extract().
57
+
58
+ *databases.yml*
59
+
60
+ ```yaml
61
+ warehouse:
62
+ adapter: postgresql
63
+ host: 192.168.0.1
64
+ database: Warehouse
65
+ username: foo
66
+ password: 1234
67
+ ```
68
+
69
+ ```ruby
70
+ databases = YAML.load_file('databases.yml')
71
+
72
+ people = extract(from: databases['main'], table: 'people') do
73
+ field :name
74
+ field :age
75
+ end
76
+ ```
77
+
78
+ extract() can take a block in order to manipulate data being extracted. All table or file columns
79
+ are brung by default; to get the opposite effect use 'exclude: true'.
80
+
81
+ Inside the block, field() is used to refer to columns. Without more arguments than columns' name
82
+ (as above) it performs no operation, but if columns are being excluded through the aforementioned
83
+ option, field() includes the referred columns. Both string and symbol are valid for column names.
84
+
85
+ To summarize, extract() options are:
86
+
87
+ * from: (hash or string)
88
+ * table: (string)
89
+ * exclude: (boolean)
90
+ * separator: (string)
91
+
92
+ And now we introduce field() options under extraction context:
93
+
94
+ * reject_if: (value or array of values) — Rejects row if condition is meet.
95
+ * reject_unless: (value or array of values) — Rejects row unless condition is meet.
96
+ * map: (string or symbol) — Alias column name in database.
97
+ * type: (class) — Casts data type on field.
98
+ * default: (value) — Set field to default value if nil.
99
+ * type_default: (class) — Casts data type even if field's value is nil. This way a nil
100
+ can be converted into a zero or an empty string.
101
+ * empty_default: (string) — Set field to default value if nil or empty.
102
+ * strip: (boolean) — Right and left strips the value to remove extra spaces in the string.
103
+
104
+ In addition to these options field() also takes blocks!
105
+
106
+ All together now:
107
+
108
+ ```ruby
109
+ people = extract(from: databases['main'], table: 'people') do
110
+ field 'name', map: 'Name', reject_if: nil
111
+ field 'age', map: 'Age', type: Integer
112
+ end
113
+ ```
114
+
115
+ ### Transform
116
+
117
+ Method options:
118
+
119
+ * exclude
120
+ * join
121
+ * on
122
+ * unique
123
+
124
+ Field options: Not allowed.
125
+
126
+ ### Load
127
+
128
+ Method options:
129
+
130
+ * to
131
+ * table
132
+ * exclude
133
+ * append
134
+ * batch
135
+
136
+ Field options:
137
+
138
+ * map
139
+
140
+ ## Example
141
+
142
+ ```ruby
143
+ require 'bundler/setup'
144
+ require 'datamancer'
145
+ require 'active_record'
146
+ require 'csv'
147
+
148
+ include Datamancer
149
+
150
+ bases = YAML.load_file('/home/matias/proyectos/panel/bases_de_datos.yml')
151
+
152
+ países_ISO =
153
+ extract from: 'country-list/country/cldr/es_AR/country.csv' do
154
+ field :iso
155
+ field :nombre, map: 'name'
156
+ end
157
+
158
+ países_UN =
159
+ extract from: 'countries/countries.csv', separator: ';', exclude: true do
160
+ field :iso, map: 'cca2'
161
+ field :número, map: 'ccn3', type: Integer
162
+ end
163
+
164
+ países =
165
+ transform países_ISO, join: países_UN, on: :iso
166
+
167
+ load países, to: bases['panel'], table: 'lk_com_pais', append: true do
168
+ field :número, map: 'id_com_pais'
169
+ field :iso, map: 'cd_com_pais'
170
+ field :nombre, map: 'ds_com_pais'
171
+ end
172
+ ```
173
+
174
+ ## Future features
175
+
176
+ * Batch mode
177
+ * Error monitor
178
+ * Control files
26
179
 
27
180
  ## Contributing
28
181
 
@@ -43,6 +43,7 @@ module Datamancer
43
43
 
44
44
  define_singleton_method :field do |name, actions = {}, &block|
45
45
  actions[:type] ||= actions[:type_default]
46
+ actions[:default] ||= actions[:empty_default]
46
47
  actions = default_actions.merge(actions)
47
48
  mapping = actions[:map] || name.to_s
48
49
 
@@ -171,11 +172,21 @@ module Datamancer
171
172
 
172
173
  # TODO: Test this. Test to not reject nil by default.
173
174
 
174
- if actions[:reject_if] == value ||
175
- (actions[:reject_unless] != :ñil &&
176
- actions[:reject_unless] != value)
175
+ if actions[:reject_if].is_a? Array
176
+ actions[:reject_if].each do |reject_value|
177
+ value = :reject if reject_value == value
178
+ end
179
+ else
180
+ value = :reject if actions[:reject_if] == value
181
+ end
177
182
 
178
- value = :reject
183
+ if actions[:reject_unless].is_a? Array
184
+ actions[:reject_unless].each do |reject_value|
185
+ value = :reject if reject_value != value
186
+ end
187
+ else
188
+ value = :reject if actions[:reject_unless] != :ñil &&
189
+ actions[:reject_unless] != value
179
190
  end
180
191
 
181
192
  value
@@ -33,14 +33,40 @@ module Datamancer
33
33
  output
34
34
  end
35
35
 
36
+ def where input, attributes
37
+
38
+ end
39
+
40
+ def unique input, attribute
41
+
42
+ attribute = attribute.to_sym
43
+ output = Array.new
44
+ unique_values = Array.new
45
+
46
+ input.each do |row|
47
+ unless unique_values.include?(row[attribute])
48
+ output << row
49
+ unique_values << row[attribute]
50
+ end
51
+ end
52
+
53
+ output
54
+ end
55
+
36
56
  def transform input, args = {}
37
57
 
38
58
  if args[:join]
39
59
  raise ArgumentError unless args[:on]
60
+ raise ArgumentError unless input.first.keys.include?(args[:on].to_sym)
61
+ raise ArgumentError unless args[:join].first.keys.include?(args[:on].to_sym)
40
62
 
41
63
  input = join input, args[:join], args[:on]
42
64
  end
43
65
 
66
+ if args[:unique]
67
+ input = unique input, args[:unique]
68
+ end
69
+
44
70
  # TODO: Method-overriding safeguard.
45
71
 
46
72
  input.first.each_key do |key|
@@ -55,43 +81,79 @@ module Datamancer
55
81
  end
56
82
  end
57
83
 
84
+ define_singleton_method :row_number do
85
+ @row_number
86
+ end
87
+
88
+ define_singleton_method :row do
89
+ @supplementary_row
90
+ end
91
+
92
+ define_singleton_method :count do
93
+ @count += 1
94
+ end
95
+
96
+ define_singleton_method :output do
97
+ @output
98
+ end
99
+
100
+ define_singleton_method :switch do |slot|
101
+ @slot = slot
102
+ end
103
+
58
104
  define_singleton_method :field do |name, value = nil, *args|
59
105
  raise MissingField,
60
- "Required field '#{name}' was not found" unless respond_to?(name)
61
-
106
+ "Required field '#{name}' was not found" unless @input_row.include?(name.to_sym)
107
+
62
108
  @output_row[name.to_sym] = if value.is_a?(Symbol)
63
- send(name).send *args.unshift(value)
109
+ send(name.downcase).send *args.unshift(value)
64
110
  else
65
- value || send(name)
111
+ value || send(name.downcase)
66
112
  end
67
113
  end
68
114
 
69
115
  define_singleton_method :del_field do |name|
70
116
  raise MissingField,
71
- "Filtered field '#{name}' was not found" unless respond_to?(name)
117
+ "Filtered field '#{name}' was not found" unless @input_row.include?(name.to_sym)
72
118
 
73
119
  @output_row.delete(name.to_sym)
74
120
  end
75
121
 
76
122
  define_singleton_method :new_field do |name, value|
77
123
  raise ExistingField,
78
- "New field '#{name}' already exists" if respond_to?(name)
124
+ "New field '#{name}' already exists" if @input_row.include?(name.to_sym)
79
125
 
80
126
  @output_row[name.to_sym] = value
81
127
  end
82
128
 
83
- input.map do |row|
129
+ # TODO: Test for count.
130
+
131
+ @count = 0
132
+
133
+ # TODO: Test for slots.
134
+
135
+ @output = Hash.new { |h, k| h[k] = [] }
136
+
137
+ input.each_with_index do |row, row_number|
138
+
139
+ # TODO: Test for row_number.
140
+ # TODO: Test for (supplementary) row.
141
+
142
+ @row_number = row_number
84
143
  @input_row = row
144
+ @supplementary_row = @input_row.dup
85
145
  @output_row = args[:exclude] ? {} : @input_row.dup
86
146
 
87
147
  yield if block_given?
88
148
 
89
- @output_row
149
+ @output[@slot] << @output_row
90
150
  end
151
+
152
+ @output.length == 1? @output[nil] : @output
91
153
  end
92
154
 
93
155
  def aggregate input
94
-
156
+
95
157
  define_singleton_method :dim do |name|
96
158
  name = name.to_sym
97
159
  @dimensions[name] = @row[name]
@@ -104,7 +166,7 @@ module Datamancer
104
166
 
105
167
  aggregated_input = Hash.new { |hash, key| hash[key] = Hash.new }
106
168
 
107
- input.each do |row|
169
+ input.each_with_index do |row, row_number|
108
170
  @row = row
109
171
  @dimensions = {}
110
172
  @facts = {}
@@ -1,3 +1,3 @@
1
1
  module Datamancer
2
- VERSION = "0.0.2"
2
+ VERSION = '0.0.3'
3
3
  end
Binary file
data/spec/extract_spec.rb CHANGED
@@ -1,5 +1,16 @@
1
1
  require 'spec_helper'
2
2
 
3
+ # TODO: Some wanted features
4
+
5
+ # allow table_name to be both string or symbol
6
+ # database.yml autoloading
7
+ # autoincremental id
8
+ # transform!
9
+ # reject by array and regex
10
+ # existent actions check
11
+
12
+ # TODO: case insentivity
13
+
3
14
  describe Datamancer do
4
15
 
5
16
  context 'EXTRACT spell' do
@@ -60,10 +71,11 @@ describe Datamancer do
60
71
  }.to raise_error(ArgumentError, 'Extract requires a source, i.e. extract(from: source)')
61
72
  end
62
73
 
63
-
74
+
64
75
  it 'raises an exception if a required field is missing' do
65
76
 
66
- # TODO: Better explanation for this error.
77
+ # TODO: Better explanation for this error. Assure that table appears
78
+ # even though it has been specified with :table option (outside :from).
67
79
 
68
80
  # source = case source_type
69
81
  # when 'CSV file' then "#{@source} file"
@@ -73,11 +85,11 @@ describe Datamancer do
73
85
  expect {
74
86
 
75
87
  extract(from: @source) do
76
- field :surname
88
+ field :agE
77
89
  end
78
90
 
79
91
  }.to raise_error(MissingField,
80
- "Required field 'surname' was not found in '#{@source}'")
92
+ "Required field 'agE' was not found in '#{@source}'")
81
93
  end
82
94
 
83
95
 
data/spec/load_spec.rb CHANGED
@@ -80,11 +80,13 @@ describe Datamancer do
80
80
  expect {
81
81
 
82
82
  load(@data, to: @destination) do
83
- field :surname
83
+ field :agE
84
84
  end
85
85
 
86
+ # TODO: Check for table.
87
+
86
88
  }.to raise_error(MissingField,
87
- "Required field 'surname' was not found in '#{@destination}'")
89
+ "Required field 'agE' was not found in '#{@destination}'")
88
90
  end
89
91
 
90
92
 
@@ -55,20 +55,20 @@ describe Datamancer do
55
55
  expect {
56
56
 
57
57
  transform(@data) do
58
- field :surname
58
+ field :agE
59
59
  end
60
60
 
61
61
  }.to raise_error(MissingField,
62
- "Required field 'surname' was not found")
62
+ "Required field 'agE' was not found")
63
63
 
64
64
  expect {
65
65
 
66
66
  transform(@data) do
67
- del_field :surname
67
+ del_field :agE
68
68
  end
69
69
 
70
70
  }.to raise_error(MissingField,
71
- "Filtered field 'surname' was not found")
71
+ "Filtered field 'agE' was not found")
72
72
  end
73
73
 
74
74
 
@@ -149,11 +149,35 @@ describe Datamancer do
149
149
  end
150
150
 
151
151
 
152
+ it 'drops duplicated rows' do
153
+ duplicated_data = @data + @data
154
+
155
+ expect(
156
+ transform(duplicated_data, unique: :name)
157
+ ).to eq(@data)
158
+
159
+ expect(
160
+ transform(duplicated_data, unique: :name) do
161
+ field :name
162
+ end
163
+ ).to eq(@data)
164
+ end
165
+
166
+
152
167
  context 'combines records by' do
153
168
 
154
169
  before(:all) do
155
- @left_data = extract from: $dir + '/left_source.csv'
156
- @right_data = extract from: $dir + '/right_source.csv'
170
+ @left_data =
171
+ [{name: 'Foo', some_id: 1},
172
+ {name: 'Bar', some_id: 2},
173
+ {name: 'Baz', some_id: 2},
174
+ {name: 'Foobar', some_id: nil}]
175
+
176
+ @right_data =
177
+ [{age: 0, some_id: nil},
178
+ {age: 27, some_id: 1},
179
+ {age: 33, some_id: 1},
180
+ {age: 42, some_id: 2}]
157
181
  end
158
182
 
159
183
 
@@ -164,16 +188,20 @@ describe Datamancer do
164
188
  expect {
165
189
  transform(@left_data, join: @right_data)
166
190
  }.to raise_error(ArgumentError)
191
+
192
+ expect {
193
+ transform(@left_data, join: @right_data, on: 'some_ID')
194
+ }.to raise_error(ArgumentError)
167
195
 
168
196
  expect(
169
197
  transform(@left_data, join: @right_data, on: 'some_id') do
170
198
  del_field :some_id
171
- new_field :namage, name.downcase + age
199
+ new_field :namage, name.downcase + age.to_s
172
200
  end
173
- ).to eq([{name: 'Foo', age: '27', namage: 'foo27'},
174
- {name: 'Foo', age: '33', namage: 'foo33'},
175
- {name: 'Bar', age: '42', namage: 'bar42'},
176
- {name: 'Baz', age: '42', namage: 'baz42'}])
201
+ ).to eq([{name: 'Foo', age: 27, namage: 'foo27'},
202
+ {name: 'Foo', age: 33, namage: 'foo33'},
203
+ {name: 'Bar', age: 42, namage: 'bar42'},
204
+ {name: 'Baz', age: 42, namage: 'baz42'}])
177
205
  end
178
206
 
179
207
 
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: datamancer
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.2
4
+ version: 0.0.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Matías Battocchia
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2013-11-22 00:00:00.000000000 Z
11
+ date: 2013-12-19 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -114,8 +114,6 @@ files:
114
114
  - lib/datamancer/version.rb
115
115
  - spec/data/destination.csv
116
116
  - spec/data/destination.sqlite3
117
- - spec/data/left_source.csv
118
- - spec/data/right_source.csv
119
117
  - spec/data/simple.yml
120
118
  - spec/data/source.csv
121
119
  - spec/data/source.sqlite3
@@ -151,8 +149,6 @@ summary: ''
151
149
  test_files:
152
150
  - spec/data/destination.csv
153
151
  - spec/data/destination.sqlite3
154
- - spec/data/left_source.csv
155
- - spec/data/right_source.csv
156
152
  - spec/data/simple.yml
157
153
  - spec/data/source.csv
158
154
  - spec/data/source.sqlite3
@@ -1,5 +0,0 @@
1
- name,some_id
2
- "Foo",1
3
- "Bar",2
4
- "Baz",2
5
- "Foobar",
@@ -1,5 +0,0 @@
1
- age,some_id
2
- 0,
3
- 27,1
4
- 33,1
5
- 42,2