drudgery 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2012 Jeremy Israelsen
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,306 @@
1
+ Drudgery [![Build Status](https://secure.travis-ci.org/jisraelsen/drudgery.png?branch=master)](http://travis-ci.org/jisraelsen/drudgery)
2
+ ========
3
+
4
+ A simple ETL library that supports the following sources/destinations:
5
+
6
+ * CSV and other delimited file formats (e.g. pipe, tab, etc)
7
+ * SQLite3
8
+ * ActiveRecord (bulk insert support using activerecord-import)
9
+
10
+ Supported Rubies:
11
+
12
+ * Ruby 1.9.2, 1.9.3
13
+
14
+ Install
15
+ -------
16
+
17
+ Install the gem directly:
18
+
19
+ ```bash
20
+ gem install drudgery
21
+ ```
22
+
23
+ Or, add it to your Gemfile:
24
+
25
+ ```ruby
26
+ gem 'drudgery'
27
+ ```
28
+
29
+ And, if using the `:sqlite3` extractor or loader:
30
+
31
+ ```ruby
32
+ gem 'sqlite3', '~> 1.3'
33
+ ```
34
+
35
+ And, if using the `:active_record` extractor or loader:
36
+
37
+ ```ruby
38
+ gem 'activerecord', '~> 3.0'
39
+ ```
40
+
41
+ And, if using the `:active_record_import` loader:
42
+
43
+ ```ruby
44
+ gem 'activerecord-import', '>= 0.2.9'
45
+ ```
46
+
47
+ Usage
48
+ -----
49
+
50
+ Extracting from CSV and loading into ActiveRecord:
51
+
52
+ ```ruby
53
+ m = Drudgery::Manager.new
54
+
55
+ m.prepare do |job|
56
+ job.extract :csv, 'src/addresses.csv'
57
+
58
+ job.transform do |data|
59
+ first_name, last_name = data.delete(:name).split(' ')
60
+
61
+ data[:first_name] = first_name
62
+ data[:last_name] = last_name
63
+ data[:state] = data.delete(:state_abbr)
64
+
65
+ data
66
+ end
67
+
68
+ job.load :active_record, Address
69
+ end
70
+
71
+ m.run
72
+ ```
73
+
74
+ Extracting from SQLite3 and bulk loading into ActiveRecord:
75
+
76
+ ```ruby
77
+ db = SQLite3::Database.new('db.sqlite3')
78
+
79
+ m = Drudgery::Manager.new
80
+
81
+ m.prepare do |job|
82
+ job.batch_size 5000
83
+
84
+ job.extract :sqlite3, db, 'addresses' do |extractor|
85
+ extractor.select(
86
+ 'name',
87
+ 'street_address',
88
+ 'city',
89
+ 'state_abbr AS state',
90
+ 'zip'
91
+ )
92
+ extractor.where("state LIKE 'A%'")
93
+ extractor.order('name')
94
+ end
95
+
96
+ job.transform do |data|
97
+ first_name, last_name = data.delete(:name).split(' ')
98
+
99
+ data[:first_name] = first_name
100
+ data[:last_name] = last_name
101
+
102
+ data
103
+ end
104
+
105
+ job.load :active_record_import, Address
106
+ end
107
+
108
+ m.run
109
+ ```
110
+
111
+ Extractors
112
+ ----------
113
+
114
+ The following extractors are provided: `:csv`, `:sqlite3`, `:active_record`
115
+
116
+ You can use your own extractors if you would like. They need only
117
+ implement an `#extract` method that yields each record:
118
+
119
+ ```ruby
120
+ class ArrayExtractor
121
+ def initialize(source)
122
+ @source = source
123
+ end
124
+
125
+ def extract
126
+ @source.each do |record|
127
+ yield record
128
+ end
129
+ end
130
+ end
131
+
132
+ source = []
133
+
134
+ m = Drudgery::Manager.new
135
+ job = Drudgery::Job.new(:extractor => ArrayExtractor.new(source))
136
+
137
+ m.prepare(job) do |job|
138
+ m.load :csv, 'destination.csv'
139
+ end
140
+ ```
141
+
142
+ Or, if you define your custom extractor under the Drudgery::Extractors
143
+ namespace:
144
+
145
+ ```ruby
146
+ module Drudgery
147
+ module Extractors
148
+ class ArrayExtractor
149
+ def initialize(source)
150
+ @source = source
151
+ end
152
+
153
+ def extract
154
+ @source.each do |record|
155
+ yield record
156
+ end
157
+ end
158
+ end
159
+ end
160
+ end
161
+
162
+ source = []
163
+
164
+ m = Drudgery::Manager.new
165
+
166
+ m.prepare do |job|
167
+ m.extract :array, source
168
+ m.load :csv, 'destination.csv'
169
+ end
170
+ ```
171
+
172
+ Transformers
173
+ ------------
174
+
175
+ Drudgery comes with a basic Transformer class. It symbolizes the keys of
176
+ each record and allows you to register processors to process data. Registered
177
+ processors should implement a `#call` method and return a `Hash` or `nil`.
178
+
179
+ ```ruby
180
+ custom_processor = Proc.new do |data, cache|
181
+ data[:initials] = data[:name].split(' ').map(&:capitalize)
182
+ data
183
+ end
184
+
185
+ transformer = Drudgery::Transformer.new
186
+ transformer.register(custom_processor)
187
+
188
+ transformer.transform({ :name => 'John Doe' }) # == { :name => 'John Doe', :initials => 'JD' }
189
+ ```
190
+
191
+ You could also implement your own transformer if you need more custom
192
+ processing power. If you inherit from `Drudgery::Transfomer`, you need
193
+ only implement the `#transform` method that accepts a hash as an
194
+ argument and returns a `Hash` or `nil`.
195
+
196
+ ```ruby
197
+ class CustomTransformer < Drudgery::Transformer
198
+ def transform(data)
199
+ # do custom processing here
200
+ end
201
+ end
202
+
203
+ m = Drudgery::Manager.new
204
+ job = Drudgery::Job.new(:transformer => CustomTransformer.new)
205
+
206
+ m.prepare(job) do |job|
207
+ m.extract :csv, 'source.csv'
208
+ m.load :csv, 'destination.csv'
209
+ end
210
+ ```
211
+
212
+ Loaders
213
+ -------
214
+
215
+ The following loaders are provided:
216
+
217
+ * `:csv`
218
+ * `:sqlite3`
219
+ * `:active_record`
220
+ * `:active_record_import`
221
+
222
+ You can use your own loaders if you would like. They need only
223
+ implement a `#load` method that accepts an array of records as an
224
+ argument and then writes/inserts them to the destination.
225
+
226
+ ```ruby
227
+ class ArrayLoader
228
+ def initialize(destination)
229
+ @destination = destination
230
+ end
231
+
232
+ def load(records)
233
+ @destination.push(*records)
234
+ end
235
+ end
236
+
237
+ destination = []
238
+
239
+ m = Drudgery::Manager.new
240
+ job = Drudgery::Job.new(:loader => ArrayLoader.new(destination))
241
+
242
+ m.prepare(job) do |job|
243
+ m.extract :csv, 'source.csv'
244
+ end
245
+ ```
246
+
247
+ Or, if you define your custom loader under the Drudgery::Loaders
248
+ namespace:
249
+
250
+ ```ruby
251
+ module Drudgery
252
+ module Loaders
253
+ class ArrayLoader
254
+ def initialize(destination)
255
+ @destination = destination
256
+ end
257
+
258
+ def load(records)
259
+ @destination.push(*records)
260
+ end
261
+ end
262
+ end
263
+ end
264
+
265
+ destination = []
266
+
267
+ m = Drudgery::Manager.new
268
+
269
+ m.prepare do |job|
270
+ m.extract :csv, 'source.csv'
271
+ m.load :array, destination
272
+ end
273
+ ```
274
+
275
+ Contributing
276
+ ------------
277
+
278
+ Pull requests are welcome. Just make sure to include tests!
279
+
280
+ To run tests, install some dependencies:
281
+
282
+ ```bash
283
+ bundle install
284
+ ```
285
+
286
+ Then, run tests with:
287
+
288
+ ```bash
289
+ rake test
290
+ ```
291
+
292
+ Or, If you want to check coverage:
293
+
294
+ ```bash
295
+ COVERAGE=true rake test
296
+ ```
297
+
298
+ Issues
299
+ ------
300
+
301
+ Please use GitHub's [issue tracker](http://github.com/jisraelsen/drudgery/issues).
302
+
303
+ Author
304
+ ------
305
+
306
+ [Jeremy Israelsen](http://github.com/jisraelsen)
@@ -0,0 +1,15 @@
1
+ module Drudgery
2
+ module Extractors
3
+ class ActiveRecordExtractor
4
+ def initialize(model)
5
+ @model = model
6
+ end
7
+
8
+ def extract
9
+ @model.find_each do |record|
10
+ yield record.attributes
11
+ end
12
+ end
13
+ end
14
+ end
15
+ end
@@ -0,0 +1,19 @@
1
+ require 'csv'
2
+
3
+ module Drudgery
4
+ module Extractors
5
+ class CSVExtractor
6
+ def initialize(filepath, options={})
7
+ @filepath = filepath
8
+ @options = { :headers => true }
9
+ @options.merge!(options)
10
+ end
11
+
12
+ def extract
13
+ CSV.foreach(@filepath, @options) do |row|
14
+ yield row.to_hash
15
+ end
16
+ end
17
+ end
18
+ end
19
+ end
@@ -0,0 +1,68 @@
1
+ module Drudgery
2
+ module Extractors
3
+ class SQLite3Extractor
4
+ def initialize(db, table)
5
+ @db = db
6
+ @db.results_as_hash = true
7
+ @db.type_translation = true
8
+
9
+ @table = table
10
+ @clauses = {}
11
+ end
12
+
13
+ def select(*expressions)
14
+ @clauses[:select] = expressions.join(', ')
15
+ end
16
+
17
+ def from(expression)
18
+ @clauses[:from] = expression
19
+ end
20
+
21
+ def joins(*clauses)
22
+ @clauses[:joins] = clauses
23
+ end
24
+
25
+ def where(condition)
26
+ @clauses[:where] = condition
27
+ end
28
+
29
+ def group(*expressions)
30
+ @clauses[:group] = expressions.join(', ')
31
+ end
32
+
33
+ def having(condition)
34
+ @clauses[:having] = condition
35
+ end
36
+
37
+ def order(*expressions)
38
+ @clauses[:order] = expressions.join(', ')
39
+ end
40
+
41
+ def extract
42
+ @db.execute(sql) do |row|
43
+ row.reject! { |key, value| key.kind_of?(Integer) }
44
+ yield row
45
+ end
46
+ end
47
+
48
+ private
49
+ def sql
50
+ clauses = [
51
+ "SELECT #{@clauses[:select] || '*'}",
52
+ "FROM #{@clauses[:from] || @table}"
53
+ ]
54
+
55
+ (@clauses[:joins] || []).each do |join|
56
+ clauses << join
57
+ end
58
+
59
+ clauses << "WHERE #{@clauses[:where]}" if @clauses[:where]
60
+ clauses << "GROUP BY #{@clauses[:group]}" if @clauses[:group]
61
+ clauses << "HAVING #{@clauses[:having]}" if @clauses[:having]
62
+ clauses << "ORDER BY #{@clauses[:order]}" if @clauses[:order]
63
+
64
+ clauses.join(' ')
65
+ end
66
+ end
67
+ end
68
+ end
@@ -0,0 +1,54 @@
1
+ module Drudgery
2
+ class Job
3
+ def initialize(options={})
4
+ @extractor = options[:extractor]
5
+ @loader = options[:loader]
6
+ @transformer = options[:transformer] || Drudgery::Transformer.new
7
+
8
+ @batch_size, @records = 1000, []
9
+ end
10
+
11
+ def batch_size(size)
12
+ @batch_size = size
13
+ end
14
+
15
+ def extract(type, *args)
16
+ @extractor = Drudgery::Extractors.instantiate(type, *args)
17
+ end
18
+
19
+ def transform(&processor)
20
+ @transformer.register(processor)
21
+ end
22
+
23
+ def load(type, *args)
24
+ @loader = Drudgery::Loaders.instantiate(type, *args)
25
+ end
26
+
27
+ def perform
28
+ extract_records do |record|
29
+ @records << record
30
+
31
+ if @records.size == @batch_size
32
+ load_records
33
+ end
34
+ end
35
+
36
+ load_records
37
+ end
38
+
39
+ private
40
+ def extract_records
41
+ @extractor.extract do |data|
42
+ record = @transformer.transform(data)
43
+ next if record.nil?
44
+
45
+ yield record
46
+ end
47
+ end
48
+
49
+ def load_records
50
+ @loader.load(@records)
51
+ @records.clear
52
+ end
53
+ end
54
+ end
@@ -0,0 +1,16 @@
1
+ module Drudgery
2
+ module Loaders
3
+ class ActiveRecordImportLoader
4
+ def initialize(model)
5
+ @model = model
6
+ end
7
+
8
+ def load(records)
9
+ columns = records.first.keys
10
+ values = records.map { |record| columns.map { |column| record[column] } }
11
+
12
+ @model.import(columns, values, :validate => false)
13
+ end
14
+ end
15
+ end
16
+ end
@@ -0,0 +1,15 @@
1
+ module Drudgery
2
+ module Loaders
3
+ class ActiveRecordLoader
4
+ def initialize(model)
5
+ @model = model
6
+ end
7
+
8
+ def load(records)
9
+ records.each do |record|
10
+ @model.new(record).save(:validate => false)
11
+ end
12
+ end
13
+ end
14
+ end
15
+ end
@@ -0,0 +1,29 @@
1
+ require 'csv'
2
+
3
+ module Drudgery
4
+ module Loaders
5
+ class CSVLoader
6
+ def initialize(filepath, options={})
7
+ @filepath = filepath
8
+ @options = options
9
+
10
+ @write_headers = true
11
+ end
12
+
13
+ def load(records)
14
+ columns = records.first.keys.sort { |a,b| a.to_s <=> b.to_s }
15
+
16
+ CSV.open(@filepath, 'a', @options) do |csv|
17
+ if @write_headers
18
+ csv << columns
19
+ @write_headers = false
20
+ end
21
+
22
+ records.each do |record|
23
+ csv << columns.map { |column| record[column] }
24
+ end
25
+ end
26
+ end
27
+ end
28
+ end
29
+ end
@@ -0,0 +1,25 @@
1
+ module Drudgery
2
+ module Loaders
3
+ class SQLite3Loader
4
+ def initialize(db, table)
5
+ @db = db
6
+ @table = table
7
+ end
8
+
9
+ def load(records)
10
+ columns = records.first.keys
11
+
12
+ @db.transaction do |db|
13
+ records.each do |record|
14
+ db.execute(sql(columns), columns.map { |column| record[column] })
15
+ end
16
+ end
17
+ end
18
+
19
+ private
20
+ def sql(columns)
21
+ "INSERT INTO #{@table} (#{columns.map { |column| column }.join(', ')}) VALUES (#{columns.map { |column| '?' }.join(', ')})"
22
+ end
23
+ end
24
+ end
25
+ end
@@ -0,0 +1,17 @@
1
+ module Drudgery
2
+ class Manager
3
+ def initialize
4
+ @jobs = []
5
+ end
6
+
7
+ def prepare(job=Drudgery::Job.new)
8
+ yield job if block_given?
9
+
10
+ @jobs << job
11
+ end
12
+
13
+ def run
14
+ @jobs.each { |job| job.perform }
15
+ end
16
+ end
17
+ end
@@ -0,0 +1,30 @@
1
+ module Drudgery
2
+ class Transformer
3
+ def initialize
4
+ @processors = []
5
+ @cache = {}
6
+ end
7
+
8
+ def register(processor)
9
+ @processors << processor
10
+ end
11
+
12
+ def transform(data)
13
+ symbolize_keys!(data)
14
+
15
+ @processors.each do |processor|
16
+ data = processor.call(data, @cache)
17
+ break if data.nil?
18
+ end
19
+
20
+ data
21
+ end
22
+
23
+ private
24
+ def symbolize_keys!(data)
25
+ data.keys.each do |key|
26
+ data[(key.to_sym rescue key) || key] = data.delete(key)
27
+ end
28
+ end
29
+ end
30
+ end
@@ -0,0 +1,3 @@
1
+ module Drudgery
2
+ VERSION = '0.0.1'
3
+ end
data/lib/drudgery.rb ADDED
@@ -0,0 +1,45 @@
1
+ require 'drudgery/version'
2
+ require 'drudgery/manager'
3
+ require 'drudgery/job'
4
+ require 'drudgery/transformer'
5
+
6
+ require 'drudgery/extractors/active_record_extractor'
7
+ require 'drudgery/extractors/csv_extractor'
8
+ require 'drudgery/extractors/sqlite3_extractor'
9
+
10
+ require 'drudgery/loaders/active_record_import_loader'
11
+ require 'drudgery/loaders/active_record_loader'
12
+ require 'drudgery/loaders/csv_loader'
13
+ require 'drudgery/loaders/sqlite3_loader'
14
+
15
+ module Drudgery
16
+ module Extractors
17
+ def self.instantiate(type, *args)
18
+ case type
19
+ when :csv
20
+ extractor = Drudgery::Extractors::CSVExtractor
21
+ when :sqlite3
22
+ extractor = Drudgery::Extractors::SQLite3Extractor
23
+ else
24
+ extractor = Drudgery::Extractors.const_get("#{type.to_s.split('_').map(&:capitalize).join}Extractor")
25
+ end
26
+
27
+ extractor.new(*args)
28
+ end
29
+ end
30
+
31
+ module Loaders
32
+ def self.instantiate(type, *args)
33
+ case type
34
+ when :csv
35
+ loader = Drudgery::Loaders::CSVLoader
36
+ when :sqlite3
37
+ loader = Drudgery::Loaders::SQLite3Loader
38
+ else
39
+ loader = Drudgery::Loaders.const_get("#{type.to_s.split('_').map(&:capitalize).join}Loader")
40
+ end
41
+
42
+ loader.new(*args)
43
+ end
44
+ end
45
+ end