drudgery 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2012 Jeremy Israelsen
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,306 @@
1
+ Drudgery [![Build Status](https://secure.travis-ci.org/jisraelsen/drudgery.png?branch=master)](http://travis-ci.org/jisraelsen/drudgery)
2
+ ========
3
+
4
+ A simple ETL library that supports the following sources/destinations:
5
+
6
+ * CSV and other delimited file formats (e.g. pipe, tab, etc)
7
+ * SQLite3
8
+ * ActiveRecord (bulk insert support using activerecord-import)
9
+
10
+ Supported Rubies:
11
+
12
+ * Ruby 1.9.2, 1.9.3
13
+
14
+ Install
15
+ -------
16
+
17
+ Install the gem directly:
18
+
19
+ ```bash
20
+ gem install drudgery
21
+ ```
22
+
23
+ Or, add it to your Gemfile:
24
+
25
+ ```ruby
26
+ gem 'drudgery'
27
+ ```
28
+
29
+ And, if using the `:sqlite3` extractor or loader:
30
+
31
+ ```ruby
32
+ gem 'sqlite3', '~> 1.3'
33
+ ```
34
+
35
+ And, if using the `:active_record` extractor or loader:
36
+
37
+ ```ruby
38
+ gem 'activerecord', '~> 3.0'
39
+ ```
40
+
41
+ And, if using the `:active_record_import` loader:
42
+
43
+ ```ruby
44
+ gem 'activerecord-import', '>= 0.2.9'
45
+ ```
46
+
47
+ Usage
48
+ -----
49
+
50
+ Extracting from CSV and loading into ActiveRecord:
51
+
52
+ ```ruby
53
+ m = Drudgery::Manager.new
54
+
55
+ m.prepare do |job|
56
+ job.extract :csv, 'src/addresses.csv'
57
+
58
+ job.transform do |data|
59
+ first_name, last_name = data.delete(:name).split(' ')
60
+
61
+ data[:first_name] = first_name
62
+ data[:last_name] = last_name
63
+ data[:state] = data.delete(:state_abbr)
64
+
65
+ data
66
+ end
67
+
68
+ job.load :active_record, Address
69
+ end
70
+
71
+ m.run
72
+ ```
73
+
74
+ Extracting from SQLite3 and bulk loading into ActiveRecord:
75
+
76
+ ```ruby
77
+ db = SQLite3::Database.new('db.sqlite3')
78
+
79
+ m = Drudgery::Manager.new
80
+
81
+ m.prepare do |job|
82
+ job.batch_size 5000
83
+
84
+ job.extract :sqlite3, db, 'addresses' do |extractor|
85
+ extractor.select(
86
+ 'name',
87
+ 'street_address',
88
+ 'city',
89
+ 'state_abbr AS state',
90
+ 'zip'
91
+ )
92
+ extractor.where("state LIKE 'A%'")
93
+ extractor.order('name')
94
+ end
95
+
96
+ job.transform do |data|
97
+ first_name, last_name = data.delete(:name).split(' ')
98
+
99
+ data[:first_name] = first_name
100
+ data[:last_name] = last_name
101
+
102
+ data
103
+ end
104
+
105
+ job.load :active_record_import, Address
106
+ end
107
+
108
+ m.run
109
+ ```
110
+
111
+ Extractors
112
+ ----------
113
+
114
+ The following extractors are provided: `:csv`, `:sqlite3`, `:active_record`
115
+
116
+ You can use your own extractors if you would like. They need only
117
+ implement an `#extract` method that yields each record:
118
+
119
+ ```ruby
120
+ class ArrayExtractor
121
+ def initialize(source)
122
+ @source = source
123
+ end
124
+
125
+ def extract
126
+ @source.each do |record|
127
+ yield record
128
+ end
129
+ end
130
+ end
131
+
132
+ source = []
133
+
134
+ m = Drudgery::Manager.new
135
+ job = Drudgery::Job.new(:extractor => ArrayExtractor.new(source))
136
+
137
+ m.prepare(job) do |job|
138
+ m.load :csv, 'destination.csv'
139
+ end
140
+ ```
141
+
142
+ Or, if you define your custom extractor under the Drudgery::Extractors
143
+ namespace:
144
+
145
+ ```ruby
146
+ module Drudgery
147
+ module Extractors
148
+ class ArrayExtractor
149
+ def initialize(source)
150
+ @source = source
151
+ end
152
+
153
+ def extract
154
+ @source.each do |record|
155
+ yield record
156
+ end
157
+ end
158
+ end
159
+ end
160
+ end
161
+
162
+ source = []
163
+
164
+ m = Drudgery::Manager.new
165
+
166
+ m.prepare do |job|
167
+ m.extract :array, source
168
+ m.load :csv, 'destination.csv'
169
+ end
170
+ ```
171
+
172
+ Transformers
173
+ ------------
174
+
175
+ Drudgery comes with a basic Transformer class. It symbolizes the keys of
176
+ each record and allows you to register processors to process data. Registered
177
+ processors should implement a `#call` method and return a `Hash` or `nil`.
178
+
179
+ ```ruby
180
+ custom_processor = Proc.new do |data, cache|
181
+ data[:initials] = data[:name].split(' ').map(&:capitalize)
182
+ data
183
+ end
184
+
185
+ transformer = Drudgery::Transformer.new
186
+ transformer.register(custom_processor)
187
+
188
+ transformer.transform({ :name => 'John Doe' }) # == { :name => 'John Doe', :initials => 'JD' }
189
+ ```
190
+
191
+ You could also implement your own transformer if you need more custom
192
+ processing power. If you inherit from `Drudgery::Transfomer`, you need
193
+ only implement the `#transform` method that accepts a hash as an
194
+ argument and returns a `Hash` or `nil`.
195
+
196
+ ```ruby
197
+ class CustomTransformer < Drudgery::Transformer
198
+ def transform(data)
199
+ # do custom processing here
200
+ end
201
+ end
202
+
203
+ m = Drudgery::Manager.new
204
+ job = Drudgery::Job.new(:transformer => CustomTransformer.new)
205
+
206
+ m.prepare(job) do |job|
207
+ m.extract :csv, 'source.csv'
208
+ m.load :csv, 'destination.csv'
209
+ end
210
+ ```
211
+
212
+ Loaders
213
+ -------
214
+
215
+ The following loaders are provided:
216
+
217
+ * `:csv`
218
+ * `:sqlite3`
219
+ * `:active_record`
220
+ * `:active_record_import`
221
+
222
+ You can use your own loaders if you would like. They need only
223
+ implement a `#load` method that accepts an array of records as an
224
+ argument and then writes/inserts them to the destination.
225
+
226
+ ```ruby
227
+ class ArrayLoader
228
+ def initialize(destination)
229
+ @destination = destination
230
+ end
231
+
232
+ def load(records)
233
+ @destination.push(*records)
234
+ end
235
+ end
236
+
237
+ destination = []
238
+
239
+ m = Drudgery::Manager.new
240
+ job = Drudgery::Job.new(:loader => ArrayLoader.new(destination))
241
+
242
+ m.prepare(job) do |job|
243
+ m.extract :csv, 'source.csv'
244
+ end
245
+ ```
246
+
247
+ Or, if you define your custom loader under the Drudgery::Loaders
248
+ namespace:
249
+
250
+ ```ruby
251
+ module Drudgery
252
+ module Loaders
253
+ class ArrayLoader
254
+ def initialize(destination)
255
+ @destination = destination
256
+ end
257
+
258
+ def load(records)
259
+ @destination.push(*records)
260
+ end
261
+ end
262
+ end
263
+ end
264
+
265
+ destination = []
266
+
267
+ m = Drudgery::Manager.new
268
+
269
+ m.prepare do |job|
270
+ m.extract :csv, 'source.csv'
271
+ m.load :array, destination
272
+ end
273
+ ```
274
+
275
+ Contributing
276
+ ------------
277
+
278
+ Pull requests are welcome. Just make sure to include tests!
279
+
280
+ To run tests, install some dependencies:
281
+
282
+ ```bash
283
+ bundle install
284
+ ```
285
+
286
+ Then, run tests with:
287
+
288
+ ```bash
289
+ rake test
290
+ ```
291
+
292
+ Or, If you want to check coverage:
293
+
294
+ ```bash
295
+ COVERAGE=true rake test
296
+ ```
297
+
298
+ Issues
299
+ ------
300
+
301
+ Please use GitHub's [issue tracker](http://github.com/jisraelsen/drudgery/issues).
302
+
303
+ Author
304
+ ------
305
+
306
+ [Jeremy Israelsen](http://github.com/jisraelsen)
@@ -0,0 +1,15 @@
1
+ module Drudgery
2
+ module Extractors
3
+ class ActiveRecordExtractor
4
+ def initialize(model)
5
+ @model = model
6
+ end
7
+
8
+ def extract
9
+ @model.find_each do |record|
10
+ yield record.attributes
11
+ end
12
+ end
13
+ end
14
+ end
15
+ end
@@ -0,0 +1,19 @@
1
+ require 'csv'
2
+
3
+ module Drudgery
4
+ module Extractors
5
+ class CSVExtractor
6
+ def initialize(filepath, options={})
7
+ @filepath = filepath
8
+ @options = { :headers => true }
9
+ @options.merge!(options)
10
+ end
11
+
12
+ def extract
13
+ CSV.foreach(@filepath, @options) do |row|
14
+ yield row.to_hash
15
+ end
16
+ end
17
+ end
18
+ end
19
+ end
@@ -0,0 +1,68 @@
1
+ module Drudgery
2
+ module Extractors
3
+ class SQLite3Extractor
4
+ def initialize(db, table)
5
+ @db = db
6
+ @db.results_as_hash = true
7
+ @db.type_translation = true
8
+
9
+ @table = table
10
+ @clauses = {}
11
+ end
12
+
13
+ def select(*expressions)
14
+ @clauses[:select] = expressions.join(', ')
15
+ end
16
+
17
+ def from(expression)
18
+ @clauses[:from] = expression
19
+ end
20
+
21
+ def joins(*clauses)
22
+ @clauses[:joins] = clauses
23
+ end
24
+
25
+ def where(condition)
26
+ @clauses[:where] = condition
27
+ end
28
+
29
+ def group(*expressions)
30
+ @clauses[:group] = expressions.join(', ')
31
+ end
32
+
33
+ def having(condition)
34
+ @clauses[:having] = condition
35
+ end
36
+
37
+ def order(*expressions)
38
+ @clauses[:order] = expressions.join(', ')
39
+ end
40
+
41
+ def extract
42
+ @db.execute(sql) do |row|
43
+ row.reject! { |key, value| key.kind_of?(Integer) }
44
+ yield row
45
+ end
46
+ end
47
+
48
+ private
49
+ def sql
50
+ clauses = [
51
+ "SELECT #{@clauses[:select] || '*'}",
52
+ "FROM #{@clauses[:from] || @table}"
53
+ ]
54
+
55
+ (@clauses[:joins] || []).each do |join|
56
+ clauses << join
57
+ end
58
+
59
+ clauses << "WHERE #{@clauses[:where]}" if @clauses[:where]
60
+ clauses << "GROUP BY #{@clauses[:group]}" if @clauses[:group]
61
+ clauses << "HAVING #{@clauses[:having]}" if @clauses[:having]
62
+ clauses << "ORDER BY #{@clauses[:order]}" if @clauses[:order]
63
+
64
+ clauses.join(' ')
65
+ end
66
+ end
67
+ end
68
+ end
@@ -0,0 +1,54 @@
1
+ module Drudgery
2
+ class Job
3
+ def initialize(options={})
4
+ @extractor = options[:extractor]
5
+ @loader = options[:loader]
6
+ @transformer = options[:transformer] || Drudgery::Transformer.new
7
+
8
+ @batch_size, @records = 1000, []
9
+ end
10
+
11
+ def batch_size(size)
12
+ @batch_size = size
13
+ end
14
+
15
+ def extract(type, *args)
16
+ @extractor = Drudgery::Extractors.instantiate(type, *args)
17
+ end
18
+
19
+ def transform(&processor)
20
+ @transformer.register(processor)
21
+ end
22
+
23
+ def load(type, *args)
24
+ @loader = Drudgery::Loaders.instantiate(type, *args)
25
+ end
26
+
27
+ def perform
28
+ extract_records do |record|
29
+ @records << record
30
+
31
+ if @records.size == @batch_size
32
+ load_records
33
+ end
34
+ end
35
+
36
+ load_records
37
+ end
38
+
39
+ private
40
+ def extract_records
41
+ @extractor.extract do |data|
42
+ record = @transformer.transform(data)
43
+ next if record.nil?
44
+
45
+ yield record
46
+ end
47
+ end
48
+
49
+ def load_records
50
+ @loader.load(@records)
51
+ @records.clear
52
+ end
53
+ end
54
+ end
@@ -0,0 +1,16 @@
1
+ module Drudgery
2
+ module Loaders
3
+ class ActiveRecordImportLoader
4
+ def initialize(model)
5
+ @model = model
6
+ end
7
+
8
+ def load(records)
9
+ columns = records.first.keys
10
+ values = records.map { |record| columns.map { |column| record[column] } }
11
+
12
+ @model.import(columns, values, :validate => false)
13
+ end
14
+ end
15
+ end
16
+ end
@@ -0,0 +1,15 @@
1
+ module Drudgery
2
+ module Loaders
3
+ class ActiveRecordLoader
4
+ def initialize(model)
5
+ @model = model
6
+ end
7
+
8
+ def load(records)
9
+ records.each do |record|
10
+ @model.new(record).save(:validate => false)
11
+ end
12
+ end
13
+ end
14
+ end
15
+ end
@@ -0,0 +1,29 @@
1
+ require 'csv'
2
+
3
+ module Drudgery
4
+ module Loaders
5
+ class CSVLoader
6
+ def initialize(filepath, options={})
7
+ @filepath = filepath
8
+ @options = options
9
+
10
+ @write_headers = true
11
+ end
12
+
13
+ def load(records)
14
+ columns = records.first.keys.sort { |a,b| a.to_s <=> b.to_s }
15
+
16
+ CSV.open(@filepath, 'a', @options) do |csv|
17
+ if @write_headers
18
+ csv << columns
19
+ @write_headers = false
20
+ end
21
+
22
+ records.each do |record|
23
+ csv << columns.map { |column| record[column] }
24
+ end
25
+ end
26
+ end
27
+ end
28
+ end
29
+ end
@@ -0,0 +1,25 @@
1
+ module Drudgery
2
+ module Loaders
3
+ class SQLite3Loader
4
+ def initialize(db, table)
5
+ @db = db
6
+ @table = table
7
+ end
8
+
9
+ def load(records)
10
+ columns = records.first.keys
11
+
12
+ @db.transaction do |db|
13
+ records.each do |record|
14
+ db.execute(sql(columns), columns.map { |column| record[column] })
15
+ end
16
+ end
17
+ end
18
+
19
+ private
20
+ def sql(columns)
21
+ "INSERT INTO #{@table} (#{columns.map { |column| column }.join(', ')}) VALUES (#{columns.map { |column| '?' }.join(', ')})"
22
+ end
23
+ end
24
+ end
25
+ end
@@ -0,0 +1,17 @@
1
+ module Drudgery
2
+ class Manager
3
+ def initialize
4
+ @jobs = []
5
+ end
6
+
7
+ def prepare(job=Drudgery::Job.new)
8
+ yield job if block_given?
9
+
10
+ @jobs << job
11
+ end
12
+
13
+ def run
14
+ @jobs.each { |job| job.perform }
15
+ end
16
+ end
17
+ end
@@ -0,0 +1,30 @@
1
+ module Drudgery
2
+ class Transformer
3
+ def initialize
4
+ @processors = []
5
+ @cache = {}
6
+ end
7
+
8
+ def register(processor)
9
+ @processors << processor
10
+ end
11
+
12
+ def transform(data)
13
+ symbolize_keys!(data)
14
+
15
+ @processors.each do |processor|
16
+ data = processor.call(data, @cache)
17
+ break if data.nil?
18
+ end
19
+
20
+ data
21
+ end
22
+
23
+ private
24
+ def symbolize_keys!(data)
25
+ data.keys.each do |key|
26
+ data[(key.to_sym rescue key) || key] = data.delete(key)
27
+ end
28
+ end
29
+ end
30
+ end
@@ -0,0 +1,3 @@
1
+ module Drudgery
2
+ VERSION = '0.0.1'
3
+ end
data/lib/drudgery.rb ADDED
@@ -0,0 +1,45 @@
1
+ require 'drudgery/version'
2
+ require 'drudgery/manager'
3
+ require 'drudgery/job'
4
+ require 'drudgery/transformer'
5
+
6
+ require 'drudgery/extractors/active_record_extractor'
7
+ require 'drudgery/extractors/csv_extractor'
8
+ require 'drudgery/extractors/sqlite3_extractor'
9
+
10
+ require 'drudgery/loaders/active_record_import_loader'
11
+ require 'drudgery/loaders/active_record_loader'
12
+ require 'drudgery/loaders/csv_loader'
13
+ require 'drudgery/loaders/sqlite3_loader'
14
+
15
+ module Drudgery
16
+ module Extractors
17
+ def self.instantiate(type, *args)
18
+ case type
19
+ when :csv
20
+ extractor = Drudgery::Extractors::CSVExtractor
21
+ when :sqlite3
22
+ extractor = Drudgery::Extractors::SQLite3Extractor
23
+ else
24
+ extractor = Drudgery::Extractors.const_get("#{type.to_s.split('_').map(&:capitalize).join}Extractor")
25
+ end
26
+
27
+ extractor.new(*args)
28
+ end
29
+ end
30
+
31
+ module Loaders
32
+ def self.instantiate(type, *args)
33
+ case type
34
+ when :csv
35
+ loader = Drudgery::Loaders::CSVLoader
36
+ when :sqlite3
37
+ loader = Drudgery::Loaders::SQLite3Loader
38
+ else
39
+ loader = Drudgery::Loaders.const_get("#{type.to_s.split('_').map(&:capitalize).join}Loader")
40
+ end
41
+
42
+ loader.new(*args)
43
+ end
44
+ end
45
+ end