davidrichards-etl 0.0.4

Sign up to get free protection for your applications and to get access to all the features.
data/README.rdoc ADDED
@@ -0,0 +1,261 @@
1
+ == ETL
2
+
3
+ Projects always need data. The standard way to get data is in three stages:
4
+
5
+ * Extract
6
+ * Transform
7
+ * Load
8
+
9
+ So, for me, this is true for my analytical work (Tegu) and any web application (Rails). The problems I am solving are:
10
+
11
+ * Dealing with ETL while I have it wrong (I tend to get it wrong a few times before I get all the edge cases worked out)
12
+ * Keeping the granularity of the data matched to the granularity of my models (Running it the right number of times, combining the data into a time series or a single model)
13
+ * Logging my efforts
14
+ * Storing some utilities and shortcuts to allow me to reuse older work
15
+
16
+ The philosophy of this gem is to create a simple core utility and then offer other utilities that can be useful. For example, you may want to just run:
17
+
18
+ require 'rubygems'
19
+ require 'etl'
20
+
21
+ In this case, you could create a class like this:
22
+
23
+ class MyETL < ETL
24
+ def extract
25
+ # ...
26
+ end
27
+
28
+ def tranform
29
+ # ...
30
+ end
31
+
32
+ def load
33
+ # ...
34
+ end
35
+ end
36
+
37
+ Whatever you want to do at every stage, you just do it. Usually, you'll set @raw to whatever data has been extracted or transformed. After the after_extract and after_transform callbacks are finished, the contents of @raw are set to @data. So, you could do something like this:
38
+
39
+ def extract
40
+ @raw = 1
41
+ end
42
+
43
+ def transform
44
+ @raw = @data + 1
45
+ end
46
+
47
+ def load
48
+ raise StandardError, "Data wasn't transformed right" unless @data == 2
49
+ end
50
+
51
+ Or, more interestingly:
52
+
53
+ class ValidatingETL < ETL
54
+ after_extract :validate_array_data
55
+ before_transform :extract_header
56
+ after_transform :validate_addresses
57
+
58
+ def extract
59
+ # ...
60
+ end
61
+
62
+ def tranform
63
+ # Something using @data and @header
64
+ end
65
+
66
+ def load
67
+ # ...
68
+ end
69
+
70
+ def validate_array_data
71
+ raise ExtractError, "Could not get an array of arrays" unless
72
+ @raw.is_a?(Array) and not @raw.empty? and @raw.first.is_a?(Array)
73
+ end
74
+
75
+ def extract_header
76
+ @header = @raw.shift
77
+ end
78
+
79
+ def validate_addresses
80
+ raise TransformError, "Did not get valid addresses" unless ...
81
+ end
82
+ end
83
+
84
+ Notice how we have special errors, just so that it's easier to use the logs. They are:
85
+
86
+ * ExtractError
87
+ * TransformError
88
+ * LoadingError (because LoadError means something else)
89
+
90
+ Also notice that I just raise errors instead of trying to do anything special when things don't match. This is the whole purpose of ETL, to make it easy to re-run the script until it is right.
91
+
92
+ === When Something Goes Wrong
93
+
94
+ If something goes wrong, you can address the issue and just restart it.
95
+
96
+ @etl = SomethingThatWillBreak.new
97
+ @etl.process(:with => :options)
98
+ # Errors are raised
99
+ # Fix what was wrong
100
+ @etl.process
101
+
102
+ The original options are still stored in @etl, so you don't need to resend those. If you send them back in, they will be ignored anyway. If the problem was that you setup the options wrong, you can write something like
103
+
104
+ @etl.options[:with => :better_options]
105
+
106
+ Because of the nature of the code, the stages that passed won't be re-run. In fact, you can take a completed etl object and call it all day long, and it will never try to restart itself. If you need it to restart, or you want to restart at an earlier stage, just do something like:
107
+
108
+ @etl.rewind_to :some_previous_state
109
+
110
+ States to choose from are:
111
+
112
+ * :before_extract
113
+ * :extract
114
+ * :after_extract
115
+ * :before_transform
116
+ * :transform
117
+ * :after_transform
118
+ * :before_load
119
+ * :load
120
+ * :after_load
121
+ * :complete
122
+
123
+ === Logs
124
+
125
+ The logs are pretty useful, I think. I use the log pretty aggressively when I use these tools. That's because most of the value of ETL well-done is the knowledge that it was actually done right. You might find the logs where you expect them to be, here's how I infer where to stick the logs:
126
+
127
+ * If you're in a Rails app, and you've got etl in RAILS_ROOT/vendor/gems, then it will log in RAILS_ROOT/logs
128
+ * If you're in a Rails app, and you've got a class in RAILS_ROOT/lib that uses ETL, it will log in RAILS_ROOT/logs
129
+ * If you're in any app, and there is a direct subdirectory, log, the log will be held there
130
+ * If you're in any app, and you set ETL.logger_root = '/some/directory', then the logs will be in /some/directory for all ETL processes.
131
+ * If you're in any app, and you set SomeClass.logger_root = '/some/directory', then there will be a file /some/directory/SomeClass.log that holds the logs for that class, but all other classes will follow the rules notes above.
132
+
133
+ Basically, all I'm saying is that you should be able to get the logs to where you need them. I don't make any effort in this code to consolidate logs. There is so much that goes on with that, I'm going to let you set your own conventions. If you really want to consolidate things, you might want to look at syslog-ng or similar open-source tools.
134
+
135
+ There is also a log that goes to standard error for any error at WARN or above. I use Log4r here, so that very rich logging environment can be used if you have more robust logging needs.
136
+
137
+ I'd like to take some time and write some more interesting examples. For example, I have been playing with downloading financial data for analysis, and that will be interesting. There are a number of interesting data sources that I generally work with: the filesystem, iGTD, and BaseCamp data. Maybe I can get those written up and into the examples folder.
138
+
139
+ === Utilities
140
+
141
+ I'll probably keep extracting generalized utilities and put them into this gem. For instance, with just a little creativity, I can mix open-uri with FasterCSV and get pretty decent access to remote CSV files:
142
+
143
+ require 'etl/csv_et'
144
+ class MyCSV < CSV::ET
145
+ def load
146
+ # Do something with the array of arrays in @data
147
+ end
148
+ end
149
+
150
+ First, notice that I called the class CSV::ET, instead of CSV::ETL. That's because I didn't implement any load function. Also, notice that I required etl/csv_et explicitly.
151
+
152
+ There are a number of utilities that I have yet exported out of other projects and brought into this gem. If you have some that you think are generally useful, please send them along as a patch, a concept, an email, or a pull request.
153
+
154
+ If you want to load all the utilities to play with them, you might want to use the etl command line utility:
155
+
156
+ davidrichards: etl
157
+ Loading ETL version: 0.0.2
158
+ >> class MyCSV < CSV::ET
159
+ >> def load
160
+ >> puts @data.inspect
161
+ >> end
162
+ >> end
163
+ => nil
164
+ >> mycsv = MyCSV.process(:source => '/tmp/test.csv')
165
+ [["this", " and", " that"], [1, 2, 3], [1, 3, 4], [1, 4, 5]]
166
+ => #<MyCSV:0x233915c @options={:source=>"/tmp/test.csv"}, @state=:complete, @raw=nil, @block=nil, @data=[["this", " and", " that"], [1, 2, 3], [1, 3, 4], [1, 4, 5]]>
167
+ >> mycsv.data
168
+ => [["this", " and", " that"], [1, 2, 3], [1, 3, 4], [1, 4, 5]]
169
+
170
+ Finally, beware the XML stuff for the moment. I don't think there's much there. I'm finishing a SCORM ETL process tonight or in the morning, then I can more likely bring in something that's actually useful.
171
+
172
+ ==Buckets
173
+
174
+ I have a Bucket class that is created to assist with managing the granularity of an ETL process. That class is well documented in the specs. I am working on a project that will use the Bucket utility and the TimeBucket to get a regular time series of data from many sources. So, look to that for some of the more exciting changes to this gem.
175
+
176
+ A basic example might be:
177
+
178
+ require 'etl/bucket'
179
+ b = Bucket.new(:this => 1, :that => 2)
180
+ b.white_list = [:this]
181
+ b.filtered_data
182
+ # => {:this => 1}
183
+ b.add :this => 2, :something => :ignored
184
+ b.raw_data
185
+ # => {:this => 2}
186
+ b.filtered_data
187
+ # => {:this => 2}
188
+ b.white_list = [:this, :that]
189
+ b.filtered_data
190
+ # => {:this => 2}
191
+ b.add(:that => 2)
192
+ b.filtered_data
193
+ # => {:this => 2, :that => 2}
194
+ b.dump
195
+ # => {:this => 2, :that => 2}
196
+ b.filtered_data
197
+ # => {}
198
+ b.add :this => 1
199
+ b.add :that => 2
200
+ b.to_a
201
+ # [1, 2]
202
+ b.to_open_struct
203
+ S = Struct.new(:this, :that)
204
+ b.to_struct(S)
205
+ class A
206
+ attr_reader :values
207
+ def initialize(*args)
208
+ @values = args
209
+ end
210
+ end
211
+ a = b.to_obj(A)
212
+ a.values
213
+ # => [1,2]
214
+ b.filter_block = lambda{|hash| :not_a_useful_block}
215
+ b.dump
216
+ # => :not_a_useful_block
217
+
218
+ ==Installation
219
+
220
+ sudo gem install davidrichards-etl
221
+
222
+ === Dependencies
223
+
224
+ The core library formally requires
225
+
226
+ * activesupport for callback support
227
+ * ostruct for using OpenStruct to contain the data
228
+ * log4r for logging
229
+ * fileutils for file system management
230
+
231
+ Optionally, if you have these installed, they will be made available:
232
+
233
+ * tegu_gears, for composing ETL with other analytical work or distributing the ETL process
234
+ * data_frame, for more-easily understanding your data as a named data grid
235
+ * babel_icious, for munging XML and hashes with some very useful transformation tools
236
+
237
+ The various ETL implementations will each have their own dependencies. The Bucket utility class, for instance, requires that the facets gem is installed (sudo gem install facets). I'll document those dependencies as I fill in that part of the gem.
238
+
239
+ == ActiveWarehouse-ETL
240
+
241
+ I really like some of what ActiveWarehouse-ETL does. If your target is a data warehouse, don't even start with ETL, start with ActiveWarehouse and ActiveWarehouse-ETL. There are a lot of tools you'd be re-creating with ETL that are available for free. Some of the more general tools from ActiveWarehouse-ETL belong in this gem too, in our style of code. In particular:
242
+
243
+ * The XML and SAX support
244
+ * A table decoder
245
+ * String to date
246
+ * Webserver logs
247
+ * Date to string
248
+ * Time spans
249
+
250
+ == TODO
251
+
252
+ There are a number of things I'd like to do:
253
+
254
+ * Work out the TimeBucket implementation.
255
+ * Get more ETL scripts gathered and well-documented.
256
+ * Integrate the ideas mentioned above from ActiveWarehouse-ETL.
257
+ * Work out a better online-processing ETL. I.e., to work on streaming data.
258
+
259
+ ==COPYRIGHT
260
+
261
+ Copyright (c) 2009 David Richards. See LICENSE for details.
data/VERSION.yml ADDED
@@ -0,0 +1,4 @@
1
+ ---
2
+ :major: 0
3
+ :minor: 0
4
+ :patch: 4
data/bin/etl ADDED
@@ -0,0 +1,27 @@
1
+ #!/usr/bin/env ruby -wKU
2
+ require 'yaml'
3
+
4
+ version_hash = YAML.load_file(File.join(File.dirname(__FILE__), %w(.. VERSION.yml)))
5
+ version = [version_hash[:major].to_s, version_hash[:minor].to_s, version_hash[:patch].to_s].join(".")
6
+ etl_file = File.join(File.dirname(__FILE__), %w(.. lib etl))
7
+ all = File.join(File.dirname(__FILE__), %w(.. lib all))
8
+
9
+ irb = RUBY_PLATFORM =~ /(:?mswin|mingw)/ ? 'irb.bat' : 'irb'
10
+
11
+ require 'optparse'
12
+ options = { :sandbox => false, :irb => irb, :without_stored_procedures => false }
13
+ OptionParser.new do |opt|
14
+ opt.banner = "Usage: console [environment] [options]"
15
+ opt.on("--irb=[#{irb}]", 'Invoke a different irb.') { |v| options[:irb] = v }
16
+ opt.parse!(ARGV)
17
+ end
18
+
19
+ libs = " -r irb/completion -r #{etl_file} -r #{all}"
20
+
21
+ puts "Loading ETL version: #{version}"
22
+
23
+ if options[:sandbox]
24
+ puts "I'll have to think about how the whole sandbox concept should work for the etl"
25
+ end
26
+
27
+ exec "#{options[:irb]} #{libs} --simple-prompt"
data/lib/all.rb ADDED
@@ -0,0 +1,4 @@
1
+ Dir.glob("#{File.dirname(__FILE__)}/etl/*.rb").each do |file|
2
+ next if /etl.rb/ === file
3
+ require file
4
+ end
@@ -0,0 +1,50 @@
1
+ # # This is a base class that uses ETL and Zach Dennis' excellent ar-
2
+ # # extensions gem. To get the gem, just:
3
+ # #
4
+ # # sudo gem install ar-extensions
5
+ # #
6
+ # # See:
7
+ # # http://www.igvita.com/2007/07/11/efficient-updates-data-import-in-rails/
8
+ # # http://agilewebdevelopment.com/plugins/ar_extensions
9
+ # # http://www.continuousthinking.com/
10
+ #
11
+ # # To use this, 1) setup an extract to find the data, and 2) a transform to
12
+ # # create an array of arrays, with the first array as the header. The
13
+ # # header and data should only contain values in the table to be imported.
14
+ # # The data_frame gem (sudo gem install davidrichards-data_frame)
15
+ # # may make the transform a LOT easier to do if there is a lot of column
16
+ # # munging to do. Chris Wycoff's babel_icious gem will go a long way in
17
+ # # the transform if you have XML data you are working with
18
+ # # (sudo gem install cwycoff-babel_icious).
19
+ #
20
+ # class ActiveRecordLoader < ETL
21
+ #
22
+ # after_transform :ensure_array_of_arrays
23
+ # before_load :ensure_class_defined
24
+ # before_load :assert_header
25
+ #
26
+ # protected
27
+ #
28
+ # def ensure_array_of_arrays
29
+ # # Not 100% whether I process raw_data before or after this method. I think before.
30
+ # data = @raw || @data
31
+ # raise ArgumentError,
32
+ # "Expecting transformed data to be an array of arays" unless
33
+ # data.is_a?(Array) and data.first.is_a?(Array) and data.last.is_a?(Array)
34
+ # end
35
+ #
36
+ # def assert_header
37
+ # @header ||= @data.shift
38
+ # @header.symbolize_values!
39
+ # end
40
+ #
41
+ # def ensure_class_defined
42
+ # raise ArgumentError,
43
+ # "Must provide a class to import to. Try #{self.to_a}.instance.options[:class] = ModelClassName" unless
44
+ # self.options[:class]
45
+ # end
46
+ #
47
+ # def load
48
+ # options[:class].import(@header, @data)
49
+ # end
50
+ # end
data/lib/etl/bucket.rb ADDED
@@ -0,0 +1,148 @@
1
+ require 'facets/dictionary'
2
+
3
+ # Sometimes I have data coming from several sources. I want to combine
4
+ # the sources and release a consolidated record. This is meant to work
5
+ # like that. For a weird example:
6
+ # >> my_hash = {:surprise => 'me'}
7
+ # => {:surprise=>"me"}
8
+ # >> b = Bucket.new(my_hash) {|h| h.inject({}) {|hsh, e| hsh[e.first] = e.last % 3; hsh}}
9
+ # => #<Bucket:0x232d230 @raw_data={:surprise=>"me"}, @filter_block=#<Proc:0x0232d26c@(irb):2>>
10
+ # >> b.add :this => 1
11
+ # => {:surprise=>"me", :this=>1}
12
+ # >> b.add OpenStruct.new(:this => 6)
13
+ # => {:surprise=>"me", :this=>6}
14
+ # >> b.raw_data
15
+ # => {:surprise=>"me", :this=>6}
16
+ # >> b.filtered_data
17
+ # => {:surprise=>"me", :this=>0}
18
+ # >> b.dump
19
+ # => {:surprise=>"me", :this=>0}
20
+ # >> b.raw_data
21
+ # => {}
22
+ # A more practical use that I have for this is with screen scraping,
23
+ # when I'm getting the source of some concept, I may ask the same site
24
+ # for information at different times, or ask complimentary sites for
25
+ # overlaying data. A much more practical use of this is with the
26
+ # TimeBucket. That is a bucket that creates a time series from
27
+ # observations that may be on very different time schedules.
28
+ class Bucket
29
+
30
+ # The block used to filter the bucket. Useful for converting the data
31
+ # to a different data type.
32
+ # Examples:
33
+ # Return a hash
34
+ # b.filter_block = lambda{|o| o.table}
35
+ # Return an array
36
+ # b.filter_block = lambda{|o| o.table.values}
37
+ attr_accessor :filter_block
38
+
39
+ # The data in the bucket, as an OpenStruct
40
+ attr_reader :raw_data
41
+
42
+ def initialize(obj=nil, &block)
43
+ @filter_block = block
44
+ reset_bucket
45
+ assert_object(obj) if obj
46
+ end
47
+
48
+ def add(obj)
49
+ assert_object(obj)
50
+ end
51
+
52
+ def dump
53
+ data = self.raw_data
54
+ reset_bucket
55
+ filter(data)
56
+ end
57
+
58
+ def filtered_data
59
+ filter(self.raw_data)
60
+ end
61
+
62
+ # Uses the facets/dictionary to deliver an ordered hash, in the order of
63
+ # the white list.
64
+ def ordered_data
65
+ return self.raw_data unless self.white_list
66
+ dictionary = Dictionary.new
67
+ self.white_list.each do |k|
68
+ dictionary[k] = self.raw_data[k]
69
+ end
70
+ dictionary
71
+ end
72
+
73
+ def to_a
74
+ self.ordered_data.values
75
+ end
76
+ alias :to_array :to_a
77
+
78
+ alias :to_hash :raw_data
79
+
80
+ # Initializes a class with the values of the raw data. Good for structs
81
+ # and struct-like classes.
82
+ def to_obj(klass, use_hash=false)
83
+ use_hash ? klass.new(self.raw_data) : klass.new(*self.raw_data.values)
84
+ end
85
+ alias :to_struct :to_obj
86
+
87
+ def to_open_struct
88
+ OpenStruct.new(self.raw_data)
89
+ end
90
+
91
+ # Reveals the white list. If this is set, it is an array, and it not
92
+ # only filters the data in the bucket, but also orders it.
93
+ attr_reader :white_list
94
+ alias :labels :white_list
95
+
96
+ # Sets the white list, if it's an array. Filters the raw data, in case
97
+ # there are illegal values in there.
98
+ def white_list=(value)
99
+ raise ArgumentError, "Must provide and array" unless value.is_a?(Array)
100
+ @white_list = value
101
+ @raw_data = filter_input(self.raw_data)
102
+ end
103
+ alias :labels= :white_list=
104
+
105
+
106
+ protected
107
+ # Filters the input through the filter_block. Use filtered_data if you
108
+ # just want the data in the bucket.
109
+ def filter(data)
110
+ self.filter_block ? self.filter_block.call(data) : data
111
+ end
112
+
113
+ def reset_bucket
114
+ @raw_data = Hash.new
115
+ end
116
+
117
+ def assert_object(obj)
118
+ case obj
119
+ when Hash
120
+ obj = filter_input(obj)
121
+ self.raw_data.merge!(obj)
122
+ when OpenStruct
123
+ obj = filter_input(obj.table)
124
+ self.raw_data.merge!(obj)
125
+ when Struct
126
+ obj.each_pair do |k, v|
127
+ if self.white_list
128
+ self.raw_data[k] = v if self.white_list.include?(v)
129
+ else
130
+ self.raw_data[k] = v
131
+ end
132
+ end
133
+ else
134
+ raise ArgumentError, "Don't know how to use this data"
135
+ end
136
+ end
137
+
138
+ def filter_input(hash)
139
+ if self.white_list
140
+ hash = hash.inject({}) do |h, e|
141
+ h[e.first] = e.last if self.white_list.include?(e.first)
142
+ h
143
+ end
144
+ end
145
+ hash
146
+ end
147
+
148
+ end