RubyGems - davidrichards-etl - Versions diffs - 0.0.4 - Mend

davidrichards-etl 0.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

data/README.rdoc +261 -0
data/VERSION.yml +4 -0
data/bin/etl +27 -0
data/lib/all.rb +4 -0
data/lib/etl/active_record_loader.rb +50 -0
data/lib/etl/bucket.rb +148 -0
data/lib/etl/csv_et.rb +64 -0
data/lib/etl/etl.rb +273 -0
data/lib/etl/time_bucket.rb +104 -0
data/lib/etl/xml_et.rb +6 -0
data/lib/etl.rb +36 -0
data/lib/helpers/array.rb +11 -0
data/lib/helpers/observation.rb +10 -0
data/lib/helpers/open_struct.rb +18 -0
data/lib/helpers/string.rb +6 -0
data/lib/helpers/symbol.rb +6 -0
data/spec/etl/bucket_spec.rb +112 -0
data/spec/etl/csv_et_spec.rb +43 -0
data/spec/etl/etl_spec.rb +237 -0
data/spec/etl/xml_et_spec.rb +50 -0
data/spec/etl_spec.rb +16 -0
data/spec/fixtures/test_file.csv +3 -0
data/spec/helpers/array_spec.rb +13 -0
data/spec/helpers/observation_spec.rb +22 -0
data/spec/helpers/open_struct_spec.rb +25 -0
data/spec/helpers/string_spec.rb +8 -0
data/spec/helpers/symbol_spec.rb +7 -0
data/spec/spec_helper.rb +15 -0
metadata +106 -0

data/README.rdoc ADDED Viewed

@@ -0,0 +1,261 @@
+== ETL
+Projects always need data.  The standard way to get data is in three stages:
+* Extract
+* Transform
+* Load
+So, for me, this is true for my analytical work (Tegu) and any web application (Rails).  The problems I am solving are:
+* Dealing with ETL while I have it wrong (I tend to get it wrong a few times before I get all the edge cases worked out)
+* Keeping the granularity of the data matched to the granularity of my models (Running it the right number of times, combining the data into a time series or a single model)
+* Logging my efforts
+* Storing some utilities and shortcuts to allow me to reuse older work
+The philosophy of this gem is to create a simple core utility and then offer other utilities that can be useful.  For example, you may want to just run:
+  require 'rubygems'
+  require 'etl'
+In this case, you could create a class like this:
+  class MyETL < ETL
+    def extract
+      # ...
+    end
+    def tranform
+      # ...
+    end
+    def load
+      # ...
+    end
+  end
+Whatever you want to do at every stage, you just do it.  Usually, you'll set @raw to whatever data has been extracted or transformed.  After the after_extract and after_transform callbacks are finished, the contents of @raw are set to @data.  So, you could do something like this:
+  def extract
+    @raw = 1
+  end
+  def transform
+    @raw = @data + 1
+  end
+  def load
+    raise StandardError, "Data wasn't transformed right" unless @data == 2
+  end
+Or, more interestingly:
+  class ValidatingETL < ETL
+    after_extract :validate_array_data
+    before_transform :extract_header
+    after_transform :validate_addresses
+    def extract
+      # ...
+    end
+    def tranform
+      # Something using @data and @header
+    end
+    def load
+      # ...
+    end
+    def validate_array_data
+      raise ExtractError, "Could not get an array of arrays" unless
+        @raw.is_a?(Array) and not @raw.empty? and @raw.first.is_a?(Array)
+    end
+    def extract_header
+      @header = @raw.shift
+    end
+    def validate_addresses
+      raise TransformError, "Did not get valid addresses" unless ...
+    end
+  end
+Notice how we have special errors, just so that it's easier to use the logs.  They are:
+* ExtractError
+* TransformError
+* LoadingError (because LoadError means something else)
+Also notice that I just raise errors instead of trying to do anything special when things don't match.  This is the whole purpose of ETL, to make it easy to re-run the script until it is right.
+=== When Something Goes Wrong
+If something goes wrong, you can address the issue and just restart it.
+  @etl = SomethingThatWillBreak.new
+  @etl.process(:with => :options)
+  # Errors are raised
+  # Fix what was wrong
+  @etl.process
+The original options are still stored in @etl, so you don't need to resend those.  If you send them back in, they will be ignored anyway.  If the problem was that you setup the options wrong, you can write something like
+  @etl.options[:with => :better_options]
+Because of the nature of the code, the stages that passed won't be re-run.  In fact, you can take a completed etl object and call it all day long, and it will never try to restart itself.  If you need it to restart, or you want to restart at an earlier stage, just do something like:
+  @etl.rewind_to :some_previous_state
+States to choose from are:
+* :before_extract
+* :extract
+* :after_extract
+* :before_transform
+* :transform
+* :after_transform
+* :before_load
+* :load
+* :after_load
+* :complete
+=== Logs
+The logs are pretty useful, I think.  I use the log pretty aggressively when I use these tools.  That's because most of the value of ETL well-done is the knowledge that it was actually done right.  You might find the logs where you expect them to be, here's how I infer where to stick the logs:
+* If you're in a Rails app, and you've got etl in RAILS_ROOT/vendor/gems, then it will log in RAILS_ROOT/logs
+* If you're in a Rails app, and you've got a class in RAILS_ROOT/lib that uses ETL, it will log in RAILS_ROOT/logs
+* If you're in any app, and there is a direct subdirectory, log, the log will be held there
+* If you're in any app, and you set ETL.logger_root = '/some/directory', then the logs will be in /some/directory for all ETL processes.
+* If you're in any app, and you set SomeClass.logger_root = '/some/directory', then there will be a file /some/directory/SomeClass.log that holds the logs for that class, but all other classes will follow the rules notes above.
+Basically, all I'm saying is that you should be able to get the logs to where you need them.  I don't make any effort in this code to consolidate logs.  There is so much that goes on with that, I'm going to let you set your own conventions.  If you really want to consolidate things, you might want to look at syslog-ng or similar open-source tools.
+There is also a log that goes to standard error for any error at WARN or above.  I use Log4r here, so that very rich logging environment can be used if you have more robust logging needs.
+I'd like to take some time and write some more interesting examples.  For example, I have been playing with downloading financial data for analysis, and that will be interesting.  There are a number of interesting data sources that I generally work with: the filesystem, iGTD, and BaseCamp data.  Maybe I can get those written up and into the examples folder.
+=== Utilities
+I'll probably keep extracting generalized utilities and put them into this gem.  For instance, with just a little creativity, I can mix open-uri with FasterCSV and get pretty decent access to remote CSV files:
+  require 'etl/csv_et'
+  class MyCSV < CSV::ET
+    def load
+      # Do something with the array of arrays in @data
+    end
+  end
+First, notice that I called the class CSV::ET, instead of CSV::ETL.  That's because I didn't implement any load function.  Also, notice that I required etl/csv_et explicitly.
+There are a number of utilities that I have yet exported out of other projects and brought into this gem.  If you have some that you think are generally useful, please send them along as a patch, a concept, an email, or a pull request.
+If you want to load all the utilities to play with them, you might want to use the etl command line utility:
+  davidrichards: etl
+  Loading ETL version: 0.0.2
+  >> class MyCSV < CSV::ET
+  >> def load
+  >> puts @data.inspect
+  >> end
+  >> end
+  => nil
+  >> mycsv = MyCSV.process(:source => '/tmp/test.csv')
+  [["this", " and", " that"], [1, 2, 3], [1, 3, 4], [1, 4, 5]]
+  => #<MyCSV:0x233915c @options={:source=>"/tmp/test.csv"}, @state=:complete, @raw=nil, @block=nil, @data=[["this", " and", " that"], [1, 2, 3], [1, 3, 4], [1, 4, 5]]>
+  >> mycsv.data
+  => [["this", " and", " that"], [1, 2, 3], [1, 3, 4], [1, 4, 5]]
+Finally, beware the XML stuff for the moment.  I don't think there's much there.  I'm finishing a SCORM ETL process tonight or in the morning, then I can more likely bring in something that's actually useful.
+==Buckets
+I have a Bucket class that is created to assist with managing the granularity of an ETL process.  That class is well documented in the specs.  I am working on a project that will use the Bucket utility and the TimeBucket to get a regular time series of data from many sources.  So, look to that for some of the more exciting changes to this gem.
+A basic example might be:
+  require 'etl/bucket'
+  b = Bucket.new(:this => 1, :that => 2)
+  b.white_list = [:this]
+  b.filtered_data
+  # => {:this => 1}
+  b.add :this => 2, :something => :ignored
+  b.raw_data
+  # => {:this => 2}
+  b.filtered_data
+  # => {:this => 2}
+  b.white_list = [:this, :that]
+  b.filtered_data
+  # => {:this => 2}
+  b.add(:that => 2)
+  b.filtered_data
+  # => {:this => 2, :that => 2}
+  b.dump
+  # => {:this => 2, :that => 2}
+  b.filtered_data
+  # => {}
+  b.add :this => 1
+  b.add :that => 2
+  b.to_a
+  # [1, 2]
+  b.to_open_struct
+  S = Struct.new(:this, :that)
+  b.to_struct(S)
+  class A
+    attr_reader :values
+    def initialize(*args)
+      @values = args
+    end
+  end
+  a = b.to_obj(A)
+  a.values
+  # => [1,2]
+  b.filter_block = lambda{|hash| :not_a_useful_block}
+  b.dump
+  # => :not_a_useful_block
+==Installation
+  sudo gem install davidrichards-etl
+=== Dependencies
+The core library formally requires
+* activesupport for callback support
+* ostruct for using OpenStruct to contain the data
+* log4r for logging
+* fileutils for file system management
+Optionally, if you have these installed, they will be made available:
+* tegu_gears, for composing ETL with other analytical work or distributing the ETL process
+* data_frame, for more-easily understanding your data as a named data grid
+* babel_icious, for munging XML and hashes with some very useful transformation tools
+The various ETL implementations will each have their own dependencies.  The Bucket utility class, for instance, requires that the facets gem is installed (sudo gem install facets).  I'll document those dependencies as I fill in that part of the gem.
+== ActiveWarehouse-ETL
+I really like some of what ActiveWarehouse-ETL does.  If your target is a data warehouse, don't even start with ETL, start with ActiveWarehouse and ActiveWarehouse-ETL.  There are a lot of tools you'd be re-creating with ETL that are available for free.  Some of the more general tools from ActiveWarehouse-ETL belong in this gem too, in our style of code.  In particular:
+* The XML and SAX support
+* A table decoder
+* String to date
+* Webserver logs
+* Date to string
+* Time spans
+== TODO
+There are a number of things I'd like to do:
+* Work out the TimeBucket implementation.
+* Get more ETL scripts gathered and well-documented.
+* Integrate the ideas mentioned above from ActiveWarehouse-ETL.
+* Work out a better online-processing ETL.  I.e., to work on streaming data.
+==COPYRIGHT
+Copyright (c) 2009 David Richards. See LICENSE for details.

data/VERSION.yml ADDED Viewed

@@ -0,0 +1,4 @@
+---
+:major: 0
+:minor: 0
+:patch: 4

data/bin/etl ADDED Viewed

@@ -0,0 +1,27 @@
+#!/usr/bin/env ruby -wKU
+require 'yaml'
+version_hash = YAML.load_file(File.join(File.dirname(__FILE__), %w(.. VERSION.yml)))
+version = [version_hash[:major].to_s, version_hash[:minor].to_s, version_hash[:patch].to_s].join(".")
+etl_file = File.join(File.dirname(__FILE__), %w(.. lib etl))
+all = File.join(File.dirname(__FILE__), %w(.. lib all))
+irb = RUBY_PLATFORM =~ /(:?mswin|mingw)/ ? 'irb.bat' : 'irb'
+require 'optparse'
+options = { :sandbox => false, :irb => irb, :without_stored_procedures => false }
+OptionParser.new do |opt|
+  opt.banner = "Usage: console [environment] [options]"
+  opt.on("--irb=[#{irb}]", 'Invoke a different irb.') { |v| options[:irb] = v }
+  opt.parse!(ARGV)
+end
+libs =  " -r irb/completion -r #{etl_file} -r #{all}"
+puts "Loading ETL version: #{version}"
+if options[:sandbox]
+  puts "I'll have to think about how the whole sandbox concept should work for the etl"
+end
+exec "#{options[:irb]} #{libs} --simple-prompt"

data/lib/all.rb ADDED Viewed

@@ -0,0 +1,4 @@
+Dir.glob("#{File.dirname(__FILE__)}/etl/*.rb").each do |file|
+  next if /etl.rb/ === file
+  require file
+end

data/lib/etl/active_record_loader.rb ADDED Viewed

@@ -0,0 +1,50 @@
+# # This is a base class that uses ETL and Zach Dennis' excellent ar-
+# # extensions gem.  To get the gem, just:
+# #
+# # sudo gem install ar-extensions
+# #
+# # See:
+# # http://www.igvita.com/2007/07/11/efficient-updates-data-import-in-rails/
+# # http://agilewebdevelopment.com/plugins/ar_extensions
+# # http://www.continuousthinking.com/
+#
+# # To use this, 1) setup an extract to find the data, and 2) a transform to
+# # create an array of arrays, with the first array as the header.  The
+# # header and data should only contain values in the table to be imported.
+# # The data_frame gem (sudo gem install davidrichards-data_frame)
+# # may make the transform a LOT easier to do if there is a lot of column
+# # munging to do.  Chris Wycoff's babel_icious gem will go a long way in
+# # the transform if you have XML data you are working with
+# # (sudo gem install cwycoff-babel_icious).
+#
+# class ActiveRecordLoader < ETL
+#
+#   after_transform :ensure_array_of_arrays
+#   before_load :ensure_class_defined
+#   before_load :assert_header
+#
+#   protected
+#
+#     def ensure_array_of_arrays
+#       # Not 100% whether I process raw_data before or after this method.  I think before.
+#       data = @raw || @data
+#       raise ArgumentError,
+#         "Expecting transformed data to be an array of arays" unless
+#         data.is_a?(Array) and data.first.is_a?(Array) and data.last.is_a?(Array)
+#     end
+#
+#     def assert_header
+#       @header ||= @data.shift
+#       @header.symbolize_values!
+#     end
+#
+#     def ensure_class_defined
+#       raise ArgumentError,
+#         "Must provide a class to import to.  Try #{self.to_a}.instance.options[:class] = ModelClassName" unless
+#           self.options[:class]
+#     end
+#
+#     def load
+#       options[:class].import(@header, @data)
+#     end
+# end

data/lib/etl/bucket.rb ADDED Viewed

@@ -0,0 +1,148 @@
+require 'facets/dictionary'
+# Sometimes I have data coming from several sources.  I want to combine
+# the sources and release a consolidated record.  This is meant to work
+# like that.  For a weird example:
+# >> my_hash = {:surprise => 'me'}
+# => {:surprise=>"me"}
+# >> b = Bucket.new(my_hash) {|h| h.inject({}) {|hsh, e| hsh[e.first] = e.last % 3; hsh}}
+# => #<Bucket:0x232d230 @raw_data={:surprise=>"me"}, @filter_block=#<Proc:0x0232d26c@(irb):2>>
+# >> b.add :this => 1
+# => {:surprise=>"me", :this=>1}
+# >> b.add OpenStruct.new(:this => 6)
+# => {:surprise=>"me", :this=>6}
+# >> b.raw_data
+# => {:surprise=>"me", :this=>6}
+# >> b.filtered_data
+# => {:surprise=>"me", :this=>0}
+# >> b.dump
+# => {:surprise=>"me", :this=>0}
+# >> b.raw_data
+# => {}
+# A more practical use that I have for this is with screen scraping,
+# when I'm getting the source of some concept, I may ask the same site
+# for information at different times, or ask complimentary sites for
+# overlaying data.  A much more practical use of this is with the
+# TimeBucket.  That is a bucket that creates a time series from
+# observations that may be on very different time schedules.
+class Bucket
+  # The block used to filter the bucket.  Useful for converting the data
+  # to a different data type.
+  # Examples:
+  # Return a hash
+  # b.filter_block = lambda{|o| o.table}
+  # Return an array
+  # b.filter_block = lambda{|o| o.table.values}
+  attr_accessor :filter_block
+  # The data in the bucket, as an OpenStruct
+  attr_reader :raw_data
+  def initialize(obj=nil, &block)
+    @filter_block = block
+    reset_bucket
+    assert_object(obj) if obj
+  end
+  def add(obj)
+    assert_object(obj)
+  end
+  def dump
+    data = self.raw_data
+    reset_bucket
+    filter(data)
+  end
+  def filtered_data
+    filter(self.raw_data)
+  end
+  # Uses the facets/dictionary to deliver an ordered hash, in the order of
+  # the white list.
+  def ordered_data
+    return self.raw_data unless self.white_list
+    dictionary = Dictionary.new
+    self.white_list.each do |k|
+      dictionary[k] = self.raw_data[k]
+    end
+    dictionary
+  end
+  def to_a
+    self.ordered_data.values
+  end
+  alias :to_array :to_a
+  alias :to_hash :raw_data
+  # Initializes a class with the values of the raw data.  Good for structs
+  # and struct-like classes.
+  def to_obj(klass, use_hash=false)
+    use_hash ? klass.new(self.raw_data) : klass.new(*self.raw_data.values)
+  end
+  alias :to_struct :to_obj
+  def to_open_struct
+    OpenStruct.new(self.raw_data)
+  end
+  # Reveals the white list.  If this is set, it is an array, and it not
+  # only filters the data in the bucket, but also orders it.
+  attr_reader :white_list
+  alias :labels :white_list
+  # Sets the white list, if it's an array.  Filters the raw data, in case
+  # there are illegal values in there.
+  def white_list=(value)
+    raise ArgumentError, "Must provide and array" unless value.is_a?(Array)
+    @white_list = value
+    @raw_data = filter_input(self.raw_data)
+  end
+  alias :labels= :white_list=
+  protected
+    # Filters the input through the filter_block.  Use filtered_data if you
+    # just want the data in the bucket.
+    def filter(data)
+      self.filter_block ? self.filter_block.call(data) : data
+    end
+    def reset_bucket
+      @raw_data = Hash.new
+    end
+    def assert_object(obj)
+      case obj
+      when Hash
+        obj = filter_input(obj)
+        self.raw_data.merge!(obj)
+      when OpenStruct
+        obj = filter_input(obj.table)
+        self.raw_data.merge!(obj)
+      when Struct
+        obj.each_pair do |k, v|
+          if self.white_list
+            self.raw_data[k] = v if self.white_list.include?(v)
+          else
+            self.raw_data[k] = v
+          end
+        end
+      else
+        raise ArgumentError, "Don't know how to use this data"
+      end
+    end
+    def filter_input(hash)
+      if self.white_list
+        hash = hash.inject({}) do |h, e|
+          h[e.first] = e.last if self.white_list.include?(e.first)
+          h
+        end
+      end
+      hash
+    end
+end