RubyGems - dataduck - Versions diffs - 0.4.0 → 0.5.0 - Mend

dataduck 0.4.0 → 0.5.0

Files changed (26) hide show

checksums.yaml +4 -4
data/docs/README.md +1 -1
data/docs/commands/README.md +12 -0
data/docs/commands/console.md +5 -0
data/docs/commands/dbconsole.md +16 -0
data/docs/commands/etl.md +11 -0
data/docs/commands/quickstart.md +7 -0
data/docs/commands/show.md +27 -0
data/docs/contents.yml +7 -0
data/docs/overview/getting_started.md +1 -1
data/docs/tables/README.md +43 -0
data/lib/dataduck/commands.rb +78 -2
data/lib/dataduck/database.rb +81 -0
data/lib/dataduck/destination.rb +32 -12
data/lib/dataduck/etl.rb +23 -6
data/lib/dataduck/logs.rb +34 -0
data/lib/dataduck/mysql_source.rb +11 -0
data/lib/dataduck/postgresql_source.rb +18 -0
data/lib/dataduck/redshift_destination.rb +67 -31
data/lib/dataduck/source.rb +28 -19
data/lib/dataduck/sql_db_source.rb +7 -1
data/lib/dataduck/table.rb +125 -13
data/lib/dataduck/util.rb +13 -3
data/lib/dataduck/version.rb +1 -1
data/lib/dataduck.rb +12 -3
metadata +10 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: a4dabe01cff2c6455751ab08c520d4bfaee62139
-  data.tar.gz: d20ef216bc631c445daad0767a51788b42b7f90f
+  metadata.gz: de529cfe949f8c1fb4a4cb36129188636ffbcb74
+  data.tar.gz: ebbcaa35d0babcbabdaef339f9ef72b061fc54d5
 SHA512:
-  metadata.gz: d2eacaf08c612c25ae8bf9b1b1d46d4a0312fe0024211d0a8306faa5a810b972a5c2aa8386c4b05b04a26d73093bcae5a89d72bcadef98f6ed7e062054d40410
-  data.tar.gz: 2c4c1aec2a0257ad3dcc4e9559436c39de0f747a6ec3fdb816afb7d096678d7c1f608269b6c8dc55d1f1aeac514153bdbec7dbb7d3c082699a89f28e16577b22
+  metadata.gz: 2958e2909631c314c7104fa340f0a587b47ab172417aa792f2b1a31377b1c455188acb0554c84ab656d057c4c060f9d7110bb6971fe45cedc8d4d3a117339d1e
+  data.tar.gz: 5e62d009d64ebe30b1ade7184c5e7f1041e8c58d3cbf5be39d3d1885e89a3126870197568694ead1673018985b079757ad00117d1aa5423d6413d027a292cd09

data/docs/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Documentation
-The documentation directory is viewable at (http://dataducketl.com/docs)[http://dataducketl.com/docs].
+The documentation directory is viewable at http://dataducketl.com/docs.
 # Autogenerated

data/docs/commands/README.md ADDED Viewed

@@ -0,0 +1,12 @@
+# Commands
+Commands can be run by running `dataduck commandname` from the project directory, assuming that you've already
+run `bundle install` to install the DataDuck gem.
+The list of commands is:
+- [console](/docs/commands/console)
+- [dbconsole](/docs/commands/dbconsole)
+- [etl](/docs/commands/etl)
+- [quickstart](/docs/commands/quickstart)
+- [show](/docs/commands/show)

data/docs/commands/console.md ADDED Viewed

@@ -0,0 +1,5 @@
+# The `console` command
+The `console` command will place you into a Ruby console with DataDuck loaded. This can be useful for debugging. Run it with:
+`$ dataduck console`

data/docs/commands/dbconsole.md ADDED Viewed

@@ -0,0 +1,16 @@
+# The `dbconsole` command
+The `dbconsole` command will place you into a database connection with one of your databases, by using the appropriate command
+on your system (e.g. `mysql` or `psql`).
+This will connect you with the destination (e.g. Redshift):
+`$ dataduck dbconsole`
+You can also use one of these:
+`$ dataduck dbconsole source`
+`$ dataduck dbconsole destination`
+`$ dataduck dbconsole [db_name]`

data/docs/commands/etl.md ADDED Viewed

@@ -0,0 +1,11 @@
+# The `etl` command
+The `etl` command is the main command for running an ETL process. You can use it to ETL all the tables, or just one table at a time.
+To ETL all tables, use:
+`$ dataduck etl all`
+To ETL just one table, use:
+`$ dataduck etl my_table_name`

data/docs/commands/quickstart.md ADDED Viewed

@@ -0,0 +1,7 @@
+# The `quickstart` command
+The `quickstart` command will give you a wizard for getting started with DataDuck. You should only use this with a brand new DataDuck project.
+It will ask you for the credentials to your database, and then create the basic setup for your project. After you are completely setup, your project's ETL can be run by running `dataduck etl`
+If you would like to run the ETL regularly, such as every night, it's recommended to use the [whenever](https://github.com/javan/whenever) gem to manage a cron job to regularly run the ETL.

data/docs/commands/show.md ADDED Viewed

@@ -0,0 +1,27 @@
+# The `show` command
+The `show` command shows you the database tables that DataDuck is planning to ETL.
+Usage to show all table names:
+`$ dataduck show`
+Usage to show info for just one table:
+```bash
+$ dataduck show users
+Table users
+Sources from users on my_database
+  created_at
+  updated_at
+  id
+  username
+Outputs
+  created_at  datetime
+  updated_at  datetime
+  id          integer
+  username    string
+```

data/docs/contents.yml CHANGED Viewed

@@ -2,5 +2,12 @@
   "Welcome": README
   "Getting Started": getting_started
+"Commands":
+  "console": console
+  "dbconsole": dbconsole
+  "etl": etl
+  "quickstart": quickstart
+  "show": show
 "Tables":
   "The Table Class": README

data/docs/overview/getting_started.md CHANGED Viewed

@@ -23,6 +23,6 @@ Finally, run the quickstart command:
     $ dataduck quickstart
-It will ask you for the credentials to your database, and then create the basic setup for your project. After the setup, your project's ETL can be run by running `ruby src/main.rb`
+It will ask you for the credentials to your database, and then create the basic setup for your project. After you are completely setup, your project's ETL can be run by running `dataduck etl`
 If you would like to run this regularly, such as every night, it's recommended to use the [whenever](https://github.com/javan/whenever) gem to manage a cron job to regularly run the ETL.

data/docs/tables/README.md CHANGED Viewed

@@ -5,6 +5,49 @@ Each of these table files inherits from `DataDuck::Table`, the base table class.
 You may also define transformations with the `transforms` method and validations with `validates` method.
+## Types of Loading Methods
+There are a few different methods to load your table. You can load the whole table fresh with each ETL, or you can load
+just the most recently changed rows (based off some column such as an updated_at column).
+Loading just those rows that have changed is best for most tables, since it significantly reduces the amount of data you
+transfer as well as the time your ETL process takes. Loading the whole table fresh each time is best if the table is
+small or rows may be deleted from the table by your main application. (In the case that rows are deleted, you need to reload
+the whole table each ETL, since the ETL process wouldn't otherwise know which rows no longer exist.)
+## The `should_fully_reload?` method
+If `should_fully_reload?` is true, the table will be fully reloaded each ETL. By default, this is false.
+## The `extract_by_column` and `batch_size` methods
+The alternative to fully reloading is to use an `extract_by_column`. By default, `extract_by_column` returns updated_at
+if your table has an updated_at column. This way, only the rows that have changed need be ETLed. This can give you
+significant performance improvements, which is why it is the default.
+If the `batch_size` method is set, the extract query will use a `LIMIT batch_size` clause. This is useful if your table
+is fairly big and you are running DataDuck on a small EC2 instance or other computer without a lot of memory.
+In order to use `batch_size`, you must also set the `extract_by_column`
+An example of where you might want to override the default `extract_by_column` is if you are tracking visitor events in
+a table, and the visitor events are never modified. In this case, you might not even have an `updated_at` column. Instead,
+you could use the `created_at` column or the `id` column (if ids are assumed to be generated always increasing).
+## The `etl!` method
+The `etl!` method is what gets called when you run the `dataduck etl` command. It first extracts the
+data from your source via the `extract!` method, transforms the data according to any transformations you've created in
+the `transform!` method, then loads the data into your destination with the `destination.load_table!` method.
+You may overwrite this if you have some custom ETL process, however, it may be better to overwrite the `extract!` method
+and leave the rest of the process (and the Redshift loading) up to DataDuck.
+## The `extract!` method
+The `extract!` method takes one argument: the destination. It then extracts the data from the source necessary to load
+data into the destination. If you are writing your own Table class with some custom third party API, you will probably
+want to overwrite this method.
 ## Example Table
 The following is an example table.

data/lib/dataduck/commands.rb CHANGED Viewed

@@ -32,7 +32,7 @@ module DataDuck
     end
     def self.acceptable_commands
-      ['console', 'quickstart']
+      ['console', 'dbconsole', 'etl', 'quickstart', 'show']
     end
     def self.route_command(args)
@@ -46,16 +46,92 @@ module DataDuck
         return DataDuck::Commands.help
       end
-      DataDuck::Commands.public_send(command)
+      DataDuck::Commands.public_send(command, *args[1..-1])
     end
     def self.console
       require "irb"
+      ARGV.clear
       IRB.start
     end
+    def self.dbconsole(where = "destination")
+      which_database = nil
+      if where == "destination"
+        which_database = DataDuck::Destination.only_destination
+      elsif where == "source"
+        which_database = DataDuck::Source.only_source
+      else
+        found_source = DataDuck::Source.source(where, true)
+        found_destination = DataDuck::Destination.destination(where, true)
+        if found_source && found_destination
+          raise ArgumentError.new("Ambiguous call to dbconsole for #{ where } since there is both a source and destination named #{ where }.")
+        end
+        which_database = found_source if found_source
+        which_database = found_destination if found_destination
+      end
+      if which_database.nil?
+        raise ArgumentError.new("Could not find database '#{ where }'")
+      end
+      puts "Connecting to #{ where }..."
+      which_database.dbconsole
+    end
+    def self.etl(what = nil)
+      if what.nil?
+        puts "You need to specify a table name or 'all'. Usage: dataduck etl all OR datduck etl my_table_name"
+        return
+      end
+      only_destination = DataDuck::Destination.only_destination
+      if what == "all"
+        etl = ETL.new(destinations: [only_destination], autoload_tables: true)
+        etl.process!
+      else
+        table_name_camelized = DataDuck::Util.underscore_to_camelcase(what)
+        require DataDuck.project_root + "/src/tables/#{ what }.rb"
+        table_class = Object.const_get(table_name_camelized)
+        if !(table_class <= DataDuck::Table)
+          raise Exception.new("Table class #{ table_name_camelized } must inherit from DataDuck::Table")
+        end
+        table = table_class.new
+        etl = ETL.new(destinations: [only_destination], autoload_tables: false, tables: [table])
+        etl.process_table!(table)
+      end
+    end
     def self.help
       puts "Usage: dataduck commandname"
+      puts "Commands: #{ acceptable_commands.sort.join(' ') }"
+    end
+    def self.show(table_name = nil)
+      if table_name.nil?
+        Dir[DataDuck.project_root + "/src/tables/*.rb"].each do |file|
+          table_name_underscores = file.split("/").last.gsub(".rb", "")
+          table_name_camelized = DataDuck::Util.underscore_to_camelcase(table_name_underscores)
+          require file
+          table = Object.const_get(table_name_camelized)
+          if table <= DataDuck::Table
+            puts table_name_underscores
+          end
+        end
+      else
+        table_name_camelized = DataDuck::Util.underscore_to_camelcase(table_name)
+        require DataDuck.project_root + "/src/tables/#{ table_name }.rb"
+        table_class = Object.const_get(table_name_camelized)
+        if !(table_class <= DataDuck::Table)
+          raise Exception.new("Table class #{ table_name_camelized } must inherit from DataDuck::Table")
+        end
+        table = table_class.new
+        table.show
+      end
     end
     def self.quickstart

data/lib/dataduck/database.rb ADDED Viewed

@@ -0,0 +1,81 @@
+module DataDuck
+  class Database
+    attr_accessor :name
+    def initialize(name, *args)
+      self.name = name
+    end
+    def connection
+      raise Exception.new("Must implement connection in subclass.")
+    end
+    def query
+      raise Exception.new("Must implement query in subclass.")
+    end
+    def table_names
+      raise Exception.new("Must implement query in subclass.")
+    end
+    protected
+      def find_command_and_execute(commands, *args)
+        # This function was originally sourced from Rails
+        # https://github.com/rails/rails
+        #
+        # Licensed under the MIT license
+        # http://opensource.org/licenses/MIT
+        #
+        # Permission is hereby granted, free of charge, to any person obtaining a copy
+        # of this software and associated documentation files (the "Software"), to deal
+        # in the Software without restriction, including without limitation the rights
+        # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+        # copies of the Software, and to permit persons to whom the Software is
+        # furnished to do so, subject to the following conditions:
+        #
+        # The above copyright notice and this permission notice shall be included in
+        # all copies or substantial portions of the Software.
+        #
+        # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+        # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+        # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+        # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+        # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+        # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+        # THE SOFTWARE.
+        commands = Array(commands)
+        dirs_on_path = ENV['PATH'].to_s.split(File::PATH_SEPARATOR)
+        full_path_command = nil
+        found = commands.detect do |cmd|
+          dirs_on_path.detect do |path|
+            full_path_command = File.join(path, cmd)
+            File.file?(full_path_command) && File.executable?(full_path_command)
+          end
+        end
+        if found
+          exec full_path_command, *args
+        else
+          abort("Couldn't find command: #{commands.join(', ')}. Check your $PATH and try again.")
+        end
+      end
+      def is_mutating_sql?(sql)
+        # This method is not all exhaustive, and is not meant to be necessarily relied on, but is a
+        # sanity check that can be used to ensure certain sql is not mutating.
+        return true if sql.downcase.start_with?("drop table")
+        return true if sql.downcase.start_with?("create table")
+        return true if sql.downcase.start_with?("delete from")
+        return true if sql.downcase.start_with?("insert into")
+        return true if sql.downcase.start_with?("alter table")
+        false
+      end
+  end
+end

data/lib/dataduck/destination.rb CHANGED Viewed

@@ -1,5 +1,21 @@
 module DataDuck
-  class Destination
+  class Destination < DataDuck::Database
+    def self.load_config!
+      all_config = DataDuck.config['destinations']
+      return if all_config.nil?
+      all_config.each_key do |destination_name|
+        configuration = all_config[destination_name]
+        destination_type = configuration['type']
+        if destination_type == "redshift"
+          DataDuck.destinations[destination_name] = DataDuck::RedshiftDestination.new(destination_name, configuration)
+        else
+          raise ArgumentError.new("Unknown type '#{ destination_type }' for destination #{ destination_name }.")
+        end
+      end
+    end
     def self.destination_config(name)
       if DataDuck.config['destinations'].nil? || DataDuck.config['destinations'][name.to_s].nil?
         raise Exception.new("Could not find destination #{ name } in destinations configs.")
@@ -12,21 +28,25 @@ module DataDuck
       raise Exception.new("Must implement load_table! in subclass")
     end
-    def self.destination(destination_name)
-      destination_name = destination_name.to_s
+    def self.destination(name, allow_nil = false)
+      name = name.to_s
-      if DataDuck.destinations[destination_name]
-        return DataDuck.destinations[destination_name]
+      if DataDuck.destinations[name]
+        return DataDuck.destinations[name]
+      elsif allow_nil
+        return nil
+      else
+        raise Exception.new("Could not find destination #{ name } in destination configs.")
       end
+    end
-      destination_configuration = DataDuck::Destination.destination_config(destination_name)
-      destination_type = destination_configuration['type']
-      if destination_type == "redshift"
-        DataDuck.destinations[destination_name] = DataDuck::RedshiftDestination.new(destination_configuration)
-        return DataDuck.destinations[destination_name]
-      else
-        raise ArgumentError.new("Unknown type '#{ destination_type }' for destination #{ destination_name }.")
+    def self.only_destination
+      if DataDuck.destinations.keys.length != 1
+        raise ArgumentError.new("Must be exactly 1 destination.")
       end
+      destination_name = DataDuck.destinations.keys[0]
+      return DataDuck::Destination.destination(destination_name)
     end
   end
 end

data/lib/dataduck/etl.rb CHANGED Viewed

@@ -11,8 +11,13 @@ module DataDuck
       self.destinations << DataDuck::Destination.destination(destination_name)
     end
+    attr_accessor :destinations
+    attr_accessor :tables
     def initialize(options = {})
+      self.class.destinations ||= []
       @tables = options[:tables] || []
+      @destinations = options[:destinations] || []
       @autoload_tables = options[:autoload_tables].nil? ? true : options[:autoload_tables]
       if @autoload_tables
@@ -29,16 +34,28 @@ module DataDuck
     end
     def process!
-      puts "Processing ETL..."
+      DataDuck::Logs.info "Processing ETL..."
+      destinations_to_use = []
+      destinations_to_use = destinations_to_use.concat(self.class.destinations)
+      destinations_to_use = destinations_to_use.concat(self.destinations)
+      destinations_to_use.uniq!
       @tables.each do |table_class|
         table_to_etl = table_class.new
-        table_to_etl.extract!
-        table_to_etl.transform!
-        self.class.destinations.each do |destination|
-          destination.load_table!(table_to_etl)
-        end
+        table_to_etl.etl!(destinations_to_use)
       end
     end
+    def process_table!(table)
+      DataDuck::Logs.info "Processing ETL for table #{ table.name }..."
+      destinations_to_use = []
+      destinations_to_use = destinations_to_use.concat(self.class.destinations)
+      destinations_to_use = destinations_to_use.concat(self.destinations)
+      destinations_to_use.uniq!
+      table.etl!(destinations_to_use)
+    end
   end
 end

data/lib/dataduck/logs.rb ADDED Viewed

@@ -0,0 +1,34 @@
+require 'logger'
+module DataDuck
+  module Logs
+    @@ONE_MB_IN_BYTES = 1048576
+    @@logger = nil
+    def Logs.ensure_logger_exists!
+      log_file_path = DataDuck.project_root + '/log/dataduck.log'
+      DataDuck::Util.ensure_path_exists!(log_file_path)
+      @@logger ||= Logger.new(log_file_path, shift_age = 100, shift_size = 100 * @@ONE_MB_IN_BYTES)
+    end
+    def Logs.info(message)
+      self.ensure_logger_exists!
+      puts "[INFO] #{ message }"
+      @@logger.info(message)
+    end
+    def Logs.warn(message)
+      self.ensure_logger_exists!
+      puts "[WARN] #{ message }"
+      @@logger.warn(message)
+    end
+    def Logs.error(err, message = nil)
+      self.ensure_logger_exists!
+      message = err.to_s unless message
+      puts "[ERROR] #{ message }"
+      @@logger.error(message)
+    end
+  end
+end

data/lib/dataduck/mysql_source.rb CHANGED Viewed

@@ -7,5 +7,16 @@ module DataDuck
     def db_type
       'mysql'
     end
+    def dbconsole(options = {})
+      args = []
+      args << "--host=#{ @host }"
+      args << "--user=#{ @username }"
+      args << "--database=#{ @database }"
+      args << "--port=#{ @port }"
+      args << "--password=#{ @password }"
+      self.find_command_and_execute("mysql", *args)
+    end
   end
 end

data/lib/dataduck/postgresql_source.rb CHANGED Viewed

@@ -7,5 +7,23 @@ module DataDuck
     def db_type
       'postgres'
     end
+    def dbconsole(options = {})
+      args = []
+      args << "--host=#{ @host }"
+      args << "--username=#{ @username }"
+      args << "--dbname=#{ @database }"
+      args << "--port=#{ @port }"
+      ENV['PGPASSWORD'] = @password
+      self.find_command_and_execute("psql", *args)
+    end
+    def data_size_for_table(table_name)
+      size_in_bytes = self.query("SELECT pg_total_relation_size('#{ table_name }')").first.to_i
+      size_in_gb = size_in_bytes / 1_000_000_000.0
+      size_in_gb
+    end
   end
 end

data/lib/dataduck/redshift_destination.rb CHANGED Viewed

@@ -2,7 +2,7 @@ require_relative 'destination.rb'
 module DataDuck
   class RedshiftDestination < DataDuck::Destination
-    def initialize(config)
+    def initialize(name, config)
       @aws_key = config['aws_key']
       @aws_secret = config['aws_secret']
       @s3_bucket = config['s3_bucket']
@@ -14,6 +14,8 @@ module DataDuck
       @username = config['username']
       @password = config['password']
       @redshift_connection = nil
+      super
     end
     def connection
@@ -27,7 +29,7 @@ module DataDuck
     def copy_query(table, s3_path)
       properties_joined_string = "\"#{ table.output_column_names.join('","') }\""
       query_fragments = []
-      query_fragments << "COPY #{ self.staging_table_name(table) } (#{ properties_joined_string })"
+      query_fragments << "COPY #{ table.staging_name } (#{ properties_joined_string })"
       query_fragments << "FROM '#{ s3_path }'"
       query_fragments << "CREDENTIALS 'aws_access_key_id=#{ @aws_key };aws_secret_access_key=#{ @aws_secret }'"
       query_fragments << "REGION '#{ @s3_region }'"
@@ -37,13 +39,13 @@ module DataDuck
     end
     def create_columns_on_data_warehouse!(table)
-      columns = get_columns_in_data_warehouse(table)
+      columns = get_columns_in_data_warehouse(table.building_name)
       column_names = columns.map { |col| col[:name].to_s }
       table.output_schema.map do |name, data_type|
         if !column_names.include?(name.to_s)
           redshift_data_type = data_type.to_s
           redshift_data_type = 'varchar(255)' if redshift_data_type == 'string'
-          self.run_query("ALTER TABLE #{ table.name } ADD #{ name } #{ redshift_data_type }")
+          self.query("ALTER TABLE #{ table.building_name } ADD #{ name } #{ redshift_data_type }")
         end
       end
     end
@@ -56,18 +58,21 @@ module DataDuck
         "\"#{ name }\" #{ redshift_data_type }"
       end
       props_string = props_array.join(', ')
-      "CREATE TABLE IF NOT EXISTS #{ table_name } (#{ props_string })"
+      distribution_clause = table.distribution_key ? "DISTKEY(#{ table.distribution_key })" : ""
+      index_clause = table.indexes.length > 0 ? "INTERLEAVED SORTKEY (#{ table.indexes.join(',') })" : ""
+      "CREATE TABLE IF NOT EXISTS #{ table_name } (#{ props_string }) #{ distribution_clause } #{ index_clause }"
     end
-    def create_output_table_on_data_warehouse!(table)
-      self.run_query(self.create_table_query(table))
+    def create_output_tables!(table)
+      self.query(self.create_table_query(table, table.building_name))
       self.create_columns_on_data_warehouse!(table)
-    end
-    def create_staging_table!(table)
-      table_name = self.staging_table_name(table)
-      self.drop_staging_table!(table)
-      self.run_query(self.create_table_query(table, table_name))
+      if table.building_name != table.staging_name
+        self.drop_staging_table!(table)
+        self.query(self.create_table_query(table, table.staging_name))
+      end
     end
     def data_as_csv_string(data, property_names)
@@ -94,13 +99,25 @@ module DataDuck
       return data_string_components.join
     end
+    def dbconsole(options = {})
+      args = []
+      args << "--host=#{ @host }"
+      args << "--username=#{ @username }"
+      args << "--dbname=#{ @database }"
+      args << "--port=#{ @port }"
+      ENV['PGPASSWORD'] = @password
+      self.find_command_and_execute("psql", *args)
+    end
     def drop_staging_table!(table)
-      self.run_query("DROP TABLE IF EXISTS #{ self.staging_table_name(table) }")
+      self.query("DROP TABLE IF EXISTS #{ table.staging_name }")
     end
-    def get_columns_in_data_warehouse(table)
-      query = "SELECT pg_table_def.column as name, type as data_type, distkey, sortkey FROM pg_table_def WHERE tablename='#{ table.name }'"
-      results = self.run_query(query)
+    def get_columns_in_data_warehouse(table_name)
+      cols_query = "SELECT pg_table_def.column AS name, type AS data_type, distkey, sortkey FROM pg_table_def WHERE tablename='#{ table_name }'"
+      results = self.query(cols_query)
       columns = []
       results.each do |result|
@@ -108,7 +125,7 @@ module DataDuck
             name: result[:name],
             data_type: result[:data_type],
             distkey: result[:distkey],
-            sortkey: result[:sortkey]
+            sortkey: result[:sortkey],
         }
       end
@@ -116,20 +133,25 @@ module DataDuck
     end
     def merge_from_staging!(table)
+      if table.staging_name == table.building_name
+        return
+      end
       # Following guidelines in http://docs.aws.amazon.com/redshift/latest/dg/merge-examples.html
-      staging_name = self.staging_table_name(table)
-      delete_query = "DELETE FROM #{ table.name } USING #{ staging_name } WHERE #{ table.name }.id = #{ staging_name }.id" # TODO allow custom or multiple keys
-      self.run_query(delete_query)
-      insert_query = "INSERT INTO #{ table.name } (\"#{ table.output_column_names.join('","') }\") SELECT \"#{ table.output_column_names.join('","') }\" FROM #{ staging_name }"
-      self.run_query(insert_query)
+      staging_name = table.staging_name
+      building_name = table.building_name
+      delete_query = "DELETE FROM #{ building_name } USING #{ staging_name } WHERE #{ building_name }.id = #{ staging_name }.id" # TODO allow custom or multiple keys
+      self.query(delete_query)
+      insert_query = "INSERT INTO #{ building_name } (\"#{ table.output_column_names.join('","') }\") SELECT \"#{ table.output_column_names.join('","') }\" FROM #{ staging_name }"
+      self.query(insert_query)
     end
-    def run_query(sql)
+    def query(sql)
       self.connection[sql].map { |elem| elem }
     end
-    def staging_table_name(table)
-      "zz_dataduck_#{ table.name }"
+    def table_names
+      self.query("SELECT DISTINCT(tablename) AS name FROM pg_table_def WHERE schemaname='public' ORDER BY name").map { |item| item[:name] }
     end
     def upload_table_to_s3!(table)
@@ -144,14 +166,28 @@ module DataDuck
       return s3_obj
     end
+    def finish_fully_reloading_table!(table)
+      self.query("DROP TABLE IF EXISTS dataduck_zz_old_#{ table.name }")
+      table_already_exists = self.table_names.include?(table.name)
+      if table_already_exists
+        self.query("ALTER TABLE #{ table.name } RENAME TO dataduck_zz_old_#{ table.name }")
+      end
+      self.query("ALTER TABLE #{ table.staging_name } RENAME TO #{ table.name }")
+      self.query("DROP TABLE IF EXISTS dataduck_zz_old_#{ table.name }")
+    end
     def load_table!(table)
-      puts "Loading table #{ table.name }..."
+      DataDuck::Logs.info "Loading table #{ table.name }..."
       s3_object = self.upload_table_to_s3!(table)
-      self.create_staging_table!(table)
-      self.create_output_table_on_data_warehouse!(table)
-      self.run_query(self.copy_query(table, s3_object.s3_path))
-      self.merge_from_staging!(table)
-      self.drop_staging_table!(table)
+      self.create_output_tables!(table)
+      self.query(self.copy_query(table, s3_object.s3_path))
+      if table.staging_name != table.building_name
+        self.merge_from_staging!(table)
+        self.drop_staging_table!(table)
+      end
     end
     def self.value_to_string(value)

data/lib/dataduck/source.rb CHANGED Viewed

@@ -1,6 +1,23 @@
 module DataDuck
+  class Source < DataDuck::Database
+    def self.load_config!
+      all_sources = DataDuck.config['sources']
+      return if all_sources.nil?
+      all_sources.each_key do |source_name|
+        configuration = all_sources[source_name]
+        source_type = configuration['type']
+        if source_type == "postgresql"
+          DataDuck.sources[source_name] = DataDuck::PostgresqlSource.new(source_name, configuration)
+        elsif source_type == "mysql"
+          DataDuck.sources[source_name] = DataDuck::MysqlSource.new(source_name, configuration)
+        else
+          raise ArgumentError.new("Unknown type '#{ source_type }' for source #{ source_name }.")
+        end
+      end
+    end
-  class Source
     def self.source_config(name)
       if DataDuck.config['sources'].nil? || DataDuck.config['sources'][name.to_s].nil?
         raise Exception.new("Could not find source #{ name } in source configs.")
@@ -9,33 +26,25 @@ module DataDuck
       DataDuck.config['sources'][name.to_s]
     end
-    def self.source(name)
+    def self.source(name, allow_nil = false)
       name = name.to_s
       if DataDuck.sources[name]
         return DataDuck.sources[name]
-      end
-      configuration = DataDuck::Source.source_config(name)
-      source_type = configuration['type']
-      if source_type == "postgresql"
-        DataDuck.sources[name] = DataDuck::PostgresqlSource.new(configuration)
-        return DataDuck.sources[name]
-      elsif source_type == "mysql"
-        DataDuck.sources[name] = DataDuck::MysqlSource.new(configuration)
-        return DataDuck.sources[name]
+      elsif allow_nil
+        return nil
       else
-        raise ArgumentError.new("Unknown type '#{ source_type }' for source #{ name }.")
+        raise Exception.new("Could not find source #{ name } in source configs.")
       end
     end
-    def connection
-      raise Exception.new("Must implement connection in subclass.")
-    end
+    def self.only_source
+      if DataDuck.sources.keys.length != 1
+        raise ArgumentError.new("Must be exactly 1 source.")
+      end
-    def query
-      raise Exception.new("Must implement query in subclass.")
+      source_name = DataDuck.sources.keys[0]
+      return DataDuck::Source.source(source_name)
     end
     def schema(table_name)

data/lib/dataduck/sql_db_source.rb CHANGED Viewed

@@ -4,13 +4,15 @@ require 'sequel'
 module DataDuck
   class SqlDbSource < DataDuck::Source
-    def initialize(data)
+    def initialize(name, data)
       @host = data['host']
       @port = data['port']
       @username = data['username']
       @password = data['password']
       @database = data['database']
       @initialized_db_type = data['db_type']
+      super
     end
     def connection
@@ -35,6 +37,10 @@ module DataDuck
     end
     def query(sql)
+      if self.is_mutating_sql?(sql)
+        raise ArgumentError.new("Database #{ self.name } must not run mutating sql: #{ sql }")
+      end
       self.connection.fetch(sql).all
     end
   end

data/lib/dataduck/table.rb CHANGED Viewed

@@ -46,38 +46,150 @@ module DataDuck
       self.class.actions
     end
-    def output_schema
-      self.class.output_schema
+    def check_table_valid!
+      if !self.batch_size.nil?
+        raise Exception.new("Table #{ self.name }'s batch_size must be > 0") unless self.batch_size > 0
+        raise Exception.new("Table #{ self.name } has batch_size defined but no extract_by_column") if self.extract_by_column.nil?
+      end
     end
-    def output_column_names
-      self.class.output_schema.keys.sort
+    def distribution_key
+      if self.output_column_names.include?("id")
+        "id"
+      else
+        nil
+      end
     end
-    def extract!
-      puts "Extracting table #{ self.name }..."
+    def etl!(destinations)
+      if destinations.length != 1
+        raise ArgumentError.new("DataDuck can only etl to one destination at a time for now.")
+      end
+      self.check_table_valid!
+      destination = destinations.first
+      if self.should_fully_reload?
+        destination.drop_staging_table!(self)
+      end
+      batch_number = 0
+      while batch_number < 1_000
+        batch_number += 1
+        self.extract!(destination)
+        self.transform!
+        destination.load_table!(self)
+        if self.batch_size.nil?
+          break
+        else
+          if self.batch_size == self.data.length
+            DataDuck::Logs.info "Finished batch #{ batch_number }, continuing with the next batch"
+          else
+            DataDuck::Logs.info "Finished batch #{ batch_number } (last batch)"
+            break
+          end
+        end
+      end
+      self.data = []
+      if self.should_fully_reload?
+        destination.finish_fully_reloading_table!(self)
+      end
+    end
+    def extract!(destination = nil)
+      DataDuck::Logs.info "Extracting table #{ self.name }"
       self.errors ||= []
       self.data = []
       self.class.sources.each do |source_spec|
         source = source_spec[:source]
-        my_query = self.extract_query(source_spec)
+        my_query = self.extract_query(source_spec, destination)
         results = source.query(my_query)
         self.data = results
       end
       self.data
     end
-    def extract_query(source_spec)
-      if source_spec.has_key?(:query)
-        query
-      else
-        "SELECT \"#{ source_spec[:columns].sort.join('","') }\" FROM #{ source_spec[:table_name] }"
+    def extract_query(source_spec, destination = nil)
+      base_query = source_spec.has_key?(:query) ? source_spec[:query] :
+         "SELECT \"#{ source_spec[:columns].sort.join('","') }\" FROM #{ source_spec[:table_name] }"
+      extract_by_clause = ""
+      limit_clause = ""
+      if self.extract_by_column
+        if destination.table_names.include?(self.building_name)
+          extract_by_value = destination.query("SELECT MAX(#{ self.extract_by_column }) AS val FROM #{ self.building_name }").first
+          extract_by_value = extract_by_value.nil? ? nil : extract_by_value[:val]
+          if extract_by_value
+            extract_by_clause = "WHERE #{ self.extract_by_column } >= '#{ extract_by_value }'"
+          end
+        end
+        limit_clause = self.batch_size ? "ORDER BY #{ self.extract_by_column } LIMIT #{ self.batch_size }" : ""
+      end
+      [base_query, extract_by_clause, limit_clause].join(' ').strip
+    end
+    def indexes
+      which_columns = []
+      which_columns << "id" if self.output_column_names.include?("id")
+      which_columns << "created_at" if self.output_column_names.include?("created_at")
+      which_columns
+    end
+    def batch_size
+      nil
+    end
+    def extract_by_column
+      return 'updated_at' if self.output_column_names.include?("updated_at")
+      nil
+    end
+    def should_fully_reload?
+      false # Set to true if you want to fully reload a table with each ETL
+    end
+    def building_name
+      self.should_fully_reload? ? self.staging_name : self.name
+    end
+    def staging_name
+      "zz_dataduck_#{ self.name }"
+    end
+    def output_schema
+      self.class.output_schema
+    end
+    def output_column_names
+      self.class.output_schema.keys.sort.map(&:to_s)
+    end
+    def show
+      puts "Table #{ self.name }"
+      self.class.sources.each do |source_spec|
+        puts "\nSources from #{ source_spec[:table_name] || source_spec[:query] } on #{ source_spec[:source].name }"
+        source_spec[:columns].each do |col_name|
+          puts "  #{ col_name }"
+        end
+      end
+      puts "\nOutputs "
+      num_separators = self.output_schema.keys.map { |key| key.length }.max
+      self.output_schema.each_pair do |name, datatype|
+        puts "  #{ name }#{ ' ' * (num_separators + 2 - name.length) }#{ datatype }"
       end
     end
     def transform!
-      puts "Transforming table #{ self.name }..."
+      DataDuck::Logs.info "Transforming table #{ self.name }"
       self.errors ||= []
       self.class.actions ||= []

data/lib/dataduck/util.rb CHANGED Viewed

@@ -1,10 +1,20 @@
+require 'fileutils'
 module DataDuck
-  class Util
-    def self.underscore_to_camelcase(str)
+  module Util
+    def Util.ensure_path_exists!(full_path)
+      split_paths = full_path.split('/')
+      just_file_path = split_paths.pop
+      directory_path = split_paths.join('/')
+      FileUtils.mkdir_p(directory_path)
+      FileUtils.touch("#{ directory_path }/#{ just_file_path }")
+    end
+    def Util.underscore_to_camelcase(str)
       str.split('_').map{ |chunk| chunk.capitalize }.join
     end
-    def self.camelcase_to_underscore(str)
+    def Util.camelcase_to_underscore(str)
       str.gsub(/::/, '/')
           .gsub(/([A-Z]+)([A-Z][a-z])/,'\1_\2')
           .gsub(/([a-z\d])([A-Z])/,'\1_\2')

data/lib/dataduck/version.rb CHANGED Viewed

@@ -1,6 +1,6 @@
 module DataDuck
   VERSION_MAJOR = 0
-  VERSION_MINOR = 4
+  VERSION_MINOR = 5
   VERSION_PATCH = 0
   VERSION = [VERSION_MAJOR, VERSION_MINOR, VERSION_PATCH].join('.')
 end

data/lib/dataduck.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+require 'yaml'
 Dir[File.dirname(__FILE__) + '/helpers/*.rb'].each do |file|
   require file
 end
@@ -6,13 +8,11 @@ Dir[File.dirname(__FILE__) + '/dataduck/*.rb'].each do |file|
   require file
 end
-require 'yaml'
 module DataDuck
   extend ModuleVars
   ENV['DATADUCK_ENV'] ||= "development"
-  create_module_var("environment",  ENV['DATADUCK_ENV'])
+  create_module_var("environment", ENV['DATADUCK_ENV'])
   spec = Gem::Specification.find_by_name("dataduck")
   create_module_var("gem_root", spec.gem_dir)
@@ -26,4 +26,13 @@ module DataDuck
   create_module_var("sources", {})
   create_module_var("destinations", {})
+  DataDuck::Source.load_config!
+  DataDuck::Destination.load_config!
+  Dir[DataDuck.project_root + "/src/tables/*.rb"].each do |file|
+    table_name_underscores = file.split("/").last.gsub(".rb", "")
+    table_name_camelized = DataDuck::Util.underscore_to_camelcase(table_name_underscores)
+    require file
+  end
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: dataduck
 version: !ruby/object:Gem::Version
-  version: 0.4.0
+  version: 0.5.0
 platform: ruby
 authors:
 - Jeff Pickhardt
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-10-14 00:00:00.000000000 Z
+date: 2015-10-19 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -143,6 +143,12 @@ files:
 - bin/setup
 - dataduck.gemspec
 - docs/README.md
+- docs/commands/README.md
+- docs/commands/console.md
+- docs/commands/dbconsole.md
+- docs/commands/etl.md
+- docs/commands/quickstart.md
+- docs/commands/show.md
 - docs/contents.yml
 - docs/overview/README.md
 - docs/overview/getting_started.md
@@ -157,8 +163,10 @@ files:
 - examples/example/src/tables/users.rb
 - lib/dataduck.rb
 - lib/dataduck/commands.rb
+- lib/dataduck/database.rb
 - lib/dataduck/destination.rb
 - lib/dataduck/etl.rb
+- lib/dataduck/logs.rb
 - lib/dataduck/mysql_source.rb
 - lib/dataduck/postgresql_source.rb
 - lib/dataduck/redshift_destination.rb