RubyGems - schema_transformer - Versions diffs - 0.1.0 - Mend

schema_transformer 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

data/README.markdown +83 -0
data/Rakefile +29 -0
data/TODO +11 -0
data/bin/schema_transformer +4 -0
data/gemspec.rb +20 -0
data/lib/schema_transformer/base.rb +260 -0
data/lib/schema_transformer/cli.rb +99 -0
data/lib/schema_transformer/help.rb +43 -0
data/lib/schema_transformer/version.rb +3 -0
data/lib/schema_transformer.rb +10 -0
data/notes/copier.rb +14 -0
data/notes/copier_scratchpad.rb +45 -0
data/notes/pager.rb +101 -0
data/notes/schema_transformer_notes.txt +44 -0
data/test/fake_app/config/database.yml +34 -0
data/test/fake_app/config/schema_transformations/books.json +1 -0
data/test/fake_app/config/schema_transformations/users.json +1 -0
data/test/fake_app/log/schema_transformer.log +58795 -0
data/test/schema_transformer_test.rb +210 -0
metadata +85 -0

data/README.markdown ADDED Viewed

@@ -0,0 +1,83 @@
+Schema Transformer
+=======
+Summary
+-------
+This gem provides a way is alter database schemas on large tables with little downtime.  You run 2 commands to ultimately alter the database.
+First, you generate the schema transform definitions and commands to be ran later on production.  You will check these files into the rails project.
+Second, you run 2 commands on production.
+The first command will create a 'temporary' table with the altered schema and incrementally copy the data over until it is close to synced.  You can run this command as many times as you want as it want - it work hurt.  This first command is slow as it takes a while to copy the data over, especially if you have a really large tables that are several GBs in size.
+The second command will do a switheroo with with 'temporarily' new table and the current table.  It will then remove the obsoleted table with the old schema structure.  Because it is doing a rename (which can screw up replication on a heavily traffic site), this second command should be ran with maintenance page up.  This second command is fast because it doe a final incremental sync and quickly switches the new table into place.
+Install
+-------
+<pre>
+gem install --no-ri --no-rdoc schema_transformer # sudo if you need to
+</pre>
+Usage
+-------
+Generate the schema transform definitions:
+<pre>
+tung@walle $ schema_transformer generate
+What is the name of the table you want to alter?
+> tags
+What is the modification to the table?
+Examples 1:
+  ADD COLUMN smart tinyint(1) DEFAULT '0'
+Examples 2:
+  ADD INDEX idx_name (name)
+Examples 3:
+  ADD COLUMN smart tinyint(1) DEFAULT '0', DROP COLUMN full_name
+> ADD COLUMN special tinyint(1) DEFAULT '0'
+        ss
+*** Thanks ***
+Schema transform definitions have been generated and saved to:
+  config/schema_transformations/tags.json
+Next you need to run 2 commands to alter the database.  As explained in the README, the first
+can be ran with the site still up.  The second command should be done with a maintenance page up.
+Here are the 2 commands you'll need to run later after checking in the tags.json file
+into your version control system:
+$ schema_transformer sync tags   # can be ran over and over, it will just keep syncing the data
+$ schema_transformer switch tags # should be done with a maintenance page up, switches the tables
+*** Thank you ***
+tung@walle $ schema_transformer sync tags
+Creating temp table and syncing the data... (tail log/schema_transformer.log for status)
+*** Thanks ***
+There is now a tags_st_temp table with the new table schema and the data has been synced.
+Please run the next command after you put a maintenance page up:
+$ schema_transformer switch tags
+tung@walle $ schema_transformer switch tags
+*** Thanks ***
+The final sync ran and the table tags has been updated with the new schema.
+Get rid of that maintenance page and re-enable your site.
+Thank you.  Have a very nice day.
+tung@walle $
+</pre>
+FAQ
+-------
+Q: What table alteration are supported?
+A: I've only tested with adding columns and removing columns.
+Q: Can I add and drop multiple columns and indexes at the same time?
+A: Yes.
+Cautionary Notes
+-------
+For speed reasons the final sync is done by using the updated_at timestamp if available and syncing
+the data last updated since the last day.  Data before that will not get synced in the final sync.
+So, having an updated_at timestamp and using it on the original table is very important.
+For tables that do not have updated_at timestamps.  I need to still limit the size of the final update
+so I'm limiting it to the last 100_000 records.  Not much at all, so it is very important to have that
+updated_at timestamp.

data/Rakefile ADDED Viewed

@@ -0,0 +1,29 @@
+require 'rubygems'
+require 'rake'
+require 'rake/gempackagetask'
+require 'spec/rake/spectask'
+require 'gemspec'
+desc "Generate gemspec"
+task :gemspec do
+  File.open("#{Dir.pwd}/#{GEM_NAME}.gemspec", 'w') do |f|
+    f.write(GEM_SPEC.to_ruby)
+  end
+end
+desc "Install gem"
+task :install do
+  Rake::Task['gem'].invoke
+  $stdout.puts "Installing gem..."
+  `gem install pkg/#{GEM_NAME}*.gem`
+  `rm -Rf pkg`
+end
+desc "Package gem"
+Rake::GemPackageTask.new(GEM_SPEC) do |pkg|
+  pkg.gem_spec = GEM_SPEC
+end
+desc "Clean up the test project"
+task :cleanup do
+end

data/TODO ADDED Viewed

@@ -0,0 +1,11 @@
+1. create new table with schema
+2. batch copy data
+3. maintainenance page
+4. batch copy final data
+5. rename tables
+6. remove maintenance page
+* TODO:
+* add logging again: schema_transformer.log
+* updated_at if its available and use a real time vs some guess
+* clean up spec: use real mocks, get rid of $testing_books

data/bin/schema_transformer ADDED Viewed

@@ -0,0 +1,4 @@
+#!/usr/bin/env ruby
+require File.join(File.dirname(__FILE__),'..','lib','schema_transformer')
+SchemaTransformer::CLI.run(ARGV)

data/gemspec.rb ADDED Viewed

@@ -0,0 +1,20 @@
+require 'lib/schema_transformer/version'
+GEM_NAME = 'schema_transformer'
+GEM_FILES = FileList['**/*'] - FileList['coverage', 'coverage/**/*', 'pkg', 'pkg/**/*']
+GEM_SPEC = Gem::Specification.new do |s|
+  # == CONFIGURE ==
+  s.author = "Tung Nguyen"
+  s.email = "tongueroo@gmail.com"
+  s.homepage = "http://github.com/tongueroo/#{GEM_NAME}"
+  s.summary = "Way is alter database schemas on large tables with little downtime"
+  # == CONFIGURE ==
+  s.executables += [GEM_NAME]
+  s.extra_rdoc_files = [ "README.markdown" ]
+  s.files = GEM_FILES.to_a
+  s.has_rdoc = false
+  s.name = GEM_NAME
+  s.platform = Gem::Platform::RUBY
+  s.require_path = "lib"
+  s.version = SchemaTransformer::VERSION
+end

data/lib/schema_transformer/base.rb ADDED Viewed

@@ -0,0 +1,260 @@
+module SchemaTransformer
+  class UsageError < RuntimeError; end
+  class Base
+    include Help
+    @@stagger = 0
+    def self.run(options)
+      @@stagger = options[:stagger] || 0
+      @transformer = SchemaTransformer::Base.new(options[:base] || Dir.pwd)
+      @transformer.run(options)
+    end
+    attr_reader :options, :temp_table, :table
+    def initialize(base = File.expand_path("..", __FILE__), options = {})
+      @base = base
+      @db, @log, @mail = ActiveWrapper.setup(
+        :base => @base,
+        :env => ENV['RAILS_ENV'] || 'development',
+        :log => "schema_transformer"
+      )
+      @db.establish_connection
+      @conn = ActiveRecord::Base.connection
+      @batch_size = options[:batch_size] || 10_000
+    end
+    def run(options)
+      @action = options[:action].first
+      case @action
+      when "generate"
+        self.generate
+        help(:generate)
+      when "sync"
+        help(:sync_progress)
+        table = options[:action][1]
+        self.gather_info(table)
+        self.create
+        self.sync
+        help(:sync)
+      when "switch"
+        table = options[:action][1]
+        self.gather_info(table)
+        self.switch
+        self.cleanup
+        help(:switch)
+      else
+        raise UsageError, "Invalid action #{@action}"
+      end
+    end
+    def generate
+      data = {}
+      ask "What is the name of the table you want to alter?"
+      data[:table] = gets(:table)
+      ask <<-TXT
+What is the modification to the table?
+Examples 1:
+  ADD COLUMN smart tinyint(1) DEFAULT '0'
+Examples 2:
+  ADD INDEX idx_name (name)
+Examples 3:
+  ADD COLUMN smart tinyint(1) DEFAULT '0', DROP COLUMN full_name
+TXT
+      data[:mod] = gets(:mod)
+      path = transform_file(data[:table])
+      FileUtils.mkdir(File.dirname(path)) unless File.exist?(File.dirname(path))
+      File.open(path,"w") { |f| f << data.to_json }
+      @table = data[:table]
+      data
+    end
+    def gather_info(table)
+      if table.nil?
+        raise UsageError, "You need to specific the table name: schema_transformer #{@action} <table_name>"
+      end
+      data = JSON.parse(IO.read(transform_file(table)))
+      @table = data["table"]
+      @mod = data["mod"]
+      # variables need for rest of the program
+      @temp_table = "#{@table}_st_temp"
+      @trash_table = "#{@table}_st_trash"
+      @model = define_model(@table)
+    end
+    def create
+      if self.temp_table_exists?
+        @temp_model = define_model(@temp_table)
+      else
+        sql_create = %{CREATE TABLE #{@temp_table} LIKE #{@table}}
+        sql_mod = %{ALTER TABLE #{@temp_table} #{@mod}}
+        @conn.execute(sql_create)
+        @conn.execute(sql_mod)
+        @temp_model = define_model(@temp_table)
+      end
+      reset_column_info
+    end
+    def sync
+      res = @conn.execute("SELECT max(id) AS max_id FROM `#{@temp_table}`")
+      start = res.fetch_row[0].to_i + 1 # nil case is okay: [nil][0].to_i => 0
+      find_in_batches(@table, :start => start, :batch_size => @batch_size) do |batch|
+        # puts "batch #{batch.inspect}"
+        lower = batch.first
+        upper = batch.last
+        columns = insert_columns_sql
+        sql = %Q{
+          INSERT INTO #{@temp_table} (
+            SELECT #{columns}
+          	FROM #{@table} WHERE id >= #{lower} AND id <= #{upper}
+          )
+        }
+        # puts sql
+        @conn.execute(sql)
+        if @@stagger > 0
+          log("Staggering: delaying for #{@@stagger} seconds before next batch insert")
+          sleep(@@stagger)
+        end
+      end
+    end
+    def final_sync
+      @temp_model = define_model(@temp_table)
+      reset_column_info
+      sync
+      columns = subset_columns.collect{|x| "#{@temp_table}.`#{x}` = #{@table}.`#{x}`" }.join(", ")
+      # need to limit the final sync, if we do the entire table it takes a long time
+      limit_cond = get_limit_cond
+      sql = %{
+        UPDATE #{@temp_table} INNER JOIN #{@table}
+          ON #{@temp_table}.id = #{@table}.id
+          SET #{columns}
+        WHERE #{limit_cond}
+      }
+      # puts sql
+      @conn.execute(sql)
+    end
+    def switch
+      final_sync
+      to_trash  = %Q{RENAME TABLE #{@table} TO #{@trash_table}}
+      from_temp = %Q{RENAME TABLE #{@temp_table} TO #{@table}}
+      @conn.execute(to_trash)
+      @conn.execute(from_temp)
+    end
+    def cleanup
+      sql = %Q{DROP TABLE #{@trash_table}}
+      @conn.execute(sql)
+    end
+    def get_limit_cond
+      if @model.column_names.include?("updated_at")
+        "#{@table}.updated_at >= '#{1.day.ago.strftime("%Y-%m-%d")}'"
+      else
+        sql = "select id from #{@table} order by id desc limit 100000"
+        resp = @conn.execute(sql)
+        bound = 0
+        while row = resp.fetch_row do
+          bound = row[0].to_i
+        end
+        "#{@table}.id >= #{bound}"
+      end
+    end
+    # the parameter is only for testing
+    def gets(name = nil)
+      STDIN.gets.strip
+    end
+    def subset_columns
+      removed = @model.column_names - @temp_model.column_names
+      subset  = @model.column_names - removed
+    end
+    def insert_columns_sql
+      # existing subset
+      subset = subset_columns
+      # added
+      added_s = @temp_model.column_names - @model.column_names
+      added = @temp_model.columns.
+                select{|c| added_s.include?(c.name) }.
+                collect{|c| "#{extract_default(c)} AS `#{c.name}`" }
+      # combine both
+      columns = subset.collect{|x| "`#{x}`"} + added
+      sql = columns.join(", ")
+    end
+    # returns Array of record ids
+    def find(table, cond)
+      sql = "SELECT id FROM #{table} WHERE #{cond}"
+      response = @conn.execute(sql)
+      results = []
+      while row = response.fetch_row do
+        results << row[0].to_i
+      end
+      results
+    end
+    # lower memory heavy version of ActiveRecord's find in batches
+    def find_in_batches(table, options = {})
+      raise "You can't specify an order, it's forced to be #{batch_order}" if options[:order]
+      raise "You can't specify a limit, it's forced to be the batch_size"  if options[:limit]
+      start = options.delete(:start).to_i
+      batch_size = options.delete(:batch_size) || 1000
+      order_limit = "ORDER BY id LIMIT #{batch_size}"
+      records = find(table, "id >= #{start} #{order_limit}")
+      while records.any?
+        yield records
+        break if records.size < batch_size
+        records = find(table, "id > #{records.last} #{order_limit}")
+      end
+    end
+    def define_model(table)
+      # Object.const_set(table.classify, Class.new(ActiveRecord::Base))
+      Object.class_eval(<<-code)
+        class #{table.classify} < ActiveRecord::Base
+          set_table_name "#{table}"
+        end
+      code
+      table.classify.constantize # returns the constant
+    end
+    def transform_file(table)
+      @base+"/config/schema_transformations/#{table}.json"
+    end
+    def temp_table_exists?
+      @conn.table_exists?(@temp_table)
+    end
+    def reset_column_info
+      @model.reset_column_information
+      @temp_model.reset_column_information
+    end
+    def log(msg)
+      @log.info(msg)
+    end
+  private
+    def ask(msg)
+      puts msg
+      print "> "
+    end
+    def extract_default(col)
+      @conn.quote(col.default)
+    end
+  end
+end

data/lib/schema_transformer/cli.rb ADDED Viewed

@@ -0,0 +1,99 @@
+#!/usr/bin/env ruby
+require 'rubygems'
+require 'active_wrapper'
+module SchemaTransformer
+  class CLI
+    def self.run(args)
+      cli = new(args)
+      cli.parse_options!
+      cli.run
+    end
+    # The array of (unparsed) command-line options
+    attr_reader :args
+    # The hash of (parsed) command-line options
+    attr_reader :options
+    def initialize(args)
+      @args = args.dup
+    end
+    # Return an OptionParser instance that defines the acceptable command
+    # line switches for cloud_info, and what their corresponding behaviors
+    # are.
+    def option_parser
+      # @logger = Logger.new
+      @option_parser ||= OptionParser.new do |opts|
+        opts.banner = "Usage: #{File.basename($0)} [options] [action]"
+        opts.on("-h", "--help", "Display this help message.") do
+          puts help_message
+          puts opts
+          exit
+        end
+        opts.on("-v", "--verbose",
+          "Verbose mode"
+        ) { |value| options[:verbose] = true }
+        opts.on("-s", "--stagger",
+          "Number of seconds to wait inbetween each bulk insert. Default 0"
+        ) { |value| options[:stagger] = value }
+        opts.on("-V", "--version",
+          "Display the schema_transformer version, and exit."
+        ) do
+          require File.expand_path("../version", __FILE__)
+          puts "Schema Transformer v#{SchemaTransformer::VERSION}"
+          exit
+        end
+      end
+    end
+    def parse_options!
+      @options = {:action => nil}
+      if args.empty?
+        warn "Please specifiy an action to execute."
+        warn help_message
+        warn option_parser
+        exit 1
+      end
+      option_parser.parse!(args)
+      extract_environment_variables!
+      options[:action] = args # ignore remaining
+    end
+    # Extracts name=value pairs from the remaining command-line arguments
+    # and assigns them as environment variables.
+    def extract_environment_variables! #:nodoc:
+      args.delete_if do |arg|
+        next unless arg.match(/^(\w+)=(.*)$/)
+        ENV[$1] = $2
+      end
+    end
+    def run
+      begin
+        SchemaTransformer::Base.run(options)
+      rescue UsageError => e
+        puts "Usage Error: #{e.message}"
+        puts help_message
+        puts option_parser
+      end
+    end
+    private
+    def help_message
+      "Available actions: generate, sync, switch"
+    end
+  end
+end

data/lib/schema_transformer/help.rb ADDED Viewed

@@ -0,0 +1,43 @@
+module SchemaTransformer
+  module Help
+    def help(action)
+      case action
+      when :generate
+        out =<<-HELP
+        ss
+*** Thanks ***
+Schema transform definitions have been generated and saved to:
+  config/schema_transformations/#{self.table}.json
+Next you need to run 2 commands to alter the database.  As explained in the README, the first
+can be ran with the site still up.  The second command should be done with a maintenance page up.
+Here are the 2 commands you'll need to run later after checking in the #{self.table}.json file
+into your version control system:
+$ schema_transformer sync #{self.table}   # can be ran over and over, it will just keep syncing the data
+$ schema_transformer switch #{self.table} # should be done with a maintenance page up, switches the tables
+*** Thank you ***
+HELP
+      when :sync_progress
+        out =<<-TEXT
+Creating temp table and syncing the data... (tail log/schema_transformer.log for status)
+TEXT
+      when :sync
+        out =<<-TEXT
+*** Thanks ***
+There is now a #{self.temp_table} table with the new table schema and the data has been synced.
+Please run the next command after you put a maintenance page up:
+$ schema_transformer switch #{self.table}
+TEXT
+      when :switch
+        out =<<-TEXT
+*** Thanks ***
+The final sync ran and the table #{self.table} has been updated with the new schema.
+Get rid of that maintenance page and re-enable your site.
+Thank you.  Have a very nice day.
+TEXT
+      end
+      puts out
+    end
+  end
+end

data/lib/schema_transformer/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module SchemaTransformer
+  VERSION = "0.1.0"
+end

data/lib/schema_transformer.rb ADDED Viewed

@@ -0,0 +1,10 @@
+#!/usr/bin/env ruby
+require 'rubygems'
+require 'active_wrapper'
+require 'pp'
+require 'fileutils'
+require File.expand_path('../schema_transformer/version', __FILE__)
+require File.expand_path('../schema_transformer/help', __FILE__)
+require File.expand_path('../schema_transformer/base', __FILE__)
+require File.expand_path('../schema_transformer/cli', __FILE__)

data/notes/copier.rb ADDED Viewed

@@ -0,0 +1,14 @@
+#!/usr/bin/env ruby
+res = conn.execute("SELECT max(`article_revisions_new`.id) AS max_id FROM `article_revisions_new`")
+start = res.fetch_row[0].to_i # nil case is okay: [nil][0].to_i => nil
+Article::Revisions.find_in_batches(:start => start, :batch_size => 10_000) do |batch|
+  lower = batch.first.id
+  upper = batch.last.id
+  execute(%{
+    INSERT INTO article_revisions_new (
+    	SELECT id, title, body, article_id, number, note, editor_id, created_at, blurb, teaser, source, slide_id
+    	FROM article_revisions WHERE id <= #{lower} AND id < #{upper}
+    );
+  })
+end

data/notes/copier_scratchpad.rb ADDED Viewed

@@ -0,0 +1,45 @@
+#!/usr/bin/env ruby
+ArticleRevision.find_in_batches
+Activity
+id, title, body, article_id, number, note, editor_id, created_at, blurb, teaser, source, slide_id, NULL test_id
+def find_in_batches(options = {})
+  raise "You can't specify an order, it's forced to be #{batch_order}" if options[:order]
+  raise "You can't specify a limit, it's forced to be the batch_size"  if options[:limit]
+  start = options.delete(:start).to_i
+  batch_size = options.delete(:batch_size) || 1000
+  with_scope(:find => options.merge(:order => batch_order, :limit => batch_size)) do
+    records = find(:all, :conditions => [ "#{table_name}.#{primary_key} >= ?", start ])
+    while records.any?
+      yield records
+      break if records.size < batch_size
+      records = find(:all, :conditions => [ "#{table_name}.#{primary_key} > ?", records.last.id ])
+    end
+  end
+end
+res = conn.execute("SELECT max(`article_revisions_new`.id) AS max_id FROM `article_revisions_new`")
+start = res.fetch_row[0].to_i # nil case is okay: [nil][0].to_i => nil
+Article::Revisions.find_in_batches(:start => start, :batch_size => 10_000) do |batch|
+  lower = batch.first.id
+  upper = batch.last.id
+  execute(%{
+    INSERT INTO article_revisions_new (
+    	SELECT id, title, body, article_id, number, note, editor_id, created_at, blurb, teaser, source, slide_id
+    	FROM article_revisions WHERE id <= #{lower} AND id < #{upper}
+    );
+  })
+end
+pager = Pager.new(:per_page => 10_000, :lower => 300, :upper => 30_000)
+pager.each do |page|
+  puts page.start_index
+end