RubyGems - pg_shrink - Versions diffs - 0.0.2 - Mend

pg_shrink 0.0.2

Files changed (29) hide show

checksums.yaml +7 -0
data/.gitignore +24 -0
data/.rspec +1 -0
data/Gemfile +4 -0
data/Guardfile +11 -0
data/README.md +92 -0
data/Rakefile +10 -0
data/Shrinkfile.example +74 -0
data/bin/pg_shrink +44 -0
data/lib/pg_shrink/database/postgres.rb +91 -0
data/lib/pg_shrink/database.rb +61 -0
data/lib/pg_shrink/sub_table_filter.rb +21 -0
data/lib/pg_shrink/sub_table_operator.rb +44 -0
data/lib/pg_shrink/sub_table_sanitizer.rb +33 -0
data/lib/pg_shrink/table.rb +159 -0
data/lib/pg_shrink/table_filter.rb +14 -0
data/lib/pg_shrink/table_sanitizer.rb +14 -0
data/lib/pg_shrink/version.rb +3 -0
data/lib/pg_shrink.rb +59 -0
data/pg_shrink.gemspec +34 -0
data/spec/Shrinkfile.basic +6 -0
data/spec/pg_config.yml +6 -0
data/spec/pg_shrink/database/postgres_spec.rb +86 -0
data/spec/pg_shrink/database_spec.rb +26 -0
data/spec/pg_shrink/table_spec.rb +158 -0
data/spec/pg_shrink_spec.rb +459 -0
data/spec/pg_spec_helper.rb +45 -0
data/spec/spec_helper.rb +4 -0
metadata +262 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 977a0cdcf3f266d4b4b80737915e4621f2f9c348
+  data.tar.gz: 88ef7101d4cfbbf425ac9d94e0a8319a4d7edd1a
+SHA512:
+  metadata.gz: 30e52962da0e958fe6130a8ca74cf98ec1a61a2b7225426efeb4720ab4ad511788eee8679e5aba3ed120b0d3673ccb4b462be405ee8ca8c88d37b91481a37163
+  data.tar.gz: 846ccb47ae40bf656312601da0aa25f23610f6752bb1f06f7fe60de6a66beaeb2d1f4cab5716b358277e639bdcfc71ca30d505152c347161f2ebb6bc8e8d3583

data/.gitignore ADDED Viewed

@@ -0,0 +1,24 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp
+*.bundle
+*.so
+*.o
+*.a
+mkmf.log
+*.swp
+spec/pg_config.yml

data/.rspec ADDED Viewed

	@@ -0,0 +1 @@
1	+ --format Nc --format documentation

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify gem's dependencies in pg_shrink.gemspec
+gemspec

data/Guardfile ADDED Viewed

@@ -0,0 +1,11 @@
+guard 'rspec' do
+  # watch /lib/ files
+  watch(%r{^lib/(.+).rb$}) do |m|
+    "spec/#{m[1]}_spec.rb"
+  end
+  # watch /spec/ files
+  watch(%r{^spec/(.+).rb$}) do |m|
+    "spec/#{m[1]}.rb"
+  end
+end

data/README.md ADDED Viewed

@@ -0,0 +1,92 @@
+# PgShrink
+The pg_shrink tool makes it easy to shrink and sanitize a postgres database,
+allowing you to specify custom filtering and sanitization via a simple
+DSL in a configuration file (Shrinkfile).
+The pg_shrink tool takes two arguments, a url for a postgres database and
+the path to a configuration file (will default to the Shrinkfile in the
+current directory)
+The simplest way to learn how to use pg_shrink is via an example.
+## Usage
+### Example Shrinkfile
+This is a simple Ruby DSL that defines which tables are to be filtered and
+sanitized in what way, and the relationships between those tables when filtering
+or sanitization is to be propagated.
+```ruby
+filter_table :users do |f|
+   f.filter_by do |u|
+      u[:name].match(/save me/)
+  end
+  f.sanitize do |u|
+    u[:email] = "sanitized_email#{u[:id]}@fake.com"
+    u
+  end
+  f.filter_subtable(:user_preferences, :foreign_key => :user_id)
+end
+```
+This particular example will filter the users table to contain only users with
+a name matching the regular expression /save me/, sanitize the email field on
+those users, and then filter the user_preferences table to contain only
+preferences associated with those users.
+### Full DSL
+See the Shrinkfile.example file in this directory for a complete list of the
+available DSL.
+### Options
+```
+-u, --url URL            *REQUIRED* Specify URL to postgres database.
+                         WARNING: This database should be a backup and not
+                         be changing at the time pg_shrink is run.  It will
+                         be modified in place.
+-c, --config SHRINKFILE  Specify a configuration file for how to shrink
+--force                  Force run without confirmation.
+-h, --help               Show this message and exit
+```
+## How does it work?
+The pg_shrink command runs through 4 major steps.
+* 1. Options parsing.
+* 2. Shrinkfile parsing and setting up the structure of tables, filters, sanitizers,
+and their subtable relationships
+* 3. Iterating through tables and doing a depth-first filter on them.
+* 4. Iterating through tables and doing a depth-first sanitization on them.
+**Step 1:** Option parsing is simple. pg_shrink uses `optparse`
+**Step 2:** Before anything is run, the Shrinkfile is completely parsed, setting up a set of tables, the filters and sanitizers on those tables, and any subtable relationships
+**Step 3:** For each table, the filters on that table are iterated through.  For each filter, the records in the table are pulled out in batches, the filter is applied to that batch, and then any subtable filters are applied for records impacted within that batch.
+**Step 4:** For each table, the sanitizers on that table are iterated through.  For each filter, the records in the table are pulled out in batches, the sanitizers is applied to that batch, and then any subtable sanitizers are applied for records impacted within that batch.
+## Installation
+Add this line to your application's Gemfile:
+    gem 'pg_shrink'
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install pg_shrink
+## Contributing
+1. Fork it
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create new Pull Request

data/Rakefile ADDED Viewed

@@ -0,0 +1,10 @@
+require 'rspec/core/rake_task'
+require "bundler/gem_tasks"
+# Default directory to look in is `/specs`
+# Run with `rake spec`
+RSpec::Core::RakeTask.new(:spec) do |task|
+  task.rspec_opts = ['--color', '--format', 'nested']
+end
+task :default => :spec

data/Shrinkfile.example ADDED Viewed

@@ -0,0 +1,74 @@
+filter_table :users do |f|
+  # filter_by takes a block and yields the fields of each record (as a hash)
+  # the block should return true to keep the record, false if not.  For
+  # ease of use and extensibility, we allow multiple filter_by blocks
+  # rather than forcing all logic into one block.
+  f.filter_by do |u|
+    u[:id] % 1000 == 0
+  end
+  # lock takes a block and yields the fields of each record (as a hash of
+  # fieldname => value) If the block returns true this record is immune to all
+  # further filtering.
+  f.lock do |u|
+    u[:email].split('@').last == 'apartmentlist.com'
+  end
+  # sanitize takes a block, yields the fields of each record as a hash of
+  # fieldname => value and should return a new set of fields that has been
+  # sanitized however desired.
+  f.sanitize do |u|
+    u[:email] = "somerandomemail#{u[:id]}@foo.bar"
+    u
+  end
+  # filter_subtable indicates a child table to filter based upon the filtering
+  # done on this table.
+  f.filter_subtable(:favorites, :foreign_key => :user_id)
+  # if needbe you can filter by a different key besides the id.  All filtering
+  # will be done before all sanitization, so you don't need to worry about if
+  # these are getting munged.
+  f.filter_subtable(:email_preferences, :foreign_key => :user_email,
+                                        :primary_key => :email)
+  # You can also filter by a polymorphic reference by specifying the
+  # type_key and type.
+  f.filter_subtable(:polymorphic_referneces, :foreign_key => :context_id,
+                                             :type_key => :context_type,
+                                             :type => 'User')
+  # If it feels more natural, you can define additional filters
+  # or locks within a filter_subtable definitition
+  f.filter_subtable(:lockable_table, :foreign_key => :user_id) do |sub|
+    sub.lock do |u|
+      u[:locked] == true
+    end
+  end
+  # To keep things consistent, if you're sanitizing something that also exists
+  # in other places (ie tables aren't fully normalized, and you have email in 2
+  # places), you probably need to be able to specify this somehow
+  f.sanitize_subtable(:email_preferences,
+                      :local_field => :email,
+                      :foreign_field => :user_email)
+end
+# If you have a chain of dependencies, ie users has favorites, favorites has
+# some additional set of tables hanging off it, you can define the 2nd
+# relationship in its own filter_table block, and the tool will figure out that
+# going from users => favorites also implies
+# favorites => favorite_related_table
+filter_table :favorites do |f|
+  f.filter_subtable(:favorite_related_table, :foreign_key => :favorite_id)
+end
+# You can completely remove a table as well, or remove it minus a locked set of
+# rows
+remove_table :removables do |f|
+  f.lock do |u|
+    u[:name] == "Keep Me"
+  end
+end

data/bin/pg_shrink ADDED Viewed

@@ -0,0 +1,44 @@
+#!/usr/bin/env ruby
+require 'optparse'
+$:.unshift(File.join(File.dirname(__FILE__), "/../lib"))
+require 'pg_shrink'
+def parse_options!(options)
+  OptionParser.new do |opts|
+    banner = <<-TXT
+pg_shrink helps you shrink and sanitize your psql database!
+Please make sure you have a Shrinkfile or specify one using -c
+    TXT
+    opts.banner = banner
+    url_desc = '*REQUIRED* Specify URL to postgres database.  WARNING:  ' +
+              'This database should be a backup and not be changing at the ' +
+              'time pg_shrink is run.  It will be modified in place.'
+    opts.on('-u', '--url URL', url_desc) do |url|
+      options[:url] = url
+    end
+    config_desc = '(Optional) Specify configuration file for how to shrink.  ' +
+                  '  Will default to Shrinkfile in directory command is being ' +
+                  'run from'
+              'time pg_shrink is run.  It will be modified in place.'
+    opts.on('-c', '--config Shrinkfile', config_desc) do |config|
+      options[:config] = config
+    end
+    force_desc = 'Force run without confirmation.'
+    opts.on('--force', force_desc) do
+      options[:force] = true
+    end
+    opts.on('-h', '--help', 'Show this message and exit') do |h|
+      puts opts
+      exit
+    end
+  end.parse!
+end
+options = PgShrink.blank_options
+parse_options!(options)
+PgShrink.run(options)

data/lib/pg_shrink/database/postgres.rb ADDED Viewed

@@ -0,0 +1,91 @@
+module PgShrink
+  require 'pg'
+  require 'sequel'
+  class Database::Postgres < Database
+    attr_accessor :connection
+    DEFAULT_OPTS = {
+      postgres_url: nil,
+      host: 'localhost',
+      port: nil,
+      username: 'postgres',
+      password: nil,
+      database: 'test',
+      batch_size: 10000
+    }.freeze
+    def connection_string
+     if @opts[:postgres_url]
+       @opts[:postgres_url]
+     else
+       str = "postgres://#{@opts[:user]}"
+       str << ":#{@opts[:password]}" if @opts[:password]
+       str << "@#{@opts[:host]}"
+       str << ":#{@opts[:port]}" if @opts[:port]
+       str << "/#{@opts[:database]}"
+     end
+    end
+    def batch_size
+      @opts[:batch_size]
+    end
+    def initialize(opts)
+      @opts = DEFAULT_OPTS.merge(opts.symbolize_keys)
+      @connection = Sequel.connect(connection_string)
+    end
+    # WARNING!  This assumes the database is not changing during run.  If
+    # requirements change we may need to insert a lock.
+    def records_in_batches(table_name)
+      table = self.table(table_name)
+      primary_key = table.primary_key
+      max_id = self.connection["select max(#{primary_key}) from #{table_name}"].
+                    first[:max]
+      i = 1;
+      while i < max_id  do
+        sql = "select * from #{table_name} where " +
+                 "#{primary_key} >= #{i} and #{primary_key} < #{i + batch_size}"
+        batch = self.connection[sql].all
+        yield(batch)
+        i = i + batch_size
+      end
+    end
+    def update_records(table_name, old_records, new_records)
+      table = self.table(table_name)
+      primary_key = table.primary_key
+      old_records_by_key = old_records.index_by {|r| r[primary_key]}
+      new_records_by_key = new_records.index_by {|r| r[primary_key]}
+      if (new_records_by_key.keys - old_records_by_key.keys).size > 0
+        raise "Bad voodoo!  New records have primary keys not in old records!"
+      end
+      deleted_record_ids =  old_records_by_key.keys - new_records_by_key.keys
+      if deleted_record_ids.any?
+        raise "Bad voodoo!  Some records missing in new records!"
+      end
+      # TODO:  This can be optimized if performance is too slow.  Will impact
+      # the speed of sanitizing the already-filtered dataset.
+      new_records.each do |rec|
+        if old_records_by_key[rec[primary_key]] != rec
+          self.connection.from(table_name).
+               where(primary_key => rec[primary_key]).
+               update(rec)
+        end
+      end
+    end
+    def get_records(table_name, opts)
+      self.connection.from(table_name).where(opts).all
+    end
+    def delete_records(table_name, condition_to_delete)
+      self.connection.from(table_name).where(condition_to_delete).delete
+    end
+  end
+end

data/lib/pg_shrink/database.rb ADDED Viewed

@@ -0,0 +1,61 @@
+module PgShrink
+  class Database
+    def tables
+      @tables ||= {}
+    end
+    # table should return a unique table representation for this database.
+    def table(table_name)
+      tables[table_name] ||= Table.new(self, table_name)
+    end
+    def filter_table(table_name, opts = {})
+      table = self.table(table_name)
+      # we want to allow composability of filter specifications, so we always
+      # update existing options rather than overriding
+      table.update_options(opts)
+      yield table if block_given?
+    end
+    def remove_table(table_name)
+      table = self.table(table_name)
+      table.mark_for_removal!
+    end
+    # records_in_batches should yield a series of batches # of records.
+    def records_in_batches(table_name)
+      raise "implement in subclass"
+    end
+    # get_records should take a table name and options hash and return a
+    # specific set of records
+    def get_records(table_name, opts)
+      raise "implement in subclass"
+    end
+    # The update_records method takes a set of original records and a new
+    # set of records.  It should throw an error if there are any records missing,
+    # so it should not be used for deletion.
+    def update_records(table_name, old_records, new_records)
+      raise "implement in subclass"
+    end
+    # The delete_records method takes a table name and a condition to delete on.
+    def delete_records(table_name, condition)
+      raise "implement in subclass"
+    end
+    def filter!
+      tables.values.each(&:filter!)
+    end
+    def sanitize!
+      tables.values.each(&:sanitize!)
+    end
+    def shrink!
+      filter!
+      sanitize!
+    end
+  end
+end

data/lib/pg_shrink/sub_table_filter.rb ADDED Viewed

@@ -0,0 +1,21 @@
+module PgShrink
+  class SubTableFilter < SubTableOperator
+    def propagate!(old_parent_data, new_parent_data)
+      old_batch_keys = old_parent_data.map {|record| record[@opts[:primary_key]]}
+      new_batch_keys = new_parent_data.map {|record| record[@opts[:primary_key]]}
+      foreign_key = @opts[:foreign_key]
+      finder_options = {foreign_key => old_batch_keys}
+      if @opts[:type_key] && @opts[:type]
+        finder_options[@opts[:type_key]] = @opts[:type]
+      end
+      old_records = table.get_records(finder_options)
+      table.filter_batch(old_records) do |record|
+        new_batch_keys.include?(record[foreign_key])
+      end
+    end
+  end
+end

data/lib/pg_shrink/sub_table_operator.rb ADDED Viewed

@@ -0,0 +1,44 @@
+module PgShrink
+  class SubTableOperator
+    attr_accessor :parent, :table_name, :database
+    def default_opts
+      {:foreign_key =>
+        "#{ActiveSupport::Inflector.singularize(parent.table_name.to_s)}_id",
+       :primary_key => :id
+      }
+    end
+    def name
+      "#{table_name} #{self.class.name.demodulize} from #{parent.table_name}"
+    end
+    def table
+      database.table(table_name)
+    end
+    def validate_opts!(opts)
+      if opts[:type_key] && !opts[:type]
+        raise "Error:  #{name} has type_key set but no type"
+      end
+      if opts[:type] && !opts[:type_key]
+        raise "Error:  #{name} has type set but no type_key"
+      end
+    end
+    def initialize(parent, table_name, opts = {})
+      self.parent = parent
+      self.table_name = table_name
+      self.database = parent.database
+      @opts = default_opts.merge(opts)
+      validate_opts!(@opts)
+    end
+    def propagate!(old_parent_data, new_parent_data)
+      raise "Implement in subclass"
+    end
+  end
+end

data/lib/pg_shrink/sub_table_sanitizer.rb ADDED Viewed

@@ -0,0 +1,33 @@
+module PgShrink
+  class SubTableSanitizer < SubTableOperator
+    def validate_opts!(opts)
+      unless opts[:local_field] && opts[:foreign_field]
+        raise "Error: #{name} must define :local_field and :foreign_field"
+      end
+      super(opts)
+    end
+    def propagate!(old_parent_data, new_parent_data)
+      old_batch = old_parent_data.index_by {|record| record[@opts[:primary_key]]}
+      new_batch = new_parent_data.index_by {|record| record[@opts[:primary_key]]}
+      foreign_key = @opts[:foreign_key]
+      finder_options = {foreign_key => old_batch.keys}
+      if @opts[:type_key] && @opts[:type]
+        finder_options[@opts[:type_key]] = @opts[:type]
+      end
+      parent_field = @opts[:local_field].to_sym
+      child_field = @opts[:foreign_field].to_sym
+      old_child_records = table.get_records(finder_options)
+      table.sanitize_batch(old_child_records) do |record|
+        parent_record = new_batch[record[foreign_key]]
+        record[child_field] = parent_record[parent_field]
+        record
+      end
+    end
+  end
+end

data/lib/pg_shrink/table.rb ADDED Viewed

@@ -0,0 +1,159 @@
+module PgShrink
+  class Table
+    attr_accessor :table_name
+    attr_accessor :database
+    attr_accessor :opts
+    attr_reader :filters, :sanitizers, :subtable_filters, :subtable_sanitizers
+    # TODO:  Figure out, do we need to be able to support tables with no
+    # keys?  If so, how should we handle that?
+    def initialize(database, table_name, opts = {})
+      self.table_name = table_name
+      self.database = database
+      @opts = opts
+      @filters = []
+      @sanitizers = []
+      @subtable_filters = []
+      @subtable_sanitizers = []
+    end
+    def update_options(opts)
+      @opts = @opts.merge(opts)
+    end
+    def filter_by(opts = {}, &block)
+      self.filters << TableFilter.new(self, opts, &block)
+    end
+    def filter_subtable(table_name, opts = {})
+      filter = SubTableFilter.new(self, table_name, opts)
+      self.subtable_filters << filter
+      yield filter.table if block_given?
+    end
+    def lock(opts = {}, &block)
+      @lock = block
+    end
+    def locked?(record)
+      if @lock
+        @lock.call(record)
+      end
+    end
+    def sanitize(opts = {}, &block)
+      self.sanitizers << TableSanitizer.new(self, opts, &block)
+    end
+    def sanitize_subtable(table_name, opts = {})
+      sanitizer = SubTableSanitizer.new(self, table_name, opts)
+      self.subtable_sanitizers << sanitizer
+      yield sanitizer.table if block_given?
+    end
+    def update_records(original_records, new_records)
+      if self.database
+        database.update_records(self.table_name, original_records, new_records)
+      end
+    end
+    def delete_records(old_records, new_records)
+      if primary_key
+        deleted_keys = old_records.map {|r| r[primary_key]} -
+                       new_records.map {|r| r[primary_key]}
+        self.database.delete_records(table_name, primary_key => deleted_keys)
+      else
+        # TODO:  Do we need to speed this up?  Or is this an unusual enough
+        # case that we can leave it slow?
+        deleted_records = old_records - new_records
+        deleted_records.each do |rec|
+          self.database.delete_records(table_name, rec)
+        end
+      end
+    end
+    def records_in_batches(&block)
+      if self.database
+        self.database.records_in_batches(self.table_name, &block)
+      else
+        yield []
+      end
+    end
+    def get_records(finder_options)
+      if self.database
+        self.database.get_records(self.table_name, finder_options)
+      else
+        []
+      end
+    end
+    def filter_subtables(old_set, new_set)
+      self.subtable_filters.each do |subtable_filter|
+        subtable_filter.propagate!(old_set, new_set)
+      end
+    end
+    def sanitize_subtables(old_set, new_set)
+      self.subtable_sanitizers.each do |subtable_sanitizer|
+        subtable_sanitizer.propagate!(old_set, new_set)
+      end
+    end
+    def filter_batch(batch, &filter_block)
+      new_set = batch.select do |record|
+        locked?(record) || filter_block.call(record.dup)
+      end
+      delete_records(batch, new_set)
+      filter_subtables(batch, new_set)
+    end
+    def sanitize_batch(batch, &sanitize_block)
+      new_set = batch.map do |record|
+        if locked?(record)
+          record.dup
+        else
+          sanitize_block.call(record.dup)
+        end
+      end
+      update_records(batch, new_set)
+      sanitize_subtables(batch, new_set)
+    end
+    def filter!
+      self.filters.each do |filter|
+        self.records_in_batches do |batch|
+          self.filter_batch(batch) do |record|
+            filter.apply(record)
+          end
+        end
+      end
+    end
+    def sanitize!
+      self.sanitizers.each do |sanitizer|
+        self.records_in_batches do |batch|
+          self.sanitize_batch(batch) do |record|
+            sanitizer.apply(record)
+          end
+        end
+      end
+    end
+    # We use a filter for this, so that all other dependencies etc behave
+    # as would be expected.
+    def mark_for_removal!
+      self.filter_by { false }
+    end
+    # Check explicitly for nil because we want to be able to set primary_key
+    # to false for e.g. join tables
+    def primary_key
+      opts[:primary_key].nil? ? :id : opts[:primary_key]
+    end
+    def shrink!
+      filter!
+      sanitize!
+    end
+  end
+end