RubyGems - bigshift - Versions diffs - 0.2.0 → 0.3.0 - Mend

bigshift 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml +4 -4
data/README.md +20 -2
data/lib/bigshift/cli.rb +65 -22
data/lib/bigshift/cloud_storage_transfer.rb +8 -1
data/lib/bigshift/redshift_unloader.rb +1 -1
data/lib/bigshift/unload_manifest.rb +38 -6
data/lib/bigshift/version.rb +1 -1
metadata +2 -2

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: fc84facadd8de03293a5ba461bce6653bb3f00aa
-  data.tar.gz: deb0e103ae33b5a9627feb3aa4ac617cfa54e342
+  metadata.gz: 5221e0948bc35adae3c09681be2de8529cf51630
+  data.tar.gz: ab8501193f724bed2288a3784719ef4cbbf16c26
 SHA512:
-  metadata.gz: ec259abd928ad95999f64fa9765776c659113a373257d840874d9864ff571bdec0744efa756d3aaf62c7599a5c689de5ca9cf77d66e04a441a4b0d22cdbb833e
-  data.tar.gz: 04cbba86814f2526260f24a4c6583180e55edb4faf6ef7b20a96a0b961ad48586b36c1145af4f49ae06f9735fe2a0c98654433b7ec79bd0520fd5d0d7924935b
+  metadata.gz: 914cdf7f5e432faba32a6d66661c9dd1b0b55edac2933438a46bfbdc6cc4476441d8fbe5e2858017ae012ef2bd0c559c07c75bd2b5fb1bc33754aebbf3dee4c8
+  data.tar.gz: f05dc703a91fb1dbc338e65a04473c8d29ed7df19b9bc2abb6e702aebc022df5f47e7c524b7abf0a52a288961b0f09cc26828646655ebc6312905ad73aff3dba

data/README.md CHANGED

@@ -22,15 +22,17 @@ The main interface to BigShift is the `bigshift` command line tool.
 BigShift can also be used as a library in a Ruby application. Look at the tests, and how the `bigshift` tool is built to figure out how.
+Because a transfer can take a long time, it's highly recommended that you run the command in `screen` or `tmux` or using some other mechanism that ensures that the process isn't killed prematurely.
 ## Cost
 Please note that transferring large amounts of data between AWS and GCP is not free. [AWS charges for outgoing traffic from S3](https://aws.amazon.com/s3/pricing/#Data_Transfer_Pricing). There are also storage charges for the Redshift dumps on S3 and GCS, but since they are kept only until the BigQuery table has been loaded those should be negligible.
-BigShift tells Redshift to compress the dumps, even if that means that the BigQuery load will be slower, in order to minimize the transfer cost.
+BigShift tells Redshift to compress the dumps by default, even if that means that the BigQuery load will be slower, in order to minimize the transfer cost. However, depending on your setup and data the individual files produced by Redshift might become larger than BigQuery's compressed file size limit of 4 GiB. In these cases you need to either uncompress the files manually on the GCP side (for example by running BigShift with just `--steps unload,transfer` to get the dumps to GCS), or dump and transfer uncompressed files (with `--no-compression`), at a higher bandwidth cost.
 ## Arguments
-Running `bigshift` without any arguments, or with `--help` will show the options. All except `--s3-prefix`, `--bq-table` and `--max-bad-records` are required.
+Running `bigshift` without any arguments, or with `--help` will show the options. All except `--s3-prefix`, `--bq-table`, `--max-bad-records`, `--steps` and `--[no-]compress` are required.
 ### GCP credentials
@@ -108,6 +110,12 @@ Because of how GCS' Transfer Service works the transferred files will have exact
 By default the BigQuery table ID will be the same as the Redshift table name, but the optional argument `--bq-table` can be used to tell BigShift to use another table ID.
+### Running only some steps
+Using the `--steps` argument it's possible to skip some parts of the transfer, or resume a failed transfer. The default is `--steps unload,transfer,load,cleanup`, but using for example `--steps unload,transfer` would dump the table to S3 and transfer the files and then stop.
+Another case might be if for some reason the BigShift process was killed during the transfer step. The transfer will still run in GCS, and you might not want to start over from the start, it takes a long time to unload a big table, and an even longer time to transfer it, not to mention bandwidth costs. You can then run the same command again, but add `--steps load,cleanup` to the arguments to skip the unloading and transferring steps.
 # How does it work?
 There are four main pieces to BigShift: the Redshift unloader, the transfer, the BigQuery load and the schema translation.
@@ -151,6 +159,16 @@ The certificates used by the Google APIs might not be installed on your system,
 export SSL_CERT_FILE="$(find $GEM_HOME/gems -name 'google-api-client-*' | tail -n 1)/lib/cacerts.pem"
 ```
+### BigQuery says my files are not splittable and too large
+For example:
+> Input CSV files are not splittable and at least one of the files is larger than the maximum allowed size. Size is: 5838980665. Max allowed size is: 4294967296. Filename: gs://bigshift/foo/bar/foo-bar-0039_part_00.gz
+This happens when the (compressed) files exceed 4 GiB in size. Unfortunately it is not possible to control the size of the files produced by Redshift's `UNLOAD` command, and the size of the files will depend on the number of nodes in your cluster and the amount of data you're dumping.
+There are two options: either you use BigShift to get the dumps to GCS and then manually uncompress and load them (use `--steps unload,transfer`) or you dump without compression (use `--no-compression`). Keep in mind that without compression the bandwidth costs will be significanly higher.
 ### I get errors when the data is loaded into BigQuery
 This could be anything, but it could be things that aren't escaped properly when the data is dumped from Redshift. Try figuring out from the errors where the problem is and what the data looks like and open an issue. The more you can figure out yourself the more likely it is that you will get help. No one wants to trawl through your data, make an effort.

data/lib/bigshift/cli.rb CHANGED

@@ -34,38 +34,66 @@ module BigShift
     private
+    def run?(step)
+      @config[:steps].include?(step)
+    end
     def setup
       @config = parse_args(@argv)
       @factory = @factory_factory.call(@config)
+      @logger = @factory.logger
     end
     def unload
-      s3_uri = "s3://#{@config[:s3_bucket_name]}/#{s3_table_prefix}/"
-      @factory.redshift_unloader.unload_to(@config[:rs_table_name], s3_uri, allow_overwrite: false)
-      @unload_manifest = UnloadManifest.new(@factory.s3_resource, @config[:s3_bucket_name], "#{s3_table_prefix}/")
+      if run?(:unload)
+        s3_uri = "s3://#{@config[:s3_bucket_name]}/#{s3_table_prefix}"
+        @factory.redshift_unloader.unload_to(@config[:rs_table_name], s3_uri, allow_overwrite: false, compression: @config[:compression])
+      else
+        @logger.debug('Skipping unload')
+      end
+      @unload_manifest = @factory.create_unload_manifest(@config[:s3_bucket_name], s3_table_prefix)
     end
     def transfer
-      description = "bigshift-#{@config[:rs_database_name]}-#{@config[:rs_table_name]}-#{Time.now.utc.strftime('%Y%m%dT%H%M')}"
-      @factory.cloud_storage_transfer.copy_to_cloud_storage(@unload_manifest, @config[:cs_bucket_name], description: description, allow_overwrite: false)
+      if run?(:transfer)
+        description = "bigshift-#{@config[:rs_database_name]}-#{@config[:rs_table_name]}-#{Time.now.utc.strftime('%Y%m%dT%H%M')}"
+        @factory.cloud_storage_transfer.copy_to_cloud_storage(@unload_manifest, @config[:cs_bucket_name], description: description, allow_overwrite: false)
+      else
+        @logger.debug('Skipping transfer')
+      end
     end
     def load
-      rs_table_schema = @factory.redshift_table_schema
-      bq_dataset = @factory.big_query_dataset
-      bq_table = bq_dataset.table(@config[:bq_table_id]) || bq_dataset.create_table(@config[:bq_table_id])
-      gcs_uri = "gs://#{@config[:cs_bucket_name]}/#{s3_table_prefix}/*"
-      options = {}
-      options[:schema] = rs_table_schema.to_big_query
-      options[:allow_overwrite] = true
-      options[:max_bad_records] = @config[:max_bad_records] if @config[:max_bad_records]
-      bq_table.load(gcs_uri, options)
+      if run?(:load)
+        rs_table_schema = @factory.redshift_table_schema
+        bq_dataset = @factory.big_query_dataset
+        bq_table = bq_dataset.table(@config[:bq_table_id]) || bq_dataset.create_table(@config[:bq_table_id])
+        gcs_uri = "gs://#{@config[:cs_bucket_name]}/#{s3_table_prefix}*"
+        options = {}
+        options[:schema] = rs_table_schema.to_big_query
+        options[:allow_overwrite] = true
+        options[:max_bad_records] = @config[:max_bad_records] if @config[:max_bad_records]
+        bq_table.load(gcs_uri, options)
+      else
+        @logger.debug('Skipping load')
+      end
     end
     def cleanup
-      @factory.cleaner.cleanup(@unload_manifest, @config[:cs_bucket_name])
+      if run?(:cleanup)
+        @factory.cleaner.cleanup(@unload_manifest, @config[:cs_bucket_name])
+      else
+        @logger.debug('Skipping cleanup')
+      end
     end
+    STEPS = [
+      :unload,
+      :transfer,
+      :load,
+      :cleanup
+    ].freeze
     ARGUMENTS = [
       ['--gcp-credentials', 'PATH', String, :gcp_credentials_path, :required],
       ['--aws-credentials', 'PATH', String, :aws_credentials_path, nil],
@@ -78,6 +106,8 @@ module BigShift
       ['--s3-prefix', 'PREFIX', String, :s3_prefix, nil],
       ['--cs-bucket', 'BUCKET_NAME', String, :cs_bucket_name, :required],
       ['--max-bad-records', 'N', Integer, :max_bad_records, nil],
+      ['--steps', 'STEPS', Array, :steps, nil],
+      ['--[no-]compression', nil, nil, :compression, nil],
     ]
     def parse_args(argv)
@@ -106,6 +136,11 @@ module BigShift
         end
       end
       config[:bq_table_id] ||= config[:rs_table_name]
+      if config[:steps] && !config[:steps].empty?
+        config[:steps] = STEPS.select { |s| config[:steps].include?(s.to_s) }
+      else
+        config[:steps] = STEPS
+      end
       unless config_errors.empty?
         raise CliError.new('Configuration missing or malformed', config_errors, parser.to_s)
       end
@@ -113,12 +148,16 @@ module BigShift
     end
     def s3_table_prefix
-      components = @config.values_at(:rs_database_name, :rs_table_name)
-      if (prefix = @config[:s3_prefix])
-        prefix = prefix.gsub(%r{\A/|/\Z}, '')
-        components.unshift(prefix)
+      @s3_table_prefix ||= begin
+        db_name = @config[:rs_database_name]
+        table_name = @config[:rs_table_name]
+        prefix = "#{db_name}/#{table_name}/#{db_name}-#{table_name}-"
+        if (s3_prefix = @config[:s3_prefix])
+          s3_prefix = s3_prefix.gsub(%r{\A/|/\Z}, '')
+          prefix = "#{s3_prefix}/#{prefix}"
+        end
+        prefix
       end
-      File.join(*components)
     end
   end
@@ -154,12 +193,16 @@ module BigShift
       )
     end
-    private
     def logger
       @logger ||= Logger.new($stderr)
     end
+    def create_unload_manifest(s3_bucket_name, s3_table_prefix)
+      UnloadManifest.new(s3_resource, cs_service, @config[:s3_bucket_name], s3_table_prefix)
+    end
+    private
     def rs_connection
       @rs_connection ||= PG.connect(
         host: @config[:rs_credentials]['host'],

data/lib/bigshift/cloud_storage_transfer.rb CHANGED

@@ -15,6 +15,7 @@ module BigShift
       transfer_job = @storage_transfer_service.create_transfer_job(transfer_job)
       @logger.info(sprintf('Transferring %d objects (%.2f GiB) from s3://%s/%s to gs://%s/%s', unload_manifest.count, unload_manifest.total_file_size.to_f/2**30, unload_manifest.bucket_name, unload_manifest.prefix, cloud_storage_bucket, unload_manifest.prefix))
       await_completion(transfer_job, poll_interval)
+      validate_transfer(unload_manifest, cloud_storage_bucket)
       nil
     end
@@ -45,7 +46,8 @@ module BigShift
             bucket_name: cloud_storage_bucket
           ),
           object_conditions: Google::Apis::StoragetransferV1::ObjectConditions.new(
-            include_prefixes: unload_manifest.keys,
+            include_prefixes: [unload_manifest.prefix],
+            exclude_prefixes: [unload_manifest.manifest_key]
           ),
           transfer_options: Google::Apis::StoragetransferV1::TransferOptions.new(
             overwrite_objects_already_existing_in_sink: !!allow_overwrite
@@ -100,5 +102,10 @@ module BigShift
         @logger.info(message)
       end
     end
+    def validate_transfer(unload_manifest, cloud_storage_bucket)
+      unload_manifest.validate_transfer(cloud_storage_bucket)
+      @logger.info('Transfer validated, all file sizes match')
+    end
   end
 end

data/lib/bigshift/redshift_unloader.rb CHANGED

@@ -17,8 +17,8 @@ module BigShift
       unload_sql << %Q< TO '#{s3_uri}'>
       unload_sql << %Q< CREDENTIALS '#{credentials_string}'>
       unload_sql << %q< MANIFEST>
-      unload_sql << %q< GZIP>
       unload_sql << %q< DELIMITER '\t'>
+      unload_sql << %q< GZIP> if options.fetch(:compression, true)
       unload_sql << %q< ALLOWOVERWRITE> if options[:allow_overwrite]
       @logger.info(sprintf('Unloading Redshift table %s to %s', table_name, s3_uri))
       @redshift_connection.exec(unload_sql)

data/lib/bigshift/unload_manifest.rb CHANGED

@@ -1,9 +1,12 @@
 module BigShift
+  TransferValidationError = Class.new(BigShiftError)
   class UnloadManifest
     attr_reader :bucket_name, :prefix, :manifest_key
-    def initialize(s3_resource, bucket_name, prefix)
+    def initialize(s3_resource, cs_service, bucket_name, prefix)
       @s3_resource = s3_resource
+      @cs_service = cs_service
       @bucket_name = bucket_name
       @prefix = prefix
       @manifest_key = "#{@prefix}manifest"
@@ -23,14 +26,43 @@ module BigShift
     end
     def total_file_size
-      @total_file_size ||= begin
+      @total_file_size ||= file_sizes.values.reduce(:+)
+    end
+    def validate_transfer(cs_bucket_name)
+      objects = @cs_service.list_objects(cs_bucket_name, prefix: @prefix)
+      cs_file_sizes = objects.items.each_with_object({}) do |item, acc|
+        acc[item.name] = item.size.to_i
+      end
+      missing_files = (file_sizes.keys - cs_file_sizes.keys)
+      extra_files = cs_file_sizes.keys - file_sizes.keys
+      common_files = (cs_file_sizes.keys & file_sizes.keys)
+      size_mismatches = common_files.select { |name| file_sizes[name] != cs_file_sizes[name] }
+      errors = []
+      unless missing_files.empty?
+        errors << "missing files: #{missing_files.join(', ')}"
+      end
+      unless extra_files.empty?
+        errors << "extra files: #{extra_files.join(', ')}"
+      end
+      unless size_mismatches.empty?
+        messages = size_mismatches.map { |name| sprintf('%s (%d != %d)', name, cs_file_sizes[name], file_sizes[name]) }
+        errors << "size mismatches: #{messages.join(', ')}"
+      end
+      unless errors.empty?
+        raise TransferValidationError, "Transferred files don't match unload manifest: #{errors.join('; ')}"
+      end
+    end
+    private
+    def file_sizes
+      @file_sizes ||= begin
         bucket = @s3_resource.bucket(@bucket_name)
         objects = bucket.objects(prefix: @prefix)
-        objects.reduce(0) do |sum, object|
+        objects.each_with_object({}) do |object, acc|
           if keys.include?(object.key)
-            sum + object.size
-          else
-            sum
+            acc[object.key] = object.size
           end
         end
       end

data/lib/bigshift/version.rb CHANGED

@@ -1,3 +1,3 @@
 module BigShift
-  VERSION = '0.2.0'.freeze
+  VERSION = '0.3.0'.freeze
 end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: bigshift
 version: !ruby/object:Gem::Version
-  version: 0.2.0
+  version: 0.3.0
 platform: ruby
 authors:
 - Theo Hultberg
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2016-04-14 00:00:00.000000000 Z
+date: 2016-05-12 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: pg