RubyGems - bigshift - Versions diffs - 0.3.2 → 0.4.0 - Mend

bigshift 0.3.2 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +4 -4
data/README.md +16 -8
data/lib/bigshift.rb +2 -1
data/lib/bigshift/big_query/table.rb +2 -5
data/lib/bigshift/cli.rb +65 -30
data/lib/bigshift/cloud_storage_transfer.rb +4 -4
data/lib/bigshift/redshift_table_schema.rb +12 -5
data/lib/bigshift/redshift_unloader.rb +1 -0
data/lib/bigshift/version.rb +1 -1
metadata +18 -4

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: b0a1088f4c4d8c66c8af35a4e67a8377d8b6f805
-  data.tar.gz: bff611a9528b2b08a3587177cb3a448b7dbea4de
+  metadata.gz: 60b5d96dd6c068f548e07446f970f8520a332516
+  data.tar.gz: 8216e56b7ce9ae684e1d17e2fb89a64c77a4157b
 SHA512:
-  metadata.gz: 045ba2e30068a4259ac34763f3597d7087053ad8889443c077643576e5cb8df55ad02d0a01e50577b4865afa130a68311188bd4db05bc6c34a632d2ab9bfe39d
-  data.tar.gz: 31cf2ec5852d2a1c200398a089b1be9c3093abb5e17a90cab3f130f43ef50df5d0c5749f117651f4ddc123d6504397fa1602a02dcf0e99d4fed511d320b803cf
+  metadata.gz: f1fbeea6fcb26d64a3416f376b5b324fcd086f2c62d1878589eefd9be18437a8ae2ea9d47116cd485ff0b86dd0c263ba9ad797f135581968ff237edd7d1e939b
+  data.tar.gz: 763dd96b31254c70b1d596500ee0ed28f7f9c4025ff3272051b9ffb018618742908ff52f0b0c35bac887ba0f970b0f791091138d293009364673d3c370efa18a

data/README.md CHANGED

@@ -36,7 +36,11 @@ Running `bigshift` without any arguments, or with `--help` will show the options
 ### GCP credentials
-The `--gcp-credentials` argument must be a path to a JSON file that contains a public/private key pair for a GCP user. The best way to obtain this is to create a new service account and chose JSON as the key type when prompted.
+You can provide GCP credentials either with the environment variable `GOOGLE_APPLICATION_CREDENTIALS` or with the `--gcp-credentials` argument. These must be a path to a JSON file that contains a public/private key pair for a GCP user. The best way to obtain this is to create a new service account and choose JSON as the key type when prompted. See the [GCP documentation](https://cloud.google.com/docs/authentication/production#obtaining_and_providing_service_account_credentials_manually) for more information.
+If Bigshift is run directly on Compute Engine, Kubernetes Engine or App Engine flexible environment, the embedded service account will be used instead. Please note the service account will need to have the `cloud-platform` authorization scope as detailed in the [Storage Transfer Service documentation](https://cloud.google.com/storage-transfer/docs/create-client#scope).
+If you haven't used Storage Transfer Service with your destination bucket before it might not have the right permissions setup, see below under [Troubleshooting](#insufficientpermissionswhentransferringtogcs) for more information.
 ### AWS credentials
@@ -163,19 +167,23 @@ The certificates used by the Google APIs might not be installed on your system,
 export SSL_CERT_FILE="$(find $GEM_HOME/gems -name 'google-api-client-*' | tail -n 1)/lib/cacerts.pem"
 ```
-### BigQuery says my files are not splittable and too large
+### I get errors when the data is loaded into BigQuery
+This could be anything, but it could be things that aren't escaped properly when the data is dumped from Redshift. Try figuring out from the errors where the problem is and what the data looks like and open an issue. The more you can figure out yourself the more likely it is that you will get help. No one wants to trawl through your data, make an effort.
-For example:
+### Insufficient permissions when transferring to GCS
-> Input CSV files are not splittable and at least one of the files is larger than the maximum allowed size. Size is: 5838980665. Max allowed size is: 4294967296. Filename: gs://bigshift/foo/bar/foo-bar-0039_part_00.gz
+The Google Storage bucket needs permissions for the Storage Transfer service's Service Account to write to it. If you haven't used Storage Transfer service with this bucket before the bucket might not have the necessary permissions set up.
-This happens when the (compressed) files exceed 4 GiB in size. Unfortunately it is not possible to control the size of the files produced by Redshift's `UNLOAD` command, and the size of the files will depend on the number of nodes in your cluster and the amount of data you're dumping.
+The easiest way for now to get that ID applied is to just create a manual Transfer request through the UI at which point you will have the permission automatically applied to the bucket.
-There are two options: either you use BigShift to get the dumps to GCS and then manually uncompress and load them (use `--steps unload,transfer`) or you dump without compression (use `--no-compression`). Keep in mind that without compression the bandwidth costs will be significanly higher.
+You can verify that this has been set up by inspecting the permissions for your bucket and check that there is a user with a name like `storage-transfer-<ID>@partnercontent.gserviceaccount.com` that is set up as a writer.
-### I get errors when the data is loaded into BigQuery
+If the permission on the bucket isn't there, the Storage Transfer service won't be able to find the bucket and will fail. You might see an error like "Failed to obtain the location of the destination Google Cloud Storage (GCS) bucket due to insufficient permissions".
-This could be anything, but it could be things that aren't escaped properly when the data is dumped from Redshift. Try figuring out from the errors where the problem is and what the data looks like and open an issue. The more you can figure out yourself the more likely it is that you will get help. No one wants to trawl through your data, make an effort.
+### I get a NoMethodError: undefined method 'match' for nil:NilClass
+This appears to be a bug in the AWS SDK that manifests when your [AWS credentials](#aws-credentials) have not been properly specified.
 # Copyright

data/lib/bigshift.rb CHANGED

@@ -1,7 +1,8 @@
 require 'google/apis/bigquery_v2'
 require 'google/apis/storagetransfer_v1'
 require 'google/apis/storage_v1'
-require 'aws-sdk'
+require 'google/cloud/env'
+require 'aws-sdk-s3'
 module BigShift
   BigShiftError = Class.new(StandardError)

data/lib/bigshift/big_query/table.rb CHANGED

@@ -18,6 +18,7 @@ module BigShift
         load_configuration[:source_format] = 'CSV'
         load_configuration[:field_delimiter] = '\t'
         load_configuration[:quote] = '"'
+        load_configuration[:allow_quoted_newlines] = true
         load_configuration[:destination_table] = @table_data.table_reference
         load_configuration[:max_bad_records] = options[:max_bad_records] if options[:max_bad_records]
         job = Google::Apis::BigqueryV2::Job.new(
@@ -36,11 +37,7 @@ module BigShift
             else
               job.status.errors.each do |error|
                 message = %<Load error: "#{error.message}">
-                if error.location
-                  file, line, field = error.location.split('/').map { |s| s.split(':').last.strip }
-                  message << " at file #{file}, line #{line}"
-                  message << ", field #{field}" if field
-                end
+                message << " in #{error.location}" if error.location
                 @logger.debug(message)
               end
               raise job.status.error_result.message

data/lib/bigshift/cli.rb CHANGED

@@ -25,12 +25,18 @@ module BigShift
     end
     def run
-      setup
-      unload
-      transfer
-      load
-      cleanup
-      nil
+      begin
+        setup
+        unload
+        transfer
+        load
+        cleanup
+        nil
+      rescue Aws::Errors::MissingRegionError, Aws::Sigv4::Errors::MissingCredentialsError => e
+        raise CliError.new('AWS configuration missing or malformed: ' + e.message, e.backtrace, @usage)
+      rescue Signet::AuthorizationError => e
+        raise CliError.new('GCP configuration missing or malformed: ' + e.message, e.backtrace, @usage)
+      end
     end
     private
@@ -43,12 +49,15 @@ module BigShift
       @config = parse_args(@argv)
       @factory = @factory_factory.call(@config)
       @logger = @factory.logger
+      @logger.debug('Setup complete')
     end
     def unload
       if run?(:unload)
+        @logger.debug('Running unload')
         s3_uri = "s3://#{@config[:s3_bucket_name]}/#{s3_table_prefix}"
         @factory.redshift_unloader.unload_to(@config[:rs_schema_name], @config[:rs_table_name], s3_uri, allow_overwrite: false, compression: @config[:compression])
+        @logger.debug('Unload complete')
       else
         @logger.debug('Skipping unload')
       end
@@ -57,8 +66,10 @@ module BigShift
     def transfer
       if run?(:transfer)
+        @logger.debug('Running transfer')
         description = "bigshift-#{@config[:rs_database_name]}-#{@config[:rs_schema_name]}-#{@config[:rs_table_name]}-#{Time.now.utc.strftime('%Y%m%dT%H%M')}"
         @factory.cloud_storage_transfer.copy_to_cloud_storage(@unload_manifest, @config[:cs_bucket_name], description: description, allow_overwrite: false)
+        @logger.debug('Transfer complete')
       else
         @logger.debug('Skipping transfer')
       end
@@ -66,6 +77,7 @@ module BigShift
     def load
       if run?(:load)
+        @logger.debug('Querying Redshift schema')
         rs_table_schema = @factory.redshift_table_schema
         bq_dataset = @factory.big_query_dataset
         bq_table = bq_dataset.table(@config[:bq_table_id]) || bq_dataset.create_table(@config[:bq_table_id])
@@ -74,7 +86,9 @@ module BigShift
         options[:schema] = rs_table_schema.to_big_query
         options[:allow_overwrite] = true
         options[:max_bad_records] = @config[:max_bad_records] if @config[:max_bad_records]
+        @logger.debug('Running load')
         bq_table.load(gcs_uri, options)
+        @logger.debug('Load complete')
       else
         @logger.debug('Skipping load')
       end
@@ -82,7 +96,9 @@ module BigShift
     def cleanup
       if run?(:cleanup)
+        @logger.debug('Running cleanup')
         @factory.cleaner.cleanup(@unload_manifest, @config[:cs_bucket_name])
+        @logger.debug('Cleanup complete')
       else
         @logger.debug('Skipping cleanup')
       end
@@ -96,7 +112,7 @@ module BigShift
     ].freeze
     ARGUMENTS = [
-      ['--gcp-credentials', 'PATH', String, :gcp_credentials_path, :required],
+      ['--gcp-credentials', 'PATH', String, :gcp_credentials_path, nil],
       ['--aws-credentials', 'PATH', String, :aws_credentials_path, nil],
       ['--rs-credentials', 'PATH', String, :rs_credentials_path, :required],
       ['--rs-database', 'DB_NAME', String, :rs_database_name, :required],
@@ -125,6 +141,9 @@ module BigShift
       rescue OptionParser::InvalidOption => e
         config_errors << e.message
       end
+      if !config[:gcp_credentials_path] && ENV['GOOGLE_APPLICATION_CREDENTIALS']
+        config[:gcp_credentials_path] = ENV['GOOGLE_APPLICATION_CREDENTIALS']
+      end
       %w[gcp aws rs].each do |prefix|
         if (path = config["#{prefix}_credentials_path".to_sym]) && File.exist?(path)
           config["#{prefix}_credentials".to_sym] = YAML.load(File.read(path))
@@ -144,8 +163,9 @@ module BigShift
       else
         config[:steps] = STEPS
       end
+      @usage = parser.to_s
       unless config_errors.empty?
-        raise CliError.new('Configuration missing or malformed', config_errors, parser.to_s)
+        raise CliError.new('Configuration missing or malformed', config_errors, @usage)
       end
       config
     end
@@ -171,19 +191,19 @@ module BigShift
     end
     def redshift_unloader
-      @redshift_unloader ||= RedshiftUnloader.new(rs_connection, aws_credentials, logger: logger)
+      @redshift_unloader ||= RedshiftUnloader.new(create_rs_connection, aws_credentials, logger: logger)
     end
     def cloud_storage_transfer
-      @cloud_storage_transfer ||= CloudStorageTransfer.new(cs_transfer_service, raw_gcp_credentials['project_id'], aws_credentials, logger: logger)
+      @cloud_storage_transfer ||= CloudStorageTransfer.new(cs_transfer_service, gcp_project, aws_credentials, logger: logger)
     end
     def redshift_table_schema
-      @redshift_table_schema ||= RedshiftTableSchema.new(@config[:rs_schema_name], @config[:rs_table_name], rs_connection)
+      @redshift_table_schema ||= RedshiftTableSchema.new(@config[:rs_schema_name], @config[:rs_table_name], create_rs_connection)
     end
     def big_query_dataset
-      @big_query_dataset ||= BigQuery::Dataset.new(bq_service, raw_gcp_credentials['project_id'], @config[:bq_dataset_id], logger: logger)
+      @big_query_dataset ||= BigQuery::Dataset.new(bq_service, gcp_project, @config[:bq_dataset_id], logger: logger)
     end
     def cleaner
@@ -207,8 +227,8 @@ module BigShift
     private
-    def rs_connection
-      @rs_connection ||= PG.connect(
+    def create_rs_connection
+      rs_connection = PG.connect(
         host: @config[:rs_credentials]['host'],
         port: @config[:rs_credentials]['port'],
         dbname: @config[:rs_database_name],
@@ -216,13 +236,13 @@ module BigShift
         password: @config[:rs_credentials]['password'],
         sslmode: 'require'
       )
-      socket = Socket.for_fd(@rs_connection.socket)
+      socket = Socket.for_fd(rs_connection.socket)
       socket.setsockopt(Socket::SOL_SOCKET, Socket::SO_KEEPALIVE, 1)
       socket.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_KEEPCNT, 5)
       socket.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_KEEPINTVL, 2)
       socket.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_KEEPIDLE, 2) if defined?(Socket::TCP_KEEPIDLE)
-      @rs_connection.exec("SET search_path = \"#{@config[:rs_schema_name]}\"")
-      @rs_connection
+      rs_connection.exec("SET search_path = \"#{@config[:rs_schema_name]}\"")
+      rs_connection
     end
     def cs_transfer_service
@@ -254,29 +274,44 @@ module BigShift
         if @config[:aws_credentials]
           credentials = Aws::Credentials.new(*@config[:aws_credentials].values_at('access_key_id', 'secret_access_key'))
         else
-          credentials = nil
-        end
-        if (credentials = Aws::CredentialProviderChain.new(credentials).resolve)
-          credentials
-        else
-          raise 'No AWS credentials found'
+          credentials = Aws::CredentialProviderChain.new.resolve
         end
       end
     end
     def aws_region
-      @aws_region ||= ((awsc = @config[:aws_credentials]) && awsc['region']) || ENV['AWS_REGION'] || ENV['AWS_DEFAULT_REGION']
+      @aws_region ||= begin
+        if @config[:aws_credentials]
+          region = @config[:aws_credentials]['region']
+        else
+          region = ENV['AWS_REGION'] || ENV['AWS_DEFAULT_REGION']
+        end
+        if !region
+          raise BigShiftError.new('AWS Region not specified')
+        end
+      end
     end
-    def raw_gcp_credentials
-      @config[:gcp_credentials]
+    def gcp_project
+      if @config[:gcp_credentials]
+        @config[:gcp_credentials]['project_id']
+      else
+        Google::Cloud.env.project_id
+      end
     end
     def gcp_credentials
-      @gcp_credentials ||= Google::Auth::ServiceAccountCredentials.make_creds(
-        json_key_io: StringIO.new(JSON.dump(raw_gcp_credentials)),
-        scope: Google::Apis::StoragetransferV1::AUTH_CLOUD_PLATFORM
-      )
+      @gcp_credentials ||= begin
+        if @config[:gcp_credentials]
+          credentials = Google::Auth::ServiceAccountCredentials.make_creds(
+            json_key_io: StringIO.new(JSON.dump(@config[:gcp_credentials])),
+            scope: Google::Apis::StoragetransferV1::AUTH_CLOUD_PLATFORM
+          )
+        else
+          credentials = Google::Auth::GCECredentials.new
+        end
+      end
     end
   end
 end

data/lib/bigshift/cloud_storage_transfer.rb CHANGED

@@ -24,15 +24,15 @@ module BigShift
     DEFAULT_POLL_INTERVAL = 30
     def create_transfer_job(unload_manifest, cloud_storage_bucket, description, allow_overwrite)
-      now = @clock.now.utc
+      soon = @clock.now.utc + 60
       Google::Apis::StoragetransferV1::TransferJob.new(
         description: description,
         project_id: @project_id,
         status: 'ENABLED',
         schedule: Google::Apis::StoragetransferV1::Schedule.new(
-          schedule_start_date: Google::Apis::StoragetransferV1::Date.new(year: now.year, month: now.month, day: now.day),
-          schedule_end_date: Google::Apis::StoragetransferV1::Date.new(year: now.year, month: now.month, day: now.day),
-          start_time_of_day: Google::Apis::StoragetransferV1::TimeOfDay.new(hours: now.hour, minutes: now.min + 1)
+          schedule_start_date: Google::Apis::StoragetransferV1::Date.new(year: soon.year, month: soon.month, day: soon.day),
+          schedule_end_date: Google::Apis::StoragetransferV1::Date.new(year: soon.year, month: soon.month, day: soon.day),
+          start_time_of_day: Google::Apis::StoragetransferV1::TimeOfDay.new(hours: soon.hour, minutes: soon.min)
         ),
         transfer_spec: Google::Apis::StoragetransferV1::TransferSpec.new(
           aws_s3_data_source: Google::Apis::StoragetransferV1::AwsS3Data.new(

data/lib/bigshift/redshift_table_schema.rb CHANGED

@@ -8,7 +8,17 @@ module BigShift
     def columns
       @columns ||= begin
-        rows = @redshift_connection.exec_params(%|SELECT "column", "type", "notnull" FROM "pg_table_def" WHERE "schemaname" = $1 AND "tablename" = $2|, [@schema_name, @table_name])
+        query = %{
+          SELECT "column", "type", "notnull"
+          FROM pg_table_def ptd, information_schema.columns isc
+          WHERE ptd.schemaname = isc.table_schema
+          AND ptd.tablename = isc.table_name
+          AND ptd.column = isc.column_name
+          AND schemaname = $1
+          AND tablename = $2
+          ORDER BY ordinal_position
+        }.gsub(/\s+/, ' ').strip
+        rows = @redshift_connection.exec_params(query, [@schema_name, @table_name])
         if rows.count == 0
           raise sprintf('Table %s for schema %s not found', @table_name.inspect, @schema_name.inspect)
         else
@@ -18,7 +28,6 @@ module BigShift
             nullable = row['notnull'] == 'f'
             Column.new(name, type, nullable)
           end
-          columns.sort_by!(&:name)
           columns
         end
       end
@@ -51,12 +60,10 @@ module BigShift
       def to_sql
         case @type
-        when /^numeric/, /int/, /^double/, 'real'
+        when /^numeric/, /int/, /^double/, 'real', /^timestamp/
           sprintf('"%s"', @name)
         when /^character/
           sprintf(%q<('"' || REPLACE(REPLACE(REPLACE("%s", '"', '""'), '\\n', '\\\\n'), '\\r', '\\\\r') || '"')>, @name)
-        when /^timestamp/
-          sprintf('(EXTRACT(epoch FROM "%s") + EXTRACT(milliseconds FROM "%s")/1000.0)', @name, @name)
         when 'date'
           sprintf(%q<(TO_CHAR("%s", 'YYYY-MM-DD'))>, @name)
         when 'boolean'

data/lib/bigshift/redshift_unloader.rb CHANGED

@@ -20,6 +20,7 @@ module BigShift
       unload_sql << %q< DELIMITER '\t'>
       unload_sql << %q< GZIP> if options[:compression] || options[:compression].nil?
       unload_sql << %q< ALLOWOVERWRITE> if options[:allow_overwrite]
+      unload_sql << %q< MAXFILESIZE 3.9 GB>
       @logger.info(sprintf('Unloading Redshift table %s to %s', table_name, s3_uri))
       @redshift_connection.exec(unload_sql)
       @logger.info(sprintf('Unload of %s complete', table_name))

data/lib/bigshift/version.rb CHANGED

@@ -1,3 +1,3 @@
 module BigShift
-  VERSION = '0.3.2'.freeze
+  VERSION = '0.4.0'.freeze
 end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: bigshift
 version: !ruby/object:Gem::Version
-  version: 0.3.2
+  version: 0.4.0
 platform: ruby
 authors:
 - Theo Hultberg
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2016-08-19 00:00:00.000000000 Z
+date: 2019-01-20 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: pg
@@ -53,7 +53,21 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
-  name: aws-sdk
+  name: google-cloud-env
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: aws-sdk-s3
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
@@ -112,7 +126,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.4.5
+rubygems_version: 2.6.14
 signing_key:
 specification_version: 4
 summary: A tool for moving tables from Redshift to BigQuery