RubyGems - bigshift - Versions diffs - 0.1.1 → 0.2.0 - Mend

bigshift 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/README.md +65 -6
data/lib/bigshift.rb +4 -0
data/lib/bigshift/big_query/table.rb +1 -0
data/lib/bigshift/cleaner.rb +31 -0
data/lib/bigshift/cli.rb +69 -27
data/lib/bigshift/cloud_storage_transfer.rb +8 -8
data/lib/bigshift/redshift_unloader.rb +4 -2
data/lib/bigshift/unload_manifest.rb +39 -0
data/lib/bigshift/version.rb +1 -1
metadata +18 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 6b933f1227d7a30c817577db6ca2f1517111d0e2
-  data.tar.gz: c53b1f16c4977e04c796a5f645d3d4ca600e3b13
+  metadata.gz: fc84facadd8de03293a5ba461bce6653bb3f00aa
+  data.tar.gz: deb0e103ae33b5a9627feb3aa4ac617cfa54e342
 SHA512:
-  metadata.gz: dc549cf4e6ec70de381ff11118967f68c3d6868aa3892656d379d265d3669f787a81b1193b1e605c0f84f8b692e75a51c5bf45d15e68bc7b43843047c22650e0
-  data.tar.gz: 3c5407a160e9389e478c2b9c2c4f8561ffdcb64d038514bf5e6b41c4dc78dc83ba412c6aa31b1a9a9a24217525eb43b3ff54622cdef95053949031a0fbf11096
+  metadata.gz: ec259abd928ad95999f64fa9765776c659113a373257d840874d9864ff571bdec0744efa756d3aaf62c7599a5c689de5ca9cf77d66e04a441a4b0d22cdbb833e
+  data.tar.gz: 04cbba86814f2526260f24a4c6583180e55edb4faf6ef7b20a96a0b961ad48586b36c1145af4f49ae06f9735fe2a0c98654433b7ec79bd0520fd5d0d7924935b

data/README.md CHANGED Viewed

@@ -1,5 +1,9 @@
 # BigShift
+[![Build Status](https://travis-ci.org/iconara/bigshift.png?branch=master)](https://travis-ci.org/iconara/bigshift)
+_If you're reading this on GitHub, please note that this is the readme for the development version and that some features described here might not yet have been released. You can find the readme for a specific version either through [rubydoc.info](http://rubydoc.info/find/gems?q=bigshift) or via the release tags ([here is an example](https://github.com/iconara/bigshift/tree/v0.1.1))._
 BigShift is a tool for moving tables from Redshift to BigQuery. It will create a table in BigQuery with a schema that matches the Redshift table, dump the data to S3, transfer it to GCS and finally load it into the BigQuery table.
 # Installation
@@ -18,9 +22,15 @@ The main interface to BigShift is the `bigshift` command line tool.
 BigShift can also be used as a library in a Ruby application. Look at the tests, and how the `bigshift` tool is built to figure out how.
+## Cost
+Please note that transferring large amounts of data between AWS and GCP is not free. [AWS charges for outgoing traffic from S3](https://aws.amazon.com/s3/pricing/#Data_Transfer_Pricing). There are also storage charges for the Redshift dumps on S3 and GCS, but since they are kept only until the BigQuery table has been loaded those should be negligible.
+BigShift tells Redshift to compress the dumps, even if that means that the BigQuery load will be slower, in order to minimize the transfer cost.
 ## Arguments
-Running `bigshift` without any arguments, or with `--help` will show the options. All except `--s3-prefix` are required.
+Running `bigshift` without any arguments, or with `--help` will show the options. All except `--s3-prefix`, `--bq-table` and `--max-bad-records` are required.
 ### GCP credentials
@@ -28,16 +38,54 @@ The `--gcp-credentials` argument must be a path to a JSON file that contains a p
 ### AWS credentials
-The `--aws-credentials` argument must be a path to a JSON or YAML file that contains `aws_access_key_id` and `aws_secret_access_key`, and optionally `token`.
+You can provide AWS credentials the same way that you can for the AWS SDK, that is with environment variables and files in specific locations in the file system, etc. See the [AWS SDK documentation](http://aws.amazon.com/documentation/sdk-for-ruby/) for more information. You can't use temporary credentials, like instance role credentials, unfortunately, because GCS Transfer Service doesn't support session tokens.
+You can also use the optional `--aws-credentials` argument to point to a JSON or YAML file that contains `access_key_id` and `secret_access_key`, and optionally `region`.
 ```yaml
 ---
-aws_access_key_id: AKXYZABC123FOOBARBAZ
-aws_secret_access_key: eW91ZmlndXJlZG91dGl0d2FzYmFzZTY0ISEhCg
+access_key_id: AKXYZABC123FOOBARBAZ
+secret_access_key: eW91ZmlndXJlZG91dGl0d2FzYmFzZTY0ISEhCg
+region: eu-west-1
 ```
 These credentials need to be allowed to read and write the S3 location you specify with `--s3-bucket` and `--s3-prefix`.
+Here is a minimal IAM policy that should work:
+```json
+{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Action": [
+        "s3:GetObject",
+        "s3:PutObject",
+        "s3:DeleteObject"
+      ],
+      "Resource": [
+        "arn:aws:s3:::THE-NAME-OF-THE-BUCKET/THE/PREFIX/*"
+      ],
+      "Effect": "Allow"
+    },
+    {
+      "Action": [
+        "s3:ListBucket",
+        "s3:GetBucketLocation"
+      ],
+      "Resource": [
+        "arn:aws:s3:::THE-NAME-OF-THE-BUCKET"
+      ],
+      "Effect": "Allow"
+    }
+  ]
+}
+```
+If you set `THE-NAME-OF-THE-BUCKET` to the same value as `--s3-bucket` and `THE/PREFIX` to the same value as `--s3-prefix` you're limiting the damage that BigShift can do, and unless you store something else at that location there is very little damage to be done.
+It is _strongly_ recommended that you create a specific IAM user with minimal permissions for use with BigShift. The nature of GCS Transfer Service means that these credentials are sent to and stored in GCP. The credentials are also used in the `UNLOAD` command sent to Redshift, and with the AWS SDK to work with the objects on S3.
 ### Redshift credentials
 The `--rs-credentials` argument must be a path to a JSON or YAML file that contains the `host` and `port` of the Redshift cluster, as well as the `username` and `password` required to connect.
@@ -50,6 +98,16 @@ username: my_redshift_user
 password: dGhpc2lzYWxzb2Jhc2U2NAo
 ```
+### S3 prefix
+If you don't want to put the data dumped from Redshift directly into the root of the S3 bucket you can use the `--s3-prefix` to provide a prefix to where the dumps should be placed.
+Because of how GCS' Transfer Service works the transferred files will have exactly the same keys in the destination bucket, this cannot be configured.
+### BigQuery table ID
+By default the BigQuery table ID will be the same as the Redshift table name, but the optional argument `--bq-table` can be used to tell BigShift to use another table ID.
 # How does it work?
 There are four main pieces to BigShift: the Redshift unloader, the transfer, the BigQuery load and the schema translation.
@@ -74,11 +132,12 @@ Once the data is in GCS, the BigQuery table can be created and loaded. At this p
 `NOT NULL` becomes `REQUIRED` in BigQuery, and `NULL` becomes `NULLABLE`.
+Finally, once the BigQuery table has been loaded BigShift will remove the data it dumped to S3 and the data it transferred to Cloud Storage.
 # What doesn't it do?
-* Currently BigShift doesn't delete the dumped table from S3 or from GCS. This is planned.
 * BigShift can't currently append to an existing BigQuery table. This feature would be possible to add.
-* The tool will happily overwrite any data on S3, GCS and in BigQuery that happen to be in the way (i.e. in the specified S3 or GCS location, or in the target table). This is convenient if you want to move the same data multiple times, but very scary and unsafe. To clobber everything will be an option in the future, but the default will be much safer.
+* The tool will truncate the target table before loading the transferred data to it. This is convenient if you want to move the same data multiple times, but can also be considered very scary and unsafe. It would be possible to have options to fail if there is data in the target table, or to append to the target table.
 * There is no transformation or processing of the data. When moving to BigQuery you might want to split a string and use the pieces as values in a repeated field, but BigShift doesn't help you with that. You will almost always have to do some post processing in BigQuery once the data has been moved. Processing on the way would require a lot more complexity and involve either Hadoop or Dataflow, and that's beyond the scope of a tool like this.
 * BigShift can't move data back from BigQuery to Redshift. It can probably be done, but you would probably have to write a big part of the Redshift schema yourself since BigQuery's data model is so much simpler. Going from Redshift to BigQuery is simple, most of Redshifts datatypes map directly to one of BigQuery's, and there's no encodings, sort or dist keys to worry about. Going in the other direction the tool can't know whether or not a `STRING` column in BigQuery should be a `CHAR(12)` or `VARCHAR(65535)`, and if it should be encoded as `LZO` or `BYTEDICT` or what should be the primary, sort, and dist key of the table.

data/lib/bigshift.rb CHANGED Viewed

@@ -1,5 +1,7 @@
 require 'google/apis/bigquery_v2'
 require 'google/apis/storagetransfer_v1'
+require 'google/apis/storage_v1'
+require 'aws-sdk'
 module BigShift
   BigShiftError = Class.new(StandardError)
@@ -27,3 +29,5 @@ require 'bigshift/big_query/table'
 require 'bigshift/redshift_table_schema'
 require 'bigshift/redshift_unloader'
 require 'bigshift/cloud_storage_transfer'
+require 'bigshift/unload_manifest'
+require 'bigshift/cleaner'

data/lib/bigshift/big_query/table.rb CHANGED Viewed

@@ -19,6 +19,7 @@ module BigShift
         load_configuration[:field_delimiter] = '\t'
         load_configuration[:quote] = '"'
         load_configuration[:destination_table] = @table_data.table_reference
+        load_configuration[:max_bad_records] = options[:max_bad_records] if options[:max_bad_records]
         job = Google::Apis::BigqueryV2::Job.new(
           configuration: Google::Apis::BigqueryV2::JobConfiguration.new(
             load: Google::Apis::BigqueryV2::JobConfigurationLoad.new(load_configuration)

data/lib/bigshift/cleaner.rb ADDED Viewed

@@ -0,0 +1,31 @@
+module BigShift
+  class Cleaner
+    def initialize(s3_resource, cs_service, options={})
+      @s3_resource = s3_resource
+      @cs_service = cs_service
+      @logger = options[:logger] || NullLogger.new
+    end
+    def cleanup(unload_manifest, cs_bucket_name)
+      cleanup_s3(unload_manifest)
+      cleanup_cs(cs_bucket_name, unload_manifest)
+      nil
+    end
+    private
+    def cleanup_s3(unload_manifest)
+      objects = unload_manifest.keys.map { |k| {key: k} }
+      objects << {key: unload_manifest.manifest_key}
+      @logger.info(sprintf('Deleting %d files from s3://%s/%s (including the manifest file)', objects.size, unload_manifest.bucket_name, unload_manifest.prefix))
+      @s3_resource.bucket(unload_manifest.bucket_name).delete_objects(delete: {objects: objects})
+    end
+    def cleanup_cs(bucket_name, unload_manifest)
+      @logger.info(sprintf('Deleting %d files from gs://%s/%s', unload_manifest.count, bucket_name, unload_manifest.prefix))
+      unload_manifest.keys.each do |key|
+        @cs_service.delete_object(bucket_name, key)
+      end
+    end
+  end
+end

data/lib/bigshift/cli.rb CHANGED Viewed

@@ -41,12 +41,13 @@ module BigShift
     def unload
       s3_uri = "s3://#{@config[:s3_bucket_name]}/#{s3_table_prefix}/"
-      @factory.redshift_unloader.unload_to(@config[:rs_table_name], s3_uri, allow_overwrite: true)
+      @factory.redshift_unloader.unload_to(@config[:rs_table_name], s3_uri, allow_overwrite: false)
+      @unload_manifest = UnloadManifest.new(@factory.s3_resource, @config[:s3_bucket_name], "#{s3_table_prefix}/")
     end
     def transfer
       description = "bigshift-#{@config[:rs_database_name]}-#{@config[:rs_table_name]}-#{Time.now.utc.strftime('%Y%m%dT%H%M')}"
-      @factory.cloud_storage_transfer.copy_to_cloud_storage(@config[:s3_bucket_name], "#{s3_table_prefix}/", @config[:cs_bucket_name], description: description, allow_overwrite: true)
+      @factory.cloud_storage_transfer.copy_to_cloud_storage(@unload_manifest, @config[:cs_bucket_name], description: description, allow_overwrite: false)
     end
     def load
@@ -54,30 +55,36 @@ module BigShift
       bq_dataset = @factory.big_query_dataset
       bq_table = bq_dataset.table(@config[:bq_table_id]) || bq_dataset.create_table(@config[:bq_table_id])
       gcs_uri = "gs://#{@config[:cs_bucket_name]}/#{s3_table_prefix}/*"
-      bq_table.load(gcs_uri, schema: rs_table_schema.to_big_query, allow_overwrite: true)
+      options = {}
+      options[:schema] = rs_table_schema.to_big_query
+      options[:allow_overwrite] = true
+      options[:max_bad_records] = @config[:max_bad_records] if @config[:max_bad_records]
+      bq_table.load(gcs_uri, options)
     end
     def cleanup
+      @factory.cleaner.cleanup(@unload_manifest, @config[:cs_bucket_name])
     end
     ARGUMENTS = [
-      ['--gcp-credentials', 'PATH', :gcp_credentials_path, :required],
-      ['--aws-credentials', 'PATH', :aws_credentials_path, :required],
-      ['--rs-credentials', 'PATH', :rs_credentials_path, :required],
-      ['--rs-database', 'DB_NAME', :rs_database_name, :required],
-      ['--rs-table', 'TABLE_NAME', :rs_table_name, :required],
-      ['--bq-dataset', 'DATASET_ID', :bq_dataset_id, :required],
-      ['--bq-table', 'TABLE_ID', :bq_table_id, :required],
-      ['--s3-bucket', 'BUCKET_NAME', :s3_bucket_name, :required],
-      ['--s3-prefix', 'PREFIX', :s3_prefix, nil],
-      ['--cs-bucket', 'BUCKET_NAME', :cs_bucket_name, :required],
+      ['--gcp-credentials', 'PATH', String, :gcp_credentials_path, :required],
+      ['--aws-credentials', 'PATH', String, :aws_credentials_path, nil],
+      ['--rs-credentials', 'PATH', String, :rs_credentials_path, :required],
+      ['--rs-database', 'DB_NAME', String, :rs_database_name, :required],
+      ['--rs-table', 'TABLE_NAME', String, :rs_table_name, :required],
+      ['--bq-dataset', 'DATASET_ID', String, :bq_dataset_id, :required],
+      ['--bq-table', 'TABLE_ID', String, :bq_table_id, nil],
+      ['--s3-bucket', 'BUCKET_NAME', String, :s3_bucket_name, :required],
+      ['--s3-prefix', 'PREFIX', String, :s3_prefix, nil],
+      ['--cs-bucket', 'BUCKET_NAME', String, :cs_bucket_name, :required],
+      ['--max-bad-records', 'N', Integer, :max_bad_records, nil],
     ]
     def parse_args(argv)
       config = {}
       parser = OptionParser.new do |p|
-        ARGUMENTS.each do |flag, value_name, config_key, _|
-          p.on("#{flag} #{value_name}") { |v| config[config_key] = v }
+        ARGUMENTS.each do |flag, value_name, type, config_key, _|
+          p.on("#{flag} #{value_name}", type) { |v| config[config_key] = v }
         end
       end
       config_errors = []
@@ -93,11 +100,12 @@ module BigShift
           config_errors << sprintf('%s does not exist', path.inspect)
         end
       end
-      ARGUMENTS.each do |flag, _, config_key, required|
+      ARGUMENTS.each do |flag, _, _, config_key, required|
         if !config.include?(config_key) && required
           config_errors << "#{flag} is required"
         end
       end
+      config[:bq_table_id] ||= config[:rs_table_name]
       unless config_errors.empty?
         raise CliError.new('Configuration missing or malformed', config_errors, parser.to_s)
       end
@@ -107,6 +115,7 @@ module BigShift
     def s3_table_prefix
       components = @config.values_at(:rs_database_name, :rs_table_name)
       if (prefix = @config[:s3_prefix])
+        prefix = prefix.gsub(%r{\A/|/\Z}, '')
         components.unshift(prefix)
       end
       File.join(*components)
@@ -123,7 +132,7 @@ module BigShift
     end
     def cloud_storage_transfer
-      @cloud_storage_transfer ||= CloudStorageTransfer.new(gcs_transfer_service, raw_gcp_credentials['project_id'], aws_credentials, logger: logger)
+      @cloud_storage_transfer ||= CloudStorageTransfer.new(cs_transfer_service, raw_gcp_credentials['project_id'], aws_credentials, logger: logger)
     end
     def redshift_table_schema
@@ -134,6 +143,17 @@ module BigShift
       @big_query_dataset ||= BigQuery::Dataset.new(bq_service, raw_gcp_credentials['project_id'], @config[:bq_dataset_id], logger: logger)
     end
+    def cleaner
+      @cleaner ||= Cleaner.new(s3_resource, cs_service, logger: logger)
+    end
+    def s3_resource
+      @s3_resource ||= Aws::S3::Resource.new(
+        region: aws_region,
+        credentials: aws_credentials
+      )
+    end
     private
     def logger
@@ -142,24 +162,31 @@ module BigShift
     def rs_connection
       @rs_connection ||= PG.connect(
-        @config[:rs_credentials]['host'],
-        @config[:rs_credentials]['port'],
-        nil,
-        nil,
-        @config[:rs_database_name],
-        @config[:rs_credentials]['username'],
-        @config[:rs_credentials]['password']
+        host: @config[:rs_credentials]['host'],
+        port: @config[:rs_credentials]['port'],
+        dbname: @config[:rs_database_name],
+        user: @config[:rs_credentials]['username'],
+        password: @config[:rs_credentials]['password'],
+        sslmode: 'require'
       )
     end
-    def gcs_transfer_service
-      @gcs_transfer_service ||= begin
+    def cs_transfer_service
+      @cs_transfer_service ||= begin
         s = Google::Apis::StoragetransferV1::StoragetransferService.new
         s.authorization = gcp_credentials
         s
       end
     end
+    def cs_service
+      @cs_service ||= begin
+        s = Google::Apis::StorageV1::StorageService.new
+        s.authorization = gcp_credentials
+        s
+      end
+    end
     def bq_service
       @bq_service ||= begin
         s = Google::Apis::BigqueryV2::BigqueryService.new
@@ -169,7 +196,22 @@ module BigShift
     end
     def aws_credentials
-      @config[:aws_credentials]
+      @aws_credentials ||= begin
+        if @config[:aws_credentials]
+          credentials = Aws::Credentials.new(*@config[:aws_credentials].values_at('access_key_id', 'secret_access_key'))
+        else
+          credentials = nil
+        end
+        if (credentials = Aws::CredentialProviderChain.new(credentials).resolve)
+          credentials
+        else
+          raise 'No AWS credentials found'
+        end
+      end
+    end
+    def aws_region
+      @aws_region ||= ((awsc = @config[:aws_credentials]) && awsc['region']) || ENV['AWS_REGION'] || ENV['AWS_DEFAULT_REGION']
     end
     def raw_gcp_credentials

data/lib/bigshift/cloud_storage_transfer.rb CHANGED Viewed

@@ -9,11 +9,11 @@ module BigShift
       @logger = options[:logger] || NullLogger::INSTANCE
     end
-    def copy_to_cloud_storage(s3_bucket, s3_path_prefix, cloud_storage_bucket, options={})
+    def copy_to_cloud_storage(unload_manifest, cloud_storage_bucket, options={})
       poll_interval = options[:poll_interval] || DEFAULT_POLL_INTERVAL
-      transfer_job = create_transfer_job(s3_bucket, s3_path_prefix, cloud_storage_bucket, options[:description], options[:allow_overwrite])
+      transfer_job = create_transfer_job(unload_manifest, cloud_storage_bucket, options[:description], options[:allow_overwrite])
       transfer_job = @storage_transfer_service.create_transfer_job(transfer_job)
-      @logger.info(sprintf('Transferring objects from s3://%s/%s to gs://%s/%s', s3_bucket, s3_path_prefix, cloud_storage_bucket, s3_path_prefix))
+      @logger.info(sprintf('Transferring %d objects (%.2f GiB) from s3://%s/%s to gs://%s/%s', unload_manifest.count, unload_manifest.total_file_size.to_f/2**30, unload_manifest.bucket_name, unload_manifest.prefix, cloud_storage_bucket, unload_manifest.prefix))
       await_completion(transfer_job, poll_interval)
       nil
     end
@@ -22,7 +22,7 @@ module BigShift
     DEFAULT_POLL_INTERVAL = 30
-    def create_transfer_job(s3_bucket, s3_path_prefix, cloud_storage_bucket, description, allow_overwrite)
+    def create_transfer_job(unload_manifest, cloud_storage_bucket, description, allow_overwrite)
       now = @clock.now.utc
       Google::Apis::StoragetransferV1::TransferJob.new(
         description: description,
@@ -35,17 +35,17 @@ module BigShift
         ),
         transfer_spec: Google::Apis::StoragetransferV1::TransferSpec.new(
           aws_s3_data_source: Google::Apis::StoragetransferV1::AwsS3Data.new(
-            bucket_name: s3_bucket,
+            bucket_name: unload_manifest.bucket_name,
             aws_access_key: Google::Apis::StoragetransferV1::AwsAccessKey.new(
-              access_key_id: @aws_credentials['aws_access_key_id'],
-              secret_access_key: @aws_credentials['aws_secret_access_key'],
+              access_key_id: @aws_credentials.access_key_id,
+              secret_access_key: @aws_credentials.secret_access_key,
             )
           ),
           gcs_data_sink: Google::Apis::StoragetransferV1::GcsData.new(
             bucket_name: cloud_storage_bucket
           ),
           object_conditions: Google::Apis::StoragetransferV1::ObjectConditions.new(
-            include_prefixes: [s3_path_prefix]
+            include_prefixes: unload_manifest.keys,
           ),
           transfer_options: Google::Apis::StoragetransferV1::TransferOptions.new(
             overwrite_objects_already_existing_in_sink: !!allow_overwrite

data/lib/bigshift/redshift_unloader.rb CHANGED Viewed

@@ -8,14 +8,16 @@ module BigShift
     def unload_to(table_name, s3_uri, options={})
       table_schema = RedshiftTableSchema.new(table_name, @redshift_connection)
-      credentials = @aws_credentials.map { |pair| pair.join('=') }.join(';')
+      credentials_string = "aws_access_key_id=#{@aws_credentials.access_key_id};aws_secret_access_key=#{@aws_credentials.secret_access_key}"
       select_sql = 'SELECT '
       select_sql << table_schema.columns.map(&:to_sql).join(', ')
       select_sql << %Q< FROM "#{table_name}">
       select_sql.gsub!('\'') { |s| '\\\'' }
       unload_sql = %Q<UNLOAD ('#{select_sql}')>
       unload_sql << %Q< TO '#{s3_uri}'>
-      unload_sql << %Q< CREDENTIALS '#{credentials}'>
+      unload_sql << %Q< CREDENTIALS '#{credentials_string}'>
+      unload_sql << %q< MANIFEST>
+      unload_sql << %q< GZIP>
       unload_sql << %q< DELIMITER '\t'>
       unload_sql << %q< ALLOWOVERWRITE> if options[:allow_overwrite]
       @logger.info(sprintf('Unloading Redshift table %s to %s', table_name, s3_uri))

data/lib/bigshift/unload_manifest.rb ADDED Viewed

@@ -0,0 +1,39 @@
+module BigShift
+  class UnloadManifest
+    attr_reader :bucket_name, :prefix, :manifest_key
+    def initialize(s3_resource, bucket_name, prefix)
+      @s3_resource = s3_resource
+      @bucket_name = bucket_name
+      @prefix = prefix
+      @manifest_key = "#{@prefix}manifest"
+    end
+    def keys
+      @keys ||= begin
+        bucket = @s3_resource.bucket(@bucket_name)
+        object = bucket.object(@manifest_key)
+        manifest = JSON.load(object.get.body)
+        manifest['entries'].map { |entry| entry['url'].sub(%r{\As3://[^/]+/}, '') }
+      end
+    end
+    def count
+      keys.size
+    end
+    def total_file_size
+      @total_file_size ||= begin
+        bucket = @s3_resource.bucket(@bucket_name)
+        objects = bucket.objects(prefix: @prefix)
+        objects.reduce(0) do |sum, object|
+          if keys.include?(object.key)
+            sum + object.size
+          else
+            sum
+          end
+        end
+      end
+    end
+  end
+end

data/lib/bigshift/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module BigShift
-  VERSION = '0.1.1'.freeze
+  VERSION = '0.2.0'.freeze
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: bigshift
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.2.0
 platform: ruby
 authors:
 - Theo Hultberg
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2016-04-08 00:00:00.000000000 Z
+date: 2016-04-14 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: pg
@@ -52,6 +52,20 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
+- !ruby/object:Gem::Dependency
+  name: aws-sdk
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
 description: |-
   BigShift is a tool for moving tables from Redshift
                          to BigQuery. It will create a table in BigQuery with
@@ -71,10 +85,12 @@ files:
 - lib/bigshift.rb
 - lib/bigshift/big_query/dataset.rb
 - lib/bigshift/big_query/table.rb
+- lib/bigshift/cleaner.rb
 - lib/bigshift/cli.rb
 - lib/bigshift/cloud_storage_transfer.rb
 - lib/bigshift/redshift_table_schema.rb
 - lib/bigshift/redshift_unloader.rb
+- lib/bigshift/unload_manifest.rb
 - lib/bigshift/version.rb
 homepage: http://github.com/iconara/bigshift
 licenses: