bigshift 0.1.1 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 6b933f1227d7a30c817577db6ca2f1517111d0e2
4
- data.tar.gz: c53b1f16c4977e04c796a5f645d3d4ca600e3b13
3
+ metadata.gz: fc84facadd8de03293a5ba461bce6653bb3f00aa
4
+ data.tar.gz: deb0e103ae33b5a9627feb3aa4ac617cfa54e342
5
5
  SHA512:
6
- metadata.gz: dc549cf4e6ec70de381ff11118967f68c3d6868aa3892656d379d265d3669f787a81b1193b1e605c0f84f8b692e75a51c5bf45d15e68bc7b43843047c22650e0
7
- data.tar.gz: 3c5407a160e9389e478c2b9c2c4f8561ffdcb64d038514bf5e6b41c4dc78dc83ba412c6aa31b1a9a9a24217525eb43b3ff54622cdef95053949031a0fbf11096
6
+ metadata.gz: ec259abd928ad95999f64fa9765776c659113a373257d840874d9864ff571bdec0744efa756d3aaf62c7599a5c689de5ca9cf77d66e04a441a4b0d22cdbb833e
7
+ data.tar.gz: 04cbba86814f2526260f24a4c6583180e55edb4faf6ef7b20a96a0b961ad48586b36c1145af4f49ae06f9735fe2a0c98654433b7ec79bd0520fd5d0d7924935b
data/README.md CHANGED
@@ -1,5 +1,9 @@
1
1
  # BigShift
2
2
 
3
+ [![Build Status](https://travis-ci.org/iconara/bigshift.png?branch=master)](https://travis-ci.org/iconara/bigshift)
4
+
5
+ _If you're reading this on GitHub, please note that this is the readme for the development version and that some features described here might not yet have been released. You can find the readme for a specific version either through [rubydoc.info](http://rubydoc.info/find/gems?q=bigshift) or via the release tags ([here is an example](https://github.com/iconara/bigshift/tree/v0.1.1))._
6
+
3
7
  BigShift is a tool for moving tables from Redshift to BigQuery. It will create a table in BigQuery with a schema that matches the Redshift table, dump the data to S3, transfer it to GCS and finally load it into the BigQuery table.
4
8
 
5
9
  # Installation
@@ -18,9 +22,15 @@ The main interface to BigShift is the `bigshift` command line tool.
18
22
 
19
23
  BigShift can also be used as a library in a Ruby application. Look at the tests, and how the `bigshift` tool is built to figure out how.
20
24
 
25
+ ## Cost
26
+
27
+ Please note that transferring large amounts of data between AWS and GCP is not free. [AWS charges for outgoing traffic from S3](https://aws.amazon.com/s3/pricing/#Data_Transfer_Pricing). There are also storage charges for the Redshift dumps on S3 and GCS, but since they are kept only until the BigQuery table has been loaded those should be negligible.
28
+
29
+ BigShift tells Redshift to compress the dumps, even if that means that the BigQuery load will be slower, in order to minimize the transfer cost.
30
+
21
31
  ## Arguments
22
32
 
23
- Running `bigshift` without any arguments, or with `--help` will show the options. All except `--s3-prefix` are required.
33
+ Running `bigshift` without any arguments, or with `--help` will show the options. All except `--s3-prefix`, `--bq-table` and `--max-bad-records` are required.
24
34
 
25
35
  ### GCP credentials
26
36
 
@@ -28,16 +38,54 @@ The `--gcp-credentials` argument must be a path to a JSON file that contains a p
28
38
 
29
39
  ### AWS credentials
30
40
 
31
- The `--aws-credentials` argument must be a path to a JSON or YAML file that contains `aws_access_key_id` and `aws_secret_access_key`, and optionally `token`.
41
+ You can provide AWS credentials the same way that you can for the AWS SDK, that is with environment variables and files in specific locations in the file system, etc. See the [AWS SDK documentation](http://aws.amazon.com/documentation/sdk-for-ruby/) for more information. You can't use temporary credentials, like instance role credentials, unfortunately, because GCS Transfer Service doesn't support session tokens.
42
+
43
+ You can also use the optional `--aws-credentials` argument to point to a JSON or YAML file that contains `access_key_id` and `secret_access_key`, and optionally `region`.
32
44
 
33
45
  ```yaml
34
46
  ---
35
- aws_access_key_id: AKXYZABC123FOOBARBAZ
36
- aws_secret_access_key: eW91ZmlndXJlZG91dGl0d2FzYmFzZTY0ISEhCg
47
+ access_key_id: AKXYZABC123FOOBARBAZ
48
+ secret_access_key: eW91ZmlndXJlZG91dGl0d2FzYmFzZTY0ISEhCg
49
+ region: eu-west-1
37
50
  ```
38
51
 
39
52
  These credentials need to be allowed to read and write the S3 location you specify with `--s3-bucket` and `--s3-prefix`.
40
53
 
54
+ Here is a minimal IAM policy that should work:
55
+
56
+ ```json
57
+ {
58
+ "Version": "2012-10-17",
59
+ "Statement": [
60
+ {
61
+ "Action": [
62
+ "s3:GetObject",
63
+ "s3:PutObject",
64
+ "s3:DeleteObject"
65
+ ],
66
+ "Resource": [
67
+ "arn:aws:s3:::THE-NAME-OF-THE-BUCKET/THE/PREFIX/*"
68
+ ],
69
+ "Effect": "Allow"
70
+ },
71
+ {
72
+ "Action": [
73
+ "s3:ListBucket",
74
+ "s3:GetBucketLocation"
75
+ ],
76
+ "Resource": [
77
+ "arn:aws:s3:::THE-NAME-OF-THE-BUCKET"
78
+ ],
79
+ "Effect": "Allow"
80
+ }
81
+ ]
82
+ }
83
+ ```
84
+
85
+ If you set `THE-NAME-OF-THE-BUCKET` to the same value as `--s3-bucket` and `THE/PREFIX` to the same value as `--s3-prefix` you're limiting the damage that BigShift can do, and unless you store something else at that location there is very little damage to be done.
86
+
87
+ It is _strongly_ recommended that you create a specific IAM user with minimal permissions for use with BigShift. The nature of GCS Transfer Service means that these credentials are sent to and stored in GCP. The credentials are also used in the `UNLOAD` command sent to Redshift, and with the AWS SDK to work with the objects on S3.
88
+
41
89
  ### Redshift credentials
42
90
 
43
91
  The `--rs-credentials` argument must be a path to a JSON or YAML file that contains the `host` and `port` of the Redshift cluster, as well as the `username` and `password` required to connect.
@@ -50,6 +98,16 @@ username: my_redshift_user
50
98
  password: dGhpc2lzYWxzb2Jhc2U2NAo
51
99
  ```
52
100
 
101
+ ### S3 prefix
102
+
103
+ If you don't want to put the data dumped from Redshift directly into the root of the S3 bucket you can use the `--s3-prefix` to provide a prefix to where the dumps should be placed.
104
+
105
+ Because of how GCS' Transfer Service works the transferred files will have exactly the same keys in the destination bucket, this cannot be configured.
106
+
107
+ ### BigQuery table ID
108
+
109
+ By default the BigQuery table ID will be the same as the Redshift table name, but the optional argument `--bq-table` can be used to tell BigShift to use another table ID.
110
+
53
111
  # How does it work?
54
112
 
55
113
  There are four main pieces to BigShift: the Redshift unloader, the transfer, the BigQuery load and the schema translation.
@@ -74,11 +132,12 @@ Once the data is in GCS, the BigQuery table can be created and loaded. At this p
74
132
 
75
133
  `NOT NULL` becomes `REQUIRED` in BigQuery, and `NULL` becomes `NULLABLE`.
76
134
 
135
+ Finally, once the BigQuery table has been loaded BigShift will remove the data it dumped to S3 and the data it transferred to Cloud Storage.
136
+
77
137
  # What doesn't it do?
78
138
 
79
- * Currently BigShift doesn't delete the dumped table from S3 or from GCS. This is planned.
80
139
  * BigShift can't currently append to an existing BigQuery table. This feature would be possible to add.
81
- * The tool will happily overwrite any data on S3, GCS and in BigQuery that happen to be in the way (i.e. in the specified S3 or GCS location, or in the target table). This is convenient if you want to move the same data multiple times, but very scary and unsafe. To clobber everything will be an option in the future, but the default will be much safer.
140
+ * The tool will truncate the target table before loading the transferred data to it. This is convenient if you want to move the same data multiple times, but can also be considered very scary and unsafe. It would be possible to have options to fail if there is data in the target table, or to append to the target table.
82
141
  * There is no transformation or processing of the data. When moving to BigQuery you might want to split a string and use the pieces as values in a repeated field, but BigShift doesn't help you with that. You will almost always have to do some post processing in BigQuery once the data has been moved. Processing on the way would require a lot more complexity and involve either Hadoop or Dataflow, and that's beyond the scope of a tool like this.
83
142
  * BigShift can't move data back from BigQuery to Redshift. It can probably be done, but you would probably have to write a big part of the Redshift schema yourself since BigQuery's data model is so much simpler. Going from Redshift to BigQuery is simple, most of Redshifts datatypes map directly to one of BigQuery's, and there's no encodings, sort or dist keys to worry about. Going in the other direction the tool can't know whether or not a `STRING` column in BigQuery should be a `CHAR(12)` or `VARCHAR(65535)`, and if it should be encoded as `LZO` or `BYTEDICT` or what should be the primary, sort, and dist key of the table.
84
143
 
data/lib/bigshift.rb CHANGED
@@ -1,5 +1,7 @@
1
1
  require 'google/apis/bigquery_v2'
2
2
  require 'google/apis/storagetransfer_v1'
3
+ require 'google/apis/storage_v1'
4
+ require 'aws-sdk'
3
5
 
4
6
  module BigShift
5
7
  BigShiftError = Class.new(StandardError)
@@ -27,3 +29,5 @@ require 'bigshift/big_query/table'
27
29
  require 'bigshift/redshift_table_schema'
28
30
  require 'bigshift/redshift_unloader'
29
31
  require 'bigshift/cloud_storage_transfer'
32
+ require 'bigshift/unload_manifest'
33
+ require 'bigshift/cleaner'
@@ -19,6 +19,7 @@ module BigShift
19
19
  load_configuration[:field_delimiter] = '\t'
20
20
  load_configuration[:quote] = '"'
21
21
  load_configuration[:destination_table] = @table_data.table_reference
22
+ load_configuration[:max_bad_records] = options[:max_bad_records] if options[:max_bad_records]
22
23
  job = Google::Apis::BigqueryV2::Job.new(
23
24
  configuration: Google::Apis::BigqueryV2::JobConfiguration.new(
24
25
  load: Google::Apis::BigqueryV2::JobConfigurationLoad.new(load_configuration)
@@ -0,0 +1,31 @@
1
+ module BigShift
2
+ class Cleaner
3
+ def initialize(s3_resource, cs_service, options={})
4
+ @s3_resource = s3_resource
5
+ @cs_service = cs_service
6
+ @logger = options[:logger] || NullLogger.new
7
+ end
8
+
9
+ def cleanup(unload_manifest, cs_bucket_name)
10
+ cleanup_s3(unload_manifest)
11
+ cleanup_cs(cs_bucket_name, unload_manifest)
12
+ nil
13
+ end
14
+
15
+ private
16
+
17
+ def cleanup_s3(unload_manifest)
18
+ objects = unload_manifest.keys.map { |k| {key: k} }
19
+ objects << {key: unload_manifest.manifest_key}
20
+ @logger.info(sprintf('Deleting %d files from s3://%s/%s (including the manifest file)', objects.size, unload_manifest.bucket_name, unload_manifest.prefix))
21
+ @s3_resource.bucket(unload_manifest.bucket_name).delete_objects(delete: {objects: objects})
22
+ end
23
+
24
+ def cleanup_cs(bucket_name, unload_manifest)
25
+ @logger.info(sprintf('Deleting %d files from gs://%s/%s', unload_manifest.count, bucket_name, unload_manifest.prefix))
26
+ unload_manifest.keys.each do |key|
27
+ @cs_service.delete_object(bucket_name, key)
28
+ end
29
+ end
30
+ end
31
+ end
data/lib/bigshift/cli.rb CHANGED
@@ -41,12 +41,13 @@ module BigShift
41
41
 
42
42
  def unload
43
43
  s3_uri = "s3://#{@config[:s3_bucket_name]}/#{s3_table_prefix}/"
44
- @factory.redshift_unloader.unload_to(@config[:rs_table_name], s3_uri, allow_overwrite: true)
44
+ @factory.redshift_unloader.unload_to(@config[:rs_table_name], s3_uri, allow_overwrite: false)
45
+ @unload_manifest = UnloadManifest.new(@factory.s3_resource, @config[:s3_bucket_name], "#{s3_table_prefix}/")
45
46
  end
46
47
 
47
48
  def transfer
48
49
  description = "bigshift-#{@config[:rs_database_name]}-#{@config[:rs_table_name]}-#{Time.now.utc.strftime('%Y%m%dT%H%M')}"
49
- @factory.cloud_storage_transfer.copy_to_cloud_storage(@config[:s3_bucket_name], "#{s3_table_prefix}/", @config[:cs_bucket_name], description: description, allow_overwrite: true)
50
+ @factory.cloud_storage_transfer.copy_to_cloud_storage(@unload_manifest, @config[:cs_bucket_name], description: description, allow_overwrite: false)
50
51
  end
51
52
 
52
53
  def load
@@ -54,30 +55,36 @@ module BigShift
54
55
  bq_dataset = @factory.big_query_dataset
55
56
  bq_table = bq_dataset.table(@config[:bq_table_id]) || bq_dataset.create_table(@config[:bq_table_id])
56
57
  gcs_uri = "gs://#{@config[:cs_bucket_name]}/#{s3_table_prefix}/*"
57
- bq_table.load(gcs_uri, schema: rs_table_schema.to_big_query, allow_overwrite: true)
58
+ options = {}
59
+ options[:schema] = rs_table_schema.to_big_query
60
+ options[:allow_overwrite] = true
61
+ options[:max_bad_records] = @config[:max_bad_records] if @config[:max_bad_records]
62
+ bq_table.load(gcs_uri, options)
58
63
  end
59
64
 
60
65
  def cleanup
66
+ @factory.cleaner.cleanup(@unload_manifest, @config[:cs_bucket_name])
61
67
  end
62
68
 
63
69
  ARGUMENTS = [
64
- ['--gcp-credentials', 'PATH', :gcp_credentials_path, :required],
65
- ['--aws-credentials', 'PATH', :aws_credentials_path, :required],
66
- ['--rs-credentials', 'PATH', :rs_credentials_path, :required],
67
- ['--rs-database', 'DB_NAME', :rs_database_name, :required],
68
- ['--rs-table', 'TABLE_NAME', :rs_table_name, :required],
69
- ['--bq-dataset', 'DATASET_ID', :bq_dataset_id, :required],
70
- ['--bq-table', 'TABLE_ID', :bq_table_id, :required],
71
- ['--s3-bucket', 'BUCKET_NAME', :s3_bucket_name, :required],
72
- ['--s3-prefix', 'PREFIX', :s3_prefix, nil],
73
- ['--cs-bucket', 'BUCKET_NAME', :cs_bucket_name, :required],
70
+ ['--gcp-credentials', 'PATH', String, :gcp_credentials_path, :required],
71
+ ['--aws-credentials', 'PATH', String, :aws_credentials_path, nil],
72
+ ['--rs-credentials', 'PATH', String, :rs_credentials_path, :required],
73
+ ['--rs-database', 'DB_NAME', String, :rs_database_name, :required],
74
+ ['--rs-table', 'TABLE_NAME', String, :rs_table_name, :required],
75
+ ['--bq-dataset', 'DATASET_ID', String, :bq_dataset_id, :required],
76
+ ['--bq-table', 'TABLE_ID', String, :bq_table_id, nil],
77
+ ['--s3-bucket', 'BUCKET_NAME', String, :s3_bucket_name, :required],
78
+ ['--s3-prefix', 'PREFIX', String, :s3_prefix, nil],
79
+ ['--cs-bucket', 'BUCKET_NAME', String, :cs_bucket_name, :required],
80
+ ['--max-bad-records', 'N', Integer, :max_bad_records, nil],
74
81
  ]
75
82
 
76
83
  def parse_args(argv)
77
84
  config = {}
78
85
  parser = OptionParser.new do |p|
79
- ARGUMENTS.each do |flag, value_name, config_key, _|
80
- p.on("#{flag} #{value_name}") { |v| config[config_key] = v }
86
+ ARGUMENTS.each do |flag, value_name, type, config_key, _|
87
+ p.on("#{flag} #{value_name}", type) { |v| config[config_key] = v }
81
88
  end
82
89
  end
83
90
  config_errors = []
@@ -93,11 +100,12 @@ module BigShift
93
100
  config_errors << sprintf('%s does not exist', path.inspect)
94
101
  end
95
102
  end
96
- ARGUMENTS.each do |flag, _, config_key, required|
103
+ ARGUMENTS.each do |flag, _, _, config_key, required|
97
104
  if !config.include?(config_key) && required
98
105
  config_errors << "#{flag} is required"
99
106
  end
100
107
  end
108
+ config[:bq_table_id] ||= config[:rs_table_name]
101
109
  unless config_errors.empty?
102
110
  raise CliError.new('Configuration missing or malformed', config_errors, parser.to_s)
103
111
  end
@@ -107,6 +115,7 @@ module BigShift
107
115
  def s3_table_prefix
108
116
  components = @config.values_at(:rs_database_name, :rs_table_name)
109
117
  if (prefix = @config[:s3_prefix])
118
+ prefix = prefix.gsub(%r{\A/|/\Z}, '')
110
119
  components.unshift(prefix)
111
120
  end
112
121
  File.join(*components)
@@ -123,7 +132,7 @@ module BigShift
123
132
  end
124
133
 
125
134
  def cloud_storage_transfer
126
- @cloud_storage_transfer ||= CloudStorageTransfer.new(gcs_transfer_service, raw_gcp_credentials['project_id'], aws_credentials, logger: logger)
135
+ @cloud_storage_transfer ||= CloudStorageTransfer.new(cs_transfer_service, raw_gcp_credentials['project_id'], aws_credentials, logger: logger)
127
136
  end
128
137
 
129
138
  def redshift_table_schema
@@ -134,6 +143,17 @@ module BigShift
134
143
  @big_query_dataset ||= BigQuery::Dataset.new(bq_service, raw_gcp_credentials['project_id'], @config[:bq_dataset_id], logger: logger)
135
144
  end
136
145
 
146
+ def cleaner
147
+ @cleaner ||= Cleaner.new(s3_resource, cs_service, logger: logger)
148
+ end
149
+
150
+ def s3_resource
151
+ @s3_resource ||= Aws::S3::Resource.new(
152
+ region: aws_region,
153
+ credentials: aws_credentials
154
+ )
155
+ end
156
+
137
157
  private
138
158
 
139
159
  def logger
@@ -142,24 +162,31 @@ module BigShift
142
162
 
143
163
  def rs_connection
144
164
  @rs_connection ||= PG.connect(
145
- @config[:rs_credentials]['host'],
146
- @config[:rs_credentials]['port'],
147
- nil,
148
- nil,
149
- @config[:rs_database_name],
150
- @config[:rs_credentials]['username'],
151
- @config[:rs_credentials]['password']
165
+ host: @config[:rs_credentials]['host'],
166
+ port: @config[:rs_credentials]['port'],
167
+ dbname: @config[:rs_database_name],
168
+ user: @config[:rs_credentials]['username'],
169
+ password: @config[:rs_credentials]['password'],
170
+ sslmode: 'require'
152
171
  )
153
172
  end
154
173
 
155
- def gcs_transfer_service
156
- @gcs_transfer_service ||= begin
174
+ def cs_transfer_service
175
+ @cs_transfer_service ||= begin
157
176
  s = Google::Apis::StoragetransferV1::StoragetransferService.new
158
177
  s.authorization = gcp_credentials
159
178
  s
160
179
  end
161
180
  end
162
181
 
182
+ def cs_service
183
+ @cs_service ||= begin
184
+ s = Google::Apis::StorageV1::StorageService.new
185
+ s.authorization = gcp_credentials
186
+ s
187
+ end
188
+ end
189
+
163
190
  def bq_service
164
191
  @bq_service ||= begin
165
192
  s = Google::Apis::BigqueryV2::BigqueryService.new
@@ -169,7 +196,22 @@ module BigShift
169
196
  end
170
197
 
171
198
  def aws_credentials
172
- @config[:aws_credentials]
199
+ @aws_credentials ||= begin
200
+ if @config[:aws_credentials]
201
+ credentials = Aws::Credentials.new(*@config[:aws_credentials].values_at('access_key_id', 'secret_access_key'))
202
+ else
203
+ credentials = nil
204
+ end
205
+ if (credentials = Aws::CredentialProviderChain.new(credentials).resolve)
206
+ credentials
207
+ else
208
+ raise 'No AWS credentials found'
209
+ end
210
+ end
211
+ end
212
+
213
+ def aws_region
214
+ @aws_region ||= ((awsc = @config[:aws_credentials]) && awsc['region']) || ENV['AWS_REGION'] || ENV['AWS_DEFAULT_REGION']
173
215
  end
174
216
 
175
217
  def raw_gcp_credentials
@@ -9,11 +9,11 @@ module BigShift
9
9
  @logger = options[:logger] || NullLogger::INSTANCE
10
10
  end
11
11
 
12
- def copy_to_cloud_storage(s3_bucket, s3_path_prefix, cloud_storage_bucket, options={})
12
+ def copy_to_cloud_storage(unload_manifest, cloud_storage_bucket, options={})
13
13
  poll_interval = options[:poll_interval] || DEFAULT_POLL_INTERVAL
14
- transfer_job = create_transfer_job(s3_bucket, s3_path_prefix, cloud_storage_bucket, options[:description], options[:allow_overwrite])
14
+ transfer_job = create_transfer_job(unload_manifest, cloud_storage_bucket, options[:description], options[:allow_overwrite])
15
15
  transfer_job = @storage_transfer_service.create_transfer_job(transfer_job)
16
- @logger.info(sprintf('Transferring objects from s3://%s/%s to gs://%s/%s', s3_bucket, s3_path_prefix, cloud_storage_bucket, s3_path_prefix))
16
+ @logger.info(sprintf('Transferring %d objects (%.2f GiB) from s3://%s/%s to gs://%s/%s', unload_manifest.count, unload_manifest.total_file_size.to_f/2**30, unload_manifest.bucket_name, unload_manifest.prefix, cloud_storage_bucket, unload_manifest.prefix))
17
17
  await_completion(transfer_job, poll_interval)
18
18
  nil
19
19
  end
@@ -22,7 +22,7 @@ module BigShift
22
22
 
23
23
  DEFAULT_POLL_INTERVAL = 30
24
24
 
25
- def create_transfer_job(s3_bucket, s3_path_prefix, cloud_storage_bucket, description, allow_overwrite)
25
+ def create_transfer_job(unload_manifest, cloud_storage_bucket, description, allow_overwrite)
26
26
  now = @clock.now.utc
27
27
  Google::Apis::StoragetransferV1::TransferJob.new(
28
28
  description: description,
@@ -35,17 +35,17 @@ module BigShift
35
35
  ),
36
36
  transfer_spec: Google::Apis::StoragetransferV1::TransferSpec.new(
37
37
  aws_s3_data_source: Google::Apis::StoragetransferV1::AwsS3Data.new(
38
- bucket_name: s3_bucket,
38
+ bucket_name: unload_manifest.bucket_name,
39
39
  aws_access_key: Google::Apis::StoragetransferV1::AwsAccessKey.new(
40
- access_key_id: @aws_credentials['aws_access_key_id'],
41
- secret_access_key: @aws_credentials['aws_secret_access_key'],
40
+ access_key_id: @aws_credentials.access_key_id,
41
+ secret_access_key: @aws_credentials.secret_access_key,
42
42
  )
43
43
  ),
44
44
  gcs_data_sink: Google::Apis::StoragetransferV1::GcsData.new(
45
45
  bucket_name: cloud_storage_bucket
46
46
  ),
47
47
  object_conditions: Google::Apis::StoragetransferV1::ObjectConditions.new(
48
- include_prefixes: [s3_path_prefix]
48
+ include_prefixes: unload_manifest.keys,
49
49
  ),
50
50
  transfer_options: Google::Apis::StoragetransferV1::TransferOptions.new(
51
51
  overwrite_objects_already_existing_in_sink: !!allow_overwrite
@@ -8,14 +8,16 @@ module BigShift
8
8
 
9
9
  def unload_to(table_name, s3_uri, options={})
10
10
  table_schema = RedshiftTableSchema.new(table_name, @redshift_connection)
11
- credentials = @aws_credentials.map { |pair| pair.join('=') }.join(';')
11
+ credentials_string = "aws_access_key_id=#{@aws_credentials.access_key_id};aws_secret_access_key=#{@aws_credentials.secret_access_key}"
12
12
  select_sql = 'SELECT '
13
13
  select_sql << table_schema.columns.map(&:to_sql).join(', ')
14
14
  select_sql << %Q< FROM "#{table_name}">
15
15
  select_sql.gsub!('\'') { |s| '\\\'' }
16
16
  unload_sql = %Q<UNLOAD ('#{select_sql}')>
17
17
  unload_sql << %Q< TO '#{s3_uri}'>
18
- unload_sql << %Q< CREDENTIALS '#{credentials}'>
18
+ unload_sql << %Q< CREDENTIALS '#{credentials_string}'>
19
+ unload_sql << %q< MANIFEST>
20
+ unload_sql << %q< GZIP>
19
21
  unload_sql << %q< DELIMITER '\t'>
20
22
  unload_sql << %q< ALLOWOVERWRITE> if options[:allow_overwrite]
21
23
  @logger.info(sprintf('Unloading Redshift table %s to %s', table_name, s3_uri))
@@ -0,0 +1,39 @@
1
+ module BigShift
2
+ class UnloadManifest
3
+ attr_reader :bucket_name, :prefix, :manifest_key
4
+
5
+ def initialize(s3_resource, bucket_name, prefix)
6
+ @s3_resource = s3_resource
7
+ @bucket_name = bucket_name
8
+ @prefix = prefix
9
+ @manifest_key = "#{@prefix}manifest"
10
+ end
11
+
12
+ def keys
13
+ @keys ||= begin
14
+ bucket = @s3_resource.bucket(@bucket_name)
15
+ object = bucket.object(@manifest_key)
16
+ manifest = JSON.load(object.get.body)
17
+ manifest['entries'].map { |entry| entry['url'].sub(%r{\As3://[^/]+/}, '') }
18
+ end
19
+ end
20
+
21
+ def count
22
+ keys.size
23
+ end
24
+
25
+ def total_file_size
26
+ @total_file_size ||= begin
27
+ bucket = @s3_resource.bucket(@bucket_name)
28
+ objects = bucket.objects(prefix: @prefix)
29
+ objects.reduce(0) do |sum, object|
30
+ if keys.include?(object.key)
31
+ sum + object.size
32
+ else
33
+ sum
34
+ end
35
+ end
36
+ end
37
+ end
38
+ end
39
+ end
@@ -1,3 +1,3 @@
1
1
  module BigShift
2
- VERSION = '0.1.1'.freeze
2
+ VERSION = '0.2.0'.freeze
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: bigshift
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Theo Hultberg
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2016-04-08 00:00:00.000000000 Z
11
+ date: 2016-04-14 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: pg
@@ -52,6 +52,20 @@ dependencies:
52
52
  - - ">="
53
53
  - !ruby/object:Gem::Version
54
54
  version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: aws-sdk
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - ">="
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ type: :runtime
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - ">="
67
+ - !ruby/object:Gem::Version
68
+ version: '0'
55
69
  description: |-
56
70
  BigShift is a tool for moving tables from Redshift
57
71
  to BigQuery. It will create a table in BigQuery with
@@ -71,10 +85,12 @@ files:
71
85
  - lib/bigshift.rb
72
86
  - lib/bigshift/big_query/dataset.rb
73
87
  - lib/bigshift/big_query/table.rb
88
+ - lib/bigshift/cleaner.rb
74
89
  - lib/bigshift/cli.rb
75
90
  - lib/bigshift/cloud_storage_transfer.rb
76
91
  - lib/bigshift/redshift_table_schema.rb
77
92
  - lib/bigshift/redshift_unloader.rb
93
+ - lib/bigshift/unload_manifest.rb
78
94
  - lib/bigshift/version.rb
79
95
  homepage: http://github.com/iconara/bigshift
80
96
  licenses: