bigshift 0.1.1 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +65 -6
- data/lib/bigshift.rb +4 -0
- data/lib/bigshift/big_query/table.rb +1 -0
- data/lib/bigshift/cleaner.rb +31 -0
- data/lib/bigshift/cli.rb +69 -27
- data/lib/bigshift/cloud_storage_transfer.rb +8 -8
- data/lib/bigshift/redshift_unloader.rb +4 -2
- data/lib/bigshift/unload_manifest.rb +39 -0
- data/lib/bigshift/version.rb +1 -1
- metadata +18 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: fc84facadd8de03293a5ba461bce6653bb3f00aa
|
4
|
+
data.tar.gz: deb0e103ae33b5a9627feb3aa4ac617cfa54e342
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: ec259abd928ad95999f64fa9765776c659113a373257d840874d9864ff571bdec0744efa756d3aaf62c7599a5c689de5ca9cf77d66e04a441a4b0d22cdbb833e
|
7
|
+
data.tar.gz: 04cbba86814f2526260f24a4c6583180e55edb4faf6ef7b20a96a0b961ad48586b36c1145af4f49ae06f9735fe2a0c98654433b7ec79bd0520fd5d0d7924935b
|
data/README.md
CHANGED
@@ -1,5 +1,9 @@
|
|
1
1
|
# BigShift
|
2
2
|
|
3
|
+
[![Build Status](https://travis-ci.org/iconara/bigshift.png?branch=master)](https://travis-ci.org/iconara/bigshift)
|
4
|
+
|
5
|
+
_If you're reading this on GitHub, please note that this is the readme for the development version and that some features described here might not yet have been released. You can find the readme for a specific version either through [rubydoc.info](http://rubydoc.info/find/gems?q=bigshift) or via the release tags ([here is an example](https://github.com/iconara/bigshift/tree/v0.1.1))._
|
6
|
+
|
3
7
|
BigShift is a tool for moving tables from Redshift to BigQuery. It will create a table in BigQuery with a schema that matches the Redshift table, dump the data to S3, transfer it to GCS and finally load it into the BigQuery table.
|
4
8
|
|
5
9
|
# Installation
|
@@ -18,9 +22,15 @@ The main interface to BigShift is the `bigshift` command line tool.
|
|
18
22
|
|
19
23
|
BigShift can also be used as a library in a Ruby application. Look at the tests, and how the `bigshift` tool is built to figure out how.
|
20
24
|
|
25
|
+
## Cost
|
26
|
+
|
27
|
+
Please note that transferring large amounts of data between AWS and GCP is not free. [AWS charges for outgoing traffic from S3](https://aws.amazon.com/s3/pricing/#Data_Transfer_Pricing). There are also storage charges for the Redshift dumps on S3 and GCS, but since they are kept only until the BigQuery table has been loaded those should be negligible.
|
28
|
+
|
29
|
+
BigShift tells Redshift to compress the dumps, even if that means that the BigQuery load will be slower, in order to minimize the transfer cost.
|
30
|
+
|
21
31
|
## Arguments
|
22
32
|
|
23
|
-
Running `bigshift` without any arguments, or with `--help` will show the options. All except `--s3-prefix` are required.
|
33
|
+
Running `bigshift` without any arguments, or with `--help` will show the options. All except `--s3-prefix`, `--bq-table` and `--max-bad-records` are required.
|
24
34
|
|
25
35
|
### GCP credentials
|
26
36
|
|
@@ -28,16 +38,54 @@ The `--gcp-credentials` argument must be a path to a JSON file that contains a p
|
|
28
38
|
|
29
39
|
### AWS credentials
|
30
40
|
|
31
|
-
|
41
|
+
You can provide AWS credentials the same way that you can for the AWS SDK, that is with environment variables and files in specific locations in the file system, etc. See the [AWS SDK documentation](http://aws.amazon.com/documentation/sdk-for-ruby/) for more information. You can't use temporary credentials, like instance role credentials, unfortunately, because GCS Transfer Service doesn't support session tokens.
|
42
|
+
|
43
|
+
You can also use the optional `--aws-credentials` argument to point to a JSON or YAML file that contains `access_key_id` and `secret_access_key`, and optionally `region`.
|
32
44
|
|
33
45
|
```yaml
|
34
46
|
---
|
35
|
-
|
36
|
-
|
47
|
+
access_key_id: AKXYZABC123FOOBARBAZ
|
48
|
+
secret_access_key: eW91ZmlndXJlZG91dGl0d2FzYmFzZTY0ISEhCg
|
49
|
+
region: eu-west-1
|
37
50
|
```
|
38
51
|
|
39
52
|
These credentials need to be allowed to read and write the S3 location you specify with `--s3-bucket` and `--s3-prefix`.
|
40
53
|
|
54
|
+
Here is a minimal IAM policy that should work:
|
55
|
+
|
56
|
+
```json
|
57
|
+
{
|
58
|
+
"Version": "2012-10-17",
|
59
|
+
"Statement": [
|
60
|
+
{
|
61
|
+
"Action": [
|
62
|
+
"s3:GetObject",
|
63
|
+
"s3:PutObject",
|
64
|
+
"s3:DeleteObject"
|
65
|
+
],
|
66
|
+
"Resource": [
|
67
|
+
"arn:aws:s3:::THE-NAME-OF-THE-BUCKET/THE/PREFIX/*"
|
68
|
+
],
|
69
|
+
"Effect": "Allow"
|
70
|
+
},
|
71
|
+
{
|
72
|
+
"Action": [
|
73
|
+
"s3:ListBucket",
|
74
|
+
"s3:GetBucketLocation"
|
75
|
+
],
|
76
|
+
"Resource": [
|
77
|
+
"arn:aws:s3:::THE-NAME-OF-THE-BUCKET"
|
78
|
+
],
|
79
|
+
"Effect": "Allow"
|
80
|
+
}
|
81
|
+
]
|
82
|
+
}
|
83
|
+
```
|
84
|
+
|
85
|
+
If you set `THE-NAME-OF-THE-BUCKET` to the same value as `--s3-bucket` and `THE/PREFIX` to the same value as `--s3-prefix` you're limiting the damage that BigShift can do, and unless you store something else at that location there is very little damage to be done.
|
86
|
+
|
87
|
+
It is _strongly_ recommended that you create a specific IAM user with minimal permissions for use with BigShift. The nature of GCS Transfer Service means that these credentials are sent to and stored in GCP. The credentials are also used in the `UNLOAD` command sent to Redshift, and with the AWS SDK to work with the objects on S3.
|
88
|
+
|
41
89
|
### Redshift credentials
|
42
90
|
|
43
91
|
The `--rs-credentials` argument must be a path to a JSON or YAML file that contains the `host` and `port` of the Redshift cluster, as well as the `username` and `password` required to connect.
|
@@ -50,6 +98,16 @@ username: my_redshift_user
|
|
50
98
|
password: dGhpc2lzYWxzb2Jhc2U2NAo
|
51
99
|
```
|
52
100
|
|
101
|
+
### S3 prefix
|
102
|
+
|
103
|
+
If you don't want to put the data dumped from Redshift directly into the root of the S3 bucket you can use the `--s3-prefix` to provide a prefix to where the dumps should be placed.
|
104
|
+
|
105
|
+
Because of how GCS' Transfer Service works the transferred files will have exactly the same keys in the destination bucket, this cannot be configured.
|
106
|
+
|
107
|
+
### BigQuery table ID
|
108
|
+
|
109
|
+
By default the BigQuery table ID will be the same as the Redshift table name, but the optional argument `--bq-table` can be used to tell BigShift to use another table ID.
|
110
|
+
|
53
111
|
# How does it work?
|
54
112
|
|
55
113
|
There are four main pieces to BigShift: the Redshift unloader, the transfer, the BigQuery load and the schema translation.
|
@@ -74,11 +132,12 @@ Once the data is in GCS, the BigQuery table can be created and loaded. At this p
|
|
74
132
|
|
75
133
|
`NOT NULL` becomes `REQUIRED` in BigQuery, and `NULL` becomes `NULLABLE`.
|
76
134
|
|
135
|
+
Finally, once the BigQuery table has been loaded BigShift will remove the data it dumped to S3 and the data it transferred to Cloud Storage.
|
136
|
+
|
77
137
|
# What doesn't it do?
|
78
138
|
|
79
|
-
* Currently BigShift doesn't delete the dumped table from S3 or from GCS. This is planned.
|
80
139
|
* BigShift can't currently append to an existing BigQuery table. This feature would be possible to add.
|
81
|
-
* The tool will
|
140
|
+
* The tool will truncate the target table before loading the transferred data to it. This is convenient if you want to move the same data multiple times, but can also be considered very scary and unsafe. It would be possible to have options to fail if there is data in the target table, or to append to the target table.
|
82
141
|
* There is no transformation or processing of the data. When moving to BigQuery you might want to split a string and use the pieces as values in a repeated field, but BigShift doesn't help you with that. You will almost always have to do some post processing in BigQuery once the data has been moved. Processing on the way would require a lot more complexity and involve either Hadoop or Dataflow, and that's beyond the scope of a tool like this.
|
83
142
|
* BigShift can't move data back from BigQuery to Redshift. It can probably be done, but you would probably have to write a big part of the Redshift schema yourself since BigQuery's data model is so much simpler. Going from Redshift to BigQuery is simple, most of Redshifts datatypes map directly to one of BigQuery's, and there's no encodings, sort or dist keys to worry about. Going in the other direction the tool can't know whether or not a `STRING` column in BigQuery should be a `CHAR(12)` or `VARCHAR(65535)`, and if it should be encoded as `LZO` or `BYTEDICT` or what should be the primary, sort, and dist key of the table.
|
84
143
|
|
data/lib/bigshift.rb
CHANGED
@@ -1,5 +1,7 @@
|
|
1
1
|
require 'google/apis/bigquery_v2'
|
2
2
|
require 'google/apis/storagetransfer_v1'
|
3
|
+
require 'google/apis/storage_v1'
|
4
|
+
require 'aws-sdk'
|
3
5
|
|
4
6
|
module BigShift
|
5
7
|
BigShiftError = Class.new(StandardError)
|
@@ -27,3 +29,5 @@ require 'bigshift/big_query/table'
|
|
27
29
|
require 'bigshift/redshift_table_schema'
|
28
30
|
require 'bigshift/redshift_unloader'
|
29
31
|
require 'bigshift/cloud_storage_transfer'
|
32
|
+
require 'bigshift/unload_manifest'
|
33
|
+
require 'bigshift/cleaner'
|
@@ -19,6 +19,7 @@ module BigShift
|
|
19
19
|
load_configuration[:field_delimiter] = '\t'
|
20
20
|
load_configuration[:quote] = '"'
|
21
21
|
load_configuration[:destination_table] = @table_data.table_reference
|
22
|
+
load_configuration[:max_bad_records] = options[:max_bad_records] if options[:max_bad_records]
|
22
23
|
job = Google::Apis::BigqueryV2::Job.new(
|
23
24
|
configuration: Google::Apis::BigqueryV2::JobConfiguration.new(
|
24
25
|
load: Google::Apis::BigqueryV2::JobConfigurationLoad.new(load_configuration)
|
@@ -0,0 +1,31 @@
|
|
1
|
+
module BigShift
|
2
|
+
class Cleaner
|
3
|
+
def initialize(s3_resource, cs_service, options={})
|
4
|
+
@s3_resource = s3_resource
|
5
|
+
@cs_service = cs_service
|
6
|
+
@logger = options[:logger] || NullLogger.new
|
7
|
+
end
|
8
|
+
|
9
|
+
def cleanup(unload_manifest, cs_bucket_name)
|
10
|
+
cleanup_s3(unload_manifest)
|
11
|
+
cleanup_cs(cs_bucket_name, unload_manifest)
|
12
|
+
nil
|
13
|
+
end
|
14
|
+
|
15
|
+
private
|
16
|
+
|
17
|
+
def cleanup_s3(unload_manifest)
|
18
|
+
objects = unload_manifest.keys.map { |k| {key: k} }
|
19
|
+
objects << {key: unload_manifest.manifest_key}
|
20
|
+
@logger.info(sprintf('Deleting %d files from s3://%s/%s (including the manifest file)', objects.size, unload_manifest.bucket_name, unload_manifest.prefix))
|
21
|
+
@s3_resource.bucket(unload_manifest.bucket_name).delete_objects(delete: {objects: objects})
|
22
|
+
end
|
23
|
+
|
24
|
+
def cleanup_cs(bucket_name, unload_manifest)
|
25
|
+
@logger.info(sprintf('Deleting %d files from gs://%s/%s', unload_manifest.count, bucket_name, unload_manifest.prefix))
|
26
|
+
unload_manifest.keys.each do |key|
|
27
|
+
@cs_service.delete_object(bucket_name, key)
|
28
|
+
end
|
29
|
+
end
|
30
|
+
end
|
31
|
+
end
|
data/lib/bigshift/cli.rb
CHANGED
@@ -41,12 +41,13 @@ module BigShift
|
|
41
41
|
|
42
42
|
def unload
|
43
43
|
s3_uri = "s3://#{@config[:s3_bucket_name]}/#{s3_table_prefix}/"
|
44
|
-
@factory.redshift_unloader.unload_to(@config[:rs_table_name], s3_uri, allow_overwrite:
|
44
|
+
@factory.redshift_unloader.unload_to(@config[:rs_table_name], s3_uri, allow_overwrite: false)
|
45
|
+
@unload_manifest = UnloadManifest.new(@factory.s3_resource, @config[:s3_bucket_name], "#{s3_table_prefix}/")
|
45
46
|
end
|
46
47
|
|
47
48
|
def transfer
|
48
49
|
description = "bigshift-#{@config[:rs_database_name]}-#{@config[:rs_table_name]}-#{Time.now.utc.strftime('%Y%m%dT%H%M')}"
|
49
|
-
@factory.cloud_storage_transfer.copy_to_cloud_storage(@
|
50
|
+
@factory.cloud_storage_transfer.copy_to_cloud_storage(@unload_manifest, @config[:cs_bucket_name], description: description, allow_overwrite: false)
|
50
51
|
end
|
51
52
|
|
52
53
|
def load
|
@@ -54,30 +55,36 @@ module BigShift
|
|
54
55
|
bq_dataset = @factory.big_query_dataset
|
55
56
|
bq_table = bq_dataset.table(@config[:bq_table_id]) || bq_dataset.create_table(@config[:bq_table_id])
|
56
57
|
gcs_uri = "gs://#{@config[:cs_bucket_name]}/#{s3_table_prefix}/*"
|
57
|
-
|
58
|
+
options = {}
|
59
|
+
options[:schema] = rs_table_schema.to_big_query
|
60
|
+
options[:allow_overwrite] = true
|
61
|
+
options[:max_bad_records] = @config[:max_bad_records] if @config[:max_bad_records]
|
62
|
+
bq_table.load(gcs_uri, options)
|
58
63
|
end
|
59
64
|
|
60
65
|
def cleanup
|
66
|
+
@factory.cleaner.cleanup(@unload_manifest, @config[:cs_bucket_name])
|
61
67
|
end
|
62
68
|
|
63
69
|
ARGUMENTS = [
|
64
|
-
['--gcp-credentials', 'PATH', :gcp_credentials_path, :required],
|
65
|
-
['--aws-credentials', 'PATH', :aws_credentials_path,
|
66
|
-
['--rs-credentials', 'PATH', :rs_credentials_path, :required],
|
67
|
-
['--rs-database', 'DB_NAME', :rs_database_name, :required],
|
68
|
-
['--rs-table', 'TABLE_NAME', :rs_table_name, :required],
|
69
|
-
['--bq-dataset', 'DATASET_ID', :bq_dataset_id, :required],
|
70
|
-
['--bq-table', 'TABLE_ID', :bq_table_id,
|
71
|
-
['--s3-bucket', 'BUCKET_NAME', :s3_bucket_name, :required],
|
72
|
-
['--s3-prefix', 'PREFIX', :s3_prefix, nil],
|
73
|
-
['--cs-bucket', 'BUCKET_NAME', :cs_bucket_name, :required],
|
70
|
+
['--gcp-credentials', 'PATH', String, :gcp_credentials_path, :required],
|
71
|
+
['--aws-credentials', 'PATH', String, :aws_credentials_path, nil],
|
72
|
+
['--rs-credentials', 'PATH', String, :rs_credentials_path, :required],
|
73
|
+
['--rs-database', 'DB_NAME', String, :rs_database_name, :required],
|
74
|
+
['--rs-table', 'TABLE_NAME', String, :rs_table_name, :required],
|
75
|
+
['--bq-dataset', 'DATASET_ID', String, :bq_dataset_id, :required],
|
76
|
+
['--bq-table', 'TABLE_ID', String, :bq_table_id, nil],
|
77
|
+
['--s3-bucket', 'BUCKET_NAME', String, :s3_bucket_name, :required],
|
78
|
+
['--s3-prefix', 'PREFIX', String, :s3_prefix, nil],
|
79
|
+
['--cs-bucket', 'BUCKET_NAME', String, :cs_bucket_name, :required],
|
80
|
+
['--max-bad-records', 'N', Integer, :max_bad_records, nil],
|
74
81
|
]
|
75
82
|
|
76
83
|
def parse_args(argv)
|
77
84
|
config = {}
|
78
85
|
parser = OptionParser.new do |p|
|
79
|
-
ARGUMENTS.each do |flag, value_name, config_key, _|
|
80
|
-
p.on("#{flag} #{value_name}") { |v| config[config_key] = v }
|
86
|
+
ARGUMENTS.each do |flag, value_name, type, config_key, _|
|
87
|
+
p.on("#{flag} #{value_name}", type) { |v| config[config_key] = v }
|
81
88
|
end
|
82
89
|
end
|
83
90
|
config_errors = []
|
@@ -93,11 +100,12 @@ module BigShift
|
|
93
100
|
config_errors << sprintf('%s does not exist', path.inspect)
|
94
101
|
end
|
95
102
|
end
|
96
|
-
ARGUMENTS.each do |flag, _, config_key, required|
|
103
|
+
ARGUMENTS.each do |flag, _, _, config_key, required|
|
97
104
|
if !config.include?(config_key) && required
|
98
105
|
config_errors << "#{flag} is required"
|
99
106
|
end
|
100
107
|
end
|
108
|
+
config[:bq_table_id] ||= config[:rs_table_name]
|
101
109
|
unless config_errors.empty?
|
102
110
|
raise CliError.new('Configuration missing or malformed', config_errors, parser.to_s)
|
103
111
|
end
|
@@ -107,6 +115,7 @@ module BigShift
|
|
107
115
|
def s3_table_prefix
|
108
116
|
components = @config.values_at(:rs_database_name, :rs_table_name)
|
109
117
|
if (prefix = @config[:s3_prefix])
|
118
|
+
prefix = prefix.gsub(%r{\A/|/\Z}, '')
|
110
119
|
components.unshift(prefix)
|
111
120
|
end
|
112
121
|
File.join(*components)
|
@@ -123,7 +132,7 @@ module BigShift
|
|
123
132
|
end
|
124
133
|
|
125
134
|
def cloud_storage_transfer
|
126
|
-
@cloud_storage_transfer ||= CloudStorageTransfer.new(
|
135
|
+
@cloud_storage_transfer ||= CloudStorageTransfer.new(cs_transfer_service, raw_gcp_credentials['project_id'], aws_credentials, logger: logger)
|
127
136
|
end
|
128
137
|
|
129
138
|
def redshift_table_schema
|
@@ -134,6 +143,17 @@ module BigShift
|
|
134
143
|
@big_query_dataset ||= BigQuery::Dataset.new(bq_service, raw_gcp_credentials['project_id'], @config[:bq_dataset_id], logger: logger)
|
135
144
|
end
|
136
145
|
|
146
|
+
def cleaner
|
147
|
+
@cleaner ||= Cleaner.new(s3_resource, cs_service, logger: logger)
|
148
|
+
end
|
149
|
+
|
150
|
+
def s3_resource
|
151
|
+
@s3_resource ||= Aws::S3::Resource.new(
|
152
|
+
region: aws_region,
|
153
|
+
credentials: aws_credentials
|
154
|
+
)
|
155
|
+
end
|
156
|
+
|
137
157
|
private
|
138
158
|
|
139
159
|
def logger
|
@@ -142,24 +162,31 @@ module BigShift
|
|
142
162
|
|
143
163
|
def rs_connection
|
144
164
|
@rs_connection ||= PG.connect(
|
145
|
-
@config[:rs_credentials]['host'],
|
146
|
-
@config[:rs_credentials]['port'],
|
147
|
-
|
148
|
-
|
149
|
-
@config[:
|
150
|
-
|
151
|
-
@config[:rs_credentials]['password']
|
165
|
+
host: @config[:rs_credentials]['host'],
|
166
|
+
port: @config[:rs_credentials]['port'],
|
167
|
+
dbname: @config[:rs_database_name],
|
168
|
+
user: @config[:rs_credentials]['username'],
|
169
|
+
password: @config[:rs_credentials]['password'],
|
170
|
+
sslmode: 'require'
|
152
171
|
)
|
153
172
|
end
|
154
173
|
|
155
|
-
def
|
156
|
-
@
|
174
|
+
def cs_transfer_service
|
175
|
+
@cs_transfer_service ||= begin
|
157
176
|
s = Google::Apis::StoragetransferV1::StoragetransferService.new
|
158
177
|
s.authorization = gcp_credentials
|
159
178
|
s
|
160
179
|
end
|
161
180
|
end
|
162
181
|
|
182
|
+
def cs_service
|
183
|
+
@cs_service ||= begin
|
184
|
+
s = Google::Apis::StorageV1::StorageService.new
|
185
|
+
s.authorization = gcp_credentials
|
186
|
+
s
|
187
|
+
end
|
188
|
+
end
|
189
|
+
|
163
190
|
def bq_service
|
164
191
|
@bq_service ||= begin
|
165
192
|
s = Google::Apis::BigqueryV2::BigqueryService.new
|
@@ -169,7 +196,22 @@ module BigShift
|
|
169
196
|
end
|
170
197
|
|
171
198
|
def aws_credentials
|
172
|
-
@
|
199
|
+
@aws_credentials ||= begin
|
200
|
+
if @config[:aws_credentials]
|
201
|
+
credentials = Aws::Credentials.new(*@config[:aws_credentials].values_at('access_key_id', 'secret_access_key'))
|
202
|
+
else
|
203
|
+
credentials = nil
|
204
|
+
end
|
205
|
+
if (credentials = Aws::CredentialProviderChain.new(credentials).resolve)
|
206
|
+
credentials
|
207
|
+
else
|
208
|
+
raise 'No AWS credentials found'
|
209
|
+
end
|
210
|
+
end
|
211
|
+
end
|
212
|
+
|
213
|
+
def aws_region
|
214
|
+
@aws_region ||= ((awsc = @config[:aws_credentials]) && awsc['region']) || ENV['AWS_REGION'] || ENV['AWS_DEFAULT_REGION']
|
173
215
|
end
|
174
216
|
|
175
217
|
def raw_gcp_credentials
|
@@ -9,11 +9,11 @@ module BigShift
|
|
9
9
|
@logger = options[:logger] || NullLogger::INSTANCE
|
10
10
|
end
|
11
11
|
|
12
|
-
def copy_to_cloud_storage(
|
12
|
+
def copy_to_cloud_storage(unload_manifest, cloud_storage_bucket, options={})
|
13
13
|
poll_interval = options[:poll_interval] || DEFAULT_POLL_INTERVAL
|
14
|
-
transfer_job = create_transfer_job(
|
14
|
+
transfer_job = create_transfer_job(unload_manifest, cloud_storage_bucket, options[:description], options[:allow_overwrite])
|
15
15
|
transfer_job = @storage_transfer_service.create_transfer_job(transfer_job)
|
16
|
-
@logger.info(sprintf('Transferring objects from s3://%s/%s to gs://%s/%s',
|
16
|
+
@logger.info(sprintf('Transferring %d objects (%.2f GiB) from s3://%s/%s to gs://%s/%s', unload_manifest.count, unload_manifest.total_file_size.to_f/2**30, unload_manifest.bucket_name, unload_manifest.prefix, cloud_storage_bucket, unload_manifest.prefix))
|
17
17
|
await_completion(transfer_job, poll_interval)
|
18
18
|
nil
|
19
19
|
end
|
@@ -22,7 +22,7 @@ module BigShift
|
|
22
22
|
|
23
23
|
DEFAULT_POLL_INTERVAL = 30
|
24
24
|
|
25
|
-
def create_transfer_job(
|
25
|
+
def create_transfer_job(unload_manifest, cloud_storage_bucket, description, allow_overwrite)
|
26
26
|
now = @clock.now.utc
|
27
27
|
Google::Apis::StoragetransferV1::TransferJob.new(
|
28
28
|
description: description,
|
@@ -35,17 +35,17 @@ module BigShift
|
|
35
35
|
),
|
36
36
|
transfer_spec: Google::Apis::StoragetransferV1::TransferSpec.new(
|
37
37
|
aws_s3_data_source: Google::Apis::StoragetransferV1::AwsS3Data.new(
|
38
|
-
bucket_name:
|
38
|
+
bucket_name: unload_manifest.bucket_name,
|
39
39
|
aws_access_key: Google::Apis::StoragetransferV1::AwsAccessKey.new(
|
40
|
-
access_key_id: @aws_credentials
|
41
|
-
secret_access_key: @aws_credentials
|
40
|
+
access_key_id: @aws_credentials.access_key_id,
|
41
|
+
secret_access_key: @aws_credentials.secret_access_key,
|
42
42
|
)
|
43
43
|
),
|
44
44
|
gcs_data_sink: Google::Apis::StoragetransferV1::GcsData.new(
|
45
45
|
bucket_name: cloud_storage_bucket
|
46
46
|
),
|
47
47
|
object_conditions: Google::Apis::StoragetransferV1::ObjectConditions.new(
|
48
|
-
include_prefixes:
|
48
|
+
include_prefixes: unload_manifest.keys,
|
49
49
|
),
|
50
50
|
transfer_options: Google::Apis::StoragetransferV1::TransferOptions.new(
|
51
51
|
overwrite_objects_already_existing_in_sink: !!allow_overwrite
|
@@ -8,14 +8,16 @@ module BigShift
|
|
8
8
|
|
9
9
|
def unload_to(table_name, s3_uri, options={})
|
10
10
|
table_schema = RedshiftTableSchema.new(table_name, @redshift_connection)
|
11
|
-
|
11
|
+
credentials_string = "aws_access_key_id=#{@aws_credentials.access_key_id};aws_secret_access_key=#{@aws_credentials.secret_access_key}"
|
12
12
|
select_sql = 'SELECT '
|
13
13
|
select_sql << table_schema.columns.map(&:to_sql).join(', ')
|
14
14
|
select_sql << %Q< FROM "#{table_name}">
|
15
15
|
select_sql.gsub!('\'') { |s| '\\\'' }
|
16
16
|
unload_sql = %Q<UNLOAD ('#{select_sql}')>
|
17
17
|
unload_sql << %Q< TO '#{s3_uri}'>
|
18
|
-
unload_sql << %Q< CREDENTIALS '#{
|
18
|
+
unload_sql << %Q< CREDENTIALS '#{credentials_string}'>
|
19
|
+
unload_sql << %q< MANIFEST>
|
20
|
+
unload_sql << %q< GZIP>
|
19
21
|
unload_sql << %q< DELIMITER '\t'>
|
20
22
|
unload_sql << %q< ALLOWOVERWRITE> if options[:allow_overwrite]
|
21
23
|
@logger.info(sprintf('Unloading Redshift table %s to %s', table_name, s3_uri))
|
@@ -0,0 +1,39 @@
|
|
1
|
+
module BigShift
|
2
|
+
class UnloadManifest
|
3
|
+
attr_reader :bucket_name, :prefix, :manifest_key
|
4
|
+
|
5
|
+
def initialize(s3_resource, bucket_name, prefix)
|
6
|
+
@s3_resource = s3_resource
|
7
|
+
@bucket_name = bucket_name
|
8
|
+
@prefix = prefix
|
9
|
+
@manifest_key = "#{@prefix}manifest"
|
10
|
+
end
|
11
|
+
|
12
|
+
def keys
|
13
|
+
@keys ||= begin
|
14
|
+
bucket = @s3_resource.bucket(@bucket_name)
|
15
|
+
object = bucket.object(@manifest_key)
|
16
|
+
manifest = JSON.load(object.get.body)
|
17
|
+
manifest['entries'].map { |entry| entry['url'].sub(%r{\As3://[^/]+/}, '') }
|
18
|
+
end
|
19
|
+
end
|
20
|
+
|
21
|
+
def count
|
22
|
+
keys.size
|
23
|
+
end
|
24
|
+
|
25
|
+
def total_file_size
|
26
|
+
@total_file_size ||= begin
|
27
|
+
bucket = @s3_resource.bucket(@bucket_name)
|
28
|
+
objects = bucket.objects(prefix: @prefix)
|
29
|
+
objects.reduce(0) do |sum, object|
|
30
|
+
if keys.include?(object.key)
|
31
|
+
sum + object.size
|
32
|
+
else
|
33
|
+
sum
|
34
|
+
end
|
35
|
+
end
|
36
|
+
end
|
37
|
+
end
|
38
|
+
end
|
39
|
+
end
|
data/lib/bigshift/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: bigshift
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Theo Hultberg
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2016-04-
|
11
|
+
date: 2016-04-14 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: pg
|
@@ -52,6 +52,20 @@ dependencies:
|
|
52
52
|
- - ">="
|
53
53
|
- !ruby/object:Gem::Version
|
54
54
|
version: '0'
|
55
|
+
- !ruby/object:Gem::Dependency
|
56
|
+
name: aws-sdk
|
57
|
+
requirement: !ruby/object:Gem::Requirement
|
58
|
+
requirements:
|
59
|
+
- - ">="
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '0'
|
62
|
+
type: :runtime
|
63
|
+
prerelease: false
|
64
|
+
version_requirements: !ruby/object:Gem::Requirement
|
65
|
+
requirements:
|
66
|
+
- - ">="
|
67
|
+
- !ruby/object:Gem::Version
|
68
|
+
version: '0'
|
55
69
|
description: |-
|
56
70
|
BigShift is a tool for moving tables from Redshift
|
57
71
|
to BigQuery. It will create a table in BigQuery with
|
@@ -71,10 +85,12 @@ files:
|
|
71
85
|
- lib/bigshift.rb
|
72
86
|
- lib/bigshift/big_query/dataset.rb
|
73
87
|
- lib/bigshift/big_query/table.rb
|
88
|
+
- lib/bigshift/cleaner.rb
|
74
89
|
- lib/bigshift/cli.rb
|
75
90
|
- lib/bigshift/cloud_storage_transfer.rb
|
76
91
|
- lib/bigshift/redshift_table_schema.rb
|
77
92
|
- lib/bigshift/redshift_unloader.rb
|
93
|
+
- lib/bigshift/unload_manifest.rb
|
78
94
|
- lib/bigshift/version.rb
|
79
95
|
homepage: http://github.com/iconara/bigshift
|
80
96
|
licenses:
|