bigshift 0.1.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/LICENSE.txt +12 -0
- data/README.md +101 -0
- data/bin/bigshift +18 -0
- data/lib/bigshift.rb +29 -0
- data/lib/bigshift/big_query/dataset.rb +41 -0
- data/lib/bigshift/big_query/table.rb +74 -0
- data/lib/bigshift/cli.rb +186 -0
- data/lib/bigshift/cloud_storage_transfer.rb +104 -0
- data/lib/bigshift/redshift_table_schema.rb +87 -0
- data/lib/bigshift/redshift_unloader.rb +26 -0
- data/lib/bigshift/version.rb +3 -0
- metadata +103 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 6b933f1227d7a30c817577db6ca2f1517111d0e2
|
4
|
+
data.tar.gz: c53b1f16c4977e04c796a5f645d3d4ca600e3b13
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: dc549cf4e6ec70de381ff11118967f68c3d6868aa3892656d379d265d3669f787a81b1193b1e605c0f84f8b692e75a51c5bf45d15e68bc7b43843047c22650e0
|
7
|
+
data.tar.gz: 3c5407a160e9389e478c2b9c2c4f8561ffdcb64d038514bf5e6b41c4dc78dc83ba412c6aa31b1a9a9a24217525eb43b3ff54622cdef95053949031a0fbf11096
|
data/LICENSE.txt
ADDED
@@ -0,0 +1,12 @@
|
|
1
|
+
Copyright (c) 2014, Burt AB
|
2
|
+
All rights reserved.
|
3
|
+
|
4
|
+
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
|
5
|
+
|
6
|
+
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
|
7
|
+
|
8
|
+
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
|
9
|
+
|
10
|
+
3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
|
11
|
+
|
12
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
data/README.md
ADDED
@@ -0,0 +1,101 @@
|
|
1
|
+
# BigShift
|
2
|
+
|
3
|
+
BigShift is a tool for moving tables from Redshift to BigQuery. It will create a table in BigQuery with a schema that matches the Redshift table, dump the data to S3, transfer it to GCS and finally load it into the BigQuery table.
|
4
|
+
|
5
|
+
# Installation
|
6
|
+
|
7
|
+
```
|
8
|
+
$ gem install bigshift
|
9
|
+
```
|
10
|
+
|
11
|
+
# Requirements
|
12
|
+
|
13
|
+
On the AWS side you need a Redshift cluster and an S3 bucket, and credentials that let you read from Redshift, and read and write to the S3 bucket (it doesn't have to be to the whole bucket, a prefix works fine). On the GCP side you need a Cloud Storage bucket, a BigQuery dataset and credentials that allows reading and writing to the bucket, and create BigQuery tables.
|
14
|
+
|
15
|
+
# Usage
|
16
|
+
|
17
|
+
The main interface to BigShift is the `bigshift` command line tool.
|
18
|
+
|
19
|
+
BigShift can also be used as a library in a Ruby application. Look at the tests, and how the `bigshift` tool is built to figure out how.
|
20
|
+
|
21
|
+
## Arguments
|
22
|
+
|
23
|
+
Running `bigshift` without any arguments, or with `--help` will show the options. All except `--s3-prefix` are required.
|
24
|
+
|
25
|
+
### GCP credentials
|
26
|
+
|
27
|
+
The `--gcp-credentials` argument must be a path to a JSON file that contains a public/private key pair for a GCP user. The best way to obtain this is to create a new service account and chose JSON as the key type when prompted.
|
28
|
+
|
29
|
+
### AWS credentials
|
30
|
+
|
31
|
+
The `--aws-credentials` argument must be a path to a JSON or YAML file that contains `aws_access_key_id` and `aws_secret_access_key`, and optionally `token`.
|
32
|
+
|
33
|
+
```yaml
|
34
|
+
---
|
35
|
+
aws_access_key_id: AKXYZABC123FOOBARBAZ
|
36
|
+
aws_secret_access_key: eW91ZmlndXJlZG91dGl0d2FzYmFzZTY0ISEhCg
|
37
|
+
```
|
38
|
+
|
39
|
+
These credentials need to be allowed to read and write the S3 location you specify with `--s3-bucket` and `--s3-prefix`.
|
40
|
+
|
41
|
+
### Redshift credentials
|
42
|
+
|
43
|
+
The `--rs-credentials` argument must be a path to a JSON or YAML file that contains the `host` and `port` of the Redshift cluster, as well as the `username` and `password` required to connect.
|
44
|
+
|
45
|
+
```yaml
|
46
|
+
---
|
47
|
+
host: my-cluster.abc123.eu-west-1.redshift.amazonaws.com
|
48
|
+
port: 5439
|
49
|
+
username: my_redshift_user
|
50
|
+
password: dGhpc2lzYWxzb2Jhc2U2NAo
|
51
|
+
```
|
52
|
+
|
53
|
+
# How does it work?
|
54
|
+
|
55
|
+
There are four main pieces to BigShift: the Redshift unloader, the transfer, the BigQuery load and the schema translation.
|
56
|
+
|
57
|
+
In theory it's pretty simple: the Redshift table is dumped to S3 using Redshift's `UNLOAD` command, copied over to GCS and loaded into BigQuery – but the devil is the details.
|
58
|
+
|
59
|
+
The CSV produced by Redshift's `UNLOAD` can't be loaded into BigQuery no matter what options you specify on either end. Redshift can quote _all_ fields or none, but BigQuery doesn't allow non-string fields to be quoted. The format of booleans and timestamps are not compatible, and they expect quotes in quoted fields to be escaped differently, to name a few things.
|
60
|
+
|
61
|
+
This means that a lot of what BigShift does is make sure that the data that is dumped from Redshift is compatible with BigQuery. To do this it reads the table schema and translates the different datatypes while the data is dumped. Quotes are escaped, timestamps formatted, and so on.
|
62
|
+
|
63
|
+
Once the data is on S3 it's fairly simple to move it over to GCS. GCS has a great service called Transfer Service, that does the transfer for you. If this didn't exist you would have to stream all of the bytes through the machine that ran BigShift. As long as you've set up the credentials right in AWS IAM this works smoothly.
|
64
|
+
|
65
|
+
Once the data is in GCS, the BigQuery table can be created and loaded. At this point the Redshift table's schema is translated into a BigQuery schema. The Redshift datatypes are mapped to BigQuery datatypes and things like nullability are determines. The mapping is straighforward:
|
66
|
+
|
67
|
+
* `BOOLEAN` in Redshift becomes `BOOLEAN` in BigQuery
|
68
|
+
* all Redshift integer types are mapped to BigQuery's `INTEGER`
|
69
|
+
* all Redshift floating point types are mapped to BigQuery's `FLOAT`
|
70
|
+
* `DATE` in Redshift becomes `STRING` in BigQuery (formatted as YYYY-MM-DD)
|
71
|
+
* `NUMERIC` is mapped to `STRING`, because BigQuery doesn't have any equivalent data type and using `STRING` avoids loosing precision
|
72
|
+
* `TIMESTAMP` in Redshift becomes `TIMESTAMP` in BigQuery, and the data is transferred as a UNIX timestamp with fractional seconds (to the limit of what Redshift's `TIMESTAMP` datatype provides)
|
73
|
+
* `CHAR` and `VARCHAR` obviously become `STRING` in BigQuery
|
74
|
+
|
75
|
+
`NOT NULL` becomes `REQUIRED` in BigQuery, and `NULL` becomes `NULLABLE`.
|
76
|
+
|
77
|
+
# What doesn't it do?
|
78
|
+
|
79
|
+
* Currently BigShift doesn't delete the dumped table from S3 or from GCS. This is planned.
|
80
|
+
* BigShift can't currently append to an existing BigQuery table. This feature would be possible to add.
|
81
|
+
* The tool will happily overwrite any data on S3, GCS and in BigQuery that happen to be in the way (i.e. in the specified S3 or GCS location, or in the target table). This is convenient if you want to move the same data multiple times, but very scary and unsafe. To clobber everything will be an option in the future, but the default will be much safer.
|
82
|
+
* There is no transformation or processing of the data. When moving to BigQuery you might want to split a string and use the pieces as values in a repeated field, but BigShift doesn't help you with that. You will almost always have to do some post processing in BigQuery once the data has been moved. Processing on the way would require a lot more complexity and involve either Hadoop or Dataflow, and that's beyond the scope of a tool like this.
|
83
|
+
* BigShift can't move data back from BigQuery to Redshift. It can probably be done, but you would probably have to write a big part of the Redshift schema yourself since BigQuery's data model is so much simpler. Going from Redshift to BigQuery is simple, most of Redshifts datatypes map directly to one of BigQuery's, and there's no encodings, sort or dist keys to worry about. Going in the other direction the tool can't know whether or not a `STRING` column in BigQuery should be a `CHAR(12)` or `VARCHAR(65535)`, and if it should be encoded as `LZO` or `BYTEDICT` or what should be the primary, sort, and dist key of the table.
|
84
|
+
|
85
|
+
# Troubleshooting
|
86
|
+
|
87
|
+
### I get SSL errors
|
88
|
+
|
89
|
+
The certificates used by the Google APIs might not be installed on your system, try this as a workaround:
|
90
|
+
|
91
|
+
```
|
92
|
+
export SSL_CERT_FILE="$(find $GEM_HOME/gems -name 'google-api-client-*' | tail -n 1)/lib/cacerts.pem"
|
93
|
+
```
|
94
|
+
|
95
|
+
### I get errors when the data is loaded into BigQuery
|
96
|
+
|
97
|
+
This could be anything, but it could be things that aren't escaped properly when the data is dumped from Redshift. Try figuring out from the errors where the problem is and what the data looks like and open an issue. The more you can figure out yourself the more likely it is that you will get help. No one wants to trawl through your data, make an effort.
|
98
|
+
|
99
|
+
# Copyright
|
100
|
+
|
101
|
+
© 2016 Theo Hultberg and contributors, see LICENSE.txt (BSD 3-Clause).
|
data/bin/bigshift
ADDED
@@ -0,0 +1,18 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require 'bigshift/cli'
|
4
|
+
|
5
|
+
begin
|
6
|
+
BigShift::Cli.new(ARGV).run
|
7
|
+
rescue BigShift::CliError => e
|
8
|
+
$stderr.puts("#{e.message}:")
|
9
|
+
$stderr.puts
|
10
|
+
e.details.each do |detail|
|
11
|
+
$stderr.puts("* #{detail}")
|
12
|
+
end
|
13
|
+
$stderr.puts
|
14
|
+
$stderr.puts(e.usage)
|
15
|
+
$stderr.puts
|
16
|
+
exit(1)
|
17
|
+
end
|
18
|
+
|
data/lib/bigshift.rb
ADDED
@@ -0,0 +1,29 @@
|
|
1
|
+
require 'google/apis/bigquery_v2'
|
2
|
+
require 'google/apis/storagetransfer_v1'
|
3
|
+
|
4
|
+
module BigShift
|
5
|
+
BigShiftError = Class.new(StandardError)
|
6
|
+
|
7
|
+
class NullLogger
|
8
|
+
def close(*); end
|
9
|
+
def debug(*); end
|
10
|
+
def debug?; false end
|
11
|
+
def error(*); end
|
12
|
+
def error?; false end
|
13
|
+
def fatal(*); end
|
14
|
+
def fatal?; false end
|
15
|
+
def info(*); end
|
16
|
+
def info?; false end
|
17
|
+
def unknown(*); end
|
18
|
+
def warn(*); end
|
19
|
+
def warn?; false end
|
20
|
+
|
21
|
+
INSTANCE = new
|
22
|
+
end
|
23
|
+
end
|
24
|
+
|
25
|
+
require 'bigshift/big_query/dataset'
|
26
|
+
require 'bigshift/big_query/table'
|
27
|
+
require 'bigshift/redshift_table_schema'
|
28
|
+
require 'bigshift/redshift_unloader'
|
29
|
+
require 'bigshift/cloud_storage_transfer'
|
@@ -0,0 +1,41 @@
|
|
1
|
+
module BigShift
|
2
|
+
module BigQuery
|
3
|
+
class Dataset
|
4
|
+
def initialize(big_query_service, project_id, dataset_id, options={})
|
5
|
+
@big_query_service = big_query_service
|
6
|
+
@project_id = project_id
|
7
|
+
@dataset_id = dataset_id
|
8
|
+
@logger = options[:logger] || NullLogger::INSTANCE
|
9
|
+
end
|
10
|
+
|
11
|
+
def table(table_name)
|
12
|
+
table_data = @big_query_service.get_table(@project_id, @dataset_id, table_name)
|
13
|
+
Table.new(@big_query_service, table_data, logger: @logger)
|
14
|
+
rescue Google::Apis::ClientError => e
|
15
|
+
if e.status_code == 404
|
16
|
+
nil
|
17
|
+
else
|
18
|
+
raise
|
19
|
+
end
|
20
|
+
end
|
21
|
+
|
22
|
+
def create_table(table_name, options={})
|
23
|
+
table_reference = Google::Apis::BigqueryV2::TableReference.new(
|
24
|
+
project_id: @project_id,
|
25
|
+
dataset_id: @dataset_id,
|
26
|
+
table_id: table_name
|
27
|
+
)
|
28
|
+
if options[:schema]
|
29
|
+
fields = options[:schema]['fields'].map { |f| Google::Apis::BigqueryV2::TableFieldSchema.new(name: f['name'], type: f['type'], mode: f['mode']) }
|
30
|
+
schema = Google::Apis::BigqueryV2::TableSchema.new(fields: fields)
|
31
|
+
end
|
32
|
+
table_spec = {}
|
33
|
+
table_spec[:table_reference] = table_reference
|
34
|
+
table_spec[:schema] = schema if schema
|
35
|
+
table_data = Google::Apis::BigqueryV2::Table.new(table_spec)
|
36
|
+
table_data = @big_query_service.insert_table(@project_id, @dataset_id, table_data)
|
37
|
+
Table.new(@big_query_service, table_data, logger: @logger)
|
38
|
+
end
|
39
|
+
end
|
40
|
+
end
|
41
|
+
end
|
@@ -0,0 +1,74 @@
|
|
1
|
+
module BigShift
|
2
|
+
module BigQuery
|
3
|
+
class Table
|
4
|
+
def initialize(big_query_service, table_data, options={})
|
5
|
+
@big_query_service = big_query_service
|
6
|
+
@table_data = table_data
|
7
|
+
@logger = options[:logger] || NullLogger::INSTANCE
|
8
|
+
@thread = options[:thread] || Kernel
|
9
|
+
end
|
10
|
+
|
11
|
+
def load(uri, options={})
|
12
|
+
poll_interval = options[:poll_interval] || DEFAULT_POLL_INTERVAL
|
13
|
+
load_configuration = {}
|
14
|
+
load_configuration[:source_uris] = [uri]
|
15
|
+
load_configuration[:write_disposition] = options[:allow_overwrite] ? 'WRITE_TRUNCATE' : 'WRITE_EMPTY'
|
16
|
+
load_configuration[:create_disposition] = 'CREATE_IF_NEEDED'
|
17
|
+
load_configuration[:schema] = options[:schema] if options[:schema]
|
18
|
+
load_configuration[:source_format] = 'CSV'
|
19
|
+
load_configuration[:field_delimiter] = '\t'
|
20
|
+
load_configuration[:quote] = '"'
|
21
|
+
load_configuration[:destination_table] = @table_data.table_reference
|
22
|
+
job = Google::Apis::BigqueryV2::Job.new(
|
23
|
+
configuration: Google::Apis::BigqueryV2::JobConfiguration.new(
|
24
|
+
load: Google::Apis::BigqueryV2::JobConfigurationLoad.new(load_configuration)
|
25
|
+
)
|
26
|
+
)
|
27
|
+
job = @big_query_service.insert_job(@table_data.table_reference.project_id, job)
|
28
|
+
@logger.info(sprintf('Loading rows from %s to the table %s.%s', uri, @table_data.table_reference.dataset_id, @table_data.table_reference.table_id))
|
29
|
+
started = false
|
30
|
+
loop do
|
31
|
+
job = @big_query_service.get_job(@table_data.table_reference.project_id, job.job_reference.job_id)
|
32
|
+
if job.status && job.status.state == 'DONE'
|
33
|
+
if job.status.errors.nil? || job.status.errors.empty?
|
34
|
+
break
|
35
|
+
else
|
36
|
+
job.status.errors.each do |error|
|
37
|
+
message = %<Load error: "#{error.message}">
|
38
|
+
if error.location
|
39
|
+
file, line, field = error.location.split('/').map { |s| s.split(':').last.strip }
|
40
|
+
message << " at file #{file}, line #{line}"
|
41
|
+
message << ", field #{field}" if field
|
42
|
+
end
|
43
|
+
@logger.debug(message)
|
44
|
+
end
|
45
|
+
raise job.status.error_result.message
|
46
|
+
end
|
47
|
+
else
|
48
|
+
state = job.status && job.status.state
|
49
|
+
if state == 'RUNNING' && !started
|
50
|
+
@logger.info('Loading started')
|
51
|
+
started = true
|
52
|
+
else
|
53
|
+
@logger.debug(sprintf('Waiting for job %s (status: %s)', job.job_reference.job_id.inspect, state ? state.inspect : 'unknown'))
|
54
|
+
end
|
55
|
+
@thread.sleep(poll_interval)
|
56
|
+
end
|
57
|
+
end
|
58
|
+
report_complete(job)
|
59
|
+
nil
|
60
|
+
end
|
61
|
+
|
62
|
+
private
|
63
|
+
|
64
|
+
DEFAULT_POLL_INTERVAL = 30
|
65
|
+
|
66
|
+
def report_complete(job)
|
67
|
+
statistics = job.statistics.load
|
68
|
+
input_size = statistics.input_file_bytes.to_f/2**30
|
69
|
+
output_size = statistics.output_bytes.to_f/2**30
|
70
|
+
@logger.info(sprintf('Loading complete, %.2f GiB loaded from %s files, %s rows created, table size %.2f GiB', input_size, statistics.input_files, statistics.output_rows, output_size))
|
71
|
+
end
|
72
|
+
end
|
73
|
+
end
|
74
|
+
end
|
data/lib/bigshift/cli.rb
ADDED
@@ -0,0 +1,186 @@
|
|
1
|
+
require 'pg'
|
2
|
+
require 'yaml'
|
3
|
+
require 'json'
|
4
|
+
require 'stringio'
|
5
|
+
require 'logger'
|
6
|
+
require 'optparse'
|
7
|
+
require 'bigshift'
|
8
|
+
|
9
|
+
module BigShift
|
10
|
+
class CliError < BigShiftError
|
11
|
+
attr_reader :details, :usage
|
12
|
+
|
13
|
+
def initialize(message, details, usage)
|
14
|
+
super(message)
|
15
|
+
@details = details
|
16
|
+
@usage = usage
|
17
|
+
end
|
18
|
+
end
|
19
|
+
|
20
|
+
class Cli
|
21
|
+
def initialize(argv, options={})
|
22
|
+
@argv = argv.dup
|
23
|
+
@factory_factory = options[:factory_factory] || Factory.method(:new)
|
24
|
+
end
|
25
|
+
|
26
|
+
def run
|
27
|
+
setup
|
28
|
+
unload
|
29
|
+
transfer
|
30
|
+
load
|
31
|
+
cleanup
|
32
|
+
nil
|
33
|
+
end
|
34
|
+
|
35
|
+
private
|
36
|
+
|
37
|
+
def setup
|
38
|
+
@config = parse_args(@argv)
|
39
|
+
@factory = @factory_factory.call(@config)
|
40
|
+
end
|
41
|
+
|
42
|
+
def unload
|
43
|
+
s3_uri = "s3://#{@config[:s3_bucket_name]}/#{s3_table_prefix}/"
|
44
|
+
@factory.redshift_unloader.unload_to(@config[:rs_table_name], s3_uri, allow_overwrite: true)
|
45
|
+
end
|
46
|
+
|
47
|
+
def transfer
|
48
|
+
description = "bigshift-#{@config[:rs_database_name]}-#{@config[:rs_table_name]}-#{Time.now.utc.strftime('%Y%m%dT%H%M')}"
|
49
|
+
@factory.cloud_storage_transfer.copy_to_cloud_storage(@config[:s3_bucket_name], "#{s3_table_prefix}/", @config[:cs_bucket_name], description: description, allow_overwrite: true)
|
50
|
+
end
|
51
|
+
|
52
|
+
def load
|
53
|
+
rs_table_schema = @factory.redshift_table_schema
|
54
|
+
bq_dataset = @factory.big_query_dataset
|
55
|
+
bq_table = bq_dataset.table(@config[:bq_table_id]) || bq_dataset.create_table(@config[:bq_table_id])
|
56
|
+
gcs_uri = "gs://#{@config[:cs_bucket_name]}/#{s3_table_prefix}/*"
|
57
|
+
bq_table.load(gcs_uri, schema: rs_table_schema.to_big_query, allow_overwrite: true)
|
58
|
+
end
|
59
|
+
|
60
|
+
def cleanup
|
61
|
+
end
|
62
|
+
|
63
|
+
ARGUMENTS = [
|
64
|
+
['--gcp-credentials', 'PATH', :gcp_credentials_path, :required],
|
65
|
+
['--aws-credentials', 'PATH', :aws_credentials_path, :required],
|
66
|
+
['--rs-credentials', 'PATH', :rs_credentials_path, :required],
|
67
|
+
['--rs-database', 'DB_NAME', :rs_database_name, :required],
|
68
|
+
['--rs-table', 'TABLE_NAME', :rs_table_name, :required],
|
69
|
+
['--bq-dataset', 'DATASET_ID', :bq_dataset_id, :required],
|
70
|
+
['--bq-table', 'TABLE_ID', :bq_table_id, :required],
|
71
|
+
['--s3-bucket', 'BUCKET_NAME', :s3_bucket_name, :required],
|
72
|
+
['--s3-prefix', 'PREFIX', :s3_prefix, nil],
|
73
|
+
['--cs-bucket', 'BUCKET_NAME', :cs_bucket_name, :required],
|
74
|
+
]
|
75
|
+
|
76
|
+
def parse_args(argv)
|
77
|
+
config = {}
|
78
|
+
parser = OptionParser.new do |p|
|
79
|
+
ARGUMENTS.each do |flag, value_name, config_key, _|
|
80
|
+
p.on("#{flag} #{value_name}") { |v| config[config_key] = v }
|
81
|
+
end
|
82
|
+
end
|
83
|
+
config_errors = []
|
84
|
+
begin
|
85
|
+
parser.parse!(argv)
|
86
|
+
rescue OptionParser::InvalidOption => e
|
87
|
+
config_errors << e.message
|
88
|
+
end
|
89
|
+
%w[gcp aws rs].each do |prefix|
|
90
|
+
if (path = config["#{prefix}_credentials_path".to_sym]) && File.exist?(path)
|
91
|
+
config["#{prefix}_credentials".to_sym] = YAML.load(File.read(path))
|
92
|
+
elsif path && !File.exist?(path)
|
93
|
+
config_errors << sprintf('%s does not exist', path.inspect)
|
94
|
+
end
|
95
|
+
end
|
96
|
+
ARGUMENTS.each do |flag, _, config_key, required|
|
97
|
+
if !config.include?(config_key) && required
|
98
|
+
config_errors << "#{flag} is required"
|
99
|
+
end
|
100
|
+
end
|
101
|
+
unless config_errors.empty?
|
102
|
+
raise CliError.new('Configuration missing or malformed', config_errors, parser.to_s)
|
103
|
+
end
|
104
|
+
config
|
105
|
+
end
|
106
|
+
|
107
|
+
def s3_table_prefix
|
108
|
+
components = @config.values_at(:rs_database_name, :rs_table_name)
|
109
|
+
if (prefix = @config[:s3_prefix])
|
110
|
+
components.unshift(prefix)
|
111
|
+
end
|
112
|
+
File.join(*components)
|
113
|
+
end
|
114
|
+
end
|
115
|
+
|
116
|
+
class Factory
|
117
|
+
def initialize(config)
|
118
|
+
@config = config
|
119
|
+
end
|
120
|
+
|
121
|
+
def redshift_unloader
|
122
|
+
@redshift_unloader ||= RedshiftUnloader.new(rs_connection, aws_credentials, logger: logger)
|
123
|
+
end
|
124
|
+
|
125
|
+
def cloud_storage_transfer
|
126
|
+
@cloud_storage_transfer ||= CloudStorageTransfer.new(gcs_transfer_service, raw_gcp_credentials['project_id'], aws_credentials, logger: logger)
|
127
|
+
end
|
128
|
+
|
129
|
+
def redshift_table_schema
|
130
|
+
@redshift_table_schema ||= RedshiftTableSchema.new(@config[:rs_table_name], rs_connection)
|
131
|
+
end
|
132
|
+
|
133
|
+
def big_query_dataset
|
134
|
+
@big_query_dataset ||= BigQuery::Dataset.new(bq_service, raw_gcp_credentials['project_id'], @config[:bq_dataset_id], logger: logger)
|
135
|
+
end
|
136
|
+
|
137
|
+
private
|
138
|
+
|
139
|
+
def logger
|
140
|
+
@logger ||= Logger.new($stderr)
|
141
|
+
end
|
142
|
+
|
143
|
+
def rs_connection
|
144
|
+
@rs_connection ||= PG.connect(
|
145
|
+
@config[:rs_credentials]['host'],
|
146
|
+
@config[:rs_credentials]['port'],
|
147
|
+
nil,
|
148
|
+
nil,
|
149
|
+
@config[:rs_database_name],
|
150
|
+
@config[:rs_credentials]['username'],
|
151
|
+
@config[:rs_credentials]['password']
|
152
|
+
)
|
153
|
+
end
|
154
|
+
|
155
|
+
def gcs_transfer_service
|
156
|
+
@gcs_transfer_service ||= begin
|
157
|
+
s = Google::Apis::StoragetransferV1::StoragetransferService.new
|
158
|
+
s.authorization = gcp_credentials
|
159
|
+
s
|
160
|
+
end
|
161
|
+
end
|
162
|
+
|
163
|
+
def bq_service
|
164
|
+
@bq_service ||= begin
|
165
|
+
s = Google::Apis::BigqueryV2::BigqueryService.new
|
166
|
+
s.authorization = gcp_credentials
|
167
|
+
s
|
168
|
+
end
|
169
|
+
end
|
170
|
+
|
171
|
+
def aws_credentials
|
172
|
+
@config[:aws_credentials]
|
173
|
+
end
|
174
|
+
|
175
|
+
def raw_gcp_credentials
|
176
|
+
@config[:gcp_credentials]
|
177
|
+
end
|
178
|
+
|
179
|
+
def gcp_credentials
|
180
|
+
@gcp_credentials ||= Google::Auth::ServiceAccountCredentials.make_creds(
|
181
|
+
json_key_io: StringIO.new(JSON.dump(raw_gcp_credentials)),
|
182
|
+
scope: Google::Apis::StoragetransferV1::AUTH_CLOUD_PLATFORM
|
183
|
+
)
|
184
|
+
end
|
185
|
+
end
|
186
|
+
end
|
@@ -0,0 +1,104 @@
|
|
1
|
+
module BigShift
|
2
|
+
class CloudStorageTransfer
|
3
|
+
def initialize(storage_transfer_service, project_id, aws_credentials, options={})
|
4
|
+
@storage_transfer_service = storage_transfer_service
|
5
|
+
@project_id = project_id
|
6
|
+
@aws_credentials = aws_credentials
|
7
|
+
@clock = options[:clock] || Time
|
8
|
+
@thread = options[:thread] || Kernel
|
9
|
+
@logger = options[:logger] || NullLogger::INSTANCE
|
10
|
+
end
|
11
|
+
|
12
|
+
def copy_to_cloud_storage(s3_bucket, s3_path_prefix, cloud_storage_bucket, options={})
|
13
|
+
poll_interval = options[:poll_interval] || DEFAULT_POLL_INTERVAL
|
14
|
+
transfer_job = create_transfer_job(s3_bucket, s3_path_prefix, cloud_storage_bucket, options[:description], options[:allow_overwrite])
|
15
|
+
transfer_job = @storage_transfer_service.create_transfer_job(transfer_job)
|
16
|
+
@logger.info(sprintf('Transferring objects from s3://%s/%s to gs://%s/%s', s3_bucket, s3_path_prefix, cloud_storage_bucket, s3_path_prefix))
|
17
|
+
await_completion(transfer_job, poll_interval)
|
18
|
+
nil
|
19
|
+
end
|
20
|
+
|
21
|
+
private
|
22
|
+
|
23
|
+
DEFAULT_POLL_INTERVAL = 30
|
24
|
+
|
25
|
+
def create_transfer_job(s3_bucket, s3_path_prefix, cloud_storage_bucket, description, allow_overwrite)
|
26
|
+
now = @clock.now.utc
|
27
|
+
Google::Apis::StoragetransferV1::TransferJob.new(
|
28
|
+
description: description,
|
29
|
+
project_id: @project_id,
|
30
|
+
status: 'ENABLED',
|
31
|
+
schedule: Google::Apis::StoragetransferV1::Schedule.new(
|
32
|
+
schedule_start_date: Google::Apis::StoragetransferV1::Date.new(year: now.year, month: now.month, day: now.day),
|
33
|
+
schedule_end_date: Google::Apis::StoragetransferV1::Date.new(year: now.year, month: now.month, day: now.day),
|
34
|
+
start_time_of_day: Google::Apis::StoragetransferV1::TimeOfDay.new(hours: now.hour, minutes: now.min + 1)
|
35
|
+
),
|
36
|
+
transfer_spec: Google::Apis::StoragetransferV1::TransferSpec.new(
|
37
|
+
aws_s3_data_source: Google::Apis::StoragetransferV1::AwsS3Data.new(
|
38
|
+
bucket_name: s3_bucket,
|
39
|
+
aws_access_key: Google::Apis::StoragetransferV1::AwsAccessKey.new(
|
40
|
+
access_key_id: @aws_credentials['aws_access_key_id'],
|
41
|
+
secret_access_key: @aws_credentials['aws_secret_access_key'],
|
42
|
+
)
|
43
|
+
),
|
44
|
+
gcs_data_sink: Google::Apis::StoragetransferV1::GcsData.new(
|
45
|
+
bucket_name: cloud_storage_bucket
|
46
|
+
),
|
47
|
+
object_conditions: Google::Apis::StoragetransferV1::ObjectConditions.new(
|
48
|
+
include_prefixes: [s3_path_prefix]
|
49
|
+
),
|
50
|
+
transfer_options: Google::Apis::StoragetransferV1::TransferOptions.new(
|
51
|
+
overwrite_objects_already_existing_in_sink: !!allow_overwrite
|
52
|
+
)
|
53
|
+
)
|
54
|
+
)
|
55
|
+
end
|
56
|
+
|
57
|
+
def await_completion(transfer_job, poll_interval)
|
58
|
+
started = false
|
59
|
+
loop do
|
60
|
+
operation = nil
|
61
|
+
failures = 0
|
62
|
+
begin
|
63
|
+
operations_response = @storage_transfer_service.list_transfer_operations('transferOperations', filter: JSON.dump({project_id: @project_id, job_names: [transfer_job.name]}))
|
64
|
+
operation = operations_response.operations && operations_response.operations.first
|
65
|
+
rescue Google::Apis::ServerError => e
|
66
|
+
failures += 1
|
67
|
+
if failures < 5
|
68
|
+
@logger.debug(sprintf('Error while waiting for job %s, will retry: %s (%s)', transfer_job.name.inspect, e.message.inspect, e.class.name))
|
69
|
+
@thread.sleep(poll_interval)
|
70
|
+
retry
|
71
|
+
else
|
72
|
+
raise sprintf('Transfer failed: %s (%s)', e.message.inspect, e.class.name)
|
73
|
+
end
|
74
|
+
end
|
75
|
+
if operation && operation.done?
|
76
|
+
handle_completion(transfer_job, operation)
|
77
|
+
break
|
78
|
+
else
|
79
|
+
status = operation && operation.metadata && operation.metadata['status']
|
80
|
+
if status == 'IN_PROGRESS' && !started
|
81
|
+
@logger.info(sprintf('Transfer %s started', transfer_job.description))
|
82
|
+
started = true
|
83
|
+
else
|
84
|
+
@logger.debug(sprintf('Waiting for job %s (name: %s, status: %s)', transfer_job.description.inspect, transfer_job.name.inspect, status ? status.inspect : 'unknown'))
|
85
|
+
end
|
86
|
+
@thread.sleep(poll_interval)
|
87
|
+
end
|
88
|
+
end
|
89
|
+
end
|
90
|
+
|
91
|
+
def handle_completion(transfer_job, operation)
|
92
|
+
if operation.metadata['status'] == 'FAILED'
|
93
|
+
raise 'Transfer failed'
|
94
|
+
else
|
95
|
+
message = sprintf('Transfer %s complete', transfer_job.description)
|
96
|
+
if (counters = operation.metadata['counters'])
|
97
|
+
size_in_gib = counters['bytesCopiedToSink'].to_f / 2**30
|
98
|
+
message << sprintf(', %s objects and %.2f GiB copied', counters['objectsCopiedToSink'], size_in_gib)
|
99
|
+
end
|
100
|
+
@logger.info(message)
|
101
|
+
end
|
102
|
+
end
|
103
|
+
end
|
104
|
+
end
|
@@ -0,0 +1,87 @@
|
|
1
|
+
module BigShift
|
2
|
+
class RedshiftTableSchema
|
3
|
+
def initialize(table_name, redshift_connection)
|
4
|
+
@table_name = table_name
|
5
|
+
@redshift_connection = redshift_connection
|
6
|
+
end
|
7
|
+
|
8
|
+
def columns
|
9
|
+
@columns ||= begin
|
10
|
+
rows = @redshift_connection.exec_params(%|SELECT "column", "type", "notnull" FROM "pg_table_def" WHERE "schemaname" = 'public' AND "tablename" = $1|, [@table_name])
|
11
|
+
if rows.count == 0
|
12
|
+
raise sprintf('Table not found: %s', @table_name.inspect)
|
13
|
+
else
|
14
|
+
columns = rows.map do |row|
|
15
|
+
name = row['column']
|
16
|
+
type = row['type']
|
17
|
+
nullable = row['notnull'] == 'f'
|
18
|
+
Column.new(name, type, nullable)
|
19
|
+
end
|
20
|
+
columns.sort_by!(&:name)
|
21
|
+
columns
|
22
|
+
end
|
23
|
+
end
|
24
|
+
end
|
25
|
+
|
26
|
+
def to_big_query
|
27
|
+
Google::Apis::BigqueryV2::TableSchema.new(fields: columns.map(&:to_big_query))
|
28
|
+
end
|
29
|
+
|
30
|
+
class Column
|
31
|
+
attr_reader :name, :type
|
32
|
+
|
33
|
+
def initialize(name, type, nullable)
|
34
|
+
@name = name
|
35
|
+
@type = type
|
36
|
+
@nullable = nullable
|
37
|
+
end
|
38
|
+
|
39
|
+
def nullable?
|
40
|
+
@nullable
|
41
|
+
end
|
42
|
+
|
43
|
+
def to_big_query
|
44
|
+
Google::Apis::BigqueryV2::TableFieldSchema.new(
|
45
|
+
name: @name,
|
46
|
+
type: big_query_type,
|
47
|
+
mode: @nullable ? 'NULLABLE' : 'REQUIRED'
|
48
|
+
)
|
49
|
+
end
|
50
|
+
|
51
|
+
def to_sql
|
52
|
+
case @type
|
53
|
+
when /^numeric/, /int/, /^double/, 'real'
|
54
|
+
sprintf('"%s"', @name)
|
55
|
+
when /^character/
|
56
|
+
sprintf(%q<('"' || REPLACE(REPLACE(REPLACE("%s", '"', '""'), '\\n', '\\\\n'), '\\r', '\\\\r') || '"')>, @name)
|
57
|
+
when /^timestamp/
|
58
|
+
sprintf('(EXTRACT(epoch FROM "%s") + EXTRACT(milliseconds FROM "%s")/1000.0)', @name, @name)
|
59
|
+
when 'date'
|
60
|
+
sprintf(%q<(TO_CHAR("%s", 'YYYY-MM-DD'))>, @name)
|
61
|
+
when 'boolean'
|
62
|
+
if nullable?
|
63
|
+
sprintf('(CASE WHEN "%s" IS NULL THEN NULL WHEN "%s" THEN 1 ELSE 0 END)', @name, @name)
|
64
|
+
else
|
65
|
+
sprintf('(CASE WHEN "%s" THEN 1 ELSE 0 END)', @name)
|
66
|
+
end
|
67
|
+
else
|
68
|
+
raise sprintf('Unsupported column type: %s', type.inspect)
|
69
|
+
end
|
70
|
+
end
|
71
|
+
|
72
|
+
private
|
73
|
+
|
74
|
+
def big_query_type
|
75
|
+
case @type
|
76
|
+
when /^character/, /^numeric/, 'date' then 'STRING'
|
77
|
+
when /^timestamp/ then 'TIMESTAMP'
|
78
|
+
when /int/ then 'INTEGER'
|
79
|
+
when 'boolean' then 'BOOLEAN'
|
80
|
+
when /^double/, 'real' then 'FLOAT'
|
81
|
+
else
|
82
|
+
raise sprintf('Unsupported column type: %s', type.inspect)
|
83
|
+
end
|
84
|
+
end
|
85
|
+
end
|
86
|
+
end
|
87
|
+
end
|
@@ -0,0 +1,26 @@
|
|
1
|
+
module BigShift
|
2
|
+
class RedshiftUnloader
|
3
|
+
def initialize(redshift_connection, aws_credentials, options={})
|
4
|
+
@redshift_connection = redshift_connection
|
5
|
+
@aws_credentials = aws_credentials
|
6
|
+
@logger = options[:logger] || NullLogger::INSTANCE
|
7
|
+
end
|
8
|
+
|
9
|
+
def unload_to(table_name, s3_uri, options={})
|
10
|
+
table_schema = RedshiftTableSchema.new(table_name, @redshift_connection)
|
11
|
+
credentials = @aws_credentials.map { |pair| pair.join('=') }.join(';')
|
12
|
+
select_sql = 'SELECT '
|
13
|
+
select_sql << table_schema.columns.map(&:to_sql).join(', ')
|
14
|
+
select_sql << %Q< FROM "#{table_name}">
|
15
|
+
select_sql.gsub!('\'') { |s| '\\\'' }
|
16
|
+
unload_sql = %Q<UNLOAD ('#{select_sql}')>
|
17
|
+
unload_sql << %Q< TO '#{s3_uri}'>
|
18
|
+
unload_sql << %Q< CREDENTIALS '#{credentials}'>
|
19
|
+
unload_sql << %q< DELIMITER '\t'>
|
20
|
+
unload_sql << %q< ALLOWOVERWRITE> if options[:allow_overwrite]
|
21
|
+
@logger.info(sprintf('Unloading Redshift table %s to %s', table_name, s3_uri))
|
22
|
+
@redshift_connection.exec(unload_sql)
|
23
|
+
@logger.info(sprintf('Unload of %s complete', table_name))
|
24
|
+
end
|
25
|
+
end
|
26
|
+
end
|
metadata
ADDED
@@ -0,0 +1,103 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: bigshift
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.1.1
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Theo Hultberg
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
date: 2016-04-08 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: pg
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - ">="
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '0'
|
20
|
+
type: :runtime
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - ">="
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '0'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: google-api-client
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - "~>"
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '0.9'
|
34
|
+
type: :runtime
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - "~>"
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '0.9'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: googleauth
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - ">="
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '0'
|
48
|
+
type: :runtime
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - ">="
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '0'
|
55
|
+
description: |-
|
56
|
+
BigShift is a tool for moving tables from Redshift
|
57
|
+
to BigQuery. It will create a table in BigQuery with
|
58
|
+
a schema that matches the Redshift table, dump the
|
59
|
+
data to S3, transfer it to GCS and finally load it
|
60
|
+
into the BigQuery table.
|
61
|
+
email:
|
62
|
+
- theo@iconara.net
|
63
|
+
executables:
|
64
|
+
- bigshift
|
65
|
+
extensions: []
|
66
|
+
extra_rdoc_files: []
|
67
|
+
files:
|
68
|
+
- LICENSE.txt
|
69
|
+
- README.md
|
70
|
+
- bin/bigshift
|
71
|
+
- lib/bigshift.rb
|
72
|
+
- lib/bigshift/big_query/dataset.rb
|
73
|
+
- lib/bigshift/big_query/table.rb
|
74
|
+
- lib/bigshift/cli.rb
|
75
|
+
- lib/bigshift/cloud_storage_transfer.rb
|
76
|
+
- lib/bigshift/redshift_table_schema.rb
|
77
|
+
- lib/bigshift/redshift_unloader.rb
|
78
|
+
- lib/bigshift/version.rb
|
79
|
+
homepage: http://github.com/iconara/bigshift
|
80
|
+
licenses:
|
81
|
+
- BSD-3-Clause
|
82
|
+
metadata: {}
|
83
|
+
post_install_message:
|
84
|
+
rdoc_options: []
|
85
|
+
require_paths:
|
86
|
+
- lib
|
87
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
88
|
+
requirements:
|
89
|
+
- - ">="
|
90
|
+
- !ruby/object:Gem::Version
|
91
|
+
version: 1.9.3
|
92
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
93
|
+
requirements:
|
94
|
+
- - ">="
|
95
|
+
- !ruby/object:Gem::Version
|
96
|
+
version: '0'
|
97
|
+
requirements: []
|
98
|
+
rubyforge_project:
|
99
|
+
rubygems_version: 2.4.8
|
100
|
+
signing_key:
|
101
|
+
specification_version: 4
|
102
|
+
summary: A tool for moving tables from Redshift to BigQuery
|
103
|
+
test_files: []
|