bigshift 0.1.1

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 6b933f1227d7a30c817577db6ca2f1517111d0e2
4
+ data.tar.gz: c53b1f16c4977e04c796a5f645d3d4ca600e3b13
5
+ SHA512:
6
+ metadata.gz: dc549cf4e6ec70de381ff11118967f68c3d6868aa3892656d379d265d3669f787a81b1193b1e605c0f84f8b692e75a51c5bf45d15e68bc7b43843047c22650e0
7
+ data.tar.gz: 3c5407a160e9389e478c2b9c2c4f8561ffdcb64d038514bf5e6b41c4dc78dc83ba412c6aa31b1a9a9a24217525eb43b3ff54622cdef95053949031a0fbf11096
@@ -0,0 +1,12 @@
1
+ Copyright (c) 2014, Burt AB
2
+ All rights reserved.
3
+
4
+ Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
5
+
6
+ 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
7
+
8
+ 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
9
+
10
+ 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
11
+
12
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
@@ -0,0 +1,101 @@
1
+ # BigShift
2
+
3
+ BigShift is a tool for moving tables from Redshift to BigQuery. It will create a table in BigQuery with a schema that matches the Redshift table, dump the data to S3, transfer it to GCS and finally load it into the BigQuery table.
4
+
5
+ # Installation
6
+
7
+ ```
8
+ $ gem install bigshift
9
+ ```
10
+
11
+ # Requirements
12
+
13
+ On the AWS side you need a Redshift cluster and an S3 bucket, and credentials that let you read from Redshift, and read and write to the S3 bucket (it doesn't have to be to the whole bucket, a prefix works fine). On the GCP side you need a Cloud Storage bucket, a BigQuery dataset and credentials that allows reading and writing to the bucket, and create BigQuery tables.
14
+
15
+ # Usage
16
+
17
+ The main interface to BigShift is the `bigshift` command line tool.
18
+
19
+ BigShift can also be used as a library in a Ruby application. Look at the tests, and how the `bigshift` tool is built to figure out how.
20
+
21
+ ## Arguments
22
+
23
+ Running `bigshift` without any arguments, or with `--help` will show the options. All except `--s3-prefix` are required.
24
+
25
+ ### GCP credentials
26
+
27
+ The `--gcp-credentials` argument must be a path to a JSON file that contains a public/private key pair for a GCP user. The best way to obtain this is to create a new service account and chose JSON as the key type when prompted.
28
+
29
+ ### AWS credentials
30
+
31
+ The `--aws-credentials` argument must be a path to a JSON or YAML file that contains `aws_access_key_id` and `aws_secret_access_key`, and optionally `token`.
32
+
33
+ ```yaml
34
+ ---
35
+ aws_access_key_id: AKXYZABC123FOOBARBAZ
36
+ aws_secret_access_key: eW91ZmlndXJlZG91dGl0d2FzYmFzZTY0ISEhCg
37
+ ```
38
+
39
+ These credentials need to be allowed to read and write the S3 location you specify with `--s3-bucket` and `--s3-prefix`.
40
+
41
+ ### Redshift credentials
42
+
43
+ The `--rs-credentials` argument must be a path to a JSON or YAML file that contains the `host` and `port` of the Redshift cluster, as well as the `username` and `password` required to connect.
44
+
45
+ ```yaml
46
+ ---
47
+ host: my-cluster.abc123.eu-west-1.redshift.amazonaws.com
48
+ port: 5439
49
+ username: my_redshift_user
50
+ password: dGhpc2lzYWxzb2Jhc2U2NAo
51
+ ```
52
+
53
+ # How does it work?
54
+
55
+ There are four main pieces to BigShift: the Redshift unloader, the transfer, the BigQuery load and the schema translation.
56
+
57
+ In theory it's pretty simple: the Redshift table is dumped to S3 using Redshift's `UNLOAD` command, copied over to GCS and loaded into BigQuery – but the devil is the details.
58
+
59
+ The CSV produced by Redshift's `UNLOAD` can't be loaded into BigQuery no matter what options you specify on either end. Redshift can quote _all_ fields or none, but BigQuery doesn't allow non-string fields to be quoted. The format of booleans and timestamps are not compatible, and they expect quotes in quoted fields to be escaped differently, to name a few things.
60
+
61
+ This means that a lot of what BigShift does is make sure that the data that is dumped from Redshift is compatible with BigQuery. To do this it reads the table schema and translates the different datatypes while the data is dumped. Quotes are escaped, timestamps formatted, and so on.
62
+
63
+ Once the data is on S3 it's fairly simple to move it over to GCS. GCS has a great service called Transfer Service, that does the transfer for you. If this didn't exist you would have to stream all of the bytes through the machine that ran BigShift. As long as you've set up the credentials right in AWS IAM this works smoothly.
64
+
65
+ Once the data is in GCS, the BigQuery table can be created and loaded. At this point the Redshift table's schema is translated into a BigQuery schema. The Redshift datatypes are mapped to BigQuery datatypes and things like nullability are determines. The mapping is straighforward:
66
+
67
+ * `BOOLEAN` in Redshift becomes `BOOLEAN` in BigQuery
68
+ * all Redshift integer types are mapped to BigQuery's `INTEGER`
69
+ * all Redshift floating point types are mapped to BigQuery's `FLOAT`
70
+ * `DATE` in Redshift becomes `STRING` in BigQuery (formatted as YYYY-MM-DD)
71
+ * `NUMERIC` is mapped to `STRING`, because BigQuery doesn't have any equivalent data type and using `STRING` avoids loosing precision
72
+ * `TIMESTAMP` in Redshift becomes `TIMESTAMP` in BigQuery, and the data is transferred as a UNIX timestamp with fractional seconds (to the limit of what Redshift's `TIMESTAMP` datatype provides)
73
+ * `CHAR` and `VARCHAR` obviously become `STRING` in BigQuery
74
+
75
+ `NOT NULL` becomes `REQUIRED` in BigQuery, and `NULL` becomes `NULLABLE`.
76
+
77
+ # What doesn't it do?
78
+
79
+ * Currently BigShift doesn't delete the dumped table from S3 or from GCS. This is planned.
80
+ * BigShift can't currently append to an existing BigQuery table. This feature would be possible to add.
81
+ * The tool will happily overwrite any data on S3, GCS and in BigQuery that happen to be in the way (i.e. in the specified S3 or GCS location, or in the target table). This is convenient if you want to move the same data multiple times, but very scary and unsafe. To clobber everything will be an option in the future, but the default will be much safer.
82
+ * There is no transformation or processing of the data. When moving to BigQuery you might want to split a string and use the pieces as values in a repeated field, but BigShift doesn't help you with that. You will almost always have to do some post processing in BigQuery once the data has been moved. Processing on the way would require a lot more complexity and involve either Hadoop or Dataflow, and that's beyond the scope of a tool like this.
83
+ * BigShift can't move data back from BigQuery to Redshift. It can probably be done, but you would probably have to write a big part of the Redshift schema yourself since BigQuery's data model is so much simpler. Going from Redshift to BigQuery is simple, most of Redshifts datatypes map directly to one of BigQuery's, and there's no encodings, sort or dist keys to worry about. Going in the other direction the tool can't know whether or not a `STRING` column in BigQuery should be a `CHAR(12)` or `VARCHAR(65535)`, and if it should be encoded as `LZO` or `BYTEDICT` or what should be the primary, sort, and dist key of the table.
84
+
85
+ # Troubleshooting
86
+
87
+ ### I get SSL errors
88
+
89
+ The certificates used by the Google APIs might not be installed on your system, try this as a workaround:
90
+
91
+ ```
92
+ export SSL_CERT_FILE="$(find $GEM_HOME/gems -name 'google-api-client-*' | tail -n 1)/lib/cacerts.pem"
93
+ ```
94
+
95
+ ### I get errors when the data is loaded into BigQuery
96
+
97
+ This could be anything, but it could be things that aren't escaped properly when the data is dumped from Redshift. Try figuring out from the errors where the problem is and what the data looks like and open an issue. The more you can figure out yourself the more likely it is that you will get help. No one wants to trawl through your data, make an effort.
98
+
99
+ # Copyright
100
+
101
+ © 2016 Theo Hultberg and contributors, see LICENSE.txt (BSD 3-Clause).
@@ -0,0 +1,18 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'bigshift/cli'
4
+
5
+ begin
6
+ BigShift::Cli.new(ARGV).run
7
+ rescue BigShift::CliError => e
8
+ $stderr.puts("#{e.message}:")
9
+ $stderr.puts
10
+ e.details.each do |detail|
11
+ $stderr.puts("* #{detail}")
12
+ end
13
+ $stderr.puts
14
+ $stderr.puts(e.usage)
15
+ $stderr.puts
16
+ exit(1)
17
+ end
18
+
@@ -0,0 +1,29 @@
1
+ require 'google/apis/bigquery_v2'
2
+ require 'google/apis/storagetransfer_v1'
3
+
4
+ module BigShift
5
+ BigShiftError = Class.new(StandardError)
6
+
7
+ class NullLogger
8
+ def close(*); end
9
+ def debug(*); end
10
+ def debug?; false end
11
+ def error(*); end
12
+ def error?; false end
13
+ def fatal(*); end
14
+ def fatal?; false end
15
+ def info(*); end
16
+ def info?; false end
17
+ def unknown(*); end
18
+ def warn(*); end
19
+ def warn?; false end
20
+
21
+ INSTANCE = new
22
+ end
23
+ end
24
+
25
+ require 'bigshift/big_query/dataset'
26
+ require 'bigshift/big_query/table'
27
+ require 'bigshift/redshift_table_schema'
28
+ require 'bigshift/redshift_unloader'
29
+ require 'bigshift/cloud_storage_transfer'
@@ -0,0 +1,41 @@
1
+ module BigShift
2
+ module BigQuery
3
+ class Dataset
4
+ def initialize(big_query_service, project_id, dataset_id, options={})
5
+ @big_query_service = big_query_service
6
+ @project_id = project_id
7
+ @dataset_id = dataset_id
8
+ @logger = options[:logger] || NullLogger::INSTANCE
9
+ end
10
+
11
+ def table(table_name)
12
+ table_data = @big_query_service.get_table(@project_id, @dataset_id, table_name)
13
+ Table.new(@big_query_service, table_data, logger: @logger)
14
+ rescue Google::Apis::ClientError => e
15
+ if e.status_code == 404
16
+ nil
17
+ else
18
+ raise
19
+ end
20
+ end
21
+
22
+ def create_table(table_name, options={})
23
+ table_reference = Google::Apis::BigqueryV2::TableReference.new(
24
+ project_id: @project_id,
25
+ dataset_id: @dataset_id,
26
+ table_id: table_name
27
+ )
28
+ if options[:schema]
29
+ fields = options[:schema]['fields'].map { |f| Google::Apis::BigqueryV2::TableFieldSchema.new(name: f['name'], type: f['type'], mode: f['mode']) }
30
+ schema = Google::Apis::BigqueryV2::TableSchema.new(fields: fields)
31
+ end
32
+ table_spec = {}
33
+ table_spec[:table_reference] = table_reference
34
+ table_spec[:schema] = schema if schema
35
+ table_data = Google::Apis::BigqueryV2::Table.new(table_spec)
36
+ table_data = @big_query_service.insert_table(@project_id, @dataset_id, table_data)
37
+ Table.new(@big_query_service, table_data, logger: @logger)
38
+ end
39
+ end
40
+ end
41
+ end
@@ -0,0 +1,74 @@
1
+ module BigShift
2
+ module BigQuery
3
+ class Table
4
+ def initialize(big_query_service, table_data, options={})
5
+ @big_query_service = big_query_service
6
+ @table_data = table_data
7
+ @logger = options[:logger] || NullLogger::INSTANCE
8
+ @thread = options[:thread] || Kernel
9
+ end
10
+
11
+ def load(uri, options={})
12
+ poll_interval = options[:poll_interval] || DEFAULT_POLL_INTERVAL
13
+ load_configuration = {}
14
+ load_configuration[:source_uris] = [uri]
15
+ load_configuration[:write_disposition] = options[:allow_overwrite] ? 'WRITE_TRUNCATE' : 'WRITE_EMPTY'
16
+ load_configuration[:create_disposition] = 'CREATE_IF_NEEDED'
17
+ load_configuration[:schema] = options[:schema] if options[:schema]
18
+ load_configuration[:source_format] = 'CSV'
19
+ load_configuration[:field_delimiter] = '\t'
20
+ load_configuration[:quote] = '"'
21
+ load_configuration[:destination_table] = @table_data.table_reference
22
+ job = Google::Apis::BigqueryV2::Job.new(
23
+ configuration: Google::Apis::BigqueryV2::JobConfiguration.new(
24
+ load: Google::Apis::BigqueryV2::JobConfigurationLoad.new(load_configuration)
25
+ )
26
+ )
27
+ job = @big_query_service.insert_job(@table_data.table_reference.project_id, job)
28
+ @logger.info(sprintf('Loading rows from %s to the table %s.%s', uri, @table_data.table_reference.dataset_id, @table_data.table_reference.table_id))
29
+ started = false
30
+ loop do
31
+ job = @big_query_service.get_job(@table_data.table_reference.project_id, job.job_reference.job_id)
32
+ if job.status && job.status.state == 'DONE'
33
+ if job.status.errors.nil? || job.status.errors.empty?
34
+ break
35
+ else
36
+ job.status.errors.each do |error|
37
+ message = %<Load error: "#{error.message}">
38
+ if error.location
39
+ file, line, field = error.location.split('/').map { |s| s.split(':').last.strip }
40
+ message << " at file #{file}, line #{line}"
41
+ message << ", field #{field}" if field
42
+ end
43
+ @logger.debug(message)
44
+ end
45
+ raise job.status.error_result.message
46
+ end
47
+ else
48
+ state = job.status && job.status.state
49
+ if state == 'RUNNING' && !started
50
+ @logger.info('Loading started')
51
+ started = true
52
+ else
53
+ @logger.debug(sprintf('Waiting for job %s (status: %s)', job.job_reference.job_id.inspect, state ? state.inspect : 'unknown'))
54
+ end
55
+ @thread.sleep(poll_interval)
56
+ end
57
+ end
58
+ report_complete(job)
59
+ nil
60
+ end
61
+
62
+ private
63
+
64
+ DEFAULT_POLL_INTERVAL = 30
65
+
66
+ def report_complete(job)
67
+ statistics = job.statistics.load
68
+ input_size = statistics.input_file_bytes.to_f/2**30
69
+ output_size = statistics.output_bytes.to_f/2**30
70
+ @logger.info(sprintf('Loading complete, %.2f GiB loaded from %s files, %s rows created, table size %.2f GiB', input_size, statistics.input_files, statistics.output_rows, output_size))
71
+ end
72
+ end
73
+ end
74
+ end
@@ -0,0 +1,186 @@
1
+ require 'pg'
2
+ require 'yaml'
3
+ require 'json'
4
+ require 'stringio'
5
+ require 'logger'
6
+ require 'optparse'
7
+ require 'bigshift'
8
+
9
+ module BigShift
10
+ class CliError < BigShiftError
11
+ attr_reader :details, :usage
12
+
13
+ def initialize(message, details, usage)
14
+ super(message)
15
+ @details = details
16
+ @usage = usage
17
+ end
18
+ end
19
+
20
+ class Cli
21
+ def initialize(argv, options={})
22
+ @argv = argv.dup
23
+ @factory_factory = options[:factory_factory] || Factory.method(:new)
24
+ end
25
+
26
+ def run
27
+ setup
28
+ unload
29
+ transfer
30
+ load
31
+ cleanup
32
+ nil
33
+ end
34
+
35
+ private
36
+
37
+ def setup
38
+ @config = parse_args(@argv)
39
+ @factory = @factory_factory.call(@config)
40
+ end
41
+
42
+ def unload
43
+ s3_uri = "s3://#{@config[:s3_bucket_name]}/#{s3_table_prefix}/"
44
+ @factory.redshift_unloader.unload_to(@config[:rs_table_name], s3_uri, allow_overwrite: true)
45
+ end
46
+
47
+ def transfer
48
+ description = "bigshift-#{@config[:rs_database_name]}-#{@config[:rs_table_name]}-#{Time.now.utc.strftime('%Y%m%dT%H%M')}"
49
+ @factory.cloud_storage_transfer.copy_to_cloud_storage(@config[:s3_bucket_name], "#{s3_table_prefix}/", @config[:cs_bucket_name], description: description, allow_overwrite: true)
50
+ end
51
+
52
+ def load
53
+ rs_table_schema = @factory.redshift_table_schema
54
+ bq_dataset = @factory.big_query_dataset
55
+ bq_table = bq_dataset.table(@config[:bq_table_id]) || bq_dataset.create_table(@config[:bq_table_id])
56
+ gcs_uri = "gs://#{@config[:cs_bucket_name]}/#{s3_table_prefix}/*"
57
+ bq_table.load(gcs_uri, schema: rs_table_schema.to_big_query, allow_overwrite: true)
58
+ end
59
+
60
+ def cleanup
61
+ end
62
+
63
+ ARGUMENTS = [
64
+ ['--gcp-credentials', 'PATH', :gcp_credentials_path, :required],
65
+ ['--aws-credentials', 'PATH', :aws_credentials_path, :required],
66
+ ['--rs-credentials', 'PATH', :rs_credentials_path, :required],
67
+ ['--rs-database', 'DB_NAME', :rs_database_name, :required],
68
+ ['--rs-table', 'TABLE_NAME', :rs_table_name, :required],
69
+ ['--bq-dataset', 'DATASET_ID', :bq_dataset_id, :required],
70
+ ['--bq-table', 'TABLE_ID', :bq_table_id, :required],
71
+ ['--s3-bucket', 'BUCKET_NAME', :s3_bucket_name, :required],
72
+ ['--s3-prefix', 'PREFIX', :s3_prefix, nil],
73
+ ['--cs-bucket', 'BUCKET_NAME', :cs_bucket_name, :required],
74
+ ]
75
+
76
+ def parse_args(argv)
77
+ config = {}
78
+ parser = OptionParser.new do |p|
79
+ ARGUMENTS.each do |flag, value_name, config_key, _|
80
+ p.on("#{flag} #{value_name}") { |v| config[config_key] = v }
81
+ end
82
+ end
83
+ config_errors = []
84
+ begin
85
+ parser.parse!(argv)
86
+ rescue OptionParser::InvalidOption => e
87
+ config_errors << e.message
88
+ end
89
+ %w[gcp aws rs].each do |prefix|
90
+ if (path = config["#{prefix}_credentials_path".to_sym]) && File.exist?(path)
91
+ config["#{prefix}_credentials".to_sym] = YAML.load(File.read(path))
92
+ elsif path && !File.exist?(path)
93
+ config_errors << sprintf('%s does not exist', path.inspect)
94
+ end
95
+ end
96
+ ARGUMENTS.each do |flag, _, config_key, required|
97
+ if !config.include?(config_key) && required
98
+ config_errors << "#{flag} is required"
99
+ end
100
+ end
101
+ unless config_errors.empty?
102
+ raise CliError.new('Configuration missing or malformed', config_errors, parser.to_s)
103
+ end
104
+ config
105
+ end
106
+
107
+ def s3_table_prefix
108
+ components = @config.values_at(:rs_database_name, :rs_table_name)
109
+ if (prefix = @config[:s3_prefix])
110
+ components.unshift(prefix)
111
+ end
112
+ File.join(*components)
113
+ end
114
+ end
115
+
116
+ class Factory
117
+ def initialize(config)
118
+ @config = config
119
+ end
120
+
121
+ def redshift_unloader
122
+ @redshift_unloader ||= RedshiftUnloader.new(rs_connection, aws_credentials, logger: logger)
123
+ end
124
+
125
+ def cloud_storage_transfer
126
+ @cloud_storage_transfer ||= CloudStorageTransfer.new(gcs_transfer_service, raw_gcp_credentials['project_id'], aws_credentials, logger: logger)
127
+ end
128
+
129
+ def redshift_table_schema
130
+ @redshift_table_schema ||= RedshiftTableSchema.new(@config[:rs_table_name], rs_connection)
131
+ end
132
+
133
+ def big_query_dataset
134
+ @big_query_dataset ||= BigQuery::Dataset.new(bq_service, raw_gcp_credentials['project_id'], @config[:bq_dataset_id], logger: logger)
135
+ end
136
+
137
+ private
138
+
139
+ def logger
140
+ @logger ||= Logger.new($stderr)
141
+ end
142
+
143
+ def rs_connection
144
+ @rs_connection ||= PG.connect(
145
+ @config[:rs_credentials]['host'],
146
+ @config[:rs_credentials]['port'],
147
+ nil,
148
+ nil,
149
+ @config[:rs_database_name],
150
+ @config[:rs_credentials]['username'],
151
+ @config[:rs_credentials]['password']
152
+ )
153
+ end
154
+
155
+ def gcs_transfer_service
156
+ @gcs_transfer_service ||= begin
157
+ s = Google::Apis::StoragetransferV1::StoragetransferService.new
158
+ s.authorization = gcp_credentials
159
+ s
160
+ end
161
+ end
162
+
163
+ def bq_service
164
+ @bq_service ||= begin
165
+ s = Google::Apis::BigqueryV2::BigqueryService.new
166
+ s.authorization = gcp_credentials
167
+ s
168
+ end
169
+ end
170
+
171
+ def aws_credentials
172
+ @config[:aws_credentials]
173
+ end
174
+
175
+ def raw_gcp_credentials
176
+ @config[:gcp_credentials]
177
+ end
178
+
179
+ def gcp_credentials
180
+ @gcp_credentials ||= Google::Auth::ServiceAccountCredentials.make_creds(
181
+ json_key_io: StringIO.new(JSON.dump(raw_gcp_credentials)),
182
+ scope: Google::Apis::StoragetransferV1::AUTH_CLOUD_PLATFORM
183
+ )
184
+ end
185
+ end
186
+ end
@@ -0,0 +1,104 @@
1
+ module BigShift
2
+ class CloudStorageTransfer
3
+ def initialize(storage_transfer_service, project_id, aws_credentials, options={})
4
+ @storage_transfer_service = storage_transfer_service
5
+ @project_id = project_id
6
+ @aws_credentials = aws_credentials
7
+ @clock = options[:clock] || Time
8
+ @thread = options[:thread] || Kernel
9
+ @logger = options[:logger] || NullLogger::INSTANCE
10
+ end
11
+
12
+ def copy_to_cloud_storage(s3_bucket, s3_path_prefix, cloud_storage_bucket, options={})
13
+ poll_interval = options[:poll_interval] || DEFAULT_POLL_INTERVAL
14
+ transfer_job = create_transfer_job(s3_bucket, s3_path_prefix, cloud_storage_bucket, options[:description], options[:allow_overwrite])
15
+ transfer_job = @storage_transfer_service.create_transfer_job(transfer_job)
16
+ @logger.info(sprintf('Transferring objects from s3://%s/%s to gs://%s/%s', s3_bucket, s3_path_prefix, cloud_storage_bucket, s3_path_prefix))
17
+ await_completion(transfer_job, poll_interval)
18
+ nil
19
+ end
20
+
21
+ private
22
+
23
+ DEFAULT_POLL_INTERVAL = 30
24
+
25
+ def create_transfer_job(s3_bucket, s3_path_prefix, cloud_storage_bucket, description, allow_overwrite)
26
+ now = @clock.now.utc
27
+ Google::Apis::StoragetransferV1::TransferJob.new(
28
+ description: description,
29
+ project_id: @project_id,
30
+ status: 'ENABLED',
31
+ schedule: Google::Apis::StoragetransferV1::Schedule.new(
32
+ schedule_start_date: Google::Apis::StoragetransferV1::Date.new(year: now.year, month: now.month, day: now.day),
33
+ schedule_end_date: Google::Apis::StoragetransferV1::Date.new(year: now.year, month: now.month, day: now.day),
34
+ start_time_of_day: Google::Apis::StoragetransferV1::TimeOfDay.new(hours: now.hour, minutes: now.min + 1)
35
+ ),
36
+ transfer_spec: Google::Apis::StoragetransferV1::TransferSpec.new(
37
+ aws_s3_data_source: Google::Apis::StoragetransferV1::AwsS3Data.new(
38
+ bucket_name: s3_bucket,
39
+ aws_access_key: Google::Apis::StoragetransferV1::AwsAccessKey.new(
40
+ access_key_id: @aws_credentials['aws_access_key_id'],
41
+ secret_access_key: @aws_credentials['aws_secret_access_key'],
42
+ )
43
+ ),
44
+ gcs_data_sink: Google::Apis::StoragetransferV1::GcsData.new(
45
+ bucket_name: cloud_storage_bucket
46
+ ),
47
+ object_conditions: Google::Apis::StoragetransferV1::ObjectConditions.new(
48
+ include_prefixes: [s3_path_prefix]
49
+ ),
50
+ transfer_options: Google::Apis::StoragetransferV1::TransferOptions.new(
51
+ overwrite_objects_already_existing_in_sink: !!allow_overwrite
52
+ )
53
+ )
54
+ )
55
+ end
56
+
57
+ def await_completion(transfer_job, poll_interval)
58
+ started = false
59
+ loop do
60
+ operation = nil
61
+ failures = 0
62
+ begin
63
+ operations_response = @storage_transfer_service.list_transfer_operations('transferOperations', filter: JSON.dump({project_id: @project_id, job_names: [transfer_job.name]}))
64
+ operation = operations_response.operations && operations_response.operations.first
65
+ rescue Google::Apis::ServerError => e
66
+ failures += 1
67
+ if failures < 5
68
+ @logger.debug(sprintf('Error while waiting for job %s, will retry: %s (%s)', transfer_job.name.inspect, e.message.inspect, e.class.name))
69
+ @thread.sleep(poll_interval)
70
+ retry
71
+ else
72
+ raise sprintf('Transfer failed: %s (%s)', e.message.inspect, e.class.name)
73
+ end
74
+ end
75
+ if operation && operation.done?
76
+ handle_completion(transfer_job, operation)
77
+ break
78
+ else
79
+ status = operation && operation.metadata && operation.metadata['status']
80
+ if status == 'IN_PROGRESS' && !started
81
+ @logger.info(sprintf('Transfer %s started', transfer_job.description))
82
+ started = true
83
+ else
84
+ @logger.debug(sprintf('Waiting for job %s (name: %s, status: %s)', transfer_job.description.inspect, transfer_job.name.inspect, status ? status.inspect : 'unknown'))
85
+ end
86
+ @thread.sleep(poll_interval)
87
+ end
88
+ end
89
+ end
90
+
91
+ def handle_completion(transfer_job, operation)
92
+ if operation.metadata['status'] == 'FAILED'
93
+ raise 'Transfer failed'
94
+ else
95
+ message = sprintf('Transfer %s complete', transfer_job.description)
96
+ if (counters = operation.metadata['counters'])
97
+ size_in_gib = counters['bytesCopiedToSink'].to_f / 2**30
98
+ message << sprintf(', %s objects and %.2f GiB copied', counters['objectsCopiedToSink'], size_in_gib)
99
+ end
100
+ @logger.info(message)
101
+ end
102
+ end
103
+ end
104
+ end
@@ -0,0 +1,87 @@
1
+ module BigShift
2
+ class RedshiftTableSchema
3
+ def initialize(table_name, redshift_connection)
4
+ @table_name = table_name
5
+ @redshift_connection = redshift_connection
6
+ end
7
+
8
+ def columns
9
+ @columns ||= begin
10
+ rows = @redshift_connection.exec_params(%|SELECT "column", "type", "notnull" FROM "pg_table_def" WHERE "schemaname" = 'public' AND "tablename" = $1|, [@table_name])
11
+ if rows.count == 0
12
+ raise sprintf('Table not found: %s', @table_name.inspect)
13
+ else
14
+ columns = rows.map do |row|
15
+ name = row['column']
16
+ type = row['type']
17
+ nullable = row['notnull'] == 'f'
18
+ Column.new(name, type, nullable)
19
+ end
20
+ columns.sort_by!(&:name)
21
+ columns
22
+ end
23
+ end
24
+ end
25
+
26
+ def to_big_query
27
+ Google::Apis::BigqueryV2::TableSchema.new(fields: columns.map(&:to_big_query))
28
+ end
29
+
30
+ class Column
31
+ attr_reader :name, :type
32
+
33
+ def initialize(name, type, nullable)
34
+ @name = name
35
+ @type = type
36
+ @nullable = nullable
37
+ end
38
+
39
+ def nullable?
40
+ @nullable
41
+ end
42
+
43
+ def to_big_query
44
+ Google::Apis::BigqueryV2::TableFieldSchema.new(
45
+ name: @name,
46
+ type: big_query_type,
47
+ mode: @nullable ? 'NULLABLE' : 'REQUIRED'
48
+ )
49
+ end
50
+
51
+ def to_sql
52
+ case @type
53
+ when /^numeric/, /int/, /^double/, 'real'
54
+ sprintf('"%s"', @name)
55
+ when /^character/
56
+ sprintf(%q<('"' || REPLACE(REPLACE(REPLACE("%s", '"', '""'), '\\n', '\\\\n'), '\\r', '\\\\r') || '"')>, @name)
57
+ when /^timestamp/
58
+ sprintf('(EXTRACT(epoch FROM "%s") + EXTRACT(milliseconds FROM "%s")/1000.0)', @name, @name)
59
+ when 'date'
60
+ sprintf(%q<(TO_CHAR("%s", 'YYYY-MM-DD'))>, @name)
61
+ when 'boolean'
62
+ if nullable?
63
+ sprintf('(CASE WHEN "%s" IS NULL THEN NULL WHEN "%s" THEN 1 ELSE 0 END)', @name, @name)
64
+ else
65
+ sprintf('(CASE WHEN "%s" THEN 1 ELSE 0 END)', @name)
66
+ end
67
+ else
68
+ raise sprintf('Unsupported column type: %s', type.inspect)
69
+ end
70
+ end
71
+
72
+ private
73
+
74
+ def big_query_type
75
+ case @type
76
+ when /^character/, /^numeric/, 'date' then 'STRING'
77
+ when /^timestamp/ then 'TIMESTAMP'
78
+ when /int/ then 'INTEGER'
79
+ when 'boolean' then 'BOOLEAN'
80
+ when /^double/, 'real' then 'FLOAT'
81
+ else
82
+ raise sprintf('Unsupported column type: %s', type.inspect)
83
+ end
84
+ end
85
+ end
86
+ end
87
+ end
@@ -0,0 +1,26 @@
1
+ module BigShift
2
+ class RedshiftUnloader
3
+ def initialize(redshift_connection, aws_credentials, options={})
4
+ @redshift_connection = redshift_connection
5
+ @aws_credentials = aws_credentials
6
+ @logger = options[:logger] || NullLogger::INSTANCE
7
+ end
8
+
9
+ def unload_to(table_name, s3_uri, options={})
10
+ table_schema = RedshiftTableSchema.new(table_name, @redshift_connection)
11
+ credentials = @aws_credentials.map { |pair| pair.join('=') }.join(';')
12
+ select_sql = 'SELECT '
13
+ select_sql << table_schema.columns.map(&:to_sql).join(', ')
14
+ select_sql << %Q< FROM "#{table_name}">
15
+ select_sql.gsub!('\'') { |s| '\\\'' }
16
+ unload_sql = %Q<UNLOAD ('#{select_sql}')>
17
+ unload_sql << %Q< TO '#{s3_uri}'>
18
+ unload_sql << %Q< CREDENTIALS '#{credentials}'>
19
+ unload_sql << %q< DELIMITER '\t'>
20
+ unload_sql << %q< ALLOWOVERWRITE> if options[:allow_overwrite]
21
+ @logger.info(sprintf('Unloading Redshift table %s to %s', table_name, s3_uri))
22
+ @redshift_connection.exec(unload_sql)
23
+ @logger.info(sprintf('Unload of %s complete', table_name))
24
+ end
25
+ end
26
+ end
@@ -0,0 +1,3 @@
1
+ module BigShift
2
+ VERSION = '0.1.1'.freeze
3
+ end
metadata ADDED
@@ -0,0 +1,103 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: bigshift
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.1
5
+ platform: ruby
6
+ authors:
7
+ - Theo Hultberg
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2016-04-08 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: pg
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: '0'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: google-api-client
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '0.9'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '0.9'
41
+ - !ruby/object:Gem::Dependency
42
+ name: googleauth
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :runtime
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ description: |-
56
+ BigShift is a tool for moving tables from Redshift
57
+ to BigQuery. It will create a table in BigQuery with
58
+ a schema that matches the Redshift table, dump the
59
+ data to S3, transfer it to GCS and finally load it
60
+ into the BigQuery table.
61
+ email:
62
+ - theo@iconara.net
63
+ executables:
64
+ - bigshift
65
+ extensions: []
66
+ extra_rdoc_files: []
67
+ files:
68
+ - LICENSE.txt
69
+ - README.md
70
+ - bin/bigshift
71
+ - lib/bigshift.rb
72
+ - lib/bigshift/big_query/dataset.rb
73
+ - lib/bigshift/big_query/table.rb
74
+ - lib/bigshift/cli.rb
75
+ - lib/bigshift/cloud_storage_transfer.rb
76
+ - lib/bigshift/redshift_table_schema.rb
77
+ - lib/bigshift/redshift_unloader.rb
78
+ - lib/bigshift/version.rb
79
+ homepage: http://github.com/iconara/bigshift
80
+ licenses:
81
+ - BSD-3-Clause
82
+ metadata: {}
83
+ post_install_message:
84
+ rdoc_options: []
85
+ require_paths:
86
+ - lib
87
+ required_ruby_version: !ruby/object:Gem::Requirement
88
+ requirements:
89
+ - - ">="
90
+ - !ruby/object:Gem::Version
91
+ version: 1.9.3
92
+ required_rubygems_version: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - ">="
95
+ - !ruby/object:Gem::Version
96
+ version: '0'
97
+ requirements: []
98
+ rubyforge_project:
99
+ rubygems_version: 2.4.8
100
+ signing_key:
101
+ specification_version: 4
102
+ summary: A tool for moving tables from Redshift to BigQuery
103
+ test_files: []