bigshift 0.2.0 → 0.3.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +20 -2
- data/lib/bigshift/cli.rb +65 -22
- data/lib/bigshift/cloud_storage_transfer.rb +8 -1
- data/lib/bigshift/redshift_unloader.rb +1 -1
- data/lib/bigshift/unload_manifest.rb +38 -6
- data/lib/bigshift/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 5221e0948bc35adae3c09681be2de8529cf51630
|
4
|
+
data.tar.gz: ab8501193f724bed2288a3784719ef4cbbf16c26
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 914cdf7f5e432faba32a6d66661c9dd1b0b55edac2933438a46bfbdc6cc4476441d8fbe5e2858017ae012ef2bd0c559c07c75bd2b5fb1bc33754aebbf3dee4c8
|
7
|
+
data.tar.gz: f05dc703a91fb1dbc338e65a04473c8d29ed7df19b9bc2abb6e702aebc022df5f47e7c524b7abf0a52a288961b0f09cc26828646655ebc6312905ad73aff3dba
|
data/README.md
CHANGED
@@ -22,15 +22,17 @@ The main interface to BigShift is the `bigshift` command line tool.
|
|
22
22
|
|
23
23
|
BigShift can also be used as a library in a Ruby application. Look at the tests, and how the `bigshift` tool is built to figure out how.
|
24
24
|
|
25
|
+
Because a transfer can take a long time, it's highly recommended that you run the command in `screen` or `tmux` or using some other mechanism that ensures that the process isn't killed prematurely.
|
26
|
+
|
25
27
|
## Cost
|
26
28
|
|
27
29
|
Please note that transferring large amounts of data between AWS and GCP is not free. [AWS charges for outgoing traffic from S3](https://aws.amazon.com/s3/pricing/#Data_Transfer_Pricing). There are also storage charges for the Redshift dumps on S3 and GCS, but since they are kept only until the BigQuery table has been loaded those should be negligible.
|
28
30
|
|
29
|
-
BigShift tells Redshift to compress the dumps, even if that means that the BigQuery load will be slower, in order to minimize the transfer cost.
|
31
|
+
BigShift tells Redshift to compress the dumps by default, even if that means that the BigQuery load will be slower, in order to minimize the transfer cost. However, depending on your setup and data the individual files produced by Redshift might become larger than BigQuery's compressed file size limit of 4 GiB. In these cases you need to either uncompress the files manually on the GCP side (for example by running BigShift with just `--steps unload,transfer` to get the dumps to GCS), or dump and transfer uncompressed files (with `--no-compression`), at a higher bandwidth cost.
|
30
32
|
|
31
33
|
## Arguments
|
32
34
|
|
33
|
-
Running `bigshift` without any arguments, or with `--help` will show the options. All except `--s3-prefix`, `--bq-table
|
35
|
+
Running `bigshift` without any arguments, or with `--help` will show the options. All except `--s3-prefix`, `--bq-table`, `--max-bad-records`, `--steps` and `--[no-]compress` are required.
|
34
36
|
|
35
37
|
### GCP credentials
|
36
38
|
|
@@ -108,6 +110,12 @@ Because of how GCS' Transfer Service works the transferred files will have exact
|
|
108
110
|
|
109
111
|
By default the BigQuery table ID will be the same as the Redshift table name, but the optional argument `--bq-table` can be used to tell BigShift to use another table ID.
|
110
112
|
|
113
|
+
### Running only some steps
|
114
|
+
|
115
|
+
Using the `--steps` argument it's possible to skip some parts of the transfer, or resume a failed transfer. The default is `--steps unload,transfer,load,cleanup`, but using for example `--steps unload,transfer` would dump the table to S3 and transfer the files and then stop.
|
116
|
+
|
117
|
+
Another case might be if for some reason the BigShift process was killed during the transfer step. The transfer will still run in GCS, and you might not want to start over from the start, it takes a long time to unload a big table, and an even longer time to transfer it, not to mention bandwidth costs. You can then run the same command again, but add `--steps load,cleanup` to the arguments to skip the unloading and transferring steps.
|
118
|
+
|
111
119
|
# How does it work?
|
112
120
|
|
113
121
|
There are four main pieces to BigShift: the Redshift unloader, the transfer, the BigQuery load and the schema translation.
|
@@ -151,6 +159,16 @@ The certificates used by the Google APIs might not be installed on your system,
|
|
151
159
|
export SSL_CERT_FILE="$(find $GEM_HOME/gems -name 'google-api-client-*' | tail -n 1)/lib/cacerts.pem"
|
152
160
|
```
|
153
161
|
|
162
|
+
### BigQuery says my files are not splittable and too large
|
163
|
+
|
164
|
+
For example:
|
165
|
+
|
166
|
+
> Input CSV files are not splittable and at least one of the files is larger than the maximum allowed size. Size is: 5838980665. Max allowed size is: 4294967296. Filename: gs://bigshift/foo/bar/foo-bar-0039_part_00.gz
|
167
|
+
|
168
|
+
This happens when the (compressed) files exceed 4 GiB in size. Unfortunately it is not possible to control the size of the files produced by Redshift's `UNLOAD` command, and the size of the files will depend on the number of nodes in your cluster and the amount of data you're dumping.
|
169
|
+
|
170
|
+
There are two options: either you use BigShift to get the dumps to GCS and then manually uncompress and load them (use `--steps unload,transfer`) or you dump without compression (use `--no-compression`). Keep in mind that without compression the bandwidth costs will be significanly higher.
|
171
|
+
|
154
172
|
### I get errors when the data is loaded into BigQuery
|
155
173
|
|
156
174
|
This could be anything, but it could be things that aren't escaped properly when the data is dumped from Redshift. Try figuring out from the errors where the problem is and what the data looks like and open an issue. The more you can figure out yourself the more likely it is that you will get help. No one wants to trawl through your data, make an effort.
|
data/lib/bigshift/cli.rb
CHANGED
@@ -34,38 +34,66 @@ module BigShift
|
|
34
34
|
|
35
35
|
private
|
36
36
|
|
37
|
+
def run?(step)
|
38
|
+
@config[:steps].include?(step)
|
39
|
+
end
|
40
|
+
|
37
41
|
def setup
|
38
42
|
@config = parse_args(@argv)
|
39
43
|
@factory = @factory_factory.call(@config)
|
44
|
+
@logger = @factory.logger
|
40
45
|
end
|
41
46
|
|
42
47
|
def unload
|
43
|
-
|
44
|
-
|
45
|
-
|
48
|
+
if run?(:unload)
|
49
|
+
s3_uri = "s3://#{@config[:s3_bucket_name]}/#{s3_table_prefix}"
|
50
|
+
@factory.redshift_unloader.unload_to(@config[:rs_table_name], s3_uri, allow_overwrite: false, compression: @config[:compression])
|
51
|
+
else
|
52
|
+
@logger.debug('Skipping unload')
|
53
|
+
end
|
54
|
+
@unload_manifest = @factory.create_unload_manifest(@config[:s3_bucket_name], s3_table_prefix)
|
46
55
|
end
|
47
56
|
|
48
57
|
def transfer
|
49
|
-
|
50
|
-
|
58
|
+
if run?(:transfer)
|
59
|
+
description = "bigshift-#{@config[:rs_database_name]}-#{@config[:rs_table_name]}-#{Time.now.utc.strftime('%Y%m%dT%H%M')}"
|
60
|
+
@factory.cloud_storage_transfer.copy_to_cloud_storage(@unload_manifest, @config[:cs_bucket_name], description: description, allow_overwrite: false)
|
61
|
+
else
|
62
|
+
@logger.debug('Skipping transfer')
|
63
|
+
end
|
51
64
|
end
|
52
65
|
|
53
66
|
def load
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
67
|
+
if run?(:load)
|
68
|
+
rs_table_schema = @factory.redshift_table_schema
|
69
|
+
bq_dataset = @factory.big_query_dataset
|
70
|
+
bq_table = bq_dataset.table(@config[:bq_table_id]) || bq_dataset.create_table(@config[:bq_table_id])
|
71
|
+
gcs_uri = "gs://#{@config[:cs_bucket_name]}/#{s3_table_prefix}*"
|
72
|
+
options = {}
|
73
|
+
options[:schema] = rs_table_schema.to_big_query
|
74
|
+
options[:allow_overwrite] = true
|
75
|
+
options[:max_bad_records] = @config[:max_bad_records] if @config[:max_bad_records]
|
76
|
+
bq_table.load(gcs_uri, options)
|
77
|
+
else
|
78
|
+
@logger.debug('Skipping load')
|
79
|
+
end
|
63
80
|
end
|
64
81
|
|
65
82
|
def cleanup
|
66
|
-
|
83
|
+
if run?(:cleanup)
|
84
|
+
@factory.cleaner.cleanup(@unload_manifest, @config[:cs_bucket_name])
|
85
|
+
else
|
86
|
+
@logger.debug('Skipping cleanup')
|
87
|
+
end
|
67
88
|
end
|
68
89
|
|
90
|
+
STEPS = [
|
91
|
+
:unload,
|
92
|
+
:transfer,
|
93
|
+
:load,
|
94
|
+
:cleanup
|
95
|
+
].freeze
|
96
|
+
|
69
97
|
ARGUMENTS = [
|
70
98
|
['--gcp-credentials', 'PATH', String, :gcp_credentials_path, :required],
|
71
99
|
['--aws-credentials', 'PATH', String, :aws_credentials_path, nil],
|
@@ -78,6 +106,8 @@ module BigShift
|
|
78
106
|
['--s3-prefix', 'PREFIX', String, :s3_prefix, nil],
|
79
107
|
['--cs-bucket', 'BUCKET_NAME', String, :cs_bucket_name, :required],
|
80
108
|
['--max-bad-records', 'N', Integer, :max_bad_records, nil],
|
109
|
+
['--steps', 'STEPS', Array, :steps, nil],
|
110
|
+
['--[no-]compression', nil, nil, :compression, nil],
|
81
111
|
]
|
82
112
|
|
83
113
|
def parse_args(argv)
|
@@ -106,6 +136,11 @@ module BigShift
|
|
106
136
|
end
|
107
137
|
end
|
108
138
|
config[:bq_table_id] ||= config[:rs_table_name]
|
139
|
+
if config[:steps] && !config[:steps].empty?
|
140
|
+
config[:steps] = STEPS.select { |s| config[:steps].include?(s.to_s) }
|
141
|
+
else
|
142
|
+
config[:steps] = STEPS
|
143
|
+
end
|
109
144
|
unless config_errors.empty?
|
110
145
|
raise CliError.new('Configuration missing or malformed', config_errors, parser.to_s)
|
111
146
|
end
|
@@ -113,12 +148,16 @@ module BigShift
|
|
113
148
|
end
|
114
149
|
|
115
150
|
def s3_table_prefix
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
151
|
+
@s3_table_prefix ||= begin
|
152
|
+
db_name = @config[:rs_database_name]
|
153
|
+
table_name = @config[:rs_table_name]
|
154
|
+
prefix = "#{db_name}/#{table_name}/#{db_name}-#{table_name}-"
|
155
|
+
if (s3_prefix = @config[:s3_prefix])
|
156
|
+
s3_prefix = s3_prefix.gsub(%r{\A/|/\Z}, '')
|
157
|
+
prefix = "#{s3_prefix}/#{prefix}"
|
158
|
+
end
|
159
|
+
prefix
|
120
160
|
end
|
121
|
-
File.join(*components)
|
122
161
|
end
|
123
162
|
end
|
124
163
|
|
@@ -154,12 +193,16 @@ module BigShift
|
|
154
193
|
)
|
155
194
|
end
|
156
195
|
|
157
|
-
private
|
158
|
-
|
159
196
|
def logger
|
160
197
|
@logger ||= Logger.new($stderr)
|
161
198
|
end
|
162
199
|
|
200
|
+
def create_unload_manifest(s3_bucket_name, s3_table_prefix)
|
201
|
+
UnloadManifest.new(s3_resource, cs_service, @config[:s3_bucket_name], s3_table_prefix)
|
202
|
+
end
|
203
|
+
|
204
|
+
private
|
205
|
+
|
163
206
|
def rs_connection
|
164
207
|
@rs_connection ||= PG.connect(
|
165
208
|
host: @config[:rs_credentials]['host'],
|
@@ -15,6 +15,7 @@ module BigShift
|
|
15
15
|
transfer_job = @storage_transfer_service.create_transfer_job(transfer_job)
|
16
16
|
@logger.info(sprintf('Transferring %d objects (%.2f GiB) from s3://%s/%s to gs://%s/%s', unload_manifest.count, unload_manifest.total_file_size.to_f/2**30, unload_manifest.bucket_name, unload_manifest.prefix, cloud_storage_bucket, unload_manifest.prefix))
|
17
17
|
await_completion(transfer_job, poll_interval)
|
18
|
+
validate_transfer(unload_manifest, cloud_storage_bucket)
|
18
19
|
nil
|
19
20
|
end
|
20
21
|
|
@@ -45,7 +46,8 @@ module BigShift
|
|
45
46
|
bucket_name: cloud_storage_bucket
|
46
47
|
),
|
47
48
|
object_conditions: Google::Apis::StoragetransferV1::ObjectConditions.new(
|
48
|
-
include_prefixes: unload_manifest.
|
49
|
+
include_prefixes: [unload_manifest.prefix],
|
50
|
+
exclude_prefixes: [unload_manifest.manifest_key]
|
49
51
|
),
|
50
52
|
transfer_options: Google::Apis::StoragetransferV1::TransferOptions.new(
|
51
53
|
overwrite_objects_already_existing_in_sink: !!allow_overwrite
|
@@ -100,5 +102,10 @@ module BigShift
|
|
100
102
|
@logger.info(message)
|
101
103
|
end
|
102
104
|
end
|
105
|
+
|
106
|
+
def validate_transfer(unload_manifest, cloud_storage_bucket)
|
107
|
+
unload_manifest.validate_transfer(cloud_storage_bucket)
|
108
|
+
@logger.info('Transfer validated, all file sizes match')
|
109
|
+
end
|
103
110
|
end
|
104
111
|
end
|
@@ -17,8 +17,8 @@ module BigShift
|
|
17
17
|
unload_sql << %Q< TO '#{s3_uri}'>
|
18
18
|
unload_sql << %Q< CREDENTIALS '#{credentials_string}'>
|
19
19
|
unload_sql << %q< MANIFEST>
|
20
|
-
unload_sql << %q< GZIP>
|
21
20
|
unload_sql << %q< DELIMITER '\t'>
|
21
|
+
unload_sql << %q< GZIP> if options.fetch(:compression, true)
|
22
22
|
unload_sql << %q< ALLOWOVERWRITE> if options[:allow_overwrite]
|
23
23
|
@logger.info(sprintf('Unloading Redshift table %s to %s', table_name, s3_uri))
|
24
24
|
@redshift_connection.exec(unload_sql)
|
@@ -1,9 +1,12 @@
|
|
1
1
|
module BigShift
|
2
|
+
TransferValidationError = Class.new(BigShiftError)
|
3
|
+
|
2
4
|
class UnloadManifest
|
3
5
|
attr_reader :bucket_name, :prefix, :manifest_key
|
4
6
|
|
5
|
-
def initialize(s3_resource, bucket_name, prefix)
|
7
|
+
def initialize(s3_resource, cs_service, bucket_name, prefix)
|
6
8
|
@s3_resource = s3_resource
|
9
|
+
@cs_service = cs_service
|
7
10
|
@bucket_name = bucket_name
|
8
11
|
@prefix = prefix
|
9
12
|
@manifest_key = "#{@prefix}manifest"
|
@@ -23,14 +26,43 @@ module BigShift
|
|
23
26
|
end
|
24
27
|
|
25
28
|
def total_file_size
|
26
|
-
@total_file_size ||=
|
29
|
+
@total_file_size ||= file_sizes.values.reduce(:+)
|
30
|
+
end
|
31
|
+
|
32
|
+
def validate_transfer(cs_bucket_name)
|
33
|
+
objects = @cs_service.list_objects(cs_bucket_name, prefix: @prefix)
|
34
|
+
cs_file_sizes = objects.items.each_with_object({}) do |item, acc|
|
35
|
+
acc[item.name] = item.size.to_i
|
36
|
+
end
|
37
|
+
missing_files = (file_sizes.keys - cs_file_sizes.keys)
|
38
|
+
extra_files = cs_file_sizes.keys - file_sizes.keys
|
39
|
+
common_files = (cs_file_sizes.keys & file_sizes.keys)
|
40
|
+
size_mismatches = common_files.select { |name| file_sizes[name] != cs_file_sizes[name] }
|
41
|
+
errors = []
|
42
|
+
unless missing_files.empty?
|
43
|
+
errors << "missing files: #{missing_files.join(', ')}"
|
44
|
+
end
|
45
|
+
unless extra_files.empty?
|
46
|
+
errors << "extra files: #{extra_files.join(', ')}"
|
47
|
+
end
|
48
|
+
unless size_mismatches.empty?
|
49
|
+
messages = size_mismatches.map { |name| sprintf('%s (%d != %d)', name, cs_file_sizes[name], file_sizes[name]) }
|
50
|
+
errors << "size mismatches: #{messages.join(', ')}"
|
51
|
+
end
|
52
|
+
unless errors.empty?
|
53
|
+
raise TransferValidationError, "Transferred files don't match unload manifest: #{errors.join('; ')}"
|
54
|
+
end
|
55
|
+
end
|
56
|
+
|
57
|
+
private
|
58
|
+
|
59
|
+
def file_sizes
|
60
|
+
@file_sizes ||= begin
|
27
61
|
bucket = @s3_resource.bucket(@bucket_name)
|
28
62
|
objects = bucket.objects(prefix: @prefix)
|
29
|
-
objects.
|
63
|
+
objects.each_with_object({}) do |object, acc|
|
30
64
|
if keys.include?(object.key)
|
31
|
-
|
32
|
-
else
|
33
|
-
sum
|
65
|
+
acc[object.key] = object.size
|
34
66
|
end
|
35
67
|
end
|
36
68
|
end
|
data/lib/bigshift/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: bigshift
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.3.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Theo Hultberg
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2016-
|
11
|
+
date: 2016-05-12 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: pg
|