bigshift 0.2.0 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: fc84facadd8de03293a5ba461bce6653bb3f00aa
4
- data.tar.gz: deb0e103ae33b5a9627feb3aa4ac617cfa54e342
3
+ metadata.gz: 5221e0948bc35adae3c09681be2de8529cf51630
4
+ data.tar.gz: ab8501193f724bed2288a3784719ef4cbbf16c26
5
5
  SHA512:
6
- metadata.gz: ec259abd928ad95999f64fa9765776c659113a373257d840874d9864ff571bdec0744efa756d3aaf62c7599a5c689de5ca9cf77d66e04a441a4b0d22cdbb833e
7
- data.tar.gz: 04cbba86814f2526260f24a4c6583180e55edb4faf6ef7b20a96a0b961ad48586b36c1145af4f49ae06f9735fe2a0c98654433b7ec79bd0520fd5d0d7924935b
6
+ metadata.gz: 914cdf7f5e432faba32a6d66661c9dd1b0b55edac2933438a46bfbdc6cc4476441d8fbe5e2858017ae012ef2bd0c559c07c75bd2b5fb1bc33754aebbf3dee4c8
7
+ data.tar.gz: f05dc703a91fb1dbc338e65a04473c8d29ed7df19b9bc2abb6e702aebc022df5f47e7c524b7abf0a52a288961b0f09cc26828646655ebc6312905ad73aff3dba
data/README.md CHANGED
@@ -22,15 +22,17 @@ The main interface to BigShift is the `bigshift` command line tool.
22
22
 
23
23
  BigShift can also be used as a library in a Ruby application. Look at the tests, and how the `bigshift` tool is built to figure out how.
24
24
 
25
+ Because a transfer can take a long time, it's highly recommended that you run the command in `screen` or `tmux` or using some other mechanism that ensures that the process isn't killed prematurely.
26
+
25
27
  ## Cost
26
28
 
27
29
  Please note that transferring large amounts of data between AWS and GCP is not free. [AWS charges for outgoing traffic from S3](https://aws.amazon.com/s3/pricing/#Data_Transfer_Pricing). There are also storage charges for the Redshift dumps on S3 and GCS, but since they are kept only until the BigQuery table has been loaded those should be negligible.
28
30
 
29
- BigShift tells Redshift to compress the dumps, even if that means that the BigQuery load will be slower, in order to minimize the transfer cost.
31
+ BigShift tells Redshift to compress the dumps by default, even if that means that the BigQuery load will be slower, in order to minimize the transfer cost. However, depending on your setup and data the individual files produced by Redshift might become larger than BigQuery's compressed file size limit of 4 GiB. In these cases you need to either uncompress the files manually on the GCP side (for example by running BigShift with just `--steps unload,transfer` to get the dumps to GCS), or dump and transfer uncompressed files (with `--no-compression`), at a higher bandwidth cost.
30
32
 
31
33
  ## Arguments
32
34
 
33
- Running `bigshift` without any arguments, or with `--help` will show the options. All except `--s3-prefix`, `--bq-table` and `--max-bad-records` are required.
35
+ Running `bigshift` without any arguments, or with `--help` will show the options. All except `--s3-prefix`, `--bq-table`, `--max-bad-records`, `--steps` and `--[no-]compress` are required.
34
36
 
35
37
  ### GCP credentials
36
38
 
@@ -108,6 +110,12 @@ Because of how GCS' Transfer Service works the transferred files will have exact
108
110
 
109
111
  By default the BigQuery table ID will be the same as the Redshift table name, but the optional argument `--bq-table` can be used to tell BigShift to use another table ID.
110
112
 
113
+ ### Running only some steps
114
+
115
+ Using the `--steps` argument it's possible to skip some parts of the transfer, or resume a failed transfer. The default is `--steps unload,transfer,load,cleanup`, but using for example `--steps unload,transfer` would dump the table to S3 and transfer the files and then stop.
116
+
117
+ Another case might be if for some reason the BigShift process was killed during the transfer step. The transfer will still run in GCS, and you might not want to start over from the start, it takes a long time to unload a big table, and an even longer time to transfer it, not to mention bandwidth costs. You can then run the same command again, but add `--steps load,cleanup` to the arguments to skip the unloading and transferring steps.
118
+
111
119
  # How does it work?
112
120
 
113
121
  There are four main pieces to BigShift: the Redshift unloader, the transfer, the BigQuery load and the schema translation.
@@ -151,6 +159,16 @@ The certificates used by the Google APIs might not be installed on your system,
151
159
  export SSL_CERT_FILE="$(find $GEM_HOME/gems -name 'google-api-client-*' | tail -n 1)/lib/cacerts.pem"
152
160
  ```
153
161
 
162
+ ### BigQuery says my files are not splittable and too large
163
+
164
+ For example:
165
+
166
+ > Input CSV files are not splittable and at least one of the files is larger than the maximum allowed size. Size is: 5838980665. Max allowed size is: 4294967296. Filename: gs://bigshift/foo/bar/foo-bar-0039_part_00.gz
167
+
168
+ This happens when the (compressed) files exceed 4 GiB in size. Unfortunately it is not possible to control the size of the files produced by Redshift's `UNLOAD` command, and the size of the files will depend on the number of nodes in your cluster and the amount of data you're dumping.
169
+
170
+ There are two options: either you use BigShift to get the dumps to GCS and then manually uncompress and load them (use `--steps unload,transfer`) or you dump without compression (use `--no-compression`). Keep in mind that without compression the bandwidth costs will be significanly higher.
171
+
154
172
  ### I get errors when the data is loaded into BigQuery
155
173
 
156
174
  This could be anything, but it could be things that aren't escaped properly when the data is dumped from Redshift. Try figuring out from the errors where the problem is and what the data looks like and open an issue. The more you can figure out yourself the more likely it is that you will get help. No one wants to trawl through your data, make an effort.
@@ -34,38 +34,66 @@ module BigShift
34
34
 
35
35
  private
36
36
 
37
+ def run?(step)
38
+ @config[:steps].include?(step)
39
+ end
40
+
37
41
  def setup
38
42
  @config = parse_args(@argv)
39
43
  @factory = @factory_factory.call(@config)
44
+ @logger = @factory.logger
40
45
  end
41
46
 
42
47
  def unload
43
- s3_uri = "s3://#{@config[:s3_bucket_name]}/#{s3_table_prefix}/"
44
- @factory.redshift_unloader.unload_to(@config[:rs_table_name], s3_uri, allow_overwrite: false)
45
- @unload_manifest = UnloadManifest.new(@factory.s3_resource, @config[:s3_bucket_name], "#{s3_table_prefix}/")
48
+ if run?(:unload)
49
+ s3_uri = "s3://#{@config[:s3_bucket_name]}/#{s3_table_prefix}"
50
+ @factory.redshift_unloader.unload_to(@config[:rs_table_name], s3_uri, allow_overwrite: false, compression: @config[:compression])
51
+ else
52
+ @logger.debug('Skipping unload')
53
+ end
54
+ @unload_manifest = @factory.create_unload_manifest(@config[:s3_bucket_name], s3_table_prefix)
46
55
  end
47
56
 
48
57
  def transfer
49
- description = "bigshift-#{@config[:rs_database_name]}-#{@config[:rs_table_name]}-#{Time.now.utc.strftime('%Y%m%dT%H%M')}"
50
- @factory.cloud_storage_transfer.copy_to_cloud_storage(@unload_manifest, @config[:cs_bucket_name], description: description, allow_overwrite: false)
58
+ if run?(:transfer)
59
+ description = "bigshift-#{@config[:rs_database_name]}-#{@config[:rs_table_name]}-#{Time.now.utc.strftime('%Y%m%dT%H%M')}"
60
+ @factory.cloud_storage_transfer.copy_to_cloud_storage(@unload_manifest, @config[:cs_bucket_name], description: description, allow_overwrite: false)
61
+ else
62
+ @logger.debug('Skipping transfer')
63
+ end
51
64
  end
52
65
 
53
66
  def load
54
- rs_table_schema = @factory.redshift_table_schema
55
- bq_dataset = @factory.big_query_dataset
56
- bq_table = bq_dataset.table(@config[:bq_table_id]) || bq_dataset.create_table(@config[:bq_table_id])
57
- gcs_uri = "gs://#{@config[:cs_bucket_name]}/#{s3_table_prefix}/*"
58
- options = {}
59
- options[:schema] = rs_table_schema.to_big_query
60
- options[:allow_overwrite] = true
61
- options[:max_bad_records] = @config[:max_bad_records] if @config[:max_bad_records]
62
- bq_table.load(gcs_uri, options)
67
+ if run?(:load)
68
+ rs_table_schema = @factory.redshift_table_schema
69
+ bq_dataset = @factory.big_query_dataset
70
+ bq_table = bq_dataset.table(@config[:bq_table_id]) || bq_dataset.create_table(@config[:bq_table_id])
71
+ gcs_uri = "gs://#{@config[:cs_bucket_name]}/#{s3_table_prefix}*"
72
+ options = {}
73
+ options[:schema] = rs_table_schema.to_big_query
74
+ options[:allow_overwrite] = true
75
+ options[:max_bad_records] = @config[:max_bad_records] if @config[:max_bad_records]
76
+ bq_table.load(gcs_uri, options)
77
+ else
78
+ @logger.debug('Skipping load')
79
+ end
63
80
  end
64
81
 
65
82
  def cleanup
66
- @factory.cleaner.cleanup(@unload_manifest, @config[:cs_bucket_name])
83
+ if run?(:cleanup)
84
+ @factory.cleaner.cleanup(@unload_manifest, @config[:cs_bucket_name])
85
+ else
86
+ @logger.debug('Skipping cleanup')
87
+ end
67
88
  end
68
89
 
90
+ STEPS = [
91
+ :unload,
92
+ :transfer,
93
+ :load,
94
+ :cleanup
95
+ ].freeze
96
+
69
97
  ARGUMENTS = [
70
98
  ['--gcp-credentials', 'PATH', String, :gcp_credentials_path, :required],
71
99
  ['--aws-credentials', 'PATH', String, :aws_credentials_path, nil],
@@ -78,6 +106,8 @@ module BigShift
78
106
  ['--s3-prefix', 'PREFIX', String, :s3_prefix, nil],
79
107
  ['--cs-bucket', 'BUCKET_NAME', String, :cs_bucket_name, :required],
80
108
  ['--max-bad-records', 'N', Integer, :max_bad_records, nil],
109
+ ['--steps', 'STEPS', Array, :steps, nil],
110
+ ['--[no-]compression', nil, nil, :compression, nil],
81
111
  ]
82
112
 
83
113
  def parse_args(argv)
@@ -106,6 +136,11 @@ module BigShift
106
136
  end
107
137
  end
108
138
  config[:bq_table_id] ||= config[:rs_table_name]
139
+ if config[:steps] && !config[:steps].empty?
140
+ config[:steps] = STEPS.select { |s| config[:steps].include?(s.to_s) }
141
+ else
142
+ config[:steps] = STEPS
143
+ end
109
144
  unless config_errors.empty?
110
145
  raise CliError.new('Configuration missing or malformed', config_errors, parser.to_s)
111
146
  end
@@ -113,12 +148,16 @@ module BigShift
113
148
  end
114
149
 
115
150
  def s3_table_prefix
116
- components = @config.values_at(:rs_database_name, :rs_table_name)
117
- if (prefix = @config[:s3_prefix])
118
- prefix = prefix.gsub(%r{\A/|/\Z}, '')
119
- components.unshift(prefix)
151
+ @s3_table_prefix ||= begin
152
+ db_name = @config[:rs_database_name]
153
+ table_name = @config[:rs_table_name]
154
+ prefix = "#{db_name}/#{table_name}/#{db_name}-#{table_name}-"
155
+ if (s3_prefix = @config[:s3_prefix])
156
+ s3_prefix = s3_prefix.gsub(%r{\A/|/\Z}, '')
157
+ prefix = "#{s3_prefix}/#{prefix}"
158
+ end
159
+ prefix
120
160
  end
121
- File.join(*components)
122
161
  end
123
162
  end
124
163
 
@@ -154,12 +193,16 @@ module BigShift
154
193
  )
155
194
  end
156
195
 
157
- private
158
-
159
196
  def logger
160
197
  @logger ||= Logger.new($stderr)
161
198
  end
162
199
 
200
+ def create_unload_manifest(s3_bucket_name, s3_table_prefix)
201
+ UnloadManifest.new(s3_resource, cs_service, @config[:s3_bucket_name], s3_table_prefix)
202
+ end
203
+
204
+ private
205
+
163
206
  def rs_connection
164
207
  @rs_connection ||= PG.connect(
165
208
  host: @config[:rs_credentials]['host'],
@@ -15,6 +15,7 @@ module BigShift
15
15
  transfer_job = @storage_transfer_service.create_transfer_job(transfer_job)
16
16
  @logger.info(sprintf('Transferring %d objects (%.2f GiB) from s3://%s/%s to gs://%s/%s', unload_manifest.count, unload_manifest.total_file_size.to_f/2**30, unload_manifest.bucket_name, unload_manifest.prefix, cloud_storage_bucket, unload_manifest.prefix))
17
17
  await_completion(transfer_job, poll_interval)
18
+ validate_transfer(unload_manifest, cloud_storage_bucket)
18
19
  nil
19
20
  end
20
21
 
@@ -45,7 +46,8 @@ module BigShift
45
46
  bucket_name: cloud_storage_bucket
46
47
  ),
47
48
  object_conditions: Google::Apis::StoragetransferV1::ObjectConditions.new(
48
- include_prefixes: unload_manifest.keys,
49
+ include_prefixes: [unload_manifest.prefix],
50
+ exclude_prefixes: [unload_manifest.manifest_key]
49
51
  ),
50
52
  transfer_options: Google::Apis::StoragetransferV1::TransferOptions.new(
51
53
  overwrite_objects_already_existing_in_sink: !!allow_overwrite
@@ -100,5 +102,10 @@ module BigShift
100
102
  @logger.info(message)
101
103
  end
102
104
  end
105
+
106
+ def validate_transfer(unload_manifest, cloud_storage_bucket)
107
+ unload_manifest.validate_transfer(cloud_storage_bucket)
108
+ @logger.info('Transfer validated, all file sizes match')
109
+ end
103
110
  end
104
111
  end
@@ -17,8 +17,8 @@ module BigShift
17
17
  unload_sql << %Q< TO '#{s3_uri}'>
18
18
  unload_sql << %Q< CREDENTIALS '#{credentials_string}'>
19
19
  unload_sql << %q< MANIFEST>
20
- unload_sql << %q< GZIP>
21
20
  unload_sql << %q< DELIMITER '\t'>
21
+ unload_sql << %q< GZIP> if options.fetch(:compression, true)
22
22
  unload_sql << %q< ALLOWOVERWRITE> if options[:allow_overwrite]
23
23
  @logger.info(sprintf('Unloading Redshift table %s to %s', table_name, s3_uri))
24
24
  @redshift_connection.exec(unload_sql)
@@ -1,9 +1,12 @@
1
1
  module BigShift
2
+ TransferValidationError = Class.new(BigShiftError)
3
+
2
4
  class UnloadManifest
3
5
  attr_reader :bucket_name, :prefix, :manifest_key
4
6
 
5
- def initialize(s3_resource, bucket_name, prefix)
7
+ def initialize(s3_resource, cs_service, bucket_name, prefix)
6
8
  @s3_resource = s3_resource
9
+ @cs_service = cs_service
7
10
  @bucket_name = bucket_name
8
11
  @prefix = prefix
9
12
  @manifest_key = "#{@prefix}manifest"
@@ -23,14 +26,43 @@ module BigShift
23
26
  end
24
27
 
25
28
  def total_file_size
26
- @total_file_size ||= begin
29
+ @total_file_size ||= file_sizes.values.reduce(:+)
30
+ end
31
+
32
+ def validate_transfer(cs_bucket_name)
33
+ objects = @cs_service.list_objects(cs_bucket_name, prefix: @prefix)
34
+ cs_file_sizes = objects.items.each_with_object({}) do |item, acc|
35
+ acc[item.name] = item.size.to_i
36
+ end
37
+ missing_files = (file_sizes.keys - cs_file_sizes.keys)
38
+ extra_files = cs_file_sizes.keys - file_sizes.keys
39
+ common_files = (cs_file_sizes.keys & file_sizes.keys)
40
+ size_mismatches = common_files.select { |name| file_sizes[name] != cs_file_sizes[name] }
41
+ errors = []
42
+ unless missing_files.empty?
43
+ errors << "missing files: #{missing_files.join(', ')}"
44
+ end
45
+ unless extra_files.empty?
46
+ errors << "extra files: #{extra_files.join(', ')}"
47
+ end
48
+ unless size_mismatches.empty?
49
+ messages = size_mismatches.map { |name| sprintf('%s (%d != %d)', name, cs_file_sizes[name], file_sizes[name]) }
50
+ errors << "size mismatches: #{messages.join(', ')}"
51
+ end
52
+ unless errors.empty?
53
+ raise TransferValidationError, "Transferred files don't match unload manifest: #{errors.join('; ')}"
54
+ end
55
+ end
56
+
57
+ private
58
+
59
+ def file_sizes
60
+ @file_sizes ||= begin
27
61
  bucket = @s3_resource.bucket(@bucket_name)
28
62
  objects = bucket.objects(prefix: @prefix)
29
- objects.reduce(0) do |sum, object|
63
+ objects.each_with_object({}) do |object, acc|
30
64
  if keys.include?(object.key)
31
- sum + object.size
32
- else
33
- sum
65
+ acc[object.key] = object.size
34
66
  end
35
67
  end
36
68
  end
@@ -1,3 +1,3 @@
1
1
  module BigShift
2
- VERSION = '0.2.0'.freeze
2
+ VERSION = '0.3.0'.freeze
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: bigshift
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.3.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Theo Hultberg
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2016-04-14 00:00:00.000000000 Z
11
+ date: 2016-05-12 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: pg