sluice 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,17 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ test/tmp
16
+ test/version_tmp
17
+ tmp
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in sluice.gemspec
4
+ gemspec
@@ -0,0 +1,202 @@
1
+
2
+ Apache License
3
+ Version 2.0, January 2004
4
+ http://www.apache.org/licenses/
5
+
6
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
7
+
8
+ 1. Definitions.
9
+
10
+ "License" shall mean the terms and conditions for use, reproduction,
11
+ and distribution as defined by Sections 1 through 9 of this document.
12
+
13
+ "Licensor" shall mean the copyright owner or entity authorized by
14
+ the copyright owner that is granting the License.
15
+
16
+ "Legal Entity" shall mean the union of the acting entity and all
17
+ other entities that control, are controlled by, or are under common
18
+ control with that entity. For the purposes of this definition,
19
+ "control" means (i) the power, direct or indirect, to cause the
20
+ direction or management of such entity, whether by contract or
21
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
22
+ outstanding shares, or (iii) beneficial ownership of such entity.
23
+
24
+ "You" (or "Your") shall mean an individual or Legal Entity
25
+ exercising permissions granted by this License.
26
+
27
+ "Source" form shall mean the preferred form for making modifications,
28
+ including but not limited to software source code, documentation
29
+ source, and configuration files.
30
+
31
+ "Object" form shall mean any form resulting from mechanical
32
+ transformation or translation of a Source form, including but
33
+ not limited to compiled object code, generated documentation,
34
+ and conversions to other media types.
35
+
36
+ "Work" shall mean the work of authorship, whether in Source or
37
+ Object form, made available under the License, as indicated by a
38
+ copyright notice that is included in or attached to the work
39
+ (an example is provided in the Appendix below).
40
+
41
+ "Derivative Works" shall mean any work, whether in Source or Object
42
+ form, that is based on (or derived from) the Work and for which the
43
+ editorial revisions, annotations, elaborations, or other modifications
44
+ represent, as a whole, an original work of authorship. For the purposes
45
+ of this License, Derivative Works shall not include works that remain
46
+ separable from, or merely link (or bind by name) to the interfaces of,
47
+ the Work and Derivative Works thereof.
48
+
49
+ "Contribution" shall mean any work of authorship, including
50
+ the original version of the Work and any modifications or additions
51
+ to that Work or Derivative Works thereof, that is intentionally
52
+ submitted to Licensor for inclusion in the Work by the copyright owner
53
+ or by an individual or Legal Entity authorized to submit on behalf of
54
+ the copyright owner. For the purposes of this definition, "submitted"
55
+ means any form of electronic, verbal, or written communication sent
56
+ to the Licensor or its representatives, including but not limited to
57
+ communication on electronic mailing lists, source code control systems,
58
+ and issue tracking systems that are managed by, or on behalf of, the
59
+ Licensor for the purpose of discussing and improving the Work, but
60
+ excluding communication that is conspicuously marked or otherwise
61
+ designated in writing by the copyright owner as "Not a Contribution."
62
+
63
+ "Contributor" shall mean Licensor and any individual or Legal Entity
64
+ on behalf of whom a Contribution has been received by Licensor and
65
+ subsequently incorporated within the Work.
66
+
67
+ 2. Grant of Copyright License. Subject to the terms and conditions of
68
+ this License, each Contributor hereby grants to You a perpetual,
69
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70
+ copyright license to reproduce, prepare Derivative Works of,
71
+ publicly display, publicly perform, sublicense, and distribute the
72
+ Work and such Derivative Works in Source or Object form.
73
+
74
+ 3. Grant of Patent License. Subject to the terms and conditions of
75
+ this License, each Contributor hereby grants to You a perpetual,
76
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77
+ (except as stated in this section) patent license to make, have made,
78
+ use, offer to sell, sell, import, and otherwise transfer the Work,
79
+ where such license applies only to those patent claims licensable
80
+ by such Contributor that are necessarily infringed by their
81
+ Contribution(s) alone or by combination of their Contribution(s)
82
+ with the Work to which such Contribution(s) was submitted. If You
83
+ institute patent litigation against any entity (including a
84
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
85
+ or a Contribution incorporated within the Work constitutes direct
86
+ or contributory patent infringement, then any patent licenses
87
+ granted to You under this License for that Work shall terminate
88
+ as of the date such litigation is filed.
89
+
90
+ 4. Redistribution. You may reproduce and distribute copies of the
91
+ Work or Derivative Works thereof in any medium, with or without
92
+ modifications, and in Source or Object form, provided that You
93
+ meet the following conditions:
94
+
95
+ (a) You must give any other recipients of the Work or
96
+ Derivative Works a copy of this License; and
97
+
98
+ (b) You must cause any modified files to carry prominent notices
99
+ stating that You changed the files; and
100
+
101
+ (c) You must retain, in the Source form of any Derivative Works
102
+ that You distribute, all copyright, patent, trademark, and
103
+ attribution notices from the Source form of the Work,
104
+ excluding those notices that do not pertain to any part of
105
+ the Derivative Works; and
106
+
107
+ (d) If the Work includes a "NOTICE" text file as part of its
108
+ distribution, then any Derivative Works that You distribute must
109
+ include a readable copy of the attribution notices contained
110
+ within such NOTICE file, excluding those notices that do not
111
+ pertain to any part of the Derivative Works, in at least one
112
+ of the following places: within a NOTICE text file distributed
113
+ as part of the Derivative Works; within the Source form or
114
+ documentation, if provided along with the Derivative Works; or,
115
+ within a display generated by the Derivative Works, if and
116
+ wherever such third-party notices normally appear. The contents
117
+ of the NOTICE file are for informational purposes only and
118
+ do not modify the License. You may add Your own attribution
119
+ notices within Derivative Works that You distribute, alongside
120
+ or as an addendum to the NOTICE text from the Work, provided
121
+ that such additional attribution notices cannot be construed
122
+ as modifying the License.
123
+
124
+ You may add Your own copyright statement to Your modifications and
125
+ may provide additional or different license terms and conditions
126
+ for use, reproduction, or distribution of Your modifications, or
127
+ for any such Derivative Works as a whole, provided Your use,
128
+ reproduction, and distribution of the Work otherwise complies with
129
+ the conditions stated in this License.
130
+
131
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
132
+ any Contribution intentionally submitted for inclusion in the Work
133
+ by You to the Licensor shall be under the terms and conditions of
134
+ this License, without any additional terms or conditions.
135
+ Notwithstanding the above, nothing herein shall supersede or modify
136
+ the terms of any separate license agreement you may have executed
137
+ with Licensor regarding such Contributions.
138
+
139
+ 6. Trademarks. This License does not grant permission to use the trade
140
+ names, trademarks, service marks, or product names of the Licensor,
141
+ except as required for reasonable and customary use in describing the
142
+ origin of the Work and reproducing the content of the NOTICE file.
143
+
144
+ 7. Disclaimer of Warranty. Unless required by applicable law or
145
+ agreed to in writing, Licensor provides the Work (and each
146
+ Contributor provides its Contributions) on an "AS IS" BASIS,
147
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148
+ implied, including, without limitation, any warranties or conditions
149
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150
+ PARTICULAR PURPOSE. You are solely responsible for determining the
151
+ appropriateness of using or redistributing the Work and assume any
152
+ risks associated with Your exercise of permissions under this License.
153
+
154
+ 8. Limitation of Liability. In no event and under no legal theory,
155
+ whether in tort (including negligence), contract, or otherwise,
156
+ unless required by applicable law (such as deliberate and grossly
157
+ negligent acts) or agreed to in writing, shall any Contributor be
158
+ liable to You for damages, including any direct, indirect, special,
159
+ incidental, or consequential damages of any character arising as a
160
+ result of this License or out of the use or inability to use the
161
+ Work (including but not limited to damages for loss of goodwill,
162
+ work stoppage, computer failure or malfunction, or any and all
163
+ other commercial damages or losses), even if such Contributor
164
+ has been advised of the possibility of such damages.
165
+
166
+ 9. Accepting Warranty or Additional Liability. While redistributing
167
+ the Work or Derivative Works thereof, You may choose to offer,
168
+ and charge a fee for, acceptance of support, warranty, indemnity,
169
+ or other liability obligations and/or rights consistent with this
170
+ License. However, in accepting such obligations, You may act only
171
+ on Your own behalf and on Your sole responsibility, not on behalf
172
+ of any other Contributor, and only if You agree to indemnify,
173
+ defend, and hold each Contributor harmless for any liability
174
+ incurred by, or claims asserted against, such Contributor by reason
175
+ of your accepting any such warranty or additional liability.
176
+
177
+ END OF TERMS AND CONDITIONS
178
+
179
+ APPENDIX: How to apply the Apache License to your work.
180
+
181
+ To apply the Apache License to your work, attach the following
182
+ boilerplate notice, with the fields enclosed by brackets "[]"
183
+ replaced with your own identifying information. (Don't include
184
+ the brackets!) The text should be enclosed in the appropriate
185
+ comment syntax for the file format. We also recommend that a
186
+ file or class name and description of purpose be included on the
187
+ same "printed page" as the copyright notice for easier
188
+ identification within third-party archives.
189
+
190
+ Copyright [yyyy] [name of copyright owner]
191
+
192
+ Licensed under the Apache License, Version 2.0 (the "License");
193
+ you may not use this file except in compliance with the License.
194
+ You may obtain a copy of the License at
195
+
196
+ http://www.apache.org/licenses/LICENSE-2.0
197
+
198
+ Unless required by applicable law or agreed to in writing, software
199
+ distributed under the License is distributed on an "AS IS" BASIS,
200
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201
+ See the License for the specific language governing permissions and
202
+ limitations under the License.
@@ -0,0 +1,64 @@
1
+ # Sluice
2
+
3
+ Sluice is a Ruby gem (built with [Bundler] [bundler]) to help you build cloud-friendly ETL (extract, transform, load) processes.
4
+
5
+ Sluice has been extracted from a pair of Ruby ETL applications built by the [SnowPlow Analytics] [snowplow-analytics] team, specifically:
6
+
7
+ 1. [EmrEtlRunner] [emr-etl-runner], a Ruby application to run the SnowPlow ETL process on Elastic MapReduce
8
+ 2. [StorageLoader] [storage-loader], a Ruby application to load SnowPlow event files from Amazon S3 into databases such as Infobright
9
+
10
+ ## Installation
11
+
12
+ $ gem install sluice
13
+
14
+ Or in your Gemfile:
15
+
16
+ gem 'sluice', '~> 0.0.1'
17
+
18
+ ## Usage
19
+
20
+ Rubydoc and usage examples to come.
21
+
22
+ ## Hacking and contributing
23
+
24
+ To hack on Sluice locally:
25
+
26
+ $ gem build sluice.gemspec
27
+ $ sudo gem install sluice-0.0.1.gem
28
+
29
+ To contribute:
30
+
31
+ 1. Fork it
32
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
33
+ 3. Commit your changes (`git commit -am 'Added some feature'`)
34
+ 4. Push to the branch (`git push origin my-new-feature`)
35
+ 5. Create new Pull Request
36
+
37
+ ## Credits and thanks
38
+
39
+ Sluice was developed by [Alex Dean] [alexanderdean] ([SnowPlow Analytics] [snowplow-analytics]) and [Michael Tibben] [mtibben] ([99designs] [99designs]).
40
+
41
+ ## Copyright and license
42
+
43
+ Sluice is copyright 2012 SnowPlow Analytics Ltd.
44
+
45
+ Licensed under the [Apache License, Version 2.0] [license] (the "License");
46
+ you may not use this software except in compliance with the License.
47
+
48
+ Unless required by applicable law or agreed to in writing, software
49
+ distributed under the License is distributed on an "AS IS" BASIS,
50
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
51
+ See the License for the specific language governing permissions and
52
+ limitations under the License.
53
+
54
+ [bundler]: http://gembundler.com/
55
+
56
+ [snowplow-analytics]: http://snowplowanalytics.com
57
+ [alexanderdean]: https://github.com/alexanderdean
58
+ [mtibben]: https://github.com/mtibben
59
+ [99designs]: http://99designs.com
60
+
61
+ [emr-etl-runner]: https://github.com/snowplow/snowplow/tree/master/3-etl/emr-etl-runner
62
+ [storage-loader]: https://github.com/snowplow/snowplow/tree/master/4-storage/storage-loader
63
+
64
+ [license]: http://www.apache.org/licenses/LICENSE-2.0
@@ -0,0 +1,2 @@
1
+ #!/usr/bin/env rake
2
+ require "bundler/gem_tasks"
@@ -0,0 +1,23 @@
1
+ # Copyright (c) 2012 SnowPlow Analytics Ltd. All rights reserved.
2
+ #
3
+ # This program is licensed to you under the Apache License Version 2.0,
4
+ # and you may not use this file except in compliance with the Apache License Version 2.0.
5
+ # You may obtain a copy of the Apache License Version 2.0 at http://www.apache.org/licenses/LICENSE-2.0.
6
+ #
7
+ # Unless required by applicable law or agreed to in writing,
8
+ # software distributed under the Apache License Version 2.0 is distributed on an
9
+ # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10
+ # See the Apache License Version 2.0 for the specific language governing permissions and limitations there under.
11
+
12
+ # Author:: Alex Dean (mailto:support@snowplowanalytics.com)
13
+ # Copyright:: Copyright (c) 2012 SnowPlow Analytics Ltd
14
+ # License:: Apache License Version 2.0
15
+
16
+ require 'sluice/errors'
17
+ require 'sluice/storage/storage'
18
+ require 'sluice/storage/s3'
19
+
20
+ module Sluice
21
+ NAME = "sluice"
22
+ VERSION = "0.0.1"
23
+ end
@@ -0,0 +1,26 @@
1
+ # Copyright (c) 2012 SnowPlow Analytics Ltd. All rights reserved.
2
+ #
3
+ # This program is licensed to you under the Apache License Version 2.0,
4
+ # and you may not use this file except in compliance with the Apache License Version 2.0.
5
+ # You may obtain a copy of the Apache License Version 2.0 at http://www.apache.org/licenses/LICENSE-2.0.
6
+ #
7
+ # Unless required by applicable law or agreed to in writing,
8
+ # software distributed under the Apache License Version 2.0 is distributed on an
9
+ # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10
+ # See the Apache License Version 2.0 for the specific language governing permissions and limitations there under.
11
+
12
+ # Author:: Alex Dean (mailto:support@snowplowanalytics.com)
13
+ # Copyright:: Copyright (c) 2012 SnowPlow Analytics Ltd
14
+ # License:: Apache License Version 2.0
15
+
16
+ # All errors
17
+ module Sluice
18
+
19
+ # The base error class for all <tt>Sluice</tt> error classes.
20
+ class Error < StandardError
21
+ end
22
+
23
+ # If a storage operation is not supported or incorrect
24
+ class StorageOperationError < Error
25
+ end
26
+ end
@@ -0,0 +1,281 @@
1
+ # Copyright (c) 2012 SnowPlow Analytics Ltd. All rights reserved.
2
+ #
3
+ # This program is licensed to you under the Apache License Version 2.0,
4
+ # and you may not use this file except in compliance with the Apache License Version 2.0.
5
+ # You may obtain a copy of the Apache License Version 2.0 at http://www.apache.org/licenses/LICENSE-2.0.
6
+ #
7
+ # Unless required by applicable law or agreed to in writing,
8
+ # software distributed under the Apache License Version 2.0 is distributed on an
9
+ # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10
+ # See the Apache License Version 2.0 for the specific language governing permissions and limitations there under.
11
+
12
+ # Authors:: Alex Dean (mailto:support@snowplowanalytics.com), Michael Tibben
13
+ # Copyright:: Copyright (c) 2012 SnowPlow Analytics Ltd
14
+ # License:: Apache License Version 2.0
15
+
16
+ require 'fog'
17
+ require 'thread'
18
+
19
+ module Sluice
20
+ module Storage
21
+ module S3
22
+
23
+ # TODO: figure out logging instead of puts (https://github.com/snowplow/sluice/issues/2)
24
+ # TODO: consider moving to OO structure (https://github.com/snowplow/sluice/issues/3)
25
+
26
+ # Constants
27
+ CONCURRENCY = 10 # Threads
28
+ RETRIES = 3 # Attempts
29
+ RETRY_WAIT = 10 # Seconds
30
+
31
+ # Class to describe an S3 location
32
+ # TODO: if we are going to impose trailing line-breaks on
33
+ # buckets, maybe we should make that clearer?
34
+ class Location
35
+ attr_reader :bucket, :dir, :s3location
36
+
37
+ # Parameters:
38
+ # +s3location+:: the s3 location config string e.g. "bucket/directory"
39
+ def initialize(s3_location)
40
+ @s3_location = s3_location
41
+
42
+ s3_location_match = s3_location.match('^s3n?://([^/]+)/?(.*)/$')
43
+ raise ArgumentError, 'Bad S3 location %s' % s3_location unless s3_location_match
44
+
45
+ @bucket = s3_location_match[1]
46
+ @dir = s3_location_match[2]
47
+ end
48
+
49
+ def dir_as_path
50
+ if @dir.length > 0
51
+ return @dir+'/'
52
+ else
53
+ return ''
54
+ end
55
+ end
56
+
57
+ def to_s
58
+ @s3_location
59
+ end
60
+ end
61
+
62
+ # Helper function to instantiate a new Fog::Storage
63
+ # for S3 based on our config options
64
+ #
65
+ # Parameters:
66
+ # +region+:: Amazon S3 region we will be working with
67
+ # +access_key_id+:: AWS access key ID
68
+ # +secret_access_key+:: AWS secret access key
69
+ def new_fog_s3_from(region, access_key_id, secret_access_key)
70
+ Fog::Storage.new({
71
+ :provider => 'AWS',
72
+ :region => region,
73
+ :aws_access_key_id => access_key_id,
74
+ :aws_secret_access_key => secret_access_key
75
+ })
76
+ end
77
+ module_function :new_fog_s3_from
78
+
79
+ # Determine if a bucket is empty
80
+ #
81
+ # Parameters:
82
+ # +s3+:: A Fog::Storage s3 connection
83
+ # +location+:: The location to check
84
+ def is_empty?(s3, location)
85
+ s3.directories.get(location.bucket, :prefix => location.dir).files().length > 1
86
+ end
87
+ module_function :is_empty?
88
+
89
+ # Delete files from S3 locations concurrently
90
+ #
91
+ # Parameters:
92
+ # +s3+:: A Fog::Storage s3 connection
93
+ # +from+:: S3Location to delete files from
94
+ # +match_regex+:: a regex string to match the files to delete
95
+ def delete_files(s3, from_location, match_regex='.+')
96
+
97
+ puts " deleting files from #{from_location}"
98
+ process_files(:delete, s3, from_location, match_regex)
99
+ end
100
+ module_function :delete_files
101
+
102
+ # Copies files between S3 locations concurrently
103
+ #
104
+ # Parameters:
105
+ # +s3+:: A Fog::Storage s3 connection
106
+ # +from+:: S3Location to copy files from
107
+ # +to+:: S3Location to copy files to
108
+ # +match_regex+:: a regex string to match the files to copy
109
+ # +alter_filename_lambda+:: lambda to alter the written filename
110
+ def copy_files(s3, from_location, to_location, match_regex='.+', alter_filename_lambda=false)
111
+
112
+ puts " copying files from #{from_location} to #{to_location}"
113
+ process_files(:copy, s3, from_location, match_regex, to_location, alter_filename_lambda)
114
+ end
115
+ module_function :copy_files
116
+
117
+ # Moves files between S3 locations concurrently
118
+ #
119
+ # Parameters:
120
+ # +s3+:: A Fog::Storage s3 connection
121
+ # +from+:: S3Location to move files from
122
+ # +to+:: S3Location to move files to
123
+ # +match_regex+:: a regex string to match the files to move
124
+ # +alter_filename_lambda+:: lambda to alter the written filename
125
+ def move_files(s3, from_location, to_location, match_regex='.+', alter_filename_lambda=false)
126
+
127
+ puts " moving files from #{from_location} to #{to_location}"
128
+ process_files(:move, s3, from_location, match_regex, to_location, alter_filename_lambda)
129
+ end
130
+ module_function :move_files
131
+
132
+ private
133
+
134
+ # Concurrent file operations between S3 locations. Supports:
135
+ # - Copy
136
+ # - Delete
137
+ # - Move (= Copy + Delete)
138
+ #
139
+ # Parameters:
140
+ # +operation+:: Operation to perform. :copy, :delete, :move supported
141
+ # +s3+:: A Fog::Storage s3 connection
142
+ # +from+:: S3Location to process files from
143
+ # +match_regex+:: a regex string to match the files to process
144
+ # +to+:: S3Location to process files to
145
+ # +alter_filename_lambda+:: lambda to alter the written filename
146
+ def process_files(operation, s3, from_location, match_regex='.+', to_location=nil, alter_filename_lambda=false)
147
+
148
+ # Validate that the file operation makes sense
149
+ case operation
150
+ when :copy, :move
151
+ if to_location.nil?
152
+ raise StorageOperationError "File operation %s requires a to_location to be set" % operation
153
+ end
154
+ when :delete
155
+ unless to_location.nil?
156
+ raise StorageOperationError "File operation %s does not support the to_location argument" % operation
157
+ end
158
+ if alter_filename_lambda.class == Proc
159
+ raise StorageOperationError "File operation %s does not support the alter_filename_lambda argument" % operation
160
+ end
161
+ else
162
+ raise StorageOperationError "File operation %s is unsupported. Try :copy, :delete or :move" % operation
163
+ end
164
+
165
+ files_to_process = []
166
+ threads = []
167
+ mutex = Mutex.new
168
+ complete = false
169
+ marker_opts = {}
170
+
171
+ # If an exception is thrown in a thread that isn't handled, die quickly
172
+ Thread.abort_on_exception = true
173
+
174
+ # Create ruby threads to concurrently execute s3 operations
175
+ for i in (0...CONCURRENCY)
176
+
177
+ # Each thread pops a file off the files_to_process array, and moves it.
178
+ # We loop until there are no more files
179
+ threads << Thread.new do
180
+ loop do
181
+ file = false
182
+ match = false
183
+
184
+ # Critical section:
185
+ # only allow one thread to modify the array at any time
186
+ mutex.synchronize do
187
+
188
+ while !complete && !match do
189
+ if files_to_process.size == 0
190
+ # s3 batches 1000 files per request
191
+ # we load up our array with the files to move
192
+ files_to_process = s3.directories.get(from_location.bucket, :prefix => from_location.dir).files.all(marker_opts)
193
+ # if we don't have any files after the s3 request, we're complete
194
+ if files_to_process.size == 0
195
+ complete = true
196
+ next
197
+ else
198
+ marker_opts['marker'] = files_to_process.last.key
199
+
200
+ # By reversing the array we can use pop and get FIFO behaviour
201
+ # instead of the performance penalty incurred by unshift
202
+ files_to_process = files_to_process.reverse
203
+ end
204
+ end
205
+
206
+ file = files_to_process.pop
207
+ match = if match_regex.is_a? NegativeRegex
208
+ !file.key.match(match_regex.regex)
209
+ else
210
+ file.key.match(match_regex)
211
+ end
212
+ end
213
+ end
214
+
215
+ # If we don't have a match, then we must be complete
216
+ break unless match # exit the thread
217
+
218
+ # Match the filename, ignoring directory
219
+ file_match = file.key.match('([^/]+)$')
220
+
221
+ # Silently skip any sub-directories in the list
222
+ break unless file_match
223
+
224
+ if alter_filename_lambda.class == Proc
225
+ filename = alter_filename_lambda.call(file_match[1])
226
+ else
227
+ filename = file_match[1]
228
+ end
229
+
230
+ # What are we doing?
231
+ case operation
232
+ when :move
233
+ puts " MOVE #{from_location.bucket}/#{file.key} -> #{to_location.bucket}/#{to_location.dir_as_path}#{filename}"
234
+ when :copy
235
+ puts " COPY #{from_location.bucket}/#{file.key} +-> #{to_location.bucket}/#{to_location.dir_as_path}#{filename}"
236
+ when :delete
237
+ puts " DELETE x #{from_location.bucket}/#{file.key}"
238
+ end
239
+
240
+ # A move or copy starts with a copy file
241
+ if [:move, :copy].include? operation
242
+ i = 0
243
+ begin
244
+ file.copy(to_location.bucket, to_location.dir_as_path + filename)
245
+ puts " +-> #{to_location.bucket}/#{to_location.dir_as_path}#{filename}"
246
+ rescue
247
+ raise unless i < RETRIES
248
+ puts "Problem copying #{file.key}. Retrying.", $!, $@
249
+ sleep(RETRY_WAIT) # give us a bit of time before retrying
250
+ i += 1
251
+ retry
252
+ end
253
+ end
254
+
255
+ # A move or delete ends with a delete
256
+ if [:move, :delete].include? operation
257
+ i = 0
258
+ begin
259
+ file.destroy()
260
+ puts " x #{from_location.bucket}/#{file.key}"
261
+ rescue
262
+ raise unless i < RETRIES
263
+ puts "Problem destroying #{file.key}. Retrying.", $!, $@
264
+ sleep(RETRY_WAIT) # Give us a bit of time before retrying
265
+ i += 1
266
+ retry
267
+ end
268
+ end
269
+ end
270
+ end
271
+ end
272
+
273
+ # Wait for threads to finish
274
+ threads.each { |aThread| aThread.join }
275
+
276
+ end
277
+ module_function :process_files
278
+
279
+ end
280
+ end
281
+ end
@@ -0,0 +1,103 @@
1
+ # Copyright (c) 2012 SnowPlow Analytics Ltd. All rights reserved.
2
+ #
3
+ # This program is licensed to you under the Apache License Version 2.0,
4
+ # and you may not use this file except in compliance with the Apache License Version 2.0.
5
+ # You may obtain a copy of the Apache License Version 2.0 at http://www.apache.org/licenses/LICENSE-2.0.
6
+ #
7
+ # Unless required by applicable law or agreed to in writing,
8
+ # software distributed under the Apache License Version 2.0 is distributed on an
9
+ # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10
+ # See the Apache License Version 2.0 for the specific language governing permissions and limitations there under.
11
+
12
+ # Author:: Alex Dean (mailto:support@snowplowanalytics.com), Michael Tibben
13
+ # Copyright:: Copyright (c) 2012 SnowPlow Analytics Ltd
14
+ # License:: Apache License Version 2.0
15
+
16
+ module Sluice
17
+ module Storage
18
+
19
+ # To handle negative file matching
20
+ NegativeRegex = Struct.new(:regex)
21
+
22
+ # Find files within the given date range
23
+ # (inclusive).
24
+ #
25
+ # Parameters:
26
+ # +start_date+:: start date
27
+ # +end_date+:: end date
28
+ # +date_format:: format of date in filenames
29
+ # +file_ext:: extension on files (if any)
30
+ def files_between(start_date, end_date, date_format, file_ext=nil)
31
+
32
+ dates = []
33
+ Date.parse(start_date).upto(Date.parse(end_date)) do |day|
34
+ dates << day.strftime(date_format)
35
+ end
36
+
37
+ '(' + dates.join('|') + ')[^/]+%s$' % regexify.call(file_ext)
38
+ end
39
+ module_function :files_between
40
+
41
+ # Add a trailing slash to a path if missing
42
+ #
43
+ # Parameters:
44
+ # +path+:: path to add a trailing slash to
45
+ def trail_slash(path)
46
+
47
+ path[-1].chr != '/' ? path << '/' : path
48
+ end
49
+ module_function :trail_slash
50
+
51
+ # Find files up to (and including) the given date.
52
+ #
53
+ # Returns a regex in a NegativeRegex so that the
54
+ # matcher can negate the match.
55
+ #
56
+ # Parameters:
57
+ # +end_date+:: end date
58
+ # +date_format:: format of date in filenames
59
+ # +file_ext:: extension on files (if any)
60
+ def files_up_to(end_date, date_format, file_ext=nil)
61
+
62
+ # Let's create a black list from the day
63
+ # after the end_date up to today
64
+ day_after = Date.parse(end_date) + 1
65
+ today = Date.today
66
+
67
+ dates = []
68
+ day_after.upto(today) do |day|
69
+ dates << day.strftime(date_format) # Black list
70
+ end
71
+
72
+ NegativeRegex.new('(' + dates.join('|') + ')[^/]+%s$') % regexify.call(file_ext)
73
+ end
74
+ module_function :files_up_to
75
+
76
+ # Find files starting from the given date.
77
+ #
78
+ # Parameters:
79
+ # +start_date+:: start date
80
+ # +date_format:: format of date in filenames
81
+ # +file_ext:: extension on files (if any); include period
82
+ def files_from(start_date, date_format, file_ext=nil)
83
+
84
+ # Let's create a white list from the start_date to today
85
+ today = Date.today
86
+
87
+ dates = []
88
+ Date.parse(start_date).upto(today) do |day|
89
+ dates << day.strftime(date_format)
90
+ end
91
+
92
+ '(' + dates.join('|') + ')[^/]+%s$' % regexify.call(file_ext)
93
+ end
94
+ module_function :files_from
95
+
96
+ private
97
+
98
+ # Make a file extension regular expression friendly,
99
+ # adding a starting period (.) if missing
100
+ regexify = lambda {|fe| return fe.nil? ? nil : fe[0].chr != '.' ? '\\.' << fe : '\\' << fe}
101
+
102
+ end
103
+ end
@@ -0,0 +1,38 @@
1
+ # Copyright (c) 2012 SnowPlow Analytics Ltd. All rights reserved.
2
+ #
3
+ # This program is licensed to you under the Apache License Version 2.0,
4
+ # and you may not use this file except in compliance with the Apache License Version 2.0.
5
+ # You may obtain a copy of the Apache License Version 2.0 at http://www.apache.org/licenses/LICENSE-2.0.
6
+ #
7
+ # Unless required by applicable law or agreed to in writing,
8
+ # software distributed under the Apache License Version 2.0 is distributed on an
9
+ # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10
+ # See the Apache License Version 2.0 for the specific language governing permissions and limitations there under.
11
+
12
+ # Author:: Alex Dean (mailto:support@snowplowanalytics.com)
13
+ # Copyright:: Copyright (c) 2012 SnowPlow Analytics Ltd
14
+ # License:: Apache License Version 2.0
15
+
16
+ # -*- encoding: utf-8 -*-
17
+ lib = File.expand_path('../lib', __FILE__)
18
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
19
+ require 'sluice'
20
+
21
+ Gem::Specification.new do |gem|
22
+ gem.authors = ["Alex Dean", "Michael Tibben"]
23
+ gem.email = ["support@snowplowanalytics.com"]
24
+ gem.summary = %q{Ruby toolkit for cloud-friendly ETL}
25
+ gem.description = %q{A Ruby gem to help you build ETL processes involving Amazon S3. Uses Fog}
26
+ gem.homepage = "http://snowplowanalytics.com"
27
+
28
+ gem.files = `git ls-files`.split($\)
29
+ gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
30
+ gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
31
+ gem.name = Sluice::NAME
32
+ gem.version = Sluice::VERSION
33
+ gem.platform = Gem::Platform::RUBY
34
+ gem.require_paths = ["lib"]
35
+
36
+ # Dependencies
37
+ gem.add_dependency 'fog', '~> 1.6.0'
38
+ end
metadata ADDED
@@ -0,0 +1,91 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: sluice
3
+ version: !ruby/object:Gem::Version
4
+ hash: 29
5
+ prerelease:
6
+ segments:
7
+ - 0
8
+ - 0
9
+ - 1
10
+ version: 0.0.1
11
+ platform: ruby
12
+ authors:
13
+ - Alex Dean
14
+ - Michael Tibben
15
+ autorequire:
16
+ bindir: bin
17
+ cert_chain: []
18
+
19
+ date: 2012-11-05 00:00:00 Z
20
+ dependencies:
21
+ - !ruby/object:Gem::Dependency
22
+ name: fog
23
+ prerelease: false
24
+ requirement: &id001 !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ~>
28
+ - !ruby/object:Gem::Version
29
+ hash: 15
30
+ segments:
31
+ - 1
32
+ - 6
33
+ - 0
34
+ version: 1.6.0
35
+ type: :runtime
36
+ version_requirements: *id001
37
+ description: A Ruby gem to help you build ETL processes involving Amazon S3. Uses Fog
38
+ email:
39
+ - support@snowplowanalytics.com
40
+ executables: []
41
+
42
+ extensions: []
43
+
44
+ extra_rdoc_files: []
45
+
46
+ files:
47
+ - .gitignore
48
+ - Gemfile
49
+ - LICENSE-2.0.txt
50
+ - README.md
51
+ - Rakefile
52
+ - lib/sluice.rb
53
+ - lib/sluice/errors.rb
54
+ - lib/sluice/storage/s3.rb
55
+ - lib/sluice/storage/storage.rb
56
+ - sluice.gemspec
57
+ homepage: http://snowplowanalytics.com
58
+ licenses: []
59
+
60
+ post_install_message:
61
+ rdoc_options: []
62
+
63
+ require_paths:
64
+ - lib
65
+ required_ruby_version: !ruby/object:Gem::Requirement
66
+ none: false
67
+ requirements:
68
+ - - ">="
69
+ - !ruby/object:Gem::Version
70
+ hash: 3
71
+ segments:
72
+ - 0
73
+ version: "0"
74
+ required_rubygems_version: !ruby/object:Gem::Requirement
75
+ none: false
76
+ requirements:
77
+ - - ">="
78
+ - !ruby/object:Gem::Version
79
+ hash: 3
80
+ segments:
81
+ - 0
82
+ version: "0"
83
+ requirements: []
84
+
85
+ rubyforge_project:
86
+ rubygems_version: 1.8.24
87
+ signing_key:
88
+ specification_version: 3
89
+ summary: Ruby toolkit for cloud-friendly ETL
90
+ test_files: []
91
+