sluice 0.0.9 → 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
data/CHANGELOG CHANGED
@@ -6,7 +6,7 @@ Version 0.0.8 (2013-08-14)
6
6
  --------------------------
7
7
  Added in ability to process a list of files rather than an S3 location
8
8
 
9
- Version 0.0.7 (2013-XX-XX)
9
+ Version 0.0.7 (2013-07-12)
10
10
  --------------------------
11
11
  Added in upload capability
12
12
 
data/README.md CHANGED
@@ -8,12 +8,12 @@ Currently Sluice provides the following very robust, very parallel S3 operations
8
8
  * File download from S3
9
9
  * File delete from S3
10
10
  * File move within S3 (from/to the same or different AWS accounts)
11
- * File copy within S3 (from/to the same or different AWS accounts)
11
+ * File copy within S3 (from/to the same or different AWS accounts; optionally using a manifest)
12
12
 
13
- Sluice has been extracted from a pair of Ruby ETL applications built by the [SnowPlow Analytics] [snowplow-analytics] team, specifically:
13
+ Sluice has been extracted from a pair of Ruby ETL applications built by the [Snowplow Analytics] [snowplow-analytics] team, specifically:
14
14
 
15
- 1. [EmrEtlRunner] [emr-etl-runner], a Ruby application to run the SnowPlow ETL process on Elastic MapReduce
16
- 2. [StorageLoader] [storage-loader], a Ruby application to load SnowPlow event files from Amazon S3 into databases such as Infobright
15
+ 1. [EmrEtlRunner] [emr-etl-runner], a Ruby application to run the Snowplow ETL process on Elastic MapReduce
16
+ 2. [StorageLoader] [storage-loader], a Ruby application to load Snowplow event files from Amazon S3 into databases such as Redshift and Postgres
17
17
 
18
18
  ## Installation
19
19
 
@@ -21,7 +21,7 @@ Sluice has been extracted from a pair of Ruby ETL applications built by the [Sno
21
21
 
22
22
  Or in your Gemfile:
23
23
 
24
- gem 'sluice', '~> 0.0.9'
24
+ gem 'sluice', '~> 0.1.0'
25
25
 
26
26
  ## Usage
27
27
 
@@ -32,7 +32,7 @@ Rubydoc and usage examples to come.
32
32
  To hack on Sluice locally:
33
33
 
34
34
  $ gem build sluice.gemspec
35
- $ sudo gem install sluice-0.0.9.gem
35
+ $ sudo gem install sluice-0.1.0.gem
36
36
 
37
37
  To contribute:
38
38
 
@@ -44,11 +44,11 @@ To contribute:
44
44
 
45
45
  ## Credits and thanks
46
46
 
47
- Sluice was developed by [Alex Dean] [alexanderdean] ([SnowPlow Analytics] [snowplow-analytics]) and [Michael Tibben] [mtibben] ([99designs] [99designs]).
47
+ Sluice was developed by [Alex Dean] [alexanderdean] ([Snowplow Analytics] [snowplow-analytics]) and [Michael Tibben] [mtibben] ([99designs] [99designs]).
48
48
 
49
49
  ## Copyright and license
50
50
 
51
- Sluice is copyright 2012 SnowPlow Analytics Ltd.
51
+ Sluice is copyright 2012-2013 Snowplow Analytics Ltd.
52
52
 
53
53
  Licensed under the [Apache License, Version 2.0] [license] (the "License");
54
54
  you may not use this software except in compliance with the License.
@@ -66,7 +66,7 @@ limitations under the License.
66
66
  [mtibben]: https://github.com/mtibben
67
67
  [99designs]: http://99designs.com
68
68
 
69
- [emr-etl-runner]: https://github.com/snowplow/snowplow/tree/master/3-etl/emr-etl-runner
69
+ [emr-etl-runner]: https://github.com/snowplow/snowplow/tree/master/3-enrich/emr-etl-runner
70
70
  [storage-loader]: https://github.com/snowplow/snowplow/tree/master/4-storage/storage-loader
71
71
 
72
72
  [license]: http://www.apache.org/licenses/LICENSE-2.0
data/lib/sluice.rb CHANGED
@@ -19,5 +19,5 @@ require 'sluice/storage/s3'
19
19
 
20
20
  module Sluice
21
21
  NAME = "sluice"
22
- VERSION = "0.0.9"
22
+ VERSION = "0.1.0"
23
23
  end
@@ -13,10 +13,14 @@
13
13
  # Copyright:: Copyright (c) 2012 SnowPlow Analytics Ltd
14
14
  # License:: Apache License Version 2.0
15
15
 
16
+ require 'set'
16
17
  require 'tmpdir'
17
18
  require 'fog'
18
19
  require 'thread'
19
20
 
21
+ require 'contracts'
22
+ include Contracts
23
+
20
24
  module Sluice
21
25
  module Storage
22
26
  module S3
@@ -29,12 +33,18 @@ module Sluice
29
33
  RETRIES = 3 # Attempts
30
34
  RETRY_WAIT = 10 # Seconds
31
35
 
36
+ # Aliases for Contracts
37
+ FogStorage = Fog::Storage::AWS::Real
38
+ FogFile = Fog::Storage::AWS::File
39
+
32
40
  # Class to describe an S3 location
33
41
  # TODO: if we are going to impose trailing line-breaks on
34
42
  # buckets, maybe we should make that clearer?
35
43
  class Location
36
44
  attr_reader :bucket, :dir, :s3location
37
45
 
46
+ # Location constructor
47
+ #
38
48
  # Parameters:
39
49
  # +s3location+:: the s3 location config string e.g. "bucket/directory"
40
50
  def initialize(s3_location)
@@ -60,6 +70,108 @@ module Sluice
60
70
  end
61
71
  end
62
72
 
73
+ # Legitimate manifest scopes:
74
+ # 1. :filename - store only the filename
75
+ # in the manifest
76
+ # 2. :relpath - store the relative path
77
+ # to the file in the manifest
78
+ # 3. :abspath - store the absolute path
79
+ # to the file in the manifest
80
+ # 4. :bucket - store bucket PLUS absolute
81
+ # path to the file in the manifest
82
+ #
83
+ # TODO: add support for 2-4. Currently only 1 supported
84
+ class ManifestScope
85
+
86
+ @@scopes = Set::[](:filename) # TODO add :relpath, :abspath, :bucket
87
+
88
+ def self.valid?(val)
89
+ val.is_a?(Symbol) &&
90
+ @@scopes.include?(val)
91
+ end
92
+ end
93
+
94
+ # Class to read and maintain a manifest.
95
+ class Manifest
96
+ attr_reader :s3_location, :scope, :manifest_file
97
+
98
+ # Manifest constructor
99
+ #
100
+ # Parameters:
101
+ # +path+:: full path to the manifest file
102
+ # +scope+:: whether file entries in the
103
+ # manifest should be scoped to
104
+ # filename, relative path, absolute
105
+ # path, or absolute path and bucket
106
+ Contract Location, ManifestScope => nil
107
+ def initialize(s3_location, scope)
108
+ @s3_location = s3_location
109
+ @scope = scope
110
+ @manifest_file = "%ssluice-%s-manifest" % [s3_location.dir_as_path, scope.to_s]
111
+ nil
112
+ end
113
+
114
+ # Get the current file entries in the manifest
115
+ #
116
+ # Parameters:
117
+ # +s3+:: A Fog::Storage s3 connection
118
+ #
119
+ # Returns an Array of filenames as Strings
120
+ Contract FogStorage => ArrayOf[String]
121
+ def get_entries(s3)
122
+
123
+ manifest = self.class.get_manifest(s3, @s3_location, @manifest_file)
124
+ if manifest.nil?
125
+ return []
126
+ end
127
+
128
+ manifest.body.split("\n").reject(&:empty?)
129
+ end
130
+
131
+ # Add (i.e. append) the following file entries
132
+ # to the manifest
133
+ # Files listed previously in the manifest will
134
+ # be kept in the new manifest file.
135
+ #
136
+ # Parameters:
137
+ # +s3+:: A Fog::Storage s3 connection
138
+ # +entries+:: an Array of filenames as Strings
139
+ #
140
+ # Returns all entries now in the manifest
141
+ Contract FogStorage, ArrayOf[String] => ArrayOf[String]
142
+ def add_entries(s3, entries)
143
+
144
+ existing = get_entries(s3)
145
+ filenames = entries.map { |filepath|
146
+ File.basename(filepath)
147
+ } # TODO: update when non-filename-based manifests supported
148
+ all = (existing + filenames)
149
+
150
+ manifest = self.class.get_manifest(s3, @s3_location, @manifest_file)
151
+ body = all.join("\n")
152
+ if manifest.nil?
153
+ bucket = s3.directories.get(s3_location.bucket).files.create(
154
+ :key => @manifest_file,
155
+ :body => body
156
+ )
157
+ else
158
+ manifest.body = body
159
+ manifest.save
160
+ end
161
+
162
+ all
163
+ end
164
+
165
+ private
166
+
167
+ # Helper to get the manifest file
168
+ # Contract FogStorage, Location, String => Or[FogFile, nil] TODO: fix this. Expected: File, Actual: <Fog::Storage::AWS::File>
169
+ def self.get_manifest(s3, s3_location, filename)
170
+ s3.directories.get(s3_location.bucket, prefix: s3_location.dir).files.get(filename) # TODO: break out into new generic get_file() procedure
171
+ end
172
+
173
+ end
174
+
63
175
  # Helper function to instantiate a new Fog::Storage
64
176
  # for S3 based on our config options
65
177
  #
@@ -67,6 +179,7 @@ module Sluice
67
179
  # +region+:: Amazon S3 region we will be working with
68
180
  # +access_key_id+:: AWS access key ID
69
181
  # +secret_access_key+:: AWS secret access key
182
+ Contract String, String, String => FogStorage
70
183
  def new_fog_s3_from(region, access_key_id, secret_access_key)
71
184
  fog = Fog::Storage.new({
72
185
  :provider => 'AWS',
@@ -164,7 +277,7 @@ module Sluice
164
277
  def download_files(s3, from_files_or_loc, to_directory, match_regex='.+')
165
278
 
166
279
  puts " downloading #{describe_from(from_files_or_loc)} to #{to_directory}"
167
- process_files(:download, s3, from_files_or_loc, match_regex, to_directory)
280
+ process_files(:download, s3, from_files_or_loc, [], match_regex, to_directory)
168
281
  end
169
282
  module_function :download_files
170
283
 
@@ -177,7 +290,7 @@ module Sluice
177
290
  def delete_files(s3, from_files_or_loc, match_regex='.+')
178
291
 
179
292
  puts " deleting #{describe_from(from_files_or_loc)}"
180
- process_files(:delete, s3, from_files_or_loc, match_regex)
293
+ process_files(:delete, s3, from_files_or_loc, [], match_regex)
181
294
  end
182
295
  module_function :delete_files
183
296
 
@@ -203,12 +316,14 @@ module Sluice
203
316
  def copy_files_inter(from_s3, to_s3, from_location, to_location, match_regex='.+', alter_filename_lambda=false, flatten=false)
204
317
 
205
318
  puts " copying inter-account #{describe_from(from_location)} to #{to_location}"
319
+ processed = []
206
320
  Dir.mktmpdir do |t|
207
321
  tmp = Sluice::Storage.trail_slash(t)
208
- download_files(from_s3, from_location, tmp, match_regex)
322
+ processed = download_files(from_s3, from_location, tmp, match_regex)
209
323
  upload_files(to_s3, tmp, to_location, '**/*') # Upload all files we downloaded
210
324
  end
211
325
 
326
+ processed
212
327
  end
213
328
  module_function :copy_files_inter
214
329
 
@@ -224,10 +339,37 @@ module Sluice
224
339
  def copy_files(s3, from_files_or_loc, to_location, match_regex='.+', alter_filename_lambda=false, flatten=false)
225
340
 
226
341
  puts " copying #{describe_from(from_files_or_loc)} to #{to_location}"
227
- process_files(:copy, s3, from_files_or_loc, match_regex, to_location, alter_filename_lambda, flatten)
342
+ process_files(:copy, s3, from_files_or_loc, [], match_regex, to_location, alter_filename_lambda, flatten)
228
343
  end
229
344
  module_function :copy_files
230
345
 
346
+ # Copies files between S3 locations maintaining a manifest to
347
+ # avoid copying a file which was copied previously.
348
+ #
349
+ # Useful in scenarios such as:
350
+ # 1. You would like to do a move but only have read permission
351
+ # on the source bucket
352
+ # 2. You would like to do a move but some other process needs
353
+ # to use the files after you
354
+ #
355
+ # +s3+:: A Fog::Storage s3 connection
356
+ # +manifest+:: A Sluice::Storage::S3::Manifest object
357
+ # +from_files_or_loc+:: Array of filepaths or Fog::Storage::AWS::File objects, or S3Location to copy files from
358
+ # +to_location+:: S3Location to copy files to
359
+ # +match_regex+:: a regex string to match the files to copy
360
+ # +alter_filename_lambda+:: lambda to alter the written filename
361
+ # +flatten+:: strips off any sub-folders below the from_location
362
+ def copy_files_manifest(s3, manifest, from_files_or_loc, to_location, match_regex='.+', alter_filename_lambda=false, flatten=false)
363
+
364
+ puts " copying with manifest #{describe_from(from_files_or_loc)} to #{to_location}"
365
+ ignore = manifest.get_entries(s3) # Files to leave untouched
366
+ processed = process_files(:copy, s3, from_files_or_loc, ignore, match_regex, to_location, alter_filename_lambda, flatten)
367
+ manifest.add_entries(s3, processed)
368
+
369
+ processed
370
+ end
371
+ module_function :copy_files_manifest
372
+
231
373
  # Moves files between S3 locations in two different accounts
232
374
  #
233
375
  # Implementation is as follows:
@@ -248,13 +390,15 @@ module Sluice
248
390
  def move_files_inter(from_s3, to_s3, from_location, to_location, match_regex='.+', alter_filename_lambda=false, flatten=false)
249
391
 
250
392
  puts " moving inter-account #{describe_from(from_location)} to #{to_location}"
393
+ processed = []
251
394
  Dir.mktmpdir do |t|
252
395
  tmp = Sluice::Storage.trail_slash(t)
253
- download_files(from_s3, from_location, tmp, match_regex)
396
+ processed = download_files(from_s3, from_location, tmp, match_regex)
254
397
  upload_files(to_s3, tmp, to_location, '**/*') # Upload all files we downloaded
255
398
  delete_files(from_s3, from_location, '.+') # Delete all files we downloaded
256
399
  end
257
400
 
401
+ processed
258
402
  end
259
403
  module_function :move_files_inter
260
404
 
@@ -270,7 +414,7 @@ module Sluice
270
414
  def move_files(s3, from_files_or_loc, to_location, match_regex='.+', alter_filename_lambda=false, flatten=false)
271
415
 
272
416
  puts " moving #{describe_from(from_files_or_loc)} to #{to_location}"
273
- process_files(:move, s3, from_files_or_loc, match_regex, to_location, alter_filename_lambda, flatten)
417
+ process_files(:move, s3, from_files_or_loc, [], match_regex, to_location, alter_filename_lambda, flatten)
274
418
  end
275
419
  module_function :move_files
276
420
 
@@ -284,7 +428,7 @@ module Sluice
284
428
  def upload_files(s3, from_files_or_dir, to_location, match_glob='*')
285
429
 
286
430
  puts " uploading #{describe_from(from_files_or_dir)} to #{to_location}"
287
- process_files(:upload, s3, from_files_or_dir, match_glob, to_location)
431
+ process_files(:upload, s3, from_files_or_dir, [], match_glob, to_location)
288
432
  end
289
433
  module_function :upload_files
290
434
 
@@ -357,13 +501,14 @@ module Sluice
357
501
  #
358
502
  # Parameters:
359
503
  # +operation+:: Operation to perform. :download, :upload, :copy, :delete, :move supported
504
+ # +ignore+:: Array of filenames to ignore (used by manifest code)
360
505
  # +s3+:: A Fog::Storage s3 connection
361
506
  # +from_files_or_dir_or_loc+:: Array of filepaths or Fog::Storage::AWS::File objects, local directory or S3Location to process files from
362
507
  # +match_regex_or_glob+:: a regex or glob string to match the files to process
363
508
  # +to_loc_or_dir+:: S3Location or local directory to process files to
364
509
  # +alter_filename_lambda+:: lambda to alter the written filename
365
510
  # +flatten+:: strips off any sub-folders below the from_loc_or_dir
366
- def process_files(operation, s3, from_files_or_dir_or_loc, match_regex_or_glob='.+', to_loc_or_dir=nil, alter_filename_lambda=false, flatten=false)
511
+ def process_files(operation, s3, from_files_or_dir_or_loc, ignore=[], match_regex_or_glob='.+', to_loc_or_dir=nil, alter_filename_lambda=false, flatten=false)
367
512
 
368
513
  # Validate that the file operation makes sense
369
514
  case operation
@@ -401,6 +546,7 @@ module Sluice
401
546
  mutex = Mutex.new
402
547
  complete = false
403
548
  marker_opts = {}
549
+ processed_files = [] # For manifest updating, determining if any files were moved etc
404
550
 
405
551
  # If an exception is thrown in a thread that isn't handled, die quickly
406
552
  Thread.abort_on_exception = true
@@ -410,7 +556,8 @@ module Sluice
410
556
 
411
557
  # Each thread pops a file off the files_to_process array, and moves it.
412
558
  # We loop until there are no more files
413
- threads << Thread.new do
559
+ threads << Thread.new(i) do |thread_idx|
560
+
414
561
  loop do
415
562
  file = false
416
563
  filepath = false
@@ -418,7 +565,7 @@ module Sluice
418
565
  from_path = false
419
566
  match = false
420
567
 
421
- # Critical section:
568
+ # First critical section:
422
569
  # only allow one thread to modify the array at any time
423
570
  mutex.synchronize do
424
571
 
@@ -476,6 +623,7 @@ module Sluice
476
623
  else
477
624
  filepath.match(match_regex_or_glob)
478
625
  end
626
+
479
627
  end
480
628
  end
481
629
  end
@@ -485,6 +633,8 @@ module Sluice
485
633
 
486
634
  # Name file
487
635
  basename = get_basename(filepath)
636
+ break if ignore.include?(basename) # Don't process if in our leave list
637
+
488
638
  if alter_filename_lambda.class == Proc
489
639
  filename = alter_filename_lambda.call(basename)
490
640
  else
@@ -497,23 +647,23 @@ module Sluice
497
647
  when :upload
498
648
  source = "#{filepath}"
499
649
  target = name_file(filepath, filename, from_path, to_loc_or_dir.dir_as_path, flatten)
500
- puts " UPLOAD #{source} +-> #{to_loc_or_dir.bucket}/#{target}"
650
+ puts "(t#{thread_idx}) UPLOAD #{source} +-> #{to_loc_or_dir.bucket}/#{target}"
501
651
  when :download
502
652
  source = "#{from_bucket}/#{filepath}"
503
653
  target = name_file(filepath, filename, from_path, to_loc_or_dir, flatten)
504
- puts " DOWNLOAD #{source} +-> #{target}"
654
+ puts "(t#{thread_idx}) DOWNLOAD #{source} +-> #{target}"
505
655
  when :move
506
656
  source = "#{from_bucket}/#{filepath}"
507
657
  target = name_file(filepath, filename, from_path, to_loc_or_dir.dir_as_path, flatten)
508
- puts " MOVE #{source} -> #{to_loc_or_dir.bucket}/#{target}"
658
+ puts "(t#{thread_idx}) MOVE #{source} -> #{to_loc_or_dir.bucket}/#{target}"
509
659
  when :copy
510
660
  source = "#{from_bucket}/#{filepath}"
511
661
  target = name_file(filepath, filename, from_path, to_loc_or_dir.dir_as_path, flatten)
512
- puts " COPY #{source} +-> #{to_loc_or_dir.bucket}/#{target}"
662
+ puts "(t#{thread_idx}) COPY #{source} +-> #{to_loc_or_dir.bucket}/#{target}"
513
663
  when :delete
514
664
  source = "#{from_bucket}/#{filepath}"
515
665
  # No target
516
- puts " DELETE x #{source}"
666
+ puts "(t#{thread_idx}) DELETE x #{source}"
517
667
  end
518
668
 
519
669
  # Upload is a standalone operation vs move/copy/delete
@@ -555,6 +705,12 @@ module Sluice
555
705
  " x #{source}",
556
706
  "Problem destroying #{filepath}. Retrying.")
557
707
  end
708
+
709
+ # Second critical section: we need to update
710
+ # processed_files in a thread-safe way
711
+ mutex.synchronize do
712
+ processed_files << filepath
713
+ end
558
714
  end
559
715
  end
560
716
  end
@@ -562,6 +718,7 @@ module Sluice
562
718
  # Wait for threads to finish
563
719
  threads.each { |aThread| aThread.join }
564
720
 
721
+ processed_files # Return the processed files
565
722
  end
566
723
  module_function :process_files
567
724
 
@@ -600,7 +757,7 @@ module Sluice
600
757
  sleep(RETRY_WAIT) # Give us a bit of time before retrying
601
758
  i += 1
602
759
  retry
603
- end
760
+ end
604
761
  end
605
762
  module_function :retry_x
606
763
 
data/sluice.gemspec CHANGED
@@ -35,4 +35,5 @@ Gem::Specification.new do |gem|
35
35
 
36
36
  # Dependencies
37
37
  gem.add_dependency 'fog', '~> 1.14.0'
38
+ gem.add_dependency 'contracts', '~> 0.2.3'
38
39
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: sluice
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.9
4
+ version: 0.1.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -10,7 +10,7 @@ authors:
10
10
  autorequire:
11
11
  bindir: bin
12
12
  cert_chain: []
13
- date: 2013-08-14 00:00:00.000000000 Z
13
+ date: 2013-09-09 00:00:00.000000000 Z
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
16
  name: fog
@@ -28,6 +28,22 @@ dependencies:
28
28
  - - ~>
29
29
  - !ruby/object:Gem::Version
30
30
  version: 1.14.0
31
+ - !ruby/object:Gem::Dependency
32
+ name: contracts
33
+ requirement: !ruby/object:Gem::Requirement
34
+ none: false
35
+ requirements:
36
+ - - ~>
37
+ - !ruby/object:Gem::Version
38
+ version: 0.2.3
39
+ type: :runtime
40
+ prerelease: false
41
+ version_requirements: !ruby/object:Gem::Requirement
42
+ none: false
43
+ requirements:
44
+ - - ~>
45
+ - !ruby/object:Gem::Version
46
+ version: 0.2.3
31
47
  description: A Ruby gem to help you build ETL processes involving Amazon S3. Uses
32
48
  Fog
33
49
  email:
@@ -67,7 +83,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
67
83
  version: '0'
68
84
  requirements: []
69
85
  rubyforge_project:
70
- rubygems_version: 1.8.24
86
+ rubygems_version: 1.8.25
71
87
  signing_key:
72
88
  specification_version: 3
73
89
  summary: Ruby toolkit for cloud-friendly ETL