sluice 0.0.9 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/CHANGELOG CHANGED
@@ -6,7 +6,7 @@ Version 0.0.8 (2013-08-14)
6
6
  --------------------------
7
7
  Added in ability to process a list of files rather than an S3 location
8
8
 
9
- Version 0.0.7 (2013-XX-XX)
9
+ Version 0.0.7 (2013-07-12)
10
10
  --------------------------
11
11
  Added in upload capability
12
12
 
data/README.md CHANGED
@@ -8,12 +8,12 @@ Currently Sluice provides the following very robust, very parallel S3 operations
8
8
  * File download from S3
9
9
  * File delete from S3
10
10
  * File move within S3 (from/to the same or different AWS accounts)
11
- * File copy within S3 (from/to the same or different AWS accounts)
11
+ * File copy within S3 (from/to the same or different AWS accounts; optionally using a manifest)
12
12
 
13
- Sluice has been extracted from a pair of Ruby ETL applications built by the [SnowPlow Analytics] [snowplow-analytics] team, specifically:
13
+ Sluice has been extracted from a pair of Ruby ETL applications built by the [Snowplow Analytics] [snowplow-analytics] team, specifically:
14
14
 
15
- 1. [EmrEtlRunner] [emr-etl-runner], a Ruby application to run the SnowPlow ETL process on Elastic MapReduce
16
- 2. [StorageLoader] [storage-loader], a Ruby application to load SnowPlow event files from Amazon S3 into databases such as Infobright
15
+ 1. [EmrEtlRunner] [emr-etl-runner], a Ruby application to run the Snowplow ETL process on Elastic MapReduce
16
+ 2. [StorageLoader] [storage-loader], a Ruby application to load Snowplow event files from Amazon S3 into databases such as Redshift and Postgres
17
17
 
18
18
  ## Installation
19
19
 
@@ -21,7 +21,7 @@ Sluice has been extracted from a pair of Ruby ETL applications built by the [Sno
21
21
 
22
22
  Or in your Gemfile:
23
23
 
24
- gem 'sluice', '~> 0.0.9'
24
+ gem 'sluice', '~> 0.1.0'
25
25
 
26
26
  ## Usage
27
27
 
@@ -32,7 +32,7 @@ Rubydoc and usage examples to come.
32
32
  To hack on Sluice locally:
33
33
 
34
34
  $ gem build sluice.gemspec
35
- $ sudo gem install sluice-0.0.9.gem
35
+ $ sudo gem install sluice-0.1.0.gem
36
36
 
37
37
  To contribute:
38
38
 
@@ -44,11 +44,11 @@ To contribute:
44
44
 
45
45
  ## Credits and thanks
46
46
 
47
- Sluice was developed by [Alex Dean] [alexanderdean] ([SnowPlow Analytics] [snowplow-analytics]) and [Michael Tibben] [mtibben] ([99designs] [99designs]).
47
+ Sluice was developed by [Alex Dean] [alexanderdean] ([Snowplow Analytics] [snowplow-analytics]) and [Michael Tibben] [mtibben] ([99designs] [99designs]).
48
48
 
49
49
  ## Copyright and license
50
50
 
51
- Sluice is copyright 2012 SnowPlow Analytics Ltd.
51
+ Sluice is copyright 2012-2013 Snowplow Analytics Ltd.
52
52
 
53
53
  Licensed under the [Apache License, Version 2.0] [license] (the "License");
54
54
  you may not use this software except in compliance with the License.
@@ -66,7 +66,7 @@ limitations under the License.
66
66
  [mtibben]: https://github.com/mtibben
67
67
  [99designs]: http://99designs.com
68
68
 
69
- [emr-etl-runner]: https://github.com/snowplow/snowplow/tree/master/3-etl/emr-etl-runner
69
+ [emr-etl-runner]: https://github.com/snowplow/snowplow/tree/master/3-enrich/emr-etl-runner
70
70
  [storage-loader]: https://github.com/snowplow/snowplow/tree/master/4-storage/storage-loader
71
71
 
72
72
  [license]: http://www.apache.org/licenses/LICENSE-2.0
data/lib/sluice.rb CHANGED
@@ -19,5 +19,5 @@ require 'sluice/storage/s3'
19
19
 
20
20
  module Sluice
21
21
  NAME = "sluice"
22
- VERSION = "0.0.9"
22
+ VERSION = "0.1.0"
23
23
  end
@@ -13,10 +13,14 @@
13
13
  # Copyright:: Copyright (c) 2012 SnowPlow Analytics Ltd
14
14
  # License:: Apache License Version 2.0
15
15
 
16
+ require 'set'
16
17
  require 'tmpdir'
17
18
  require 'fog'
18
19
  require 'thread'
19
20
 
21
+ require 'contracts'
22
+ include Contracts
23
+
20
24
  module Sluice
21
25
  module Storage
22
26
  module S3
@@ -29,12 +33,18 @@ module Sluice
29
33
  RETRIES = 3 # Attempts
30
34
  RETRY_WAIT = 10 # Seconds
31
35
 
36
+ # Aliases for Contracts
37
+ FogStorage = Fog::Storage::AWS::Real
38
+ FogFile = Fog::Storage::AWS::File
39
+
32
40
  # Class to describe an S3 location
33
41
  # TODO: if we are going to impose trailing line-breaks on
34
42
  # buckets, maybe we should make that clearer?
35
43
  class Location
36
44
  attr_reader :bucket, :dir, :s3location
37
45
 
46
+ # Location constructor
47
+ #
38
48
  # Parameters:
39
49
  # +s3location+:: the s3 location config string e.g. "bucket/directory"
40
50
  def initialize(s3_location)
@@ -60,6 +70,108 @@ module Sluice
60
70
  end
61
71
  end
62
72
 
73
+ # Legitimate manifest scopes:
74
+ # 1. :filename - store only the filename
75
+ # in the manifest
76
+ # 2. :relpath - store the relative path
77
+ # to the file in the manifest
78
+ # 3. :abspath - store the absolute path
79
+ # to the file in the manifest
80
+ # 4. :bucket - store bucket PLUS absolute
81
+ # path to the file in the manifest
82
+ #
83
+ # TODO: add support for 2-4. Currently only 1 supported
84
+ class ManifestScope
85
+
86
+ @@scopes = Set::[](:filename) # TODO add :relpath, :abspath, :bucket
87
+
88
+ def self.valid?(val)
89
+ val.is_a?(Symbol) &&
90
+ @@scopes.include?(val)
91
+ end
92
+ end
93
+
94
+ # Class to read and maintain a manifest.
95
+ class Manifest
96
+ attr_reader :s3_location, :scope, :manifest_file
97
+
98
+ # Manifest constructor
99
+ #
100
+ # Parameters:
101
+ # +path+:: full path to the manifest file
102
+ # +scope+:: whether file entries in the
103
+ # manifest should be scoped to
104
+ # filename, relative path, absolute
105
+ # path, or absolute path and bucket
106
+ Contract Location, ManifestScope => nil
107
+ def initialize(s3_location, scope)
108
+ @s3_location = s3_location
109
+ @scope = scope
110
+ @manifest_file = "%ssluice-%s-manifest" % [s3_location.dir_as_path, scope.to_s]
111
+ nil
112
+ end
113
+
114
+ # Get the current file entries in the manifest
115
+ #
116
+ # Parameters:
117
+ # +s3+:: A Fog::Storage s3 connection
118
+ #
119
+ # Returns an Array of filenames as Strings
120
+ Contract FogStorage => ArrayOf[String]
121
+ def get_entries(s3)
122
+
123
+ manifest = self.class.get_manifest(s3, @s3_location, @manifest_file)
124
+ if manifest.nil?
125
+ return []
126
+ end
127
+
128
+ manifest.body.split("\n").reject(&:empty?)
129
+ end
130
+
131
+ # Add (i.e. append) the following file entries
132
+ # to the manifest
133
+ # Files listed previously in the manifest will
134
+ # be kept in the new manifest file.
135
+ #
136
+ # Parameters:
137
+ # +s3+:: A Fog::Storage s3 connection
138
+ # +entries+:: an Array of filenames as Strings
139
+ #
140
+ # Returns all entries now in the manifest
141
+ Contract FogStorage, ArrayOf[String] => ArrayOf[String]
142
+ def add_entries(s3, entries)
143
+
144
+ existing = get_entries(s3)
145
+ filenames = entries.map { |filepath|
146
+ File.basename(filepath)
147
+ } # TODO: update when non-filename-based manifests supported
148
+ all = (existing + filenames)
149
+
150
+ manifest = self.class.get_manifest(s3, @s3_location, @manifest_file)
151
+ body = all.join("\n")
152
+ if manifest.nil?
153
+ bucket = s3.directories.get(s3_location.bucket).files.create(
154
+ :key => @manifest_file,
155
+ :body => body
156
+ )
157
+ else
158
+ manifest.body = body
159
+ manifest.save
160
+ end
161
+
162
+ all
163
+ end
164
+
165
+ private
166
+
167
+ # Helper to get the manifest file
168
+ # Contract FogStorage, Location, String => Or[FogFile, nil] TODO: fix this. Expected: File, Actual: <Fog::Storage::AWS::File>
169
+ def self.get_manifest(s3, s3_location, filename)
170
+ s3.directories.get(s3_location.bucket, prefix: s3_location.dir).files.get(filename) # TODO: break out into new generic get_file() procedure
171
+ end
172
+
173
+ end
174
+
63
175
  # Helper function to instantiate a new Fog::Storage
64
176
  # for S3 based on our config options
65
177
  #
@@ -67,6 +179,7 @@ module Sluice
67
179
  # +region+:: Amazon S3 region we will be working with
68
180
  # +access_key_id+:: AWS access key ID
69
181
  # +secret_access_key+:: AWS secret access key
182
+ Contract String, String, String => FogStorage
70
183
  def new_fog_s3_from(region, access_key_id, secret_access_key)
71
184
  fog = Fog::Storage.new({
72
185
  :provider => 'AWS',
@@ -164,7 +277,7 @@ module Sluice
164
277
  def download_files(s3, from_files_or_loc, to_directory, match_regex='.+')
165
278
 
166
279
  puts " downloading #{describe_from(from_files_or_loc)} to #{to_directory}"
167
- process_files(:download, s3, from_files_or_loc, match_regex, to_directory)
280
+ process_files(:download, s3, from_files_or_loc, [], match_regex, to_directory)
168
281
  end
169
282
  module_function :download_files
170
283
 
@@ -177,7 +290,7 @@ module Sluice
177
290
  def delete_files(s3, from_files_or_loc, match_regex='.+')
178
291
 
179
292
  puts " deleting #{describe_from(from_files_or_loc)}"
180
- process_files(:delete, s3, from_files_or_loc, match_regex)
293
+ process_files(:delete, s3, from_files_or_loc, [], match_regex)
181
294
  end
182
295
  module_function :delete_files
183
296
 
@@ -203,12 +316,14 @@ module Sluice
203
316
  def copy_files_inter(from_s3, to_s3, from_location, to_location, match_regex='.+', alter_filename_lambda=false, flatten=false)
204
317
 
205
318
  puts " copying inter-account #{describe_from(from_location)} to #{to_location}"
319
+ processed = []
206
320
  Dir.mktmpdir do |t|
207
321
  tmp = Sluice::Storage.trail_slash(t)
208
- download_files(from_s3, from_location, tmp, match_regex)
322
+ processed = download_files(from_s3, from_location, tmp, match_regex)
209
323
  upload_files(to_s3, tmp, to_location, '**/*') # Upload all files we downloaded
210
324
  end
211
325
 
326
+ processed
212
327
  end
213
328
  module_function :copy_files_inter
214
329
 
@@ -224,10 +339,37 @@ module Sluice
224
339
  def copy_files(s3, from_files_or_loc, to_location, match_regex='.+', alter_filename_lambda=false, flatten=false)
225
340
 
226
341
  puts " copying #{describe_from(from_files_or_loc)} to #{to_location}"
227
- process_files(:copy, s3, from_files_or_loc, match_regex, to_location, alter_filename_lambda, flatten)
342
+ process_files(:copy, s3, from_files_or_loc, [], match_regex, to_location, alter_filename_lambda, flatten)
228
343
  end
229
344
  module_function :copy_files
230
345
 
346
+ # Copies files between S3 locations maintaining a manifest to
347
+ # avoid copying a file which was copied previously.
348
+ #
349
+ # Useful in scenarios such as:
350
+ # 1. You would like to do a move but only have read permission
351
+ # on the source bucket
352
+ # 2. You would like to do a move but some other process needs
353
+ # to use the files after you
354
+ #
355
+ # +s3+:: A Fog::Storage s3 connection
356
+ # +manifest+:: A Sluice::Storage::S3::Manifest object
357
+ # +from_files_or_loc+:: Array of filepaths or Fog::Storage::AWS::File objects, or S3Location to copy files from
358
+ # +to_location+:: S3Location to copy files to
359
+ # +match_regex+:: a regex string to match the files to copy
360
+ # +alter_filename_lambda+:: lambda to alter the written filename
361
+ # +flatten+:: strips off any sub-folders below the from_location
362
+ def copy_files_manifest(s3, manifest, from_files_or_loc, to_location, match_regex='.+', alter_filename_lambda=false, flatten=false)
363
+
364
+ puts " copying with manifest #{describe_from(from_files_or_loc)} to #{to_location}"
365
+ ignore = manifest.get_entries(s3) # Files to leave untouched
366
+ processed = process_files(:copy, s3, from_files_or_loc, ignore, match_regex, to_location, alter_filename_lambda, flatten)
367
+ manifest.add_entries(s3, processed)
368
+
369
+ processed
370
+ end
371
+ module_function :copy_files_manifest
372
+
231
373
  # Moves files between S3 locations in two different accounts
232
374
  #
233
375
  # Implementation is as follows:
@@ -248,13 +390,15 @@ module Sluice
248
390
  def move_files_inter(from_s3, to_s3, from_location, to_location, match_regex='.+', alter_filename_lambda=false, flatten=false)
249
391
 
250
392
  puts " moving inter-account #{describe_from(from_location)} to #{to_location}"
393
+ processed = []
251
394
  Dir.mktmpdir do |t|
252
395
  tmp = Sluice::Storage.trail_slash(t)
253
- download_files(from_s3, from_location, tmp, match_regex)
396
+ processed = download_files(from_s3, from_location, tmp, match_regex)
254
397
  upload_files(to_s3, tmp, to_location, '**/*') # Upload all files we downloaded
255
398
  delete_files(from_s3, from_location, '.+') # Delete all files we downloaded
256
399
  end
257
400
 
401
+ processed
258
402
  end
259
403
  module_function :move_files_inter
260
404
 
@@ -270,7 +414,7 @@ module Sluice
270
414
  def move_files(s3, from_files_or_loc, to_location, match_regex='.+', alter_filename_lambda=false, flatten=false)
271
415
 
272
416
  puts " moving #{describe_from(from_files_or_loc)} to #{to_location}"
273
- process_files(:move, s3, from_files_or_loc, match_regex, to_location, alter_filename_lambda, flatten)
417
+ process_files(:move, s3, from_files_or_loc, [], match_regex, to_location, alter_filename_lambda, flatten)
274
418
  end
275
419
  module_function :move_files
276
420
 
@@ -284,7 +428,7 @@ module Sluice
284
428
  def upload_files(s3, from_files_or_dir, to_location, match_glob='*')
285
429
 
286
430
  puts " uploading #{describe_from(from_files_or_dir)} to #{to_location}"
287
- process_files(:upload, s3, from_files_or_dir, match_glob, to_location)
431
+ process_files(:upload, s3, from_files_or_dir, [], match_glob, to_location)
288
432
  end
289
433
  module_function :upload_files
290
434
 
@@ -357,13 +501,14 @@ module Sluice
357
501
  #
358
502
  # Parameters:
359
503
  # +operation+:: Operation to perform. :download, :upload, :copy, :delete, :move supported
504
+ # +ignore+:: Array of filenames to ignore (used by manifest code)
360
505
  # +s3+:: A Fog::Storage s3 connection
361
506
  # +from_files_or_dir_or_loc+:: Array of filepaths or Fog::Storage::AWS::File objects, local directory or S3Location to process files from
362
507
  # +match_regex_or_glob+:: a regex or glob string to match the files to process
363
508
  # +to_loc_or_dir+:: S3Location or local directory to process files to
364
509
  # +alter_filename_lambda+:: lambda to alter the written filename
365
510
  # +flatten+:: strips off any sub-folders below the from_loc_or_dir
366
- def process_files(operation, s3, from_files_or_dir_or_loc, match_regex_or_glob='.+', to_loc_or_dir=nil, alter_filename_lambda=false, flatten=false)
511
+ def process_files(operation, s3, from_files_or_dir_or_loc, ignore=[], match_regex_or_glob='.+', to_loc_or_dir=nil, alter_filename_lambda=false, flatten=false)
367
512
 
368
513
  # Validate that the file operation makes sense
369
514
  case operation
@@ -401,6 +546,7 @@ module Sluice
401
546
  mutex = Mutex.new
402
547
  complete = false
403
548
  marker_opts = {}
549
+ processed_files = [] # For manifest updating, determining if any files were moved etc
404
550
 
405
551
  # If an exception is thrown in a thread that isn't handled, die quickly
406
552
  Thread.abort_on_exception = true
@@ -410,7 +556,8 @@ module Sluice
410
556
 
411
557
  # Each thread pops a file off the files_to_process array, and moves it.
412
558
  # We loop until there are no more files
413
- threads << Thread.new do
559
+ threads << Thread.new(i) do |thread_idx|
560
+
414
561
  loop do
415
562
  file = false
416
563
  filepath = false
@@ -418,7 +565,7 @@ module Sluice
418
565
  from_path = false
419
566
  match = false
420
567
 
421
- # Critical section:
568
+ # First critical section:
422
569
  # only allow one thread to modify the array at any time
423
570
  mutex.synchronize do
424
571
 
@@ -476,6 +623,7 @@ module Sluice
476
623
  else
477
624
  filepath.match(match_regex_or_glob)
478
625
  end
626
+
479
627
  end
480
628
  end
481
629
  end
@@ -485,6 +633,8 @@ module Sluice
485
633
 
486
634
  # Name file
487
635
  basename = get_basename(filepath)
636
+ break if ignore.include?(basename) # Don't process if in our leave list
637
+
488
638
  if alter_filename_lambda.class == Proc
489
639
  filename = alter_filename_lambda.call(basename)
490
640
  else
@@ -497,23 +647,23 @@ module Sluice
497
647
  when :upload
498
648
  source = "#{filepath}"
499
649
  target = name_file(filepath, filename, from_path, to_loc_or_dir.dir_as_path, flatten)
500
- puts " UPLOAD #{source} +-> #{to_loc_or_dir.bucket}/#{target}"
650
+ puts "(t#{thread_idx}) UPLOAD #{source} +-> #{to_loc_or_dir.bucket}/#{target}"
501
651
  when :download
502
652
  source = "#{from_bucket}/#{filepath}"
503
653
  target = name_file(filepath, filename, from_path, to_loc_or_dir, flatten)
504
- puts " DOWNLOAD #{source} +-> #{target}"
654
+ puts "(t#{thread_idx}) DOWNLOAD #{source} +-> #{target}"
505
655
  when :move
506
656
  source = "#{from_bucket}/#{filepath}"
507
657
  target = name_file(filepath, filename, from_path, to_loc_or_dir.dir_as_path, flatten)
508
- puts " MOVE #{source} -> #{to_loc_or_dir.bucket}/#{target}"
658
+ puts "(t#{thread_idx}) MOVE #{source} -> #{to_loc_or_dir.bucket}/#{target}"
509
659
  when :copy
510
660
  source = "#{from_bucket}/#{filepath}"
511
661
  target = name_file(filepath, filename, from_path, to_loc_or_dir.dir_as_path, flatten)
512
- puts " COPY #{source} +-> #{to_loc_or_dir.bucket}/#{target}"
662
+ puts "(t#{thread_idx}) COPY #{source} +-> #{to_loc_or_dir.bucket}/#{target}"
513
663
  when :delete
514
664
  source = "#{from_bucket}/#{filepath}"
515
665
  # No target
516
- puts " DELETE x #{source}"
666
+ puts "(t#{thread_idx}) DELETE x #{source}"
517
667
  end
518
668
 
519
669
  # Upload is a standalone operation vs move/copy/delete
@@ -555,6 +705,12 @@ module Sluice
555
705
  " x #{source}",
556
706
  "Problem destroying #{filepath}. Retrying.")
557
707
  end
708
+
709
+ # Second critical section: we need to update
710
+ # processed_files in a thread-safe way
711
+ mutex.synchronize do
712
+ processed_files << filepath
713
+ end
558
714
  end
559
715
  end
560
716
  end
@@ -562,6 +718,7 @@ module Sluice
562
718
  # Wait for threads to finish
563
719
  threads.each { |aThread| aThread.join }
564
720
 
721
+ processed_files # Return the processed files
565
722
  end
566
723
  module_function :process_files
567
724
 
@@ -600,7 +757,7 @@ module Sluice
600
757
  sleep(RETRY_WAIT) # Give us a bit of time before retrying
601
758
  i += 1
602
759
  retry
603
- end
760
+ end
604
761
  end
605
762
  module_function :retry_x
606
763
 
data/sluice.gemspec CHANGED
@@ -35,4 +35,5 @@ Gem::Specification.new do |gem|
35
35
 
36
36
  # Dependencies
37
37
  gem.add_dependency 'fog', '~> 1.14.0'
38
+ gem.add_dependency 'contracts', '~> 0.2.3'
38
39
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: sluice
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.9
4
+ version: 0.1.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -10,7 +10,7 @@ authors:
10
10
  autorequire:
11
11
  bindir: bin
12
12
  cert_chain: []
13
- date: 2013-08-14 00:00:00.000000000 Z
13
+ date: 2013-09-09 00:00:00.000000000 Z
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
16
  name: fog
@@ -28,6 +28,22 @@ dependencies:
28
28
  - - ~>
29
29
  - !ruby/object:Gem::Version
30
30
  version: 1.14.0
31
+ - !ruby/object:Gem::Dependency
32
+ name: contracts
33
+ requirement: !ruby/object:Gem::Requirement
34
+ none: false
35
+ requirements:
36
+ - - ~>
37
+ - !ruby/object:Gem::Version
38
+ version: 0.2.3
39
+ type: :runtime
40
+ prerelease: false
41
+ version_requirements: !ruby/object:Gem::Requirement
42
+ none: false
43
+ requirements:
44
+ - - ~>
45
+ - !ruby/object:Gem::Version
46
+ version: 0.2.3
31
47
  description: A Ruby gem to help you build ETL processes involving Amazon S3. Uses
32
48
  Fog
33
49
  email:
@@ -67,7 +83,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
67
83
  version: '0'
68
84
  requirements: []
69
85
  rubyforge_project:
70
- rubygems_version: 1.8.24
86
+ rubygems_version: 1.8.25
71
87
  signing_key:
72
88
  specification_version: 3
73
89
  summary: Ruby toolkit for cloud-friendly ETL