RubyGems - mobilize-hdfs - Versions diffs - 1.0.10 → 1.2 - Mend

mobilize-hdfs 1.0.10 → 1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

data/README.md +13 -22
data/lib/mobilize-hdfs/handlers/hadoop.rb +11 -11
data/lib/mobilize-hdfs/handlers/hdfs.rb +129 -146
data/lib/mobilize-hdfs/version.rb +1 -1
data/lib/samples/hadoop.yml +0 -3
data/mobilize-hdfs.gemspec +2 -2
data/test/hdfs_job_rows.yml +4 -5
data/test/mobilize-hdfs_test.rb +25 -2
metadata +5 -5

data/README.md CHANGED

@@ -94,14 +94,11 @@ be read. If the data is bigger than the read limit, an exception will be
 raised.
 The Hadoop configuration consists of:
-* output_cluster, which is the cluster where stage outputs will be
-stored. Clusters are defined in the clusters parameter as described
-below.
 * output_dir, which is the absolute path to the directory in HDFS that will store stage
-outputs. Directory names should end with a slash (/).
+outputs. Directory names should end with a slash (/). It will choose the
+first cluster as the default cluster to write to.
 * read_limit, which is the maximum size data that can be read from the
-cluster. This is applied at read time by piping hadoop dfs -cat | head
--c <size limit>. Default is 1GB.
+cluster. Default is 1GB.
 * clusters - this defines aliases for clusters, which are used as
 parameters for Hdfs stages. Cluster aliases contain 5 parameters:
   * namenode - defines the name and port for accessing the namenode
@@ -118,7 +115,6 @@ Sample hadoop.yml:
 ``` yml
 ---
 development:
-  output_cluster: dev_cluster
   output_dir: /user/mobilize/development/
   read_limit: 1000000000
   clusters:
@@ -135,7 +131,6 @@ development:
       gateway_node: dev_hadoop_host
       exec_path: /path/to/hadoop
 test:
-  output_cluster: test_cluster
   output_dir: /user/mobilize/test/
   read_limit: 1000000000
   clusters:
@@ -152,7 +147,6 @@ test:
       gateway_node: test_hadoop_host
       exec_path: /path/to/hadoop
 production:
-  output_cluster: prod_cluster
   output_dir: /user/mobilize/production/
   read_limit: 1000000000
   clusters:
@@ -181,17 +175,15 @@ Start
   * cluster and user are optional for all of the below.
     * cluster defaults to output_cluster;
     * user is treated the same way as in [mobilize-ssh][mobilize-ssh].
-  * hdfs.read `source:<hdfs_full_path>, user:<user>`, which reads the input path on the specified cluster.
-  * hdfs.write `source:<gsheet_full_path>, target:<hdfs_full_path>, user:<user>`
-  * hdfs.copy `source:<source_hdfs_full_path>,target:<target_hdfs_full_path>,user:<user>`
-  * The gsheet_full_path should be of the form `<gbook_name>/<gsheet_name>`. The test uses "Requestor_mobilize(test)/test_hdfs_1.in".
-  * The hdfs_full_path is the cluster alias followed by full path on the cluster.
+  * hdfs.write `source:<full_path>, target:<hdfs_full_path>, user:<user>`
+  * The full_path can use `<gsheet_path>` or `<hdfs_path>`. The test uses "test_hdfs_1.in".
+  * `<hdfs_path>` is the cluster alias followed by absolute path on the cluster.
     * if a full path is supplied without a preceding cluster alias (e.g. "/user/mobilize/test/test_hdfs_1.in"),
-      the output cluster will be used.
+      the first listed cluster will be used as the default.
     * The test uses "/user/mobilize/test/test_hdfs_1.in" for the initial
 write, then "test_cluster_2/user/mobilize/test/test_hdfs_copy.out" for
-the copy and subsequent read.
-  * both cluster arguments and user are optional. If copying from
+the cross-cluster write.
+  * both cluster arguments and user are optional. If writing from
 one cluster to another, your source_cluster gateway_node must be able to
 access both clusters.
@@ -216,12 +208,11 @@ same cluster as your first.
 3) $ rake test
-* The test runs a 4 stage job:
+* The test runs a 3 stage job:
   * test_hdfs_1:
-    * `hdfs.write target:"/user/mobilize/test/test_hdfs_1.out", source:"Runner_mobilize(test)/test_hdfs_1.in"`
-    * `hdfs.copy source:"/user/mobilize/test/test_hdfs_1.out",target:"test_cluster_2/user/mobilize/test/test_hdfs_1_copy.out"`
-    * `hdfs.read source:"/user/mobilize/test/test_hdfs_1_copy.out"`
-    * `gsheet.write source:"stage3", target:"Runner_mobilize(test)/test_hdfs_1_copy.out"`
+    * `hdfs.write target:"/user/mobilize/test/test_hdfs_1.out", source:"test_hdfs_1.in"`
+    * `hdfs.write source:"/user/mobilize/test/test_hdfs_1.out",target:"test_cluster_2/user/mobilize/test/test_hdfs_1_copy.out"`
+    * `gsheet.write source:"hdfs://test_cluster_2/user/mobilize/test/test_hdfs_1_copy.out", target:"test_hdfs_1_copy.out"`
   * at the end of the test, there should be a sheet named "test_hdfs_1_copy.out" with the same data as test_hdfs_1.in
 <a name='section_Meta'></a>

data/lib/mobilize-hdfs/handlers/hadoop.rb CHANGED

@@ -9,15 +9,15 @@ module Mobilize
     end
     def Hadoop.gateway_node(cluster)
-      Hadoop.clusters[cluster]['gateway_node']
+      Hadoop.config['clusters'][cluster]['gateway_node']
     end
     def Hadoop.clusters
-      Hadoop.config['clusters']
+      Hadoop.config['clusters'].keys
     end
-    def Hadoop.output_cluster
-      Hadoop.config['output_cluster']
+    def Hadoop.default_cluster
+      Hadoop.clusters.first
     end
     def Hadoop.output_dir
@@ -28,20 +28,20 @@ module Mobilize
       Hadoop.config['read_limit']
     end
-    def Hadoop.job(command,cluster,user,file_hash={})
+    def Hadoop.job(cluster,command,user,file_hash={})
       command = ["-",command].join unless command.starts_with?("-")
-      Hadoop.run("job -fs #{Hdfs.root(cluster)} #{command}",cluster,user,file_hash).ie do |r|
+      Hadoop.run(cluster,"job -fs #{Hdfs.root(cluster)} #{command}",user,file_hash).ie do |r|
         r.class==Array ? r.first : r
       end
     end
     def Hadoop.job_list(cluster)
-      raw_list = Hadoop.job("list",{},cluster)
+      raw_list = Hadoop.job(cluster,"list")
       raw_list.split("\n")[1..-1].join("\n").tsv_to_hash_array
     end
-    def Hadoop.job_status(hdfs_job_id,cluster)
-      raw_status = Hadoop.job("status #{hdfs_job_id}",{},cluster)
+    def Hadoop.job_status(cluster,hadoop_job_id)
+      raw_status = Hadoop.job(cluster,"status #{hadoop_job_id}",{})
       dhash_status = raw_status.strip.split("\n").map do |sline|
                        delim_index = [sline.index("="),sline.index(":")].compact.min
                        if delim_index
@@ -54,14 +54,14 @@ module Mobilize
       hash_status
     end
-    def Hadoop.run(command,cluster,user,file_hash={})
+    def Hadoop.run(cluster,command,user_name,file_hash={})
       h_command = if command.starts_with?("hadoop")
                     command.sub("hadoop",Hadoop.exec_path(cluster))
                   else
                     "#{Hadoop.exec_path(cluster)} #{command}"
                   end
       gateway_node = Hadoop.gateway_node(cluster)
-      Ssh.run(gateway_node,h_command,user,file_hash)
+      Ssh.run(gateway_node,h_command,user_name,file_hash)
     end
   end
 end

data/lib/mobilize-hdfs/handlers/hdfs.rb CHANGED

@@ -1,190 +1,173 @@
 module Mobilize
   module Hdfs
+    #returns the hdfs path to the root of the cluster
     def Hdfs.root(cluster)
-      namenode = Hadoop.clusters[cluster]['namenode']
+      namenode = Hadoop.config['clusters'][cluster]['namenode']
       "hdfs://#{namenode['name']}:#{namenode['port']}"
     end
-    def Hdfs.run(command,cluster,user)
+    #replaces the cluster alias with a proper namenode path
+    def Hdfs.hdfs_url(url)
+      cluster = url.split("hdfs://").last.split("/").first
+      #replace first instance
+      url.sub("hdfs://#{cluster}",Hdfs.root(cluster))
+    end
+    def Hdfs.run(cluster,command,user)
       command = ["-",command].join unless command.starts_with?("-")
       command = "dfs -fs #{Hdfs.root(cluster)}/ #{command}"
-      Hadoop.run(command,cluster,user)
+      Hadoop.run(cluster,command,user)
     end
-    def Hdfs.rm(target_path,user)
-      #ignore errors due to missing file
-      cluster,cluster_path = Hdfs.resolve_path(target_path)
-      begin
-        Hdfs.run("rm '#{cluster_path}'",cluster,user)
-        return true
-      rescue
-        return false
+    #return the size in bytes for an Hdfs file
+    def Hdfs.file_size(url,user_name)
+      cluster = url.split("://").last.split("/").first
+      hdfs_url = Hdfs.hdfs_url(url)
+      response = Hadoop.run(cluster, "dfs -du '#{hdfs_url}'", user_name)
+      if response['exit_code'] != 0
+        raise "Unable to get file size for #{url} with error: #{response['stderr']}"
+      else
+        #parse out response
+        return response['stdout'].split("\n")[1].split(" ")[1].to_i
       end
     end
-    def Hdfs.rmr(target_dir,user)
-      #ignore errors due to missing dir
-      cluster,cluster_dir = Hdfs.resolve_path(target_dir)
-      begin
-        Hdfs.run("rmr '#{cluster_dir}'",cluster,user)
-        return true
-      rescue
-        return false
+    def Hdfs.read_by_dataset_path(dst_path,user_name,*args)
+      cluster = dst_path.split("/").first
+      url = Hdfs.url_by_path(dst_path,user_name)
+      #make sure file is not too big
+      if Hdfs.file_size(url,user_name) >= Hadoop.read_limit
+        raise "Hadoop read limit reached -- please reduce query size"
       end
-    end
-    def Hdfs.read(path,user)
-      cluster, cluster_path = Hdfs.resolve_path(path)
-      gateway_node = Hadoop.gateway_node(cluster)
+      hdfs_url = Hdfs.hdfs_url(url)
       #need to direct stderr to dev null since hdfs throws errors at being headed off
-      command = "((#{Hadoop.exec_path(cluster)} fs -fs '#{Hdfs.root(cluster)}' -cat '#{cluster_path}'"
-      command += " | head -c #{Hadoop.read_limit}) > out.txt 2> /dev/null) && cat out.txt"
-      response = Ssh.run(gateway_node,command,user)
-      if response.length==Hadoop.read_limit
-        raise "Hadoop read limit reached -- please reduce query size"
+      read_command = "dfs -cat '#{hdfs_url}'"
+      response = Hadoop.run(cluster,read_command,user_name)
+      if response['exit_code'] != 0
+        raise "Unable to read from #{url} with error: #{response['stderr']}"
+      else
+        return response['stdout']
       end
-      response
     end
-    def Hdfs.resolve_path(path)
-      if path.starts_with?("/")
-        return [Hadoop.output_cluster,path]
-      #determine if first term in path is a cluster name
-      elsif Hadoop.clusters.keys.include?(path.split("/").first)
-        return path.split("/").ie{|p| [p.first,"/#{p[1..-1].join("/")}"]}
+    #used for writing strings straight up to hdfs
+    def Hdfs.write_by_dataset_path(dst_path,string,user_name)
+      cluster = dst_path.split("/").first
+      url = Hdfs.url_by_path(dst_path,user_name)
+      hdfs_url = Hdfs.hdfs_url(url)
+      response = Hdfs.write(cluster,hdfs_url,string,user_name)
+      if response['exit_code'] != 0
+        raise "Unable to write to #{url} with error: #{response['stderr']}"
       else
-        #default cluster, slash on the front
-        return [Hadoop.output_cluster,"/#{path.to_s}"]
+        return response
       end
     end
-    def Hdfs.namenode_path(path)
-      cluster, cluster_path = Hdfs.resolve_path(path)
-      "#{Hdfs.root(cluster)}#{cluster_path}"
-    end
-    def Hdfs.write(path,string,user)
+    def Hdfs.write(cluster,hdfs_url,string,user_name)
       file_hash = {'file.txt'=>string}
-      cluster = Hdfs.resolve_path(path).first
-      Hdfs.rm(path,user) #remove old one if any
-      write_command = "dfs -copyFromLocal file.txt '#{Hdfs.namenode_path(path)}'"
-      Hadoop.run(write_command,cluster,user,file_hash)
-      return Hdfs.namenode_path(path)
-    end
-    def Hdfs.copy(source_path,target_path,user)
-      Hdfs.rm(target_path,user) #remove to_path
-      source_cluster = Hdfs.resolve_path(source_path).first
-      command = "dfs -cp '#{Hdfs.namenode_path(source_path)}' '#{Hdfs.namenode_path(target_path)}'"
-      #copy operation implies access to target_url from source_cluster
-      Hadoop.run(command,source_cluster,user)
-      return Hdfs.namenode_path(target_path)
+      #make sure path is clear
+      delete_command = "dfs -rm '#{hdfs_url}'"
+      Hadoop.run(cluster,delete_command,user_name)
+      write_command = "dfs -copyFromLocal file.txt '#{hdfs_url}'"
+      response = Hadoop.run(cluster,write_command,user_name,file_hash)
+      response
     end
-    def Hdfs.read_by_stage_path(stage_path)
-      s = Stage.where(:path=>stage_path).first
-      u = s.job.runner.user
-      params = s.params
-      source_path = params['source']
-      user = params['user']
-      #check for source in hdfs format
-      source_cluster, source_cluster_path = Hdfs.resolve_path(source_path)
-      raise "unable to resolve source path" if source_cluster.nil?
-      node = Hadoop.gateway_node(source_cluster)
-      if user and !Ssh.sudoers(node).include?(u.name)
-        raise "#{u.name} does not have su permissions for #{node}"
-      elsif user.nil? and Ssh.su_all_users(node)
-        user = u.name
+    #copy file from one url to another
+    #source cluster must be able to issue copy command to target cluster
+    def Hdfs.copy(source_url, target_url, user_name)
+      #convert aliases
+      source_hdfs_url = Hdfs.hdfs_url(source_url)
+      target_hdfs_url = Hdfs.hdfs_url(target_url)
+      #get cluster names
+      source_cluster = source_url.split("://").last.split("/").first
+      target_cluster = target_url.split("://").last.split("/").first
+      #delete target
+      delete_command = "dfs -rm '#{target_hdfs_url}'"
+      Hadoop.run(target_cluster,delete_command,user_name)
+      #copy source to target
+      copy_command = "dfs -cp '#{source_hdfs_url}' '#{target_hdfs_url}'"
+      response = Hadoop.run(source_cluster,copy_command,user_name)
+      if response['exit_code'] != 0
+        raise "Unable to copy #{source_url} to #{target_url} with error: #{response['stderr']}"
+      else
+        return target_url
       end
-      source_path = "#{source_cluster}#{source_cluster_path}"
-      out_string = Hdfs.read(source_path,user).to_s
-      out_url = "hdfs://#{Hadoop.output_cluster}#{Hadoop.output_dir}hdfs/#{stage_path}/out"
-      Dataset.write_by_url(out_url,out_string,Gdrive.owner_name)
-      out_url
     end
-    def Hdfs.write_by_stage_path(stage_path)
+    # converts a source path or target path to a dst in the context of handler and stage
+    def Hdfs.path_to_dst(path,stage_path)
+      has_handler = true if path.index("://")
       s = Stage.where(:path=>stage_path).first
-      u = s.job.runner.user
       params = s.params
-      source_path = params['source']
       target_path = params['target']
-      user = params['user']
-      begin
-        #check for source in gsheet format
-        gdrive_slot = Gdrive.slot_worker_by_path(stage_path)
-        #return blank response if there are no slots available
-        return nil unless gdrive_slot
-        source_dst = s.source_dsts(gdrive_slot).first
-        in_string = source_dst.read(user)
-        Gdrive.unslot_worker_by_path(stage_path)
-      rescue
-        #try hdfs
-        source_cluster, source_cluster_path = Hdfs.resolve_path(source_path)
-        source_path = "#{source_cluster}#{source_cluster_path}"
-        source_dst = Dataset.find_or_create_by_handler_and_path("hdfs",source_path)
-        in_string = source_dst.read(user)
-        raise "No data found at hdfs://#{source_path}" unless in_string.to_s.length>0
+      is_target = true if path == target_path
+      red_path = path.split("://").last
+      cluster = red_path.split("/").first
+      #is user has a handler, is specifying a target,
+      #has more than 1 slash,
+      #or their first path node is a cluster name
+      #assume it's an hdfs pointer
+      if is_target or has_handler or Hadoop.clusters.include?(cluster) or red_path.split("/").length>2
+        user_name = Hdfs.user_name_by_stage_path(stage_path)
+        hdfs_url = Hdfs.url_by_path(red_path,user_name,is_target)
+        return Dataset.find_or_create_by_url(hdfs_url)
       end
+      #otherwise, use ssh convention
+      return Ssh.path_to_dst(path,stage_path)
+    end
-      #determine cluster for target
-      target_cluster, target_cluster_path = Hdfs.resolve_path(target_path)
-      raise "unable to resolve target path" if target_cluster.nil?
-      node = Hadoop.gateway_node(target_cluster)
-      if user and !Ssh.sudoers(node).include?(u.name)
-        raise "#{u.name} does not have su permissions for #{node}"
-      elsif user.nil? and Ssh.su_all_users(node)
-        user = u.name
+    def Hdfs.url_by_path(path,user_name,is_target=false)
+      cluster = path.split("/").first.to_s
+      if Hadoop.clusters.include?(cluster)
+        #cut node out of path
+        path = "/" + path.split("/")[1..-1].join("/")
+      else
+        cluster = Hadoop.default_cluster
+        path = path.starts_with?("/") ? path : "/#{path}"
+      end
+      url = "hdfs://#{cluster}#{path}"
+      hdfs_url = Hdfs.hdfs_url(url)
+      begin
+        response = Hadoop.run(cluster, "fs -tail '#{hdfs_url}'", user_name)
+        if response['exit_code']==0 or is_target
+          return "hdfs://#{cluster}#{path}"
+        else
+          raise "Unable to find #{url} with error: #{response['stderr']}"
+        end
+      rescue => exc
+        raise Exception, "Unable to find #{url} with error: #{exc.to_s}", exc.backtrace
       end
-      target_path = "#{target_cluster}#{target_cluster_path}"
-      out_string = Hdfs.write(target_path,in_string,user)
-      out_url = "hdfs://#{Hadoop.output_cluster}#{Hadoop.output_dir}hdfs/#{stage_path}/out"
-      Dataset.write_by_url(out_url,out_string,Gdrive.owner_name)
-      out_url
     end
-    def Hdfs.copy_by_stage_path(stage_path)
+    def Hdfs.user_name_by_stage_path(stage_path,cluster=nil)
       s = Stage.where(:path=>stage_path).first
       u = s.job.runner.user
-      params = s.params
-      source_path = params['source']
-      target_path = params['target']
-      user = params['user']
-      #check for source in hdfs format
-      source_cluster, source_cluster_path = Hdfs.resolve_path(source_path)
-      raise "unable to resolve source path" if source_cluster.nil?
-      #determine cluster for target
-      target_cluster, target_cluster_path = Hdfs.resolve_path(target_path)
-      raise "unable to resolve target path" if target_cluster.nil?
-      node = Hadoop.gateway_node(source_cluster)
-      if user and !Ssh.sudoers(node).include?(u.name)
-        raise "#{u.name} does not have su permissions for #{node}"
-      elsif user.nil? and Ssh.su_all_users(node)
-        user = u.name
+      user_name = s.params['user']
+      cluster ||= s.params['cluster']
+      cluster = Hadoop.default_cluster unless Hadoop.clusters.include?(cluster)
+      node = Hadoop.gateway_node(cluster)
+      if user_name and !Ssh.sudoers(node).include?(u.name)
+        raise "#{u.name} does not have su permissions for node #{node}"
+      elsif user_name.nil? and Ssh.su_all_users(node)
+        user_name = u.name
       end
-      source_path = "#{source_cluster}#{source_cluster_path}"
-      target_path = "#{target_cluster}#{target_cluster_path}"
-      out_string = Hdfs.copy(source_path,target_path,user)
-      out_url = "hdfs://#{Hadoop.output_cluster}#{Hadoop.output_dir}hdfs/#{stage_path}/out"
-      Dataset.write_by_url(out_url,out_string,Gdrive.owner_name)
-      out_url
-    end
-    def Hdfs.read_by_dataset_path(dst_path,user)
-      Hdfs.read(dst_path,user)
+      return user_name
     end
-    def Hdfs.write_by_dataset_path(dst_path,string,user)
-      Hdfs.write(dst_path,string,user)
+    def Hdfs.write_by_stage_path(stage_path)
+      s = Stage.where(:path=>stage_path).first
+      source = s.sources.first
+      target = s.target
+      cluster = target.url.split("://").last.split("/").first
+      user_name = Hdfs.user_name_by_stage_path(stage_path,cluster)
+      stdout = if source.handler == 'hdfs'
+                   Hdfs.copy(source.url,target.url,user_name)
+                 elsif ["gsheet","gfile","ssh"].include?(source.handler)
+                   in_string = source.read(user_name)
+                   Dataset.write_by_url(target.url, in_string, user_name)
+                 end
+      return {'out_str'=>stdout, 'signal' => 0}
     end
   end
 end

data/lib/mobilize-hdfs/version.rb CHANGED

@@ -1,5 +1,5 @@
 module Mobilize
   module Hdfs
-    VERSION = "1.0.10"
+    VERSION = "1.2"
   end
 end

data/lib/samples/hadoop.yml CHANGED

@@ -1,6 +1,5 @@
 ---
 development:
-  output_cluster: dev_cluster
   output_dir: /user/mobilize/development/
   read_limit: 1000000000
   clusters:
@@ -17,7 +16,6 @@ development:
       gateway_node: dev_hadoop_host
       exec_path: /path/to/hadoop
 test:
-  output_cluster: test_cluster
   output_dir: /user/mobilize/test/
   read_limit: 1000000000
   clusters:
@@ -34,7 +32,6 @@ test:
       gateway_node: test_hadoop_host
       exec_path: /path/to/hadoop
 production:
-  output_cluster: prod_cluster
   output_dir: /user/mobilize/production/
   read_limit: 1000000000
   clusters:

data/mobilize-hdfs.gemspec CHANGED

@@ -7,7 +7,7 @@ Gem::Specification.new do |gem|
   gem.name          = "mobilize-hdfs"
   gem.version       = Mobilize::Hdfs::VERSION
   gem.authors       = ["Cassio Paes-Leme"]
-  gem.email         = ["cpaesleme@ngmoco.com"]
+  gem.email         = ["cpaesleme@dena.com"]
   gem.description   = %q{Adds hdfs read, write, and copy support to mobilize-ssh}
   gem.summary       = %q{Adds hdfs read, write, and copy support to mobilize-ssh}
   gem.homepage      = "http://github.com/dena/mobilize-hdfs"
@@ -16,5 +16,5 @@ Gem::Specification.new do |gem|
   gem.executables   = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
   gem.test_files    = gem.files.grep(%r{^(test|spec|features)/})
   gem.require_paths = ["lib"]
-  gem.add_runtime_dependency "mobilize-ssh","1.1.10"
+  gem.add_runtime_dependency "mobilize-ssh","1.2"
 end

data/test/hdfs_job_rows.yml CHANGED

@@ -2,10 +2,9 @@
   active: true
   trigger: once
   status: ""
-  stage1: hdfs.write target:"/user/mobilize/test/test_hdfs_1.out",
-            source:"Runner_mobilize(test)/test_hdfs_1.in"
-  stage2: hdfs.copy source:"/user/mobilize/test/test_hdfs_1.out",
+  stage1: hdfs.write target:"/user/mobilize/test/test_hdfs_1.out",
+            source:"test_hdfs_1.in"
+  stage2: hdfs.write source:"/user/mobilize/test/test_hdfs_1.out",
             target:"test_cluster_2/user/mobilize/test/test_hdfs_1_copy.out",
-  stage3: hdfs.read source:"test_cluster_2/user/mobilize/test/test_hdfs_1_copy.out"
-  stage4: gsheet.write source:"hdfs://test_cluster_2/user/mobilize/test/test_hdfs_1_copy.out",
+  stage3: gsheet.write source:"hdfs://test_cluster_2/user/mobilize/test/test_hdfs_1_copy.out",
             target:"Runner_mobilize(test)/test_hdfs_1_copy.out"

data/test/mobilize-hdfs_test.rb CHANGED

@@ -34,9 +34,9 @@ describe "Mobilize" do
     hdfs_1_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/test_hdfs_1_copy.out",gdrive_slot)
     [hdfs_1_target_sheet].each {|s| s.delete if s}
-    puts "job row added, force enqueued requestor, wait 150s"
+    puts "job row added, force enqueued requestor, wait for stages"
     r.enqueue!
-    sleep 150
+    wait_for_stages
     puts "jobtracker posted data to test sheet"
     test_destination_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/test_hdfs_1_copy.out",gdrive_slot)
@@ -44,4 +44,27 @@ describe "Mobilize" do
     assert test_destination_sheet.read(u.name).length == 599
   end
+  def wait_for_stages(time_limit=600,stage_limit=120,wait_length=10)
+    time = 0
+    time_since_stage = 0
+    #check for 10 min
+    while time < time_limit and time_since_stage < stage_limit
+      sleep wait_length
+      job_classes = Mobilize::Resque.jobs.map{|j| j['class']}
+      if job_classes.include?("Mobilize::Stage")
+        time_since_stage = 0
+        puts "saw stage at #{time.to_s} seconds"
+      else
+        time_since_stage += wait_length
+        puts "#{time_since_stage.to_s} seconds since stage seen"
+      end
+      time += wait_length
+      puts "total wait time #{time.to_s} seconds"
+    end
+    if time >= time_limit
+      raise "Timed out before stage completion"
+    end
+  end
 end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: mobilize-hdfs
 version: !ruby/object:Gem::Version
-  version: 1.0.10
+  version: '1.2'
   prerelease:
 platform: ruby
 authors:
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2013-03-05 00:00:00.000000000 Z
+date: 2013-03-21 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: mobilize-ssh
@@ -18,7 +18,7 @@ dependencies:
     requirements:
     - - '='
       - !ruby/object:Gem::Version
-        version: 1.1.10
+        version: '1.2'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
@@ -26,10 +26,10 @@ dependencies:
     requirements:
     - - '='
       - !ruby/object:Gem::Version
-        version: 1.1.10
+        version: '1.2'
 description: Adds hdfs read, write, and copy support to mobilize-ssh
 email:
-- cpaesleme@ngmoco.com
+- cpaesleme@dena.com
 executables: []
 extensions: []
 extra_rdoc_files: []