RubyGems - mobilize-hdfs - Versions diffs - 1.351 → 1.361 - Mend

mobilize-hdfs 1.351 → 1.361

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

data/README.md +3 -233
data/lib/mobilize-hdfs.rb +3 -0
data/lib/mobilize-hdfs/handlers/hdfs.rb +1 -11
data/lib/mobilize-hdfs/tasks.rb +3 -2
data/lib/mobilize-hdfs/version.rb +1 -1
data/mobilize-hdfs.gemspec +1 -1
data/test/fixtures/hdfs1.in.yml +91 -0
data/test/fixtures/integration_expected.yml +17 -0
data/test/fixtures/integration_jobs.yml +10 -0
data/test/integration/mobilize-hdfs_test.rb +42 -0
data/test/test_helper.rb +1 -0
metadata +14 -10
data/test/hdfs_job_rows.yml +0 -10
data/test/mobilize-hdfs_test.rb +0 -70

data/README.md CHANGED Viewed

@@ -1,234 +1,4 @@
-Mobilize-Hdfs
-===============
+Mobilize
+========
-Mobilize-Hdfs adds the power of hdfs to [mobilize-ssh][mobilize-ssh].
-* read, write, and copy hdfs files through Google
-Spreadsheets.
-Table Of Contents
------------------
-* [Overview](#section_Overview)
-* [Install](#section_Install)
-  * [Mobilize-Hdfs](#section_Install_Mobilize-Hdfs)
-  * [Install Dirs and Files](#section_Install_Dirs_and_Files)
-* [Configure](#section_Configure)
-  * [Hadoop](#section_Configure_Hadoop)
-* [Start](#section_Start)
-  * [Create Job](#section_Start_Create_Job)
-  * [Run Test](#section_Start_Run_Test)
-* [Meta](#section_Meta)
-* [Author](#section_Author)
-<a name='section_Overview'></a>
-Overview
------------
-* Mobilize-hdfs adds Hdfs methods to mobilize-ssh.
-<a name='section_Install'></a>
-Install
-------------
-Make sure you go through all the steps in the
-[mobilize-base][mobilize-base] and [mobilize-ssh][mobilize-ssh]
-install sections first.
-<a name='section_Install_Mobilize-Hdfs'></a>
-### Mobilize-Hdfs
-add this to your Gemfile:
-``` ruby
-gem "mobilize-hdfs"
-```
-or do
-  $ gem install mobilize-hdfs
-for a ruby-wide install.
-<a name='section_Install_Dirs_and_Files'></a>
-### Dirs and Files
-### Rakefile
-Inside the Rakefile in your project's root dir, make sure you have:
-``` ruby
-require 'mobilize-base/tasks'
-require 'mobilize-ssh/tasks'
-require 'mobilize-hdfs/tasks'
-```
-This defines rake tasks essential to run the environment.
-### Config Dir
-run
-  $ rake mobilize_hdfs:setup
-This will copy over a sample hadoop.yml to your config dir.
-<a name='section_Configure'></a>
-Configure
-------------
-<a name='section_Configure_Hadoop'></a>
-### Configure Hadoop
-* Hadoop is big data. That means we need to be careful when reading from
-the cluster as it could easily fill up our mongodb instance, RAM, local disk
-space, etc.
-* To achieve this, all hadoop operations, stage outputs, etc. are
-executed and stored on the cluster only.
-  * The exceptions are:
-    * writing to the cluster from an external source, such as a google
-sheet. Here there
-is no risk as the external source has much more strict size limits than
-hdfs.
-    * reading from the cluster, such as for posting to google sheet. In
-this case, the read_limit parameter dictates the maximum amount that can
-be read. If the data is bigger than the read limit, an exception will be
-raised.
-The Hadoop configuration consists of:
-* output_dir, which is the absolute path to the directory in HDFS that will store stage
-outputs. Directory names should end with a slash (/). It will choose the
-first cluster as the default cluster to write to.
-* read_limit, which is the maximum size data that can be read from the
-cluster. Default is 1GB.
-* clusters - this defines aliases for clusters, which are used as
-parameters for Hdfs stages. Cluster aliases contain 5 parameters:
-  * namenode - defines the name and port for accessing the namenode
-    * name - namenode full name, as in namenode1.host.com
-    * port - namenode port, by default 50070
-  * gateway_node - defines the node that executes the cluster commands.
-  * exec_path - defines the path to the hadoop
-This node must be defined in ssh.yml according to the specs in
-[mobilize-ssh][mobilize-ssh]. The gateway node can be the same for
-multiple clusters, depending on your cluster setup.
-Sample hadoop.yml:
-``` yml
----
-development:
-  output_dir: /user/mobilize/development/
-  read_limit: 1000000000
-  clusters:
-    dev_cluster:
-      namenode:
-        name: dev_namenode.host.com
-        port: 50070
-      gateway_node: dev_hadoop_host
-      exec_path: /path/to/hadoop
-    dev_cluster_2:
-      namenode:
-        name: dev_namenode_2.host.com
-        port: 50070
-      gateway_node: dev_hadoop_host
-      exec_path: /path/to/hadoop
-test:
-  output_dir: /user/mobilize/test/
-  read_limit: 1000000000
-  clusters:
-    test_cluster:
-      namenode:
-        name: test_namenode.host.com
-        port: 50070
-      gateway_node: test_hadoop_host
-      exec_path: /path/to/hadoop
-    test_cluster_2:
-      namenode:
-        name: test_namenode_2.host.com
-        port: 50070
-      gateway_node: test_hadoop_host
-      exec_path: /path/to/hadoop
-production:
-  output_dir: /user/mobilize/production/
-  read_limit: 1000000000
-  clusters:
-    prod_cluster:
-      namenode:
-        name: prod_namenode.host.com
-        port: 50070
-      gateway_node: prod_hadoop_host
-      exec_path: /path/to/hadoop
-    prod_cluster_2:
-      namenode:
-        name: prod_namenode_2.host.com
-        port: 50070
-      gateway_node: prod_hadoop_host
-      exec_path: /path/to/hadoop
-```
-<a name='section_Start'></a>
-Start
------
-<a name='section_Start_Create_Job'></a>
-### Create Job
-* For mobilize-hdfs, the following stages are available.
-  * cluster and user are optional for all of the below.
-    * cluster defaults to output_cluster;
-    * user is treated the same way as in [mobilize-ssh][mobilize-ssh].
-  * hdfs.write `source:<full_path>, target:<hdfs_full_path>, user:<user>`
-  * The full_path can use `<gsheet_path>` or `<hdfs_path>`. The test uses "test_hdfs_1.in".
-  * `<hdfs_path>` is the cluster alias followed by absolute path on the cluster.
-    * if a full path is supplied without a preceding cluster alias (e.g. "/user/mobilize/test/test_hdfs_1.in"),
-      the first listed cluster will be used as the default.
-    * The test uses "/user/mobilize/test/test_hdfs_1.in" for the initial
-write, then "test_cluster_2/user/mobilize/test/test_hdfs_copy.out" for
-the cross-cluster write.
-  * both cluster arguments and user are optional. If writing from
-one cluster to another, your source_cluster gateway_node must be able to
-access both clusters.
-<a name='section_Start_Run_Test'></a>
-### Run Test
-To run tests, you will need to
-1) go through the [mobilize-base][mobilize-base] and [mobilize-ssh][mobilize-ssh] tests first
-2) clone the mobilize-hdfs repository
-From the project folder, run
-3) $ rake mobilize_hdfs:setup
-Copy over the config files from the mobilize-base and mobilize-ssh
-projects into the config dir, and populate the values in the hadoop.yml file.
-If you don't have two clusters, you can populate test_cluster_2 with the
-same cluster as your first.
-3) $ rake test
-* The test runs a 3 stage job:
-  * test_hdfs_1:
-    * `hdfs.write target:"/user/mobilize/test/test_hdfs_1.out", source:"test_hdfs_1.in"`
-    * `hdfs.write source:"/user/mobilize/test/test_hdfs_1.out",target:"test_cluster_2/user/mobilize/test/test_hdfs_1_copy.out"`
-    * `gsheet.write source:"hdfs://test_cluster_2/user/mobilize/test/test_hdfs_1_copy.out", target:"test_hdfs_1_copy.out"`
-  * at the end of the test, there should be a sheet named "test_hdfs_1_copy.out" with the same data as test_hdfs_1.in
-<a name='section_Meta'></a>
-Meta
-----
-* Code: `git clone git://github.com/dena/mobilize-hdfs.git`
-* Home: <https://github.com/dena/mobilize-hdfs>
-* Bugs: <https://github.com/dena/mobilize-hdfs/issues>
-* Gems: <http://rubygems.org/gems/mobilize-hdfs>
-<a name='section_Author'></a>
-Author
-------
-Cassio Paes-Leme :: cpaesleme@dena.com :: @cpaesleme
-[mobilize-base]: https://github.com/dena/mobilize-base
-[mobilize-ssh]: https://github.com/dena/mobilize-ssh
+Please refer to the mobilize-server wiki: https://github.com/DeNA/mobilize-server/wiki

data/lib/mobilize-hdfs.rb CHANGED Viewed

@@ -3,6 +3,9 @@ require "mobilize-ssh"
 module Mobilize
   module Hdfs
+    def Hdfs.home_dir
+      File.expand_path('..',File.dirname(__FILE__))
+    end
   end
 end
 require "mobilize-hdfs/handlers/hadoop"

data/lib/mobilize-hdfs/handlers/hdfs.rb CHANGED Viewed

@@ -127,17 +127,7 @@ module Mobilize
         path = path.starts_with?("/") ? path : "/#{path}"
       end
       url = "hdfs://#{cluster}#{path}"
-      hdfs_url = Hdfs.hdfs_url(url)
-      begin
-        response = Hadoop.run(cluster, "fs -tail '#{hdfs_url}'", user_name)
-        if response['exit_code']==0 or is_target
-          return "hdfs://#{cluster}#{path}"
-        else
-          raise "Unable to find #{url} with error: #{response['stderr']}"
-        end
-      rescue => exc
-        raise Exception, "Unable to find #{url} with error: #{exc.to_s}", exc.backtrace
-      end
+      return url
     end
     def Hdfs.user_name_by_stage_path(stage_path,cluster=nil)

data/lib/mobilize-hdfs/tasks.rb CHANGED Viewed

@@ -1,6 +1,7 @@
-namespace :mobilize_hdfs do
+require 'yaml'
+namespace :mobilize do
   desc "Set up config and log folders and files"
-  task :setup do
+  task :setup_hdfs do
     sample_dir = File.dirname(__FILE__) + '/../samples/'
     sample_files = Dir.entries(sample_dir)
     config_dir = (ENV['MOBILIZE_CONFIG_DIR'] ||= "config/mobilize/")

data/lib/mobilize-hdfs/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 module Mobilize
   module Hdfs
-    VERSION = "1.351"
+    VERSION = "1.361"
   end
 end

data/mobilize-hdfs.gemspec CHANGED Viewed

@@ -16,5 +16,5 @@ Gem::Specification.new do |gem|
   gem.executables   = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
   gem.test_files    = gem.files.grep(%r{^(test|spec|features)/})
   gem.require_paths = ["lib"]
-  gem.add_runtime_dependency "mobilize-ssh","1.351"
+  gem.add_runtime_dependency "mobilize-ssh","1.361"
 end

data/test/fixtures/hdfs1.in.yml ADDED Viewed

@@ -0,0 +1,91 @@
+---
+- test0: test0
+  test1: test1
+  test2: test2
+  test3: test3
+  test4: test4
+  test5: test5
+  test6: test6
+  test7: test7
+  test8: test8
+  test9: test9
+- test0: test0
+  test1: test1
+  test2: test2
+  test3: test3
+  test4: test4
+  test5: test5
+  test6: test6
+  test7: test7
+  test8: test8
+  test9: test9
+- test0: test0
+  test1: test1
+  test2: test2
+  test3: test3
+  test4: test4
+  test5: test5
+  test6: test6
+  test7: test7
+  test8: test8
+  test9: test9
+- test0: test0
+  test1: test1
+  test2: test2
+  test3: test3
+  test4: test4
+  test5: test5
+  test6: test6
+  test7: test7
+  test8: test8
+  test9: test9
+- test0: test0
+  test1: test1
+  test2: test2
+  test3: test3
+  test4: test4
+  test5: test5
+  test6: test6
+  test7: test7
+  test8: test8
+  test9: test9
+- test0: test0
+  test1: test1
+  test2: test2
+  test3: test3
+  test4: test4
+  test5: test5
+  test6: test6
+  test7: test7
+  test8: test8
+  test9: test9
+- test0: test0
+  test1: test1
+  test2: test2
+  test3: test3
+  test4: test4
+  test5: test5
+  test6: test6
+  test7: test7
+  test8: test8
+  test9: test9
+- test0: test0
+  test1: test1
+  test2: test2
+  test3: test3
+  test4: test4
+  test5: test5
+  test6: test6
+  test7: test7
+  test8: test8
+  test9: test9
+- test0: test0
+  test1: test1
+  test2: test2
+  test3: test3
+  test4: test4
+  test5: test5
+  test6: test6
+  test7: test7
+  test8: test8
+  test9: test9

data/test/fixtures/integration_expected.yml ADDED Viewed

@@ -0,0 +1,17 @@
+---
+- path: "Runner_mobilize(test)/jobs"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hdfs1/stage1"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hdfs1/stage2"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hdfs1/stage3"
+  state: working
+  count: 1
+  confirmed_ats: []

data/test/fixtures/integration_jobs.yml ADDED Viewed

@@ -0,0 +1,10 @@
+- name: hdfs1
+  active: true
+  trigger: once
+  status: ""
+  stage1: hdfs.write target:"/user/mobilize/test/hdfs1.out",
+            source:"hdfs1.in"
+  stage2: hdfs.write source:"/user/mobilize/test/hdfs1.out",
+            target:"test_cluster_2/user/mobilize/test/hdfs1_copy.out",
+  stage3: gsheet.write source:"hdfs://test_cluster_2/user/mobilize/test/hdfs1_copy.out",
+            target:"Runner_mobilize(test)/hdfs1_copy.out"

data/test/integration/mobilize-hdfs_test.rb ADDED Viewed

@@ -0,0 +1,42 @@
+require 'test_helper'
+describe "Mobilize" do
+  # enqueues 4 workers on Resque
+  it "runs integration test" do
+    puts "restart workers"
+    Mobilize::Jobtracker.restart_workers!
+    u = TestHelper.owner_user
+    r = u.runner
+    user_name = u.name
+    gdrive_slot = u.email
+    puts "add test data"
+    ["hdfs1.in"].each do |fixture_name|
+      target_url = "gsheet://#{r.title}/#{fixture_name}"
+      TestHelper.write_fixture(fixture_name, target_url, 'replace')
+    end
+    puts "add/update jobs"
+    u.jobs.each{|j| j.delete}
+    jobs_fixture_name = "integration_jobs"
+    jobs_target_url = "gsheet://#{r.title}/jobs"
+    TestHelper.write_fixture(jobs_fixture_name, jobs_target_url, 'update')
+    puts "job rows added, force enqueue runner, wait for stages"
+    #wait for stages to complete
+    expected_fixture_name = "integration_expected"
+    Mobilize::Jobtracker.stop!
+    r.enqueue!
+    TestHelper.confirm_expected_jobs(expected_fixture_name)
+    puts "update job status and activity"
+    r.update_gsheet(gdrive_slot)
+    puts "check posted data"
+    ['hdfs1_copy.out'].each do |out_name|
+      url = "gsheet://#{r.title}/#{out_name}"
+      assert TestHelper.check_output(url, 'min_length' => 599) == true
+    end
+  end
+end

data/test/test_helper.rb CHANGED Viewed

@@ -8,3 +8,4 @@ $dir = File.dirname(File.expand_path(__FILE__))
 ENV['MOBILIZE_ENV'] = 'test'
 require 'mobilize-hdfs'
 $TESTING = true
+require "#{Mobilize::Ssh.home_dir}/test/test_helper"

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: mobilize-hdfs
 version: !ruby/object:Gem::Version
-  version: '1.351'
+  version: '1.361'
   prerelease:
 platform: ruby
 authors:
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2013-04-25 00:00:00.000000000 Z
+date: 2013-05-31 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: mobilize-ssh
@@ -18,7 +18,7 @@ dependencies:
     requirements:
     - - '='
       - !ruby/object:Gem::Version
-        version: '1.351'
+        version: '1.361'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
@@ -26,7 +26,7 @@ dependencies:
     requirements:
     - - '='
       - !ruby/object:Gem::Version
-        version: '1.351'
+        version: '1.361'
 description: Adds hdfs read, write, and copy support to mobilize-ssh
 email:
 - cpaesleme@dena.com
@@ -46,8 +46,10 @@ files:
 - lib/mobilize-hdfs/version.rb
 - lib/samples/hadoop.yml
 - mobilize-hdfs.gemspec
-- test/hdfs_job_rows.yml
-- test/mobilize-hdfs_test.rb
+- test/fixtures/hdfs1.in.yml
+- test/fixtures/integration_expected.yml
+- test/fixtures/integration_jobs.yml
+- test/integration/mobilize-hdfs_test.rb
 - test/redis-test.conf
 - test/test_helper.rb
 homepage: http://github.com/dena/mobilize-hdfs
@@ -64,7 +66,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
       version: '0'
       segments:
       - 0
-      hash: -2398358469652320205
+      hash: 2190980204946063989
 required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
   requirements:
@@ -73,7 +75,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
       segments:
       - 0
-      hash: -2398358469652320205
+      hash: 2190980204946063989
 requirements: []
 rubyforge_project:
 rubygems_version: 1.8.25
@@ -81,7 +83,9 @@ signing_key:
 specification_version: 3
 summary: Adds hdfs read, write, and copy support to mobilize-ssh
 test_files:
-- test/hdfs_job_rows.yml
-- test/mobilize-hdfs_test.rb
+- test/fixtures/hdfs1.in.yml
+- test/fixtures/integration_expected.yml
+- test/fixtures/integration_jobs.yml
+- test/integration/mobilize-hdfs_test.rb
 - test/redis-test.conf
 - test/test_helper.rb

data/test/hdfs_job_rows.yml DELETED Viewed

@@ -1,10 +0,0 @@
-- name: test_hdfs_1
-  active: true
-  trigger: once
-  status: ""
-  stage1: hdfs.write target:"/user/mobilize/test/test_hdfs_1.out",
-            source:"test_hdfs_1.in"
-  stage2: hdfs.write source:"/user/mobilize/test/test_hdfs_1.out",
-            target:"test_cluster_2/user/mobilize/test/test_hdfs_1_copy.out",
-  stage3: gsheet.write source:"hdfs://test_cluster_2/user/mobilize/test/test_hdfs_1_copy.out",
-            target:"Runner_mobilize(test)/test_hdfs_1_copy.out"

data/test/mobilize-hdfs_test.rb DELETED Viewed

@@ -1,70 +0,0 @@
-require 'test_helper'
-describe "Mobilize" do
-  def before
-    puts 'nothing before'
-  end
-  # enqueues 4 workers on Resque
-  it "runs integration test" do
-    puts "restart workers"
-    Mobilize::Jobtracker.restart_workers!
-    gdrive_slot = Mobilize::Gdrive.owner_email
-    puts "create user 'mobilize'"
-    user_name = gdrive_slot.split("@").first
-    u = Mobilize::User.where(:name=>user_name).first
-    r = u.runner
-    hdfs_1_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/test_hdfs_1.in",gdrive_slot)
-    [hdfs_1_sheet].each {|s| s.delete if s}
-    puts "add test_source data"
-    hdfs_1_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/test_hdfs_1.in",gdrive_slot)
-    hdfs_1_tsv = ([%w{test0 test1 test2 test3 test4 test5 test6 test7 test8 test9}.join("\t")]*10).join("\n")
-    hdfs_1_sheet.write(hdfs_1_tsv,u.name)
-    jobs_sheet = r.gsheet(gdrive_slot)
-    test_job_rows = ::YAML.load_file("#{Mobilize::Base.root}/test/hdfs_job_rows.yml")
-    test_job_rows.map{|j| r.jobs(j['name'])}.each{|j| j.delete if j}
-    jobs_sheet.add_or_update_rows(test_job_rows)
-    hdfs_1_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/test_hdfs_1_copy.out",gdrive_slot)
-    [hdfs_1_target_sheet].each {|s| s.delete if s}
-    puts "job row added, force enqueued requestor, wait for stages"
-    r.enqueue!
-    wait_for_stages
-    puts "jobtracker posted data to test sheet"
-    test_destination_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/test_hdfs_1_copy.out",gdrive_slot)
-    assert test_destination_sheet.read(u.name).length == 599
-  end
-  def wait_for_stages(time_limit=600,stage_limit=120,wait_length=10)
-    time = 0
-    time_since_stage = 0
-    #check for 10 min
-    while time < time_limit and time_since_stage < stage_limit
-      sleep wait_length
-      job_classes = Mobilize::Resque.jobs.map{|j| j['class']}
-      if job_classes.include?("Mobilize::Stage")
-        time_since_stage = 0
-        puts "saw stage at #{time.to_s} seconds"
-      else
-        time_since_stage += wait_length
-        puts "#{time_since_stage.to_s} seconds since stage seen"
-      end
-      time += wait_length
-      puts "total wait time #{time.to_s} seconds"
-    end
-    if time >= time_limit
-      raise "Timed out before stage completion"
-    end
-  end
-end