RubyGems - mobilize-hive - Versions diffs - 1.35 → 1.36 - Mend

mobilize-hive 1.35 → 1.36

Files changed (21) hide show

data/README.md +3 -253
data/lib/mobilize-hive/handlers/hive.rb +47 -43
data/lib/mobilize-hive/helpers/hive_helper.rb +4 -0
data/lib/mobilize-hive/tasks.rb +1 -0
data/lib/mobilize-hive/version.rb +1 -1
data/lib/mobilize-hive.rb +3 -0
data/lib/samples/hive.yml +6 -0
data/mobilize-hive.gemspec +1 -1
data/test/{hive_test_1.hql → fixtures/hive1.hql} +0 -0
data/test/{hive_test_1_in.yml → fixtures/hive1.in.yml} +0 -0
data/test/{hive_test_1_schema.yml → fixtures/hive1.schema.yml} +0 -0
data/test/fixtures/hive1.sql +1 -0
data/test/fixtures/hive4_stage1.in +1 -0
data/test/fixtures/hive4_stage2.in.yml +4 -0
data/test/fixtures/integration_expected.yml +69 -0
data/test/fixtures/integration_jobs.yml +34 -0
data/test/integration/mobilize-hive_test.rb +43 -0
data/test/test_helper.rb +1 -0
metadata +24 -16
data/test/hive_job_rows.yml +0 -34
data/test/mobilize-hive_test.rb +0 -112

data/README.md CHANGED Viewed

@@ -1,254 +1,4 @@
-Mobilize-Hive
-===============
+Mobilize
+========
-Mobilize-Hive adds the power of hive to [mobilize-hdfs][mobilize-hdfs].
-* read, write, and copy hive files through Google Spreadsheets.
-Table Of Contents
------------------
-* [Overview](#section_Overview)
-* [Install](#section_Install)
-  * [Mobilize-Hive](#section_Install_Mobilize-Hive)
-  * [Install Dirs and Files](#section_Install_Dirs_and_Files)
-* [Configure](#section_Configure)
-  * [Hive](#section_Configure_Hive)
-* [Start](#section_Start)
-  * [Create Job](#section_Start_Create_Job)
-  * [Run Test](#section_Start_Run_Test)
-* [Meta](#section_Meta)
-* [Special Thanks](#section_Special_Thanks)
-* [Author](#section_Author)
-<a name='section_Overview'></a>
-Overview
------------
-* Mobilize-hive adds Hive methods to mobilize-hdfs.
-<a name='section_Install'></a>
-Install
-------------
-Make sure you go through all the steps in the
-[mobilize-base][mobilize-base],
-[mobilize-ssh][mobilize-ssh],
-[mobilize-hdfs][mobilize-hdfs],
-install sections first.
-<a name='section_Install_Mobilize-Hive'></a>
-### Mobilize-Hive
-add this to your Gemfile:
-``` ruby
-gem "mobilize-hive"
-```
-or do
-  $ gem install mobilize-hive
-for a ruby-wide install.
-<a name='section_Install_Dirs_and_Files'></a>
-### Dirs and Files
-### Rakefile
-Inside the Rakefile in your project's root dir, make sure you have:
-``` ruby
-require 'mobilize-base/tasks'
-require 'mobilize-ssh/tasks'
-require 'mobilize-hdfs/tasks'
-require 'mobilize-hive/tasks'
-```
-This defines rake tasks essential to run the environment.
-### Config Dir
-run
-  $ rake mobilize_hive:setup
-This will copy over a sample hive.yml to your config dir.
-<a name='section_Configure'></a>
-Configure
-------------
-<a name='section_Configure_Hive'></a>
-### Configure Hive
-* Hive is big data. That means we need to be careful when reading from
-the cluster as it could easily fill up our mongodb instance, RAM, local disk
-space, etc.
-* To achieve this, all hive operations, stage outputs, etc. are
-executed and stored on the cluster only.
-  * The exceptions are:
-    * writing to the cluster from an external source, such as a google
-sheet. Here there
-is no risk as the external source has much more strict size limits than
-hive.
-    * reading from the cluster, such as for posting to google sheet. In
-this case, the read_limit parameter dictates the maximum amount that can
-be read. If the data is bigger than the read limit, an exception will be
-raised.
-The Hive configuration consists of:
-* clusters - this defines aliases for clusters, which are used as
-parameters for Hive stages. They should have the same name as those
-in hadoop.yml. Each cluster has:
-  * max_slots - defines the total number of simultaneous slots to be
-    used for hive jobs on this cluster
-  * output_db - defines the db which should be used to hold stage outputs.
-    * This db must have open permissions (777) so any user on the system can
-write to it -- the tables inside will be owned by the users themselves.
-  * exec_path - defines the path to the hive executable
-Sample hive.yml:
-``` yml
----
-development:
-  clusters:
-    dev_cluster:
-      max_slots: 5
-      output_db: mobilize
-      exec_path: /path/to/hive
-test:
-  clusters:
-    test_cluster:
-      max_slots: 5
-      output_db: mobilize
-      exec_path: /path/to/hive
-production:
-  clusters:
-    prod_cluster:
-      max_slots: 5
-      output_db: mobilize
-      exec_path: /path/to/hive
-```
-<a name='section_Start'></a>
-Start
------
-<a name='section_Start_Create_Job'></a>
-### Create Job
-* For mobilize-hive, the following stages are available.
-  * cluster and user are optional for all of the below.
-    * cluster defaults to the first cluster listed;
-    * user is treated the same way as in [mobilize-ssh][mobilize-ssh].
-  * params are also optional for all of the below. They replace HQL in sources.
-    * params are passed as a YML or JSON, as in:
-      * `hive.run source:<source_path>, params:{'date': '2013-03-01', 'unit': 'widgets'}`
-        * this example replaces all the keys, preceded by '@' in all source hqls with the value.
-          * The preceding '@' is used to keep from replacing instances
-            of "date" and "unit" in the HQL; you should have `@date` and `@unit` in your actual HQL
-            if you'd like to replace those tokens.
-    * in addition, the following params are substituted automatically:
-      * `$utc_date` - replaced with YYYY-MM-DD date, UTC
-      * `$utc_time` - replaced with HH:MM time, UTC
-      * any occurrence of these values in HQL will be replaced at runtime.
-  * hive.run `hql:<hql> || source:<gsheet_path>, user:<user>, cluster:<cluster>`, which executes the
-      script in the hql or source sheet and returns any output specified at the
-      end. If the cmd or last query in source is a select statement, column headers will be
-      returned as well.
-  * hive.write `hql:<hql> || source:<source_path>, target:<hive_path>, partitions:<partition_path>, user:<user>, cluster:<cluster>, schema:<gsheet_path>, drop:<true/false>`,
-      which writes the source or query result to the selected hive table.
-    * hive_path
-      * should be of the form `<hive_db>/<table_name>` or `<hive_db>.<table_name>`.
-    * source:
-      * can be a gsheet_path, hdfs_path, or hive_path (no partitions)
-      * for gsheet and hdfs path,
-        * if the file ends in .*ql, it's treated the same as passing hql
-        * otherwise it is treated as a tsv with the first row as column headers
-    * target:
-      * Should be a hive_path, as in `<hive_db>/<table_name>` or `<hive_db>.<table_name>`.
-    * partitions:
-      * Due to Hive limitation, partition names CANNOT be reserved keywords when writing from tsv (gsheet or hdfs source)
-      * Partitions should be specified as a path, as in  partitions:`<partition1>/<partition2>`.
-    * schema:
-      * optional. gsheet_path to column schema.
-        * two columns: name, datatype
-        * Any columns not defined here will receive "string" as the datatype
-        * partitions can have their datatypes overridden here as well
-        * columns named here that are not in the dataset will be ignored
-    * drop:
-      * optional. drops the target table before performing write
-      * defaults to false
-<a name='section_Start_Run_Test'></a>
-### Run Test
-To run tests, you will need to
-1) go through [mobilize-base][mobilize-base], [mobilize-ssh][mobilize-ssh], [mobilize-hdfs][mobilize-hdfs] tests first
-2) clone the mobilize-hive repository
-From the project folder, run
-3) $ rake mobilize_hive:setup
-Copy over the config files from the mobilize-base, mobilize-ssh,
-mobilize-hdfs projects into the config dir, and populate the values in the hive.yml file.
-Make sure you use the same names for your hive clusters as you do in
-hadoop.yml.
-3) $ rake test
-* The test runs these jobs:
-  * hive_test_1:
-    * `hive.write target:"mobilize/hive_test_1/act_date",source:"Runner_mobilize(test)/hive_test_1.in", schema:"hive_test_1.schema", drop:true`
-    * `hive.run source:"hive_test_1.hql"`
-    * `hive.run cmd:"show databases"`
-    * `gsheet.write source:"stage2", target:"hive_test_1_stage_2.out"`
-    * `gsheet.write source:"stage3", target:"hive_test_1_stage_3.out"`
-    * hive_test_1.hql runs a select statement on the table created in the
-      write command.
-    * at the end of the test, there should be two sheets, one with a
-        sum of the data as in your write query, one with the results of the show
-        databases command.
-  * hive_test_2:
-    * `hive.write source:"hdfs://user/mobilize/test/test_hdfs_1.out", target:"mobilize.hive_test_2", drop:true`
-    * `hive.run cmd:"select * from mobilize.hive_test_2"`
-    * `gsheet.write source:"stage2", target:"hive_test_2.out"`
-    * this test uses the output from the first hdfs test as an input, so make sure you've run that first.
-  * hive_test_3:
-    * `hive.write source:"hive://mobilize.hive_test_1",target:"mobilize/hive_test_3/date/product",drop:true`
-    * `hive.run hql:"select act_date as ```date```,product,category,value from mobilize.hive_test_1;"`
-    * `hive.write source:"stage2",target:"mobilize/hive_test_3/date/product", drop:false`
-    * `gsheet.write source:"hive://mobilize/hive_test_3", target:"hive_test_3.out"`
-<a name='section_Meta'></a>
-Meta
-----
-* Code: `git clone git://github.com/dena/mobilize-hive.git`
-* Home: <https://github.com/dena/mobilize-hive>
-* Bugs: <https://github.com/dena/mobilize-hive/issues>
-* Gems: <http://rubygems.org/gems/mobilize-hive>
-<a name='section_Special_Thanks'></a>
-Special Thanks
---------------
-* This release goes to Toby Negrin, who championed this project with
-DeNA and gave me the support to get it properly architected, tested, and documented.
-* Also many thanks to the Analytics team at DeNA who build and maintain
-our Big Data infrastructure.
-<a name='section_Author'></a>
-Author
-------
-Cassio Paes-Leme :: cpaesleme@dena.com :: @cpaesleme
-[mobilize-base]: https://github.com/dena/mobilize-base
-[mobilize-ssh]: https://github.com/dena/mobilize-ssh
-[mobilize-hdfs]: https://github.com/dena/mobilize-hdfs
+Please refer to the mobilize-server wiki: https://github.com/DeNA/mobilize-server/wiki

data/lib/mobilize-hive/handlers/hive.rb CHANGED Viewed

@@ -94,22 +94,29 @@ module Mobilize
     #run a generic hive command, with the option of passing a file hash to be locally available
     def Hive.run(cluster,hql,user_name,params=nil,file_hash=nil)
-      # no TempStatsStore
-      hql = "set hive.stats.autogather=false;#{hql}"
+      preps = Hive.prepends.map do |p|
+                                  prefix = "set "
+                                  suffix = ";"
+                                  prep_out = p
+                                  prep_out = "#{prefix}#{prep_out}" unless prep_out.starts_with?(prefix)
+                                  prep_out = "#{prep_out}#{suffix}" unless prep_out.ends_with?(suffix)
+                                  prep_out
+                                end.join
+      hql = "#{preps}#{hql}"
       filename = hql.to_md5
       file_hash||= {}
       file_hash[filename] = hql
-      #add in default params
       params ||= {}
-      params = params.merge(Hive.default_params)
       #replace any params in the file_hash and command
       params.each do |k,v|
         file_hash.each do |name,data|
-          if k.starts_with?("$")
-            data.gsub!(k,v)
-          else
-            data.gsub!("@#{k}",v)
-          end
+          data.gsub!("@#{k}",v)
+        end
+      end
+      #add in default params
+      Hive.default_params.each do |k,v|
+        file_hash.each do |name,data|
+          data.gsub!(k,v)
         end
       end
       #silent mode so we don't have logs in stderr; clip output
@@ -155,9 +162,9 @@ module Mobilize
       Gdrive.unslot_worker_by_path(stage_path)
       #check for select at end
-      hql_array = hql.split(";").map{|hc| hc.strip}.reject{|hc| hc.length==0}
-      last_statement = hql_array.last.downcase.split("\n").reject{|l| l.starts_with?("-- ")}.first
-      if last_statement.to_s.starts_with?("select")
+      hql_array = hql.split("\n").reject{|l| l.starts_with?("--") or l.strip.length==0}.join("\n").split(";").map{|h| h.strip}
+      last_statement = hql_array.last
+      if last_statement.to_s.downcase.starts_with?("select")
         #nil if no prior commands
         prior_hql = hql_array[0..-2].join(";") if hql_array.length > 1
         select_hql = hql_array.last
@@ -181,41 +188,37 @@ module Mobilize
       response
     end
-    def Hive.schema_hash(schema_path,user_name,gdrive_slot)
-      if schema_path.index("/")
-        #slashes mean sheets
-        out_tsv = Gsheet.find_by_path(schema_path,gdrive_slot).read(user_name)
+    def Hive.schema_hash(schema_path,stage_path,user_name,gdrive_slot)
+      handler = if schema_path.index("://")
+                  schema_path.split("://").first
+                else
+                  "gsheet"
+                end
+      dst = "Mobilize::#{handler.downcase.capitalize}".constantize.path_to_dst(schema_path,stage_path,gdrive_slot)
+      out_raw = dst.read(user_name,gdrive_slot)
+      #determine the datatype for schema; accept json, yaml, tsv
+      if schema_path.ends_with?(".yml")
+        out_ha = begin;YAML.load(out_raw);rescue ScriptError, StandardError;nil;end if out_ha.nil?
       else
-        u = User.where(:name=>user_name).first
-        #check sheets in runner
-        r = u.runner
-        runner_sheet = r.gbook(gdrive_slot).worksheet_by_title(schema_path)
-        out_tsv = if runner_sheet
-                    runner_sheet.read(user_name)
-                  else
-                    #check for gfile. will fail if there isn't one.
-                    Gfile.find_by_path(schema_path).read(user_name)
-                  end
+        out_ha = begin;JSON.parse(out_raw);rescue ScriptError, StandardError;nil;end
+        out_ha = out_raw.tsv_to_hash_array if out_ha.nil?
       end
-      #use Gridfs to cache gdrive results
-      file_name = schema_path.split("/").last
-      out_url = "gridfs://#{schema_path}/#{file_name}"
-      Dataset.write_by_url(out_url,out_tsv,user_name)
-      schema_tsv = Dataset.find_by_url(out_url).read(user_name,gdrive_slot)
       schema_hash = {}
-      schema_tsv.tsv_to_hash_array.each do |ha|
-        schema_hash[ha['name']] = ha['datatype']
+      out_ha.each do |hash|
+        schema_hash[hash['name']] = hash['datatype']
       end
       schema_hash
     end
-    def Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop=false, schema_hash=nil, params=nil)
+    def Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop=false, schema_hash=nil, run_params=nil)
       table_path = [db,table].join(".")
       table_stats = Hive.table_stats(cluster, db, table, user_name)
       url = "hive://" + [cluster,db,table,part_array.compact.join("/")].join("/")
-      source_hql_array = source_hql.split(";")
-      last_select_i = source_hql_array.rindex{|hql| hql.downcase.strip.starts_with?("select")}
+      #decomment hql
+      source_hql_array = source_hql.split("\n").reject{|l| l.starts_with?("--") or l.strip.length==0}.join("\n").split(";").map{|h| h.strip}
+      last_select_i = source_hql_array.rindex{|s| s.downcase.starts_with?("select")}
       #find the last select query -- it should be used for the temp table creation
       last_select_hql = (source_hql_array[last_select_i..-1].join(";")+";")
       #if there is anything prior to the last select, add it in prior to table creation
@@ -228,7 +231,7 @@ module Mobilize
       temp_set_hql = "set mapred.job.name=#{job_name} (temp table);"
       temp_drop_hql = "drop table if exists #{temp_table_path};"
       temp_create_hql = "#{temp_set_hql}#{prior_hql}#{temp_drop_hql}create table #{temp_table_path} as #{last_select_hql}"
-      response = Hive.run(cluster,temp_create_hql,user_name,params)
+      response = Hive.run(cluster,temp_create_hql,user_name,run_params)
       raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
       source_table_stats = Hive.table_stats(cluster,temp_db,temp_table_name,user_name)
@@ -267,7 +270,7 @@ module Mobilize
                            target_insert_hql,
                            temp_drop_hql].join
-        response = Hive.run(cluster, target_full_hql, user_name, params)
+        response = Hive.run(cluster, target_full_hql, user_name, run_params)
         raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
@@ -319,7 +322,7 @@ module Mobilize
           part_set_hql = "set hive.cli.print.header=true;set mapred.job.name=#{job_name} (permutations);"
           part_select_hql = "select distinct #{target_part_stmt} from #{temp_table_path};"
           part_perm_hql = part_set_hql + part_select_hql
-          response = Hive.run(cluster, part_perm_hql, user_name, params)
+          response = Hive.run(cluster, part_perm_hql, user_name, run_params)
           raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
           part_perm_tsv = response['stdout']
           #having gotten the permutations, ensure they are dropped
@@ -332,7 +335,7 @@ module Mobilize
           part_drop_hql = part_hash_array.map do |h|
             part_drop_stmt = h.map do |name,value|
-                               part_defs[name[1..-2]]=="string" ? "#{name}='#{value}'" : "#{name}=#{value}"
+                               part_defs[name[1..-2]].downcase=="string" ? "#{name}='#{value}'" : "#{name}=#{value}"
                              end.join(",")
                             "use #{db};alter table #{table} drop if exists partition (#{part_drop_stmt});"
                           end.join
@@ -345,7 +348,7 @@ module Mobilize
         target_full_hql = [target_set_hql, target_create_hql, target_insert_hql, temp_drop_hql].join
-        response = Hive.run(cluster, target_full_hql, user_name, params)
+        response = Hive.run(cluster, target_full_hql, user_name, run_params)
         raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
       else
         error_msg = "Incompatible partition specs"
@@ -500,7 +503,7 @@ module Mobilize
       job_name = s.path.sub("Runner_","")
       schema_hash = if params['schema']
-                      Hive.schema_hash(params['schema'],user_name,gdrive_slot)
+                      Hive.schema_hash(params['schema'],stage_path,user_name,gdrive_slot)
                     else
                       {}
                     end
@@ -543,7 +546,8 @@ module Mobilize
       result = begin
                  url = if source_hql
                          #include any params (or nil) at the end
-                         Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop, schema_hash,params['params'])
+                         run_params = params['params']
+                         Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop, schema_hash,run_params)
                        elsif source_tsv
                          Hive.tsv_to_table(cluster, db, table, part_array, source_tsv, user_name, drop, schema_hash)
                        elsif source

data/lib/mobilize-hive/helpers/hive_helper.rb CHANGED Viewed

@@ -26,6 +26,10 @@ module Mobilize
       (1..self.clusters[cluster]['max_slots']).to_a.map{|s| "#{cluster}_#{s.to_s}"}
     end
+    def self.prepends
+      self.config['prepends']
+    end
     def self.slot_worker_by_cluster_and_path(cluster,path)
       working_slots = Mobilize::Resque.jobs.map{|j| begin j['args'][1]['hive_slot'];rescue;nil;end}.compact.uniq
       self.slot_ids(cluster).each do |slot_id|

data/lib/mobilize-hive/tasks.rb CHANGED Viewed

@@ -1,3 +1,4 @@
+require 'yaml'
 namespace :mobilize_hive do
   desc "Set up config and log folders and files"
   task :setup do

data/lib/mobilize-hive/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 module Mobilize
   module Hive
-    VERSION = "1.35"
+    VERSION = "1.36"
   end
 end

data/lib/mobilize-hive.rb CHANGED Viewed

@@ -3,6 +3,9 @@ require "mobilize-hdfs"
 module Mobilize
   module Hive
+    def Hive.home_dir
+      File.expand_path('..',File.dirname(__FILE__))
+    end
   end
 end
 require "mobilize-hive/handlers/hive"

data/lib/samples/hive.yml CHANGED Viewed

@@ -1,17 +1,23 @@
 ---
 development:
+  prepends:
+    - "hive.stats.autogather=false"
   clusters:
     dev_cluster:
       max_slots: 5
       temp_table_db: mobilize
       exec_path: /path/to/hive
 test:
+  prepends:
+    - "hive.stats.autogather=false"
   clusters:
     test_cluster:
       max_slots: 5
       temp_table_db: mobilize
       exec_path: /path/to/hive
 production:
+  prepends:
+    - "hive.stats.autogather=false"
   clusters:
     prod_cluster:
       max_slots: 5

data/mobilize-hive.gemspec CHANGED Viewed

@@ -16,5 +16,5 @@ Gem::Specification.new do |gem|
   gem.executables   = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
   gem.test_files    = gem.files.grep(%r{^(test|spec|features)/})
   gem.require_paths = ["lib"]
-  gem.add_runtime_dependency "mobilize-hdfs","1.35"
+  gem.add_runtime_dependency "mobilize-hdfs","1.36"
 end

data/test/{hive_test_1.hql → fixtures/hive1.hql} RENAMED Viewed

File without changes

data/test/{hive_test_1_in.yml → fixtures/hive1.in.yml} RENAMED Viewed

File without changes

data/test/{hive_test_1_schema.yml → fixtures/hive1.schema.yml} RENAMED Viewed

File without changes

data/test/fixtures/hive1.sql ADDED Viewed

	@@ -0,0 +1 @@
1	+ select act_date,product, sum(value) as sum from mobilize.hive_test_1 group by act_date,product;

data/test/fixtures/hive4_stage1.in ADDED Viewed

	@@ -0,0 +1 @@
1	+

data/test/fixtures/hive4_stage2.in.yml ADDED Viewed

@@ -0,0 +1,4 @@
+- act_date: ""
+  product: ""
+  category: ""
+  value: ""

data/test/fixtures/integration_expected.yml ADDED Viewed

@@ -0,0 +1,69 @@
+---
+- path: "Runner_mobilize(test)/jobs"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hive1/stage1"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hive1/stage2"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hive1/stage3"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hive1/stage4"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hive1/stage5"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hive2/stage1"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hive2/stage2"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hive2/stage3"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hive3/stage1"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hive3/stage2"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hive3/stage3"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hive3/stage4"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hive4/stage1"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hive4/stage2"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hive4/stage3"
+  state: working
+  count: 1
+  confirmed_ats: []
+- path: "Runner_mobilize(test)/jobs/hive4/stage4"
+  state: working
+  count: 1
+  confirmed_ats: []

data/test/fixtures/integration_jobs.yml ADDED Viewed

@@ -0,0 +1,34 @@
+---
+- name: hive1
+  active: true
+  trigger: once
+  status: ""
+  stage1: hive.write target:"mobilize/hive1", partitions:"act_date", drop:true,
+            source:"Runner_mobilize(test)/hive1.in", schema:"hive1.schema"
+  stage2: hive.run source:"hive1.sql"
+  stage3: hive.run hql:"show databases;"
+  stage4: gsheet.write source:"stage2", target:"hive1_stage2.out"
+  stage5: gsheet.write source:"stage3", target:"hive1_stage3.out"
+- name: hive2
+  active: true
+  trigger: after hive1
+  status: ""
+  stage1: hive.write source:"hdfs://user/mobilize/test/hdfs1.out", target:"mobilize.hive2", drop:true
+  stage2: hive.run hql:"select * from mobilize.hive2;"
+  stage3: gsheet.write source:"stage2", target:"hive2.out"
+- name: hive3
+  active: true
+  trigger: after hive2
+  status: ""
+  stage1: hive.run hql:"select '@date' as `date`,product,category,value from mobilize.hive1;", params:{'date':'2013-01-01'}
+  stage2: hive.write source:"stage1",target:"mobilize/hive3", partitions:"date/product", drop:true
+  stage3: hive.write hql:"select * from mobilize.hive3;",target:"mobilize/hive3", partitions:"date/product", drop:false
+  stage4: gsheet.write source:"hive://mobilize/hive3", target:"hive3.out"
+- name: hive4
+  active: true
+  trigger: after hive3
+  status: ""
+  stage1: hive.write source:"hive4_stage1.in", target:"mobilize/hive1", partitions:"act_date"
+  stage2: hive.write source:"hive4_stage2.in", target:"mobilize/hive1", partitions:"act_date"
+  stage3: hive.run hql:"select '@date $utc_time' as `date_time`,product,category,value from mobilize.hive1;", params:{'date':'$utc_date'}
+  stage4: gsheet.write source:stage3, target:"hive4.out"

data/test/integration/mobilize-hive_test.rb ADDED Viewed

@@ -0,0 +1,43 @@
+require 'test_helper'
+describe "Mobilize" do
+  # enqueues 4 workers on Resque
+  it "runs integration test" do
+    puts "restart workers"
+    Mobilize::Jobtracker.restart_workers!
+    u = TestHelper.owner_user
+    r = u.runner
+    user_name = u.name
+    gdrive_slot = u.email
+    puts "add test data"
+    ["hive1.in","hive4_stage1.in","hive4_stage2.in","hive1.schema","hive1.sql"].each do |fixture_name|
+      target_url = "gsheet://#{r.title}/#{fixture_name}"
+      TestHelper.write_fixture(fixture_name, target_url, 'replace')
+    end
+    puts "add/update jobs"
+    u.jobs.each{|j| j.delete}
+    jobs_fixture_name = "integration_jobs"
+    jobs_target_url = "gsheet://#{r.title}/jobs"
+    TestHelper.write_fixture(jobs_fixture_name, jobs_target_url, 'update')
+    puts "job rows added, force enqueue runner, wait for stages"
+    #wait for stages to complete
+    expected_fixture_name = "integration_expected"
+    Mobilize::Jobtracker.stop!
+    r.enqueue!
+    TestHelper.confirm_expected_jobs(expected_fixture_name,2100)
+    puts "update job status and activity"
+    r.update_gsheet(gdrive_slot)
+    puts "check posted data"
+    assert TestHelper.check_output("gsheet://#{r.title}/hive1_stage2.out", 'min_length' => 219) == true
+    assert TestHelper.check_output("gsheet://#{r.title}/hive1_stage3.out", 'min_length' => 3) == true
+    assert TestHelper.check_output("gsheet://#{r.title}/hive2.out", 'min_length' => 599) == true
+    assert TestHelper.check_output("gsheet://#{r.title}/hive3.out", 'min_length' => 347) == true
+    assert TestHelper.check_output("gsheet://#{r.title}/hive4.out", 'min_length' => 432) == true
+  end
+end

data/test/test_helper.rb CHANGED Viewed

@@ -8,3 +8,4 @@ $dir = File.dirname(File.expand_path(__FILE__))
 ENV['MOBILIZE_ENV'] = 'test'
 require 'mobilize-hive'
 $TESTING = true
+require "#{Mobilize::Hdfs.home_dir}/test/test_helper"

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: mobilize-hive
 version: !ruby/object:Gem::Version
-  version: '1.35'
+  version: '1.36'
   prerelease:
 platform: ruby
 authors:
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2013-04-23 00:00:00.000000000 Z
+date: 2013-05-21 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: mobilize-hdfs
@@ -18,7 +18,7 @@ dependencies:
     requirements:
     - - '='
       - !ruby/object:Gem::Version
-        version: '1.35'
+        version: '1.36'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
@@ -26,7 +26,7 @@ dependencies:
     requirements:
     - - '='
       - !ruby/object:Gem::Version
-        version: '1.35'
+        version: '1.36'
 description: Adds hive read, write, and run support to mobilize-hdfs
 email:
 - cpaesleme@dena.com
@@ -46,11 +46,15 @@ files:
 - lib/mobilize-hive/version.rb
 - lib/samples/hive.yml
 - mobilize-hive.gemspec
-- test/hive_job_rows.yml
-- test/hive_test_1.hql
-- test/hive_test_1_in.yml
-- test/hive_test_1_schema.yml
-- test/mobilize-hive_test.rb
+- test/fixtures/hive1.hql
+- test/fixtures/hive1.in.yml
+- test/fixtures/hive1.schema.yml
+- test/fixtures/hive1.sql
+- test/fixtures/hive4_stage1.in
+- test/fixtures/hive4_stage2.in.yml
+- test/fixtures/integration_expected.yml
+- test/fixtures/integration_jobs.yml
+- test/integration/mobilize-hive_test.rb
 - test/redis-test.conf
 - test/test_helper.rb
 homepage: http://github.com/dena/mobilize-hive
@@ -67,7 +71,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
       version: '0'
       segments:
       - 0
-      hash: -2302417197076465358
+      hash: 837156919845089008
 required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
   requirements:
@@ -76,7 +80,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
       segments:
       - 0
-      hash: -2302417197076465358
+      hash: 837156919845089008
 requirements: []
 rubyforge_project:
 rubygems_version: 1.8.25
@@ -84,10 +88,14 @@ signing_key:
 specification_version: 3
 summary: Adds hive read, write, and run support to mobilize-hdfs
 test_files:
-- test/hive_job_rows.yml
-- test/hive_test_1.hql
-- test/hive_test_1_in.yml
-- test/hive_test_1_schema.yml
-- test/mobilize-hive_test.rb
+- test/fixtures/hive1.hql
+- test/fixtures/hive1.in.yml
+- test/fixtures/hive1.schema.yml
+- test/fixtures/hive1.sql
+- test/fixtures/hive4_stage1.in
+- test/fixtures/hive4_stage2.in.yml
+- test/fixtures/integration_expected.yml
+- test/fixtures/integration_jobs.yml
+- test/integration/mobilize-hive_test.rb
 - test/redis-test.conf
 - test/test_helper.rb

data/test/hive_job_rows.yml DELETED Viewed

@@ -1,34 +0,0 @@
----
-- name: hive_test_1
-  active: true
-  trigger: once
-  status: ""
-  stage1: hive.write target:"mobilize/hive_test_1", partitions:"act_date", drop:true,
-            source:"Runner_mobilize(test)/hive_test_1.in", schema:"hive_test_1.schema"
-  stage2: hive.run source:"hive_test_1.hql"
-  stage3: hive.run hql:"show databases;"
-  stage4: gsheet.write source:"stage2", target:"hive_test_1_stage_2.out"
-  stage5: gsheet.write source:"stage3", target:"hive_test_1_stage_3.out"
-- name: hive_test_2
-  active: true
-  trigger: after hive_test_1
-  status: ""
-  stage1: hive.write source:"hdfs://user/mobilize/test/test_hdfs_1.out", target:"mobilize.hive_test_2", drop:true
-  stage2: hive.run hql:"select * from mobilize.hive_test_2;"
-  stage3: gsheet.write source:"stage2", target:"hive_test_2.out"
-- name: hive_test_3
-  active: true
-  trigger: after hive_test_2
-  status: ""
-  stage1: hive.run hql:"select '@date' as `date`,product,category,value from mobilize.hive_test_1;", params:{'date':'2013-01-01'}
-  stage2: hive.write source:"stage1",target:"mobilize/hive_test_3", partitions:"date/product", drop:true
-  stage3: hive.write hql:"select * from mobilize.hive_test_3;",target:"mobilize/hive_test_3", partitions:"date/product", drop:false
-  stage4: gsheet.write source:"hive://mobilize/hive_test_3", target:"hive_test_3.out"
-- name: hive_test_4
-  active: true
-  trigger: after hive_test_3
-  status: ""
-  stage1: hive.write source:"hive_test_4_stage_1.in", target:"mobilize/hive_test_1", partitions:"act_date"
-  stage2: hive.write source:"hive_test_4_stage_2.in", target:"mobilize/hive_test_1", partitions:"act_date"
-  stage3: hive.run hql:"select '$utc_date $utc_time' as `date_time`,product,category,value from mobilize.hive_test_1;"
-  stage4: gsheet.write source:stage3, target:"hive_test_4.out"

data/test/mobilize-hive_test.rb DELETED Viewed

@@ -1,112 +0,0 @@
-require 'test_helper'
-describe "Mobilize" do
-  def before
-    puts 'nothing before'
-  end
-  # enqueues 4 workers on Resque
-  it "runs integration test" do
-    puts "restart workers"
-    Mobilize::Jobtracker.restart_workers!
-    gdrive_slot = Mobilize::Gdrive.owner_email
-    puts "create user 'mobilize'"
-    user_name = gdrive_slot.split("@").first
-    u = Mobilize::User.where(:name=>user_name).first
-    r = u.runner
-    puts "add test_source data"
-    hive_1_in_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.in",gdrive_slot)
-    [hive_1_in_sheet].each {|s| s.delete if s}
-    hive_1_in_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.in",gdrive_slot)
-    hive_1_in_tsv = YAML.load_file("#{Mobilize::Base.root}/test/hive_test_1_in.yml").hash_array_to_tsv
-    hive_1_in_sheet.write(hive_1_in_tsv,Mobilize::Gdrive.owner_name)
-    #create blank sheet
-    hive_4_stage_1_in_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4_stage_1.in",gdrive_slot)
-    [hive_4_stage_1_in_sheet].each {|s| s.delete if s}
-    hive_4_stage_1_in_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4_stage_1.in",gdrive_slot)
-    #create sheet w just headers
-    hive_4_stage_2_in_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4_stage_2.in",gdrive_slot)
-    [hive_4_stage_2_in_sheet].each {|s| s.delete if s}
-    hive_4_stage_2_in_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4_stage_2.in",gdrive_slot)
-    hive_4_stage_2_in_sheet_header = hive_1_in_tsv.tsv_header_array.join("\t")
-    hive_4_stage_2_in_sheet.write(hive_4_stage_2_in_sheet_header,Mobilize::Gdrive.owner_name)
-    hive_1_schema_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.schema",gdrive_slot)
-    [hive_1_schema_sheet].each {|s| s.delete if s}
-    hive_1_schema_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.schema",gdrive_slot)
-    hive_1_schema_tsv = YAML.load_file("#{Mobilize::Base.root}/test/hive_test_1_schema.yml").hash_array_to_tsv
-    hive_1_schema_sheet.write(hive_1_schema_tsv,Mobilize::Gdrive.owner_name)
-    hive_1_hql_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.hql",gdrive_slot)
-    [hive_1_hql_sheet].each {|s| s.delete if s}
-    hive_1_hql_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.hql",gdrive_slot)
-    hive_1_hql_tsv = File.open("#{Mobilize::Base.root}/test/hive_test_1.hql").read
-    hive_1_hql_sheet.write(hive_1_hql_tsv,Mobilize::Gdrive.owner_name)
-    jobs_sheet = r.gsheet(gdrive_slot)
-    test_job_rows = ::YAML.load_file("#{Mobilize::Base.root}/test/hive_job_rows.yml")
-    test_job_rows.map{|j| r.jobs(j['name'])}.each{|j| j.delete if j}
-    jobs_sheet.add_or_update_rows(test_job_rows)
-    hive_1_stage_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_2.out",gdrive_slot)
-    [hive_1_stage_2_target_sheet].each{|s| s.delete if s}
-    hive_1_stage_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_3.out",gdrive_slot)
-    [hive_1_stage_3_target_sheet].each{|s| s.delete if s}
-    hive_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_2.out",gdrive_slot)
-    [hive_2_target_sheet].each{|s| s.delete if s}
-    hive_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_3.out",gdrive_slot)
-    [hive_3_target_sheet].each{|s| s.delete if s}
-    hive_4_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4.out",gdrive_slot)
-    [hive_4_target_sheet].each{|s| s.delete if s}
-    puts "job row added, force enqueued requestor, wait for stages"
-    r.enqueue!
-    wait_for_stages(2100)
-    puts "jobtracker posted data to test sheet"
-    hive_1_stage_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_2.out",gdrive_slot)
-    hive_1_stage_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_3.out",gdrive_slot)
-    hive_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_2.out",gdrive_slot)
-    hive_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_3.out",gdrive_slot)
-    hive_4_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4.out",gdrive_slot)
-    assert hive_1_stage_2_target_sheet.read(u.name).length == 219
-    assert hive_1_stage_3_target_sheet.read(u.name).length > 3
-    assert hive_2_target_sheet.read(u.name).length == 599
-    assert hive_3_target_sheet.read(u.name).length == 347
-    assert hive_4_target_sheet.read(u.name).length == 432
-  end
-  def wait_for_stages(time_limit=600,stage_limit=120,wait_length=10)
-    time = 0
-    time_since_stage = 0
-    #check for 10 min
-    while time < time_limit and time_since_stage < stage_limit
-      sleep wait_length
-      job_classes = Mobilize::Resque.jobs.map{|j| j['class']}
-      if job_classes.include?("Mobilize::Stage")
-        time_since_stage = 0
-        puts "saw stage at #{time.to_s} seconds"
-      else
-        time_since_stage += wait_length
-        puts "#{time_since_stage.to_s} seconds since stage seen"
-      end
-      time += wait_length
-      puts "total wait time #{time.to_s} seconds"
-    end
-    if time >= time_limit
-      raise "Timed out before stage completion"
-    end
-  end
-end