mobilize-hive 1.36 → 1.291

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 1eb5a243ff499f31c06f0c1e09abba3fafee0b31
4
+ data.tar.gz: d44471ad7d8fcc72ac8562ebfb529d31d839d195
5
+ SHA512:
6
+ metadata.gz: a27bde80d634f949cbf7f82b0ceaad4e087e3fe1bee6fe4631aa8205b68a7897e9d0f8e8ec49700a37cc3bd54e266df06a30e2187e13b6f9a1357d1d270af54c
7
+ data.tar.gz: 3763262ee0ac27778cc2abb6342b20d9fb1a1c6c546ddd87f4ff73969a8054314480387c230aa232e89fad7e2df16f597f2fa50344430343a7cdde9aa03d0d79
data/README.md CHANGED
@@ -1,4 +1,243 @@
1
- Mobilize
2
- ========
1
+ Mobilize-Hive
2
+ ===============
3
3
 
4
- Please refer to the mobilize-server wiki: https://github.com/DeNA/mobilize-server/wiki
4
+ Mobilize-Hive adds the power of hive to [mobilize-hdfs][mobilize-hdfs].
5
+ * read, write, and copy hive files through Google Spreadsheets.
6
+
7
+ Table Of Contents
8
+ -----------------
9
+ * [Overview](#section_Overview)
10
+ * [Install](#section_Install)
11
+ * [Mobilize-Hive](#section_Install_Mobilize-Hive)
12
+ * [Install Dirs and Files](#section_Install_Dirs_and_Files)
13
+ * [Configure](#section_Configure)
14
+ * [Hive](#section_Configure_Hive)
15
+ * [Start](#section_Start)
16
+ * [Create Job](#section_Start_Create_Job)
17
+ * [Run Test](#section_Start_Run_Test)
18
+ * [Meta](#section_Meta)
19
+ * [Special Thanks](#section_Special_Thanks)
20
+ * [Author](#section_Author)
21
+
22
+ <a name='section_Overview'></a>
23
+ Overview
24
+ -----------
25
+
26
+ * Mobilize-hive adds Hive methods to mobilize-hdfs.
27
+
28
+ <a name='section_Install'></a>
29
+ Install
30
+ ------------
31
+
32
+ Make sure you go through all the steps in the
33
+ [mobilize-base][mobilize-base],
34
+ [mobilize-ssh][mobilize-ssh],
35
+ [mobilize-hdfs][mobilize-hdfs],
36
+ install sections first.
37
+
38
+ <a name='section_Install_Mobilize-Hive'></a>
39
+ ### Mobilize-Hive
40
+
41
+ add this to your Gemfile:
42
+
43
+ ``` ruby
44
+ gem "mobilize-hive"
45
+ ```
46
+
47
+ or do
48
+
49
+ $ gem install mobilize-hive
50
+
51
+ for a ruby-wide install.
52
+
53
+ <a name='section_Install_Dirs_and_Files'></a>
54
+ ### Dirs and Files
55
+
56
+ ### Rakefile
57
+
58
+ Inside the Rakefile in your project's root dir, make sure you have:
59
+
60
+ ``` ruby
61
+ require 'mobilize-base/tasks'
62
+ require 'mobilize-ssh/tasks'
63
+ require 'mobilize-hdfs/tasks'
64
+ require 'mobilize-hive/tasks'
65
+ ```
66
+
67
+ This defines rake tasks essential to run the environment.
68
+
69
+ ### Config Dir
70
+
71
+ run
72
+
73
+ $ rake mobilize_hive:setup
74
+
75
+ This will copy over a sample hive.yml to your config dir.
76
+
77
+ <a name='section_Configure'></a>
78
+ Configure
79
+ ------------
80
+
81
+ <a name='section_Configure_Hive'></a>
82
+ ### Configure Hive
83
+
84
+ * Hive is big data. That means we need to be careful when reading from
85
+ the cluster as it could easily fill up our mongodb instance, RAM, local disk
86
+ space, etc.
87
+ * To achieve this, all hive operations, stage outputs, etc. are
88
+ executed and stored on the cluster only.
89
+ * The exceptions are:
90
+ * writing to the cluster from an external source, such as a google
91
+ sheet. Here there
92
+ is no risk as the external source has much more strict size limits than
93
+ hive.
94
+ * reading from the cluster, such as for posting to google sheet. In
95
+ this case, the read_limit parameter dictates the maximum amount that can
96
+ be read. If the data is bigger than the read limit, an exception will be
97
+ raised.
98
+
99
+ The Hive configuration consists of:
100
+ * clusters - this defines aliases for clusters, which are used as
101
+ parameters for Hive stages. They should have the same name as those
102
+ in hadoop.yml. Each cluster has:
103
+ * max_slots - defines the total number of simultaneous slots to be
104
+ used for hive jobs on this cluster
105
+ * output_db - defines the db which should be used to hold stage outputs.
106
+ * This db must have open permissions (777) so any user on the system can
107
+ write to it -- the tables inside will be owned by the users themselves.
108
+ * exec_path - defines the path to the hive executable
109
+
110
+ Sample hive.yml:
111
+
112
+ ``` yml
113
+ ---
114
+ development:
115
+ clusters:
116
+ dev_cluster:
117
+ max_slots: 5
118
+ output_db: mobilize
119
+ exec_path: /path/to/hive
120
+ test:
121
+ clusters:
122
+ test_cluster:
123
+ max_slots: 5
124
+ output_db: mobilize
125
+ exec_path: /path/to/hive
126
+ production:
127
+ clusters:
128
+ prod_cluster:
129
+ max_slots: 5
130
+ output_db: mobilize
131
+ exec_path: /path/to/hive
132
+ ```
133
+
134
+ <a name='section_Start'></a>
135
+ Start
136
+ -----
137
+
138
+ <a name='section_Start_Create_Job'></a>
139
+ ### Create Job
140
+
141
+ * For mobilize-hive, the following stages are available.
142
+ * cluster and user are optional for all of the below.
143
+ * cluster defaults to the first cluster listed;
144
+ * user is treated the same way as in [mobilize-ssh][mobilize-ssh].
145
+ * hive.run `hql:<hql> || source:<gsheet_path>, user:<user>, cluster:<cluster>`, which executes the
146
+ script in the hql or source sheet and returns any output specified at the
147
+ end. If the cmd or last query in source is a select statement, column headers will be
148
+ returned as well.
149
+ * hive.write `hql:<hql> || source:<source_path>, target:<hive_path>, partitions:<partition_path>, user:<user>, cluster:<cluster>, schema:<gsheet_path>, drop:<true/false>`,
150
+ which writes the source or query result to the selected hive table.
151
+ * hive_path
152
+ * should be of the form `<hive_db>/<table_name>` or `<hive_db>.<table_name>`.
153
+ * source:
154
+ * can be a gsheet_path, hdfs_path, or hive_path (no partitions)
155
+ * for gsheet and hdfs path,
156
+ * if the file ends in .*ql, it's treated the same as passing hql
157
+ * otherwise it is treated as a tsv with the first row as column headers
158
+ * target:
159
+ * Should be a hive_path, as in `<hive_db>/<table_name>` or `<hive_db>.<table_name>`.
160
+ * partitions:
161
+ * Due to Hive limitation, partition names CANNOT be reserved keywords when writing from tsv (gsheet or hdfs source)
162
+ * Partitions should be specified as a path, as in partitions:`<partition1>/<partition2>`.
163
+ * schema:
164
+ * optional. gsheet_path to column schema.
165
+ * two columns: name, datatype
166
+ * Any columns not defined here will receive "string" as the datatype
167
+ * partitions can have their datatypes overridden here as well
168
+ * columns named here that are not in the dataset will be ignored
169
+ * drop:
170
+ * optional. drops the target table before performing write
171
+ * defaults to false
172
+
173
+ <a name='section_Start_Run_Test'></a>
174
+ ### Run Test
175
+
176
+ To run tests, you will need to
177
+
178
+ 1) go through [mobilize-base][mobilize-base], [mobilize-ssh][mobilize-ssh], [mobilize-hdfs][mobilize-hdfs] tests first
179
+
180
+ 2) clone the mobilize-hive repository
181
+
182
+ From the project folder, run
183
+
184
+ 3) $ rake mobilize_hive:setup
185
+
186
+ Copy over the config files from the mobilize-base, mobilize-ssh,
187
+ mobilize-hdfs projects into the config dir, and populate the values in the hive.yml file.
188
+
189
+ Make sure you use the same names for your hive clusters as you do in
190
+ hadoop.yml.
191
+
192
+ 3) $ rake test
193
+
194
+ * The test runs these jobs:
195
+ * hive_test_1:
196
+ * `hive.write target:"mobilize/hive_test_1/act_date",source:"Runner_mobilize(test)/hive_test_1.in", schema:"hive_test_1.schema", drop:true`
197
+ * `hive.run source:"hive_test_1.hql"`
198
+ * `hive.run cmd:"show databases"`
199
+ * `gsheet.write source:"stage2", target:"hive_test_1_stage_2.out"`
200
+ * `gsheet.write source:"stage3", target:"hive_test_1_stage_3.out"`
201
+ * hive_test_1.hql runs a select statement on the table created in the
202
+ write command.
203
+ * at the end of the test, there should be two sheets, one with a
204
+ sum of the data as in your write query, one with the results of the show
205
+ databases command.
206
+ * hive_test_2:
207
+ * `hive.write source:"hdfs://user/mobilize/test/test_hdfs_1.out", target:"mobilize.hive_test_2", drop:true`
208
+ * `hive.run cmd:"select * from mobilize.hive_test_2"`
209
+ * `gsheet.write source:"stage2", target:"hive_test_2.out"`
210
+ * this test uses the output from the first hdfs test as an input, so make sure you've run that first.
211
+ * hive_test_3:
212
+ * `hive.write source:"hive://mobilize.hive_test_1",target:"mobilize/hive_test_3/date/product",drop:true`
213
+ * `hive.run hql:"select act_date as ```date```,product,category,value from mobilize.hive_test_1;"`
214
+ * `hive.write source:"stage2",target:"mobilize/hive_test_3/date/product", drop:false`
215
+ * `gsheet.write source:"hive://mobilize/hive_test_3", target:"hive_test_3.out"`
216
+
217
+
218
+ <a name='section_Meta'></a>
219
+ Meta
220
+ ----
221
+
222
+ * Code: `git clone git://github.com/dena/mobilize-hive.git`
223
+ * Home: <https://github.com/dena/mobilize-hive>
224
+ * Bugs: <https://github.com/dena/mobilize-hive/issues>
225
+ * Gems: <http://rubygems.org/gems/mobilize-hive>
226
+
227
+ <a name='section_Special_Thanks'></a>
228
+ Special Thanks
229
+ --------------
230
+ * This release goes to Toby Negrin, who championed this project with
231
+ DeNA and gave me the support to get it properly architected, tested, and documented.
232
+ * Also many thanks to the Analytics team at DeNA who build and maintain
233
+ our Big Data infrastructure.
234
+
235
+ <a name='section_Author'></a>
236
+ Author
237
+ ------
238
+
239
+ Cassio Paes-Leme :: cpaesleme@dena.com :: @cpaesleme
240
+
241
+ [mobilize-base]: https://github.com/dena/mobilize-base
242
+ [mobilize-ssh]: https://github.com/dena/mobilize-ssh
243
+ [mobilize-hdfs]: https://github.com/dena/mobilize-hdfs
data/lib/mobilize-hive.rb CHANGED
@@ -3,9 +3,6 @@ require "mobilize-hdfs"
3
3
 
4
4
  module Mobilize
5
5
  module Hive
6
- def Hive.home_dir
7
- File.expand_path('..',File.dirname(__FILE__))
8
- end
9
6
  end
10
7
  end
11
8
  require "mobilize-hive/handlers/hive"
@@ -1,7 +1,56 @@
1
1
  module Mobilize
2
2
  module Hive
3
- #adds convenience methods
4
- require "#{File.dirname(__FILE__)}/../helpers/hive_helper"
3
+ def Hive.config
4
+ Base.config('hive')
5
+ end
6
+
7
+ def Hive.exec_path(cluster)
8
+ Hive.clusters[cluster]['exec_path']
9
+ end
10
+
11
+ def Hive.output_db(cluster)
12
+ Hive.clusters[cluster]['output_db']
13
+ end
14
+
15
+ def Hive.output_db_user(cluster)
16
+ output_db_node = Hadoop.gateway_node(cluster)
17
+ output_db_user = Ssh.host(output_db_node)['user']
18
+ output_db_user
19
+ end
20
+
21
+ def Hive.clusters
22
+ Hive.config['clusters']
23
+ end
24
+
25
+ def Hive.slot_ids(cluster)
26
+ (1..Hive.clusters[cluster]['max_slots']).to_a.map{|s| "#{cluster}_#{s.to_s}"}
27
+ end
28
+
29
+ def Hive.slot_worker_by_cluster_and_path(cluster,path)
30
+ working_slots = Mobilize::Resque.jobs.map{|j| begin j['args'][1]['hive_slot'];rescue;nil;end}.compact.uniq
31
+ Hive.slot_ids(cluster).each do |slot_id|
32
+ unless working_slots.include?(slot_id)
33
+ Mobilize::Resque.set_worker_args_by_path(path,{'hive_slot'=>slot_id})
34
+ return slot_id
35
+ end
36
+ end
37
+ #return false if none are available
38
+ return false
39
+ end
40
+
41
+ def Hive.unslot_worker_by_path(path)
42
+ begin
43
+ Mobilize::Resque.set_worker_args_by_path(path,{'hive_slot'=>nil})
44
+ return true
45
+ rescue
46
+ return false
47
+ end
48
+ end
49
+
50
+ def Hive.databases(cluster,user_name)
51
+ Hive.run(cluster,"show databases",user_name)['stdout'].split("\n")
52
+ end
53
+
5
54
  # converts a source path or target path to a dst in the context of handler and stage
6
55
  def Hive.path_to_dst(path,stage_path,gdrive_slot)
7
56
  has_handler = true if path.index("://")
@@ -93,32 +142,12 @@ module Mobilize
93
142
  end
94
143
 
95
144
  #run a generic hive command, with the option of passing a file hash to be locally available
96
- def Hive.run(cluster,hql,user_name,params=nil,file_hash=nil)
97
- preps = Hive.prepends.map do |p|
98
- prefix = "set "
99
- suffix = ";"
100
- prep_out = p
101
- prep_out = "#{prefix}#{prep_out}" unless prep_out.starts_with?(prefix)
102
- prep_out = "#{prep_out}#{suffix}" unless prep_out.ends_with?(suffix)
103
- prep_out
104
- end.join
105
- hql = "#{preps}#{hql}"
145
+ def Hive.run(cluster,hql,user_name,file_hash=nil)
146
+ # no TempStatsStore
147
+ hql = "set hive.stats.autogather=false;#{hql}"
106
148
  filename = hql.to_md5
107
149
  file_hash||= {}
108
150
  file_hash[filename] = hql
109
- params ||= {}
110
- #replace any params in the file_hash and command
111
- params.each do |k,v|
112
- file_hash.each do |name,data|
113
- data.gsub!("@#{k}",v)
114
- end
115
- end
116
- #add in default params
117
- Hive.default_params.each do |k,v|
118
- file_hash.each do |name,data|
119
- data.gsub!(k,v)
120
- end
121
- end
122
151
  #silent mode so we don't have logs in stderr; clip output
123
152
  #at hadoop read limit
124
153
  command = "#{Hive.exec_path(cluster)} -S -f #{filename} | head -c #{Hadoop.read_limit}"
@@ -162,9 +191,8 @@ module Mobilize
162
191
  Gdrive.unslot_worker_by_path(stage_path)
163
192
 
164
193
  #check for select at end
165
- hql_array = hql.split("\n").reject{|l| l.starts_with?("--") or l.strip.length==0}.join("\n").split(";").map{|h| h.strip}
166
- last_statement = hql_array.last
167
- if last_statement.to_s.downcase.starts_with?("select")
194
+ hql_array = hql.split(";").map{|hc| hc.strip}.reject{|hc| hc.length==0}
195
+ if hql_array.last.downcase.starts_with?("select")
168
196
  #nil if no prior commands
169
197
  prior_hql = hql_array[0..-2].join(";") if hql_array.length > 1
170
198
  select_hql = hql_array.last
@@ -172,10 +200,10 @@ module Mobilize
172
200
  "drop table if exists #{output_path}",
173
201
  "create table #{output_path} as #{select_hql};"].join(";")
174
202
  full_hql = [prior_hql, output_table_hql].compact.join(";")
175
- result = Hive.run(cluster,full_hql, user_name,params['params'])
203
+ result = Hive.run(cluster,full_hql, user_name)
176
204
  Dataset.find_or_create_by_url(out_url)
177
205
  else
178
- result = Hive.run(cluster, hql, user_name,params['params'])
206
+ result = Hive.run(cluster, hql, user_name)
179
207
  Dataset.find_or_create_by_url(out_url)
180
208
  Dataset.write_by_url(out_url,result['stdout'],user_name) if result['stdout'].to_s.length>0
181
209
  end
@@ -188,37 +216,40 @@ module Mobilize
188
216
  response
189
217
  end
190
218
 
191
- def Hive.schema_hash(schema_path,stage_path,user_name,gdrive_slot)
192
- handler = if schema_path.index("://")
193
- schema_path.split("://").first
194
- else
195
- "gsheet"
196
- end
197
- dst = "Mobilize::#{handler.downcase.capitalize}".constantize.path_to_dst(schema_path,stage_path,gdrive_slot)
198
- out_raw = dst.read(user_name,gdrive_slot)
199
- #determine the datatype for schema; accept json, yaml, tsv
200
- if schema_path.ends_with?(".yml")
201
- out_ha = begin;YAML.load(out_raw);rescue ScriptError, StandardError;nil;end if out_ha.nil?
219
+ def Hive.schema_hash(schema_path,user_name,gdrive_slot)
220
+ if schema_path.index("/")
221
+ #slashes mean sheets
222
+ out_tsv = Gsheet.find_by_path(schema_path,gdrive_slot).read(user_name)
202
223
  else
203
- out_ha = begin;JSON.parse(out_raw);rescue ScriptError, StandardError;nil;end
204
- out_ha = out_raw.tsv_to_hash_array if out_ha.nil?
224
+ u = User.where(:name=>user_name).first
225
+ #check sheets in runner
226
+ r = u.runner
227
+ runner_sheet = r.gbook(gdrive_slot).worksheet_by_title(schema_path)
228
+ out_tsv = if runner_sheet
229
+ runner_sheet.read(user_name)
230
+ else
231
+ #check for gfile. will fail if there isn't one.
232
+ Gfile.find_by_path(schema_path).read(user_name)
233
+ end
205
234
  end
235
+ #use Gridfs to cache gdrive results
236
+ file_name = schema_path.split("/").last
237
+ out_url = "gridfs://#{schema_path}/#{file_name}"
238
+ Dataset.write_by_url(out_url,out_tsv,user_name)
239
+ schema_tsv = Dataset.find_by_url(out_url).read(user_name,gdrive_slot)
206
240
  schema_hash = {}
207
- out_ha.each do |hash|
208
- schema_hash[hash['name']] = hash['datatype']
241
+ schema_tsv.tsv_to_hash_array.each do |ha|
242
+ schema_hash[ha['name']] = ha['datatype']
209
243
  end
210
244
  schema_hash
211
245
  end
212
246
 
213
- def Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop=false, schema_hash=nil, run_params=nil)
247
+ def Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop=false, schema_hash=nil)
214
248
  table_path = [db,table].join(".")
215
249
  table_stats = Hive.table_stats(cluster, db, table, user_name)
216
- url = "hive://" + [cluster,db,table,part_array.compact.join("/")].join("/")
217
250
 
218
- #decomment hql
219
-
220
- source_hql_array = source_hql.split("\n").reject{|l| l.starts_with?("--") or l.strip.length==0}.join("\n").split(";").map{|h| h.strip}
221
- last_select_i = source_hql_array.rindex{|s| s.downcase.starts_with?("select")}
251
+ source_hql_array = source_hql.split(";")
252
+ last_select_i = source_hql_array.rindex{|hql| hql.downcase.strip.starts_with?("select")}
222
253
  #find the last select query -- it should be used for the temp table creation
223
254
  last_select_hql = (source_hql_array[last_select_i..-1].join(";")+";")
224
255
  #if there is anything prior to the last select, add it in prior to table creation
@@ -231,8 +262,7 @@ module Mobilize
231
262
  temp_set_hql = "set mapred.job.name=#{job_name} (temp table);"
232
263
  temp_drop_hql = "drop table if exists #{temp_table_path};"
233
264
  temp_create_hql = "#{temp_set_hql}#{prior_hql}#{temp_drop_hql}create table #{temp_table_path} as #{last_select_hql}"
234
- response = Hive.run(cluster,temp_create_hql,user_name,run_params)
235
- raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
265
+ Hive.run(cluster,temp_create_hql,user_name)
236
266
 
237
267
  source_table_stats = Hive.table_stats(cluster,temp_db,temp_table_name,user_name)
238
268
  source_fields = source_table_stats['field_defs']
@@ -270,12 +300,10 @@ module Mobilize
270
300
  target_insert_hql,
271
301
  temp_drop_hql].join
272
302
 
273
- response = Hive.run(cluster, target_full_hql, user_name, run_params)
274
-
275
- raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
303
+ Hive.run(cluster, target_full_hql, user_name)
276
304
 
277
305
  elsif part_array.length > 0 and
278
- table_stats.ie{|tts| tts.nil? || drop || tts['partitions'].to_a.map{|p| p['name']}.sort == part_array.sort}
306
+ table_stats.ie{|tts| tts.nil? || drop || tts['partitions'].to_a.map{|p| p['name']} == part_array}
279
307
  #partitions and no target table or same partitions in both target table and user params
280
308
 
281
309
  target_headers = source_fields.map{|f| f['name']}.reject{|h| part_array.include?(h)}
@@ -306,7 +334,7 @@ module Mobilize
306
334
 
307
335
  target_set_hql = ["set mapred.job.name=#{job_name};",
308
336
  "set hive.exec.dynamic.partition.mode=nonstrict;",
309
- "set hive.exec.max.dynamic.partitions.pernode=10000;",
337
+ "set hive.exec.max.dynamic.partitions.pernode=1000;",
310
338
  "set hive.exec.dynamic.partition=true;",
311
339
  "set hive.exec.max.created.files = 200000;",
312
340
  "set hive.max.created.files = 200000;"].join
@@ -322,20 +350,12 @@ module Mobilize
322
350
  part_set_hql = "set hive.cli.print.header=true;set mapred.job.name=#{job_name} (permutations);"
323
351
  part_select_hql = "select distinct #{target_part_stmt} from #{temp_table_path};"
324
352
  part_perm_hql = part_set_hql + part_select_hql
325
- response = Hive.run(cluster, part_perm_hql, user_name, run_params)
326
- raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
327
- part_perm_tsv = response['stdout']
353
+ part_perm_tsv = Hive.run(cluster, part_perm_hql, user_name)['stdout']
328
354
  #having gotten the permutations, ensure they are dropped
329
355
  part_hash_array = part_perm_tsv.tsv_to_hash_array
330
- #make sure there is data
331
- if part_hash_array.first.nil? or part_hash_array.first.values.include?(nil)
332
- #blank result set, return url
333
- return url
334
- end
335
-
336
356
  part_drop_hql = part_hash_array.map do |h|
337
357
  part_drop_stmt = h.map do |name,value|
338
- part_defs[name[1..-2]].downcase=="string" ? "#{name}='#{value}'" : "#{name}=#{value}"
358
+ part_defs[name[1..-2]]=="string" ? "#{name}='#{value}'" : "#{name}=#{value}"
339
359
  end.join(",")
340
360
  "use #{db};alter table #{table} drop if exists partition (#{part_drop_stmt});"
341
361
  end.join
@@ -348,12 +368,12 @@ module Mobilize
348
368
 
349
369
  target_full_hql = [target_set_hql, target_create_hql, target_insert_hql, temp_drop_hql].join
350
370
 
351
- response = Hive.run(cluster, target_full_hql, user_name, run_params)
352
- raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
371
+ Hive.run(cluster, target_full_hql, user_name)
353
372
  else
354
373
  error_msg = "Incompatible partition specs"
355
374
  raise error_msg
356
375
  end
376
+ url = "hive://" + [cluster,db,table,part_array.compact.join("/")].join("/")
357
377
  return url
358
378
  end
359
379
 
@@ -361,12 +381,6 @@ module Mobilize
361
381
  #Accepts options to drop existing target if any
362
382
  #also schema with column datatype overrides
363
383
  def Hive.tsv_to_table(cluster, db, table, part_array, source_tsv, user_name, drop=false, schema_hash=nil)
364
- return nil if source_tsv.strip.length==0
365
- if source_tsv.index("\r\n")
366
- source_tsv = source_tsv.gsub("\r\n","\n")
367
- elsif source_tsv.index("\r")
368
- source_tsv = source_tsv.gsub("\r","\n")
369
- end
370
384
  source_headers = source_tsv.tsv_header_array
371
385
 
372
386
  table_path = [db,table].join(".")
@@ -374,8 +388,6 @@ module Mobilize
374
388
 
375
389
  schema_hash ||= {}
376
390
 
377
- url = "hive://" + [cluster,db,table,part_array.compact.join("/")].join("/")
378
-
379
391
  if part_array.length == 0 and
380
392
  table_stats.ie{|tts| tts.nil? || drop || tts['partitions'].nil?}
381
393
  #no partitions in either user params or the target table
@@ -402,11 +414,10 @@ module Mobilize
402
414
 
403
415
  target_full_hql = [target_drop_hql,target_create_hql,target_insert_hql].join(";")
404
416
 
405
- response = Hive.run(cluster, target_full_hql, user_name, nil, file_hash)
406
- raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
417
+ Hive.run(cluster, target_full_hql, user_name, file_hash)
407
418
 
408
419
  elsif part_array.length > 0 and
409
- table_stats.ie{|tts| tts.nil? || drop || tts['partitions'].to_a.map{|p| p['name']}.sort == part_array.sort}
420
+ table_stats.ie{|tts| tts.nil? || drop || tts['partitions'].to_a.map{|p| p['name']} == part_array}
410
421
  #partitions and no target table
411
422
  #or same partitions in both target table and user params
412
423
  #or drop and start fresh
@@ -430,17 +441,13 @@ module Mobilize
430
441
  "partitioned by #{partition_defs}"
431
442
 
432
443
  #create target table early if not here
433
- response = Hive.run(cluster, target_create_hql, user_name)
434
- raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
435
-
436
- #return url (operation complete) if there's no data
437
- source_hash_array = source_tsv.tsv_to_hash_array
438
- return url if source_hash_array.length==1 and source_hash_array.first.values.compact.length==0
444
+ Hive.run(cluster, target_create_hql, user_name)
439
445
 
440
446
  table_stats = Hive.table_stats(cluster, db, table, user_name)
441
447
 
442
448
  #create data hash from source hash array
443
449
  data_hash = {}
450
+ source_hash_array = source_tsv.tsv_to_hash_array
444
451
  source_hash_array.each do |ha|
445
452
  tpmk = part_array.map{|pn| "#{pn}=#{ha[pn]}"}.join("/")
446
453
  tpmv = ha.reject{|k,v| part_array.include?(k)}.values.join("\001")
@@ -473,8 +480,7 @@ module Mobilize
473
480
  #run actual partition adds all at once
474
481
  if target_part_hql.length>0
475
482
  puts "Adding partitions to #{cluster}/#{db}/#{table} for #{user_name} at #{Time.now.utc}"
476
- response = Hive.run(cluster, target_part_hql, user_name)
477
- raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
483
+ Hive.run(cluster, target_part_hql, user_name)
478
484
  end
479
485
  else
480
486
  error_msg = "Incompatible partition specs: " +
@@ -482,7 +488,7 @@ module Mobilize
482
488
  "user_params:#{part_array.to_s}"
483
489
  raise error_msg
484
490
  end
485
-
491
+ url = "hive://" + [cluster,db,table,part_array.compact.join("/")].join("/")
486
492
  return url
487
493
  end
488
494
 
@@ -503,7 +509,7 @@ module Mobilize
503
509
  job_name = s.path.sub("Runner_","")
504
510
 
505
511
  schema_hash = if params['schema']
506
- Hive.schema_hash(params['schema'],stage_path,user_name,gdrive_slot)
512
+ Hive.schema_hash(params['schema'],user_name,gdrive_slot)
507
513
  else
508
514
  {}
509
515
  end
@@ -519,11 +525,11 @@ module Mobilize
519
525
  #source table
520
526
  cluster,source_path = source.path.split("/").ie{|sp| [sp.first, sp[1..-1].join(".")]}
521
527
  source_hql = "select * from #{source_path};"
522
- else
528
+ elsif ['gsheet','gridfs','hdfs'].include?(source.handler)
523
529
  if source.path.ie{|sdp| sdp.index(/\.[A-Za-z]ql$/) or sdp.ends_with?(".ql")}
524
530
  source_hql = source.read(user_name,gdrive_slot)
525
531
  else
526
- #tsv from sheet or file
532
+ #tsv from sheet
527
533
  source_tsv = source.read(user_name,gdrive_slot)
528
534
  end
529
535
  end
@@ -545,13 +551,9 @@ module Mobilize
545
551
 
546
552
  result = begin
547
553
  url = if source_hql
548
- #include any params (or nil) at the end
549
- run_params = params['params']
550
- Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop, schema_hash,run_params)
554
+ Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop, schema_hash)
551
555
  elsif source_tsv
552
556
  Hive.tsv_to_table(cluster, db, table, part_array, source_tsv, user_name, drop, schema_hash)
553
- elsif source
554
- #null sheet
555
557
  else
556
558
  raise "Unable to determine source tsv or source hql"
557
559
  end
@@ -578,8 +580,11 @@ module Mobilize
578
580
  select_hql = "select * from #{source_path};"
579
581
  hql = [set_hql,select_hql].join
580
582
  response = Hive.run(cluster, hql,user_name)
581
- raise "Unable to read hive://#{dst_path} with error: #{response['stderr']}" if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
582
- return response['stdout']
583
+ if response['exit_code']==0
584
+ return response['stdout']
585
+ else
586
+ raise "Unable to read hive://#{dst_path} with error: #{response['stderr']}"
587
+ end
583
588
  end
584
589
 
585
590
  def Hive.write_by_dataset_path(dst_path,source_tsv,user_name,*args)
@@ -1,4 +1,3 @@
1
- require 'yaml'
2
1
  namespace :mobilize_hive do
3
2
  desc "Set up config and log folders and files"
4
3
  task :setup do
@@ -1,5 +1,5 @@
1
1
  module Mobilize
2
2
  module Hive
3
- VERSION = "1.36"
3
+ VERSION = "1.291"
4
4
  end
5
5
  end
data/lib/samples/hive.yml CHANGED
@@ -1,23 +1,17 @@
1
1
  ---
2
2
  development:
3
- prepends:
4
- - "hive.stats.autogather=false"
5
3
  clusters:
6
4
  dev_cluster:
7
5
  max_slots: 5
8
6
  temp_table_db: mobilize
9
7
  exec_path: /path/to/hive
10
8
  test:
11
- prepends:
12
- - "hive.stats.autogather=false"
13
9
  clusters:
14
10
  test_cluster:
15
11
  max_slots: 5
16
12
  temp_table_db: mobilize
17
13
  exec_path: /path/to/hive
18
14
  production:
19
- prepends:
20
- - "hive.stats.autogather=false"
21
15
  clusters:
22
16
  prod_cluster:
23
17
  max_slots: 5
@@ -16,5 +16,5 @@ Gem::Specification.new do |gem|
16
16
  gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
17
17
  gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
18
18
  gem.require_paths = ["lib"]
19
- gem.add_runtime_dependency "mobilize-hdfs","1.36"
19
+ gem.add_runtime_dependency "mobilize-hdfs","1.291"
20
20
  end
@@ -0,0 +1,26 @@
1
+ ---
2
+ - name: hive_test_1
3
+ active: true
4
+ trigger: once
5
+ status: ""
6
+ stage1: hive.write target:"mobilize/hive_test_1", partitions:"act_date", drop:true,
7
+ source:"Runner_mobilize(test)/hive_test_1.in", schema:"hive_test_1.schema"
8
+ stage2: hive.run source:"hive_test_1.hql"
9
+ stage3: hive.run hql:"show databases;"
10
+ stage4: gsheet.write source:"stage2", target:"hive_test_1_stage_2.out"
11
+ stage5: gsheet.write source:"stage3", target:"hive_test_1_stage_3.out"
12
+ - name: hive_test_2
13
+ active: true
14
+ trigger: after hive_test_1
15
+ status: ""
16
+ stage1: hive.write source:"hdfs://user/mobilize/test/test_hdfs_1.out", target:"mobilize.hive_test_2", drop:true
17
+ stage2: hive.run hql:"select * from mobilize.hive_test_2;"
18
+ stage3: gsheet.write source:"stage2", target:"hive_test_2.out"
19
+ - name: hive_test_3
20
+ active: true
21
+ trigger: after hive_test_2
22
+ status: ""
23
+ stage1: hive.run hql:"select act_date as `date`,product,category,value from mobilize.hive_test_1;"
24
+ stage2: hive.write source:"stage1",target:"mobilize/hive_test_3", partitions:"date/product", drop:true
25
+ stage3: hive.write hql:"select * from mobilize.hive_test_3;",target:"mobilize/hive_test_3", partitions:"date/product", drop:false
26
+ stage4: gsheet.write source:"hive://mobilize/hive_test_3", target:"hive_test_3.out"
File without changes
File without changes
@@ -0,0 +1,96 @@
1
+ require 'test_helper'
2
+
3
+ describe "Mobilize" do
4
+
5
+ def before
6
+ puts 'nothing before'
7
+ end
8
+
9
+ # enqueues 4 workers on Resque
10
+ it "runs integration test" do
11
+
12
+ puts "restart workers"
13
+ Mobilize::Jobtracker.restart_workers!
14
+
15
+ gdrive_slot = Mobilize::Gdrive.owner_email
16
+ puts "create user 'mobilize'"
17
+ user_name = gdrive_slot.split("@").first
18
+ u = Mobilize::User.where(:name=>user_name).first
19
+ r = u.runner
20
+
21
+ puts "add test_source data"
22
+ hive_1_in_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.in",gdrive_slot)
23
+ [hive_1_in_sheet].each {|s| s.delete if s}
24
+ hive_1_in_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.in",gdrive_slot)
25
+ hive_1_in_tsv = YAML.load_file("#{Mobilize::Base.root}/test/hive_test_1_in.yml").hash_array_to_tsv
26
+ hive_1_in_sheet.write(hive_1_in_tsv,Mobilize::Gdrive.owner_name)
27
+
28
+ hive_1_schema_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.schema",gdrive_slot)
29
+ [hive_1_schema_sheet].each {|s| s.delete if s}
30
+ hive_1_schema_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.schema",gdrive_slot)
31
+ hive_1_schema_tsv = YAML.load_file("#{Mobilize::Base.root}/test/hive_test_1_schema.yml").hash_array_to_tsv
32
+ hive_1_schema_sheet.write(hive_1_schema_tsv,Mobilize::Gdrive.owner_name)
33
+
34
+ hive_1_hql_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.hql",gdrive_slot)
35
+ [hive_1_hql_sheet].each {|s| s.delete if s}
36
+ hive_1_hql_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.hql",gdrive_slot)
37
+ hive_1_hql_tsv = File.open("#{Mobilize::Base.root}/test/hive_test_1.hql").read
38
+ hive_1_hql_sheet.write(hive_1_hql_tsv,Mobilize::Gdrive.owner_name)
39
+
40
+ jobs_sheet = r.gsheet(gdrive_slot)
41
+
42
+ test_job_rows = ::YAML.load_file("#{Mobilize::Base.root}/test/hive_job_rows.yml")
43
+ test_job_rows.map{|j| r.jobs(j['name'])}.each{|j| j.delete if j}
44
+ jobs_sheet.add_or_update_rows(test_job_rows)
45
+
46
+ hive_1_stage_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_2.out",gdrive_slot)
47
+ [hive_1_stage_2_target_sheet].each{|s| s.delete if s}
48
+ hive_1_stage_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_3.out",gdrive_slot)
49
+ [hive_1_stage_3_target_sheet].each{|s| s.delete if s}
50
+ hive_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_2.out",gdrive_slot)
51
+ [hive_2_target_sheet].each{|s| s.delete if s}
52
+ hive_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_3.out",gdrive_slot)
53
+ [hive_3_target_sheet].each{|s| s.delete if s}
54
+
55
+ puts "job row added, force enqueued requestor, wait for stages"
56
+ r.enqueue!
57
+ wait_for_stages(1200)
58
+
59
+ puts "jobtracker posted data to test sheet"
60
+ hive_1_stage_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_2.out",gdrive_slot)
61
+ hive_1_stage_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_3.out",gdrive_slot)
62
+ hive_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_2.out",gdrive_slot)
63
+ hive_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_3.out",gdrive_slot)
64
+
65
+ assert hive_1_stage_2_target_sheet.read(u.name).length == 219
66
+ assert hive_1_stage_3_target_sheet.read(u.name).length > 3
67
+ assert hive_2_target_sheet.read(u.name).length == 599
68
+ assert hive_3_target_sheet.read(u.name).length == 347
69
+ end
70
+
71
+ def wait_for_stages(time_limit=600,stage_limit=120,wait_length=10)
72
+ time = 0
73
+ time_since_stage = 0
74
+ #check for 10 min
75
+ while time < time_limit and time_since_stage < stage_limit
76
+ sleep wait_length
77
+ job_classes = Mobilize::Resque.jobs.map{|j| j['class']}
78
+ if job_classes.include?("Mobilize::Stage")
79
+ time_since_stage = 0
80
+ puts "saw stage at #{time.to_s} seconds"
81
+ else
82
+ time_since_stage += wait_length
83
+ puts "#{time_since_stage.to_s} seconds since stage seen"
84
+ end
85
+ time += wait_length
86
+ puts "total wait time #{time.to_s} seconds"
87
+ end
88
+
89
+ if time >= time_limit
90
+ raise "Timed out before stage completion"
91
+ end
92
+ end
93
+
94
+
95
+
96
+ end
data/test/test_helper.rb CHANGED
@@ -8,4 +8,3 @@ $dir = File.dirname(File.expand_path(__FILE__))
8
8
  ENV['MOBILIZE_ENV'] = 'test'
9
9
  require 'mobilize-hive'
10
10
  $TESTING = true
11
- require "#{Mobilize::Hdfs.home_dir}/test/test_helper"
metadata CHANGED
@@ -1,32 +1,29 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: mobilize-hive
3
3
  version: !ruby/object:Gem::Version
4
- version: '1.36'
5
- prerelease:
4
+ version: '1.291'
6
5
  platform: ruby
7
6
  authors:
8
7
  - Cassio Paes-Leme
9
8
  autorequire:
10
9
  bindir: bin
11
10
  cert_chain: []
12
- date: 2013-05-21 00:00:00.000000000 Z
11
+ date: 2013-03-27 00:00:00.000000000 Z
13
12
  dependencies:
14
13
  - !ruby/object:Gem::Dependency
15
14
  name: mobilize-hdfs
16
15
  requirement: !ruby/object:Gem::Requirement
17
- none: false
18
16
  requirements:
19
17
  - - '='
20
18
  - !ruby/object:Gem::Version
21
- version: '1.36'
19
+ version: '1.291'
22
20
  type: :runtime
23
21
  prerelease: false
24
22
  version_requirements: !ruby/object:Gem::Requirement
25
- none: false
26
23
  requirements:
27
24
  - - '='
28
25
  - !ruby/object:Gem::Version
29
- version: '1.36'
26
+ version: '1.291'
30
27
  description: Adds hive read, write, and run support to mobilize-hdfs
31
28
  email:
32
29
  - cpaesleme@dena.com
@@ -41,61 +38,45 @@ files:
41
38
  - Rakefile
42
39
  - lib/mobilize-hive.rb
43
40
  - lib/mobilize-hive/handlers/hive.rb
44
- - lib/mobilize-hive/helpers/hive_helper.rb
45
41
  - lib/mobilize-hive/tasks.rb
46
42
  - lib/mobilize-hive/version.rb
47
43
  - lib/samples/hive.yml
48
44
  - mobilize-hive.gemspec
49
- - test/fixtures/hive1.hql
50
- - test/fixtures/hive1.in.yml
51
- - test/fixtures/hive1.schema.yml
52
- - test/fixtures/hive1.sql
53
- - test/fixtures/hive4_stage1.in
54
- - test/fixtures/hive4_stage2.in.yml
55
- - test/fixtures/integration_expected.yml
56
- - test/fixtures/integration_jobs.yml
57
- - test/integration/mobilize-hive_test.rb
45
+ - test/hive_job_rows.yml
46
+ - test/hive_test_1.hql
47
+ - test/hive_test_1_in.yml
48
+ - test/hive_test_1_schema.yml
49
+ - test/mobilize-hive_test.rb
58
50
  - test/redis-test.conf
59
51
  - test/test_helper.rb
60
52
  homepage: http://github.com/dena/mobilize-hive
61
53
  licenses: []
54
+ metadata: {}
62
55
  post_install_message:
63
56
  rdoc_options: []
64
57
  require_paths:
65
58
  - lib
66
59
  required_ruby_version: !ruby/object:Gem::Requirement
67
- none: false
68
60
  requirements:
69
- - - ! '>='
61
+ - - '>='
70
62
  - !ruby/object:Gem::Version
71
63
  version: '0'
72
- segments:
73
- - 0
74
- hash: 837156919845089008
75
64
  required_rubygems_version: !ruby/object:Gem::Requirement
76
- none: false
77
65
  requirements:
78
- - - ! '>='
66
+ - - '>='
79
67
  - !ruby/object:Gem::Version
80
68
  version: '0'
81
- segments:
82
- - 0
83
- hash: 837156919845089008
84
69
  requirements: []
85
70
  rubyforge_project:
86
- rubygems_version: 1.8.25
71
+ rubygems_version: 2.0.3
87
72
  signing_key:
88
- specification_version: 3
73
+ specification_version: 4
89
74
  summary: Adds hive read, write, and run support to mobilize-hdfs
90
75
  test_files:
91
- - test/fixtures/hive1.hql
92
- - test/fixtures/hive1.in.yml
93
- - test/fixtures/hive1.schema.yml
94
- - test/fixtures/hive1.sql
95
- - test/fixtures/hive4_stage1.in
96
- - test/fixtures/hive4_stage2.in.yml
97
- - test/fixtures/integration_expected.yml
98
- - test/fixtures/integration_jobs.yml
99
- - test/integration/mobilize-hive_test.rb
76
+ - test/hive_job_rows.yml
77
+ - test/hive_test_1.hql
78
+ - test/hive_test_1_in.yml
79
+ - test/hive_test_1_schema.yml
80
+ - test/mobilize-hive_test.rb
100
81
  - test/redis-test.conf
101
82
  - test/test_helper.rb
@@ -1,67 +0,0 @@
1
- module Mobilize
2
- module Hive
3
- def self.config
4
- Base.config('hive')
5
- end
6
-
7
- def self.exec_path(cluster)
8
- self.clusters[cluster]['exec_path']
9
- end
10
-
11
- def self.output_db(cluster)
12
- self.clusters[cluster]['output_db']
13
- end
14
-
15
- def self.output_db_user(cluster)
16
- output_db_node = Hadoop.gateway_node(cluster)
17
- output_db_user = Ssh.host(output_db_node)['user']
18
- output_db_user
19
- end
20
-
21
- def self.clusters
22
- self.config['clusters']
23
- end
24
-
25
- def self.slot_ids(cluster)
26
- (1..self.clusters[cluster]['max_slots']).to_a.map{|s| "#{cluster}_#{s.to_s}"}
27
- end
28
-
29
- def self.prepends
30
- self.config['prepends']
31
- end
32
-
33
- def self.slot_worker_by_cluster_and_path(cluster,path)
34
- working_slots = Mobilize::Resque.jobs.map{|j| begin j['args'][1]['hive_slot'];rescue;nil;end}.compact.uniq
35
- self.slot_ids(cluster).each do |slot_id|
36
- unless working_slots.include?(slot_id)
37
- Mobilize::Resque.set_worker_args_by_path(path,{'hive_slot'=>slot_id})
38
- return slot_id
39
- end
40
- end
41
- #return false if none are available
42
- return false
43
- end
44
-
45
- def self.unslot_worker_by_path(path)
46
- begin
47
- Mobilize::Resque.set_worker_args_by_path(path,{'hive_slot'=>nil})
48
- return true
49
- rescue
50
- return false
51
- end
52
- end
53
-
54
- def self.databases(cluster,user_name)
55
- self.run(cluster,"show databases",user_name)['stdout'].split("\n")
56
- end
57
-
58
- def self.default_params
59
- time = Time.now.utc
60
- {
61
- '$utc_date'=>time.strftime("%Y-%m-%d"),
62
- '$utc_time'=>time.strftime("%H:%M"),
63
- }
64
- end
65
- end
66
- end
67
-
@@ -1 +0,0 @@
1
- select act_date,product, sum(value) as sum from mobilize.hive_test_1 group by act_date,product;
@@ -1 +0,0 @@
1
-
@@ -1,4 +0,0 @@
1
- - act_date: ""
2
- product: ""
3
- category: ""
4
- value: ""
@@ -1,69 +0,0 @@
1
- ---
2
- - path: "Runner_mobilize(test)/jobs"
3
- state: working
4
- count: 1
5
- confirmed_ats: []
6
- - path: "Runner_mobilize(test)/jobs/hive1/stage1"
7
- state: working
8
- count: 1
9
- confirmed_ats: []
10
- - path: "Runner_mobilize(test)/jobs/hive1/stage2"
11
- state: working
12
- count: 1
13
- confirmed_ats: []
14
- - path: "Runner_mobilize(test)/jobs/hive1/stage3"
15
- state: working
16
- count: 1
17
- confirmed_ats: []
18
- - path: "Runner_mobilize(test)/jobs/hive1/stage4"
19
- state: working
20
- count: 1
21
- confirmed_ats: []
22
- - path: "Runner_mobilize(test)/jobs/hive1/stage5"
23
- state: working
24
- count: 1
25
- confirmed_ats: []
26
- - path: "Runner_mobilize(test)/jobs/hive2/stage1"
27
- state: working
28
- count: 1
29
- confirmed_ats: []
30
- - path: "Runner_mobilize(test)/jobs/hive2/stage2"
31
- state: working
32
- count: 1
33
- confirmed_ats: []
34
- - path: "Runner_mobilize(test)/jobs/hive2/stage3"
35
- state: working
36
- count: 1
37
- confirmed_ats: []
38
- - path: "Runner_mobilize(test)/jobs/hive3/stage1"
39
- state: working
40
- count: 1
41
- confirmed_ats: []
42
- - path: "Runner_mobilize(test)/jobs/hive3/stage2"
43
- state: working
44
- count: 1
45
- confirmed_ats: []
46
- - path: "Runner_mobilize(test)/jobs/hive3/stage3"
47
- state: working
48
- count: 1
49
- confirmed_ats: []
50
- - path: "Runner_mobilize(test)/jobs/hive3/stage4"
51
- state: working
52
- count: 1
53
- confirmed_ats: []
54
- - path: "Runner_mobilize(test)/jobs/hive4/stage1"
55
- state: working
56
- count: 1
57
- confirmed_ats: []
58
- - path: "Runner_mobilize(test)/jobs/hive4/stage2"
59
- state: working
60
- count: 1
61
- confirmed_ats: []
62
- - path: "Runner_mobilize(test)/jobs/hive4/stage3"
63
- state: working
64
- count: 1
65
- confirmed_ats: []
66
- - path: "Runner_mobilize(test)/jobs/hive4/stage4"
67
- state: working
68
- count: 1
69
- confirmed_ats: []
@@ -1,34 +0,0 @@
1
- ---
2
- - name: hive1
3
- active: true
4
- trigger: once
5
- status: ""
6
- stage1: hive.write target:"mobilize/hive1", partitions:"act_date", drop:true,
7
- source:"Runner_mobilize(test)/hive1.in", schema:"hive1.schema"
8
- stage2: hive.run source:"hive1.sql"
9
- stage3: hive.run hql:"show databases;"
10
- stage4: gsheet.write source:"stage2", target:"hive1_stage2.out"
11
- stage5: gsheet.write source:"stage3", target:"hive1_stage3.out"
12
- - name: hive2
13
- active: true
14
- trigger: after hive1
15
- status: ""
16
- stage1: hive.write source:"hdfs://user/mobilize/test/hdfs1.out", target:"mobilize.hive2", drop:true
17
- stage2: hive.run hql:"select * from mobilize.hive2;"
18
- stage3: gsheet.write source:"stage2", target:"hive2.out"
19
- - name: hive3
20
- active: true
21
- trigger: after hive2
22
- status: ""
23
- stage1: hive.run hql:"select '@date' as `date`,product,category,value from mobilize.hive1;", params:{'date':'2013-01-01'}
24
- stage2: hive.write source:"stage1",target:"mobilize/hive3", partitions:"date/product", drop:true
25
- stage3: hive.write hql:"select * from mobilize.hive3;",target:"mobilize/hive3", partitions:"date/product", drop:false
26
- stage4: gsheet.write source:"hive://mobilize/hive3", target:"hive3.out"
27
- - name: hive4
28
- active: true
29
- trigger: after hive3
30
- status: ""
31
- stage1: hive.write source:"hive4_stage1.in", target:"mobilize/hive1", partitions:"act_date"
32
- stage2: hive.write source:"hive4_stage2.in", target:"mobilize/hive1", partitions:"act_date"
33
- stage3: hive.run hql:"select '@date $utc_time' as `date_time`,product,category,value from mobilize.hive1;", params:{'date':'$utc_date'}
34
- stage4: gsheet.write source:stage3, target:"hive4.out"
@@ -1,43 +0,0 @@
1
- require 'test_helper'
2
- describe "Mobilize" do
3
- # enqueues 4 workers on Resque
4
- it "runs integration test" do
5
-
6
- puts "restart workers"
7
- Mobilize::Jobtracker.restart_workers!
8
-
9
- u = TestHelper.owner_user
10
- r = u.runner
11
- user_name = u.name
12
- gdrive_slot = u.email
13
-
14
- puts "add test data"
15
- ["hive1.in","hive4_stage1.in","hive4_stage2.in","hive1.schema","hive1.sql"].each do |fixture_name|
16
- target_url = "gsheet://#{r.title}/#{fixture_name}"
17
- TestHelper.write_fixture(fixture_name, target_url, 'replace')
18
- end
19
-
20
- puts "add/update jobs"
21
- u.jobs.each{|j| j.delete}
22
- jobs_fixture_name = "integration_jobs"
23
- jobs_target_url = "gsheet://#{r.title}/jobs"
24
- TestHelper.write_fixture(jobs_fixture_name, jobs_target_url, 'update')
25
-
26
- puts "job rows added, force enqueue runner, wait for stages"
27
- #wait for stages to complete
28
- expected_fixture_name = "integration_expected"
29
- Mobilize::Jobtracker.stop!
30
- r.enqueue!
31
- TestHelper.confirm_expected_jobs(expected_fixture_name,2100)
32
-
33
- puts "update job status and activity"
34
- r.update_gsheet(gdrive_slot)
35
-
36
- puts "check posted data"
37
- assert TestHelper.check_output("gsheet://#{r.title}/hive1_stage2.out", 'min_length' => 219) == true
38
- assert TestHelper.check_output("gsheet://#{r.title}/hive1_stage3.out", 'min_length' => 3) == true
39
- assert TestHelper.check_output("gsheet://#{r.title}/hive2.out", 'min_length' => 599) == true
40
- assert TestHelper.check_output("gsheet://#{r.title}/hive3.out", 'min_length' => 347) == true
41
- assert TestHelper.check_output("gsheet://#{r.title}/hive4.out", 'min_length' => 432) == true
42
- end
43
- end