mobilize-hive 1.354 → 1.361
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +3 -253
- data/lib/mobilize-hive.rb +3 -0
- data/lib/mobilize-hive/handlers/hive.rb +67 -145
- data/lib/mobilize-hive/helpers/hive_helper.rb +4 -0
- data/lib/mobilize-hive/tasks.rb +3 -2
- data/lib/mobilize-hive/version.rb +1 -1
- data/lib/samples/hive.yml +6 -0
- data/mobilize-hive.gemspec +1 -1
- data/test/{hive_test_1.hql → fixtures/hive1.hql} +0 -0
- data/test/{hive_test_1_in.yml → fixtures/hive1.in.yml} +0 -0
- data/test/{hive_test_1_schema.yml → fixtures/hive1.schema.yml} +0 -0
- data/test/fixtures/hive1.sql +1 -0
- data/test/fixtures/hive4_stage1.in +1 -0
- data/test/fixtures/hive4_stage2.in.yml +4 -0
- data/test/fixtures/integration_expected.yml +69 -0
- data/test/fixtures/integration_jobs.yml +34 -0
- data/test/integration/mobilize-hive_test.rb +43 -0
- data/test/test_helper.rb +1 -0
- metadata +24 -16
- data/test/hive_job_rows.yml +0 -34
- data/test/mobilize-hive_test.rb +0 -112
data/README.md
CHANGED
@@ -1,254 +1,4 @@
|
|
1
|
-
Mobilize
|
2
|
-
|
1
|
+
Mobilize
|
2
|
+
========
|
3
3
|
|
4
|
-
|
5
|
-
* read, write, and copy hive files through Google Spreadsheets.
|
6
|
-
|
7
|
-
Table Of Contents
|
8
|
-
-----------------
|
9
|
-
* [Overview](#section_Overview)
|
10
|
-
* [Install](#section_Install)
|
11
|
-
* [Mobilize-Hive](#section_Install_Mobilize-Hive)
|
12
|
-
* [Install Dirs and Files](#section_Install_Dirs_and_Files)
|
13
|
-
* [Configure](#section_Configure)
|
14
|
-
* [Hive](#section_Configure_Hive)
|
15
|
-
* [Start](#section_Start)
|
16
|
-
* [Create Job](#section_Start_Create_Job)
|
17
|
-
* [Run Test](#section_Start_Run_Test)
|
18
|
-
* [Meta](#section_Meta)
|
19
|
-
* [Special Thanks](#section_Special_Thanks)
|
20
|
-
* [Author](#section_Author)
|
21
|
-
|
22
|
-
<a name='section_Overview'></a>
|
23
|
-
Overview
|
24
|
-
-----------
|
25
|
-
|
26
|
-
* Mobilize-hive adds Hive methods to mobilize-hdfs.
|
27
|
-
|
28
|
-
<a name='section_Install'></a>
|
29
|
-
Install
|
30
|
-
------------
|
31
|
-
|
32
|
-
Make sure you go through all the steps in the
|
33
|
-
[mobilize-base][mobilize-base],
|
34
|
-
[mobilize-ssh][mobilize-ssh],
|
35
|
-
[mobilize-hdfs][mobilize-hdfs],
|
36
|
-
install sections first.
|
37
|
-
|
38
|
-
<a name='section_Install_Mobilize-Hive'></a>
|
39
|
-
### Mobilize-Hive
|
40
|
-
|
41
|
-
add this to your Gemfile:
|
42
|
-
|
43
|
-
``` ruby
|
44
|
-
gem "mobilize-hive"
|
45
|
-
```
|
46
|
-
|
47
|
-
or do
|
48
|
-
|
49
|
-
$ gem install mobilize-hive
|
50
|
-
|
51
|
-
for a ruby-wide install.
|
52
|
-
|
53
|
-
<a name='section_Install_Dirs_and_Files'></a>
|
54
|
-
### Dirs and Files
|
55
|
-
|
56
|
-
### Rakefile
|
57
|
-
|
58
|
-
Inside the Rakefile in your project's root dir, make sure you have:
|
59
|
-
|
60
|
-
``` ruby
|
61
|
-
require 'mobilize-base/tasks'
|
62
|
-
require 'mobilize-ssh/tasks'
|
63
|
-
require 'mobilize-hdfs/tasks'
|
64
|
-
require 'mobilize-hive/tasks'
|
65
|
-
```
|
66
|
-
|
67
|
-
This defines rake tasks essential to run the environment.
|
68
|
-
|
69
|
-
### Config Dir
|
70
|
-
|
71
|
-
run
|
72
|
-
|
73
|
-
$ rake mobilize_hive:setup
|
74
|
-
|
75
|
-
This will copy over a sample hive.yml to your config dir.
|
76
|
-
|
77
|
-
<a name='section_Configure'></a>
|
78
|
-
Configure
|
79
|
-
------------
|
80
|
-
|
81
|
-
<a name='section_Configure_Hive'></a>
|
82
|
-
### Configure Hive
|
83
|
-
|
84
|
-
* Hive is big data. That means we need to be careful when reading from
|
85
|
-
the cluster as it could easily fill up our mongodb instance, RAM, local disk
|
86
|
-
space, etc.
|
87
|
-
* To achieve this, all hive operations, stage outputs, etc. are
|
88
|
-
executed and stored on the cluster only.
|
89
|
-
* The exceptions are:
|
90
|
-
* writing to the cluster from an external source, such as a google
|
91
|
-
sheet. Here there
|
92
|
-
is no risk as the external source has much more strict size limits than
|
93
|
-
hive.
|
94
|
-
* reading from the cluster, such as for posting to google sheet. In
|
95
|
-
this case, the read_limit parameter dictates the maximum amount that can
|
96
|
-
be read. If the data is bigger than the read limit, an exception will be
|
97
|
-
raised.
|
98
|
-
|
99
|
-
The Hive configuration consists of:
|
100
|
-
* clusters - this defines aliases for clusters, which are used as
|
101
|
-
parameters for Hive stages. They should have the same name as those
|
102
|
-
in hadoop.yml. Each cluster has:
|
103
|
-
* max_slots - defines the total number of simultaneous slots to be
|
104
|
-
used for hive jobs on this cluster
|
105
|
-
* output_db - defines the db which should be used to hold stage outputs.
|
106
|
-
* This db must have open permissions (777) so any user on the system can
|
107
|
-
write to it -- the tables inside will be owned by the users themselves.
|
108
|
-
* exec_path - defines the path to the hive executable
|
109
|
-
|
110
|
-
Sample hive.yml:
|
111
|
-
|
112
|
-
``` yml
|
113
|
-
---
|
114
|
-
development:
|
115
|
-
clusters:
|
116
|
-
dev_cluster:
|
117
|
-
max_slots: 5
|
118
|
-
output_db: mobilize
|
119
|
-
exec_path: /path/to/hive
|
120
|
-
test:
|
121
|
-
clusters:
|
122
|
-
test_cluster:
|
123
|
-
max_slots: 5
|
124
|
-
output_db: mobilize
|
125
|
-
exec_path: /path/to/hive
|
126
|
-
production:
|
127
|
-
clusters:
|
128
|
-
prod_cluster:
|
129
|
-
max_slots: 5
|
130
|
-
output_db: mobilize
|
131
|
-
exec_path: /path/to/hive
|
132
|
-
```
|
133
|
-
|
134
|
-
<a name='section_Start'></a>
|
135
|
-
Start
|
136
|
-
-----
|
137
|
-
|
138
|
-
<a name='section_Start_Create_Job'></a>
|
139
|
-
### Create Job
|
140
|
-
|
141
|
-
* For mobilize-hive, the following stages are available.
|
142
|
-
* cluster and user are optional for all of the below.
|
143
|
-
* cluster defaults to the first cluster listed;
|
144
|
-
* user is treated the same way as in [mobilize-ssh][mobilize-ssh].
|
145
|
-
* params are also optional for all of the below. They replace HQL in sources.
|
146
|
-
* params are passed as a YML or JSON, as in:
|
147
|
-
* `hive.run source:<source_path>, params:{'date': '2013-03-01', 'unit': 'widgets'}`
|
148
|
-
* this example replaces all the keys, preceded by '@' in all source hqls with the value.
|
149
|
-
* The preceding '@' is used to keep from replacing instances
|
150
|
-
of "date" and "unit" in the HQL; you should have `@date` and `@unit` in your actual HQL
|
151
|
-
if you'd like to replace those tokens.
|
152
|
-
* in addition, the following params are substituted automatically:
|
153
|
-
* `$utc_date` - replaced with YYYY-MM-DD date, UTC
|
154
|
-
* `$utc_time` - replaced with HH:MM time, UTC
|
155
|
-
* any occurrence of these values in HQL will be replaced at runtime.
|
156
|
-
* hive.run `hql:<hql> || source:<gsheet_path>, user:<user>, cluster:<cluster>`, which executes the
|
157
|
-
script in the hql or source sheet and returns any output specified at the
|
158
|
-
end. If the cmd or last query in source is a select statement, column headers will be
|
159
|
-
returned as well.
|
160
|
-
* hive.write `hql:<hql> || source:<source_path>, target:<hive_path>, partitions:<partition_path>, user:<user>, cluster:<cluster>, schema:<gsheet_path>, drop:<true/false>`,
|
161
|
-
which writes the source or query result to the selected hive table.
|
162
|
-
* hive_path
|
163
|
-
* should be of the form `<hive_db>/<table_name>` or `<hive_db>.<table_name>`.
|
164
|
-
* source:
|
165
|
-
* can be a gsheet_path, hdfs_path, or hive_path (no partitions)
|
166
|
-
* for gsheet and hdfs path,
|
167
|
-
* if the file ends in .*ql, it's treated the same as passing hql
|
168
|
-
* otherwise it is treated as a tsv with the first row as column headers
|
169
|
-
* target:
|
170
|
-
* Should be a hive_path, as in `<hive_db>/<table_name>` or `<hive_db>.<table_name>`.
|
171
|
-
* partitions:
|
172
|
-
* Due to Hive limitation, partition names CANNOT be reserved keywords when writing from tsv (gsheet or hdfs source)
|
173
|
-
* Partitions should be specified as a path, as in partitions:`<partition1>/<partition2>`.
|
174
|
-
* schema:
|
175
|
-
* optional. gsheet_path to column schema.
|
176
|
-
* two columns: name, datatype
|
177
|
-
* Any columns not defined here will receive "string" as the datatype
|
178
|
-
* partitions can have their datatypes overridden here as well
|
179
|
-
* columns named here that are not in the dataset will be ignored
|
180
|
-
* drop:
|
181
|
-
* optional. drops the target table before performing write
|
182
|
-
* defaults to false
|
183
|
-
|
184
|
-
<a name='section_Start_Run_Test'></a>
|
185
|
-
### Run Test
|
186
|
-
|
187
|
-
To run tests, you will need to
|
188
|
-
|
189
|
-
1) go through [mobilize-base][mobilize-base], [mobilize-ssh][mobilize-ssh], [mobilize-hdfs][mobilize-hdfs] tests first
|
190
|
-
|
191
|
-
2) clone the mobilize-hive repository
|
192
|
-
|
193
|
-
From the project folder, run
|
194
|
-
|
195
|
-
3) $ rake mobilize_hive:setup
|
196
|
-
|
197
|
-
Copy over the config files from the mobilize-base, mobilize-ssh,
|
198
|
-
mobilize-hdfs projects into the config dir, and populate the values in the hive.yml file.
|
199
|
-
|
200
|
-
Make sure you use the same names for your hive clusters as you do in
|
201
|
-
hadoop.yml.
|
202
|
-
|
203
|
-
3) $ rake test
|
204
|
-
|
205
|
-
* The test runs these jobs:
|
206
|
-
* hive_test_1:
|
207
|
-
* `hive.write target:"mobilize/hive_test_1/act_date",source:"Runner_mobilize(test)/hive_test_1.in", schema:"hive_test_1.schema", drop:true`
|
208
|
-
* `hive.run source:"hive_test_1.hql"`
|
209
|
-
* `hive.run cmd:"show databases"`
|
210
|
-
* `gsheet.write source:"stage2", target:"hive_test_1_stage_2.out"`
|
211
|
-
* `gsheet.write source:"stage3", target:"hive_test_1_stage_3.out"`
|
212
|
-
* hive_test_1.hql runs a select statement on the table created in the
|
213
|
-
write command.
|
214
|
-
* at the end of the test, there should be two sheets, one with a
|
215
|
-
sum of the data as in your write query, one with the results of the show
|
216
|
-
databases command.
|
217
|
-
* hive_test_2:
|
218
|
-
* `hive.write source:"hdfs://user/mobilize/test/test_hdfs_1.out", target:"mobilize.hive_test_2", drop:true`
|
219
|
-
* `hive.run cmd:"select * from mobilize.hive_test_2"`
|
220
|
-
* `gsheet.write source:"stage2", target:"hive_test_2.out"`
|
221
|
-
* this test uses the output from the first hdfs test as an input, so make sure you've run that first.
|
222
|
-
* hive_test_3:
|
223
|
-
* `hive.write source:"hive://mobilize.hive_test_1",target:"mobilize/hive_test_3/date/product",drop:true`
|
224
|
-
* `hive.run hql:"select act_date as ```date```,product,category,value from mobilize.hive_test_1;"`
|
225
|
-
* `hive.write source:"stage2",target:"mobilize/hive_test_3/date/product", drop:false`
|
226
|
-
* `gsheet.write source:"hive://mobilize/hive_test_3", target:"hive_test_3.out"`
|
227
|
-
|
228
|
-
|
229
|
-
<a name='section_Meta'></a>
|
230
|
-
Meta
|
231
|
-
----
|
232
|
-
|
233
|
-
* Code: `git clone git://github.com/dena/mobilize-hive.git`
|
234
|
-
* Home: <https://github.com/dena/mobilize-hive>
|
235
|
-
* Bugs: <https://github.com/dena/mobilize-hive/issues>
|
236
|
-
* Gems: <http://rubygems.org/gems/mobilize-hive>
|
237
|
-
|
238
|
-
<a name='section_Special_Thanks'></a>
|
239
|
-
Special Thanks
|
240
|
-
--------------
|
241
|
-
* This release goes to Toby Negrin, who championed this project with
|
242
|
-
DeNA and gave me the support to get it properly architected, tested, and documented.
|
243
|
-
* Also many thanks to the Analytics team at DeNA who build and maintain
|
244
|
-
our Big Data infrastructure.
|
245
|
-
|
246
|
-
<a name='section_Author'></a>
|
247
|
-
Author
|
248
|
-
------
|
249
|
-
|
250
|
-
Cassio Paes-Leme :: cpaesleme@dena.com :: @cpaesleme
|
251
|
-
|
252
|
-
[mobilize-base]: https://github.com/dena/mobilize-base
|
253
|
-
[mobilize-ssh]: https://github.com/dena/mobilize-ssh
|
254
|
-
[mobilize-hdfs]: https://github.com/dena/mobilize-hdfs
|
4
|
+
Please refer to the mobilize-server wiki: https://github.com/DeNA/mobilize-server/wiki
|
data/lib/mobilize-hive.rb
CHANGED
@@ -94,22 +94,29 @@ module Mobilize
|
|
94
94
|
|
95
95
|
#run a generic hive command, with the option of passing a file hash to be locally available
|
96
96
|
def Hive.run(cluster,hql,user_name,params=nil,file_hash=nil)
|
97
|
-
|
98
|
-
|
97
|
+
preps = Hive.prepends.map do |p|
|
98
|
+
prefix = "set "
|
99
|
+
suffix = ";"
|
100
|
+
prep_out = p
|
101
|
+
prep_out = "#{prefix}#{prep_out}" unless prep_out.starts_with?(prefix)
|
102
|
+
prep_out = "#{prep_out}#{suffix}" unless prep_out.ends_with?(suffix)
|
103
|
+
prep_out
|
104
|
+
end.join
|
105
|
+
hql = "#{preps}#{hql}"
|
99
106
|
filename = hql.to_md5
|
100
107
|
file_hash||= {}
|
101
108
|
file_hash[filename] = hql
|
102
|
-
#add in default params
|
103
109
|
params ||= {}
|
104
|
-
params = params.merge(Hive.default_params)
|
105
110
|
#replace any params in the file_hash and command
|
106
111
|
params.each do |k,v|
|
107
112
|
file_hash.each do |name,data|
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
+
data.gsub!("@#{k}",v)
|
114
|
+
end
|
115
|
+
end
|
116
|
+
#add in default params
|
117
|
+
Hive.default_params.each do |k,v|
|
118
|
+
file_hash.each do |name,data|
|
119
|
+
data.gsub!(k,v)
|
113
120
|
end
|
114
121
|
end
|
115
122
|
#silent mode so we don't have logs in stderr; clip output
|
@@ -181,30 +188,24 @@ module Mobilize
|
|
181
188
|
response
|
182
189
|
end
|
183
190
|
|
184
|
-
def Hive.schema_hash(schema_path,user_name,gdrive_slot)
|
185
|
-
if schema_path.index("
|
186
|
-
|
187
|
-
|
191
|
+
def Hive.schema_hash(schema_path,stage_path,user_name,gdrive_slot)
|
192
|
+
handler = if schema_path.index("://")
|
193
|
+
schema_path.split("://").first
|
194
|
+
else
|
195
|
+
"gsheet"
|
196
|
+
end
|
197
|
+
dst = "Mobilize::#{handler.downcase.capitalize}".constantize.path_to_dst(schema_path,stage_path,gdrive_slot)
|
198
|
+
out_raw = dst.read(user_name,gdrive_slot)
|
199
|
+
#determine the datatype for schema; accept json, yaml, tsv
|
200
|
+
if schema_path.ends_with?(".yml")
|
201
|
+
out_ha = begin;YAML.load(out_raw);rescue ScriptError, StandardError;nil;end if out_ha.nil?
|
188
202
|
else
|
189
|
-
|
190
|
-
|
191
|
-
r = u.runner
|
192
|
-
runner_sheet = r.gbook(gdrive_slot).worksheet_by_title(schema_path)
|
193
|
-
out_tsv = if runner_sheet
|
194
|
-
runner_sheet.read(user_name)
|
195
|
-
else
|
196
|
-
#check for gfile. will fail if there isn't one.
|
197
|
-
Gfile.find_by_path(schema_path).read(user_name)
|
198
|
-
end
|
203
|
+
out_ha = begin;JSON.parse(out_raw);rescue ScriptError, StandardError;nil;end
|
204
|
+
out_ha = out_raw.tsv_to_hash_array if out_ha.nil?
|
199
205
|
end
|
200
|
-
#use Gridfs to cache gdrive results
|
201
|
-
file_name = schema_path.split("/").last
|
202
|
-
out_url = "gridfs://#{schema_path}/#{file_name}"
|
203
|
-
Dataset.write_by_url(out_url,out_tsv,user_name)
|
204
|
-
schema_tsv = Dataset.find_by_url(out_url).read(user_name,gdrive_slot)
|
205
206
|
schema_hash = {}
|
206
|
-
|
207
|
-
schema_hash[
|
207
|
+
out_ha.each do |hash|
|
208
|
+
schema_hash[hash['name']] = hash['datatype']
|
208
209
|
end
|
209
210
|
schema_hash
|
210
211
|
end
|
@@ -359,8 +360,13 @@ module Mobilize
|
|
359
360
|
#turn a tsv into a hive table.
|
360
361
|
#Accepts options to drop existing target if any
|
361
362
|
#also schema with column datatype overrides
|
362
|
-
def Hive.tsv_to_table(cluster,
|
363
|
-
|
363
|
+
def Hive.tsv_to_table(cluster, table_path, user_name, source_tsv)
|
364
|
+
#nil if only header row, or no header row
|
365
|
+
if source_tsv.strip.length==0 or source_tsv.strip.split("\n").length<=1
|
366
|
+
puts "no data in source_tsv for #{cluster}/#{table_path}"
|
367
|
+
return nil
|
368
|
+
end
|
369
|
+
#get rid of freaking carriage return characters
|
364
370
|
if source_tsv.index("\r\n")
|
365
371
|
source_tsv = source_tsv.gsub("\r\n","\n")
|
366
372
|
elsif source_tsv.index("\r")
|
@@ -368,121 +374,29 @@ module Mobilize
|
|
368
374
|
end
|
369
375
|
source_headers = source_tsv.tsv_header_array
|
370
376
|
|
371
|
-
|
372
|
-
|
373
|
-
|
374
|
-
|
375
|
-
|
376
|
-
url = "hive://" + [cluster,db,table,part_array.compact.join("/")].join("/")
|
377
|
-
|
378
|
-
if part_array.length == 0 and
|
379
|
-
table_stats.ie{|tts| tts.nil? || drop || tts['partitions'].nil?}
|
380
|
-
#no partitions in either user params or the target table
|
381
|
-
#or drop and start fresh
|
382
|
-
|
383
|
-
#one file only, strip headers, replace tab with ctrl-a for hive
|
384
|
-
#get rid of freaking carriage return characters
|
385
|
-
source_rows = source_tsv.split("\n")[1..-1].join("\n").gsub("\t","\001")
|
386
|
-
source_tsv_filename = "000000_0"
|
387
|
-
file_hash = {source_tsv_filename=>source_rows}
|
388
|
-
|
389
|
-
field_defs = source_headers.map do |name|
|
390
|
-
datatype = schema_hash[name] || "string"
|
391
|
-
"`#{name}` #{datatype}"
|
392
|
-
end.ie{|fs| "(#{fs.join(",")})"}
|
393
|
-
|
394
|
-
#for single insert, use drop table and create table always
|
395
|
-
target_drop_hql = "drop table if exists #{table_path}"
|
396
|
-
|
397
|
-
target_create_hql = "create table #{table_path} #{field_defs}"
|
377
|
+
#one file only, strip headers, replace tab with ctrl-a for hive
|
378
|
+
source_rows = source_tsv.split("\n")[1..-1].join("\n").gsub("\t","\001")
|
379
|
+
source_tsv_filename = "000000_0"
|
380
|
+
file_hash = {source_tsv_filename=>source_rows}
|
398
381
|
|
399
|
-
|
400
|
-
|
382
|
+
field_defs = source_headers.map do |name|
|
383
|
+
"`#{name}` string"
|
384
|
+
end.ie{|fs| "(#{fs.join(",")})"}
|
401
385
|
|
402
|
-
|
386
|
+
#for single insert, use drop table and create table always
|
387
|
+
target_drop_hql = "drop table if exists #{table_path}"
|
403
388
|
|
404
|
-
|
405
|
-
raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
|
406
|
-
|
407
|
-
elsif part_array.length > 0 and
|
408
|
-
table_stats.ie{|tts| tts.nil? || drop || tts['partitions'].to_a.map{|p| p['name']}.sort == part_array.sort}
|
409
|
-
#partitions and no target table
|
410
|
-
#or same partitions in both target table and user params
|
411
|
-
#or drop and start fresh
|
412
|
-
|
413
|
-
target_headers = source_headers.reject{|h| part_array.include?(h)}
|
414
|
-
|
415
|
-
field_defs = "(#{target_headers.map do |name|
|
416
|
-
datatype = schema_hash[name] || "string"
|
417
|
-
"`#{name}` #{datatype}"
|
418
|
-
end.join(",")})"
|
419
|
-
|
420
|
-
partition_defs = "(#{part_array.map do |name|
|
421
|
-
datatype = schema_hash[name] || "string"
|
422
|
-
"#{name} #{datatype}"
|
423
|
-
end.join(",")})"
|
424
|
-
|
425
|
-
target_drop_hql = drop ? "drop table if exists #{table_path};" : ""
|
426
|
-
|
427
|
-
target_create_hql = target_drop_hql +
|
428
|
-
"create table if not exists #{table_path} #{field_defs} " +
|
429
|
-
"partitioned by #{partition_defs}"
|
430
|
-
|
431
|
-
#create target table early if not here
|
432
|
-
response = Hive.run(cluster, target_create_hql, user_name)
|
433
|
-
raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
|
389
|
+
target_create_hql = "create table #{table_path} #{field_defs}"
|
434
390
|
|
435
|
-
|
436
|
-
|
437
|
-
return url if source_hash_array.length==1 and source_hash_array.first.values.compact.length==0
|
391
|
+
#load source data
|
392
|
+
target_insert_hql = "load data local inpath '#{source_tsv_filename}' overwrite into table #{table_path};"
|
438
393
|
|
439
|
-
|
394
|
+
target_full_hql = [target_drop_hql,target_create_hql,target_insert_hql].join(";")
|
440
395
|
|
441
|
-
|
442
|
-
|
443
|
-
source_hash_array.each do |ha|
|
444
|
-
tpmk = part_array.map{|pn| "#{pn}=#{ha[pn]}"}.join("/")
|
445
|
-
tpmv = ha.reject{|k,v| part_array.include?(k)}.values.join("\001")
|
446
|
-
if data_hash[tpmk]
|
447
|
-
data_hash[tpmk] += "\n#{tpmv}"
|
448
|
-
else
|
449
|
-
data_hash[tpmk] = tpmv
|
450
|
-
end
|
451
|
-
end
|
452
|
-
|
453
|
-
#go through completed data hash and write each key value to the table in question
|
454
|
-
target_part_hql = ""
|
455
|
-
data_hash.each do |tpmk,tpmv|
|
456
|
-
base_filename = "000000_0"
|
457
|
-
part_pairs = tpmk.split("/").map{|p| p.split("=").ie{|pa| ["#{pa.first}","#{pa.second}"]}}
|
458
|
-
part_dir = part_pairs.map{|pp| "#{pp.first}=#{pp.second}"}.join("/")
|
459
|
-
part_stmt = part_pairs.map{|pp| "#{pp.first}='#{pp.second}'"}.join(",")
|
460
|
-
hdfs_dir = "#{table_stats['location']}/#{part_dir}"
|
461
|
-
#source the partitions from a parallel load folder since filenames are all named the same
|
462
|
-
hdfs_source_url = "#{table_stats['location']}/part_load/#{part_dir}/#{base_filename}"
|
463
|
-
hdfs_target_url = hdfs_dir
|
464
|
-
#load partition into source path
|
465
|
-
puts "Writing to #{hdfs_source_url} for #{user_name} at #{Time.now.utc}"
|
466
|
-
Hdfs.write(cluster,hdfs_source_url,tpmv,user_name)
|
467
|
-
#let Hive know where the partition is
|
468
|
-
target_add_part_hql = "use #{db};alter table #{table} add if not exists partition (#{part_stmt}) location '#{hdfs_target_url}'"
|
469
|
-
target_insert_part_hql = "load data inpath '#{hdfs_source_url}' overwrite into table #{table} partition (#{part_stmt});"
|
470
|
-
target_part_hql += [target_add_part_hql,target_insert_part_hql].join(";")
|
471
|
-
end
|
472
|
-
#run actual partition adds all at once
|
473
|
-
if target_part_hql.length>0
|
474
|
-
puts "Adding partitions to #{cluster}/#{db}/#{table} for #{user_name} at #{Time.now.utc}"
|
475
|
-
response = Hive.run(cluster, target_part_hql, user_name)
|
476
|
-
raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
|
477
|
-
end
|
478
|
-
else
|
479
|
-
error_msg = "Incompatible partition specs: " +
|
480
|
-
"target table:#{table_stats['partitions'].to_s}, " +
|
481
|
-
"user_params:#{part_array.to_s}"
|
482
|
-
raise error_msg
|
483
|
-
end
|
396
|
+
response = Hive.run(cluster, target_full_hql, user_name, nil, file_hash)
|
397
|
+
raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
|
484
398
|
|
485
|
-
return
|
399
|
+
return true
|
486
400
|
end
|
487
401
|
|
488
402
|
def Hive.write_by_stage_path(stage_path)
|
@@ -502,7 +416,7 @@ module Mobilize
|
|
502
416
|
job_name = s.path.sub("Runner_","")
|
503
417
|
|
504
418
|
schema_hash = if params['schema']
|
505
|
-
Hive.schema_hash(params['schema'],user_name,gdrive_slot)
|
419
|
+
Hive.schema_hash(params['schema'],stage_path,user_name,gdrive_slot)
|
506
420
|
else
|
507
421
|
{}
|
508
422
|
end
|
@@ -548,7 +462,16 @@ module Mobilize
|
|
548
462
|
run_params = params['params']
|
549
463
|
Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop, schema_hash,run_params)
|
550
464
|
elsif source_tsv
|
551
|
-
|
465
|
+
#first write tsv to temp table
|
466
|
+
temp_table_path = "#{Hive.output_db(cluster)}.temptsv_#{stage_path.gridsafe}"
|
467
|
+
has_data = Hive.tsv_to_table(cluster, temp_table_path, user_name, source_tsv)
|
468
|
+
if has_data
|
469
|
+
#then do the regular insert, with source hql being select * from temp table
|
470
|
+
source_hql = "select * from #{temp_table_path}"
|
471
|
+
Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop, schema_hash)
|
472
|
+
else
|
473
|
+
nil
|
474
|
+
end
|
552
475
|
elsif source
|
553
476
|
#null sheet
|
554
477
|
else
|
@@ -583,9 +506,8 @@ module Mobilize
|
|
583
506
|
|
584
507
|
def Hive.write_by_dataset_path(dst_path,source_tsv,user_name,*args)
|
585
508
|
cluster,db,table = dst_path.split("/")
|
586
|
-
|
587
|
-
|
588
|
-
Hive.tsv_to_table(cluster, db, table, part_array, source_tsv, user_name, drop)
|
509
|
+
table_path = "#{db}.#{table}"
|
510
|
+
Hive.tsv_to_table(cluster, table_path, user_name, source_tsv)
|
589
511
|
end
|
590
512
|
end
|
591
513
|
|
@@ -26,6 +26,10 @@ module Mobilize
|
|
26
26
|
(1..self.clusters[cluster]['max_slots']).to_a.map{|s| "#{cluster}_#{s.to_s}"}
|
27
27
|
end
|
28
28
|
|
29
|
+
def self.prepends
|
30
|
+
self.config['prepends']
|
31
|
+
end
|
32
|
+
|
29
33
|
def self.slot_worker_by_cluster_and_path(cluster,path)
|
30
34
|
working_slots = Mobilize::Resque.jobs.map{|j| begin j['args'][1]['hive_slot'];rescue;nil;end}.compact.uniq
|
31
35
|
self.slot_ids(cluster).each do |slot_id|
|
data/lib/mobilize-hive/tasks.rb
CHANGED
@@ -1,6 +1,7 @@
|
|
1
|
-
|
1
|
+
require 'yaml'
|
2
|
+
namespace :mobilize do
|
2
3
|
desc "Set up config and log folders and files"
|
3
|
-
task :
|
4
|
+
task :setup_hive do
|
4
5
|
sample_dir = File.dirname(__FILE__) + '/../samples/'
|
5
6
|
sample_files = Dir.entries(sample_dir)
|
6
7
|
config_dir = (ENV['MOBILIZE_CONFIG_DIR'] ||= "config/mobilize/")
|
data/lib/samples/hive.yml
CHANGED
@@ -1,17 +1,23 @@
|
|
1
1
|
---
|
2
2
|
development:
|
3
|
+
prepends:
|
4
|
+
- "hive.stats.autogather=false"
|
3
5
|
clusters:
|
4
6
|
dev_cluster:
|
5
7
|
max_slots: 5
|
6
8
|
temp_table_db: mobilize
|
7
9
|
exec_path: /path/to/hive
|
8
10
|
test:
|
11
|
+
prepends:
|
12
|
+
- "hive.stats.autogather=false"
|
9
13
|
clusters:
|
10
14
|
test_cluster:
|
11
15
|
max_slots: 5
|
12
16
|
temp_table_db: mobilize
|
13
17
|
exec_path: /path/to/hive
|
14
18
|
production:
|
19
|
+
prepends:
|
20
|
+
- "hive.stats.autogather=false"
|
15
21
|
clusters:
|
16
22
|
prod_cluster:
|
17
23
|
max_slots: 5
|
data/mobilize-hive.gemspec
CHANGED
@@ -16,5 +16,5 @@ Gem::Specification.new do |gem|
|
|
16
16
|
gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
|
17
17
|
gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
|
18
18
|
gem.require_paths = ["lib"]
|
19
|
-
gem.add_runtime_dependency "mobilize-hdfs","1.
|
19
|
+
gem.add_runtime_dependency "mobilize-hdfs","1.361"
|
20
20
|
end
|
File without changes
|
File without changes
|
File without changes
|
@@ -0,0 +1 @@
|
|
1
|
+
select act_date,product, sum(value) as sum from mobilize.hive_test_1 group by act_date,product;
|
@@ -0,0 +1 @@
|
|
1
|
+
|
@@ -0,0 +1,69 @@
|
|
1
|
+
---
|
2
|
+
- path: "Runner_mobilize(test)/jobs"
|
3
|
+
state: working
|
4
|
+
count: 1
|
5
|
+
confirmed_ats: []
|
6
|
+
- path: "Runner_mobilize(test)/jobs/hive1/stage1"
|
7
|
+
state: working
|
8
|
+
count: 1
|
9
|
+
confirmed_ats: []
|
10
|
+
- path: "Runner_mobilize(test)/jobs/hive1/stage2"
|
11
|
+
state: working
|
12
|
+
count: 1
|
13
|
+
confirmed_ats: []
|
14
|
+
- path: "Runner_mobilize(test)/jobs/hive1/stage3"
|
15
|
+
state: working
|
16
|
+
count: 1
|
17
|
+
confirmed_ats: []
|
18
|
+
- path: "Runner_mobilize(test)/jobs/hive1/stage4"
|
19
|
+
state: working
|
20
|
+
count: 1
|
21
|
+
confirmed_ats: []
|
22
|
+
- path: "Runner_mobilize(test)/jobs/hive1/stage5"
|
23
|
+
state: working
|
24
|
+
count: 1
|
25
|
+
confirmed_ats: []
|
26
|
+
- path: "Runner_mobilize(test)/jobs/hive2/stage1"
|
27
|
+
state: working
|
28
|
+
count: 1
|
29
|
+
confirmed_ats: []
|
30
|
+
- path: "Runner_mobilize(test)/jobs/hive2/stage2"
|
31
|
+
state: working
|
32
|
+
count: 1
|
33
|
+
confirmed_ats: []
|
34
|
+
- path: "Runner_mobilize(test)/jobs/hive2/stage3"
|
35
|
+
state: working
|
36
|
+
count: 1
|
37
|
+
confirmed_ats: []
|
38
|
+
- path: "Runner_mobilize(test)/jobs/hive3/stage1"
|
39
|
+
state: working
|
40
|
+
count: 1
|
41
|
+
confirmed_ats: []
|
42
|
+
- path: "Runner_mobilize(test)/jobs/hive3/stage2"
|
43
|
+
state: working
|
44
|
+
count: 1
|
45
|
+
confirmed_ats: []
|
46
|
+
- path: "Runner_mobilize(test)/jobs/hive3/stage3"
|
47
|
+
state: working
|
48
|
+
count: 1
|
49
|
+
confirmed_ats: []
|
50
|
+
- path: "Runner_mobilize(test)/jobs/hive3/stage4"
|
51
|
+
state: working
|
52
|
+
count: 1
|
53
|
+
confirmed_ats: []
|
54
|
+
- path: "Runner_mobilize(test)/jobs/hive4/stage1"
|
55
|
+
state: working
|
56
|
+
count: 1
|
57
|
+
confirmed_ats: []
|
58
|
+
- path: "Runner_mobilize(test)/jobs/hive4/stage2"
|
59
|
+
state: working
|
60
|
+
count: 1
|
61
|
+
confirmed_ats: []
|
62
|
+
- path: "Runner_mobilize(test)/jobs/hive4/stage3"
|
63
|
+
state: working
|
64
|
+
count: 1
|
65
|
+
confirmed_ats: []
|
66
|
+
- path: "Runner_mobilize(test)/jobs/hive4/stage4"
|
67
|
+
state: working
|
68
|
+
count: 1
|
69
|
+
confirmed_ats: []
|
@@ -0,0 +1,34 @@
|
|
1
|
+
---
|
2
|
+
- name: hive1
|
3
|
+
active: true
|
4
|
+
trigger: once
|
5
|
+
status: ""
|
6
|
+
stage1: hive.write target:"mobilize/hive1", partitions:"act_date", drop:true,
|
7
|
+
source:"Runner_mobilize(test)/hive1.in", schema:"hive1.schema"
|
8
|
+
stage2: hive.run source:"hive1.sql"
|
9
|
+
stage3: hive.run hql:"show databases;"
|
10
|
+
stage4: gsheet.write source:"stage2", target:"hive1_stage2.out"
|
11
|
+
stage5: gsheet.write source:"stage3", target:"hive1_stage3.out"
|
12
|
+
- name: hive2
|
13
|
+
active: true
|
14
|
+
trigger: after hive1
|
15
|
+
status: ""
|
16
|
+
stage1: hive.write source:"hdfs://user/mobilize/test/hdfs1.out", target:"mobilize.hive2", drop:true
|
17
|
+
stage2: hive.run hql:"select * from mobilize.hive2;"
|
18
|
+
stage3: gsheet.write source:"stage2", target:"hive2.out"
|
19
|
+
- name: hive3
|
20
|
+
active: true
|
21
|
+
trigger: after hive2
|
22
|
+
status: ""
|
23
|
+
stage1: hive.run hql:"select '@date' as `date`,product,category,value from mobilize.hive1;", params:{'date':'2013-01-01'}
|
24
|
+
stage2: hive.write source:"stage1",target:"mobilize/hive3", partitions:"date/product", drop:true
|
25
|
+
stage3: hive.write hql:"select * from mobilize.hive3;",target:"mobilize/hive3", partitions:"date/product", drop:false
|
26
|
+
stage4: gsheet.write source:"hive://mobilize/hive3", target:"hive3.out"
|
27
|
+
- name: hive4
|
28
|
+
active: true
|
29
|
+
trigger: after hive3
|
30
|
+
status: ""
|
31
|
+
stage1: hive.write source:"hive4_stage1.in", target:"mobilize/hive1", partitions:"act_date"
|
32
|
+
stage2: hive.write source:"hive4_stage2.in", target:"mobilize/hive1", partitions:"act_date"
|
33
|
+
stage3: hive.run hql:"select '@date $utc_time' as `date_time`,product,category,value from mobilize.hive1;", params:{'date':'$utc_date'}
|
34
|
+
stage4: gsheet.write source:stage3, target:"hive4.out"
|
@@ -0,0 +1,43 @@
|
|
1
|
+
require 'test_helper'
|
2
|
+
describe "Mobilize" do
|
3
|
+
# enqueues 4 workers on Resque
|
4
|
+
it "runs integration test" do
|
5
|
+
|
6
|
+
puts "restart workers"
|
7
|
+
Mobilize::Jobtracker.restart_workers!
|
8
|
+
|
9
|
+
u = TestHelper.owner_user
|
10
|
+
r = u.runner
|
11
|
+
user_name = u.name
|
12
|
+
gdrive_slot = u.email
|
13
|
+
|
14
|
+
puts "add test data"
|
15
|
+
["hive1.in","hive4_stage1.in","hive4_stage2.in","hive1.schema","hive1.sql"].each do |fixture_name|
|
16
|
+
target_url = "gsheet://#{r.title}/#{fixture_name}"
|
17
|
+
TestHelper.write_fixture(fixture_name, target_url, 'replace')
|
18
|
+
end
|
19
|
+
|
20
|
+
puts "add/update jobs"
|
21
|
+
u.jobs.each{|j| j.delete}
|
22
|
+
jobs_fixture_name = "integration_jobs"
|
23
|
+
jobs_target_url = "gsheet://#{r.title}/jobs"
|
24
|
+
TestHelper.write_fixture(jobs_fixture_name, jobs_target_url, 'update')
|
25
|
+
|
26
|
+
puts "job rows added, force enqueue runner, wait for stages"
|
27
|
+
#wait for stages to complete
|
28
|
+
expected_fixture_name = "integration_expected"
|
29
|
+
Mobilize::Jobtracker.stop!
|
30
|
+
r.enqueue!
|
31
|
+
TestHelper.confirm_expected_jobs(expected_fixture_name,3600)
|
32
|
+
|
33
|
+
puts "update job status and activity"
|
34
|
+
r.update_gsheet(gdrive_slot)
|
35
|
+
|
36
|
+
puts "check posted data"
|
37
|
+
assert TestHelper.check_output("gsheet://#{r.title}/hive1_stage2.out", 'min_length' => 219) == true
|
38
|
+
assert TestHelper.check_output("gsheet://#{r.title}/hive1_stage3.out", 'min_length' => 3) == true
|
39
|
+
assert TestHelper.check_output("gsheet://#{r.title}/hive2.out", 'min_length' => 599) == true
|
40
|
+
assert TestHelper.check_output("gsheet://#{r.title}/hive3.out", 'min_length' => 347) == true
|
41
|
+
assert TestHelper.check_output("gsheet://#{r.title}/hive4.out", 'min_length' => 432) == true
|
42
|
+
end
|
43
|
+
end
|
data/test/test_helper.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: mobilize-hive
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: '1.
|
4
|
+
version: '1.361'
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2013-
|
12
|
+
date: 2013-05-31 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: mobilize-hdfs
|
@@ -18,7 +18,7 @@ dependencies:
|
|
18
18
|
requirements:
|
19
19
|
- - '='
|
20
20
|
- !ruby/object:Gem::Version
|
21
|
-
version: '1.
|
21
|
+
version: '1.361'
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
24
|
version_requirements: !ruby/object:Gem::Requirement
|
@@ -26,7 +26,7 @@ dependencies:
|
|
26
26
|
requirements:
|
27
27
|
- - '='
|
28
28
|
- !ruby/object:Gem::Version
|
29
|
-
version: '1.
|
29
|
+
version: '1.361'
|
30
30
|
description: Adds hive read, write, and run support to mobilize-hdfs
|
31
31
|
email:
|
32
32
|
- cpaesleme@dena.com
|
@@ -46,11 +46,15 @@ files:
|
|
46
46
|
- lib/mobilize-hive/version.rb
|
47
47
|
- lib/samples/hive.yml
|
48
48
|
- mobilize-hive.gemspec
|
49
|
-
- test/
|
50
|
-
- test/
|
51
|
-
- test/
|
52
|
-
- test/
|
53
|
-
- test/
|
49
|
+
- test/fixtures/hive1.hql
|
50
|
+
- test/fixtures/hive1.in.yml
|
51
|
+
- test/fixtures/hive1.schema.yml
|
52
|
+
- test/fixtures/hive1.sql
|
53
|
+
- test/fixtures/hive4_stage1.in
|
54
|
+
- test/fixtures/hive4_stage2.in.yml
|
55
|
+
- test/fixtures/integration_expected.yml
|
56
|
+
- test/fixtures/integration_jobs.yml
|
57
|
+
- test/integration/mobilize-hive_test.rb
|
54
58
|
- test/redis-test.conf
|
55
59
|
- test/test_helper.rb
|
56
60
|
homepage: http://github.com/dena/mobilize-hive
|
@@ -67,7 +71,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
67
71
|
version: '0'
|
68
72
|
segments:
|
69
73
|
- 0
|
70
|
-
hash:
|
74
|
+
hash: -2758952284012723637
|
71
75
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
72
76
|
none: false
|
73
77
|
requirements:
|
@@ -76,7 +80,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
76
80
|
version: '0'
|
77
81
|
segments:
|
78
82
|
- 0
|
79
|
-
hash:
|
83
|
+
hash: -2758952284012723637
|
80
84
|
requirements: []
|
81
85
|
rubyforge_project:
|
82
86
|
rubygems_version: 1.8.25
|
@@ -84,10 +88,14 @@ signing_key:
|
|
84
88
|
specification_version: 3
|
85
89
|
summary: Adds hive read, write, and run support to mobilize-hdfs
|
86
90
|
test_files:
|
87
|
-
- test/
|
88
|
-
- test/
|
89
|
-
- test/
|
90
|
-
- test/
|
91
|
-
- test/
|
91
|
+
- test/fixtures/hive1.hql
|
92
|
+
- test/fixtures/hive1.in.yml
|
93
|
+
- test/fixtures/hive1.schema.yml
|
94
|
+
- test/fixtures/hive1.sql
|
95
|
+
- test/fixtures/hive4_stage1.in
|
96
|
+
- test/fixtures/hive4_stage2.in.yml
|
97
|
+
- test/fixtures/integration_expected.yml
|
98
|
+
- test/fixtures/integration_jobs.yml
|
99
|
+
- test/integration/mobilize-hive_test.rb
|
92
100
|
- test/redis-test.conf
|
93
101
|
- test/test_helper.rb
|
data/test/hive_job_rows.yml
DELETED
@@ -1,34 +0,0 @@
|
|
1
|
-
---
|
2
|
-
- name: hive_test_1
|
3
|
-
active: true
|
4
|
-
trigger: once
|
5
|
-
status: ""
|
6
|
-
stage1: hive.write target:"mobilize/hive_test_1", partitions:"act_date", drop:true,
|
7
|
-
source:"Runner_mobilize(test)/hive_test_1.in", schema:"hive_test_1.schema"
|
8
|
-
stage2: hive.run source:"hive_test_1.hql"
|
9
|
-
stage3: hive.run hql:"show databases;"
|
10
|
-
stage4: gsheet.write source:"stage2", target:"hive_test_1_stage_2.out"
|
11
|
-
stage5: gsheet.write source:"stage3", target:"hive_test_1_stage_3.out"
|
12
|
-
- name: hive_test_2
|
13
|
-
active: true
|
14
|
-
trigger: after hive_test_1
|
15
|
-
status: ""
|
16
|
-
stage1: hive.write source:"hdfs://user/mobilize/test/test_hdfs_1.out", target:"mobilize.hive_test_2", drop:true
|
17
|
-
stage2: hive.run hql:"select * from mobilize.hive_test_2;"
|
18
|
-
stage3: gsheet.write source:"stage2", target:"hive_test_2.out"
|
19
|
-
- name: hive_test_3
|
20
|
-
active: true
|
21
|
-
trigger: after hive_test_2
|
22
|
-
status: ""
|
23
|
-
stage1: hive.run hql:"select '@date' as `date`,product,category,value from mobilize.hive_test_1;", params:{'date':'2013-01-01'}
|
24
|
-
stage2: hive.write source:"stage1",target:"mobilize/hive_test_3", partitions:"date/product", drop:true
|
25
|
-
stage3: hive.write hql:"select * from mobilize.hive_test_3;",target:"mobilize/hive_test_3", partitions:"date/product", drop:false
|
26
|
-
stage4: gsheet.write source:"hive://mobilize/hive_test_3", target:"hive_test_3.out"
|
27
|
-
- name: hive_test_4
|
28
|
-
active: true
|
29
|
-
trigger: after hive_test_3
|
30
|
-
status: ""
|
31
|
-
stage1: hive.write source:"hive_test_4_stage_1.in", target:"mobilize/hive_test_1", partitions:"act_date"
|
32
|
-
stage2: hive.write source:"hive_test_4_stage_2.in", target:"mobilize/hive_test_1", partitions:"act_date"
|
33
|
-
stage3: hive.run hql:"select '$utc_date $utc_time' as `date_time`,product,category,value from mobilize.hive_test_1;"
|
34
|
-
stage4: gsheet.write source:stage3, target:"hive_test_4.out"
|
data/test/mobilize-hive_test.rb
DELETED
@@ -1,112 +0,0 @@
|
|
1
|
-
require 'test_helper'
|
2
|
-
|
3
|
-
describe "Mobilize" do
|
4
|
-
|
5
|
-
def before
|
6
|
-
puts 'nothing before'
|
7
|
-
end
|
8
|
-
|
9
|
-
# enqueues 4 workers on Resque
|
10
|
-
it "runs integration test" do
|
11
|
-
|
12
|
-
puts "restart workers"
|
13
|
-
Mobilize::Jobtracker.restart_workers!
|
14
|
-
|
15
|
-
gdrive_slot = Mobilize::Gdrive.owner_email
|
16
|
-
puts "create user 'mobilize'"
|
17
|
-
user_name = gdrive_slot.split("@").first
|
18
|
-
u = Mobilize::User.where(:name=>user_name).first
|
19
|
-
r = u.runner
|
20
|
-
|
21
|
-
puts "add test_source data"
|
22
|
-
hive_1_in_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.in",gdrive_slot)
|
23
|
-
[hive_1_in_sheet].each {|s| s.delete if s}
|
24
|
-
hive_1_in_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.in",gdrive_slot)
|
25
|
-
hive_1_in_tsv = YAML.load_file("#{Mobilize::Base.root}/test/hive_test_1_in.yml").hash_array_to_tsv
|
26
|
-
hive_1_in_sheet.write(hive_1_in_tsv,Mobilize::Gdrive.owner_name)
|
27
|
-
|
28
|
-
#create blank sheet
|
29
|
-
hive_4_stage_1_in_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4_stage_1.in",gdrive_slot)
|
30
|
-
[hive_4_stage_1_in_sheet].each {|s| s.delete if s}
|
31
|
-
hive_4_stage_1_in_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4_stage_1.in",gdrive_slot)
|
32
|
-
|
33
|
-
#create sheet w just headers
|
34
|
-
hive_4_stage_2_in_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4_stage_2.in",gdrive_slot)
|
35
|
-
[hive_4_stage_2_in_sheet].each {|s| s.delete if s}
|
36
|
-
hive_4_stage_2_in_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4_stage_2.in",gdrive_slot)
|
37
|
-
hive_4_stage_2_in_sheet_header = hive_1_in_tsv.tsv_header_array.join("\t")
|
38
|
-
hive_4_stage_2_in_sheet.write(hive_4_stage_2_in_sheet_header,Mobilize::Gdrive.owner_name)
|
39
|
-
|
40
|
-
hive_1_schema_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.schema",gdrive_slot)
|
41
|
-
[hive_1_schema_sheet].each {|s| s.delete if s}
|
42
|
-
hive_1_schema_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.schema",gdrive_slot)
|
43
|
-
hive_1_schema_tsv = YAML.load_file("#{Mobilize::Base.root}/test/hive_test_1_schema.yml").hash_array_to_tsv
|
44
|
-
hive_1_schema_sheet.write(hive_1_schema_tsv,Mobilize::Gdrive.owner_name)
|
45
|
-
|
46
|
-
hive_1_hql_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.hql",gdrive_slot)
|
47
|
-
[hive_1_hql_sheet].each {|s| s.delete if s}
|
48
|
-
hive_1_hql_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.hql",gdrive_slot)
|
49
|
-
hive_1_hql_tsv = File.open("#{Mobilize::Base.root}/test/hive_test_1.hql").read
|
50
|
-
hive_1_hql_sheet.write(hive_1_hql_tsv,Mobilize::Gdrive.owner_name)
|
51
|
-
|
52
|
-
jobs_sheet = r.gsheet(gdrive_slot)
|
53
|
-
|
54
|
-
test_job_rows = ::YAML.load_file("#{Mobilize::Base.root}/test/hive_job_rows.yml")
|
55
|
-
test_job_rows.map{|j| r.jobs(j['name'])}.each{|j| j.delete if j}
|
56
|
-
jobs_sheet.add_or_update_rows(test_job_rows)
|
57
|
-
|
58
|
-
hive_1_stage_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_2.out",gdrive_slot)
|
59
|
-
[hive_1_stage_2_target_sheet].each{|s| s.delete if s}
|
60
|
-
hive_1_stage_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_3.out",gdrive_slot)
|
61
|
-
[hive_1_stage_3_target_sheet].each{|s| s.delete if s}
|
62
|
-
hive_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_2.out",gdrive_slot)
|
63
|
-
[hive_2_target_sheet].each{|s| s.delete if s}
|
64
|
-
hive_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_3.out",gdrive_slot)
|
65
|
-
[hive_3_target_sheet].each{|s| s.delete if s}
|
66
|
-
hive_4_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4.out",gdrive_slot)
|
67
|
-
[hive_4_target_sheet].each{|s| s.delete if s}
|
68
|
-
|
69
|
-
puts "job row added, force enqueued requestor, wait for stages"
|
70
|
-
r.enqueue!
|
71
|
-
wait_for_stages(2100)
|
72
|
-
|
73
|
-
puts "jobtracker posted data to test sheet"
|
74
|
-
hive_1_stage_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_2.out",gdrive_slot)
|
75
|
-
hive_1_stage_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_3.out",gdrive_slot)
|
76
|
-
hive_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_2.out",gdrive_slot)
|
77
|
-
hive_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_3.out",gdrive_slot)
|
78
|
-
hive_4_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4.out",gdrive_slot)
|
79
|
-
|
80
|
-
assert hive_1_stage_2_target_sheet.read(u.name).length == 219
|
81
|
-
assert hive_1_stage_3_target_sheet.read(u.name).length > 3
|
82
|
-
assert hive_2_target_sheet.read(u.name).length == 599
|
83
|
-
assert hive_3_target_sheet.read(u.name).length == 347
|
84
|
-
assert hive_4_target_sheet.read(u.name).length == 432
|
85
|
-
end
|
86
|
-
|
87
|
-
def wait_for_stages(time_limit=600,stage_limit=120,wait_length=10)
|
88
|
-
time = 0
|
89
|
-
time_since_stage = 0
|
90
|
-
#check for 10 min
|
91
|
-
while time < time_limit and time_since_stage < stage_limit
|
92
|
-
sleep wait_length
|
93
|
-
job_classes = Mobilize::Resque.jobs.map{|j| j['class']}
|
94
|
-
if job_classes.include?("Mobilize::Stage")
|
95
|
-
time_since_stage = 0
|
96
|
-
puts "saw stage at #{time.to_s} seconds"
|
97
|
-
else
|
98
|
-
time_since_stage += wait_length
|
99
|
-
puts "#{time_since_stage.to_s} seconds since stage seen"
|
100
|
-
end
|
101
|
-
time += wait_length
|
102
|
-
puts "total wait time #{time.to_s} seconds"
|
103
|
-
end
|
104
|
-
|
105
|
-
if time >= time_limit
|
106
|
-
raise "Timed out before stage completion"
|
107
|
-
end
|
108
|
-
end
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
end
|