mobilize-hive 1.36 → 1.291
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/README.md +242 -3
- data/lib/mobilize-hive.rb +0 -3
- data/lib/mobilize-hive/handlers/hive.rb +105 -100
- data/lib/mobilize-hive/tasks.rb +0 -1
- data/lib/mobilize-hive/version.rb +1 -1
- data/lib/samples/hive.yml +0 -6
- data/mobilize-hive.gemspec +1 -1
- data/test/hive_job_rows.yml +26 -0
- data/test/{fixtures/hive1.hql → hive_test_1.hql} +0 -0
- data/test/{fixtures/hive1.in.yml → hive_test_1_in.yml} +0 -0
- data/test/{fixtures/hive1.schema.yml → hive_test_1_schema.yml} +0 -0
- data/test/mobilize-hive_test.rb +96 -0
- data/test/test_helper.rb +0 -1
- metadata +19 -38
- data/lib/mobilize-hive/helpers/hive_helper.rb +0 -67
- data/test/fixtures/hive1.sql +0 -1
- data/test/fixtures/hive4_stage1.in +0 -1
- data/test/fixtures/hive4_stage2.in.yml +0 -4
- data/test/fixtures/integration_expected.yml +0 -69
- data/test/fixtures/integration_jobs.yml +0 -34
- data/test/integration/mobilize-hive_test.rb +0 -43
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 1eb5a243ff499f31c06f0c1e09abba3fafee0b31
|
4
|
+
data.tar.gz: d44471ad7d8fcc72ac8562ebfb529d31d839d195
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: a27bde80d634f949cbf7f82b0ceaad4e087e3fe1bee6fe4631aa8205b68a7897e9d0f8e8ec49700a37cc3bd54e266df06a30e2187e13b6f9a1357d1d270af54c
|
7
|
+
data.tar.gz: 3763262ee0ac27778cc2abb6342b20d9fb1a1c6c546ddd87f4ff73969a8054314480387c230aa232e89fad7e2df16f597f2fa50344430343a7cdde9aa03d0d79
|
data/README.md
CHANGED
@@ -1,4 +1,243 @@
|
|
1
|
-
Mobilize
|
2
|
-
|
1
|
+
Mobilize-Hive
|
2
|
+
===============
|
3
3
|
|
4
|
-
|
4
|
+
Mobilize-Hive adds the power of hive to [mobilize-hdfs][mobilize-hdfs].
|
5
|
+
* read, write, and copy hive files through Google Spreadsheets.
|
6
|
+
|
7
|
+
Table Of Contents
|
8
|
+
-----------------
|
9
|
+
* [Overview](#section_Overview)
|
10
|
+
* [Install](#section_Install)
|
11
|
+
* [Mobilize-Hive](#section_Install_Mobilize-Hive)
|
12
|
+
* [Install Dirs and Files](#section_Install_Dirs_and_Files)
|
13
|
+
* [Configure](#section_Configure)
|
14
|
+
* [Hive](#section_Configure_Hive)
|
15
|
+
* [Start](#section_Start)
|
16
|
+
* [Create Job](#section_Start_Create_Job)
|
17
|
+
* [Run Test](#section_Start_Run_Test)
|
18
|
+
* [Meta](#section_Meta)
|
19
|
+
* [Special Thanks](#section_Special_Thanks)
|
20
|
+
* [Author](#section_Author)
|
21
|
+
|
22
|
+
<a name='section_Overview'></a>
|
23
|
+
Overview
|
24
|
+
-----------
|
25
|
+
|
26
|
+
* Mobilize-hive adds Hive methods to mobilize-hdfs.
|
27
|
+
|
28
|
+
<a name='section_Install'></a>
|
29
|
+
Install
|
30
|
+
------------
|
31
|
+
|
32
|
+
Make sure you go through all the steps in the
|
33
|
+
[mobilize-base][mobilize-base],
|
34
|
+
[mobilize-ssh][mobilize-ssh],
|
35
|
+
[mobilize-hdfs][mobilize-hdfs],
|
36
|
+
install sections first.
|
37
|
+
|
38
|
+
<a name='section_Install_Mobilize-Hive'></a>
|
39
|
+
### Mobilize-Hive
|
40
|
+
|
41
|
+
add this to your Gemfile:
|
42
|
+
|
43
|
+
``` ruby
|
44
|
+
gem "mobilize-hive"
|
45
|
+
```
|
46
|
+
|
47
|
+
or do
|
48
|
+
|
49
|
+
$ gem install mobilize-hive
|
50
|
+
|
51
|
+
for a ruby-wide install.
|
52
|
+
|
53
|
+
<a name='section_Install_Dirs_and_Files'></a>
|
54
|
+
### Dirs and Files
|
55
|
+
|
56
|
+
### Rakefile
|
57
|
+
|
58
|
+
Inside the Rakefile in your project's root dir, make sure you have:
|
59
|
+
|
60
|
+
``` ruby
|
61
|
+
require 'mobilize-base/tasks'
|
62
|
+
require 'mobilize-ssh/tasks'
|
63
|
+
require 'mobilize-hdfs/tasks'
|
64
|
+
require 'mobilize-hive/tasks'
|
65
|
+
```
|
66
|
+
|
67
|
+
This defines rake tasks essential to run the environment.
|
68
|
+
|
69
|
+
### Config Dir
|
70
|
+
|
71
|
+
run
|
72
|
+
|
73
|
+
$ rake mobilize_hive:setup
|
74
|
+
|
75
|
+
This will copy over a sample hive.yml to your config dir.
|
76
|
+
|
77
|
+
<a name='section_Configure'></a>
|
78
|
+
Configure
|
79
|
+
------------
|
80
|
+
|
81
|
+
<a name='section_Configure_Hive'></a>
|
82
|
+
### Configure Hive
|
83
|
+
|
84
|
+
* Hive is big data. That means we need to be careful when reading from
|
85
|
+
the cluster as it could easily fill up our mongodb instance, RAM, local disk
|
86
|
+
space, etc.
|
87
|
+
* To achieve this, all hive operations, stage outputs, etc. are
|
88
|
+
executed and stored on the cluster only.
|
89
|
+
* The exceptions are:
|
90
|
+
* writing to the cluster from an external source, such as a google
|
91
|
+
sheet. Here there
|
92
|
+
is no risk as the external source has much more strict size limits than
|
93
|
+
hive.
|
94
|
+
* reading from the cluster, such as for posting to google sheet. In
|
95
|
+
this case, the read_limit parameter dictates the maximum amount that can
|
96
|
+
be read. If the data is bigger than the read limit, an exception will be
|
97
|
+
raised.
|
98
|
+
|
99
|
+
The Hive configuration consists of:
|
100
|
+
* clusters - this defines aliases for clusters, which are used as
|
101
|
+
parameters for Hive stages. They should have the same name as those
|
102
|
+
in hadoop.yml. Each cluster has:
|
103
|
+
* max_slots - defines the total number of simultaneous slots to be
|
104
|
+
used for hive jobs on this cluster
|
105
|
+
* output_db - defines the db which should be used to hold stage outputs.
|
106
|
+
* This db must have open permissions (777) so any user on the system can
|
107
|
+
write to it -- the tables inside will be owned by the users themselves.
|
108
|
+
* exec_path - defines the path to the hive executable
|
109
|
+
|
110
|
+
Sample hive.yml:
|
111
|
+
|
112
|
+
``` yml
|
113
|
+
---
|
114
|
+
development:
|
115
|
+
clusters:
|
116
|
+
dev_cluster:
|
117
|
+
max_slots: 5
|
118
|
+
output_db: mobilize
|
119
|
+
exec_path: /path/to/hive
|
120
|
+
test:
|
121
|
+
clusters:
|
122
|
+
test_cluster:
|
123
|
+
max_slots: 5
|
124
|
+
output_db: mobilize
|
125
|
+
exec_path: /path/to/hive
|
126
|
+
production:
|
127
|
+
clusters:
|
128
|
+
prod_cluster:
|
129
|
+
max_slots: 5
|
130
|
+
output_db: mobilize
|
131
|
+
exec_path: /path/to/hive
|
132
|
+
```
|
133
|
+
|
134
|
+
<a name='section_Start'></a>
|
135
|
+
Start
|
136
|
+
-----
|
137
|
+
|
138
|
+
<a name='section_Start_Create_Job'></a>
|
139
|
+
### Create Job
|
140
|
+
|
141
|
+
* For mobilize-hive, the following stages are available.
|
142
|
+
* cluster and user are optional for all of the below.
|
143
|
+
* cluster defaults to the first cluster listed;
|
144
|
+
* user is treated the same way as in [mobilize-ssh][mobilize-ssh].
|
145
|
+
* hive.run `hql:<hql> || source:<gsheet_path>, user:<user>, cluster:<cluster>`, which executes the
|
146
|
+
script in the hql or source sheet and returns any output specified at the
|
147
|
+
end. If the cmd or last query in source is a select statement, column headers will be
|
148
|
+
returned as well.
|
149
|
+
* hive.write `hql:<hql> || source:<source_path>, target:<hive_path>, partitions:<partition_path>, user:<user>, cluster:<cluster>, schema:<gsheet_path>, drop:<true/false>`,
|
150
|
+
which writes the source or query result to the selected hive table.
|
151
|
+
* hive_path
|
152
|
+
* should be of the form `<hive_db>/<table_name>` or `<hive_db>.<table_name>`.
|
153
|
+
* source:
|
154
|
+
* can be a gsheet_path, hdfs_path, or hive_path (no partitions)
|
155
|
+
* for gsheet and hdfs path,
|
156
|
+
* if the file ends in .*ql, it's treated the same as passing hql
|
157
|
+
* otherwise it is treated as a tsv with the first row as column headers
|
158
|
+
* target:
|
159
|
+
* Should be a hive_path, as in `<hive_db>/<table_name>` or `<hive_db>.<table_name>`.
|
160
|
+
* partitions:
|
161
|
+
* Due to Hive limitation, partition names CANNOT be reserved keywords when writing from tsv (gsheet or hdfs source)
|
162
|
+
* Partitions should be specified as a path, as in partitions:`<partition1>/<partition2>`.
|
163
|
+
* schema:
|
164
|
+
* optional. gsheet_path to column schema.
|
165
|
+
* two columns: name, datatype
|
166
|
+
* Any columns not defined here will receive "string" as the datatype
|
167
|
+
* partitions can have their datatypes overridden here as well
|
168
|
+
* columns named here that are not in the dataset will be ignored
|
169
|
+
* drop:
|
170
|
+
* optional. drops the target table before performing write
|
171
|
+
* defaults to false
|
172
|
+
|
173
|
+
<a name='section_Start_Run_Test'></a>
|
174
|
+
### Run Test
|
175
|
+
|
176
|
+
To run tests, you will need to
|
177
|
+
|
178
|
+
1) go through [mobilize-base][mobilize-base], [mobilize-ssh][mobilize-ssh], [mobilize-hdfs][mobilize-hdfs] tests first
|
179
|
+
|
180
|
+
2) clone the mobilize-hive repository
|
181
|
+
|
182
|
+
From the project folder, run
|
183
|
+
|
184
|
+
3) $ rake mobilize_hive:setup
|
185
|
+
|
186
|
+
Copy over the config files from the mobilize-base, mobilize-ssh,
|
187
|
+
mobilize-hdfs projects into the config dir, and populate the values in the hive.yml file.
|
188
|
+
|
189
|
+
Make sure you use the same names for your hive clusters as you do in
|
190
|
+
hadoop.yml.
|
191
|
+
|
192
|
+
3) $ rake test
|
193
|
+
|
194
|
+
* The test runs these jobs:
|
195
|
+
* hive_test_1:
|
196
|
+
* `hive.write target:"mobilize/hive_test_1/act_date",source:"Runner_mobilize(test)/hive_test_1.in", schema:"hive_test_1.schema", drop:true`
|
197
|
+
* `hive.run source:"hive_test_1.hql"`
|
198
|
+
* `hive.run cmd:"show databases"`
|
199
|
+
* `gsheet.write source:"stage2", target:"hive_test_1_stage_2.out"`
|
200
|
+
* `gsheet.write source:"stage3", target:"hive_test_1_stage_3.out"`
|
201
|
+
* hive_test_1.hql runs a select statement on the table created in the
|
202
|
+
write command.
|
203
|
+
* at the end of the test, there should be two sheets, one with a
|
204
|
+
sum of the data as in your write query, one with the results of the show
|
205
|
+
databases command.
|
206
|
+
* hive_test_2:
|
207
|
+
* `hive.write source:"hdfs://user/mobilize/test/test_hdfs_1.out", target:"mobilize.hive_test_2", drop:true`
|
208
|
+
* `hive.run cmd:"select * from mobilize.hive_test_2"`
|
209
|
+
* `gsheet.write source:"stage2", target:"hive_test_2.out"`
|
210
|
+
* this test uses the output from the first hdfs test as an input, so make sure you've run that first.
|
211
|
+
* hive_test_3:
|
212
|
+
* `hive.write source:"hive://mobilize.hive_test_1",target:"mobilize/hive_test_3/date/product",drop:true`
|
213
|
+
* `hive.run hql:"select act_date as ```date```,product,category,value from mobilize.hive_test_1;"`
|
214
|
+
* `hive.write source:"stage2",target:"mobilize/hive_test_3/date/product", drop:false`
|
215
|
+
* `gsheet.write source:"hive://mobilize/hive_test_3", target:"hive_test_3.out"`
|
216
|
+
|
217
|
+
|
218
|
+
<a name='section_Meta'></a>
|
219
|
+
Meta
|
220
|
+
----
|
221
|
+
|
222
|
+
* Code: `git clone git://github.com/dena/mobilize-hive.git`
|
223
|
+
* Home: <https://github.com/dena/mobilize-hive>
|
224
|
+
* Bugs: <https://github.com/dena/mobilize-hive/issues>
|
225
|
+
* Gems: <http://rubygems.org/gems/mobilize-hive>
|
226
|
+
|
227
|
+
<a name='section_Special_Thanks'></a>
|
228
|
+
Special Thanks
|
229
|
+
--------------
|
230
|
+
* This release goes to Toby Negrin, who championed this project with
|
231
|
+
DeNA and gave me the support to get it properly architected, tested, and documented.
|
232
|
+
* Also many thanks to the Analytics team at DeNA who build and maintain
|
233
|
+
our Big Data infrastructure.
|
234
|
+
|
235
|
+
<a name='section_Author'></a>
|
236
|
+
Author
|
237
|
+
------
|
238
|
+
|
239
|
+
Cassio Paes-Leme :: cpaesleme@dena.com :: @cpaesleme
|
240
|
+
|
241
|
+
[mobilize-base]: https://github.com/dena/mobilize-base
|
242
|
+
[mobilize-ssh]: https://github.com/dena/mobilize-ssh
|
243
|
+
[mobilize-hdfs]: https://github.com/dena/mobilize-hdfs
|
data/lib/mobilize-hive.rb
CHANGED
@@ -1,7 +1,56 @@
|
|
1
1
|
module Mobilize
|
2
2
|
module Hive
|
3
|
-
|
4
|
-
|
3
|
+
def Hive.config
|
4
|
+
Base.config('hive')
|
5
|
+
end
|
6
|
+
|
7
|
+
def Hive.exec_path(cluster)
|
8
|
+
Hive.clusters[cluster]['exec_path']
|
9
|
+
end
|
10
|
+
|
11
|
+
def Hive.output_db(cluster)
|
12
|
+
Hive.clusters[cluster]['output_db']
|
13
|
+
end
|
14
|
+
|
15
|
+
def Hive.output_db_user(cluster)
|
16
|
+
output_db_node = Hadoop.gateway_node(cluster)
|
17
|
+
output_db_user = Ssh.host(output_db_node)['user']
|
18
|
+
output_db_user
|
19
|
+
end
|
20
|
+
|
21
|
+
def Hive.clusters
|
22
|
+
Hive.config['clusters']
|
23
|
+
end
|
24
|
+
|
25
|
+
def Hive.slot_ids(cluster)
|
26
|
+
(1..Hive.clusters[cluster]['max_slots']).to_a.map{|s| "#{cluster}_#{s.to_s}"}
|
27
|
+
end
|
28
|
+
|
29
|
+
def Hive.slot_worker_by_cluster_and_path(cluster,path)
|
30
|
+
working_slots = Mobilize::Resque.jobs.map{|j| begin j['args'][1]['hive_slot'];rescue;nil;end}.compact.uniq
|
31
|
+
Hive.slot_ids(cluster).each do |slot_id|
|
32
|
+
unless working_slots.include?(slot_id)
|
33
|
+
Mobilize::Resque.set_worker_args_by_path(path,{'hive_slot'=>slot_id})
|
34
|
+
return slot_id
|
35
|
+
end
|
36
|
+
end
|
37
|
+
#return false if none are available
|
38
|
+
return false
|
39
|
+
end
|
40
|
+
|
41
|
+
def Hive.unslot_worker_by_path(path)
|
42
|
+
begin
|
43
|
+
Mobilize::Resque.set_worker_args_by_path(path,{'hive_slot'=>nil})
|
44
|
+
return true
|
45
|
+
rescue
|
46
|
+
return false
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
50
|
+
def Hive.databases(cluster,user_name)
|
51
|
+
Hive.run(cluster,"show databases",user_name)['stdout'].split("\n")
|
52
|
+
end
|
53
|
+
|
5
54
|
# converts a source path or target path to a dst in the context of handler and stage
|
6
55
|
def Hive.path_to_dst(path,stage_path,gdrive_slot)
|
7
56
|
has_handler = true if path.index("://")
|
@@ -93,32 +142,12 @@ module Mobilize
|
|
93
142
|
end
|
94
143
|
|
95
144
|
#run a generic hive command, with the option of passing a file hash to be locally available
|
96
|
-
def Hive.run(cluster,hql,user_name,
|
97
|
-
|
98
|
-
|
99
|
-
suffix = ";"
|
100
|
-
prep_out = p
|
101
|
-
prep_out = "#{prefix}#{prep_out}" unless prep_out.starts_with?(prefix)
|
102
|
-
prep_out = "#{prep_out}#{suffix}" unless prep_out.ends_with?(suffix)
|
103
|
-
prep_out
|
104
|
-
end.join
|
105
|
-
hql = "#{preps}#{hql}"
|
145
|
+
def Hive.run(cluster,hql,user_name,file_hash=nil)
|
146
|
+
# no TempStatsStore
|
147
|
+
hql = "set hive.stats.autogather=false;#{hql}"
|
106
148
|
filename = hql.to_md5
|
107
149
|
file_hash||= {}
|
108
150
|
file_hash[filename] = hql
|
109
|
-
params ||= {}
|
110
|
-
#replace any params in the file_hash and command
|
111
|
-
params.each do |k,v|
|
112
|
-
file_hash.each do |name,data|
|
113
|
-
data.gsub!("@#{k}",v)
|
114
|
-
end
|
115
|
-
end
|
116
|
-
#add in default params
|
117
|
-
Hive.default_params.each do |k,v|
|
118
|
-
file_hash.each do |name,data|
|
119
|
-
data.gsub!(k,v)
|
120
|
-
end
|
121
|
-
end
|
122
151
|
#silent mode so we don't have logs in stderr; clip output
|
123
152
|
#at hadoop read limit
|
124
153
|
command = "#{Hive.exec_path(cluster)} -S -f #{filename} | head -c #{Hadoop.read_limit}"
|
@@ -162,9 +191,8 @@ module Mobilize
|
|
162
191
|
Gdrive.unslot_worker_by_path(stage_path)
|
163
192
|
|
164
193
|
#check for select at end
|
165
|
-
hql_array = hql.split("
|
166
|
-
|
167
|
-
if last_statement.to_s.downcase.starts_with?("select")
|
194
|
+
hql_array = hql.split(";").map{|hc| hc.strip}.reject{|hc| hc.length==0}
|
195
|
+
if hql_array.last.downcase.starts_with?("select")
|
168
196
|
#nil if no prior commands
|
169
197
|
prior_hql = hql_array[0..-2].join(";") if hql_array.length > 1
|
170
198
|
select_hql = hql_array.last
|
@@ -172,10 +200,10 @@ module Mobilize
|
|
172
200
|
"drop table if exists #{output_path}",
|
173
201
|
"create table #{output_path} as #{select_hql};"].join(";")
|
174
202
|
full_hql = [prior_hql, output_table_hql].compact.join(";")
|
175
|
-
result = Hive.run(cluster,full_hql, user_name
|
203
|
+
result = Hive.run(cluster,full_hql, user_name)
|
176
204
|
Dataset.find_or_create_by_url(out_url)
|
177
205
|
else
|
178
|
-
result = Hive.run(cluster, hql, user_name
|
206
|
+
result = Hive.run(cluster, hql, user_name)
|
179
207
|
Dataset.find_or_create_by_url(out_url)
|
180
208
|
Dataset.write_by_url(out_url,result['stdout'],user_name) if result['stdout'].to_s.length>0
|
181
209
|
end
|
@@ -188,37 +216,40 @@ module Mobilize
|
|
188
216
|
response
|
189
217
|
end
|
190
218
|
|
191
|
-
def Hive.schema_hash(schema_path,
|
192
|
-
|
193
|
-
|
194
|
-
|
195
|
-
"gsheet"
|
196
|
-
end
|
197
|
-
dst = "Mobilize::#{handler.downcase.capitalize}".constantize.path_to_dst(schema_path,stage_path,gdrive_slot)
|
198
|
-
out_raw = dst.read(user_name,gdrive_slot)
|
199
|
-
#determine the datatype for schema; accept json, yaml, tsv
|
200
|
-
if schema_path.ends_with?(".yml")
|
201
|
-
out_ha = begin;YAML.load(out_raw);rescue ScriptError, StandardError;nil;end if out_ha.nil?
|
219
|
+
def Hive.schema_hash(schema_path,user_name,gdrive_slot)
|
220
|
+
if schema_path.index("/")
|
221
|
+
#slashes mean sheets
|
222
|
+
out_tsv = Gsheet.find_by_path(schema_path,gdrive_slot).read(user_name)
|
202
223
|
else
|
203
|
-
|
204
|
-
|
224
|
+
u = User.where(:name=>user_name).first
|
225
|
+
#check sheets in runner
|
226
|
+
r = u.runner
|
227
|
+
runner_sheet = r.gbook(gdrive_slot).worksheet_by_title(schema_path)
|
228
|
+
out_tsv = if runner_sheet
|
229
|
+
runner_sheet.read(user_name)
|
230
|
+
else
|
231
|
+
#check for gfile. will fail if there isn't one.
|
232
|
+
Gfile.find_by_path(schema_path).read(user_name)
|
233
|
+
end
|
205
234
|
end
|
235
|
+
#use Gridfs to cache gdrive results
|
236
|
+
file_name = schema_path.split("/").last
|
237
|
+
out_url = "gridfs://#{schema_path}/#{file_name}"
|
238
|
+
Dataset.write_by_url(out_url,out_tsv,user_name)
|
239
|
+
schema_tsv = Dataset.find_by_url(out_url).read(user_name,gdrive_slot)
|
206
240
|
schema_hash = {}
|
207
|
-
|
208
|
-
schema_hash[
|
241
|
+
schema_tsv.tsv_to_hash_array.each do |ha|
|
242
|
+
schema_hash[ha['name']] = ha['datatype']
|
209
243
|
end
|
210
244
|
schema_hash
|
211
245
|
end
|
212
246
|
|
213
|
-
def Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop=false, schema_hash=nil
|
247
|
+
def Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop=false, schema_hash=nil)
|
214
248
|
table_path = [db,table].join(".")
|
215
249
|
table_stats = Hive.table_stats(cluster, db, table, user_name)
|
216
|
-
url = "hive://" + [cluster,db,table,part_array.compact.join("/")].join("/")
|
217
250
|
|
218
|
-
|
219
|
-
|
220
|
-
source_hql_array = source_hql.split("\n").reject{|l| l.starts_with?("--") or l.strip.length==0}.join("\n").split(";").map{|h| h.strip}
|
221
|
-
last_select_i = source_hql_array.rindex{|s| s.downcase.starts_with?("select")}
|
251
|
+
source_hql_array = source_hql.split(";")
|
252
|
+
last_select_i = source_hql_array.rindex{|hql| hql.downcase.strip.starts_with?("select")}
|
222
253
|
#find the last select query -- it should be used for the temp table creation
|
223
254
|
last_select_hql = (source_hql_array[last_select_i..-1].join(";")+";")
|
224
255
|
#if there is anything prior to the last select, add it in prior to table creation
|
@@ -231,8 +262,7 @@ module Mobilize
|
|
231
262
|
temp_set_hql = "set mapred.job.name=#{job_name} (temp table);"
|
232
263
|
temp_drop_hql = "drop table if exists #{temp_table_path};"
|
233
264
|
temp_create_hql = "#{temp_set_hql}#{prior_hql}#{temp_drop_hql}create table #{temp_table_path} as #{last_select_hql}"
|
234
|
-
|
235
|
-
raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
|
265
|
+
Hive.run(cluster,temp_create_hql,user_name)
|
236
266
|
|
237
267
|
source_table_stats = Hive.table_stats(cluster,temp_db,temp_table_name,user_name)
|
238
268
|
source_fields = source_table_stats['field_defs']
|
@@ -270,12 +300,10 @@ module Mobilize
|
|
270
300
|
target_insert_hql,
|
271
301
|
temp_drop_hql].join
|
272
302
|
|
273
|
-
|
274
|
-
|
275
|
-
raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
|
303
|
+
Hive.run(cluster, target_full_hql, user_name)
|
276
304
|
|
277
305
|
elsif part_array.length > 0 and
|
278
|
-
table_stats.ie{|tts| tts.nil? || drop || tts['partitions'].to_a.map{|p| p['name']}
|
306
|
+
table_stats.ie{|tts| tts.nil? || drop || tts['partitions'].to_a.map{|p| p['name']} == part_array}
|
279
307
|
#partitions and no target table or same partitions in both target table and user params
|
280
308
|
|
281
309
|
target_headers = source_fields.map{|f| f['name']}.reject{|h| part_array.include?(h)}
|
@@ -306,7 +334,7 @@ module Mobilize
|
|
306
334
|
|
307
335
|
target_set_hql = ["set mapred.job.name=#{job_name};",
|
308
336
|
"set hive.exec.dynamic.partition.mode=nonstrict;",
|
309
|
-
"set hive.exec.max.dynamic.partitions.pernode=
|
337
|
+
"set hive.exec.max.dynamic.partitions.pernode=1000;",
|
310
338
|
"set hive.exec.dynamic.partition=true;",
|
311
339
|
"set hive.exec.max.created.files = 200000;",
|
312
340
|
"set hive.max.created.files = 200000;"].join
|
@@ -322,20 +350,12 @@ module Mobilize
|
|
322
350
|
part_set_hql = "set hive.cli.print.header=true;set mapred.job.name=#{job_name} (permutations);"
|
323
351
|
part_select_hql = "select distinct #{target_part_stmt} from #{temp_table_path};"
|
324
352
|
part_perm_hql = part_set_hql + part_select_hql
|
325
|
-
|
326
|
-
raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
|
327
|
-
part_perm_tsv = response['stdout']
|
353
|
+
part_perm_tsv = Hive.run(cluster, part_perm_hql, user_name)['stdout']
|
328
354
|
#having gotten the permutations, ensure they are dropped
|
329
355
|
part_hash_array = part_perm_tsv.tsv_to_hash_array
|
330
|
-
#make sure there is data
|
331
|
-
if part_hash_array.first.nil? or part_hash_array.first.values.include?(nil)
|
332
|
-
#blank result set, return url
|
333
|
-
return url
|
334
|
-
end
|
335
|
-
|
336
356
|
part_drop_hql = part_hash_array.map do |h|
|
337
357
|
part_drop_stmt = h.map do |name,value|
|
338
|
-
part_defs[name[1..-2]]
|
358
|
+
part_defs[name[1..-2]]=="string" ? "#{name}='#{value}'" : "#{name}=#{value}"
|
339
359
|
end.join(",")
|
340
360
|
"use #{db};alter table #{table} drop if exists partition (#{part_drop_stmt});"
|
341
361
|
end.join
|
@@ -348,12 +368,12 @@ module Mobilize
|
|
348
368
|
|
349
369
|
target_full_hql = [target_set_hql, target_create_hql, target_insert_hql, temp_drop_hql].join
|
350
370
|
|
351
|
-
|
352
|
-
raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
|
371
|
+
Hive.run(cluster, target_full_hql, user_name)
|
353
372
|
else
|
354
373
|
error_msg = "Incompatible partition specs"
|
355
374
|
raise error_msg
|
356
375
|
end
|
376
|
+
url = "hive://" + [cluster,db,table,part_array.compact.join("/")].join("/")
|
357
377
|
return url
|
358
378
|
end
|
359
379
|
|
@@ -361,12 +381,6 @@ module Mobilize
|
|
361
381
|
#Accepts options to drop existing target if any
|
362
382
|
#also schema with column datatype overrides
|
363
383
|
def Hive.tsv_to_table(cluster, db, table, part_array, source_tsv, user_name, drop=false, schema_hash=nil)
|
364
|
-
return nil if source_tsv.strip.length==0
|
365
|
-
if source_tsv.index("\r\n")
|
366
|
-
source_tsv = source_tsv.gsub("\r\n","\n")
|
367
|
-
elsif source_tsv.index("\r")
|
368
|
-
source_tsv = source_tsv.gsub("\r","\n")
|
369
|
-
end
|
370
384
|
source_headers = source_tsv.tsv_header_array
|
371
385
|
|
372
386
|
table_path = [db,table].join(".")
|
@@ -374,8 +388,6 @@ module Mobilize
|
|
374
388
|
|
375
389
|
schema_hash ||= {}
|
376
390
|
|
377
|
-
url = "hive://" + [cluster,db,table,part_array.compact.join("/")].join("/")
|
378
|
-
|
379
391
|
if part_array.length == 0 and
|
380
392
|
table_stats.ie{|tts| tts.nil? || drop || tts['partitions'].nil?}
|
381
393
|
#no partitions in either user params or the target table
|
@@ -402,11 +414,10 @@ module Mobilize
|
|
402
414
|
|
403
415
|
target_full_hql = [target_drop_hql,target_create_hql,target_insert_hql].join(";")
|
404
416
|
|
405
|
-
|
406
|
-
raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
|
417
|
+
Hive.run(cluster, target_full_hql, user_name, file_hash)
|
407
418
|
|
408
419
|
elsif part_array.length > 0 and
|
409
|
-
table_stats.ie{|tts| tts.nil? || drop || tts['partitions'].to_a.map{|p| p['name']}
|
420
|
+
table_stats.ie{|tts| tts.nil? || drop || tts['partitions'].to_a.map{|p| p['name']} == part_array}
|
410
421
|
#partitions and no target table
|
411
422
|
#or same partitions in both target table and user params
|
412
423
|
#or drop and start fresh
|
@@ -430,17 +441,13 @@ module Mobilize
|
|
430
441
|
"partitioned by #{partition_defs}"
|
431
442
|
|
432
443
|
#create target table early if not here
|
433
|
-
|
434
|
-
raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
|
435
|
-
|
436
|
-
#return url (operation complete) if there's no data
|
437
|
-
source_hash_array = source_tsv.tsv_to_hash_array
|
438
|
-
return url if source_hash_array.length==1 and source_hash_array.first.values.compact.length==0
|
444
|
+
Hive.run(cluster, target_create_hql, user_name)
|
439
445
|
|
440
446
|
table_stats = Hive.table_stats(cluster, db, table, user_name)
|
441
447
|
|
442
448
|
#create data hash from source hash array
|
443
449
|
data_hash = {}
|
450
|
+
source_hash_array = source_tsv.tsv_to_hash_array
|
444
451
|
source_hash_array.each do |ha|
|
445
452
|
tpmk = part_array.map{|pn| "#{pn}=#{ha[pn]}"}.join("/")
|
446
453
|
tpmv = ha.reject{|k,v| part_array.include?(k)}.values.join("\001")
|
@@ -473,8 +480,7 @@ module Mobilize
|
|
473
480
|
#run actual partition adds all at once
|
474
481
|
if target_part_hql.length>0
|
475
482
|
puts "Adding partitions to #{cluster}/#{db}/#{table} for #{user_name} at #{Time.now.utc}"
|
476
|
-
|
477
|
-
raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
|
483
|
+
Hive.run(cluster, target_part_hql, user_name)
|
478
484
|
end
|
479
485
|
else
|
480
486
|
error_msg = "Incompatible partition specs: " +
|
@@ -482,7 +488,7 @@ module Mobilize
|
|
482
488
|
"user_params:#{part_array.to_s}"
|
483
489
|
raise error_msg
|
484
490
|
end
|
485
|
-
|
491
|
+
url = "hive://" + [cluster,db,table,part_array.compact.join("/")].join("/")
|
486
492
|
return url
|
487
493
|
end
|
488
494
|
|
@@ -503,7 +509,7 @@ module Mobilize
|
|
503
509
|
job_name = s.path.sub("Runner_","")
|
504
510
|
|
505
511
|
schema_hash = if params['schema']
|
506
|
-
Hive.schema_hash(params['schema'],
|
512
|
+
Hive.schema_hash(params['schema'],user_name,gdrive_slot)
|
507
513
|
else
|
508
514
|
{}
|
509
515
|
end
|
@@ -519,11 +525,11 @@ module Mobilize
|
|
519
525
|
#source table
|
520
526
|
cluster,source_path = source.path.split("/").ie{|sp| [sp.first, sp[1..-1].join(".")]}
|
521
527
|
source_hql = "select * from #{source_path};"
|
522
|
-
|
528
|
+
elsif ['gsheet','gridfs','hdfs'].include?(source.handler)
|
523
529
|
if source.path.ie{|sdp| sdp.index(/\.[A-Za-z]ql$/) or sdp.ends_with?(".ql")}
|
524
530
|
source_hql = source.read(user_name,gdrive_slot)
|
525
531
|
else
|
526
|
-
#tsv from sheet
|
532
|
+
#tsv from sheet
|
527
533
|
source_tsv = source.read(user_name,gdrive_slot)
|
528
534
|
end
|
529
535
|
end
|
@@ -545,13 +551,9 @@ module Mobilize
|
|
545
551
|
|
546
552
|
result = begin
|
547
553
|
url = if source_hql
|
548
|
-
|
549
|
-
run_params = params['params']
|
550
|
-
Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop, schema_hash,run_params)
|
554
|
+
Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop, schema_hash)
|
551
555
|
elsif source_tsv
|
552
556
|
Hive.tsv_to_table(cluster, db, table, part_array, source_tsv, user_name, drop, schema_hash)
|
553
|
-
elsif source
|
554
|
-
#null sheet
|
555
557
|
else
|
556
558
|
raise "Unable to determine source tsv or source hql"
|
557
559
|
end
|
@@ -578,8 +580,11 @@ module Mobilize
|
|
578
580
|
select_hql = "select * from #{source_path};"
|
579
581
|
hql = [set_hql,select_hql].join
|
580
582
|
response = Hive.run(cluster, hql,user_name)
|
581
|
-
|
582
|
-
|
583
|
+
if response['exit_code']==0
|
584
|
+
return response['stdout']
|
585
|
+
else
|
586
|
+
raise "Unable to read hive://#{dst_path} with error: #{response['stderr']}"
|
587
|
+
end
|
583
588
|
end
|
584
589
|
|
585
590
|
def Hive.write_by_dataset_path(dst_path,source_tsv,user_name,*args)
|
data/lib/mobilize-hive/tasks.rb
CHANGED
data/lib/samples/hive.yml
CHANGED
@@ -1,23 +1,17 @@
|
|
1
1
|
---
|
2
2
|
development:
|
3
|
-
prepends:
|
4
|
-
- "hive.stats.autogather=false"
|
5
3
|
clusters:
|
6
4
|
dev_cluster:
|
7
5
|
max_slots: 5
|
8
6
|
temp_table_db: mobilize
|
9
7
|
exec_path: /path/to/hive
|
10
8
|
test:
|
11
|
-
prepends:
|
12
|
-
- "hive.stats.autogather=false"
|
13
9
|
clusters:
|
14
10
|
test_cluster:
|
15
11
|
max_slots: 5
|
16
12
|
temp_table_db: mobilize
|
17
13
|
exec_path: /path/to/hive
|
18
14
|
production:
|
19
|
-
prepends:
|
20
|
-
- "hive.stats.autogather=false"
|
21
15
|
clusters:
|
22
16
|
prod_cluster:
|
23
17
|
max_slots: 5
|
data/mobilize-hive.gemspec
CHANGED
@@ -16,5 +16,5 @@ Gem::Specification.new do |gem|
|
|
16
16
|
gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
|
17
17
|
gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
|
18
18
|
gem.require_paths = ["lib"]
|
19
|
-
gem.add_runtime_dependency "mobilize-hdfs","1.
|
19
|
+
gem.add_runtime_dependency "mobilize-hdfs","1.291"
|
20
20
|
end
|
@@ -0,0 +1,26 @@
|
|
1
|
+
---
|
2
|
+
- name: hive_test_1
|
3
|
+
active: true
|
4
|
+
trigger: once
|
5
|
+
status: ""
|
6
|
+
stage1: hive.write target:"mobilize/hive_test_1", partitions:"act_date", drop:true,
|
7
|
+
source:"Runner_mobilize(test)/hive_test_1.in", schema:"hive_test_1.schema"
|
8
|
+
stage2: hive.run source:"hive_test_1.hql"
|
9
|
+
stage3: hive.run hql:"show databases;"
|
10
|
+
stage4: gsheet.write source:"stage2", target:"hive_test_1_stage_2.out"
|
11
|
+
stage5: gsheet.write source:"stage3", target:"hive_test_1_stage_3.out"
|
12
|
+
- name: hive_test_2
|
13
|
+
active: true
|
14
|
+
trigger: after hive_test_1
|
15
|
+
status: ""
|
16
|
+
stage1: hive.write source:"hdfs://user/mobilize/test/test_hdfs_1.out", target:"mobilize.hive_test_2", drop:true
|
17
|
+
stage2: hive.run hql:"select * from mobilize.hive_test_2;"
|
18
|
+
stage3: gsheet.write source:"stage2", target:"hive_test_2.out"
|
19
|
+
- name: hive_test_3
|
20
|
+
active: true
|
21
|
+
trigger: after hive_test_2
|
22
|
+
status: ""
|
23
|
+
stage1: hive.run hql:"select act_date as `date`,product,category,value from mobilize.hive_test_1;"
|
24
|
+
stage2: hive.write source:"stage1",target:"mobilize/hive_test_3", partitions:"date/product", drop:true
|
25
|
+
stage3: hive.write hql:"select * from mobilize.hive_test_3;",target:"mobilize/hive_test_3", partitions:"date/product", drop:false
|
26
|
+
stage4: gsheet.write source:"hive://mobilize/hive_test_3", target:"hive_test_3.out"
|
File without changes
|
File without changes
|
File without changes
|
@@ -0,0 +1,96 @@
|
|
1
|
+
require 'test_helper'
|
2
|
+
|
3
|
+
describe "Mobilize" do
|
4
|
+
|
5
|
+
def before
|
6
|
+
puts 'nothing before'
|
7
|
+
end
|
8
|
+
|
9
|
+
# enqueues 4 workers on Resque
|
10
|
+
it "runs integration test" do
|
11
|
+
|
12
|
+
puts "restart workers"
|
13
|
+
Mobilize::Jobtracker.restart_workers!
|
14
|
+
|
15
|
+
gdrive_slot = Mobilize::Gdrive.owner_email
|
16
|
+
puts "create user 'mobilize'"
|
17
|
+
user_name = gdrive_slot.split("@").first
|
18
|
+
u = Mobilize::User.where(:name=>user_name).first
|
19
|
+
r = u.runner
|
20
|
+
|
21
|
+
puts "add test_source data"
|
22
|
+
hive_1_in_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.in",gdrive_slot)
|
23
|
+
[hive_1_in_sheet].each {|s| s.delete if s}
|
24
|
+
hive_1_in_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.in",gdrive_slot)
|
25
|
+
hive_1_in_tsv = YAML.load_file("#{Mobilize::Base.root}/test/hive_test_1_in.yml").hash_array_to_tsv
|
26
|
+
hive_1_in_sheet.write(hive_1_in_tsv,Mobilize::Gdrive.owner_name)
|
27
|
+
|
28
|
+
hive_1_schema_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.schema",gdrive_slot)
|
29
|
+
[hive_1_schema_sheet].each {|s| s.delete if s}
|
30
|
+
hive_1_schema_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.schema",gdrive_slot)
|
31
|
+
hive_1_schema_tsv = YAML.load_file("#{Mobilize::Base.root}/test/hive_test_1_schema.yml").hash_array_to_tsv
|
32
|
+
hive_1_schema_sheet.write(hive_1_schema_tsv,Mobilize::Gdrive.owner_name)
|
33
|
+
|
34
|
+
hive_1_hql_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.hql",gdrive_slot)
|
35
|
+
[hive_1_hql_sheet].each {|s| s.delete if s}
|
36
|
+
hive_1_hql_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.hql",gdrive_slot)
|
37
|
+
hive_1_hql_tsv = File.open("#{Mobilize::Base.root}/test/hive_test_1.hql").read
|
38
|
+
hive_1_hql_sheet.write(hive_1_hql_tsv,Mobilize::Gdrive.owner_name)
|
39
|
+
|
40
|
+
jobs_sheet = r.gsheet(gdrive_slot)
|
41
|
+
|
42
|
+
test_job_rows = ::YAML.load_file("#{Mobilize::Base.root}/test/hive_job_rows.yml")
|
43
|
+
test_job_rows.map{|j| r.jobs(j['name'])}.each{|j| j.delete if j}
|
44
|
+
jobs_sheet.add_or_update_rows(test_job_rows)
|
45
|
+
|
46
|
+
hive_1_stage_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_2.out",gdrive_slot)
|
47
|
+
[hive_1_stage_2_target_sheet].each{|s| s.delete if s}
|
48
|
+
hive_1_stage_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_3.out",gdrive_slot)
|
49
|
+
[hive_1_stage_3_target_sheet].each{|s| s.delete if s}
|
50
|
+
hive_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_2.out",gdrive_slot)
|
51
|
+
[hive_2_target_sheet].each{|s| s.delete if s}
|
52
|
+
hive_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_3.out",gdrive_slot)
|
53
|
+
[hive_3_target_sheet].each{|s| s.delete if s}
|
54
|
+
|
55
|
+
puts "job row added, force enqueued requestor, wait for stages"
|
56
|
+
r.enqueue!
|
57
|
+
wait_for_stages(1200)
|
58
|
+
|
59
|
+
puts "jobtracker posted data to test sheet"
|
60
|
+
hive_1_stage_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_2.out",gdrive_slot)
|
61
|
+
hive_1_stage_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_3.out",gdrive_slot)
|
62
|
+
hive_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_2.out",gdrive_slot)
|
63
|
+
hive_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_3.out",gdrive_slot)
|
64
|
+
|
65
|
+
assert hive_1_stage_2_target_sheet.read(u.name).length == 219
|
66
|
+
assert hive_1_stage_3_target_sheet.read(u.name).length > 3
|
67
|
+
assert hive_2_target_sheet.read(u.name).length == 599
|
68
|
+
assert hive_3_target_sheet.read(u.name).length == 347
|
69
|
+
end
|
70
|
+
|
71
|
+
def wait_for_stages(time_limit=600,stage_limit=120,wait_length=10)
|
72
|
+
time = 0
|
73
|
+
time_since_stage = 0
|
74
|
+
#check for 10 min
|
75
|
+
while time < time_limit and time_since_stage < stage_limit
|
76
|
+
sleep wait_length
|
77
|
+
job_classes = Mobilize::Resque.jobs.map{|j| j['class']}
|
78
|
+
if job_classes.include?("Mobilize::Stage")
|
79
|
+
time_since_stage = 0
|
80
|
+
puts "saw stage at #{time.to_s} seconds"
|
81
|
+
else
|
82
|
+
time_since_stage += wait_length
|
83
|
+
puts "#{time_since_stage.to_s} seconds since stage seen"
|
84
|
+
end
|
85
|
+
time += wait_length
|
86
|
+
puts "total wait time #{time.to_s} seconds"
|
87
|
+
end
|
88
|
+
|
89
|
+
if time >= time_limit
|
90
|
+
raise "Timed out before stage completion"
|
91
|
+
end
|
92
|
+
end
|
93
|
+
|
94
|
+
|
95
|
+
|
96
|
+
end
|
data/test/test_helper.rb
CHANGED
metadata
CHANGED
@@ -1,32 +1,29 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: mobilize-hive
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: '1.
|
5
|
-
prerelease:
|
4
|
+
version: '1.291'
|
6
5
|
platform: ruby
|
7
6
|
authors:
|
8
7
|
- Cassio Paes-Leme
|
9
8
|
autorequire:
|
10
9
|
bindir: bin
|
11
10
|
cert_chain: []
|
12
|
-
date: 2013-
|
11
|
+
date: 2013-03-27 00:00:00.000000000 Z
|
13
12
|
dependencies:
|
14
13
|
- !ruby/object:Gem::Dependency
|
15
14
|
name: mobilize-hdfs
|
16
15
|
requirement: !ruby/object:Gem::Requirement
|
17
|
-
none: false
|
18
16
|
requirements:
|
19
17
|
- - '='
|
20
18
|
- !ruby/object:Gem::Version
|
21
|
-
version: '1.
|
19
|
+
version: '1.291'
|
22
20
|
type: :runtime
|
23
21
|
prerelease: false
|
24
22
|
version_requirements: !ruby/object:Gem::Requirement
|
25
|
-
none: false
|
26
23
|
requirements:
|
27
24
|
- - '='
|
28
25
|
- !ruby/object:Gem::Version
|
29
|
-
version: '1.
|
26
|
+
version: '1.291'
|
30
27
|
description: Adds hive read, write, and run support to mobilize-hdfs
|
31
28
|
email:
|
32
29
|
- cpaesleme@dena.com
|
@@ -41,61 +38,45 @@ files:
|
|
41
38
|
- Rakefile
|
42
39
|
- lib/mobilize-hive.rb
|
43
40
|
- lib/mobilize-hive/handlers/hive.rb
|
44
|
-
- lib/mobilize-hive/helpers/hive_helper.rb
|
45
41
|
- lib/mobilize-hive/tasks.rb
|
46
42
|
- lib/mobilize-hive/version.rb
|
47
43
|
- lib/samples/hive.yml
|
48
44
|
- mobilize-hive.gemspec
|
49
|
-
- test/
|
50
|
-
- test/
|
51
|
-
- test/
|
52
|
-
- test/
|
53
|
-
- test/
|
54
|
-
- test/fixtures/hive4_stage2.in.yml
|
55
|
-
- test/fixtures/integration_expected.yml
|
56
|
-
- test/fixtures/integration_jobs.yml
|
57
|
-
- test/integration/mobilize-hive_test.rb
|
45
|
+
- test/hive_job_rows.yml
|
46
|
+
- test/hive_test_1.hql
|
47
|
+
- test/hive_test_1_in.yml
|
48
|
+
- test/hive_test_1_schema.yml
|
49
|
+
- test/mobilize-hive_test.rb
|
58
50
|
- test/redis-test.conf
|
59
51
|
- test/test_helper.rb
|
60
52
|
homepage: http://github.com/dena/mobilize-hive
|
61
53
|
licenses: []
|
54
|
+
metadata: {}
|
62
55
|
post_install_message:
|
63
56
|
rdoc_options: []
|
64
57
|
require_paths:
|
65
58
|
- lib
|
66
59
|
required_ruby_version: !ruby/object:Gem::Requirement
|
67
|
-
none: false
|
68
60
|
requirements:
|
69
|
-
- -
|
61
|
+
- - '>='
|
70
62
|
- !ruby/object:Gem::Version
|
71
63
|
version: '0'
|
72
|
-
segments:
|
73
|
-
- 0
|
74
|
-
hash: 837156919845089008
|
75
64
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
76
|
-
none: false
|
77
65
|
requirements:
|
78
|
-
- -
|
66
|
+
- - '>='
|
79
67
|
- !ruby/object:Gem::Version
|
80
68
|
version: '0'
|
81
|
-
segments:
|
82
|
-
- 0
|
83
|
-
hash: 837156919845089008
|
84
69
|
requirements: []
|
85
70
|
rubyforge_project:
|
86
|
-
rubygems_version:
|
71
|
+
rubygems_version: 2.0.3
|
87
72
|
signing_key:
|
88
|
-
specification_version:
|
73
|
+
specification_version: 4
|
89
74
|
summary: Adds hive read, write, and run support to mobilize-hdfs
|
90
75
|
test_files:
|
91
|
-
- test/
|
92
|
-
- test/
|
93
|
-
- test/
|
94
|
-
- test/
|
95
|
-
- test/
|
96
|
-
- test/fixtures/hive4_stage2.in.yml
|
97
|
-
- test/fixtures/integration_expected.yml
|
98
|
-
- test/fixtures/integration_jobs.yml
|
99
|
-
- test/integration/mobilize-hive_test.rb
|
76
|
+
- test/hive_job_rows.yml
|
77
|
+
- test/hive_test_1.hql
|
78
|
+
- test/hive_test_1_in.yml
|
79
|
+
- test/hive_test_1_schema.yml
|
80
|
+
- test/mobilize-hive_test.rb
|
100
81
|
- test/redis-test.conf
|
101
82
|
- test/test_helper.rb
|
@@ -1,67 +0,0 @@
|
|
1
|
-
module Mobilize
|
2
|
-
module Hive
|
3
|
-
def self.config
|
4
|
-
Base.config('hive')
|
5
|
-
end
|
6
|
-
|
7
|
-
def self.exec_path(cluster)
|
8
|
-
self.clusters[cluster]['exec_path']
|
9
|
-
end
|
10
|
-
|
11
|
-
def self.output_db(cluster)
|
12
|
-
self.clusters[cluster]['output_db']
|
13
|
-
end
|
14
|
-
|
15
|
-
def self.output_db_user(cluster)
|
16
|
-
output_db_node = Hadoop.gateway_node(cluster)
|
17
|
-
output_db_user = Ssh.host(output_db_node)['user']
|
18
|
-
output_db_user
|
19
|
-
end
|
20
|
-
|
21
|
-
def self.clusters
|
22
|
-
self.config['clusters']
|
23
|
-
end
|
24
|
-
|
25
|
-
def self.slot_ids(cluster)
|
26
|
-
(1..self.clusters[cluster]['max_slots']).to_a.map{|s| "#{cluster}_#{s.to_s}"}
|
27
|
-
end
|
28
|
-
|
29
|
-
def self.prepends
|
30
|
-
self.config['prepends']
|
31
|
-
end
|
32
|
-
|
33
|
-
def self.slot_worker_by_cluster_and_path(cluster,path)
|
34
|
-
working_slots = Mobilize::Resque.jobs.map{|j| begin j['args'][1]['hive_slot'];rescue;nil;end}.compact.uniq
|
35
|
-
self.slot_ids(cluster).each do |slot_id|
|
36
|
-
unless working_slots.include?(slot_id)
|
37
|
-
Mobilize::Resque.set_worker_args_by_path(path,{'hive_slot'=>slot_id})
|
38
|
-
return slot_id
|
39
|
-
end
|
40
|
-
end
|
41
|
-
#return false if none are available
|
42
|
-
return false
|
43
|
-
end
|
44
|
-
|
45
|
-
def self.unslot_worker_by_path(path)
|
46
|
-
begin
|
47
|
-
Mobilize::Resque.set_worker_args_by_path(path,{'hive_slot'=>nil})
|
48
|
-
return true
|
49
|
-
rescue
|
50
|
-
return false
|
51
|
-
end
|
52
|
-
end
|
53
|
-
|
54
|
-
def self.databases(cluster,user_name)
|
55
|
-
self.run(cluster,"show databases",user_name)['stdout'].split("\n")
|
56
|
-
end
|
57
|
-
|
58
|
-
def self.default_params
|
59
|
-
time = Time.now.utc
|
60
|
-
{
|
61
|
-
'$utc_date'=>time.strftime("%Y-%m-%d"),
|
62
|
-
'$utc_time'=>time.strftime("%H:%M"),
|
63
|
-
}
|
64
|
-
end
|
65
|
-
end
|
66
|
-
end
|
67
|
-
|
data/test/fixtures/hive1.sql
DELETED
@@ -1 +0,0 @@
|
|
1
|
-
select act_date,product, sum(value) as sum from mobilize.hive_test_1 group by act_date,product;
|
@@ -1 +0,0 @@
|
|
1
|
-
|
@@ -1,69 +0,0 @@
|
|
1
|
-
---
|
2
|
-
- path: "Runner_mobilize(test)/jobs"
|
3
|
-
state: working
|
4
|
-
count: 1
|
5
|
-
confirmed_ats: []
|
6
|
-
- path: "Runner_mobilize(test)/jobs/hive1/stage1"
|
7
|
-
state: working
|
8
|
-
count: 1
|
9
|
-
confirmed_ats: []
|
10
|
-
- path: "Runner_mobilize(test)/jobs/hive1/stage2"
|
11
|
-
state: working
|
12
|
-
count: 1
|
13
|
-
confirmed_ats: []
|
14
|
-
- path: "Runner_mobilize(test)/jobs/hive1/stage3"
|
15
|
-
state: working
|
16
|
-
count: 1
|
17
|
-
confirmed_ats: []
|
18
|
-
- path: "Runner_mobilize(test)/jobs/hive1/stage4"
|
19
|
-
state: working
|
20
|
-
count: 1
|
21
|
-
confirmed_ats: []
|
22
|
-
- path: "Runner_mobilize(test)/jobs/hive1/stage5"
|
23
|
-
state: working
|
24
|
-
count: 1
|
25
|
-
confirmed_ats: []
|
26
|
-
- path: "Runner_mobilize(test)/jobs/hive2/stage1"
|
27
|
-
state: working
|
28
|
-
count: 1
|
29
|
-
confirmed_ats: []
|
30
|
-
- path: "Runner_mobilize(test)/jobs/hive2/stage2"
|
31
|
-
state: working
|
32
|
-
count: 1
|
33
|
-
confirmed_ats: []
|
34
|
-
- path: "Runner_mobilize(test)/jobs/hive2/stage3"
|
35
|
-
state: working
|
36
|
-
count: 1
|
37
|
-
confirmed_ats: []
|
38
|
-
- path: "Runner_mobilize(test)/jobs/hive3/stage1"
|
39
|
-
state: working
|
40
|
-
count: 1
|
41
|
-
confirmed_ats: []
|
42
|
-
- path: "Runner_mobilize(test)/jobs/hive3/stage2"
|
43
|
-
state: working
|
44
|
-
count: 1
|
45
|
-
confirmed_ats: []
|
46
|
-
- path: "Runner_mobilize(test)/jobs/hive3/stage3"
|
47
|
-
state: working
|
48
|
-
count: 1
|
49
|
-
confirmed_ats: []
|
50
|
-
- path: "Runner_mobilize(test)/jobs/hive3/stage4"
|
51
|
-
state: working
|
52
|
-
count: 1
|
53
|
-
confirmed_ats: []
|
54
|
-
- path: "Runner_mobilize(test)/jobs/hive4/stage1"
|
55
|
-
state: working
|
56
|
-
count: 1
|
57
|
-
confirmed_ats: []
|
58
|
-
- path: "Runner_mobilize(test)/jobs/hive4/stage2"
|
59
|
-
state: working
|
60
|
-
count: 1
|
61
|
-
confirmed_ats: []
|
62
|
-
- path: "Runner_mobilize(test)/jobs/hive4/stage3"
|
63
|
-
state: working
|
64
|
-
count: 1
|
65
|
-
confirmed_ats: []
|
66
|
-
- path: "Runner_mobilize(test)/jobs/hive4/stage4"
|
67
|
-
state: working
|
68
|
-
count: 1
|
69
|
-
confirmed_ats: []
|
@@ -1,34 +0,0 @@
|
|
1
|
-
---
|
2
|
-
- name: hive1
|
3
|
-
active: true
|
4
|
-
trigger: once
|
5
|
-
status: ""
|
6
|
-
stage1: hive.write target:"mobilize/hive1", partitions:"act_date", drop:true,
|
7
|
-
source:"Runner_mobilize(test)/hive1.in", schema:"hive1.schema"
|
8
|
-
stage2: hive.run source:"hive1.sql"
|
9
|
-
stage3: hive.run hql:"show databases;"
|
10
|
-
stage4: gsheet.write source:"stage2", target:"hive1_stage2.out"
|
11
|
-
stage5: gsheet.write source:"stage3", target:"hive1_stage3.out"
|
12
|
-
- name: hive2
|
13
|
-
active: true
|
14
|
-
trigger: after hive1
|
15
|
-
status: ""
|
16
|
-
stage1: hive.write source:"hdfs://user/mobilize/test/hdfs1.out", target:"mobilize.hive2", drop:true
|
17
|
-
stage2: hive.run hql:"select * from mobilize.hive2;"
|
18
|
-
stage3: gsheet.write source:"stage2", target:"hive2.out"
|
19
|
-
- name: hive3
|
20
|
-
active: true
|
21
|
-
trigger: after hive2
|
22
|
-
status: ""
|
23
|
-
stage1: hive.run hql:"select '@date' as `date`,product,category,value from mobilize.hive1;", params:{'date':'2013-01-01'}
|
24
|
-
stage2: hive.write source:"stage1",target:"mobilize/hive3", partitions:"date/product", drop:true
|
25
|
-
stage3: hive.write hql:"select * from mobilize.hive3;",target:"mobilize/hive3", partitions:"date/product", drop:false
|
26
|
-
stage4: gsheet.write source:"hive://mobilize/hive3", target:"hive3.out"
|
27
|
-
- name: hive4
|
28
|
-
active: true
|
29
|
-
trigger: after hive3
|
30
|
-
status: ""
|
31
|
-
stage1: hive.write source:"hive4_stage1.in", target:"mobilize/hive1", partitions:"act_date"
|
32
|
-
stage2: hive.write source:"hive4_stage2.in", target:"mobilize/hive1", partitions:"act_date"
|
33
|
-
stage3: hive.run hql:"select '@date $utc_time' as `date_time`,product,category,value from mobilize.hive1;", params:{'date':'$utc_date'}
|
34
|
-
stage4: gsheet.write source:stage3, target:"hive4.out"
|
@@ -1,43 +0,0 @@
|
|
1
|
-
require 'test_helper'
|
2
|
-
describe "Mobilize" do
|
3
|
-
# enqueues 4 workers on Resque
|
4
|
-
it "runs integration test" do
|
5
|
-
|
6
|
-
puts "restart workers"
|
7
|
-
Mobilize::Jobtracker.restart_workers!
|
8
|
-
|
9
|
-
u = TestHelper.owner_user
|
10
|
-
r = u.runner
|
11
|
-
user_name = u.name
|
12
|
-
gdrive_slot = u.email
|
13
|
-
|
14
|
-
puts "add test data"
|
15
|
-
["hive1.in","hive4_stage1.in","hive4_stage2.in","hive1.schema","hive1.sql"].each do |fixture_name|
|
16
|
-
target_url = "gsheet://#{r.title}/#{fixture_name}"
|
17
|
-
TestHelper.write_fixture(fixture_name, target_url, 'replace')
|
18
|
-
end
|
19
|
-
|
20
|
-
puts "add/update jobs"
|
21
|
-
u.jobs.each{|j| j.delete}
|
22
|
-
jobs_fixture_name = "integration_jobs"
|
23
|
-
jobs_target_url = "gsheet://#{r.title}/jobs"
|
24
|
-
TestHelper.write_fixture(jobs_fixture_name, jobs_target_url, 'update')
|
25
|
-
|
26
|
-
puts "job rows added, force enqueue runner, wait for stages"
|
27
|
-
#wait for stages to complete
|
28
|
-
expected_fixture_name = "integration_expected"
|
29
|
-
Mobilize::Jobtracker.stop!
|
30
|
-
r.enqueue!
|
31
|
-
TestHelper.confirm_expected_jobs(expected_fixture_name,2100)
|
32
|
-
|
33
|
-
puts "update job status and activity"
|
34
|
-
r.update_gsheet(gdrive_slot)
|
35
|
-
|
36
|
-
puts "check posted data"
|
37
|
-
assert TestHelper.check_output("gsheet://#{r.title}/hive1_stage2.out", 'min_length' => 219) == true
|
38
|
-
assert TestHelper.check_output("gsheet://#{r.title}/hive1_stage3.out", 'min_length' => 3) == true
|
39
|
-
assert TestHelper.check_output("gsheet://#{r.title}/hive2.out", 'min_length' => 599) == true
|
40
|
-
assert TestHelper.check_output("gsheet://#{r.title}/hive3.out", 'min_length' => 347) == true
|
41
|
-
assert TestHelper.check_output("gsheet://#{r.title}/hive4.out", 'min_length' => 432) == true
|
42
|
-
end
|
43
|
-
end
|