mobilize-hive 1.35 → 1.36

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -1,254 +1,4 @@
1
- Mobilize-Hive
2
- ===============
1
+ Mobilize
2
+ ========
3
3
 
4
- Mobilize-Hive adds the power of hive to [mobilize-hdfs][mobilize-hdfs].
5
- * read, write, and copy hive files through Google Spreadsheets.
6
-
7
- Table Of Contents
8
- -----------------
9
- * [Overview](#section_Overview)
10
- * [Install](#section_Install)
11
- * [Mobilize-Hive](#section_Install_Mobilize-Hive)
12
- * [Install Dirs and Files](#section_Install_Dirs_and_Files)
13
- * [Configure](#section_Configure)
14
- * [Hive](#section_Configure_Hive)
15
- * [Start](#section_Start)
16
- * [Create Job](#section_Start_Create_Job)
17
- * [Run Test](#section_Start_Run_Test)
18
- * [Meta](#section_Meta)
19
- * [Special Thanks](#section_Special_Thanks)
20
- * [Author](#section_Author)
21
-
22
- <a name='section_Overview'></a>
23
- Overview
24
- -----------
25
-
26
- * Mobilize-hive adds Hive methods to mobilize-hdfs.
27
-
28
- <a name='section_Install'></a>
29
- Install
30
- ------------
31
-
32
- Make sure you go through all the steps in the
33
- [mobilize-base][mobilize-base],
34
- [mobilize-ssh][mobilize-ssh],
35
- [mobilize-hdfs][mobilize-hdfs],
36
- install sections first.
37
-
38
- <a name='section_Install_Mobilize-Hive'></a>
39
- ### Mobilize-Hive
40
-
41
- add this to your Gemfile:
42
-
43
- ``` ruby
44
- gem "mobilize-hive"
45
- ```
46
-
47
- or do
48
-
49
- $ gem install mobilize-hive
50
-
51
- for a ruby-wide install.
52
-
53
- <a name='section_Install_Dirs_and_Files'></a>
54
- ### Dirs and Files
55
-
56
- ### Rakefile
57
-
58
- Inside the Rakefile in your project's root dir, make sure you have:
59
-
60
- ``` ruby
61
- require 'mobilize-base/tasks'
62
- require 'mobilize-ssh/tasks'
63
- require 'mobilize-hdfs/tasks'
64
- require 'mobilize-hive/tasks'
65
- ```
66
-
67
- This defines rake tasks essential to run the environment.
68
-
69
- ### Config Dir
70
-
71
- run
72
-
73
- $ rake mobilize_hive:setup
74
-
75
- This will copy over a sample hive.yml to your config dir.
76
-
77
- <a name='section_Configure'></a>
78
- Configure
79
- ------------
80
-
81
- <a name='section_Configure_Hive'></a>
82
- ### Configure Hive
83
-
84
- * Hive is big data. That means we need to be careful when reading from
85
- the cluster as it could easily fill up our mongodb instance, RAM, local disk
86
- space, etc.
87
- * To achieve this, all hive operations, stage outputs, etc. are
88
- executed and stored on the cluster only.
89
- * The exceptions are:
90
- * writing to the cluster from an external source, such as a google
91
- sheet. Here there
92
- is no risk as the external source has much more strict size limits than
93
- hive.
94
- * reading from the cluster, such as for posting to google sheet. In
95
- this case, the read_limit parameter dictates the maximum amount that can
96
- be read. If the data is bigger than the read limit, an exception will be
97
- raised.
98
-
99
- The Hive configuration consists of:
100
- * clusters - this defines aliases for clusters, which are used as
101
- parameters for Hive stages. They should have the same name as those
102
- in hadoop.yml. Each cluster has:
103
- * max_slots - defines the total number of simultaneous slots to be
104
- used for hive jobs on this cluster
105
- * output_db - defines the db which should be used to hold stage outputs.
106
- * This db must have open permissions (777) so any user on the system can
107
- write to it -- the tables inside will be owned by the users themselves.
108
- * exec_path - defines the path to the hive executable
109
-
110
- Sample hive.yml:
111
-
112
- ``` yml
113
- ---
114
- development:
115
- clusters:
116
- dev_cluster:
117
- max_slots: 5
118
- output_db: mobilize
119
- exec_path: /path/to/hive
120
- test:
121
- clusters:
122
- test_cluster:
123
- max_slots: 5
124
- output_db: mobilize
125
- exec_path: /path/to/hive
126
- production:
127
- clusters:
128
- prod_cluster:
129
- max_slots: 5
130
- output_db: mobilize
131
- exec_path: /path/to/hive
132
- ```
133
-
134
- <a name='section_Start'></a>
135
- Start
136
- -----
137
-
138
- <a name='section_Start_Create_Job'></a>
139
- ### Create Job
140
-
141
- * For mobilize-hive, the following stages are available.
142
- * cluster and user are optional for all of the below.
143
- * cluster defaults to the first cluster listed;
144
- * user is treated the same way as in [mobilize-ssh][mobilize-ssh].
145
- * params are also optional for all of the below. They replace HQL in sources.
146
- * params are passed as a YML or JSON, as in:
147
- * `hive.run source:<source_path>, params:{'date': '2013-03-01', 'unit': 'widgets'}`
148
- * this example replaces all the keys, preceded by '@' in all source hqls with the value.
149
- * The preceding '@' is used to keep from replacing instances
150
- of "date" and "unit" in the HQL; you should have `@date` and `@unit` in your actual HQL
151
- if you'd like to replace those tokens.
152
- * in addition, the following params are substituted automatically:
153
- * `$utc_date` - replaced with YYYY-MM-DD date, UTC
154
- * `$utc_time` - replaced with HH:MM time, UTC
155
- * any occurrence of these values in HQL will be replaced at runtime.
156
- * hive.run `hql:<hql> || source:<gsheet_path>, user:<user>, cluster:<cluster>`, which executes the
157
- script in the hql or source sheet and returns any output specified at the
158
- end. If the cmd or last query in source is a select statement, column headers will be
159
- returned as well.
160
- * hive.write `hql:<hql> || source:<source_path>, target:<hive_path>, partitions:<partition_path>, user:<user>, cluster:<cluster>, schema:<gsheet_path>, drop:<true/false>`,
161
- which writes the source or query result to the selected hive table.
162
- * hive_path
163
- * should be of the form `<hive_db>/<table_name>` or `<hive_db>.<table_name>`.
164
- * source:
165
- * can be a gsheet_path, hdfs_path, or hive_path (no partitions)
166
- * for gsheet and hdfs path,
167
- * if the file ends in .*ql, it's treated the same as passing hql
168
- * otherwise it is treated as a tsv with the first row as column headers
169
- * target:
170
- * Should be a hive_path, as in `<hive_db>/<table_name>` or `<hive_db>.<table_name>`.
171
- * partitions:
172
- * Due to Hive limitation, partition names CANNOT be reserved keywords when writing from tsv (gsheet or hdfs source)
173
- * Partitions should be specified as a path, as in partitions:`<partition1>/<partition2>`.
174
- * schema:
175
- * optional. gsheet_path to column schema.
176
- * two columns: name, datatype
177
- * Any columns not defined here will receive "string" as the datatype
178
- * partitions can have their datatypes overridden here as well
179
- * columns named here that are not in the dataset will be ignored
180
- * drop:
181
- * optional. drops the target table before performing write
182
- * defaults to false
183
-
184
- <a name='section_Start_Run_Test'></a>
185
- ### Run Test
186
-
187
- To run tests, you will need to
188
-
189
- 1) go through [mobilize-base][mobilize-base], [mobilize-ssh][mobilize-ssh], [mobilize-hdfs][mobilize-hdfs] tests first
190
-
191
- 2) clone the mobilize-hive repository
192
-
193
- From the project folder, run
194
-
195
- 3) $ rake mobilize_hive:setup
196
-
197
- Copy over the config files from the mobilize-base, mobilize-ssh,
198
- mobilize-hdfs projects into the config dir, and populate the values in the hive.yml file.
199
-
200
- Make sure you use the same names for your hive clusters as you do in
201
- hadoop.yml.
202
-
203
- 3) $ rake test
204
-
205
- * The test runs these jobs:
206
- * hive_test_1:
207
- * `hive.write target:"mobilize/hive_test_1/act_date",source:"Runner_mobilize(test)/hive_test_1.in", schema:"hive_test_1.schema", drop:true`
208
- * `hive.run source:"hive_test_1.hql"`
209
- * `hive.run cmd:"show databases"`
210
- * `gsheet.write source:"stage2", target:"hive_test_1_stage_2.out"`
211
- * `gsheet.write source:"stage3", target:"hive_test_1_stage_3.out"`
212
- * hive_test_1.hql runs a select statement on the table created in the
213
- write command.
214
- * at the end of the test, there should be two sheets, one with a
215
- sum of the data as in your write query, one with the results of the show
216
- databases command.
217
- * hive_test_2:
218
- * `hive.write source:"hdfs://user/mobilize/test/test_hdfs_1.out", target:"mobilize.hive_test_2", drop:true`
219
- * `hive.run cmd:"select * from mobilize.hive_test_2"`
220
- * `gsheet.write source:"stage2", target:"hive_test_2.out"`
221
- * this test uses the output from the first hdfs test as an input, so make sure you've run that first.
222
- * hive_test_3:
223
- * `hive.write source:"hive://mobilize.hive_test_1",target:"mobilize/hive_test_3/date/product",drop:true`
224
- * `hive.run hql:"select act_date as ```date```,product,category,value from mobilize.hive_test_1;"`
225
- * `hive.write source:"stage2",target:"mobilize/hive_test_3/date/product", drop:false`
226
- * `gsheet.write source:"hive://mobilize/hive_test_3", target:"hive_test_3.out"`
227
-
228
-
229
- <a name='section_Meta'></a>
230
- Meta
231
- ----
232
-
233
- * Code: `git clone git://github.com/dena/mobilize-hive.git`
234
- * Home: <https://github.com/dena/mobilize-hive>
235
- * Bugs: <https://github.com/dena/mobilize-hive/issues>
236
- * Gems: <http://rubygems.org/gems/mobilize-hive>
237
-
238
- <a name='section_Special_Thanks'></a>
239
- Special Thanks
240
- --------------
241
- * This release goes to Toby Negrin, who championed this project with
242
- DeNA and gave me the support to get it properly architected, tested, and documented.
243
- * Also many thanks to the Analytics team at DeNA who build and maintain
244
- our Big Data infrastructure.
245
-
246
- <a name='section_Author'></a>
247
- Author
248
- ------
249
-
250
- Cassio Paes-Leme :: cpaesleme@dena.com :: @cpaesleme
251
-
252
- [mobilize-base]: https://github.com/dena/mobilize-base
253
- [mobilize-ssh]: https://github.com/dena/mobilize-ssh
254
- [mobilize-hdfs]: https://github.com/dena/mobilize-hdfs
4
+ Please refer to the mobilize-server wiki: https://github.com/DeNA/mobilize-server/wiki
@@ -94,22 +94,29 @@ module Mobilize
94
94
 
95
95
  #run a generic hive command, with the option of passing a file hash to be locally available
96
96
  def Hive.run(cluster,hql,user_name,params=nil,file_hash=nil)
97
- # no TempStatsStore
98
- hql = "set hive.stats.autogather=false;#{hql}"
97
+ preps = Hive.prepends.map do |p|
98
+ prefix = "set "
99
+ suffix = ";"
100
+ prep_out = p
101
+ prep_out = "#{prefix}#{prep_out}" unless prep_out.starts_with?(prefix)
102
+ prep_out = "#{prep_out}#{suffix}" unless prep_out.ends_with?(suffix)
103
+ prep_out
104
+ end.join
105
+ hql = "#{preps}#{hql}"
99
106
  filename = hql.to_md5
100
107
  file_hash||= {}
101
108
  file_hash[filename] = hql
102
- #add in default params
103
109
  params ||= {}
104
- params = params.merge(Hive.default_params)
105
110
  #replace any params in the file_hash and command
106
111
  params.each do |k,v|
107
112
  file_hash.each do |name,data|
108
- if k.starts_with?("$")
109
- data.gsub!(k,v)
110
- else
111
- data.gsub!("@#{k}",v)
112
- end
113
+ data.gsub!("@#{k}",v)
114
+ end
115
+ end
116
+ #add in default params
117
+ Hive.default_params.each do |k,v|
118
+ file_hash.each do |name,data|
119
+ data.gsub!(k,v)
113
120
  end
114
121
  end
115
122
  #silent mode so we don't have logs in stderr; clip output
@@ -155,9 +162,9 @@ module Mobilize
155
162
  Gdrive.unslot_worker_by_path(stage_path)
156
163
 
157
164
  #check for select at end
158
- hql_array = hql.split(";").map{|hc| hc.strip}.reject{|hc| hc.length==0}
159
- last_statement = hql_array.last.downcase.split("\n").reject{|l| l.starts_with?("-- ")}.first
160
- if last_statement.to_s.starts_with?("select")
165
+ hql_array = hql.split("\n").reject{|l| l.starts_with?("--") or l.strip.length==0}.join("\n").split(";").map{|h| h.strip}
166
+ last_statement = hql_array.last
167
+ if last_statement.to_s.downcase.starts_with?("select")
161
168
  #nil if no prior commands
162
169
  prior_hql = hql_array[0..-2].join(";") if hql_array.length > 1
163
170
  select_hql = hql_array.last
@@ -181,41 +188,37 @@ module Mobilize
181
188
  response
182
189
  end
183
190
 
184
- def Hive.schema_hash(schema_path,user_name,gdrive_slot)
185
- if schema_path.index("/")
186
- #slashes mean sheets
187
- out_tsv = Gsheet.find_by_path(schema_path,gdrive_slot).read(user_name)
191
+ def Hive.schema_hash(schema_path,stage_path,user_name,gdrive_slot)
192
+ handler = if schema_path.index("://")
193
+ schema_path.split("://").first
194
+ else
195
+ "gsheet"
196
+ end
197
+ dst = "Mobilize::#{handler.downcase.capitalize}".constantize.path_to_dst(schema_path,stage_path,gdrive_slot)
198
+ out_raw = dst.read(user_name,gdrive_slot)
199
+ #determine the datatype for schema; accept json, yaml, tsv
200
+ if schema_path.ends_with?(".yml")
201
+ out_ha = begin;YAML.load(out_raw);rescue ScriptError, StandardError;nil;end if out_ha.nil?
188
202
  else
189
- u = User.where(:name=>user_name).first
190
- #check sheets in runner
191
- r = u.runner
192
- runner_sheet = r.gbook(gdrive_slot).worksheet_by_title(schema_path)
193
- out_tsv = if runner_sheet
194
- runner_sheet.read(user_name)
195
- else
196
- #check for gfile. will fail if there isn't one.
197
- Gfile.find_by_path(schema_path).read(user_name)
198
- end
203
+ out_ha = begin;JSON.parse(out_raw);rescue ScriptError, StandardError;nil;end
204
+ out_ha = out_raw.tsv_to_hash_array if out_ha.nil?
199
205
  end
200
- #use Gridfs to cache gdrive results
201
- file_name = schema_path.split("/").last
202
- out_url = "gridfs://#{schema_path}/#{file_name}"
203
- Dataset.write_by_url(out_url,out_tsv,user_name)
204
- schema_tsv = Dataset.find_by_url(out_url).read(user_name,gdrive_slot)
205
206
  schema_hash = {}
206
- schema_tsv.tsv_to_hash_array.each do |ha|
207
- schema_hash[ha['name']] = ha['datatype']
207
+ out_ha.each do |hash|
208
+ schema_hash[hash['name']] = hash['datatype']
208
209
  end
209
210
  schema_hash
210
211
  end
211
212
 
212
- def Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop=false, schema_hash=nil, params=nil)
213
+ def Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop=false, schema_hash=nil, run_params=nil)
213
214
  table_path = [db,table].join(".")
214
215
  table_stats = Hive.table_stats(cluster, db, table, user_name)
215
216
  url = "hive://" + [cluster,db,table,part_array.compact.join("/")].join("/")
216
217
 
217
- source_hql_array = source_hql.split(";")
218
- last_select_i = source_hql_array.rindex{|hql| hql.downcase.strip.starts_with?("select")}
218
+ #decomment hql
219
+
220
+ source_hql_array = source_hql.split("\n").reject{|l| l.starts_with?("--") or l.strip.length==0}.join("\n").split(";").map{|h| h.strip}
221
+ last_select_i = source_hql_array.rindex{|s| s.downcase.starts_with?("select")}
219
222
  #find the last select query -- it should be used for the temp table creation
220
223
  last_select_hql = (source_hql_array[last_select_i..-1].join(";")+";")
221
224
  #if there is anything prior to the last select, add it in prior to table creation
@@ -228,7 +231,7 @@ module Mobilize
228
231
  temp_set_hql = "set mapred.job.name=#{job_name} (temp table);"
229
232
  temp_drop_hql = "drop table if exists #{temp_table_path};"
230
233
  temp_create_hql = "#{temp_set_hql}#{prior_hql}#{temp_drop_hql}create table #{temp_table_path} as #{last_select_hql}"
231
- response = Hive.run(cluster,temp_create_hql,user_name,params)
234
+ response = Hive.run(cluster,temp_create_hql,user_name,run_params)
232
235
  raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
233
236
 
234
237
  source_table_stats = Hive.table_stats(cluster,temp_db,temp_table_name,user_name)
@@ -267,7 +270,7 @@ module Mobilize
267
270
  target_insert_hql,
268
271
  temp_drop_hql].join
269
272
 
270
- response = Hive.run(cluster, target_full_hql, user_name, params)
273
+ response = Hive.run(cluster, target_full_hql, user_name, run_params)
271
274
 
272
275
  raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
273
276
 
@@ -319,7 +322,7 @@ module Mobilize
319
322
  part_set_hql = "set hive.cli.print.header=true;set mapred.job.name=#{job_name} (permutations);"
320
323
  part_select_hql = "select distinct #{target_part_stmt} from #{temp_table_path};"
321
324
  part_perm_hql = part_set_hql + part_select_hql
322
- response = Hive.run(cluster, part_perm_hql, user_name, params)
325
+ response = Hive.run(cluster, part_perm_hql, user_name, run_params)
323
326
  raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
324
327
  part_perm_tsv = response['stdout']
325
328
  #having gotten the permutations, ensure they are dropped
@@ -332,7 +335,7 @@ module Mobilize
332
335
 
333
336
  part_drop_hql = part_hash_array.map do |h|
334
337
  part_drop_stmt = h.map do |name,value|
335
- part_defs[name[1..-2]]=="string" ? "#{name}='#{value}'" : "#{name}=#{value}"
338
+ part_defs[name[1..-2]].downcase=="string" ? "#{name}='#{value}'" : "#{name}=#{value}"
336
339
  end.join(",")
337
340
  "use #{db};alter table #{table} drop if exists partition (#{part_drop_stmt});"
338
341
  end.join
@@ -345,7 +348,7 @@ module Mobilize
345
348
 
346
349
  target_full_hql = [target_set_hql, target_create_hql, target_insert_hql, temp_drop_hql].join
347
350
 
348
- response = Hive.run(cluster, target_full_hql, user_name, params)
351
+ response = Hive.run(cluster, target_full_hql, user_name, run_params)
349
352
  raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
350
353
  else
351
354
  error_msg = "Incompatible partition specs"
@@ -500,7 +503,7 @@ module Mobilize
500
503
  job_name = s.path.sub("Runner_","")
501
504
 
502
505
  schema_hash = if params['schema']
503
- Hive.schema_hash(params['schema'],user_name,gdrive_slot)
506
+ Hive.schema_hash(params['schema'],stage_path,user_name,gdrive_slot)
504
507
  else
505
508
  {}
506
509
  end
@@ -543,7 +546,8 @@ module Mobilize
543
546
  result = begin
544
547
  url = if source_hql
545
548
  #include any params (or nil) at the end
546
- Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop, schema_hash,params['params'])
549
+ run_params = params['params']
550
+ Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop, schema_hash,run_params)
547
551
  elsif source_tsv
548
552
  Hive.tsv_to_table(cluster, db, table, part_array, source_tsv, user_name, drop, schema_hash)
549
553
  elsif source
@@ -26,6 +26,10 @@ module Mobilize
26
26
  (1..self.clusters[cluster]['max_slots']).to_a.map{|s| "#{cluster}_#{s.to_s}"}
27
27
  end
28
28
 
29
+ def self.prepends
30
+ self.config['prepends']
31
+ end
32
+
29
33
  def self.slot_worker_by_cluster_and_path(cluster,path)
30
34
  working_slots = Mobilize::Resque.jobs.map{|j| begin j['args'][1]['hive_slot'];rescue;nil;end}.compact.uniq
31
35
  self.slot_ids(cluster).each do |slot_id|
@@ -1,3 +1,4 @@
1
+ require 'yaml'
1
2
  namespace :mobilize_hive do
2
3
  desc "Set up config and log folders and files"
3
4
  task :setup do
@@ -1,5 +1,5 @@
1
1
  module Mobilize
2
2
  module Hive
3
- VERSION = "1.35"
3
+ VERSION = "1.36"
4
4
  end
5
5
  end
data/lib/mobilize-hive.rb CHANGED
@@ -3,6 +3,9 @@ require "mobilize-hdfs"
3
3
 
4
4
  module Mobilize
5
5
  module Hive
6
+ def Hive.home_dir
7
+ File.expand_path('..',File.dirname(__FILE__))
8
+ end
6
9
  end
7
10
  end
8
11
  require "mobilize-hive/handlers/hive"
data/lib/samples/hive.yml CHANGED
@@ -1,17 +1,23 @@
1
1
  ---
2
2
  development:
3
+ prepends:
4
+ - "hive.stats.autogather=false"
3
5
  clusters:
4
6
  dev_cluster:
5
7
  max_slots: 5
6
8
  temp_table_db: mobilize
7
9
  exec_path: /path/to/hive
8
10
  test:
11
+ prepends:
12
+ - "hive.stats.autogather=false"
9
13
  clusters:
10
14
  test_cluster:
11
15
  max_slots: 5
12
16
  temp_table_db: mobilize
13
17
  exec_path: /path/to/hive
14
18
  production:
19
+ prepends:
20
+ - "hive.stats.autogather=false"
15
21
  clusters:
16
22
  prod_cluster:
17
23
  max_slots: 5
@@ -16,5 +16,5 @@ Gem::Specification.new do |gem|
16
16
  gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
17
17
  gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
18
18
  gem.require_paths = ["lib"]
19
- gem.add_runtime_dependency "mobilize-hdfs","1.35"
19
+ gem.add_runtime_dependency "mobilize-hdfs","1.36"
20
20
  end
File without changes
File without changes
@@ -0,0 +1 @@
1
+ select act_date,product, sum(value) as sum from mobilize.hive_test_1 group by act_date,product;
@@ -0,0 +1 @@
1
+
@@ -0,0 +1,4 @@
1
+ - act_date: ""
2
+ product: ""
3
+ category: ""
4
+ value: ""
@@ -0,0 +1,69 @@
1
+ ---
2
+ - path: "Runner_mobilize(test)/jobs"
3
+ state: working
4
+ count: 1
5
+ confirmed_ats: []
6
+ - path: "Runner_mobilize(test)/jobs/hive1/stage1"
7
+ state: working
8
+ count: 1
9
+ confirmed_ats: []
10
+ - path: "Runner_mobilize(test)/jobs/hive1/stage2"
11
+ state: working
12
+ count: 1
13
+ confirmed_ats: []
14
+ - path: "Runner_mobilize(test)/jobs/hive1/stage3"
15
+ state: working
16
+ count: 1
17
+ confirmed_ats: []
18
+ - path: "Runner_mobilize(test)/jobs/hive1/stage4"
19
+ state: working
20
+ count: 1
21
+ confirmed_ats: []
22
+ - path: "Runner_mobilize(test)/jobs/hive1/stage5"
23
+ state: working
24
+ count: 1
25
+ confirmed_ats: []
26
+ - path: "Runner_mobilize(test)/jobs/hive2/stage1"
27
+ state: working
28
+ count: 1
29
+ confirmed_ats: []
30
+ - path: "Runner_mobilize(test)/jobs/hive2/stage2"
31
+ state: working
32
+ count: 1
33
+ confirmed_ats: []
34
+ - path: "Runner_mobilize(test)/jobs/hive2/stage3"
35
+ state: working
36
+ count: 1
37
+ confirmed_ats: []
38
+ - path: "Runner_mobilize(test)/jobs/hive3/stage1"
39
+ state: working
40
+ count: 1
41
+ confirmed_ats: []
42
+ - path: "Runner_mobilize(test)/jobs/hive3/stage2"
43
+ state: working
44
+ count: 1
45
+ confirmed_ats: []
46
+ - path: "Runner_mobilize(test)/jobs/hive3/stage3"
47
+ state: working
48
+ count: 1
49
+ confirmed_ats: []
50
+ - path: "Runner_mobilize(test)/jobs/hive3/stage4"
51
+ state: working
52
+ count: 1
53
+ confirmed_ats: []
54
+ - path: "Runner_mobilize(test)/jobs/hive4/stage1"
55
+ state: working
56
+ count: 1
57
+ confirmed_ats: []
58
+ - path: "Runner_mobilize(test)/jobs/hive4/stage2"
59
+ state: working
60
+ count: 1
61
+ confirmed_ats: []
62
+ - path: "Runner_mobilize(test)/jobs/hive4/stage3"
63
+ state: working
64
+ count: 1
65
+ confirmed_ats: []
66
+ - path: "Runner_mobilize(test)/jobs/hive4/stage4"
67
+ state: working
68
+ count: 1
69
+ confirmed_ats: []
@@ -0,0 +1,34 @@
1
+ ---
2
+ - name: hive1
3
+ active: true
4
+ trigger: once
5
+ status: ""
6
+ stage1: hive.write target:"mobilize/hive1", partitions:"act_date", drop:true,
7
+ source:"Runner_mobilize(test)/hive1.in", schema:"hive1.schema"
8
+ stage2: hive.run source:"hive1.sql"
9
+ stage3: hive.run hql:"show databases;"
10
+ stage4: gsheet.write source:"stage2", target:"hive1_stage2.out"
11
+ stage5: gsheet.write source:"stage3", target:"hive1_stage3.out"
12
+ - name: hive2
13
+ active: true
14
+ trigger: after hive1
15
+ status: ""
16
+ stage1: hive.write source:"hdfs://user/mobilize/test/hdfs1.out", target:"mobilize.hive2", drop:true
17
+ stage2: hive.run hql:"select * from mobilize.hive2;"
18
+ stage3: gsheet.write source:"stage2", target:"hive2.out"
19
+ - name: hive3
20
+ active: true
21
+ trigger: after hive2
22
+ status: ""
23
+ stage1: hive.run hql:"select '@date' as `date`,product,category,value from mobilize.hive1;", params:{'date':'2013-01-01'}
24
+ stage2: hive.write source:"stage1",target:"mobilize/hive3", partitions:"date/product", drop:true
25
+ stage3: hive.write hql:"select * from mobilize.hive3;",target:"mobilize/hive3", partitions:"date/product", drop:false
26
+ stage4: gsheet.write source:"hive://mobilize/hive3", target:"hive3.out"
27
+ - name: hive4
28
+ active: true
29
+ trigger: after hive3
30
+ status: ""
31
+ stage1: hive.write source:"hive4_stage1.in", target:"mobilize/hive1", partitions:"act_date"
32
+ stage2: hive.write source:"hive4_stage2.in", target:"mobilize/hive1", partitions:"act_date"
33
+ stage3: hive.run hql:"select '@date $utc_time' as `date_time`,product,category,value from mobilize.hive1;", params:{'date':'$utc_date'}
34
+ stage4: gsheet.write source:stage3, target:"hive4.out"
@@ -0,0 +1,43 @@
1
+ require 'test_helper'
2
+ describe "Mobilize" do
3
+ # enqueues 4 workers on Resque
4
+ it "runs integration test" do
5
+
6
+ puts "restart workers"
7
+ Mobilize::Jobtracker.restart_workers!
8
+
9
+ u = TestHelper.owner_user
10
+ r = u.runner
11
+ user_name = u.name
12
+ gdrive_slot = u.email
13
+
14
+ puts "add test data"
15
+ ["hive1.in","hive4_stage1.in","hive4_stage2.in","hive1.schema","hive1.sql"].each do |fixture_name|
16
+ target_url = "gsheet://#{r.title}/#{fixture_name}"
17
+ TestHelper.write_fixture(fixture_name, target_url, 'replace')
18
+ end
19
+
20
+ puts "add/update jobs"
21
+ u.jobs.each{|j| j.delete}
22
+ jobs_fixture_name = "integration_jobs"
23
+ jobs_target_url = "gsheet://#{r.title}/jobs"
24
+ TestHelper.write_fixture(jobs_fixture_name, jobs_target_url, 'update')
25
+
26
+ puts "job rows added, force enqueue runner, wait for stages"
27
+ #wait for stages to complete
28
+ expected_fixture_name = "integration_expected"
29
+ Mobilize::Jobtracker.stop!
30
+ r.enqueue!
31
+ TestHelper.confirm_expected_jobs(expected_fixture_name,2100)
32
+
33
+ puts "update job status and activity"
34
+ r.update_gsheet(gdrive_slot)
35
+
36
+ puts "check posted data"
37
+ assert TestHelper.check_output("gsheet://#{r.title}/hive1_stage2.out", 'min_length' => 219) == true
38
+ assert TestHelper.check_output("gsheet://#{r.title}/hive1_stage3.out", 'min_length' => 3) == true
39
+ assert TestHelper.check_output("gsheet://#{r.title}/hive2.out", 'min_length' => 599) == true
40
+ assert TestHelper.check_output("gsheet://#{r.title}/hive3.out", 'min_length' => 347) == true
41
+ assert TestHelper.check_output("gsheet://#{r.title}/hive4.out", 'min_length' => 432) == true
42
+ end
43
+ end
data/test/test_helper.rb CHANGED
@@ -8,3 +8,4 @@ $dir = File.dirname(File.expand_path(__FILE__))
8
8
  ENV['MOBILIZE_ENV'] = 'test'
9
9
  require 'mobilize-hive'
10
10
  $TESTING = true
11
+ require "#{Mobilize::Hdfs.home_dir}/test/test_helper"
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: mobilize-hive
3
3
  version: !ruby/object:Gem::Version
4
- version: '1.35'
4
+ version: '1.36'
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2013-04-23 00:00:00.000000000 Z
12
+ date: 2013-05-21 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: mobilize-hdfs
@@ -18,7 +18,7 @@ dependencies:
18
18
  requirements:
19
19
  - - '='
20
20
  - !ruby/object:Gem::Version
21
- version: '1.35'
21
+ version: '1.36'
22
22
  type: :runtime
23
23
  prerelease: false
24
24
  version_requirements: !ruby/object:Gem::Requirement
@@ -26,7 +26,7 @@ dependencies:
26
26
  requirements:
27
27
  - - '='
28
28
  - !ruby/object:Gem::Version
29
- version: '1.35'
29
+ version: '1.36'
30
30
  description: Adds hive read, write, and run support to mobilize-hdfs
31
31
  email:
32
32
  - cpaesleme@dena.com
@@ -46,11 +46,15 @@ files:
46
46
  - lib/mobilize-hive/version.rb
47
47
  - lib/samples/hive.yml
48
48
  - mobilize-hive.gemspec
49
- - test/hive_job_rows.yml
50
- - test/hive_test_1.hql
51
- - test/hive_test_1_in.yml
52
- - test/hive_test_1_schema.yml
53
- - test/mobilize-hive_test.rb
49
+ - test/fixtures/hive1.hql
50
+ - test/fixtures/hive1.in.yml
51
+ - test/fixtures/hive1.schema.yml
52
+ - test/fixtures/hive1.sql
53
+ - test/fixtures/hive4_stage1.in
54
+ - test/fixtures/hive4_stage2.in.yml
55
+ - test/fixtures/integration_expected.yml
56
+ - test/fixtures/integration_jobs.yml
57
+ - test/integration/mobilize-hive_test.rb
54
58
  - test/redis-test.conf
55
59
  - test/test_helper.rb
56
60
  homepage: http://github.com/dena/mobilize-hive
@@ -67,7 +71,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
67
71
  version: '0'
68
72
  segments:
69
73
  - 0
70
- hash: -2302417197076465358
74
+ hash: 837156919845089008
71
75
  required_rubygems_version: !ruby/object:Gem::Requirement
72
76
  none: false
73
77
  requirements:
@@ -76,7 +80,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
76
80
  version: '0'
77
81
  segments:
78
82
  - 0
79
- hash: -2302417197076465358
83
+ hash: 837156919845089008
80
84
  requirements: []
81
85
  rubyforge_project:
82
86
  rubygems_version: 1.8.25
@@ -84,10 +88,14 @@ signing_key:
84
88
  specification_version: 3
85
89
  summary: Adds hive read, write, and run support to mobilize-hdfs
86
90
  test_files:
87
- - test/hive_job_rows.yml
88
- - test/hive_test_1.hql
89
- - test/hive_test_1_in.yml
90
- - test/hive_test_1_schema.yml
91
- - test/mobilize-hive_test.rb
91
+ - test/fixtures/hive1.hql
92
+ - test/fixtures/hive1.in.yml
93
+ - test/fixtures/hive1.schema.yml
94
+ - test/fixtures/hive1.sql
95
+ - test/fixtures/hive4_stage1.in
96
+ - test/fixtures/hive4_stage2.in.yml
97
+ - test/fixtures/integration_expected.yml
98
+ - test/fixtures/integration_jobs.yml
99
+ - test/integration/mobilize-hive_test.rb
92
100
  - test/redis-test.conf
93
101
  - test/test_helper.rb
@@ -1,34 +0,0 @@
1
- ---
2
- - name: hive_test_1
3
- active: true
4
- trigger: once
5
- status: ""
6
- stage1: hive.write target:"mobilize/hive_test_1", partitions:"act_date", drop:true,
7
- source:"Runner_mobilize(test)/hive_test_1.in", schema:"hive_test_1.schema"
8
- stage2: hive.run source:"hive_test_1.hql"
9
- stage3: hive.run hql:"show databases;"
10
- stage4: gsheet.write source:"stage2", target:"hive_test_1_stage_2.out"
11
- stage5: gsheet.write source:"stage3", target:"hive_test_1_stage_3.out"
12
- - name: hive_test_2
13
- active: true
14
- trigger: after hive_test_1
15
- status: ""
16
- stage1: hive.write source:"hdfs://user/mobilize/test/test_hdfs_1.out", target:"mobilize.hive_test_2", drop:true
17
- stage2: hive.run hql:"select * from mobilize.hive_test_2;"
18
- stage3: gsheet.write source:"stage2", target:"hive_test_2.out"
19
- - name: hive_test_3
20
- active: true
21
- trigger: after hive_test_2
22
- status: ""
23
- stage1: hive.run hql:"select '@date' as `date`,product,category,value from mobilize.hive_test_1;", params:{'date':'2013-01-01'}
24
- stage2: hive.write source:"stage1",target:"mobilize/hive_test_3", partitions:"date/product", drop:true
25
- stage3: hive.write hql:"select * from mobilize.hive_test_3;",target:"mobilize/hive_test_3", partitions:"date/product", drop:false
26
- stage4: gsheet.write source:"hive://mobilize/hive_test_3", target:"hive_test_3.out"
27
- - name: hive_test_4
28
- active: true
29
- trigger: after hive_test_3
30
- status: ""
31
- stage1: hive.write source:"hive_test_4_stage_1.in", target:"mobilize/hive_test_1", partitions:"act_date"
32
- stage2: hive.write source:"hive_test_4_stage_2.in", target:"mobilize/hive_test_1", partitions:"act_date"
33
- stage3: hive.run hql:"select '$utc_date $utc_time' as `date_time`,product,category,value from mobilize.hive_test_1;"
34
- stage4: gsheet.write source:stage3, target:"hive_test_4.out"
@@ -1,112 +0,0 @@
1
- require 'test_helper'
2
-
3
- describe "Mobilize" do
4
-
5
- def before
6
- puts 'nothing before'
7
- end
8
-
9
- # enqueues 4 workers on Resque
10
- it "runs integration test" do
11
-
12
- puts "restart workers"
13
- Mobilize::Jobtracker.restart_workers!
14
-
15
- gdrive_slot = Mobilize::Gdrive.owner_email
16
- puts "create user 'mobilize'"
17
- user_name = gdrive_slot.split("@").first
18
- u = Mobilize::User.where(:name=>user_name).first
19
- r = u.runner
20
-
21
- puts "add test_source data"
22
- hive_1_in_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.in",gdrive_slot)
23
- [hive_1_in_sheet].each {|s| s.delete if s}
24
- hive_1_in_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.in",gdrive_slot)
25
- hive_1_in_tsv = YAML.load_file("#{Mobilize::Base.root}/test/hive_test_1_in.yml").hash_array_to_tsv
26
- hive_1_in_sheet.write(hive_1_in_tsv,Mobilize::Gdrive.owner_name)
27
-
28
- #create blank sheet
29
- hive_4_stage_1_in_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4_stage_1.in",gdrive_slot)
30
- [hive_4_stage_1_in_sheet].each {|s| s.delete if s}
31
- hive_4_stage_1_in_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4_stage_1.in",gdrive_slot)
32
-
33
- #create sheet w just headers
34
- hive_4_stage_2_in_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4_stage_2.in",gdrive_slot)
35
- [hive_4_stage_2_in_sheet].each {|s| s.delete if s}
36
- hive_4_stage_2_in_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4_stage_2.in",gdrive_slot)
37
- hive_4_stage_2_in_sheet_header = hive_1_in_tsv.tsv_header_array.join("\t")
38
- hive_4_stage_2_in_sheet.write(hive_4_stage_2_in_sheet_header,Mobilize::Gdrive.owner_name)
39
-
40
- hive_1_schema_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.schema",gdrive_slot)
41
- [hive_1_schema_sheet].each {|s| s.delete if s}
42
- hive_1_schema_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.schema",gdrive_slot)
43
- hive_1_schema_tsv = YAML.load_file("#{Mobilize::Base.root}/test/hive_test_1_schema.yml").hash_array_to_tsv
44
- hive_1_schema_sheet.write(hive_1_schema_tsv,Mobilize::Gdrive.owner_name)
45
-
46
- hive_1_hql_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.hql",gdrive_slot)
47
- [hive_1_hql_sheet].each {|s| s.delete if s}
48
- hive_1_hql_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.hql",gdrive_slot)
49
- hive_1_hql_tsv = File.open("#{Mobilize::Base.root}/test/hive_test_1.hql").read
50
- hive_1_hql_sheet.write(hive_1_hql_tsv,Mobilize::Gdrive.owner_name)
51
-
52
- jobs_sheet = r.gsheet(gdrive_slot)
53
-
54
- test_job_rows = ::YAML.load_file("#{Mobilize::Base.root}/test/hive_job_rows.yml")
55
- test_job_rows.map{|j| r.jobs(j['name'])}.each{|j| j.delete if j}
56
- jobs_sheet.add_or_update_rows(test_job_rows)
57
-
58
- hive_1_stage_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_2.out",gdrive_slot)
59
- [hive_1_stage_2_target_sheet].each{|s| s.delete if s}
60
- hive_1_stage_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_3.out",gdrive_slot)
61
- [hive_1_stage_3_target_sheet].each{|s| s.delete if s}
62
- hive_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_2.out",gdrive_slot)
63
- [hive_2_target_sheet].each{|s| s.delete if s}
64
- hive_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_3.out",gdrive_slot)
65
- [hive_3_target_sheet].each{|s| s.delete if s}
66
- hive_4_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4.out",gdrive_slot)
67
- [hive_4_target_sheet].each{|s| s.delete if s}
68
-
69
- puts "job row added, force enqueued requestor, wait for stages"
70
- r.enqueue!
71
- wait_for_stages(2100)
72
-
73
- puts "jobtracker posted data to test sheet"
74
- hive_1_stage_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_2.out",gdrive_slot)
75
- hive_1_stage_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_3.out",gdrive_slot)
76
- hive_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_2.out",gdrive_slot)
77
- hive_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_3.out",gdrive_slot)
78
- hive_4_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4.out",gdrive_slot)
79
-
80
- assert hive_1_stage_2_target_sheet.read(u.name).length == 219
81
- assert hive_1_stage_3_target_sheet.read(u.name).length > 3
82
- assert hive_2_target_sheet.read(u.name).length == 599
83
- assert hive_3_target_sheet.read(u.name).length == 347
84
- assert hive_4_target_sheet.read(u.name).length == 432
85
- end
86
-
87
- def wait_for_stages(time_limit=600,stage_limit=120,wait_length=10)
88
- time = 0
89
- time_since_stage = 0
90
- #check for 10 min
91
- while time < time_limit and time_since_stage < stage_limit
92
- sleep wait_length
93
- job_classes = Mobilize::Resque.jobs.map{|j| j['class']}
94
- if job_classes.include?("Mobilize::Stage")
95
- time_since_stage = 0
96
- puts "saw stage at #{time.to_s} seconds"
97
- else
98
- time_since_stage += wait_length
99
- puts "#{time_since_stage.to_s} seconds since stage seen"
100
- end
101
- time += wait_length
102
- puts "total wait time #{time.to_s} seconds"
103
- end
104
-
105
- if time >= time_limit
106
- raise "Timed out before stage completion"
107
- end
108
- end
109
-
110
-
111
-
112
- end