mobilize-hive 1.35 → 1.36

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,254 +1,4 @@
1
- Mobilize-Hive
2
- ===============
1
+ Mobilize
2
+ ========
3
3
 
4
- Mobilize-Hive adds the power of hive to [mobilize-hdfs][mobilize-hdfs].
5
- * read, write, and copy hive files through Google Spreadsheets.
6
-
7
- Table Of Contents
8
- -----------------
9
- * [Overview](#section_Overview)
10
- * [Install](#section_Install)
11
- * [Mobilize-Hive](#section_Install_Mobilize-Hive)
12
- * [Install Dirs and Files](#section_Install_Dirs_and_Files)
13
- * [Configure](#section_Configure)
14
- * [Hive](#section_Configure_Hive)
15
- * [Start](#section_Start)
16
- * [Create Job](#section_Start_Create_Job)
17
- * [Run Test](#section_Start_Run_Test)
18
- * [Meta](#section_Meta)
19
- * [Special Thanks](#section_Special_Thanks)
20
- * [Author](#section_Author)
21
-
22
- <a name='section_Overview'></a>
23
- Overview
24
- -----------
25
-
26
- * Mobilize-hive adds Hive methods to mobilize-hdfs.
27
-
28
- <a name='section_Install'></a>
29
- Install
30
- ------------
31
-
32
- Make sure you go through all the steps in the
33
- [mobilize-base][mobilize-base],
34
- [mobilize-ssh][mobilize-ssh],
35
- [mobilize-hdfs][mobilize-hdfs],
36
- install sections first.
37
-
38
- <a name='section_Install_Mobilize-Hive'></a>
39
- ### Mobilize-Hive
40
-
41
- add this to your Gemfile:
42
-
43
- ``` ruby
44
- gem "mobilize-hive"
45
- ```
46
-
47
- or do
48
-
49
- $ gem install mobilize-hive
50
-
51
- for a ruby-wide install.
52
-
53
- <a name='section_Install_Dirs_and_Files'></a>
54
- ### Dirs and Files
55
-
56
- ### Rakefile
57
-
58
- Inside the Rakefile in your project's root dir, make sure you have:
59
-
60
- ``` ruby
61
- require 'mobilize-base/tasks'
62
- require 'mobilize-ssh/tasks'
63
- require 'mobilize-hdfs/tasks'
64
- require 'mobilize-hive/tasks'
65
- ```
66
-
67
- This defines rake tasks essential to run the environment.
68
-
69
- ### Config Dir
70
-
71
- run
72
-
73
- $ rake mobilize_hive:setup
74
-
75
- This will copy over a sample hive.yml to your config dir.
76
-
77
- <a name='section_Configure'></a>
78
- Configure
79
- ------------
80
-
81
- <a name='section_Configure_Hive'></a>
82
- ### Configure Hive
83
-
84
- * Hive is big data. That means we need to be careful when reading from
85
- the cluster as it could easily fill up our mongodb instance, RAM, local disk
86
- space, etc.
87
- * To achieve this, all hive operations, stage outputs, etc. are
88
- executed and stored on the cluster only.
89
- * The exceptions are:
90
- * writing to the cluster from an external source, such as a google
91
- sheet. Here there
92
- is no risk as the external source has much more strict size limits than
93
- hive.
94
- * reading from the cluster, such as for posting to google sheet. In
95
- this case, the read_limit parameter dictates the maximum amount that can
96
- be read. If the data is bigger than the read limit, an exception will be
97
- raised.
98
-
99
- The Hive configuration consists of:
100
- * clusters - this defines aliases for clusters, which are used as
101
- parameters for Hive stages. They should have the same name as those
102
- in hadoop.yml. Each cluster has:
103
- * max_slots - defines the total number of simultaneous slots to be
104
- used for hive jobs on this cluster
105
- * output_db - defines the db which should be used to hold stage outputs.
106
- * This db must have open permissions (777) so any user on the system can
107
- write to it -- the tables inside will be owned by the users themselves.
108
- * exec_path - defines the path to the hive executable
109
-
110
- Sample hive.yml:
111
-
112
- ``` yml
113
- ---
114
- development:
115
- clusters:
116
- dev_cluster:
117
- max_slots: 5
118
- output_db: mobilize
119
- exec_path: /path/to/hive
120
- test:
121
- clusters:
122
- test_cluster:
123
- max_slots: 5
124
- output_db: mobilize
125
- exec_path: /path/to/hive
126
- production:
127
- clusters:
128
- prod_cluster:
129
- max_slots: 5
130
- output_db: mobilize
131
- exec_path: /path/to/hive
132
- ```
133
-
134
- <a name='section_Start'></a>
135
- Start
136
- -----
137
-
138
- <a name='section_Start_Create_Job'></a>
139
- ### Create Job
140
-
141
- * For mobilize-hive, the following stages are available.
142
- * cluster and user are optional for all of the below.
143
- * cluster defaults to the first cluster listed;
144
- * user is treated the same way as in [mobilize-ssh][mobilize-ssh].
145
- * params are also optional for all of the below. They replace HQL in sources.
146
- * params are passed as a YML or JSON, as in:
147
- * `hive.run source:<source_path>, params:{'date': '2013-03-01', 'unit': 'widgets'}`
148
- * this example replaces all the keys, preceded by '@' in all source hqls with the value.
149
- * The preceding '@' is used to keep from replacing instances
150
- of "date" and "unit" in the HQL; you should have `@date` and `@unit` in your actual HQL
151
- if you'd like to replace those tokens.
152
- * in addition, the following params are substituted automatically:
153
- * `$utc_date` - replaced with YYYY-MM-DD date, UTC
154
- * `$utc_time` - replaced with HH:MM time, UTC
155
- * any occurrence of these values in HQL will be replaced at runtime.
156
- * hive.run `hql:<hql> || source:<gsheet_path>, user:<user>, cluster:<cluster>`, which executes the
157
- script in the hql or source sheet and returns any output specified at the
158
- end. If the cmd or last query in source is a select statement, column headers will be
159
- returned as well.
160
- * hive.write `hql:<hql> || source:<source_path>, target:<hive_path>, partitions:<partition_path>, user:<user>, cluster:<cluster>, schema:<gsheet_path>, drop:<true/false>`,
161
- which writes the source or query result to the selected hive table.
162
- * hive_path
163
- * should be of the form `<hive_db>/<table_name>` or `<hive_db>.<table_name>`.
164
- * source:
165
- * can be a gsheet_path, hdfs_path, or hive_path (no partitions)
166
- * for gsheet and hdfs path,
167
- * if the file ends in .*ql, it's treated the same as passing hql
168
- * otherwise it is treated as a tsv with the first row as column headers
169
- * target:
170
- * Should be a hive_path, as in `<hive_db>/<table_name>` or `<hive_db>.<table_name>`.
171
- * partitions:
172
- * Due to Hive limitation, partition names CANNOT be reserved keywords when writing from tsv (gsheet or hdfs source)
173
- * Partitions should be specified as a path, as in partitions:`<partition1>/<partition2>`.
174
- * schema:
175
- * optional. gsheet_path to column schema.
176
- * two columns: name, datatype
177
- * Any columns not defined here will receive "string" as the datatype
178
- * partitions can have their datatypes overridden here as well
179
- * columns named here that are not in the dataset will be ignored
180
- * drop:
181
- * optional. drops the target table before performing write
182
- * defaults to false
183
-
184
- <a name='section_Start_Run_Test'></a>
185
- ### Run Test
186
-
187
- To run tests, you will need to
188
-
189
- 1) go through [mobilize-base][mobilize-base], [mobilize-ssh][mobilize-ssh], [mobilize-hdfs][mobilize-hdfs] tests first
190
-
191
- 2) clone the mobilize-hive repository
192
-
193
- From the project folder, run
194
-
195
- 3) $ rake mobilize_hive:setup
196
-
197
- Copy over the config files from the mobilize-base, mobilize-ssh,
198
- mobilize-hdfs projects into the config dir, and populate the values in the hive.yml file.
199
-
200
- Make sure you use the same names for your hive clusters as you do in
201
- hadoop.yml.
202
-
203
- 3) $ rake test
204
-
205
- * The test runs these jobs:
206
- * hive_test_1:
207
- * `hive.write target:"mobilize/hive_test_1/act_date",source:"Runner_mobilize(test)/hive_test_1.in", schema:"hive_test_1.schema", drop:true`
208
- * `hive.run source:"hive_test_1.hql"`
209
- * `hive.run cmd:"show databases"`
210
- * `gsheet.write source:"stage2", target:"hive_test_1_stage_2.out"`
211
- * `gsheet.write source:"stage3", target:"hive_test_1_stage_3.out"`
212
- * hive_test_1.hql runs a select statement on the table created in the
213
- write command.
214
- * at the end of the test, there should be two sheets, one with a
215
- sum of the data as in your write query, one with the results of the show
216
- databases command.
217
- * hive_test_2:
218
- * `hive.write source:"hdfs://user/mobilize/test/test_hdfs_1.out", target:"mobilize.hive_test_2", drop:true`
219
- * `hive.run cmd:"select * from mobilize.hive_test_2"`
220
- * `gsheet.write source:"stage2", target:"hive_test_2.out"`
221
- * this test uses the output from the first hdfs test as an input, so make sure you've run that first.
222
- * hive_test_3:
223
- * `hive.write source:"hive://mobilize.hive_test_1",target:"mobilize/hive_test_3/date/product",drop:true`
224
- * `hive.run hql:"select act_date as ```date```,product,category,value from mobilize.hive_test_1;"`
225
- * `hive.write source:"stage2",target:"mobilize/hive_test_3/date/product", drop:false`
226
- * `gsheet.write source:"hive://mobilize/hive_test_3", target:"hive_test_3.out"`
227
-
228
-
229
- <a name='section_Meta'></a>
230
- Meta
231
- ----
232
-
233
- * Code: `git clone git://github.com/dena/mobilize-hive.git`
234
- * Home: <https://github.com/dena/mobilize-hive>
235
- * Bugs: <https://github.com/dena/mobilize-hive/issues>
236
- * Gems: <http://rubygems.org/gems/mobilize-hive>
237
-
238
- <a name='section_Special_Thanks'></a>
239
- Special Thanks
240
- --------------
241
- * This release goes to Toby Negrin, who championed this project with
242
- DeNA and gave me the support to get it properly architected, tested, and documented.
243
- * Also many thanks to the Analytics team at DeNA who build and maintain
244
- our Big Data infrastructure.
245
-
246
- <a name='section_Author'></a>
247
- Author
248
- ------
249
-
250
- Cassio Paes-Leme :: cpaesleme@dena.com :: @cpaesleme
251
-
252
- [mobilize-base]: https://github.com/dena/mobilize-base
253
- [mobilize-ssh]: https://github.com/dena/mobilize-ssh
254
- [mobilize-hdfs]: https://github.com/dena/mobilize-hdfs
4
+ Please refer to the mobilize-server wiki: https://github.com/DeNA/mobilize-server/wiki
@@ -94,22 +94,29 @@ module Mobilize
94
94
 
95
95
  #run a generic hive command, with the option of passing a file hash to be locally available
96
96
  def Hive.run(cluster,hql,user_name,params=nil,file_hash=nil)
97
- # no TempStatsStore
98
- hql = "set hive.stats.autogather=false;#{hql}"
97
+ preps = Hive.prepends.map do |p|
98
+ prefix = "set "
99
+ suffix = ";"
100
+ prep_out = p
101
+ prep_out = "#{prefix}#{prep_out}" unless prep_out.starts_with?(prefix)
102
+ prep_out = "#{prep_out}#{suffix}" unless prep_out.ends_with?(suffix)
103
+ prep_out
104
+ end.join
105
+ hql = "#{preps}#{hql}"
99
106
  filename = hql.to_md5
100
107
  file_hash||= {}
101
108
  file_hash[filename] = hql
102
- #add in default params
103
109
  params ||= {}
104
- params = params.merge(Hive.default_params)
105
110
  #replace any params in the file_hash and command
106
111
  params.each do |k,v|
107
112
  file_hash.each do |name,data|
108
- if k.starts_with?("$")
109
- data.gsub!(k,v)
110
- else
111
- data.gsub!("@#{k}",v)
112
- end
113
+ data.gsub!("@#{k}",v)
114
+ end
115
+ end
116
+ #add in default params
117
+ Hive.default_params.each do |k,v|
118
+ file_hash.each do |name,data|
119
+ data.gsub!(k,v)
113
120
  end
114
121
  end
115
122
  #silent mode so we don't have logs in stderr; clip output
@@ -155,9 +162,9 @@ module Mobilize
155
162
  Gdrive.unslot_worker_by_path(stage_path)
156
163
 
157
164
  #check for select at end
158
- hql_array = hql.split(";").map{|hc| hc.strip}.reject{|hc| hc.length==0}
159
- last_statement = hql_array.last.downcase.split("\n").reject{|l| l.starts_with?("-- ")}.first
160
- if last_statement.to_s.starts_with?("select")
165
+ hql_array = hql.split("\n").reject{|l| l.starts_with?("--") or l.strip.length==0}.join("\n").split(";").map{|h| h.strip}
166
+ last_statement = hql_array.last
167
+ if last_statement.to_s.downcase.starts_with?("select")
161
168
  #nil if no prior commands
162
169
  prior_hql = hql_array[0..-2].join(";") if hql_array.length > 1
163
170
  select_hql = hql_array.last
@@ -181,41 +188,37 @@ module Mobilize
181
188
  response
182
189
  end
183
190
 
184
- def Hive.schema_hash(schema_path,user_name,gdrive_slot)
185
- if schema_path.index("/")
186
- #slashes mean sheets
187
- out_tsv = Gsheet.find_by_path(schema_path,gdrive_slot).read(user_name)
191
+ def Hive.schema_hash(schema_path,stage_path,user_name,gdrive_slot)
192
+ handler = if schema_path.index("://")
193
+ schema_path.split("://").first
194
+ else
195
+ "gsheet"
196
+ end
197
+ dst = "Mobilize::#{handler.downcase.capitalize}".constantize.path_to_dst(schema_path,stage_path,gdrive_slot)
198
+ out_raw = dst.read(user_name,gdrive_slot)
199
+ #determine the datatype for schema; accept json, yaml, tsv
200
+ if schema_path.ends_with?(".yml")
201
+ out_ha = begin;YAML.load(out_raw);rescue ScriptError, StandardError;nil;end if out_ha.nil?
188
202
  else
189
- u = User.where(:name=>user_name).first
190
- #check sheets in runner
191
- r = u.runner
192
- runner_sheet = r.gbook(gdrive_slot).worksheet_by_title(schema_path)
193
- out_tsv = if runner_sheet
194
- runner_sheet.read(user_name)
195
- else
196
- #check for gfile. will fail if there isn't one.
197
- Gfile.find_by_path(schema_path).read(user_name)
198
- end
203
+ out_ha = begin;JSON.parse(out_raw);rescue ScriptError, StandardError;nil;end
204
+ out_ha = out_raw.tsv_to_hash_array if out_ha.nil?
199
205
  end
200
- #use Gridfs to cache gdrive results
201
- file_name = schema_path.split("/").last
202
- out_url = "gridfs://#{schema_path}/#{file_name}"
203
- Dataset.write_by_url(out_url,out_tsv,user_name)
204
- schema_tsv = Dataset.find_by_url(out_url).read(user_name,gdrive_slot)
205
206
  schema_hash = {}
206
- schema_tsv.tsv_to_hash_array.each do |ha|
207
- schema_hash[ha['name']] = ha['datatype']
207
+ out_ha.each do |hash|
208
+ schema_hash[hash['name']] = hash['datatype']
208
209
  end
209
210
  schema_hash
210
211
  end
211
212
 
212
- def Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop=false, schema_hash=nil, params=nil)
213
+ def Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop=false, schema_hash=nil, run_params=nil)
213
214
  table_path = [db,table].join(".")
214
215
  table_stats = Hive.table_stats(cluster, db, table, user_name)
215
216
  url = "hive://" + [cluster,db,table,part_array.compact.join("/")].join("/")
216
217
 
217
- source_hql_array = source_hql.split(";")
218
- last_select_i = source_hql_array.rindex{|hql| hql.downcase.strip.starts_with?("select")}
218
+ #decomment hql
219
+
220
+ source_hql_array = source_hql.split("\n").reject{|l| l.starts_with?("--") or l.strip.length==0}.join("\n").split(";").map{|h| h.strip}
221
+ last_select_i = source_hql_array.rindex{|s| s.downcase.starts_with?("select")}
219
222
  #find the last select query -- it should be used for the temp table creation
220
223
  last_select_hql = (source_hql_array[last_select_i..-1].join(";")+";")
221
224
  #if there is anything prior to the last select, add it in prior to table creation
@@ -228,7 +231,7 @@ module Mobilize
228
231
  temp_set_hql = "set mapred.job.name=#{job_name} (temp table);"
229
232
  temp_drop_hql = "drop table if exists #{temp_table_path};"
230
233
  temp_create_hql = "#{temp_set_hql}#{prior_hql}#{temp_drop_hql}create table #{temp_table_path} as #{last_select_hql}"
231
- response = Hive.run(cluster,temp_create_hql,user_name,params)
234
+ response = Hive.run(cluster,temp_create_hql,user_name,run_params)
232
235
  raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
233
236
 
234
237
  source_table_stats = Hive.table_stats(cluster,temp_db,temp_table_name,user_name)
@@ -267,7 +270,7 @@ module Mobilize
267
270
  target_insert_hql,
268
271
  temp_drop_hql].join
269
272
 
270
- response = Hive.run(cluster, target_full_hql, user_name, params)
273
+ response = Hive.run(cluster, target_full_hql, user_name, run_params)
271
274
 
272
275
  raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
273
276
 
@@ -319,7 +322,7 @@ module Mobilize
319
322
  part_set_hql = "set hive.cli.print.header=true;set mapred.job.name=#{job_name} (permutations);"
320
323
  part_select_hql = "select distinct #{target_part_stmt} from #{temp_table_path};"
321
324
  part_perm_hql = part_set_hql + part_select_hql
322
- response = Hive.run(cluster, part_perm_hql, user_name, params)
325
+ response = Hive.run(cluster, part_perm_hql, user_name, run_params)
323
326
  raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
324
327
  part_perm_tsv = response['stdout']
325
328
  #having gotten the permutations, ensure they are dropped
@@ -332,7 +335,7 @@ module Mobilize
332
335
 
333
336
  part_drop_hql = part_hash_array.map do |h|
334
337
  part_drop_stmt = h.map do |name,value|
335
- part_defs[name[1..-2]]=="string" ? "#{name}='#{value}'" : "#{name}=#{value}"
338
+ part_defs[name[1..-2]].downcase=="string" ? "#{name}='#{value}'" : "#{name}=#{value}"
336
339
  end.join(",")
337
340
  "use #{db};alter table #{table} drop if exists partition (#{part_drop_stmt});"
338
341
  end.join
@@ -345,7 +348,7 @@ module Mobilize
345
348
 
346
349
  target_full_hql = [target_set_hql, target_create_hql, target_insert_hql, temp_drop_hql].join
347
350
 
348
- response = Hive.run(cluster, target_full_hql, user_name, params)
351
+ response = Hive.run(cluster, target_full_hql, user_name, run_params)
349
352
  raise response['stderr'] if response['stderr'].to_s.ie{|s| s.index("FAILED") or s.index("KILLED")}
350
353
  else
351
354
  error_msg = "Incompatible partition specs"
@@ -500,7 +503,7 @@ module Mobilize
500
503
  job_name = s.path.sub("Runner_","")
501
504
 
502
505
  schema_hash = if params['schema']
503
- Hive.schema_hash(params['schema'],user_name,gdrive_slot)
506
+ Hive.schema_hash(params['schema'],stage_path,user_name,gdrive_slot)
504
507
  else
505
508
  {}
506
509
  end
@@ -543,7 +546,8 @@ module Mobilize
543
546
  result = begin
544
547
  url = if source_hql
545
548
  #include any params (or nil) at the end
546
- Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop, schema_hash,params['params'])
549
+ run_params = params['params']
550
+ Hive.hql_to_table(cluster, db, table, part_array, source_hql, user_name, job_name, drop, schema_hash,run_params)
547
551
  elsif source_tsv
548
552
  Hive.tsv_to_table(cluster, db, table, part_array, source_tsv, user_name, drop, schema_hash)
549
553
  elsif source
@@ -26,6 +26,10 @@ module Mobilize
26
26
  (1..self.clusters[cluster]['max_slots']).to_a.map{|s| "#{cluster}_#{s.to_s}"}
27
27
  end
28
28
 
29
+ def self.prepends
30
+ self.config['prepends']
31
+ end
32
+
29
33
  def self.slot_worker_by_cluster_and_path(cluster,path)
30
34
  working_slots = Mobilize::Resque.jobs.map{|j| begin j['args'][1]['hive_slot'];rescue;nil;end}.compact.uniq
31
35
  self.slot_ids(cluster).each do |slot_id|
@@ -1,3 +1,4 @@
1
+ require 'yaml'
1
2
  namespace :mobilize_hive do
2
3
  desc "Set up config and log folders and files"
3
4
  task :setup do
@@ -1,5 +1,5 @@
1
1
  module Mobilize
2
2
  module Hive
3
- VERSION = "1.35"
3
+ VERSION = "1.36"
4
4
  end
5
5
  end
data/lib/mobilize-hive.rb CHANGED
@@ -3,6 +3,9 @@ require "mobilize-hdfs"
3
3
 
4
4
  module Mobilize
5
5
  module Hive
6
+ def Hive.home_dir
7
+ File.expand_path('..',File.dirname(__FILE__))
8
+ end
6
9
  end
7
10
  end
8
11
  require "mobilize-hive/handlers/hive"
data/lib/samples/hive.yml CHANGED
@@ -1,17 +1,23 @@
1
1
  ---
2
2
  development:
3
+ prepends:
4
+ - "hive.stats.autogather=false"
3
5
  clusters:
4
6
  dev_cluster:
5
7
  max_slots: 5
6
8
  temp_table_db: mobilize
7
9
  exec_path: /path/to/hive
8
10
  test:
11
+ prepends:
12
+ - "hive.stats.autogather=false"
9
13
  clusters:
10
14
  test_cluster:
11
15
  max_slots: 5
12
16
  temp_table_db: mobilize
13
17
  exec_path: /path/to/hive
14
18
  production:
19
+ prepends:
20
+ - "hive.stats.autogather=false"
15
21
  clusters:
16
22
  prod_cluster:
17
23
  max_slots: 5
@@ -16,5 +16,5 @@ Gem::Specification.new do |gem|
16
16
  gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
17
17
  gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
18
18
  gem.require_paths = ["lib"]
19
- gem.add_runtime_dependency "mobilize-hdfs","1.35"
19
+ gem.add_runtime_dependency "mobilize-hdfs","1.36"
20
20
  end
File without changes
File without changes
@@ -0,0 +1 @@
1
+ select act_date,product, sum(value) as sum from mobilize.hive_test_1 group by act_date,product;
@@ -0,0 +1 @@
1
+
@@ -0,0 +1,4 @@
1
+ - act_date: ""
2
+ product: ""
3
+ category: ""
4
+ value: ""
@@ -0,0 +1,69 @@
1
+ ---
2
+ - path: "Runner_mobilize(test)/jobs"
3
+ state: working
4
+ count: 1
5
+ confirmed_ats: []
6
+ - path: "Runner_mobilize(test)/jobs/hive1/stage1"
7
+ state: working
8
+ count: 1
9
+ confirmed_ats: []
10
+ - path: "Runner_mobilize(test)/jobs/hive1/stage2"
11
+ state: working
12
+ count: 1
13
+ confirmed_ats: []
14
+ - path: "Runner_mobilize(test)/jobs/hive1/stage3"
15
+ state: working
16
+ count: 1
17
+ confirmed_ats: []
18
+ - path: "Runner_mobilize(test)/jobs/hive1/stage4"
19
+ state: working
20
+ count: 1
21
+ confirmed_ats: []
22
+ - path: "Runner_mobilize(test)/jobs/hive1/stage5"
23
+ state: working
24
+ count: 1
25
+ confirmed_ats: []
26
+ - path: "Runner_mobilize(test)/jobs/hive2/stage1"
27
+ state: working
28
+ count: 1
29
+ confirmed_ats: []
30
+ - path: "Runner_mobilize(test)/jobs/hive2/stage2"
31
+ state: working
32
+ count: 1
33
+ confirmed_ats: []
34
+ - path: "Runner_mobilize(test)/jobs/hive2/stage3"
35
+ state: working
36
+ count: 1
37
+ confirmed_ats: []
38
+ - path: "Runner_mobilize(test)/jobs/hive3/stage1"
39
+ state: working
40
+ count: 1
41
+ confirmed_ats: []
42
+ - path: "Runner_mobilize(test)/jobs/hive3/stage2"
43
+ state: working
44
+ count: 1
45
+ confirmed_ats: []
46
+ - path: "Runner_mobilize(test)/jobs/hive3/stage3"
47
+ state: working
48
+ count: 1
49
+ confirmed_ats: []
50
+ - path: "Runner_mobilize(test)/jobs/hive3/stage4"
51
+ state: working
52
+ count: 1
53
+ confirmed_ats: []
54
+ - path: "Runner_mobilize(test)/jobs/hive4/stage1"
55
+ state: working
56
+ count: 1
57
+ confirmed_ats: []
58
+ - path: "Runner_mobilize(test)/jobs/hive4/stage2"
59
+ state: working
60
+ count: 1
61
+ confirmed_ats: []
62
+ - path: "Runner_mobilize(test)/jobs/hive4/stage3"
63
+ state: working
64
+ count: 1
65
+ confirmed_ats: []
66
+ - path: "Runner_mobilize(test)/jobs/hive4/stage4"
67
+ state: working
68
+ count: 1
69
+ confirmed_ats: []
@@ -0,0 +1,34 @@
1
+ ---
2
+ - name: hive1
3
+ active: true
4
+ trigger: once
5
+ status: ""
6
+ stage1: hive.write target:"mobilize/hive1", partitions:"act_date", drop:true,
7
+ source:"Runner_mobilize(test)/hive1.in", schema:"hive1.schema"
8
+ stage2: hive.run source:"hive1.sql"
9
+ stage3: hive.run hql:"show databases;"
10
+ stage4: gsheet.write source:"stage2", target:"hive1_stage2.out"
11
+ stage5: gsheet.write source:"stage3", target:"hive1_stage3.out"
12
+ - name: hive2
13
+ active: true
14
+ trigger: after hive1
15
+ status: ""
16
+ stage1: hive.write source:"hdfs://user/mobilize/test/hdfs1.out", target:"mobilize.hive2", drop:true
17
+ stage2: hive.run hql:"select * from mobilize.hive2;"
18
+ stage3: gsheet.write source:"stage2", target:"hive2.out"
19
+ - name: hive3
20
+ active: true
21
+ trigger: after hive2
22
+ status: ""
23
+ stage1: hive.run hql:"select '@date' as `date`,product,category,value from mobilize.hive1;", params:{'date':'2013-01-01'}
24
+ stage2: hive.write source:"stage1",target:"mobilize/hive3", partitions:"date/product", drop:true
25
+ stage3: hive.write hql:"select * from mobilize.hive3;",target:"mobilize/hive3", partitions:"date/product", drop:false
26
+ stage4: gsheet.write source:"hive://mobilize/hive3", target:"hive3.out"
27
+ - name: hive4
28
+ active: true
29
+ trigger: after hive3
30
+ status: ""
31
+ stage1: hive.write source:"hive4_stage1.in", target:"mobilize/hive1", partitions:"act_date"
32
+ stage2: hive.write source:"hive4_stage2.in", target:"mobilize/hive1", partitions:"act_date"
33
+ stage3: hive.run hql:"select '@date $utc_time' as `date_time`,product,category,value from mobilize.hive1;", params:{'date':'$utc_date'}
34
+ stage4: gsheet.write source:stage3, target:"hive4.out"
@@ -0,0 +1,43 @@
1
+ require 'test_helper'
2
+ describe "Mobilize" do
3
+ # enqueues 4 workers on Resque
4
+ it "runs integration test" do
5
+
6
+ puts "restart workers"
7
+ Mobilize::Jobtracker.restart_workers!
8
+
9
+ u = TestHelper.owner_user
10
+ r = u.runner
11
+ user_name = u.name
12
+ gdrive_slot = u.email
13
+
14
+ puts "add test data"
15
+ ["hive1.in","hive4_stage1.in","hive4_stage2.in","hive1.schema","hive1.sql"].each do |fixture_name|
16
+ target_url = "gsheet://#{r.title}/#{fixture_name}"
17
+ TestHelper.write_fixture(fixture_name, target_url, 'replace')
18
+ end
19
+
20
+ puts "add/update jobs"
21
+ u.jobs.each{|j| j.delete}
22
+ jobs_fixture_name = "integration_jobs"
23
+ jobs_target_url = "gsheet://#{r.title}/jobs"
24
+ TestHelper.write_fixture(jobs_fixture_name, jobs_target_url, 'update')
25
+
26
+ puts "job rows added, force enqueue runner, wait for stages"
27
+ #wait for stages to complete
28
+ expected_fixture_name = "integration_expected"
29
+ Mobilize::Jobtracker.stop!
30
+ r.enqueue!
31
+ TestHelper.confirm_expected_jobs(expected_fixture_name,2100)
32
+
33
+ puts "update job status and activity"
34
+ r.update_gsheet(gdrive_slot)
35
+
36
+ puts "check posted data"
37
+ assert TestHelper.check_output("gsheet://#{r.title}/hive1_stage2.out", 'min_length' => 219) == true
38
+ assert TestHelper.check_output("gsheet://#{r.title}/hive1_stage3.out", 'min_length' => 3) == true
39
+ assert TestHelper.check_output("gsheet://#{r.title}/hive2.out", 'min_length' => 599) == true
40
+ assert TestHelper.check_output("gsheet://#{r.title}/hive3.out", 'min_length' => 347) == true
41
+ assert TestHelper.check_output("gsheet://#{r.title}/hive4.out", 'min_length' => 432) == true
42
+ end
43
+ end
data/test/test_helper.rb CHANGED
@@ -8,3 +8,4 @@ $dir = File.dirname(File.expand_path(__FILE__))
8
8
  ENV['MOBILIZE_ENV'] = 'test'
9
9
  require 'mobilize-hive'
10
10
  $TESTING = true
11
+ require "#{Mobilize::Hdfs.home_dir}/test/test_helper"
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: mobilize-hive
3
3
  version: !ruby/object:Gem::Version
4
- version: '1.35'
4
+ version: '1.36'
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2013-04-23 00:00:00.000000000 Z
12
+ date: 2013-05-21 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: mobilize-hdfs
@@ -18,7 +18,7 @@ dependencies:
18
18
  requirements:
19
19
  - - '='
20
20
  - !ruby/object:Gem::Version
21
- version: '1.35'
21
+ version: '1.36'
22
22
  type: :runtime
23
23
  prerelease: false
24
24
  version_requirements: !ruby/object:Gem::Requirement
@@ -26,7 +26,7 @@ dependencies:
26
26
  requirements:
27
27
  - - '='
28
28
  - !ruby/object:Gem::Version
29
- version: '1.35'
29
+ version: '1.36'
30
30
  description: Adds hive read, write, and run support to mobilize-hdfs
31
31
  email:
32
32
  - cpaesleme@dena.com
@@ -46,11 +46,15 @@ files:
46
46
  - lib/mobilize-hive/version.rb
47
47
  - lib/samples/hive.yml
48
48
  - mobilize-hive.gemspec
49
- - test/hive_job_rows.yml
50
- - test/hive_test_1.hql
51
- - test/hive_test_1_in.yml
52
- - test/hive_test_1_schema.yml
53
- - test/mobilize-hive_test.rb
49
+ - test/fixtures/hive1.hql
50
+ - test/fixtures/hive1.in.yml
51
+ - test/fixtures/hive1.schema.yml
52
+ - test/fixtures/hive1.sql
53
+ - test/fixtures/hive4_stage1.in
54
+ - test/fixtures/hive4_stage2.in.yml
55
+ - test/fixtures/integration_expected.yml
56
+ - test/fixtures/integration_jobs.yml
57
+ - test/integration/mobilize-hive_test.rb
54
58
  - test/redis-test.conf
55
59
  - test/test_helper.rb
56
60
  homepage: http://github.com/dena/mobilize-hive
@@ -67,7 +71,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
67
71
  version: '0'
68
72
  segments:
69
73
  - 0
70
- hash: -2302417197076465358
74
+ hash: 837156919845089008
71
75
  required_rubygems_version: !ruby/object:Gem::Requirement
72
76
  none: false
73
77
  requirements:
@@ -76,7 +80,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
76
80
  version: '0'
77
81
  segments:
78
82
  - 0
79
- hash: -2302417197076465358
83
+ hash: 837156919845089008
80
84
  requirements: []
81
85
  rubyforge_project:
82
86
  rubygems_version: 1.8.25
@@ -84,10 +88,14 @@ signing_key:
84
88
  specification_version: 3
85
89
  summary: Adds hive read, write, and run support to mobilize-hdfs
86
90
  test_files:
87
- - test/hive_job_rows.yml
88
- - test/hive_test_1.hql
89
- - test/hive_test_1_in.yml
90
- - test/hive_test_1_schema.yml
91
- - test/mobilize-hive_test.rb
91
+ - test/fixtures/hive1.hql
92
+ - test/fixtures/hive1.in.yml
93
+ - test/fixtures/hive1.schema.yml
94
+ - test/fixtures/hive1.sql
95
+ - test/fixtures/hive4_stage1.in
96
+ - test/fixtures/hive4_stage2.in.yml
97
+ - test/fixtures/integration_expected.yml
98
+ - test/fixtures/integration_jobs.yml
99
+ - test/integration/mobilize-hive_test.rb
92
100
  - test/redis-test.conf
93
101
  - test/test_helper.rb
@@ -1,34 +0,0 @@
1
- ---
2
- - name: hive_test_1
3
- active: true
4
- trigger: once
5
- status: ""
6
- stage1: hive.write target:"mobilize/hive_test_1", partitions:"act_date", drop:true,
7
- source:"Runner_mobilize(test)/hive_test_1.in", schema:"hive_test_1.schema"
8
- stage2: hive.run source:"hive_test_1.hql"
9
- stage3: hive.run hql:"show databases;"
10
- stage4: gsheet.write source:"stage2", target:"hive_test_1_stage_2.out"
11
- stage5: gsheet.write source:"stage3", target:"hive_test_1_stage_3.out"
12
- - name: hive_test_2
13
- active: true
14
- trigger: after hive_test_1
15
- status: ""
16
- stage1: hive.write source:"hdfs://user/mobilize/test/test_hdfs_1.out", target:"mobilize.hive_test_2", drop:true
17
- stage2: hive.run hql:"select * from mobilize.hive_test_2;"
18
- stage3: gsheet.write source:"stage2", target:"hive_test_2.out"
19
- - name: hive_test_3
20
- active: true
21
- trigger: after hive_test_2
22
- status: ""
23
- stage1: hive.run hql:"select '@date' as `date`,product,category,value from mobilize.hive_test_1;", params:{'date':'2013-01-01'}
24
- stage2: hive.write source:"stage1",target:"mobilize/hive_test_3", partitions:"date/product", drop:true
25
- stage3: hive.write hql:"select * from mobilize.hive_test_3;",target:"mobilize/hive_test_3", partitions:"date/product", drop:false
26
- stage4: gsheet.write source:"hive://mobilize/hive_test_3", target:"hive_test_3.out"
27
- - name: hive_test_4
28
- active: true
29
- trigger: after hive_test_3
30
- status: ""
31
- stage1: hive.write source:"hive_test_4_stage_1.in", target:"mobilize/hive_test_1", partitions:"act_date"
32
- stage2: hive.write source:"hive_test_4_stage_2.in", target:"mobilize/hive_test_1", partitions:"act_date"
33
- stage3: hive.run hql:"select '$utc_date $utc_time' as `date_time`,product,category,value from mobilize.hive_test_1;"
34
- stage4: gsheet.write source:stage3, target:"hive_test_4.out"
@@ -1,112 +0,0 @@
1
- require 'test_helper'
2
-
3
- describe "Mobilize" do
4
-
5
- def before
6
- puts 'nothing before'
7
- end
8
-
9
- # enqueues 4 workers on Resque
10
- it "runs integration test" do
11
-
12
- puts "restart workers"
13
- Mobilize::Jobtracker.restart_workers!
14
-
15
- gdrive_slot = Mobilize::Gdrive.owner_email
16
- puts "create user 'mobilize'"
17
- user_name = gdrive_slot.split("@").first
18
- u = Mobilize::User.where(:name=>user_name).first
19
- r = u.runner
20
-
21
- puts "add test_source data"
22
- hive_1_in_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.in",gdrive_slot)
23
- [hive_1_in_sheet].each {|s| s.delete if s}
24
- hive_1_in_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.in",gdrive_slot)
25
- hive_1_in_tsv = YAML.load_file("#{Mobilize::Base.root}/test/hive_test_1_in.yml").hash_array_to_tsv
26
- hive_1_in_sheet.write(hive_1_in_tsv,Mobilize::Gdrive.owner_name)
27
-
28
- #create blank sheet
29
- hive_4_stage_1_in_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4_stage_1.in",gdrive_slot)
30
- [hive_4_stage_1_in_sheet].each {|s| s.delete if s}
31
- hive_4_stage_1_in_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4_stage_1.in",gdrive_slot)
32
-
33
- #create sheet w just headers
34
- hive_4_stage_2_in_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4_stage_2.in",gdrive_slot)
35
- [hive_4_stage_2_in_sheet].each {|s| s.delete if s}
36
- hive_4_stage_2_in_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4_stage_2.in",gdrive_slot)
37
- hive_4_stage_2_in_sheet_header = hive_1_in_tsv.tsv_header_array.join("\t")
38
- hive_4_stage_2_in_sheet.write(hive_4_stage_2_in_sheet_header,Mobilize::Gdrive.owner_name)
39
-
40
- hive_1_schema_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.schema",gdrive_slot)
41
- [hive_1_schema_sheet].each {|s| s.delete if s}
42
- hive_1_schema_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.schema",gdrive_slot)
43
- hive_1_schema_tsv = YAML.load_file("#{Mobilize::Base.root}/test/hive_test_1_schema.yml").hash_array_to_tsv
44
- hive_1_schema_sheet.write(hive_1_schema_tsv,Mobilize::Gdrive.owner_name)
45
-
46
- hive_1_hql_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.hql",gdrive_slot)
47
- [hive_1_hql_sheet].each {|s| s.delete if s}
48
- hive_1_hql_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1.hql",gdrive_slot)
49
- hive_1_hql_tsv = File.open("#{Mobilize::Base.root}/test/hive_test_1.hql").read
50
- hive_1_hql_sheet.write(hive_1_hql_tsv,Mobilize::Gdrive.owner_name)
51
-
52
- jobs_sheet = r.gsheet(gdrive_slot)
53
-
54
- test_job_rows = ::YAML.load_file("#{Mobilize::Base.root}/test/hive_job_rows.yml")
55
- test_job_rows.map{|j| r.jobs(j['name'])}.each{|j| j.delete if j}
56
- jobs_sheet.add_or_update_rows(test_job_rows)
57
-
58
- hive_1_stage_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_2.out",gdrive_slot)
59
- [hive_1_stage_2_target_sheet].each{|s| s.delete if s}
60
- hive_1_stage_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_3.out",gdrive_slot)
61
- [hive_1_stage_3_target_sheet].each{|s| s.delete if s}
62
- hive_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_2.out",gdrive_slot)
63
- [hive_2_target_sheet].each{|s| s.delete if s}
64
- hive_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_3.out",gdrive_slot)
65
- [hive_3_target_sheet].each{|s| s.delete if s}
66
- hive_4_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4.out",gdrive_slot)
67
- [hive_4_target_sheet].each{|s| s.delete if s}
68
-
69
- puts "job row added, force enqueued requestor, wait for stages"
70
- r.enqueue!
71
- wait_for_stages(2100)
72
-
73
- puts "jobtracker posted data to test sheet"
74
- hive_1_stage_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_2.out",gdrive_slot)
75
- hive_1_stage_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_1_stage_3.out",gdrive_slot)
76
- hive_2_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_2.out",gdrive_slot)
77
- hive_3_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_3.out",gdrive_slot)
78
- hive_4_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/hive_test_4.out",gdrive_slot)
79
-
80
- assert hive_1_stage_2_target_sheet.read(u.name).length == 219
81
- assert hive_1_stage_3_target_sheet.read(u.name).length > 3
82
- assert hive_2_target_sheet.read(u.name).length == 599
83
- assert hive_3_target_sheet.read(u.name).length == 347
84
- assert hive_4_target_sheet.read(u.name).length == 432
85
- end
86
-
87
- def wait_for_stages(time_limit=600,stage_limit=120,wait_length=10)
88
- time = 0
89
- time_since_stage = 0
90
- #check for 10 min
91
- while time < time_limit and time_since_stage < stage_limit
92
- sleep wait_length
93
- job_classes = Mobilize::Resque.jobs.map{|j| j['class']}
94
- if job_classes.include?("Mobilize::Stage")
95
- time_since_stage = 0
96
- puts "saw stage at #{time.to_s} seconds"
97
- else
98
- time_since_stage += wait_length
99
- puts "#{time_since_stage.to_s} seconds since stage seen"
100
- end
101
- time += wait_length
102
- puts "total wait time #{time.to_s} seconds"
103
- end
104
-
105
- if time >= time_limit
106
- raise "Timed out before stage completion"
107
- end
108
- end
109
-
110
-
111
-
112
- end