rbhive 0.6.0 → 1.0.0.pre

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -1,7 +1,7 @@
1
- # RBHive -- Ruby thrift lib for executing Hive queries
1
+ # RBHive - A Ruby Thrift client for Apache Hive
2
2
 
3
3
  RBHive is a simple Ruby gem to communicate with the [Apache Hive](http://hive.apache.org)
4
- Thrift server.
4
+ Thrift servers.
5
5
 
6
6
  It supports:
7
7
  * Hiveserver (the original Thrift service shipped with Hive since early releases)
@@ -13,6 +13,10 @@ It is capable of using the following Thrift transports:
13
13
  * SaslClientTransport ([SASL-enabled](http://en.wikipedia.org/wiki/Simple_Authentication_and_Security_Layer) transport)
14
14
  * HTTPClientTransport (tunnels Thrift over HTTP)
15
15
 
16
+ As of version 1.0, it supports asynchronous execution of queries. This allows you to submit
17
+ a query, disconnect, then reconnect later to check the status and retrieve the results.
18
+ This frees systems of the need to keep a persistent TCP connection.
19
+
16
20
  ## About Thrift services and transports
17
21
 
18
22
  ### Hiveserver
@@ -31,6 +35,12 @@ supported; starting with Hive 0.12, HTTPClientTransport is also supported.
31
35
  Each of the versions after Hive 0.10 has a slightly different Thrift interface; when
32
36
  connecting, you must specify the Hive version or you may get an exception.
33
37
 
38
+ Hiveserver2 supports (in versions later than 0.12) asynchronous query execution. This
39
+ works by submitting a query and retrieving a handle to the execution process; you can
40
+ then reconnect at a later time and retrieve the results using this handle.
41
+ Using the asynchronous methods has some caveats - please read the Asynchronous Execution
42
+ section of the documentation thoroughly before using them.
43
+
34
44
  RBHive implements this client with the `RBHive::TCLIConnection` class.
35
45
 
36
46
  #### Warning!
@@ -129,6 +139,75 @@ In addition, you can explicitly set the Thrift protocol version according to thi
129
139
  | `:PROTOCOL_V6` | V6 | Updated during Hive 0.13 development, adds binary type for binary payload, uses columnar result set
130
140
  | `:PROTOCOL_V7` | V7 | Used by Hive 0.13 release, support for token-based delegation connections
131
141
 
142
+ ## Asynchronous execution with Hiveserver2
143
+
144
+ In versions of Hive later than 0.12, the Thrift server supports asynchronous execution.
145
+
146
+ The high-level view of using this feature is as follows:
147
+ 1. Submit your query using `async_execute(query)`. This function returns a hash
148
+ with the following keys: `:guid`, `:secret`, and `:session`. You don't need to
149
+ care about the internals of this hash - all methods that interact with an async
150
+ query require this hash, and you can just store it and hand it to the methods.
151
+ 2. To check the state of the query, call `async_state(handles)`, where `handles`
152
+ is the handles hash given to you when you called `async_execute(query)`.
153
+ 3. To retrieve results, call either `async_fetch(handles)` or `async_fetch_in_batch(handles)`,
154
+ which work like the non async methods.
155
+ 4. When you're done with the query, call `async_close_session(handles)`.
156
+
157
+ ### Memory leaks
158
+
159
+ When you call `async_close_session(handles)`, *all async handles created during this
160
+ session are closed*.
161
+
162
+ If you do not close the sessions you create, *you will leak memory in the Hiveserver2 process*.
163
+ Be very careful to close your sessions!
164
+
165
+ ### Method documentation
166
+
167
+ #### `async_execute(query)`
168
+
169
+ This method submits a query for async execution. The hash you get back is used in the other
170
+ async methods, and will look like this:
171
+
172
+ {
173
+ :guid => (binary string),
174
+ :secret => (binary string),
175
+ :session => (binary string)
176
+ }
177
+
178
+ The Thrift protocol specifies the strings as "binary" - which means they have no encoding.
179
+ Be *extremely* careful when manipulating or storing these values, as they can quite easily
180
+ get converted to UTF-8 strings, which will make them invalid when trying to retrieve async data.
181
+
182
+ #### `async_state(handles)`
183
+
184
+ `handles` is the hash returned by `async_execute(query)`. The state will be a symbol with
185
+ one of the following values and meanings:
186
+
187
+ | symbol | meaning
188
+ | --------------------- | -------
189
+ | :initialized | The query is initialized in Hive and ready to run
190
+ | :running | The query is running (either as a MapReduce job or within process)
191
+ | :finished | The query is completed and results can be retrieved
192
+ | :cancelled | The query was cancelled by a user
193
+ | :closed | Unknown at present
194
+ | :error | The query is invalid semantically or broken in another way
195
+ | :unknown | The query is in an unknown state
196
+ | :pending | The query is ready to run but is not running
197
+
198
+ There are also the utility methods `async_is_complete?(handles)`, `async_is_running?(handles)`,
199
+ `async_is_failed?(handles)` and `async_is_cancelled?(handles)`.
200
+
201
+ #### `async_cancel(handles)`
202
+
203
+ Calling this method will cancel the query in execution.
204
+
205
+ #### `async_fetch(handles)`, `async_fetch_in_batch(handles)`
206
+
207
+ These methods let you fetch the results of the async query, if they are complete. If you call
208
+ these methods on an incomplete query, they will raise an exception. They work in exactly the
209
+ same way as the normal synchronous methods.
210
+
132
211
  ## Examples
133
212
 
134
213
  ### Fetching results
@@ -103,7 +103,6 @@ module RBHive
103
103
  @client = Hive2::Thrift::TCLIService::Client.new(@protocol)
104
104
  @session = nil
105
105
  @logger.info("Connecting to HiveServer2 #{server} on port #{port}")
106
- @mutex = Mutex.new
107
106
  end
108
107
 
109
108
  def thrift_hive_protocol(version)
@@ -169,7 +168,11 @@ module RBHive
169
168
  end
170
169
 
171
170
  def execute(query)
172
- execute_safe(query)
171
+ @logger.info("Executing Hive Query: #{query}")
172
+ req = prepare_execute_statement(query)
173
+ exec_result = client.ExecuteStatement(req)
174
+ raise_error_if_failed!(exec_result)
175
+ exec_result
173
176
  end
174
177
 
175
178
  def priority=(priority)
@@ -185,6 +188,118 @@ module RBHive
185
188
  self.execute("SET #{name}=#{value}")
186
189
  end
187
190
 
191
+ # Async execute
192
+ def async_execute(query)
193
+ @logger.info("Executing query asynchronously: #{query}")
194
+ op_handle = @client.ExecuteStatement(
195
+ Hive2::Thrift::TExecuteStatementReq.new(
196
+ sessionHandle: @session.sessionHandle,
197
+ statement: query,
198
+ runAsync: true
199
+ )
200
+ ).operationHandle
201
+
202
+ # Return handles to get hold of this query / session again
203
+ {
204
+ session: @session.sessionHandle,
205
+ guid: op_handle.operationId.guid,
206
+ secret: op_handle.operationId.secret
207
+ }
208
+ end
209
+
210
+ # Is the query complete?
211
+ def async_is_complete?(handles)
212
+ async_state(handles) == :finished
213
+ end
214
+
215
+ # Is the query actually running?
216
+ def async_is_running?(handles)
217
+ async_state(handles) == :running
218
+ end
219
+
220
+ # Has the query failed?
221
+ def async_is_failed?(handles)
222
+ async_state(handles) == :error
223
+ end
224
+
225
+ def async_is_cancelled?(handles)
226
+ async_state(handles) == :cancelled
227
+ end
228
+
229
+ def async_cancel(handles)
230
+ @client.CancelOperation(prepare_cancel_request(handles))
231
+ end
232
+
233
+ # Map states to symbols
234
+ def async_state(handles)
235
+ response = @client.GetOperationStatus(
236
+ Hive2::Thrift::TGetOperationStatusReq.new(operationHandle: prepare_operation_handle(handles))
237
+ )
238
+ puts response.operationState
239
+ case response.operationState
240
+ when Hive2::Thrift::TOperationState::FINISHED_STATE
241
+ return :finished
242
+ when Hive2::Thrift::TOperationState::INITIALIZED_STATE
243
+ return :initialized
244
+ when Hive2::Thrift::TOperationState::RUNNING_STATE
245
+ return :running
246
+ when Hive2::Thrift::TOperationState::CANCELED_STATE
247
+ return :cancelled
248
+ when Hive2::Thrift::TOperationState::CLOSED_STATE
249
+ return :closed
250
+ when Hive2::Thrift::TOperationState::ERROR_STATE
251
+ return :error
252
+ when Hive2::Thrift::TOperationState::UKNOWN_STATE
253
+ return :unknown
254
+ when Hive2::Thrift::TOperationState::PENDING_STATE
255
+ return :pending
256
+ else
257
+ return :state_not_in_protocol
258
+ end
259
+ end
260
+
261
+ # Async fetch results from an async execute
262
+ def async_fetch(handles, max_rows = 100)
263
+ # Can't get data from an unfinished query
264
+ unless async_is_complete?(handles)
265
+ raise "Can't perform fetch on a query in state: #{async_state(handles[:guid], handles[:secret])}"
266
+ end
267
+
268
+ # Fetch and
269
+ fetch_rows(prepare_operation_handle(handles), :first, max_rows)
270
+ end
271
+
272
+ # Performs a query on the server, fetches the results in batches of *batch_size* rows
273
+ # and yields the result batches to a given block as arrays of rows.
274
+ def async_fetch_in_batch(handles, batch_size = 1000, &block)
275
+ raise "No block given for the batch fetch request!" unless block_given?
276
+ # Can't get data from an unfinished query
277
+ unless async_is_complete?(handles)
278
+ raise "Can't perform fetch on a query in state: #{async_state(handles[:guid], handles[:secret])}"
279
+ end
280
+
281
+ # Now let's iterate over the results
282
+ loop do
283
+ rows = fetch_rows(prepare_operation_handle(handles), :next, batch_size)
284
+ break if rows.empty?
285
+ yield rows
286
+ end
287
+ end
288
+
289
+ def async_close_session(handles)
290
+ validate_handles!(handles)
291
+ @client.CloseSession(Hive2::Thrift::TCloseSessionReq.new( sessionHandle: handles[:session] ))
292
+ end
293
+
294
+ # Pull rows from the query result
295
+ def fetch_rows(op_handle, orientation = :first, max_rows = 1000)
296
+ fetch_req = prepare_fetch_results(op_handle, orientation, max_rows)
297
+ fetch_results = @client.FetchResults(fetch_req)
298
+ raise_error_if_failed!(fetch_results)
299
+ rows = fetch_results.results.rows
300
+ TCLIResultSet.new(rows, TCLISchemaDefinition.new(get_schema_for(op_handle), rows.first))
301
+ end
302
+
188
303
  # Performs a explain on the supplied query on the server, returns it as a ExplainResult.
189
304
  # (Only works on 0.12 if you have this patch - https://issues.apache.org/jira/browse/HIVE-5492)
190
305
  def explain(query)
@@ -197,58 +312,37 @@ module RBHive
197
312
 
198
313
  # Performs a query on the server, fetches up to *max_rows* rows and returns them as an array.
199
314
  def fetch(query, max_rows = 100)
200
- safe do
201
- # Execute the query and check the result
202
- exec_result = execute_unsafe(query)
203
- raise_error_if_failed!(exec_result)
204
-
205
- # Get search operation handle to fetch the results
206
- op_handle = exec_result.operationHandle
207
-
208
- # Prepare and execute fetch results request
209
- fetch_req = prepare_fetch_results(op_handle, :first, max_rows)
210
- fetch_results = client.FetchResults(fetch_req)
211
- raise_error_if_failed!(fetch_results)
212
-
213
- # Get data rows and format the result
214
- rows = fetch_results.results.rows
215
- the_schema = TCLISchemaDefinition.new(get_schema_for( op_handle ), rows.first)
216
- TCLIResultSet.new(rows, the_schema)
217
- end
315
+ # Execute the query and check the result
316
+ exec_result = execute(query)
317
+ raise_error_if_failed!(exec_result)
318
+
319
+ # Get search operation handle to fetch the results
320
+ op_handle = exec_result.operationHandle
321
+
322
+ # Fetch the rows
323
+ fetch_rows(op_handle, :first, max_rows)
218
324
  end
219
325
 
220
326
  # Performs a query on the server, fetches the results in batches of *batch_size* rows
221
327
  # and yields the result batches to a given block as arrays of rows.
222
328
  def fetch_in_batch(query, batch_size = 1000, &block)
223
329
  raise "No block given for the batch fetch request!" unless block_given?
224
- safe do
225
- # Execute the query and check the result
226
- exec_result = execute_unsafe(query)
227
- raise_error_if_failed!(exec_result)
228
-
229
- # Get search operation handle to fetch the results
230
- op_handle = exec_result.operationHandle
231
-
232
- # Prepare fetch results request
233
- fetch_req = prepare_fetch_results(op_handle, :next, batch_size)
234
-
235
- # Now let's iterate over the results
236
- loop do
237
- # Fetch next batch and raise an exception if it failed
238
- fetch_results = client.FetchResults(fetch_req)
239
- raise_error_if_failed!(fetch_results)
330
+
331
+ # Execute the query and check the result
332
+ exec_result = execute(query)
333
+ raise_error_if_failed!(exec_result)
240
334
 
241
- # Get data rows from the result
242
- rows = fetch_results.results.rows
243
- break if rows.empty?
335
+ # Get search operation handle to fetch the results
336
+ op_handle = exec_result.operationHandle
244
337
 
245
- # Prepare schema definition for the row
246
- schema_for_req ||= get_schema_for(op_handle)
247
- the_schema ||= TCLISchemaDefinition.new(schema_for_req, rows.first)
338
+ # Prepare fetch results request
339
+ fetch_req = prepare_fetch_results(op_handle, :next, batch_size)
248
340
 
249
- # Format the results and yield them to the given block
250
- yield TCLIResultSet.new(rows, the_schema)
251
- end
341
+ # Now let's iterate over the results
342
+ loop do
343
+ rows = fetch_rows(op_handle, :next, batch_size)
344
+ break if rows.empty?
345
+ yield rows
252
346
  end
253
347
  end
254
348
 
@@ -275,26 +369,6 @@ module RBHive
275
369
 
276
370
  private
277
371
 
278
- def execute_safe(query)
279
- safe do
280
- exec_result = execute_unsafe(query)
281
- raise_error_if_failed!(exec_result)
282
- exec_result
283
- end
284
- end
285
-
286
- def execute_unsafe(query)
287
- @logger.info("Executing Hive Query: #{query}")
288
- req = prepare_execute_statement(query)
289
- client.ExecuteStatement(req)
290
- end
291
-
292
- def safe
293
- ret = nil
294
- @mutex.synchronize { ret = yield }
295
- ret
296
- end
297
-
298
372
  def prepare_open_session(client_protocol)
299
373
  req = ::Hive2::Thrift::TOpenSessionReq.new( @options[:sasl_params].nil? ? [] : @options[:sasl_params] )
300
374
  req.client_protocol = client_protocol
@@ -323,6 +397,27 @@ module RBHive
323
397
  )
324
398
  end
325
399
 
400
+ def prepare_operation_handle(handles)
401
+ validate_handles!(handles)
402
+ Hive2::Thrift::TOperationHandle.new(
403
+ operationId: Hive2::Thrift::THandleIdentifier.new(guid: handles[:guid], secret: handles[:secret]),
404
+ operationType: Hive2::Thrift::TOperationType::EXECUTE_STATEMENT,
405
+ hasResultSet: false
406
+ )
407
+ end
408
+
409
+ def prepare_cancel_request(handles)
410
+ Hive2::Thrift::TCancelOperationReq.new(
411
+ operationHandle: prepare_operation_handle(handles)
412
+ )
413
+ end
414
+
415
+ def validate_handles!(handles)
416
+ unless handles.has_key?(:guid) and handles.has_key?(:secret) and handles.has_key?(:session)
417
+ raise "Invalid handles hash: #{handles.inspect}"
418
+ end
419
+ end
420
+
326
421
  def get_schema_for(handle)
327
422
  req = ::Hive2::Thrift::TGetResultSetMetadataReq.new( operationHandle: handle )
328
423
  metadata = client.GetResultSetMetadata( req )
@@ -1,3 +1,3 @@
1
1
  module RBHive
2
- VERSION = '0.6.0'
2
+ VERSION = '1.0.0.pre'
3
3
  end
metadata CHANGED
@@ -1,8 +1,8 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: rbhive
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.6.0
5
- prerelease:
4
+ version: 1.0.0.pre
5
+ prerelease: 6
6
6
  platform: ruby
7
7
  authors:
8
8
  - Forward3D
@@ -10,7 +10,7 @@ authors:
10
10
  autorequire:
11
11
  bindir: bin
12
12
  cert_chain: []
13
- date: 2014-03-28 00:00:00.000000000 Z
13
+ date: 2014-03-31 00:00:00.000000000 Z
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
16
  name: thrift
@@ -134,16 +134,13 @@ required_ruby_version: !ruby/object:Gem::Requirement
134
134
  version: '0'
135
135
  segments:
136
136
  - 0
137
- hash: 2810079357689827941
137
+ hash: 2597338757284379755
138
138
  required_rubygems_version: !ruby/object:Gem::Requirement
139
139
  none: false
140
140
  requirements:
141
- - - ! '>='
141
+ - - ! '>'
142
142
  - !ruby/object:Gem::Version
143
- version: '0'
144
- segments:
145
- - 0
146
- hash: 2810079357689827941
143
+ version: 1.3.1
147
144
  requirements: []
148
145
  rubyforge_project:
149
146
  rubygems_version: 1.8.23