rbhive 0.6.0 → 1.0.0.pre

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,7 +1,7 @@
1
- # RBHive -- Ruby thrift lib for executing Hive queries
1
+ # RBHive - A Ruby Thrift client for Apache Hive
2
2
 
3
3
  RBHive is a simple Ruby gem to communicate with the [Apache Hive](http://hive.apache.org)
4
- Thrift server.
4
+ Thrift servers.
5
5
 
6
6
  It supports:
7
7
  * Hiveserver (the original Thrift service shipped with Hive since early releases)
@@ -13,6 +13,10 @@ It is capable of using the following Thrift transports:
13
13
  * SaslClientTransport ([SASL-enabled](http://en.wikipedia.org/wiki/Simple_Authentication_and_Security_Layer) transport)
14
14
  * HTTPClientTransport (tunnels Thrift over HTTP)
15
15
 
16
+ As of version 1.0, it supports asynchronous execution of queries. This allows you to submit
17
+ a query, disconnect, then reconnect later to check the status and retrieve the results.
18
+ This frees systems of the need to keep a persistent TCP connection.
19
+
16
20
  ## About Thrift services and transports
17
21
 
18
22
  ### Hiveserver
@@ -31,6 +35,12 @@ supported; starting with Hive 0.12, HTTPClientTransport is also supported.
31
35
  Each of the versions after Hive 0.10 has a slightly different Thrift interface; when
32
36
  connecting, you must specify the Hive version or you may get an exception.
33
37
 
38
+ Hiveserver2 supports (in versions later than 0.12) asynchronous query execution. This
39
+ works by submitting a query and retrieving a handle to the execution process; you can
40
+ then reconnect at a later time and retrieve the results using this handle.
41
+ Using the asynchronous methods has some caveats - please read the Asynchronous Execution
42
+ section of the documentation thoroughly before using them.
43
+
34
44
  RBHive implements this client with the `RBHive::TCLIConnection` class.
35
45
 
36
46
  #### Warning!
@@ -129,6 +139,75 @@ In addition, you can explicitly set the Thrift protocol version according to thi
129
139
  | `:PROTOCOL_V6` | V6 | Updated during Hive 0.13 development, adds binary type for binary payload, uses columnar result set
130
140
  | `:PROTOCOL_V7` | V7 | Used by Hive 0.13 release, support for token-based delegation connections
131
141
 
142
+ ## Asynchronous execution with Hiveserver2
143
+
144
+ In versions of Hive later than 0.12, the Thrift server supports asynchronous execution.
145
+
146
+ The high-level view of using this feature is as follows:
147
+ 1. Submit your query using `async_execute(query)`. This function returns a hash
148
+ with the following keys: `:guid`, `:secret`, and `:session`. You don't need to
149
+ care about the internals of this hash - all methods that interact with an async
150
+ query require this hash, and you can just store it and hand it to the methods.
151
+ 2. To check the state of the query, call `async_state(handles)`, where `handles`
152
+ is the handles hash given to you when you called `async_execute(query)`.
153
+ 3. To retrieve results, call either `async_fetch(handles)` or `async_fetch_in_batch(handles)`,
154
+ which work like the non async methods.
155
+ 4. When you're done with the query, call `async_close_session(handles)`.
156
+
157
+ ### Memory leaks
158
+
159
+ When you call `async_close_session(handles)`, *all async handles created during this
160
+ session are closed*.
161
+
162
+ If you do not close the sessions you create, *you will leak memory in the Hiveserver2 process*.
163
+ Be very careful to close your sessions!
164
+
165
+ ### Method documentation
166
+
167
+ #### `async_execute(query)`
168
+
169
+ This method submits a query for async execution. The hash you get back is used in the other
170
+ async methods, and will look like this:
171
+
172
+ {
173
+ :guid => (binary string),
174
+ :secret => (binary string),
175
+ :session => (binary string)
176
+ }
177
+
178
+ The Thrift protocol specifies the strings as "binary" - which means they have no encoding.
179
+ Be *extremely* careful when manipulating or storing these values, as they can quite easily
180
+ get converted to UTF-8 strings, which will make them invalid when trying to retrieve async data.
181
+
182
+ #### `async_state(handles)`
183
+
184
+ `handles` is the hash returned by `async_execute(query)`. The state will be a symbol with
185
+ one of the following values and meanings:
186
+
187
+ | symbol | meaning
188
+ | --------------------- | -------
189
+ | :initialized | The query is initialized in Hive and ready to run
190
+ | :running | The query is running (either as a MapReduce job or within process)
191
+ | :finished | The query is completed and results can be retrieved
192
+ | :cancelled | The query was cancelled by a user
193
+ | :closed | Unknown at present
194
+ | :error | The query is invalid semantically or broken in another way
195
+ | :unknown | The query is in an unknown state
196
+ | :pending | The query is ready to run but is not running
197
+
198
+ There are also the utility methods `async_is_complete?(handles)`, `async_is_running?(handles)`,
199
+ `async_is_failed?(handles)` and `async_is_cancelled?(handles)`.
200
+
201
+ #### `async_cancel(handles)`
202
+
203
+ Calling this method will cancel the query in execution.
204
+
205
+ #### `async_fetch(handles)`, `async_fetch_in_batch(handles)`
206
+
207
+ These methods let you fetch the results of the async query, if they are complete. If you call
208
+ these methods on an incomplete query, they will raise an exception. They work in exactly the
209
+ same way as the normal synchronous methods.
210
+
132
211
  ## Examples
133
212
 
134
213
  ### Fetching results
@@ -103,7 +103,6 @@ module RBHive
103
103
  @client = Hive2::Thrift::TCLIService::Client.new(@protocol)
104
104
  @session = nil
105
105
  @logger.info("Connecting to HiveServer2 #{server} on port #{port}")
106
- @mutex = Mutex.new
107
106
  end
108
107
 
109
108
  def thrift_hive_protocol(version)
@@ -169,7 +168,11 @@ module RBHive
169
168
  end
170
169
 
171
170
  def execute(query)
172
- execute_safe(query)
171
+ @logger.info("Executing Hive Query: #{query}")
172
+ req = prepare_execute_statement(query)
173
+ exec_result = client.ExecuteStatement(req)
174
+ raise_error_if_failed!(exec_result)
175
+ exec_result
173
176
  end
174
177
 
175
178
  def priority=(priority)
@@ -185,6 +188,118 @@ module RBHive
185
188
  self.execute("SET #{name}=#{value}")
186
189
  end
187
190
 
191
+ # Async execute
192
+ def async_execute(query)
193
+ @logger.info("Executing query asynchronously: #{query}")
194
+ op_handle = @client.ExecuteStatement(
195
+ Hive2::Thrift::TExecuteStatementReq.new(
196
+ sessionHandle: @session.sessionHandle,
197
+ statement: query,
198
+ runAsync: true
199
+ )
200
+ ).operationHandle
201
+
202
+ # Return handles to get hold of this query / session again
203
+ {
204
+ session: @session.sessionHandle,
205
+ guid: op_handle.operationId.guid,
206
+ secret: op_handle.operationId.secret
207
+ }
208
+ end
209
+
210
+ # Is the query complete?
211
+ def async_is_complete?(handles)
212
+ async_state(handles) == :finished
213
+ end
214
+
215
+ # Is the query actually running?
216
+ def async_is_running?(handles)
217
+ async_state(handles) == :running
218
+ end
219
+
220
+ # Has the query failed?
221
+ def async_is_failed?(handles)
222
+ async_state(handles) == :error
223
+ end
224
+
225
+ def async_is_cancelled?(handles)
226
+ async_state(handles) == :cancelled
227
+ end
228
+
229
+ def async_cancel(handles)
230
+ @client.CancelOperation(prepare_cancel_request(handles))
231
+ end
232
+
233
+ # Map states to symbols
234
+ def async_state(handles)
235
+ response = @client.GetOperationStatus(
236
+ Hive2::Thrift::TGetOperationStatusReq.new(operationHandle: prepare_operation_handle(handles))
237
+ )
238
+ puts response.operationState
239
+ case response.operationState
240
+ when Hive2::Thrift::TOperationState::FINISHED_STATE
241
+ return :finished
242
+ when Hive2::Thrift::TOperationState::INITIALIZED_STATE
243
+ return :initialized
244
+ when Hive2::Thrift::TOperationState::RUNNING_STATE
245
+ return :running
246
+ when Hive2::Thrift::TOperationState::CANCELED_STATE
247
+ return :cancelled
248
+ when Hive2::Thrift::TOperationState::CLOSED_STATE
249
+ return :closed
250
+ when Hive2::Thrift::TOperationState::ERROR_STATE
251
+ return :error
252
+ when Hive2::Thrift::TOperationState::UKNOWN_STATE
253
+ return :unknown
254
+ when Hive2::Thrift::TOperationState::PENDING_STATE
255
+ return :pending
256
+ else
257
+ return :state_not_in_protocol
258
+ end
259
+ end
260
+
261
+ # Async fetch results from an async execute
262
+ def async_fetch(handles, max_rows = 100)
263
+ # Can't get data from an unfinished query
264
+ unless async_is_complete?(handles)
265
+ raise "Can't perform fetch on a query in state: #{async_state(handles[:guid], handles[:secret])}"
266
+ end
267
+
268
+ # Fetch and
269
+ fetch_rows(prepare_operation_handle(handles), :first, max_rows)
270
+ end
271
+
272
+ # Performs a query on the server, fetches the results in batches of *batch_size* rows
273
+ # and yields the result batches to a given block as arrays of rows.
274
+ def async_fetch_in_batch(handles, batch_size = 1000, &block)
275
+ raise "No block given for the batch fetch request!" unless block_given?
276
+ # Can't get data from an unfinished query
277
+ unless async_is_complete?(handles)
278
+ raise "Can't perform fetch on a query in state: #{async_state(handles[:guid], handles[:secret])}"
279
+ end
280
+
281
+ # Now let's iterate over the results
282
+ loop do
283
+ rows = fetch_rows(prepare_operation_handle(handles), :next, batch_size)
284
+ break if rows.empty?
285
+ yield rows
286
+ end
287
+ end
288
+
289
+ def async_close_session(handles)
290
+ validate_handles!(handles)
291
+ @client.CloseSession(Hive2::Thrift::TCloseSessionReq.new( sessionHandle: handles[:session] ))
292
+ end
293
+
294
+ # Pull rows from the query result
295
+ def fetch_rows(op_handle, orientation = :first, max_rows = 1000)
296
+ fetch_req = prepare_fetch_results(op_handle, orientation, max_rows)
297
+ fetch_results = @client.FetchResults(fetch_req)
298
+ raise_error_if_failed!(fetch_results)
299
+ rows = fetch_results.results.rows
300
+ TCLIResultSet.new(rows, TCLISchemaDefinition.new(get_schema_for(op_handle), rows.first))
301
+ end
302
+
188
303
  # Performs a explain on the supplied query on the server, returns it as a ExplainResult.
189
304
  # (Only works on 0.12 if you have this patch - https://issues.apache.org/jira/browse/HIVE-5492)
190
305
  def explain(query)
@@ -197,58 +312,37 @@ module RBHive
197
312
 
198
313
  # Performs a query on the server, fetches up to *max_rows* rows and returns them as an array.
199
314
  def fetch(query, max_rows = 100)
200
- safe do
201
- # Execute the query and check the result
202
- exec_result = execute_unsafe(query)
203
- raise_error_if_failed!(exec_result)
204
-
205
- # Get search operation handle to fetch the results
206
- op_handle = exec_result.operationHandle
207
-
208
- # Prepare and execute fetch results request
209
- fetch_req = prepare_fetch_results(op_handle, :first, max_rows)
210
- fetch_results = client.FetchResults(fetch_req)
211
- raise_error_if_failed!(fetch_results)
212
-
213
- # Get data rows and format the result
214
- rows = fetch_results.results.rows
215
- the_schema = TCLISchemaDefinition.new(get_schema_for( op_handle ), rows.first)
216
- TCLIResultSet.new(rows, the_schema)
217
- end
315
+ # Execute the query and check the result
316
+ exec_result = execute(query)
317
+ raise_error_if_failed!(exec_result)
318
+
319
+ # Get search operation handle to fetch the results
320
+ op_handle = exec_result.operationHandle
321
+
322
+ # Fetch the rows
323
+ fetch_rows(op_handle, :first, max_rows)
218
324
  end
219
325
 
220
326
  # Performs a query on the server, fetches the results in batches of *batch_size* rows
221
327
  # and yields the result batches to a given block as arrays of rows.
222
328
  def fetch_in_batch(query, batch_size = 1000, &block)
223
329
  raise "No block given for the batch fetch request!" unless block_given?
224
- safe do
225
- # Execute the query and check the result
226
- exec_result = execute_unsafe(query)
227
- raise_error_if_failed!(exec_result)
228
-
229
- # Get search operation handle to fetch the results
230
- op_handle = exec_result.operationHandle
231
-
232
- # Prepare fetch results request
233
- fetch_req = prepare_fetch_results(op_handle, :next, batch_size)
234
-
235
- # Now let's iterate over the results
236
- loop do
237
- # Fetch next batch and raise an exception if it failed
238
- fetch_results = client.FetchResults(fetch_req)
239
- raise_error_if_failed!(fetch_results)
330
+
331
+ # Execute the query and check the result
332
+ exec_result = execute(query)
333
+ raise_error_if_failed!(exec_result)
240
334
 
241
- # Get data rows from the result
242
- rows = fetch_results.results.rows
243
- break if rows.empty?
335
+ # Get search operation handle to fetch the results
336
+ op_handle = exec_result.operationHandle
244
337
 
245
- # Prepare schema definition for the row
246
- schema_for_req ||= get_schema_for(op_handle)
247
- the_schema ||= TCLISchemaDefinition.new(schema_for_req, rows.first)
338
+ # Prepare fetch results request
339
+ fetch_req = prepare_fetch_results(op_handle, :next, batch_size)
248
340
 
249
- # Format the results and yield them to the given block
250
- yield TCLIResultSet.new(rows, the_schema)
251
- end
341
+ # Now let's iterate over the results
342
+ loop do
343
+ rows = fetch_rows(op_handle, :next, batch_size)
344
+ break if rows.empty?
345
+ yield rows
252
346
  end
253
347
  end
254
348
 
@@ -275,26 +369,6 @@ module RBHive
275
369
 
276
370
  private
277
371
 
278
- def execute_safe(query)
279
- safe do
280
- exec_result = execute_unsafe(query)
281
- raise_error_if_failed!(exec_result)
282
- exec_result
283
- end
284
- end
285
-
286
- def execute_unsafe(query)
287
- @logger.info("Executing Hive Query: #{query}")
288
- req = prepare_execute_statement(query)
289
- client.ExecuteStatement(req)
290
- end
291
-
292
- def safe
293
- ret = nil
294
- @mutex.synchronize { ret = yield }
295
- ret
296
- end
297
-
298
372
  def prepare_open_session(client_protocol)
299
373
  req = ::Hive2::Thrift::TOpenSessionReq.new( @options[:sasl_params].nil? ? [] : @options[:sasl_params] )
300
374
  req.client_protocol = client_protocol
@@ -323,6 +397,27 @@ module RBHive
323
397
  )
324
398
  end
325
399
 
400
+ def prepare_operation_handle(handles)
401
+ validate_handles!(handles)
402
+ Hive2::Thrift::TOperationHandle.new(
403
+ operationId: Hive2::Thrift::THandleIdentifier.new(guid: handles[:guid], secret: handles[:secret]),
404
+ operationType: Hive2::Thrift::TOperationType::EXECUTE_STATEMENT,
405
+ hasResultSet: false
406
+ )
407
+ end
408
+
409
+ def prepare_cancel_request(handles)
410
+ Hive2::Thrift::TCancelOperationReq.new(
411
+ operationHandle: prepare_operation_handle(handles)
412
+ )
413
+ end
414
+
415
+ def validate_handles!(handles)
416
+ unless handles.has_key?(:guid) and handles.has_key?(:secret) and handles.has_key?(:session)
417
+ raise "Invalid handles hash: #{handles.inspect}"
418
+ end
419
+ end
420
+
326
421
  def get_schema_for(handle)
327
422
  req = ::Hive2::Thrift::TGetResultSetMetadataReq.new( operationHandle: handle )
328
423
  metadata = client.GetResultSetMetadata( req )
@@ -1,3 +1,3 @@
1
1
  module RBHive
2
- VERSION = '0.6.0'
2
+ VERSION = '1.0.0.pre'
3
3
  end
metadata CHANGED
@@ -1,8 +1,8 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: rbhive
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.6.0
5
- prerelease:
4
+ version: 1.0.0.pre
5
+ prerelease: 6
6
6
  platform: ruby
7
7
  authors:
8
8
  - Forward3D
@@ -10,7 +10,7 @@ authors:
10
10
  autorequire:
11
11
  bindir: bin
12
12
  cert_chain: []
13
- date: 2014-03-28 00:00:00.000000000 Z
13
+ date: 2014-03-31 00:00:00.000000000 Z
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
16
  name: thrift
@@ -134,16 +134,13 @@ required_ruby_version: !ruby/object:Gem::Requirement
134
134
  version: '0'
135
135
  segments:
136
136
  - 0
137
- hash: 2810079357689827941
137
+ hash: 2597338757284379755
138
138
  required_rubygems_version: !ruby/object:Gem::Requirement
139
139
  none: false
140
140
  requirements:
141
- - - ! '>='
141
+ - - ! '>'
142
142
  - !ruby/object:Gem::Version
143
- version: '0'
144
- segments:
145
- - 0
146
- hash: 2810079357689827941
143
+ version: 1.3.1
147
144
  requirements: []
148
145
  rubyforge_project:
149
146
  rubygems_version: 1.8.23