rbhive 0.6.0 → 1.0.0.pre
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +81 -2
- data/lib/rbhive/t_c_l_i_connection.rb +160 -65
- data/lib/rbhive/version.rb +1 -1
- metadata +6 -9
data/README.md
CHANGED
@@ -1,7 +1,7 @@
|
|
1
|
-
# RBHive
|
1
|
+
# RBHive - A Ruby Thrift client for Apache Hive
|
2
2
|
|
3
3
|
RBHive is a simple Ruby gem to communicate with the [Apache Hive](http://hive.apache.org)
|
4
|
-
Thrift
|
4
|
+
Thrift servers.
|
5
5
|
|
6
6
|
It supports:
|
7
7
|
* Hiveserver (the original Thrift service shipped with Hive since early releases)
|
@@ -13,6 +13,10 @@ It is capable of using the following Thrift transports:
|
|
13
13
|
* SaslClientTransport ([SASL-enabled](http://en.wikipedia.org/wiki/Simple_Authentication_and_Security_Layer) transport)
|
14
14
|
* HTTPClientTransport (tunnels Thrift over HTTP)
|
15
15
|
|
16
|
+
As of version 1.0, it supports asynchronous execution of queries. This allows you to submit
|
17
|
+
a query, disconnect, then reconnect later to check the status and retrieve the results.
|
18
|
+
This frees systems of the need to keep a persistent TCP connection.
|
19
|
+
|
16
20
|
## About Thrift services and transports
|
17
21
|
|
18
22
|
### Hiveserver
|
@@ -31,6 +35,12 @@ supported; starting with Hive 0.12, HTTPClientTransport is also supported.
|
|
31
35
|
Each of the versions after Hive 0.10 has a slightly different Thrift interface; when
|
32
36
|
connecting, you must specify the Hive version or you may get an exception.
|
33
37
|
|
38
|
+
Hiveserver2 supports (in versions later than 0.12) asynchronous query execution. This
|
39
|
+
works by submitting a query and retrieving a handle to the execution process; you can
|
40
|
+
then reconnect at a later time and retrieve the results using this handle.
|
41
|
+
Using the asynchronous methods has some caveats - please read the Asynchronous Execution
|
42
|
+
section of the documentation thoroughly before using them.
|
43
|
+
|
34
44
|
RBHive implements this client with the `RBHive::TCLIConnection` class.
|
35
45
|
|
36
46
|
#### Warning!
|
@@ -129,6 +139,75 @@ In addition, you can explicitly set the Thrift protocol version according to thi
|
|
129
139
|
| `:PROTOCOL_V6` | V6 | Updated during Hive 0.13 development, adds binary type for binary payload, uses columnar result set
|
130
140
|
| `:PROTOCOL_V7` | V7 | Used by Hive 0.13 release, support for token-based delegation connections
|
131
141
|
|
142
|
+
## Asynchronous execution with Hiveserver2
|
143
|
+
|
144
|
+
In versions of Hive later than 0.12, the Thrift server supports asynchronous execution.
|
145
|
+
|
146
|
+
The high-level view of using this feature is as follows:
|
147
|
+
1. Submit your query using `async_execute(query)`. This function returns a hash
|
148
|
+
with the following keys: `:guid`, `:secret`, and `:session`. You don't need to
|
149
|
+
care about the internals of this hash - all methods that interact with an async
|
150
|
+
query require this hash, and you can just store it and hand it to the methods.
|
151
|
+
2. To check the state of the query, call `async_state(handles)`, where `handles`
|
152
|
+
is the handles hash given to you when you called `async_execute(query)`.
|
153
|
+
3. To retrieve results, call either `async_fetch(handles)` or `async_fetch_in_batch(handles)`,
|
154
|
+
which work like the non async methods.
|
155
|
+
4. When you're done with the query, call `async_close_session(handles)`.
|
156
|
+
|
157
|
+
### Memory leaks
|
158
|
+
|
159
|
+
When you call `async_close_session(handles)`, *all async handles created during this
|
160
|
+
session are closed*.
|
161
|
+
|
162
|
+
If you do not close the sessions you create, *you will leak memory in the Hiveserver2 process*.
|
163
|
+
Be very careful to close your sessions!
|
164
|
+
|
165
|
+
### Method documentation
|
166
|
+
|
167
|
+
#### `async_execute(query)`
|
168
|
+
|
169
|
+
This method submits a query for async execution. The hash you get back is used in the other
|
170
|
+
async methods, and will look like this:
|
171
|
+
|
172
|
+
{
|
173
|
+
:guid => (binary string),
|
174
|
+
:secret => (binary string),
|
175
|
+
:session => (binary string)
|
176
|
+
}
|
177
|
+
|
178
|
+
The Thrift protocol specifies the strings as "binary" - which means they have no encoding.
|
179
|
+
Be *extremely* careful when manipulating or storing these values, as they can quite easily
|
180
|
+
get converted to UTF-8 strings, which will make them invalid when trying to retrieve async data.
|
181
|
+
|
182
|
+
#### `async_state(handles)`
|
183
|
+
|
184
|
+
`handles` is the hash returned by `async_execute(query)`. The state will be a symbol with
|
185
|
+
one of the following values and meanings:
|
186
|
+
|
187
|
+
| symbol | meaning
|
188
|
+
| --------------------- | -------
|
189
|
+
| :initialized | The query is initialized in Hive and ready to run
|
190
|
+
| :running | The query is running (either as a MapReduce job or within process)
|
191
|
+
| :finished | The query is completed and results can be retrieved
|
192
|
+
| :cancelled | The query was cancelled by a user
|
193
|
+
| :closed | Unknown at present
|
194
|
+
| :error | The query is invalid semantically or broken in another way
|
195
|
+
| :unknown | The query is in an unknown state
|
196
|
+
| :pending | The query is ready to run but is not running
|
197
|
+
|
198
|
+
There are also the utility methods `async_is_complete?(handles)`, `async_is_running?(handles)`,
|
199
|
+
`async_is_failed?(handles)` and `async_is_cancelled?(handles)`.
|
200
|
+
|
201
|
+
#### `async_cancel(handles)`
|
202
|
+
|
203
|
+
Calling this method will cancel the query in execution.
|
204
|
+
|
205
|
+
#### `async_fetch(handles)`, `async_fetch_in_batch(handles)`
|
206
|
+
|
207
|
+
These methods let you fetch the results of the async query, if they are complete. If you call
|
208
|
+
these methods on an incomplete query, they will raise an exception. They work in exactly the
|
209
|
+
same way as the normal synchronous methods.
|
210
|
+
|
132
211
|
## Examples
|
133
212
|
|
134
213
|
### Fetching results
|
@@ -103,7 +103,6 @@ module RBHive
|
|
103
103
|
@client = Hive2::Thrift::TCLIService::Client.new(@protocol)
|
104
104
|
@session = nil
|
105
105
|
@logger.info("Connecting to HiveServer2 #{server} on port #{port}")
|
106
|
-
@mutex = Mutex.new
|
107
106
|
end
|
108
107
|
|
109
108
|
def thrift_hive_protocol(version)
|
@@ -169,7 +168,11 @@ module RBHive
|
|
169
168
|
end
|
170
169
|
|
171
170
|
def execute(query)
|
172
|
-
|
171
|
+
@logger.info("Executing Hive Query: #{query}")
|
172
|
+
req = prepare_execute_statement(query)
|
173
|
+
exec_result = client.ExecuteStatement(req)
|
174
|
+
raise_error_if_failed!(exec_result)
|
175
|
+
exec_result
|
173
176
|
end
|
174
177
|
|
175
178
|
def priority=(priority)
|
@@ -185,6 +188,118 @@ module RBHive
|
|
185
188
|
self.execute("SET #{name}=#{value}")
|
186
189
|
end
|
187
190
|
|
191
|
+
# Async execute
|
192
|
+
def async_execute(query)
|
193
|
+
@logger.info("Executing query asynchronously: #{query}")
|
194
|
+
op_handle = @client.ExecuteStatement(
|
195
|
+
Hive2::Thrift::TExecuteStatementReq.new(
|
196
|
+
sessionHandle: @session.sessionHandle,
|
197
|
+
statement: query,
|
198
|
+
runAsync: true
|
199
|
+
)
|
200
|
+
).operationHandle
|
201
|
+
|
202
|
+
# Return handles to get hold of this query / session again
|
203
|
+
{
|
204
|
+
session: @session.sessionHandle,
|
205
|
+
guid: op_handle.operationId.guid,
|
206
|
+
secret: op_handle.operationId.secret
|
207
|
+
}
|
208
|
+
end
|
209
|
+
|
210
|
+
# Is the query complete?
|
211
|
+
def async_is_complete?(handles)
|
212
|
+
async_state(handles) == :finished
|
213
|
+
end
|
214
|
+
|
215
|
+
# Is the query actually running?
|
216
|
+
def async_is_running?(handles)
|
217
|
+
async_state(handles) == :running
|
218
|
+
end
|
219
|
+
|
220
|
+
# Has the query failed?
|
221
|
+
def async_is_failed?(handles)
|
222
|
+
async_state(handles) == :error
|
223
|
+
end
|
224
|
+
|
225
|
+
def async_is_cancelled?(handles)
|
226
|
+
async_state(handles) == :cancelled
|
227
|
+
end
|
228
|
+
|
229
|
+
def async_cancel(handles)
|
230
|
+
@client.CancelOperation(prepare_cancel_request(handles))
|
231
|
+
end
|
232
|
+
|
233
|
+
# Map states to symbols
|
234
|
+
def async_state(handles)
|
235
|
+
response = @client.GetOperationStatus(
|
236
|
+
Hive2::Thrift::TGetOperationStatusReq.new(operationHandle: prepare_operation_handle(handles))
|
237
|
+
)
|
238
|
+
puts response.operationState
|
239
|
+
case response.operationState
|
240
|
+
when Hive2::Thrift::TOperationState::FINISHED_STATE
|
241
|
+
return :finished
|
242
|
+
when Hive2::Thrift::TOperationState::INITIALIZED_STATE
|
243
|
+
return :initialized
|
244
|
+
when Hive2::Thrift::TOperationState::RUNNING_STATE
|
245
|
+
return :running
|
246
|
+
when Hive2::Thrift::TOperationState::CANCELED_STATE
|
247
|
+
return :cancelled
|
248
|
+
when Hive2::Thrift::TOperationState::CLOSED_STATE
|
249
|
+
return :closed
|
250
|
+
when Hive2::Thrift::TOperationState::ERROR_STATE
|
251
|
+
return :error
|
252
|
+
when Hive2::Thrift::TOperationState::UKNOWN_STATE
|
253
|
+
return :unknown
|
254
|
+
when Hive2::Thrift::TOperationState::PENDING_STATE
|
255
|
+
return :pending
|
256
|
+
else
|
257
|
+
return :state_not_in_protocol
|
258
|
+
end
|
259
|
+
end
|
260
|
+
|
261
|
+
# Async fetch results from an async execute
|
262
|
+
def async_fetch(handles, max_rows = 100)
|
263
|
+
# Can't get data from an unfinished query
|
264
|
+
unless async_is_complete?(handles)
|
265
|
+
raise "Can't perform fetch on a query in state: #{async_state(handles[:guid], handles[:secret])}"
|
266
|
+
end
|
267
|
+
|
268
|
+
# Fetch and
|
269
|
+
fetch_rows(prepare_operation_handle(handles), :first, max_rows)
|
270
|
+
end
|
271
|
+
|
272
|
+
# Performs a query on the server, fetches the results in batches of *batch_size* rows
|
273
|
+
# and yields the result batches to a given block as arrays of rows.
|
274
|
+
def async_fetch_in_batch(handles, batch_size = 1000, &block)
|
275
|
+
raise "No block given for the batch fetch request!" unless block_given?
|
276
|
+
# Can't get data from an unfinished query
|
277
|
+
unless async_is_complete?(handles)
|
278
|
+
raise "Can't perform fetch on a query in state: #{async_state(handles[:guid], handles[:secret])}"
|
279
|
+
end
|
280
|
+
|
281
|
+
# Now let's iterate over the results
|
282
|
+
loop do
|
283
|
+
rows = fetch_rows(prepare_operation_handle(handles), :next, batch_size)
|
284
|
+
break if rows.empty?
|
285
|
+
yield rows
|
286
|
+
end
|
287
|
+
end
|
288
|
+
|
289
|
+
def async_close_session(handles)
|
290
|
+
validate_handles!(handles)
|
291
|
+
@client.CloseSession(Hive2::Thrift::TCloseSessionReq.new( sessionHandle: handles[:session] ))
|
292
|
+
end
|
293
|
+
|
294
|
+
# Pull rows from the query result
|
295
|
+
def fetch_rows(op_handle, orientation = :first, max_rows = 1000)
|
296
|
+
fetch_req = prepare_fetch_results(op_handle, orientation, max_rows)
|
297
|
+
fetch_results = @client.FetchResults(fetch_req)
|
298
|
+
raise_error_if_failed!(fetch_results)
|
299
|
+
rows = fetch_results.results.rows
|
300
|
+
TCLIResultSet.new(rows, TCLISchemaDefinition.new(get_schema_for(op_handle), rows.first))
|
301
|
+
end
|
302
|
+
|
188
303
|
# Performs a explain on the supplied query on the server, returns it as a ExplainResult.
|
189
304
|
# (Only works on 0.12 if you have this patch - https://issues.apache.org/jira/browse/HIVE-5492)
|
190
305
|
def explain(query)
|
@@ -197,58 +312,37 @@ module RBHive
|
|
197
312
|
|
198
313
|
# Performs a query on the server, fetches up to *max_rows* rows and returns them as an array.
|
199
314
|
def fetch(query, max_rows = 100)
|
200
|
-
|
201
|
-
|
202
|
-
|
203
|
-
|
204
|
-
|
205
|
-
|
206
|
-
|
207
|
-
|
208
|
-
|
209
|
-
fetch_req = prepare_fetch_results(op_handle, :first, max_rows)
|
210
|
-
fetch_results = client.FetchResults(fetch_req)
|
211
|
-
raise_error_if_failed!(fetch_results)
|
212
|
-
|
213
|
-
# Get data rows and format the result
|
214
|
-
rows = fetch_results.results.rows
|
215
|
-
the_schema = TCLISchemaDefinition.new(get_schema_for( op_handle ), rows.first)
|
216
|
-
TCLIResultSet.new(rows, the_schema)
|
217
|
-
end
|
315
|
+
# Execute the query and check the result
|
316
|
+
exec_result = execute(query)
|
317
|
+
raise_error_if_failed!(exec_result)
|
318
|
+
|
319
|
+
# Get search operation handle to fetch the results
|
320
|
+
op_handle = exec_result.operationHandle
|
321
|
+
|
322
|
+
# Fetch the rows
|
323
|
+
fetch_rows(op_handle, :first, max_rows)
|
218
324
|
end
|
219
325
|
|
220
326
|
# Performs a query on the server, fetches the results in batches of *batch_size* rows
|
221
327
|
# and yields the result batches to a given block as arrays of rows.
|
222
328
|
def fetch_in_batch(query, batch_size = 1000, &block)
|
223
329
|
raise "No block given for the batch fetch request!" unless block_given?
|
224
|
-
|
225
|
-
|
226
|
-
|
227
|
-
|
228
|
-
|
229
|
-
# Get search operation handle to fetch the results
|
230
|
-
op_handle = exec_result.operationHandle
|
231
|
-
|
232
|
-
# Prepare fetch results request
|
233
|
-
fetch_req = prepare_fetch_results(op_handle, :next, batch_size)
|
234
|
-
|
235
|
-
# Now let's iterate over the results
|
236
|
-
loop do
|
237
|
-
# Fetch next batch and raise an exception if it failed
|
238
|
-
fetch_results = client.FetchResults(fetch_req)
|
239
|
-
raise_error_if_failed!(fetch_results)
|
330
|
+
|
331
|
+
# Execute the query and check the result
|
332
|
+
exec_result = execute(query)
|
333
|
+
raise_error_if_failed!(exec_result)
|
240
334
|
|
241
|
-
|
242
|
-
|
243
|
-
break if rows.empty?
|
335
|
+
# Get search operation handle to fetch the results
|
336
|
+
op_handle = exec_result.operationHandle
|
244
337
|
|
245
|
-
|
246
|
-
|
247
|
-
the_schema ||= TCLISchemaDefinition.new(schema_for_req, rows.first)
|
338
|
+
# Prepare fetch results request
|
339
|
+
fetch_req = prepare_fetch_results(op_handle, :next, batch_size)
|
248
340
|
|
249
|
-
|
250
|
-
|
251
|
-
|
341
|
+
# Now let's iterate over the results
|
342
|
+
loop do
|
343
|
+
rows = fetch_rows(op_handle, :next, batch_size)
|
344
|
+
break if rows.empty?
|
345
|
+
yield rows
|
252
346
|
end
|
253
347
|
end
|
254
348
|
|
@@ -275,26 +369,6 @@ module RBHive
|
|
275
369
|
|
276
370
|
private
|
277
371
|
|
278
|
-
def execute_safe(query)
|
279
|
-
safe do
|
280
|
-
exec_result = execute_unsafe(query)
|
281
|
-
raise_error_if_failed!(exec_result)
|
282
|
-
exec_result
|
283
|
-
end
|
284
|
-
end
|
285
|
-
|
286
|
-
def execute_unsafe(query)
|
287
|
-
@logger.info("Executing Hive Query: #{query}")
|
288
|
-
req = prepare_execute_statement(query)
|
289
|
-
client.ExecuteStatement(req)
|
290
|
-
end
|
291
|
-
|
292
|
-
def safe
|
293
|
-
ret = nil
|
294
|
-
@mutex.synchronize { ret = yield }
|
295
|
-
ret
|
296
|
-
end
|
297
|
-
|
298
372
|
def prepare_open_session(client_protocol)
|
299
373
|
req = ::Hive2::Thrift::TOpenSessionReq.new( @options[:sasl_params].nil? ? [] : @options[:sasl_params] )
|
300
374
|
req.client_protocol = client_protocol
|
@@ -323,6 +397,27 @@ module RBHive
|
|
323
397
|
)
|
324
398
|
end
|
325
399
|
|
400
|
+
def prepare_operation_handle(handles)
|
401
|
+
validate_handles!(handles)
|
402
|
+
Hive2::Thrift::TOperationHandle.new(
|
403
|
+
operationId: Hive2::Thrift::THandleIdentifier.new(guid: handles[:guid], secret: handles[:secret]),
|
404
|
+
operationType: Hive2::Thrift::TOperationType::EXECUTE_STATEMENT,
|
405
|
+
hasResultSet: false
|
406
|
+
)
|
407
|
+
end
|
408
|
+
|
409
|
+
def prepare_cancel_request(handles)
|
410
|
+
Hive2::Thrift::TCancelOperationReq.new(
|
411
|
+
operationHandle: prepare_operation_handle(handles)
|
412
|
+
)
|
413
|
+
end
|
414
|
+
|
415
|
+
def validate_handles!(handles)
|
416
|
+
unless handles.has_key?(:guid) and handles.has_key?(:secret) and handles.has_key?(:session)
|
417
|
+
raise "Invalid handles hash: #{handles.inspect}"
|
418
|
+
end
|
419
|
+
end
|
420
|
+
|
326
421
|
def get_schema_for(handle)
|
327
422
|
req = ::Hive2::Thrift::TGetResultSetMetadataReq.new( operationHandle: handle )
|
328
423
|
metadata = client.GetResultSetMetadata( req )
|
data/lib/rbhive/version.rb
CHANGED
metadata
CHANGED
@@ -1,8 +1,8 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: rbhive
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
5
|
-
prerelease:
|
4
|
+
version: 1.0.0.pre
|
5
|
+
prerelease: 6
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
8
8
|
- Forward3D
|
@@ -10,7 +10,7 @@ authors:
|
|
10
10
|
autorequire:
|
11
11
|
bindir: bin
|
12
12
|
cert_chain: []
|
13
|
-
date: 2014-03-
|
13
|
+
date: 2014-03-31 00:00:00.000000000 Z
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|
16
16
|
name: thrift
|
@@ -134,16 +134,13 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
134
134
|
version: '0'
|
135
135
|
segments:
|
136
136
|
- 0
|
137
|
-
hash:
|
137
|
+
hash: 2597338757284379755
|
138
138
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
139
139
|
none: false
|
140
140
|
requirements:
|
141
|
-
- - ! '
|
141
|
+
- - ! '>'
|
142
142
|
- !ruby/object:Gem::Version
|
143
|
-
version:
|
144
|
-
segments:
|
145
|
-
- 0
|
146
|
-
hash: 2810079357689827941
|
143
|
+
version: 1.3.1
|
147
144
|
requirements: []
|
148
145
|
rubyforge_project:
|
149
146
|
rubygems_version: 1.8.23
|