rbhive 0.6.0 → 1.0.0.pre
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.md +81 -2
- data/lib/rbhive/t_c_l_i_connection.rb +160 -65
- data/lib/rbhive/version.rb +1 -1
- metadata +6 -9
data/README.md
CHANGED
@@ -1,7 +1,7 @@
|
|
1
|
-
# RBHive
|
1
|
+
# RBHive - A Ruby Thrift client for Apache Hive
|
2
2
|
|
3
3
|
RBHive is a simple Ruby gem to communicate with the [Apache Hive](http://hive.apache.org)
|
4
|
-
Thrift
|
4
|
+
Thrift servers.
|
5
5
|
|
6
6
|
It supports:
|
7
7
|
* Hiveserver (the original Thrift service shipped with Hive since early releases)
|
@@ -13,6 +13,10 @@ It is capable of using the following Thrift transports:
|
|
13
13
|
* SaslClientTransport ([SASL-enabled](http://en.wikipedia.org/wiki/Simple_Authentication_and_Security_Layer) transport)
|
14
14
|
* HTTPClientTransport (tunnels Thrift over HTTP)
|
15
15
|
|
16
|
+
As of version 1.0, it supports asynchronous execution of queries. This allows you to submit
|
17
|
+
a query, disconnect, then reconnect later to check the status and retrieve the results.
|
18
|
+
This frees systems of the need to keep a persistent TCP connection.
|
19
|
+
|
16
20
|
## About Thrift services and transports
|
17
21
|
|
18
22
|
### Hiveserver
|
@@ -31,6 +35,12 @@ supported; starting with Hive 0.12, HTTPClientTransport is also supported.
|
|
31
35
|
Each of the versions after Hive 0.10 has a slightly different Thrift interface; when
|
32
36
|
connecting, you must specify the Hive version or you may get an exception.
|
33
37
|
|
38
|
+
Hiveserver2 supports (in versions later than 0.12) asynchronous query execution. This
|
39
|
+
works by submitting a query and retrieving a handle to the execution process; you can
|
40
|
+
then reconnect at a later time and retrieve the results using this handle.
|
41
|
+
Using the asynchronous methods has some caveats - please read the Asynchronous Execution
|
42
|
+
section of the documentation thoroughly before using them.
|
43
|
+
|
34
44
|
RBHive implements this client with the `RBHive::TCLIConnection` class.
|
35
45
|
|
36
46
|
#### Warning!
|
@@ -129,6 +139,75 @@ In addition, you can explicitly set the Thrift protocol version according to thi
|
|
129
139
|
| `:PROTOCOL_V6` | V6 | Updated during Hive 0.13 development, adds binary type for binary payload, uses columnar result set
|
130
140
|
| `:PROTOCOL_V7` | V7 | Used by Hive 0.13 release, support for token-based delegation connections
|
131
141
|
|
142
|
+
## Asynchronous execution with Hiveserver2
|
143
|
+
|
144
|
+
In versions of Hive later than 0.12, the Thrift server supports asynchronous execution.
|
145
|
+
|
146
|
+
The high-level view of using this feature is as follows:
|
147
|
+
1. Submit your query using `async_execute(query)`. This function returns a hash
|
148
|
+
with the following keys: `:guid`, `:secret`, and `:session`. You don't need to
|
149
|
+
care about the internals of this hash - all methods that interact with an async
|
150
|
+
query require this hash, and you can just store it and hand it to the methods.
|
151
|
+
2. To check the state of the query, call `async_state(handles)`, where `handles`
|
152
|
+
is the handles hash given to you when you called `async_execute(query)`.
|
153
|
+
3. To retrieve results, call either `async_fetch(handles)` or `async_fetch_in_batch(handles)`,
|
154
|
+
which work like the non async methods.
|
155
|
+
4. When you're done with the query, call `async_close_session(handles)`.
|
156
|
+
|
157
|
+
### Memory leaks
|
158
|
+
|
159
|
+
When you call `async_close_session(handles)`, *all async handles created during this
|
160
|
+
session are closed*.
|
161
|
+
|
162
|
+
If you do not close the sessions you create, *you will leak memory in the Hiveserver2 process*.
|
163
|
+
Be very careful to close your sessions!
|
164
|
+
|
165
|
+
### Method documentation
|
166
|
+
|
167
|
+
#### `async_execute(query)`
|
168
|
+
|
169
|
+
This method submits a query for async execution. The hash you get back is used in the other
|
170
|
+
async methods, and will look like this:
|
171
|
+
|
172
|
+
{
|
173
|
+
:guid => (binary string),
|
174
|
+
:secret => (binary string),
|
175
|
+
:session => (binary string)
|
176
|
+
}
|
177
|
+
|
178
|
+
The Thrift protocol specifies the strings as "binary" - which means they have no encoding.
|
179
|
+
Be *extremely* careful when manipulating or storing these values, as they can quite easily
|
180
|
+
get converted to UTF-8 strings, which will make them invalid when trying to retrieve async data.
|
181
|
+
|
182
|
+
#### `async_state(handles)`
|
183
|
+
|
184
|
+
`handles` is the hash returned by `async_execute(query)`. The state will be a symbol with
|
185
|
+
one of the following values and meanings:
|
186
|
+
|
187
|
+
| symbol | meaning
|
188
|
+
| --------------------- | -------
|
189
|
+
| :initialized | The query is initialized in Hive and ready to run
|
190
|
+
| :running | The query is running (either as a MapReduce job or within process)
|
191
|
+
| :finished | The query is completed and results can be retrieved
|
192
|
+
| :cancelled | The query was cancelled by a user
|
193
|
+
| :closed | Unknown at present
|
194
|
+
| :error | The query is invalid semantically or broken in another way
|
195
|
+
| :unknown | The query is in an unknown state
|
196
|
+
| :pending | The query is ready to run but is not running
|
197
|
+
|
198
|
+
There are also the utility methods `async_is_complete?(handles)`, `async_is_running?(handles)`,
|
199
|
+
`async_is_failed?(handles)` and `async_is_cancelled?(handles)`.
|
200
|
+
|
201
|
+
#### `async_cancel(handles)`
|
202
|
+
|
203
|
+
Calling this method will cancel the query in execution.
|
204
|
+
|
205
|
+
#### `async_fetch(handles)`, `async_fetch_in_batch(handles)`
|
206
|
+
|
207
|
+
These methods let you fetch the results of the async query, if they are complete. If you call
|
208
|
+
these methods on an incomplete query, they will raise an exception. They work in exactly the
|
209
|
+
same way as the normal synchronous methods.
|
210
|
+
|
132
211
|
## Examples
|
133
212
|
|
134
213
|
### Fetching results
|
@@ -103,7 +103,6 @@ module RBHive
|
|
103
103
|
@client = Hive2::Thrift::TCLIService::Client.new(@protocol)
|
104
104
|
@session = nil
|
105
105
|
@logger.info("Connecting to HiveServer2 #{server} on port #{port}")
|
106
|
-
@mutex = Mutex.new
|
107
106
|
end
|
108
107
|
|
109
108
|
def thrift_hive_protocol(version)
|
@@ -169,7 +168,11 @@ module RBHive
|
|
169
168
|
end
|
170
169
|
|
171
170
|
def execute(query)
|
172
|
-
|
171
|
+
@logger.info("Executing Hive Query: #{query}")
|
172
|
+
req = prepare_execute_statement(query)
|
173
|
+
exec_result = client.ExecuteStatement(req)
|
174
|
+
raise_error_if_failed!(exec_result)
|
175
|
+
exec_result
|
173
176
|
end
|
174
177
|
|
175
178
|
def priority=(priority)
|
@@ -185,6 +188,118 @@ module RBHive
|
|
185
188
|
self.execute("SET #{name}=#{value}")
|
186
189
|
end
|
187
190
|
|
191
|
+
# Async execute
|
192
|
+
def async_execute(query)
|
193
|
+
@logger.info("Executing query asynchronously: #{query}")
|
194
|
+
op_handle = @client.ExecuteStatement(
|
195
|
+
Hive2::Thrift::TExecuteStatementReq.new(
|
196
|
+
sessionHandle: @session.sessionHandle,
|
197
|
+
statement: query,
|
198
|
+
runAsync: true
|
199
|
+
)
|
200
|
+
).operationHandle
|
201
|
+
|
202
|
+
# Return handles to get hold of this query / session again
|
203
|
+
{
|
204
|
+
session: @session.sessionHandle,
|
205
|
+
guid: op_handle.operationId.guid,
|
206
|
+
secret: op_handle.operationId.secret
|
207
|
+
}
|
208
|
+
end
|
209
|
+
|
210
|
+
# Is the query complete?
|
211
|
+
def async_is_complete?(handles)
|
212
|
+
async_state(handles) == :finished
|
213
|
+
end
|
214
|
+
|
215
|
+
# Is the query actually running?
|
216
|
+
def async_is_running?(handles)
|
217
|
+
async_state(handles) == :running
|
218
|
+
end
|
219
|
+
|
220
|
+
# Has the query failed?
|
221
|
+
def async_is_failed?(handles)
|
222
|
+
async_state(handles) == :error
|
223
|
+
end
|
224
|
+
|
225
|
+
def async_is_cancelled?(handles)
|
226
|
+
async_state(handles) == :cancelled
|
227
|
+
end
|
228
|
+
|
229
|
+
def async_cancel(handles)
|
230
|
+
@client.CancelOperation(prepare_cancel_request(handles))
|
231
|
+
end
|
232
|
+
|
233
|
+
# Map states to symbols
|
234
|
+
def async_state(handles)
|
235
|
+
response = @client.GetOperationStatus(
|
236
|
+
Hive2::Thrift::TGetOperationStatusReq.new(operationHandle: prepare_operation_handle(handles))
|
237
|
+
)
|
238
|
+
puts response.operationState
|
239
|
+
case response.operationState
|
240
|
+
when Hive2::Thrift::TOperationState::FINISHED_STATE
|
241
|
+
return :finished
|
242
|
+
when Hive2::Thrift::TOperationState::INITIALIZED_STATE
|
243
|
+
return :initialized
|
244
|
+
when Hive2::Thrift::TOperationState::RUNNING_STATE
|
245
|
+
return :running
|
246
|
+
when Hive2::Thrift::TOperationState::CANCELED_STATE
|
247
|
+
return :cancelled
|
248
|
+
when Hive2::Thrift::TOperationState::CLOSED_STATE
|
249
|
+
return :closed
|
250
|
+
when Hive2::Thrift::TOperationState::ERROR_STATE
|
251
|
+
return :error
|
252
|
+
when Hive2::Thrift::TOperationState::UKNOWN_STATE
|
253
|
+
return :unknown
|
254
|
+
when Hive2::Thrift::TOperationState::PENDING_STATE
|
255
|
+
return :pending
|
256
|
+
else
|
257
|
+
return :state_not_in_protocol
|
258
|
+
end
|
259
|
+
end
|
260
|
+
|
261
|
+
# Async fetch results from an async execute
|
262
|
+
def async_fetch(handles, max_rows = 100)
|
263
|
+
# Can't get data from an unfinished query
|
264
|
+
unless async_is_complete?(handles)
|
265
|
+
raise "Can't perform fetch on a query in state: #{async_state(handles[:guid], handles[:secret])}"
|
266
|
+
end
|
267
|
+
|
268
|
+
# Fetch and
|
269
|
+
fetch_rows(prepare_operation_handle(handles), :first, max_rows)
|
270
|
+
end
|
271
|
+
|
272
|
+
# Performs a query on the server, fetches the results in batches of *batch_size* rows
|
273
|
+
# and yields the result batches to a given block as arrays of rows.
|
274
|
+
def async_fetch_in_batch(handles, batch_size = 1000, &block)
|
275
|
+
raise "No block given for the batch fetch request!" unless block_given?
|
276
|
+
# Can't get data from an unfinished query
|
277
|
+
unless async_is_complete?(handles)
|
278
|
+
raise "Can't perform fetch on a query in state: #{async_state(handles[:guid], handles[:secret])}"
|
279
|
+
end
|
280
|
+
|
281
|
+
# Now let's iterate over the results
|
282
|
+
loop do
|
283
|
+
rows = fetch_rows(prepare_operation_handle(handles), :next, batch_size)
|
284
|
+
break if rows.empty?
|
285
|
+
yield rows
|
286
|
+
end
|
287
|
+
end
|
288
|
+
|
289
|
+
def async_close_session(handles)
|
290
|
+
validate_handles!(handles)
|
291
|
+
@client.CloseSession(Hive2::Thrift::TCloseSessionReq.new( sessionHandle: handles[:session] ))
|
292
|
+
end
|
293
|
+
|
294
|
+
# Pull rows from the query result
|
295
|
+
def fetch_rows(op_handle, orientation = :first, max_rows = 1000)
|
296
|
+
fetch_req = prepare_fetch_results(op_handle, orientation, max_rows)
|
297
|
+
fetch_results = @client.FetchResults(fetch_req)
|
298
|
+
raise_error_if_failed!(fetch_results)
|
299
|
+
rows = fetch_results.results.rows
|
300
|
+
TCLIResultSet.new(rows, TCLISchemaDefinition.new(get_schema_for(op_handle), rows.first))
|
301
|
+
end
|
302
|
+
|
188
303
|
# Performs a explain on the supplied query on the server, returns it as a ExplainResult.
|
189
304
|
# (Only works on 0.12 if you have this patch - https://issues.apache.org/jira/browse/HIVE-5492)
|
190
305
|
def explain(query)
|
@@ -197,58 +312,37 @@ module RBHive
|
|
197
312
|
|
198
313
|
# Performs a query on the server, fetches up to *max_rows* rows and returns them as an array.
|
199
314
|
def fetch(query, max_rows = 100)
|
200
|
-
|
201
|
-
|
202
|
-
|
203
|
-
|
204
|
-
|
205
|
-
|
206
|
-
|
207
|
-
|
208
|
-
|
209
|
-
fetch_req = prepare_fetch_results(op_handle, :first, max_rows)
|
210
|
-
fetch_results = client.FetchResults(fetch_req)
|
211
|
-
raise_error_if_failed!(fetch_results)
|
212
|
-
|
213
|
-
# Get data rows and format the result
|
214
|
-
rows = fetch_results.results.rows
|
215
|
-
the_schema = TCLISchemaDefinition.new(get_schema_for( op_handle ), rows.first)
|
216
|
-
TCLIResultSet.new(rows, the_schema)
|
217
|
-
end
|
315
|
+
# Execute the query and check the result
|
316
|
+
exec_result = execute(query)
|
317
|
+
raise_error_if_failed!(exec_result)
|
318
|
+
|
319
|
+
# Get search operation handle to fetch the results
|
320
|
+
op_handle = exec_result.operationHandle
|
321
|
+
|
322
|
+
# Fetch the rows
|
323
|
+
fetch_rows(op_handle, :first, max_rows)
|
218
324
|
end
|
219
325
|
|
220
326
|
# Performs a query on the server, fetches the results in batches of *batch_size* rows
|
221
327
|
# and yields the result batches to a given block as arrays of rows.
|
222
328
|
def fetch_in_batch(query, batch_size = 1000, &block)
|
223
329
|
raise "No block given for the batch fetch request!" unless block_given?
|
224
|
-
|
225
|
-
|
226
|
-
|
227
|
-
|
228
|
-
|
229
|
-
# Get search operation handle to fetch the results
|
230
|
-
op_handle = exec_result.operationHandle
|
231
|
-
|
232
|
-
# Prepare fetch results request
|
233
|
-
fetch_req = prepare_fetch_results(op_handle, :next, batch_size)
|
234
|
-
|
235
|
-
# Now let's iterate over the results
|
236
|
-
loop do
|
237
|
-
# Fetch next batch and raise an exception if it failed
|
238
|
-
fetch_results = client.FetchResults(fetch_req)
|
239
|
-
raise_error_if_failed!(fetch_results)
|
330
|
+
|
331
|
+
# Execute the query and check the result
|
332
|
+
exec_result = execute(query)
|
333
|
+
raise_error_if_failed!(exec_result)
|
240
334
|
|
241
|
-
|
242
|
-
|
243
|
-
break if rows.empty?
|
335
|
+
# Get search operation handle to fetch the results
|
336
|
+
op_handle = exec_result.operationHandle
|
244
337
|
|
245
|
-
|
246
|
-
|
247
|
-
the_schema ||= TCLISchemaDefinition.new(schema_for_req, rows.first)
|
338
|
+
# Prepare fetch results request
|
339
|
+
fetch_req = prepare_fetch_results(op_handle, :next, batch_size)
|
248
340
|
|
249
|
-
|
250
|
-
|
251
|
-
|
341
|
+
# Now let's iterate over the results
|
342
|
+
loop do
|
343
|
+
rows = fetch_rows(op_handle, :next, batch_size)
|
344
|
+
break if rows.empty?
|
345
|
+
yield rows
|
252
346
|
end
|
253
347
|
end
|
254
348
|
|
@@ -275,26 +369,6 @@ module RBHive
|
|
275
369
|
|
276
370
|
private
|
277
371
|
|
278
|
-
def execute_safe(query)
|
279
|
-
safe do
|
280
|
-
exec_result = execute_unsafe(query)
|
281
|
-
raise_error_if_failed!(exec_result)
|
282
|
-
exec_result
|
283
|
-
end
|
284
|
-
end
|
285
|
-
|
286
|
-
def execute_unsafe(query)
|
287
|
-
@logger.info("Executing Hive Query: #{query}")
|
288
|
-
req = prepare_execute_statement(query)
|
289
|
-
client.ExecuteStatement(req)
|
290
|
-
end
|
291
|
-
|
292
|
-
def safe
|
293
|
-
ret = nil
|
294
|
-
@mutex.synchronize { ret = yield }
|
295
|
-
ret
|
296
|
-
end
|
297
|
-
|
298
372
|
def prepare_open_session(client_protocol)
|
299
373
|
req = ::Hive2::Thrift::TOpenSessionReq.new( @options[:sasl_params].nil? ? [] : @options[:sasl_params] )
|
300
374
|
req.client_protocol = client_protocol
|
@@ -323,6 +397,27 @@ module RBHive
|
|
323
397
|
)
|
324
398
|
end
|
325
399
|
|
400
|
+
def prepare_operation_handle(handles)
|
401
|
+
validate_handles!(handles)
|
402
|
+
Hive2::Thrift::TOperationHandle.new(
|
403
|
+
operationId: Hive2::Thrift::THandleIdentifier.new(guid: handles[:guid], secret: handles[:secret]),
|
404
|
+
operationType: Hive2::Thrift::TOperationType::EXECUTE_STATEMENT,
|
405
|
+
hasResultSet: false
|
406
|
+
)
|
407
|
+
end
|
408
|
+
|
409
|
+
def prepare_cancel_request(handles)
|
410
|
+
Hive2::Thrift::TCancelOperationReq.new(
|
411
|
+
operationHandle: prepare_operation_handle(handles)
|
412
|
+
)
|
413
|
+
end
|
414
|
+
|
415
|
+
def validate_handles!(handles)
|
416
|
+
unless handles.has_key?(:guid) and handles.has_key?(:secret) and handles.has_key?(:session)
|
417
|
+
raise "Invalid handles hash: #{handles.inspect}"
|
418
|
+
end
|
419
|
+
end
|
420
|
+
|
326
421
|
def get_schema_for(handle)
|
327
422
|
req = ::Hive2::Thrift::TGetResultSetMetadataReq.new( operationHandle: handle )
|
328
423
|
metadata = client.GetResultSetMetadata( req )
|
data/lib/rbhive/version.rb
CHANGED
metadata
CHANGED
@@ -1,8 +1,8 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: rbhive
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
5
|
-
prerelease:
|
4
|
+
version: 1.0.0.pre
|
5
|
+
prerelease: 6
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
8
8
|
- Forward3D
|
@@ -10,7 +10,7 @@ authors:
|
|
10
10
|
autorequire:
|
11
11
|
bindir: bin
|
12
12
|
cert_chain: []
|
13
|
-
date: 2014-03-
|
13
|
+
date: 2014-03-31 00:00:00.000000000 Z
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|
16
16
|
name: thrift
|
@@ -134,16 +134,13 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
134
134
|
version: '0'
|
135
135
|
segments:
|
136
136
|
- 0
|
137
|
-
hash:
|
137
|
+
hash: 2597338757284379755
|
138
138
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
139
139
|
none: false
|
140
140
|
requirements:
|
141
|
-
- - ! '
|
141
|
+
- - ! '>'
|
142
142
|
- !ruby/object:Gem::Version
|
143
|
-
version:
|
144
|
-
segments:
|
145
|
-
- 0
|
146
|
-
hash: 2810079357689827941
|
143
|
+
version: 1.3.1
|
147
144
|
requirements: []
|
148
145
|
rubyforge_project:
|
149
146
|
rubygems_version: 1.8.23
|