RubyGems - rbhive-u2i - Versions diffs - 1.0.0 - Mend

rbhive-u2i 1.0.0

Files changed (36) hide show

checksums.yaml +7 -0
data/.gitignore +18 -0
data/CHANGELOG.md +16 -0
data/Gemfile +3 -0
data/LICENSE +20 -0
data/README.md +348 -0
data/Rakefile +1 -0
data/lib/rbhive.rb +8 -0
data/lib/rbhive/connection.rb +150 -0
data/lib/rbhive/explain_result.rb +46 -0
data/lib/rbhive/result_set.rb +37 -0
data/lib/rbhive/schema_definition.rb +87 -0
data/lib/rbhive/t_c_l_i_connection.rb +441 -0
data/lib/rbhive/t_c_l_i_result_set.rb +3 -0
data/lib/rbhive/t_c_l_i_schema_definition.rb +89 -0
data/lib/rbhive/table_schema.rb +122 -0
data/lib/rbhive/version.rb +3 -0
data/lib/thrift/facebook_service.rb +700 -0
data/lib/thrift/fb303_constants.rb +9 -0
data/lib/thrift/fb303_types.rb +19 -0
data/lib/thrift/hive_metastore_constants.rb +41 -0
data/lib/thrift/hive_metastore_types.rb +630 -0
data/lib/thrift/hive_service_constants.rb +13 -0
data/lib/thrift/hive_service_types.rb +72 -0
data/lib/thrift/queryplan_constants.rb +13 -0
data/lib/thrift/queryplan_types.rb +261 -0
data/lib/thrift/sasl_client_transport.rb +97 -0
data/lib/thrift/serde_constants.rb +92 -0
data/lib/thrift/serde_types.rb +7 -0
data/lib/thrift/t_c_l_i_service.rb +1054 -0
data/lib/thrift/t_c_l_i_service_constants.rb +72 -0
data/lib/thrift/t_c_l_i_service_types.rb +1762 -0
data/lib/thrift/thrift_hive.rb +508 -0
data/lib/thrift/thrift_hive_metastore.rb +3856 -0
data/rbhive.gemspec +27 -0
metadata +137 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: aa0f81a2728885fabb85feabc4044b2a29cc26f0
+  data.tar.gz: ec38ad569b040b3b0c646272fb3cd95e99a35beb
+SHA512:
+  metadata.gz: 841210a38b540c3e1513bd440de7e65b4fe77ce40ee5847acfa91fffc66da14b8bf0d02417018ac3ac26756385549f72be176162268e8c232f1f70268ed11953
+  data.tar.gz: 064a4baa9bbd6266eee0a5dda8c5cbeab9a1eafe6497ec9f010858304ba144b8b8c790b9ef2dab1c40ad2bdd9e5ed62058e5c6bd347285592fa989bc77dd1923

data/.gitignore ADDED

@@ -0,0 +1,18 @@
+.DS_Store
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp

data/CHANGELOG.md ADDED

@@ -0,0 +1,16 @@
+# RBHive changelog
+Versioning prior to 0.5.3 was not tracked, so this changelog only lists changes introduced after 0.5.3.
+## 0.6.0
+0.6.0 introduces one backwards-incompatible change:
+* Behaviour change: RBHive will no longer coerce the strings "NULL" or "null" to the Ruby `nil`; the rationale
+  for this change is that it introduces hard to trace bugs and does not seem to make sense from a logical
+  perspective (Hive's "NULL" is a very different thing to Ruby's `nil`).
+0.6.0 introduces support for Hive 0.13, and for the Hive 0.11 version shipped with CDH5 Beta 1 and Beta 2:
+* Thrift protocol bindings updated to include all the protocols shipped with the Hive 0.13 release.
+* Allow the user to choose a protocol explicitly; provided helper symbols / lookups for common protocols (e.g. CDH4, CDH5)

data/Gemfile ADDED

@@ -0,0 +1,3 @@
+source "https://rubygems.org"
+gemspec

data/LICENSE ADDED

@@ -0,0 +1,20 @@
+The MIT License (MIT)
+Copyright (c) [2013] [Forward3D]
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
+the Software, and to permit persons to whom the Software is furnished to do so,
+subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
+FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
+COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
+IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,348 @@
+# RBHive - A Ruby Thrift client for Apache Hive
+[![Code Climate](https://codeclimate.com/github/forward3d/rbhive/badges/gpa.svg)](https://codeclimate.com/github/forward3d/rbhive)
+### WARNING
+This is u2i fork of [rbhive](https://github.com/forward3d/rbhive).
+RBHive is a simple Ruby gem to communicate with the [Apache Hive](http://hive.apache.org)
+Thrift servers.
+It supports:
+* Hiveserver (the original Thrift service shipped with Hive since early releases)
+* Hiveserver2 (the new, concurrent Thrift service shipped with Hive releases since 0.10)
+* Any other 100% Hive-compatible Thrift service (e.g. [Sharkserver](https://github.com/amplab/shark))
+It is capable of using the following Thrift transports:
+* BufferedTransport (the default)
+* SaslClientTransport ([SASL-enabled](http://en.wikipedia.org/wiki/Simple_Authentication_and_Security_Layer) transport)
+* HTTPClientTransport (tunnels Thrift over HTTP)
+As of version 1.0, it supports asynchronous execution of queries. This allows you to submit
+a query, disconnect, then reconnect later to check the status and retrieve the results.
+This frees systems of the need to keep a persistent TCP connection.
+## About Thrift services and transports
+### Hiveserver
+Hiveserver (the original Thrift interface) only supports a single client at a time. RBHive
+implements this with the `RBHive::Connection` class. It only supports a single transport,
+BufferedTransport.
+### Hiveserver2
+[Hiveserver2](https://cwiki.apache.org/confluence/display/Hive/Setting+up+HiveServer2)
+(the new Thrift interface) can support many concurrent client connections. It is shipped
+with Hive 0.10 and later. In Hive 0.10, only BufferedTranport and SaslClientTransport are
+supported; starting with Hive 0.12, HTTPClientTransport is also supported.
+Each of the versions after Hive 0.10 has a slightly different Thrift interface; when
+connecting, you must specify the Hive version or you may get an exception.
+Hiveserver2 supports (in versions later than 0.12) asynchronous query execution. This
+works by submitting a query and retrieving a handle to the execution process; you can
+then reconnect at a later time and retrieve the results using this handle.
+Using the asynchronous methods has some caveats - please read the Asynchronous Execution
+section of the documentation thoroughly before using them.
+RBHive implements this client with the `RBHive::TCLIConnection` class.
+#### Warning!
+We had to set the following in hive-site.xml to get the BufferedTransport Thrift service
+to work with RBHive:
+    <property>
+      <name>hive.server2.enable.doAs</name>
+      <value>false</value>
+    </property>
+Otherwise you'll get this nasty-looking exception in the logs:
+    ERROR server.TThreadPoolServer: Error occurred during processing of message.
+    java.lang.ClassCastException: org.apache.thrift.transport.TSocket cannot be cast to org.apache.thrift.transport.TSaslServerTransport
+      at org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:35)
+      at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
+      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
+      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
+      at java.lang.Thread.run(Thread.java:662)
+### Other Hive-compatible services
+Consult the documentation for the service, as this will vary depending on the service you're using.
+## Connecting to Hiveserver and Hiveserver2
+### Hiveserver
+Since Hiveserver has no options, connection code is very simple:
+    RBHive.connect('hive.server.address', 10_000) do |connection|
+      connection.fetch 'SELECT city, country FROM cities'
+    end
+    ➔ [{:city => "London", :country => "UK"}, {:city => "Mumbai", :country => "India"}, {:city => "New York", :country => "USA"}]
+### Hiveserver2
+Hiveserver2 has several options with how it is run. The connection code takes
+a hash with these possible parameters:
+* `:transport` - one of `:buffered` (BufferedTransport), `:http` (HTTPClientTransport), or `:sasl` (SaslClientTransport)
+* `:hive_version` - the number after the period in the Hive version; e.g. `10`, `11`, `12`, `13` or one of
+   a set of symbols; see [Hiveserver2 protocol versions](#hiveserver2-protocol-versions) below for details
+* `:timeout` - if using BufferedTransport or SaslClientTransport, this is how long the timeout on the socket will be
+* `:sasl_params` - if using SaslClientTransport, this is a hash of parameters to set up the SASL connection
+If you pass either an empty hash or nil in place of the options (or do not supply them), the connection
+is attempted with the Hive version set to 0.10, using `:buffered` as the transport, and a timeout of 1800 seconds.
+Connecting with the defaults:
+    RBHive.tcli_connect('hive.server.address', 10_000) do |connection|
+      connection.fetch('SHOW TABLES')
+    end
+Connecting with a Logger:
+    RBHive.tcli_connect('hive.server.address', 10_000, { logger: Logger.new(STDOUT) }) do |connection|
+      connection.fetch('SHOW TABLES')
+    end
+Connecting with a specific Hive version (0.12 in this case):
+    RBHive.tcli_connect('hive.server.address', 10_000, { hive_version: 12 }) do |connection|
+      connection.fetch('SHOW TABLES')
+    end
+Connecting with a specific Hive version (0.12) and using the `:http` transport:
+    RBHive.tcli_connect('hive.server.address', 10_000, { hive_version: 12, transport: :http }) do |connection|
+      connection.fetch('SHOW TABLES')
+    end
+We have not tested the SASL connection, as we don't run SASL; pull requests and testing are welcomed.
+#### Hiveserver2 protocol versions
+Since the introduction of Hiveserver2 in Hive 0.10, there have been a number of revisions to the Thrift protocol it uses.
+The following table lists the available values you can supply to the `:hive_version` parameter when making a connection
+to Hiveserver2.
+| value   | Thrift protocol version | notes
+| ------- | ----------------------- | -----
+| `10`    | V1                      | First version of the Thrift protocol used only by Hive 0.10
+| `11`    | V2                      | Used by the Hive 0.11 release (*but not CDH5 which ships with Hive 0.11!*) - adds asynchronous execution
+| `12`    | V3                      | Used by the Hive 0.12 release, adds varchar type and primitive type qualifiers
+| `13`    | V7                      | Used by the Hive 0.13 release, adds features from V4, V5 and V6, plus token-based delegation connections
+| `:cdh4` | V1                      | CDH4 uses the V1 protocol as it ships with the upstream Hive 0.10
+| `:cdh5` | V5                      | CDH5 ships with upstream Hive 0.11, but adds patches to bring the Thrift protocol up to V5
+In addition, you can explicitly set the Thrift protocol version according to this table:
+| value           | Thrift protocol version | notes
+| --------------- | ----------------------- | -----
+| `:PROTOCOL_V1`  | V1                      | Used by Hive 0.10 release
+| `:PROTOCOL_V2`  | V2                      | Used by Hive 0.11 release
+| `:PROTOCOL_V3`  | V3                      | Used by Hive 0.12 release
+| `:PROTOCOL_V4`  | V4                      | Updated during Hive 0.13 development, adds decimal precision/scale, char type
+| `:PROTOCOL_V5`  | V5                      | Updated during Hive 0.13 development, adds error details when GetOperationStatus returns in error state
+| `:PROTOCOL_V6`  | V6                      | Updated during Hive 0.13 development, adds binary type for binary payload, uses columnar result set
+| `:PROTOCOL_V7`  | V7                      | Used by Hive 0.13 release, support for token-based delegation connections
+## Asynchronous execution with Hiveserver2
+In versions of Hive later than 0.12, the Thrift server supports asynchronous execution.
+The high-level view of using this feature is as follows:
+1. Submit your query using `async_execute(query)`. This function returns a hash
+   with the following keys: `:guid`, `:secret`, and `:session`. You don't need to
+   care about the internals of this hash - all methods that interact with an async
+   query require this hash, and you can just store it and hand it to the methods.
+2. To check the state of the query, call `async_state(handles)`, where `handles`
+   is the handles hash given to you when you called `async_execute(query)`.
+3. To retrieve results, call either `async_fetch(handles)` or `async_fetch_in_batch(handles)`,
+   which work like the non async methods.
+4. When you're done with the query, call `async_close_session(handles)`.
+### Memory leaks
+When you call `async_close_session(handles)`, *all async handles created during this
+session are closed*.
+If you do not close the sessions you create, *you will leak memory in the Hiveserver2 process*.
+Be very careful to close your sessions!
+### Method documentation
+#### `async_execute(query)`
+This method submits a query for async execution. The hash you get back is used in the other
+async methods, and will look like this:
+    {
+      :guid => (binary string),
+      :secret => (binary string),
+      :session => (binary string)
+    }
+The Thrift protocol specifies the strings as "binary" - which means they have no encoding.
+Be *extremely* careful when manipulating or storing these values, as they can quite easily
+get converted to UTF-8 strings, which will make them invalid when trying to retrieve async data.
+#### `async_state(handles)`
+`handles` is the hash returned by `async_execute(query)`. The state will be a symbol with
+one of the following values and meanings:
+| symbol                | meaning
+| --------------------- | -------
+| :initialized          | The query is initialized in Hive and ready to run
+| :running              | The query is running (either as a MapReduce job or within process)
+| :finished             | The query is completed and results can be retrieved
+| :cancelled            | The query was cancelled by a user
+| :closed               | Unknown at present
+| :error                | The query is invalid semantically or broken in another way
+| :unknown              | The query is in an unknown state
+| :pending              | The query is ready to run but is not running
+There are also the utility methods `async_is_complete?(handles)`, `async_is_running?(handles)`,
+`async_is_failed?(handles)` and `async_is_cancelled?(handles)`.
+#### `async_cancel(handles)`
+Calling this method will cancel the query in execution.
+#### `async_fetch(handles)`, `async_fetch_in_batch(handles)`
+These methods let you fetch the results of the async query, if they are complete. If you call
+these methods on an incomplete query, they will raise an exception. They work in exactly the
+same way as the normal synchronous methods.
+## Examples
+### Fetching results
+#### Hiveserver
+    RBHive.connect('hive.server.address', 10_000) do |connection|
+      connection.fetch 'SELECT city, country FROM cities'
+    end
+    ➔ [{:city => "London", :country => "UK"}, {:city => "Mumbai", :country => "India"}, {:city => "New York", :country => "USA"}]
+#### Hiveserver2
+    RBHive.tcli_connect('hive.server.address', 10_000) do |connection|
+      connection.fetch 'SELECT city, country FROM cities'
+    end
+    ➔ [{:city => "London", :country => "UK"}, {:city => "Mumbai", :country => "India"}, {:city => "New York", :country => "USA"}]
+### Executing a query
+#### Hiveserver
+    RBHive.connect('hive.server.address') do |connection|
+      connection.execute 'DROP TABLE cities'
+    end
+    ➔ nil
+#### Hiveserver2
+    RBHive.tcli_connect('hive.server.address') do |connection|
+      connection.execute 'DROP TABLE cities'
+    end
+    ➔ nil
+### Creating tables
+    table = TableSchema.new('person', 'List of people that owe me money') do
+      column 'name', :string, 'Full name of debtor'
+      column 'address', :string, 'Address of debtor'
+      column 'amount', :float, 'The amount of money borrowed'
+      partition 'dated', :string, 'The date money was given'
+      partition 'country', :string, 'The country the person resides in'
+    end
+Then for Hiveserver:
+    RBHive.connect('hive.server.address', 10_000) do |connection|
+      connection.create_table(table)
+    end
+Or Hiveserver2:
+    RBHive.tcli_connect('hive.server.address', 10_000) do |connection|
+      connection.create_table(table)
+    end
+### Modifying table schema
+    table = TableSchema.new('person', 'List of people that owe me money') do
+      column 'name', :string, 'Full name of debtor'
+      column 'address', :string, 'Address of debtor'
+      column 'amount', :float, 'The amount of money borrowed'
+      column 'new_amount', :float, 'The new amount this person somehow convinced me to give them'
+      partition 'dated', :string, 'The date money was given'
+      partition 'country', :string, 'The country the person resides in'
+    end
+Then for Hiveserver:
+    RBHive.connect('hive.server.address') do |connection|
+      connection.replace_columns(table)
+    end
+Or Hiveserver2:
+    RBHive.tcli_connect('hive.server.address') do |connection|
+      connection.replace_columns(table)
+    end
+### Setting properties
+You can set various properties for Hive tasks, some of which change how they run. Consult the Apache
+Hive documentation and Hadoop's documentation for the various properties that can be set.
+For example, you can set the map-reduce job's priority with the following:
+    connection.set("mapred.job.priority", "VERY_HIGH")
+### Inspecting tables
+#### Hiveserver
+    RBHive.connect('hive.hadoop.forward.co.uk', 10_000) {|connection|
+      result = connection.fetch("describe some_table")
+      puts result.column_names.inspect
+      puts result.first.inspect
+    }
+#### Hiveserver2
+    RBHive.tcli_connect('hive.hadoop.forward.co.uk', 10_000) {|connection|
+      result = connection.fetch("describe some_table")
+      puts result.column_names.inspect
+      puts result.first.inspect
+    }
+## Testing
+We use RBHive against Hive 0.10, 0.11 and 0.12, and have tested the BufferedTransport and
+HTTPClientTransport. We use it against both Hiveserver and Hiveserver2 with success.
+We have _not_ tested the SaslClientTransport, and would welcome reports
+on whether it works correctly.
+## Contributing
+We welcome contributions, issues and pull requests. If there's a feature missing in RBHive that you need, or you
+think you've found a bug, please do not hesitate to create an issue.
+1. Fork it
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create new Pull Request

data/Rakefile ADDED

	@@ -0,0 +1 @@
1	+ require "bundler/gem_tasks"

data/lib/rbhive.rb ADDED

@@ -0,0 +1,8 @@
+require File.join(File.dirname(__FILE__), 'rbhive', 'connection')
+require File.join(File.dirname(__FILE__), 'rbhive', 'table_schema')
+require File.join(File.dirname(__FILE__), 'rbhive', 'result_set')
+require File.join(File.dirname(__FILE__), 'rbhive', 'explain_result')
+require File.join(File.dirname(__FILE__), 'rbhive', 'schema_definition')
+require File.join(File.dirname(__FILE__), *%w[rbhive t_c_l_i_result_set])
+require File.join(File.dirname(__FILE__), *%w[rbhive t_c_l_i_schema_definition])
+require File.join(File.dirname(__FILE__), *%w[rbhive t_c_l_i_connection])

data/lib/rbhive/connection.rb ADDED

@@ -0,0 +1,150 @@
+# suppress warnings
+old_verbose, $VERBOSE = $VERBOSE, nil
+# require thrift autogenerated files
+require File.join(File.split(File.dirname(__FILE__)).first, *%w[thrift thrift_hive])
+# require 'thrift'
+# restore warnings
+$VERBOSE = old_verbose
+module RBHive
+  def connect(server, port=10_000, logger=StdOutLogger.new)
+    connection = RBHive::Connection.new(server, port, logger)
+    ret = nil
+    begin
+      connection.open
+      ret = yield(connection)
+    ensure
+      connection.close
+      ret
+    end
+  end
+  module_function :connect
+  class StdOutLogger
+    %w(fatal error warn info debug).each do |level|
+      define_method level.to_sym do |message|
+        STDOUT.puts(message)
+     end
+   end
+  end
+  class Connection
+    attr_reader :client
+    def initialize(server, port=10_000, logger=StdOutLogger.new)
+      @socket = Thrift::Socket.new(server, port)
+      @transport = Thrift::BufferedTransport.new(@socket)
+      @protocol = Thrift::BinaryProtocol.new(@transport)
+      @client = Hive::Thrift::ThriftHive::Client.new(@protocol)
+      @logger = logger
+      @logger.info("#{Time.now}: Connecting to #{server} on port #{port}")
+      @mutex = Mutex.new
+    end
+    def open
+      @transport.open
+    end
+    def close
+      @transport.close
+    end
+    def client
+      @client
+    end
+    def execute(query)
+      execute_safe(query)
+    end
+    def explain(query)
+      safe do
+        execute_unsafe("EXPLAIN "+ query)
+        ExplainResult.new(client.fetchAll)
+      end
+    end
+    def priority=(priority)
+      set("mapred.job.priority", priority)
+    end
+    def queue=(queue)
+      set("mapred.job.queue.name", queue)
+    end
+    def set(name,value)
+      @logger.info("Setting #{name}=#{value}")
+      client.execute("SET #{name}=#{value}")
+    end
+    def fetch(query)
+      safe do
+        execute_unsafe(query)
+        rows = client.fetchAll
+        the_schema = SchemaDefinition.new(client.getSchema, rows.first)
+        ResultSet.new(rows, the_schema)
+      end
+    end
+    def fetch_in_batch(query, batch_size=1_000)
+      safe do
+        execute_unsafe(query)
+        until (next_batch = client.fetchN(batch_size)).empty?
+          the_schema ||= SchemaDefinition.new(client.getSchema, next_batch.first)
+          yield ResultSet.new(next_batch, the_schema)
+        end
+      end
+    end
+    def first(query)
+      safe do
+        execute_unsafe(query)
+        row = client.fetchOne
+        the_schema = SchemaDefinition.new(client.getSchema, row)
+        ResultSet.new([row], the_schema).first
+      end
+    end
+    def schema(example_row=[])
+      safe { SchemaDefinition.new(client.getSchema, example_row) }
+    end
+    def create_table(schema)
+      execute(schema.create_table_statement)
+    end
+    def drop_table(name)
+      name = name.name if name.is_a?(TableSchema)
+      execute("DROP TABLE `#{name}`")
+    end
+    def replace_columns(schema)
+      execute(schema.replace_columns_statement)
+    end
+    def add_columns(schema)
+      execute(schema.add_columns_statement)
+    end
+    def method_missing(meth, *args)
+      client.send(meth, *args)
+    end
+    private
+    def execute_safe(query)
+      safe { execute_unsafe(query) }
+    end
+    def execute_unsafe(query)
+      @logger.info("Executing Hive Query: #{query}")
+      client.execute(query)
+    end
+    def safe
+      ret = nil
+      @mutex.synchronize { ret = yield }
+      ret
+    end
+  end
+end