dbx-api 0.1.1 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +3 -0
- data/CHANGELOG.md +10 -1
- data/Gemfile.lock +1 -1
- data/README.md +61 -6
- data/lib/dbx/databricks/sql.rb +31 -45
- data/lib/dbx/databricks/sql_response.rb +113 -0
- data/lib/dbx/gateway.rb +4 -1
- data/lib/dbx/version.rb +1 -1
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: '09cd7a1984478b2761fbe0dca4b69acd123d664c96ea7333997143fe4389aa3b'
|
4
|
+
data.tar.gz: 5e1093ab32b19c13eff195869d12dc159459f4418f3bc260df241e737cd5f78f
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 407516bedbe4fa69d01ad765804aa2593966faebfa966eadac4ed08da44fb41fc7ea055b2be781194e762916784c4cb1d0bb0d4c44b9b3ad8ed23273415277f2
|
7
|
+
data.tar.gz: 569cdbc0465214559dd397e27d143cf3e56fbc96e1f1447c45bd18ba472ef444ec712cf16ae69837b6380de54135c3e42dd1af775f22221954b149c70f249a5c
|
data/.rubocop.yml
CHANGED
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,14 @@
|
|
1
1
|
## [Unreleased]
|
2
2
|
|
3
|
-
## [0.1.
|
3
|
+
## [0.1.1] - 2023-09-27
|
4
|
+
- Yanked because I didn't know what I was doing
|
4
5
|
|
6
|
+
## [0.1.2]
|
5
7
|
- Initial release
|
8
|
+
|
9
|
+
## [0.2.0]
|
10
|
+
- Added `DatabricksSQLResponse` class
|
11
|
+
- `DatabricksGateway::run_sql` now returns an object of type `DatabricksSQLResponse`
|
12
|
+
- results can be accessed by `DatabricksSQLResponse::results`
|
13
|
+
- query success can be accessed by `DatabricksSQLResponse::success?`
|
14
|
+
- Added optional `sleep_timer` parameter to `DatabricksGateway`. This is the number of seconds to wait between checking the status of a query. Defaults to 5 seconds.
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -4,7 +4,10 @@
|
|
4
4
|
This gem is designed to allow access to the DBX APIs (Jobs and SQL) from ruby applications.
|
5
5
|
|
6
6
|
## Installation
|
7
|
-
|
7
|
+
Add the following to your Gemfile to install
|
8
|
+
```ruby
|
9
|
+
gem 'dbx-api', '~>0.2.0'
|
10
|
+
```
|
8
11
|
|
9
12
|
## Usage
|
10
13
|
Set up your .env file (optional)
|
@@ -25,17 +28,69 @@ sql_runner = DatabricksGateway.new
|
|
25
28
|
sql_runner = DatabricksGateway.new(host: 'DBX_CONNECTION_STRING', token: 'DBX_ACCESS_TOKEN', warehouse: 'DBX_SQL_WAREHOUSE_ID')
|
26
29
|
|
27
30
|
# Basic sql
|
28
|
-
|
29
|
-
|
31
|
+
response = sql_runner.run_sql("SELECT 1")
|
32
|
+
response.results
|
30
33
|
# => [{"1"=>"1"}]
|
31
34
|
|
32
35
|
# Dummy data in public DBX table
|
33
|
-
|
34
|
-
|
36
|
+
response = sql_runner.run_sql("SELECT * FROM samples.nyctaxi.trips LIMIT 1")
|
37
|
+
response.results
|
35
38
|
# => [{"tpep_pickup_datetime"=>"2016-02-14T16:52:13.000Z",
|
36
39
|
# "tpep_dropoff_datetime"=>"2016-02-14T17:16:04.000Z",
|
37
40
|
# "trip_distance"=>"4.94",
|
38
41
|
# "fare_amount"=>"19.0",
|
39
42
|
# "pickup_zip"=>"10282",
|
40
43
|
# "dropoff_zip"=>"10171"}]
|
41
|
-
```
|
44
|
+
```
|
45
|
+
|
46
|
+
`run_sql` returns an object of type DatabricksSQLResponse.
|
47
|
+
|
48
|
+
The response object has a few useful methods. For a complete list, see the class definition: `lib/dbx/databricks/sql_response.rb`
|
49
|
+
```ruby
|
50
|
+
response = sql_runner.run_sql("SELECT 1")
|
51
|
+
|
52
|
+
# checking the status of a response
|
53
|
+
response.status # => SUCCEEDED | FAILED | PENDING | RUNNING
|
54
|
+
response.failed? # => Boolean
|
55
|
+
response.success? # => Boolean
|
56
|
+
|
57
|
+
# getting the results of a response
|
58
|
+
response.results # => Array of Hashes
|
59
|
+
|
60
|
+
# looking at the raw response
|
61
|
+
response.raw_response # => HTTP object
|
62
|
+
# or just the parsed body of the HTTP response
|
63
|
+
response.body
|
64
|
+
|
65
|
+
# checking error messages for failed responses
|
66
|
+
response.error_message # => String
|
67
|
+
```
|
68
|
+
|
69
|
+
This gem does not make an inference to how error handling should occur. `run_sql` always returns an array, even if the query fails (it will return `[]` if status.failed?). Users may wish to check the status of the response before attempting to access the results. For example:
|
70
|
+
```ruby
|
71
|
+
require 'dbx'
|
72
|
+
|
73
|
+
sql_runner = DatabricksGateway.new
|
74
|
+
res = sql_runner.run_sql("SELECT 1")
|
75
|
+
|
76
|
+
# do something with the results if the query succeeded
|
77
|
+
return res.results if res.success?
|
78
|
+
|
79
|
+
# do something else if the query failed
|
80
|
+
puts "query failed: #{res.error_message}"
|
81
|
+
```
|
82
|
+
|
83
|
+
Since `run_sql` returns an instance of `DatabricksSQLResponse`, you can also chain methods together:
|
84
|
+
```ruby
|
85
|
+
sql_runner.run_sql("SELECT 1").results
|
86
|
+
```
|
87
|
+
|
88
|
+
## Development
|
89
|
+
- After checking out the repo, run `bin/setup` to install dependencies.
|
90
|
+
- Set up your `.env` file as described above.
|
91
|
+
- Run `rake spec` to run the rspec tests.
|
92
|
+
|
93
|
+
## Build
|
94
|
+
- Run `gem build dbx.gemspec ` to build the gem.
|
95
|
+
- Run `gem push dbx-api-0.2.0.gem` to push the gem to rubygems.org
|
96
|
+
- Requires logging in to rubygems.org first via `gem login`
|
data/lib/dbx/databricks/sql.rb
CHANGED
@@ -1,6 +1,7 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
3
|
require "json"
|
4
|
+
require_relative "sql_response"
|
4
5
|
|
5
6
|
# This module handles the execution of SQL statements via the DBX API.
|
6
7
|
# For more information about the DBX SQL API, see: https://docs.databricks.com/sql/admin/sql-execution-tutorial.html
|
@@ -30,7 +31,7 @@ module DatabricksSQL
|
|
30
31
|
# POST SQL query to DBX
|
31
32
|
def post_sql_request(sql)
|
32
33
|
response = http.request(sql_request(sql))
|
33
|
-
response
|
34
|
+
DatabricksSQLResponse.new(response)
|
34
35
|
end
|
35
36
|
|
36
37
|
# GET request object
|
@@ -40,71 +41,56 @@ module DatabricksSQL
|
|
40
41
|
Net::HTTP::Get.new(req_uri, request_headers)
|
41
42
|
end
|
42
43
|
|
43
|
-
# GET results of SQL query from DBX.
|
44
|
-
def get_sql_results(http_response)
|
45
|
-
statement_id = JSON.parse(http_response)["statement_id"]
|
46
|
-
response = http.request(sql_results_request(statement_id))
|
47
|
-
puts "#{statement_id}: #{JSON.parse(response.body)["status"]["state"]}"
|
48
|
-
response.body
|
49
|
-
end
|
50
|
-
|
51
44
|
# GET SQL chunk from DBX by internal link
|
45
|
+
# @return [Hash<{"chunk_index" => Number, "row_offset" => Number, "row_count" => Number, "data_array" => Array<Array>}>] # rubocop:disable Layout/LineLength
|
52
46
|
def get_sql_chunk(chunk_url)
|
47
|
+
puts "GET chunk: #{chunk_url}"
|
53
48
|
request = Net::HTTP::Get.new(chunk_url, request_headers)
|
54
49
|
response = http.request(request)
|
55
|
-
response
|
50
|
+
DatabricksSQLResponse.new(response)
|
56
51
|
end
|
57
52
|
|
58
53
|
# Load additional chunks of data from DBX.
|
59
54
|
# DBX returns data with maximum chunk size of 16mb.
|
60
|
-
def load_additional_chunks(
|
61
|
-
next_chunk =
|
55
|
+
def load_additional_chunks(response)
|
56
|
+
next_chunk = response.next_chunk
|
62
57
|
while next_chunk
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
data = results_hash["result"]["data_array"]
|
67
|
-
results_hash["result"]["data_array"] = [*data, *result]
|
68
|
-
next_chunk = parsed_response["next_chunk_internal_link"]
|
58
|
+
chunk_response = get_sql_chunk(next_chunk)
|
59
|
+
response.add_chunk_to_data(chunk_response)
|
60
|
+
next_chunk = chunk_response.next_chunk
|
69
61
|
end
|
70
62
|
end
|
71
63
|
|
64
|
+
# GET results of SQL query from DBX.
|
65
|
+
def get_sql_results(dbx_sql_response)
|
66
|
+
statement_id = dbx_sql_response.statement_id
|
67
|
+
http_response = http.request(sql_results_request(statement_id))
|
68
|
+
response = DatabricksSQLResponse.new(http_response)
|
69
|
+
puts "#{statement_id}: #{response.status}"
|
70
|
+
response
|
71
|
+
end
|
72
|
+
|
72
73
|
# Wait for SQL query response from DBX.
|
73
74
|
# Returns a hash of the results of the SQL query.
|
74
75
|
def wait_for_sql_response(response)
|
75
76
|
result = get_sql_results(response)
|
76
|
-
|
77
|
-
# PENDING means the warehouse is starting up
|
78
|
-
# RUNNING means the query is still executing
|
79
|
-
while %w[PENDING RUNNING].include?(status)
|
80
|
-
sleep(5)
|
81
|
-
result = get_sql_results(response)
|
82
|
-
status = JSON.parse(result)["status"]["state"]
|
83
|
-
end
|
84
|
-
JSON.parse(result)
|
85
|
-
end
|
77
|
+
still_running = result.pending?
|
86
78
|
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
data_array = JSON.parse(http_response)["result"]["data_array"]
|
92
|
-
|
93
|
-
data_array.map do |row|
|
94
|
-
hash = {}
|
95
|
-
keys.each do |key|
|
96
|
-
hash[key["name"]] = row[key["position"]]
|
97
|
-
end
|
98
|
-
hash
|
79
|
+
while still_running
|
80
|
+
sleep(@sleep_timer)
|
81
|
+
result = get_sql_results(response)
|
82
|
+
still_running = result.pending?
|
99
83
|
end
|
84
|
+
result
|
100
85
|
end
|
101
86
|
|
102
87
|
# Submit SQL query to DBX and return results.
|
103
|
-
#
|
88
|
+
# @return [DatabricksSQLResponse]
|
104
89
|
def run_sql(sql)
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
90
|
+
posted_sql = post_sql_request(sql)
|
91
|
+
sql_results = wait_for_sql_response(posted_sql)
|
92
|
+
|
93
|
+
load_additional_chunks(sql_results) if sql_results.more_chunks?
|
94
|
+
sql_results
|
109
95
|
end
|
110
96
|
end
|
@@ -0,0 +1,113 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require "pry"
|
4
|
+
|
5
|
+
# This class represents a response from the Databricks SQL API.
|
6
|
+
# It is used by DatabricksSQL to handle http failures and parse the response body.
|
7
|
+
class DatabricksSQLResponse
|
8
|
+
def initialize(http_response)
|
9
|
+
self.raw_response = http_response
|
10
|
+
self.body = parse_body
|
11
|
+
self.data_array = extract_data_array
|
12
|
+
end
|
13
|
+
|
14
|
+
attr_accessor :raw_response, :body, :data_array
|
15
|
+
|
16
|
+
# -------------------- BODY --------------------
|
17
|
+
|
18
|
+
# Parse the response body as JSON.
|
19
|
+
def parse_body
|
20
|
+
return {} unless raw_response.is_a?(Net::HTTPSuccess)
|
21
|
+
|
22
|
+
@body = JSON.parse(raw_response.body)
|
23
|
+
end
|
24
|
+
|
25
|
+
# Dig out the statement_id from the response body.
|
26
|
+
# @return [String | nil]
|
27
|
+
def statement_id
|
28
|
+
body["statement_id"]
|
29
|
+
end
|
30
|
+
|
31
|
+
# -------------------- CHUNKS --------------------
|
32
|
+
|
33
|
+
# Determine if the response contains multiple chunks.
|
34
|
+
def more_chunks?
|
35
|
+
chunk_count = body&.dig("manifest", "total_chunk_count")&.to_i
|
36
|
+
chunk_count && chunk_count > 1
|
37
|
+
end
|
38
|
+
|
39
|
+
# Dig out the next_chunk_internal_link from the response body.
|
40
|
+
# @return [String | nil]
|
41
|
+
def next_chunk
|
42
|
+
body.dig("result", "next_chunk_internal_link")
|
43
|
+
end
|
44
|
+
|
45
|
+
# Combine the data from the chunk response into the data from the original response.
|
46
|
+
# @return [Array]
|
47
|
+
def add_chunk_to_data(chunk_response)
|
48
|
+
chunk_data_array = chunk_response.data_array
|
49
|
+
self.data_array = [*data_array, *chunk_data_array]
|
50
|
+
end
|
51
|
+
|
52
|
+
# -------------------- STATUS --------------------
|
53
|
+
|
54
|
+
# Determine if the response from the API has succeeded.
|
55
|
+
def success?
|
56
|
+
status == "SUCCEEDED"
|
57
|
+
end
|
58
|
+
|
59
|
+
# Determine if the response from the API is still executing.
|
60
|
+
# PENDING means the warehouse is starting up
|
61
|
+
# RUNNING means the query is still executing
|
62
|
+
def pending?
|
63
|
+
%w[PENDING RUNNING].include?(status)
|
64
|
+
end
|
65
|
+
|
66
|
+
# Determine if the response from the API has failed.
|
67
|
+
def failed?
|
68
|
+
status == "FAILED"
|
69
|
+
end
|
70
|
+
|
71
|
+
# Dig out the error message from the response body.
|
72
|
+
# @return [String | nil]
|
73
|
+
def error_message
|
74
|
+
body.dig("status", "error", "message")
|
75
|
+
end
|
76
|
+
|
77
|
+
# Dig out the status of the query from the response body.
|
78
|
+
# @return [String]
|
79
|
+
def status
|
80
|
+
return "FAILED" unless raw_response.is_a?(Net::HTTPSuccess)
|
81
|
+
|
82
|
+
body.dig("status", "state")
|
83
|
+
end
|
84
|
+
|
85
|
+
# ------------------- RESULTS --------------------
|
86
|
+
|
87
|
+
# Dig out the columns array from the response body.
|
88
|
+
# @return [Array<String>]
|
89
|
+
def columns
|
90
|
+
body.dig("manifest", "schema", "columns") || []
|
91
|
+
end
|
92
|
+
|
93
|
+
# Dig out values array for the queried data.
|
94
|
+
# Chunks have a simpler hash structure than initial SQL responses.
|
95
|
+
# @return [Array<Array>]
|
96
|
+
def extract_data_array
|
97
|
+
body.dig("result", "data_array") || body["data_array"] || []
|
98
|
+
end
|
99
|
+
|
100
|
+
# Return the results of the query as an array of hashes.
|
101
|
+
# @return [Array<Hash>]
|
102
|
+
def results
|
103
|
+
return [] if failed?
|
104
|
+
|
105
|
+
data_array.map do |row|
|
106
|
+
hash = {}
|
107
|
+
columns.each do |column|
|
108
|
+
hash[column["name"]] = row[column["position"]]
|
109
|
+
end
|
110
|
+
hash
|
111
|
+
end
|
112
|
+
end
|
113
|
+
end
|
data/lib/dbx/gateway.rb
CHANGED
@@ -6,11 +6,13 @@ require_relative "databricks/databricks"
|
|
6
6
|
# This class is a gateway to the Databricks API.
|
7
7
|
# https://docs.databricks.com/api-explorer/workspace/introduction
|
8
8
|
class DatabricksGateway
|
9
|
-
def initialize(host: ENV.fetch("DBX_HOST", nil), token: ENV.fetch("DBX_TOKEN", nil),
|
9
|
+
def initialize(host: ENV.fetch("DBX_HOST", nil), token: ENV.fetch("DBX_TOKEN", nil),
|
10
|
+
warehouse: ENV.fetch("DBX_WAREHOUSE_ID", nil), sleep_timer: 5)
|
10
11
|
@base_url = host
|
11
12
|
@uri = URI(@base_url)
|
12
13
|
@token = token
|
13
14
|
@warehouse = warehouse
|
15
|
+
@sleep_timer = sleep_timer
|
14
16
|
end
|
15
17
|
|
16
18
|
# HTTP request headers
|
@@ -23,6 +25,7 @@ class DatabricksGateway
|
|
23
25
|
end
|
24
26
|
|
25
27
|
# HTTP connection object
|
28
|
+
# @return [Net::HTTP]
|
26
29
|
def http
|
27
30
|
http = Net::HTTP.new(@uri.host, @uri.port)
|
28
31
|
http.use_ssl = true
|
data/lib/dbx/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: dbx-api
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- cmmille
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2023-
|
11
|
+
date: 2023-10-06 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: dotenv
|
@@ -45,6 +45,7 @@ files:
|
|
45
45
|
- lib/dbx/databricks/databricks.rb
|
46
46
|
- lib/dbx/databricks/jobs.rb
|
47
47
|
- lib/dbx/databricks/sql.rb
|
48
|
+
- lib/dbx/databricks/sql_response.rb
|
48
49
|
- lib/dbx/gateway.rb
|
49
50
|
- lib/dbx/version.rb
|
50
51
|
- sig/dbx.rbs
|