dbx-api 0.1.1 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.rubocop.yml +3 -0
- data/CHANGELOG.md +10 -1
- data/Gemfile.lock +1 -1
- data/README.md +61 -6
- data/lib/dbx/databricks/sql.rb +31 -45
- data/lib/dbx/databricks/sql_response.rb +113 -0
- data/lib/dbx/gateway.rb +4 -1
- data/lib/dbx/version.rb +1 -1
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: '09cd7a1984478b2761fbe0dca4b69acd123d664c96ea7333997143fe4389aa3b'
|
4
|
+
data.tar.gz: 5e1093ab32b19c13eff195869d12dc159459f4418f3bc260df241e737cd5f78f
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 407516bedbe4fa69d01ad765804aa2593966faebfa966eadac4ed08da44fb41fc7ea055b2be781194e762916784c4cb1d0bb0d4c44b9b3ad8ed23273415277f2
|
7
|
+
data.tar.gz: 569cdbc0465214559dd397e27d143cf3e56fbc96e1f1447c45bd18ba472ef444ec712cf16ae69837b6380de54135c3e42dd1af775f22221954b149c70f249a5c
|
data/.rubocop.yml
CHANGED
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,14 @@
|
|
1
1
|
## [Unreleased]
|
2
2
|
|
3
|
-
## [0.1.
|
3
|
+
## [0.1.1] - 2023-09-27
|
4
|
+
- Yanked because I didn't know what I was doing
|
4
5
|
|
6
|
+
## [0.1.2]
|
5
7
|
- Initial release
|
8
|
+
|
9
|
+
## [0.2.0]
|
10
|
+
- Added `DatabricksSQLResponse` class
|
11
|
+
- `DatabricksGateway::run_sql` now returns an object of type `DatabricksSQLResponse`
|
12
|
+
- results can be accessed by `DatabricksSQLResponse::results`
|
13
|
+
- query success can be accessed by `DatabricksSQLResponse::success?`
|
14
|
+
- Added optional `sleep_timer` parameter to `DatabricksGateway`. This is the number of seconds to wait between checking the status of a query. Defaults to 5 seconds.
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -4,7 +4,10 @@
|
|
4
4
|
This gem is designed to allow access to the DBX APIs (Jobs and SQL) from ruby applications.
|
5
5
|
|
6
6
|
## Installation
|
7
|
-
|
7
|
+
Add the following to your Gemfile to install
|
8
|
+
```ruby
|
9
|
+
gem 'dbx-api', '~>0.2.0'
|
10
|
+
```
|
8
11
|
|
9
12
|
## Usage
|
10
13
|
Set up your .env file (optional)
|
@@ -25,17 +28,69 @@ sql_runner = DatabricksGateway.new
|
|
25
28
|
sql_runner = DatabricksGateway.new(host: 'DBX_CONNECTION_STRING', token: 'DBX_ACCESS_TOKEN', warehouse: 'DBX_SQL_WAREHOUSE_ID')
|
26
29
|
|
27
30
|
# Basic sql
|
28
|
-
|
29
|
-
|
31
|
+
response = sql_runner.run_sql("SELECT 1")
|
32
|
+
response.results
|
30
33
|
# => [{"1"=>"1"}]
|
31
34
|
|
32
35
|
# Dummy data in public DBX table
|
33
|
-
|
34
|
-
|
36
|
+
response = sql_runner.run_sql("SELECT * FROM samples.nyctaxi.trips LIMIT 1")
|
37
|
+
response.results
|
35
38
|
# => [{"tpep_pickup_datetime"=>"2016-02-14T16:52:13.000Z",
|
36
39
|
# "tpep_dropoff_datetime"=>"2016-02-14T17:16:04.000Z",
|
37
40
|
# "trip_distance"=>"4.94",
|
38
41
|
# "fare_amount"=>"19.0",
|
39
42
|
# "pickup_zip"=>"10282",
|
40
43
|
# "dropoff_zip"=>"10171"}]
|
41
|
-
```
|
44
|
+
```
|
45
|
+
|
46
|
+
`run_sql` returns an object of type DatabricksSQLResponse.
|
47
|
+
|
48
|
+
The response object has a few useful methods. For a complete list, see the class definition: `lib/dbx/databricks/sql_response.rb`
|
49
|
+
```ruby
|
50
|
+
response = sql_runner.run_sql("SELECT 1")
|
51
|
+
|
52
|
+
# checking the status of a response
|
53
|
+
response.status # => SUCCEEDED | FAILED | PENDING | RUNNING
|
54
|
+
response.failed? # => Boolean
|
55
|
+
response.success? # => Boolean
|
56
|
+
|
57
|
+
# getting the results of a response
|
58
|
+
response.results # => Array of Hashes
|
59
|
+
|
60
|
+
# looking at the raw response
|
61
|
+
response.raw_response # => HTTP object
|
62
|
+
# or just the parsed body of the HTTP response
|
63
|
+
response.body
|
64
|
+
|
65
|
+
# checking error messages for failed responses
|
66
|
+
response.error_message # => String
|
67
|
+
```
|
68
|
+
|
69
|
+
This gem does not make an inference to how error handling should occur. `run_sql` always returns an array, even if the query fails (it will return `[]` if status.failed?). Users may wish to check the status of the response before attempting to access the results. For example:
|
70
|
+
```ruby
|
71
|
+
require 'dbx'
|
72
|
+
|
73
|
+
sql_runner = DatabricksGateway.new
|
74
|
+
res = sql_runner.run_sql("SELECT 1")
|
75
|
+
|
76
|
+
# do something with the results if the query succeeded
|
77
|
+
return res.results if res.success?
|
78
|
+
|
79
|
+
# do something else if the query failed
|
80
|
+
puts "query failed: #{res.error_message}"
|
81
|
+
```
|
82
|
+
|
83
|
+
Since `run_sql` returns an instance of `DatabricksSQLResponse`, you can also chain methods together:
|
84
|
+
```ruby
|
85
|
+
sql_runner.run_sql("SELECT 1").results
|
86
|
+
```
|
87
|
+
|
88
|
+
## Development
|
89
|
+
- After checking out the repo, run `bin/setup` to install dependencies.
|
90
|
+
- Set up your `.env` file as described above.
|
91
|
+
- Run `rake spec` to run the rspec tests.
|
92
|
+
|
93
|
+
## Build
|
94
|
+
- Run `gem build dbx.gemspec ` to build the gem.
|
95
|
+
- Run `gem push dbx-api-0.2.0.gem` to push the gem to rubygems.org
|
96
|
+
- Requires logging in to rubygems.org first via `gem login`
|
data/lib/dbx/databricks/sql.rb
CHANGED
@@ -1,6 +1,7 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
3
|
require "json"
|
4
|
+
require_relative "sql_response"
|
4
5
|
|
5
6
|
# This module handles the execution of SQL statements via the DBX API.
|
6
7
|
# For more information about the DBX SQL API, see: https://docs.databricks.com/sql/admin/sql-execution-tutorial.html
|
@@ -30,7 +31,7 @@ module DatabricksSQL
|
|
30
31
|
# POST SQL query to DBX
|
31
32
|
def post_sql_request(sql)
|
32
33
|
response = http.request(sql_request(sql))
|
33
|
-
response
|
34
|
+
DatabricksSQLResponse.new(response)
|
34
35
|
end
|
35
36
|
|
36
37
|
# GET request object
|
@@ -40,71 +41,56 @@ module DatabricksSQL
|
|
40
41
|
Net::HTTP::Get.new(req_uri, request_headers)
|
41
42
|
end
|
42
43
|
|
43
|
-
# GET results of SQL query from DBX.
|
44
|
-
def get_sql_results(http_response)
|
45
|
-
statement_id = JSON.parse(http_response)["statement_id"]
|
46
|
-
response = http.request(sql_results_request(statement_id))
|
47
|
-
puts "#{statement_id}: #{JSON.parse(response.body)["status"]["state"]}"
|
48
|
-
response.body
|
49
|
-
end
|
50
|
-
|
51
44
|
# GET SQL chunk from DBX by internal link
|
45
|
+
# @return [Hash<{"chunk_index" => Number, "row_offset" => Number, "row_count" => Number, "data_array" => Array<Array>}>] # rubocop:disable Layout/LineLength
|
52
46
|
def get_sql_chunk(chunk_url)
|
47
|
+
puts "GET chunk: #{chunk_url}"
|
53
48
|
request = Net::HTTP::Get.new(chunk_url, request_headers)
|
54
49
|
response = http.request(request)
|
55
|
-
response
|
50
|
+
DatabricksSQLResponse.new(response)
|
56
51
|
end
|
57
52
|
|
58
53
|
# Load additional chunks of data from DBX.
|
59
54
|
# DBX returns data with maximum chunk size of 16mb.
|
60
|
-
def load_additional_chunks(
|
61
|
-
next_chunk =
|
55
|
+
def load_additional_chunks(response)
|
56
|
+
next_chunk = response.next_chunk
|
62
57
|
while next_chunk
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
data = results_hash["result"]["data_array"]
|
67
|
-
results_hash["result"]["data_array"] = [*data, *result]
|
68
|
-
next_chunk = parsed_response["next_chunk_internal_link"]
|
58
|
+
chunk_response = get_sql_chunk(next_chunk)
|
59
|
+
response.add_chunk_to_data(chunk_response)
|
60
|
+
next_chunk = chunk_response.next_chunk
|
69
61
|
end
|
70
62
|
end
|
71
63
|
|
64
|
+
# GET results of SQL query from DBX.
|
65
|
+
def get_sql_results(dbx_sql_response)
|
66
|
+
statement_id = dbx_sql_response.statement_id
|
67
|
+
http_response = http.request(sql_results_request(statement_id))
|
68
|
+
response = DatabricksSQLResponse.new(http_response)
|
69
|
+
puts "#{statement_id}: #{response.status}"
|
70
|
+
response
|
71
|
+
end
|
72
|
+
|
72
73
|
# Wait for SQL query response from DBX.
|
73
74
|
# Returns a hash of the results of the SQL query.
|
74
75
|
def wait_for_sql_response(response)
|
75
76
|
result = get_sql_results(response)
|
76
|
-
|
77
|
-
# PENDING means the warehouse is starting up
|
78
|
-
# RUNNING means the query is still executing
|
79
|
-
while %w[PENDING RUNNING].include?(status)
|
80
|
-
sleep(5)
|
81
|
-
result = get_sql_results(response)
|
82
|
-
status = JSON.parse(result)["status"]["state"]
|
83
|
-
end
|
84
|
-
JSON.parse(result)
|
85
|
-
end
|
77
|
+
still_running = result.pending?
|
86
78
|
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
data_array = JSON.parse(http_response)["result"]["data_array"]
|
92
|
-
|
93
|
-
data_array.map do |row|
|
94
|
-
hash = {}
|
95
|
-
keys.each do |key|
|
96
|
-
hash[key["name"]] = row[key["position"]]
|
97
|
-
end
|
98
|
-
hash
|
79
|
+
while still_running
|
80
|
+
sleep(@sleep_timer)
|
81
|
+
result = get_sql_results(response)
|
82
|
+
still_running = result.pending?
|
99
83
|
end
|
84
|
+
result
|
100
85
|
end
|
101
86
|
|
102
87
|
# Submit SQL query to DBX and return results.
|
103
|
-
#
|
88
|
+
# @return [DatabricksSQLResponse]
|
104
89
|
def run_sql(sql)
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
90
|
+
posted_sql = post_sql_request(sql)
|
91
|
+
sql_results = wait_for_sql_response(posted_sql)
|
92
|
+
|
93
|
+
load_additional_chunks(sql_results) if sql_results.more_chunks?
|
94
|
+
sql_results
|
109
95
|
end
|
110
96
|
end
|
@@ -0,0 +1,113 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require "pry"
|
4
|
+
|
5
|
+
# This class represents a response from the Databricks SQL API.
|
6
|
+
# It is used by DatabricksSQL to handle http failures and parse the response body.
|
7
|
+
class DatabricksSQLResponse
|
8
|
+
def initialize(http_response)
|
9
|
+
self.raw_response = http_response
|
10
|
+
self.body = parse_body
|
11
|
+
self.data_array = extract_data_array
|
12
|
+
end
|
13
|
+
|
14
|
+
attr_accessor :raw_response, :body, :data_array
|
15
|
+
|
16
|
+
# -------------------- BODY --------------------
|
17
|
+
|
18
|
+
# Parse the response body as JSON.
|
19
|
+
def parse_body
|
20
|
+
return {} unless raw_response.is_a?(Net::HTTPSuccess)
|
21
|
+
|
22
|
+
@body = JSON.parse(raw_response.body)
|
23
|
+
end
|
24
|
+
|
25
|
+
# Dig out the statement_id from the response body.
|
26
|
+
# @return [String | nil]
|
27
|
+
def statement_id
|
28
|
+
body["statement_id"]
|
29
|
+
end
|
30
|
+
|
31
|
+
# -------------------- CHUNKS --------------------
|
32
|
+
|
33
|
+
# Determine if the response contains multiple chunks.
|
34
|
+
def more_chunks?
|
35
|
+
chunk_count = body&.dig("manifest", "total_chunk_count")&.to_i
|
36
|
+
chunk_count && chunk_count > 1
|
37
|
+
end
|
38
|
+
|
39
|
+
# Dig out the next_chunk_internal_link from the response body.
|
40
|
+
# @return [String | nil]
|
41
|
+
def next_chunk
|
42
|
+
body.dig("result", "next_chunk_internal_link")
|
43
|
+
end
|
44
|
+
|
45
|
+
# Combine the data from the chunk response into the data from the original response.
|
46
|
+
# @return [Array]
|
47
|
+
def add_chunk_to_data(chunk_response)
|
48
|
+
chunk_data_array = chunk_response.data_array
|
49
|
+
self.data_array = [*data_array, *chunk_data_array]
|
50
|
+
end
|
51
|
+
|
52
|
+
# -------------------- STATUS --------------------
|
53
|
+
|
54
|
+
# Determine if the response from the API has succeeded.
|
55
|
+
def success?
|
56
|
+
status == "SUCCEEDED"
|
57
|
+
end
|
58
|
+
|
59
|
+
# Determine if the response from the API is still executing.
|
60
|
+
# PENDING means the warehouse is starting up
|
61
|
+
# RUNNING means the query is still executing
|
62
|
+
def pending?
|
63
|
+
%w[PENDING RUNNING].include?(status)
|
64
|
+
end
|
65
|
+
|
66
|
+
# Determine if the response from the API has failed.
|
67
|
+
def failed?
|
68
|
+
status == "FAILED"
|
69
|
+
end
|
70
|
+
|
71
|
+
# Dig out the error message from the response body.
|
72
|
+
# @return [String | nil]
|
73
|
+
def error_message
|
74
|
+
body.dig("status", "error", "message")
|
75
|
+
end
|
76
|
+
|
77
|
+
# Dig out the status of the query from the response body.
|
78
|
+
# @return [String]
|
79
|
+
def status
|
80
|
+
return "FAILED" unless raw_response.is_a?(Net::HTTPSuccess)
|
81
|
+
|
82
|
+
body.dig("status", "state")
|
83
|
+
end
|
84
|
+
|
85
|
+
# ------------------- RESULTS --------------------
|
86
|
+
|
87
|
+
# Dig out the columns array from the response body.
|
88
|
+
# @return [Array<String>]
|
89
|
+
def columns
|
90
|
+
body.dig("manifest", "schema", "columns") || []
|
91
|
+
end
|
92
|
+
|
93
|
+
# Dig out values array for the queried data.
|
94
|
+
# Chunks have a simpler hash structure than initial SQL responses.
|
95
|
+
# @return [Array<Array>]
|
96
|
+
def extract_data_array
|
97
|
+
body.dig("result", "data_array") || body["data_array"] || []
|
98
|
+
end
|
99
|
+
|
100
|
+
# Return the results of the query as an array of hashes.
|
101
|
+
# @return [Array<Hash>]
|
102
|
+
def results
|
103
|
+
return [] if failed?
|
104
|
+
|
105
|
+
data_array.map do |row|
|
106
|
+
hash = {}
|
107
|
+
columns.each do |column|
|
108
|
+
hash[column["name"]] = row[column["position"]]
|
109
|
+
end
|
110
|
+
hash
|
111
|
+
end
|
112
|
+
end
|
113
|
+
end
|
data/lib/dbx/gateway.rb
CHANGED
@@ -6,11 +6,13 @@ require_relative "databricks/databricks"
|
|
6
6
|
# This class is a gateway to the Databricks API.
|
7
7
|
# https://docs.databricks.com/api-explorer/workspace/introduction
|
8
8
|
class DatabricksGateway
|
9
|
-
def initialize(host: ENV.fetch("DBX_HOST", nil), token: ENV.fetch("DBX_TOKEN", nil),
|
9
|
+
def initialize(host: ENV.fetch("DBX_HOST", nil), token: ENV.fetch("DBX_TOKEN", nil),
|
10
|
+
warehouse: ENV.fetch("DBX_WAREHOUSE_ID", nil), sleep_timer: 5)
|
10
11
|
@base_url = host
|
11
12
|
@uri = URI(@base_url)
|
12
13
|
@token = token
|
13
14
|
@warehouse = warehouse
|
15
|
+
@sleep_timer = sleep_timer
|
14
16
|
end
|
15
17
|
|
16
18
|
# HTTP request headers
|
@@ -23,6 +25,7 @@ class DatabricksGateway
|
|
23
25
|
end
|
24
26
|
|
25
27
|
# HTTP connection object
|
28
|
+
# @return [Net::HTTP]
|
26
29
|
def http
|
27
30
|
http = Net::HTTP.new(@uri.host, @uri.port)
|
28
31
|
http.use_ssl = true
|
data/lib/dbx/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: dbx-api
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- cmmille
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2023-
|
11
|
+
date: 2023-10-06 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: dotenv
|
@@ -45,6 +45,7 @@ files:
|
|
45
45
|
- lib/dbx/databricks/databricks.rb
|
46
46
|
- lib/dbx/databricks/jobs.rb
|
47
47
|
- lib/dbx/databricks/sql.rb
|
48
|
+
- lib/dbx/databricks/sql_response.rb
|
48
49
|
- lib/dbx/gateway.rb
|
49
50
|
- lib/dbx/version.rb
|
50
51
|
- sig/dbx.rbs
|