dbx-api 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 6f72e4b53309594553a8616e66bd27008585029c0b61d78b4e749c697c964064
4
- data.tar.gz: 8bdd6b0e7025486de43a2951a33a0726e94374f3299f9d6e8526d70d424888c7
3
+ metadata.gz: '09cd7a1984478b2761fbe0dca4b69acd123d664c96ea7333997143fe4389aa3b'
4
+ data.tar.gz: 5e1093ab32b19c13eff195869d12dc159459f4418f3bc260df241e737cd5f78f
5
5
  SHA512:
6
- metadata.gz: 2ae01403a40e9688aea026bff939788e83615a017c8333e8c7b50c5e83b88c6f0b7fe0e8244186f07cb47f81757fc801d62980f6eb160e8c71059283c3b7879f
7
- data.tar.gz: 131e3d2d072e05fcab4d456f767b87bc8871decea1abab8e11c15c92cfafe6afa5916c4637744553d3bef35eac3fb7b9395b12843bb8d6bdeeffe5e2509e9039
6
+ metadata.gz: 407516bedbe4fa69d01ad765804aa2593966faebfa966eadac4ed08da44fb41fc7ea055b2be781194e762916784c4cb1d0bb0d4c44b9b3ad8ed23273415277f2
7
+ data.tar.gz: 569cdbc0465214559dd397e27d143cf3e56fbc96e1f1447c45bd18ba472ef444ec712cf16ae69837b6380de54135c3e42dd1af775f22221954b149c70f249a5c
data/.rubocop.yml CHANGED
@@ -12,3 +12,6 @@ Style/StringLiteralsInInterpolation:
12
12
 
13
13
  Layout/LineLength:
14
14
  Max: 120
15
+
16
+ Metrics/BlockLength:
17
+ Enabled: false
data/CHANGELOG.md CHANGED
@@ -1,5 +1,14 @@
1
1
  ## [Unreleased]
2
2
 
3
- ## [0.1.0] - 2023-09-25
3
+ ## [0.1.1] - 2023-09-27
4
+ - Yanked because I didn't know what I was doing
4
5
 
6
+ ## [0.1.2]
5
7
  - Initial release
8
+
9
+ ## [0.2.0]
10
+ - Added `DatabricksSQLResponse` class
11
+ - `DatabricksGateway::run_sql` now returns an object of type `DatabricksSQLResponse`
12
+ - results can be accessed by `DatabricksSQLResponse::results`
13
+ - query success can be accessed by `DatabricksSQLResponse::success?`
14
+ - Added optional `sleep_timer` parameter to `DatabricksGateway`. This is the number of seconds to wait between checking the status of a query. Defaults to 5 seconds.
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- dbx-api (0.1.0)
4
+ dbx-api (0.2.0)
5
5
  dotenv (~> 2.0)
6
6
 
7
7
  GEM
data/README.md CHANGED
@@ -4,7 +4,10 @@
4
4
  This gem is designed to allow access to the DBX APIs (Jobs and SQL) from ruby applications.
5
5
 
6
6
  ## Installation
7
- TODO: write this section
7
+ Add the following to your Gemfile to install
8
+ ```ruby
9
+ gem 'dbx-api', '~>0.2.0'
10
+ ```
8
11
 
9
12
  ## Usage
10
13
  Set up your .env file (optional)
@@ -25,17 +28,69 @@ sql_runner = DatabricksGateway.new
25
28
  sql_runner = DatabricksGateway.new(host: 'DBX_CONNECTION_STRING', token: 'DBX_ACCESS_TOKEN', warehouse: 'DBX_SQL_WAREHOUSE_ID')
26
29
 
27
30
  # Basic sql
28
- result = sql_runner.run_sql("SELECT 1")
29
- sql_runner.parse_result(result)
31
+ response = sql_runner.run_sql("SELECT 1")
32
+ response.results
30
33
  # => [{"1"=>"1"}]
31
34
 
32
35
  # Dummy data in public DBX table
33
- result = sql_runner.run_sql("SELECT * FROM samples.nyctaxi.trips LIMIT 1")
34
- sql_runner.parse_result(result)
36
+ response = sql_runner.run_sql("SELECT * FROM samples.nyctaxi.trips LIMIT 1")
37
+ response.results
35
38
  # => [{"tpep_pickup_datetime"=>"2016-02-14T16:52:13.000Z",
36
39
  # "tpep_dropoff_datetime"=>"2016-02-14T17:16:04.000Z",
37
40
  # "trip_distance"=>"4.94",
38
41
  # "fare_amount"=>"19.0",
39
42
  # "pickup_zip"=>"10282",
40
43
  # "dropoff_zip"=>"10171"}]
41
- ```
44
+ ```
45
+
46
+ `run_sql` returns an object of type DatabricksSQLResponse.
47
+
48
+ The response object has a few useful methods. For a complete list, see the class definition: `lib/dbx/databricks/sql_response.rb`
49
+ ```ruby
50
+ response = sql_runner.run_sql("SELECT 1")
51
+
52
+ # checking the status of a response
53
+ response.status # => SUCCEEDED | FAILED | PENDING | RUNNING
54
+ response.failed? # => Boolean
55
+ response.success? # => Boolean
56
+
57
+ # getting the results of a response
58
+ response.results # => Array of Hashes
59
+
60
+ # looking at the raw response
61
+ response.raw_response # => HTTP object
62
+ # or just the parsed body of the HTTP response
63
+ response.body
64
+
65
+ # checking error messages for failed responses
66
+ response.error_message # => String
67
+ ```
68
+
69
+ This gem does not make an inference to how error handling should occur. `run_sql` always returns an array, even if the query fails (it will return `[]` if status.failed?). Users may wish to check the status of the response before attempting to access the results. For example:
70
+ ```ruby
71
+ require 'dbx'
72
+
73
+ sql_runner = DatabricksGateway.new
74
+ res = sql_runner.run_sql("SELECT 1")
75
+
76
+ # do something with the results if the query succeeded
77
+ return res.results if res.success?
78
+
79
+ # do something else if the query failed
80
+ puts "query failed: #{res.error_message}"
81
+ ```
82
+
83
+ Since `run_sql` returns an instance of `DatabricksSQLResponse`, you can also chain methods together:
84
+ ```ruby
85
+ sql_runner.run_sql("SELECT 1").results
86
+ ```
87
+
88
+ ## Development
89
+ - After checking out the repo, run `bin/setup` to install dependencies.
90
+ - Set up your `.env` file as described above.
91
+ - Run `rake spec` to run the rspec tests.
92
+
93
+ ## Build
94
+ - Run `gem build dbx.gemspec ` to build the gem.
95
+ - Run `gem push dbx-api-0.2.0.gem` to push the gem to rubygems.org
96
+ - Requires logging in to rubygems.org first via `gem login`
@@ -1,6 +1,7 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require "json"
4
+ require_relative "sql_response"
4
5
 
5
6
  # This module handles the execution of SQL statements via the DBX API.
6
7
  # For more information about the DBX SQL API, see: https://docs.databricks.com/sql/admin/sql-execution-tutorial.html
@@ -30,7 +31,7 @@ module DatabricksSQL
30
31
  # POST SQL query to DBX
31
32
  def post_sql_request(sql)
32
33
  response = http.request(sql_request(sql))
33
- response.body
34
+ DatabricksSQLResponse.new(response)
34
35
  end
35
36
 
36
37
  # GET request object
@@ -40,71 +41,56 @@ module DatabricksSQL
40
41
  Net::HTTP::Get.new(req_uri, request_headers)
41
42
  end
42
43
 
43
- # GET results of SQL query from DBX.
44
- def get_sql_results(http_response)
45
- statement_id = JSON.parse(http_response)["statement_id"]
46
- response = http.request(sql_results_request(statement_id))
47
- puts "#{statement_id}: #{JSON.parse(response.body)["status"]["state"]}"
48
- response.body
49
- end
50
-
51
44
  # GET SQL chunk from DBX by internal link
45
+ # @return [Hash<{"chunk_index" => Number, "row_offset" => Number, "row_count" => Number, "data_array" => Array<Array>}>] # rubocop:disable Layout/LineLength
52
46
  def get_sql_chunk(chunk_url)
47
+ puts "GET chunk: #{chunk_url}"
53
48
  request = Net::HTTP::Get.new(chunk_url, request_headers)
54
49
  response = http.request(request)
55
- response.body
50
+ DatabricksSQLResponse.new(response)
56
51
  end
57
52
 
58
53
  # Load additional chunks of data from DBX.
59
54
  # DBX returns data with maximum chunk size of 16mb.
60
- def load_additional_chunks(results_hash)
61
- next_chunk = results_hash["result"]["next_chunk_internal_link"]
55
+ def load_additional_chunks(response)
56
+ next_chunk = response.next_chunk
62
57
  while next_chunk
63
- response = get_sql_chunk(next_chunk)
64
- parsed_response = JSON.parse(response)
65
- result = parsed_response["data_array"]
66
- data = results_hash["result"]["data_array"]
67
- results_hash["result"]["data_array"] = [*data, *result]
68
- next_chunk = parsed_response["next_chunk_internal_link"]
58
+ chunk_response = get_sql_chunk(next_chunk)
59
+ response.add_chunk_to_data(chunk_response)
60
+ next_chunk = chunk_response.next_chunk
69
61
  end
70
62
  end
71
63
 
64
+ # GET results of SQL query from DBX.
65
+ def get_sql_results(dbx_sql_response)
66
+ statement_id = dbx_sql_response.statement_id
67
+ http_response = http.request(sql_results_request(statement_id))
68
+ response = DatabricksSQLResponse.new(http_response)
69
+ puts "#{statement_id}: #{response.status}"
70
+ response
71
+ end
72
+
72
73
  # Wait for SQL query response from DBX.
73
74
  # Returns a hash of the results of the SQL query.
74
75
  def wait_for_sql_response(response)
75
76
  result = get_sql_results(response)
76
- status = JSON.parse(result)["status"]["state"]
77
- # PENDING means the warehouse is starting up
78
- # RUNNING means the query is still executing
79
- while %w[PENDING RUNNING].include?(status)
80
- sleep(5)
81
- result = get_sql_results(response)
82
- status = JSON.parse(result)["status"]["state"]
83
- end
84
- JSON.parse(result)
85
- end
77
+ still_running = result.pending?
86
78
 
87
- # Parse JSON response from DBX into array of hashes.
88
- # Provides output c/w Big Query.
89
- def parse_result(http_response)
90
- keys = JSON.parse(http_response)["manifest"]["schema"]["columns"]
91
- data_array = JSON.parse(http_response)["result"]["data_array"]
92
-
93
- data_array.map do |row|
94
- hash = {}
95
- keys.each do |key|
96
- hash[key["name"]] = row[key["position"]]
97
- end
98
- hash
79
+ while still_running
80
+ sleep(@sleep_timer)
81
+ result = get_sql_results(response)
82
+ still_running = result.pending?
99
83
  end
84
+ result
100
85
  end
101
86
 
102
87
  # Submit SQL query to DBX and return results.
103
- # returns a JSON string of the results of the SQL query
88
+ # @return [DatabricksSQLResponse]
104
89
  def run_sql(sql)
105
- response = post_sql_request(sql)
106
- results_hash = wait_for_sql_response(response)
107
- load_additional_chunks(results_hash) if results_hash["manifest"]["total_chunk_count"] > 1
108
- JSON.dump(results_hash)
90
+ posted_sql = post_sql_request(sql)
91
+ sql_results = wait_for_sql_response(posted_sql)
92
+
93
+ load_additional_chunks(sql_results) if sql_results.more_chunks?
94
+ sql_results
109
95
  end
110
96
  end
@@ -0,0 +1,113 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "pry"
4
+
5
+ # This class represents a response from the Databricks SQL API.
6
+ # It is used by DatabricksSQL to handle http failures and parse the response body.
7
+ class DatabricksSQLResponse
8
+ def initialize(http_response)
9
+ self.raw_response = http_response
10
+ self.body = parse_body
11
+ self.data_array = extract_data_array
12
+ end
13
+
14
+ attr_accessor :raw_response, :body, :data_array
15
+
16
+ # -------------------- BODY --------------------
17
+
18
+ # Parse the response body as JSON.
19
+ def parse_body
20
+ return {} unless raw_response.is_a?(Net::HTTPSuccess)
21
+
22
+ @body = JSON.parse(raw_response.body)
23
+ end
24
+
25
+ # Dig out the statement_id from the response body.
26
+ # @return [String | nil]
27
+ def statement_id
28
+ body["statement_id"]
29
+ end
30
+
31
+ # -------------------- CHUNKS --------------------
32
+
33
+ # Determine if the response contains multiple chunks.
34
+ def more_chunks?
35
+ chunk_count = body&.dig("manifest", "total_chunk_count")&.to_i
36
+ chunk_count && chunk_count > 1
37
+ end
38
+
39
+ # Dig out the next_chunk_internal_link from the response body.
40
+ # @return [String | nil]
41
+ def next_chunk
42
+ body.dig("result", "next_chunk_internal_link")
43
+ end
44
+
45
+ # Combine the data from the chunk response into the data from the original response.
46
+ # @return [Array]
47
+ def add_chunk_to_data(chunk_response)
48
+ chunk_data_array = chunk_response.data_array
49
+ self.data_array = [*data_array, *chunk_data_array]
50
+ end
51
+
52
+ # -------------------- STATUS --------------------
53
+
54
+ # Determine if the response from the API has succeeded.
55
+ def success?
56
+ status == "SUCCEEDED"
57
+ end
58
+
59
+ # Determine if the response from the API is still executing.
60
+ # PENDING means the warehouse is starting up
61
+ # RUNNING means the query is still executing
62
+ def pending?
63
+ %w[PENDING RUNNING].include?(status)
64
+ end
65
+
66
+ # Determine if the response from the API has failed.
67
+ def failed?
68
+ status == "FAILED"
69
+ end
70
+
71
+ # Dig out the error message from the response body.
72
+ # @return [String | nil]
73
+ def error_message
74
+ body.dig("status", "error", "message")
75
+ end
76
+
77
+ # Dig out the status of the query from the response body.
78
+ # @return [String]
79
+ def status
80
+ return "FAILED" unless raw_response.is_a?(Net::HTTPSuccess)
81
+
82
+ body.dig("status", "state")
83
+ end
84
+
85
+ # ------------------- RESULTS --------------------
86
+
87
+ # Dig out the columns array from the response body.
88
+ # @return [Array<String>]
89
+ def columns
90
+ body.dig("manifest", "schema", "columns") || []
91
+ end
92
+
93
+ # Dig out values array for the queried data.
94
+ # Chunks have a simpler hash structure than initial SQL responses.
95
+ # @return [Array<Array>]
96
+ def extract_data_array
97
+ body.dig("result", "data_array") || body["data_array"] || []
98
+ end
99
+
100
+ # Return the results of the query as an array of hashes.
101
+ # @return [Array<Hash>]
102
+ def results
103
+ return [] if failed?
104
+
105
+ data_array.map do |row|
106
+ hash = {}
107
+ columns.each do |column|
108
+ hash[column["name"]] = row[column["position"]]
109
+ end
110
+ hash
111
+ end
112
+ end
113
+ end
data/lib/dbx/gateway.rb CHANGED
@@ -6,11 +6,13 @@ require_relative "databricks/databricks"
6
6
  # This class is a gateway to the Databricks API.
7
7
  # https://docs.databricks.com/api-explorer/workspace/introduction
8
8
  class DatabricksGateway
9
- def initialize(host: ENV.fetch("DBX_HOST", nil), token: ENV.fetch("DBX_TOKEN", nil), warehouse: ENV.fetch("DBX_WAREHOUSE_ID", nil))
9
+ def initialize(host: ENV.fetch("DBX_HOST", nil), token: ENV.fetch("DBX_TOKEN", nil),
10
+ warehouse: ENV.fetch("DBX_WAREHOUSE_ID", nil), sleep_timer: 5)
10
11
  @base_url = host
11
12
  @uri = URI(@base_url)
12
13
  @token = token
13
14
  @warehouse = warehouse
15
+ @sleep_timer = sleep_timer
14
16
  end
15
17
 
16
18
  # HTTP request headers
@@ -23,6 +25,7 @@ class DatabricksGateway
23
25
  end
24
26
 
25
27
  # HTTP connection object
28
+ # @return [Net::HTTP]
26
29
  def http
27
30
  http = Net::HTTP.new(@uri.host, @uri.port)
28
31
  http.use_ssl = true
data/lib/dbx/version.rb CHANGED
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Dbx
4
- VERSION = "0.1.1"
4
+ VERSION = "0.2.0"
5
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: dbx-api
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - cmmille
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2023-09-27 00:00:00.000000000 Z
11
+ date: 2023-10-06 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: dotenv
@@ -45,6 +45,7 @@ files:
45
45
  - lib/dbx/databricks/databricks.rb
46
46
  - lib/dbx/databricks/jobs.rb
47
47
  - lib/dbx/databricks/sql.rb
48
+ - lib/dbx/databricks/sql_response.rb
48
49
  - lib/dbx/gateway.rb
49
50
  - lib/dbx/version.rb
50
51
  - sig/dbx.rbs