dbx-api 0.1.1 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 6f72e4b53309594553a8616e66bd27008585029c0b61d78b4e749c697c964064
4
- data.tar.gz: 8bdd6b0e7025486de43a2951a33a0726e94374f3299f9d6e8526d70d424888c7
3
+ metadata.gz: '09cd7a1984478b2761fbe0dca4b69acd123d664c96ea7333997143fe4389aa3b'
4
+ data.tar.gz: 5e1093ab32b19c13eff195869d12dc159459f4418f3bc260df241e737cd5f78f
5
5
  SHA512:
6
- metadata.gz: 2ae01403a40e9688aea026bff939788e83615a017c8333e8c7b50c5e83b88c6f0b7fe0e8244186f07cb47f81757fc801d62980f6eb160e8c71059283c3b7879f
7
- data.tar.gz: 131e3d2d072e05fcab4d456f767b87bc8871decea1abab8e11c15c92cfafe6afa5916c4637744553d3bef35eac3fb7b9395b12843bb8d6bdeeffe5e2509e9039
6
+ metadata.gz: 407516bedbe4fa69d01ad765804aa2593966faebfa966eadac4ed08da44fb41fc7ea055b2be781194e762916784c4cb1d0bb0d4c44b9b3ad8ed23273415277f2
7
+ data.tar.gz: 569cdbc0465214559dd397e27d143cf3e56fbc96e1f1447c45bd18ba472ef444ec712cf16ae69837b6380de54135c3e42dd1af775f22221954b149c70f249a5c
data/.rubocop.yml CHANGED
@@ -12,3 +12,6 @@ Style/StringLiteralsInInterpolation:
12
12
 
13
13
  Layout/LineLength:
14
14
  Max: 120
15
+
16
+ Metrics/BlockLength:
17
+ Enabled: false
data/CHANGELOG.md CHANGED
@@ -1,5 +1,14 @@
1
1
  ## [Unreleased]
2
2
 
3
- ## [0.1.0] - 2023-09-25
3
+ ## [0.1.1] - 2023-09-27
4
+ - Yanked because I didn't know what I was doing
4
5
 
6
+ ## [0.1.2]
5
7
  - Initial release
8
+
9
+ ## [0.2.0]
10
+ - Added `DatabricksSQLResponse` class
11
+ - `DatabricksGateway::run_sql` now returns an object of type `DatabricksSQLResponse`
12
+ - results can be accessed by `DatabricksSQLResponse::results`
13
+ - query success can be accessed by `DatabricksSQLResponse::success?`
14
+ - Added optional `sleep_timer` parameter to `DatabricksGateway`. This is the number of seconds to wait between checking the status of a query. Defaults to 5 seconds.
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- dbx-api (0.1.0)
4
+ dbx-api (0.2.0)
5
5
  dotenv (~> 2.0)
6
6
 
7
7
  GEM
data/README.md CHANGED
@@ -4,7 +4,10 @@
4
4
  This gem is designed to allow access to the DBX APIs (Jobs and SQL) from ruby applications.
5
5
 
6
6
  ## Installation
7
- TODO: write this section
7
+ Add the following to your Gemfile to install
8
+ ```ruby
9
+ gem 'dbx-api', '~>0.2.0'
10
+ ```
8
11
 
9
12
  ## Usage
10
13
  Set up your .env file (optional)
@@ -25,17 +28,69 @@ sql_runner = DatabricksGateway.new
25
28
  sql_runner = DatabricksGateway.new(host: 'DBX_CONNECTION_STRING', token: 'DBX_ACCESS_TOKEN', warehouse: 'DBX_SQL_WAREHOUSE_ID')
26
29
 
27
30
  # Basic sql
28
- result = sql_runner.run_sql("SELECT 1")
29
- sql_runner.parse_result(result)
31
+ response = sql_runner.run_sql("SELECT 1")
32
+ response.results
30
33
  # => [{"1"=>"1"}]
31
34
 
32
35
  # Dummy data in public DBX table
33
- result = sql_runner.run_sql("SELECT * FROM samples.nyctaxi.trips LIMIT 1")
34
- sql_runner.parse_result(result)
36
+ response = sql_runner.run_sql("SELECT * FROM samples.nyctaxi.trips LIMIT 1")
37
+ response.results
35
38
  # => [{"tpep_pickup_datetime"=>"2016-02-14T16:52:13.000Z",
36
39
  # "tpep_dropoff_datetime"=>"2016-02-14T17:16:04.000Z",
37
40
  # "trip_distance"=>"4.94",
38
41
  # "fare_amount"=>"19.0",
39
42
  # "pickup_zip"=>"10282",
40
43
  # "dropoff_zip"=>"10171"}]
41
- ```
44
+ ```
45
+
46
+ `run_sql` returns an object of type DatabricksSQLResponse.
47
+
48
+ The response object has a few useful methods. For a complete list, see the class definition: `lib/dbx/databricks/sql_response.rb`
49
+ ```ruby
50
+ response = sql_runner.run_sql("SELECT 1")
51
+
52
+ # checking the status of a response
53
+ response.status # => SUCCEEDED | FAILED | PENDING | RUNNING
54
+ response.failed? # => Boolean
55
+ response.success? # => Boolean
56
+
57
+ # getting the results of a response
58
+ response.results # => Array of Hashes
59
+
60
+ # looking at the raw response
61
+ response.raw_response # => HTTP object
62
+ # or just the parsed body of the HTTP response
63
+ response.body
64
+
65
+ # checking error messages for failed responses
66
+ response.error_message # => String
67
+ ```
68
+
69
+ This gem does not make an inference to how error handling should occur. `run_sql` always returns an array, even if the query fails (it will return `[]` if status.failed?). Users may wish to check the status of the response before attempting to access the results. For example:
70
+ ```ruby
71
+ require 'dbx'
72
+
73
+ sql_runner = DatabricksGateway.new
74
+ res = sql_runner.run_sql("SELECT 1")
75
+
76
+ # do something with the results if the query succeeded
77
+ return res.results if res.success?
78
+
79
+ # do something else if the query failed
80
+ puts "query failed: #{res.error_message}"
81
+ ```
82
+
83
+ Since `run_sql` returns an instance of `DatabricksSQLResponse`, you can also chain methods together:
84
+ ```ruby
85
+ sql_runner.run_sql("SELECT 1").results
86
+ ```
87
+
88
+ ## Development
89
+ - After checking out the repo, run `bin/setup` to install dependencies.
90
+ - Set up your `.env` file as described above.
91
+ - Run `rake spec` to run the rspec tests.
92
+
93
+ ## Build
94
+ - Run `gem build dbx.gemspec ` to build the gem.
95
+ - Run `gem push dbx-api-0.2.0.gem` to push the gem to rubygems.org
96
+ - Requires logging in to rubygems.org first via `gem login`
@@ -1,6 +1,7 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require "json"
4
+ require_relative "sql_response"
4
5
 
5
6
  # This module handles the execution of SQL statements via the DBX API.
6
7
  # For more information about the DBX SQL API, see: https://docs.databricks.com/sql/admin/sql-execution-tutorial.html
@@ -30,7 +31,7 @@ module DatabricksSQL
30
31
  # POST SQL query to DBX
31
32
  def post_sql_request(sql)
32
33
  response = http.request(sql_request(sql))
33
- response.body
34
+ DatabricksSQLResponse.new(response)
34
35
  end
35
36
 
36
37
  # GET request object
@@ -40,71 +41,56 @@ module DatabricksSQL
40
41
  Net::HTTP::Get.new(req_uri, request_headers)
41
42
  end
42
43
 
43
- # GET results of SQL query from DBX.
44
- def get_sql_results(http_response)
45
- statement_id = JSON.parse(http_response)["statement_id"]
46
- response = http.request(sql_results_request(statement_id))
47
- puts "#{statement_id}: #{JSON.parse(response.body)["status"]["state"]}"
48
- response.body
49
- end
50
-
51
44
  # GET SQL chunk from DBX by internal link
45
+ # @return [Hash<{"chunk_index" => Number, "row_offset" => Number, "row_count" => Number, "data_array" => Array<Array>}>] # rubocop:disable Layout/LineLength
52
46
  def get_sql_chunk(chunk_url)
47
+ puts "GET chunk: #{chunk_url}"
53
48
  request = Net::HTTP::Get.new(chunk_url, request_headers)
54
49
  response = http.request(request)
55
- response.body
50
+ DatabricksSQLResponse.new(response)
56
51
  end
57
52
 
58
53
  # Load additional chunks of data from DBX.
59
54
  # DBX returns data with maximum chunk size of 16mb.
60
- def load_additional_chunks(results_hash)
61
- next_chunk = results_hash["result"]["next_chunk_internal_link"]
55
+ def load_additional_chunks(response)
56
+ next_chunk = response.next_chunk
62
57
  while next_chunk
63
- response = get_sql_chunk(next_chunk)
64
- parsed_response = JSON.parse(response)
65
- result = parsed_response["data_array"]
66
- data = results_hash["result"]["data_array"]
67
- results_hash["result"]["data_array"] = [*data, *result]
68
- next_chunk = parsed_response["next_chunk_internal_link"]
58
+ chunk_response = get_sql_chunk(next_chunk)
59
+ response.add_chunk_to_data(chunk_response)
60
+ next_chunk = chunk_response.next_chunk
69
61
  end
70
62
  end
71
63
 
64
+ # GET results of SQL query from DBX.
65
+ def get_sql_results(dbx_sql_response)
66
+ statement_id = dbx_sql_response.statement_id
67
+ http_response = http.request(sql_results_request(statement_id))
68
+ response = DatabricksSQLResponse.new(http_response)
69
+ puts "#{statement_id}: #{response.status}"
70
+ response
71
+ end
72
+
72
73
  # Wait for SQL query response from DBX.
73
74
  # Returns a hash of the results of the SQL query.
74
75
  def wait_for_sql_response(response)
75
76
  result = get_sql_results(response)
76
- status = JSON.parse(result)["status"]["state"]
77
- # PENDING means the warehouse is starting up
78
- # RUNNING means the query is still executing
79
- while %w[PENDING RUNNING].include?(status)
80
- sleep(5)
81
- result = get_sql_results(response)
82
- status = JSON.parse(result)["status"]["state"]
83
- end
84
- JSON.parse(result)
85
- end
77
+ still_running = result.pending?
86
78
 
87
- # Parse JSON response from DBX into array of hashes.
88
- # Provides output c/w Big Query.
89
- def parse_result(http_response)
90
- keys = JSON.parse(http_response)["manifest"]["schema"]["columns"]
91
- data_array = JSON.parse(http_response)["result"]["data_array"]
92
-
93
- data_array.map do |row|
94
- hash = {}
95
- keys.each do |key|
96
- hash[key["name"]] = row[key["position"]]
97
- end
98
- hash
79
+ while still_running
80
+ sleep(@sleep_timer)
81
+ result = get_sql_results(response)
82
+ still_running = result.pending?
99
83
  end
84
+ result
100
85
  end
101
86
 
102
87
  # Submit SQL query to DBX and return results.
103
- # returns a JSON string of the results of the SQL query
88
+ # @return [DatabricksSQLResponse]
104
89
  def run_sql(sql)
105
- response = post_sql_request(sql)
106
- results_hash = wait_for_sql_response(response)
107
- load_additional_chunks(results_hash) if results_hash["manifest"]["total_chunk_count"] > 1
108
- JSON.dump(results_hash)
90
+ posted_sql = post_sql_request(sql)
91
+ sql_results = wait_for_sql_response(posted_sql)
92
+
93
+ load_additional_chunks(sql_results) if sql_results.more_chunks?
94
+ sql_results
109
95
  end
110
96
  end
@@ -0,0 +1,113 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "pry"
4
+
5
+ # This class represents a response from the Databricks SQL API.
6
+ # It is used by DatabricksSQL to handle http failures and parse the response body.
7
+ class DatabricksSQLResponse
8
+ def initialize(http_response)
9
+ self.raw_response = http_response
10
+ self.body = parse_body
11
+ self.data_array = extract_data_array
12
+ end
13
+
14
+ attr_accessor :raw_response, :body, :data_array
15
+
16
+ # -------------------- BODY --------------------
17
+
18
+ # Parse the response body as JSON.
19
+ def parse_body
20
+ return {} unless raw_response.is_a?(Net::HTTPSuccess)
21
+
22
+ @body = JSON.parse(raw_response.body)
23
+ end
24
+
25
+ # Dig out the statement_id from the response body.
26
+ # @return [String | nil]
27
+ def statement_id
28
+ body["statement_id"]
29
+ end
30
+
31
+ # -------------------- CHUNKS --------------------
32
+
33
+ # Determine if the response contains multiple chunks.
34
+ def more_chunks?
35
+ chunk_count = body&.dig("manifest", "total_chunk_count")&.to_i
36
+ chunk_count && chunk_count > 1
37
+ end
38
+
39
+ # Dig out the next_chunk_internal_link from the response body.
40
+ # @return [String | nil]
41
+ def next_chunk
42
+ body.dig("result", "next_chunk_internal_link")
43
+ end
44
+
45
+ # Combine the data from the chunk response into the data from the original response.
46
+ # @return [Array]
47
+ def add_chunk_to_data(chunk_response)
48
+ chunk_data_array = chunk_response.data_array
49
+ self.data_array = [*data_array, *chunk_data_array]
50
+ end
51
+
52
+ # -------------------- STATUS --------------------
53
+
54
+ # Determine if the response from the API has succeeded.
55
+ def success?
56
+ status == "SUCCEEDED"
57
+ end
58
+
59
+ # Determine if the response from the API is still executing.
60
+ # PENDING means the warehouse is starting up
61
+ # RUNNING means the query is still executing
62
+ def pending?
63
+ %w[PENDING RUNNING].include?(status)
64
+ end
65
+
66
+ # Determine if the response from the API has failed.
67
+ def failed?
68
+ status == "FAILED"
69
+ end
70
+
71
+ # Dig out the error message from the response body.
72
+ # @return [String | nil]
73
+ def error_message
74
+ body.dig("status", "error", "message")
75
+ end
76
+
77
+ # Dig out the status of the query from the response body.
78
+ # @return [String]
79
+ def status
80
+ return "FAILED" unless raw_response.is_a?(Net::HTTPSuccess)
81
+
82
+ body.dig("status", "state")
83
+ end
84
+
85
+ # ------------------- RESULTS --------------------
86
+
87
+ # Dig out the columns array from the response body.
88
+ # @return [Array<String>]
89
+ def columns
90
+ body.dig("manifest", "schema", "columns") || []
91
+ end
92
+
93
+ # Dig out values array for the queried data.
94
+ # Chunks have a simpler hash structure than initial SQL responses.
95
+ # @return [Array<Array>]
96
+ def extract_data_array
97
+ body.dig("result", "data_array") || body["data_array"] || []
98
+ end
99
+
100
+ # Return the results of the query as an array of hashes.
101
+ # @return [Array<Hash>]
102
+ def results
103
+ return [] if failed?
104
+
105
+ data_array.map do |row|
106
+ hash = {}
107
+ columns.each do |column|
108
+ hash[column["name"]] = row[column["position"]]
109
+ end
110
+ hash
111
+ end
112
+ end
113
+ end
data/lib/dbx/gateway.rb CHANGED
@@ -6,11 +6,13 @@ require_relative "databricks/databricks"
6
6
  # This class is a gateway to the Databricks API.
7
7
  # https://docs.databricks.com/api-explorer/workspace/introduction
8
8
  class DatabricksGateway
9
- def initialize(host: ENV.fetch("DBX_HOST", nil), token: ENV.fetch("DBX_TOKEN", nil), warehouse: ENV.fetch("DBX_WAREHOUSE_ID", nil))
9
+ def initialize(host: ENV.fetch("DBX_HOST", nil), token: ENV.fetch("DBX_TOKEN", nil),
10
+ warehouse: ENV.fetch("DBX_WAREHOUSE_ID", nil), sleep_timer: 5)
10
11
  @base_url = host
11
12
  @uri = URI(@base_url)
12
13
  @token = token
13
14
  @warehouse = warehouse
15
+ @sleep_timer = sleep_timer
14
16
  end
15
17
 
16
18
  # HTTP request headers
@@ -23,6 +25,7 @@ class DatabricksGateway
23
25
  end
24
26
 
25
27
  # HTTP connection object
28
+ # @return [Net::HTTP]
26
29
  def http
27
30
  http = Net::HTTP.new(@uri.host, @uri.port)
28
31
  http.use_ssl = true
data/lib/dbx/version.rb CHANGED
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Dbx
4
- VERSION = "0.1.1"
4
+ VERSION = "0.2.0"
5
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: dbx-api
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - cmmille
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2023-09-27 00:00:00.000000000 Z
11
+ date: 2023-10-06 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: dotenv
@@ -45,6 +45,7 @@ files:
45
45
  - lib/dbx/databricks/databricks.rb
46
46
  - lib/dbx/databricks/jobs.rb
47
47
  - lib/dbx/databricks/sql.rb
48
+ - lib/dbx/databricks/sql_response.rb
48
49
  - lib/dbx/gateway.rb
49
50
  - lib/dbx/version.rb
50
51
  - sig/dbx.rbs