dataproc-spark-connect 0.6.0__py2.py3-none-any.whl → 0.7.1__py2.py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,98 @@
1
+ Metadata-Version: 2.1
2
+ Name: dataproc-spark-connect
3
+ Version: 0.7.1
4
+ Summary: Dataproc client library for Spark Connect
5
+ Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
6
+ Author: Google LLC
7
+ License: Apache 2.0
8
+ License-File: LICENSE
9
+ Requires-Dist: google-api-core>=2.19
10
+ Requires-Dist: google-cloud-dataproc>=5.18
11
+ Requires-Dist: packaging>=20.0
12
+ Requires-Dist: pyspark[connect]>=3.5
13
+ Requires-Dist: tqdm>=4.67
14
+ Requires-Dist: websockets>=14.0
15
+
16
+ # Dataproc Spark Connect Client
17
+
18
+ A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
19
+ client with additional functionalities that allow applications to communicate
20
+ with a remote Dataproc Spark Session using the Spark Connect protocol without
21
+ requiring additional steps.
22
+
23
+ ## Install
24
+
25
+ ```sh
26
+ pip install dataproc_spark_connect
27
+ ```
28
+
29
+ ## Uninstall
30
+
31
+ ```sh
32
+ pip uninstall dataproc_spark_connect
33
+ ```
34
+
35
+ ## Setup
36
+
37
+ This client requires permissions to
38
+ manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
39
+ If you are running the client outside of Google Cloud, you must set following
40
+ environment variables:
41
+
42
+ * `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
43
+ workloads
44
+ * `GOOGLE_CLOUD_REGION` - The Compute
45
+ Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
46
+ where you run the Spark workload.
47
+ * `GOOGLE_APPLICATION_CREDENTIALS` -
48
+ Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
49
+
50
+ ## Usage
51
+
52
+ 1. Install the latest version of Dataproc Python client and Dataproc Spark
53
+ Connect modules:
54
+
55
+ ```sh
56
+ pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
57
+ ```
58
+
59
+ 2. Add the required imports into your PySpark application or notebook and start
60
+ a Spark session with the following code instead of using
61
+ environment variables:
62
+
63
+ ```python
64
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
65
+ from google.cloud.dataproc_v1 import Session
66
+ session_config = Session()
67
+ session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
68
+ session_config.runtime_config.version = '2.2'
69
+ spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
70
+ ```
71
+
72
+ ## Developing
73
+
74
+ For development instructions see [guide](DEVELOPING.md).
75
+
76
+ ## Contributing
77
+
78
+ We'd love to accept your patches and contributions to this project. There are
79
+ just a few small guidelines you need to follow.
80
+
81
+ ### Contributor License Agreement
82
+
83
+ Contributions to this project must be accompanied by a Contributor License
84
+ Agreement. You (or your employer) retain the copyright to your contribution;
85
+ this simply gives us permission to use and redistribute your contributions as
86
+ part of the project. Head over to <https://cla.developers.google.com> to see
87
+ your current agreements on file or to sign a new one.
88
+
89
+ You generally only need to submit a CLA once, so if you've already submitted one
90
+ (even if it was for a different project), you probably don't need to do it
91
+ again.
92
+
93
+ ### Code reviews
94
+
95
+ All submissions, including submissions by project members, require review. We
96
+ use GitHub pull requests for this purpose. Consult
97
+ [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
98
+ information on using pull requests.
@@ -1,12 +1,12 @@
1
1
  google/cloud/dataproc_spark_connect/__init__.py,sha256=dIqHNWVWWrSuRf26x11kX5e9yMKSHCtmI_GBj1-FDdE,1101
2
2
  google/cloud/dataproc_spark_connect/exceptions.py,sha256=ilGyHD5M_yBQ3IC58-Y5miRGIQVJsLaNKvEGcHuk_BE,969
3
3
  google/cloud/dataproc_spark_connect/pypi_artifacts.py,sha256=gd-VMwiVP-EJuPp9Vf9Shx8pqps3oSKp0hBcSSZQS-A,1575
4
- google/cloud/dataproc_spark_connect/session.py,sha256=gKPtWDzlz5WA5lPGLMOhNdtKskMDjbLG8KcTmv0PrWA,26189
4
+ google/cloud/dataproc_spark_connect/session.py,sha256=OmaxXDqyBltmG0QzK3t0onygsjLjFcX_vSftliUAFbg,24875
5
5
  google/cloud/dataproc_spark_connect/client/__init__.py,sha256=6hCNSsgYlie6GuVpc5gjFsPnyeMTScTpXSPYqp1fplY,615
6
6
  google/cloud/dataproc_spark_connect/client/core.py,sha256=m3oXTKBm3sBy6jhDu9GRecrxLb5CdEM53SgMlnJb6ag,4616
7
- google/cloud/dataproc_spark_connect/client/proxy.py,sha256=GNy561Fo8A2ehqLrDMkVWOUYV62YCO2tuN77it3H098,8954
8
- dataproc_spark_connect-0.6.0.dist-info/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
9
- dataproc_spark_connect-0.6.0.dist-info/METADATA,sha256=m8PZHKk353AcATjML-Fgw_6yrtHmgVQLxEDI_90h2_0,4020
10
- dataproc_spark_connect-0.6.0.dist-info/WHEEL,sha256=OpXWERl2xLPRHTvd2ZXo_iluPEQd8uSbYkJ53NAER_Y,109
11
- dataproc_spark_connect-0.6.0.dist-info/top_level.txt,sha256=_1QvSJIhFAGfxb79D6DhB7SUw2X6T4rwnz_LLrbcD3c,7
12
- dataproc_spark_connect-0.6.0.dist-info/RECORD,,
7
+ google/cloud/dataproc_spark_connect/client/proxy.py,sha256=qUZXvVY1yn934vE6nlO495XUZ53AUx9O74a9ozkGI9U,8976
8
+ dataproc_spark_connect-0.7.1.dist-info/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
9
+ dataproc_spark_connect-0.7.1.dist-info/METADATA,sha256=wIyHOeQkgY4Oj0Hv3zOSSMBDqYaqVmzqjIxTfvSyJOQ,3328
10
+ dataproc_spark_connect-0.7.1.dist-info/WHEEL,sha256=OpXWERl2xLPRHTvd2ZXo_iluPEQd8uSbYkJ53NAER_Y,109
11
+ dataproc_spark_connect-0.7.1.dist-info/top_level.txt,sha256=_1QvSJIhFAGfxb79D6DhB7SUw2X6T4rwnz_LLrbcD3c,7
12
+ dataproc_spark_connect-0.7.1.dist-info/RECORD,,
@@ -18,7 +18,6 @@ import contextlib
18
18
  import logging
19
19
  import socket
20
20
  import threading
21
- import time
22
21
 
23
22
  import websockets.sync.client as websocketclient
24
23
 
@@ -95,6 +94,7 @@ def forward_bytes(name, from_sock, to_sock):
95
94
  This method is intended to be run in a separate thread of execution.
96
95
 
97
96
  Args:
97
+ name: forwarding thread name
98
98
  from_sock: A socket-like object to stream bytes from.
99
99
  to_sock: A socket-like object to stream bytes to.
100
100
  """
@@ -131,7 +131,7 @@ def connect_sockets(conn_number, from_sock, to_sock):
131
131
  This method continuously streams bytes in both directions between the
132
132
  given `from_sock` and `to_sock` socket-like objects.
133
133
 
134
- The caller is responsible for creating and closing the supplied socekts.
134
+ The caller is responsible for creating and closing the supplied sockets.
135
135
  """
136
136
  forward_name = f"{conn_number}-forward"
137
137
  t1 = threading.Thread(
@@ -163,7 +163,7 @@ def forward_connection(conn_number, conn, addr, target_host):
163
163
  Both the supplied incoming connection (`conn`) and the created outgoing
164
164
  connection are automatically closed when this method terminates.
165
165
 
166
- This method should be run inside of a daemon thread so that it will not
166
+ This method should be run inside a daemon thread so that it will not
167
167
  block program termination.
168
168
  """
169
169
  with conn:
@@ -11,39 +11,37 @@
11
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
12
  # See the License for the specific language governing permissions and
13
13
  # limitations under the License.
14
+
14
15
  import atexit
16
+ import datetime
15
17
  import json
16
18
  import logging
17
19
  import os
18
20
  import random
19
21
  import string
22
+ import threading
20
23
  import time
21
- import datetime
22
- from time import sleep
23
- from typing import Any, cast, ClassVar, Dict, Optional
24
+ import tqdm
24
25
 
25
26
  from google.api_core import retry
26
- from google.api_core.future.polling import POLLING_PREDICATE
27
27
  from google.api_core.client_options import ClientOptions
28
28
  from google.api_core.exceptions import Aborted, FailedPrecondition, InvalidArgument, NotFound, PermissionDenied
29
- from google.cloud.dataproc_v1.types import sessions
30
-
31
- from google.cloud.dataproc_spark_connect.pypi_artifacts import PyPiArtifacts
29
+ from google.api_core.future.polling import POLLING_PREDICATE
32
30
  from google.cloud.dataproc_spark_connect.client import DataprocChannelBuilder
31
+ from google.cloud.dataproc_spark_connect.exceptions import DataprocSparkConnectException
32
+ from google.cloud.dataproc_spark_connect.pypi_artifacts import PyPiArtifacts
33
33
  from google.cloud.dataproc_v1 import (
34
+ AuthenticationConfig,
34
35
  CreateSessionRequest,
35
36
  GetSessionRequest,
36
37
  Session,
37
38
  SessionControllerClient,
38
- SessionTemplate,
39
39
  TerminateSessionRequest,
40
40
  )
41
- from google.protobuf import text_format
42
- from google.protobuf.text_format import ParseError
41
+ from google.cloud.dataproc_v1.types import sessions
43
42
  from pyspark.sql.connect.session import SparkSession
44
43
  from pyspark.sql.utils import to_str
45
-
46
- from google.cloud.dataproc_spark_connect.exceptions import DataprocSparkConnectException
44
+ from typing import Any, cast, ClassVar, Dict, Optional
47
45
 
48
46
  # Set up logging
49
47
  logging.basicConfig(level=logging.INFO)
@@ -69,6 +67,8 @@ class DataprocSparkSession(SparkSession):
69
67
  ... ) # doctest: +SKIP
70
68
  """
71
69
 
70
+ _DEFAULT_RUNTIME_VERSION = "2.2"
71
+
72
72
  _active_s8s_session_uuid: ClassVar[Optional[str]] = None
73
73
  _project_id = None
74
74
  _region = None
@@ -77,8 +77,6 @@ class DataprocSparkSession(SparkSession):
77
77
 
78
78
  class Builder(SparkSession.Builder):
79
79
 
80
- _dataproc_runtime_spark_version = {"3.0": "3.5.1", "2.2": "3.5.0"}
81
-
82
80
  _session_static_configs = [
83
81
  "spark.executor.cores",
84
82
  "spark.executor.memoryOverhead",
@@ -93,10 +91,10 @@ class DataprocSparkSession(SparkSession):
93
91
  self._options: Dict[str, Any] = {}
94
92
  self._channel_builder: Optional[DataprocChannelBuilder] = None
95
93
  self._dataproc_config: Optional[Session] = None
96
- self._project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
97
- self._region = os.environ.get("GOOGLE_CLOUD_REGION")
94
+ self._project_id = os.getenv("GOOGLE_CLOUD_PROJECT")
95
+ self._region = os.getenv("GOOGLE_CLOUD_REGION")
98
96
  self._client_options = ClientOptions(
99
- api_endpoint=os.environ.get(
97
+ api_endpoint=os.getenv(
100
98
  "GOOGLE_CLOUD_DATAPROC_API_ENDPOINT",
101
99
  f"{self._region}-dataproc.googleapis.com",
102
100
  )
@@ -117,7 +115,7 @@ class DataprocSparkSession(SparkSession):
117
115
 
118
116
  def location(self, location):
119
117
  self._region = location
120
- self._client_options.api_endpoint = os.environ.get(
118
+ self._client_options.api_endpoint = os.getenv(
121
119
  "GOOGLE_CLOUD_DATAPROC_API_ENDPOINT",
122
120
  f"{self._region}-dataproc.googleapis.com",
123
121
  )
@@ -155,10 +153,7 @@ class DataprocSparkSession(SparkSession):
155
153
  spark_connect_url = session_response.runtime_info.endpoints.get(
156
154
  "Spark Connect Server"
157
155
  )
158
- spark_connect_url = spark_connect_url.replace("https", "sc")
159
- if not spark_connect_url.endswith("/"):
160
- spark_connect_url += "/"
161
- url = f"{spark_connect_url.replace('.com/', '.com:443/')};session_id={session_response.uuid};use_ssl=true"
156
+ url = f"{spark_connect_url}/;session_id={session_response.uuid};use_ssl=true"
162
157
  logger.debug(f"Spark Connect URL: {url}")
163
158
  self._channel_builder = DataprocChannelBuilder(
164
159
  url,
@@ -179,56 +174,64 @@ class DataprocSparkSession(SparkSession):
179
174
 
180
175
  if self._options.get("spark.remote", False):
181
176
  raise NotImplemented(
182
- "DataprocSparkSession does not support connecting to an existing remote server"
177
+ "DataprocSparkSession does not support connecting to an existing Spark Connect remote server"
183
178
  )
184
179
 
185
180
  from google.cloud.dataproc_v1 import SessionControllerClient
186
181
 
187
182
  dataproc_config: Session = self._get_dataproc_config()
188
- session_template: SessionTemplate = self._get_session_template()
189
183
 
190
- self._get_and_validate_version(
191
- dataproc_config, session_template
192
- )
193
-
194
- spark_connect_session = self._get_spark_connect_session(
195
- dataproc_config, session_template
196
- )
197
-
198
- if not spark_connect_session:
199
- dataproc_config.spark_connect_session = {}
200
- os.environ["SPARK_CONNECT_MODE_ENABLED"] = "1"
201
- session_request = CreateSessionRequest()
202
184
  session_id = self.generate_dataproc_session_id()
203
-
204
- session_request.session_id = session_id
205
185
  dataproc_config.name = f"projects/{self._project_id}/locations/{self._region}/sessions/{session_id}"
206
186
  logger.debug(
207
- f"Configurations used to create serverless session:\n {dataproc_config}"
187
+ f"Dataproc Session configuration:\n{dataproc_config}"
208
188
  )
189
+
190
+ session_request = CreateSessionRequest()
191
+ session_request.session_id = session_id
209
192
  session_request.session = dataproc_config
210
193
  session_request.parent = (
211
194
  f"projects/{self._project_id}/locations/{self._region}"
212
195
  )
213
196
 
214
- logger.debug("Creating serverless session")
197
+ logger.debug("Creating Dataproc Session")
215
198
  DataprocSparkSession._active_s8s_session_id = session_id
216
199
  s8s_creation_start_time = time.time()
217
- try:
218
- session_polling = retry.Retry(
219
- predicate=POLLING_PREDICATE,
220
- initial=5.0, # seconds
221
- maximum=5.0, # seconds
222
- multiplier=1.0,
223
- timeout=600, # seconds
200
+
201
+ stop_create_session_pbar = False
202
+
203
+ def create_session_pbar():
204
+ iterations = 150
205
+ pbar = tqdm.trange(
206
+ iterations,
207
+ bar_format="{bar}",
208
+ ncols=80,
224
209
  )
225
- print("Creating Spark session. It may take a few minutes.")
210
+ for i in pbar:
211
+ if stop_create_session_pbar:
212
+ break
213
+ # Last iteration
214
+ if i >= iterations - 1:
215
+ # Sleep until session created
216
+ while not stop_create_session_pbar:
217
+ time.sleep(1)
218
+ else:
219
+ time.sleep(1)
220
+
221
+ pbar.close()
222
+ # Print new line after the progress bar
223
+ print()
224
+
225
+ create_session_pbar_thread = threading.Thread(
226
+ target=create_session_pbar
227
+ )
228
+
229
+ try:
226
230
  if (
227
- "dataproc_spark_connect_SESSION_TERMINATE_AT_EXIT"
228
- in os.environ
229
- and os.getenv(
230
- "dataproc_spark_connect_SESSION_TERMINATE_AT_EXIT"
231
- ).lower()
231
+ os.getenv(
232
+ "DATAPROC_SPARK_CONNECT_SESSION_TERMINATE_AT_EXIT",
233
+ "false",
234
+ )
232
235
  == "true"
233
236
  ):
234
237
  atexit.register(
@@ -243,18 +246,25 @@ class DataprocSparkSession(SparkSession):
243
246
  client_options=self._client_options
244
247
  ).create_session(session_request)
245
248
  print(
246
- f"Interactive Session Detail View: https://console.cloud.google.com/dataproc/interactive/{self._region}/{session_id}?project={self._project_id}"
249
+ f"Creating Dataproc Session: https://console.cloud.google.com/dataproc/interactive/{self._region}/{session_id}?project={self._project_id}"
247
250
  )
251
+ create_session_pbar_thread.start()
248
252
  session_response: Session = operation.result(
249
- polling=session_polling
253
+ polling=retry.Retry(
254
+ predicate=POLLING_PREDICATE,
255
+ initial=5.0, # seconds
256
+ maximum=5.0, # seconds
257
+ multiplier=1.0,
258
+ timeout=600, # seconds
259
+ )
250
260
  )
251
- if (
252
- "DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH"
253
- in os.environ
254
- ):
255
- file_path = os.environ[
256
- "DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH"
257
- ]
261
+ stop_create_session_pbar = True
262
+ create_session_pbar_thread.join()
263
+ print("Dataproc Session was successfully created")
264
+ file_path = (
265
+ DataprocSparkSession._get_active_session_file_path()
266
+ )
267
+ if file_path is not None:
258
268
  try:
259
269
  session_data = {
260
270
  "session_name": session_response.name,
@@ -267,21 +277,27 @@ class DataprocSparkSession(SparkSession):
267
277
  json.dump(session_data, json_file, indent=4)
268
278
  except Exception as e:
269
279
  logger.error(
270
- f"Exception while writing active session to file {file_path} , {e}"
280
+ f"Exception while writing active session to file {file_path}, {e}"
271
281
  )
272
282
  except (InvalidArgument, PermissionDenied) as e:
283
+ stop_create_session_pbar = True
284
+ if create_session_pbar_thread.is_alive():
285
+ create_session_pbar_thread.join()
273
286
  DataprocSparkSession._active_s8s_session_id = None
274
287
  raise DataprocSparkConnectException(
275
- f"Error while creating serverless session: {e.message}"
288
+ f"Error while creating Dataproc Session: {e.message}"
276
289
  )
277
290
  except Exception as e:
291
+ stop_create_session_pbar = True
292
+ if create_session_pbar_thread.is_alive():
293
+ create_session_pbar_thread.join()
278
294
  DataprocSparkSession._active_s8s_session_id = None
279
295
  raise RuntimeError(
280
- f"Error while creating serverless session"
296
+ f"Error while creating Dataproc Session"
281
297
  ) from e
282
298
 
283
299
  logger.debug(
284
- f"Serverless session created: {session_id}, creation time taken: {int(time.time() - s8s_creation_start_time)} seconds"
300
+ f"Dataproc Session created: {session_id} in {int(time.time() - s8s_creation_start_time)} seconds"
285
301
  )
286
302
  return self.__create_spark_connect_session_from_s8s(
287
303
  session_response, dataproc_config.name
@@ -292,17 +308,20 @@ class DataprocSparkSession(SparkSession):
292
308
  ) -> Optional["DataprocSparkSession"]:
293
309
  s8s_session_id = DataprocSparkSession._active_s8s_session_id
294
310
  session_name = f"projects/{self._project_id}/locations/{self._region}/sessions/{s8s_session_id}"
295
- session_response = get_active_s8s_session_response(
296
- session_name, self._client_options
297
- )
311
+ session_response = None
312
+ session = None
313
+ if s8s_session_id is not None:
314
+ session_response = get_active_s8s_session_response(
315
+ session_name, self._client_options
316
+ )
317
+ session = DataprocSparkSession.getActiveSession()
298
318
 
299
- session = DataprocSparkSession.getActiveSession()
300
319
  if session is None:
301
320
  session = DataprocSparkSession._default_session
302
321
 
303
322
  if session_response is not None:
304
323
  print(
305
- f"Using existing session: https://console.cloud.google.com/dataproc/interactive/{self._region}/{s8s_session_id}?project={self._project_id}, configuration changes may not be applied."
324
+ f"Using existing Dataproc Session (configuration changes may not be applied): https://console.cloud.google.com/dataproc/interactive/{self._region}/{s8s_session_id}?project={self._project_id}"
306
325
  )
307
326
  if session is None:
308
327
  session = self.__create_spark_connect_session_from_s8s(
@@ -310,10 +329,10 @@ class DataprocSparkSession(SparkSession):
310
329
  )
311
330
  return session
312
331
  else:
313
- logger.info(
314
- f"Session: {s8s_session_id} not active, stopping previous spark session and creating new"
315
- )
316
332
  if session is not None:
333
+ print(
334
+ f"{s8s_session_id} Dataproc Session is not active, stopping and creating a new one"
335
+ )
317
336
  session.stop()
318
337
 
319
338
  return None
@@ -333,21 +352,52 @@ class DataprocSparkSession(SparkSession):
333
352
  dataproc_config = self._dataproc_config
334
353
  for k, v in self._options.items():
335
354
  dataproc_config.runtime_config.properties[k] = v
336
- elif "DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG" in os.environ:
337
- filepath = os.environ[
338
- "DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG"
355
+ dataproc_config.spark_connect_session = (
356
+ sessions.SparkConnectConfig()
357
+ )
358
+ if not dataproc_config.runtime_config.version:
359
+ dataproc_config.runtime_config.version = (
360
+ DataprocSparkSession._DEFAULT_RUNTIME_VERSION
361
+ )
362
+ if (
363
+ not dataproc_config.environment_config.execution_config.authentication_config.user_workload_authentication_type
364
+ and "DATAPROC_SPARK_CONNECT_AUTH_TYPE" in os.environ
365
+ ):
366
+ dataproc_config.environment_config.execution_config.authentication_config.user_workload_authentication_type = AuthenticationConfig.AuthenticationType[
367
+ os.getenv("DATAPROC_SPARK_CONNECT_AUTH_TYPE")
339
368
  ]
340
- try:
341
- with open(filepath, "r") as f:
342
- dataproc_config = Session.wrap(
343
- text_format.Parse(
344
- f.read(), Session.pb(dataproc_config)
345
- )
346
- )
347
- except FileNotFoundError:
348
- raise FileNotFoundError(f"File '{filepath}' not found")
349
- except ParseError as e:
350
- raise ParseError(f"Error parsing file '{filepath}': {e}")
369
+ if (
370
+ not dataproc_config.environment_config.execution_config.service_account
371
+ and "DATAPROC_SPARK_CONNECT_SERVICE_ACCOUNT" in os.environ
372
+ ):
373
+ dataproc_config.environment_config.execution_config.service_account = os.getenv(
374
+ "DATAPROC_SPARK_CONNECT_SERVICE_ACCOUNT"
375
+ )
376
+ if (
377
+ not dataproc_config.environment_config.execution_config.subnetwork_uri
378
+ and "DATAPROC_SPARK_CONNECT_SUBNET" in os.environ
379
+ ):
380
+ dataproc_config.environment_config.execution_config.subnetwork_uri = os.getenv(
381
+ "DATAPROC_SPARK_CONNECT_SUBNET"
382
+ )
383
+ if (
384
+ not dataproc_config.environment_config.execution_config.ttl
385
+ and "DATAPROC_SPARK_CONNECT_TTL_SECONDS" in os.environ
386
+ ):
387
+ dataproc_config.environment_config.execution_config.ttl = {
388
+ "seconds": int(
389
+ os.getenv("DATAPROC_SPARK_CONNECT_TTL_SECONDS")
390
+ )
391
+ }
392
+ if (
393
+ not dataproc_config.environment_config.execution_config.idle_ttl
394
+ and "DATAPROC_SPARK_CONNECT_IDLE_TTL_SECONDS" in os.environ
395
+ ):
396
+ dataproc_config.environment_config.execution_config.idle_ttl = {
397
+ "seconds": int(
398
+ os.getenv("DATAPROC_SPARK_CONNECT_IDLE_TTL_SECONDS")
399
+ )
400
+ }
351
401
  if "COLAB_NOTEBOOK_RUNTIME_ID" in os.environ:
352
402
  dataproc_config.labels["colab-notebook-runtime-id"] = (
353
403
  os.environ["COLAB_NOTEBOOK_RUNTIME_ID"]
@@ -358,87 +408,8 @@ class DataprocSparkSession(SparkSession):
358
408
  ]
359
409
  return dataproc_config
360
410
 
361
- def _get_session_template(self):
362
- from google.cloud.dataproc_v1 import (
363
- GetSessionTemplateRequest,
364
- SessionTemplateControllerClient,
365
- )
366
-
367
- session_template = None
368
- if self._dataproc_config and self._dataproc_config.session_template:
369
- session_template = self._dataproc_config.session_template
370
- get_session_template_request = GetSessionTemplateRequest()
371
- get_session_template_request.name = session_template
372
- client = SessionTemplateControllerClient(
373
- client_options=self._client_options
374
- )
375
- try:
376
- session_template = client.get_session_template(
377
- get_session_template_request
378
- )
379
- except Exception as e:
380
- logger.error(
381
- f"Failed to get session template {session_template}: {e}"
382
- )
383
- raise
384
- return session_template
385
-
386
- def _get_and_validate_version(self, dataproc_config, session_template):
387
- trimmed_version = lambda v: ".".join(v.split(".")[:2])
388
- version = None
389
- if (
390
- dataproc_config
391
- and dataproc_config.runtime_config
392
- and dataproc_config.runtime_config.version
393
- ):
394
- version = dataproc_config.runtime_config.version
395
- elif (
396
- session_template
397
- and session_template.runtime_config
398
- and session_template.runtime_config.version
399
- ):
400
- version = session_template.runtime_config.version
401
-
402
- if not version:
403
- version = "3.0"
404
- dataproc_config.runtime_config.version = version
405
- elif (
406
- trimmed_version(version)
407
- not in self._dataproc_runtime_spark_version
408
- ):
409
- raise ValueError(
410
- f"runtime_config.version {version} is not supported. "
411
- f"Supported versions: {self._dataproc_runtime_spark_version.keys()}"
412
- )
413
-
414
- server_version = self._dataproc_runtime_spark_version[
415
- trimmed_version(version)
416
- ]
417
- import importlib.metadata
418
-
419
- google_connect_version = importlib.metadata.version(
420
- "dataproc-spark-connect"
421
- )
422
- client_version = importlib.metadata.version("pyspark")
423
- version_message = f"Spark Connect: {google_connect_version} (PySpark: {client_version}) Session Runtime: {version} (Spark: {server_version})"
424
- logger.info(version_message)
425
- if trimmed_version(client_version) != trimmed_version(
426
- server_version
427
- ):
428
- logger.warning(
429
- f"client and server on different versions: {version_message}"
430
- )
431
- return version
432
-
433
- def _get_spark_connect_session(self, dataproc_config, session_template):
434
- spark_connect_session = None
435
- if dataproc_config and dataproc_config.spark_connect_session:
436
- spark_connect_session = dataproc_config.spark_connect_session
437
- elif session_template and session_template.spark_connect_session:
438
- spark_connect_session = session_template.spark_connect_session
439
- return spark_connect_session
440
-
441
- def generate_dataproc_session_id(self):
411
+ @staticmethod
412
+ def generate_dataproc_session_id():
442
413
  timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
443
414
  suffix_length = 6
444
415
  random_suffix = "".join(
@@ -451,32 +422,30 @@ class DataprocSparkSession(SparkSession):
451
422
  def _repr_html_(self) -> str:
452
423
  if not self._active_s8s_session_id:
453
424
  return """
454
- <div>No Active Dataproc Spark Session</div>
425
+ <div>No Active Dataproc Session</div>
455
426
  """
456
427
 
457
428
  s8s_session = f"https://console.cloud.google.com/dataproc/interactive/{self._region}/{self._active_s8s_session_id}"
458
429
  ui = f"{s8s_session}/sparkApplications/applications"
459
- version = ""
460
430
  return f"""
461
431
  <div>
462
432
  <p><b>Spark Connect</b></p>
463
433
 
464
- <p><a href="{s8s_session}?project={self._project_id}">Serverless Session</a></p>
434
+ <p><a href="{s8s_session}?project={self._project_id}">Dataproc Session</a></p>
465
435
  <p><a href="{ui}?project={self._project_id}">Spark UI</a></p>
466
436
  </div>
467
437
  """
468
438
 
469
- def _remove_stoped_session_from_file(self):
470
- if "DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH" in os.environ:
471
- file_path = os.environ[
472
- "DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH"
473
- ]
439
+ @staticmethod
440
+ def _remove_stopped_session_from_file():
441
+ file_path = DataprocSparkSession._get_active_session_file_path()
442
+ if file_path is not None:
474
443
  try:
475
444
  with open(file_path, "w"):
476
445
  pass
477
446
  except Exception as e:
478
447
  logger.error(
479
- f"Exception while removing active session in file {file_path} , {e}"
448
+ f"Exception while removing active session in file {file_path}, {e}"
480
449
  )
481
450
 
482
451
  def addArtifacts(
@@ -494,7 +463,7 @@ class DataprocSparkSession(SparkSession):
494
463
 
495
464
  Parameters
496
465
  ----------
497
- *path : tuple of str
466
+ *artifact : tuple of str
498
467
  Artifact's URIs to add.
499
468
  pyfile : bool
500
469
  Whether to add them as Python dependencies such as .py, .egg, .zip or .jar files.
@@ -507,7 +476,7 @@ class DataprocSparkSession(SparkSession):
507
476
  Add a file to be downloaded with this Spark job on every node.
508
477
  The ``path`` passed can only be a local file for now.
509
478
  pypi : bool
510
- This option is only available with DataprocSparkSession. eg. `spark.addArtifacts("spacy==3.8.4", "torch", pypi=True)`
479
+ This option is only available with DataprocSparkSession. e.g. `spark.addArtifacts("spacy==3.8.4", "torch", pypi=True)`
511
480
  Installs PyPi package (with its dependencies) in the active Spark session on the driver and executors.
512
481
 
513
482
  Notes
@@ -534,6 +503,10 @@ class DataprocSparkSession(SparkSession):
534
503
  *artifact, pyfile=pyfile, archive=archive, file=file
535
504
  )
536
505
 
506
+ @staticmethod
507
+ def _get_active_session_file_path():
508
+ return os.getenv("DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH")
509
+
537
510
  def stop(self) -> None:
538
511
  with DataprocSparkSession._lock:
539
512
  if DataprocSparkSession._active_s8s_session_id is not None:
@@ -544,7 +517,7 @@ class DataprocSparkSession(SparkSession):
544
517
  self._client_options,
545
518
  )
546
519
 
547
- self._remove_stoped_session_from_file()
520
+ self._remove_stopped_session_from_file()
548
521
  DataprocSparkSession._active_s8s_session_uuid = None
549
522
  DataprocSparkSession._active_s8s_session_id = None
550
523
  DataprocSparkSession._project_id = None
@@ -565,7 +538,7 @@ def terminate_s8s_session(
565
538
  ):
566
539
  from google.cloud.dataproc_v1 import SessionControllerClient
567
540
 
568
- logger.debug(f"Terminating serverless session: {active_s8s_session_id}")
541
+ logger.debug(f"Terminating Dataproc Session: {active_s8s_session_id}")
569
542
  terminate_session_request = TerminateSessionRequest()
570
543
  session_name = f"projects/{project_id}/locations/{region}/sessions/{active_s8s_session_id}"
571
544
  terminate_session_request.name = session_name
@@ -583,18 +556,20 @@ def terminate_s8s_session(
583
556
  ):
584
557
  session = session_client.get_session(get_session_request)
585
558
  state = session.state
586
- sleep(1)
559
+ time.sleep(1)
587
560
  except NotFound:
588
- logger.debug(f"Session {active_s8s_session_id} already deleted")
561
+ logger.debug(
562
+ f"{active_s8s_session_id} Dataproc Session already deleted"
563
+ )
589
564
  # Client will get 'Aborted' error if session creation is still in progress and
590
565
  # 'FailedPrecondition' if another termination is still in progress.
591
- # Both are retryable but we catch it and let TTL take care of cleanups.
566
+ # Both are retryable, but we catch it and let TTL take care of cleanups.
592
567
  except (FailedPrecondition, Aborted):
593
568
  logger.debug(
594
- f"Session {active_s8s_session_id} already terminated manually or terminated automatically through session ttl limits"
569
+ f"{active_s8s_session_id} Dataproc Session already terminated manually or automatically due to TTL"
595
570
  )
596
571
  if state is not None and state == Session.State.FAILED:
597
- raise RuntimeError("Serverless session termination failed")
572
+ raise RuntimeError("Dataproc Session termination failed")
598
573
 
599
574
 
600
575
  def get_active_s8s_session_response(
@@ -608,7 +583,7 @@ def get_active_s8s_session_response(
608
583
  ).get_session(get_session_request)
609
584
  state = get_session_response.state
610
585
  except Exception as e:
611
- logger.info(f"{session_name} deleted: {e}")
586
+ print(f"{session_name} Dataproc Session deleted: {e}")
612
587
  return None
613
588
  if state is not None and (
614
589
  state == Session.State.ACTIVE or state == Session.State.CREATING
@@ -1,111 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: dataproc-spark-connect
3
- Version: 0.6.0
4
- Summary: Dataproc client library for Spark Connect
5
- Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
6
- Author: Google LLC
7
- License: Apache 2.0
8
- License-File: LICENSE
9
- Requires-Dist: google-api-core>=2.19.1
10
- Requires-Dist: google-cloud-dataproc>=5.18.0
11
- Requires-Dist: websockets
12
- Requires-Dist: pyspark[connect]>=3.5
13
- Requires-Dist: packaging>=20.0
14
-
15
- # Dataproc Spark Connect Client
16
-
17
- A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
18
- additional functionalities that allow applications to communicate with a remote Dataproc
19
- Spark cluster using the Spark Connect protocol without requiring additional steps.
20
-
21
- ## Install
22
-
23
- ```console
24
- pip install dataproc_spark_connect
25
- ```
26
-
27
- ## Uninstall
28
-
29
- ```console
30
- pip uninstall dataproc_spark_connect
31
- ```
32
-
33
- ## Setup
34
- This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
35
- If you are running the client outside of Google Cloud, you must set following environment variables:
36
-
37
- * GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
38
- * GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
39
- * GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
40
- * DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
41
-
42
- ## Usage
43
-
44
- 1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
45
-
46
- ```console
47
- pip install google_cloud_dataproc --force-reinstall
48
- pip install dataproc_spark_connect --force-reinstall
49
- ```
50
-
51
- 2. Add the required import into your PySpark application or notebook:
52
-
53
- ```python
54
- from google.cloud.dataproc_spark_connect import DataprocSparkSession
55
- ```
56
-
57
- 3. There are two ways to create a spark session,
58
-
59
- 1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
60
-
61
- ```python
62
- spark = DataprocSparkSession.builder.getOrCreate()
63
- ```
64
-
65
- 2. Start a Spark session with the following code instead of using a config file:
66
-
67
- ```python
68
- from google.cloud.dataproc_v1 import SparkConnectConfig
69
- from google.cloud.dataproc_v1 import Session
70
- dataproc_session_config = Session()
71
- dataproc_session_config.spark_connect_session = SparkConnectConfig()
72
- dataproc_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
73
- dataproc_session_config.runtime_config.version = '3.0'
74
- spark = DataprocSparkSession.builder.dataprocSessionConfig(dataproc_session_config).getOrCreate()
75
- ```
76
-
77
- ## Billing
78
- As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
79
- This will happen even if you are running the client from a non-GCE instance.
80
-
81
- ## Contributing
82
- ### Building and Deploying SDK
83
-
84
- 1. Install the requirements in virtual environment.
85
-
86
- ```console
87
- pip install -r requirements-dev.txt
88
- ```
89
-
90
- 2. Build the code.
91
-
92
- ```console
93
- python setup.py sdist bdist_wheel
94
- ```
95
-
96
- 3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
97
-
98
- ```sh
99
- VERSION=<version>
100
- gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
101
- ```
102
-
103
- 4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
104
-
105
- ```sh
106
- %%bash
107
- export VERSION=<version>
108
- gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
109
- yes | pip uninstall dataproc_spark_connect
110
- pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
111
- ```