dataproc-spark-connect 0.9.0__py2.py3-none-any.whl → 1.0.0__py2.py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,200 @@
1
+ Metadata-Version: 2.4
2
+ Name: dataproc-spark-connect
3
+ Version: 1.0.0
4
+ Summary: Dataproc client library for Spark Connect
5
+ Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
6
+ Author: Google LLC
7
+ License: Apache 2.0
8
+ Description-Content-Type: text/markdown
9
+ License-File: LICENSE
10
+ Requires-Dist: google-api-core>=2.19
11
+ Requires-Dist: google-cloud-dataproc>=5.18
12
+ Requires-Dist: packaging>=20.0
13
+ Requires-Dist: pyspark-client~=4.0.0
14
+ Requires-Dist: tqdm>=4.67
15
+ Requires-Dist: websockets>=14.0
16
+ Dynamic: author
17
+ Dynamic: description
18
+ Dynamic: home-page
19
+ Dynamic: license
20
+ Dynamic: license-file
21
+ Dynamic: requires-dist
22
+ Dynamic: summary
23
+
24
+ # Dataproc Spark Connect Client
25
+
26
+ A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
27
+ client with additional functionalities that allow applications to communicate
28
+ with a remote Dataproc Spark Session using the Spark Connect protocol without
29
+ requiring additional steps.
30
+
31
+ ## Install
32
+
33
+ ```sh
34
+ pip install dataproc_spark_connect
35
+ ```
36
+
37
+ ## Uninstall
38
+
39
+ ```sh
40
+ pip uninstall dataproc_spark_connect
41
+ ```
42
+
43
+ ## Setup
44
+
45
+ This client requires permissions to
46
+ manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
47
+
48
+ If you are running the client outside of Google Cloud, you need to provide
49
+ authentication credentials. Set the `GOOGLE_APPLICATION_CREDENTIALS` environment
50
+ variable to point to
51
+ your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
52
+ file.
53
+
54
+ You can specify the project and region either via environment variables or directly
55
+ in your code using the builder API:
56
+
57
+ * Environment variables: `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_REGION`
58
+ * Builder API: `.projectId()` and `.location()` methods (recommended)
59
+
60
+ ## Usage
61
+
62
+ 1. Install the latest version of Dataproc Spark Connect:
63
+
64
+ ```sh
65
+ pip install -U dataproc-spark-connect
66
+ ```
67
+
68
+ 2. Add the required imports into your PySpark application or notebook and start
69
+ a Spark session using the fluent API:
70
+
71
+ ```python
72
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
73
+ spark = DataprocSparkSession.builder.getOrCreate()
74
+ ```
75
+
76
+ 3. You can configure Spark properties using the `.config()` method:
77
+
78
+ ```python
79
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
80
+ spark = DataprocSparkSession.builder.config('spark.executor.memory', '4g').config('spark.executor.cores', '2').getOrCreate()
81
+ ```
82
+
83
+ 4. For advanced configuration, you can use the `Session` class to customize
84
+ settings like subnetwork or other environment configurations:
85
+
86
+ ```python
87
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
88
+ from google.cloud.dataproc_v1 import Session
89
+ session_config = Session()
90
+ session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
91
+ session_config.runtime_config.version = '3.0'
92
+ spark = DataprocSparkSession.builder.projectId('my-project').location('us-central1').dataprocSessionConfig(session_config).getOrCreate()
93
+ ```
94
+
95
+ ### Reusing Named Sessions Across Notebooks
96
+
97
+ Named sessions allow you to share a single Spark session across multiple notebooks, improving efficiency by avoiding repeated session startup times and reducing costs.
98
+
99
+ To create or connect to a named session:
100
+
101
+ 1. Create a session with a custom ID in your first notebook:
102
+
103
+ ```python
104
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
105
+ session_id = 'my-ml-pipeline-session'
106
+ spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
107
+ df = spark.createDataFrame([(1, 'data')], ['id', 'value'])
108
+ df.show()
109
+ ```
110
+
111
+ 2. Reuse the same session in another notebook by specifying the same session ID:
112
+
113
+ ```python
114
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
115
+ session_id = 'my-ml-pipeline-session'
116
+ spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
117
+ df = spark.createDataFrame([(2, 'more-data')], ['id', 'value'])
118
+ df.show()
119
+ ```
120
+
121
+ 3. Session IDs must be 4-63 characters long, start with a lowercase letter, contain only lowercase letters, numbers, and hyphens, and not end with a hyphen.
122
+
123
+ 4. Named sessions persist until explicitly terminated or reach their configured TTL.
124
+
125
+ 5. A session with a given ID that is in a TERMINATED state cannot be reused. It must be deleted before a new session with the same ID can be created.
126
+
127
+ ### Using Spark SQL Magic Commands (Jupyter Notebooks)
128
+
129
+ The package supports the [sparksql-magic](https://github.com/cryeo/sparksql-magic) library for executing Spark SQL queries directly in Jupyter notebooks.
130
+
131
+ **Installation**: To use magic commands, install the required dependencies manually:
132
+ ```bash
133
+ pip install dataproc-spark-connect
134
+ pip install IPython sparksql-magic
135
+ ```
136
+
137
+ 1. Load the magic extension:
138
+ ```python
139
+ %load_ext sparksql_magic
140
+ ```
141
+
142
+ 2. Configure default settings (optional):
143
+ ```python
144
+ %config SparkSql.limit=20
145
+ ```
146
+
147
+ 3. Execute SQL queries:
148
+ ```python
149
+ %%sparksql
150
+ SELECT * FROM your_table
151
+ ```
152
+
153
+ 4. Advanced usage with options:
154
+ ```python
155
+ # Cache results and create a view
156
+ %%sparksql --cache --view result_view df
157
+ SELECT * FROM your_table WHERE condition = true
158
+ ```
159
+
160
+ Available options:
161
+ - `--cache` / `-c`: Cache the DataFrame
162
+ - `--eager` / `-e`: Cache with eager loading
163
+ - `--view VIEW` / `-v VIEW`: Create a temporary view
164
+ - `--limit N` / `-l N`: Override default row display limit
165
+ - `variable_name`: Store result in a variable
166
+
167
+ See [sparksql-magic](https://github.com/cryeo/sparksql-magic) for more examples.
168
+
169
+ **Note**: Magic commands are optional. If you only need basic DataprocSparkSession functionality without Jupyter magic support, install only the base package:
170
+ ```bash
171
+ pip install dataproc-spark-connect
172
+ ```
173
+
174
+ ## Developing
175
+
176
+ For development instructions see [guide](DEVELOPING.md).
177
+
178
+ ## Contributing
179
+
180
+ We'd love to accept your patches and contributions to this project. There are
181
+ just a few small guidelines you need to follow.
182
+
183
+ ### Contributor License Agreement
184
+
185
+ Contributions to this project must be accompanied by a Contributor License
186
+ Agreement. You (or your employer) retain the copyright to your contribution;
187
+ this simply gives us permission to use and redistribute your contributions as
188
+ part of the project. Head over to <https://cla.developers.google.com> to see
189
+ your current agreements on file or to sign a new one.
190
+
191
+ You generally only need to submit a CLA once, so if you've already submitted one
192
+ (even if it was for a different project), you probably don't need to do it
193
+ again.
194
+
195
+ ### Code reviews
196
+
197
+ All submissions, including submissions by project members, require review. We
198
+ use GitHub pull requests for this purpose. Consult
199
+ [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
200
+ information on using pull requests.
@@ -0,0 +1,13 @@
1
+ dataproc_spark_connect-1.0.0.dist-info/licenses/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
2
+ google/cloud/dataproc_spark_connect/__init__.py,sha256=dIqHNWVWWrSuRf26x11kX5e9yMKSHCtmI_GBj1-FDdE,1101
3
+ google/cloud/dataproc_spark_connect/environment.py,sha256=o5WRKI1vyIaxZ8S2UhtDer6pdi4CXYRzI9Xdpq5hVkQ,2771
4
+ google/cloud/dataproc_spark_connect/exceptions.py,sha256=iwaHgNabcaxqquOpktGkOWKHMf8hgdPQJUgRnIbTXVs,970
5
+ google/cloud/dataproc_spark_connect/pypi_artifacts.py,sha256=gd-VMwiVP-EJuPp9Vf9Shx8pqps3oSKp0hBcSSZQS-A,1575
6
+ google/cloud/dataproc_spark_connect/session.py,sha256=loEpKA2ssA89EqT9gWphmfPsZwfHjayxd97J2avdQMc,55890
7
+ google/cloud/dataproc_spark_connect/client/__init__.py,sha256=6hCNSsgYlie6GuVpc5gjFsPnyeMTScTpXSPYqp1fplY,615
8
+ google/cloud/dataproc_spark_connect/client/core.py,sha256=GRc4OCTBvIvdagjxOPoDO22vLtt8xDSerdREMRDeUBY,4659
9
+ google/cloud/dataproc_spark_connect/client/proxy.py,sha256=qUZXvVY1yn934vE6nlO495XUZ53AUx9O74a9ozkGI9U,8976
10
+ dataproc_spark_connect-1.0.0.dist-info/METADATA,sha256=HYCTM2juKp06uDL-9Ec1Ssu7tjBfnqX_LJ6bBjRjJjA,6838
11
+ dataproc_spark_connect-1.0.0.dist-info/WHEEL,sha256=JNWh1Fm1UdwIQV075glCn4MVuCRs0sotJIq-J6rbxCU,109
12
+ dataproc_spark_connect-1.0.0.dist-info/top_level.txt,sha256=_1QvSJIhFAGfxb79D6DhB7SUw2X6T4rwnz_LLrbcD3c,7
13
+ dataproc_spark_connect-1.0.0.dist-info/RECORD,,
@@ -15,14 +15,14 @@ import logging
15
15
 
16
16
  import google
17
17
  import grpc
18
- from pyspark.sql.connect.client import ChannelBuilder
18
+ from pyspark.sql.connect.client import DefaultChannelBuilder
19
19
 
20
20
  from . import proxy
21
21
 
22
22
  logger = logging.getLogger(__name__)
23
23
 
24
24
 
25
- class DataprocChannelBuilder(ChannelBuilder):
25
+ class DataprocChannelBuilder(DefaultChannelBuilder):
26
26
  """
27
27
  This is a helper class that is used to create a GRPC channel based on the given
28
28
  connection string per the documentation of Spark Connect.
@@ -88,7 +88,9 @@ class ProxiedChannel(grpc.Channel):
88
88
  self._proxy = proxy.DataprocSessionProxy(0, target_host)
89
89
  self._proxy.start()
90
90
  self._proxied_connect_url = f"sc://localhost:{self._proxy.port}"
91
- self._wrapped = ChannelBuilder(self._proxied_connect_url).toChannel()
91
+ self._wrapped = DefaultChannelBuilder(
92
+ self._proxied_connect_url
93
+ ).toChannel()
92
94
 
93
95
  def __enter__(self):
94
96
  return self
@@ -13,6 +13,7 @@
13
13
  # limitations under the License.
14
14
 
15
15
  import os
16
+ import sys
16
17
  from typing import Callable, Tuple, List
17
18
 
18
19
 
@@ -46,6 +47,30 @@ def is_jetbrains_ide() -> bool:
46
47
  return "jetbrains" in os.getenv("TERMINAL_EMULATOR", "").lower()
47
48
 
48
49
 
50
+ def is_interactive():
51
+ try:
52
+ from IPython import get_ipython
53
+
54
+ if get_ipython() is not None:
55
+ return True
56
+ except ImportError:
57
+ pass
58
+
59
+ return hasattr(sys, "ps1") or sys.flags.interactive
60
+
61
+
62
+ def is_terminal():
63
+ return sys.stdin.isatty()
64
+
65
+
66
+ def is_interactive_terminal():
67
+ return is_interactive() and is_terminal()
68
+
69
+
70
+ def is_dataproc_batch() -> bool:
71
+ return os.getenv("DATAPROC_WORKLOAD_TYPE") == "batch"
72
+
73
+
49
74
  def get_client_environment_label() -> str:
50
75
  """
51
76
  Map current environment to a standardized client label.
@@ -24,4 +24,4 @@ class DataprocSparkConnectException(Exception):
24
24
  super().__init__(message)
25
25
 
26
26
  def _render_traceback_(self):
27
- return self.message
27
+ return [self.message]
@@ -14,6 +14,7 @@
14
14
 
15
15
  import atexit
16
16
  import datetime
17
+ import functools
17
18
  import json
18
19
  import logging
19
20
  import os
@@ -24,8 +25,9 @@ import threading
24
25
  import time
25
26
  import uuid
26
27
  import tqdm
28
+ from packaging import version
27
29
  from types import MethodType
28
- from typing import Any, cast, ClassVar, Dict, Optional, Union
30
+ from typing import Any, cast, ClassVar, Dict, Iterable, Optional, Union
29
31
 
30
32
  from google.api_core import retry
31
33
  from google.api_core.client_options import ClientOptions
@@ -43,6 +45,7 @@ from google.cloud.dataproc_spark_connect.pypi_artifacts import PyPiArtifacts
43
45
  from google.cloud.dataproc_v1 import (
44
46
  AuthenticationConfig,
45
47
  CreateSessionRequest,
48
+ DeleteSessionRequest,
46
49
  GetSessionRequest,
47
50
  Session,
48
51
  SessionControllerClient,
@@ -63,6 +66,10 @@ SYSTEM_LABELS = {
63
66
  "goog-colab-notebook-id",
64
67
  }
65
68
 
69
+ _DATAPROC_SESSIONS_BASE_URL = (
70
+ "https://console.cloud.google.com/dataproc/interactive"
71
+ )
72
+
66
73
 
67
74
  def _is_valid_label_value(value: str) -> bool:
68
75
  """
@@ -84,6 +91,22 @@ def _is_valid_label_value(value: str) -> bool:
84
91
  return bool(re.match(pattern, value))
85
92
 
86
93
 
94
+ def _is_valid_session_id(session_id: str) -> bool:
95
+ """
96
+ Validates if a string complies with Google Cloud session ID format.
97
+ - Must be 4-63 characters
98
+ - Only lowercase letters, numbers, and dashes are allowed
99
+ - Must start with a lowercase letter
100
+ - Cannot end with a dash
101
+ """
102
+ if not session_id:
103
+ return False
104
+
105
+ # The pattern is sufficient for validation and already enforces length constraints.
106
+ pattern = r"^[a-z][a-z0-9-]{2,61}[a-z0-9]$"
107
+ return bool(re.match(pattern, session_id))
108
+
109
+
87
110
  class DataprocSparkSession(SparkSession):
88
111
  """The entry point to programming Spark with the Dataset and DataFrame API.
89
112
 
@@ -103,13 +126,16 @@ class DataprocSparkSession(SparkSession):
103
126
  ... ) # doctest: +SKIP
104
127
  """
105
128
 
106
- _DEFAULT_RUNTIME_VERSION = "2.3"
129
+ _DEFAULT_RUNTIME_VERSION = "3.0"
130
+ _MIN_RUNTIME_VERSION = "3.0"
107
131
 
108
132
  _active_s8s_session_uuid: ClassVar[Optional[str]] = None
109
133
  _project_id = None
110
134
  _region = None
111
135
  _client_options = None
112
136
  _active_s8s_session_id: ClassVar[Optional[str]] = None
137
+ _active_session_uses_custom_id: ClassVar[bool] = False
138
+ _execution_progress_bar = dict()
113
139
 
114
140
  class Builder(SparkSession.Builder):
115
141
 
@@ -117,6 +143,7 @@ class DataprocSparkSession(SparkSession):
117
143
  self._options: Dict[str, Any] = {}
118
144
  self._channel_builder: Optional[DataprocChannelBuilder] = None
119
145
  self._dataproc_config: Optional[Session] = None
146
+ self._custom_session_id: Optional[str] = None
120
147
  self._project_id = os.getenv("GOOGLE_CLOUD_PROJECT")
121
148
  self._region = os.getenv("GOOGLE_CLOUD_REGION")
122
149
  self._client_options = ClientOptions(
@@ -125,6 +152,18 @@ class DataprocSparkSession(SparkSession):
125
152
  f"{self._region}-dataproc.googleapis.com",
126
153
  )
127
154
  )
155
+ self._session_controller_client: Optional[
156
+ SessionControllerClient
157
+ ] = None
158
+
159
+ @property
160
+ def session_controller_client(self) -> SessionControllerClient:
161
+ """Get or create a SessionControllerClient instance."""
162
+ if self._session_controller_client is None:
163
+ self._session_controller_client = SessionControllerClient(
164
+ client_options=self._client_options
165
+ )
166
+ return self._session_controller_client
128
167
 
129
168
  def projectId(self, project_id):
130
169
  self._project_id = project_id
@@ -138,6 +177,35 @@ class DataprocSparkSession(SparkSession):
138
177
  )
139
178
  return self
140
179
 
180
+ def dataprocSessionId(self, session_id: str):
181
+ """
182
+ Set a custom session ID for creating or reusing sessions.
183
+
184
+ The session ID must:
185
+ - Be 4-63 characters long
186
+ - Start with a lowercase letter
187
+ - Contain only lowercase letters, numbers, and hyphens
188
+ - Not end with a hyphen
189
+
190
+ Args:
191
+ session_id: The custom session ID to use
192
+
193
+ Returns:
194
+ This Builder instance for method chaining
195
+
196
+ Raises:
197
+ ValueError: If the session ID format is invalid
198
+ """
199
+ if not _is_valid_session_id(session_id):
200
+ raise ValueError(
201
+ f"Invalid session ID: '{session_id}'. "
202
+ "Session ID must be 4-63 characters, start with a lowercase letter, "
203
+ "contain only lowercase letters, numbers, and hyphens, "
204
+ "and not end with a hyphen."
205
+ )
206
+ self._custom_session_id = session_id
207
+ return self
208
+
141
209
  def dataprocSessionConfig(self, dataproc_config: Session):
142
210
  self._dataproc_config = dataproc_config
143
211
  for k, v in dataproc_config.runtime_config.properties.items():
@@ -158,19 +226,6 @@ class DataprocSparkSession(SparkSession):
158
226
  self.dataproc_config.environment_config.execution_config.service_account = (
159
227
  account
160
228
  )
161
- # Automatically set auth type to SERVICE_ACCOUNT when service account is provided
162
- # This overrides any env var setting to simplify user experience
163
- self.dataproc_config.environment_config.execution_config.authentication_config.user_workload_authentication_type = (
164
- AuthenticationConfig.AuthenticationType.SERVICE_ACCOUNT
165
- )
166
- return self
167
-
168
- def authType(
169
- self, auth_type: "AuthenticationConfig.AuthenticationType"
170
- ):
171
- self.dataproc_config.environment_config.execution_config.authentication_config.user_workload_authentication_type = (
172
- auth_type
173
- )
174
229
  return self
175
230
 
176
231
  def subnetwork(self, subnet: str):
@@ -181,10 +236,7 @@ class DataprocSparkSession(SparkSession):
181
236
 
182
237
  def ttl(self, duration: datetime.timedelta):
183
238
  """Set the time-to-live (TTL) for the session using a timedelta object."""
184
- self.dataproc_config.environment_config.execution_config.ttl = {
185
- "seconds": int(duration.total_seconds())
186
- }
187
- return self
239
+ return self.ttlSeconds(int(duration.total_seconds()))
188
240
 
189
241
  def ttlSeconds(self, seconds: int):
190
242
  """Set the time-to-live (TTL) for the session in seconds."""
@@ -195,10 +247,7 @@ class DataprocSparkSession(SparkSession):
195
247
 
196
248
  def idleTtl(self, duration: datetime.timedelta):
197
249
  """Set the idle time-to-live (idle TTL) for the session using a timedelta object."""
198
- self.dataproc_config.environment_config.execution_config.idle_ttl = {
199
- "seconds": int(duration.total_seconds())
200
- }
201
- return self
250
+ return self.idleTtlSeconds(int(duration.total_seconds()))
202
251
 
203
252
  def idleTtlSeconds(self, seconds: int):
204
253
  """Set the idle time-to-live (idle TTL) for the session in seconds."""
@@ -266,7 +315,11 @@ class DataprocSparkSession(SparkSession):
266
315
  assert self._channel_builder is not None
267
316
  session = DataprocSparkSession(connection=self._channel_builder)
268
317
 
318
+ # Register handler for Cell Execution Progress bar
319
+ session._register_progress_execution_handler()
320
+
269
321
  DataprocSparkSession._set_default_and_active_session(session)
322
+
270
323
  return session
271
324
 
272
325
  def __create(self) -> "DataprocSparkSession":
@@ -281,7 +334,16 @@ class DataprocSparkSession(SparkSession):
281
334
 
282
335
  dataproc_config: Session = self._get_dataproc_config()
283
336
 
284
- session_id = self.generate_dataproc_session_id()
337
+ # Check runtime version compatibility before creating session
338
+ self._check_runtime_compatibility(dataproc_config)
339
+
340
+ # Use custom session ID if provided, otherwise generate one
341
+ session_id = (
342
+ self._custom_session_id
343
+ if self._custom_session_id
344
+ else self.generate_dataproc_session_id()
345
+ )
346
+
285
347
  dataproc_config.name = f"projects/{self._project_id}/locations/{self._region}/sessions/{session_id}"
286
348
  logger.debug(
287
349
  f"Dataproc Session configuration:\n{dataproc_config}"
@@ -296,6 +358,10 @@ class DataprocSparkSession(SparkSession):
296
358
 
297
359
  logger.debug("Creating Dataproc Session")
298
360
  DataprocSparkSession._active_s8s_session_id = session_id
361
+ # Track whether this session uses a custom ID (unmanaged) or auto-generated ID (managed)
362
+ DataprocSparkSession._active_session_uses_custom_id = (
363
+ self._custom_session_id is not None
364
+ )
299
365
  s8s_creation_start_time = time.time()
300
366
 
301
367
  stop_create_session_pbar_event = threading.Event()
@@ -386,6 +452,7 @@ class DataprocSparkSession(SparkSession):
386
452
  if create_session_pbar_thread.is_alive():
387
453
  create_session_pbar_thread.join()
388
454
  DataprocSparkSession._active_s8s_session_id = None
455
+ DataprocSparkSession._active_session_uses_custom_id = False
389
456
  raise DataprocSparkConnectException(
390
457
  f"Error while creating Dataproc Session: {e.message}"
391
458
  )
@@ -394,6 +461,7 @@ class DataprocSparkSession(SparkSession):
394
461
  if create_session_pbar_thread.is_alive():
395
462
  create_session_pbar_thread.join()
396
463
  DataprocSparkSession._active_s8s_session_id = None
464
+ DataprocSparkSession._active_session_uses_custom_id = False
397
465
  raise RuntimeError(
398
466
  f"Error while creating Dataproc Session"
399
467
  ) from e
@@ -407,16 +475,43 @@ class DataprocSparkSession(SparkSession):
407
475
  session_response, dataproc_config.name
408
476
  )
409
477
 
478
+ def _wait_for_session_available(
479
+ self, session_name: str, timeout: int = 300
480
+ ) -> Session:
481
+ start_time = time.time()
482
+ while time.time() - start_time < timeout:
483
+ try:
484
+ session = self.session_controller_client.get_session(
485
+ name=session_name
486
+ )
487
+ if "Spark Connect Server" in session.runtime_info.endpoints:
488
+ return session
489
+ time.sleep(5)
490
+ except Exception as e:
491
+ logger.warning(
492
+ f"Error while polling for Spark Connect endpoint: {e}"
493
+ )
494
+ time.sleep(5)
495
+ raise RuntimeError(
496
+ f"Spark Connect endpoint not available for session {session_name} after {timeout} seconds."
497
+ )
498
+
410
499
  def _display_session_link_on_creation(self, session_id):
411
- session_url = f"https://console.cloud.google.com/dataproc/interactive/{self._region}/{session_id}?project={self._project_id}"
500
+ session_url = f"{_DATAPROC_SESSIONS_BASE_URL}/{self._region}/{session_id}?project={self._project_id}"
412
501
  plain_message = f"Creating Dataproc Session: {session_url}"
413
- html_element = f"""
502
+ if environment.is_colab_enterprise():
503
+ html_element = f"""
414
504
  <div>
415
505
  <p>Creating Dataproc Spark Session<p>
416
- <p><a href="{session_url}">Dataproc Session</a></p>
417
506
  </div>
418
- """
419
-
507
+ """
508
+ else:
509
+ html_element = f"""
510
+ <div>
511
+ <p>Creating Dataproc Spark Session<p>
512
+ <p><a href="{session_url}">Dataproc Session</a></p>
513
+ </div>
514
+ """
420
515
  self._output_element_or_message(plain_message, html_element)
421
516
 
422
517
  def _print_session_created_message(self):
@@ -435,16 +530,19 @@ class DataprocSparkSession(SparkSession):
435
530
  :param html_element: HTML element to display for interactive IPython
436
531
  environment
437
532
  """
533
+ # Don't print any output (Rich or Plain) for non-interactive
534
+ if not environment.is_interactive():
535
+ return
536
+
537
+ if environment.is_interactive_terminal():
538
+ print(plain_message)
539
+ return
540
+
438
541
  try:
439
542
  from IPython.display import display, HTML
440
- from IPython.core.interactiveshell import InteractiveShell
441
543
 
442
- if not InteractiveShell.initialized():
443
- raise DataprocSparkConnectException(
444
- "Not in an Interactive IPython Environment"
445
- )
446
544
  display(HTML(html_element))
447
- except (ImportError, DataprocSparkConnectException):
545
+ except ImportError:
448
546
  print(plain_message)
449
547
 
450
548
  def _get_exiting_active_session(
@@ -465,10 +563,13 @@ class DataprocSparkSession(SparkSession):
465
563
 
466
564
  if session_response is not None:
467
565
  print(
468
- f"Using existing Dataproc Session (configuration changes may not be applied): https://console.cloud.google.com/dataproc/interactive/{self._region}/{s8s_session_id}?project={self._project_id}"
566
+ f"Using existing Dataproc Session (configuration changes may not be applied): {_DATAPROC_SESSIONS_BASE_URL}/{self._region}/{s8s_session_id}?project={self._project_id}"
469
567
  )
470
568
  self._display_view_session_details_button(s8s_session_id)
471
569
  if session is None:
570
+ session_response = self._wait_for_session_available(
571
+ session_name
572
+ )
472
573
  session = self.__create_spark_connect_session_from_s8s(
473
574
  session_response, session_name
474
575
  )
@@ -484,11 +585,54 @@ class DataprocSparkSession(SparkSession):
484
585
 
485
586
  def getOrCreate(self) -> "DataprocSparkSession":
486
587
  with DataprocSparkSession._lock:
588
+ if environment.is_dataproc_batch():
589
+ # For Dataproc batch workloads, connect to the already initialized local SparkSession
590
+ from pyspark.sql import SparkSession as PySparkSQLSession
591
+
592
+ session = PySparkSQLSession.builder.getOrCreate()
593
+ return session # type: ignore
594
+
595
+ if self._project_id is None:
596
+ raise DataprocSparkConnectException(
597
+ f"Error while creating Dataproc Session: project ID is not set"
598
+ )
599
+
600
+ if self._region is None:
601
+ raise DataprocSparkConnectException(
602
+ f"Error while creating Dataproc Session: location is not set"
603
+ )
604
+
605
+ # Handle custom session ID by setting it early and letting existing logic handle it
606
+ if self._custom_session_id:
607
+ self._handle_custom_session_id()
608
+
487
609
  session = self._get_exiting_active_session()
488
610
  if session is None:
489
611
  session = self.__create()
612
+
613
+ # Register this session as the instantiated SparkSession for compatibility
614
+ # with tools and libraries that expect SparkSession._instantiatedSession
615
+ from pyspark.sql import SparkSession as PySparkSQLSession
616
+
617
+ PySparkSQLSession._instantiatedSession = session
618
+
490
619
  return session
491
620
 
621
+ def _handle_custom_session_id(self):
622
+ """Handle custom session ID by checking if it exists and setting _active_s8s_session_id."""
623
+ session_response = self._get_session_by_id(self._custom_session_id)
624
+ if session_response is not None:
625
+ # Found an active session with the custom ID, set it as the active session
626
+ DataprocSparkSession._active_s8s_session_id = (
627
+ self._custom_session_id
628
+ )
629
+ # Mark that this session uses a custom ID
630
+ DataprocSparkSession._active_session_uses_custom_id = True
631
+ else:
632
+ # No existing session found, clear any existing active session ID
633
+ # so we'll create a new one with the custom ID
634
+ DataprocSparkSession._active_s8s_session_id = None
635
+
492
636
  def _get_dataproc_config(self):
493
637
  # Use the property to ensure we always have a config
494
638
  dataproc_config = self.dataproc_config
@@ -506,20 +650,33 @@ class DataprocSparkSession(SparkSession):
506
650
  self._check_python_version_compatibility(
507
651
  dataproc_config.runtime_config.version
508
652
  )
653
+
654
+ # Use local variable to improve readability of deeply nested attribute access
655
+ exec_config = dataproc_config.environment_config.execution_config
656
+
657
+ # Set service account from environment if not already set
509
658
  if (
510
- not dataproc_config.environment_config.execution_config.authentication_config.user_workload_authentication_type
511
- and "DATAPROC_SPARK_CONNECT_AUTH_TYPE" in os.environ
512
- ):
513
- dataproc_config.environment_config.execution_config.authentication_config.user_workload_authentication_type = AuthenticationConfig.AuthenticationType[
514
- os.getenv("DATAPROC_SPARK_CONNECT_AUTH_TYPE")
515
- ]
516
- if (
517
- not dataproc_config.environment_config.execution_config.service_account
659
+ not exec_config.service_account
518
660
  and "DATAPROC_SPARK_CONNECT_SERVICE_ACCOUNT" in os.environ
519
661
  ):
520
- dataproc_config.environment_config.execution_config.service_account = os.getenv(
662
+ exec_config.service_account = os.getenv(
521
663
  "DATAPROC_SPARK_CONNECT_SERVICE_ACCOUNT"
522
664
  )
665
+
666
+ # Auto-set authentication type to SERVICE_ACCOUNT when service account is provided
667
+ if exec_config.service_account:
668
+ # When service account is provided, explicitly set auth type to SERVICE_ACCOUNT
669
+ exec_config.authentication_config.user_workload_authentication_type = (
670
+ AuthenticationConfig.AuthenticationType.SERVICE_ACCOUNT
671
+ )
672
+ elif (
673
+ not exec_config.authentication_config.user_workload_authentication_type
674
+ and "DATAPROC_SPARK_CONNECT_AUTH_TYPE" in os.environ
675
+ ):
676
+ # Only set auth type from environment if no service account is present
677
+ exec_config.authentication_config.user_workload_authentication_type = AuthenticationConfig.AuthenticationType[
678
+ os.getenv("DATAPROC_SPARK_CONNECT_AUTH_TYPE")
679
+ ]
523
680
  if (
524
681
  not dataproc_config.environment_config.execution_config.subnetwork_uri
525
682
  and "DATAPROC_SPARK_CONNECT_SUBNET" in os.environ
@@ -568,27 +725,23 @@ class DataprocSparkSession(SparkSession):
568
725
  default_datasource = os.getenv(
569
726
  "DATAPROC_SPARK_CONNECT_DEFAULT_DATASOURCE"
570
727
  )
571
- if (
572
- default_datasource
573
- and dataproc_config.runtime_config.version == "2.3"
574
- ):
575
- if default_datasource == "bigquery":
576
- bq_datasource_properties = {
577
- "spark.datasource.bigquery.viewsEnabled": "true",
578
- "spark.datasource.bigquery.writeMethod": "direct",
728
+ match default_datasource:
729
+ case "bigquery":
730
+ # Merge default configs with existing properties,
731
+ # user configs take precedence
732
+ for k, v in {
579
733
  "spark.sql.catalog.spark_catalog": "com.google.cloud.spark.bigquery.BigQuerySparkSessionCatalog",
580
- "spark.sql.legacy.createHiveTableByDefault": "false",
581
734
  "spark.sql.sources.default": "bigquery",
582
- }
583
- # Merge default configs with existing properties, user configs take precedence
584
- for k, v in bq_datasource_properties.items():
735
+ }.items():
585
736
  if k not in dataproc_config.runtime_config.properties:
586
737
  dataproc_config.runtime_config.properties[k] = v
587
- else:
588
- logger.warning(
589
- f"DATAPROC_SPARK_CONNECT_DEFAULT_DATASOURCE is set to an invalid value:"
590
- f" {default_datasource}. Supported value is 'bigquery'."
591
- )
738
+ case _:
739
+ if default_datasource:
740
+ logger.warning(
741
+ f"DATAPROC_SPARK_CONNECT_DEFAULT_DATASOURCE is set to an invalid value:"
742
+ f" {default_datasource}. Supported value is 'bigquery'."
743
+ )
744
+
592
745
  return dataproc_config
593
746
 
594
747
  def _check_python_version_compatibility(self, runtime_version):
@@ -598,9 +751,7 @@ class DataprocSparkSession(SparkSession):
598
751
 
599
752
  # Runtime version to server Python version mapping
600
753
  RUNTIME_PYTHON_MAP = {
601
- "1.2": (3, 12),
602
- "2.2": (3, 12),
603
- "2.3": (3, 11),
754
+ "3.0": (3, 12),
604
755
  }
605
756
 
606
757
  client_python = sys.version_info[:2] # (major, minor)
@@ -617,9 +768,54 @@ class DataprocSparkSession(SparkSession):
617
768
  stacklevel=3,
618
769
  )
619
770
 
771
+ def _check_runtime_compatibility(self, dataproc_config):
772
+ """Check if runtime version 3.0 client is compatible with older runtime versions.
773
+
774
+ Runtime version 3.0 clients do not support older runtime versions (pre-3.0).
775
+ There is no backward or forward compatibility between different runtime versions.
776
+
777
+ Args:
778
+ dataproc_config: The Session configuration containing runtime version
779
+
780
+ Raises:
781
+ DataprocSparkConnectException: If server is using pre-3.0 runtime version
782
+ """
783
+ runtime_version = dataproc_config.runtime_config.version
784
+
785
+ if not runtime_version:
786
+ return
787
+
788
+ logger.debug(f"Detected server runtime version: {runtime_version}")
789
+
790
+ # Parse runtime version to check if it's below minimum supported version
791
+ try:
792
+ server_version = version.parse(runtime_version)
793
+ min_version = version.parse(
794
+ DataprocSparkSession._MIN_RUNTIME_VERSION
795
+ )
796
+
797
+ if server_version < min_version:
798
+ raise DataprocSparkConnectException(
799
+ f"Specified {runtime_version} Dataproc Runtime version is not supported, "
800
+ f"use {DataprocSparkSession._MIN_RUNTIME_VERSION} version or higher."
801
+ )
802
+ except version.InvalidVersion:
803
+ # If we can't parse the version, log a warning but continue
804
+ logger.warning(
805
+ f"Could not parse runtime version: {runtime_version}"
806
+ )
807
+
620
808
  def _display_view_session_details_button(self, session_id):
809
+ # Display button is only supported in colab enterprise
810
+ if not environment.is_colab_enterprise():
811
+ return
812
+
813
+ # Skip button display for colab enterprise IPython terminals
814
+ if environment.is_interactive_terminal():
815
+ return
816
+
621
817
  try:
622
- session_url = f"https://console.cloud.google.com/dataproc/interactive/sessions/{session_id}/locations/{self._region}?project={self._project_id}"
818
+ session_url = f"{_DATAPROC_SESSIONS_BASE_URL}/{self._region}/{session_id}?project={self._project_id}"
623
819
  from IPython.core.interactiveshell import InteractiveShell
624
820
 
625
821
  if not InteractiveShell.initialized():
@@ -633,6 +829,90 @@ class DataprocSparkSession(SparkSession):
633
829
  except ImportError as e:
634
830
  logger.debug(f"Import error: {e}")
635
831
 
832
+ def _get_session_by_id(self, session_id: str) -> Optional[Session]:
833
+ """
834
+ Get existing session by ID.
835
+
836
+ Returns:
837
+ Session if ACTIVE/CREATING, None if not found or not usable
838
+ """
839
+ session_name = f"projects/{self._project_id}/locations/{self._region}/sessions/{session_id}"
840
+
841
+ try:
842
+ get_request = GetSessionRequest(name=session_name)
843
+ session = self.session_controller_client.get_session(
844
+ get_request
845
+ )
846
+
847
+ logger.debug(
848
+ f"Found existing session {session_id} in state: {session.state}"
849
+ )
850
+
851
+ if session.state in [
852
+ Session.State.ACTIVE,
853
+ Session.State.CREATING,
854
+ ]:
855
+ # Reuse the active session
856
+ logger.info(f"Reusing existing session: {session_id}")
857
+ return session
858
+ else:
859
+ # Session exists but is not usable (terminated/failed/terminating)
860
+ logger.info(
861
+ f"Session {session_id} in {session.state.name} state, cannot reuse"
862
+ )
863
+ return None
864
+
865
+ except NotFound:
866
+ # Session doesn't exist, can create new one
867
+ logger.debug(
868
+ f"Session {session_id} not found, can create new one"
869
+ )
870
+ return None
871
+ except Exception as e:
872
+ logger.error(f"Error checking session {session_id}: {e}")
873
+ return None
874
+
875
+ def _delete_session(self, session_name: str):
876
+ """Delete a session to free up the session ID for reuse."""
877
+ try:
878
+ delete_request = DeleteSessionRequest(name=session_name)
879
+ self.session_controller_client.delete_session(delete_request)
880
+ logger.debug(f"Deleted session: {session_name}")
881
+ except NotFound:
882
+ logger.debug(f"Session already deleted: {session_name}")
883
+
884
+ def _wait_for_termination(self, session_name: str, timeout: int = 180):
885
+ """Wait for a session to finish terminating."""
886
+ start_time = time.time()
887
+
888
+ while time.time() - start_time < timeout:
889
+ try:
890
+ get_request = GetSessionRequest(name=session_name)
891
+ session = self.session_controller_client.get_session(
892
+ get_request
893
+ )
894
+
895
+ if session.state in [
896
+ Session.State.TERMINATED,
897
+ Session.State.FAILED,
898
+ ]:
899
+ return
900
+ elif session.state != Session.State.TERMINATING:
901
+ # Session is in unexpected state
902
+ logger.warning(
903
+ f"Session {session_name} in unexpected state while waiting for termination: {session.state}"
904
+ )
905
+ return
906
+
907
+ time.sleep(2)
908
+ except NotFound:
909
+ # Session was deleted
910
+ return
911
+
912
+ logger.warning(
913
+ f"Timeout waiting for session {session_name} to terminate"
914
+ )
915
+
636
916
  @staticmethod
637
917
  def generate_dataproc_session_id():
638
918
  timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
@@ -706,16 +986,111 @@ class DataprocSparkSession(SparkSession):
706
986
  execute_and_fetch_as_iterator_wrapped_method, self.client
707
987
  )
708
988
 
989
+ # Patching clearProgressHandlers method to not remove Dataproc Progress Handler
990
+ clearProgressHandlers_base_method = self.clearProgressHandlers
991
+
992
+ def clearProgressHandlers_wrapper_method(_, *args, **kwargs):
993
+ clearProgressHandlers_base_method(*args, **kwargs)
994
+
995
+ self._register_progress_execution_handler()
996
+
997
+ self.clearProgressHandlers = MethodType(
998
+ clearProgressHandlers_wrapper_method, self
999
+ )
1000
+
1001
+ @staticmethod
1002
+ @functools.lru_cache(maxsize=1)
1003
+ def get_tqdm_bar():
1004
+ """
1005
+ Return a tqdm implementation that works in the current environment.
1006
+
1007
+ - Uses CLI tqdm for interactive terminals.
1008
+ - Uses the notebook tqdm if available, otherwise falls back to CLI tqdm.
1009
+ """
1010
+ from tqdm import tqdm as cli_tqdm
1011
+
1012
+ if environment.is_interactive_terminal():
1013
+ return cli_tqdm
1014
+
1015
+ try:
1016
+ import ipywidgets
1017
+ from tqdm.notebook import tqdm as notebook_tqdm
1018
+
1019
+ return notebook_tqdm
1020
+ except ImportError:
1021
+ return cli_tqdm
1022
+
1023
+ def _register_progress_execution_handler(self):
1024
+ from pyspark.sql.connect.shell.progress import StageInfo
1025
+
1026
+ def handler(
1027
+ stages: Optional[Iterable[StageInfo]],
1028
+ inflight_tasks: int,
1029
+ operation_id: Optional[str],
1030
+ done: bool,
1031
+ ):
1032
+ if operation_id is None:
1033
+ return
1034
+
1035
+ # Don't build / render progress bar for non-interactive (despite
1036
+ # Ipython or non-IPython)
1037
+ if not environment.is_interactive():
1038
+ return
1039
+
1040
+ total_tasks = 0
1041
+ completed_tasks = 0
1042
+
1043
+ for stage in stages or []:
1044
+ total_tasks += stage.num_tasks
1045
+ completed_tasks += stage.num_completed_tasks
1046
+
1047
+ # Don't show progress bar till we receive some tasks
1048
+ if total_tasks == 0:
1049
+ return
1050
+
1051
+ # Get correct tqdm (notebook or CLI)
1052
+ tqdm_pbar = self.get_tqdm_bar()
1053
+
1054
+ # Use a lock to ensure only one thread can access and modify
1055
+ # the shared dictionaries at a time.
1056
+ with self._lock:
1057
+ if operation_id in self._execution_progress_bar:
1058
+ pbar = self._execution_progress_bar[operation_id]
1059
+ if pbar.total != total_tasks:
1060
+ pbar.reset(
1061
+ total=total_tasks
1062
+ ) # This force resets the progress bar % too on next refresh
1063
+ else:
1064
+ pbar = tqdm_pbar(
1065
+ total=total_tasks,
1066
+ leave=True,
1067
+ dynamic_ncols=True,
1068
+ bar_format="{l_bar}{bar} {n_fmt}/{total_fmt} Tasks",
1069
+ )
1070
+ self._execution_progress_bar[operation_id] = pbar
1071
+
1072
+ # To handle skipped or failed tasks.
1073
+ # StageInfo proto doesn't have skipped and failed tasks information to process.
1074
+ if done and completed_tasks < total_tasks:
1075
+ completed_tasks = total_tasks
1076
+
1077
+ pbar.n = completed_tasks
1078
+ pbar.refresh()
1079
+
1080
+ if done:
1081
+ pbar.close()
1082
+ self._execution_progress_bar.pop(operation_id, None)
1083
+
1084
+ self.registerProgressHandler(handler)
1085
+
709
1086
  @staticmethod
710
1087
  def _sql_lazy_transformation(req):
711
1088
  # Select SQL command
712
- if req.plan and req.plan.command and req.plan.command.sql_command:
713
- return (
714
- "select"
715
- in req.plan.command.sql_command.sql.strip().lower().split()
716
- )
717
-
718
- return False
1089
+ try:
1090
+ query = req.plan.command.sql_command.input.sql.query
1091
+ return "select" in query.strip().lower().split()
1092
+ except AttributeError:
1093
+ return False
719
1094
 
720
1095
  def _repr_html_(self) -> str:
721
1096
  if not self._active_s8s_session_id:
@@ -723,7 +1098,7 @@ class DataprocSparkSession(SparkSession):
723
1098
  <div>No Active Dataproc Session</div>
724
1099
  """
725
1100
 
726
- s8s_session = f"https://console.cloud.google.com/dataproc/interactive/{self._region}/{self._active_s8s_session_id}"
1101
+ s8s_session = f"{_DATAPROC_SESSIONS_BASE_URL}/{self._region}/{self._active_s8s_session_id}"
727
1102
  ui = f"{s8s_session}/sparkApplications/applications"
728
1103
  return f"""
729
1104
  <div>
@@ -735,6 +1110,11 @@ class DataprocSparkSession(SparkSession):
735
1110
  """
736
1111
 
737
1112
  def _display_operation_link(self, operation_id: str):
1113
+ # Don't print per-operation Spark UI link for non-interactive (despite
1114
+ # Ipython or non-IPython)
1115
+ if not environment.is_interactive():
1116
+ return
1117
+
738
1118
  assert all(
739
1119
  [
740
1120
  operation_id is not None,
@@ -745,17 +1125,18 @@ class DataprocSparkSession(SparkSession):
745
1125
  )
746
1126
 
747
1127
  url = (
748
- f"https://console.cloud.google.com/dataproc/interactive/{self._region}/"
1128
+ f"{_DATAPROC_SESSIONS_BASE_URL}/{self._region}/"
749
1129
  f"{self._active_s8s_session_id}/sparkApplications/application;"
750
1130
  f"associatedSqlOperationId={operation_id}?project={self._project_id}"
751
1131
  )
752
1132
 
1133
+ if environment.is_interactive_terminal():
1134
+ print(f"Spark Query: {url}")
1135
+ return
1136
+
753
1137
  try:
754
1138
  from IPython.display import display, HTML
755
- from IPython.core.interactiveshell import InteractiveShell
756
1139
 
757
- if not InteractiveShell.initialized():
758
- return
759
1140
  html_element = f"""
760
1141
  <div>
761
1142
  <p><a href="{url}">Spark Query</a> (Operation: {operation_id})</p>
@@ -813,7 +1194,7 @@ class DataprocSparkSession(SparkSession):
813
1194
  This is an API dedicated to Spark Connect client only. With regular Spark Session, it throws
814
1195
  an exception.
815
1196
  Regarding pypi: Popular packages are already pre-installed in s8s runtime.
816
- https://cloud.google.com/dataproc-serverless/docs/concepts/versions/spark-runtime-2.2#python_libraries
1197
+ https://cloud.google.com/dataproc-serverless/docs/concepts/versions/spark-runtime-2.3#python_libraries
817
1198
  If there are conflicts/package doesn't exist, it throws an exception.
818
1199
  """
819
1200
  if sum([pypi, file, pyfile, archive]) > 1:
@@ -836,19 +1217,83 @@ class DataprocSparkSession(SparkSession):
836
1217
  def _get_active_session_file_path():
837
1218
  return os.getenv("DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH")
838
1219
 
839
- def stop(self) -> None:
1220
+ def stop(self, terminate: Optional[bool] = None) -> None:
1221
+ """
1222
+ Stop the Spark session and optionally terminate the server-side session.
1223
+
1224
+ Parameters
1225
+ ----------
1226
+ terminate : bool, optional
1227
+ Control server-side termination behavior.
1228
+
1229
+ - None (default): Auto-detect based on session type
1230
+
1231
+ - Managed sessions (auto-generated ID): terminate server
1232
+ - Named sessions (custom ID): client-side cleanup only
1233
+
1234
+ - True: Always terminate the server-side session
1235
+ - False: Never terminate the server-side session (client cleanup only)
1236
+
1237
+ Examples
1238
+ --------
1239
+ Auto-detect termination behavior (existing behavior):
1240
+
1241
+ >>> spark.stop()
1242
+
1243
+ Force terminate a named session:
1244
+
1245
+ >>> spark.stop(terminate=True)
1246
+
1247
+ Prevent termination of a managed session:
1248
+
1249
+ >>> spark.stop(terminate=False)
1250
+ """
840
1251
  with DataprocSparkSession._lock:
841
1252
  if DataprocSparkSession._active_s8s_session_id is not None:
842
- terminate_s8s_session(
843
- DataprocSparkSession._project_id,
844
- DataprocSparkSession._region,
845
- DataprocSparkSession._active_s8s_session_id,
846
- self._client_options,
847
- )
1253
+ # Determine if we should terminate the server-side session
1254
+ if terminate is None:
1255
+ # Auto-detect: managed sessions terminate, named sessions don't
1256
+ should_terminate = (
1257
+ not DataprocSparkSession._active_session_uses_custom_id
1258
+ )
1259
+ else:
1260
+ should_terminate = terminate
1261
+
1262
+ if should_terminate:
1263
+ # Terminate the server-side session
1264
+ logger.debug(
1265
+ f"Terminating session {DataprocSparkSession._active_s8s_session_id}"
1266
+ )
1267
+ terminate_s8s_session(
1268
+ DataprocSparkSession._project_id,
1269
+ DataprocSparkSession._region,
1270
+ DataprocSparkSession._active_s8s_session_id,
1271
+ self._client_options,
1272
+ )
1273
+ else:
1274
+ # Client-side cleanup only
1275
+ logger.debug(
1276
+ f"Stopping session {DataprocSparkSession._active_s8s_session_id} without termination"
1277
+ )
848
1278
 
849
1279
  self._remove_stopped_session_from_file()
1280
+
1281
+ # Clean up SparkSession._instantiatedSession if it points to this session
1282
+ try:
1283
+ from pyspark.sql import SparkSession as PySparkSQLSession
1284
+
1285
+ if PySparkSQLSession._instantiatedSession is self:
1286
+ PySparkSQLSession._instantiatedSession = None
1287
+ logger.debug(
1288
+ "Cleared SparkSession._instantiatedSession reference"
1289
+ )
1290
+ except (ImportError, AttributeError):
1291
+ # PySpark not available or _instantiatedSession doesn't exist
1292
+ pass
1293
+
850
1294
  DataprocSparkSession._active_s8s_session_uuid = None
851
1295
  DataprocSparkSession._active_s8s_session_id = None
1296
+ DataprocSparkSession._active_session_uses_custom_id = False
852
1297
  DataprocSparkSession._project_id = None
853
1298
  DataprocSparkSession._region = None
854
1299
  DataprocSparkSession._client_options = None
@@ -1,105 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: dataproc-spark-connect
3
- Version: 0.9.0
4
- Summary: Dataproc client library for Spark Connect
5
- Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
6
- Author: Google LLC
7
- License: Apache 2.0
8
- License-File: LICENSE
9
- Requires-Dist: google-api-core>=2.19
10
- Requires-Dist: google-cloud-dataproc>=5.18
11
- Requires-Dist: packaging>=20.0
12
- Requires-Dist: pyspark[connect]~=3.5.1
13
- Requires-Dist: tqdm>=4.67
14
- Requires-Dist: websockets>=14.0
15
- Dynamic: author
16
- Dynamic: description
17
- Dynamic: home-page
18
- Dynamic: license
19
- Dynamic: license-file
20
- Dynamic: requires-dist
21
- Dynamic: summary
22
-
23
- # Dataproc Spark Connect Client
24
-
25
- A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
26
- client with additional functionalities that allow applications to communicate
27
- with a remote Dataproc Spark Session using the Spark Connect protocol without
28
- requiring additional steps.
29
-
30
- ## Install
31
-
32
- ```sh
33
- pip install dataproc_spark_connect
34
- ```
35
-
36
- ## Uninstall
37
-
38
- ```sh
39
- pip uninstall dataproc_spark_connect
40
- ```
41
-
42
- ## Setup
43
-
44
- This client requires permissions to
45
- manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
46
- If you are running the client outside of Google Cloud, you must set following
47
- environment variables:
48
-
49
- * `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
50
- workloads
51
- * `GOOGLE_CLOUD_REGION` - The Compute
52
- Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
53
- where you run the Spark workload.
54
- * `GOOGLE_APPLICATION_CREDENTIALS` -
55
- Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
56
-
57
- ## Usage
58
-
59
- 1. Install the latest version of Dataproc Python client and Dataproc Spark
60
- Connect modules:
61
-
62
- ```sh
63
- pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
64
- ```
65
-
66
- 2. Add the required imports into your PySpark application or notebook and start
67
- a Spark session with the following code instead of using
68
- environment variables:
69
-
70
- ```python
71
- from google.cloud.dataproc_spark_connect import DataprocSparkSession
72
- from google.cloud.dataproc_v1 import Session
73
- session_config = Session()
74
- session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
75
- session_config.runtime_config.version = '2.2'
76
- spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
77
- ```
78
-
79
- ## Developing
80
-
81
- For development instructions see [guide](DEVELOPING.md).
82
-
83
- ## Contributing
84
-
85
- We'd love to accept your patches and contributions to this project. There are
86
- just a few small guidelines you need to follow.
87
-
88
- ### Contributor License Agreement
89
-
90
- Contributions to this project must be accompanied by a Contributor License
91
- Agreement. You (or your employer) retain the copyright to your contribution;
92
- this simply gives us permission to use and redistribute your contributions as
93
- part of the project. Head over to <https://cla.developers.google.com> to see
94
- your current agreements on file or to sign a new one.
95
-
96
- You generally only need to submit a CLA once, so if you've already submitted one
97
- (even if it was for a different project), you probably don't need to do it
98
- again.
99
-
100
- ### Code reviews
101
-
102
- All submissions, including submissions by project members, require review. We
103
- use GitHub pull requests for this purpose. Consult
104
- [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
105
- information on using pull requests.
@@ -1,13 +0,0 @@
1
- dataproc_spark_connect-0.9.0.dist-info/licenses/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
2
- google/cloud/dataproc_spark_connect/__init__.py,sha256=dIqHNWVWWrSuRf26x11kX5e9yMKSHCtmI_GBj1-FDdE,1101
3
- google/cloud/dataproc_spark_connect/environment.py,sha256=UICy9XyqAxL-cryVWx7GZPRAxoir5LKk0dtqqY_l--c,2307
4
- google/cloud/dataproc_spark_connect/exceptions.py,sha256=WF-qdzgdofRwILCriIkjjsmjObZfF0P3Ecg4lv-Hmec,968
5
- google/cloud/dataproc_spark_connect/pypi_artifacts.py,sha256=gd-VMwiVP-EJuPp9Vf9Shx8pqps3oSKp0hBcSSZQS-A,1575
6
- google/cloud/dataproc_spark_connect/session.py,sha256=ELj5hDhofK1967eE5YaG_LP5B80KWFQWJn5gxi9yYt0,38577
7
- google/cloud/dataproc_spark_connect/client/__init__.py,sha256=6hCNSsgYlie6GuVpc5gjFsPnyeMTScTpXSPYqp1fplY,615
8
- google/cloud/dataproc_spark_connect/client/core.py,sha256=m3oXTKBm3sBy6jhDu9GRecrxLb5CdEM53SgMlnJb6ag,4616
9
- google/cloud/dataproc_spark_connect/client/proxy.py,sha256=qUZXvVY1yn934vE6nlO495XUZ53AUx9O74a9ozkGI9U,8976
10
- dataproc_spark_connect-0.9.0.dist-info/METADATA,sha256=1z8Ag1P_Lh9db0Rk9nGFoOu6sdeRs0UlrgtOqN_OhIQ,3465
11
- dataproc_spark_connect-0.9.0.dist-info/WHEEL,sha256=JNWh1Fm1UdwIQV075glCn4MVuCRs0sotJIq-J6rbxCU,109
12
- dataproc_spark_connect-0.9.0.dist-info/top_level.txt,sha256=_1QvSJIhFAGfxb79D6DhB7SUw2X6T4rwnz_LLrbcD3c,7
13
- dataproc_spark_connect-0.9.0.dist-info/RECORD,,