dataproc-spark-connect 1.0.0rc5__py2.py3-none-any.whl → 1.0.0rc7__py2.py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,200 @@
1
+ Metadata-Version: 2.4
2
+ Name: dataproc-spark-connect
3
+ Version: 1.0.0rc7
4
+ Summary: Dataproc client library for Spark Connect
5
+ Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
6
+ Author: Google LLC
7
+ License: Apache 2.0
8
+ Description-Content-Type: text/markdown
9
+ License-File: LICENSE
10
+ Requires-Dist: google-api-core>=2.19
11
+ Requires-Dist: google-cloud-dataproc>=5.18
12
+ Requires-Dist: packaging>=20.0
13
+ Requires-Dist: pyspark-client~=4.0.0
14
+ Requires-Dist: tqdm>=4.67
15
+ Requires-Dist: websockets>=14.0
16
+ Dynamic: author
17
+ Dynamic: description
18
+ Dynamic: home-page
19
+ Dynamic: license
20
+ Dynamic: license-file
21
+ Dynamic: requires-dist
22
+ Dynamic: summary
23
+
24
+ # Dataproc Spark Connect Client
25
+
26
+ A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
27
+ client with additional functionalities that allow applications to communicate
28
+ with a remote Dataproc Spark Session using the Spark Connect protocol without
29
+ requiring additional steps.
30
+
31
+ ## Install
32
+
33
+ ```sh
34
+ pip install dataproc_spark_connect
35
+ ```
36
+
37
+ ## Uninstall
38
+
39
+ ```sh
40
+ pip uninstall dataproc_spark_connect
41
+ ```
42
+
43
+ ## Setup
44
+
45
+ This client requires permissions to
46
+ manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
47
+
48
+ If you are running the client outside of Google Cloud, you need to provide
49
+ authentication credentials. Set the `GOOGLE_APPLICATION_CREDENTIALS` environment
50
+ variable to point to
51
+ your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
52
+ file.
53
+
54
+ You can specify the project and region either via environment variables or directly
55
+ in your code using the builder API:
56
+
57
+ * Environment variables: `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_REGION`
58
+ * Builder API: `.projectId()` and `.location()` methods (recommended)
59
+
60
+ ## Usage
61
+
62
+ 1. Install the latest version of Dataproc Spark Connect:
63
+
64
+ ```sh
65
+ pip install -U dataproc-spark-connect
66
+ ```
67
+
68
+ 2. Add the required imports into your PySpark application or notebook and start
69
+ a Spark session using the fluent API:
70
+
71
+ ```python
72
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
73
+ spark = DataprocSparkSession.builder.getOrCreate()
74
+ ```
75
+
76
+ 3. You can configure Spark properties using the `.config()` method:
77
+
78
+ ```python
79
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
80
+ spark = DataprocSparkSession.builder.config('spark.executor.memory', '4g').config('spark.executor.cores', '2').getOrCreate()
81
+ ```
82
+
83
+ 4. For advanced configuration, you can use the `Session` class to customize
84
+ settings like subnetwork or other environment configurations:
85
+
86
+ ```python
87
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
88
+ from google.cloud.dataproc_v1 import Session
89
+ session_config = Session()
90
+ session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
91
+ session_config.runtime_config.version = '3.0'
92
+ spark = DataprocSparkSession.builder.projectId('my-project').location('us-central1').dataprocSessionConfig(session_config).getOrCreate()
93
+ ```
94
+
95
+ ### Reusing Named Sessions Across Notebooks
96
+
97
+ Named sessions allow you to share a single Spark session across multiple notebooks, improving efficiency by avoiding repeated session startup times and reducing costs.
98
+
99
+ To create or connect to a named session:
100
+
101
+ 1. Create a session with a custom ID in your first notebook:
102
+
103
+ ```python
104
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
105
+ session_id = 'my-ml-pipeline-session'
106
+ spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
107
+ df = spark.createDataFrame([(1, 'data')], ['id', 'value'])
108
+ df.show()
109
+ ```
110
+
111
+ 2. Reuse the same session in another notebook by specifying the same session ID:
112
+
113
+ ```python
114
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
115
+ session_id = 'my-ml-pipeline-session'
116
+ spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
117
+ df = spark.createDataFrame([(2, 'more-data')], ['id', 'value'])
118
+ df.show()
119
+ ```
120
+
121
+ 3. Session IDs must be 4-63 characters long, start with a lowercase letter, contain only lowercase letters, numbers, and hyphens, and not end with a hyphen.
122
+
123
+ 4. Named sessions persist until explicitly terminated or reach their configured TTL.
124
+
125
+ 5. A session with a given ID that is in a TERMINATED state cannot be reused. It must be deleted before a new session with the same ID can be created.
126
+
127
+ ### Using Spark SQL Magic Commands (Jupyter Notebooks)
128
+
129
+ The package supports the [sparksql-magic](https://github.com/cryeo/sparksql-magic) library for executing Spark SQL queries directly in Jupyter notebooks.
130
+
131
+ **Installation**: To use magic commands, install the required dependencies manually:
132
+ ```bash
133
+ pip install dataproc-spark-connect
134
+ pip install IPython sparksql-magic
135
+ ```
136
+
137
+ 1. Load the magic extension:
138
+ ```python
139
+ %load_ext sparksql_magic
140
+ ```
141
+
142
+ 2. Configure default settings (optional):
143
+ ```python
144
+ %config SparkSql.limit=20
145
+ ```
146
+
147
+ 3. Execute SQL queries:
148
+ ```python
149
+ %%sparksql
150
+ SELECT * FROM your_table
151
+ ```
152
+
153
+ 4. Advanced usage with options:
154
+ ```python
155
+ # Cache results and create a view
156
+ %%sparksql --cache --view result_view df
157
+ SELECT * FROM your_table WHERE condition = true
158
+ ```
159
+
160
+ Available options:
161
+ - `--cache` / `-c`: Cache the DataFrame
162
+ - `--eager` / `-e`: Cache with eager loading
163
+ - `--view VIEW` / `-v VIEW`: Create a temporary view
164
+ - `--limit N` / `-l N`: Override default row display limit
165
+ - `variable_name`: Store result in a variable
166
+
167
+ See [sparksql-magic](https://github.com/cryeo/sparksql-magic) for more examples.
168
+
169
+ **Note**: Magic commands are optional. If you only need basic DataprocSparkSession functionality without Jupyter magic support, install only the base package:
170
+ ```bash
171
+ pip install dataproc-spark-connect
172
+ ```
173
+
174
+ ## Developing
175
+
176
+ For development instructions see [guide](DEVELOPING.md).
177
+
178
+ ## Contributing
179
+
180
+ We'd love to accept your patches and contributions to this project. There are
181
+ just a few small guidelines you need to follow.
182
+
183
+ ### Contributor License Agreement
184
+
185
+ Contributions to this project must be accompanied by a Contributor License
186
+ Agreement. You (or your employer) retain the copyright to your contribution;
187
+ this simply gives us permission to use and redistribute your contributions as
188
+ part of the project. Head over to <https://cla.developers.google.com> to see
189
+ your current agreements on file or to sign a new one.
190
+
191
+ You generally only need to submit a CLA once, so if you've already submitted one
192
+ (even if it was for a different project), you probably don't need to do it
193
+ again.
194
+
195
+ ### Code reviews
196
+
197
+ All submissions, including submissions by project members, require review. We
198
+ use GitHub pull requests for this purpose. Consult
199
+ [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
200
+ information on using pull requests.
@@ -1,13 +1,13 @@
1
- dataproc_spark_connect-1.0.0rc5.dist-info/licenses/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
1
+ dataproc_spark_connect-1.0.0rc7.dist-info/licenses/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
2
2
  google/cloud/dataproc_spark_connect/__init__.py,sha256=dIqHNWVWWrSuRf26x11kX5e9yMKSHCtmI_GBj1-FDdE,1101
3
- google/cloud/dataproc_spark_connect/environment.py,sha256=H4KcT-_X64oKlQ9vFhfoRSh5JrmyHgFGCeo8UOAztiM,2678
4
- google/cloud/dataproc_spark_connect/exceptions.py,sha256=WF-qdzgdofRwILCriIkjjsmjObZfF0P3Ecg4lv-Hmec,968
3
+ google/cloud/dataproc_spark_connect/environment.py,sha256=o5WRKI1vyIaxZ8S2UhtDer6pdi4CXYRzI9Xdpq5hVkQ,2771
4
+ google/cloud/dataproc_spark_connect/exceptions.py,sha256=iwaHgNabcaxqquOpktGkOWKHMf8hgdPQJUgRnIbTXVs,970
5
5
  google/cloud/dataproc_spark_connect/pypi_artifacts.py,sha256=gd-VMwiVP-EJuPp9Vf9Shx8pqps3oSKp0hBcSSZQS-A,1575
6
- google/cloud/dataproc_spark_connect/session.py,sha256=e1Z3xpjgZimcaYVrxzFhlMnhWmyxp7v7TTltuQqjhbA,51461
6
+ google/cloud/dataproc_spark_connect/session.py,sha256=ntkuGzONtC3vXooUnHlHwhY32HP0hB4O9Ze-LVCAlKk,55470
7
7
  google/cloud/dataproc_spark_connect/client/__init__.py,sha256=6hCNSsgYlie6GuVpc5gjFsPnyeMTScTpXSPYqp1fplY,615
8
8
  google/cloud/dataproc_spark_connect/client/core.py,sha256=GRc4OCTBvIvdagjxOPoDO22vLtt8xDSerdREMRDeUBY,4659
9
9
  google/cloud/dataproc_spark_connect/client/proxy.py,sha256=qUZXvVY1yn934vE6nlO495XUZ53AUx9O74a9ozkGI9U,8976
10
- dataproc_spark_connect-1.0.0rc5.dist-info/METADATA,sha256=sLRphUFOBZYU8T7h4IgDkGs8EvhqZ1Fm5FMw5-SWi2A,3468
11
- dataproc_spark_connect-1.0.0rc5.dist-info/WHEEL,sha256=JNWh1Fm1UdwIQV075glCn4MVuCRs0sotJIq-J6rbxCU,109
12
- dataproc_spark_connect-1.0.0rc5.dist-info/top_level.txt,sha256=_1QvSJIhFAGfxb79D6DhB7SUw2X6T4rwnz_LLrbcD3c,7
13
- dataproc_spark_connect-1.0.0rc5.dist-info/RECORD,,
10
+ dataproc_spark_connect-1.0.0rc7.dist-info/METADATA,sha256=eln8dtAOo-JrzHyxOdLoTLc0a0MyvLQk9ZfGXySiSRY,6841
11
+ dataproc_spark_connect-1.0.0rc7.dist-info/WHEEL,sha256=JNWh1Fm1UdwIQV075glCn4MVuCRs0sotJIq-J6rbxCU,109
12
+ dataproc_spark_connect-1.0.0rc7.dist-info/top_level.txt,sha256=_1QvSJIhFAGfxb79D6DhB7SUw2X6T4rwnz_LLrbcD3c,7
13
+ dataproc_spark_connect-1.0.0rc7.dist-info/RECORD,,
@@ -67,6 +67,10 @@ def is_interactive_terminal():
67
67
  return is_interactive() and is_terminal()
68
68
 
69
69
 
70
+ def is_dataproc_batch() -> bool:
71
+ return os.getenv("DATAPROC_WORKLOAD_TYPE") == "batch"
72
+
73
+
70
74
  def get_client_environment_label() -> str:
71
75
  """
72
76
  Map current environment to a standardized client label.
@@ -24,4 +24,4 @@ class DataprocSparkConnectException(Exception):
24
24
  super().__init__(message)
25
25
 
26
26
  def _render_traceback_(self):
27
- return self.message
27
+ return [self.message]
@@ -14,6 +14,7 @@
14
14
 
15
15
  import atexit
16
16
  import datetime
17
+ import functools
17
18
  import json
18
19
  import logging
19
20
  import os
@@ -25,8 +26,6 @@ import time
25
26
  import uuid
26
27
  import tqdm
27
28
  from packaging import version
28
- from tqdm import tqdm as cli_tqdm
29
- from tqdm.notebook import tqdm as notebook_tqdm
30
29
  from types import MethodType
31
30
  from typing import Any, cast, ClassVar, Dict, Iterable, Optional, Union
32
31
 
@@ -67,6 +66,10 @@ SYSTEM_LABELS = {
67
66
  "goog-colab-notebook-id",
68
67
  }
69
68
 
69
+ _DATAPROC_SESSIONS_BASE_URL = (
70
+ "https://console.cloud.google.com/dataproc/interactive"
71
+ )
72
+
70
73
 
71
74
  def _is_valid_label_value(value: str) -> bool:
72
75
  """
@@ -472,16 +475,43 @@ class DataprocSparkSession(SparkSession):
472
475
  session_response, dataproc_config.name
473
476
  )
474
477
 
478
+ def _wait_for_session_available(
479
+ self, session_name: str, timeout: int = 300
480
+ ) -> Session:
481
+ start_time = time.time()
482
+ while time.time() - start_time < timeout:
483
+ try:
484
+ session = self.session_controller_client.get_session(
485
+ name=session_name
486
+ )
487
+ if "Spark Connect Server" in session.runtime_info.endpoints:
488
+ return session
489
+ time.sleep(5)
490
+ except Exception as e:
491
+ logger.warning(
492
+ f"Error while polling for Spark Connect endpoint: {e}"
493
+ )
494
+ time.sleep(5)
495
+ raise RuntimeError(
496
+ f"Spark Connect endpoint not available for session {session_name} after {timeout} seconds."
497
+ )
498
+
475
499
  def _display_session_link_on_creation(self, session_id):
476
- session_url = f"https://console.cloud.google.com/dataproc/interactive/{self._region}/{session_id}?project={self._project_id}"
500
+ session_url = f"{_DATAPROC_SESSIONS_BASE_URL}/{self._region}/{session_id}?project={self._project_id}"
477
501
  plain_message = f"Creating Dataproc Session: {session_url}"
478
- html_element = f"""
502
+ if environment.is_colab_enterprise():
503
+ html_element = f"""
479
504
  <div>
480
505
  <p>Creating Dataproc Spark Session<p>
481
- <p><a href="{session_url}">Dataproc Session</a></p>
482
506
  </div>
483
- """
484
-
507
+ """
508
+ else:
509
+ html_element = f"""
510
+ <div>
511
+ <p>Creating Dataproc Spark Session<p>
512
+ <p><a href="{session_url}">Dataproc Session</a></p>
513
+ </div>
514
+ """
485
515
  self._output_element_or_message(plain_message, html_element)
486
516
 
487
517
  def _print_session_created_message(self):
@@ -533,10 +563,13 @@ class DataprocSparkSession(SparkSession):
533
563
 
534
564
  if session_response is not None:
535
565
  print(
536
- f"Using existing Dataproc Session (configuration changes may not be applied): https://console.cloud.google.com/dataproc/interactive/{self._region}/{s8s_session_id}?project={self._project_id}"
566
+ f"Using existing Dataproc Session (configuration changes may not be applied): {_DATAPROC_SESSIONS_BASE_URL}/{self._region}/{s8s_session_id}?project={self._project_id}"
537
567
  )
538
568
  self._display_view_session_details_button(s8s_session_id)
539
569
  if session is None:
570
+ session_response = self._wait_for_session_available(
571
+ session_name
572
+ )
540
573
  session = self.__create_spark_connect_session_from_s8s(
541
574
  session_response, session_name
542
575
  )
@@ -552,6 +585,13 @@ class DataprocSparkSession(SparkSession):
552
585
 
553
586
  def getOrCreate(self) -> "DataprocSparkSession":
554
587
  with DataprocSparkSession._lock:
588
+ if environment.is_dataproc_batch():
589
+ # For Dataproc batch workloads, connect to the already initialized local SparkSession
590
+ from pyspark.sql import SparkSession as PySparkSQLSession
591
+
592
+ session = PySparkSQLSession.builder.getOrCreate()
593
+ return session # type: ignore
594
+
555
595
  # Handle custom session ID by setting it early and letting existing logic handle it
556
596
  if self._custom_session_id:
557
597
  self._handle_custom_session_id()
@@ -559,6 +599,13 @@ class DataprocSparkSession(SparkSession):
559
599
  session = self._get_exiting_active_session()
560
600
  if session is None:
561
601
  session = self.__create()
602
+
603
+ # Register this session as the instantiated SparkSession for compatibility
604
+ # with tools and libraries that expect SparkSession._instantiatedSession
605
+ from pyspark.sql import SparkSession as PySparkSQLSession
606
+
607
+ PySparkSQLSession._instantiatedSession = session
608
+
562
609
  return session
563
610
 
564
611
  def _handle_custom_session_id(self):
@@ -673,8 +720,6 @@ class DataprocSparkSession(SparkSession):
673
720
  # Merge default configs with existing properties,
674
721
  # user configs take precedence
675
722
  for k, v in {
676
- "spark.datasource.bigquery.viewsEnabled": "true",
677
- "spark.datasource.bigquery.writeMethod": "direct",
678
723
  "spark.sql.catalog.spark_catalog": "com.google.cloud.spark.bigquery.BigQuerySparkSessionCatalog",
679
724
  "spark.sql.sources.default": "bigquery",
680
725
  }.items():
@@ -696,7 +741,7 @@ class DataprocSparkSession(SparkSession):
696
741
 
697
742
  # Runtime version to server Python version mapping
698
743
  RUNTIME_PYTHON_MAP = {
699
- "3.0": (3, 11),
744
+ "3.0": (3, 12),
700
745
  }
701
746
 
702
747
  client_python = sys.version_info[:2] # (major, minor)
@@ -760,7 +805,7 @@ class DataprocSparkSession(SparkSession):
760
805
  return
761
806
 
762
807
  try:
763
- session_url = f"https://console.cloud.google.com/dataproc/interactive/sessions/{session_id}/locations/{self._region}?project={self._project_id}"
808
+ session_url = f"{_DATAPROC_SESSIONS_BASE_URL}/{self._region}/{session_id}?project={self._project_id}"
764
809
  from IPython.core.interactiveshell import InteractiveShell
765
810
 
766
811
  if not InteractiveShell.initialized():
@@ -943,6 +988,28 @@ class DataprocSparkSession(SparkSession):
943
988
  clearProgressHandlers_wrapper_method, self
944
989
  )
945
990
 
991
+ @staticmethod
992
+ @functools.lru_cache(maxsize=1)
993
+ def get_tqdm_bar():
994
+ """
995
+ Return a tqdm implementation that works in the current environment.
996
+
997
+ - Uses CLI tqdm for interactive terminals.
998
+ - Uses the notebook tqdm if available, otherwise falls back to CLI tqdm.
999
+ """
1000
+ from tqdm import tqdm as cli_tqdm
1001
+
1002
+ if environment.is_interactive_terminal():
1003
+ return cli_tqdm
1004
+
1005
+ try:
1006
+ import ipywidgets
1007
+ from tqdm.notebook import tqdm as notebook_tqdm
1008
+
1009
+ return notebook_tqdm
1010
+ except ImportError:
1011
+ return cli_tqdm
1012
+
946
1013
  def _register_progress_execution_handler(self):
947
1014
  from pyspark.sql.connect.shell.progress import StageInfo
948
1015
 
@@ -967,9 +1034,12 @@ class DataprocSparkSession(SparkSession):
967
1034
  total_tasks += stage.num_tasks
968
1035
  completed_tasks += stage.num_completed_tasks
969
1036
 
970
- tqdm_pbar = notebook_tqdm
971
- if environment.is_interactive_terminal():
972
- tqdm_pbar = cli_tqdm
1037
+ # Don't show progress bar till we receive some tasks
1038
+ if total_tasks == 0:
1039
+ return
1040
+
1041
+ # Get correct tqdm (notebook or CLI)
1042
+ tqdm_pbar = self.get_tqdm_bar()
973
1043
 
974
1044
  # Use a lock to ensure only one thread can access and modify
975
1045
  # the shared dictionaries at a time.
@@ -1006,13 +1076,11 @@ class DataprocSparkSession(SparkSession):
1006
1076
  @staticmethod
1007
1077
  def _sql_lazy_transformation(req):
1008
1078
  # Select SQL command
1009
- if req.plan and req.plan.command and req.plan.command.sql_command:
1010
- return (
1011
- "select"
1012
- in req.plan.command.sql_command.sql.strip().lower().split()
1013
- )
1014
-
1015
- return False
1079
+ try:
1080
+ query = req.plan.command.sql_command.input.sql.query
1081
+ return "select" in query.strip().lower().split()
1082
+ except AttributeError:
1083
+ return False
1016
1084
 
1017
1085
  def _repr_html_(self) -> str:
1018
1086
  if not self._active_s8s_session_id:
@@ -1020,7 +1088,7 @@ class DataprocSparkSession(SparkSession):
1020
1088
  <div>No Active Dataproc Session</div>
1021
1089
  """
1022
1090
 
1023
- s8s_session = f"https://console.cloud.google.com/dataproc/interactive/{self._region}/{self._active_s8s_session_id}"
1091
+ s8s_session = f"{_DATAPROC_SESSIONS_BASE_URL}/{self._region}/{self._active_s8s_session_id}"
1024
1092
  ui = f"{s8s_session}/sparkApplications/applications"
1025
1093
  return f"""
1026
1094
  <div>
@@ -1047,7 +1115,7 @@ class DataprocSparkSession(SparkSession):
1047
1115
  )
1048
1116
 
1049
1117
  url = (
1050
- f"https://console.cloud.google.com/dataproc/interactive/{self._region}/"
1118
+ f"{_DATAPROC_SESSIONS_BASE_URL}/{self._region}/"
1051
1119
  f"{self._active_s8s_session_id}/sparkApplications/application;"
1052
1120
  f"associatedSqlOperationId={operation_id}?project={self._project_id}"
1053
1121
  )
@@ -1139,20 +1207,52 @@ class DataprocSparkSession(SparkSession):
1139
1207
  def _get_active_session_file_path():
1140
1208
  return os.getenv("DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH")
1141
1209
 
1142
- def stop(self) -> None:
1210
+ def stop(self, terminate: Optional[bool] = None) -> None:
1211
+ """
1212
+ Stop the Spark session and optionally terminate the server-side session.
1213
+
1214
+ Parameters
1215
+ ----------
1216
+ terminate : bool, optional
1217
+ Control server-side termination behavior.
1218
+
1219
+ - None (default): Auto-detect based on session type
1220
+
1221
+ - Managed sessions (auto-generated ID): terminate server
1222
+ - Named sessions (custom ID): client-side cleanup only
1223
+
1224
+ - True: Always terminate the server-side session
1225
+ - False: Never terminate the server-side session (client cleanup only)
1226
+
1227
+ Examples
1228
+ --------
1229
+ Auto-detect termination behavior (existing behavior):
1230
+
1231
+ >>> spark.stop()
1232
+
1233
+ Force terminate a named session:
1234
+
1235
+ >>> spark.stop(terminate=True)
1236
+
1237
+ Prevent termination of a managed session:
1238
+
1239
+ >>> spark.stop(terminate=False)
1240
+ """
1143
1241
  with DataprocSparkSession._lock:
1144
1242
  if DataprocSparkSession._active_s8s_session_id is not None:
1145
- # Check if this is a managed session (auto-generated ID) or unmanaged session (custom ID)
1146
- if DataprocSparkSession._active_session_uses_custom_id:
1147
- # Unmanaged session (custom ID): Only clean up client-side state
1148
- # Don't terminate as it might be in use by other notebooks or clients
1149
- logger.debug(
1150
- f"Stopping unmanaged session {DataprocSparkSession._active_s8s_session_id} without termination"
1243
+ # Determine if we should terminate the server-side session
1244
+ if terminate is None:
1245
+ # Auto-detect: managed sessions terminate, named sessions don't
1246
+ should_terminate = (
1247
+ not DataprocSparkSession._active_session_uses_custom_id
1151
1248
  )
1152
1249
  else:
1153
- # Managed session (auto-generated ID): Use original behavior and terminate
1250
+ should_terminate = terminate
1251
+
1252
+ if should_terminate:
1253
+ # Terminate the server-side session
1154
1254
  logger.debug(
1155
- f"Terminating managed session {DataprocSparkSession._active_s8s_session_id}"
1255
+ f"Terminating session {DataprocSparkSession._active_s8s_session_id}"
1156
1256
  )
1157
1257
  terminate_s8s_session(
1158
1258
  DataprocSparkSession._project_id,
@@ -1160,8 +1260,27 @@ class DataprocSparkSession(SparkSession):
1160
1260
  DataprocSparkSession._active_s8s_session_id,
1161
1261
  self._client_options,
1162
1262
  )
1263
+ else:
1264
+ # Client-side cleanup only
1265
+ logger.debug(
1266
+ f"Stopping session {DataprocSparkSession._active_s8s_session_id} without termination"
1267
+ )
1163
1268
 
1164
1269
  self._remove_stopped_session_from_file()
1270
+
1271
+ # Clean up SparkSession._instantiatedSession if it points to this session
1272
+ try:
1273
+ from pyspark.sql import SparkSession as PySparkSQLSession
1274
+
1275
+ if PySparkSQLSession._instantiatedSession is self:
1276
+ PySparkSQLSession._instantiatedSession = None
1277
+ logger.debug(
1278
+ "Cleared SparkSession._instantiatedSession reference"
1279
+ )
1280
+ except (ImportError, AttributeError):
1281
+ # PySpark not available or _instantiatedSession doesn't exist
1282
+ pass
1283
+
1165
1284
  DataprocSparkSession._active_s8s_session_uuid = None
1166
1285
  DataprocSparkSession._active_s8s_session_id = None
1167
1286
  DataprocSparkSession._active_session_uses_custom_id = False
@@ -1,105 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: dataproc-spark-connect
3
- Version: 1.0.0rc5
4
- Summary: Dataproc client library for Spark Connect
5
- Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
6
- Author: Google LLC
7
- License: Apache 2.0
8
- License-File: LICENSE
9
- Requires-Dist: google-api-core>=2.19
10
- Requires-Dist: google-cloud-dataproc>=5.18
11
- Requires-Dist: packaging>=20.0
12
- Requires-Dist: pyspark[connect]~=4.0.0
13
- Requires-Dist: tqdm>=4.67
14
- Requires-Dist: websockets>=14.0
15
- Dynamic: author
16
- Dynamic: description
17
- Dynamic: home-page
18
- Dynamic: license
19
- Dynamic: license-file
20
- Dynamic: requires-dist
21
- Dynamic: summary
22
-
23
- # Dataproc Spark Connect Client
24
-
25
- A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
26
- client with additional functionalities that allow applications to communicate
27
- with a remote Dataproc Spark Session using the Spark Connect protocol without
28
- requiring additional steps.
29
-
30
- ## Install
31
-
32
- ```sh
33
- pip install dataproc_spark_connect
34
- ```
35
-
36
- ## Uninstall
37
-
38
- ```sh
39
- pip uninstall dataproc_spark_connect
40
- ```
41
-
42
- ## Setup
43
-
44
- This client requires permissions to
45
- manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
46
- If you are running the client outside of Google Cloud, you must set following
47
- environment variables:
48
-
49
- * `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
50
- workloads
51
- * `GOOGLE_CLOUD_REGION` - The Compute
52
- Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
53
- where you run the Spark workload.
54
- * `GOOGLE_APPLICATION_CREDENTIALS` -
55
- Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
56
-
57
- ## Usage
58
-
59
- 1. Install the latest version of Dataproc Python client and Dataproc Spark
60
- Connect modules:
61
-
62
- ```sh
63
- pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
64
- ```
65
-
66
- 2. Add the required imports into your PySpark application or notebook and start
67
- a Spark session with the following code instead of using
68
- environment variables:
69
-
70
- ```python
71
- from google.cloud.dataproc_spark_connect import DataprocSparkSession
72
- from google.cloud.dataproc_v1 import Session
73
- session_config = Session()
74
- session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
75
- session_config.runtime_config.version = '2.2'
76
- spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
77
- ```
78
-
79
- ## Developing
80
-
81
- For development instructions see [guide](DEVELOPING.md).
82
-
83
- ## Contributing
84
-
85
- We'd love to accept your patches and contributions to this project. There are
86
- just a few small guidelines you need to follow.
87
-
88
- ### Contributor License Agreement
89
-
90
- Contributions to this project must be accompanied by a Contributor License
91
- Agreement. You (or your employer) retain the copyright to your contribution;
92
- this simply gives us permission to use and redistribute your contributions as
93
- part of the project. Head over to <https://cla.developers.google.com> to see
94
- your current agreements on file or to sign a new one.
95
-
96
- You generally only need to submit a CLA once, so if you've already submitted one
97
- (even if it was for a different project), you probably don't need to do it
98
- again.
99
-
100
- ### Code reviews
101
-
102
- All submissions, including submissions by project members, require review. We
103
- use GitHub pull requests for this purpose. Consult
104
- [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
105
- information on using pull requests.