PyPI - dataproc-spark-connect - Versions diffs - 0.6.0__tar.gz → 0.7.0__tar.gz - Mend

dataproc-spark-connect 0.6.0tar.gz → 0.7.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

dataproc_spark_connect-0.7.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,98 @@
+Metadata-Version: 2.1
+Name: dataproc-spark-connect
+Version: 0.7.0
+Summary: Dataproc client library for Spark Connect
+Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
+Author: Google LLC
+License: Apache 2.0
+License-File: LICENSE
+Requires-Dist: google-api-core>=2.19
+Requires-Dist: google-cloud-dataproc>=5.18
+Requires-Dist: packaging>=20.0
+Requires-Dist: pyspark[connect]>=3.5
+Requires-Dist: tqdm>=4.67
+Requires-Dist: websockets>=15.0
+# Dataproc Spark Connect Client
+A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
+client with additional functionalities that allow applications to communicate
+with a remote Dataproc Spark Session using the Spark Connect protocol without
+requiring additional steps.
+## Install
+```sh
+pip install dataproc_spark_connect
+```
+## Uninstall
+```sh
+pip uninstall dataproc_spark_connect
+```
+## Setup
+This client requires permissions to
+manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
+If you are running the client outside of Google Cloud, you must set following
+environment variables:
+* `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
+  workloads
+* `GOOGLE_CLOUD_REGION` - The Compute
+  Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
+  where you run the Spark workload.
+* `GOOGLE_APPLICATION_CREDENTIALS` -
+  Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
+## Usage
+1. Install the latest version of Dataproc Python client and Dataproc Spark
+   Connect modules:
+   ```sh
+   pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
+   ```
+2. Add the required imports into your PySpark application or notebook and start
+   a Spark session with the following code instead of using
+   environment variables:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   from google.cloud.dataproc_v1 import Session
+   session_config = Session()
+   session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
+   session_config.runtime_config.version = '2.2'
+   spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
+   ```
+## Developing
+For development instructions see [guide](DEVELOPING.md).
+## Contributing
+We'd love to accept your patches and contributions to this project. There are
+just a few small guidelines you need to follow.
+### Contributor License Agreement
+Contributions to this project must be accompanied by a Contributor License
+Agreement. You (or your employer) retain the copyright to your contribution;
+this simply gives us permission to use and redistribute your contributions as
+part of the project. Head over to <https://cla.developers.google.com> to see
+your current agreements on file or to sign a new one.
+You generally only need to submit a CLA once, so if you've already submitted one
+(even if it was for a different project), you probably don't need to do it
+again.
+### Code reviews
+All submissions, including submissions by project members, require review. We
+use GitHub pull requests for this purpose. Consult
+[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
+information on using pull requests.

dataproc_spark_connect-0.7.0/README.md ADDED Viewed

@@ -0,0 +1,83 @@
+# Dataproc Spark Connect Client
+A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
+client with additional functionalities that allow applications to communicate
+with a remote Dataproc Spark Session using the Spark Connect protocol without
+requiring additional steps.
+## Install
+```sh
+pip install dataproc_spark_connect
+```
+## Uninstall
+```sh
+pip uninstall dataproc_spark_connect
+```
+## Setup
+This client requires permissions to
+manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
+If you are running the client outside of Google Cloud, you must set following
+environment variables:
+* `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
+  workloads
+* `GOOGLE_CLOUD_REGION` - The Compute
+  Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
+  where you run the Spark workload.
+* `GOOGLE_APPLICATION_CREDENTIALS` -
+  Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
+## Usage
+1. Install the latest version of Dataproc Python client and Dataproc Spark
+   Connect modules:
+   ```sh
+   pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
+   ```
+2. Add the required imports into your PySpark application or notebook and start
+   a Spark session with the following code instead of using
+   environment variables:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   from google.cloud.dataproc_v1 import Session
+   session_config = Session()
+   session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
+   session_config.runtime_config.version = '2.2'
+   spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
+   ```
+## Developing
+For development instructions see [guide](DEVELOPING.md).
+## Contributing
+We'd love to accept your patches and contributions to this project. There are
+just a few small guidelines you need to follow.
+### Contributor License Agreement
+Contributions to this project must be accompanied by a Contributor License
+Agreement. You (or your employer) retain the copyright to your contribution;
+this simply gives us permission to use and redistribute your contributions as
+part of the project. Head over to <https://cla.developers.google.com> to see
+your current agreements on file or to sign a new one.
+You generally only need to submit a CLA once, so if you've already submitted one
+(even if it was for a different project), you probably don't need to do it
+again.
+### Code reviews
+All submissions, including submissions by project members, require review. We
+use GitHub pull requests for this purpose. Consult
+[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
+information on using pull requests.

dataproc_spark_connect-0.7.0/dataproc_spark_connect.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,98 @@
+Metadata-Version: 2.1
+Name: dataproc-spark-connect
+Version: 0.7.0
+Summary: Dataproc client library for Spark Connect
+Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
+Author: Google LLC
+License: Apache 2.0
+License-File: LICENSE
+Requires-Dist: google-api-core>=2.19
+Requires-Dist: google-cloud-dataproc>=5.18
+Requires-Dist: packaging>=20.0
+Requires-Dist: pyspark[connect]>=3.5
+Requires-Dist: tqdm>=4.67
+Requires-Dist: websockets>=15.0
+# Dataproc Spark Connect Client
+A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
+client with additional functionalities that allow applications to communicate
+with a remote Dataproc Spark Session using the Spark Connect protocol without
+requiring additional steps.
+## Install
+```sh
+pip install dataproc_spark_connect
+```
+## Uninstall
+```sh
+pip uninstall dataproc_spark_connect
+```
+## Setup
+This client requires permissions to
+manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
+If you are running the client outside of Google Cloud, you must set following
+environment variables:
+* `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
+  workloads
+* `GOOGLE_CLOUD_REGION` - The Compute
+  Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
+  where you run the Spark workload.
+* `GOOGLE_APPLICATION_CREDENTIALS` -
+  Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
+## Usage
+1. Install the latest version of Dataproc Python client and Dataproc Spark
+   Connect modules:
+   ```sh
+   pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
+   ```
+2. Add the required imports into your PySpark application or notebook and start
+   a Spark session with the following code instead of using
+   environment variables:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   from google.cloud.dataproc_v1 import Session
+   session_config = Session()
+   session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
+   session_config.runtime_config.version = '2.2'
+   spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
+   ```
+## Developing
+For development instructions see [guide](DEVELOPING.md).
+## Contributing
+We'd love to accept your patches and contributions to this project. There are
+just a few small guidelines you need to follow.
+### Contributor License Agreement
+Contributions to this project must be accompanied by a Contributor License
+Agreement. You (or your employer) retain the copyright to your contribution;
+this simply gives us permission to use and redistribute your contributions as
+part of the project. Head over to <https://cla.developers.google.com> to see
+your current agreements on file or to sign a new one.
+You generally only need to submit a CLA once, so if you've already submitted one
+(even if it was for a different project), you probably don't need to do it
+again.
+### Code reviews
+All submissions, including submissions by project members, require review. We
+use GitHub pull requests for this purpose. Consult
+[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
+information on using pull requests.

dataproc_spark_connect-0.7.0/dataproc_spark_connect.egg-info/requires.txt ADDED Viewed

@@ -0,0 +1,6 @@
+google-api-core>=2.19
+google-cloud-dataproc>=5.18
+packaging>=20.0
+pyspark[connect]>=3.5
+tqdm>=4.67
+websockets>=15.0

{dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/google/cloud/dataproc_spark_connect/client/proxy.py RENAMED Viewed

@@ -18,7 +18,6 @@ import contextlib
 import logging
 import socket
 import threading
-import time
 import websockets.sync.client as websocketclient
@@ -95,6 +94,7 @@ def forward_bytes(name, from_sock, to_sock):
     This method is intended to be run in a separate thread of execution.
     Args:
+        name: forwarding thread name
         from_sock: A socket-like object to stream bytes from.
         to_sock: A socket-like object to stream bytes to.
     """
@@ -131,7 +131,7 @@ def connect_sockets(conn_number, from_sock, to_sock):
     This method continuously streams bytes in both directions between the
     given `from_sock` and `to_sock` socket-like objects.
-    The caller is responsible for creating and closing the supplied socekts.
+    The caller is responsible for creating and closing the supplied sockets.
     """
     forward_name = f"{conn_number}-forward"
     t1 = threading.Thread(
@@ -163,7 +163,7 @@ def forward_connection(conn_number, conn, addr, target_host):
     Both the supplied incoming connection (`conn`) and the created outgoing
     connection are automatically closed when this method terminates.
-    This method should be run inside of a daemon thread so that it will not
+    This method should be run inside a daemon thread so that it will not
     block program termination.
     """
     with conn:

{dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/google/cloud/dataproc_spark_connect/session.py RENAMED Viewed

@@ -11,39 +11,37 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import atexit
+import datetime
 import json
 import logging
 import os
 import random
 import string
+import threading
 import time
-import datetime
-from time import sleep
-from typing import Any, cast, ClassVar, Dict, Optional
+import tqdm
 from google.api_core import retry
-from google.api_core.future.polling import POLLING_PREDICATE
 from google.api_core.client_options import ClientOptions
 from google.api_core.exceptions import Aborted, FailedPrecondition, InvalidArgument, NotFound, PermissionDenied
-from google.cloud.dataproc_v1.types import sessions
-from google.cloud.dataproc_spark_connect.pypi_artifacts import PyPiArtifacts
+from google.api_core.future.polling import POLLING_PREDICATE
 from google.cloud.dataproc_spark_connect.client import DataprocChannelBuilder
+from google.cloud.dataproc_spark_connect.exceptions import DataprocSparkConnectException
+from google.cloud.dataproc_spark_connect.pypi_artifacts import PyPiArtifacts
 from google.cloud.dataproc_v1 import (
+    AuthenticationConfig,
     CreateSessionRequest,
     GetSessionRequest,
     Session,
     SessionControllerClient,
-    SessionTemplate,
     TerminateSessionRequest,
 )
-from google.protobuf import text_format
-from google.protobuf.text_format import ParseError
+from google.cloud.dataproc_v1.types import sessions
 from pyspark.sql.connect.session import SparkSession
 from pyspark.sql.utils import to_str
-from google.cloud.dataproc_spark_connect.exceptions import DataprocSparkConnectException
+from typing import Any, cast, ClassVar, Dict, Optional
 # Set up logging
 logging.basicConfig(level=logging.INFO)
@@ -69,6 +67,8 @@ class DataprocSparkSession(SparkSession):
     ... ) # doctest: +SKIP
     """
+    _DEFAULT_RUNTIME_VERSION = "2.2"
     _active_s8s_session_uuid: ClassVar[Optional[str]] = None
     _project_id = None
     _region = None
@@ -77,7 +77,12 @@ class DataprocSparkSession(SparkSession):
     class Builder(SparkSession.Builder):
-        _dataproc_runtime_spark_version = {"3.0": "3.5.1", "2.2": "3.5.0"}
+        _dataproc_runtime_to_spark_version = {
+            "1.2": "3.5",
+            "2.2": "3.5",
+            "2.3": "3.5",
+            "3.0": "4.0",
+        }
         _session_static_configs = [
             "spark.executor.cores",
@@ -93,10 +98,10 @@ class DataprocSparkSession(SparkSession):
             self._options: Dict[str, Any] = {}
             self._channel_builder: Optional[DataprocChannelBuilder] = None
             self._dataproc_config: Optional[Session] = None
-            self._project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
-            self._region = os.environ.get("GOOGLE_CLOUD_REGION")
+            self._project_id = os.getenv("GOOGLE_CLOUD_PROJECT")
+            self._region = os.getenv("GOOGLE_CLOUD_REGION")
             self._client_options = ClientOptions(
-                api_endpoint=os.environ.get(
+                api_endpoint=os.getenv(
                     "GOOGLE_CLOUD_DATAPROC_API_ENDPOINT",
                     f"{self._region}-dataproc.googleapis.com",
                 )
@@ -117,7 +122,7 @@ class DataprocSparkSession(SparkSession):
         def location(self, location):
             self._region = location
-            self._client_options.api_endpoint = os.environ.get(
+            self._client_options.api_endpoint = os.getenv(
                 "GOOGLE_CLOUD_DATAPROC_API_ENDPOINT",
                 f"{self._region}-dataproc.googleapis.com",
             )
@@ -155,10 +160,7 @@ class DataprocSparkSession(SparkSession):
             spark_connect_url = session_response.runtime_info.endpoints.get(
                 "Spark Connect Server"
             )
-            spark_connect_url = spark_connect_url.replace("https", "sc")
-            if not spark_connect_url.endswith("/"):
-                spark_connect_url += "/"
-            url = f"{spark_connect_url.replace('.com/', '.com:443/')};session_id={session_response.uuid};use_ssl=true"
+            url = f"{spark_connect_url}/;session_id={session_response.uuid};use_ssl=true"
             logger.debug(f"Spark Connect URL: {url}")
             self._channel_builder = DataprocChannelBuilder(
                 url,
@@ -179,56 +181,66 @@ class DataprocSparkSession(SparkSession):
                 if self._options.get("spark.remote", False):
                     raise NotImplemented(
-                        "DataprocSparkSession does not support connecting to an existing remote server"
+                        "DataprocSparkSession does not support connecting to an existing Spark Connect remote server"
                     )
                 from google.cloud.dataproc_v1 import SessionControllerClient
                 dataproc_config: Session = self._get_dataproc_config()
-                session_template: SessionTemplate = self._get_session_template()
-                self._get_and_validate_version(
-                    dataproc_config, session_template
-                )
+                self._validate_version(dataproc_config)
-                spark_connect_session = self._get_spark_connect_session(
-                    dataproc_config, session_template
-                )
-                if not spark_connect_session:
-                    dataproc_config.spark_connect_session = {}
-                os.environ["SPARK_CONNECT_MODE_ENABLED"] = "1"
-                session_request = CreateSessionRequest()
                 session_id = self.generate_dataproc_session_id()
-                session_request.session_id = session_id
                 dataproc_config.name = f"projects/{self._project_id}/locations/{self._region}/sessions/{session_id}"
                 logger.debug(
-                    f"Configurations used to create serverless session:\n {dataproc_config}"
+                    f"Dataproc Session configuration:\n{dataproc_config}"
                 )
+                session_request = CreateSessionRequest()
+                session_request.session_id = session_id
                 session_request.session = dataproc_config
                 session_request.parent = (
                     f"projects/{self._project_id}/locations/{self._region}"
                 )
-                logger.debug("Creating serverless session")
+                logger.debug("Creating Dataproc Session")
                 DataprocSparkSession._active_s8s_session_id = session_id
                 s8s_creation_start_time = time.time()
-                try:
-                    session_polling = retry.Retry(
-                        predicate=POLLING_PREDICATE,
-                        initial=5.0,  # seconds
-                        maximum=5.0,  # seconds
-                        multiplier=1.0,
-                        timeout=600,  # seconds
+                stop_create_session_pbar = False
+                def create_session_pbar():
+                    iterations = 150
+                    pbar = tqdm.trange(
+                        iterations,
+                        bar_format="{bar}",
+                        ncols=80,
                     )
-                    print("Creating Spark session. It may take a few minutes.")
+                    for i in pbar:
+                        if stop_create_session_pbar:
+                            break
+                        # Last iteration
+                        if i >= iterations - 1:
+                            # Sleep until session created
+                            while not stop_create_session_pbar:
+                                time.sleep(1)
+                        else:
+                            time.sleep(1)
+                    pbar.close()
+                    # Print new line after the progress bar
+                    print()
+                create_session_pbar_thread = threading.Thread(
+                    target=create_session_pbar
+                )
+                try:
                     if (
-                        "dataproc_spark_connect_SESSION_TERMINATE_AT_EXIT"
-                        in os.environ
-                        and os.getenv(
-                            "dataproc_spark_connect_SESSION_TERMINATE_AT_EXIT"
-                        ).lower()
+                        os.getenv(
+                            "DATAPROC_SPARK_CONNECT_SESSION_TERMINATE_AT_EXIT",
+                            "false",
+                        )
                         == "true"
                     ):
                         atexit.register(
@@ -243,18 +255,25 @@ class DataprocSparkSession(SparkSession):
                         client_options=self._client_options
                     ).create_session(session_request)
                     print(
-                        f"Interactive Session Detail View:  https://console.cloud.google.com/dataproc/interactive/{self._region}/{session_id}?project={self._project_id}"
+                        f"Creating Dataproc Session: https://console.cloud.google.com/dataproc/interactive/{self._region}/{session_id}?project={self._project_id}"
                     )
+                    create_session_pbar_thread.start()
                     session_response: Session = operation.result(
-                        polling=session_polling
+                        polling=retry.Retry(
+                            predicate=POLLING_PREDICATE,
+                            initial=5.0,  # seconds
+                            maximum=5.0,  # seconds
+                            multiplier=1.0,
+                            timeout=600,  # seconds
+                        )
                     )
-                    if (
-                        "DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH"
-                        in os.environ
-                    ):
-                        file_path = os.environ[
-                            "DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH"
-                        ]
+                    stop_create_session_pbar = True
+                    create_session_pbar_thread.join()
+                    print("Dataproc Session was successfully created")
+                    file_path = (
+                        DataprocSparkSession._get_active_session_file_path()
+                    )
+                    if file_path is not None:
                         try:
                             session_data = {
                                 "session_name": session_response.name,
@@ -267,21 +286,27 @@ class DataprocSparkSession(SparkSession):
                                 json.dump(session_data, json_file, indent=4)
                         except Exception as e:
                             logger.error(
-                                f"Exception while writing active session to file {file_path} , {e}"
+                                f"Exception while writing active session to file {file_path}, {e}"
                             )
                 except (InvalidArgument, PermissionDenied) as e:
+                    stop_create_session_pbar = True
+                    if create_session_pbar_thread.is_alive():
+                        create_session_pbar_thread.join()
                     DataprocSparkSession._active_s8s_session_id = None
                     raise DataprocSparkConnectException(
-                        f"Error while creating serverless session: {e.message}"
+                        f"Error while creating Dataproc Session: {e.message}"
                     )
                 except Exception as e:
+                    stop_create_session_pbar = True
+                    if create_session_pbar_thread.is_alive():
+                        create_session_pbar_thread.join()
                     DataprocSparkSession._active_s8s_session_id = None
                     raise RuntimeError(
-                        f"Error while creating serverless session"
+                        f"Error while creating Dataproc Session"
                     ) from e
                 logger.debug(
-                    f"Serverless session created: {session_id}, creation time taken: {int(time.time() - s8s_creation_start_time)} seconds"
+                    f"Dataproc Session created: {session_id} in {int(time.time() - s8s_creation_start_time)} seconds"
                 )
                 return self.__create_spark_connect_session_from_s8s(
                     session_response, dataproc_config.name
@@ -292,17 +317,20 @@ class DataprocSparkSession(SparkSession):
         ) -> Optional["DataprocSparkSession"]:
             s8s_session_id = DataprocSparkSession._active_s8s_session_id
             session_name = f"projects/{self._project_id}/locations/{self._region}/sessions/{s8s_session_id}"
-            session_response = get_active_s8s_session_response(
-                session_name, self._client_options
-            )
+            session_response = None
+            session = None
+            if s8s_session_id is not None:
+                session_response = get_active_s8s_session_response(
+                    session_name, self._client_options
+                )
+                session = DataprocSparkSession.getActiveSession()
-            session = DataprocSparkSession.getActiveSession()
             if session is None:
                 session = DataprocSparkSession._default_session
             if session_response is not None:
                 print(
-                    f"Using existing session: https://console.cloud.google.com/dataproc/interactive/{self._region}/{s8s_session_id}?project={self._project_id}, configuration changes may not be applied."
+                    f"Using existing Dataproc Session (configuration changes may not be applied): https://console.cloud.google.com/dataproc/interactive/{self._region}/{s8s_session_id}?project={self._project_id}"
                 )
                 if session is None:
                     session = self.__create_spark_connect_session_from_s8s(
@@ -310,10 +338,10 @@ class DataprocSparkSession(SparkSession):
                     )
                 return session
             else:
-                logger.info(
-                    f"Session: {s8s_session_id} not active, stopping previous spark session and creating new"
-                )
                 if session is not None:
+                    print(
+                        f"{s8s_session_id} Dataproc Session is not active, stopping and creating a new one"
+                    )
                     session.stop()
                 return None
@@ -333,21 +361,52 @@ class DataprocSparkSession(SparkSession):
                 dataproc_config = self._dataproc_config
                 for k, v in self._options.items():
                     dataproc_config.runtime_config.properties[k] = v
-            elif "DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG" in os.environ:
-                filepath = os.environ[
-                    "DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG"
+            dataproc_config.spark_connect_session = (
+                sessions.SparkConnectConfig()
+            )
+            if not dataproc_config.runtime_config.version:
+                dataproc_config.runtime_config.version = (
+                    DataprocSparkSession._DEFAULT_RUNTIME_VERSION
+                )
+            if (
+                not dataproc_config.environment_config.execution_config.authentication_config.user_workload_authentication_type
+                and "DATAPROC_SPARK_CONNECT_AUTH_TYPE" in os.environ
+            ):
+                dataproc_config.environment_config.execution_config.authentication_config.user_workload_authentication_type = AuthenticationConfig.AuthenticationType[
+                    os.getenv("DATAPROC_SPARK_CONNECT_AUTH_TYPE")
                 ]
-                try:
-                    with open(filepath, "r") as f:
-                        dataproc_config = Session.wrap(
-                            text_format.Parse(
-                                f.read(), Session.pb(dataproc_config)
-                            )
-                        )
-                except FileNotFoundError:
-                    raise FileNotFoundError(f"File '{filepath}' not found")
-                except ParseError as e:
-                    raise ParseError(f"Error parsing file '{filepath}': {e}")
+            if (
+                not dataproc_config.environment_config.execution_config.service_account
+                and "DATAPROC_SPARK_CONNECT_SERVICE_ACCOUNT" in os.environ
+            ):
+                dataproc_config.environment_config.execution_config.service_account = os.getenv(
+                    "DATAPROC_SPARK_CONNECT_SERVICE_ACCOUNT"
+                )
+            if (
+                not dataproc_config.environment_config.execution_config.subnetwork_uri
+                and "DATAPROC_SPARK_CONNECT_SUBNET" in os.environ
+            ):
+                dataproc_config.environment_config.execution_config.subnetwork_uri = os.getenv(
+                    "DATAPROC_SPARK_CONNECT_SUBNET"
+                )
+            if (
+                not dataproc_config.environment_config.execution_config.ttl
+                and "DATAPROC_SPARK_CONNECT_TTL_SECONDS" in os.environ
+            ):
+                dataproc_config.environment_config.execution_config.ttl = {
+                    "seconds": int(
+                        os.getenv("DATAPROC_SPARK_CONNECT_TTL_SECONDS")
+                    )
+                }
+            if (
+                not dataproc_config.environment_config.execution_config.idle_ttl
+                and "DATAPROC_SPARK_CONNECT_IDLE_TTL_SECONDS" in os.environ
+            ):
+                dataproc_config.environment_config.execution_config.idle_ttl = {
+                    "seconds": int(
+                        os.getenv("DATAPROC_SPARK_CONNECT_IDLE_TTL_SECONDS")
+                    )
+                }
             if "COLAB_NOTEBOOK_RUNTIME_ID" in os.environ:
                 dataproc_config.labels["colab-notebook-runtime-id"] = (
                     os.environ["COLAB_NOTEBOOK_RUNTIME_ID"]
@@ -358,87 +417,38 @@ class DataprocSparkSession(SparkSession):
                 ]
             return dataproc_config
-        def _get_session_template(self):
-            from google.cloud.dataproc_v1 import (
-                GetSessionTemplateRequest,
-                SessionTemplateControllerClient,
-            )
-            session_template = None
-            if self._dataproc_config and self._dataproc_config.session_template:
-                session_template = self._dataproc_config.session_template
-                get_session_template_request = GetSessionTemplateRequest()
-                get_session_template_request.name = session_template
-                client = SessionTemplateControllerClient(
-                    client_options=self._client_options
-                )
-                try:
-                    session_template = client.get_session_template(
-                        get_session_template_request
-                    )
-                except Exception as e:
-                    logger.error(
-                        f"Failed to get session template {session_template}: {e}"
-                    )
-                    raise
-            return session_template
+        def _validate_version(self, dataproc_config):
+            trim_version = lambda v: ".".join(v.split(".")[:2])
-        def _get_and_validate_version(self, dataproc_config, session_template):
-            trimmed_version = lambda v: ".".join(v.split(".")[:2])
-            version = None
+            version = dataproc_config.runtime_config.version
             if (
-                dataproc_config
-                and dataproc_config.runtime_config
-                and dataproc_config.runtime_config.version
-            ):
-                version = dataproc_config.runtime_config.version
-            elif (
-                session_template
-                and session_template.runtime_config
-                and session_template.runtime_config.version
-            ):
-                version = session_template.runtime_config.version
-            if not version:
-                version = "3.0"
-                dataproc_config.runtime_config.version = version
-            elif (
-                trimmed_version(version)
-                not in self._dataproc_runtime_spark_version
+                trim_version(version)
+                not in self._dataproc_runtime_to_spark_version
             ):
                 raise ValueError(
-                    f"runtime_config.version {version} is not supported. "
-                    f"Supported versions: {self._dataproc_runtime_spark_version.keys()}"
+                    f"Specified {version} Dataproc Spark runtime version is not supported. "
+                    f"Supported runtime versions: {self._dataproc_runtime_to_spark_version.keys()}"
                 )
-            server_version = self._dataproc_runtime_spark_version[
-                trimmed_version(version)
+            server_version = self._dataproc_runtime_to_spark_version[
+                trim_version(version)
             ]
             import importlib.metadata
-            google_connect_version = importlib.metadata.version(
+            dataproc_connect_version = importlib.metadata.version(
                 "dataproc-spark-connect"
             )
             client_version = importlib.metadata.version("pyspark")
-            version_message = f"Spark Connect: {google_connect_version} (PySpark: {client_version}) Session Runtime: {version} (Spark: {server_version})"
-            logger.info(version_message)
-            if trimmed_version(client_version) != trimmed_version(
-                server_version
-            ):
-                logger.warning(
-                    f"client and server on different versions: {version_message}"
+            if trim_version(client_version) != trim_version(server_version):
+                print(
+                    f"Spark Connect client and server use different versions:\n"
+                    f"- Dataproc Spark Connect client {dataproc_connect_version} (PySpark {client_version})\n"
+                    f"- Dataproc Spark runtime {version} (Spark {server_version})"
                 )
-            return version
-        def _get_spark_connect_session(self, dataproc_config, session_template):
-            spark_connect_session = None
-            if dataproc_config and dataproc_config.spark_connect_session:
-                spark_connect_session = dataproc_config.spark_connect_session
-            elif session_template and session_template.spark_connect_session:
-                spark_connect_session = session_template.spark_connect_session
-            return spark_connect_session
-        def generate_dataproc_session_id(self):
+        @staticmethod
+        def generate_dataproc_session_id():
             timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
             suffix_length = 6
             random_suffix = "".join(
@@ -451,32 +461,30 @@ class DataprocSparkSession(SparkSession):
     def _repr_html_(self) -> str:
         if not self._active_s8s_session_id:
             return """
-            <div>No Active Dataproc Spark Session</div>
+            <div>No Active Dataproc Session</div>
             """
         s8s_session = f"https://console.cloud.google.com/dataproc/interactive/{self._region}/{self._active_s8s_session_id}"
         ui = f"{s8s_session}/sparkApplications/applications"
-        version = ""
         return f"""
         <div>
             <p><b>Spark Connect</b></p>
-            <p><a href="{s8s_session}?project={self._project_id}">Serverless Session</a></p>
+            <p><a href="{s8s_session}?project={self._project_id}">Dataproc Session</a></p>
             <p><a href="{ui}?project={self._project_id}">Spark UI</a></p>
         </div>
         """
-    def _remove_stoped_session_from_file(self):
-        if "DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH" in os.environ:
-            file_path = os.environ[
-                "DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH"
-            ]
+    @staticmethod
+    def _remove_stopped_session_from_file():
+        file_path = DataprocSparkSession._get_active_session_file_path()
+        if file_path is not None:
             try:
                 with open(file_path, "w"):
                     pass
             except Exception as e:
                 logger.error(
-                    f"Exception while removing active session in file {file_path} , {e}"
+                    f"Exception while removing active session in file {file_path}, {e}"
                 )
     def addArtifacts(
@@ -494,7 +502,7 @@ class DataprocSparkSession(SparkSession):
         Parameters
         ----------
-        *path : tuple of str
+        *artifact : tuple of str
             Artifact's URIs to add.
         pyfile : bool
             Whether to add them as Python dependencies such as .py, .egg, .zip or .jar files.
@@ -507,7 +515,7 @@ class DataprocSparkSession(SparkSession):
             Add a file to be downloaded with this Spark job on every node.
             The ``path`` passed can only be a local file for now.
         pypi : bool
-            This option is only available with DataprocSparkSession. eg. `spark.addArtifacts("spacy==3.8.4", "torch",  pypi=True)`
+            This option is only available with DataprocSparkSession. e.g. `spark.addArtifacts("spacy==3.8.4", "torch",  pypi=True)`
             Installs PyPi package (with its dependencies) in the active Spark session on the driver and executors.
         Notes
@@ -534,6 +542,10 @@ class DataprocSparkSession(SparkSession):
                 *artifact, pyfile=pyfile, archive=archive, file=file
             )
+    @staticmethod
+    def _get_active_session_file_path():
+        return os.getenv("DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH")
     def stop(self) -> None:
         with DataprocSparkSession._lock:
             if DataprocSparkSession._active_s8s_session_id is not None:
@@ -544,7 +556,7 @@ class DataprocSparkSession(SparkSession):
                     self._client_options,
                 )
-                self._remove_stoped_session_from_file()
+                self._remove_stopped_session_from_file()
                 DataprocSparkSession._active_s8s_session_uuid = None
                 DataprocSparkSession._active_s8s_session_id = None
                 DataprocSparkSession._project_id = None
@@ -565,7 +577,7 @@ def terminate_s8s_session(
 ):
     from google.cloud.dataproc_v1 import SessionControllerClient
-    logger.debug(f"Terminating serverless session: {active_s8s_session_id}")
+    logger.debug(f"Terminating Dataproc Session: {active_s8s_session_id}")
     terminate_session_request = TerminateSessionRequest()
     session_name = f"projects/{project_id}/locations/{region}/sessions/{active_s8s_session_id}"
     terminate_session_request.name = session_name
@@ -583,18 +595,20 @@ def terminate_s8s_session(
         ):
             session = session_client.get_session(get_session_request)
             state = session.state
-            sleep(1)
+            time.sleep(1)
     except NotFound:
-        logger.debug(f"Session {active_s8s_session_id} already deleted")
+        logger.debug(
+            f"{active_s8s_session_id} Dataproc Session already deleted"
+        )
     # Client will get 'Aborted' error if session creation is still in progress and
     # 'FailedPrecondition' if another termination is still in progress.
-    # Both are retryable but we catch it and let TTL take care of cleanups.
+    # Both are retryable, but we catch it and let TTL take care of cleanups.
     except (FailedPrecondition, Aborted):
         logger.debug(
-            f"Session {active_s8s_session_id} already terminated manually or terminated automatically through session ttl limits"
+            f"{active_s8s_session_id} Dataproc Session already terminated manually or automatically due to TTL"
         )
     if state is not None and state == Session.State.FAILED:
-        raise RuntimeError("Serverless session termination failed")
+        raise RuntimeError("Dataproc Session termination failed")
 def get_active_s8s_session_response(
@@ -608,7 +622,7 @@ def get_active_s8s_session_response(
         ).get_session(get_session_request)
         state = get_session_response.state
     except Exception as e:
-        logger.info(f"{session_name} deleted: {e}")
+        print(f"{session_name} Dataproc Session deleted: {e}")
         return None
     if state is not None and (
         state == Session.State.ACTIVE or state == Session.State.CREATING

{dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/setup.py RENAMED Viewed

@@ -20,7 +20,7 @@ long_description = (this_directory / "README.md").read_text()
 setup(
     name="dataproc-spark-connect",
-    version="0.6.0",
+    version="0.7.0",
     description="Dataproc client library for Spark Connect",
     long_description=long_description,
     author="Google LLC",
@@ -28,10 +28,11 @@ setup(
     license="Apache 2.0",
     packages=find_namespace_packages(include=["google.*"]),
     install_requires=[
-        "google-api-core>=2.19.1",
-        "google-cloud-dataproc>=5.18.0",
-        "websockets",
-        "pyspark[connect]>=3.5",
+        "google-api-core>=2.19",
+        "google-cloud-dataproc>=5.18",
         "packaging>=20.0",
+        "pyspark[connect]>=3.5",
+        "tqdm>=4.67",
+        "websockets>=15.0",
     ],
 )

dataproc_spark_connect-0.6.0/PKG-INFO DELETED Viewed

@@ -1,111 +0,0 @@
-Metadata-Version: 2.1
-Name: dataproc-spark-connect
-Version: 0.6.0
-Summary: Dataproc client library for Spark Connect
-Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
-Author: Google LLC
-License: Apache 2.0
-License-File: LICENSE
-Requires-Dist: google-api-core>=2.19.1
-Requires-Dist: google-cloud-dataproc>=5.18.0
-Requires-Dist: websockets
-Requires-Dist: pyspark[connect]>=3.5
-Requires-Dist: packaging>=20.0
-# Dataproc Spark Connect Client
-A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
-additional functionalities that allow applications to communicate with a remote Dataproc
-Spark cluster using the Spark Connect protocol without requiring additional steps.
-## Install
-```console
-pip install dataproc_spark_connect
-```
-## Uninstall
-```console
-pip uninstall dataproc_spark_connect
-```
-## Setup
-This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
-If you are running the client outside of Google Cloud, you must set following environment variables:
-* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
-* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
-* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
-* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
-## Usage
-1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
-   ```console
-   pip install google_cloud_dataproc --force-reinstall
-   pip install dataproc_spark_connect --force-reinstall
-   ```
-2. Add the required import into your PySpark application or notebook:
-   ```python
-   from google.cloud.dataproc_spark_connect import DataprocSparkSession
-   ```
-3. There are two ways to create a spark session,
-   1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
-      ```python
-      spark = DataprocSparkSession.builder.getOrCreate()
-      ```
-   2. Start a Spark session with the following code instead of using a config file:
-      ```python
-      from google.cloud.dataproc_v1 import SparkConnectConfig
-      from google.cloud.dataproc_v1 import Session
-      dataproc_session_config = Session()
-      dataproc_session_config.spark_connect_session = SparkConnectConfig()
-      dataproc_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
-      dataproc_session_config.runtime_config.version = '3.0'
-      spark = DataprocSparkSession.builder.dataprocSessionConfig(dataproc_session_config).getOrCreate()
-      ```
-## Billing
-As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
-This will happen even if you are running the client from a non-GCE instance.
-## Contributing
-### Building and Deploying SDK
-1. Install the requirements in virtual environment.
-      ```console
-      pip install -r requirements-dev.txt
-      ```
-2. Build the code.
-      ```console
-      python setup.py sdist bdist_wheel
-      ```
-3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
-      ```sh
-      VERSION=<version>
-      gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
-      ```
-4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
-      ```sh
-      %%bash
-      export VERSION=<version>
-      gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
-      yes | pip uninstall dataproc_spark_connect
-      pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
-      ```

dataproc_spark_connect-0.6.0/README.md DELETED Viewed

@@ -1,97 +0,0 @@
-# Dataproc Spark Connect Client
-A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
-additional functionalities that allow applications to communicate with a remote Dataproc
-Spark cluster using the Spark Connect protocol without requiring additional steps.
-## Install
-```console
-pip install dataproc_spark_connect
-```
-## Uninstall
-```console
-pip uninstall dataproc_spark_connect
-```
-## Setup
-This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
-If you are running the client outside of Google Cloud, you must set following environment variables:
-* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
-* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
-* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
-* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
-## Usage
-1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
-   ```console
-   pip install google_cloud_dataproc --force-reinstall
-   pip install dataproc_spark_connect --force-reinstall
-   ```
-2. Add the required import into your PySpark application or notebook:
-   ```python
-   from google.cloud.dataproc_spark_connect import DataprocSparkSession
-   ```
-3. There are two ways to create a spark session,
-   1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
-      ```python
-      spark = DataprocSparkSession.builder.getOrCreate()
-      ```
-   2. Start a Spark session with the following code instead of using a config file:
-      ```python
-      from google.cloud.dataproc_v1 import SparkConnectConfig
-      from google.cloud.dataproc_v1 import Session
-      dataproc_session_config = Session()
-      dataproc_session_config.spark_connect_session = SparkConnectConfig()
-      dataproc_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
-      dataproc_session_config.runtime_config.version = '3.0'
-      spark = DataprocSparkSession.builder.dataprocSessionConfig(dataproc_session_config).getOrCreate()
-      ```
-## Billing
-As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
-This will happen even if you are running the client from a non-GCE instance.
-## Contributing
-### Building and Deploying SDK
-1. Install the requirements in virtual environment.
-      ```console
-      pip install -r requirements-dev.txt
-      ```
-2. Build the code.
-      ```console
-      python setup.py sdist bdist_wheel
-      ```
-3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
-      ```sh
-      VERSION=<version>
-      gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
-      ```
-4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
-      ```sh
-      %%bash
-      export VERSION=<version>
-      gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
-      yes | pip uninstall dataproc_spark_connect
-      pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
-      ```

dataproc_spark_connect-0.6.0/dataproc_spark_connect.egg-info/PKG-INFO DELETED Viewed

@@ -1,111 +0,0 @@
-Metadata-Version: 2.1
-Name: dataproc-spark-connect
-Version: 0.6.0
-Summary: Dataproc client library for Spark Connect
-Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
-Author: Google LLC
-License: Apache 2.0
-License-File: LICENSE
-Requires-Dist: google-api-core>=2.19.1
-Requires-Dist: google-cloud-dataproc>=5.18.0
-Requires-Dist: websockets
-Requires-Dist: pyspark[connect]>=3.5
-Requires-Dist: packaging>=20.0
-# Dataproc Spark Connect Client
-A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
-additional functionalities that allow applications to communicate with a remote Dataproc
-Spark cluster using the Spark Connect protocol without requiring additional steps.
-## Install
-```console
-pip install dataproc_spark_connect
-```
-## Uninstall
-```console
-pip uninstall dataproc_spark_connect
-```
-## Setup
-This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
-If you are running the client outside of Google Cloud, you must set following environment variables:
-* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
-* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
-* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
-* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
-## Usage
-1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
-   ```console
-   pip install google_cloud_dataproc --force-reinstall
-   pip install dataproc_spark_connect --force-reinstall
-   ```
-2. Add the required import into your PySpark application or notebook:
-   ```python
-   from google.cloud.dataproc_spark_connect import DataprocSparkSession
-   ```
-3. There are two ways to create a spark session,
-   1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
-      ```python
-      spark = DataprocSparkSession.builder.getOrCreate()
-      ```
-   2. Start a Spark session with the following code instead of using a config file:
-      ```python
-      from google.cloud.dataproc_v1 import SparkConnectConfig
-      from google.cloud.dataproc_v1 import Session
-      dataproc_session_config = Session()
-      dataproc_session_config.spark_connect_session = SparkConnectConfig()
-      dataproc_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
-      dataproc_session_config.runtime_config.version = '3.0'
-      spark = DataprocSparkSession.builder.dataprocSessionConfig(dataproc_session_config).getOrCreate()
-      ```
-## Billing
-As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
-This will happen even if you are running the client from a non-GCE instance.
-## Contributing
-### Building and Deploying SDK
-1. Install the requirements in virtual environment.
-      ```console
-      pip install -r requirements-dev.txt
-      ```
-2. Build the code.
-      ```console
-      python setup.py sdist bdist_wheel
-      ```
-3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
-      ```sh
-      VERSION=<version>
-      gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
-      ```
-4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
-      ```sh
-      %%bash
-      export VERSION=<version>
-      gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
-      yes | pip uninstall dataproc_spark_connect
-      pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
-      ```