PyPI - dataproc-spark-connect - Versions diffs - 0.2.1__tar.gz → 0.6.0__tar.gz - Mend

dataproc-spark-connect 0.2.1tar.gz → 0.6.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

{dataproc_spark_connect-0.2.1 → dataproc_spark_connect-0.6.0}/PKG-INFO RENAMED Viewed

@@ -1,42 +1,34 @@
 Metadata-Version: 2.1
 Name: dataproc-spark-connect
-Version: 0.2.1
+Version: 0.6.0
 Summary: Dataproc client library for Spark Connect
 Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
 Author: Google LLC
 License: Apache 2.0
 License-File: LICENSE
 Requires-Dist: google-api-core>=2.19.1
-Requires-Dist: google-cloud-dataproc>=5.15.1
-Requires-Dist: wheel
+Requires-Dist: google-cloud-dataproc>=5.18.0
 Requires-Dist: websockets
-Requires-Dist: pyspark>=3.5
-Requires-Dist: pandas
-Requires-Dist: pyarrow
+Requires-Dist: pyspark[connect]>=3.5
+Requires-Dist: packaging>=20.0
 # Dataproc Spark Connect Client
-> ⚠️ **Warning:**
-The package `dataproc-spark-connect` has been renamed to `google-spark-connect`. `dataproc-spark-connect` will no longer be updated.
-For help using `google-spark-connect`, please see [guide](https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python/blob/main/README.md).
 A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
 additional functionalities that allow applications to communicate with a remote Dataproc
 Spark cluster using the Spark Connect protocol without requiring additional steps.
 ## Install
-.. code-block:: console
-      pip install dataproc_spark_connect
+```console
+pip install dataproc_spark_connect
+```
 ## Uninstall
-.. code-block:: console
-      pip uninstall dataproc_spark_connect
+```console
+pip uninstall dataproc_spark_connect
+```
 ## Setup
 This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
@@ -51,36 +43,36 @@ If you are running the client outside of Google Cloud, you must set following en
 1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
-      .. code-block:: console
-            pip install google_cloud_dataproc --force-reinstall
-            pip install dataproc_spark_connect --force-reinstall
+   ```console
+   pip install google_cloud_dataproc --force-reinstall
+   pip install dataproc_spark_connect --force-reinstall
+   ```
 2. Add the required import into your PySpark application or notebook:
-      .. code-block:: python
-            from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   ```
 3. There are two ways to create a spark session,
    1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
-      .. code-block:: python
-            spark = DataprocSparkSession.builder.getOrCreate()
+      ```python
+      spark = DataprocSparkSession.builder.getOrCreate()
+      ```
    2. Start a Spark session with the following code instead of using a config file:
-      .. code-block:: python
-            from google.cloud.dataproc_v1 import SparkConnectConfig
-            from google.cloud.dataproc_v1 import Session
-            dataproc_config = Session()
-            dataproc_config.spark_connect_session = SparkConnectConfig()
-            dataproc_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
-            dataproc_config.runtime_config.version = '3.0'
-            spark = DataprocSparkSession.builder.dataprocConfig(dataproc_config).getOrCreate()
+      ```python
+      from google.cloud.dataproc_v1 import SparkConnectConfig
+      from google.cloud.dataproc_v1 import Session
+      dataproc_session_config = Session()
+      dataproc_session_config.spark_connect_session = SparkConnectConfig()
+      dataproc_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
+      dataproc_session_config.runtime_config.version = '3.0'
+      spark = DataprocSparkSession.builder.dataprocSessionConfig(dataproc_session_config).getOrCreate()
+      ```
 ## Billing
 As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
@@ -91,29 +83,29 @@ This will happen even if you are running the client from a non-GCE instance.
 1. Install the requirements in virtual environment.
-      .. code-block:: console
-            pip install -r requirements.txt
+      ```console
+      pip install -r requirements-dev.txt
+      ```
 2. Build the code.
-      .. code-block:: console
-            python setup.py sdist bdist_wheel
+      ```console
+      python setup.py sdist bdist_wheel
+      ```
 3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
-      .. code-block:: console
-            VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
+      ```sh
+      VERSION=<version>
+      gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
+      ```
 4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
-      .. code-block:: console
-            %%bash
-            export VERSION=<version>
-            gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
-            yes | pip uninstall dataproc_spark_connect
-            pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
+      ```sh
+      %%bash
+      export VERSION=<version>
+      gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
+      yes | pip uninstall dataproc_spark_connect
+      pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
+      ```

{dataproc_spark_connect-0.2.1 → dataproc_spark_connect-0.6.0}/README.md RENAMED Viewed

@@ -1,26 +1,20 @@
 # Dataproc Spark Connect Client
-> ⚠️ **Warning:**
-The package `dataproc-spark-connect` has been renamed to `google-spark-connect`. `dataproc-spark-connect` will no longer be updated.
-For help using `google-spark-connect`, please see [guide](https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python/blob/main/README.md).
 A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
 additional functionalities that allow applications to communicate with a remote Dataproc
 Spark cluster using the Spark Connect protocol without requiring additional steps.
 ## Install
-.. code-block:: console
-      pip install dataproc_spark_connect
+```console
+pip install dataproc_spark_connect
+```
 ## Uninstall
-.. code-block:: console
-      pip uninstall dataproc_spark_connect
+```console
+pip uninstall dataproc_spark_connect
+```
 ## Setup
 This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
@@ -35,36 +29,36 @@ If you are running the client outside of Google Cloud, you must set following en
 1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
-      .. code-block:: console
-            pip install google_cloud_dataproc --force-reinstall
-            pip install dataproc_spark_connect --force-reinstall
+   ```console
+   pip install google_cloud_dataproc --force-reinstall
+   pip install dataproc_spark_connect --force-reinstall
+   ```
 2. Add the required import into your PySpark application or notebook:
-      .. code-block:: python
-            from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   ```
 3. There are two ways to create a spark session,
    1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
-      .. code-block:: python
-            spark = DataprocSparkSession.builder.getOrCreate()
+      ```python
+      spark = DataprocSparkSession.builder.getOrCreate()
+      ```
    2. Start a Spark session with the following code instead of using a config file:
-      .. code-block:: python
-            from google.cloud.dataproc_v1 import SparkConnectConfig
-            from google.cloud.dataproc_v1 import Session
-            dataproc_config = Session()
-            dataproc_config.spark_connect_session = SparkConnectConfig()
-            dataproc_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
-            dataproc_config.runtime_config.version = '3.0'
-            spark = DataprocSparkSession.builder.dataprocConfig(dataproc_config).getOrCreate()
+      ```python
+      from google.cloud.dataproc_v1 import SparkConnectConfig
+      from google.cloud.dataproc_v1 import Session
+      dataproc_session_config = Session()
+      dataproc_session_config.spark_connect_session = SparkConnectConfig()
+      dataproc_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
+      dataproc_session_config.runtime_config.version = '3.0'
+      spark = DataprocSparkSession.builder.dataprocSessionConfig(dataproc_session_config).getOrCreate()
+      ```
 ## Billing
 As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
@@ -75,29 +69,29 @@ This will happen even if you are running the client from a non-GCE instance.
 1. Install the requirements in virtual environment.
-      .. code-block:: console
-            pip install -r requirements.txt
+      ```console
+      pip install -r requirements-dev.txt
+      ```
 2. Build the code.
-      .. code-block:: console
-            python setup.py sdist bdist_wheel
+      ```console
+      python setup.py sdist bdist_wheel
+      ```
 3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
-      .. code-block:: console
-            VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
+      ```sh
+      VERSION=<version>
+      gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
+      ```
 4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
-      .. code-block:: console
-            %%bash
-            export VERSION=<version>
-            gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
-            yes | pip uninstall dataproc_spark_connect
-            pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
+      ```sh
+      %%bash
+      export VERSION=<version>
+      gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
+      yes | pip uninstall dataproc_spark_connect
+      pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
+      ```

{dataproc_spark_connect-0.2.1 → dataproc_spark_connect-0.6.0}/dataproc_spark_connect.egg-info/PKG-INFO RENAMED Viewed

@@ -1,42 +1,34 @@
 Metadata-Version: 2.1
 Name: dataproc-spark-connect
-Version: 0.2.1
+Version: 0.6.0
 Summary: Dataproc client library for Spark Connect
 Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
 Author: Google LLC
 License: Apache 2.0
 License-File: LICENSE
 Requires-Dist: google-api-core>=2.19.1
-Requires-Dist: google-cloud-dataproc>=5.15.1
-Requires-Dist: wheel
+Requires-Dist: google-cloud-dataproc>=5.18.0
 Requires-Dist: websockets
-Requires-Dist: pyspark>=3.5
-Requires-Dist: pandas
-Requires-Dist: pyarrow
+Requires-Dist: pyspark[connect]>=3.5
+Requires-Dist: packaging>=20.0
 # Dataproc Spark Connect Client
-> ⚠️ **Warning:**
-The package `dataproc-spark-connect` has been renamed to `google-spark-connect`. `dataproc-spark-connect` will no longer be updated.
-For help using `google-spark-connect`, please see [guide](https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python/blob/main/README.md).
 A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
 additional functionalities that allow applications to communicate with a remote Dataproc
 Spark cluster using the Spark Connect protocol without requiring additional steps.
 ## Install
-.. code-block:: console
-      pip install dataproc_spark_connect
+```console
+pip install dataproc_spark_connect
+```
 ## Uninstall
-.. code-block:: console
-      pip uninstall dataproc_spark_connect
+```console
+pip uninstall dataproc_spark_connect
+```
 ## Setup
 This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
@@ -51,36 +43,36 @@ If you are running the client outside of Google Cloud, you must set following en
 1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
-      .. code-block:: console
-            pip install google_cloud_dataproc --force-reinstall
-            pip install dataproc_spark_connect --force-reinstall
+   ```console
+   pip install google_cloud_dataproc --force-reinstall
+   pip install dataproc_spark_connect --force-reinstall
+   ```
 2. Add the required import into your PySpark application or notebook:
-      .. code-block:: python
-            from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   ```
 3. There are two ways to create a spark session,
    1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
-      .. code-block:: python
-            spark = DataprocSparkSession.builder.getOrCreate()
+      ```python
+      spark = DataprocSparkSession.builder.getOrCreate()
+      ```
    2. Start a Spark session with the following code instead of using a config file:
-      .. code-block:: python
-            from google.cloud.dataproc_v1 import SparkConnectConfig
-            from google.cloud.dataproc_v1 import Session
-            dataproc_config = Session()
-            dataproc_config.spark_connect_session = SparkConnectConfig()
-            dataproc_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
-            dataproc_config.runtime_config.version = '3.0'
-            spark = DataprocSparkSession.builder.dataprocConfig(dataproc_config).getOrCreate()
+      ```python
+      from google.cloud.dataproc_v1 import SparkConnectConfig
+      from google.cloud.dataproc_v1 import Session
+      dataproc_session_config = Session()
+      dataproc_session_config.spark_connect_session = SparkConnectConfig()
+      dataproc_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
+      dataproc_session_config.runtime_config.version = '3.0'
+      spark = DataprocSparkSession.builder.dataprocSessionConfig(dataproc_session_config).getOrCreate()
+      ```
 ## Billing
 As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
@@ -91,29 +83,29 @@ This will happen even if you are running the client from a non-GCE instance.
 1. Install the requirements in virtual environment.
-      .. code-block:: console
-            pip install -r requirements.txt
+      ```console
+      pip install -r requirements-dev.txt
+      ```
 2. Build the code.
-      .. code-block:: console
-            python setup.py sdist bdist_wheel
+      ```console
+      python setup.py sdist bdist_wheel
+      ```
 3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
-      .. code-block:: console
-            VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
+      ```sh
+      VERSION=<version>
+      gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
+      ```
 4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
-      .. code-block:: console
-            %%bash
-            export VERSION=<version>
-            gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
-            yes | pip uninstall dataproc_spark_connect
-            pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
+      ```sh
+      %%bash
+      export VERSION=<version>
+      gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
+      yes | pip uninstall dataproc_spark_connect
+      pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
+      ```

{dataproc_spark_connect-0.2.1 → dataproc_spark_connect-0.6.0}/dataproc_spark_connect.egg-info/SOURCES.txt RENAMED Viewed

@@ -9,6 +9,8 @@ dataproc_spark_connect.egg-info/dependency_links.txt
 dataproc_spark_connect.egg-info/requires.txt
 dataproc_spark_connect.egg-info/top_level.txt
 google/cloud/dataproc_spark_connect/__init__.py
+google/cloud/dataproc_spark_connect/exceptions.py
+google/cloud/dataproc_spark_connect/pypi_artifacts.py
 google/cloud/dataproc_spark_connect/session.py
 google/cloud/dataproc_spark_connect/client/__init__.py
 google/cloud/dataproc_spark_connect/client/core.py

dataproc_spark_connect-0.6.0/dataproc_spark_connect.egg-info/requires.txt ADDED Viewed

@@ -0,0 +1,5 @@
+google-api-core>=2.19.1
+google-cloud-dataproc>=5.18.0
+websockets
+pyspark[connect]>=3.5
+packaging>=20.0

{dataproc_spark_connect-0.2.1 → dataproc_spark_connect-0.6.0}/google/cloud/dataproc_spark_connect/__init__.py RENAMED Viewed

@@ -11,13 +11,19 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from .session import DataprocSparkSession
+import importlib.metadata
 import warnings
-warnings.warn(
-    "The package 'dataproc-spark-connect' has been renamed to 'google-spark-connect'. "
-    "'dataproc-spark-connect' will no longer be updated. "
-    "For help using 'google-spark-connect', "
-    "see https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python/blob/main/README.md. ",
-    DeprecationWarning,
-)
+from .session import DataprocSparkSession
+old_package_name = "google-spark-connect"
+current_package_name = "dataproc-spark-connect"
+try:
+    importlib.metadata.distribution(old_package_name)
+    warnings.warn(
+        f"Package '{old_package_name}' is already installed in your environment. "
+        f"This might cause conflicts with '{current_package_name}'. "
+        f"Consider uninstalling '{old_package_name}' and only install '{current_package_name}'."
+    )
+except:
+    pass

{dataproc_spark_connect-0.2.1 → dataproc_spark_connect-0.6.0}/google/cloud/dataproc_spark_connect/client/core.py RENAMED Viewed

@@ -11,12 +11,16 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import logging
 import google
 import grpc
 from pyspark.sql.connect.client import ChannelBuilder
 from . import proxy
+logger = logging.getLogger(__name__)
 class DataprocChannelBuilder(ChannelBuilder):
     """
@@ -36,6 +40,10 @@ class DataprocChannelBuilder(ChannelBuilder):
     True
     """
+    def __init__(self, url, is_active_callback=None):
+        self._is_active_callback = is_active_callback
+        super().__init__(url)
     def toChannel(self) -> grpc.Channel:
         """
         Applies the parameters of the connection string and creates a new
@@ -51,7 +59,7 @@ class DataprocChannelBuilder(ChannelBuilder):
         return self._proxied_channel()
     def _proxied_channel(self) -> grpc.Channel:
-        return ProxiedChannel(self.host)
+        return ProxiedChannel(self.host, self._is_active_callback)
     def _direct_channel(self) -> grpc.Channel:
         destination = f"{self.host}:{self.port}"
@@ -75,7 +83,8 @@ class DataprocChannelBuilder(ChannelBuilder):
 class ProxiedChannel(grpc.Channel):
-    def __init__(self, target_host):
+    def __init__(self, target_host, is_active_callback):
+        self._is_active_callback = is_active_callback
         self._proxy = proxy.DataprocSessionProxy(0, target_host)
         self._proxy.start()
         self._proxied_connect_url = f"sc://localhost:{self._proxy.port}"
@@ -94,20 +103,37 @@ class ProxiedChannel(grpc.Channel):
         self._proxy.stop()
         return ret
+    def _wrap_method(self, wrapped_method):
+        if self._is_active_callback is None:
+            return wrapped_method
+        def checked_method(*margs, **mkwargs):
+            if (
+                self._is_active_callback is not None
+                and not self._is_active_callback()
+            ):
+                logger.warning(f"Session is no longer active")
+                raise RuntimeError(
+                    "Session not active. Please create a new session"
+                )
+            return wrapped_method(*margs, **mkwargs)
+        return checked_method
     def stream_stream(self, *args, **kwargs):
-        return self._wrapped.stream_stream(*args, **kwargs)
+        return self._wrap_method(self._wrapped.stream_stream(*args, **kwargs))
     def stream_unary(self, *args, **kwargs):
-        return self._wrapped.stream_unary(*args, **kwargs)
+        return self._wrap_method(self._wrapped.stream_unary(*args, **kwargs))
     def subscribe(self, *args, **kwargs):
-        return self._wrapped.subscribe(*args, **kwargs)
+        return self._wrap_method(self._wrapped.subscribe(*args, **kwargs))
     def unary_stream(self, *args, **kwargs):
-        return self._wrapped.unary_stream(*args, **kwargs)
+        return self._wrap_method(self._wrapped.unary_stream(*args, **kwargs))
     def unary_unary(self, *args, **kwargs):
-        return self._wrapped.unary_unary(*args, **kwargs)
+        return self._wrap_method(self._wrapped.unary_unary(*args, **kwargs))
     def unsubscribe(self, *args, **kwargs):
-        return self._wrapped.unsubscribe(*args, **kwargs)
+        return self._wrap_method(self._wrapped.unsubscribe(*args, **kwargs))

{dataproc_spark_connect-0.2.1 → dataproc_spark_connect-0.6.0}/google/cloud/dataproc_spark_connect/client/proxy.py RENAMED Viewed

@@ -81,6 +81,7 @@ def connect_tcp_bridge(hostname):
     return websocketclient.connect(
         f"wss://{hostname}/{path}",
         additional_headers={"Authorization": f"Bearer {creds.token}"},
+        open_timeout=30,
     )
@@ -101,8 +102,11 @@ def forward_bytes(name, from_sock, to_sock):
         try:
             bs = from_sock.recv(1024)
             if not bs:
+                to_sock.close()
                 return
-            while bs:
+            attempt = 0
+            while bs and (attempt < 10):
+                attempt += 1
                 try:
                     to_sock.send(bs)
                     bs = None
@@ -110,6 +114,8 @@ def forward_bytes(name, from_sock, to_sock):
                     # On timeouts during a send, we retry just the send
                     # to make sure we don't lose any bytes.
                     pass
+            if bs:
+                raise Exception(f"Failed to forward bytes for {name}")
         except TimeoutError:
             # On timeouts during a receive, we retry the entire flow.
             pass
@@ -163,6 +169,11 @@ def forward_connection(conn_number, conn, addr, target_host):
     with conn:
         with connect_tcp_bridge(target_host) as websocket_conn:
             backend_socket = bridged_socket(websocket_conn)
+            # Set a timeout on how long we will allow send/recv calls to block
+            #
+            # The code that reads and writes to this connection will retry
+            # on timeouts, so this is a safe change.
+            conn.settimeout(10)
             connect_sockets(conn_number, conn, backend_socket)
@@ -210,14 +221,6 @@ class DataprocSessionProxy(object):
             s.release()
             while not self._killed:
                 conn, addr = frontend_socket.accept()
-                # Set a timeout on how long we will allow send/recv calls to block
-                #
-                # The code that reads and writes to this connection will retry
-                # on timeouts, so this is a safe change.
-                #
-                # The chosen timeout is a very short one because it allows us
-                # to more quickly detect when a connection has been closed.
-                conn.settimeout(1)
                 logger.debug(f"Accepted a connection from {addr}...")
                 self._conn_number += 1
                 threading.Thread(

dataproc_spark_connect-0.6.0/google/cloud/dataproc_spark_connect/exceptions.py ADDED Viewed

@@ -0,0 +1,27 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+class DataprocSparkConnectException(Exception):
+    """A custom exception class to only print the error messages.
+    This would be used for exceptions where the stack trace
+    doesn't provide any additional information.h
+    """
+    def __init__(self, message):
+        self.message = message
+        super().__init__(message)
+    def _render_traceback_(self):
+        return self.message

dataproc_spark_connect-0.6.0/google/cloud/dataproc_spark_connect/pypi_artifacts.py ADDED Viewed

@@ -0,0 +1,48 @@
+import json
+import logging
+import os
+import tempfile
+from packaging.requirements import Requirement
+logger = logging.getLogger(__name__)
+class PyPiArtifacts:
+    """
+    This is a helper class to serialize the PYPI package installation request with a "magic" file name
+    that Spark Connect server understands
+    """
+    @staticmethod
+    def __try_parsing_package(packages: set[str]) -> list[Requirement]:
+        reqs = [Requirement(p) for p in packages]
+        if 0 in [len(req.specifier) for req in reqs]:
+            logger.info("It is recommended to pin the version of the package")
+        return reqs
+    def __init__(self, packages: set[str]):
+        self.requirements = PyPiArtifacts.__try_parsing_package(packages)
+    def write_packages_config(self, s8s_session_uuid: str) -> str:
+        """
+        Can't use the same file-name as Spark throws exception that file already exists
+        Keep the filename/format in sync with server
+        """
+        dependencies = {
+            "version": "0.5",
+            "packageType": "PYPI",
+            "packages": [str(req) for req in self.requirements],
+        }
+        file_path = os.path.join(
+            tempfile.gettempdir(),
+            s8s_session_uuid,
+            "add-artifacts-1729-" + self.__str__() + ".json",
+        )
+        os.makedirs(os.path.dirname(file_path), exist_ok=True)
+        with open(file_path, "w") as json_file:
+            json.dump(dependencies, json_file, indent=4)
+        logger.debug("Dumping dependencies request in file: " + file_path)
+        return file_path

{dataproc_spark_connect-0.2.1 → dataproc_spark_connect-0.6.0}/google/cloud/dataproc_spark_connect/session.py RENAMED Viewed

@@ -11,6 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import atexit
 import json
 import logging
 import os
@@ -24,9 +25,10 @@ from typing import Any, cast, ClassVar, Dict, Optional
 from google.api_core import retry
 from google.api_core.future.polling import POLLING_PREDICATE
 from google.api_core.client_options import ClientOptions
-from google.api_core.exceptions import FailedPrecondition, InvalidArgument, NotFound
+from google.api_core.exceptions import Aborted, FailedPrecondition, InvalidArgument, NotFound, PermissionDenied
 from google.cloud.dataproc_v1.types import sessions
+from google.cloud.dataproc_spark_connect.pypi_artifacts import PyPiArtifacts
 from google.cloud.dataproc_spark_connect.client import DataprocChannelBuilder
 from google.cloud.dataproc_v1 import (
     CreateSessionRequest,
@@ -41,6 +43,7 @@ from google.protobuf.text_format import ParseError
 from pyspark.sql.connect.session import SparkSession
 from pyspark.sql.utils import to_str
+from google.cloud.dataproc_spark_connect.exceptions import DataprocSparkConnectException
 # Set up logging
 logging.basicConfig(level=logging.INFO)
@@ -61,7 +64,7 @@ class DataprocSparkSession(SparkSession):
     >>> spark = (
     ...     DataprocSparkSession.builder
     ...         .appName("Word Count")
-    ...         .dataprocConfig(Session())
+    ...         .dataprocSessionConfig(Session())
     ...         .getOrCreate()
     ... ) # doctest: +SKIP
     """
@@ -112,15 +115,15 @@ class DataprocSparkSession(SparkSession):
             self._project_id = project_id
             return self
-        def region(self, region):
-            self._region = region
+        def location(self, location):
+            self._region = location
             self._client_options.api_endpoint = os.environ.get(
                 "GOOGLE_CLOUD_DATAPROC_API_ENDPOINT",
                 f"{self._region}-dataproc.googleapis.com",
             )
             return self
-        def dataprocConfig(self, dataproc_config: Session):
+        def dataprocSessionConfig(self, dataproc_config: Session):
             with self._lock:
                 self._dataproc_config = dataproc_config
                 for k, v in dataproc_config.runtime_config.properties.items():
@@ -135,14 +138,14 @@ class DataprocSparkSession(SparkSession):
             else:
                 return self
-        def create(self) -> "SparkSession":
+        def create(self) -> "DataprocSparkSession":
             raise NotImplemented(
                 "DataprocSparkSession allows session creation only through getOrCreate"
             )
         def __create_spark_connect_session_from_s8s(
-            self, session_response
-        ) -> "SparkSession":
+            self, session_response, session_name
+        ) -> "DataprocSparkSession":
             DataprocSparkSession._active_s8s_session_uuid = (
                 session_response.uuid
             )
@@ -153,9 +156,16 @@ class DataprocSparkSession(SparkSession):
                 "Spark Connect Server"
             )
             spark_connect_url = spark_connect_url.replace("https", "sc")
+            if not spark_connect_url.endswith("/"):
+                spark_connect_url += "/"
             url = f"{spark_connect_url.replace('.com/', '.com:443/')};session_id={session_response.uuid};use_ssl=true"
             logger.debug(f"Spark Connect URL: {url}")
-            self._channel_builder = DataprocChannelBuilder(url)
+            self._channel_builder = DataprocChannelBuilder(
+                url,
+                is_active_callback=lambda: is_s8s_session_active(
+                    session_name, self._client_options
+                ),
+            )
             assert self._channel_builder is not None
             session = DataprocSparkSession(connection=self._channel_builder)
@@ -164,7 +174,7 @@ class DataprocSparkSession(SparkSession):
             self.__apply_options(session)
             return session
-        def __create(self) -> "SparkSession":
+        def __create(self) -> "DataprocSparkSession":
             with self._lock:
                 if self._options.get("spark.remote", False):
@@ -185,12 +195,8 @@ class DataprocSparkSession(SparkSession):
                     dataproc_config, session_template
                 )
-                spark = self._get_spark(dataproc_config, session_template)
                 if not spark_connect_session:
                     dataproc_config.spark_connect_session = {}
-                if not spark:
-                    dataproc_config.spark = {}
                 os.environ["SPARK_CONNECT_MODE_ENABLED"] = "1"
                 session_request = CreateSessionRequest()
                 session_id = self.generate_dataproc_session_id()
@@ -216,14 +222,28 @@ class DataprocSparkSession(SparkSession):
                         multiplier=1.0,
                         timeout=600,  # seconds
                     )
-                    logger.info(
-                        "Creating Spark session. It may take few minutes."
-                    )
+                    print("Creating Spark session. It may take a few minutes.")
+                    if (
+                        "dataproc_spark_connect_SESSION_TERMINATE_AT_EXIT"
+                        in os.environ
+                        and os.getenv(
+                            "dataproc_spark_connect_SESSION_TERMINATE_AT_EXIT"
+                        ).lower()
+                        == "true"
+                    ):
+                        atexit.register(
+                            lambda: terminate_s8s_session(
+                                self._project_id,
+                                self._region,
+                                session_id,
+                                self._client_options,
+                            )
+                        )
                     operation = SessionControllerClient(
                         client_options=self._client_options
                     ).create_session(session_request)
                     print(
-                        f"Interactive Session Detail View:  https://console.cloud.google.com/dataproc/interactive/{self._region}/{session_id}"
+                        f"Interactive Session Detail View:  https://console.cloud.google.com/dataproc/interactive/{self._region}/{session_id}?project={self._project_id}"
                     )
                     session_response: Session = operation.result(
                         polling=session_polling
@@ -249,49 +269,32 @@ class DataprocSparkSession(SparkSession):
                             logger.error(
                                 f"Exception while writing active session to file {file_path} , {e}"
                             )
-                except InvalidArgument as e:
+                except (InvalidArgument, PermissionDenied) as e:
                     DataprocSparkSession._active_s8s_session_id = None
-                    raise RuntimeError(
-                        f"Error while creating serverless session: {e}"
-                    ) from None
+                    raise DataprocSparkConnectException(
+                        f"Error while creating serverless session: {e.message}"
+                    )
                 except Exception as e:
                     DataprocSparkSession._active_s8s_session_id = None
                     raise RuntimeError(
-                        f"Error while creating serverless session https://console.cloud.google.com/dataproc/interactive/{self._region}/{session_id} : {e}"
-                    ) from None
+                        f"Error while creating serverless session"
+                    ) from e
                 logger.debug(
                     f"Serverless session created: {session_id}, creation time taken: {int(time.time() - s8s_creation_start_time)} seconds"
                 )
                 return self.__create_spark_connect_session_from_s8s(
-                    session_response
+                    session_response, dataproc_config.name
                 )
-        def _is_s8s_session_active(
-            self, s8s_session_id: str
-        ) -> Optional[sessions.Session]:
-            session_name = f"projects/{self._project_id}/locations/{self._region}/sessions/{s8s_session_id}"
-            get_session_request = GetSessionRequest()
-            get_session_request.name = session_name
-            state = None
-            try:
-                get_session_response = SessionControllerClient(
-                    client_options=self._client_options
-                ).get_session(get_session_request)
-                state = get_session_response.state
-            except Exception as e:
-                logger.debug(f"{s8s_session_id} deleted: {e}")
-                return None
-            if state is not None and (
-                state == Session.State.ACTIVE or state == Session.State.CREATING
-            ):
-                return get_session_response
-            return None
-        def _get_exiting_active_session(self) -> Optional["SparkSession"]:
+        def _get_exiting_active_session(
+            self,
+        ) -> Optional["DataprocSparkSession"]:
             s8s_session_id = DataprocSparkSession._active_s8s_session_id
-            session_response = self._is_s8s_session_active(s8s_session_id)
+            session_name = f"projects/{self._project_id}/locations/{self._region}/sessions/{s8s_session_id}"
+            session_response = get_active_s8s_session_response(
+                session_name, self._client_options
+            )
             session = DataprocSparkSession.getActiveSession()
             if session is None:
@@ -299,11 +302,11 @@ class DataprocSparkSession(SparkSession):
             if session_response is not None:
                 print(
-                    f"Using existing session: https://console.cloud.google.com/dataproc/interactive/{self._region}/{s8s_session_id}, configuration changes may not be applied."
+                    f"Using existing session: https://console.cloud.google.com/dataproc/interactive/{self._region}/{s8s_session_id}?project={self._project_id}, configuration changes may not be applied."
                 )
                 if session is None:
                     session = self.__create_spark_connect_session_from_s8s(
-                        session_response
+                        session_response, session_name
                     )
                 return session
             else:
@@ -315,7 +318,7 @@ class DataprocSparkSession(SparkSession):
                 return None
-        def getOrCreate(self) -> "SparkSession":
+        def getOrCreate(self) -> "DataprocSparkSession":
             with DataprocSparkSession._lock:
                 session = self._get_exiting_active_session()
                 if session is None:
@@ -413,11 +416,11 @@ class DataprocSparkSession(SparkSession):
             ]
             import importlib.metadata
-            dataproc_connect_version = importlib.metadata.version(
+            google_connect_version = importlib.metadata.version(
                 "dataproc-spark-connect"
             )
             client_version = importlib.metadata.version("pyspark")
-            version_message = f"Dataproc Spark Connect: {dataproc_connect_version} (PySpark: {client_version}) Dataproc Session Runtime: {version} (Spark: {server_version})"
+            version_message = f"Spark Connect: {google_connect_version} (PySpark: {client_version}) Session Runtime: {version} (Spark: {server_version})"
             logger.info(version_message)
             if trimmed_version(client_version) != trimmed_version(
                 server_version
@@ -435,14 +438,6 @@ class DataprocSparkSession(SparkSession):
                 spark_connect_session = session_template.spark_connect_session
             return spark_connect_session
-        def _get_spark(self, dataproc_config, session_template):
-            spark = None
-            if dataproc_config and dataproc_config.spark:
-                spark = dataproc_config.spark
-            elif session_template and session_template.spark:
-                spark = session_template.spark
-            return spark
         def generate_dataproc_session_id(self):
             timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
             suffix_length = 6
@@ -466,8 +461,8 @@ class DataprocSparkSession(SparkSession):
         <div>
             <p><b>Spark Connect</b></p>
-            <p><a href="{s8s_session}">Dataproc Session</a></p>
-            <p><a href="{ui}">Spark UI</a></p>
+            <p><a href="{s8s_session}?project={self._project_id}">Serverless Session</a></p>
+            <p><a href="{ui}?project={self._project_id}">Spark UI</a></p>
         </div>
         """
@@ -484,45 +479,70 @@ class DataprocSparkSession(SparkSession):
                     f"Exception while removing active session in file {file_path} , {e}"
                 )
+    def addArtifacts(
+        self,
+        *artifact: str,
+        pyfile: bool = False,
+        archive: bool = False,
+        file: bool = False,
+        pypi: bool = False,
+    ) -> None:
+        """
+        Add artifact(s) to the client session. Currently only local files & pypi installations are supported.
+        .. versionadded:: 3.5.0
+        Parameters
+        ----------
+        *path : tuple of str
+            Artifact's URIs to add.
+        pyfile : bool
+            Whether to add them as Python dependencies such as .py, .egg, .zip or .jar files.
+            The pyfiles are directly inserted into the path when executing Python functions
+            in executors.
+        archive : bool
+            Whether to add them as archives such as .zip, .jar, .tar.gz, .tgz, or .tar files.
+            The archives are unpacked on the executor side automatically.
+        file : bool
+            Add a file to be downloaded with this Spark job on every node.
+            The ``path`` passed can only be a local file for now.
+        pypi : bool
+            This option is only available with DataprocSparkSession. eg. `spark.addArtifacts("spacy==3.8.4", "torch",  pypi=True)`
+            Installs PyPi package (with its dependencies) in the active Spark session on the driver and executors.
+        Notes
+        -----
+        This is an API dedicated to Spark Connect client only. With regular Spark Session, it throws
+        an exception.
+        Regarding pypi: Popular packages are already pre-installed in s8s runtime.
+        https://cloud.google.com/dataproc-serverless/docs/concepts/versions/spark-runtime-2.2#python_libraries
+        If there are conflicts/package doesn't exist, it throws an exception.
+        """
+        if sum([pypi, file, pyfile, archive]) > 1:
+            raise ValueError(
+                "'pyfile', 'archive', 'file' and/or 'pypi' cannot be True together."
+            )
+        if pypi:
+            artifacts = PyPiArtifacts(set(artifact))
+            logger.debug("Making addArtifact call to install packages")
+            self.addArtifact(
+                artifacts.write_packages_config(self._active_s8s_session_uuid),
+                file=True,
+            )
+        else:
+            super().addArtifacts(
+                *artifact, pyfile=pyfile, archive=archive, file=file
+            )
     def stop(self) -> None:
         with DataprocSparkSession._lock:
             if DataprocSparkSession._active_s8s_session_id is not None:
-                from google.cloud.dataproc_v1 import SessionControllerClient
-                logger.debug(
-                    f"Terminating serverless session: {DataprocSparkSession._active_s8s_session_id}"
+                terminate_s8s_session(
+                    DataprocSparkSession._project_id,
+                    DataprocSparkSession._region,
+                    DataprocSparkSession._active_s8s_session_id,
+                    self._client_options,
                 )
-                terminate_session_request = TerminateSessionRequest()
-                session_name = f"projects/{DataprocSparkSession._project_id}/locations/{DataprocSparkSession._region}/sessions/{DataprocSparkSession._active_s8s_session_id}"
-                terminate_session_request.name = session_name
-                state = None
-                try:
-                    SessionControllerClient(
-                        client_options=self._client_options
-                    ).terminate_session(terminate_session_request)
-                    get_session_request = GetSessionRequest()
-                    get_session_request.name = session_name
-                    state = Session.State.ACTIVE
-                    while (
-                        state != Session.State.TERMINATING
-                        and state != Session.State.TERMINATED
-                        and state != Session.State.FAILED
-                    ):
-                        session = SessionControllerClient(
-                            client_options=self._client_options
-                        ).get_session(get_session_request)
-                        state = session.state
-                        sleep(1)
-                except NotFound:
-                    logger.debug(
-                        f"Session {DataprocSparkSession._active_s8s_session_id} already deleted"
-                    )
-                except FailedPrecondition:
-                    logger.debug(
-                        f"Session {DataprocSparkSession._active_s8s_session_id} already terminated manually or terminated automatically through session ttl limits"
-                    )
-                if state is not None and state == Session.State.FAILED:
-                    raise RuntimeError("Serverless session termination failed")
                 self._remove_stoped_session_from_file()
                 DataprocSparkSession._active_s8s_session_uuid = None
@@ -538,3 +558,66 @@ class DataprocSparkSession(SparkSession):
                 DataprocSparkSession._active_session, "session", None
             ):
                 DataprocSparkSession._active_session.session = None
+def terminate_s8s_session(
+    project_id, region, active_s8s_session_id, client_options=None
+):
+    from google.cloud.dataproc_v1 import SessionControllerClient
+    logger.debug(f"Terminating serverless session: {active_s8s_session_id}")
+    terminate_session_request = TerminateSessionRequest()
+    session_name = f"projects/{project_id}/locations/{region}/sessions/{active_s8s_session_id}"
+    terminate_session_request.name = session_name
+    state = None
+    try:
+        session_client = SessionControllerClient(client_options=client_options)
+        session_client.terminate_session(terminate_session_request)
+        get_session_request = GetSessionRequest()
+        get_session_request.name = session_name
+        state = Session.State.ACTIVE
+        while (
+            state != Session.State.TERMINATING
+            and state != Session.State.TERMINATED
+            and state != Session.State.FAILED
+        ):
+            session = session_client.get_session(get_session_request)
+            state = session.state
+            sleep(1)
+    except NotFound:
+        logger.debug(f"Session {active_s8s_session_id} already deleted")
+    # Client will get 'Aborted' error if session creation is still in progress and
+    # 'FailedPrecondition' if another termination is still in progress.
+    # Both are retryable but we catch it and let TTL take care of cleanups.
+    except (FailedPrecondition, Aborted):
+        logger.debug(
+            f"Session {active_s8s_session_id} already terminated manually or terminated automatically through session ttl limits"
+        )
+    if state is not None and state == Session.State.FAILED:
+        raise RuntimeError("Serverless session termination failed")
+def get_active_s8s_session_response(
+    session_name, client_options
+) -> Optional[sessions.Session]:
+    get_session_request = GetSessionRequest()
+    get_session_request.name = session_name
+    try:
+        get_session_response = SessionControllerClient(
+            client_options=client_options
+        ).get_session(get_session_request)
+        state = get_session_response.state
+    except Exception as e:
+        logger.info(f"{session_name} deleted: {e}")
+        return None
+    if state is not None and (
+        state == Session.State.ACTIVE or state == Session.State.CREATING
+    ):
+        return get_session_response
+    return None
+def is_s8s_session_active(session_name, client_options) -> bool:
+    if get_active_s8s_session_response(session_name, client_options) is None:
+        return False
+    return True

{dataproc_spark_connect-0.2.1 → dataproc_spark_connect-0.6.0}/setup.py RENAMED Viewed

@@ -20,7 +20,7 @@ long_description = (this_directory / "README.md").read_text()
 setup(
     name="dataproc-spark-connect",
-    version="0.2.1",
+    version="0.6.0",
     description="Dataproc client library for Spark Connect",
     long_description=long_description,
     author="Google LLC",
@@ -29,11 +29,9 @@ setup(
     packages=find_namespace_packages(include=["google.*"]),
     install_requires=[
         "google-api-core>=2.19.1",
-        "google-cloud-dataproc>=5.15.1",
-        "wheel",
+        "google-cloud-dataproc>=5.18.0",
         "websockets",
-        "pyspark>=3.5",
-        "pandas",
-        "pyarrow",
+        "pyspark[connect]>=3.5",
+        "packaging>=20.0",
     ],
 )

dataproc_spark_connect-0.2.1/dataproc_spark_connect.egg-info/requires.txt DELETED Viewed

@@ -1,7 +0,0 @@
-google-api-core>=2.19.1
-google-cloud-dataproc>=5.15.1
-wheel
-websockets
-pyspark>=3.5
-pandas
-pyarrow

{dataproc_spark_connect-0.2.1 → dataproc_spark_connect-0.6.0}/LICENSE RENAMED Viewed

File without changes

{dataproc_spark_connect-0.2.1 → dataproc_spark_connect-0.6.0}/dataproc_spark_connect.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{dataproc_spark_connect-0.2.1 → dataproc_spark_connect-0.6.0}/dataproc_spark_connect.egg-info/top_level.txt RENAMED Viewed

File without changes

{dataproc_spark_connect-0.2.1 → dataproc_spark_connect-0.6.0}/google/cloud/dataproc_spark_connect/client/__init__.py RENAMED Viewed

File without changes

{dataproc_spark_connect-0.2.1 → dataproc_spark_connect-0.6.0}/pyproject.toml RENAMED Viewed

File without changes

{dataproc_spark_connect-0.2.1 → dataproc_spark_connect-0.6.0}/setup.cfg RENAMED Viewed

File without changes

dataproc-spark-connect 0.2.1__tar.gz → 0.6.0__tar.gz

dataproc-spark-connect 0.2.1tar.gz → 0.6.0tar.gz