PyPI - dataproc-spark-connect - Versions diffs - 1.0.0rc5__tar.gz → 1.0.0rc7__tar.gz - Mend

dataproc-spark-connect 1.0.0rc5tar.gz → 1.0.0rc7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

dataproc_spark_connect-1.0.0rc7/PKG-INFO ADDED Viewed

@@ -0,0 +1,200 @@
+Metadata-Version: 2.4
+Name: dataproc-spark-connect
+Version: 1.0.0rc7
+Summary: Dataproc client library for Spark Connect
+Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
+Author: Google LLC
+License: Apache 2.0
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: google-api-core>=2.19
+Requires-Dist: google-cloud-dataproc>=5.18
+Requires-Dist: packaging>=20.0
+Requires-Dist: pyspark-client~=4.0.0
+Requires-Dist: tqdm>=4.67
+Requires-Dist: websockets>=14.0
+Dynamic: author
+Dynamic: description
+Dynamic: home-page
+Dynamic: license
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: summary
+# Dataproc Spark Connect Client
+A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
+client with additional functionalities that allow applications to communicate
+with a remote Dataproc Spark Session using the Spark Connect protocol without
+requiring additional steps.
+## Install
+```sh
+pip install dataproc_spark_connect
+```
+## Uninstall
+```sh
+pip uninstall dataproc_spark_connect
+```
+## Setup
+This client requires permissions to
+manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
+If you are running the client outside of Google Cloud, you need to provide
+authentication credentials. Set the `GOOGLE_APPLICATION_CREDENTIALS` environment
+variable to point to
+your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
+file.
+You can specify the project and region either via environment variables or directly
+in your code using the builder API:
+* Environment variables: `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_REGION`
+* Builder API: `.projectId()` and `.location()` methods (recommended)
+## Usage
+1. Install the latest version of Dataproc Spark Connect:
+   ```sh
+   pip install -U dataproc-spark-connect
+   ```
+2. Add the required imports into your PySpark application or notebook and start
+   a Spark session using the fluent API:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   spark = DataprocSparkSession.builder.getOrCreate()
+   ```
+3. You can configure Spark properties using the `.config()` method:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   spark = DataprocSparkSession.builder.config('spark.executor.memory', '4g').config('spark.executor.cores', '2').getOrCreate()
+   ```
+4. For advanced configuration, you can use the `Session` class to customize
+   settings like subnetwork or other environment configurations:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   from google.cloud.dataproc_v1 import Session
+   session_config = Session()
+   session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
+   session_config.runtime_config.version = '3.0'
+   spark = DataprocSparkSession.builder.projectId('my-project').location('us-central1').dataprocSessionConfig(session_config).getOrCreate()
+   ```
+### Reusing Named Sessions Across Notebooks
+Named sessions allow you to share a single Spark session across multiple notebooks, improving efficiency by avoiding repeated session startup times and reducing costs.
+To create or connect to a named session:
+1. Create a session with a custom ID in your first notebook:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   session_id = 'my-ml-pipeline-session'
+   spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
+   df = spark.createDataFrame([(1, 'data')], ['id', 'value'])
+   df.show()
+   ```
+2. Reuse the same session in another notebook by specifying the same session ID:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   session_id = 'my-ml-pipeline-session'
+   spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
+   df = spark.createDataFrame([(2, 'more-data')], ['id', 'value'])
+   df.show()
+   ```
+3. Session IDs must be 4-63 characters long, start with a lowercase letter, contain only lowercase letters, numbers, and hyphens, and not end with a hyphen.
+4. Named sessions persist until explicitly terminated or reach their configured TTL.
+5. A session with a given ID that is in a TERMINATED state cannot be reused. It must be deleted before a new session with the same ID can be created.
+### Using Spark SQL Magic Commands (Jupyter Notebooks)
+The package supports the [sparksql-magic](https://github.com/cryeo/sparksql-magic) library for executing Spark SQL queries directly in Jupyter notebooks.
+**Installation**: To use magic commands, install the required dependencies manually:
+```bash
+pip install dataproc-spark-connect
+pip install IPython sparksql-magic
+```
+1. Load the magic extension:
+   ```python
+   %load_ext sparksql_magic
+   ```
+2. Configure default settings (optional):
+   ```python
+   %config SparkSql.limit=20
+   ```
+3. Execute SQL queries:
+   ```python
+   %%sparksql
+   SELECT * FROM your_table
+   ```
+4. Advanced usage with options:
+   ```python
+   # Cache results and create a view
+   %%sparksql --cache --view result_view df
+   SELECT * FROM your_table WHERE condition = true
+   ```
+Available options:
+- `--cache` / `-c`: Cache the DataFrame
+- `--eager` / `-e`: Cache with eager loading
+- `--view VIEW` / `-v VIEW`: Create a temporary view
+- `--limit N` / `-l N`: Override default row display limit
+- `variable_name`: Store result in a variable
+See [sparksql-magic](https://github.com/cryeo/sparksql-magic) for more examples.
+**Note**: Magic commands are optional. If you only need basic DataprocSparkSession functionality without Jupyter magic support, install only the base package:
+```bash
+pip install dataproc-spark-connect
+```
+## Developing
+For development instructions see [guide](DEVELOPING.md).
+## Contributing
+We'd love to accept your patches and contributions to this project. There are
+just a few small guidelines you need to follow.
+### Contributor License Agreement
+Contributions to this project must be accompanied by a Contributor License
+Agreement. You (or your employer) retain the copyright to your contribution;
+this simply gives us permission to use and redistribute your contributions as
+part of the project. Head over to <https://cla.developers.google.com> to see
+your current agreements on file or to sign a new one.
+You generally only need to submit a CLA once, so if you've already submitted one
+(even if it was for a different project), you probably don't need to do it
+again.
+### Code reviews
+All submissions, including submissions by project members, require review. We
+use GitHub pull requests for this purpose. Consult
+[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
+information on using pull requests.

dataproc_spark_connect-1.0.0rc7/README.md ADDED Viewed

@@ -0,0 +1,177 @@
+# Dataproc Spark Connect Client
+A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
+client with additional functionalities that allow applications to communicate
+with a remote Dataproc Spark Session using the Spark Connect protocol without
+requiring additional steps.
+## Install
+```sh
+pip install dataproc_spark_connect
+```
+## Uninstall
+```sh
+pip uninstall dataproc_spark_connect
+```
+## Setup
+This client requires permissions to
+manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
+If you are running the client outside of Google Cloud, you need to provide
+authentication credentials. Set the `GOOGLE_APPLICATION_CREDENTIALS` environment
+variable to point to
+your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
+file.
+You can specify the project and region either via environment variables or directly
+in your code using the builder API:
+* Environment variables: `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_REGION`
+* Builder API: `.projectId()` and `.location()` methods (recommended)
+## Usage
+1. Install the latest version of Dataproc Spark Connect:
+   ```sh
+   pip install -U dataproc-spark-connect
+   ```
+2. Add the required imports into your PySpark application or notebook and start
+   a Spark session using the fluent API:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   spark = DataprocSparkSession.builder.getOrCreate()
+   ```
+3. You can configure Spark properties using the `.config()` method:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   spark = DataprocSparkSession.builder.config('spark.executor.memory', '4g').config('spark.executor.cores', '2').getOrCreate()
+   ```
+4. For advanced configuration, you can use the `Session` class to customize
+   settings like subnetwork or other environment configurations:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   from google.cloud.dataproc_v1 import Session
+   session_config = Session()
+   session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
+   session_config.runtime_config.version = '3.0'
+   spark = DataprocSparkSession.builder.projectId('my-project').location('us-central1').dataprocSessionConfig(session_config).getOrCreate()
+   ```
+### Reusing Named Sessions Across Notebooks
+Named sessions allow you to share a single Spark session across multiple notebooks, improving efficiency by avoiding repeated session startup times and reducing costs.
+To create or connect to a named session:
+1. Create a session with a custom ID in your first notebook:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   session_id = 'my-ml-pipeline-session'
+   spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
+   df = spark.createDataFrame([(1, 'data')], ['id', 'value'])
+   df.show()
+   ```
+2. Reuse the same session in another notebook by specifying the same session ID:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   session_id = 'my-ml-pipeline-session'
+   spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
+   df = spark.createDataFrame([(2, 'more-data')], ['id', 'value'])
+   df.show()
+   ```
+3. Session IDs must be 4-63 characters long, start with a lowercase letter, contain only lowercase letters, numbers, and hyphens, and not end with a hyphen.
+4. Named sessions persist until explicitly terminated or reach their configured TTL.
+5. A session with a given ID that is in a TERMINATED state cannot be reused. It must be deleted before a new session with the same ID can be created.
+### Using Spark SQL Magic Commands (Jupyter Notebooks)
+The package supports the [sparksql-magic](https://github.com/cryeo/sparksql-magic) library for executing Spark SQL queries directly in Jupyter notebooks.
+**Installation**: To use magic commands, install the required dependencies manually:
+```bash
+pip install dataproc-spark-connect
+pip install IPython sparksql-magic
+```
+1. Load the magic extension:
+   ```python
+   %load_ext sparksql_magic
+   ```
+2. Configure default settings (optional):
+   ```python
+   %config SparkSql.limit=20
+   ```
+3. Execute SQL queries:
+   ```python
+   %%sparksql
+   SELECT * FROM your_table
+   ```
+4. Advanced usage with options:
+   ```python
+   # Cache results and create a view
+   %%sparksql --cache --view result_view df
+   SELECT * FROM your_table WHERE condition = true
+   ```
+Available options:
+- `--cache` / `-c`: Cache the DataFrame
+- `--eager` / `-e`: Cache with eager loading
+- `--view VIEW` / `-v VIEW`: Create a temporary view
+- `--limit N` / `-l N`: Override default row display limit
+- `variable_name`: Store result in a variable
+See [sparksql-magic](https://github.com/cryeo/sparksql-magic) for more examples.
+**Note**: Magic commands are optional. If you only need basic DataprocSparkSession functionality without Jupyter magic support, install only the base package:
+```bash
+pip install dataproc-spark-connect
+```
+## Developing
+For development instructions see [guide](DEVELOPING.md).
+## Contributing
+We'd love to accept your patches and contributions to this project. There are
+just a few small guidelines you need to follow.
+### Contributor License Agreement
+Contributions to this project must be accompanied by a Contributor License
+Agreement. You (or your employer) retain the copyright to your contribution;
+this simply gives us permission to use and redistribute your contributions as
+part of the project. Head over to <https://cla.developers.google.com> to see
+your current agreements on file or to sign a new one.
+You generally only need to submit a CLA once, so if you've already submitted one
+(even if it was for a different project), you probably don't need to do it
+again.
+### Code reviews
+All submissions, including submissions by project members, require review. We
+use GitHub pull requests for this purpose. Consult
+[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
+information on using pull requests.

dataproc_spark_connect-1.0.0rc7/dataproc_spark_connect.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,200 @@
+Metadata-Version: 2.4
+Name: dataproc-spark-connect
+Version: 1.0.0rc7
+Summary: Dataproc client library for Spark Connect
+Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
+Author: Google LLC
+License: Apache 2.0
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: google-api-core>=2.19
+Requires-Dist: google-cloud-dataproc>=5.18
+Requires-Dist: packaging>=20.0
+Requires-Dist: pyspark-client~=4.0.0
+Requires-Dist: tqdm>=4.67
+Requires-Dist: websockets>=14.0
+Dynamic: author
+Dynamic: description
+Dynamic: home-page
+Dynamic: license
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: summary
+# Dataproc Spark Connect Client
+A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
+client with additional functionalities that allow applications to communicate
+with a remote Dataproc Spark Session using the Spark Connect protocol without
+requiring additional steps.
+## Install
+```sh
+pip install dataproc_spark_connect
+```
+## Uninstall
+```sh
+pip uninstall dataproc_spark_connect
+```
+## Setup
+This client requires permissions to
+manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
+If you are running the client outside of Google Cloud, you need to provide
+authentication credentials. Set the `GOOGLE_APPLICATION_CREDENTIALS` environment
+variable to point to
+your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
+file.
+You can specify the project and region either via environment variables or directly
+in your code using the builder API:
+* Environment variables: `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_REGION`
+* Builder API: `.projectId()` and `.location()` methods (recommended)
+## Usage
+1. Install the latest version of Dataproc Spark Connect:
+   ```sh
+   pip install -U dataproc-spark-connect
+   ```
+2. Add the required imports into your PySpark application or notebook and start
+   a Spark session using the fluent API:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   spark = DataprocSparkSession.builder.getOrCreate()
+   ```
+3. You can configure Spark properties using the `.config()` method:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   spark = DataprocSparkSession.builder.config('spark.executor.memory', '4g').config('spark.executor.cores', '2').getOrCreate()
+   ```
+4. For advanced configuration, you can use the `Session` class to customize
+   settings like subnetwork or other environment configurations:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   from google.cloud.dataproc_v1 import Session
+   session_config = Session()
+   session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
+   session_config.runtime_config.version = '3.0'
+   spark = DataprocSparkSession.builder.projectId('my-project').location('us-central1').dataprocSessionConfig(session_config).getOrCreate()
+   ```
+### Reusing Named Sessions Across Notebooks
+Named sessions allow you to share a single Spark session across multiple notebooks, improving efficiency by avoiding repeated session startup times and reducing costs.
+To create or connect to a named session:
+1. Create a session with a custom ID in your first notebook:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   session_id = 'my-ml-pipeline-session'
+   spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
+   df = spark.createDataFrame([(1, 'data')], ['id', 'value'])
+   df.show()
+   ```
+2. Reuse the same session in another notebook by specifying the same session ID:
+   ```python
+   from google.cloud.dataproc_spark_connect import DataprocSparkSession
+   session_id = 'my-ml-pipeline-session'
+   spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
+   df = spark.createDataFrame([(2, 'more-data')], ['id', 'value'])
+   df.show()
+   ```
+3. Session IDs must be 4-63 characters long, start with a lowercase letter, contain only lowercase letters, numbers, and hyphens, and not end with a hyphen.
+4. Named sessions persist until explicitly terminated or reach their configured TTL.
+5. A session with a given ID that is in a TERMINATED state cannot be reused. It must be deleted before a new session with the same ID can be created.
+### Using Spark SQL Magic Commands (Jupyter Notebooks)
+The package supports the [sparksql-magic](https://github.com/cryeo/sparksql-magic) library for executing Spark SQL queries directly in Jupyter notebooks.
+**Installation**: To use magic commands, install the required dependencies manually:
+```bash
+pip install dataproc-spark-connect
+pip install IPython sparksql-magic
+```
+1. Load the magic extension:
+   ```python
+   %load_ext sparksql_magic
+   ```
+2. Configure default settings (optional):
+   ```python
+   %config SparkSql.limit=20
+   ```
+3. Execute SQL queries:
+   ```python
+   %%sparksql
+   SELECT * FROM your_table
+   ```
+4. Advanced usage with options:
+   ```python
+   # Cache results and create a view
+   %%sparksql --cache --view result_view df
+   SELECT * FROM your_table WHERE condition = true
+   ```
+Available options:
+- `--cache` / `-c`: Cache the DataFrame
+- `--eager` / `-e`: Cache with eager loading
+- `--view VIEW` / `-v VIEW`: Create a temporary view
+- `--limit N` / `-l N`: Override default row display limit
+- `variable_name`: Store result in a variable
+See [sparksql-magic](https://github.com/cryeo/sparksql-magic) for more examples.
+**Note**: Magic commands are optional. If you only need basic DataprocSparkSession functionality without Jupyter magic support, install only the base package:
+```bash
+pip install dataproc-spark-connect
+```
+## Developing
+For development instructions see [guide](DEVELOPING.md).
+## Contributing
+We'd love to accept your patches and contributions to this project. There are
+just a few small guidelines you need to follow.
+### Contributor License Agreement
+Contributions to this project must be accompanied by a Contributor License
+Agreement. You (or your employer) retain the copyright to your contribution;
+this simply gives us permission to use and redistribute your contributions as
+part of the project. Head over to <https://cla.developers.google.com> to see
+your current agreements on file or to sign a new one.
+You generally only need to submit a CLA once, so if you've already submitted one
+(even if it was for a different project), you probably don't need to do it
+again.
+### Code reviews
+All submissions, including submissions by project members, require review. We
+use GitHub pull requests for this purpose. Consult
+[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
+information on using pull requests.

{dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/dataproc_spark_connect.egg-info/requires.txt RENAMED Viewed

@@ -1,6 +1,6 @@
 google-api-core>=2.19
 google-cloud-dataproc>=5.18
 packaging>=20.0
-pyspark[connect]~=4.0.0
+pyspark-client~=4.0.0
 tqdm>=4.67
 websockets>=14.0

{dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/google/cloud/dataproc_spark_connect/environment.py RENAMED Viewed

@@ -67,6 +67,10 @@ def is_interactive_terminal():
     return is_interactive() and is_terminal()
+def is_dataproc_batch() -> bool:
+    return os.getenv("DATAPROC_WORKLOAD_TYPE") == "batch"
 def get_client_environment_label() -> str:
     """
     Map current environment to a standardized client label.

{dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/google/cloud/dataproc_spark_connect/exceptions.py RENAMED Viewed

@@ -24,4 +24,4 @@ class DataprocSparkConnectException(Exception):
         super().__init__(message)
     def _render_traceback_(self):
-        return self.message
+        return [self.message]

{dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/google/cloud/dataproc_spark_connect/session.py RENAMED Viewed

@@ -14,6 +14,7 @@
 import atexit
 import datetime
+import functools
 import json
 import logging
 import os
@@ -25,8 +26,6 @@ import time
 import uuid
 import tqdm
 from packaging import version
-from tqdm import tqdm as cli_tqdm
-from tqdm.notebook import tqdm as notebook_tqdm
 from types import MethodType
 from typing import Any, cast, ClassVar, Dict, Iterable, Optional, Union
@@ -67,6 +66,10 @@ SYSTEM_LABELS = {
     "goog-colab-notebook-id",
 }
+_DATAPROC_SESSIONS_BASE_URL = (
+    "https://console.cloud.google.com/dataproc/interactive"
+)
 def _is_valid_label_value(value: str) -> bool:
     """
@@ -472,16 +475,43 @@ class DataprocSparkSession(SparkSession):
                     session_response, dataproc_config.name
                 )
+        def _wait_for_session_available(
+            self, session_name: str, timeout: int = 300
+        ) -> Session:
+            start_time = time.time()
+            while time.time() - start_time < timeout:
+                try:
+                    session = self.session_controller_client.get_session(
+                        name=session_name
+                    )
+                    if "Spark Connect Server" in session.runtime_info.endpoints:
+                        return session
+                    time.sleep(5)
+                except Exception as e:
+                    logger.warning(
+                        f"Error while polling for Spark Connect endpoint: {e}"
+                    )
+                    time.sleep(5)
+            raise RuntimeError(
+                f"Spark Connect endpoint not available for session {session_name} after {timeout} seconds."
+            )
         def _display_session_link_on_creation(self, session_id):
-            session_url = f"https://console.cloud.google.com/dataproc/interactive/{self._region}/{session_id}?project={self._project_id}"
+            session_url = f"{_DATAPROC_SESSIONS_BASE_URL}/{self._region}/{session_id}?project={self._project_id}"
             plain_message = f"Creating Dataproc Session: {session_url}"
-            html_element = f"""
+            if environment.is_colab_enterprise():
+                html_element = f"""
                 <div>
                     <p>Creating Dataproc Spark Session<p>
-                    <p><a href="{session_url}">Dataproc Session</a></p>
                 </div>
-            """
+                """
+            else:
+                html_element = f"""
+                    <div>
+                        <p>Creating Dataproc Spark Session<p>
+                        <p><a href="{session_url}">Dataproc Session</a></p>
+                    </div>
+                """
             self._output_element_or_message(plain_message, html_element)
         def _print_session_created_message(self):
@@ -533,10 +563,13 @@ class DataprocSparkSession(SparkSession):
             if session_response is not None:
                 print(
-                    f"Using existing Dataproc Session (configuration changes may not be applied): https://console.cloud.google.com/dataproc/interactive/{self._region}/{s8s_session_id}?project={self._project_id}"
+                    f"Using existing Dataproc Session (configuration changes may not be applied): {_DATAPROC_SESSIONS_BASE_URL}/{self._region}/{s8s_session_id}?project={self._project_id}"
                 )
                 self._display_view_session_details_button(s8s_session_id)
                 if session is None:
+                    session_response = self._wait_for_session_available(
+                        session_name
+                    )
                     session = self.__create_spark_connect_session_from_s8s(
                         session_response, session_name
                     )
@@ -552,6 +585,13 @@ class DataprocSparkSession(SparkSession):
         def getOrCreate(self) -> "DataprocSparkSession":
             with DataprocSparkSession._lock:
+                if environment.is_dataproc_batch():
+                    # For Dataproc batch workloads, connect to the already initialized local SparkSession
+                    from pyspark.sql import SparkSession as PySparkSQLSession
+                    session = PySparkSQLSession.builder.getOrCreate()
+                    return session  # type: ignore
                 # Handle custom session ID by setting it early and letting existing logic handle it
                 if self._custom_session_id:
                     self._handle_custom_session_id()
@@ -559,6 +599,13 @@ class DataprocSparkSession(SparkSession):
                 session = self._get_exiting_active_session()
                 if session is None:
                     session = self.__create()
+                # Register this session as the instantiated SparkSession for compatibility
+                # with tools and libraries that expect SparkSession._instantiatedSession
+                from pyspark.sql import SparkSession as PySparkSQLSession
+                PySparkSQLSession._instantiatedSession = session
                 return session
         def _handle_custom_session_id(self):
@@ -673,8 +720,6 @@ class DataprocSparkSession(SparkSession):
                     # Merge default configs with existing properties,
                     # user configs take precedence
                     for k, v in {
-                        "spark.datasource.bigquery.viewsEnabled": "true",
-                        "spark.datasource.bigquery.writeMethod": "direct",
                         "spark.sql.catalog.spark_catalog": "com.google.cloud.spark.bigquery.BigQuerySparkSessionCatalog",
                         "spark.sql.sources.default": "bigquery",
                     }.items():
@@ -696,7 +741,7 @@ class DataprocSparkSession(SparkSession):
             # Runtime version to server Python version mapping
             RUNTIME_PYTHON_MAP = {
-                "3.0": (3, 11),
+                "3.0": (3, 12),
             }
             client_python = sys.version_info[:2]  # (major, minor)
@@ -760,7 +805,7 @@ class DataprocSparkSession(SparkSession):
                 return
             try:
-                session_url = f"https://console.cloud.google.com/dataproc/interactive/sessions/{session_id}/locations/{self._region}?project={self._project_id}"
+                session_url = f"{_DATAPROC_SESSIONS_BASE_URL}/{self._region}/{session_id}?project={self._project_id}"
                 from IPython.core.interactiveshell import InteractiveShell
                 if not InteractiveShell.initialized():
@@ -943,6 +988,28 @@ class DataprocSparkSession(SparkSession):
             clearProgressHandlers_wrapper_method, self
         )
+    @staticmethod
+    @functools.lru_cache(maxsize=1)
+    def get_tqdm_bar():
+        """
+        Return a tqdm implementation that works in the current environment.
+        - Uses CLI tqdm for interactive terminals.
+        - Uses the notebook tqdm if available, otherwise falls back to CLI tqdm.
+        """
+        from tqdm import tqdm as cli_tqdm
+        if environment.is_interactive_terminal():
+            return cli_tqdm
+        try:
+            import ipywidgets
+            from tqdm.notebook import tqdm as notebook_tqdm
+            return notebook_tqdm
+        except ImportError:
+            return cli_tqdm
     def _register_progress_execution_handler(self):
         from pyspark.sql.connect.shell.progress import StageInfo
@@ -967,9 +1034,12 @@ class DataprocSparkSession(SparkSession):
                 total_tasks += stage.num_tasks
                 completed_tasks += stage.num_completed_tasks
-            tqdm_pbar = notebook_tqdm
-            if environment.is_interactive_terminal():
-                tqdm_pbar = cli_tqdm
+            # Don't show progress bar till we receive some tasks
+            if total_tasks == 0:
+                return
+            # Get correct tqdm (notebook or CLI)
+            tqdm_pbar = self.get_tqdm_bar()
             # Use a lock to ensure only one thread can access and modify
             # the shared dictionaries at a time.
@@ -1006,13 +1076,11 @@ class DataprocSparkSession(SparkSession):
     @staticmethod
     def _sql_lazy_transformation(req):
         # Select SQL command
-        if req.plan and req.plan.command and req.plan.command.sql_command:
-            return (
-                "select"
-                in req.plan.command.sql_command.sql.strip().lower().split()
-            )
-        return False
+        try:
+            query = req.plan.command.sql_command.input.sql.query
+            return "select" in query.strip().lower().split()
+        except AttributeError:
+            return False
     def _repr_html_(self) -> str:
         if not self._active_s8s_session_id:
@@ -1020,7 +1088,7 @@ class DataprocSparkSession(SparkSession):
             <div>No Active Dataproc Session</div>
             """
-        s8s_session = f"https://console.cloud.google.com/dataproc/interactive/{self._region}/{self._active_s8s_session_id}"
+        s8s_session = f"{_DATAPROC_SESSIONS_BASE_URL}/{self._region}/{self._active_s8s_session_id}"
         ui = f"{s8s_session}/sparkApplications/applications"
         return f"""
         <div>
@@ -1047,7 +1115,7 @@ class DataprocSparkSession(SparkSession):
         )
         url = (
-            f"https://console.cloud.google.com/dataproc/interactive/{self._region}/"
+            f"{_DATAPROC_SESSIONS_BASE_URL}/{self._region}/"
             f"{self._active_s8s_session_id}/sparkApplications/application;"
             f"associatedSqlOperationId={operation_id}?project={self._project_id}"
         )
@@ -1139,20 +1207,52 @@ class DataprocSparkSession(SparkSession):
     def _get_active_session_file_path():
         return os.getenv("DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH")
-    def stop(self) -> None:
+    def stop(self, terminate: Optional[bool] = None) -> None:
+        """
+        Stop the Spark session and optionally terminate the server-side session.
+        Parameters
+        ----------
+        terminate : bool, optional
+            Control server-side termination behavior.
+            - None (default): Auto-detect based on session type
+              - Managed sessions (auto-generated ID): terminate server
+              - Named sessions (custom ID): client-side cleanup only
+            - True: Always terminate the server-side session
+            - False: Never terminate the server-side session (client cleanup only)
+        Examples
+        --------
+        Auto-detect termination behavior (existing behavior):
+        >>> spark.stop()
+        Force terminate a named session:
+        >>> spark.stop(terminate=True)
+        Prevent termination of a managed session:
+        >>> spark.stop(terminate=False)
+        """
         with DataprocSparkSession._lock:
             if DataprocSparkSession._active_s8s_session_id is not None:
-                # Check if this is a managed session (auto-generated ID) or unmanaged session (custom ID)
-                if DataprocSparkSession._active_session_uses_custom_id:
-                    # Unmanaged session (custom ID): Only clean up client-side state
-                    # Don't terminate as it might be in use by other notebooks or clients
-                    logger.debug(
-                        f"Stopping unmanaged session {DataprocSparkSession._active_s8s_session_id} without termination"
+                # Determine if we should terminate the server-side session
+                if terminate is None:
+                    # Auto-detect: managed sessions terminate, named sessions don't
+                    should_terminate = (
+                        not DataprocSparkSession._active_session_uses_custom_id
                     )
                 else:
-                    # Managed session (auto-generated ID): Use original behavior and terminate
+                    should_terminate = terminate
+                if should_terminate:
+                    # Terminate the server-side session
                     logger.debug(
-                        f"Terminating managed session {DataprocSparkSession._active_s8s_session_id}"
+                        f"Terminating session {DataprocSparkSession._active_s8s_session_id}"
                     )
                     terminate_s8s_session(
                         DataprocSparkSession._project_id,
@@ -1160,8 +1260,27 @@ class DataprocSparkSession(SparkSession):
                         DataprocSparkSession._active_s8s_session_id,
                         self._client_options,
                     )
+                else:
+                    # Client-side cleanup only
+                    logger.debug(
+                        f"Stopping session {DataprocSparkSession._active_s8s_session_id} without termination"
+                    )
                 self._remove_stopped_session_from_file()
+                # Clean up SparkSession._instantiatedSession if it points to this session
+                try:
+                    from pyspark.sql import SparkSession as PySparkSQLSession
+                    if PySparkSQLSession._instantiatedSession is self:
+                        PySparkSQLSession._instantiatedSession = None
+                        logger.debug(
+                            "Cleared SparkSession._instantiatedSession reference"
+                        )
+                except (ImportError, AttributeError):
+                    # PySpark not available or _instantiatedSession doesn't exist
+                    pass
                 DataprocSparkSession._active_s8s_session_uuid = None
                 DataprocSparkSession._active_s8s_session_id = None
                 DataprocSparkSession._active_session_uses_custom_id = False

dataproc_spark_connect-1.0.0rc7/pyproject.toml ADDED Viewed

@@ -0,0 +1,9 @@
+[tool.pyink]
+line-length = 80  # (default is 88)
+pyink-indentation = 4 #  (default is 4)
+[tool.pytest.ini_options]
+markers = [
+    "integration: marks tests as integration tests",
+    "ci_safe: marks tests that work in CI environment",
+]

dataproc_spark_connect-1.0.0rc7/setup.cfg ADDED Viewed

@@ -0,0 +1,14 @@
+[bdist_wheel]
+universal = 1
+[check-manifest]
+ignore =
+	.github/**
+[metadata]
+long_description_content_type = text/markdown
+[egg_info]
+tag_build =
+tag_date = 0

{dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/setup.py RENAMED Viewed

@@ -20,7 +20,7 @@ long_description = (this_directory / "README.md").read_text()
 setup(
     name="dataproc-spark-connect",
-    version="1.0.0rc5",
+    version="1.0.0rc7",
     description="Dataproc client library for Spark Connect",
     long_description=long_description,
     author="Google LLC",
@@ -31,7 +31,7 @@ setup(
         "google-api-core>=2.19",
         "google-cloud-dataproc>=5.18",
         "packaging>=20.0",
-        "pyspark[connect]~=4.0.0",
+        "pyspark-client~=4.0.0",
         "tqdm>=4.67",
         "websockets>=14.0",
     ],

dataproc_spark_connect-1.0.0rc5/PKG-INFO DELETED Viewed

@@ -1,105 +0,0 @@
-Metadata-Version: 2.4
-Name: dataproc-spark-connect
-Version: 1.0.0rc5
-Summary: Dataproc client library for Spark Connect
-Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
-Author: Google LLC
-License: Apache 2.0
-License-File: LICENSE
-Requires-Dist: google-api-core>=2.19
-Requires-Dist: google-cloud-dataproc>=5.18
-Requires-Dist: packaging>=20.0
-Requires-Dist: pyspark[connect]~=4.0.0
-Requires-Dist: tqdm>=4.67
-Requires-Dist: websockets>=14.0
-Dynamic: author
-Dynamic: description
-Dynamic: home-page
-Dynamic: license
-Dynamic: license-file
-Dynamic: requires-dist
-Dynamic: summary
-# Dataproc Spark Connect Client
-A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
-client with additional functionalities that allow applications to communicate
-with a remote Dataproc Spark Session using the Spark Connect protocol without
-requiring additional steps.
-## Install
-```sh
-pip install dataproc_spark_connect
-```
-## Uninstall
-```sh
-pip uninstall dataproc_spark_connect
-```
-## Setup
-This client requires permissions to
-manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
-If you are running the client outside of Google Cloud, you must set following
-environment variables:
-* `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
-  workloads
-* `GOOGLE_CLOUD_REGION` - The Compute
-  Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
-  where you run the Spark workload.
-* `GOOGLE_APPLICATION_CREDENTIALS` -
-  Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
-## Usage
-1. Install the latest version of Dataproc Python client and Dataproc Spark
-   Connect modules:
-   ```sh
-   pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
-   ```
-2. Add the required imports into your PySpark application or notebook and start
-   a Spark session with the following code instead of using
-   environment variables:
-   ```python
-   from google.cloud.dataproc_spark_connect import DataprocSparkSession
-   from google.cloud.dataproc_v1 import Session
-   session_config = Session()
-   session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
-   session_config.runtime_config.version = '2.2'
-   spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
-   ```
-## Developing
-For development instructions see [guide](DEVELOPING.md).
-## Contributing
-We'd love to accept your patches and contributions to this project. There are
-just a few small guidelines you need to follow.
-### Contributor License Agreement
-Contributions to this project must be accompanied by a Contributor License
-Agreement. You (or your employer) retain the copyright to your contribution;
-this simply gives us permission to use and redistribute your contributions as
-part of the project. Head over to <https://cla.developers.google.com> to see
-your current agreements on file or to sign a new one.
-You generally only need to submit a CLA once, so if you've already submitted one
-(even if it was for a different project), you probably don't need to do it
-again.
-### Code reviews
-All submissions, including submissions by project members, require review. We
-use GitHub pull requests for this purpose. Consult
-[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
-information on using pull requests.

dataproc_spark_connect-1.0.0rc5/README.md DELETED Viewed

@@ -1,83 +0,0 @@
-# Dataproc Spark Connect Client
-A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
-client with additional functionalities that allow applications to communicate
-with a remote Dataproc Spark Session using the Spark Connect protocol without
-requiring additional steps.
-## Install
-```sh
-pip install dataproc_spark_connect
-```
-## Uninstall
-```sh
-pip uninstall dataproc_spark_connect
-```
-## Setup
-This client requires permissions to
-manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
-If you are running the client outside of Google Cloud, you must set following
-environment variables:
-* `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
-  workloads
-* `GOOGLE_CLOUD_REGION` - The Compute
-  Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
-  where you run the Spark workload.
-* `GOOGLE_APPLICATION_CREDENTIALS` -
-  Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
-## Usage
-1. Install the latest version of Dataproc Python client and Dataproc Spark
-   Connect modules:
-   ```sh
-   pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
-   ```
-2. Add the required imports into your PySpark application or notebook and start
-   a Spark session with the following code instead of using
-   environment variables:
-   ```python
-   from google.cloud.dataproc_spark_connect import DataprocSparkSession
-   from google.cloud.dataproc_v1 import Session
-   session_config = Session()
-   session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
-   session_config.runtime_config.version = '2.2'
-   spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
-   ```
-## Developing
-For development instructions see [guide](DEVELOPING.md).
-## Contributing
-We'd love to accept your patches and contributions to this project. There are
-just a few small guidelines you need to follow.
-### Contributor License Agreement
-Contributions to this project must be accompanied by a Contributor License
-Agreement. You (or your employer) retain the copyright to your contribution;
-this simply gives us permission to use and redistribute your contributions as
-part of the project. Head over to <https://cla.developers.google.com> to see
-your current agreements on file or to sign a new one.
-You generally only need to submit a CLA once, so if you've already submitted one
-(even if it was for a different project), you probably don't need to do it
-again.
-### Code reviews
-All submissions, including submissions by project members, require review. We
-use GitHub pull requests for this purpose. Consult
-[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
-information on using pull requests.

dataproc_spark_connect-1.0.0rc5/dataproc_spark_connect.egg-info/PKG-INFO DELETED Viewed

@@ -1,105 +0,0 @@
-Metadata-Version: 2.4
-Name: dataproc-spark-connect
-Version: 1.0.0rc5
-Summary: Dataproc client library for Spark Connect
-Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
-Author: Google LLC
-License: Apache 2.0
-License-File: LICENSE
-Requires-Dist: google-api-core>=2.19
-Requires-Dist: google-cloud-dataproc>=5.18
-Requires-Dist: packaging>=20.0
-Requires-Dist: pyspark[connect]~=4.0.0
-Requires-Dist: tqdm>=4.67
-Requires-Dist: websockets>=14.0
-Dynamic: author
-Dynamic: description
-Dynamic: home-page
-Dynamic: license
-Dynamic: license-file
-Dynamic: requires-dist
-Dynamic: summary
-# Dataproc Spark Connect Client
-A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
-client with additional functionalities that allow applications to communicate
-with a remote Dataproc Spark Session using the Spark Connect protocol without
-requiring additional steps.
-## Install
-```sh
-pip install dataproc_spark_connect
-```
-## Uninstall
-```sh
-pip uninstall dataproc_spark_connect
-```
-## Setup
-This client requires permissions to
-manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
-If you are running the client outside of Google Cloud, you must set following
-environment variables:
-* `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
-  workloads
-* `GOOGLE_CLOUD_REGION` - The Compute
-  Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
-  where you run the Spark workload.
-* `GOOGLE_APPLICATION_CREDENTIALS` -
-  Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
-## Usage
-1. Install the latest version of Dataproc Python client and Dataproc Spark
-   Connect modules:
-   ```sh
-   pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
-   ```
-2. Add the required imports into your PySpark application or notebook and start
-   a Spark session with the following code instead of using
-   environment variables:
-   ```python
-   from google.cloud.dataproc_spark_connect import DataprocSparkSession
-   from google.cloud.dataproc_v1 import Session
-   session_config = Session()
-   session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
-   session_config.runtime_config.version = '2.2'
-   spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
-   ```
-## Developing
-For development instructions see [guide](DEVELOPING.md).
-## Contributing
-We'd love to accept your patches and contributions to this project. There are
-just a few small guidelines you need to follow.
-### Contributor License Agreement
-Contributions to this project must be accompanied by a Contributor License
-Agreement. You (or your employer) retain the copyright to your contribution;
-this simply gives us permission to use and redistribute your contributions as
-part of the project. Head over to <https://cla.developers.google.com> to see
-your current agreements on file or to sign a new one.
-You generally only need to submit a CLA once, so if you've already submitted one
-(even if it was for a different project), you probably don't need to do it
-again.
-### Code reviews
-All submissions, including submissions by project members, require review. We
-use GitHub pull requests for this purpose. Consult
-[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
-information on using pull requests.