PyPI - dataproc-spark-connect - Versions diffs - 1.0.0rc5__tar.gz → 1.0.0rc6__tar.gz - Mend

dataproc-spark-connect 1.0.0rc5tar.gz → 1.0.0rc6tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

{dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc6}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: dataproc-spark-connect
-Version: 1.0.0rc5
+Version: 1.0.0rc6
 Summary: Dataproc client library for Spark Connect
 Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
 Author: Google LLC
@@ -76,6 +76,53 @@ environment variables:
    spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
    ```
+### Using Spark SQL Magic Commands (Jupyter Notebooks)
+The package supports the [sparksql-magic](https://github.com/cryeo/sparksql-magic) library for executing Spark SQL queries directly in Jupyter notebooks.
+**Installation**: To use magic commands, install the required dependencies manually:
+```bash
+pip install dataproc-spark-connect
+pip install IPython sparksql-magic
+```
+1. Load the magic extension:
+   ```python
+   %load_ext sparksql_magic
+   ```
+2. Configure default settings (optional):
+   ```python
+   %config SparkSql.limit=20
+   ```
+3. Execute SQL queries:
+   ```python
+   %%sparksql
+   SELECT * FROM your_table
+   ```
+4. Advanced usage with options:
+   ```python
+   # Cache results and create a view
+   %%sparksql --cache --view result_view df
+   SELECT * FROM your_table WHERE condition = true
+   ```
+Available options:
+- `--cache` / `-c`: Cache the DataFrame
+- `--eager` / `-e`: Cache with eager loading
+- `--view VIEW` / `-v VIEW`: Create a temporary view
+- `--limit N` / `-l N`: Override default row display limit
+- `variable_name`: Store result in a variable
+See [sparksql-magic](https://github.com/cryeo/sparksql-magic) for more examples.
+**Note**: Magic commands are optional. If you only need basic DataprocSparkSession functionality without Jupyter magic support, install only the base package:
+```bash
+pip install dataproc-spark-connect
+```
 ## Developing
 For development instructions see [guide](DEVELOPING.md).

{dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc6}/README.md RENAMED Viewed

@@ -54,6 +54,53 @@ environment variables:
    spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
    ```
+### Using Spark SQL Magic Commands (Jupyter Notebooks)
+The package supports the [sparksql-magic](https://github.com/cryeo/sparksql-magic) library for executing Spark SQL queries directly in Jupyter notebooks.
+**Installation**: To use magic commands, install the required dependencies manually:
+```bash
+pip install dataproc-spark-connect
+pip install IPython sparksql-magic
+```
+1. Load the magic extension:
+   ```python
+   %load_ext sparksql_magic
+   ```
+2. Configure default settings (optional):
+   ```python
+   %config SparkSql.limit=20
+   ```
+3. Execute SQL queries:
+   ```python
+   %%sparksql
+   SELECT * FROM your_table
+   ```
+4. Advanced usage with options:
+   ```python
+   # Cache results and create a view
+   %%sparksql --cache --view result_view df
+   SELECT * FROM your_table WHERE condition = true
+   ```
+Available options:
+- `--cache` / `-c`: Cache the DataFrame
+- `--eager` / `-e`: Cache with eager loading
+- `--view VIEW` / `-v VIEW`: Create a temporary view
+- `--limit N` / `-l N`: Override default row display limit
+- `variable_name`: Store result in a variable
+See [sparksql-magic](https://github.com/cryeo/sparksql-magic) for more examples.
+**Note**: Magic commands are optional. If you only need basic DataprocSparkSession functionality without Jupyter magic support, install only the base package:
+```bash
+pip install dataproc-spark-connect
+```
 ## Developing
 For development instructions see [guide](DEVELOPING.md).

{dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc6}/dataproc_spark_connect.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: dataproc-spark-connect
-Version: 1.0.0rc5
+Version: 1.0.0rc6
 Summary: Dataproc client library for Spark Connect
 Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
 Author: Google LLC
@@ -76,6 +76,53 @@ environment variables:
    spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
    ```
+### Using Spark SQL Magic Commands (Jupyter Notebooks)
+The package supports the [sparksql-magic](https://github.com/cryeo/sparksql-magic) library for executing Spark SQL queries directly in Jupyter notebooks.
+**Installation**: To use magic commands, install the required dependencies manually:
+```bash
+pip install dataproc-spark-connect
+pip install IPython sparksql-magic
+```
+1. Load the magic extension:
+   ```python
+   %load_ext sparksql_magic
+   ```
+2. Configure default settings (optional):
+   ```python
+   %config SparkSql.limit=20
+   ```
+3. Execute SQL queries:
+   ```python
+   %%sparksql
+   SELECT * FROM your_table
+   ```
+4. Advanced usage with options:
+   ```python
+   # Cache results and create a view
+   %%sparksql --cache --view result_view df
+   SELECT * FROM your_table WHERE condition = true
+   ```
+Available options:
+- `--cache` / `-c`: Cache the DataFrame
+- `--eager` / `-e`: Cache with eager loading
+- `--view VIEW` / `-v VIEW`: Create a temporary view
+- `--limit N` / `-l N`: Override default row display limit
+- `variable_name`: Store result in a variable
+See [sparksql-magic](https://github.com/cryeo/sparksql-magic) for more examples.
+**Note**: Magic commands are optional. If you only need basic DataprocSparkSession functionality without Jupyter magic support, install only the base package:
+```bash
+pip install dataproc-spark-connect
+```
 ## Developing
 For development instructions see [guide](DEVELOPING.md).

{dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc6}/google/cloud/dataproc_spark_connect/environment.py RENAMED Viewed

@@ -67,6 +67,10 @@ def is_interactive_terminal():
     return is_interactive() and is_terminal()
+def is_dataproc_batch() -> bool:
+    return os.getenv("DATAPROC_WORKLOAD_TYPE") == "batch"
 def get_client_environment_label() -> str:
     """
     Map current environment to a standardized client label.

{dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc6}/google/cloud/dataproc_spark_connect/exceptions.py RENAMED Viewed

@@ -24,4 +24,4 @@ class DataprocSparkConnectException(Exception):
         super().__init__(message)
     def _render_traceback_(self):
-        return self.message
+        return [self.message]

{dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc6}/google/cloud/dataproc_spark_connect/session.py RENAMED Viewed

@@ -472,6 +472,27 @@ class DataprocSparkSession(SparkSession):
                     session_response, dataproc_config.name
                 )
+        def _wait_for_session_available(
+            self, session_name: str, timeout: int = 300
+        ) -> Session:
+            start_time = time.time()
+            while time.time() - start_time < timeout:
+                try:
+                    session = self.session_controller_client.get_session(
+                        name=session_name
+                    )
+                    if "Spark Connect Server" in session.runtime_info.endpoints:
+                        return session
+                    time.sleep(5)
+                except Exception as e:
+                    logger.warning(
+                        f"Error while polling for Spark Connect endpoint: {e}"
+                    )
+                    time.sleep(5)
+            raise RuntimeError(
+                f"Spark Connect endpoint not available for session {session_name} after {timeout} seconds."
+            )
         def _display_session_link_on_creation(self, session_id):
             session_url = f"https://console.cloud.google.com/dataproc/interactive/{self._region}/{session_id}?project={self._project_id}"
             plain_message = f"Creating Dataproc Session: {session_url}"
@@ -537,6 +558,9 @@ class DataprocSparkSession(SparkSession):
                 )
                 self._display_view_session_details_button(s8s_session_id)
                 if session is None:
+                    session_response = self._wait_for_session_available(
+                        session_name
+                    )
                     session = self.__create_spark_connect_session_from_s8s(
                         session_response, session_name
                     )
@@ -552,6 +576,13 @@ class DataprocSparkSession(SparkSession):
         def getOrCreate(self) -> "DataprocSparkSession":
             with DataprocSparkSession._lock:
+                if environment.is_dataproc_batch():
+                    # For Dataproc batch workloads, connect to the already initialized local SparkSession
+                    from pyspark.sql import SparkSession as PySparkSQLSession
+                    session = PySparkSQLSession.builder.getOrCreate()
+                    return session  # type: ignore
                 # Handle custom session ID by setting it early and letting existing logic handle it
                 if self._custom_session_id:
                     self._handle_custom_session_id()
@@ -559,6 +590,13 @@ class DataprocSparkSession(SparkSession):
                 session = self._get_exiting_active_session()
                 if session is None:
                     session = self.__create()
+                # Register this session as the instantiated SparkSession for compatibility
+                # with tools and libraries that expect SparkSession._instantiatedSession
+                from pyspark.sql import SparkSession as PySparkSQLSession
+                PySparkSQLSession._instantiatedSession = session
                 return session
         def _handle_custom_session_id(self):
@@ -1162,6 +1200,20 @@ class DataprocSparkSession(SparkSession):
                     )
                 self._remove_stopped_session_from_file()
+                # Clean up SparkSession._instantiatedSession if it points to this session
+                try:
+                    from pyspark.sql import SparkSession as PySparkSQLSession
+                    if PySparkSQLSession._instantiatedSession is self:
+                        PySparkSQLSession._instantiatedSession = None
+                        logger.debug(
+                            "Cleared SparkSession._instantiatedSession reference"
+                        )
+                except (ImportError, AttributeError):
+                    # PySpark not available or _instantiatedSession doesn't exist
+                    pass
                 DataprocSparkSession._active_s8s_session_uuid = None
                 DataprocSparkSession._active_s8s_session_id = None
                 DataprocSparkSession._active_session_uses_custom_id = False

dataproc_spark_connect-1.0.0rc6/pyproject.toml ADDED Viewed

@@ -0,0 +1,9 @@
+[tool.pyink]
+line-length = 80  # (default is 88)
+pyink-indentation = 4 #  (default is 4)
+[tool.pytest.ini_options]
+markers = [
+    "integration: marks tests as integration tests",
+    "ci_safe: marks tests that work in CI environment",
+]

{dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc6}/setup.py RENAMED Viewed

@@ -20,7 +20,7 @@ long_description = (this_directory / "README.md").read_text()
 setup(
     name="dataproc-spark-connect",
-    version="1.0.0rc5",
+    version="1.0.0rc6",
     description="Dataproc client library for Spark Connect",
     long_description=long_description,
     author="Google LLC",