dataproc-spark-connect 0.1.0__tar.gz → 0.2.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- dataproc_spark_connect-0.2.1/PKG-INFO +119 -0
- dataproc_spark_connect-0.2.1/README.md +103 -0
- dataproc_spark_connect-0.2.1/dataproc_spark_connect.egg-info/PKG-INFO +119 -0
- dataproc_spark_connect-0.2.1/dataproc_spark_connect.egg-info/requires.txt +7 -0
- {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/google/cloud/dataproc_spark_connect/__init__.py +9 -0
- {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/google/cloud/dataproc_spark_connect/client/proxy.py +59 -20
- {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/google/cloud/dataproc_spark_connect/session.py +2 -2
- {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/setup.py +12 -1
- dataproc_spark_connect-0.1.0/PKG-INFO +0 -10
- dataproc_spark_connect-0.1.0/README.md +0 -90
- dataproc_spark_connect-0.1.0/dataproc_spark_connect.egg-info/PKG-INFO +0 -10
- dataproc_spark_connect-0.1.0/dataproc_spark_connect.egg-info/requires.txt +0 -5
- {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/LICENSE +0 -0
- {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/dataproc_spark_connect.egg-info/SOURCES.txt +0 -0
- {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/dataproc_spark_connect.egg-info/dependency_links.txt +0 -0
- {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/dataproc_spark_connect.egg-info/top_level.txt +0 -0
- {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/google/cloud/dataproc_spark_connect/client/__init__.py +0 -0
- {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/google/cloud/dataproc_spark_connect/client/core.py +0 -0
- {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/pyproject.toml +0 -0
- {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/setup.cfg +0 -0
|
@@ -0,0 +1,119 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: dataproc-spark-connect
|
|
3
|
+
Version: 0.2.1
|
|
4
|
+
Summary: Dataproc client library for Spark Connect
|
|
5
|
+
Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
|
|
6
|
+
Author: Google LLC
|
|
7
|
+
License: Apache 2.0
|
|
8
|
+
License-File: LICENSE
|
|
9
|
+
Requires-Dist: google-api-core>=2.19.1
|
|
10
|
+
Requires-Dist: google-cloud-dataproc>=5.15.1
|
|
11
|
+
Requires-Dist: wheel
|
|
12
|
+
Requires-Dist: websockets
|
|
13
|
+
Requires-Dist: pyspark>=3.5
|
|
14
|
+
Requires-Dist: pandas
|
|
15
|
+
Requires-Dist: pyarrow
|
|
16
|
+
|
|
17
|
+
# Dataproc Spark Connect Client
|
|
18
|
+
|
|
19
|
+
> ⚠️ **Warning:**
|
|
20
|
+
The package `dataproc-spark-connect` has been renamed to `google-spark-connect`. `dataproc-spark-connect` will no longer be updated.
|
|
21
|
+
For help using `google-spark-connect`, please see [guide](https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python/blob/main/README.md).
|
|
22
|
+
|
|
23
|
+
|
|
24
|
+
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
|
|
25
|
+
additional functionalities that allow applications to communicate with a remote Dataproc
|
|
26
|
+
Spark cluster using the Spark Connect protocol without requiring additional steps.
|
|
27
|
+
|
|
28
|
+
## Install
|
|
29
|
+
|
|
30
|
+
.. code-block:: console
|
|
31
|
+
|
|
32
|
+
pip install dataproc_spark_connect
|
|
33
|
+
|
|
34
|
+
## Uninstall
|
|
35
|
+
|
|
36
|
+
.. code-block:: console
|
|
37
|
+
|
|
38
|
+
pip uninstall dataproc_spark_connect
|
|
39
|
+
|
|
40
|
+
|
|
41
|
+
## Setup
|
|
42
|
+
This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
|
|
43
|
+
If you are running the client outside of Google Cloud, you must set following environment variables:
|
|
44
|
+
|
|
45
|
+
* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
|
|
46
|
+
* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
|
|
47
|
+
* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
|
|
48
|
+
* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
|
|
49
|
+
|
|
50
|
+
## Usage
|
|
51
|
+
|
|
52
|
+
1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
|
|
53
|
+
|
|
54
|
+
.. code-block:: console
|
|
55
|
+
|
|
56
|
+
pip install google_cloud_dataproc --force-reinstall
|
|
57
|
+
pip install dataproc_spark_connect --force-reinstall
|
|
58
|
+
|
|
59
|
+
2. Add the required import into your PySpark application or notebook:
|
|
60
|
+
|
|
61
|
+
.. code-block:: python
|
|
62
|
+
|
|
63
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
64
|
+
|
|
65
|
+
3. There are two ways to create a spark session,
|
|
66
|
+
|
|
67
|
+
1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
|
|
68
|
+
|
|
69
|
+
.. code-block:: python
|
|
70
|
+
|
|
71
|
+
spark = DataprocSparkSession.builder.getOrCreate()
|
|
72
|
+
|
|
73
|
+
2. Start a Spark session with the following code instead of using a config file:
|
|
74
|
+
|
|
75
|
+
.. code-block:: python
|
|
76
|
+
|
|
77
|
+
from google.cloud.dataproc_v1 import SparkConnectConfig
|
|
78
|
+
from google.cloud.dataproc_v1 import Session
|
|
79
|
+
dataproc_config = Session()
|
|
80
|
+
dataproc_config.spark_connect_session = SparkConnectConfig()
|
|
81
|
+
dataproc_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
|
|
82
|
+
dataproc_config.runtime_config.version = '3.0'
|
|
83
|
+
spark = DataprocSparkSession.builder.dataprocConfig(dataproc_config).getOrCreate()
|
|
84
|
+
|
|
85
|
+
## Billing
|
|
86
|
+
As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
|
|
87
|
+
This will happen even if you are running the client from a non-GCE instance.
|
|
88
|
+
|
|
89
|
+
## Contributing
|
|
90
|
+
### Building and Deploying SDK
|
|
91
|
+
|
|
92
|
+
1. Install the requirements in virtual environment.
|
|
93
|
+
|
|
94
|
+
.. code-block:: console
|
|
95
|
+
|
|
96
|
+
pip install -r requirements.txt
|
|
97
|
+
|
|
98
|
+
2. Build the code.
|
|
99
|
+
|
|
100
|
+
.. code-block:: console
|
|
101
|
+
|
|
102
|
+
python setup.py sdist bdist_wheel
|
|
103
|
+
|
|
104
|
+
|
|
105
|
+
3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
|
|
106
|
+
|
|
107
|
+
.. code-block:: console
|
|
108
|
+
|
|
109
|
+
VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
|
|
110
|
+
|
|
111
|
+
4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
|
|
112
|
+
|
|
113
|
+
.. code-block:: console
|
|
114
|
+
|
|
115
|
+
%%bash
|
|
116
|
+
export VERSION=<version>
|
|
117
|
+
gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
|
|
118
|
+
yes | pip uninstall dataproc_spark_connect
|
|
119
|
+
pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
|
|
@@ -0,0 +1,103 @@
|
|
|
1
|
+
# Dataproc Spark Connect Client
|
|
2
|
+
|
|
3
|
+
> ⚠️ **Warning:**
|
|
4
|
+
The package `dataproc-spark-connect` has been renamed to `google-spark-connect`. `dataproc-spark-connect` will no longer be updated.
|
|
5
|
+
For help using `google-spark-connect`, please see [guide](https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python/blob/main/README.md).
|
|
6
|
+
|
|
7
|
+
|
|
8
|
+
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
|
|
9
|
+
additional functionalities that allow applications to communicate with a remote Dataproc
|
|
10
|
+
Spark cluster using the Spark Connect protocol without requiring additional steps.
|
|
11
|
+
|
|
12
|
+
## Install
|
|
13
|
+
|
|
14
|
+
.. code-block:: console
|
|
15
|
+
|
|
16
|
+
pip install dataproc_spark_connect
|
|
17
|
+
|
|
18
|
+
## Uninstall
|
|
19
|
+
|
|
20
|
+
.. code-block:: console
|
|
21
|
+
|
|
22
|
+
pip uninstall dataproc_spark_connect
|
|
23
|
+
|
|
24
|
+
|
|
25
|
+
## Setup
|
|
26
|
+
This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
|
|
27
|
+
If you are running the client outside of Google Cloud, you must set following environment variables:
|
|
28
|
+
|
|
29
|
+
* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
|
|
30
|
+
* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
|
|
31
|
+
* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
|
|
32
|
+
* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
|
|
33
|
+
|
|
34
|
+
## Usage
|
|
35
|
+
|
|
36
|
+
1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
|
|
37
|
+
|
|
38
|
+
.. code-block:: console
|
|
39
|
+
|
|
40
|
+
pip install google_cloud_dataproc --force-reinstall
|
|
41
|
+
pip install dataproc_spark_connect --force-reinstall
|
|
42
|
+
|
|
43
|
+
2. Add the required import into your PySpark application or notebook:
|
|
44
|
+
|
|
45
|
+
.. code-block:: python
|
|
46
|
+
|
|
47
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
48
|
+
|
|
49
|
+
3. There are two ways to create a spark session,
|
|
50
|
+
|
|
51
|
+
1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
|
|
52
|
+
|
|
53
|
+
.. code-block:: python
|
|
54
|
+
|
|
55
|
+
spark = DataprocSparkSession.builder.getOrCreate()
|
|
56
|
+
|
|
57
|
+
2. Start a Spark session with the following code instead of using a config file:
|
|
58
|
+
|
|
59
|
+
.. code-block:: python
|
|
60
|
+
|
|
61
|
+
from google.cloud.dataproc_v1 import SparkConnectConfig
|
|
62
|
+
from google.cloud.dataproc_v1 import Session
|
|
63
|
+
dataproc_config = Session()
|
|
64
|
+
dataproc_config.spark_connect_session = SparkConnectConfig()
|
|
65
|
+
dataproc_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
|
|
66
|
+
dataproc_config.runtime_config.version = '3.0'
|
|
67
|
+
spark = DataprocSparkSession.builder.dataprocConfig(dataproc_config).getOrCreate()
|
|
68
|
+
|
|
69
|
+
## Billing
|
|
70
|
+
As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
|
|
71
|
+
This will happen even if you are running the client from a non-GCE instance.
|
|
72
|
+
|
|
73
|
+
## Contributing
|
|
74
|
+
### Building and Deploying SDK
|
|
75
|
+
|
|
76
|
+
1. Install the requirements in virtual environment.
|
|
77
|
+
|
|
78
|
+
.. code-block:: console
|
|
79
|
+
|
|
80
|
+
pip install -r requirements.txt
|
|
81
|
+
|
|
82
|
+
2. Build the code.
|
|
83
|
+
|
|
84
|
+
.. code-block:: console
|
|
85
|
+
|
|
86
|
+
python setup.py sdist bdist_wheel
|
|
87
|
+
|
|
88
|
+
|
|
89
|
+
3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
|
|
90
|
+
|
|
91
|
+
.. code-block:: console
|
|
92
|
+
|
|
93
|
+
VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
|
|
94
|
+
|
|
95
|
+
4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
|
|
96
|
+
|
|
97
|
+
.. code-block:: console
|
|
98
|
+
|
|
99
|
+
%%bash
|
|
100
|
+
export VERSION=<version>
|
|
101
|
+
gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
|
|
102
|
+
yes | pip uninstall dataproc_spark_connect
|
|
103
|
+
pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
|
|
@@ -0,0 +1,119 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: dataproc-spark-connect
|
|
3
|
+
Version: 0.2.1
|
|
4
|
+
Summary: Dataproc client library for Spark Connect
|
|
5
|
+
Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
|
|
6
|
+
Author: Google LLC
|
|
7
|
+
License: Apache 2.0
|
|
8
|
+
License-File: LICENSE
|
|
9
|
+
Requires-Dist: google-api-core>=2.19.1
|
|
10
|
+
Requires-Dist: google-cloud-dataproc>=5.15.1
|
|
11
|
+
Requires-Dist: wheel
|
|
12
|
+
Requires-Dist: websockets
|
|
13
|
+
Requires-Dist: pyspark>=3.5
|
|
14
|
+
Requires-Dist: pandas
|
|
15
|
+
Requires-Dist: pyarrow
|
|
16
|
+
|
|
17
|
+
# Dataproc Spark Connect Client
|
|
18
|
+
|
|
19
|
+
> ⚠️ **Warning:**
|
|
20
|
+
The package `dataproc-spark-connect` has been renamed to `google-spark-connect`. `dataproc-spark-connect` will no longer be updated.
|
|
21
|
+
For help using `google-spark-connect`, please see [guide](https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python/blob/main/README.md).
|
|
22
|
+
|
|
23
|
+
|
|
24
|
+
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
|
|
25
|
+
additional functionalities that allow applications to communicate with a remote Dataproc
|
|
26
|
+
Spark cluster using the Spark Connect protocol without requiring additional steps.
|
|
27
|
+
|
|
28
|
+
## Install
|
|
29
|
+
|
|
30
|
+
.. code-block:: console
|
|
31
|
+
|
|
32
|
+
pip install dataproc_spark_connect
|
|
33
|
+
|
|
34
|
+
## Uninstall
|
|
35
|
+
|
|
36
|
+
.. code-block:: console
|
|
37
|
+
|
|
38
|
+
pip uninstall dataproc_spark_connect
|
|
39
|
+
|
|
40
|
+
|
|
41
|
+
## Setup
|
|
42
|
+
This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
|
|
43
|
+
If you are running the client outside of Google Cloud, you must set following environment variables:
|
|
44
|
+
|
|
45
|
+
* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
|
|
46
|
+
* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
|
|
47
|
+
* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
|
|
48
|
+
* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
|
|
49
|
+
|
|
50
|
+
## Usage
|
|
51
|
+
|
|
52
|
+
1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
|
|
53
|
+
|
|
54
|
+
.. code-block:: console
|
|
55
|
+
|
|
56
|
+
pip install google_cloud_dataproc --force-reinstall
|
|
57
|
+
pip install dataproc_spark_connect --force-reinstall
|
|
58
|
+
|
|
59
|
+
2. Add the required import into your PySpark application or notebook:
|
|
60
|
+
|
|
61
|
+
.. code-block:: python
|
|
62
|
+
|
|
63
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
64
|
+
|
|
65
|
+
3. There are two ways to create a spark session,
|
|
66
|
+
|
|
67
|
+
1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
|
|
68
|
+
|
|
69
|
+
.. code-block:: python
|
|
70
|
+
|
|
71
|
+
spark = DataprocSparkSession.builder.getOrCreate()
|
|
72
|
+
|
|
73
|
+
2. Start a Spark session with the following code instead of using a config file:
|
|
74
|
+
|
|
75
|
+
.. code-block:: python
|
|
76
|
+
|
|
77
|
+
from google.cloud.dataproc_v1 import SparkConnectConfig
|
|
78
|
+
from google.cloud.dataproc_v1 import Session
|
|
79
|
+
dataproc_config = Session()
|
|
80
|
+
dataproc_config.spark_connect_session = SparkConnectConfig()
|
|
81
|
+
dataproc_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
|
|
82
|
+
dataproc_config.runtime_config.version = '3.0'
|
|
83
|
+
spark = DataprocSparkSession.builder.dataprocConfig(dataproc_config).getOrCreate()
|
|
84
|
+
|
|
85
|
+
## Billing
|
|
86
|
+
As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
|
|
87
|
+
This will happen even if you are running the client from a non-GCE instance.
|
|
88
|
+
|
|
89
|
+
## Contributing
|
|
90
|
+
### Building and Deploying SDK
|
|
91
|
+
|
|
92
|
+
1. Install the requirements in virtual environment.
|
|
93
|
+
|
|
94
|
+
.. code-block:: console
|
|
95
|
+
|
|
96
|
+
pip install -r requirements.txt
|
|
97
|
+
|
|
98
|
+
2. Build the code.
|
|
99
|
+
|
|
100
|
+
.. code-block:: console
|
|
101
|
+
|
|
102
|
+
python setup.py sdist bdist_wheel
|
|
103
|
+
|
|
104
|
+
|
|
105
|
+
3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
|
|
106
|
+
|
|
107
|
+
.. code-block:: console
|
|
108
|
+
|
|
109
|
+
VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
|
|
110
|
+
|
|
111
|
+
4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
|
|
112
|
+
|
|
113
|
+
.. code-block:: console
|
|
114
|
+
|
|
115
|
+
%%bash
|
|
116
|
+
export VERSION=<version>
|
|
117
|
+
gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
|
|
118
|
+
yes | pip uninstall dataproc_spark_connect
|
|
119
|
+
pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
|
|
@@ -12,3 +12,12 @@
|
|
|
12
12
|
# See the License for the specific language governing permissions and
|
|
13
13
|
# limitations under the License.
|
|
14
14
|
from .session import DataprocSparkSession
|
|
15
|
+
import warnings
|
|
16
|
+
|
|
17
|
+
warnings.warn(
|
|
18
|
+
"The package 'dataproc-spark-connect' has been renamed to 'google-spark-connect'. "
|
|
19
|
+
"'dataproc-spark-connect' will no longer be updated. "
|
|
20
|
+
"For help using 'google-spark-connect', "
|
|
21
|
+
"see https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python/blob/main/README.md. ",
|
|
22
|
+
DeprecationWarning,
|
|
23
|
+
)
|
|
@@ -43,13 +43,22 @@ class bridged_socket(object):
|
|
|
43
43
|
self._conn = websocket_conn
|
|
44
44
|
|
|
45
45
|
def recv(self, buff_size):
|
|
46
|
-
|
|
46
|
+
# N.B. The websockets [recv method](https://websockets.readthedocs.io/en/stable/reference/sync/client.html#websockets.sync.client.ClientConnection.recv)
|
|
47
|
+
# does not support the buff_size parameter, but it does add a `timeout` keyword parameter not supported by normal
|
|
48
|
+
# socket objects.
|
|
49
|
+
#
|
|
50
|
+
# We set that timeout to 60 seconds to prevent any scenarios where we wind up stuck waiting for a message from a websocket connection
|
|
51
|
+
# that never comes.
|
|
52
|
+
msg = self._conn.recv(timeout=60)
|
|
47
53
|
return bytes.fromhex(msg)
|
|
48
54
|
|
|
49
55
|
def send(self, msg_bytes):
|
|
50
56
|
msg = bytes.hex(msg_bytes)
|
|
51
57
|
self._conn.send(msg)
|
|
52
58
|
|
|
59
|
+
def close(self):
|
|
60
|
+
return self._conn.close()
|
|
61
|
+
|
|
53
62
|
|
|
54
63
|
def connect_tcp_bridge(hostname):
|
|
55
64
|
"""Create a socket-like connection to the given hostname using websocket.
|
|
@@ -93,12 +102,51 @@ def forward_bytes(name, from_sock, to_sock):
|
|
|
93
102
|
bs = from_sock.recv(1024)
|
|
94
103
|
if not bs:
|
|
95
104
|
return
|
|
96
|
-
|
|
105
|
+
while bs:
|
|
106
|
+
try:
|
|
107
|
+
to_sock.send(bs)
|
|
108
|
+
bs = None
|
|
109
|
+
except TimeoutError:
|
|
110
|
+
# On timeouts during a send, we retry just the send
|
|
111
|
+
# to make sure we don't lose any bytes.
|
|
112
|
+
pass
|
|
113
|
+
except TimeoutError:
|
|
114
|
+
# On timeouts during a receive, we retry the entire flow.
|
|
115
|
+
pass
|
|
97
116
|
except Exception as ex:
|
|
98
117
|
logger.debug(f"[{name}] Exception forwarding bytes: {ex}")
|
|
118
|
+
to_sock.close()
|
|
99
119
|
return
|
|
100
120
|
|
|
101
121
|
|
|
122
|
+
def connect_sockets(conn_number, from_sock, to_sock):
|
|
123
|
+
"""Create a connection between the two given ports.
|
|
124
|
+
|
|
125
|
+
This method continuously streams bytes in both directions between the
|
|
126
|
+
given `from_sock` and `to_sock` socket-like objects.
|
|
127
|
+
|
|
128
|
+
The caller is responsible for creating and closing the supplied socekts.
|
|
129
|
+
"""
|
|
130
|
+
forward_name = f"{conn_number}-forward"
|
|
131
|
+
t1 = threading.Thread(
|
|
132
|
+
name=forward_name,
|
|
133
|
+
target=forward_bytes,
|
|
134
|
+
args=[forward_name, from_sock, to_sock],
|
|
135
|
+
daemon=True,
|
|
136
|
+
)
|
|
137
|
+
t1.start()
|
|
138
|
+
backward_name = f"{conn_number}-backward"
|
|
139
|
+
t2 = threading.Thread(
|
|
140
|
+
name=backward_name,
|
|
141
|
+
target=forward_bytes,
|
|
142
|
+
args=[backward_name, to_sock, from_sock],
|
|
143
|
+
daemon=True,
|
|
144
|
+
)
|
|
145
|
+
t2.start()
|
|
146
|
+
t1.join()
|
|
147
|
+
t2.join()
|
|
148
|
+
|
|
149
|
+
|
|
102
150
|
def forward_connection(conn_number, conn, addr, target_host):
|
|
103
151
|
"""Create a connection to the target and forward `conn` to it.
|
|
104
152
|
|
|
@@ -115,24 +163,7 @@ def forward_connection(conn_number, conn, addr, target_host):
|
|
|
115
163
|
with conn:
|
|
116
164
|
with connect_tcp_bridge(target_host) as websocket_conn:
|
|
117
165
|
backend_socket = bridged_socket(websocket_conn)
|
|
118
|
-
|
|
119
|
-
t1 = threading.Thread(
|
|
120
|
-
name=forward_name,
|
|
121
|
-
target=forward_bytes,
|
|
122
|
-
args=[forward_name, conn, backend_socket],
|
|
123
|
-
daemon=True,
|
|
124
|
-
)
|
|
125
|
-
t1.start()
|
|
126
|
-
backward_name = f"{conn_number}-backward"
|
|
127
|
-
t2 = threading.Thread(
|
|
128
|
-
name=backward_name,
|
|
129
|
-
target=forward_bytes,
|
|
130
|
-
args=[backward_name, backend_socket, conn],
|
|
131
|
-
daemon=True,
|
|
132
|
-
)
|
|
133
|
-
t2.start()
|
|
134
|
-
t1.join()
|
|
135
|
-
t2.join()
|
|
166
|
+
connect_sockets(conn_number, conn, backend_socket)
|
|
136
167
|
|
|
137
168
|
|
|
138
169
|
class DataprocSessionProxy(object):
|
|
@@ -179,6 +210,14 @@ class DataprocSessionProxy(object):
|
|
|
179
210
|
s.release()
|
|
180
211
|
while not self._killed:
|
|
181
212
|
conn, addr = frontend_socket.accept()
|
|
213
|
+
# Set a timeout on how long we will allow send/recv calls to block
|
|
214
|
+
#
|
|
215
|
+
# The code that reads and writes to this connection will retry
|
|
216
|
+
# on timeouts, so this is a safe change.
|
|
217
|
+
#
|
|
218
|
+
# The chosen timeout is a very short one because it allows us
|
|
219
|
+
# to more quickly detect when a connection has been closed.
|
|
220
|
+
conn.settimeout(1)
|
|
182
221
|
logger.debug(f"Accepted a connection from {addr}...")
|
|
183
222
|
self._conn_number += 1
|
|
184
223
|
threading.Thread(
|
|
@@ -196,13 +196,13 @@ class DataprocSparkSession(SparkSession):
|
|
|
196
196
|
session_id = self.generate_dataproc_session_id()
|
|
197
197
|
|
|
198
198
|
session_request.session_id = session_id
|
|
199
|
-
dataproc_config.name = f"projects/{self._project_id}/
|
|
199
|
+
dataproc_config.name = f"projects/{self._project_id}/locations/{self._region}/sessions/{session_id}"
|
|
200
200
|
logger.debug(
|
|
201
201
|
f"Configurations used to create serverless session:\n {dataproc_config}"
|
|
202
202
|
)
|
|
203
203
|
session_request.session = dataproc_config
|
|
204
204
|
session_request.parent = (
|
|
205
|
-
f"projects/{self._project_id}/
|
|
205
|
+
f"projects/{self._project_id}/locations/{self._region}"
|
|
206
206
|
)
|
|
207
207
|
|
|
208
208
|
logger.debug("Creating serverless session")
|
|
@@ -12,13 +12,24 @@
|
|
|
12
12
|
# See the License for the specific language governing permissions and
|
|
13
13
|
# limitations under the License.
|
|
14
14
|
from setuptools import find_namespace_packages, setup
|
|
15
|
+
from pathlib import Path
|
|
16
|
+
|
|
17
|
+
this_directory = Path(__file__).parent
|
|
18
|
+
long_description = (this_directory / "README.md").read_text()
|
|
19
|
+
|
|
15
20
|
|
|
16
21
|
setup(
|
|
17
22
|
name="dataproc-spark-connect",
|
|
18
|
-
version="0.1
|
|
23
|
+
version="0.2.1",
|
|
19
24
|
description="Dataproc client library for Spark Connect",
|
|
25
|
+
long_description=long_description,
|
|
26
|
+
author="Google LLC",
|
|
27
|
+
url="https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python",
|
|
28
|
+
license="Apache 2.0",
|
|
20
29
|
packages=find_namespace_packages(include=["google.*"]),
|
|
21
30
|
install_requires=[
|
|
31
|
+
"google-api-core>=2.19.1",
|
|
32
|
+
"google-cloud-dataproc>=5.15.1",
|
|
22
33
|
"wheel",
|
|
23
34
|
"websockets",
|
|
24
35
|
"pyspark>=3.5",
|
|
@@ -1,10 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.1
|
|
2
|
-
Name: dataproc-spark-connect
|
|
3
|
-
Version: 0.1.0
|
|
4
|
-
Summary: Dataproc client library for Spark Connect
|
|
5
|
-
License-File: LICENSE
|
|
6
|
-
Requires-Dist: wheel
|
|
7
|
-
Requires-Dist: websockets
|
|
8
|
-
Requires-Dist: pyspark>=3.5
|
|
9
|
-
Requires-Dist: pandas
|
|
10
|
-
Requires-Dist: pyarrow
|
|
@@ -1,90 +0,0 @@
|
|
|
1
|
-
# Dataproc Spark Connect Client
|
|
2
|
-
|
|
3
|
-
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
|
|
4
|
-
additional functionalities that allow applications to communicate with a remote Dataproc
|
|
5
|
-
Spark cluster using the Spark Connect protocol without requiring additional steps.
|
|
6
|
-
|
|
7
|
-
## Install
|
|
8
|
-
|
|
9
|
-
```
|
|
10
|
-
pip install dataproc_spark_connect
|
|
11
|
-
```
|
|
12
|
-
|
|
13
|
-
## Uninstall
|
|
14
|
-
|
|
15
|
-
```
|
|
16
|
-
pip uninstall dataproc_spark_connect
|
|
17
|
-
```
|
|
18
|
-
|
|
19
|
-
## Setup
|
|
20
|
-
This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
|
|
21
|
-
If you are running the client outside of Google Cloud, you must set following environment variables:
|
|
22
|
-
|
|
23
|
-
* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
|
|
24
|
-
* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
|
|
25
|
-
* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
|
|
26
|
-
* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
|
|
27
|
-
|
|
28
|
-
## Usage
|
|
29
|
-
|
|
30
|
-
1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
|
|
31
|
-
```
|
|
32
|
-
pip install google_cloud_dataproc --force-reinstall
|
|
33
|
-
pip install dataproc_spark_connect --force-reinstall
|
|
34
|
-
```
|
|
35
|
-
|
|
36
|
-
2. Add the required import into your PySpark application or notebook:
|
|
37
|
-
```python
|
|
38
|
-
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
39
|
-
|
|
40
|
-
```
|
|
41
|
-
|
|
42
|
-
3. There are two ways to create a spark session,
|
|
43
|
-
1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
|
|
44
|
-
```python
|
|
45
|
-
spark = DataprocSparkSession.builder.getOrCreate()
|
|
46
|
-
```
|
|
47
|
-
|
|
48
|
-
2. Start a Spark session with the following code instead of using a config file:
|
|
49
|
-
```python
|
|
50
|
-
from google.cloud.dataproc_v1 import SparkConnectConfig
|
|
51
|
-
from google.cloud.dataproc_v1 import Session
|
|
52
|
-
dataproc_config = Session()
|
|
53
|
-
dataproc_config.spark_connect_session = SparkConnectConfig()
|
|
54
|
-
dataproc_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
|
|
55
|
-
dataproc_config.runtime_config.version = '3.0'
|
|
56
|
-
spark = DataprocSparkSession.builder.dataprocConfig(dataproc_config).getOrCreate()
|
|
57
|
-
```
|
|
58
|
-
|
|
59
|
-
## Billing
|
|
60
|
-
As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
|
|
61
|
-
This will happen even if you are running the client from a non-GCE instance.
|
|
62
|
-
|
|
63
|
-
## Contributing
|
|
64
|
-
### Building and Deploying SDK
|
|
65
|
-
1. Install the requirements in virtual environment.
|
|
66
|
-
|
|
67
|
-
```
|
|
68
|
-
pip install -r requirements.txt
|
|
69
|
-
```
|
|
70
|
-
2. Build the code.
|
|
71
|
-
|
|
72
|
-
```
|
|
73
|
-
python setup.py sdist bdist_wheel
|
|
74
|
-
```
|
|
75
|
-
|
|
76
|
-
2. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
|
|
77
|
-
|
|
78
|
-
```
|
|
79
|
-
VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
|
|
80
|
-
```
|
|
81
|
-
|
|
82
|
-
3. Download the new SDK on Vertex, then uninstall the old version and install the new one.
|
|
83
|
-
|
|
84
|
-
```
|
|
85
|
-
%%bash
|
|
86
|
-
export VERSION=<version>
|
|
87
|
-
gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
|
|
88
|
-
yes | pip uninstall dataproc_spark_connect
|
|
89
|
-
pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
|
|
90
|
-
```
|
|
@@ -1,10 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.1
|
|
2
|
-
Name: dataproc-spark-connect
|
|
3
|
-
Version: 0.1.0
|
|
4
|
-
Summary: Dataproc client library for Spark Connect
|
|
5
|
-
License-File: LICENSE
|
|
6
|
-
Requires-Dist: wheel
|
|
7
|
-
Requires-Dist: websockets
|
|
8
|
-
Requires-Dist: pyspark>=3.5
|
|
9
|
-
Requires-Dist: pandas
|
|
10
|
-
Requires-Dist: pyarrow
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|