dataproc-spark-connect 0.6.0__tar.gz → 0.7.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- dataproc_spark_connect-0.7.1/PKG-INFO +98 -0
- dataproc_spark_connect-0.7.1/README.md +83 -0
- dataproc_spark_connect-0.7.1/dataproc_spark_connect.egg-info/PKG-INFO +98 -0
- dataproc_spark_connect-0.7.1/dataproc_spark_connect.egg-info/requires.txt +6 -0
- {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.1}/google/cloud/dataproc_spark_connect/client/proxy.py +3 -3
- {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.1}/google/cloud/dataproc_spark_connect/session.py +163 -188
- {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.1}/setup.py +6 -5
- dataproc_spark_connect-0.6.0/PKG-INFO +0 -111
- dataproc_spark_connect-0.6.0/README.md +0 -97
- dataproc_spark_connect-0.6.0/dataproc_spark_connect.egg-info/PKG-INFO +0 -111
- dataproc_spark_connect-0.6.0/dataproc_spark_connect.egg-info/requires.txt +0 -5
- {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.1}/LICENSE +0 -0
- {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.1}/dataproc_spark_connect.egg-info/SOURCES.txt +0 -0
- {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.1}/dataproc_spark_connect.egg-info/dependency_links.txt +0 -0
- {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.1}/dataproc_spark_connect.egg-info/top_level.txt +0 -0
- {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.1}/google/cloud/dataproc_spark_connect/__init__.py +0 -0
- {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.1}/google/cloud/dataproc_spark_connect/client/__init__.py +0 -0
- {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.1}/google/cloud/dataproc_spark_connect/client/core.py +0 -0
- {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.1}/google/cloud/dataproc_spark_connect/exceptions.py +0 -0
- {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.1}/google/cloud/dataproc_spark_connect/pypi_artifacts.py +0 -0
- {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.1}/pyproject.toml +0 -0
- {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.1}/setup.cfg +0 -0
|
@@ -0,0 +1,98 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: dataproc-spark-connect
|
|
3
|
+
Version: 0.7.1
|
|
4
|
+
Summary: Dataproc client library for Spark Connect
|
|
5
|
+
Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
|
|
6
|
+
Author: Google LLC
|
|
7
|
+
License: Apache 2.0
|
|
8
|
+
License-File: LICENSE
|
|
9
|
+
Requires-Dist: google-api-core>=2.19
|
|
10
|
+
Requires-Dist: google-cloud-dataproc>=5.18
|
|
11
|
+
Requires-Dist: packaging>=20.0
|
|
12
|
+
Requires-Dist: pyspark[connect]>=3.5
|
|
13
|
+
Requires-Dist: tqdm>=4.67
|
|
14
|
+
Requires-Dist: websockets>=14.0
|
|
15
|
+
|
|
16
|
+
# Dataproc Spark Connect Client
|
|
17
|
+
|
|
18
|
+
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
|
|
19
|
+
client with additional functionalities that allow applications to communicate
|
|
20
|
+
with a remote Dataproc Spark Session using the Spark Connect protocol without
|
|
21
|
+
requiring additional steps.
|
|
22
|
+
|
|
23
|
+
## Install
|
|
24
|
+
|
|
25
|
+
```sh
|
|
26
|
+
pip install dataproc_spark_connect
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
## Uninstall
|
|
30
|
+
|
|
31
|
+
```sh
|
|
32
|
+
pip uninstall dataproc_spark_connect
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
## Setup
|
|
36
|
+
|
|
37
|
+
This client requires permissions to
|
|
38
|
+
manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
|
|
39
|
+
If you are running the client outside of Google Cloud, you must set following
|
|
40
|
+
environment variables:
|
|
41
|
+
|
|
42
|
+
* `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
|
|
43
|
+
workloads
|
|
44
|
+
* `GOOGLE_CLOUD_REGION` - The Compute
|
|
45
|
+
Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
|
|
46
|
+
where you run the Spark workload.
|
|
47
|
+
* `GOOGLE_APPLICATION_CREDENTIALS` -
|
|
48
|
+
Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
|
|
49
|
+
|
|
50
|
+
## Usage
|
|
51
|
+
|
|
52
|
+
1. Install the latest version of Dataproc Python client and Dataproc Spark
|
|
53
|
+
Connect modules:
|
|
54
|
+
|
|
55
|
+
```sh
|
|
56
|
+
pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
2. Add the required imports into your PySpark application or notebook and start
|
|
60
|
+
a Spark session with the following code instead of using
|
|
61
|
+
environment variables:
|
|
62
|
+
|
|
63
|
+
```python
|
|
64
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
65
|
+
from google.cloud.dataproc_v1 import Session
|
|
66
|
+
session_config = Session()
|
|
67
|
+
session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
|
|
68
|
+
session_config.runtime_config.version = '2.2'
|
|
69
|
+
spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
## Developing
|
|
73
|
+
|
|
74
|
+
For development instructions see [guide](DEVELOPING.md).
|
|
75
|
+
|
|
76
|
+
## Contributing
|
|
77
|
+
|
|
78
|
+
We'd love to accept your patches and contributions to this project. There are
|
|
79
|
+
just a few small guidelines you need to follow.
|
|
80
|
+
|
|
81
|
+
### Contributor License Agreement
|
|
82
|
+
|
|
83
|
+
Contributions to this project must be accompanied by a Contributor License
|
|
84
|
+
Agreement. You (or your employer) retain the copyright to your contribution;
|
|
85
|
+
this simply gives us permission to use and redistribute your contributions as
|
|
86
|
+
part of the project. Head over to <https://cla.developers.google.com> to see
|
|
87
|
+
your current agreements on file or to sign a new one.
|
|
88
|
+
|
|
89
|
+
You generally only need to submit a CLA once, so if you've already submitted one
|
|
90
|
+
(even if it was for a different project), you probably don't need to do it
|
|
91
|
+
again.
|
|
92
|
+
|
|
93
|
+
### Code reviews
|
|
94
|
+
|
|
95
|
+
All submissions, including submissions by project members, require review. We
|
|
96
|
+
use GitHub pull requests for this purpose. Consult
|
|
97
|
+
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
|
|
98
|
+
information on using pull requests.
|
|
@@ -0,0 +1,83 @@
|
|
|
1
|
+
# Dataproc Spark Connect Client
|
|
2
|
+
|
|
3
|
+
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
|
|
4
|
+
client with additional functionalities that allow applications to communicate
|
|
5
|
+
with a remote Dataproc Spark Session using the Spark Connect protocol without
|
|
6
|
+
requiring additional steps.
|
|
7
|
+
|
|
8
|
+
## Install
|
|
9
|
+
|
|
10
|
+
```sh
|
|
11
|
+
pip install dataproc_spark_connect
|
|
12
|
+
```
|
|
13
|
+
|
|
14
|
+
## Uninstall
|
|
15
|
+
|
|
16
|
+
```sh
|
|
17
|
+
pip uninstall dataproc_spark_connect
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
## Setup
|
|
21
|
+
|
|
22
|
+
This client requires permissions to
|
|
23
|
+
manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
|
|
24
|
+
If you are running the client outside of Google Cloud, you must set following
|
|
25
|
+
environment variables:
|
|
26
|
+
|
|
27
|
+
* `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
|
|
28
|
+
workloads
|
|
29
|
+
* `GOOGLE_CLOUD_REGION` - The Compute
|
|
30
|
+
Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
|
|
31
|
+
where you run the Spark workload.
|
|
32
|
+
* `GOOGLE_APPLICATION_CREDENTIALS` -
|
|
33
|
+
Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
|
|
34
|
+
|
|
35
|
+
## Usage
|
|
36
|
+
|
|
37
|
+
1. Install the latest version of Dataproc Python client and Dataproc Spark
|
|
38
|
+
Connect modules:
|
|
39
|
+
|
|
40
|
+
```sh
|
|
41
|
+
pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
2. Add the required imports into your PySpark application or notebook and start
|
|
45
|
+
a Spark session with the following code instead of using
|
|
46
|
+
environment variables:
|
|
47
|
+
|
|
48
|
+
```python
|
|
49
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
50
|
+
from google.cloud.dataproc_v1 import Session
|
|
51
|
+
session_config = Session()
|
|
52
|
+
session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
|
|
53
|
+
session_config.runtime_config.version = '2.2'
|
|
54
|
+
spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
## Developing
|
|
58
|
+
|
|
59
|
+
For development instructions see [guide](DEVELOPING.md).
|
|
60
|
+
|
|
61
|
+
## Contributing
|
|
62
|
+
|
|
63
|
+
We'd love to accept your patches and contributions to this project. There are
|
|
64
|
+
just a few small guidelines you need to follow.
|
|
65
|
+
|
|
66
|
+
### Contributor License Agreement
|
|
67
|
+
|
|
68
|
+
Contributions to this project must be accompanied by a Contributor License
|
|
69
|
+
Agreement. You (or your employer) retain the copyright to your contribution;
|
|
70
|
+
this simply gives us permission to use and redistribute your contributions as
|
|
71
|
+
part of the project. Head over to <https://cla.developers.google.com> to see
|
|
72
|
+
your current agreements on file or to sign a new one.
|
|
73
|
+
|
|
74
|
+
You generally only need to submit a CLA once, so if you've already submitted one
|
|
75
|
+
(even if it was for a different project), you probably don't need to do it
|
|
76
|
+
again.
|
|
77
|
+
|
|
78
|
+
### Code reviews
|
|
79
|
+
|
|
80
|
+
All submissions, including submissions by project members, require review. We
|
|
81
|
+
use GitHub pull requests for this purpose. Consult
|
|
82
|
+
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
|
|
83
|
+
information on using pull requests.
|
|
@@ -0,0 +1,98 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: dataproc-spark-connect
|
|
3
|
+
Version: 0.7.1
|
|
4
|
+
Summary: Dataproc client library for Spark Connect
|
|
5
|
+
Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
|
|
6
|
+
Author: Google LLC
|
|
7
|
+
License: Apache 2.0
|
|
8
|
+
License-File: LICENSE
|
|
9
|
+
Requires-Dist: google-api-core>=2.19
|
|
10
|
+
Requires-Dist: google-cloud-dataproc>=5.18
|
|
11
|
+
Requires-Dist: packaging>=20.0
|
|
12
|
+
Requires-Dist: pyspark[connect]>=3.5
|
|
13
|
+
Requires-Dist: tqdm>=4.67
|
|
14
|
+
Requires-Dist: websockets>=14.0
|
|
15
|
+
|
|
16
|
+
# Dataproc Spark Connect Client
|
|
17
|
+
|
|
18
|
+
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
|
|
19
|
+
client with additional functionalities that allow applications to communicate
|
|
20
|
+
with a remote Dataproc Spark Session using the Spark Connect protocol without
|
|
21
|
+
requiring additional steps.
|
|
22
|
+
|
|
23
|
+
## Install
|
|
24
|
+
|
|
25
|
+
```sh
|
|
26
|
+
pip install dataproc_spark_connect
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
## Uninstall
|
|
30
|
+
|
|
31
|
+
```sh
|
|
32
|
+
pip uninstall dataproc_spark_connect
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
## Setup
|
|
36
|
+
|
|
37
|
+
This client requires permissions to
|
|
38
|
+
manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
|
|
39
|
+
If you are running the client outside of Google Cloud, you must set following
|
|
40
|
+
environment variables:
|
|
41
|
+
|
|
42
|
+
* `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
|
|
43
|
+
workloads
|
|
44
|
+
* `GOOGLE_CLOUD_REGION` - The Compute
|
|
45
|
+
Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
|
|
46
|
+
where you run the Spark workload.
|
|
47
|
+
* `GOOGLE_APPLICATION_CREDENTIALS` -
|
|
48
|
+
Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
|
|
49
|
+
|
|
50
|
+
## Usage
|
|
51
|
+
|
|
52
|
+
1. Install the latest version of Dataproc Python client and Dataproc Spark
|
|
53
|
+
Connect modules:
|
|
54
|
+
|
|
55
|
+
```sh
|
|
56
|
+
pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
2. Add the required imports into your PySpark application or notebook and start
|
|
60
|
+
a Spark session with the following code instead of using
|
|
61
|
+
environment variables:
|
|
62
|
+
|
|
63
|
+
```python
|
|
64
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
65
|
+
from google.cloud.dataproc_v1 import Session
|
|
66
|
+
session_config = Session()
|
|
67
|
+
session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
|
|
68
|
+
session_config.runtime_config.version = '2.2'
|
|
69
|
+
spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
## Developing
|
|
73
|
+
|
|
74
|
+
For development instructions see [guide](DEVELOPING.md).
|
|
75
|
+
|
|
76
|
+
## Contributing
|
|
77
|
+
|
|
78
|
+
We'd love to accept your patches and contributions to this project. There are
|
|
79
|
+
just a few small guidelines you need to follow.
|
|
80
|
+
|
|
81
|
+
### Contributor License Agreement
|
|
82
|
+
|
|
83
|
+
Contributions to this project must be accompanied by a Contributor License
|
|
84
|
+
Agreement. You (or your employer) retain the copyright to your contribution;
|
|
85
|
+
this simply gives us permission to use and redistribute your contributions as
|
|
86
|
+
part of the project. Head over to <https://cla.developers.google.com> to see
|
|
87
|
+
your current agreements on file or to sign a new one.
|
|
88
|
+
|
|
89
|
+
You generally only need to submit a CLA once, so if you've already submitted one
|
|
90
|
+
(even if it was for a different project), you probably don't need to do it
|
|
91
|
+
again.
|
|
92
|
+
|
|
93
|
+
### Code reviews
|
|
94
|
+
|
|
95
|
+
All submissions, including submissions by project members, require review. We
|
|
96
|
+
use GitHub pull requests for this purpose. Consult
|
|
97
|
+
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
|
|
98
|
+
information on using pull requests.
|
|
@@ -18,7 +18,6 @@ import contextlib
|
|
|
18
18
|
import logging
|
|
19
19
|
import socket
|
|
20
20
|
import threading
|
|
21
|
-
import time
|
|
22
21
|
|
|
23
22
|
import websockets.sync.client as websocketclient
|
|
24
23
|
|
|
@@ -95,6 +94,7 @@ def forward_bytes(name, from_sock, to_sock):
|
|
|
95
94
|
This method is intended to be run in a separate thread of execution.
|
|
96
95
|
|
|
97
96
|
Args:
|
|
97
|
+
name: forwarding thread name
|
|
98
98
|
from_sock: A socket-like object to stream bytes from.
|
|
99
99
|
to_sock: A socket-like object to stream bytes to.
|
|
100
100
|
"""
|
|
@@ -131,7 +131,7 @@ def connect_sockets(conn_number, from_sock, to_sock):
|
|
|
131
131
|
This method continuously streams bytes in both directions between the
|
|
132
132
|
given `from_sock` and `to_sock` socket-like objects.
|
|
133
133
|
|
|
134
|
-
The caller is responsible for creating and closing the supplied
|
|
134
|
+
The caller is responsible for creating and closing the supplied sockets.
|
|
135
135
|
"""
|
|
136
136
|
forward_name = f"{conn_number}-forward"
|
|
137
137
|
t1 = threading.Thread(
|
|
@@ -163,7 +163,7 @@ def forward_connection(conn_number, conn, addr, target_host):
|
|
|
163
163
|
Both the supplied incoming connection (`conn`) and the created outgoing
|
|
164
164
|
connection are automatically closed when this method terminates.
|
|
165
165
|
|
|
166
|
-
This method should be run inside
|
|
166
|
+
This method should be run inside a daemon thread so that it will not
|
|
167
167
|
block program termination.
|
|
168
168
|
"""
|
|
169
169
|
with conn:
|
|
@@ -11,39 +11,37 @@
|
|
|
11
11
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
12
12
|
# See the License for the specific language governing permissions and
|
|
13
13
|
# limitations under the License.
|
|
14
|
+
|
|
14
15
|
import atexit
|
|
16
|
+
import datetime
|
|
15
17
|
import json
|
|
16
18
|
import logging
|
|
17
19
|
import os
|
|
18
20
|
import random
|
|
19
21
|
import string
|
|
22
|
+
import threading
|
|
20
23
|
import time
|
|
21
|
-
import
|
|
22
|
-
from time import sleep
|
|
23
|
-
from typing import Any, cast, ClassVar, Dict, Optional
|
|
24
|
+
import tqdm
|
|
24
25
|
|
|
25
26
|
from google.api_core import retry
|
|
26
|
-
from google.api_core.future.polling import POLLING_PREDICATE
|
|
27
27
|
from google.api_core.client_options import ClientOptions
|
|
28
28
|
from google.api_core.exceptions import Aborted, FailedPrecondition, InvalidArgument, NotFound, PermissionDenied
|
|
29
|
-
from google.
|
|
30
|
-
|
|
31
|
-
from google.cloud.dataproc_spark_connect.pypi_artifacts import PyPiArtifacts
|
|
29
|
+
from google.api_core.future.polling import POLLING_PREDICATE
|
|
32
30
|
from google.cloud.dataproc_spark_connect.client import DataprocChannelBuilder
|
|
31
|
+
from google.cloud.dataproc_spark_connect.exceptions import DataprocSparkConnectException
|
|
32
|
+
from google.cloud.dataproc_spark_connect.pypi_artifacts import PyPiArtifacts
|
|
33
33
|
from google.cloud.dataproc_v1 import (
|
|
34
|
+
AuthenticationConfig,
|
|
34
35
|
CreateSessionRequest,
|
|
35
36
|
GetSessionRequest,
|
|
36
37
|
Session,
|
|
37
38
|
SessionControllerClient,
|
|
38
|
-
SessionTemplate,
|
|
39
39
|
TerminateSessionRequest,
|
|
40
40
|
)
|
|
41
|
-
from google.
|
|
42
|
-
from google.protobuf.text_format import ParseError
|
|
41
|
+
from google.cloud.dataproc_v1.types import sessions
|
|
43
42
|
from pyspark.sql.connect.session import SparkSession
|
|
44
43
|
from pyspark.sql.utils import to_str
|
|
45
|
-
|
|
46
|
-
from google.cloud.dataproc_spark_connect.exceptions import DataprocSparkConnectException
|
|
44
|
+
from typing import Any, cast, ClassVar, Dict, Optional
|
|
47
45
|
|
|
48
46
|
# Set up logging
|
|
49
47
|
logging.basicConfig(level=logging.INFO)
|
|
@@ -69,6 +67,8 @@ class DataprocSparkSession(SparkSession):
|
|
|
69
67
|
... ) # doctest: +SKIP
|
|
70
68
|
"""
|
|
71
69
|
|
|
70
|
+
_DEFAULT_RUNTIME_VERSION = "2.2"
|
|
71
|
+
|
|
72
72
|
_active_s8s_session_uuid: ClassVar[Optional[str]] = None
|
|
73
73
|
_project_id = None
|
|
74
74
|
_region = None
|
|
@@ -77,8 +77,6 @@ class DataprocSparkSession(SparkSession):
|
|
|
77
77
|
|
|
78
78
|
class Builder(SparkSession.Builder):
|
|
79
79
|
|
|
80
|
-
_dataproc_runtime_spark_version = {"3.0": "3.5.1", "2.2": "3.5.0"}
|
|
81
|
-
|
|
82
80
|
_session_static_configs = [
|
|
83
81
|
"spark.executor.cores",
|
|
84
82
|
"spark.executor.memoryOverhead",
|
|
@@ -93,10 +91,10 @@ class DataprocSparkSession(SparkSession):
|
|
|
93
91
|
self._options: Dict[str, Any] = {}
|
|
94
92
|
self._channel_builder: Optional[DataprocChannelBuilder] = None
|
|
95
93
|
self._dataproc_config: Optional[Session] = None
|
|
96
|
-
self._project_id = os.
|
|
97
|
-
self._region = os.
|
|
94
|
+
self._project_id = os.getenv("GOOGLE_CLOUD_PROJECT")
|
|
95
|
+
self._region = os.getenv("GOOGLE_CLOUD_REGION")
|
|
98
96
|
self._client_options = ClientOptions(
|
|
99
|
-
api_endpoint=os.
|
|
97
|
+
api_endpoint=os.getenv(
|
|
100
98
|
"GOOGLE_CLOUD_DATAPROC_API_ENDPOINT",
|
|
101
99
|
f"{self._region}-dataproc.googleapis.com",
|
|
102
100
|
)
|
|
@@ -117,7 +115,7 @@ class DataprocSparkSession(SparkSession):
|
|
|
117
115
|
|
|
118
116
|
def location(self, location):
|
|
119
117
|
self._region = location
|
|
120
|
-
self._client_options.api_endpoint = os.
|
|
118
|
+
self._client_options.api_endpoint = os.getenv(
|
|
121
119
|
"GOOGLE_CLOUD_DATAPROC_API_ENDPOINT",
|
|
122
120
|
f"{self._region}-dataproc.googleapis.com",
|
|
123
121
|
)
|
|
@@ -155,10 +153,7 @@ class DataprocSparkSession(SparkSession):
|
|
|
155
153
|
spark_connect_url = session_response.runtime_info.endpoints.get(
|
|
156
154
|
"Spark Connect Server"
|
|
157
155
|
)
|
|
158
|
-
|
|
159
|
-
if not spark_connect_url.endswith("/"):
|
|
160
|
-
spark_connect_url += "/"
|
|
161
|
-
url = f"{spark_connect_url.replace('.com/', '.com:443/')};session_id={session_response.uuid};use_ssl=true"
|
|
156
|
+
url = f"{spark_connect_url}/;session_id={session_response.uuid};use_ssl=true"
|
|
162
157
|
logger.debug(f"Spark Connect URL: {url}")
|
|
163
158
|
self._channel_builder = DataprocChannelBuilder(
|
|
164
159
|
url,
|
|
@@ -179,56 +174,64 @@ class DataprocSparkSession(SparkSession):
|
|
|
179
174
|
|
|
180
175
|
if self._options.get("spark.remote", False):
|
|
181
176
|
raise NotImplemented(
|
|
182
|
-
"DataprocSparkSession does not support connecting to an existing remote server"
|
|
177
|
+
"DataprocSparkSession does not support connecting to an existing Spark Connect remote server"
|
|
183
178
|
)
|
|
184
179
|
|
|
185
180
|
from google.cloud.dataproc_v1 import SessionControllerClient
|
|
186
181
|
|
|
187
182
|
dataproc_config: Session = self._get_dataproc_config()
|
|
188
|
-
session_template: SessionTemplate = self._get_session_template()
|
|
189
183
|
|
|
190
|
-
self._get_and_validate_version(
|
|
191
|
-
dataproc_config, session_template
|
|
192
|
-
)
|
|
193
|
-
|
|
194
|
-
spark_connect_session = self._get_spark_connect_session(
|
|
195
|
-
dataproc_config, session_template
|
|
196
|
-
)
|
|
197
|
-
|
|
198
|
-
if not spark_connect_session:
|
|
199
|
-
dataproc_config.spark_connect_session = {}
|
|
200
|
-
os.environ["SPARK_CONNECT_MODE_ENABLED"] = "1"
|
|
201
|
-
session_request = CreateSessionRequest()
|
|
202
184
|
session_id = self.generate_dataproc_session_id()
|
|
203
|
-
|
|
204
|
-
session_request.session_id = session_id
|
|
205
185
|
dataproc_config.name = f"projects/{self._project_id}/locations/{self._region}/sessions/{session_id}"
|
|
206
186
|
logger.debug(
|
|
207
|
-
f"
|
|
187
|
+
f"Dataproc Session configuration:\n{dataproc_config}"
|
|
208
188
|
)
|
|
189
|
+
|
|
190
|
+
session_request = CreateSessionRequest()
|
|
191
|
+
session_request.session_id = session_id
|
|
209
192
|
session_request.session = dataproc_config
|
|
210
193
|
session_request.parent = (
|
|
211
194
|
f"projects/{self._project_id}/locations/{self._region}"
|
|
212
195
|
)
|
|
213
196
|
|
|
214
|
-
logger.debug("Creating
|
|
197
|
+
logger.debug("Creating Dataproc Session")
|
|
215
198
|
DataprocSparkSession._active_s8s_session_id = session_id
|
|
216
199
|
s8s_creation_start_time = time.time()
|
|
217
|
-
|
|
218
|
-
|
|
219
|
-
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
|
|
223
|
-
|
|
200
|
+
|
|
201
|
+
stop_create_session_pbar = False
|
|
202
|
+
|
|
203
|
+
def create_session_pbar():
|
|
204
|
+
iterations = 150
|
|
205
|
+
pbar = tqdm.trange(
|
|
206
|
+
iterations,
|
|
207
|
+
bar_format="{bar}",
|
|
208
|
+
ncols=80,
|
|
224
209
|
)
|
|
225
|
-
|
|
210
|
+
for i in pbar:
|
|
211
|
+
if stop_create_session_pbar:
|
|
212
|
+
break
|
|
213
|
+
# Last iteration
|
|
214
|
+
if i >= iterations - 1:
|
|
215
|
+
# Sleep until session created
|
|
216
|
+
while not stop_create_session_pbar:
|
|
217
|
+
time.sleep(1)
|
|
218
|
+
else:
|
|
219
|
+
time.sleep(1)
|
|
220
|
+
|
|
221
|
+
pbar.close()
|
|
222
|
+
# Print new line after the progress bar
|
|
223
|
+
print()
|
|
224
|
+
|
|
225
|
+
create_session_pbar_thread = threading.Thread(
|
|
226
|
+
target=create_session_pbar
|
|
227
|
+
)
|
|
228
|
+
|
|
229
|
+
try:
|
|
226
230
|
if (
|
|
227
|
-
|
|
228
|
-
|
|
229
|
-
|
|
230
|
-
|
|
231
|
-
).lower()
|
|
231
|
+
os.getenv(
|
|
232
|
+
"DATAPROC_SPARK_CONNECT_SESSION_TERMINATE_AT_EXIT",
|
|
233
|
+
"false",
|
|
234
|
+
)
|
|
232
235
|
== "true"
|
|
233
236
|
):
|
|
234
237
|
atexit.register(
|
|
@@ -243,18 +246,25 @@ class DataprocSparkSession(SparkSession):
|
|
|
243
246
|
client_options=self._client_options
|
|
244
247
|
).create_session(session_request)
|
|
245
248
|
print(
|
|
246
|
-
f"
|
|
249
|
+
f"Creating Dataproc Session: https://console.cloud.google.com/dataproc/interactive/{self._region}/{session_id}?project={self._project_id}"
|
|
247
250
|
)
|
|
251
|
+
create_session_pbar_thread.start()
|
|
248
252
|
session_response: Session = operation.result(
|
|
249
|
-
polling=
|
|
253
|
+
polling=retry.Retry(
|
|
254
|
+
predicate=POLLING_PREDICATE,
|
|
255
|
+
initial=5.0, # seconds
|
|
256
|
+
maximum=5.0, # seconds
|
|
257
|
+
multiplier=1.0,
|
|
258
|
+
timeout=600, # seconds
|
|
259
|
+
)
|
|
250
260
|
)
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
|
|
261
|
+
stop_create_session_pbar = True
|
|
262
|
+
create_session_pbar_thread.join()
|
|
263
|
+
print("Dataproc Session was successfully created")
|
|
264
|
+
file_path = (
|
|
265
|
+
DataprocSparkSession._get_active_session_file_path()
|
|
266
|
+
)
|
|
267
|
+
if file_path is not None:
|
|
258
268
|
try:
|
|
259
269
|
session_data = {
|
|
260
270
|
"session_name": session_response.name,
|
|
@@ -267,21 +277,27 @@ class DataprocSparkSession(SparkSession):
|
|
|
267
277
|
json.dump(session_data, json_file, indent=4)
|
|
268
278
|
except Exception as e:
|
|
269
279
|
logger.error(
|
|
270
|
-
f"Exception while writing active session to file {file_path}
|
|
280
|
+
f"Exception while writing active session to file {file_path}, {e}"
|
|
271
281
|
)
|
|
272
282
|
except (InvalidArgument, PermissionDenied) as e:
|
|
283
|
+
stop_create_session_pbar = True
|
|
284
|
+
if create_session_pbar_thread.is_alive():
|
|
285
|
+
create_session_pbar_thread.join()
|
|
273
286
|
DataprocSparkSession._active_s8s_session_id = None
|
|
274
287
|
raise DataprocSparkConnectException(
|
|
275
|
-
f"Error while creating
|
|
288
|
+
f"Error while creating Dataproc Session: {e.message}"
|
|
276
289
|
)
|
|
277
290
|
except Exception as e:
|
|
291
|
+
stop_create_session_pbar = True
|
|
292
|
+
if create_session_pbar_thread.is_alive():
|
|
293
|
+
create_session_pbar_thread.join()
|
|
278
294
|
DataprocSparkSession._active_s8s_session_id = None
|
|
279
295
|
raise RuntimeError(
|
|
280
|
-
f"Error while creating
|
|
296
|
+
f"Error while creating Dataproc Session"
|
|
281
297
|
) from e
|
|
282
298
|
|
|
283
299
|
logger.debug(
|
|
284
|
-
f"
|
|
300
|
+
f"Dataproc Session created: {session_id} in {int(time.time() - s8s_creation_start_time)} seconds"
|
|
285
301
|
)
|
|
286
302
|
return self.__create_spark_connect_session_from_s8s(
|
|
287
303
|
session_response, dataproc_config.name
|
|
@@ -292,17 +308,20 @@ class DataprocSparkSession(SparkSession):
|
|
|
292
308
|
) -> Optional["DataprocSparkSession"]:
|
|
293
309
|
s8s_session_id = DataprocSparkSession._active_s8s_session_id
|
|
294
310
|
session_name = f"projects/{self._project_id}/locations/{self._region}/sessions/{s8s_session_id}"
|
|
295
|
-
session_response =
|
|
296
|
-
|
|
297
|
-
|
|
311
|
+
session_response = None
|
|
312
|
+
session = None
|
|
313
|
+
if s8s_session_id is not None:
|
|
314
|
+
session_response = get_active_s8s_session_response(
|
|
315
|
+
session_name, self._client_options
|
|
316
|
+
)
|
|
317
|
+
session = DataprocSparkSession.getActiveSession()
|
|
298
318
|
|
|
299
|
-
session = DataprocSparkSession.getActiveSession()
|
|
300
319
|
if session is None:
|
|
301
320
|
session = DataprocSparkSession._default_session
|
|
302
321
|
|
|
303
322
|
if session_response is not None:
|
|
304
323
|
print(
|
|
305
|
-
f"Using existing
|
|
324
|
+
f"Using existing Dataproc Session (configuration changes may not be applied): https://console.cloud.google.com/dataproc/interactive/{self._region}/{s8s_session_id}?project={self._project_id}"
|
|
306
325
|
)
|
|
307
326
|
if session is None:
|
|
308
327
|
session = self.__create_spark_connect_session_from_s8s(
|
|
@@ -310,10 +329,10 @@ class DataprocSparkSession(SparkSession):
|
|
|
310
329
|
)
|
|
311
330
|
return session
|
|
312
331
|
else:
|
|
313
|
-
logger.info(
|
|
314
|
-
f"Session: {s8s_session_id} not active, stopping previous spark session and creating new"
|
|
315
|
-
)
|
|
316
332
|
if session is not None:
|
|
333
|
+
print(
|
|
334
|
+
f"{s8s_session_id} Dataproc Session is not active, stopping and creating a new one"
|
|
335
|
+
)
|
|
317
336
|
session.stop()
|
|
318
337
|
|
|
319
338
|
return None
|
|
@@ -333,21 +352,52 @@ class DataprocSparkSession(SparkSession):
|
|
|
333
352
|
dataproc_config = self._dataproc_config
|
|
334
353
|
for k, v in self._options.items():
|
|
335
354
|
dataproc_config.runtime_config.properties[k] = v
|
|
336
|
-
|
|
337
|
-
|
|
338
|
-
|
|
355
|
+
dataproc_config.spark_connect_session = (
|
|
356
|
+
sessions.SparkConnectConfig()
|
|
357
|
+
)
|
|
358
|
+
if not dataproc_config.runtime_config.version:
|
|
359
|
+
dataproc_config.runtime_config.version = (
|
|
360
|
+
DataprocSparkSession._DEFAULT_RUNTIME_VERSION
|
|
361
|
+
)
|
|
362
|
+
if (
|
|
363
|
+
not dataproc_config.environment_config.execution_config.authentication_config.user_workload_authentication_type
|
|
364
|
+
and "DATAPROC_SPARK_CONNECT_AUTH_TYPE" in os.environ
|
|
365
|
+
):
|
|
366
|
+
dataproc_config.environment_config.execution_config.authentication_config.user_workload_authentication_type = AuthenticationConfig.AuthenticationType[
|
|
367
|
+
os.getenv("DATAPROC_SPARK_CONNECT_AUTH_TYPE")
|
|
339
368
|
]
|
|
340
|
-
|
|
341
|
-
|
|
342
|
-
|
|
343
|
-
|
|
344
|
-
|
|
345
|
-
|
|
346
|
-
|
|
347
|
-
|
|
348
|
-
|
|
349
|
-
|
|
350
|
-
|
|
369
|
+
if (
|
|
370
|
+
not dataproc_config.environment_config.execution_config.service_account
|
|
371
|
+
and "DATAPROC_SPARK_CONNECT_SERVICE_ACCOUNT" in os.environ
|
|
372
|
+
):
|
|
373
|
+
dataproc_config.environment_config.execution_config.service_account = os.getenv(
|
|
374
|
+
"DATAPROC_SPARK_CONNECT_SERVICE_ACCOUNT"
|
|
375
|
+
)
|
|
376
|
+
if (
|
|
377
|
+
not dataproc_config.environment_config.execution_config.subnetwork_uri
|
|
378
|
+
and "DATAPROC_SPARK_CONNECT_SUBNET" in os.environ
|
|
379
|
+
):
|
|
380
|
+
dataproc_config.environment_config.execution_config.subnetwork_uri = os.getenv(
|
|
381
|
+
"DATAPROC_SPARK_CONNECT_SUBNET"
|
|
382
|
+
)
|
|
383
|
+
if (
|
|
384
|
+
not dataproc_config.environment_config.execution_config.ttl
|
|
385
|
+
and "DATAPROC_SPARK_CONNECT_TTL_SECONDS" in os.environ
|
|
386
|
+
):
|
|
387
|
+
dataproc_config.environment_config.execution_config.ttl = {
|
|
388
|
+
"seconds": int(
|
|
389
|
+
os.getenv("DATAPROC_SPARK_CONNECT_TTL_SECONDS")
|
|
390
|
+
)
|
|
391
|
+
}
|
|
392
|
+
if (
|
|
393
|
+
not dataproc_config.environment_config.execution_config.idle_ttl
|
|
394
|
+
and "DATAPROC_SPARK_CONNECT_IDLE_TTL_SECONDS" in os.environ
|
|
395
|
+
):
|
|
396
|
+
dataproc_config.environment_config.execution_config.idle_ttl = {
|
|
397
|
+
"seconds": int(
|
|
398
|
+
os.getenv("DATAPROC_SPARK_CONNECT_IDLE_TTL_SECONDS")
|
|
399
|
+
)
|
|
400
|
+
}
|
|
351
401
|
if "COLAB_NOTEBOOK_RUNTIME_ID" in os.environ:
|
|
352
402
|
dataproc_config.labels["colab-notebook-runtime-id"] = (
|
|
353
403
|
os.environ["COLAB_NOTEBOOK_RUNTIME_ID"]
|
|
@@ -358,87 +408,8 @@ class DataprocSparkSession(SparkSession):
|
|
|
358
408
|
]
|
|
359
409
|
return dataproc_config
|
|
360
410
|
|
|
361
|
-
|
|
362
|
-
|
|
363
|
-
GetSessionTemplateRequest,
|
|
364
|
-
SessionTemplateControllerClient,
|
|
365
|
-
)
|
|
366
|
-
|
|
367
|
-
session_template = None
|
|
368
|
-
if self._dataproc_config and self._dataproc_config.session_template:
|
|
369
|
-
session_template = self._dataproc_config.session_template
|
|
370
|
-
get_session_template_request = GetSessionTemplateRequest()
|
|
371
|
-
get_session_template_request.name = session_template
|
|
372
|
-
client = SessionTemplateControllerClient(
|
|
373
|
-
client_options=self._client_options
|
|
374
|
-
)
|
|
375
|
-
try:
|
|
376
|
-
session_template = client.get_session_template(
|
|
377
|
-
get_session_template_request
|
|
378
|
-
)
|
|
379
|
-
except Exception as e:
|
|
380
|
-
logger.error(
|
|
381
|
-
f"Failed to get session template {session_template}: {e}"
|
|
382
|
-
)
|
|
383
|
-
raise
|
|
384
|
-
return session_template
|
|
385
|
-
|
|
386
|
-
def _get_and_validate_version(self, dataproc_config, session_template):
|
|
387
|
-
trimmed_version = lambda v: ".".join(v.split(".")[:2])
|
|
388
|
-
version = None
|
|
389
|
-
if (
|
|
390
|
-
dataproc_config
|
|
391
|
-
and dataproc_config.runtime_config
|
|
392
|
-
and dataproc_config.runtime_config.version
|
|
393
|
-
):
|
|
394
|
-
version = dataproc_config.runtime_config.version
|
|
395
|
-
elif (
|
|
396
|
-
session_template
|
|
397
|
-
and session_template.runtime_config
|
|
398
|
-
and session_template.runtime_config.version
|
|
399
|
-
):
|
|
400
|
-
version = session_template.runtime_config.version
|
|
401
|
-
|
|
402
|
-
if not version:
|
|
403
|
-
version = "3.0"
|
|
404
|
-
dataproc_config.runtime_config.version = version
|
|
405
|
-
elif (
|
|
406
|
-
trimmed_version(version)
|
|
407
|
-
not in self._dataproc_runtime_spark_version
|
|
408
|
-
):
|
|
409
|
-
raise ValueError(
|
|
410
|
-
f"runtime_config.version {version} is not supported. "
|
|
411
|
-
f"Supported versions: {self._dataproc_runtime_spark_version.keys()}"
|
|
412
|
-
)
|
|
413
|
-
|
|
414
|
-
server_version = self._dataproc_runtime_spark_version[
|
|
415
|
-
trimmed_version(version)
|
|
416
|
-
]
|
|
417
|
-
import importlib.metadata
|
|
418
|
-
|
|
419
|
-
google_connect_version = importlib.metadata.version(
|
|
420
|
-
"dataproc-spark-connect"
|
|
421
|
-
)
|
|
422
|
-
client_version = importlib.metadata.version("pyspark")
|
|
423
|
-
version_message = f"Spark Connect: {google_connect_version} (PySpark: {client_version}) Session Runtime: {version} (Spark: {server_version})"
|
|
424
|
-
logger.info(version_message)
|
|
425
|
-
if trimmed_version(client_version) != trimmed_version(
|
|
426
|
-
server_version
|
|
427
|
-
):
|
|
428
|
-
logger.warning(
|
|
429
|
-
f"client and server on different versions: {version_message}"
|
|
430
|
-
)
|
|
431
|
-
return version
|
|
432
|
-
|
|
433
|
-
def _get_spark_connect_session(self, dataproc_config, session_template):
|
|
434
|
-
spark_connect_session = None
|
|
435
|
-
if dataproc_config and dataproc_config.spark_connect_session:
|
|
436
|
-
spark_connect_session = dataproc_config.spark_connect_session
|
|
437
|
-
elif session_template and session_template.spark_connect_session:
|
|
438
|
-
spark_connect_session = session_template.spark_connect_session
|
|
439
|
-
return spark_connect_session
|
|
440
|
-
|
|
441
|
-
def generate_dataproc_session_id(self):
|
|
411
|
+
@staticmethod
|
|
412
|
+
def generate_dataproc_session_id():
|
|
442
413
|
timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
|
|
443
414
|
suffix_length = 6
|
|
444
415
|
random_suffix = "".join(
|
|
@@ -451,32 +422,30 @@ class DataprocSparkSession(SparkSession):
|
|
|
451
422
|
def _repr_html_(self) -> str:
|
|
452
423
|
if not self._active_s8s_session_id:
|
|
453
424
|
return """
|
|
454
|
-
<div>No Active Dataproc
|
|
425
|
+
<div>No Active Dataproc Session</div>
|
|
455
426
|
"""
|
|
456
427
|
|
|
457
428
|
s8s_session = f"https://console.cloud.google.com/dataproc/interactive/{self._region}/{self._active_s8s_session_id}"
|
|
458
429
|
ui = f"{s8s_session}/sparkApplications/applications"
|
|
459
|
-
version = ""
|
|
460
430
|
return f"""
|
|
461
431
|
<div>
|
|
462
432
|
<p><b>Spark Connect</b></p>
|
|
463
433
|
|
|
464
|
-
<p><a href="{s8s_session}?project={self._project_id}">
|
|
434
|
+
<p><a href="{s8s_session}?project={self._project_id}">Dataproc Session</a></p>
|
|
465
435
|
<p><a href="{ui}?project={self._project_id}">Spark UI</a></p>
|
|
466
436
|
</div>
|
|
467
437
|
"""
|
|
468
438
|
|
|
469
|
-
|
|
470
|
-
|
|
471
|
-
|
|
472
|
-
|
|
473
|
-
]
|
|
439
|
+
@staticmethod
|
|
440
|
+
def _remove_stopped_session_from_file():
|
|
441
|
+
file_path = DataprocSparkSession._get_active_session_file_path()
|
|
442
|
+
if file_path is not None:
|
|
474
443
|
try:
|
|
475
444
|
with open(file_path, "w"):
|
|
476
445
|
pass
|
|
477
446
|
except Exception as e:
|
|
478
447
|
logger.error(
|
|
479
|
-
f"Exception while removing active session in file {file_path}
|
|
448
|
+
f"Exception while removing active session in file {file_path}, {e}"
|
|
480
449
|
)
|
|
481
450
|
|
|
482
451
|
def addArtifacts(
|
|
@@ -494,7 +463,7 @@ class DataprocSparkSession(SparkSession):
|
|
|
494
463
|
|
|
495
464
|
Parameters
|
|
496
465
|
----------
|
|
497
|
-
*
|
|
466
|
+
*artifact : tuple of str
|
|
498
467
|
Artifact's URIs to add.
|
|
499
468
|
pyfile : bool
|
|
500
469
|
Whether to add them as Python dependencies such as .py, .egg, .zip or .jar files.
|
|
@@ -507,7 +476,7 @@ class DataprocSparkSession(SparkSession):
|
|
|
507
476
|
Add a file to be downloaded with this Spark job on every node.
|
|
508
477
|
The ``path`` passed can only be a local file for now.
|
|
509
478
|
pypi : bool
|
|
510
|
-
This option is only available with DataprocSparkSession.
|
|
479
|
+
This option is only available with DataprocSparkSession. e.g. `spark.addArtifacts("spacy==3.8.4", "torch", pypi=True)`
|
|
511
480
|
Installs PyPi package (with its dependencies) in the active Spark session on the driver and executors.
|
|
512
481
|
|
|
513
482
|
Notes
|
|
@@ -534,6 +503,10 @@ class DataprocSparkSession(SparkSession):
|
|
|
534
503
|
*artifact, pyfile=pyfile, archive=archive, file=file
|
|
535
504
|
)
|
|
536
505
|
|
|
506
|
+
@staticmethod
|
|
507
|
+
def _get_active_session_file_path():
|
|
508
|
+
return os.getenv("DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH")
|
|
509
|
+
|
|
537
510
|
def stop(self) -> None:
|
|
538
511
|
with DataprocSparkSession._lock:
|
|
539
512
|
if DataprocSparkSession._active_s8s_session_id is not None:
|
|
@@ -544,7 +517,7 @@ class DataprocSparkSession(SparkSession):
|
|
|
544
517
|
self._client_options,
|
|
545
518
|
)
|
|
546
519
|
|
|
547
|
-
self.
|
|
520
|
+
self._remove_stopped_session_from_file()
|
|
548
521
|
DataprocSparkSession._active_s8s_session_uuid = None
|
|
549
522
|
DataprocSparkSession._active_s8s_session_id = None
|
|
550
523
|
DataprocSparkSession._project_id = None
|
|
@@ -565,7 +538,7 @@ def terminate_s8s_session(
|
|
|
565
538
|
):
|
|
566
539
|
from google.cloud.dataproc_v1 import SessionControllerClient
|
|
567
540
|
|
|
568
|
-
logger.debug(f"Terminating
|
|
541
|
+
logger.debug(f"Terminating Dataproc Session: {active_s8s_session_id}")
|
|
569
542
|
terminate_session_request = TerminateSessionRequest()
|
|
570
543
|
session_name = f"projects/{project_id}/locations/{region}/sessions/{active_s8s_session_id}"
|
|
571
544
|
terminate_session_request.name = session_name
|
|
@@ -583,18 +556,20 @@ def terminate_s8s_session(
|
|
|
583
556
|
):
|
|
584
557
|
session = session_client.get_session(get_session_request)
|
|
585
558
|
state = session.state
|
|
586
|
-
sleep(1)
|
|
559
|
+
time.sleep(1)
|
|
587
560
|
except NotFound:
|
|
588
|
-
logger.debug(
|
|
561
|
+
logger.debug(
|
|
562
|
+
f"{active_s8s_session_id} Dataproc Session already deleted"
|
|
563
|
+
)
|
|
589
564
|
# Client will get 'Aborted' error if session creation is still in progress and
|
|
590
565
|
# 'FailedPrecondition' if another termination is still in progress.
|
|
591
|
-
# Both are retryable but we catch it and let TTL take care of cleanups.
|
|
566
|
+
# Both are retryable, but we catch it and let TTL take care of cleanups.
|
|
592
567
|
except (FailedPrecondition, Aborted):
|
|
593
568
|
logger.debug(
|
|
594
|
-
f"
|
|
569
|
+
f"{active_s8s_session_id} Dataproc Session already terminated manually or automatically due to TTL"
|
|
595
570
|
)
|
|
596
571
|
if state is not None and state == Session.State.FAILED:
|
|
597
|
-
raise RuntimeError("
|
|
572
|
+
raise RuntimeError("Dataproc Session termination failed")
|
|
598
573
|
|
|
599
574
|
|
|
600
575
|
def get_active_s8s_session_response(
|
|
@@ -608,7 +583,7 @@ def get_active_s8s_session_response(
|
|
|
608
583
|
).get_session(get_session_request)
|
|
609
584
|
state = get_session_response.state
|
|
610
585
|
except Exception as e:
|
|
611
|
-
|
|
586
|
+
print(f"{session_name} Dataproc Session deleted: {e}")
|
|
612
587
|
return None
|
|
613
588
|
if state is not None and (
|
|
614
589
|
state == Session.State.ACTIVE or state == Session.State.CREATING
|
|
@@ -20,7 +20,7 @@ long_description = (this_directory / "README.md").read_text()
|
|
|
20
20
|
|
|
21
21
|
setup(
|
|
22
22
|
name="dataproc-spark-connect",
|
|
23
|
-
version="0.
|
|
23
|
+
version="0.7.1",
|
|
24
24
|
description="Dataproc client library for Spark Connect",
|
|
25
25
|
long_description=long_description,
|
|
26
26
|
author="Google LLC",
|
|
@@ -28,10 +28,11 @@ setup(
|
|
|
28
28
|
license="Apache 2.0",
|
|
29
29
|
packages=find_namespace_packages(include=["google.*"]),
|
|
30
30
|
install_requires=[
|
|
31
|
-
"google-api-core>=2.19
|
|
32
|
-
"google-cloud-dataproc>=5.18
|
|
33
|
-
"websockets",
|
|
34
|
-
"pyspark[connect]>=3.5",
|
|
31
|
+
"google-api-core>=2.19",
|
|
32
|
+
"google-cloud-dataproc>=5.18",
|
|
35
33
|
"packaging>=20.0",
|
|
34
|
+
"pyspark[connect]>=3.5",
|
|
35
|
+
"tqdm>=4.67",
|
|
36
|
+
"websockets>=14.0",
|
|
36
37
|
],
|
|
37
38
|
)
|
|
@@ -1,111 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.1
|
|
2
|
-
Name: dataproc-spark-connect
|
|
3
|
-
Version: 0.6.0
|
|
4
|
-
Summary: Dataproc client library for Spark Connect
|
|
5
|
-
Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
|
|
6
|
-
Author: Google LLC
|
|
7
|
-
License: Apache 2.0
|
|
8
|
-
License-File: LICENSE
|
|
9
|
-
Requires-Dist: google-api-core>=2.19.1
|
|
10
|
-
Requires-Dist: google-cloud-dataproc>=5.18.0
|
|
11
|
-
Requires-Dist: websockets
|
|
12
|
-
Requires-Dist: pyspark[connect]>=3.5
|
|
13
|
-
Requires-Dist: packaging>=20.0
|
|
14
|
-
|
|
15
|
-
# Dataproc Spark Connect Client
|
|
16
|
-
|
|
17
|
-
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
|
|
18
|
-
additional functionalities that allow applications to communicate with a remote Dataproc
|
|
19
|
-
Spark cluster using the Spark Connect protocol without requiring additional steps.
|
|
20
|
-
|
|
21
|
-
## Install
|
|
22
|
-
|
|
23
|
-
```console
|
|
24
|
-
pip install dataproc_spark_connect
|
|
25
|
-
```
|
|
26
|
-
|
|
27
|
-
## Uninstall
|
|
28
|
-
|
|
29
|
-
```console
|
|
30
|
-
pip uninstall dataproc_spark_connect
|
|
31
|
-
```
|
|
32
|
-
|
|
33
|
-
## Setup
|
|
34
|
-
This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
|
|
35
|
-
If you are running the client outside of Google Cloud, you must set following environment variables:
|
|
36
|
-
|
|
37
|
-
* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
|
|
38
|
-
* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
|
|
39
|
-
* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
|
|
40
|
-
* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
|
|
41
|
-
|
|
42
|
-
## Usage
|
|
43
|
-
|
|
44
|
-
1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
|
|
45
|
-
|
|
46
|
-
```console
|
|
47
|
-
pip install google_cloud_dataproc --force-reinstall
|
|
48
|
-
pip install dataproc_spark_connect --force-reinstall
|
|
49
|
-
```
|
|
50
|
-
|
|
51
|
-
2. Add the required import into your PySpark application or notebook:
|
|
52
|
-
|
|
53
|
-
```python
|
|
54
|
-
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
55
|
-
```
|
|
56
|
-
|
|
57
|
-
3. There are two ways to create a spark session,
|
|
58
|
-
|
|
59
|
-
1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
|
|
60
|
-
|
|
61
|
-
```python
|
|
62
|
-
spark = DataprocSparkSession.builder.getOrCreate()
|
|
63
|
-
```
|
|
64
|
-
|
|
65
|
-
2. Start a Spark session with the following code instead of using a config file:
|
|
66
|
-
|
|
67
|
-
```python
|
|
68
|
-
from google.cloud.dataproc_v1 import SparkConnectConfig
|
|
69
|
-
from google.cloud.dataproc_v1 import Session
|
|
70
|
-
dataproc_session_config = Session()
|
|
71
|
-
dataproc_session_config.spark_connect_session = SparkConnectConfig()
|
|
72
|
-
dataproc_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
|
|
73
|
-
dataproc_session_config.runtime_config.version = '3.0'
|
|
74
|
-
spark = DataprocSparkSession.builder.dataprocSessionConfig(dataproc_session_config).getOrCreate()
|
|
75
|
-
```
|
|
76
|
-
|
|
77
|
-
## Billing
|
|
78
|
-
As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
|
|
79
|
-
This will happen even if you are running the client from a non-GCE instance.
|
|
80
|
-
|
|
81
|
-
## Contributing
|
|
82
|
-
### Building and Deploying SDK
|
|
83
|
-
|
|
84
|
-
1. Install the requirements in virtual environment.
|
|
85
|
-
|
|
86
|
-
```console
|
|
87
|
-
pip install -r requirements-dev.txt
|
|
88
|
-
```
|
|
89
|
-
|
|
90
|
-
2. Build the code.
|
|
91
|
-
|
|
92
|
-
```console
|
|
93
|
-
python setup.py sdist bdist_wheel
|
|
94
|
-
```
|
|
95
|
-
|
|
96
|
-
3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
|
|
97
|
-
|
|
98
|
-
```sh
|
|
99
|
-
VERSION=<version>
|
|
100
|
-
gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
|
|
101
|
-
```
|
|
102
|
-
|
|
103
|
-
4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
|
|
104
|
-
|
|
105
|
-
```sh
|
|
106
|
-
%%bash
|
|
107
|
-
export VERSION=<version>
|
|
108
|
-
gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
|
|
109
|
-
yes | pip uninstall dataproc_spark_connect
|
|
110
|
-
pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
|
|
111
|
-
```
|
|
@@ -1,97 +0,0 @@
|
|
|
1
|
-
# Dataproc Spark Connect Client
|
|
2
|
-
|
|
3
|
-
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
|
|
4
|
-
additional functionalities that allow applications to communicate with a remote Dataproc
|
|
5
|
-
Spark cluster using the Spark Connect protocol without requiring additional steps.
|
|
6
|
-
|
|
7
|
-
## Install
|
|
8
|
-
|
|
9
|
-
```console
|
|
10
|
-
pip install dataproc_spark_connect
|
|
11
|
-
```
|
|
12
|
-
|
|
13
|
-
## Uninstall
|
|
14
|
-
|
|
15
|
-
```console
|
|
16
|
-
pip uninstall dataproc_spark_connect
|
|
17
|
-
```
|
|
18
|
-
|
|
19
|
-
## Setup
|
|
20
|
-
This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
|
|
21
|
-
If you are running the client outside of Google Cloud, you must set following environment variables:
|
|
22
|
-
|
|
23
|
-
* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
|
|
24
|
-
* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
|
|
25
|
-
* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
|
|
26
|
-
* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
|
|
27
|
-
|
|
28
|
-
## Usage
|
|
29
|
-
|
|
30
|
-
1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
|
|
31
|
-
|
|
32
|
-
```console
|
|
33
|
-
pip install google_cloud_dataproc --force-reinstall
|
|
34
|
-
pip install dataproc_spark_connect --force-reinstall
|
|
35
|
-
```
|
|
36
|
-
|
|
37
|
-
2. Add the required import into your PySpark application or notebook:
|
|
38
|
-
|
|
39
|
-
```python
|
|
40
|
-
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
41
|
-
```
|
|
42
|
-
|
|
43
|
-
3. There are two ways to create a spark session,
|
|
44
|
-
|
|
45
|
-
1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
|
|
46
|
-
|
|
47
|
-
```python
|
|
48
|
-
spark = DataprocSparkSession.builder.getOrCreate()
|
|
49
|
-
```
|
|
50
|
-
|
|
51
|
-
2. Start a Spark session with the following code instead of using a config file:
|
|
52
|
-
|
|
53
|
-
```python
|
|
54
|
-
from google.cloud.dataproc_v1 import SparkConnectConfig
|
|
55
|
-
from google.cloud.dataproc_v1 import Session
|
|
56
|
-
dataproc_session_config = Session()
|
|
57
|
-
dataproc_session_config.spark_connect_session = SparkConnectConfig()
|
|
58
|
-
dataproc_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
|
|
59
|
-
dataproc_session_config.runtime_config.version = '3.0'
|
|
60
|
-
spark = DataprocSparkSession.builder.dataprocSessionConfig(dataproc_session_config).getOrCreate()
|
|
61
|
-
```
|
|
62
|
-
|
|
63
|
-
## Billing
|
|
64
|
-
As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
|
|
65
|
-
This will happen even if you are running the client from a non-GCE instance.
|
|
66
|
-
|
|
67
|
-
## Contributing
|
|
68
|
-
### Building and Deploying SDK
|
|
69
|
-
|
|
70
|
-
1. Install the requirements in virtual environment.
|
|
71
|
-
|
|
72
|
-
```console
|
|
73
|
-
pip install -r requirements-dev.txt
|
|
74
|
-
```
|
|
75
|
-
|
|
76
|
-
2. Build the code.
|
|
77
|
-
|
|
78
|
-
```console
|
|
79
|
-
python setup.py sdist bdist_wheel
|
|
80
|
-
```
|
|
81
|
-
|
|
82
|
-
3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
|
|
83
|
-
|
|
84
|
-
```sh
|
|
85
|
-
VERSION=<version>
|
|
86
|
-
gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
|
|
87
|
-
```
|
|
88
|
-
|
|
89
|
-
4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
|
|
90
|
-
|
|
91
|
-
```sh
|
|
92
|
-
%%bash
|
|
93
|
-
export VERSION=<version>
|
|
94
|
-
gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
|
|
95
|
-
yes | pip uninstall dataproc_spark_connect
|
|
96
|
-
pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
|
|
97
|
-
```
|
|
@@ -1,111 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.1
|
|
2
|
-
Name: dataproc-spark-connect
|
|
3
|
-
Version: 0.6.0
|
|
4
|
-
Summary: Dataproc client library for Spark Connect
|
|
5
|
-
Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
|
|
6
|
-
Author: Google LLC
|
|
7
|
-
License: Apache 2.0
|
|
8
|
-
License-File: LICENSE
|
|
9
|
-
Requires-Dist: google-api-core>=2.19.1
|
|
10
|
-
Requires-Dist: google-cloud-dataproc>=5.18.0
|
|
11
|
-
Requires-Dist: websockets
|
|
12
|
-
Requires-Dist: pyspark[connect]>=3.5
|
|
13
|
-
Requires-Dist: packaging>=20.0
|
|
14
|
-
|
|
15
|
-
# Dataproc Spark Connect Client
|
|
16
|
-
|
|
17
|
-
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
|
|
18
|
-
additional functionalities that allow applications to communicate with a remote Dataproc
|
|
19
|
-
Spark cluster using the Spark Connect protocol without requiring additional steps.
|
|
20
|
-
|
|
21
|
-
## Install
|
|
22
|
-
|
|
23
|
-
```console
|
|
24
|
-
pip install dataproc_spark_connect
|
|
25
|
-
```
|
|
26
|
-
|
|
27
|
-
## Uninstall
|
|
28
|
-
|
|
29
|
-
```console
|
|
30
|
-
pip uninstall dataproc_spark_connect
|
|
31
|
-
```
|
|
32
|
-
|
|
33
|
-
## Setup
|
|
34
|
-
This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
|
|
35
|
-
If you are running the client outside of Google Cloud, you must set following environment variables:
|
|
36
|
-
|
|
37
|
-
* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
|
|
38
|
-
* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
|
|
39
|
-
* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
|
|
40
|
-
* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
|
|
41
|
-
|
|
42
|
-
## Usage
|
|
43
|
-
|
|
44
|
-
1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
|
|
45
|
-
|
|
46
|
-
```console
|
|
47
|
-
pip install google_cloud_dataproc --force-reinstall
|
|
48
|
-
pip install dataproc_spark_connect --force-reinstall
|
|
49
|
-
```
|
|
50
|
-
|
|
51
|
-
2. Add the required import into your PySpark application or notebook:
|
|
52
|
-
|
|
53
|
-
```python
|
|
54
|
-
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
55
|
-
```
|
|
56
|
-
|
|
57
|
-
3. There are two ways to create a spark session,
|
|
58
|
-
|
|
59
|
-
1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
|
|
60
|
-
|
|
61
|
-
```python
|
|
62
|
-
spark = DataprocSparkSession.builder.getOrCreate()
|
|
63
|
-
```
|
|
64
|
-
|
|
65
|
-
2. Start a Spark session with the following code instead of using a config file:
|
|
66
|
-
|
|
67
|
-
```python
|
|
68
|
-
from google.cloud.dataproc_v1 import SparkConnectConfig
|
|
69
|
-
from google.cloud.dataproc_v1 import Session
|
|
70
|
-
dataproc_session_config = Session()
|
|
71
|
-
dataproc_session_config.spark_connect_session = SparkConnectConfig()
|
|
72
|
-
dataproc_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
|
|
73
|
-
dataproc_session_config.runtime_config.version = '3.0'
|
|
74
|
-
spark = DataprocSparkSession.builder.dataprocSessionConfig(dataproc_session_config).getOrCreate()
|
|
75
|
-
```
|
|
76
|
-
|
|
77
|
-
## Billing
|
|
78
|
-
As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
|
|
79
|
-
This will happen even if you are running the client from a non-GCE instance.
|
|
80
|
-
|
|
81
|
-
## Contributing
|
|
82
|
-
### Building and Deploying SDK
|
|
83
|
-
|
|
84
|
-
1. Install the requirements in virtual environment.
|
|
85
|
-
|
|
86
|
-
```console
|
|
87
|
-
pip install -r requirements-dev.txt
|
|
88
|
-
```
|
|
89
|
-
|
|
90
|
-
2. Build the code.
|
|
91
|
-
|
|
92
|
-
```console
|
|
93
|
-
python setup.py sdist bdist_wheel
|
|
94
|
-
```
|
|
95
|
-
|
|
96
|
-
3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
|
|
97
|
-
|
|
98
|
-
```sh
|
|
99
|
-
VERSION=<version>
|
|
100
|
-
gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
|
|
101
|
-
```
|
|
102
|
-
|
|
103
|
-
4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
|
|
104
|
-
|
|
105
|
-
```sh
|
|
106
|
-
%%bash
|
|
107
|
-
export VERSION=<version>
|
|
108
|
-
gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
|
|
109
|
-
yes | pip uninstall dataproc_spark_connect
|
|
110
|
-
pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
|
|
111
|
-
```
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|