dataproc-spark-connect 0.6.0__tar.gz → 0.7.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (22) hide show
  1. dataproc_spark_connect-0.7.0/PKG-INFO +98 -0
  2. dataproc_spark_connect-0.7.0/README.md +83 -0
  3. dataproc_spark_connect-0.7.0/dataproc_spark_connect.egg-info/PKG-INFO +98 -0
  4. dataproc_spark_connect-0.7.0/dataproc_spark_connect.egg-info/requires.txt +6 -0
  5. {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/google/cloud/dataproc_spark_connect/client/proxy.py +3 -3
  6. {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/google/cloud/dataproc_spark_connect/session.py +186 -172
  7. {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/setup.py +6 -5
  8. dataproc_spark_connect-0.6.0/PKG-INFO +0 -111
  9. dataproc_spark_connect-0.6.0/README.md +0 -97
  10. dataproc_spark_connect-0.6.0/dataproc_spark_connect.egg-info/PKG-INFO +0 -111
  11. dataproc_spark_connect-0.6.0/dataproc_spark_connect.egg-info/requires.txt +0 -5
  12. {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/LICENSE +0 -0
  13. {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/dataproc_spark_connect.egg-info/SOURCES.txt +0 -0
  14. {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/dataproc_spark_connect.egg-info/dependency_links.txt +0 -0
  15. {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/dataproc_spark_connect.egg-info/top_level.txt +0 -0
  16. {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/google/cloud/dataproc_spark_connect/__init__.py +0 -0
  17. {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/google/cloud/dataproc_spark_connect/client/__init__.py +0 -0
  18. {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/google/cloud/dataproc_spark_connect/client/core.py +0 -0
  19. {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/google/cloud/dataproc_spark_connect/exceptions.py +0 -0
  20. {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/google/cloud/dataproc_spark_connect/pypi_artifacts.py +0 -0
  21. {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/pyproject.toml +0 -0
  22. {dataproc_spark_connect-0.6.0 → dataproc_spark_connect-0.7.0}/setup.cfg +0 -0
@@ -0,0 +1,98 @@
1
+ Metadata-Version: 2.1
2
+ Name: dataproc-spark-connect
3
+ Version: 0.7.0
4
+ Summary: Dataproc client library for Spark Connect
5
+ Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
6
+ Author: Google LLC
7
+ License: Apache 2.0
8
+ License-File: LICENSE
9
+ Requires-Dist: google-api-core>=2.19
10
+ Requires-Dist: google-cloud-dataproc>=5.18
11
+ Requires-Dist: packaging>=20.0
12
+ Requires-Dist: pyspark[connect]>=3.5
13
+ Requires-Dist: tqdm>=4.67
14
+ Requires-Dist: websockets>=15.0
15
+
16
+ # Dataproc Spark Connect Client
17
+
18
+ A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
19
+ client with additional functionalities that allow applications to communicate
20
+ with a remote Dataproc Spark Session using the Spark Connect protocol without
21
+ requiring additional steps.
22
+
23
+ ## Install
24
+
25
+ ```sh
26
+ pip install dataproc_spark_connect
27
+ ```
28
+
29
+ ## Uninstall
30
+
31
+ ```sh
32
+ pip uninstall dataproc_spark_connect
33
+ ```
34
+
35
+ ## Setup
36
+
37
+ This client requires permissions to
38
+ manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
39
+ If you are running the client outside of Google Cloud, you must set following
40
+ environment variables:
41
+
42
+ * `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
43
+ workloads
44
+ * `GOOGLE_CLOUD_REGION` - The Compute
45
+ Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
46
+ where you run the Spark workload.
47
+ * `GOOGLE_APPLICATION_CREDENTIALS` -
48
+ Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
49
+
50
+ ## Usage
51
+
52
+ 1. Install the latest version of Dataproc Python client and Dataproc Spark
53
+ Connect modules:
54
+
55
+ ```sh
56
+ pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
57
+ ```
58
+
59
+ 2. Add the required imports into your PySpark application or notebook and start
60
+ a Spark session with the following code instead of using
61
+ environment variables:
62
+
63
+ ```python
64
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
65
+ from google.cloud.dataproc_v1 import Session
66
+ session_config = Session()
67
+ session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
68
+ session_config.runtime_config.version = '2.2'
69
+ spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
70
+ ```
71
+
72
+ ## Developing
73
+
74
+ For development instructions see [guide](DEVELOPING.md).
75
+
76
+ ## Contributing
77
+
78
+ We'd love to accept your patches and contributions to this project. There are
79
+ just a few small guidelines you need to follow.
80
+
81
+ ### Contributor License Agreement
82
+
83
+ Contributions to this project must be accompanied by a Contributor License
84
+ Agreement. You (or your employer) retain the copyright to your contribution;
85
+ this simply gives us permission to use and redistribute your contributions as
86
+ part of the project. Head over to <https://cla.developers.google.com> to see
87
+ your current agreements on file or to sign a new one.
88
+
89
+ You generally only need to submit a CLA once, so if you've already submitted one
90
+ (even if it was for a different project), you probably don't need to do it
91
+ again.
92
+
93
+ ### Code reviews
94
+
95
+ All submissions, including submissions by project members, require review. We
96
+ use GitHub pull requests for this purpose. Consult
97
+ [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
98
+ information on using pull requests.
@@ -0,0 +1,83 @@
1
+ # Dataproc Spark Connect Client
2
+
3
+ A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
4
+ client with additional functionalities that allow applications to communicate
5
+ with a remote Dataproc Spark Session using the Spark Connect protocol without
6
+ requiring additional steps.
7
+
8
+ ## Install
9
+
10
+ ```sh
11
+ pip install dataproc_spark_connect
12
+ ```
13
+
14
+ ## Uninstall
15
+
16
+ ```sh
17
+ pip uninstall dataproc_spark_connect
18
+ ```
19
+
20
+ ## Setup
21
+
22
+ This client requires permissions to
23
+ manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
24
+ If you are running the client outside of Google Cloud, you must set following
25
+ environment variables:
26
+
27
+ * `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
28
+ workloads
29
+ * `GOOGLE_CLOUD_REGION` - The Compute
30
+ Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
31
+ where you run the Spark workload.
32
+ * `GOOGLE_APPLICATION_CREDENTIALS` -
33
+ Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
34
+
35
+ ## Usage
36
+
37
+ 1. Install the latest version of Dataproc Python client and Dataproc Spark
38
+ Connect modules:
39
+
40
+ ```sh
41
+ pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
42
+ ```
43
+
44
+ 2. Add the required imports into your PySpark application or notebook and start
45
+ a Spark session with the following code instead of using
46
+ environment variables:
47
+
48
+ ```python
49
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
50
+ from google.cloud.dataproc_v1 import Session
51
+ session_config = Session()
52
+ session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
53
+ session_config.runtime_config.version = '2.2'
54
+ spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
55
+ ```
56
+
57
+ ## Developing
58
+
59
+ For development instructions see [guide](DEVELOPING.md).
60
+
61
+ ## Contributing
62
+
63
+ We'd love to accept your patches and contributions to this project. There are
64
+ just a few small guidelines you need to follow.
65
+
66
+ ### Contributor License Agreement
67
+
68
+ Contributions to this project must be accompanied by a Contributor License
69
+ Agreement. You (or your employer) retain the copyright to your contribution;
70
+ this simply gives us permission to use and redistribute your contributions as
71
+ part of the project. Head over to <https://cla.developers.google.com> to see
72
+ your current agreements on file or to sign a new one.
73
+
74
+ You generally only need to submit a CLA once, so if you've already submitted one
75
+ (even if it was for a different project), you probably don't need to do it
76
+ again.
77
+
78
+ ### Code reviews
79
+
80
+ All submissions, including submissions by project members, require review. We
81
+ use GitHub pull requests for this purpose. Consult
82
+ [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
83
+ information on using pull requests.
@@ -0,0 +1,98 @@
1
+ Metadata-Version: 2.1
2
+ Name: dataproc-spark-connect
3
+ Version: 0.7.0
4
+ Summary: Dataproc client library for Spark Connect
5
+ Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
6
+ Author: Google LLC
7
+ License: Apache 2.0
8
+ License-File: LICENSE
9
+ Requires-Dist: google-api-core>=2.19
10
+ Requires-Dist: google-cloud-dataproc>=5.18
11
+ Requires-Dist: packaging>=20.0
12
+ Requires-Dist: pyspark[connect]>=3.5
13
+ Requires-Dist: tqdm>=4.67
14
+ Requires-Dist: websockets>=15.0
15
+
16
+ # Dataproc Spark Connect Client
17
+
18
+ A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
19
+ client with additional functionalities that allow applications to communicate
20
+ with a remote Dataproc Spark Session using the Spark Connect protocol without
21
+ requiring additional steps.
22
+
23
+ ## Install
24
+
25
+ ```sh
26
+ pip install dataproc_spark_connect
27
+ ```
28
+
29
+ ## Uninstall
30
+
31
+ ```sh
32
+ pip uninstall dataproc_spark_connect
33
+ ```
34
+
35
+ ## Setup
36
+
37
+ This client requires permissions to
38
+ manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
39
+ If you are running the client outside of Google Cloud, you must set following
40
+ environment variables:
41
+
42
+ * `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
43
+ workloads
44
+ * `GOOGLE_CLOUD_REGION` - The Compute
45
+ Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
46
+ where you run the Spark workload.
47
+ * `GOOGLE_APPLICATION_CREDENTIALS` -
48
+ Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
49
+
50
+ ## Usage
51
+
52
+ 1. Install the latest version of Dataproc Python client and Dataproc Spark
53
+ Connect modules:
54
+
55
+ ```sh
56
+ pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
57
+ ```
58
+
59
+ 2. Add the required imports into your PySpark application or notebook and start
60
+ a Spark session with the following code instead of using
61
+ environment variables:
62
+
63
+ ```python
64
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
65
+ from google.cloud.dataproc_v1 import Session
66
+ session_config = Session()
67
+ session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
68
+ session_config.runtime_config.version = '2.2'
69
+ spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
70
+ ```
71
+
72
+ ## Developing
73
+
74
+ For development instructions see [guide](DEVELOPING.md).
75
+
76
+ ## Contributing
77
+
78
+ We'd love to accept your patches and contributions to this project. There are
79
+ just a few small guidelines you need to follow.
80
+
81
+ ### Contributor License Agreement
82
+
83
+ Contributions to this project must be accompanied by a Contributor License
84
+ Agreement. You (or your employer) retain the copyright to your contribution;
85
+ this simply gives us permission to use and redistribute your contributions as
86
+ part of the project. Head over to <https://cla.developers.google.com> to see
87
+ your current agreements on file or to sign a new one.
88
+
89
+ You generally only need to submit a CLA once, so if you've already submitted one
90
+ (even if it was for a different project), you probably don't need to do it
91
+ again.
92
+
93
+ ### Code reviews
94
+
95
+ All submissions, including submissions by project members, require review. We
96
+ use GitHub pull requests for this purpose. Consult
97
+ [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
98
+ information on using pull requests.
@@ -0,0 +1,6 @@
1
+ google-api-core>=2.19
2
+ google-cloud-dataproc>=5.18
3
+ packaging>=20.0
4
+ pyspark[connect]>=3.5
5
+ tqdm>=4.67
6
+ websockets>=15.0
@@ -18,7 +18,6 @@ import contextlib
18
18
  import logging
19
19
  import socket
20
20
  import threading
21
- import time
22
21
 
23
22
  import websockets.sync.client as websocketclient
24
23
 
@@ -95,6 +94,7 @@ def forward_bytes(name, from_sock, to_sock):
95
94
  This method is intended to be run in a separate thread of execution.
96
95
 
97
96
  Args:
97
+ name: forwarding thread name
98
98
  from_sock: A socket-like object to stream bytes from.
99
99
  to_sock: A socket-like object to stream bytes to.
100
100
  """
@@ -131,7 +131,7 @@ def connect_sockets(conn_number, from_sock, to_sock):
131
131
  This method continuously streams bytes in both directions between the
132
132
  given `from_sock` and `to_sock` socket-like objects.
133
133
 
134
- The caller is responsible for creating and closing the supplied socekts.
134
+ The caller is responsible for creating and closing the supplied sockets.
135
135
  """
136
136
  forward_name = f"{conn_number}-forward"
137
137
  t1 = threading.Thread(
@@ -163,7 +163,7 @@ def forward_connection(conn_number, conn, addr, target_host):
163
163
  Both the supplied incoming connection (`conn`) and the created outgoing
164
164
  connection are automatically closed when this method terminates.
165
165
 
166
- This method should be run inside of a daemon thread so that it will not
166
+ This method should be run inside a daemon thread so that it will not
167
167
  block program termination.
168
168
  """
169
169
  with conn:
@@ -11,39 +11,37 @@
11
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
12
  # See the License for the specific language governing permissions and
13
13
  # limitations under the License.
14
+
14
15
  import atexit
16
+ import datetime
15
17
  import json
16
18
  import logging
17
19
  import os
18
20
  import random
19
21
  import string
22
+ import threading
20
23
  import time
21
- import datetime
22
- from time import sleep
23
- from typing import Any, cast, ClassVar, Dict, Optional
24
+ import tqdm
24
25
 
25
26
  from google.api_core import retry
26
- from google.api_core.future.polling import POLLING_PREDICATE
27
27
  from google.api_core.client_options import ClientOptions
28
28
  from google.api_core.exceptions import Aborted, FailedPrecondition, InvalidArgument, NotFound, PermissionDenied
29
- from google.cloud.dataproc_v1.types import sessions
30
-
31
- from google.cloud.dataproc_spark_connect.pypi_artifacts import PyPiArtifacts
29
+ from google.api_core.future.polling import POLLING_PREDICATE
32
30
  from google.cloud.dataproc_spark_connect.client import DataprocChannelBuilder
31
+ from google.cloud.dataproc_spark_connect.exceptions import DataprocSparkConnectException
32
+ from google.cloud.dataproc_spark_connect.pypi_artifacts import PyPiArtifacts
33
33
  from google.cloud.dataproc_v1 import (
34
+ AuthenticationConfig,
34
35
  CreateSessionRequest,
35
36
  GetSessionRequest,
36
37
  Session,
37
38
  SessionControllerClient,
38
- SessionTemplate,
39
39
  TerminateSessionRequest,
40
40
  )
41
- from google.protobuf import text_format
42
- from google.protobuf.text_format import ParseError
41
+ from google.cloud.dataproc_v1.types import sessions
43
42
  from pyspark.sql.connect.session import SparkSession
44
43
  from pyspark.sql.utils import to_str
45
-
46
- from google.cloud.dataproc_spark_connect.exceptions import DataprocSparkConnectException
44
+ from typing import Any, cast, ClassVar, Dict, Optional
47
45
 
48
46
  # Set up logging
49
47
  logging.basicConfig(level=logging.INFO)
@@ -69,6 +67,8 @@ class DataprocSparkSession(SparkSession):
69
67
  ... ) # doctest: +SKIP
70
68
  """
71
69
 
70
+ _DEFAULT_RUNTIME_VERSION = "2.2"
71
+
72
72
  _active_s8s_session_uuid: ClassVar[Optional[str]] = None
73
73
  _project_id = None
74
74
  _region = None
@@ -77,7 +77,12 @@ class DataprocSparkSession(SparkSession):
77
77
 
78
78
  class Builder(SparkSession.Builder):
79
79
 
80
- _dataproc_runtime_spark_version = {"3.0": "3.5.1", "2.2": "3.5.0"}
80
+ _dataproc_runtime_to_spark_version = {
81
+ "1.2": "3.5",
82
+ "2.2": "3.5",
83
+ "2.3": "3.5",
84
+ "3.0": "4.0",
85
+ }
81
86
 
82
87
  _session_static_configs = [
83
88
  "spark.executor.cores",
@@ -93,10 +98,10 @@ class DataprocSparkSession(SparkSession):
93
98
  self._options: Dict[str, Any] = {}
94
99
  self._channel_builder: Optional[DataprocChannelBuilder] = None
95
100
  self._dataproc_config: Optional[Session] = None
96
- self._project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
97
- self._region = os.environ.get("GOOGLE_CLOUD_REGION")
101
+ self._project_id = os.getenv("GOOGLE_CLOUD_PROJECT")
102
+ self._region = os.getenv("GOOGLE_CLOUD_REGION")
98
103
  self._client_options = ClientOptions(
99
- api_endpoint=os.environ.get(
104
+ api_endpoint=os.getenv(
100
105
  "GOOGLE_CLOUD_DATAPROC_API_ENDPOINT",
101
106
  f"{self._region}-dataproc.googleapis.com",
102
107
  )
@@ -117,7 +122,7 @@ class DataprocSparkSession(SparkSession):
117
122
 
118
123
  def location(self, location):
119
124
  self._region = location
120
- self._client_options.api_endpoint = os.environ.get(
125
+ self._client_options.api_endpoint = os.getenv(
121
126
  "GOOGLE_CLOUD_DATAPROC_API_ENDPOINT",
122
127
  f"{self._region}-dataproc.googleapis.com",
123
128
  )
@@ -155,10 +160,7 @@ class DataprocSparkSession(SparkSession):
155
160
  spark_connect_url = session_response.runtime_info.endpoints.get(
156
161
  "Spark Connect Server"
157
162
  )
158
- spark_connect_url = spark_connect_url.replace("https", "sc")
159
- if not spark_connect_url.endswith("/"):
160
- spark_connect_url += "/"
161
- url = f"{spark_connect_url.replace('.com/', '.com:443/')};session_id={session_response.uuid};use_ssl=true"
163
+ url = f"{spark_connect_url}/;session_id={session_response.uuid};use_ssl=true"
162
164
  logger.debug(f"Spark Connect URL: {url}")
163
165
  self._channel_builder = DataprocChannelBuilder(
164
166
  url,
@@ -179,56 +181,66 @@ class DataprocSparkSession(SparkSession):
179
181
 
180
182
  if self._options.get("spark.remote", False):
181
183
  raise NotImplemented(
182
- "DataprocSparkSession does not support connecting to an existing remote server"
184
+ "DataprocSparkSession does not support connecting to an existing Spark Connect remote server"
183
185
  )
184
186
 
185
187
  from google.cloud.dataproc_v1 import SessionControllerClient
186
188
 
187
189
  dataproc_config: Session = self._get_dataproc_config()
188
- session_template: SessionTemplate = self._get_session_template()
189
190
 
190
- self._get_and_validate_version(
191
- dataproc_config, session_template
192
- )
191
+ self._validate_version(dataproc_config)
193
192
 
194
- spark_connect_session = self._get_spark_connect_session(
195
- dataproc_config, session_template
196
- )
197
-
198
- if not spark_connect_session:
199
- dataproc_config.spark_connect_session = {}
200
- os.environ["SPARK_CONNECT_MODE_ENABLED"] = "1"
201
- session_request = CreateSessionRequest()
202
193
  session_id = self.generate_dataproc_session_id()
203
-
204
- session_request.session_id = session_id
205
194
  dataproc_config.name = f"projects/{self._project_id}/locations/{self._region}/sessions/{session_id}"
206
195
  logger.debug(
207
- f"Configurations used to create serverless session:\n {dataproc_config}"
196
+ f"Dataproc Session configuration:\n{dataproc_config}"
208
197
  )
198
+
199
+ session_request = CreateSessionRequest()
200
+ session_request.session_id = session_id
209
201
  session_request.session = dataproc_config
210
202
  session_request.parent = (
211
203
  f"projects/{self._project_id}/locations/{self._region}"
212
204
  )
213
205
 
214
- logger.debug("Creating serverless session")
206
+ logger.debug("Creating Dataproc Session")
215
207
  DataprocSparkSession._active_s8s_session_id = session_id
216
208
  s8s_creation_start_time = time.time()
217
- try:
218
- session_polling = retry.Retry(
219
- predicate=POLLING_PREDICATE,
220
- initial=5.0, # seconds
221
- maximum=5.0, # seconds
222
- multiplier=1.0,
223
- timeout=600, # seconds
209
+
210
+ stop_create_session_pbar = False
211
+
212
+ def create_session_pbar():
213
+ iterations = 150
214
+ pbar = tqdm.trange(
215
+ iterations,
216
+ bar_format="{bar}",
217
+ ncols=80,
224
218
  )
225
- print("Creating Spark session. It may take a few minutes.")
219
+ for i in pbar:
220
+ if stop_create_session_pbar:
221
+ break
222
+ # Last iteration
223
+ if i >= iterations - 1:
224
+ # Sleep until session created
225
+ while not stop_create_session_pbar:
226
+ time.sleep(1)
227
+ else:
228
+ time.sleep(1)
229
+
230
+ pbar.close()
231
+ # Print new line after the progress bar
232
+ print()
233
+
234
+ create_session_pbar_thread = threading.Thread(
235
+ target=create_session_pbar
236
+ )
237
+
238
+ try:
226
239
  if (
227
- "dataproc_spark_connect_SESSION_TERMINATE_AT_EXIT"
228
- in os.environ
229
- and os.getenv(
230
- "dataproc_spark_connect_SESSION_TERMINATE_AT_EXIT"
231
- ).lower()
240
+ os.getenv(
241
+ "DATAPROC_SPARK_CONNECT_SESSION_TERMINATE_AT_EXIT",
242
+ "false",
243
+ )
232
244
  == "true"
233
245
  ):
234
246
  atexit.register(
@@ -243,18 +255,25 @@ class DataprocSparkSession(SparkSession):
243
255
  client_options=self._client_options
244
256
  ).create_session(session_request)
245
257
  print(
246
- f"Interactive Session Detail View: https://console.cloud.google.com/dataproc/interactive/{self._region}/{session_id}?project={self._project_id}"
258
+ f"Creating Dataproc Session: https://console.cloud.google.com/dataproc/interactive/{self._region}/{session_id}?project={self._project_id}"
247
259
  )
260
+ create_session_pbar_thread.start()
248
261
  session_response: Session = operation.result(
249
- polling=session_polling
262
+ polling=retry.Retry(
263
+ predicate=POLLING_PREDICATE,
264
+ initial=5.0, # seconds
265
+ maximum=5.0, # seconds
266
+ multiplier=1.0,
267
+ timeout=600, # seconds
268
+ )
250
269
  )
251
- if (
252
- "DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH"
253
- in os.environ
254
- ):
255
- file_path = os.environ[
256
- "DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH"
257
- ]
270
+ stop_create_session_pbar = True
271
+ create_session_pbar_thread.join()
272
+ print("Dataproc Session was successfully created")
273
+ file_path = (
274
+ DataprocSparkSession._get_active_session_file_path()
275
+ )
276
+ if file_path is not None:
258
277
  try:
259
278
  session_data = {
260
279
  "session_name": session_response.name,
@@ -267,21 +286,27 @@ class DataprocSparkSession(SparkSession):
267
286
  json.dump(session_data, json_file, indent=4)
268
287
  except Exception as e:
269
288
  logger.error(
270
- f"Exception while writing active session to file {file_path} , {e}"
289
+ f"Exception while writing active session to file {file_path}, {e}"
271
290
  )
272
291
  except (InvalidArgument, PermissionDenied) as e:
292
+ stop_create_session_pbar = True
293
+ if create_session_pbar_thread.is_alive():
294
+ create_session_pbar_thread.join()
273
295
  DataprocSparkSession._active_s8s_session_id = None
274
296
  raise DataprocSparkConnectException(
275
- f"Error while creating serverless session: {e.message}"
297
+ f"Error while creating Dataproc Session: {e.message}"
276
298
  )
277
299
  except Exception as e:
300
+ stop_create_session_pbar = True
301
+ if create_session_pbar_thread.is_alive():
302
+ create_session_pbar_thread.join()
278
303
  DataprocSparkSession._active_s8s_session_id = None
279
304
  raise RuntimeError(
280
- f"Error while creating serverless session"
305
+ f"Error while creating Dataproc Session"
281
306
  ) from e
282
307
 
283
308
  logger.debug(
284
- f"Serverless session created: {session_id}, creation time taken: {int(time.time() - s8s_creation_start_time)} seconds"
309
+ f"Dataproc Session created: {session_id} in {int(time.time() - s8s_creation_start_time)} seconds"
285
310
  )
286
311
  return self.__create_spark_connect_session_from_s8s(
287
312
  session_response, dataproc_config.name
@@ -292,17 +317,20 @@ class DataprocSparkSession(SparkSession):
292
317
  ) -> Optional["DataprocSparkSession"]:
293
318
  s8s_session_id = DataprocSparkSession._active_s8s_session_id
294
319
  session_name = f"projects/{self._project_id}/locations/{self._region}/sessions/{s8s_session_id}"
295
- session_response = get_active_s8s_session_response(
296
- session_name, self._client_options
297
- )
320
+ session_response = None
321
+ session = None
322
+ if s8s_session_id is not None:
323
+ session_response = get_active_s8s_session_response(
324
+ session_name, self._client_options
325
+ )
326
+ session = DataprocSparkSession.getActiveSession()
298
327
 
299
- session = DataprocSparkSession.getActiveSession()
300
328
  if session is None:
301
329
  session = DataprocSparkSession._default_session
302
330
 
303
331
  if session_response is not None:
304
332
  print(
305
- f"Using existing session: https://console.cloud.google.com/dataproc/interactive/{self._region}/{s8s_session_id}?project={self._project_id}, configuration changes may not be applied."
333
+ f"Using existing Dataproc Session (configuration changes may not be applied): https://console.cloud.google.com/dataproc/interactive/{self._region}/{s8s_session_id}?project={self._project_id}"
306
334
  )
307
335
  if session is None:
308
336
  session = self.__create_spark_connect_session_from_s8s(
@@ -310,10 +338,10 @@ class DataprocSparkSession(SparkSession):
310
338
  )
311
339
  return session
312
340
  else:
313
- logger.info(
314
- f"Session: {s8s_session_id} not active, stopping previous spark session and creating new"
315
- )
316
341
  if session is not None:
342
+ print(
343
+ f"{s8s_session_id} Dataproc Session is not active, stopping and creating a new one"
344
+ )
317
345
  session.stop()
318
346
 
319
347
  return None
@@ -333,21 +361,52 @@ class DataprocSparkSession(SparkSession):
333
361
  dataproc_config = self._dataproc_config
334
362
  for k, v in self._options.items():
335
363
  dataproc_config.runtime_config.properties[k] = v
336
- elif "DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG" in os.environ:
337
- filepath = os.environ[
338
- "DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG"
364
+ dataproc_config.spark_connect_session = (
365
+ sessions.SparkConnectConfig()
366
+ )
367
+ if not dataproc_config.runtime_config.version:
368
+ dataproc_config.runtime_config.version = (
369
+ DataprocSparkSession._DEFAULT_RUNTIME_VERSION
370
+ )
371
+ if (
372
+ not dataproc_config.environment_config.execution_config.authentication_config.user_workload_authentication_type
373
+ and "DATAPROC_SPARK_CONNECT_AUTH_TYPE" in os.environ
374
+ ):
375
+ dataproc_config.environment_config.execution_config.authentication_config.user_workload_authentication_type = AuthenticationConfig.AuthenticationType[
376
+ os.getenv("DATAPROC_SPARK_CONNECT_AUTH_TYPE")
339
377
  ]
340
- try:
341
- with open(filepath, "r") as f:
342
- dataproc_config = Session.wrap(
343
- text_format.Parse(
344
- f.read(), Session.pb(dataproc_config)
345
- )
346
- )
347
- except FileNotFoundError:
348
- raise FileNotFoundError(f"File '{filepath}' not found")
349
- except ParseError as e:
350
- raise ParseError(f"Error parsing file '{filepath}': {e}")
378
+ if (
379
+ not dataproc_config.environment_config.execution_config.service_account
380
+ and "DATAPROC_SPARK_CONNECT_SERVICE_ACCOUNT" in os.environ
381
+ ):
382
+ dataproc_config.environment_config.execution_config.service_account = os.getenv(
383
+ "DATAPROC_SPARK_CONNECT_SERVICE_ACCOUNT"
384
+ )
385
+ if (
386
+ not dataproc_config.environment_config.execution_config.subnetwork_uri
387
+ and "DATAPROC_SPARK_CONNECT_SUBNET" in os.environ
388
+ ):
389
+ dataproc_config.environment_config.execution_config.subnetwork_uri = os.getenv(
390
+ "DATAPROC_SPARK_CONNECT_SUBNET"
391
+ )
392
+ if (
393
+ not dataproc_config.environment_config.execution_config.ttl
394
+ and "DATAPROC_SPARK_CONNECT_TTL_SECONDS" in os.environ
395
+ ):
396
+ dataproc_config.environment_config.execution_config.ttl = {
397
+ "seconds": int(
398
+ os.getenv("DATAPROC_SPARK_CONNECT_TTL_SECONDS")
399
+ )
400
+ }
401
+ if (
402
+ not dataproc_config.environment_config.execution_config.idle_ttl
403
+ and "DATAPROC_SPARK_CONNECT_IDLE_TTL_SECONDS" in os.environ
404
+ ):
405
+ dataproc_config.environment_config.execution_config.idle_ttl = {
406
+ "seconds": int(
407
+ os.getenv("DATAPROC_SPARK_CONNECT_IDLE_TTL_SECONDS")
408
+ )
409
+ }
351
410
  if "COLAB_NOTEBOOK_RUNTIME_ID" in os.environ:
352
411
  dataproc_config.labels["colab-notebook-runtime-id"] = (
353
412
  os.environ["COLAB_NOTEBOOK_RUNTIME_ID"]
@@ -358,87 +417,38 @@ class DataprocSparkSession(SparkSession):
358
417
  ]
359
418
  return dataproc_config
360
419
 
361
- def _get_session_template(self):
362
- from google.cloud.dataproc_v1 import (
363
- GetSessionTemplateRequest,
364
- SessionTemplateControllerClient,
365
- )
366
-
367
- session_template = None
368
- if self._dataproc_config and self._dataproc_config.session_template:
369
- session_template = self._dataproc_config.session_template
370
- get_session_template_request = GetSessionTemplateRequest()
371
- get_session_template_request.name = session_template
372
- client = SessionTemplateControllerClient(
373
- client_options=self._client_options
374
- )
375
- try:
376
- session_template = client.get_session_template(
377
- get_session_template_request
378
- )
379
- except Exception as e:
380
- logger.error(
381
- f"Failed to get session template {session_template}: {e}"
382
- )
383
- raise
384
- return session_template
420
+ def _validate_version(self, dataproc_config):
421
+ trim_version = lambda v: ".".join(v.split(".")[:2])
385
422
 
386
- def _get_and_validate_version(self, dataproc_config, session_template):
387
- trimmed_version = lambda v: ".".join(v.split(".")[:2])
388
- version = None
423
+ version = dataproc_config.runtime_config.version
389
424
  if (
390
- dataproc_config
391
- and dataproc_config.runtime_config
392
- and dataproc_config.runtime_config.version
393
- ):
394
- version = dataproc_config.runtime_config.version
395
- elif (
396
- session_template
397
- and session_template.runtime_config
398
- and session_template.runtime_config.version
399
- ):
400
- version = session_template.runtime_config.version
401
-
402
- if not version:
403
- version = "3.0"
404
- dataproc_config.runtime_config.version = version
405
- elif (
406
- trimmed_version(version)
407
- not in self._dataproc_runtime_spark_version
425
+ trim_version(version)
426
+ not in self._dataproc_runtime_to_spark_version
408
427
  ):
409
428
  raise ValueError(
410
- f"runtime_config.version {version} is not supported. "
411
- f"Supported versions: {self._dataproc_runtime_spark_version.keys()}"
429
+ f"Specified {version} Dataproc Spark runtime version is not supported. "
430
+ f"Supported runtime versions: {self._dataproc_runtime_to_spark_version.keys()}"
412
431
  )
413
432
 
414
- server_version = self._dataproc_runtime_spark_version[
415
- trimmed_version(version)
433
+ server_version = self._dataproc_runtime_to_spark_version[
434
+ trim_version(version)
416
435
  ]
436
+
417
437
  import importlib.metadata
418
438
 
419
- google_connect_version = importlib.metadata.version(
439
+ dataproc_connect_version = importlib.metadata.version(
420
440
  "dataproc-spark-connect"
421
441
  )
422
442
  client_version = importlib.metadata.version("pyspark")
423
- version_message = f"Spark Connect: {google_connect_version} (PySpark: {client_version}) Session Runtime: {version} (Spark: {server_version})"
424
- logger.info(version_message)
425
- if trimmed_version(client_version) != trimmed_version(
426
- server_version
427
- ):
428
- logger.warning(
429
- f"client and server on different versions: {version_message}"
443
+ if trim_version(client_version) != trim_version(server_version):
444
+ print(
445
+ f"Spark Connect client and server use different versions:\n"
446
+ f"- Dataproc Spark Connect client {dataproc_connect_version} (PySpark {client_version})\n"
447
+ f"- Dataproc Spark runtime {version} (Spark {server_version})"
430
448
  )
431
- return version
432
449
 
433
- def _get_spark_connect_session(self, dataproc_config, session_template):
434
- spark_connect_session = None
435
- if dataproc_config and dataproc_config.spark_connect_session:
436
- spark_connect_session = dataproc_config.spark_connect_session
437
- elif session_template and session_template.spark_connect_session:
438
- spark_connect_session = session_template.spark_connect_session
439
- return spark_connect_session
440
-
441
- def generate_dataproc_session_id(self):
450
+ @staticmethod
451
+ def generate_dataproc_session_id():
442
452
  timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
443
453
  suffix_length = 6
444
454
  random_suffix = "".join(
@@ -451,32 +461,30 @@ class DataprocSparkSession(SparkSession):
451
461
  def _repr_html_(self) -> str:
452
462
  if not self._active_s8s_session_id:
453
463
  return """
454
- <div>No Active Dataproc Spark Session</div>
464
+ <div>No Active Dataproc Session</div>
455
465
  """
456
466
 
457
467
  s8s_session = f"https://console.cloud.google.com/dataproc/interactive/{self._region}/{self._active_s8s_session_id}"
458
468
  ui = f"{s8s_session}/sparkApplications/applications"
459
- version = ""
460
469
  return f"""
461
470
  <div>
462
471
  <p><b>Spark Connect</b></p>
463
472
 
464
- <p><a href="{s8s_session}?project={self._project_id}">Serverless Session</a></p>
473
+ <p><a href="{s8s_session}?project={self._project_id}">Dataproc Session</a></p>
465
474
  <p><a href="{ui}?project={self._project_id}">Spark UI</a></p>
466
475
  </div>
467
476
  """
468
477
 
469
- def _remove_stoped_session_from_file(self):
470
- if "DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH" in os.environ:
471
- file_path = os.environ[
472
- "DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH"
473
- ]
478
+ @staticmethod
479
+ def _remove_stopped_session_from_file():
480
+ file_path = DataprocSparkSession._get_active_session_file_path()
481
+ if file_path is not None:
474
482
  try:
475
483
  with open(file_path, "w"):
476
484
  pass
477
485
  except Exception as e:
478
486
  logger.error(
479
- f"Exception while removing active session in file {file_path} , {e}"
487
+ f"Exception while removing active session in file {file_path}, {e}"
480
488
  )
481
489
 
482
490
  def addArtifacts(
@@ -494,7 +502,7 @@ class DataprocSparkSession(SparkSession):
494
502
 
495
503
  Parameters
496
504
  ----------
497
- *path : tuple of str
505
+ *artifact : tuple of str
498
506
  Artifact's URIs to add.
499
507
  pyfile : bool
500
508
  Whether to add them as Python dependencies such as .py, .egg, .zip or .jar files.
@@ -507,7 +515,7 @@ class DataprocSparkSession(SparkSession):
507
515
  Add a file to be downloaded with this Spark job on every node.
508
516
  The ``path`` passed can only be a local file for now.
509
517
  pypi : bool
510
- This option is only available with DataprocSparkSession. eg. `spark.addArtifacts("spacy==3.8.4", "torch", pypi=True)`
518
+ This option is only available with DataprocSparkSession. e.g. `spark.addArtifacts("spacy==3.8.4", "torch", pypi=True)`
511
519
  Installs PyPi package (with its dependencies) in the active Spark session on the driver and executors.
512
520
 
513
521
  Notes
@@ -534,6 +542,10 @@ class DataprocSparkSession(SparkSession):
534
542
  *artifact, pyfile=pyfile, archive=archive, file=file
535
543
  )
536
544
 
545
+ @staticmethod
546
+ def _get_active_session_file_path():
547
+ return os.getenv("DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH")
548
+
537
549
  def stop(self) -> None:
538
550
  with DataprocSparkSession._lock:
539
551
  if DataprocSparkSession._active_s8s_session_id is not None:
@@ -544,7 +556,7 @@ class DataprocSparkSession(SparkSession):
544
556
  self._client_options,
545
557
  )
546
558
 
547
- self._remove_stoped_session_from_file()
559
+ self._remove_stopped_session_from_file()
548
560
  DataprocSparkSession._active_s8s_session_uuid = None
549
561
  DataprocSparkSession._active_s8s_session_id = None
550
562
  DataprocSparkSession._project_id = None
@@ -565,7 +577,7 @@ def terminate_s8s_session(
565
577
  ):
566
578
  from google.cloud.dataproc_v1 import SessionControllerClient
567
579
 
568
- logger.debug(f"Terminating serverless session: {active_s8s_session_id}")
580
+ logger.debug(f"Terminating Dataproc Session: {active_s8s_session_id}")
569
581
  terminate_session_request = TerminateSessionRequest()
570
582
  session_name = f"projects/{project_id}/locations/{region}/sessions/{active_s8s_session_id}"
571
583
  terminate_session_request.name = session_name
@@ -583,18 +595,20 @@ def terminate_s8s_session(
583
595
  ):
584
596
  session = session_client.get_session(get_session_request)
585
597
  state = session.state
586
- sleep(1)
598
+ time.sleep(1)
587
599
  except NotFound:
588
- logger.debug(f"Session {active_s8s_session_id} already deleted")
600
+ logger.debug(
601
+ f"{active_s8s_session_id} Dataproc Session already deleted"
602
+ )
589
603
  # Client will get 'Aborted' error if session creation is still in progress and
590
604
  # 'FailedPrecondition' if another termination is still in progress.
591
- # Both are retryable but we catch it and let TTL take care of cleanups.
605
+ # Both are retryable, but we catch it and let TTL take care of cleanups.
592
606
  except (FailedPrecondition, Aborted):
593
607
  logger.debug(
594
- f"Session {active_s8s_session_id} already terminated manually or terminated automatically through session ttl limits"
608
+ f"{active_s8s_session_id} Dataproc Session already terminated manually or automatically due to TTL"
595
609
  )
596
610
  if state is not None and state == Session.State.FAILED:
597
- raise RuntimeError("Serverless session termination failed")
611
+ raise RuntimeError("Dataproc Session termination failed")
598
612
 
599
613
 
600
614
  def get_active_s8s_session_response(
@@ -608,7 +622,7 @@ def get_active_s8s_session_response(
608
622
  ).get_session(get_session_request)
609
623
  state = get_session_response.state
610
624
  except Exception as e:
611
- logger.info(f"{session_name} deleted: {e}")
625
+ print(f"{session_name} Dataproc Session deleted: {e}")
612
626
  return None
613
627
  if state is not None and (
614
628
  state == Session.State.ACTIVE or state == Session.State.CREATING
@@ -20,7 +20,7 @@ long_description = (this_directory / "README.md").read_text()
20
20
 
21
21
  setup(
22
22
  name="dataproc-spark-connect",
23
- version="0.6.0",
23
+ version="0.7.0",
24
24
  description="Dataproc client library for Spark Connect",
25
25
  long_description=long_description,
26
26
  author="Google LLC",
@@ -28,10 +28,11 @@ setup(
28
28
  license="Apache 2.0",
29
29
  packages=find_namespace_packages(include=["google.*"]),
30
30
  install_requires=[
31
- "google-api-core>=2.19.1",
32
- "google-cloud-dataproc>=5.18.0",
33
- "websockets",
34
- "pyspark[connect]>=3.5",
31
+ "google-api-core>=2.19",
32
+ "google-cloud-dataproc>=5.18",
35
33
  "packaging>=20.0",
34
+ "pyspark[connect]>=3.5",
35
+ "tqdm>=4.67",
36
+ "websockets>=15.0",
36
37
  ],
37
38
  )
@@ -1,111 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: dataproc-spark-connect
3
- Version: 0.6.0
4
- Summary: Dataproc client library for Spark Connect
5
- Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
6
- Author: Google LLC
7
- License: Apache 2.0
8
- License-File: LICENSE
9
- Requires-Dist: google-api-core>=2.19.1
10
- Requires-Dist: google-cloud-dataproc>=5.18.0
11
- Requires-Dist: websockets
12
- Requires-Dist: pyspark[connect]>=3.5
13
- Requires-Dist: packaging>=20.0
14
-
15
- # Dataproc Spark Connect Client
16
-
17
- A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
18
- additional functionalities that allow applications to communicate with a remote Dataproc
19
- Spark cluster using the Spark Connect protocol without requiring additional steps.
20
-
21
- ## Install
22
-
23
- ```console
24
- pip install dataproc_spark_connect
25
- ```
26
-
27
- ## Uninstall
28
-
29
- ```console
30
- pip uninstall dataproc_spark_connect
31
- ```
32
-
33
- ## Setup
34
- This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
35
- If you are running the client outside of Google Cloud, you must set following environment variables:
36
-
37
- * GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
38
- * GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
39
- * GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
40
- * DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
41
-
42
- ## Usage
43
-
44
- 1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
45
-
46
- ```console
47
- pip install google_cloud_dataproc --force-reinstall
48
- pip install dataproc_spark_connect --force-reinstall
49
- ```
50
-
51
- 2. Add the required import into your PySpark application or notebook:
52
-
53
- ```python
54
- from google.cloud.dataproc_spark_connect import DataprocSparkSession
55
- ```
56
-
57
- 3. There are two ways to create a spark session,
58
-
59
- 1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
60
-
61
- ```python
62
- spark = DataprocSparkSession.builder.getOrCreate()
63
- ```
64
-
65
- 2. Start a Spark session with the following code instead of using a config file:
66
-
67
- ```python
68
- from google.cloud.dataproc_v1 import SparkConnectConfig
69
- from google.cloud.dataproc_v1 import Session
70
- dataproc_session_config = Session()
71
- dataproc_session_config.spark_connect_session = SparkConnectConfig()
72
- dataproc_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
73
- dataproc_session_config.runtime_config.version = '3.0'
74
- spark = DataprocSparkSession.builder.dataprocSessionConfig(dataproc_session_config).getOrCreate()
75
- ```
76
-
77
- ## Billing
78
- As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
79
- This will happen even if you are running the client from a non-GCE instance.
80
-
81
- ## Contributing
82
- ### Building and Deploying SDK
83
-
84
- 1. Install the requirements in virtual environment.
85
-
86
- ```console
87
- pip install -r requirements-dev.txt
88
- ```
89
-
90
- 2. Build the code.
91
-
92
- ```console
93
- python setup.py sdist bdist_wheel
94
- ```
95
-
96
- 3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
97
-
98
- ```sh
99
- VERSION=<version>
100
- gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
101
- ```
102
-
103
- 4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
104
-
105
- ```sh
106
- %%bash
107
- export VERSION=<version>
108
- gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
109
- yes | pip uninstall dataproc_spark_connect
110
- pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
111
- ```
@@ -1,97 +0,0 @@
1
- # Dataproc Spark Connect Client
2
-
3
- A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
4
- additional functionalities that allow applications to communicate with a remote Dataproc
5
- Spark cluster using the Spark Connect protocol without requiring additional steps.
6
-
7
- ## Install
8
-
9
- ```console
10
- pip install dataproc_spark_connect
11
- ```
12
-
13
- ## Uninstall
14
-
15
- ```console
16
- pip uninstall dataproc_spark_connect
17
- ```
18
-
19
- ## Setup
20
- This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
21
- If you are running the client outside of Google Cloud, you must set following environment variables:
22
-
23
- * GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
24
- * GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
25
- * GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
26
- * DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
27
-
28
- ## Usage
29
-
30
- 1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
31
-
32
- ```console
33
- pip install google_cloud_dataproc --force-reinstall
34
- pip install dataproc_spark_connect --force-reinstall
35
- ```
36
-
37
- 2. Add the required import into your PySpark application or notebook:
38
-
39
- ```python
40
- from google.cloud.dataproc_spark_connect import DataprocSparkSession
41
- ```
42
-
43
- 3. There are two ways to create a spark session,
44
-
45
- 1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
46
-
47
- ```python
48
- spark = DataprocSparkSession.builder.getOrCreate()
49
- ```
50
-
51
- 2. Start a Spark session with the following code instead of using a config file:
52
-
53
- ```python
54
- from google.cloud.dataproc_v1 import SparkConnectConfig
55
- from google.cloud.dataproc_v1 import Session
56
- dataproc_session_config = Session()
57
- dataproc_session_config.spark_connect_session = SparkConnectConfig()
58
- dataproc_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
59
- dataproc_session_config.runtime_config.version = '3.0'
60
- spark = DataprocSparkSession.builder.dataprocSessionConfig(dataproc_session_config).getOrCreate()
61
- ```
62
-
63
- ## Billing
64
- As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
65
- This will happen even if you are running the client from a non-GCE instance.
66
-
67
- ## Contributing
68
- ### Building and Deploying SDK
69
-
70
- 1. Install the requirements in virtual environment.
71
-
72
- ```console
73
- pip install -r requirements-dev.txt
74
- ```
75
-
76
- 2. Build the code.
77
-
78
- ```console
79
- python setup.py sdist bdist_wheel
80
- ```
81
-
82
- 3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
83
-
84
- ```sh
85
- VERSION=<version>
86
- gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
87
- ```
88
-
89
- 4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
90
-
91
- ```sh
92
- %%bash
93
- export VERSION=<version>
94
- gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
95
- yes | pip uninstall dataproc_spark_connect
96
- pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
97
- ```
@@ -1,111 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: dataproc-spark-connect
3
- Version: 0.6.0
4
- Summary: Dataproc client library for Spark Connect
5
- Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
6
- Author: Google LLC
7
- License: Apache 2.0
8
- License-File: LICENSE
9
- Requires-Dist: google-api-core>=2.19.1
10
- Requires-Dist: google-cloud-dataproc>=5.18.0
11
- Requires-Dist: websockets
12
- Requires-Dist: pyspark[connect]>=3.5
13
- Requires-Dist: packaging>=20.0
14
-
15
- # Dataproc Spark Connect Client
16
-
17
- A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
18
- additional functionalities that allow applications to communicate with a remote Dataproc
19
- Spark cluster using the Spark Connect protocol without requiring additional steps.
20
-
21
- ## Install
22
-
23
- ```console
24
- pip install dataproc_spark_connect
25
- ```
26
-
27
- ## Uninstall
28
-
29
- ```console
30
- pip uninstall dataproc_spark_connect
31
- ```
32
-
33
- ## Setup
34
- This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
35
- If you are running the client outside of Google Cloud, you must set following environment variables:
36
-
37
- * GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
38
- * GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
39
- * GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
40
- * DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
41
-
42
- ## Usage
43
-
44
- 1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
45
-
46
- ```console
47
- pip install google_cloud_dataproc --force-reinstall
48
- pip install dataproc_spark_connect --force-reinstall
49
- ```
50
-
51
- 2. Add the required import into your PySpark application or notebook:
52
-
53
- ```python
54
- from google.cloud.dataproc_spark_connect import DataprocSparkSession
55
- ```
56
-
57
- 3. There are two ways to create a spark session,
58
-
59
- 1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
60
-
61
- ```python
62
- spark = DataprocSparkSession.builder.getOrCreate()
63
- ```
64
-
65
- 2. Start a Spark session with the following code instead of using a config file:
66
-
67
- ```python
68
- from google.cloud.dataproc_v1 import SparkConnectConfig
69
- from google.cloud.dataproc_v1 import Session
70
- dataproc_session_config = Session()
71
- dataproc_session_config.spark_connect_session = SparkConnectConfig()
72
- dataproc_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
73
- dataproc_session_config.runtime_config.version = '3.0'
74
- spark = DataprocSparkSession.builder.dataprocSessionConfig(dataproc_session_config).getOrCreate()
75
- ```
76
-
77
- ## Billing
78
- As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
79
- This will happen even if you are running the client from a non-GCE instance.
80
-
81
- ## Contributing
82
- ### Building and Deploying SDK
83
-
84
- 1. Install the requirements in virtual environment.
85
-
86
- ```console
87
- pip install -r requirements-dev.txt
88
- ```
89
-
90
- 2. Build the code.
91
-
92
- ```console
93
- python setup.py sdist bdist_wheel
94
- ```
95
-
96
- 3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
97
-
98
- ```sh
99
- VERSION=<version>
100
- gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
101
- ```
102
-
103
- 4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
104
-
105
- ```sh
106
- %%bash
107
- export VERSION=<version>
108
- gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
109
- yes | pip uninstall dataproc_spark_connect
110
- pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
111
- ```
@@ -1,5 +0,0 @@
1
- google-api-core>=2.19.1
2
- google-cloud-dataproc>=5.18.0
3
- websockets
4
- pyspark[connect]>=3.5
5
- packaging>=20.0