dataproc-spark-connect 0.1.0__tar.gz → 0.2.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (20) hide show
  1. dataproc_spark_connect-0.2.1/PKG-INFO +119 -0
  2. dataproc_spark_connect-0.2.1/README.md +103 -0
  3. dataproc_spark_connect-0.2.1/dataproc_spark_connect.egg-info/PKG-INFO +119 -0
  4. dataproc_spark_connect-0.2.1/dataproc_spark_connect.egg-info/requires.txt +7 -0
  5. {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/google/cloud/dataproc_spark_connect/__init__.py +9 -0
  6. {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/google/cloud/dataproc_spark_connect/client/proxy.py +59 -20
  7. {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/google/cloud/dataproc_spark_connect/session.py +2 -2
  8. {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/setup.py +12 -1
  9. dataproc_spark_connect-0.1.0/PKG-INFO +0 -10
  10. dataproc_spark_connect-0.1.0/README.md +0 -90
  11. dataproc_spark_connect-0.1.0/dataproc_spark_connect.egg-info/PKG-INFO +0 -10
  12. dataproc_spark_connect-0.1.0/dataproc_spark_connect.egg-info/requires.txt +0 -5
  13. {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/LICENSE +0 -0
  14. {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/dataproc_spark_connect.egg-info/SOURCES.txt +0 -0
  15. {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/dataproc_spark_connect.egg-info/dependency_links.txt +0 -0
  16. {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/dataproc_spark_connect.egg-info/top_level.txt +0 -0
  17. {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/google/cloud/dataproc_spark_connect/client/__init__.py +0 -0
  18. {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/google/cloud/dataproc_spark_connect/client/core.py +0 -0
  19. {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/pyproject.toml +0 -0
  20. {dataproc_spark_connect-0.1.0 → dataproc_spark_connect-0.2.1}/setup.cfg +0 -0
@@ -0,0 +1,119 @@
1
+ Metadata-Version: 2.1
2
+ Name: dataproc-spark-connect
3
+ Version: 0.2.1
4
+ Summary: Dataproc client library for Spark Connect
5
+ Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
6
+ Author: Google LLC
7
+ License: Apache 2.0
8
+ License-File: LICENSE
9
+ Requires-Dist: google-api-core>=2.19.1
10
+ Requires-Dist: google-cloud-dataproc>=5.15.1
11
+ Requires-Dist: wheel
12
+ Requires-Dist: websockets
13
+ Requires-Dist: pyspark>=3.5
14
+ Requires-Dist: pandas
15
+ Requires-Dist: pyarrow
16
+
17
+ # Dataproc Spark Connect Client
18
+
19
+ > ⚠️ **Warning:**
20
+ The package `dataproc-spark-connect` has been renamed to `google-spark-connect`. `dataproc-spark-connect` will no longer be updated.
21
+ For help using `google-spark-connect`, please see [guide](https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python/blob/main/README.md).
22
+
23
+
24
+ A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
25
+ additional functionalities that allow applications to communicate with a remote Dataproc
26
+ Spark cluster using the Spark Connect protocol without requiring additional steps.
27
+
28
+ ## Install
29
+
30
+ .. code-block:: console
31
+
32
+ pip install dataproc_spark_connect
33
+
34
+ ## Uninstall
35
+
36
+ .. code-block:: console
37
+
38
+ pip uninstall dataproc_spark_connect
39
+
40
+
41
+ ## Setup
42
+ This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
43
+ If you are running the client outside of Google Cloud, you must set following environment variables:
44
+
45
+ * GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
46
+ * GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
47
+ * GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
48
+ * DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
49
+
50
+ ## Usage
51
+
52
+ 1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
53
+
54
+ .. code-block:: console
55
+
56
+ pip install google_cloud_dataproc --force-reinstall
57
+ pip install dataproc_spark_connect --force-reinstall
58
+
59
+ 2. Add the required import into your PySpark application or notebook:
60
+
61
+ .. code-block:: python
62
+
63
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
64
+
65
+ 3. There are two ways to create a spark session,
66
+
67
+ 1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
68
+
69
+ .. code-block:: python
70
+
71
+ spark = DataprocSparkSession.builder.getOrCreate()
72
+
73
+ 2. Start a Spark session with the following code instead of using a config file:
74
+
75
+ .. code-block:: python
76
+
77
+ from google.cloud.dataproc_v1 import SparkConnectConfig
78
+ from google.cloud.dataproc_v1 import Session
79
+ dataproc_config = Session()
80
+ dataproc_config.spark_connect_session = SparkConnectConfig()
81
+ dataproc_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
82
+ dataproc_config.runtime_config.version = '3.0'
83
+ spark = DataprocSparkSession.builder.dataprocConfig(dataproc_config).getOrCreate()
84
+
85
+ ## Billing
86
+ As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
87
+ This will happen even if you are running the client from a non-GCE instance.
88
+
89
+ ## Contributing
90
+ ### Building and Deploying SDK
91
+
92
+ 1. Install the requirements in virtual environment.
93
+
94
+ .. code-block:: console
95
+
96
+ pip install -r requirements.txt
97
+
98
+ 2. Build the code.
99
+
100
+ .. code-block:: console
101
+
102
+ python setup.py sdist bdist_wheel
103
+
104
+
105
+ 3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
106
+
107
+ .. code-block:: console
108
+
109
+ VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
110
+
111
+ 4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
112
+
113
+ .. code-block:: console
114
+
115
+ %%bash
116
+ export VERSION=<version>
117
+ gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
118
+ yes | pip uninstall dataproc_spark_connect
119
+ pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
@@ -0,0 +1,103 @@
1
+ # Dataproc Spark Connect Client
2
+
3
+ > ⚠️ **Warning:**
4
+ The package `dataproc-spark-connect` has been renamed to `google-spark-connect`. `dataproc-spark-connect` will no longer be updated.
5
+ For help using `google-spark-connect`, please see [guide](https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python/blob/main/README.md).
6
+
7
+
8
+ A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
9
+ additional functionalities that allow applications to communicate with a remote Dataproc
10
+ Spark cluster using the Spark Connect protocol without requiring additional steps.
11
+
12
+ ## Install
13
+
14
+ .. code-block:: console
15
+
16
+ pip install dataproc_spark_connect
17
+
18
+ ## Uninstall
19
+
20
+ .. code-block:: console
21
+
22
+ pip uninstall dataproc_spark_connect
23
+
24
+
25
+ ## Setup
26
+ This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
27
+ If you are running the client outside of Google Cloud, you must set following environment variables:
28
+
29
+ * GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
30
+ * GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
31
+ * GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
32
+ * DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
33
+
34
+ ## Usage
35
+
36
+ 1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
37
+
38
+ .. code-block:: console
39
+
40
+ pip install google_cloud_dataproc --force-reinstall
41
+ pip install dataproc_spark_connect --force-reinstall
42
+
43
+ 2. Add the required import into your PySpark application or notebook:
44
+
45
+ .. code-block:: python
46
+
47
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
48
+
49
+ 3. There are two ways to create a spark session,
50
+
51
+ 1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
52
+
53
+ .. code-block:: python
54
+
55
+ spark = DataprocSparkSession.builder.getOrCreate()
56
+
57
+ 2. Start a Spark session with the following code instead of using a config file:
58
+
59
+ .. code-block:: python
60
+
61
+ from google.cloud.dataproc_v1 import SparkConnectConfig
62
+ from google.cloud.dataproc_v1 import Session
63
+ dataproc_config = Session()
64
+ dataproc_config.spark_connect_session = SparkConnectConfig()
65
+ dataproc_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
66
+ dataproc_config.runtime_config.version = '3.0'
67
+ spark = DataprocSparkSession.builder.dataprocConfig(dataproc_config).getOrCreate()
68
+
69
+ ## Billing
70
+ As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
71
+ This will happen even if you are running the client from a non-GCE instance.
72
+
73
+ ## Contributing
74
+ ### Building and Deploying SDK
75
+
76
+ 1. Install the requirements in virtual environment.
77
+
78
+ .. code-block:: console
79
+
80
+ pip install -r requirements.txt
81
+
82
+ 2. Build the code.
83
+
84
+ .. code-block:: console
85
+
86
+ python setup.py sdist bdist_wheel
87
+
88
+
89
+ 3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
90
+
91
+ .. code-block:: console
92
+
93
+ VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
94
+
95
+ 4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
96
+
97
+ .. code-block:: console
98
+
99
+ %%bash
100
+ export VERSION=<version>
101
+ gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
102
+ yes | pip uninstall dataproc_spark_connect
103
+ pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
@@ -0,0 +1,119 @@
1
+ Metadata-Version: 2.1
2
+ Name: dataproc-spark-connect
3
+ Version: 0.2.1
4
+ Summary: Dataproc client library for Spark Connect
5
+ Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
6
+ Author: Google LLC
7
+ License: Apache 2.0
8
+ License-File: LICENSE
9
+ Requires-Dist: google-api-core>=2.19.1
10
+ Requires-Dist: google-cloud-dataproc>=5.15.1
11
+ Requires-Dist: wheel
12
+ Requires-Dist: websockets
13
+ Requires-Dist: pyspark>=3.5
14
+ Requires-Dist: pandas
15
+ Requires-Dist: pyarrow
16
+
17
+ # Dataproc Spark Connect Client
18
+
19
+ > ⚠️ **Warning:**
20
+ The package `dataproc-spark-connect` has been renamed to `google-spark-connect`. `dataproc-spark-connect` will no longer be updated.
21
+ For help using `google-spark-connect`, please see [guide](https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python/blob/main/README.md).
22
+
23
+
24
+ A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
25
+ additional functionalities that allow applications to communicate with a remote Dataproc
26
+ Spark cluster using the Spark Connect protocol without requiring additional steps.
27
+
28
+ ## Install
29
+
30
+ .. code-block:: console
31
+
32
+ pip install dataproc_spark_connect
33
+
34
+ ## Uninstall
35
+
36
+ .. code-block:: console
37
+
38
+ pip uninstall dataproc_spark_connect
39
+
40
+
41
+ ## Setup
42
+ This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
43
+ If you are running the client outside of Google Cloud, you must set following environment variables:
44
+
45
+ * GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
46
+ * GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
47
+ * GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
48
+ * DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
49
+
50
+ ## Usage
51
+
52
+ 1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
53
+
54
+ .. code-block:: console
55
+
56
+ pip install google_cloud_dataproc --force-reinstall
57
+ pip install dataproc_spark_connect --force-reinstall
58
+
59
+ 2. Add the required import into your PySpark application or notebook:
60
+
61
+ .. code-block:: python
62
+
63
+ from google.cloud.dataproc_spark_connect import DataprocSparkSession
64
+
65
+ 3. There are two ways to create a spark session,
66
+
67
+ 1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
68
+
69
+ .. code-block:: python
70
+
71
+ spark = DataprocSparkSession.builder.getOrCreate()
72
+
73
+ 2. Start a Spark session with the following code instead of using a config file:
74
+
75
+ .. code-block:: python
76
+
77
+ from google.cloud.dataproc_v1 import SparkConnectConfig
78
+ from google.cloud.dataproc_v1 import Session
79
+ dataproc_config = Session()
80
+ dataproc_config.spark_connect_session = SparkConnectConfig()
81
+ dataproc_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
82
+ dataproc_config.runtime_config.version = '3.0'
83
+ spark = DataprocSparkSession.builder.dataprocConfig(dataproc_config).getOrCreate()
84
+
85
+ ## Billing
86
+ As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
87
+ This will happen even if you are running the client from a non-GCE instance.
88
+
89
+ ## Contributing
90
+ ### Building and Deploying SDK
91
+
92
+ 1. Install the requirements in virtual environment.
93
+
94
+ .. code-block:: console
95
+
96
+ pip install -r requirements.txt
97
+
98
+ 2. Build the code.
99
+
100
+ .. code-block:: console
101
+
102
+ python setup.py sdist bdist_wheel
103
+
104
+
105
+ 3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
106
+
107
+ .. code-block:: console
108
+
109
+ VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
110
+
111
+ 4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
112
+
113
+ .. code-block:: console
114
+
115
+ %%bash
116
+ export VERSION=<version>
117
+ gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
118
+ yes | pip uninstall dataproc_spark_connect
119
+ pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
@@ -0,0 +1,7 @@
1
+ google-api-core>=2.19.1
2
+ google-cloud-dataproc>=5.15.1
3
+ wheel
4
+ websockets
5
+ pyspark>=3.5
6
+ pandas
7
+ pyarrow
@@ -12,3 +12,12 @@
12
12
  # See the License for the specific language governing permissions and
13
13
  # limitations under the License.
14
14
  from .session import DataprocSparkSession
15
+ import warnings
16
+
17
+ warnings.warn(
18
+ "The package 'dataproc-spark-connect' has been renamed to 'google-spark-connect'. "
19
+ "'dataproc-spark-connect' will no longer be updated. "
20
+ "For help using 'google-spark-connect', "
21
+ "see https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python/blob/main/README.md. ",
22
+ DeprecationWarning,
23
+ )
@@ -43,13 +43,22 @@ class bridged_socket(object):
43
43
  self._conn = websocket_conn
44
44
 
45
45
  def recv(self, buff_size):
46
- msg = self._conn.recv()
46
+ # N.B. The websockets [recv method](https://websockets.readthedocs.io/en/stable/reference/sync/client.html#websockets.sync.client.ClientConnection.recv)
47
+ # does not support the buff_size parameter, but it does add a `timeout` keyword parameter not supported by normal
48
+ # socket objects.
49
+ #
50
+ # We set that timeout to 60 seconds to prevent any scenarios where we wind up stuck waiting for a message from a websocket connection
51
+ # that never comes.
52
+ msg = self._conn.recv(timeout=60)
47
53
  return bytes.fromhex(msg)
48
54
 
49
55
  def send(self, msg_bytes):
50
56
  msg = bytes.hex(msg_bytes)
51
57
  self._conn.send(msg)
52
58
 
59
+ def close(self):
60
+ return self._conn.close()
61
+
53
62
 
54
63
  def connect_tcp_bridge(hostname):
55
64
  """Create a socket-like connection to the given hostname using websocket.
@@ -93,12 +102,51 @@ def forward_bytes(name, from_sock, to_sock):
93
102
  bs = from_sock.recv(1024)
94
103
  if not bs:
95
104
  return
96
- to_sock.send(bs)
105
+ while bs:
106
+ try:
107
+ to_sock.send(bs)
108
+ bs = None
109
+ except TimeoutError:
110
+ # On timeouts during a send, we retry just the send
111
+ # to make sure we don't lose any bytes.
112
+ pass
113
+ except TimeoutError:
114
+ # On timeouts during a receive, we retry the entire flow.
115
+ pass
97
116
  except Exception as ex:
98
117
  logger.debug(f"[{name}] Exception forwarding bytes: {ex}")
118
+ to_sock.close()
99
119
  return
100
120
 
101
121
 
122
+ def connect_sockets(conn_number, from_sock, to_sock):
123
+ """Create a connection between the two given ports.
124
+
125
+ This method continuously streams bytes in both directions between the
126
+ given `from_sock` and `to_sock` socket-like objects.
127
+
128
+ The caller is responsible for creating and closing the supplied socekts.
129
+ """
130
+ forward_name = f"{conn_number}-forward"
131
+ t1 = threading.Thread(
132
+ name=forward_name,
133
+ target=forward_bytes,
134
+ args=[forward_name, from_sock, to_sock],
135
+ daemon=True,
136
+ )
137
+ t1.start()
138
+ backward_name = f"{conn_number}-backward"
139
+ t2 = threading.Thread(
140
+ name=backward_name,
141
+ target=forward_bytes,
142
+ args=[backward_name, to_sock, from_sock],
143
+ daemon=True,
144
+ )
145
+ t2.start()
146
+ t1.join()
147
+ t2.join()
148
+
149
+
102
150
  def forward_connection(conn_number, conn, addr, target_host):
103
151
  """Create a connection to the target and forward `conn` to it.
104
152
 
@@ -115,24 +163,7 @@ def forward_connection(conn_number, conn, addr, target_host):
115
163
  with conn:
116
164
  with connect_tcp_bridge(target_host) as websocket_conn:
117
165
  backend_socket = bridged_socket(websocket_conn)
118
- forward_name = f"{conn_number}-forward"
119
- t1 = threading.Thread(
120
- name=forward_name,
121
- target=forward_bytes,
122
- args=[forward_name, conn, backend_socket],
123
- daemon=True,
124
- )
125
- t1.start()
126
- backward_name = f"{conn_number}-backward"
127
- t2 = threading.Thread(
128
- name=backward_name,
129
- target=forward_bytes,
130
- args=[backward_name, backend_socket, conn],
131
- daemon=True,
132
- )
133
- t2.start()
134
- t1.join()
135
- t2.join()
166
+ connect_sockets(conn_number, conn, backend_socket)
136
167
 
137
168
 
138
169
  class DataprocSessionProxy(object):
@@ -179,6 +210,14 @@ class DataprocSessionProxy(object):
179
210
  s.release()
180
211
  while not self._killed:
181
212
  conn, addr = frontend_socket.accept()
213
+ # Set a timeout on how long we will allow send/recv calls to block
214
+ #
215
+ # The code that reads and writes to this connection will retry
216
+ # on timeouts, so this is a safe change.
217
+ #
218
+ # The chosen timeout is a very short one because it allows us
219
+ # to more quickly detect when a connection has been closed.
220
+ conn.settimeout(1)
182
221
  logger.debug(f"Accepted a connection from {addr}...")
183
222
  self._conn_number += 1
184
223
  threading.Thread(
@@ -196,13 +196,13 @@ class DataprocSparkSession(SparkSession):
196
196
  session_id = self.generate_dataproc_session_id()
197
197
 
198
198
  session_request.session_id = session_id
199
- dataproc_config.name = f"projects/{self._project_id}/regions/{self._region}/sessions/{session_id}"
199
+ dataproc_config.name = f"projects/{self._project_id}/locations/{self._region}/sessions/{session_id}"
200
200
  logger.debug(
201
201
  f"Configurations used to create serverless session:\n {dataproc_config}"
202
202
  )
203
203
  session_request.session = dataproc_config
204
204
  session_request.parent = (
205
- f"projects/{self._project_id}/regions/{self._region}"
205
+ f"projects/{self._project_id}/locations/{self._region}"
206
206
  )
207
207
 
208
208
  logger.debug("Creating serverless session")
@@ -12,13 +12,24 @@
12
12
  # See the License for the specific language governing permissions and
13
13
  # limitations under the License.
14
14
  from setuptools import find_namespace_packages, setup
15
+ from pathlib import Path
16
+
17
+ this_directory = Path(__file__).parent
18
+ long_description = (this_directory / "README.md").read_text()
19
+
15
20
 
16
21
  setup(
17
22
  name="dataproc-spark-connect",
18
- version="0.1.0",
23
+ version="0.2.1",
19
24
  description="Dataproc client library for Spark Connect",
25
+ long_description=long_description,
26
+ author="Google LLC",
27
+ url="https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python",
28
+ license="Apache 2.0",
20
29
  packages=find_namespace_packages(include=["google.*"]),
21
30
  install_requires=[
31
+ "google-api-core>=2.19.1",
32
+ "google-cloud-dataproc>=5.15.1",
22
33
  "wheel",
23
34
  "websockets",
24
35
  "pyspark>=3.5",
@@ -1,10 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: dataproc-spark-connect
3
- Version: 0.1.0
4
- Summary: Dataproc client library for Spark Connect
5
- License-File: LICENSE
6
- Requires-Dist: wheel
7
- Requires-Dist: websockets
8
- Requires-Dist: pyspark>=3.5
9
- Requires-Dist: pandas
10
- Requires-Dist: pyarrow
@@ -1,90 +0,0 @@
1
- # Dataproc Spark Connect Client
2
-
3
- A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
4
- additional functionalities that allow applications to communicate with a remote Dataproc
5
- Spark cluster using the Spark Connect protocol without requiring additional steps.
6
-
7
- ## Install
8
-
9
- ```
10
- pip install dataproc_spark_connect
11
- ```
12
-
13
- ## Uninstall
14
-
15
- ```
16
- pip uninstall dataproc_spark_connect
17
- ```
18
-
19
- ## Setup
20
- This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
21
- If you are running the client outside of Google Cloud, you must set following environment variables:
22
-
23
- * GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
24
- * GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
25
- * GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
26
- * DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
27
-
28
- ## Usage
29
-
30
- 1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
31
- ```
32
- pip install google_cloud_dataproc --force-reinstall
33
- pip install dataproc_spark_connect --force-reinstall
34
- ```
35
-
36
- 2. Add the required import into your PySpark application or notebook:
37
- ```python
38
- from google.cloud.dataproc_spark_connect import DataprocSparkSession
39
-
40
- ```
41
-
42
- 3. There are two ways to create a spark session,
43
- 1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
44
- ```python
45
- spark = DataprocSparkSession.builder.getOrCreate()
46
- ```
47
-
48
- 2. Start a Spark session with the following code instead of using a config file:
49
- ```python
50
- from google.cloud.dataproc_v1 import SparkConnectConfig
51
- from google.cloud.dataproc_v1 import Session
52
- dataproc_config = Session()
53
- dataproc_config.spark_connect_session = SparkConnectConfig()
54
- dataproc_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
55
- dataproc_config.runtime_config.version = '3.0'
56
- spark = DataprocSparkSession.builder.dataprocConfig(dataproc_config).getOrCreate()
57
- ```
58
-
59
- ## Billing
60
- As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
61
- This will happen even if you are running the client from a non-GCE instance.
62
-
63
- ## Contributing
64
- ### Building and Deploying SDK
65
- 1. Install the requirements in virtual environment.
66
-
67
- ```
68
- pip install -r requirements.txt
69
- ```
70
- 2. Build the code.
71
-
72
- ```
73
- python setup.py sdist bdist_wheel
74
- ```
75
-
76
- 2. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
77
-
78
- ```
79
- VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
80
- ```
81
-
82
- 3. Download the new SDK on Vertex, then uninstall the old version and install the new one.
83
-
84
- ```
85
- %%bash
86
- export VERSION=<version>
87
- gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
88
- yes | pip uninstall dataproc_spark_connect
89
- pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl
90
- ```
@@ -1,10 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: dataproc-spark-connect
3
- Version: 0.1.0
4
- Summary: Dataproc client library for Spark Connect
5
- License-File: LICENSE
6
- Requires-Dist: wheel
7
- Requires-Dist: websockets
8
- Requires-Dist: pyspark>=3.5
9
- Requires-Dist: pandas
10
- Requires-Dist: pyarrow
@@ -1,5 +0,0 @@
1
- wheel
2
- websockets
3
- pyspark>=3.5
4
- pandas
5
- pyarrow