dataproc-spark-connect 1.0.0rc5__tar.gz → 1.0.0rc7__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- dataproc_spark_connect-1.0.0rc7/PKG-INFO +200 -0
- dataproc_spark_connect-1.0.0rc7/README.md +177 -0
- dataproc_spark_connect-1.0.0rc7/dataproc_spark_connect.egg-info/PKG-INFO +200 -0
- {dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/dataproc_spark_connect.egg-info/requires.txt +1 -1
- {dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/google/cloud/dataproc_spark_connect/environment.py +4 -0
- {dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/google/cloud/dataproc_spark_connect/exceptions.py +1 -1
- {dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/google/cloud/dataproc_spark_connect/session.py +152 -33
- dataproc_spark_connect-1.0.0rc7/pyproject.toml +9 -0
- dataproc_spark_connect-1.0.0rc7/setup.cfg +14 -0
- {dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/setup.py +2 -2
- dataproc_spark_connect-1.0.0rc5/PKG-INFO +0 -105
- dataproc_spark_connect-1.0.0rc5/README.md +0 -83
- dataproc_spark_connect-1.0.0rc5/dataproc_spark_connect.egg-info/PKG-INFO +0 -105
- dataproc_spark_connect-1.0.0rc5/pyproject.toml +0 -3
- dataproc_spark_connect-1.0.0rc5/setup.cfg +0 -7
- {dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/LICENSE +0 -0
- {dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/dataproc_spark_connect.egg-info/SOURCES.txt +0 -0
- {dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/dataproc_spark_connect.egg-info/dependency_links.txt +0 -0
- {dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/dataproc_spark_connect.egg-info/top_level.txt +0 -0
- {dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/google/cloud/dataproc_spark_connect/__init__.py +0 -0
- {dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/google/cloud/dataproc_spark_connect/client/__init__.py +0 -0
- {dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/google/cloud/dataproc_spark_connect/client/core.py +0 -0
- {dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/google/cloud/dataproc_spark_connect/client/proxy.py +0 -0
- {dataproc_spark_connect-1.0.0rc5 → dataproc_spark_connect-1.0.0rc7}/google/cloud/dataproc_spark_connect/pypi_artifacts.py +0 -0
|
@@ -0,0 +1,200 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: dataproc-spark-connect
|
|
3
|
+
Version: 1.0.0rc7
|
|
4
|
+
Summary: Dataproc client library for Spark Connect
|
|
5
|
+
Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
|
|
6
|
+
Author: Google LLC
|
|
7
|
+
License: Apache 2.0
|
|
8
|
+
Description-Content-Type: text/markdown
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Requires-Dist: google-api-core>=2.19
|
|
11
|
+
Requires-Dist: google-cloud-dataproc>=5.18
|
|
12
|
+
Requires-Dist: packaging>=20.0
|
|
13
|
+
Requires-Dist: pyspark-client~=4.0.0
|
|
14
|
+
Requires-Dist: tqdm>=4.67
|
|
15
|
+
Requires-Dist: websockets>=14.0
|
|
16
|
+
Dynamic: author
|
|
17
|
+
Dynamic: description
|
|
18
|
+
Dynamic: home-page
|
|
19
|
+
Dynamic: license
|
|
20
|
+
Dynamic: license-file
|
|
21
|
+
Dynamic: requires-dist
|
|
22
|
+
Dynamic: summary
|
|
23
|
+
|
|
24
|
+
# Dataproc Spark Connect Client
|
|
25
|
+
|
|
26
|
+
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
|
|
27
|
+
client with additional functionalities that allow applications to communicate
|
|
28
|
+
with a remote Dataproc Spark Session using the Spark Connect protocol without
|
|
29
|
+
requiring additional steps.
|
|
30
|
+
|
|
31
|
+
## Install
|
|
32
|
+
|
|
33
|
+
```sh
|
|
34
|
+
pip install dataproc_spark_connect
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
## Uninstall
|
|
38
|
+
|
|
39
|
+
```sh
|
|
40
|
+
pip uninstall dataproc_spark_connect
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## Setup
|
|
44
|
+
|
|
45
|
+
This client requires permissions to
|
|
46
|
+
manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
|
|
47
|
+
|
|
48
|
+
If you are running the client outside of Google Cloud, you need to provide
|
|
49
|
+
authentication credentials. Set the `GOOGLE_APPLICATION_CREDENTIALS` environment
|
|
50
|
+
variable to point to
|
|
51
|
+
your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
|
|
52
|
+
file.
|
|
53
|
+
|
|
54
|
+
You can specify the project and region either via environment variables or directly
|
|
55
|
+
in your code using the builder API:
|
|
56
|
+
|
|
57
|
+
* Environment variables: `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_REGION`
|
|
58
|
+
* Builder API: `.projectId()` and `.location()` methods (recommended)
|
|
59
|
+
|
|
60
|
+
## Usage
|
|
61
|
+
|
|
62
|
+
1. Install the latest version of Dataproc Spark Connect:
|
|
63
|
+
|
|
64
|
+
```sh
|
|
65
|
+
pip install -U dataproc-spark-connect
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
2. Add the required imports into your PySpark application or notebook and start
|
|
69
|
+
a Spark session using the fluent API:
|
|
70
|
+
|
|
71
|
+
```python
|
|
72
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
73
|
+
spark = DataprocSparkSession.builder.getOrCreate()
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
3. You can configure Spark properties using the `.config()` method:
|
|
77
|
+
|
|
78
|
+
```python
|
|
79
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
80
|
+
spark = DataprocSparkSession.builder.config('spark.executor.memory', '4g').config('spark.executor.cores', '2').getOrCreate()
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
4. For advanced configuration, you can use the `Session` class to customize
|
|
84
|
+
settings like subnetwork or other environment configurations:
|
|
85
|
+
|
|
86
|
+
```python
|
|
87
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
88
|
+
from google.cloud.dataproc_v1 import Session
|
|
89
|
+
session_config = Session()
|
|
90
|
+
session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
|
|
91
|
+
session_config.runtime_config.version = '3.0'
|
|
92
|
+
spark = DataprocSparkSession.builder.projectId('my-project').location('us-central1').dataprocSessionConfig(session_config).getOrCreate()
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
### Reusing Named Sessions Across Notebooks
|
|
96
|
+
|
|
97
|
+
Named sessions allow you to share a single Spark session across multiple notebooks, improving efficiency by avoiding repeated session startup times and reducing costs.
|
|
98
|
+
|
|
99
|
+
To create or connect to a named session:
|
|
100
|
+
|
|
101
|
+
1. Create a session with a custom ID in your first notebook:
|
|
102
|
+
|
|
103
|
+
```python
|
|
104
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
105
|
+
session_id = 'my-ml-pipeline-session'
|
|
106
|
+
spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
|
|
107
|
+
df = spark.createDataFrame([(1, 'data')], ['id', 'value'])
|
|
108
|
+
df.show()
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
2. Reuse the same session in another notebook by specifying the same session ID:
|
|
112
|
+
|
|
113
|
+
```python
|
|
114
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
115
|
+
session_id = 'my-ml-pipeline-session'
|
|
116
|
+
spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
|
|
117
|
+
df = spark.createDataFrame([(2, 'more-data')], ['id', 'value'])
|
|
118
|
+
df.show()
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
3. Session IDs must be 4-63 characters long, start with a lowercase letter, contain only lowercase letters, numbers, and hyphens, and not end with a hyphen.
|
|
122
|
+
|
|
123
|
+
4. Named sessions persist until explicitly terminated or reach their configured TTL.
|
|
124
|
+
|
|
125
|
+
5. A session with a given ID that is in a TERMINATED state cannot be reused. It must be deleted before a new session with the same ID can be created.
|
|
126
|
+
|
|
127
|
+
### Using Spark SQL Magic Commands (Jupyter Notebooks)
|
|
128
|
+
|
|
129
|
+
The package supports the [sparksql-magic](https://github.com/cryeo/sparksql-magic) library for executing Spark SQL queries directly in Jupyter notebooks.
|
|
130
|
+
|
|
131
|
+
**Installation**: To use magic commands, install the required dependencies manually:
|
|
132
|
+
```bash
|
|
133
|
+
pip install dataproc-spark-connect
|
|
134
|
+
pip install IPython sparksql-magic
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
1. Load the magic extension:
|
|
138
|
+
```python
|
|
139
|
+
%load_ext sparksql_magic
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
2. Configure default settings (optional):
|
|
143
|
+
```python
|
|
144
|
+
%config SparkSql.limit=20
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
3. Execute SQL queries:
|
|
148
|
+
```python
|
|
149
|
+
%%sparksql
|
|
150
|
+
SELECT * FROM your_table
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
4. Advanced usage with options:
|
|
154
|
+
```python
|
|
155
|
+
# Cache results and create a view
|
|
156
|
+
%%sparksql --cache --view result_view df
|
|
157
|
+
SELECT * FROM your_table WHERE condition = true
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
Available options:
|
|
161
|
+
- `--cache` / `-c`: Cache the DataFrame
|
|
162
|
+
- `--eager` / `-e`: Cache with eager loading
|
|
163
|
+
- `--view VIEW` / `-v VIEW`: Create a temporary view
|
|
164
|
+
- `--limit N` / `-l N`: Override default row display limit
|
|
165
|
+
- `variable_name`: Store result in a variable
|
|
166
|
+
|
|
167
|
+
See [sparksql-magic](https://github.com/cryeo/sparksql-magic) for more examples.
|
|
168
|
+
|
|
169
|
+
**Note**: Magic commands are optional. If you only need basic DataprocSparkSession functionality without Jupyter magic support, install only the base package:
|
|
170
|
+
```bash
|
|
171
|
+
pip install dataproc-spark-connect
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
## Developing
|
|
175
|
+
|
|
176
|
+
For development instructions see [guide](DEVELOPING.md).
|
|
177
|
+
|
|
178
|
+
## Contributing
|
|
179
|
+
|
|
180
|
+
We'd love to accept your patches and contributions to this project. There are
|
|
181
|
+
just a few small guidelines you need to follow.
|
|
182
|
+
|
|
183
|
+
### Contributor License Agreement
|
|
184
|
+
|
|
185
|
+
Contributions to this project must be accompanied by a Contributor License
|
|
186
|
+
Agreement. You (or your employer) retain the copyright to your contribution;
|
|
187
|
+
this simply gives us permission to use and redistribute your contributions as
|
|
188
|
+
part of the project. Head over to <https://cla.developers.google.com> to see
|
|
189
|
+
your current agreements on file or to sign a new one.
|
|
190
|
+
|
|
191
|
+
You generally only need to submit a CLA once, so if you've already submitted one
|
|
192
|
+
(even if it was for a different project), you probably don't need to do it
|
|
193
|
+
again.
|
|
194
|
+
|
|
195
|
+
### Code reviews
|
|
196
|
+
|
|
197
|
+
All submissions, including submissions by project members, require review. We
|
|
198
|
+
use GitHub pull requests for this purpose. Consult
|
|
199
|
+
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
|
|
200
|
+
information on using pull requests.
|
|
@@ -0,0 +1,177 @@
|
|
|
1
|
+
# Dataproc Spark Connect Client
|
|
2
|
+
|
|
3
|
+
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
|
|
4
|
+
client with additional functionalities that allow applications to communicate
|
|
5
|
+
with a remote Dataproc Spark Session using the Spark Connect protocol without
|
|
6
|
+
requiring additional steps.
|
|
7
|
+
|
|
8
|
+
## Install
|
|
9
|
+
|
|
10
|
+
```sh
|
|
11
|
+
pip install dataproc_spark_connect
|
|
12
|
+
```
|
|
13
|
+
|
|
14
|
+
## Uninstall
|
|
15
|
+
|
|
16
|
+
```sh
|
|
17
|
+
pip uninstall dataproc_spark_connect
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
## Setup
|
|
21
|
+
|
|
22
|
+
This client requires permissions to
|
|
23
|
+
manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
|
|
24
|
+
|
|
25
|
+
If you are running the client outside of Google Cloud, you need to provide
|
|
26
|
+
authentication credentials. Set the `GOOGLE_APPLICATION_CREDENTIALS` environment
|
|
27
|
+
variable to point to
|
|
28
|
+
your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
|
|
29
|
+
file.
|
|
30
|
+
|
|
31
|
+
You can specify the project and region either via environment variables or directly
|
|
32
|
+
in your code using the builder API:
|
|
33
|
+
|
|
34
|
+
* Environment variables: `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_REGION`
|
|
35
|
+
* Builder API: `.projectId()` and `.location()` methods (recommended)
|
|
36
|
+
|
|
37
|
+
## Usage
|
|
38
|
+
|
|
39
|
+
1. Install the latest version of Dataproc Spark Connect:
|
|
40
|
+
|
|
41
|
+
```sh
|
|
42
|
+
pip install -U dataproc-spark-connect
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
2. Add the required imports into your PySpark application or notebook and start
|
|
46
|
+
a Spark session using the fluent API:
|
|
47
|
+
|
|
48
|
+
```python
|
|
49
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
50
|
+
spark = DataprocSparkSession.builder.getOrCreate()
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
3. You can configure Spark properties using the `.config()` method:
|
|
54
|
+
|
|
55
|
+
```python
|
|
56
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
57
|
+
spark = DataprocSparkSession.builder.config('spark.executor.memory', '4g').config('spark.executor.cores', '2').getOrCreate()
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
4. For advanced configuration, you can use the `Session` class to customize
|
|
61
|
+
settings like subnetwork or other environment configurations:
|
|
62
|
+
|
|
63
|
+
```python
|
|
64
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
65
|
+
from google.cloud.dataproc_v1 import Session
|
|
66
|
+
session_config = Session()
|
|
67
|
+
session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
|
|
68
|
+
session_config.runtime_config.version = '3.0'
|
|
69
|
+
spark = DataprocSparkSession.builder.projectId('my-project').location('us-central1').dataprocSessionConfig(session_config).getOrCreate()
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
### Reusing Named Sessions Across Notebooks
|
|
73
|
+
|
|
74
|
+
Named sessions allow you to share a single Spark session across multiple notebooks, improving efficiency by avoiding repeated session startup times and reducing costs.
|
|
75
|
+
|
|
76
|
+
To create or connect to a named session:
|
|
77
|
+
|
|
78
|
+
1. Create a session with a custom ID in your first notebook:
|
|
79
|
+
|
|
80
|
+
```python
|
|
81
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
82
|
+
session_id = 'my-ml-pipeline-session'
|
|
83
|
+
spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
|
|
84
|
+
df = spark.createDataFrame([(1, 'data')], ['id', 'value'])
|
|
85
|
+
df.show()
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
2. Reuse the same session in another notebook by specifying the same session ID:
|
|
89
|
+
|
|
90
|
+
```python
|
|
91
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
92
|
+
session_id = 'my-ml-pipeline-session'
|
|
93
|
+
spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
|
|
94
|
+
df = spark.createDataFrame([(2, 'more-data')], ['id', 'value'])
|
|
95
|
+
df.show()
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
3. Session IDs must be 4-63 characters long, start with a lowercase letter, contain only lowercase letters, numbers, and hyphens, and not end with a hyphen.
|
|
99
|
+
|
|
100
|
+
4. Named sessions persist until explicitly terminated or reach their configured TTL.
|
|
101
|
+
|
|
102
|
+
5. A session with a given ID that is in a TERMINATED state cannot be reused. It must be deleted before a new session with the same ID can be created.
|
|
103
|
+
|
|
104
|
+
### Using Spark SQL Magic Commands (Jupyter Notebooks)
|
|
105
|
+
|
|
106
|
+
The package supports the [sparksql-magic](https://github.com/cryeo/sparksql-magic) library for executing Spark SQL queries directly in Jupyter notebooks.
|
|
107
|
+
|
|
108
|
+
**Installation**: To use magic commands, install the required dependencies manually:
|
|
109
|
+
```bash
|
|
110
|
+
pip install dataproc-spark-connect
|
|
111
|
+
pip install IPython sparksql-magic
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
1. Load the magic extension:
|
|
115
|
+
```python
|
|
116
|
+
%load_ext sparksql_magic
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
2. Configure default settings (optional):
|
|
120
|
+
```python
|
|
121
|
+
%config SparkSql.limit=20
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
3. Execute SQL queries:
|
|
125
|
+
```python
|
|
126
|
+
%%sparksql
|
|
127
|
+
SELECT * FROM your_table
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
4. Advanced usage with options:
|
|
131
|
+
```python
|
|
132
|
+
# Cache results and create a view
|
|
133
|
+
%%sparksql --cache --view result_view df
|
|
134
|
+
SELECT * FROM your_table WHERE condition = true
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
Available options:
|
|
138
|
+
- `--cache` / `-c`: Cache the DataFrame
|
|
139
|
+
- `--eager` / `-e`: Cache with eager loading
|
|
140
|
+
- `--view VIEW` / `-v VIEW`: Create a temporary view
|
|
141
|
+
- `--limit N` / `-l N`: Override default row display limit
|
|
142
|
+
- `variable_name`: Store result in a variable
|
|
143
|
+
|
|
144
|
+
See [sparksql-magic](https://github.com/cryeo/sparksql-magic) for more examples.
|
|
145
|
+
|
|
146
|
+
**Note**: Magic commands are optional. If you only need basic DataprocSparkSession functionality without Jupyter magic support, install only the base package:
|
|
147
|
+
```bash
|
|
148
|
+
pip install dataproc-spark-connect
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
## Developing
|
|
152
|
+
|
|
153
|
+
For development instructions see [guide](DEVELOPING.md).
|
|
154
|
+
|
|
155
|
+
## Contributing
|
|
156
|
+
|
|
157
|
+
We'd love to accept your patches and contributions to this project. There are
|
|
158
|
+
just a few small guidelines you need to follow.
|
|
159
|
+
|
|
160
|
+
### Contributor License Agreement
|
|
161
|
+
|
|
162
|
+
Contributions to this project must be accompanied by a Contributor License
|
|
163
|
+
Agreement. You (or your employer) retain the copyright to your contribution;
|
|
164
|
+
this simply gives us permission to use and redistribute your contributions as
|
|
165
|
+
part of the project. Head over to <https://cla.developers.google.com> to see
|
|
166
|
+
your current agreements on file or to sign a new one.
|
|
167
|
+
|
|
168
|
+
You generally only need to submit a CLA once, so if you've already submitted one
|
|
169
|
+
(even if it was for a different project), you probably don't need to do it
|
|
170
|
+
again.
|
|
171
|
+
|
|
172
|
+
### Code reviews
|
|
173
|
+
|
|
174
|
+
All submissions, including submissions by project members, require review. We
|
|
175
|
+
use GitHub pull requests for this purpose. Consult
|
|
176
|
+
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
|
|
177
|
+
information on using pull requests.
|
|
@@ -0,0 +1,200 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: dataproc-spark-connect
|
|
3
|
+
Version: 1.0.0rc7
|
|
4
|
+
Summary: Dataproc client library for Spark Connect
|
|
5
|
+
Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
|
|
6
|
+
Author: Google LLC
|
|
7
|
+
License: Apache 2.0
|
|
8
|
+
Description-Content-Type: text/markdown
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Requires-Dist: google-api-core>=2.19
|
|
11
|
+
Requires-Dist: google-cloud-dataproc>=5.18
|
|
12
|
+
Requires-Dist: packaging>=20.0
|
|
13
|
+
Requires-Dist: pyspark-client~=4.0.0
|
|
14
|
+
Requires-Dist: tqdm>=4.67
|
|
15
|
+
Requires-Dist: websockets>=14.0
|
|
16
|
+
Dynamic: author
|
|
17
|
+
Dynamic: description
|
|
18
|
+
Dynamic: home-page
|
|
19
|
+
Dynamic: license
|
|
20
|
+
Dynamic: license-file
|
|
21
|
+
Dynamic: requires-dist
|
|
22
|
+
Dynamic: summary
|
|
23
|
+
|
|
24
|
+
# Dataproc Spark Connect Client
|
|
25
|
+
|
|
26
|
+
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
|
|
27
|
+
client with additional functionalities that allow applications to communicate
|
|
28
|
+
with a remote Dataproc Spark Session using the Spark Connect protocol without
|
|
29
|
+
requiring additional steps.
|
|
30
|
+
|
|
31
|
+
## Install
|
|
32
|
+
|
|
33
|
+
```sh
|
|
34
|
+
pip install dataproc_spark_connect
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
## Uninstall
|
|
38
|
+
|
|
39
|
+
```sh
|
|
40
|
+
pip uninstall dataproc_spark_connect
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## Setup
|
|
44
|
+
|
|
45
|
+
This client requires permissions to
|
|
46
|
+
manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
|
|
47
|
+
|
|
48
|
+
If you are running the client outside of Google Cloud, you need to provide
|
|
49
|
+
authentication credentials. Set the `GOOGLE_APPLICATION_CREDENTIALS` environment
|
|
50
|
+
variable to point to
|
|
51
|
+
your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
|
|
52
|
+
file.
|
|
53
|
+
|
|
54
|
+
You can specify the project and region either via environment variables or directly
|
|
55
|
+
in your code using the builder API:
|
|
56
|
+
|
|
57
|
+
* Environment variables: `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_REGION`
|
|
58
|
+
* Builder API: `.projectId()` and `.location()` methods (recommended)
|
|
59
|
+
|
|
60
|
+
## Usage
|
|
61
|
+
|
|
62
|
+
1. Install the latest version of Dataproc Spark Connect:
|
|
63
|
+
|
|
64
|
+
```sh
|
|
65
|
+
pip install -U dataproc-spark-connect
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
2. Add the required imports into your PySpark application or notebook and start
|
|
69
|
+
a Spark session using the fluent API:
|
|
70
|
+
|
|
71
|
+
```python
|
|
72
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
73
|
+
spark = DataprocSparkSession.builder.getOrCreate()
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
3. You can configure Spark properties using the `.config()` method:
|
|
77
|
+
|
|
78
|
+
```python
|
|
79
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
80
|
+
spark = DataprocSparkSession.builder.config('spark.executor.memory', '4g').config('spark.executor.cores', '2').getOrCreate()
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
4. For advanced configuration, you can use the `Session` class to customize
|
|
84
|
+
settings like subnetwork or other environment configurations:
|
|
85
|
+
|
|
86
|
+
```python
|
|
87
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
88
|
+
from google.cloud.dataproc_v1 import Session
|
|
89
|
+
session_config = Session()
|
|
90
|
+
session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
|
|
91
|
+
session_config.runtime_config.version = '3.0'
|
|
92
|
+
spark = DataprocSparkSession.builder.projectId('my-project').location('us-central1').dataprocSessionConfig(session_config).getOrCreate()
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
### Reusing Named Sessions Across Notebooks
|
|
96
|
+
|
|
97
|
+
Named sessions allow you to share a single Spark session across multiple notebooks, improving efficiency by avoiding repeated session startup times and reducing costs.
|
|
98
|
+
|
|
99
|
+
To create or connect to a named session:
|
|
100
|
+
|
|
101
|
+
1. Create a session with a custom ID in your first notebook:
|
|
102
|
+
|
|
103
|
+
```python
|
|
104
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
105
|
+
session_id = 'my-ml-pipeline-session'
|
|
106
|
+
spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
|
|
107
|
+
df = spark.createDataFrame([(1, 'data')], ['id', 'value'])
|
|
108
|
+
df.show()
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
2. Reuse the same session in another notebook by specifying the same session ID:
|
|
112
|
+
|
|
113
|
+
```python
|
|
114
|
+
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
115
|
+
session_id = 'my-ml-pipeline-session'
|
|
116
|
+
spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
|
|
117
|
+
df = spark.createDataFrame([(2, 'more-data')], ['id', 'value'])
|
|
118
|
+
df.show()
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
3. Session IDs must be 4-63 characters long, start with a lowercase letter, contain only lowercase letters, numbers, and hyphens, and not end with a hyphen.
|
|
122
|
+
|
|
123
|
+
4. Named sessions persist until explicitly terminated or reach their configured TTL.
|
|
124
|
+
|
|
125
|
+
5. A session with a given ID that is in a TERMINATED state cannot be reused. It must be deleted before a new session with the same ID can be created.
|
|
126
|
+
|
|
127
|
+
### Using Spark SQL Magic Commands (Jupyter Notebooks)
|
|
128
|
+
|
|
129
|
+
The package supports the [sparksql-magic](https://github.com/cryeo/sparksql-magic) library for executing Spark SQL queries directly in Jupyter notebooks.
|
|
130
|
+
|
|
131
|
+
**Installation**: To use magic commands, install the required dependencies manually:
|
|
132
|
+
```bash
|
|
133
|
+
pip install dataproc-spark-connect
|
|
134
|
+
pip install IPython sparksql-magic
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
1. Load the magic extension:
|
|
138
|
+
```python
|
|
139
|
+
%load_ext sparksql_magic
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
2. Configure default settings (optional):
|
|
143
|
+
```python
|
|
144
|
+
%config SparkSql.limit=20
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
3. Execute SQL queries:
|
|
148
|
+
```python
|
|
149
|
+
%%sparksql
|
|
150
|
+
SELECT * FROM your_table
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
4. Advanced usage with options:
|
|
154
|
+
```python
|
|
155
|
+
# Cache results and create a view
|
|
156
|
+
%%sparksql --cache --view result_view df
|
|
157
|
+
SELECT * FROM your_table WHERE condition = true
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
Available options:
|
|
161
|
+
- `--cache` / `-c`: Cache the DataFrame
|
|
162
|
+
- `--eager` / `-e`: Cache with eager loading
|
|
163
|
+
- `--view VIEW` / `-v VIEW`: Create a temporary view
|
|
164
|
+
- `--limit N` / `-l N`: Override default row display limit
|
|
165
|
+
- `variable_name`: Store result in a variable
|
|
166
|
+
|
|
167
|
+
See [sparksql-magic](https://github.com/cryeo/sparksql-magic) for more examples.
|
|
168
|
+
|
|
169
|
+
**Note**: Magic commands are optional. If you only need basic DataprocSparkSession functionality without Jupyter magic support, install only the base package:
|
|
170
|
+
```bash
|
|
171
|
+
pip install dataproc-spark-connect
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
## Developing
|
|
175
|
+
|
|
176
|
+
For development instructions see [guide](DEVELOPING.md).
|
|
177
|
+
|
|
178
|
+
## Contributing
|
|
179
|
+
|
|
180
|
+
We'd love to accept your patches and contributions to this project. There are
|
|
181
|
+
just a few small guidelines you need to follow.
|
|
182
|
+
|
|
183
|
+
### Contributor License Agreement
|
|
184
|
+
|
|
185
|
+
Contributions to this project must be accompanied by a Contributor License
|
|
186
|
+
Agreement. You (or your employer) retain the copyright to your contribution;
|
|
187
|
+
this simply gives us permission to use and redistribute your contributions as
|
|
188
|
+
part of the project. Head over to <https://cla.developers.google.com> to see
|
|
189
|
+
your current agreements on file or to sign a new one.
|
|
190
|
+
|
|
191
|
+
You generally only need to submit a CLA once, so if you've already submitted one
|
|
192
|
+
(even if it was for a different project), you probably don't need to do it
|
|
193
|
+
again.
|
|
194
|
+
|
|
195
|
+
### Code reviews
|
|
196
|
+
|
|
197
|
+
All submissions, including submissions by project members, require review. We
|
|
198
|
+
use GitHub pull requests for this purpose. Consult
|
|
199
|
+
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
|
|
200
|
+
information on using pull requests.
|
|
@@ -67,6 +67,10 @@ def is_interactive_terminal():
|
|
|
67
67
|
return is_interactive() and is_terminal()
|
|
68
68
|
|
|
69
69
|
|
|
70
|
+
def is_dataproc_batch() -> bool:
|
|
71
|
+
return os.getenv("DATAPROC_WORKLOAD_TYPE") == "batch"
|
|
72
|
+
|
|
73
|
+
|
|
70
74
|
def get_client_environment_label() -> str:
|
|
71
75
|
"""
|
|
72
76
|
Map current environment to a standardized client label.
|
|
@@ -14,6 +14,7 @@
|
|
|
14
14
|
|
|
15
15
|
import atexit
|
|
16
16
|
import datetime
|
|
17
|
+
import functools
|
|
17
18
|
import json
|
|
18
19
|
import logging
|
|
19
20
|
import os
|
|
@@ -25,8 +26,6 @@ import time
|
|
|
25
26
|
import uuid
|
|
26
27
|
import tqdm
|
|
27
28
|
from packaging import version
|
|
28
|
-
from tqdm import tqdm as cli_tqdm
|
|
29
|
-
from tqdm.notebook import tqdm as notebook_tqdm
|
|
30
29
|
from types import MethodType
|
|
31
30
|
from typing import Any, cast, ClassVar, Dict, Iterable, Optional, Union
|
|
32
31
|
|
|
@@ -67,6 +66,10 @@ SYSTEM_LABELS = {
|
|
|
67
66
|
"goog-colab-notebook-id",
|
|
68
67
|
}
|
|
69
68
|
|
|
69
|
+
_DATAPROC_SESSIONS_BASE_URL = (
|
|
70
|
+
"https://console.cloud.google.com/dataproc/interactive"
|
|
71
|
+
)
|
|
72
|
+
|
|
70
73
|
|
|
71
74
|
def _is_valid_label_value(value: str) -> bool:
|
|
72
75
|
"""
|
|
@@ -472,16 +475,43 @@ class DataprocSparkSession(SparkSession):
|
|
|
472
475
|
session_response, dataproc_config.name
|
|
473
476
|
)
|
|
474
477
|
|
|
478
|
+
def _wait_for_session_available(
|
|
479
|
+
self, session_name: str, timeout: int = 300
|
|
480
|
+
) -> Session:
|
|
481
|
+
start_time = time.time()
|
|
482
|
+
while time.time() - start_time < timeout:
|
|
483
|
+
try:
|
|
484
|
+
session = self.session_controller_client.get_session(
|
|
485
|
+
name=session_name
|
|
486
|
+
)
|
|
487
|
+
if "Spark Connect Server" in session.runtime_info.endpoints:
|
|
488
|
+
return session
|
|
489
|
+
time.sleep(5)
|
|
490
|
+
except Exception as e:
|
|
491
|
+
logger.warning(
|
|
492
|
+
f"Error while polling for Spark Connect endpoint: {e}"
|
|
493
|
+
)
|
|
494
|
+
time.sleep(5)
|
|
495
|
+
raise RuntimeError(
|
|
496
|
+
f"Spark Connect endpoint not available for session {session_name} after {timeout} seconds."
|
|
497
|
+
)
|
|
498
|
+
|
|
475
499
|
def _display_session_link_on_creation(self, session_id):
|
|
476
|
-
session_url = f"
|
|
500
|
+
session_url = f"{_DATAPROC_SESSIONS_BASE_URL}/{self._region}/{session_id}?project={self._project_id}"
|
|
477
501
|
plain_message = f"Creating Dataproc Session: {session_url}"
|
|
478
|
-
|
|
502
|
+
if environment.is_colab_enterprise():
|
|
503
|
+
html_element = f"""
|
|
479
504
|
<div>
|
|
480
505
|
<p>Creating Dataproc Spark Session<p>
|
|
481
|
-
<p><a href="{session_url}">Dataproc Session</a></p>
|
|
482
506
|
</div>
|
|
483
|
-
|
|
484
|
-
|
|
507
|
+
"""
|
|
508
|
+
else:
|
|
509
|
+
html_element = f"""
|
|
510
|
+
<div>
|
|
511
|
+
<p>Creating Dataproc Spark Session<p>
|
|
512
|
+
<p><a href="{session_url}">Dataproc Session</a></p>
|
|
513
|
+
</div>
|
|
514
|
+
"""
|
|
485
515
|
self._output_element_or_message(plain_message, html_element)
|
|
486
516
|
|
|
487
517
|
def _print_session_created_message(self):
|
|
@@ -533,10 +563,13 @@ class DataprocSparkSession(SparkSession):
|
|
|
533
563
|
|
|
534
564
|
if session_response is not None:
|
|
535
565
|
print(
|
|
536
|
-
f"Using existing Dataproc Session (configuration changes may not be applied):
|
|
566
|
+
f"Using existing Dataproc Session (configuration changes may not be applied): {_DATAPROC_SESSIONS_BASE_URL}/{self._region}/{s8s_session_id}?project={self._project_id}"
|
|
537
567
|
)
|
|
538
568
|
self._display_view_session_details_button(s8s_session_id)
|
|
539
569
|
if session is None:
|
|
570
|
+
session_response = self._wait_for_session_available(
|
|
571
|
+
session_name
|
|
572
|
+
)
|
|
540
573
|
session = self.__create_spark_connect_session_from_s8s(
|
|
541
574
|
session_response, session_name
|
|
542
575
|
)
|
|
@@ -552,6 +585,13 @@ class DataprocSparkSession(SparkSession):
|
|
|
552
585
|
|
|
553
586
|
def getOrCreate(self) -> "DataprocSparkSession":
|
|
554
587
|
with DataprocSparkSession._lock:
|
|
588
|
+
if environment.is_dataproc_batch():
|
|
589
|
+
# For Dataproc batch workloads, connect to the already initialized local SparkSession
|
|
590
|
+
from pyspark.sql import SparkSession as PySparkSQLSession
|
|
591
|
+
|
|
592
|
+
session = PySparkSQLSession.builder.getOrCreate()
|
|
593
|
+
return session # type: ignore
|
|
594
|
+
|
|
555
595
|
# Handle custom session ID by setting it early and letting existing logic handle it
|
|
556
596
|
if self._custom_session_id:
|
|
557
597
|
self._handle_custom_session_id()
|
|
@@ -559,6 +599,13 @@ class DataprocSparkSession(SparkSession):
|
|
|
559
599
|
session = self._get_exiting_active_session()
|
|
560
600
|
if session is None:
|
|
561
601
|
session = self.__create()
|
|
602
|
+
|
|
603
|
+
# Register this session as the instantiated SparkSession for compatibility
|
|
604
|
+
# with tools and libraries that expect SparkSession._instantiatedSession
|
|
605
|
+
from pyspark.sql import SparkSession as PySparkSQLSession
|
|
606
|
+
|
|
607
|
+
PySparkSQLSession._instantiatedSession = session
|
|
608
|
+
|
|
562
609
|
return session
|
|
563
610
|
|
|
564
611
|
def _handle_custom_session_id(self):
|
|
@@ -673,8 +720,6 @@ class DataprocSparkSession(SparkSession):
|
|
|
673
720
|
# Merge default configs with existing properties,
|
|
674
721
|
# user configs take precedence
|
|
675
722
|
for k, v in {
|
|
676
|
-
"spark.datasource.bigquery.viewsEnabled": "true",
|
|
677
|
-
"spark.datasource.bigquery.writeMethod": "direct",
|
|
678
723
|
"spark.sql.catalog.spark_catalog": "com.google.cloud.spark.bigquery.BigQuerySparkSessionCatalog",
|
|
679
724
|
"spark.sql.sources.default": "bigquery",
|
|
680
725
|
}.items():
|
|
@@ -696,7 +741,7 @@ class DataprocSparkSession(SparkSession):
|
|
|
696
741
|
|
|
697
742
|
# Runtime version to server Python version mapping
|
|
698
743
|
RUNTIME_PYTHON_MAP = {
|
|
699
|
-
"3.0": (3,
|
|
744
|
+
"3.0": (3, 12),
|
|
700
745
|
}
|
|
701
746
|
|
|
702
747
|
client_python = sys.version_info[:2] # (major, minor)
|
|
@@ -760,7 +805,7 @@ class DataprocSparkSession(SparkSession):
|
|
|
760
805
|
return
|
|
761
806
|
|
|
762
807
|
try:
|
|
763
|
-
session_url = f"
|
|
808
|
+
session_url = f"{_DATAPROC_SESSIONS_BASE_URL}/{self._region}/{session_id}?project={self._project_id}"
|
|
764
809
|
from IPython.core.interactiveshell import InteractiveShell
|
|
765
810
|
|
|
766
811
|
if not InteractiveShell.initialized():
|
|
@@ -943,6 +988,28 @@ class DataprocSparkSession(SparkSession):
|
|
|
943
988
|
clearProgressHandlers_wrapper_method, self
|
|
944
989
|
)
|
|
945
990
|
|
|
991
|
+
@staticmethod
|
|
992
|
+
@functools.lru_cache(maxsize=1)
|
|
993
|
+
def get_tqdm_bar():
|
|
994
|
+
"""
|
|
995
|
+
Return a tqdm implementation that works in the current environment.
|
|
996
|
+
|
|
997
|
+
- Uses CLI tqdm for interactive terminals.
|
|
998
|
+
- Uses the notebook tqdm if available, otherwise falls back to CLI tqdm.
|
|
999
|
+
"""
|
|
1000
|
+
from tqdm import tqdm as cli_tqdm
|
|
1001
|
+
|
|
1002
|
+
if environment.is_interactive_terminal():
|
|
1003
|
+
return cli_tqdm
|
|
1004
|
+
|
|
1005
|
+
try:
|
|
1006
|
+
import ipywidgets
|
|
1007
|
+
from tqdm.notebook import tqdm as notebook_tqdm
|
|
1008
|
+
|
|
1009
|
+
return notebook_tqdm
|
|
1010
|
+
except ImportError:
|
|
1011
|
+
return cli_tqdm
|
|
1012
|
+
|
|
946
1013
|
def _register_progress_execution_handler(self):
|
|
947
1014
|
from pyspark.sql.connect.shell.progress import StageInfo
|
|
948
1015
|
|
|
@@ -967,9 +1034,12 @@ class DataprocSparkSession(SparkSession):
|
|
|
967
1034
|
total_tasks += stage.num_tasks
|
|
968
1035
|
completed_tasks += stage.num_completed_tasks
|
|
969
1036
|
|
|
970
|
-
|
|
971
|
-
if
|
|
972
|
-
|
|
1037
|
+
# Don't show progress bar till we receive some tasks
|
|
1038
|
+
if total_tasks == 0:
|
|
1039
|
+
return
|
|
1040
|
+
|
|
1041
|
+
# Get correct tqdm (notebook or CLI)
|
|
1042
|
+
tqdm_pbar = self.get_tqdm_bar()
|
|
973
1043
|
|
|
974
1044
|
# Use a lock to ensure only one thread can access and modify
|
|
975
1045
|
# the shared dictionaries at a time.
|
|
@@ -1006,13 +1076,11 @@ class DataprocSparkSession(SparkSession):
|
|
|
1006
1076
|
@staticmethod
|
|
1007
1077
|
def _sql_lazy_transformation(req):
|
|
1008
1078
|
# Select SQL command
|
|
1009
|
-
|
|
1010
|
-
|
|
1011
|
-
|
|
1012
|
-
|
|
1013
|
-
|
|
1014
|
-
|
|
1015
|
-
return False
|
|
1079
|
+
try:
|
|
1080
|
+
query = req.plan.command.sql_command.input.sql.query
|
|
1081
|
+
return "select" in query.strip().lower().split()
|
|
1082
|
+
except AttributeError:
|
|
1083
|
+
return False
|
|
1016
1084
|
|
|
1017
1085
|
def _repr_html_(self) -> str:
|
|
1018
1086
|
if not self._active_s8s_session_id:
|
|
@@ -1020,7 +1088,7 @@ class DataprocSparkSession(SparkSession):
|
|
|
1020
1088
|
<div>No Active Dataproc Session</div>
|
|
1021
1089
|
"""
|
|
1022
1090
|
|
|
1023
|
-
s8s_session = f"
|
|
1091
|
+
s8s_session = f"{_DATAPROC_SESSIONS_BASE_URL}/{self._region}/{self._active_s8s_session_id}"
|
|
1024
1092
|
ui = f"{s8s_session}/sparkApplications/applications"
|
|
1025
1093
|
return f"""
|
|
1026
1094
|
<div>
|
|
@@ -1047,7 +1115,7 @@ class DataprocSparkSession(SparkSession):
|
|
|
1047
1115
|
)
|
|
1048
1116
|
|
|
1049
1117
|
url = (
|
|
1050
|
-
f"
|
|
1118
|
+
f"{_DATAPROC_SESSIONS_BASE_URL}/{self._region}/"
|
|
1051
1119
|
f"{self._active_s8s_session_id}/sparkApplications/application;"
|
|
1052
1120
|
f"associatedSqlOperationId={operation_id}?project={self._project_id}"
|
|
1053
1121
|
)
|
|
@@ -1139,20 +1207,52 @@ class DataprocSparkSession(SparkSession):
|
|
|
1139
1207
|
def _get_active_session_file_path():
|
|
1140
1208
|
return os.getenv("DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH")
|
|
1141
1209
|
|
|
1142
|
-
def stop(self) -> None:
|
|
1210
|
+
def stop(self, terminate: Optional[bool] = None) -> None:
|
|
1211
|
+
"""
|
|
1212
|
+
Stop the Spark session and optionally terminate the server-side session.
|
|
1213
|
+
|
|
1214
|
+
Parameters
|
|
1215
|
+
----------
|
|
1216
|
+
terminate : bool, optional
|
|
1217
|
+
Control server-side termination behavior.
|
|
1218
|
+
|
|
1219
|
+
- None (default): Auto-detect based on session type
|
|
1220
|
+
|
|
1221
|
+
- Managed sessions (auto-generated ID): terminate server
|
|
1222
|
+
- Named sessions (custom ID): client-side cleanup only
|
|
1223
|
+
|
|
1224
|
+
- True: Always terminate the server-side session
|
|
1225
|
+
- False: Never terminate the server-side session (client cleanup only)
|
|
1226
|
+
|
|
1227
|
+
Examples
|
|
1228
|
+
--------
|
|
1229
|
+
Auto-detect termination behavior (existing behavior):
|
|
1230
|
+
|
|
1231
|
+
>>> spark.stop()
|
|
1232
|
+
|
|
1233
|
+
Force terminate a named session:
|
|
1234
|
+
|
|
1235
|
+
>>> spark.stop(terminate=True)
|
|
1236
|
+
|
|
1237
|
+
Prevent termination of a managed session:
|
|
1238
|
+
|
|
1239
|
+
>>> spark.stop(terminate=False)
|
|
1240
|
+
"""
|
|
1143
1241
|
with DataprocSparkSession._lock:
|
|
1144
1242
|
if DataprocSparkSession._active_s8s_session_id is not None:
|
|
1145
|
-
#
|
|
1146
|
-
if
|
|
1147
|
-
#
|
|
1148
|
-
|
|
1149
|
-
|
|
1150
|
-
f"Stopping unmanaged session {DataprocSparkSession._active_s8s_session_id} without termination"
|
|
1243
|
+
# Determine if we should terminate the server-side session
|
|
1244
|
+
if terminate is None:
|
|
1245
|
+
# Auto-detect: managed sessions terminate, named sessions don't
|
|
1246
|
+
should_terminate = (
|
|
1247
|
+
not DataprocSparkSession._active_session_uses_custom_id
|
|
1151
1248
|
)
|
|
1152
1249
|
else:
|
|
1153
|
-
|
|
1250
|
+
should_terminate = terminate
|
|
1251
|
+
|
|
1252
|
+
if should_terminate:
|
|
1253
|
+
# Terminate the server-side session
|
|
1154
1254
|
logger.debug(
|
|
1155
|
-
f"Terminating
|
|
1255
|
+
f"Terminating session {DataprocSparkSession._active_s8s_session_id}"
|
|
1156
1256
|
)
|
|
1157
1257
|
terminate_s8s_session(
|
|
1158
1258
|
DataprocSparkSession._project_id,
|
|
@@ -1160,8 +1260,27 @@ class DataprocSparkSession(SparkSession):
|
|
|
1160
1260
|
DataprocSparkSession._active_s8s_session_id,
|
|
1161
1261
|
self._client_options,
|
|
1162
1262
|
)
|
|
1263
|
+
else:
|
|
1264
|
+
# Client-side cleanup only
|
|
1265
|
+
logger.debug(
|
|
1266
|
+
f"Stopping session {DataprocSparkSession._active_s8s_session_id} without termination"
|
|
1267
|
+
)
|
|
1163
1268
|
|
|
1164
1269
|
self._remove_stopped_session_from_file()
|
|
1270
|
+
|
|
1271
|
+
# Clean up SparkSession._instantiatedSession if it points to this session
|
|
1272
|
+
try:
|
|
1273
|
+
from pyspark.sql import SparkSession as PySparkSQLSession
|
|
1274
|
+
|
|
1275
|
+
if PySparkSQLSession._instantiatedSession is self:
|
|
1276
|
+
PySparkSQLSession._instantiatedSession = None
|
|
1277
|
+
logger.debug(
|
|
1278
|
+
"Cleared SparkSession._instantiatedSession reference"
|
|
1279
|
+
)
|
|
1280
|
+
except (ImportError, AttributeError):
|
|
1281
|
+
# PySpark not available or _instantiatedSession doesn't exist
|
|
1282
|
+
pass
|
|
1283
|
+
|
|
1165
1284
|
DataprocSparkSession._active_s8s_session_uuid = None
|
|
1166
1285
|
DataprocSparkSession._active_s8s_session_id = None
|
|
1167
1286
|
DataprocSparkSession._active_session_uses_custom_id = False
|
|
@@ -20,7 +20,7 @@ long_description = (this_directory / "README.md").read_text()
|
|
|
20
20
|
|
|
21
21
|
setup(
|
|
22
22
|
name="dataproc-spark-connect",
|
|
23
|
-
version="1.0.
|
|
23
|
+
version="1.0.0rc7",
|
|
24
24
|
description="Dataproc client library for Spark Connect",
|
|
25
25
|
long_description=long_description,
|
|
26
26
|
author="Google LLC",
|
|
@@ -31,7 +31,7 @@ setup(
|
|
|
31
31
|
"google-api-core>=2.19",
|
|
32
32
|
"google-cloud-dataproc>=5.18",
|
|
33
33
|
"packaging>=20.0",
|
|
34
|
-
"pyspark
|
|
34
|
+
"pyspark-client~=4.0.0",
|
|
35
35
|
"tqdm>=4.67",
|
|
36
36
|
"websockets>=14.0",
|
|
37
37
|
],
|
|
@@ -1,105 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: dataproc-spark-connect
|
|
3
|
-
Version: 1.0.0rc5
|
|
4
|
-
Summary: Dataproc client library for Spark Connect
|
|
5
|
-
Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
|
|
6
|
-
Author: Google LLC
|
|
7
|
-
License: Apache 2.0
|
|
8
|
-
License-File: LICENSE
|
|
9
|
-
Requires-Dist: google-api-core>=2.19
|
|
10
|
-
Requires-Dist: google-cloud-dataproc>=5.18
|
|
11
|
-
Requires-Dist: packaging>=20.0
|
|
12
|
-
Requires-Dist: pyspark[connect]~=4.0.0
|
|
13
|
-
Requires-Dist: tqdm>=4.67
|
|
14
|
-
Requires-Dist: websockets>=14.0
|
|
15
|
-
Dynamic: author
|
|
16
|
-
Dynamic: description
|
|
17
|
-
Dynamic: home-page
|
|
18
|
-
Dynamic: license
|
|
19
|
-
Dynamic: license-file
|
|
20
|
-
Dynamic: requires-dist
|
|
21
|
-
Dynamic: summary
|
|
22
|
-
|
|
23
|
-
# Dataproc Spark Connect Client
|
|
24
|
-
|
|
25
|
-
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
|
|
26
|
-
client with additional functionalities that allow applications to communicate
|
|
27
|
-
with a remote Dataproc Spark Session using the Spark Connect protocol without
|
|
28
|
-
requiring additional steps.
|
|
29
|
-
|
|
30
|
-
## Install
|
|
31
|
-
|
|
32
|
-
```sh
|
|
33
|
-
pip install dataproc_spark_connect
|
|
34
|
-
```
|
|
35
|
-
|
|
36
|
-
## Uninstall
|
|
37
|
-
|
|
38
|
-
```sh
|
|
39
|
-
pip uninstall dataproc_spark_connect
|
|
40
|
-
```
|
|
41
|
-
|
|
42
|
-
## Setup
|
|
43
|
-
|
|
44
|
-
This client requires permissions to
|
|
45
|
-
manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
|
|
46
|
-
If you are running the client outside of Google Cloud, you must set following
|
|
47
|
-
environment variables:
|
|
48
|
-
|
|
49
|
-
* `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
|
|
50
|
-
workloads
|
|
51
|
-
* `GOOGLE_CLOUD_REGION` - The Compute
|
|
52
|
-
Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
|
|
53
|
-
where you run the Spark workload.
|
|
54
|
-
* `GOOGLE_APPLICATION_CREDENTIALS` -
|
|
55
|
-
Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
|
|
56
|
-
|
|
57
|
-
## Usage
|
|
58
|
-
|
|
59
|
-
1. Install the latest version of Dataproc Python client and Dataproc Spark
|
|
60
|
-
Connect modules:
|
|
61
|
-
|
|
62
|
-
```sh
|
|
63
|
-
pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
|
|
64
|
-
```
|
|
65
|
-
|
|
66
|
-
2. Add the required imports into your PySpark application or notebook and start
|
|
67
|
-
a Spark session with the following code instead of using
|
|
68
|
-
environment variables:
|
|
69
|
-
|
|
70
|
-
```python
|
|
71
|
-
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
72
|
-
from google.cloud.dataproc_v1 import Session
|
|
73
|
-
session_config = Session()
|
|
74
|
-
session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
|
|
75
|
-
session_config.runtime_config.version = '2.2'
|
|
76
|
-
spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
|
|
77
|
-
```
|
|
78
|
-
|
|
79
|
-
## Developing
|
|
80
|
-
|
|
81
|
-
For development instructions see [guide](DEVELOPING.md).
|
|
82
|
-
|
|
83
|
-
## Contributing
|
|
84
|
-
|
|
85
|
-
We'd love to accept your patches and contributions to this project. There are
|
|
86
|
-
just a few small guidelines you need to follow.
|
|
87
|
-
|
|
88
|
-
### Contributor License Agreement
|
|
89
|
-
|
|
90
|
-
Contributions to this project must be accompanied by a Contributor License
|
|
91
|
-
Agreement. You (or your employer) retain the copyright to your contribution;
|
|
92
|
-
this simply gives us permission to use and redistribute your contributions as
|
|
93
|
-
part of the project. Head over to <https://cla.developers.google.com> to see
|
|
94
|
-
your current agreements on file or to sign a new one.
|
|
95
|
-
|
|
96
|
-
You generally only need to submit a CLA once, so if you've already submitted one
|
|
97
|
-
(even if it was for a different project), you probably don't need to do it
|
|
98
|
-
again.
|
|
99
|
-
|
|
100
|
-
### Code reviews
|
|
101
|
-
|
|
102
|
-
All submissions, including submissions by project members, require review. We
|
|
103
|
-
use GitHub pull requests for this purpose. Consult
|
|
104
|
-
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
|
|
105
|
-
information on using pull requests.
|
|
@@ -1,83 +0,0 @@
|
|
|
1
|
-
# Dataproc Spark Connect Client
|
|
2
|
-
|
|
3
|
-
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
|
|
4
|
-
client with additional functionalities that allow applications to communicate
|
|
5
|
-
with a remote Dataproc Spark Session using the Spark Connect protocol without
|
|
6
|
-
requiring additional steps.
|
|
7
|
-
|
|
8
|
-
## Install
|
|
9
|
-
|
|
10
|
-
```sh
|
|
11
|
-
pip install dataproc_spark_connect
|
|
12
|
-
```
|
|
13
|
-
|
|
14
|
-
## Uninstall
|
|
15
|
-
|
|
16
|
-
```sh
|
|
17
|
-
pip uninstall dataproc_spark_connect
|
|
18
|
-
```
|
|
19
|
-
|
|
20
|
-
## Setup
|
|
21
|
-
|
|
22
|
-
This client requires permissions to
|
|
23
|
-
manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
|
|
24
|
-
If you are running the client outside of Google Cloud, you must set following
|
|
25
|
-
environment variables:
|
|
26
|
-
|
|
27
|
-
* `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
|
|
28
|
-
workloads
|
|
29
|
-
* `GOOGLE_CLOUD_REGION` - The Compute
|
|
30
|
-
Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
|
|
31
|
-
where you run the Spark workload.
|
|
32
|
-
* `GOOGLE_APPLICATION_CREDENTIALS` -
|
|
33
|
-
Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
|
|
34
|
-
|
|
35
|
-
## Usage
|
|
36
|
-
|
|
37
|
-
1. Install the latest version of Dataproc Python client and Dataproc Spark
|
|
38
|
-
Connect modules:
|
|
39
|
-
|
|
40
|
-
```sh
|
|
41
|
-
pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
|
|
42
|
-
```
|
|
43
|
-
|
|
44
|
-
2. Add the required imports into your PySpark application or notebook and start
|
|
45
|
-
a Spark session with the following code instead of using
|
|
46
|
-
environment variables:
|
|
47
|
-
|
|
48
|
-
```python
|
|
49
|
-
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
50
|
-
from google.cloud.dataproc_v1 import Session
|
|
51
|
-
session_config = Session()
|
|
52
|
-
session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
|
|
53
|
-
session_config.runtime_config.version = '2.2'
|
|
54
|
-
spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
|
|
55
|
-
```
|
|
56
|
-
|
|
57
|
-
## Developing
|
|
58
|
-
|
|
59
|
-
For development instructions see [guide](DEVELOPING.md).
|
|
60
|
-
|
|
61
|
-
## Contributing
|
|
62
|
-
|
|
63
|
-
We'd love to accept your patches and contributions to this project. There are
|
|
64
|
-
just a few small guidelines you need to follow.
|
|
65
|
-
|
|
66
|
-
### Contributor License Agreement
|
|
67
|
-
|
|
68
|
-
Contributions to this project must be accompanied by a Contributor License
|
|
69
|
-
Agreement. You (or your employer) retain the copyright to your contribution;
|
|
70
|
-
this simply gives us permission to use and redistribute your contributions as
|
|
71
|
-
part of the project. Head over to <https://cla.developers.google.com> to see
|
|
72
|
-
your current agreements on file or to sign a new one.
|
|
73
|
-
|
|
74
|
-
You generally only need to submit a CLA once, so if you've already submitted one
|
|
75
|
-
(even if it was for a different project), you probably don't need to do it
|
|
76
|
-
again.
|
|
77
|
-
|
|
78
|
-
### Code reviews
|
|
79
|
-
|
|
80
|
-
All submissions, including submissions by project members, require review. We
|
|
81
|
-
use GitHub pull requests for this purpose. Consult
|
|
82
|
-
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
|
|
83
|
-
information on using pull requests.
|
|
@@ -1,105 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: dataproc-spark-connect
|
|
3
|
-
Version: 1.0.0rc5
|
|
4
|
-
Summary: Dataproc client library for Spark Connect
|
|
5
|
-
Home-page: https://github.com/GoogleCloudDataproc/dataproc-spark-connect-python
|
|
6
|
-
Author: Google LLC
|
|
7
|
-
License: Apache 2.0
|
|
8
|
-
License-File: LICENSE
|
|
9
|
-
Requires-Dist: google-api-core>=2.19
|
|
10
|
-
Requires-Dist: google-cloud-dataproc>=5.18
|
|
11
|
-
Requires-Dist: packaging>=20.0
|
|
12
|
-
Requires-Dist: pyspark[connect]~=4.0.0
|
|
13
|
-
Requires-Dist: tqdm>=4.67
|
|
14
|
-
Requires-Dist: websockets>=14.0
|
|
15
|
-
Dynamic: author
|
|
16
|
-
Dynamic: description
|
|
17
|
-
Dynamic: home-page
|
|
18
|
-
Dynamic: license
|
|
19
|
-
Dynamic: license-file
|
|
20
|
-
Dynamic: requires-dist
|
|
21
|
-
Dynamic: summary
|
|
22
|
-
|
|
23
|
-
# Dataproc Spark Connect Client
|
|
24
|
-
|
|
25
|
-
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/)
|
|
26
|
-
client with additional functionalities that allow applications to communicate
|
|
27
|
-
with a remote Dataproc Spark Session using the Spark Connect protocol without
|
|
28
|
-
requiring additional steps.
|
|
29
|
-
|
|
30
|
-
## Install
|
|
31
|
-
|
|
32
|
-
```sh
|
|
33
|
-
pip install dataproc_spark_connect
|
|
34
|
-
```
|
|
35
|
-
|
|
36
|
-
## Uninstall
|
|
37
|
-
|
|
38
|
-
```sh
|
|
39
|
-
pip uninstall dataproc_spark_connect
|
|
40
|
-
```
|
|
41
|
-
|
|
42
|
-
## Setup
|
|
43
|
-
|
|
44
|
-
This client requires permissions to
|
|
45
|
-
manage [Dataproc Sessions and Session Templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
|
|
46
|
-
If you are running the client outside of Google Cloud, you must set following
|
|
47
|
-
environment variables:
|
|
48
|
-
|
|
49
|
-
* `GOOGLE_CLOUD_PROJECT` - The Google Cloud project you use to run Spark
|
|
50
|
-
workloads
|
|
51
|
-
* `GOOGLE_CLOUD_REGION` - The Compute
|
|
52
|
-
Engine [region](https://cloud.google.com/compute/docs/regions-zones#available)
|
|
53
|
-
where you run the Spark workload.
|
|
54
|
-
* `GOOGLE_APPLICATION_CREDENTIALS` -
|
|
55
|
-
Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
|
|
56
|
-
|
|
57
|
-
## Usage
|
|
58
|
-
|
|
59
|
-
1. Install the latest version of Dataproc Python client and Dataproc Spark
|
|
60
|
-
Connect modules:
|
|
61
|
-
|
|
62
|
-
```sh
|
|
63
|
-
pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
|
|
64
|
-
```
|
|
65
|
-
|
|
66
|
-
2. Add the required imports into your PySpark application or notebook and start
|
|
67
|
-
a Spark session with the following code instead of using
|
|
68
|
-
environment variables:
|
|
69
|
-
|
|
70
|
-
```python
|
|
71
|
-
from google.cloud.dataproc_spark_connect import DataprocSparkSession
|
|
72
|
-
from google.cloud.dataproc_v1 import Session
|
|
73
|
-
session_config = Session()
|
|
74
|
-
session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
|
|
75
|
-
session_config.runtime_config.version = '2.2'
|
|
76
|
-
spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()
|
|
77
|
-
```
|
|
78
|
-
|
|
79
|
-
## Developing
|
|
80
|
-
|
|
81
|
-
For development instructions see [guide](DEVELOPING.md).
|
|
82
|
-
|
|
83
|
-
## Contributing
|
|
84
|
-
|
|
85
|
-
We'd love to accept your patches and contributions to this project. There are
|
|
86
|
-
just a few small guidelines you need to follow.
|
|
87
|
-
|
|
88
|
-
### Contributor License Agreement
|
|
89
|
-
|
|
90
|
-
Contributions to this project must be accompanied by a Contributor License
|
|
91
|
-
Agreement. You (or your employer) retain the copyright to your contribution;
|
|
92
|
-
this simply gives us permission to use and redistribute your contributions as
|
|
93
|
-
part of the project. Head over to <https://cla.developers.google.com> to see
|
|
94
|
-
your current agreements on file or to sign a new one.
|
|
95
|
-
|
|
96
|
-
You generally only need to submit a CLA once, so if you've already submitted one
|
|
97
|
-
(even if it was for a different project), you probably don't need to do it
|
|
98
|
-
again.
|
|
99
|
-
|
|
100
|
-
### Code reviews
|
|
101
|
-
|
|
102
|
-
All submissions, including submissions by project members, require review. We
|
|
103
|
-
use GitHub pull requests for this purpose. Consult
|
|
104
|
-
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
|
|
105
|
-
information on using pull requests.
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|