xpk 0.14.4__py3-none-any.whl → 0.16.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (91) hide show
  1. integration/README.md +19 -0
  2. integration/gcluster_a3mega_test.py +11 -0
  3. integration/gcluster_a3ultra_test.py +11 -0
  4. integration/gcluster_a4_test.py +11 -0
  5. xpk/blueprints/a3mega/config-map.yaml.tftpl +15 -0
  6. xpk/blueprints/a3mega/storage_crd.yaml +52 -0
  7. xpk/blueprints/a3ultra/config-map.yaml.tftpl +15 -0
  8. xpk/blueprints/a3ultra/mlgru-disable.yaml +59 -0
  9. xpk/blueprints/a3ultra/nccl-installer.yaml +95 -0
  10. xpk/blueprints/a3ultra/storage_crd.yaml +52 -0
  11. xpk/blueprints/a4/config-map.yaml.tftpl +15 -0
  12. xpk/blueprints/a4/nccl-rdma-installer-a4.yaml +66 -0
  13. xpk/blueprints/a4/storage_crd.yaml +52 -0
  14. xpk/commands/cluster.py +89 -32
  15. xpk/commands/cluster_gcluster.py +25 -5
  16. xpk/commands/cluster_gcluster_test.py +16 -3
  17. xpk/commands/cluster_test.py +353 -7
  18. xpk/commands/config.py +3 -5
  19. xpk/commands/inspector.py +5 -3
  20. xpk/commands/kind.py +3 -1
  21. xpk/commands/managed_ml_diagnostics.py +249 -0
  22. xpk/commands/managed_ml_diagnostics_test.py +146 -0
  23. xpk/commands/storage.py +8 -10
  24. xpk/commands/workload.py +143 -142
  25. xpk/commands/workload_test.py +160 -118
  26. xpk/core/blueprint/blueprint_generator.py +73 -33
  27. xpk/core/blueprint/blueprint_test.py +9 -0
  28. xpk/core/blueprint/testing/data/a3_mega.yaml +129 -0
  29. xpk/core/blueprint/testing/data/a3_mega_spot.yaml +125 -0
  30. xpk/core/blueprint/testing/data/a3_ultra.yaml +173 -0
  31. xpk/core/blueprint/testing/data/a4.yaml +185 -0
  32. xpk/core/capacity.py +48 -8
  33. xpk/core/capacity_test.py +32 -1
  34. xpk/core/cluster.py +55 -104
  35. xpk/core/cluster_test.py +170 -0
  36. xpk/core/commands.py +4 -10
  37. xpk/core/config.py +88 -7
  38. xpk/core/config_test.py +67 -11
  39. xpk/core/docker_container.py +3 -1
  40. xpk/core/docker_image.py +10 -6
  41. xpk/core/docker_resources.py +1 -10
  42. xpk/core/gcloud_context.py +18 -12
  43. xpk/core/gcloud_context_test.py +111 -1
  44. xpk/core/kjob.py +17 -19
  45. xpk/core/kueue_manager.py +205 -51
  46. xpk/core/kueue_manager_test.py +158 -4
  47. xpk/core/nap.py +13 -14
  48. xpk/core/nodepool.py +37 -43
  49. xpk/core/nodepool_test.py +42 -19
  50. xpk/core/pathways.py +23 -0
  51. xpk/core/pathways_test.py +57 -0
  52. xpk/core/resources.py +84 -27
  53. xpk/core/scheduling.py +144 -133
  54. xpk/core/scheduling_test.py +298 -6
  55. xpk/core/system_characteristics.py +256 -19
  56. xpk/core/system_characteristics_test.py +128 -5
  57. xpk/core/telemetry.py +263 -0
  58. xpk/core/telemetry_test.py +211 -0
  59. xpk/core/vertex.py +4 -3
  60. xpk/core/workload_decorators/tcpx_decorator.py +5 -1
  61. xpk/main.py +33 -13
  62. xpk/parser/cluster.py +40 -67
  63. xpk/parser/cluster_test.py +83 -3
  64. xpk/parser/common.py +84 -0
  65. xpk/parser/storage.py +10 -0
  66. xpk/parser/storage_test.py +47 -0
  67. xpk/parser/workload.py +14 -29
  68. xpk/parser/workload_test.py +3 -49
  69. xpk/telemetry_uploader.py +29 -0
  70. xpk/templates/arm_gpu_workload_crate.yaml.j2 +46 -0
  71. xpk/templates/kueue_gke_default_topology.yaml.j2 +1 -1
  72. xpk/templates/kueue_sub_slicing_topology.yaml.j2 +3 -8
  73. xpk/utils/console.py +41 -10
  74. xpk/utils/console_test.py +106 -0
  75. xpk/utils/feature_flags.py +10 -1
  76. xpk/utils/file.py +4 -1
  77. xpk/utils/topology.py +4 -0
  78. xpk/utils/user_agent.py +35 -0
  79. xpk/utils/user_agent_test.py +44 -0
  80. xpk/utils/user_input.py +48 -0
  81. xpk/utils/user_input_test.py +92 -0
  82. xpk/utils/validation.py +2 -13
  83. xpk/utils/versions.py +31 -0
  84. xpk-0.16.0.dist-info/METADATA +127 -0
  85. xpk-0.16.0.dist-info/RECORD +168 -0
  86. xpk-0.14.4.dist-info/METADATA +0 -1645
  87. xpk-0.14.4.dist-info/RECORD +0 -139
  88. {xpk-0.14.4.dist-info → xpk-0.16.0.dist-info}/WHEEL +0 -0
  89. {xpk-0.14.4.dist-info → xpk-0.16.0.dist-info}/entry_points.txt +0 -0
  90. {xpk-0.14.4.dist-info → xpk-0.16.0.dist-info}/licenses/LICENSE +0 -0
  91. {xpk-0.14.4.dist-info → xpk-0.16.0.dist-info}/top_level.txt +0 -0
@@ -1,1645 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: xpk
3
- Version: 0.14.4
4
- Summary: xpk helps Cloud developers to orchestrate training jobs on accelerators on GKE.
5
- Author-email: XPK team <xpk-code-reviewers@google.com>
6
- License: Apache-2.0
7
- Project-URL: Homepage, https://github.com/google/xpk
8
- Project-URL: Bug Tracker, https://github.com/google/xpk/issues
9
- Classifier: Programming Language :: Python :: 3.10
10
- Classifier: Programming Language :: Python :: 3.11
11
- Requires-Python: >=3.10
12
- Description-Content-Type: text/markdown
13
- License-File: LICENSE
14
- Requires-Dist: cloud-accelerator-diagnostics==0.1.1
15
- Requires-Dist: tabulate==0.9.0
16
- Requires-Dist: ruamel.yaml==0.18.10
17
- Requires-Dist: pyyaml==6.0.2
18
- Requires-Dist: docker==7.1.0
19
- Requires-Dist: kubernetes==31.0.0
20
- Requires-Dist: google-cloud==0.34.0
21
- Requires-Dist: google-api-core==2.24.1
22
- Requires-Dist: packaging==24.2
23
- Requires-Dist: google-cloud-filestore==1.12.0
24
- Requires-Dist: google-cloud-storage
25
- Requires-Dist: Jinja2==3.1.6
26
- Provides-Extra: dev
27
- Requires-Dist: pyink==24.3.0; extra == "dev"
28
- Requires-Dist: pylint>=2.6.0; extra == "dev"
29
- Requires-Dist: pre-commit; extra == "dev"
30
- Requires-Dist: pytest; extra == "dev"
31
- Requires-Dist: pytest-mock==3.15.1; extra == "dev"
32
- Requires-Dist: docker==7.1.0; extra == "dev"
33
- Requires-Dist: mypy~=1.17; extra == "dev"
34
- Requires-Dist: types-PyYAML==6.0.2; extra == "dev"
35
- Requires-Dist: types-docker~=7.1.0.0; extra == "dev"
36
- Requires-Dist: pylint-per-file-ignores==1.4.0; extra == "dev"
37
- Dynamic: license-file
38
-
39
- <!--
40
- Copyright 2023 Google LLC
41
-
42
- Licensed under the Apache License, Version 2.0 (the "License");
43
- you may not use this file except in compliance with the License.
44
- You may obtain a copy of the License at
45
-
46
- https://www.apache.org/licenses/LICENSE-2.0
47
-
48
- Unless required by applicable law or agreed to in writing, software
49
- distributed under the License is distributed on an "AS IS" BASIS,
50
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
51
- See the License for the specific language governing permissions and
52
- limitations under the License.
53
- -->
54
-
55
- [![Build Tests](https://github.com/google/xpk/actions/workflows/build_tests.yaml/badge.svg?query=branch%3Amain)](https://github.com/google/xpk/actions/workflows/build_tests.yaml?query=branch%3Amain)
56
- [![Nightly Tests](https://github.com/google/xpk/actions/workflows/nightly_tests.yaml/badge.svg?query=branch%3Amain)](https://github.com/google/xpk/actions/workflows/nightly_tests.yaml?query=branch%3Amain)
57
-
58
- # Overview
59
-
60
- XPK (Accelerated Processing Kit, pronounced x-p-k) is a command line interface that simplifies cluster creation and workload execution on Google Kubernetes Engine (GKE). XPK generates preconfigured, training-optimized clusters and allows easy workload scheduling without any Kubernetes expertise.
61
-
62
- XPK is recommended for quick creation of GKE clusters for proofs of concepts and testing.
63
-
64
- XPK decouples provisioning capacity from running jobs. There are two structures: clusters (provisioned VMs) and workloads (training jobs). Clusters represent the physical resources you have available. Workloads represent training jobs -- at any time some of these will be completed, others will be running and some will be queued, waiting for cluster resources to become available.
65
-
66
- The ideal workflow starts by provisioning the clusters for all of the ML
67
- hardware you have reserved. Then, without re-provisioning, submit jobs as
68
- needed. By eliminating the need for re-provisioning between jobs, using Docker
69
- containers with pre-installed dependencies and cross-ahead of time compilation,
70
- these queued jobs run with minimal start times. Further, because workloads
71
- return the hardware back to the shared pool when they complete, developers can
72
- achieve better use of finite hardware resources. And automated tests can run
73
- overnight while resources tend to be underutilized.
74
-
75
- XPK supports the following TPU types:
76
- * v4
77
- * v5e
78
- * v5p
79
- * Trillium (v6e)
80
- * Ironwood (tpu7x)
81
-
82
- and the following GPU types:
83
- * A100
84
- * A3-Highgpu (h100)
85
- * A3-Mega (h100-mega) - [Create cluster](#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines)
86
- * A3-Ultra (h200) - [Create cluster](#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines)
87
- * A4 (b200) - [Create cluster](#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines)
88
- * A4X (gb200)
89
-
90
- and the following CPU types:
91
- * n2-standard-32
92
-
93
- XPK also supports [Google Cloud Storage solutions](#storage):
94
- * [Cloud Storage FUSE](#fuse)
95
- * [Filestore](#filestore)
96
- * [Parallelstore](#parallelstore)
97
- * [Block storage (Persistent Disk, Hyperdisk)](#block-storage-persistent-disk-hyperdisk)
98
-
99
- # Permissions needed on Cloud Console:
100
-
101
- * Artifact Registry Writer
102
- * Compute Admin
103
- * Kubernetes Engine Admin
104
- * Logging Admin
105
- * Monitoring Admin
106
- * Service Account User
107
- * Storage Admin
108
- * Vertex AI Administrator
109
- * Filestore Editor (This role is neccessary if you want to run `storage create` command with `--type=gcpfilestore`)
110
-
111
- # Installation
112
-
113
- There are 2 ways to install XPK:
114
-
115
- - via Python package installer (`pip`),
116
- - clone from git and build from source.
117
-
118
- ## Prerequisites
119
-
120
- The following tools must be installed:
121
-
122
- - python >= 3.10: download from [here](https://www.python.org/downloads/)
123
- - pip: [installation instructions](https://pip.pypa.io/en/stable/installation/)
124
- - python venv: [installation instructions](https://virtualenv.pypa.io/en/latest/installation.html)
125
- (all three of above can be installed at once from [here](https://packaging.python.org/en/latest/guides/installing-using-linux-tools/#installing-pip-setuptools-wheel-with-linux-package-managers))
126
- - gcloud: install from [here](https://cloud.google.com/sdk/gcloud#download_and_install_the) and then:
127
- - Run `gcloud init`
128
- - [Authenticate](https://cloud.google.com/sdk/gcloud/reference/auth/application-default/login) to Google Cloud
129
- - kubectl: install from [here](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_kubectl) and then:
130
- - Install `gke-gcloud-auth-plugin` from [here](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_plugin)
131
- - docker: [installation instructions](https://docs.docker.com/engine/install/) and then:
132
- - Configure sudoless docker: [guide](https://docs.docker.com/engine/install/linux-postinstall/)
133
- - Run `gcloud auth configure-docker` to ensure images can be uploaded to registry
134
-
135
- ### Additional prerequisites when installing from pip
136
-
137
- - kueuectl: install from [here](https://kueue.sigs.k8s.io/docs/reference/kubectl-kueue/installation/)
138
- - kjob: installation instructions [here](https://github.com/kubernetes-sigs/kjob/blob/main/docs/installation.md)
139
-
140
- ### Additional prerequisites when installing from source
141
-
142
- - git: [installation instructions](https://git-scm.com/downloads/linux)
143
- - make: install by running `apt-get -y install make` (`sudo` might be required)
144
-
145
- ## Installation via pip
146
-
147
- To install XPK using pip, first install required tools mentioned in [prerequisites](#prerequisites) and [additional prerequisites](#additional-prerequisites-when-installing-from-pip). Then you can install XPK simply by running:
148
-
149
- ```shell
150
- pip install xpk
151
- ```
152
-
153
- If you see an error saying: `This environment is externally managed`, please use a virtual environment. For example:
154
-
155
- ```shell
156
- # One time step of creating the virtual environment
157
- VENV_DIR=~/venvp3
158
- python3 -m venv $VENV_DIR
159
-
160
- # Activate your virtual environment
161
- source $VENV_DIR/bin/activate
162
-
163
- # Install XPK in virtual environment using pip
164
- pip install xpk
165
- ```
166
-
167
- ## Installation from source
168
-
169
- To install XPK from source, first install required tools mentioned in [prerequisites](#prerequisites) and [additional prerequisites](#additional-prerequisites-when-installing-from-source). Afterwards you can install XPK from source using `make`
170
-
171
- ```shell
172
- # Clone the XPK repository
173
- git clone https://github.com/google/xpk.git
174
- cd xpk
175
-
176
- # Install required dependencies and build XPK with make
177
- make install && export PATH=$PATH:$PWD/bin
178
- ```
179
-
180
- If you want the dependecies to be available in your PATH please run: `echo $PWD/bin` and add its value to `PATH` in .bashrc or .zshrc file.
181
-
182
- If you see an error saying: `This environment is externally managed`, please use a virtual environment. For example:
183
-
184
- ```shell
185
- # One time step of creating the virtual environment
186
- VENV_DIR=~/venvp3
187
- python3 -m venv $VENV_DIR
188
-
189
- # Activate your virtual environment
190
- source $VENV_DIR/bin/activate
191
-
192
- # Clone the XPK repository
193
- git clone https://github.com/google/xpk.git
194
- cd xpk
195
-
196
- # Install required dependencies and build XPK with make
197
- make install && export PATH=$PATH:$PWD/bin
198
- ```
199
-
200
- # XPK for Large Scale (>1k VMs)
201
-
202
- Follow user instructions in [xpk-large-scale-guide.sh](xpk-large-scale-guide.sh)
203
- to use xpk for a GKE cluster greater than 1000 VMs. Run these steps to set up a
204
- GKE cluster with large scale training and high throughput support with XPK, and
205
- run jobs with XPK. We recommend you manually copy commands per step and verify
206
- the outputs of each step.
207
-
208
- # Example usages:
209
-
210
- To get started, be sure to set your GCP Project and Zone as usual via `gcloud
211
- config set`.
212
-
213
- Below are reference commands. A typical journey starts with a `Cluster Create`
214
- followed by many `Workload Create`s. To understand the state of the system you
215
- might want to use `Cluster List` or `Workload List` commands. Finally, you can
216
- cleanup with a `Cluster Delete`.
217
-
218
- If you have failures with workloads not running, use `xpk inspector` to investigate
219
- more.
220
-
221
- If you need your Workloads to have persistent storage, use `xpk storage` to find out more.
222
-
223
- ## Cluster Create
224
-
225
- First set the project and zone through gcloud config or xpk arguments.
226
-
227
- ```shell
228
- PROJECT_ID=my-project-id
229
- ZONE=us-east5-b
230
- # gcloud config:
231
- gcloud config set project $PROJECT_ID
232
- gcloud config set compute/zone $ZONE
233
- # xpk arguments
234
- xpk .. --zone $ZONE --project $PROJECT_ID
235
- ```
236
-
237
- The cluster created is a regional cluster to enable the GKE control plane across
238
- all zones.
239
-
240
- * Cluster Create (provision reserved capacity):
241
-
242
- ```shell
243
- # Find your reservations
244
- gcloud compute reservations list --project=$PROJECT_ID
245
- # Run cluster create with reservation.
246
- python3 xpk.py cluster create \
247
- --cluster xpk-test --tpu-type=v5litepod-256 \
248
- --num-slices=2 \
249
- --reservation=$RESERVATION_ID
250
- ```
251
-
252
- * Cluster Create (provision on-demand capacity):
253
-
254
- ```shell
255
- python3 xpk.py cluster create \
256
- --cluster xpk-test --tpu-type=v5litepod-16 \
257
- --num-slices=4 --on-demand
258
- ```
259
-
260
- * Cluster Create (provision spot / preemptable capacity):
261
-
262
- ```shell
263
- python3 xpk.py cluster create \
264
- --cluster xpk-test --tpu-type=v5litepod-16 \
265
- --num-slices=4 --spot
266
- ```
267
-
268
- * Cluster Create (DWS flex queued capacity):
269
- ```shell
270
- python3 xpk.py cluster create \
271
- --cluster xpk-test --tpu-type=v5litepod-16 \
272
- --num-slices=4 --flex
273
- ```
274
-
275
- * Cluster Create for Pathways:
276
- Pathways compatible cluster can be created using `cluster create-pathways`.
277
- ```shell
278
- python3 xpk.py cluster create-pathways \
279
- --cluster xpk-pw-test \
280
- --num-slices=4 --on-demand \
281
- --tpu-type=v5litepod-16
282
- ```
283
- Note that Pathways clusters need a CPU nodepool of n2-standard-64 or higher.
284
-
285
- * Cluster Create for Ray:
286
- A cluster with KubeRay enabled and a RayCluster can be created using `cluster create-ray`.
287
- ```shell
288
- python3 xpk.py cluster create-ray \
289
- --cluster xpk-rc-test \
290
- --ray-version=2.39.0 \
291
- --num-slices=4 --on-demand \
292
- --tpu-type=v5litepod-8
293
- ```
294
-
295
- * Cluster Create can be called again with the same `--cluster name` to modify
296
- the number of slices or retry failed steps.
297
-
298
- For example, if a user creates a cluster with 4 slices:
299
-
300
- ```shell
301
- python3 xpk.py cluster create \
302
- --cluster xpk-test --tpu-type=v5litepod-16 \
303
- --num-slices=4 --reservation=$RESERVATION_ID
304
- ```
305
-
306
- and recreates the cluster with 8 slices. The command will rerun to create 4
307
- new slices:
308
-
309
- ```shell
310
- python3 xpk.py cluster create \
311
- --cluster xpk-test --tpu-type=v5litepod-16 \
312
- --num-slices=8 --reservation=$RESERVATION_ID
313
- ```
314
-
315
- and recreates the cluster with 6 slices. The command will rerun to delete 2
316
- slices. The command will warn the user when deleting slices.
317
- Use `--force` to skip prompts.
318
-
319
- ```shell
320
- python3 xpk.py cluster create \
321
- --cluster xpk-test --tpu-type=v5litepod-16 \
322
- --num-slices=6 --reservation=$RESERVATION_ID
323
-
324
- # Skip delete prompts using --force.
325
-
326
- python3 xpk.py cluster create --force \
327
- --cluster xpk-test --tpu-type=v5litepod-16 \
328
- --num-slices=6 --reservation=$RESERVATION_ID
329
- ```
330
-
331
- and recreates the cluster with 4 slices of v4-8. The command will rerun to delete
332
- 6 slices of v5litepod-16 and create 4 slices of v4-8. The command will warn the
333
- user when deleting slices. Use `--force` to skip prompts.
334
-
335
- ```shell
336
- python3 xpk.py cluster create \
337
- --cluster xpk-test --tpu-type=v4-8 \
338
- --num-slices=4 --reservation=$RESERVATION_ID
339
-
340
- # Skip delete prompts using --force.
341
-
342
- python3 xpk.py cluster create --force \
343
- --cluster xpk-test --tpu-type=v4-8 \
344
- --num-slices=4 --reservation=$RESERVATION_ID
345
- ```
346
-
347
- ### Create Private Cluster
348
-
349
- XPK allows you to create a private GKE cluster for enhanced security. In a private cluster, nodes and pods are isolated from the public internet, providing an additional layer of protection for your workloads.
350
-
351
- To create a private cluster, use the following arguments:
352
-
353
- **`--private`**
354
-
355
- This flag enables the creation of a private GKE cluster. When this flag is set:
356
-
357
- * Nodes and pods are isolated from the direct internet access.
358
- * `master_authorized_networks` is automatically enabled.
359
- * Access to the cluster's control plane is restricted to your current machine's IP address by default.
360
-
361
- **`--authorized-networks`**
362
-
363
- This argument allows you to specify additional IP ranges (in CIDR notation) that are authorized to access the private cluster's control plane and perform `kubectl` commands.
364
-
365
- * Even if this argument is not set when you have `--private`, your current machine's IP address will always be given access to the control plane.
366
- * If this argument is used with an existing private cluster, it will replace the existing authorized networks.
367
-
368
- **Example Usage:**
369
-
370
- * To create a private cluster and allow access to Control Plane only to your current machine:
371
-
372
- ```shell
373
- python3 xpk.py cluster create \
374
- --cluster=xpk-private-cluster \
375
- --tpu-type=v4-8 --num-slices=2 \
376
- --private
377
- ```
378
-
379
- * To create a private cluster and allow access to Control Plane only to your current machine and the IP ranges `1.2.3.0/24` and `1.2.4.5/32`:
380
-
381
- ```shell
382
- python3 xpk.py cluster create \
383
- --cluster=xpk-private-cluster \
384
- --tpu-type=v4-8 --num-slices=2 \
385
- --authorized-networks 1.2.3.0/24 1.2.4.5/32
386
-
387
- # --private is optional when you set --authorized-networks
388
- ```
389
-
390
- > **Important Notes:**
391
- > * The argument `--private` is only applicable when creating new clusters. You cannot convert an existing public cluster to a private cluster using these flags.
392
- > * The argument `--authorized-networks` is applicable when creating new clusters or using an existing _*private*_ cluster. You cannot convert an existing public cluster to a private cluster using these flags.
393
- > * You need to [set up a Cluster NAT for your VPC network](https://cloud.google.com/nat/docs/set-up-manage-network-address-translation#creating_nat) so that the Nodes and Pods have outbound access to the internet. This is required because XPK installs and configures components such as kueue that need access to external sources like `registry.k8.io`.
394
-
395
-
396
- ### Create Vertex AI Tensorboard
397
- *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature. Make sure you have
398
- [Vertex AI Administrator](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.admin) role
399
- assigned to your user account.*
400
-
401
- Vertex AI Tensorboard is a fully managed version of open-source Tensorboard. To learn more about Vertex AI Tensorboard, visit [this](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-introduction). Note that Vertex AI Tensorboard is only available in [these](https://cloud.google.com/vertex-ai/docs/general/locations#available-regions) regions.
402
-
403
- You can create a Vertex AI Tensorboard for your cluster with `Cluster Create` command. XPK will create a single Vertex AI Tensorboard instance per cluster.
404
-
405
- * Create Vertex AI Tensorboard in default region with default Tensorboard name:
406
-
407
- ```shell
408
- python3 xpk.py cluster create \
409
- --cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
410
- --create-vertex-tensorboard
411
- ```
412
-
413
- will create a Vertex AI Tensorboard with the name `xpk-test-tb-instance` (*<args.cluster>-tb-instance*) in `us-central1` (*default region*).
414
-
415
- * Create Vertex AI Tensorboard in user-specified region with default Tensorboard name:
416
-
417
- ```shell
418
- python3 xpk.py cluster create \
419
- --cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
420
- --create-vertex-tensorboard --tensorboard-region=us-west1
421
- ```
422
-
423
- will create a Vertex AI Tensorboard with the name `xpk-test-tb-instance` (*<args.cluster>-tb-instance*) in `us-west1`.
424
-
425
- * Create Vertex AI Tensorboard in default region with user-specified Tensorboard name:
426
-
427
- ```shell
428
- python3 xpk.py cluster create \
429
- --cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
430
- --create-vertex-tensorboard --tensorboard-name=tb-testing
431
- ```
432
-
433
- will create a Vertex AI Tensorboard with the name `tb-testing` in `us-central1`.
434
-
435
- * Create Vertex AI Tensorboard in user-specified region with user-specified Tensorboard name:
436
-
437
- ```shell
438
- python3 xpk.py cluster create \
439
- --cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
440
- --create-vertex-tensorboard --tensorboard-region=us-west1 --tensorboard-name=tb-testing
441
- ```
442
-
443
- will create a Vertex AI Tensorboard instance with the name `tb-testing` in `us-west1`.
444
-
445
- * Create Vertex AI Tensorboard in an unsupported region:
446
-
447
- ```shell
448
- python3 xpk.py cluster create \
449
- --cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
450
- --create-vertex-tensorboard --tensorboard-region=us-central2
451
- ```
452
-
453
- will fail the cluster creation process because Vertex AI Tensorboard is not supported in `us-central2`.
454
-
455
- ## Cluster Delete
456
- * Cluster Delete (deprovision capacity):
457
-
458
- ```shell
459
- python3 xpk.py cluster delete \
460
- --cluster xpk-test
461
- ```
462
- ## Cluster List
463
- * Cluster List (see provisioned capacity):
464
-
465
- ```shell
466
- python3 xpk.py cluster list
467
- ```
468
- ## Cluster Describe
469
- * Cluster Describe (see capacity):
470
-
471
- ```shell
472
- python3 xpk.py cluster describe \
473
- --cluster xpk-test
474
- ```
475
-
476
- ## Cluster Cacheimage
477
- * Cluster Cacheimage (enables faster start times):
478
-
479
- ```shell
480
- python3 xpk.py cluster cacheimage \
481
- --cluster xpk-test --docker-image gcr.io/your_docker_image \
482
- --tpu-type=v5litepod-16
483
- ```
484
-
485
- ## Provisioning A3 Ultra, A3 Mega and A4 clusters (GPU machines)
486
- To create a cluster with A3 or A4 machines, run the command below with selected device type. To create workloads on these clusters see [here](#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines).
487
-
488
- **Note:** Creating A3 Ultra, A3 Mega and A4 clusters is currently supported **only** on linux/amd64 architecture.
489
-
490
- Machine | Device type
491
- :- | :-
492
- A3 Mega | `h100-mega-80gb-8`
493
- A3 Ultra | `h200-141gb-8`
494
- A4 | `b200-8`
495
-
496
-
497
- ```shell
498
- python3 xpk.py cluster create \
499
- --cluster CLUSTER_NAME --device-type DEVICE_TYPE \
500
- --zone=$COMPUTE_ZONE --project=$PROJECT_ID \
501
- --num-nodes=$NUM_NODES --reservation=$RESERVATION_ID
502
- ```
503
-
504
- Currently, the below flags/arguments are supported for A3 Mega, A3 Ultra and A4 machines:
505
- * `--num-nodes`
506
- * `--default-pool-cpu-machine-type`
507
- * `--default-pool-cpu-num-nodes`
508
- * `--reservation`
509
- * `--spot`
510
- * `--on-demand` (A3 Mega only)
511
- * `--flex`
512
-
513
- ## Running XPK on existing clusters
514
-
515
- In order to run XPK commands on a cluster it needs to be set up correctly. This is done automatically when creating a cluster using `xpk cluster create`. For clusters created differently (e.g.: with 'gcloud' or a Cluster Toolkit blueprint) there is a dedicated command: `xpk cluster adapt`. This command installs required config maps, kueue, jobset, CSI drivers etc.
516
-
517
- Currently `xpk cluster adapt` supports only the following device types:
518
-
519
- - `h200-141gb-8` (A3 Ultra)
520
-
521
- Example usage:
522
- ```shell
523
- python3 xpk.py cluster adapt \
524
- --cluster=$CLUSTER_NAME --device-type=$DEVICE_TYPE \
525
- --zone=$COMPUTE_ZONE --project=$PROJECT_ID \
526
- --num-nodes=$NUM_NODES --reservation=$RESERVATION_ID
527
- ```
528
-
529
- ## Storage
530
- Currently XPK supports the below types of storages:
531
- - [Cloud Storage FUSE](#fuse)
532
- - [Google Cloud Filestore](#filestore)
533
- - [Google Cloud Parallelstore](#parallelstore)
534
- - [Google Cloud Block storages (Persistent Disk, Hyperdisk)](#block-storage-persistent-disk-hyperdisk)
535
- - [Google Cloud Managed Lustre](#managed-lustre)
536
-
537
- ### FUSE
538
- A FUSE adapter lets you mount and access Cloud Storage buckets as local file systems, so workloads can read and write objects in your bucket using standard file system semantics.
539
-
540
- To use the GCS FUSE with XPK you need to create a [Storage Bucket](https://console.cloud.google.com/storage/).
541
-
542
- Once it's ready you can use `xpk storage attach` with `--type=gcsfuse` command to attach a FUSE storage instance to your cluster:
543
-
544
- ```shell
545
- python3 xpk.py storage attach test-fuse-storage --type=gcsfuse \
546
- --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE
547
- --mount-point='/test-mount-point' --readonly=false \
548
- --bucket=test-bucket --size=1 --auto-mount=false
549
- ```
550
-
551
- Parameters:
552
-
553
- - `--type` - type of the storage, currently xpk supports `gcsfuse` and `gcpfilestore` only.
554
- - `--auto-mount` - if set to true all workloads will have this storage mounted by default.
555
- - `--mount-point` - the path on which this storage should be mounted for a workload.
556
- - `--readonly` - if set to true, workload can only read from storage.
557
- - `--size` - size of the storage in Gb.
558
- - `--bucket` - name of the storage bucket. If not set then the name of the storage is used as a bucket name.
559
- - `--mount-options` - comma-separated list of additional mount options for PersistentVolume ([reference](https://cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-perf#mount-options)).
560
- - `--prefetch-metadata` - enables metadata pre-population when mounting the volume by setting parameter `gcsfuseMetadataPrefetchOnMount` to `true` ([reference](https://cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-perf#metadata-prefetch)).
561
- - `--manifest` - path to the manifest file containing PersistentVolume and PresistentVolumeClaim definitions. If set, then values from manifest override the following parameters: `--size` and `--bucket`.
562
-
563
- ### Filestore
564
-
565
- A Filestore adapter lets you mount and access [Filestore instances](https://cloud.google.com/filestore/) as local file systems, so workloads can read and write files in your volumes using standard file system semantics.
566
-
567
- To create and attach a GCP Filestore instance to your cluster use `xpk storage create` command with `--type=gcpfilestore`:
568
-
569
- ```shell
570
- python3 xpk.py storage create test-fs-storage --type=gcpfilestore \
571
- --auto-mount=false --mount-point=/data-fs --readonly=false \
572
- --size=1024 --tier=BASIC_HDD --access_mode=ReadWriteMany --vol=default \
573
- --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE
574
- ```
575
-
576
- You can also attach an existing Filestore instance to your cluster using `xpk storage attach` command:
577
-
578
- ```shell
579
- python3 xpk.py storage attach test-fs-storage --type=gcpfilestore \
580
- --auto-mount=false --mount-point=/data-fs --readonly=false \
581
- --size=1024 --tier=BASIC_HDD --access_mode=ReadWriteMany --vol=default \
582
- --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE
583
- ```
584
-
585
- The command above is also useful when attaching multiple volumes from the same Filestore instance.
586
-
587
- Commands `xpk storage create` and `xpk storage attach` with `--type=gcpfilestore` accept following arguments:
588
- - `--type` - type of the storage.
589
- - `--auto-mount` - if set to true all workloads will have this storage mounted by default.
590
- - `--mount-point` - the path on which this storage should be mounted for a workload.
591
- - `--readonly` - if set to true, workload can only read from storage.
592
- - `--size` - size of the Filestore instance that will be created in Gb.
593
- - `--tier` - tier of the Filestore instance that will be created. Possible options are: `[BASIC_HDD, BASIC_SSD, ZONAL, REGIONAL, ENTERPRISE]`
594
- - `--access-mode` - access mode of the Filestore instance that will be created. Possible values are: `[ReadWriteOnce, ReadOnlyMany, ReadWriteMany]`
595
- - `--vol` - file share name of the Filestore instance that will be created.
596
- - `--instance` - the name of the Filestore instance. If not set then the name parameter is used as an instance name. Useful when connecting multiple volumes from the same Filestore instance.
597
- - `--manifest` - path to the manifest file containing PersistentVolume, PresistentVolumeClaim and StorageClass definitions. If set, then values from manifest override the following parameters: `--access-mode`, `--size` and `--volume`.
598
-
599
- ### Parallelstore
600
-
601
- A Parallelstore adapter lets you mount and access [Parallelstore instances](https://cloud.google.com/parallelstore/) as local file systems, so workloads can read and write files in your volumes using standard file system semantics.
602
-
603
- To use the GCS Parallelstore with XPK you need to create a [Parallelstore Instance](https://console.cloud.google.com/parallelstore/).
604
-
605
- Once it's ready you can use `xpk storage attach` with `--type=parallelstore` command to attach a Parallelstore instance to your cluster. Currently, attaching a Parallelstore is supported only by providing a manifest file.
606
-
607
- ```shell
608
- python3 xpk.py storage attach test-parallelstore-storage --type=parallelstore \
609
- --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \
610
- --mount-point='/test-mount-point' --readonly=false \
611
- --auto-mount=true \
612
- --manifest='./examples/storage/parallelstore-manifest-attach.yaml'
613
- ```
614
-
615
- Parameters:
616
-
617
- - `--type` - type of the storage `parallelstore`
618
- - `--auto-mount` - if set to true all workloads will have this storage mounted by default.
619
- - `--mount-point` - the path on which this storage should be mounted for a workload.
620
- - `--readonly` - if set to true, workload can only read from storage.
621
- - `--manifest` - path to the manifest file containing PersistentVolume and PresistentVolumeClaim definitions.
622
-
623
- ### Block storage (Persistent Disk, Hyperdisk)
624
-
625
- A PersistentDisk adapter lets you mount and access Google Cloud Block storage solutions ([Persistent Disk](https://cloud.google.com/kubernetes-engine/docs/concepts/storage-overview#pd), [Hyperdisk](https://cloud.google.com/kubernetes-engine/docs/concepts/storage-overview#hyperdisk)) as local file systems, so workloads can read and write files in your volumes using standard file system semantics.
626
-
627
- To use the GCE PersistentDisk with XPK you need to create a [disk in GCE](https://cloud.google.com/compute/docs/disks). Please consider that the disk type you are creating is [compatible with the VMs](https://cloud.google.com/compute/docs/machine-resource#machine_type_comparison) in the default and accelerator nodepools.
628
-
629
- Once it's ready you can use `xpk storage attach` with `--type=pd` command to attach a PersistentDisk instance to your cluster. Currently, attaching a PersistentDisk is supported only by providing a manifest file.
630
-
631
- ```shell
632
- python3 xpk.py storage attach test-pd-storage --type=pd \
633
- --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \
634
- --mount-point='/test-mount-point' --readonly=false \
635
- --auto-mount=true \
636
- --manifest='./examples/storage/pd-manifest-attach.yaml'
637
- ```
638
-
639
- Parameters:
640
-
641
- - `--type` - type of the storage `pd`
642
- - `--auto-mount` - if set to true all workloads will have this storage mounted by default.
643
- - `--mount-point` - the path on which this storage should be mounted for a workload.
644
- - `--readonly` - if set to true, workload can only read from storage.
645
- - `--manifest` - path to the manifest file containing PersistentVolume and PresistentVolumeClaim definitions.
646
-
647
- ### Managed Lustre
648
-
649
- A Managed Lustre adaptor lets you mount and access [Google Cloud Managed Lustre instances](https://cloud.google.com/kubernetes-engine/docs/concepts/managed-lustre) as local file systems, so workloads can read and write files in your volumes using standard file system semantics.
650
-
651
- To use the GCP Managed Lustre with XPK you need to create [an instance](https://cloud.google.com/managed-lustre/docs/create-instance). Please make sure you enable GKE support when creating the instance (gcloud ex. `--gke-support-enabled`).
652
-
653
- Once it's ready you can use `xpk storage attach` with `--type=lustre` command to attach a Managed Lustre instance to your cluster. Currently, attaching a Managed Lustre instance is supported only by providing a manifest file.
654
-
655
- ```shell
656
- python3 xpk.py storage attach test-lustre-storage --type=lustre \
657
- --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \
658
- --mount-point='/test-mount-point' --readonly=false \
659
- --auto-mount=true \
660
- --manifest='./examples/storage/lustre-manifest-attach.yaml'
661
- ```
662
-
663
- Parameters:
664
-
665
- - `--type` - type of the storage `lustre`
666
- - `--auto-mount` - if set to true all workloads will have this storage mounted by default.
667
- - `--mount-point` - the path on which this storage should be mounted for a workload.
668
- - `--readonly` - if set to true, workload can only read from storage.
669
- - `--manifest` - path to the manifest file containing PersistentVolume and PresistentVolumeClaim definitions.
670
-
671
- ### List attached storages
672
-
673
- ```shell
674
- python3 xpk.py storage list \
675
- --project=$PROJECT --cluster $CLUSTER --zone=$ZONE
676
- ```
677
-
678
- ### Running workloads with storage
679
-
680
- If you specified `--auto-mount=true` when creating or attaching a storage, then all workloads deployed on the cluster will have the volume attached by default. Otherwise, in order to have the storage attached, you have to add `--storage` parameter to `workload create` command:
681
-
682
- ```shell
683
- python3 xpk.py workload create \
684
- --workload xpk-test-workload --command "echo goodbye" \
685
- --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \
686
- --tpu-type=v5litepod-16 --storage=test-storage
687
- ```
688
-
689
- ### Detaching storage
690
-
691
- ```shell
692
- python3 xpk.py storage detach $STORAGE_NAME \
693
- --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE
694
- ```
695
-
696
- ### Deleting storage
697
-
698
- XPK allows you to remove Filestore instances easily with `xpk storage delete` command. **Warning:** this deletes all data contained in the Filestore!
699
-
700
- ```shell
701
- python3 xpk.py storage delete test-fs-instance \
702
- --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE
703
- ```
704
-
705
- ## Workload Create
706
- * Workload Create (submit training job):
707
-
708
- ```shell
709
- python3 xpk.py workload create \
710
- --workload xpk-test-workload --command "echo goodbye" \
711
- --cluster xpk-test \
712
- --tpu-type=v5litepod-16 --project=$PROJECT
713
- ```
714
- * Workload create(DWS flex with queued provisioning):
715
- ```shell
716
- python3 xpk.py workload create \
717
- --workload xpk-test-workload --command "echo goodbye" \
718
- --cluster xpk-test --flex \
719
- --tpu-type=v5litepod-16 --project=$PROJECT
720
-
721
- * Workload Create for Pathways:
722
- Pathways workload can be submitted using `workload create-pathways` on a Pathways enabled cluster (created with `cluster create-pathways`)
723
-
724
- Pathways workload example:
725
- ```shell
726
- python3 xpk.py workload create-pathways \
727
- --workload xpk-pw-test \
728
- --num-slices=1 \
729
- --tpu-type=v5litepod-16 \
730
- --cluster xpk-pw-test \
731
- --docker-name='user-workload' \
732
- --docker-image=<maxtext docker image> \
733
- --command='python3 -m MaxText.train MaxText/configs/base.yml base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1 enable_single_controller=True'
734
- ```
735
-
736
- Regular workload can also be submitted on a Pathways enabled cluster (created with `cluster create-pathways`)
737
-
738
- Pathways workload example:
739
- ```shell
740
- python3 xpk.py workload create-pathways \
741
- --workload xpk-regular-test \
742
- --num-slices=1 \
743
- --tpu-type=v5litepod-16 \
744
- --cluster xpk-pw-test \
745
- --docker-name='user-workload' \
746
- --docker-image=<maxtext docker image> \
747
- --command='python3 -m MaxText.train MaxText/configs/base.yml base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
748
- ```
749
-
750
- Pathways in headless mode - Pathways now offers the capability to run JAX workloads in Vertex AI notebooks or in GCE VMs!
751
- Specify `--headless` with `workload create-pathways` when the user workload is not provided in a docker container.
752
- ```shell
753
- python3 xpk.py workload create-pathways --headless \
754
- --workload xpk-pw-headless \
755
- --num-slices=1 \
756
- --tpu-type=v5litepod-16 \
757
- --cluster xpk-pw-test
758
- ```
759
- Executing the command above would provide the address of the proxy that the user job should connect to.
760
- ```shell
761
- kubectl get pods
762
- kubectl port-forward pod/<proxy-pod-name> 29000:29000
763
- ```
764
- ```shell
765
- JAX_PLATFORMS=proxy JAX_BACKEND_TARGET=grpc://127.0.0.1:29000 python -c 'import pathwaysutils; import jax; print(jax.devices())'
766
- ```
767
- Specify `JAX_PLATFORMS=proxy` and `JAX_BACKEND_TARGET=<proxy address from above>` and `import pathwaysutils` to establish this connection between the user's JAX code and the Pathways proxy. Execute Pathways workloads interactively on Vertex AI notebooks!
768
-
769
- ### Set `max-restarts` for production jobs
770
-
771
- * `--max-restarts <value>`: By default, this is 0. This will restart the job ""
772
- times when the job terminates. For production jobs, it is recommended to
773
- increase this to a large number, say 50. Real jobs can be interrupted due to
774
- hardware failures and software updates. We assume your job has implemented
775
- checkpointing so the job restarts near where it was interrupted.
776
-
777
- ### Workloads for A3 Ultra, A3 Mega and A4 clusters (GPU machines)
778
- To submit jobs on a cluster with A3 or A4 machines, run the command with selected device type. To create a cluster with A3 or A4 machines see [here](#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines).
779
-
780
-
781
- Machine | Device type
782
- :- | :-
783
- A3 Mega | `h100-mega-80gb-8`
784
- A3 Ultra | `h200-141gb-8`
785
- A4 | `b200-8`
786
-
787
- ```shell
788
- python3 xpk.py workload create \
789
- --workload=$WORKLOAD_NAME --command="echo goodbye" \
790
- --cluster=$CLUSTER_NAME --device-type DEVICE_TYPE \
791
- --zone=$COMPUTE_ZONE --project=$PROJECT_ID \
792
- --num-nodes=$WOKRKLOAD_NUM_NODES
793
- ```
794
-
795
- > The docker image flags/arguments introduced in [workloads section](#workload-create) can be used with A3 or A4 machines as well.
796
-
797
- In order to run NCCL test on A3 machines check out [this guide](/examples/nccl/nccl.md).
798
-
799
- ### Workload Priority and Preemption
800
- * Set the priority level of your workload with `--priority=LEVEL`
801
-
802
- We have five priorities defined: [`very-low`, `low`, `medium`, `high`, `very-high`].
803
- The default priority is `medium`.
804
-
805
- Priority determines:
806
-
807
- 1. Order of queued jobs.
808
-
809
- Queued jobs are ordered by
810
- `very-low` < `low` < `medium` < `high` < `very-high`
811
-
812
- 2. Preemption of lower priority workloads.
813
-
814
- A higher priority job will `evict` lower priority jobs.
815
- Evicted jobs are brought back to the queue and will re-hydrate appropriately.
816
-
817
- #### General Example:
818
- ```shell
819
- python3 xpk.py workload create \
820
- --workload xpk-test-medium-workload --command "echo goodbye" --cluster \
821
- xpk-test --tpu-type=v5litepod-16 --priority=medium
822
- ```
823
-
824
- ### Create Vertex AI Experiment to upload data to Vertex AI Tensorboard
825
- *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature. Make sure you have
826
- [Vertex AI Administrator](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.admin) role
827
- assigned to your user account and to the [Compute Engine Service account](https://cloud.google.com/compute/docs/access/service-accounts#default_service_account) attached to the node pools in the cluster.*
828
-
829
- Vertex AI Experiment is a tool that helps to track and analyze an experiment run on Vertex AI Tensorboard. To learn more about Vertex AI Experiments, visit [this](https://cloud.google.com/vertex-ai/docs/experiments/intro-vertex-ai-experiments).
830
-
831
- XPK will create a Vertex AI Experiment in `workload create` command and attach the Vertex AI Tensorboard created for the cluster during `cluster create`. If there is a cluster created before this feature is released, there will be no Vertex AI Tensorboard created for the cluster and `workload create` will fail. Re-run `cluster create` to create a Vertex AI Tensorboard and then run `workload create` again to schedule your workload.
832
-
833
- * Create Vertex AI Experiment with default Experiment name:
834
-
835
- ```shell
836
- python3 xpk.py workload create \
837
- --cluster xpk-test --workload xpk-workload \
838
- --use-vertex-tensorboard
839
- ```
840
-
841
- will create a Vertex AI Experiment with the name `xpk-test-xpk-workload` (*<args.cluster>-<args.workload>*).
842
-
843
- * Create Vertex AI Experiment with user-specified Experiment name:
844
-
845
- ```shell
846
- python3 xpk.py workload create \
847
- --cluster xpk-test --workload xpk-workload \
848
- --use-vertex-tensorboard --experiment-name=test-experiment
849
- ```
850
-
851
- will create a Vertex AI Experiment with the name `test-experiment`.
852
-
853
- Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how to update your workload to automatically upload logs collected in your Tensorboard directory to the Vertex AI Experiment created by `workload create`.
854
-
855
- ## Workload Delete
856
- * Workload Delete (delete training job):
857
-
858
- ```shell
859
- python3 xpk.py workload delete \
860
- --workload xpk-test-workload --cluster xpk-test
861
- ```
862
-
863
- This will only delete `xpk-test-workload` workload in `xpk-test` cluster.
864
-
865
- * Workload Delete (delete all training jobs in the cluster):
866
-
867
- ```shell
868
- python3 xpk.py workload delete \
869
- --cluster xpk-test
870
- ```
871
-
872
- This will delete all the workloads in `xpk-test` cluster. Deletion will only begin if you type `y` or `yes` at the prompt. Multiple workload deletions are processed in batches for optimized processing.
873
-
874
- * Workload Delete supports filtering. Delete a portion of jobs that match user criteria. Multiple workload deletions are processed in batches for optimized processing.
875
- * Filter by Job: `filter-by-job`
876
-
877
- ```shell
878
- python3 xpk.py workload delete \
879
- --cluster xpk-test --filter-by-job=$USER
880
- ```
881
-
882
- This will delete all the workloads in `xpk-test` cluster whose names start with `$USER`. Deletion will only begin if you type `y` or `yes` at the prompt.
883
-
884
- * Filter by Status: `filter-by-status`
885
-
886
- ```shell
887
- python3 xpk.py workload delete \
888
- --cluster xpk-test --filter-by-status=QUEUED
889
- ```
890
-
891
- This will delete all the workloads in `xpk-test` cluster that have the status as Admitted or Evicted, and the number of running VMs is 0. Deletion will only begin if you type `y` or `yes` at the prompt. Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`.
892
-
893
- ## Workload List
894
- * Workload List (see training jobs):
895
-
896
- ```shell
897
- python3 xpk.py workload list \
898
- --cluster xpk-test
899
- ```
900
-
901
- * Example Workload List Output:
902
-
903
- The below example shows four jobs of different statuses:
904
-
905
- * `user-first-job-failed`: **filter-status** is `FINISHED` and `FAILED`.
906
- * `user-second-job-success`: **filter-status** is `FINISHED` and `SUCCESSFUL`.
907
- * `user-third-job-running`: **filter-status** is `RUNNING`.
908
- * `user-forth-job-in-queue`: **filter-status** is `QUEUED`.
909
- * `user-fifth-job-in-queue-preempted`: **filter-status** is `QUEUED`.
910
-
911
- ```
912
- Jobset Name Created Time Priority TPU VMs Needed TPU VMs Running/Ran TPU VMs Done Status Status Message Status Time
913
- user-first-job-failed 2023-1-1T1:00:00Z medium 4 4 <none> Finished JobSet failed 2023-1-1T1:05:00Z
914
- user-second-job-success 2023-1-1T1:10:00Z medium 4 4 4 Finished JobSet finished successfully 2023-1-1T1:14:00Z
915
- user-third-job-running 2023-1-1T1:15:00Z medium 4 4 <none> Admitted Admitted by ClusterQueue cluster-queue 2023-1-1T1:16:00Z
916
- user-forth-job-in-queue 2023-1-1T1:16:05Z medium 4 <none> <none> Admitted couldn't assign flavors to pod set slice-job: insufficient unused quota for google.com/tpu in flavor 2xv4-8, 4 more need 2023-1-1T1:16:10Z
917
- user-fifth-job-preempted 2023-1-1T1:10:05Z low 4 <none> <none> Evicted Preempted to accommodate a higher priority Workload 2023-1-1T1:10:00Z
918
- ```
919
-
920
- * Workload List supports filtering. Observe a portion of jobs that match user criteria.
921
-
922
- * Filter by Status: `filter-by-status`
923
-
924
- Filter the workload list by the status of respective jobs.
925
- Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`
926
-
927
- * Filter by Job: `filter-by-job`
928
-
929
- Filter the workload list by the name of a job.
930
-
931
- ```shell
932
- python3 xpk.py workload list \
933
- --cluster xpk-test --filter-by-job=$USER
934
- ```
935
-
936
- * Workload List supports waiting for the completion of a specific job. XPK will follow an existing job until it has finished or the `timeout`, if provided, has been reached and then list the job. If no `timeout` is specified, the default value is set to the max value, 1 week. You may also set `timeout=0` to poll the job once.
937
-
938
- Wait for a job to complete.
939
-
940
- ```shell
941
- python3 xpk.py workload list \
942
- --cluster xpk-test --wait-for-job-completion=xpk-test-workload
943
- ```
944
-
945
- Wait for a job to complete with a timeout of 300 seconds.
946
-
947
- ```shell
948
- python3 xpk.py workload list \
949
- --cluster xpk-test --wait-for-job-completion=xpk-test-workload \
950
- --timeout=300
951
- ```
952
-
953
- Return codes
954
- `0`: Workload finished and completed successfully.
955
- `124`: Timeout was reached before workload finished.
956
- `125`: Workload finished but did not complete successfully.
957
- `1`: Other failure.
958
-
959
- ## Job List
960
-
961
- * Job List (see jobs submitted via batch command):
962
-
963
- ```shell
964
- python3 xpk.py job ls --cluster xpk-test
965
- ```
966
-
967
- * Example Job List Output:
968
-
969
- ```
970
- NAME PROFILE LOCAL QUEUE COMPLETIONS DURATION AGE
971
- xpk-def-app-profile-slurm-74kbv xpk-def-app-profile 1/1 15s 17h
972
- xpk-def-app-profile-slurm-brcsg xpk-def-app-profile 1/1 9s 3h56m
973
- xpk-def-app-profile-slurm-kw99l xpk-def-app-profile 1/1 5s 3h54m
974
- xpk-def-app-profile-slurm-x99nx xpk-def-app-profile 3/3 29s 17h
975
- ```
976
-
977
- ## Job Cancel
978
-
979
- * Job Cancel (delete job submitted via batch command):
980
-
981
- ```shell
982
- python3 xpk.py job cancel xpk-def-app-profile-slurm-74kbv --cluster xpk-test
983
- ```
984
-
985
- ## Inspector
986
- * Inspector provides debug info to understand cluster health, and why workloads are not running.
987
- Inspector output is saved to a file.
988
-
989
- ```shell
990
- python3 xpk.py inspector \
991
- --cluster $CLUSTER_NAME \
992
- --project $PROJECT_ID \
993
- --zone $ZONE
994
- ```
995
-
996
- * Optional Arguments
997
- * `--print-to-terminal`:
998
- Print command output to terminal as well as a file.
999
- * `--workload $WORKLOAD_NAME`
1000
- Inspector will write debug info related to the workload:`$WORKLOAD_NAME`
1001
-
1002
- * Example Output:
1003
-
1004
- The output of xpk inspector is in `/tmp/tmp0pd6_k1o` in this example.
1005
- ```shell
1006
- [XPK] Starting xpk
1007
- [XPK] Task: `Set Cluster` succeeded.
1008
- [XPK] Task: `Local Setup: gcloud version` is implemented by `gcloud version`, hiding output unless there is an error.
1009
- [XPK] Task: `Local Setup: Project / Zone / Region` is implemented by `gcloud config get project; gcloud config get compute/zone; gcloud config get compute/region`, hiding output unless there is an error.
1010
- [XPK] Task: `GKE: Cluster Details` is implemented by `gcloud beta container clusters list --project $PROJECT --region $REGION | grep -e NAME -e $CLUSTER_NAME`, hiding output unless there is an error.
1011
- [XPK] Task: `GKE: Node pool Details` is implemented by `gcloud beta container node-pools list --cluster $CLUSTER_NAME --project=$PROJECT --region=$REGION`, hiding output unless there is an error.
1012
- [XPK] Task: `Kubectl: All Nodes` is implemented by `kubectl get node -o custom-columns='NODE_NAME:metadata.name, READY_STATUS:.status.conditions[?(@.type=="Ready")].status, NODEPOOL:metadata.labels.cloud\.google\.com/gke-nodepool'`, hiding output unless there is an error.
1013
- [XPK] Task: `Kubectl: Number of Nodes per Node Pool` is implemented by `kubectl get node -o custom-columns=':metadata.labels.cloud\.google\.com/gke-nodepool' | sort | uniq -c`, hiding output unless there is an error.
1014
- [XPK] Task: `Kubectl: Healthy Node Count Per Node Pool` is implemented by `kubectl get node -o custom-columns='NODE_NAME:metadata.name, READY_STATUS:.status.conditions[?(@.type=="Ready")].status, NODEPOOL:metadata.labels.cloud\.google\.com/gke-nodepool' | grep -w True | awk {'print $3'} | sort | uniq -c`, hiding output unless there is an error.
1015
- [XPK] Task: `Kueue: ClusterQueue Details` is implemented by `kubectl describe ClusterQueue cluster-queue`, hiding output unless there is an error.
1016
- [XPK] Task: `Kueue: LocalQueue Details` is implemented by `kubectl describe LocalQueue multislice-queue`, hiding output unless there is an error.
1017
- [XPK] Task: `Kueue: Kueue Deployment Details` is implemented by `kubectl describe Deployment kueue-controller-manager -n kueue-system`, hiding output unless there is an error.
1018
- [XPK] Task: `Jobset: Deployment Details` is implemented by `kubectl describe Deployment jobset-controller-manager -n jobset-system`, hiding output unless there is an error.
1019
- [XPK] Task: `Kueue Manager Logs` is implemented by `kubectl logs deployment/kueue-controller-manager -n kueue-system --tail=100 --prefix=True`, hiding output unless there is an error.
1020
- [XPK] Task: `Jobset Manager Logs` is implemented by `kubectl logs deployment/jobset-controller-manager -n jobset-system --tail=100 --prefix=True`, hiding output unless there is an error.
1021
- [XPK] Task: `List Jobs with filter-by-status=EVERYTHING with filter-by-jobs=None` is implemented by `kubectl get workloads -o=custom-columns="Jobset Name:.metadata.ownerReferences[0].name,Created Time:.metadata.creationTimestamp,Priority:.spec.priorityClassName,TPU VMs Needed:.spec.podSets[0].count,TPU VMs Running/Ran:.status.admission.podSetAssignments[-1].count,TPU VMs Done:.status.reclaimablePods[0].count,Status:.status.conditions[-1].type,Status Message:.status.conditions[-1].message,Status Time:.status.conditions[-1].lastTransitionTime" `, hiding output unless there is an error.
1022
- [XPK] Task: `List Jobs with filter-by-status=QUEUED with filter-by-jobs=None` is implemented by `kubectl get workloads -o=custom-columns="Jobset Name:.metadata.ownerReferences[0].name,Created Time:.metadata.creationTimestamp,Priority:.spec.priorityClassName,TPU VMs Needed:.spec.podSets[0].count,TPU VMs Running/Ran:.status.admission.podSetAssignments[-1].count,TPU VMs Done:.status.reclaimablePods[0].count,Status:.status.conditions[-1].type,Status Message:.status.conditions[-1].message,Status Time:.status.conditions[-1].lastTransitionTime" | awk -e 'NR == 1 || ($7 ~ "Admitted|Evicted|QuotaReserved" && ($5 ~ "<none>" || $5 == 0)) {print $0}' `, hiding output unless there is an error.
1023
- [XPK] Task: `List Jobs with filter-by-status=RUNNING with filter-by-jobs=None` is implemented by `kubectl get workloads -o=custom-columns="Jobset Name:.metadata.ownerReferences[0].name,Created Time:.metadata.creationTimestamp,Priority:.spec.priorityClassName,TPU VMs Needed:.spec.podSets[0].count,TPU VMs Running/Ran:.status.admission.podSetAssignments[-1].count,TPU VMs Done:.status.reclaimablePods[0].count,Status:.status.conditions[-1].type,Status Message:.status.conditions[-1].message,Status Time:.status.conditions[-1].lastTransitionTime" | awk -e 'NR == 1 || ($7 ~ "Admitted|Evicted" && $5 ~ /^[0-9]+$/ && $5 > 0) {print $0}' `, hiding output unless there is an error.
1024
- [XPK] Find xpk inspector output file: /tmp/tmp0pd6_k1o
1025
- [XPK] Exiting XPK cleanly
1026
- ```
1027
-
1028
- ## Run
1029
- * `xpk run` lets you execute scripts on a cluster with ease. It automates task execution, handles interruptions, and streams job output to your console.
1030
-
1031
- ```shell
1032
- python xpk.py run --kind-cluster -n 2 -t 0-2 examples/job.sh
1033
- ```
1034
-
1035
- * Example Output:
1036
-
1037
- ```shell
1038
- [XPK] Starting xpk
1039
- [XPK] Task: `get current-context` is implemented by `kubectl config current-context`, hiding output unless there is an error.
1040
- [XPK] No local cluster name specified. Using current-context `kind-kind`
1041
- [XPK] Task: `run task` is implemented by `kubectl kjob create slurm --profile xpk-def-app-profile --localqueue multislice-queue --wait --rm -- examples/job.sh --partition multislice-queue --ntasks 2 --time 0-2`. Streaming output and input live.
1042
- job.batch/xpk-def-app-profile-slurm-g4vr6 created
1043
- configmap/xpk-def-app-profile-slurm-g4vr6 created
1044
- service/xpk-def-app-profile-slurm-g4vr6 created
1045
- Starting log streaming for pod xpk-def-app-profile-slurm-g4vr6-1-4rmgk...
1046
- Now processing task ID: 3
1047
- Starting log streaming for pod xpk-def-app-profile-slurm-g4vr6-0-bg6dm...
1048
- Now processing task ID: 1
1049
- exit
1050
- exit
1051
- Now processing task ID: 2
1052
- exit
1053
- Job logs streaming finished.[XPK] Task: `run task` terminated with code `0`
1054
- [XPK] XPK Done.
1055
- ```
1056
-
1057
- ## GPU usage
1058
-
1059
- In order to use XPK for GPU, you can do so by using `device-type` flag.
1060
-
1061
- * Cluster Create (provision reserved capacity):
1062
-
1063
- ```shell
1064
- # Find your reservations
1065
- gcloud compute reservations list --project=$PROJECT_ID
1066
-
1067
- # Run cluster create with reservation.
1068
- python3 xpk.py cluster create \
1069
- --cluster xpk-test --device-type=h100-80gb-8 \
1070
- --num-nodes=2 \
1071
- --reservation=$RESERVATION_ID
1072
- ```
1073
-
1074
- * Cluster Delete (deprovision capacity):
1075
-
1076
- ```shell
1077
- python3 xpk.py cluster delete \
1078
- --cluster xpk-test
1079
- ```
1080
-
1081
- * Cluster List (see provisioned capacity):
1082
-
1083
- ```shell
1084
- python3 xpk.py cluster list
1085
- ```
1086
-
1087
- * Cluster Describe (see capacity):
1088
-
1089
- ```shell
1090
- python3 xpk.py cluster describe \
1091
- --cluster xpk-test
1092
- ```
1093
-
1094
-
1095
- * Cluster Cacheimage (enables faster start times):
1096
-
1097
- ```shell
1098
- python3 xpk.py cluster cacheimage \
1099
- --cluster xpk-test --docker-image gcr.io/your_docker_image \
1100
- --device-type=h100-80gb-8
1101
- ```
1102
-
1103
-
1104
- * [Install NVIDIA GPU device drivers](https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#install)
1105
- ```shell
1106
- # List available driver versions
1107
- gcloud compute ssh $NODE_NAME --command "sudo cos-extensions list"
1108
-
1109
- # Install the default driver
1110
- gcloud compute ssh $NODE_NAME --command "sudo cos-extensions install gpu"
1111
- # OR install a specific version of the driver
1112
- gcloud compute ssh $NODE_NAME --command "sudo cos-extensions install gpu -- -version=DRIVER_VERSION"
1113
- ```
1114
-
1115
- * Run a workload:
1116
-
1117
- ```shell
1118
- # Submit a workload
1119
- python3 xpk.py workload create \
1120
- --cluster xpk-test --device-type h100-80gb-8 \
1121
- --workload xpk-test-workload \
1122
- --command="echo hello world"
1123
- ```
1124
-
1125
- * Workload Delete (delete training job):
1126
-
1127
- ```shell
1128
- python3 xpk.py workload delete \
1129
- --workload xpk-test-workload --cluster xpk-test
1130
- ```
1131
-
1132
- This will only delete `xpk-test-workload` workload in `xpk-test` cluster.
1133
-
1134
- * Workload Delete (delete all training jobs in the cluster):
1135
-
1136
- ```shell
1137
- python3 xpk.py workload delete \
1138
- --cluster xpk-test
1139
- ```
1140
-
1141
- This will delete all the workloads in `xpk-test` cluster. Deletion will only begin if you type `y` or `yes` at the prompt.
1142
-
1143
- * Workload Delete supports filtering. Delete a portion of jobs that match user criteria.
1144
- * Filter by Job: `filter-by-job`
1145
-
1146
- ```shell
1147
- python3 xpk.py workload delete \
1148
- --cluster xpk-test --filter-by-job=$USER
1149
- ```
1150
-
1151
- This will delete all the workloads in `xpk-test` cluster whose names start with `$USER`. Deletion will only begin if you type `y` or `yes` at the prompt.
1152
-
1153
- * Filter by Status: `filter-by-status`
1154
-
1155
- ```shell
1156
- python3 xpk.py workload delete \
1157
- --cluster xpk-test --filter-by-status=QUEUED
1158
- ```
1159
-
1160
- This will delete all the workloads in `xpk-test` cluster that have the status as Admitted or Evicted, and the number of running VMs is 0. Deletion will only begin if you type `y` or `yes` at the prompt. Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`.
1161
-
1162
- ## CPU usage
1163
-
1164
- In order to use XPK for CPU, you can do so by using `device-type` flag.
1165
-
1166
- * Cluster Create (provision on-demand capacity):
1167
-
1168
- ```shell
1169
- # Run cluster create with on demand capacity.
1170
- python3 xpk.py cluster create \
1171
- --cluster xpk-test \
1172
- --device-type=n2-standard-32-256 \
1173
- --num-slices=1 \
1174
- --default-pool-cpu-machine-type=n2-standard-32 \
1175
- --on-demand
1176
- ```
1177
- Note that `device-type` for CPUs is of the format <cpu-machine-type>-<number of VMs>, thus in the above example, user requests for 256 VMs of type n2-standard-32.
1178
- Currently workloads using < 1000 VMs are supported.
1179
-
1180
- * Run a workload:
1181
-
1182
- ```shell
1183
- # Submit a workload
1184
- python3 xpk.py workload create \
1185
- --cluster xpk-test \
1186
- --num-slices=1 \
1187
- --device-type=n2-standard-32-256 \
1188
- --workload xpk-test-workload \
1189
- --command="echo hello world"
1190
- ```
1191
-
1192
- # Autoprovisioning with XPK
1193
- XPK can dynamically allocate cluster capacity using [Node Auto Provisioning, (NAP)](https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning#use_accelerators_for_new_auto-provisioned_node_pools) support.
1194
-
1195
- Allow several topology sizes to be supported from one XPK cluster, and be dynamically provisioned based on incoming workload requests. XPK users will not need to re-provision the cluster manually.
1196
-
1197
- Enabling autoprovisioning will take the cluster around initially up to **30 minutes to upgrade**.
1198
-
1199
- ## Create a cluster with autoprovisioning:
1200
-
1201
- Autoprovisioning will be enabled on the below cluster with [0, 8] chips of v4 TPU (up to 1xv4-16) to scale
1202
- between.
1203
-
1204
- XPK doesn't currently support different generations of accelerators in the same cluster (like v4 and v5p TPUs).
1205
-
1206
- ```shell
1207
- CLUSTER_NAME=my_cluster
1208
- NUM_SLICES=2
1209
- DEVICE_TYPE=v4-8
1210
- RESERVATION=reservation_id
1211
- PROJECT=my_project
1212
- ZONE=us-east5-b
1213
-
1214
- python3 xpk.py cluster create \
1215
- --cluster $CLUSTER_NAME \
1216
- --num-slices=$NUM_SLICES \
1217
- --device-type=$DEVICE_TYPE \
1218
- --zone=$ZONE \
1219
- --project=$PROJECT \
1220
- --reservation=$RESERVATION \
1221
- --enable-autoprovisioning
1222
- ```
1223
-
1224
- 1. Define the starting accelerator configuration and capacity type.
1225
-
1226
- ```shell
1227
- --device-type=$DEVICE_TYPE \
1228
- --num-slice=$NUM_SLICES
1229
- ```
1230
- 2. Optionally set custom `minimum` / `maximum` chips. NAP will rescale the cluster with `maximum` - `minimum` chips. By default, `maximum` is set to the current cluster configuration size, and `minimum` is set to 0. This allows NAP to rescale with all the resources.
1231
-
1232
- ```shell
1233
- --autoprovisioning-min-chips=$MIN_CHIPS \
1234
- --autoprovisioning-max-chips=$MAX_CHIPS
1235
- ```
1236
-
1237
- 3. `FEATURE TO COME SOON:` Set the timeout period for when node pools will automatically be deleted
1238
- if no incoming workloads are run. This is 10 minutes currently.
1239
-
1240
- 4. `FEATURE TO COME:` Set the timeout period to infinity. This will keep the idle node pool configuration always running until updated by new workloads.
1241
-
1242
- ### Update a cluster with autoprovisioning:
1243
- ```shell
1244
- CLUSTER_NAME=my_cluster
1245
- NUM_SLICES=2
1246
- DEVICE_TYPE=v4-8
1247
- RESERVATION=reservation_id
1248
- PROJECT=my_project
1249
- ZONE=us-east5-b
1250
-
1251
- python3 xpk.py cluster create \
1252
- --cluster $CLUSTER_NAME \
1253
- --num-slices=$NUM_SLICES \
1254
- --device-type=$DEVICE_TYPE \
1255
- --zone=$ZONE \
1256
- --project=$PROJECT \
1257
- --reservation=$RESERVATION \
1258
- --enable-autoprovisioning
1259
- ```
1260
-
1261
- ### Update a previously autoprovisioned cluster with different amount of chips:
1262
-
1263
- * Option 1: By creating a new cluster nodepool configuration.
1264
-
1265
- ```shell
1266
- CLUSTER_NAME=my_cluster
1267
- NUM_SLICES=2
1268
- DEVICE_TYPE=v4-16
1269
- RESERVATION=reservation_id
1270
- PROJECT=my_project
1271
- ZONE=us-east5-b
1272
-
1273
- # This will create 2x v4-16 node pools and set the max autoprovisioned chips to 16.
1274
- python3 xpk.py cluster create \
1275
- --cluster $CLUSTER_NAME \
1276
- --num-slices=$NUM_SLICES \
1277
- --device-type=$DEVICE_TYPE \
1278
- --zone=$ZONE \
1279
- --project=$PROJECT \
1280
- --reservation=$RESERVATION \
1281
- --enable-autoprovisioning
1282
- ```
1283
-
1284
- * Option 2: By increasing the `--autoprovisioning-max-chips`.
1285
- ```shell
1286
- CLUSTER_NAME=my_cluster
1287
- NUM_SLICES=0
1288
- DEVICE_TYPE=v4-16
1289
- RESERVATION=reservation_id
1290
- PROJECT=my_project
1291
- ZONE=us-east5-b
1292
-
1293
- # This will clear the node pools if they exist in the cluster and set the max autoprovisioned chips to 16
1294
- python3 xpk.py cluster create \
1295
- --cluster $CLUSTER_NAME \
1296
- --num-slices=$NUM_SLICES \
1297
- --device-type=$DEVICE_TYPE \
1298
- --zone=$ZONE \
1299
- --project=$PROJECT \
1300
- --reservation=$RESERVATION \
1301
- --enable-autoprovisioning \
1302
- --autoprovisioning-max-chips 16
1303
- ```
1304
-
1305
- ## Run workloads on the cluster with autoprovisioning:
1306
- Reconfigure the `--device-type` and `--num-slices`
1307
- ```shell
1308
- CLUSTER_NAME=my_cluster
1309
- NUM_SLICES=2
1310
- DEVICE_TYPE=v4-8
1311
- NEW_RESERVATION=new_reservation_id
1312
- PROJECT=my_project
1313
- ZONE=us-east5-b
1314
- # Create a 2x v4-8 TPU workload.
1315
- python3 xpk.py workload create \
1316
- --cluster $CLUSTER \
1317
- --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \
1318
- --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \
1319
- --device-type=$DEVICE_TYPE \
1320
- --num-slices=$NUM_SLICES \
1321
- --zone=$ZONE \
1322
- --project=$PROJECT
1323
-
1324
- NUM_SLICES=1
1325
- DEVICE_TYPE=v4-16
1326
-
1327
- # Create a 1x v4-16 TPU workload.
1328
- python3 xpk.py workload create \
1329
- --cluster $CLUSTER \
1330
- --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \
1331
- --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \
1332
- --device-type=$DEVICE_TYPE \
1333
- --num-slices=$NUM_SLICES \
1334
- --zone=$ZONE \
1335
- --project=$PROJECT
1336
-
1337
- # Use a different reservation from what the cluster was created with.
1338
- python3 xpk.py workload create \
1339
- --cluster $CLUSTER \
1340
- --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \
1341
- --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \
1342
- --device-type=$DEVICE_TYPE \
1343
- --num-slices=$NUM_SLICES \
1344
- --zone=$ZONE \
1345
- --project=$PROJECT \
1346
- --reservation=$NEW_RESERVATION
1347
- ```
1348
-
1349
- 1. (Optional) Define the capacity type. By default, the capacity type will
1350
- match with what the cluster was created with.
1351
-
1352
- ```shell
1353
- --reservation=my-reservation-id | --on-demand | --spot
1354
- ```
1355
-
1356
- 2. Set the topology of your workload using --device-type.
1357
-
1358
- ```shell
1359
- NUM_SLICES=1
1360
- DEVICE_TYPE=v4-8
1361
- --device-type=$DEVICE_TYPE \
1362
- --num-slices=$NUM_SLICES \
1363
- ```
1364
-
1365
-
1366
- # How to add docker images to a xpk workload
1367
-
1368
- The default behavior is `xpk workload create` will layer the local directory (`--script-dir`) into
1369
- the base docker image (`--base-docker-image`) and run the workload command.
1370
- If you don't want this layering behavior, you can directly use `--docker-image`. Do not mix arguments from the two flows in the same command.
1371
-
1372
- ## Recommended / Default Docker Flow: `--base-docker-image` and `--script-dir`
1373
- This flow pulls the `--script-dir` into the `--base-docker-image` and runs the new docker image.
1374
-
1375
- * The below arguments are optional by default. xpk will pull the local
1376
- directory with a generic base docker image.
1377
-
1378
- - `--base-docker-image` sets the base image that xpk will start with.
1379
-
1380
- - `--script-dir` sets which directory to pull into the image. This defaults to the current working directory.
1381
-
1382
- See `python3 xpk.py workload create --help` for more info.
1383
-
1384
- * Example with defaults which pulls the local directory into the base image:
1385
- ```shell
1386
- echo -e '#!/bin/bash \n echo "Hello world from a test script!"' > test.sh
1387
- python3 xpk.py workload create --cluster xpk-test \
1388
- --workload xpk-test-workload-base-image --command "bash test.sh" \
1389
- --tpu-type=v5litepod-16 --num-slices=1
1390
- ```
1391
-
1392
- * Recommended Flow For Normal Sized Jobs (fewer than 10k accelerators):
1393
- ```shell
1394
- python3 xpk.py workload create --cluster xpk-test \
1395
- --workload xpk-test-workload-base-image --command "bash custom_script.sh" \
1396
- --base-docker-image=gcr.io/your_dependencies_docker_image \
1397
- --tpu-type=v5litepod-16 --num-slices=1
1398
- ```
1399
-
1400
- ## Optional Direct Docker Image Configuration: `--docker-image`
1401
- If a user wants to directly set the docker image used and not layer in the
1402
- current working directory, set `--docker-image` to the image to be use in the
1403
- workload.
1404
-
1405
- * Running with `--docker-image`:
1406
- ```shell
1407
- python3 xpk.py workload create --cluster xpk-test \
1408
- --workload xpk-test-workload-base-image --command "bash test.sh" \
1409
- --tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image
1410
- ```
1411
-
1412
- * Recommended Flow For Large Sized Jobs (more than 10k accelerators):
1413
- ```shell
1414
- python3 xpk.py cluster cacheimage \
1415
- --cluster xpk-test --docker-image gcr.io/your_docker_image
1416
- # Run workload create with the same image.
1417
- python3 xpk.py workload create --cluster xpk-test \
1418
- --workload xpk-test-workload-base-image --command "bash test.sh" \
1419
- --tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image
1420
- ```
1421
-
1422
- # More advanced facts:
1423
-
1424
- * Workload create has two mutually exclusive ways to override the environment of a workload:
1425
- * a `--env` flag to specify each environment variable separately. The format is:
1426
-
1427
- `--env VARIABLE1=value --env VARIABLE2=value`
1428
-
1429
- * a `--env-file` flag to allow specifying the container's
1430
- environment from a file. Usage is the same as Docker's
1431
- [--env-file flag](https://docs.docker.com/engine/reference/commandline/run/#env)
1432
-
1433
- Example Env File:
1434
- ```shell
1435
- LIBTPU_INIT_ARGS=--my-flag=true --performance=high
1436
- MY_ENV_VAR=hello
1437
- ```
1438
-
1439
- * Workload create accepts a --debug-dump-gcs flag which is a path to GCS bucket.
1440
- Passing this flag sets the XLA_FLAGS='--xla_dump_to=/tmp/xla_dump/' and uploads
1441
- hlo dumps to the specified GCS bucket for each worker.
1442
-
1443
- # Integration Test Workflows
1444
- The repository code is tested through Github Workflows and Actions. Currently three kinds of tests are performed:
1445
- * A nightly build that runs every 24 hours
1446
- * A build that runs on push to `main` branch
1447
- * A build that runs for every PR approval
1448
-
1449
- More information is documented [here](https://github.com/google/xpk/tree/main/.github/workflows)
1450
-
1451
- # Troubleshooting
1452
-
1453
- ## `Invalid machine type` for CPUs.
1454
- XPK will create a regional GKE cluster. If you see issues like
1455
-
1456
- ```shell
1457
- Invalid machine type e2-standard-32 in zone $ZONE_NAME
1458
- ```
1459
-
1460
- Please select a CPU type that exists in all zones in the region.
1461
-
1462
- ```shell
1463
- # Find CPU Types supported in zones.
1464
- gcloud compute machine-types list --zones=$ZONE_LIST
1465
- # Adjust default cpu machine type.
1466
- python3 xpk.py cluster create --default-pool-cpu-machine-type=CPU_TYPE ...
1467
- ```
1468
-
1469
- ## Workload creation fails
1470
-
1471
- Some XPK cluster configuration might be missing, if workload creation fails with the below error.
1472
-
1473
- `[XPK] b'error: the server doesn\'t have a resource type "workloads"\n'`
1474
-
1475
- Mitigate this error by re-running your `xpk.py cluster create ...` command, to refresh the cluster configurations.
1476
-
1477
- ## Permission Issues: `requires one of ["permission_name"] permission(s)`.
1478
-
1479
- 1) Determine the role needed based on the permission error:
1480
-
1481
- ```shell
1482
- # For example: `requires one of ["container.*"] permission(s)`
1483
- # Add [Kubernetes Engine Admin](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles) to your user.
1484
- ```
1485
-
1486
- 2) Add the role to the user in your project.
1487
-
1488
- Go to [iam-admin](https://console.cloud.google.com/iam-admin/) or use gcloud cli:
1489
- ```shell
1490
- PROJECT_ID=my-project-id
1491
- CURRENT_GKE_USER=$(gcloud config get account)
1492
- ROLE=roles/container.admin # container.admin is the role needed for Kubernetes Engine Admin
1493
- gcloud projects add-iam-policy-binding $PROJECT_ID --member user:$CURRENT_GKE_USER --role=$ROLE
1494
- ```
1495
-
1496
- 3) Check the permissions are correct for the users.
1497
-
1498
- Go to [iam-admin](https://console.cloud.google.com/iam-admin/) or use gcloud cli:
1499
-
1500
- ```shell
1501
- PROJECT_ID=my-project-id
1502
- CURRENT_GKE_USER=$(gcloud config get account)
1503
- gcloud projects get-iam-policy $PROJECT_ID --filter="bindings.members:$CURRENT_GKE_USER" --flatten="bindings[].members"
1504
- ```
1505
-
1506
- 4) Confirm you have logged in locally with the correct user.
1507
-
1508
- ```shell
1509
- gcloud auth login
1510
- ```
1511
-
1512
- ### Roles needed based on permission errors:
1513
-
1514
- * `requires one of ["container.*"] permission(s)`
1515
-
1516
- Add [Kubernetes Engine Admin](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles) to your user.
1517
-
1518
- * `ERROR: (gcloud.monitoring.dashboards.list) User does not have permission to access projects instance (or it may not exist)`
1519
-
1520
- Add [Monitoring Viewer](https://cloud.google.com/iam/docs/understanding-roles#monitoring.viewer) to your user.
1521
-
1522
-
1523
- ## Reservation Troubleshooting:
1524
-
1525
- ### How to determine your reservation and its size / utilization:
1526
-
1527
- ```shell
1528
- PROJECT_ID=my-project
1529
- ZONE=us-east5-b
1530
- RESERVATION=my-reservation-name
1531
- # Find the reservations in your project
1532
- gcloud beta compute reservations list --project=$PROJECT_ID
1533
- # Find the tpu machine type and current utilization of a reservation.
1534
- gcloud beta compute reservations describe $RESERVATION --project=$PROJECT_ID --zone=$ZONE
1535
- ```
1536
-
1537
- ## 403 error on workload create when using `--base-docker-image` flag
1538
- You need authority to push to the registry from your local machine. Try running `gcloud auth configure-docker`.
1539
- ## `Kubernetes API exception` - 404 error
1540
- If error of this kind appeared after updating xpk version it's possible that you need to rerun `cluster create` command in order to update resource definitions.
1541
-
1542
- # TPU Workload Debugging
1543
-
1544
- ## Verbose Logging
1545
- If you are having trouble with your workload, try setting the `--enable-debug-logs` when you schedule it. This will give you more detailed logs to help pinpoint the issue. For example:
1546
- ```shell
1547
- python3 xpk.py workload create \
1548
- --cluster --workload xpk-test-workload \
1549
- --command="echo hello world" --enable-debug-logs
1550
- ```
1551
- Please check [libtpu logging](https://cloud.google.com/tpu/docs/troubleshooting/trouble-tf#debug_logs) and [Tensorflow logging](https://deepreg.readthedocs.io/en/latest/docs/logging.html#tensorflow-logging) for more information about the flags that are enabled to get the logs.
1552
-
1553
- ## Collect Stack Traces
1554
- [cloud-tpu-diagnostics](https://pypi.org/project/cloud-tpu-diagnostics/) PyPI package can be used to generate stack traces for workloads running in GKE. This package dumps the Python traces when a fault such as segmentation fault, floating-point exception, or illegal operation exception occurs in the program. Additionally, it will also periodically collect stack traces to help you debug situations when the program is unresponsive. You must make the following changes in the docker image running in a Kubernetes main container to enable periodic stack trace collection.
1555
- ```shell
1556
- # main.py
1557
-
1558
- from cloud_tpu_diagnostics import diagnostic
1559
- from cloud_tpu_diagnostics.configuration import debug_configuration
1560
- from cloud_tpu_diagnostics.configuration import diagnostic_configuration
1561
- from cloud_tpu_diagnostics.configuration import stack_trace_configuration
1562
-
1563
- stack_trace_config = stack_trace_configuration.StackTraceConfig(
1564
- collect_stack_trace = True,
1565
- stack_trace_to_cloud = True)
1566
- debug_config = debug_configuration.DebugConfig(
1567
- stack_trace_config = stack_trace_config)
1568
- diagnostic_config = diagnostic_configuration.DiagnosticConfig(
1569
- debug_config = debug_config)
1570
-
1571
- with diagnostic.diagnose(diagnostic_config):
1572
- main_method() # this is the main method to run
1573
- ```
1574
- This configuration will start collecting stack traces inside the `/tmp/debugging` directory on each Kubernetes Pod.
1575
-
1576
- ### Explore Stack Traces
1577
- To explore the stack traces collected in a temporary directory in Kubernetes Pod, you can run the following command to configure a sidecar container that will read the traces from `/tmp/debugging` directory.
1578
- ```shell
1579
- python3 xpk.py workload create \
1580
- --workload xpk-test-workload --command "python3 main.py" --cluster \
1581
- xpk-test --tpu-type=v5litepod-16 --deploy-stacktrace-sidecar
1582
- ```
1583
-
1584
- ### Get information about jobs, queues and resources.
1585
-
1586
- To list available resources and queues use ```xpk info``` command. It allows to see localqueues and clusterqueues and check for available resources.
1587
-
1588
- To see queues with usage and workload info use:
1589
- ```shell
1590
- python3 xpk.py info --cluster my-cluster
1591
- ```
1592
-
1593
- You can specify what kind of resources(clusterqueue or localqueue) you want to see using flags --clusterqueue or --localqueue.
1594
- ```shell
1595
- python3 xpk.py info --cluster my-cluster --localqueue
1596
- ```
1597
-
1598
- # Local testing with Kind
1599
-
1600
- To facilitate development and testing locally, we have integrated support for testing with `kind`. This enables you to simulate a Kubernetes environment on your local machine.
1601
-
1602
- ## Prerequisites
1603
-
1604
- - Install kind on your local machine. Follow the official documentation here: [Kind Installation Guide.](https://kind.sigs.k8s.io/docs/user/quick-start#installation)
1605
-
1606
- ## Usage
1607
-
1608
- xpk interfaces seamlessly with kind to manage Kubernetes clusters locally, facilitating the orchestration and management of workloads. Below are the commands for managing clusters:
1609
-
1610
- ### Cluster Create
1611
- * Cluster create:
1612
-
1613
- ```shell
1614
- python3 xpk.py kind create \
1615
- --cluster xpk-test
1616
- ```
1617
-
1618
- ### Cluster Delete
1619
- * Cluster Delete:
1620
-
1621
- ```shell
1622
- python3 xpk.py kind delete \
1623
- --cluster xpk-test
1624
- ```
1625
-
1626
- ### Cluster List
1627
- * Cluster List:
1628
-
1629
- ```shell
1630
- python3 xpk.py kind list
1631
- ```
1632
-
1633
- ## Local Testing Basics
1634
-
1635
- Local testing is available exclusively through the `batch` and `job` commands of xpk with the `--kind-cluster` flag. This allows you to simulate training jobs locally:
1636
-
1637
- ```shell
1638
- python xpk.py batch [other-options] --kind-cluster script
1639
- ```
1640
-
1641
- Please note that all other xpk subcommands are intended for use with cloud systems on Google Cloud Engine (GCE) and don't support local testing. This includes commands like cluster, info, inspector, etc.
1642
-
1643
- # Other advanced usage
1644
- [Use a Jupyter notebook to interact with a Cloud TPU cluster](xpk-notebooks.md) \
1645
- [Use Slurm like commands in XPK to execute workloads on top of GKE](xpk-slurm-commands.md)