xpk 0.3.0__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
xpk-0.4.0/PKG-INFO ADDED
@@ -0,0 +1,1078 @@
1
+ Metadata-Version: 2.1
2
+ Name: xpk
3
+ Version: 0.4.0
4
+ Summary: xpk helps Cloud developers to orchestrate training jobs on accelerators on GKE.
5
+ Author-email: Cloud TPU Team <cloud-tpu-eng@google.com>
6
+ License: Apache-2.0
7
+ Project-URL: Homepage, https://github.com/google/xpk
8
+ Project-URL: Bug Tracker, https://github.com/google/xpk/issues
9
+ Classifier: Programming Language :: Python :: 3.10
10
+ Classifier: Programming Language :: Python :: 3.11
11
+ Requires-Python: >=3.10
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Requires-Dist: cloud-accelerator-diagnostics
15
+ Provides-Extra: dev
16
+ Requires-Dist: pyink==24.3.0; extra == "dev"
17
+ Requires-Dist: pylint>=2.6.0; extra == "dev"
18
+ Requires-Dist: pre-commit; extra == "dev"
19
+
20
+ <!--
21
+ Copyright 2023 Google LLC
22
+
23
+ Licensed under the Apache License, Version 2.0 (the "License");
24
+ you may not use this file except in compliance with the License.
25
+ You may obtain a copy of the License at
26
+
27
+ https://www.apache.org/licenses/LICENSE-2.0
28
+
29
+ Unless required by applicable law or agreed to in writing, software
30
+ distributed under the License is distributed on an "AS IS" BASIS,
31
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
32
+ See the License for the specific language governing permissions and
33
+ limitations under the License.
34
+ -->
35
+
36
+ [![Build Tests](https://github.com/google/xpk/actions/workflows/build_tests.yaml/badge.svg)](https://github.com/google/xpk/actions/workflows/build_tests.yaml)
37
+ [![Nightly Tests](https://github.com/google/xpk/actions/workflows/nightly_tests.yaml/badge.svg)](https://github.com/google/xpk/actions/workflows/nightly_tests.yaml)
38
+
39
+ # Overview
40
+
41
+ xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help
42
+ Cloud developers to orchestrate training jobs on accelerators such as TPUs and
43
+ GPUs on GKE. xpk handles the "multihost pods" of TPUs, GPUs (HGX H100) and CPUs
44
+ (n2-standard-32) as first class citizens.
45
+
46
+ xpk decouples provisioning capacity from running jobs. There are two structures:
47
+ clusters (provisioned VMs) and workloads (training jobs). Clusters represent the
48
+ physical resources you have available. Workloads represent training jobs -- at
49
+ any time some of these will be completed, others will be running and some will
50
+ be queued, waiting for cluster resources to become available.
51
+
52
+ The ideal workflow starts by provisioning the clusters for all of the ML
53
+ hardware you have reserved. Then, without re-provisioning, submit jobs as
54
+ needed. By eliminating the need for re-provisioning between jobs, using Docker
55
+ containers with pre-installed dependencies and cross-ahead of time compilation,
56
+ these queued jobs run with minimal start times. Further, because workloads
57
+ return the hardware back to the shared pool when they complete, developers can
58
+ achieve better use of finite hardware resources. And automated tests can run
59
+ overnight while resources tend to be underutilized.
60
+
61
+ xpk supports the following TPU types:
62
+ * v4
63
+ * v5e
64
+ * v5p
65
+
66
+ and the following GPU types:
67
+ * a100
68
+ * h100
69
+
70
+ and the following CPU types:
71
+ * n2-standard-32
72
+
73
+ # Installation
74
+ To install xpk, run the following command:
75
+
76
+ ```shell
77
+ pip install xpk
78
+ ```
79
+
80
+ If you are running XPK by cloning GitHub repository, first run the
81
+ following commands to begin using XPK commands:
82
+
83
+ ```shell
84
+ git clone https://github.com/google/xpk.git
85
+ cd xpk
86
+ # Install dependencies such as cloud-accelerator-diagnostics
87
+ pip install .
88
+ ```
89
+
90
+ If you see an error saying: `This environment is externally managed`, please use a virtual environment.
91
+
92
+ Example:
93
+
94
+ ```shell
95
+ ## One time step of creating the venv
96
+ VENV_DIR=~/venvp3
97
+ python3 -m venv $VENV_DIR
98
+ ## Enter your venv.
99
+ source $VENV_DIR/bin/activate
100
+ ## Clone the repository and installing dependencies.
101
+ git clone https://github.com/google/xpk.git
102
+ cd xpk
103
+ # Install dependencies such as cloud-accelerator-diagnostics
104
+ pip install .
105
+ ```
106
+
107
+ # XPK for Large Scale (>1k VMs)
108
+
109
+ Follow user instructions in [xpk-large-scale-guide.sh](xpk-large-scale-guide.sh)
110
+ to use xpk for a GKE cluster greater than 1000 VMs. Run these steps to set up a
111
+ GKE cluster with large scale training and high throughput support with XPK, and
112
+ run jobs with XPK. We recommend you manually copy commands per step and verify
113
+ the outputs of each step.
114
+
115
+ # Example usages:
116
+
117
+ To get started, be sure to set your GCP Project and Zone as usual via `gcloud
118
+ config set`.
119
+
120
+ Below are reference commands. A typical journey starts with a `Cluster Create`
121
+ followed by many `Workload Create`s. To understand the state of the system you
122
+ might want to use `Cluster List` or `Workload List` commands. Finally, you can
123
+ cleanup with a `Cluster Delete`.
124
+
125
+ If you have failures with workloads not running, use `xpk inspector` to investigate
126
+ more.
127
+
128
+ ## Cluster Create
129
+
130
+ First set the project and zone through gcloud config or xpk arguments.
131
+
132
+ ```shell
133
+ PROJECT_ID=my-project-id
134
+ ZONE=us-east5-b
135
+ # gcloud config:
136
+ gcloud config set project $PROJECT_ID
137
+ gcloud config set compute/zone $ZONE
138
+ # xpk arguments
139
+ xpk .. --zone $ZONE --project $PROJECT_ID
140
+ ```
141
+
142
+ `Cluster Create` command will create a project-specific Service Account. Note that only one service
143
+ account will be created per project. This service account will be attached to the node pools instead of default
144
+ [Compute Engine Service Account](https://cloud.google.com/compute/docs/access/service-accounts#default_service_account).
145
+ All the required permissions will be assigned to this service account by XPK. Make sure you have
146
+ [Service Account Admin](https://cloud.google.com/iam/docs/understanding-roles#iam.serviceAccountAdmin) and
147
+ [Project IAM Admin](https://cloud.google.com/iam/docs/understanding-roles#resourcemanager.projectIamAdmin)
148
+ roles assigned to your user account.
149
+
150
+ The cluster created is a regional cluster to enable the GKE control plane across
151
+ all zones.
152
+
153
+ * Cluster Create (provision reserved capacity):
154
+
155
+ ```shell
156
+ # Find your reservations
157
+ gcloud compute reservations list --project=$PROJECT_ID
158
+ # Run cluster create with reservation.
159
+ python3 xpk.py cluster create \
160
+ --cluster xpk-test --tpu-type=v5litepod-256 \
161
+ --num-slices=2 \
162
+ --reservation=$RESERVATION_ID
163
+ ```
164
+
165
+ * Cluster Create (provision on-demand capacity):
166
+
167
+ ```shell
168
+ python3 xpk.py cluster create \
169
+ --cluster xpk-test --tpu-type=v5litepod-16 \
170
+ --num-slices=4 --on-demand
171
+ ```
172
+
173
+ * Cluster Create (provision spot / preemptable capacity):
174
+
175
+ ```shell
176
+ python3 xpk.py cluster create \
177
+ --cluster xpk-test --tpu-type=v5litepod-16 \
178
+ --num-slices=4 --spot
179
+ ```
180
+
181
+ * Cluster Create for Pathways:
182
+ Pathways compatible cluster can be created using `--enable-pathways`
183
+ ```shell
184
+ python3 xpk.py cluster create \
185
+ --cluster xpk-pw-test \
186
+ --num-slices=4 --on-demand \
187
+ --tpu-type=v5litepod-16 \
188
+ --enable-pathways
189
+ ```
190
+
191
+ * Cluster Create can be called again with the same `--cluster name` to modify
192
+ the number of slices or retry failed steps.
193
+
194
+ For example, if a user creates a cluster with 4 slices:
195
+
196
+ ```shell
197
+ python3 xpk.py cluster create \
198
+ --cluster xpk-test --tpu-type=v5litepod-16 \
199
+ --num-slices=4 --reservation=$RESERVATION_ID
200
+ ```
201
+
202
+ and recreates the cluster with 8 slices. The command will rerun to create 4
203
+ new slices:
204
+
205
+ ```shell
206
+ python3 xpk.py cluster create \
207
+ --cluster xpk-test --tpu-type=v5litepod-16 \
208
+ --num-slices=8 --reservation=$RESERVATION_ID
209
+ ```
210
+
211
+ and recreates the cluster with 6 slices. The command will rerun to delete 2
212
+ slices. The command will warn the user when deleting slices.
213
+ Use `--force` to skip prompts.
214
+
215
+ ```shell
216
+ python3 xpk.py cluster create \
217
+ --cluster xpk-test --tpu-type=v5litepod-16 \
218
+ --num-slices=6 --reservation=$RESERVATION_ID
219
+
220
+ # Skip delete prompts using --force.
221
+
222
+ python3 xpk.py cluster create --force \
223
+ --cluster xpk-test --tpu-type=v5litepod-16 \
224
+ --num-slices=6 --reservation=$RESERVATION_ID
225
+
226
+ ```
227
+
228
+ ### Create Vertex AI Tensorboard
229
+ *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature.*
230
+
231
+ Vertex AI Tensorboard is a fully managed version of open-source Tensorboard. To learn more about Vertex AI Tensorboard, visit [this](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-introduction). Note that Vertex AI Tensorboard is only available in [these](https://cloud.google.com/vertex-ai/docs/general/locations#available-regions) regions.
232
+
233
+ You can create a Vertex AI Tensorboard for your cluster with `Cluster Create` command. XPK will create a single Vertex AI Tensorboard instance per cluster.
234
+
235
+ * Create Vertex AI Tensorboard in default region with default Tensorboard name:
236
+
237
+ ```shell
238
+ python3 xpk.py cluster create \
239
+ --cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
240
+ --create-vertex-tensorboard
241
+ ```
242
+
243
+ will create a Vertex AI Tensorboard with the name `xpk-test-tb-instance` (*<args.cluster>-tb-instance*) in `us-central1` (*default region*).
244
+
245
+ * Create Vertex AI Tensorboard in user-specified region with default Tensorboard name:
246
+
247
+ ```shell
248
+ python3 xpk.py cluster create \
249
+ --cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
250
+ --create-vertex-tensorboard --tensorboard-region=us-west1
251
+ ```
252
+
253
+ will create a Vertex AI Tensorboard with the name `xpk-test-tb-instance` (*<args.cluster>-tb-instance*) in `us-west1`.
254
+
255
+ * Create Vertex AI Tensorboard in default region with user-specified Tensorboard name:
256
+
257
+ ```shell
258
+ python3 xpk.py cluster create \
259
+ --cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
260
+ --create-vertex-tensorboard --tensorboard-name=tb-testing
261
+ ```
262
+
263
+ will create a Vertex AI Tensorboard with the name `tb-testing` in `us-central1`.
264
+
265
+ * Create Vertex AI Tensorboard in user-specified region with user-specified Tensorboard name:
266
+
267
+ ```shell
268
+ python3 xpk.py cluster create \
269
+ --cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
270
+ --create-vertex-tensorboard --tensorboard-region=us-west1 --tensorboard-name=tb-testing
271
+ ```
272
+
273
+ will create a Vertex AI Tensorboard instance with the name `tb-testing` in `us-west1`.
274
+
275
+ * Create Vertex AI Tensorboard in an unsupported region:
276
+
277
+ ```shell
278
+ python3 xpk.py cluster create \
279
+ --cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
280
+ --create-vertex-tensorboard --tensorboard-region=us-central2
281
+ ```
282
+
283
+ will fail the cluster creation process because Vertex AI Tensorboard is not supported in `us-central2`.
284
+
285
+ ## Cluster Delete
286
+ * Cluster Delete (deprovision capacity):
287
+
288
+ ```shell
289
+ python3 xpk.py cluster delete \
290
+ --cluster xpk-test
291
+ ```
292
+ ## Cluster List
293
+ * Cluster List (see provisioned capacity):
294
+
295
+ ```shell
296
+ python3 xpk.py cluster list
297
+ ```
298
+ ## Cluster Describe
299
+ * Cluster Describe (see capacity):
300
+
301
+ ```shell
302
+ python3 xpk.py cluster describe \
303
+ --cluster xpk-test
304
+ ```
305
+
306
+ ## Cluster Cacheimage
307
+ * Cluster Cacheimage (enables faster start times):
308
+
309
+ ```shell
310
+ python3 xpk.py cluster cacheimage \
311
+ --cluster xpk-test --docker-image gcr.io/your_docker_image \
312
+ --tpu-type=v5litepod-16
313
+ ```
314
+
315
+ ## Workload Create
316
+ * Workload Create (submit training job):
317
+
318
+ ```shell
319
+ python3 xpk.py workload create \
320
+ --workload xpk-test-workload --command "echo goodbye" \
321
+ --cluster xpk-test \
322
+ --tpu-type=v5litepod-16
323
+ ```
324
+
325
+ * Workload Create for Pathways:
326
+ Pathways workload can be submitted using `--use-pathways` on a Pathways enabled cluster (created with `--enable-pathways`)
327
+
328
+ Pathways workload example:
329
+ ```shell
330
+ python3 xpk.py workload create \
331
+ --workload xpk-pw-test \
332
+ --num-slices=1 \
333
+ --tpu-type=v5litepod-16 \
334
+ --use-pathways \
335
+ --cluster xpk-pw-test \
336
+ --docker-name='user-workload' \
337
+ --docker-image=<maxtext docker image> \
338
+ --command='python3 MaxText/train.py MaxText/configs/base.yml base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
339
+ ```
340
+
341
+ Regular workload can also be submitted on a Pathways enabled cluster (created with `--enable-pathways`)
342
+
343
+ Pathways workload example:
344
+ ```shell
345
+ python3 xpk.py workload create \
346
+ --workload xpk-regular-test \
347
+ --num-slices=1 \
348
+ --tpu-type=v5litepod-16 \
349
+ --cluster xpk-pw-test \
350
+ --docker-name='user-workload' \
351
+ --docker-image=<maxtext docker image> \
352
+ --command='python3 MaxText/train.py MaxText/configs/base.yml base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
353
+ ```
354
+
355
+ ### Set `max-restarts` for production jobs
356
+
357
+ * `--max-restarts <value>`: By default, this is 0. This will restart the job ""
358
+ times when the job terminates. For production jobs, it is recommended to
359
+ increase this to a large number, say 50. Real jobs can be interrupted due to
360
+ hardware failures and software updates. We assume your job has implemented
361
+ checkpointing so the job restarts near where it was interrupted.
362
+
363
+ ### Workload Priority and Preemption
364
+ * Set the priority level of your workload with `--priority=LEVEL`
365
+
366
+ We have five priorities defined: [`very-low`, `low`, `medium`, `high`, `very-high`].
367
+ The default priority is `medium`.
368
+
369
+ Priority determines:
370
+
371
+ 1. Order of queued jobs.
372
+
373
+ Queued jobs are ordered by
374
+ `very-low` < `low` < `medium` < `high` < `very-high`
375
+
376
+ 2. Preemption of lower priority workloads.
377
+
378
+ A higher priority job will `evict` lower priority jobs.
379
+ Evicted jobs are brought back to the queue and will re-hydrate appropriately.
380
+
381
+ #### General Example:
382
+ ```shell
383
+ python3 xpk.py workload create \
384
+ --workload xpk-test-medium-workload --command "echo goodbye" --cluster \
385
+ xpk-test --tpu-type=v5litepod-16 --priority=medium
386
+ ```
387
+
388
+ ### Create Vertex AI Experiment to upload data to Vertex AI Tensorboard
389
+ *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature.*
390
+
391
+ Vertex AI Experiment is a tool that helps to track and analyze an experiment run on Vertex AI Tensorboard. To learn more about Vertex AI Experiments, visit [this](https://cloud.google.com/vertex-ai/docs/experiments/intro-vertex-ai-experiments).
392
+
393
+ XPK will create a Vertex AI Experiment in `workload create` command and attach the Vertex AI Tensorboard created for the cluster during `cluster create`. If there is a cluster created before this feature is released, there will be no Vertex AI Tensorboard created for the cluster and `workload create` will fail. Re-run `cluster create` to create a Vertex AI Tensorboard and then run `workload create` again to schedule your workload.
394
+
395
+ * Create Vertex AI Experiment with default Experiment name:
396
+
397
+ ```shell
398
+ python3 xpk.py workload create \
399
+ --cluster xpk-test --workload xpk-workload \
400
+ --use-vertex-tensorboard
401
+ ```
402
+
403
+ will create a Vertex AI Experiment with the name `xpk-test-xpk-workload` (*<args.cluster>-<args.workload>*).
404
+
405
+ * Create Vertex AI Experiment with user-specified Experiment name:
406
+
407
+ ```shell
408
+ python3 xpk.py workload create \
409
+ --cluster xpk-test --workload xpk-workload \
410
+ --use-vertex-tensorboard --experiment-name=test-experiment
411
+ ```
412
+
413
+ will create a Vertex AI Experiment with the name `test-experiment`.
414
+
415
+ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how to update your workload to automatically upload logs collected in your Tensorboard directory to the Vertex AI Experiment created by `workload create`.
416
+
417
+ ## Workload Delete
418
+ * Workload Delete (delete training job):
419
+
420
+ ```shell
421
+ python3 xpk.py workload delete \
422
+ --workload xpk-test-workload --cluster xpk-test
423
+ ```
424
+
425
+ This will only delete `xpk-test-workload` workload in `xpk-test` cluster.
426
+
427
+ * Workload Delete (delete all training jobs in the cluster):
428
+
429
+ ```shell
430
+ python3 xpk.py workload delete \
431
+ --cluster xpk-test
432
+ ```
433
+
434
+ This will delete all the workloads in `xpk-test` cluster. Deletion will only begin if you type `y` or `yes` at the prompt. Multiple workload deletions are processed in batches for optimized processing.
435
+
436
+ * Workload Delete supports filtering. Delete a portion of jobs that match user criteria. Multiple workload deletions are processed in batches for optimized processing.
437
+ * Filter by Job: `filter-by-job`
438
+
439
+ ```shell
440
+ python3 xpk.py workload delete \
441
+ --cluster xpk-test --filter-by-job=$USER
442
+ ```
443
+
444
+ This will delete all the workloads in `xpk-test` cluster whose names start with `$USER`. Deletion will only begin if you type `y` or `yes` at the prompt.
445
+
446
+ * Filter by Status: `filter-by-status`
447
+
448
+ ```shell
449
+ python3 xpk.py workload delete \
450
+ --cluster xpk-test --filter-by-status=QUEUED
451
+ ```
452
+
453
+ This will delete all the workloads in `xpk-test` cluster that have the status as Admitted or Evicted, and the number of running VMs is 0. Deletion will only begin if you type `y` or `yes` at the prompt. Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`.
454
+
455
+ ## Workload List
456
+ * Workload List (see training jobs):
457
+
458
+ ```shell
459
+ python3 xpk.py workload list \
460
+ --cluster xpk-test
461
+ ```
462
+
463
+ * Example Workload List Output:
464
+
465
+ The below example shows four jobs of different statuses:
466
+
467
+ * `user-first-job-failed`: **filter-status** is `FINISHED` and `FAILED`.
468
+ * `user-second-job-success`: **filter-status** is `FINISHED` and `SUCCESSFUL`.
469
+ * `user-third-job-running`: **filter-status** is `RUNNING`.
470
+ * `user-forth-job-in-queue`: **filter-status** is `QUEUED`.
471
+ * `user-fifth-job-in-queue-preempted`: **filter-status** is `QUEUED`.
472
+
473
+ ```
474
+ Jobset Name Created Time Priority TPU VMs Needed TPU VMs Running/Ran TPU VMs Done Status Status Message Status Time
475
+ user-first-job-failed 2023-1-1T1:00:00Z medium 4 4 <none> Finished JobSet failed 2023-1-1T1:05:00Z
476
+ user-second-job-success 2023-1-1T1:10:00Z medium 4 4 4 Finished JobSet finished successfully 2023-1-1T1:14:00Z
477
+ user-third-job-running 2023-1-1T1:15:00Z medium 4 4 <none> Admitted Admitted by ClusterQueue cluster-queue 2023-1-1T1:16:00Z
478
+ user-forth-job-in-queue 2023-1-1T1:16:05Z medium 4 <none> <none> Admitted couldn't assign flavors to pod set slice-job: insufficient unused quota for google.com/tpu in flavor 2xv4-8, 4 more need 2023-1-1T1:16:10Z
479
+ user-fifth-job-preempted 2023-1-1T1:10:05Z low 4 <none> <none> Evicted Preempted to accommodate a higher priority Workload 2023-1-1T1:10:00Z
480
+ ```
481
+
482
+ * Workload List supports filtering. Observe a portion of jobs that match user criteria.
483
+
484
+ * Filter by Status: `filter-by-status`
485
+
486
+ Filter the workload list by the status of respective jobs.
487
+ Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`
488
+
489
+ * Filter by Job: `filter-by-job`
490
+
491
+ Filter the workload list by the name of a job.
492
+
493
+ ```shell
494
+ python3 xpk.py workload list \
495
+ --cluster xpk-test --filter-by-job=$USER
496
+ ```
497
+
498
+ * Workload List supports waiting for the completion of a specific job. XPK will follow an existing job until it has finished or the `timeout`, if provided, has been reached and then list the job. If no `timeout` is specified, the default value is set to the max value, 1 week. You may also set `timeout=0` to poll the job once.
499
+ (Note: `restart-on-user-code-failure` must be set
500
+ when creating the workload otherwise the workload will always finish with `Completed` status.)
501
+
502
+ Wait for a job to complete.
503
+
504
+ ```shell
505
+ python3 xpk.py workload list \
506
+ --cluster xpk-test --wait-for-job-completion=xpk-test-workload
507
+ ```
508
+
509
+ Wait for a job to complete with a timeout of 300 seconds.
510
+
511
+ ```shell
512
+ python3 xpk.py workload list \
513
+ --cluster xpk-test --wait-for-job-completion=xpk-test-workload \
514
+ --timeout=300
515
+ ```
516
+
517
+ Return codes
518
+ `0`: Workload finished and completed successfully.
519
+ `124`: Timeout was reached before workload finished.
520
+ `125`: Workload finished but did not complete successfully.
521
+ `1`: Other failure.
522
+
523
+ ## Inspector
524
+ * Inspector provides debug info to understand cluster health, and why workloads are not running.
525
+ Inspector output is saved to a file.
526
+
527
+ ```shell
528
+ python3 xpk.py inspector \
529
+ --cluster $CLUSTER_NAME \
530
+ --project $PROJECT_ID \
531
+ --zone $ZONE
532
+ ```
533
+
534
+ * Optional Arguments
535
+ * `--print-to-terminal`:
536
+ Print command output to terminal as well as a file.
537
+ * `--workload $WORKLOAD_NAME`
538
+ Inspector will write debug info related to the workload:`$WORKLOAD_NAME`
539
+
540
+ * Example Output:
541
+
542
+ The output of xpk inspector is in `/tmp/tmp0pd6_k1o` in this example.
543
+ ```shell
544
+ [XPK] Starting xpk
545
+ [XPK] Task: `Set Cluster` succeeded.
546
+ [XPK] Task: `Local Setup: gcloud version` is implemented by `gcloud version`, hiding output unless there is an error.
547
+ [XPK] Task: `Local Setup: Project / Zone / Region` is implemented by `gcloud config get project; gcloud config get compute/zone; gcloud config get compute/region`, hiding output unless there is an error.
548
+ [XPK] Task: `GKE: Cluster Details` is implemented by `gcloud beta container clusters list --project $PROJECT --region $REGION | grep -e NAME -e $CLUSTER_NAME`, hiding output unless there is an error.
549
+ [XPK] Task: `GKE: Node pool Details` is implemented by `gcloud beta container node-pools list --cluster $CLUSTER_NAME --project=$PROJECT --region=$REGION`, hiding output unless there is an error.
550
+ [XPK] Task: `Kubectl: All Nodes` is implemented by `kubectl get node -o custom-columns='NODE_NAME:metadata.name, READY_STATUS:.status.conditions[?(@.type=="Ready")].status, NODEPOOL:metadata.labels.cloud\.google\.com/gke-nodepool'`, hiding output unless there is an error.
551
+ [XPK] Task: `Kubectl: Number of Nodes per Node Pool` is implemented by `kubectl get node -o custom-columns=':metadata.labels.cloud\.google\.com/gke-nodepool' | sort | uniq -c`, hiding output unless there is an error.
552
+ [XPK] Task: `Kubectl: Healthy Node Count Per Node Pool` is implemented by `kubectl get node -o custom-columns='NODE_NAME:metadata.name, READY_STATUS:.status.conditions[?(@.type=="Ready")].status, NODEPOOL:metadata.labels.cloud\.google\.com/gke-nodepool' | grep -w True | awk {'print $3'} | sort | uniq -c`, hiding output unless there is an error.
553
+ [XPK] Task: `Kueue: ClusterQueue Details` is implemented by `kubectl describe ClusterQueue cluster-queue`, hiding output unless there is an error.
554
+ [XPK] Task: `Kueue: LocalQueue Details` is implemented by `kubectl describe LocalQueue multislice-queue`, hiding output unless there is an error.
555
+ [XPK] Task: `Kueue: Kueue Deployment Details` is implemented by `kubectl describe Deployment kueue-controller-manager -n kueue-system`, hiding output unless there is an error.
556
+ [XPK] Task: `Jobset: Deployment Details` is implemented by `kubectl describe Deployment jobset-controller-manager -n jobset-system`, hiding output unless there is an error.
557
+ [XPK] Task: `Kueue Manager Logs` is implemented by `kubectl logs deployment/kueue-controller-manager -n kueue-system --tail=100 --prefix=True`, hiding output unless there is an error.
558
+ [XPK] Task: `Jobset Manager Logs` is implemented by `kubectl logs deployment/jobset-controller-manager -n jobset-system --tail=100 --prefix=True`, hiding output unless there is an error.
559
+ [XPK] Task: `List Jobs with filter-by-status=EVERYTHING with filter-by-jobs=None` is implemented by `kubectl get workloads -o=custom-columns="Jobset Name:.metadata.ownerReferences[0].name,Created Time:.metadata.creationTimestamp,Priority:.spec.priorityClassName,TPU VMs Needed:.spec.podSets[0].count,TPU VMs Running/Ran:.status.admission.podSetAssignments[-1].count,TPU VMs Done:.status.reclaimablePods[0].count,Status:.status.conditions[-1].type,Status Message:.status.conditions[-1].message,Status Time:.status.conditions[-1].lastTransitionTime" `, hiding output unless there is an error.
560
+ [XPK] Task: `List Jobs with filter-by-status=QUEUED with filter-by-jobs=None` is implemented by `kubectl get workloads -o=custom-columns="Jobset Name:.metadata.ownerReferences[0].name,Created Time:.metadata.creationTimestamp,Priority:.spec.priorityClassName,TPU VMs Needed:.spec.podSets[0].count,TPU VMs Running/Ran:.status.admission.podSetAssignments[-1].count,TPU VMs Done:.status.reclaimablePods[0].count,Status:.status.conditions[-1].type,Status Message:.status.conditions[-1].message,Status Time:.status.conditions[-1].lastTransitionTime" | awk -e 'NR == 1 || ($7 ~ "Admitted|Evicted|QuotaReserved" && ($5 ~ "<none>" || $5 == 0)) {print $0}' `, hiding output unless there is an error.
561
+ [XPK] Task: `List Jobs with filter-by-status=RUNNING with filter-by-jobs=None` is implemented by `kubectl get workloads -o=custom-columns="Jobset Name:.metadata.ownerReferences[0].name,Created Time:.metadata.creationTimestamp,Priority:.spec.priorityClassName,TPU VMs Needed:.spec.podSets[0].count,TPU VMs Running/Ran:.status.admission.podSetAssignments[-1].count,TPU VMs Done:.status.reclaimablePods[0].count,Status:.status.conditions[-1].type,Status Message:.status.conditions[-1].message,Status Time:.status.conditions[-1].lastTransitionTime" | awk -e 'NR == 1 || ($7 ~ "Admitted|Evicted" && $5 ~ /^[0-9]+$/ && $5 > 0) {print $0}' `, hiding output unless there is an error.
562
+ [XPK] Find xpk inspector output file: /tmp/tmp0pd6_k1o
563
+ [XPK] Exiting XPK cleanly
564
+ ```
565
+
566
+ ## GPU usage
567
+
568
+ In order to use XPK for GPU, you can do so by using `device-type` flag.
569
+
570
+ * Cluster Create (provision reserved capacity):
571
+
572
+ ```shell
573
+ # Find your reservations
574
+ gcloud compute reservations list --project=$PROJECT_ID
575
+
576
+ # Run cluster create with reservation.
577
+ python3 xpk.py cluster create \
578
+ --cluster xpk-test --device-type=h100-80gb-8 \
579
+ --num-nodes=2 \
580
+ --reservation=$RESERVATION_ID
581
+ ```
582
+
583
+ * Cluster Delete (deprovision capacity):
584
+
585
+ ```shell
586
+ python3 xpk.py cluster delete \
587
+ --cluster xpk-test
588
+ ```
589
+
590
+ * Cluster List (see provisioned capacity):
591
+
592
+ ```shell
593
+ python3 xpk.py cluster list
594
+ ```
595
+
596
+ * Cluster Describe (see capacity):
597
+
598
+ ```shell
599
+ python3 xpk.py cluster describe \
600
+ --cluster xpk-test
601
+ ```
602
+
603
+
604
+ * Cluster Cacheimage (enables faster start times):
605
+
606
+ ```shell
607
+ python3 xpk.py cluster cacheimage \
608
+ --cluster xpk-test --docker-image gcr.io/your_docker_image \
609
+ --device-type=h100-80gb-8
610
+ ```
611
+
612
+
613
+ * [Install NVIDIA GPU device drivers](https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#install)
614
+ ```shell
615
+ # List available driver versions
616
+ gcloud compute ssh $NODE_NAME --command "sudo cos-extensions list"
617
+
618
+ # Install the default driver
619
+ gcloud compute ssh $NODE_NAME --command "sudo cos-extensions install gpu"
620
+ # OR install a specific version of the driver
621
+ gcloud compute ssh $NODE_NAME --command "sudo cos-extensions install gpu -- -version=DRIVER_VERSION"
622
+ ```
623
+
624
+ * Run a workload:
625
+
626
+ ```shell
627
+ # Submit a workload
628
+ python3 xpk.py workload create \
629
+ --cluster xpk-test --device-type h100-80gb-8 \
630
+ --workload xpk-test-workload \
631
+ --command="echo hello world"
632
+ ```
633
+
634
+ * Workload Delete (delete training job):
635
+
636
+ ```shell
637
+ python3 xpk.py workload delete \
638
+ --workload xpk-test-workload --cluster xpk-test
639
+ ```
640
+
641
+ This will only delete `xpk-test-workload` workload in `xpk-test` cluster.
642
+
643
+ * Workload Delete (delete all training jobs in the cluster):
644
+
645
+ ```shell
646
+ python3 xpk.py workload delete \
647
+ --cluster xpk-test
648
+ ```
649
+
650
+ This will delete all the workloads in `xpk-test` cluster. Deletion will only begin if you type `y` or `yes` at the prompt.
651
+
652
+ * Workload Delete supports filtering. Delete a portion of jobs that match user criteria.
653
+ * Filter by Job: `filter-by-job`
654
+
655
+ ```shell
656
+ python3 xpk.py workload delete \
657
+ --cluster xpk-test --filter-by-job=$USER
658
+ ```
659
+
660
+ This will delete all the workloads in `xpk-test` cluster whose names start with `$USER`. Deletion will only begin if you type `y` or `yes` at the prompt.
661
+
662
+ * Filter by Status: `filter-by-status`
663
+
664
+ ```shell
665
+ python3 xpk.py workload delete \
666
+ --cluster xpk-test --filter-by-status=QUEUED
667
+ ```
668
+
669
+ This will delete all the workloads in `xpk-test` cluster that have the status as Admitted or Evicted, and the number of running VMs is 0. Deletion will only begin if you type `y` or `yes` at the prompt. Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`.
670
+
671
+ ## CPU usage
672
+
673
+ In order to use XPK for CPU, you can do so by using `device-type` flag.
674
+
675
+ * Cluster Create (provision on-demand capacity):
676
+
677
+ ```shell
678
+ # Run cluster create with on demand capacity.
679
+ python3 xpk.py cluster create \
680
+ --cluster xpk-test \
681
+ --device-type=n2-standard-32-256 \
682
+ --num-slices=1 \
683
+ --default-pool-cpu-machine-type=n2-standard-32 \
684
+ --on-demand
685
+ ```
686
+ Note that `device-type` for CPUs is of the format <cpu-machine-type>-<number of VMs>, thus in the above example, user requests for 256 VMs of type n2-standard-32.
687
+ Currently workloads using < 1000 VMs are supported.
688
+
689
+ * Run a workload:
690
+
691
+ ```shell
692
+ # Submit a workload
693
+ python3 xpk.py workload create \
694
+ --cluster xpk-test \
695
+ --num-slices=1 \
696
+ --device-type=n2-standard-32-256 \
697
+ --workload xpk-test-workload \
698
+ --command="echo hello world"
699
+ ```
700
+
701
+ # Autoprovisioning with XPK
702
+ XPK can dynamically allocate cluster capacity using [Node Auto Provisioning, (NAP)](https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning#use_accelerators_for_new_auto-provisioned_node_pools) support.
703
+
704
+ Allow several topology sizes to be supported from one XPK cluster, and be dynamically provisioned based on incoming workload requests. XPK users will not need to re-provision the cluster manually.
705
+
706
+ Enabling autoprovisioning will take the cluster around initially up to **30 minutes to upgrade**.
707
+
708
+ ## Create a cluster with autoprovisioning:
709
+
710
+ Autoprovisioning will be enabled on the below cluster with [0, 8] chips of v4 TPU (up to 1xv4-16) to scale
711
+ between.
712
+
713
+ XPK doesn't currently support different generations of accelerators in the same cluster (like v4 and v5p TPUs).
714
+
715
+ ```shell
716
+ CLUSTER_NAME=my_cluster
717
+ NUM_SLICES=2
718
+ DEVICE_TYPE=v4-8
719
+ RESERVATION=reservation_id
720
+ PROJECT=my_project
721
+ ZONE=us-east5-b
722
+
723
+ python3 xpk.py cluster create \
724
+ --cluster $CLUSTER_NAME \
725
+ --num-slices=$NUM_SLICES \
726
+ --device-type=$DEVICE_TYPE \
727
+ --zone=$ZONE \
728
+ --project=$PROJECT \
729
+ --reservation=$RESERVATION \
730
+ --enable-autoprovisioning
731
+ ```
732
+
733
+ 1. Define the starting accelerator configuration and capacity type.
734
+
735
+ ```shell
736
+ --device-type=$DEVICE_TYPE \
737
+ --num-slice=$NUM_SLICES
738
+ ```
739
+ 2. Optionally set custom `minimum` / `maximum` chips. NAP will rescale the cluster with `maximum` - `minimum` chips. By default, `maximum` is set to the current cluster configuration size, and `minimum` is set to 0. This allows NAP to rescale with all the resources.
740
+
741
+ ```shell
742
+ --autoprovisioning-min-chips=$MIN_CHIPS \
743
+ --autoprovisioning-max-chips=$MAX_CHIPS
744
+ ```
745
+
746
+ 3. `FEATURE TO COME SOON:` Set the timeout period for when node pools will automatically be deleted
747
+ if no incoming workloads are run. This is 10 minutes currently.
748
+
749
+ 4. `FEATURE TO COME:` Set the timeout period to infinity. This will keep the idle node pool configuration always running until updated by new workloads.
750
+
751
+ ### Update a cluster with autoprovisioning:
752
+ ```shell
753
+ CLUSTER_NAME=my_cluster
754
+ NUM_SLICES=2
755
+ DEVICE_TYPE=v4-8
756
+ RESERVATION=reservation_id
757
+ PROJECT=my_project
758
+ ZONE=us-east5-b
759
+
760
+ python3 xpk.py cluster create \
761
+ --cluster $CLUSTER_NAME \
762
+ --num-slices=$NUM_SLICES \
763
+ --device-type=$DEVICE_TYPE \
764
+ --zone=$ZONE \
765
+ --project=$PROJECT \
766
+ --reservation=$RESERVATION \
767
+ --enable-autoprovisioning
768
+ ```
769
+
770
+ ### Update a previously autoprovisioned cluster with different amount of chips:
771
+
772
+ * Option 1: By creating a new cluster nodepool configuration.
773
+
774
+ ```shell
775
+ CLUSTER_NAME=my_cluster
776
+ NUM_SLICES=2
777
+ DEVICE_TYPE=v4-16
778
+ RESERVATION=reservation_id
779
+ PROJECT=my_project
780
+ ZONE=us-east5-b
781
+
782
+ # This will create 2x v4-16 node pools and set the max autoprovisioned chips to 16.
783
+ python3 xpk.py cluster create \
784
+ --cluster $CLUSTER_NAME \
785
+ --num-slices=$NUM_SLICES \
786
+ --device-type=$DEVICE_TYPE \
787
+ --zone=$ZONE \
788
+ --project=$PROJECT \
789
+ --reservation=$RESERVATION \
790
+ --enable-autoprovisioning
791
+ ```
792
+
793
+ * Option 2: By increasing the `--autoprovisioning-max-chips`.
794
+ ```shell
795
+ CLUSTER_NAME=my_cluster
796
+ NUM_SLICES=0
797
+ DEVICE_TYPE=v4-16
798
+ RESERVATION=reservation_id
799
+ PROJECT=my_project
800
+ ZONE=us-east5-b
801
+
802
+ # This will clear the node pools if they exist in the cluster and set the max autoprovisioned chips to 16
803
+ python3 xpk.py cluster create \
804
+ --cluster $CLUSTER_NAME \
805
+ --num-slices=$NUM_SLICES \
806
+ --device-type=$DEVICE_TYPE \
807
+ --zone=$ZONE \
808
+ --project=$PROJECT \
809
+ --reservation=$RESERVATION \
810
+ --enable-autoprovisioning \
811
+ --autoprovisioning-max-chips 16
812
+ ```
813
+
814
+ ## Run workloads on the cluster with autoprovisioning:
815
+ Reconfigure the `--device-type` and `--num-slices`
816
+ ```shell
817
+ CLUSTER_NAME=my_cluster
818
+ NUM_SLICES=2
819
+ DEVICE_TYPE=v4-8
820
+ NEW_RESERVATION=new_reservation_id
821
+ PROJECT=my_project
822
+ ZONE=us-east5-b
823
+ # Create a 2x v4-8 TPU workload.
824
+ python3 xpk.py workload create \
825
+ --cluster $CLUSTER \
826
+ --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \
827
+ --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \
828
+ --device-type=$DEVICE_TYPE \
829
+ --num-slices=$NUM_SLICES \
830
+ --zone=$ZONE \
831
+ --project=$PROJECT
832
+
833
+ NUM_SLICES=1
834
+ DEVICE_TYPE=v4-16
835
+
836
+ # Create a 1x v4-16 TPU workload.
837
+ python3 xpk.py workload create \
838
+ --cluster $CLUSTER \
839
+ --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \
840
+ --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \
841
+ --device-type=$DEVICE_TYPE \
842
+ --num-slices=$NUM_SLICES \
843
+ --zone=$ZONE \
844
+ --project=$PROJECT
845
+
846
+ # Use a different reservation from what the cluster was created with.
847
+ python3 xpk.py workload create \
848
+ --cluster $CLUSTER \
849
+ --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \
850
+ --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \
851
+ --device-type=$DEVICE_TYPE \
852
+ --num-slices=$NUM_SLICES \
853
+ --zone=$ZONE \
854
+ --project=$PROJECT \
855
+ --reservation=$NEW_RESERVATION
856
+ ```
857
+
858
+ 1. (Optional) Define the capacity type. By default, the capacity type will
859
+ match with what the cluster was created with.
860
+
861
+ ```shell
862
+ --reservation=my-reservation-id | --on-demand | --spot
863
+ ```
864
+
865
+ 2. Set the topology of your workload using --device-type.
866
+
867
+ ```shell
868
+ NUM_SLICES=1
869
+ DEVICE_TYPE=v4-8
870
+ --device-type=$DEVICE_TYPE \
871
+ --num-slices=$NUM_SLICES \
872
+ ```
873
+
874
+
875
+ # How to add docker images to a xpk workload
876
+
877
+ The default behavior is `xpk workload create` will layer the local directory (`--script-dir`) into
878
+ the base docker image (`--base-docker-image`) and run the workload command.
879
+ If you don't want this layering behavior, you can directly use `--docker-image`. Do not mix arguments from the two flows in the same command.
880
+
881
+ ## Recommended / Default Docker Flow: `--base-docker-image` and `--script-dir`
882
+ This flow pulls the `--script-dir` into the `--base-docker-image` and runs the new docker image.
883
+
884
+ * The below arguments are optional by default. xpk will pull the local
885
+ directory with a generic base docker image.
886
+
887
+ - `--base-docker-image` sets the base image that xpk will start with.
888
+
889
+ - `--script-dir` sets which directory to pull into the image. This defaults to the current working directory.
890
+
891
+ See `python3 xpk.py workload create --help` for more info.
892
+
893
+ * Example with defaults which pulls the local directory into the base image:
894
+ ```shell
895
+ echo -e '#!/bin/bash \n echo "Hello world from a test script!"' > test.sh
896
+ python3 xpk.py workload create --cluster xpk-test \
897
+ --workload xpk-test-workload-base-image --command "bash test.sh" \
898
+ --tpu-type=v5litepod-16 --num-slices=1
899
+ ```
900
+
901
+ * Recommended Flow For Normal Sized Jobs (fewer than 10k accelerators):
902
+ ```shell
903
+ python3 xpk.py workload create --cluster xpk-test \
904
+ --workload xpk-test-workload-base-image --command "bash custom_script.sh" \
905
+ --base-docker-image=gcr.io/your_dependencies_docker_image \
906
+ --tpu-type=v5litepod-16 --num-slices=1
907
+ ```
908
+
909
+ ## Optional Direct Docker Image Configuration: `--docker-image`
910
+ If a user wants to directly set the docker image used and not layer in the
911
+ current working directory, set `--docker-image` to the image to be use in the
912
+ workload.
913
+
914
+ * Running with `--docker-image`:
915
+ ```shell
916
+ python3 xpk.py workload create --cluster xpk-test \
917
+ --workload xpk-test-workload-base-image --command "bash test.sh" \
918
+ --tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image
919
+ ```
920
+
921
+ * Recommended Flow For Large Sized Jobs (more than 10k accelerators):
922
+ ```shell
923
+ python3 xpk.py cluster cacheimage \
924
+ --cluster xpk-test --docker-image gcr.io/your_docker_image
925
+ # Run workload create with the same image.
926
+ python3 xpk.py workload create --cluster xpk-test \
927
+ --workload xpk-test-workload-base-image --command "bash test.sh" \
928
+ --tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image
929
+ ```
930
+
931
+ # More advanced facts:
932
+
933
+ * Workload create has two mutually exclusive ways to override the environment of a workload:
934
+ * a `--env` flag to specify each environment variable separately. The format is:
935
+
936
+ `--env VARIABLE1=value --env VARIABLE2=value`
937
+
938
+ * a `--env-file` flag to allow specifying the container's
939
+ environment from a file. Usage is the same as Docker's
940
+ [--env-file flag](https://docs.docker.com/engine/reference/commandline/run/#env)
941
+
942
+ Example Env File:
943
+ ```shell
944
+ LIBTPU_INIT_ARGS=--my-flag=true --performance=high
945
+ MY_ENV_VAR=hello
946
+ ```
947
+
948
+ * Workload create accepts a --debug-dump-gcs flag which is a path to GCS bucket.
949
+ Passing this flag sets the XLA_FLAGS='--xla_dump_to=/tmp/xla_dump/' and uploads
950
+ hlo dumps to the specified GCS bucket for each worker.
951
+
952
+ # Integration Test Workflows
953
+ The repository code is tested through Github Workflows and Actions. Currently three kinds of tests are performed:
954
+ * A nightly build that runs every 24 hours
955
+ * A build that runs on push to `main` branch
956
+ * A build that runs for every PR approval
957
+
958
+ More information is documented [here](https://github.com/google/xpk/tree/main/.github/workflows)
959
+
960
+ # Troubleshooting
961
+
962
+ ## `Invalid machine type` for CPUs.
963
+ XPK will create a regional GKE cluster. If you see issues like
964
+
965
+ ```shell
966
+ Invalid machine type e2-standard-32 in zone $ZONE_NAME
967
+ ```
968
+
969
+ Please select a CPU type that exists in all zones in the region.
970
+
971
+ ```shell
972
+ # Find CPU Types supported in zones.
973
+ gcloud compute machine-types list --zones=$ZONE_LIST
974
+ # Adjust default cpu machine type.
975
+ python3 xpk.py cluster create --default-pool-cpu-machine-type=CPU_TYPE ...
976
+ ```
977
+
978
+ ## Permission Issues: `requires one of ["permission_name"] permission(s)`.
979
+
980
+ 1) Determine the role needed based on the permission error:
981
+
982
+ ```shell
983
+ # For example: `requires one of ["container.*"] permission(s)`
984
+ # Add [Kubernetes Engine Admin](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles) to your user.
985
+ ```
986
+
987
+ 2) Add the role to the user in your project.
988
+
989
+ Go to [iam-admin](https://console.cloud.google.com/iam-admin/) or use gcloud cli:
990
+ ```shell
991
+ PROJECT_ID=my-project-id
992
+ CURRENT_GKE_USER=$(gcloud config get account)
993
+ ROLE=roles/container.admin # container.admin is the role needed for Kubernetes Engine Admin
994
+ gcloud projects add-iam-policy-binding $PROJECT_ID --member user:$CURRENT_GKE_USER --role=$ROLE
995
+ ```
996
+
997
+ 3) Check the permissions are correct for the users.
998
+
999
+ Go to [iam-admin](https://console.cloud.google.com/iam-admin/) or use gcloud cli:
1000
+
1001
+ ```shell
1002
+ PROJECT_ID=my-project-id
1003
+ CURRENT_GKE_USER=$(gcloud config get account)
1004
+ gcloud projects get-iam-policy $PROJECT_ID --filter="bindings.members:$CURRENT_GKE_USER" --flatten="bindings[].members"
1005
+ ```
1006
+
1007
+ 4) Confirm you have logged in locally with the correct user.
1008
+
1009
+ ```shell
1010
+ gcloud auth login
1011
+ ```
1012
+
1013
+ ### Roles needed based on permission errors:
1014
+
1015
+ * `requires one of ["container.*"] permission(s)`
1016
+
1017
+ Add [Kubernetes Engine Admin](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles) to your user.
1018
+
1019
+ * `ERROR: (gcloud.monitoring.dashboards.list) User does not have permission to access projects instance (or it may not exist)`
1020
+
1021
+ Add [Monitoring Viewer](https://cloud.google.com/iam/docs/understanding-roles#monitoring.viewer) to your user.
1022
+
1023
+
1024
+ ## Reservation Troubleshooting:
1025
+
1026
+ ### How to determine your reservation and its size / utilization:
1027
+
1028
+ ```shell
1029
+ PROJECT_ID=my-project
1030
+ ZONE=us-east5-b
1031
+ RESERVATION=my-reservation-name
1032
+ # Find the reservations in your project
1033
+ gcloud beta compute reservations list --project=$PROJECT_ID
1034
+ # Find the tpu machine type and current utilization of a reservation.
1035
+ gcloud beta compute reservations describe $RESERVATION --project=$PROJECT_ID --zone=$ZONE
1036
+ ```
1037
+
1038
+ # TPU Workload Debugging
1039
+
1040
+ ## Verbose Logging
1041
+ If you are having trouble with your workload, try setting the `--enable-debug-logs` when you schedule it. This will give you more detailed logs to help pinpoint the issue. For example:
1042
+ ```shell
1043
+ python3 xpk.py workload create \
1044
+ --cluster --workload xpk-test-workload \
1045
+ --command="echo hello world" --enable-debug-logs
1046
+ ```
1047
+ Please check [libtpu logging](https://cloud.google.com/tpu/docs/troubleshooting/trouble-tf#debug_logs) and [Tensorflow logging](https://deepreg.readthedocs.io/en/latest/docs/logging.html#tensorflow-logging) for more information about the flags that are enabled to get the logs.
1048
+
1049
+ ## Collect Stack Traces
1050
+ [cloud-tpu-diagnostics](https://pypi.org/project/cloud-tpu-diagnostics/) PyPI package can be used to generate stack traces for workloads running in GKE. This package dumps the Python traces when a fault such as segmentation fault, floating-point exception, or illegal operation exception occurs in the program. Additionally, it will also periodically collect stack traces to help you debug situations when the program is unresponsive. You must make the following changes in the docker image running in a Kubernetes main container to enable periodic stack trace collection.
1051
+ ```shell
1052
+ # main.py
1053
+
1054
+ from cloud_tpu_diagnostics import diagnostic
1055
+ from cloud_tpu_diagnostics.configuration import debug_configuration
1056
+ from cloud_tpu_diagnostics.configuration import diagnostic_configuration
1057
+ from cloud_tpu_diagnostics.configuration import stack_trace_configuration
1058
+
1059
+ stack_trace_config = stack_trace_configuration.StackTraceConfig(
1060
+ collect_stack_trace = True,
1061
+ stack_trace_to_cloud = True)
1062
+ debug_config = debug_configuration.DebugConfig(
1063
+ stack_trace_config = stack_trace_config)
1064
+ diagnostic_config = diagnostic_configuration.DiagnosticConfig(
1065
+ debug_config = debug_config)
1066
+
1067
+ with diagnostic.diagnose(diagnostic_config):
1068
+ main_method() # this is the main method to run
1069
+ ```
1070
+ This configuration will start collecting stack traces inside the `/tmp/debugging` directory on each Kubernetes Pod.
1071
+
1072
+ ### Explore Stack Traces
1073
+ To explore the stack traces collected in a temporary directory in Kubernetes Pod, you can run the following command to configure a sidecar container that will read the traces from `/tmp/debugging` directory.
1074
+ ```shell
1075
+ python3 xpk.py workload create \
1076
+ --workload xpk-test-workload --command "python3 main.py" --cluster \
1077
+ xpk-test --tpu-type=v5litepod-16 --deploy-stacktrace-sidecar
1078
+ ```