xpk 0.14.4__py3-none-any.whl → 0.15.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- integration/gcluster_a3mega_test.py +11 -0
- integration/gcluster_a3ultra_test.py +11 -0
- integration/gcluster_a4_test.py +11 -0
- xpk/commands/cluster.py +57 -21
- xpk/commands/cluster_gcluster.py +25 -5
- xpk/commands/cluster_gcluster_test.py +11 -2
- xpk/commands/cluster_test.py +233 -12
- xpk/commands/config.py +3 -5
- xpk/commands/kind.py +1 -1
- xpk/commands/storage.py +8 -10
- xpk/commands/workload.py +28 -12
- xpk/commands/workload_test.py +3 -3
- xpk/core/blueprint/blueprint_generator.py +70 -33
- xpk/core/blueprint/blueprint_test.py +9 -0
- xpk/core/capacity.py +46 -8
- xpk/core/capacity_test.py +32 -1
- xpk/core/cluster.py +37 -57
- xpk/core/cluster_test.py +95 -0
- xpk/core/commands.py +4 -10
- xpk/core/config.py +9 -2
- xpk/core/gcloud_context.py +18 -12
- xpk/core/gcloud_context_test.py +111 -1
- xpk/core/kjob.py +6 -9
- xpk/core/kueue_manager.py +192 -32
- xpk/core/kueue_manager_test.py +132 -4
- xpk/core/nodepool.py +21 -29
- xpk/core/nodepool_test.py +17 -15
- xpk/core/scheduling.py +16 -1
- xpk/core/scheduling_test.py +85 -6
- xpk/core/system_characteristics.py +77 -19
- xpk/core/system_characteristics_test.py +80 -5
- xpk/core/telemetry.py +263 -0
- xpk/core/telemetry_test.py +211 -0
- xpk/main.py +31 -13
- xpk/parser/cluster.py +48 -9
- xpk/parser/cluster_test.py +42 -3
- xpk/parser/workload.py +12 -0
- xpk/parser/workload_test.py +4 -4
- xpk/telemetry_uploader.py +29 -0
- xpk/templates/kueue_gke_default_topology.yaml.j2 +1 -1
- xpk/templates/kueue_sub_slicing_topology.yaml.j2 +3 -8
- xpk/utils/console.py +41 -10
- xpk/utils/console_test.py +106 -0
- xpk/utils/feature_flags.py +7 -1
- xpk/utils/file.py +4 -1
- xpk/utils/topology.py +4 -0
- xpk/utils/user_agent.py +35 -0
- xpk/utils/user_agent_test.py +44 -0
- xpk/utils/user_input.py +48 -0
- xpk/utils/user_input_test.py +92 -0
- xpk/utils/validation.py +0 -11
- xpk/utils/versions.py +31 -0
- {xpk-0.14.4.dist-info → xpk-0.15.0.dist-info}/METADATA +113 -92
- {xpk-0.14.4.dist-info → xpk-0.15.0.dist-info}/RECORD +58 -48
- {xpk-0.14.4.dist-info → xpk-0.15.0.dist-info}/WHEEL +0 -0
- {xpk-0.14.4.dist-info → xpk-0.15.0.dist-info}/entry_points.txt +0 -0
- {xpk-0.14.4.dist-info → xpk-0.15.0.dist-info}/licenses/LICENSE +0 -0
- {xpk-0.14.4.dist-info → xpk-0.15.0.dist-info}/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: xpk
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.15.0
|
|
4
4
|
Summary: xpk helps Cloud developers to orchestrate training jobs on accelerators on GKE.
|
|
5
5
|
Author-email: XPK team <xpk-code-reviewers@google.com>
|
|
6
6
|
License: Apache-2.0
|
|
@@ -11,6 +11,7 @@ Classifier: Programming Language :: Python :: 3.11
|
|
|
11
11
|
Requires-Python: >=3.10
|
|
12
12
|
Description-Content-Type: text/markdown
|
|
13
13
|
License-File: LICENSE
|
|
14
|
+
Requires-Dist: argcomplete==3.6.3
|
|
14
15
|
Requires-Dist: cloud-accelerator-diagnostics==0.1.1
|
|
15
16
|
Requires-Dist: tabulate==0.9.0
|
|
16
17
|
Requires-Dist: ruamel.yaml==0.18.10
|
|
@@ -142,6 +143,18 @@ The following tools must be installed:
|
|
|
142
143
|
- git: [installation instructions](https://git-scm.com/downloads/linux)
|
|
143
144
|
- make: install by running `apt-get -y install make` (`sudo` might be required)
|
|
144
145
|
|
|
146
|
+
### Additional prerequisites to enable bash completion
|
|
147
|
+
|
|
148
|
+
- Install [argcomplete](https://pypi.org/project/argcomplete/) globally on your machine.
|
|
149
|
+
```shell
|
|
150
|
+
pip install argcomplete
|
|
151
|
+
activate-global-python-argcomplete
|
|
152
|
+
```
|
|
153
|
+
- Configure `argcomplete` for XPK.
|
|
154
|
+
```shell
|
|
155
|
+
eval "$(register-python-argcomplete xpk)"
|
|
156
|
+
```
|
|
157
|
+
|
|
145
158
|
## Installation via pip
|
|
146
159
|
|
|
147
160
|
To install XPK using pip, first install required tools mentioned in [prerequisites](#prerequisites) and [additional prerequisites](#additional-prerequisites-when-installing-from-pip). Then you can install XPK simply by running:
|
|
@@ -243,7 +256,7 @@ all zones.
|
|
|
243
256
|
# Find your reservations
|
|
244
257
|
gcloud compute reservations list --project=$PROJECT_ID
|
|
245
258
|
# Run cluster create with reservation.
|
|
246
|
-
|
|
259
|
+
xpk cluster create \
|
|
247
260
|
--cluster xpk-test --tpu-type=v5litepod-256 \
|
|
248
261
|
--num-slices=2 \
|
|
249
262
|
--reservation=$RESERVATION_ID
|
|
@@ -252,7 +265,7 @@ all zones.
|
|
|
252
265
|
* Cluster Create (provision on-demand capacity):
|
|
253
266
|
|
|
254
267
|
```shell
|
|
255
|
-
|
|
268
|
+
xpk cluster create \
|
|
256
269
|
--cluster xpk-test --tpu-type=v5litepod-16 \
|
|
257
270
|
--num-slices=4 --on-demand
|
|
258
271
|
```
|
|
@@ -260,22 +273,30 @@ all zones.
|
|
|
260
273
|
* Cluster Create (provision spot / preemptable capacity):
|
|
261
274
|
|
|
262
275
|
```shell
|
|
263
|
-
|
|
276
|
+
xpk cluster create \
|
|
264
277
|
--cluster xpk-test --tpu-type=v5litepod-16 \
|
|
265
278
|
--num-slices=4 --spot
|
|
266
279
|
```
|
|
267
280
|
|
|
268
281
|
* Cluster Create (DWS flex queued capacity):
|
|
269
282
|
```shell
|
|
270
|
-
|
|
283
|
+
xpk cluster create \
|
|
271
284
|
--cluster xpk-test --tpu-type=v5litepod-16 \
|
|
272
285
|
--num-slices=4 --flex
|
|
273
286
|
```
|
|
274
287
|
|
|
288
|
+
* Cluster Create with CPU and/or memory quota:
|
|
289
|
+
```shell
|
|
290
|
+
xpk cluster create \
|
|
291
|
+
--cluster xpk-test --tpu-type=v5litepod-16 \
|
|
292
|
+
--cpu-limit=112 --memory-limit=192Gi \
|
|
293
|
+
--on-demand
|
|
294
|
+
```
|
|
295
|
+
|
|
275
296
|
* Cluster Create for Pathways:
|
|
276
297
|
Pathways compatible cluster can be created using `cluster create-pathways`.
|
|
277
298
|
```shell
|
|
278
|
-
|
|
299
|
+
xpk cluster create-pathways \
|
|
279
300
|
--cluster xpk-pw-test \
|
|
280
301
|
--num-slices=4 --on-demand \
|
|
281
302
|
--tpu-type=v5litepod-16
|
|
@@ -285,7 +306,7 @@ all zones.
|
|
|
285
306
|
* Cluster Create for Ray:
|
|
286
307
|
A cluster with KubeRay enabled and a RayCluster can be created using `cluster create-ray`.
|
|
287
308
|
```shell
|
|
288
|
-
|
|
309
|
+
xpk cluster create-ray \
|
|
289
310
|
--cluster xpk-rc-test \
|
|
290
311
|
--ray-version=2.39.0 \
|
|
291
312
|
--num-slices=4 --on-demand \
|
|
@@ -298,7 +319,7 @@ all zones.
|
|
|
298
319
|
For example, if a user creates a cluster with 4 slices:
|
|
299
320
|
|
|
300
321
|
```shell
|
|
301
|
-
|
|
322
|
+
xpk cluster create \
|
|
302
323
|
--cluster xpk-test --tpu-type=v5litepod-16 \
|
|
303
324
|
--num-slices=4 --reservation=$RESERVATION_ID
|
|
304
325
|
```
|
|
@@ -307,7 +328,7 @@ all zones.
|
|
|
307
328
|
new slices:
|
|
308
329
|
|
|
309
330
|
```shell
|
|
310
|
-
|
|
331
|
+
xpk cluster create \
|
|
311
332
|
--cluster xpk-test --tpu-type=v5litepod-16 \
|
|
312
333
|
--num-slices=8 --reservation=$RESERVATION_ID
|
|
313
334
|
```
|
|
@@ -317,13 +338,13 @@ all zones.
|
|
|
317
338
|
Use `--force` to skip prompts.
|
|
318
339
|
|
|
319
340
|
```shell
|
|
320
|
-
|
|
341
|
+
xpk cluster create \
|
|
321
342
|
--cluster xpk-test --tpu-type=v5litepod-16 \
|
|
322
343
|
--num-slices=6 --reservation=$RESERVATION_ID
|
|
323
344
|
|
|
324
345
|
# Skip delete prompts using --force.
|
|
325
346
|
|
|
326
|
-
|
|
347
|
+
xpk cluster create --force \
|
|
327
348
|
--cluster xpk-test --tpu-type=v5litepod-16 \
|
|
328
349
|
--num-slices=6 --reservation=$RESERVATION_ID
|
|
329
350
|
```
|
|
@@ -333,13 +354,13 @@ all zones.
|
|
|
333
354
|
user when deleting slices. Use `--force` to skip prompts.
|
|
334
355
|
|
|
335
356
|
```shell
|
|
336
|
-
|
|
357
|
+
xpk cluster create \
|
|
337
358
|
--cluster xpk-test --tpu-type=v4-8 \
|
|
338
359
|
--num-slices=4 --reservation=$RESERVATION_ID
|
|
339
360
|
|
|
340
361
|
# Skip delete prompts using --force.
|
|
341
362
|
|
|
342
|
-
|
|
363
|
+
xpk cluster create --force \
|
|
343
364
|
--cluster xpk-test --tpu-type=v4-8 \
|
|
344
365
|
--num-slices=4 --reservation=$RESERVATION_ID
|
|
345
366
|
```
|
|
@@ -370,7 +391,7 @@ This argument allows you to specify additional IP ranges (in CIDR notation) that
|
|
|
370
391
|
* To create a private cluster and allow access to Control Plane only to your current machine:
|
|
371
392
|
|
|
372
393
|
```shell
|
|
373
|
-
|
|
394
|
+
xpk cluster create \
|
|
374
395
|
--cluster=xpk-private-cluster \
|
|
375
396
|
--tpu-type=v4-8 --num-slices=2 \
|
|
376
397
|
--private
|
|
@@ -379,7 +400,7 @@ This argument allows you to specify additional IP ranges (in CIDR notation) that
|
|
|
379
400
|
* To create a private cluster and allow access to Control Plane only to your current machine and the IP ranges `1.2.3.0/24` and `1.2.4.5/32`:
|
|
380
401
|
|
|
381
402
|
```shell
|
|
382
|
-
|
|
403
|
+
xpk cluster create \
|
|
383
404
|
--cluster=xpk-private-cluster \
|
|
384
405
|
--tpu-type=v4-8 --num-slices=2 \
|
|
385
406
|
--authorized-networks 1.2.3.0/24 1.2.4.5/32
|
|
@@ -405,7 +426,7 @@ You can create a Vertex AI Tensorboard for your cluster with `Cluster Create` co
|
|
|
405
426
|
* Create Vertex AI Tensorboard in default region with default Tensorboard name:
|
|
406
427
|
|
|
407
428
|
```shell
|
|
408
|
-
|
|
429
|
+
xpk cluster create \
|
|
409
430
|
--cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
|
|
410
431
|
--create-vertex-tensorboard
|
|
411
432
|
```
|
|
@@ -415,7 +436,7 @@ will create a Vertex AI Tensorboard with the name `xpk-test-tb-instance` (*<args
|
|
|
415
436
|
* Create Vertex AI Tensorboard in user-specified region with default Tensorboard name:
|
|
416
437
|
|
|
417
438
|
```shell
|
|
418
|
-
|
|
439
|
+
xpk cluster create \
|
|
419
440
|
--cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
|
|
420
441
|
--create-vertex-tensorboard --tensorboard-region=us-west1
|
|
421
442
|
```
|
|
@@ -425,7 +446,7 @@ will create a Vertex AI Tensorboard with the name `xpk-test-tb-instance` (*<args
|
|
|
425
446
|
* Create Vertex AI Tensorboard in default region with user-specified Tensorboard name:
|
|
426
447
|
|
|
427
448
|
```shell
|
|
428
|
-
|
|
449
|
+
xpk cluster create \
|
|
429
450
|
--cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
|
|
430
451
|
--create-vertex-tensorboard --tensorboard-name=tb-testing
|
|
431
452
|
```
|
|
@@ -435,7 +456,7 @@ will create a Vertex AI Tensorboard with the name `tb-testing` in `us-central1`.
|
|
|
435
456
|
* Create Vertex AI Tensorboard in user-specified region with user-specified Tensorboard name:
|
|
436
457
|
|
|
437
458
|
```shell
|
|
438
|
-
|
|
459
|
+
xpk cluster create \
|
|
439
460
|
--cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
|
|
440
461
|
--create-vertex-tensorboard --tensorboard-region=us-west1 --tensorboard-name=tb-testing
|
|
441
462
|
```
|
|
@@ -445,7 +466,7 @@ will create a Vertex AI Tensorboard instance with the name `tb-testing` in `us-w
|
|
|
445
466
|
* Create Vertex AI Tensorboard in an unsupported region:
|
|
446
467
|
|
|
447
468
|
```shell
|
|
448
|
-
|
|
469
|
+
xpk cluster create \
|
|
449
470
|
--cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
|
|
450
471
|
--create-vertex-tensorboard --tensorboard-region=us-central2
|
|
451
472
|
```
|
|
@@ -456,20 +477,20 @@ will fail the cluster creation process because Vertex AI Tensorboard is not supp
|
|
|
456
477
|
* Cluster Delete (deprovision capacity):
|
|
457
478
|
|
|
458
479
|
```shell
|
|
459
|
-
|
|
480
|
+
xpk cluster delete \
|
|
460
481
|
--cluster xpk-test
|
|
461
482
|
```
|
|
462
483
|
## Cluster List
|
|
463
484
|
* Cluster List (see provisioned capacity):
|
|
464
485
|
|
|
465
486
|
```shell
|
|
466
|
-
|
|
487
|
+
xpk cluster list
|
|
467
488
|
```
|
|
468
489
|
## Cluster Describe
|
|
469
490
|
* Cluster Describe (see capacity):
|
|
470
491
|
|
|
471
492
|
```shell
|
|
472
|
-
|
|
493
|
+
xpk cluster describe \
|
|
473
494
|
--cluster xpk-test
|
|
474
495
|
```
|
|
475
496
|
|
|
@@ -477,7 +498,7 @@ will fail the cluster creation process because Vertex AI Tensorboard is not supp
|
|
|
477
498
|
* Cluster Cacheimage (enables faster start times):
|
|
478
499
|
|
|
479
500
|
```shell
|
|
480
|
-
|
|
501
|
+
xpk cluster cacheimage \
|
|
481
502
|
--cluster xpk-test --docker-image gcr.io/your_docker_image \
|
|
482
503
|
--tpu-type=v5litepod-16
|
|
483
504
|
```
|
|
@@ -495,7 +516,7 @@ A4 | `b200-8`
|
|
|
495
516
|
|
|
496
517
|
|
|
497
518
|
```shell
|
|
498
|
-
|
|
519
|
+
xpk cluster create \
|
|
499
520
|
--cluster CLUSTER_NAME --device-type DEVICE_TYPE \
|
|
500
521
|
--zone=$COMPUTE_ZONE --project=$PROJECT_ID \
|
|
501
522
|
--num-nodes=$NUM_NODES --reservation=$RESERVATION_ID
|
|
@@ -520,7 +541,7 @@ Currently `xpk cluster adapt` supports only the following device types:
|
|
|
520
541
|
|
|
521
542
|
Example usage:
|
|
522
543
|
```shell
|
|
523
|
-
|
|
544
|
+
xpk cluster adapt \
|
|
524
545
|
--cluster=$CLUSTER_NAME --device-type=$DEVICE_TYPE \
|
|
525
546
|
--zone=$COMPUTE_ZONE --project=$PROJECT_ID \
|
|
526
547
|
--num-nodes=$NUM_NODES --reservation=$RESERVATION_ID
|
|
@@ -542,7 +563,7 @@ To use the GCS FUSE with XPK you need to create a [Storage Bucket](https://conso
|
|
|
542
563
|
Once it's ready you can use `xpk storage attach` with `--type=gcsfuse` command to attach a FUSE storage instance to your cluster:
|
|
543
564
|
|
|
544
565
|
```shell
|
|
545
|
-
|
|
566
|
+
xpk storage attach test-fuse-storage --type=gcsfuse \
|
|
546
567
|
--project=$PROJECT --cluster=$CLUSTER --zone=$ZONE
|
|
547
568
|
--mount-point='/test-mount-point' --readonly=false \
|
|
548
569
|
--bucket=test-bucket --size=1 --auto-mount=false
|
|
@@ -567,7 +588,7 @@ A Filestore adapter lets you mount and access [Filestore instances](https://clou
|
|
|
567
588
|
To create and attach a GCP Filestore instance to your cluster use `xpk storage create` command with `--type=gcpfilestore`:
|
|
568
589
|
|
|
569
590
|
```shell
|
|
570
|
-
|
|
591
|
+
xpk storage create test-fs-storage --type=gcpfilestore \
|
|
571
592
|
--auto-mount=false --mount-point=/data-fs --readonly=false \
|
|
572
593
|
--size=1024 --tier=BASIC_HDD --access_mode=ReadWriteMany --vol=default \
|
|
573
594
|
--project=$PROJECT --cluster=$CLUSTER --zone=$ZONE
|
|
@@ -576,7 +597,7 @@ python3 xpk.py storage create test-fs-storage --type=gcpfilestore \
|
|
|
576
597
|
You can also attach an existing Filestore instance to your cluster using `xpk storage attach` command:
|
|
577
598
|
|
|
578
599
|
```shell
|
|
579
|
-
|
|
600
|
+
xpk storage attach test-fs-storage --type=gcpfilestore \
|
|
580
601
|
--auto-mount=false --mount-point=/data-fs --readonly=false \
|
|
581
602
|
--size=1024 --tier=BASIC_HDD --access_mode=ReadWriteMany --vol=default \
|
|
582
603
|
--project=$PROJECT --cluster=$CLUSTER --zone=$ZONE
|
|
@@ -605,7 +626,7 @@ To use the GCS Parallelstore with XPK you need to create a [Parallelstore Instan
|
|
|
605
626
|
Once it's ready you can use `xpk storage attach` with `--type=parallelstore` command to attach a Parallelstore instance to your cluster. Currently, attaching a Parallelstore is supported only by providing a manifest file.
|
|
606
627
|
|
|
607
628
|
```shell
|
|
608
|
-
|
|
629
|
+
xpk storage attach test-parallelstore-storage --type=parallelstore \
|
|
609
630
|
--project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \
|
|
610
631
|
--mount-point='/test-mount-point' --readonly=false \
|
|
611
632
|
--auto-mount=true \
|
|
@@ -629,7 +650,7 @@ To use the GCE PersistentDisk with XPK you need to create a [disk in GCE](https:
|
|
|
629
650
|
Once it's ready you can use `xpk storage attach` with `--type=pd` command to attach a PersistentDisk instance to your cluster. Currently, attaching a PersistentDisk is supported only by providing a manifest file.
|
|
630
651
|
|
|
631
652
|
```shell
|
|
632
|
-
|
|
653
|
+
xpk storage attach test-pd-storage --type=pd \
|
|
633
654
|
--project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \
|
|
634
655
|
--mount-point='/test-mount-point' --readonly=false \
|
|
635
656
|
--auto-mount=true \
|
|
@@ -653,7 +674,7 @@ To use the GCP Managed Lustre with XPK you need to create [an instance](https://
|
|
|
653
674
|
Once it's ready you can use `xpk storage attach` with `--type=lustre` command to attach a Managed Lustre instance to your cluster. Currently, attaching a Managed Lustre instance is supported only by providing a manifest file.
|
|
654
675
|
|
|
655
676
|
```shell
|
|
656
|
-
|
|
677
|
+
xpk storage attach test-lustre-storage --type=lustre \
|
|
657
678
|
--project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \
|
|
658
679
|
--mount-point='/test-mount-point' --readonly=false \
|
|
659
680
|
--auto-mount=true \
|
|
@@ -671,7 +692,7 @@ Parameters:
|
|
|
671
692
|
### List attached storages
|
|
672
693
|
|
|
673
694
|
```shell
|
|
674
|
-
|
|
695
|
+
xpk storage list \
|
|
675
696
|
--project=$PROJECT --cluster $CLUSTER --zone=$ZONE
|
|
676
697
|
```
|
|
677
698
|
|
|
@@ -680,7 +701,7 @@ python3 xpk.py storage list \
|
|
|
680
701
|
If you specified `--auto-mount=true` when creating or attaching a storage, then all workloads deployed on the cluster will have the volume attached by default. Otherwise, in order to have the storage attached, you have to add `--storage` parameter to `workload create` command:
|
|
681
702
|
|
|
682
703
|
```shell
|
|
683
|
-
|
|
704
|
+
xpk workload create \
|
|
684
705
|
--workload xpk-test-workload --command "echo goodbye" \
|
|
685
706
|
--project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \
|
|
686
707
|
--tpu-type=v5litepod-16 --storage=test-storage
|
|
@@ -689,7 +710,7 @@ python3 xpk.py workload create \
|
|
|
689
710
|
### Detaching storage
|
|
690
711
|
|
|
691
712
|
```shell
|
|
692
|
-
|
|
713
|
+
xpk storage detach $STORAGE_NAME \
|
|
693
714
|
--project=$PROJECT --cluster=$CLUSTER --zone=$ZONE
|
|
694
715
|
```
|
|
695
716
|
|
|
@@ -698,7 +719,7 @@ python3 xpk.py storage detach $STORAGE_NAME \
|
|
|
698
719
|
XPK allows you to remove Filestore instances easily with `xpk storage delete` command. **Warning:** this deletes all data contained in the Filestore!
|
|
699
720
|
|
|
700
721
|
```shell
|
|
701
|
-
|
|
722
|
+
xpk storage delete test-fs-instance \
|
|
702
723
|
--project=$PROJECT --cluster=$CLUSTER --zone=$ZONE
|
|
703
724
|
```
|
|
704
725
|
|
|
@@ -706,14 +727,14 @@ python3 xpk.py storage delete test-fs-instance \
|
|
|
706
727
|
* Workload Create (submit training job):
|
|
707
728
|
|
|
708
729
|
```shell
|
|
709
|
-
|
|
730
|
+
xpk workload create \
|
|
710
731
|
--workload xpk-test-workload --command "echo goodbye" \
|
|
711
732
|
--cluster xpk-test \
|
|
712
733
|
--tpu-type=v5litepod-16 --project=$PROJECT
|
|
713
734
|
```
|
|
714
735
|
* Workload create(DWS flex with queued provisioning):
|
|
715
736
|
```shell
|
|
716
|
-
|
|
737
|
+
xpk workload create \
|
|
717
738
|
--workload xpk-test-workload --command "echo goodbye" \
|
|
718
739
|
--cluster xpk-test --flex \
|
|
719
740
|
--tpu-type=v5litepod-16 --project=$PROJECT
|
|
@@ -723,7 +744,7 @@ python3 xpk.py storage delete test-fs-instance \
|
|
|
723
744
|
|
|
724
745
|
Pathways workload example:
|
|
725
746
|
```shell
|
|
726
|
-
|
|
747
|
+
xpk workload create-pathways \
|
|
727
748
|
--workload xpk-pw-test \
|
|
728
749
|
--num-slices=1 \
|
|
729
750
|
--tpu-type=v5litepod-16 \
|
|
@@ -737,7 +758,7 @@ python3 xpk.py storage delete test-fs-instance \
|
|
|
737
758
|
|
|
738
759
|
Pathways workload example:
|
|
739
760
|
```shell
|
|
740
|
-
|
|
761
|
+
xpk workload create-pathways \
|
|
741
762
|
--workload xpk-regular-test \
|
|
742
763
|
--num-slices=1 \
|
|
743
764
|
--tpu-type=v5litepod-16 \
|
|
@@ -750,7 +771,7 @@ python3 xpk.py storage delete test-fs-instance \
|
|
|
750
771
|
Pathways in headless mode - Pathways now offers the capability to run JAX workloads in Vertex AI notebooks or in GCE VMs!
|
|
751
772
|
Specify `--headless` with `workload create-pathways` when the user workload is not provided in a docker container.
|
|
752
773
|
```shell
|
|
753
|
-
|
|
774
|
+
xpk workload create-pathways --headless \
|
|
754
775
|
--workload xpk-pw-headless \
|
|
755
776
|
--num-slices=1 \
|
|
756
777
|
--tpu-type=v5litepod-16 \
|
|
@@ -785,7 +806,7 @@ A3 Ultra | `h200-141gb-8`
|
|
|
785
806
|
A4 | `b200-8`
|
|
786
807
|
|
|
787
808
|
```shell
|
|
788
|
-
|
|
809
|
+
xpk workload create \
|
|
789
810
|
--workload=$WORKLOAD_NAME --command="echo goodbye" \
|
|
790
811
|
--cluster=$CLUSTER_NAME --device-type DEVICE_TYPE \
|
|
791
812
|
--zone=$COMPUTE_ZONE --project=$PROJECT_ID \
|
|
@@ -816,7 +837,7 @@ In order to run NCCL test on A3 machines check out [this guide](/examples/nccl/n
|
|
|
816
837
|
|
|
817
838
|
#### General Example:
|
|
818
839
|
```shell
|
|
819
|
-
|
|
840
|
+
xpk workload create \
|
|
820
841
|
--workload xpk-test-medium-workload --command "echo goodbye" --cluster \
|
|
821
842
|
xpk-test --tpu-type=v5litepod-16 --priority=medium
|
|
822
843
|
```
|
|
@@ -833,7 +854,7 @@ XPK will create a Vertex AI Experiment in `workload create` command and attach t
|
|
|
833
854
|
* Create Vertex AI Experiment with default Experiment name:
|
|
834
855
|
|
|
835
856
|
```shell
|
|
836
|
-
|
|
857
|
+
xpk workload create \
|
|
837
858
|
--cluster xpk-test --workload xpk-workload \
|
|
838
859
|
--use-vertex-tensorboard
|
|
839
860
|
```
|
|
@@ -843,7 +864,7 @@ will create a Vertex AI Experiment with the name `xpk-test-xpk-workload` (*<args
|
|
|
843
864
|
* Create Vertex AI Experiment with user-specified Experiment name:
|
|
844
865
|
|
|
845
866
|
```shell
|
|
846
|
-
|
|
867
|
+
xpk workload create \
|
|
847
868
|
--cluster xpk-test --workload xpk-workload \
|
|
848
869
|
--use-vertex-tensorboard --experiment-name=test-experiment
|
|
849
870
|
```
|
|
@@ -856,7 +877,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
|
|
|
856
877
|
* Workload Delete (delete training job):
|
|
857
878
|
|
|
858
879
|
```shell
|
|
859
|
-
|
|
880
|
+
xpk workload delete \
|
|
860
881
|
--workload xpk-test-workload --cluster xpk-test
|
|
861
882
|
```
|
|
862
883
|
|
|
@@ -865,7 +886,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
|
|
|
865
886
|
* Workload Delete (delete all training jobs in the cluster):
|
|
866
887
|
|
|
867
888
|
```shell
|
|
868
|
-
|
|
889
|
+
xpk workload delete \
|
|
869
890
|
--cluster xpk-test
|
|
870
891
|
```
|
|
871
892
|
|
|
@@ -875,7 +896,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
|
|
|
875
896
|
* Filter by Job: `filter-by-job`
|
|
876
897
|
|
|
877
898
|
```shell
|
|
878
|
-
|
|
899
|
+
xpk workload delete \
|
|
879
900
|
--cluster xpk-test --filter-by-job=$USER
|
|
880
901
|
```
|
|
881
902
|
|
|
@@ -884,7 +905,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
|
|
|
884
905
|
* Filter by Status: `filter-by-status`
|
|
885
906
|
|
|
886
907
|
```shell
|
|
887
|
-
|
|
908
|
+
xpk workload delete \
|
|
888
909
|
--cluster xpk-test --filter-by-status=QUEUED
|
|
889
910
|
```
|
|
890
911
|
|
|
@@ -894,7 +915,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
|
|
|
894
915
|
* Workload List (see training jobs):
|
|
895
916
|
|
|
896
917
|
```shell
|
|
897
|
-
|
|
918
|
+
xpk workload list \
|
|
898
919
|
--cluster xpk-test
|
|
899
920
|
```
|
|
900
921
|
|
|
@@ -929,7 +950,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
|
|
|
929
950
|
Filter the workload list by the name of a job.
|
|
930
951
|
|
|
931
952
|
```shell
|
|
932
|
-
|
|
953
|
+
xpk workload list \
|
|
933
954
|
--cluster xpk-test --filter-by-job=$USER
|
|
934
955
|
```
|
|
935
956
|
|
|
@@ -938,14 +959,14 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
|
|
|
938
959
|
Wait for a job to complete.
|
|
939
960
|
|
|
940
961
|
```shell
|
|
941
|
-
|
|
962
|
+
xpk workload list \
|
|
942
963
|
--cluster xpk-test --wait-for-job-completion=xpk-test-workload
|
|
943
964
|
```
|
|
944
965
|
|
|
945
966
|
Wait for a job to complete with a timeout of 300 seconds.
|
|
946
967
|
|
|
947
968
|
```shell
|
|
948
|
-
|
|
969
|
+
xpk workload list \
|
|
949
970
|
--cluster xpk-test --wait-for-job-completion=xpk-test-workload \
|
|
950
971
|
--timeout=300
|
|
951
972
|
```
|
|
@@ -961,7 +982,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
|
|
|
961
982
|
* Job List (see jobs submitted via batch command):
|
|
962
983
|
|
|
963
984
|
```shell
|
|
964
|
-
|
|
985
|
+
xpk job ls --cluster xpk-test
|
|
965
986
|
```
|
|
966
987
|
|
|
967
988
|
* Example Job List Output:
|
|
@@ -979,7 +1000,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
|
|
|
979
1000
|
* Job Cancel (delete job submitted via batch command):
|
|
980
1001
|
|
|
981
1002
|
```shell
|
|
982
|
-
|
|
1003
|
+
xpk job cancel xpk-def-app-profile-slurm-74kbv --cluster xpk-test
|
|
983
1004
|
```
|
|
984
1005
|
|
|
985
1006
|
## Inspector
|
|
@@ -987,7 +1008,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
|
|
|
987
1008
|
Inspector output is saved to a file.
|
|
988
1009
|
|
|
989
1010
|
```shell
|
|
990
|
-
|
|
1011
|
+
xpk inspector \
|
|
991
1012
|
--cluster $CLUSTER_NAME \
|
|
992
1013
|
--project $PROJECT_ID \
|
|
993
1014
|
--zone $ZONE
|
|
@@ -1029,7 +1050,7 @@ Inspector output is saved to a file.
|
|
|
1029
1050
|
* `xpk run` lets you execute scripts on a cluster with ease. It automates task execution, handles interruptions, and streams job output to your console.
|
|
1030
1051
|
|
|
1031
1052
|
```shell
|
|
1032
|
-
|
|
1053
|
+
xpk run --kind-cluster -n 2 -t 0-2 examples/job.sh
|
|
1033
1054
|
```
|
|
1034
1055
|
|
|
1035
1056
|
* Example Output:
|
|
@@ -1065,7 +1086,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
|
|
|
1065
1086
|
gcloud compute reservations list --project=$PROJECT_ID
|
|
1066
1087
|
|
|
1067
1088
|
# Run cluster create with reservation.
|
|
1068
|
-
|
|
1089
|
+
xpk cluster create \
|
|
1069
1090
|
--cluster xpk-test --device-type=h100-80gb-8 \
|
|
1070
1091
|
--num-nodes=2 \
|
|
1071
1092
|
--reservation=$RESERVATION_ID
|
|
@@ -1074,20 +1095,20 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
|
|
|
1074
1095
|
* Cluster Delete (deprovision capacity):
|
|
1075
1096
|
|
|
1076
1097
|
```shell
|
|
1077
|
-
|
|
1098
|
+
xpk cluster delete \
|
|
1078
1099
|
--cluster xpk-test
|
|
1079
1100
|
```
|
|
1080
1101
|
|
|
1081
1102
|
* Cluster List (see provisioned capacity):
|
|
1082
1103
|
|
|
1083
1104
|
```shell
|
|
1084
|
-
|
|
1105
|
+
xpk cluster list
|
|
1085
1106
|
```
|
|
1086
1107
|
|
|
1087
1108
|
* Cluster Describe (see capacity):
|
|
1088
1109
|
|
|
1089
1110
|
```shell
|
|
1090
|
-
|
|
1111
|
+
xpk cluster describe \
|
|
1091
1112
|
--cluster xpk-test
|
|
1092
1113
|
```
|
|
1093
1114
|
|
|
@@ -1095,7 +1116,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
|
|
|
1095
1116
|
* Cluster Cacheimage (enables faster start times):
|
|
1096
1117
|
|
|
1097
1118
|
```shell
|
|
1098
|
-
|
|
1119
|
+
xpk cluster cacheimage \
|
|
1099
1120
|
--cluster xpk-test --docker-image gcr.io/your_docker_image \
|
|
1100
1121
|
--device-type=h100-80gb-8
|
|
1101
1122
|
```
|
|
@@ -1116,7 +1137,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
|
|
|
1116
1137
|
|
|
1117
1138
|
```shell
|
|
1118
1139
|
# Submit a workload
|
|
1119
|
-
|
|
1140
|
+
xpk workload create \
|
|
1120
1141
|
--cluster xpk-test --device-type h100-80gb-8 \
|
|
1121
1142
|
--workload xpk-test-workload \
|
|
1122
1143
|
--command="echo hello world"
|
|
@@ -1125,7 +1146,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
|
|
|
1125
1146
|
* Workload Delete (delete training job):
|
|
1126
1147
|
|
|
1127
1148
|
```shell
|
|
1128
|
-
|
|
1149
|
+
xpk workload delete \
|
|
1129
1150
|
--workload xpk-test-workload --cluster xpk-test
|
|
1130
1151
|
```
|
|
1131
1152
|
|
|
@@ -1134,7 +1155,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
|
|
|
1134
1155
|
* Workload Delete (delete all training jobs in the cluster):
|
|
1135
1156
|
|
|
1136
1157
|
```shell
|
|
1137
|
-
|
|
1158
|
+
xpk workload delete \
|
|
1138
1159
|
--cluster xpk-test
|
|
1139
1160
|
```
|
|
1140
1161
|
|
|
@@ -1144,7 +1165,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
|
|
|
1144
1165
|
* Filter by Job: `filter-by-job`
|
|
1145
1166
|
|
|
1146
1167
|
```shell
|
|
1147
|
-
|
|
1168
|
+
xpk workload delete \
|
|
1148
1169
|
--cluster xpk-test --filter-by-job=$USER
|
|
1149
1170
|
```
|
|
1150
1171
|
|
|
@@ -1153,7 +1174,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
|
|
|
1153
1174
|
* Filter by Status: `filter-by-status`
|
|
1154
1175
|
|
|
1155
1176
|
```shell
|
|
1156
|
-
|
|
1177
|
+
xpk workload delete \
|
|
1157
1178
|
--cluster xpk-test --filter-by-status=QUEUED
|
|
1158
1179
|
```
|
|
1159
1180
|
|
|
@@ -1167,7 +1188,7 @@ In order to use XPK for CPU, you can do so by using `device-type` flag.
|
|
|
1167
1188
|
|
|
1168
1189
|
```shell
|
|
1169
1190
|
# Run cluster create with on demand capacity.
|
|
1170
|
-
|
|
1191
|
+
xpk cluster create \
|
|
1171
1192
|
--cluster xpk-test \
|
|
1172
1193
|
--device-type=n2-standard-32-256 \
|
|
1173
1194
|
--num-slices=1 \
|
|
@@ -1181,7 +1202,7 @@ In order to use XPK for CPU, you can do so by using `device-type` flag.
|
|
|
1181
1202
|
|
|
1182
1203
|
```shell
|
|
1183
1204
|
# Submit a workload
|
|
1184
|
-
|
|
1205
|
+
xpk workload create \
|
|
1185
1206
|
--cluster xpk-test \
|
|
1186
1207
|
--num-slices=1 \
|
|
1187
1208
|
--device-type=n2-standard-32-256 \
|
|
@@ -1211,7 +1232,7 @@ RESERVATION=reservation_id
|
|
|
1211
1232
|
PROJECT=my_project
|
|
1212
1233
|
ZONE=us-east5-b
|
|
1213
1234
|
|
|
1214
|
-
|
|
1235
|
+
xpk cluster create \
|
|
1215
1236
|
--cluster $CLUSTER_NAME \
|
|
1216
1237
|
--num-slices=$NUM_SLICES \
|
|
1217
1238
|
--device-type=$DEVICE_TYPE \
|
|
@@ -1248,7 +1269,7 @@ RESERVATION=reservation_id
|
|
|
1248
1269
|
PROJECT=my_project
|
|
1249
1270
|
ZONE=us-east5-b
|
|
1250
1271
|
|
|
1251
|
-
|
|
1272
|
+
xpk cluster create \
|
|
1252
1273
|
--cluster $CLUSTER_NAME \
|
|
1253
1274
|
--num-slices=$NUM_SLICES \
|
|
1254
1275
|
--device-type=$DEVICE_TYPE \
|
|
@@ -1271,7 +1292,7 @@ PROJECT=my_project
|
|
|
1271
1292
|
ZONE=us-east5-b
|
|
1272
1293
|
|
|
1273
1294
|
# This will create 2x v4-16 node pools and set the max autoprovisioned chips to 16.
|
|
1274
|
-
|
|
1295
|
+
xpk cluster create \
|
|
1275
1296
|
--cluster $CLUSTER_NAME \
|
|
1276
1297
|
--num-slices=$NUM_SLICES \
|
|
1277
1298
|
--device-type=$DEVICE_TYPE \
|
|
@@ -1291,7 +1312,7 @@ PROJECT=my_project
|
|
|
1291
1312
|
ZONE=us-east5-b
|
|
1292
1313
|
|
|
1293
1314
|
# This will clear the node pools if they exist in the cluster and set the max autoprovisioned chips to 16
|
|
1294
|
-
|
|
1315
|
+
xpk cluster create \
|
|
1295
1316
|
--cluster $CLUSTER_NAME \
|
|
1296
1317
|
--num-slices=$NUM_SLICES \
|
|
1297
1318
|
--device-type=$DEVICE_TYPE \
|
|
@@ -1312,7 +1333,7 @@ Reconfigure the `--device-type` and `--num-slices`
|
|
|
1312
1333
|
PROJECT=my_project
|
|
1313
1334
|
ZONE=us-east5-b
|
|
1314
1335
|
# Create a 2x v4-8 TPU workload.
|
|
1315
|
-
|
|
1336
|
+
xpk workload create \
|
|
1316
1337
|
--cluster $CLUSTER \
|
|
1317
1338
|
--workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \
|
|
1318
1339
|
--command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \
|
|
@@ -1325,7 +1346,7 @@ Reconfigure the `--device-type` and `--num-slices`
|
|
|
1325
1346
|
DEVICE_TYPE=v4-16
|
|
1326
1347
|
|
|
1327
1348
|
# Create a 1x v4-16 TPU workload.
|
|
1328
|
-
|
|
1349
|
+
xpk workload create \
|
|
1329
1350
|
--cluster $CLUSTER \
|
|
1330
1351
|
--workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \
|
|
1331
1352
|
--command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \
|
|
@@ -1335,7 +1356,7 @@ Reconfigure the `--device-type` and `--num-slices`
|
|
|
1335
1356
|
--project=$PROJECT
|
|
1336
1357
|
|
|
1337
1358
|
# Use a different reservation from what the cluster was created with.
|
|
1338
|
-
|
|
1359
|
+
xpk workload create \
|
|
1339
1360
|
--cluster $CLUSTER \
|
|
1340
1361
|
--workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \
|
|
1341
1362
|
--command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \
|
|
@@ -1379,19 +1400,19 @@ This flow pulls the `--script-dir` into the `--base-docker-image` and runs the n
|
|
|
1379
1400
|
|
|
1380
1401
|
- `--script-dir` sets which directory to pull into the image. This defaults to the current working directory.
|
|
1381
1402
|
|
|
1382
|
-
See `
|
|
1403
|
+
See `xpk workload create --help` for more info.
|
|
1383
1404
|
|
|
1384
1405
|
* Example with defaults which pulls the local directory into the base image:
|
|
1385
1406
|
```shell
|
|
1386
1407
|
echo -e '#!/bin/bash \n echo "Hello world from a test script!"' > test.sh
|
|
1387
|
-
|
|
1408
|
+
xpk workload create --cluster xpk-test \
|
|
1388
1409
|
--workload xpk-test-workload-base-image --command "bash test.sh" \
|
|
1389
1410
|
--tpu-type=v5litepod-16 --num-slices=1
|
|
1390
1411
|
```
|
|
1391
1412
|
|
|
1392
1413
|
* Recommended Flow For Normal Sized Jobs (fewer than 10k accelerators):
|
|
1393
1414
|
```shell
|
|
1394
|
-
|
|
1415
|
+
xpk workload create --cluster xpk-test \
|
|
1395
1416
|
--workload xpk-test-workload-base-image --command "bash custom_script.sh" \
|
|
1396
1417
|
--base-docker-image=gcr.io/your_dependencies_docker_image \
|
|
1397
1418
|
--tpu-type=v5litepod-16 --num-slices=1
|
|
@@ -1404,17 +1425,17 @@ workload.
|
|
|
1404
1425
|
|
|
1405
1426
|
* Running with `--docker-image`:
|
|
1406
1427
|
```shell
|
|
1407
|
-
|
|
1428
|
+
xpk workload create --cluster xpk-test \
|
|
1408
1429
|
--workload xpk-test-workload-base-image --command "bash test.sh" \
|
|
1409
1430
|
--tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image
|
|
1410
1431
|
```
|
|
1411
1432
|
|
|
1412
1433
|
* Recommended Flow For Large Sized Jobs (more than 10k accelerators):
|
|
1413
1434
|
```shell
|
|
1414
|
-
|
|
1435
|
+
xpk cluster cacheimage \
|
|
1415
1436
|
--cluster xpk-test --docker-image gcr.io/your_docker_image
|
|
1416
1437
|
# Run workload create with the same image.
|
|
1417
|
-
|
|
1438
|
+
xpk workload create --cluster xpk-test \
|
|
1418
1439
|
--workload xpk-test-workload-base-image --command "bash test.sh" \
|
|
1419
1440
|
--tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image
|
|
1420
1441
|
```
|
|
@@ -1463,7 +1484,7 @@ Please select a CPU type that exists in all zones in the region.
|
|
|
1463
1484
|
# Find CPU Types supported in zones.
|
|
1464
1485
|
gcloud compute machine-types list --zones=$ZONE_LIST
|
|
1465
1486
|
# Adjust default cpu machine type.
|
|
1466
|
-
|
|
1487
|
+
xpk cluster create --default-pool-cpu-machine-type=CPU_TYPE ...
|
|
1467
1488
|
```
|
|
1468
1489
|
|
|
1469
1490
|
## Workload creation fails
|
|
@@ -1472,7 +1493,7 @@ Some XPK cluster configuration might be missing, if workload creation fails with
|
|
|
1472
1493
|
|
|
1473
1494
|
`[XPK] b'error: the server doesn\'t have a resource type "workloads"\n'`
|
|
1474
1495
|
|
|
1475
|
-
Mitigate this error by re-running your `xpk
|
|
1496
|
+
Mitigate this error by re-running your `xpk cluster create ...` command, to refresh the cluster configurations.
|
|
1476
1497
|
|
|
1477
1498
|
## Permission Issues: `requires one of ["permission_name"] permission(s)`.
|
|
1478
1499
|
|
|
@@ -1544,7 +1565,7 @@ If error of this kind appeared after updating xpk version it's possible that you
|
|
|
1544
1565
|
## Verbose Logging
|
|
1545
1566
|
If you are having trouble with your workload, try setting the `--enable-debug-logs` when you schedule it. This will give you more detailed logs to help pinpoint the issue. For example:
|
|
1546
1567
|
```shell
|
|
1547
|
-
|
|
1568
|
+
xpk workload create \
|
|
1548
1569
|
--cluster --workload xpk-test-workload \
|
|
1549
1570
|
--command="echo hello world" --enable-debug-logs
|
|
1550
1571
|
```
|
|
@@ -1576,7 +1597,7 @@ This configuration will start collecting stack traces inside the `/tmp/debugging
|
|
|
1576
1597
|
### Explore Stack Traces
|
|
1577
1598
|
To explore the stack traces collected in a temporary directory in Kubernetes Pod, you can run the following command to configure a sidecar container that will read the traces from `/tmp/debugging` directory.
|
|
1578
1599
|
```shell
|
|
1579
|
-
|
|
1600
|
+
xpk workload create \
|
|
1580
1601
|
--workload xpk-test-workload --command "python3 main.py" --cluster \
|
|
1581
1602
|
xpk-test --tpu-type=v5litepod-16 --deploy-stacktrace-sidecar
|
|
1582
1603
|
```
|
|
@@ -1587,12 +1608,12 @@ To list available resources and queues use ```xpk info``` command. It allows to
|
|
|
1587
1608
|
|
|
1588
1609
|
To see queues with usage and workload info use:
|
|
1589
1610
|
```shell
|
|
1590
|
-
|
|
1611
|
+
xpk info --cluster my-cluster
|
|
1591
1612
|
```
|
|
1592
1613
|
|
|
1593
1614
|
You can specify what kind of resources(clusterqueue or localqueue) you want to see using flags --clusterqueue or --localqueue.
|
|
1594
1615
|
```shell
|
|
1595
|
-
|
|
1616
|
+
xpk info --cluster my-cluster --localqueue
|
|
1596
1617
|
```
|
|
1597
1618
|
|
|
1598
1619
|
# Local testing with Kind
|
|
@@ -1611,7 +1632,7 @@ xpk interfaces seamlessly with kind to manage Kubernetes clusters locally, facil
|
|
|
1611
1632
|
* Cluster create:
|
|
1612
1633
|
|
|
1613
1634
|
```shell
|
|
1614
|
-
|
|
1635
|
+
xpk kind create \
|
|
1615
1636
|
--cluster xpk-test
|
|
1616
1637
|
```
|
|
1617
1638
|
|
|
@@ -1619,7 +1640,7 @@ xpk interfaces seamlessly with kind to manage Kubernetes clusters locally, facil
|
|
|
1619
1640
|
* Cluster Delete:
|
|
1620
1641
|
|
|
1621
1642
|
```shell
|
|
1622
|
-
|
|
1643
|
+
xpk kind delete \
|
|
1623
1644
|
--cluster xpk-test
|
|
1624
1645
|
```
|
|
1625
1646
|
|
|
@@ -1627,7 +1648,7 @@ xpk interfaces seamlessly with kind to manage Kubernetes clusters locally, facil
|
|
|
1627
1648
|
* Cluster List:
|
|
1628
1649
|
|
|
1629
1650
|
```shell
|
|
1630
|
-
|
|
1651
|
+
xpk kind list
|
|
1631
1652
|
```
|
|
1632
1653
|
|
|
1633
1654
|
## Local Testing Basics
|
|
@@ -1635,7 +1656,7 @@ xpk interfaces seamlessly with kind to manage Kubernetes clusters locally, facil
|
|
|
1635
1656
|
Local testing is available exclusively through the `batch` and `job` commands of xpk with the `--kind-cluster` flag. This allows you to simulate training jobs locally:
|
|
1636
1657
|
|
|
1637
1658
|
```shell
|
|
1638
|
-
|
|
1659
|
+
xpk batch [other-options] --kind-cluster script
|
|
1639
1660
|
```
|
|
1640
1661
|
|
|
1641
1662
|
Please note that all other xpk subcommands are intended for use with cloud systems on Google Cloud Engine (GCE) and don't support local testing. This includes commands like cluster, info, inspector, etc.
|