xpk 0.14.3__py3-none-any.whl → 0.15.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (58) hide show
  1. integration/gcluster_a3mega_test.py +11 -0
  2. integration/gcluster_a3ultra_test.py +11 -0
  3. integration/gcluster_a4_test.py +11 -0
  4. xpk/commands/cluster.py +57 -21
  5. xpk/commands/cluster_gcluster.py +25 -5
  6. xpk/commands/cluster_gcluster_test.py +11 -2
  7. xpk/commands/cluster_test.py +233 -12
  8. xpk/commands/config.py +3 -5
  9. xpk/commands/kind.py +1 -1
  10. xpk/commands/storage.py +8 -10
  11. xpk/commands/workload.py +28 -11
  12. xpk/commands/workload_test.py +3 -3
  13. xpk/core/blueprint/blueprint_generator.py +70 -33
  14. xpk/core/blueprint/blueprint_test.py +9 -0
  15. xpk/core/capacity.py +46 -8
  16. xpk/core/capacity_test.py +32 -1
  17. xpk/core/cluster.py +37 -57
  18. xpk/core/cluster_test.py +95 -0
  19. xpk/core/commands.py +4 -10
  20. xpk/core/config.py +9 -2
  21. xpk/core/gcloud_context.py +18 -12
  22. xpk/core/gcloud_context_test.py +111 -1
  23. xpk/core/kjob.py +6 -9
  24. xpk/core/kueue_manager.py +192 -32
  25. xpk/core/kueue_manager_test.py +132 -4
  26. xpk/core/nodepool.py +21 -29
  27. xpk/core/nodepool_test.py +17 -15
  28. xpk/core/scheduling.py +16 -1
  29. xpk/core/scheduling_test.py +85 -6
  30. xpk/core/system_characteristics.py +77 -19
  31. xpk/core/system_characteristics_test.py +80 -5
  32. xpk/core/telemetry.py +263 -0
  33. xpk/core/telemetry_test.py +211 -0
  34. xpk/main.py +31 -13
  35. xpk/parser/cluster.py +48 -9
  36. xpk/parser/cluster_test.py +42 -3
  37. xpk/parser/workload.py +12 -0
  38. xpk/parser/workload_test.py +4 -4
  39. xpk/telemetry_uploader.py +29 -0
  40. xpk/templates/kueue_gke_default_topology.yaml.j2 +1 -1
  41. xpk/templates/kueue_sub_slicing_topology.yaml.j2 +3 -8
  42. xpk/utils/console.py +41 -10
  43. xpk/utils/console_test.py +106 -0
  44. xpk/utils/feature_flags.py +7 -1
  45. xpk/utils/file.py +4 -1
  46. xpk/utils/topology.py +4 -0
  47. xpk/utils/user_agent.py +35 -0
  48. xpk/utils/user_agent_test.py +44 -0
  49. xpk/utils/user_input.py +48 -0
  50. xpk/utils/user_input_test.py +92 -0
  51. xpk/utils/validation.py +0 -11
  52. xpk/utils/versions.py +31 -0
  53. {xpk-0.14.3.dist-info → xpk-0.15.0.dist-info}/METADATA +113 -92
  54. {xpk-0.14.3.dist-info → xpk-0.15.0.dist-info}/RECORD +58 -48
  55. {xpk-0.14.3.dist-info → xpk-0.15.0.dist-info}/WHEEL +0 -0
  56. {xpk-0.14.3.dist-info → xpk-0.15.0.dist-info}/entry_points.txt +0 -0
  57. {xpk-0.14.3.dist-info → xpk-0.15.0.dist-info}/licenses/LICENSE +0 -0
  58. {xpk-0.14.3.dist-info → xpk-0.15.0.dist-info}/top_level.txt +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: xpk
3
- Version: 0.14.3
3
+ Version: 0.15.0
4
4
  Summary: xpk helps Cloud developers to orchestrate training jobs on accelerators on GKE.
5
5
  Author-email: XPK team <xpk-code-reviewers@google.com>
6
6
  License: Apache-2.0
@@ -11,6 +11,7 @@ Classifier: Programming Language :: Python :: 3.11
11
11
  Requires-Python: >=3.10
12
12
  Description-Content-Type: text/markdown
13
13
  License-File: LICENSE
14
+ Requires-Dist: argcomplete==3.6.3
14
15
  Requires-Dist: cloud-accelerator-diagnostics==0.1.1
15
16
  Requires-Dist: tabulate==0.9.0
16
17
  Requires-Dist: ruamel.yaml==0.18.10
@@ -142,6 +143,18 @@ The following tools must be installed:
142
143
  - git: [installation instructions](https://git-scm.com/downloads/linux)
143
144
  - make: install by running `apt-get -y install make` (`sudo` might be required)
144
145
 
146
+ ### Additional prerequisites to enable bash completion
147
+
148
+ - Install [argcomplete](https://pypi.org/project/argcomplete/) globally on your machine.
149
+ ```shell
150
+ pip install argcomplete
151
+ activate-global-python-argcomplete
152
+ ```
153
+ - Configure `argcomplete` for XPK.
154
+ ```shell
155
+ eval "$(register-python-argcomplete xpk)"
156
+ ```
157
+
145
158
  ## Installation via pip
146
159
 
147
160
  To install XPK using pip, first install required tools mentioned in [prerequisites](#prerequisites) and [additional prerequisites](#additional-prerequisites-when-installing-from-pip). Then you can install XPK simply by running:
@@ -243,7 +256,7 @@ all zones.
243
256
  # Find your reservations
244
257
  gcloud compute reservations list --project=$PROJECT_ID
245
258
  # Run cluster create with reservation.
246
- python3 xpk.py cluster create \
259
+ xpk cluster create \
247
260
  --cluster xpk-test --tpu-type=v5litepod-256 \
248
261
  --num-slices=2 \
249
262
  --reservation=$RESERVATION_ID
@@ -252,7 +265,7 @@ all zones.
252
265
  * Cluster Create (provision on-demand capacity):
253
266
 
254
267
  ```shell
255
- python3 xpk.py cluster create \
268
+ xpk cluster create \
256
269
  --cluster xpk-test --tpu-type=v5litepod-16 \
257
270
  --num-slices=4 --on-demand
258
271
  ```
@@ -260,22 +273,30 @@ all zones.
260
273
  * Cluster Create (provision spot / preemptable capacity):
261
274
 
262
275
  ```shell
263
- python3 xpk.py cluster create \
276
+ xpk cluster create \
264
277
  --cluster xpk-test --tpu-type=v5litepod-16 \
265
278
  --num-slices=4 --spot
266
279
  ```
267
280
 
268
281
  * Cluster Create (DWS flex queued capacity):
269
282
  ```shell
270
- python3 xpk.py cluster create \
283
+ xpk cluster create \
271
284
  --cluster xpk-test --tpu-type=v5litepod-16 \
272
285
  --num-slices=4 --flex
273
286
  ```
274
287
 
288
+ * Cluster Create with CPU and/or memory quota:
289
+ ```shell
290
+ xpk cluster create \
291
+ --cluster xpk-test --tpu-type=v5litepod-16 \
292
+ --cpu-limit=112 --memory-limit=192Gi \
293
+ --on-demand
294
+ ```
295
+
275
296
  * Cluster Create for Pathways:
276
297
  Pathways compatible cluster can be created using `cluster create-pathways`.
277
298
  ```shell
278
- python3 xpk.py cluster create-pathways \
299
+ xpk cluster create-pathways \
279
300
  --cluster xpk-pw-test \
280
301
  --num-slices=4 --on-demand \
281
302
  --tpu-type=v5litepod-16
@@ -285,7 +306,7 @@ all zones.
285
306
  * Cluster Create for Ray:
286
307
  A cluster with KubeRay enabled and a RayCluster can be created using `cluster create-ray`.
287
308
  ```shell
288
- python3 xpk.py cluster create-ray \
309
+ xpk cluster create-ray \
289
310
  --cluster xpk-rc-test \
290
311
  --ray-version=2.39.0 \
291
312
  --num-slices=4 --on-demand \
@@ -298,7 +319,7 @@ all zones.
298
319
  For example, if a user creates a cluster with 4 slices:
299
320
 
300
321
  ```shell
301
- python3 xpk.py cluster create \
322
+ xpk cluster create \
302
323
  --cluster xpk-test --tpu-type=v5litepod-16 \
303
324
  --num-slices=4 --reservation=$RESERVATION_ID
304
325
  ```
@@ -307,7 +328,7 @@ all zones.
307
328
  new slices:
308
329
 
309
330
  ```shell
310
- python3 xpk.py cluster create \
331
+ xpk cluster create \
311
332
  --cluster xpk-test --tpu-type=v5litepod-16 \
312
333
  --num-slices=8 --reservation=$RESERVATION_ID
313
334
  ```
@@ -317,13 +338,13 @@ all zones.
317
338
  Use `--force` to skip prompts.
318
339
 
319
340
  ```shell
320
- python3 xpk.py cluster create \
341
+ xpk cluster create \
321
342
  --cluster xpk-test --tpu-type=v5litepod-16 \
322
343
  --num-slices=6 --reservation=$RESERVATION_ID
323
344
 
324
345
  # Skip delete prompts using --force.
325
346
 
326
- python3 xpk.py cluster create --force \
347
+ xpk cluster create --force \
327
348
  --cluster xpk-test --tpu-type=v5litepod-16 \
328
349
  --num-slices=6 --reservation=$RESERVATION_ID
329
350
  ```
@@ -333,13 +354,13 @@ all zones.
333
354
  user when deleting slices. Use `--force` to skip prompts.
334
355
 
335
356
  ```shell
336
- python3 xpk.py cluster create \
357
+ xpk cluster create \
337
358
  --cluster xpk-test --tpu-type=v4-8 \
338
359
  --num-slices=4 --reservation=$RESERVATION_ID
339
360
 
340
361
  # Skip delete prompts using --force.
341
362
 
342
- python3 xpk.py cluster create --force \
363
+ xpk cluster create --force \
343
364
  --cluster xpk-test --tpu-type=v4-8 \
344
365
  --num-slices=4 --reservation=$RESERVATION_ID
345
366
  ```
@@ -370,7 +391,7 @@ This argument allows you to specify additional IP ranges (in CIDR notation) that
370
391
  * To create a private cluster and allow access to Control Plane only to your current machine:
371
392
 
372
393
  ```shell
373
- python3 xpk.py cluster create \
394
+ xpk cluster create \
374
395
  --cluster=xpk-private-cluster \
375
396
  --tpu-type=v4-8 --num-slices=2 \
376
397
  --private
@@ -379,7 +400,7 @@ This argument allows you to specify additional IP ranges (in CIDR notation) that
379
400
  * To create a private cluster and allow access to Control Plane only to your current machine and the IP ranges `1.2.3.0/24` and `1.2.4.5/32`:
380
401
 
381
402
  ```shell
382
- python3 xpk.py cluster create \
403
+ xpk cluster create \
383
404
  --cluster=xpk-private-cluster \
384
405
  --tpu-type=v4-8 --num-slices=2 \
385
406
  --authorized-networks 1.2.3.0/24 1.2.4.5/32
@@ -405,7 +426,7 @@ You can create a Vertex AI Tensorboard for your cluster with `Cluster Create` co
405
426
  * Create Vertex AI Tensorboard in default region with default Tensorboard name:
406
427
 
407
428
  ```shell
408
- python3 xpk.py cluster create \
429
+ xpk cluster create \
409
430
  --cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
410
431
  --create-vertex-tensorboard
411
432
  ```
@@ -415,7 +436,7 @@ will create a Vertex AI Tensorboard with the name `xpk-test-tb-instance` (*<args
415
436
  * Create Vertex AI Tensorboard in user-specified region with default Tensorboard name:
416
437
 
417
438
  ```shell
418
- python3 xpk.py cluster create \
439
+ xpk cluster create \
419
440
  --cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
420
441
  --create-vertex-tensorboard --tensorboard-region=us-west1
421
442
  ```
@@ -425,7 +446,7 @@ will create a Vertex AI Tensorboard with the name `xpk-test-tb-instance` (*<args
425
446
  * Create Vertex AI Tensorboard in default region with user-specified Tensorboard name:
426
447
 
427
448
  ```shell
428
- python3 xpk.py cluster create \
449
+ xpk cluster create \
429
450
  --cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
430
451
  --create-vertex-tensorboard --tensorboard-name=tb-testing
431
452
  ```
@@ -435,7 +456,7 @@ will create a Vertex AI Tensorboard with the name `tb-testing` in `us-central1`.
435
456
  * Create Vertex AI Tensorboard in user-specified region with user-specified Tensorboard name:
436
457
 
437
458
  ```shell
438
- python3 xpk.py cluster create \
459
+ xpk cluster create \
439
460
  --cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
440
461
  --create-vertex-tensorboard --tensorboard-region=us-west1 --tensorboard-name=tb-testing
441
462
  ```
@@ -445,7 +466,7 @@ will create a Vertex AI Tensorboard instance with the name `tb-testing` in `us-w
445
466
  * Create Vertex AI Tensorboard in an unsupported region:
446
467
 
447
468
  ```shell
448
- python3 xpk.py cluster create \
469
+ xpk cluster create \
449
470
  --cluster xpk-test --num-slices=1 --tpu-type=v4-8 \
450
471
  --create-vertex-tensorboard --tensorboard-region=us-central2
451
472
  ```
@@ -456,20 +477,20 @@ will fail the cluster creation process because Vertex AI Tensorboard is not supp
456
477
  * Cluster Delete (deprovision capacity):
457
478
 
458
479
  ```shell
459
- python3 xpk.py cluster delete \
480
+ xpk cluster delete \
460
481
  --cluster xpk-test
461
482
  ```
462
483
  ## Cluster List
463
484
  * Cluster List (see provisioned capacity):
464
485
 
465
486
  ```shell
466
- python3 xpk.py cluster list
487
+ xpk cluster list
467
488
  ```
468
489
  ## Cluster Describe
469
490
  * Cluster Describe (see capacity):
470
491
 
471
492
  ```shell
472
- python3 xpk.py cluster describe \
493
+ xpk cluster describe \
473
494
  --cluster xpk-test
474
495
  ```
475
496
 
@@ -477,7 +498,7 @@ will fail the cluster creation process because Vertex AI Tensorboard is not supp
477
498
  * Cluster Cacheimage (enables faster start times):
478
499
 
479
500
  ```shell
480
- python3 xpk.py cluster cacheimage \
501
+ xpk cluster cacheimage \
481
502
  --cluster xpk-test --docker-image gcr.io/your_docker_image \
482
503
  --tpu-type=v5litepod-16
483
504
  ```
@@ -495,7 +516,7 @@ A4 | `b200-8`
495
516
 
496
517
 
497
518
  ```shell
498
- python3 xpk.py cluster create \
519
+ xpk cluster create \
499
520
  --cluster CLUSTER_NAME --device-type DEVICE_TYPE \
500
521
  --zone=$COMPUTE_ZONE --project=$PROJECT_ID \
501
522
  --num-nodes=$NUM_NODES --reservation=$RESERVATION_ID
@@ -520,7 +541,7 @@ Currently `xpk cluster adapt` supports only the following device types:
520
541
 
521
542
  Example usage:
522
543
  ```shell
523
- python3 xpk.py cluster adapt \
544
+ xpk cluster adapt \
524
545
  --cluster=$CLUSTER_NAME --device-type=$DEVICE_TYPE \
525
546
  --zone=$COMPUTE_ZONE --project=$PROJECT_ID \
526
547
  --num-nodes=$NUM_NODES --reservation=$RESERVATION_ID
@@ -542,7 +563,7 @@ To use the GCS FUSE with XPK you need to create a [Storage Bucket](https://conso
542
563
  Once it's ready you can use `xpk storage attach` with `--type=gcsfuse` command to attach a FUSE storage instance to your cluster:
543
564
 
544
565
  ```shell
545
- python3 xpk.py storage attach test-fuse-storage --type=gcsfuse \
566
+ xpk storage attach test-fuse-storage --type=gcsfuse \
546
567
  --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE
547
568
  --mount-point='/test-mount-point' --readonly=false \
548
569
  --bucket=test-bucket --size=1 --auto-mount=false
@@ -567,7 +588,7 @@ A Filestore adapter lets you mount and access [Filestore instances](https://clou
567
588
  To create and attach a GCP Filestore instance to your cluster use `xpk storage create` command with `--type=gcpfilestore`:
568
589
 
569
590
  ```shell
570
- python3 xpk.py storage create test-fs-storage --type=gcpfilestore \
591
+ xpk storage create test-fs-storage --type=gcpfilestore \
571
592
  --auto-mount=false --mount-point=/data-fs --readonly=false \
572
593
  --size=1024 --tier=BASIC_HDD --access_mode=ReadWriteMany --vol=default \
573
594
  --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE
@@ -576,7 +597,7 @@ python3 xpk.py storage create test-fs-storage --type=gcpfilestore \
576
597
  You can also attach an existing Filestore instance to your cluster using `xpk storage attach` command:
577
598
 
578
599
  ```shell
579
- python3 xpk.py storage attach test-fs-storage --type=gcpfilestore \
600
+ xpk storage attach test-fs-storage --type=gcpfilestore \
580
601
  --auto-mount=false --mount-point=/data-fs --readonly=false \
581
602
  --size=1024 --tier=BASIC_HDD --access_mode=ReadWriteMany --vol=default \
582
603
  --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE
@@ -605,7 +626,7 @@ To use the GCS Parallelstore with XPK you need to create a [Parallelstore Instan
605
626
  Once it's ready you can use `xpk storage attach` with `--type=parallelstore` command to attach a Parallelstore instance to your cluster. Currently, attaching a Parallelstore is supported only by providing a manifest file.
606
627
 
607
628
  ```shell
608
- python3 xpk.py storage attach test-parallelstore-storage --type=parallelstore \
629
+ xpk storage attach test-parallelstore-storage --type=parallelstore \
609
630
  --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \
610
631
  --mount-point='/test-mount-point' --readonly=false \
611
632
  --auto-mount=true \
@@ -629,7 +650,7 @@ To use the GCE PersistentDisk with XPK you need to create a [disk in GCE](https:
629
650
  Once it's ready you can use `xpk storage attach` with `--type=pd` command to attach a PersistentDisk instance to your cluster. Currently, attaching a PersistentDisk is supported only by providing a manifest file.
630
651
 
631
652
  ```shell
632
- python3 xpk.py storage attach test-pd-storage --type=pd \
653
+ xpk storage attach test-pd-storage --type=pd \
633
654
  --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \
634
655
  --mount-point='/test-mount-point' --readonly=false \
635
656
  --auto-mount=true \
@@ -653,7 +674,7 @@ To use the GCP Managed Lustre with XPK you need to create [an instance](https://
653
674
  Once it's ready you can use `xpk storage attach` with `--type=lustre` command to attach a Managed Lustre instance to your cluster. Currently, attaching a Managed Lustre instance is supported only by providing a manifest file.
654
675
 
655
676
  ```shell
656
- python3 xpk.py storage attach test-lustre-storage --type=lustre \
677
+ xpk storage attach test-lustre-storage --type=lustre \
657
678
  --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \
658
679
  --mount-point='/test-mount-point' --readonly=false \
659
680
  --auto-mount=true \
@@ -671,7 +692,7 @@ Parameters:
671
692
  ### List attached storages
672
693
 
673
694
  ```shell
674
- python3 xpk.py storage list \
695
+ xpk storage list \
675
696
  --project=$PROJECT --cluster $CLUSTER --zone=$ZONE
676
697
  ```
677
698
 
@@ -680,7 +701,7 @@ python3 xpk.py storage list \
680
701
  If you specified `--auto-mount=true` when creating or attaching a storage, then all workloads deployed on the cluster will have the volume attached by default. Otherwise, in order to have the storage attached, you have to add `--storage` parameter to `workload create` command:
681
702
 
682
703
  ```shell
683
- python3 xpk.py workload create \
704
+ xpk workload create \
684
705
  --workload xpk-test-workload --command "echo goodbye" \
685
706
  --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \
686
707
  --tpu-type=v5litepod-16 --storage=test-storage
@@ -689,7 +710,7 @@ python3 xpk.py workload create \
689
710
  ### Detaching storage
690
711
 
691
712
  ```shell
692
- python3 xpk.py storage detach $STORAGE_NAME \
713
+ xpk storage detach $STORAGE_NAME \
693
714
  --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE
694
715
  ```
695
716
 
@@ -698,7 +719,7 @@ python3 xpk.py storage detach $STORAGE_NAME \
698
719
  XPK allows you to remove Filestore instances easily with `xpk storage delete` command. **Warning:** this deletes all data contained in the Filestore!
699
720
 
700
721
  ```shell
701
- python3 xpk.py storage delete test-fs-instance \
722
+ xpk storage delete test-fs-instance \
702
723
  --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE
703
724
  ```
704
725
 
@@ -706,14 +727,14 @@ python3 xpk.py storage delete test-fs-instance \
706
727
  * Workload Create (submit training job):
707
728
 
708
729
  ```shell
709
- python3 xpk.py workload create \
730
+ xpk workload create \
710
731
  --workload xpk-test-workload --command "echo goodbye" \
711
732
  --cluster xpk-test \
712
733
  --tpu-type=v5litepod-16 --project=$PROJECT
713
734
  ```
714
735
  * Workload create(DWS flex with queued provisioning):
715
736
  ```shell
716
- python3 xpk.py workload create \
737
+ xpk workload create \
717
738
  --workload xpk-test-workload --command "echo goodbye" \
718
739
  --cluster xpk-test --flex \
719
740
  --tpu-type=v5litepod-16 --project=$PROJECT
@@ -723,7 +744,7 @@ python3 xpk.py storage delete test-fs-instance \
723
744
 
724
745
  Pathways workload example:
725
746
  ```shell
726
- python3 xpk.py workload create-pathways \
747
+ xpk workload create-pathways \
727
748
  --workload xpk-pw-test \
728
749
  --num-slices=1 \
729
750
  --tpu-type=v5litepod-16 \
@@ -737,7 +758,7 @@ python3 xpk.py storage delete test-fs-instance \
737
758
 
738
759
  Pathways workload example:
739
760
  ```shell
740
- python3 xpk.py workload create-pathways \
761
+ xpk workload create-pathways \
741
762
  --workload xpk-regular-test \
742
763
  --num-slices=1 \
743
764
  --tpu-type=v5litepod-16 \
@@ -750,7 +771,7 @@ python3 xpk.py storage delete test-fs-instance \
750
771
  Pathways in headless mode - Pathways now offers the capability to run JAX workloads in Vertex AI notebooks or in GCE VMs!
751
772
  Specify `--headless` with `workload create-pathways` when the user workload is not provided in a docker container.
752
773
  ```shell
753
- python3 xpk.py workload create-pathways --headless \
774
+ xpk workload create-pathways --headless \
754
775
  --workload xpk-pw-headless \
755
776
  --num-slices=1 \
756
777
  --tpu-type=v5litepod-16 \
@@ -785,7 +806,7 @@ A3 Ultra | `h200-141gb-8`
785
806
  A4 | `b200-8`
786
807
 
787
808
  ```shell
788
- python3 xpk.py workload create \
809
+ xpk workload create \
789
810
  --workload=$WORKLOAD_NAME --command="echo goodbye" \
790
811
  --cluster=$CLUSTER_NAME --device-type DEVICE_TYPE \
791
812
  --zone=$COMPUTE_ZONE --project=$PROJECT_ID \
@@ -816,7 +837,7 @@ In order to run NCCL test on A3 machines check out [this guide](/examples/nccl/n
816
837
 
817
838
  #### General Example:
818
839
  ```shell
819
- python3 xpk.py workload create \
840
+ xpk workload create \
820
841
  --workload xpk-test-medium-workload --command "echo goodbye" --cluster \
821
842
  xpk-test --tpu-type=v5litepod-16 --priority=medium
822
843
  ```
@@ -833,7 +854,7 @@ XPK will create a Vertex AI Experiment in `workload create` command and attach t
833
854
  * Create Vertex AI Experiment with default Experiment name:
834
855
 
835
856
  ```shell
836
- python3 xpk.py workload create \
857
+ xpk workload create \
837
858
  --cluster xpk-test --workload xpk-workload \
838
859
  --use-vertex-tensorboard
839
860
  ```
@@ -843,7 +864,7 @@ will create a Vertex AI Experiment with the name `xpk-test-xpk-workload` (*<args
843
864
  * Create Vertex AI Experiment with user-specified Experiment name:
844
865
 
845
866
  ```shell
846
- python3 xpk.py workload create \
867
+ xpk workload create \
847
868
  --cluster xpk-test --workload xpk-workload \
848
869
  --use-vertex-tensorboard --experiment-name=test-experiment
849
870
  ```
@@ -856,7 +877,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
856
877
  * Workload Delete (delete training job):
857
878
 
858
879
  ```shell
859
- python3 xpk.py workload delete \
880
+ xpk workload delete \
860
881
  --workload xpk-test-workload --cluster xpk-test
861
882
  ```
862
883
 
@@ -865,7 +886,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
865
886
  * Workload Delete (delete all training jobs in the cluster):
866
887
 
867
888
  ```shell
868
- python3 xpk.py workload delete \
889
+ xpk workload delete \
869
890
  --cluster xpk-test
870
891
  ```
871
892
 
@@ -875,7 +896,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
875
896
  * Filter by Job: `filter-by-job`
876
897
 
877
898
  ```shell
878
- python3 xpk.py workload delete \
899
+ xpk workload delete \
879
900
  --cluster xpk-test --filter-by-job=$USER
880
901
  ```
881
902
 
@@ -884,7 +905,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
884
905
  * Filter by Status: `filter-by-status`
885
906
 
886
907
  ```shell
887
- python3 xpk.py workload delete \
908
+ xpk workload delete \
888
909
  --cluster xpk-test --filter-by-status=QUEUED
889
910
  ```
890
911
 
@@ -894,7 +915,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
894
915
  * Workload List (see training jobs):
895
916
 
896
917
  ```shell
897
- python3 xpk.py workload list \
918
+ xpk workload list \
898
919
  --cluster xpk-test
899
920
  ```
900
921
 
@@ -929,7 +950,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
929
950
  Filter the workload list by the name of a job.
930
951
 
931
952
  ```shell
932
- python3 xpk.py workload list \
953
+ xpk workload list \
933
954
  --cluster xpk-test --filter-by-job=$USER
934
955
  ```
935
956
 
@@ -938,14 +959,14 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
938
959
  Wait for a job to complete.
939
960
 
940
961
  ```shell
941
- python3 xpk.py workload list \
962
+ xpk workload list \
942
963
  --cluster xpk-test --wait-for-job-completion=xpk-test-workload
943
964
  ```
944
965
 
945
966
  Wait for a job to complete with a timeout of 300 seconds.
946
967
 
947
968
  ```shell
948
- python3 xpk.py workload list \
969
+ xpk workload list \
949
970
  --cluster xpk-test --wait-for-job-completion=xpk-test-workload \
950
971
  --timeout=300
951
972
  ```
@@ -961,7 +982,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
961
982
  * Job List (see jobs submitted via batch command):
962
983
 
963
984
  ```shell
964
- python3 xpk.py job ls --cluster xpk-test
985
+ xpk job ls --cluster xpk-test
965
986
  ```
966
987
 
967
988
  * Example Job List Output:
@@ -979,7 +1000,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
979
1000
  * Job Cancel (delete job submitted via batch command):
980
1001
 
981
1002
  ```shell
982
- python3 xpk.py job cancel xpk-def-app-profile-slurm-74kbv --cluster xpk-test
1003
+ xpk job cancel xpk-def-app-profile-slurm-74kbv --cluster xpk-test
983
1004
  ```
984
1005
 
985
1006
  ## Inspector
@@ -987,7 +1008,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
987
1008
  Inspector output is saved to a file.
988
1009
 
989
1010
  ```shell
990
- python3 xpk.py inspector \
1011
+ xpk inspector \
991
1012
  --cluster $CLUSTER_NAME \
992
1013
  --project $PROJECT_ID \
993
1014
  --zone $ZONE
@@ -1029,7 +1050,7 @@ Inspector output is saved to a file.
1029
1050
  * `xpk run` lets you execute scripts on a cluster with ease. It automates task execution, handles interruptions, and streams job output to your console.
1030
1051
 
1031
1052
  ```shell
1032
- python xpk.py run --kind-cluster -n 2 -t 0-2 examples/job.sh
1053
+ xpk run --kind-cluster -n 2 -t 0-2 examples/job.sh
1033
1054
  ```
1034
1055
 
1035
1056
  * Example Output:
@@ -1065,7 +1086,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
1065
1086
  gcloud compute reservations list --project=$PROJECT_ID
1066
1087
 
1067
1088
  # Run cluster create with reservation.
1068
- python3 xpk.py cluster create \
1089
+ xpk cluster create \
1069
1090
  --cluster xpk-test --device-type=h100-80gb-8 \
1070
1091
  --num-nodes=2 \
1071
1092
  --reservation=$RESERVATION_ID
@@ -1074,20 +1095,20 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
1074
1095
  * Cluster Delete (deprovision capacity):
1075
1096
 
1076
1097
  ```shell
1077
- python3 xpk.py cluster delete \
1098
+ xpk cluster delete \
1078
1099
  --cluster xpk-test
1079
1100
  ```
1080
1101
 
1081
1102
  * Cluster List (see provisioned capacity):
1082
1103
 
1083
1104
  ```shell
1084
- python3 xpk.py cluster list
1105
+ xpk cluster list
1085
1106
  ```
1086
1107
 
1087
1108
  * Cluster Describe (see capacity):
1088
1109
 
1089
1110
  ```shell
1090
- python3 xpk.py cluster describe \
1111
+ xpk cluster describe \
1091
1112
  --cluster xpk-test
1092
1113
  ```
1093
1114
 
@@ -1095,7 +1116,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
1095
1116
  * Cluster Cacheimage (enables faster start times):
1096
1117
 
1097
1118
  ```shell
1098
- python3 xpk.py cluster cacheimage \
1119
+ xpk cluster cacheimage \
1099
1120
  --cluster xpk-test --docker-image gcr.io/your_docker_image \
1100
1121
  --device-type=h100-80gb-8
1101
1122
  ```
@@ -1116,7 +1137,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
1116
1137
 
1117
1138
  ```shell
1118
1139
  # Submit a workload
1119
- python3 xpk.py workload create \
1140
+ xpk workload create \
1120
1141
  --cluster xpk-test --device-type h100-80gb-8 \
1121
1142
  --workload xpk-test-workload \
1122
1143
  --command="echo hello world"
@@ -1125,7 +1146,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
1125
1146
  * Workload Delete (delete training job):
1126
1147
 
1127
1148
  ```shell
1128
- python3 xpk.py workload delete \
1149
+ xpk workload delete \
1129
1150
  --workload xpk-test-workload --cluster xpk-test
1130
1151
  ```
1131
1152
 
@@ -1134,7 +1155,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
1134
1155
  * Workload Delete (delete all training jobs in the cluster):
1135
1156
 
1136
1157
  ```shell
1137
- python3 xpk.py workload delete \
1158
+ xpk workload delete \
1138
1159
  --cluster xpk-test
1139
1160
  ```
1140
1161
 
@@ -1144,7 +1165,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
1144
1165
  * Filter by Job: `filter-by-job`
1145
1166
 
1146
1167
  ```shell
1147
- python3 xpk.py workload delete \
1168
+ xpk workload delete \
1148
1169
  --cluster xpk-test --filter-by-job=$USER
1149
1170
  ```
1150
1171
 
@@ -1153,7 +1174,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
1153
1174
  * Filter by Status: `filter-by-status`
1154
1175
 
1155
1176
  ```shell
1156
- python3 xpk.py workload delete \
1177
+ xpk workload delete \
1157
1178
  --cluster xpk-test --filter-by-status=QUEUED
1158
1179
  ```
1159
1180
 
@@ -1167,7 +1188,7 @@ In order to use XPK for CPU, you can do so by using `device-type` flag.
1167
1188
 
1168
1189
  ```shell
1169
1190
  # Run cluster create with on demand capacity.
1170
- python3 xpk.py cluster create \
1191
+ xpk cluster create \
1171
1192
  --cluster xpk-test \
1172
1193
  --device-type=n2-standard-32-256 \
1173
1194
  --num-slices=1 \
@@ -1181,7 +1202,7 @@ In order to use XPK for CPU, you can do so by using `device-type` flag.
1181
1202
 
1182
1203
  ```shell
1183
1204
  # Submit a workload
1184
- python3 xpk.py workload create \
1205
+ xpk workload create \
1185
1206
  --cluster xpk-test \
1186
1207
  --num-slices=1 \
1187
1208
  --device-type=n2-standard-32-256 \
@@ -1211,7 +1232,7 @@ RESERVATION=reservation_id
1211
1232
  PROJECT=my_project
1212
1233
  ZONE=us-east5-b
1213
1234
 
1214
- python3 xpk.py cluster create \
1235
+ xpk cluster create \
1215
1236
  --cluster $CLUSTER_NAME \
1216
1237
  --num-slices=$NUM_SLICES \
1217
1238
  --device-type=$DEVICE_TYPE \
@@ -1248,7 +1269,7 @@ RESERVATION=reservation_id
1248
1269
  PROJECT=my_project
1249
1270
  ZONE=us-east5-b
1250
1271
 
1251
- python3 xpk.py cluster create \
1272
+ xpk cluster create \
1252
1273
  --cluster $CLUSTER_NAME \
1253
1274
  --num-slices=$NUM_SLICES \
1254
1275
  --device-type=$DEVICE_TYPE \
@@ -1271,7 +1292,7 @@ PROJECT=my_project
1271
1292
  ZONE=us-east5-b
1272
1293
 
1273
1294
  # This will create 2x v4-16 node pools and set the max autoprovisioned chips to 16.
1274
- python3 xpk.py cluster create \
1295
+ xpk cluster create \
1275
1296
  --cluster $CLUSTER_NAME \
1276
1297
  --num-slices=$NUM_SLICES \
1277
1298
  --device-type=$DEVICE_TYPE \
@@ -1291,7 +1312,7 @@ PROJECT=my_project
1291
1312
  ZONE=us-east5-b
1292
1313
 
1293
1314
  # This will clear the node pools if they exist in the cluster and set the max autoprovisioned chips to 16
1294
- python3 xpk.py cluster create \
1315
+ xpk cluster create \
1295
1316
  --cluster $CLUSTER_NAME \
1296
1317
  --num-slices=$NUM_SLICES \
1297
1318
  --device-type=$DEVICE_TYPE \
@@ -1312,7 +1333,7 @@ Reconfigure the `--device-type` and `--num-slices`
1312
1333
  PROJECT=my_project
1313
1334
  ZONE=us-east5-b
1314
1335
  # Create a 2x v4-8 TPU workload.
1315
- python3 xpk.py workload create \
1336
+ xpk workload create \
1316
1337
  --cluster $CLUSTER \
1317
1338
  --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \
1318
1339
  --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \
@@ -1325,7 +1346,7 @@ Reconfigure the `--device-type` and `--num-slices`
1325
1346
  DEVICE_TYPE=v4-16
1326
1347
 
1327
1348
  # Create a 1x v4-16 TPU workload.
1328
- python3 xpk.py workload create \
1349
+ xpk workload create \
1329
1350
  --cluster $CLUSTER \
1330
1351
  --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \
1331
1352
  --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \
@@ -1335,7 +1356,7 @@ Reconfigure the `--device-type` and `--num-slices`
1335
1356
  --project=$PROJECT
1336
1357
 
1337
1358
  # Use a different reservation from what the cluster was created with.
1338
- python3 xpk.py workload create \
1359
+ xpk workload create \
1339
1360
  --cluster $CLUSTER \
1340
1361
  --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \
1341
1362
  --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \
@@ -1379,19 +1400,19 @@ This flow pulls the `--script-dir` into the `--base-docker-image` and runs the n
1379
1400
 
1380
1401
  - `--script-dir` sets which directory to pull into the image. This defaults to the current working directory.
1381
1402
 
1382
- See `python3 xpk.py workload create --help` for more info.
1403
+ See `xpk workload create --help` for more info.
1383
1404
 
1384
1405
  * Example with defaults which pulls the local directory into the base image:
1385
1406
  ```shell
1386
1407
  echo -e '#!/bin/bash \n echo "Hello world from a test script!"' > test.sh
1387
- python3 xpk.py workload create --cluster xpk-test \
1408
+ xpk workload create --cluster xpk-test \
1388
1409
  --workload xpk-test-workload-base-image --command "bash test.sh" \
1389
1410
  --tpu-type=v5litepod-16 --num-slices=1
1390
1411
  ```
1391
1412
 
1392
1413
  * Recommended Flow For Normal Sized Jobs (fewer than 10k accelerators):
1393
1414
  ```shell
1394
- python3 xpk.py workload create --cluster xpk-test \
1415
+ xpk workload create --cluster xpk-test \
1395
1416
  --workload xpk-test-workload-base-image --command "bash custom_script.sh" \
1396
1417
  --base-docker-image=gcr.io/your_dependencies_docker_image \
1397
1418
  --tpu-type=v5litepod-16 --num-slices=1
@@ -1404,17 +1425,17 @@ workload.
1404
1425
 
1405
1426
  * Running with `--docker-image`:
1406
1427
  ```shell
1407
- python3 xpk.py workload create --cluster xpk-test \
1428
+ xpk workload create --cluster xpk-test \
1408
1429
  --workload xpk-test-workload-base-image --command "bash test.sh" \
1409
1430
  --tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image
1410
1431
  ```
1411
1432
 
1412
1433
  * Recommended Flow For Large Sized Jobs (more than 10k accelerators):
1413
1434
  ```shell
1414
- python3 xpk.py cluster cacheimage \
1435
+ xpk cluster cacheimage \
1415
1436
  --cluster xpk-test --docker-image gcr.io/your_docker_image
1416
1437
  # Run workload create with the same image.
1417
- python3 xpk.py workload create --cluster xpk-test \
1438
+ xpk workload create --cluster xpk-test \
1418
1439
  --workload xpk-test-workload-base-image --command "bash test.sh" \
1419
1440
  --tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image
1420
1441
  ```
@@ -1463,7 +1484,7 @@ Please select a CPU type that exists in all zones in the region.
1463
1484
  # Find CPU Types supported in zones.
1464
1485
  gcloud compute machine-types list --zones=$ZONE_LIST
1465
1486
  # Adjust default cpu machine type.
1466
- python3 xpk.py cluster create --default-pool-cpu-machine-type=CPU_TYPE ...
1487
+ xpk cluster create --default-pool-cpu-machine-type=CPU_TYPE ...
1467
1488
  ```
1468
1489
 
1469
1490
  ## Workload creation fails
@@ -1472,7 +1493,7 @@ Some XPK cluster configuration might be missing, if workload creation fails with
1472
1493
 
1473
1494
  `[XPK] b'error: the server doesn\'t have a resource type "workloads"\n'`
1474
1495
 
1475
- Mitigate this error by re-running your `xpk.py cluster create ...` command, to refresh the cluster configurations.
1496
+ Mitigate this error by re-running your `xpk cluster create ...` command, to refresh the cluster configurations.
1476
1497
 
1477
1498
  ## Permission Issues: `requires one of ["permission_name"] permission(s)`.
1478
1499
 
@@ -1544,7 +1565,7 @@ If error of this kind appeared after updating xpk version it's possible that you
1544
1565
  ## Verbose Logging
1545
1566
  If you are having trouble with your workload, try setting the `--enable-debug-logs` when you schedule it. This will give you more detailed logs to help pinpoint the issue. For example:
1546
1567
  ```shell
1547
- python3 xpk.py workload create \
1568
+ xpk workload create \
1548
1569
  --cluster --workload xpk-test-workload \
1549
1570
  --command="echo hello world" --enable-debug-logs
1550
1571
  ```
@@ -1576,7 +1597,7 @@ This configuration will start collecting stack traces inside the `/tmp/debugging
1576
1597
  ### Explore Stack Traces
1577
1598
  To explore the stack traces collected in a temporary directory in Kubernetes Pod, you can run the following command to configure a sidecar container that will read the traces from `/tmp/debugging` directory.
1578
1599
  ```shell
1579
- python3 xpk.py workload create \
1600
+ xpk workload create \
1580
1601
  --workload xpk-test-workload --command "python3 main.py" --cluster \
1581
1602
  xpk-test --tpu-type=v5litepod-16 --deploy-stacktrace-sidecar
1582
1603
  ```
@@ -1587,12 +1608,12 @@ To list available resources and queues use ```xpk info``` command. It allows to
1587
1608
 
1588
1609
  To see queues with usage and workload info use:
1589
1610
  ```shell
1590
- python3 xpk.py info --cluster my-cluster
1611
+ xpk info --cluster my-cluster
1591
1612
  ```
1592
1613
 
1593
1614
  You can specify what kind of resources(clusterqueue or localqueue) you want to see using flags --clusterqueue or --localqueue.
1594
1615
  ```shell
1595
- python3 xpk.py info --cluster my-cluster --localqueue
1616
+ xpk info --cluster my-cluster --localqueue
1596
1617
  ```
1597
1618
 
1598
1619
  # Local testing with Kind
@@ -1611,7 +1632,7 @@ xpk interfaces seamlessly with kind to manage Kubernetes clusters locally, facil
1611
1632
  * Cluster create:
1612
1633
 
1613
1634
  ```shell
1614
- python3 xpk.py kind create \
1635
+ xpk kind create \
1615
1636
  --cluster xpk-test
1616
1637
  ```
1617
1638
 
@@ -1619,7 +1640,7 @@ xpk interfaces seamlessly with kind to manage Kubernetes clusters locally, facil
1619
1640
  * Cluster Delete:
1620
1641
 
1621
1642
  ```shell
1622
- python3 xpk.py kind delete \
1643
+ xpk kind delete \
1623
1644
  --cluster xpk-test
1624
1645
  ```
1625
1646
 
@@ -1627,7 +1648,7 @@ xpk interfaces seamlessly with kind to manage Kubernetes clusters locally, facil
1627
1648
  * Cluster List:
1628
1649
 
1629
1650
  ```shell
1630
- python3 xpk.py kind list
1651
+ xpk kind list
1631
1652
  ```
1632
1653
 
1633
1654
  ## Local Testing Basics
@@ -1635,7 +1656,7 @@ xpk interfaces seamlessly with kind to manage Kubernetes clusters locally, facil
1635
1656
  Local testing is available exclusively through the `batch` and `job` commands of xpk with the `--kind-cluster` flag. This allows you to simulate training jobs locally:
1636
1657
 
1637
1658
  ```shell
1638
- python xpk.py batch [other-options] --kind-cluster script
1659
+ xpk batch [other-options] --kind-cluster script
1639
1660
  ```
1640
1661
 
1641
1662
  Please note that all other xpk subcommands are intended for use with cloud systems on Google Cloud Engine (GCE) and don't support local testing. This includes commands like cluster, info, inspector, etc.