xpk 0.5.0__tar.gz → 0.6.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (65) hide show
  1. {xpk-0.5.0 → xpk-0.6.0}/PKG-INFO +298 -25
  2. xpk-0.5.0/xpk.egg-info/PKG-INFO → xpk-0.6.0/README.md +288 -41
  3. {xpk-0.5.0 → xpk-0.6.0}/pyproject.toml +15 -4
  4. xpk-0.6.0/src/xpk/__init__.py +15 -0
  5. xpk-0.6.0/src/xpk/commands/__init__.py +15 -0
  6. xpk-0.6.0/src/xpk/commands/batch.py +109 -0
  7. xpk-0.6.0/src/xpk/commands/cluster.py +784 -0
  8. xpk-0.6.0/src/xpk/commands/cluster_gcluster.py +185 -0
  9. xpk-0.6.0/src/xpk/commands/info.py +245 -0
  10. xpk-0.6.0/src/xpk/commands/inspector.py +363 -0
  11. xpk-0.6.0/src/xpk/commands/job.py +197 -0
  12. xpk-0.6.0/src/xpk/commands/kind.py +253 -0
  13. xpk-0.6.0/src/xpk/commands/shell.py +120 -0
  14. xpk-0.6.0/src/xpk/commands/version.py +39 -0
  15. xpk-0.6.0/src/xpk/commands/workload.py +692 -0
  16. xpk-0.6.0/src/xpk/core/__init__.py +15 -0
  17. xpk-0.6.0/src/xpk/core/blueprint/__init__.py +15 -0
  18. xpk-0.6.0/src/xpk/core/blueprint/blueprint_definitions.py +61 -0
  19. xpk-0.6.0/src/xpk/core/blueprint/blueprint_generator.py +652 -0
  20. xpk-0.6.0/src/xpk/core/cluster_private.py +197 -0
  21. xpk-0.6.0/src/xpk/core/commands.py +352 -0
  22. xpk-0.6.0/src/xpk/core/core.py +2824 -0
  23. xpk-0.6.0/src/xpk/core/docker_manager.py +308 -0
  24. xpk-0.6.0/src/xpk/core/gcluster_manager.py +158 -0
  25. xpk-0.6.0/src/xpk/core/kjob.py +205 -0
  26. xpk-0.6.0/src/xpk/core/kueue.py +352 -0
  27. xpk-0.6.0/src/xpk/core/nap.py +349 -0
  28. xpk-0.6.0/src/xpk/core/pathways.py +298 -0
  29. xpk-0.6.0/src/xpk/core/ray.py +222 -0
  30. xpk-0.6.0/src/xpk/core/system_characteristics.py +1395 -0
  31. xpk-0.6.0/src/xpk/core/workload.py +133 -0
  32. xpk-0.6.0/src/xpk/core/workload_decorators/__init__.py +15 -0
  33. xpk-0.6.0/src/xpk/core/workload_decorators/rdma_decorator.py +109 -0
  34. xpk-0.6.0/src/xpk/core/workload_decorators/tcpxo_decorator.py +157 -0
  35. xpk-0.6.0/src/xpk/main.py +73 -0
  36. xpk-0.6.0/src/xpk/parser/__init__.py +15 -0
  37. xpk-0.6.0/src/xpk/parser/batch.py +184 -0
  38. xpk-0.6.0/src/xpk/parser/cluster.py +621 -0
  39. xpk-0.6.0/src/xpk/parser/common.py +71 -0
  40. xpk-0.6.0/src/xpk/parser/core.py +109 -0
  41. xpk-0.6.0/src/xpk/parser/info.py +63 -0
  42. xpk-0.6.0/src/xpk/parser/inspector.py +65 -0
  43. xpk-0.6.0/src/xpk/parser/job.py +126 -0
  44. xpk-0.6.0/src/xpk/parser/kind.py +94 -0
  45. xpk-0.6.0/src/xpk/parser/shell.py +50 -0
  46. xpk-0.6.0/src/xpk/parser/validators.py +39 -0
  47. xpk-0.6.0/src/xpk/parser/version.py +23 -0
  48. xpk-0.6.0/src/xpk/parser/workload.py +684 -0
  49. xpk-0.6.0/src/xpk/utils/__init__.py +15 -0
  50. xpk-0.6.0/src/xpk/utils/console.py +55 -0
  51. xpk-0.6.0/src/xpk/utils/file.py +82 -0
  52. xpk-0.6.0/src/xpk/utils/network.py +168 -0
  53. xpk-0.6.0/src/xpk/utils/objects.py +85 -0
  54. xpk-0.6.0/src/xpk/utils/yaml.py +30 -0
  55. xpk-0.5.0/README.md → xpk-0.6.0/src/xpk.egg-info/PKG-INFO +314 -22
  56. xpk-0.6.0/src/xpk.egg-info/SOURCES.txt +60 -0
  57. xpk-0.6.0/src/xpk.egg-info/entry_points.txt +2 -0
  58. {xpk-0.5.0 → xpk-0.6.0/src}/xpk.egg-info/requires.txt +7 -0
  59. xpk-0.5.0/xpk.egg-info/SOURCES.txt +0 -10
  60. xpk-0.5.0/xpk.egg-info/entry_points.txt +0 -2
  61. xpk-0.5.0/xpk.py +0 -7282
  62. {xpk-0.5.0 → xpk-0.6.0}/LICENSE +0 -0
  63. {xpk-0.5.0 → xpk-0.6.0}/setup.cfg +0 -0
  64. {xpk-0.5.0 → xpk-0.6.0/src}/xpk.egg-info/dependency_links.txt +0 -0
  65. {xpk-0.5.0 → xpk-0.6.0/src}/xpk.egg-info/top_level.txt +0 -0
@@ -1,8 +1,8 @@
1
- Metadata-Version: 2.1
1
+ Metadata-Version: 2.2
2
2
  Name: xpk
3
- Version: 0.5.0
3
+ Version: 0.6.0
4
4
  Summary: xpk helps Cloud developers to orchestrate training jobs on accelerators on GKE.
5
- Author-email: Cloud TPU Team <cloud-tpu-eng@google.com>
5
+ Author-email: XPK team <xpk-code-reviewers@google.com>
6
6
  License: Apache-2.0
7
7
  Project-URL: Homepage, https://github.com/google/xpk
8
8
  Project-URL: Bug Tracker, https://github.com/google/xpk/issues
@@ -12,10 +12,17 @@ Requires-Python: >=3.10
12
12
  Description-Content-Type: text/markdown
13
13
  License-File: LICENSE
14
14
  Requires-Dist: cloud-accelerator-diagnostics
15
+ Requires-Dist: tabulate
16
+ Requires-Dist: ruamel.yaml
17
+ Requires-Dist: pyyaml
18
+ Requires-Dist: docker
19
+ Requires-Dist: packaging
15
20
  Provides-Extra: dev
16
21
  Requires-Dist: pyink==24.3.0; extra == "dev"
17
22
  Requires-Dist: pylint>=2.6.0; extra == "dev"
18
23
  Requires-Dist: pre-commit; extra == "dev"
24
+ Requires-Dist: pytest; extra == "dev"
25
+ Requires-Dist: docker; extra == "dev"
19
26
 
20
27
  <!--
21
28
  Copyright 2023 Google LLC
@@ -62,31 +69,73 @@ xpk supports the following TPU types:
62
69
  * v4
63
70
  * v5e
64
71
  * v5p
72
+ * Trillium (v6e)
65
73
 
66
74
  and the following GPU types:
67
- * a100
68
- * h100
75
+ * A100
76
+ * A3-Highgpu (h100)
77
+ * A3-Mega (h100-mega) - [Create cluster](#provisioning-a3-ultra-and-a3-mega-clusters-gpu-machines), [Create workloads](#workloads-for-a3-ultra-and-a3-mega-clusters-gpu-machines)
78
+ * A3-Ultra (h200) - [Create cluster](#provisioning-a3-ultra-and-a3-mega-clusters-gpu-machines), [Create workloads](#workloads-for-a3-ultra-and-a3-mega-clusters-gpu-machines)
69
79
 
70
80
  and the following CPU types:
71
81
  * n2-standard-32
72
82
 
83
+ # Cloud Console Permissions on the user or service account needed to run XPK:
84
+
85
+ * Artifact Registry Writer
86
+ * Compute Admin
87
+ * Kubernetes Engine Admin
88
+ * Logging Admin
89
+ * Monitoring Admin
90
+ * Service Account User
91
+ * Storage Admin
92
+ * Vertex AI Administrator
93
+
94
+ # Prerequisites
95
+
96
+ Following tools must be installed:
97
+
98
+ - python >= 3.10 (download from [here](https://www.python.org/downloads/))
99
+ - pip ([installation instruction](https://pip.pypa.io/en/stable/installation/))
100
+ - python venv ([installation instruction](https://virtualenv.pypa.io/en/latest/installation.html))
101
+ (all three of above can be installed at once from [here](https://packaging.python.org/en/latest/guides/installing-using-linux-tools/#installing-pip-setuptools-wheel-with-linux-package-managers))
102
+ - gcloud (install from [here](https://cloud.google.com/sdk/gcloud#download_and_install_the))
103
+ - Run `gcloud init`
104
+ - [Authenticate](https://cloud.google.com/sdk/gcloud/reference/auth/application-default/login) to Google Cloud
105
+ - kubectl (install from [here](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_kubectl))
106
+ - Install `gke-gcloud-auth-plugin` from [here](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_plugin)
107
+ - docker ([installation instruction](https://docs.docker.com/engine/install/))
108
+ - Run `gcloud auth configure-docker` to ensure images can be uploaded to registry
109
+ - make - please run below command.
110
+ ```shell
111
+ # sudo may be required
112
+ apt-get -y install make
113
+ ```
114
+ In addition, below dependencies will be installed with `make install` command:
115
+ - kueuectl (install from [here](https://kueue.sigs.k8s.io/docs/reference/kubectl-kueue/installation/))
116
+ - kjob (installation instructions [here](https://github.com/kubernetes-sigs/kjob/blob/main/docs/installation.md))
117
+
73
118
  # Installation
74
- To install xpk, run the following command:
119
+ To install xpk, run the following command and install additional tools, mentioned in [prerequisites](#prerequisites). [Makefile](https://github.com/AI-Hypercomputer/xpk/blob/main/Makefile) provides a way to install all neccessary tools:
75
120
 
76
121
  ```shell
77
122
  pip install xpk
78
123
  ```
79
124
 
125
+
80
126
  If you are running XPK by cloning GitHub repository, first run the
81
127
  following commands to begin using XPK commands:
82
128
 
83
129
  ```shell
84
130
  git clone https://github.com/google/xpk.git
85
131
  cd xpk
86
- # Install dependencies such as cloud-accelerator-diagnostics
87
- pip install .
132
+ # Install required dependencies with make
133
+ make install && export PATH=$PATH:$PWD/bin
88
134
  ```
89
135
 
136
+ If you want to have installed dependecies persist in your PATH please run:
137
+ `echo $PWD/bin` and add its value to `PATH` in .bashrc or .zshrc
138
+
90
139
  If you see an error saying: `This environment is externally managed`, please use a virtual environment.
91
140
 
92
141
  Example:
@@ -100,8 +149,8 @@ Example:
100
149
  ## Clone the repository and installing dependencies.
101
150
  git clone https://github.com/google/xpk.git
102
151
  cd xpk
103
- # Install dependencies such as cloud-accelerator-diagnostics
104
- pip install .
152
+ # Install required dependencies with make
153
+ make install && export PATH=$PATH:$PWD/bin
105
154
  ```
106
155
 
107
156
  # XPK for Large Scale (>1k VMs)
@@ -171,13 +220,22 @@ all zones.
171
220
  ```
172
221
 
173
222
  * Cluster Create for Pathways:
174
- Pathways compatible cluster can be created using `--enable-pathways`
223
+ Pathways compatible cluster can be created using `cluster create-pathways`.
175
224
  ```shell
176
- python3 xpk.py cluster create \
225
+ python3 xpk.py cluster create-pathways \
177
226
  --cluster xpk-pw-test \
178
227
  --num-slices=4 --on-demand \
179
- --tpu-type=v5litepod-16 \
180
- --enable-pathways
228
+ --tpu-type=v5litepod-16
229
+ ```
230
+
231
+ * Cluster Create for Ray:
232
+ A cluster with KubeRay enabled and a RayCluster can be created using `cluster create-ray`.
233
+ ```shell
234
+ python3 xpk.py cluster create-ray \
235
+ --cluster xpk-rc-test \
236
+ --ray-version=2.39.0 \
237
+ --num-slices=4 --on-demand \
238
+ --tpu-type=v5litepod-8
181
239
  ```
182
240
 
183
241
  * Cluster Create can be called again with the same `--cluster name` to modify
@@ -214,9 +272,73 @@ all zones.
214
272
  python3 xpk.py cluster create --force \
215
273
  --cluster xpk-test --tpu-type=v5litepod-16 \
216
274
  --num-slices=6 --reservation=$RESERVATION_ID
275
+ ```
276
+
277
+ and recreates the cluster with 4 slices of v4-8. The command will rerun to delete
278
+ 6 slices of v5litepod-16 and create 4 slices of v4-8. The command will warn the
279
+ user when deleting slices. Use `--force` to skip prompts.
280
+
281
+ ```shell
282
+ python3 xpk.py cluster create \
283
+ --cluster xpk-test --tpu-type=v4-8 \
284
+ --num-slices=4 --reservation=$RESERVATION_ID
285
+
286
+ # Skip delete prompts using --force.
217
287
 
288
+ python3 xpk.py cluster create --force \
289
+ --cluster xpk-test --tpu-type=v4-8 \
290
+ --num-slices=4 --reservation=$RESERVATION_ID
218
291
  ```
219
292
 
293
+ ### Create Private Cluster
294
+
295
+ XPK allows you to create a private GKE cluster for enhanced security. In a private cluster, nodes and pods are isolated from the public internet, providing an additional layer of protection for your workloads.
296
+
297
+ To create a private cluster, use the following arguments:
298
+
299
+ **`--private`**
300
+
301
+ This flag enables the creation of a private GKE cluster. When this flag is set:
302
+
303
+ * Nodes and pods are isolated from the direct internet access.
304
+ * `master_authorized_networks` is automatically enabled.
305
+ * Access to the cluster's control plane is restricted to your current machine's IP address by default.
306
+
307
+ **`--authorized-networks`**
308
+
309
+ This argument allows you to specify additional IP ranges (in CIDR notation) that are authorized to access the private cluster's control plane and perform `kubectl` commands.
310
+
311
+ * Even if this argument is not set when you have `--private`, your current machine's IP address will always be given access to the control plane.
312
+ * If this argument is used with an existing private cluster, it will replace the existing authorized networks.
313
+
314
+ **Example Usage:**
315
+
316
+ * To create a private cluster and allow access to Control Plane only to your current machine:
317
+
318
+ ```shell
319
+ python3 xpk.py cluster create \
320
+ --cluster=xpk-private-cluster \
321
+ --tpu-type=v4-8 --num-slices=2 \
322
+ --private
323
+ ```
324
+
325
+ * To create a private cluster and allow access to Control Plane only to your current machine and the IP ranges `1.2.3.0/24` and `1.2.4.5/32`:
326
+
327
+ ```shell
328
+ python3 xpk.py cluster create \
329
+ --cluster=xpk-private-cluster \
330
+ --tpu-type=v4-8 --num-slices=2 \
331
+ --authorized-networks 1.2.3.0/24 1.2.4.5/32
332
+
333
+ # --private is optional when you set --authorized-networks
334
+ ```
335
+
336
+ > **Important Notes:**
337
+ > * The argument `--private` is only applicable when creating new clusters. You cannot convert an existing public cluster to a private cluster using these flags.
338
+ > * The argument `--authorized-networks` is applicable when creating new clusters or using an existing _*private*_ cluster. You cannot convert an existing public cluster to a private cluster using these flags.
339
+ > * You need to [set up a Cluster NAT for your VPC network](https://cloud.google.com/nat/docs/set-up-manage-network-address-translation#creating_nat) so that the Nodes and Pods have outbound access to the internet. This is required because XPK installs and configures components such as kueue that need access to external sources like `registry.k8.io`.
340
+
341
+
220
342
  ### Create Vertex AI Tensorboard
221
343
  *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature. Make sure you have
222
344
  [Vertex AI Administrator](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.admin) role
@@ -306,6 +428,26 @@ will fail the cluster creation process because Vertex AI Tensorboard is not supp
306
428
  --tpu-type=v5litepod-16
307
429
  ```
308
430
 
431
+ ## Provisioning A3-Ultra and A3-Mega clusters (GPU machines)
432
+ To create a cluster with A3 machines, run the below command. To create workloads on these clusters see [here](#workloads-for-a3-ultra-and-a3-mega-clusters-gpu-machines).
433
+ * For A3-Ultra: --device-type=h200-141gb-8
434
+ * For A3-Mega: --device-type=h100-mega-80gb-8
435
+
436
+ ```shell
437
+ python3 xpk.py cluster create \
438
+ --cluster CLUSTER_NAME --device-type=h200-141gb-8 \
439
+ --zone=$COMPUTE_ZONE --project=$PROJECT_ID \
440
+ --num-nodes=4 --reservation=$RESERVATION_ID
441
+ ```
442
+ Currently, the below flags/arguments are supported for A3-Mega and A3-Ultra machines:
443
+ * --num-nodes
444
+ * --default-pool-cpu-machine-type
445
+ * --default-pool-cpu-num-nodes
446
+ * --reservation
447
+ * --spot
448
+ * --on-demand (only A3-Mega)
449
+
450
+
309
451
  ## Workload Create
310
452
  * Workload Create (submit training job):
311
453
 
@@ -317,26 +459,25 @@ will fail the cluster creation process because Vertex AI Tensorboard is not supp
317
459
  ```
318
460
 
319
461
  * Workload Create for Pathways:
320
- Pathways workload can be submitted using `--use-pathways` on a Pathways enabled cluster (created with `--enable-pathways`)
462
+ Pathways workload can be submitted using `workload create-pathways` on a Pathways enabled cluster (created with `cluster create-pathways`)
321
463
 
322
464
  Pathways workload example:
323
465
  ```shell
324
- python3 xpk.py workload create \
466
+ python3 xpk.py workload create-pathways \
325
467
  --workload xpk-pw-test \
326
468
  --num-slices=1 \
327
469
  --tpu-type=v5litepod-16 \
328
- --use-pathways \
329
470
  --cluster xpk-pw-test \
330
471
  --docker-name='user-workload' \
331
472
  --docker-image=<maxtext docker image> \
332
473
  --command='python3 MaxText/train.py MaxText/configs/base.yml base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
333
474
  ```
334
475
 
335
- Regular workload can also be submitted on a Pathways enabled cluster (created with `--enable-pathways`)
476
+ Regular workload can also be submitted on a Pathways enabled cluster (created with `cluster create-pathways`)
336
477
 
337
478
  Pathways workload example:
338
479
  ```shell
339
- python3 xpk.py workload create \
480
+ python3 xpk.py workload create-pathways \
340
481
  --workload xpk-regular-test \
341
482
  --num-slices=1 \
342
483
  --tpu-type=v5litepod-16 \
@@ -346,6 +487,25 @@ will fail the cluster creation process because Vertex AI Tensorboard is not supp
346
487
  --command='python3 MaxText/train.py MaxText/configs/base.yml base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
347
488
  ```
348
489
 
490
+ Pathways in headless mode - Pathways now offers the capability to run JAX workloads in Vertex AI notebooks or in GCE VMs!
491
+ Specify `--headless` with `workload create-pathways` when the user workload is not provided in a docker container.
492
+ ```shell
493
+ python3 xpk.py workload create-pathways --headless \
494
+ --workload xpk-pw-headless \
495
+ --num-slices=1 \
496
+ --tpu-type=v5litepod-16 \
497
+ --cluster xpk-pw-test
498
+ ```
499
+ Executing the command above would provide the address of the proxy that the user job should connect to.
500
+ ```shell
501
+ kubectl get pods
502
+ kubectl port-forward pod/<proxy-pod-name> 29000:29000
503
+ ```
504
+ ```shell
505
+ JAX_PLATFORMS=proxy JAX_BACKEND_TARGET=grpc://127.0.0.1:29000 python -c 'import pathwaysutils; import jax; print(jax.devices())'
506
+ ```
507
+ Specify `JAX_PLATFORMS=proxy` and `JAX_BACKEND_TARGET=<proxy address from above>` and `import pathwaysutils` to establish this connection between the user's JAX code and the Pathways proxy. Execute Pathways workloads interactively on Vertex AI notebooks!
508
+
349
509
  ### Set `max-restarts` for production jobs
350
510
 
351
511
  * `--max-restarts <value>`: By default, this is 0. This will restart the job ""
@@ -354,6 +514,20 @@ increase this to a large number, say 50. Real jobs can be interrupted due to
354
514
  hardware failures and software updates. We assume your job has implemented
355
515
  checkpointing so the job restarts near where it was interrupted.
356
516
 
517
+ ### Workloads for A3-Ultra and A3-Mega clusters (GPU machines)
518
+ To submit jobs on a cluster with A3 machines, run the below command. To create a cluster with A3 machines see [here](#provisioning-a3-ultra-and-a3-mega-clusters-gpu-machines).
519
+ * For A3-Ultra: --device-type=h200-141gb-8
520
+ * For A3-Mega: --device-type=h100-mega-80gb-8
521
+
522
+ ```shell
523
+ python3 xpk.py workload create \
524
+ --workload=$WORKLOAD_NAME --command="echo goodbye" \
525
+ --cluster=$CLUSTER_NAME --device-type=h200-141gb-8 \
526
+ --zone=$COMPUTE_ZONE --project=$PROJECT_ID \
527
+ --num-nodes=$WOKRKLOAD_NUM_NODES
528
+ ```
529
+ > The docker image flags/arguments introduced in [workloads section](#workload-create) can be used with A3 machines as well.
530
+
357
531
  ### Workload Priority and Preemption
358
532
  * Set the priority level of your workload with `--priority=LEVEL`
359
533
 
@@ -491,7 +665,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t
491
665
  --cluster xpk-test --filter-by-job=$USER
492
666
  ```
493
667
 
494
- * Workload List supports waiting for the completion of a specific job. XPK will follow an existing job until it has finished or the `timeout`, if provided, has been reached and then list the job. If no `timeout` is specified, the default value is set to the max value, 1 week. You may also set `timeout=0` to poll the job once.
668
+ * Workload List supports waiting for the completion of a specific job. XPK will follow an existing job until it has finished or the `timeout`, if provided, has been reached and then list the job. If no `timeout` is specified, the default value is set to the max value, 1 week. You may also set `timeout=0` to poll the job once.
495
669
  (Note: `restart-on-user-code-failure` must be set
496
670
  when creating the workload otherwise the workload will always finish with `Completed` status.)
497
671
 
@@ -510,11 +684,37 @@ when creating the workload otherwise the workload will always finish with `Compl
510
684
  --timeout=300
511
685
  ```
512
686
 
513
- Return codes
514
- `0`: Workload finished and completed successfully.
515
- `124`: Timeout was reached before workload finished.
516
- `125`: Workload finished but did not complete successfully.
517
- `1`: Other failure.
687
+ Return codes
688
+ `0`: Workload finished and completed successfully.
689
+ `124`: Timeout was reached before workload finished.
690
+ `125`: Workload finished but did not complete successfully.
691
+ `1`: Other failure.
692
+
693
+ ## Job List
694
+
695
+ * Job List (see jobs submitted via batch command):
696
+
697
+ ```shell
698
+ python3 xpk.py job ls --cluster xpk-test
699
+ ```
700
+
701
+ * Example Job List Output:
702
+
703
+ ```
704
+ NAME PROFILE LOCAL QUEUE COMPLETIONS DURATION AGE
705
+ xpk-def-app-profile-slurm-74kbv xpk-def-app-profile 1/1 15s 17h
706
+ xpk-def-app-profile-slurm-brcsg xpk-def-app-profile 1/1 9s 3h56m
707
+ xpk-def-app-profile-slurm-kw99l xpk-def-app-profile 1/1 5s 3h54m
708
+ xpk-def-app-profile-slurm-x99nx xpk-def-app-profile 3/3 29s 17h
709
+ ```
710
+
711
+ ## Job Cancel
712
+
713
+ * Job Cancel (delete job submitted via batch command):
714
+
715
+ ```shell
716
+ python3 xpk.py job cancel xpk-def-app-profile-slurm-74kbv --cluster xpk-test
717
+ ```
518
718
 
519
719
  ## Inspector
520
720
  * Inspector provides debug info to understand cluster health, and why workloads are not running.
@@ -971,6 +1171,14 @@ gcloud compute machine-types list --zones=$ZONE_LIST
971
1171
  python3 xpk.py cluster create --default-pool-cpu-machine-type=CPU_TYPE ...
972
1172
  ```
973
1173
 
1174
+ ## Workload creation fails
1175
+
1176
+ Some XPK cluster configuration might be missing, if workload creation fails with the below error.
1177
+
1178
+ `[XPK] b'error: the server doesn\'t have a resource type "workloads"\n'`
1179
+
1180
+ Mitigate this error by re-running your `xpk.py cluster create ...` command, to refresh the cluster configurations.
1181
+
974
1182
  ## Permission Issues: `requires one of ["permission_name"] permission(s)`.
975
1183
 
976
1184
  1) Determine the role needed based on the permission error:
@@ -1031,6 +1239,9 @@ gcloud beta compute reservations list --project=$PROJECT_ID
1031
1239
  gcloud beta compute reservations describe $RESERVATION --project=$PROJECT_ID --zone=$ZONE
1032
1240
  ```
1033
1241
 
1242
+ ## 403 error on workload create when using `--base-docker-image` flag
1243
+ You need authority to push to the registry from your local machine. Try running `gcloud auth configure-docker`.
1244
+
1034
1245
  # TPU Workload Debugging
1035
1246
 
1036
1247
  ## Verbose Logging
@@ -1072,3 +1283,65 @@ To explore the stack traces collected in a temporary directory in Kubernetes Pod
1072
1283
  --workload xpk-test-workload --command "python3 main.py" --cluster \
1073
1284
  xpk-test --tpu-type=v5litepod-16 --deploy-stacktrace-sidecar
1074
1285
  ```
1286
+
1287
+ ### Get information about jobs, queues and resources.
1288
+
1289
+ To list available resources and queues use ```xpk info``` command. It allows to see localqueues and clusterqueues and check for available resources.
1290
+
1291
+ To see queues with usage and workload info use:
1292
+ ```shell
1293
+ python3 xpk.py info --cluster my-cluster
1294
+ ```
1295
+
1296
+ You can specify what kind of resources(clusterqueue or localqueue) you want to see using flags --clusterqueue or --localqueue.
1297
+ ```shell
1298
+ python3 xpk.py info --cluster my-cluster --localqueue
1299
+ ```
1300
+
1301
+ # Local testing with Kind
1302
+
1303
+ To facilitate development and testing locally, we have integrated support for testing with `kind`. This enables you to simulate a Kubernetes environment on your local machine.
1304
+
1305
+ ## Prerequisites
1306
+
1307
+ - Install kind on your local machine. Follow the official documentation here: [Kind Installation Guide.](https://kind.sigs.k8s.io/docs/user/quick-start#installation)
1308
+
1309
+ ## Usage
1310
+
1311
+ xpk interfaces seamlessly with kind to manage Kubernetes clusters locally, facilitating the orchestration and management of workloads. Below are the commands for managing clusters:
1312
+
1313
+ ### Cluster Create
1314
+ * Cluster create:
1315
+
1316
+ ```shell
1317
+ python3 xpk.py kind create \
1318
+ --cluster xpk-test
1319
+ ```
1320
+
1321
+ ### Cluster Delete
1322
+ * Cluster Delete:
1323
+
1324
+ ```shell
1325
+ python3 xpk.py kind delete \
1326
+ --cluster xpk-test
1327
+ ```
1328
+
1329
+ ### Cluster List
1330
+ * Cluster List:
1331
+
1332
+ ```shell
1333
+ python3 xpk.py kind list
1334
+ ```
1335
+
1336
+ ## Local Testing Basics
1337
+
1338
+ Local testing is available exclusively through the `batch` and `job` commands of xpk with the `--kind-cluster` flag. This allows you to simulate training jobs locally:
1339
+
1340
+ ```shell
1341
+ python xpk.py batch [other-options] --kind-cluster script
1342
+ ```
1343
+
1344
+ Please note that all other xpk subcommands are intended for use with cloud systems on Google Cloud Engine (GCE) and don't support local testing. This includes commands like cluster, info, inspector, etc.
1345
+
1346
+ # Other advanced usage
1347
+ [Use a Jupyter notebook to interact with a Cloud TPU cluster](xpk-notebooks.md)