xpk 0.3.0__py3-none-any.whl → 0.5.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,586 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: xpk
3
- Version: 0.3.0
4
- Summary: xpk helps Cloud developers to orchestrate training jobs on accelerators on GKE.
5
- Author-email: Cloud TPU Team <cloud-tpu-eng@google.com>
6
- License: Apache-2.0
7
- Project-URL: Homepage, https://github.com/google/xpk
8
- Project-URL: Bug Tracker, https://github.com/google/xpk/issues
9
- Classifier: Programming Language :: Python :: 3.10
10
- Classifier: Programming Language :: Python :: 3.11
11
- Requires-Python: >=3.10
12
- Description-Content-Type: text/markdown
13
- License-File: LICENSE
14
-
15
- <!--
16
- Copyright 2023 Google LLC
17
-
18
- Licensed under the Apache License, Version 2.0 (the "License");
19
- you may not use this file except in compliance with the License.
20
- You may obtain a copy of the License at
21
-
22
- https://www.apache.org/licenses/LICENSE-2.0
23
-
24
- Unless required by applicable law or agreed to in writing, software
25
- distributed under the License is distributed on an "AS IS" BASIS,
26
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
27
- See the License for the specific language governing permissions and
28
- limitations under the License.
29
- -->
30
-
31
- # Overview
32
-
33
- xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help
34
- Cloud developers to orchestrate training jobs on accelerators such as TPUs and
35
- GPUs on GKE. xpk handles the "multihost pods" of TPUs, GPUs (HGX H100) and CPUs
36
- (n2-standard-32) as first class citizens.
37
-
38
- xpk decouples provisioning capacity from running jobs. There are two structures:
39
- clusters (provisioned VMs) and workloads (training jobs). Clusters represent the
40
- physical resources you have available. Workloads represent training jobs -- at
41
- any time some of these will be completed, others will be running and some will
42
- be queued, waiting for cluster resources to become available.
43
-
44
- The ideal workflow starts by provisioning the clusters for all of the ML
45
- hardware you have reserved. Then, without re-provisioning, submit jobs as
46
- needed. By eliminating the need for re-provisioning between jobs, using Docker
47
- containers with pre-installed dependencies and cross-ahead of time compilation,
48
- these queued jobs run with minimal start times. Further, because workloads
49
- return the hardware back to the shared pool when they complete, developers can
50
- achieve better use of finite hardware resources. And automated tests can run
51
- overnight while resources tend to be underutilized.
52
-
53
- xpk supports the following TPU types:
54
- * v4
55
- * v5e
56
- * v5p
57
-
58
- and the following GPU types:
59
- * a100
60
- * h100
61
-
62
- and the following CPU types:
63
- * n2-standard-32
64
-
65
- # Installation
66
- To install xpk, run the following command:
67
-
68
- ```shell
69
- pip install xpk
70
- ```
71
-
72
- # XPK for Large Scale (>1k VMs)
73
-
74
- Follow user instructions in [xpk-large-scale-guide.sh](xpk-large-scale-guide.sh)
75
- to use xpk for a GKE cluster greater than 1000 VMs. Run these steps to set up a
76
- GKE cluster with large scale training and high throughput support with XPK, and
77
- run jobs with XPK. We recommend you manually copy commands per step and verify
78
- the outputs of each step.
79
-
80
- # Example usages:
81
-
82
- To get started, be sure to set your GCP Project and Zone as usual via `gcloud
83
- config set`.
84
-
85
- Below are reference commands. A typical journey starts with a `Cluster Create`
86
- followed by many `Workload Create`s. To understand the state of the system you
87
- might want to use `Cluster List` or `Workload List` commands. Finally, you can
88
- cleanup with a `Cluster Delete`.
89
-
90
- ## Cluster Create
91
-
92
- First set the project and zone through gcloud config or xpk arguments.
93
-
94
- ```shell
95
- PROJECT_ID=my-project-id
96
- ZONE=us-east5-b
97
- # gcloud config:
98
- gcloud config set project $PROJECT_ID
99
- gcloud config set compute/zone $ZONE
100
- # xpk arguments
101
- xpk .. --zone $ZONE --project $PROJECT_ID
102
- ```
103
-
104
-
105
- The cluster created is a regional cluster to enable the GKE control plane across
106
- all zones.
107
-
108
- * Cluster Create (provision reserved capacity):
109
-
110
- ```shell
111
- # Find your reservations
112
- gcloud compute reservations list --project=$PROJECT_ID
113
- # Run cluster create with reservation.
114
- python3 xpk.py cluster create \
115
- --cluster xpk-test --tpu-type=v5litepod-256 \
116
- --num-slices=2 \
117
- --reservation=$RESERVATION_ID
118
- ```
119
-
120
- * Cluster Create (provision on-demand capacity):
121
-
122
- ```shell
123
- python3 xpk.py cluster create \
124
- --cluster xpk-test --tpu-type=v5litepod-16 \
125
- --num-slices=4 --on-demand
126
- ```
127
-
128
- * Cluster Create (provision spot / preemptable capacity):
129
-
130
- ```shell
131
- python3 xpk.py cluster create \
132
- --cluster xpk-test --tpu-type=v5litepod-16 \
133
- --num-slices=4 --spot
134
- ```
135
-
136
- * Cluster Create can be called again with the same `--cluster name` to modify
137
- the number of slices or retry failed steps.
138
-
139
- For example, if a user creates a cluster with 4 slices:
140
-
141
- ```shell
142
- python3 xpk.py cluster create \
143
- --cluster xpk-test --tpu-type=v5litepod-16 \
144
- --num-slices=4 --reservation=$RESERVATION_ID
145
- ```
146
-
147
- and recreates the cluster with 8 slices. The command will rerun to create 4
148
- new slices:
149
-
150
- ```shell
151
- python3 xpk.py cluster create \
152
- --cluster xpk-test --tpu-type=v5litepod-16 \
153
- --num-slices=8 --reservation=$RESERVATION_ID
154
- ```
155
-
156
- and recreates the cluster with 6 slices. The command will rerun to delete 2
157
- slices. The command will warn the user when deleting slices.
158
- Use `--force` to skip prompts.
159
-
160
- ```shell
161
- python3 xpk.py cluster create \
162
- --cluster xpk-test --tpu-type=v5litepod-16 \
163
- --num-slices=6 --reservation=$RESERVATION_ID
164
-
165
- # Skip delete prompts using --force.
166
-
167
- python3 xpk.py cluster create --force \
168
- --cluster xpk-test --tpu-type=v5litepod-16 \
169
- --num-slices=6 --reservation=$RESERVATION_ID
170
-
171
- ```
172
- ## Cluster Delete
173
- * Cluster Delete (deprovision capacity):
174
-
175
- ```shell
176
- python3 xpk.py cluster delete \
177
- --cluster xpk-test
178
- ```
179
- ## Cluster List
180
- * Cluster List (see provisioned capacity):
181
-
182
- ```shell
183
- python3 xpk.py cluster list
184
- ```
185
- ## Cluster Describe
186
- * Cluster Describe (see capacity):
187
-
188
- ```shell
189
- python3 xpk.py cluster describe \
190
- --cluster xpk-test
191
- ```
192
-
193
- ## Cluster Cacheimage
194
- * Cluster Cacheimage (enables faster start times):
195
-
196
- ```shell
197
- python3 xpk.py cluster cacheimage \
198
- --cluster xpk-test --docker-image gcr.io/your_docker_image \
199
- --tpu-type=v5litepod-16
200
- ```
201
-
202
- ## Workload Create
203
- * Workload Create (submit training job):
204
-
205
- ```shell
206
- python3 xpk.py workload create \
207
- --workload xpk-test-workload --command "echo goodbye" --cluster \
208
- xpk-test --tpu-type=v5litepod-16
209
- ```
210
-
211
- ### Set `max-restarts` for production jobs
212
-
213
- * `--max-restarts <value>`: By default, this is 0. This will restart the job ""
214
- times when the job terminates. For production jobs, it is recommended to
215
- increase this to a large number, say 50. Real jobs can be interrupted due to
216
- hardware failures and software updates. We assume your job has implemented
217
- checkpointing so the job restarts near where it was interrupted.
218
-
219
- ### Workload Priority and Preemption
220
- * Set the priority level of your workload with `--priority=LEVEL`
221
-
222
- We have five priorities defined: [`very-low`, `low`, `medium`, `high`, `very-high`].
223
- The default priority is `medium`.
224
-
225
- Priority determines:
226
-
227
- 1. Order of queued jobs.
228
-
229
- Queued jobs are ordered by
230
- `very-low` < `low` < `medium` < `high` < `very-high`
231
-
232
- 2. Preemption of lower priority workloads.
233
-
234
- A higher priority job will `evict` lower priority jobs.
235
- Evicted jobs are brought back to the queue and will re-hydrate appropriately.
236
-
237
- #### General Example:
238
- ```shell
239
- python3 xpk.py workload create \
240
- --workload xpk-test-medium-workload --command "echo goodbye" --cluster \
241
- xpk-test --tpu-type=v5litepod-16 --priority=medium
242
- ```
243
-
244
- ## Workload Delete
245
- * Workload Delete (delete training job):
246
-
247
- ```shell
248
- python3 xpk.py workload delete \
249
- --workload xpk-test-workload --cluster xpk-test
250
- ```
251
-
252
- This will only delete `xpk-test-workload` workload in `xpk-test` cluster.
253
-
254
- * Workload Delete (delete all training jobs in the cluster):
255
-
256
- ```shell
257
- python3 xpk.py workload delete \
258
- --cluster xpk-test
259
- ```
260
-
261
- This will delete all the workloads in `xpk-test` cluster. Deletion will only begin if you type `y` or `yes` at the prompt. Multiple workload deletions are processed in batches for optimized processing.
262
-
263
- * Workload Delete supports filtering. Delete a portion of jobs that match user criteria. Multiple workload deletions are processed in batches for optimized processing.
264
- * Filter by Job: `filter-by-job`
265
-
266
- ```shell
267
- python3 xpk.py workload delete \
268
- --cluster xpk-test --filter-by-job=$USER
269
- ```
270
-
271
- This will delete all the workloads in `xpk-test` cluster whose names start with `$USER`. Deletion will only begin if you type `y` or `yes` at the prompt.
272
-
273
- * Filter by Status: `filter-by-status`
274
-
275
- ```shell
276
- python3 xpk.py workload delete \
277
- --cluster xpk-test --filter-by-status=QUEUED
278
- ```
279
-
280
- This will delete all the workloads in `xpk-test` cluster that have the status as Admitted or Evicted, and the number of running VMs is 0. Deletion will only begin if you type `y` or `yes` at the prompt. Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`.
281
-
282
- ## Workload List
283
- * Workload List (see training jobs):
284
-
285
- ```shell
286
- python3 xpk.py workload list \
287
- --cluster xpk-test
288
- ```
289
-
290
- * Example Workload List Output:
291
-
292
- The below example shows four jobs of different statuses:
293
-
294
- * `user-first-job-failed`: **filter-status** is `FINISHED` and `FAILED`.
295
- * `user-second-job-success`: **filter-status** is `FINISHED` and `SUCCESSFUL`.
296
- * `user-third-job-running`: **filter-status** is `RUNNING`.
297
- * `user-forth-job-in-queue`: **filter-status** is `QUEUED`.
298
- * `user-fifth-job-in-queue-preempted`: **filter-status** is `QUEUED`.
299
-
300
- ```
301
- Jobset Name Created Time Priority TPU VMs Needed TPU VMs Running/Ran TPU VMs Done Status Status Message Status Time
302
- user-first-job-failed 2023-1-1T1:00:00Z medium 4 4 <none> Finished JobSet failed 2023-1-1T1:05:00Z
303
- user-second-job-success 2023-1-1T1:10:00Z medium 4 4 4 Finished JobSet finished successfully 2023-1-1T1:14:00Z
304
- user-third-job-running 2023-1-1T1:15:00Z medium 4 4 <none> Admitted Admitted by ClusterQueue cluster-queue 2023-1-1T1:16:00Z
305
- user-forth-job-in-queue 2023-1-1T1:16:05Z medium 4 <none> <none> Admitted couldn't assign flavors to pod set slice-job: insufficient unused quota for google.com/tpu in flavor 2xv4-8, 4 more need 2023-1-1T1:16:10Z
306
- user-fifth-job-preempted 2023-1-1T1:10:05Z low 4 <none> <none> Evicted Preempted to accommodate a higher priority Workload 2023-1-1T1:10:00Z
307
- ```
308
-
309
- * Workload List supports filtering. Observe a portion of jobs that match user criteria.
310
-
311
- * Filter by Status: `filter-by-status`
312
-
313
- Filter the workload list by the status of respective jobs.
314
- Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`
315
-
316
- * Filter by Job: `filter-by-job`
317
-
318
- Filter the workload list by the name of a job.
319
-
320
- ```shell
321
- python3 xpk.py workload list \
322
- --cluster xpk-test --filter-by-job=$USER
323
- ```
324
-
325
-
326
- ## GPU usage
327
-
328
- In order to use XPK for GPU, you can do so by using `device-type` flag.
329
-
330
- * Cluster Create (provision reserved capacity):
331
-
332
- ```shell
333
- # Find your reservations
334
- gcloud compute reservations list --project=$PROJECT_ID
335
-
336
- # Run cluster create with reservation.
337
- python3 xpk.py cluster create \
338
- --cluster xpk-test --device-type=h100-80gb-8 \
339
- --num-slices=2 \
340
- --reservation=$RESERVATION_ID
341
- ```
342
-
343
- * [Install NVIDIA GPU device drivers](https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#install)
344
- ```shell
345
- # List available driver versions
346
- gcloud compute ssh $NODE_NAME --command "sudo cos-extensions list"
347
-
348
- # Install the default driver
349
- gcloud compute ssh $NODE_NAME --command "sudo cos-extensions install gpu"
350
- # OR install a specific version of the driver
351
- gcloud compute ssh $NODE_NAME --command "sudo cos-extensions install gpu -- -version=DRIVER_VERSION"
352
- ```
353
-
354
- * Run a workload:
355
-
356
- ```shell
357
- # Submit a workload
358
- python3 xpk.py workload create \
359
- --cluster xpk-test --device-type h100-80gb-8 \
360
- --workload xpk-test-workload \
361
- --command="echo hello world"
362
- ```
363
-
364
- ## CPU usage
365
-
366
- In order to use XPK for CPU, you can do so by using `device-type` flag.
367
-
368
- * Cluster Create (provision on-demand capacity):
369
-
370
- ```shell
371
- # Run cluster create with on demand capacity.
372
- python3 xpk.py cluster create \
373
- --cluster xpk-test \
374
- --device-type=n2-standard-32-256 \
375
- --num-slices=1 \
376
- --default-pool-cpu-machine-type=n2-standard-32 \
377
- --on-demand
378
- ```
379
- Note that `device-type` for CPUs is of the format <cpu-machine-type>-<number of VMs>, thus in the above example, user requests for 256 VMs of type n2-standard-32.
380
- Currently workloads using < 1000 VMs are supported.
381
-
382
- * Run a workload:
383
-
384
- ```shell
385
- # Submit a workload
386
- python3 xpk.py workload create \
387
- --cluster xpk-test \
388
- --num-slices=1 \
389
- --device-type=n2-standard-32-256 \
390
- --workload xpk-test-workload \
391
- --command="echo hello world"
392
- ```
393
-
394
-
395
- # How to add docker images to a xpk workload
396
-
397
- The default behavior is `xpk workload create` will layer the local directory (`--script-dir`) into
398
- the base docker image (`--base-docker-image`) and run the workload command.
399
- If you don't want this layering behavior, you can directly use `--docker-image`. Do not mix arguments from the two flows in the same command.
400
-
401
- ## Recommended / Default Docker Flow: `--base-docker-image` and `--script-dir`
402
- This flow pulls the `--script-dir` into the `--base-docker-image` and runs the new docker image.
403
-
404
- * The below arguments are optional by default. xpk will pull the local
405
- directory with a generic base docker image.
406
-
407
- - `--base-docker-image` sets the base image that xpk will start with.
408
-
409
- - `--script-dir` sets which directory to pull into the image. This defaults to the current working directory.
410
-
411
- See `python3 xpk.py workload create --help` for more info.
412
-
413
- * Example with defaults which pulls the local directory into the base image:
414
- ```shell
415
- echo -e '#!/bin/bash \n echo "Hello world from a test script!"' > test.sh
416
- python3 xpk.py workload create --cluster xpk-test \
417
- --workload xpk-test-workload-base-image --command "bash test.sh" \
418
- --tpu-type=v5litepod-16 --num-slices=1
419
- ```
420
-
421
- * Recommended Flow For Normal Sized Jobs (fewer than 10k accelerators):
422
- ```shell
423
- python3 xpk.py workload create --cluster xpk-test \
424
- --workload xpk-test-workload-base-image --command "bash custom_script.sh" \
425
- --base-docker-image=gcr.io/your_dependencies_docker_image \
426
- --tpu-type=v5litepod-16 --num-slices=1
427
- ```
428
-
429
- ## Optional Direct Docker Image Configuration: `--docker-image`
430
- If a user wants to directly set the docker image used and not layer in the
431
- current working directory, set `--docker-image` to the image to be use in the
432
- workload.
433
-
434
- * Running with `--docker-image`:
435
- ```shell
436
- python3 xpk.py workload create --cluster xpk-test \
437
- --workload xpk-test-workload-base-image --command "bash test.sh" \
438
- --tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image
439
- ```
440
-
441
- * Recommended Flow For Large Sized Jobs (more than 10k accelerators):
442
- ```shell
443
- python3 xpk.py cluster cacheimage \
444
- --cluster xpk-test --docker-image gcr.io/your_docker_image
445
- # Run workload create with the same image.
446
- python3 xpk.py workload create --cluster xpk-test \
447
- --workload xpk-test-workload-base-image --command "bash test.sh" \
448
- --tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image
449
- ```
450
-
451
- # More advanced facts:
452
-
453
- * Workload create accepts a --env-file flag to allow specifying the container's
454
- environment from a file. Usage is the same as Docker's
455
- [--env-file flag](https://docs.docker.com/engine/reference/commandline/run/#env)
456
-
457
- Example File:
458
- ```shell
459
- LIBTPU_INIT_ARGS=--my-flag=true --performance=high
460
- MY_ENV_VAR=hello
461
- ```
462
-
463
- * Workload create accepts a --debug-dump-gcs flag which is a path to GCS bucket.
464
- Passing this flag sets the XLA_FLAGS='--xla_dump_to=/tmp/xla_dump/' and uploads
465
- hlo dumps to the specified GCS bucket for each worker.
466
-
467
-
468
- # Troubleshooting
469
-
470
- ## `Invalid machine type` for CPUs.
471
- XPK will create a regional GKE cluster. If you see issues like
472
-
473
- ```shell
474
- Invalid machine type e2-standard-32 in zone $ZONE_NAME
475
- ```
476
-
477
- Please select a CPU type that exists in all zones in the region.
478
-
479
- ```shell
480
- # Find CPU Types supported in zones.
481
- gcloud compute machine-types list --zones=$ZONE_LIST
482
- # Adjust default cpu machine type.
483
- python3 xpk.py cluster create --default-pool-cpu-machine-type=CPU_TYPE ...
484
- ```
485
-
486
- ## Permission Issues: `requires one of ["permission_name"] permission(s)`.
487
-
488
- 1) Determine the role needed based on the permission error:
489
-
490
- ```shell
491
- # For example: `requires one of ["container.*"] permission(s)`
492
- # Add [Kubernetes Engine Admin](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles) to your user.
493
- ```
494
-
495
- 2) Add the role to the user in your project.
496
-
497
- Go to [iam-admin](https://console.cloud.google.com/iam-admin/) or use gcloud cli:
498
- ```shell
499
- PROJECT_ID=my-project-id
500
- CURRENT_GKE_USER=$(gcloud config get account)
501
- ROLE=roles/container.admin # container.admin is the role needed for Kubernetes Engine Admin
502
- gcloud projects add-iam-policy-binding $PROJECT_ID --member user:$CURRENT_GKE_USER --role=$ROLE
503
- ```
504
-
505
- 3) Check the permissions are correct for the users.
506
-
507
- Go to [iam-admin](https://console.cloud.google.com/iam-admin/) or use gcloud cli:
508
-
509
- ```shell
510
- PROJECT_ID=my-project-id
511
- CURRENT_GKE_USER=$(gcloud config get account)
512
- gcloud projects get-iam-policy $PROJECT_ID --filter="bindings.members:$CURRENT_GKE_USER" --flatten="bindings[].members"
513
- ```
514
-
515
- 4) Confirm you have logged in locally with the correct user.
516
-
517
- ```shell
518
- gcloud auth login
519
- ```
520
-
521
- ### Roles needed based on permission errors:
522
-
523
- * `requires one of ["container.*"] permission(s)`
524
-
525
- Add [Kubernetes Engine Admin](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles) to your user.
526
-
527
- * `ERROR: (gcloud.monitoring.dashboards.list) User does not have permission to access projects instance (or it may not exist)`
528
-
529
- Add [Monitoring Viewer](https://cloud.google.com/iam/docs/understanding-roles#monitoring.viewer) to your user.
530
-
531
-
532
- ## Reservation Troubleshooting:
533
-
534
- ### How to determine your reservation and its size / utilization:
535
-
536
- ```shell
537
- PROJECT_ID=my-project
538
- ZONE=us-east5-b
539
- RESERVATION=my-reservation-name
540
- # Find the reservations in your project
541
- gcloud beta compute reservations list --project=$PROJECT_ID
542
- # Find the tpu machine type and current utilization of a reservation.
543
- gcloud beta compute reservations describe $RESERVATION --project=$PROJECT_ID --zone=$ZONE
544
- ```
545
-
546
- # TPU Workload Debugging
547
-
548
- ## Verbose Logging
549
- If you are having trouble with your workload, try setting the `--enable-debug-logs` when you schedule it. This will give you more detailed logs to help pinpoint the issue. For example:
550
- ```shell
551
- python3 xpk.py workload create \
552
- --cluster --workload xpk-test-workload \
553
- --command="echo hello world" --enable-debug-logs
554
- ```
555
- Please check [libtpu logging](https://cloud.google.com/tpu/docs/troubleshooting/trouble-tf#debug_logs) and [Tensorflow logging](https://deepreg.readthedocs.io/en/latest/docs/logging.html#tensorflow-logging) for more information about the flags that are enabled to get the logs.
556
-
557
- ## Collect Stack Traces
558
- [cloud-tpu-diagnostics](https://pypi.org/project/cloud-tpu-diagnostics/) PyPI package can be used to generate stack traces for workloads running in GKE. This package dumps the Python traces when a fault such as segmentation fault, floating-point exception, or illegal operation exception occurs in the program. Additionally, it will also periodically collect stack traces to help you debug situations when the program is unresponsive. You must make the following changes in the docker image running in a Kubernetes main container to enable periodic stack trace collection.
559
- ```shell
560
- # main.py
561
-
562
- from cloud_tpu_diagnostics import diagnostic
563
- from cloud_tpu_diagnostics.configuration import debug_configuration
564
- from cloud_tpu_diagnostics.configuration import diagnostic_configuration
565
- from cloud_tpu_diagnostics.configuration import stack_trace_configuration
566
-
567
- stack_trace_config = stack_trace_configuration.StackTraceConfig(
568
- collect_stack_trace = True,
569
- stack_trace_to_cloud = True)
570
- debug_config = debug_configuration.DebugConfig(
571
- stack_trace_config = stack_trace_config)
572
- diagnostic_config = diagnostic_configuration.DiagnosticConfig(
573
- debug_config = debug_config)
574
-
575
- with diagnostic.diagnose(diagnostic_config):
576
- main_method() # this is the main method to run
577
- ```
578
- This configuration will start collecting stack traces inside the `/tmp/debugging` directory on each Kubernetes Pod.
579
-
580
- ### Explore Stack Traces
581
- To explore the stack traces collected in a temporary directory in Kubernetes Pod, you can run the following command to configure a sidecar container that will read the traces from `/tmp/debugging` directory.
582
- ```shell
583
- python3 xpk.py workload create \
584
- --workload xpk-test-workload --command "python3 main.py" --cluster \
585
- xpk-test --tpu-type=v5litepod-16 --deploy-stacktrace-sidecar
586
- ```
@@ -1,7 +0,0 @@
1
- xpk.py,sha256=2JnMVsnDG7w0zr8HHHepc-kcuApQ6_PIlNh20aNh5pw,111620
2
- xpk-0.3.0.dist-info/LICENSE,sha256=z8d0m5b2O9McPEK1xHG_dWgUBT6EfBDz6wA0F7xSPTA,11358
3
- xpk-0.3.0.dist-info/METADATA,sha256=y9XYFZ0pR1uIjgmQz6B07ZnjcyjXZZLNRNpx4GB9cyE,21884
4
- xpk-0.3.0.dist-info/WHEEL,sha256=oiQVh_5PnQM0E3gPdiz09WCNmwiHDMaGer_elqB3coM,92
5
- xpk-0.3.0.dist-info/entry_points.txt,sha256=lhrMqkTA09DLePaqxSMyW2RCLUKs2X1c84baGhMev_k,33
6
- xpk-0.3.0.dist-info/top_level.txt,sha256=aDe4N0jicmuWExx_6w0TxWQJaEuPSs9BnLU-3aF1GLo,4
7
- xpk-0.3.0.dist-info/RECORD,,
File without changes