xpk 0.17.3__tar.gz → 1.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {xpk-0.17.3 → xpk-1.1.0}/.github/actions/setup-test-env/action.yml +0 -1
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/build_tests.yaml +1 -2
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/integration_basic_cluster_create.yaml +3 -91
- xpk-1.1.0/.github/workflows/integration_gpu_cluster_create.yaml +78 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/label-validation.yaml +2 -2
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/nightly_tests.yaml +11 -12
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/reusable_goldens.yaml +1 -2
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/reusable_lint_and_format.yml +0 -1
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/reusable_storage_create.yaml +0 -41
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/reusable_storage_delete.yaml +0 -3
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/reusable_unit_tests.yaml +0 -1
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/stale.yaml +4 -4
- {xpk-0.17.3 → xpk-1.1.0}/Makefile +4 -26
- {xpk-0.17.3/src/xpk.egg-info → xpk-1.1.0}/PKG-INFO +50 -23
- {xpk-0.17.3 → xpk-1.1.0}/README.md +49 -22
- {xpk-0.17.3 → xpk-1.1.0}/docs/installation.md +0 -1
- {xpk-0.17.3 → xpk-1.1.0}/docs/testing.md +37 -16
- {xpk-0.17.3 → xpk-1.1.0}/docs/troubleshooting.md +1 -1
- {xpk-0.17.3 → xpk-1.1.0}/docs/usage/clusters.md +30 -1
- {xpk-0.17.3 → xpk-1.1.0}/docs/usage/tpu7x/recipes/flex_filestore_recipe.md +0 -4
- {xpk-0.17.3 → xpk-1.1.0}/docs/usage/tpu7x/recipes/flex_lustre_recipe.md +0 -4
- {xpk-0.17.3 → xpk-1.1.0}/docs/usage/workloads.md +3 -0
- xpk-1.1.0/recipes/Basic_cluster_adapt.md +143 -0
- xpk-0.17.3/goldens/Basic_cluster_create.txt → xpk-1.1.0/recipes/Basic_cluster_create.md +15 -6
- xpk-1.1.0/recipes/Cluster_create_RayCluster.md +288 -0
- xpk-0.17.3/goldens/Cluster_create_for_multi-host_nodepool.txt → xpk-1.1.0/recipes/Cluster_create_for_multi-host_nodepool.md +16 -7
- xpk-1.1.0/recipes/Cluster_create_for_single-host_nodepool.md +275 -0
- xpk-0.17.3/goldens/Cluster_create_private.txt → xpk-1.1.0/recipes/Cluster_create_private.md +18 -7
- xpk-0.17.3/goldens/Cluster_create_sub-slicing.txt → xpk-1.1.0/recipes/Cluster_create_sub-slicing.md +18 -7
- xpk-0.17.3/goldens/Cluster_create_super-slicing.txt → xpk-1.1.0/recipes/Cluster_create_super-slicing.md +21 -10
- xpk-0.17.3/goldens/Cluster_create_with_CPU_and_memory_limits_above_capacity.txt → xpk-1.1.0/recipes/Cluster_create_with_CPU_and_memory_limits_above_capacity.md +15 -6
- xpk-0.17.3/goldens/Cluster_create_with_CPU_and_memory_limits_below_capacity.txt → xpk-1.1.0/recipes/Cluster_create_with_CPU_and_memory_limits_below_capacity.md +15 -6
- xpk-0.17.3/goldens/Cluster_create_with_Managed_Lustre_driver.txt → xpk-1.1.0/recipes/Cluster_create_with_Managed_Lustre_driver.md +15 -6
- xpk-0.17.3/goldens/Cluster_create_with_Managed_Lustre_driver_and_legacy_port.txt → xpk-1.1.0/recipes/Cluster_create_with_Managed_Lustre_driver_and_legacy_port.md +15 -6
- xpk-0.17.3/goldens/Cluster_create_with_gb200-4.txt → xpk-1.1.0/recipes/Cluster_create_with_gb200-4.md +51 -40
- xpk-0.17.3/goldens/Cluster_create_with_shared_reservation.txt → xpk-1.1.0/recipes/Cluster_create_with_shared_reservation.md +17 -6
- xpk-0.17.3/goldens/Cluster_delete.txt → xpk-1.1.0/recipes/Cluster_delete.md +10 -1
- xpk-0.17.3/goldens/Cluster_delete_force.txt → xpk-1.1.0/recipes/Cluster_delete_force.md +10 -1
- xpk-0.17.3/goldens/NAP_cluster-create.txt → xpk-1.1.0/recipes/NAP_cluster-create.md +15 -6
- xpk-0.17.3/goldens/NAP_cluster-create_with_pathways.txt → xpk-1.1.0/recipes/NAP_cluster-create_with_pathways.md +15 -6
- xpk-0.17.3/goldens/Storage_list.txt → xpk-1.1.0/recipes/Storage_list.md +10 -1
- xpk-0.17.3/goldens/Workload_create.txt → xpk-1.1.0/recipes/Workload_create.md +15 -8
- xpk-0.17.3/goldens/Workload_create_pathways.txt → xpk-1.1.0/recipes/Workload_create_pathways.md +13 -6
- xpk-0.17.3/goldens/Workload_create_sub-slicing.txt → xpk-1.1.0/recipes/Workload_create_sub-slicing.md +15 -8
- xpk-0.17.3/goldens/Workload_create_super-slicing.txt → xpk-1.1.0/recipes/Workload_create_super-slicing.md +59 -11
- xpk-0.17.3/goldens/Workload_create_with_output-manifest-file.txt → xpk-1.1.0/recipes/Workload_create_with_output-manifest-file.md +15 -8
- xpk-0.17.3/goldens/Workload_delete.txt → xpk-1.1.0/recipes/Workload_delete.md +10 -1
- xpk-0.17.3/goldens/Workload_list.txt → xpk-1.1.0/recipes/Workload_list.md +10 -1
- xpk-1.1.0/recipes/comprehensive-demo.md +83 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/cluster.py +33 -43
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/cluster_gcluster.py +19 -14
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/cluster_gcluster_test.py +2 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/cluster_test.py +1 -21
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/common.py +39 -6
- xpk-1.1.0/src/xpk/commands/common_test.py +170 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/info.py +9 -5
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/inspector.py +33 -4
- xpk-1.1.0/src/xpk/commands/inspector_test.py +142 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/workload.py +32 -11
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/workload_test.py +71 -3
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/blueprint/blueprint_generator.py +19 -8
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/blueprint/testing/data/a3_ultra.yaml +3 -1
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/blueprint/testing/data/a4.yaml +3 -1
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/capacity.py +37 -17
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/capacity_test.py +66 -1
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/cluster.py +11 -10
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/cluster_private.py +3 -3
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/cluster_test.py +29 -2
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/config.py +5 -2
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/docker_container.py +31 -24
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/docker_manager.py +4 -4
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/docker_resources.py +4 -1
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/kueue_manager.py +6 -8
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/kueue_manager_test.py +6 -5
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/nap.py +14 -3
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/nodepool.py +52 -13
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/nodepool_test.py +147 -8
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/remote_state/fuse_remote_state.py +1 -1
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/scheduling.py +32 -4
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/scheduling_test.py +39 -2
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/system_characteristics.py +44 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/system_characteristics_test.py +11 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/telemetry.py +11 -1
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/telemetry_test.py +39 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/testing/commands_tester.py +26 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/testing/commands_tester_test.py +20 -1
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/workload_decorators/rdma_decorator.py +9 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/cluster.py +11 -1
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/cluster_test.py +59 -1
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/common.py +11 -17
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/core.py +0 -8
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/storage.py +3 -14
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/console.py +1 -1
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/feature_flags.py +8 -4
- {xpk-0.17.3 → xpk-1.1.0/src/xpk.egg-info}/PKG-INFO +50 -23
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk.egg-info/SOURCES.txt +32 -52
- xpk-1.1.0/src/xpk.egg-info/top_level.txt +1 -0
- xpk-1.1.0/tools/install-xpk.sh +7 -0
- xpk-1.1.0/tools/recipes.py +235 -0
- xpk-0.17.3/.github/actions/install-kjob/action.yml +0 -35
- xpk-0.17.3/.github/workflows/integration_legacy_tests.yaml +0 -67
- xpk-0.17.3/.github/workflows/reusable_build_kjob.yaml +0 -23
- xpk-0.17.3/.github/workflows/reusable_integration_tests.yaml +0 -62
- xpk-0.17.3/docs/local_testing.md +0 -61
- xpk-0.17.3/docs/usage/job.md +0 -41
- xpk-0.17.3/docs/usage/run.md +0 -44
- xpk-0.17.3/docs/usage/tpu7x/clusters.md +0 -329
- xpk-0.17.3/docs/usage/tpu7x/workloads.md +0 -269
- xpk-0.17.3/examples/batch.md +0 -24
- xpk-0.17.3/examples/job.sh +0 -12
- xpk-0.17.3/golden_buddy.sh +0 -150
- xpk-0.17.3/goldens/Cluster_create_for_single-host_single-slice_TPU.txt +0 -199
- xpk-0.17.3/goldens.yaml +0 -47
- xpk-0.17.3/src/integration/README.md +0 -19
- xpk-0.17.3/src/integration/docker_manager_test.py +0 -102
- xpk-0.17.3/src/integration/gcluster_a3mega_test.py +0 -215
- xpk-0.17.3/src/integration/gcluster_a3ultra_test.py +0 -187
- xpk-0.17.3/src/integration/gcluster_a4_test.py +0 -187
- xpk-0.17.3/src/integration/gcluster_test.py +0 -107
- xpk-0.17.3/src/xpk/commands/kind.py +0 -265
- xpk-0.17.3/src/xpk/parser/kind.py +0 -95
- xpk-0.17.3/src/xpk/utils/__init__.py +0 -15
- xpk-0.17.3/src/xpk/utils/user_input.py +0 -48
- xpk-0.17.3/src/xpk/utils/user_input_test.py +0 -92
- xpk-0.17.3/src/xpk.egg-info/top_level.txt +0 -2
- xpk-0.17.3/tools/Dockerfile-kjob +0 -33
- xpk-0.17.3/tools/build-kjob.sh +0 -9
- xpk-0.17.3/tools/install-xpk.sh +0 -11
- xpk-0.17.3/xpk-slurm-commands.md +0 -382
- {xpk-0.17.3 → xpk-1.1.0}/.dockerignore +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/CODEOWNERS +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/PULL_REQUEST_TEMPLATE.md +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/actions/install-kueue/action.yml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/release.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/README.md +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/build_wheels.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/cleanup.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/gemini-dispatch.yml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/gemini-invoke.yml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/gemini-review.yml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/gemini-scheduled-triage.yml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/gemini-triage.yml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/integration_pathways_cluster_create.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/integration_ray_cluster_create.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/integration_storage_tests.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/periodic_release.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/release_branch_versioning.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/reusable_build_scripts.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/reusable_build_wheel.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.gitignore +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/.pre-commit-config.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/LICENSE +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/backoff_retry.sh +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/data/Dockerfile +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/docs/code-of-conduct.md +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/docs/contributing.md +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/docs/permissions.md +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/docs/usage/advanced.md +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/docs/usage/autoprovisioning.md +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/docs/usage/cpu.md +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/docs/usage/docker.md +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/docs/usage/gpu.md +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/docs/usage/inspector.md +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/docs/usage/storage.md +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/docs/usage/tpu7x/recipes/reservation_gcs_bucket_recipe.md +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/examples/fake_training.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/examples/llama-3.1-finetuning/check_cuda.sh +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/examples/llama-3.1-finetuning/requirements.txt +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/examples/llama-3.1-finetuning/train.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/examples/llama-3.1-finetuning/train.slurm +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/examples/llama-3.1-finetuning/training_data.jsonl +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/examples/nccl/nccl-a3mega.sh +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/examples/nccl/nccl-a3ultra.sh +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/examples/nccl/nccl.md +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/examples/storage/filestore-manifest-attach.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/examples/storage/gcsfuse-manifest.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/examples/storage/lustre-manifest-attach.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/examples/storage/parallelstore-manifest-attach.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/examples/storage/pd-manifest-attach.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/pylintrc +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/pyproject.toml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/setup.cfg +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/__init__.py +0 -0
- {xpk-0.17.3/src/integration → xpk-1.1.0/src/xpk/api}/__init__.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/api/storage_crd.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a3mega/config-map.yaml.tftpl +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a3mega/storage_crd.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a3ultra/config-map.yaml.tftpl +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a3ultra/mlgru-disable.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a3ultra/nccl-installer.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a3ultra/storage_crd.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a4/config-map.yaml.tftpl +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a4/nccl-rdma-installer-a4.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a4/storage_crd.yaml +0 -0
- {xpk-0.17.3/src/xpk/api → xpk-1.1.0/src/xpk/commands}/__init__.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/config.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/managed_ml_diagnostics.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/managed_ml_diagnostics_test.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/storage.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/version.py +0 -0
- {xpk-0.17.3/src/xpk/commands → xpk-1.1.0/src/xpk/core}/__init__.py +0 -0
- {xpk-0.17.3/src/xpk/core → xpk-1.1.0/src/xpk/core/blueprint}/__init__.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/blueprint/blueprint_definitions.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/blueprint/blueprint_test.py +0 -0
- {xpk-0.17.3/src/xpk/core/blueprint → xpk-1.1.0/src/xpk/core/blueprint/testing}/__init__.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/blueprint/testing/data/a3_mega.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/blueprint/testing/data/a3_mega_spot.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/commands.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/config_test.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/docker_image.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/filestore.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/gcloud_context.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/gcloud_context_test.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/gcluster_manager.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/gcsfuse.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/jobset.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/monitoring.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/mtc.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/network.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/pathways.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/pathways_test.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/ray.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/remote_state/__init__.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/remote_state/remote_state_client.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/resources.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/storage.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/testing/__init__.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/updates.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/updates_test.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/vertex.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/workload.py +0 -0
- {xpk-0.17.3/src/xpk/core/blueprint/testing → xpk-1.1.0/src/xpk/core/workload_decorators}/__init__.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/workload_decorators/storage_decorator.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/workload_decorators/tcpx_decorator.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/workload_decorators/tcpx_decorator_test.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/workload_decorators/tcpxo_decorator.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/workload_test.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/main.py +0 -0
- {xpk-0.17.3/src/xpk/core/workload_decorators → xpk-1.1.0/src/xpk/parser}/__init__.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/common_test.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/config.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/info.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/inspector.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/storage_test.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/validators.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/version.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/workload.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/workload_test.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/telemetry_uploader.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/__init__.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/arm_gpu_workload_crate.yaml.j2 +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/cluster_preheat.yaml.j2 +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/filestore-pv.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/filestore-pvc.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/filestore-sc.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/fuse-pv.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/fuse-pvc.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/kueue_config.yaml.j2 +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/kueue_gke_default_topology.yaml.j2 +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/kueue_sub_slicing_topology.yaml.j2 +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/kueue_super_slicing_topology.yaml.j2 +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/mtc-cpc.yaml +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/storage.yaml +0 -0
- {xpk-0.17.3/src/xpk/parser → xpk-1.1.0/src/xpk/utils}/__init__.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/console_test.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/execution_context.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/file.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/gcs_utils.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/kubectl.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/kueue.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/network.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/objects.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/templates.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/topology.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/topology_test.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/user_agent.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/user_agent_test.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/validation.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/validation_test.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/versions.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/yaml.py +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk.egg-info/dependency_links.txt +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk.egg-info/entry_points.txt +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/src/xpk.egg-info/requires.txt +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/tools/install-gke-auth-plugin.sh +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/xpk-large-scale-guide.sh +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/xpk-notebooks.md +0 -0
- {xpk-0.17.3 → xpk-1.1.0}/xpk.py +0 -0
|
@@ -49,14 +49,13 @@ jobs:
|
|
|
49
49
|
lookup-only: true
|
|
50
50
|
- name: install dependencies
|
|
51
51
|
if : steps.check-cache.outputs.cache-hit != 'true'
|
|
52
|
-
run: make install-dev && cp ./bin/kubectl-kueue /usr/local/bin/kubectl-kueue
|
|
52
|
+
run: make install-dev && cp ./bin/kubectl-kueue /usr/local/bin/kubectl-kueue
|
|
53
53
|
- name: Cache dependencies
|
|
54
54
|
if : steps.check-cache.outputs.cache-hit != 'true'
|
|
55
55
|
uses: actions/cache/save@v3
|
|
56
56
|
with:
|
|
57
57
|
path: |
|
|
58
58
|
/usr/local/bin/kubectl-kueue
|
|
59
|
-
/usr/local/bin/kubectl-kjob
|
|
60
59
|
~/.cache/pip
|
|
61
60
|
${{env.pythonLocation}}
|
|
62
61
|
key: xpk-deps-${{ matrix.python-version }}-${{github.run_id}}-${{github.run_attempt}}
|
|
@@ -31,7 +31,7 @@ jobs:
|
|
|
31
31
|
group: nightly-test-cluster-group-empty
|
|
32
32
|
cancel-in-progress: false
|
|
33
33
|
env:
|
|
34
|
-
EMPTY_CLUSTER_NAME: nightly-xpk-zero
|
|
34
|
+
EMPTY_CLUSTER_NAME: nightly-xpk-zero
|
|
35
35
|
steps:
|
|
36
36
|
- uses: actions/download-artifact@v4
|
|
37
37
|
with:
|
|
@@ -59,7 +59,7 @@ jobs:
|
|
|
59
59
|
group: nightly-test-cluster-group-private
|
|
60
60
|
cancel-in-progress: false
|
|
61
61
|
env:
|
|
62
|
-
PRIVATE_CLUSTER_NAME: nightly-xpk-private-2-v4-8
|
|
62
|
+
PRIVATE_CLUSTER_NAME: nightly-xpk-private-2-v4-8
|
|
63
63
|
steps:
|
|
64
64
|
- uses: actions/download-artifact@v4
|
|
65
65
|
with:
|
|
@@ -83,38 +83,6 @@ jobs:
|
|
|
83
83
|
with:
|
|
84
84
|
name: empty-private-cluster-nodepool-log-${{github.run_id}}
|
|
85
85
|
path: /tmp/NodepoolCreate-${{ env.PRIVATE_CLUSTER_NAME }}-np-*
|
|
86
|
-
dws_flex_cluster:
|
|
87
|
-
runs-on: [ubuntu-22.04]
|
|
88
|
-
concurrency: # We support one build test to run at a time currently.
|
|
89
|
-
group: nightly-test-cluster-group-flex
|
|
90
|
-
cancel-in-progress: false
|
|
91
|
-
env:
|
|
92
|
-
DWS_FLEX_CLUSTER_NAME: xpk-dws-nightly-test-2-v4-8
|
|
93
|
-
steps:
|
|
94
|
-
- uses: actions/download-artifact@v4
|
|
95
|
-
with:
|
|
96
|
-
name: custom-scripts
|
|
97
|
-
- name: Setup environment
|
|
98
|
-
uses: ./.github/actions/setup-test-env
|
|
99
|
-
with:
|
|
100
|
-
credentials_json: "${{ secrets.GCP_SA_KEY }}"
|
|
101
|
-
- name: Check xpk installation
|
|
102
|
-
run: xpk version
|
|
103
|
-
- name: Create a DWS flex queued xpk cluster
|
|
104
|
-
run: xpk cluster create --cluster ${DWS_FLEX_CLUSTER_NAME} --tpu-type=v5p-8 --num-slices=1 --zone=us-east5-a --default-pool-cpu-num-nodes=2 --flex --custom-cluster-arguments="${CLUSTER_NETWORK_ARGUMENTS_DWS}"
|
|
105
|
-
- name: Run dws flex queued TPU workload
|
|
106
|
-
run: xpk workload create --workload xpktest-build-${{ github.run_attempt }}-dws --cluster ${DWS_FLEX_CLUSTER_NAME} --zone=us-east5-a --tpu-type=v5p-8 --flex --command "echo foo" --num-slices=1
|
|
107
|
-
- name: Wait for workload completion and confirm it succeeded
|
|
108
|
-
run: xpk workload list --cluster ${DWS_FLEX_CLUSTER_NAME} --zone=us-east5-a --wait-for-job-completion xpktest-build-${{ github.run_attempt }}-dws --timeout 1000
|
|
109
|
-
- name: Delete the DWS flex queued cluster
|
|
110
|
-
if: always()
|
|
111
|
-
run: xpk cluster delete --cluster ${DWS_FLEX_CLUSTER_NAME} --zone=us-east5-a --force
|
|
112
|
-
- name: Upload DWS cluster nodepool creation log
|
|
113
|
-
if: always()
|
|
114
|
-
uses: actions/upload-artifact@v4
|
|
115
|
-
with:
|
|
116
|
-
name: empty-dws-cluster-nodepool-log-${{github.run_id}}
|
|
117
|
-
path: /tmp/NodepoolCreate-${{ env.DWS_FLEX_CLUSTER_NAME }}-np-*
|
|
118
86
|
|
|
119
87
|
cluster-create-and-delete:
|
|
120
88
|
runs-on: [ubuntu-22.04]
|
|
@@ -122,7 +90,7 @@ jobs:
|
|
|
122
90
|
group: nightly-test-cluster-group-tpu
|
|
123
91
|
cancel-in-progress: false
|
|
124
92
|
env:
|
|
125
|
-
TPU_CLUSTER_NAME: nightly-xpk-2-v5p-8
|
|
93
|
+
TPU_CLUSTER_NAME: nightly-xpk-2-v5p-8
|
|
126
94
|
WORKLOAD_NAME: xpktest-nightly-${{ github.run_attempt }}
|
|
127
95
|
steps:
|
|
128
96
|
- uses: actions/download-artifact@v4
|
|
@@ -152,62 +120,6 @@ jobs:
|
|
|
152
120
|
run: xpk info --cluster $TPU_CLUSTER_NAME --zone=us-central2-b
|
|
153
121
|
- name: Delete the workload on the cluster
|
|
154
122
|
run: xpk workload delete --workload $WORKLOAD_NAME --cluster $TPU_CLUSTER_NAME --zone=us-central2-b
|
|
155
|
-
- name: Create test script to execute in batch
|
|
156
|
-
run: echo -e '#!/bin/bash \n#SBATCH --unknown-flag=value\n echo "Hello world from a test script!"' > batch.sh
|
|
157
|
-
- name: Run a batch job on the cluster
|
|
158
|
-
run: xpk batch --cluster $TPU_CLUSTER_NAME --zone=us-central2-b batch.sh --ignore-unknown-flags --array 1-5 --nodes 2 --ntasks 3
|
|
159
|
-
- name: List out the jobs on the cluster
|
|
160
|
-
run: xpk job ls --cluster $TPU_CLUSTER_NAME --zone=us-central2-b | grep 'xpk-def-app-profile-slurm-'
|
|
161
|
-
- name: Get created job name
|
|
162
|
-
run: |
|
|
163
|
-
JOB_NAME=$(xpk job ls --cluster $TPU_CLUSTER_NAME --zone=us-central2-b | grep 'xpk-def-app-profile-slurm-' | grep 'multislice-queue' | head -1 | awk '{print $1}')
|
|
164
|
-
echo "JOB_NAME=${JOB_NAME}" >> $GITHUB_ENV
|
|
165
|
-
- name: Check job spec
|
|
166
|
-
run: |
|
|
167
|
-
job_spec=$(kubectl get job ${JOB_NAME} -o jsonpath='{.spec}')
|
|
168
|
-
echo "$job_spec" | grep '"completions":2'
|
|
169
|
-
echo "$job_spec" | grep '"parallelism":2'
|
|
170
|
-
echo "$job_spec" | jq '.template.spec.containers | length' | grep 3
|
|
171
|
-
- name: Get job info for the last job created on the cluster
|
|
172
|
-
run: xpk job info ${JOB_NAME} --cluster $TPU_CLUSTER_NAME --zone=us-central2-b | grep -e "Entrypoint environment variables template:" -e "Job name:" -e "Labels:" -e "Mounts:" -e "Pods:" -e "Profile:" -e "Script name:" | wc -l | grep "7"
|
|
173
|
-
- name: Cancel the batch job on the cluster
|
|
174
|
-
run: xpk job cancel ${JOB_NAME} --cluster $TPU_CLUSTER_NAME --zone=us-central2-b | grep "job.batch/${JOB_NAME} deleted"
|
|
175
|
-
- name: Create shell and exit it immediately
|
|
176
|
-
run: |
|
|
177
|
-
cat <<EOF > create-shell.exp
|
|
178
|
-
#!/usr/bin/expect
|
|
179
|
-
set timeout 180
|
|
180
|
-
spawn sh -c "xpk shell --cluster $TPU_CLUSTER_NAME --zone=us-central2-b | tee shell.log"
|
|
181
|
-
send "\n"
|
|
182
|
-
expect {
|
|
183
|
-
"/ # " {
|
|
184
|
-
send "exit\n"
|
|
185
|
-
# Wait for EOF after exit
|
|
186
|
-
expect eof
|
|
187
|
-
exit 0
|
|
188
|
-
}
|
|
189
|
-
timeout {
|
|
190
|
-
puts "Timed out waiting for pod to be running"
|
|
191
|
-
exit 1
|
|
192
|
-
}
|
|
193
|
-
eof {
|
|
194
|
-
puts "Unexpected EOF before getting prompt"
|
|
195
|
-
exit 1
|
|
196
|
-
}
|
|
197
|
-
}
|
|
198
|
-
EOF
|
|
199
|
-
chmod +x ./create-shell.exp
|
|
200
|
-
expect ./create-shell.exp
|
|
201
|
-
- name: Check if shell exists and is running
|
|
202
|
-
run: |
|
|
203
|
-
pod_name=$(grep 'waiting for pod' shell.log | awk -F'"' '{print $2}')
|
|
204
|
-
kubectl wait --for='jsonpath={.status.conditions[?(@.type=="Ready")].status}=True' --timeout=1m pod/${pod_name}
|
|
205
|
-
- name: Stop the shell
|
|
206
|
-
run: xpk shell stop --cluster $TPU_CLUSTER_NAME --zone=us-central2-b
|
|
207
|
-
- name: Delete create-shell.exp file
|
|
208
|
-
run: rm create-shell.exp
|
|
209
|
-
- name: Delete shell.log file
|
|
210
|
-
run: rm shell.log
|
|
211
123
|
- name: Delete the cluster created
|
|
212
124
|
if: always()
|
|
213
125
|
run: xpk cluster delete --cluster $TPU_CLUSTER_NAME --zone=us-central2-b --force
|
|
@@ -0,0 +1,78 @@
|
|
|
1
|
+
# Copyright 2025 Google LLC
|
|
2
|
+
#
|
|
3
|
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
4
|
+
# you may not use this file except in compliance with the License.
|
|
5
|
+
# You may obtain a copy of the License at
|
|
6
|
+
#
|
|
7
|
+
# https://www.apache.org/licenses/LICENSE-2.0
|
|
8
|
+
#
|
|
9
|
+
# Unless required by applicable law or agreed to in writing, software
|
|
10
|
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
11
|
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
12
|
+
# See the License for the specific language governing permissions and
|
|
13
|
+
# limitations under the License
|
|
14
|
+
|
|
15
|
+
name: Basic GPU cluster create
|
|
16
|
+
|
|
17
|
+
on:
|
|
18
|
+
workflow_call:
|
|
19
|
+
|
|
20
|
+
permissions:
|
|
21
|
+
contents: read
|
|
22
|
+
|
|
23
|
+
jobs:
|
|
24
|
+
gpu-cluster-create-and-delete:
|
|
25
|
+
runs-on: [ubuntu-22.04]
|
|
26
|
+
concurrency:
|
|
27
|
+
group: nightly-test-cluster-group-gpu
|
|
28
|
+
cancel-in-progress: false
|
|
29
|
+
env:
|
|
30
|
+
GPU_CLUSTER_NAME: nightly-xpk-b200
|
|
31
|
+
WORKLOAD_NAME: xpktest-gpu-nightly-${{ github.run_attempt }}
|
|
32
|
+
steps:
|
|
33
|
+
- uses: actions/download-artifact@v4
|
|
34
|
+
with:
|
|
35
|
+
name: custom-scripts
|
|
36
|
+
- name: Setup environment
|
|
37
|
+
uses: ./.github/actions/setup-test-env
|
|
38
|
+
with:
|
|
39
|
+
credentials_json: "${{ secrets.GCP_SA_KEY }}"
|
|
40
|
+
- name: Check xpk installation
|
|
41
|
+
run: xpk version
|
|
42
|
+
- name: 'Setup Service Account for XPK'
|
|
43
|
+
run: |
|
|
44
|
+
# 1. Clear any existing WIF configurations to avoid conflicts
|
|
45
|
+
rm -rf $HOME/.config/gcloud
|
|
46
|
+
mkdir -p $HOME/.config/gcloud
|
|
47
|
+
|
|
48
|
+
# 2. Write the Key File
|
|
49
|
+
echo '${{ secrets.GCP_SA_KEY }}' > $HOME/.config/gcloud/application_default_credentials.json
|
|
50
|
+
|
|
51
|
+
# 3. Activate the Service Account
|
|
52
|
+
# This updates the internal config files to point to the key file.
|
|
53
|
+
# When Docker mounts the directory, it will now see "Active Account: Service Account"
|
|
54
|
+
gcloud auth activate-service-account --key-file=$HOME/.config/gcloud/application_default_credentials.json --project=cloud-tpu-multipod-dev
|
|
55
|
+
|
|
56
|
+
# 4. Set Env Var for the host (GitHub Runner)
|
|
57
|
+
echo "GOOGLE_APPLICATION_CREDENTIALS=$HOME/.config/gcloud/application_default_credentials.json" >> $GITHUB_ENV
|
|
58
|
+
- name: Create an XPK Cluster with 1 x b200 GPU
|
|
59
|
+
run: xpk cluster create --cluster $GPU_CLUSTER_NAME --device-type=b200-8 --zone=asia-northeast1-b --default-pool-cpu-machine-type=n1-standard-16 --spot
|
|
60
|
+
- name: Authenticate Docker
|
|
61
|
+
run: gcloud auth configure-docker --quiet
|
|
62
|
+
- name: Run a base-docker-image workload
|
|
63
|
+
run: xpk workload create --cluster $GPU_CLUSTER_NAME --workload $WORKLOAD_NAME --docker-image='nvidia/cuda:12.1.0-base-ubuntu22.04' --command "nvidia-smi" --zone=asia-northeast1-b --device-type=b200-8
|
|
64
|
+
- name: List out the workloads on the cluster
|
|
65
|
+
run: xpk workload list --cluster $GPU_CLUSTER_NAME --zone=asia-northeast1-b
|
|
66
|
+
- name: Wait for workload completion and confirm it succeeded
|
|
67
|
+
run: xpk workload list --cluster $GPU_CLUSTER_NAME --zone=asia-northeast1-b --wait-for-job-completion $WORKLOAD_NAME --timeout 600
|
|
68
|
+
- name: Delete the workload on the cluster
|
|
69
|
+
run: xpk workload delete --workload $WORKLOAD_NAME --cluster $GPU_CLUSTER_NAME --zone=asia-northeast1-b
|
|
70
|
+
- name: Delete the cluster created
|
|
71
|
+
if: always()
|
|
72
|
+
run: xpk cluster delete --cluster $GPU_CLUSTER_NAME --zone=asia-northeast1-b --force
|
|
73
|
+
- name: Upload cluster nodepool creation log
|
|
74
|
+
if: always()
|
|
75
|
+
uses: actions/upload-artifact@v4
|
|
76
|
+
with:
|
|
77
|
+
name: gpu-cluster-nodepool-log-${{github.run_id}}
|
|
78
|
+
path: /tmp/NodepoolCreate-${{ env.GPU_CLUSTER_NAME }}-np-*
|
|
@@ -36,8 +36,8 @@ jobs:
|
|
|
36
36
|
with:
|
|
37
37
|
mode: minimum
|
|
38
38
|
count: 1
|
|
39
|
-
labels: "release-improvements, release-bugfix, release-features"
|
|
40
|
-
message: "This PR is being prevented from merging because it is not labeled. Please add a label to this PR. Accepted labels: release-improvements, release-bugfix, release-features"
|
|
39
|
+
labels: "release-improvements, release-bugfix, release-features, release-breaking"
|
|
40
|
+
message: "This PR is being prevented from merging because it is not labeled. Please add a label to this PR. Accepted labels: release-improvements, release-bugfix, release-features, release-breaking"
|
|
41
41
|
- id: do-not-merge
|
|
42
42
|
uses: mheap/github-action-required-labels@v5
|
|
43
43
|
with:
|
|
@@ -16,38 +16,37 @@ name: Nightly Tests
|
|
|
16
16
|
|
|
17
17
|
on:
|
|
18
18
|
workflow_dispatch:
|
|
19
|
-
schedule: # Schedule the job run at
|
|
20
|
-
- cron: "0
|
|
19
|
+
schedule: # Schedule the job run at 6AM UTC daily.
|
|
20
|
+
- cron: "0 6 * * *"
|
|
21
21
|
|
|
22
22
|
permissions:
|
|
23
23
|
contents: read
|
|
24
24
|
|
|
25
25
|
jobs:
|
|
26
|
-
build_kjob:
|
|
27
|
-
uses: ./.github/workflows/reusable_build_kjob.yaml
|
|
28
26
|
build_wheel:
|
|
29
27
|
uses: ./.github/workflows/reusable_build_wheel.yaml
|
|
30
28
|
build_actions:
|
|
31
29
|
uses: ./.github/workflows/reusable_build_scripts.yaml
|
|
32
30
|
basic_cluster_create:
|
|
33
|
-
needs: [
|
|
31
|
+
needs: [build_actions, build_wheel]
|
|
34
32
|
uses: ./.github/workflows/integration_basic_cluster_create.yaml
|
|
35
33
|
secrets: inherit
|
|
36
34
|
|
|
35
|
+
gpu_cluster_create:
|
|
36
|
+
needs: [build_actions, build_wheel]
|
|
37
|
+
uses: ./.github/workflows/integration_gpu_cluster_create.yaml
|
|
38
|
+
secrets: inherit
|
|
39
|
+
|
|
37
40
|
pathways_cluster_create:
|
|
38
|
-
needs: [
|
|
41
|
+
needs: [build_actions, build_wheel]
|
|
39
42
|
uses: ./.github/workflows/integration_pathways_cluster_create.yaml
|
|
40
43
|
secrets: inherit
|
|
41
44
|
|
|
42
45
|
ray_cluster_create:
|
|
43
|
-
needs: [
|
|
46
|
+
needs: [build_actions, build_wheel]
|
|
44
47
|
uses: ./.github/workflows/integration_ray_cluster_create.yaml
|
|
45
48
|
secrets: inherit
|
|
46
|
-
legacy_integration:
|
|
47
|
-
needs: [build_kjob, build_actions, build_wheel]
|
|
48
|
-
uses: ./.github/workflows/integration_legacy_tests.yaml
|
|
49
|
-
secrets: inherit
|
|
50
49
|
storage-tests:
|
|
51
|
-
needs: [
|
|
50
|
+
needs: [build_actions, build_wheel]
|
|
52
51
|
uses: ./.github/workflows/integration_storage_tests.yaml
|
|
53
52
|
secrets: inherit
|
|
@@ -33,13 +33,12 @@ jobs:
|
|
|
33
33
|
with:
|
|
34
34
|
path: |
|
|
35
35
|
/usr/local/bin/kubectl-kueue
|
|
36
|
-
/usr/local/bin/kubectl-kjob
|
|
37
36
|
~/.cache/pip
|
|
38
37
|
${{env.pythonLocation}}
|
|
39
38
|
key: xpk-deps-3.10-${{github.run_id}}-${{github.run_attempt}}
|
|
40
39
|
restore-keys: xpk-deps-3.10-
|
|
41
40
|
- name: Verify goldens
|
|
42
|
-
run:
|
|
41
|
+
run: python3 tools/recipes.py golden recipes/*.md
|
|
43
42
|
env:
|
|
44
43
|
UPDATE_GOLDEN_COMMAND: make goldens
|
|
45
44
|
XPK_VERSION_OVERRIDE: v0.0.0
|
|
@@ -92,8 +92,6 @@ jobs:
|
|
|
92
92
|
--auto-mount=true --vol=vol1 --mount-point='/${{inputs.storage-type}}-test-mount-point' --readonly=false
|
|
93
93
|
- name: List and verify existing Storages
|
|
94
94
|
run: xpk storage list --cluster ${{inputs.cluster-name}} --zone=${{inputs.zone}} | tee output.txt | grep ${{inputs.storage-name}} || (echo 'No storage found' && exit 143)
|
|
95
|
-
- name: Verify VolumeBundle created
|
|
96
|
-
run: kubectl get volumebundle ${{inputs.storage-name}} -o jsonpath='{.spec.containerVolumeMounts[0].mountPath}' | grep '/${{inputs.storage-type}}-test-mount-point'
|
|
97
95
|
- name: Verify Persistent Volume mount options
|
|
98
96
|
if: inputs.storage-command == 'attach' && inputs.storage-type == 'gcsfuse'
|
|
99
97
|
run: kubectl get pv ${{inputs.storage-name}}-pv -oyaml | grep rename-dir-limit=10000 || (echo 'Invalid storage mount options' && exit 143)
|
|
@@ -114,45 +112,6 @@ jobs:
|
|
|
114
112
|
run: xpk workload list --cluster ${{inputs.cluster-name}} --zone=${{inputs.zone}} --wait-for-job-completion $STORAGE_READ_WORKLOAD --timeout 300
|
|
115
113
|
- name: Delete the reader workload on the cluster
|
|
116
114
|
run: xpk workload delete --workload $STORAGE_READ_WORKLOAD --cluster ${{inputs.cluster-name}} --zone=${{inputs.zone}}
|
|
117
|
-
- name: Create batch-read.sh script
|
|
118
|
-
run: |
|
|
119
|
-
cat <<EOF > batch-read.sh
|
|
120
|
-
#!/bin/bash
|
|
121
|
-
grep 'Test text message' /${{inputs.storage-type}}-test-mount-point/$RANDOM_SEED/test.txt || (echo 'Reading from filestore failed' && exit 143)
|
|
122
|
-
EOF
|
|
123
|
-
- name: Run a batch-read job on the cluster
|
|
124
|
-
run: xpk batch --cluster ${{inputs.cluster-name}} --zone=${{inputs.zone}} batch-read.sh | tee batch-read.log
|
|
125
|
-
- name: Get job name
|
|
126
|
-
run: |
|
|
127
|
-
cat batch-read.log | grep 'xpk-def-app-profile-slurm-'
|
|
128
|
-
READ_JOB_NAME=$(grep 'Job name: xpk-def-app-profile-slurm-' batch-read.log | awk -F': ' '{print $2}')
|
|
129
|
-
echo "READ_JOB_NAME=${READ_JOB_NAME}" >> $GITHUB_ENV
|
|
130
|
-
- name: Wait for the batch-read job to finish
|
|
131
|
-
run: kubectl wait job.batch/$READ_JOB_NAME --for=condition=Complete --timeout=1m
|
|
132
|
-
- name: Cancel the batch-read job
|
|
133
|
-
run: xpk job cancel $READ_JOB_NAME --cluster ${{inputs.cluster-name}} --zone=${{inputs.zone}} | grep "job.batch/$READ_JOB_NAME deleted"
|
|
134
|
-
- name: Delete batch-read.log file
|
|
135
|
-
run: rm batch-read.log
|
|
136
|
-
- name: Run a run-read job on the cluster
|
|
137
|
-
run: xpk run --cluster ${{inputs.cluster-name}} --zone=${{inputs.zone}} batch-read.sh --timeout 60
|
|
138
|
-
- name: Delete batch-read.sh file
|
|
139
|
-
run: rm batch-read.sh
|
|
140
|
-
- name: Create shell and exit it immediately
|
|
141
|
-
run: |
|
|
142
|
-
cat <<EOF >> create-shell.exp
|
|
143
|
-
##!/usr/bin/expect
|
|
144
|
-
spawn xpk shell --cluster ${{inputs.cluster-name}} --zone=${{inputs.zone}}
|
|
145
|
-
expect "/ # "
|
|
146
|
-
send "cat /${{inputs.storage-type}}-test-mount-point/$RANDOM_SEED/test.txt\n"
|
|
147
|
-
expect "Test text message"
|
|
148
|
-
send "exit\n"
|
|
149
|
-
EOF
|
|
150
|
-
chmod +x ./create-shell.exp
|
|
151
|
-
expect ./create-shell.exp
|
|
152
|
-
- name: Stop the shell
|
|
153
|
-
run: xpk shell stop --cluster ${{inputs.cluster-name}} --zone=${{inputs.zone}}
|
|
154
|
-
- name: Delete create-shell.exp file
|
|
155
|
-
run: rm create-shell.exp
|
|
156
115
|
- name: Run workload to delete file on filestore
|
|
157
116
|
run : xpk workload create --workload $STORAGE_DELETE_WORKLOAD --command "rm -rf /${{inputs.storage-type}}-test-mount-point/$RANDOM_SEED/test.txt || exit 143" --num-slices=1 --cluster ${{inputs.cluster-name}} --device-type=${{inputs.device-type}} --zone ${{inputs.zone}}
|
|
158
117
|
- name: Wait for delete workload completion and confirm it succeeded
|
|
@@ -61,9 +61,6 @@ jobs:
|
|
|
61
61
|
- name: Detach storage volumes
|
|
62
62
|
if: always()
|
|
63
63
|
run: xpk storage detach ${{inputs.storage-name}} --cluster=${{inputs.cluster-name}} --zone=${{inputs.zone}}
|
|
64
|
-
- name: Verify VolumeBundle deleted
|
|
65
|
-
run: |
|
|
66
|
-
! kubectl get volumebundle | grep ${{inputs.storage-name}}
|
|
67
64
|
- name: Delete GCP Filestore Storage instance
|
|
68
65
|
if: always() && inputs.storage-command == 'delete'
|
|
69
66
|
run: xpk storage delete ${{inputs.storage-name}} --cluster=${{inputs.cluster-name}} --zone=${{inputs.zone}}
|
|
@@ -12,11 +12,10 @@
|
|
|
12
12
|
# See the License for the specific language governing permissions and
|
|
13
13
|
# limitations under the License
|
|
14
14
|
|
|
15
|
-
|
|
16
|
-
name: 'Close stale issues and PRs'
|
|
15
|
+
name: "Close stale issues and PRs"
|
|
17
16
|
on:
|
|
18
17
|
schedule:
|
|
19
|
-
- cron:
|
|
18
|
+
- cron: "30 1 * * *"
|
|
20
19
|
|
|
21
20
|
jobs:
|
|
22
21
|
stale:
|
|
@@ -24,7 +23,8 @@ jobs:
|
|
|
24
23
|
steps:
|
|
25
24
|
- uses: actions/stale@5f858e3efba33a5ca4407a664cc011ad407f2008 # v10.1.0
|
|
26
25
|
with:
|
|
27
|
-
|
|
26
|
+
days-before-issue-stale: -1
|
|
27
|
+
stale-pr-message: "This pull request is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days."
|
|
28
28
|
days-before-pr-stale: 30
|
|
29
29
|
days-before-pr-close: 7
|
|
30
30
|
operations-per-run: 100
|
|
@@ -1,26 +1,17 @@
|
|
|
1
|
-
KUEUE_REPO=https://github.com/kubernetes-sigs/kueue.git
|
|
2
|
-
|
|
3
|
-
KUBECTL_VERSION := $(shell curl -L -s https://dl.k8s.io/release/stable.txt)
|
|
4
|
-
KUEUE_VERSION=v0.14.3
|
|
5
|
-
KJOB_VERSION=v0.1.0
|
|
6
|
-
|
|
7
1
|
OS := $(shell uname -s | tr A-Z a-z)
|
|
8
2
|
PLATFORM := $(shell uname -m | sed -e 's/aarch64/arm64/' | sed -e 's/x86_64/amd64/')
|
|
9
3
|
|
|
10
|
-
|
|
4
|
+
KUEUE_VERSION=v0.15.2
|
|
11
5
|
KUEUECTL_URL = "https://github.com/kubernetes-sigs/kueue/releases/download/$(KUEUE_VERSION)/kubectl-kueue-$(OS)-$(PLATFORM)"
|
|
12
|
-
KJOBCTL_URL = "https://github.com/kubernetes-sigs/kjob/releases/download/$(KJOB_VERSION)/kubectl-kjob-$(OS)-$(PLATFORM)"
|
|
13
6
|
|
|
14
7
|
PROJECT_DIR := $(realpath $(shell dirname $(firstword $(MAKEFILE_LIST))))
|
|
15
|
-
KJOB_DOCKER_IMG := xpk_kjob
|
|
16
|
-
KJOB_DOCKER_CONTAINER := xpk_kjob_container
|
|
17
8
|
BIN_PATH=$(PROJECT_DIR)/bin
|
|
18
9
|
|
|
19
10
|
.PHONY: install
|
|
20
|
-
install: check-python check-gcloud install-gcloud-auth-plugin install-kueuectl
|
|
11
|
+
install: check-python check-gcloud install-gcloud-auth-plugin install-kueuectl pip-install
|
|
21
12
|
|
|
22
13
|
.PHONY: install-dev
|
|
23
|
-
install-dev: check-python check-gcloud mkdir-bin install-kueuectl
|
|
14
|
+
install-dev: check-python check-gcloud mkdir-bin install-kueuectl pip-install pip-install-dev install-pytest install-lint
|
|
24
15
|
|
|
25
16
|
.PHONY: pip-install-dev
|
|
26
17
|
pip-install-dev:
|
|
@@ -38,12 +29,9 @@ install-pytest:
|
|
|
38
29
|
run-unittests:
|
|
39
30
|
XPK_TESTER=false XPK_VERSION_OVERRIDE=v0.0.0 pytest -vv src/xpk/
|
|
40
31
|
|
|
41
|
-
run-integrationtests:
|
|
42
|
-
XPK_TESTER=false XPK_VERSION_OVERRIDE=v0.0.0 pytest src/integration/
|
|
43
|
-
|
|
44
32
|
.PHONY: goldens
|
|
45
33
|
goldens:
|
|
46
|
-
XPK_TESTER=false XPK_VERSION_OVERRIDE=v0.0.0
|
|
34
|
+
XPK_TESTER=false XPK_VERSION_OVERRIDE=v0.0.0 python3 tools/recipes.py update recipes/*.md
|
|
47
35
|
|
|
48
36
|
.PHONY: mkdir-bin
|
|
49
37
|
mkdir-bin:
|
|
@@ -54,16 +42,6 @@ install-kueuectl: mkdir-bin
|
|
|
54
42
|
curl -Lo $(BIN_PATH)/kubectl-kueue $(KUEUECTL_URL);
|
|
55
43
|
chmod +x $(BIN_PATH)/kubectl-kueue;
|
|
56
44
|
|
|
57
|
-
.PHONY: install-kjobctl
|
|
58
|
-
install-kjobctl: mkdir-bin
|
|
59
|
-
#curl -Lo $(BIN_PATH)/kubectl-kjob $(KJOBCTL_URL)
|
|
60
|
-
#chmod +x $(BIN_PATH)/kubectl-kjob
|
|
61
|
-
# TODO: Switch to kjob release-based installation once version >=0.2.0 is available.
|
|
62
|
-
chmod +x tools/build-kjob.sh
|
|
63
|
-
./tools/build-kjob.sh
|
|
64
|
-
mv kubectl-kjob $(BIN_PATH)/kubectl-kjob
|
|
65
|
-
chmod +x $(BIN_PATH)/kubectl-kjob
|
|
66
|
-
|
|
67
45
|
.PHONY: install-gcloud-auth-plugin
|
|
68
46
|
install-gcloud-auth-plugin:
|
|
69
47
|
chmod +x tools/install-gke-auth-plugin.sh
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: xpk
|
|
3
|
-
Version:
|
|
3
|
+
Version: 1.1.0
|
|
4
4
|
Summary: xpk helps Cloud developers to orchestrate training jobs on accelerators on GKE.
|
|
5
5
|
Author-email: XPK team <xpk-code-reviewers@google.com>
|
|
6
6
|
License: Apache-2.0
|
|
@@ -93,36 +93,63 @@ XPK supports a variety of hardware accelerators.
|
|
|
93
93
|
|
|
94
94
|
XPK also supports the following [Google Cloud Storage solutions](./docs/usage/storage.md):
|
|
95
95
|
|
|
96
|
-
| Storage Type | Documentation
|
|
97
|
-
|
|
98
|
-
| Cloud Storage FUSE | [docs](./docs/usage/storage.md#fuse)
|
|
99
|
-
| Filestore | [docs](./docs/usage/storage.md#filestore)
|
|
100
|
-
| Parallelstore | [docs](./docs/usage/storage.md#parallelstore)
|
|
101
|
-
| Block storage (Persistent Disk, Hyperdisk) | [docs](./docs/usage/storage.md#block-storage-persistent-disk-hyperdisk)
|
|
96
|
+
| Storage Type | Documentation |
|
|
97
|
+
| ------------------------------------------ | ----------------------------------------------------------------------- |
|
|
98
|
+
| Cloud Storage FUSE | [docs](./docs/usage/storage.md#fuse) |
|
|
99
|
+
| Filestore | [docs](./docs/usage/storage.md#filestore) |
|
|
100
|
+
| Parallelstore | [docs](./docs/usage/storage.md#parallelstore) |
|
|
101
|
+
| Block storage (Persistent Disk, Hyperdisk) | [docs](./docs/usage/storage.md#block-storage-persistent-disk-hyperdisk) |
|
|
102
102
|
|
|
103
103
|
# Documentation
|
|
104
104
|
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
105
|
+
- [Permissions](./docs/permissions.md)
|
|
106
|
+
- [Installation](./docs/installation.md)
|
|
107
|
+
- Usage:
|
|
108
|
+
- [Clusters](./docs/usage/clusters.md)
|
|
109
|
+
- [GPU](./docs/usage/gpu.md)
|
|
110
|
+
- [CPU](./docs/usage/cpu.md)
|
|
111
|
+
- [Autoprovisioning](./docs/usage/autoprovisioning.md)
|
|
112
|
+
- [Workloads](./docs/usage/workloads.md)
|
|
113
|
+
- [Docker](./docs/usage/docker.md)
|
|
114
|
+
- [Storage](./docs/usage/storage.md)
|
|
115
|
+
- [Advanced](./docs/usage/advanced.md)
|
|
116
|
+
- [Inspector](./docs/usage/inspector.md)
|
|
117
|
+
- [Troubleshooting](./docs/troubleshooting.md)
|
|
118
|
+
|
|
119
|
+
# Dependencies
|
|
120
|
+
|
|
121
|
+
| Dependency | When used |
|
|
122
|
+
| ------------------------------------------------------------------------------------------------------------ | --------------------------- |
|
|
123
|
+
| [Google Cloud SDK (gcloud)](https://cloud.google.com/sdk/docs/install) | _always_ |
|
|
124
|
+
| [kubectl](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_kubectl) | _always_ |
|
|
125
|
+
| [ClusterToolkit](https://github.com/GoogleCloudPlatform/cluster-toolkit) | Provisioning GPU clusters |
|
|
126
|
+
| [Kueue](https://github.com/kubernetes-sigs/kueue) | Scheduling workloads |
|
|
127
|
+
| [JobSet](https://github.com/kubernetes-sigs/jobset) | Workload creation |
|
|
128
|
+
| [Docker](https://docs.docker.com/engine/install/) | Building workload container |
|
|
129
|
+
| [CoreDNS](https://github.com/coredns/deployment/tree/master/kubernetes) | Cluster set up |
|
|
130
|
+
| [PathwaysJob](https://github.com/google/pathways-job) | Running Pathways workloads |
|
|
131
|
+
|
|
132
|
+
# Privacy notice
|
|
133
|
+
|
|
134
|
+
To help improve XPK, feature usage statistics are collected and sent to Google. You can opt-out at any time by executing
|
|
135
|
+
the following shell command:
|
|
136
|
+
|
|
137
|
+
```shell
|
|
138
|
+
xpk config set send-telemetry <true/false>
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
XPK telemetry overall is handled in accordance with the [Google Privacy Policy](https://policies.google.com/privacy). When
|
|
142
|
+
you use XPK to interact with or utilize GCP Services, your information is handled in accordance with the
|
|
143
|
+
[Google Cloud Privacy Notice](https://cloud.google.com/terms/cloud-privacy-notice).
|
|
121
144
|
|
|
122
145
|
# Contributing
|
|
123
146
|
|
|
124
147
|
Please read [`contributing.md`](./docs/contributing.md) for details on our code of conduct, and the process for submitting pull requests to us.
|
|
125
148
|
|
|
149
|
+
# Get involved
|
|
150
|
+
|
|
151
|
+
We'd love to hear from you! If you have questions or want to discuss ideas, join us on [GitHub Discussions](https://github.com/AI-Hypercomputer/xpk/discussions). Found a bug or have a feature request? Please let us know on [GitHub Issues](https://github.com/AI-Hypercomputer/xpk/issues).
|
|
152
|
+
|
|
126
153
|
# License
|
|
127
154
|
|
|
128
155
|
This project is licensed under the Apache License 2.0 - see the [`LICENSE`](./LICENSE) file for details
|