xpk 0.17.3__tar.gz → 1.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (288) hide show
  1. {xpk-0.17.3 → xpk-1.1.0}/.github/actions/setup-test-env/action.yml +0 -1
  2. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/build_tests.yaml +1 -2
  3. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/integration_basic_cluster_create.yaml +3 -91
  4. xpk-1.1.0/.github/workflows/integration_gpu_cluster_create.yaml +78 -0
  5. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/label-validation.yaml +2 -2
  6. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/nightly_tests.yaml +11 -12
  7. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/reusable_goldens.yaml +1 -2
  8. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/reusable_lint_and_format.yml +0 -1
  9. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/reusable_storage_create.yaml +0 -41
  10. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/reusable_storage_delete.yaml +0 -3
  11. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/reusable_unit_tests.yaml +0 -1
  12. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/stale.yaml +4 -4
  13. {xpk-0.17.3 → xpk-1.1.0}/Makefile +4 -26
  14. {xpk-0.17.3/src/xpk.egg-info → xpk-1.1.0}/PKG-INFO +50 -23
  15. {xpk-0.17.3 → xpk-1.1.0}/README.md +49 -22
  16. {xpk-0.17.3 → xpk-1.1.0}/docs/installation.md +0 -1
  17. {xpk-0.17.3 → xpk-1.1.0}/docs/testing.md +37 -16
  18. {xpk-0.17.3 → xpk-1.1.0}/docs/troubleshooting.md +1 -1
  19. {xpk-0.17.3 → xpk-1.1.0}/docs/usage/clusters.md +30 -1
  20. {xpk-0.17.3 → xpk-1.1.0}/docs/usage/tpu7x/recipes/flex_filestore_recipe.md +0 -4
  21. {xpk-0.17.3 → xpk-1.1.0}/docs/usage/tpu7x/recipes/flex_lustre_recipe.md +0 -4
  22. {xpk-0.17.3 → xpk-1.1.0}/docs/usage/workloads.md +3 -0
  23. xpk-1.1.0/recipes/Basic_cluster_adapt.md +143 -0
  24. xpk-0.17.3/goldens/Basic_cluster_create.txt → xpk-1.1.0/recipes/Basic_cluster_create.md +15 -6
  25. xpk-1.1.0/recipes/Cluster_create_RayCluster.md +288 -0
  26. xpk-0.17.3/goldens/Cluster_create_for_multi-host_nodepool.txt → xpk-1.1.0/recipes/Cluster_create_for_multi-host_nodepool.md +16 -7
  27. xpk-1.1.0/recipes/Cluster_create_for_single-host_nodepool.md +275 -0
  28. xpk-0.17.3/goldens/Cluster_create_private.txt → xpk-1.1.0/recipes/Cluster_create_private.md +18 -7
  29. xpk-0.17.3/goldens/Cluster_create_sub-slicing.txt → xpk-1.1.0/recipes/Cluster_create_sub-slicing.md +18 -7
  30. xpk-0.17.3/goldens/Cluster_create_super-slicing.txt → xpk-1.1.0/recipes/Cluster_create_super-slicing.md +21 -10
  31. xpk-0.17.3/goldens/Cluster_create_with_CPU_and_memory_limits_above_capacity.txt → xpk-1.1.0/recipes/Cluster_create_with_CPU_and_memory_limits_above_capacity.md +15 -6
  32. xpk-0.17.3/goldens/Cluster_create_with_CPU_and_memory_limits_below_capacity.txt → xpk-1.1.0/recipes/Cluster_create_with_CPU_and_memory_limits_below_capacity.md +15 -6
  33. xpk-0.17.3/goldens/Cluster_create_with_Managed_Lustre_driver.txt → xpk-1.1.0/recipes/Cluster_create_with_Managed_Lustre_driver.md +15 -6
  34. xpk-0.17.3/goldens/Cluster_create_with_Managed_Lustre_driver_and_legacy_port.txt → xpk-1.1.0/recipes/Cluster_create_with_Managed_Lustre_driver_and_legacy_port.md +15 -6
  35. xpk-0.17.3/goldens/Cluster_create_with_gb200-4.txt → xpk-1.1.0/recipes/Cluster_create_with_gb200-4.md +51 -40
  36. xpk-0.17.3/goldens/Cluster_create_with_shared_reservation.txt → xpk-1.1.0/recipes/Cluster_create_with_shared_reservation.md +17 -6
  37. xpk-0.17.3/goldens/Cluster_delete.txt → xpk-1.1.0/recipes/Cluster_delete.md +10 -1
  38. xpk-0.17.3/goldens/Cluster_delete_force.txt → xpk-1.1.0/recipes/Cluster_delete_force.md +10 -1
  39. xpk-0.17.3/goldens/NAP_cluster-create.txt → xpk-1.1.0/recipes/NAP_cluster-create.md +15 -6
  40. xpk-0.17.3/goldens/NAP_cluster-create_with_pathways.txt → xpk-1.1.0/recipes/NAP_cluster-create_with_pathways.md +15 -6
  41. xpk-0.17.3/goldens/Storage_list.txt → xpk-1.1.0/recipes/Storage_list.md +10 -1
  42. xpk-0.17.3/goldens/Workload_create.txt → xpk-1.1.0/recipes/Workload_create.md +15 -8
  43. xpk-0.17.3/goldens/Workload_create_pathways.txt → xpk-1.1.0/recipes/Workload_create_pathways.md +13 -6
  44. xpk-0.17.3/goldens/Workload_create_sub-slicing.txt → xpk-1.1.0/recipes/Workload_create_sub-slicing.md +15 -8
  45. xpk-0.17.3/goldens/Workload_create_super-slicing.txt → xpk-1.1.0/recipes/Workload_create_super-slicing.md +59 -11
  46. xpk-0.17.3/goldens/Workload_create_with_output-manifest-file.txt → xpk-1.1.0/recipes/Workload_create_with_output-manifest-file.md +15 -8
  47. xpk-0.17.3/goldens/Workload_delete.txt → xpk-1.1.0/recipes/Workload_delete.md +10 -1
  48. xpk-0.17.3/goldens/Workload_list.txt → xpk-1.1.0/recipes/Workload_list.md +10 -1
  49. xpk-1.1.0/recipes/comprehensive-demo.md +83 -0
  50. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/cluster.py +33 -43
  51. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/cluster_gcluster.py +19 -14
  52. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/cluster_gcluster_test.py +2 -0
  53. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/cluster_test.py +1 -21
  54. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/common.py +39 -6
  55. xpk-1.1.0/src/xpk/commands/common_test.py +170 -0
  56. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/info.py +9 -5
  57. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/inspector.py +33 -4
  58. xpk-1.1.0/src/xpk/commands/inspector_test.py +142 -0
  59. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/workload.py +32 -11
  60. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/workload_test.py +71 -3
  61. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/blueprint/blueprint_generator.py +19 -8
  62. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/blueprint/testing/data/a3_ultra.yaml +3 -1
  63. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/blueprint/testing/data/a4.yaml +3 -1
  64. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/capacity.py +37 -17
  65. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/capacity_test.py +66 -1
  66. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/cluster.py +11 -10
  67. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/cluster_private.py +3 -3
  68. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/cluster_test.py +29 -2
  69. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/config.py +5 -2
  70. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/docker_container.py +31 -24
  71. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/docker_manager.py +4 -4
  72. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/docker_resources.py +4 -1
  73. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/kueue_manager.py +6 -8
  74. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/kueue_manager_test.py +6 -5
  75. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/nap.py +14 -3
  76. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/nodepool.py +52 -13
  77. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/nodepool_test.py +147 -8
  78. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/remote_state/fuse_remote_state.py +1 -1
  79. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/scheduling.py +32 -4
  80. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/scheduling_test.py +39 -2
  81. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/system_characteristics.py +44 -0
  82. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/system_characteristics_test.py +11 -0
  83. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/telemetry.py +11 -1
  84. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/telemetry_test.py +39 -0
  85. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/testing/commands_tester.py +26 -0
  86. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/testing/commands_tester_test.py +20 -1
  87. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/workload_decorators/rdma_decorator.py +9 -0
  88. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/cluster.py +11 -1
  89. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/cluster_test.py +59 -1
  90. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/common.py +11 -17
  91. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/core.py +0 -8
  92. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/storage.py +3 -14
  93. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/console.py +1 -1
  94. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/feature_flags.py +8 -4
  95. {xpk-0.17.3 → xpk-1.1.0/src/xpk.egg-info}/PKG-INFO +50 -23
  96. {xpk-0.17.3 → xpk-1.1.0}/src/xpk.egg-info/SOURCES.txt +32 -52
  97. xpk-1.1.0/src/xpk.egg-info/top_level.txt +1 -0
  98. xpk-1.1.0/tools/install-xpk.sh +7 -0
  99. xpk-1.1.0/tools/recipes.py +235 -0
  100. xpk-0.17.3/.github/actions/install-kjob/action.yml +0 -35
  101. xpk-0.17.3/.github/workflows/integration_legacy_tests.yaml +0 -67
  102. xpk-0.17.3/.github/workflows/reusable_build_kjob.yaml +0 -23
  103. xpk-0.17.3/.github/workflows/reusable_integration_tests.yaml +0 -62
  104. xpk-0.17.3/docs/local_testing.md +0 -61
  105. xpk-0.17.3/docs/usage/job.md +0 -41
  106. xpk-0.17.3/docs/usage/run.md +0 -44
  107. xpk-0.17.3/docs/usage/tpu7x/clusters.md +0 -329
  108. xpk-0.17.3/docs/usage/tpu7x/workloads.md +0 -269
  109. xpk-0.17.3/examples/batch.md +0 -24
  110. xpk-0.17.3/examples/job.sh +0 -12
  111. xpk-0.17.3/golden_buddy.sh +0 -150
  112. xpk-0.17.3/goldens/Cluster_create_for_single-host_single-slice_TPU.txt +0 -199
  113. xpk-0.17.3/goldens.yaml +0 -47
  114. xpk-0.17.3/src/integration/README.md +0 -19
  115. xpk-0.17.3/src/integration/docker_manager_test.py +0 -102
  116. xpk-0.17.3/src/integration/gcluster_a3mega_test.py +0 -215
  117. xpk-0.17.3/src/integration/gcluster_a3ultra_test.py +0 -187
  118. xpk-0.17.3/src/integration/gcluster_a4_test.py +0 -187
  119. xpk-0.17.3/src/integration/gcluster_test.py +0 -107
  120. xpk-0.17.3/src/xpk/commands/kind.py +0 -265
  121. xpk-0.17.3/src/xpk/parser/kind.py +0 -95
  122. xpk-0.17.3/src/xpk/utils/__init__.py +0 -15
  123. xpk-0.17.3/src/xpk/utils/user_input.py +0 -48
  124. xpk-0.17.3/src/xpk/utils/user_input_test.py +0 -92
  125. xpk-0.17.3/src/xpk.egg-info/top_level.txt +0 -2
  126. xpk-0.17.3/tools/Dockerfile-kjob +0 -33
  127. xpk-0.17.3/tools/build-kjob.sh +0 -9
  128. xpk-0.17.3/tools/install-xpk.sh +0 -11
  129. xpk-0.17.3/xpk-slurm-commands.md +0 -382
  130. {xpk-0.17.3 → xpk-1.1.0}/.dockerignore +0 -0
  131. {xpk-0.17.3 → xpk-1.1.0}/.github/CODEOWNERS +0 -0
  132. {xpk-0.17.3 → xpk-1.1.0}/.github/PULL_REQUEST_TEMPLATE.md +0 -0
  133. {xpk-0.17.3 → xpk-1.1.0}/.github/actions/install-kueue/action.yml +0 -0
  134. {xpk-0.17.3 → xpk-1.1.0}/.github/release.yaml +0 -0
  135. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/README.md +0 -0
  136. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/build_wheels.yaml +0 -0
  137. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/cleanup.yaml +0 -0
  138. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/gemini-dispatch.yml +0 -0
  139. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/gemini-invoke.yml +0 -0
  140. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/gemini-review.yml +0 -0
  141. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/gemini-scheduled-triage.yml +0 -0
  142. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/gemini-triage.yml +0 -0
  143. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/integration_pathways_cluster_create.yaml +0 -0
  144. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/integration_ray_cluster_create.yaml +0 -0
  145. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/integration_storage_tests.yaml +0 -0
  146. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/periodic_release.yaml +0 -0
  147. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/release_branch_versioning.yaml +0 -0
  148. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/reusable_build_scripts.yaml +0 -0
  149. {xpk-0.17.3 → xpk-1.1.0}/.github/workflows/reusable_build_wheel.yaml +0 -0
  150. {xpk-0.17.3 → xpk-1.1.0}/.gitignore +0 -0
  151. {xpk-0.17.3 → xpk-1.1.0}/.pre-commit-config.yaml +0 -0
  152. {xpk-0.17.3 → xpk-1.1.0}/LICENSE +0 -0
  153. {xpk-0.17.3 → xpk-1.1.0}/backoff_retry.sh +0 -0
  154. {xpk-0.17.3 → xpk-1.1.0}/data/Dockerfile +0 -0
  155. {xpk-0.17.3 → xpk-1.1.0}/docs/code-of-conduct.md +0 -0
  156. {xpk-0.17.3 → xpk-1.1.0}/docs/contributing.md +0 -0
  157. {xpk-0.17.3 → xpk-1.1.0}/docs/permissions.md +0 -0
  158. {xpk-0.17.3 → xpk-1.1.0}/docs/usage/advanced.md +0 -0
  159. {xpk-0.17.3 → xpk-1.1.0}/docs/usage/autoprovisioning.md +0 -0
  160. {xpk-0.17.3 → xpk-1.1.0}/docs/usage/cpu.md +0 -0
  161. {xpk-0.17.3 → xpk-1.1.0}/docs/usage/docker.md +0 -0
  162. {xpk-0.17.3 → xpk-1.1.0}/docs/usage/gpu.md +0 -0
  163. {xpk-0.17.3 → xpk-1.1.0}/docs/usage/inspector.md +0 -0
  164. {xpk-0.17.3 → xpk-1.1.0}/docs/usage/storage.md +0 -0
  165. {xpk-0.17.3 → xpk-1.1.0}/docs/usage/tpu7x/recipes/reservation_gcs_bucket_recipe.md +0 -0
  166. {xpk-0.17.3 → xpk-1.1.0}/examples/fake_training.py +0 -0
  167. {xpk-0.17.3 → xpk-1.1.0}/examples/llama-3.1-finetuning/check_cuda.sh +0 -0
  168. {xpk-0.17.3 → xpk-1.1.0}/examples/llama-3.1-finetuning/requirements.txt +0 -0
  169. {xpk-0.17.3 → xpk-1.1.0}/examples/llama-3.1-finetuning/train.py +0 -0
  170. {xpk-0.17.3 → xpk-1.1.0}/examples/llama-3.1-finetuning/train.slurm +0 -0
  171. {xpk-0.17.3 → xpk-1.1.0}/examples/llama-3.1-finetuning/training_data.jsonl +0 -0
  172. {xpk-0.17.3 → xpk-1.1.0}/examples/nccl/nccl-a3mega.sh +0 -0
  173. {xpk-0.17.3 → xpk-1.1.0}/examples/nccl/nccl-a3ultra.sh +0 -0
  174. {xpk-0.17.3 → xpk-1.1.0}/examples/nccl/nccl.md +0 -0
  175. {xpk-0.17.3 → xpk-1.1.0}/examples/storage/filestore-manifest-attach.yaml +0 -0
  176. {xpk-0.17.3 → xpk-1.1.0}/examples/storage/gcsfuse-manifest.yaml +0 -0
  177. {xpk-0.17.3 → xpk-1.1.0}/examples/storage/lustre-manifest-attach.yaml +0 -0
  178. {xpk-0.17.3 → xpk-1.1.0}/examples/storage/parallelstore-manifest-attach.yaml +0 -0
  179. {xpk-0.17.3 → xpk-1.1.0}/examples/storage/pd-manifest-attach.yaml +0 -0
  180. {xpk-0.17.3 → xpk-1.1.0}/pylintrc +0 -0
  181. {xpk-0.17.3 → xpk-1.1.0}/pyproject.toml +0 -0
  182. {xpk-0.17.3 → xpk-1.1.0}/setup.cfg +0 -0
  183. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/__init__.py +0 -0
  184. {xpk-0.17.3/src/integration → xpk-1.1.0/src/xpk/api}/__init__.py +0 -0
  185. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/api/storage_crd.yaml +0 -0
  186. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a3mega/config-map.yaml.tftpl +0 -0
  187. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a3mega/storage_crd.yaml +0 -0
  188. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a3ultra/config-map.yaml.tftpl +0 -0
  189. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a3ultra/mlgru-disable.yaml +0 -0
  190. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a3ultra/nccl-installer.yaml +0 -0
  191. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a3ultra/storage_crd.yaml +0 -0
  192. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a4/config-map.yaml.tftpl +0 -0
  193. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a4/nccl-rdma-installer-a4.yaml +0 -0
  194. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/blueprints/a4/storage_crd.yaml +0 -0
  195. {xpk-0.17.3/src/xpk/api → xpk-1.1.0/src/xpk/commands}/__init__.py +0 -0
  196. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/config.py +0 -0
  197. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/managed_ml_diagnostics.py +0 -0
  198. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/managed_ml_diagnostics_test.py +0 -0
  199. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/storage.py +0 -0
  200. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/commands/version.py +0 -0
  201. {xpk-0.17.3/src/xpk/commands → xpk-1.1.0/src/xpk/core}/__init__.py +0 -0
  202. {xpk-0.17.3/src/xpk/core → xpk-1.1.0/src/xpk/core/blueprint}/__init__.py +0 -0
  203. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/blueprint/blueprint_definitions.py +0 -0
  204. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/blueprint/blueprint_test.py +0 -0
  205. {xpk-0.17.3/src/xpk/core/blueprint → xpk-1.1.0/src/xpk/core/blueprint/testing}/__init__.py +0 -0
  206. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/blueprint/testing/data/a3_mega.yaml +0 -0
  207. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/blueprint/testing/data/a3_mega_spot.yaml +0 -0
  208. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/commands.py +0 -0
  209. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/config_test.py +0 -0
  210. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/docker_image.py +0 -0
  211. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/filestore.py +0 -0
  212. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/gcloud_context.py +0 -0
  213. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/gcloud_context_test.py +0 -0
  214. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/gcluster_manager.py +0 -0
  215. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/gcsfuse.py +0 -0
  216. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/jobset.py +0 -0
  217. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/monitoring.py +0 -0
  218. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/mtc.py +0 -0
  219. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/network.py +0 -0
  220. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/pathways.py +0 -0
  221. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/pathways_test.py +0 -0
  222. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/ray.py +0 -0
  223. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/remote_state/__init__.py +0 -0
  224. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/remote_state/remote_state_client.py +0 -0
  225. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/resources.py +0 -0
  226. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/storage.py +0 -0
  227. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/testing/__init__.py +0 -0
  228. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/updates.py +0 -0
  229. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/updates_test.py +0 -0
  230. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/vertex.py +0 -0
  231. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/workload.py +0 -0
  232. {xpk-0.17.3/src/xpk/core/blueprint/testing → xpk-1.1.0/src/xpk/core/workload_decorators}/__init__.py +0 -0
  233. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/workload_decorators/storage_decorator.py +0 -0
  234. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/workload_decorators/tcpx_decorator.py +0 -0
  235. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/workload_decorators/tcpx_decorator_test.py +0 -0
  236. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/workload_decorators/tcpxo_decorator.py +0 -0
  237. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/core/workload_test.py +0 -0
  238. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/main.py +0 -0
  239. {xpk-0.17.3/src/xpk/core/workload_decorators → xpk-1.1.0/src/xpk/parser}/__init__.py +0 -0
  240. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/common_test.py +0 -0
  241. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/config.py +0 -0
  242. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/info.py +0 -0
  243. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/inspector.py +0 -0
  244. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/storage_test.py +0 -0
  245. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/validators.py +0 -0
  246. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/version.py +0 -0
  247. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/workload.py +0 -0
  248. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/parser/workload_test.py +0 -0
  249. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/telemetry_uploader.py +0 -0
  250. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/__init__.py +0 -0
  251. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/arm_gpu_workload_crate.yaml.j2 +0 -0
  252. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/cluster_preheat.yaml.j2 +0 -0
  253. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/filestore-pv.yaml +0 -0
  254. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/filestore-pvc.yaml +0 -0
  255. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/filestore-sc.yaml +0 -0
  256. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/fuse-pv.yaml +0 -0
  257. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/fuse-pvc.yaml +0 -0
  258. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/kueue_config.yaml.j2 +0 -0
  259. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/kueue_gke_default_topology.yaml.j2 +0 -0
  260. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/kueue_sub_slicing_topology.yaml.j2 +0 -0
  261. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/kueue_super_slicing_topology.yaml.j2 +0 -0
  262. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/mtc-cpc.yaml +0 -0
  263. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/templates/storage.yaml +0 -0
  264. {xpk-0.17.3/src/xpk/parser → xpk-1.1.0/src/xpk/utils}/__init__.py +0 -0
  265. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/console_test.py +0 -0
  266. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/execution_context.py +0 -0
  267. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/file.py +0 -0
  268. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/gcs_utils.py +0 -0
  269. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/kubectl.py +0 -0
  270. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/kueue.py +0 -0
  271. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/network.py +0 -0
  272. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/objects.py +0 -0
  273. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/templates.py +0 -0
  274. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/topology.py +0 -0
  275. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/topology_test.py +0 -0
  276. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/user_agent.py +0 -0
  277. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/user_agent_test.py +0 -0
  278. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/validation.py +0 -0
  279. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/validation_test.py +0 -0
  280. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/versions.py +0 -0
  281. {xpk-0.17.3 → xpk-1.1.0}/src/xpk/utils/yaml.py +0 -0
  282. {xpk-0.17.3 → xpk-1.1.0}/src/xpk.egg-info/dependency_links.txt +0 -0
  283. {xpk-0.17.3 → xpk-1.1.0}/src/xpk.egg-info/entry_points.txt +0 -0
  284. {xpk-0.17.3 → xpk-1.1.0}/src/xpk.egg-info/requires.txt +0 -0
  285. {xpk-0.17.3 → xpk-1.1.0}/tools/install-gke-auth-plugin.sh +0 -0
  286. {xpk-0.17.3 → xpk-1.1.0}/xpk-large-scale-guide.sh +0 -0
  287. {xpk-0.17.3 → xpk-1.1.0}/xpk-notebooks.md +0 -0
  288. {xpk-0.17.3 → xpk-1.1.0}/xpk.py +0 -0
@@ -44,7 +44,6 @@ runs:
44
44
  run: gcloud auth configure-docker --quiet
45
45
  shell: bash
46
46
  - uses: ./.github/actions/install-kueue
47
- - uses: ./.github/actions/install-kjob
48
47
  - name: Install XPK
49
48
  run: pip install dist/xpk-*.whl
50
49
  shell: bash
@@ -49,14 +49,13 @@ jobs:
49
49
  lookup-only: true
50
50
  - name: install dependencies
51
51
  if : steps.check-cache.outputs.cache-hit != 'true'
52
- run: make install-dev && cp ./bin/kubectl-kueue /usr/local/bin/kubectl-kueue && cp ./bin/kubectl-kjob /usr/local/bin/kubectl-kjob
52
+ run: make install-dev && cp ./bin/kubectl-kueue /usr/local/bin/kubectl-kueue
53
53
  - name: Cache dependencies
54
54
  if : steps.check-cache.outputs.cache-hit != 'true'
55
55
  uses: actions/cache/save@v3
56
56
  with:
57
57
  path: |
58
58
  /usr/local/bin/kubectl-kueue
59
- /usr/local/bin/kubectl-kjob
60
59
  ~/.cache/pip
61
60
  ${{env.pythonLocation}}
62
61
  key: xpk-deps-${{ matrix.python-version }}-${{github.run_id}}-${{github.run_attempt}}
@@ -31,7 +31,7 @@ jobs:
31
31
  group: nightly-test-cluster-group-empty
32
32
  cancel-in-progress: false
33
33
  env:
34
- EMPTY_CLUSTER_NAME: nightly-xpk-zero-nodepools
34
+ EMPTY_CLUSTER_NAME: nightly-xpk-zero
35
35
  steps:
36
36
  - uses: actions/download-artifact@v4
37
37
  with:
@@ -59,7 +59,7 @@ jobs:
59
59
  group: nightly-test-cluster-group-private
60
60
  cancel-in-progress: false
61
61
  env:
62
- PRIVATE_CLUSTER_NAME: nightly-xpk-private-2-v4-8-nodepools
62
+ PRIVATE_CLUSTER_NAME: nightly-xpk-private-2-v4-8
63
63
  steps:
64
64
  - uses: actions/download-artifact@v4
65
65
  with:
@@ -83,38 +83,6 @@ jobs:
83
83
  with:
84
84
  name: empty-private-cluster-nodepool-log-${{github.run_id}}
85
85
  path: /tmp/NodepoolCreate-${{ env.PRIVATE_CLUSTER_NAME }}-np-*
86
- dws_flex_cluster:
87
- runs-on: [ubuntu-22.04]
88
- concurrency: # We support one build test to run at a time currently.
89
- group: nightly-test-cluster-group-flex
90
- cancel-in-progress: false
91
- env:
92
- DWS_FLEX_CLUSTER_NAME: xpk-dws-nightly-test-2-v4-8
93
- steps:
94
- - uses: actions/download-artifact@v4
95
- with:
96
- name: custom-scripts
97
- - name: Setup environment
98
- uses: ./.github/actions/setup-test-env
99
- with:
100
- credentials_json: "${{ secrets.GCP_SA_KEY }}"
101
- - name: Check xpk installation
102
- run: xpk version
103
- - name: Create a DWS flex queued xpk cluster
104
- run: xpk cluster create --cluster ${DWS_FLEX_CLUSTER_NAME} --tpu-type=v5p-8 --num-slices=1 --zone=us-east5-a --default-pool-cpu-num-nodes=2 --flex --custom-cluster-arguments="${CLUSTER_NETWORK_ARGUMENTS_DWS}"
105
- - name: Run dws flex queued TPU workload
106
- run: xpk workload create --workload xpktest-build-${{ github.run_attempt }}-dws --cluster ${DWS_FLEX_CLUSTER_NAME} --zone=us-east5-a --tpu-type=v5p-8 --flex --command "echo foo" --num-slices=1
107
- - name: Wait for workload completion and confirm it succeeded
108
- run: xpk workload list --cluster ${DWS_FLEX_CLUSTER_NAME} --zone=us-east5-a --wait-for-job-completion xpktest-build-${{ github.run_attempt }}-dws --timeout 1000
109
- - name: Delete the DWS flex queued cluster
110
- if: always()
111
- run: xpk cluster delete --cluster ${DWS_FLEX_CLUSTER_NAME} --zone=us-east5-a --force
112
- - name: Upload DWS cluster nodepool creation log
113
- if: always()
114
- uses: actions/upload-artifact@v4
115
- with:
116
- name: empty-dws-cluster-nodepool-log-${{github.run_id}}
117
- path: /tmp/NodepoolCreate-${{ env.DWS_FLEX_CLUSTER_NAME }}-np-*
118
86
 
119
87
  cluster-create-and-delete:
120
88
  runs-on: [ubuntu-22.04]
@@ -122,7 +90,7 @@ jobs:
122
90
  group: nightly-test-cluster-group-tpu
123
91
  cancel-in-progress: false
124
92
  env:
125
- TPU_CLUSTER_NAME: nightly-xpk-2-v5p-8-nodepools
93
+ TPU_CLUSTER_NAME: nightly-xpk-2-v5p-8
126
94
  WORKLOAD_NAME: xpktest-nightly-${{ github.run_attempt }}
127
95
  steps:
128
96
  - uses: actions/download-artifact@v4
@@ -152,62 +120,6 @@ jobs:
152
120
  run: xpk info --cluster $TPU_CLUSTER_NAME --zone=us-central2-b
153
121
  - name: Delete the workload on the cluster
154
122
  run: xpk workload delete --workload $WORKLOAD_NAME --cluster $TPU_CLUSTER_NAME --zone=us-central2-b
155
- - name: Create test script to execute in batch
156
- run: echo -e '#!/bin/bash \n#SBATCH --unknown-flag=value\n echo "Hello world from a test script!"' > batch.sh
157
- - name: Run a batch job on the cluster
158
- run: xpk batch --cluster $TPU_CLUSTER_NAME --zone=us-central2-b batch.sh --ignore-unknown-flags --array 1-5 --nodes 2 --ntasks 3
159
- - name: List out the jobs on the cluster
160
- run: xpk job ls --cluster $TPU_CLUSTER_NAME --zone=us-central2-b | grep 'xpk-def-app-profile-slurm-'
161
- - name: Get created job name
162
- run: |
163
- JOB_NAME=$(xpk job ls --cluster $TPU_CLUSTER_NAME --zone=us-central2-b | grep 'xpk-def-app-profile-slurm-' | grep 'multislice-queue' | head -1 | awk '{print $1}')
164
- echo "JOB_NAME=${JOB_NAME}" >> $GITHUB_ENV
165
- - name: Check job spec
166
- run: |
167
- job_spec=$(kubectl get job ${JOB_NAME} -o jsonpath='{.spec}')
168
- echo "$job_spec" | grep '"completions":2'
169
- echo "$job_spec" | grep '"parallelism":2'
170
- echo "$job_spec" | jq '.template.spec.containers | length' | grep 3
171
- - name: Get job info for the last job created on the cluster
172
- run: xpk job info ${JOB_NAME} --cluster $TPU_CLUSTER_NAME --zone=us-central2-b | grep -e "Entrypoint environment variables template:" -e "Job name:" -e "Labels:" -e "Mounts:" -e "Pods:" -e "Profile:" -e "Script name:" | wc -l | grep "7"
173
- - name: Cancel the batch job on the cluster
174
- run: xpk job cancel ${JOB_NAME} --cluster $TPU_CLUSTER_NAME --zone=us-central2-b | grep "job.batch/${JOB_NAME} deleted"
175
- - name: Create shell and exit it immediately
176
- run: |
177
- cat <<EOF > create-shell.exp
178
- #!/usr/bin/expect
179
- set timeout 180
180
- spawn sh -c "xpk shell --cluster $TPU_CLUSTER_NAME --zone=us-central2-b | tee shell.log"
181
- send "\n"
182
- expect {
183
- "/ # " {
184
- send "exit\n"
185
- # Wait for EOF after exit
186
- expect eof
187
- exit 0
188
- }
189
- timeout {
190
- puts "Timed out waiting for pod to be running"
191
- exit 1
192
- }
193
- eof {
194
- puts "Unexpected EOF before getting prompt"
195
- exit 1
196
- }
197
- }
198
- EOF
199
- chmod +x ./create-shell.exp
200
- expect ./create-shell.exp
201
- - name: Check if shell exists and is running
202
- run: |
203
- pod_name=$(grep 'waiting for pod' shell.log | awk -F'"' '{print $2}')
204
- kubectl wait --for='jsonpath={.status.conditions[?(@.type=="Ready")].status}=True' --timeout=1m pod/${pod_name}
205
- - name: Stop the shell
206
- run: xpk shell stop --cluster $TPU_CLUSTER_NAME --zone=us-central2-b
207
- - name: Delete create-shell.exp file
208
- run: rm create-shell.exp
209
- - name: Delete shell.log file
210
- run: rm shell.log
211
123
  - name: Delete the cluster created
212
124
  if: always()
213
125
  run: xpk cluster delete --cluster $TPU_CLUSTER_NAME --zone=us-central2-b --force
@@ -0,0 +1,78 @@
1
+ # Copyright 2025 Google LLC
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # https://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License
14
+
15
+ name: Basic GPU cluster create
16
+
17
+ on:
18
+ workflow_call:
19
+
20
+ permissions:
21
+ contents: read
22
+
23
+ jobs:
24
+ gpu-cluster-create-and-delete:
25
+ runs-on: [ubuntu-22.04]
26
+ concurrency:
27
+ group: nightly-test-cluster-group-gpu
28
+ cancel-in-progress: false
29
+ env:
30
+ GPU_CLUSTER_NAME: nightly-xpk-b200
31
+ WORKLOAD_NAME: xpktest-gpu-nightly-${{ github.run_attempt }}
32
+ steps:
33
+ - uses: actions/download-artifact@v4
34
+ with:
35
+ name: custom-scripts
36
+ - name: Setup environment
37
+ uses: ./.github/actions/setup-test-env
38
+ with:
39
+ credentials_json: "${{ secrets.GCP_SA_KEY }}"
40
+ - name: Check xpk installation
41
+ run: xpk version
42
+ - name: 'Setup Service Account for XPK'
43
+ run: |
44
+ # 1. Clear any existing WIF configurations to avoid conflicts
45
+ rm -rf $HOME/.config/gcloud
46
+ mkdir -p $HOME/.config/gcloud
47
+
48
+ # 2. Write the Key File
49
+ echo '${{ secrets.GCP_SA_KEY }}' > $HOME/.config/gcloud/application_default_credentials.json
50
+
51
+ # 3. Activate the Service Account
52
+ # This updates the internal config files to point to the key file.
53
+ # When Docker mounts the directory, it will now see "Active Account: Service Account"
54
+ gcloud auth activate-service-account --key-file=$HOME/.config/gcloud/application_default_credentials.json --project=cloud-tpu-multipod-dev
55
+
56
+ # 4. Set Env Var for the host (GitHub Runner)
57
+ echo "GOOGLE_APPLICATION_CREDENTIALS=$HOME/.config/gcloud/application_default_credentials.json" >> $GITHUB_ENV
58
+ - name: Create an XPK Cluster with 1 x b200 GPU
59
+ run: xpk cluster create --cluster $GPU_CLUSTER_NAME --device-type=b200-8 --zone=asia-northeast1-b --default-pool-cpu-machine-type=n1-standard-16 --spot
60
+ - name: Authenticate Docker
61
+ run: gcloud auth configure-docker --quiet
62
+ - name: Run a base-docker-image workload
63
+ run: xpk workload create --cluster $GPU_CLUSTER_NAME --workload $WORKLOAD_NAME --docker-image='nvidia/cuda:12.1.0-base-ubuntu22.04' --command "nvidia-smi" --zone=asia-northeast1-b --device-type=b200-8
64
+ - name: List out the workloads on the cluster
65
+ run: xpk workload list --cluster $GPU_CLUSTER_NAME --zone=asia-northeast1-b
66
+ - name: Wait for workload completion and confirm it succeeded
67
+ run: xpk workload list --cluster $GPU_CLUSTER_NAME --zone=asia-northeast1-b --wait-for-job-completion $WORKLOAD_NAME --timeout 600
68
+ - name: Delete the workload on the cluster
69
+ run: xpk workload delete --workload $WORKLOAD_NAME --cluster $GPU_CLUSTER_NAME --zone=asia-northeast1-b
70
+ - name: Delete the cluster created
71
+ if: always()
72
+ run: xpk cluster delete --cluster $GPU_CLUSTER_NAME --zone=asia-northeast1-b --force
73
+ - name: Upload cluster nodepool creation log
74
+ if: always()
75
+ uses: actions/upload-artifact@v4
76
+ with:
77
+ name: gpu-cluster-nodepool-log-${{github.run_id}}
78
+ path: /tmp/NodepoolCreate-${{ env.GPU_CLUSTER_NAME }}-np-*
@@ -36,8 +36,8 @@ jobs:
36
36
  with:
37
37
  mode: minimum
38
38
  count: 1
39
- labels: "release-improvements, release-bugfix, release-features"
40
- message: "This PR is being prevented from merging because it is not labeled. Please add a label to this PR. Accepted labels: release-improvements, release-bugfix, release-features"
39
+ labels: "release-improvements, release-bugfix, release-features, release-breaking"
40
+ message: "This PR is being prevented from merging because it is not labeled. Please add a label to this PR. Accepted labels: release-improvements, release-bugfix, release-features, release-breaking"
41
41
  - id: do-not-merge
42
42
  uses: mheap/github-action-required-labels@v5
43
43
  with:
@@ -16,38 +16,37 @@ name: Nightly Tests
16
16
 
17
17
  on:
18
18
  workflow_dispatch:
19
- schedule: # Schedule the job run at 12AM PST daily.
20
- - cron: "0 8 * * *"
19
+ schedule: # Schedule the job run at 6AM UTC daily.
20
+ - cron: "0 6 * * *"
21
21
 
22
22
  permissions:
23
23
  contents: read
24
24
 
25
25
  jobs:
26
- build_kjob:
27
- uses: ./.github/workflows/reusable_build_kjob.yaml
28
26
  build_wheel:
29
27
  uses: ./.github/workflows/reusable_build_wheel.yaml
30
28
  build_actions:
31
29
  uses: ./.github/workflows/reusable_build_scripts.yaml
32
30
  basic_cluster_create:
33
- needs: [build_kjob, build_actions, build_wheel]
31
+ needs: [build_actions, build_wheel]
34
32
  uses: ./.github/workflows/integration_basic_cluster_create.yaml
35
33
  secrets: inherit
36
34
 
35
+ gpu_cluster_create:
36
+ needs: [build_actions, build_wheel]
37
+ uses: ./.github/workflows/integration_gpu_cluster_create.yaml
38
+ secrets: inherit
39
+
37
40
  pathways_cluster_create:
38
- needs: [build_kjob, build_actions, build_wheel]
41
+ needs: [build_actions, build_wheel]
39
42
  uses: ./.github/workflows/integration_pathways_cluster_create.yaml
40
43
  secrets: inherit
41
44
 
42
45
  ray_cluster_create:
43
- needs: [build_kjob, build_actions, build_wheel]
46
+ needs: [build_actions, build_wheel]
44
47
  uses: ./.github/workflows/integration_ray_cluster_create.yaml
45
48
  secrets: inherit
46
- legacy_integration:
47
- needs: [build_kjob, build_actions, build_wheel]
48
- uses: ./.github/workflows/integration_legacy_tests.yaml
49
- secrets: inherit
50
49
  storage-tests:
51
- needs: [build_kjob, build_actions, build_wheel]
50
+ needs: [build_actions, build_wheel]
52
51
  uses: ./.github/workflows/integration_storage_tests.yaml
53
52
  secrets: inherit
@@ -33,13 +33,12 @@ jobs:
33
33
  with:
34
34
  path: |
35
35
  /usr/local/bin/kubectl-kueue
36
- /usr/local/bin/kubectl-kjob
37
36
  ~/.cache/pip
38
37
  ${{env.pythonLocation}}
39
38
  key: xpk-deps-3.10-${{github.run_id}}-${{github.run_attempt}}
40
39
  restore-keys: xpk-deps-3.10-
41
40
  - name: Verify goldens
42
- run: ./golden_buddy.sh verify goldens.yaml goldens
41
+ run: python3 tools/recipes.py golden recipes/*.md
43
42
  env:
44
43
  UPDATE_GOLDEN_COMMAND: make goldens
45
44
  XPK_VERSION_OVERRIDE: v0.0.0
@@ -39,7 +39,6 @@ jobs:
39
39
  with:
40
40
  path: |
41
41
  /usr/local/bin/kubectl-kueue
42
- /usr/local/bin/kubectl-kjob
43
42
  ~/.cache/pip
44
43
  ${{env.pythonLocation}}
45
44
  key: xpk-deps-${{matrix.python-version}}-${{github.run_id}}-${{github.run_attempt}}
@@ -92,8 +92,6 @@ jobs:
92
92
  --auto-mount=true --vol=vol1 --mount-point='/${{inputs.storage-type}}-test-mount-point' --readonly=false
93
93
  - name: List and verify existing Storages
94
94
  run: xpk storage list --cluster ${{inputs.cluster-name}} --zone=${{inputs.zone}} | tee output.txt | grep ${{inputs.storage-name}} || (echo 'No storage found' && exit 143)
95
- - name: Verify VolumeBundle created
96
- run: kubectl get volumebundle ${{inputs.storage-name}} -o jsonpath='{.spec.containerVolumeMounts[0].mountPath}' | grep '/${{inputs.storage-type}}-test-mount-point'
97
95
  - name: Verify Persistent Volume mount options
98
96
  if: inputs.storage-command == 'attach' && inputs.storage-type == 'gcsfuse'
99
97
  run: kubectl get pv ${{inputs.storage-name}}-pv -oyaml | grep rename-dir-limit=10000 || (echo 'Invalid storage mount options' && exit 143)
@@ -114,45 +112,6 @@ jobs:
114
112
  run: xpk workload list --cluster ${{inputs.cluster-name}} --zone=${{inputs.zone}} --wait-for-job-completion $STORAGE_READ_WORKLOAD --timeout 300
115
113
  - name: Delete the reader workload on the cluster
116
114
  run: xpk workload delete --workload $STORAGE_READ_WORKLOAD --cluster ${{inputs.cluster-name}} --zone=${{inputs.zone}}
117
- - name: Create batch-read.sh script
118
- run: |
119
- cat <<EOF > batch-read.sh
120
- #!/bin/bash
121
- grep 'Test text message' /${{inputs.storage-type}}-test-mount-point/$RANDOM_SEED/test.txt || (echo 'Reading from filestore failed' && exit 143)
122
- EOF
123
- - name: Run a batch-read job on the cluster
124
- run: xpk batch --cluster ${{inputs.cluster-name}} --zone=${{inputs.zone}} batch-read.sh | tee batch-read.log
125
- - name: Get job name
126
- run: |
127
- cat batch-read.log | grep 'xpk-def-app-profile-slurm-'
128
- READ_JOB_NAME=$(grep 'Job name: xpk-def-app-profile-slurm-' batch-read.log | awk -F': ' '{print $2}')
129
- echo "READ_JOB_NAME=${READ_JOB_NAME}" >> $GITHUB_ENV
130
- - name: Wait for the batch-read job to finish
131
- run: kubectl wait job.batch/$READ_JOB_NAME --for=condition=Complete --timeout=1m
132
- - name: Cancel the batch-read job
133
- run: xpk job cancel $READ_JOB_NAME --cluster ${{inputs.cluster-name}} --zone=${{inputs.zone}} | grep "job.batch/$READ_JOB_NAME deleted"
134
- - name: Delete batch-read.log file
135
- run: rm batch-read.log
136
- - name: Run a run-read job on the cluster
137
- run: xpk run --cluster ${{inputs.cluster-name}} --zone=${{inputs.zone}} batch-read.sh --timeout 60
138
- - name: Delete batch-read.sh file
139
- run: rm batch-read.sh
140
- - name: Create shell and exit it immediately
141
- run: |
142
- cat <<EOF >> create-shell.exp
143
- ##!/usr/bin/expect
144
- spawn xpk shell --cluster ${{inputs.cluster-name}} --zone=${{inputs.zone}}
145
- expect "/ # "
146
- send "cat /${{inputs.storage-type}}-test-mount-point/$RANDOM_SEED/test.txt\n"
147
- expect "Test text message"
148
- send "exit\n"
149
- EOF
150
- chmod +x ./create-shell.exp
151
- expect ./create-shell.exp
152
- - name: Stop the shell
153
- run: xpk shell stop --cluster ${{inputs.cluster-name}} --zone=${{inputs.zone}}
154
- - name: Delete create-shell.exp file
155
- run: rm create-shell.exp
156
115
  - name: Run workload to delete file on filestore
157
116
  run : xpk workload create --workload $STORAGE_DELETE_WORKLOAD --command "rm -rf /${{inputs.storage-type}}-test-mount-point/$RANDOM_SEED/test.txt || exit 143" --num-slices=1 --cluster ${{inputs.cluster-name}} --device-type=${{inputs.device-type}} --zone ${{inputs.zone}}
158
117
  - name: Wait for delete workload completion and confirm it succeeded
@@ -61,9 +61,6 @@ jobs:
61
61
  - name: Detach storage volumes
62
62
  if: always()
63
63
  run: xpk storage detach ${{inputs.storage-name}} --cluster=${{inputs.cluster-name}} --zone=${{inputs.zone}}
64
- - name: Verify VolumeBundle deleted
65
- run: |
66
- ! kubectl get volumebundle | grep ${{inputs.storage-name}}
67
64
  - name: Delete GCP Filestore Storage instance
68
65
  if: always() && inputs.storage-command == 'delete'
69
66
  run: xpk storage delete ${{inputs.storage-name}} --cluster=${{inputs.cluster-name}} --zone=${{inputs.zone}}
@@ -33,7 +33,6 @@ jobs:
33
33
  with:
34
34
  path: |
35
35
  /usr/local/bin/kubectl-kueue
36
- /usr/local/bin/kubectl-kjob
37
36
  ~/.cache/pip
38
37
  ${{env.pythonLocation}}
39
38
  key: xpk-deps-3.10-${{github.run_id}}-${{github.run_attempt}}
@@ -12,11 +12,10 @@
12
12
  # See the License for the specific language governing permissions and
13
13
  # limitations under the License
14
14
 
15
-
16
- name: 'Close stale issues and PRs'
15
+ name: "Close stale issues and PRs"
17
16
  on:
18
17
  schedule:
19
- - cron: '30 1 * * *'
18
+ - cron: "30 1 * * *"
20
19
 
21
20
  jobs:
22
21
  stale:
@@ -24,7 +23,8 @@ jobs:
24
23
  steps:
25
24
  - uses: actions/stale@5f858e3efba33a5ca4407a664cc011ad407f2008 # v10.1.0
26
25
  with:
27
- stale-pr-message: 'This pull request is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.'
26
+ days-before-issue-stale: -1
27
+ stale-pr-message: "This pull request is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days."
28
28
  days-before-pr-stale: 30
29
29
  days-before-pr-close: 7
30
30
  operations-per-run: 100
@@ -1,26 +1,17 @@
1
- KUEUE_REPO=https://github.com/kubernetes-sigs/kueue.git
2
-
3
- KUBECTL_VERSION := $(shell curl -L -s https://dl.k8s.io/release/stable.txt)
4
- KUEUE_VERSION=v0.14.3
5
- KJOB_VERSION=v0.1.0
6
-
7
1
  OS := $(shell uname -s | tr A-Z a-z)
8
2
  PLATFORM := $(shell uname -m | sed -e 's/aarch64/arm64/' | sed -e 's/x86_64/amd64/')
9
3
 
10
- KUBECTL_URL = "https://dl.k8s.io/release/$(KUBECTL_VERSION)/bin/$(OS)/$(PLATFORM)/kubectl"
4
+ KUEUE_VERSION=v0.15.2
11
5
  KUEUECTL_URL = "https://github.com/kubernetes-sigs/kueue/releases/download/$(KUEUE_VERSION)/kubectl-kueue-$(OS)-$(PLATFORM)"
12
- KJOBCTL_URL = "https://github.com/kubernetes-sigs/kjob/releases/download/$(KJOB_VERSION)/kubectl-kjob-$(OS)-$(PLATFORM)"
13
6
 
14
7
  PROJECT_DIR := $(realpath $(shell dirname $(firstword $(MAKEFILE_LIST))))
15
- KJOB_DOCKER_IMG := xpk_kjob
16
- KJOB_DOCKER_CONTAINER := xpk_kjob_container
17
8
  BIN_PATH=$(PROJECT_DIR)/bin
18
9
 
19
10
  .PHONY: install
20
- install: check-python check-gcloud install-gcloud-auth-plugin install-kueuectl install-kjobctl pip-install
11
+ install: check-python check-gcloud install-gcloud-auth-plugin install-kueuectl pip-install
21
12
 
22
13
  .PHONY: install-dev
23
- install-dev: check-python check-gcloud mkdir-bin install-kueuectl install-kjobctl pip-install pip-install-dev install-pytest install-lint
14
+ install-dev: check-python check-gcloud mkdir-bin install-kueuectl pip-install pip-install-dev install-pytest install-lint
24
15
 
25
16
  .PHONY: pip-install-dev
26
17
  pip-install-dev:
@@ -38,12 +29,9 @@ install-pytest:
38
29
  run-unittests:
39
30
  XPK_TESTER=false XPK_VERSION_OVERRIDE=v0.0.0 pytest -vv src/xpk/
40
31
 
41
- run-integrationtests:
42
- XPK_TESTER=false XPK_VERSION_OVERRIDE=v0.0.0 pytest src/integration/
43
-
44
32
  .PHONY: goldens
45
33
  goldens:
46
- XPK_TESTER=false XPK_VERSION_OVERRIDE=v0.0.0 ./golden_buddy.sh update goldens.yaml goldens
34
+ XPK_TESTER=false XPK_VERSION_OVERRIDE=v0.0.0 python3 tools/recipes.py update recipes/*.md
47
35
 
48
36
  .PHONY: mkdir-bin
49
37
  mkdir-bin:
@@ -54,16 +42,6 @@ install-kueuectl: mkdir-bin
54
42
  curl -Lo $(BIN_PATH)/kubectl-kueue $(KUEUECTL_URL);
55
43
  chmod +x $(BIN_PATH)/kubectl-kueue;
56
44
 
57
- .PHONY: install-kjobctl
58
- install-kjobctl: mkdir-bin
59
- #curl -Lo $(BIN_PATH)/kubectl-kjob $(KJOBCTL_URL)
60
- #chmod +x $(BIN_PATH)/kubectl-kjob
61
- # TODO: Switch to kjob release-based installation once version >=0.2.0 is available.
62
- chmod +x tools/build-kjob.sh
63
- ./tools/build-kjob.sh
64
- mv kubectl-kjob $(BIN_PATH)/kubectl-kjob
65
- chmod +x $(BIN_PATH)/kubectl-kjob
66
-
67
45
  .PHONY: install-gcloud-auth-plugin
68
46
  install-gcloud-auth-plugin:
69
47
  chmod +x tools/install-gke-auth-plugin.sh
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: xpk
3
- Version: 0.17.3
3
+ Version: 1.1.0
4
4
  Summary: xpk helps Cloud developers to orchestrate training jobs on accelerators on GKE.
5
5
  Author-email: XPK team <xpk-code-reviewers@google.com>
6
6
  License: Apache-2.0
@@ -93,36 +93,63 @@ XPK supports a variety of hardware accelerators.
93
93
 
94
94
  XPK also supports the following [Google Cloud Storage solutions](./docs/usage/storage.md):
95
95
 
96
- | Storage Type | Documentation |
97
- |--------------------------------------------|------------------------------------------------------------------------------------------|
98
- | Cloud Storage FUSE | [docs](./docs/usage/storage.md#fuse) |
99
- | Filestore | [docs](./docs/usage/storage.md#filestore) |
100
- | Parallelstore | [docs](./docs/usage/storage.md#parallelstore) |
101
- | Block storage (Persistent Disk, Hyperdisk) | [docs](./docs/usage/storage.md#block-storage-persistent-disk-hyperdisk) |
96
+ | Storage Type | Documentation |
97
+ | ------------------------------------------ | ----------------------------------------------------------------------- |
98
+ | Cloud Storage FUSE | [docs](./docs/usage/storage.md#fuse) |
99
+ | Filestore | [docs](./docs/usage/storage.md#filestore) |
100
+ | Parallelstore | [docs](./docs/usage/storage.md#parallelstore) |
101
+ | Block storage (Persistent Disk, Hyperdisk) | [docs](./docs/usage/storage.md#block-storage-persistent-disk-hyperdisk) |
102
102
 
103
103
  # Documentation
104
104
 
105
- * [Permissions](./docs/permissions.md)
106
- * [Installation](./docs/installation.md)
107
- * Usage:
108
- * [Clusters](./docs/usage/clusters.md)
109
- * [GPU](./docs/usage/gpu.md)
110
- * [CPU](./docs/usage/cpu.md)
111
- * [Autoprovisioning](./docs/usage/autoprovisioning.md)
112
- * [Workloads](./docs/usage/workloads.md)
113
- * [Docker](./docs/usage/docker.md)
114
- * [Storage](./docs/usage/storage.md)
115
- * [Advanced](./docs/usage/advanced.md)
116
- * [Inspector](./docs/usage/inspector.md)
117
- * [Run](./docs/usage/run.md)
118
- * [Job](./docs/usage/job.md)
119
- * [Troubleshooting](./docs/troubleshooting.md)
120
- * [Local Testing](./docs/local_testing.md)
105
+ - [Permissions](./docs/permissions.md)
106
+ - [Installation](./docs/installation.md)
107
+ - Usage:
108
+ - [Clusters](./docs/usage/clusters.md)
109
+ - [GPU](./docs/usage/gpu.md)
110
+ - [CPU](./docs/usage/cpu.md)
111
+ - [Autoprovisioning](./docs/usage/autoprovisioning.md)
112
+ - [Workloads](./docs/usage/workloads.md)
113
+ - [Docker](./docs/usage/docker.md)
114
+ - [Storage](./docs/usage/storage.md)
115
+ - [Advanced](./docs/usage/advanced.md)
116
+ - [Inspector](./docs/usage/inspector.md)
117
+ - [Troubleshooting](./docs/troubleshooting.md)
118
+
119
+ # Dependencies
120
+
121
+ | Dependency | When used |
122
+ | ------------------------------------------------------------------------------------------------------------ | --------------------------- |
123
+ | [Google Cloud SDK (gcloud)](https://cloud.google.com/sdk/docs/install) | _always_ |
124
+ | [kubectl](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_kubectl) | _always_ |
125
+ | [ClusterToolkit](https://github.com/GoogleCloudPlatform/cluster-toolkit) | Provisioning GPU clusters |
126
+ | [Kueue](https://github.com/kubernetes-sigs/kueue) | Scheduling workloads |
127
+ | [JobSet](https://github.com/kubernetes-sigs/jobset) | Workload creation |
128
+ | [Docker](https://docs.docker.com/engine/install/) | Building workload container |
129
+ | [CoreDNS](https://github.com/coredns/deployment/tree/master/kubernetes) | Cluster set up |
130
+ | [PathwaysJob](https://github.com/google/pathways-job) | Running Pathways workloads |
131
+
132
+ # Privacy notice
133
+
134
+ To help improve XPK, feature usage statistics are collected and sent to Google. You can opt-out at any time by executing
135
+ the following shell command:
136
+
137
+ ```shell
138
+ xpk config set send-telemetry <true/false>
139
+ ```
140
+
141
+ XPK telemetry overall is handled in accordance with the [Google Privacy Policy](https://policies.google.com/privacy). When
142
+ you use XPK to interact with or utilize GCP Services, your information is handled in accordance with the
143
+ [Google Cloud Privacy Notice](https://cloud.google.com/terms/cloud-privacy-notice).
121
144
 
122
145
  # Contributing
123
146
 
124
147
  Please read [`contributing.md`](./docs/contributing.md) for details on our code of conduct, and the process for submitting pull requests to us.
125
148
 
149
+ # Get involved
150
+
151
+ We'd love to hear from you! If you have questions or want to discuss ideas, join us on [GitHub Discussions](https://github.com/AI-Hypercomputer/xpk/discussions). Found a bug or have a feature request? Please let us know on [GitHub Issues](https://github.com/AI-Hypercomputer/xpk/issues).
152
+
126
153
  # License
127
154
 
128
155
  This project is licensed under the Apache License 2.0 - see the [`LICENSE`](./LICENSE) file for details